Worksheet 1: Exploring Data with R
Learning Objectives 🎯
By completing this worksheet, you will:
Load and explore datasets in R using basic commands
Distinguish between qualitative and quantitative variables
Classify variables by measurement scale (nominal, ordinal, interval, ratio)
Create and interpret univariate visualizations (histograms, boxplots)
Compute grouped statistics using functional programming
Build comparative visualizations for multivariate exploration
Practice effective prompting strategies for AI assistance
Prerequisites 📚
Basic familiarity with RStudio interface
Understanding of variable types and measurement scales
Ability to run R commands and interpret output
Introduction
In this exercise, we will analyze a dataset from a study by Potvin, Lechowicz, and Tardif (1990), which examined how the grass species Echinochloa crus-galli responds to environmental changes. This ecophysiological study provides an excellent opportunity to practice core statistical skills using real experimental data.
Study Design:
Species: Echinochloa crus-galli (barnyard grass)
Locations: Quebec, Canada (northern) and Mississippi, USA (southern)
Sample Size: 12 plants total (6 from each location)
Treatment: Overnight chilling (half received treatment, half remained unchilled)
Measurements: CO₂ uptake rates at 7 ambient CO₂ concentration levels per plant
Total Observations: 84 (12 plants × 7 measurements each)
This balanced experimental design allows us to explore how geographic origin and temperature treatment affect plant CO₂ uptake across varying atmospheric conditions.
Note
Citation: Potvin, C., Lechowicz, M. J. and Tardif, S. (1990) “The statistical analysis of ecophysiological response curves obtained from experiments involving repeated measures”, Ecology, 71, 1389–1400.
Part 1: Loading and Understanding the Dataset
The CO2 dataset is included in base R, making it readily accessible for analysis. This is one of several built-in datasets that R provides for teaching and demonstration purposes.
# Load the dataset from the datasets package
data(package = "datasets", "CO2")
Understanding R’s Lazy Loading
After running this command, a data variable named CO2
should appear in your environment (visible in the top-right pane of RStudio) as a promise. A promise is part of R’s lazy loading mechanism that delays evaluation until the object is actually needed. This improves performance by not loading data into memory until it’s accessed.
The help()
command provides comprehensive documentation for R’s functionality. It’s your first resource for understanding datasets, functions, and packages.
# View documentation for the CO2 dataset
help(CO2)
# Alternative syntax
?CO2
Question 1: Using the help documentation (which appears in the bottom-right pane of RStudio), answer the following:
List all variables in the dataset
For each variable, specify:
The variable name
What it measures or represents
Its measurement units (where applicable, e.g., mL/L, μmol/m²/s)
Any additional context provided in the documentation
Documentation Skills 📖
Learning to read R documentation effectively is a crucial skill. Pay attention to:
Description: Overview of the dataset
Format: Structure and variable definitions
Details: Additional context and methodology
Source: Original reference for the data
Examples: Sample code for using the data
Part 2: Initial Data Exploration
The View()
command provides a spreadsheet-like interface for exploring data visually. This is particularly useful for getting an initial sense of your data’s structure.
# Open dataset in RStudio's data viewer
View(CO2)
# Note: View() is capitalized - view() won't work!
Exploring Data Structure
In addition to View()
, consider these commands for initial exploration:
# Display structure of the dataset
str(CO2)
# First few rows
head(CO2)
# Basic dimensions
nrow(CO2) # Number of rows
ncol(CO2) # Number of columns
dim(CO2) # Both dimensions
Question 2: Based on your exploration, provide a comprehensive analysis:
Dataset dimensions:
How many observations (rows) are in the CO2 dataset?
How many variables (columns) are there?
Variable classification:
Which variables are qualitative (categorical)?
Which variables are quantitative (numerical)?
Qualitative variable analysis:
For each qualitative variable, determine:
Is it nominal (categories have no natural order) or ordinal (categories have a meaningful order)?
List all possible values (categories/levels)
Explain your reasoning for the classification
Quantitative variable analysis:
For each quantitative variable, determine:
Is it interval (no true zero) or ratio (has true zero) scale?
Provide justification based on:
Whether zero represents absence of the quantity
Whether ratios are meaningful (e.g., “twice as much”)
The nature of the measurement
Measurement Scales Review 📏
Qualitative (Categorical) Variables:
Nominal: Categories with no inherent order (e.g., colors, treatment groups)
Ordinal: Categories with meaningful order but unequal intervals (e.g., satisfaction ratings)
Quantitative (Numerical) Variables:
Interval: Equal intervals but no true zero (e.g., temperature in Celsius)
Ratio: Equal intervals with true zero (e.g., height, weight, concentration)
Part 3: Frequency Tables
The table()
function generates frequency counts for categorical variables, providing insight into the distribution of categories in your data.
# Basic usage of table()
table(CO2$Type)
table(CO2$Treatment)
# For more detailed output
summary(CO2$Type)
Question 3: Frequency analysis:
Categorical variables:
Run
table()
on each qualitative variableReport the frequency counts in a clear format
Comment on whether the design appears balanced
Numerical variables:
Run
table()
on both quantitative variablesFor
conc
: Explain why you see exactly 12 observations at each levelFor
uptake
: Describe the pattern you observeBased on the experimental design described in the introduction, hypothesize why
conc
shows this specific pattern
Understanding Experimental Design 🔬
The pattern in conc
reflects the repeated measures design. Each plant was measured at the same 7 CO₂ concentration levels. This is crucial for understanding:
Why certain statistical methods are appropriate
How to properly visualize the data
The structure of dependencies in the data
Part 4: Univariate Analysis of Uptake
The uptake
variable represents CO₂ uptake rate (μmol/m²/s) and is our primary response variable. A thorough univariate analysis helps us understand its distribution before examining relationships.
Question 4: Perform a comprehensive univariate analysis:
Numerical summaries:
Compute and report the following statistics using proper notation:
Central tendency: sample mean (\(\bar{x}\)) and median (\(\tilde{x}\))
Variability: sample standard deviation (\(s\)) and IQR
Five-number summary using
summary()
# Hint: Key functions for numerical summaries mean() # Calculate arithmetic mean median() # Calculate median sd() # Calculate standard deviation IQR() # Calculate interquartile range summary() # Five-number summary plus mean
Graphical analysis:
Create the following visualizations:
Histogram: Use an appropriate number of bins
Modified boxplot: Include potential outliers
Plotting Hints 📊
For histograms, use
hist()
with arguments for: *main
: Plot title *xlab
: X-axis label *col
: Bar color *breaks
: Number of bins or breakpointsFor boxplots, use
boxplot()
with arguments for: *main
: Plot title *ylab
orxlab
: Axis labels *horizontal
: TRUE/FALSE for orientation
Distribution description:
Based on your numerical and graphical analyses, describe:
Shape (symmetric, skewed, multimodal?)
Center (typical value)
Spread (variability)
Unusual features (outliers, gaps, clusters?)
Part 5: Grouped Statistics with tapply
R’s functional programming paradigm shines when computing statistics by groups. The tapply()
function (table-apply) efficiently computes summaries for subsets of data.
Syntax: tapply(X, INDEX, FUN)
X
: The vector to analyzeINDEX
: The grouping factor(s)FUN
: The function to apply to each group
Question 5: Group-wise analysis:
Using tapply()
, compute statistics for uptake
grouped by Type
:
# Compute grouped statistics
uptake_mean_by_type <- tapply(CO2$uptake, CO2$Type, mean)
uptake_sd_by_type <- tapply(CO2$uptake, CO2$Type, sd)
# Display results
print(uptake_mean_by_type)
print(uptake_sd_by_type)
Report the mean and standard deviation for each Type
Calculate the difference in means between locations
Which location shows higher average CO₂ uptake?
Which location shows more variability?
Understanding R’s Apply Family 🔧
The apply functions are powerful tools for avoiding loops in R. Each serves a specific purpose:
apply() - For matrices and arrays
Purpose: Apply function over rows or columns of a matrix/array
Syntax:
apply(X, MARGIN, FUN)
MARGIN: 1 = rows, 2 = columns, c(1,2) = both
Example: Calculate row means of a matrix
mat <- matrix(1:12, nrow = 3)
apply(mat, 1, mean) # Mean of each row
apply(mat, 2, sum) # Sum of each column
lapply() - For lists (returns list)
Purpose: Apply function to each element of a list
Syntax:
lapply(X, FUN)
Returns: Always a list
Example: Get length of each list element
my_list <- list(a = 1:5, b = 1:10, c = 1:3)
lapply(my_list, length) # Returns list
sapply() - Simplified lapply (returns vector/matrix)
Purpose: Like lapply but simplifies result if possible
Syntax:
sapply(X, FUN)
Returns: Vector, matrix, or list (depending on output)
Example: Same as above but simplified
sapply(my_list, length) # Returns named vector
tapply() - For grouped operations
Purpose: Apply function to subsets defined by factors
Syntax:
tapply(X, INDEX, FUN)
Use case: Calculate statistics by group
Example: Mean by category
# If we have values and their categories
values <- c(10, 20, 15, 30, 25, 35)
groups <- c("A", "B", "A", "B", "A", "B")
tapply(values, groups, mean) # Mean by group
mapply() - Multivariate apply
Purpose: Apply function to multiple arguments in parallel
Syntax:
mapply(FUN, arg1, arg2, ...)
Use case: Element-wise operations on multiple vectors
Example: Custom calculation on paired values
# Calculate x*y + z for each position
x <- c(1, 2, 3)
y <- c(4, 5, 6)
z <- c(7, 8, 9)
mapply(function(a, b, c) a * b + c, x, y, z)
When to use which?
Have a matrix? →
apply()
Have a list? →
lapply()
orsapply()
Need groups? →
tapply()
Multiple inputs? →
mapply()
Part 6: Comparative Visualization by Type
Comparing distributions across groups reveals patterns that univariate analysis might miss. We’ll use both base R and ggplot2 for visualization.
Question 6: Side-by-side boxplots:
Create a comparative boxplot using ggplot2
:
ggplot2 Structure 📊
Remember the grammar of graphics:
Data: What dataset are you using?
Aesthetics: What variables map to x, y, color, etc.?
Geometries: What type of plot (boxplot, points, etc.)?
Labels: Title, axis labels
Theme: Overall appearance
Key functions to explore:
ggplot()
- Initialize plot with data and aestheticsgeom_boxplot()
- Add boxplot layerlabs()
- Add labelstheme_minimal()
or other themes for appearance
Enhance your plot by:
Adding individual points with
geom_jitter()
Highlighting means with
stat_summary()
Using colors to distinguish groups
Adding appropriate labels and title
Describe the differences you observe between Quebec and Mississippi plants in terms of:
Central tendency
Variability
Skewness
Potential outliers
Question 7: Comparative histograms with density overlays:
Create faceted histograms showing the distribution of uptake
by Type
:
Data preparation:
Use
tapply()
to calculate group-specific means and standard deviationsUse
ifelse()
to create a new column with normal density values
# Calculate group statistics means <- tapply(CO2$uptake, CO2$Type, mean) sds <- tapply(CO2$uptake, CO2$Type, sd) # Add normal density column CO2$normal_density <- ifelse(CO2$Type == "Quebec", dnorm(CO2$uptake, means["Quebec"], sds["Quebec"]), dnorm(CO2$uptake, means["Mississippi"], sds["Mississippi"]))
Visualization:
Create a faceted histogram with:
Histogram bars showing frequency distribution
Red kernel density curve (empirical distribution)
Blue normal density curve (theoretical distribution)
Appropriate labels and formatting
Interpretation:
Compare the distributions and discuss:
How well does the normal distribution fit each group?
Are there signs of skewness or other departures from normality?
Do the two locations show similar distributional patterns?
Part 7: Exploring the Concentration Effect
The relationship between uptake
and conc
reveals how plants respond physiologically to varying CO₂ levels - a key aspect of plant ecology and climate change research.
Question 8: Initial visualization attempt:
Create a boxplot of uptake
versus conc
:
Task 🎯
Using ggplot2, create a boxplot with:
conc
on the x-axisuptake
on the y-axisAppropriate axis labels
Think about what aesthetic mapping you need in aes()
What issue do you observe with this plot?
Why doesn’t this approach work as intended?
How is R treating the
conc
variable, and why is this problematic?
Variable Types Matter! ⚠️
R’s automatic type detection can sometimes work against us. When a variable contains numbers, R assumes it’s continuous unless told otherwise. This affects:
How ggplot2 creates axes
Which geoms are appropriate
How statistical summaries are calculated
Question 9: Correcting the visualization:
Fix the issue by explicitly converting conc
to a factor:
# Convert to factor to treat as discrete categories
CO2$conc <- as.factor(CO2$conc)
# Verify the change
str(CO2$conc)
levels(CO2$conc)
Now recreate the boxplot and describe:
The relationship between CO₂ concentration and uptake rate
How variability changes across concentration levels
Any apparent threshold effects or plateaus
Distributional changes (skewness, outliers) across concentrations
Part 8: Advanced Visualization with Multiple Categories
When working with multiple categories, we need more sophisticated approaches than simple ifelse()
statements. The mapply()
function enables elegant solutions for multi-group analyses.
Question 10: Multi-category histogram with density curves:
AI Assistance Guidelines 🤖
For this complex visualization task, you’re encouraged to use AI assistance (ChatGPT, Claude, etc.). However, use it as a learning tool, not just a code generator.
Creating an Effective Prompt:
Provide Context: “I’m learning R and ggplot2 in a statistics course. I’m working with the CO2 dataset…”
Be Specific About Requirements: - Must use base R functions (tapply, mapply) with ggplot2 - Cannot use dplyr, tidyr, or other tidyverse packages - Need to overlay kernel and normal density curves
Request Teaching, Not Just Code: “Can you explain why mapply is used here and how it works step-by-step?”
Example Prompt Structure:
I need to create faceted histograms showing uptake distributions across 7 CO2 concentration levels. Requirements: - Use only base R (tapply, mapply) and ggplot2 - No dplyr or other packages - Calculate normal density for each concentration group - Overlay kernel density (red) and normal density (blue) curves - Use facet_wrap with nrow=2 Can you explain the approach step-by-step with commented code?
Verification Checklist:
✓ Uses only allowed functions and packages
✓ Includes clear explanations of each step
✓ Code is well-commented
✓ Follows the specific requirements
✓ You understand why each part works
Implementation steps:
Calculate group statistics:
# Group means and standard deviations conc_means <- tapply(CO2$uptake, CO2$conc, mean) conc_sds <- tapply(CO2$uptake, CO2$conc, sd)
Create a density calculation function:
Function Design 💡
Your function needs to:
Accept two parameters: an uptake value and a concentration group
Look up the appropriate mean and standard deviation for that group
Calculate the normal density at that uptake value
Return the density value
Remember:
dnorm(x, mean, sd)
calculates normal densityApply the function using mapply:
Understanding mapply() 🔄
mapply()
will:Take your function as the first argument
Take
CO2$uptake
as the second argument (all 84 values)Take
CO2$conc
as the third argument (all 84 groups)Apply your function to each pair of (uptake, conc) values
Return a vector of 84 density values
Why use
as.character(CO2$conc)
? Because factor levels need to match the names in your statistics vectors.Create the visualization:
Faceted histograms with
facet_wrap(~ conc, nrow = 2)
Overlay kernel density (red) and normal density (blue)
Proper labels and formatting
Interpretation:
Describe patterns across concentration levels:
How does the distribution shape change?
At what concentration does uptake appear to plateau?
Which concentrations show the most/least variability?
How well do normal curves fit at different concentrations?
Reference: Key Functions
- ifelse()
Purpose: Vectorized conditional evaluation
Syntax:
ifelse(test, yes, no)
Example:
# Assign groups based on condition group <- ifelse(values > 50, "High", "Low")
Why Used: Simple and efficient for binary conditions. Works element-wise on vectors.
Limitation: Becomes unwieldy with multiple conditions
Alternative:
dplyr::case_when()
for multiple conditions- tapply()
Purpose: Apply a function to subsets of a vector, grouped by factors
Syntax:
tapply(X, INDEX, FUN, ...)
Example:
# Calculate mean by group group_means <- tapply(data$value, data$group, mean, na.rm = TRUE)
Why Used:
No additional packages required
Memory efficient
Returns named vector/array
Alternative:
dplyr::group_by() %>% summarise()
- mapply()
Purpose: Multivariate version of sapply - applies function to multiple arguments
Syntax:
mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE)
Example:
# Calculate values using multiple inputs results <- mapply(function(x, y) x * y + 2, vec1, vec2)
Why Used:
Elegant for element-wise operations on multiple vectors
Avoids explicit loops
Flexible argument passing
Alternative:
purrr::map2()
ordplyr::mutate()
Base R vs. Tidyverse 🔄
This worksheet uses base R functions to:
Build fundamental R skills
Work without package dependencies
Understand underlying operations
Prepare for situations where packages aren’t available
In practice, tidyverse packages often provide more readable solutions, but understanding base R is essential for R proficiency.
Troubleshooting Guide
Common Issues and Solutions 🔧
Issue: “Error: object ‘CO2’ not found”
Solution: Run data("CO2")
first
—
Issue: Boxplot shows only one box when using conc
Solution: Convert to factor: CO2$conc <- as.factor(CO2$conc)
—
Issue: mapply()
returns unexpected results
Solution: Ensure your function arguments match the vectors passed
—
Issue: Density curves don’t appear on histogram
Solution: Check that you’re using aes(y = after_stat(density))
in geom_histogram()
Additional Resources
R Documentation:
CO2 dataset:
help(CO2)
Plotting functions:
help(hist)
,help(boxplot)
Apply family:
help(tapply)
,help(mapply)
Online Resources:
Getting Help:
Use
?function_name
for documentationSearch specific errors online
Visit office hours with specific questions
Form study groups to discuss approaches
Next Steps 🚀
This worksheet introduces fundamental concepts that we’ll build upon:
Next week: Two-sample comparisons and ANOVA
Coming soon: Regression analysis for continuous relationships
Later: Mixed models for repeated measures data
Happy analyzing! 📊