Worksheet 1: Exploring Data with R
Learning Objectives 🎯
Load and explore datasets in R using basic commands
Distinguish between qualitative and quantitative variables
Classify variables by measurement scale (nominal, ordinal, interval, ratio)
Create and interpret univariate visualizations (histograms, boxplots)
Compute grouped statistics using functional programming
Build comparative visualizations for multivariate exploration
Introduction
In this exercise, we will analyze a dataset from a study by Potvin, Lechowicz, and Tardif (1990), which examined how the grass species Echinochloa crus-galli responds to environmental changes.
Study Design: - Sample Size: 12 plants total (6 from Quebec, Canada; 6 from Mississippi, USA) - Treatment: Half the plants from each location received overnight chilling; half remained unchilled - Measurements: CO₂ uptake rates measured at 7 ambient CO₂ concentration levels per plant - Total Observations: 84 (12 plants × 7 measurements)
This dataset provides an excellent opportunity to practice core statistical skills, including working with categorical and numerical data and creating visualizations such as histograms and boxplots.
Note
Citation: Potvin, C., Lechowicz, M. J. and Tardif, S. (1990) “The statistical analysis of ecophysiological response curves obtained from experiments involving repeated measures”, Ecology, 71, 1389–1400.
Part 1: Loading and Understanding the Dataset
The CO2 dataset is included in base R, making it readily accessible for analysis.
# Load the dataset
data(package = "datasets", "CO2")
After running this command, a data variable named CO2
should appear in your environment (visible in the top-right pane of RStudio) as a promise. A promise is part of R’s lazy loading process that delays evaluation until needed.
The help()
command provides comprehensive documentation for R’s functionality. To learn more about the dataset:
# View documentation
help(CO2)
Question 1: Using the help documentation (visible in the bottom-right pane), list all variables in the dataset and specify their measurement units where applicable.
Part 2: Initial Data Exploration
The View()
command provides a spreadsheet-like format for exploring data:
# Open dataset in viewer
View(CO2)
Question 2: Answer the following questions about the data:
How many observations (rows) are in the CO2 dataset?
Identify and classify each variable:
Which variables are qualitative (categorical)?
Which variables are quantitative (numerical)?
For each qualitative variable:
Specify whether it’s nominal or ordinal
List all possible values (categories)
For each quantitative variable:
Specify whether it’s interval or ratio scale
Provide brief justification (consider meaningful zero, type of measurement)
Part 3: Frequency Tables
The table()
function generates frequency counts for categorical variables:
# Example usage
table(CO2$Type)
table(CO2$Treatment)
Question 3:
Run
table()
on each qualitative variable and report the frequency counts.Run
table()
on the quantitative variables. Why might one quantitative variable show a specific pattern (repeated values or fixed intervals)? Based on the experimental design, hypothesize why this occurs.
Part 4: Univariate Analysis of Uptake
The uptake
variable is our primary response variable, representing CO₂ uptake rates.
Question 4: Perform numerical and graphical univariate analysis of uptake
:
Compute and report statistics for central tendency and spread. Use proper notation (e.g., \(\bar{x}\) for sample mean).
Create a histogram and modified boxplot for
uptake
.Describe the distribution based on your visualizations.
# Hint: Basic statistics
mean(CO2$uptake)
sd(CO2$uptake)
summary(CO2$uptake)
# Hint: Basic plots
hist(CO2$uptake)
boxplot(CO2$uptake)
Part 5: Grouped Statistics with tapply
R’s functional programming simplifies group-wise operations. The tapply()
function applies a function to subsets of data grouped by a categorical variable.
Question 5: Using tapply()
, compute the mean and standard deviation of uptake
by Type
:
# Compute grouped statistics
uptake_mean_by_type <- tapply(CO2$uptake, CO2$Type, mean)
uptake_sd_by_type <- tapply(CO2$uptake, CO2$Type, sd)
Report your results below.
Part 6: Comparative Visualization by Type
Question 6: Create a side-by-side boxplot using ggplot2
:
Plot
uptake
on y-axis andType
on x-axisLabel axes appropriately
Add an informative title
Describe differences in distributional characteristics between Quebec and Mississippi
library(ggplot2)
# Your code here
Question 7: Create comparative histograms with density curves:
Use
tapply()
to calculate mean and standard deviation for eachType
Use
ifelse()
to compute normal density values conditionallyCreate faceted histograms with:
Red kernel density curve
Blue normal density curve
Describe distributional differences between types
Part 7: Exploring the Concentration Effect
The relationship between uptake
and conc
(ambient CO₂ concentration) reflects plant response to varying CO₂ levels.
Question 8: Create a boxplot of uptake
vs conc
:
What issue do you observe?
Why doesn’t this approach work as intended? (Hint: How is
conc
being treated?)
Question 9: Fix the issue by converting conc
to a factor:
CO2$conc <- as.factor(CO2$conc)
Recreate the boxplot and describe distributional differences across concentration levels.
Part 8: Advanced Visualization with Multiple Categories
Question 10: Create faceted histograms for uptake
by conc
:
AI Assistance Guidelines 🤖
For this complex visualization task, you’re encouraged to use your favorite AI assistant (ChatGPT, Claude, etc.) to help solve the problem. However, follow these guidelines:
Creating an Effective Pedagogical Prompt:
Provide Context: “I’m learning R and ggplot2 in a statistics course…”
Be Specific: Include the exact requirements and constraints
Request Teaching: Ask for explanations, not just code
Example prompt structure:
“I need to create a histogram showing uptake distributions across multiple CO2 concentration levels. I must use ggplot2 and cannot use dplyr or other tidyverse packages beyond ggplot2. I need to use mapply() to compute normal density values for multiple groups. Can you explain the approach step-by-step and provide commented code?”
Verification Checklist:
✓ Uses only base R functions (tapply, mapply) and ggplot2
✓ No dplyr, tidyr, or other packages
✓ Includes clear explanations of each step
✓ Code is well-commented
✓ Follows the specific requirements (facet_wrap with nrow=2, overlay curves)
Requirements:
Use
mapply()
to compute normal density for multiple categoriesCreate faceted histograms with
facet_wrap(~ conc, nrow = 2)
Overlay red kernel density and blue normal density curves
Describe distributional patterns
Reference: Key Functions
- ifelse
Purpose: Vectorized conditional function
Usage:
ifelse(condition, value_if_true, value_if_false)
Why Used: Straightforward for binary conditions
Alternative:
dplyr::case_when()
for multiple conditions- tapply
Purpose: Apply function to grouped subsets
Usage:
tapply(vector, grouping_variable, function)
Why Used: Simple and efficient for grouped calculations
Alternative:
dplyr::group_by()
+summarise()
- mapply
Purpose: Apply function to multiple arguments simultaneously
Usage:
mapply(function, arg1, arg2, ...)
Why Used: Flexible for row-wise calculations
Alternative:
purrr::map2()
ordplyr::mutate()
Submission Guidelines
Complete all questions in order
Include both code and written interpretations
Ensure all plots are properly labeled
Submit your write-up in class