Worksheet 1: Exploring Data with R

Learning Objectives 🎯

By completing this worksheet, you will:

Load and explore datasets in R using basic commands
Distinguish between qualitative and quantitative variables
Classify variables by measurement scale (nominal, ordinal, interval, ratio)
Create and interpret univariate visualizations (histograms, boxplots)
Compute grouped statistics using functional programming
Build comparative visualizations for multivariate exploration
Practice effective prompting strategies for AI assistance

Prerequisites 📚

Basic familiarity with RStudio interface
Understanding of variable types and measurement scales
Ability to run R commands and interpret output

Introduction

In this exercise, we will analyze a dataset from a study by Potvin, Lechowicz, and Tardif (1990), which examined how the grass species Echinochloa crus-galli responds to environmental changes. This ecophysiological study provides an excellent opportunity to practice core statistical skills using real experimental data.

Study Design:

Species: Echinochloa crus-galli (barnyard grass)
Locations: Quebec, Canada (northern) and Mississippi, USA (southern)
Sample Size: 12 plants total (6 from each location)
Treatment: Overnight chilling (half received treatment, half remained unchilled)
Measurements: CO₂ uptake rates at 7 ambient CO₂ concentration levels per plant
Total Observations: 84 (12 plants × 7 measurements each)

This balanced experimental design allows us to explore how geographic origin and temperature treatment affect plant CO₂ uptake across varying atmospheric conditions.

Note

Citation: Potvin, C., Lechowicz, M. J. and Tardif, S. (1990) “The statistical analysis of ecophysiological response curves obtained from experiments involving repeated measures”, Ecology, 71, 1389–1400.

Part 1: Loading and Understanding the Dataset

The CO2 dataset is included in base R, making it readily accessible for analysis. This is one of several built-in datasets that R provides for teaching and demonstration purposes.

# Load the dataset from the datasets package
data(package = "datasets", "CO2")

Understanding R’s Lazy Loading

After running this command, a data variable named CO2 should appear in your environment (visible in the top-right pane of RStudio) as a promise. A promise is part of R’s lazy loading mechanism that delays evaluation until the object is actually needed. This improves performance by not loading data into memory until it’s accessed.

The help() command provides comprehensive documentation for R’s functionality. It’s your first resource for understanding datasets, functions, and packages.

# View documentation for the CO2 dataset
help(CO2)

# Alternative syntax
?CO2

Question 1: Using the help documentation (which appears in the bottom-right pane of RStudio), answer the following:

List all variables in the dataset
For each variable, specify:
- The variable name
- What it measures or represents
- Its measurement units (where applicable, e.g., mL/L, μmol/m²/s)
- Any additional context provided in the documentation

Documentation Skills 📖

Learning to read R documentation effectively is a crucial skill. Pay attention to:

Description: Overview of the dataset
Format: Structure and variable definitions
Details: Additional context and methodology
Source: Original reference for the data
Examples: Sample code for using the data

Part 2: Initial Data Exploration

The View() command provides a spreadsheet-like interface for exploring data visually. This is particularly useful for getting an initial sense of your data’s structure.

# Open dataset in RStudio's data viewer
View(CO2)

# Note: View() is capitalized - view() won't work!

Exploring Data Structure

In addition to View(), consider these commands for initial exploration:

# Display structure of the dataset
str(CO2)

# First few rows
head(CO2)

# Basic dimensions
nrow(CO2)  # Number of rows
ncol(CO2)  # Number of columns
dim(CO2)   # Both dimensions

Question 2: Based on your exploration, provide a comprehensive analysis:

Dataset dimensions:
- How many observations (rows) are in the CO2 dataset?
- How many variables (columns) are there?
Variable classification:
- Which variables are qualitative (categorical)?
- Which variables are quantitative (numerical)?
Qualitative variable analysis:

For each qualitative variable, determine:
- Is it nominal (categories have no natural order) or ordinal (categories have a meaningful order)?
- List all possible values (categories/levels)
- Explain your reasoning for the classification
Quantitative variable analysis:

For each quantitative variable, determine:
- Is it interval (no true zero) or ratio (has true zero) scale?
- Provide justification based on:
  - Whether zero represents absence of the quantity
  - Whether ratios are meaningful (e.g., “twice as much”)
  - The nature of the measurement

Measurement Scales Review 📏

Qualitative (Categorical) Variables:

Nominal: Categories with no inherent order (e.g., colors, treatment groups)
Ordinal: Categories with meaningful order but unequal intervals (e.g., satisfaction ratings)

Quantitative (Numerical) Variables:

Interval: Equal intervals but no true zero (e.g., temperature in Celsius)
Ratio: Equal intervals with true zero (e.g., height, weight, concentration)

Part 3: Frequency Tables

The table() function generates frequency counts for categorical variables, providing insight into the distribution of categories in your data.

# Basic usage of table()
table(CO2$Type)
table(CO2$Treatment)

# For more detailed output
summary(CO2$Type)

Question 3: Frequency analysis:

Categorical variables:
- Run table() on each qualitative variable
- Report the frequency counts in a clear format
- Comment on whether the design appears balanced
Numerical variables:
- Run table() on both quantitative variables
- For conc: Explain why you see exactly 12 observations at each level
- For uptake: Describe the pattern you observe
- Based on the experimental design described in the introduction, hypothesize why conc shows this specific pattern

Understanding Experimental Design 🔬

The pattern in conc reflects the repeated measures design. Each plant was measured at the same 7 CO₂ concentration levels. This is crucial for understanding:

Why certain statistical methods are appropriate
How to properly visualize the data
The structure of dependencies in the data

Part 4: Univariate Analysis of Uptake

The uptake variable represents CO₂ uptake rate (μmol/m²/s) and is our primary response variable. A thorough univariate analysis helps us understand its distribution before examining relationships.

Question 4: Perform a comprehensive univariate analysis:

Numerical summaries:

Compute and report the following statistics using proper notation:

Central tendency: sample mean ($\bar{x}$) and median ($\tilde{x}$)
Variability: sample standard deviation ($s$) and IQR
Five-number summary using summary()

# Hint: Key functions for numerical summaries
mean()     # Calculate arithmetic mean
median()   # Calculate median
sd()       # Calculate standard deviation
IQR()      # Calculate interquartile range
summary()  # Five-number summary plus mean

Graphical analysis:

Create the following visualizations:
- Histogram: Use an appropriate number of bins
- Modified boxplot: Include potential outliers
Plotting Hints 📊
- For histograms, use hist() with arguments for: * main: Plot title * xlab: X-axis label * col: Bar color * breaks: Number of bins or breakpoints
- For boxplots, use boxplot() with arguments for: * main: Plot title * ylab or xlab: Axis labels * horizontal: TRUE/FALSE for orientation
Distribution description:

Based on your numerical and graphical analyses, describe:
- Shape (symmetric, skewed, multimodal?)
- Center (typical value)
- Spread (variability)
- Unusual features (outliers, gaps, clusters?)

Part 5: Grouped Statistics with tapply

R’s functional programming paradigm shines when computing statistics by groups. The tapply() function (table-apply) efficiently computes summaries for subsets of data.

Syntax: tapply(X, INDEX, FUN)

X: The vector to analyze
INDEX: The grouping factor(s)
FUN: The function to apply to each group

Question 5: Group-wise analysis:

Using tapply(), compute statistics for uptake grouped by Type:

# Compute grouped statistics
uptake_mean_by_type <- tapply(CO2$uptake, CO2$Type, mean)
uptake_sd_by_type <- tapply(CO2$uptake, CO2$Type, sd)

# Display results
print(uptake_mean_by_type)
print(uptake_sd_by_type)

Report the mean and standard deviation for each Type
Calculate the difference in means between locations
Which location shows higher average CO₂ uptake?
Which location shows more variability?

Understanding R’s Apply Family 🔧

The apply functions are powerful tools for avoiding loops in R. Each serves a specific purpose:

apply() - For matrices and arrays

Purpose: Apply function over rows or columns of a matrix/array
Syntax: apply(X, MARGIN, FUN)
MARGIN: 1 = rows, 2 = columns, c(1,2) = both
Example: Calculate row means of a matrix

mat <- matrix(1:12, nrow = 3)
apply(mat, 1, mean)  # Mean of each row
apply(mat, 2, sum)   # Sum of each column

lapply() - For lists (returns list)

Purpose: Apply function to each element of a list
Syntax: lapply(X, FUN)
Returns: Always a list
Example: Get length of each list element

my_list <- list(a = 1:5, b = 1:10, c = 1:3)
lapply(my_list, length)  # Returns list

sapply() - Simplified lapply (returns vector/matrix)

Purpose: Like lapply but simplifies result if possible
Syntax: sapply(X, FUN)
Returns: Vector, matrix, or list (depending on output)
Example: Same as above but simplified

sapply(my_list, length)  # Returns named vector

tapply() - For grouped operations

Purpose: Apply function to subsets defined by factors
Syntax: tapply(X, INDEX, FUN)
Use case: Calculate statistics by group
Example: Mean by category

# If we have values and their categories
values <- c(10, 20, 15, 30, 25, 35)
groups <- c("A", "B", "A", "B", "A", "B")
tapply(values, groups, mean)  # Mean by group

mapply() - Multivariate apply

Purpose: Apply function to multiple arguments in parallel
Syntax: mapply(FUN, arg1, arg2, ...)
Use case: Element-wise operations on multiple vectors
Example: Custom calculation on paired values

# Calculate x*y + z for each position
x <- c(1, 2, 3)
y <- c(4, 5, 6)
z <- c(7, 8, 9)
mapply(function(a, b, c) a * b + c, x, y, z)

When to use which?

Have a matrix? → apply()
Have a list? → lapply() or sapply()
Need groups? → tapply()
Multiple inputs? → mapply()

Part 6: Comparative Visualization by Type

Comparing distributions across groups reveals patterns that univariate analysis might miss. We’ll use both base R and ggplot2 for visualization.

Question 6: Side-by-side boxplots:

Create a comparative boxplot using ggplot2:

ggplot2 Structure 📊

Remember the grammar of graphics:

Data: What dataset are you using?
Aesthetics: What variables map to x, y, color, etc.?
Geometries: What type of plot (boxplot, points, etc.)?
Labels: Title, axis labels
Theme: Overall appearance

Key functions to explore:

ggplot() - Initialize plot with data and aesthetics
geom_boxplot() - Add boxplot layer
labs() - Add labels
theme_minimal() or other themes for appearance

Enhance your plot by:

Adding individual points with geom_jitter()
Highlighting means with stat_summary()
Using colors to distinguish groups
Adding appropriate labels and title

Describe the differences you observe between Quebec and Mississippi plants in terms of:

Central tendency
Variability
Skewness
Potential outliers

Question 7: Comparative histograms with density overlays:

Create faceted histograms showing the distribution of uptake by Type:

Data preparation:

Use tapply() to calculate group-specific means and standard deviations
Use ifelse() to create a new column with normal density values

# Calculate group statistics
means <- tapply(CO2$uptake, CO2$Type, mean)
sds <- tapply(CO2$uptake, CO2$Type, sd)

# Add normal density column
CO2$normal_density <- ifelse(CO2$Type == "Quebec",
                             dnorm(CO2$uptake, means["Quebec"], sds["Quebec"]),
                             dnorm(CO2$uptake, means["Mississippi"], sds["Mississippi"]))

Visualization:

Create a faceted histogram with:
- Histogram bars showing frequency distribution
- Red kernel density curve (empirical distribution)
- Blue normal density curve (theoretical distribution)
- Appropriate labels and formatting
Interpretation:

Compare the distributions and discuss:
- How well does the normal distribution fit each group?
- Are there signs of skewness or other departures from normality?
- Do the two locations show similar distributional patterns?

Part 7: Exploring the Concentration Effect

The relationship between uptake and conc reveals how plants respond physiologically to varying CO₂ levels - a key aspect of plant ecology and climate change research.

Question 8: Initial visualization attempt:

Create a boxplot of uptake versus conc:

Task 🎯

Using ggplot2, create a boxplot with:

conc on the x-axis
uptake on the y-axis
Appropriate axis labels

Think about what aesthetic mapping you need in aes()

What issue do you observe with this plot?
Why doesn’t this approach work as intended?
How is R treating the conc variable, and why is this problematic?

Variable Types Matter! ⚠️

R’s automatic type detection can sometimes work against us. When a variable contains numbers, R assumes it’s continuous unless told otherwise. This affects:

How ggplot2 creates axes
Which geoms are appropriate
How statistical summaries are calculated

Question 9: Correcting the visualization:

Fix the issue by explicitly converting conc to a factor:

# Convert to factor to treat as discrete categories
CO2$conc <- as.factor(CO2$conc)

# Verify the change
str(CO2$conc)
levels(CO2$conc)

Now recreate the boxplot and describe:

The relationship between CO₂ concentration and uptake rate
How variability changes across concentration levels
Any apparent threshold effects or plateaus
Distributional changes (skewness, outliers) across concentrations

Part 8: Advanced Visualization with Multiple Categories

When working with multiple categories, we need more sophisticated approaches than simple ifelse() statements. The mapply() function enables elegant solutions for multi-group analyses.

Question 10: Multi-category histogram with density curves:

AI Assistance Guidelines 🤖

For this complex visualization task, you’re encouraged to use AI assistance (ChatGPT, Claude, etc.). However, use it as a learning tool, not just a code generator.

Creating an Effective Prompt:

Provide Context: “I’m learning R and ggplot2 in a statistics course. I’m working with the CO2 dataset…”
Be Specific About Requirements: - Must use base R functions (tapply, mapply) with ggplot2 - Cannot use dplyr, tidyr, or other tidyverse packages - Need to overlay kernel and normal density curves
Request Teaching, Not Just Code: “Can you explain why mapply is used here and how it works step-by-step?”

Example Prompt Structure:

I need to create faceted histograms showing uptake distributions
across 7 CO2 concentration levels. Requirements:
- Use only base R (tapply, mapply) and ggplot2
- No dplyr or other packages
- Calculate normal density for each concentration group
- Overlay kernel density (red) and normal density (blue) curves
- Use facet_wrap with nrow=2

Can you explain the approach step-by-step with commented code?

Verification Checklist:

✓ Uses only allowed functions and packages

✓ Includes clear explanations of each step

✓ Code is well-commented

✓ Follows the specific requirements

✓ You understand why each part works

Implementation steps:

Calculate group statistics:

# Group means and standard deviations
conc_means <- tapply(CO2$uptake, CO2$conc, mean)
conc_sds <- tapply(CO2$uptake, CO2$conc, sd)

Create a density calculation function:
Function Design 💡

Your function needs to:
1. Accept two parameters: an uptake value and a concentration group
2. Look up the appropriate mean and standard deviation for that group
3. Calculate the normal density at that uptake value
4. Return the density value
Remember: dnorm(x, mean, sd) calculates normal density
Apply the function using mapply:
Understanding mapply() 🔄

mapply() will:
1. Take your function as the first argument
2. Take CO2$uptake as the second argument (all 84 values)
3. Take CO2$conc as the third argument (all 84 groups)
4. Apply your function to each pair of (uptake, conc) values
5. Return a vector of 84 density values
Why use as.character(CO2$conc)? Because factor levels need to match the names in your statistics vectors.
Create the visualization:
- Faceted histograms with facet_wrap(~ conc, nrow = 2)
- Overlay kernel density (red) and normal density (blue)
- Proper labels and formatting
Interpretation:

Describe patterns across concentration levels:
- How does the distribution shape change?
- At what concentration does uptake appear to plateau?
- Which concentrations show the most/least variability?
- How well do normal curves fit at different concentrations?

Reference: Key Functions

ifelse()

Purpose: Vectorized conditional evaluation

Syntax: ifelse(test, yes, no)

Example:

# Assign groups based on condition
group <- ifelse(values > 50, "High", "Low")

Why Used: Simple and efficient for binary conditions. Works element-wise on vectors.

Limitation: Becomes unwieldy with multiple conditions

Alternative: dplyr::case_when() for multiple conditions

tapply()

Purpose: Apply a function to subsets of a vector, grouped by factors

Syntax: tapply(X, INDEX, FUN, ...)

Example:

# Calculate mean by group
group_means <- tapply(data$value, data$group, mean, na.rm = TRUE)

Why Used:

No additional packages required
Memory efficient
Returns named vector/array

Alternative: dplyr::group_by() %>% summarise()

mapply()

Purpose: Multivariate version of sapply - applies function to multiple arguments

Syntax: mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE)

Example:

# Calculate values using multiple inputs
results <- mapply(function(x, y) x * y + 2,
                 vec1, vec2)

Why Used:

Elegant for element-wise operations on multiple vectors
Avoids explicit loops
Flexible argument passing

Alternative: purrr::map2() or dplyr::mutate()

Base R vs. Tidyverse 🔄

This worksheet uses base R functions to:

Build fundamental R skills
Work without package dependencies
Understand underlying operations
Prepare for situations where packages aren’t available

In practice, tidyverse packages often provide more readable solutions, but understanding base R is essential for R proficiency.

Troubleshooting Guide

Common Issues and Solutions 🔧

Issue: “Error: object ‘CO2’ not found”

Solution: Run data("CO2") first

—

Issue: Boxplot shows only one box when using conc

Solution: Convert to factor: CO2$conc <- as.factor(CO2$conc)

—

Issue: mapply() returns unexpected results

Solution: Ensure your function arguments match the vectors passed

—

Issue: Density curves don’t appear on histogram

Solution: Check that you’re using aes(y = after_stat(density)) in geom_histogram()

Additional Resources

R Documentation:

CO2 dataset: help(CO2)
Plotting functions: help(hist), help(boxplot)
Apply family: help(tapply), help(mapply)

Online Resources:

Getting Help:

Use ?function_name for documentation
Search specific errors online
Visit office hours with specific questions
Form study groups to discuss approaches

Next Steps 🚀

This worksheet introduces fundamental concepts that we’ll build upon:

Next week: Two-sample comparisons and ANOVA
Coming soon: Regression analysis for continuous relationships
Later: Mixed models for repeated measures data

Happy analyzing! 📊