Worksheet 1: Exploring Data with R

Learning Objectives 🎯

By completing this worksheet, you will:

  • Load and explore datasets in R using basic commands

  • Distinguish between qualitative and quantitative variables

  • Classify variables by measurement scale (nominal, ordinal, interval, ratio)

  • Create and interpret univariate visualizations (histograms, boxplots)

  • Compute grouped statistics using functional programming

  • Build comparative visualizations for multivariate exploration

  • Practice effective prompting strategies for AI assistance

Prerequisites 📚

  • Basic familiarity with RStudio interface

  • Understanding of variable types and measurement scales

  • Ability to run R commands and interpret output

Introduction

In this exercise, we will analyze a dataset from a study by Potvin, Lechowicz, and Tardif (1990), which examined how the grass species Echinochloa crus-galli responds to environmental changes. This ecophysiological study provides an excellent opportunity to practice core statistical skills using real experimental data.

Study Design:

  • Species: Echinochloa crus-galli (barnyard grass)

  • Locations: Quebec, Canada (northern) and Mississippi, USA (southern)

  • Sample Size: 12 plants total (6 from each location)

  • Treatment: Overnight chilling (half received treatment, half remained unchilled)

  • Measurements: CO₂ uptake rates at 7 ambient CO₂ concentration levels per plant

  • Total Observations: 84 (12 plants × 7 measurements each)

This balanced experimental design allows us to explore how geographic origin and temperature treatment affect plant CO₂ uptake across varying atmospheric conditions.

Note

Citation: Potvin, C., Lechowicz, M. J. and Tardif, S. (1990) “The statistical analysis of ecophysiological response curves obtained from experiments involving repeated measures”, Ecology, 71, 1389–1400.

Part 1: Loading and Understanding the Dataset

The CO2 dataset is included in base R, making it readily accessible for analysis. This is one of several built-in datasets that R provides for teaching and demonstration purposes.

# Load the dataset from the datasets package
data(package = "datasets", "CO2")

Understanding R’s Lazy Loading

After running this command, a data variable named CO2 should appear in your environment (visible in the top-right pane of RStudio) as a promise. A promise is part of R’s lazy loading mechanism that delays evaluation until the object is actually needed. This improves performance by not loading data into memory until it’s accessed.

The help() command provides comprehensive documentation for R’s functionality. It’s your first resource for understanding datasets, functions, and packages.

# View documentation for the CO2 dataset
help(CO2)

# Alternative syntax
?CO2

Question 1: Using the help documentation (which appears in the bottom-right pane of RStudio), answer the following:

  1. List all variables in the dataset

  2. For each variable, specify:

    • The variable name

    • What it measures or represents

    • Its measurement units (where applicable, e.g., mL/L, μmol/m²/s)

    • Any additional context provided in the documentation

Documentation Skills 📖

Learning to read R documentation effectively is a crucial skill. Pay attention to:

  • Description: Overview of the dataset

  • Format: Structure and variable definitions

  • Details: Additional context and methodology

  • Source: Original reference for the data

  • Examples: Sample code for using the data

Part 2: Initial Data Exploration

The View() command provides a spreadsheet-like interface for exploring data visually. This is particularly useful for getting an initial sense of your data’s structure.

# Open dataset in RStudio's data viewer
View(CO2)

# Note: View() is capitalized - view() won't work!

Exploring Data Structure

In addition to View(), consider these commands for initial exploration:

# Display structure of the dataset
str(CO2)

# First few rows
head(CO2)

# Basic dimensions
nrow(CO2)  # Number of rows
ncol(CO2)  # Number of columns
dim(CO2)   # Both dimensions

Question 2: Based on your exploration, provide a comprehensive analysis:

  1. Dataset dimensions:

    • How many observations (rows) are in the CO2 dataset?

    • How many variables (columns) are there?

  2. Variable classification:

    • Which variables are qualitative (categorical)?

    • Which variables are quantitative (numerical)?

  3. Qualitative variable analysis:

    For each qualitative variable, determine:

    • Is it nominal (categories have no natural order) or ordinal (categories have a meaningful order)?

    • List all possible values (categories/levels)

    • Explain your reasoning for the classification

  4. Quantitative variable analysis:

    For each quantitative variable, determine:

    • Is it interval (no true zero) or ratio (has true zero) scale?

    • Provide justification based on:

      • Whether zero represents absence of the quantity

      • Whether ratios are meaningful (e.g., “twice as much”)

      • The nature of the measurement

Measurement Scales Review 📏

Qualitative (Categorical) Variables:

  • Nominal: Categories with no inherent order (e.g., colors, treatment groups)

  • Ordinal: Categories with meaningful order but unequal intervals (e.g., satisfaction ratings)

Quantitative (Numerical) Variables:

  • Interval: Equal intervals but no true zero (e.g., temperature in Celsius)

  • Ratio: Equal intervals with true zero (e.g., height, weight, concentration)

Part 3: Frequency Tables

The table() function generates frequency counts for categorical variables, providing insight into the distribution of categories in your data.

# Basic usage of table()
table(CO2$Type)
table(CO2$Treatment)

# For more detailed output
summary(CO2$Type)

Question 3: Frequency analysis:

  1. Categorical variables:

    • Run table() on each qualitative variable

    • Report the frequency counts in a clear format

    • Comment on whether the design appears balanced

  2. Numerical variables:

    • Run table() on both quantitative variables

    • For conc: Explain why you see exactly 12 observations at each level

    • For uptake: Describe the pattern you observe

    • Based on the experimental design described in the introduction, hypothesize why conc shows this specific pattern

Understanding Experimental Design 🔬

The pattern in conc reflects the repeated measures design. Each plant was measured at the same 7 CO₂ concentration levels. This is crucial for understanding:

  • Why certain statistical methods are appropriate

  • How to properly visualize the data

  • The structure of dependencies in the data

Part 4: Univariate Analysis of Uptake

The uptake variable represents CO₂ uptake rate (μmol/m²/s) and is our primary response variable. A thorough univariate analysis helps us understand its distribution before examining relationships.

Question 4: Perform a comprehensive univariate analysis:

  1. Numerical summaries:

    Compute and report the following statistics using proper notation:

    • Central tendency: sample mean (\(\bar{x}\)) and median (\(\tilde{x}\))

    • Variability: sample standard deviation (\(s\)) and IQR

    • Five-number summary using summary()

    # Hint: Key functions for numerical summaries
    mean()     # Calculate arithmetic mean
    median()   # Calculate median
    sd()       # Calculate standard deviation
    IQR()      # Calculate interquartile range
    summary()  # Five-number summary plus mean
    
  2. Graphical analysis:

    Create the following visualizations:

    • Histogram: Use an appropriate number of bins

    • Modified boxplot: Include potential outliers

    Plotting Hints 📊

    • For histograms, use hist() with arguments for: * main: Plot title * xlab: X-axis label * col: Bar color * breaks: Number of bins or breakpoints

    • For boxplots, use boxplot() with arguments for: * main: Plot title * ylab or xlab: Axis labels * horizontal: TRUE/FALSE for orientation

  3. Distribution description:

    Based on your numerical and graphical analyses, describe:

    • Shape (symmetric, skewed, multimodal?)

    • Center (typical value)

    • Spread (variability)

    • Unusual features (outliers, gaps, clusters?)

Part 5: Grouped Statistics with tapply

R’s functional programming paradigm shines when computing statistics by groups. The tapply() function (table-apply) efficiently computes summaries for subsets of data.

Syntax: tapply(X, INDEX, FUN)

  • X: The vector to analyze

  • INDEX: The grouping factor(s)

  • FUN: The function to apply to each group

Question 5: Group-wise analysis:

Using tapply(), compute statistics for uptake grouped by Type:

# Compute grouped statistics
uptake_mean_by_type <- tapply(CO2$uptake, CO2$Type, mean)
uptake_sd_by_type <- tapply(CO2$uptake, CO2$Type, sd)

# Display results
print(uptake_mean_by_type)
print(uptake_sd_by_type)
  1. Report the mean and standard deviation for each Type

  2. Calculate the difference in means between locations

  3. Which location shows higher average CO₂ uptake?

  4. Which location shows more variability?

Understanding R’s Apply Family 🔧

The apply functions are powerful tools for avoiding loops in R. Each serves a specific purpose:

apply() - For matrices and arrays

  • Purpose: Apply function over rows or columns of a matrix/array

  • Syntax: apply(X, MARGIN, FUN)

  • MARGIN: 1 = rows, 2 = columns, c(1,2) = both

  • Example: Calculate row means of a matrix

mat <- matrix(1:12, nrow = 3)
apply(mat, 1, mean)  # Mean of each row
apply(mat, 2, sum)   # Sum of each column

lapply() - For lists (returns list)

  • Purpose: Apply function to each element of a list

  • Syntax: lapply(X, FUN)

  • Returns: Always a list

  • Example: Get length of each list element

my_list <- list(a = 1:5, b = 1:10, c = 1:3)
lapply(my_list, length)  # Returns list

sapply() - Simplified lapply (returns vector/matrix)

  • Purpose: Like lapply but simplifies result if possible

  • Syntax: sapply(X, FUN)

  • Returns: Vector, matrix, or list (depending on output)

  • Example: Same as above but simplified

sapply(my_list, length)  # Returns named vector

tapply() - For grouped operations

  • Purpose: Apply function to subsets defined by factors

  • Syntax: tapply(X, INDEX, FUN)

  • Use case: Calculate statistics by group

  • Example: Mean by category

# If we have values and their categories
values <- c(10, 20, 15, 30, 25, 35)
groups <- c("A", "B", "A", "B", "A", "B")
tapply(values, groups, mean)  # Mean by group

mapply() - Multivariate apply

  • Purpose: Apply function to multiple arguments in parallel

  • Syntax: mapply(FUN, arg1, arg2, ...)

  • Use case: Element-wise operations on multiple vectors

  • Example: Custom calculation on paired values

# Calculate x*y + z for each position
x <- c(1, 2, 3)
y <- c(4, 5, 6)
z <- c(7, 8, 9)
mapply(function(a, b, c) a * b + c, x, y, z)

When to use which?

  • Have a matrix? → apply()

  • Have a list? → lapply() or sapply()

  • Need groups? → tapply()

  • Multiple inputs? → mapply()

Part 6: Comparative Visualization by Type

Comparing distributions across groups reveals patterns that univariate analysis might miss. We’ll use both base R and ggplot2 for visualization.

Question 6: Side-by-side boxplots:

Create a comparative boxplot using ggplot2:

ggplot2 Structure 📊

Remember the grammar of graphics:

  1. Data: What dataset are you using?

  2. Aesthetics: What variables map to x, y, color, etc.?

  3. Geometries: What type of plot (boxplot, points, etc.)?

  4. Labels: Title, axis labels

  5. Theme: Overall appearance

Key functions to explore:

  • ggplot() - Initialize plot with data and aesthetics

  • geom_boxplot() - Add boxplot layer

  • labs() - Add labels

  • theme_minimal() or other themes for appearance

Enhance your plot by:

  • Adding individual points with geom_jitter()

  • Highlighting means with stat_summary()

  • Using colors to distinguish groups

  • Adding appropriate labels and title

Describe the differences you observe between Quebec and Mississippi plants in terms of:

  • Central tendency

  • Variability

  • Skewness

  • Potential outliers

Question 7: Comparative histograms with density overlays:

Create faceted histograms showing the distribution of uptake by Type:

  1. Data preparation:

    • Use tapply() to calculate group-specific means and standard deviations

    • Use ifelse() to create a new column with normal density values

    # Calculate group statistics
    means <- tapply(CO2$uptake, CO2$Type, mean)
    sds <- tapply(CO2$uptake, CO2$Type, sd)
    
    # Add normal density column
    CO2$normal_density <- ifelse(CO2$Type == "Quebec",
                                 dnorm(CO2$uptake, means["Quebec"], sds["Quebec"]),
                                 dnorm(CO2$uptake, means["Mississippi"], sds["Mississippi"]))
    
  2. Visualization:

    Create a faceted histogram with:

    • Histogram bars showing frequency distribution

    • Red kernel density curve (empirical distribution)

    • Blue normal density curve (theoretical distribution)

    • Appropriate labels and formatting

  3. Interpretation:

    Compare the distributions and discuss:

    • How well does the normal distribution fit each group?

    • Are there signs of skewness or other departures from normality?

    • Do the two locations show similar distributional patterns?

Part 7: Exploring the Concentration Effect

The relationship between uptake and conc reveals how plants respond physiologically to varying CO₂ levels - a key aspect of plant ecology and climate change research.

Question 8: Initial visualization attempt:

Create a boxplot of uptake versus conc:

Task 🎯

Using ggplot2, create a boxplot with:

  • conc on the x-axis

  • uptake on the y-axis

  • Appropriate axis labels

Think about what aesthetic mapping you need in aes()

  1. What issue do you observe with this plot?

  2. Why doesn’t this approach work as intended?

  3. How is R treating the conc variable, and why is this problematic?

Variable Types Matter! ⚠️

R’s automatic type detection can sometimes work against us. When a variable contains numbers, R assumes it’s continuous unless told otherwise. This affects:

  • How ggplot2 creates axes

  • Which geoms are appropriate

  • How statistical summaries are calculated

Question 9: Correcting the visualization:

Fix the issue by explicitly converting conc to a factor:

# Convert to factor to treat as discrete categories
CO2$conc <- as.factor(CO2$conc)

# Verify the change
str(CO2$conc)
levels(CO2$conc)

Now recreate the boxplot and describe:

  1. The relationship between CO₂ concentration and uptake rate

  2. How variability changes across concentration levels

  3. Any apparent threshold effects or plateaus

  4. Distributional changes (skewness, outliers) across concentrations

Part 8: Advanced Visualization with Multiple Categories

When working with multiple categories, we need more sophisticated approaches than simple ifelse() statements. The mapply() function enables elegant solutions for multi-group analyses.

Question 10: Multi-category histogram with density curves:

AI Assistance Guidelines 🤖

For this complex visualization task, you’re encouraged to use AI assistance (ChatGPT, Claude, etc.). However, use it as a learning tool, not just a code generator.

Creating an Effective Prompt:

  1. Provide Context: “I’m learning R and ggplot2 in a statistics course. I’m working with the CO2 dataset…”

  2. Be Specific About Requirements: - Must use base R functions (tapply, mapply) with ggplot2 - Cannot use dplyr, tidyr, or other tidyverse packages - Need to overlay kernel and normal density curves

  3. Request Teaching, Not Just Code: “Can you explain why mapply is used here and how it works step-by-step?”

  4. Example Prompt Structure:

    I need to create faceted histograms showing uptake distributions
    across 7 CO2 concentration levels. Requirements:
    - Use only base R (tapply, mapply) and ggplot2
    - No dplyr or other packages
    - Calculate normal density for each concentration group
    - Overlay kernel density (red) and normal density (blue) curves
    - Use facet_wrap with nrow=2
    
    Can you explain the approach step-by-step with commented code?
    

Verification Checklist:

✓ Uses only allowed functions and packages

✓ Includes clear explanations of each step

✓ Code is well-commented

✓ Follows the specific requirements

✓ You understand why each part works

Implementation steps:

  1. Calculate group statistics:

    # Group means and standard deviations
    conc_means <- tapply(CO2$uptake, CO2$conc, mean)
    conc_sds <- tapply(CO2$uptake, CO2$conc, sd)
    
  2. Create a density calculation function:

    Function Design 💡

    Your function needs to:

    1. Accept two parameters: an uptake value and a concentration group

    2. Look up the appropriate mean and standard deviation for that group

    3. Calculate the normal density at that uptake value

    4. Return the density value

    Remember: dnorm(x, mean, sd) calculates normal density

  3. Apply the function using mapply:

    Understanding mapply() 🔄

    mapply() will:

    1. Take your function as the first argument

    2. Take CO2$uptake as the second argument (all 84 values)

    3. Take CO2$conc as the third argument (all 84 groups)

    4. Apply your function to each pair of (uptake, conc) values

    5. Return a vector of 84 density values

    Why use as.character(CO2$conc)? Because factor levels need to match the names in your statistics vectors.

  4. Create the visualization:

    • Faceted histograms with facet_wrap(~ conc, nrow = 2)

    • Overlay kernel density (red) and normal density (blue)

    • Proper labels and formatting

  5. Interpretation:

    Describe patterns across concentration levels:

    • How does the distribution shape change?

    • At what concentration does uptake appear to plateau?

    • Which concentrations show the most/least variability?

    • How well do normal curves fit at different concentrations?

Reference: Key Functions

ifelse()

Purpose: Vectorized conditional evaluation

Syntax: ifelse(test, yes, no)

Example:

# Assign groups based on condition
group <- ifelse(values > 50, "High", "Low")

Why Used: Simple and efficient for binary conditions. Works element-wise on vectors.

Limitation: Becomes unwieldy with multiple conditions

Alternative: dplyr::case_when() for multiple conditions

tapply()

Purpose: Apply a function to subsets of a vector, grouped by factors

Syntax: tapply(X, INDEX, FUN, ...)

Example:

# Calculate mean by group
group_means <- tapply(data$value, data$group, mean, na.rm = TRUE)

Why Used:

  • No additional packages required

  • Memory efficient

  • Returns named vector/array

Alternative: dplyr::group_by() %>% summarise()

mapply()

Purpose: Multivariate version of sapply - applies function to multiple arguments

Syntax: mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE)

Example:

# Calculate values using multiple inputs
results <- mapply(function(x, y) x * y + 2,
                 vec1, vec2)

Why Used:

  • Elegant for element-wise operations on multiple vectors

  • Avoids explicit loops

  • Flexible argument passing

Alternative: purrr::map2() or dplyr::mutate()

Base R vs. Tidyverse 🔄

This worksheet uses base R functions to:

  • Build fundamental R skills

  • Work without package dependencies

  • Understand underlying operations

  • Prepare for situations where packages aren’t available

In practice, tidyverse packages often provide more readable solutions, but understanding base R is essential for R proficiency.

Troubleshooting Guide

Common Issues and Solutions 🔧

Issue: “Error: object ‘CO2’ not found”

Solution: Run data("CO2") first

Issue: Boxplot shows only one box when using conc

Solution: Convert to factor: CO2$conc <- as.factor(CO2$conc)

Issue: mapply() returns unexpected results

Solution: Ensure your function arguments match the vectors passed

Issue: Density curves don’t appear on histogram

Solution: Check that you’re using aes(y = after_stat(density)) in geom_histogram()

Additional Resources

R Documentation:

  • CO2 dataset: help(CO2)

  • Plotting functions: help(hist), help(boxplot)

  • Apply family: help(tapply), help(mapply)

Online Resources:

Getting Help:

  • Use ?function_name for documentation

  • Search specific errors online

  • Visit office hours with specific questions

  • Form study groups to discuss approaches

Next Steps 🚀

This worksheet introduces fundamental concepts that we’ll build upon:

  • Next week: Two-sample comparisons and ANOVA

  • Coming soon: Regression analysis for continuous relationships

  • Later: Mixed models for repeated measures data

Happy analyzing! 📊