Worksheet 1: Exploring Data with R

Learning Objectives 🎯

  • Load and explore datasets in R using basic commands

  • Distinguish between qualitative and quantitative variables

  • Classify variables by measurement scale (nominal, ordinal, interval, ratio)

  • Create and interpret univariate visualizations (histograms, boxplots)

  • Compute grouped statistics using functional programming

  • Build comparative visualizations for multivariate exploration

Introduction

In this exercise, we will analyze a dataset from a study by Potvin, Lechowicz, and Tardif (1990), which examined how the grass species Echinochloa crus-galli responds to environmental changes.

Study Design: - Sample Size: 12 plants total (6 from Quebec, Canada; 6 from Mississippi, USA) - Treatment: Half the plants from each location received overnight chilling; half remained unchilled - Measurements: CO₂ uptake rates measured at 7 ambient CO₂ concentration levels per plant - Total Observations: 84 (12 plants × 7 measurements)

This dataset provides an excellent opportunity to practice core statistical skills, including working with categorical and numerical data and creating visualizations such as histograms and boxplots.

Note

Citation: Potvin, C., Lechowicz, M. J. and Tardif, S. (1990) “The statistical analysis of ecophysiological response curves obtained from experiments involving repeated measures”, Ecology, 71, 1389–1400.

Part 1: Loading and Understanding the Dataset

The CO2 dataset is included in base R, making it readily accessible for analysis.

# Load the dataset
data(package = "datasets", "CO2")

After running this command, a data variable named CO2 should appear in your environment (visible in the top-right pane of RStudio) as a promise. A promise is part of R’s lazy loading process that delays evaluation until needed.

The help() command provides comprehensive documentation for R’s functionality. To learn more about the dataset:

# View documentation
help(CO2)

Question 1: Using the help documentation (visible in the bottom-right pane), list all variables in the dataset and specify their measurement units where applicable.

Part 2: Initial Data Exploration

The View() command provides a spreadsheet-like format for exploring data:

# Open dataset in viewer
View(CO2)

Question 2: Answer the following questions about the data:

  1. How many observations (rows) are in the CO2 dataset?

  2. Identify and classify each variable:

    • Which variables are qualitative (categorical)?

    • Which variables are quantitative (numerical)?

  3. For each qualitative variable:

    • Specify whether it’s nominal or ordinal

    • List all possible values (categories)

  4. For each quantitative variable:

    • Specify whether it’s interval or ratio scale

    • Provide brief justification (consider meaningful zero, type of measurement)

Part 3: Frequency Tables

The table() function generates frequency counts for categorical variables:

# Example usage
table(CO2$Type)
table(CO2$Treatment)

Question 3:

  1. Run table() on each qualitative variable and report the frequency counts.

  2. Run table() on the quantitative variables. Why might one quantitative variable show a specific pattern (repeated values or fixed intervals)? Based on the experimental design, hypothesize why this occurs.

Part 4: Univariate Analysis of Uptake

The uptake variable is our primary response variable, representing CO₂ uptake rates.

Question 4: Perform numerical and graphical univariate analysis of uptake:

  1. Compute and report statistics for central tendency and spread. Use proper notation (e.g., \(\bar{x}\) for sample mean).

  2. Create a histogram and modified boxplot for uptake.

  3. Describe the distribution based on your visualizations.

# Hint: Basic statistics
mean(CO2$uptake)
sd(CO2$uptake)
summary(CO2$uptake)

# Hint: Basic plots
hist(CO2$uptake)
boxplot(CO2$uptake)

Part 5: Grouped Statistics with tapply

R’s functional programming simplifies group-wise operations. The tapply() function applies a function to subsets of data grouped by a categorical variable.

Question 5: Using tapply(), compute the mean and standard deviation of uptake by Type:

# Compute grouped statistics
uptake_mean_by_type <- tapply(CO2$uptake, CO2$Type, mean)
uptake_sd_by_type <- tapply(CO2$uptake, CO2$Type, sd)

Report your results below.

Part 6: Comparative Visualization by Type

Question 6: Create a side-by-side boxplot using ggplot2:

  • Plot uptake on y-axis and Type on x-axis

  • Label axes appropriately

  • Add an informative title

  • Describe differences in distributional characteristics between Quebec and Mississippi

library(ggplot2)

# Your code here

Question 7: Create comparative histograms with density curves:

  1. Use tapply() to calculate mean and standard deviation for each Type

  2. Use ifelse() to compute normal density values conditionally

  3. Create faceted histograms with:

    • Red kernel density curve

    • Blue normal density curve

  4. Describe distributional differences between types

Part 7: Exploring the Concentration Effect

The relationship between uptake and conc (ambient CO₂ concentration) reflects plant response to varying CO₂ levels.

Question 8: Create a boxplot of uptake vs conc:

  1. What issue do you observe?

  2. Why doesn’t this approach work as intended? (Hint: How is conc being treated?)

Question 9: Fix the issue by converting conc to a factor:

CO2$conc <- as.factor(CO2$conc)

Recreate the boxplot and describe distributional differences across concentration levels.

Part 8: Advanced Visualization with Multiple Categories

Question 10: Create faceted histograms for uptake by conc:

AI Assistance Guidelines 🤖

For this complex visualization task, you’re encouraged to use your favorite AI assistant (ChatGPT, Claude, etc.) to help solve the problem. However, follow these guidelines:

Creating an Effective Pedagogical Prompt:

  1. Provide Context: “I’m learning R and ggplot2 in a statistics course…”

  2. Be Specific: Include the exact requirements and constraints

  3. Request Teaching: Ask for explanations, not just code

  4. Example prompt structure:

    “I need to create a histogram showing uptake distributions across multiple CO2 concentration levels. I must use ggplot2 and cannot use dplyr or other tidyverse packages beyond ggplot2. I need to use mapply() to compute normal density values for multiple groups. Can you explain the approach step-by-step and provide commented code?”

Verification Checklist:

  • ✓ Uses only base R functions (tapply, mapply) and ggplot2

  • ✓ No dplyr, tidyr, or other packages

  • ✓ Includes clear explanations of each step

  • ✓ Code is well-commented

  • ✓ Follows the specific requirements (facet_wrap with nrow=2, overlay curves)

Requirements:

  • Use mapply() to compute normal density for multiple categories

  • Create faceted histograms with facet_wrap(~ conc, nrow = 2)

  • Overlay red kernel density and blue normal density curves

  • Describe distributional patterns

Reference: Key Functions

ifelse

Purpose: Vectorized conditional function

Usage: ifelse(condition, value_if_true, value_if_false)

Why Used: Straightforward for binary conditions

Alternative: dplyr::case_when() for multiple conditions

tapply

Purpose: Apply function to grouped subsets

Usage: tapply(vector, grouping_variable, function)

Why Used: Simple and efficient for grouped calculations

Alternative: dplyr::group_by() + summarise()

mapply

Purpose: Apply function to multiple arguments simultaneously

Usage: mapply(function, arg1, arg2, ...)

Why Used: Flexible for row-wise calculations

Alternative: purrr::map2() or dplyr::mutate()

Submission Guidelines

  1. Complete all questions in order

  2. Include both code and written interpretations

  3. Ensure all plots are properly labeled

  4. Submit your write-up in class