.. _worksheet1: Worksheet 1: Exploring Data with R ================================== .. admonition:: Learning Objectives 🎯 :class: info • Load and explore datasets in R using basic commands • Distinguish between qualitative and quantitative variables • Classify variables by measurement scale (nominal, ordinal, interval, ratio) • Create and interpret univariate visualizations (histograms, boxplots) • Compute grouped statistics using functional programming • Build comparative visualizations for multivariate exploration Introduction ------------ In this exercise, we will analyze a dataset from a study by Potvin, Lechowicz, and Tardif (1990), which examined how the grass species *Echinochloa crus-galli* responds to environmental changes. **Study Design:** - **Sample Size:** 12 plants total (6 from Quebec, Canada; 6 from Mississippi, USA) - **Treatment:** Half the plants from each location received overnight chilling; half remained unchilled - **Measurements:** CO₂ uptake rates measured at 7 ambient CO₂ concentration levels per plant - **Total Observations:** 84 (12 plants × 7 measurements) This dataset provides an excellent opportunity to practice core statistical skills, including working with categorical and numerical data and creating visualizations such as histograms and boxplots. .. note:: **Citation:** Potvin, C., Lechowicz, M. J. and Tardif, S. (1990) "The statistical analysis of ecophysiological response curves obtained from experiments involving repeated measures", *Ecology*, 71, 1389–1400. Part 1: Loading and Understanding the Dataset ---------------------------------------------- The CO2 dataset is included in base R, making it readily accessible for analysis. .. code-block:: r # Load the dataset data(package = "datasets", "CO2") After running this command, a data variable named ``CO2`` should appear in your environment (visible in the top-right pane of RStudio) as a *promise*. A promise is part of R's lazy loading process that delays evaluation until needed. The ``help()`` command provides comprehensive documentation for R's functionality. To learn more about the dataset: .. code-block:: r # View documentation help(CO2) **Question 1:** Using the help documentation (visible in the bottom-right pane), list all variables in the dataset and specify their measurement units where applicable. Part 2: Initial Data Exploration --------------------------------- The ``View()`` command provides a spreadsheet-like format for exploring data: .. code-block:: r # Open dataset in viewer View(CO2) **Question 2:** Answer the following questions about the data: a) How many observations (rows) are in the CO2 dataset? b) Identify and classify each variable: - Which variables are **qualitative** (categorical)? - Which variables are **quantitative** (numerical)? c) For each qualitative variable: - Specify whether it's **nominal** or **ordinal** - List all possible values (categories) d) For each quantitative variable: - Specify whether it's **interval** or **ratio** scale - Provide brief justification (consider meaningful zero, type of measurement) Part 3: Frequency Tables ------------------------ The ``table()`` function generates frequency counts for categorical variables: .. code-block:: r # Example usage table(CO2$Type) table(CO2$Treatment) **Question 3:** a) Run ``table()`` on each qualitative variable and report the frequency counts. b) Run ``table()`` on the quantitative variables. Why might one quantitative variable show a specific pattern (repeated values or fixed intervals)? Based on the experimental design, hypothesize why this occurs. Part 4: Univariate Analysis of Uptake -------------------------------------- The ``uptake`` variable is our primary response variable, representing CO₂ uptake rates. **Question 4:** Perform numerical and graphical univariate analysis of ``uptake``: a) Compute and report statistics for central tendency and spread. Use proper notation (e.g., :math:`\bar{x}` for sample mean). b) Create a histogram and modified boxplot for ``uptake``. c) Describe the distribution based on your visualizations. .. code-block:: r # Hint: Basic statistics mean(CO2$uptake) sd(CO2$uptake) summary(CO2$uptake) # Hint: Basic plots hist(CO2$uptake) boxplot(CO2$uptake) Part 5: Grouped Statistics with tapply --------------------------------------- R's functional programming simplifies group-wise operations. The ``tapply()`` function applies a function to subsets of data grouped by a categorical variable. **Question 5:** Using ``tapply()``, compute the mean and standard deviation of ``uptake`` by ``Type``: .. code-block:: r # Compute grouped statistics uptake_mean_by_type <- tapply(CO2$uptake, CO2$Type, mean) uptake_sd_by_type <- tapply(CO2$uptake, CO2$Type, sd) Report your results below. Part 6: Comparative Visualization by Type ------------------------------------------ **Question 6:** Create a side-by-side boxplot using ``ggplot2``: - Plot ``uptake`` on y-axis and ``Type`` on x-axis - Label axes appropriately - Add an informative title - Describe differences in distributional characteristics between Quebec and Mississippi .. code-block:: r library(ggplot2) # Your code here **Question 7:** Create comparative histograms with density curves: a) Use ``tapply()`` to calculate mean and standard deviation for each ``Type`` b) Use ``ifelse()`` to compute normal density values conditionally c) Create faceted histograms with: - Red kernel density curve - Blue normal density curve d) Describe distributional differences between types Part 7: Exploring the Concentration Effect ------------------------------------------- The relationship between ``uptake`` and ``conc`` (ambient CO₂ concentration) reflects plant response to varying CO₂ levels. **Question 8:** Create a boxplot of ``uptake`` vs ``conc``: a) What issue do you observe? b) Why doesn't this approach work as intended? (Hint: How is ``conc`` being treated?) **Question 9:** Fix the issue by converting ``conc`` to a factor: .. code-block:: r CO2$conc <- as.factor(CO2$conc) Recreate the boxplot and describe distributional differences across concentration levels. Part 8: Advanced Visualization with Multiple Categories -------------------------------------------------------- **Question 10:** Create faceted histograms for ``uptake`` by ``conc``: .. admonition:: AI Assistance Guidelines 🤖 :class: tip For this complex visualization task, you're encouraged to use your favorite AI assistant (ChatGPT, Claude, etc.) to help solve the problem. However, follow these guidelines: **Creating an Effective Pedagogical Prompt:** 1. **Provide Context:** "I'm learning R and ggplot2 in a statistics course..." 2. **Be Specific:** Include the exact requirements and constraints 3. **Request Teaching:** Ask for explanations, not just code 4. **Example prompt structure:** "I need to create a histogram showing uptake distributions across multiple CO2 concentration levels. I must use ggplot2 and cannot use dplyr or other tidyverse packages beyond ggplot2. I need to use mapply() to compute normal density values for multiple groups. Can you explain the approach step-by-step and provide commented code?" **Verification Checklist:** - ✓ Uses only base R functions (tapply, mapply) and ggplot2 - ✓ No dplyr, tidyr, or other packages - ✓ Includes clear explanations of each step - ✓ Code is well-commented - ✓ Follows the specific requirements (facet_wrap with nrow=2, overlay curves) Requirements: - Use ``mapply()`` to compute normal density for multiple categories - Create faceted histograms with ``facet_wrap(~ conc, nrow = 2)`` - Overlay red kernel density and blue normal density curves - Describe distributional patterns Reference: Key Functions ------------------------ .. glossary:: ifelse **Purpose:** Vectorized conditional function **Usage:** ``ifelse(condition, value_if_true, value_if_false)`` **Why Used:** Straightforward for binary conditions **Alternative:** ``dplyr::case_when()`` for multiple conditions tapply **Purpose:** Apply function to grouped subsets **Usage:** ``tapply(vector, grouping_variable, function)`` **Why Used:** Simple and efficient for grouped calculations **Alternative:** ``dplyr::group_by()`` + ``summarise()`` mapply **Purpose:** Apply function to multiple arguments simultaneously **Usage:** ``mapply(function, arg1, arg2, ...)`` **Why Used:** Flexible for row-wise calculations **Alternative:** ``purrr::map2()`` or ``dplyr::mutate()`` Submission Guidelines --------------------- 1. Complete all questions in order 2. Include both code and written interpretations 3. Ensure all plots are properly labeled 4. Submit your write-up in class