.. _worksheet1:

Worksheet 1: Exploring Data with R
==================================

.. admonition:: Learning Objectives 🎯
   :class: info

   • Load and explore datasets in R using basic commands
   • Distinguish between qualitative and quantitative variables
   • Classify variables by measurement scale (nominal, ordinal, interval, ratio)
   • Create and interpret univariate visualizations (histograms, boxplots)
   • Compute grouped statistics using functional programming
   • Build comparative visualizations for multivariate exploration

Introduction
------------

In this exercise, we will analyze a dataset from a study by Potvin, Lechowicz, and Tardif (1990), which examined how the grass species *Echinochloa crus-galli* responds to environmental changes. 

**Study Design:**
- **Sample Size:** 12 plants total (6 from Quebec, Canada; 6 from Mississippi, USA)
- **Treatment:** Half the plants from each location received overnight chilling; half remained unchilled
- **Measurements:** CO₂ uptake rates measured at 7 ambient CO₂ concentration levels per plant
- **Total Observations:** 84 (12 plants × 7 measurements)

This dataset provides an excellent opportunity to practice core statistical skills, including working with categorical and numerical data and creating visualizations such as histograms and boxplots.

.. note::
   **Citation:** Potvin, C., Lechowicz, M. J. and Tardif, S. (1990) "The statistical analysis of ecophysiological response curves obtained from experiments involving repeated measures", *Ecology*, 71, 1389–1400.

Part 1: Loading and Understanding the Dataset
----------------------------------------------

The CO2 dataset is included in base R, making it readily accessible for analysis.

.. code-block:: r

   # Load the dataset
   data(package = "datasets", "CO2")

After running this command, a data variable named ``CO2`` should appear in your environment (visible in the top-right pane of RStudio) as a *promise*. A promise is part of R's lazy loading process that delays evaluation until needed.

The ``help()`` command provides comprehensive documentation for R's functionality. To learn more about the dataset:

.. code-block:: r

   # View documentation
   help(CO2)

**Question 1:** Using the help documentation (visible in the bottom-right pane), list all variables in the dataset and specify their measurement units where applicable.


Part 2: Initial Data Exploration
---------------------------------

The ``View()`` command provides a spreadsheet-like format for exploring data:

.. code-block:: r

   # Open dataset in viewer
   View(CO2)

**Question 2:** Answer the following questions about the data:

a) How many observations (rows) are in the CO2 dataset?

b) Identify and classify each variable:
   
   - Which variables are **qualitative** (categorical)?
   - Which variables are **quantitative** (numerical)?

c) For each qualitative variable:
   
   - Specify whether it's **nominal** or **ordinal**
   - List all possible values (categories)

d) For each quantitative variable:
   
   - Specify whether it's **interval** or **ratio** scale
   - Provide brief justification (consider meaningful zero, type of measurement)


Part 3: Frequency Tables
------------------------

The ``table()`` function generates frequency counts for categorical variables:

.. code-block:: r

   # Example usage
   table(CO2$Type)
   table(CO2$Treatment)

**Question 3:** 

a) Run ``table()`` on each qualitative variable and report the frequency counts.

b) Run ``table()`` on the quantitative variables. Why might one quantitative variable show a specific pattern (repeated values or fixed intervals)? Based on the experimental design, hypothesize why this occurs.


Part 4: Univariate Analysis of Uptake
--------------------------------------

The ``uptake`` variable is our primary response variable, representing CO₂ uptake rates.

**Question 4:** Perform numerical and graphical univariate analysis of ``uptake``:

a) Compute and report statistics for central tendency and spread. Use proper notation (e.g., :math:`\bar{x}` for sample mean).

b) Create a histogram and modified boxplot for ``uptake``.

c) Describe the distribution based on your visualizations.

.. code-block:: r

   # Hint: Basic statistics
   mean(CO2$uptake)
   sd(CO2$uptake)
   summary(CO2$uptake)
   
   # Hint: Basic plots
   hist(CO2$uptake)
   boxplot(CO2$uptake)


Part 5: Grouped Statistics with tapply
---------------------------------------

R's functional programming simplifies group-wise operations. The ``tapply()`` function applies a function to subsets of data grouped by a categorical variable.

**Question 5:** Using ``tapply()``, compute the mean and standard deviation of ``uptake`` by ``Type``:

.. code-block:: r

   # Compute grouped statistics
   uptake_mean_by_type <- tapply(CO2$uptake, CO2$Type, mean)
   uptake_sd_by_type <- tapply(CO2$uptake, CO2$Type, sd)

Report your results below.


Part 6: Comparative Visualization by Type
------------------------------------------

**Question 6:** Create a side-by-side boxplot using ``ggplot2``:

- Plot ``uptake`` on y-axis and ``Type`` on x-axis
- Label axes appropriately
- Add an informative title
- Describe differences in distributional characteristics between Quebec and Mississippi

.. code-block:: r

   library(ggplot2)
   
   # Your code here


**Question 7:** Create comparative histograms with density curves:

a) Use ``tapply()`` to calculate mean and standard deviation for each ``Type``
b) Use ``ifelse()`` to compute normal density values conditionally
c) Create faceted histograms with:
   
   - Red kernel density curve
   - Blue normal density curve
   
d) Describe distributional differences between types


Part 7: Exploring the Concentration Effect
-------------------------------------------

The relationship between ``uptake`` and ``conc`` (ambient CO₂ concentration) reflects plant response to varying CO₂ levels.

**Question 8:** Create a boxplot of ``uptake`` vs ``conc``:

a) What issue do you observe?
b) Why doesn't this approach work as intended? (Hint: How is ``conc`` being treated?)


**Question 9:** Fix the issue by converting ``conc`` to a factor:

.. code-block:: r

   CO2$conc <- as.factor(CO2$conc)

Recreate the boxplot and describe distributional differences across concentration levels.


Part 8: Advanced Visualization with Multiple Categories
--------------------------------------------------------

**Question 10:** Create faceted histograms for ``uptake`` by ``conc``:

.. admonition:: AI Assistance Guidelines 🤖
   :class: tip
   
   For this complex visualization task, you're encouraged to use your favorite AI assistant (ChatGPT, Claude, etc.) to help solve the problem. However, follow these guidelines:

   **Creating an Effective Pedagogical Prompt:**
   
   1. **Provide Context:** "I'm learning R and ggplot2 in a statistics course..."
   2. **Be Specific:** Include the exact requirements and constraints
   3. **Request Teaching:** Ask for explanations, not just code
   4. **Example prompt structure:**
      
      "I need to create a histogram showing uptake distributions across multiple CO2 concentration levels. I must use ggplot2 and cannot use dplyr or other tidyverse packages beyond ggplot2. I need to use mapply() to compute normal density values for multiple groups. Can you explain the approach step-by-step and provide commented code?"

   **Verification Checklist:**
   
   - ✓ Uses only base R functions (tapply, mapply) and ggplot2
   - ✓ No dplyr, tidyr, or other packages
   - ✓ Includes clear explanations of each step
   - ✓ Code is well-commented
   - ✓ Follows the specific requirements (facet_wrap with nrow=2, overlay curves)

Requirements:

- Use ``mapply()`` to compute normal density for multiple categories
- Create faceted histograms with ``facet_wrap(~ conc, nrow = 2)``
- Overlay red kernel density and blue normal density curves
- Describe distributional patterns


Reference: Key Functions
------------------------

.. glossary::

   ifelse
      **Purpose:** Vectorized conditional function
      
      **Usage:** ``ifelse(condition, value_if_true, value_if_false)``
      
      **Why Used:** Straightforward for binary conditions
      
      **Alternative:** ``dplyr::case_when()`` for multiple conditions

   tapply
      **Purpose:** Apply function to grouped subsets
      
      **Usage:** ``tapply(vector, grouping_variable, function)``
      
      **Why Used:** Simple and efficient for grouped calculations
      
      **Alternative:** ``dplyr::group_by()`` + ``summarise()``

   mapply
      **Purpose:** Apply function to multiple arguments simultaneously
      
      **Usage:** ``mapply(function, arg1, arg2, ...)``
      
      **Why Used:** Flexible for row-wise calculations
      
      **Alternative:** ``purrr::map2()`` or ``dplyr::mutate()``

Submission Guidelines
---------------------

1. Complete all questions in order
2. Include both code and written interpretations
3. Ensure all plots are properly labeled
4. Submit your write-up in class