.. _r_computer_assignments:

R / RStudio Guide and Function Reference
==================================================

Overview
-------------------------------------------------
This document is a comprehensive reference for the **primary R tools** used in STAT 350. It introduces R and RStudio, outlines the workflow we use in class, and provides an **enhanced reference** to the functions, libraries, and plotting layers that appear across the computer assignments and worksheets. The emphasis is on: loading data; validating and cleaning; visualizing; computing summaries; fitting models; **checking assumptions**; and performing inference with well-annotated code.

Use this as a companion to the assignment tutorials (linked below).

Quick Reference Table
-------------------------------------------------

.. csv-table:: Common Tasks Quick Reference
   :header: "Task", "Function(s)", "Example"
   :widths: 25, 25, 50
   :align: left

   "Import CSV", "``read.csv()``", "``d <- read.csv('data/AppRating.csv')``"
   "Check structure", "``str()``, ``head()``", "``str(d); head(d, 10)``"
   "Remove NAs", "``complete.cases()``", "``d_clean <- d[complete.cases(d), ]``"
   "Basic summary", "``summary()``", "``summary(d$score)``"
   "Create histogram", "``ggplot() + geom_histogram()``", "``ggplot(d, aes(x=score)) + geom_histogram()``"
   "One-sample t-test", "``t.test()``", "``t.test(d$score, mu = 3)``"
   "Two-sample t-test", "``t.test()``", "``t.test(score ~ platform, data = d)``"
   "One-way ANOVA", "``aov()``", "``fit <- aov(score ~ group, data = d)``"
   "Linear regression", "``lm()``", "``mod <- lm(rating ~ price, data = d)``"
   "Check residuals", "``plot()``", "``plot(mod, which = 1:2)``"

Quick Start: R / RStudio Setup
-------------------------------------------------

- Local Install (R/RStudio): `Local Install <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Local_Install.html>`_
- Scholar Access (cluster): `Scholar Access <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Scholar-Access.html>`_


Course Pipeline (At a Glance)
-------------------------------------------------

1. **Import**: ``read.csv`` → **inspect** with ``head``, ``str``, ``summary``.
2. **Validate & clean**: missing data (``is.na``, ``complete.cases``), types (``as.numeric``, ``factor``), quick checks (``length``, ``nrow``, ``unique``).
3. **Explore**: core summaries (``mean``, ``median``, ``sd``, ``quantile``, ``IQR``), plots (``ggplot2`` histograms/boxplots).
4. **Model**: one/two-sample t; ANOVA; SLR via ``lm``; compute p-values/intervals; **diagnostics** (residual plots, QQ).
5. **Report**: figures/tables (``ggplot2``, ``knitr::kable`` + ``kableExtra``), short text with context.

Packages / Libraries (Course Set)
-------------------------------------------------

- **ggplot2** — Grammar of Graphics. Histograms, boxplots, QQ, scatter + fit lines, ribbons for intervals.
- **grid**, **gridExtra** — Plot/table layout (e.g., arrange multiple graphics on a page).
- **kableExtra** — Styling for ``knitr::kable`` tables (borders, alignment).
- **knitr** — Report tooling; ``kable`` produces clean tables.
- **latex2exp** — ``TeX()`` for LaTeX-style math in plot labels.
- **magrittr** — Pipe operator ``%>%`` to chain steps.
- **stats** — Inference & models: ``aov``, ``anova``, ``lm``, ``t.test``, ``TukeyHSD``, distributions (``dnorm``, ``qnorm``, ``pt``, ``qt``), etc.
- **utils** — I/O like ``read.csv``, ``write.csv``.

.. tip::
   Install from CRAN if needed:
   ::
      install.packages(c("ggplot2","knitr","kableExtra","latex2exp","gridExtra"))

RStudio Orientation
-------------------------------------------------

- **Console** (executes statements), **Source/Editor** (scripts, Rmd), **Environment** (objects), **Plots/Help** (graphics, docs).
- Use **Projects** to lock working directory. Prefer relative paths and **never** hardcode machine-specific directories mid-script.
- Keep code modular. Consider separate chunks for import, cleaning, EDA, modeling, diagnostics, and reporting.

Getting Started with swirl
-------------------------------------------------

If you have never coded before, the computer assignments may immediately feel overwhelming. However, R is a language that is as rewarding as it is approachable, especially for statistical work. If you're new to coding, you'll find R's logical syntax is perfect for beginners.

Our tutorials are tailored to gradually build your understanding, making R less daunting. The tutorials are focused on the tools needed for the Computer Assignments. However, you may feel you need to get a little more comfortable in R before you attempt the tutorials and the Computer Assignments. An R package called **swirl** may be just what you need to get started.

Setup and Use swirl
~~~~~~~~~~~~~~~~~~~~~~~~

1. Setup R and RStudio (see links above)
2. Install swirl from the command line in R:

   .. code-block:: r

      install.packages('swirl')

3. Load swirl:

   .. code-block:: r

      library('swirl')

4. Run swirl:

   .. code-block:: r

      swirl()

This will start the command line interactive tutorials of swirl. You will be instructed on exercises on the command line.

**swirl Commands (available anytime during lessons):**

- ``skip()`` — Skip the current question
- ``play()`` — Experiment with R on your own (swirl ignores your actions)
- ``nxt()`` — Return swirl's attention after using play()
- ``bye()`` — Exit swirl (progress is saved)
- ``main()`` — Return to swirl's main menu
- ``info()`` — Display these options again

I recommend working through at least a few lessons to get familiarity with coding in R.

Alternative R Learning Resources
-------------------------------------------------

For curious students who want to explore R beyond the course requirements:

Base R
~~~~~~~~~~~~~~

- **CRAN "An Introduction to R"** (official manual): Canonical, up-to-date coverage of types, subsetting, vectorization, base graphics, and the modeling API. The authoritative source.
  
  https://cran.r-project.org/doc/manuals/r-release/R-intro.html

- **Deep R Programming** (free textbook): Modern, rigorous coverage of R semantics, performance, and numerical computing. Ideal after you're comfortable with basics.
  
  https://deepr.gagolewski.com/

tidyverse
~~~~~~~~~~~~~~

- **R for Data Science** (2e, free online): The authoritative entry point to tidy data workflows (readr, dplyr, tidyr, ggplot2, purrr).
  
  https://r4ds.hadley.nz

- **tidyverse.org**: Core package documentation and philosophy.
  
  https://www.tidyverse.org

R Markdown
~~~~~~~~~~~~~~

- **R Markdown: The Definitive Guide**: Complete reference for authoring and multi-format output.
  
  https://bookdown.org/yihui/rmarkdown/

- **R Markdown Cookbook**: Task-oriented tips that save hours in practice.
  
  https://bookdown.org/yihui/rmarkdown-cookbook/

Statistical Computing
~~~~~~~~~~~~~~~~~~~~~~

- **STAT 545** (UBC): Practice-first course on modern workflows with excellent materials.
  
  https://stat545.com

- **Modern Statistics for Modern Biology**: R-based computational statistics for life sciences.
  
  https://www.huber.embl.de/msmb/

Video Resources
~~~~~~~~~~~~~~~

- **R Programming Full Course for Beginners** (freeCodeCamp): Comprehensive 2-hour introduction covering basics through data analysis.
  
  https://www.youtube.com/watch?v=fDRa82lxzaU

- **R Programming Tutorial - Learn the Basics** (freeCodeCamp): Shorter focused tutorial on R fundamentals.
  
  https://www.youtube.com/watch?v=Q5g6lYUn6Q4

Assignment Tutorials (Links)
-------------------------------------------------

.. note::
   **Winter Session**: Computer Assignments 1 and 2 are combined into a single assignment. Subsequent assignments are renumbered:
   
   - CA1 & CA2 → Combined Assignment 1
   - CA3 → Assignment 2
   - CA4 → Assignment 3  
   - CA5 → Assignment 4
   - CA6 → Assignment 5

- **Computer Assignment 1 Tutorial** — Intro & basics:
  `CA1 Tutorial <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Computer_Assignment_1_Tutorial.html>`_
- **Computer Assignment 2 Tutorial** — Graphical & numerical summaries:
  `CA2 Tutorial <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Computer_Assignment_2_Graphical_and_Numerical_Summaries_Tutorial.html>`_
- **Computer Assignment 3 Tutorial** — Subsetting & basic hypothesis testing:
  `CA3 Tutorial <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Computer_Assignment_3_BasicHypothesisTesting_and_subsetting_Tutorial.html>`_
- **Computer Assignment 4 Tutorial** — Two-sample procedures:
  `CA4 Tutorial <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Computer_Assignment_4_TwoSampleProcedures_Tutorial.html>`_
- **Computer Assignment 5 Tutorial** — One-way ANOVA:
  `CA5 Tutorial <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Computer_Assignment_5_ANOVA_Tutorial.html>`_
- **Computer Assignment 6 Tutorial** — Simple linear regression:
  `CA6 Tutorial <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Computer_Assignment_6_SimpleLinearRegression_Tutorial.html>`_

Function Reference (Alphabetized within Category)
-------------------------------------------------

Data I/O & Housekeeping
~~~~~~~~~~~~~~~~~~~~~~~~

**getwd()**

   *Purpose*: Show current working directory  
   *Help*: Type ``?getwd`` in R console  
   *Syntax*: ``getwd()``  
   
   Example showing the working directory:
   
   .. code-block:: rconsole

      > getwd()
      [1] "/Users/username/STAT350"

**head(x, n = 6)**

   *Purpose*: Display first n rows of a data frame or elements of a vector  
   *Help*: Type ``?head`` in R console  
   *Syntax*: ``head(x, n = 6)``  
   *Common use*: Quickly inspect data after importing  
   
   Using built-in iris dataset:
   
   .. code-block:: rconsole

      > data(iris)
      > head(iris, 3)
        Sepal.Length Sepal.Width Petal.Length Petal.Width Species
      1          5.1         3.5          1.4         0.2  setosa
      2          4.9         3.0          1.4         0.2  setosa
      3          4.7         3.2          1.3         0.2  setosa
   
   *See also*: ``tail()``, ``str()``, ``summary()``

**read.csv(file, header = TRUE, stringsAsFactors = FALSE, ...)**

   *Purpose*: Import CSV file into a data frame  
   *Help*: Type ``?read.csv`` in R console  
   *Key arguments*:
   
   - ``header``: First row contains column names (default TRUE)
   - ``stringsAsFactors``: Keep strings as characters (recommended FALSE)
   - ``na.strings``: Values to treat as NA (default "NA")
   
   *Common pitfalls*:
   
   - Using absolute paths (breaks on other computers)
   - Forgetting to check for proper import with ``str()``
   - Not handling special characters in column names
   
   Basic import:
   
   .. code-block:: rconsole

      > mydata <- read.csv("data/experiment.csv")
      > str(mydata)
      'data.frame':	150 obs. of  5 variables:
       $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
       $ group   : chr  "A" "B" "A" "C" ...
       $ measure1: num  23.5 19.2 25.1 22.8 20.3 ...
       $ measure2: num  45.2 41.8 46.7 44.3 42.1 ...
       $ outcome : chr  "success" "failure" "success" ...
   
   With options for handling missing data:
   
   .. code-block:: r

      mydata <- read.csv("data/experiment.csv", 
                        stringsAsFactors = FALSE,
                        na.strings = c("NA", "N/A", ""))
   
   Always validate after import:
   
   .. code-block:: rconsole

      > sum(is.na(mydata))
      [1] 12
   
   *See also*: ``read.table()``, ``write.csv()``, ``str()``

**setwd(path)**

   *Purpose*: Change working directory  
   *Help*: Type ``?setwd`` in R console  
   *Warning*: Avoid using in scripts; use RStudio Projects instead  
   
   Not recommended in scripts:
   
   .. code-block:: r

      # Bad practice - machine specific!
      # setwd("/Users/username/STAT350")
      
      # Better: Use RStudio Projects or relative paths

**write.csv(x, file, row.names = FALSE)**

   *Purpose*: Export data frame to CSV file  
   *Help*: Type ``?write.csv`` in R console  
   *Key arguments*:
   
   - ``row.names``: Include row numbers (usually FALSE)
   - ``na``: String for missing values (default "NA")
   
   Save cleaned data:
   
   .. code-block:: r

      write.csv(d_clean, "output/cleaned_data.csv", row.names = FALSE)

Data Structures & Creation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**c(...)**

   *Purpose*: Combine values into a vector  
   *Help*: Type ``?c`` in R console  
   
   Creating different types of vectors:
   
   .. code-block:: rconsole

      > v1 <- c(1, 2, 3, 4, 5)
      > v2 <- c("iOS", "Android", "Windows")
      > v3 <- c(1, "two", 3)
      > class(v3)
      [1] "character"
   
   Note that mixed types get coerced to the most general type (character in this case).

**colnames(x), rownames(x)**

   *Purpose*: Get or set column/row names  
   *Help*: Type ``?colnames`` or ``?rownames`` in R console  
   
   Getting column names from built-in dataset:
   
   .. code-block:: rconsole

      > colnames(mtcars)
       [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"
   
   Renaming columns:
   
   .. code-block:: r

      # Create a copy to modify
      mydata <- mtcars[1:5, 1:3]
      colnames(mydata) <- c("MPG", "Cylinders", "Displacement")
      
      # Or rename specific columns
      colnames(mydata)[1] <- "MilesPerGallon"

**data.frame(...)**

   *Purpose*: Create a data frame from vectors  
   *Help*: Type ``?data.frame`` in R console  
   *Key arguments*:
   
   - ``stringsAsFactors``: Auto-convert strings to factors (set FALSE)
   
   Creating a data frame from scratch:
   
   .. code-block:: rconsole

      > df <- data.frame(
      +   id = 1:5,
      +   group = c("A", "A", "B", "B", "B"),
      +   score = c(85, 90, 78, 82, 88),
      +   stringsAsFactors = FALSE
      + )
      > str(df)
      'data.frame':	5 obs. of  3 variables:
       $ id   : int  1 2 3 4 5
       $ group: chr  "A" "A" "B" "B" "B"
       $ score: num  85 90 78 82 88
   
   *See also*: ``tibble::tibble()`` for modern alternative

**factor(x, levels = ..., ordered = FALSE)**

   *Purpose*: Create categorical variables with defined levels  
   *Help*: Type ``?factor`` in R console  
   *When to use*: For categorical data in ANOVA, controlling plot order  
   
   Basic factor creation with automatic level detection:
   
   .. code-block:: rconsole

      > platform <- factor(c("iOS", "Android", "iOS", "Windows"))
      > levels(platform)
      [1] "Android" "iOS"     "Windows"
   
   Controlling level order for plots and analyses:
   
   .. code-block:: r

      platform <- factor(platform, 
                        levels = c("iOS", "Android", "Windows"))
   
   Creating ordered factors for ordinal data:
   
   .. code-block:: r

      satisfaction <- factor(c("Low", "High", "Medium", "High"),
                            levels = c("Low", "Medium", "High"),
                            ordered = TRUE)
   
   *Common pitfall*: Converting numeric factors back to numbers
   
   Wrong way (returns level indices):
   
   .. code-block:: rconsole

      > f <- factor(c("2", "4", "6"))
      > as.numeric(f)
      [1] 1 2 3
   
   Correct way:
   
   .. code-block:: rconsole

      > as.numeric(as.character(f))
      [1] 2 4 6
   
   *See also*: ``levels()``, ``relevel()``, ``droplevels()``

Data Wrangling & Utilities
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**apply(X, MARGIN, FUN)**

   *Purpose*: Apply function over matrix/array margins  
   *Help*: Type ``?apply`` in R console  
   *Key arguments*:
   
   - ``MARGIN``: 1 for rows, 2 for columns
   - ``FUN``: Function to apply
   
   Create a sample matrix:
   
   .. code-block:: rconsole

      > M <- matrix(1:12, nrow = 3)
      > M
           [,1] [,2] [,3] [,4]
      [1,]    1    4    7   10
      [2,]    2    5    8   11
      [3,]    3    6    9   12
   
   Row means (MARGIN = 1):
   
   .. code-block:: rconsole

      > apply(M, 1, mean)
      [1] 5.5 6.5 7.5
   
   Column sums (MARGIN = 2):
   
   .. code-block:: rconsole

      > apply(M, 2, sum)
      [1]  6 15 24 33
   
   Custom function to calculate range:
   
   .. code-block:: rconsole

      > apply(M, 2, function(x) max(x) - min(x))
      [1] 2 2 2 2
   
   *See also*: ``lapply()``, ``sapply()``, ``tapply()``

**as.numeric(x)**

   *Purpose*: Convert to numeric type  
   *Help*: Type ``?as.numeric`` in R console  
   *Common uses*: Fix data imported as characters, convert factors  
   
   Character to numeric conversion:
   
   .. code-block:: rconsole

      > x <- c("1.5", "2.3", "3.1")
      > as.numeric(x)
      [1] 1.5 2.3 3.1
   
   Non-numeric values produce NAs with a warning:
   
   .. code-block:: rconsole

      > y <- c("1", "2", "three")
      > as.numeric(y)
      [1]  1  2 NA
      Warning message:
      NAs introduced by coercion

**complete.cases(x)**

   *Purpose*: Identify rows with no missing values  
   *Help*: Type ``?complete.cases`` in R console  
   *Returns*: Logical vector (TRUE for complete rows)  
   
   Create data with missing values:
   
   .. code-block:: rconsole

      > df <- data.frame(
      +   x = c(1, 2, NA, 4),
      +   y = c(5, NA, 7, 8)
      + )
      > complete.cases(df)
      [1]  TRUE FALSE FALSE  TRUE
   
   Filter to complete cases only:
   
   .. code-block:: rconsole

      > df_clean <- df[complete.cases(df), ]
      > nrow(df_clean)
      [1] 2
   
   Count incomplete cases:
   
   .. code-block:: rconsole

      > sum(!complete.cases(df))
      [1] 2
   
   *See also*: ``is.na()``, ``na.omit()``

**ifelse(test, yes, no)**

   *Purpose*: Vectorized conditional operation  
   *Help*: Type ``?ifelse`` in R console  
   
   Basic pass/fail grading:
   
   .. code-block:: rconsole

      > score <- c(85, 72, 90, 68, 88)
      > grade <- ifelse(score >= 80, "Pass", "Fail")
      > grade
      [1] "Pass" "Fail" "Pass" "Fail" "Pass"
   
   Nested ifelse for letter grades:
   
   .. code-block:: r

      grade <- ifelse(score >= 90, "A",
                     ifelse(score >= 80, "B", 
                           ifelse(score >= 70, "C", "F")))
   
   Conditional calculations:
   
   .. code-block:: r

      df$adjusted <- ifelse(df$group == "control", 
                           df$score * 1.1, 
                           df$score)

**IQR(x, na.rm = FALSE)**

   *Purpose*: Calculate interquartile range (Q3 - Q1)  
   *Help*: Type ``?IQR`` in R console  
   
   With missing values:
   
   .. code-block:: rconsole

      > x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, NA)
      > IQR(x)
      [1] NA
      > IQR(x, na.rm = TRUE)
      [1] 4
   
   Compare with manual calculation from quantiles:
   
   .. code-block:: rconsole

      > quantile(x, c(0.25, 0.75), na.rm = TRUE)
      25% 75% 
        3   7
   
   *See also*: ``quantile()``, ``fivenum()``, ``boxplot.stats()``

**is.na(x)**

   *Purpose*: Test for missing values  
   *Help*: Type ``?is.na`` in R console  
   
   Identifying NAs:
   
   .. code-block:: rconsole

      > x <- c(1, NA, 3, NA, 5)
      > is.na(x)
      [1] FALSE  TRUE FALSE  TRUE FALSE
   
   Counting NAs:
   
   .. code-block:: rconsole

      > sum(is.na(x))
      [1] 2
   
   Finding positions of NAs:
   
   .. code-block:: rconsole

      > which(is.na(x))
      [1] 2 4
   
   Replacing NAs:
   
   .. code-block:: r

      x[is.na(x)] <- 0
   
   *See also*: ``complete.cases()``, ``na.omit()``, ``anyNA()``

**paste(..., sep = " ", collapse = NULL)**

   *Purpose*: Concatenate strings  
   *Help*: Type ``?paste`` in R console  
   *Key arguments*:
   
   - ``sep``: Separator between elements
   - ``collapse``: Collapse vector to single string
   
   Basic concatenation:
   
   .. code-block:: rconsole

      > paste("Mean:", 5.2)
      [1] "Mean: 5.2"
   
   Custom separator for dates:
   
   .. code-block:: rconsole

      > paste("2024", "01", "15", sep = "-")
      [1] "2024-01-15"
   
   Vectorized operation:
   
   .. code-block:: rconsole

      > paste("Group", 1:3)
      [1] "Group 1" "Group 2" "Group 3"
   
   Collapse to single string:
   
   .. code-block:: rconsole

      > paste(c("A", "B", "C"), collapse = ", ")
      [1] "A, B, C"
   
   No-space version with paste0:
   
   .. code-block:: rconsole

      > paste0("var", 1:3)
      [1] "var1" "var2" "var3"

**quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE)**

   *Purpose*: Calculate sample quantiles  
   *Help*: Type ``?quantile`` in R console  
   
   Default quartiles:
   
   .. code-block:: rconsole

      > x <- c(1:10, NA)
      > quantile(x, na.rm = TRUE)
        0%  25%  50%  75% 100% 
       1.0  3.0  5.5  8.0 10.0
   
   Custom percentiles:
   
   .. code-block:: rconsole

      > quantile(x, probs = c(0.1, 0.9), na.rm = TRUE)
      10% 90% 
      1.9 9.1
   
   *See also*: ``median()``, ``IQR()``, ``fivenum()``

**sapply(X, FUN), lapply(X, FUN)**

   *Purpose*: Apply function over list/vector elements  
   *Help*: Type ``?sapply`` or ``?lapply`` in R console  
   *Differences*:
   
   - ``lapply``: Always returns a list
   - ``sapply``: Simplifies to vector/matrix if possible
   
   Create a list of vectors:
   
   .. code-block:: rconsole

      > lst <- list(a = 1:5, b = 6:10, c = 11:15)
   
   sapply simplifies to vector:
   
   .. code-block:: rconsole

      > sapply(lst, mean)
       a  b  c 
       3  8 13
   
   lapply returns list:
   
   .. code-block:: rconsole

      > lapply(lst, mean)
      $a
      [1] 3
      
      $b
      [1] 8
      
      $c
      [1] 13
   
   Check multiple columns for NAs:
   
   .. code-block:: r

      sapply(df, function(x) sum(is.na(x)))

**tapply(X, INDEX, FUN)**

   *Purpose*: Apply function by group  
   *Help*: Type ``?tapply`` in R console  
   
   Group means using built-in dataset:
   
   .. code-block:: rconsole

      > tapply(iris$Sepal.Length, iris$Species, mean)
          setosa versicolor  virginica 
           5.006      5.936      6.588
   
   Multiple statistics per group:
   
   .. code-block:: r

      tapply(iris$Sepal.Length, iris$Species, 
             function(x) c(mean = mean(x), sd = sd(x)))
   
   *See also*: ``aggregate()``, ``by()``

Descriptive Statistics & Correlation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**cor(x, y = NULL, use = "everything", method = "pearson")**

   *Purpose*: Calculate correlation coefficient  
   *Help*: Type ``?cor`` in R console  
   *Key arguments*:
   
   - ``use``: How to handle NAs ("complete.obs" drops them)
   - ``method``: "pearson" (default), "spearman", "kendall"
   
   Perfect positive correlation:
   
   .. code-block:: rconsole

      > x <- c(1, 2, 3, 4, 5)
      > y <- c(2, 4, 6, 8, 10)
      > cor(x, y)
      [1] 1
   
   Using built-in dataset:
   
   .. code-block:: rconsole

      > cor(mtcars$mpg, mtcars$wt)
      [1] -0.8676594
   
   Correlation matrix:
   
   .. code-block:: rconsole

      > cor(mtcars[,c("mpg", "wt", "hp")])
                 mpg         wt         hp
      mpg  1.0000000 -0.8676594 -0.7761684
      wt  -0.8676594  1.0000000  0.6587479
      hp  -0.7761684  0.6587479  1.0000000
   
   *Interpretation guide*:
   
   - -1 to -0.7: Strong negative
   - -0.7 to -0.3: Moderate negative  
   - -0.3 to 0.3: Weak/no linear relationship
   - 0.3 to 0.7: Moderate positive
   - 0.7 to 1: Strong positive
   
   Test for significance:
   
   .. code-block:: r

      cor.test(mtcars$mpg, mtcars$wt)
   
   *See also*: ``cor.test()``, ``cov()``

**mean(x, trim = 0, na.rm = FALSE)**

   *Purpose*: Calculate arithmetic mean  
   *Help*: Type ``?mean`` in R console  
   *Key arguments*:
   
   - ``na.rm``: Remove NAs before calculation
   - ``trim``: Fraction to trim from each end
   
   Basic usage with NAs:
   
   .. code-block:: rconsole

      > x <- c(1, 2, 3, 4, 5, NA)
      > mean(x)
      [1] NA
      > mean(x, na.rm = TRUE)
      [1] 3
   
   Trimmed mean (robust to outliers):
   
   .. code-block:: rconsole

      > y <- c(1, 2, 3, 4, 100)
      > mean(y)
      [1] 22
      > mean(y, trim = 0.2)
      [1] 3
   
   The trimmed mean removes 20% from each end before calculating.
   
   *See also*: ``median()``, ``summary()``

**median(x, na.rm = FALSE)**

   *Purpose*: Calculate median (middle value)  
   *Help*: Type ``?median`` in R console  
   
   Odd number of values:
   
   .. code-block:: rconsole

      > median(c(1, 3, 5))
      [1] 3
   
   Even number of values:
   
   .. code-block:: rconsole

      > median(c(1, 2, 3, 4))
      [1] 2.5
   
   Median is more robust than mean for skewed data:
   
   .. code-block:: rconsole

      > x <- c(20, 25, 30, 35, 200)
      > mean(x)
      [1] 62
      > median(x)
      [1] 30

**sd(x, na.rm = FALSE)**

   *Purpose*: Calculate sample standard deviation  
   *Help*: Type ``?sd`` in R console  
   *Note*: Uses n-1 denominator (sample SD)  
   
   .. code-block:: rconsole

      > x <- c(2, 4, 4, 4, 5, 5, 7, 9)
      > sd(x)
      [1] 2.13809
      > var(x)
      [1] 4.571429
   
   The variance equals the square of the standard deviation.
   
   Calculate coefficient of variation (relative variability):
   
   .. code-block:: rconsole

      > cv <- sd(x) / mean(x) * 100
      > paste("CV:", round(cv, 1), "%")
      [1] "CV: 42.8 %"
   
   *See also*: ``var()``, ``mad()`` (robust alternative)

Probability & Distributions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Normal Distribution Functions**

   **dnorm(x, mean = 0, sd = 1)**
   
   *Purpose*: Normal probability density  
   *Help*: Type ``?dnorm`` in R console  
   *Use cases*: Overlay theoretical curves, calculate likelihoods  
   
   Standard normal density at x = 0:
   
   .. code-block:: rconsole

      > dnorm(0)
      [1] 0.3989423
   
   Custom parameters:
   
   .. code-block:: rconsole

      > dnorm(100, mean = 100, sd = 15)
      [1] 0.02659615
   
   Use in ggplot2 for overlaying normal curve:
   
   .. code-block:: r

      ggplot(data.frame(x = x), aes(x)) +
        geom_histogram(aes(y = after_stat(density))) +
        stat_function(fun = dnorm, 
                     args = list(mean = mean(x), sd = sd(x)),
                     color = "red")

   **pnorm(q, mean = 0, sd = 1, lower.tail = TRUE)**
   
   *Purpose*: Normal cumulative distribution (probability)  
   *Help*: Type ``?pnorm`` in R console  
   
   Standard normal probabilities:
   
   .. code-block:: rconsole

      > pnorm(1.96)
      [1] 0.9750021
      > pnorm(1.96, lower.tail = FALSE)
      [1] 0.0249979
   
   Two-tailed p-value:
   
   .. code-block:: rconsole

      > 2 * pnorm(-abs(1.96))
      [1] 0.04999579

   **qnorm(p, mean = 0, sd = 1, lower.tail = TRUE)**
   
   *Purpose*: Normal quantiles (inverse CDF)  
   *Help*: Type ``?qnorm`` in R console  
   
   Critical values:
   
   .. code-block:: rconsole

      > qnorm(0.975)
      [1] 1.959964
      > qnorm(0.95)
      [1] 1.644854
   
   The first gives the two-tailed 95% critical value, the second gives one-tailed.
   
   Calculate confidence interval:
   
   .. code-block:: r

      xbar <- 100; s <- 15; n <- 25
      xbar + c(-1, 1) * qnorm(0.975) * s/sqrt(n)
   
   **rnorm(n, mean = 0, sd = 1)**
   
   *Purpose*: Generate random normal values  
   *Help*: Type ``?rnorm`` in R console  
   
   .. code-block:: r

      set.seed(123)
      x <- rnorm(100, mean = 50, sd = 10)
      
      # Visualize with ggplot2
      library(ggplot2)
      ggplot(data.frame(x = x), aes(x = x)) +
        geom_histogram(bins = 20) +
        ggtitle("Sample from Normal(50, 10)")

**t Distribution Functions**

   **pt(q, df, lower.tail = TRUE)**
   
   *Purpose*: Student's t cumulative distribution  
   *Help*: Type ``?pt`` in R console  
   *Use cases*: Calculate p-values for t-tests  
   
   One-sided p-value:
   
   .. code-block:: rconsole

      > t_stat <- 2.5
      > df <- 24
      > pt(t_stat, df, lower.tail = FALSE)
      [1] 0.009963466
   
   Two-sided p-value:
   
   .. code-block:: rconsole

      > 2 * pt(-abs(t_stat), df)
      [1] 0.01992693

   **qt(p, df, lower.tail = TRUE)**
   
   *Purpose*: Student's t quantiles  
   *Help*: Type ``?qt`` in R console  
   *Use cases*: Critical values for confidence intervals  
   
   95% CI critical value:
   
   .. code-block:: rconsole

      > qt(0.975, df = 24)
      [1] 2.063899
   
   Compare to normal:
   
   .. code-block:: rconsole

      > qnorm(0.975)
      [1] 1.959964
   
   The t critical value is larger due to heavier tails.
   
   *See also*: ``dt()``, ``rt()``

**Other Distributions**

   **dbinom(x, size, prob)**
   
   *Purpose*: Binomial probability mass function  
   *Help*: Type ``?dbinom`` in R console  
   
   Probability of exactly 3 successes in 10 trials with p = 0.5:
   
   .. code-block:: rconsole

      > dbinom(3, size = 10, prob = 0.5)
      [1] 0.1171875
   
   Full distribution:
   
   .. code-block:: r

      x <- 0:10
      p <- dbinom(x, size = 10, prob = 0.5)
      
      # Visualize with ggplot2
      library(ggplot2)
      ggplot(data.frame(x = x, p = p), aes(x = x, y = p)) +
        geom_col() +
        scale_x_continuous(breaks = 0:10) +
        ggtitle("Binomial(10, 0.5) Distribution")

   **qtukey(p, nmeans, df)**
   
   *Purpose*: Tukey HSD critical values  
   *Help*: Type ``?qtukey`` in R console  
   
   Critical value for 3 groups with df = 27:
   
   .. code-block:: rconsole

      > qtukey(0.95, nmeans = 3, df = 27)
      [1] 3.506426

Simulation Functions
~~~~~~~~~~~~~~~~~~~~~~~~

**set.seed(seed)**

   *Purpose*: Set random number generator seed for reproducibility  
   *Help*: Type ``?set.seed`` in R console  
   
   Without seed (different each time):
   
   .. code-block:: rconsole

      > rnorm(3)
      [1]  0.3186301 -0.5817907  0.7145327
      > rnorm(3)
      [1] -0.8252594 -0.3598138 -0.0109303
   
   With seed (reproducible):
   
   .. code-block:: rconsole

      > set.seed(123)
      > rnorm(3)
      [1] -0.5604756 -0.2301775  1.5587083
      > set.seed(123)
      > rnorm(3)
      [1] -0.5604756 -0.2301775  1.5587083

**Random Generation Functions**

   All random generation functions take ``n`` (number of values) as the first argument.
   
   .. code-block:: r

      set.seed(1)
      rnorm(5, mean = 100, sd = 15)    # Normal
      runif(5, min = 0, max = 10)      # Uniform
      rexp(5, rate = 2)                # Exponential
   
   For simulation studies:
   
   .. code-block:: r

      n_sims <- 1000
      means <- replicate(n_sims, mean(rnorm(30)))
      
      # Visualize sampling distribution with ggplot2
      library(ggplot2)
      ggplot(data.frame(means = means), aes(x = means)) +
        geom_histogram(bins = 30) +
        ggtitle("Sampling Distribution of Mean")

Inference Functions
~~~~~~~~~~~~~~~~~~~~~~~~

**t.test(x, y = NULL, alternative = "two.sided", mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95)**

   *Purpose*: One and two sample t-tests  
   *Help*: Type ``?t.test`` in R console  
   *When to use*: Compare means between 1-2 groups  
   *Key arguments*:
   
   - ``alternative``: "two.sided", "less", "greater"
   - ``mu``: Null hypothesis value (one-sample) 
   - ``paired``: TRUE for matched pairs
   - ``var.equal``: TRUE assumes equal variances
   
   *Common pitfalls*:
   
   - Using paired = TRUE with independent samples
   - Not checking normality assumptions
   - Forgetting var.equal = FALSE for unequal variances
   
   One-sample test:
   
   .. code-block:: rconsole

      > x <- c(98, 102, 101, 97, 103, 100, 99, 98, 104, 101)
      > t.test(x, mu = 100)
      
      	One Sample t-test
      
      data:  x
      t = 0.63246, df = 9, p-value = 0.5425
      alternative hypothesis: true mean is not equal to 100
      95 percent confidence interval:
       98.41533 102.18467
      sample estimates:
      mean of x 
          100.3
   
   Two-sample independent test using built-in data:
   
   .. code-block:: rconsole

      > data(sleep)
      > t.test(extra ~ group, data = sleep, var.equal = FALSE)
      
      	Welch Two Sample t-test
      
      data:  extra by group
      t = -1.8608, df = 17.776, p-value = 0.07939
      alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
      95 percent confidence interval:
       -3.3654832  0.2054832
      sample estimates:
      mean in group 1 mean in group 2 
                 0.75            2.33
   
   **Paired t-test - Three equivalent approaches:**
   
   Method 1 - Using paired = TRUE with two vectors:
   
   .. code-block:: r

      before <- c(85, 88, 90, 92, 87)
      after  <- c(88, 90, 91, 94, 89)
      t.test(before, after, paired = TRUE)
   
   Method 2 - Convert to one-sample test of differences:
   
   .. code-block:: r

      differences <- after - before
      t.test(differences, mu = 0)
   
   Method 3 - Formula interface with paired data:
   
   .. code-block:: rconsole

      > # Using sleep dataset (paired by ID)
      > t.test(Pair(extra[group==1], extra[group==2]) ~ 1, data = sleep)
      
      	Paired t-test
      
      data:  Pair(extra[group == 1], extra[group == 2])
      t = -4.0621, df = 9, p-value = 0.002833
      alternative hypothesis: true mean difference is not equal to 0
      95 percent confidence interval:
       -2.4598858 -0.7001142
      sample estimates:
      mean difference 
                -1.58
   
   Extract components from results:
   
   .. code-block:: rconsole

      > result <- t.test(x, mu = 100)
      > result$p.value
      [1] 0.5424805
      > result$conf.int
      [1]  98.41533 102.18467
      attr(,"conf.level")
      [1] 0.95
      > result$estimate
      mean of x 
          100.3
   
   **Checking assumptions**:
   
   Normality check:
   
   .. code-block:: r

      # Shapiro-Wilk test
      shapiro.test(x)
      
      # Visual check with ggplot2
      library(ggplot2)
      ggplot(data.frame(x = x), aes(sample = x)) +
        stat_qq() +
        stat_qq_line()
   
   Equal variances check:
   
   .. code-block:: r

      # F-test
      var.test(x, y)
      
      # Visual check
      ggplot(data.frame(value = c(x, y), 
                       group = rep(c("X", "Y"), c(length(x), length(y)))),
             aes(x = group, y = value)) +
        geom_boxplot()
   
   *See also*: ``wilcox.test()`` (non-parametric alternative)

**aov(formula, data)**

   *Purpose*: Analysis of variance (ANOVA)  
   *Help*: Type ``?aov`` in R console  
   *When to use*: Compare means across 3+ groups  
   
   One-way ANOVA:
   
   .. code-block:: rconsole

      > fit <- aov(score ~ group, data = df)
      > summary(fit)
                  Df Sum Sq Mean Sq F value Pr(>F)   
      group        2  520.3  260.15   8.214 0.0019 **
      Residuals   27  855.2   31.67                  
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
   
   Check assumptions with diagnostic plots:
   
   .. code-block:: r

      plot(fit, which = 1)  # Residuals vs Fitted
      plot(fit, which = 2)  # Normal Q-Q
   
   Test for equal variances:
   
   .. code-block:: r

      bartlett.test(score ~ group, data = df)
   
   Post-hoc comparisons:
   
   .. code-block:: r

      TukeyHSD(fit)
      plot(TukeyHSD(fit))
   
   *See also*: ``anova()``, ``TukeyHSD()``, ``kruskal.test()``

**TukeyHSD(x, conf.level = 0.95)**

   *Purpose*: Tukey's honest significant differences  
   *Help*: Type ``?TukeyHSD`` in R console  
   *When to use*: After significant ANOVA for pairwise comparisons  
   
   Following ANOVA:
   
   .. code-block:: rconsole

      > fit <- aov(score ~ group, data = df)
      > tukey_result <- TukeyHSD(fit)
      > tukey_result
        Tukey multiple comparisons of means
          95% family-wise confidence level
      
      Fit: aov(formula = score ~ group, data = df)
      
      $group
              diff       lwr      upr     p adj
      B-A  -5.333 -12.20271 1.536038 0.1518711
      C-A   3.167  -3.70271 10.036038 0.5019442
      C-B   8.500   1.63062 15.369376 0.0125506
   
   Visual display:
   
   .. code-block:: r

      plot(tukey_result)
   
   Confidence intervals not containing 0 indicate significant differences.

**lm(formula, data)**

   *Purpose*: Fit linear models  
   *Help*: Type ``?lm`` in R console  
   *When to use*: Regression analysis, ANOVA via regression  
   
   Simple linear regression:
   
   .. code-block:: rconsole

      > mod <- lm(rating ~ price, data = df)
      > summary(mod)
      
      Call:
      lm(formula = rating ~ price, data = df)
      
      Residuals:
           Min       1Q   Median       3Q      Max 
      -2.83450 -0.83450 -0.03077  0.78231  2.70308 
      
      Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
      (Intercept)  2.1234     0.2145   9.897 2.13e-10 ***
      price        0.5432     0.0532  10.211 9.77e-11 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Residual standard error: 1.234 on 28 degrees of freedom
      Multiple R-squared:  0.7882,	Adjusted R-squared:  0.7806 
      F-statistic: 104.3 on 1 and 28 DF,  p-value: 9.77e-11
   
   Extract components:
   
   .. code-block:: r

      coef(mod)           # Coefficients
      confint(mod)        # CI for coefficients
      residuals(mod)      # Residuals
      fitted(mod)         # Fitted values
   
   Diagnostic plots:
   
   .. code-block:: r

      plot(mod)  # Produces 4 diagnostic plots
   
   ANOVA table:
   
   .. code-block:: r

      anova(mod)
   
   *See also*: ``summary.lm()``, ``predict.lm()``, ``plot.lm()``

**predict(object, newdata, interval = "none", level = 0.95)**

   *Purpose*: Predictions from model fits  
   *Help*: Type ``?predict.lm`` in R console  
   *Key arguments*:
   
   - ``interval``: "confidence" (mean response) or "prediction" (new obs)
   - ``level``: Confidence level
   
   *Critical distinction*:
   
   - **Confidence interval**: Uncertainty about mean response
   - **Prediction interval**: Uncertainty about individual response (wider)
   
   Fit model first:
   
   .. code-block:: r

      mod <- lm(rating ~ price, data = df)
   
   Point prediction:
   
   .. code-block:: rconsole

      > new_price <- data.frame(price = 50)
      > predict(mod, new_price)
             1 
      29.28571
   
   Confidence interval (mean rating at price = 50):
   
   .. code-block:: rconsole

      > predict(mod, new_price, interval = "confidence")
             fit      lwr      upr
      1 29.28571 28.47933 30.09209
   
   Prediction interval (individual rating at price = 50):
   
   .. code-block:: rconsole

      > predict(mod, new_price, interval = "prediction")
             fit      lwr      upr
      1 29.28571 26.89414 31.67728
   
   Note that prediction intervals are always wider than confidence intervals.
   
   Multiple predictions for plotting:
   
   .. code-block:: r

      new_prices <- data.frame(
        price = seq(min(df$price), max(df$price), length = 100)
      )
      ci <- predict(mod, new_prices, interval = "confidence")
      pi <- predict(mod, new_prices, interval = "prediction")
   
   Plot with bands:
   
   .. code-block:: r

      plot(rating ~ price, data = df)
      abline(mod)
      lines(new_prices$price, ci[,"lwr"], lty = 2, col = "blue")
      lines(new_prices$price, ci[,"upr"], lty = 2, col = "blue")
      lines(new_prices$price, pi[,"lwr"], lty = 2, col = "red")
      lines(new_prices$price, pi[,"upr"], lty = 2, col = "red")
   
   *See also*: ``fitted()``, ``residuals()``

**Diagnostic Plots for Assumptions**

   For t-test (normality of residuals):
   
   .. code-block:: r

      library(ggplot2)
      x <- residuals(lm(score ~ group, data = df))
      
      # Q-Q plot
      ggplot(data.frame(sample = x), aes(sample = sample)) +
        stat_qq() +
        stat_qq_line() +
        ggtitle("Normal Q-Q Plot of Residuals")
      
      # Formal test
      shapiro.test(x)
   
   For ANOVA:
   
   .. code-block:: r

      fit <- aov(score ~ group, data = df)
      
      # Residuals vs Fitted
      ggplot(data.frame(fitted = fitted(fit),
                       residuals = residuals(fit)),
             aes(x = fitted, y = residuals)) +
        geom_point() +
        geom_hline(yintercept = 0, linetype = "dashed") +
        ggtitle("Residuals vs Fitted")
      
      # Q-Q plot of residuals  
      ggplot(data.frame(sample = residuals(fit)), 
             aes(sample = sample)) +
        stat_qq() +
        stat_qq_line() +
        ggtitle("Normal Q-Q")
   
   For regression (full diagnostic panel):
   
   .. code-block:: r

      mod <- lm(rating ~ price, data = df)
      
      # Prepare data for all plots
      diag_data <- data.frame(
        fitted = fitted(mod),
        residuals = residuals(mod),
        std_resid = rstandard(mod),
        sqrt_std_resid = sqrt(abs(rstandard(mod))),
        cooks = cooks.distance(mod),
        leverage = hatvalues(mod)
      )
      
      # Create four diagnostic plots
      library(gridExtra)
      
      p1 <- ggplot(diag_data, aes(x = fitted, y = residuals)) +
        geom_point() +
        geom_hline(yintercept = 0, linetype = "dashed") +
        geom_smooth(se = FALSE) +
        ggtitle("Residuals vs Fitted")
      
      p2 <- ggplot(diag_data, aes(sample = std_resid)) +
        stat_qq() +
        stat_qq_line() +
        ggtitle("Normal Q-Q")
      
      p3 <- ggplot(diag_data, aes(x = fitted, y = sqrt_std_resid)) +
        geom_point() +
        geom_smooth(se = FALSE) +
        ggtitle("Scale-Location")
      
      p4 <- ggplot(diag_data, aes(x = leverage, y = std_resid)) +
        geom_point() +
        geom_smooth(se = FALSE) +
        ggtitle("Residuals vs Leverage")
      
      grid.arrange(p1, p2, p3, p4, ncol = 2)
   
   Pattern interpretation:
   
   - Random scatter: Good
   - Funnel shape: Heteroscedasticity
   - Curve: Non-linearity
   - Outliers: Influential points

Graphics (ggplot2)
~~~~~~~~~~~~~~~~~~~~~~~~

.. important::
   **This course uses ggplot2 exclusively for all plotting and visualization.** While base R has plotting functions like plot(), hist(), and boxplot(), we will only use ggplot2's grammar of graphics approach. All assignments and exams will expect ggplot2 syntax for graphs.

Core Components
~~~~~~~~~~~~~~~~

**ggplot(data, aes(...))**

   *Purpose*: Initialize a ggplot object  
   *Help*: Type ``?ggplot`` in R console  
   *Structure*: Data + aesthetic mappings + layers  
   
   Basic setup creates blank plot with axes:
   
   .. code-block:: r

      library(ggplot2)
      ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))
   
   Add layers with +:
   
   .. code-block:: r

      ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
        geom_point()
   
   Save for reuse:
   
   .. code-block:: r

      p <- ggplot(mtcars, aes(x = factor(cyl), y = mpg))
      p + geom_boxplot()
      p + geom_violin()

**aes(...)**

   *Purpose*: Map variables to visual properties  
   *Help*: Type ``?aes`` in R console  
   *Common aesthetics*: x, y, color, fill, shape, size, alpha  
   
   Position aesthetics:
   
   .. code-block:: r

      aes(x = Sepal.Length, y = Sepal.Width)
   
   Color by group:
   
   .. code-block:: r

      aes(x = Sepal.Length, y = Sepal.Width, color = Species)
   
   Multiple aesthetics:
   
   .. code-block:: r

      aes(x = wt, y = mpg, color = factor(cyl), size = hp)
   
   Can set in ggplot() or individual geoms:
   
   .. code-block:: r

      ggplot(iris) +
        geom_point(aes(x = Sepal.Length, y = Sepal.Width))

Geoms (Geometric Objects)
~~~~~~~~~~~~~~~~~~~~~~~~~~

**geom_histogram(binwidth, bins, aes(y = after_stat(...)))**

   *Purpose*: Create histograms  
   *Help*: Type ``?geom_histogram`` in R console  
   *Key arguments*:
   
   - ``bins``: Number of bins (default 30)
   - ``binwidth``: Width of bins
   - ``aes(y = after_stat(density))``: Scale to density
   
   .. important::
      **In this course, all histograms must include:**
      
      1. Density scaling using ``aes(y = after_stat(density))``
      2. A kernel density curve (red) using ``geom_density()``
      3. A normal curve (blue) using ``stat_function()``
      
      This allows visual comparison of the data distribution with theoretical distributions.
   
   Standard histogram format for this course:
   
   .. code-block:: r

      # Calculate mean and sd first
      xbar <- mean(data$variable)
      s <- sd(data$variable)
      
      # Create histogram with required overlays
      ggplot(data, aes(x = variable)) +
        geom_histogram(aes(y = after_stat(density)), 
                      bins = 24, fill = "grey", col = "black") +
        geom_density(col = "red", linewidth = 1) +
        stat_function(fun = dnorm, 
                     args = list(mean = xbar, sd = s), 
                     col = "blue", linewidth = 1) +
        ggtitle("Histogram with Density Curves")
   
   Adding mean and median lines (often required):
   
   .. code-block:: r

      # Calculate statistics
      xbar <- mean(data$variable)
      xtilde <- median(data$variable)
      s <- sd(data$variable)
      
      # Full histogram with all required elements
      ggplot(data, aes(x = variable)) +
        geom_histogram(aes(y = after_stat(density)), 
                      bins = 24, fill = "grey", col = "black") +
        geom_vline(xintercept = xbar, color = "purple", 
                   linetype = "dotted", linewidth = 1) +
        geom_vline(xintercept = xtilde, color = "orange", 
                   linetype = "dashed", linewidth = 1) +
        geom_density(col = "red", linewidth = 1) +
        stat_function(fun = dnorm, 
                     args = list(mean = xbar, sd = s), 
                     col = "blue", linewidth = 1) +
        ggtitle("Complete Histogram for Course")
   
   By group:
   
   .. code-block:: r

      ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
        geom_histogram(aes(y = after_stat(density)), 
                      position = "dodge", bins = 15) +
        facet_wrap(~ Species) +
        ggtitle("Histograms by Group")


**geom_boxplot(outlier.shape, varwidth)**

   *Purpose*: Box and whisker plots  
   *Help*: Type ``?geom_boxplot`` in R console  
   
   Basic boxplot:
   
   .. code-block:: r

      ggplot(iris, aes(x = Species, y = Sepal.Length)) +
        geom_boxplot()
   
   Horizontal orientation:
   
   .. code-block:: r

      ggplot(iris, aes(x = Species, y = Sepal.Length)) +
        geom_boxplot() +
        coord_flip()
   
   With individual points:
   
   .. code-block:: r

      ggplot(iris, aes(x = Species, y = Sepal.Length)) +
        geom_boxplot(outlier.shape = NA) +
        geom_jitter(width = 0.2, alpha = 0.5)
   
   Variable width by sample size:
   
   .. code-block:: r

      ggplot(iris, aes(x = Species, y = Sepal.Length)) +
        geom_boxplot(varwidth = TRUE)

**geom_point(), geom_line()**

   *Purpose*: Scatterplots and line plots  
   *Help*: Type ``?geom_point`` or ``?geom_line``  
   
   Basic scatterplot:
   
   .. code-block:: r

      ggplot(mtcars, aes(x = wt, y = mpg)) +
        geom_point()
   
   Customize points:
   
   .. code-block:: r

      ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
        geom_point(size = 3, alpha = 0.7)
   
   Add regression line:
   
   .. code-block:: r

      ggplot(mtcars, aes(x = wt, y = mpg)) +
        geom_point() +
        geom_smooth(method = "lm", se = FALSE, color = "red")
   
   Time series (using built-in dataset):
   
   .. code-block:: r

      # Convert to time series format
      data(AirPassengers)
      ts_data <- data.frame(
        date = as.Date(time(AirPassengers)),
        value = as.numeric(AirPassengers)
      )
      
      ggplot(ts_data, aes(x = date, y = value)) +
        geom_line() +
        geom_point()

**geom_smooth(method, formula, se)**

   *Purpose*: Add smoothed conditional means  
   *Help*: Type ``?geom_smooth`` in R console  
   *Key arguments*:
   
   - ``method``: "lm", "loess", "gam"
   - ``se``: Show confidence band (TRUE/FALSE)
   
   Linear regression with confidence band:
   
   .. code-block:: r

      ggplot(mtcars, aes(x = wt, y = mpg)) +
        geom_point() +
        geom_smooth(method = "lm")
   
   Without confidence band:
   
   .. code-block:: r

      ggplot(mtcars, aes(x = wt, y = mpg)) +
        geom_point() +
        geom_smooth(method = "lm", se = FALSE)
   
   By group:
   
   .. code-block:: r

      ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
        geom_point() +
        geom_smooth(method = "lm")


Faceted histograms with required density overlays and smooth kernel:

.. code-block:: r

   # Calculate group-specific statistics
   xbar <- tapply(iris$Sepal.Length, iris$Species, mean)
   s <- tapply(iris$Sepal.Length, iris$Species, sd)
   
   # Add normal density values to the dataset
   iris$normal.density <- ifelse(iris$Species == "setosa",
                                dnorm(iris$Sepal.Length, xbar["setosa"], s["setosa"]),
                                ifelse(iris$Species == "versicolor",
                                      dnorm(iris$Sepal.Length, xbar["versicolor"], s["versicolor"]),
                                      dnorm(iris$Sepal.Length, xbar["virginica"], s["virginica"])))
   
   # Create faceted histogram with overlays
   ggplot(iris, aes(x = Sepal.Length)) + 
     geom_histogram(aes(y = after_stat(density)), 
                   bins = 15, fill = "grey", col = "black") + 
     geom_density(col = "red", linewidth = 1) + 
     geom_line(aes(y = normal.density), col = "blue", linewidth = 1) + 
     facet_wrap(~ Species) +
     ggtitle("Distribution of Sepal Length by Species")
   
   # For datasets with only two groups (like CO2), the approach is simpler:
   
   .. code-block:: r
   
   # Two-group example using CO2 dataset
   xbar <- tapply(CO2$uptake, CO2$Type, mean)
   s <- tapply(CO2$uptake, CO2$Type, sd)
   
   CO2$normal.density <- ifelse(CO2$Type == "Quebec", 
                               dnorm(CO2$uptake, xbar["Quebec"], s["Quebec"]), 
                               dnorm(CO2$uptake, xbar["Mississippi"], s["Mississippi"]))
   
   ggplot(CO2, aes(x = uptake)) + 
     geom_histogram(aes(y = after_stat(density)), 
                   bins = 20, fill = "grey", col = "black") + 
     geom_density(col = "red", linewidth = 1) + 
     geom_line(aes(y = normal.density), col = "blue", linewidth = 1) + 
     facet_grid(. ~ Type) +
     ggtitle("Distribution of Uptake by Location Type")


**stat_qq(), stat_qq_line()**

   *Purpose*: Q-Q plots for normality assessment  
   *Help*: Type ``?stat_qq`` in R console  
   
   Basic Q-Q plot:
   
   .. code-block:: r

      ggplot(data.frame(sample = rnorm(100)), aes(sample = sample)) +
        stat_qq() +
        stat_qq_line()
   
   For residuals:
   
   .. code-block:: r

      mod <- lm(mpg ~ wt, data = mtcars)
      res_df <- data.frame(residuals = residuals(mod))
      
      ggplot(res_df, aes(sample = residuals)) +
        stat_qq() +
        stat_qq_line(color = "red") +
        labs(title = "Normal Q-Q Plot of Residuals")

**geom_ribbon(aes(ymin, ymax))**

   *Purpose*: Area plots with upper/lower bounds  
   *Help*: Type ``?geom_ribbon`` in R console  
   *Use case*: Custom confidence/prediction bands  
   
   Setup for regression bands:
   
   .. code-block:: r

      mod <- lm(mpg ~ wt, data = mtcars)
      new_data <- data.frame(wt = seq(min(mtcars$wt), 
                                      max(mtcars$wt), 
                                      length = 100))
      
      pred_ci <- predict(mod, new_data, interval = "confidence")
      pred_pi <- predict(mod, new_data, interval = "prediction")
      
      plot_data <- cbind(new_data, 
                        ci_fit = pred_ci[,"fit"],
                        ci_lwr = pred_ci[,"lwr"],
                        ci_upr = pred_ci[,"upr"],
                        pi_lwr = pred_pi[,"lwr"],
                        pi_upr = pred_pi[,"upr"])
   
   Plot with bands:
   
   .. code-block:: r

      ggplot(plot_data, aes(x = wt)) +
        geom_ribbon(aes(ymin = pi_lwr, ymax = pi_upr), 
                   fill = "grey90") +
        geom_ribbon(aes(ymin = ci_lwr, ymax = ci_upr), 
                   fill = "grey70") +
        geom_line(aes(y = ci_fit), color = "blue", size = 1) +
        geom_point(data = mtcars, aes(y = mpg)) +
        labs(title = "Regression with CI and PI bands")

Categorical Data Visualization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**geom_bar(), geom_col()**

   *Purpose*: Bar charts for categorical data  
   *Help*: Type ``?geom_bar`` or ``?geom_col`` in R console  
   
   - ``geom_bar()``: Counts categories (like table())
   - ``geom_col()``: Uses pre-computed values
   
   Count frequencies with geom_bar():
   
   .. code-block:: r

      # Frequency bar chart
      ggplot(mtcars, aes(x = factor(cyl))) +
        geom_bar() +
        xlab("Number of Cylinders") +
        ylab("Count") +
        ggtitle("Distribution of Cylinder Counts")
   
   Pre-computed values with geom_col():
   
   .. code-block:: r

      # Create summary data
      cyl_means <- aggregate(mpg ~ cyl, data = mtcars, mean)
      
      ggplot(cyl_means, aes(x = factor(cyl), y = mpg)) +
        geom_col(fill = "steelblue") +
        xlab("Number of Cylinders") +
        ylab("Mean MPG") +
        ggtitle("Average MPG by Cylinder Count")
   
   Grouped bar chart:
   
   .. code-block:: r

      ggplot(mtcars, aes(x = factor(cyl), fill = factor(am))) +
        geom_bar(position = "dodge") +
        xlab("Cylinders") +
        labs(fill = "Transmission") +
        scale_fill_discrete(labels = c("Automatic", "Manual"))
   
   Stacked bar chart (proportions):
   
   .. code-block:: r

      ggplot(mtcars, aes(x = factor(cyl), fill = factor(am))) +
        geom_bar(position = "fill") +
        ylab("Proportion") +
        labs(fill = "Transmission")
   
   .. note::
      For pie charts, while possible in ggplot2 using ``coord_polar()``, they are generally discouraged in statistical visualization. Bar charts are preferred for comparing categorical proportions.

Plot Customization
~~~~~~~~~~~~~~~~~~~

**Labels and Titles**

   All labels at once:
   
   .. code-block:: r

      p + labs(title = "Main Title",
              subtitle = "Subtitle",
              x = "X-axis label",
              y = "Y-axis label",
              caption = "Data source: ...")
   
   Individual functions:
   
   .. code-block:: r

      p + ggtitle("Main Title") +
          xlab("X-axis") +
          ylab("Y-axis")
   
   Math expressions:
   
   .. code-block:: r

      library(latex2exp)
      p + xlab(TeX("$\\beta_1$")) +
          ylab(TeX("$\\hat{y}$"))

**Themes**

   Built-in themes:
   
   .. code-block:: r

      p + theme_minimal()
      p + theme_classic()
      p + theme_bw()
   
   Customize elements:
   
   .. code-block:: r

      p + theme(
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 14, face = "bold"),
        plot.title = element_text(size = 16, hjust = 0.5),
        legend.position = "bottom"
      )

**Faceting**

   Separate panels by variable:
   
   .. code-block:: r

      ggplot(df, aes(x = score)) +
        geom_histogram() +
        facet_wrap(~ group)
   
   Grid layout:
   
   .. code-block:: r

      ggplot(df, aes(x = score)) +
        geom_histogram() +
        facet_grid(platform ~ group)

Tables & Reporting
~~~~~~~~~~~~~~~~~~~~~~~~

**knitr::kable(x, caption, digits, format)**

   *Purpose*: Create formatted tables  
   *Help*: Type ``?knitr::kable`` in R console  
   
   Basic table using built-in data:
   
   .. code-block:: r

      library(knitr)
      library(kableExtra)
      
      # Summary statistics by species
      summary_table <- aggregate(Sepal.Length ~ Species, data = iris, 
                                FUN = function(x) c(mean = mean(x), 
                                                   sd = sd(x),
                                                   n = length(x)))
      kable(summary_table, 
            caption = "Summary Statistics by Species",
            digits = 2)
   
   Enhanced styling:
   
   .. code-block:: r

      kable(summary_table, caption = "Iris Summary by Species") %>%
        kable_styling(bootstrap_options = c("striped", "hover"),
                     full_width = FALSE) %>%
        column_spec(1, bold = TRUE)

**gridExtra::grid.arrange(...)**

   *Purpose*: Arrange multiple plots  
   *Help*: Type ``?gridExtra::grid.arrange``  
   
   Create and arrange plots:
   
   .. code-block:: r

      library(gridExtra)
      
      p1 <- ggplot(iris, aes(x = Sepal.Length)) + 
            geom_histogram(bins = 20) + 
            ggtitle("Distribution of Sepal Length")
      
      p2 <- ggplot(iris, aes(x = Species, y = Sepal.Length)) + 
            geom_boxplot() + 
            ggtitle("Sepal Length by Species")
      
      grid.arrange(p1, p2, ncol = 2)
   
   Different layouts:
   
   .. code-block:: r

      grid.arrange(p1, p2, p1, p2, 
                  ncol = 2, nrow = 2,
                  top = "Four Panel Display")

Best Practices & Common Pitfalls
=================================

Data Import & Validation
~~~~~~~~~~~~~~~~~~~~~~~~

Standard validation workflow:

.. code-block:: r

   d <- read.csv("data/file.csv")
   
   str(d)                              # Structure check
   head(d); tail(d)                    # First/last rows
   summary(d)                          # Summary statistics
   sapply(d, function(x) sum(is.na(x))) # Missing values
   sapply(d, class)                    # Data types

Statistical Assumptions
~~~~~~~~~~~~~~~~~~~~~~~

**Before t-tests/ANOVA:**

.. code-block:: r

   # Normality
   hist(x)
   qqnorm(x); qqline(x)
   shapiro.test(x)
   
   # Equal variances
   var.test(x, y)
   boxplot(score ~ group, data = df)

**Before regression:**

.. code-block:: r

   mod <- lm(y ~ x, data = df)
   
   par(mfrow = c(2,2))
   plot(mod)
   par(mfrow = c(1,1))

Check for:

1. Linearity (Residuals vs Fitted)
2. Normality (Q-Q plot)
3. Homoscedasticity (Scale-Location)
4. Influential points (Residuals vs Leverage)

Common Errors to Avoid
~~~~~~~~~~~~~~~~~~~~~~

1. **Factor to numeric conversion**

   Wrong (returns level indices):
   
   .. code-block:: rconsole

      > f <- factor(c("2", "4", "6"))
      > as.numeric(f)
      [1] 1 2 3
   
   Correct:
   
   .. code-block:: rconsole

      > as.numeric(as.character(f))
      [1] 2 4 6

2. **Missing data handling**

   Wrong (silently propagates NA):
   
   .. code-block:: rconsole

      > mean(c(1, 2, NA, 4))
      [1] NA
   
   Correct:
   
   .. code-block:: rconsole

      > mean(c(1, 2, NA, 4), na.rm = TRUE)
      [1] 2.333333

3. **Confidence vs Prediction intervals**

   .. code-block:: r

      # CI: Where is the mean?
      predict(mod, newdata, interval = "confidence")
      
      # PI: Where is a new observation?
      predict(mod, newdata, interval = "prediction")
   
   Prediction intervals are always wider!

4. **Multiple testing without adjustment**

   After ANOVA, use Tukey HSD:
   
   .. code-block:: r

      TukeyHSD(aov_fit)
   
   This adjusts for multiple comparisons automatically.

Workflow Template
~~~~~~~~~~~~~~~~~

.. code-block:: r

   # 1. Setup
   library(ggplot2)
   library(knitr)
   set.seed(123)
   
   # 2. Import and validate
   d <- read.csv("data/file.csv", stringsAsFactors = FALSE)
   str(d)
   summary(d)
   
   # 3. Clean
   d_clean <- d[complete.cases(d), ]
   
   # 4. Explore
   ggplot(d_clean, aes(x = x, y = y)) +
     geom_point() +
     theme_minimal()
   
   # 5. Model
   mod <- lm(y ~ x, data = d_clean)
   
   # 6. Diagnose
   plot(mod, which = 1:2)
   
   # 7. Report
   summary(mod)
   kable(tidy(mod), digits = 3)

Appendix: Quick Links
-------------------------------------------------

Setup
~~~~~~~~~~~~

- `Local Install <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Local_Install.html>`_
- `Scholar Access <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Scholar-Access.html>`_

Assignments
~~~~~~~~~~~~

- `CA1 Tutorial <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Computer_Assignment_1_Tutorial.html>`_ — R/RStudio basics
- `CA2 Tutorial <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Computer_Assignment_2_Graphical_and_Numerical_Summaries_Tutorial.html>`_ — EDA and summaries
- `CA3 Tutorial <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Computer_Assignment_3_BasicHypothesisTesting_and_subsetting_Tutorial.html>`_ — Hypothesis testing
- `CA4 Tutorial <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Computer_Assignment_4_TwoSampleProcedures_Tutorial.html>`_ — Two-sample procedures
- `CA5 Tutorial <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Computer_Assignment_5_ANOVA_Tutorial.html>`_ — ANOVA
- `CA6 Tutorial <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Computer_Assignment_6_SimpleLinearRegression_Tutorial.html>`_ — Linear regression

Getting Help
~~~~~~~~~~~~

- In R: ``?function_name`` or ``help(function_name)``
- Examples: ``example(function_name)``
- Online: https://www.rdocumentation.org/
- Stack Overflow: https://stackoverflow.com/questions/tagged/r