.. _r_computer_assignments: R / RStudio Guide and Function Reference ================================================== Overview ------------------------------------------------- This document is a comprehensive reference for the **primary R tools** used in STAT 350. It introduces R and RStudio, outlines the workflow we use in class, and provides an **enhanced reference** to the functions, libraries, and plotting layers that appear across the computer assignments and worksheets. The emphasis is on: loading data; validating and cleaning; visualizing; computing summaries; fitting models; **checking assumptions**; and performing inference with well-annotated code. Use this as a companion to the assignment tutorials (linked below). Quick Reference Table ------------------------------------------------- .. csv-table:: Common Tasks Quick Reference :header: "Task", "Function(s)", "Example" :widths: 25, 25, 50 :align: left "Import CSV", "``read.csv()``", "``d <- read.csv('data/AppRating.csv')``" "Check structure", "``str()``, ``head()``", "``str(d); head(d, 10)``" "Remove NAs", "``complete.cases()``", "``d_clean <- d[complete.cases(d), ]``" "Basic summary", "``summary()``", "``summary(d$score)``" "Create histogram", "``ggplot() + geom_histogram()``", "``ggplot(d, aes(x=score)) + geom_histogram()``" "One-sample t-test", "``t.test()``", "``t.test(d$score, mu = 3)``" "Two-sample t-test", "``t.test()``", "``t.test(score ~ platform, data = d)``" "One-way ANOVA", "``aov()``", "``fit <- aov(score ~ group, data = d)``" "Linear regression", "``lm()``", "``mod <- lm(rating ~ price, data = d)``" "Check residuals", "``plot()``", "``plot(mod, which = 1:2)``" Quick Start: R / RStudio Setup ------------------------------------------------- - Local Install (R/RStudio): `Local Install `_ - Scholar Access (cluster): `Scholar Access `_ Course Pipeline (At a Glance) ------------------------------------------------- 1. **Import**: ``read.csv`` → **inspect** with ``head``, ``str``, ``summary``. 2. **Validate & clean**: missing data (``is.na``, ``complete.cases``), types (``as.numeric``, ``factor``), quick checks (``length``, ``nrow``, ``unique``). 3. **Explore**: core summaries (``mean``, ``median``, ``sd``, ``quantile``, ``IQR``), plots (``ggplot2`` histograms/boxplots). 4. **Model**: one/two-sample t; ANOVA; SLR via ``lm``; compute p-values/intervals; **diagnostics** (residual plots, QQ). 5. **Report**: figures/tables (``ggplot2``, ``knitr::kable`` + ``kableExtra``), short text with context. Packages / Libraries (Course Set) ------------------------------------------------- - **ggplot2** — Grammar of Graphics. Histograms, boxplots, QQ, scatter + fit lines, ribbons for intervals. - **grid**, **gridExtra** — Plot/table layout (e.g., arrange multiple graphics on a page). - **kableExtra** — Styling for ``knitr::kable`` tables (borders, alignment). - **knitr** — Report tooling; ``kable`` produces clean tables. - **latex2exp** — ``TeX()`` for LaTeX-style math in plot labels. - **magrittr** — Pipe operator ``%>%`` to chain steps. - **stats** — Inference & models: ``aov``, ``anova``, ``lm``, ``t.test``, ``TukeyHSD``, distributions (``dnorm``, ``qnorm``, ``pt``, ``qt``), etc. - **utils** — I/O like ``read.csv``, ``write.csv``. .. tip:: Install from CRAN if needed: :: install.packages(c("ggplot2","knitr","kableExtra","latex2exp","gridExtra")) RStudio Orientation ------------------------------------------------- - **Console** (executes statements), **Source/Editor** (scripts, Rmd), **Environment** (objects), **Plots/Help** (graphics, docs). - Use **Projects** to lock working directory. Prefer relative paths and **never** hardcode machine-specific directories mid-script. - Keep code modular. Consider separate chunks for import, cleaning, EDA, modeling, diagnostics, and reporting. Getting Started with swirl ------------------------------------------------- If you have never coded before, the computer assignments may immediately feel overwhelming. However, R is a language that is as rewarding as it is approachable, especially for statistical work. If you're new to coding, you'll find R's logical syntax is perfect for beginners. Our tutorials are tailored to gradually build your understanding, making R less daunting. The tutorials are focused on the tools needed for the Computer Assignments. However, you may feel you need to get a little more comfortable in R before you attempt the tutorials and the Computer Assignments. An R package called **swirl** may be just what you need to get started. Setup and Use swirl ~~~~~~~~~~~~~~~~~~~~~~~~ 1. Setup R and RStudio (see links above) 2. Install swirl from the command line in R: .. code-block:: r install.packages('swirl') 3. Load swirl: .. code-block:: r library('swirl') 4. Run swirl: .. code-block:: r swirl() This will start the command line interactive tutorials of swirl. You will be instructed on exercises on the command line. **swirl Commands (available anytime during lessons):** - ``skip()`` — Skip the current question - ``play()`` — Experiment with R on your own (swirl ignores your actions) - ``nxt()`` — Return swirl's attention after using play() - ``bye()`` — Exit swirl (progress is saved) - ``main()`` — Return to swirl's main menu - ``info()`` — Display these options again I recommend working through at least a few lessons to get familiarity with coding in R. Alternative R Learning Resources ------------------------------------------------- For curious students who want to explore R beyond the course requirements: Base R ~~~~~~~~~~~~~~ - **CRAN "An Introduction to R"** (official manual): Canonical, up-to-date coverage of types, subsetting, vectorization, base graphics, and the modeling API. The authoritative source. https://cran.r-project.org/doc/manuals/r-release/R-intro.html - **Deep R Programming** (free textbook): Modern, rigorous coverage of R semantics, performance, and numerical computing. Ideal after you're comfortable with basics. https://deepr.gagolewski.com/ tidyverse ~~~~~~~~~~~~~~ - **R for Data Science** (2e, free online): The authoritative entry point to tidy data workflows (readr, dplyr, tidyr, ggplot2, purrr). https://r4ds.hadley.nz - **tidyverse.org**: Core package documentation and philosophy. https://www.tidyverse.org R Markdown ~~~~~~~~~~~~~~ - **R Markdown: The Definitive Guide**: Complete reference for authoring and multi-format output. https://bookdown.org/yihui/rmarkdown/ - **R Markdown Cookbook**: Task-oriented tips that save hours in practice. https://bookdown.org/yihui/rmarkdown-cookbook/ Statistical Computing ~~~~~~~~~~~~~~~~~~~~~~ - **STAT 545** (UBC): Practice-first course on modern workflows with excellent materials. https://stat545.com - **Modern Statistics for Modern Biology**: R-based computational statistics for life sciences. https://www.huber.embl.de/msmb/ Video Resources ~~~~~~~~~~~~~~~ - **R Programming Full Course for Beginners** (freeCodeCamp): Comprehensive 2-hour introduction covering basics through data analysis. https://www.youtube.com/watch?v=fDRa82lxzaU - **R Programming Tutorial - Learn the Basics** (freeCodeCamp): Shorter focused tutorial on R fundamentals. https://www.youtube.com/watch?v=Q5g6lYUn6Q4 Assignment Tutorials (Links) ------------------------------------------------- .. note:: **Winter Session**: Computer Assignments 1 and 2 are combined into a single assignment. Subsequent assignments are renumbered: - CA1 & CA2 → Combined Assignment 1 - CA3 → Assignment 2 - CA4 → Assignment 3 - CA5 → Assignment 4 - CA6 → Assignment 5 - **Computer Assignment 1 Tutorial** — Intro & basics: `CA1 Tutorial `_ - **Computer Assignment 2 Tutorial** — Graphical & numerical summaries: `CA2 Tutorial `_ - **Computer Assignment 3 Tutorial** — Subsetting & basic hypothesis testing: `CA3 Tutorial `_ - **Computer Assignment 4 Tutorial** — Two-sample procedures: `CA4 Tutorial `_ - **Computer Assignment 5 Tutorial** — One-way ANOVA: `CA5 Tutorial `_ - **Computer Assignment 6 Tutorial** — Simple linear regression: `CA6 Tutorial `_ Function Reference (Alphabetized within Category) ------------------------------------------------- Data I/O & Housekeeping ~~~~~~~~~~~~~~~~~~~~~~~~ **getwd()** *Purpose*: Show current working directory *Help*: Type ``?getwd`` in R console *Syntax*: ``getwd()`` Example showing the working directory: .. code-block:: rconsole > getwd() [1] "/Users/username/STAT350" **head(x, n = 6)** *Purpose*: Display first n rows of a data frame or elements of a vector *Help*: Type ``?head`` in R console *Syntax*: ``head(x, n = 6)`` *Common use*: Quickly inspect data after importing Using built-in iris dataset: .. code-block:: rconsole > data(iris) > head(iris, 3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa *See also*: ``tail()``, ``str()``, ``summary()`` **read.csv(file, header = TRUE, stringsAsFactors = FALSE, ...)** *Purpose*: Import CSV file into a data frame *Help*: Type ``?read.csv`` in R console *Key arguments*: - ``header``: First row contains column names (default TRUE) - ``stringsAsFactors``: Keep strings as characters (recommended FALSE) - ``na.strings``: Values to treat as NA (default "NA") *Common pitfalls*: - Using absolute paths (breaks on other computers) - Forgetting to check for proper import with ``str()`` - Not handling special characters in column names Basic import: .. code-block:: rconsole > mydata <- read.csv("data/experiment.csv") > str(mydata) 'data.frame': 150 obs. of 5 variables: $ id : int 1 2 3 4 5 6 7 8 9 10 ... $ group : chr "A" "B" "A" "C" ... $ measure1: num 23.5 19.2 25.1 22.8 20.3 ... $ measure2: num 45.2 41.8 46.7 44.3 42.1 ... $ outcome : chr "success" "failure" "success" ... With options for handling missing data: .. code-block:: r mydata <- read.csv("data/experiment.csv", stringsAsFactors = FALSE, na.strings = c("NA", "N/A", "")) Always validate after import: .. code-block:: rconsole > sum(is.na(mydata)) [1] 12 *See also*: ``read.table()``, ``write.csv()``, ``str()`` **setwd(path)** *Purpose*: Change working directory *Help*: Type ``?setwd`` in R console *Warning*: Avoid using in scripts; use RStudio Projects instead Not recommended in scripts: .. code-block:: r # Bad practice - machine specific! # setwd("/Users/username/STAT350") # Better: Use RStudio Projects or relative paths **write.csv(x, file, row.names = FALSE)** *Purpose*: Export data frame to CSV file *Help*: Type ``?write.csv`` in R console *Key arguments*: - ``row.names``: Include row numbers (usually FALSE) - ``na``: String for missing values (default "NA") Save cleaned data: .. code-block:: r write.csv(d_clean, "output/cleaned_data.csv", row.names = FALSE) Data Structures & Creation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **c(...)** *Purpose*: Combine values into a vector *Help*: Type ``?c`` in R console Creating different types of vectors: .. code-block:: rconsole > v1 <- c(1, 2, 3, 4, 5) > v2 <- c("iOS", "Android", "Windows") > v3 <- c(1, "two", 3) > class(v3) [1] "character" Note that mixed types get coerced to the most general type (character in this case). **colnames(x), rownames(x)** *Purpose*: Get or set column/row names *Help*: Type ``?colnames`` or ``?rownames`` in R console Getting column names from built-in dataset: .. code-block:: rconsole > colnames(mtcars) [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb" Renaming columns: .. code-block:: r # Create a copy to modify mydata <- mtcars[1:5, 1:3] colnames(mydata) <- c("MPG", "Cylinders", "Displacement") # Or rename specific columns colnames(mydata)[1] <- "MilesPerGallon" **data.frame(...)** *Purpose*: Create a data frame from vectors *Help*: Type ``?data.frame`` in R console *Key arguments*: - ``stringsAsFactors``: Auto-convert strings to factors (set FALSE) Creating a data frame from scratch: .. code-block:: rconsole > df <- data.frame( + id = 1:5, + group = c("A", "A", "B", "B", "B"), + score = c(85, 90, 78, 82, 88), + stringsAsFactors = FALSE + ) > str(df) 'data.frame': 5 obs. of 3 variables: $ id : int 1 2 3 4 5 $ group: chr "A" "A" "B" "B" "B" $ score: num 85 90 78 82 88 *See also*: ``tibble::tibble()`` for modern alternative **factor(x, levels = ..., ordered = FALSE)** *Purpose*: Create categorical variables with defined levels *Help*: Type ``?factor`` in R console *When to use*: For categorical data in ANOVA, controlling plot order Basic factor creation with automatic level detection: .. code-block:: rconsole > platform <- factor(c("iOS", "Android", "iOS", "Windows")) > levels(platform) [1] "Android" "iOS" "Windows" Controlling level order for plots and analyses: .. code-block:: r platform <- factor(platform, levels = c("iOS", "Android", "Windows")) Creating ordered factors for ordinal data: .. code-block:: r satisfaction <- factor(c("Low", "High", "Medium", "High"), levels = c("Low", "Medium", "High"), ordered = TRUE) *Common pitfall*: Converting numeric factors back to numbers Wrong way (returns level indices): .. code-block:: rconsole > f <- factor(c("2", "4", "6")) > as.numeric(f) [1] 1 2 3 Correct way: .. code-block:: rconsole > as.numeric(as.character(f)) [1] 2 4 6 *See also*: ``levels()``, ``relevel()``, ``droplevels()`` Data Wrangling & Utilities ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **apply(X, MARGIN, FUN)** *Purpose*: Apply function over matrix/array margins *Help*: Type ``?apply`` in R console *Key arguments*: - ``MARGIN``: 1 for rows, 2 for columns - ``FUN``: Function to apply Create a sample matrix: .. code-block:: rconsole > M <- matrix(1:12, nrow = 3) > M [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 Row means (MARGIN = 1): .. code-block:: rconsole > apply(M, 1, mean) [1] 5.5 6.5 7.5 Column sums (MARGIN = 2): .. code-block:: rconsole > apply(M, 2, sum) [1] 6 15 24 33 Custom function to calculate range: .. code-block:: rconsole > apply(M, 2, function(x) max(x) - min(x)) [1] 2 2 2 2 *See also*: ``lapply()``, ``sapply()``, ``tapply()`` **as.numeric(x)** *Purpose*: Convert to numeric type *Help*: Type ``?as.numeric`` in R console *Common uses*: Fix data imported as characters, convert factors Character to numeric conversion: .. code-block:: rconsole > x <- c("1.5", "2.3", "3.1") > as.numeric(x) [1] 1.5 2.3 3.1 Non-numeric values produce NAs with a warning: .. code-block:: rconsole > y <- c("1", "2", "three") > as.numeric(y) [1] 1 2 NA Warning message: NAs introduced by coercion **complete.cases(x)** *Purpose*: Identify rows with no missing values *Help*: Type ``?complete.cases`` in R console *Returns*: Logical vector (TRUE for complete rows) Create data with missing values: .. code-block:: rconsole > df <- data.frame( + x = c(1, 2, NA, 4), + y = c(5, NA, 7, 8) + ) > complete.cases(df) [1] TRUE FALSE FALSE TRUE Filter to complete cases only: .. code-block:: rconsole > df_clean <- df[complete.cases(df), ] > nrow(df_clean) [1] 2 Count incomplete cases: .. code-block:: rconsole > sum(!complete.cases(df)) [1] 2 *See also*: ``is.na()``, ``na.omit()`` **ifelse(test, yes, no)** *Purpose*: Vectorized conditional operation *Help*: Type ``?ifelse`` in R console Basic pass/fail grading: .. code-block:: rconsole > score <- c(85, 72, 90, 68, 88) > grade <- ifelse(score >= 80, "Pass", "Fail") > grade [1] "Pass" "Fail" "Pass" "Fail" "Pass" Nested ifelse for letter grades: .. code-block:: r grade <- ifelse(score >= 90, "A", ifelse(score >= 80, "B", ifelse(score >= 70, "C", "F"))) Conditional calculations: .. code-block:: r df$adjusted <- ifelse(df$group == "control", df$score * 1.1, df$score) **IQR(x, na.rm = FALSE)** *Purpose*: Calculate interquartile range (Q3 - Q1) *Help*: Type ``?IQR`` in R console With missing values: .. code-block:: rconsole > x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, NA) > IQR(x) [1] NA > IQR(x, na.rm = TRUE) [1] 4 Compare with manual calculation from quantiles: .. code-block:: rconsole > quantile(x, c(0.25, 0.75), na.rm = TRUE) 25% 75% 3 7 *See also*: ``quantile()``, ``fivenum()``, ``boxplot.stats()`` **is.na(x)** *Purpose*: Test for missing values *Help*: Type ``?is.na`` in R console Identifying NAs: .. code-block:: rconsole > x <- c(1, NA, 3, NA, 5) > is.na(x) [1] FALSE TRUE FALSE TRUE FALSE Counting NAs: .. code-block:: rconsole > sum(is.na(x)) [1] 2 Finding positions of NAs: .. code-block:: rconsole > which(is.na(x)) [1] 2 4 Replacing NAs: .. code-block:: r x[is.na(x)] <- 0 *See also*: ``complete.cases()``, ``na.omit()``, ``anyNA()`` **paste(..., sep = " ", collapse = NULL)** *Purpose*: Concatenate strings *Help*: Type ``?paste`` in R console *Key arguments*: - ``sep``: Separator between elements - ``collapse``: Collapse vector to single string Basic concatenation: .. code-block:: rconsole > paste("Mean:", 5.2) [1] "Mean: 5.2" Custom separator for dates: .. code-block:: rconsole > paste("2024", "01", "15", sep = "-") [1] "2024-01-15" Vectorized operation: .. code-block:: rconsole > paste("Group", 1:3) [1] "Group 1" "Group 2" "Group 3" Collapse to single string: .. code-block:: rconsole > paste(c("A", "B", "C"), collapse = ", ") [1] "A, B, C" No-space version with paste0: .. code-block:: rconsole > paste0("var", 1:3) [1] "var1" "var2" "var3" **quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE)** *Purpose*: Calculate sample quantiles *Help*: Type ``?quantile`` in R console Default quartiles: .. code-block:: rconsole > x <- c(1:10, NA) > quantile(x, na.rm = TRUE) 0% 25% 50% 75% 100% 1.0 3.0 5.5 8.0 10.0 Custom percentiles: .. code-block:: rconsole > quantile(x, probs = c(0.1, 0.9), na.rm = TRUE) 10% 90% 1.9 9.1 *See also*: ``median()``, ``IQR()``, ``fivenum()`` **sapply(X, FUN), lapply(X, FUN)** *Purpose*: Apply function over list/vector elements *Help*: Type ``?sapply`` or ``?lapply`` in R console *Differences*: - ``lapply``: Always returns a list - ``sapply``: Simplifies to vector/matrix if possible Create a list of vectors: .. code-block:: rconsole > lst <- list(a = 1:5, b = 6:10, c = 11:15) sapply simplifies to vector: .. code-block:: rconsole > sapply(lst, mean) a b c 3 8 13 lapply returns list: .. code-block:: rconsole > lapply(lst, mean) $a [1] 3 $b [1] 8 $c [1] 13 Check multiple columns for NAs: .. code-block:: r sapply(df, function(x) sum(is.na(x))) **tapply(X, INDEX, FUN)** *Purpose*: Apply function by group *Help*: Type ``?tapply`` in R console Group means using built-in dataset: .. code-block:: rconsole > tapply(iris$Sepal.Length, iris$Species, mean) setosa versicolor virginica 5.006 5.936 6.588 Multiple statistics per group: .. code-block:: r tapply(iris$Sepal.Length, iris$Species, function(x) c(mean = mean(x), sd = sd(x))) *See also*: ``aggregate()``, ``by()`` Descriptive Statistics & Correlation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **cor(x, y = NULL, use = "everything", method = "pearson")** *Purpose*: Calculate correlation coefficient *Help*: Type ``?cor`` in R console *Key arguments*: - ``use``: How to handle NAs ("complete.obs" drops them) - ``method``: "pearson" (default), "spearman", "kendall" Perfect positive correlation: .. code-block:: rconsole > x <- c(1, 2, 3, 4, 5) > y <- c(2, 4, 6, 8, 10) > cor(x, y) [1] 1 Using built-in dataset: .. code-block:: rconsole > cor(mtcars$mpg, mtcars$wt) [1] -0.8676594 Correlation matrix: .. code-block:: rconsole > cor(mtcars[,c("mpg", "wt", "hp")]) mpg wt hp mpg 1.0000000 -0.8676594 -0.7761684 wt -0.8676594 1.0000000 0.6587479 hp -0.7761684 0.6587479 1.0000000 *Interpretation guide*: - -1 to -0.7: Strong negative - -0.7 to -0.3: Moderate negative - -0.3 to 0.3: Weak/no linear relationship - 0.3 to 0.7: Moderate positive - 0.7 to 1: Strong positive Test for significance: .. code-block:: r cor.test(mtcars$mpg, mtcars$wt) *See also*: ``cor.test()``, ``cov()`` **mean(x, trim = 0, na.rm = FALSE)** *Purpose*: Calculate arithmetic mean *Help*: Type ``?mean`` in R console *Key arguments*: - ``na.rm``: Remove NAs before calculation - ``trim``: Fraction to trim from each end Basic usage with NAs: .. code-block:: rconsole > x <- c(1, 2, 3, 4, 5, NA) > mean(x) [1] NA > mean(x, na.rm = TRUE) [1] 3 Trimmed mean (robust to outliers): .. code-block:: rconsole > y <- c(1, 2, 3, 4, 100) > mean(y) [1] 22 > mean(y, trim = 0.2) [1] 3 The trimmed mean removes 20% from each end before calculating. *See also*: ``median()``, ``summary()`` **median(x, na.rm = FALSE)** *Purpose*: Calculate median (middle value) *Help*: Type ``?median`` in R console Odd number of values: .. code-block:: rconsole > median(c(1, 3, 5)) [1] 3 Even number of values: .. code-block:: rconsole > median(c(1, 2, 3, 4)) [1] 2.5 Median is more robust than mean for skewed data: .. code-block:: rconsole > x <- c(20, 25, 30, 35, 200) > mean(x) [1] 62 > median(x) [1] 30 **sd(x, na.rm = FALSE)** *Purpose*: Calculate sample standard deviation *Help*: Type ``?sd`` in R console *Note*: Uses n-1 denominator (sample SD) .. code-block:: rconsole > x <- c(2, 4, 4, 4, 5, 5, 7, 9) > sd(x) [1] 2.13809 > var(x) [1] 4.571429 The variance equals the square of the standard deviation. Calculate coefficient of variation (relative variability): .. code-block:: rconsole > cv <- sd(x) / mean(x) * 100 > paste("CV:", round(cv, 1), "%") [1] "CV: 42.8 %" *See also*: ``var()``, ``mad()`` (robust alternative) Probability & Distributions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Normal Distribution Functions** **dnorm(x, mean = 0, sd = 1)** *Purpose*: Normal probability density *Help*: Type ``?dnorm`` in R console *Use cases*: Overlay theoretical curves, calculate likelihoods Standard normal density at x = 0: .. code-block:: rconsole > dnorm(0) [1] 0.3989423 Custom parameters: .. code-block:: rconsole > dnorm(100, mean = 100, sd = 15) [1] 0.02659615 Use in ggplot2 for overlaying normal curve: .. code-block:: r ggplot(data.frame(x = x), aes(x)) + geom_histogram(aes(y = after_stat(density))) + stat_function(fun = dnorm, args = list(mean = mean(x), sd = sd(x)), color = "red") **pnorm(q, mean = 0, sd = 1, lower.tail = TRUE)** *Purpose*: Normal cumulative distribution (probability) *Help*: Type ``?pnorm`` in R console Standard normal probabilities: .. code-block:: rconsole > pnorm(1.96) [1] 0.9750021 > pnorm(1.96, lower.tail = FALSE) [1] 0.0249979 Two-tailed p-value: .. code-block:: rconsole > 2 * pnorm(-abs(1.96)) [1] 0.04999579 **qnorm(p, mean = 0, sd = 1, lower.tail = TRUE)** *Purpose*: Normal quantiles (inverse CDF) *Help*: Type ``?qnorm`` in R console Critical values: .. code-block:: rconsole > qnorm(0.975) [1] 1.959964 > qnorm(0.95) [1] 1.644854 The first gives the two-tailed 95% critical value, the second gives one-tailed. Calculate confidence interval: .. code-block:: r xbar <- 100; s <- 15; n <- 25 xbar + c(-1, 1) * qnorm(0.975) * s/sqrt(n) **rnorm(n, mean = 0, sd = 1)** *Purpose*: Generate random normal values *Help*: Type ``?rnorm`` in R console .. code-block:: r set.seed(123) x <- rnorm(100, mean = 50, sd = 10) # Visualize with ggplot2 library(ggplot2) ggplot(data.frame(x = x), aes(x = x)) + geom_histogram(bins = 20) + ggtitle("Sample from Normal(50, 10)") **t Distribution Functions** **pt(q, df, lower.tail = TRUE)** *Purpose*: Student's t cumulative distribution *Help*: Type ``?pt`` in R console *Use cases*: Calculate p-values for t-tests One-sided p-value: .. code-block:: rconsole > t_stat <- 2.5 > df <- 24 > pt(t_stat, df, lower.tail = FALSE) [1] 0.009963466 Two-sided p-value: .. code-block:: rconsole > 2 * pt(-abs(t_stat), df) [1] 0.01992693 **qt(p, df, lower.tail = TRUE)** *Purpose*: Student's t quantiles *Help*: Type ``?qt`` in R console *Use cases*: Critical values for confidence intervals 95% CI critical value: .. code-block:: rconsole > qt(0.975, df = 24) [1] 2.063899 Compare to normal: .. code-block:: rconsole > qnorm(0.975) [1] 1.959964 The t critical value is larger due to heavier tails. *See also*: ``dt()``, ``rt()`` **Other Distributions** **dbinom(x, size, prob)** *Purpose*: Binomial probability mass function *Help*: Type ``?dbinom`` in R console Probability of exactly 3 successes in 10 trials with p = 0.5: .. code-block:: rconsole > dbinom(3, size = 10, prob = 0.5) [1] 0.1171875 Full distribution: .. code-block:: r x <- 0:10 p <- dbinom(x, size = 10, prob = 0.5) # Visualize with ggplot2 library(ggplot2) ggplot(data.frame(x = x, p = p), aes(x = x, y = p)) + geom_col() + scale_x_continuous(breaks = 0:10) + ggtitle("Binomial(10, 0.5) Distribution") **qtukey(p, nmeans, df)** *Purpose*: Tukey HSD critical values *Help*: Type ``?qtukey`` in R console Critical value for 3 groups with df = 27: .. code-block:: rconsole > qtukey(0.95, nmeans = 3, df = 27) [1] 3.506426 Simulation Functions ~~~~~~~~~~~~~~~~~~~~~~~~ **set.seed(seed)** *Purpose*: Set random number generator seed for reproducibility *Help*: Type ``?set.seed`` in R console Without seed (different each time): .. code-block:: rconsole > rnorm(3) [1] 0.3186301 -0.5817907 0.7145327 > rnorm(3) [1] -0.8252594 -0.3598138 -0.0109303 With seed (reproducible): .. code-block:: rconsole > set.seed(123) > rnorm(3) [1] -0.5604756 -0.2301775 1.5587083 > set.seed(123) > rnorm(3) [1] -0.5604756 -0.2301775 1.5587083 **Random Generation Functions** All random generation functions take ``n`` (number of values) as the first argument. .. code-block:: r set.seed(1) rnorm(5, mean = 100, sd = 15) # Normal runif(5, min = 0, max = 10) # Uniform rexp(5, rate = 2) # Exponential For simulation studies: .. code-block:: r n_sims <- 1000 means <- replicate(n_sims, mean(rnorm(30))) # Visualize sampling distribution with ggplot2 library(ggplot2) ggplot(data.frame(means = means), aes(x = means)) + geom_histogram(bins = 30) + ggtitle("Sampling Distribution of Mean") Inference Functions ~~~~~~~~~~~~~~~~~~~~~~~~ **t.test(x, y = NULL, alternative = "two.sided", mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95)** *Purpose*: One and two sample t-tests *Help*: Type ``?t.test`` in R console *When to use*: Compare means between 1-2 groups *Key arguments*: - ``alternative``: "two.sided", "less", "greater" - ``mu``: Null hypothesis value (one-sample) - ``paired``: TRUE for matched pairs - ``var.equal``: TRUE assumes equal variances *Common pitfalls*: - Using paired = TRUE with independent samples - Not checking normality assumptions - Forgetting var.equal = FALSE for unequal variances One-sample test: .. code-block:: rconsole > x <- c(98, 102, 101, 97, 103, 100, 99, 98, 104, 101) > t.test(x, mu = 100) One Sample t-test data: x t = 0.63246, df = 9, p-value = 0.5425 alternative hypothesis: true mean is not equal to 100 95 percent confidence interval: 98.41533 102.18467 sample estimates: mean of x 100.3 Two-sample independent test using built-in data: .. code-block:: rconsole > data(sleep) > t.test(extra ~ group, data = sleep, var.equal = FALSE) Welch Two Sample t-test data: extra by group t = -1.8608, df = 17.776, p-value = 0.07939 alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0 95 percent confidence interval: -3.3654832 0.2054832 sample estimates: mean in group 1 mean in group 2 0.75 2.33 **Paired t-test - Three equivalent approaches:** Method 1 - Using paired = TRUE with two vectors: .. code-block:: r before <- c(85, 88, 90, 92, 87) after <- c(88, 90, 91, 94, 89) t.test(before, after, paired = TRUE) Method 2 - Convert to one-sample test of differences: .. code-block:: r differences <- after - before t.test(differences, mu = 0) Method 3 - Formula interface with paired data: .. code-block:: rconsole > # Using sleep dataset (paired by ID) > t.test(Pair(extra[group==1], extra[group==2]) ~ 1, data = sleep) Paired t-test data: Pair(extra[group == 1], extra[group == 2]) t = -4.0621, df = 9, p-value = 0.002833 alternative hypothesis: true mean difference is not equal to 0 95 percent confidence interval: -2.4598858 -0.7001142 sample estimates: mean difference -1.58 Extract components from results: .. code-block:: rconsole > result <- t.test(x, mu = 100) > result$p.value [1] 0.5424805 > result$conf.int [1] 98.41533 102.18467 attr(,"conf.level") [1] 0.95 > result$estimate mean of x 100.3 **Checking assumptions**: Normality check: .. code-block:: r # Shapiro-Wilk test shapiro.test(x) # Visual check with ggplot2 library(ggplot2) ggplot(data.frame(x = x), aes(sample = x)) + stat_qq() + stat_qq_line() Equal variances check: .. code-block:: r # F-test var.test(x, y) # Visual check ggplot(data.frame(value = c(x, y), group = rep(c("X", "Y"), c(length(x), length(y)))), aes(x = group, y = value)) + geom_boxplot() *See also*: ``wilcox.test()`` (non-parametric alternative) **aov(formula, data)** *Purpose*: Analysis of variance (ANOVA) *Help*: Type ``?aov`` in R console *When to use*: Compare means across 3+ groups One-way ANOVA: .. code-block:: rconsole > fit <- aov(score ~ group, data = df) > summary(fit) Df Sum Sq Mean Sq F value Pr(>F) group 2 520.3 260.15 8.214 0.0019 ** Residuals 27 855.2 31.67 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Check assumptions with diagnostic plots: .. code-block:: r plot(fit, which = 1) # Residuals vs Fitted plot(fit, which = 2) # Normal Q-Q Test for equal variances: .. code-block:: r bartlett.test(score ~ group, data = df) Post-hoc comparisons: .. code-block:: r TukeyHSD(fit) plot(TukeyHSD(fit)) *See also*: ``anova()``, ``TukeyHSD()``, ``kruskal.test()`` **TukeyHSD(x, conf.level = 0.95)** *Purpose*: Tukey's honest significant differences *Help*: Type ``?TukeyHSD`` in R console *When to use*: After significant ANOVA for pairwise comparisons Following ANOVA: .. code-block:: rconsole > fit <- aov(score ~ group, data = df) > tukey_result <- TukeyHSD(fit) > tukey_result Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = score ~ group, data = df) $group diff lwr upr p adj B-A -5.333 -12.20271 1.536038 0.1518711 C-A 3.167 -3.70271 10.036038 0.5019442 C-B 8.500 1.63062 15.369376 0.0125506 Visual display: .. code-block:: r plot(tukey_result) Confidence intervals not containing 0 indicate significant differences. **lm(formula, data)** *Purpose*: Fit linear models *Help*: Type ``?lm`` in R console *When to use*: Regression analysis, ANOVA via regression Simple linear regression: .. code-block:: rconsole > mod <- lm(rating ~ price, data = df) > summary(mod) Call: lm(formula = rating ~ price, data = df) Residuals: Min 1Q Median 3Q Max -2.83450 -0.83450 -0.03077 0.78231 2.70308 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1234 0.2145 9.897 2.13e-10 *** price 0.5432 0.0532 10.211 9.77e-11 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.234 on 28 degrees of freedom Multiple R-squared: 0.7882, Adjusted R-squared: 0.7806 F-statistic: 104.3 on 1 and 28 DF, p-value: 9.77e-11 Extract components: .. code-block:: r coef(mod) # Coefficients confint(mod) # CI for coefficients residuals(mod) # Residuals fitted(mod) # Fitted values Diagnostic plots: .. code-block:: r plot(mod) # Produces 4 diagnostic plots ANOVA table: .. code-block:: r anova(mod) *See also*: ``summary.lm()``, ``predict.lm()``, ``plot.lm()`` **predict(object, newdata, interval = "none", level = 0.95)** *Purpose*: Predictions from model fits *Help*: Type ``?predict.lm`` in R console *Key arguments*: - ``interval``: "confidence" (mean response) or "prediction" (new obs) - ``level``: Confidence level *Critical distinction*: - **Confidence interval**: Uncertainty about mean response - **Prediction interval**: Uncertainty about individual response (wider) Fit model first: .. code-block:: r mod <- lm(rating ~ price, data = df) Point prediction: .. code-block:: rconsole > new_price <- data.frame(price = 50) > predict(mod, new_price) 1 29.28571 Confidence interval (mean rating at price = 50): .. code-block:: rconsole > predict(mod, new_price, interval = "confidence") fit lwr upr 1 29.28571 28.47933 30.09209 Prediction interval (individual rating at price = 50): .. code-block:: rconsole > predict(mod, new_price, interval = "prediction") fit lwr upr 1 29.28571 26.89414 31.67728 Note that prediction intervals are always wider than confidence intervals. Multiple predictions for plotting: .. code-block:: r new_prices <- data.frame( price = seq(min(df$price), max(df$price), length = 100) ) ci <- predict(mod, new_prices, interval = "confidence") pi <- predict(mod, new_prices, interval = "prediction") Plot with bands: .. code-block:: r plot(rating ~ price, data = df) abline(mod) lines(new_prices$price, ci[,"lwr"], lty = 2, col = "blue") lines(new_prices$price, ci[,"upr"], lty = 2, col = "blue") lines(new_prices$price, pi[,"lwr"], lty = 2, col = "red") lines(new_prices$price, pi[,"upr"], lty = 2, col = "red") *See also*: ``fitted()``, ``residuals()`` **Diagnostic Plots for Assumptions** For t-test (normality of residuals): .. code-block:: r library(ggplot2) x <- residuals(lm(score ~ group, data = df)) # Q-Q plot ggplot(data.frame(sample = x), aes(sample = sample)) + stat_qq() + stat_qq_line() + ggtitle("Normal Q-Q Plot of Residuals") # Formal test shapiro.test(x) For ANOVA: .. code-block:: r fit <- aov(score ~ group, data = df) # Residuals vs Fitted ggplot(data.frame(fitted = fitted(fit), residuals = residuals(fit)), aes(x = fitted, y = residuals)) + geom_point() + geom_hline(yintercept = 0, linetype = "dashed") + ggtitle("Residuals vs Fitted") # Q-Q plot of residuals ggplot(data.frame(sample = residuals(fit)), aes(sample = sample)) + stat_qq() + stat_qq_line() + ggtitle("Normal Q-Q") For regression (full diagnostic panel): .. code-block:: r mod <- lm(rating ~ price, data = df) # Prepare data for all plots diag_data <- data.frame( fitted = fitted(mod), residuals = residuals(mod), std_resid = rstandard(mod), sqrt_std_resid = sqrt(abs(rstandard(mod))), cooks = cooks.distance(mod), leverage = hatvalues(mod) ) # Create four diagnostic plots library(gridExtra) p1 <- ggplot(diag_data, aes(x = fitted, y = residuals)) + geom_point() + geom_hline(yintercept = 0, linetype = "dashed") + geom_smooth(se = FALSE) + ggtitle("Residuals vs Fitted") p2 <- ggplot(diag_data, aes(sample = std_resid)) + stat_qq() + stat_qq_line() + ggtitle("Normal Q-Q") p3 <- ggplot(diag_data, aes(x = fitted, y = sqrt_std_resid)) + geom_point() + geom_smooth(se = FALSE) + ggtitle("Scale-Location") p4 <- ggplot(diag_data, aes(x = leverage, y = std_resid)) + geom_point() + geom_smooth(se = FALSE) + ggtitle("Residuals vs Leverage") grid.arrange(p1, p2, p3, p4, ncol = 2) Pattern interpretation: - Random scatter: Good - Funnel shape: Heteroscedasticity - Curve: Non-linearity - Outliers: Influential points Graphics (ggplot2) ~~~~~~~~~~~~~~~~~~~~~~~~ .. important:: **This course uses ggplot2 exclusively for all plotting and visualization.** While base R has plotting functions like plot(), hist(), and boxplot(), we will only use ggplot2's grammar of graphics approach. All assignments and exams will expect ggplot2 syntax for graphs. Core Components ~~~~~~~~~~~~~~~~ **ggplot(data, aes(...))** *Purpose*: Initialize a ggplot object *Help*: Type ``?ggplot`` in R console *Structure*: Data + aesthetic mappings + layers Basic setup creates blank plot with axes: .. code-block:: r library(ggplot2) ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) Add layers with +: .. code-block:: r ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point() Save for reuse: .. code-block:: r p <- ggplot(mtcars, aes(x = factor(cyl), y = mpg)) p + geom_boxplot() p + geom_violin() **aes(...)** *Purpose*: Map variables to visual properties *Help*: Type ``?aes`` in R console *Common aesthetics*: x, y, color, fill, shape, size, alpha Position aesthetics: .. code-block:: r aes(x = Sepal.Length, y = Sepal.Width) Color by group: .. code-block:: r aes(x = Sepal.Length, y = Sepal.Width, color = Species) Multiple aesthetics: .. code-block:: r aes(x = wt, y = mpg, color = factor(cyl), size = hp) Can set in ggplot() or individual geoms: .. code-block:: r ggplot(iris) + geom_point(aes(x = Sepal.Length, y = Sepal.Width)) Geoms (Geometric Objects) ~~~~~~~~~~~~~~~~~~~~~~~~~~ **geom_histogram(binwidth, bins, aes(y = after_stat(...)))** *Purpose*: Create histograms *Help*: Type ``?geom_histogram`` in R console *Key arguments*: - ``bins``: Number of bins (default 30) - ``binwidth``: Width of bins - ``aes(y = after_stat(density))``: Scale to density .. important:: **In this course, all histograms must include:** 1. Density scaling using ``aes(y = after_stat(density))`` 2. A kernel density curve (red) using ``geom_density()`` 3. A normal curve (blue) using ``stat_function()`` This allows visual comparison of the data distribution with theoretical distributions. Standard histogram format for this course: .. code-block:: r # Calculate mean and sd first xbar <- mean(data$variable) s <- sd(data$variable) # Create histogram with required overlays ggplot(data, aes(x = variable)) + geom_histogram(aes(y = after_stat(density)), bins = 24, fill = "grey", col = "black") + geom_density(col = "red", linewidth = 1) + stat_function(fun = dnorm, args = list(mean = xbar, sd = s), col = "blue", linewidth = 1) + ggtitle("Histogram with Density Curves") Adding mean and median lines (often required): .. code-block:: r # Calculate statistics xbar <- mean(data$variable) xtilde <- median(data$variable) s <- sd(data$variable) # Full histogram with all required elements ggplot(data, aes(x = variable)) + geom_histogram(aes(y = after_stat(density)), bins = 24, fill = "grey", col = "black") + geom_vline(xintercept = xbar, color = "purple", linetype = "dotted", linewidth = 1) + geom_vline(xintercept = xtilde, color = "orange", linetype = "dashed", linewidth = 1) + geom_density(col = "red", linewidth = 1) + stat_function(fun = dnorm, args = list(mean = xbar, sd = s), col = "blue", linewidth = 1) + ggtitle("Complete Histogram for Course") By group: .. code-block:: r ggplot(iris, aes(x = Sepal.Length, fill = Species)) + geom_histogram(aes(y = after_stat(density)), position = "dodge", bins = 15) + facet_wrap(~ Species) + ggtitle("Histograms by Group") **geom_boxplot(outlier.shape, varwidth)** *Purpose*: Box and whisker plots *Help*: Type ``?geom_boxplot`` in R console Basic boxplot: .. code-block:: r ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot() Horizontal orientation: .. code-block:: r ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot() + coord_flip() With individual points: .. code-block:: r ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot(outlier.shape = NA) + geom_jitter(width = 0.2, alpha = 0.5) Variable width by sample size: .. code-block:: r ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot(varwidth = TRUE) **geom_point(), geom_line()** *Purpose*: Scatterplots and line plots *Help*: Type ``?geom_point`` or ``?geom_line`` Basic scatterplot: .. code-block:: r ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() Customize points: .. code-block:: r ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + geom_point(size = 3, alpha = 0.7) Add regression line: .. code-block:: r ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red") Time series (using built-in dataset): .. code-block:: r # Convert to time series format data(AirPassengers) ts_data <- data.frame( date = as.Date(time(AirPassengers)), value = as.numeric(AirPassengers) ) ggplot(ts_data, aes(x = date, y = value)) + geom_line() + geom_point() **geom_smooth(method, formula, se)** *Purpose*: Add smoothed conditional means *Help*: Type ``?geom_smooth`` in R console *Key arguments*: - ``method``: "lm", "loess", "gam" - ``se``: Show confidence band (TRUE/FALSE) Linear regression with confidence band: .. code-block:: r ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + geom_smooth(method = "lm") Without confidence band: .. code-block:: r ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + geom_smooth(method = "lm", se = FALSE) By group: .. code-block:: r ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + geom_point() + geom_smooth(method = "lm") Faceted histograms with required density overlays and smooth kernel: .. code-block:: r # Calculate group-specific statistics xbar <- tapply(iris$Sepal.Length, iris$Species, mean) s <- tapply(iris$Sepal.Length, iris$Species, sd) # Add normal density values to the dataset iris$normal.density <- ifelse(iris$Species == "setosa", dnorm(iris$Sepal.Length, xbar["setosa"], s["setosa"]), ifelse(iris$Species == "versicolor", dnorm(iris$Sepal.Length, xbar["versicolor"], s["versicolor"]), dnorm(iris$Sepal.Length, xbar["virginica"], s["virginica"]))) # Create faceted histogram with overlays ggplot(iris, aes(x = Sepal.Length)) + geom_histogram(aes(y = after_stat(density)), bins = 15, fill = "grey", col = "black") + geom_density(col = "red", linewidth = 1) + geom_line(aes(y = normal.density), col = "blue", linewidth = 1) + facet_wrap(~ Species) + ggtitle("Distribution of Sepal Length by Species") # For datasets with only two groups (like CO2), the approach is simpler: .. code-block:: r # Two-group example using CO2 dataset xbar <- tapply(CO2$uptake, CO2$Type, mean) s <- tapply(CO2$uptake, CO2$Type, sd) CO2$normal.density <- ifelse(CO2$Type == "Quebec", dnorm(CO2$uptake, xbar["Quebec"], s["Quebec"]), dnorm(CO2$uptake, xbar["Mississippi"], s["Mississippi"])) ggplot(CO2, aes(x = uptake)) + geom_histogram(aes(y = after_stat(density)), bins = 20, fill = "grey", col = "black") + geom_density(col = "red", linewidth = 1) + geom_line(aes(y = normal.density), col = "blue", linewidth = 1) + facet_grid(. ~ Type) + ggtitle("Distribution of Uptake by Location Type") **stat_qq(), stat_qq_line()** *Purpose*: Q-Q plots for normality assessment *Help*: Type ``?stat_qq`` in R console Basic Q-Q plot: .. code-block:: r ggplot(data.frame(sample = rnorm(100)), aes(sample = sample)) + stat_qq() + stat_qq_line() For residuals: .. code-block:: r mod <- lm(mpg ~ wt, data = mtcars) res_df <- data.frame(residuals = residuals(mod)) ggplot(res_df, aes(sample = residuals)) + stat_qq() + stat_qq_line(color = "red") + labs(title = "Normal Q-Q Plot of Residuals") **geom_ribbon(aes(ymin, ymax))** *Purpose*: Area plots with upper/lower bounds *Help*: Type ``?geom_ribbon`` in R console *Use case*: Custom confidence/prediction bands Setup for regression bands: .. code-block:: r mod <- lm(mpg ~ wt, data = mtcars) new_data <- data.frame(wt = seq(min(mtcars$wt), max(mtcars$wt), length = 100)) pred_ci <- predict(mod, new_data, interval = "confidence") pred_pi <- predict(mod, new_data, interval = "prediction") plot_data <- cbind(new_data, ci_fit = pred_ci[,"fit"], ci_lwr = pred_ci[,"lwr"], ci_upr = pred_ci[,"upr"], pi_lwr = pred_pi[,"lwr"], pi_upr = pred_pi[,"upr"]) Plot with bands: .. code-block:: r ggplot(plot_data, aes(x = wt)) + geom_ribbon(aes(ymin = pi_lwr, ymax = pi_upr), fill = "grey90") + geom_ribbon(aes(ymin = ci_lwr, ymax = ci_upr), fill = "grey70") + geom_line(aes(y = ci_fit), color = "blue", size = 1) + geom_point(data = mtcars, aes(y = mpg)) + labs(title = "Regression with CI and PI bands") Categorical Data Visualization ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **geom_bar(), geom_col()** *Purpose*: Bar charts for categorical data *Help*: Type ``?geom_bar`` or ``?geom_col`` in R console - ``geom_bar()``: Counts categories (like table()) - ``geom_col()``: Uses pre-computed values Count frequencies with geom_bar(): .. code-block:: r # Frequency bar chart ggplot(mtcars, aes(x = factor(cyl))) + geom_bar() + xlab("Number of Cylinders") + ylab("Count") + ggtitle("Distribution of Cylinder Counts") Pre-computed values with geom_col(): .. code-block:: r # Create summary data cyl_means <- aggregate(mpg ~ cyl, data = mtcars, mean) ggplot(cyl_means, aes(x = factor(cyl), y = mpg)) + geom_col(fill = "steelblue") + xlab("Number of Cylinders") + ylab("Mean MPG") + ggtitle("Average MPG by Cylinder Count") Grouped bar chart: .. code-block:: r ggplot(mtcars, aes(x = factor(cyl), fill = factor(am))) + geom_bar(position = "dodge") + xlab("Cylinders") + labs(fill = "Transmission") + scale_fill_discrete(labels = c("Automatic", "Manual")) Stacked bar chart (proportions): .. code-block:: r ggplot(mtcars, aes(x = factor(cyl), fill = factor(am))) + geom_bar(position = "fill") + ylab("Proportion") + labs(fill = "Transmission") .. note:: For pie charts, while possible in ggplot2 using ``coord_polar()``, they are generally discouraged in statistical visualization. Bar charts are preferred for comparing categorical proportions. Plot Customization ~~~~~~~~~~~~~~~~~~~ **Labels and Titles** All labels at once: .. code-block:: r p + labs(title = "Main Title", subtitle = "Subtitle", x = "X-axis label", y = "Y-axis label", caption = "Data source: ...") Individual functions: .. code-block:: r p + ggtitle("Main Title") + xlab("X-axis") + ylab("Y-axis") Math expressions: .. code-block:: r library(latex2exp) p + xlab(TeX("$\\beta_1$")) + ylab(TeX("$\\hat{y}$")) **Themes** Built-in themes: .. code-block:: r p + theme_minimal() p + theme_classic() p + theme_bw() Customize elements: .. code-block:: r p + theme( axis.text = element_text(size = 12), axis.title = element_text(size = 14, face = "bold"), plot.title = element_text(size = 16, hjust = 0.5), legend.position = "bottom" ) **Faceting** Separate panels by variable: .. code-block:: r ggplot(df, aes(x = score)) + geom_histogram() + facet_wrap(~ group) Grid layout: .. code-block:: r ggplot(df, aes(x = score)) + geom_histogram() + facet_grid(platform ~ group) Tables & Reporting ~~~~~~~~~~~~~~~~~~~~~~~~ **knitr::kable(x, caption, digits, format)** *Purpose*: Create formatted tables *Help*: Type ``?knitr::kable`` in R console Basic table using built-in data: .. code-block:: r library(knitr) library(kableExtra) # Summary statistics by species summary_table <- aggregate(Sepal.Length ~ Species, data = iris, FUN = function(x) c(mean = mean(x), sd = sd(x), n = length(x))) kable(summary_table, caption = "Summary Statistics by Species", digits = 2) Enhanced styling: .. code-block:: r kable(summary_table, caption = "Iris Summary by Species") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>% column_spec(1, bold = TRUE) **gridExtra::grid.arrange(...)** *Purpose*: Arrange multiple plots *Help*: Type ``?gridExtra::grid.arrange`` Create and arrange plots: .. code-block:: r library(gridExtra) p1 <- ggplot(iris, aes(x = Sepal.Length)) + geom_histogram(bins = 20) + ggtitle("Distribution of Sepal Length") p2 <- ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot() + ggtitle("Sepal Length by Species") grid.arrange(p1, p2, ncol = 2) Different layouts: .. code-block:: r grid.arrange(p1, p2, p1, p2, ncol = 2, nrow = 2, top = "Four Panel Display") Best Practices & Common Pitfalls ================================= Data Import & Validation ~~~~~~~~~~~~~~~~~~~~~~~~ Standard validation workflow: .. code-block:: r d <- read.csv("data/file.csv") str(d) # Structure check head(d); tail(d) # First/last rows summary(d) # Summary statistics sapply(d, function(x) sum(is.na(x))) # Missing values sapply(d, class) # Data types Statistical Assumptions ~~~~~~~~~~~~~~~~~~~~~~~ **Before t-tests/ANOVA:** .. code-block:: r # Normality hist(x) qqnorm(x); qqline(x) shapiro.test(x) # Equal variances var.test(x, y) boxplot(score ~ group, data = df) **Before regression:** .. code-block:: r mod <- lm(y ~ x, data = df) par(mfrow = c(2,2)) plot(mod) par(mfrow = c(1,1)) Check for: 1. Linearity (Residuals vs Fitted) 2. Normality (Q-Q plot) 3. Homoscedasticity (Scale-Location) 4. Influential points (Residuals vs Leverage) Common Errors to Avoid ~~~~~~~~~~~~~~~~~~~~~~ 1. **Factor to numeric conversion** Wrong (returns level indices): .. code-block:: rconsole > f <- factor(c("2", "4", "6")) > as.numeric(f) [1] 1 2 3 Correct: .. code-block:: rconsole > as.numeric(as.character(f)) [1] 2 4 6 2. **Missing data handling** Wrong (silently propagates NA): .. code-block:: rconsole > mean(c(1, 2, NA, 4)) [1] NA Correct: .. code-block:: rconsole > mean(c(1, 2, NA, 4), na.rm = TRUE) [1] 2.333333 3. **Confidence vs Prediction intervals** .. code-block:: r # CI: Where is the mean? predict(mod, newdata, interval = "confidence") # PI: Where is a new observation? predict(mod, newdata, interval = "prediction") Prediction intervals are always wider! 4. **Multiple testing without adjustment** After ANOVA, use Tukey HSD: .. code-block:: r TukeyHSD(aov_fit) This adjusts for multiple comparisons automatically. Workflow Template ~~~~~~~~~~~~~~~~~ .. code-block:: r # 1. Setup library(ggplot2) library(knitr) set.seed(123) # 2. Import and validate d <- read.csv("data/file.csv", stringsAsFactors = FALSE) str(d) summary(d) # 3. Clean d_clean <- d[complete.cases(d), ] # 4. Explore ggplot(d_clean, aes(x = x, y = y)) + geom_point() + theme_minimal() # 5. Model mod <- lm(y ~ x, data = d_clean) # 6. Diagnose plot(mod, which = 1:2) # 7. Report summary(mod) kable(tidy(mod), digits = 3) Appendix: Quick Links ------------------------------------------------- Setup ~~~~~~~~~~~~ - `Local Install `_ - `Scholar Access `_ Assignments ~~~~~~~~~~~~ - `CA1 Tutorial `_ — R/RStudio basics - `CA2 Tutorial `_ — EDA and summaries - `CA3 Tutorial `_ — Hypothesis testing - `CA4 Tutorial `_ — Two-sample procedures - `CA5 Tutorial `_ — ANOVA - `CA6 Tutorial `_ — Linear regression Getting Help ~~~~~~~~~~~~ - In R: ``?function_name`` or ``help(function_name)`` - Examples: ``example(function_name)`` - Online: https://www.rdocumentation.org/ - Stack Overflow: https://stackoverflow.com/questions/tagged/r