R / RStudio Guide and Function Reference
Overview
This document is a comprehensive reference for the primary R tools used in STAT 350. It introduces R and RStudio, outlines the workflow we use in class, and provides an enhanced reference to the functions, libraries, and plotting layers that appear across the computer assignments and worksheets. The emphasis is on: loading data; validating and cleaning; visualizing; computing summaries; fitting models; checking assumptions; and performing inference with well-annotated code.
Use this as a companion to the assignment tutorials (linked below).
Quick Reference Table
Task |
Function(s) |
Example |
---|---|---|
Import CSV |
|
|
Check structure |
|
|
Remove NAs |
|
|
Basic summary |
|
|
Create histogram |
|
|
One-sample t-test |
|
|
Two-sample t-test |
|
|
One-way ANOVA |
|
|
Linear regression |
|
|
Check residuals |
|
|
Quick Start: R / RStudio Setup
Local Install (R/RStudio): Local Install
Scholar Access (cluster): Scholar Access
Course Pipeline (At a Glance)
Import:
read.csv
→ inspect withhead
,str
,summary
.Validate & clean: missing data (
is.na
,complete.cases
), types (as.numeric
,factor
), quick checks (length
,nrow
,unique
).Explore: core summaries (
mean
,median
,sd
,quantile
,IQR
), plots (ggplot2
histograms/boxplots).Model: one/two-sample t; ANOVA; SLR via
lm
; compute p-values/intervals; diagnostics (residual plots, QQ).Report: figures/tables (
ggplot2
,knitr::kable
+kableExtra
), short text with context.
Packages / Libraries (Course Set)
ggplot2 — Grammar of Graphics. Histograms, boxplots, QQ, scatter + fit lines, ribbons for intervals.
grid, gridExtra — Plot/table layout (e.g., arrange multiple graphics on a page).
kableExtra — Styling for
knitr::kable
tables (borders, alignment).knitr — Report tooling;
kable
produces clean tables.latex2exp —
TeX()
for LaTeX-style math in plot labels.magrittr — Pipe operator
%>%
to chain steps.stats — Inference & models:
aov
,anova
,lm
,t.test
,TukeyHSD
, distributions (dnorm
,qnorm
,pt
,qt
), etc.utils — I/O like
read.csv
,write.csv
.
Tip
Install from CRAN if needed:
install.packages(c("ggplot2","knitr","kableExtra","latex2exp","gridExtra"))
RStudio Orientation
Console (executes statements), Source/Editor (scripts, Rmd), Environment (objects), Plots/Help (graphics, docs).
Use Projects to lock working directory. Prefer relative paths and never hardcode machine-specific directories mid-script.
Keep code modular. Consider separate chunks for import, cleaning, EDA, modeling, diagnostics, and reporting.
Getting Started with swirl
If you have never coded before, the computer assignments may immediately feel overwhelming. However, R is a language that is as rewarding as it is approachable, especially for statistical work. If you’re new to coding, you’ll find R’s logical syntax is perfect for beginners.
Our tutorials are tailored to gradually build your understanding, making R less daunting. The tutorials are focused on the tools needed for the Computer Assignments. However, you may feel you need to get a little more comfortable in R before you attempt the tutorials and the Computer Assignments. An R package called swirl may be just what you need to get started.
Setup and Use swirl
Setup R and RStudio (see links above)
Install swirl from the command line in R:
install.packages('swirl')
Load swirl:
library('swirl')
Run swirl:
swirl()
This will start the command line interactive tutorials of swirl. You will be instructed on exercises on the command line.
swirl Commands (available anytime during lessons):
skip()
— Skip the current questionplay()
— Experiment with R on your own (swirl ignores your actions)nxt()
— Return swirl’s attention after using play()bye()
— Exit swirl (progress is saved)main()
— Return to swirl’s main menuinfo()
— Display these options again
I recommend working through at least a few lessons to get familiarity with coding in R.
Alternative R Learning Resources
For curious students who want to explore R beyond the course requirements:
Base R
CRAN “An Introduction to R” (official manual): Canonical, up-to-date coverage of types, subsetting, vectorization, base graphics, and the modeling API. The authoritative source.
https://cran.r-project.org/doc/manuals/r-release/R-intro.html
Deep R Programming (free textbook): Modern, rigorous coverage of R semantics, performance, and numerical computing. Ideal after you’re comfortable with basics.
tidyverse
R for Data Science (2e, free online): The authoritative entry point to tidy data workflows (readr, dplyr, tidyr, ggplot2, purrr).
tidyverse.org: Core package documentation and philosophy.
R Markdown
R Markdown: The Definitive Guide: Complete reference for authoring and multi-format output.
R Markdown Cookbook: Task-oriented tips that save hours in practice.
Statistical Computing
STAT 545 (UBC): Practice-first course on modern workflows with excellent materials.
Modern Statistics for Modern Biology: R-based computational statistics for life sciences.
Video Resources
R Programming Full Course for Beginners (freeCodeCamp): Comprehensive 2-hour introduction covering basics through data analysis.
R Programming Tutorial - Learn the Basics (freeCodeCamp): Shorter focused tutorial on R fundamentals.
Assignment Tutorials (Links)
Note
Winter Session: Computer Assignments 1 and 2 are combined into a single assignment. Subsequent assignments are renumbered:
CA1 & CA2 → Combined Assignment 1
CA3 → Assignment 2
CA4 → Assignment 3
CA5 → Assignment 4
CA6 → Assignment 5
Computer Assignment 1 Tutorial — Intro & basics: CA1 Tutorial
Computer Assignment 2 Tutorial — Graphical & numerical summaries: CA2 Tutorial
Computer Assignment 3 Tutorial — Subsetting & basic hypothesis testing: CA3 Tutorial
Computer Assignment 4 Tutorial — Two-sample procedures: CA4 Tutorial
Computer Assignment 5 Tutorial — One-way ANOVA: CA5 Tutorial
Computer Assignment 6 Tutorial — Simple linear regression: CA6 Tutorial
Function Reference (Alphabetized within Category)
Data I/O & Housekeeping
getwd()
Purpose: Show current working directory Help: Type
?getwd
in R console Syntax:getwd()
Example showing the working directory:
> getwd() [1] "/Users/username/STAT350"
head(x, n = 6)
Purpose: Display first n rows of a data frame or elements of a vector Help: Type
?head
in R console Syntax:head(x, n = 6)
Common use: Quickly inspect data after importingUsing built-in iris dataset:
> data(iris) > head(iris, 3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosaSee also:
tail()
,str()
,summary()
read.csv(file, header = TRUE, stringsAsFactors = FALSE, …)
Purpose: Import CSV file into a data frame Help: Type
?read.csv
in R console Key arguments:
header
: First row contains column names (default TRUE)
stringsAsFactors
: Keep strings as characters (recommended FALSE)
na.strings
: Values to treat as NA (default “NA”)Common pitfalls:
Using absolute paths (breaks on other computers)
Forgetting to check for proper import with
str()
Not handling special characters in column names
Basic import:
> mydata <- read.csv("data/experiment.csv") > str(mydata) 'data.frame': 150 obs. of 5 variables: $ id : int 1 2 3 4 5 6 7 8 9 10 ... $ group : chr "A" "B" "A" "C" ... $ measure1: num 23.5 19.2 25.1 22.8 20.3 ... $ measure2: num 45.2 41.8 46.7 44.3 42.1 ... $ outcome : chr "success" "failure" "success" ...With options for handling missing data:
mydata <- read.csv("data/experiment.csv", stringsAsFactors = FALSE, na.strings = c("NA", "N/A", ""))Always validate after import:
> sum(is.na(mydata)) [1] 12See also:
read.table()
,write.csv()
,str()
setwd(path)
Purpose: Change working directory Help: Type
?setwd
in R console Warning: Avoid using in scripts; use RStudio Projects insteadNot recommended in scripts:
# Bad practice - machine specific! # setwd("/Users/username/STAT350") # Better: Use RStudio Projects or relative paths
write.csv(x, file, row.names = FALSE)
Purpose: Export data frame to CSV file Help: Type
?write.csv
in R console Key arguments:
row.names
: Include row numbers (usually FALSE)
na
: String for missing values (default “NA”)Save cleaned data:
write.csv(d_clean, "output/cleaned_data.csv", row.names = FALSE)
Data Structures & Creation
c(…)
Purpose: Combine values into a vector Help: Type
?c
in R consoleCreating different types of vectors:
> v1 <- c(1, 2, 3, 4, 5) > v2 <- c("iOS", "Android", "Windows") > v3 <- c(1, "two", 3) > class(v3) [1] "character"Note that mixed types get coerced to the most general type (character in this case).
colnames(x), rownames(x)
Purpose: Get or set column/row names Help: Type
?colnames
or?rownames
in R consoleGetting column names from built-in dataset:
> colnames(mtcars) [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"Renaming columns:
# Create a copy to modify mydata <- mtcars[1:5, 1:3] colnames(mydata) <- c("MPG", "Cylinders", "Displacement") # Or rename specific columns colnames(mydata)[1] <- "MilesPerGallon"
data.frame(…)
Purpose: Create a data frame from vectors Help: Type
?data.frame
in R console Key arguments:
stringsAsFactors
: Auto-convert strings to factors (set FALSE)Creating a data frame from scratch:
> df <- data.frame( + id = 1:5, + group = c("A", "A", "B", "B", "B"), + score = c(85, 90, 78, 82, 88), + stringsAsFactors = FALSE + ) > str(df) 'data.frame': 5 obs. of 3 variables: $ id : int 1 2 3 4 5 $ group: chr "A" "A" "B" "B" "B" $ score: num 85 90 78 82 88See also:
tibble::tibble()
for modern alternative
factor(x, levels = …, ordered = FALSE)
Purpose: Create categorical variables with defined levels Help: Type
?factor
in R console When to use: For categorical data in ANOVA, controlling plot orderBasic factor creation with automatic level detection:
> platform <- factor(c("iOS", "Android", "iOS", "Windows")) > levels(platform) [1] "Android" "iOS" "Windows"Controlling level order for plots and analyses:
platform <- factor(platform, levels = c("iOS", "Android", "Windows"))Creating ordered factors for ordinal data:
satisfaction <- factor(c("Low", "High", "Medium", "High"), levels = c("Low", "Medium", "High"), ordered = TRUE)Common pitfall: Converting numeric factors back to numbers
Wrong way (returns level indices):
> f <- factor(c("2", "4", "6")) > as.numeric(f) [1] 1 2 3Correct way:
> as.numeric(as.character(f)) [1] 2 4 6See also:
levels()
,relevel()
,droplevels()
Data Wrangling & Utilities
apply(X, MARGIN, FUN)
Purpose: Apply function over matrix/array margins Help: Type
?apply
in R console Key arguments:
MARGIN
: 1 for rows, 2 for columns
FUN
: Function to applyCreate a sample matrix:
> M <- matrix(1:12, nrow = 3) > M [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12Row means (MARGIN = 1):
> apply(M, 1, mean) [1] 5.5 6.5 7.5Column sums (MARGIN = 2):
> apply(M, 2, sum) [1] 6 15 24 33Custom function to calculate range:
> apply(M, 2, function(x) max(x) - min(x)) [1] 2 2 2 2See also:
lapply()
,sapply()
,tapply()
as.numeric(x)
Purpose: Convert to numeric type Help: Type
?as.numeric
in R console Common uses: Fix data imported as characters, convert factorsCharacter to numeric conversion:
> x <- c("1.5", "2.3", "3.1") > as.numeric(x) [1] 1.5 2.3 3.1Non-numeric values produce NAs with a warning:
> y <- c("1", "2", "three") > as.numeric(y) [1] 1 2 NA Warning message: NAs introduced by coercion
complete.cases(x)
Purpose: Identify rows with no missing values Help: Type
?complete.cases
in R console Returns: Logical vector (TRUE for complete rows)Create data with missing values:
> df <- data.frame( + x = c(1, 2, NA, 4), + y = c(5, NA, 7, 8) + ) > complete.cases(df) [1] TRUE FALSE FALSE TRUEFilter to complete cases only:
> df_clean <- df[complete.cases(df), ] > nrow(df_clean) [1] 2Count incomplete cases:
> sum(!complete.cases(df)) [1] 2See also:
is.na()
,na.omit()
ifelse(test, yes, no)
Purpose: Vectorized conditional operation Help: Type
?ifelse
in R consoleBasic pass/fail grading:
> score <- c(85, 72, 90, 68, 88) > grade <- ifelse(score >= 80, "Pass", "Fail") > grade [1] "Pass" "Fail" "Pass" "Fail" "Pass"Nested ifelse for letter grades:
grade <- ifelse(score >= 90, "A", ifelse(score >= 80, "B", ifelse(score >= 70, "C", "F")))Conditional calculations:
df$adjusted <- ifelse(df$group == "control", df$score * 1.1, df$score)
IQR(x, na.rm = FALSE)
Purpose: Calculate interquartile range (Q3 - Q1) Help: Type
?IQR
in R consoleWith missing values:
> x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, NA) > IQR(x) [1] NA > IQR(x, na.rm = TRUE) [1] 4Compare with manual calculation from quantiles:
> quantile(x, c(0.25, 0.75), na.rm = TRUE) 25% 75% 3 7See also:
quantile()
,fivenum()
,boxplot.stats()
is.na(x)
Purpose: Test for missing values Help: Type
?is.na
in R consoleIdentifying NAs:
> x <- c(1, NA, 3, NA, 5) > is.na(x) [1] FALSE TRUE FALSE TRUE FALSECounting NAs:
> sum(is.na(x)) [1] 2Finding positions of NAs:
> which(is.na(x)) [1] 2 4Replacing NAs:
x[is.na(x)] <- 0See also:
complete.cases()
,na.omit()
,anyNA()
paste(…, sep = “ “, collapse = NULL)
Purpose: Concatenate strings Help: Type
?paste
in R console Key arguments:
sep
: Separator between elements
collapse
: Collapse vector to single stringBasic concatenation:
> paste("Mean:", 5.2) [1] "Mean: 5.2"Custom separator for dates:
> paste("2024", "01", "15", sep = "-") [1] "2024-01-15"Vectorized operation:
> paste("Group", 1:3) [1] "Group 1" "Group 2" "Group 3"Collapse to single string:
> paste(c("A", "B", "C"), collapse = ", ") [1] "A, B, C"No-space version with paste0:
> paste0("var", 1:3) [1] "var1" "var2" "var3"
quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE)
Purpose: Calculate sample quantiles Help: Type
?quantile
in R consoleDefault quartiles:
> x <- c(1:10, NA) > quantile(x, na.rm = TRUE) 0% 25% 50% 75% 100% 1.0 3.0 5.5 8.0 10.0Custom percentiles:
> quantile(x, probs = c(0.1, 0.9), na.rm = TRUE) 10% 90% 1.9 9.1See also:
median()
,IQR()
,fivenum()
sapply(X, FUN), lapply(X, FUN)
Purpose: Apply function over list/vector elements Help: Type
?sapply
or?lapply
in R console Differences:
lapply
: Always returns a list
sapply
: Simplifies to vector/matrix if possibleCreate a list of vectors:
> lst <- list(a = 1:5, b = 6:10, c = 11:15)sapply simplifies to vector:
> sapply(lst, mean) a b c 3 8 13lapply returns list:
> lapply(lst, mean) $a [1] 3 $b [1] 8 $c [1] 13Check multiple columns for NAs:
sapply(df, function(x) sum(is.na(x)))
tapply(X, INDEX, FUN)
Purpose: Apply function by group Help: Type
?tapply
in R consoleGroup means using built-in dataset:
> tapply(iris$Sepal.Length, iris$Species, mean) setosa versicolor virginica 5.006 5.936 6.588Multiple statistics per group:
tapply(iris$Sepal.Length, iris$Species, function(x) c(mean = mean(x), sd = sd(x)))See also:
aggregate()
,by()
Descriptive Statistics & Correlation
cor(x, y = NULL, use = “everything”, method = “pearson”)
Purpose: Calculate correlation coefficient Help: Type
?cor
in R console Key arguments:
use
: How to handle NAs (“complete.obs” drops them)
method
: “pearson” (default), “spearman”, “kendall”Perfect positive correlation:
> x <- c(1, 2, 3, 4, 5) > y <- c(2, 4, 6, 8, 10) > cor(x, y) [1] 1Using built-in dataset:
> cor(mtcars$mpg, mtcars$wt) [1] -0.8676594Correlation matrix:
> cor(mtcars[,c("mpg", "wt", "hp")]) mpg wt hp mpg 1.0000000 -0.8676594 -0.7761684 wt -0.8676594 1.0000000 0.6587479 hp -0.7761684 0.6587479 1.0000000Interpretation guide:
-1 to -0.7: Strong negative
-0.7 to -0.3: Moderate negative
-0.3 to 0.3: Weak/no linear relationship
0.3 to 0.7: Moderate positive
0.7 to 1: Strong positive
Test for significance:
cor.test(mtcars$mpg, mtcars$wt)See also:
cor.test()
,cov()
mean(x, trim = 0, na.rm = FALSE)
Purpose: Calculate arithmetic mean Help: Type
?mean
in R console Key arguments:
na.rm
: Remove NAs before calculation
trim
: Fraction to trim from each endBasic usage with NAs:
> x <- c(1, 2, 3, 4, 5, NA) > mean(x) [1] NA > mean(x, na.rm = TRUE) [1] 3Trimmed mean (robust to outliers):
> y <- c(1, 2, 3, 4, 100) > mean(y) [1] 22 > mean(y, trim = 0.2) [1] 3The trimmed mean removes 20% from each end before calculating.
See also:
median()
,summary()
median(x, na.rm = FALSE)
Purpose: Calculate median (middle value) Help: Type
?median
in R consoleOdd number of values:
> median(c(1, 3, 5)) [1] 3Even number of values:
> median(c(1, 2, 3, 4)) [1] 2.5Median is more robust than mean for skewed data:
> x <- c(20, 25, 30, 35, 200) > mean(x) [1] 62 > median(x) [1] 30
sd(x, na.rm = FALSE)
Purpose: Calculate sample standard deviation Help: Type
?sd
in R console Note: Uses n-1 denominator (sample SD)> x <- c(2, 4, 4, 4, 5, 5, 7, 9) > sd(x) [1] 2.13809 > var(x) [1] 4.571429The variance equals the square of the standard deviation.
Calculate coefficient of variation (relative variability):
> cv <- sd(x) / mean(x) * 100 > paste("CV:", round(cv, 1), "%") [1] "CV: 42.8 %"See also:
var()
,mad()
(robust alternative)
Probability & Distributions
Normal Distribution Functions
dnorm(x, mean = 0, sd = 1)
Purpose: Normal probability density Help: Type
?dnorm
in R console Use cases: Overlay theoretical curves, calculate likelihoodsStandard normal density at x = 0:
> dnorm(0) [1] 0.3989423Custom parameters:
> dnorm(100, mean = 100, sd = 15) [1] 0.02659615Use in ggplot2 for overlaying normal curve:
ggplot(data.frame(x = x), aes(x)) + geom_histogram(aes(y = after_stat(density))) + stat_function(fun = dnorm, args = list(mean = mean(x), sd = sd(x)), color = "red")pnorm(q, mean = 0, sd = 1, lower.tail = TRUE)
Purpose: Normal cumulative distribution (probability) Help: Type
?pnorm
in R consoleStandard normal probabilities:
> pnorm(1.96) [1] 0.9750021 > pnorm(1.96, lower.tail = FALSE) [1] 0.0249979Two-tailed p-value:
> 2 * pnorm(-abs(1.96)) [1] 0.04999579qnorm(p, mean = 0, sd = 1, lower.tail = TRUE)
Purpose: Normal quantiles (inverse CDF) Help: Type
?qnorm
in R consoleCritical values:
> qnorm(0.975) [1] 1.959964 > qnorm(0.95) [1] 1.644854The first gives the two-tailed 95% critical value, the second gives one-tailed.
Calculate confidence interval:
xbar <- 100; s <- 15; n <- 25 xbar + c(-1, 1) * qnorm(0.975) * s/sqrt(n)rnorm(n, mean = 0, sd = 1)
Purpose: Generate random normal values Help: Type
?rnorm
in R consoleset.seed(123) x <- rnorm(100, mean = 50, sd = 10) # Visualize with ggplot2 library(ggplot2) ggplot(data.frame(x = x), aes(x = x)) + geom_histogram(bins = 20) + ggtitle("Sample from Normal(50, 10)")
t Distribution Functions
pt(q, df, lower.tail = TRUE)
Purpose: Student’s t cumulative distribution Help: Type
?pt
in R console Use cases: Calculate p-values for t-testsOne-sided p-value:
> t_stat <- 2.5 > df <- 24 > pt(t_stat, df, lower.tail = FALSE) [1] 0.009963466Two-sided p-value:
> 2 * pt(-abs(t_stat), df) [1] 0.01992693qt(p, df, lower.tail = TRUE)
Purpose: Student’s t quantiles Help: Type
?qt
in R console Use cases: Critical values for confidence intervals95% CI critical value:
> qt(0.975, df = 24) [1] 2.063899Compare to normal:
> qnorm(0.975) [1] 1.959964The t critical value is larger due to heavier tails.
See also:
dt()
,rt()
Other Distributions
dbinom(x, size, prob)
Purpose: Binomial probability mass function Help: Type
?dbinom
in R consoleProbability of exactly 3 successes in 10 trials with p = 0.5:
> dbinom(3, size = 10, prob = 0.5) [1] 0.1171875Full distribution:
x <- 0:10 p <- dbinom(x, size = 10, prob = 0.5) # Visualize with ggplot2 library(ggplot2) ggplot(data.frame(x = x, p = p), aes(x = x, y = p)) + geom_col() + scale_x_continuous(breaks = 0:10) + ggtitle("Binomial(10, 0.5) Distribution")qtukey(p, nmeans, df)
Purpose: Tukey HSD critical values Help: Type
?qtukey
in R consoleCritical value for 3 groups with df = 27:
> qtukey(0.95, nmeans = 3, df = 27) [1] 3.506426
Simulation Functions
set.seed(seed)
Purpose: Set random number generator seed for reproducibility Help: Type
?set.seed
in R consoleWithout seed (different each time):
> rnorm(3) [1] 0.3186301 -0.5817907 0.7145327 > rnorm(3) [1] -0.8252594 -0.3598138 -0.0109303With seed (reproducible):
> set.seed(123) > rnorm(3) [1] -0.5604756 -0.2301775 1.5587083 > set.seed(123) > rnorm(3) [1] -0.5604756 -0.2301775 1.5587083
Random Generation Functions
All random generation functions take
n
(number of values) as the first argument.set.seed(1) rnorm(5, mean = 100, sd = 15) # Normal runif(5, min = 0, max = 10) # Uniform rexp(5, rate = 2) # ExponentialFor simulation studies:
n_sims <- 1000 means <- replicate(n_sims, mean(rnorm(30))) # Visualize sampling distribution with ggplot2 library(ggplot2) ggplot(data.frame(means = means), aes(x = means)) + geom_histogram(bins = 30) + ggtitle("Sampling Distribution of Mean")
Inference Functions
t.test(x, y = NULL, alternative = “two.sided”, mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95)
Purpose: One and two sample t-tests Help: Type
?t.test
in R console When to use: Compare means between 1-2 groups Key arguments:
alternative
: “two.sided”, “less”, “greater”
mu
: Null hypothesis value (one-sample)
paired
: TRUE for matched pairs
var.equal
: TRUE assumes equal variancesCommon pitfalls:
Using paired = TRUE with independent samples
Not checking normality assumptions
Forgetting var.equal = FALSE for unequal variances
One-sample test:
> x <- c(98, 102, 101, 97, 103, 100, 99, 98, 104, 101) > t.test(x, mu = 100) One Sample t-test data: x t = 0.63246, df = 9, p-value = 0.5425 alternative hypothesis: true mean is not equal to 100 95 percent confidence interval: 98.41533 102.18467 sample estimates: mean of x 100.3Two-sample independent test using built-in data:
> data(sleep) > t.test(extra ~ group, data = sleep, var.equal = FALSE) Welch Two Sample t-test data: extra by group t = -1.8608, df = 17.776, p-value = 0.07939 alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0 95 percent confidence interval: -3.3654832 0.2054832 sample estimates: mean in group 1 mean in group 2 0.75 2.33Paired t-test - Three equivalent approaches:
Method 1 - Using paired = TRUE with two vectors:
before <- c(85, 88, 90, 92, 87) after <- c(88, 90, 91, 94, 89) t.test(before, after, paired = TRUE)Method 2 - Convert to one-sample test of differences:
differences <- after - before t.test(differences, mu = 0)Method 3 - Formula interface with paired data:
> # Using sleep dataset (paired by ID) > t.test(Pair(extra[group==1], extra[group==2]) ~ 1, data = sleep) Paired t-test data: Pair(extra[group == 1], extra[group == 2]) t = -4.0621, df = 9, p-value = 0.002833 alternative hypothesis: true mean difference is not equal to 0 95 percent confidence interval: -2.4598858 -0.7001142 sample estimates: mean difference -1.58Extract components from results:
> result <- t.test(x, mu = 100) > result$p.value [1] 0.5424805 > result$conf.int [1] 98.41533 102.18467 attr(,"conf.level") [1] 0.95 > result$estimate mean of x 100.3Checking assumptions:
Normality check:
# Shapiro-Wilk test shapiro.test(x) # Visual check with ggplot2 library(ggplot2) ggplot(data.frame(x = x), aes(sample = x)) + stat_qq() + stat_qq_line()Equal variances check:
# F-test var.test(x, y) # Visual check ggplot(data.frame(value = c(x, y), group = rep(c("X", "Y"), c(length(x), length(y)))), aes(x = group, y = value)) + geom_boxplot()See also:
wilcox.test()
(non-parametric alternative)
aov(formula, data)
Purpose: Analysis of variance (ANOVA) Help: Type
?aov
in R console When to use: Compare means across 3+ groupsOne-way ANOVA:
> fit <- aov(score ~ group, data = df) > summary(fit) Df Sum Sq Mean Sq F value Pr(>F) group 2 520.3 260.15 8.214 0.0019 ** Residuals 27 855.2 31.67 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Check assumptions with diagnostic plots:
plot(fit, which = 1) # Residuals vs Fitted plot(fit, which = 2) # Normal Q-QTest for equal variances:
bartlett.test(score ~ group, data = df)Post-hoc comparisons:
TukeyHSD(fit) plot(TukeyHSD(fit))See also:
anova()
,TukeyHSD()
,kruskal.test()
TukeyHSD(x, conf.level = 0.95)
Purpose: Tukey’s honest significant differences Help: Type
?TukeyHSD
in R console When to use: After significant ANOVA for pairwise comparisonsFollowing ANOVA:
> fit <- aov(score ~ group, data = df) > tukey_result <- TukeyHSD(fit) > tukey_result Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = score ~ group, data = df) $group diff lwr upr p adj B-A -5.333 -12.20271 1.536038 0.1518711 C-A 3.167 -3.70271 10.036038 0.5019442 C-B 8.500 1.63062 15.369376 0.0125506Visual display:
plot(tukey_result)Confidence intervals not containing 0 indicate significant differences.
lm(formula, data)
Purpose: Fit linear models Help: Type
?lm
in R console When to use: Regression analysis, ANOVA via regressionSimple linear regression:
> mod <- lm(rating ~ price, data = df) > summary(mod) Call: lm(formula = rating ~ price, data = df) Residuals: Min 1Q Median 3Q Max -2.83450 -0.83450 -0.03077 0.78231 2.70308 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1234 0.2145 9.897 2.13e-10 *** price 0.5432 0.0532 10.211 9.77e-11 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.234 on 28 degrees of freedom Multiple R-squared: 0.7882, Adjusted R-squared: 0.7806 F-statistic: 104.3 on 1 and 28 DF, p-value: 9.77e-11Extract components:
coef(mod) # Coefficients confint(mod) # CI for coefficients residuals(mod) # Residuals fitted(mod) # Fitted valuesDiagnostic plots:
plot(mod) # Produces 4 diagnostic plotsANOVA table:
anova(mod)See also:
summary.lm()
,predict.lm()
,plot.lm()
predict(object, newdata, interval = “none”, level = 0.95)
Purpose: Predictions from model fits Help: Type
?predict.lm
in R console Key arguments:
interval
: “confidence” (mean response) or “prediction” (new obs)
level
: Confidence levelCritical distinction:
Confidence interval: Uncertainty about mean response
Prediction interval: Uncertainty about individual response (wider)
Fit model first:
mod <- lm(rating ~ price, data = df)Point prediction:
> new_price <- data.frame(price = 50) > predict(mod, new_price) 1 29.28571Confidence interval (mean rating at price = 50):
> predict(mod, new_price, interval = "confidence") fit lwr upr 1 29.28571 28.47933 30.09209Prediction interval (individual rating at price = 50):
> predict(mod, new_price, interval = "prediction") fit lwr upr 1 29.28571 26.89414 31.67728Note that prediction intervals are always wider than confidence intervals.
Multiple predictions for plotting:
new_prices <- data.frame( price = seq(min(df$price), max(df$price), length = 100) ) ci <- predict(mod, new_prices, interval = "confidence") pi <- predict(mod, new_prices, interval = "prediction")Plot with bands:
plot(rating ~ price, data = df) abline(mod) lines(new_prices$price, ci[,"lwr"], lty = 2, col = "blue") lines(new_prices$price, ci[,"upr"], lty = 2, col = "blue") lines(new_prices$price, pi[,"lwr"], lty = 2, col = "red") lines(new_prices$price, pi[,"upr"], lty = 2, col = "red")See also:
fitted()
,residuals()
Diagnostic Plots for Assumptions
For t-test (normality of residuals):
library(ggplot2) x <- residuals(lm(score ~ group, data = df)) # Q-Q plot ggplot(data.frame(sample = x), aes(sample = sample)) + stat_qq() + stat_qq_line() + ggtitle("Normal Q-Q Plot of Residuals") # Formal test shapiro.test(x)For ANOVA:
fit <- aov(score ~ group, data = df) # Residuals vs Fitted ggplot(data.frame(fitted = fitted(fit), residuals = residuals(fit)), aes(x = fitted, y = residuals)) + geom_point() + geom_hline(yintercept = 0, linetype = "dashed") + ggtitle("Residuals vs Fitted") # Q-Q plot of residuals ggplot(data.frame(sample = residuals(fit)), aes(sample = sample)) + stat_qq() + stat_qq_line() + ggtitle("Normal Q-Q")For regression (full diagnostic panel):
mod <- lm(rating ~ price, data = df) # Prepare data for all plots diag_data <- data.frame( fitted = fitted(mod), residuals = residuals(mod), std_resid = rstandard(mod), sqrt_std_resid = sqrt(abs(rstandard(mod))), cooks = cooks.distance(mod), leverage = hatvalues(mod) ) # Create four diagnostic plots library(gridExtra) p1 <- ggplot(diag_data, aes(x = fitted, y = residuals)) + geom_point() + geom_hline(yintercept = 0, linetype = "dashed") + geom_smooth(se = FALSE) + ggtitle("Residuals vs Fitted") p2 <- ggplot(diag_data, aes(sample = std_resid)) + stat_qq() + stat_qq_line() + ggtitle("Normal Q-Q") p3 <- ggplot(diag_data, aes(x = fitted, y = sqrt_std_resid)) + geom_point() + geom_smooth(se = FALSE) + ggtitle("Scale-Location") p4 <- ggplot(diag_data, aes(x = leverage, y = std_resid)) + geom_point() + geom_smooth(se = FALSE) + ggtitle("Residuals vs Leverage") grid.arrange(p1, p2, p3, p4, ncol = 2)Pattern interpretation:
Random scatter: Good
Funnel shape: Heteroscedasticity
Curve: Non-linearity
Outliers: Influential points
Graphics (ggplot2)
Important
This course uses ggplot2 exclusively for all plotting and visualization. While base R has plotting functions like plot(), hist(), and boxplot(), we will only use ggplot2’s grammar of graphics approach. All assignments and exams will expect ggplot2 syntax for graphs.
Core Components
ggplot(data, aes(…))
Purpose: Initialize a ggplot object Help: Type
?ggplot
in R console Structure: Data + aesthetic mappings + layersBasic setup creates blank plot with axes:
library(ggplot2) ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))Add layers with +:
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()Save for reuse:
p <- ggplot(mtcars, aes(x = factor(cyl), y = mpg)) p + geom_boxplot() p + geom_violin()
aes(…)
Purpose: Map variables to visual properties Help: Type
?aes
in R console Common aesthetics: x, y, color, fill, shape, size, alphaPosition aesthetics:
aes(x = Sepal.Length, y = Sepal.Width)Color by group:
aes(x = Sepal.Length, y = Sepal.Width, color = Species)Multiple aesthetics:
aes(x = wt, y = mpg, color = factor(cyl), size = hp)Can set in ggplot() or individual geoms:
ggplot(iris) + geom_point(aes(x = Sepal.Length, y = Sepal.Width))
Geoms (Geometric Objects)
geom_histogram(binwidth, bins, aes(y = after_stat(…)))
Purpose: Create histograms Help: Type
?geom_histogram
in R console Key arguments:
bins
: Number of bins (default 30)
binwidth
: Width of bins
aes(y = after_stat(density))
: Scale to densityImportant
In this course, all histograms must include:
Density scaling using
aes(y = after_stat(density))
A kernel density curve (red) using
geom_density()
A normal curve (blue) using
stat_function()
This allows visual comparison of the data distribution with theoretical distributions.
Standard histogram format for this course:
# Calculate mean and sd first xbar <- mean(data$variable) s <- sd(data$variable) # Create histogram with required overlays ggplot(data, aes(x = variable)) + geom_histogram(aes(y = after_stat(density)), bins = 24, fill = "grey", col = "black") + geom_density(col = "red", linewidth = 1) + stat_function(fun = dnorm, args = list(mean = xbar, sd = s), col = "blue", linewidth = 1) + ggtitle("Histogram with Density Curves")Adding mean and median lines (often required):
# Calculate statistics xbar <- mean(data$variable) xtilde <- median(data$variable) s <- sd(data$variable) # Full histogram with all required elements ggplot(data, aes(x = variable)) + geom_histogram(aes(y = after_stat(density)), bins = 24, fill = "grey", col = "black") + geom_vline(xintercept = xbar, color = "purple", linetype = "dotted", linewidth = 1) + geom_vline(xintercept = xtilde, color = "orange", linetype = "dashed", linewidth = 1) + geom_density(col = "red", linewidth = 1) + stat_function(fun = dnorm, args = list(mean = xbar, sd = s), col = "blue", linewidth = 1) + ggtitle("Complete Histogram for Course")By group:
ggplot(iris, aes(x = Sepal.Length, fill = Species)) + geom_histogram(aes(y = after_stat(density)), position = "dodge", bins = 15) + facet_wrap(~ Species) + ggtitle("Histograms by Group")
geom_boxplot(outlier.shape, varwidth)
Purpose: Box and whisker plots Help: Type
?geom_boxplot
in R consoleBasic boxplot:
ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot()Horizontal orientation:
ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot() + coord_flip()With individual points:
ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot(outlier.shape = NA) + geom_jitter(width = 0.2, alpha = 0.5)Variable width by sample size:
ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot(varwidth = TRUE)
geom_point(), geom_line()
Purpose: Scatterplots and line plots Help: Type
?geom_point
or?geom_line
Basic scatterplot:
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point()Customize points:
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + geom_point(size = 3, alpha = 0.7)Add regression line:
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red")Time series (using built-in dataset):
# Convert to time series format data(AirPassengers) ts_data <- data.frame( date = as.Date(time(AirPassengers)), value = as.numeric(AirPassengers) ) ggplot(ts_data, aes(x = date, y = value)) + geom_line() + geom_point()
geom_smooth(method, formula, se)
Purpose: Add smoothed conditional means Help: Type
?geom_smooth
in R console Key arguments:
method
: “lm”, “loess”, “gam”
se
: Show confidence band (TRUE/FALSE)Linear regression with confidence band:
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + geom_smooth(method = "lm")Without confidence band:
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + geom_smooth(method = "lm", se = FALSE)By group:
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + geom_point() + geom_smooth(method = "lm")
Faceted histograms with required density overlays and smooth kernel:
# Calculate group-specific statistics
xbar <- tapply(iris$Sepal.Length, iris$Species, mean)
s <- tapply(iris$Sepal.Length, iris$Species, sd)
# Add normal density values to the dataset
iris$normal.density <- ifelse(iris$Species == "setosa",
dnorm(iris$Sepal.Length, xbar["setosa"], s["setosa"]),
ifelse(iris$Species == "versicolor",
dnorm(iris$Sepal.Length, xbar["versicolor"], s["versicolor"]),
dnorm(iris$Sepal.Length, xbar["virginica"], s["virginica"])))
# Create faceted histogram with overlays
ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram(aes(y = after_stat(density)),
bins = 15, fill = "grey", col = "black") +
geom_density(col = "red", linewidth = 1) +
geom_line(aes(y = normal.density), col = "blue", linewidth = 1) +
facet_wrap(~ Species) +
ggtitle("Distribution of Sepal Length by Species")
# For datasets with only two groups (like CO2), the approach is simpler:
.. code-block:: r
# Two-group example using CO2 dataset
xbar <- tapply(CO2$uptake, CO2$Type, mean)
s <- tapply(CO2$uptake, CO2$Type, sd)
CO2$normal.density <- ifelse(CO2$Type == "Quebec",
dnorm(CO2$uptake, xbar["Quebec"], s["Quebec"]),
dnorm(CO2$uptake, xbar["Mississippi"], s["Mississippi"]))
ggplot(CO2, aes(x = uptake)) +
geom_histogram(aes(y = after_stat(density)),
bins = 20, fill = "grey", col = "black") +
geom_density(col = "red", linewidth = 1) +
geom_line(aes(y = normal.density), col = "blue", linewidth = 1) +
facet_grid(. ~ Type) +
ggtitle("Distribution of Uptake by Location Type")
stat_qq(), stat_qq_line()
Purpose: Q-Q plots for normality assessment Help: Type
?stat_qq
in R consoleBasic Q-Q plot:
ggplot(data.frame(sample = rnorm(100)), aes(sample = sample)) + stat_qq() + stat_qq_line()For residuals:
mod <- lm(mpg ~ wt, data = mtcars) res_df <- data.frame(residuals = residuals(mod)) ggplot(res_df, aes(sample = residuals)) + stat_qq() + stat_qq_line(color = "red") + labs(title = "Normal Q-Q Plot of Residuals")
geom_ribbon(aes(ymin, ymax))
Purpose: Area plots with upper/lower bounds Help: Type
?geom_ribbon
in R console Use case: Custom confidence/prediction bandsSetup for regression bands:
mod <- lm(mpg ~ wt, data = mtcars) new_data <- data.frame(wt = seq(min(mtcars$wt), max(mtcars$wt), length = 100)) pred_ci <- predict(mod, new_data, interval = "confidence") pred_pi <- predict(mod, new_data, interval = "prediction") plot_data <- cbind(new_data, ci_fit = pred_ci[,"fit"], ci_lwr = pred_ci[,"lwr"], ci_upr = pred_ci[,"upr"], pi_lwr = pred_pi[,"lwr"], pi_upr = pred_pi[,"upr"])Plot with bands:
ggplot(plot_data, aes(x = wt)) + geom_ribbon(aes(ymin = pi_lwr, ymax = pi_upr), fill = "grey90") + geom_ribbon(aes(ymin = ci_lwr, ymax = ci_upr), fill = "grey70") + geom_line(aes(y = ci_fit), color = "blue", size = 1) + geom_point(data = mtcars, aes(y = mpg)) + labs(title = "Regression with CI and PI bands")
Categorical Data Visualization
geom_bar(), geom_col()
Purpose: Bar charts for categorical data Help: Type
?geom_bar
or?geom_col
in R console
geom_bar()
: Counts categories (like table())
geom_col()
: Uses pre-computed valuesCount frequencies with geom_bar():
# Frequency bar chart ggplot(mtcars, aes(x = factor(cyl))) + geom_bar() + xlab("Number of Cylinders") + ylab("Count") + ggtitle("Distribution of Cylinder Counts")Pre-computed values with geom_col():
# Create summary data cyl_means <- aggregate(mpg ~ cyl, data = mtcars, mean) ggplot(cyl_means, aes(x = factor(cyl), y = mpg)) + geom_col(fill = "steelblue") + xlab("Number of Cylinders") + ylab("Mean MPG") + ggtitle("Average MPG by Cylinder Count")Grouped bar chart:
ggplot(mtcars, aes(x = factor(cyl), fill = factor(am))) + geom_bar(position = "dodge") + xlab("Cylinders") + labs(fill = "Transmission") + scale_fill_discrete(labels = c("Automatic", "Manual"))Stacked bar chart (proportions):
ggplot(mtcars, aes(x = factor(cyl), fill = factor(am))) + geom_bar(position = "fill") + ylab("Proportion") + labs(fill = "Transmission")Note
For pie charts, while possible in ggplot2 using
coord_polar()
, they are generally discouraged in statistical visualization. Bar charts are preferred for comparing categorical proportions.
Plot Customization
Labels and Titles
All labels at once:
p + labs(title = "Main Title", subtitle = "Subtitle", x = "X-axis label", y = "Y-axis label", caption = "Data source: ...")Individual functions:
p + ggtitle("Main Title") + xlab("X-axis") + ylab("Y-axis")Math expressions:
library(latex2exp) p + xlab(TeX("$\\beta_1$")) + ylab(TeX("$\\hat{y}$"))
Themes
Built-in themes:
p + theme_minimal() p + theme_classic() p + theme_bw()Customize elements:
p + theme( axis.text = element_text(size = 12), axis.title = element_text(size = 14, face = "bold"), plot.title = element_text(size = 16, hjust = 0.5), legend.position = "bottom" )
Faceting
Separate panels by variable:
ggplot(df, aes(x = score)) + geom_histogram() + facet_wrap(~ group)Grid layout:
ggplot(df, aes(x = score)) + geom_histogram() + facet_grid(platform ~ group)
Tables & Reporting
knitr::kable(x, caption, digits, format)
Purpose: Create formatted tables Help: Type
?knitr::kable
in R consoleBasic table using built-in data:
library(knitr) library(kableExtra) # Summary statistics by species summary_table <- aggregate(Sepal.Length ~ Species, data = iris, FUN = function(x) c(mean = mean(x), sd = sd(x), n = length(x))) kable(summary_table, caption = "Summary Statistics by Species", digits = 2)Enhanced styling:
kable(summary_table, caption = "Iris Summary by Species") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>% column_spec(1, bold = TRUE)
gridExtra::grid.arrange(…)
Purpose: Arrange multiple plots Help: Type
?gridExtra::grid.arrange
Create and arrange plots:
library(gridExtra) p1 <- ggplot(iris, aes(x = Sepal.Length)) + geom_histogram(bins = 20) + ggtitle("Distribution of Sepal Length") p2 <- ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot() + ggtitle("Sepal Length by Species") grid.arrange(p1, p2, ncol = 2)Different layouts:
grid.arrange(p1, p2, p1, p2, ncol = 2, nrow = 2, top = "Four Panel Display")
Best Practices & Common Pitfalls
Standard validation workflow:
d <- read.csv("data/file.csv")
str(d) # Structure check
head(d); tail(d) # First/last rows
summary(d) # Summary statistics
sapply(d, function(x) sum(is.na(x))) # Missing values
sapply(d, class) # Data types
Before t-tests/ANOVA:
# Normality
hist(x)
qqnorm(x); qqline(x)
shapiro.test(x)
# Equal variances
var.test(x, y)
boxplot(score ~ group, data = df)
Before regression:
mod <- lm(y ~ x, data = df)
par(mfrow = c(2,2))
plot(mod)
par(mfrow = c(1,1))
Check for:
Linearity (Residuals vs Fitted)
Normality (Q-Q plot)
Homoscedasticity (Scale-Location)
Influential points (Residuals vs Leverage)
Factor to numeric conversion
Wrong (returns level indices):
> f <- factor(c("2", "4", "6")) > as.numeric(f) [1] 1 2 3
Correct:
> as.numeric(as.character(f)) [1] 2 4 6
Missing data handling
Wrong (silently propagates NA):
> mean(c(1, 2, NA, 4)) [1] NA
Correct:
> mean(c(1, 2, NA, 4), na.rm = TRUE) [1] 2.333333
Confidence vs Prediction intervals
# CI: Where is the mean? predict(mod, newdata, interval = "confidence") # PI: Where is a new observation? predict(mod, newdata, interval = "prediction")
Prediction intervals are always wider!
Multiple testing without adjustment
After ANOVA, use Tukey HSD:
TukeyHSD(aov_fit)
This adjusts for multiple comparisons automatically.
# 1. Setup
library(ggplot2)
library(knitr)
set.seed(123)
# 2. Import and validate
d <- read.csv("data/file.csv", stringsAsFactors = FALSE)
str(d)
summary(d)
# 3. Clean
d_clean <- d[complete.cases(d), ]
# 4. Explore
ggplot(d_clean, aes(x = x, y = y)) +
geom_point() +
theme_minimal()
# 5. Model
mod <- lm(y ~ x, data = d_clean)
# 6. Diagnose
plot(mod, which = 1:2)
# 7. Report
summary(mod)
kable(tidy(mod), digits = 3)
Appendix: Quick Links
Setup
Assignments
CA1 Tutorial — R/RStudio basics
CA2 Tutorial — EDA and summaries
CA3 Tutorial — Hypothesis testing
CA4 Tutorial — Two-sample procedures
CA5 Tutorial — ANOVA
CA6 Tutorial — Linear regression
Getting Help
In R:
?function_name
orhelp(function_name)
Examples:
example(function_name)
Online: https://www.rdocumentation.org/
Stack Overflow: https://stackoverflow.com/questions/tagged/r