R / RStudio Guide and Function Reference

Overview

This document is a comprehensive reference for the primary R tools used in STAT 350. It introduces R and RStudio, outlines the workflow we use in class, and provides an enhanced reference to the functions, libraries, and plotting layers that appear across the computer assignments and worksheets. The emphasis is on: loading data; validating and cleaning; visualizing; computing summaries; fitting models; checking assumptions; and performing inference with well-annotated code.

Use this as a companion to the assignment tutorials (linked below).

Quick Reference Table

Table 10 Common Tasks Quick Reference

Task

Function(s)

Example

Import CSV

read.csv()

d <- read.csv('data/AppRating.csv')

Check structure

str(), head()

str(d); head(d, 10)

Remove NAs

complete.cases()

d_clean <- d[complete.cases(d), ]

Basic summary

summary()

summary(d$score)

Create histogram

ggplot() + geom_histogram()

ggplot(d, aes(x=score)) + geom_histogram()

One-sample t-test

t.test()

t.test(d$score, mu = 3)

Two-sample t-test

t.test()

t.test(score ~ platform, data = d)

One-way ANOVA

aov()

fit <- aov(score ~ group, data = d)

Linear regression

lm()

mod <- lm(rating ~ price, data = d)

Check residuals

plot()

plot(mod, which = 1:2)

Quick Start: R / RStudio Setup

Course Pipeline (At a Glance)

  1. Import: read.csvinspect with head, str, summary.

  2. Validate & clean: missing data (is.na, complete.cases), types (as.numeric, factor), quick checks (length, nrow, unique).

  3. Explore: core summaries (mean, median, sd, quantile, IQR), plots (ggplot2 histograms/boxplots).

  4. Model: one/two-sample t; ANOVA; SLR via lm; compute p-values/intervals; diagnostics (residual plots, QQ).

  5. Report: figures/tables (ggplot2, knitr::kable + kableExtra), short text with context.

Packages / Libraries (Course Set)

  • ggplot2 — Grammar of Graphics. Histograms, boxplots, QQ, scatter + fit lines, ribbons for intervals.

  • grid, gridExtra — Plot/table layout (e.g., arrange multiple graphics on a page).

  • kableExtra — Styling for knitr::kable tables (borders, alignment).

  • knitr — Report tooling; kable produces clean tables.

  • latex2expTeX() for LaTeX-style math in plot labels.

  • magrittr — Pipe operator %>% to chain steps.

  • stats — Inference & models: aov, anova, lm, t.test, TukeyHSD, distributions (dnorm, qnorm, pt, qt), etc.

  • utils — I/O like read.csv, write.csv.

Tip

Install from CRAN if needed:

install.packages(c("ggplot2","knitr","kableExtra","latex2exp","gridExtra"))

RStudio Orientation

  • Console (executes statements), Source/Editor (scripts, Rmd), Environment (objects), Plots/Help (graphics, docs).

  • Use Projects to lock working directory. Prefer relative paths and never hardcode machine-specific directories mid-script.

  • Keep code modular. Consider separate chunks for import, cleaning, EDA, modeling, diagnostics, and reporting.

Getting Started with swirl

If you have never coded before, the computer assignments may immediately feel overwhelming. However, R is a language that is as rewarding as it is approachable, especially for statistical work. If you’re new to coding, you’ll find R’s logical syntax is perfect for beginners.

Our tutorials are tailored to gradually build your understanding, making R less daunting. The tutorials are focused on the tools needed for the Computer Assignments. However, you may feel you need to get a little more comfortable in R before you attempt the tutorials and the Computer Assignments. An R package called swirl may be just what you need to get started.

Setup and Use swirl

  1. Setup R and RStudio (see links above)

  2. Install swirl from the command line in R:

    install.packages('swirl')
    
  3. Load swirl:

    library('swirl')
    
  4. Run swirl:

    swirl()
    

This will start the command line interactive tutorials of swirl. You will be instructed on exercises on the command line.

swirl Commands (available anytime during lessons):

  • skip() — Skip the current question

  • play() — Experiment with R on your own (swirl ignores your actions)

  • nxt() — Return swirl’s attention after using play()

  • bye() — Exit swirl (progress is saved)

  • main() — Return to swirl’s main menu

  • info() — Display these options again

I recommend working through at least a few lessons to get familiarity with coding in R.

Alternative R Learning Resources

For curious students who want to explore R beyond the course requirements:

Base R

tidyverse

  • R for Data Science (2e, free online): The authoritative entry point to tidy data workflows (readr, dplyr, tidyr, ggplot2, purrr).

    https://r4ds.hadley.nz

  • tidyverse.org: Core package documentation and philosophy.

    https://www.tidyverse.org

R Markdown

Statistical Computing

Video Resources

Function Reference (Alphabetized within Category)

Data I/O & Housekeeping

getwd()

Purpose: Show current working directory Help: Type ?getwd in R console Syntax: getwd()

Example showing the working directory:

> getwd()
[1] "/Users/username/STAT350"

head(x, n = 6)

Purpose: Display first n rows of a data frame or elements of a vector Help: Type ?head in R console Syntax: head(x, n = 6) Common use: Quickly inspect data after importing

Using built-in iris dataset:

> data(iris)
> head(iris, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

See also: tail(), str(), summary()

read.csv(file, header = TRUE, stringsAsFactors = FALSE, …)

Purpose: Import CSV file into a data frame Help: Type ?read.csv in R console Key arguments:

  • header: First row contains column names (default TRUE)

  • stringsAsFactors: Keep strings as characters (recommended FALSE)

  • na.strings: Values to treat as NA (default “NA”)

Common pitfalls:

  • Using absolute paths (breaks on other computers)

  • Forgetting to check for proper import with str()

  • Not handling special characters in column names

Basic import:

> mydata <- read.csv("data/experiment.csv")
> str(mydata)
'data.frame':     150 obs. of  5 variables:
 $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ group   : chr  "A" "B" "A" "C" ...
 $ measure1: num  23.5 19.2 25.1 22.8 20.3 ...
 $ measure2: num  45.2 41.8 46.7 44.3 42.1 ...
 $ outcome : chr  "success" "failure" "success" ...

With options for handling missing data:

mydata <- read.csv("data/experiment.csv",
                  stringsAsFactors = FALSE,
                  na.strings = c("NA", "N/A", ""))

Always validate after import:

> sum(is.na(mydata))
[1] 12

See also: read.table(), write.csv(), str()

setwd(path)

Purpose: Change working directory Help: Type ?setwd in R console Warning: Avoid using in scripts; use RStudio Projects instead

Not recommended in scripts:

# Bad practice - machine specific!
# setwd("/Users/username/STAT350")

# Better: Use RStudio Projects or relative paths

write.csv(x, file, row.names = FALSE)

Purpose: Export data frame to CSV file Help: Type ?write.csv in R console Key arguments:

  • row.names: Include row numbers (usually FALSE)

  • na: String for missing values (default “NA”)

Save cleaned data:

write.csv(d_clean, "output/cleaned_data.csv", row.names = FALSE)

Data Structures & Creation

c(…)

Purpose: Combine values into a vector Help: Type ?c in R console

Creating different types of vectors:

> v1 <- c(1, 2, 3, 4, 5)
> v2 <- c("iOS", "Android", "Windows")
> v3 <- c(1, "two", 3)
> class(v3)
[1] "character"

Note that mixed types get coerced to the most general type (character in this case).

colnames(x), rownames(x)

Purpose: Get or set column/row names Help: Type ?colnames or ?rownames in R console

Getting column names from built-in dataset:

> colnames(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

Renaming columns:

# Create a copy to modify
mydata <- mtcars[1:5, 1:3]
colnames(mydata) <- c("MPG", "Cylinders", "Displacement")

# Or rename specific columns
colnames(mydata)[1] <- "MilesPerGallon"

data.frame(…)

Purpose: Create a data frame from vectors Help: Type ?data.frame in R console Key arguments:

  • stringsAsFactors: Auto-convert strings to factors (set FALSE)

Creating a data frame from scratch:

> df <- data.frame(
+   id = 1:5,
+   group = c("A", "A", "B", "B", "B"),
+   score = c(85, 90, 78, 82, 88),
+   stringsAsFactors = FALSE
+ )
> str(df)
'data.frame':     5 obs. of  3 variables:
 $ id   : int  1 2 3 4 5
 $ group: chr  "A" "A" "B" "B" "B"
 $ score: num  85 90 78 82 88

See also: tibble::tibble() for modern alternative

factor(x, levels = …, ordered = FALSE)

Purpose: Create categorical variables with defined levels Help: Type ?factor in R console When to use: For categorical data in ANOVA, controlling plot order

Basic factor creation with automatic level detection:

> platform <- factor(c("iOS", "Android", "iOS", "Windows"))
> levels(platform)
[1] "Android" "iOS"     "Windows"

Controlling level order for plots and analyses:

platform <- factor(platform,
                  levels = c("iOS", "Android", "Windows"))

Creating ordered factors for ordinal data:

satisfaction <- factor(c("Low", "High", "Medium", "High"),
                      levels = c("Low", "Medium", "High"),
                      ordered = TRUE)

Common pitfall: Converting numeric factors back to numbers

Wrong way (returns level indices):

> f <- factor(c("2", "4", "6"))
> as.numeric(f)
[1] 1 2 3

Correct way:

> as.numeric(as.character(f))
[1] 2 4 6

See also: levels(), relevel(), droplevels()

Data Wrangling & Utilities

apply(X, MARGIN, FUN)

Purpose: Apply function over matrix/array margins Help: Type ?apply in R console Key arguments:

  • MARGIN: 1 for rows, 2 for columns

  • FUN: Function to apply

Create a sample matrix:

> M <- matrix(1:12, nrow = 3)
> M
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

Row means (MARGIN = 1):

> apply(M, 1, mean)
[1] 5.5 6.5 7.5

Column sums (MARGIN = 2):

> apply(M, 2, sum)
[1]  6 15 24 33

Custom function to calculate range:

> apply(M, 2, function(x) max(x) - min(x))
[1] 2 2 2 2

See also: lapply(), sapply(), tapply()

as.numeric(x)

Purpose: Convert to numeric type Help: Type ?as.numeric in R console Common uses: Fix data imported as characters, convert factors

Character to numeric conversion:

> x <- c("1.5", "2.3", "3.1")
> as.numeric(x)
[1] 1.5 2.3 3.1

Non-numeric values produce NAs with a warning:

> y <- c("1", "2", "three")
> as.numeric(y)
[1]  1  2 NA
Warning message:
NAs introduced by coercion

complete.cases(x)

Purpose: Identify rows with no missing values Help: Type ?complete.cases in R console Returns: Logical vector (TRUE for complete rows)

Create data with missing values:

> df <- data.frame(
+   x = c(1, 2, NA, 4),
+   y = c(5, NA, 7, 8)
+ )
> complete.cases(df)
[1]  TRUE FALSE FALSE  TRUE

Filter to complete cases only:

> df_clean <- df[complete.cases(df), ]
> nrow(df_clean)
[1] 2

Count incomplete cases:

> sum(!complete.cases(df))
[1] 2

See also: is.na(), na.omit()

ifelse(test, yes, no)

Purpose: Vectorized conditional operation Help: Type ?ifelse in R console

Basic pass/fail grading:

> score <- c(85, 72, 90, 68, 88)
> grade <- ifelse(score >= 80, "Pass", "Fail")
> grade
[1] "Pass" "Fail" "Pass" "Fail" "Pass"

Nested ifelse for letter grades:

grade <- ifelse(score >= 90, "A",
               ifelse(score >= 80, "B",
                     ifelse(score >= 70, "C", "F")))

Conditional calculations:

df$adjusted <- ifelse(df$group == "control",
                     df$score * 1.1,
                     df$score)

IQR(x, na.rm = FALSE)

Purpose: Calculate interquartile range (Q3 - Q1) Help: Type ?IQR in R console

With missing values:

> x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, NA)
> IQR(x)
[1] NA
> IQR(x, na.rm = TRUE)
[1] 4

Compare with manual calculation from quantiles:

> quantile(x, c(0.25, 0.75), na.rm = TRUE)
25% 75%
  3   7

See also: quantile(), fivenum(), boxplot.stats()

is.na(x)

Purpose: Test for missing values Help: Type ?is.na in R console

Identifying NAs:

> x <- c(1, NA, 3, NA, 5)
> is.na(x)
[1] FALSE  TRUE FALSE  TRUE FALSE

Counting NAs:

> sum(is.na(x))
[1] 2

Finding positions of NAs:

> which(is.na(x))
[1] 2 4

Replacing NAs:

x[is.na(x)] <- 0

See also: complete.cases(), na.omit(), anyNA()

paste(…, sep = “ “, collapse = NULL)

Purpose: Concatenate strings Help: Type ?paste in R console Key arguments:

  • sep: Separator between elements

  • collapse: Collapse vector to single string

Basic concatenation:

> paste("Mean:", 5.2)
[1] "Mean: 5.2"

Custom separator for dates:

> paste("2024", "01", "15", sep = "-")
[1] "2024-01-15"

Vectorized operation:

> paste("Group", 1:3)
[1] "Group 1" "Group 2" "Group 3"

Collapse to single string:

> paste(c("A", "B", "C"), collapse = ", ")
[1] "A, B, C"

No-space version with paste0:

> paste0("var", 1:3)
[1] "var1" "var2" "var3"

quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE)

Purpose: Calculate sample quantiles Help: Type ?quantile in R console

Default quartiles:

> x <- c(1:10, NA)
> quantile(x, na.rm = TRUE)
  0%  25%  50%  75% 100%
 1.0  3.0  5.5  8.0 10.0

Custom percentiles:

> quantile(x, probs = c(0.1, 0.9), na.rm = TRUE)
10% 90%
1.9 9.1

See also: median(), IQR(), fivenum()

sapply(X, FUN), lapply(X, FUN)

Purpose: Apply function over list/vector elements Help: Type ?sapply or ?lapply in R console Differences:

  • lapply: Always returns a list

  • sapply: Simplifies to vector/matrix if possible

Create a list of vectors:

> lst <- list(a = 1:5, b = 6:10, c = 11:15)

sapply simplifies to vector:

> sapply(lst, mean)
 a  b  c
 3  8 13

lapply returns list:

> lapply(lst, mean)
$a
[1] 3

$b
[1] 8

$c
[1] 13

Check multiple columns for NAs:

sapply(df, function(x) sum(is.na(x)))

tapply(X, INDEX, FUN)

Purpose: Apply function by group Help: Type ?tapply in R console

Group means using built-in dataset:

> tapply(iris$Sepal.Length, iris$Species, mean)
    setosa versicolor  virginica
     5.006      5.936      6.588

Multiple statistics per group:

tapply(iris$Sepal.Length, iris$Species,
       function(x) c(mean = mean(x), sd = sd(x)))

See also: aggregate(), by()

Descriptive Statistics & Correlation

cor(x, y = NULL, use = “everything”, method = “pearson”)

Purpose: Calculate correlation coefficient Help: Type ?cor in R console Key arguments:

  • use: How to handle NAs (“complete.obs” drops them)

  • method: “pearson” (default), “spearman”, “kendall”

Perfect positive correlation:

> x <- c(1, 2, 3, 4, 5)
> y <- c(2, 4, 6, 8, 10)
> cor(x, y)
[1] 1

Using built-in dataset:

> cor(mtcars$mpg, mtcars$wt)
[1] -0.8676594

Correlation matrix:

> cor(mtcars[,c("mpg", "wt", "hp")])
           mpg         wt         hp
mpg  1.0000000 -0.8676594 -0.7761684
wt  -0.8676594  1.0000000  0.6587479
hp  -0.7761684  0.6587479  1.0000000

Interpretation guide:

  • -1 to -0.7: Strong negative

  • -0.7 to -0.3: Moderate negative

  • -0.3 to 0.3: Weak/no linear relationship

  • 0.3 to 0.7: Moderate positive

  • 0.7 to 1: Strong positive

Test for significance:

cor.test(mtcars$mpg, mtcars$wt)

See also: cor.test(), cov()

mean(x, trim = 0, na.rm = FALSE)

Purpose: Calculate arithmetic mean Help: Type ?mean in R console Key arguments:

  • na.rm: Remove NAs before calculation

  • trim: Fraction to trim from each end

Basic usage with NAs:

> x <- c(1, 2, 3, 4, 5, NA)
> mean(x)
[1] NA
> mean(x, na.rm = TRUE)
[1] 3

Trimmed mean (robust to outliers):

> y <- c(1, 2, 3, 4, 100)
> mean(y)
[1] 22
> mean(y, trim = 0.2)
[1] 3

The trimmed mean removes 20% from each end before calculating.

See also: median(), summary()

median(x, na.rm = FALSE)

Purpose: Calculate median (middle value) Help: Type ?median in R console

Odd number of values:

> median(c(1, 3, 5))
[1] 3

Even number of values:

> median(c(1, 2, 3, 4))
[1] 2.5

Median is more robust than mean for skewed data:

> x <- c(20, 25, 30, 35, 200)
> mean(x)
[1] 62
> median(x)
[1] 30

sd(x, na.rm = FALSE)

Purpose: Calculate sample standard deviation Help: Type ?sd in R console Note: Uses n-1 denominator (sample SD)

> x <- c(2, 4, 4, 4, 5, 5, 7, 9)
> sd(x)
[1] 2.13809
> var(x)
[1] 4.571429

The variance equals the square of the standard deviation.

Calculate coefficient of variation (relative variability):

> cv <- sd(x) / mean(x) * 100
> paste("CV:", round(cv, 1), "%")
[1] "CV: 42.8 %"

See also: var(), mad() (robust alternative)

Probability & Distributions

Normal Distribution Functions

dnorm(x, mean = 0, sd = 1)

Purpose: Normal probability density Help: Type ?dnorm in R console Use cases: Overlay theoretical curves, calculate likelihoods

Standard normal density at x = 0:

> dnorm(0)
[1] 0.3989423

Custom parameters:

> dnorm(100, mean = 100, sd = 15)
[1] 0.02659615

Use in ggplot2 for overlaying normal curve:

ggplot(data.frame(x = x), aes(x)) +
  geom_histogram(aes(y = after_stat(density))) +
  stat_function(fun = dnorm,
               args = list(mean = mean(x), sd = sd(x)),
               color = "red")

pnorm(q, mean = 0, sd = 1, lower.tail = TRUE)

Purpose: Normal cumulative distribution (probability) Help: Type ?pnorm in R console

Standard normal probabilities:

> pnorm(1.96)
[1] 0.9750021
> pnorm(1.96, lower.tail = FALSE)
[1] 0.0249979

Two-tailed p-value:

> 2 * pnorm(-abs(1.96))
[1] 0.04999579

qnorm(p, mean = 0, sd = 1, lower.tail = TRUE)

Purpose: Normal quantiles (inverse CDF) Help: Type ?qnorm in R console

Critical values:

> qnorm(0.975)
[1] 1.959964
> qnorm(0.95)
[1] 1.644854

The first gives the two-tailed 95% critical value, the second gives one-tailed.

Calculate confidence interval:

xbar <- 100; s <- 15; n <- 25
xbar + c(-1, 1) * qnorm(0.975) * s/sqrt(n)

rnorm(n, mean = 0, sd = 1)

Purpose: Generate random normal values Help: Type ?rnorm in R console

set.seed(123)
x <- rnorm(100, mean = 50, sd = 10)

# Visualize with ggplot2
library(ggplot2)
ggplot(data.frame(x = x), aes(x = x)) +
  geom_histogram(bins = 20) +
  ggtitle("Sample from Normal(50, 10)")

t Distribution Functions

pt(q, df, lower.tail = TRUE)

Purpose: Student’s t cumulative distribution Help: Type ?pt in R console Use cases: Calculate p-values for t-tests

One-sided p-value:

> t_stat <- 2.5
> df <- 24
> pt(t_stat, df, lower.tail = FALSE)
[1] 0.009963466

Two-sided p-value:

> 2 * pt(-abs(t_stat), df)
[1] 0.01992693

qt(p, df, lower.tail = TRUE)

Purpose: Student’s t quantiles Help: Type ?qt in R console Use cases: Critical values for confidence intervals

95% CI critical value:

> qt(0.975, df = 24)
[1] 2.063899

Compare to normal:

> qnorm(0.975)
[1] 1.959964

The t critical value is larger due to heavier tails.

See also: dt(), rt()

Other Distributions

dbinom(x, size, prob)

Purpose: Binomial probability mass function Help: Type ?dbinom in R console

Probability of exactly 3 successes in 10 trials with p = 0.5:

> dbinom(3, size = 10, prob = 0.5)
[1] 0.1171875

Full distribution:

x <- 0:10
p <- dbinom(x, size = 10, prob = 0.5)

# Visualize with ggplot2
library(ggplot2)
ggplot(data.frame(x = x, p = p), aes(x = x, y = p)) +
  geom_col() +
  scale_x_continuous(breaks = 0:10) +
  ggtitle("Binomial(10, 0.5) Distribution")

qtukey(p, nmeans, df)

Purpose: Tukey HSD critical values Help: Type ?qtukey in R console

Critical value for 3 groups with df = 27:

> qtukey(0.95, nmeans = 3, df = 27)
[1] 3.506426

Simulation Functions

set.seed(seed)

Purpose: Set random number generator seed for reproducibility Help: Type ?set.seed in R console

Without seed (different each time):

> rnorm(3)
[1]  0.3186301 -0.5817907  0.7145327
> rnorm(3)
[1] -0.8252594 -0.3598138 -0.0109303

With seed (reproducible):

> set.seed(123)
> rnorm(3)
[1] -0.5604756 -0.2301775  1.5587083
> set.seed(123)
> rnorm(3)
[1] -0.5604756 -0.2301775  1.5587083

Random Generation Functions

All random generation functions take n (number of values) as the first argument.

set.seed(1)
rnorm(5, mean = 100, sd = 15)    # Normal
runif(5, min = 0, max = 10)      # Uniform
rexp(5, rate = 2)                # Exponential

For simulation studies:

n_sims <- 1000
means <- replicate(n_sims, mean(rnorm(30)))

# Visualize sampling distribution with ggplot2
library(ggplot2)
ggplot(data.frame(means = means), aes(x = means)) +
  geom_histogram(bins = 30) +
  ggtitle("Sampling Distribution of Mean")

Inference Functions

t.test(x, y = NULL, alternative = “two.sided”, mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95)

Purpose: One and two sample t-tests Help: Type ?t.test in R console When to use: Compare means between 1-2 groups Key arguments:

  • alternative: “two.sided”, “less”, “greater”

  • mu: Null hypothesis value (one-sample)

  • paired: TRUE for matched pairs

  • var.equal: TRUE assumes equal variances

Common pitfalls:

  • Using paired = TRUE with independent samples

  • Not checking normality assumptions

  • Forgetting var.equal = FALSE for unequal variances

One-sample test:

> x <- c(98, 102, 101, 97, 103, 100, 99, 98, 104, 101)
> t.test(x, mu = 100)

  One Sample t-test

data:  x
t = 0.63246, df = 9, p-value = 0.5425
alternative hypothesis: true mean is not equal to 100
95 percent confidence interval:
 98.41533 102.18467
sample estimates:
mean of x
    100.3

Two-sample independent test using built-in data:

> data(sleep)
> t.test(extra ~ group, data = sleep, var.equal = FALSE)

  Welch Two Sample t-test

data:  extra by group
t = -1.8608, df = 17.776, p-value = 0.07939
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
 -3.3654832  0.2054832
sample estimates:
mean in group 1 mean in group 2
           0.75            2.33

Paired t-test - Three equivalent approaches:

Method 1 - Using paired = TRUE with two vectors:

before <- c(85, 88, 90, 92, 87)
after  <- c(88, 90, 91, 94, 89)
t.test(before, after, paired = TRUE)

Method 2 - Convert to one-sample test of differences:

differences <- after - before
t.test(differences, mu = 0)

Method 3 - Formula interface with paired data:

> # Using sleep dataset (paired by ID)
> t.test(Pair(extra[group==1], extra[group==2]) ~ 1, data = sleep)

  Paired t-test

data:  Pair(extra[group == 1], extra[group == 2])
t = -4.0621, df = 9, p-value = 0.002833
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -2.4598858 -0.7001142
sample estimates:
mean difference
          -1.58

Extract components from results:

> result <- t.test(x, mu = 100)
> result$p.value
[1] 0.5424805
> result$conf.int
[1]  98.41533 102.18467
attr(,"conf.level")
[1] 0.95
> result$estimate
mean of x
    100.3

Checking assumptions:

Normality check:

# Shapiro-Wilk test
shapiro.test(x)

# Visual check with ggplot2
library(ggplot2)
ggplot(data.frame(x = x), aes(sample = x)) +
  stat_qq() +
  stat_qq_line()

Equal variances check:

# F-test
var.test(x, y)

# Visual check
ggplot(data.frame(value = c(x, y),
                 group = rep(c("X", "Y"), c(length(x), length(y)))),
       aes(x = group, y = value)) +
  geom_boxplot()

See also: wilcox.test() (non-parametric alternative)

aov(formula, data)

Purpose: Analysis of variance (ANOVA) Help: Type ?aov in R console When to use: Compare means across 3+ groups

One-way ANOVA:

> fit <- aov(score ~ group, data = df)
> summary(fit)
            Df Sum Sq Mean Sq F value Pr(>F)
group        2  520.3  260.15   8.214 0.0019 **
Residuals   27  855.2   31.67
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Check assumptions with diagnostic plots:

plot(fit, which = 1)  # Residuals vs Fitted
plot(fit, which = 2)  # Normal Q-Q

Test for equal variances:

bartlett.test(score ~ group, data = df)

Post-hoc comparisons:

TukeyHSD(fit)
plot(TukeyHSD(fit))

See also: anova(), TukeyHSD(), kruskal.test()

TukeyHSD(x, conf.level = 0.95)

Purpose: Tukey’s honest significant differences Help: Type ?TukeyHSD in R console When to use: After significant ANOVA for pairwise comparisons

Following ANOVA:

> fit <- aov(score ~ group, data = df)
> tukey_result <- TukeyHSD(fit)
> tukey_result
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = score ~ group, data = df)

$group
        diff       lwr      upr     p adj
B-A  -5.333 -12.20271 1.536038 0.1518711
C-A   3.167  -3.70271 10.036038 0.5019442
C-B   8.500   1.63062 15.369376 0.0125506

Visual display:

plot(tukey_result)

Confidence intervals not containing 0 indicate significant differences.

lm(formula, data)

Purpose: Fit linear models Help: Type ?lm in R console When to use: Regression analysis, ANOVA via regression

Simple linear regression:

> mod <- lm(rating ~ price, data = df)
> summary(mod)

Call:
lm(formula = rating ~ price, data = df)

Residuals:
     Min       1Q   Median       3Q      Max
-2.83450 -0.83450 -0.03077  0.78231  2.70308

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  2.1234     0.2145   9.897 2.13e-10 ***
price        0.5432     0.0532  10.211 9.77e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.234 on 28 degrees of freedom
Multiple R-squared:  0.7882,      Adjusted R-squared:  0.7806
F-statistic: 104.3 on 1 and 28 DF,  p-value: 9.77e-11

Extract components:

coef(mod)           # Coefficients
confint(mod)        # CI for coefficients
residuals(mod)      # Residuals
fitted(mod)         # Fitted values

Diagnostic plots:

plot(mod)  # Produces 4 diagnostic plots

ANOVA table:

anova(mod)

See also: summary.lm(), predict.lm(), plot.lm()

predict(object, newdata, interval = “none”, level = 0.95)

Purpose: Predictions from model fits Help: Type ?predict.lm in R console Key arguments:

  • interval: “confidence” (mean response) or “prediction” (new obs)

  • level: Confidence level

Critical distinction:

  • Confidence interval: Uncertainty about mean response

  • Prediction interval: Uncertainty about individual response (wider)

Fit model first:

mod <- lm(rating ~ price, data = df)

Point prediction:

> new_price <- data.frame(price = 50)
> predict(mod, new_price)
       1
29.28571

Confidence interval (mean rating at price = 50):

> predict(mod, new_price, interval = "confidence")
       fit      lwr      upr
1 29.28571 28.47933 30.09209

Prediction interval (individual rating at price = 50):

> predict(mod, new_price, interval = "prediction")
       fit      lwr      upr
1 29.28571 26.89414 31.67728

Note that prediction intervals are always wider than confidence intervals.

Multiple predictions for plotting:

new_prices <- data.frame(
  price = seq(min(df$price), max(df$price), length = 100)
)
ci <- predict(mod, new_prices, interval = "confidence")
pi <- predict(mod, new_prices, interval = "prediction")

Plot with bands:

plot(rating ~ price, data = df)
abline(mod)
lines(new_prices$price, ci[,"lwr"], lty = 2, col = "blue")
lines(new_prices$price, ci[,"upr"], lty = 2, col = "blue")
lines(new_prices$price, pi[,"lwr"], lty = 2, col = "red")
lines(new_prices$price, pi[,"upr"], lty = 2, col = "red")

See also: fitted(), residuals()

Diagnostic Plots for Assumptions

For t-test (normality of residuals):

library(ggplot2)
x <- residuals(lm(score ~ group, data = df))

# Q-Q plot
ggplot(data.frame(sample = x), aes(sample = sample)) +
  stat_qq() +
  stat_qq_line() +
  ggtitle("Normal Q-Q Plot of Residuals")

# Formal test
shapiro.test(x)

For ANOVA:

fit <- aov(score ~ group, data = df)

# Residuals vs Fitted
ggplot(data.frame(fitted = fitted(fit),
                 residuals = residuals(fit)),
       aes(x = fitted, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  ggtitle("Residuals vs Fitted")

# Q-Q plot of residuals
ggplot(data.frame(sample = residuals(fit)),
       aes(sample = sample)) +
  stat_qq() +
  stat_qq_line() +
  ggtitle("Normal Q-Q")

For regression (full diagnostic panel):

mod <- lm(rating ~ price, data = df)

# Prepare data for all plots
diag_data <- data.frame(
  fitted = fitted(mod),
  residuals = residuals(mod),
  std_resid = rstandard(mod),
  sqrt_std_resid = sqrt(abs(rstandard(mod))),
  cooks = cooks.distance(mod),
  leverage = hatvalues(mod)
)

# Create four diagnostic plots
library(gridExtra)

p1 <- ggplot(diag_data, aes(x = fitted, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_smooth(se = FALSE) +
  ggtitle("Residuals vs Fitted")

p2 <- ggplot(diag_data, aes(sample = std_resid)) +
  stat_qq() +
  stat_qq_line() +
  ggtitle("Normal Q-Q")

p3 <- ggplot(diag_data, aes(x = fitted, y = sqrt_std_resid)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  ggtitle("Scale-Location")

p4 <- ggplot(diag_data, aes(x = leverage, y = std_resid)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  ggtitle("Residuals vs Leverage")

grid.arrange(p1, p2, p3, p4, ncol = 2)

Pattern interpretation:

  • Random scatter: Good

  • Funnel shape: Heteroscedasticity

  • Curve: Non-linearity

  • Outliers: Influential points

Graphics (ggplot2)

Important

This course uses ggplot2 exclusively for all plotting and visualization. While base R has plotting functions like plot(), hist(), and boxplot(), we will only use ggplot2’s grammar of graphics approach. All assignments and exams will expect ggplot2 syntax for graphs.

Core Components

ggplot(data, aes(…))

Purpose: Initialize a ggplot object Help: Type ?ggplot in R console Structure: Data + aesthetic mappings + layers

Basic setup creates blank plot with axes:

library(ggplot2)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))

Add layers with +:

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point()

Save for reuse:

p <- ggplot(mtcars, aes(x = factor(cyl), y = mpg))
p + geom_boxplot()
p + geom_violin()

aes(…)

Purpose: Map variables to visual properties Help: Type ?aes in R console Common aesthetics: x, y, color, fill, shape, size, alpha

Position aesthetics:

aes(x = Sepal.Length, y = Sepal.Width)

Color by group:

aes(x = Sepal.Length, y = Sepal.Width, color = Species)

Multiple aesthetics:

aes(x = wt, y = mpg, color = factor(cyl), size = hp)

Can set in ggplot() or individual geoms:

ggplot(iris) +
  geom_point(aes(x = Sepal.Length, y = Sepal.Width))

Geoms (Geometric Objects)

geom_histogram(binwidth, bins, aes(y = after_stat(…)))

Purpose: Create histograms Help: Type ?geom_histogram in R console Key arguments:

  • bins: Number of bins (default 30)

  • binwidth: Width of bins

  • aes(y = after_stat(density)): Scale to density

Important

In this course, all histograms must include:

  1. Density scaling using aes(y = after_stat(density))

  2. A kernel density curve (red) using geom_density()

  3. A normal curve (blue) using stat_function()

This allows visual comparison of the data distribution with theoretical distributions.

Standard histogram format for this course:

# Calculate mean and sd first
xbar <- mean(data$variable)
s <- sd(data$variable)

# Create histogram with required overlays
ggplot(data, aes(x = variable)) +
  geom_histogram(aes(y = after_stat(density)),
                bins = 24, fill = "grey", col = "black") +
  geom_density(col = "red", linewidth = 1) +
  stat_function(fun = dnorm,
               args = list(mean = xbar, sd = s),
               col = "blue", linewidth = 1) +
  ggtitle("Histogram with Density Curves")

Adding mean and median lines (often required):

# Calculate statistics
xbar <- mean(data$variable)
xtilde <- median(data$variable)
s <- sd(data$variable)

# Full histogram with all required elements
ggplot(data, aes(x = variable)) +
  geom_histogram(aes(y = after_stat(density)),
                bins = 24, fill = "grey", col = "black") +
  geom_vline(xintercept = xbar, color = "purple",
             linetype = "dotted", linewidth = 1) +
  geom_vline(xintercept = xtilde, color = "orange",
             linetype = "dashed", linewidth = 1) +
  geom_density(col = "red", linewidth = 1) +
  stat_function(fun = dnorm,
               args = list(mean = xbar, sd = s),
               col = "blue", linewidth = 1) +
  ggtitle("Complete Histogram for Course")

By group:

ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
  geom_histogram(aes(y = after_stat(density)),
                position = "dodge", bins = 15) +
  facet_wrap(~ Species) +
  ggtitle("Histograms by Group")

geom_boxplot(outlier.shape, varwidth)

Purpose: Box and whisker plots Help: Type ?geom_boxplot in R console

Basic boxplot:

ggplot(iris, aes(x = Species, y = Sepal.Length)) +
  geom_boxplot()

Horizontal orientation:

ggplot(iris, aes(x = Species, y = Sepal.Length)) +
  geom_boxplot() +
  coord_flip()

With individual points:

ggplot(iris, aes(x = Species, y = Sepal.Length)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.5)

Variable width by sample size:

ggplot(iris, aes(x = Species, y = Sepal.Length)) +
  geom_boxplot(varwidth = TRUE)

geom_point(), geom_line()

Purpose: Scatterplots and line plots Help: Type ?geom_point or ?geom_line

Basic scatterplot:

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()

Customize points:

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point(size = 3, alpha = 0.7)

Add regression line:

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red")

Time series (using built-in dataset):

# Convert to time series format
data(AirPassengers)
ts_data <- data.frame(
  date = as.Date(time(AirPassengers)),
  value = as.numeric(AirPassengers)
)

ggplot(ts_data, aes(x = date, y = value)) +
  geom_line() +
  geom_point()

geom_smooth(method, formula, se)

Purpose: Add smoothed conditional means Help: Type ?geom_smooth in R console Key arguments:

  • method: “lm”, “loess”, “gam”

  • se: Show confidence band (TRUE/FALSE)

Linear regression with confidence band:

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm")

Without confidence band:

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

By group:

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point() +
  geom_smooth(method = "lm")

Faceted histograms with required density overlays and smooth kernel:

# Calculate group-specific statistics
xbar <- tapply(iris$Sepal.Length, iris$Species, mean)
s <- tapply(iris$Sepal.Length, iris$Species, sd)

# Add normal density values to the dataset
iris$normal.density <- ifelse(iris$Species == "setosa",
                             dnorm(iris$Sepal.Length, xbar["setosa"], s["setosa"]),
                             ifelse(iris$Species == "versicolor",
                                   dnorm(iris$Sepal.Length, xbar["versicolor"], s["versicolor"]),
                                   dnorm(iris$Sepal.Length, xbar["virginica"], s["virginica"])))

# Create faceted histogram with overlays
ggplot(iris, aes(x = Sepal.Length)) +
  geom_histogram(aes(y = after_stat(density)),
                bins = 15, fill = "grey", col = "black") +
  geom_density(col = "red", linewidth = 1) +
  geom_line(aes(y = normal.density), col = "blue", linewidth = 1) +
  facet_wrap(~ Species) +
  ggtitle("Distribution of Sepal Length by Species")

# For datasets with only two groups (like CO2), the approach is simpler:

.. code-block:: r

# Two-group example using CO2 dataset
xbar <- tapply(CO2$uptake, CO2$Type, mean)
s <- tapply(CO2$uptake, CO2$Type, sd)

CO2$normal.density <- ifelse(CO2$Type == "Quebec",
                            dnorm(CO2$uptake, xbar["Quebec"], s["Quebec"]),
                            dnorm(CO2$uptake, xbar["Mississippi"], s["Mississippi"]))

ggplot(CO2, aes(x = uptake)) +
  geom_histogram(aes(y = after_stat(density)),
                bins = 20, fill = "grey", col = "black") +
  geom_density(col = "red", linewidth = 1) +
  geom_line(aes(y = normal.density), col = "blue", linewidth = 1) +
  facet_grid(. ~ Type) +
  ggtitle("Distribution of Uptake by Location Type")

stat_qq(), stat_qq_line()

Purpose: Q-Q plots for normality assessment Help: Type ?stat_qq in R console

Basic Q-Q plot:

ggplot(data.frame(sample = rnorm(100)), aes(sample = sample)) +
  stat_qq() +
  stat_qq_line()

For residuals:

mod <- lm(mpg ~ wt, data = mtcars)
res_df <- data.frame(residuals = residuals(mod))

ggplot(res_df, aes(sample = residuals)) +
  stat_qq() +
  stat_qq_line(color = "red") +
  labs(title = "Normal Q-Q Plot of Residuals")

geom_ribbon(aes(ymin, ymax))

Purpose: Area plots with upper/lower bounds Help: Type ?geom_ribbon in R console Use case: Custom confidence/prediction bands

Setup for regression bands:

mod <- lm(mpg ~ wt, data = mtcars)
new_data <- data.frame(wt = seq(min(mtcars$wt),
                                max(mtcars$wt),
                                length = 100))

pred_ci <- predict(mod, new_data, interval = "confidence")
pred_pi <- predict(mod, new_data, interval = "prediction")

plot_data <- cbind(new_data,
                  ci_fit = pred_ci[,"fit"],
                  ci_lwr = pred_ci[,"lwr"],
                  ci_upr = pred_ci[,"upr"],
                  pi_lwr = pred_pi[,"lwr"],
                  pi_upr = pred_pi[,"upr"])

Plot with bands:

ggplot(plot_data, aes(x = wt)) +
  geom_ribbon(aes(ymin = pi_lwr, ymax = pi_upr),
             fill = "grey90") +
  geom_ribbon(aes(ymin = ci_lwr, ymax = ci_upr),
             fill = "grey70") +
  geom_line(aes(y = ci_fit), color = "blue", size = 1) +
  geom_point(data = mtcars, aes(y = mpg)) +
  labs(title = "Regression with CI and PI bands")

Categorical Data Visualization

geom_bar(), geom_col()

Purpose: Bar charts for categorical data Help: Type ?geom_bar or ?geom_col in R console

  • geom_bar(): Counts categories (like table())

  • geom_col(): Uses pre-computed values

Count frequencies with geom_bar():

# Frequency bar chart
ggplot(mtcars, aes(x = factor(cyl))) +
  geom_bar() +
  xlab("Number of Cylinders") +
  ylab("Count") +
  ggtitle("Distribution of Cylinder Counts")

Pre-computed values with geom_col():

# Create summary data
cyl_means <- aggregate(mpg ~ cyl, data = mtcars, mean)

ggplot(cyl_means, aes(x = factor(cyl), y = mpg)) +
  geom_col(fill = "steelblue") +
  xlab("Number of Cylinders") +
  ylab("Mean MPG") +
  ggtitle("Average MPG by Cylinder Count")

Grouped bar chart:

ggplot(mtcars, aes(x = factor(cyl), fill = factor(am))) +
  geom_bar(position = "dodge") +
  xlab("Cylinders") +
  labs(fill = "Transmission") +
  scale_fill_discrete(labels = c("Automatic", "Manual"))

Stacked bar chart (proportions):

ggplot(mtcars, aes(x = factor(cyl), fill = factor(am))) +
  geom_bar(position = "fill") +
  ylab("Proportion") +
  labs(fill = "Transmission")

Note

For pie charts, while possible in ggplot2 using coord_polar(), they are generally discouraged in statistical visualization. Bar charts are preferred for comparing categorical proportions.

Plot Customization

Labels and Titles

All labels at once:

p + labs(title = "Main Title",
        subtitle = "Subtitle",
        x = "X-axis label",
        y = "Y-axis label",
        caption = "Data source: ...")

Individual functions:

p + ggtitle("Main Title") +
    xlab("X-axis") +
    ylab("Y-axis")

Math expressions:

library(latex2exp)
p + xlab(TeX("$\\beta_1$")) +
    ylab(TeX("$\\hat{y}$"))

Themes

Built-in themes:

p + theme_minimal()
p + theme_classic()
p + theme_bw()

Customize elements:

p + theme(
  axis.text = element_text(size = 12),
  axis.title = element_text(size = 14, face = "bold"),
  plot.title = element_text(size = 16, hjust = 0.5),
  legend.position = "bottom"
)

Faceting

Separate panels by variable:

ggplot(df, aes(x = score)) +
  geom_histogram() +
  facet_wrap(~ group)

Grid layout:

ggplot(df, aes(x = score)) +
  geom_histogram() +
  facet_grid(platform ~ group)

Tables & Reporting

knitr::kable(x, caption, digits, format)

Purpose: Create formatted tables Help: Type ?knitr::kable in R console

Basic table using built-in data:

library(knitr)
library(kableExtra)

# Summary statistics by species
summary_table <- aggregate(Sepal.Length ~ Species, data = iris,
                          FUN = function(x) c(mean = mean(x),
                                             sd = sd(x),
                                             n = length(x)))
kable(summary_table,
      caption = "Summary Statistics by Species",
      digits = 2)

Enhanced styling:

kable(summary_table, caption = "Iris Summary by Species") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
               full_width = FALSE) %>%
  column_spec(1, bold = TRUE)

gridExtra::grid.arrange(…)

Purpose: Arrange multiple plots Help: Type ?gridExtra::grid.arrange

Create and arrange plots:

library(gridExtra)

p1 <- ggplot(iris, aes(x = Sepal.Length)) +
      geom_histogram(bins = 20) +
      ggtitle("Distribution of Sepal Length")

p2 <- ggplot(iris, aes(x = Species, y = Sepal.Length)) +
      geom_boxplot() +
      ggtitle("Sepal Length by Species")

grid.arrange(p1, p2, ncol = 2)

Different layouts:

grid.arrange(p1, p2, p1, p2,
            ncol = 2, nrow = 2,
            top = "Four Panel Display")

Best Practices & Common Pitfalls

Standard validation workflow:

d <- read.csv("data/file.csv")

str(d)                              # Structure check
head(d); tail(d)                    # First/last rows
summary(d)                          # Summary statistics
sapply(d, function(x) sum(is.na(x))) # Missing values
sapply(d, class)                    # Data types

Before t-tests/ANOVA:

# Normality
hist(x)
qqnorm(x); qqline(x)
shapiro.test(x)

# Equal variances
var.test(x, y)
boxplot(score ~ group, data = df)

Before regression:

mod <- lm(y ~ x, data = df)

par(mfrow = c(2,2))
plot(mod)
par(mfrow = c(1,1))

Check for:

  1. Linearity (Residuals vs Fitted)

  2. Normality (Q-Q plot)

  3. Homoscedasticity (Scale-Location)

  4. Influential points (Residuals vs Leverage)

  1. Factor to numeric conversion

    Wrong (returns level indices):

    > f <- factor(c("2", "4", "6"))
    > as.numeric(f)
    [1] 1 2 3
    

    Correct:

    > as.numeric(as.character(f))
    [1] 2 4 6
    
  2. Missing data handling

    Wrong (silently propagates NA):

    > mean(c(1, 2, NA, 4))
    [1] NA
    

    Correct:

    > mean(c(1, 2, NA, 4), na.rm = TRUE)
    [1] 2.333333
    
  3. Confidence vs Prediction intervals

    # CI: Where is the mean?
    predict(mod, newdata, interval = "confidence")
    
    # PI: Where is a new observation?
    predict(mod, newdata, interval = "prediction")
    

    Prediction intervals are always wider!

  4. Multiple testing without adjustment

    After ANOVA, use Tukey HSD:

    TukeyHSD(aov_fit)
    

    This adjusts for multiple comparisons automatically.

# 1. Setup
library(ggplot2)
library(knitr)
set.seed(123)

# 2. Import and validate
d <- read.csv("data/file.csv", stringsAsFactors = FALSE)
str(d)
summary(d)

# 3. Clean
d_clean <- d[complete.cases(d), ]

# 4. Explore
ggplot(d_clean, aes(x = x, y = y)) +
  geom_point() +
  theme_minimal()

# 5. Model
mod <- lm(y ~ x, data = d_clean)

# 6. Diagnose
plot(mod, which = 1:2)

# 7. Report
summary(mod)
kable(tidy(mod), digits = 3)