Function Reference Part 1

Data I/O & Housekeeping

getwd()

Purpose: Show current working directory Help: Type ?getwd in R console Syntax: getwd()

Example showing the working directory:
> getwd()
[1] "/Users/username/STAT350"

head(x, n = 6)

Purpose: Display first n rows of a data frame or elements of a vector Help: Type ?head in R console Syntax: head(x, n = 6) Common use: Quickly inspect data after importing

Using built-in iris dataset:
> data(iris)
> head(iris, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
See also: tail(), str(), summary()

read.csv(file, header = TRUE, stringsAsFactors = FALSE, …)

Purpose: Import CSV file into a data frame Help: Type ?read.csv in R console Key arguments:

header: First row contains column names (default TRUE)

stringsAsFactors: Keep strings as characters (recommended FALSE)

na.strings: Values to treat as NA (default “NA”)

Common pitfalls:

Using absolute paths (breaks on other computers)

Forgetting to check for proper import with str()

Not handling special characters in column names

Basic import:
> mydata <- read.csv("data/experiment.csv")
> str(mydata)
'data.frame':     150 obs. of  5 variables:
 $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ group   : chr  "A" "B" "A" "C" ...
 $ measure1: num  23.5 19.2 25.1 22.8 20.3 ...
 $ measure2: num  45.2 41.8 46.7 44.3 42.1 ...
 $ outcome : chr  "success" "failure" "success" ...
With options for handling missing data:
mydata <- read.csv("data/experiment.csv",
                  stringsAsFactors = FALSE,
                  na.strings = c("NA", "N/A", ""))
Always validate after import:
> sum(is.na(mydata))
[1] 12
See also: read.table(), write.csv(), str()

setwd(path)

Purpose: Change working directory Help: Type ?setwd in R console

Not recommended in scripts:
# Bad practice - machine specific!
# setwd("/Users/username/STAT350")

write.csv(x, file, row.names = FALSE)

Purpose: Export data frame to CSV file Help: Type ?write.csv in R console Key arguments:

row.names: Include row numbers (usually FALSE)

na: String for missing values (default “NA”)

Save cleaned data:
write.csv(d_clean, "output/cleaned_data.csv", row.names = FALSE)

Data Structures & Creation

c(…)

Purpose: Combine values into a vector Help: Type ?c in R console

Creating different types of vectors:
> v1 <- c(1, 2, 3, 4, 5)
> v2 <- c("iOS", "Android", "Windows")
> v3 <- c(1, "two", 3)
> class(v3)
[1] "character"
Note that mixed types get coerced to the most general type (character in this case).

colnames(x), rownames(x)

Purpose: Get or set column/row names Help: Type ?colnames or ?rownames in R console

Getting column names from built-in dataset:

> colnames(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

Renaming columns:

# Create a copy to modify
mydata <- mtcars[1:5, 1:3]
colnames(mydata) <- c("MPG", "Cylinders", "Displacement")

# Or rename specific columns
colnames(mydata)[1] <- "MilesPerGallon"

data.frame(…)

Purpose: Create a data frame from vectors Help: Type ?data.frame in R console Key arguments:

stringsAsFactors: Auto-convert strings to factors (set FALSE)

Creating a data frame from scratch:
> df <- data.frame(
+   id = 1:5,
+   group = c("A", "A", "B", "B", "B"),
+   score = c(85, 90, 78, 82, 88),
+   stringsAsFactors = FALSE
+ )
> str(df)
'data.frame':     5 obs. of  3 variables:
 $ id   : int  1 2 3 4 5
 $ group: chr  "A" "A" "B" "B" "B"
 $ score: num  85 90 78 82 88
See also: tibble::tibble() for modern alternative

factor(x, levels = …, ordered = FALSE)

Purpose: Create categorical variables with defined levels Help: Type ?factor in R console When to use: For categorical data in ANOVA, controlling plot order

Basic factor creation with automatic level detection:
> platform <- factor(c("iOS", "Android", "iOS", "Windows"))
> levels(platform)
[1] "Android" "iOS"     "Windows"
Controlling level order for plots and analyses:
platform <- factor(platform,
                  levels = c("iOS", "Android", "Windows"))
Creating ordered factors for ordinal data:
satisfaction <- factor(c("Low", "High", "Medium", "High"),
                      levels = c("Low", "Medium", "High"),
                      ordered = TRUE)
Common pitfall: Converting numeric factors back to numbers

Wrong way (returns level indices):
> f <- factor(c("2", "4", "6"))
> as.numeric(f)
[1] 1 2 3
Correct way:
> as.numeric(as.character(f))
[1] 2 4 6
See also: levels(), relevel(), droplevels()

Data Wrangling & Utilities

apply(X, MARGIN, FUN)

Purpose: Apply function over matrix/array margins Help: Type ?apply in R console Key arguments:

MARGIN: 1 for rows, 2 for columns

FUN: Function to apply

Create a sample matrix:
> M <- matrix(1:12, nrow = 3)
> M
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
Row means (MARGIN = 1):
> apply(M, 1, mean)
[1] 5.5 6.5 7.5
Column sums (MARGIN = 2):
> apply(M, 2, sum)
[1]  6 15 24 33
Custom function to calculate range:
> apply(M, 2, function(x) max(x) - min(x))
[1] 2 2 2 2
See also: lapply(), sapply(), tapply()

as.numeric(x)

Purpose: Convert to numeric type Help: Type ?as.numeric in R console Common uses: Fix data imported as characters, convert factors

Character to numeric conversion:
> x <- c("1.5", "2.3", "3.1")
> as.numeric(x)
[1] 1.5 2.3 3.1
Non-numeric values produce NAs with a warning:
> y <- c("1", "2", "three")
> as.numeric(y)
[1]  1  2 NA
Warning message:
NAs introduced by coercion

complete.cases(x)

Purpose: Identify rows with no missing values Help: Type ?complete.cases in R console Returns: Logical vector (TRUE for complete rows)

Create data with missing values:
> df <- data.frame(
+   x = c(1, 2, NA, 4),
+   y = c(5, NA, 7, 8)
+ )
> complete.cases(df)
[1]  TRUE FALSE FALSE  TRUE
Filter to complete cases only:
> df_clean <- df[complete.cases(df), ]
> nrow(df_clean)
[1] 2
Count incomplete cases:
> sum(!complete.cases(df))
[1] 2
See also: is.na(), na.omit()

ifelse(test, yes, no)

Purpose: Vectorized conditional operation Help: Type ?ifelse in R console

Basic pass/fail grading:

> score <- c(85, 72, 90, 68, 88)
> grade <- ifelse(score >= 80, "Pass", "Fail")
> grade
[1] "Pass" "Fail" "Pass" "Fail" "Pass"

Nested ifelse for letter grades:

grade <- ifelse(score >= 90, "A",
               ifelse(score >= 80, "B",
                     ifelse(score >= 70, "C", "F")))

Conditional calculations:

df$adjusted <- ifelse(df$group == "control",
                     df$score * 1.1,
                     df$score)

IQR(x, na.rm = FALSE)

Purpose: Calculate interquartile range (Q3 - Q1) Help: Type ?IQR in R console

With missing values:
> x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, NA)
> IQR(x)
[1] NA
> IQR(x, na.rm = TRUE)
[1] 4
Compare with manual calculation from quantiles:
> quantile(x, c(0.25, 0.75), na.rm = TRUE)
25% 75%
  3   7
See also: quantile(), fivenum(), boxplot.stats()

is.na(x)

Purpose: Test for missing values Help: Type ?is.na in R console

Identifying NAs:
> x <- c(1, NA, 3, NA, 5)
> is.na(x)
[1] FALSE  TRUE FALSE  TRUE FALSE
Counting NAs:
> sum(is.na(x))
[1] 2
Finding positions of NAs:
> which(is.na(x))
[1] 2 4
Replacing NAs:
x[is.na(x)] <- 0
See also: complete.cases(), na.omit(), anyNA()

paste(…, sep = “ “, collapse = NULL)

Purpose: Concatenate strings Help: Type ?paste in R console Key arguments:

sep: Separator between elements

collapse: Collapse vector to single string

Basic concatenation:
> paste("Mean:", 5.2)
[1] "Mean: 5.2"
Custom separator for dates:
> paste("2024", "01", "15", sep = "-")
[1] "2024-01-15"
Vectorized operation:
> paste("Group", 1:3)
[1] "Group 1" "Group 2" "Group 3"
Collapse to single string:
> paste(c("A", "B", "C"), collapse = ", ")
[1] "A, B, C"
No-space version with paste0:
> paste0("var", 1:3)
[1] "var1" "var2" "var3"

quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE)

Purpose: Calculate sample quantiles Help: Type ?quantile in R console

Default quartiles:
> x <- c(1:10, NA)
> quantile(x, na.rm = TRUE)
  0%  25%  50%  75% 100%
 1.0  3.0  5.5  8.0 10.0
Custom percentiles:
> quantile(x, probs = c(0.1, 0.9), na.rm = TRUE)
10% 90%
1.9 9.1
See also: median(), IQR(), fivenum()

sapply(X, FUN), lapply(X, FUN)

Purpose: Apply function over list/vector elements Help: Type ?sapply or ?lapply in R console Differences:

lapply: Always returns a list

sapply: Simplifies to vector/matrix if possible

Create a list of vectors:
> lst <- list(a = 1:5, b = 6:10, c = 11:15)
sapply simplifies to vector:
> sapply(lst, mean)
 a  b  c
 3  8 13
lapply returns list:
> lapply(lst, mean)
$a
[1] 3

$b
[1] 8

$c
[1] 13
Check multiple columns for NAs:
sapply(df, function(x) sum(is.na(x)))

tapply(X, INDEX, FUN)

Purpose: Apply function by group Help: Type ?tapply in R console

Group means using built-in dataset:
> tapply(iris$Sepal.Length, iris$Species, mean)
    setosa versicolor  virginica
     5.006      5.936      6.588
Multiple statistics per group:
tapply(iris$Sepal.Length, iris$Species,
       function(x) c(mean = mean(x), sd = sd(x)))
See also: aggregate(), by()

Descriptive Statistics & Correlation

cor(x, y = NULL, use = “everything”, method = “pearson”)

Purpose: Calculate correlation coefficient Help: Type ?cor in R console Key arguments:

use: How to handle NAs (“complete.obs” drops them)

method: “pearson” (default), “spearman”, “kendall”

Perfect positive correlation:
> x <- c(1, 2, 3, 4, 5)
> y <- c(2, 4, 6, 8, 10)
> cor(x, y)
[1] 1
Using built-in dataset:
> cor(mtcars$mpg, mtcars$wt)
[1] -0.8676594
Correlation matrix:
> cor(mtcars[,c("mpg", "wt", "hp")])
           mpg         wt         hp
mpg  1.0000000 -0.8676594 -0.7761684
wt  -0.8676594  1.0000000  0.6587479
hp  -0.7761684  0.6587479  1.0000000
Interpretation guide:

-1 to -0.7: Strong negative

-0.7 to -0.3: Moderate negative

-0.3 to 0.3: Weak/no linear relationship

0.3 to 0.7: Moderate positive

0.7 to 1: Strong positive

Test for significance:
cor.test(mtcars$mpg, mtcars$wt)
See also: cor.test(), cov()

mean(x, trim = 0, na.rm = FALSE)

Purpose: Calculate arithmetic mean Help: Type ?mean in R console Key arguments:

na.rm: Remove NAs before calculation

trim: Fraction to trim from each end

Basic usage with NAs:
> x <- c(1, 2, 3, 4, 5, NA)
> mean(x)
[1] NA
> mean(x, na.rm = TRUE)
[1] 3
Trimmed mean (robust to outliers):
> y <- c(1, 2, 3, 4, 100)
> mean(y)
[1] 22
> mean(y, trim = 0.2)
[1] 3
The trimmed mean removes 20% from each end before calculating.

See also: median(), summary()

median(x, na.rm = FALSE)

Purpose: Calculate median (middle value) Help: Type ?median in R console

Odd number of values:
> median(c(1, 3, 5))
[1] 3
Even number of values:
> median(c(1, 2, 3, 4))
[1] 2.5
Median is more robust than mean for skewed data:
> x <- c(20, 25, 30, 35, 200)
> mean(x)
[1] 62
> median(x)
[1] 30

sd(x, na.rm = FALSE)

Purpose: Calculate sample standard deviation Help: Type ?sd in R console Note: Uses n-1 denominator (sample SD)
> x <- c(2, 4, 4, 4, 5, 5, 7, 9)
> sd(x)
[1] 2.13809
> var(x)
[1] 4.571429
The variance equals the square of the standard deviation.

Calculate coefficient of variation (relative variability):
> cv <- sd(x) / mean(x) * 100
> paste("CV:", round(cv, 1), "%")
[1] "CV: 42.8 %"
See also: var(), mad() (robust alternative)

Probability & Distributions

Normal Distribution Functions

dnorm(x, mean = 0, sd = 1)

Purpose: Normal probability density Help: Type ?dnorm in R console Use cases: Overlay theoretical curves, calculate likelihoods

Standard normal density at x = 0:
> dnorm(0)
[1] 0.3989423
Custom parameters:
> dnorm(100, mean = 100, sd = 15)
[1] 0.02659615
Use in ggplot2 for overlaying normal curve:
ggplot(data.frame(x = x), aes(x)) +
  geom_histogram(aes(y = after_stat(density))) +
  stat_function(fun = dnorm,
               args = list(mean = mean(x), sd = sd(x)),
               color = "red")
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE)

Purpose: Normal cumulative distribution (probability) Help: Type ?pnorm in R console

Standard normal probabilities:
> pnorm(1.96)
[1] 0.9750021
> pnorm(1.96, lower.tail = FALSE)
[1] 0.0249979
Two-tailed p-value:
> 2 * pnorm(-abs(1.96))
[1] 0.04999579
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE)

Purpose: Normal quantiles (inverse CDF) Help: Type ?qnorm in R console

Critical values:
> qnorm(0.975)
[1] 1.959964
> qnorm(0.95)
[1] 1.644854
The first gives the two-tailed 95% critical value, the second gives one-tailed.

Calculate confidence interval:
xbar <- 100; s <- 15; n <- 25
xbar + c(-1, 1) * qnorm(0.975) * s/sqrt(n)
rnorm(n, mean = 0, sd = 1)

Purpose: Generate random normal values Help: Type ?rnorm in R console
set.seed(123)
x <- rnorm(100, mean = 50, sd = 10)

# Visualize with ggplot2
library(ggplot2)
ggplot(data.frame(x = x), aes(x = x)) +
  geom_histogram(bins = 20) +
  ggtitle("Sample from Normal(50, 10)")

t Distribution Functions

pt(q, df, lower.tail = TRUE)

Purpose: Student’s t cumulative distribution Help: Type ?pt in R console Use cases: Calculate p-values for t-tests

One-sided p-value:
> t_stat <- 2.5
> df <- 24
> pt(t_stat, df, lower.tail = FALSE)
[1] 0.009963466
Two-sided p-value:
> 2 * pt(-abs(t_stat), df)
[1] 0.01992693
qt(p, df, lower.tail = TRUE)

Purpose: Student’s t quantiles Help: Type ?qt in R console Use cases: Critical values for confidence intervals

95% CI critical value:
> qt(0.975, df = 24)
[1] 2.063899
Compare to normal:
> qnorm(0.975)
[1] 1.959964
The t critical value is larger due to heavier tails.

See also: dt(), rt()

Other Distributions

dbinom(x, size, prob)

Purpose: Binomial probability mass function Help: Type ?dbinom in R console

Probability of exactly 3 successes in 10 trials with p = 0.5:
> dbinom(3, size = 10, prob = 0.5)
[1] 0.1171875
Full distribution:
x <- 0:10
p <- dbinom(x, size = 10, prob = 0.5)

# Visualize with ggplot2
library(ggplot2)
ggplot(data.frame(x = x, p = p), aes(x = x, y = p)) +
  geom_col() +
  scale_x_continuous(breaks = 0:10) +
  ggtitle("Binomial(10, 0.5) Distribution")
qtukey(p, nmeans, df)

Purpose: Tukey HSD critical values Help: Type ?qtukey in R console

Critical value for 3 groups with df = 27:
> qtukey(0.95, nmeans = 3, df = 27)
[1] 3.506426

Simulation Functions

set.seed(seed)

Purpose: Set random number generator seed for reproducibility Help: Type ?set.seed in R console

Without seed (different each time):
> rnorm(3)
[1]  0.3186301 -0.5817907  0.7145327
> rnorm(3)
[1] -0.8252594 -0.3598138 -0.0109303
With seed (reproducible):
> set.seed(123)
> rnorm(3)
[1] -0.5604756 -0.2301775  1.5587083
> set.seed(123)
> rnorm(3)
[1] -0.5604756 -0.2301775  1.5587083

Random Generation Functions

All random generation functions take n (number of values) as the first argument.

set.seed(1)
rnorm(5, mean = 100, sd = 15)    # Normal
runif(5, min = 0, max = 10)      # Uniform
rexp(5, rate = 2)                # Exponential

For simulation studies:

n_sims <- 1000
means <- replicate(n_sims, mean(rnorm(30)))

# Visualize sampling distribution with ggplot2
library(ggplot2)
ggplot(data.frame(means = means), aes(x = means)) +
  geom_histogram(bins = 30) +
  ggtitle("Sampling Distribution of Mean")