Function Reference Part 1

Data I/O & Housekeeping

getwd()

Purpose: Show current working directory Help: Type ?getwd in R console Syntax: getwd()

Example showing the working directory:

> getwd()
[1] "/Users/username/STAT350"

head(x, n = 6)

Purpose: Display first n rows of a data frame or elements of a vector Help: Type ?head in R console Syntax: head(x, n = 6) Common use: Quickly inspect data after importing

Using built-in iris dataset:

> data(iris)
> head(iris, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

See also: tail(), str(), summary()

read.csv(file, header = TRUE, stringsAsFactors = FALSE, …)

Purpose: Import CSV file into a data frame Help: Type ?read.csv in R console Key arguments:

  • header: First row contains column names (default TRUE)

  • stringsAsFactors: Keep strings as characters (recommended FALSE)

  • na.strings: Values to treat as NA (default “NA”)

Common pitfalls:

  • Using absolute paths (breaks on other computers)

  • Forgetting to check for proper import with str()

  • Not handling special characters in column names

Basic import:

> mydata <- read.csv("data/experiment.csv")
> str(mydata)
'data.frame':     150 obs. of  5 variables:
 $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ group   : chr  "A" "B" "A" "C" ...
 $ measure1: num  23.5 19.2 25.1 22.8 20.3 ...
 $ measure2: num  45.2 41.8 46.7 44.3 42.1 ...
 $ outcome : chr  "success" "failure" "success" ...

With options for handling missing data:

mydata <- read.csv("data/experiment.csv",
                  stringsAsFactors = FALSE,
                  na.strings = c("NA", "N/A", ""))

Always validate after import:

> sum(is.na(mydata))
[1] 12

See also: read.table(), write.csv(), str()

setwd(path)

Purpose: Change working directory Help: Type ?setwd in R console

Not recommended in scripts:

# Bad practice - machine specific!
# setwd("/Users/username/STAT350")

write.csv(x, file, row.names = FALSE)

Purpose: Export data frame to CSV file Help: Type ?write.csv in R console Key arguments:

  • row.names: Include row numbers (usually FALSE)

  • na: String for missing values (default “NA”)

Save cleaned data:

write.csv(d_clean, "output/cleaned_data.csv", row.names = FALSE)

Data Structures & Creation

c(…)

Purpose: Combine values into a vector Help: Type ?c in R console

Creating different types of vectors:

> v1 <- c(1, 2, 3, 4, 5)
> v2 <- c("iOS", "Android", "Windows")
> v3 <- c(1, "two", 3)
> class(v3)
[1] "character"

Note that mixed types get coerced to the most general type (character in this case).

colnames(x), rownames(x)

Purpose: Get or set column/row names Help: Type ?colnames or ?rownames in R console

Getting column names from built-in dataset:

> colnames(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

Renaming columns:

# Create a copy to modify
mydata <- mtcars[1:5, 1:3]
colnames(mydata) <- c("MPG", "Cylinders", "Displacement")

# Or rename specific columns
colnames(mydata)[1] <- "MilesPerGallon"

data.frame(…)

Purpose: Create a data frame from vectors Help: Type ?data.frame in R console Key arguments:

  • stringsAsFactors: Auto-convert strings to factors (set FALSE)

Creating a data frame from scratch:

> df <- data.frame(
+   id = 1:5,
+   group = c("A", "A", "B", "B", "B"),
+   score = c(85, 90, 78, 82, 88),
+   stringsAsFactors = FALSE
+ )
> str(df)
'data.frame':     5 obs. of  3 variables:
 $ id   : int  1 2 3 4 5
 $ group: chr  "A" "A" "B" "B" "B"
 $ score: num  85 90 78 82 88

See also: tibble::tibble() for modern alternative

factor(x, levels = …, ordered = FALSE)

Purpose: Create categorical variables with defined levels Help: Type ?factor in R console When to use: For categorical data in ANOVA, controlling plot order

Basic factor creation with automatic level detection:

> platform <- factor(c("iOS", "Android", "iOS", "Windows"))
> levels(platform)
[1] "Android" "iOS"     "Windows"

Controlling level order for plots and analyses:

platform <- factor(platform,
                  levels = c("iOS", "Android", "Windows"))

Creating ordered factors for ordinal data:

satisfaction <- factor(c("Low", "High", "Medium", "High"),
                      levels = c("Low", "Medium", "High"),
                      ordered = TRUE)

Common pitfall: Converting numeric factors back to numbers

Wrong way (returns level indices):

> f <- factor(c("2", "4", "6"))
> as.numeric(f)
[1] 1 2 3

Correct way:

> as.numeric(as.character(f))
[1] 2 4 6

See also: levels(), relevel(), droplevels()

Data Wrangling & Utilities

apply(X, MARGIN, FUN)

Purpose: Apply function over matrix/array margins Help: Type ?apply in R console Key arguments:

  • MARGIN: 1 for rows, 2 for columns

  • FUN: Function to apply

Create a sample matrix:

> M <- matrix(1:12, nrow = 3)
> M
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

Row means (MARGIN = 1):

> apply(M, 1, mean)
[1] 5.5 6.5 7.5

Column sums (MARGIN = 2):

> apply(M, 2, sum)
[1]  6 15 24 33

Custom function to calculate range:

> apply(M, 2, function(x) max(x) - min(x))
[1] 2 2 2 2

See also: lapply(), sapply(), tapply()

as.numeric(x)

Purpose: Convert to numeric type Help: Type ?as.numeric in R console Common uses: Fix data imported as characters, convert factors

Character to numeric conversion:

> x <- c("1.5", "2.3", "3.1")
> as.numeric(x)
[1] 1.5 2.3 3.1

Non-numeric values produce NAs with a warning:

> y <- c("1", "2", "three")
> as.numeric(y)
[1]  1  2 NA
Warning message:
NAs introduced by coercion

complete.cases(x)

Purpose: Identify rows with no missing values Help: Type ?complete.cases in R console Returns: Logical vector (TRUE for complete rows)

Create data with missing values:

> df <- data.frame(
+   x = c(1, 2, NA, 4),
+   y = c(5, NA, 7, 8)
+ )
> complete.cases(df)
[1]  TRUE FALSE FALSE  TRUE

Filter to complete cases only:

> df_clean <- df[complete.cases(df), ]
> nrow(df_clean)
[1] 2

Count incomplete cases:

> sum(!complete.cases(df))
[1] 2

See also: is.na(), na.omit()

ifelse(test, yes, no)

Purpose: Vectorized conditional operation Help: Type ?ifelse in R console

Basic pass/fail grading:

> score <- c(85, 72, 90, 68, 88)
> grade <- ifelse(score >= 80, "Pass", "Fail")
> grade
[1] "Pass" "Fail" "Pass" "Fail" "Pass"

Nested ifelse for letter grades:

grade <- ifelse(score >= 90, "A",
               ifelse(score >= 80, "B",
                     ifelse(score >= 70, "C", "F")))

Conditional calculations:

df$adjusted <- ifelse(df$group == "control",
                     df$score * 1.1,
                     df$score)

IQR(x, na.rm = FALSE)

Purpose: Calculate interquartile range (Q3 - Q1) Help: Type ?IQR in R console

With missing values:

> x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, NA)
> IQR(x)
[1] NA
> IQR(x, na.rm = TRUE)
[1] 4

Compare with manual calculation from quantiles:

> quantile(x, c(0.25, 0.75), na.rm = TRUE)
25% 75%
  3   7

See also: quantile(), fivenum(), boxplot.stats()

is.na(x)

Purpose: Test for missing values Help: Type ?is.na in R console

Identifying NAs:

> x <- c(1, NA, 3, NA, 5)
> is.na(x)
[1] FALSE  TRUE FALSE  TRUE FALSE

Counting NAs:

> sum(is.na(x))
[1] 2

Finding positions of NAs:

> which(is.na(x))
[1] 2 4

Replacing NAs:

x[is.na(x)] <- 0

See also: complete.cases(), na.omit(), anyNA()

paste(…, sep = “ “, collapse = NULL)

Purpose: Concatenate strings Help: Type ?paste in R console Key arguments:

  • sep: Separator between elements

  • collapse: Collapse vector to single string

Basic concatenation:

> paste("Mean:", 5.2)
[1] "Mean: 5.2"

Custom separator for dates:

> paste("2024", "01", "15", sep = "-")
[1] "2024-01-15"

Vectorized operation:

> paste("Group", 1:3)
[1] "Group 1" "Group 2" "Group 3"

Collapse to single string:

> paste(c("A", "B", "C"), collapse = ", ")
[1] "A, B, C"

No-space version with paste0:

> paste0("var", 1:3)
[1] "var1" "var2" "var3"

quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE)

Purpose: Calculate sample quantiles Help: Type ?quantile in R console

Default quartiles:

> x <- c(1:10, NA)
> quantile(x, na.rm = TRUE)
  0%  25%  50%  75% 100%
 1.0  3.0  5.5  8.0 10.0

Custom percentiles:

> quantile(x, probs = c(0.1, 0.9), na.rm = TRUE)
10% 90%
1.9 9.1

See also: median(), IQR(), fivenum()

sapply(X, FUN), lapply(X, FUN)

Purpose: Apply function over list/vector elements Help: Type ?sapply or ?lapply in R console Differences:

  • lapply: Always returns a list

  • sapply: Simplifies to vector/matrix if possible

Create a list of vectors:

> lst <- list(a = 1:5, b = 6:10, c = 11:15)

sapply simplifies to vector:

> sapply(lst, mean)
 a  b  c
 3  8 13

lapply returns list:

> lapply(lst, mean)
$a
[1] 3

$b
[1] 8

$c
[1] 13

Check multiple columns for NAs:

sapply(df, function(x) sum(is.na(x)))

tapply(X, INDEX, FUN)

Purpose: Apply function by group Help: Type ?tapply in R console

Group means using built-in dataset:

> tapply(iris$Sepal.Length, iris$Species, mean)
    setosa versicolor  virginica
     5.006      5.936      6.588

Multiple statistics per group:

tapply(iris$Sepal.Length, iris$Species,
       function(x) c(mean = mean(x), sd = sd(x)))

See also: aggregate(), by()

Descriptive Statistics & Correlation

cor(x, y = NULL, use = “everything”, method = “pearson”)

Purpose: Calculate correlation coefficient Help: Type ?cor in R console Key arguments:

  • use: How to handle NAs (“complete.obs” drops them)

  • method: “pearson” (default), “spearman”, “kendall”

Perfect positive correlation:

> x <- c(1, 2, 3, 4, 5)
> y <- c(2, 4, 6, 8, 10)
> cor(x, y)
[1] 1

Using built-in dataset:

> cor(mtcars$mpg, mtcars$wt)
[1] -0.8676594

Correlation matrix:

> cor(mtcars[,c("mpg", "wt", "hp")])
           mpg         wt         hp
mpg  1.0000000 -0.8676594 -0.7761684
wt  -0.8676594  1.0000000  0.6587479
hp  -0.7761684  0.6587479  1.0000000

Interpretation guide:

  • -1 to -0.7: Strong negative

  • -0.7 to -0.3: Moderate negative

  • -0.3 to 0.3: Weak/no linear relationship

  • 0.3 to 0.7: Moderate positive

  • 0.7 to 1: Strong positive

Test for significance:

cor.test(mtcars$mpg, mtcars$wt)

See also: cor.test(), cov()

mean(x, trim = 0, na.rm = FALSE)

Purpose: Calculate arithmetic mean Help: Type ?mean in R console Key arguments:

  • na.rm: Remove NAs before calculation

  • trim: Fraction to trim from each end

Basic usage with NAs:

> x <- c(1, 2, 3, 4, 5, NA)
> mean(x)
[1] NA
> mean(x, na.rm = TRUE)
[1] 3

Trimmed mean (robust to outliers):

> y <- c(1, 2, 3, 4, 100)
> mean(y)
[1] 22
> mean(y, trim = 0.2)
[1] 3

The trimmed mean removes 20% from each end before calculating.

See also: median(), summary()

median(x, na.rm = FALSE)

Purpose: Calculate median (middle value) Help: Type ?median in R console

Odd number of values:

> median(c(1, 3, 5))
[1] 3

Even number of values:

> median(c(1, 2, 3, 4))
[1] 2.5

Median is more robust than mean for skewed data:

> x <- c(20, 25, 30, 35, 200)
> mean(x)
[1] 62
> median(x)
[1] 30

sd(x, na.rm = FALSE)

Purpose: Calculate sample standard deviation Help: Type ?sd in R console Note: Uses n-1 denominator (sample SD)

> x <- c(2, 4, 4, 4, 5, 5, 7, 9)
> sd(x)
[1] 2.13809
> var(x)
[1] 4.571429

The variance equals the square of the standard deviation.

Calculate coefficient of variation (relative variability):

> cv <- sd(x) / mean(x) * 100
> paste("CV:", round(cv, 1), "%")
[1] "CV: 42.8 %"

See also: var(), mad() (robust alternative)

Probability & Distributions

Normal Distribution Functions

dnorm(x, mean = 0, sd = 1)

Purpose: Normal probability density Help: Type ?dnorm in R console Use cases: Overlay theoretical curves, calculate likelihoods

Standard normal density at x = 0:

> dnorm(0)
[1] 0.3989423

Custom parameters:

> dnorm(100, mean = 100, sd = 15)
[1] 0.02659615

Use in ggplot2 for overlaying normal curve:

ggplot(data.frame(x = x), aes(x)) +
  geom_histogram(aes(y = after_stat(density))) +
  stat_function(fun = dnorm,
               args = list(mean = mean(x), sd = sd(x)),
               color = "red")

pnorm(q, mean = 0, sd = 1, lower.tail = TRUE)

Purpose: Normal cumulative distribution (probability) Help: Type ?pnorm in R console

Standard normal probabilities:

> pnorm(1.96)
[1] 0.9750021
> pnorm(1.96, lower.tail = FALSE)
[1] 0.0249979

Two-tailed p-value:

> 2 * pnorm(-abs(1.96))
[1] 0.04999579

qnorm(p, mean = 0, sd = 1, lower.tail = TRUE)

Purpose: Normal quantiles (inverse CDF) Help: Type ?qnorm in R console

Critical values:

> qnorm(0.975)
[1] 1.959964
> qnorm(0.95)
[1] 1.644854

The first gives the two-tailed 95% critical value, the second gives one-tailed.

Calculate confidence interval:

xbar <- 100; s <- 15; n <- 25
xbar + c(-1, 1) * qnorm(0.975) * s/sqrt(n)

rnorm(n, mean = 0, sd = 1)

Purpose: Generate random normal values Help: Type ?rnorm in R console

set.seed(123)
x <- rnorm(100, mean = 50, sd = 10)

# Visualize with ggplot2
library(ggplot2)
ggplot(data.frame(x = x), aes(x = x)) +
  geom_histogram(bins = 20) +
  ggtitle("Sample from Normal(50, 10)")

t Distribution Functions

pt(q, df, lower.tail = TRUE)

Purpose: Student’s t cumulative distribution Help: Type ?pt in R console Use cases: Calculate p-values for t-tests

One-sided p-value:

> t_stat <- 2.5
> df <- 24
> pt(t_stat, df, lower.tail = FALSE)
[1] 0.009963466

Two-sided p-value:

> 2 * pt(-abs(t_stat), df)
[1] 0.01992693

qt(p, df, lower.tail = TRUE)

Purpose: Student’s t quantiles Help: Type ?qt in R console Use cases: Critical values for confidence intervals

95% CI critical value:

> qt(0.975, df = 24)
[1] 2.063899

Compare to normal:

> qnorm(0.975)
[1] 1.959964

The t critical value is larger due to heavier tails.

See also: dt(), rt()

Other Distributions

dbinom(x, size, prob)

Purpose: Binomial probability mass function Help: Type ?dbinom in R console

Probability of exactly 3 successes in 10 trials with p = 0.5:

> dbinom(3, size = 10, prob = 0.5)
[1] 0.1171875

Full distribution:

x <- 0:10
p <- dbinom(x, size = 10, prob = 0.5)

# Visualize with ggplot2
library(ggplot2)
ggplot(data.frame(x = x, p = p), aes(x = x, y = p)) +
  geom_col() +
  scale_x_continuous(breaks = 0:10) +
  ggtitle("Binomial(10, 0.5) Distribution")

qtukey(p, nmeans, df)

Purpose: Tukey HSD critical values Help: Type ?qtukey in R console

Critical value for 3 groups with df = 27:

> qtukey(0.95, nmeans = 3, df = 27)
[1] 3.506426

Simulation Functions

set.seed(seed)

Purpose: Set random number generator seed for reproducibility Help: Type ?set.seed in R console

Without seed (different each time):

> rnorm(3)
[1]  0.3186301 -0.5817907  0.7145327
> rnorm(3)
[1] -0.8252594 -0.3598138 -0.0109303

With seed (reproducible):

> set.seed(123)
> rnorm(3)
[1] -0.5604756 -0.2301775  1.5587083
> set.seed(123)
> rnorm(3)
[1] -0.5604756 -0.2301775  1.5587083

Random Generation Functions

All random generation functions take n (number of values) as the first argument.

set.seed(1)
rnorm(5, mean = 100, sd = 15)    # Normal
runif(5, min = 0, max = 10)      # Uniform
rexp(5, rate = 2)                # Exponential

For simulation studies:

n_sims <- 1000
means <- replicate(n_sims, mean(rnorm(30)))

# Visualize sampling distribution with ggplot2
library(ggplot2)
ggplot(data.frame(means = means), aes(x = means)) +
  geom_histogram(bins = 30) +
  ggtitle("Sampling Distribution of Mean")