Function Reference Part 1
Data I/O & Housekeeping
getwd()
Purpose: Show current working directory Help: Type
?getwd
in R console Syntax:getwd()
Example showing the working directory:
> getwd() [1] "/Users/username/STAT350"
head(x, n = 6)
Purpose: Display first n rows of a data frame or elements of a vector Help: Type
?head
in R console Syntax:head(x, n = 6)
Common use: Quickly inspect data after importingUsing built-in iris dataset:
> data(iris) > head(iris, 3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosaSee also:
tail()
,str()
,summary()
read.csv(file, header = TRUE, stringsAsFactors = FALSE, …)
Purpose: Import CSV file into a data frame Help: Type
?read.csv
in R console Key arguments:
header
: First row contains column names (default TRUE)
stringsAsFactors
: Keep strings as characters (recommended FALSE)
na.strings
: Values to treat as NA (default “NA”)Common pitfalls:
Using absolute paths (breaks on other computers)
Forgetting to check for proper import with
str()
Not handling special characters in column names
Basic import:
> mydata <- read.csv("data/experiment.csv") > str(mydata) 'data.frame': 150 obs. of 5 variables: $ id : int 1 2 3 4 5 6 7 8 9 10 ... $ group : chr "A" "B" "A" "C" ... $ measure1: num 23.5 19.2 25.1 22.8 20.3 ... $ measure2: num 45.2 41.8 46.7 44.3 42.1 ... $ outcome : chr "success" "failure" "success" ...With options for handling missing data:
mydata <- read.csv("data/experiment.csv", stringsAsFactors = FALSE, na.strings = c("NA", "N/A", ""))Always validate after import:
> sum(is.na(mydata)) [1] 12See also:
read.table()
,write.csv()
,str()
setwd(path)
Purpose: Change working directory Help: Type
?setwd
in R consoleNot recommended in scripts:
# Bad practice - machine specific! # setwd("/Users/username/STAT350")
write.csv(x, file, row.names = FALSE)
Purpose: Export data frame to CSV file Help: Type
?write.csv
in R console Key arguments:
row.names
: Include row numbers (usually FALSE)
na
: String for missing values (default “NA”)Save cleaned data:
write.csv(d_clean, "output/cleaned_data.csv", row.names = FALSE)
Data Structures & Creation
c(…)
Purpose: Combine values into a vector Help: Type
?c
in R consoleCreating different types of vectors:
> v1 <- c(1, 2, 3, 4, 5) > v2 <- c("iOS", "Android", "Windows") > v3 <- c(1, "two", 3) > class(v3) [1] "character"Note that mixed types get coerced to the most general type (character in this case).
colnames(x), rownames(x)
Purpose: Get or set column/row names Help: Type
?colnames
or?rownames
in R consoleGetting column names from built-in dataset:
> colnames(mtcars) [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"Renaming columns:
# Create a copy to modify mydata <- mtcars[1:5, 1:3] colnames(mydata) <- c("MPG", "Cylinders", "Displacement") # Or rename specific columns colnames(mydata)[1] <- "MilesPerGallon"
data.frame(…)
Purpose: Create a data frame from vectors Help: Type
?data.frame
in R console Key arguments:
stringsAsFactors
: Auto-convert strings to factors (set FALSE)Creating a data frame from scratch:
> df <- data.frame( + id = 1:5, + group = c("A", "A", "B", "B", "B"), + score = c(85, 90, 78, 82, 88), + stringsAsFactors = FALSE + ) > str(df) 'data.frame': 5 obs. of 3 variables: $ id : int 1 2 3 4 5 $ group: chr "A" "A" "B" "B" "B" $ score: num 85 90 78 82 88See also:
tibble::tibble()
for modern alternative
factor(x, levels = …, ordered = FALSE)
Purpose: Create categorical variables with defined levels Help: Type
?factor
in R console When to use: For categorical data in ANOVA, controlling plot orderBasic factor creation with automatic level detection:
> platform <- factor(c("iOS", "Android", "iOS", "Windows")) > levels(platform) [1] "Android" "iOS" "Windows"Controlling level order for plots and analyses:
platform <- factor(platform, levels = c("iOS", "Android", "Windows"))Creating ordered factors for ordinal data:
satisfaction <- factor(c("Low", "High", "Medium", "High"), levels = c("Low", "Medium", "High"), ordered = TRUE)Common pitfall: Converting numeric factors back to numbers
Wrong way (returns level indices):
> f <- factor(c("2", "4", "6")) > as.numeric(f) [1] 1 2 3Correct way:
> as.numeric(as.character(f)) [1] 2 4 6See also:
levels()
,relevel()
,droplevels()
Data Wrangling & Utilities
apply(X, MARGIN, FUN)
Purpose: Apply function over matrix/array margins Help: Type
?apply
in R console Key arguments:
MARGIN
: 1 for rows, 2 for columns
FUN
: Function to applyCreate a sample matrix:
> M <- matrix(1:12, nrow = 3) > M [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12Row means (MARGIN = 1):
> apply(M, 1, mean) [1] 5.5 6.5 7.5Column sums (MARGIN = 2):
> apply(M, 2, sum) [1] 6 15 24 33Custom function to calculate range:
> apply(M, 2, function(x) max(x) - min(x)) [1] 2 2 2 2See also:
lapply()
,sapply()
,tapply()
as.numeric(x)
Purpose: Convert to numeric type Help: Type
?as.numeric
in R console Common uses: Fix data imported as characters, convert factorsCharacter to numeric conversion:
> x <- c("1.5", "2.3", "3.1") > as.numeric(x) [1] 1.5 2.3 3.1Non-numeric values produce NAs with a warning:
> y <- c("1", "2", "three") > as.numeric(y) [1] 1 2 NA Warning message: NAs introduced by coercion
complete.cases(x)
Purpose: Identify rows with no missing values Help: Type
?complete.cases
in R console Returns: Logical vector (TRUE for complete rows)Create data with missing values:
> df <- data.frame( + x = c(1, 2, NA, 4), + y = c(5, NA, 7, 8) + ) > complete.cases(df) [1] TRUE FALSE FALSE TRUEFilter to complete cases only:
> df_clean <- df[complete.cases(df), ] > nrow(df_clean) [1] 2Count incomplete cases:
> sum(!complete.cases(df)) [1] 2See also:
is.na()
,na.omit()
ifelse(test, yes, no)
Purpose: Vectorized conditional operation Help: Type
?ifelse
in R consoleBasic pass/fail grading:
> score <- c(85, 72, 90, 68, 88) > grade <- ifelse(score >= 80, "Pass", "Fail") > grade [1] "Pass" "Fail" "Pass" "Fail" "Pass"Nested ifelse for letter grades:
grade <- ifelse(score >= 90, "A", ifelse(score >= 80, "B", ifelse(score >= 70, "C", "F")))Conditional calculations:
df$adjusted <- ifelse(df$group == "control", df$score * 1.1, df$score)
IQR(x, na.rm = FALSE)
Purpose: Calculate interquartile range (Q3 - Q1) Help: Type
?IQR
in R consoleWith missing values:
> x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, NA) > IQR(x) [1] NA > IQR(x, na.rm = TRUE) [1] 4Compare with manual calculation from quantiles:
> quantile(x, c(0.25, 0.75), na.rm = TRUE) 25% 75% 3 7See also:
quantile()
,fivenum()
,boxplot.stats()
is.na(x)
Purpose: Test for missing values Help: Type
?is.na
in R consoleIdentifying NAs:
> x <- c(1, NA, 3, NA, 5) > is.na(x) [1] FALSE TRUE FALSE TRUE FALSECounting NAs:
> sum(is.na(x)) [1] 2Finding positions of NAs:
> which(is.na(x)) [1] 2 4Replacing NAs:
x[is.na(x)] <- 0See also:
complete.cases()
,na.omit()
,anyNA()
paste(…, sep = “ “, collapse = NULL)
Purpose: Concatenate strings Help: Type
?paste
in R console Key arguments:
sep
: Separator between elements
collapse
: Collapse vector to single stringBasic concatenation:
> paste("Mean:", 5.2) [1] "Mean: 5.2"Custom separator for dates:
> paste("2024", "01", "15", sep = "-") [1] "2024-01-15"Vectorized operation:
> paste("Group", 1:3) [1] "Group 1" "Group 2" "Group 3"Collapse to single string:
> paste(c("A", "B", "C"), collapse = ", ") [1] "A, B, C"No-space version with paste0:
> paste0("var", 1:3) [1] "var1" "var2" "var3"
quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE)
Purpose: Calculate sample quantiles Help: Type
?quantile
in R consoleDefault quartiles:
> x <- c(1:10, NA) > quantile(x, na.rm = TRUE) 0% 25% 50% 75% 100% 1.0 3.0 5.5 8.0 10.0Custom percentiles:
> quantile(x, probs = c(0.1, 0.9), na.rm = TRUE) 10% 90% 1.9 9.1See also:
median()
,IQR()
,fivenum()
sapply(X, FUN), lapply(X, FUN)
Purpose: Apply function over list/vector elements Help: Type
?sapply
or?lapply
in R console Differences:
lapply
: Always returns a list
sapply
: Simplifies to vector/matrix if possibleCreate a list of vectors:
> lst <- list(a = 1:5, b = 6:10, c = 11:15)sapply simplifies to vector:
> sapply(lst, mean) a b c 3 8 13lapply returns list:
> lapply(lst, mean) $a [1] 3 $b [1] 8 $c [1] 13Check multiple columns for NAs:
sapply(df, function(x) sum(is.na(x)))
tapply(X, INDEX, FUN)
Purpose: Apply function by group Help: Type
?tapply
in R consoleGroup means using built-in dataset:
> tapply(iris$Sepal.Length, iris$Species, mean) setosa versicolor virginica 5.006 5.936 6.588Multiple statistics per group:
tapply(iris$Sepal.Length, iris$Species, function(x) c(mean = mean(x), sd = sd(x)))See also:
aggregate()
,by()
Descriptive Statistics & Correlation
cor(x, y = NULL, use = “everything”, method = “pearson”)
Purpose: Calculate correlation coefficient Help: Type
?cor
in R console Key arguments:
use
: How to handle NAs (“complete.obs” drops them)
method
: “pearson” (default), “spearman”, “kendall”Perfect positive correlation:
> x <- c(1, 2, 3, 4, 5) > y <- c(2, 4, 6, 8, 10) > cor(x, y) [1] 1Using built-in dataset:
> cor(mtcars$mpg, mtcars$wt) [1] -0.8676594Correlation matrix:
> cor(mtcars[,c("mpg", "wt", "hp")]) mpg wt hp mpg 1.0000000 -0.8676594 -0.7761684 wt -0.8676594 1.0000000 0.6587479 hp -0.7761684 0.6587479 1.0000000Interpretation guide:
-1 to -0.7: Strong negative
-0.7 to -0.3: Moderate negative
-0.3 to 0.3: Weak/no linear relationship
0.3 to 0.7: Moderate positive
0.7 to 1: Strong positive
Test for significance:
cor.test(mtcars$mpg, mtcars$wt)See also:
cor.test()
,cov()
mean(x, trim = 0, na.rm = FALSE)
Purpose: Calculate arithmetic mean Help: Type
?mean
in R console Key arguments:
na.rm
: Remove NAs before calculation
trim
: Fraction to trim from each endBasic usage with NAs:
> x <- c(1, 2, 3, 4, 5, NA) > mean(x) [1] NA > mean(x, na.rm = TRUE) [1] 3Trimmed mean (robust to outliers):
> y <- c(1, 2, 3, 4, 100) > mean(y) [1] 22 > mean(y, trim = 0.2) [1] 3The trimmed mean removes 20% from each end before calculating.
See also:
median()
,summary()
median(x, na.rm = FALSE)
Purpose: Calculate median (middle value) Help: Type
?median
in R consoleOdd number of values:
> median(c(1, 3, 5)) [1] 3Even number of values:
> median(c(1, 2, 3, 4)) [1] 2.5Median is more robust than mean for skewed data:
> x <- c(20, 25, 30, 35, 200) > mean(x) [1] 62 > median(x) [1] 30
sd(x, na.rm = FALSE)
Purpose: Calculate sample standard deviation Help: Type
?sd
in R console Note: Uses n-1 denominator (sample SD)> x <- c(2, 4, 4, 4, 5, 5, 7, 9) > sd(x) [1] 2.13809 > var(x) [1] 4.571429The variance equals the square of the standard deviation.
Calculate coefficient of variation (relative variability):
> cv <- sd(x) / mean(x) * 100 > paste("CV:", round(cv, 1), "%") [1] "CV: 42.8 %"See also:
var()
,mad()
(robust alternative)
Probability & Distributions
Normal Distribution Functions
dnorm(x, mean = 0, sd = 1)
Purpose: Normal probability density Help: Type
?dnorm
in R console Use cases: Overlay theoretical curves, calculate likelihoodsStandard normal density at x = 0:
> dnorm(0) [1] 0.3989423Custom parameters:
> dnorm(100, mean = 100, sd = 15) [1] 0.02659615Use in ggplot2 for overlaying normal curve:
ggplot(data.frame(x = x), aes(x)) + geom_histogram(aes(y = after_stat(density))) + stat_function(fun = dnorm, args = list(mean = mean(x), sd = sd(x)), color = "red")pnorm(q, mean = 0, sd = 1, lower.tail = TRUE)
Purpose: Normal cumulative distribution (probability) Help: Type
?pnorm
in R consoleStandard normal probabilities:
> pnorm(1.96) [1] 0.9750021 > pnorm(1.96, lower.tail = FALSE) [1] 0.0249979Two-tailed p-value:
> 2 * pnorm(-abs(1.96)) [1] 0.04999579qnorm(p, mean = 0, sd = 1, lower.tail = TRUE)
Purpose: Normal quantiles (inverse CDF) Help: Type
?qnorm
in R consoleCritical values:
> qnorm(0.975) [1] 1.959964 > qnorm(0.95) [1] 1.644854The first gives the two-tailed 95% critical value, the second gives one-tailed.
Calculate confidence interval:
xbar <- 100; s <- 15; n <- 25 xbar + c(-1, 1) * qnorm(0.975) * s/sqrt(n)rnorm(n, mean = 0, sd = 1)
Purpose: Generate random normal values Help: Type
?rnorm
in R consoleset.seed(123) x <- rnorm(100, mean = 50, sd = 10) # Visualize with ggplot2 library(ggplot2) ggplot(data.frame(x = x), aes(x = x)) + geom_histogram(bins = 20) + ggtitle("Sample from Normal(50, 10)")
t Distribution Functions
pt(q, df, lower.tail = TRUE)
Purpose: Student’s t cumulative distribution Help: Type
?pt
in R console Use cases: Calculate p-values for t-testsOne-sided p-value:
> t_stat <- 2.5 > df <- 24 > pt(t_stat, df, lower.tail = FALSE) [1] 0.009963466Two-sided p-value:
> 2 * pt(-abs(t_stat), df) [1] 0.01992693qt(p, df, lower.tail = TRUE)
Purpose: Student’s t quantiles Help: Type
?qt
in R console Use cases: Critical values for confidence intervals95% CI critical value:
> qt(0.975, df = 24) [1] 2.063899Compare to normal:
> qnorm(0.975) [1] 1.959964The t critical value is larger due to heavier tails.
See also:
dt()
,rt()
Other Distributions
dbinom(x, size, prob)
Purpose: Binomial probability mass function Help: Type
?dbinom
in R consoleProbability of exactly 3 successes in 10 trials with p = 0.5:
> dbinom(3, size = 10, prob = 0.5) [1] 0.1171875Full distribution:
x <- 0:10 p <- dbinom(x, size = 10, prob = 0.5) # Visualize with ggplot2 library(ggplot2) ggplot(data.frame(x = x, p = p), aes(x = x, y = p)) + geom_col() + scale_x_continuous(breaks = 0:10) + ggtitle("Binomial(10, 0.5) Distribution")qtukey(p, nmeans, df)
Purpose: Tukey HSD critical values Help: Type
?qtukey
in R consoleCritical value for 3 groups with df = 27:
> qtukey(0.95, nmeans = 3, df = 27) [1] 3.506426
Simulation Functions
set.seed(seed)
Purpose: Set random number generator seed for reproducibility Help: Type
?set.seed
in R consoleWithout seed (different each time):
> rnorm(3) [1] 0.3186301 -0.5817907 0.7145327 > rnorm(3) [1] -0.8252594 -0.3598138 -0.0109303With seed (reproducible):
> set.seed(123) > rnorm(3) [1] -0.5604756 -0.2301775 1.5587083 > set.seed(123) > rnorm(3) [1] -0.5604756 -0.2301775 1.5587083
Random Generation Functions
All random generation functions take
n
(number of values) as the first argument.set.seed(1) rnorm(5, mean = 100, sd = 15) # Normal runif(5, min = 0, max = 10) # Uniform rexp(5, rate = 2) # ExponentialFor simulation studies:
n_sims <- 1000 means <- replicate(n_sims, mean(rnorm(30))) # Visualize sampling distribution with ggplot2 library(ggplot2) ggplot(data.frame(means = means), aes(x = means)) + geom_histogram(bins = 30) + ggtitle("Sampling Distribution of Mean")