.. _r_function_reference_part1: Function Reference Part 1 ------------------------------------------------- Data I/O & Housekeeping ~~~~~~~~~~~~~~~~~~~~~~~~ **getwd()** *Purpose*: Show current working directory *Help*: Type ``?getwd`` in R console *Syntax*: ``getwd()`` Example showing the working directory: .. code-block:: rconsole > getwd() [1] "/Users/username/STAT350" **head(x, n = 6)** *Purpose*: Display first n rows of a data frame or elements of a vector *Help*: Type ``?head`` in R console *Syntax*: ``head(x, n = 6)`` *Common use*: Quickly inspect data after importing Using built-in iris dataset: .. code-block:: rconsole > data(iris) > head(iris, 3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa *See also*: ``tail()``, ``str()``, ``summary()`` **read.csv(file, header = TRUE, stringsAsFactors = FALSE, ...)** *Purpose*: Import CSV file into a data frame *Help*: Type ``?read.csv`` in R console *Key arguments*: - ``header``: First row contains column names (default TRUE) - ``stringsAsFactors``: Keep strings as characters (recommended FALSE) - ``na.strings``: Values to treat as NA (default "NA") *Common pitfalls*: - Using absolute paths (breaks on other computers) - Forgetting to check for proper import with ``str()`` - Not handling special characters in column names Basic import: .. code-block:: rconsole > mydata <- read.csv("data/experiment.csv") > str(mydata) 'data.frame': 150 obs. of 5 variables: $ id : int 1 2 3 4 5 6 7 8 9 10 ... $ group : chr "A" "B" "A" "C" ... $ measure1: num 23.5 19.2 25.1 22.8 20.3 ... $ measure2: num 45.2 41.8 46.7 44.3 42.1 ... $ outcome : chr "success" "failure" "success" ... With options for handling missing data: .. code-block:: r mydata <- read.csv("data/experiment.csv", stringsAsFactors = FALSE, na.strings = c("NA", "N/A", "")) Always validate after import: .. code-block:: rconsole > sum(is.na(mydata)) [1] 12 *See also*: ``read.table()``, ``write.csv()``, ``str()`` **setwd(path)** *Purpose*: Change working directory *Help*: Type ``?setwd`` in R console Not recommended in scripts: .. code-block:: r # Bad practice - machine specific! # setwd("/Users/username/STAT350") **write.csv(x, file, row.names = FALSE)** *Purpose*: Export data frame to CSV file *Help*: Type ``?write.csv`` in R console *Key arguments*: - ``row.names``: Include row numbers (usually FALSE) - ``na``: String for missing values (default "NA") Save cleaned data: .. code-block:: r write.csv(d_clean, "output/cleaned_data.csv", row.names = FALSE) Data Structures & Creation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **c(...)** *Purpose*: Combine values into a vector *Help*: Type ``?c`` in R console Creating different types of vectors: .. code-block:: rconsole > v1 <- c(1, 2, 3, 4, 5) > v2 <- c("iOS", "Android", "Windows") > v3 <- c(1, "two", 3) > class(v3) [1] "character" Note that mixed types get coerced to the most general type (character in this case). **colnames(x), rownames(x)** *Purpose*: Get or set column/row names *Help*: Type ``?colnames`` or ``?rownames`` in R console Getting column names from built-in dataset: .. code-block:: rconsole > colnames(mtcars) [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb" Renaming columns: .. code-block:: r # Create a copy to modify mydata <- mtcars[1:5, 1:3] colnames(mydata) <- c("MPG", "Cylinders", "Displacement") # Or rename specific columns colnames(mydata)[1] <- "MilesPerGallon" **data.frame(...)** *Purpose*: Create a data frame from vectors *Help*: Type ``?data.frame`` in R console *Key arguments*: - ``stringsAsFactors``: Auto-convert strings to factors (set FALSE) Creating a data frame from scratch: .. code-block:: rconsole > df <- data.frame( + id = 1:5, + group = c("A", "A", "B", "B", "B"), + score = c(85, 90, 78, 82, 88), + stringsAsFactors = FALSE + ) > str(df) 'data.frame': 5 obs. of 3 variables: $ id : int 1 2 3 4 5 $ group: chr "A" "A" "B" "B" "B" $ score: num 85 90 78 82 88 *See also*: ``tibble::tibble()`` for modern alternative **factor(x, levels = ..., ordered = FALSE)** *Purpose*: Create categorical variables with defined levels *Help*: Type ``?factor`` in R console *When to use*: For categorical data in ANOVA, controlling plot order Basic factor creation with automatic level detection: .. code-block:: rconsole > platform <- factor(c("iOS", "Android", "iOS", "Windows")) > levels(platform) [1] "Android" "iOS" "Windows" Controlling level order for plots and analyses: .. code-block:: r platform <- factor(platform, levels = c("iOS", "Android", "Windows")) Creating ordered factors for ordinal data: .. code-block:: r satisfaction <- factor(c("Low", "High", "Medium", "High"), levels = c("Low", "Medium", "High"), ordered = TRUE) *Common pitfall*: Converting numeric factors back to numbers Wrong way (returns level indices): .. code-block:: rconsole > f <- factor(c("2", "4", "6")) > as.numeric(f) [1] 1 2 3 Correct way: .. code-block:: rconsole > as.numeric(as.character(f)) [1] 2 4 6 *See also*: ``levels()``, ``relevel()``, ``droplevels()`` Data Wrangling & Utilities ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **apply(X, MARGIN, FUN)** *Purpose*: Apply function over matrix/array margins *Help*: Type ``?apply`` in R console *Key arguments*: - ``MARGIN``: 1 for rows, 2 for columns - ``FUN``: Function to apply Create a sample matrix: .. code-block:: rconsole > M <- matrix(1:12, nrow = 3) > M [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 Row means (MARGIN = 1): .. code-block:: rconsole > apply(M, 1, mean) [1] 5.5 6.5 7.5 Column sums (MARGIN = 2): .. code-block:: rconsole > apply(M, 2, sum) [1] 6 15 24 33 Custom function to calculate range: .. code-block:: rconsole > apply(M, 2, function(x) max(x) - min(x)) [1] 2 2 2 2 *See also*: ``lapply()``, ``sapply()``, ``tapply()`` **as.numeric(x)** *Purpose*: Convert to numeric type *Help*: Type ``?as.numeric`` in R console *Common uses*: Fix data imported as characters, convert factors Character to numeric conversion: .. code-block:: rconsole > x <- c("1.5", "2.3", "3.1") > as.numeric(x) [1] 1.5 2.3 3.1 Non-numeric values produce NAs with a warning: .. code-block:: rconsole > y <- c("1", "2", "three") > as.numeric(y) [1] 1 2 NA Warning message: NAs introduced by coercion **complete.cases(x)** *Purpose*: Identify rows with no missing values *Help*: Type ``?complete.cases`` in R console *Returns*: Logical vector (TRUE for complete rows) Create data with missing values: .. code-block:: rconsole > df <- data.frame( + x = c(1, 2, NA, 4), + y = c(5, NA, 7, 8) + ) > complete.cases(df) [1] TRUE FALSE FALSE TRUE Filter to complete cases only: .. code-block:: rconsole > df_clean <- df[complete.cases(df), ] > nrow(df_clean) [1] 2 Count incomplete cases: .. code-block:: rconsole > sum(!complete.cases(df)) [1] 2 *See also*: ``is.na()``, ``na.omit()`` **ifelse(test, yes, no)** *Purpose*: Vectorized conditional operation *Help*: Type ``?ifelse`` in R console Basic pass/fail grading: .. code-block:: rconsole > score <- c(85, 72, 90, 68, 88) > grade <- ifelse(score >= 80, "Pass", "Fail") > grade [1] "Pass" "Fail" "Pass" "Fail" "Pass" Nested ifelse for letter grades: .. code-block:: r grade <- ifelse(score >= 90, "A", ifelse(score >= 80, "B", ifelse(score >= 70, "C", "F"))) Conditional calculations: .. code-block:: r df$adjusted <- ifelse(df$group == "control", df$score * 1.1, df$score) **IQR(x, na.rm = FALSE)** *Purpose*: Calculate interquartile range (Q3 - Q1) *Help*: Type ``?IQR`` in R console With missing values: .. code-block:: rconsole > x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, NA) > IQR(x) [1] NA > IQR(x, na.rm = TRUE) [1] 4 Compare with manual calculation from quantiles: .. code-block:: rconsole > quantile(x, c(0.25, 0.75), na.rm = TRUE) 25% 75% 3 7 *See also*: ``quantile()``, ``fivenum()``, ``boxplot.stats()`` **is.na(x)** *Purpose*: Test for missing values *Help*: Type ``?is.na`` in R console Identifying NAs: .. code-block:: rconsole > x <- c(1, NA, 3, NA, 5) > is.na(x) [1] FALSE TRUE FALSE TRUE FALSE Counting NAs: .. code-block:: rconsole > sum(is.na(x)) [1] 2 Finding positions of NAs: .. code-block:: rconsole > which(is.na(x)) [1] 2 4 Replacing NAs: .. code-block:: r x[is.na(x)] <- 0 *See also*: ``complete.cases()``, ``na.omit()``, ``anyNA()`` **paste(..., sep = " ", collapse = NULL)** *Purpose*: Concatenate strings *Help*: Type ``?paste`` in R console *Key arguments*: - ``sep``: Separator between elements - ``collapse``: Collapse vector to single string Basic concatenation: .. code-block:: rconsole > paste("Mean:", 5.2) [1] "Mean: 5.2" Custom separator for dates: .. code-block:: rconsole > paste("2024", "01", "15", sep = "-") [1] "2024-01-15" Vectorized operation: .. code-block:: rconsole > paste("Group", 1:3) [1] "Group 1" "Group 2" "Group 3" Collapse to single string: .. code-block:: rconsole > paste(c("A", "B", "C"), collapse = ", ") [1] "A, B, C" No-space version with paste0: .. code-block:: rconsole > paste0("var", 1:3) [1] "var1" "var2" "var3" **quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE)** *Purpose*: Calculate sample quantiles *Help*: Type ``?quantile`` in R console Default quartiles: .. code-block:: rconsole > x <- c(1:10, NA) > quantile(x, na.rm = TRUE) 0% 25% 50% 75% 100% 1.0 3.0 5.5 8.0 10.0 Custom percentiles: .. code-block:: rconsole > quantile(x, probs = c(0.1, 0.9), na.rm = TRUE) 10% 90% 1.9 9.1 *See also*: ``median()``, ``IQR()``, ``fivenum()`` **sapply(X, FUN), lapply(X, FUN)** *Purpose*: Apply function over list/vector elements *Help*: Type ``?sapply`` or ``?lapply`` in R console *Differences*: - ``lapply``: Always returns a list - ``sapply``: Simplifies to vector/matrix if possible Create a list of vectors: .. code-block:: rconsole > lst <- list(a = 1:5, b = 6:10, c = 11:15) sapply simplifies to vector: .. code-block:: rconsole > sapply(lst, mean) a b c 3 8 13 lapply returns list: .. code-block:: rconsole > lapply(lst, mean) $a [1] 3 $b [1] 8 $c [1] 13 Check multiple columns for NAs: .. code-block:: r sapply(df, function(x) sum(is.na(x))) **tapply(X, INDEX, FUN)** *Purpose*: Apply function by group *Help*: Type ``?tapply`` in R console Group means using built-in dataset: .. code-block:: rconsole > tapply(iris$Sepal.Length, iris$Species, mean) setosa versicolor virginica 5.006 5.936 6.588 Multiple statistics per group: .. code-block:: r tapply(iris$Sepal.Length, iris$Species, function(x) c(mean = mean(x), sd = sd(x))) *See also*: ``aggregate()``, ``by()`` Descriptive Statistics & Correlation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **cor(x, y = NULL, use = "everything", method = "pearson")** *Purpose*: Calculate correlation coefficient *Help*: Type ``?cor`` in R console *Key arguments*: - ``use``: How to handle NAs ("complete.obs" drops them) - ``method``: "pearson" (default), "spearman", "kendall" Perfect positive correlation: .. code-block:: rconsole > x <- c(1, 2, 3, 4, 5) > y <- c(2, 4, 6, 8, 10) > cor(x, y) [1] 1 Using built-in dataset: .. code-block:: rconsole > cor(mtcars$mpg, mtcars$wt) [1] -0.8676594 Correlation matrix: .. code-block:: rconsole > cor(mtcars[,c("mpg", "wt", "hp")]) mpg wt hp mpg 1.0000000 -0.8676594 -0.7761684 wt -0.8676594 1.0000000 0.6587479 hp -0.7761684 0.6587479 1.0000000 *Interpretation guide*: - -1 to -0.7: Strong negative - -0.7 to -0.3: Moderate negative - -0.3 to 0.3: Weak/no linear relationship - 0.3 to 0.7: Moderate positive - 0.7 to 1: Strong positive Test for significance: .. code-block:: r cor.test(mtcars$mpg, mtcars$wt) *See also*: ``cor.test()``, ``cov()`` **mean(x, trim = 0, na.rm = FALSE)** *Purpose*: Calculate arithmetic mean *Help*: Type ``?mean`` in R console *Key arguments*: - ``na.rm``: Remove NAs before calculation - ``trim``: Fraction to trim from each end Basic usage with NAs: .. code-block:: rconsole > x <- c(1, 2, 3, 4, 5, NA) > mean(x) [1] NA > mean(x, na.rm = TRUE) [1] 3 Trimmed mean (robust to outliers): .. code-block:: rconsole > y <- c(1, 2, 3, 4, 100) > mean(y) [1] 22 > mean(y, trim = 0.2) [1] 3 The trimmed mean removes 20% from each end before calculating. *See also*: ``median()``, ``summary()`` **median(x, na.rm = FALSE)** *Purpose*: Calculate median (middle value) *Help*: Type ``?median`` in R console Odd number of values: .. code-block:: rconsole > median(c(1, 3, 5)) [1] 3 Even number of values: .. code-block:: rconsole > median(c(1, 2, 3, 4)) [1] 2.5 Median is more robust than mean for skewed data: .. code-block:: rconsole > x <- c(20, 25, 30, 35, 200) > mean(x) [1] 62 > median(x) [1] 30 **sd(x, na.rm = FALSE)** *Purpose*: Calculate sample standard deviation *Help*: Type ``?sd`` in R console *Note*: Uses n-1 denominator (sample SD) .. code-block:: rconsole > x <- c(2, 4, 4, 4, 5, 5, 7, 9) > sd(x) [1] 2.13809 > var(x) [1] 4.571429 The variance equals the square of the standard deviation. Calculate coefficient of variation (relative variability): .. code-block:: rconsole > cv <- sd(x) / mean(x) * 100 > paste("CV:", round(cv, 1), "%") [1] "CV: 42.8 %" *See also*: ``var()``, ``mad()`` (robust alternative) Probability & Distributions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Normal Distribution Functions** **dnorm(x, mean = 0, sd = 1)** *Purpose*: Normal probability density *Help*: Type ``?dnorm`` in R console *Use cases*: Overlay theoretical curves, calculate likelihoods Standard normal density at x = 0: .. code-block:: rconsole > dnorm(0) [1] 0.3989423 Custom parameters: .. code-block:: rconsole > dnorm(100, mean = 100, sd = 15) [1] 0.02659615 Use in ggplot2 for overlaying normal curve: .. code-block:: r ggplot(data.frame(x = x), aes(x)) + geom_histogram(aes(y = after_stat(density))) + stat_function(fun = dnorm, args = list(mean = mean(x), sd = sd(x)), color = "red") **pnorm(q, mean = 0, sd = 1, lower.tail = TRUE)** *Purpose*: Normal cumulative distribution (probability) *Help*: Type ``?pnorm`` in R console Standard normal probabilities: .. code-block:: rconsole > pnorm(1.96) [1] 0.9750021 > pnorm(1.96, lower.tail = FALSE) [1] 0.0249979 Two-tailed p-value: .. code-block:: rconsole > 2 * pnorm(-abs(1.96)) [1] 0.04999579 **qnorm(p, mean = 0, sd = 1, lower.tail = TRUE)** *Purpose*: Normal quantiles (inverse CDF) *Help*: Type ``?qnorm`` in R console Critical values: .. code-block:: rconsole > qnorm(0.975) [1] 1.959964 > qnorm(0.95) [1] 1.644854 The first gives the two-tailed 95% critical value, the second gives one-tailed. Calculate confidence interval: .. code-block:: r xbar <- 100; s <- 15; n <- 25 xbar + c(-1, 1) * qnorm(0.975) * s/sqrt(n) **rnorm(n, mean = 0, sd = 1)** *Purpose*: Generate random normal values *Help*: Type ``?rnorm`` in R console .. code-block:: r set.seed(123) x <- rnorm(100, mean = 50, sd = 10) # Visualize with ggplot2 library(ggplot2) ggplot(data.frame(x = x), aes(x = x)) + geom_histogram(bins = 20) + ggtitle("Sample from Normal(50, 10)") **t Distribution Functions** **pt(q, df, lower.tail = TRUE)** *Purpose*: Student's t cumulative distribution *Help*: Type ``?pt`` in R console *Use cases*: Calculate p-values for t-tests One-sided p-value: .. code-block:: rconsole > t_stat <- 2.5 > df <- 24 > pt(t_stat, df, lower.tail = FALSE) [1] 0.009963466 Two-sided p-value: .. code-block:: rconsole > 2 * pt(-abs(t_stat), df) [1] 0.01992693 **qt(p, df, lower.tail = TRUE)** *Purpose*: Student's t quantiles *Help*: Type ``?qt`` in R console *Use cases*: Critical values for confidence intervals 95% CI critical value: .. code-block:: rconsole > qt(0.975, df = 24) [1] 2.063899 Compare to normal: .. code-block:: rconsole > qnorm(0.975) [1] 1.959964 The t critical value is larger due to heavier tails. *See also*: ``dt()``, ``rt()`` **Other Distributions** **dbinom(x, size, prob)** *Purpose*: Binomial probability mass function *Help*: Type ``?dbinom`` in R console Probability of exactly 3 successes in 10 trials with p = 0.5: .. code-block:: rconsole > dbinom(3, size = 10, prob = 0.5) [1] 0.1171875 Full distribution: .. code-block:: r x <- 0:10 p <- dbinom(x, size = 10, prob = 0.5) # Visualize with ggplot2 library(ggplot2) ggplot(data.frame(x = x, p = p), aes(x = x, y = p)) + geom_col() + scale_x_continuous(breaks = 0:10) + ggtitle("Binomial(10, 0.5) Distribution") **qtukey(p, nmeans, df)** *Purpose*: Tukey HSD critical values *Help*: Type ``?qtukey`` in R console Critical value for 3 groups with df = 27: .. code-block:: rconsole > qtukey(0.95, nmeans = 3, df = 27) [1] 3.506426 Simulation Functions ~~~~~~~~~~~~~~~~~~~~~~~~ **set.seed(seed)** *Purpose*: Set random number generator seed for reproducibility *Help*: Type ``?set.seed`` in R console Without seed (different each time): .. code-block:: rconsole > rnorm(3) [1] 0.3186301 -0.5817907 0.7145327 > rnorm(3) [1] -0.8252594 -0.3598138 -0.0109303 With seed (reproducible): .. code-block:: rconsole > set.seed(123) > rnorm(3) [1] -0.5604756 -0.2301775 1.5587083 > set.seed(123) > rnorm(3) [1] -0.5604756 -0.2301775 1.5587083 **Random Generation Functions** All random generation functions take ``n`` (number of values) as the first argument. .. code-block:: r set.seed(1) rnorm(5, mean = 100, sd = 15) # Normal runif(5, min = 0, max = 10) # Uniform rexp(5, rate = 2) # Exponential For simulation studies: .. code-block:: r n_sims <- 1000 means <- replicate(n_sims, mean(rnorm(30))) # Visualize sampling distribution with ggplot2 library(ggplot2) ggplot(data.frame(means = means), aes(x = means)) + geom_histogram(bins = 30) + ggtitle("Sampling Distribution of Mean")