.. _r_function_reference_part1:

Function Reference Part 1
-------------------------------------------------

Data I/O & Housekeeping
~~~~~~~~~~~~~~~~~~~~~~~~

**getwd()**

   *Purpose*: Show current working directory  
   *Help*: Type ``?getwd`` in R console  
   *Syntax*: ``getwd()``  
   
   Example showing the working directory:
   
   .. code-block:: rconsole

      > getwd()
      [1] "/Users/username/STAT350"

**head(x, n = 6)**

   *Purpose*: Display first n rows of a data frame or elements of a vector  
   *Help*: Type ``?head`` in R console  
   *Syntax*: ``head(x, n = 6)``  
   *Common use*: Quickly inspect data after importing  
   
   Using built-in iris dataset:
   
   .. code-block:: rconsole

      > data(iris)
      > head(iris, 3)
        Sepal.Length Sepal.Width Petal.Length Petal.Width Species
      1          5.1         3.5          1.4         0.2  setosa
      2          4.9         3.0          1.4         0.2  setosa
      3          4.7         3.2          1.3         0.2  setosa
   
   *See also*: ``tail()``, ``str()``, ``summary()``

**read.csv(file, header = TRUE, stringsAsFactors = FALSE, ...)**

   *Purpose*: Import CSV file into a data frame  
   *Help*: Type ``?read.csv`` in R console  
   *Key arguments*:
   
   - ``header``: First row contains column names (default TRUE)
   - ``stringsAsFactors``: Keep strings as characters (recommended FALSE)
   - ``na.strings``: Values to treat as NA (default "NA")
   
   *Common pitfalls*:
   
   - Using absolute paths (breaks on other computers)
   - Forgetting to check for proper import with ``str()``
   - Not handling special characters in column names
   
   Basic import:
   
   .. code-block:: rconsole

      > mydata <- read.csv("data/experiment.csv")
      > str(mydata)
      'data.frame':	150 obs. of  5 variables:
       $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
       $ group   : chr  "A" "B" "A" "C" ...
       $ measure1: num  23.5 19.2 25.1 22.8 20.3 ...
       $ measure2: num  45.2 41.8 46.7 44.3 42.1 ...
       $ outcome : chr  "success" "failure" "success" ...
   
   With options for handling missing data:
   
   .. code-block:: r

      mydata <- read.csv("data/experiment.csv", 
                        stringsAsFactors = FALSE,
                        na.strings = c("NA", "N/A", ""))
   
   Always validate after import:
   
   .. code-block:: rconsole

      > sum(is.na(mydata))
      [1] 12
   
   *See also*: ``read.table()``, ``write.csv()``, ``str()``

**setwd(path)**

   *Purpose*: Change working directory  
   *Help*: Type ``?setwd`` in R console  
   
   Not recommended in scripts:
   
   .. code-block:: r

      # Bad practice - machine specific!
      # setwd("/Users/username/STAT350")
      

**write.csv(x, file, row.names = FALSE)**

   *Purpose*: Export data frame to CSV file  
   *Help*: Type ``?write.csv`` in R console  
   *Key arguments*:
   
   - ``row.names``: Include row numbers (usually FALSE)
   - ``na``: String for missing values (default "NA")
   
   Save cleaned data:
   
   .. code-block:: r

      write.csv(d_clean, "output/cleaned_data.csv", row.names = FALSE)

Data Structures & Creation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**c(...)**

   *Purpose*: Combine values into a vector  
   *Help*: Type ``?c`` in R console  
   
   Creating different types of vectors:
   
   .. code-block:: rconsole

      > v1 <- c(1, 2, 3, 4, 5)
      > v2 <- c("iOS", "Android", "Windows")
      > v3 <- c(1, "two", 3)
      > class(v3)
      [1] "character"
   
   Note that mixed types get coerced to the most general type (character in this case).

**colnames(x), rownames(x)**

   *Purpose*: Get or set column/row names  
   *Help*: Type ``?colnames`` or ``?rownames`` in R console  
   
   Getting column names from built-in dataset:
   
   .. code-block:: rconsole

      > colnames(mtcars)
       [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"
   
   Renaming columns:
   
   .. code-block:: r

      # Create a copy to modify
      mydata <- mtcars[1:5, 1:3]
      colnames(mydata) <- c("MPG", "Cylinders", "Displacement")
      
      # Or rename specific columns
      colnames(mydata)[1] <- "MilesPerGallon"

**data.frame(...)**

   *Purpose*: Create a data frame from vectors  
   *Help*: Type ``?data.frame`` in R console  
   *Key arguments*:
   
   - ``stringsAsFactors``: Auto-convert strings to factors (set FALSE)
   
   Creating a data frame from scratch:
   
   .. code-block:: rconsole

      > df <- data.frame(
      +   id = 1:5,
      +   group = c("A", "A", "B", "B", "B"),
      +   score = c(85, 90, 78, 82, 88),
      +   stringsAsFactors = FALSE
      + )
      > str(df)
      'data.frame':	5 obs. of  3 variables:
       $ id   : int  1 2 3 4 5
       $ group: chr  "A" "A" "B" "B" "B"
       $ score: num  85 90 78 82 88
   
   *See also*: ``tibble::tibble()`` for modern alternative

**factor(x, levels = ..., ordered = FALSE)**

   *Purpose*: Create categorical variables with defined levels  
   *Help*: Type ``?factor`` in R console  
   *When to use*: For categorical data in ANOVA, controlling plot order  
   
   Basic factor creation with automatic level detection:
   
   .. code-block:: rconsole

      > platform <- factor(c("iOS", "Android", "iOS", "Windows"))
      > levels(platform)
      [1] "Android" "iOS"     "Windows"
   
   Controlling level order for plots and analyses:
   
   .. code-block:: r

      platform <- factor(platform, 
                        levels = c("iOS", "Android", "Windows"))
   
   Creating ordered factors for ordinal data:
   
   .. code-block:: r

      satisfaction <- factor(c("Low", "High", "Medium", "High"),
                            levels = c("Low", "Medium", "High"),
                            ordered = TRUE)
   
   *Common pitfall*: Converting numeric factors back to numbers
   
   Wrong way (returns level indices):
   
   .. code-block:: rconsole

      > f <- factor(c("2", "4", "6"))
      > as.numeric(f)
      [1] 1 2 3
   
   Correct way:
   
   .. code-block:: rconsole

      > as.numeric(as.character(f))
      [1] 2 4 6
   
   *See also*: ``levels()``, ``relevel()``, ``droplevels()``

Data Wrangling & Utilities
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**apply(X, MARGIN, FUN)**

   *Purpose*: Apply function over matrix/array margins  
   *Help*: Type ``?apply`` in R console  
   *Key arguments*:
   
   - ``MARGIN``: 1 for rows, 2 for columns
   - ``FUN``: Function to apply
   
   Create a sample matrix:
   
   .. code-block:: rconsole

      > M <- matrix(1:12, nrow = 3)
      > M
           [,1] [,2] [,3] [,4]
      [1,]    1    4    7   10
      [2,]    2    5    8   11
      [3,]    3    6    9   12
   
   Row means (MARGIN = 1):
   
   .. code-block:: rconsole

      > apply(M, 1, mean)
      [1] 5.5 6.5 7.5
   
   Column sums (MARGIN = 2):
   
   .. code-block:: rconsole

      > apply(M, 2, sum)
      [1]  6 15 24 33
   
   Custom function to calculate range:
   
   .. code-block:: rconsole

      > apply(M, 2, function(x) max(x) - min(x))
      [1] 2 2 2 2
   
   *See also*: ``lapply()``, ``sapply()``, ``tapply()``

**as.numeric(x)**

   *Purpose*: Convert to numeric type  
   *Help*: Type ``?as.numeric`` in R console  
   *Common uses*: Fix data imported as characters, convert factors  
   
   Character to numeric conversion:
   
   .. code-block:: rconsole

      > x <- c("1.5", "2.3", "3.1")
      > as.numeric(x)
      [1] 1.5 2.3 3.1
   
   Non-numeric values produce NAs with a warning:
   
   .. code-block:: rconsole

      > y <- c("1", "2", "three")
      > as.numeric(y)
      [1]  1  2 NA
      Warning message:
      NAs introduced by coercion

**complete.cases(x)**

   *Purpose*: Identify rows with no missing values  
   *Help*: Type ``?complete.cases`` in R console  
   *Returns*: Logical vector (TRUE for complete rows)  
   
   Create data with missing values:
   
   .. code-block:: rconsole

      > df <- data.frame(
      +   x = c(1, 2, NA, 4),
      +   y = c(5, NA, 7, 8)
      + )
      > complete.cases(df)
      [1]  TRUE FALSE FALSE  TRUE
   
   Filter to complete cases only:
   
   .. code-block:: rconsole

      > df_clean <- df[complete.cases(df), ]
      > nrow(df_clean)
      [1] 2
   
   Count incomplete cases:
   
   .. code-block:: rconsole

      > sum(!complete.cases(df))
      [1] 2
   
   *See also*: ``is.na()``, ``na.omit()``

**ifelse(test, yes, no)**

   *Purpose*: Vectorized conditional operation  
   *Help*: Type ``?ifelse`` in R console  
   
   Basic pass/fail grading:
   
   .. code-block:: rconsole

      > score <- c(85, 72, 90, 68, 88)
      > grade <- ifelse(score >= 80, "Pass", "Fail")
      > grade
      [1] "Pass" "Fail" "Pass" "Fail" "Pass"
   
   Nested ifelse for letter grades:
   
   .. code-block:: r

      grade <- ifelse(score >= 90, "A",
                     ifelse(score >= 80, "B", 
                           ifelse(score >= 70, "C", "F")))
   
   Conditional calculations:
   
   .. code-block:: r

      df$adjusted <- ifelse(df$group == "control", 
                           df$score * 1.1, 
                           df$score)

**IQR(x, na.rm = FALSE)**

   *Purpose*: Calculate interquartile range (Q3 - Q1)  
   *Help*: Type ``?IQR`` in R console  
   
   With missing values:
   
   .. code-block:: rconsole

      > x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, NA)
      > IQR(x)
      [1] NA
      > IQR(x, na.rm = TRUE)
      [1] 4
   
   Compare with manual calculation from quantiles:
   
   .. code-block:: rconsole

      > quantile(x, c(0.25, 0.75), na.rm = TRUE)
      25% 75% 
        3   7
   
   *See also*: ``quantile()``, ``fivenum()``, ``boxplot.stats()``

**is.na(x)**

   *Purpose*: Test for missing values  
   *Help*: Type ``?is.na`` in R console  
   
   Identifying NAs:
   
   .. code-block:: rconsole

      > x <- c(1, NA, 3, NA, 5)
      > is.na(x)
      [1] FALSE  TRUE FALSE  TRUE FALSE
   
   Counting NAs:
   
   .. code-block:: rconsole

      > sum(is.na(x))
      [1] 2
   
   Finding positions of NAs:
   
   .. code-block:: rconsole

      > which(is.na(x))
      [1] 2 4
   
   Replacing NAs:
   
   .. code-block:: r

      x[is.na(x)] <- 0
   
   *See also*: ``complete.cases()``, ``na.omit()``, ``anyNA()``

**paste(..., sep = " ", collapse = NULL)**

   *Purpose*: Concatenate strings  
   *Help*: Type ``?paste`` in R console  
   *Key arguments*:
   
   - ``sep``: Separator between elements
   - ``collapse``: Collapse vector to single string
   
   Basic concatenation:
   
   .. code-block:: rconsole

      > paste("Mean:", 5.2)
      [1] "Mean: 5.2"
   
   Custom separator for dates:
   
   .. code-block:: rconsole

      > paste("2024", "01", "15", sep = "-")
      [1] "2024-01-15"
   
   Vectorized operation:
   
   .. code-block:: rconsole

      > paste("Group", 1:3)
      [1] "Group 1" "Group 2" "Group 3"
   
   Collapse to single string:
   
   .. code-block:: rconsole

      > paste(c("A", "B", "C"), collapse = ", ")
      [1] "A, B, C"
   
   No-space version with paste0:
   
   .. code-block:: rconsole

      > paste0("var", 1:3)
      [1] "var1" "var2" "var3"

**quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE)**

   *Purpose*: Calculate sample quantiles  
   *Help*: Type ``?quantile`` in R console  
   
   Default quartiles:
   
   .. code-block:: rconsole

      > x <- c(1:10, NA)
      > quantile(x, na.rm = TRUE)
        0%  25%  50%  75% 100% 
       1.0  3.0  5.5  8.0 10.0
   
   Custom percentiles:
   
   .. code-block:: rconsole

      > quantile(x, probs = c(0.1, 0.9), na.rm = TRUE)
      10% 90% 
      1.9 9.1
   
   *See also*: ``median()``, ``IQR()``, ``fivenum()``

**sapply(X, FUN), lapply(X, FUN)**

   *Purpose*: Apply function over list/vector elements  
   *Help*: Type ``?sapply`` or ``?lapply`` in R console  
   *Differences*:
   
   - ``lapply``: Always returns a list
   - ``sapply``: Simplifies to vector/matrix if possible
   
   Create a list of vectors:
   
   .. code-block:: rconsole

      > lst <- list(a = 1:5, b = 6:10, c = 11:15)
   
   sapply simplifies to vector:
   
   .. code-block:: rconsole

      > sapply(lst, mean)
       a  b  c 
       3  8 13
   
   lapply returns list:
   
   .. code-block:: rconsole

      > lapply(lst, mean)
      $a
      [1] 3
      
      $b
      [1] 8
      
      $c
      [1] 13
   
   Check multiple columns for NAs:
   
   .. code-block:: r

      sapply(df, function(x) sum(is.na(x)))

**tapply(X, INDEX, FUN)**

   *Purpose*: Apply function by group  
   *Help*: Type ``?tapply`` in R console  
   
   Group means using built-in dataset:
   
   .. code-block:: rconsole

      > tapply(iris$Sepal.Length, iris$Species, mean)
          setosa versicolor  virginica 
           5.006      5.936      6.588
   
   Multiple statistics per group:
   
   .. code-block:: r

      tapply(iris$Sepal.Length, iris$Species, 
             function(x) c(mean = mean(x), sd = sd(x)))
   
   *See also*: ``aggregate()``, ``by()``

Descriptive Statistics & Correlation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**cor(x, y = NULL, use = "everything", method = "pearson")**

   *Purpose*: Calculate correlation coefficient  
   *Help*: Type ``?cor`` in R console  
   *Key arguments*:
   
   - ``use``: How to handle NAs ("complete.obs" drops them)
   - ``method``: "pearson" (default), "spearman", "kendall"
   
   Perfect positive correlation:
   
   .. code-block:: rconsole

      > x <- c(1, 2, 3, 4, 5)
      > y <- c(2, 4, 6, 8, 10)
      > cor(x, y)
      [1] 1
   
   Using built-in dataset:
   
   .. code-block:: rconsole

      > cor(mtcars$mpg, mtcars$wt)
      [1] -0.8676594
   
   Correlation matrix:
   
   .. code-block:: rconsole

      > cor(mtcars[,c("mpg", "wt", "hp")])
                 mpg         wt         hp
      mpg  1.0000000 -0.8676594 -0.7761684
      wt  -0.8676594  1.0000000  0.6587479
      hp  -0.7761684  0.6587479  1.0000000
   
   *Interpretation guide*:
   
   - -1 to -0.7: Strong negative
   - -0.7 to -0.3: Moderate negative  
   - -0.3 to 0.3: Weak/no linear relationship
   - 0.3 to 0.7: Moderate positive
   - 0.7 to 1: Strong positive
   
   Test for significance:
   
   .. code-block:: r

      cor.test(mtcars$mpg, mtcars$wt)
   
   *See also*: ``cor.test()``, ``cov()``

**mean(x, trim = 0, na.rm = FALSE)**

   *Purpose*: Calculate arithmetic mean  
   *Help*: Type ``?mean`` in R console  
   *Key arguments*:
   
   - ``na.rm``: Remove NAs before calculation
   - ``trim``: Fraction to trim from each end
   
   Basic usage with NAs:
   
   .. code-block:: rconsole

      > x <- c(1, 2, 3, 4, 5, NA)
      > mean(x)
      [1] NA
      > mean(x, na.rm = TRUE)
      [1] 3
   
   Trimmed mean (robust to outliers):
   
   .. code-block:: rconsole

      > y <- c(1, 2, 3, 4, 100)
      > mean(y)
      [1] 22
      > mean(y, trim = 0.2)
      [1] 3
   
   The trimmed mean removes 20% from each end before calculating.
   
   *See also*: ``median()``, ``summary()``

**median(x, na.rm = FALSE)**

   *Purpose*: Calculate median (middle value)  
   *Help*: Type ``?median`` in R console  
   
   Odd number of values:
   
   .. code-block:: rconsole

      > median(c(1, 3, 5))
      [1] 3
   
   Even number of values:
   
   .. code-block:: rconsole

      > median(c(1, 2, 3, 4))
      [1] 2.5
   
   Median is more robust than mean for skewed data:
   
   .. code-block:: rconsole

      > x <- c(20, 25, 30, 35, 200)
      > mean(x)
      [1] 62
      > median(x)
      [1] 30

**sd(x, na.rm = FALSE)**

   *Purpose*: Calculate sample standard deviation  
   *Help*: Type ``?sd`` in R console  
   *Note*: Uses n-1 denominator (sample SD)  
   
   .. code-block:: rconsole

      > x <- c(2, 4, 4, 4, 5, 5, 7, 9)
      > sd(x)
      [1] 2.13809
      > var(x)
      [1] 4.571429
   
   The variance equals the square of the standard deviation.
   
   Calculate coefficient of variation (relative variability):
   
   .. code-block:: rconsole

      > cv <- sd(x) / mean(x) * 100
      > paste("CV:", round(cv, 1), "%")
      [1] "CV: 42.8 %"
   
   *See also*: ``var()``, ``mad()`` (robust alternative)

Probability & Distributions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Normal Distribution Functions**

   **dnorm(x, mean = 0, sd = 1)**
   
   *Purpose*: Normal probability density  
   *Help*: Type ``?dnorm`` in R console  
   *Use cases*: Overlay theoretical curves, calculate likelihoods  
   
   Standard normal density at x = 0:
   
   .. code-block:: rconsole

      > dnorm(0)
      [1] 0.3989423
   
   Custom parameters:
   
   .. code-block:: rconsole

      > dnorm(100, mean = 100, sd = 15)
      [1] 0.02659615
   
   Use in ggplot2 for overlaying normal curve:
   
   .. code-block:: r

      ggplot(data.frame(x = x), aes(x)) +
        geom_histogram(aes(y = after_stat(density))) +
        stat_function(fun = dnorm, 
                     args = list(mean = mean(x), sd = sd(x)),
                     color = "red")

   **pnorm(q, mean = 0, sd = 1, lower.tail = TRUE)**
   
   *Purpose*: Normal cumulative distribution (probability)  
   *Help*: Type ``?pnorm`` in R console  
   
   Standard normal probabilities:
   
   .. code-block:: rconsole

      > pnorm(1.96)
      [1] 0.9750021
      > pnorm(1.96, lower.tail = FALSE)
      [1] 0.0249979
   
   Two-tailed p-value:
   
   .. code-block:: rconsole

      > 2 * pnorm(-abs(1.96))
      [1] 0.04999579

   **qnorm(p, mean = 0, sd = 1, lower.tail = TRUE)**
   
   *Purpose*: Normal quantiles (inverse CDF)  
   *Help*: Type ``?qnorm`` in R console  
   
   Critical values:
   
   .. code-block:: rconsole

      > qnorm(0.975)
      [1] 1.959964
      > qnorm(0.95)
      [1] 1.644854
   
   The first gives the two-tailed 95% critical value, the second gives one-tailed.
   
   Calculate confidence interval:
   
   .. code-block:: r

      xbar <- 100; s <- 15; n <- 25
      xbar + c(-1, 1) * qnorm(0.975) * s/sqrt(n)
   
   **rnorm(n, mean = 0, sd = 1)**
   
   *Purpose*: Generate random normal values  
   *Help*: Type ``?rnorm`` in R console  
   
   .. code-block:: r

      set.seed(123)
      x <- rnorm(100, mean = 50, sd = 10)
      
      # Visualize with ggplot2
      library(ggplot2)
      ggplot(data.frame(x = x), aes(x = x)) +
        geom_histogram(bins = 20) +
        ggtitle("Sample from Normal(50, 10)")

**t Distribution Functions**

   **pt(q, df, lower.tail = TRUE)**
   
   *Purpose*: Student's t cumulative distribution  
   *Help*: Type ``?pt`` in R console  
   *Use cases*: Calculate p-values for t-tests  
   
   One-sided p-value:
   
   .. code-block:: rconsole

      > t_stat <- 2.5
      > df <- 24
      > pt(t_stat, df, lower.tail = FALSE)
      [1] 0.009963466
   
   Two-sided p-value:
   
   .. code-block:: rconsole

      > 2 * pt(-abs(t_stat), df)
      [1] 0.01992693

   **qt(p, df, lower.tail = TRUE)**
   
   *Purpose*: Student's t quantiles  
   *Help*: Type ``?qt`` in R console  
   *Use cases*: Critical values for confidence intervals  
   
   95% CI critical value:
   
   .. code-block:: rconsole

      > qt(0.975, df = 24)
      [1] 2.063899
   
   Compare to normal:
   
   .. code-block:: rconsole

      > qnorm(0.975)
      [1] 1.959964
   
   The t critical value is larger due to heavier tails.
   
   *See also*: ``dt()``, ``rt()``

**Other Distributions**

   **dbinom(x, size, prob)**
   
   *Purpose*: Binomial probability mass function  
   *Help*: Type ``?dbinom`` in R console  
   
   Probability of exactly 3 successes in 10 trials with p = 0.5:
   
   .. code-block:: rconsole

      > dbinom(3, size = 10, prob = 0.5)
      [1] 0.1171875
   
   Full distribution:
   
   .. code-block:: r

      x <- 0:10
      p <- dbinom(x, size = 10, prob = 0.5)
      
      # Visualize with ggplot2
      library(ggplot2)
      ggplot(data.frame(x = x, p = p), aes(x = x, y = p)) +
        geom_col() +
        scale_x_continuous(breaks = 0:10) +
        ggtitle("Binomial(10, 0.5) Distribution")

   **qtukey(p, nmeans, df)**
   
   *Purpose*: Tukey HSD critical values  
   *Help*: Type ``?qtukey`` in R console  
   
   Critical value for 3 groups with df = 27:
   
   .. code-block:: rconsole

      > qtukey(0.95, nmeans = 3, df = 27)
      [1] 3.506426

Simulation Functions
~~~~~~~~~~~~~~~~~~~~~~~~

**set.seed(seed)**

   *Purpose*: Set random number generator seed for reproducibility  
   *Help*: Type ``?set.seed`` in R console  
   
   Without seed (different each time):
   
   .. code-block:: rconsole

      > rnorm(3)
      [1]  0.3186301 -0.5817907  0.7145327
      > rnorm(3)
      [1] -0.8252594 -0.3598138 -0.0109303
   
   With seed (reproducible):
   
   .. code-block:: rconsole

      > set.seed(123)
      > rnorm(3)
      [1] -0.5604756 -0.2301775  1.5587083
      > set.seed(123)
      > rnorm(3)
      [1] -0.5604756 -0.2301775  1.5587083

**Random Generation Functions**

   All random generation functions take ``n`` (number of values) as the first argument.
   
   .. code-block:: r

      set.seed(1)
      rnorm(5, mean = 100, sd = 15)    # Normal
      runif(5, min = 0, max = 10)      # Uniform
      rexp(5, rate = 2)                # Exponential
   
   For simulation studies:
   
   .. code-block:: r

      n_sims <- 1000
      means <- replicate(n_sims, mean(rnorm(30)))
      
      # Visualize sampling distribution with ggplot2
      library(ggplot2)
      ggplot(data.frame(means = means), aes(x = means)) +
        geom_histogram(bins = 30) +
        ggtitle("Sampling Distribution of Mean")