.. _r_function_reference_part2: Function Reference Part 2: Inference Functions ------------------------------------------------- **t.test(x, y = NULL, alternative = "two.sided", mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95)** *Purpose*: One and two sample t-tests *Help*: Type ``?t.test`` in R console *When to use*: Compare means between 1-2 groups *Key arguments*: - ``alternative``: "two.sided", "less", "greater" - ``mu``: Null hypothesis value (one-sample) - ``paired``: TRUE for matched pairs - ``var.equal``: TRUE assumes equal variances *Common pitfalls*: - Using paired = TRUE with independent samples - Not checking normality assumptions - Forgetting var.equal = FALSE for unequal variances One-sample test: .. code-block:: rconsole > x <- c(98, 102, 101, 97, 103, 100, 99, 98, 104, 101) > t.test(x, mu = 100) One Sample t-test data: x t = 0.63246, df = 9, p-value = 0.5425 alternative hypothesis: true mean is not equal to 100 95 percent confidence interval: 98.41533 102.18467 sample estimates: mean of x 100.3 Two-sample independent test using built-in data: .. code-block:: rconsole > data(sleep) > t.test(extra ~ group, data = sleep, var.equal = FALSE) Welch Two Sample t-test data: extra by group t = -1.8608, df = 17.776, p-value = 0.07939 alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0 95 percent confidence interval: -3.3654832 0.2054832 sample estimates: mean in group 1 mean in group 2 0.75 2.33 **Paired t-test - Three equivalent approaches:** Method 1 - Using paired = TRUE with two vectors: .. code-block:: r before <- c(85, 88, 90, 92, 87) after <- c(88, 90, 91, 94, 89) t.test(before, after, paired = TRUE) Method 2 - Convert to one-sample test of differences: .. code-block:: r differences <- after - before t.test(differences, mu = 0) Method 3 - Formula interface with paired data: .. code-block:: rconsole > # Using sleep dataset (paired by ID) > t.test(Pair(extra[group==1], extra[group==2]) ~ 1, data = sleep) Paired t-test data: Pair(extra[group == 1], extra[group == 2]) t = -4.0621, df = 9, p-value = 0.002833 alternative hypothesis: true mean difference is not equal to 0 95 percent confidence interval: -2.4598858 -0.7001142 sample estimates: mean difference -1.58 Extract components from results: .. code-block:: rconsole > result <- t.test(x, mu = 100) > result$p.value [1] 0.5424805 > result$conf.int [1] 98.41533 102.18467 attr(,"conf.level") [1] 0.95 > result$estimate mean of x 100.3 **Checking assumptions**: Normality check: .. code-block:: r # Shapiro-Wilk test shapiro.test(x) # Visual check with ggplot2 library(ggplot2) ggplot(data.frame(x = x), aes(sample = x)) + stat_qq() + stat_qq_line() Equal variances check: .. code-block:: r # F-test var.test(x, y) # Visual check ggplot(data.frame(value = c(x, y), group = rep(c("X", "Y"), c(length(x), length(y)))), aes(x = group, y = value)) + geom_boxplot() *See also*: ``wilcox.test()`` (non-parametric alternative) **aov(formula, data)** *Purpose*: Analysis of variance (ANOVA) *Help*: Type ``?aov`` in R console *When to use*: Compare means across 3+ groups One-way ANOVA: .. code-block:: rconsole > fit <- aov(score ~ group, data = df) > summary(fit) Df Sum Sq Mean Sq F value Pr(>F) group 2 520.3 260.15 8.214 0.0019 ** Residuals 27 855.2 31.67 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Check assumptions with diagnostic plots: .. code-block:: r plot(fit, which = 1) # Residuals vs Fitted plot(fit, which = 2) # Normal Q-Q Test for equal variances: .. code-block:: r bartlett.test(score ~ group, data = df) Post-hoc comparisons: .. code-block:: r TukeyHSD(fit) plot(TukeyHSD(fit)) *See also*: ``anova()``, ``TukeyHSD()``, ``kruskal.test()`` **TukeyHSD(x, conf.level = 0.95)** *Purpose*: Tukey's honest significant differences *Help*: Type ``?TukeyHSD`` in R console *When to use*: After significant ANOVA for pairwise comparisons Following ANOVA: .. code-block:: rconsole > fit <- aov(score ~ group, data = df) > tukey_result <- TukeyHSD(fit) > tukey_result Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = score ~ group, data = df) $group diff lwr upr p adj B-A -5.333 -12.20271 1.536038 0.1518711 C-A 3.167 -3.70271 10.036038 0.5019442 C-B 8.500 1.63062 15.369376 0.0125506 Visual display: .. code-block:: r plot(tukey_result) Confidence intervals not containing 0 indicate significant differences. **lm(formula, data)** *Purpose*: Fit linear models *Help*: Type ``?lm`` in R console *When to use*: Regression analysis, ANOVA via regression Simple linear regression: .. code-block:: rconsole > mod <- lm(rating ~ price, data = df) > summary(mod) Call: lm(formula = rating ~ price, data = df) Residuals: Min 1Q Median 3Q Max -2.83450 -0.83450 -0.03077 0.78231 2.70308 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1234 0.2145 9.897 2.13e-10 *** price 0.5432 0.0532 10.211 9.77e-11 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.234 on 28 degrees of freedom Multiple R-squared: 0.7882, Adjusted R-squared: 0.7806 F-statistic: 104.3 on 1 and 28 DF, p-value: 9.77e-11 Extract components: .. code-block:: r coef(mod) # Coefficients confint(mod) # CI for coefficients residuals(mod) # Residuals fitted(mod) # Fitted values Diagnostic plots: .. code-block:: r plot(mod) # Produces 4 diagnostic plots ANOVA table: .. code-block:: r anova(mod) *See also*: ``summary.lm()``, ``predict.lm()``, ``plot.lm()`` **predict(object, newdata, interval = "none", level = 0.95)** *Purpose*: Predictions from model fits *Help*: Type ``?predict.lm`` in R console *Key arguments*: - ``interval``: "confidence" (mean response) or "prediction" (new obs) - ``level``: Confidence level *Critical distinction*: - **Confidence interval**: Uncertainty about mean response - **Prediction interval**: Uncertainty about individual response (wider) Fit model first: .. code-block:: r mod <- lm(rating ~ price, data = df) Point prediction: .. code-block:: rconsole > new_price <- data.frame(price = 50) > predict(mod, new_price) 1 29.28571 Confidence interval (mean rating at price = 50): .. code-block:: rconsole > predict(mod, new_price, interval = "confidence") fit lwr upr 1 29.28571 28.47933 30.09209 Prediction interval (individual rating at price = 50): .. code-block:: rconsole > predict(mod, new_price, interval = "prediction") fit lwr upr 1 29.28571 26.89414 31.67728 Note that prediction intervals are always wider than confidence intervals. Multiple predictions for plotting: .. code-block:: r new_prices <- data.frame( price = seq(min(df$price), max(df$price), length = 100) ) ci <- predict(mod, new_prices, interval = "confidence") pi <- predict(mod, new_prices, interval = "prediction") Plot with bands: .. code-block:: r plot(rating ~ price, data = df) abline(mod) lines(new_prices$price, ci[,"lwr"], lty = 2, col = "blue") lines(new_prices$price, ci[,"upr"], lty = 2, col = "blue") lines(new_prices$price, pi[,"lwr"], lty = 2, col = "red") lines(new_prices$price, pi[,"upr"], lty = 2, col = "red") *See also*: ``fitted()``, ``residuals()`` Diagnostic Plots for Assumptions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For t-test (normality of residuals): .. code-block:: r library(ggplot2) x <- residuals(lm(score ~ group, data = df)) # Q-Q plot ggplot(data.frame(sample = x), aes(sample = sample)) + stat_qq() + stat_qq_line() + ggtitle("Normal Q-Q Plot of Residuals") # Formal test shapiro.test(x) For ANOVA: .. code-block:: r fit <- aov(score ~ group, data = df) # Residuals vs Fitted ggplot(data.frame(fitted = fitted(fit), residuals = residuals(fit)), aes(x = fitted, y = residuals)) + geom_point() + geom_hline(yintercept = 0, linetype = "dashed") + ggtitle("Residuals vs Fitted") # Q-Q plot of residuals ggplot(data.frame(sample = residuals(fit)), aes(sample = sample)) + stat_qq() + stat_qq_line() + ggtitle("Normal Q-Q") For regression (full diagnostic panel): .. code-block:: r mod <- lm(rating ~ price, data = df) # Prepare data for all plots diag_data <- data.frame( fitted = fitted(mod), residuals = residuals(mod), std_resid = rstandard(mod), sqrt_std_resid = sqrt(abs(rstandard(mod))), cooks = cooks.distance(mod), leverage = hatvalues(mod) ) # Create four diagnostic plots library(gridExtra) p1 <- ggplot(diag_data, aes(x = fitted, y = residuals)) + geom_point() + geom_hline(yintercept = 0, linetype = "dashed") + geom_smooth(se = FALSE) + ggtitle("Residuals vs Fitted") p2 <- ggplot(diag_data, aes(sample = std_resid)) + stat_qq() + stat_qq_line() + ggtitle("Normal Q-Q") p3 <- ggplot(diag_data, aes(x = fitted, y = sqrt_std_resid)) + geom_point() + geom_smooth(se = FALSE) + ggtitle("Scale-Location") p4 <- ggplot(diag_data, aes(x = leverage, y = std_resid)) + geom_point() + geom_smooth(se = FALSE) + ggtitle("Residuals vs Leverage") grid.arrange(p1, p2, p3, p4, ncol = 2) Pattern interpretation: - Random scatter: Good - Funnel shape: Heteroscedasticity - Curve: Non-linearity - Outliers: Influential points