.. _r_best_practices: Best Practices & Common Pitfalls ================================= Data Import & Validation ~~~~~~~~~~~~~~~~~~~~~~~~ Standard validation workflow: .. code-block:: r d <- read.csv("data/file.csv") str(d) # Structure check head(d); tail(d) # First/last rows summary(d) # Summary statistics sapply(d, function(x) sum(is.na(x))) # Missing values sapply(d, class) # Data types Statistical Assumptions ~~~~~~~~~~~~~~~~~~~~~~~ **Before t-tests/ANOVA:** .. code-block:: r # Normality - visual check library(ggplot2) ggplot(data.frame(x = x), aes(sample = x)) + stat_qq() + stat_qq_line() # Normality - formal test shapiro.test(x) # Equal variances - visual check ggplot(data.frame(value = c(x, y), group = rep(c("X", "Y"), c(length(x), length(y)))), aes(x = group, y = value)) + geom_boxplot() # Equal variances - formal test var.test(x, y) **Before regression:** .. code-block:: r mod <- lm(y ~ x, data = df) # Prepare diagnostic data diag_data <- data.frame( fitted = fitted(mod), residuals = residuals(mod), std_resid = rstandard(mod) ) # Residuals vs Fitted ggplot(diag_data, aes(x = fitted, y = residuals)) + geom_point() + geom_hline(yintercept = 0, linetype = "dashed") + ggtitle("Residuals vs Fitted") # Q-Q plot ggplot(diag_data, aes(sample = std_resid)) + stat_qq() + stat_qq_line() + ggtitle("Normal Q-Q") Check for: 1. Linearity (Residuals vs Fitted) 2. Normality (Q-Q plot) 3. Homoscedasticity (Scale-Location) 4. Influential points (Residuals vs Leverage) Common Errors to Avoid ~~~~~~~~~~~~~~~~~~~~~~ 1. **Factor to numeric conversion** Wrong (returns level indices): .. code-block:: rconsole > f <- factor(c("2", "4", "6")) > as.numeric(f) [1] 1 2 3 Correct: .. code-block:: rconsole > as.numeric(as.character(f)) [1] 2 4 6 2. **Missing data handling** Wrong (silently propagates NA): .. code-block:: rconsole > mean(c(1, 2, NA, 4)) [1] NA Correct: .. code-block:: rconsole > mean(c(1, 2, NA, 4), na.rm = TRUE) [1] 2.333333 3. **Confidence vs Prediction intervals** .. code-block:: r # CI: Where is the mean? predict(mod, newdata, interval = "confidence") # PI: Where is a new observation? predict(mod, newdata, interval = "prediction") Prediction intervals are always wider! 4. **Multiple testing without adjustment** After ANOVA, use Tukey HSD: .. code-block:: r TukeyHSD(aov_fit) This adjusts for multiple comparisons automatically. Workflow Template ~~~~~~~~~~~~~~~~~ .. code-block:: r # 1. Setup library(ggplot2) library(knitr) set.seed(123) # 2. Import and validate d <- read.csv("data/file.csv", stringsAsFactors = FALSE) str(d) summary(d) # 3. Clean d_clean <- d[complete.cases(d), ] # 4. Explore ggplot(d_clean, aes(x = x, y = y)) + geom_point() + theme_minimal() # 5. Model mod <- lm(y ~ x, data = d_clean) # 6. Diagnose diag_data <- data.frame( fitted = fitted(mod), residuals = residuals(mod), std_resid = rstandard(mod) ) ggplot(diag_data, aes(x = fitted, y = residuals)) + geom_point() + geom_hline(yintercept = 0, linetype = "dashed") ggplot(diag_data, aes(sample = std_resid)) + stat_qq() + stat_qq_line() # 7. Report summary(mod) kable(coef(summary(mod)), digits = 3)