Best Practices & Common Pitfalls
Data Import & Validation
Standard validation workflow:
d <- read.csv("data/file.csv")
str(d) # Structure check
head(d); tail(d) # First/last rows
summary(d) # Summary statistics
sapply(d, function(x) sum(is.na(x))) # Missing values
sapply(d, class) # Data types
Statistical Assumptions
Before t-tests/ANOVA:
# Normality - visual check
library(ggplot2)
ggplot(data.frame(x = x), aes(sample = x)) +
stat_qq() +
stat_qq_line()
# Normality - formal test
shapiro.test(x)
# Equal variances - visual check
ggplot(data.frame(value = c(x, y),
group = rep(c("X", "Y"), c(length(x), length(y)))),
aes(x = group, y = value)) +
geom_boxplot()
# Equal variances - formal test
var.test(x, y)
Before regression:
mod <- lm(y ~ x, data = df)
# Prepare diagnostic data
diag_data <- data.frame(
fitted = fitted(mod),
residuals = residuals(mod),
std_resid = rstandard(mod)
)
# Residuals vs Fitted
ggplot(diag_data, aes(x = fitted, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
ggtitle("Residuals vs Fitted")
# Q-Q plot
ggplot(diag_data, aes(sample = std_resid)) +
stat_qq() +
stat_qq_line() +
ggtitle("Normal Q-Q")
Check for:
Linearity (Residuals vs Fitted)
Normality (Q-Q plot)
Homoscedasticity (Scale-Location)
Influential points (Residuals vs Leverage)
Common Errors to Avoid
Factor to numeric conversion
Wrong (returns level indices):
> f <- factor(c("2", "4", "6")) > as.numeric(f) [1] 1 2 3
Correct:
> as.numeric(as.character(f)) [1] 2 4 6
Missing data handling
Wrong (silently propagates NA):
> mean(c(1, 2, NA, 4)) [1] NA
Correct:
> mean(c(1, 2, NA, 4), na.rm = TRUE) [1] 2.333333
Confidence vs Prediction intervals
# CI: Where is the mean? predict(mod, newdata, interval = "confidence") # PI: Where is a new observation? predict(mod, newdata, interval = "prediction")
Prediction intervals are always wider!
Multiple testing without adjustment
After ANOVA, use Tukey HSD:
TukeyHSD(aov_fit)
This adjusts for multiple comparisons automatically.
Workflow Template
# 1. Setup
library(ggplot2)
library(knitr)
set.seed(123)
# 2. Import and validate
d <- read.csv("data/file.csv", stringsAsFactors = FALSE)
str(d)
summary(d)
# 3. Clean
d_clean <- d[complete.cases(d), ]
# 4. Explore
ggplot(d_clean, aes(x = x, y = y)) +
geom_point() +
theme_minimal()
# 5. Model
mod <- lm(y ~ x, data = d_clean)
# 6. Diagnose
diag_data <- data.frame(
fitted = fitted(mod),
residuals = residuals(mod),
std_resid = rstandard(mod)
)
ggplot(diag_data, aes(x = fitted, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed")
ggplot(diag_data, aes(sample = std_resid)) +
stat_qq() +
stat_qq_line()
# 7. Report
summary(mod)
kable(coef(summary(mod)), digits = 3)