Best Practices & Common Pitfalls

Data Import & Validation

Standard validation workflow:

d <- read.csv("data/file.csv")

str(d)                              # Structure check
head(d); tail(d)                    # First/last rows
summary(d)                          # Summary statistics
sapply(d, function(x) sum(is.na(x))) # Missing values
sapply(d, class)                    # Data types

Statistical Assumptions

Before t-tests/ANOVA:

# Normality - visual check
library(ggplot2)
ggplot(data.frame(x = x), aes(sample = x)) +
  stat_qq() +
  stat_qq_line()

# Normality - formal test
shapiro.test(x)

# Equal variances - visual check
ggplot(data.frame(value = c(x, y),
                 group = rep(c("X", "Y"), c(length(x), length(y)))),
       aes(x = group, y = value)) +
  geom_boxplot()

# Equal variances - formal test
var.test(x, y)

Before regression:

mod <- lm(y ~ x, data = df)

# Prepare diagnostic data
diag_data <- data.frame(
  fitted = fitted(mod),
  residuals = residuals(mod),
  std_resid = rstandard(mod)
)

# Residuals vs Fitted
ggplot(diag_data, aes(x = fitted, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  ggtitle("Residuals vs Fitted")

# Q-Q plot
ggplot(diag_data, aes(sample = std_resid)) +
  stat_qq() +
  stat_qq_line() +
  ggtitle("Normal Q-Q")

Check for:

Linearity (Residuals vs Fitted)
Normality (Q-Q plot)
Homoscedasticity (Scale-Location)
Influential points (Residuals vs Leverage)

Common Errors to Avoid

Factor to numeric conversion

Wrong (returns level indices):

> f <- factor(c("2", "4", "6"))
> as.numeric(f)
[1] 1 2 3

Correct:

> as.numeric(as.character(f))
[1] 2 4 6

Missing data handling

Wrong (silently propagates NA):

> mean(c(1, 2, NA, 4))
[1] NA

Correct:

> mean(c(1, 2, NA, 4), na.rm = TRUE)
[1] 2.333333

Confidence vs Prediction intervals

# CI: Where is the mean?
predict(mod, newdata, interval = "confidence")

# PI: Where is a new observation?
predict(mod, newdata, interval = "prediction")

Prediction intervals are always wider!

Multiple testing without adjustment

After ANOVA, use Tukey HSD:
```
TukeyHSD(aov_fit)
```
This adjusts for multiple comparisons automatically.

Workflow Template

# 1. Setup
library(ggplot2)
library(knitr)
set.seed(123)

# 2. Import and validate
d <- read.csv("data/file.csv", stringsAsFactors = FALSE)
str(d)
summary(d)

# 3. Clean
d_clean <- d[complete.cases(d), ]

# 4. Explore
ggplot(d_clean, aes(x = x, y = y)) +
  geom_point() +
  theme_minimal()

# 5. Model
mod <- lm(y ~ x, data = d_clean)

# 6. Diagnose
diag_data <- data.frame(
  fitted = fitted(mod),
  residuals = residuals(mod),
  std_resid = rstandard(mod)
)

ggplot(diag_data, aes(x = fitted, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed")

ggplot(diag_data, aes(sample = std_resid)) +
  stat_qq() +
  stat_qq_line()

# 7. Report
summary(mod)
kable(coef(summary(mod)), digits = 3)