Best Practices & Common Pitfalls

Data Import & Validation

Standard validation workflow:

d <- read.csv("data/file.csv")

str(d)                              # Structure check
head(d); tail(d)                    # First/last rows
summary(d)                          # Summary statistics
sapply(d, function(x) sum(is.na(x))) # Missing values
sapply(d, class)                    # Data types

Statistical Assumptions

Before t-tests/ANOVA:

# Normality - visual check
library(ggplot2)
ggplot(data.frame(x = x), aes(sample = x)) +
  stat_qq() +
  stat_qq_line()

# Normality - formal test
shapiro.test(x)

# Equal variances - visual check
ggplot(data.frame(value = c(x, y),
                 group = rep(c("X", "Y"), c(length(x), length(y)))),
       aes(x = group, y = value)) +
  geom_boxplot()

# Equal variances - formal test
var.test(x, y)

Before regression:

mod <- lm(y ~ x, data = df)

# Prepare diagnostic data
diag_data <- data.frame(
  fitted = fitted(mod),
  residuals = residuals(mod),
  std_resid = rstandard(mod)
)

# Residuals vs Fitted
ggplot(diag_data, aes(x = fitted, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  ggtitle("Residuals vs Fitted")

# Q-Q plot
ggplot(diag_data, aes(sample = std_resid)) +
  stat_qq() +
  stat_qq_line() +
  ggtitle("Normal Q-Q")

Check for:

  1. Linearity (Residuals vs Fitted)

  2. Normality (Q-Q plot)

  3. Homoscedasticity (Scale-Location)

  4. Influential points (Residuals vs Leverage)

Common Errors to Avoid

  1. Factor to numeric conversion

    Wrong (returns level indices):

    > f <- factor(c("2", "4", "6"))
    > as.numeric(f)
    [1] 1 2 3
    

    Correct:

    > as.numeric(as.character(f))
    [1] 2 4 6
    
  2. Missing data handling

    Wrong (silently propagates NA):

    > mean(c(1, 2, NA, 4))
    [1] NA
    

    Correct:

    > mean(c(1, 2, NA, 4), na.rm = TRUE)
    [1] 2.333333
    
  3. Confidence vs Prediction intervals

    # CI: Where is the mean?
    predict(mod, newdata, interval = "confidence")
    
    # PI: Where is a new observation?
    predict(mod, newdata, interval = "prediction")
    

    Prediction intervals are always wider!

  4. Multiple testing without adjustment

    After ANOVA, use Tukey HSD:

    TukeyHSD(aov_fit)
    

    This adjusts for multiple comparisons automatically.

Workflow Template

# 1. Setup
library(ggplot2)
library(knitr)
set.seed(123)

# 2. Import and validate
d <- read.csv("data/file.csv", stringsAsFactors = FALSE)
str(d)
summary(d)

# 3. Clean
d_clean <- d[complete.cases(d), ]

# 4. Explore
ggplot(d_clean, aes(x = x, y = y)) +
  geom_point() +
  theme_minimal()

# 5. Model
mod <- lm(y ~ x, data = d_clean)

# 6. Diagnose
diag_data <- data.frame(
  fitted = fitted(mod),
  residuals = residuals(mod),
  std_resid = rstandard(mod)
)

ggplot(diag_data, aes(x = fitted, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed")

ggplot(diag_data, aes(sample = std_resid)) +
  stat_qq() +
  stat_qq_line()

# 7. Report
summary(mod)
kable(coef(summary(mod)), digits = 3)