.. _r_best_practices:

Best Practices & Common Pitfalls
=================================

Data Import & Validation
~~~~~~~~~~~~~~~~~~~~~~~~

Standard validation workflow:

.. code-block:: r

   d <- read.csv("data/file.csv")
   
   str(d)                              # Structure check
   head(d); tail(d)                    # First/last rows
   summary(d)                          # Summary statistics
   sapply(d, function(x) sum(is.na(x))) # Missing values
   sapply(d, class)                    # Data types

Statistical Assumptions
~~~~~~~~~~~~~~~~~~~~~~~

**Before t-tests/ANOVA:**

.. code-block:: r

   # Normality - visual check
   library(ggplot2)
   ggplot(data.frame(x = x), aes(sample = x)) +
     stat_qq() +
     stat_qq_line()
   
   # Normality - formal test
   shapiro.test(x)
   
   # Equal variances - visual check
   ggplot(data.frame(value = c(x, y), 
                    group = rep(c("X", "Y"), c(length(x), length(y)))),
          aes(x = group, y = value)) +
     geom_boxplot()
   
   # Equal variances - formal test
   var.test(x, y)

**Before regression:**

.. code-block:: r

   mod <- lm(y ~ x, data = df)
   
   # Prepare diagnostic data
   diag_data <- data.frame(
     fitted = fitted(mod),
     residuals = residuals(mod),
     std_resid = rstandard(mod)
   )
   
   # Residuals vs Fitted
   ggplot(diag_data, aes(x = fitted, y = residuals)) +
     geom_point() +
     geom_hline(yintercept = 0, linetype = "dashed") +
     ggtitle("Residuals vs Fitted")
   
   # Q-Q plot
   ggplot(diag_data, aes(sample = std_resid)) +
     stat_qq() +
     stat_qq_line() +
     ggtitle("Normal Q-Q")

Check for:

1. Linearity (Residuals vs Fitted)
2. Normality (Q-Q plot)
3. Homoscedasticity (Scale-Location)
4. Influential points (Residuals vs Leverage)

Common Errors to Avoid
~~~~~~~~~~~~~~~~~~~~~~

1. **Factor to numeric conversion**

   Wrong (returns level indices):
   
   .. code-block:: rconsole

      > f <- factor(c("2", "4", "6"))
      > as.numeric(f)
      [1] 1 2 3
   
   Correct:
   
   .. code-block:: rconsole

      > as.numeric(as.character(f))
      [1] 2 4 6

2. **Missing data handling**

   Wrong (silently propagates NA):
   
   .. code-block:: rconsole

      > mean(c(1, 2, NA, 4))
      [1] NA
   
   Correct:
   
   .. code-block:: rconsole

      > mean(c(1, 2, NA, 4), na.rm = TRUE)
      [1] 2.333333

3. **Confidence vs Prediction intervals**

   .. code-block:: r

      # CI: Where is the mean?
      predict(mod, newdata, interval = "confidence")
      
      # PI: Where is a new observation?
      predict(mod, newdata, interval = "prediction")
   
   Prediction intervals are always wider!

4. **Multiple testing without adjustment**

   After ANOVA, use Tukey HSD:
   
   .. code-block:: r

      TukeyHSD(aov_fit)
   
   This adjusts for multiple comparisons automatically.

Workflow Template
~~~~~~~~~~~~~~~~~

.. code-block:: r

   # 1. Setup
   library(ggplot2)
   library(knitr)
   set.seed(123)
   
   # 2. Import and validate
   d <- read.csv("data/file.csv", stringsAsFactors = FALSE)
   str(d)
   summary(d)
   
   # 3. Clean
   d_clean <- d[complete.cases(d), ]
   
   # 4. Explore
   ggplot(d_clean, aes(x = x, y = y)) +
     geom_point() +
     theme_minimal()
   
   # 5. Model
   mod <- lm(y ~ x, data = d_clean)
   
   # 6. Diagnose
   diag_data <- data.frame(
     fitted = fitted(mod),
     residuals = residuals(mod),
     std_resid = rstandard(mod)
   )
   
   ggplot(diag_data, aes(x = fitted, y = residuals)) +
     geom_point() +
     geom_hline(yintercept = 0, linetype = "dashed")
   
   ggplot(diag_data, aes(sample = std_resid)) +
     stat_qq() +
     stat_qq_line()
   
   # 7. Report
   summary(mod)
   kable(coef(summary(mod)), digits = 3)