Worksheet 22: Model Assessment and Prediction in Regression

Learning Objectives 🎯

Understand the coefficient of determination $R^2$ and its interpretation
Distinguish between confidence intervals for the mean response and prediction intervals for individual observations
Explore through simulation how the Central Limit Theorem affects CIs differently than PIs
Compute and interpret prediction intervals and confidence intervals at specific values of $x$
Apply comprehensive regression diagnostics to real-world data
Implement model assessment and prediction tools using R

Introduction

Building upon what we’ve learned over the last two worksheets, you’ve developed a solid foundation in simple linear regression:

Worksheet 20: Introduction to Simple Linear Regression: You learned how to estimate the regression line by minimizing squared residuals (least squares method). You derived formulas for the special univariate calculus examples gaining insights into how the slope and intercept are derived in the multivariate case, explored their statistical properties via simulation, and saw how these estimators behave in repeated sampling scenarios.
Worksheet 21: Inference for Simple Linear Regression: You moved from estimation to formal statistical inference. You learned how to test hypotheses about the slope ($\hat{\beta}_1$) and intercept ($\hat{\beta}_0$), construct confidence intervals, and connect regression inference to the ANOVA framework. Additionally, you examined critical regression assumptions using residual diagnostics, including scatterplots, residual plots, histograms, and QQ-plots.

Now that you understand how to estimate the line and conduct inference about its parameters, you must address two essential follow-up questions:

Key Questions for This Worksheet 🎯

How well does our regression line actually explain the observed data?

How can we accurately predict future observations, and quantify our uncertainty about those predictions?

Having statistically significant slope or intercept parameters is necessary, but not sufficient, for practical usefulness. A model can show a statistically significant relationship and yet fail to explain much variability in the data or be unsuitable for accurate predictions. Therefore, the evaluation of a regression model involves more than just testing parameters; it involves quantifying the explanatory power, verifying assumptions, and assessing predictive accuracy and reliability. To do this, we will introduce new quantitative measures and deepen our diagnostic understanding of the regression model’s fit.

Part 1: The Coefficient of Determination

To answer these questions rigorously, we rely on additional metrics and diagnostics. One central measure you’ll examine closely is the Coefficient of Determination ($R^2$).

\[R^2 = \frac{\text{SS}_R}{\text{SS}_T} = \frac{\sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}\]

Intuitively, $R^2$ quantifies the proportion of the total variability in the response variable ($Y$) explained by your regression model. Yet, it’s crucial to understand not only what $R^2$ indicates but also what it does not indicate:

$R^2$ provides a global measure of explanatory power for your fitted regression model. A high $R^2$ indicates that the regression line effectively summarizes the observed data.
However, a high $R^2$ alone doesn’t confirm a causal relationship, nor does it ensure that predictions will always be accurate. It simply measures how closely the data points cluster around your fitted regression line.
You should also recognize the direct relationship between $R^2$ and the sample Pearson correlation coefficient ($r$):

\[r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}\]

While $r$ specifically quantifies the linear association between two quantitative variables, in simple linear regression, the relationship is even clearer: $R^2$ is simply $r^2$. Thus, in this specific context, understanding one measure directly informs your understanding of the other.

Part 2: Confidence Intervals vs. Prediction Intervals

Given a fitted regression model, your next step is often to use it for predictive purposes. To do this responsibly, you must quantify not only your best predictions, but also how confident you can be in these predictions.

You will explore two distinct types of intervals:

Confidence Intervals for the Mean Response

These intervals measure uncertainty around the regression line itself—the estimated average response at a given value of $x^*$.

Estimated Mean Response:

\[\hat{\mu}^* = b_0 + b_1 x^*\]
Standard Error of the Mean Response:

\[\text{SE}_{\hat{\mu}^*} = \sqrt{\text{MS}_E \cdot \left(\frac{1}{n} + \frac{(x^* - \bar{x})^2}{S_{xx}}\right)}\]
A $100(1-\alpha)\%$ CI for the mean response at $x^*$:

\[(b_0 + b_1 x^*) \pm t_{\alpha/2, n-2} \sqrt{\text{MS}_E \cdot \left(\frac{1}{n} + \frac{(x^* - \bar{x})^2}{S_{xx}}\right)}\]

Note

The CLT applies for $\hat{\mu}^* = b_0 + b_1 x^* = \sum_{i=1}^{n}\left(\frac{1}{n} + \frac{(x^* - \bar{x})}{S_{xx}}(x_i - \bar{x})\right)y_i$, therefore quantification of uncertainty using $t_{\alpha/2, n-2}$ is reasonably accurate even if the population errors deviate from normality.

Prediction Intervals for Individual Observations

These intervals reflect not only uncertainty in the regression line, but also the added variability inherent in predicting a single new data point.

Predicted individual response:

\[\hat{Y}^* = b_0 + b_1 x^* + \varepsilon_{\text{new}}\]
Standard Error for individual prediction:

\[\text{SE}_{\hat{Y}^*} = \sqrt{\text{MS}_E \cdot \left(1 + \frac{1}{n} + \frac{(x^* - \bar{x})^2}{S_{xx}}\right)}\]
A $100(1-\alpha)\%$ PI for an individual response at $x^*$:

\[(b_0 + b_1 x^*) \pm t_{\alpha/2, n-2} \sqrt{\text{MS}_E \cdot \left(1 + \frac{1}{n} + \frac{(x^* - \bar{x})^2}{S_{xx}}\right)}\]

Warning

The CLT does not apply if $\varepsilon_{\text{new}}$ is not Normal for $\hat{Y}^* = b_0 + b_1 x^* + \varepsilon = \sum_{i=1}^{n}\left(\frac{1}{n} + \frac{(x^* - \bar{x})}{S_{xx}}(x_i - \bar{x})\right)y_i + \varepsilon_{\text{new}}$. The new error term $\varepsilon_{\text{new}}$ is not averaged over and retains its original distribution.

These intervals highlight the crucial distinction between estimating average responses and predicting individual observations. The latter always requires wider intervals to account for additional uncertainty.

Part 3: Simulation Study — CI vs. PI Coverage

Question 1: In this problem, you’ll explore through simulation how confidence intervals for the mean response differ from prediction intervals for individual responses, specifically examining their sensitivity to the underlying error distribution and sample size.

Instructions:

Copy the R simulation script provided below into an R file (e.g., CI_PI_simulation.R) and save it.

Start with a fresh R session (e.g., restart R in RStudio). Then run the simulation by sourcing the script: source("CI_PI_simulation.R").

This script performs the following:

Simulates data for three error distributions (Normal, Uniform, and Lognormal) and three sample sizes (n = 3, 30, 500).

Computes 95% confidence intervals (CIs) for the mean response and 95% prediction intervals (PIs) for individual responses.

Reports empirical coverage probabilities in the console.

Produces side-by-side histograms and QQ-plots for the sampling distributions of the mean predictions and individual predictions.

#------------------------------------------------------------------
#  Simulation: CI for Mean Response vs PI for Individual Response
#  (single x0)  — STAT 350 Worksheet 22
#------------------------------------------------------------------
library(ggplot2)
library(gridExtra)
set.seed(123)

# ------------------ 1. USER SETTINGS -----------------------------
Nsim     <- 5000                 # simulations per scenario
n_vec    <- c(3, 30, 500)        # sample sizes
beta0    <- 5                    # true intercept
beta1    <- 2                    # true slope
sigma    <- 2                    # SD of errors
alpha    <- 0.05                 # 95% intervals
err_list <- c("normal", "uniform", "lognormal")
# -----------------------------------------------------------------

# ------------------ 2. HELPER FUNCTIONS --------------------------
make_errors <- function(N, n, dist, sigma) {
  switch(dist,
         normal   = matrix(rnorm(N*n, 0, sigma), N),
         uniform  = {rng <- sqrt(12)*sigma; matrix(runif(N*n, -rng/2, rng/2), N)},
         lognormal= {sdlog <- sqrt(log(1+sigma^2));
         matrix(rlnorm(N*n, 0, sdlog) - exp(sdlog^2/2), N)},
         stop("Unknown distribution"))
}

ls_est <- function(Y, x, xbar, Sxx) {
  b1 <- drop(Y %*% (x - xbar) / Sxx)    # N x n %*% n x 1  -> N x 1
  b0 <- rowMeans(Y) - b1 * xbar
  list(b0 = b0, b1 = b1)
}
# -----------------------------------------------------------------

# ------------------ 3. MAIN SIMULATION LOOP ----------------------
for (dist in err_list) {
  cat("\n==============================\nError distribution:", dist, "\n")

  for (n in n_vec) {

    x_vec <- seq(0, 10, length.out = n)
    xbar  <- mean(x_vec)
    Sxx   <- sum((x_vec - xbar)^2)
    x0    <- xbar               # prediction point x0 = mean(x)

    # simulate responses
    E_mat <- make_errors(Nsim, n, dist, sigma)
    Y_mat <- sweep(E_mat, 2, beta0 + beta1 * x_vec, "+")  # Nsim x n

    # least-squares estimates
    est <- ls_est(Y_mat, x_vec, xbar, Sxx)
    b0  <- est$b0;  b1 <- est$b1

    # fitted values for residual MS
    fits <- matrix(b0, Nsim, n) + matrix(b1, Nsim, n) * rep(x_vec, each = Nsim)
    SSE  <- rowSums((Y_mat - fits)^2)
    s    <- sqrt(SSE/(n-2))                      # Nsim-vector

    # point predictions
    yhat <- b0 + b1 * x0                         # mean response

    # ------------------ intervals -------------------------------
    se_mean <- s * sqrt(1/n + (x0 - xbar)^2 / Sxx)
    se_pred <- s * sqrt(1 + 1/n + (x0 - xbar)^2 / Sxx)
    t_mult  <- qt(1 - alpha/2, df = n - 2)

    CI_low  <- yhat - t_mult * se_mean
    CI_high <- yhat + t_mult * se_mean

    PI_low  <- yhat - t_mult * se_pred
    PI_high <- yhat + t_mult * se_pred

    # simulate ONE future error & future observation
    eps_new <- make_errors(Nsim, 1, dist, sigma)[,1]
    y_indiv <- yhat + eps_new          # what you'd predict for one new case
    y_true  <- beta0 + beta1 * x0 + eps_new  # population truth for coverage

    # coverage
    mu_true  <- beta0 + beta1 * x0
    cover_CI <- mean(CI_low <= mu_true & mu_true <= CI_high)
    cover_PI <- mean(PI_low <= y_true  & y_true  <= PI_high)

    # ------------------ plotting -------------------------------
    df <- data.frame(mean_pred = yhat, indiv_pred = y_indiv)

    p1 <- ggplot(df, aes(mean_pred)) +
      geom_histogram(aes(y = after_stat(density)),
                     bins = 30, fill = "grey", col = "black") +
      geom_density(col = "blue", linewidth = 1) +
      stat_function(fun = dnorm,
                    args = list(mean = mean(yhat), sd = sd(yhat)),
                    col = "red", linewidth = 1) +
      labs(title = "Mean predictions",
           subtitle = paste0("CI cover = ", round(cover_CI,3))) +
      theme(plot.title = element_text(size = 9))

    p2 <- ggplot(df, aes(sample = mean_pred)) +
      geom_qq(size=.4) + geom_qq_line(col="blue") +
      labs(title="QQ-Mean") +
      theme(plot.title = element_text(size = 9))

    p3 <- ggplot(df, aes(indiv_pred)) +
      geom_histogram(aes(y = after_stat(density)),
                     bins = 30, fill = "grey", col = "black") +
      geom_density(col = "darkgreen", linewidth = 1) +
      stat_function(fun = dnorm,
                    args = list(mean = mean(df$indiv_pred), sd = sd(df$indiv_pred)),
                    col = "red", linewidth = 1) +
      labs(title = "Individual predictions",
           subtitle = paste0("PI cover = ", round(cover_PI,3))) +
      theme(plot.title = element_text(size = 9))

    p4 <- ggplot(df, aes(sample = indiv_pred)) +
      geom_qq(size=.4) + geom_qq_line(col="darkgreen") +
      labs(title="QQ-Individual") +
      theme(plot.title = element_text(size = 9))

    grid.arrange(p1,p2,p3,p4, nrow = 2,
                 top = paste("Error:", dist, "| n =", n, "| x0 =", round(x0,2)))

    # console summary
    cat("n =", n, " CI cover =", round(cover_CI,3),
        " PI cover =", round(cover_PI,3), "\n")
  }
}

a) Report your findings:

When errors are Normal, how close are the observed CI and PI coverages to the nominal 95%?

Identify which distribution consistently produced intervals with coverage noticeably above or below 95%. Provide a brief explanation for these observations.

Compare the shapes of the histograms and QQ-plots for the mean predictions and individual predictions. Which appear approximately Normal, and which clearly reflect the shape of the original error distribution?

b) Answer the following questions:

Use these observations to explain how the Central Limit Theorem affects confidence intervals differently than prediction intervals.

Describe clearly how increasing the sample size influences CI and PI coverage, as well as the normality of their distributions.

Why does the distribution of individual predictions remain noticeably skewed even for very large sample sizes?

Part 4: Prediction with the Bird Data

Question 2: Use your own estimated coefficients $b_0$, $b_1$, $\bar{x}$, and $\text{MS}_E$ from your fitted model regarding the bird data to answer the following:

Weight (grams)	WingLength (cm)
130.8	24.9
125.8	24.5
155.2	27.1
105.6	22.3
146.9	25.4
148.4	26.5
181.2	32.4
137.0	25.7

Your regression equation: $\hat{y}_{\text{wingspan}} = b_0 + b_1 \cdot x_{\text{weight}} =$ ____________________

a) Use your fitted line to compute the predicted wing length at Point A and Point B. Give a clear reason why the prediction at Point B is much less trustworthy than that for Point A.

Point	Weight (grams)	$\hat{y}_{\text{wingspan}}$
A	140
B	210

Why is the prediction at Point B less trustworthy?

b) Work only with Point A (140 g) in this part.

Obtain and report the 95% confidence interval for the mean wing length at 140 g.

Obtain and report the 95% prediction interval for an individual bird’s wing length at 140 g.

Which interval is wider, and by how many centimeters? Provide a clear statistical interpretation of each interval.

# Verification code for bird data predictions
bird_data <- data.frame(
  Weight = c(130.8, 125.8, 155.2, 105.6, 146.9, 148.4, 181.2, 137.0),
  WingLength = c(24.9, 24.5, 27.1, 22.3, 25.4, 26.5, 32.4, 25.7)
)

bird_lm <- lm(WingLength ~ Weight, data = bird_data)

# Create new data for prediction
newdata <- data.frame(Weight = c(140, 210))

# Point predictions
predict(bird_lm, newdata = newdata)

# Confidence interval for mean response
predict(bird_lm, newdata = newdata, interval = "confidence", level = 0.95)

# Prediction interval for individual response
predict(bird_lm, newdata = newdata, interval = "prediction", level = 0.95)

Part 5: Diamonds Data Analysis

Question 3: The diamonds data set, included in the ggplot2 package, contains detailed information on 53,940 round-cut diamonds. Each observation corresponds to a single diamond and includes both physical characteristics and pricing information.

Variable	Description	Type
carat	Weight of the diamond (in carats)	Quantitative
cut	Quality of the cut (Fair, Good, Very Good, Premium, Ideal)	Categorical
color	Diamond color (graded from J (worst) to D (best))	Categorical
clarity	A measurement of how clear the diamond is	Categorical
depth	Total depth percentage	Quantitative
table	Width of the top of the diamond relative to the widest point	Quantitative
price	Price in US dollars	Quantitative
x	Length in mm	Quantitative
y	Width in mm	Quantitative
z	Depth in mm	Quantitative

For this exercise, we will focus on a subset of the diamonds data, specifically those diamonds with carat strictly less than 1.00. This yields a more focused analysis by excluding the influence of very large diamonds and narrowing the prediction space.

We are interested in exploring how carat ($x$) influences price ($Y$) for diamonds under 1 carat. We will treat carat as the explanatory variable and price as the response variable in a simple linear regression framework.

While linearity may be a reasonable approximation in this limited range, we will examine the model assumptions carefully and explore how to compute and interpret interval estimates (confidence and prediction intervals).

a) Checking Assumptions:

Subset the data to include only diamonds where carat < 1.00.

Create the following diagnostic plots based on your fitted simple linear regression:

Scatterplot of carat ($x$) vs price ($y$), with the least squares regression line

Residual plot (residuals vs. carat)

Histogram of residuals with density and Normal curve overlay

QQ-plot of residuals

After reviewing the plots, evaluate the assumptions by completing the table below.

library(ggplot2)

# Load and subset the data
data(diamonds, package = "ggplot2")
d_small <- diamonds[diamonds$carat < 1, ]

cat("Number of observations:", nrow(d_small), "\n")

# Fit the model
fit <- lm(price ~ carat, data = d_small)

# Add residuals and fitted values to data
d_small$resids <- resid(fit)
d_small$fitted <- fitted(fit)

# 1. Scatterplot with regression line
ggplot(d_small, aes(x = carat, y = price)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  labs(title = "Price vs. Carat (<1 carat)",
       x = "Carat", y = "Price (US$)") +
  theme_minimal()

# 2. Residual plot
ggplot(d_small, aes(x = carat, y = resids)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, color = "blue", linewidth = 1) +
  labs(title = "Residuals vs. Carat",
       x = "Carat", y = "Residual (y - y-hat)") +
  theme_minimal()

# 3. Histogram of residuals
res_mu <- mean(d_small$resids)
res_sigma <- sd(d_small$resids)

ggplot(d_small, aes(x = resids)) +
  geom_histogram(aes(y = after_stat(density)), bins = 40,
                 fill = "grey", color = "black") +
  geom_density(color = "red", linewidth = 1) +
  stat_function(fun = dnorm,
                args = list(mean = res_mu, sd = res_sigma),
                color = "blue", linewidth = 1, linetype = "dashed") +
  labs(title = "Histogram of Residuals",
       x = "Residual", y = "Density") +
  theme_minimal()

# 4. QQ-plot of residuals
ggplot(d_small, aes(sample = resids)) +
  stat_qq(alpha = 0.5) +
  stat_qq_line(color = "red") +
  labs(title = "QQ-plot of Residuals") +
  theme_minimal()

Assumption	OK / Violated	Brief justification (one sentence)
Linearity
Constant variance (homoscedasticity)
Normality of errors
Independence

b) Applying the Log-Log Transformation:

The diagnostics from part (a) likely revealed serious assumption violations. When the relationship between $x$ and $Y$ follows a power law (i.e., $Y \propto x^\beta$), applying a log-log transformation can simultaneously:

Linearize the relationship
Stabilize variance (reduce heteroscedasticity)
Improve Normality of errors

The transformation works because if $Y = \alpha \cdot x^\beta \cdot \varepsilon$, then:

\[\log(Y) = \log(\alpha) + \beta \cdot \log(x) + \log(\varepsilon)\]

This is now a linear relationship between $\log(Y)$ and $\log(x)$.

Create new variables log_carat = log(carat) and log_price = log(price).

Fit a simple linear regression of log(price) on log(carat).

Create the same four diagnostic plots as in part (a), but now for the log-log model.

Re-evaluate the assumptions using the table below.

# Create log-transformed variables
d_small$log_carat <- log(d_small$carat)
d_small$log_price <- log(d_small$price)

# Fit the model on log-log scale
fit_log <- lm(log_price ~ log_carat, data = d_small)

# Add residuals and fitted values
d_small$resids_log <- resid(fit_log)
d_small$fitted_log <- fitted(fit_log)

# 1. Scatterplot with regression line (log-log scale)
ggplot(d_small, aes(x = log_carat, y = log_price)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  labs(title = "Log(Price) vs. Log(Carat)",
       x = "Log(Carat)", y = "Log(Price)") +
  theme_minimal()

# 2. Residual plot (log-log scale)
ggplot(d_small, aes(x = log_carat, y = resids_log)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, color = "blue", linewidth = 1) +
  labs(title = "Residuals vs. Log(Carat)",
       x = "Log(Carat)", y = "Residual") +
  theme_minimal()

# 3. Histogram of residuals (log-log scale)
res_mu_log <- mean(d_small$resids_log)
res_sigma_log <- sd(d_small$resids_log)

ggplot(d_small, aes(x = resids_log)) +
  geom_histogram(aes(y = after_stat(density)), bins = 40,
                 fill = "grey", color = "black") +
  geom_density(color = "red", linewidth = 1) +
  stat_function(fun = dnorm,
                args = list(mean = res_mu_log, sd = res_sigma_log),
                color = "blue", linewidth = 1, linetype = "dashed") +
  labs(title = "Histogram of Residuals — Log-Log Model",
       x = "Residual", y = "Density") +
  theme_minimal()

# 4. QQ-plot of residuals (log-log scale)
ggplot(d_small, aes(sample = resids_log)) +
  stat_qq(alpha = 0.5) +
  stat_qq_line(color = "red") +
  labs(title = "QQ-plot of Residuals — Log-Log Model") +
  theme_minimal()

Assumption Assessment (Log-Log Scale):

Assumption	OK / Violated	Brief justification (one sentence)
Linearity
Constant variance (homoscedasticity)
Normality of errors
Independence

Comparison: In one or two sentences, describe how the log-log transformation affected the assumption diagnostics.

Note

For the remaining parts (c–f), use the log-log transformed model since it better satisfies the regression assumptions.

c) Write down the estimated regression equation for the log-log model.

# Model summary (log-log)
summary(fit_log)

$\widehat{\log(\text{price})} =$ ____________________

Interpretation of the slope (elasticity): In a log-log model, the slope $b_1$ represents the elasticity — the percent change in $Y$ for a 1% change in $x$. Complete the interpretation:

“A 1% increase in carat is associated with approximately a ____% increase in price.”

d) Perform the full four-step hypothesis test for the slope of the log-log model at $\alpha = 0.01$.

Step 1: State the parameter of interest. (Skip)

Step 2: State the hypotheses.

Step 3: Calculate the test statistic and p-value.

Step 4: State your conclusion in context.

e) Complete the ANOVA table for the log-log model.

# ANOVA table (log-log model)
anova(fit_log)

Source	Degrees of Freedom	Sum of Squares	Mean Squares	$F$-test statistic	$p$-value
Regression
Residual
Total

f) Compute the coefficient of determination for the log-log model and compare it to the original model.

# Compare R-squared values
R2_orig <- summary(fit_orig)$r.squared
R2_log <- summary(fit_log)$r.squared

cat("Original model (price ~ carat):     R² =", round(R2_orig, 4), "\n")
cat("Log-log model (log_price ~ log_carat): R² =", round(R2_log, 4), "\n")

\[R^2_{\text{original}} = \qquad\qquad\qquad R^2_{\text{log-log}} =\]

Interpretation: Which model explains more variability in the response? Why might $R^2$ values not be directly comparable between the two models?

g) Graphically construct confidence and prediction bands for the log-log model.

Construct 90% confidence bands for the mean log(price) across the observed range of log(carat).

Construct 90% prediction bands for individual log(price) values across the same range.

Superimpose on one plot: scatterplot of the log-transformed data, fitted line, and both bands.

Compare the bands to those you would have obtained from the original model. Why are prediction intervals from the log-log model more trustworthy?

# Create prediction grid on log scale
log_carat_grid <- data.frame(log_carat = seq(min(d_small$log_carat),
                                              max(d_small$log_carat),
                                              length.out = 400))

# Get confidence and prediction bands
conf_band <- predict(fit_log, newdata = log_carat_grid,
                     interval = "confidence", level = 0.90)
pred_band <- predict(fit_log, newdata = log_carat_grid,
                     interval = "prediction", level = 0.90)

# Combine into data frame
band_df <- cbind(log_carat_grid,
                 fit = conf_band[, "fit"],
                 conf_lwr = conf_band[, "lwr"],
                 conf_upr = conf_band[, "upr"],
                 pred_lwr = pred_band[, "lwr"],
                 pred_upr = pred_band[, "upr"])

# Create the plot
ggplot() +
  # Scatter points (log-log scale)
  geom_point(data = d_small, aes(x = log_carat, y = log_price),
             alpha = 0.25, size = 1) +
  # Prediction band (wide, light blue) — plot first so it's behind
  geom_ribbon(data = band_df,
              aes(x = log_carat, ymin = pred_lwr, ymax = pred_upr),
              fill = "lightblue", alpha = 0.5) +
  # Confidence band (narrow, dark blue)
  geom_ribbon(data = band_df,
              aes(x = log_carat, ymin = conf_lwr, ymax = conf_upr),
              fill = "darkblue", alpha = 0.6) +
  # Fitted line
  geom_line(data = band_df, aes(x = log_carat, y = fit),
            linewidth = 1.5, color = "black") +
  labs(title = "90% Confidence (dark) & Prediction (light) Bands — Log-Log Model",
       x = "Log(Carat)", y = "Log(Price)") +
  theme_minimal()

Your comparison of the bands and discussion of trustworthiness:

h) (Bonus) Back-transform the predictions to the original scale.

When we predict on the log scale, we get $\widehat{\log(\text{price})}$. To convert back to dollars, we exponentiate: $\widehat{\text{price}} = e^{\widehat{\log(\text{price})}}$.

However, there is a subtlety: $E[\log(Y)] \neq \log(E[Y])$. The back-transformed prediction $e^{\widehat{\log(Y)}}$ is actually an estimate of the median of $Y$, not the mean. For the mean, a bias correction is needed (beyond the scope of this course).

# Predict for a specific diamond: 0.5 carats
new_diamond <- data.frame(log_carat = log(0.5))

# Get prediction on log scale
pred_log <- predict(fit_log, newdata = new_diamond,
                    interval = "prediction", level = 0.95)

cat("Prediction on log scale:\n")
print(pred_log)

cat("\nBack-transformed to dollars (median prediction):\n")
cat("Point estimate:", exp(pred_log[1, "fit"]), "\n")
cat("95% PI: (", exp(pred_log[1, "lwr"]), ",", exp(pred_log[1, "upr"]), ")\n")

Predicted median price for a 0.5-carat diamond: $______

95% prediction interval (back-transformed): ($______, $______)

Warning

What Can We Trust When Assumptions Are Violated? ⚠️

When regression assumptions are violated, not all of our inference procedures fail equally. Understanding what remains trustworthy and what becomes unreliable is critical for responsible statistical practice.

WHAT WE CAN STILL TRUST:

Point estimates $b_0$ and $b_1$ are unbiased: The least squares estimators are unbiased as long as (1) the model is correctly specified (linearity holds) and (2) $E[\varepsilon_i] = 0$. Normality is NOT required for unbiasedness. However, to be BLUE (Best Linear Unbiased Estimators), the Gauss-Markov theorem additionally requires (3) homoscedasticity (constant variance) and (4) uncorrelated errors. When heteroscedasticity is present, the estimators are still unbiased but no longer “best” (minimum variance)—other estimators like weighted least squares would be more efficient.
Confidence intervals for parameters (with large $n$): Due to the CLT, confidence intervals for $\beta_0$ and $\beta_1$ achieve approximately nominal coverage even when errors are non-Normal, provided the sample size is sufficiently large AND variance is constant. With heteroscedasticity, even large samples don’t fully fix the problem.
Confidence intervals for the mean response $\mu_{Y|X=x^*}$: These also benefit from the CLT because the estimated mean response is a weighted average of the $y_i$ values. With large samples and constant variance, these intervals are reasonably trustworthy even with non-Normal errors.
Direction of the relationship: If you find a significant positive (or negative) slope, the direction of the association is reliable even if the exact p-value or interval width is not.

WHAT WE CANNOT TRUST:

Standard error estimates under heteroscedasticity: When variance is not constant (funnel/fan patterns in residuals), the standard errors for $b_0$ and $b_1$ are biased. This means confidence intervals may be too wide or too narrow, and p-values are unreliable. The estimates themselves are still unbiased, but we cannot trust our uncertainty quantification.
Prediction intervals with non-Normal errors: The CLT does NOT save prediction intervals. Since $\hat{Y}^* = b_0 + b_1 x^* + \varepsilon_{\text{new}}$, the new error term $\varepsilon_{\text{new}}$ is not averaged over and retains its original (possibly non-Normal) distribution. Prediction intervals can have coverage far from the nominal level.
Prediction intervals under heteroscedasticity: Even worse than for CIs—the prediction interval formula assumes constant variance $\sigma^2$ across all $x$. If variance increases with $x$, prediction intervals will be too narrow at high $x$ values and too wide at low $x$ values.
Inference with small samples and non-Normal errors: The CLT requires sufficient sample size. With small $n$ and skewed or heavy-tailed errors, even confidence intervals for parameters may have poor coverage.
Any inference when independence is violated: If observations are not independent (e.g., time series data, clustered observations), standard errors are typically underestimated, leading to inflated Type I error rates. This is often the most serious violation.

GAUSS-MARKOV CONDITIONS FOR BLUE:

For $b_0$ and $b_1$ to be the Best Linear Unbiased Estimators:

Linearity: $Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$ (correct model specification)
Zero mean errors: $E[\varepsilon_i] = 0$
Homoscedasticity: $\text{Var}(\varepsilon_i) = \sigma^2$ (constant variance)
Uncorrelated errors: $\text{Cov}(\varepsilon_i, \varepsilon_j) = 0$ for $i \neq j$

Note: Normality is not required for BLUE—it is only needed for exact $t$ and $F$ distributions in finite samples. With large samples, the CLT provides approximate Normality for the estimators.

THE DIAMONDS EXAMPLE — WHAT YOU OBSERVED:

In the diamonds analysis (Question 3), you compared two models:

Original Model (price ~ carat):

Linearity violation (curvature): The residual plot showed a U-shaped pattern — residuals systematically positive at low and high carat values, negative in the middle. This means $b_0$ and $b_1$ are biased.
Severe heteroscedasticity: Variance increased dramatically with carat (fanning pattern). Standard errors and p-values are unreliable.
Non-Normal residuals: Heavy tails and skewness. Prediction intervals have poor coverage.

Log-Log Model (log(price) ~ log(carat)):

Linearity: ✓ Much better — the relationship is approximately linear on the log-log scale.
Homoscedasticity: ✓ Much better — variance is approximately constant across log(carat).
Normality: ✓ Much better — residuals are approximately Normal (slight heavy tails remain).

THE LESSON: When the original model violates assumptions, the solution is often to transform the variables rather than proceeding with unreliable inference. The log-log transformation is particularly effective when the underlying relationship follows a power law ($Y \propto x^\beta$), which is common in economics, biology, and physics.

BOTTOM LINE:

✓ Use the log-log model for inference — its assumptions are reasonably satisfied

✓ Interpret the slope as an elasticity (% change in price per % change in carat)

✓ Prediction intervals from the log-log model are trustworthy; those from the original model are not

✗ Do NOT use inference from the original model — biased estimates and unreliable standard errors

Key Takeaways

Summary 📝

The coefficient of determination $R^2$ measures the proportion of variability in $Y$ explained by the regression model; in simple linear regression, $R^2 = r^2$
Confidence intervals for the mean response quantify uncertainty about the average $Y$ at a given $x^*$; the CLT helps these intervals achieve nominal coverage even with non-Normal errors
Prediction intervals for individual observations must account for both uncertainty in the regression line AND the inherent variability of a single new observation
Prediction intervals are always wider than confidence intervals because they include the additional $\text{MS}_E \cdot 1$ term in the standard error
The CLT does not apply to prediction intervals when errors are non-Normal, because the new error $\varepsilon_{\text{new}}$ is not averaged over
Extrapolation (predicting beyond the range of observed $x$ values) is unreliable because we have no data to support the assumed linear relationship in that region
Transformations (such as log-log) can simultaneously fix multiple assumption violations: linearizing the relationship, stabilizing variance, and improving Normality
In a log-log model, the slope is an elasticity: a 1% increase in $x$ is associated with approximately a $b_1$% change in $Y$
Heteroscedasticity (non-constant variance) biases standard error estimates, making confidence intervals and p-values unreliable even when point estimates remain valid
Linearity violations cause bias in the point estimates themselves — this is the only assumption whose violation directly biases $b_0$ and $b_1$
R functions predict() with interval = "confidence" or interval = "prediction" efficiently compute interval estimates at specified values of $x$