Slides 📊
13.4. Prediction and Robustness
We have now developed the complete foundation for simple linear regression: model fitting, assumption checking, and statistical inference for model parameters. This final chapter completes our regression toolkit by addressing two critical questions:
How do we make predictions with appropriate uncertainty quantification?
When can we trust our inference procedures despite violations of the normality assumption?
This chapter represents the culmination of our statistical journey through STAT 350, bringing together concepts from descriptive statistics, probability, sampling distributions, and inference into a comprehensive framework.
Road Map 🧭
Build awareness of the dangers of extrapolation.
Understand the difference between the two prediction tasks in regression: predicting the mean response versus predicting a single response.
Construct intervals for the both types of prediction and compare their shared characteristics and key differences.
Discuss how CLT makes certain linear regression inference tasks robust, but not all.
Organize all components of linear regression into a single workflow using a complete example.
13.4.1. The Danger of Extrapolation
One of the most important applications of linear regression is prediction—using the fitted model to estimate the response for a new explanatory input \(x^*\). However, a fitted model is reliable for prediction only if the explanatory value is “reasonable” for the model. To undertand what counts as reasonable, we should first understand the difference between interpolation and extrapolation.
Fig. 13.25 Scatter plot with extrapolation regions marked
Interpolation involves making predictions for explanatory variable values that fall within the range of values used to create the regression line. These predictions are generally trustworthy because:
Our model has been “trained” on data within this range.
We have evidence that the linear relationship holds in this region.
Our model assumptions have been validated using data from this range.
Extrapolation involves using the regression line for prediction outside the observed range of the explanatory variable. This is risky because:
We have no evidence that the linear relationship continues outside this range.
Model assumptions may not hold in unobserved regions.
In esssence, the only \(x^*\) values suitable for prediction are those within the observed range of \(X\). Extrapolation should be avoided whenever possible.
13.4.2. Two Types of Prediction and Their Distributional Foundation
Two Types of Prediction
Prediction in regression analysis involves finding a suitable response for a given explanatory value \(x^*\). This seemingly simple task, however, breaks down to two different questions depending on the target of estiamtion:
What is the average of all responses to the explanatory value \(x^*\)?
We call the average of all responses to an explanatory value \(x^*\) the mean response, and we answer this question by computing a confidence interval for the mean response.
What will be the value of one new response associated with the explanatory value \(x^*\)?
When our interest is in estimating a single response to a given explanatory value \(x^*\), we call our target the predicted response. We answer this question using a prediction interval.
The different types of prediction give rise to inferences with different characterizations of uncertainty. We begin by analyzing the distributional properties of the corresponding prediction estimators.
1. Distribution of the Mean Response Estimator
Mathematically, the true mean response can be written as:
For conciseness, let us assume that the value of \(x^*\) is clear from context and write \(\mu^* = E[Y|X=x^*]\).
The Mean Response Estimator
The logical choice for an estimator of the mean response is:
where the unknown regression parameters are replaced by their estimators, \(b_0\) and \(b_1\). Let us further analyze the distribution of \(\hat{\mu}^*\).
Expected Value of \(\hat{\mu}^*\)
The parameter estiamators \(b_0\) and \(b_1\) are unbiased, which in turn makes the mean response also unbiased.
Variance of \(\hat{\mu}^*\)
Using the identity \(b_0 = \bar{Y} - b_1\bar{x}\),
Now using the expression of \(b_1\) as a linear combination of the responses (see Eq. (13.10)):
Using the independence of the \(Y_i\) terms:
Distribution of \(\hat{\mu}^*\) Under Normality
From Eq (13.11), it is evident that \(\hat{\mu}^*\) is a linear combination of the responses. Given that the normality assumption holds, therefore,
2. Distribution of the Single Response Estimator
Denoting the predicted response as \(Y^*\), its mathematical identity is:
where \(\varepsilon^* \sim N(0, \sigma^2)\) is a new error term, independent of the data used to fit the model. Replacing the unknown regression parameters with their estimators, we define the estimator of \(Y^*\) as:
Expected Value of \(\hat{Y}^*\)
The expected value is equal to the true mean response \(\mu^*\), which is also the expected value of \(\hat{\mu}^*\).
Variance of \(\hat{Y}^*\)
Since the new error term \(\varepsilon^*\) is independent of the data used to fit the regression:
From the first step of Eq. (13.12), we note that \(\hat{Y}^*\) accounts for two sources of variability:
Uncertainty in estimating the mean response with \(\hat{\mu}^*\)
Natural variability of individual observations around the mean response
Since \(\hat{Y}^*\) is essentially \(\hat{\mu}^*\) with an additional source of variability, its variance is always greater than the variance of the mean response estimator.
Distribution of \(\hat{Y}^*\) Under Normality
If both \(\hat{\mu}^*\) and \(\varepsilon^*\) are normally distributed, their sum \(\hat{Y}^*\) is also normal. Given that the normality assumption holds,
13.4.3. Confidence and Prediction Intervals
We are mainly concerned with two-sided intervals for prediction tasks. The intervals follow the standard form:
Since \(\sigma^2\) is unknown, the true standard errors are not available and must be estimated by replacing \(\sigma^2\) with \(s^2 = MSE\). This leads to \(t\)-based confidnce intervals with \(df=n-2\).
1. Confidence Intervals for Mean Response
Interpretation: We are \((1-\alpha) \times 100\%\) confident that the true mean of all responses for explanatory value \(x^*\) lies within this interval.
2. Prediction Interval for Single Response
Interpretation: We are \((1-\alpha) \times 100\%\) confident that a new response with explanatory value \(x^*\) falls within this interval.
Despite its definition, \(\hat{y}^*\) cannot contain \(\varepsilon^*\) as part of its computational formula because \(\hat{y}^*\) must be a number, while \(\varepsilon^*\) is a random variable. In Eq (13.14), we can consider \(\varepsilon^*\) as replaced with its best point estimate, \(0\).
Confidence and Prediction Bands
By evaluating Equations (13.13) and (13.14) over a range of \(x^*\) values, we can construct and visualize the confidence band and the prediction band on a scatter plot:
Fig. 13.26 Comparison of confidence band and prediction band
The confidence band provides the range of plausible values for the true mean response line \(\mu_{Y|X=x} = \beta_0 + \beta_1 x\) across the entire range of explanatory variable values.
The prediction band marks a plausible region for individual observations.
Key Observations
Prediction intervals (band) are always wider than confidence intervals (band) for the mean response.
Mathematically, this is due to due to the additional \(+1\) term in the standard error of the prediction interval.
Intuitively, a mean is always more stable than a single observation.
At \(x^* = \bar{x}\), the \(\frac{(x^* - \bar{x})^2}{S_{XX}}\) term becomes zero and the standard error formulas simplify significantly for both intervals. By consequence, the bands have a “bow-tie” or curved shape, narrowest at \((\bar{x}, \bar{y})\).
Points falling outside the prediction band suggest potential outliers or model inadequacy.
Important Considerations for Multiple Predictions
Constructing many intervals simultaneously leads to a concern about FWER, as we are aware from the multiple comparisons procedure for ANOVA. To address this issue,
Use more conservative confidence levels (e.g., 99% instead of 95%).
Apply multiple comparison corrections (beyond this course’s scope).
Understand that individual intervals have the stated coverage probability, but simultaneous coverage is lower.
13.4.4. Robustness to Normality Assumptions
What happens to the inference results of linear regression when the normality assumption is violated? This discussion mirrors our earlier discussions about robustness in single-sample and two-sample procedures, but regression also presents some unique considerations.
The Central Limit Theorem in Regression
Different inference procedures in the same regression context have varying sensitivity to normality violations. The key characteristic that distinguishes robust methods from non-robust ones is whether the central estimator is an average or involves a single observation. When the estimator is constructed through averaging a large enough number of responses, this mitigates the non-normality of individual outcomes and allows safer use of the associated inference methods.
1. Parameter Estimation (Robust)
Both estimators are weighted averages of the \(Y_i\)’s.
2. Mean Response Prediction (Robust)
This is also a weighted average of the \(Y_i\)’s.
3. Single Response Prediction (NOT Robust)
While \(\hat{\mu}^*\) benefits from CLT, the additional error term \(\varepsilon^*\) does not. This new error term represents a single draw from the error distribution, not an average.
Critical Limitation: If the error terms are not normally distributed, prediction intervals may have incorrect coverage rates. The intervals might be too wide, too narrow, or asymmetric, depending on the true error distribution.
Practical Implications for Real Data Analysis
Always verify the normality assumption using residual diagnostics.
For small samples (\(n < 20\)), normality is more critical for all procedures.
Normality violations are particularly problematic for prediction intervals.
Consider transformations if normality violations are severe.
Acknowledge limitations when reporting results with questionable normality.
13.4.5. Example of Complete Linear Regression Workflow
Example 💡: Cetane Number and Iodine Value
The cetane number is a critical property that specifies the ignition quality of fuel used in diesel engines. Determination of this number for biodiesel fuel is expensive and time-consuming. Researchers want to explore using a simple linear regression model to predict cetane number from the iodine value.
Variables:
Response (Y): Cetane Number (CN) - measures ignition quality
Explanatory (X): Iodine Value (IV) - the amount of iodine necessary to saturate a sample of 100 grams of oil
A sample of 14 different biodiesel fuels was collected, with both iodine value and cetane number measured for each fuel. Can iodine value be used to predict cetane number through a simple linear relationship?
Obs |
Iodine Value (IV) |
Cetane Number (CN) |
|---|---|---|
1 |
132.0 |
46.0 |
2 |
129.0 |
48.0 |
3 |
120.0 |
51.0 |
4 |
113.2 |
52.1 |
5 |
105.0 |
54.0 |
6 |
92.0 |
52.0 |
7 |
84.0 |
59.0 |
8 |
83.2 |
58.7 |
9 |
88.4 |
61.6 |
10 |
59.0 |
64.0 |
11 |
80.0 |
61.4 |
12 |
81.5 |
54.6 |
13 |
71.0 |
58.8 |
14 |
69.2 |
58.0 |
A. Exploratory Data Analysis
# Create the dataset
iodine_value <- c(132.0, 129.0, 120.0, 113.2, 105.0, 92.0, 84.0,
83.2, 88.4, 59.0, 80.0, 81.5, 71.0, 69.2)
cetane_number <- c(46.0, 48.0, 51.0, 52.1, 54.0, 52.0, 59.0,
58.7, 61.6, 64.0, 61.4, 54.6, 58.8, 58.0)
# Create data frame
cetane_data <- data.frame(
IodineValue = iodine_value,
CetaneNumber = cetane_number
)
# Initial scatter plot
library(ggplot2)
ggplot(cetane_data, aes(x = IodineValue, y = CetaneNumber)) +
geom_point(size = 3, color = "blue") +
labs(
title = "Cetane Number vs Iodine Value",
x = "Iodine Value (IV)",
y = "Cetane Number (CN)"
) +
theme_minimal()
Fig. 13.27 Scatter plot of cetane number vs iodine value data
Negative linear trend visible
Points roughly follow a straight line pattern
No extreme outliers apparent
B. Model Fitting
# Fit the linear regression model
fit <- lm(CetaneNumber ~ IodineValue, data = cetane_data)
# Get basic summary
summary(fit)
# Extract coefficients
b0 <- fit`coefficients['(Intercept)']
b1 <- fit`coefficients['IodineValue']
# Display fitted equation
cat("Fitted equation: CN =", round(b0, 3), "+", round(b1, 3), "* IV")
\(b_0 = 75.212\): Cetane number is predicted to be \(75.212\) when iodine value is \(0\). This is not practically meaningful since \(IV = 0\) is outside our data range.
\(b_1 = -0.2094\): For each unit increase in iodine value, the cetane number decreases by an average of \(0.2094\) units
C. Comprehensive Assumption Checking
Before proceeding with inference, we must verify that our model assumptions are reasonable.
# Calculate residuals and fitted values
cetane_data`residuals <- residuals(fit)
cetane_data`fitted <- fitted(fit)
# Residual plot
ggplot(cetane_data, aes(x = IodineValue, y = residuals)) +
geom_point(size = 3) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(
title = "Residual Plot",
x = "Iodine Value",
y = "Residuals"
) +
theme_minimal()
# Histogram of residuals
ggplot(cetane_data, aes(x = residuals)) +
geom_histogram(aes(y = after_stat(density)), bins = 5,
fill = "lightblue", color = "black") +
geom_density(color = "red", size = 1) +
stat_function(fun = dnorm,
args = list(mean = mean(cetane_data`residuals),
sd = sd(cetane_data`residuals)),
color = "blue", size = 1) +
labs(title = "Histogram of Residuals with Normal Overlay")
# QQ plot
ggplot(cetane_data, aes(sample = residuals)) +
stat_qq() +
stat_qq_line() +
labs(title = "QQ Plot of Residuals")
Fig. 13.28 Residual plot of cetane number vs iodine value data
Fig. 13.29 Normnality assessment plots of cetane number vs iodine value data; Upper: histogram of residuals; lower: QQ plot of residuals
Independence: This must be evaluated based on data collection procedures.
Linearity: The residual plot shows no obvious patterns or curvature, supporting the linearity assumption.
Constant Variance: The residual plot shows some regions where points cluster more tightly than others, but with only 14 observations, it’s difficult to definitively assess the assumption. The violations don’t appear severe enough to invalidate the analysis.
Normality: The histogram shows roughly symmetric distribution with no extreme outliers. The QQ plot shows some fluctuation but no systematic departures from linearity. The normality assumption appears reasonable, though drawing a definitive conclusion is challenging due to small sample size.
D. F-test for Model Utility
The hypothesis are:
\(H_0\): There is no linear association between iodine value and cetane number.
\(H_a\): There is a linear association between iodine value and cetane number.
summary(fit)
\(R^2 = 0.7906\)
\(f_{TS}=45.35\) with \(df_1=1\) and \(df_2=n-2=12\)
\(p\)-value \(=2.091e-05\)
With p-value < 0.001, we reject \(H_0\) at any reasonable significance level. There is strong evidence of a linear association between iodine value and cetane number.
E. Inference on the Slope Parameter
From summary(fit), we also obtain:
\(b_1=-0.20939\)
\(\widehat{SE}=0.03109\) for the slope estimate
\(t_{TS} = -6.734\)
\(p\)-value \(=2.09 \times 10^{-5}\)
A hypothesis test on \(H_a: \beta_1 \neq 0\) would yield the same conclusion as the model utiltiy test.
95% Confidence Interval for Slope:
# Calculate confidence interval for slope
confint(fit, 'IodineValue', level = 0.95)
# output: (-0.277, -0.142)
We are 95% confident that each unit increase in iodine value is associated with a decrease in cetane number between 0.142 and 0.277 units.
F. Prediction Applications
Q1. What is the expected cetane number for a biodiesel fuel with iodine value of 75?
# Create new data for prediction
new_data <- data.frame(IodineValue = 75)
# Confidence interval for mean response
conf_interval <- predict(fit, newdata = new_data,
interval = "confidence", level = 0.99)
print(conf_interval)
Predicted mean cetane number: 59.51
99% Confidence interval: (57.74, 61.28)
We are 99% confident that the average cetane number for all biodiesel fuels with iodine value 75 is between 57.74 and 61.28.
Q2. What is the predicted response for a new biodiesel fuel with iodine value of 75?
# Prediction interval for individual response
pred_interval <- predict(fit, newdata = new_data,
interval = "prediction", level = 0.99)
print(pred_interval)
Predicted individual cetane number: 59.51
99% Prediction interval: (54.19, 64.83)
We are 99% confident that a new biodiesel fuel with iodine value 75 will have a cetane number between 54.19 and 64.83.
Creating Confidence and Prediction Bands:
# Generate confidence and prediction bands
conf_band <- predict(fit, interval = "confidence", level = 0.99)
pred_band <- predict(fit, interval = "prediction", level = 0.99)
# Comprehensive visualization
ggplot(cetane_data, aes(x = IodineValue, y = CetaneNumber)) +
# Prediction bands (outer)
geom_ribbon(aes(ymin = pred_band[,2], ymax = pred_band[,3]),
fill = "lightblue", alpha = 0.3) +
# Confidence bands (inner)
geom_ribbon(aes(ymin = conf_band[,2], ymax = conf_band[,3]),
fill = "darkblue", alpha = 0.5) +
# Data points
geom_point(size = 3, color = "black") +
# Regression line
geom_smooth(method = "lm", se = FALSE, color = "black", size = 1) +
labs(
title = "Cetane Number Prediction Model",
subtitle = "Dark blue: 99% Confidence bands, Light blue: 99% Prediction bands",
x = "Iodine Value",
y = "Cetane Number"
) +
theme_minimal()
.. code-block:: r
Fig. 13.30 Confidence and prediction bands for cetane number versus iodine value
G. Practical Conclusions and Limitations
There is compelling evidence (\(p < 0.001\)) of a negative linear association between iodine value and cetane number in biodiesel fuels. The model explains approximately 79% of the variation in cetane number, suggesting that iodine value is a useful predictor for this important fuel quality measure.
With only 14 observations, however, our conclusions should be considered preliminary. Larger studies would provide more definitive results. Some minor violations of the constant variance assumption were noted, which could affect the reliability of prediction intervals. Predictions should only be made within the range of observed iodine values (59-132). Extrapolation beyond this range is not justified.
13.4.6. Bringing It All Together
Key Takeaways 📝
Interpolation is safe, extrapolation is dangerous. Predictions should only be made within the range of observed explanatory variable values used to fit the model.
Confidence intervals for mean response estimate average behavior, while prediction intervals for individual observations account for additional uncertainty from new error terms. For this reason, prediction intervals are always wider than confidence intervals for means response.
The Central Limit Theorem provides robustness for estimators that ivolve averaging, but individual predictions are not robust to normality violations.
Exercises
Prediction Types and Interpretation: For a study relating years of experience (X) to annual salary (Y) with fitted model \(\hat{y} = 35000 + 2000x\),
Calculate the predicted salary for someone with 10 years of experience.
Explain the difference between a confidence interval for mean salary and a prediction interval for an individual’s salary at X = 10.
Which interval will be wider and why?
How would you explain these concepts to a non-statistical audience?
Mathematical Understanding: Given the variance formulas for mean response and individual prediction:
Mean response: \(\text{Var}[\hat{\mu}^*] = \sigma^2 \left(\frac{1}{n} + \frac{(x^* - \bar{x})^2}{S_{XX}}\right)\)
Individual prediction: \(\text{Var}[\hat{Y}^*] = \sigma^2 \left(1 + \frac{1}{n} + \frac{(x^* - \bar{x})^2}{S_{XX}}\right)\)
Explain why the prediction variance includes the additional “+1” term
How does the variance change as \(x^*\) moves away from \(\bar{x}\)?
What happens to both variances as \(n\) increases?
Under what conditions would the two variances be most similar?
Assumption Violations: For each scenario, identify the primary assumption violation and discuss the implications:
Residual plot shows a clear curved pattern.
Residual plot shows increasing variance as X increases.
QQ plot of residuals shows heavy tails.
Data points were collected sequentially over time and show temporal patterns.
The dataset includes two distinct subgroups with different relationships.