13.3. Model Diagnostics and Statistical Inference

Having developed the simple linear regression model and methods for fitting it using least squares, we now face a critical question: how do we know if our model is appropriate for the data? Before conducting any statistical inference—hypothesis tests, confidence intervals, or predictions—we must first verify that our model assumptions are reasonable. Violating these assumptions can lead to invalid conclusions and unreliable inference procedures.

This chapter combines three essential components of regression analysis: diagnostic procedures for checking model assumptions, the F-test for overall model utility, and inference procedures for individual model parameters. Together, these tools provide a complete framework for validating and drawing conclusions from simple linear regression models.

Road Map 🧭

  • Problem we will solve – How to verify that regression model assumptions are satisfied, test whether our model provides useful information about the relationship between variables, and conduct formal inference about model parameters with appropriate uncertainty quantification

  • Tools we’ll learn – Residual plots and diagnostic graphics for assumption checking, F-test for overall model utility, t-tests and confidence intervals for slope and intercept parameters, and the mathematical relationship between different inference approaches

  • How it fits – This completes our regression toolkit by ensuring model validity before inference, testing overall model usefulness, and providing methods for parameter-specific conclusions—preparing us for prediction and more advanced regression techniques

13.3.1. The Critical Importance of Assumption Checking

Before conducting any statistical inference procedures, we must verify that our model assumptions are reasonable. If we have strong violations of these assumptions, our statistical inference procedures will not be accurate—we won’t be able to trust the results and won’t be able to convey the information we want about the relationship between our variables.

Review of Simple Linear Regression Assumptions

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-regression-assumptions.png

Fig. 13.31 The four fundamental assumptions of simple linear regression that must be verified before conducting inference

Our simple linear regression model \(Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\) requires four key assumptions:

Assumption 1: Independence and Identical Distribution (IID)

The observed pairs \((x_i, y_i)\) for \(i \in \{1, 2, \ldots, n\}\) are such that \(y_i\) represents a simple random sample for each fixed \(x_i\). This means:

  • We plan in advance which explanatory variable values \(x_1, x_2, \ldots, x_n\) to collect

  • We then measure the response as output for each fixed \(x_i\)

  • Each response \(y_i\) is independent of the others

  • The responses constitute a simple random sample for their respective \(x\) values

This assumption is primarily ensured through proper experimental design and data collection procedures. It’s difficult to verify statistically after data collection, so we must rely on understanding how the data was gathered.

Assumption 2: Linearity

The association between the explanatory variable and the response is, on average, linear. The mean response follows the straight line \(E[Y|X] = \beta_0 + \beta_1 X\). If this assumption is violated, using a linear model to describe a non-linear relationship will lead to poor fits and misleading conclusions.

Assumption 3: Normality

The error terms (and hence the response values) are normally distributed:

\[\varepsilon_i \stackrel{iid}{\sim} N(0, \sigma^2) \quad \text{for } i = 1, 2, \ldots, n\]

This leads to the conditional distribution:

\[Y_i | X_i = x_i \stackrel{iid}{\sim} N(\beta_0 + \beta_1 x_i, \sigma^2)\]

Assumption 4: Homoscedasticity (Equal Variance)

The error terms have constant variance \(\sigma^2\) across all values of \(X\). The spread of \(Y\) values around the regression line remains the same regardless of the \(X\) value. Violations of this assumption are called heteroscedasticity.

13.3.2. Diagnostic Tools: Scatter Plots and Residual Plots

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-scatterplot-assumptions.png

Fig. 13.32 Using scatter plots to check linearity and constant variance assumptions

Scatter Plots for Initial Assessment

Scatter plots serve as our primary tool for initial assumption checking:

  • Linearity: Points should roughly follow a straight line pattern

  • Constant variance: The spread of points around the apparent trend should remain consistent across the range of \(X\) values

  • Outliers: Identify observations that don’t fit the general pattern

However, scatter plots cannot help us assess the normality assumption—we need additional tools for that.

The Power of Residual Plots

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-residual-plot-construction.png

Fig. 13.33 Construction of residual plots from scatter plots, showing how residuals transform the regression analysis

Residual plots provide a more sensitive diagnostic tool than scatter plots alone. A residual plot displays the residuals \(e_i = y_i - \hat{y}_i\) on the vertical axis against the explanatory variable \(x_i\) on the horizontal axis.

Construction Process:

  1. Fit the regression line to obtain predicted values \(\hat{y}_i\)

  2. Calculate residuals: \(e_i = y_i - \hat{y}_i\) for each observation

  3. Plot residuals (vertical axis) against \(x_i\) values (horizontal axis)

What Residual Plots Reveal:

The residual plot essentially rotates the regression line to be horizontal around the x-axis and amplifies deviations from the fitted relationship. This makes patterns easier to detect than in the original scatter plot.

Applied Example: Blood Pressure Study Revisited

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-blood-pressure-diagnostics.png

Fig. 13.34 Diagnostic analysis of the blood pressure treatment study showing scatter plot with fitted line

Let’s apply our diagnostic procedures to the blood pressure treatment study, where we examined the relationship between patient age and change in blood pressure after 24 hours of treatment.

Scatter Plot Assessment:

Looking at the scatter plot with the fitted line \(\hat{y} = 20.12 - 0.5263x\), we can trace our finger across the plot to assess the spread of points around the line. The linear relationship appears reasonable, though with only 11 observations, it’s challenging to definitively assess the constant variance assumption.

Residual Plot Analysis:

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-blood-pressure-residuals.png

Fig. 13.35 Residual plot for the blood pressure data showing individual residuals labeled by vehicle type

The residual plot for the car efficiency data shows each vehicle’s residual clearly labeled. When we examine the spread across different cylinder volume ranges:

  • Left region (around 1.5L): Limited observations make assessment difficult

  • Middle region (around 1.8-2.0L): Several observations with varied residuals

  • Right region (around 2.5L): Adequate spread above and below zero

The residual plot suggests potential minor violations of the constant variance assumption, but nothing strong enough to invalidate our analysis given the small sample size.

13.3.3. Recognizing Assumption Violations

Understanding what various patterns in residual plots indicate is crucial for proper model assessment.

Constant Variance Violations (Heteroscedasticity)

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-heteroscedasticity-patterns.png

Fig. 13.36 Common patterns indicating violations of the constant variance assumption

Cone Pattern: As \(X\) increases, the residual errors become larger. This indicates that \(\text{Var}(\varepsilon_i)\) is an increasing function of \(X\).

Hourglass Pattern: For extreme values of \(X\) (both large and small), the spread is larger than in the middle range. Variance depends on \(X\) in a non-constant way.

Reverse Cone Pattern: As \(X\) increases, the residual errors become smaller. Again, variance is a function of \(X\) rather than constant.

These patterns indicate strong violations of the equal variance assumption, requiring more advanced techniques like weighted regression (beyond this course’s scope).

Linearity Violations

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-linearity-violations.png

Fig. 13.37 Residual plot patterns indicating violations of the linearity assumption

What We Want to See: Random scatter of points around the horizontal line at zero, with no discernible pattern. This indicates both linearity and constant variance assumptions are satisfied.

Curved Patterns: If residuals show systematic curved patterns, this suggests the true population relationship is non-linear. For example, if the true model should be \(Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \varepsilon\) but we fit only \(Y = \beta_0 + \beta_1 X + \varepsilon\), the quadratic component gets absorbed into the residuals, creating a curved pattern.

Key Insight: Patterns in residual plots indicate that our linear model is missing important systematic relationships that exist in the data.

Assessing Normality of Residuals

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-normality-assessment.png

Fig. 13.38 Methods for assessing normality of residuals using histograms and QQ plots

For normality assessment, we use the same tools we’ve employed throughout the course, but applied to the residuals \(e_1, e_2, \ldots, e_n\):

Histogram of Residuals: Should approximate a normal distribution shape, centered around zero with the characteristic bell curve.

QQ Plot of Residuals: Should show points following approximately a straight diagonal line from lower left to upper right. Systematic deviations from this line suggest departures from normality.

The residuals should behave like a random sample from \(N(0, \sigma^2)\), so standard normality assessment techniques apply directly.

Comprehensive Diagnostic Examples

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-diagnostic-examples.png

Fig. 13.39 Multiple examples showing different combinations of assumption violations and their appearance in diagnostic plots

Example 1: Good Model Fit

  • Scatter plot shows clear linear trend with consistent spread

  • Residual plot shows random scatter with no patterns

  • Histogram of residuals approximates normal distribution

  • QQ plot shows points close to diagonal line

Example 2: Non-linearity Problem

  • Scatter plot shows slight curvature

  • Residual plot reveals systematic curved pattern

  • Normality plots may look reasonable since the issue is functional form, not error distribution

The lesson: visual inspection of multiple diagnostic plots provides complementary information about different aspects of model adequacy.

13.3.4. The F-Test for Model Utility

Once we’ve verified that our model assumptions are reasonably satisfied, we can proceed with statistical inference. The first question we typically ask is: “Does our simple linear regression model provide useful information about the relationship between our explanatory and response variables?”

Understanding the ANOVA Decomposition

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-anova-decomposition.png

Fig. 13.40 Complete ANOVA table for simple linear regression showing all components and formulas

The F-test for model utility builds on the ANOVA decomposition we developed in Chapter 13.2, but now we understand it in the context of hypothesis testing about model usefulness.

The Fundamental Identity:

\[\text{SST} = \text{SSR} + \text{SSE}\]

This decomposes the total variability in our response variable into two meaningful components:

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-sst-explanation.png

Fig. 13.41 Visual explanation of Sum of Squares Total as baseline variability ignoring the explanatory variable

Sum of Squares Total (SST):

\[\text{SST} = \sum_{i=1}^n (y_i - \bar{y})^2\]

SST measures how much the response values deviate from their overall mean, completely ignoring any information from the explanatory variable. If there were no explanatory variable, \(\bar{y}\) would be our best estimate for modeling the response.

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-ssr-explanation.png

Fig. 13.42 Visual explanation of Sum of Squares Regression as improvement over the baseline mean model

Sum of Squares Regression (SSR):

\[\text{SSR} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2\]

SSR measures how much the fitted values deviate from the overall mean response. This quantifies the improvement we get by using the linear relationship instead of simply averaging all response values.

Model Utility Interpretation: If our model is useful, we want \(\hat{y}_i\) values to be different from \(\bar{y}\). If \(\hat{y}_i \approx \bar{y}\) for all observations, our explanatory variable provides no additional information beyond the overall mean.

Connection to the Slope: Recall that \(\text{SSR} = b_1^2 \sum_{i=1}^n (x_i - \bar{x})^2\). If the slope \(b_1 \to 0\), then \(\text{SSR} \to 0\), indicating no linear relationship.

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-sse-explanation.png

Fig. 13.43 Visual explanation of Sum of Squares Error as unexplained variation after fitting the model

Sum of Squares Error (SSE):

\[\text{SSE} = \sum_{i=1}^n (y_i - \hat{y}_i)^2\]

SSE measures how much the observed values deviate from the fitted line—the unexplained variability. This is exactly what we minimized when fitting the least squares regression line.

What We Want: We want SSE to be small relative to SST, meaning our model explains most of the variation. If SSE ≈ SST, our model provides little improvement over simply using \(\bar{y}\).

Understanding Degrees of Freedom: Multiple Perspectives

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-degrees-freedom-explanation.png

Fig. 13.44 Explanation of degrees of freedom concept and calculation for regression ANOVA

The concept of degrees of freedom is fundamental to understanding statistical inference, and there are several complementary ways to think about why we get \(n-2\) degrees of freedom for error in simple linear regression. Understanding these different perspectives will help you develop intuition for more complex statistical procedures.

The Information Lens: “Paying the Price of Estimation”

Every observation \(y_i\) initially contributes one independent piece of information about the population. However, when we estimate parameters from the data, we “use up” some of this information.

Think of it this way: we start with \(n\) independent pieces of random variation from our \(n\) observations. To estimate our intercept \(b_0\) and slope \(b_1\), we must impose constraints on the data that “consume” two pieces of this randomness:

  • Estimating \(b_0\) requires that \(\sum_{i=1}^n e_i = 0\) (residuals sum to zero)

  • Estimating \(b_1\) requires that \(\sum_{i=1}^n x_i e_i = 0\) (residuals are uncorrelated with X)

After paying this “price” of estimation, only \(n - 2\) independent pieces of information remain in the residuals.

Connection to Familiar Concepts: This matches the intuition you developed with one-sample t-tests, where estimating the sample mean \(\bar{x}\) used up 1 degree of freedom, leaving \(n-1\) degrees of freedom for the sample variance. Here, estimating two parameters uses up 2 degrees of freedom.

The Constraint Lens: “Equations the Data Must Satisfy”

When we fit \(Y_i = \hat{y}_i + e_i\) using least squares, we’re solving an optimization problem. The solution must satisfy exactly two linear constraints:

\[\sum_{i=1}^n e_i = 0 \quad \text{and} \quad \sum_{i=1}^n x_i e_i = 0\]

These aren’t just mathematical curiosities—they’re fundamental requirements that our residuals must satisfy. With \(n\) residual values but 2 constraints, only \(n-2\) residuals can vary independently. The remaining 2 are completely determined by these constraints.

General Pattern: In multiple regression with \(p\) explanatory variables plus an intercept, we estimate \(p+1\) parameters, creating \(p+1\) constraints on the residuals. This leaves \(n-(p+1)\) degrees of freedom for error.

The Geometric Lens: “Dimensions of Solution Spaces”

From a linear algebra perspective, we can think about the “space” of possible residual vectors. Our \(n\) observations define an \(n\)-dimensional space. When we fit the regression model, we project the data onto a 2-dimensional subspace (spanned by the intercept and slope).

The residuals live in the remaining \(n-2\) dimensional space—the part of the full \(n\)-dimensional space that’s orthogonal (perpendicular) to our fitted model. This geometric perspective shows that the dimension of the “leftover” space after fitting is exactly our error degrees of freedom.

The Distribution Lens: “Where the Chi-Square Comes From”

The degrees of freedom directly determine the shape of our sampling distributions. Under our normality assumptions:

  • \(\frac{\text{SSE}}{\sigma^2} \sim \chi^2_{n-2}\) (chi-square with \(n-2\) degrees of freedom)

  • This feeds into our t-statistics: \(t = \frac{b_1 - \beta_1}{SE(b_1)} \sim t_{n-2}\)

  • And our F-statistic: \(F = \frac{\text{MSR}}{\text{MSE}} \sim F_{1,n-2}\)

The \(n-2\) parameter isn’t arbitrary—it’s precisely the dimension of the space where our residuals can vary independently.

For Simple Linear Regression, Specifically:

  • df_regression = 1: We’re fitting one slope parameter (the intercept is determined once we choose the slope)

  • df_error = n - 2: After estimating intercept and slope, \(n-2\) residuals remain free to vary

  • df_total = n - 1: Total variation around the overall mean \(\bar{y}\) has the familiar \(n-1\) degrees of freedom

The degrees of freedom always sum: \((n-1) = 1 + (n-2)\).

Why Multiple Perspectives Matter

Different situations call for different ways of thinking about degrees of freedom:

  • The information lens helps with intuitive understanding and connects to familiar concepts

  • The constraint lens is useful when working with model equations and understanding why certain relationships hold

  • The geometric lens becomes powerful in multiple regression and advanced modeling

  • The distribution lens is essential for understanding test statistics and p-values

As you encounter more complex statistical procedures—ANOVA with multiple factors, multiple regression, mixed effects models—you’ll find that some of these perspectives provide clearer insight than others. Having multiple ways to think about the same concept makes you a more flexible and intuitive statistical thinker.

Conducting the F-Test: Step-by-Step Process

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-f-test-steps.png

Fig. 13.45 Complete four-step process for conducting the F-test for model utility

Step 1: Parameter of Interest (Can be skipped)

For the model utility test, we’re not focusing on a specific parameter but rather on the overall usefulness of the linear relationship. We can skip the explicit parameter statement.

Step 2: Hypotheses

\[H_0: \text{There is no linear association between } X \text{ and } Y\]
\[H_a: \text{There is a linear association between } X \text{ and } Y\]

Important: Always state these hypotheses in the context of your specific problem, replacing “X” and “Y” with the actual variable names and context.

Step 3: Test Statistic and P-value

Degrees of freedom: \(df_1 = 1\), \(df_2 = n-2\)

P-value calculation in R: `r p_value <- pf(F_statistic, df1 = 1, df2 = n-2, lower.tail = FALSE) `

Why Always Upper Tail? The F-distribution is right-skewed and bounded below by zero. Large F-values provide evidence against the null hypothesis (that the model is not useful), so we always calculate \(P(F > F_{\text{observed}})\).

Step 4: Decision and Conclusion

Compare the p-value to the significance level \(\alpha\):

  • If p-value ≤ \(\alpha\): Reject \(H_0\). We have evidence of a linear association.

  • If p-value > \(\alpha\): Fail to reject \(H_0\). We do not have sufficient evidence of a linear association.

Conclusion Template:

  • If rejecting \(H_0\): “At the \(\alpha\) significance level, we have sufficient evidence to conclude that there is a linear association between [explanatory variable] and [response variable] in [context].”

  • If failing to reject \(H_0\): “At the \(\alpha\) significance level, we do not have sufficient evidence to conclude that there is a linear association between [explanatory variable] and [response variable] in [context].”

13.3.5. Inference for Individual Parameters

While the F-test tells us whether our model provides useful information overall, we often want to make specific inferences about the slope and intercept parameters. This allows us to quantify the nature of the relationship and test specific hypotheses about the parameters.

Rewriting Estimates as Linear Combinations

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-parameter-linear-combinations.png

Fig. 13.46 Rewriting slope and intercept estimates as linear combinations of the response values

To develop inference procedures for \(\beta_0\) and \(\beta_1\), we need to understand the statistical properties of our estimates \(b_0\) and \(b_1\). The key insight is rewriting these estimates as linear combinations of the response values \(Y_i\), since the responses are the only random components in our model.

Intercept Rewrite:

Starting from \(b_0 = \bar{y} - b_1\bar{x}\) and substituting the expression for \(b_1\), we can show:

Slope Rewrite:

Starting from the least squares formula and using algebraic manipulation:

Why This Matters: Both estimates are linear combinations (weighted averages) of the normally distributed response values \(Y_i\). Since linear combinations of independent normal random variables are also normally distributed, both \(b_0\) and \(b_1\) follow normal distributions under our model assumptions.

Expected Values and Variances

Through careful application of expectation and variance properties:

For the Intercept:

  • \(E[b_0] = \beta_0\) (unbiased estimate)

  • \(\text{Var}(b_0) = \sigma^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right)\)

For the Slope:

  • \(E[b_1] = \beta_1\) (unbiased estimate)

  • \(\text{Var}(b_1) = \frac{\sigma^2}{S_{xx}}\)

Distributions Under Normality:

Standard Errors and t-Distribution

Since \(\sigma^2\) is unknown, we estimate it using the mean squared error:

Standard Errors:

t-Distribution Result: When we replace \(\sigma^2\) with \(s^2\), the normal distribution becomes a t-distribution with \(n-2\) degrees of freedom:

Confidence Intervals for Parameters

General Form:

For the Slope:

For the Intercept:

Interpretation: We are \((1-\alpha) \times 100\%\) confident that the true parameter value lies within the calculated interval.

Critical Value in R: `r t_critical <- qt(alpha/2, df = n-2, lower.tail = FALSE) `

Hypothesis Testing for Parameters

The Four-Step Process for Slope Testing:

Step 1: Parameter of Interest

We are interested in \(\beta_1\), the true slope of the population regression line relating [explanatory variable] to [response variable].

Step 2: Hypotheses

Most commonly:

  • \(H_0: \beta_1 = 0\) (no linear relationship)

  • \(H_a: \beta_1 \neq 0\) (linear relationship exists)

But we can test other values:

  • \(H_0: \beta_1 = \beta_{10}\) (for any specified value \(\beta_{10}\))

  • \(H_a: \beta_1 \neq \beta_{10}\) (or \(>\) or \(<\) for one-sided tests)

Step 3: Test Statistic and P-value

Degrees of freedom: \(n-2\)

P-value calculation depends on the alternative hypothesis: ```r # Two-sided test p_value <- 2 * pt(abs(t_stat), df = n-2, lower.tail = FALSE)

# Upper tail test p_value <- pt(t_stat, df = n-2, lower.tail = FALSE)

# Lower tail test p_value <- pt(t_stat, df = n-2, lower.tail = TRUE) ```

Step 4: Decision and Conclusion

Compare p-value to \(\alpha\) and draw conclusions about the slope parameter in context.

Connection Between F-test and t-test

For simple linear regression with one explanatory variable, there’s a direct mathematical relationship:

when testing \(H_0: \beta_1 = 0\).

Why Both Tests Matter: In simple linear regression, the F-test and t-test for \(\beta_1 = 0\) are equivalent. However, in multiple regression:

  • The F-test assesses overall model utility (are any of the explanatory variables useful?)

  • Individual t-tests assess each explanatory variable separately (is this specific variable useful?)

Both perspectives provide valuable but different information about model components.

13.3.6. Implementation in R

Fitting the Model:

```r # Fit linear model fit <- lm(response ~ explanatory, data = dataset_name)

# Extract coefficients coefficients(fit) # or fit`coefficients

# Extract residuals and fitted values residuals(fit) # or fit`residuals fitted(fit) # or fit`fitted.values ```

Getting Inference Results:

```r # Complete summary with tests and R-squared summary(fit)

# ANOVA table for F-test anova(fit) # or summary(aov(fit)) ```

Manual Calculations (when given summary statistics rather than raw data):

```r # Calculate test statistics manually F_stat <- MSR / MSE t_stat <- (b1 - beta1_null) / SE_b1

# Calculate p-values p_value_F <- pf(F_stat, df1 = 1, df2 = n-2, lower.tail = FALSE) p_value_t <- 2 * pt(abs(t_stat), df = n-2, lower.tail = FALSE)

# Calculate confidence intervals margin_error <- qt(alpha/2, df = n-2, lower.tail = FALSE) * SE_b1 CI_lower <- b1 - margin_error CI_upper <- b1 + margin_error ```

13.3.7. Integrated Example: Blood Pressure Study Complete Analysis

Let’s work through a complete analysis of the blood pressure treatment study, incorporating all diagnostic and inference procedures.

Research Context: Investigating whether patient age affects the change in blood pressure after 24 hours of a new treatment (\(n = 11\) patients).

Step 1: Diagnostic Analysis

```r # Check assumptions plot(age, bp_change) # Scatter plot for linearity and homoscedasticity abline(lm(bp_change ~ age))

# Fit model and create residual plot fit <- lm(bp_change ~ age) plot(age, residuals(fit)) # Residual plot abline(h = 0)

# Check normality of residuals hist(residuals(fit)) qqnorm(residuals(fit)) qqline(residuals(fit)) ```

Assessment: Linear relationship appears reasonable, constant variance assumption shows some minor violations but nothing severe enough to invalidate analysis with \(n = 11\). Normality appears reasonable.

Step 2: F-test for Model Utility

Using our fitted model \(\hat{y} = 20.11 - 0.526x\) with \(\text{SSR} = 556\), \(\text{SSE} = 383\), \(\text{SST} = 939\):

Hypotheses:

  • \(H_0\): There is no linear association between patient age and change in blood pressure

  • \(H_a\): There is a linear association between patient age and change in blood pressure

Test Statistic:

  • \(\text{MSR} = \text{SSR}/1 = 556\)

  • \(\text{MSE} = \text{SSE}/9 = 42.56\)

  • \(F = 556/42.56 = 13.06\)

P-value: \(P(F_{1,9} > 13.06) = 0.0055\)

Conclusion: At \(\alpha = 0.05\), we reject \(H_0\) and conclude there is sufficient evidence of a linear association between patient age and change in blood pressure.

Step 3: Inference for the Slope

Parameter of Interest: \(\beta_1\), the true change in blood pressure reduction per year increase in patient age.

Hypotheses:

  • \(H_0: \beta_1 = 0\)

  • \(H_a: \beta_1 \neq 0\)

Test Statistic:

  • \(SE(b_1) = \sqrt{\text{MSE}/S_{xx}} = \sqrt{42.56/2008} = 0.146\)

  • \(t = \frac{-0.526 - 0}{0.146} = -3.61\)

P-value: \(P(|t_9| > 3.61) = 2 \times P(t_9 > 3.61) = 0.0055\)

95% Confidence Interval:

  • \(t_{0.025,9} = 2.262\)

  • \(-0.526 \pm 2.262 \times 0.146 = -0.526 \pm 0.330 = (-0.856, -0.196)\)

Conclusion: We are 95% confident that each additional year of age is associated with a decrease in blood pressure between 0.196 and 0.856 mm Hg after treatment.

Note: The F-test and t-test give identical p-values (0.0055) since \(F = t^2 = (-3.61)^2 = 13.03 \approx 13.06\) (small rounding differences).

13.3.8. Parameter Distribution Properties

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-parameter-properties.png

Fig. 13.47 Summary of the statistical properties of slope and intercept estimators showing unbiasedness and variance formulas

The mathematical foundation for our inference procedures rests on the key statistical properties of our parameter estimates.

Slope Properties:

  • Unbiased: \(E[b_1] = \beta_1\)

  • Variance: \(\text{Var}(b_1) = \frac{\sigma^2}{S_{xx}}\)

  • Distribution: \(b_1 \sim N\left(\beta_1, \frac{\sigma^2}{S_{xx}}\right)\)

Intercept Properties:

  • Unbiased: \(E[b_0] = \beta_0\)

  • Variance: \(\text{Var}(b_0) = \sigma^2\left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right)\)

  • Distribution: \(b_0 \sim N\left(\beta_0, \sigma^2\left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right)\right)\)

Key Insights:

  • Both estimates are linear combinations of the normally distributed response values

  • The slope variance depends only on the error variance and the spread of X values

  • The intercept variance includes additional uncertainty when \(\bar{x} \neq 0\)

Standard Error Formulas and Confidence Intervals

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-confidence-interval-slope.png

Fig. 13.48 Confidence interval formula for the slope parameter with all components clearly labeled

Since \(\sigma^2\) is unknown, we estimate it using \(s^2 = \text{MSE}\) and use the t-distribution:

Standard Error of the Slope:

\[SE(b_1) = \sqrt{\frac{\text{MSE}}{S_{xx}}}\]

where \(S_{xx} = \sum_{i=1}^n (x_i - \bar{x})^2\).

Confidence Interval for the Slope:

\[b_1 \pm t_{\alpha/2, n-2} \sqrt{\frac{\text{MSE}}{S_{xx}}}\]

This provides a range of plausible values for the true population slope \(\beta_1\) with \((1-\alpha) \times 100\%\) confidence.

Interpretation: We are \((1-\alpha) \times 100\%\) confident that the true change in the response variable for each one-unit increase in the explanatory variable lies within this interval.

Complete Hypothesis Testing Framework

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-hypothesis-test-slope.png

Fig. 13.49 Complete framework for hypothesis testing about the slope parameter

Step 1: Parameter of Interest

We are interested in \(\beta_1\), the true population slope of the mean response line \(\mu_{Y|X=x}\).

Step 2: Hypotheses

The general form allows for various null values \(\beta_{10}\):

\[H_0: \beta_1 = \beta_{10} \quad \text{vs} \quad H_a: \beta_1 \neq \beta_{10}\]

Alternative formulations:

  • One-sided upper: \(H_a: \beta_1 > \beta_{10}\)

  • One-sided lower: \(H_a: \beta_1 < \beta_{10}\)

Step 3: Test Statistic and P-value

The test statistic follows a t-distribution with \(n-2\) degrees of freedom.

P-value calculation depends on the alternative:

  • Two-sided: \(\text{p-value} = 2P(t_{n-2} > |t|)\)

  • Upper tail: \(\text{p-value} = P(t_{n-2} > t)\)

  • Lower tail: \(\text{p-value} = P(t_{n-2} < t)\)

Step 4: Decision and Conclusion

Compare p-value to \(\alpha\) and state conclusions in context of the problem.

Special Case: Equivalence of F-test and t-test

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-f-t-equivalence.png

Fig. 13.50 Mathematical relationship between F-test and t-test when testing slope equals zero

When testing \(H_0: \beta_1 = 0\) versus \(H_a: \beta_1 \neq 0\), there’s a direct mathematical relationship:

Specifically:

This equivalence means both tests provide identical p-values and lead to identical conclusions for this specific hypothesis.

Why Both Tests Matter: While equivalent in simple linear regression, they serve different purposes in multiple regression:

  • F-test: Tests overall model utility (are any predictors useful?)

  • t-tests: Test individual predictors (is this specific predictor useful?)

The F-test is more general and extends naturally to multiple regression scenarios where we test several slopes simultaneously.

13.3.9. Implementation in R: Complete Workflow

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-r-implementation.png

Fig. 13.51 Complete R workflow for fitting models and conducting inference

Model Fitting and Basic Output:

# Fit the linear model
fit <- lm(response_variable ~ explanatory_variable, data = dataFrame)

# Extract key components
coefficients(fit)     # Get b0 and b1
residuals(fit)        # Get residuals for diagnostics
fitted.values(fit)    # Get predicted values

# Complete inference summary
summary(fit)          # Includes R², F-test, t-tests, standard errors

# ANOVA table
anova(fit)           # Or summary(aov(fit))

What summary(fit) Provides:

  • Coefficient estimates (\(b_0\), \(b_1\)) with standard errors

  • t-statistics and p-values for testing each coefficient equals zero

  • R-squared and adjusted R-squared

  • F-statistic and p-value for overall model utility

  • Residual standard error (estimate of \(\sigma\))

Manual Calculations (useful for understanding or when given summary statistics):

# Calculate standard error of slope manually
SE_b1 <- sqrt(MSE / Sxx)

# Calculate t-statistic
t_stat <- (b1 - beta1_null) / SE_b1

# Calculate confidence interval
t_critical <- qt(alpha/2, df = n-2, lower.tail = FALSE)
CI_lower <- b1 - t_critical * SE_b1
CI_upper <- b1 + t_critical * SE_b1

# Calculate p-values
p_value_two_sided <- 2 * pt(abs(t_stat), df = n-2, lower.tail = FALSE)
p_value_F <- pf(F_stat, df1 = 1, df2 = n-2, lower.tail = FALSE)

13.3.10. Comprehensive Example: Blood Pressure Study Final Analysis

Let’s complete our blood pressure analysis with full parameter inference.

Given Information:

  • Sample size: \(n = 11\)

  • Fitted model: \(\hat{y} = 20.11 - 0.526x\)

  • \(\text{SSE} = 383\), so \(\text{MSE} = 383/9 = 42.56\)

  • \(S_{xx} = 2008\) (from our previous calculations)

Slope Inference:

Standard Error:

\(SE(b_1) = \sqrt{42.56/2008} = \sqrt{0.0212} = 0.146\)

95% Confidence Interval:

  • \(t_{0.025,9} = 2.262\)

  • \(-0.526 \pm 2.262 \times 0.146 = -0.526 \pm 0.330\)

  • Interval: \((-0.856, -0.196)\)

Interpretation: We are 95% confident that each additional year of patient age is associated with an additional decrease in blood pressure between 0.196 and 0.856 mm Hg after treatment.

Hypothesis Test (\(H_0: \beta_1 = 0\) vs \(H_a: \beta_1 \neq 0\)):

  • \(t = \frac{-0.526 - 0}{0.146} = -3.61\)

  • p-value = \(2 \times P(t_9 > 3.61) = 0.0055\)

  • Conclusion: At \(\alpha = 0.05\), we reject \(H_0\) and conclude there is significant evidence that patient age affects the change in blood pressure after treatment.

Verification of F-test and t-test Equivalence:

  • \(t^2 = (-3.61)^2 = 13.03\)

  • Our F-statistic was 13.06 (small rounding differences)

  • Both tests give p-value ≈ 0.0055

13.3.11. Summary of Diagnostic and Inference Workflow

The complete regression analysis workflow involves these essential steps:

1. Exploratory Analysis

  • Create scatter plot to assess initial relationship

  • Determine which variable should be explanatory vs. response

  • Look for obvious outliers or non-linear patterns

2. Model Fitting

  • Fit least squares regression line

  • Calculate basic summary statistics and R-squared

3. Diagnostic Checking (Critical - do before inference!)

  • Create residual plots to check linearity and constant variance

  • Examine histograms and QQ plots of residuals for normality

  • Identify any influential points or assumption violations

4. Inference Procedures (only if assumptions are reasonable)

  • F-test for overall model utility

  • t-tests and confidence intervals for individual parameters

  • Interpret all results in context of the original problem

5. Model Use (if diagnostics and tests support the model)

  • Make predictions with appropriate uncertainty quantification

  • Draw scientific conclusions about the relationship

13.3.12. When Assumptions Are Violated

If diagnostic procedures reveal serious assumption violations:

Linearity Violations:

  • Consider transformations of variables (log, square root, etc.)

  • Fit polynomial or other non-linear models (beyond course scope)

  • Use piecewise or segmented regression for different regions

Constant Variance Violations:

  • Variable transformations may help stabilize variance

  • Weighted least squares methods (beyond course scope)

  • Robust regression techniques

Normality Violations:

  • Often less critical for large sample sizes (Central Limit Theorem)

  • Bootstrap methods for inference (beyond course scope)

  • Non-parametric alternatives

Independence Violations:

  • Time series methods for correlated observations

  • Mixed effects models for clustered data

  • These require specialized techniques beyond this course

Important Principle: When assumptions are seriously violated, our inference procedures may not be reliable. It’s better to address the violations or acknowledge limitations than to proceed with invalid analysis.

13.3.13. Building Toward Prediction and Advanced Topics

The diagnostic and inference tools developed in this chapter provide the foundation for the final components of regression analysis:

What’s Coming Next: - Prediction intervals: Using our fitted model to predict new observations with appropriate uncertainty - Confidence intervals for the mean response: Estimating expected values at specific X values - Model comparison and selection: Comparing different potential models - Introduction to multiple regression: Extending to multiple explanatory variables

The Bigger Picture: The workflow established here—visual exploration, model fitting, diagnostic checking, and formal inference—forms the backbone of all statistical modeling. Whether working with simple linear regression, multiple regression, or advanced modeling techniques, this systematic approach ensures reliable and interpretable results.

The combination of diagnostic tools and inference procedures gives us confidence in our conclusions while maintaining appropriate humility about the limitations of our models. This balance between statistical rigor and practical insight represents the essence of effective statistical analysis.

Key Takeaways 📝

  1. Assumption checking must precede inference: Diagnostic plots are essential for verifying that our model is appropriate before conducting hypothesis tests or constructing confidence intervals.

  2. Residual plots are more sensitive than scatter plots: They amplify patterns and make assumption violations easier to detect, particularly for constant variance and linearity.

  3. Four key diagnostics work together: Scatter plots, residual plots, histograms of residuals, and QQ plots provide complementary information about different aspects of model adequacy.

  4. The F-test assesses overall model utility: It tests whether the linear relationship explains a significant portion of the variability in the response variable.

  5. ANOVA decomposition provides intuitive understanding: SST = SSR + SSE shows how total variation splits into explained and unexplained components.

  6. Parameter inference follows familiar patterns: Confidence intervals and hypothesis tests for slope and intercept use the same principles as previous inference procedures, adapted for regression context.

  7. Standard errors incorporate both error variance and design: SE(b₁) = √(MSE/Sₓₓ) shows that precision depends on both residual variation and spread of X values.

  8. F-test and t-test are equivalent for simple regression: When testing β₁ = 0, both approaches give identical conclusions, but the F-test generalizes to multiple regression.

  9. Degrees of freedom reflect parameters estimated: df = n-2 accounts for estimating both slope and intercept from the data.

  10. R provides comprehensive output: The summary() function includes all essential inference results, while diagnostic plots require additional commands.

  11. Violations have consequences: Serious assumption violations can invalidate inference procedures, requiring alternative approaches or model modifications.

  12. Context drives interpretation: All statistical results must be interpreted in terms of the original research question and practical significance.

Exercises

  1. Diagnostic Interpretation: For each residual plot pattern described, identify the assumption violation and potential consequences:

    1. Residuals form a cone shape, spreading out as X increases

    2. Residuals show a clear curved (U-shaped) pattern

    3. Residuals appear randomly scattered around zero with constant spread

    4. Most residuals are near zero with a few extremely large positive and negative values

    5. Residuals show alternating positive and negative values in sequence

  2. ANOVA Table Completion: Given the following partial ANOVA table for a regression with n = 15, complete all missing values:

  3. Parameter Inference: A study of house prices yields the regression equation Price = 45,000 + 120 × Size, where Price is in dollars and Size is in square feet. With n = 20, MSE = 50,000,000, and Sₓₓ = 2500:

    1. Calculate the standard error of the slope

    2. Construct a 95% confidence interval for the slope

    3. Test H₀: β₁ = 100 vs Hₐ: β₁ ≠ 100 at α = 0.05

    4. Interpret the slope coefficient in context

  4. F-test vs t-test: Using the house price data from Exercise 3:

    1. Conduct the F-test for model utility

    2. Conduct the t-test for H₀: β₁ = 0 vs Hₐ: β₁ ≠ 0

    3. Verify that F = t² and explain why this relationship holds

    4. Discuss when you might prefer one test over the other

  5. Assumption Checking Protocol: Design a systematic approach for checking regression assumptions:

    1. List the specific plots you would create and in what order

    2. Describe what to look for in each plot

    3. Explain how you would decide whether violations are serious enough to invalidate analysis

    4. Suggest potential remedies for each type of violation

  6. Real Data Analysis: Collect data on two quantitative variables of interest (at least 15 observations):

    1. Create appropriate exploratory plots

    2. Fit a simple linear regression model

    3. Conduct complete diagnostic analysis

    4. Perform F-test and parameter inference

    5. Interpret all results in context

    6. Discuss any limitations or concerns

  7. R Implementation: Write R code to perform a complete regression analysis:

    1. Fit the model and extract basic output

    2. Create all necessary diagnostic plots

    3. Conduct F-test and t-tests manually (not using summary output)

    4. Calculate confidence intervals for the slope

    5. Compare your manual calculations to R’s built-in results

  8. Critical Evaluation: A researcher reports: “The regression has R² = 0.95, so the model is excellent and all assumptions are satisfied.”

    1. What’s wrong with this reasoning?

    2. What additional information would you need to evaluate the model?

    3. Describe how a high R² could coexist with serious assumption violations

    4. What would you recommend the researcher do?

  9. Design Considerations: Explain how each factor affects the precision of slope estimation:

    1. Increasing the sample size n

    2. Increasing the range of X values observed

    3. Reducing the error variance σ²

    4. Changing from X values clustered together to X values spread out

  10. Confidence Interval Interpretation: For each confidence interval interpretation, identify whether it’s correct or incorrect and explain:

    1. “There’s a 95% chance that the true slope lies in this interval”

    2. “95% of sample slopes will fall in this interval”

    3. “If we repeated this study many times, 95% of the intervals would contain the true slope”

    4. “We’re 95% confident about the slope value for this specific dataset”

  11. Hypothesis Testing Scenarios: For each research scenario, formulate appropriate hypotheses:

    1. Testing whether there’s any linear relationship between study hours and test scores

    2. Testing whether the slope of salary vs. experience is at least $2000 per year

    3. Testing whether the relationship between temperature and ice cream sales is negative

    4. Testing whether the effect of fertilizer on plant growth is exactly 5 cm per gram

  12. Comprehensive Case Study: A medical researcher studies the relationship between patient age (X) and recovery time in days (Y) for a surgical procedure. With n = 25 patients, the analysis yields:

    • Fitted model: Ŷ = 8.5 + 0.3X

    • SSR = 156, SSE = 234, SST = 390

    • Sₓₓ = 1200

    Conduct a complete analysis including:

    1. ANOVA table and R² calculation

    2. F-test for model utility

    3. 95% confidence interval for the slope

    4. Test whether the slope exceeds 0.25 days per year

    5. Practical interpretation of all results

    6. Discussion of what diagnostic plots you would need to see