Slides 📊
13.2. Simple Linear Regression
After identifying a linear association between two quantitative variables via a scatter plot, we proceed to mathematical modeling of the association. This involves formalizing the population model, developing methods to estimate the model parameters, and assessing the model’s fit to the data.
Road Map 🧭
Understand the implications of linear regression model assumptions.
Derive the least squares estimates of the slope and intercept parameters, and present the result as a fitted regression line.
Use the ANOVA table to decompose the sources of variability in the repsonse into error and model components.
Use the coefficient of determination and the sample correlation as assisting numerical summaries.
13.2.1. The Simple Linear Regression Model
In this course, we assume that the \(X\) values are given (non-random) as \(x_1, x_2, \cdots, x_n\). For each explanatory value, the corresponding response \(Y_i\) is generated from the population model:
where \(\varepsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\), and \(\beta_0\) and \(\beta_1\) are the intercept and slope of the true linear association.
Important Implications of the Model
1. The Mean Response
\(y=\beta_0 + \beta_1 x\) represents the mean response line, or the true relationship between the explanatory and response variables. That is,
for any given \(x\) values in an appropriate range. \(\beta_0\), \(\beta_1\), and \(x\) pass through the expectation operator unchanged since they are constants. The error term has expected value zero, so it disappears.
2. The Error Term
The error term \(\varepsilon\) represents the random variation in \(Y\) that is not explained by the mean response line. With non-random explanatory values, this is the only source of randomness in \(Y\). Namely,
Note that the result does not depend on the index \(i\) or the value of \(x_i\). In our model, we assume that the true spread of \(Y\) values around the regression line remains constant regardless of the \(X\) value.
3. The Conditional distribution of \(Y_i\)
With the explanatory value fixed as \(x_i\), each \(Y_i\) is a linear function of a normal random variable \(\varepsilon_i\). This implies that \(Y_i|X=x_i\) is also normally distributed. Combining this result with the previous two observations, we can write the exact conditional distribution of the response as:
Fig. 13.8 Visualization of conditional distributions of \(Y_i\) given \(x_i\)
Summary of the Essential Assumptions
The following four assumptions are equivalent to the model assumptions discussed above and are required to ensure validity of subsequent analysis procedures.
Assumption 1: Independence
For each given value \(x_i\), \(Y_i\) is a size-1 simple random sample from the distribution of \(Y|X=x_i\). Further, the pairs \((x_i, Y_i)\) are independent from all other pairs \((x_j, Y_j)\), \(j\neq i\).
Assumption 2: Linearity
The association between the explanatory variable and the response is linear on average.
Assumption 3: Normality
The errors are normally distributed:
\[\varepsilon_i \stackrel{iid}{\sim} N(0, \sigma^2) \quad \text{for } i = 1, 2, \ldots, n\]Assumption 4: Equal Variance
The error terms have constant variance \(\sigma^2\) across all values of \(X\).
13.2.2. Point Estimates of Slope and Intercept
We now develop the point estimates for the unknown intercept and slope parameters \(\beta_0\) and \(\beta_1\) using the least squares method. The estimators are chosen so that the resulting trend line yields the smallest possible overall squared distance from the observed response values.
Fig. 13.9 Distances between observed reponses and the fitted line
The goal is mathematically equivalent to finding the arguments that minimize:
To find their explicit forms, we take the partial derivative of Eq. (13.2) with respect to each parameter and set it equal to zero. Then we solve the resulting system of equations for \(\hat{\beta}_0\) and \(\hat{\beta}_1\).
Derivation of the Intercept Estimate
Take the partial derivative of \(g(\beta_0, \beta_1)\) with respect to \(\beta_0\):
By setting this equal to zero and solving for \(\beta_0\), we obtain:
Derivation of the Slope Estimate
Likewise, we take the partial derivative of \(g(\beta_0, \beta_1)\) with respect to \(\beta_1\):
Substitute \(\hat{\beta}_0\) (Eq. (13.3)) for \(\beta_0\) and set equal to zero:
Isolate \(\beta_1\) to obtain:
Summary
Slope estimate:
Intercept estimate:
Alternative Notation
The symbols \(b_0\) and \(b_1\) are used interchangeably with \(\hat{\beta}_0\) and \(\hat{\beta}_1\), respectively.
Alternative Expressions of the Slope Estimate
Several alternative expressions exist for the slope estimate \(\hat{\beta}_1\), each offering a different perspective on its relation to various components of regression analysis. Being able to transition freely between these forms is key to deepening the intuition for linear regression and enabling efficient computation.
Define the sum of cross products and the sum of sqaures of \(X\) as:
\(S_{XY} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\)
\(S_{XX} = \sum_{i=1}^n (x_i - \bar{x})^2\)
In addition, the sample covariance of \(X\) and \(Y\) is denoted \(s_{XY}\) (lower case \(s\)) and defined as:
Also recall that \(s_X^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}\) denotes the sample variance of \(x_1, \cdots, x_n\).
Then \(\hat{\beta}_1\) (13.4) can also be written as:
Exercise: Prove equality between the alternative expressions
Hint: Showing the second and third equality of (13.5) is relatively simple. To show the first equality, begin with the numerators:
Repeat a similar procedure for the denominators.
13.2.3. The Fitted Regression Line
Once the slope and intercept estimates are obtained, we can use them to construct the fitted regression line:
By plugging in specific \(x\) values into the equation, we obtain point estimates of the corresponding responses. The hat notation is essential, as it distinguishes these estimates from the observed responses, \(y\).
Properties of the Fitted Regression Line
1. Interpretation of the Slope
The slope \(\hat{\beta}_1\) represents the estimated average change in the response for every one-unit change in the explanatory variable. The sign of \(\hat{\beta}_1\) indicates the direction of the association.
2. Interpretation of the Intercept
The intercept \(\hat{\beta}_0\) represents the estimated average value of the response when the explanatory variable equals zero. However, this may not have significance if \(X = 0\) is outside the range of the data or not practically meaningful.
3. The Line Passes Through \((\bar{x}, \bar{y})\)
If we substitute \(x = \bar{x}\) into the fitted regression equation:
That is, a fitted regression line always passes through the point \((\bar{x}, \bar{y})\).
4. Non-exchangeability
If we swap the explanatory and response variables and refit the model, the resulting fitted line will not be the same as the original, nor will the new slope be the algebraic inverse of the original.
Example 💡: Blood Pressure Study
A new treatment for high blood pressure is being assessed for feasibility. In an early trial, 11 subjects had their blood pressure measured before and after treatment. Researchers want to determine if there is a linear association between patient age and the change in systolic blood pressure after 24 hours. Using the data and the summary statistics below, compute and interpret the fitted regression line.
Variable Definitions:
Explanatory variable (X): Age of patient (years)
Response variable (Y): Change in blood pressure = (After treatment) - (Before treatment)
Data:
\(i\) |
Age (\(x_i\)) |
ΔBP (\(y_i\)) |
|---|---|---|
1 |
70 |
-28 |
2 |
51 |
-10 |
3 |
65 |
-8 |
4 |
70 |
-15 |
5 |
48 |
-8 |
6 |
70 |
-10 |
7 |
45 |
-12 |
8 |
48 |
3 |
9 |
35 |
1 |
10 |
48 |
-5 |
11 |
30 |
8 |
Scatter Plot:
Fig. 13.10 Scatter plot of age vs blood pressure data
Summary Statistics:
\(n=11\)
\(\bar{x} = 52.7273\) years
\(\bar{y} = -7.6364\) mm Hg
\(\sum x_i y_i = -5485\)
\(\sum x_i^2 = 32588\)
Slope Calculation:
Intercept Calculation:
Fitted Regression Line:
Interpretation:
For each additional year of age, the change in blood pressure decreases by an average of 0.526 mm Hg.
The intercept estimate has no practical meaning since we do not study newborns for blood pressure treatment.
The negative slope suggests that older patients benefit more from the treatment.
13.2.4. The ANOVA Table for Regression
Just as in ANOVA, we can decompose the total variability in the response variable into two components, one arising from the model structure and the other from random error.
Total Variability: SST
Fig. 13.11 SST is the sum of the squared lengths of all the red dotted lines. They measure the distances of the response values from an overall mean.
SST measures how much the response values deviate from their overall mean, ignoring the explanatory variable. SST can also be denoted as \(S_{YY}\). The degrees of freedom associated with \(\text{SST}\) is \(df_T = n-1\).
Computational Shortcut for SST
While Eq. (13.6) conveys its intuitive meaning, it is often more convenient to use the following equivalent formula for computation:
Model Variability: SSR and MSR
Fig. 13.12 SSR is the sum of the squared lengths of all the red dotted lines. They measure the distances of the predicted \(\hat{y}\) values from an overall mean.
The variability in \(Y\) arising from its linear association with \(X\) is measured with the regression sum of squares, or SSR:
SSR expresses how much the fitted values deviate from the overall mean. Its associated degrees of freedom is denoted \(df_R\), and it is equal to the number of explanatory variables. In simple linear regression, \(df_R\) is always equal to 1. Therefore,
Computational Shortcut for SSR
Using the expanded formula of \(\hat{y}_i\) and \(\hat{\beta}_0\),
The final equality uses the definition of \(S_{XX}\) and the fact that \(\hat{\beta}_1 = \frac{S_{XY}}{S_{XX}}\). The resulting equation
is convenient for computaion if \(\hat{\beta}_1\) and \(S_{XY}\) are available.
Variability Due to Random Error: SSE and MSE
Fig. 13.13 SSE is the sum of the squared lengths of all the red dotted lines. They measure the distances between the observed and predicted response values.
Finally, the variability in \(Y\) arising from random error is measured with the error sum of squares, or SSE:
The Residuals
Each difference \(e_i = y_i - \hat{y}_i\) is called the \(i\)-th residual. Residuals serve as proxies for the unobserved true error terms \(\varepsilon_i\).
Degrees of Freedom and MSE
The degrees of freedom associated with SSE is \(df_E = n-2\) because its formula involves two estimated parameters (\(b_0\) and \(b_1\)). Readjusting the scale by the degrees of freedom,
MSE as Variance Estimator
\(\text{MSE}\) is a mean of squared distances of \(y_i\) from their estimated means, \(\hat{y_i}\). Therefore, we use it as an estimator of \(\sigma^2\):
The Fundamental Identity
Just like in ANOVA, the total sum of squares and the total degrees freedom deompose into their respective model and error components:
This can be proven algebraically by adding and subtracting \(\hat{y}_i\) in the expression for SST and using properties of least squares. You are encouraged to show this as an independent exercise.
Partial ANOVA Table for Simple Linear Regression
Source |
df |
Sum of Squares |
Mean Square |
F-statistic |
p-value |
|---|---|---|---|---|---|
Regression |
1 |
\(\sum_{i=1}^n (\hat{y}_i - \bar{y})^2\) |
\(\frac{\text{SSR}}{1}\) |
? |
? |
Error |
\(n-2\) |
\(\sum_{i=1}^n (y_i - \hat{y}_i)^2\) |
\(\frac{\text{SSE}}{n-2}\) |
||
Total |
\(n-1\) |
\(\sum_{i=1}^n (y_i - \bar{y})^2\) |
We will discuss how to compute and use the \(F\)-statistic and \(p\)-value in the upcoming lesson.
Example 💡: Blood Pressure Study, Continued
The summary statistics and the model parameter estimates computed in the previous example for the blood pressure study are listed below:
\(n=11\)
\(\bar{x} = 52.7273\) years
\(\bar{y} = -7.6364\) mm Hg
\(\sum x_i y_i = -5485\)
\(\sum x_i^2 = 32588\)
\(\sum y_i^2 = 1580\)
\(S_{XY} = -1055.8857\)
\(S_{XX}=2006.1502\)
\(b_0 = 20.114\)
\(b_1 = -0.5263\)
Q1: Predict the change in blood pressure for a patient who is 65 years old. Compute the corresponding residual.
The predicted change is
The observed response value for \(x=65\) is \(y=-8\) from the data table. Therefore, the residual is
Q2: Complete the ANOVA table for the linear regression model between age and change in blood pressure.
For each sum of squares, we will use the computational shortcut rather than the definition:
Finally, using the decomposition identity of the sums of squares,
The mean squares are:
The results are organized into a table below:
Source |
df |
Sum of Squares |
Mean Square |
F-statistic |
p-value |
|---|---|---|---|---|---|
Regression |
1 |
\(555.7126\) |
\(555.7126\) |
? |
? |
Error |
\(9\) |
\(382.8267\) |
\(42.5363\) |
||
Total |
\(10\) |
\(938.5393\) |
13.2.5. The Coefficient of Determination (\(R^2\))
The coefficient of determination, denoted \(R^2\), provides a single numerical measure of how well our regression line fits the data. It is defined as:
Due to the fundamental identity \(\text{SST} = \text{SSR} + \text{SSE}\) and the fact that each component is non-negative, \(R^2\) is always between 0 and 1.
It is interpreted as the fraction of the response variability that is explained by its linear association with the explanatory variable.
\(R^2\) approaches 1 when:
\(\text{SSR}\) approaches \(\text{SST}\).
The residuals \(y_i - \hat{y}_i\) have small magnitudes.
The regression line captures most of the variability in the response.
Points gather tightly around the fitted line.
\(R^2\) approaches 0 when:
\(\text{SSE}\) approaches \(\text{SST}\).
The residuals have large magnitudes.
The explanatory variable provides little information about the response.
Points scatter widely around the fitted line.
Example 💡: Blood Pressure Study, Continued
Compute and interpret the coefficient of determination from the blood pressure study.
Interpretation: Approximately 59.21% of the variation in blood pressure change is explained by the linear relationship with patient age.
Important Limitations of \(R^2\)
While \(R^2\) is a useful summary measure, it has important limitations that require careful interpretation. These limitations are evident in the scatter plots of the well-known Anscome’s quartet (Fig. 13.14). The figure presents four bivariate data sets that have distinct patterns yet are identical in their \(R^2\) values:
Fig. 13.14 Anscome’s Quartet
Let us point out some important observations from Fig. 13.14:
A high \(R^2\) value does not guarantee linaer association.
\(R^2\) is not robust to outliers. Outliers can dramatically reduce or inflate \(R^2\).
A high \(R^2\) does not automatically guarantee good prediction performance.
Even when the true form is not linear, we might obtain a high \(R^2\) if the sample size is small and the points happen to align well with a linear pattern.
Absolute scale of the observed errors also matter. Consider a scenario where \(\text{SSR} = 9 \times 10^6\) and \(\text{SST} = 10^7\), giving \(R^2 = 0.90\). While 90% of the total variation is explained, \(\text{SSE} = 10^6\) indicates substantial absolute errors that might make predictions unreliable for practical purposes.
Best Practices for Using the Coefficient of Determination
Always examine scatter plots to verify model assumptions and check for outliers.
Use \(R^2\) as one component of model assessment, not the sole criterion. Consider factors such as sample size, scale, and practical significance.
13.2.6. Sample Pearson Correlation Coefficient
Another numerical measure that proves useful in simple linear regression is the sample Pearson correlation coefficient. This provides a standardized quantification of the direction and strength of a linear association. It is defined as:
As suggested by the structural similarity, \(r\) is the sample equivalent of the population correlation introduced in the probability chapters:
Fig. 13.15 Scatter plots with different sample correlation values
Sample correlation is unitless and always falls between -1 and +1, with the sign indicating the direction and the magnitude indicating the strength of linear association. The bounded range makes it easy to compare correlations across different data sets.
We classify linear associations as strong, moderate or weak using the following rule of thumb:
Strong association: \(0.8 <|r| \leq 1\)
Moderate association: \(0.5 <|r| \leq 8\)
Weak association: \(0 \leq |r| \leq 0.5\)
Alternative Expressions of Sample Correlation
1. Using Sample Covariance and Sample Standard Deviations
Using the sample covariance \(s_{XY} = \frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})\) and the sample standard deviations \(s_X\) and \(s_Y\), we can rewrite the definition of sample correlation (13.7) as:
This definition gives us an insight into what \(r\) measures. Sample covariance is an unscaled measure of how \(X\) and \(Y\) move together.
If the two variables typically take values above or below their respective means simultaneously, most terms in the summation will be positive, increasing the chance that the mean of these terms, the sample covariance, is also positive.
If the values tend to lie on opposite sides of their respective means, the sample covariance is likely negative.
Sample correlation is obtained by removing variable-specific scales from sample covariance, dividing it by the sample standard deviations of both \(X\) and \(Y\).
2. Using Sums of Cross-products and Squares
The definition of sample correlation can also be expressed using the sums of cross-products and squares, \(S_{XY}, S_{XX},\) and \(S_{YY}\):
3. Cross-products of Standardized Data
Modifying the equation (13.8),
This gives us another perspective into what the correlation coefficient is really doing—it is a mean of cross-products of the standardized values of the explanatory and response variables.
Connections with Other Components of Linear Regression
1. Slope Estimate, \(b_1\)
Both the correlation coefficient and the slope estimate capture information about a linear relationship. How exactly are they connected?
Recall that one way to write the slope estimate is:
Using the correlation formula \(r = \frac{s_{XX}}{s_X s_Y}\), the slope estimate can be written as:
This relationship shows that the slope estimate is simply the correlation coefficient rescaled by the ratio of standard deviations.
2. Coefficient of Determination, \(R^2\)
Recall that:
\(b_1 = \frac{S_{XY}}{S_{XX}}\)
\(\text{SSR} = b_1 S_{XY}\)
\(\text{SST} = S_{YY}\)
Combining these together,
In other words, the coefficient of determination is equal to the square of the correlation coefficient. However, this special relationship holds only for simple linear regression, where the model has a single explanatory variable.
Limitations of Correlation
Correlation specifically measures the direction and strength of linear association. Therefore, \(r = 0\) indicates an absence of linear association, not necessarily an absence of association altogether. Symmetrical non-linear relationships tend to produce sample correlation close to zero, as any linear trend one one side offsets the trend on the other.
Fig. 13.16 Scatter plots with zero sample correlation
Furthermore, extreme observations can significantly impact correlation values. These limitations mean we need to examine our data visually before trusting correlation values, just as we do with \(R^2\).
13.2.7. Bringing It All Together
Key Takeaways 📝
The simple linear regression model \(Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\) and its associated assumptions formalize the linear relationship between quantitative variables.
The least squares method provides optimal estimates for the slope and intercept parameters by minimizing the sum of squared differences between the observed data and the summary line.
The variability components of linear regression can also be organized into an ANOVA table.
Residuals \(e_i = y_i - \hat{y}_i\) estimate the unobserved error terms and enable variance estimation through \(s^2 = \text{MSE} = \text{SSE}/(n-2)\).
The coefficient of determination \(R^2 = \text{SSR}/\text{SST}\) measures the proportion of total variation in observed \(Y\) that is explained by the regression model.
The sample Pearson correlation coefficient \(r\) provides the direction and strength of a linear relationship on a standard numerical scale between -1 and 1.
Model assessment requires multiple tools: scatter plots, residual analysis, ANOVA table, \(R^2\), and \(r\) all work together to evaluate model appropriateness.
Exercises
Least Squares Calculation: Given the following data on house size (X, in thousands of sq ft) and price (Y, in thousands of dollars):
House
Size (X)
Price (Y)
1
1.2
180
2
1.8
220
3
2.1
280
4
1.5
200
5
2.4
320
Calculate \(\bar{x}\), \(\bar{y}\), \(\sum x_i y_i\), and \(\sum x_i^2\).
Find the least squares estimates \(b_0\) and \(b_1\).
Write the fitted regression equation.
Interpret the slope and intercept in context.
Verify that the line passes through \((\bar{x}, \bar{y})\).
ANOVA Table Construction: Using the house price data from Exercise 2:
Calculate SST, SSR, and SSE.
Complete the ANOVA table with degrees of freedom and mean squares.
Calculate \(R^2\) and interpret its meaning.
Estimate the standard deviation \(\sigma\).
Residual Analysis: For the house price regression model,
Calculate the fitted value and residual for House 3.
What does this residual tell you about the model’s performance for this house?
If all residuals were positive in a subrange of the \(X\) values, what would this suggest about the model?
Critical Thinking: A news article claims: “Study finds strong relationship between ice cream sales and drowning deaths (R² = 0.89). Ice cream consumption causes drowning!”
What statistical concept is being misunderstood in this claim?
Explain what the high \(R^2\) actually tells us.
What might be a more plausible explanation for this relationship?
Advanced Calculation: Using the computational formula \(\text{SSR} = b_1 S_{xy}\):
Explain why \(b_1\) and \(S_{xy}\) always have the same sign.
Under what conditions would \(\text{SSR} = 0\)?