13.2. Simple Linear Regression

After identifying a linear association between two quantitative variables via a scatter plot, we proceed to mathematical modeling of the association. This involves formalizing the population model, developing methods to estimate the model parameters, and assessing the model’s fit to the data.

Road Map 🧭

  • Understand the implications of linear regression model assumptions.

  • Derive the least squares estimates of the slope and intercept parameters, and present the result as a fitted regression line.

  • Use the ANOVA table to decompose the sources of variability in the repsonse into error and model components.

  • Use the coefficient of determination and the sample correlation as assisting numerical summaries.

13.2.1. The Simple Linear Regression Model

In this course, we assume that the \(X\) values are given (non-random) as \(x_1, x_2, \cdots, x_n\). For each explanatory value, the corresponding response \(Y_i\) is generated from the population model:

(13.1)\[Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i,\]

where \(\varepsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\), and \(\beta_0\) and \(\beta_1\) are the intercept and slope of the true linear association.

Important Implications of the Model

1. The Mean Response

\(y=\beta_0 + \beta_1 x\) represents the mean response line, or the true relationship between the explanatory and response variables. That is,

\[\begin{split}E[Y|X=x] &= E[\beta_0 + \beta_1 x + \varepsilon] \\ &= \beta_0 + \beta_1 x + E[\varepsilon]\\ &= \beta_0 + \beta_1 x,\end{split}\]

for any given \(x\) values in an appropriate range. \(\beta_0\), \(\beta_1\), and \(x\) pass through the expectation operator unchanged since they are constants. The error term has expected value zero, so it disappears.

2. The Error Term

The error term \(\varepsilon\) represents the random variation in \(Y\) that is not explained by the mean response line. With non-random explanatory values, this is the only source of randomness in \(Y\). Namely,

\[\text{Var}(Y|X=x) = \text{Var}(\beta_0 + \beta_1 x + \varepsilon) = \text{Var}(\varepsilon) = \sigma^2.\]

Note that the result does not depend on the index \(i\) or the value of \(x_i\). In our model, we assume that the true spread of \(Y\) values around the regression line remains constant regardless of the \(X\) value.

3. The Conditional distribution of \(Y_i\)

With the explanatory value fixed as \(x_i\), each \(Y_i\) is a linear function of a normal random variable \(\varepsilon_i\). This implies that \(Y_i|X=x_i\) is also normally distributed. Combining this result with the previous two observations, we can write the exact conditional distribution of the response as:

\[Y_i|X=x_i \sim N(\beta_0 + \beta_1 x_i, \quad \sigma^2)\]
Visualization of conditionals distributions of :math:`Y_i`

Fig. 13.8 Visualization of conditional distributions of \(Y_i\) given \(x_i\)

Summary of the Essential Assumptions

The following four assumptions are equivalent to the model assumptions discussed above and are required to ensure validity of subsequent analysis procedures.

  • Assumption 1: Independence

    For each given value \(x_i\), \(Y_i\) is a size-1 simple random sample from the distribution of \(Y|X=x_i\). Further, the pairs \((x_i, Y_i)\) are independent from all other pairs \((x_j, Y_j)\), \(j\neq i\).

  • Assumption 2: Linearity

    The association between the explanatory variable and the response is linear on average.

  • Assumption 3: Normality

    The errors are normally distributed:

    \[\varepsilon_i \stackrel{iid}{\sim} N(0, \sigma^2) \quad \text{for } i = 1, 2, \ldots, n\]
  • Assumption 4: Equal Variance

    The error terms have constant variance \(\sigma^2\) across all values of \(X\).

13.2.2. Point Estimates of Slope and Intercept

We now develop the point estimates for the unknown intercept and slope parameters \(\beta_0\) and \(\beta_1\) using the least squares method. The estimators are chosen so that the resulting trend line yields the smallest possible overall squared distance from the observed response values.

Distances between observed reposnes and the fitted line

Fig. 13.9 Distances between observed reponses and the fitted line

The goal is mathematically equivalent to finding the arguments that minimize:

(13.2)\[g(\beta_0, \beta_1) = \sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_i))^2\]

To find their explicit forms, we take the partial derivative of Eq. (13.2) with respect to each parameter and set it equal to zero. Then we solve the resulting system of equations for \(\hat{\beta}_0\) and \(\hat{\beta}_1\).

Derivation of the Intercept Estimate

Take the partial derivative of \(g(\beta_0, \beta_1)\) with respect to \(\beta_0\):

\[\begin{split}\frac{\partial g}{\partial \beta_0} &= \frac{\partial}{\partial \beta_0} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2\\ &= -2 \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)\\ &= \sum_{i=1}^n y_i - n\beta_0 - \beta_1 \sum_{i=1}^n x_i\end{split}\]

By setting this equal to zero and solving for \(\beta_0\), we obtain:

(13.3)\[\hat{\beta}_0 = \frac{1}{n}\sum_{i=1}^n y_i - \beta_1 \frac{1}{n}\sum_{i=1}^n x_i = \bar{y} - \beta_1 \bar{x}.\]

Derivation of the Slope Estimate

Likewise, we take the partial derivative of \(g(\beta_0, \beta_1)\) with respect to \(\beta_1\):

\[\begin{split}\frac{\partial g}{\partial \beta_1} &= -2 \sum_{i=1}^n x_i(y_i - \beta_0 - \beta_1 x_i)\\ &= -2 \left(\sum_{i=1}^n x_i y_i - \beta_0 \sum_{i=1}^n x_i - \beta_1 \sum_{i=1}^n x_i^2 \right)\end{split}\]

Substitute \(\hat{\beta}_0\) (Eq. (13.3)) for \(\beta_0\) and set equal to zero:

\[\begin{split}0 &= \sum_{i=1}^n x_i y_i - (\bar{y}-\beta_1 \bar{x})\sum_{i=1}^n x_i - \beta_1 \sum_{i=1}^n x_i^2\\ &= \sum_{i=1}^n x_i y_i - (\bar{y}-\beta_1 \bar{x})n\bar{x} - \beta_1 \sum_{i=1}^n x_i^2\\ &= \sum_{i=1}^n x_i y_i - n\bar{x}\bar{y} - \beta_1 (\sum_{i=1}^n x_i^2 - n \bar{x}^2)\end{split}\]

Isolate \(\beta_1\) to obtain:

(13.4)\[\hat{\beta}_1 = \frac{\sum_{i=1}^n x_i y_i - n\bar{x}\bar{y}}{\sum_{i=1}^n x_i^2 - n\bar{x}^2}\]

Summary

Slope estimate:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^n x_i y_i - n\bar{x}\bar{y}}{\sum_{i=1}^n x_i^2 - n\bar{x}^2}\]

Intercept estimate:

\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

Alternative Notation

The symbols \(b_0\) and \(b_1\) are used interchangeably with \(\hat{\beta}_0\) and \(\hat{\beta}_1\), respectively.

Alternative Expressions of the Slope Estimate

Several alternative expressions exist for the slope estimate \(\hat{\beta}_1\), each offering a different perspective on its relation to various components of regression analysis. Being able to transition freely between these forms is key to deepening the intuition for linear regression and enabling efficient computation.

Define the sum of cross products and the sum of sqaures of \(X\) as:

  • \(S_{XY} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\)

  • \(S_{XX} = \sum_{i=1}^n (x_i - \bar{x})^2\)

In addition, the sample covariance of \(X\) and \(Y\) is denoted \(s_{XY}\) (lower case \(s\)) and defined as:

\[s_{XY} = \frac{ \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{n-1}\]

Also recall that \(s_X^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}\) denotes the sample variance of \(x_1, \cdots, x_n\).

Then \(\hat{\beta}_1\) (13.4) can also be written as:

(13.5)\[\hat{\beta}_1 = \frac{S_{XY}}{S_{XX}} = \frac{s_{XY}}{s^2_X} = \frac{\sum_{i=1}^n (x_i -\bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}\]

Exercise: Prove equality between the alternative expressions

Hint: Showing the second and third equality of (13.5) is relatively simple. To show the first equality, begin with the numerators:

\[\begin{split}\sum_{i=1}^n (x_i -\bar{x})(y_i - \bar{y}) &= \sum_{i=1}^n x_i y_i -\bar{x}\sum_{i=1}^n y_i -\bar{y}\sum_{i=1}^n x_i + n\bar{x}\bar{y}\\ &= \sum_{i=1}^n x_i y_i -\bar{x}(n\bar{y}) -\bar{y}(n\bar{x}) + n\bar{x}\bar{y}\\ &= \sum_{i=1}^n x_i y_i -n\bar{x}\bar{y} \quad ✅\end{split}\]

Repeat a similar procedure for the denominators.

13.2.3. The Fitted Regression Line

Once the slope and intercept estimates are obtained, we can use them to construct the fitted regression line:

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x.\]

By plugging in specific \(x\) values into the equation, we obtain point estimates of the corresponding responses. The hat notation is essential, as it distinguishes these estimates from the observed responses, \(y\).

Properties of the Fitted Regression Line

1. Interpretation of the Slope

The slope \(\hat{\beta}_1\) represents the estimated average change in the response for every one-unit change in the explanatory variable. The sign of \(\hat{\beta}_1\) indicates the direction of the association.

2. Interpretation of the Intercept

The intercept \(\hat{\beta}_0\) represents the estimated average value of the response when the explanatory variable equals zero. However, this may not have significance if \(X = 0\) is outside the range of the data or not practically meaningful.

3. The Line Passes Through \((\bar{x}, \bar{y})\)

If we substitute \(x = \bar{x}\) into the fitted regression equation:

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1\bar{x} = (\bar{y} - \hat{\beta}_1\bar{x}) + \hat{\beta}_1\bar{x} = \bar{y}.\]

That is, a fitted regression line always passes through the point \((\bar{x}, \bar{y})\).

4. Non-exchangeability

If we swap the explanatory and response variables and refit the model, the resulting fitted line will not be the same as the original, nor will the new slope be the algebraic inverse of the original.

Example 💡: Blood Pressure Study

A new treatment for high blood pressure is being assessed for feasibility. In an early trial, 11 subjects had their blood pressure measured before and after treatment. Researchers want to determine if there is a linear association between patient age and the change in systolic blood pressure after 24 hours. Using the data and the summary statistics below, compute and interpret the fitted regression line.

Variable Definitions:

  • Explanatory variable (X): Age of patient (years)

  • Response variable (Y): Change in blood pressure = (After treatment) - (Before treatment)

Data:

\(i\)

Age (\(x_i\))

ΔBP (\(y_i\))

1

70

-28

2

51

-10

3

65

-8

4

70

-15

5

48

-8

6

70

-10

7

45

-12

8

48

3

9

35

1

10

48

-5

11

30

8

Scatter Plot:

Scatter plot of age vs blood pressure data

Fig. 13.10 Scatter plot of age vs blood pressure data

Summary Statistics:

  • \(n=11\)

  • \(\bar{x} = 52.7273\) years

  • \(\bar{y} = -7.6364\) mm Hg

  • \(\sum x_i y_i = -5485\)

  • \(\sum x_i^2 = 32588\)

Slope Calculation:

\[\begin{split}S_{XY} &= \sum x_i y_i - n\bar{x}\bar{y} \\ &= -5485 - 11(52.7273)(-7.6364) \\ &= -1055.8857\\ &\quad\\ S_{XX} &= \sum x_i^2 - n\bar{x}^2\\ &= 32588 - 11(52.7273)^2 \\ &= 2006.1502\end{split}\]
\[b_1 = \frac{S_{xy}}{S_{xx}} = \frac{-1055.8857}{2006.1502} = -0.5263\]

Intercept Calculation:

\[b_0 = \bar{y} - b_1\bar{x} = -7.6364 - (-0.5263)(52.7273) = 20.114\]

Fitted Regression Line:

\[\hat{y} = 20.114 - 0.526x\]

Interpretation:

  • For each additional year of age, the change in blood pressure decreases by an average of 0.526 mm Hg.

  • The intercept estimate has no practical meaning since we do not study newborns for blood pressure treatment.

  • The negative slope suggests that older patients benefit more from the treatment.


13.2.4. The ANOVA Table for Regression

Just as in ANOVA, we can decompose the total variability in the response variable into two components, one arising from the model structure and the other from random error.

Total Variability: SST

Components of SST

Fig. 13.11 SST is the sum of the squared lengths of all the red dotted lines. They measure the distances of the response values from an overall mean.

(13.6)\[\text{SST} = \sum_{i=1}^n (y_i - \bar{y})^2\]

SST measures how much the response values deviate from their overall mean, ignoring the explanatory variable. SST can also be denoted as \(S_{YY}\). The degrees of freedom associated with \(\text{SST}\) is \(df_T = n-1\).

Computational Shortcut for SST

While Eq. (13.6) conveys its intuitive meaning, it is often more convenient to use the following equivalent formula for computation:

\[\text{SST} = \sum_{i=1}^n y_i^2 - n\bar{y}^2\]

Model Variability: SSR and MSR

Components of SSR

Fig. 13.12 SSR is the sum of the squared lengths of all the red dotted lines. They measure the distances of the predicted \(\hat{y}\) values from an overall mean.

The variability in \(Y\) arising from its linear association with \(X\) is measured with the regression sum of squares, or SSR:

\[\text{SSR} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2.\]

SSR expresses how much the fitted values deviate from the overall mean. Its associated degrees of freedom is denoted \(df_R\), and it is equal to the number of explanatory variables. In simple linear regression, \(df_R\) is always equal to 1. Therefore,

\[\text{MSR} = \frac{\text{SSR}}{df_R} = \text{SSR}.\]

Computational Shortcut for SSR

Using the expanded formula of \(\hat{y}_i\) and \(\hat{\beta}_0\),

\[\begin{split}\text{SSR} &= \sum_{i=1}^n (\hat{\beta}_0 + \hat{\beta}_1x_i - \bar{y})^2\\ &= \sum_{i=1}^n ((\bar{y}-\hat{\beta}_1\bar{x}) + \hat{\beta}_1x_i - \bar{y})^2\\ &= \hat{\beta}_1^2\sum_{i=1}^n (x_i -\bar{x})^2\\ &= \hat{\beta}_1\frac{S_{XY}}{S_{XX}} S_{XX} = \hat{\beta}_1 S_{XY}\end{split}\]

The final equality uses the definition of \(S_{XX}\) and the fact that \(\hat{\beta}_1 = \frac{S_{XY}}{S_{XX}}\). The resulting equation

\[\text{SSR} = \hat{\beta}_1 S_{XY}\]

is convenient for computaion if \(\hat{\beta}_1\) and \(S_{XY}\) are available.

Variability Due to Random Error: SSE and MSE

Components of SSE

Fig. 13.13 SSE is the sum of the squared lengths of all the red dotted lines. They measure the distances between the observed and predicted response values.

Finally, the variability in \(Y\) arising from random error is measured with the error sum of squares, or SSE:

\[\text{SSE} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n e_i^2\]

The Residuals

Each difference \(e_i = y_i - \hat{y}_i\) is called the \(i\)-th residual. Residuals serve as proxies for the unobserved true error terms \(\varepsilon_i\).

Degrees of Freedom and MSE

The degrees of freedom associated with SSE is \(df_E = n-2\) because its formula involves two estimated parameters (\(b_0\) and \(b_1\)). Readjusting the scale by the degrees of freedom,

\[\text{MSE} = \frac{\text{SSE}}{n-2}.\]

MSE as Variance Estimator

\(\text{MSE}\) is a mean of squared distances of \(y_i\) from their estimated means, \(\hat{y_i}\). Therefore, we use it as an estimator of \(\sigma^2\):

\[\hat{\sigma}^2 = MSE\]

The Fundamental Identity

Just like in ANOVA, the total sum of squares and the total degrees freedom deompose into their respective model and error components:

\[\text{SST} = \text{SSR} + \text{SSE} \quad \text{ and } \quad df_T = df_R + df_E.\]

This can be proven algebraically by adding and subtracting \(\hat{y}_i\) in the expression for SST and using properties of least squares. You are encouraged to show this as an independent exercise.

Partial ANOVA Table for Simple Linear Regression

Source

df

Sum of Squares

Mean Square

F-statistic

p-value

Regression

1

\(\sum_{i=1}^n (\hat{y}_i - \bar{y})^2\)

\(\frac{\text{SSR}}{1}\)

?

?

Error

\(n-2\)

\(\sum_{i=1}^n (y_i - \hat{y}_i)^2\)

\(\frac{\text{SSE}}{n-2}\)

Total

\(n-1\)

\(\sum_{i=1}^n (y_i - \bar{y})^2\)

We will discuss how to compute and use the \(F\)-statistic and \(p\)-value in the upcoming lesson.

Example 💡: Blood Pressure Study, Continued

The summary statistics and the model parameter estimates computed in the previous example for the blood pressure study are listed below:

  • \(n=11\)

  • \(\bar{x} = 52.7273\) years

  • \(\bar{y} = -7.6364\) mm Hg

  • \(\sum x_i y_i = -5485\)

  • \(\sum x_i^2 = 32588\)

  • \(\sum y_i^2 = 1580\)

  • \(S_{XY} = -1055.8857\)

  • \(S_{XX}=2006.1502\)

  • \(b_0 = 20.114\)

  • \(b_1 = -0.5263\)

Q1: Predict the change in blood pressure for a patient who is 65 years old. Compute the corresponding residual.

The predicted change is

\[\hat{y} = 20.11 - 0.526(65) = -14.0 \text{ mm Hg}\]

The observed response value for \(x=65\) is \(y=-8\) from the data table. Therefore, the residual is

\[e = -8 - (-14) = 6 \text{ mm Hg}\]

Q2: Complete the ANOVA table for the linear regression model between age and change in blood pressure.

For each sum of squares, we will use the computational shortcut rather than the definition:

\[\begin{split}&\text{SSR} = b_1 S_{xy} = (-0.5263)(-1055.8857) = 555.7126\\ &\text{SST} = \sum y_i^2 - n\bar{y}^2 = 1580 - 11(-7.6364)^2 = 938.5393\end{split}\]

Finally, using the decomposition identity of the sums of squares,

\[\text{SSE} = \text{SST} - \text{SSR} = 938.5393 - 555.7126 = 382.8267\]

The mean squares are:

\[\begin{split}& \text{MSR} = \frac{\text{SSR}}{1} = 555.7126\\ & \text{MSE} = \frac{\text{SSE}}{9} = 42.5363\end{split}\]

The results are organized into a table below:

Source

df

Sum of Squares

Mean Square

F-statistic

p-value

Regression

1

\(555.7126\)

\(555.7126\)

?

?

Error

\(9\)

\(382.8267\)

\(42.5363\)

Total

\(10\)

\(938.5393\)

13.2.5. The Coefficient of Determination (\(R^2\))

The coefficient of determination, denoted \(R^2\), provides a single numerical measure of how well our regression line fits the data. It is defined as:

\[R^2 = \frac{\text{SSR}}{\text{SST}}.\]

Due to the fundamental identity \(\text{SST} = \text{SSR} + \text{SSE}\) and the fact that each component is non-negative, \(R^2\) is always between 0 and 1.

It is interpreted as the fraction of the response variability that is explained by its linear association with the explanatory variable.

\(R^2\) approaches 1 when:

  • \(\text{SSR}\) approaches \(\text{SST}\).

  • The residuals \(y_i - \hat{y}_i\) have small magnitudes.

  • The regression line captures most of the variability in the response.

  • Points gather tightly around the fitted line.

\(R^2\) approaches 0 when:

  • \(\text{SSE}\) approaches \(\text{SST}\).

  • The residuals have large magnitudes.

  • The explanatory variable provides little information about the response.

  • Points scatter widely around the fitted line.

Example 💡: Blood Pressure Study, Continued

Compute and interpret the coefficient of determination from the blood pressure study.

\[R^2 = \frac{\text{SSR}}{\text{SST}} = \frac{555.7126}{938.5393} = 0.5921\]

Interpretation: Approximately 59.21% of the variation in blood pressure change is explained by the linear relationship with patient age.

Important Limitations of \(R^2\)

While \(R^2\) is a useful summary measure, it has important limitations that require careful interpretation. These limitations are evident in the scatter plots of the well-known Anscome’s quartet (Fig. 13.14). The figure presents four bivariate data sets that have distinct patterns yet are identical in their \(R^2\) values:

anscome's-quartet

Fig. 13.14 Anscome’s Quartet

Let us point out some important observations from Fig. 13.14:

  1. A high \(R^2\) value does not guarantee linaer association.

  2. \(R^2\) is not robust to outliers. Outliers can dramatically reduce or inflate \(R^2\).

  3. A high \(R^2\) does not automatically guarantee good prediction performance.

    • Even when the true form is not linear, we might obtain a high \(R^2\) if the sample size is small and the points happen to align well with a linear pattern.

    • Absolute scale of the observed errors also matter. Consider a scenario where \(\text{SSR} = 9 \times 10^6\) and \(\text{SST} = 10^7\), giving \(R^2 = 0.90\). While 90% of the total variation is explained, \(\text{SSE} = 10^6\) indicates substantial absolute errors that might make predictions unreliable for practical purposes.

Best Practices for Using the Coefficient of Determination

  1. Always examine scatter plots to verify model assumptions and check for outliers.

  2. Use \(R^2\) as one component of model assessment, not the sole criterion. Consider factors such as sample size, scale, and practical significance.


13.2.6. Sample Pearson Correlation Coefficient

Another numerical measure that proves useful in simple linear regression is the sample Pearson correlation coefficient. This provides a standardized quantification of the direction and strength of a linear association. It is defined as:

\[r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}.\]

As suggested by the structural similarity, \(r\) is the sample equivalent of the population correlation introduced in the probability chapters:

(13.7)\[\rho = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}.\]
Scatter plots with different sample correlation values

Fig. 13.15 Scatter plots with different sample correlation values

Sample correlation is unitless and always falls between -1 and +1, with the sign indicating the direction and the magnitude indicating the strength of linear association. The bounded range makes it easy to compare correlations across different data sets.

We classify linear associations as strong, moderate or weak using the following rule of thumb:

  • Strong association: \(0.8 <|r| \leq 1\)

  • Moderate association: \(0.5 <|r| \leq 8\)

  • Weak association: \(0 \leq |r| \leq 0.5\)

Alternative Expressions of Sample Correlation

1. Using Sample Covariance and Sample Standard Deviations

Using the sample covariance \(s_{XY} = \frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})\) and the sample standard deviations \(s_X\) and \(s_Y\), we can rewrite the definition of sample correlation (13.7) as:

(13.8)\[r = \frac{s_{XY}}{s_X s_Y}.\]

This definition gives us an insight into what \(r\) measures. Sample covariance is an unscaled measure of how \(X\) and \(Y\) move together.

  • If the two variables typically take values above or below their respective means simultaneously, most terms in the summation will be positive, increasing the chance that the mean of these terms, the sample covariance, is also positive.

  • If the values tend to lie on opposite sides of their respective means, the sample covariance is likely negative.

Sample correlation is obtained by removing variable-specific scales from sample covariance, dividing it by the sample standard deviations of both \(X\) and \(Y\).

2. Using Sums of Cross-products and Squares

The definition of sample correlation can also be expressed using the sums of cross-products and squares, \(S_{XY}, S_{XX},\) and \(S_{YY}\):

\[r = \frac{S_{XY}}{\sqrt{S_{XX}S_{YY}}}.\]

3. Cross-products of Standardized Data

Modifying the equation (13.8),

\[\begin{split}r &= \frac{\sum_{i=1}^n[(x_i - \bar{x})(y_i - \bar{y})]/(n-1)}{s_X s_Y}\\ &= \frac{1}{n-1}\sum_{i=1}^n \left(\frac{x_i - \bar{x}}{s_X}\right)\left(\frac{y_i - \bar{y}}{s_Y}\right).\end{split}\]

This gives us another perspective into what the correlation coefficient is really doing—it is a mean of cross-products of the standardized values of the explanatory and response variables.

Connections with Other Components of Linear Regression

1. Slope Estimate, \(b_1\)

Both the correlation coefficient and the slope estimate capture information about a linear relationship. How exactly are they connected?

Recall that one way to write the slope estimate is:

\[b_1 = \frac{s_{XY}}{s_X^2}.\]

Using the correlation formula \(r = \frac{s_{XX}}{s_X s_Y}\), the slope estimate can be written as:

\[b_1 = \frac{r s_X s_Y}{s_X^2} = r \frac{s_Y}{s_X}\]

This relationship shows that the slope estimate is simply the correlation coefficient rescaled by the ratio of standard deviations.

2. Coefficient of Determination, \(R^2\)

Recall that:

  • \(b_1 = \frac{S_{XY}}{S_{XX}}\)

  • \(\text{SSR} = b_1 S_{XY}\)

  • \(\text{SST} = S_{YY}\)

Combining these together,

\[R^2 = \frac{\text{SSR}}{\text{SST}} = \frac{b_1 S_{XY}}{S_{YY}} = \frac{S_{XY}^2}{S_{XX}S_{YY}} = \left(\frac{S_{XY}}{\sqrt{S_{XX} S_{YY}}}\right)^2 = r^2.\]

In other words, the coefficient of determination is equal to the square of the correlation coefficient. However, this special relationship holds only for simple linear regression, where the model has a single explanatory variable.

Limitations of Correlation

Correlation specifically measures the direction and strength of linear association. Therefore, \(r = 0\) indicates an absence of linear association, not necessarily an absence of association altogether. Symmetrical non-linear relationships tend to produce sample correlation close to zero, as any linear trend one one side offsets the trend on the other.

Scatter plots with zero sample correlation

Fig. 13.16 Scatter plots with zero sample correlation

Furthermore, extreme observations can significantly impact correlation values. These limitations mean we need to examine our data visually before trusting correlation values, just as we do with \(R^2\).

13.2.7. Bringing It All Together

Key Takeaways 📝

  1. The simple linear regression model \(Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\) and its associated assumptions formalize the linear relationship between quantitative variables.

  2. The least squares method provides optimal estimates for the slope and intercept parameters by minimizing the sum of squared differences between the observed data and the summary line.

  3. The variability components of linear regression can also be organized into an ANOVA table.

  4. Residuals \(e_i = y_i - \hat{y}_i\) estimate the unobserved error terms and enable variance estimation through \(s^2 = \text{MSE} = \text{SSE}/(n-2)\).

  5. The coefficient of determination \(R^2 = \text{SSR}/\text{SST}\) measures the proportion of total variation in observed \(Y\) that is explained by the regression model.

  6. The sample Pearson correlation coefficient \(r\) provides the direction and strength of a linear relationship on a standard numerical scale between -1 and 1.

  7. Model assessment requires multiple tools: scatter plots, residual analysis, ANOVA table, \(R^2\), and \(r\) all work together to evaluate model appropriateness.

Exercises

  1. Least Squares Calculation: Given the following data on house size (X, in thousands of sq ft) and price (Y, in thousands of dollars):

    House

    Size (X)

    Price (Y)

    1

    1.2

    180

    2

    1.8

    220

    3

    2.1

    280

    4

    1.5

    200

    5

    2.4

    320

    1. Calculate \(\bar{x}\), \(\bar{y}\), \(\sum x_i y_i\), and \(\sum x_i^2\).

    2. Find the least squares estimates \(b_0\) and \(b_1\).

    3. Write the fitted regression equation.

    4. Interpret the slope and intercept in context.

    5. Verify that the line passes through \((\bar{x}, \bar{y})\).

  2. ANOVA Table Construction: Using the house price data from Exercise 2:

    1. Calculate SST, SSR, and SSE.

    2. Complete the ANOVA table with degrees of freedom and mean squares.

    3. Calculate \(R^2\) and interpret its meaning.

    4. Estimate the standard deviation \(\sigma\).

  3. Residual Analysis: For the house price regression model,

    1. Calculate the fitted value and residual for House 3.

    2. What does this residual tell you about the model’s performance for this house?

    3. If all residuals were positive in a subrange of the \(X\) values, what would this suggest about the model?

  4. Critical Thinking: A news article claims: “Study finds strong relationship between ice cream sales and drowning deaths (R² = 0.89). Ice cream consumption causes drowning!”

    1. What statistical concept is being misunderstood in this claim?

    2. Explain what the high \(R^2\) actually tells us.

    3. What might be a more plausible explanation for this relationship?

  5. Advanced Calculation: Using the computational formula \(\text{SSR} = b_1 S_{xy}\):

    1. Explain why \(b_1\) and \(S_{xy}\) always have the same sign.

    2. Under what conditions would \(\text{SSR} = 0\)?