13.2. Simple Linear Regression

After exploring the relationship between two quantitative variables through scatter plots and determining that a linear relationship provides the best description of their association, we need to move from visual exploration to mathematical modeling. This involves formalizing our population-level model, developing methods to estimate the parameters of this model, and creating tools to assess how well our model fits the data.

Road Map 🧭

  • Problem we will solve – How to formalize the simple linear regression model, estimate its parameters using least squares methods, and assess model quality through ANOVA decomposition and the coefficient of determination

  • Tools we’ll learn – The population regression model with assumptions, least squares estimation for slope and intercept, residual analysis, ANOVA table for regression, and R-squared as a measure of model fit

  • How it fits – This transforms our visual understanding of relationships into a rigorous statistical framework with parameter estimation and model assessment, setting the foundation for hypothesis testing and prediction

13.2.1. The Population Model for Simple Linear Regression

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-simple-linear-regression-model.png

Fig. 13.18 The simple linear regression population model showing the linear relationship between explanatory and response variables

After determining through scatter plot analysis that a linear relationship appropriately describes the association between our explanatory and response variables, we can formalize this relationship using a statistical model. This model provides the theoretical foundation for all our subsequent estimation and inference procedures.

The Linear Regression Equation

Our population model takes the form:

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \quad \text{for } i = 1, 2, \ldots, n\]

This equation captures three essential components of the relationship:

The Systematic Component: \(\beta_0 + \beta_1 X_i\) represents the mean response line—the average value of \(Y\) for any given value of \(X\). This linear function defines the systematic relationship between the explanatory and response variables.

The Random Component: \(\varepsilon_i\) represents the error term or unexplained variation—the difference between each individual observation and the mean response line. This captures all variation in \(Y\) that cannot be explained by the linear relationship with \(X\).

Population Parameters:

  • \(\beta_0\) is the true population intercept—the expected value of \(Y\) when \(X = 0\)

  • \(\beta_1\) is the true population slope—the expected change in \(Y\) for a one-unit increase in \(X\)

Understanding the Model’s Meaning

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-regression-model-assumptions.png

Fig. 13.19 Key assumptions of the simple linear regression model

This model framework tells us that for each fixed value of the explanatory variable \(X_i\), the corresponding response \(Y_i\) is a random variable. The randomness comes from the error term \(\varepsilon_i\), while the systematic relationship is captured by the linear function.

Expected Value (Mean Response Line):

\[E[Y_i] = E[\beta_0 + \beta_1 X_i + \varepsilon_i] = \beta_0 + \beta_1 X_i\]

Since \(\beta_0\), \(\beta_1\), and \(X_i\) are constants, they pass through the expectation operator unchanged. The error term has expected value zero, so it disappears. This gives us the mean response line: \(\mu_{Y|X=x} = \beta_0 + \beta_1 x\).

Common Variance:

\[\text{Var}(Y_i) = \text{Var}(\beta_0 + \beta_1 X_i + \varepsilon_i) = \text{Var}(\varepsilon_i) = \sigma^2\]

Since constants have zero variance, only the error term contributes to the variability of \(Y_i\).

13.2.2. Essential Assumptions of Linear Regression

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-regression-assumptions-list.png

Fig. 13.20 The four fundamental assumptions required for valid linear regression analysis

For our regression procedures to be valid, we must make four key assumptions about our model:

Assumption 1: Independence

The observed pairs \((x_i, y_i)\) for \(i \in \{1, 2, \ldots, n\}\) are such that \(y_i\) represents a simple random sample for each fixed value of \(x_i\). Observations should not influence each other.

Assumption 2: Linearity

The association between the explanatory variable and the response is, on average, linear. The mean response follows the straight line \(E[Y|X] = \beta_0 + \beta_1 X\).

Assumption 3: Normality

The error terms (and hence the \(Y_i\) values) are normally distributed:

\[\varepsilon_i \stackrel{iid}{\sim} N(0, \sigma^2) \quad \text{for } i = 1, 2, \ldots, n\]

Assumption 4: Equal Variance (Homoscedasticity)

The error terms have constant variance \(\sigma^2\) across all values of \(X\). This means the spread of \(Y\) values around the regression line remains the same regardless of the \(X\) value.

Visualizing the Regression Model

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-3d-regression-visualization.png

Fig. 13.21 Three-dimensional visualization showing normal distributions around the mean response line

The regression model can be visualized as a three-dimensional structure:

  • The X-Y plane contains our scatter plot of data points

  • The mean response line \(\mu_{Y|X=x} = \beta_0 + \beta_1 x\) runs through this plane

  • Normal distributions are centered on this line at each \(X\) value, extending perpendicular to the X-Y plane

For any fixed value of \(X\), we observe a simple random sample of \(Y\) values from a normal distribution with:

  • Mean: \(\beta_0 + \beta_1 x\) (the value on the line)

  • Variance: \(\sigma^2\) (the same for all \(X\) values)

This is analogous to ANOVA, but instead of discrete group means, we have a continuous mean function that changes linearly with \(X\).

13.2.3. The Method of Least Squares

Since we don’t know the true population parameters \(\beta_0\), \(\beta_1\), and \(\sigma^2\), we must estimate them from our sample data. The method of least squares provides the optimal approach for estimating the slope and intercept.

The Least Squares Principle

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-finding-best-fit-line.png

Fig. 13.22 Illustration of the least squares principle using the car efficiency data

The least squares method finds the line that minimizes the sum of squared deviations between the observed \(y_i\) values and the fitted values \(\hat{y}_i\) predicted by our line.

The Optimization Problem:

We want to find estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize:

\[g(\beta_0, \beta_1) = \sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_i))^2\]

Observable Errors (Residuals):

For any candidate line \(\hat{y} = b_0 + b_1 x\), we can compute residuals:

\[e_i = y_i - \hat{y}_i = y_i - (b_0 + b_1 x_i)\]

The least squares method minimizes \(\sum_{i=1}^n e_i^2\).

Derivation of Least Squares Estimates

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-least-squares-calculus.png

Fig. 13.23 The calculus approach to finding least squares estimates

To find the minimum, we take partial derivatives with respect to both parameters and set them equal to zero.

Optimizing with respect to \(\beta_0\):

\[\frac{\partial g}{\partial \beta_0} = \frac{\partial}{\partial \beta_0} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 = 0\]
https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-intercept-derivation.png

Fig. 13.24 Step-by-step derivation of the intercept estimate

Using the chain rule:

\[-2 \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i) = 0\]

Dividing by -2 and expanding:

\[\sum_{i=1}^n y_i - n\beta_0 - \beta_1 \sum_{i=1}^n x_i = 0\]

Solving for \(\beta_0\):

\[\hat{\beta}_0 = \frac{1}{n}\sum_{i=1}^n y_i - \beta_1 \frac{1}{n}\sum_{i=1}^n x_i = \bar{y} - \hat{\beta}_1 \bar{x}\]

Optimizing with respect to \(\beta_1\):

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-slope-derivation.png

Fig. 13.25 Detailed derivation of the slope estimate showing the algebraic steps

\[\frac{\partial g}{\partial \beta_1} = -2 \sum_{i=1}^n x_i(y_i - \beta_0 - \beta_1 x_i) = 0\]

Substituting our expression for \(\hat{\beta}_0\) and simplifying through several algebraic steps:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^n x_i y_i - n\bar{x}\bar{y}}{\sum_{i=1}^n x_i^2 - n\bar{x}^2}\]

The Least Squares Formulas

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-regression-line-formulas.png

Fig. 13.26 Final least squares formulas expressed in terms of sample covariance and variance

Our final least squares estimates can be expressed in several equivalent forms:

Slope Estimate:

\[b_1 = \frac{s_{xy}}{s_x^2} = \frac{S_{xy}}{S_{xx}}\]

where: - \(s_{xy} = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\) (sample covariance) - \(s_x^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2\) (sample variance of X) - \(S_{xy} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\) (sum of cross-products) - \(S_{xx} = \sum_{i=1}^n (x_i - \bar{x})^2\) (sum of squares for X)

Intercept Estimate:

\[b_0 = \bar{y} - b_1\bar{x}\]

Fitted Regression Line:

\[\hat{y} = b_0 + b_1 x\]

13.2.4. Properties of the Regression Line

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-regression-line-properties.png

Fig. 13.27 Important properties of the fitted regression line

The least squares regression line has several important properties:

Property 1: Interpretation of the Slope

The slope \(b_1\) represents the average change in the response variable for every one-unit change in the explanatory variable. The sign of \(b_1\) indicates the direction of the association.

Property 2: Interpretation of the Intercept

The intercept \(b_0\) represents the average value of the response variable when the explanatory variable equals zero. However, this may not have practical meaning if \(X = 0\) is outside the range of the data or not physically meaningful.

Property 3: The Line Passes Through \((\bar{x}, \bar{y})\)

If we substitute \(x = \bar{x}\) into our regression equation:

\[\hat{y} = b_0 + b_1\bar{x} = (\bar{y} - b_1\bar{x}) + b_1\bar{x} = \bar{y}\]

Property 4: Non-invertibility

If we swap the explanatory and response variables and refit the regression, we do not get the algebraic inverse of the original line. The least squares method considers which variable is treated as fixed and which as random, so the roles cannot simply be interchanged.

13.2.5. A Complete Example: Blood Pressure Study

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-blood-pressure-example.png

Fig. 13.28 Blood pressure treatment study data showing age and change in blood pressure

Let’s work through a comprehensive example to illustrate all the concepts.

Research Context: A new treatment for high blood pressure is being assessed for feasibility. In an early trial, 11 subjects have their blood pressure measured before and after treatment. Researchers want to determine if there’s an association between patient age and the change in systolic blood pressure after 24 hours.

Variable Definitions: - Explanatory variable (X): Age of patient (years) - Response variable (Y): Change in blood pressure = (After treatment) - (Before treatment)

Expected Results: If the treatment is effective, we expect mostly negative values (blood pressure decreases). If age affects treatment effectiveness, we might see a relationship between age and the magnitude of change.

Computing the Least Squares Estimates

Using our data with \(n = 11\) subjects:

Summary Statistics: - \(\bar{x} = 52.73\) years - \(\bar{y} = -8.09\) mm Hg - \(\sum x_i y_i = -5700\) - \(\sum x_i^2 = 32056\)

Slope Calculation:

\[S_{xy} = \sum x_i y_i - n\bar{x}\bar{y} = -5700 - 11(52.73)(-8.09) = -1056\]
\[S_{xx} = \sum x_i^2 - n\bar{x}^2 = 32056 - 11(52.73)^2 = 2008\]
\[b_1 = \frac{S_{xy}}{S_{xx}} = \frac{-1056}{2008} = -0.526\]

Intercept Calculation:

\[b_0 = \bar{y} - b_1\bar{x} = -8.09 - (-0.526)(52.73) = 20.11\]

Fitted Regression Line:

\[\hat{y} = 20.11 - 0.526x\]

Interpretation:

  • For each additional year of age, the change in blood pressure decreases by an average of 0.526 mm Hg (treatment becomes more effective)

  • The intercept (20.11) has no practical meaning since we don’t study newborns for blood pressure treatment

  • The negative slope suggests older patients benefit more from the treatment

13.2.6. Residual Analysis and Variance Estimation

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-variance-estimation.png

Fig. 13.29 Estimating the common variance using residuals from the fitted model

Once we have our fitted line, we can compute residuals and estimate the error variance.

Residuals (Observable Errors):

\[e_i = y_i - \hat{y}_i = y_i - (b_0 + b_1 x_i)\]

Example Calculation: For a 65-year-old patient:

  • Predicted change: \(\hat{y} = 20.11 - 0.526(65) = -14.0\) mm Hg

  • Observed change: \(y = -8\) mm Hg (from data)

  • Residual: \(e = -8 - (-14) = 6\) mm Hg

Variance Estimate:

\[s^2 = \frac{1}{n-2}\sum_{i=1}^n e_i^2 = \frac{\text{SSE}}{n-2}\]

We use \(n-2\) degrees of freedom because we estimated two parameters (\(b_0\) and \(b_1\)) from the data.

13.2.7. The ANOVA Table for Regression

Just as in ANOVA, we can decompose the total variability in our response variable into meaningful components. This decomposition helps us assess how well our regression model explains the variation in the data.

Variance Decomposition in Regression

The key insight is that total variability can be split into two parts:

\[\text{Total Variability} = \text{Explained by Regression} + \text{Unexplained (Error)}\]

Sum of Squares Total (SST):

\[\text{SST} = \sum_{i=1}^n (y_i - \bar{y})^2\]

This measures how much the response values deviate from their overall mean, ignoring the explanatory variable.

Sum of Squares Regression (SSR):

\[\text{SSR} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2\]

This measures how much the fitted values deviate from the overall mean—the variability explained by the linear relationship.

Sum of Squares Error (SSE):

\[\text{SSE} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n e_i^2\]

This measures how much the observed values deviate from the fitted line—the unexplained variability.

The Fundamental Identity:

\[\text{SST} = \text{SSR} + \text{SSE}\]

This can be proven algebraically by adding and subtracting \(\hat{y}_i\) in the expression for SST and using properties of least squares.

Computational Shortcuts for Sum of Squares

While we could compute each sum of squares directly, there are more efficient computational formulas:

For SSR:

\[\text{SSR} = b_1 \cdot S_{xy}\]

This elegant result shows that \(\text{SSR} = b_1 \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\).

For SST:

\[\text{SST} = \sum_{i=1}^n y_i^2 - n\bar{y}^2\]

For SSE:

The Complete ANOVA Table

Table 13.2 ANOVA Table for Simple Linear Regression

Source

df

Sum of Squares

Mean Square

F-statistic

p-value

Regression

1

\(\text{SSR} = b_1 \cdot S_{xy}\)

\(\text{MSR} = \text{SSR}/1\)

\(\text{MSR}/\text{MSE}\)

p-value

Error

\(n-2\)

\(\text{SSE} = \text{SST} - \text{SSR}\)

\(\text{MSE} = \text{SSE}/(n-2)\)

Total

\(n-1\)

\(\text{SST} = \sum y_i^2 - n\bar{y}^2\)

\(\text{MST} = \text{SST}/(n-1)\)

Degrees of Freedom:

  • Regression: 1 (we have one explanatory variable)

  • Error: \(n-2\) (total sample size minus two estimated parameters)

  • Total: \(n-1\) (total sample size minus one estimated mean)

Mean Squares:

  • \(\text{MSE} = \text{SSE}/(n-2)\) estimates \(\sigma^2\)

  • \(\text{MSR} = \text{SSR}/1\) measures explained variation per degree of freedom

13.2.8. The Coefficient of Determination (R-squared)

The coefficient of determination, denoted \(R^2\), provides a single numerical measure of how well our regression line fits the data.

Definition and Calculation

\[R^2 = \frac{\text{SSR}}{\text{SST}} = \frac{\text{Variation Explained by Regression}}{\text{Total Variation}}\]

Interpretation: \(R^2\) represents the fraction (or percentage when multiplied by 100) of the variation in the response variable that is explained by the least squares regression line.

Range: \(0 \leq R^2 \leq 1\) - \(R^2 = 0\): The regression line explains none of the variation (horizontal line at \(\bar{y}\)) - \(R^2 = 1\): The regression line explains all the variation (all points lie exactly on the line)

Alternative Formula:

This shows that \(R^2\) measures the proportion of variation NOT explained subtracted from 1.

Understanding R-squared Through Examples

When is R-squared large?

\(R^2\) approaches 1 when \(\text{SSR}\) is large relative to \(\text{SST}\). This happens when:

  • The fitted values \(\hat{y}_i\) are close to the observed values \(y_i\)

  • The regression line captures most of the variability in the response

  • Points cluster tightly around the fitted line

When is R-squared small?

\(R^2\) approaches 0 when \(\text{SSE}\) is large relative to \(\text{SST}\). This happens when:

  • The fitted values \(\hat{y}_i\) are close to \(\bar{y}\) (horizontal line)

  • The explanatory variable provides little information about the response

  • Points scatter widely around the fitted line

Blood Pressure Example: Computing R-squared

Using our blood pressure data:

SSR Calculation:

SST Calculation:

SSE Calculation:

R-squared:

Interpretation: Approximately 59.2% of the variation in blood pressure change is explained by the linear relationship with patient age.

13.2.9. Important Limitations of R-squared

While \(R^2\) is a useful summary measure, it has important limitations that require careful interpretation.

R-squared Does Not Guarantee Linearity

A high \(R^2\) value does not necessarily mean the relationship is truly linear. Consider a sinusoidal relationship where a linear line might still capture the average trend and yield a high \(R^2\), even though the true relationship is clearly non-linear.

Example: A curved relationship might have \(R^2 = 0.90\), suggesting the linear model explains 90% of the variation. However, the residuals would show systematic patterns indicating that a non-linear model would be more appropriate.

Lesson: Always examine scatter plots and residual plots, not just \(R^2\).

R-squared is Not Robust to Outliers

Outliers can dramatically affect \(R^2\) in unexpected ways:

Case 1: Outlier reduces R-squared

A single extreme outlier in the response direction can make \(\text{SST}\) very large while having less impact on \(\text{SSR}\), resulting in a misleadingly low \(R^2\) even when most points follow a clear linear pattern.

Case 2: Outlier inflates R-squared

An outlier that happens to fall near the regression line might not dramatically affect the fit but could mask the strength of the relationship among the remaining points.

Lesson: Always identify and investigate outliers before interpreting \(R^2\).

High R-squared ≠ Good Predictions

A high \(R^2\) does not automatically guarantee good prediction performance:

Scale Matters: Consider a scenario where \(\text{SSR} = 9,000,000\) and \(\text{SST} = 10,000,000\), giving \(R^2 = 0.90\). While 90% of variation is explained, \(\text{SSE} = 1,000,000\) indicates substantial absolute errors that might make predictions unreliable for practical purposes.

Missing Variables: High \(R^2\) with one explanatory variable might improve dramatically with additional relevant variables, indicating the current model, while good, is incomplete.

Lesson: Consider both the proportion of variation explained and the absolute magnitude of unexplained variation.

Other Important Limitations

Sample Size Effects: With small sample sizes, you might obtain high \(R^2\) values even when the true population relationship is weak or non-linear, simply because you happened to sample points that align well with a linear pattern.

Extrapolation Risks: \(R^2\) describes fit within the range of observed data but provides no information about model performance outside this range.

Assumption Violations: If regression assumptions (normality, equal variance, independence) are violated, \(R^2\) may not provide reliable information about model quality.

Correlation vs. Causation: A high \(R^2\) indicates association but never implies causation. The explanatory variable might be correlated with the true causal factor without itself being causal.

13.2.10. Best Practices for Using R-squared

To use \(R^2\) effectively:

  1. Always examine scatter plots first to verify linear relationships

  2. Check for outliers and assess their impact on the analysis

  3. Consider the practical significance of unexplained variation, not just the percentage explained

  4. Use R-squared as one component of model assessment, not the sole criterion

  5. Verify model assumptions through residual analysis

  6. Be cautious with small sample sizes where high \(R^2\) might be misleading

  7. Avoid extrapolation beyond the range of observed data

  8. Remember that association ≠ causation regardless of \(R^2\) value

13.2.11. Sample Pearson Correlation Coefficient

Another numerical measure that proves useful in simple linear regression—and will become even more valuable in advanced regression techniques—is the sample Pearson correlation coefficient. This measure provides a standardized way to quantify the linear association between our explanatory and response variables.

The sample Pearson correlation coefficient is simply a statistical measure of the strength and direction of a linear relationship. Given that we’ve established a linear relationship between two quantitative variables, it tells us how strongly they’re associated and in which direction.

The formula for the sample correlation coefficient is:

\[r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}\]

If we divide both the numerator and denominator by \(n-1\), we can rewrite this more compactly as:

\[r = \frac{s_{xy}}{s_x s_y}\]

where \(s_{xy}\) is the sample covariance and \(s_x\), \(s_y\) are the sample standard deviations of X and Y respectively.

The sample covariance \(s_{xy}\) tells us about the relationship between X and Y. For observed pairs \((x_i, y_i)\), when X tends to be above its mean, what does Y tend to do? Does it tend to be above its mean as well (indicating a positive association), or does it tend to be below its mean (indicating a negative association)? We divide by the standard deviations to get a standardized measure.

This quantity is estimating the population correlation \(\rho\), which we saw earlier in the probability chapters:

\[\rho = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}\]

We’re estimating the population covariance with the sample covariance, and the population standard deviations with their sample versions.

Since the denominator consists only of constants, we can bring everything up into the numerator to give a different form:

\[r = \frac{1}{n-1} \sum_{i=1}^n \left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)\]

This gives us insight into what the correlation coefficient is really doing. Notice that \(\frac{x_i - \bar{x}}{s_x}\) is just the standardized form of \(x_i\)—a z-score telling us how many standard deviations the observation \(x_i\) is away from \(\bar{x}\). Similarly for the Y values. We’re looking at the average product of these standardized values across all observations.

Since standardized units are unitless measures, \(r\) is unitless, making it easy to compare the strength and direction of relationships between different quantitative variables regardless of their original units.

This correlation coefficient has a useful property: it always falls between -1 and +1. You can prove this using the Cauchy-Schwarz inequality, though we won’t cover that proof here. This bounded range makes it easy to compare associations across different variable pairs.

To interpret correlation strength, we use these rules of thumb:

  • Strong negative association: -1 to -0.8

  • Moderate negative association: -0.8 to -0.5

  • Weak association: -0.5 to +0.5

  • Moderate positive association: +0.5 to +0.8

  • Strong positive association: +0.8 to +1

The sign clearly indicates direction. Anything close to 0 indicates very little linear association between X and Y.

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter13/slide-correlation-visual-examples.png

Fig. 13.30 Visual representation of different correlation strengths showing downward trends for negative correlations and upward trends for positive correlations

When \(r = 0\), it indicates no linear association, but this can happen in several ways. Sometimes there’s genuinely no relationship—the points look like a perfect circle scattered randomly. But zero correlation can also occur with strong nonlinear relationships that are symmetric around the means. For instance, a perfect circle, a U-shaped curve, or clustered patterns might all yield \(r = 0\) despite clear associations being present.

The key insight is that correlation specifically measures linear association. Symmetry tends to produce zero correlation because in some regions where X is above its mean, Y tends to be above its mean, while in other regions the opposite occurs, and these effects cancel out in the averaging process.

Connection Between Correlation and Slope

Since both the correlation coefficient and the regression slope capture information about the relationship between X and Y, they should be related—and indeed they are.

Recall that our slope formula is:

\[b_1 = \frac{s_{xy}}{s_x^2}\]

From our correlation formula \(r = \frac{s_{xy}}{s_x s_y}\), we can solve for the sample covariance:

Substituting this into our slope formula:

This elegant relationship shows that our slope can be expressed as the correlation coefficient rescaled by the ratio of standard deviations. The correlation provides the unitless measure of strength and direction, while the ratio \(s_y/s_x\) converts this into the actual units of our data.

Connection to R-squared

There’s also a direct mathematical relationship between the correlation coefficient and our coefficient of determination. Recall that:

and \(\text{SST} = S_{yy}\). Therefore:

If we multiply by \(\frac{1/(n-1)}{1/(n-1)}\), this becomes:

So in simple linear regression, \(R^2 = r^2\). This means if we know one, we can determine the other (though we need the sign of the slope to determine whether \(r\) is positive or negative).

This relationship only holds for simple linear regression with one explanatory variable. In multiple regression, \(R^2\) assesses the overall model utility, while individual correlation coefficients measure pairwise associations between specific variables.

Limitations of Correlation

Just like \(R^2\), the correlation coefficient has important limitations:

  1. Quantitative variables only: Correlation measures linear association between two quantitative variables exclusively.

  2. Linearity required: Correlation only measures linear relationships effectively. Strong nonlinear relationships can produce correlations near zero.

  3. Not robust to outliers: Extreme observations can significantly impact correlation values.

  4. Incomplete information: Correlation provides a summary number but doesn’t give complete information about the association. You must combine numerical assessment with visual assessment through scatter plots.

These limitations mean we need to examine our data visually before trusting correlation values, just as we do with \(R^2\). The correlation coefficient is useful for quantifying linear association strength and direction, but only after we’ve verified that a linear relationship is indeed appropriate.

13.2.12. Moving Forward: From Description to Inference

The tools developed in this chapter—the regression model, least squares estimation, ANOVA decomposition, \(R^2\), and correlation analysis—provide the foundation for statistical inference in regression. We can now:

  • Estimate relationships between quantitative variables

  • Assess model fit using multiple criteria

  • Decompose variation into explained and unexplained components

  • Quantify strength of association through both \(R^2\) and \(r\)

  • Compare standardized and unstandardized measures of association

What’s Next: In subsequent sections, we’ll develop:

  • Model diagnostics for checking regression assumptions

  • Hypothesis tests for slope and intercept parameters

  • Confidence intervals for regression parameters and predictions

  • Prediction intervals for new observations

  • Robustness considerations for when assumptions are violated

The progression follows our familiar pattern: establish the model, estimate parameters, assess fit, quantify associations, then conduct formal inference with proper uncertainty quantification.

Key Takeaways 📝

  1. The simple linear regression model \(Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\) formalizes the linear relationship between quantitative variables with clear assumptions about error terms.

  2. Least squares estimation provides optimal estimates for slope and intercept by minimizing the sum of squared residuals, with formulas based on sample covariance and variance.

  3. The regression line always passes through \((\bar{x}, \bar{y})\) and provides average changes in response per unit change in explanatory variable.

  4. ANOVA decomposition splits total variation into explained (SSR) and unexplained (SSE) components, with degrees of freedom that sum appropriately.

  5. The coefficient of determination \(R^2 = \text{SSR}/\text{SST}\) measures the proportion of variation explained by the regression, ranging from 0 to 1.

  6. R-squared has important limitations: it doesn’t guarantee linearity, isn’t robust to outliers, and doesn’t automatically ensure good predictions.

  7. Model assessment requires multiple tools: scatter plots, residual analysis, \(R^2\), and assumption checking work together to evaluate model appropriateness.

  8. Residuals \(e_i = y_i - \hat{y}_i\) provide estimates of error terms and enable variance estimation through \(s^2 = \text{SSE}/(n-2)\).

  9. Computational shortcuts like \(\text{SSR} = b_1 \cdot S_{xy}\) make ANOVA calculations efficient while maintaining conceptual clarity.

  10. Parameter interpretation requires context: the slope represents average change per unit increase in X, while the intercept may or may not have practical meaning depending on whether X = 0 is meaningful.

Exercises

  1. Model Components and Assumptions: For the regression model \(Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\):

    1. Explain what each component represents in the context of studying the relationship between hours studied and exam scores

    2. List the four key assumptions and explain why each is important

    3. What does it mean for the error terms to be “iid normal”?

    4. How would you check each assumption using real data?

  2. Least Squares Calculation: Given the following data on house size (X, in 1000 sq ft) and price (Y, in `1000s):

    House

    Size (X)

    Price (Y)

    1

    1.2

    180

    2

    1.8

    220

    3

    2.1

    280

    4

    1.5

    200

    5

    2.4

    320

    1. Calculate \(\bar{x}\), \(\bar{y}\), \(\sum x_i y_i\), and \(\sum x_i^2\)

    2. Find the least squares estimates \(b_0\) and \(b_1\)

    3. Write the fitted regression equation

    4. Interpret the slope and intercept in context

    5. Verify that the line passes through \((\bar{x}, \bar{y})\)

  3. ANOVA Table Construction: Using the house price data from Exercise 2:

    1. Calculate SST, SSR, and SSE

    2. Complete the ANOVA table with degrees of freedom and mean squares

    3. Calculate \(R^2\) and interpret its meaning

    4. Estimate the error standard deviation \(s\)

  4. Residual Analysis: For the house price regression:

    1. Calculate the fitted value and residual for House 3

    2. What does this residual tell you about the model’s performance for this house?

    3. If all residuals were positive, what would this suggest about the model?

  5. R-squared Interpretation: Consider three different regression analyses:

    Scenario A: Predicting height from shoe size with \(R^2 = 0.85\), \(\text{SSE} = 25\)

    Scenario B: Predicting income from education with \(R^2 = 0.65\), \(\text{SSE} = 10,000,000\)

    Scenario C: Predicting test scores from study hours with \(R^2 = 0.40\), \(\text{SSE} = 100\)

    1. Which scenario shows the strongest linear relationship?

    2. Which scenario might be most useful for prediction? Explain your reasoning

    3. What additional information would help you better evaluate these models?

  6. Computational Formulas: Show algebraically that:

    1. \(S_{xy} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) = \sum_{i=1}^n x_i y_i - n\bar{x}\bar{y}\)

    2. The regression line always passes through \((\bar{x}, \bar{y})\)

    3. \(\text{SST} = \text{SSR} + \text{SSE}\) (hint: add and subtract \(\hat{y}_i\) in the SST expression)

  7. Model Comparison: Two students fit different models to the same dataset:

    Student A: \(\hat{y} = 12 + 3.5x\) with \(R^2 = 0.78\), \(s = 4.2\)

    Student B: \(\hat{y} = 8 + 4.1x\) with \(R^2 = 0.81\), \(s = 3.9\)

    1. If both used the same data and method, why might they get different results?

    2. Which model appears better based on the given information?

    3. What could cause such different results from the same dataset?

  8. Assumption Violations: For each scenario, identify which regression assumption might be violated and explain the potential consequences:

    1. Studying crop yield vs. rainfall, but some fields received fertilizer and others didn’t

    2. Predicting salary from years of experience, but the data includes both part-time and full-time workers

    3. Analyzing test scores vs. study time, but students were allowed to collaborate

    4. Examining house prices vs. size, but the data spans both urban and rural areas

  9. Practical Applications: A researcher studying the relationship between advertising spend (X, in 1000s) and sales (Y, in `1000s) obtains: - :math:hat{y} = 42 + 3.2x` - \(R^2 = 0.67\) - \(s = 15\) - \(n = 25\)

    1. Interpret the slope and intercept in business terms

    2. What does the \(R^2\) value tell management about the advertising-sales relationship?

    3. If the company spends $8,000 on advertising, what sales level does the model predict?

    4. Given \(s = 15\), how much confidence should management have in predictions?

  10. Critical Thinking: A news article claims: “Study finds strong relationship between ice cream sales and drowning deaths (R² = 0.89). Ice cream consumption causes drowning!”

    1. What statistical concept is being misunderstood in this claim?

    2. Explain what the high \(R^2\) actually tells us

    3. What might be a more plausible explanation for this relationship?

    4. How could you design a study to investigate causation rather than just association?

  11. Advanced Calculation: Using the computational formula \(\text{SSR} = b_1 \cdot S_{xy}\):

    1. Prove that \(\text{SSR}\) is always non-negative

    2. Explain why \(b_1\) and \(S_{xy}\) always have the same sign

    3. Under what conditions would \(\text{SSR} = 0\)?

  12. Real Data Analysis: Collect data on a topic of interest (e.g., study hours vs. GPA, car age vs. price, etc.) with at least 10 observations:

    1. Create a scatter plot and assess whether linear regression is appropriate

    2. Calculate the least squares regression line by hand

    3. Construct the ANOVA table and calculate \(R^2\)

    4. Interpret your results in context

    5. Identify any potential outliers or assumption violations

    6. Discuss the practical implications of your findings