Worksheet 20: Introduction to Simple Linear Regression

Learning Objectives 🎯

Understand the transition from comparing group means to modeling relationships between quantitative variables
Master the concept of least squares estimation and derive estimators for simplified models
Apply single-variable calculus to optimize the sum of squared residuals
Explore the sampling distributions of regression estimators through simulation
Investigate how sample size and error distribution affect estimator properties
Implement regression calculations manually and verify results using R

Introduction

Over the past several lessons, you’ve developed a solid foundation in comparing group means under uncertainty:

First, we had inference for a single population mean using the \(Z\) and \(t\) procedure frameworks.

Next, we extended that to comparing two populations, using either independent or paired designs.

Most recently, ANOVA allowed you to compare several population means simultaneously, carefully controlling the family-wise error rate in subsequent post-hoc analyses.

In each of these scenarios, the explanatory variable was categorical (e.g., Treatment A vs. Treatment B), while only the response was quantitative.

But many real-world questions involve an explanatory variable that is also quantitative:

Does a patient’s age influence their blood pressure?
Does increasing study time actually lead to better exam scores?
How does the fuel’s chemical composition affect combustion quality?

These are questions about relationships between two quantitative variables—not fixed group comparisons, but continuous patterns across a spectrum of values.

This leads us to a natural next step: How can we describe and infer the form, strength, and direction of a relationship between two numerical measurements?

From ANOVA to Regression: Same Core Ideas, New Application

Although the surface looks different, regression analysis extends many of the ideas you already know:

Topic from Earlier Lessons	How It Translates into Regression
Comparing means across groups	Modeling the mean response as a function of the explanatory variable \(x\)
Pooled variance \(s_p^2=\text{MS}_{\text{E}}\)	Error variance \(\text{MS}_{\text{E}}\) around the regression line
ANOVA \(F\)-test (at least one population mean different)	\(F\)-test for overall linear association
\(t\)-test for two groups	\(t\)-test for the slope parameter
Confidence intervals for means	Confidence and prediction intervals for values of the response \(Y\) at a given value of the explanatory variable \(x\)

What’s new is the form of the relationship. Rather than comparing horizontal group means, we now explore how the response \(Y\) responds to changes in the explanatory variable \(x\) and how to fit the relationship using a simple linear regression line.

The Simple Linear Regression Model

In simple linear regression, we model the relationship between two quantitative variables:

\[Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\]

This formulation leads to four key assumptions:

Linearity: The mean response \(E[Y|x]\) is a linear function of the explanatory variable \(x\). This assumption is explicit in how we construct the model and estimate the parameters. Our statistical inference procedure will then be used to test how well our model fits the data.

Independence: We assume that the random errors \(\varepsilon_i\) are independent of one another. This translates that for fixed values \(x_i\) the response values \(Y_i\) are independent from each other.

Constant Variance (Homoscedasticity): The variance of the errors is the same across all values of the explanatory variable \(x\). \(\text{Var}(\varepsilon_i) = \sigma^2\) for all \(i\).

Normality of Errors: The distribution of the errors \(\varepsilon_i\) are Normal. This assumption is needed for our inference procedures but not necessary for fitting the line itself.

Suppose we’ve collected data on two quantitative variables and created a scatterplot. The plot shows a general pattern suggesting that as the explanatory variable \(x\) changes, the response \(Y\) tends to change as well. If it is reasonable we will effectively summarize this relationship and use it for prediction, with a clear, simple linear relationship.

But immediately we face a challenge: How do we objectively choose the best line?

We need a precise and consistent method that anyone could apply and reach the same result. This is where the idea of an “objective fitting criterion” comes in.

Least Squares

The key idea behind fitting a line is measuring how closely that line represents the actual data points. To evaluate closeness, we consider the residuals: \(e_i = y_i - \hat{y}_i\) (observed – predicted).

Each residual measures how far off our predicted response \(\hat{y}_i\) is from the actual observed response \(y_i\). Clearly, a good line is one where most residuals are small in magnitude.

But we still have multiple sensible options:

Minimize the sum of residuals? (Problem: positive and negative residuals would cancel each other out, even if predictions were inaccurate.)
Minimize the sum of absolute residuals? (Makes sense, but computationally complex and lacks convenient analytical solutions.)
Minimize the largest residual (making the worst prediction as small as possible)? (Intuitive, but highly sensitive to single unusual points and very challenging to calculate.)

Instead, statisticians often minimize the sum of squared residuals:

\[SSE = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

Normally, finding the slope and intercept requires tools from multivariate calculus (partial derivatives). But even without multivariate calculus, we can still practice this concept in simpler scenarios. Below, you’ll complete two exercises designed to illustrate the concept of minimizing squared error. Each exercise walks you through the solution step-by-step, using only algebra and basic single-variable calculus.

Part 1: Deriving Least Squares Estimators

Question 1 (Linear Relationship Through the Origin)

Consider a scenario where the relationship between the explanatory variable \(x\) and the response variable \(Y\) is believed to be linear and the intercept is known to be zero. In other words, the model is:

Model: \(Y_i = \beta_1 x_i + \varepsilon_i\)

a) Write down the formula for the sum of squared residuals \(SSE\) as a function of \(\beta_1\) using the model above.

\[SSE(\beta_1) = \sum_{i=1}^{n} (y_i - \hspace{3cm})^2 =\]

b) Take the derivative of \(SSE\) with respect to the slope parameter \(\beta_1\). Set the derivative equal to zero and solve for \(\beta_1\) as a function of summary statistics derived from your data pairs \((x_i, y_i)\). We represent this estimate by either adding a hat to the parameter \(\hat{\beta}_1\) or using the standard alphabet \(b_1\).

Find Critical Point:

\[\frac{d(SSE)}{d\beta_1} =\]

Set derivative equal to 0, put a hat on :math:`beta_1` and solve for \(\hat{\beta}_1\):

c) Verify the solution is a minimum by applying the second derivative test.

Show that this indeed a minimum:

\[\frac{d^2(SSE)}{d\beta_1^2} =\]

d) Why does minimizing squared error naturally lead to this specific solution? Provide an intuitive interpretation considering:

What does the numerator of your estimator capture about the relationship between \(x\) and \(y\)?
What does the denominator measure about the explanatory variable?
Verify that the units of your estimator make sense.

Question 2 (Quadratic Relationship Through the Origin)

Consider a scenario where the relationship between the explanatory variable \(x\) and the response variable \(Y\) is believed to be quadratic. Suppose further that the linear term is absent (or negligible), and the intercept is known to be zero. In other words, the model is:

Model: \(Y_i = \beta_2 x_i^2 + \varepsilon_i\)

a) Write down the formula for the sum of squared residuals \(SSE\) as a function of \(\beta_2\) using the model above.

\[SSE(\beta_2) =\]

b) Take the derivative of \(SSE\) with respect to the parameter \(\beta_2\). Set the derivative equal to zero and solve for \(\hat{\beta}_2\).

Hint

Consider using the substitution \(u_i = x_i^2\). This simple renaming scheme has no effect on the derivatives as the \(u_i\)’s are treated as constants but it removes the need for unnecessary computation. The estimator becomes the same form as Question 1 in terms of \(u_i\).

c) Verify the solution is a minimum by applying the second derivative test.

d) Provide an intuitive interpretation of your result:

What does your parameter estimate \(\hat{\beta}_2\) tell you about the relationship between the response \(Y\) and the explanatory variable \(x\)?
Why does minimizing squared error naturally lead to this specific quadratic coefficient?
How is the estimator related to the scale of the explanatory variable?
Verify that the units are appropriate.

Part 2: The General Simple Linear Regression Estimators

In the two examples above, we simplified the optimization problem by forcing the line or curve to pass through the origin (0,0). This allowed us to use only single-variable calculus and basic algebra to find our best-fitting parameters.

However, in practice, the relationship between the explanatory variable \(x\) and the response variable \(Y\) often requires estimating an intercept as well as a slope. The simple linear relationship is expressed as stated earlier as:

\[Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\]

With this model, the sum of squared residuals becomes a function of two parameters:

\[SSE(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2\]

Minimizing \(SSE\) with respect to two parameters simultaneously requires more advanced mathematical tools. Specifically, multivariate calculus (partial derivatives) or linear algebra (solving systems of equations). Using these tools, we derive what are known as the “normal equations,” a system of two equations in two unknowns.

Least Squares Estimators 📐

The solution leads to the following estimates of the slope and intercept:

\[b_1 = \hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}}\]

\[b_0 = \hat{\beta}_0 = \bar{y} - b_1 \bar{x}\]

where we define the following summary statistics:

\[S_{xx} = \sum_{i=1}^{n}(x_i - \bar{x})^2 = \sum_{i=1}^{n} x_i^2 - n\bar{x}^2\]

\[S_{yy} = \sum_{i=1}^{n}(y_i - \bar{y})^2 = \sum_{i=1}^{n} y_i^2 - n\bar{y}^2\]

\[S_{xy} = \sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y}) = \sum_{i=1}^{n} x_i y_i - n\bar{x}\bar{y}\]

The process of finding the best-fitting line naturally introduces a mathematical constraint. Specifically, the line fitted by the least squares method always passes through the “center” of your data, the point defined by the averages, \((\bar{x}, \bar{y})\). This is the fundamental constraint imposed by minimizing squared errors. Intuitively, it means that the best-fitting line balances the residuals around the data’s central point. Because of this constraint, once you’ve found the optimal slope \(b_1\), the intercept \(b_0\) is uniquely determined. You cannot freely choose one parameter independently of the other.

Part 3: Sampling Distributions of Regression Estimators

While these formulas give us a way to compute estimates from a single sample, it’s important to understand their statistical properties as random variables. In other words, if we collected many samples under the same conditions, how would these estimators behave?

Question 3 (Simulation Study)

Use the R simulation script provided below to explore the sampling distributions of the regression estimators. Copy the entire script into an R file (e.g., LinearRegressionParameters_Normality_Sim.R), save it, and then source it in R using source("LinearRegressionParameters_Normality_Sim.R").

#-----------------------------------------------------------------
#  Simulation Study: Sampling Distribution of b0 and b1
#-----------------------------------------------------------------
set.seed(123)
N          <- 2000      # replications
n          <- 500       # sample size (CHANGE THIS)
beta0_true <- 5
beta1_true <- 2
sigma      <- 2
err_dist   <- "normal"  # "normal", "uniform", "lognormal" (CHANGE THIS)
#-----------------------------------------------------------------

library(ggplot2)
library(gridExtra)      # for grid.arrange()

# 1. fixed x
x_vec <- seq(0, 10, length.out = n)
x_bar <- mean(x_vec)
S_xx  <- sum((x_vec - x_bar)^2)

# 2. generator
gen_errors <- function(N, n, dist, sigma){
  if(dist=="normal"){
    matrix(rnorm(N*n,0,sigma), nrow=N)
  } else if(dist=="uniform"){
    rng <- sqrt(12)*sigma
    matrix(runif(N*n,-rng/2,rng/2), nrow=N)
  } else if(dist=="lognormal"){
    sdlog <- sqrt(log(1+sigma^2))
    matrix(rlnorm(N*n,0,sdlog) - exp(sdlog^2/2), nrow=N)
  } else stop("unknown dist")
}

E_mat <- gen_errors(N,n,err_dist,sigma)
Y_mean_part <- beta0_true + beta1_true*x_vec
all_Y <- sweep(E_mat,2,Y_mean_part,"+")

# 3. estimates
b0_hat <- b1_hat <- numeric(N)
for(i in 1:N){
  y <- all_Y[i,]; y_bar <- mean(y)
  b1_hat[i] <- sum((x_vec-x_bar)*(y-y_bar))/S_xx
  b0_hat[i] <- y_bar - b1_hat[i]*x_bar
}

# 4. plotting helpers -------------------------------------------------
hist_plot <- function(est, true, title_lab){
  df <- data.frame(est=est)
  ggplot(df,aes(est))+
    geom_histogram(aes(y=after_stat(density)),
                   bins=24,fill="grey",col="black")+
    geom_density(col="red",linewidth=1)+
    stat_function(fun=dnorm,
                  args=list(mean=mean(est),sd=sd(est)),
                  col="blue",linewidth=1)+
    geom_vline(xintercept=true,col="darkgreen",linewidth=1)+
    labs(title=title_lab,
         subtitle=paste0("True value = ",true),
         x="Estimate",y="Density")+
    theme(plot.title=element_text(size=11))
}

qq_plot <- function(est,title_lab){
  df <- data.frame(est=est)
  ggplot(df,aes(sample=est))+
    geom_qq(size=.6)+
    geom_qq_line(col="red",linewidth=1)+
    labs(title=title_lab,x="Theoretical Quantiles",
         y="Sample Quantiles")+
    theme(plot.title=element_text(size=11))
}

p1 <- hist_plot(b0_hat,beta0_true,"Histogram of b0")
p2 <- qq_plot(b0_hat,"QQ-Plot of b0")
p3 <- hist_plot(b1_hat,beta1_true,"Histogram of b1")
p4 <- qq_plot(b1_hat,"QQ-Plot of b1")

# 5. arrange 2x2 grid
grid.arrange(p1,p2,p3,p4,nrow=2)

# 6. quick numeric summary
cat("--------------------------------------------------\n")
cat("Simulation summary  N =",N,"  n =",n,
    "  error dist =",err_dist,"\n")
cat(sprintf("b0: mean = %.3f   sd = %.3f\n",
            mean(b0_hat),sd(b0_hat)))
cat(sprintf("b1: mean = %.3f   sd = %.3f\n",
            mean(b1_hat),sd(b1_hat)))

Set the simulation parameters as described below. For each case, modify the parameters in the script, save the file, and source it again.

Case 1 (Normal Errors):

Set err_dist <- "normal"
Set sample size \(n = 10, 30, 100, 500\)

Case 2 (Uniform Errors):

Set err_dist <- "uniform"
Set sample size \(n = 10, 30, 100, 500\)

Case 3 (Lognormal Errors):

Set err_dist <- "lognormal"
Set sample size \(n = 10, 30, 100, 500\)

Run the simulation and generate plots. Examine both the histograms and QQ-plots carefully.

a) (Bias) Describe clearly what happens to the sampling distributions of the intercept \(b_0\) and slope \(b_1\) as the sample size \(n\) increases under each error distribution (Normal, Uniform, Lognormal).

(Asymptotically Unbiased) Do the average estimates approach the true intercept and slope as the sample size grows?
Do the estimators appear unbiased at small \(n\)?

b) (Spread) How does increasing the sample size influence the variability (spread) of the sampling distributions?

c) (Normality) Are the estimators approximately Normally distributed for small \(n\)? What about large \(n\)?

d) Recall that the CLT states that the sampling distribution of sample means and sums approach Normality as the sample size grows, regardless of the distribution of the original data (provided the distribution has finite variance). Based on your simulations, comment specifically on whether the CLT holds for the slope \(b_1\) and intercept \(b_0\) estimators under each error distribution.

e) Do you notice any important differences in the sampling distributions of the slope \(b_1\) and intercept \(b_0\)? Are there notable differences in how quickly each estimator approaches Normality?

Part 4: Practice Problems

You’ve now seen how regression models are built, how the least-squares line is estimated, and how the slope and intercept estimators behave under different conditions. In this final section, you’ll put these concepts into practice by working through two scenarios.

Question 4 (Bird Wing Length Study)

A biologist investigates how the weight \(x\) of a certain species of bird affects its average wing length \(Y\). A random sample of 8 birds provides the following data:

Weight (grams)	Wing Length (cm)
130.8	24.9
125.8	24.5
155.2	27.1
105.6	22.3
146.9	25.4
148.4	26.5
181.2	32.4
137.0	25.7

a) Create a sketch of the corresponding scatter plot by hand.

b) Calculate the estimates of the slope \(b_1\) and intercept \(b_0\) manually.

First, compute the necessary summary statistics:

\[n = \hspace{1cm}, \quad \bar{x} = \hspace{2cm}, \quad \bar{y} = \hspace{2cm}\]

\[\sum_{i=1}^n x_i^2 = \hspace{2cm}, \quad \sum_{i=1}^n y_i^2 = \hspace{2cm}, \quad \sum_{i=1}^n x_i y_i = \hspace{2cm}\]

\[S_{xx} = \sum_{i=1}^n x_i^2 - n\bar{x}^2 = \hspace{4cm}\]

\[S_{xy} = \sum_{i=1}^n x_i y_i - n\bar{x}\bar{y} = \hspace{4cm}\]

Now compute the estimates:

\[b_1 = \frac{S_{xy}}{S_{xx}} = \hspace{4cm}\]

\[b_0 = \bar{y} - b_1 \bar{x} = \hspace{4cm}\]

The estimated regression line: \(\hat{y} = \hspace{6cm}\)

c) Add the least squares regression line to your scatter plot.

# After completing your calculations, verify with R
library(ggplot2)

bird_data <- data.frame(
  Weight = c(130.8, 125.8, 155.2, 105.6, 146.9, 148.4, 181.2, 137.0),
  WingLength = c(24.9, 24.5, 27.1, 22.3, 25.4, 26.5, 32.4, 25.7)
)

# Enter your calculated values
your_b1 <- ___  # Fill in your calculated slope
your_b0 <- ___  # Fill in your calculated intercept

# Compare to R's calculation
bird_lm <- lm(WingLength ~ Weight, data = bird_data)

cat("Your estimates:\n")
cat("  b0 =", your_b0, "\n")
cat("  b1 =", your_b1, "\n")
cat("\nR's estimates:\n")
print(coef(bird_lm))

# Check if your estimates match
if (abs(your_b0 - coef(bird_lm)[1]) < 0.01 &&
    abs(your_b1 - coef(bird_lm)[2]) < 0.001) {
  cat("\n✓ Your calculations are correct!\n")
} else {
  cat("\n✗ Check your calculations\n")
}

# Visualize the scatter plot with regression line
ggplot(bird_data, aes(x = Weight, y = WingLength)) +
  geom_point(size = 4) +
  geom_smooth(method = lm, formula = y ~ x, se = FALSE,
              linewidth = 1.5, color = "blue") +
  labs(title = "Scatterplot of Bird Wing Length vs. Weight",
       x = "Weight (grams)",
       y = "Wing Length (cm)") +
  theme_minimal() +
  theme(axis.title = element_text(size = 14, face = "bold"),
        axis.text = element_text(size = 12),
        plot.title = element_text(size = 16, face = "bold"))

Question 5 (Crop Yield Study)

An agronomist examines how fertilizer quantity (kg/acre) influences crop yield (bushels/acre). Data from 30 fields were summarized into the following statistics:

\[n = 30, \quad \bar{x} = 58.02, \quad \sum_{i=1}^n x_i^2 = 159083.4\]

\[\bar{y} = 86.32, \quad \sum_{i=1}^n x_i y_i = 112584.1\]

Calculate the estimates of the slope \(b_1\) and intercept \(b_0\) manually.

Calculate \(S_{xx}\):

\[S_{xx} = \sum_{i=1}^n x_i^2 - n\bar{x}^2 =\]

Calculate \(S_{xy}\):

\[S_{xy} = \sum_{i=1}^n x_i y_i - n\bar{x}\bar{y} =\]

Calculate the slope:

\[b_1 = \frac{S_{xy}}{S_{xx}} =\]

Calculate the intercept:

\[b_0 = \bar{y} - b_1 \bar{x} =\]

The estimated regression line: \(\hat{y} = \hspace{6cm}\)

Interpretation: For every additional kg/acre of fertilizer applied, the predicted crop yield changes by approximately ____ bushels/acre.

# Verification code
# Enter your calculated values
your_b1 <- ___  # Fill in your calculated slope
your_b0 <- ___  # Fill in your calculated intercept

# Given summary statistics
n <- 30
x_bar <- 58.02
y_bar <- 86.32
sum_x2 <- 112584.1
sum_xy <- 159083.4

# Calculate using summary statistics
S_xx <- sum_x2 - n * x_bar^2
S_xy <- sum_xy - n * x_bar * y_bar

b1_check <- S_xy / S_xx
b0_check <- y_bar - b1_check * x_bar

cat("Intermediate calculations:\n")
cat("  S_xx =", round(S_xx, 2), "\n")
cat("  S_xy =", round(S_xy, 2), "\n")
cat("\nExpected estimates:\n")
cat("  b0 =", round(b0_check, 4), "\n")
cat("  b1 =", round(b1_check, 4), "\n")

# Check your work
if (abs(your_b0 - b0_check) < 0.1 && abs(your_b1 - b1_check) < 0.01) {
  cat("\n✓ Your calculations are correct!\n")
} else {
  cat("\n✗ Check your calculations\n")
}

Key Takeaways

Summary 📝

Simple linear regression models the relationship between two quantitative variables using the equation \(Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\)
Least squares estimation minimizes the sum of squared residuals to find the best-fitting line
The least squares estimators are \(b_1 = S_{xy}/S_{xx}\) for the slope and \(b_0 = \bar{y} - b_1\bar{x}\) for the intercept
The regression line always passes through \((\bar{x}, \bar{y})\), the center of the data
Sampling distributions of the estimators \(b_0\) and \(b_1\) approach Normality as sample size increases (CLT applies)
The estimators are unbiased for any sample size, meaning \(E(b_0) = \beta_0\) and \(E(b_1) = \beta_1\)
Variability decreases with larger sample sizes, leading to more precise estimates
The slope \(b_1\) tends to converge to Normality faster than the intercept \(b_0\)
R verification allows you to check your manual calculations and visualize the fitted relationship