Section 3.4 Linear Models

Linear regression is one of the central tools in statistical modeling. Developed in the early 19th century in connection with astronomical and geodetic problems, least squares is often associated with Legendre and Gauss, and it provided a systematic way to estimate relationships from noisy measurements. Today, linear regression remains a default starting point across the sciences and a core building block for many ideas in modern machine learning. The goal is not simply to run a routine, but to understand the model, the assumptions it encodes, and what its estimates and diagnostics do and do not justify.

This section develops linear regression from first principles. We begin with the matrix calculus needed to derive estimators—building up the rules for differentiating vectors, matrices, and quadratic forms. We then derive the ordinary least squares (OLS) estimator via two complementary approaches: the calculus approach (minimizing sum of squared errors) and the geometric approach (projection onto the column space). These perspectives illuminate different aspects of the same solution.

With the estimator in hand, we prove the Gauss-Markov theorem—that OLS is the Best Linear Unbiased Estimator (BLUE) under specific conditions. We examine each condition carefully, understanding what happens when it fails. We then develop the distributional theory under normality, enabling t-tests, F-tests, and confidence intervals. Finally, we address diagnostics and remedies for when assumptions don’t hold.

Road Map 🧭

Develop: Matrix calculus rules for differentiating vectors, quadratic forms, and matrix products
Derive: OLS estimator via calculus (normal equations) and geometry (projection)
Prove: Gauss-Markov theorem establishing OLS as BLUE under classical assumptions
Apply: Distributional results for inference: t-tests, F-tests, confidence intervals
Evaluate: Diagnostics for assumption violations and remedial measures

Matrix Calculus Foundations

Before deriving the least squares estimator, we need tools for differentiating expressions involving vectors and matrices. This section builds up the necessary calculus systematically, starting from scalar derivatives and extending to the multivariate case. (A condensed reference for these rules — including the gradient identities for linear and quadratic forms — appears in Appendix B.)

Why Matrix Calculus?

In scalar calculus, finding the minimum of \(f(x) = (y - x)^2\) is straightforward: take the derivative, set it to zero, solve for \(x\). With \(n\) observations and \(p\) parameters, our objective becomes:

\[f(\boldsymbol{\beta}) = \sum_{i=1}^n (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\]

Minimizing this requires differentiating with respect to a vector \(\boldsymbol{\beta}\). Matrix calculus provides the systematic framework for such operations.

Notation Conventions

We adopt the following conventions throughout:

Vectors are column vectors, denoted by bold lowercase: \(\mathbf{x}, \mathbf{y}, \boldsymbol{\beta}\)
Matrices are denoted by bold uppercase: \(\mathbf{X}, \mathbf{A}, \boldsymbol{\Sigma}\)
Scalars are italic lowercase: \(a, b, \sigma^2\)
Transpose: \(\mathbf{x}^\top\) (row vector), \(\mathbf{A}^\top\)
Dimensions: \(\mathbf{y} \in \mathbb{R}^n\), \(\boldsymbol{\beta} \in \mathbb{R}^p\), \(\mathbf{X} \in \mathbb{R}^{n \times p}\)

The gradient of a scalar function \(f: \mathbb{R}^p \to \mathbb{R}\) with respect to vector \(\boldsymbol{\beta}\) is:

\[\begin{split}\nabla_{\boldsymbol{\beta}} f = \begin{pmatrix} \frac{\partial f}{\partial \beta_1} \\ \frac{\partial f}{\partial \beta_2} \\ \vdots \\ \frac{\partial f}{\partial \beta_p} \end{pmatrix} \in \mathbb{R}^p\end{split}\]

This is a column vector of the same dimension as \(\boldsymbol{\beta}\).

Scalar-by-Vector Derivatives: Building the Rules

We develop the key differentiation rules step by step, starting from component-wise analysis.

Rule 1: Linear function \(f(\boldsymbol{\beta}) = \mathbf{a}^\top \boldsymbol{\beta}\) where \(\mathbf{a} \in \mathbb{R}^p\) is constant.

Component form:

\[f(\boldsymbol{\beta}) = \mathbf{a}^\top \boldsymbol{\beta} = \sum_{j=1}^p a_j \beta_j\]

Taking the partial derivative with respect to \(\beta_k\):

\[\frac{\partial f}{\partial \beta_k} = \frac{\partial}{\partial \beta_k} \sum_{j=1}^p a_j \beta_j = a_k\]

Assembling all partials into a vector:

\[\boxed{\nabla_{\boldsymbol{\beta}} (\mathbf{a}^\top \boldsymbol{\beta}) = \mathbf{a}}\]

By symmetry of the dot product, \(\mathbf{a}^\top \boldsymbol{\beta} = \boldsymbol{\beta}^\top \mathbf{a}\), so:

\[\nabla_{\boldsymbol{\beta}} (\boldsymbol{\beta}^\top \mathbf{a}) = \mathbf{a}\]

Rule 2: Quadratic form \(f(\boldsymbol{\beta}) = \boldsymbol{\beta}^\top \mathbf{A} \boldsymbol{\beta}\) where \(\mathbf{A} \in \mathbb{R}^{p \times p}\) is constant.

Component form:

\[f(\boldsymbol{\beta}) = \boldsymbol{\beta}^\top \mathbf{A} \boldsymbol{\beta} = \sum_{i=1}^p \sum_{j=1}^p \beta_i A_{ij} \beta_j\]

Taking the partial derivative with respect to \(\beta_k\) requires identifying all terms containing \(\beta_k\):

Terms where \(i = k\) (varying \(j\)): \(\sum_{j=1}^p \beta_k A_{kj} \beta_j\)
Terms where \(j = k\) (varying \(i\)): \(\sum_{i=1}^p \beta_i A_{ik} \beta_k\)

For the first group:

\[\frac{\partial}{\partial \beta_k} \sum_{j=1}^p \beta_k A_{kj} \beta_j = \sum_{j=1}^p A_{kj} \beta_j = [\mathbf{A}\boldsymbol{\beta}]_k\]

For the second group:

\[\frac{\partial}{\partial \beta_k} \sum_{i=1}^p \beta_i A_{ik} \beta_k = \sum_{i=1}^p A_{ik} \beta_i = [\mathbf{A}^\top\boldsymbol{\beta}]_k\]

Combining:

\[\frac{\partial f}{\partial \beta_k} = [\mathbf{A}\boldsymbol{\beta}]_k + [\mathbf{A}^\top\boldsymbol{\beta}]_k\]

Assembling into vector form:

\[\boxed{\nabla_{\boldsymbol{\beta}} (\boldsymbol{\beta}^\top \mathbf{A} \boldsymbol{\beta}) = (\mathbf{A} + \mathbf{A}^\top)\boldsymbol{\beta}}\]

Special case: If \(\mathbf{A}\) is symmetric (\(\mathbf{A} = \mathbf{A}^\top\)):

\[\boxed{\nabla_{\boldsymbol{\beta}} (\boldsymbol{\beta}^\top \mathbf{A} \boldsymbol{\beta}) = 2\mathbf{A}\boldsymbol{\beta} \quad \text{(when } \mathbf{A} = \mathbf{A}^\top)}\]

This is the result we’ll use most often, since \(\mathbf{X}^\top\mathbf{X}\) is always symmetric.

Rule 3: Matrix-vector product \(f(\boldsymbol{\beta}) = \mathbf{A}\boldsymbol{\beta}\) where the result is a vector, not a scalar.

Here \(f: \mathbb{R}^p \to \mathbb{R}^n\), so the derivative is a matrix called the Jacobian:

\[\frac{\partial (\mathbf{A}\boldsymbol{\beta})}{\partial \boldsymbol{\beta}^\top} = \mathbf{A} \in \mathbb{R}^{n \times p}\]

The \((i, j)\) element is \(\frac{\partial [\mathbf{A}\boldsymbol{\beta}]_i}{\partial \beta_j} = A_{ij}\).

Summary of Key Rules

Table 18 Matrix Calculus Rules for Scalar Functions
Expression \(f(\boldsymbol{\beta})\)	Gradient \(\nabla_{\boldsymbol{\beta}} f\)	Condition
\(\mathbf{a}^\top \boldsymbol{\beta}\) or \(\boldsymbol{\beta}^\top \mathbf{a}\)	\(\mathbf{a}\)	\(\mathbf{a}\) constant
\(\boldsymbol{\beta}^\top \mathbf{A} \boldsymbol{\beta}\)	\((\mathbf{A} + \mathbf{A}^\top)\boldsymbol{\beta}\)	\(\mathbf{A}\) constant
\(\boldsymbol{\beta}^\top \mathbf{A} \boldsymbol{\beta}\)	\(2\mathbf{A}\boldsymbol{\beta}\)	\(\mathbf{A}\) symmetric
\(\\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\\|^2\)	\(-2\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\)	See derivation below

The Chain Rule for Vectors

For composite functions, we need the chain rule. If \(f: \mathbb{R}^n \to \mathbb{R}\) and \(\mathbf{g}: \mathbb{R}^p \to \mathbb{R}^n\), then for \(h(\boldsymbol{\beta}) = f(\mathbf{g}(\boldsymbol{\beta}))\):

\[\nabla_{\boldsymbol{\beta}} h = \left(\frac{\partial \mathbf{g}}{\partial \boldsymbol{\beta}^\top}\right)^\top \nabla_{\mathbf{g}} f\]

This is the matrix form of the familiar chain rule: derivative of outer function times derivative of inner function.

The Linear Model

With matrix calculus tools in hand, we now formalize the linear regression model.

Model Specification

We observe \(n\) data points \((y_i, \mathbf{x}_i)\) where \(y_i \in \mathbb{R}\) is the response (dependent variable) and \(\mathbf{x}_i \in \mathbb{R}^p\) is the vector of predictors (independent variables, features, covariates).

The linear model posits:

(52)\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{p-1} x_{i,p-1} + \varepsilon_i, \quad i = 1, \ldots, n\]

where:

\(\beta_0, \beta_1, \ldots, \beta_{p-1}\) are unknown parameters (coefficients)
\(\varepsilon_i\) is the error term (disturbance, noise)

Matrix notation: Define the design matrix \(\mathbf{X}\) and response vector \(\mathbf{y}\):

\[\begin{split}\mathbf{X} = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1,p-1} \\ 1 & x_{21} & x_{22} & \cdots & x_{2,p-1} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{n,p-1} \end{pmatrix} \in \mathbb{R}^{n \times p}, \quad \mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} \in \mathbb{R}^n\end{split}\]

The first column of \(\mathbf{X}\) is all ones (for the intercept). The parameter vector is \(\boldsymbol{\beta} = (\beta_0, \beta_1, \ldots, \beta_{p-1})^\top \in \mathbb{R}^p\).

Notation Note

This text indexes coefficients from \(\beta_0\) to \(\beta_{p-1}\), with the intercept as \(\beta_0\) and \(p\) total parameters. Some references instead index from \(\beta_1\) to \(\beta_p\), or write the model with \(p+1\) coefficients explicitly. The two conventions are equivalent; we use \(p\) to keep the dimension of \(\boldsymbol{\beta}\) and the column count of \(\mathbf{X}\) the same symbol throughout.

The model in matrix form:

(53)\[\boxed{\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}}\]

where \(\boldsymbol{\varepsilon} = (\varepsilon_1, \ldots, \varepsilon_n)^\top\).

What “Linear” Means

The model (53) is linear in the parameters \(\boldsymbol{\beta}\), not necessarily in the predictors. All of the following are linear models:

\[\begin{split}y &= \beta_0 + \beta_1 x + \varepsilon & \text{(simple linear regression)} \\ y &= \beta_0 + \beta_1 x + \beta_2 x^2 + \varepsilon & \text{(polynomial regression)} \\ y &= \beta_0 + \beta_1 \log(x) + \varepsilon & \text{(log transformation)} \\ y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \varepsilon & \text{(interaction term)}\end{split}\]

Each involves \(y\) as a linear combination of \(\boldsymbol{\beta}\). The following is not a linear model:

\[y = \beta_0 e^{\beta_1 x} + \varepsilon\]

Here \(y\) is nonlinear in \(\beta_1\).

Interpreting Coefficients

Before diving into estimation, let’s clarify what the coefficients mean:

Partial slopes: In multiple regression, \(\beta_j\) represents the expected change in \(y\) for a one-unit increase in \(x_j\), holding all other predictors fixed. This “ceteris paribus” interpretation is fundamental.

Correlation vs. causation: The OLS coefficient \(\hat{\beta}_j\) measures the association between \(x_j\) and \(y\), not necessarily a causal effect. To interpret \(\beta_j\) causally, we need:

No omitted confounders (variables that affect both \(x_j\) and \(y\))
No reverse causation (\(y\) doesn’t cause \(x_j\))
No measurement error in predictors

These are strong assumptions rarely satisfied in observational data. Experimental design (randomization) or quasi-experimental methods (instrumental variables, regression discontinuity) are needed for causal claims.

Effect of correlation among predictors: When predictors are correlated, “holding other predictors fixed” may describe a scenario rarely observed in the data. The coefficient \(\beta_j\) still has a well-defined mathematical interpretation, but its practical meaning becomes murky. High multicollinearity inflates standard errors, making individual coefficients imprecise even when the overall model predicts well.

Assumptions: The Classical Linear Model

The linear model (52) comes with assumptions about the errors \(\boldsymbol{\varepsilon}\). We distinguish two levels:

Gauss-Markov Assumptions (for OLS to be BLUE):

Assumption (GM1): Linearity

The true relationship is \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\) for some \(\boldsymbol{\beta} \in \mathbb{R}^p\).

Assumption (GM2): Full Rank

The design matrix \(\mathbf{X}\) has full column rank: \(\text{rank}(\mathbf{X}) = p\).

Equivalently, \(\mathbf{X}^\top\mathbf{X}\) is invertible. This requires \(n \geq p\) and no exact multicollinearity among predictors.

Assumption (GM3): Strict Exogeneity

The errors have zero conditional mean given \(\mathbf{X}\):

\[\mathbb{E}[\boldsymbol{\varepsilon} | \mathbf{X}] = \mathbf{0}\]

This implies \(\mathbb{E}[\varepsilon_i | \mathbf{X}] = 0\) for each \(i\). Predictors are uncorrelated with errors.

Assumption (GM4): Homoskedastic and Uncorrelated Errors

The errors are homoskedastic and uncorrelated:

\[\text{Var}(\boldsymbol{\varepsilon} | \mathbf{X}) = \sigma^2 \mathbf{I}_n\]

This means:

Homoskedasticity: \(\text{Var}(\varepsilon_i | \mathbf{X}) = \sigma^2\) for all \(i\) (constant variance)
No autocorrelation: \(\text{Cov}(\varepsilon_i, \varepsilon_j | \mathbf{X}) = 0\) for \(i \neq j\)

(This condition is sometimes called “spherical errors” because the error covariance is a scaled identity matrix.)

Normality Assumption (for exact inference):

Assumption (GM5): Normal Errors

The errors follow a multivariate normal distribution:

\[\boldsymbol{\varepsilon} | \mathbf{X} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)\]

This implies GM3 and GM4, plus enables exact t-tests and F-tests. Note that this is specifically the spherical normal model \(\mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})\), not merely “some normal covariance”—normality alone does not guarantee homoskedasticity or independence.

The assumptions form a hierarchy:

GM1-GM2: OLS is well-defined (unique solution exists)
GM1-GM4: OLS is BLUE (Gauss-Markov theorem applies)
GM1-GM5: Exact finite-sample inference (t, F distributions)

Ordinary Least Squares: The Calculus Approach

The Ordinary Least Squares (OLS) estimator minimizes the sum of squared residuals:

(54)\[\text{RSS}(\boldsymbol{\beta}) = \sum_{i=1}^n (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\]

We find \(\hat{\boldsymbol{\beta}}\) that minimizes RSS by taking the derivative and setting it to zero.

Expanding the Objective Function

First, expand the quadratic form:

\[\begin{split}\text{RSS}(\boldsymbol{\beta}) &= (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \\ &= \mathbf{y}^\top\mathbf{y} - \mathbf{y}^\top\mathbf{X}\boldsymbol{\beta} - \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y} + \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}\end{split}\]

Since \(\mathbf{y}^\top\mathbf{X}\boldsymbol{\beta}\) is a scalar, it equals its transpose \(\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y}\):

\[\text{RSS}(\boldsymbol{\beta}) = \mathbf{y}^\top\mathbf{y} - 2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y} + \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}\]

Taking the Derivative

Using our matrix calculus rules:

\(\nabla_{\boldsymbol{\beta}}(\mathbf{y}^\top\mathbf{y}) = \mathbf{0}\) (constant in \(\boldsymbol{\beta}\))
\(\nabla_{\boldsymbol{\beta}}(2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y}) = 2\mathbf{X}^\top\mathbf{y}\) (Rule 1: linear form with \(\mathbf{a} = \mathbf{X}^\top\mathbf{y}\))
\(\nabla_{\boldsymbol{\beta}}(\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}) = 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}\) (Rule 2: quadratic form with symmetric \(\mathbf{A} = \mathbf{X}^\top\mathbf{X}\))

Combining:

\[\nabla_{\boldsymbol{\beta}} \text{RSS} = -2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = 2\mathbf{X}^\top(\mathbf{X}\boldsymbol{\beta} - \mathbf{y})\]

The Normal Equations

Setting the gradient to zero:

\[\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} - \mathbf{X}^\top\mathbf{y} = \mathbf{0}\]

Rearranging:

(55)\[\boxed{\mathbf{X}^\top\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{y}}\]

These are the normal equations. The name comes from the fact that the residual vector \(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\) is normal (perpendicular) to the column space of \(\mathbf{X}\)—a fact we’ll prove geometrically.

Two approaches to deriving normal equations: calculus and geometry — Fig. 81 **Figure 3.4.2**: The normal equations derived two ways. (a) The calculus approach: minimize RSS by setting the gradient to zero. (b) The geometric approach: the residual \(\mathbf{e}\) must be perpendicular to \(C(\mathbf{X})\), leading to \(\mathbf{X}^\top\mathbf{e} = \mathbf{0}\).

Solving for the OLS Estimator

Under assumption GM2 (\(\text{rank}(\mathbf{X}) = p\)), the matrix \(\mathbf{X}^\top\mathbf{X}\) is invertible. Multiplying both sides by \((\mathbf{X}^\top\mathbf{X})^{-1}\):

(56)\[\boxed{\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}}\]

This is the OLS estimator—the central result of regression analysis.

Verifying the Minimum

We found a critical point. To confirm it’s a minimum (not maximum or saddle), examine the second derivative (Hessian):

\[\frac{\partial^2 \text{RSS}}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^\top} = 2\mathbf{X}^\top\mathbf{X}\]

Since \(\mathbf{X}^\top\mathbf{X}\) is positive semi-definite (for any \(\mathbf{v}\), \(\mathbf{v}^\top\mathbf{X}^\top\mathbf{X}\mathbf{v} = \|\mathbf{X}\mathbf{v}\|^2 \geq 0\)), and positive definite when \(\mathbf{X}\) has full column rank, the Hessian is positive definite. Therefore, the critical point is indeed a global minimum.

Example 💡 Simple Linear Regression by Hand

Consider simple regression \(y = \beta_0 + \beta_1 x + \varepsilon\) with \(n = 4\) observations:

\[\begin{split}\mathbf{X} = \begin{pmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{pmatrix}, \quad \mathbf{y} = \begin{pmatrix} 2.1 \\ 3.9 \\ 6.2 \\ 7.8 \end{pmatrix}\end{split}\]

Step 1: Compute \(\mathbf{X}^\top\mathbf{X}\):

\[\begin{split}\mathbf{X}^\top\mathbf{X} = \begin{pmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \end{pmatrix} \begin{pmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{pmatrix} = \begin{pmatrix} 4 & 10 \\ 10 & 30 \end{pmatrix}\end{split}\]

Step 2: Compute \(\mathbf{X}^\top\mathbf{y}\):

\[\begin{split}\mathbf{X}^\top\mathbf{y} = \begin{pmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \end{pmatrix} \begin{pmatrix} 2.1 \\ 3.9 \\ 6.2 \\ 7.8 \end{pmatrix} = \begin{pmatrix} 20.0 \\ 59.7 \end{pmatrix}\end{split}\]

Step 3: Compute \((\mathbf{X}^\top\mathbf{X})^{-1}\):

\[\begin{split}(\mathbf{X}^\top\mathbf{X})^{-1} = \frac{1}{4 \cdot 30 - 10 \cdot 10} \begin{pmatrix} 30 & -10 \\ -10 & 4 \end{pmatrix} = \frac{1}{20}\begin{pmatrix} 30 & -10 \\ -10 & 4 \end{pmatrix} = \begin{pmatrix} 1.5 & -0.5 \\ -0.5 & 0.2 \end{pmatrix}\end{split}\]

Step 4: Compute \(\hat{\boldsymbol{\beta}}\):

\[\begin{split}\hat{\boldsymbol{\beta}} = \begin{pmatrix} 1.5 & -0.5 \\ -0.5 & 0.2 \end{pmatrix} \begin{pmatrix} 20.0 \\ 59.7 \end{pmatrix} = \begin{pmatrix} 30 - 29.85 \\ -10 + 11.94 \end{pmatrix} = \begin{pmatrix} 0.15 \\ 1.94 \end{pmatrix}\end{split}\]

The fitted line is \(\hat{y} = 0.15 + 1.94x\).

import numpy as np

# Data
X = np.array([[1, 1], [1, 2], [1, 3], [1, 4]])
y = np.array([2.1, 3.9, 6.2, 7.8])

# OLS via normal equations
XtX = X.T @ X
Xty = X.T @ y
beta_hat = np.linalg.solve(XtX, Xty)  # More stable than explicit inverse

print("X'X =")
print(XtX)
print(f"\nX'y = {Xty}")
print(f"\nOLS estimates: β̂₀ = {beta_hat[0]:.4f}, β̂₁ = {beta_hat[1]:.4f}")

# Verify with fitted values
y_hat = X @ beta_hat
residuals = y - y_hat
RSS = np.sum(residuals**2)
print(f"\nFitted values: {y_hat}")
print(f"Residuals: {residuals}")
print(f"RSS = {RSS:.4f}")

X'X =
[[ 4 10]
 [10 30]]

X'y = [20.  59.7]

OLS estimates: β̂₀ = 0.1500, β̂₁ = 1.9400

Fitted values: [2.09 4.03 5.97 7.91]
Residuals: [ 0.01 -0.13  0.23 -0.11]
RSS = 0.0820

Ordinary Least Squares: The Geometric Approach

The calculus approach found the OLS estimator by optimization. The geometric approach reveals why the normal equations work and provides deeper insight into what OLS does.

Column Space and Projection

Consider the design matrix \(\mathbf{X} \in \mathbb{R}^{n \times p}\). Its column space (or range) is:

(57)\[\mathcal{C}(\mathbf{X}) = \{\mathbf{X}\boldsymbol{\beta} : \boldsymbol{\beta} \in \mathbb{R}^p\} \subseteq \mathbb{R}^n\]

This is the set of all linear combinations of the columns of \(\mathbf{X}\)—a \(p\)-dimensional subspace of \(\mathbb{R}^n\) (assuming full column rank).

The model says \(\mathbb{E}[\mathbf{y}] = \mathbf{X}\boldsymbol{\beta}\), so the expected response lies in the column space. The observed \(\mathbf{y}\) typically does not lie in \(\mathcal{C}(\mathbf{X})\) because of errors.

Key insight: OLS finds the point in \(\mathcal{C}(\mathbf{X})\) closest to \(\mathbf{y}\). (Column spaces, orthogonal projections, and projection matrices are reviewed in Appendix B.)

Geometric interpretation of OLS as orthogonal projection onto a hyperplane — Fig. 82 **Figure 3.4.1**: OLS as orthogonal projection. The observed \(\mathbf{y}\) lives in \(\mathbb{R}^n\), while the column space \(\mathcal{C}(\mathbf{X})\) is a p-dimensional hyperplane. The fitted values \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\) are the closest point to \(\mathbf{y}\) in this subspace. The residual \(\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}\) is perpendicular to the entire column space.

The Orthogonality Principle

The closest point to \(\mathbf{y}\) in a subspace is the orthogonal projection. The projection \(\hat{\mathbf{y}}\) satisfies:

\[\mathbf{y} - \hat{\mathbf{y}} \perp \mathcal{C}(\mathbf{X})\]

This means the residual \(\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}\) is orthogonal to every vector in the column space. Since any vector in \(\mathcal{C}(\mathbf{X})\) can be written as \(\mathbf{X}\mathbf{a}\) for some \(\mathbf{a} \in \mathbb{R}^p\):

\[(\mathbf{X}\mathbf{a})^\top (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = 0 \quad \text{for all } \mathbf{a} \in \mathbb{R}^p\]

This requires:

\[\mathbf{a}^\top \mathbf{X}^\top (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = 0 \quad \text{for all } \mathbf{a}\]

The only way this holds for all \(\mathbf{a}\) is if:

\[\mathbf{X}^\top (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}\]

Rearranging:

\[\mathbf{X}^\top\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{y}\]

We’ve recovered the normal equations (55) geometrically! The name “normal equations” now makes sense: they express that residuals are normal (perpendicular) to the column space.

The Hat Matrix

The hat matrix (or projection matrix) projects \(\mathbf{y}\) onto \(\mathcal{C}(\mathbf{X})\):

(58)\[\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\]

The fitted values are:

\[\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} = \mathbf{H}\mathbf{y}\]

The name “hat matrix” comes from putting a hat on \(\mathbf{y}\) to get \(\hat{\mathbf{y}}\).

Properties of the hat matrix:

Symmetric: \(\mathbf{H}^\top = \mathbf{H}\)

Proof: \(\mathbf{H}^\top = [\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top]^\top = \mathbf{X}[(\mathbf{X}^\top\mathbf{X})^{-1}]^\top\mathbf{X}^\top = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top = \mathbf{H}\)
Idempotent: \(\mathbf{H}^2 = \mathbf{H}\) (projecting twice gives the same result)

Proof: \(\mathbf{H}^2 = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top = \mathbf{H}\)
Trace: \(\text{tr}(\mathbf{H}) = p\)

Proof: Using \(\text{tr}(\mathbf{AB}) = \text{tr}(\mathbf{BA})\): \(\text{tr}(\mathbf{H}) = \text{tr}(\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top) = \text{tr}((\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}) = \text{tr}(\mathbf{I}_p) = p\)
Leverage: The diagonal elements \(h_{ii} \in [0, 1]\) of the hat matrix (58) measure how much observation \(i\) influences its own fitted value.

The residual vector can also be written using a projection matrix:

\[\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{H})\mathbf{y} = \mathbf{M}\mathbf{y}\]

where \(\mathbf{M} = \mathbf{I} - \mathbf{H}\) projects onto the orthogonal complement of \(\mathcal{C}(\mathbf{X})\).

The hat matrix visualized as a projection operator — Fig. 83 **Figure 3.4.3**: The hat matrix \(\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\) “puts a hat on y.” (a) Conceptual view: H transforms observed y into fitted ŷ. (b) Matrix visualization showing larger diagonal elements (leverage) for extreme observations. (c) The projection: H maps each \(y_i\) to \(\hat{y}_i\).

Rank, Pseudo-Inverse, and When Solutions Exist

When \(\mathbf{X}\) has full column rank (\(\text{rank}(\mathbf{X}) = p\)), the solution is unique:

\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\]

The matrix \((\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\) is the left pseudo-inverse of \(\mathbf{X}\), denoted \(\mathbf{X}^+\). It satisfies \(\mathbf{X}^+\mathbf{X} = \mathbf{I}_p\).

What if \(\mathbf{X}\) doesn’t have full rank?

When \(\text{rank}(\mathbf{X}) = r < p\) (multicollinearity), \(\mathbf{X}^\top\mathbf{X}\) is singular and not invertible. The normal equations \(\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y}\) still have solutions, but infinitely many—the solution set is a \((p-r)\)-dimensional affine subspace.

Multicollinearity effects on coefficient estimation — Fig. 84 **Figure 3.4.12**: Multicollinearity inflates standard errors. (a) VIF grows rapidly as correlation approaches 1: \(\text{VIF} = 1/(1-\rho^2)\). (b) With high correlation (\(\rho = 0.95\)), the sampling distribution of \(\hat{\beta}_1\) is much wider than with uncorrelated predictors. (c) The joint confidence region becomes an elongated ellipse when predictors are correlated.

In this case, we use the Moore-Penrose pseudo-inverse \(\mathbf{X}^+\), which selects the minimum-norm solution:

\[\hat{\boldsymbol{\beta}} = \mathbf{X}^+ \mathbf{y}\]

This solution has the smallest \(\|\boldsymbol{\beta}\|\) among all solutions to the normal equations.

In practice: Software like numpy.linalg.lstsq and scipy.linalg.lstsq use SVD-based methods that handle rank deficiency automatically. (Rank and the singular value decomposition are reviewed in Appendix B.)

import numpy as np

# Example with exact multicollinearity
# x3 = x1 + x2 (perfect linear dependence)
X_singular = np.array([
    [1, 1, 0, 1],
    [1, 0, 1, 1],
    [1, 2, 1, 3],
    [1, 1, 2, 3]
])
y = np.array([1, 2, 4, 5])

# Check rank
print(f"Matrix shape: {X_singular.shape}")
print(f"Rank: {np.linalg.matrix_rank(X_singular)}")
print(f"X'X is singular: {np.linalg.det(X_singular.T @ X_singular) == 0}")

# lstsq handles rank deficiency via SVD
beta_hat, residuals, rank, s = np.linalg.lstsq(X_singular, y, rcond=None)
print(f"\nMinimum-norm solution: {beta_hat}")
print(f"Singular values: {s}")

Matrix shape: (4, 4)
Rank: 3
X'X is singular: True

Minimum-norm solution: [9.32295354e-16 2.22044605e-16 1.00000000e+00 1.00000000e+00]
Singular values: [5.76877982 1.41421356 0.84922282 0.        ]

Properties of the OLS Estimator

Having derived the OLS estimator, we now examine its statistical properties.

Convention: Fixed Design

Throughout this section, we treat \(\mathbf{X}\) as fixed (non-random) and condition all statements on \(\mathbf{X}\). Expectations, variances, and distributions are conditional on \(\mathbf{X}\). This “fixed design” viewpoint is standard in classical regression theory. When \(\mathbf{X}\) is random, results hold conditional on the observed \(\mathbf{X}\).

Unbiasedness

Theorem: OLS is Unbiased

Under assumptions GM1-GM3, the OLS estimator is unbiased:

\[\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta}\]

Proof:

Starting from the OLS estimator (56) and substituting \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\):

\[\begin{split}\hat{\boldsymbol{\beta}} &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}) \\ &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon} \\ &= \boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon}\end{split}\]

Taking expectations conditional on \(\mathbf{X}\):

\[\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \mathbb{E}[\boldsymbol{\varepsilon} | \mathbf{X}] = \boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \mathbf{0} = \boldsymbol{\beta}\]

where we used GM3: \(\mathbb{E}[\boldsymbol{\varepsilon} | \mathbf{X}] = \mathbf{0}\). ∎

Variance-Covariance Matrix

Theorem: Variance of OLS Estimator

Under assumptions GM1-GM4:

(59)\[\text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}\]

Proof:

From above, \(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon}\).

Let \(\mathbf{A} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\). Then:

\[\begin{split}\text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) &= \text{Var}(\mathbf{A}\boldsymbol{\varepsilon} | \mathbf{X}) \\ &= \mathbf{A} \text{Var}(\boldsymbol{\varepsilon} | \mathbf{X}) \mathbf{A}^\top \\ &= \mathbf{A} (\sigma^2 \mathbf{I}_n) \mathbf{A}^\top \\ &= \sigma^2 \mathbf{A} \mathbf{A}^\top\end{split}\]

Now compute \(\mathbf{A}\mathbf{A}^\top\):

\[\begin{split}\mathbf{A}\mathbf{A}^\top &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \mathbf{X}[(\mathbf{X}^\top\mathbf{X})^{-1}]^\top \\ &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} \\ &= (\mathbf{X}^\top\mathbf{X})^{-1}\end{split}\]

Therefore:

\[\text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}\]

∎

Interpreting the variance formula (59):

The variance of the \(j\)-th coefficient is:

\[\text{Var}(\hat{\beta}_j | \mathbf{X}) = \sigma^2 [(\mathbf{X}^\top\mathbf{X})^{-1}]_{jj}\]

The theoretical standard deviation is:

\[\text{SD}(\hat{\beta}_j) = \sigma \sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]_{jj}}\]

In practice, we replace \(\sigma\) with its estimate \(s\) (see below).

The Gauss-Markov Theorem

The Gauss-Markov theorem is one of the most important results in regression theory. It establishes OLS as the Best Linear Unbiased Estimator (BLUE).

What BLUE Means

Let’s unpack each word:

Estimator: A function of the data that produces an estimate of the parameter
Unbiased: \(\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}\) (hits the target on average)
Linear: The estimator is a linear function of \(\mathbf{y}\): \(\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\) for some matrix \(\mathbf{C}\)
Best: Has the smallest variance among all linear unbiased estimators

Why focus on linear estimators?

Tractability: Linear estimators have simple variance formulas
Sufficiency: For normal errors, linear estimators can be optimal even among nonlinear ones
Robustness considerations: Nonlinear estimators may be more efficient under specific distributional assumptions but fail badly when those assumptions are wrong

The Theorem Statement

Theorem: Gauss-Markov

Under assumptions GM1-GM4 (linearity, full rank, strict exogeneity, spherical errors), the OLS estimator \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\) is BLUE:

For any other linear unbiased estimator \(\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\) of \(\boldsymbol{\beta}\):

\[\text{Var}(\tilde{\boldsymbol{\beta}}) - \text{Var}(\hat{\boldsymbol{\beta}}) \text{ is positive semi-definite}\]

In particular, for any coefficient:

\[\text{Var}(\tilde{\beta}_j | \mathbf{X}) \geq \text{Var}(\hat{\beta}_j | \mathbf{X})\]

OLS achieves the minimum variance among all linear unbiased estimators.

Complete Proof

Step 1: Characterize linear unbiased estimators

Any linear estimator has the form \(\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\) for some \(p \times n\) matrix \(\mathbf{C}\).

For \(\tilde{\boldsymbol{\beta}}\) to be unbiased:

\[\mathbb{E}[\tilde{\boldsymbol{\beta}}] = \mathbf{C}\mathbb{E}[\mathbf{y}] = \mathbf{C}\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta} \quad \text{for all } \boldsymbol{\beta}\]

This requires \(\mathbf{C}\mathbf{X} = \mathbf{I}_p\).

Step 2: Write any linear unbiased estimator as OLS plus something

Let \(\mathbf{A} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\) be the OLS matrix, so \(\hat{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y}\).

Define \(\mathbf{D} = \mathbf{C} - \mathbf{A}\). Then any linear unbiased estimator can be written:

\[\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y} = (\mathbf{A} + \mathbf{D})\mathbf{y}\]

Since both \(\mathbf{C}\) and \(\mathbf{A}\) satisfy \(\mathbf{C}\mathbf{X} = \mathbf{I}_p\) and \(\mathbf{A}\mathbf{X} = \mathbf{I}_p\):

\[\mathbf{D}\mathbf{X} = \mathbf{C}\mathbf{X} - \mathbf{A}\mathbf{X} = \mathbf{I}_p - \mathbf{I}_p = \mathbf{0}\]

Step 3: Compute the variance of the alternative estimator

Conditional on \(\mathbf{X}\), the variance of any linear estimator \(\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\) is:

\[\begin{split}\text{Var}(\tilde{\boldsymbol{\beta}} | \mathbf{X}) &= \text{Var}((\mathbf{A} + \mathbf{D})\mathbf{y} | \mathbf{X}) \\ &= (\mathbf{A} + \mathbf{D}) \text{Var}(\mathbf{y} | \mathbf{X}) (\mathbf{A} + \mathbf{D})^\top \\ &= \sigma^2 (\mathbf{A} + \mathbf{D})(\mathbf{A} + \mathbf{D})^\top \\ &= \sigma^2 (\mathbf{A}\mathbf{A}^\top + \mathbf{A}\mathbf{D}^\top + \mathbf{D}\mathbf{A}^\top + \mathbf{D}\mathbf{D}^\top)\end{split}\]

Step 4: Show the cross-terms vanish

Compute \(\mathbf{A}\mathbf{D}^\top\):

\[\mathbf{A}\mathbf{D}^\top = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \mathbf{D}^\top\]

Now, \(\mathbf{D}\mathbf{X} = \mathbf{0}\) implies \(\mathbf{X}^\top\mathbf{D}^\top = \mathbf{0}\). Therefore:

\[\mathbf{A}\mathbf{D}^\top = (\mathbf{X}^\top\mathbf{X})^{-1} \cdot \mathbf{0} = \mathbf{0}\]

Similarly, \(\mathbf{D}\mathbf{A}^\top = (\mathbf{A}\mathbf{D}^\top)^\top = \mathbf{0}\).

Step 5: Conclude

\[\text{Var}(\tilde{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{A}\mathbf{A}^\top + \mathbf{D}\mathbf{D}^\top) = \text{Var}(\hat{\boldsymbol{\beta}}) + \sigma^2 \mathbf{D}\mathbf{D}^\top\]

Since \(\mathbf{D}\mathbf{D}^\top\) is positive semi-definite (for any \(\mathbf{v}\), \(\mathbf{v}^\top\mathbf{D}\mathbf{D}^\top\mathbf{v} = \|\mathbf{D}^\top\mathbf{v}\|^2 \geq 0\)):

\[\text{Var}(\tilde{\boldsymbol{\beta}}) - \text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 \mathbf{D}\mathbf{D}^\top \succeq \mathbf{0}\]

with equality if and only if \(\mathbf{D} = \mathbf{0}\), i.e., \(\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}\). ∎

What Happens When Assumptions Fail

Each Gauss-Markov assumption is critical. Here’s what breaks when each fails:

Table 19 Consequences of Gauss-Markov Assumption Violations
Assumption	Violation Example	Consequence for OLS	Remedy
GM1 Linearity	True model is \(y = \beta_0 e^{\beta_1 x} + \varepsilon\) but we fit \(y = \beta_0 + \beta_1 x\)	OLS is biased and inconsistent; converges to wrong quantity as \(n \to \infty\)	Re-specify the model; transform predictors or add polynomial terms if misspecification is mild; nonlinear regression methods are beyond the scope of this chapter
GM2 Full Rank	Perfect multicollinearity: \(x_3 = 2x_1 + x_2\)	\(\mathbf{X}^\top\mathbf{X}\) is singular; OLS estimator not unique	Remove redundant predictors; use ridge regularization (covered in Regularization: Ridge and LASSO later in this section)
GM3 Exogeneity	Excluded variable correlated with included predictor (e.g., omitted confounder)	Slope estimates biased and inconsistent; a constant nonzero mean in \(\varepsilon\) is absorbed by the intercept and is harmless	Include the omitted variable if observable; instrumental variables and fixed effects are remedies when it is not (beyond scope of this chapter)
GM4a Homoskedasticity	Variance increases with \(x\): \(\text{Var}(\varepsilon_i) = \sigma^2 x_i^2\)	OLS remains unbiased but standard errors are wrong; \(t\)-tests and CIs have incorrect size	Robust (sandwich) standard errors (HC0–HC3); covered in Robust Standard Errors later in this section
GM4b No autocorrelation	Time series: \(\varepsilon_t = \rho\varepsilon_{t-1} + u_t\)	OLS remains unbiased but standard errors underestimate variability under positive autocorrelation; \(t\)-tests invalid	Heteroskedasticity and autocorrelation consistent (HAC) standard errors; GLS and time-series methods are beyond this chapter’s scope

Common Pitfall ⚠️ Endogeneity

The most serious violation is GM3 (exogeneity): the requirement that \(\mathbb{E}[\varepsilon_i | \mathbf{X}] = 0\) for all \(i\). This “conditional mean zero” condition says that knowing \(\mathbf{X}\) tells you nothing about the expected error.

When \(\mathbb{E}[\varepsilon | \mathbf{X}] \neq \mathbf{0}\):

OLS is biased even in large samples (inconsistent)
No amount of data fixes the problem
Robust standard errors do not help: They correct for heteroskedasticity, not bias. If your coefficients are biased, having more precise standard errors for the wrong value doesn’t help!

Common causes: omitted variables correlated with included predictors, measurement error in \(\mathbf{X}\), simultaneity (reverse causation).

Detection is difficult since we don’t observe \(\varepsilon\). Economic theory, domain knowledge, and careful study design are essential. Remedies include instrumental variables and natural experiments.

Estimating the Error Variance

The variance formula \(\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}\) depends on the unknown \(\sigma^2\). We need to estimate it.

The Residual Sum of Squares

Define the residuals as \(\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\).

The residual sum of squares (RSS) is:

\[\text{RSS} = \mathbf{e}^\top\mathbf{e} = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2\]

A natural estimator of \(\sigma^2\) might be \(\text{RSS}/n\). However, this is biased because residuals underestimate the true errors.

The Unbiased Estimator

Theorem: Unbiased Estimator of σ²

Under GM1-GM4:

(60)\[s^2 = \frac{\text{RSS}}{n - p} = \frac{\mathbf{e}^\top\mathbf{e}}{n - p}\]

is an unbiased estimator of \(\sigma^2\): \(\mathbb{E}[s^2] = \sigma^2\).

Proof sketch:

The residuals can be written as \(\mathbf{e} = \mathbf{M}\mathbf{y} = \mathbf{M}(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}) = \mathbf{M}\boldsymbol{\varepsilon}\) where \(\mathbf{M} = \mathbf{I} - \mathbf{H}\).

(Note: \(\mathbf{M}\mathbf{X}\boldsymbol{\beta} = (\mathbf{I} - \mathbf{H})\mathbf{X}\boldsymbol{\beta} = \mathbf{X}\boldsymbol{\beta} - \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{X}\boldsymbol{\beta} - \mathbf{X}\boldsymbol{\beta} = \mathbf{0}\))

Then:

\[\text{RSS} = \mathbf{e}^\top\mathbf{e} = \boldsymbol{\varepsilon}^\top\mathbf{M}^\top\mathbf{M}\boldsymbol{\varepsilon} = \boldsymbol{\varepsilon}^\top\mathbf{M}\boldsymbol{\varepsilon}\]

since \(\mathbf{M}\) is idempotent (\(\mathbf{M}^2 = \mathbf{M}\)) and symmetric.

Taking expectations and using the trace formula for quadratic forms:

\[\mathbb{E}[\text{RSS}] = \mathbb{E}[\boldsymbol{\varepsilon}^\top\mathbf{M}\boldsymbol{\varepsilon}] = \text{tr}(\mathbf{M} \cdot \sigma^2\mathbf{I}) = \sigma^2 \text{tr}(\mathbf{M})\]

Since \(\text{tr}(\mathbf{M}) = \text{tr}(\mathbf{I} - \mathbf{H}) = n - \text{tr}(\mathbf{H}) = n - p\):

\[\mathbb{E}[\text{RSS}] = \sigma^2 (n - p)\]

Therefore \(\mathbb{E}[\text{RSS}/(n-p)] = \sigma^2\). ∎

Degrees of freedom interpretation: We estimate \(p\) parameters from \(n\) observations, leaving \(n - p\) degrees of freedom for estimating error variance.

Degrees of freedom visualized showing why we divide by n-p — Fig. 85 **Figure 3.4.5**: Degrees of freedom explained. (a) With n observations, p are “used up” estimating \(\boldsymbol{\beta}\), leaving n-p to estimate \(\sigma^2\). (b) Simulation confirms \(s^2 = \text{RSS}/(n-p)\) is unbiased while \(\text{RSS}/n\) underestimates \(\sigma^2\). (c) Under normality, \(\text{RSS}/\sigma^2 \sim \chi^2_{n-p}\).

Standard Errors of Coefficients

The estimated variance-covariance matrix of \(\hat{\boldsymbol{\beta}}\) is:

\[\widehat{\text{Var}}(\hat{\boldsymbol{\beta}}) = s^2 (\mathbf{X}^\top\mathbf{X})^{-1}\]

The standard error of the \(j\)-th coefficient is:

(61)\[\text{SE}(\hat{\beta}_j) = s \sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]_{jj}}\]

import numpy as np
from scipy import stats

def ols_with_inference(X, y):
    """
    Perform OLS regression with inference.

    Note: This function uses explicit matrix formulas for pedagogical clarity.
    For production code, use np.linalg.lstsq() or scipy.linalg.lstsq() which
    are numerically more stable (see Numerical Stability section).

    Returns
    -------
    dict with beta_hat, se, t_stats, p_values, R2, s2
    """
    n, p = X.shape

    # OLS estimates (formula-aligned; use np.linalg.lstsq for production)
    XtX = X.T @ X
    XtX_inv = np.linalg.inv(XtX)  # Needed for standard errors
    beta_hat = np.linalg.solve(XtX, X.T @ y)  # More stable than inv() @ y

    # Fitted values and residuals
    y_hat = X @ beta_hat
    residuals = y - y_hat

    # Variance estimation
    RSS = residuals @ residuals
    s2 = RSS / (n - p)
    s = np.sqrt(s2)

    # Standard errors
    var_beta = s2 * XtX_inv
    se = np.sqrt(np.diag(var_beta))

    # t-statistics and p-values
    t_stats = beta_hat / se
    p_values = 2 * (1 - stats.t.cdf(np.abs(t_stats), df=n-p))

    # R-squared
    TSS = np.sum((y - np.mean(y))**2)
    R2 = 1 - RSS / TSS
    R2_adj = 1 - (1 - R2) * (n - 1) / (n - p)

    return {
        'beta_hat': beta_hat,
        'se': se,
        't_stats': t_stats,
        'p_values': p_values,
        'R2': R2,
        'R2_adj': R2_adj,
        's2': s2,
        's': s,
        'residuals': residuals,
        'y_hat': y_hat
    }

# Example: mtcars-style data
rng = np.random.default_rng(42)
n = 32
hp = rng.uniform(50, 300, n)  # horsepower
wt = rng.uniform(1.5, 5.5, n)  # weight (1000 lbs)
mpg = 40 - 0.03 * hp - 4 * wt + rng.normal(0, 2, n)

X = np.column_stack([np.ones(n), hp, wt])
y = mpg

results = ols_with_inference(X, y)

print("="*60)
print("OLS REGRESSION RESULTS")
print("="*60)
print(f"\nSample size: n = {n}, Predictors: p = {X.shape[1]}")
print(f"R² = {results['R2']:.4f}, Adjusted R² = {results['R2_adj']:.4f}")
print(f"Residual SE: s = {results['s']:.4f}")

print(f"\n{'Coefficient':<12} {'Estimate':>10} {'Std Error':>10} {'t-stat':>10} {'p-value':>10}")
print("-"*54)
names = ['Intercept', 'hp', 'wt']
for name, b, se, t, p in zip(names, results['beta_hat'], results['se'],
                               results['t_stats'], results['p_values']):
    print(f"{name:<12} {b:>10.4f} {se:>10.4f} {t:>10.2f} {p:>10.4f}")

============================================================
OLS REGRESSION RESULTS
============================================================

Sample size: n = 32, Predictors: p = 3
R² = 0.9520, Adjusted R² = 0.9486
Residual SE: s = 1.2165

Coefficient    Estimate  Std Error     t-stat    p-value
------------------------------------------------------
Intercept       40.3548     0.8585      47.01     0.0000
hp              -0.0265     0.0031      -8.61     0.0000
wt              -4.3206     0.2290     -18.87     0.0000

Distributional Results Under Normality

Under the full normal linear model (GM1-GM5), we have exact distributional results enabling precise inference.

Distribution of the OLS Estimator

Theorem: Normality of OLS

Under GM1-GM5:

\[\hat{\boldsymbol{\beta}} | \mathbf{X} \sim \mathcal{N}\left(\boldsymbol{\beta}, \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}\right)\]

Proof: Since \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\) and \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\) with \(\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})\):

\(\hat{\boldsymbol{\beta}}\) is a linear transformation of normal \(\boldsymbol{\varepsilon}\), hence normal. Mean and variance computed earlier. ∎

Distribution of s²

Theorem: Chi-Square Distribution of RSS

Under GM1-GM5:

\[\frac{(n-p)s^2}{\sigma^2} = \frac{\text{RSS}}{\sigma^2} \sim \chi^2_{n-p}\]

Moreover, \(s^2\) and \(\hat{\boldsymbol{\beta}}\) are independent.

Proof sketch: RSS can be written as \(\boldsymbol{\varepsilon}^\top\mathbf{M}\boldsymbol{\varepsilon}/\sigma^2\) where \(\mathbf{M} = \mathbf{I} - \mathbf{H}\) is idempotent with rank \(n - p\). By Cochran’s theorem, this is \(\chi^2_{n-p}\).

The independence follows from the fact that \(\hat{\boldsymbol{\beta}}\) depends on \(\boldsymbol{\varepsilon}\) through \((\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon}\) and \(s^2\) through \(\mathbf{M}\boldsymbol{\varepsilon}\). Since \((\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \cdot \sigma^2 \mathbf{M} = \mathbf{0}\) (orthogonality), these linear forms of a multivariate normal are independent. ∎

t-Tests for Individual Coefficients

Combining the above results:

\[\frac{\hat{\beta}_j - \beta_j}{\text{SE}(\hat{\beta}_j)} = \frac{\hat{\beta}_j - \beta_j}{s\sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]_{jj}}} \sim t_{n-p}\]

This enables t-tests for individual coefficients:

Null hypothesis: \(H_0: \beta_j = \beta_{j,0}\) (often \(\beta_{j,0} = 0\))
Test statistic: \(t = \frac{\hat{\beta}_j - \beta_{j,0}}{\text{SE}(\hat{\beta}_j)}\)
Decision: Reject at level \(\alpha\) if \(|t| > t_{n-p, 1-\alpha/2}\)

Confidence interval for \(\beta_j\):

\[\hat{\beta}_j \pm t_{n-p, 1-\alpha/2} \cdot \text{SE}(\hat{\beta}_j)\]

Confidence intervals vs prediction intervals showing the key difference — Fig. 86 **Figure 3.4.7**: Confidence interval (for the mean) vs. prediction interval (for new Y). (a) The CI (narrow band) captures uncertainty in \(E[Y|x]\), while the PI (wide band) also includes the irreducible error variance. (b) The formulas differ by the “1” inside the square root—representing \(\sigma^2\) for a new observation. (c) Both intervals have hyperbolic shape, widening away from \(\bar{x}\), but the PI is offset by a constant amount.

F-Tests for Nested Models

To test whether a subset of coefficients is jointly zero, use the F-test.

Setup: Compare: - Restricted model (R): \(q\) constraints, \(\text{RSS}_R\) - Full model (F): no constraints, \(\text{RSS}_F\)

Theorem: F-Test for Nested Models

Under \(H_0\) (constraints are true) and normality:

\[F = \frac{(\text{RSS}_R - \text{RSS}_F) / q}{\text{RSS}_F / (n - p)} \sim F_{q, n-p}\]

Example: Testing \(H_0: \beta_2 = \beta_3 = 0\) (joint significance of two predictors).

import numpy as np
from scipy import stats

def f_test_nested(y, X_full, X_reduced):
    """
    F-test comparing full model to reduced (nested) model.
    """
    n = len(y)
    p_full = X_full.shape[1]
    p_reduced = X_reduced.shape[1]
    q = p_full - p_reduced  # number of restrictions

    # Fit both models
    beta_full = np.linalg.lstsq(X_full, y, rcond=None)[0]
    beta_reduced = np.linalg.lstsq(X_reduced, y, rcond=None)[0]

    RSS_full = np.sum((y - X_full @ beta_full)**2)
    RSS_reduced = np.sum((y - X_reduced @ beta_reduced)**2)

    # F statistic
    F = ((RSS_reduced - RSS_full) / q) / (RSS_full / (n - p_full))
    p_value = 1 - stats.f.cdf(F, q, n - p_full)

    return {'F': F, 'df1': q, 'df2': n - p_full, 'p_value': p_value,
            'RSS_full': RSS_full, 'RSS_reduced': RSS_reduced}

# Example: test if both hp and wt are needed
X_full = np.column_stack([np.ones(n), hp, wt])
X_reduced = np.column_stack([np.ones(n)])  # intercept only

result = f_test_nested(y, X_full, X_reduced)
print(f"\nF-test: Are hp and wt jointly significant?")
print(f"F({result['df1']}, {result['df2']}) = {result['F']:.2f}")
print(f"p-value = {result['p_value']:.6f}")

Example 💡 OLS Inference on the Advertising Data

Given: The ISLR Advertising data (\(n = 200\) markets), regressing Sales on the three media budgets TV, Radio, and Newspaper. A single fitted model exercises the whole inference toolkit of this section—the individual t-tests, the joint F-test, and the \(F = t^2\) identity.

Find: Which predictors carry their weight once the others are held fixed?

import pandas as pd
import statsmodels.formula.api as smf

url = ("https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/"
       "public/STAT%20418%20Images/Data/Chapter3/Advertising.csv")
adv = pd.read_csv(url)

full = smf.ols("Sales ~ TV + Radio + Newspaper", data=adv).fit()
print(f"Full model  Sales ~ TV + Radio + Newspaper   (n = {int(full.nobs)})")
print(f"  R^2 = {full.rsquared:.3f}   overall F = {full.fvalue:.1f}")

print("\nIndividual t-tests  (H0: beta_j = 0):")
for v in ["TV", "Radio", "Newspaper"]:
    print(f"  {v:10s} coef={full.params[v]:+.4f}  t={full.tvalues[v]:+.2f}  p={full.pvalues[v]:.3f}")

# Joint F-test: are Radio and Newspaper jointly unnecessary?
joint = full.f_test("Radio = 0, Newspaper = 0")
print(f"\nJoint F-test  H0: beta_Radio = beta_Newspaper = 0:")
print(f"  F = {float(joint.fvalue):.1f}  (df {int(joint.df_num)}, {int(joint.df_denom)})  p = {float(joint.pvalue):.1e}")

# Single restriction reproduces the t-test: F = t^2
drop_news = full.f_test("Newspaper = 0")
print(f"\nSingle restriction  H0: beta_Newspaper = 0:")
print(f"  F = {float(drop_news.fvalue):.3f}  = t^2 = {full.tvalues['Newspaper']**2:.3f}")

# Newspaper alone tells a different story (confounding via Radio)
simple_news = smf.ols("Sales ~ Newspaper", data=adv).fit()
print(f"\nNewspaper ALONE: t = {simple_news.tvalues['Newspaper']:.2f}, "
      f"p = {simple_news.pvalues['Newspaper']:.3f}")
print(f"corr(Radio, Newspaper) = {adv['Radio'].corr(adv['Newspaper']):.3f}")

Output:

Full model  Sales ~ TV + Radio + Newspaper   (n = 200)
  R^2 = 0.897   overall F = 570.3

Individual t-tests  (H0: beta_j = 0):
  TV         coef=+0.0458  t=+32.81  p=0.000
  Radio      coef=+0.1885  t=+21.89  p=0.000
  Newspaper  coef=-0.0010  t=-0.18  p=0.860

Joint F-test  H0: beta_Radio = beta_Newspaper = 0:
  F = 272.0  (df 2, 196)  p = 2.8e-57

Single restriction  H0: beta_Newspaper = 0:
  F = 0.031  = t^2 = 0.031

Newspaper ALONE: t = 3.30, p = 0.001
corr(Radio, Newspaper) = 0.354

The individual t-tests flag TV (\(t = 32.8\)) and Radio (\(t = 21.9\)) as overwhelmingly significant, while Newspaper fails to reject (\(t = -0.18\), \(p = 0.86\)): with TV and radio already in the model, newspaper spending adds nothing. The joint F-test that both radio and newspaper are unnecessary is decisively rejected (\(F = 272\) on \((2, 196)\) df), so the model clearly needs more than TV alone. The single-restriction F-test for newspaper returns \(F = 0.031\), exactly the square of its t-statistic (\((-0.18)^2\))—the \(F = t^2\) identity, since a one-coefficient F-test and the corresponding two-sided t-test are the same test. Finally, a cautionary twist: regressed alone, Newspaper looks significant (\(t = 3.30\), \(p = 0.001\)), because newspaper budgets are correlated with radio budgets (\(r = 0.35\)); once radio is held fixed, newspaper’s apparent effect evaporates. This is exactly why a multiple-regression coefficient must be read as an effect holding the other predictors fixed.

Diagnostics and Model Checking

The validity of regression inference depends on assumptions. Diagnostics help detect violations.

Residual Analysis

Types of residuals:

Raw residuals: \(e_i = y_i - \hat{y}_i\)
Standardized residuals: \(r_i = \frac{e_i}{s\sqrt{1 - h_{ii}}}\)

These have approximately constant variance under homoskedasticity.
Studentized (externally): \(t_i = \frac{e_i}{s_{(i)}\sqrt{1 - h_{ii}}} \sim t_{n-p-1}\)

where \(s_{(i)}\) is computed without observation \(i\). Follows exact t-distribution.

Diagnostic plots:

Residuals vs. Fitted: Check linearity (should be random scatter) and homoskedasticity (constant spread)
Q-Q plot of residuals: Check normality (should follow 45° line)
Residuals vs. each predictor: Check functional form
Scale-Location plot: \(\sqrt{|r_i|}\) vs. fitted for homoskedasticity

Gallery of assumption violations showing nonlinearity, heteroskedasticity, non-normality, and autocorrelation — Fig. 87 **Figure 3.4.6**: Assumption violations gallery. Top: data patterns; Bottom: diagnostic plots. Nonlinearity shows curved residual pattern. Heteroskedasticity produces a fan shape. Non-normality creates heavy tails in Q-Q plots. Autocorrelation shows serial correlation in \(e_t\) vs \(e_{t+1}\) plots.

Leverage and Influence

Not all observations affect the fit equally.

Leverage: \(h_{ii} = [\mathbf{H}]_{ii}\) measures how far \(\mathbf{x}_i\) is from the center of the predictor space.

\(h_{ii} \in [1/n, 1]\) (from idempotence and hat matrix properties)
Average leverage: \(\bar{h} = p/n\)
High leverage if \(h_{ii} > 2p/n\) or \(> 3p/n\)

Leverage intuition showing why extreme x values matter — Fig. 88 **Figure 3.4.4**: Leverage measures how far observation \(x_i\) is from \(\bar{x}\). (a) Points far from center have high leverage (red = high, blue = low). (b) High-leverage points can dramatically influence the regression line. (c) The leverage formula \(h_{ii} = 1/n + (x_i - \bar{x})^2/S_{xx}\) is U-shaped in \(x\).

Influence measures:

Cook’s Distance: Combines leverage and residual size

\[D_i = \frac{r_i^2}{p} \cdot \frac{h_{ii}}{1 - h_{ii}}\]

Large if \(D_i > 4/n\) or \(> 1\)
DFFITS: Scaled difference in fitted value when \(i\) deleted
DFBETAS: Change in each coefficient when \(i\) deleted

Robust Standard Errors

When homoskedasticity fails but we still want valid inference, use robust (sandwich) standard errors:

\[\widehat{\text{Var}}_{\text{robust}}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{X}^\top \hat{\boldsymbol{\Omega}} \mathbf{X} (\mathbf{X}^\top\mathbf{X})^{-1}\]

where \(\hat{\boldsymbol{\Omega}} = \text{diag}(e_1^2, \ldots, e_n^2)\) estimates the heteroskedastic structure.

HC variants (heteroskedasticity-consistent):

HC0: \(\hat{\Omega}_{ii} = e_i^2\) (White’s original)
HC1: \(\hat{\Omega}_{ii} = \frac{n}{n-p} e_i^2\) (degrees-of-freedom correction)
HC2: \(\hat{\Omega}_{ii} = \frac{e_i^2}{1 - h_{ii}}\) (leverage adjustment)
HC3: \(\hat{\Omega}_{ii} = \frac{e_i^2}{(1 - h_{ii})^2}\) (jackknife-like, best for small samples)

def robust_se(X, residuals, hc_type='HC1'):
    """
    Compute heteroskedasticity-robust standard errors.

    Note: This implementation forms the full n×n hat matrix H, which is
    O(n²) in memory and time. For large n, compute leverages h_ii directly
    without forming H: h_ii = x_i' (X'X)^{-1} x_i for each row.

    Parameters
    ----------
    X : array, shape (n, p)
        Design matrix
    residuals : array, shape (n,)
        OLS residuals
    hc_type : str
        'HC0', 'HC1', 'HC2', or 'HC3'
    """
    n, p = X.shape
    XtX_inv = np.linalg.inv(X.T @ X)

    # Leverage values (O(n²) - for large n, compute row-by-row instead)
    H = X @ XtX_inv @ X.T
    h = np.diag(H)

    # Construct Omega based on HC type
    e2 = residuals**2
    if hc_type == 'HC0':
        omega = e2
    elif hc_type == 'HC1':
        omega = e2 * n / (n - p)
    elif hc_type == 'HC2':
        omega = e2 / (1 - h)
    elif hc_type == 'HC3':
        omega = e2 / (1 - h)**2
    else:
        raise ValueError(f"Unknown HC type: {hc_type}")

    # Sandwich formula
    meat = X.T @ np.diag(omega) @ X
    sandwich = XtX_inv @ meat @ XtX_inv

    return np.sqrt(np.diag(sandwich))

Bringing It All Together

Linear regression is both a statistical tool and a conceptual framework that illuminates fundamental ideas in inference. We’ve seen:

The calculus approach reveals OLS as the solution to an optimization problem—minimizing squared residuals leads to the normal equations \(\mathbf{X}^\top\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{y}\).

The geometric approach shows OLS as projection—\(\hat{\mathbf{y}}\) is the orthogonal projection of \(\mathbf{y}\) onto the column space of \(\mathbf{X}\). The residual \(\mathbf{e}\) is perpendicular to all predictors.

The Gauss-Markov theorem establishes that under classical assumptions (linearity, exogeneity, spherical errors), OLS achieves minimum variance among all linear unbiased estimators. This optimality result has driven the dominance of OLS in applied work.

Distributional theory under normality enables exact t-tests and F-tests. Without normality, asymptotic approximations still provide valid inference for large samples.

Diagnostics reveal whether assumptions hold. When they fail, remedies range from robust standard errors to alternative estimators.

Looking Ahead

Section 3.5 extends these ideas to Generalized Linear Models (GLMs)—logistic regression, Poisson regression, and more. GLMs handle non-normal responses while maintaining the interpretability and computational tractability of the linear model framework. The IRLS algorithm for fitting GLMs is essentially Fisher scoring—Newton-Raphson with expected information—applied to the same exponential family likelihoods we met in Section 3.1.

Key Takeaways 📝

Matrix calculus foundation: The gradient of \(\boldsymbol{\beta}^\top\mathbf{A}\boldsymbol{\beta}\) is \(2\mathbf{A}\boldsymbol{\beta}\) (for symmetric \(\mathbf{A}\)), enabling systematic derivation of least squares estimators.
Two perspectives on OLS: Calculus (minimize RSS) and geometry (project onto column space) yield the same normal equations \(\mathbf{X}^\top\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{y}\).
Gauss-Markov theorem: Under linearity, full rank, exogeneity, and spherical errors, OLS is BLUE—the best linear unbiased estimator. The complete proof uses the \(\mathbf{D}\) matrix decomposition.
Inference requires variance estimation: The unbiased estimator \(s^2 = \text{RSS}/(n-p)\) replaces \(\sigma^2\) in standard error formulas. Under normality, t-tests and F-tests are exact.
Diagnostics are essential: Residual plots, leverage measures, and Cook’s distance detect assumption violations. Robust standard errors (HC0-HC3) provide valid inference under heteroskedasticity.
Outcome alignment: This section addresses Learning Outcomes on implementing regression methods, understanding estimator properties (bias, variance, efficiency), and applying diagnostics—skills essential for Sections 3.5 (GLMs), Chapter 4 (bootstrap for regression), and Chapter 5 (Bayesian linear models).

Numerical Stability: QR Decomposition

The formula \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\) is mathematically correct but numerically problematic. Computing \(\mathbf{X}^\top\mathbf{X}\) and inverting it can lose precision due to the condition number. (For background on floating-point arithmetic, machine epsilon, and how condition numbers determine digits lost, see Appendix C.)

The Condition Number Problem

The condition number of a matrix measures sensitivity to numerical errors:

\[\kappa(\mathbf{A}) = \|\mathbf{A}\| \cdot \|\mathbf{A}^{-1}\|\]

For the product \(\mathbf{X}^\top\mathbf{X}\):

\[\kappa(\mathbf{X}^\top\mathbf{X}) = \kappa(\mathbf{X})^2\]

This squaring of the condition number is catastrophic. If \(\mathbf{X}\) has condition number \(10^4\) (not unusual with poorly scaled predictors), then \(\mathbf{X}^\top\mathbf{X}\) has condition number \(10^8\), potentially losing 8 digits of precision.

Numerical stability comparison showing normal equations vs QR decomposition — Fig. 89 **Figure 3.4.8**: Numerical stability matters. (a) Normal equations square the condition number: \(\kappa(\mathbf{X}^\top\mathbf{X}) = \kappa(\mathbf{X})^2\). (b) Theoretical error bounds show QR decomposition with \(O(\kappa \cdot \epsilon)\) error vastly outperforms normal equations with \(O(\kappa^2 \cdot \epsilon)\) error. (c) Practical guidelines: use QR/SVD-based methods (like `np.linalg.lstsq`) when \(\kappa > 10^3\).

QR Decomposition Solution

The QR decomposition factorizes \(\mathbf{X}\) as:

\[\mathbf{X} = \mathbf{Q}\mathbf{R}\]

where:

\(\mathbf{Q} \in \mathbb{R}^{n \times p}\) has orthonormal columns: \(\mathbf{Q}^\top\mathbf{Q} = \mathbf{I}_p\)
\(\mathbf{R} \in \mathbb{R}^{p \times p}\) is upper triangular

Solving the normal equations via QR:

Substitute into \(\mathbf{X}^\top\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{y}\):

\[\mathbf{R}^\top\mathbf{Q}^\top\mathbf{Q}\mathbf{R}\hat{\boldsymbol{\beta}} = \mathbf{R}^\top\mathbf{Q}^\top\mathbf{y}\]

Since \(\mathbf{Q}^\top\mathbf{Q} = \mathbf{I}\) and \(\mathbf{R}^\top\) is invertible:

\[\mathbf{R}\hat{\boldsymbol{\beta}} = \mathbf{Q}^\top\mathbf{y}\]

This triangular system is solved by back-substitution—no matrix inversion needed! (For the QR decomposition itself and the use of triangular solves in place of explicit inversion, see Appendix B.)

Advantages:

Condition number is \(\kappa(\mathbf{R}) = \kappa(\mathbf{X})\), not squared
No explicit computation of \(\mathbf{X}^\top\mathbf{X}\)
Back-substitution is \(O(p^2)\), highly stable

import numpy as np

def ols_qr(X, y):
    """
    Compute OLS via QR decomposition (numerically stable).
    """
    Q, R = np.linalg.qr(X)
    Qty = Q.T @ y
    beta_hat = np.linalg.solve(R, Qty)  # Back-substitution
    return beta_hat

def demonstrate_condition_number():
    """Show the condition number problem."""
    rng = np.random.default_rng(42)
    n = 100

    # Poorly conditioned design matrix (highly correlated predictors)
    x1 = rng.uniform(0, 1, n)
    x2 = x1 + rng.normal(0, 0.0001, n)  # Nearly collinear!

    X = np.column_stack([np.ones(n), x1, x2])
    y = 1 + 2*x1 + 3*x2 + rng.normal(0, 0.1, n)

    # Condition numbers
    kappa_X = np.linalg.cond(X)
    kappa_XtX = np.linalg.cond(X.T @ X)

    print("="*60)
    print("CONDITION NUMBER COMPARISON")
    print("="*60)
    print(f"\nκ(X):     {kappa_X:.2e}")
    print(f"κ(X'X):   {kappa_XtX:.2e}")
    print(f"Ratio:    {kappa_XtX / kappa_X:.2e} (should be ≈ κ(X))")

    # Compare methods
    # Method 1: Normal equations (unstable)
    try:
        beta_normal = np.linalg.solve(X.T @ X, X.T @ y)
        print(f"\nNormal equations: β̂ = {beta_normal}")
    except np.linalg.LinAlgError:
        print("\nNormal equations: FAILED (singular matrix)")

    # Method 2: QR decomposition (stable)
    beta_qr = ols_qr(X, y)
    print(f"QR decomposition: β̂ = {beta_qr}")

    # Method 3: lstsq (uses SVD)
    beta_lstsq = np.linalg.lstsq(X, y, rcond=None)[0]
    print(f"lstsq (SVD):      β̂ = {beta_lstsq}")

demonstrate_condition_number()

============================================================
CONDITION NUMBER COMPARISON
============================================================

κ(X):     1.78e+04
κ(X'X):   3.18e+08
Ratio:    1.78e+04 (should be ≈ κ(X))

Normal equations: β̂ = [  1.0253349  -16.99304398  21.93082782]
QR decomposition: β̂ = [  1.0253349  -16.99304426  21.9308281 ]
lstsq (SVD):      β̂ = [  1.0253349  -16.99304426  21.9308281 ]

Common Pitfall ⚠️ Never Compute (X’X)⁻¹ Explicitly

When writing your own regression code:

Don’t do this: np.linalg.inv(X.T @ X) @ X.T @ y
Do this instead: np.linalg.lstsq(X, y, rcond=None)[0]

The lstsq function uses SVD internally, which is even more stable than QR and handles rank-deficient matrices gracefully.

Model Selection and Information Criteria

With multiple candidate models, how do we choose? Information criteria balance goodness-of-fit against complexity.

The Bias-Variance Tradeoff

Consider adding a predictor to a model:

RSS decreases (or stays the same)—the fit improves or is unchanged
Variance increases—more parameters to estimate

This is the classic bias-variance tradeoff:

\[\text{MSE}(\hat{y}) = \text{Bias}^2 + \text{Variance}\]

A complex model has low bias (fits the data well) but high variance (overfits to noise). A simple model has higher bias but lower variance.

R-Squared and Adjusted R-Squared

R-squared measures the proportion of variance explained:

\[R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}\]

Regression decomposition showing TSS = ESS + RSS — Fig. 90 **Figure 3.4.13**: The fundamental decomposition TSS = ESS + RSS. (a) For each point, the deviation from \(\bar{y}\) splits into explained (ESS) and residual (RSS) components. (b) The equation: Total Sum of Squares = Explained Sum of Squares + Residual Sum of Squares. (c) \(R^2 = \text{ESS}/\text{TSS}\) is the proportion explained.

Problem: \(R^2\) never decreases when adding predictors, even useless ones.

Adjusted R-squared penalizes for model complexity:

\[R^2_{\text{adj}} = 1 - \frac{\text{RSS}/(n-p)}{\text{TSS}/(n-1)} = 1 - (1 - R^2)\frac{n-1}{n-p}\]

This can decrease when adding a predictor that doesn’t improve the fit enough to justify its inclusion.

Information Criteria

Akaike Information Criterion (AIC):

\[\text{AIC} = n \ln(\text{RSS}/n) + 2p\]

Bayesian Information Criterion (BIC):

\[\text{BIC} = n \ln(\text{RSS}/n) + p \ln(n)\]

These are the AIC and BIC formulas for Gaussian linear regression, dropping additive constants that cancel when comparing models fit to the same response \(\mathbf{y}\). For general likelihoods, use \(\text{AIC} = -2\ln L + 2p\) and \(\text{BIC} = -2\ln L + p\ln n\).

Both penalize complexity, but BIC penalizes more heavily for large \(n\).

Usage: Lower is better. Compare models by their AIC or BIC values.

Table 20 Model Selection Criteria
Criterion	Formula	Penalty	Best for
\(R^2\)	\(1 - \text{RSS}/\text{TSS}\)	None (always increases)	Never for selection
\(R^2_{\text{adj}}\)	\(1 - (1-R^2)\frac{n-1}{n-p}\)	Mild	Quick comparison
AIC	\(n\ln(\text{RSS}/n) + 2p\)	\(2p\)	Prediction
BIC	\(n\ln(\text{RSS}/n) + p\ln(n)\)	\(p\ln(n)\)	Consistent selection

def model_selection_criteria(y, y_hat, p, n):
    """
    Compute model selection criteria.

    Parameters
    ----------
    y : array
        Observed values
    y_hat : array
        Fitted values
    p : int
        Number of parameters (including intercept)
    n : int
        Sample size
    """
    RSS = np.sum((y - y_hat)**2)
    TSS = np.sum((y - np.mean(y))**2)

    R2 = 1 - RSS / TSS
    R2_adj = 1 - (1 - R2) * (n - 1) / (n - p)

    AIC = n * np.log(RSS / n) + 2 * p
    BIC = n * np.log(RSS / n) + p * np.log(n)

    return {'R2': R2, 'R2_adj': R2_adj, 'AIC': AIC, 'BIC': BIC}

Regularization: Ridge and LASSO

When predictors are highly correlated or \(p\) is large relative to \(n\), OLS can be unstable or overfit. Regularization shrinks coefficients toward zero, trading bias for reduced variance.

Ridge Regression

Ridge regression adds an \(L_2\) penalty:

\[\hat{\boldsymbol{\beta}}_{\text{ridge}} = \arg\min_{\boldsymbol{\beta}} \left\{ \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|^2 \right\}\]

where \(\lambda \geq 0\) is the regularization parameter.

Closed-form solution:

(62)\[\hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}\]

Key properties of the ridge estimator (62):

Bias-variance tradeoff: Ridge is biased (\(\mathbb{E}[\hat{\boldsymbol{\beta}}_{\text{ridge}}] \neq \boldsymbol{\beta}\)) but can have lower MSE than OLS
Shrinkage: Coefficients shrink toward zero as \(\lambda \to \infty\)
Handles multicollinearity: Adding \(\lambda\mathbf{I}\) makes \(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I}\) invertible even if \(\mathbf{X}^\top\mathbf{X}\) is singular
No variable selection: All coefficients remain nonzero

The bias-variance tradeoff in regularization — Fig. 91 **Figure 3.4.9**: The bias-variance tradeoff. (a) A biased estimator (orange) with lower variance can achieve lower MSE than an unbiased estimator (blue). (b) As regularization \(\lambda\) increases, variance decreases but bias increases—there’s an optimal \(\lambda\). (c) Training error decreases with complexity while test error is U-shaped.

LASSO

LASSO (Least Absolute Shrinkage and Selection Operator) uses an \(L_1\) penalty:

\[\hat{\boldsymbol{\beta}}_{\text{LASSO}} = \arg\min_{\boldsymbol{\beta}} \left\{ \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|_1 \right\}\]

where \(\|\boldsymbol{\beta}\|_1 = \sum_j |\beta_j|\).

Key properties:

No closed form: Must use numerical optimization (coordinate descent)
Variable selection: Sets some coefficients exactly to zero (sparse solutions)
Geometry: The \(L_1\) constraint has corners at the axes, making exact zeros likely

L1 vs L2 geometry showing why LASSO produces sparse solutions — Fig. 92 **Figure 3.4.10**: L1 vs L2 geometry. (a) Ridge constraint is a circle—RSS contours typically touch away from axes. (b) LASSO constraint is a diamond—contours often touch at corners where \(\beta_j = 0\). (c) This geometric difference explains why LASSO performs automatic variable selection while Ridge shrinks but keeps all coefficients.

Regularization paths showing how coefficients shrink with lambda — Fig. 93 **Figure 3.4.11**: Regularization paths. (a) Ridge shrinks coefficients smoothly toward zero as \(\lambda\) increases—all remain nonzero. (b) LASSO drives coefficients to exactly zero at different \(\lambda\) values, providing a variable selection path.

Elastic Net

Elastic Net combines \(L_1\) and \(L_2\) penalties:

\[\hat{\boldsymbol{\beta}}_{\text{EN}} = \arg\min_{\boldsymbol{\beta}} \left\{ \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda_1 \|\boldsymbol{\beta}\|_1 + \lambda_2 \|\boldsymbol{\beta}\|^2 \right\}\]

This combines the variable selection of LASSO with the stability of ridge for correlated predictors.

Practical Considerations for Regularization

Intercept handling: Typically, we center \(\mathbf{y}\) and standardize predictors, and we do not penalize the intercept \(\beta_0\). Most software handles this automatically.
Scaling matters: Always report whether predictors were standardized. The penalty \(\lambda\) depends on the scale of predictors—a coefficient for income in dollars behaves very differently from income in thousands of dollars.
Cross-validation: Choose \(\lambda\) by cross-validation to balance fit and complexity.

import numpy as np
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

def compare_regularization():
    """Compare OLS, Ridge, LASSO, and Elastic Net."""
    rng = np.random.default_rng(42)

    # Generate data with correlated predictors
    n, p = 100, 50
    X = rng.normal(0, 1, (n, p))
    # Add correlation structure
    for j in range(1, p):
        X[:, j] = 0.9 * X[:, j-1] + 0.1 * X[:, j]

    # True coefficients: only first 5 are nonzero
    beta_true = np.zeros(p)
    beta_true[:5] = [3, 2, 1, 0.5, 0.25]

    y = X @ beta_true + rng.normal(0, 1, n)

    # Standardize for fair comparison
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Fit models
    ols_beta = np.linalg.lstsq(X_scaled, y, rcond=None)[0]

    ridge = Ridge(alpha=1.0)
    ridge.fit(X_scaled, y)

    lasso = Lasso(alpha=0.1)
    lasso.fit(X_scaled, y)

    enet = ElasticNet(alpha=0.1, l1_ratio=0.5)
    enet.fit(X_scaled, y)

    print("="*60)
    print("REGULARIZATION COMPARISON")
    print("="*60)
    print(f"\nData: n={n}, p={p}, {np.sum(beta_true != 0)} true nonzero coefficients")

    print(f"\n{'Method':<15} {'Nonzero':>10} {'MSE (coeffs)':>15}")
    print("-"*40)

    methods = [('OLS', ols_beta), ('Ridge', ridge.coef_),
               ('LASSO', lasso.coef_), ('Elastic Net', enet.coef_)]

    for name, coef in methods:
        nonzero = np.sum(np.abs(coef) > 1e-6)
        mse = np.mean((coef - beta_true)**2)
        print(f"{name:<15} {nonzero:>10} {mse:>15.4f}")

    print("\n→ LASSO and Elastic Net achieve sparsity (fewer nonzero coefficients)")
    print("→ Ridge shrinks but keeps all coefficients nonzero")

    return X_scaled, y

X_scaled, y = compare_regularization()

============================================================
REGULARIZATION COMPARISON
============================================================

Data: n=100, p=50, 5 true nonzero coefficients

Method             Nonzero    MSE (coeffs)
----------------------------------------
OLS                     50          0.3280
Ridge                   50          0.1460
LASSO                    6          0.0212
Elastic Net             16          0.0401

→ LASSO and Elastic Net achieve sparsity (fewer nonzero coefficients)
→ Ridge shrinks but keeps all coefficients nonzero

Choosing the Regularization Parameter

The parameter \(\lambda\) controls the amount of shrinkage. Too small → overfitting; too large → underfitting.

Cross-validation is the standard approach:

Split data into \(K\) folds
For each candidate \(\lambda\): - Train on \(K-1\) folds, predict on held-out fold - Average prediction error across folds
Choose \(\lambda\) with lowest CV error

from sklearn.linear_model import RidgeCV, LassoCV

# Ridge with built-in CV
alphas = np.logspace(-4, 4, 50)
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_scaled, y)
print(f"Ridge optimal λ: {ridge_cv.alpha_:.4f}")

# LASSO with built-in CV
lasso_cv = LassoCV(cv=5, random_state=42)
lasso_cv.fit(X_scaled, y)
print(f"LASSO optimal λ: {lasso_cv.alpha_:.4f}")

Chapter 3.4 Exercises: Linear Models Mastery

These exercises build your understanding of linear regression from matrix calculus foundations through the Gauss-Markov theorem to practical diagnostics. Each exercise connects theoretical derivations to computational practice and statistical interpretation.

A Note on These Exercises

These exercises are designed to deepen your understanding of linear models through hands-on exploration:

Exercise 1 develops matrix calculus skills essential for deriving OLS estimators
Exercise 2 explores the geometric interpretation of regression as projection
Exercise 3 identifies Gauss-Markov assumption violations and their consequences
Exercise 4 provides a complete proof of Gauss-Markov for a single coefficient
Exercise 5 verifies theoretical properties computationally via Monte Carlo simulation
Exercise 6 compares model-based and robust standard errors under heteroskedasticity

Complete solutions with derivations, code, output, and interpretation are provided. Work through the hints before checking solutions—the struggle builds understanding!

Exercise 1: Matrix Calculus Derivations

The matrix calculus rules developed in this section are the foundation for deriving OLS estimators. This exercise builds fluency with these essential tools.

Background: Why Matrix Calculus?

When optimizing functions of vectors and matrices (like RSS as a function of \(\boldsymbol{\beta}\)), we need systematic rules for computing gradients. Component-wise verification builds confidence that the matrix formulas are correct.

Linear form: Verify by component-wise differentiation that \(\nabla_{\mathbf{x}}(\mathbf{a}^\top\mathbf{x}) = \mathbf{a}\) for \(\mathbf{a}, \mathbf{x} \in \mathbb{R}^3\).
Cross term: Derive \(\nabla_{\boldsymbol{\beta}}(\mathbf{y}^\top\mathbf{X}\boldsymbol{\beta})\) directly from the definition of gradient.
Quadratic form: For symmetric \(\mathbf{A}\), show that \(\nabla_{\mathbf{x}}(\mathbf{x}^\top\mathbf{A}\mathbf{x}) = 2\mathbf{A}\mathbf{x}\) by expanding in components.
RSS gradient: Derive the OLS gradient \(\nabla_{\boldsymbol{\beta}} \text{RSS} = 2\mathbf{X}^\top(\mathbf{X}\boldsymbol{\beta} - \mathbf{y})\) using the chain rule.

Hint

For part (c), write out \(\mathbf{x}^\top\mathbf{A}\mathbf{x} = \sum_i\sum_j x_i A_{ij} x_j\) and differentiate with respect to \(x_k\). Remember that terms containing \(x_k\) appear both when \(i = k\) and when \(j = k\).

Exercise 2: Geometry of Simple Regression

The geometric perspective reveals regression as projection onto the column space of \(\mathbf{X}\). This exercise develops that intuition for simple linear regression.

Background: Projection Interpretation

In simple regression \(y = \beta_0 + \beta_1 x + \varepsilon\), the design matrix \(\mathbf{X} = [\mathbf{1} \; \mathbf{x}]\) spans a 2-dimensional subspace of \(\mathbb{R}^n\). The fitted values \(\hat{\mathbf{y}}\) are the orthogonal projection of \(\mathbf{y}\) onto this plane.

Show that the design matrix has columns \(\mathbf{1}\) and \(\mathbf{x}\), and verify that \(\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}\).
Verify geometrically that the residual vector is orthogonal to both \(\mathbf{1}\) and \(\mathbf{x}\). What do these orthogonality conditions imply about the residuals?
Show that the regression line passes through the point \((\bar{x}, \bar{y})\).
Derive the formula for \(R^2\) as the squared correlation between \(\mathbf{y}\) and \(\hat{\mathbf{y}}\) using vector inner products.

Hint

For part (b), the condition \(\mathbf{1}^\top \mathbf{e} = 0\) means \(\sum e_i = 0\) (residuals sum to zero). The condition \(\mathbf{x}^\top \mathbf{e} = 0\) means \(\sum x_i e_i = 0\).

Exercise 3: Gauss-Markov Assumptions

Understanding which assumptions are violated—and the consequences—is essential for applied regression. This exercise builds diagnostic intuition.

Background: The Assumption Hierarchy

The Gauss-Markov assumptions are: GM1 (linearity), GM2 (full rank), GM3 (exogeneity: \(\mathbb{E}[\varepsilon|\mathbf{X}] = \mathbf{0}\)), and GM4 (spherical errors: \(\text{Var}(\varepsilon|\mathbf{X}) = \sigma^2\mathbf{I}\)). Each violation has specific consequences for OLS.

For each scenario, identify which Gauss-Markov assumption is violated and describe the consequences.

A researcher fits \(y = \beta_0 + \beta_1 x + \varepsilon\) but the true relationship is \(y = \beta_0 + \beta_1 \log(x) + \varepsilon\).
In a regression with three predictors, the third predictor is defined as \(x_3 = 2x_1 - x_2\).
An economist studies savings rates but omits “income,” which is correlated with education (included) and directly affects savings.
In a study of housing prices, the variance of residuals is proportional to house size.
Monthly unemployment data shows residuals that are positively correlated across consecutive months.

Solution

Part (a): Wrong functional form

Violation: GM1 (Linearity)

The true model is \(y = \beta_0 + \beta_1\log(x) + \varepsilon\), but we’re fitting \(y = \gamma_0 + \gamma_1 x + \varepsilon\).

Consequences:

OLS estimates \(\hat{\gamma}_0, \hat{\gamma}_1\) are biased and inconsistent
They estimate the “best linear approximation” to a nonlinear relationship
Residual plots will show systematic curvature (nonlinearity pattern)
Predictions will be systematically wrong, especially at extreme \(x\) values

Remedy: Transform predictors (\(\log(x)\)), add polynomial terms, or use nonlinear models.

Part (b): Perfect multicollinearity

Violation: GM2 (Full Rank)

If \(x_3 = 2x_1 - x_2\), the third column of \(\mathbf{X}\) is a linear combination of the first two.

Consequences:

\(\mathbf{X}^\top\mathbf{X}\) is singular (not invertible)
OLS has infinitely many solutions—the parameter vector is not identified
Software will fail or drop one predictor automatically

Remedy: Remove redundant predictor(s) or use regularization (ridge regression).

Part (c): Omitted variable bias

Violation: GM3 (Exogeneity/Strict Exogeneity)

Income affects both education (parents with higher income can afford more education) and savings (direct effect). By omitting income, \(\mathbb{E}[\varepsilon|\text{education}] \neq 0\).

Consequences:

OLS coefficient on education is biased and inconsistent
The bias direction depends on signs: if income ↔ education (positive) and income → savings (positive), education coefficient is biased upward
Omitted variable bias formula: \(\text{bias}(\hat{\beta}_{\text{edu}}) = \beta_{\text{inc}} \times \delta\) where \(\delta\) is the regression coefficient of income on education

Remedy: Include income as a control, use instrumental variables, or acknowledge the limitation.

Part (d): Heteroskedasticity

Violation: GM4 (Spherical Errors—specifically homoskedasticity)

\(\text{Var}(\varepsilon_i) = \sigma^2 \cdot (\text{house size}_i)\), not constant.

Consequences:

OLS is still unbiased and consistent
OLS is no longer BLUE—more efficient estimators exist (WLS)
Standard errors are wrong (typically too small for large houses)
t-tests and F-tests have incorrect size (Type I error ≠ α)

Remedy: Use robust (HC) standard errors, or weighted least squares with \(w_i \propto 1/\text{size}_i\).

Part (e): Autocorrelation

Violation: GM4 (Spherical Errors—specifically no autocorrelation)

\(\text{Cov}(\varepsilon_t, \varepsilon_{t-1}) > 0\) for consecutive months.

Consequences:

OLS is still unbiased and consistent
OLS is not BLUE
Standard errors are underestimated (positive autocorrelation clusters errors)
Durbin-Watson test will detect this (DW < 2 indicates positive autocorrelation)

Remedy: Use Newey-West (HAC) standard errors, or GLS/Cochrane-Orcutt for efficiency.

Exercise 4: Complete BLUE Proof

The Gauss-Markov theorem establishes OLS as the Best Linear Unbiased Estimator. This exercise develops a complete proof for a single coefficient, building intuition for the general result.

Background: The BLUE Property

BLUE means OLS achieves the minimum variance among all linear unbiased estimators. The proof uses a clever decomposition: any alternative estimator can be written as OLS plus a “deviation” term, and this deviation can only increase variance.

Let \(\tilde{\beta}_1 = \mathbf{c}^\top\mathbf{y}\) be any linear unbiased estimator of \(\beta_1\) in simple regression. Show that unbiasedness requires \(\mathbf{c}^\top\mathbf{X} = (0, 1)\).
Write \(\mathbf{c} = \mathbf{a}_1 + \mathbf{d}\) where \(\mathbf{a}_1\) is the second row of \((\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\). Show that \(\mathbf{d}^\top\mathbf{X} = \mathbf{0}\).
Show that \(\text{Var}(\tilde{\beta}_1) = \text{Var}(\hat{\beta}_1) + \sigma^2\mathbf{d}^\top\mathbf{d}\).
Conclude that \(\text{Var}(\tilde{\beta}_1) \geq \text{Var}(\hat{\beta}_1)\) with equality iff \(\mathbf{d} = \mathbf{0}\).

Hint

The key insight in part (c) is that the cross-term \(\mathbf{a}_1^\top\mathbf{d} = 0\) because \(\mathbf{d}\) is orthogonal to the column space of \(\mathbf{X}\), and \(\mathbf{a}_1\) lies in that column space.

Exercise 5: Computational Practice

Theory becomes concrete through computation. This exercise verifies key theoretical results via Monte Carlo simulation.

Background: Why Simulate?

Simulation provides empirical verification of theoretical formulas. When theory says \(\mathbb{E}[\hat{\beta}] = \beta\) and \(\text{Var}(\hat{\beta}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}\), simulation confirms these hold—building confidence and catching errors.

Using Python, verify key theoretical results.

Generate data from \(y = 2 + 3x + \varepsilon\) with \(\varepsilon \sim N(0, 4)\), \(n = 50\). Fit OLS and verify \(\hat{\boldsymbol{\beta}} \approx (2, 3)\).
Compute the hat matrix \(\mathbf{H}\) and verify: \(\mathbf{H} = \mathbf{H}^\top\), \(\mathbf{H}^2 = \mathbf{H}\), \(\text{tr}(\mathbf{H}) = p\).
Verify that \(\mathbf{X}^\top\mathbf{e} = \mathbf{0}\) (residuals orthogonal to predictors).
Simulate 10,000 datasets and verify that \(\mathbb{E}[\hat{\beta}_1] = 3\) and \(\text{Var}(\hat{\beta}_1) \approx \sigma^2[(\mathbf{X}^\top\mathbf{X})^{-1}]_{22}\).
Verify that \(s^2 = \text{RSS}/(n-p)\) is unbiased for \(\sigma^2 = 4\).

Solution

import numpy as np
from scipy import stats

def exercise_3_4_5():
    """Complete solution to Exercise 3.4.5."""
    rng = np.random.default_rng(42)

    # Parameters
    beta_true = np.array([2.0, 3.0])
    sigma = 2.0
    n = 50

    print("="*60)
    print("EXERCISE 3.4.5: COMPUTATIONAL VERIFICATION")
    print("="*60)

    # Part (a): Single dataset fit
    print("\n--- Part (a): Single Dataset OLS ---")
    x = rng.uniform(0, 5, n)
    X = np.column_stack([np.ones(n), x])
    eps = rng.normal(0, sigma, n)
    y = X @ beta_true + eps

    beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
    print(f"True β = {beta_true}")
    print(f"OLS β̂ = [{beta_hat[0]:.3f}, {beta_hat[1]:.3f}]")

    # Part (b): Hat matrix properties
    print("\n--- Part (b): Hat Matrix Properties ---")
    XtX_inv = np.linalg.inv(X.T @ X)
    H = X @ XtX_inv @ X.T
    p = X.shape[1]

    print(f"H = H' (symmetric):     {np.allclose(H, H.T)}")
    print(f"H² = H (idempotent):    {np.allclose(H @ H, H)}")
    print(f"tr(H) = p = {p}:        tr(H) = {np.trace(H):.6f}")

    # Part (c): Orthogonality of residuals
    print("\n--- Part (c): Residual Orthogonality ---")
    e = y - X @ beta_hat
    Xte = X.T @ e
    print(f"X'e = [{Xte[0]:.2e}, {Xte[1]:.2e}]")
    print(f"||X'e|| ≈ 0: {np.allclose(Xte, 0)}")

    # Parts (d) & (e): Simulation study
    print("\n--- Parts (d) & (e): Simulation Study (10,000 reps) ---")
    n_sim = 10000

    beta1_hats = []
    s2_estimates = []

    # Use FIXED X for simulation
    x_fixed = rng.uniform(0, 5, n)
    X_fixed = np.column_stack([np.ones(n), x_fixed])
    XtX_inv_fixed = np.linalg.inv(X_fixed.T @ X_fixed)

    for _ in range(n_sim):
        eps_sim = rng.normal(0, sigma, n)
        y_sim = X_fixed @ beta_true + eps_sim

        beta_sim = XtX_inv_fixed @ X_fixed.T @ y_sim
        beta1_hats.append(beta_sim[1])

        resid = y_sim - X_fixed @ beta_sim
        s2_estimates.append(np.sum(resid**2) / (n - p))

    beta1_hats = np.array(beta1_hats)
    s2_estimates = np.array(s2_estimates)

    # Theoretical variance
    var_theory = sigma**2 * XtX_inv_fixed[1, 1]

    print(f"\nFor β̂₁ (slope):")
    print(f"  E[β̂₁] ≈ β₁ = {beta_true[1]}:")
    print(f"    Sample mean: {np.mean(beta1_hats):.4f}")
    print(f"  Var(β̂₁) = σ²[(X'X)⁻¹]₂₂ = {var_theory:.6f}:")
    print(f"    Sample var:  {np.var(beta1_hats):.6f}")

    print(f"\nFor s² (variance estimator):")
    print(f"  E[s²] = σ² = {sigma**2}:")
    print(f"    Sample mean: {np.mean(s2_estimates):.4f}")

exercise_3_4_5()

============================================================
EXERCISE 3.4.5: COMPUTATIONAL VERIFICATION
============================================================

--- Part (a): Single Dataset OLS ---
True β = [2. 3.]
OLS β̂ = [1.537, 3.048]

--- Part (b): Hat Matrix Properties ---
H = H' (symmetric):     True
H² = H (idempotent):    True
tr(H) = p = 2:        tr(H) = 2.000000

--- Part (c): Residual Orthogonality ---
X'e = [-1.78e-13, -5.64e-13]
||X'e|| ≈ 0: True

--- Parts (d) & (e): Simulation Study (10,000 reps) ---

For β̂₁ (slope):
  E[β̂₁] ≈ β₁ = 3.0:
    Sample mean: 2.9987
  Var(β̂₁) = σ²[(X'X)⁻¹]₂₂ = 0.042729:
    Sample var:  0.043122

For s² (variance estimator):
  E[s²] = σ² = 4.0:
    Sample mean: 4.0070

Key Insights:

Hat matrix properties confirmed: \(\mathbf{H}\) is symmetric, idempotent, and has trace equal to \(p\) (number of parameters).
Orthogonality verified: \(\mathbf{X}^\top\mathbf{e} \approx \mathbf{0}\) to machine precision—the residuals are exactly orthogonal to the column space.
Unbiasedness confirmed: Over 10,000 simulations, \(\bar{\hat{\beta}}_1 = 2.9987 \approx \beta_1 = 3.0\).
Variance formula confirmed: Empirical variance matches \(\sigma^2[(\mathbf{X}^\top\mathbf{X})^{-1}]_{22}\) closely.
s² is unbiased: \(\bar{s}^2 = 4.0070 \approx \sigma^2 = 4.0\).

Exercise 6: Robust Standard Errors

When heteroskedasticity is present, model-based standard errors are wrong. This exercise compares them to robust alternatives via simulation.

Background: The Heteroskedasticity Problem

Under heteroskedasticity, \(\text{Var}(\varepsilon_i) = \sigma_i^2\) varies across observations. OLS remains unbiased but the formula \(\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}\) is wrong. Robust (sandwich) standard errors correct this.

Generate heteroskedastic data: \(y_i = 2 + 3x_i + \varepsilon_i\) where \(\varepsilon_i \sim N(0, (0.5 + x_i)^2)\).
Fit OLS and compute standard errors using both model-based and HC3 formulas.
Construct 95% CIs for \(\beta_1\) using both SE methods. Assess coverage via simulation (10,000 replications). Which achieves closer to 95% coverage?
Explain why the model-based SE is inappropriate here and how HC3 corrects the problem.

Solution

Parts (a)-(c): Implementation and Coverage Simulation

import numpy as np
from scipy import stats

def robust_se_hc3(X, residuals):
    """Compute HC3 robust standard errors."""
    n, p = X.shape
    XtX_inv = np.linalg.inv(X.T @ X)
    H = X @ XtX_inv @ X.T
    h = np.diag(H)
    omega = residuals**2 / (1 - h)**2
    meat = X.T @ np.diag(omega) @ X
    sandwich = XtX_inv @ meat @ XtX_inv
    return np.sqrt(np.diag(sandwich))

def coverage_simulation(n_sim=10000, n=100, seed=42):
    """Compare coverage of model-based vs HC3 standard errors."""
    rng = np.random.default_rng(seed)

    beta_true = np.array([2.0, 3.0])

    model_covers = 0
    hc3_covers = 0

    for _ in range(n_sim):
        # Generate heteroskedastic data
        x = rng.uniform(0, 3, n)
        sigma = 0.5 + x  # SD increases with x
        eps = rng.normal(0, sigma)
        y = beta_true[0] + beta_true[1] * x + eps

        # Design matrix
        X = np.column_stack([np.ones(n), x])

        # OLS fit
        XtX_inv = np.linalg.inv(X.T @ X)
        beta_hat = XtX_inv @ X.T @ y
        residuals = y - X @ beta_hat

        # Model-based SE (assumes homoskedasticity)
        s2 = np.sum(residuals**2) / (n - 2)
        se_model = np.sqrt(s2 * np.diag(XtX_inv))

        # HC3 robust SE
        se_hc3 = robust_se_hc3(X, residuals)

        # 95% CI for beta_1 (slope)
        t_crit = stats.t.ppf(0.975, n - 2)

        ci_model = (beta_hat[1] - t_crit * se_model[1],
                    beta_hat[1] + t_crit * se_model[1])
        ci_hc3 = (beta_hat[1] - t_crit * se_hc3[1],
                  beta_hat[1] + t_crit * se_hc3[1])

        if ci_model[0] <= beta_true[1] <= ci_model[1]:
            model_covers += 1
        if ci_hc3[0] <= beta_true[1] <= ci_hc3[1]:
            hc3_covers += 1

    return model_covers / n_sim, hc3_covers / n_sim

# Run simulation
cov_model, cov_hc3 = coverage_simulation()

print("="*60)
print("HETEROSKEDASTICITY: MODEL-BASED VS ROBUST SE")
print("="*60)
print(f"\n95% CI Coverage for β₁ (10,000 simulations):")
print(f"  Model-based SE: {cov_model:.3f} (target: 0.950)")
print(f"  HC3 robust SE:  {cov_hc3:.3f} (target: 0.950)")
print(f"\nModel-based {'UNDERCOVERED' if cov_model < 0.945 else 'OK'}")
print(f"HC3 {'achieves nominal coverage' if 0.94 <= cov_hc3 <= 0.96 else 'slightly off'}")

============================================================
HETEROSKEDASTICITY: MODEL-BASED VS ROBUST SE
============================================================

95% CI Coverage for β₁ (10,000 simulations):
  Model-based SE: 0.938 (target: 0.950)
  HC3 robust SE:  0.951 (target: 0.950)

Model-based UNDERCOVERED
HC3 achieves nominal coverage

Part (d): Explanation

The model-based SE assumes \(\text{Var}(\varepsilon_i) = \sigma^2\) for all \(i\). When variance increases with \(x\), this assumption fails.

Why model-based SE fails: The formula \(s^2(\mathbf{X}^\top\mathbf{X})^{-1}\) uses a pooled \(s^2\) that averages across all observations. This underestimates variance where \(x\) is large (variance actually higher) and overestimates where \(x\) is small. Since high-\(x\) observations carry more weight in determining \(\hat{\beta}_1\), the net effect is that model-based SE is too small, leading to CIs that are too narrow and systematic undercoverage.

How HC3 corrects: The sandwich formula uses observation-specific variance estimates:

\[\widehat{\text{Var}}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\hat{\boldsymbol{\Omega}}\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\]

where \(\hat{\Omega}_{ii} = e_i^2/(1-h_{ii})^2\) for HC3.

Each observation contributes its own squared residual
The \((1-h_{ii})^2\) denominator inflates residuals for high-leverage points (which tend to have smaller residuals even when variance is large)
This provides a conservative correction that achieves approximately correct coverage even under heteroskedasticity

Key Insight: Robust standard errors do not change the point estimates—OLS \(\hat{\boldsymbol{\beta}}\) is still unbiased. They only correct the variance/SE calculation, making inference (CIs, tests) valid under heteroskedasticity.