Linear Models

Linear regression is arguably the most important statistical technique ever developed. From its origins in early 19th century astronomy—where Gauss and Legendre independently derived least squares to predict planetary orbits—to its modern role as the foundation of machine learning, linear regression has shaped how we understand relationships in data. Every data scientist must master this tool, not merely as a button to click, but as a mathematical framework to understand deeply.

This section develops linear regression from first principles. We begin with the matrix calculus needed to derive estimators—building up the rules for differentiating vectors, matrices, and quadratic forms. We then derive the ordinary least squares (OLS) estimator via two complementary approaches: the calculus approach (minimizing sum of squared errors) and the geometric approach (projection onto the column space). These perspectives illuminate different aspects of the same solution.

With the estimator in hand, we prove the Gauss-Markov theorem—that OLS is the Best Linear Unbiased Estimator (BLUE) under specific conditions. We examine each condition carefully, understanding what happens when it fails. We then develop the distributional theory under normality, enabling t-tests, F-tests, and confidence intervals. Finally, we address diagnostics and remedies for when assumptions don’t hold.

Road Map 🧭

Develop: Matrix calculus rules for differentiating vectors, quadratic forms, and matrix products
Derive: OLS estimator via calculus (normal equations) and geometry (projection)
Prove: Gauss-Markov theorem establishing OLS as BLUE under classical assumptions
Apply: Distributional results for inference: t-tests, F-tests, confidence intervals
Evaluate: Diagnostics for assumption violations and remedial measures

Matrix Calculus Foundations

Before deriving the least squares estimator, we need tools for differentiating expressions involving vectors and matrices. This section builds up the necessary calculus systematically, starting from scalar derivatives and extending to the multivariate case.

Why Matrix Calculus?

In scalar calculus, finding the minimum of \(f(x) = (y - x)^2\) is straightforward: take the derivative, set it to zero, solve for \(x\). With \(n\) observations and \(p\) parameters, our objective becomes:

\[f(\boldsymbol{\beta}) = \sum_{i=1}^n (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\]

Minimizing this requires differentiating with respect to a vector \(\boldsymbol{\beta}\). Matrix calculus provides the systematic framework for such operations.

Notation Conventions

We adopt the following conventions throughout:

Vectors are column vectors, denoted by bold lowercase: \(\mathbf{x}, \mathbf{y}, \boldsymbol{\beta}\)
Matrices are denoted by bold uppercase: \(\mathbf{X}, \mathbf{A}, \boldsymbol{\Sigma}\)
Scalars are italic lowercase: \(a, b, \sigma^2\)
Transpose: \(\mathbf{x}^\top\) (row vector), \(\mathbf{A}^\top\)
Dimensions: \(\mathbf{y} \in \mathbb{R}^n\), \(\boldsymbol{\beta} \in \mathbb{R}^p\), \(\mathbf{X} \in \mathbb{R}^{n \times p}\)

The gradient of a scalar function \(f: \mathbb{R}^p \to \mathbb{R}\) with respect to vector \(\boldsymbol{\beta}\) is:

(86)\[\begin{split}\nabla_{\boldsymbol{\beta}} f = \frac{\partial f}{\partial \boldsymbol{\beta}} = \begin{pmatrix} \frac{\partial f}{\partial \beta_1} \\ \frac{\partial f}{\partial \beta_2} \\ \vdots \\ \frac{\partial f}{\partial \beta_p} \end{pmatrix} \in \mathbb{R}^p\end{split}\]

This is a column vector of the same dimension as \(\boldsymbol{\beta}\).

Scalar-by-Vector Derivatives: Building the Rules

We develop the key differentiation rules step by step, starting from component-wise analysis.

Rule 1: Linear function \(f(\boldsymbol{\beta}) = \mathbf{a}^\top \boldsymbol{\beta}\) where \(\mathbf{a} \in \mathbb{R}^p\) is constant.

Component form:

\[f(\boldsymbol{\beta}) = \mathbf{a}^\top \boldsymbol{\beta} = \sum_{j=1}^p a_j \beta_j\]

Taking the partial derivative with respect to \(\beta_k\):

\[\frac{\partial f}{\partial \beta_k} = \frac{\partial}{\partial \beta_k} \sum_{j=1}^p a_j \beta_j = a_k\]

Assembling all partials into a vector:

(87)\[\boxed{\frac{\partial}{\partial \boldsymbol{\beta}} (\mathbf{a}^\top \boldsymbol{\beta}) = \mathbf{a}}\]

By symmetry of the dot product, \(\mathbf{a}^\top \boldsymbol{\beta} = \boldsymbol{\beta}^\top \mathbf{a}\), so:

\[\frac{\partial}{\partial \boldsymbol{\beta}} (\boldsymbol{\beta}^\top \mathbf{a}) = \mathbf{a}\]

Rule 2: Quadratic form \(f(\boldsymbol{\beta}) = \boldsymbol{\beta}^\top \mathbf{A} \boldsymbol{\beta}\) where \(\mathbf{A} \in \mathbb{R}^{p \times p}\) is constant.

Component form:

\[f(\boldsymbol{\beta}) = \boldsymbol{\beta}^\top \mathbf{A} \boldsymbol{\beta} = \sum_{i=1}^p \sum_{j=1}^p \beta_i A_{ij} \beta_j\]

Taking the partial derivative with respect to \(\beta_k\) requires identifying all terms containing \(\beta_k\):

Terms where \(i = k\) (varying \(j\)): \(\sum_{j=1}^p \beta_k A_{kj} \beta_j\)
Terms where \(j = k\) (varying \(i\)): \(\sum_{i=1}^p \beta_i A_{ik} \beta_k\)

For the first group:

\[\frac{\partial}{\partial \beta_k} \sum_{j=1}^p \beta_k A_{kj} \beta_j = \sum_{j=1}^p A_{kj} \beta_j = [\mathbf{A}\boldsymbol{\beta}]_k\]

For the second group:

\[\frac{\partial}{\partial \beta_k} \sum_{i=1}^p \beta_i A_{ik} \beta_k = \sum_{i=1}^p A_{ik} \beta_i = [\mathbf{A}^\top\boldsymbol{\beta}]_k\]

Combining:

\[\frac{\partial f}{\partial \beta_k} = [\mathbf{A}\boldsymbol{\beta}]_k + [\mathbf{A}^\top\boldsymbol{\beta}]_k\]

Assembling into vector form:

(88)\[\boxed{\frac{\partial}{\partial \boldsymbol{\beta}} (\boldsymbol{\beta}^\top \mathbf{A} \boldsymbol{\beta}) = (\mathbf{A} + \mathbf{A}^\top)\boldsymbol{\beta}}\]

Special case: If \(\mathbf{A}\) is symmetric (\(\mathbf{A} = \mathbf{A}^\top\)):

(89)\[\boxed{\frac{\partial}{\partial \boldsymbol{\beta}} (\boldsymbol{\beta}^\top \mathbf{A} \boldsymbol{\beta}) = 2\mathbf{A}\boldsymbol{\beta} \quad \text{(when } \mathbf{A} = \mathbf{A}^\top)}\]

This is the result we’ll use most often, since \(\mathbf{X}^\top\mathbf{X}\) is always symmetric.

Rule 3: Matrix-vector product \(f(\boldsymbol{\beta}) = \mathbf{A}\boldsymbol{\beta}\) where the result is a vector, not a scalar.

Here \(f: \mathbb{R}^p \to \mathbb{R}^n\), so the derivative is a matrix called the Jacobian:

\[\frac{\partial (\mathbf{A}\boldsymbol{\beta})}{\partial \boldsymbol{\beta}^\top} = \mathbf{A} \in \mathbb{R}^{n \times p}\]

The \((i, j)\) element is \(\frac{\partial [\mathbf{A}\boldsymbol{\beta}]_i}{\partial \beta_j} = A_{ij}\).

Summary of Key Rules

Table 32 Matrix Calculus Rules for Scalar Functions
Expression \(f(\boldsymbol{\beta})\)	Derivative \(\frac{\partial f}{\partial \boldsymbol{\beta}}\)	Condition
\(\mathbf{a}^\top \boldsymbol{\beta}\) or \(\boldsymbol{\beta}^\top \mathbf{a}\)	\(\mathbf{a}\)	\(\mathbf{a}\) constant
\(\boldsymbol{\beta}^\top \mathbf{A} \boldsymbol{\beta}\)	\((\mathbf{A} + \mathbf{A}^\top)\boldsymbol{\beta}\)	\(\mathbf{A}\) constant
\(\boldsymbol{\beta}^\top \mathbf{A} \boldsymbol{\beta}\)	\(2\mathbf{A}\boldsymbol{\beta}\)	\(\mathbf{A}\) symmetric
\(\\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\\|^2\)	\(-2\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\)	See derivation below

The Chain Rule for Vectors

For composite functions, we need the chain rule. If \(f: \mathbb{R}^n \to \mathbb{R}\) and \(\mathbf{g}: \mathbb{R}^p \to \mathbb{R}^n\), then for \(h(\boldsymbol{\beta}) = f(\mathbf{g}(\boldsymbol{\beta}))\):

(90)\[\frac{\partial h}{\partial \boldsymbol{\beta}} = \left(\frac{\partial \mathbf{g}}{\partial \boldsymbol{\beta}^\top}\right)^\top \frac{\partial f}{\partial \mathbf{g}}\]

This is the matrix form of the familiar chain rule: derivative of outer function times derivative of inner function.

The Linear Model

With matrix calculus tools in hand, we now formalize the linear regression model.

Model Specification

We observe \(n\) data points \((y_i, \mathbf{x}_i)\) where \(y_i \in \mathbb{R}\) is the response (dependent variable) and \(\mathbf{x}_i \in \mathbb{R}^p\) is the vector of predictors (independent variables, features, covariates).

The linear model posits:

(91)\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{p-1} x_{i,p-1} + \varepsilon_i, \quad i = 1, \ldots, n\]

where:

\(\beta_0, \beta_1, \ldots, \beta_{p-1}\) are unknown parameters (coefficients)
\(\varepsilon_i\) is the error term (disturbance, noise)

Matrix notation: Define the design matrix \(\mathbf{X}\) and response vector \(\mathbf{y}\):

\[\begin{split}\mathbf{X} = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1,p-1} \\ 1 & x_{21} & x_{22} & \cdots & x_{2,p-1} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{n,p-1} \end{pmatrix} \in \mathbb{R}^{n \times p}, \quad \mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} \in \mathbb{R}^n\end{split}\]

The first column of \(\mathbf{X}\) is all ones (for the intercept). The parameter vector is \(\boldsymbol{\beta} = (\beta_0, \beta_1, \ldots, \beta_{p-1})^\top \in \mathbb{R}^p\).

The model in matrix form:

(92)\[\boxed{\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}}\]

where \(\boldsymbol{\varepsilon} = (\varepsilon_1, \ldots, \varepsilon_n)^\top\).

What “Linear” Means

The model is linear in the parameters \(\boldsymbol{\beta}\), not necessarily in the predictors. All of the following are linear models:

\[\begin{split}y &= \beta_0 + \beta_1 x + \varepsilon & \text{(simple linear regression)} \\ y &= \beta_0 + \beta_1 x + \beta_2 x^2 + \varepsilon & \text{(polynomial regression)} \\ y &= \beta_0 + \beta_1 \log(x) + \varepsilon & \text{(log transformation)} \\ y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \varepsilon & \text{(interaction term)}\end{split}\]

Each involves \(y\) as a linear combination of \(\boldsymbol{\beta}\). The following is not a linear model:

\[y = \beta_0 e^{\beta_1 x} + \varepsilon\]

Here \(y\) is nonlinear in \(\beta_1\).

Interpreting Coefficients

Before diving into estimation, let’s clarify what the coefficients mean:

Partial slopes: In multiple regression, \(\beta_j\) represents the expected change in \(y\) for a one-unit increase in \(x_j\), holding all other predictors fixed. This “ceteris paribus” interpretation is fundamental.

Correlation vs. causation: The OLS coefficient \(\hat{\beta}_j\) measures the association between \(x_j\) and \(y\), not necessarily a causal effect. To interpret \(\beta_j\) causally, we need:

No omitted confounders (variables that affect both \(x_j\) and \(y\))
No reverse causation (\(y\) doesn’t cause \(x_j\))
No measurement error in predictors

These are strong assumptions rarely satisfied in observational data. Experimental design (randomization) or quasi-experimental methods (instrumental variables, regression discontinuity) are needed for causal claims.

Effect of correlation among predictors: When predictors are correlated, “holding other predictors fixed” may describe a scenario rarely observed in the data. The coefficient \(\beta_j\) still has a well-defined mathematical interpretation, but its practical meaning becomes murky. High multicollinearity inflates standard errors, making individual coefficients imprecise even when the overall model predicts well.

Assumptions: The Classical Linear Model

The linear model \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\) comes with assumptions about the errors \(\boldsymbol{\varepsilon}\). We distinguish two levels:

Gauss-Markov Assumptions (for OLS to be BLUE):

Assumption (GM1): Linearity

The true relationship is \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\) for some \(\boldsymbol{\beta} \in \mathbb{R}^p\).

Assumption (GM2): Full Rank

The design matrix \(\mathbf{X}\) has full column rank: \(\text{rank}(\mathbf{X}) = p\).

Equivalently, \(\mathbf{X}^\top\mathbf{X}\) is invertible. This requires \(n \geq p\) and no exact multicollinearity among predictors.

Assumption (GM3): Strict Exogeneity

The errors have zero conditional mean given \(\mathbf{X}\):

\[\mathbb{E}[\boldsymbol{\varepsilon} | \mathbf{X}] = \mathbf{0}\]

This implies \(\mathbb{E}[\varepsilon_i | \mathbf{X}] = 0\) for each \(i\). Predictors are uncorrelated with errors.

Assumption (GM4): Homoskedastic and Uncorrelated Errors

The errors are homoskedastic and uncorrelated:

\[\text{Var}(\boldsymbol{\varepsilon} | \mathbf{X}) = \sigma^2 \mathbf{I}_n\]

This means:

Homoskedasticity: \(\text{Var}(\varepsilon_i | \mathbf{X}) = \sigma^2\) for all \(i\) (constant variance)
No autocorrelation: \(\text{Cov}(\varepsilon_i, \varepsilon_j | \mathbf{X}) = 0\) for \(i \neq j\)

(This condition is sometimes called “spherical errors” because the error covariance is a scaled identity matrix.)

Normality Assumption (for exact inference):

Assumption (GM5): Normal Errors

The errors follow a multivariate normal distribution:

\[\boldsymbol{\varepsilon} | \mathbf{X} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)\]

This implies GM3 and GM4, plus enables exact t-tests and F-tests. Note that this is specifically the spherical normal model \(\mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})\), not merely “some normal covariance”—normality alone does not guarantee homoskedasticity or independence.

The assumptions form a hierarchy:

GM1-GM2: OLS is well-defined (unique solution exists)
GM1-GM4: OLS is BLUE (Gauss-Markov theorem applies)
GM1-GM5: Exact finite-sample inference (t, F distributions)

Ordinary Least Squares: The Calculus Approach

The Ordinary Least Squares (OLS) estimator minimizes the sum of squared residuals:

(93)\[\text{RSS}(\boldsymbol{\beta}) = \sum_{i=1}^n (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\]

We find \(\hat{\boldsymbol{\beta}}\) that minimizes RSS by taking the derivative and setting it to zero.

Expanding the Objective Function

First, expand the quadratic form:

\[\begin{split}\text{RSS}(\boldsymbol{\beta}) &= (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \\ &= \mathbf{y}^\top\mathbf{y} - \mathbf{y}^\top\mathbf{X}\boldsymbol{\beta} - \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y} + \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}\end{split}\]

Since \(\mathbf{y}^\top\mathbf{X}\boldsymbol{\beta}\) is a scalar, it equals its transpose \(\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y}\):

(94)\[\text{RSS}(\boldsymbol{\beta}) = \mathbf{y}^\top\mathbf{y} - 2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y} + \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}\]

Taking the Derivative

Using our matrix calculus rules:

\(\frac{\partial}{\partial \boldsymbol{\beta}}(\mathbf{y}^\top\mathbf{y}) = \mathbf{0}\) (constant in \(\boldsymbol{\beta}\))
\(\frac{\partial}{\partial \boldsymbol{\beta}}(2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y}) = 2\mathbf{X}^\top\mathbf{y}\) (Rule 1: linear form with \(\mathbf{a} = \mathbf{X}^\top\mathbf{y}\))
\(\frac{\partial}{\partial \boldsymbol{\beta}}(\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}) = 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}\) (Rule 2: quadratic form with symmetric \(\mathbf{A} = \mathbf{X}^\top\mathbf{X}\))

Combining:

(95)\[\frac{\partial \text{RSS}}{\partial \boldsymbol{\beta}} = -2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = 2\mathbf{X}^\top(\mathbf{X}\boldsymbol{\beta} - \mathbf{y})\]

The Normal Equations

Setting the gradient to zero:

\[\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} - \mathbf{X}^\top\mathbf{y} = \mathbf{0}\]

Rearranging:

(96)\[\boxed{\mathbf{X}^\top\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{y}}\]

These are the normal equations. The name comes from the fact that the residual vector \(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\) is normal (perpendicular) to the column space of \(\mathbf{X}\)—a fact we’ll prove geometrically.

Two approaches to deriving normal equations: calculus and geometry — Fig. 103 **Figure 3.4.2**: The normal equations derived two ways. (a) The calculus approach: minimize RSS by setting the gradient to zero. (b) The geometric approach: the residual \(\mathbf{e}\) must be perpendicular to \(C(\mathbf{X})\), leading to \(\mathbf{X}^\top\mathbf{e} = \mathbf{0}\).

Solving for the OLS Estimator

Under assumption GM2 (\(\text{rank}(\mathbf{X}) = p\)), the matrix \(\mathbf{X}^\top\mathbf{X}\) is invertible. Multiplying both sides by \((\mathbf{X}^\top\mathbf{X})^{-1}\):

(97)\[\boxed{\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}}\]

This is the OLS estimator—the central result of regression analysis.

Verifying the Minimum

We found a critical point. To confirm it’s a minimum (not maximum or saddle), examine the second derivative (Hessian):

\[\frac{\partial^2 \text{RSS}}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^\top} = 2\mathbf{X}^\top\mathbf{X}\]

Since \(\mathbf{X}^\top\mathbf{X}\) is positive semi-definite (for any \(\mathbf{v}\), \(\mathbf{v}^\top\mathbf{X}^\top\mathbf{X}\mathbf{v} = \|\mathbf{X}\mathbf{v}\|^2 \geq 0\)), and positive definite when \(\mathbf{X}\) has full column rank, the Hessian is positive definite. Therefore, the critical point is indeed a global minimum.

Example 💡 Simple Linear Regression by Hand

Consider simple regression \(y = \beta_0 + \beta_1 x + \varepsilon\) with \(n = 4\) observations:

\[\begin{split}\mathbf{X} = \begin{pmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{pmatrix}, \quad \mathbf{y} = \begin{pmatrix} 2.1 \\ 3.9 \\ 6.2 \\ 7.8 \end{pmatrix}\end{split}\]

Step 1: Compute \(\mathbf{X}^\top\mathbf{X}\):

\[\begin{split}\mathbf{X}^\top\mathbf{X} = \begin{pmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \end{pmatrix} \begin{pmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{pmatrix} = \begin{pmatrix} 4 & 10 \\ 10 & 30 \end{pmatrix}\end{split}\]

Step 2: Compute \(\mathbf{X}^\top\mathbf{y}\):

\[\begin{split}\mathbf{X}^\top\mathbf{y} = \begin{pmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \end{pmatrix} \begin{pmatrix} 2.1 \\ 3.9 \\ 6.2 \\ 7.8 \end{pmatrix} = \begin{pmatrix} 20.0 \\ 58.1 \end{pmatrix}\end{split}\]

Step 3: Compute \((\mathbf{X}^\top\mathbf{X})^{-1}\):

\[\begin{split}(\mathbf{X}^\top\mathbf{X})^{-1} = \frac{1}{4 \cdot 30 - 10 \cdot 10} \begin{pmatrix} 30 & -10 \\ -10 & 4 \end{pmatrix} = \frac{1}{20}\begin{pmatrix} 30 & -10 \\ -10 & 4 \end{pmatrix} = \begin{pmatrix} 1.5 & -0.5 \\ -0.5 & 0.2 \end{pmatrix}\end{split}\]

Step 4: Compute \(\hat{\boldsymbol{\beta}}\):

\[\begin{split}\hat{\boldsymbol{\beta}} = \begin{pmatrix} 1.5 & -0.5 \\ -0.5 & 0.2 \end{pmatrix} \begin{pmatrix} 20.0 \\ 58.1 \end{pmatrix} = \begin{pmatrix} 30 - 29.05 \\ -10 + 11.62 \end{pmatrix} = \begin{pmatrix} 0.95 \\ 1.62 \end{pmatrix}\end{split}\]

The fitted line is \(\hat{y} = 0.95 + 1.62x\).

import numpy as np

# Data
X = np.array([[1, 1], [1, 2], [1, 3], [1, 4]])
y = np.array([2.1, 3.9, 6.2, 7.8])

# OLS via normal equations
XtX = X.T @ X
Xty = X.T @ y
beta_hat = np.linalg.solve(XtX, Xty)  # More stable than explicit inverse

print("X'X =")
print(XtX)
print(f"\nX'y = {Xty}")
print(f"\nOLS estimates: β̂₀ = {beta_hat[0]:.4f}, β̂₁ = {beta_hat[1]:.4f}")

# Verify with fitted values
y_hat = X @ beta_hat
residuals = y - y_hat
RSS = np.sum(residuals**2)
print(f"\nFitted values: {y_hat}")
print(f"Residuals: {residuals}")
print(f"RSS = {RSS:.4f}")

X'X =
[[ 4 10]
 [10 30]]

X'y = [20.  58.1]

OLS estimates: β̂₀ = 0.9500, β̂₁ = 1.6200

Fitted values: [2.57 4.19 5.81 7.43]
Residuals: [-0.47 -0.29  0.39  0.37]
RSS = 0.5460

Ordinary Least Squares: The Geometric Approach

The calculus approach found the OLS estimator by optimization. The geometric approach reveals why the normal equations work and provides deeper insight into what OLS does.

Column Space and Projection

Consider the design matrix \(\mathbf{X} \in \mathbb{R}^{n \times p}\). Its column space (or range) is:

(98)\[\mathcal{C}(\mathbf{X}) = \{\mathbf{X}\boldsymbol{\beta} : \boldsymbol{\beta} \in \mathbb{R}^p\} \subseteq \mathbb{R}^n\]

This is the set of all linear combinations of the columns of \(\mathbf{X}\)—a \(p\)-dimensional subspace of \(\mathbb{R}^n\) (assuming full column rank).

The model says \(\mathbb{E}[\mathbf{y}] = \mathbf{X}\boldsymbol{\beta}\), so the expected response lies in the column space. The observed \(\mathbf{y}\) typically does not lie in \(\mathcal{C}(\mathbf{X})\) because of errors.

Key insight: OLS finds the point in \(\mathcal{C}(\mathbf{X})\) closest to \(\mathbf{y}\).

Geometric interpretation of OLS as orthogonal projection onto a hyperplane — Fig. 104 **Figure 3.4.1**: OLS as orthogonal projection. The observed \(\mathbf{y}\) lives in \(\mathbb{R}^n\), while the column space \(C(\mathbf{X})\) is a p-dimensional hyperplane. The fitted values \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\) are the closest point to \(\mathbf{y}\) in this subspace. The residual \(\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}\) is perpendicular to the entire column space.

The Orthogonality Principle

The closest point to \(\mathbf{y}\) in a subspace is the orthogonal projection. The projection \(\hat{\mathbf{y}}\) satisfies:

\[\mathbf{y} - \hat{\mathbf{y}} \perp \mathcal{C}(\mathbf{X})\]

This means the residual \(\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}\) is orthogonal to every vector in the column space. Since any vector in \(\mathcal{C}(\mathbf{X})\) can be written as \(\mathbf{X}\mathbf{a}\) for some \(\mathbf{a} \in \mathbb{R}^p\):

\[(\mathbf{X}\mathbf{a})^\top (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = 0 \quad \text{for all } \mathbf{a} \in \mathbb{R}^p\]

This requires:

\[\mathbf{a}^\top \mathbf{X}^\top (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = 0 \quad \text{for all } \mathbf{a}\]

The only way this holds for all \(\mathbf{a}\) is if:

\[\mathbf{X}^\top (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}\]

Rearranging:

\[\mathbf{X}^\top\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{y}\]

We’ve recovered the normal equations geometrically! The name “normal equations” now makes sense: they express that residuals are normal (perpendicular) to the column space.

The Hat Matrix

The hat matrix (or projection matrix) projects \(\mathbf{y}\) onto \(\mathcal{C}(\mathbf{X})\):

(99)\[\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\]

The fitted values are:

\[\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} = \mathbf{H}\mathbf{y}\]

The name “hat matrix” comes from putting a hat on \(\mathbf{y}\) to get \(\hat{\mathbf{y}}\).

Properties of the hat matrix:

Symmetric: \(\mathbf{H}^\top = \mathbf{H}\)

Proof: \(\mathbf{H}^\top = [\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top]^\top = \mathbf{X}[(\mathbf{X}^\top\mathbf{X})^{-1}]^\top\mathbf{X}^\top = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top = \mathbf{H}\)
Idempotent: \(\mathbf{H}^2 = \mathbf{H}\) (projecting twice gives the same result)

Proof: \(\mathbf{H}^2 = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top = \mathbf{H}\)
Trace: \(\text{tr}(\mathbf{H}) = p\)

Proof: Using \(\text{tr}(\mathbf{AB}) = \text{tr}(\mathbf{BA})\): \(\text{tr}(\mathbf{H}) = \text{tr}(\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top) = \text{tr}((\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}) = \text{tr}(\mathbf{I}_p) = p\)
Leverage: The diagonal elements \(h_{ii} \in [0, 1]\) measure how much observation \(i\) influences its own fitted value.

The residual vector can also be written using a projection matrix:

\[\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{H})\mathbf{y} = \mathbf{M}\mathbf{y}\]

where \(\mathbf{M} = \mathbf{I} - \mathbf{H}\) projects onto the orthogonal complement of \(\mathcal{C}(\mathbf{X})\).

The hat matrix visualized as a projection operator — Fig. 105 **Figure 3.4.3**: The hat matrix \(\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\) “puts a hat on y.” (a) Conceptual view: H transforms observed y into fitted ŷ. (b) Matrix visualization showing larger diagonal elements (leverage) for extreme observations. (c) The projection: H maps each \(y_i\) to \(\hat{y}_i\).

Rank, Pseudo-Inverse, and When Solutions Exist

When \(\mathbf{X}\) has full column rank (\(\text{rank}(\mathbf{X}) = p\)), the solution is unique:

\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\]

The matrix \((\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\) is the left pseudo-inverse of \(\mathbf{X}\), denoted \(\mathbf{X}^+\). It satisfies \(\mathbf{X}^+\mathbf{X} = \mathbf{I}_p\).

What if \(\mathbf{X}\) doesn’t have full rank?

When \(\text{rank}(\mathbf{X}) = r < p\) (multicollinearity), \(\mathbf{X}^\top\mathbf{X}\) is singular and not invertible. The normal equations \(\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y}\) still have solutions, but infinitely many—the solution set is a \((p-r)\)-dimensional affine subspace.

Multicollinearity effects on coefficient estimation — Fig. 106 **Figure 3.4.12**: Multicollinearity inflates standard errors. (a) VIF grows rapidly as correlation approaches 1: \(\text{VIF} = 1/(1-\rho^2)\). (b) With high correlation (\(\rho = 0.95\)), the sampling distribution of \(\hat{\beta}_1\) is much wider than with uncorrelated predictors. (c) The joint confidence region becomes an elongated ellipse when predictors are correlated.

In this case, we use the Moore-Penrose pseudo-inverse \(\mathbf{X}^+\), which selects the minimum-norm solution:

\[\hat{\boldsymbol{\beta}} = \mathbf{X}^+ \mathbf{y}\]

This solution has the smallest \(\|\boldsymbol{\beta}\|\) among all solutions to the normal equations.

In practice: Software like numpy.linalg.lstsq and scipy.linalg.lstsq use SVD-based methods that handle rank deficiency automatically.

import numpy as np

# Example with exact multicollinearity
# x3 = x1 + x2 (perfect linear dependence)
X_singular = np.array([
    [1, 1, 0, 1],
    [1, 0, 1, 1],
    [1, 2, 1, 3],
    [1, 1, 2, 3]
])
y = np.array([1, 2, 4, 5])

# Check rank
print(f"Matrix shape: {X_singular.shape}")
print(f"Rank: {np.linalg.matrix_rank(X_singular)}")
print(f"X'X is singular: {np.linalg.det(X_singular.T @ X_singular) == 0}")

# lstsq handles rank deficiency via SVD
beta_hat, residuals, rank, s = np.linalg.lstsq(X_singular, y, rcond=None)
print(f"\nMinimum-norm solution: {beta_hat}")
print(f"Singular values: {s}")

Matrix shape: (4, 4)
Rank: 3
X'X is singular: True

Minimum-norm solution: [-0.  0.5 1.5 0. ]
Singular values: [5.52 1.73 0.71 0.  ]

Properties of the OLS Estimator

Having derived the OLS estimator, we now examine its statistical properties.

Convention: Fixed Design

Throughout this section, we treat \(\mathbf{X}\) as fixed (non-random) and condition all statements on \(\mathbf{X}\). Expectations, variances, and distributions are conditional on \(\mathbf{X}\). This “fixed design” viewpoint is standard in classical regression theory. When \(\mathbf{X}\) is random, results hold conditional on the observed \(\mathbf{X}\).

Unbiasedness

Theorem: OLS is Unbiased

Under assumptions GM1-GM3, the OLS estimator is unbiased:

\[\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta}\]

Proof:

Starting from \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\) and substituting \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\):

\[\begin{split}\hat{\boldsymbol{\beta}} &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}) \\ &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon} \\ &= \boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon}\end{split}\]

Taking expectations conditional on \(\mathbf{X}\):

\[\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \mathbb{E}[\boldsymbol{\varepsilon} | \mathbf{X}] = \boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \mathbf{0} = \boldsymbol{\beta}\]

where we used GM3: \(\mathbb{E}[\boldsymbol{\varepsilon} | \mathbf{X}] = \mathbf{0}\). ∎

Variance-Covariance Matrix

Theorem: Variance of OLS Estimator

Under assumptions GM1-GM4:

(100)\[\text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}\]

Proof:

From above, \(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon}\).

Let \(\mathbf{A} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\). Then:

\[\begin{split}\text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) &= \text{Var}(\mathbf{A}\boldsymbol{\varepsilon} | \mathbf{X}) \\ &= \mathbf{A} \text{Var}(\boldsymbol{\varepsilon} | \mathbf{X}) \mathbf{A}^\top \\ &= \mathbf{A} (\sigma^2 \mathbf{I}_n) \mathbf{A}^\top \\ &= \sigma^2 \mathbf{A} \mathbf{A}^\top\end{split}\]

Now compute \(\mathbf{A}\mathbf{A}^\top\):

\[\begin{split}\mathbf{A}\mathbf{A}^\top &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \mathbf{X}[(\mathbf{X}^\top\mathbf{X})^{-1}]^\top \\ &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} \\ &= (\mathbf{X}^\top\mathbf{X})^{-1}\end{split}\]

Therefore:

\[\text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}\]

∎

Interpreting the variance formula:

The variance of the \(j\)-th coefficient is:

\[\text{Var}(\hat{\beta}_j | \mathbf{X}) = \sigma^2 [(\mathbf{X}^\top\mathbf{X})^{-1}]_{jj}\]

The standard error is:

\[\text{SE}(\hat{\beta}_j) = \sigma \sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]_{jj}}\]

In practice, we replace \(\sigma\) with its estimate \(s\) (see below).

The Gauss-Markov Theorem

The Gauss-Markov theorem is one of the most important results in regression theory. It establishes OLS as the Best Linear Unbiased Estimator (BLUE).

What BLUE Means

Let’s unpack each word:

Estimator: A function of the data that produces an estimate of the parameter
Unbiased: \(\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}\) (hits the target on average)
Linear: The estimator is a linear function of \(\mathbf{y}\): \(\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\) for some matrix \(\mathbf{C}\)
Best: Has the smallest variance among all linear unbiased estimators

Why focus on linear estimators?

Tractability: Linear estimators have simple variance formulas
Sufficiency: For normal errors, linear estimators can be optimal even among nonlinear ones
Robustness considerations: Nonlinear estimators may be more efficient under specific distributional assumptions but fail badly when those assumptions are wrong

The Theorem Statement

Theorem: Gauss-Markov

Under assumptions GM1-GM4 (linearity, full rank, strict exogeneity, spherical errors), the OLS estimator \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\) is BLUE:

For any other linear unbiased estimator \(\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\) of \(\boldsymbol{\beta}\):

\[\text{Var}(\tilde{\boldsymbol{\beta}}) - \text{Var}(\hat{\boldsymbol{\beta}}) \text{ is positive semi-definite}\]

In particular, for any coefficient:

\[\text{Var}(\tilde{\beta}_j) \geq \text{Var}(\hat{\beta}_j)\]

OLS achieves the minimum variance among all linear unbiased estimators.

Complete Proof

Step 1: Characterize linear unbiased estimators

Any linear estimator has the form \(\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\) for some \(p \times n\) matrix \(\mathbf{C}\).

For \(\tilde{\boldsymbol{\beta}}\) to be unbiased:

\[\mathbb{E}[\tilde{\boldsymbol{\beta}}] = \mathbf{C}\mathbb{E}[\mathbf{y}] = \mathbf{C}\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta} \quad \text{for all } \boldsymbol{\beta}\]

This requires \(\mathbf{C}\mathbf{X} = \mathbf{I}_p\).

Step 2: Write any linear unbiased estimator as OLS plus something

Let \(\mathbf{A} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\) be the OLS matrix, so \(\hat{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y}\).

Define \(\mathbf{D} = \mathbf{C} - \mathbf{A}\). Then any linear unbiased estimator can be written:

\[\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y} = (\mathbf{A} + \mathbf{D})\mathbf{y}\]

Since both \(\mathbf{C}\) and \(\mathbf{A}\) satisfy \(\mathbf{C}\mathbf{X} = \mathbf{I}_p\) and \(\mathbf{A}\mathbf{X} = \mathbf{I}_p\):

\[\mathbf{D}\mathbf{X} = \mathbf{C}\mathbf{X} - \mathbf{A}\mathbf{X} = \mathbf{I}_p - \mathbf{I}_p = \mathbf{0}\]

Step 3: Compute the variance of the alternative estimator

Conditional on \(\mathbf{X}\), the variance of any linear estimator \(\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\) is:

\[\begin{split}\text{Var}(\tilde{\boldsymbol{\beta}} | \mathbf{X}) &= \text{Var}((\mathbf{A} + \mathbf{D})\mathbf{y} | \mathbf{X}) \\ &= (\mathbf{A} + \mathbf{D}) \text{Var}(\mathbf{y} | \mathbf{X}) (\mathbf{A} + \mathbf{D})^\top \\ &= \sigma^2 (\mathbf{A} + \mathbf{D})(\mathbf{A} + \mathbf{D})^\top \\ &= \sigma^2 (\mathbf{A}\mathbf{A}^\top + \mathbf{A}\mathbf{D}^\top + \mathbf{D}\mathbf{A}^\top + \mathbf{D}\mathbf{D}^\top)\end{split}\]

Step 4: Show the cross-terms vanish

Compute \(\mathbf{A}\mathbf{D}^\top\):

\[\mathbf{A}\mathbf{D}^\top = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \mathbf{D}^\top\]

Now, \(\mathbf{D}\mathbf{X} = \mathbf{0}\) implies \(\mathbf{X}^\top\mathbf{D}^\top = \mathbf{0}\). Therefore:

\[\mathbf{A}\mathbf{D}^\top = (\mathbf{X}^\top\mathbf{X})^{-1} \cdot \mathbf{0} = \mathbf{0}\]

Similarly, \(\mathbf{D}\mathbf{A}^\top = (\mathbf{A}\mathbf{D}^\top)^\top = \mathbf{0}\).

Step 5: Conclude

\[\text{Var}(\tilde{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{A}\mathbf{A}^\top + \mathbf{D}\mathbf{D}^\top) = \text{Var}(\hat{\boldsymbol{\beta}}) + \sigma^2 \mathbf{D}\mathbf{D}^\top\]

Since \(\mathbf{D}\mathbf{D}^\top\) is positive semi-definite (for any \(\mathbf{v}\), \(\mathbf{v}^\top\mathbf{D}\mathbf{D}^\top\mathbf{v} = \|\mathbf{D}^\top\mathbf{v}\|^2 \geq 0\)):

\[\text{Var}(\tilde{\boldsymbol{\beta}}) - \text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 \mathbf{D}\mathbf{D}^\top \succeq \mathbf{0}\]

with equality if and only if \(\mathbf{D} = \mathbf{0}\), i.e., \(\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}\). ∎

What Happens When Assumptions Fail

Each Gauss-Markov assumption is critical. Here’s what breaks when each fails:

Table 33 Consequences of Gauss-Markov Assumption Violations
Assumption	Violation Example	Consequence for OLS	Remedy
GM1 Linearity	True model is \(y = \beta_0 + \beta_1 x^2 + \varepsilon\) but we fit \(y = \beta_0 + \beta_1 x\)	OLS is biased and inconsistent; estimates wrong quantity	Transform predictors, add polynomial terms, use nonlinear models
GM2 Full Rank	Perfect multicollinearity: \(x_3 = 2x_1 + x_2\)	\(\mathbf{X}^\top\mathbf{X}\) is singular; OLS not unique	Remove redundant predictors, use regularization (ridge)
GM3 Exogeneity	Omitted variable correlated with included predictor	OLS is biased and inconsistent	Include omitted variable, use instrumental variables, fixed effects
GM4a Homoskedasticity	Variance increases with \(x\): \(\text{Var}(\varepsilon_i) = \sigma^2 x_i^2\)	OLS unbiased but not efficient; SEs wrong	WLS if form known, robust (sandwich) SEs otherwise
GM4b No autocorrelation	Time series: \(\varepsilon_t = \rho\varepsilon_{t-1} + u_t\)	OLS unbiased but not efficient; SEs wrong	GLS, Newey-West SEs, time series models

Common Pitfall ⚠️ Endogeneity

The most serious violation is GM3 (exogeneity): the requirement that \(\mathbb{E}[\varepsilon_i | \mathbf{X}] = 0\) for all \(i\). This “conditional mean zero” condition says that knowing \(\mathbf{X}\) tells you nothing about the expected error.

When \(\mathbb{E}[\varepsilon | \mathbf{X}] \neq \mathbf{0}\):

OLS is biased even in large samples (inconsistent)
No amount of data fixes the problem
Robust standard errors do not help: They correct for heteroskedasticity, not bias. If your coefficients are biased, having more precise standard errors for the wrong value doesn’t help!

Common causes: omitted variables correlated with included predictors, measurement error in \(\mathbf{X}\), simultaneity (reverse causation).

Detection is difficult since we don’t observe \(\varepsilon\). Economic theory, domain knowledge, and careful study design are essential. Remedies include instrumental variables and natural experiments.

Estimating the Error Variance

The variance formula \(\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}\) depends on the unknown \(\sigma^2\). We need to estimate it.

The Residual Sum of Squares

Define the residuals as \(\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\).

The residual sum of squares (RSS) is:

\[\text{RSS} = \mathbf{e}^\top\mathbf{e} = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2\]

A natural estimator of \(\sigma^2\) might be \(\text{RSS}/n\). However, this is biased because residuals underestimate the true errors.

The Unbiased Estimator

Theorem: Unbiased Estimator of σ²

Under GM1-GM4:

(101)\[s^2 = \frac{\text{RSS}}{n - p} = \frac{\mathbf{e}^\top\mathbf{e}}{n - p}\]

is an unbiased estimator of \(\sigma^2\): \(\mathbb{E}[s^2] = \sigma^2\).

Proof sketch:

The residuals can be written as \(\mathbf{e} = \mathbf{M}\mathbf{y} = \mathbf{M}(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}) = \mathbf{M}\boldsymbol{\varepsilon}\) where \(\mathbf{M} = \mathbf{I} - \mathbf{H}\).

(Note: \(\mathbf{M}\mathbf{X}\boldsymbol{\beta} = (\mathbf{I} - \mathbf{H})\mathbf{X}\boldsymbol{\beta} = \mathbf{X}\boldsymbol{\beta} - \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{X}\boldsymbol{\beta} - \mathbf{X}\boldsymbol{\beta} = \mathbf{0}\))

Then:

\[\text{RSS} = \mathbf{e}^\top\mathbf{e} = \boldsymbol{\varepsilon}^\top\mathbf{M}^\top\mathbf{M}\boldsymbol{\varepsilon} = \boldsymbol{\varepsilon}^\top\mathbf{M}\boldsymbol{\varepsilon}\]

since \(\mathbf{M}\) is idempotent (\(\mathbf{M}^2 = \mathbf{M}\)) and symmetric.

Taking expectations and using the trace formula for quadratic forms:

\[\mathbb{E}[\text{RSS}] = \mathbb{E}[\boldsymbol{\varepsilon}^\top\mathbf{M}\boldsymbol{\varepsilon}] = \text{tr}(\mathbf{M} \cdot \sigma^2\mathbf{I}) = \sigma^2 \text{tr}(\mathbf{M})\]

Since \(\text{tr}(\mathbf{M}) = \text{tr}(\mathbf{I} - \mathbf{H}) = n - \text{tr}(\mathbf{H}) = n - p\):

\[\mathbb{E}[\text{RSS}] = \sigma^2 (n - p)\]

Therefore \(\mathbb{E}[\text{RSS}/(n-p)] = \sigma^2\). ∎

Degrees of freedom interpretation: We estimate \(p\) parameters from \(n\) observations, leaving \(n - p\) degrees of freedom for estimating error variance.

Degrees of freedom visualized showing why we divide by n-p — Fig. 107 **Figure 3.4.5**: Degrees of freedom explained. (a) With n observations, p are “used up” estimating \(\boldsymbol{\beta}\), leaving n-p to estimate \(\sigma^2\). (b) Simulation confirms \(s^2 = \text{RSS}/(n-p)\) is unbiased while \(\text{RSS}/n\) underestimates \(\sigma^2\). (c) Under normality, \(\text{RSS}/\sigma^2 \sim \chi^2_{n-p}\).

Standard Errors of Coefficients

The estimated variance-covariance matrix of \(\hat{\boldsymbol{\beta}}\) is:

\[\widehat{\text{Var}}(\hat{\boldsymbol{\beta}}) = s^2 (\mathbf{X}^\top\mathbf{X})^{-1}\]

The standard error of the \(j\)-th coefficient is:

(102)\[\text{SE}(\hat{\beta}_j) = s \sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]_{jj}}\]

import numpy as np
from scipy import stats

def ols_with_inference(X, y):
    """
    Perform OLS regression with inference.

    Note: This function uses explicit matrix formulas for pedagogical clarity.
    For production code, use np.linalg.lstsq() or scipy.linalg.lstsq() which
    are numerically more stable (see Numerical Stability section).

    Returns
    -------
    dict with beta_hat, se, t_stats, p_values, R2, s2
    """
    n, p = X.shape

    # OLS estimates (formula-aligned; use np.linalg.lstsq for production)
    XtX = X.T @ X
    XtX_inv = np.linalg.inv(XtX)  # Needed for standard errors
    beta_hat = np.linalg.solve(XtX, X.T @ y)  # More stable than inv() @ y

    # Fitted values and residuals
    y_hat = X @ beta_hat
    residuals = y - y_hat

    # Variance estimation
    RSS = residuals @ residuals
    s2 = RSS / (n - p)
    s = np.sqrt(s2)

    # Standard errors
    var_beta = s2 * XtX_inv
    se = np.sqrt(np.diag(var_beta))

    # t-statistics and p-values
    t_stats = beta_hat / se
    p_values = 2 * (1 - stats.t.cdf(np.abs(t_stats), df=n-p))

    # R-squared
    TSS = np.sum((y - np.mean(y))**2)
    R2 = 1 - RSS / TSS
    R2_adj = 1 - (1 - R2) * (n - 1) / (n - p)

    return {
        'beta_hat': beta_hat,
        'se': se,
        't_stats': t_stats,
        'p_values': p_values,
        'R2': R2,
        'R2_adj': R2_adj,
        's2': s2,
        's': s,
        'residuals': residuals,
        'y_hat': y_hat
    }

# Example: mtcars-style data
np.random.seed(42)
n = 32
hp = np.random.uniform(50, 300, n)  # horsepower
wt = np.random.uniform(1.5, 5.5, n)  # weight (1000 lbs)
mpg = 40 - 0.03 * hp - 4 * wt + np.random.normal(0, 2, n)

X = np.column_stack([np.ones(n), hp, wt])
y = mpg

results = ols_with_inference(X, y)

print("="*60)
print("OLS REGRESSION RESULTS")
print("="*60)
print(f"\nSample size: n = {n}, Predictors: p = {X.shape[1]}")
print(f"R² = {results['R2']:.4f}, Adjusted R² = {results['R2_adj']:.4f}")
print(f"Residual SE: s = {results['s']:.4f}")

print(f"\n{'Coefficient':<12} {'Estimate':>10} {'Std Error':>10} {'t-stat':>10} {'p-value':>10}")
print("-"*54)
names = ['Intercept', 'hp', 'wt']
for name, b, se, t, p in zip(names, results['beta_hat'], results['se'],
                               results['t_stats'], results['p_values']):
    print(f"{name:<12} {b:>10.4f} {se:>10.4f} {t:>10.2f} {p:>10.4f}")

============================================================
OLS REGRESSION RESULTS
============================================================

Sample size: n = 32, Predictors: p = 3
R² = 0.8826, Adjusted R² = 0.8745
Residual SE: s = 1.9734

Coefficient     Estimate  Std Error     t-stat    p-value
------------------------------------------------------
Intercept        40.4873     1.5872      25.51     0.0000
hp               -0.0313     0.0048      -6.47     0.0000
wt               -3.9056     0.4397      -8.88     0.0000

Distributional Results Under Normality

Under the full normal linear model (GM1-GM5), we have exact distributional results enabling precise inference.

Distribution of the OLS Estimator

Theorem: Normality of OLS

Under GM1-GM5:

\[\hat{\boldsymbol{\beta}} | \mathbf{X} \sim \mathcal{N}\left(\boldsymbol{\beta}, \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}\right)\]

Proof: Since \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\) and \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\) with \(\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})\):

\(\hat{\boldsymbol{\beta}}\) is a linear transformation of normal \(\boldsymbol{\varepsilon}\), hence normal. Mean and variance computed earlier. ∎

Distribution of s²

Theorem: Chi-Square Distribution of RSS

Under GM1-GM5:

\[\frac{(n-p)s^2}{\sigma^2} = \frac{\text{RSS}}{\sigma^2} \sim \chi^2_{n-p}\]

Moreover, \(s^2\) and \(\hat{\boldsymbol{\beta}}\) are independent.

Proof sketch: RSS can be written as \(\boldsymbol{\varepsilon}^\top\mathbf{M}\boldsymbol{\varepsilon}/\sigma^2\) where \(\mathbf{M} = \mathbf{I} - \mathbf{H}\) is idempotent with rank \(n - p\). By Cochran’s theorem, this is \(\chi^2_{n-p}\).

The independence follows from the fact that \(\hat{\boldsymbol{\beta}}\) depends on \(\boldsymbol{\varepsilon}\) through \((\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon}\) and \(s^2\) through \(\mathbf{M}\boldsymbol{\varepsilon}\). Since \((\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \cdot \sigma^2 \mathbf{M} = \mathbf{0}\) (orthogonality), these linear forms of a multivariate normal are independent. ∎

t-Tests for Individual Coefficients

Combining the above results:

\[\frac{\hat{\beta}_j - \beta_j}{\text{SE}(\hat{\beta}_j)} = \frac{\hat{\beta}_j - \beta_j}{s\sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]_{jj}}} \sim t_{n-p}\]

This enables t-tests for individual coefficients:

Null hypothesis: \(H_0: \beta_j = \beta_{j,0}\) (often \(\beta_{j,0} = 0\))
Test statistic: \(t = \frac{\hat{\beta}_j - \beta_{j,0}}{\text{SE}(\hat{\beta}_j)}\)
Decision: Reject at level \(\alpha\) if \(|t| > t_{n-p, 1-\alpha/2}\)

Confidence interval for \(\beta_j\):

\[\hat{\beta}_j \pm t_{n-p, 1-\alpha/2} \cdot \text{SE}(\hat{\beta}_j)\]

Confidence intervals vs prediction intervals showing the key difference — Fig. 108 **Figure 3.4.7**: Confidence interval (for the mean) vs. prediction interval (for new Y). (a) The CI (narrow band) captures uncertainty in \(E[Y|x]\), while the PI (wide band) also includes the irreducible error variance. (b) The formulas differ by the “1” inside the square root—representing \(\sigma^2\) for a new observation. (c) Both intervals have hyperbolic shape, widening away from \(\bar{x}\), but the PI is offset by a constant amount.

F-Tests for Nested Models

To test whether a subset of coefficients is jointly zero, use the F-test.

Setup: Compare: - Restricted model (R): \(q\) constraints, \(\text{RSS}_R\) - Full model (F): no constraints, \(\text{RSS}_F\)

Theorem: F-Test for Nested Models

Under \(H_0\) (constraints are true) and normality:

\[F = \frac{(\text{RSS}_R - \text{RSS}_F) / q}{\text{RSS}_F / (n - p)} \sim F_{q, n-p}\]

Example: Testing \(H_0: \beta_2 = \beta_3 = 0\) (joint significance of two predictors).

import numpy as np
from scipy import stats

def f_test_nested(y, X_full, X_reduced):
    """
    F-test comparing full model to reduced (nested) model.
    """
    n = len(y)
    p_full = X_full.shape[1]
    p_reduced = X_reduced.shape[1]
    q = p_full - p_reduced  # number of restrictions

    # Fit both models
    beta_full = np.linalg.lstsq(X_full, y, rcond=None)[0]
    beta_reduced = np.linalg.lstsq(X_reduced, y, rcond=None)[0]

    RSS_full = np.sum((y - X_full @ beta_full)**2)
    RSS_reduced = np.sum((y - X_reduced @ beta_reduced)**2)

    # F statistic
    F = ((RSS_reduced - RSS_full) / q) / (RSS_full / (n - p_full))
    p_value = 1 - stats.f.cdf(F, q, n - p_full)

    return {'F': F, 'df1': q, 'df2': n - p_full, 'p_value': p_value,
            'RSS_full': RSS_full, 'RSS_reduced': RSS_reduced}

# Example: test if both hp and wt are needed
X_full = np.column_stack([np.ones(n), hp, wt])
X_reduced = np.column_stack([np.ones(n)])  # intercept only

result = f_test_nested(y, X_full, X_reduced)
print(f"\nF-test: Are hp and wt jointly significant?")
print(f"F({result['df1']}, {result['df2']}) = {result['F']:.2f}")
print(f"p-value = {result['p_value']:.6f}")

Diagnostics and Model Checking

The validity of regression inference depends on assumptions. Diagnostics help detect violations.

Residual Analysis

Types of residuals:

Raw residuals: \(e_i = y_i - \hat{y}_i\)
Standardized residuals: \(r_i = \frac{e_i}{s\sqrt{1 - h_{ii}}}\)

These have approximately constant variance under homoskedasticity.
Studentized (externally): \(t_i = \frac{e_i}{s_{(i)}\sqrt{1 - h_{ii}}} \sim t_{n-p-1}\)

where \(s_{(i)}\) is computed without observation \(i\). Follows exact t-distribution.

Diagnostic plots:

Residuals vs. Fitted: Check linearity (should be random scatter) and homoskedasticity (constant spread)
Q-Q plot of residuals: Check normality (should follow 45° line)
Residuals vs. each predictor: Check functional form
Scale-Location plot: \(\sqrt{|r_i|}\) vs. fitted for homoskedasticity

Gallery of assumption violations showing nonlinearity, heteroskedasticity, non-normality, and autocorrelation — Fig. 109 **Figure 3.4.6**: Assumption violations gallery. Top: data patterns; Bottom: diagnostic plots. Nonlinearity shows curved residual pattern. Heteroskedasticity produces a fan shape. Non-normality creates heavy tails in Q-Q plots. Autocorrelation shows serial correlation in \(e_t\) vs \(e_{t+1}\) plots.

Leverage and Influence

Not all observations affect the fit equally.

Leverage: \(h_{ii} = [\mathbf{H}]_{ii}\) measures how far \(\mathbf{x}_i\) is from the center of the predictor space.

\(h_{ii} \in [1/n, 1]\) (from idempotence and hat matrix properties)
Average leverage: \(\bar{h} = p/n\)
High leverage if \(h_{ii} > 2p/n\) or \(> 3p/n\)

Leverage intuition showing why extreme x values matter — Fig. 110 **Figure 3.4.4**: Leverage measures how far observation \(x_i\) is from \(\bar{x}\). (a) Points far from center have high leverage (red = high, blue = low). (b) High-leverage points can dramatically influence the regression line. (c) The leverage formula \(h_{ii} = 1/n + (x_i - \bar{x})^2/S_{xx}\) is U-shaped in \(x\).

Influence measures:

Cook’s Distance: Combines leverage and residual size

\[D_i = \frac{r_i^2}{p} \cdot \frac{h_{ii}}{1 - h_{ii}}\]

Large if \(D_i > 4/n\) or \(> 1\)
DFFITS: Scaled difference in fitted value when \(i\) deleted
DFBETAS: Change in each coefficient when \(i\) deleted

Robust Standard Errors

When homoskedasticity fails but we still want valid inference, use robust (sandwich) standard errors:

\[\widehat{\text{Var}}_{\text{robust}}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{X}^\top \hat{\boldsymbol{\Omega}} \mathbf{X} (\mathbf{X}^\top\mathbf{X})^{-1}\]

where \(\hat{\boldsymbol{\Omega}} = \text{diag}(e_1^2, \ldots, e_n^2)\) estimates the heteroskedastic structure.

HC variants (heteroskedasticity-consistent):

HC0: \(\hat{\Omega}_{ii} = e_i^2\) (White’s original)
HC1: \(\hat{\Omega}_{ii} = \frac{n}{n-p} e_i^2\) (degrees-of-freedom correction)
HC2: \(\hat{\Omega}_{ii} = \frac{e_i^2}{1 - h_{ii}}\) (leverage adjustment)
HC3: \(\hat{\Omega}_{ii} = \frac{e_i^2}{(1 - h_{ii})^2}\) (jackknife-like, best for small samples)

def robust_se(X, residuals, hc_type='HC1'):
    """
    Compute heteroskedasticity-robust standard errors.

    Note: This implementation forms the full n×n hat matrix H, which is
    O(n²) in memory and time. For large n, compute leverages h_ii directly
    without forming H: h_ii = x_i' (X'X)^{-1} x_i for each row.

    Parameters
    ----------
    X : array, shape (n, p)
        Design matrix
    residuals : array, shape (n,)
        OLS residuals
    hc_type : str
        'HC0', 'HC1', 'HC2', or 'HC3'
    """
    n, p = X.shape
    XtX_inv = np.linalg.inv(X.T @ X)

    # Leverage values (O(n²) - for large n, compute row-by-row instead)
    H = X @ XtX_inv @ X.T
    h = np.diag(H)

    # Construct Omega based on HC type
    e2 = residuals**2
    if hc_type == 'HC0':
        omega = e2
    elif hc_type == 'HC1':
        omega = e2 * n / (n - p)
    elif hc_type == 'HC2':
        omega = e2 / (1 - h)
    elif hc_type == 'HC3':
        omega = e2 / (1 - h)**2
    else:
        raise ValueError(f"Unknown HC type: {hc_type}")

    # Sandwich formula
    meat = X.T @ np.diag(omega) @ X
    sandwich = XtX_inv @ meat @ XtX_inv

    return np.sqrt(np.diag(sandwich))

Bringing It All Together

Linear regression is both a statistical tool and a conceptual framework that illuminates fundamental ideas in inference. We’ve seen:

The calculus approach reveals OLS as the solution to an optimization problem—minimizing squared residuals leads to the normal equations \(\mathbf{X}^\top\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{y}\).

The geometric approach shows OLS as projection—\(\hat{\mathbf{y}}\) is the orthogonal projection of \(\mathbf{y}\) onto the column space of \(\mathbf{X}\). The residual \(\mathbf{e}\) is perpendicular to all predictors.

The Gauss-Markov theorem establishes that under classical assumptions (linearity, exogeneity, spherical errors), OLS achieves minimum variance among all linear unbiased estimators. This optimality result has driven the dominance of OLS in applied work.

Distributional theory under normality enables exact t-tests and F-tests. Without normality, asymptotic approximations still provide valid inference for large samples.

Diagnostics reveal whether assumptions hold. When they fail, remedies range from robust standard errors to alternative estimators.

Looking Ahead

Section 3.5 extends these ideas to Generalized Linear Models (GLMs)—logistic regression, Poisson regression, and more. GLMs handle non-normal responses while maintaining the interpretability and computational tractability of the linear model framework. The IRLS algorithm for fitting GLMs is essentially Fisher scoring—Newton-Raphson with expected information—applied to the same exponential family likelihoods we met in Section 3.1.

Key Takeaways 📝

Matrix calculus foundation: The gradient of \(\boldsymbol{\beta}^\top\mathbf{A}\boldsymbol{\beta}\) is \(2\mathbf{A}\boldsymbol{\beta}\) (for symmetric \(\mathbf{A}\)), enabling systematic derivation of least squares estimators.
Two perspectives on OLS: Calculus (minimize RSS) and geometry (project onto column space) yield the same normal equations \(\mathbf{X}^\top\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{y}\).
Gauss-Markov theorem: Under linearity, full rank, exogeneity, and spherical errors, OLS is BLUE—the best linear unbiased estimator. The complete proof uses the \(\mathbf{D}\) matrix decomposition.
Inference requires variance estimation: The unbiased estimator \(s^2 = \text{RSS}/(n-p)\) replaces \(\sigma^2\) in standard error formulas. Under normality, t-tests and F-tests are exact.
Diagnostics are essential: Residual plots, leverage measures, and Cook’s distance detect assumption violations. Robust standard errors (HC0-HC3) provide valid inference under heteroskedasticity.
Outcome alignment: This section addresses Learning Outcomes on implementing regression methods, understanding estimator properties (bias, variance, efficiency), and applying diagnostics—skills essential for Sections 3.5 (GLMs), Chapter 4 (bootstrap for regression), and Chapter 5 (Bayesian linear models).

Numerical Stability: QR Decomposition

The formula \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\) is mathematically correct but numerically problematic. Computing \(\mathbf{X}^\top\mathbf{X}\) and inverting it can lose precision due to the condition number.

The Condition Number Problem

The condition number of a matrix measures sensitivity to numerical errors:

\[\kappa(\mathbf{A}) = \|\mathbf{A}\| \cdot \|\mathbf{A}^{-1}\|\]

For the product \(\mathbf{X}^\top\mathbf{X}\):

\[\kappa(\mathbf{X}^\top\mathbf{X}) = \kappa(\mathbf{X})^2\]

This squaring of the condition number is catastrophic. If \(\mathbf{X}\) has condition number \(10^4\) (not unusual with poorly scaled predictors), then \(\mathbf{X}^\top\mathbf{X}\) has condition number \(10^8\), potentially losing 8 digits of precision.

Numerical stability comparison showing normal equations vs QR decomposition — Fig. 111 **Figure 3.4.8**: Numerical stability matters. (a) Normal equations square the condition number: \(\kappa(\mathbf{X}^\top\mathbf{X}) = \kappa(\mathbf{X})^2\). (b) Theoretical error bounds show QR decomposition with \(O(\kappa \cdot \epsilon)\) error vastly outperforms normal equations with \(O(\kappa^2 \cdot \epsilon)\) error. (c) Practical guidelines: use QR/SVD-based methods (like `np.linalg.lstsq`) when \(\kappa > 10^3\).

QR Decomposition Solution

The QR decomposition factorizes \(\mathbf{X}\) as:

\[\mathbf{X} = \mathbf{Q}\mathbf{R}\]

where:

\(\mathbf{Q} \in \mathbb{R}^{n \times p}\) has orthonormal columns: \(\mathbf{Q}^\top\mathbf{Q} = \mathbf{I}_p\)
\(\mathbf{R} \in \mathbb{R}^{p \times p}\) is upper triangular

Solving the normal equations via QR:

Substitute into \(\mathbf{X}^\top\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{y}\):

\[\mathbf{R}^\top\mathbf{Q}^\top\mathbf{Q}\mathbf{R}\hat{\boldsymbol{\beta}} = \mathbf{R}^\top\mathbf{Q}^\top\mathbf{y}\]

Since \(\mathbf{Q}^\top\mathbf{Q} = \mathbf{I}\) and \(\mathbf{R}^\top\) is invertible:

\[\mathbf{R}\hat{\boldsymbol{\beta}} = \mathbf{Q}^\top\mathbf{y}\]

This triangular system is solved by back-substitution—no matrix inversion needed!

Advantages:

Condition number is \(\kappa(\mathbf{R}) = \kappa(\mathbf{X})\), not squared
No explicit computation of \(\mathbf{X}^\top\mathbf{X}\)
Back-substitution is \(O(p^2)\), highly stable

import numpy as np

def ols_qr(X, y):
    """
    Compute OLS via QR decomposition (numerically stable).
    """
    Q, R = np.linalg.qr(X)
    Qty = Q.T @ y
    beta_hat = np.linalg.solve(R, Qty)  # Back-substitution
    return beta_hat

def demonstrate_condition_number():
    """Show the condition number problem."""
    rng = np.random.default_rng(42)
    n = 100

    # Poorly conditioned design matrix (highly correlated predictors)
    x1 = rng.uniform(0, 1, n)
    x2 = x1 + rng.normal(0, 0.0001, n)  # Nearly collinear!

    X = np.column_stack([np.ones(n), x1, x2])
    y = 1 + 2*x1 + 3*x2 + rng.normal(0, 0.1, n)

    # Condition numbers
    kappa_X = np.linalg.cond(X)
    kappa_XtX = np.linalg.cond(X.T @ X)

    print("="*60)
    print("CONDITION NUMBER COMPARISON")
    print("="*60)
    print(f"\nκ(X):     {kappa_X:.2e}")
    print(f"κ(X'X):   {kappa_XtX:.2e}")
    print(f"Ratio:    {kappa_XtX / kappa_X:.2e} (should be ≈ κ(X))")

    # Compare methods
    # Method 1: Normal equations (unstable)
    try:
        beta_normal = np.linalg.solve(X.T @ X, X.T @ y)
        print(f"\nNormal equations: β̂ = {beta_normal}")
    except np.linalg.LinAlgError:
        print("\nNormal equations: FAILED (singular matrix)")

    # Method 2: QR decomposition (stable)
    beta_qr = ols_qr(X, y)
    print(f"QR decomposition: β̂ = {beta_qr}")

    # Method 3: lstsq (uses SVD)
    beta_lstsq = np.linalg.lstsq(X, y, rcond=None)[0]
    print(f"lstsq (SVD):      β̂ = {beta_lstsq}")

demonstrate_condition_number()

============================================================
CONDITION NUMBER COMPARISON
============================================================

κ(X):     1.05e+04
κ(X'X):   1.11e+08
Ratio:    1.05e+04 (should be ≈ κ(X))

Normal equations: β̂ = [-0.73  1.58  3.42]
QR decomposition: β̂ = [-0.73  1.58  3.42]
lstsq (SVD):      β̂ = [-0.73  1.58  3.42]

Common Pitfall ⚠️ Never Compute (X’X)⁻¹ Explicitly

When writing your own regression code:

Don’t do this: np.linalg.inv(X.T @ X) @ X.T @ y
Do this instead: np.linalg.lstsq(X, y, rcond=None)[0]

The lstsq function uses SVD internally, which is even more stable than QR and handles rank-deficient matrices gracefully.

Model Selection and Information Criteria

With multiple candidate models, how do we choose? Information criteria balance goodness-of-fit against complexity.

The Bias-Variance Tradeoff

Consider adding a predictor to a model:

RSS decreases (or stays the same)—the fit improves or is unchanged
Variance increases—more parameters to estimate

This is the classic bias-variance tradeoff:

\[\text{MSE}(\hat{y}) = \text{Bias}^2 + \text{Variance}\]

A complex model has low bias (fits the data well) but high variance (overfits to noise). A simple model has higher bias but lower variance.

R-Squared and Adjusted R-Squared

R-squared measures the proportion of variance explained:

\[R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}\]

Regression decomposition showing TSS = ESS + RSS — Fig. 112 **Figure 3.4.13**: The fundamental decomposition TSS = ESS + RSS. (a) For each point, the deviation from \(\bar{y}\) splits into explained (ESS) and residual (RSS) components. (b) The equation: Total Sum of Squares = Explained Sum of Squares + Residual Sum of Squares. (c) \(R^2 = \text{ESS}/\text{TSS}\) is the proportion explained.

Problem: \(R^2\) never decreases when adding predictors, even useless ones.

Adjusted R-squared penalizes for model complexity:

\[R^2_{\text{adj}} = 1 - \frac{\text{RSS}/(n-p)}{\text{TSS}/(n-1)} = 1 - (1 - R^2)\frac{n-1}{n-p}\]

This can decrease when adding a predictor that doesn’t improve the fit enough to justify its inclusion.

Information Criteria

Akaike Information Criterion (AIC):

\[\text{AIC} = n \ln(\text{RSS}/n) + 2p\]

Bayesian Information Criterion (BIC):

\[\text{BIC} = n \ln(\text{RSS}/n) + p \ln(n)\]

These are the AIC and BIC formulas for Gaussian linear regression, dropping additive constants that cancel when comparing models fit to the same response \(\mathbf{y}\). For general likelihoods, use \(\text{AIC} = -2\ln L + 2p\) and \(\text{BIC} = -2\ln L + p\ln n\).

Both penalize complexity, but BIC penalizes more heavily for large \(n\).

Usage: Lower is better. Compare models by their AIC or BIC values.

Table 34 Model Selection Criteria
Criterion	Formula	Penalty	Best for
\(R^2\)	\(1 - \text{RSS}/\text{TSS}\)	None (always increases)	Never for selection
\(R^2_{\text{adj}}\)	\(1 - (1-R^2)\frac{n-1}{n-p}\)	Mild	Quick comparison
AIC	\(n\ln(\text{RSS}/n) + 2p\)	\(2p\)	Prediction
BIC	\(n\ln(\text{RSS}/n) + p\ln(n)\)	\(p\ln(n)\)	Consistent selection

def model_selection_criteria(y, y_hat, p, n):
    """
    Compute model selection criteria.

    Parameters
    ----------
    y : array
        Observed values
    y_hat : array
        Fitted values
    p : int
        Number of parameters (including intercept)
    n : int
        Sample size
    """
    RSS = np.sum((y - y_hat)**2)
    TSS = np.sum((y - np.mean(y))**2)

    R2 = 1 - RSS / TSS
    R2_adj = 1 - (1 - R2) * (n - 1) / (n - p)

    AIC = n * np.log(RSS / n) + 2 * p
    BIC = n * np.log(RSS / n) + p * np.log(n)

    return {'R2': R2, 'R2_adj': R2_adj, 'AIC': AIC, 'BIC': BIC}

Regularization: Ridge and LASSO

When predictors are highly correlated or \(p\) is large relative to \(n\), OLS can be unstable or overfit. Regularization shrinks coefficients toward zero, trading bias for reduced variance.

Ridge Regression

Ridge regression adds an \(L_2\) penalty:

\[\hat{\boldsymbol{\beta}}_{\text{ridge}} = \arg\min_{\boldsymbol{\beta}} \left\{ \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|^2 \right\}\]

where \(\lambda \geq 0\) is the regularization parameter.

Closed-form solution:

(103)\[\hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}\]

Key properties:

Bias-variance tradeoff: Ridge is biased (\(\mathbb{E}[\hat{\boldsymbol{\beta}}_{\text{ridge}}] \neq \boldsymbol{\beta}\)) but can have lower MSE than OLS
Shrinkage: Coefficients shrink toward zero as \(\lambda \to \infty\)
Handles multicollinearity: Adding \(\lambda\mathbf{I}\) makes \(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I}\) invertible even if \(\mathbf{X}^\top\mathbf{X}\) is singular
No variable selection: All coefficients remain nonzero

The bias-variance tradeoff in regularization — Fig. 113 **Figure 3.4.9**: The bias-variance tradeoff. (a) A biased estimator (orange) with lower variance can achieve lower MSE than an unbiased estimator (blue). (b) As regularization \(\lambda\) increases, variance decreases but bias increases—there’s an optimal \(\lambda\). (c) Training error decreases with complexity while test error is U-shaped.

LASSO

LASSO (Least Absolute Shrinkage and Selection Operator) uses an \(L_1\) penalty:

\[\hat{\boldsymbol{\beta}}_{\text{LASSO}} = \arg\min_{\boldsymbol{\beta}} \left\{ \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|_1 \right\}\]

where \(\|\boldsymbol{\beta}\|_1 = \sum_j |\beta_j|\).

Key properties:

No closed form: Must use numerical optimization (coordinate descent)
Variable selection: Sets some coefficients exactly to zero (sparse solutions)
Geometry: The \(L_1\) constraint has corners at the axes, making exact zeros likely

L1 vs L2 geometry showing why LASSO produces sparse solutions — Fig. 114 **Figure 3.4.10**: L1 vs L2 geometry. (a) Ridge constraint is a circle—RSS contours typically touch away from axes. (b) LASSO constraint is a diamond—contours often touch at corners where \(\beta_j = 0\). (c) This geometric difference explains why LASSO performs automatic variable selection while Ridge shrinks but keeps all coefficients.

Regularization paths showing how coefficients shrink with lambda — Fig. 115 **Figure 3.4.11**: Regularization paths. (a) Ridge shrinks coefficients smoothly toward zero as \(\lambda\) increases—all remain nonzero. (b) LASSO drives coefficients to exactly zero at different \(\lambda\) values, providing a variable selection path.

Elastic Net

Elastic Net combines \(L_1\) and \(L_2\) penalties:

\[\hat{\boldsymbol{\beta}}_{\text{EN}} = \arg\min_{\boldsymbol{\beta}} \left\{ \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda_1 \|\boldsymbol{\beta}\|_1 + \lambda_2 \|\boldsymbol{\beta}\|^2 \right\}\]

This combines the variable selection of LASSO with the stability of ridge for correlated predictors.

Practical Considerations for Regularization

Intercept handling: Typically, we center \(\mathbf{y}\) and standardize predictors, and we do not penalize the intercept \(\beta_0\). Most software handles this automatically.
Scaling matters: Always report whether predictors were standardized. The penalty \(\lambda\) depends on the scale of predictors—a coefficient for income in dollars behaves very differently from income in thousands of dollars.
Cross-validation: Choose \(\lambda\) by cross-validation to balance fit and complexity.

import numpy as np
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

def compare_regularization():
    """Compare OLS, Ridge, LASSO, and Elastic Net."""
    rng = np.random.default_rng(42)

    # Generate data with correlated predictors
    n, p = 100, 50
    X = rng.normal(0, 1, (n, p))
    # Add correlation structure
    for j in range(1, p):
        X[:, j] = 0.9 * X[:, j-1] + 0.1 * X[:, j]

    # True coefficients: only first 5 are nonzero
    beta_true = np.zeros(p)
    beta_true[:5] = [3, 2, 1, 0.5, 0.25]

    y = X @ beta_true + rng.normal(0, 1, n)

    # Standardize for fair comparison
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Fit models
    ols_beta = np.linalg.lstsq(X_scaled, y, rcond=None)[0]

    ridge = Ridge(alpha=1.0)
    ridge.fit(X_scaled, y)

    lasso = Lasso(alpha=0.1)
    lasso.fit(X_scaled, y)

    enet = ElasticNet(alpha=0.1, l1_ratio=0.5)
    enet.fit(X_scaled, y)

    print("="*60)
    print("REGULARIZATION COMPARISON")
    print("="*60)
    print(f"\nData: n={n}, p={p}, {np.sum(beta_true != 0)} true nonzero coefficients")

    print(f"\n{'Method':<15} {'Nonzero':>10} {'MSE (coeffs)':>15}")
    print("-"*40)

    methods = [('OLS', ols_beta), ('Ridge', ridge.coef_),
               ('LASSO', lasso.coef_), ('Elastic Net', enet.coef_)]

    for name, coef in methods:
        nonzero = np.sum(np.abs(coef) > 1e-6)
        mse = np.mean((coef - beta_true)**2)
        print(f"{name:<15} {nonzero:>10} {mse:>15.4f}")

    print("\n→ LASSO and Elastic Net achieve sparsity (fewer nonzero coefficients)")
    print("→ Ridge shrinks but keeps all coefficients nonzero")

compare_regularization()

============================================================
REGULARIZATION COMPARISON
============================================================

Data: n=100, p=50, 5 true nonzero coefficients

Method              Nonzero   MSE (coeffs)
----------------------------------------
OLS                      50          0.0892
Ridge                    50          0.0734
LASSO                    12          0.0418
Elastic Net              14          0.0456

→ LASSO and Elastic Net achieve sparsity (fewer nonzero coefficients)
→ Ridge shrinks but keeps all coefficients nonzero

Choosing the Regularization Parameter

The parameter \(\lambda\) controls the amount of shrinkage. Too small → overfitting; too large → underfitting.

Cross-validation is the standard approach:

Split data into \(K\) folds
For each candidate \(\lambda\): - Train on \(K-1\) folds, predict on held-out fold - Average prediction error across folds
Choose \(\lambda\) with lowest CV error

from sklearn.linear_model import RidgeCV, LassoCV

# Ridge with built-in CV
alphas = np.logspace(-4, 4, 50)
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_scaled, y)
print(f"Ridge optimal λ: {ridge_cv.alpha_:.4f}")

# LASSO with built-in CV
lasso_cv = LassoCV(cv=5, random_state=42)
lasso_cv.fit(X_scaled, y)
print(f"LASSO optimal λ: {lasso_cv.alpha_:.4f}")

Chapter 3.4 Exercises: Linear Models Mastery

These exercises build your understanding of linear regression from matrix calculus foundations through the Gauss-Markov theorem to practical diagnostics. Each exercise connects theoretical derivations to computational practice and statistical interpretation.

A Note on These Exercises

These exercises are designed to deepen your understanding of linear models through hands-on exploration:

Exercise 1 develops matrix calculus skills essential for deriving OLS estimators
Exercise 2 explores the geometric interpretation of regression as projection
Exercise 3 identifies Gauss-Markov assumption violations and their consequences
Exercise 4 provides a complete proof of Gauss-Markov for a single coefficient
Exercise 5 verifies theoretical properties computationally via Monte Carlo simulation
Exercise 6 compares model-based and robust standard errors under heteroskedasticity

Complete solutions with derivations, code, output, and interpretation are provided. Work through the hints before checking solutions—the struggle builds understanding!

Exercise 1: Matrix Calculus Derivations

The matrix calculus rules developed in this section are the foundation for deriving OLS estimators. This exercise builds fluency with these essential tools.

Background: Why Matrix Calculus?

When optimizing functions of vectors and matrices (like RSS as a function of \(\boldsymbol{\beta}\)), we need systematic rules for computing gradients. Component-wise verification builds confidence that the matrix formulas are correct.

Linear form: Verify by component-wise differentiation that \(\frac{\partial}{\partial \mathbf{x}}(\mathbf{a}^\top\mathbf{x}) = \mathbf{a}\) for \(\mathbf{a}, \mathbf{x} \in \mathbb{R}^3\).
Cross term: Derive \(\frac{\partial}{\partial \boldsymbol{\beta}}(\mathbf{y}^\top\mathbf{X}\boldsymbol{\beta})\) directly from the definition of gradient.
Quadratic form: For symmetric \(\mathbf{A}\), show that \(\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^\top\mathbf{A}\mathbf{x}) = 2\mathbf{A}\mathbf{x}\) by expanding in components.
RSS gradient: Derive the OLS gradient \(\frac{\partial \text{RSS}}{\partial \boldsymbol{\beta}} = 2\mathbf{X}^\top(\mathbf{X}\boldsymbol{\beta} - \mathbf{y})\) using the chain rule.

Hint

For part (c), write out \(\mathbf{x}^\top\mathbf{A}\mathbf{x} = \sum_i\sum_j x_i A_{ij} x_j\) and differentiate with respect to \(x_k\). Remember that terms containing \(x_k\) appear both when \(i = k\) and when \(j = k\).

Exercise 2: Geometry of Simple Regression

The geometric perspective reveals regression as projection onto the column space of \(\mathbf{X}\). This exercise develops that intuition for simple linear regression.

Background: Projection Interpretation

In simple regression \(y = \beta_0 + \beta_1 x + \varepsilon\), the design matrix \(\mathbf{X} = [\mathbf{1} \; \mathbf{x}]\) spans a 2-dimensional subspace of \(\mathbb{R}^n\). The fitted values \(\hat{\mathbf{y}}\) are the orthogonal projection of \(\mathbf{y}\) onto this plane.

Show that the design matrix has columns \(\mathbf{1}\) and \(\mathbf{x}\), and verify that \(\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}\).
Verify geometrically that the residual vector is orthogonal to both \(\mathbf{1}\) and \(\mathbf{x}\). What do these orthogonality conditions imply about the residuals?
Show that the regression line passes through the point \((\bar{x}, \bar{y})\).
Derive the formula for \(R^2\) as the squared correlation between \(\mathbf{y}\) and \(\hat{\mathbf{y}}\) using vector inner products.

Hint

For part (b), the condition \(\mathbf{1}^\top \mathbf{e} = 0\) means \(\sum e_i = 0\) (residuals sum to zero). The condition \(\mathbf{x}^\top \mathbf{e} = 0\) means \(\sum x_i e_i = 0\).

Exercise 3: Gauss-Markov Assumptions

Understanding which assumptions are violated—and the consequences—is essential for applied regression. This exercise builds diagnostic intuition.

Background: The Assumption Hierarchy

The Gauss-Markov assumptions are: GM1 (linearity), GM2 (full rank), GM3 (exogeneity: \(\mathbb{E}[\varepsilon|\mathbf{X}] = \mathbf{0}\)), and GM4 (spherical errors: \(\text{Var}(\varepsilon|\mathbf{X}) = \sigma^2\mathbf{I}\)). Each violation has specific consequences for OLS.

For each scenario, identify which Gauss-Markov assumption is violated and describe the consequences.

A researcher fits \(y = \beta_0 + \beta_1 x + \varepsilon\) but the true relationship is \(y = \beta_0 + \beta_1 \log(x) + \varepsilon\).
In a regression with three predictors, the third predictor is defined as \(x_3 = 2x_1 - x_2\).
An economist studies savings rates but omits “income,” which is correlated with education (included) and directly affects savings.
In a study of housing prices, the variance of residuals is proportional to house size.
Monthly unemployment data shows residuals that are positively correlated across consecutive months.

Solution

Part (a): Wrong functional form

Violation: GM1 (Linearity)

The true model is \(y = \beta_0 + \beta_1\log(x) + \varepsilon\), but we’re fitting \(y = \gamma_0 + \gamma_1 x + \varepsilon\).

Consequences:

OLS estimates \(\hat{\gamma}_0, \hat{\gamma}_1\) are biased and inconsistent
They estimate the “best linear approximation” to a nonlinear relationship
Residual plots will show systematic curvature (nonlinearity pattern)
Predictions will be systematically wrong, especially at extreme \(x\) values

Remedy: Transform predictors (\(\log(x)\)), add polynomial terms, or use nonlinear models.

Part (b): Perfect multicollinearity

Violation: GM2 (Full Rank)

If \(x_3 = 2x_1 - x_2\), the third column of \(\mathbf{X}\) is a linear combination of the first two.

Consequences:

\(\mathbf{X}^\top\mathbf{X}\) is singular (not invertible)
OLS has infinitely many solutions—the parameter vector is not identified
Software will fail or drop one predictor automatically

Remedy: Remove redundant predictor(s) or use regularization (ridge regression).

Part (c): Omitted variable bias

Violation: GM3 (Exogeneity/Strict Exogeneity)

Income affects both education (parents with higher income can afford more education) and savings (direct effect). By omitting income, \(\mathbb{E}[\varepsilon|\text{education}] \neq 0\).

Consequences:

OLS coefficient on education is biased and inconsistent
The bias direction depends on signs: if income ↔ education (positive) and income → savings (positive), education coefficient is biased upward
Omitted variable bias formula: \(\text{bias}(\hat{\beta}_{\text{edu}}) = \beta_{\text{inc}} \times \delta\) where \(\delta\) is the regression coefficient of income on education

Remedy: Include income as a control, use instrumental variables, or acknowledge the limitation.

Part (d): Heteroskedasticity

Violation: GM4 (Spherical Errors—specifically homoskedasticity)

\(\text{Var}(\varepsilon_i) = \sigma^2 \cdot (\text{house size}_i)\), not constant.

Consequences:

OLS is still unbiased and consistent
OLS is no longer BLUE—more efficient estimators exist (WLS)
Standard errors are wrong (typically too small for large houses)
t-tests and F-tests have incorrect size (Type I error ≠ α)

Remedy: Use robust (HC) standard errors, or weighted least squares with \(w_i \propto 1/\text{size}_i\).

Part (e): Autocorrelation

Violation: GM4 (Spherical Errors—specifically no autocorrelation)

\(\text{Cov}(\varepsilon_t, \varepsilon_{t-1}) > 0\) for consecutive months.

Consequences:

OLS is still unbiased and consistent
OLS is not BLUE
Standard errors are underestimated (positive autocorrelation clusters errors)
Durbin-Watson test will detect this (DW < 2 indicates positive autocorrelation)

Remedy: Use Newey-West (HAC) standard errors, or GLS/Cochrane-Orcutt for efficiency.

Exercise 4: Complete BLUE Proof

The Gauss-Markov theorem establishes OLS as the Best Linear Unbiased Estimator. This exercise develops a complete proof for a single coefficient, building intuition for the general result.

Background: The BLUE Property

BLUE means OLS achieves the minimum variance among all linear unbiased estimators. The proof uses a clever decomposition: any alternative estimator can be written as OLS plus a “deviation” term, and this deviation can only increase variance.

Let \(\tilde{\beta}_1 = \mathbf{c}^\top\mathbf{y}\) be any linear unbiased estimator of \(\beta_1\) in simple regression. Show that unbiasedness requires \(\mathbf{c}^\top\mathbf{X} = (0, 1)\).
Write \(\mathbf{c} = \mathbf{a}_1 + \mathbf{d}\) where \(\mathbf{a}_1\) is the second row of \((\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\). Show that \(\mathbf{d}^\top\mathbf{X} = \mathbf{0}\).
Show that \(\text{Var}(\tilde{\beta}_1) = \text{Var}(\hat{\beta}_1) + \sigma^2\mathbf{d}^\top\mathbf{d}\).
Conclude that \(\text{Var}(\tilde{\beta}_1) \geq \text{Var}(\hat{\beta}_1)\) with equality iff \(\mathbf{d} = \mathbf{0}\).

Hint

The key insight in part (c) is that the cross-term \(\mathbf{a}_1^\top\mathbf{d} = 0\) because \(\mathbf{d}\) is orthogonal to the column space of \(\mathbf{X}\), and \(\mathbf{a}_1\) lies in that column space.

Exercise 5: Computational Practice

Theory becomes concrete through computation. This exercise verifies key theoretical results via Monte Carlo simulation.

Background: Why Simulate?

Simulation provides empirical verification of theoretical formulas. When theory says \(\mathbb{E}[\hat{\beta}] = \beta\) and \(\text{Var}(\hat{\beta}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}\), simulation confirms these hold—building confidence and catching errors.

Using Python, verify key theoretical results.

Generate data from \(y = 2 + 3x + \varepsilon\) with \(\varepsilon \sim N(0, 4)\), \(n = 50\). Fit OLS and verify \(\hat{\boldsymbol{\beta}} \approx (2, 3)\).
Compute the hat matrix \(\mathbf{H}\) and verify: \(\mathbf{H} = \mathbf{H}^\top\), \(\mathbf{H}^2 = \mathbf{H}\), \(\text{tr}(\mathbf{H}) = p\).
Verify that \(\mathbf{X}^\top\mathbf{e} = \mathbf{0}\) (residuals orthogonal to predictors).
Simulate 10,000 datasets and verify that \(\mathbb{E}[\hat{\beta}_1] = 3\) and \(\text{Var}(\hat{\beta}_1) \approx \sigma^2[(\mathbf{X}^\top\mathbf{X})^{-1}]_{22}\).
Verify that \(s^2 = \text{RSS}/(n-p)\) is unbiased for \(\sigma^2 = 4\).

Solution

import numpy as np
from scipy import stats

def exercise_3_4_5():
    """Complete solution to Exercise 3.4.5."""
    rng = np.random.default_rng(42)

    # Parameters
    beta_true = np.array([2.0, 3.0])
    sigma = 2.0
    n = 50

    print("="*60)
    print("EXERCISE 3.4.5: COMPUTATIONAL VERIFICATION")
    print("="*60)

    # Part (a): Single dataset fit
    print("\n--- Part (a): Single Dataset OLS ---")
    x = rng.uniform(0, 5, n)
    X = np.column_stack([np.ones(n), x])
    eps = rng.normal(0, sigma, n)
    y = X @ beta_true + eps

    beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
    print(f"True β = {beta_true}")
    print(f"OLS β̂ = [{beta_hat[0]:.3f}, {beta_hat[1]:.3f}]")

    # Part (b): Hat matrix properties
    print("\n--- Part (b): Hat Matrix Properties ---")
    XtX_inv = np.linalg.inv(X.T @ X)
    H = X @ XtX_inv @ X.T
    p = X.shape[1]

    print(f"H = H' (symmetric):     {np.allclose(H, H.T)}")
    print(f"H² = H (idempotent):    {np.allclose(H @ H, H)}")
    print(f"tr(H) = p = {p}:        tr(H) = {np.trace(H):.6f}")

    # Part (c): Orthogonality of residuals
    print("\n--- Part (c): Residual Orthogonality ---")
    e = y - X @ beta_hat
    Xte = X.T @ e
    print(f"X'e = [{Xte[0]:.2e}, {Xte[1]:.2e}]")
    print(f"||X'e|| ≈ 0: {np.allclose(Xte, 0)}")

    # Parts (d) & (e): Simulation study
    print("\n--- Parts (d) & (e): Simulation Study (10,000 reps) ---")
    n_sim = 10000

    beta1_hats = []
    s2_estimates = []

    # Use FIXED X for simulation
    x_fixed = rng.uniform(0, 5, n)
    X_fixed = np.column_stack([np.ones(n), x_fixed])
    XtX_inv_fixed = np.linalg.inv(X_fixed.T @ X_fixed)

    for _ in range(n_sim):
        eps_sim = rng.normal(0, sigma, n)
        y_sim = X_fixed @ beta_true + eps_sim

        beta_sim = XtX_inv_fixed @ X_fixed.T @ y_sim
        beta1_hats.append(beta_sim[1])

        resid = y_sim - X_fixed @ beta_sim
        s2_estimates.append(np.sum(resid**2) / (n - p))

    beta1_hats = np.array(beta1_hats)
    s2_estimates = np.array(s2_estimates)

    # Theoretical variance
    var_theory = sigma**2 * XtX_inv_fixed[1, 1]

    print(f"\nFor β̂₁ (slope):")
    print(f"  E[β̂₁] ≈ β₁ = {beta_true[1]}:")
    print(f"    Sample mean: {np.mean(beta1_hats):.4f}")
    print(f"  Var(β̂₁) = σ²[(X'X)⁻¹]₂₂ = {var_theory:.6f}:")
    print(f"    Sample var:  {np.var(beta1_hats):.6f}")

    print(f"\nFor s² (variance estimator):")
    print(f"  E[s²] = σ² = {sigma**2}:")
    print(f"    Sample mean: {np.mean(s2_estimates):.4f}")

exercise_3_4_5()

============================================================
EXERCISE 3.4.5: COMPUTATIONAL VERIFICATION
============================================================

--- Part (a): Single Dataset OLS ---
True β = [2. 3.]
OLS β̂ = [2.119, 2.972]

--- Part (b): Hat Matrix Properties ---
H = H' (symmetric):     True
H² = H (idempotent):    True
tr(H) = p = 2:        tr(H) = 2.000000

--- Part (c): Residual Orthogonality ---
X'e = [1.42e-14, -3.55e-14]
||X'e|| ≈ 0: True

--- Parts (d) & (e): Simulation Study (10,000 reps) ---

For β̂₁ (slope):
  E[β̂₁] ≈ β₁ = 3.0:
    Sample mean: 3.0002
  Var(β̂₁) = σ²[(X'X)⁻¹]₂₂ = 0.035294:
    Sample var:  0.035183

For s² (variance estimator):
  E[s²] = σ² = 4.0:
    Sample mean: 4.0024

Key Insights:

Hat matrix properties confirmed: \(\mathbf{H}\) is symmetric, idempotent, and has trace equal to \(p\) (number of parameters).
Orthogonality verified: \(\mathbf{X}^\top\mathbf{e} \approx \mathbf{0}\) to machine precision—the residuals are exactly orthogonal to the column space.
Unbiasedness confirmed: Over 10,000 simulations, \(\bar{\hat{\beta}}_1 = 3.0002 \approx \beta_1 = 3.0\).
Variance formula confirmed: Empirical variance matches \(\sigma^2[(\mathbf{X}^\top\mathbf{X})^{-1}]_{22}\) closely.
s² is unbiased: \(\bar{s}^2 = 4.0024 \approx \sigma^2 = 4.0\).

Exercise 6: Robust Standard Errors

When heteroskedasticity is present, model-based standard errors are wrong. This exercise compares them to robust alternatives via simulation.

Background: The Heteroskedasticity Problem

Under heteroskedasticity, \(\text{Var}(\varepsilon_i) = \sigma_i^2\) varies across observations. OLS remains unbiased but the formula \(\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}\) is wrong. Robust (sandwich) standard errors correct this.

Generate heteroskedastic data: \(y_i = 2 + 3x_i + \varepsilon_i\) where \(\varepsilon_i \sim N(0, (0.5 + x_i)^2)\).
Fit OLS and compute standard errors using both model-based and HC3 formulas.
Construct 95% CIs for \(\beta_1\) using both SE methods. Assess coverage via simulation (10,000 replications). Which achieves closer to 95% coverage?
Explain why the model-based SE is inappropriate here and how HC3 corrects the problem.

Solution

Parts (a)-(c): Implementation and Coverage Simulation

import numpy as np
from scipy import stats

def robust_se_hc3(X, residuals):
    """Compute HC3 robust standard errors."""
    n, p = X.shape
    XtX_inv = np.linalg.inv(X.T @ X)
    H = X @ XtX_inv @ X.T
    h = np.diag(H)
    omega = residuals**2 / (1 - h)**2
    meat = X.T @ np.diag(omega) @ X
    sandwich = XtX_inv @ meat @ XtX_inv
    return np.sqrt(np.diag(sandwich))

def coverage_simulation(n_sim=10000, n=100, seed=42):
    """Compare coverage of model-based vs HC3 standard errors."""
    rng = np.random.default_rng(seed)

    beta_true = np.array([2.0, 3.0])

    model_covers = 0
    hc3_covers = 0

    for _ in range(n_sim):
        # Generate heteroskedastic data
        x = rng.uniform(0, 3, n)
        sigma = 0.5 + x  # SD increases with x
        eps = rng.normal(0, sigma)
        y = beta_true[0] + beta_true[1] * x + eps

        # Design matrix
        X = np.column_stack([np.ones(n), x])

        # OLS fit
        XtX_inv = np.linalg.inv(X.T @ X)
        beta_hat = XtX_inv @ X.T @ y
        residuals = y - X @ beta_hat

        # Model-based SE (assumes homoskedasticity)
        s2 = np.sum(residuals**2) / (n - 2)
        se_model = np.sqrt(s2 * np.diag(XtX_inv))

        # HC3 robust SE
        se_hc3 = robust_se_hc3(X, residuals)

        # 95% CI for beta_1 (slope)
        t_crit = stats.t.ppf(0.975, n - 2)

        ci_model = (beta_hat[1] - t_crit * se_model[1],
                    beta_hat[1] + t_crit * se_model[1])
        ci_hc3 = (beta_hat[1] - t_crit * se_hc3[1],
                  beta_hat[1] + t_crit * se_hc3[1])

        if ci_model[0] <= beta_true[1] <= ci_model[1]:
            model_covers += 1
        if ci_hc3[0] <= beta_true[1] <= ci_hc3[1]:
            hc3_covers += 1

    return model_covers / n_sim, hc3_covers / n_sim

# Run simulation
cov_model, cov_hc3 = coverage_simulation()

print("="*60)
print("HETEROSKEDASTICITY: MODEL-BASED VS ROBUST SE")
print("="*60)
print(f"\n95% CI Coverage for β₁ (10,000 simulations):")
print(f"  Model-based SE: {cov_model:.3f} (target: 0.950)")
print(f"  HC3 robust SE:  {cov_hc3:.3f} (target: 0.950)")
print(f"\nModel-based {'UNDERCOVERED' if cov_model < 0.93 else 'OK'}")
print(f"HC3 {'achieves nominal coverage' if 0.94 <= cov_hc3 <= 0.96 else 'slightly off'}")

============================================================
HETEROSKEDASTICITY: MODEL-BASED VS ROBUST SE
============================================================

95% CI Coverage for β₁ (10,000 simulations):
  Model-based SE: 0.912 (target: 0.950)
  HC3 robust SE:  0.948 (target: 0.950)

Model-based UNDERCOVERED
HC3 achieves nominal coverage

Part (d): Explanation

The model-based SE assumes \(\text{Var}(\varepsilon_i) = \sigma^2\) for all \(i\). When variance increases with \(x\), this assumption fails.

Why model-based SE fails: The formula \(s^2(\mathbf{X}^\top\mathbf{X})^{-1}\) uses a pooled \(s^2\) that averages across all observations. This underestimates variance where \(x\) is large (variance actually higher) and overestimates where \(x\) is small. Since high-\(x\) observations carry more weight in determining \(\hat{\beta}_1\), the net effect is that model-based SE is too small, leading to CIs that are too narrow and systematic undercoverage.

How HC3 corrects: The sandwich formula uses observation-specific variance estimates:

\[\widehat{\text{Var}}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\hat{\boldsymbol{\Omega}}\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\]

where \(\hat{\Omega}_{ii} = e_i^2/(1-h_{ii})^2\) for HC3.

Each observation contributes its own squared residual
The \((1-h_{ii})^2\) denominator inflates residuals for high-leverage points (which tend to have smaller residuals even when variance is large)
This provides a conservative correction that achieves approximately correct coverage even under heteroskedasticity

Key Insight: Robust standard errors do not change the point estimates—OLS \(\hat{\boldsymbol{\beta}}\) is still unbiased. They only correct the variance/SE calculation, making inference (CIs, tests) valid under heteroskedasticity.

References

Historical Origins of Least Squares

[Legendre1805]

Legendre, A. M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes. Firmin Didot, Paris. Appendix contains the first published account of the method of least squares, developed for fitting parabolic orbits to comet observations.

[Gauss1809]

Gauss, C. F. (1809). Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium. Perthes and Besser, Hamburg. Contains Gauss’s probabilistic justification of least squares under normally distributed errors, applied to predicting the orbit of the asteroid Ceres.

[Gauss1821]

Gauss, C. F. (1821–1823). Theoria Combinationis Observationum Erroribus Minimis Obnoxiae. Commentationes Societatis Regiae Scientiarum Gottingensis. English translation by G. W. Stewart (1995), SIAM. Proves that least squares minimizes variance among linear unbiased estimators without assuming normality—the original Gauss-Markov theorem.

[Stigler1981]

Stigler, S. M. (1981). Gauss and the invention of least squares. Annals of Statistics, 9(3), 465–474. Historical analysis of the priority dispute between Gauss and Legendre over the invention of least squares.

The Gauss-Markov Theorem

[Markov1900]

Markov, A. A. (1900). Wahrscheinlichkeitsrechnung. Teubner, Leipzig. Contains Markov’s rediscovery and generalization of Gauss’s optimality result for least squares.

[Aitken1935]

Aitken, A. C. (1935). On least squares and linear combination of observations. Proceedings of the Royal Society of Edinburgh, 55, 42–48. Generalizes least squares to handle correlated errors through the weighted least squares estimator (generalized least squares).

[Plackett1950]

Plackett, R. L. (1950). Some theorems in least squares. Biometrika, 37(1/2), 149–157. Clarifies and extends the Gauss-Markov theorem with modern notation.

[Kruskal1968]

Kruskal, W. (1968). When are Gauss-Markov and least squares estimators identical? A coordinate-free approach. Annals of Mathematical Statistics, 39(1), 70–75. Geometric perspective on the Gauss-Markov theorem using coordinate-free notation.

Distributional Theory and Inference

[Student1908]

Student [Gosset, W. S.] (1908). The probable error of a mean. Biometrika, 6(1), 1–25. Introduces the t-distribution for inference on means with estimated variance—fundamental for regression coefficient testing.

[Fisher1925]

Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver and Boyd. Develops the F-test for comparing nested models and analysis of variance for regression.

[Cochran1934]

Cochran, W. G. (1934). The distribution of quadratic forms in a normal system, with applications to the analysis of covariance. Mathematical Proceedings of the Cambridge Philosophical Society, 30(2), 178–191. Cochran’s theorem on distributions of quadratic forms, fundamental for understanding F-tests and ANOVA decompositions.

[Scheffe1959]

Scheffé, H. (1959). The Analysis of Variance. Wiley. Comprehensive treatment of linear models from the ANOVA perspective with rigorous distributional theory.

Matrix Approach to Linear Models

[Graybill1961]

Graybill, F. A. (1961). An Introduction to Linear Statistical Models. McGraw-Hill. Early systematic treatment of linear models using matrix notation.

[Searle1971]

Searle, S. R. (1971). Linear Models. Wiley. Definitive reference on the matrix theory of linear models including rank-deficient designs.

[Rao1973]

Rao, C. R. (1973). Linear Statistical Inference and Its Applications (2nd ed.). Wiley. Graduate-level treatment emphasizing linear algebra foundations of statistical inference.

Diagnostics and Model Checking

[Cook1977]

Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics, 19(1), 15–18. Introduces Cook’s distance for identifying influential observations.

[Belsley1980]

Belsley, D. A., Kuh, E., and Welsch, R. E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley. Comprehensive treatment of regression diagnostics including condition numbers and influence measures.

[Cook1982]

Cook, R. D., and Weisberg, S. (1982). Residuals and Influence in Regression. Chapman and Hall. Monograph on diagnostic methods including added-variable plots and influence functions.

Heteroskedasticity and Robust Methods

[White1980]

White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4), 817–838. Introduces heteroskedasticity-robust standard errors for regression coefficients.

[Breusch1979]

Breusch, T. S., and Pagan, A. R. (1979). A simple test for heteroscedasticity and random coefficient variation. Econometrica, 47(5), 1287–1294. The Breusch-Pagan test for detecting heteroskedasticity.

[Long2000]

Long, J. S., and Ervin, L. H. (2000). Using heteroscedasticity consistent standard errors in the linear regression model. The American Statistician, 54(3), 217–224. Practical guidance on when and how to use robust standard errors.

Modern Perspectives

[Seber2003]

Seber, G. A. F., and Lee, A. J. (2003). Linear Regression Analysis (2nd ed.). Wiley. Comprehensive modern treatment of linear regression including computational methods.

[Rencher2008]

Rencher, A. C., and Schaalje, G. B. (2008). Linear Models in Statistics (2nd ed.). Wiley. Graduate textbook covering general linear models with extensive applications.

[Weisberg2014]

Weisberg, S. (2014). Applied Linear Regression (4th ed.). Wiley. Accessible applied treatment emphasizing diagnostics and practical considerations.

Computational Aspects

[Golub1996]

Golub, G. H., and Van Loan, C. F. (1996). Matrix Computations (3rd ed.). Johns Hopkins University Press. Definitive reference on numerical linear algebra including QR decomposition for least squares.

[Björck1996]

Björck, Å. (1996). Numerical Methods for Least Squares Problems. SIAM. Comprehensive treatment of computational methods for solving least squares problems.

Historical Perspective

[Stigler1986]

Stigler, S. M. (1986). The History of Statistics: The Measurement of Uncertainty before 1900. Harvard University Press. Historical context for the development of regression and least squares methods.

[Magnus2019]

Magnus, J. R. (2019). On the concept of matrix derivative. Journal of Multivariate Analysis, 169, 94–119. Modern perspective on matrix calculus notation used in deriving least squares estimators.