Appendix B — Linear Algebra for Data Science¶

STAT 418 · Computational Data Science · runnable companion to the Linear Algebra appendix.

Linear algebra is the language of modern statistics: every dataset is a matrix, every parameter is a vector, and every inference procedure — from OLS to MCMC — is a sequence of matrix operations. This notebook makes that concrete. We run each idea with numpy.linalg / scipy.linalg so you can see a matrix expression like $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ behave as a projection, which is exactly what least squares does.

Learning outcomes¶

By the end you can:

  1. Compute inner products, norms, and angles, and read orthogonality as the geometry behind least-squares residuals.
  2. Use the core matrix operations — multiply, solve (never inv), trace, determinant, rank — and recognize the special structures (symmetric, idempotent, positive-definite, stochastic) that unlock shortcuts.
  3. Invert structured matrices with the Schur complement and the Sherman–Morrison–Woodbury rank-update formula.
  4. Solve the least-squares normal equations $\mathbf{X}^\top\mathbf{X}\hat\beta=\mathbf{X}^\top\mathbf{y}$ on the real Advertising data, build the hat matrix, and verify the projection geometry — tying directly to Chapter 3 OLS.
  5. Read structure from the eigen, Cholesky, QR, and SVD decompositions, and use the SVD to understand ridge shrinkage and the condition number.
  6. Propagate variance with $\mathrm{Var}(\mathbf{A}\mathbf{X})=\mathbf{A}\,\mathrm{Var}(\mathbf{X})\,\mathbf{A}^\top$ and take gradients/Jacobians for optimization-based inference.

Section map¶

§ Topic What we run
B.1 Vectors & inner products dot / norm / Cauchy–Schwarz; orthogonal matrices preserve length
B.2 Matrix operations multiply, solve vs inv, trace, determinant, rank
B.3 Special structures symmetric, idempotent, positive-definite, stochastic
B.4 Block matrices Schur complement + Sherman–Morrison–Woodbury
B.5 Projections & least squares Advertising normal equations + hat matrix (→ Ch 3)
B.6 Decompositions eigen, Cholesky (sampling), QR, SVD (ridge)
B.7 Quadratic forms & covariance $\mathrm{Var}(\mathbf{A}\mathbf{X})$, $\chi^2$ from idempotents
B.8 Matrix calculus gradients, log-det, Jacobian / delta method
B.9 Numerical considerations condition number, why never form $\mathbf{X}^\top\mathbf{X}$
— Exercises hat matrix · ill-conditioned ridge · block / LOOCV
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(418)            # reproducibility (STYLE.md contract)
plt.rcParams.update({
    "figure.figsize": (7.0, 4.2), "figure.dpi": 110,
    "axes.grid": True, "grid.alpha": 0.25,
    "axes.spines.top": False, "axes.spines.right": False,
    "font.size": 11,
})
BLUE, RED, GREEN, ORANGE, GRAY = "#4C72B0", "#C44E52", "#55A868", "#DD8452", "#7f7f7f"

# Course data bucket: the least-squares section loads the Chapter 3 Advertising set.
BASE = ("https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/"
        "STAT%20418%20Images/Data/Chapter3/")
def url(name):
    return BASE + name

print("environment ready · numpy", np.__version__, "· pandas", pd.__version__)
environment ready · numpy 1.26.4 · pandas 2.3.3

Section B.1 — Vectors and Inner Products¶

A vector $\mathbf{x}\in\mathbb{R}^n$ is an ordered list of numbers (a column vector by default). The inner product and Euclidean norm are

$$ \langle\mathbf{x},\mathbf{y}\rangle=\mathbf{x}^\top\mathbf{y}=\sum_i x_iy_i, \qquad \|\mathbf{x}\|=\sqrt{\mathbf{x}^\top\mathbf{x}}. $$

The inner product measures alignment: $\mathbf{x}^\top\mathbf{y}=0$ means the vectors are orthogonal (perpendicular). That single fact is the geometric backbone of least squares — the residual will be orthogonal to the fitted space. The Cauchy–Schwarz inequality $|\mathbf{x}^\top\mathbf{y}|\le\|\mathbf{x}\|\,\|\mathbf{y}\|$ also drives the Cramér–Rao bound of Chapter 3.

In [2]:
# Inner product, norm, angle, and the Cauchy-Schwarz inequality.
x = np.array([3.0, 1.0, 4.0])
y = np.array([2.0, 7.0, 1.0])

dot = x @ y                       # = 3*2 + 1*7 + 4*1
nx, ny = np.linalg.norm(x), np.linalg.norm(y)
cos_ang = dot / (nx * ny)

print(f"x . y            = {dot:.4f}      (3*2 + 1*7 + 4*1)")
print(f"||x|| = sqrt(26) = {nx:.4f}")
print(f"||y|| = sqrt(54) = {ny:.4f}")
print(f"Cauchy-Schwarz : |x.y| = {abs(dot):.4f}  <=  ||x||*||y|| = {nx*ny:.4f}   -> {abs(dot) <= nx*ny}")
print(f"angle(x, y)      = {np.degrees(np.arccos(cos_ang)):.2f} deg")

# An orthogonal pair: inner product is exactly zero.
u = np.array([1.0, 0.0, -1.0]); v = np.array([1.0, 5.0, 1.0])
print(f"u . v            = {u @ v:.1f}   -> orthogonal: {np.isclose(u @ v, 0.0)}")
x . y            = 17.0000      (3*2 + 1*7 + 4*1)
||x|| = sqrt(26) = 5.0990
||y|| = sqrt(54) = 7.3485
Cauchy-Schwarz : |x.y| = 17.0000  <=  ||x||*||y|| = 37.4700   -> True
angle(x, y)      = 63.02 deg
u . v            = 0.0   -> orthogonal: True
In [3]:
# Orthogonal matrices preserve lengths and angles (Q^T Q = I), so they never
# amplify error -- the reason QR is numerically safer than forming X^T X.
theta = np.pi / 4
Q = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta),  np.cos(theta)]])   # a rotation by 45 degrees
w = np.array([3.0, 4.0])

print("Q^T Q =\n", np.round(Q.T @ Q, 12))
print(f"||w||   = {np.linalg.norm(w):.4f}")
print(f"||Q w|| = {np.linalg.norm(Q @ w):.4f}   (length preserved by the rotation)")
Q^T Q =
 [[ 1. -0.]
 [-0.  1.]]
||w||   = 5.0000
||Q w|| = 5.0000   (length preserved by the rotation)

Section B.2 — Matrix Operations and Properties¶

A matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$ is a linear transformation. The operations we lean on: multiplication ($\mathbf{C}=\mathbf{AB}$, not commutative, order reverses under transpose), the inverse (computed never explicitly — use solve), the trace (sum of the diagonal, cyclic under products), the determinant (volume factor; zero $\iff$ singular), and the rank (number of independent columns). For a design matrix $\mathbf{X}$, full column rank is exactly the condition that makes $\mathbf{X}^\top\mathbf{X}$ invertible, so that the OLS estimator exists.

In [4]:
# Multiplication, transpose rule, trace, rank, determinant.
A = np.array([[1., 2., 3.],
              [4., 5., 6.],
              [7., 8., 9.]])
B = np.array([[1., 0., 2.],
              [0., 1., 1.],
              [1., 1., 0.]])

print("A @ B =\n", A @ B)
print(f"\n(AB)^T == B^T A^T : {np.allclose((A @ B).T, B.T @ A.T)}   (order reverses)")
print(f"AB == BA          : {np.allclose(A @ B, B @ A)}   (not commutative)")
print(f"trace(A)          = {np.trace(A):.1f}")
print(f"rank(A)           = {np.linalg.matrix_rank(A)}   (row3 = 2*row2 - row1, so only 2 independent)")
print(f"det(A)            = {np.linalg.det(A):.3e}   (~0  ->  singular, not invertible)")
print(f"det(B)            = {np.linalg.det(B):.1f}   (nonzero  ->  invertible)")
print(f"det(AB)=det(A)det(B): {np.isclose(np.linalg.det(A @ B), np.linalg.det(A)*np.linalg.det(B))}")
A @ B =
 [[ 4.  5.  4.]
 [10. 11. 13.]
 [16. 17. 22.]]

(AB)^T == B^T A^T : True   (order reverses)
AB == BA          : False   (not commutative)
trace(A)          = 15.0
rank(A)           = 2   (row3 = 2*row2 - row1, so only 2 independent)
det(A)            = 0.000e+00   (~0  ->  singular, not invertible)
det(B)            = -3.0   (nonzero  ->  invertible)
det(AB)=det(A)det(B): True
In [5]:
# Solve A x = b WITHOUT forming the inverse (faster + more stable; STYLE pitfall).
M = np.array([[4., 2.],
              [2., 3.]])
b = np.array([1.0, 1.0])

x_solve = np.linalg.solve(M, b)        # LU factorization internally
x_inv   = np.linalg.inv(M) @ b         # forms the full inverse first (avoid)
print(f"np.linalg.solve(M, b) = {x_solve}")
print(f"np.linalg.inv(M) @ b  = {x_inv}   (same answer, worse numerics)")
print(f"agree: {np.allclose(x_solve, x_inv)}")

# Trace cyclic property tr(ABC) = tr(CAB), and the inner-product identity x'Ax = tr(A x x').
rng = np.random.default_rng(0)
P, Qm, R = rng.standard_normal((3, 3)), rng.standard_normal((3, 3)), rng.standard_normal((3, 3))
print(f"\ntr(PQR) = tr(RPQ) : {np.isclose(np.trace(P @ Qm @ R), np.trace(R @ P @ Qm))}   (cyclic)")
xv = rng.standard_normal(2)
print(f"x'Mx == tr(M x x'): {np.isclose(xv @ M @ xv, np.trace(M @ np.outer(xv, xv)))}   (inner-product identity)")
np.linalg.solve(M, b) = [0.125 0.25 ]
np.linalg.inv(M) @ b  = [0.125 0.25 ]   (same answer, worse numerics)
agree: True

tr(PQR) = tr(RPQ) : True   (cyclic)
x'Mx == tr(M x x'): True   (inner-product identity)

Section B.3 — Special Matrix Structures¶

Recognizing structure unlocks shortcuts and theory:

  • Symmetric ($\mathbf{A}=\mathbf{A}^\top$): real eigenvalues, orthogonal eigenvectors, a spectral decomposition $\mathbf{A}=\mathbf{Q}\boldsymbol\Lambda\mathbf{Q}^\top$. Every covariance, Fisher information, and Hessian is symmetric.
  • Idempotent ($\mathbf{P}^2=\mathbf{P}$): the defining property of a projection; eigenvalues are $0$ or $1$, and $\mathrm{rank}=\mathrm{tr}$.
  • Positive definite ($\mathbf{x}^\top\mathbf{A}\mathbf{x}>0$): all eigenvalues $>0$; the matrix analogue of a positive number; admits a Cholesky factor.
  • Stochastic (non-negative rows summing to 1): the transition matrix of a Markov chain; the stationary distribution is its left eigenvector with eigenvalue $1$.
In [6]:
# Symmetric -> spectral decomposition;  positive definite -> all eigenvalues > 0.
A = np.array([[4., 2.],
              [2., 3.]])                       # symmetric
vals, Qv = np.linalg.eigh(A)                   # eigh: symmetric solver
print(f"eigenvalues          = {vals.round(4)}   (both > 0  ->  positive definite)")
print(f"reconstruction error : {np.max(np.abs(A - Qv @ np.diag(vals) @ Qv.T)):.2e}   (A = Q Lambda Q^T)")

# Verify x'Ax > 0 for 1000 random x (definition of PD).
rng = np.random.default_rng(42)
quad_min = min(float(z @ A @ z) for z in rng.standard_normal((1000, 2)))
print(f"min x'Ax over 1000 random x = {quad_min:.4f}  (> 0, confirms PD)")

# Eigenvectors of a symmetric matrix are orthogonal:
print(f"q1 . q2 = {Qv[:,0] @ Qv[:,1]:.2e}  (orthogonal eigenvectors)")
eigenvalues          = [1.4384 5.5616]   (both > 0  ->  positive definite)
reconstruction error : 8.88e-16   (A = Q Lambda Q^T)
min x'Ax over 1000 random x = 0.0113  (> 0, confirms PD)
q1 . q2 = 0.00e+00  (orthogonal eigenvectors)
In [7]:
# Idempotent matrices are projections: eigenvalues are 0/1 and rank == trace.
rng = np.random.default_rng(42)
n, p = 8, 3
Xr = rng.standard_normal((n, p))
Pmat = Xr @ np.linalg.solve(Xr.T @ Xr, Xr.T)   # projection onto col-space of Xr

print(f"P^2 == P (idempotent) : {np.allclose(Pmat @ Pmat, Pmat)}")
print(f"P symmetric           : {np.allclose(Pmat, Pmat.T)}")
ev = np.linalg.eigvalsh(Pmat)
print(f"eigenvalues (rounded) : {ev.round(3)}")
print(f"  # near 0 = {np.sum(np.abs(ev) < 1e-9)},  # near 1 = {np.sum(np.abs(ev-1) < 1e-9)}")
print(f"rank(P) = {np.linalg.matrix_rank(Pmat)} == trace(P) = {np.trace(Pmat):.1f} == p = {p}")
P^2 == P (idempotent) : True
P symmetric           : True
eigenvalues (rounded) : [-0. -0. -0.  0.  0.  1.  1.  1.]
  # near 0 = 5,  # near 1 = 3
rank(P) = 3 == trace(P) = 3.0 == p = 3
In [8]:
# Stochastic (transition) matrix: rows are probabilities; stationary pi is the
# left eigenvector with eigenvalue 1, and P^k converges to rows of pi.
Pt = np.array([[0.7, 0.2, 0.1],
               [0.1, 0.6, 0.3],
               [0.2, 0.3, 0.5]])
print(f"row sums = {Pt.sum(axis=1)}   (each row is a probability distribution)")

# Left eigenvector for eigenvalue 1  ->  pi^T P = pi^T.
w, V = np.linalg.eig(Pt.T)
pi = np.real(V[:, np.argmin(np.abs(w - 1.0))]);  pi = pi / pi.sum()
print(f"stationary pi          = {pi.round(4)}")
print(f"pi^T P == pi^T         : {np.allclose(pi @ Pt, pi)}")

P50 = np.linalg.matrix_power(Pt, 50)
print(f"rows of P^50 (-> pi)   = {P50[0].round(4)}")
print(f"second-largest |lambda| = {np.sort(np.abs(np.linalg.eigvals(Pt)))[-2]:.4f}  (governs convergence speed)")
row sums = [1. 1. 1.]   (each row is a probability distribution)
stationary pi          = [0.3235 0.3824 0.2941]
pi^T P == pi^T         : True
rows of P^50 (-> pi)   = [0.3235 0.3824 0.2941]
second-largest |lambda| = 0.5414  (governs convergence speed)

Section B.4 — Block Matrices¶

Many statistical matrices have natural block structure. With $\mathbf{A}=\begin{pmatrix}\mathbf{A}_{11}&\mathbf{A}_{12}\\\mathbf{A}_{21}&\mathbf{A}_{22}\end{pmatrix}$ and $\mathbf{A}_{11}$ invertible, the Schur complement $\mathbf{S}=\mathbf{A}_{22}-\mathbf{A}_{21}\mathbf{A}_{11}^{-1}\mathbf{A}_{12}$ gives the block determinant $\det(\mathbf{A})=\det(\mathbf{A}_{11})\det(\mathbf{S})$ and is exactly the conditional covariance of a multivariate normal.

The Sherman–Morrison–Woodbury formula turns an $n\times n$ inverse with a rank-$k$ update into a $k\times k$ inverse:

$$ (\mathbf{A}+\mathbf{U}\mathbf{C}\mathbf{V})^{-1}=\mathbf{A}^{-1}-\mathbf{A}^{-1}\mathbf{U}(\mathbf{C}^{-1}+\mathbf{V}\mathbf{A}^{-1}\mathbf{U})^{-1}\mathbf{V}\mathbf{A}^{-1}. $$

When $k\ll n$ this is a dramatic speedup — and it is the engine behind the leave-one-out cross-validation shortcut (Exercise 3).

In [9]:
# Schur complement: block determinant identity det(A) = det(A11) det(S).
rng = np.random.default_rng(7)
G = rng.standard_normal((5, 5)); A = G @ G.T + np.eye(5)   # symmetric PD, 2+3 blocks
A11, A12 = A[:2, :2], A[:2, 2:]
A21, A22 = A[2:, :2], A[2:, 2:]
S = A22 - A21 @ np.linalg.solve(A11, A12)                  # Schur complement of A11
print(f"det(A)                 = {np.linalg.det(A):.4f}")
print(f"det(A11) * det(S)      = {np.linalg.det(A11) * np.linalg.det(S):.4f}")
print(f"block determinant holds: {np.isclose(np.linalg.det(A), np.linalg.det(A11)*np.linalg.det(S))}")
det(A)                 = 588.6852
det(A11) * det(S)      = 588.6852
block determinant holds: True
In [10]:
# Sherman-Morrison-Woodbury: a rank-3 update, verified against direct inversion.
rng = np.random.default_rng(42)
n, k = 100, 3
Am = rng.standard_normal((n, n)); Am = Am @ Am.T + np.eye(n)   # PD
U = rng.standard_normal((n, k))
C = np.eye(k)
Vm = rng.standard_normal((k, n))

direct = np.linalg.inv(Am + U @ C @ Vm)                       # invert the full n x n
Ainv = np.linalg.inv(Am)
inner = np.linalg.inv(np.linalg.inv(C) + Vm @ Ainv @ U)       # only a k x k inverse
smw = Ainv - Ainv @ U @ inner @ Vm @ Ainv

print(f"n = {n}, rank-{k} update")
print(f"max |direct - SMW| = {np.max(np.abs(direct - smw)):.2e}   (k x k inverse reproduces the n x n one)")
n = 100, rank-3 update
max |direct - SMW| = 1.03e-15   (k x k inverse reproduces the n x n one)

Section B.5 — Column Space, Projections, and Least Squares¶

This is the geometric heart of regression. The column space $\mathcal{C}(\mathbf{X})=\{\mathbf{X}\beta\}$ is a thin $p$-dimensional slice of $\mathbb{R}^n$; the response $\mathbf{y}$ almost never lies in it. Least squares finds the closest point in that slice — the orthogonal projection $\hat{\mathbf{y}}=\mathbf{H}\mathbf{y}$ with the hat matrix

$$ \mathbf{H}=\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top. $$

Setting $\nabla_\beta\|\mathbf{y}-\mathbf{X}\beta\|^2=\mathbf{0}$ gives the normal equations $\mathbf{X}^\top\mathbf{X}\hat\beta=\mathbf{X}^\top\mathbf{y}$, whose defining feature is that the residual $\mathbf{e}=\mathbf{y}-\hat{\mathbf{y}}$ is orthogonal to every column of $\mathbf{X}$: $\mathbf{X}^\top\mathbf{e}=\mathbf{0}$.

In [11]:
# A 2-D picture of projection: column space = a line; y splits into y_hat + e.
a = np.array([2.0, 1.0])                      # spans the (1-D) column space
yv = np.array([1.0, 3.0])                     # the response vector
proj = (a @ yv) / (a @ a) * a                 # orthogonal projection of y onto span(a)
resid = yv - proj                             # residual, perpendicular to a

fig, ax = plt.subplots(figsize=(5.6, 5.2))
s = np.array([-0.6, 1.9])
ax.plot(s * a[0], s * a[1], color=GRAY, lw=1.5, label="column space  C(X)")
ax.annotate("", xy=yv,   xytext=(0, 0),  arrowprops=dict(arrowstyle="->", color=BLUE,  lw=2.2))
ax.annotate("", xy=proj, xytext=(0, 0),  arrowprops=dict(arrowstyle="->", color=GREEN, lw=2.2))
ax.annotate("", xy=yv,   xytext=proj,    arrowprops=dict(arrowstyle="->", color=RED,   lw=2.2))
ax.text(*(yv + 0.06),   r"$\mathbf{y}$",       color=BLUE,  fontsize=13)
ax.text(proj[0]+0.06, proj[1]-0.22, r"$\hat{\mathbf{y}}=\mathbf{Hy}$", color=GREEN, fontsize=13)
ax.text((yv[0]+proj[0])/2+0.05, (yv[1]+proj[1])/2, r"$\mathbf{e}\perp\mathbf{X}$", color=RED, fontsize=13)
# little right-angle marker at the foot of the projection
d1 = a / np.linalg.norm(a); d2 = resid / np.linalg.norm(resid); h = 0.18
corner = proj + h*d1
ax.plot(*np.array([proj+h*d1, proj+h*d1+h*d2, proj+h*d2]).T, color="k", lw=1)
ax.scatter([0], [0], color="k", zorder=5); ax.text(-0.18, -0.05, "0")
ax.set_xlim(-1.4, 2.6); ax.set_ylim(-0.6, 3.4); ax.set_aspect("equal")
ax.set_title("Least squares = orthogonal projection onto C(X)")
ax.legend(loc="lower right"); plt.tight_layout(); plt.show()
print(f"check  e . a = {resid @ a:.2e}  (residual orthogonal to the column space)")
check  e . a = 0.00e+00  (residual orthogonal to the column space)
In [12]:
# REAL DATA: solve the Chapter 3 Advertising OLS by the normal equations directly.
adv = pd.read_csv(url("Advertising.csv"))
Xmat = np.column_stack([np.ones(len(adv)),
                        adv["TV"].to_numpy(),
                        adv["Radio"].to_numpy(),
                        adv["Newspaper"].to_numpy()])
yv = adv["Sales"].to_numpy()
n, p = Xmat.shape

# Normal equations:  (X^T X) beta = X^T y   -- solved WITHOUT forming an inverse.
beta = np.linalg.solve(Xmat.T @ Xmat, Xmat.T @ yv)
names = ["intercept", "TV", "Radio", "Newspaper"]
print(f"Advertising OLS by normal equations   (n = {n}, p = {p})")
for nm, b in zip(names, beta):
    print(f"  beta[{nm:9s}] = {b:+.5f}")

# Cross-check 1: numpy lstsq (SVD) gives the same coefficients.
beta_ls = np.linalg.lstsq(Xmat, yv, rcond=None)[0]
print(f"\nmax |normal-eq - lstsq(SVD)|      = {np.max(np.abs(beta - beta_ls)):.2e}")

# Cross-check 2: statsmodels OLS from Chapter 3 -> identical betas.
import statsmodels.formula.api as smf
sm_fit = smf.ols("Sales ~ TV + Radio + Newspaper", data=adv).fit()
print(f"max |normal-eq - statsmodels (Ch3)| = {np.max(np.abs(beta - sm_fit.params.values)):.2e}")
print(f"  (Ch3 read-off: TV beta ~ {beta[1]:.4f}, Newspaper ~ {beta[3]:.4f} -- effectively zero)")
Advertising OLS by normal equations   (n = 200, p = 4)
  beta[intercept] = +2.93889
  beta[TV       ] = +0.04576
  beta[Radio    ] = +0.18853
  beta[Newspaper] = -0.00104

max |normal-eq - lstsq(SVD)|      = 2.04e-14
max |normal-eq - statsmodels (Ch3)| = 1.33e-14
  (Ch3 read-off: TV beta ~ 0.0458, Newspaper ~ -0.0010 -- effectively zero)
In [13]:
# Hat matrix on the real design: symmetric, idempotent, trace = p, residual orthogonal.
H = Xmat @ np.linalg.solve(Xmat.T @ Xmat, Xmat.T)   # n x n projection
yhat = H @ yv
e = yv - yhat
lev = np.diag(H)

print(f"H symmetric                 : {np.allclose(H, H.T)}")
print(f"H idempotent (H^2 = H)      : {np.allclose(H @ H, H)}")
print(f"trace(H) = {np.trace(H):.4f}  ==  p = {p}   (effective degrees of freedom)")
print(f"sum of leverages            = {lev.sum():.4f}  ==  p")
print(f"mean leverage = p/n         = {lev.mean():.4f}   (avg = {p/n:.4f})")
print(f"max leverage h_ii           = {lev.max():.4f}  at market #{int(np.argmax(lev))}")
print(f"orthogonality  max|X^T e|   = {np.max(np.abs(Xmat.T @ e)):.2e}   (residual perpendicular to every column)")

# Pythagoras in n dimensions:  ||y||^2 = ||y_hat||^2 + ||e||^2.
print(f"\nPythagoras: ||y||^2 = {yv@yv:.2f},  ||yhat||^2 + ||e||^2 = {yhat@yhat + e@e:.2f}")
R2 = 1 - (e @ e) / np.sum((yv - yv.mean())**2)
print(f"R^2 = 1 - RSS/TSS = {R2:.4f}   (matches the Chapter 3 OLS fit)")
H symmetric                 : True
H idempotent (H^2 = H)      : True
trace(H) = 4.0000  ==  p = 4   (effective degrees of freedom)
sum of leverages            = 4.0000  ==  p
mean leverage = p/n         = 0.0200   (avg = 0.0200)
max leverage h_ii           = 0.0863  at market #16
orthogonality  max|X^T e|   = 7.17e-11   (residual perpendicular to every column)

Pythagoras: ||y||^2 = 44743.25,  ||yhat||^2 + ||e||^2 = 44743.25
R^2 = 1 - RSS/TSS = 0.8972   (matches the Chapter 3 OLS fit)
In [14]:
# Figure: fitted (projected) vs actual sales -- the projection that minimizes RSS.
fig, ax = plt.subplots(figsize=(5.8, 5.2))
ax.scatter(yhat, yv, s=22, color=BLUE, alpha=0.6, edgecolor="white")
lims = [min(yv.min(), yhat.min()) - 1, max(yv.max(), yhat.max()) + 1]
ax.plot(lims, lims, color=RED, lw=1.6, ls="--", label="perfect fit (y = y_hat)")
ax.set_xlim(lims); ax.set_ylim(lims); ax.set_aspect("equal")
ax.set_xlabel(r"fitted  $\hat{y}=\mathbf{Hy}$  (sales)")
ax.set_ylabel("actual sales")
ax.set_title(f"Advertising OLS projection   ($R^2$ = {R2:.3f})")
ax.legend(loc="upper left"); plt.tight_layout(); plt.show()

Read-out. Solving the normal equations directly reproduces the Chapter 3 OLS coefficients to machine precision (and matches both lstsq and statsmodels): TV $\approx 0.046$ sales per ad-dollar, Newspaper $\approx -0.001$ — effectively zero, the chapter's confounding story. The hat matrix is a genuine projection ($\mathbf{H}^2=\mathbf{H}$, $\mathrm{tr}(\mathbf{H})=p=4$), the residual is orthogonal to every predictor ($\|\mathbf{X}^\top\mathbf{e}\|\approx 10^{-12}$), and the Pythagorean split $\|\mathbf{y}\|^2=\|\hat{\mathbf{y}}\|^2+\|\mathbf{e}\|^2$ holds exactly — the geometry is the statistics.

Section B.6 — Matrix Decompositions¶

Each decomposition answers a different question and powers a different computation:

Decomposition Needs Primary use in this course
Eigen $\mathbf{Q}\boldsymbol\Lambda\mathbf{Q}^\top$ symmetric PCA, Markov convergence, definiteness
Cholesky $\mathbf{L}\mathbf{L}^\top$ symmetric PD sampling correlated normals, PD solves
QR $\mathbf{Q}\mathbf{R}$ full column rank stable regression, leverage
SVD $\mathbf{U}\boldsymbol\Sigma\mathbf{V}^\top$ any matrix ridge, condition number, rank
In [15]:
# Eigendecomposition of a covariance matrix: eigenvectors are the principal axes,
# eigenvalues are the variance along each axis.
Sigma = np.array([[2.0, 1.2],
                  [1.2, 1.0]])
vals, vecs = np.linalg.eigh(Sigma)            # ascending eigenvalues
print(f"eigenvalues (variances) = {vals.round(4)}")
print(f"principal direction (largest var) = {vecs[:, -1].round(4)}")
# Matrix functions via the spectrum: a symmetric square root.
Sig_sqrt = vecs @ np.diag(np.sqrt(vals)) @ vecs.T
print(f"(Sigma^1/2)^2 == Sigma : {np.allclose(Sig_sqrt @ Sig_sqrt, Sigma)}")

# Draw the 2-sd covariance ellipse with its eigen-axes over a sample cloud.
rng = np.random.default_rng(418)
L = np.linalg.cholesky(Sigma)
pts = (rng.standard_normal((600, 2)) @ L.T)
tt = np.linspace(0, 2*np.pi, 200)
ell = vecs @ np.diag(2*np.sqrt(vals)) @ np.array([np.cos(tt), np.sin(tt)])
fig, ax = plt.subplots(figsize=(5.6, 5.2))
ax.scatter(pts[:, 0], pts[:, 1], s=10, alpha=0.25, color=BLUE)
ax.plot(ell[0], ell[1], color="k", lw=2, label="2-sd ellipse")
for i, col in [(-1, RED), (0, GREEN)]:
    ax.annotate("", xy=2*np.sqrt(vals[i])*vecs[:, i], xytext=(0, 0),
                arrowprops=dict(arrowstyle="->", color=col, lw=2.4))
ax.text(2.6, 1.6, r"$\mathbf{q}_1\,(\lambda_1)$", color=RED)
ax.text(-1.8, 1.4, r"$\mathbf{q}_2\,(\lambda_2)$", color=GREEN)
ax.set_aspect("equal"); ax.set_title("Eigendecomposition: principal axes of a covariance")
ax.legend(loc="lower right"); plt.tight_layout(); plt.show()
eigenvalues (variances) = [0.2 2.8]
principal direction (largest var) = [-0.8321 -0.5547]
(Sigma^1/2)^2 == Sigma : True
In [16]:
# Cholesky decomposition Sigma = L L^T is the matrix square root used to GENERATE
# correlated normals:  X = L Z  has Var(X) = L I L^T = Sigma.
Sig3 = np.array([[1.0, 0.8, 0.3],
                 [0.8, 1.0, 0.5],
                 [0.3, 0.5, 1.0]])
L = np.linalg.cholesky(Sig3)
print("L (lower triangular) =\n", L.round(4))
print(f"reconstruction error : {np.max(np.abs(Sig3 - L @ L.T)):.2e}")

rng = np.random.default_rng(42)
Z = rng.standard_normal((3, 100000))
Xc = L @ Z
print("sample covariance of X = L Z (-> Sigma):\n", np.cov(Xc).round(3))

# Figure: independent Z (left) vs correlated X = L Z (right), rho = 0.8.
L2 = np.linalg.cholesky(np.array([[1.0, 0.8], [0.8, 1.0]]))
Z2 = rng.standard_normal((2, 1500)); X2 = L2 @ Z2
fig, axes = plt.subplots(1, 2, figsize=(9.4, 4.6))
for ax, data, ttl, col in [(axes[0], Z2, r"independent  $\mathbf{Z}\sim N(0,\mathbf{I})$", GRAY),
                           (axes[1], X2, r"correlated  $\mathbf{X}=\mathbf{LZ}$  ($\rho=0.8$)", BLUE)]:
    ax.scatter(data[0], data[1], s=8, alpha=0.3, color=col)
    ax.set_xlim(-4, 4); ax.set_ylim(-4, 4); ax.set_aspect("equal"); ax.set_title(ttl)
plt.tight_layout(); plt.show()
L (lower triangular) =
 [[1.     0.     0.    ]
 [0.8    0.6    0.    ]
 [0.3    0.4333 0.8498]]
reconstruction error : 1.11e-16
sample covariance of X = L Z (-> Sigma):
 [[1.007 0.805 0.3  ]
 [0.805 1.005 0.504]
 [0.3   0.504 1.   ]]
In [17]:
# QR decomposition: numerically stable regression without forming X^T X.
rng = np.random.default_rng(42)
n, p = 100, 5
Xq = rng.standard_normal((n, p))
yq = Xq @ np.ones(p) + rng.standard_normal(n)

Qm, Rm = np.linalg.qr(Xq)                       # X = Q R, Q^T Q = I
beta_qr = np.linalg.solve(Rm, Qm.T @ yq)        # triangular solve, R beta = Q^T y
beta_ne = np.linalg.solve(Xq.T @ Xq, Xq.T @ yq) # direct normal equations
print(f"Q^T Q == I            : {np.allclose(Qm.T @ Qm, np.eye(p))}")
print(f"max |QR - normal eq|  = {np.max(np.abs(beta_qr - beta_ne)):.2e}")
# Leverage falls straight out of Q:  h_ii = ||row_i of Q||^2 = diag(Q Q^T).
lev_qr = np.sum(Qm**2, axis=1)
lev_H = np.diag(Xq @ np.linalg.solve(Xq.T @ Xq, Xq.T))
print(f"leverage via Q vs H   : max diff = {np.max(np.abs(lev_qr - lev_H)):.2e}")
Q^T Q == I            : True
max |QR - normal eq|  = 6.66e-16
leverage via Q vs H   : max diff = 5.55e-17
In [18]:
# SVD: the universal decomposition. sigma_i^2 are the eigenvalues of X^T X; the
# condition number is sigma_max/sigma_min; and ridge shrinks each SVD direction.
rng = np.random.default_rng(42)
n, p = 100, 5
Xs = rng.standard_normal((n, p))
U, sig, Vt = np.linalg.svd(Xs, full_matrices=False)
print(f"singular values   = {sig.round(3)}")
print(f"condition number  = sigma_max/sigma_min = {sig[0]/sig[-1]:.2f}")
print(f"sigma^2 == eig(X^T X): {np.allclose(np.sort(sig**2), np.sort(np.linalg.eigvalsh(Xs.T @ Xs)))}")

ys = Xs @ np.ones(p) + rng.standard_normal(n)
lam = 1.0
beta_ridge = Vt.T @ ((sig / (sig**2 + lam)) * (U.T @ ys))   # ridge via SVD
df_ridge = np.sum(sig**2 / (sig**2 + lam))                  # effective d.o.f.
print(f"ridge(lambda=1) effective df = {df_ridge:.2f}   (< p = {p})")

# Figure: shrinkage factor sigma^2/(sigma^2+lambda) by SVD direction, several lambda.
fig, ax = plt.subplots(figsize=(7.0, 4.2))
idx = np.arange(1, p + 1)
for lam, col in [(0.01, GREEN), (1.0, BLUE), (10.0, ORANGE), (100.0, RED)]:
    ax.plot(idx, sig**2 / (sig**2 + lam), "o-", color=col, lw=2, label=fr"$\lambda={lam:g}$")
ax.set_xticks(idx); ax.set_ylim(-0.03, 1.05)
ax.set_xlabel("SVD direction (decreasing singular value $\\rightarrow$)")
ax.set_ylabel(r"shrinkage  $\sigma_i^2/(\sigma_i^2+\lambda)$")
ax.set_title("Ridge shrinks weak (small-$\\sigma$) directions the most")
ax.legend(); plt.tight_layout(); plt.show()
singular values   = [11.379 10.055  9.283  9.116  7.749]
condition number  = sigma_max/sigma_min = 1.47
sigma^2 == eig(X^T X): True
ridge(lambda=1) effective df = 4.94   (< p = 5)

Section B.7 — Quadratic Forms and Covariance¶

A quadratic form $Q(\mathbf{x})=\mathbf{x}^\top\mathbf{A}\mathbf{x}$ is the matrix face of the chi-square distribution, the Mahalanobis distance, and every sum of squares. The single most useful covariance identity is the variance of a linear map:

$$ \mathrm{Var}(\mathbf{A}\mathbf{X})=\mathbf{A}\,\mathrm{Var}(\mathbf{X})\,\mathbf{A}^\top. $$

It is the one rule behind the OLS variance $\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$, the delta method, and Cholesky sampling. And if $\mathbf{Z}\sim N(\mathbf{0},\mathbf{I})$ and $\mathbf{P}$ is a symmetric idempotent matrix of rank $r$, then $\mathbf{Z}^\top\mathbf{P}\mathbf{Z}\sim\chi^2_r$ — the basis for $\mathrm{RSS}/\sigma^2\sim\chi^2_{n-p}$.

In [19]:
# Var(A X) = A Var(X) A^T, checked by Monte Carlo, then used for the OLS variance.
rng = np.random.default_rng(418)
Sig = np.array([[2.0, 0.6, 0.2],
                [0.6, 1.0, 0.3],
                [0.2, 0.3, 1.5]])
Amap = np.array([[1.0, 2.0, 0.0],
                 [0.0, 1.0, -1.0]])
L = np.linalg.cholesky(Sig)
Xd = rng.standard_normal((400000, 3)) @ L.T      # Var(X) = Sigma
Yd = Xd @ Amap.T                                 # Y = A X
print("empirical Var(AX) =\n", np.cov(Yd.T).round(3))
print("A Sigma A^T       =\n", (Amap @ Sig @ Amap.T).round(3))
print(f"max diff = {np.max(np.abs(np.cov(Yd.T) - Amap @ Sig @ Amap.T)):.3f}\n")

# OLS variance Var(beta_hat) = sigma^2 (X^T X)^-1 is the same rule with A = (X^T X)^-1 X^T.
n, p = 1000, 3
Xo = rng.standard_normal((n, p)); true_beta = np.array([1.0, 2.0, -1.0]); sigma = 0.5
yo = Xo @ true_beta + sigma * rng.standard_normal(n)
XtX_inv = np.linalg.inv(Xo.T @ Xo)
bhat = XtX_inv @ Xo.T @ yo
s2 = np.sum((yo - Xo @ bhat)**2) / (n - p)
se = np.sqrt(np.diag(s2 * XtX_inv))
print(f"beta_hat        = {bhat.round(4)}   (true = {true_beta})")
print(f"standard errors = {se.round(4)}   (~ sigma/sqrt(n) = {sigma/np.sqrt(n):.4f})")
empirical Var(AX) =
 [[8.392 1.791]
 [1.791 1.896]]
A Sigma A^T       =
 [[8.4 1.8]
 [1.8 1.9]]
max diff = 0.009

beta_hat        = [ 1.0163  1.9958 -0.984 ]   (true = [ 1.  2. -1.])
standard errors = [0.0158 0.0169 0.0171]   (~ sigma/sqrt(n) = 0.0158)
In [20]:
# Chi-square from an idempotent quadratic form:  Z^T P Z ~ chi^2_r  (r = rank P).
rng = np.random.default_rng(418)
nz, r = 12, 4
Xrnd = rng.standard_normal((nz, r))
Pproj = Xrnd @ np.linalg.solve(Xrnd.T @ Xrnd, Xrnd.T)   # symmetric idempotent, rank r
Zd = rng.standard_normal((300000, nz))
quad = np.einsum("ij,jk,ik->i", Zd, Pproj, Zd)          # z^T P z for each draw
print(f"rank(P) = {np.linalg.matrix_rank(Pproj)}  (target chi^2 degrees of freedom)")
print(f"mean(z'Pz) = {quad.mean():.3f}   (chi^2_{r} mean  = {r})")
print(f"var(z'Pz)  = {quad.var():.3f}   (chi^2_{r} var   = {2*r})")
rank(P) = 4  (target chi^2 degrees of freedom)
mean(z'Pz) = 3.997   (chi^2_4 mean  = 4)
var(z'Pz)  = 7.997   (chi^2_4 var   = 8)

Section B.8 — Matrix Calculus¶

Optimization-based inference (MLE, MAP, Fisher scoring) needs derivatives of scalar functions of vectors. The identities we use again and again:

$$ \nabla_{\mathbf x}(\mathbf a^\top\mathbf x)=\mathbf a,\qquad \nabla_{\mathbf x}(\mathbf x^\top\mathbf A\mathbf x)=2\mathbf A\mathbf x\ \ (\mathbf A\ \text{symmetric}),\qquad \frac{\partial}{\partial\theta_k}\log\det\mathbf A=\mathrm{tr}\!\Big(\mathbf A^{-1}\frac{\partial\mathbf A}{\partial\theta_k}\Big). $$

The Hessian (second derivatives) classifies critical points — negative definite $\Rightarrow$ a maximum, which is how the observed Fisher information $-\nabla^2\ell(\hat\theta)$ confirms an MLE. The Jacobian propagates uncertainty through transforms (the multivariate delta method). We verify each identity numerically with finite differences.

In [21]:
# Gradient identities, verified against central finite differences.
rng = np.random.default_rng(0)
Asym = np.array([[4.0, 1.0],
                 [1.0, 3.0]])               # symmetric
avec = np.array([2.0, -1.0])
x0 = np.array([1.0, -2.0])
eps = 1e-6

def fd_grad(f, x):
    g = np.zeros_like(x)
    for k in range(len(x)):
        d = np.zeros_like(x); d[k] = eps
        g[k] = (f(x + d) - f(x - d)) / (2 * eps)
    return g

g_lin = fd_grad(lambda v: avec @ v, x0)
g_quad = fd_grad(lambda v: v @ Asym @ v, x0)
print(f"grad(a^T x): numeric {g_lin.round(5)}  vs  a            = {avec}")
print(f"grad(x^T A x): numeric {g_quad.round(5)}  vs  2 A x = {2 * Asym @ x0}")

# Derivative of log-det:  d/dt logdet(A0 + t E) = tr(A0^-1 E)  at t = 0.
M = rng.standard_normal((4, 4)); A0 = M @ M.T + 4 * np.eye(4)
E = rng.standard_normal((4, 4)); E = E + E.T
ld = lambda t: np.linalg.slogdet(A0 + t * E)[1]
print(f"\nd/dt log det: numeric {(ld(eps) - ld(-eps)) / (2*eps):.5f}  vs  tr(A^-1 E) = {np.trace(np.linalg.solve(A0, E)):.5f}")
grad(a^T x): numeric [ 2. -1.]  vs  a            = [ 2. -1.]
grad(x^T A x): numeric [  4. -10.]  vs  2 A x = [  4. -10.]

d/dt log det: numeric 0.04060  vs  tr(A^-1 E) = 0.04060
In [22]:
# Jacobian and the multivariate delta method: Var(g) ~ J Var(theta) J^T.
def g(th):                                   # g: R^2 -> R^2
    return np.array([th[0] * th[1], th[0]**2])
theta = np.array([2.0, 3.0])
Sig_th = np.array([[0.10, 0.02],
                   [0.02, 0.05]])            # Var(theta_hat)

# Analytic Jacobian J = d g / d theta^T.
J = np.array([[theta[1], theta[0]],
              [2*theta[0], 0.0]])
var_delta = J @ Sig_th @ J.T

# Monte-Carlo check by pushing samples of theta_hat through g.
rng = np.random.default_rng(418)
samp = rng.multivariate_normal(theta, Sig_th, size=400000)
gv = np.column_stack([samp[:, 0]*samp[:, 1], samp[:, 0]**2])
print("Jacobian J =\n", J)
print("delta-method  J Sig J^T =\n", var_delta.round(4))
print("Monte-Carlo Var(g)      =\n", np.cov(gv.T).round(4))
print(f"max diff = {np.max(np.abs(var_delta - np.cov(gv.T))):.3f}  (close: first-order Taylor is good here)")
Jacobian J =
 [[3. 2.]
 [4. 0.]]
delta-method  J Sig J^T =
 [[1.34 1.36]
 [1.36 1.6 ]]
Monte-Carlo Var(g)      =
 [[1.3426 1.3603]
 [1.3603 1.6155]]
max diff = 0.016  (close: first-order Taylor is good here)

Section B.9 — Numerical Considerations¶

The formulas above are exact in real arithmetic; computers are not. The condition number $\kappa(\mathbf{A})=\sigma_{\max}/\sigma_{\min}$ measures how much a problem amplifies error: a relative perturbation $\epsilon$ can become $\kappa\epsilon$ in the answer, so $\kappa\approx 10^k$ costs about $k$ digits. Forming $\mathbf{X}^\top\mathbf{X}$ squares the condition number — which is why numerical regression uses QR or SVD instead.

Task Naive (avoid) Stable (prefer)
Solve $\mathbf{X}^\top\mathbf{X}\beta=\mathbf{X}^\top\mathbf{y}$ inv(X.T@X)@X.T@y np.linalg.lstsq(X, y) (SVD)
Leverage $h_{ii}$ diag(X@inv(X.T@X)@X.T) Q,R=qr(X); sum(Q**2,1)
PD solve $\mathbf{A}\mathbf{x}=\mathbf{b}$ inv(A)@b solve(A,b) / cho_solve
In [23]:
# Condition number squaring, and QR staying stable where the normal equations rot.
rng = np.random.default_rng(42)
n, p = 100, 5
Xw = rng.standard_normal((n, p))                       # well conditioned
Xi = rng.standard_normal((n, p))
Xi[:, 4] = Xi[:, 3] + 1e-8 * rng.standard_normal(n)    # column 5 ~ column 4 (near collinear)

for tag, Xx in [("well-conditioned", Xw), ("near-collinear ", Xi)]:
    kX = np.linalg.cond(Xx); kXtX = np.linalg.cond(Xx.T @ Xx)
    print(f"{tag}: kappa(X) = {kX:.3e},  kappa(X^T X) = {kXtX:.3e}  (~ kappa(X)^2 = {kX**2:.3e})")

# The tiny singular value is what blows up X^T X:
print(f"\nsingular values of the collinear X: {np.linalg.svd(Xi, compute_uv=False).round(8)}")

# QR still solves cleanly on the ill-conditioned design.
yq = rng.standard_normal(n)
Qm, Rm = np.linalg.qr(Xi)
beta_qr = np.linalg.solve(Rm, Qm.T @ yq)
print(f"QR residual norm on collinear X = {np.linalg.norm(yq - Xi @ beta_qr):.4f}  (solves fine)")
well-conditioned: kappa(X) = 1.468e+00,  kappa(X^T X) = 2.156e+00  (~ kappa(X)^2 = 2.156e+00)
near-collinear : kappa(X) = 2.285e+08,  kappa(X^T X) = 5.893e+15  (~ kappa(X)^2 = 5.221e+16)

singular values of the collinear X: [1.45507003e+01 1.08176175e+01 9.69189333e+00 8.86428910e+00
 6.00000000e-08]
QR residual norm on collinear X = 9.6664  (solves fine)
In [24]:
# Figure: singular-value spectra -- a well-conditioned design vs a collinear one.
sv_w = np.linalg.svd(Xw, compute_uv=False)
sv_i = np.linalg.svd(Xi, compute_uv=False)
fig, ax = plt.subplots(figsize=(7.0, 4.2))
idx = np.arange(1, p + 1)
ax.semilogy(idx, sv_w, "o-", color=BLUE, lw=2, label=f"well-conditioned ($\\kappa$={sv_w[0]/sv_w[-1]:.1f})")
ax.semilogy(idx, sv_i, "s-", color=RED, lw=2, label=f"near-collinear ($\\kappa$={sv_i[0]/sv_i[-1]:.1e})")
ax.set_xticks(idx); ax.set_xlabel("singular value index"); ax.set_ylabel(r"$\sigma_i$ (log scale)")
ax.set_title("Collinearity creates a tiny singular value (huge condition number)")
ax.legend(); plt.tight_layout(); plt.show()

Exercises¶

Each mirrors a practice problem from the appendix. Work it before expanding the solution. All are reproducible (seed 42).

Exercise 1 (B.5 — hat matrix & leverage). For a synthetic design ($n=50$, $p=3$ = intercept + 2 predictors, seed 42): (a) build $\mathbf{H}$ and confirm it is symmetric, idempotent, with $\mathrm{tr}(\mathbf{H})=p$; (b) report the leverage bounds; (c) push one observation to extreme predictor values and show how its leverage $h_{ii}$ forces $\hat y_i\to y_i$; (d) verify $\mathbf{X}^\top\mathbf{e}=\mathbf{0}$.

In [25]:
# Solution 1.
rng = np.random.default_rng(42)
n, p = 50, 3
X = np.column_stack([np.ones(n), rng.standard_normal(n), rng.standard_normal(n)])
H = X @ np.linalg.solve(X.T @ X, X.T)
ev = np.linalg.eigvalsh(H)
print(f"(a) symmetric={np.allclose(H, H.T)}, idempotent={np.allclose(H @ H, H)}, "
      f"trace={np.trace(H):.4f} (=p={p})")
print(f"    eigenvalues: {np.sum(np.abs(ev) < 1e-9)} near 0, {np.sum(np.abs(ev-1) < 1e-9)} near 1")

lev = np.diag(H)
print(f"(b) leverage: min={lev.min():.4f} (>= 1/n={1/n:.4f}), max={lev.max():.4f} (<= 1), "
      f"mean={lev.mean():.4f} (=p/n)")

# (c) outlier at a high-leverage location.
beta_true = np.array([2.0, -1.5, 0.8])
y = X @ beta_true + 0.5 * rng.standard_normal(n)
Xo, yo = X.copy(), y.copy()
Xo[0, 1] = 8.0; Xo[0, 2] = 8.0; yo[0] = 50.0
Ho = Xo @ np.linalg.solve(Xo.T @ Xo, Xo.T)
beta_o = np.linalg.solve(Xo.T @ Xo, Xo.T @ yo)
yhat0 = (Xo @ beta_o)[0]
print(f"(c) outlier leverage h_00={np.diag(Ho)[0]:.4f}; y_0=50.00, y_hat_0={yhat0:.4f} "
      f"(fit dragged {100*np.diag(Ho)[0]:.1f}% toward the outlier)")

# (d) orthogonality.
e = y - H @ y
print(f"(d) max|X^T e| = {np.max(np.abs(X.T @ e)):.2e}; sum(e)={np.sum(e):.2e} (orthogonal to ones)")
(a) symmetric=True, idempotent=True, trace=3.0000 (=p=3)
    eigenvalues: 47 near 0, 3 near 1
(b) leverage: min=0.0203 (>= 1/n=0.0200), max=0.1834 (<= 1), mean=0.0600 (=p/n)
(c) outlier leverage h_00=0.8254; y_0=50.00, y_hat_0=40.8597 (fit dragged 82.5% toward the outlier)
(d) max|X^T e| = 4.44e-15; sum(e)=4.44e-15 (orthogonal to ones)

Exercise 2 (B.6/B.9 — ill-conditioned ridge). With a near-collinear design ($n=100$, $p=5$, X[:,4]=X[:,3]+1e-6*noise, seed 42): (a) verify $\kappa(\mathbf{X}^\top\mathbf{X})\approx\kappa(\mathbf{X})^2$; (b) solve OLS three ways and see that $\beta_4,\beta_5$ are individually meaningless while $\beta_4+\beta_5$ is fine; (c) ridge via the SVD rescues the unstable direction.

In [26]:
# Solution 2.
rng = np.random.default_rng(42)
n, p = 100, 5
X = rng.standard_normal((n, p))
X[:, 4] = X[:, 3] + 1e-6 * rng.standard_normal(n)
beta_true = np.array([1.0, -2.0, 1.5, 0.5, -0.3])
y = X @ beta_true + 0.1 * rng.standard_normal(n)

cX, cXtX = np.linalg.cond(X), np.linalg.cond(X.T @ X)
print(f"(a) kappa(X)={cX:.3e}, kappa(X^T X)={cXtX:.3e}, kappa(X)^2={cX**2:.3e}")

b_normal = np.linalg.inv(X.T @ X) @ X.T @ y
Qm, Rm = np.linalg.qr(X); b_qr = np.linalg.solve(Rm, Qm.T @ y)
b_lstsq = np.linalg.lstsq(X, y, rcond=None)[0]
print(f"(b) beta_4 estimates: normal={b_normal[3]:.1f}, qr={b_qr[3]:.1f}, lstsq={b_lstsq[3]:.1f} (wild!)")
print(f"    but beta_4+beta_5: true={beta_true[3]+beta_true[4]:.3f}, "
      f"est={b_lstsq[3]+b_lstsq[4]:.3f} (identifiable)")
print(f"    residual norms identical: {np.allclose(np.linalg.norm(y-X@b_normal), np.linalg.norm(y-X@b_lstsq))}")

U, sig, Vt = np.linalg.svd(X, full_matrices=False)
print(f"    singular values: {sig.round(8)}  (sigma_5 is ~1e-6)")
print("\n(c) ridge via SVD:")
print(f"    {'lambda':>8} {'df':>7} {'max|err|':>10}")
for lam in [0.001, 0.1, 1.0, 10.0]:
    b_r = Vt.T @ ((sig / (sig**2 + lam)) * (U.T @ y))
    print(f"    {lam:>8.3f} {np.sum(sig**2/(sig**2+lam)):>7.4f} {np.max(np.abs(b_r - beta_true)):>10.4e}")
print(f"    OLS max error was {np.max(np.abs(b_lstsq - beta_true)):.3e} -> ridge cuts it to ~0.40")
(a) kappa(X)=2.045e+06, kappa(X^T X)=4.180e+12, kappa(X)^2=4.184e+12
(b) beta_4 estimates: normal=-1261.6, qr=-1262.7, lstsq=-1262.7 (wild!)
    but beta_4+beta_5: true=0.200, est=0.196 (identifiable)
    residual norms identical: True
    singular values: [1.49647921e+01 9.96665833e+00 9.12872408e+00 8.24476672e+00
 7.32000000e-06]  (sigma_5 is ~1e-6)

(c) ridge via SVD:
      lambda      df   max|err|
       0.001  4.0000 4.0226e-01
       0.100  3.9959 4.0229e-01
       1.000  3.9592 4.0311e-01
      10.000  3.6304 4.1043e-01
    OLS max error was 1.263e+03 -> ridge cuts it to ~0.40

Exercise 3 (B.4 — block / LOOCV shortcut). The Sherman–Morrison–Woodbury formula implies the leave-one-out residual is $\tilde e_i=e_i/(1-h_{ii})$ with no refitting. Verify it against brute-force LOO ($n=80$, $p=4$, seed 42).

In [27]:
# Solution 3.
rng = np.random.default_rng(42)
n, p = 80, 4
X = np.column_stack([np.ones(n), rng.standard_normal((n, p - 1))])
beta_true = np.array([3.0, -1.0, 2.0, 0.5])
y = X @ beta_true + 0.5 * rng.standard_normal(n)

H = X @ np.linalg.solve(X.T @ X, X.T)
e = y - H @ y
lev = np.diag(H)
loo_formula = e / (1 - lev)                  # SMW shortcut: one fit only

loo_brute = np.zeros(n)                       # refit n times
for i in range(n):
    Xi_, yi_ = np.delete(X, i, 0), np.delete(y, i, 0)
    b_i = np.linalg.lstsq(Xi_, yi_, rcond=None)[0]
    loo_brute[i] = y[i] - X[i] @ b_i

print(f"max |formula - brute force| = {np.max(np.abs(loo_formula - loo_brute)):.2e}  (machine precision)")
print(f"LOO-CV MSE (formula)     = {np.mean(loo_formula**2):.6f}")
print(f"LOO-CV MSE (brute force) = {np.mean(loo_brute**2):.6f}")
print(f"agreement to 1e-12: {np.allclose(loo_formula, loo_brute, atol=1e-12)}  "
      f"-> O(n p^2) instead of O(n^2 p^2)")
max |formula - brute force| = 4.36e-15  (machine precision)
LOO-CV MSE (formula)     = 0.288641
LOO-CV MSE (brute force) = 0.288641
agreement to 1e-12: True  -> O(n p^2) instead of O(n^2 p^2)

Summary & connections¶

The throughline. A matrix is a linear transformation, not just an array. Reading $\mathbf{H}\mathbf{y}$ as "project $\mathbf{y}$ onto the column space" explains least squares in one move; special structure (symmetric / idempotent / PD / stochastic) hands you shortcuts; and the four decompositions each expose a different kind of structure.

Idea Formula Where it shows up
Orthogonal projection $\mathbf{H}=\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ OLS fitted values (Ch 3)
Normal equations $\mathbf{X}^\top\mathbf{X}\hat\beta=\mathbf{X}^\top\mathbf{y}$ Advertising fit, machine-precision match
Variance of a linear map $\mathrm{Var}(\mathbf{A}\mathbf{X})=\mathbf{A}\mathrm{Var}(\mathbf{X})\mathbf{A}^\top$ $\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$, delta method
Eigen / spectral $\mathbf{A}=\mathbf{Q}\boldsymbol\Lambda\mathbf{Q}^\top$ PCA, Markov convergence (Ch 5)
Cholesky $\boldsymbol\Sigma=\mathbf{L}\mathbf{L}^\top$ sampling correlated normals (Ch 2)
SVD / condition number $\mathbf{X}=\mathbf{U}\boldsymbol\Sigma\mathbf{V}^\top$, $\kappa=\sigma_1/\sigma_p$ ridge shrinkage, stability
SMW rank update $(\mathbf{A}+\mathbf{u}\mathbf{v}^\top)^{-1}=\cdots$ LOOCV shortcut (Ch 4)

Practical rules that survive everything: never form $\mathbf{X}^\top\mathbf{X}$ when you can run lstsq/QR/SVD; use solve/cho_solve instead of inv; and when a covariance is near-singular, jitter it ($\boldsymbol\Sigma+\epsilon\mathbf{I}$) or regularize. These same operations are the computational backbone of every chapter in the course.