Chapter 3 — Parametric Inference¶

STAT 418 · Computational Data Science · runnable companion to the Chapter 3 webbook.

Parametric inference assumes the data were generated by a probability model $f(x \mid \theta)$ indexed by a finite parameter $\theta$, and asks: given the data, what is $\theta$, and how sure are we? This notebook works that question end-to-end on real datasets, loaded straight from the course data bucket so you can run every cell without the repo.

Learning outcomes¶

By the end you can:

  1. Put a familiar distribution in exponential-family form and read off the sufficient statistic, the natural parameter, and the mean/variance from the log-partition function.
  2. Derive and compute a maximum-likelihood estimate, its score, Fisher information, and a confidence interval — and recognize when the Wald interval fails near a boundary.
  3. Propagate uncertainty through a nonlinear transform with the delta method (vaccine efficacy).
  4. Fit and interpret a linear model (OLS): coefficients, $t$- and $F$-tests, and why classical SEs need a robust (HC3) fix under heteroskedasticity.
  5. Fit and interpret generalized linear models — logistic and Gamma — with odds ratios / mean ratios, and the diagnostics that keep them honest.

Section map (all core — nothing optional in Chapter 3)¶

§ Topic Real-data worked example
3.1 Exponential families Bernoulli & Poisson from NBA free throws
3.2 Maximum likelihood Curry's 7/7 binomial CI (Wald vs Wilson vs profile) + gradient benchmark
3.3 Sampling variability & the delta method Pfizer vaccine efficacy (VE ≈ 95%)
3.4 Linear models ISLR Advertising OLS + robust-SE coverage study
3.5 Generalized linear models Titanic logistic + Stroop Gamma GLM
3.6 Summary & connections formula table → Chapter 4 (the bootstrap)
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy import stats, optimize
from statsmodels.stats.proportion import proportion_confint
%matplotlib inline

np.random.seed(418)            # reproducibility (STYLE.md contract)
plt.rcParams.update({
    'figure.figsize': (7.2, 4.2), 'figure.dpi': 110,
    'axes.grid': True, 'grid.alpha': 0.25,
    'axes.spines.top': False, 'axes.spines.right': False,
    'font.size': 11,
})
BLUE, RED, GREEN, ORANGE = '#4C72B0', '#C44E52', '#55A868', '#DD8452'

# Course data bucket: every dataset for Chapter 3 lives here (no repo needed).
BASE = ('https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/'
        'STAT%20418%20Images/Data/Chapter3/')
def url(name):
    return BASE + name

print('environment ready · numpy', np.__version__, '· pandas', pd.__version__)
environment ready · numpy 1.26.4 · pandas 2.3.3

Section 3.1 — Exponential Families¶

A distribution belongs to the canonical exponential family if its density (or pmf) can be written

$$ f(x \mid \theta) = h(x)\,\exp\!\big\{\, \eta(\theta)^{\!\top} T(x) - A(\eta) \,\big\}, $$

where $\eta$ is the natural parameter, $T(x)$ the sufficient statistic, and $A(\eta)$ the log-partition function (it normalizes the density). This single form unifies the Bernoulli, Poisson, Normal, Exponential, Gamma, ... and it makes two facts mechanical:

  • Sufficiency. All the information the sample carries about $\theta$ is packed into $\sum_i T(x_i)$ — you can throw the raw data away.
  • Moments from $A$. Differentiating the log-partition gives the moments of the sufficient statistic: $A'(\eta) = \mathbb{E}[T(X)]$ and $A''(\eta) = \mathrm{Var}[T(X)]$.

We make both concrete on real free-throw data: each free throw is a Bernoulli trial (made / missed), and per-game free-throw attempts are count data we can look at through the Poisson family.

In [2]:
# Real data: 2023-24 game-by-game free-throw log.
ft = pd.read_csv(url('nba_freethrows_2023_24.csv'))
curry = ft[ft['player'] == 'Stephen Curry'].reset_index(drop=True)
makes, attempts = int(curry['cum_ft'].iloc[-1]), int(curry['cum_fta'].iloc[-1])

# ---- Bernoulli as an exponential family --------------------------------
# Each free throw is x in {0,1}.  Natural param eta = logit(p);  T(x) = x;
# h(x) = 1;  A(eta) = log(1 + e^eta).  Sufficient stat for the whole sample
# is sum(x) = makes.
x = np.r_[np.ones(makes), np.zeros(attempts - makes)]   # the Bernoulli sample
phat = x.mean()
eta_hat = np.log(phat / (1 - phat))                      # natural parameter
T_sum = x.sum()                                          # sufficient statistic
A = np.log1p(np.exp(eta_hat))                            # log-partition
A_prime = np.exp(eta_hat) / (1 + np.exp(eta_hat))        # A'(eta) = mean
A_dprime = A_prime * (1 - A_prime)                       # A''(eta) = variance

print(f'Curry 2023-24 free throws: {makes}/{attempts}  ->  p_hat = {phat:.4f}')
print(f'  natural parameter   eta_hat = logit(p_hat) = {eta_hat:+.4f}')
print(f'  sufficient statistic  sum(x) = {T_sum:.0f}   (the data reduce to this one number)')
print(f"  log-partition  A(eta)   = {A:.4f}")
print(f"  A'(eta) = {A_prime:.4f}  vs  sample mean = {phat:.4f}   (match -> mean)")
print(f"  A''(eta) = {A_dprime:.4f}  vs  sample var = {x.var():.4f}   (match -> variance)")
Curry 2023-24 free throws: 299/324  ->  p_hat = 0.9228
  natural parameter   eta_hat = logit(p_hat) = +2.4816
  sufficient statistic  sum(x) = 299   (the data reduce to this one number)
  log-partition  A(eta)   = 2.5619
  A'(eta) = 0.9228  vs  sample mean = 0.9228   (match -> mean)
  A''(eta) = 0.0712  vs  sample var = 0.0712   (match -> variance)
In [3]:
# ---- Poisson as an exponential family ----------------------------------
# Per-game free-throw ATTEMPTS as counts.  eta = log(lambda);  T(x) = x;
# h(x) = 1/x!;  A(eta) = e^eta = lambda.  Then A'(eta) = A''(eta) = lambda,
# i.e. the Poisson mean == variance (a testable assumption).
counts = curry['fta'].to_numpy()
lam = counts.mean()
eta_p = np.log(lam)
print(f'Per-game FT attempts: n = {len(counts)} games,  lambda_hat = mean = {lam:.3f}')
print(f"  eta_hat = log(lambda) = {eta_p:+.4f},  sufficient statistic sum(x) = {counts.sum()}")
print(f"  A'(eta) = e^eta = {np.exp(eta_p):.3f}  (predicted mean = variance)")
print(f'  sample variance = {counts.var(ddof=1):.3f}  ->  dispersion ratio = '
      f'{counts.var(ddof=1)/lam:.2f}')
print('  (ratio > 1 = mild OVERdispersion: real attempt counts are a bit more')
print('   variable than Poisson predicts -- exactly the check the family makes easy.)')
Per-game FT attempts: n = 74 games,  lambda_hat = mean = 4.378
  eta_hat = log(lambda) = +1.4767,  sufficient statistic sum(x) = 324
  A'(eta) = e^eta = 4.378  (predicted mean = variance)
  sample variance = 9.499  ->  dispersion ratio = 2.17
  (ratio > 1 = mild OVERdispersion: real attempt counts are a bit more
   variable than Poisson predicts -- exactly the check the family makes easy.)
In [4]:
# Figure: the log-partition A(eta)=log(1+e^eta) for the Bernoulli family.
# Its SLOPE at eta_hat equals p_hat -- 'first derivative of A is the mean'.
etas = np.linspace(-3, 3, 400)
Acurve = np.log1p(np.exp(etas))
fig, ax = plt.subplots()
ax.plot(etas, Acurve, color=BLUE, lw=2.2, label=r'$A(\eta)=\log(1+e^{\eta})$')
tangent = A + A_prime * (etas - eta_hat)
ax.plot(etas, tangent, '--', color=RED, lw=1.6,
        label=fr"tangent slope $A'(\eta)=\hat p={phat:.3f}$")
ax.plot([eta_hat], [A], 'o', color=RED, ms=8)
ax.annotate(fr'$\hat\eta={eta_hat:.2f}$', (eta_hat, A), textcoords='offset points',
            xytext=(10, -18), color=RED)
ax.set_xlabel(r'natural parameter $\eta$'); ax.set_ylabel(r'$A(\eta)$')
ax.set_title('Bernoulli log-partition: its slope is the mean')
ax.legend(loc='upper left'); plt.tight_layout(); plt.show()

Read-out. The free-throw record (hundreds of attempts) collapsed to a single sufficient statistic $\sum_i x_i = $ makes, and the mean fell out of the slope of $A$ rather than a separate calculation. The Poisson view added a built-in diagnostic: mean $=$ variance is an assumption, and the attempt counts are mildly overdispersed — a flag we will respect later when we model counts. This machinery (sufficient statistic, natural parameter, $A$ and its derivatives) is exactly what makes maximum likelihood, GLMs, and conjugate Bayesian updating tick in the rest of the chapter.

Section 3.2 — Maximum Likelihood Estimation¶

Given data and a model, the likelihood $L(\theta) = \prod_i f(x_i\mid\theta)$ measures how well each $\theta$ explains what we saw; the MLE $\hat\theta = \arg\max_\theta \log L(\theta)$ is the best-supported value. Two derived objects drive all the inference:

  • Score $\;U(\theta) = \dfrac{\partial}{\partial\theta}\log L(\theta)$ — zero at the MLE.
  • Fisher information $\;I(\theta) = -\mathbb{E}\!\left[\dfrac{\partial^2}{\partial\theta^2}\log L\right]$ — the curvature of the log-likelihood; its inverse is the asymptotic variance, so $\widehat{\mathrm{SE}}(\hat\theta) = 1/\sqrt{I(\hat\theta)}$.

For a Binomial$(n,p)$ count $x$: $\hat p = x/n$, $\;I(p) = n/[p(1-p)]$, so $\widehat{\mathrm{SE}}(\hat p) = \sqrt{\hat p(1-\hat p)/n}$. The Wald interval $\hat p \pm z\,\widehat{\mathrm{SE}}$ is the default — but it collapses at a boundary. Curry's opening game of 2023-24 is a real boundary case: 7 for 7.

In [5]:
# Curry's opening game: a naturally occurring boundary proportion p_hat = 1.
g1 = curry.iloc[0]
x, n = int(g1['ft']), int(g1['fta'])
phat1 = x / n
print(f"Curry opening game {g1['date']}: ft = {x} / fta = {n}  ->  p_hat = {phat1:.2f}")

# (1) WALD -- normal approximation; degenerates because SE = 0 at the boundary.
se = np.sqrt(phat1 * (1 - phat1) / n)
z = stats.norm.ppf(0.975)
wald = (max(0.0, phat1 - z * se), min(1.0, phat1 + z * se))
print(f'  Wald     95% CI: [{wald[0]:.3f}, {wald[1]:.3f}]   '
      f'(SE = {se:.3f} -> ZERO width; a useless "exactly 1")')

# (2) WILSON (score) interval -- inverts the score test; stays inside [0,1].
wil = proportion_confint(x, n, alpha=0.05, method='wilson')
print(f'  Wilson   95% CI: [{wil[0]:.3f}, {wil[1]:.3f}]   (score interval, honest width)')

# (3) PROFILE / LRT interval -- invert the likelihood-ratio test.
# At x = n the closed form lower bound is exp(-chi2_{1,.95} / (2x)).
c = stats.chi2.ppf(0.95, 1)
grid = np.linspace(1e-6, 1.0, 200001)
ll = x * np.log(grid)                      # log-lik at x=n reduces to x*log(p)
inside = grid[2 * (0.0 - ll) <= c]         # ll_max = 0 at p_hat = 1
prof = (inside.min(), inside.max())
print(f'  Profile  95% CI: [{prof[0]:.3f}, {prof[1]:.3f}]   '
      f'(closed form lower = e^(-chi2/2x) = {np.exp(-c/(2*x)):.3f})')
Curry opening game 2023-10-24: ft = 7 / fta = 7  ->  p_hat = 1.00
  Wald     95% CI: [1.000, 1.000]   (SE = 0.000 -> ZERO width; a useless "exactly 1")
  Wilson   95% CI: [0.646, 1.000]   (score interval, honest width)
  Profile  95% CI: [0.760, 1.000]   (closed form lower = e^(-chi2/2x) = 0.760)
In [6]:
# Figure: the binomial log-likelihood for 7/7, with the three intervals.
p = np.linspace(0.3, 0.9999, 500)
loglik = x * np.log(p)            # at x = n the log-lik reduces to x*log(p)
fig, ax = plt.subplots()
ax.plot(p, loglik, color=BLUE, lw=2.2)
ax.axvline(phat1, color='k', lw=1, ls=':', label=fr'MLE $\hat p={phat1:.0f}$')
ax.axhline(-c/2, color=GREEN, lw=1.2, ls='--', label=r'LRT cutoff $-\chi^2_{1,.95}/2$')
for (lo, hi), col, name, yo in [(wald, RED, 'Wald', 0.0),
                                (wil, ORANGE, 'Wilson', -1.1),
                                (prof, GREEN, 'Profile', -2.2)]:
    ax.plot([lo, hi], [yo - 2, yo - 2], color=col, lw=4, solid_capstyle='butt')
    ax.plot([lo], [yo - 2], '|', color=col, ms=14, mew=2)
    ax.text(lo - 0.01, yo - 2, name, ha='right', va='center', color=col, fontsize=9)
ax.set_xlabel(r'free-throw probability $p$'); ax.set_ylabel('log-likelihood')
ax.set_title('Binomial log-likelihood for 7/7 — the Wald interval collapses')
ax.set_ylim(-6, 0.5); ax.legend(loc='lower left'); plt.tight_layout(); plt.show()

Read-out. With $\hat p = 1$ the Wald SE is literally zero, so its interval is the absurd $[1,1]$ — it claims certainty from 7 attempts. The Wilson and profile-likelihood intervals respect $[0,1]$ and report honest lower bounds ($0.646$ and $0.760$). Rule: never use the Wald interval for a proportion near 0 or 1; prefer Wilson or the profile/LRT interval, which invert a test instead of trusting a normal approximation.

When the MLE has no closed form: supply the gradient¶

Most MLEs (logistic regression, Gamma GLMs, ...) have no closed form, so we hand the negative log-likelihood to a numerical optimizer. A quasi-Newton method (BFGS) needs the gradient at every step. If we don't supply it, SciPy approximates it by finite differences — roughly $p+1$ extra objective evaluations per gradient. An exact gradient (analytic, or via autodiff such as jax.grad) is handed straight in. The benchmark below (logistic MLE, $n=400$, $p=15$, seed 0) shows the cost.

In [7]:
# Gradient-evaluation benchmark (scripts/ch3/07).  Reproducible, seed = 0.
rng = np.random.default_rng(0)
Nb, Pb = 400, 15
Xb = np.column_stack([np.ones(Nb), rng.normal(size=(Nb, Pb - 1))])
beta_true = 0.5 * rng.normal(size=Pb)
yb = (rng.uniform(size=Nb) < 1.0 / (1.0 + np.exp(-(Xb @ beta_true)))).astype(float)

def nll(b):
    z = Xb @ b
    return float(np.sum(np.log1p(np.exp(z)) - yb * z))
def grad(b):
    return Xb.T @ (1.0 / (1.0 + np.exp(-(Xb @ b))) - yb)

fd = optimize.minimize(nll, np.zeros(Pb), method='BFGS')              # finite differences
an = optimize.minimize(nll, np.zeros(Pb), method='BFGS', jac=grad)    # exact gradient
print(f'Logistic MLE (n={Nb}, p={Pb}), BFGS, seed=0')
print(f'  finite-difference gradient (jac=None): nfev = {fd.nfev}')
print(f'  exact gradient (analytic / autodiff)  : nfev = {an.nfev}')
print(f'  -> {fd.nfev / an.nfev:.0f}x fewer objective evaluations '
      f'(same optimum: max|beta diff| = {np.max(np.abs(fd.x - an.x)):.1e})')
Logistic MLE (n=400, p=15), BFGS, seed=0
  finite-difference gradient (jac=None): nfev = 688
  exact gradient (analytic / autodiff)  : nfev = 43
  -> 16x fewer objective evaluations (same optimum: max|beta diff| = 3.3e-08)

Section 3.3 — Sampling Variability and the Delta Method¶

An estimator is a random variable: drawn on a new sample it lands somewhere else. We summarize that spread with the standard error. But we often care about a nonlinear function $g(\hat\theta)$ of our estimates — and a function of a random variable has its own, harder-to-find variance. The delta method linearizes $g$ with a first-order Taylor expansion around the truth:

$$ \mathrm{Var}\big(g(\hat\theta)\big) \;\approx\; \nabla g(\theta)^{\!\top}\, \mathrm{Var}(\hat\theta)\,\nabla g(\theta). $$

The chapter's payoff application is vaccine efficacy, $\mathrm{VE} = 1 - p_v/p_p$ — a ratio of two estimated rates, so plainly nonlinear. We attach it to the real Pfizer-BioNTech BNT162b2 Phase-3 primary endpoint (Polack et al., NEJM 2020): 8 COVID-19 cases among 18,198 vaccine recipients vs 162 among 18,325 placebo recipients.

In [8]:
# Pfizer-BioNTech BNT162b2 primary efficacy endpoint (Polack et al. 2020).
cases_v, n_v = 8, 18198
cases_p, n_p = 162, 18325
p_v, p_p = cases_v / n_v, cases_p / n_p
rr = p_v / p_p
ve = 1 - rr
print(f'  p_vaccine = {p_v:.6f}   p_placebo = {p_p:.6f}')
print(f'  VE = 1 - p_v/p_p = {ve:.4f}  ({100*ve:.1f}%)')

# --- delta method, written out explicitly ---
# g(p_v, p_p) = 1 - p_v/p_p.  Gradient:
dg_dpv = -1 / p_p                 # partial wrt p_v
dg_dpp = p_v / p_p**2             # partial wrt p_p
var_pv = p_v * (1 - p_v) / n_v    # binomial variances of the two rate estimates
var_pp = p_p * (1 - p_p) / n_p
var_ve = dg_dpv**2 * var_pv + dg_dpp**2 * var_pp   # delta method (independent arms)
se_ve = np.sqrt(var_ve)
se_closed = rr * np.sqrt((1 - p_v)/(n_v*p_v) + (1 - p_p)/(n_p*p_p))  # algebra check
print(f'  d(VE)/dp_v = {dg_dpv:.2f},  d(VE)/dp_p = {dg_dpp:.2f}')
print(f'  delta-method SE(VE) = {se_ve:.4f}   (closed-form check = {se_closed:.4f})')
z = stats.norm.ppf(0.975)
wald_ve = (ve - z*se_ve, ve + z*se_ve)
print(f'  Wald 95% CI: ({wald_ve[0]:.3f}, {wald_ve[1]:.3f})   <- symmetric, normal-based')
  p_vaccine = 0.000440   p_placebo = 0.008840
  VE = 1 - p_v/p_p = 0.9503  (95.0%)
  d(VE)/dp_v = -113.12,  d(VE)/dp_p = 5.63
  delta-method SE(VE) = 0.0180   (closed-form check = 0.0180)
  Wald 95% CI: (0.915, 0.986)   <- symmetric, normal-based
In [9]:
# The published Pfizer interval is BAYESIAN, not the Wald delta interval.
# Pre-specified analysis: theta = p_v/(p_v+p_p), prior Beta(0.700102, 1),
# posterior Beta(0.700102 + cases_v, 1 + cases_p);  VE = (1 - 2 theta)/(1 - theta).
post = stats.beta(0.700102 + cases_v, 1.0 + cases_p)
ve_of_theta = lambda th: (1 - 2*th) / (1 - th)        # decreasing in theta
cred = (ve_of_theta(post.ppf(0.975)), ve_of_theta(post.ppf(0.025)))
print(f'  Bayesian beta-binomial 95% CREDIBLE interval: '
      f'({cred[0]:.3f}, {cred[1]:.3f})  = the published (90.3%, 97.6%)')
print('  The HEADLINE interval is Bayesian; with only 8 vaccine-arm events the')
print('  symmetric Wald lower bound (~0.915) is optimistic vs the published 0.903.')
  Bayesian beta-binomial 95% CREDIBLE interval: (0.904, 0.976)  = the published (90.3%, 97.6%)
  The HEADLINE interval is Bayesian; with only 8 vaccine-arm events the
  symmetric Wald lower bound (~0.915) is optimistic vs the published 0.903.
In [10]:
# Figure: VE point estimate with the Wald (delta) vs Bayesian credible interval.
fig, ax = plt.subplots(figsize=(7.2, 2.6))
ax.plot([wald_ve[0], wald_ve[1]], [1, 1], color=BLUE, lw=6, solid_capstyle='butt',
        label='Wald (delta-method)')
ax.plot([cred[0], cred[1]], [0.6, 0.6], color=ORANGE, lw=6, solid_capstyle='butt',
        label='Bayesian credible (published)')
ax.plot([ve], [0.8], 'D', color=RED, ms=10, label=fr'VE point = {100*ve:.1f}%')
for v in (wald_ve[0], wald_ve[1], cred[0], cred[1]):
    ax.annotate(f'{100*v:.1f}%', (v, 1 if v in wald_ve else 0.6),
                textcoords='offset points', xytext=(0, 8 if v in wald_ve else -16),
                ha='center', fontsize=9)
ax.set_yticks([]); ax.set_ylim(0.3, 1.4); ax.set_xlim(0.88, 1.0)
ax.set_xlabel('vaccine efficacy'); ax.set_title('Pfizer VE: two 95% intervals disagree')
ax.legend(loc='lower left', fontsize=9); plt.tight_layout(); plt.show()

Read-out. The delta method turned two binomial counts into a standard error for a ratio: $\mathrm{VE} = 95.0\%$ with $\widehat{\mathrm{SE}} \approx 0.018$ and a Wald 95% CI of about $(0.915,\,0.986)$. The famous published interval $(90.3\%, 97.6\%)$ is a Bayesian beta-binomial credible interval, not the Wald delta interval — and with only 8 vaccine-arm events the two genuinely differ, the symmetric Wald lower bound being optimistic. Lesson: the delta method is fast and general, but near sparse counts a likelihood- or Bayes-based interval is more honest (and Chapter 4's bootstrap is a third route to the same SE).

Section 3.4 — Linear Models (OLS)¶

The linear model $y = X\beta + \varepsilon$ fit by ordinary least squares gives $\hat\beta = (X^\top X)^{-1}X^\top y$, the best linear unbiased estimator under the classical assumptions. Each coefficient is partial: the effect of one predictor holding the others fixed. We test $H_0\!: \beta_j = 0$ with a $t$-statistic $t_j = \hat\beta_j / \widehat{\mathrm{SE}}(\hat\beta_j)$, and compare nested models with an $F$-test (with $F = t^2$ for a single restriction).

We use the ISLR Advertising data: Sales (thousands of units) regressed on TV, Radio, and Newspaper budgets across $n=200$ markets. The full multiple regression contains a memorable real fail-to-reject: Newspaper is significant on its own but vanishes once Radio is in the model.

In [11]:
adv = pd.read_csv(url('Advertising.csv'))
full = smf.ols('Sales ~ TV + Radio + Newspaper', data=adv).fit()
print(f'Full OLS  Sales ~ TV + Radio + Newspaper   (n = {int(full.nobs)})')
tab = pd.DataFrame({'coef': full.params, 'std_err': full.bse,
                    't': full.tvalues, 'p_value': full.pvalues})
print(tab.round(4).to_string())
print(f'\n  R^2 = {full.rsquared:.4f}   adj R^2 = {full.rsquared_adj:.4f}'
      f'   F = {full.fvalue:.1f} on (df {int(full.df_model)}, {int(full.df_resid)})')
print(f'  >> TV t-statistic in the FULL model = {full.tvalues["TV"]:.2f}  '
      f'(the partial effect, NOT the simple-regression value)')
Full OLS  Sales ~ TV + Radio + Newspaper   (n = 200)
             coef  std_err        t  p_value
Intercept  2.9389   0.3119   9.4223   0.0000
TV         0.0458   0.0014  32.8086   0.0000
Radio      0.1885   0.0086  21.8935   0.0000
Newspaper -0.0010   0.0059  -0.1767   0.8599

  R^2 = 0.8972   adj R^2 = 0.8956   F = 570.3 on (df 3, 196)
  >> TV t-statistic in the FULL model = 32.81  (the partial effect, NOT the simple-regression value)
In [12]:
# Individual t-tests and the nested F-tests.
print('Individual t-tests (H0: beta_j = 0):')
for v in ('TV', 'Radio', 'Newspaper'):
    verdict = 'fail to reject' if full.pvalues[v] > 0.05 else 'REJECT'
    print(f'  {v:9s}: t = {full.tvalues[v]:+6.2f}   p = {full.pvalues[v]:.4f}   -> {verdict}')

simple_tv = smf.ols('Sales ~ TV', data=adv).fit()
joint = full.f_test('Radio = 0, Newspaper = 0')
drop_news = full.f_test('Newspaper = 0')
print(f'\nNested F-tests:')
print(f'  reduced Sales~TV         R^2 = {simple_tv.rsquared:.3f}')
print(f'  full    Sales~TV+Rad+New R^2 = {full.rsquared:.3f}')
print(f'  joint H0: beta_Radio = beta_News = 0  ->  F = {float(joint.fvalue):.1f}  '
      f'p = {float(joint.pvalue):.1e}')
print(f'  single H0: beta_News = 0  ->  F = {float(drop_news.fvalue):.4f} '
      f'= t^2 = {full.tvalues["Newspaper"]**2:.4f}   (F = t^2 identity)')

# Confounding hook: Newspaper is significant ALONE, vanishes in the full model.
simple_news = smf.ols('Sales ~ Newspaper', data=adv).fit()
print(f'\nConfounding hook (holding the others fixed):')
print(f'  Newspaper ALONE: t = {simple_news.tvalues["Newspaper"]:.2f} '
      f'(p = {simple_news.pvalues["Newspaper"]:.4f}, significant!)')
print(f'  Newspaper in FULL model: t = {full.tvalues["Newspaper"]:.2f} (vanishes)')
print(f'  corr(Radio, Newspaper) = {adv["Radio"].corr(adv["Newspaper"]):.3f} '
      f'-> Newspaper rode on Radio')
Individual t-tests (H0: beta_j = 0):
  TV       : t = +32.81   p = 0.0000   -> REJECT
  Radio    : t = +21.89   p = 0.0000   -> REJECT
  Newspaper: t =  -0.18   p = 0.8599   -> fail to reject

Nested F-tests:
  reduced Sales~TV         R^2 = 0.612
  full    Sales~TV+Rad+New R^2 = 0.897
  joint H0: beta_Radio = beta_News = 0  ->  F = 272.0  p = 2.8e-57
  single H0: beta_News = 0  ->  F = 0.0312 = t^2 = 0.0312   (F = t^2 identity)

Confounding hook (holding the others fixed):
  Newspaper ALONE: t = 3.30 (p = 0.0011, significant!)
  Newspaper in FULL model: t = -0.18 (vanishes)
  corr(Radio, Newspaper) = 0.354 -> Newspaper rode on Radio
In [13]:
# Figure: |t| for each coefficient; the dashed line is the |t|~1.97 (5%) threshold.
fig, ax = plt.subplots(figsize=(7.2, 3.4))
vars_ = ['TV', 'Radio', 'Newspaper']
tvals = [abs(full.tvalues[v]) for v in vars_]
cols = [GREEN if abs(full.tvalues[v]) > stats.t.ppf(0.975, full.df_resid) else RED for v in vars_]
bars = ax.bar(vars_, tvals, color=cols, edgecolor='white')
ax.axhline(stats.t.ppf(0.975, full.df_resid), color='k', ls='--', lw=1,
           label=r'$|t|$ = 1.97 (reject at 5%)')
for b, t in zip(bars, tvals):
    ax.text(b.get_x()+b.get_width()/2, t+0.6, f'{t:.1f}', ha='center', fontsize=10)
ax.set_ylabel(r'$|t|$ statistic'); ax.set_title('Advertising: TV & Radio dominate; Newspaper is noise')
ax.legend(); plt.tight_layout(); plt.show()

Robust (HC3) standard errors under heteroskedasticity¶

The classical $t$- and $F$-tests assume constant error variance. When that fails (heteroskedasticity), the classical SEs are biased and the 95% CIs under-cover. The HC3 sandwich SE fixes this. Because real data never reveal the true $\beta$, we verify coverage with a small Monte-Carlo study (a known truth $\beta_1 = 3$, strongly heteroskedastic errors, 10,000 fits, seed 42) — exactly the slide's design.

In [14]:
# Coverage study (scripts/ch3/06): OLS vs HC3 95% CI coverage of the true slope.
rng = np.random.default_rng(42)
B, N, BETA = 10_000, 100, np.array([2.0, 3.0])   # intercept, slope
tcrit = stats.t.ppf(0.975, N - 2)
cover_ols = cover_hc3 = 0
for _ in range(B):
    x = rng.uniform(0.0, 3.0, N)
    sd = (0.5 + x) ** 2                              # heteroskedastic: SD grows with x
    y = BETA[0] + BETA[1] * x + rng.normal(0.0, sd)
    X = np.column_stack([np.ones(N), x])
    XtX_inv = np.linalg.inv(X.T @ X)
    beta = XtX_inv @ X.T @ y
    e = y - X @ beta
    s2 = (e @ e) / (N - 2)
    se_ols = np.sqrt(s2 * XtX_inv[1, 1])
    h = np.einsum('ij,jk,ik->i', X, XtX_inv, X)      # leverages
    meat = X.T @ (X * (e**2 / (1 - h)**2)[:, None])
    se_hc3 = np.sqrt((XtX_inv @ meat @ XtX_inv)[1, 1])
    cover_ols += abs(beta[1] - BETA[1]) <= tcrit * se_ols
    cover_hc3 += abs(beta[1] - BETA[1]) <= tcrit * se_hc3
cov_ols, cov_hc3 = 100*cover_ols/B, 100*cover_hc3/B
print(f'DGP: y = 2 + 3x + e, sigma=(0.5+x)^2, x~U(0,3); n={N}, B={B:,}, seed=42')
print(f'  classical OLS SE : 95% CI coverage of beta_1 = {cov_ols:.1f}%')
print(f'  HC3 sandwich SE  : 95% CI coverage of beta_1 = {cov_hc3:.1f}%')
print(f'  -> OLS under-covers by {95.0 - cov_ols:.1f} pp; HC3 restores near-nominal coverage.')
DGP: y = 2 + 3x + e, sigma=(0.5+x)^2, x~U(0,3); n=100, B=10,000, seed=42
  classical OLS SE : 95% CI coverage of beta_1 = 90.0%
  HC3 sandwich SE  : 95% CI coverage of beta_1 = 94.7%
  -> OLS under-covers by 5.0 pp; HC3 restores near-nominal coverage.
In [15]:
# Figure: coverage bars vs the 95% nominal target.
fig, ax = plt.subplots(figsize=(6.2, 3.6))
bars = ax.bar(['classical OLS', 'HC3 robust'], [cov_ols, cov_hc3],
              color=[RED, GREEN], edgecolor='white', width=0.6)
ax.axhline(95, color='k', ls='--', lw=1.2, label='nominal 95%')
for b, v in zip(bars, [cov_ols, cov_hc3]):
    ax.text(b.get_x()+b.get_width()/2, v+0.15, f'{v:.1f}%', ha='center', fontsize=11)
ax.set_ylim(88, 96); ax.set_ylabel('actual 95%-CI coverage')
ax.set_title('Heteroskedasticity: HC3 restores coverage'); ax.legend()
plt.tight_layout(); plt.show()

Read-out. In the full Advertising model TV is overwhelmingly significant ($t \approx 32.8$), Radio strongly so ($t \approx 21.9$), and Newspaper is indistinguishable from noise ($t \approx -0.18$) — even though Newspaper alone looks significant ($t \approx 3.3$), because it is correlated with Radio ($r \approx 0.35$). That is the whole point of "holding the others fixed." And under heteroskedasticity the classical 95% CI covered the truth only 90.0% of the time while HC3 restored it to 94.7% — when in doubt, report robust SEs.

Section 3.5 — Generalized Linear Models¶

A GLM keeps the linear predictor $\eta = X\beta$ but (1) lets the response follow any exponential-family distribution and (2) connects its mean to $\eta$ through a link function $g(\mu) = \eta$. Two workhorses:

  • Logistic regression (binary $y$, logit link): $\exp(\beta_j)$ is an odds ratio.
  • Gamma regression (positive, skewed $y$, log link): $\exp(\beta_j)$ is a mean ratio.

We fit both on real data: survival on the Titanic, and reaction times in the Stroop task.

Logistic regression — Titanic survival¶

First, why not just use OLS on a 0/1 outcome? The linear probability model fits a straight line to a probability, so it cheerfully predicts values outside $[0,1]$. Then we fit the proper logistic GLM and read off odds ratios.

In [16]:
tit = pd.read_csv(url('titanic.csv'))
for c in ('Survived', 'Pclass', 'Age', 'Fare'):
    tit[c] = pd.to_numeric(tit[c], errors='coerce')
n_miss = int(tit['Age'].isna().sum())
print(f'Titanic: n = {len(tit)} passengers, Age missing = {n_miss} ({100*n_miss/len(tit):.1f}%)')

# Why OLS fails on binary y: the linear probability model predicts outside [0,1].
lpm1 = smf.ols('Survived ~ Fare', data=tit).fit()
f1 = lpm1.fittedvalues
print(f'  LPM Survived~Fare        : max p_hat = {f1.max():.2f}, #(p_hat>1) = {(f1>1).sum()}')
tit_age = tit.dropna(subset=['Age']).copy()
lpm2 = smf.ols('Survived ~ Pclass + Age + Fare', data=tit_age).fit()
f2 = lpm2.fittedvalues
print(f'  LPM Survived~Pclass+Age+Fare: min p_hat = {f2.min():.2f}, #(p_hat<0) = {(f2<0).sum()}'
      f'  (n = {len(tit_age)})')
Titanic: n = 891 passengers, Age missing = 177 (19.9%)
  LPM Survived~Fare        : max p_hat = 1.59, #(p_hat>1) = 3
  LPM Survived~Pclass+Age+Fare: min p_hat = -0.15, #(p_hat<0) = 6  (n = 714)
In [17]:
# Logistic GLM (statsmodels uses FEMALE as the reference category).
logit = smf.logit('Survived ~ C(Sex) + Pclass + Age', data=tit_age).fit(disp=0)
b_male = logit.params['C(Sex)[T.male]']
OR = pd.Series({'female vs male': np.exp(-b_male),
                'per Pclass level': np.exp(logit.params['Pclass']),
                'per year of Age': np.exp(logit.params['Age'])})
print(f'Logistic  Survived ~ Sex + Pclass + Age   (n = {int(logit.nobs)}, converged = '
      f"{logit.mle_retvals['converged']})")
print(f'  beta_male = {b_male:+.3f}  beta_Pclass = {logit.params["Pclass"]:+.3f}  '
      f'beta_Age = {logit.params["Age"]:+.4f}')
print('  Adjusted odds ratios exp(beta):')
for k, v in OR.items():
    print(f'    OR {k:18s} = {v:.3f}')
print('  Caveat: Age is ~20% missing and NOT MCAR, so listwise deletion (714 of 891)')
print('  adds mild selection bias; the data are observational -> odds ratios are associational.')
Logistic  Survived ~ Sex + Pclass + Age   (n = 714, converged = True)
  beta_male = -2.522  beta_Pclass = -1.289  beta_Age = -0.0369
  Adjusted odds ratios exp(beta):
    OR female vs male     = 12.455
    OR per Pclass level   = 0.276
    OR per year of Age    = 0.964
  Caveat: Age is ~20% missing and NOT MCAR, so listwise deletion (714 of 891)
  adds mild selection bias; the data are observational -> odds ratios are associational.
In [18]:
# Figure: fitted survival probability vs age, by sex, at 2nd class.
ages = np.linspace(tit_age['Age'].min(), tit_age['Age'].max(), 100)
fig, ax = plt.subplots()
for sex, col in [('female', RED), ('male', BLUE)]:
    grid = pd.DataFrame({'Sex': sex, 'Pclass': 2, 'Age': ages})
    ax.plot(ages, logit.predict(grid), color=col, lw=2.4, label=f'{sex} (2nd class)')
ax.scatter(tit_age['Age'], tit_age['Survived'] + np.random.uniform(-0.03, 0.03, len(tit_age)),
           s=8, alpha=0.12, color='gray')
ax.set_xlabel('Age (years)'); ax.set_ylabel('P(survive)')
ax.set_title('Titanic logistic fit: women far likelier to survive'); ax.legend()
plt.tight_layout(); plt.show()

Read-out. OLS on the 0/1 outcome produced impossible probabilities (above 1 with Fare alone, below 0 in the multiple model) — structural failure, not bad luck. The logistic GLM is well-behaved: adjusting for class and age, a female passenger's odds of survival were $\approx 12.5\times$ a male's, each step down in class multiplied the odds by $\approx 0.28$, and each year of age by $\approx 0.96$. Odds ratios, not raw coefficients, are the interpretable currency of logistic regression.

Gamma regression — Stroop reaction times¶

Reaction times are strictly positive and right-skewed — a textbook Gamma response. With a log link, $\exp(\beta)$ is a mean ratio. The Stroop data have 50 subjects and ~3000 trials; the incongruent condition (word/color mismatch) should be slower. We also surface two diagnostics the example must not hide: the Gamma family's constant-CV assumption, and the within-subject dependence of repeated trials.

In [19]:
st = pd.read_csv(url('stroop_dataset.csv'))
st['RT'] = pd.to_numeric(st['RT'], errors='coerce')
st['subj'] = pd.to_numeric(st['subj'], errors='coerce')
st = st.dropna(subset=['RT', 'subj', 'condition'])
print(f"Stroop: {st['subj'].nunique()} subjects, {len(st)} trials")

# Diagnostic 1: constant coefficient of variation (CV = sd/mean) -- VIOLATED here.
print('  per-condition mean / sd / CV  (Gamma assumes CV constant):')
for cond, s in st.groupby('condition')['RT']:
    print(f'    {cond:12s} mean = {s.mean():6.1f} ms   sd = {s.std():6.1f}   '
          f'CV = {s.std()/s.mean():.3f}   n = {len(s)}')

# Gamma GLM with log link -> exp(beta) is a mean ratio (Congruent = reference).
gamma = sm.families.Gamma(link=sm.families.links.Log())
fit = smf.glm('RT ~ C(condition)', data=st, family=gamma).fit()
b = fit.params['C(condition)[T.Incongruent]']
print(f'\nGamma GLM  RT ~ condition:  beta_Incongruent = {b:.4f}  ->  '
      f'mean ratio exp(beta) = {np.exp(b):.3f}  (~{(np.exp(b)-1)*100:.0f}% slower)')
Stroop: 50 subjects, 3058 trials
  per-condition mean / sd / CV  (Gamma assumes CV constant):
    Congruent    mean =  563.1 ms   sd =  197.7   CV = 0.351   n = 1032
    Incongruent  mean =  618.8 ms   sd =  325.2   CV = 0.526   n = 2026

Gamma GLM  RT ~ condition:  beta_Incongruent = 0.0944  ->  mean ratio exp(beta) = 1.099  (~10% slower)
In [20]:
# Diagnostic 2: within-subject dependence. Because `condition` varies WITHIN each
# subject, cluster-robust SEs barely move the CONDITION effect -- the dependence
# inflates the INTERCEPT (grand-mean) SE instead.  Do NOT claim the condition SE
# is understated.
fit_cl = smf.glm('RT ~ C(condition)', data=st, family=gamma).fit(
    cov_type='cluster', cov_kwds={'groups': st['subj']})
ck = 'C(condition)[T.Incongruent]'
print('Within-subject dependence (repeated trials per subject):')
print(f'  SE(condition): naive = {fit.bse[ck]:.4f}, cluster = {fit_cl.bse[ck]:.4f}  '
      f'(ratio {fit_cl.bse[ck]/fit.bse[ck]:.2f} -- WITHIN-subject, barely changes)')
print(f'  SE(intercept): naive = {fit.bse["Intercept"]:.4f}, '
      f'cluster = {fit_cl.bse["Intercept"]:.4f}  '
      f'(ratio {fit_cl.bse["Intercept"]/fit.bse["Intercept"]:.2f} -- the grand-mean SE inflates)')
Within-subject dependence (repeated trials per subject):
  SE(condition): naive = 0.0181, cluster = 0.0178  (ratio 0.98 -- WITHIN-subject, barely changes)
  SE(intercept): naive = 0.0148, cluster = 0.0189  (ratio 1.28 -- the grand-mean SE inflates)
In [21]:
# Figure: RT distributions by condition (positive, right-skewed -> Gamma).
fig, ax = plt.subplots()
for cond, col in [('Congruent', GREEN), ('Incongruent', ORANGE)]:
    s = st.loc[st['condition'] == cond, 'RT']
    ax.hist(s, bins=40, range=(0, 2000), alpha=0.55, color=col, density=True,
            label=f'{cond}  (mean {s.mean():.0f} ms)')
    ax.axvline(s.mean(), color=col, lw=2, ls='--')
ax.set_xlabel('reaction time (ms)'); ax.set_ylabel('density')
ax.set_title('Stroop RTs: positive, right-skewed, incongruent slower'); ax.legend()
plt.tight_layout(); plt.show()

Read-out. The Gamma GLM says incongruent trials are about 10% slower in the mean ($\exp(\hat\beta) \approx 1.10$) — the classic Stroop interference effect, read directly as a mean ratio. But honesty requires the diagnostics: the constant-CV assumption is violated (CV $\approx 0.35$ vs $0.53$), and although trials are not iid, the condition effect is estimated within subjects, so its SE is essentially unchanged by clustering ($\times 0.98$); the dependence inflates the intercept SE instead ($\times 1.28$). A common mistake is to claim clustering widens the condition SE — here it does not.

Exercises¶

Work each before expanding the solution. All use the same real datasets.

Exercise 1 (§3.1 / §3.2). Curry's full-season free-throw record is far from a boundary. Compute $\hat p$, the Fisher-information-based SE, and the Wald 95% CI, and confirm it now nearly coincides with the Wilson interval (so the boundary pathology has disappeared). Why?

In [22]:
# Solution 1.
k, m = makes, attempts                     # full-season makes / attempts (from 3.1)
phat = k / m
I = m / (phat * (1 - phat))                # Fisher information for Binomial(m, p)
se = 1 / np.sqrt(I)                         # = sqrt(p(1-p)/m)
wald = (phat - 1.96*se, phat + 1.96*se)
wil = proportion_confint(k, m, alpha=0.05, method='wilson')
print(f'  full season {k}/{m}: p_hat = {phat:.4f}, SE = {se:.4f}')
print(f'  Wald   95% CI: [{wald[0]:.4f}, {wald[1]:.4f}]')
print(f'  Wilson 95% CI: [{wil[0]:.4f}, {wil[1]:.4f}]')
print('  They nearly coincide: large m and p_hat well inside (0,1) make the normal')
print('  approximation accurate -- the Wald pathology was specific to the n=7 boundary.')
  full season 299/324: p_hat = 0.9228, SE = 0.0148
  Wald   95% CI: [0.8938, 0.9519]
  Wilson 95% CI: [0.8886, 0.9472]
  They nearly coincide: large m and p_hat well inside (0,1) make the normal
  approximation accurate -- the Wald pathology was specific to the n=7 boundary.

Exercise 2 (§3.3 delta method). For the Pfizer trial, estimate the risk difference $\mathrm{RD} = p_p - p_v$ and its delta-method SE, then give a 95% CI. (Hint: $\nabla\mathrm{RD}$ is trivial, so this is a quick warm-up for the harder VE ratio.)

In [23]:
# Solution 2.
rd = p_p - p_v
se_rd = np.sqrt(var_pv + var_pp)           # gradient is (+1,-1) -> just add variances
ci_rd = (rd - 1.96*se_rd, rd + 1.96*se_rd)
print(f'  risk difference p_p - p_v = {rd:.5f}  ({rd*1000:.2f} fewer cases per 1000)')
print(f'  delta-method SE = {se_rd:.5f}   95% CI = ({ci_rd[0]:.5f}, {ci_rd[1]:.5f})')
print('  The CI excludes 0 by a wide margin -> a clear protective effect, expressed on')
print('  the absolute (risk) scale rather than the relative (efficacy) scale.')
  risk difference p_p - p_v = 0.00840  (8.40 fewer cases per 1000)
  delta-method SE = 0.00071   95% CI = (0.00701, 0.00979)
  The CI excludes 0 by a wide margin -> a clear protective effect, expressed on
  the absolute (risk) scale rather than the relative (efficacy) scale.

Exercise 3 (§3.4). Refit the full Advertising model with HC3 robust standard errors and compare the TV $t$-statistic and CI half-width to the classical fit. Does the conclusion about TV change?

In [24]:
# Solution 3.
rob = smf.ols('Sales ~ TV + Radio + Newspaper', data=adv).fit(cov_type='HC3')
print(f'  TV coef = {full.params["TV"]:.4f}  (identical: HC3 changes SEs, not betas)')
print(f'  classical: SE(TV) = {full.bse["TV"]:.5f}, t = {full.tvalues["TV"]:.2f}')
print(f'  HC3      : SE(TV) = {rob.bse["TV"]:.5f}, t = {rob.tvalues["TV"]:.2f}')
print('  TV stays overwhelmingly significant either way; here the robust SE is close')
print('  to classical, but reporting HC3 is cheap insurance against heteroskedasticity.')
  TV coef = 0.0458  (identical: HC3 changes SEs, not betas)
  classical: SE(TV) = 0.00139, t = 32.81
  HC3      : SE(TV) = 0.00196, t = 23.38
  TV stays overwhelmingly significant either way; here the robust SE is close
  to classical, but reporting HC3 is cheap insurance against heteroskedasticity.

Exercise 4 (§3.5). Using the fitted Titanic logistic model, predict the survival probability of a 30-year-old, 1st-class passenger, separately for female and male.

In [25]:
# Solution 4.
prof = pd.DataFrame({'Sex': ['female', 'male'], 'Pclass': [1, 1], 'Age': [30, 30]})
prof['P(survive)'] = logit.predict(prof).round(3)
print(prof.to_string(index=False))
print('  A 1st-class 30-year-old woman is predicted to survive with high probability;')
print('  an otherwise-identical man far less likely -- the OR ~12.5 made concrete.')
   Sex  Pclass  Age  P(survive)
female       1   30       0.935
  male       1   30       0.534
  A 1st-class 30-year-old woman is predicted to survive with high probability;
  an otherwise-identical man far less likely -- the OR ~12.5 made concrete.

Section 3.6 — Summary & connections forward¶

The parametric-inference pipeline. Pick a model $\Rightarrow$ write the likelihood $\Rightarrow$ maximize for $\hat\theta$ $\Rightarrow$ quantify uncertainty from the curvature (Fisher information) or a delta-method/robust SE $\Rightarrow$ test and form intervals. Exponential families make every step mechanical; GLMs extend it to non-Normal responses.

Quantity Formula This chapter's example
Exp.-family density $h(x)\exp\{\eta^\top T(x) - A(\eta)\}$ Bernoulli / Poisson free throws
Mean from log-partition $\mathbb{E}[T] = A'(\eta)$ $\hat p$ as slope of $A$
MLE (binomial) $\hat p = x/n$ Curry 7/7
Fisher information $I(p) = n/[p(1-p)]$ $\mathrm{SE}(\hat p)$
Confidence interval Wald / Wilson / profile boundary $\hat p = 1$
Delta method $\mathrm{Var}(g) \approx \nabla g^\top \Sigma \nabla g$ VE $= 95\%$, SE $\approx 0.018$
OLS estimator $\hat\beta = (X^\top X)^{-1}X^\top y$ Advertising, TV $t \approx 32.8$
Robust (HC3) SE sandwich $(X^\top X)^{-1} X^\top \hat\Omega X (X^\top X)^{-1}$ 90.0% → 94.7% coverage
Logistic OR $\exp(\beta_j)$ female vs male $\approx 12.5$
Gamma mean ratio $\exp(\beta_j)$, log link incongruent $\approx 1.10$

Forward to Chapter 4 (the bootstrap). Every standard error here came from theory — a Fisher-information formula, a delta-method Taylor expansion, a sandwich estimator — each leaning on asymptotics that can be shaky with small samples, boundary parameters (the $7/7$ case!), or messy statistics. The bootstrap computes the same standard errors and confidence intervals by resampling the data, with almost no formulas. We will re-derive the Advertising regression SEs, the free-throw proportion interval, and a correlation CI by resampling — and compare them head-to-head with the parametric answers from this chapter.