Chapter Summary

This chapter developed the complete framework for parametric inference—the art and science of learning about model parameters from data. Starting with the unifying structure of exponential families, we built maximum likelihood estimation theory from first principles, established the statistical foundations for quantifying uncertainty, and extended these ideas to the workhorse models of applied statistics: linear regression and generalized linear models. The result is a coherent toolkit where the same mathematical principles—likelihood, score equations, Fisher information—apply across an extraordinary range of applications.

The Parametric Inference Pipeline

Every parametric inference problem follows a unified workflow:

┌─────────────────────────────────────────────────────────────────────────┐
│                    THE PARAMETRIC INFERENCE PIPELINE                    │
└─────────────────────────────────────────────────────────────────────────┘

Stage 1: MODEL             Stage 2: ESTIMATION      Stage 3: UNCERTAINTY
(Section 3.1)              (Section 3.2)            (Section 3.3)
┌──────────────┐           ┌──────────────┐         ┌──────────────┐
│ Choose       │           │ Maximize     │         │ Quantify     │
│ Distribution │           │ Likelihood   │         │ Variability  │
│              │ ──f(x|θ)──│              │ ──θ̂────│              │
│ • Exp Family │→          │ • Score = 0  │→        │ • Fisher I   │
│ • EDM        │           │ • Newton     │         │ • Delta      │
│ • GLM        │           │ • IRLS       │         │ • Sandwich   │
└──────────────┘           └──────────────┘         └──────────────┘
       │                                                   │
       └───────────────────────┬───────────────────────────┘
                               ↓
                   Stage 4: INFERENCE
                   (Sections 3.4-3.5)
                   ┌─────────────────────────────────────────┐
                   │ • Confidence Intervals                  │
                   │ • Hypothesis Tests (LRT, Wald, Score)   │
                   │ • Model Comparison                      │
                   │ • Diagnostics & Validation              │
                   └─────────────────────────────────────────┘

Stage 1 — Model Selection: Identify the appropriate probability model for your data. The exponential family (Section 3.1) provides a unified framework encompassing Normal, Poisson, Bernoulli, Gamma, and many other distributions. The canonical form \(f(x|\eta) = h(x)\exp\{\eta^\top T(x) - A(\eta)\}\) reveals sufficient statistics, connects moments to derivatives of \(A(\eta)\), and guarantees concave log-likelihoods in the natural parameter \(\eta\).

Stage 2 — Parameter Estimation: Find the parameter values that best explain the observed data. Maximum likelihood estimation (Section 3.2) provides a principled approach: maximize \(L(\theta) = \prod_i f(x_i|\theta)\) or equivalently solve the score equation \(U(\theta) = 0\). For exponential families, this reduces to matching observed sufficient statistics to their expectations: \(\bar{T} = \nabla A(\hat{\eta})\). Newton-Raphson and Fisher scoring provide computational algorithms.

Stage 3 — Uncertainty Quantification: A point estimate alone is incomplete; we need standard errors and confidence intervals. Sampling variability theory (Section 3.3) shows that \(\sqrt{n}(\hat{\theta} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I_1(\theta_0)^{-1})\). The delta method extends this to transformed parameters. Fisher information, observed information, and sandwich estimators provide variance estimates under different assumptions.

Stage 4 — Statistical Inference: Apply the fitted model to answer scientific questions. Linear models (Section 3.4) enable regression analysis with t-tests and F-tests under Gauss-Markov optimality. Generalized linear models (Section 3.5) extend these ideas to binary, count, and other non-normal responses through the link function and IRLS algorithm.

The Five Pillars of Chapter 3

Pillar 1: Exponential Families (Section 3.1)

The exponential family provides the mathematical foundation for all that follows. Its canonical form:

\[f(x|\eta) = h(x) \exp\left\{ \eta^\top T(x) - A(\eta) \right\}\]

yields extraordinary theoretical and computational benefits:

  • Sufficient statistics \(T(X)\) capture all information about \(\eta\)—the Neyman-Fisher factorization theorem ensures no information loss when we reduce data to \(T\).

  • Moment generation via the log-partition function: \(\mathbb{E}[T(X)] = \nabla A(\eta)\) and \(\text{Cov}[T(X)] = \nabla^2 A(\eta)\). No distribution-specific derivations needed.

  • Fisher information equals the Hessian of the log-partition function: \(I_1(\eta) = \nabla^2 A(\eta)\) per observation. For a sample of size \(n\), \(I_n(\eta) = n\nabla^2 A(\eta)\). The convexity of \(A(\eta)\) guarantees positive definiteness and ensures concave log-likelihoods in \(\eta\).

  • Conjugate priors take the form \(\pi(\eta) \propto \exp\{\eta^\top \nu_0 - n_0 A(\eta)\}\), enabling closed-form Bayesian updating.

The exponential dispersion family extends this framework with a dispersion parameter \(\phi\), creating the foundation for generalized linear models.

Pillar 2: Maximum Likelihood Estimation (Section 3.2)

Maximum likelihood is the workhorse of parametric inference. Given data \(x_1, \ldots, x_n\), the MLE \(\hat{\theta}\) maximizes:

\[L(\theta) = \prod_{i=1}^n f(x_i|\theta) \quad \text{or equivalently} \quad \ell(\theta) = \sum_{i=1}^n \log f(x_i|\theta)\]

Key theoretical results:

  • Consistency: \(\hat{\theta}_n \xrightarrow{P} \theta_0\) as \(n \to \infty\)

  • Asymptotic normality: \(\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} \mathcal{N}(0, I_1(\theta_0)^{-1})\)

  • Asymptotic efficiency: MLEs achieve the Cramér-Rao lower bound, \(\text{Var}(\hat{\theta}) \geq [nI_1(\theta)]^{-1}\)

  • Invariance: If \(\hat{\theta}\) is MLE of \(\theta\), then \(g(\hat{\theta})\) is MLE of \(g(\theta)\)

For exponential families, the score equation \(U(\hat{\eta}) = 0\) reduces to the moment-matching condition \(\nabla A(\hat{\eta}) = \bar{T}\), providing elegant closed-form solutions for many common distributions.

Pillar 3: Sampling Variability (Section 3.3)

Understanding how estimators vary across samples is essential for valid inference:

  • Bias: \(\text{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta\). MLEs are generally biased but asymptotically unbiased.

  • Variance: \(\text{Var}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2]\). Decreases with sample size as \(O(1/n)\).

  • Mean Squared Error: \(\text{MSE} = \text{Bias}^2 + \text{Variance}\). The fundamental bias-variance tradeoff.

  • Consistency: \(\hat{\theta}_n \xrightarrow{P} \theta_0\). Estimators converge to the truth.

The delta method propagates uncertainty through transformations:

\[\sqrt{n}(g(\hat{\theta}) - g(\theta_0)) \xrightarrow{d} \mathcal{N}\left(0, [g'(\theta_0)]^2 \cdot I_1(\theta_0)^{-1}\right)\]

For multivariate parameters:

\[\sqrt{n}(g(\hat{\boldsymbol{\theta}}) - g(\boldsymbol{\theta}_0)) \xrightarrow{d} \mathcal{N}\left(0, \nabla g(\boldsymbol{\theta}_0)^\top I_1(\boldsymbol{\theta}_0)^{-1} \nabla g(\boldsymbol{\theta}_0)\right)\]

Variance estimation methods include Fisher information (theoretical), observed information (data-adaptive), and sandwich estimators (robust to misspecification).

Pillar 4: Linear Models (Section 3.4)

Linear regression applies the likelihood framework to continuous responses:

\[Y_i = \mathbf{x}_i^\top \boldsymbol{\beta} + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma^2)\]

The OLS estimator minimizes sum of squared errors:

\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\]

Key theoretical results:

  • Gauss-Markov Theorem: Under assumptions of linearity, exogeneity, homoskedasticity, and no perfect multicollinearity, OLS is BLUE (Best Linear Unbiased Estimator).

  • Distributional theory: \(\hat{\boldsymbol{\beta}} \sim \mathcal{N}(\boldsymbol{\beta}, \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1})\) under normality.

  • Hypothesis testing: t-tests for individual coefficients, F-tests for nested model comparison.

  • Residual analysis: Diagnostic tools detect violations of assumptions.

The geometric interpretation—OLS projects \(\mathbf{y}\) onto the column space of \(\mathbf{X}\)—provides deep insight into the estimator’s properties.

Pillar 5: Generalized Linear Models (Section 3.5)

GLMs extend linear regression to non-normal responses through three components:

  1. Random component: Response \(Y_i\) from exponential dispersion family with mean \(\mu_i\)

  2. Systematic component: Linear predictor \(\eta_i = \mathbf{x}_i^\top \boldsymbol{\beta}\)

  3. Link function: \(g(\mu_i) = \eta_i\) connecting mean to linear predictor

Table 36 Common GLM Configurations

Response Type

Distribution

Canonical Link

Application

Binary

Bernoulli

Logit: \(\log\frac{\mu}{1-\mu}\)

Classification, propensity

Count

Poisson

Log: \(\log \mu\)

Event rates, frequencies

Continuous+

Gamma

Reciprocal: \(-1/\mu\)

Duration, costs, insurance

Continuous

Normal

Identity: \(\mu\)

Standard regression

The IRLS algorithm (Iteratively Reweighted Least Squares) provides unified computation:

Initialize: β⁽⁰⁾
Repeat until convergence:
  1. Compute η⁽ᵗ⁾ = Xβ⁽ᵗ⁾
  2. Compute μ⁽ᵗ⁾ = g⁻¹(η⁽ᵗ⁾)
  3. Compute working weights W⁽ᵗ⁾ and working response z⁽ᵗ⁾
  4. Update: β⁽ᵗ⁺¹⁾ = (X'W⁽ᵗ⁾X)⁻¹X'W⁽ᵗ⁾z⁽ᵗ⁾
Return: β̂

This algorithm is Fisher scoring applied to the GLM likelihood—the same principles from Section 3.2 yield a unified estimation procedure for all exponential family responses.

How the Pillars Connect

These five pillars form an integrated framework where each builds on the others:

Exponential families enable elegant MLE. For an exponential family, the score equation becomes \(\nabla A(\hat{\eta}) = \bar{T}\)—match expected sufficient statistics to observed values. This moment-matching principle yields closed-form solutions for Normal, Poisson, Exponential, and many other distributions. When closed forms don’t exist, the concavity of the log-likelihood in \(\eta\) (guaranteed by convexity of \(A\)) ensures that Newton’s method converges to the unique global maximum.

MLE theory provides sampling distributions. The asymptotic normality theorem \(\sqrt{n}(\hat{\theta} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I_1^{-1})\) is proven using properties of the score function and Fisher information. This theorem transforms point estimates into confidence intervals and hypothesis tests.

Sampling variability extends to transformations. The delta method converts asymptotic normality of \(\hat{\theta}\) into asymptotic normality of \(g(\hat{\theta})\). This enables inference on interpretable quantities like odds ratios (logistic regression), rate ratios (Poisson regression), and elasticities (log-transformed regression).

Linear models are normal-theory GLMs. The classical linear model \(Y \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2 I)\) is a GLM with normal distribution, identity link, and estimated dispersion. OLS and MLE coincide. The Gauss-Markov theorem provides finite-sample optimality that complements MLE’s asymptotic efficiency.

GLMs unify regression for all exponential families. Logistic, Poisson, and Gamma regression—previously treated as separate topics—are all special cases of the same framework. IRLS is Fisher scoring specialized to GLMs. Deviance analysis parallels F-tests. The canonical link ensures that \(\eta = \theta\), making the connection to exponential family theory explicit.

Example 💡 Integrated Analysis

Problem: Analyze whether study hours predict exam pass/fail for 200 students.

Stage 1 (Model): Binary response → Bernoulli distribution → Exponential family with \(\theta = \log\frac{p}{1-p}\) and \(A(\theta) = \log(1 + e^\theta)\).

Stage 2 (Estimation): Logistic regression via IRLS. The canonical link equates \(\eta_i = \mathbf{x}_i^\top \boldsymbol{\beta}\) to \(\theta_i\), so the score equation is \(\mathbf{X}^\top(\mathbf{y} - \boldsymbol{\mu}) = \mathbf{0}\).

Stage 3 (Uncertainty): Fisher information \(I(\boldsymbol{\beta}) = \mathbf{X}^\top \mathbf{W} \mathbf{X}\) where \(W_{ii} = \mu_i(1-\mu_i)\). Standard errors from \([\hat{I}(\hat{\boldsymbol{\beta}})]^{-1}\).

Stage 4 (Inference): Wald test for \(H_0: \beta_{\text{hours}} = 0\). Odds ratio \(e^{\hat{\beta}_{\text{hours}}}\) with 95% CI via delta method. Deviance test comparing to null model.

Code:

import numpy as np
import statsmodels.api as sm

# Simulated data
rng = np.random.default_rng(42)
hours = rng.uniform(2, 12, 200)
prob_pass = 1 / (1 + np.exp(-(-3 + 0.5 * hours)))
passed = rng.binomial(1, prob_pass)

# Fit logistic regression
X = sm.add_constant(hours)
model = sm.GLM(passed, X, family=sm.families.Binomial())
result = model.fit()

print(f"Intercept: {result.params[0]:.3f} (SE: {result.bse[0]:.3f})")
print(f"Hours coef: {result.params[1]:.3f} (SE: {result.bse[1]:.3f})")
print(f"Odds ratio per hour: {np.exp(result.params[1]):.3f}")
print(f"95% CI for OR: ({np.exp(result.conf_int()[1,0]):.3f}, "
      f"{np.exp(result.conf_int()[1,1]):.3f})")

Output:

Intercept: -2.687 (SE: 0.537)
Hours coef: 0.432 (SE: 0.074)
Odds ratio per hour: 1.540
95% CI for OR: (1.332, 1.781)

Method Selection Guide

Use these decision frameworks to choose appropriate methods:

Choosing the Distribution (Random Component)

What type of response variable?
│
├─► Continuous, unbounded?
│   └─► Normal distribution (identity link)
│
├─► Binary (0/1)?
│   └─► Bernoulli distribution (logit link)
│
├─► Count (0, 1, 2, ...)?
│   ├─► Mean ≈ Variance? → Poisson (log link)
│   └─► Variance > Mean? → Negative Binomial or Quasi-Poisson
│
├─► Proportion (0 to 1)?
│   └─► Beta distribution (logit link)
│
├─► Strictly positive continuous?
│   ├─► Constant CV? → Gamma (log link)
│   └─► CV increases with mean? → Inverse Gaussian
│
└─► Bounded continuous [a, b]?
    └─► Transform to (0,1), use Beta

Choosing the Variance Estimator

What assumptions can you make?
│
├─► Model correctly specified, large n?
│   └─► Fisher Information (theoretical)
│       SE = 1/√(nI₁(θ̂))
│
├─► Model correct, prefer data-adaptive?
│   └─► Observed Information
│       SE = 1/√J(θ̂) where J = -∂²ℓ/∂θ²
│
├─► Possible misspecification?
│   └─► Sandwich Estimator (robust)
│       Var(θ̂) = A⁻¹BA⁻¹ where A = E[-∂²ℓ/∂θ²], B = E[(∂ℓ/∂θ)²]
│
└─► Small sample, exact inference needed?
    └─► Bootstrap (Chapter 4)

Choosing the Hypothesis Test

Testing H₀: θ = θ₀?
│
├─► Large n, quick computation?
│   └─► Wald Test: W = (θ̂ - θ₀)² / Var(θ̂) ~ χ²₁
│
├─► Better finite-sample properties?
│   └─► Likelihood Ratio Test: LRT = 2[ℓ(θ̂) - ℓ(θ₀)] ~ χ²₁
│
└─► Only need to evaluate at θ₀?
    └─► Score Test: S = U(θ₀)² / I(θ₀) ~ χ²₁

Quick Reference: Core Formulas

Exponential Family

Quantity

Formula

Canonical form

\(f(x|\eta) = h(x)\exp\{\eta^\top T(x) - A(\eta)\}\)

Mean of sufficient statistic

\(\mathbb{E}[T(X)] = \nabla A(\eta)\)

Variance of sufficient statistic

\(\text{Cov}[T(X)] = \nabla^2 A(\eta)\)

Fisher information

\(I(\eta) = \nabla^2 A(\eta)\)

MLE condition

\(\nabla A(\hat{\eta}) = \bar{T}\)

Maximum Likelihood Estimation

Quantity

Formula

Log-likelihood

\(\ell(\theta) = \sum_{i=1}^n \log f(x_i|\theta)\)

Score function

\(U(\theta) = \nabla \ell(\theta)\)

Fisher information (per obs)

\(I_1(\theta) = \mathbb{E}[U_1(\theta)^2] = -\mathbb{E}[\nabla^2 \log f(X|\theta)]\)

Asymptotic variance

\(\text{Var}(\hat{\theta}) \approx [nI_1(\theta)]^{-1}\)

Cramér-Rao bound

\(\text{Var}(\hat{\theta}) \geq [nI_1(\theta)]^{-1}\)

Sampling Variability

Quantity

Formula

Bias

\(\text{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta\)

Mean squared error

\(\text{MSE}(\hat{\theta}) = \text{Var}(\hat{\theta}) + \text{Bias}(\hat{\theta})^2\)

Delta method (scalar)

\(\text{Var}(g(\hat{\theta})) \approx [g'(\theta)]^2 \text{Var}(\hat{\theta})\)

Delta method (vector)

\(\text{Var}(g(\hat{\boldsymbol{\theta}})) \approx \nabla g^\top \boldsymbol{\Sigma} \nabla g\)

Sandwich estimator

\(\hat{\text{Var}}(\hat{\theta}) = \hat{A}^{-1}\hat{B}\hat{A}^{-1}\)

Linear Models

Quantity

Formula

OLS estimator

\(\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\)

Variance of \(\hat{\boldsymbol{\beta}}\)

\(\text{Cov}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}\)

Residual variance estimator

\(\hat{\sigma}^2 = \frac{\mathbf{e}^\top\mathbf{e}}{n-p} = \frac{\text{RSS}}{n-p}\)

t-statistic for \(\beta_j\)

\(t = \frac{\hat{\beta}_j - \beta_{j,0}}{\text{SE}(\hat{\beta}_j)} \sim t_{n-p}\)

F-statistic (nested models)

\(F = \frac{(\text{RSS}_R - \text{RSS}_F)/(p_F - p_R)}{\text{RSS}_F/(n-p_F)} \sim F_{p_F-p_R, n-p_F}\)

Generalized Linear Models

Quantity

Formula

Mean-variance relation

\(\text{Var}(Y_i) = \phi \cdot V(\mu_i)\)

Link function

\(g(\mu_i) = \eta_i = \mathbf{x}_i^\top\boldsymbol{\beta}\)

Score equation (canonical link)

\(\mathbf{X}^\top(\mathbf{y} - \boldsymbol{\mu}) = \mathbf{0}\)

Score equation (general link)

\(\mathbf{X}^\top\mathbf{W}\mathbf{G}(\mathbf{y} - \boldsymbol{\mu}) = \mathbf{0}\) where \(G_{ii} = \frac{d\eta_i}{d\mu_i}\), \(W_{ii} = \frac{1}{V(\mu_i)}\)

IRLS update

\(\boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{z}^{(t)}\)

Deviance

\(D = 2[\ell(\hat{\boldsymbol{\mu}}_{\text{saturated}}) - \ell(\hat{\boldsymbol{\mu}}_{\text{model}})]\); scaled deviance \(D^* = D/\phi\)

Connections to Future Material

The parametric inference framework developed in this chapter provides essential foundations for the remainder of the course.

Bootstrap Methods (Chapter 4)

Chapter 4 develops resampling methods that complement and extend parametric inference:

  • When parametric assumptions fail: The bootstrap provides valid standard errors and confidence intervals without distributional assumptions. When the model is misspecified, bootstrap variance estimates can be more reliable than Fisher information.

  • Complex statistics: For statistics without tractable asymptotic distributions (medians, ratios, eigenvalues), the bootstrap provides a general-purpose solution where delta method approximations may be poor.

  • Small samples: Asymptotic normality may be inadequate for small \(n\). Bootstrap percentile and BCa intervals can have better coverage than Wald intervals.

  • Model validation: Cross-validation, a resampling technique, assesses predictive performance and guards against overfitting—complementing the in-sample fit measures (deviance, AIC) introduced in this chapter.

Connection to Chapter 3: The parametric bootstrap samples from \(\hat{F}_\theta = F_{\hat{\theta}}\)—the fitted parametric model. It combines the efficiency of parametric assumptions with the flexibility of resampling. Understanding MLE and Fisher information enables comparison of bootstrap and asymptotic standard errors.

Bayesian Inference (Chapter 5)

Chapter 5 presents the Bayesian alternative to maximum likelihood:

  • Prior distributions: Where MLE treats \(\theta\) as fixed but unknown, Bayesian inference places a prior distribution \(\pi(\theta)\) on parameters. Conjugate priors for exponential families (Section 3.1) enable closed-form posteriors.

  • Posterior inference: Bayes’ theorem yields \(\pi(\theta|x) \propto L(\theta)\pi(\theta)\)—the posterior combines likelihood with prior. MLE emerges as the posterior mode under a flat prior.

  • Credible intervals vs. confidence intervals: Bayesian 95% credible intervals contain the parameter with 95% posterior probability. Frequentist 95% confidence intervals cover the true parameter in 95% of repeated samples. The interpretations differ philosophically but often coincide numerically.

  • MCMC computation: When posteriors lack closed forms, Markov chain Monte Carlo provides samples from \(\pi(\theta|x)\). The Metropolis-Hastings algorithm uses proposal distributions analogous to importance sampling (Chapter 2).

Connection to Chapter 3: Fisher information determines the asymptotic variance of both MLE and the posterior. Under regularity conditions, the Bernstein-von Mises theorem shows \(\pi(\theta|x) \to \mathcal{N}(\hat{\theta}_{\text{MLE}}, [nI_1]^{-1})\). Bayesian and frequentist inference converge as \(n \to \infty\).

LLMs and Modern Methods (Chapters 13-15)

The parametric foundations extend to modern machine learning contexts:

  • Logistic regression in classification: Logistic regression (GLM with Bernoulli response) remains the interpretable baseline for binary classification. Understanding its likelihood foundation enables principled comparison with neural networks.

  • Regularization: Ridge and LASSO regression add penalty terms to the log-likelihood, corresponding to specific prior distributions in the Bayesian interpretation.

  • Uncertainty quantification: Modern deep learning increasingly emphasizes uncertainty estimates. The Fisher information and delta method concepts from Chapter 3 inform variance propagation through neural networks.

Practical Guidance

Best Practices for Parametric Inference

  1. Visualize first: Plot your data before fitting models. Histograms reveal distributional shape; scatterplots show relationships; residual plots expose assumption violations.

  2. Check assumptions systematically: For linear models, verify linearity (residuals vs. fitted), homoskedasticity (scale-location plot), normality (Q-Q plot), and influential points (Cook’s distance).

  3. Use robust standard errors by default: The sandwich estimator protects against misspecification with minimal efficiency loss when the model is correct.

  4. Report confidence intervals, not just p-values: Intervals convey both statistical significance and practical magnitude. A significant effect may be too small to matter; a non-significant effect may have a wide interval consistent with meaningful effects.

  5. Consider model comparison holistically: Likelihood ratio tests compare nested models; AIC and BIC enable comparison of non-nested models; cross-validation assesses predictive performance.

  6. Mind the sample size: Asymptotic results require “large enough” \(n\). For small samples with binary outcomes, consider exact methods or Firth’s penalized likelihood for separation issues.

Common Pitfalls to Avoid

Common Pitfall ⚠️ Interpreting Coefficients Incorrectly

GLM coefficients are on the link scale, not the response scale:

  • Logistic: \(\beta_j\) is the log-odds ratio, not the change in probability

  • Poisson: \(\beta_j\) is the log-rate ratio, not the change in count

  • Linear: \(\beta_j\) is the actual change in response (identity link)

Transform coefficients appropriately for interpretation: \(e^{\beta_j}\) gives odds ratios (logistic) or rate ratios (Poisson).

Common Pitfall ⚠️ Ignoring Overdispersion

For count and binary data, the variance may exceed what the model predicts. Symptoms include:

  • Residual deviance much larger than degrees of freedom

  • Many observations with large Pearson residuals

Remedies: quasi-likelihood methods, negative binomial (for counts), or beta-binomial (for proportions).

Common Pitfall ⚠️ Perfect Separation in Logistic Regression

When a predictor perfectly separates classes (all successes above some threshold), MLE does not exist—coefficients diverge to \(\pm\infty\). Symptoms include:

  • Extremely large coefficient estimates

  • Huge standard errors

  • Convergence warnings

Remedies: Firth’s penalized likelihood, Bayesian approaches with proper priors, or regularization (ridge/LASSO).

Final Perspective

Parametric inference embodies a fundamental tradeoff: we assume a probability model to gain statistical power and interpretability, but suffer if the model is wrong. The exponential family provides the mathematical structure that makes parametric inference tractable—sufficient statistics compress data without information loss, log-partition functions generate moments automatically, and concave log-likelihoods ensure well-behaved optimization.

The techniques in this chapter form the statistical backbone of data science:

  1. Model data with exponential families (Section 3.1)

  2. Estimate parameters via maximum likelihood (Section 3.2)

  3. Quantify uncertainty through sampling variability (Section 3.3)

  4. Analyze continuous responses with linear regression (Section 3.4)

  5. Extend to non-normal responses with GLMs (Section 3.5)

This toolkit is not merely academic. Every logistic regression in click-through prediction, every Poisson model in epidemiology, every linear regression in econometrics relies on these principles. The bootstrap (Chapter 4) provides robustness when parametric assumptions fail. Bayesian methods (Chapter 5) offer an alternative philosophical foundation with distinct computational techniques. Modern machine learning (Chapters 13-15) builds on these foundations while relaxing some assumptions in exchange for flexibility.

Master parametric inference, and you hold the key to principled statistical modeling. The combination of theoretical rigor and computational tractability makes these methods indispensable—equally valuable for understanding why methods work and for applying them in practice.

Key Takeaways 📝

  1. The Unifying Framework: Exponential families provide a common mathematical structure for diverse distributions. The canonical form \(f(x|\eta) = h(x)\exp\{\eta^\top T(x) - A(\eta)\}\) yields sufficient statistics, moment formulas, and Fisher information from a single framework.

  2. The Estimation Principle: Maximum likelihood chooses parameters that make observed data most probable. For exponential families, this reduces to matching sufficient statistics: \(\nabla A(\hat{\eta}) = \bar{T}\). Asymptotic theory guarantees consistency, normality, and efficiency.

  3. The Uncertainty Machinery: Sampling variability quantifies estimator precision. The delta method propagates uncertainty through transformations. Fisher information, observed information, and sandwich estimators provide variance estimates under different assumptions.

  4. The Model Extensions: Linear models provide optimal estimation under Gauss-Markov conditions. GLMs extend to binary, count, and other non-normal responses through the link function. IRLS provides unified computation across all exponential family responses.

  5. The Course Connections: This chapter provides the likelihood foundations for bootstrap methods (Chapter 4, where parametric bootstrap uses \(\hat{F}_{\hat{\theta}}\)), Bayesian inference (Chapter 5, where posterior ∝ likelihood × prior), and modern ML (where cross-entropy loss is negative log-likelihood). [LO 1, 2, 4]

References

Foundational Works on Maximum Likelihood (Section 3.2)

[Fisher1912]

Fisher, R. A. (1912). On an absolute criterion for fitting frequency curves. Messenger of Mathematics, 41, 155–160. Fisher’s earliest work on maximum likelihood.

[Fisher1922]

Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society A, 222, 309–368. The foundational paper introducing maximum likelihood, sufficiency, and efficiency—concepts that remain central to statistical inference.

[Fisher1925]

Fisher, R. A. (1925). Theory of statistical estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 22(5), 700–725. Develops the asymptotic theory of maximum likelihood including asymptotic normality and efficiency.

[Rao1945]

Rao, C. R. (1945). Information and the accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–89. Independently establishes the information inequality (Cramér-Rao bound).

[Cramer1946]

Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press. Classic synthesis of statistical theory including rigorous treatment of the Cramér-Rao inequality.

Exponential Families (Section 3.1)

[Koopman1936]

Koopman, B. O. (1936). On distributions admitting a sufficient statistic. Transactions of the American Mathematical Society, 39(3), 399–409. One of the three independent proofs of the Pitman-Koopman-Darmois theorem.

[Pitman1936]

Pitman, E. J. G. (1936). Sufficient statistics and intrinsic accuracy. Proceedings of the Cambridge Philosophical Society, 32(4), 567–579. Establishes the theorem connecting sufficiency to exponential family structure.

[Darmois1935]

Darmois, G. (1935). Sur les lois de probabilité à estimation exhaustive. Comptes Rendus de l’Académie des Sciences, 200, 1265–1266. The French contribution to the simultaneous independent discovery.

[Brown1986]

Brown, L. D. (1986). Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. Institute of Mathematical Statistics Lecture Notes–Monograph Series, Vol. 9. The definitive mathematical reference on exponential families.

[BarndorffNielsen1978]

Barndorff-Nielsen, O. E. (1978). Information and Exponential Families in Statistical Theory. Wiley. Comprehensive treatment emphasizing information geometry.

Sampling Variability and Robust Methods (Section 3.3)

[Huber1967]

Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, 221–233. University of California Press. Foundational work introducing the sandwich variance estimator.

[White1980]

White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4), 817–838. Develops heteroskedasticity-consistent standard errors now standard in regression.

[White1982]

White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50(1), 1–25. Establishes quasi-maximum likelihood theory under model misspecification.

[Serfling1980]

Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. Wiley. Comprehensive treatment of the delta method and asymptotic approximations.

[Efron1978]

Efron, B., and Hinkley, D. V. (1978). Assessing the accuracy of the maximum likelihood estimator: Observed versus expected Fisher information. Biometrika, 65(3), 457–487. Compares information-based variance estimation methods.

Linear Models (Section 3.4)

[Legendre1805]

Legendre, A. M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes. Firmin Didot, Paris. First published account of the method of least squares.

[Gauss1809]

Gauss, C. F. (1809). Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium. Perthes and Besser, Hamburg. Gauss’s probabilistic justification of least squares under normally distributed errors.

[Gauss1821]

Gauss, C. F. (1821–1823). Theoria Combinationis Observationum Erroribus Minimis Obnoxiae. English translation by G. W. Stewart (1995), SIAM. Proves the original Gauss-Markov theorem without assuming normality.

[Aitken1935]

Aitken, A. C. (1935). On least squares and linear combination of observations. Proceedings of the Royal Society of Edinburgh, 55, 42–48. Generalizes least squares to handle correlated errors.

[Cochran1934]

Cochran, W. G. (1934). The distribution of quadratic forms in a normal system. Mathematical Proceedings of the Cambridge Philosophical Society, 30(2), 178–191. Cochran’s theorem fundamental for F-tests and ANOVA.

[Cook1977]

Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics, 19(1), 15–18. Introduces Cook’s distance for identifying influential observations.

Generalized Linear Models (Section 3.5)

[NelderWedderburn1972]

Nelder, J. A., and Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society, Series A, 135(3), 370–384. The foundational paper introducing the unified GLM framework and IRLS algorithm.

[Wedderburn1974]

Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika, 61(3), 439–447. Extends GLM theory to quasi-likelihood.

[McCullaghNelder1989]

McCullagh, P., and Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman and Hall. The definitive reference on GLM theory covering exponential dispersion models, deviance, and diagnostics.

[Firth1993]

Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80(1), 27–38. Introduces penalized likelihood to address separation problems in logistic regression.

[Hosmer2013]

Hosmer, D. W., Lemeshow, S., and Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley. Comprehensive applied treatment of logistic regression.

Asymptotic Theory

[Wald1949]

Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. Annals of Mathematical Statistics, 20(4), 595–601. Establishes conditions for MLE consistency.

[LeCam1953]

Le Cam, L. (1953). On some asymptotic properties of maximum likelihood estimates. University of California Publications in Statistics, 1, 277–329. Establishes local asymptotic normality.

[VanDerVaart1998]

van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press. Definitive modern treatment of asymptotic statistical theory.

[LehmannCasella1998]

Lehmann, E. L., and Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer. Graduate-level treatment including comprehensive coverage of exponential families.

Comprehensive Texts

[CasellaBerger2002]

Casella, G., and Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury Press. Accessible introduction to sufficiency, exponential families, and inference.

[EfronHastie2016]

Efron, B., and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press. Contemporary perspective integrating classical and modern methods.

[Wasserman2004]

Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. Modern introduction to statistical inference.

[Davison2003]

Davison, A. C. (2003). Statistical Models. Cambridge University Press. Modern treatment including information-based standard errors.

[Dobson2018]

Dobson, A. J., and Barnett, A. G. (2018). An Introduction to Generalized Linear Models (4th ed.). CRC Press. Accessible introduction to GLMs.

Historical Perspectives

[Stigler1986]

Stigler, S. M. (1986). The History of Statistics: The Measurement of Uncertainty before 1900. Harvard University Press. Historical context for the development of statistical methods.

[Hand2015]

Hand, D. J. (2015). From evidence to understanding: A commentary on Fisher (1922). Philosophical Transactions of the Royal Society A, 373(2039), 20140249. Modern perspective on Fisher’s foundational contributions.