Maximum Likelihood Estimation

In the summer of 1912, a young statistician named Ronald Aylmer Fisher was grappling with a fundamental question: given a sample of data, how should we estimate the parameters of a probability distribution? Fisher was not satisfied with the existing answers—Karl Pearson’s method of moments, while simple, seemed to throw away information. Surely there was a principled way to extract all the information the data contained about the unknown parameters.

Fisher’s answer, published in a series of papers between 1912 and 1925, was maximum likelihood estimation: choose the parameter values that make the observed data most probable. This deceptively simple idea revolutionized statistical inference. Fisher showed that maximum likelihood estimators (MLEs) have remarkable properties—they are consistent, asymptotically normal, and asymptotically efficient, achieving the theoretical lower bound on estimator variance. These properties made MLE the workhorse of parametric inference for a century.

This section develops the theory and practice of maximum likelihood estimation. We begin with the likelihood function itself—the mathematical object that quantifies how well different parameter values explain the observed data. We derive the score function and Fisher information, which together characterize the geometry of the likelihood surface. For simple models, we obtain closed-form MLEs; for complex models, we develop numerical optimization algorithms including Newton-Raphson and Fisher scoring. We establish the asymptotic theory that justifies using MLEs for inference, including a complete proof of the Cramér-Rao lower bound. Finally, we connect MLE to hypothesis testing through likelihood ratio, Wald, and score tests.

The exponential family framework from Section 3.1 will prove essential here: for exponential families, the score equation takes a particularly elegant form, and the MLE has explicit connections to sufficient statistics. But MLE extends far beyond exponential families—it applies to any parametric model, making it the universal tool for parametric inference.

Road Map 🧭

Understand: The likelihood function as a measure of parameter support, the score function as its gradient, and Fisher information as its curvature
Derive: Closed-form MLEs for normal, exponential, Poisson, and Bernoulli distributions; understand why Gamma and Beta require numerical methods
Implement: Newton-Raphson and Fisher scoring algorithms; leverage scipy.optimize for production code
Prove: Asymptotic consistency, normality, and efficiency; the Cramér-Rao lower bound with full regularity conditions
Apply: Likelihood ratio, Wald, and score tests for hypothesis testing; construct confidence intervals via multiple methods

The Likelihood Function

The likelihood function is the foundation of maximum likelihood estimation. It answers a simple question: for fixed data, how probable would that data be under different parameter values?

Definition and Interpretation

Let \(X_1, X_2, \ldots, X_n\) be independent random variables, each with probability density (or mass) function \(f(x|\theta)\) depending on an unknown parameter \(\theta \in \Theta\). After observing data \(x_1, x_2, \ldots, x_n\), we define the likelihood function:

(38)\[L(\theta) = L(\theta; x_1, \ldots, x_n) = \prod_{i=1}^{n} f(x_i | \theta)\]

The crucial conceptual shift is this: we view the likelihood as a function of \(\theta\) for fixed data, not as a probability of data for fixed \(\theta\). The data are observed and therefore fixed; the parameter is unknown and therefore variable.

The Likelihood is Not a Probability Density

While \(L(\theta)\) is constructed from probability densities, it is not a probability density over \(\theta\). There is no requirement that \(\int L(\theta) d\theta = 1\), and indeed this integral often diverges or depends on the data in complex ways. The likelihood measures relative support—how much more or less the data support one parameter value versus another—not absolute probability.

The maximum likelihood estimator (MLE) is the parameter value that maximizes the likelihood:

(39)\[\hat{\theta}_{\text{MLE}} = \arg\max_{\theta \in \Theta} L(\theta)\]

Intuitively, the MLE is the parameter value that makes the observed data most probable.

The Log-Likelihood

In practice, we almost always work with the log-likelihood:

(40)\[\ell(\theta) = \log L(\theta) = \sum_{i=1}^{n} \log f(x_i | \theta)\]

This transformation offers multiple advantages:

Numerical stability: Products of many small probabilities underflow floating-point arithmetic; sums of log-probabilities do not.
Computational convenience: Sums are easier to differentiate and optimize than products.
Theoretical elegance: Asymptotic theory for the log-likelihood has cleaner formulations.

Since \(\log\) is monotonically increasing, maximizing \(\ell(\theta)\) is equivalent to maximizing \(L(\theta)\). The MLE is unchanged.

Likelihood function concept showing data, likelihood, and log-likelihood — Fig. 83 **Figure 3.2.1**: The likelihood function for Poisson data. (a) Histogram of observed counts with sample mean \(\bar{x} = 3.10\). (b) Normalized likelihood function \(L(\lambda)/L(\hat{\lambda})\) showing the MLE at the peak. (c) Log-likelihood function \(\ell(\lambda)\), whose curvature at the maximum relates to Fisher information. The true parameter \(\lambda = 3.5\) is marked for comparison.

Example 💡 Normal Likelihood

Setup: Let \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)\) with both parameters unknown. The density is:

\[f(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\]

Log-likelihood derivation:

\[\begin{split}\ell(\mu, \sigma^2) &= \sum_{i=1}^n \log f(x_i|\mu, \sigma^2) \\ &= \sum_{i=1}^n \left[ -\frac{1}{2}\log(2\pi) - \frac{1}{2}\log(\sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2} \right] \\ &= -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2\end{split}\]

This expression will be maximized shortly to derive the normal MLEs.

The Score Function

The score function is the gradient of the log-likelihood with respect to the parameters. It plays a central role in both the theory and computation of MLE.

Definition and Properties

For a scalar parameter \(\theta\), the score function is:

(41)\[U(\theta) = \frac{\partial \ell(\theta)}{\partial \theta} = \sum_{i=1}^{n} \frac{\partial}{\partial \theta} \log f(X_i | \theta)\]

For a parameter vector \(\boldsymbol{\theta} = (\theta_1, \ldots, \theta_p)^\top\), the score is the gradient vector:

\[\mathbf{U}(\boldsymbol{\theta}) = \nabla \ell(\boldsymbol{\theta}) = \left( \frac{\partial \ell}{\partial \theta_1}, \ldots, \frac{\partial \ell}{\partial \theta_p} \right)^\top\]

The MLE is typically found by solving the score equation \(U(\hat{\theta}) = 0\). At the maximum, the gradient of the log-likelihood vanishes.

The score function has a fundamental property that underlies much of likelihood theory:

Theorem: Score Has Mean Zero

Under regularity conditions (see below), the expected value of the score function at the true parameter value is zero:

(42)\[\mathbb{E}_{\theta_0}\left[ U(\theta_0) \right] = 0\]

Proof: For a single observation, the score contribution is \(u(X; \theta) = \frac{\partial}{\partial \theta} \log f(X|\theta)\). We compute:

\[\begin{split}\mathbb{E}_{\theta}\left[ \frac{\partial}{\partial \theta} \log f(X|\theta) \right] &= \int \frac{\partial}{\partial \theta} \log f(x|\theta) \cdot f(x|\theta) \, dx \\ &= \int \frac{1}{f(x|\theta)} \cdot \frac{\partial f(x|\theta)}{\partial \theta} \cdot f(x|\theta) \, dx \\ &= \int \frac{\partial f(x|\theta)}{\partial \theta} \, dx\end{split}\]

Now, assuming we can interchange differentiation and integration (this is one of the regularity conditions):

\[\int \frac{\partial f(x|\theta)}{\partial \theta} \, dx = \frac{\partial}{\partial \theta} \int f(x|\theta) \, dx = \frac{\partial}{\partial \theta} (1) = 0\]

Since the score for \(n\) iid observations is the sum of \(n\) individual scores, each with mean zero, we have \(\mathbb{E}[U(\theta_0)] = 0\). ∎

This result has an intuitive interpretation: at the true parameter value, the log-likelihood is “locally flat” on average—positive and negative slopes cancel out when we average over all possible datasets.

Score function geometry showing zero-crossing at MLE and score distribution — Fig. 84 **Figure 3.2.2**: The score function and its properties. (a) For Poisson data, the score \(U(\lambda) = n(\bar{x}/\lambda - 1)\) crosses zero at the MLE \(\hat{\lambda} = \bar{x}\). When \(U > 0\) (green region), the likelihood increases with \(\lambda\); when \(U < 0\) (red region), it decreases. (b) At the true parameter \(\lambda_0\), the score has mean zero and variance equal to the Fisher information: \(\text{Var}[U(\lambda_0)] = nI_1(\lambda_0)\).

Fisher Information

While the score tells us the direction of steepest ascent on the likelihood surface, the Fisher information tells us about the surface’s curvature—how sharply peaked the likelihood is around its maximum.

Definition

The Fisher information for a scalar parameter \(\theta\) is defined as the variance of the score:

(43)\[I(\theta) = \text{Var}_\theta\left[ U(\theta) \right] = \mathbb{E}_\theta\left[ U(\theta)^2 \right]\]

where the second equality follows from \(\mathbb{E}[U(\theta)] = 0\).

Notation Convention: Expected vs. Observed Information

We adopt the following notation throughout this chapter:

Per-observation vs. total information:

\(I_1(\theta)\) = Fisher information from a single observation
\(I_n(\theta) = n \cdot I_1(\theta)\) = total Fisher information from \(n\) iid observations

Expected vs. observed information:

\(I(\theta) = -\mathbb{E}\left[\frac{\partial^2 \ell}{\partial \theta^2}\right]\) = expected (Fisher) information
\(J(\theta) = -\frac{\partial^2 \ell}{\partial \theta^2}\bigg|_{\text{observed}}\) = observed information (data-dependent)

Under correct model specification, \(\mathbb{E}[J(\theta_0)] = I(\theta_0)\). Under misspecification, these may differ, leading to the sandwich variance estimator (see below).

Convention in formulas: Unless subscripted, \(I(\theta)\) denotes per-observation expected information \(I_1(\theta)\). The Wald statistic \(W = n I_1(\hat{\theta})(\hat{\theta} - \theta_0)^2\) uses per-observation information.

Under the same regularity conditions that gave us \(\mathbb{E}[U] = 0\), we have an equivalent expression involving the second derivative:

(44)\[I(\theta) = -\mathbb{E}_\theta\left[ \frac{\partial^2 \ell}{\partial \theta^2} \right]\]

This is the information equality: the variance of the score equals the negative expected Hessian.

Proof of information equality: We differentiate the identity \(\mathbb{E}_\theta[U(\theta)] = 0\) with respect to \(\theta\):

\[\frac{\partial}{\partial \theta} \int \frac{\partial \log f(x|\theta)}{\partial \theta} f(x|\theta) \, dx = 0\]

Applying the product rule inside the integral:

\[\int \frac{\partial^2 \log f}{\partial \theta^2} f \, dx + \int \frac{\partial \log f}{\partial \theta} \cdot \frac{\partial f}{\partial \theta} \, dx = 0\]

The first integral is \(\mathbb{E}[\partial^2 \ell / \partial \theta^2]\). For the second integral, note that \(\frac{\partial f}{\partial \theta} = f \cdot \frac{\partial \log f}{\partial \theta}\), so:

\[\int \frac{\partial \log f}{\partial \theta} \cdot f \cdot \frac{\partial \log f}{\partial \theta} \, dx = \mathbb{E}\left[\left(\frac{\partial \log f}{\partial \theta}\right)^2\right] = \mathbb{E}[U^2]\]

Therefore: \(\mathbb{E}[\partial^2 \ell / \partial \theta^2] + \mathbb{E}[U^2] = 0\), giving \(I(\theta) = \mathbb{E}[U^2] = -\mathbb{E}[\partial^2 \ell / \partial \theta^2]\). ∎

Additivity for IID Samples

For \(n\) iid observations, the Fisher information is additive:

(45)\[I_n(\theta) = n \cdot I_1(\theta)\]

where \(I_1(\theta)\) is the information from a single observation. This follows because the score is a sum of independent terms, and variance is additive for independent random variables.

This additivity has profound implications: more data means more information. Specifically, information grows linearly with sample size, which (as we will see) implies that standard errors shrink as \(1/\sqrt{n}\).

Fisher information as curvature of log-likelihood — Fig. 85 **Figure 3.2.3**: Fisher information as log-likelihood curvature. (a) High information (\(I = 4.0\)): a sharply peaked log-likelihood means small changes in \(\theta\) produce large changes in \(\ell\)—the data strongly constrain the parameter. (b) Low information (\(I = 0.44\)): a broad peak indicates the data are consistent with many parameter values. (c) For the Bernoulli distribution, \(I(p) = 1/[p(1-p)]\) is minimized at \(p = 0.5\), where the outcome is most uncertain and each observation is most informative about \(p\).

Multivariate Fisher Information

For a parameter vector \(\boldsymbol{\theta} \in \mathbb{R}^p\), the Fisher information becomes a \(p \times p\) matrix:

(46)\[\mathbf{I}(\boldsymbol{\theta})_{jk} = \mathbb{E}\left[ \frac{\partial \ell}{\partial \theta_j} \cdot \frac{\partial \ell}{\partial \theta_k} \right] = -\mathbb{E}\left[ \frac{\partial^2 \ell}{\partial \theta_j \partial \theta_k} \right]\]

The Fisher information matrix is positive semi-definite (as a covariance matrix must be), and under regularity conditions, positive definite.

Fisher information matrix and parameter orthogonality — Fig. 86 **Figure 3.2.12**: Multivariate Fisher information for the Normal distribution. (a) Joint sampling distribution of \((\hat{\mu}, \hat{\sigma}^2)\) from 1000 simulations (\(n=50\)). The elliptical 95% contour reflects the diagonal Fisher information matrix—the near-zero correlation (−0.04) confirms that \(\hat{\mu}\) and \(\hat{\sigma}^2\) are asymptotically independent. (b) Per-observation information components: \(I_{\mu\mu} = 1/\sigma^2\) for the mean and \(I_{\sigma^2\sigma^2} = 1/(2\sigma^4)\) for the variance. The off-diagonal \(I_{\mu,\sigma^2} = 0\) demonstrates **parameter orthogonality**—inference about one parameter does not depend on the other.

Example 💡 Fisher Information for Exponential Distribution

Setup: Let \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Exponential}(\lambda)\) where \(\lambda\) is the rate parameter. The density is \(f(x|\lambda) = \lambda e^{-\lambda x}\) for \(x > 0\).

Log-likelihood: \(\ell(\lambda) = n \log \lambda - \lambda \sum_{i=1}^n x_i\)

Score: \(U(\lambda) = \frac{n}{\lambda} - \sum_{i=1}^n x_i = \frac{n}{\lambda} - n\bar{x}\)

Second derivative: \(\frac{\partial^2 \ell}{\partial \lambda^2} = -\frac{n}{\lambda^2}\)

Fisher information: \(I_n(\lambda) = -\mathbb{E}\left[-\frac{n}{\lambda^2}\right] = \frac{n}{\lambda^2}\)

The per-observation information is \(I_1(\lambda) = 1/\lambda^2\). Lower rates (longer mean lifetimes) provide more information per observation. This may seem counterintuitive: shouldn’t more events (higher rate) tell us more? The resolution lies in understanding what “information about \(\lambda\)” means.

When \(\lambda\) is small, lifetimes are long and vary considerably—the data spread out, allowing us to distinguish between nearby \(\lambda\) values. When \(\lambda\) is large, lifetimes are short and tightly concentrated near zero—the data provide less resolution for distinguishing \(\lambda\) from \(\lambda + \epsilon\).

Reparameterization perspective: In terms of the mean lifetime \(\theta = 1/\lambda\), the information is \(I_1(\theta) = 1/\theta^2\). The coefficient of variation of the MLE is \(\text{CV}(\hat{\theta}) = \text{SE}(\hat{\theta})/\theta = 1/\sqrt{n}\), which is constant regardless of \(\theta\). The absolute precision scales with the parameter, but relative precision is parameter-independent.

Closed-Form Maximum Likelihood Estimators

For many common distributions, setting the score equal to zero yields explicit formulas for the MLE.

Normal Distribution

For \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)\):

Score equations:

\[\begin{split}\frac{\partial \ell}{\partial \mu} &= \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) = 0 \\ \frac{\partial \ell}{\partial \sigma^2} &= -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{i=1}^n (x_i - \mu)^2 = 0\end{split}\]

Solutions:

(47)\[\hat{\mu} = \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i, \qquad \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2\]

The MLE for \(\mu\) is the sample mean—unbiased and efficient. The MLE for \(\sigma^2\) is the biased sample variance (dividing by \(n\) rather than \(n-1\)). This illustrates an important point: MLEs are not always unbiased, though their bias typically vanishes as \(n \to \infty\).

Exponential Distribution

For \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Exponential}(\lambda)\) (rate parameterization):

Setting \(U(\lambda) = n/\lambda - n\bar{x} = 0\) gives:

(48)\[\hat{\lambda} = \frac{1}{\bar{x}}\]

The MLE is the reciprocal of the sample mean.

Poisson Distribution

For \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Poisson}(\lambda)\):

Log-likelihood: \(\ell(\lambda) = \sum_{i=1}^n (x_i \log \lambda - \lambda - \log(x_i!))\)

Score: \(U(\lambda) = \frac{\sum_{i=1}^n x_i}{\lambda} - n = \frac{n\bar{x}}{\lambda} - n\)

Setting \(U(\lambda) = 0\):

(49)\[\hat{\lambda} = \bar{x}\]

The MLE equals the sample mean—exactly what we would expect given that \(\mathbb{E}[X] = \lambda\) for the Poisson.

Bernoulli and Binomial

For \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Bernoulli}(p)\):

Log-likelihood: \(\ell(p) = \sum_{i=1}^n [x_i \log p + (1-x_i) \log(1-p)]\)

Score: \(U(p) = \frac{\sum x_i}{p} - \frac{n - \sum x_i}{1-p}\)

Setting \(U(p) = 0\) and solving:

(50)\[\hat{p} = \bar{x} = \frac{\sum_{i=1}^n x_i}{n}\]

The MLE is the sample proportion.

import numpy as np
from scipy import stats

def mle_normal(x):
    """
    Compute MLEs for Normal(μ, σ²) distribution.

    Parameters
    ----------
    x : array-like
        Observed data.

    Returns
    -------
    dict
        MLEs and standard errors.
    """
    x = np.asarray(x)
    n = len(x)

    # Point estimates
    mu_hat = np.mean(x)
    sigma2_hat = np.var(x, ddof=0)  # MLE uses divisor n, not n-1

    # Fisher information (evaluated at MLE)
    # I(μ) = n/σ², I(σ²) = n/(2σ⁴)
    se_mu = np.sqrt(sigma2_hat / n)
    se_sigma2 = sigma2_hat * np.sqrt(2 / n)

    return {
        'mu_hat': mu_hat,
        'sigma2_hat': sigma2_hat,
        'se_mu': se_mu,
        'se_sigma2': se_sigma2
    }

def mle_exponential(x):
    """
    Compute MLE for Exponential(λ) distribution (rate parameterization).

    Parameters
    ----------
    x : array-like
        Observed data (must be positive).

    Returns
    -------
    dict
        MLE and standard error.
    """
    x = np.asarray(x)
    n = len(x)

    lambda_hat = 1 / np.mean(x)

    # Fisher information: I(λ) = n/λ²
    # SE(λ̂) = λ/√n (evaluated at MLE)
    se_lambda = lambda_hat / np.sqrt(n)

    return {
        'lambda_hat': lambda_hat,
        'se_lambda': se_lambda
    }

# Demonstration
rng = np.random.default_rng(42)

# Normal data
true_mu, true_sigma = 5.0, 2.0
x_normal = rng.normal(true_mu, true_sigma, size=100)
result_normal = mle_normal(x_normal)
print("Normal MLE Results:")
print(f"  μ̂ = {result_normal['mu_hat']:.4f} (true: {true_mu}), SE = {result_normal['se_mu']:.4f}")
print(f"  σ̂² = {result_normal['sigma2_hat']:.4f} (true: {true_sigma**2}), SE = {result_normal['se_sigma2']:.4f}")

# Exponential data
true_lambda = 0.5
x_exp = rng.exponential(scale=1/true_lambda, size=100)
result_exp = mle_exponential(x_exp)
print(f"\nExponential MLE Results:")
print(f"  λ̂ = {result_exp['lambda_hat']:.4f} (true: {true_lambda}), SE = {result_exp['se_lambda']:.4f}")

Normal MLE Results:
  μ̂ = 4.8572 (true: 5.0), SE = 0.1918
  σ̂² = 3.6800 (true: 4.0), SE = 0.5205

Exponential MLE Results:
  λ̂ = 0.4834 (true: 0.5), SE = 0.0483

Exact Finite-Sample Properties

While our focus is on asymptotic theory, several exact results are available for common distributions and provide useful benchmarks.

Normal distribution: For \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)\):

The MLEs \(\hat{\mu} = \bar{X}\) and \(\hat{\sigma}^2 = \frac{1}{n}\sum(X_i - \bar{X})^2\) are independent
\(\hat{\mu} \sim \mathcal{N}(\mu, \sigma^2/n)\) exactly
\(\frac{n\hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-1}\) exactly

The \(\chi^2_{n-1}\) result implies \(\mathbb{E}[\hat{\sigma}^2] = \frac{n-1}{n} \sigma^2\)—the MLE is biased. The unbiased estimator \(S^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2\) corrects this.

Exponential distribution: For \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Exp}(\lambda)\) with MLE \(\hat{\lambda} = 1/\bar{X}\):

\(n\bar{X}\) has a Gamma(\(n, \lambda\)) distribution
The exact bias is \(\mathbb{E}[\hat{\lambda}] = \lambda \cdot \frac{n}{n-1}\) for \(n > 1\)
The exact variance is \(\text{Var}(\hat{\lambda}) = \lambda^2 \cdot \frac{n}{(n-1)^2(n-2)}\) for \(n > 2\)

The bias-corrected estimator \(\tilde{\lambda} = \frac{n-1}{n} \hat{\lambda} = \frac{n-1}{n\bar{X}}\) is unbiased.

import numpy as np
from scipy import stats

def verify_exponential_exact_results():
    """Verify exact finite-sample results for exponential MLE."""
    rng = np.random.default_rng(42)
    true_lambda = 2.0
    n = 10
    n_sim = 100000

    mles = np.zeros(n_sim)
    for i in range(n_sim):
        x = rng.exponential(scale=1/true_lambda, size=n)
        mles[i] = 1 / np.mean(x)

    # Theoretical exact values
    theory_mean = true_lambda * n / (n - 1)
    theory_var = true_lambda**2 * n / ((n-1)**2 * (n-2))

    print("EXACT FINITE-SAMPLE RESULTS: EXPONENTIAL MLE")
    print("=" * 50)
    print(f"n = {n}, true λ = {true_lambda}")
    print(f"\n{'Quantity':<20} {'Theory':>15} {'Empirical':>15}")
    print("-" * 50)
    print(f"{'E[λ̂]':<20} {theory_mean:>15.6f} {np.mean(mles):>15.6f}")
    print(f"{'Var(λ̂)':<20} {theory_var:>15.6f} {np.var(mles):>15.6f}")
    print(f"{'Bias':<20} {theory_mean - true_lambda:>15.6f} {np.mean(mles) - true_lambda:>15.6f}")

verify_exponential_exact_results()

EXACT FINITE-SAMPLE RESULTS: EXPONENTIAL MLE
==================================================
n = 10, true λ = 2.0

Quantity                   Theory       Empirical
--------------------------------------------------
E[λ̂]                     2.222222        2.221456
Var(λ̂)                   0.617284        0.615890
Bias                      0.222222        0.221456

Exact versus asymptotic properties of exponential MLE — Fig. 87 **Figure 3.2.10**: Exact versus asymptotic finite-sample properties of the exponential MLE. (a) The MLE is biased upward: \(\mathbb{E}[\hat{\lambda}] = \lambda n/(n-1) > \lambda\), with bias decreasing as \(n\) increases. (b) The exact variance \(\lambda^2 n/[(n-1)^2(n-2)]\) substantially exceeds the asymptotic approximation \(\lambda^2/n\) for small samples. The \(n-1\) and \(n-2\) factors explain why finite-sample corrections matter.

When Closed Forms Don’t Exist

Not all MLEs have closed-form solutions. Two important examples:

Gamma distribution: For \(X \sim \text{Gamma}(\alpha, \beta)\), the score equation for \(\alpha\) involves the digamma function \(\psi(\alpha) = \frac{d}{d\alpha} \log \Gamma(\alpha)\):

\[n[\log \beta + \psi(\alpha)] = \sum_{i=1}^n \log x_i\]

This transcendental equation has no closed-form solution; numerical methods are required.

Beta distribution: Similarly, the Beta(\(\alpha, \beta\)) MLEs require solving a system involving digamma functions.

Mixture models: Even simple mixtures like \(p \cdot \mathcal{N}(\mu_1, \sigma_1^2) + (1-p) \cdot \mathcal{N}(\mu_2, \sigma_2^2)\) have likelihood surfaces that preclude closed-form solutions.

For these cases, we turn to numerical optimization.

Numerical Optimization for MLE

When closed-form solutions are unavailable, we compute MLEs numerically by optimizing the log-likelihood. Two classical algorithms dominate: Newton-Raphson and Fisher scoring.

Newton-Raphson Method

Newton-Raphson is a general-purpose optimization algorithm based on quadratic approximation of the objective function.

At iteration \(t\), we approximate the log-likelihood near \(\theta^{(t)}\) by its second-order Taylor expansion:

\[\ell(\theta) \approx \ell(\theta^{(t)}) + (\theta - \theta^{(t)}) U(\theta^{(t)}) + \frac{1}{2}(\theta - \theta^{(t)})^2 H(\theta^{(t)})\]

where \(U(\theta) = \partial \ell / \partial \theta\) is the score and \(H(\theta) = \partial^2 \ell / \partial \theta^2\) is the Hessian (second derivative).

Maximizing this quadratic approximation gives the update:

(51)\[\theta^{(t+1)} = \theta^{(t)} - \frac{U(\theta^{(t)})}{H(\theta^{(t)})} = \theta^{(t)} - [H(\theta^{(t)})]^{-1} U(\theta^{(t)})\]

In the multivariate case with parameter vector \(\boldsymbol{\theta}\):

\[\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - [\mathbf{H}(\boldsymbol{\theta}^{(t)})]^{-1} \mathbf{U}(\boldsymbol{\theta}^{(t)})\]

where \(\mathbf{H}\) is the \(p \times p\) Hessian matrix.

Properties of Newton-Raphson:

Quadratic convergence: Near the optimum, the error decreases quadratically—each iteration roughly doubles the number of correct digits.
Local convergence: Convergence is only guaranteed if we start sufficiently close to the optimum.
Potential instability: If the Hessian is not negative definite (the log-likelihood is not locally concave), the algorithm may diverge or converge to a saddle point.

Newton-Raphson convergence for Gamma MLE — Fig. 88 **Figure 3.2.5**: Newton-Raphson optimization for the Gamma distribution. (a) The profile log-likelihood \(\ell_p(\alpha)\) with Newton-Raphson iterates shown as points. Starting from a method-of-moments initial value near \(\alpha = 2.5\), the algorithm converges to the MLE \(\hat{\alpha} = 4.02\) in just 5 iterations. (b) Quadratic convergence: the error \(|\alpha^{(t)} - \hat{\alpha}|\) decreases super-exponentially, roughly doubling the number of correct digits per iteration.

Fisher Scoring

Fisher scoring replaces the observed Hessian \(H(\theta)\) with its expected value, the negative Fisher information:

(52)\[\theta^{(t+1)} = \theta^{(t)} + [I(\theta^{(t)})]^{-1} U(\theta^{(t)})\]

Why Fisher scoring?

Guaranteed stability: The Fisher information matrix is always positive definite (under regularity conditions), ensuring we always move in an ascent direction.
Simpler computation: For some models, the expected information has a simpler form than the observed Hessian.
Statistical interpretation: Fisher scoring is equivalent to iteratively reweighted least squares (IRLS) for generalized linear models.

For exponential families with canonical links, Newton-Raphson and Fisher scoring are identical: the observed and expected information coincide.

import numpy as np
from scipy import special

def mle_gamma_fisher_scoring(x, max_iter=100, tol=1e-8):
    """
    Compute MLE for Gamma(α, β) using Fisher scoring.

    Uses shape-rate parameterization where E[X] = α/β, Var(X) = α/β².

    Parameters
    ----------
    x : array-like
        Observed data (must be positive).
    max_iter : int
        Maximum iterations.
    tol : float
        Convergence tolerance.

    Returns
    -------
    dict
        MLEs, standard errors, and convergence info.
    """
    x = np.asarray(x)
    n = len(x)
    x_bar = np.mean(x)
    log_x_bar = np.mean(np.log(x))

    # Method of moments initialization
    s2 = np.var(x, ddof=1)
    alpha = x_bar**2 / s2
    beta = x_bar / s2

    history = [(alpha, beta)]

    for iteration in range(max_iter):
        # Score functions
        # ∂ℓ/∂α = n[log(β) - ψ(α)] + Σlog(xᵢ)
        # ∂ℓ/∂β = nα/β - Σxᵢ
        psi_alpha = special.digamma(alpha)
        score_alpha = n * (np.log(beta) - psi_alpha) + n * log_x_bar
        score_beta = n * alpha / beta - n * x_bar

        # Fisher information matrix
        # I_αα = n·ψ'(α), I_ββ = nα/β², I_αβ = -n/β
        psi1_alpha = special.polygamma(1, alpha)  # trigamma
        I_aa = n * psi1_alpha
        I_bb = n * alpha / beta**2
        I_ab = -n / beta

        # Invert 2x2 Fisher information
        det = I_aa * I_bb - I_ab**2
        I_inv = np.array([[I_bb, -I_ab], [-I_ab, I_aa]]) / det

        # Fisher scoring update
        score = np.array([score_alpha, score_beta])
        delta = I_inv @ score

        alpha_new = alpha + delta[0]
        beta_new = beta + delta[1]

        # Ensure parameters stay positive
        alpha_new = max(alpha_new, 1e-10)
        beta_new = max(beta_new, 1e-10)

        history.append((alpha_new, beta_new))

        # Check convergence
        if np.max(np.abs(delta)) < tol:
            break

        alpha, beta = alpha_new, beta_new

    # Standard errors from inverse Fisher information at MLE
    psi1_alpha = special.polygamma(1, alpha)
    I_aa = n * psi1_alpha
    I_bb = n * alpha / beta**2
    I_ab = -n / beta
    det = I_aa * I_bb - I_ab**2
    I_inv = np.array([[I_bb, -I_ab], [-I_ab, I_aa]]) / det

    return {
        'alpha_hat': alpha,
        'beta_hat': beta,
        'se_alpha': np.sqrt(I_inv[0, 0]),
        'se_beta': np.sqrt(I_inv[1, 1]),
        'iterations': iteration + 1,
        'converged': iteration + 1 < max_iter,
        'history': history
    }

# Demonstration
rng = np.random.default_rng(42)
true_alpha, true_beta = 3.0, 2.0
x_gamma = rng.gamma(shape=true_alpha, scale=1/true_beta, size=200)

result = mle_gamma_fisher_scoring(x_gamma)
print("Gamma MLE via Fisher Scoring:")
print(f"  α̂ = {result['alpha_hat']:.4f} (true: {true_alpha}), SE = {result['se_alpha']:.4f}")
print(f"  β̂ = {result['beta_hat']:.4f} (true: {true_beta}), SE = {result['se_beta']:.4f}")
print(f"  Converged in {result['iterations']} iterations")

Gamma MLE via Fisher Scoring:
  α̂ = 3.1538 (true: 3.0), SE = 0.3209
  β̂ = 2.0893 (true: 2.0), SE = 0.2257
  Converged in 6 iterations

Connection to Generalized Linear Models

Fisher scoring becomes especially elegant for exponential family models with canonical links. In Section 3.7, we show that:

For an exponential family response with canonical link, the observed and expected information are equal
Therefore, Newton-Raphson and Fisher scoring produce identical iterations
Fisher scoring reduces to Iteratively Reweighted Least Squares (IRLS), solving at each step:

\[\boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^\top \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{W}^{(t)} \mathbf{z}^{(t)}\]

where \(\mathbf{W}\) is a diagonal weight matrix and \(\mathbf{z}\) is a “working response”

This connection—from general MLE theory through Fisher scoring to IRLS for GLMs—illustrates how foundational concepts build toward practical algorithms.

Practical Implementation with SciPy

For production code, scipy.optimize provides robust, well-tested optimization routines:

import numpy as np
from scipy import optimize, stats

def mle_gamma_scipy(x):
    """
    Compute Gamma MLE using scipy.optimize.

    Parameters
    ----------
    x : array-like
        Observed data.

    Returns
    -------
    dict
        MLEs and optimization results.
    """
    x = np.asarray(x)
    n = len(x)

    def neg_log_likelihood(params):
        alpha, beta = params
        if alpha <= 0 or beta <= 0:
            return np.inf
        # Gamma log-likelihood (rate parameterization)
        return -(n * alpha * np.log(beta) - n * np.log(special.gamma(alpha))
                 + (alpha - 1) * np.sum(np.log(x)) - beta * np.sum(x))

    # Method of moments starting values
    x_bar = np.mean(x)
    s2 = np.var(x, ddof=1)
    alpha0 = x_bar**2 / s2
    beta0 = x_bar / s2

    # L-BFGS-B with bounds to keep parameters positive
    result = optimize.minimize(
        neg_log_likelihood,
        x0=[alpha0, beta0],
        method='L-BFGS-B',
        bounds=[(1e-10, None), (1e-10, None)]
    )

    # Standard errors via numerical Hessian
    hess_inv = result.hess_inv.todense() if hasattr(result.hess_inv, 'todense') else result.hess_inv
    se = np.sqrt(np.diag(hess_inv))

    return {
        'alpha_hat': result.x[0],
        'beta_hat': result.x[1],
        'se_alpha': se[0] if len(se) > 0 else np.nan,
        'se_beta': se[1] if len(se) > 1 else np.nan,
        'success': result.success,
        'message': result.message
    }

# Compare with scipy.stats.gamma.fit
# Note: scipy uses shape, loc, scale parameterization
fitted = stats.gamma.fit(x_gamma, floc=0)  # Fix location at 0
print(f"\nscipy.stats.gamma.fit: shape={fitted[0]:.4f}, scale={fitted[2]:.4f}")
print(f"  Implied β = 1/scale = {1/fitted[2]:.4f}")

scipy.stats.gamma.fit: shape=3.1538, scale=0.4786
  Implied β = 1/scale = 2.0893

Common Pitfall ⚠️ Parameterization Consistency

Different software packages use different parameterizations for the same distribution:

Gamma: SciPy uses shape-scale; some texts use shape-rate (β = 1/scale)
Exponential: SciPy uses scale (mean); some texts use rate (1/mean)
Normal: The second parameter is always variance in this text; some use standard deviation

Always verify parameterization before comparing results across packages or implementing formulas from different sources.

Practical Safeguards for Numerical MLE

Real-world MLE optimization requires more than the basic Newton-Raphson update. Several safeguards improve robustness and reliability.

1. Line search and step control

The pure Newton step \(\theta^{(t+1)} = \theta^{(t)} - H^{-1} U\) may overshoot, especially far from the optimum. Line search modifies this to:

\[\theta^{(t+1)} = \theta^{(t)} + \alpha_t \cdot d_t\]

where \(d_t = -H^{-1} U\) is the Newton direction and \(\alpha_t \in (0, 1]\) is chosen to ensure sufficient increase in \(\ell(\theta)\). The Armijo condition requires:

\[\ell(\theta^{(t)} + \alpha d_t) \geq \ell(\theta^{(t)}) + c \cdot \alpha \cdot U^\top d_t\]

for some \(c \in (0, 1)\) (typically \(c = 10^{-4}\)). Start with \(\alpha = 1\) and backtrack by halving until the condition holds.

2. Trust region methods

Instead of line search along a direction, trust region methods constrain the step size directly:

\[\theta^{(t+1)} = \arg\max_\theta \left\{ \ell(\theta^{(t)}) + (\theta - \theta^{(t)})^\top U + \frac{1}{2}(\theta - \theta^{(t)})^\top H (\theta - \theta^{(t)}) \right\}\]

subject to \(\|\theta - \theta^{(t)}\| \leq \Delta_t\). The trust region radius \(\Delta_t\) adapts based on how well the quadratic approximation predicts actual improvement.

3. Parameter transformations for constraints

Many parameters have natural constraints. Transform to an unconstrained scale:

Table 27 Common Parameter Transformations
Constraint	Transformation	Original parameter
\(\sigma^2 > 0\)	\(\eta = \log(\sigma^2)\)	\(\sigma^2 = e^\eta\)
\(0 < p < 1\)	\(\eta = \log(p/(1-p))\)	\(p = 1/(1 + e^{-\eta})\)
\(\lambda > 0\)	\(\eta = \log(\lambda)\)	\(\lambda = e^\eta\)
\(\rho \in (-1, 1)\)	\(\eta = \text{arctanh}(\rho)\)	\(\rho = \tanh(\eta)\)

Optimize in \(\eta\), then transform back. The Jacobian of the transformation enters the standard error calculation.

4. Scaling and conditioning

Poor numerical conditioning causes optimization difficulties:

Center predictors: In regression, use \(\tilde{x}_j = x_j - \bar{x}_j\) to reduce correlation between intercept and slopes
Scale parameters: If parameters differ by orders of magnitude, rescale so they’re comparable
Check condition number: If \(\kappa(H) = \lambda_{\max} / \lambda_{\min} > 10^6\), the problem is ill-conditioned

from scipy.optimize import minimize

def mle_with_safeguards(neg_log_lik, x0, bounds=None, transform=None):
    """
    MLE with practical safeguards.

    Parameters
    ----------
    neg_log_lik : callable
        Negative log-likelihood function.
    x0 : ndarray
        Initial parameter values.
    bounds : list of tuples, optional
        Parameter bounds for L-BFGS-B.
    transform : dict, optional
        Parameter transformations {'to_unconstrained': func, 'from_unconstrained': func}.

    Returns
    -------
    result : OptimizeResult
        Optimization result with MLE and diagnostics.
    """
    # Apply transformation if provided
    if transform is not None:
        x0_transformed = transform['to_unconstrained'](x0)

        def neg_ll_transformed(eta):
            theta = transform['from_unconstrained'](eta)
            return neg_log_lik(theta)

        result = minimize(neg_ll_transformed, x0_transformed,
                          method='BFGS',  # No bounds needed after transform
                          options={'gtol': 1e-8, 'maxiter': 1000})

        result.x = transform['from_unconstrained'](result.x)
    else:
        # Use L-BFGS-B with bounds if provided
        result = minimize(neg_log_lik, x0,
                          method='L-BFGS-B' if bounds else 'BFGS',
                          bounds=bounds,
                          options={'gtol': 1e-8, 'maxiter': 1000})

    return result

Asymptotic Properties of MLEs

The true power of maximum likelihood estimation lies in its asymptotic properties. Under regularity conditions, MLEs are consistent, asymptotically normal, and asymptotically efficient.

Regularity Conditions

The classical asymptotic theory requires the following conditions. Let \(\theta_0\) denote the true parameter value:

Regularity Conditions for MLE Asymptotics

R1. Identifiability: Different parameter values give different distributions: \(\theta_1 \neq \theta_2 \Rightarrow f(\cdot|\theta_1) \neq f(\cdot|\theta_2)\).

R2. Common support: The support of \(f(x|\theta)\) does not depend on \(\theta\).

R3. Interior true value: \(\theta_0\) is in the interior of the parameter space \(\Theta\).

R4. Differentiability: \(\log f(x|\theta)\) is three times continuously differentiable in \(\theta\) for all \(x\) in the support.

R5. Integrability: We can interchange differentiation and integration: \(\frac{\partial}{\partial \theta} \int f(x|\theta) dx = \int \frac{\partial f(x|\theta)}{\partial \theta} dx\).

R6. Finite Fisher information: \(0 < I(\theta_0) < \infty\).

R7. Uniform integrability: \(\left|\frac{\partial^3 \log f(x|\theta)}{\partial \theta^3}\right|\) is bounded by a function with finite expectation in a neighborhood of \(\theta_0\).

These conditions exclude some pathological cases (e.g., uniform distributions with unknown endpoints) but are satisfied by most common parametric families.

Consistency

Theorem: Consistency of MLE

Under regularity conditions R1–R3, the MLE is consistent:

\[\hat{\theta}_n \xrightarrow{p} \theta_0 \quad \text{as } n \to \infty\]

Proof sketch: The key insight is that maximizing the log-likelihood is equivalent to maximizing the Kullback-Leibler divergence from the true distribution. Define:

\[M(\theta) = \mathbb{E}_{\theta_0}\left[\log f(X|\theta)\right]\]

By Jensen’s inequality and strict concavity of the log function, \(M(\theta)\) is uniquely maximized at \(\theta = \theta_0\). The sample average \(\frac{1}{n}\ell(\theta)\) converges uniformly to \(M(\theta)\) by the uniform law of large numbers, and the maximizer of the sample average converges to the maximizer of the limit. ∎

Asymptotic Normality

Theorem: Asymptotic Normality of MLE

Under regularity conditions R1–R7, the MLE is asymptotically normal:

(53)\[\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} \mathcal{N}\left(0, I_1(\theta_0)^{-1}\right)\]

Equivalently, for large \(n\):

\[\hat{\theta}_n \stackrel{\cdot}{\sim} \mathcal{N}\left(\theta_0, \frac{1}{nI_1(\theta_0)}\right)\]

Proof: Taylor expand the score function around \(\theta_0\):

\[0 = U(\hat{\theta}_n) = U(\theta_0) + (\hat{\theta}_n - \theta_0) \frac{\partial U}{\partial \theta}\bigg|_{\tilde{\theta}}\]

for some \(\tilde{\theta}\) between \(\hat{\theta}_n\) and \(\theta_0\). Rearranging:

\[\sqrt{n}(\hat{\theta}_n - \theta_0) = \frac{\frac{1}{\sqrt{n}} U(\theta_0)}{-\frac{1}{n} \frac{\partial U}{\partial \theta}\big|_{\tilde{\theta}}}\]

The numerator: \(\frac{1}{\sqrt{n}} U(\theta_0) = \frac{1}{\sqrt{n}} \sum_{i=1}^n u_i\) where \(u_i\) has mean 0 and variance \(I_1(\theta_0)\). By the CLT:

\[\frac{1}{\sqrt{n}} U(\theta_0) \xrightarrow{d} \mathcal{N}(0, I_1(\theta_0))\]

The denominator: \(-\frac{1}{n} \frac{\partial U}{\partial \theta} = -\frac{1}{n} \frac{\partial^2 \ell}{\partial \theta^2} \xrightarrow{p} I_1(\theta_0)\) by the LLN and consistency of \(\hat{\theta}_n\).

By Slutsky’s theorem:

\[\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} \frac{\mathcal{N}(0, I_1(\theta_0))}{I_1(\theta_0)} = \mathcal{N}\left(0, \frac{1}{I_1(\theta_0)}\right) \quad \blacksquare\]

This result is remarkably powerful: regardless of the underlying distribution, MLEs have approximately normal sampling distributions with variance determined by the Fisher information.

Asymptotic normality of MLE demonstrated via simulation — Fig. 89 **Figure 3.2.4**: Asymptotic normality of the exponential MLE. The standardized statistic \(\sqrt{nI_1(\lambda_0)}(\hat{\lambda} - \lambda_0)\) converges to \(N(0,1)\) as \(n\) increases. For \(n=5\), the distribution is right-skewed (skewness = 2.63); by \(n=200\), it closely matches the standard normal (skewness = 0.28). The K-S test p-values confirm improved approximation with larger samples.

Model Misspecification and the Sandwich Estimator

The asymptotic theory developed above assumes the model is correctly specified—that the data truly come from \(f(x|\theta_0)\) for some \(\theta_0 \in \Theta\). In practice, all models are approximations. What happens when the model is wrong?

The quasi-MLE target: Even under misspecification, the MLE converges to a well-defined limit: the parameter value \(\theta^*\) that minimizes the Kullback-Leibler divergence from the true data-generating distribution \(g(x)\) to the model family:

\[\theta^* = \arg\min_\theta \text{KL}(g \| f_\theta) = \arg\min_\theta \left[ -\int g(x) \log f(x|\theta) \, dx \right]\]

This is the “best approximation” within the model family, even if no member of the family is correct.

Asymptotic normality still holds, but with a different variance. Define:

\[\begin{split}\mathbf{A} &= -\mathbb{E}_{g}\left[\frac{\partial^2 \log f(X|\theta^*)}{\partial \theta \partial \theta^\top}\right] \quad \text{(expected Hessian under true } g\text{)} \\ \mathbf{B} &= \mathbb{E}_{g}\left[\frac{\partial \log f(X|\theta^*)}{\partial \theta} \frac{\partial \log f(X|\theta^*)}{\partial \theta^\top}\right] \quad \text{(variance of score under true } g\text{)}\end{split}\]

Under correct specification, \(\mathbf{A} = \mathbf{B} = \mathbf{I}(\theta_0)\) (the information equality). Under misspecification, \(\mathbf{A} \neq \mathbf{B}\) in general.

Theorem: Quasi-MLE Asymptotics

Under misspecification and regularity conditions:

\[\sqrt{n}(\hat{\theta}_n - \theta^*) \xrightarrow{d} \mathcal{N}\left(\mathbf{0}, \mathbf{A}^{-1} \mathbf{B} \mathbf{A}^{-1}\right)\]

The variance \(\mathbf{A}^{-1} \mathbf{B} \mathbf{A}^{-1}\) is called the sandwich variance (or Huber-White variance).

Practical implications:

Standard errors may be wrong: If we use \(I(\hat{\theta})^{-1}\) for variance (assuming correct specification), SEs can be too small or too large depending on the nature of misspecification.
Sandwich (robust) standard errors: Estimate \(\mathbf{A}\) and \(\mathbf{B}\) separately:

\[\hat{\mathbf{A}} = -\frac{1}{n} \sum_{i=1}^n \frac{\partial^2 \log f(x_i|\hat{\theta})}{\partial \theta \partial \theta^\top}, \quad \hat{\mathbf{B}} = \frac{1}{n} \sum_{i=1}^n \frac{\partial \log f(x_i|\hat{\theta})}{\partial \theta} \frac{\partial \log f(x_i|\hat{\theta})}{\partial \theta^\top}\]

Then \(\widehat{\text{Var}}(\hat{\theta}) = \hat{\mathbf{A}}^{-1} \hat{\mathbf{B}} \hat{\mathbf{A}}^{-1} / n\).
Model-based vs. robust inference: Under correct specification, both give consistent SEs. Under misspecification, only sandwich SEs are consistent. The difference between them is a diagnostic for misspecification.

import numpy as np

def sandwich_variance(score_contributions, hessian):
    """
    Compute sandwich variance estimator.

    Parameters
    ----------
    score_contributions : ndarray, shape (n, p)
        Individual score contributions ∂log f(xᵢ|θ̂)/∂θ for each observation.
    hessian : ndarray, shape (p, p)
        Observed Hessian ∂²ℓ/∂θ∂θ' evaluated at MLE.

    Returns
    -------
    sandwich_var : ndarray, shape (p, p)
        Sandwich variance estimate for θ̂.
    """
    n = score_contributions.shape[0]

    # A = -Hessian/n (average negative curvature)
    A = -hessian / n

    # B = Var(score) estimate = (1/n) Σ uᵢuᵢ' (outer products)
    B = score_contributions.T @ score_contributions / n

    # Sandwich: A⁻¹ B A⁻¹ / n
    A_inv = np.linalg.inv(A)
    sandwich_var = A_inv @ B @ A_inv / n

    return sandwich_var

Model misspecification and sandwich variance — Fig. 90 **Figure 3.2.9**: Effects of model misspecification on MLE inference. A Normal model is fit to data from three distributions: (left) correctly specified \(N(0,1)\), (center) mildly misspecified \(t_5\), and (right) severely misspecified \(t_3\). Under correct specification, model-based and sandwich SEs agree. Under misspecification, the heavier tails of \(t\)-distributions inflate variance; model-based SEs underestimate true variability while sandwich SEs remain consistent.

The Cramér-Rao Lower Bound

The Cramér-Rao inequality establishes a fundamental limit on how precisely any unbiased estimator can estimate a parameter. The MLE achieves this bound asymptotically.

Statement and Proof

Theorem: Cramér-Rao Lower Bound

Let \(T = T(X_1, \ldots, X_n)\) be any unbiased estimator of \(\tau(\theta)\), a function of the parameter. Under regularity conditions R4–R6:

(54)\[\text{Var}_\theta(T) \geq \frac{[\tau'(\theta)]^2}{I_n(\theta)} = \frac{[\tau'(\theta)]^2}{n I_1(\theta)}\]

For estimating \(\theta\) itself (\(\tau(\theta) = \theta\), so \(\tau'(\theta) = 1\)):

(55)\[\text{Var}_\theta(\hat{\theta}) \geq \frac{1}{n I_1(\theta)}\]

Complete Proof:

The proof uses the Cauchy-Schwarz inequality. Define:

\(U = U(\theta)\) = score function
\(T\) = unbiased estimator with \(\mathbb{E}_\theta[T] = \tau(\theta)\)

We know \(\mathbb{E}_\theta[U] = 0\) (score has mean zero).

Step 1: Differentiate the unbiasedness condition.

Since \(\mathbb{E}_\theta[T] = \tau(\theta)\), we have:

\[\int T(x) f(x|\theta) dx = \tau(\theta)\]

Differentiating both sides with respect to \(\theta\):

\[\int T(x) \frac{\partial f(x|\theta)}{\partial \theta} dx = \tau'(\theta)\]

Using \(\frac{\partial f}{\partial \theta} = f \cdot \frac{\partial \log f}{\partial \theta}\):

\[\int T(x) f(x|\theta) \frac{\partial \log f(x|\theta)}{\partial \theta} dx = \tau'(\theta)\]

This is \(\mathbb{E}_\theta[T \cdot U] = \tau'(\theta)\).

Step 2: Compute the covariance.

Since \(\mathbb{E}[U] = 0\):

\[\text{Cov}_\theta(T, U) = \mathbb{E}_\theta[T \cdot U] - \mathbb{E}_\theta[T] \cdot \mathbb{E}_\theta[U] = \tau'(\theta) - \tau(\theta) \cdot 0 = \tau'(\theta)\]

Step 3: Apply Cauchy-Schwarz.

For any random variables \(A, B\): \([\text{Cov}(A,B)]^2 \leq \text{Var}(A) \cdot \text{Var}(B)\).

Applied to \(T\) and \(U\):

\[[\tau'(\theta)]^2 = [\text{Cov}(T, U)]^2 \leq \text{Var}(T) \cdot \text{Var}(U) = \text{Var}(T) \cdot I_n(\theta)\]

Rearranging:

\[\text{Var}(T) \geq \frac{[\tau'(\theta)]^2}{I_n(\theta)} \quad \blacksquare\]

Efficiency

An estimator that achieves the Cramér-Rao bound is called efficient. The MLE is not generally efficient for finite samples, but it achieves the bound asymptotically:

Theorem: Asymptotic Efficiency of MLE

Under regularity conditions, the MLE is asymptotically efficient: its asymptotic variance equals the Cramér-Rao lower bound.

\[\text{AVar}(\hat{\theta}_{\text{MLE}}) = \lim_{n \to \infty} n \cdot \text{Var}(\hat{\theta}_n) = \frac{1}{I_1(\theta_0)}\]

This means that for large samples, no unbiased estimator can have smaller variance than the MLE.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def simulate_mle_distribution(true_theta, n_samples, n_simulations=5000, seed=42):
    """
    Simulate the sampling distribution of the Poisson MLE to verify asymptotic normality.

    Parameters
    ----------
    true_theta : float
        True Poisson rate parameter.
    n_samples : int
        Sample size per simulation.
    n_simulations : int
        Number of Monte Carlo simulations.
    seed : int
        Random seed.

    Returns
    -------
    dict
        MLEs, theoretical quantities, and test statistics.
    """
    rng = np.random.default_rng(seed)

    # Simulate MLEs
    mles = np.zeros(n_simulations)
    for i in range(n_simulations):
        x = rng.poisson(true_theta, size=n_samples)
        mles[i] = np.mean(x)  # MLE for Poisson is sample mean

    # Theoretical values
    # Fisher information for Poisson: I₁(λ) = 1/λ
    fisher_info = 1 / true_theta
    theoretical_se = np.sqrt(1 / (n_samples * fisher_info))
    cramer_rao_bound = 1 / (n_samples * fisher_info)

    # Standardized MLEs for normality check
    standardized = (mles - true_theta) / theoretical_se

    return {
        'mles': mles,
        'mean': np.mean(mles),
        'std': np.std(mles),
        'theoretical_mean': true_theta,
        'theoretical_se': theoretical_se,
        'cramer_rao_bound': cramer_rao_bound,
        'standardized': standardized
    }

# Run simulation
true_lambda = 5.0
results = {}
sample_sizes = [10, 50, 200]

print("Verifying MLE Asymptotic Properties (Poisson)")
print("=" * 60)
print(f"True λ = {true_lambda}, CR Bound = 1/(n·I₁) = λ/n")
print()
print(f"{'n':>6} {'Mean(λ̂)':>12} {'SD(λ̂)':>12} {'Theor SE':>12} {'CR Bound':>12}")
print("-" * 60)

for n in sample_sizes:
    results[n] = simulate_mle_distribution(true_lambda, n)
    r = results[n]
    print(f"{n:>6} {r['mean']:>12.4f} {r['std']:>12.4f} "
          f"{r['theoretical_se']:>12.4f} {np.sqrt(r['cramer_rao_bound']):>12.4f}")

Verifying MLE Asymptotic Properties (Poisson)
============================================================
True λ = 5.0, CR Bound = 1/(n·I₁) = λ/n

     n     Mean(λ̂)       SD(λ̂)     Theor SE     CR Bound
------------------------------------------------------------
    10       5.0021       0.7106       0.7071       0.7071
    50       5.0006       0.3173       0.3162       0.3162
   200       4.9990       0.1575       0.1581       0.1581

MLE variance approaching the Cramér-Rao lower bound — Fig. 91 **Figure 3.2.8**: Asymptotic efficiency of the exponential MLE. (a) Log-log plot showing that empirical \(\text{Var}(\hat{\lambda})\) closely tracks the Cramér-Rao lower bound \(\lambda^2/n\) across sample sizes. (b) Efficiency ratio \(\text{CRLB}/\text{Var}(\hat{\lambda})\) approaches 1 as \(n \to \infty\), confirming that the MLE achieves the theoretical minimum variance asymptotically.

The Invariance Property

A remarkable feature of maximum likelihood is its behavior under reparameterization.

Theorem: Invariance of MLE

If \(\hat{\theta}\) is the MLE of \(\theta\), then for any function \(g\), the MLE of \(\tau = g(\theta)\) is \(\hat{\tau} = g(\hat{\theta})\).

This follows directly from the definition: if \(\hat{\theta}\) maximizes \(L(\theta)\), then \(g(\hat{\theta})\) maximizes \(L(g^{-1}(\tau))\) over \(\tau\) (when \(g\) is one-to-one).

Example: For exponential data, \(\hat{\lambda} = 1/\bar{x}\). The MLE of the mean \(\mu = 1/\lambda\) is therefore \(\hat{\mu} = \bar{x}\)—no additional optimization needed.

This invariance property distinguishes MLE from other estimation methods. The method of moments estimator for \(g(\theta)\) is generally not \(g(\hat{\theta}_{\text{MoM}})\).

Likelihood-Based Hypothesis Testing

Maximum likelihood provides a unified framework for hypothesis testing through three asymptotically equivalent tests.

The Likelihood Ratio Test

Consider testing \(H_0: \theta \in \Theta_0\) versus \(H_1: \theta \in \Theta_1 = \Theta \setminus \Theta_0\).

The likelihood ratio statistic is:

(56)\[\Lambda = \frac{\sup_{\theta \in \Theta_0} L(\theta)}{\sup_{\theta \in \Theta} L(\theta)} = \frac{L(\hat{\theta}_0)}{L(\hat{\theta})}\]

where \(\hat{\theta}_0\) is the MLE under \(H_0\) and \(\hat{\theta}\) is the unrestricted MLE.

Since \(0 \leq \Lambda \leq 1\), we reject \(H_0\) for small values of \(\Lambda\), or equivalently, for large values of:

(57)\[D = -2\log\Lambda = 2[\ell(\hat{\theta}) - \ell(\hat{\theta}_0)]\]

Theorem: Wilks’ Theorem

Under \(H_0\) and regularity conditions, the deviance \(D\) converges in distribution:

\[D = -2\log\Lambda \xrightarrow{d} \chi^2_r \quad \text{as } n \to \infty\]

where \(r = \dim(\Theta) - \dim(\Theta_0)\) is the difference in parameter dimensions.

For testing \(H_0: \theta = \theta_0\) (a point null) against \(H_1: \theta \neq \theta_0\) with scalar \(\theta\):

\[D = 2[\ell(\hat{\theta}) - \ell(\theta_0)] \xrightarrow{d} \chi^2_1\]

Wald Test

The Wald test uses the asymptotic normality of the MLE directly. Using per-observation information \(I_1(\cdot)\):

(58)\[W = n \cdot I_1(\hat{\theta}) \cdot (\hat{\theta} - \theta_0)^2 = \frac{(\hat{\theta} - \theta_0)^2}{\widehat{\text{Var}}(\hat{\theta})}\]

where \(\widehat{\text{Var}}(\hat{\theta}) = 1/[n I_1(\hat{\theta})]\).

Equivalently, using total information \(I_n(\hat{\theta}) = n I_1(\hat{\theta})\):

\[W = I_n(\hat{\theta}) \cdot (\hat{\theta} - \theta_0)^2\]

For multivariate \(\boldsymbol{\theta} \in \mathbb{R}^p\), using total information \(\mathbf{I}_n(\boldsymbol{\theta}) = n \mathbf{I}_1(\boldsymbol{\theta})\):

\[W = (\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}_0)^\top \widehat{\mathbf{I}}_n(\hat{\boldsymbol{\theta}}) (\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}_0) \xrightarrow{d} \chi^2_p\]

where \(\widehat{\mathbf{I}}_n\) may be either:

Expected information: \(n \mathbf{I}_1(\hat{\boldsymbol{\theta}})\)
Observed information: \(\mathbf{J}_n(\hat{\boldsymbol{\theta}}) = -\partial^2 \ell_n / \partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^\top |_{\hat{\boldsymbol{\theta}}}\)

Both are asymptotically equivalent under correct specification.

Score Test (Rao Test)

The score test evaluates the score function at the null value:

(59)\[S = \frac{U(\theta_0)^2}{I(\theta_0)} = U(\theta_0)^\top I(\theta_0)^{-1} U(\theta_0)\]

Under \(H_0\):

\[S \xrightarrow{d} \chi^2_1\]

Computational tradeoffs:

Likelihood ratio: Requires fitting both restricted and unrestricted models
Wald: Requires only the unrestricted MLE
Score: Requires only evaluation at the null—no optimization needed

All three tests are asymptotically equivalent under \(H_0\), but can differ substantially in finite samples. The ordering \(W \geq D \geq S\) often holds for the test statistics (though not universally).

Geometric comparison of likelihood ratio, Wald, and score tests — Fig. 92 **Figure 3.2.6**: Geometric interpretation of the three likelihood-based tests for \(H_0: \lambda = \lambda_0\). (a) **Likelihood ratio test**: measures the vertical drop \(D = 2[\ell(\hat{\lambda}) - \ell(\lambda_0)]\) between the MLE and null. (b) **Wald test**: measures the horizontal distance \((\hat{\lambda} - \lambda_0)^2\) scaled by estimated variance. (c) **Score test**: measures the slope (tangent line) at \(\lambda_0\)—a steep slope indicates the null is far from the maximum. All three statistics are asymptotically \(\chi^2_1\) under \(H_0\).

import numpy as np
from scipy import stats

def likelihood_tests_poisson(x, lambda_0):
    """
    Perform likelihood ratio, Wald, and score tests for Poisson rate.

    Tests H₀: λ = λ₀ vs H₁: λ ≠ λ₀.

    Parameters
    ----------
    x : array-like
        Observed counts.
    lambda_0 : float
        Null hypothesis rate.

    Returns
    -------
    dict
        Test statistics and p-values.
    """
    x = np.asarray(x)
    n = len(x)
    x_bar = np.mean(x)
    lambda_hat = x_bar  # MLE

    # Log-likelihood function
    def log_lik(lam):
        return np.sum(x * np.log(lam) - lam - np.log(np.array([np.math.factorial(int(xi)) for xi in x])))

    # Simpler: use scipy.stats
    ll_mle = np.sum(stats.poisson.logpmf(x, lambda_hat))
    ll_null = np.sum(stats.poisson.logpmf(x, lambda_0))

    # Likelihood ratio test
    D = 2 * (ll_mle - ll_null)
    lr_pvalue = 1 - stats.chi2.cdf(D, df=1)

    # Wald test
    # Var(λ̂) ≈ λ/n, so W = n(λ̂ - λ₀)²/λ̂
    W = n * (lambda_hat - lambda_0)**2 / lambda_hat
    wald_pvalue = 1 - stats.chi2.cdf(W, df=1)

    # Score test
    # Score at λ₀: U(λ₀) = n·x̄/λ₀ - n = n(x̄ - λ₀)/λ₀
    # Fisher info at λ₀: I(λ₀) = n/λ₀
    # S = U(λ₀)²/I(λ₀) = n(x̄ - λ₀)²/λ₀
    S = n * (x_bar - lambda_0)**2 / lambda_0
    score_pvalue = 1 - stats.chi2.cdf(S, df=1)

    return {
        'lambda_hat': lambda_hat,
        'lambda_0': lambda_0,
        'lr_stat': D,
        'lr_pvalue': lr_pvalue,
        'wald_stat': W,
        'wald_pvalue': wald_pvalue,
        'score_stat': S,
        'score_pvalue': score_pvalue
    }

# Example: Test if Poisson rate equals 5
rng = np.random.default_rng(42)
x = rng.poisson(lam=6, size=50)  # True rate is 6, not 5

result = likelihood_tests_poisson(x, lambda_0=5.0)
print("Testing H₀: λ = 5.0")
print(f"MLE: λ̂ = {result['lambda_hat']:.4f}")
print()
print(f"{'Test':<20} {'Statistic':>12} {'p-value':>12}")
print("-" * 45)
print(f"{'Likelihood Ratio':<20} {result['lr_stat']:>12.4f} {result['lr_pvalue']:>12.4f}")
print(f"{'Wald':<20} {result['wald_stat']:>12.4f} {result['wald_pvalue']:>12.4f}")
print(f"{'Score':<20} {result['score_stat']:>12.4f} {result['score_pvalue']:>12.4f}")

Testing H₀: λ = 5.0
MLE: λ̂ = 5.8600

Test                    Statistic      p-value
---------------------------------------------
Likelihood Ratio            3.6455       0.0562
Wald                        3.9898       0.0458
Score                       3.4848       0.0619

Confidence Intervals from Likelihood

The asymptotic normality of MLEs provides multiple approaches to confidence interval construction.

Wald Intervals

The simplest approach uses the normal approximation directly:

(60)\[\hat{\theta} \pm z_{\alpha/2} \cdot \widehat{\text{SE}}(\hat{\theta})\]

where \(\widehat{\text{SE}} = 1/\sqrt{n I(\hat{\theta})}\) or is estimated from the observed Hessian.

Limitations: Wald intervals

Are not invariant to reparameterization
Can give poor coverage near parameter boundaries
May extend outside the parameter space

Profile Likelihood Intervals

Profile likelihood intervals invert the likelihood ratio test:

(61)\[\text{CI}_{1-\alpha} = \left\{ \theta : 2[\ell(\hat{\theta}) - \ell(\theta)] \leq \chi^2_{1, 1-\alpha} \right\}\]

These intervals are invariant to reparameterization: if we transform \(\tau = g(\theta)\), the profile likelihood interval for \(\tau\) is exactly \(g\) applied to the interval for \(\theta\).

For multiparameter models where \(\theta = (\psi, \lambda)\) with \(\psi\) the parameter of interest:

\[\ell_p(\psi) = \max_\lambda \ell(\psi, \lambda)\]

The profile likelihood interval for \(\psi\) is:

\[\left\{ \psi : 2[\ell(\hat{\psi}, \hat{\lambda}) - \ell_p(\psi)] \leq \chi^2_{1, 1-\alpha} \right\}\]

import numpy as np
from scipy import stats, optimize

def profile_likelihood_ci_exponential(x, confidence=0.95):
    """
    Compute profile likelihood CI for exponential rate parameter.

    Uses expanding bracket search for robustness.

    Parameters
    ----------
    x : array-like
        Observed data.
    confidence : float
        Confidence level.

    Returns
    -------
    dict
        MLE, Wald CI, and profile likelihood CI.
    """
    x = np.asarray(x)
    n = len(x)
    x_sum = np.sum(x)

    # MLE
    lambda_hat = n / x_sum

    # Log-likelihood (up to constant)
    def log_lik(lam):
        if lam <= 0:
            return -np.inf
        return n * np.log(lam) - lam * x_sum

    # Maximum log-likelihood
    ll_max = log_lik(lambda_hat)

    # Chi-square cutoff
    chi2_cutoff = stats.chi2.ppf(confidence, df=1)

    # Profile equation: find λ where 2(ℓ_max - ℓ(λ)) = χ²_{1,α}
    def profile_equation(lam):
        return 2 * (ll_max - log_lik(lam)) - chi2_cutoff

    # ROBUST BRACKET SEARCH for lower bound
    # Start just above 0, expand downward if needed
    lower_bracket_right = lambda_hat * 0.99
    lower_bracket_left = lambda_hat * 0.01

    # Ensure sign change exists
    max_expansions = 20
    for _ in range(max_expansions):
        if profile_equation(lower_bracket_left) > 0:
            break
        lower_bracket_left /= 2
        if lower_bracket_left < 1e-15:
            lower_bracket_left = 1e-15
            break

    try:
        lower = optimize.brentq(profile_equation, lower_bracket_left, lower_bracket_right)
    except ValueError:
        lower = 1e-15  # Fallback for edge cases

    # ROBUST BRACKET SEARCH for upper bound
    # Start at MLE, expand upward until sign change
    upper_bracket_left = lambda_hat * 1.01
    upper_bracket_right = lambda_hat * 2.0

    for _ in range(max_expansions):
        if profile_equation(upper_bracket_right) > 0:
            break
        upper_bracket_right *= 2  # Double the bracket

    try:
        upper = optimize.brentq(profile_equation, upper_bracket_left, upper_bracket_right)
    except ValueError:
        upper = upper_bracket_right  # Fallback

    # Wald CI for comparison
    se = lambda_hat / np.sqrt(n)  # SE(λ̂) = λ̂/√n for exponential
    z = stats.norm.ppf(1 - (1 - confidence) / 2)
    wald_lower = max(0, lambda_hat - z * se)  # Clip at 0
    wald_upper = lambda_hat + z * se

    return {
        'lambda_hat': lambda_hat,
        'wald_ci': (wald_lower, wald_upper),
        'profile_ci': (lower, upper),
        'se': se
    }

Exponential rate estimation (n=30, true λ=2.0)
MLE: λ̂ = 1.9205
Wald 95% CI:    (1.2334, 2.6076)
Profile 95% CI: (1.3184, 2.6709)

Note: Profile CI is asymmetric around MLE (appropriate for positive parameter)

Fig. 93 Figure 3.2.7: Confidence interval methods for the Binomial proportion (\(x=3\), \(n=20\), \(\hat{p}=0.15\)). (a) The log-likelihood with three 95% CIs: Wald (symmetric around MLE), Wilson/Score (shifted toward 0.5), and Profile (derived from likelihood ratio inversion). (b) Coverage simulation showing that Wald intervals undercover near the boundaries while Wilson intervals maintain closer to nominal 95% coverage across all \(p\) values.

Practical Considerations

Numerical Stability

Several numerical issues arise in MLE computation:

Log-likelihood underflow: Always work with log-likelihoods, never raw likelihoods.
Log-sum-exp: When computing \(\log \sum_i \exp(a_i)\), use the stable formula:

\[\log \sum_i e^{a_i} = a_{\max} + \log \sum_i e^{a_i - a_{\max}}\]
Hessian conditioning: Near-singular Hessians indicate weak identifiability. Consider regularization or reparameterization.
Boundary maxima: When the MLE lies on the parameter space boundary, standard asymptotics fail. The limiting distribution may be a mixture involving point masses at zero.

Starting Values

Newton-type algorithms require good starting values. Common strategies:

Method of moments: Often provides reasonable starting points with minimal computation
Grid search: For low-dimensional problems, evaluate the log-likelihood on a grid
Random restarts: Run optimization from multiple starting points; compare results

Common Pitfall ⚠️ Local vs. Global Maxima

The log-likelihood may have multiple local maxima, particularly for:

Mixture models: Notoriously multimodal
Models with latent variables: Related to mixture issues
Highly parameterized models: Many degrees of freedom

Always examine convergence diagnostics and consider multiple starting values. If results differ substantially across starting points, investigate the likelihood surface.

Non-Regular Settings: When Standard Theory Fails

The regularity conditions (R1–R7) exclude several important cases where MLE behavior differs qualitatively from the standard theory.

1. Parameter-dependent support (R2 violation)

We have seen that Uniform(0, \(\theta\)) and shifted exponential violate R2, leading to:

Boundary MLEs at \(X_{(n)}\) or \(X_{(1)}\)
Faster convergence: \(O(1/n)\) rather than \(O(1/\sqrt{n})\)
Non-normal limiting distributions (e.g., Exponential)

2. Mixture models: unbounded likelihood and non-identifiability

For mixture models like \(p \cdot \mathcal{N}(\mu_1, \sigma_1^2) + (1-p) \cdot \mathcal{N}(\mu_2, \sigma_2^2)\):

Unbounded likelihood: If \(\mu_1 = x_1\) and \(\sigma_1 \to 0\), the likelihood approaches infinity. The global MLE is degenerate.

Solution: Constrain \(\sigma_k \geq \sigma_{\min}\) or use penalized likelihood.
Label switching: Permuting component labels gives identical likelihood. The parameter space has a discrete symmetry.

Solution: Impose ordering constraints (e.g., \(\mu_1 < \mu_2\)) or work with invariant functionals.
Singular Fisher information: At certain parameter values (e.g., \(p = 0\)), the Fisher information matrix is singular. Standard asymptotics fail.

3. Parameters on the boundary

When the true parameter lies on the boundary of the parameter space (e.g., testing \(\sigma^2 = 0\) in a variance components model):

The standard LR test statistic \(D = 2(\ell_1 - \ell_0)\) no longer has a \(\chi^2\) limit
Instead, \(D \xrightarrow{d} \bar{\chi}^2 = 0.5 \cdot \chi^2_0 + 0.5 \cdot \chi^2_1\) (a 50-50 mixture of point mass at 0 and \(\chi^2_1\))
Critical values and p-values must account for this

Non-normal limiting distribution for Uniform MLE — Fig. 94 **Figure 3.2.11**: Non-regular MLE behavior for \(\text{Uniform}(0, \theta)\). (a) The correctly scaled statistic \(n(\theta - \hat{\theta})/\theta\) converges to an Exponential(1) distribution—not Normal—because the MLE \(\hat{\theta} = X_{(n)}\) lies on the boundary of the support. (b) The standard \(\sqrt{n}\) scaling produces a distribution that is dramatically non-normal, with a sharp boundary at zero. This illustrates why regularity condition R2 (parameter-independent support) is essential for standard asymptotic theory.

Practical guidance: When encountering non-regular problems:

Check whether regularity conditions hold before applying standard theory
Consider simulation-based inference (parametric bootstrap)
Use specialized asymptotic theory when available
Be cautious about standard errors near boundaries

Connection to Bayesian Inference

Maximum likelihood estimation has a natural Bayesian interpretation. Consider the posterior distribution:

\[\pi(\theta | x) \propto L(\theta) \cdot \pi(\theta)\]

With a flat (uniform) prior \(\pi(\theta) \propto 1\):

\[\pi(\theta | x) \propto L(\theta)\]

The posterior mode (MAP estimate) equals the MLE. More generally:

As sample size increases, the likelihood dominates the prior
The posterior concentrates around the MLE
The posterior is approximately normal with mean at MLE and variance \(1/[nI(\hat{\theta})]\)

This Bernstein-von Mises theorem provides a bridge between frequentist MLE and Bayesian inference, justifying the use of likelihood-based intervals from either perspective.

Chapter 3.2 Exercises: Maximum Likelihood Estimation Mastery

These exercises build your understanding of maximum likelihood estimation from analytical derivations through numerical optimization to asymptotic theory verification. Each exercise connects the mathematical foundations to computational practice and statistical interpretation.

A Note on These Exercises

These exercises are designed to deepen your understanding of MLE through hands-on exploration:

Exercise 1 develops analytical skills for deriving MLEs and understanding when closed forms exist
Exercise 2 explores Fisher information—its computation, interpretation, and role in quantifying estimation precision
Exercise 3 implements numerical optimization algorithms (Newton-Raphson, Fisher scoring) and compares their behavior
Exercise 4 verifies the asymptotic properties of MLEs through Monte Carlo simulation
Exercise 5 compares likelihood ratio, Wald, and score tests empirically
Exercise 6 constructs and compares confidence intervals via multiple methods

Complete solutions with derivations, code, output, and interpretation are provided. Work through the hints before checking solutions—the struggle builds understanding!

Exercise 1: Analytical MLE Derivations

The ability to derive MLEs analytically provides deep insight into the structure of statistical models. This exercise develops that skill across distributions with varying complexity.

Background: The Score Equation

For most regular problems, the MLE is found by solving the score equation \(U(\theta) = \partial \ell / \partial \theta = 0\). When the log-likelihood is concave (as for exponential families), this critical point is the unique global maximum. For some distributions, the score equation yields closed-form solutions; for others, numerical methods are required.

Geometric distribution: Let \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Geometric}(p)\) where \(P(X = k) = (1-p)^{k-1}p\) for \(k = 1, 2, \ldots\) (number of trials until first success).
- Write the log-likelihood \(\ell(p)\)
- Derive the score function \(U(p)\) and solve for \(\hat{p}\)
- Verify your answer makes intuitive sense
Hint: Simplifying the Sum

The log-likelihood involves \(\sum_{i=1}^n (x_i - 1)\). Note that \(\sum(x_i - 1) = n\bar{x} - n\).
Pareto distribution: Let \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Pareto}(\alpha, x_m)\) where \(f(x) = \alpha x_m^\alpha / x^{\alpha+1}\) for \(x \geq x_m\). Assume \(x_m\) is known.
- Derive \(\hat{\alpha}\) analytically
- Show that \(\hat{\alpha}\) depends on the data only through \(\sum \log(x_i/x_m)\)
- What happens if some \(x_i < x_m\)?
Uniform distribution (boundary MLE): Let \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Uniform}(0, \theta)\).
- Write the likelihood function carefully, noting where it equals zero
- Show that the MLE is \(\hat{\theta} = X_{(n)} = \max_i X_i\)
- Explain why this distribution violates the regularity conditions and what consequences this has
Hint: Likelihood Structure

The likelihood is \(L(\theta) = \theta^{-n}\) when \(\theta \geq X_{(n)}\) and \(L(\theta) = 0\) otherwise. The maximum is at the boundary.
Two-parameter exponential: Let \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Exp}(\lambda, \mu)\) where \(f(x) = \lambda e^{-\lambda(x-\mu)}\) for \(x \geq \mu\) (shifted exponential with rate \(\lambda\) and location \(\mu\)).
- Derive the MLEs \(\hat{\mu}\) and \(\hat{\lambda}\) jointly
- Which parameter has a boundary MLE similar to part (c)?

Exercise 2: Fisher Information Computation and Interpretation

Fisher information quantifies how much information the data contain about parameters. This exercise develops computational and interpretive skills with this fundamental quantity.

Background: Two Equivalent Definitions

Fisher information has two equivalent definitions under regularity conditions:

Variance form: \(I(\theta) = \text{Var}[U(\theta)] = \mathbb{E}[(U(\theta))^2]\)
Curvature form: \(I(\theta) = -\mathbb{E}[\partial^2 \ell / \partial \theta^2]\)

The curvature form is often easier to compute; the variance form provides intuition about the score’s variability.

Bernoulli information: For \(X \sim \text{Bernoulli}(p)\):
- Compute \(I(p)\) using both definitions
- Show that information is maximized at \(p = 0.5\)
- Interpret: why do extreme probabilities (near 0 or 1) provide less information?
Normal with both parameters unknown: For \(X \sim \mathcal{N}(\mu, \sigma^2)\):
- Compute the \(2 \times 2\) Fisher information matrix
- Show that \(\mu\) and \(\sigma^2\) are “orthogonal” (off-diagonal entries are zero)
- What does orthogonality mean for inference?
Exponential information: For \(X \sim \text{Exponential}(\lambda)\) (rate parameterization):
- Compute \(I(\lambda)\)
- How does information change with \(\lambda\)? Interpret this.
- Compare to the scale parameterization \(\theta = 1/\lambda\)
Hint: Reparameterization

For reparameterization \(\eta = g(\theta)\), the information transforms as \(I_\eta(\eta) = I_\theta(\theta) / [g'(\theta)]^2\).
Binomial vs. Bernoulli: For \(n\) iid Bernoulli trials vs. a single Binomial(\(n, p\)) observation:
- Show that both give \(I_n(p) = n/[p(1-p)]\)
- Explain why sufficiency implies they must have equal information

Exercise 3: Numerical MLE via Newton-Raphson and Fisher Scoring

When closed-form MLEs don’t exist, numerical optimization is required. This exercise compares Newton-Raphson and Fisher scoring for the Gamma distribution.

Background: The Gamma MLE Problem

For Gamma(\(\alpha, \beta\)) data, the MLE for \(\beta\) has a closed form given \(\alpha\), but the MLE for \(\alpha\) requires solving a transcendental equation involving the digamma function \(\psi(\alpha) = \frac{d}{d\alpha} \log \Gamma(\alpha)\). This makes Gamma an excellent test case for numerical MLE methods.

Derive the score equations: For \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Gamma}(\alpha, \beta)\) (shape-rate parameterization):
- Write the log-likelihood
- Derive the score functions \(U_\alpha\) and \(U_\beta\)
- Show that \(\hat{\beta} = \alpha / \bar{x}\) given \(\alpha\)
Implement Newton-Raphson: Implement Newton-Raphson for the profile log-likelihood in \(\alpha\) alone (substituting \(\hat{\beta}(\alpha)\)).
- Derive the profile log-likelihood \(\ell_p(\alpha)\)
- Compute \(\ell_p'(\alpha)\) and \(\ell_p''(\alpha)\) using digamma and trigamma functions
- Implement the algorithm and track convergence
Hint: Profile Likelihood

Substituting \(\beta = \alpha/\bar{x}\) into \(\ell(\alpha, \beta)\) eliminates \(\beta\), giving a one-dimensional optimization problem in \(\alpha\).
Implement Fisher scoring: For the full two-parameter problem:
- Compute the Fisher information matrix \(\mathbf{I}(\alpha, \beta)\)
- Implement the Fisher scoring update
- Compare convergence behavior to Newton-Raphson
Compare methods: Generate 1000 Gamma(3, 2) samples and estimate \((\alpha, \beta)\) using both methods. Compare:
- Number of iterations to convergence
- Sensitivity to starting values
- Behavior near the optimum

Exercise 4: Verifying Asymptotic Properties via Simulation

The asymptotic properties of MLEs—consistency, normality, efficiency—are theoretical results. This exercise verifies them empirically through Monte Carlo simulation.

Background: What to Verify

The key asymptotic results state that under regularity conditions:

Consistency: \(\hat{\theta}_n \xrightarrow{p} \theta_0\)
Asymptotic normality: \(\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1})\)
Efficiency: Asymptotic variance equals the Cramér-Rao bound

Simulation lets us verify these properties and see how quickly they “kick in.”

Consistency: For \(X_i \sim \text{Exponential}(\lambda = 2)\):
- Simulate 10,000 datasets for each \(n \in \{10, 50, 100, 500, 2000\}\)
- Compute the MLE \(\hat{\lambda} = 1/\bar{X}\) for each
- Show that \(\mathbb{E}[\hat{\lambda}] \to 2\) and \(\text{Var}(\hat{\lambda}) \to 0\)
Asymptotic normality: For the same setup:
- Compute the standardized statistic \(Z_n = \sqrt{n}(\hat{\lambda} - \lambda_0) / \sqrt{I_1(\lambda_0)^{-1}}\)
- Create Q-Q plots comparing \(Z_n\) to \(\mathcal{N}(0,1)\) for different \(n\)
- At what \(n\) does the normal approximation become accurate?
Efficiency: Compare to the Cramér-Rao bound:
- Compute the empirical variance of \(\hat{\lambda}\) for each \(n\)
- Compare to the Cramér-Rao bound \(1/(nI_1(\lambda_0))\)
- Compute the efficiency ratio \(\text{CRLB} / \text{Var}(\hat{\lambda})\)
Hint: Finite-Sample Bias

The exponential MLE is biased in finite samples: \(\mathbb{E}[\hat{\lambda}] = \lambda n/(n-1)\). Account for this when interpreting your results.
When asymptotics fail: Repeat the analysis for Uniform(0, \(\theta\)) with \(\hat{\theta} = X_{(n)}\):
- Show that \(n(\theta - \hat{\theta})\) converges to an Exponential, not Normal
- Explain why the regularity conditions are violated

Exercise 5: Likelihood Ratio, Wald, and Score Tests Compared

The three likelihood-based tests are asymptotically equivalent but can differ substantially in finite samples. This exercise compares their behavior.

Background: The Trinity of Tests

For testing \(H_0: \theta = \theta_0\):

Likelihood Ratio (LR): \(D = 2[\ell(\hat{\theta}) - \ell(\theta_0)]\)
Wald: \(W = (\hat{\theta} - \theta_0)^2 / \text{Var}(\hat{\theta})\)
Score: \(S = U(\theta_0)^2 / I(\theta_0)\)

All converge to \(\chi^2_1\) under \(H_0\), but computational requirements differ.

Implementation for Poisson: For \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Poisson}(\lambda)\):
- Derive all three test statistics for testing \(H_0: \lambda = \lambda_0\)
- Implement a function that computes all three given data and \(\lambda_0\)
Type I error comparison: Under \(H_0: \lambda = 5\):
- Simulate 10,000 datasets with \(n = 20\)
- Compute rejection rates at \(\alpha = 0.05\) for each test
- Which test is closest to nominal level?
Power comparison: Under \(H_1: \lambda = 6\) (testing \(H_0: \lambda = 5\)):
- Compute power for \(n \in \{10, 20, 50, 100\}\)
- Which test is most powerful?
The ordering phenomenon: For a single dataset, verify the classical ordering \(W \geq D \geq S\) (when \(\hat{\theta} > \theta_0\)).

Hint: Relationship Between Tests

The ordering follows from Taylor expansions: Wald overestimates, Score underestimates, and LR lies between. This ordering reverses when \(\hat{\theta} < \theta_0\).

Exercise 6: Confidence Interval Construction and Comparison

Multiple methods exist for constructing confidence intervals from likelihood. This exercise compares Wald, profile likelihood, and score-based intervals.

Background: Three Interval Methods

Wald: \(\hat{\theta} \pm z_{\alpha/2} \times \text{SE}(\hat{\theta})\) — simple but not invariant
Profile likelihood: \(\{\theta: 2[\ell(\hat{\theta}) - \ell(\theta)] \leq \chi^2_{1,1-\alpha}\}\) — invariant under reparameterization
Score (Wilson-type): Invert the score test — good boundary behavior

Implementation for binomial proportion: For \(X \sim \text{Binomial}(n, p)\):
- Implement Wald interval: \(\hat{p} \pm z \sqrt{\hat{p}(1-\hat{p})/n}\)
- Implement Wilson (score) interval
- Implement profile likelihood interval
Coverage comparison: Simulate coverage probabilities for \(n = 20\) and \(p \in \{0.1, 0.3, 0.5, 0.7, 0.9\}\):
- Which method achieves closest to nominal 95% coverage?
- Where do Wald intervals fail?
Boundary behavior: For \(n = 10\) and \(x = 0\) (no successes):
- Compute all three intervals
- Which methods give sensible results?
Hint: Wald Boundary Problem

When \(\hat{p} = 0\) or \(\hat{p} = 1\), the Wald interval has width zero because \(\hat{p}(1-\hat{p}) = 0\). This is clearly wrong.
Reparameterization: Transform to log-odds \(\psi = \log(p/(1-p))\):
- Compute Wald interval for \(\psi\) and transform back to \(p\)
- Compare to direct Wald interval for \(p\)
- Which matches the profile likelihood interval better?

Bringing It All Together

Maximum likelihood estimation occupies a central position in statistical inference. Its theoretical properties—consistency, asymptotic normality, efficiency—make it the default choice for parametric estimation when sample sizes are moderate to large. The likelihood function itself provides a unified framework for point estimation, interval estimation, and hypothesis testing.

Yet MLE has limitations. For small samples, the asymptotic approximations may be poor; bootstrap methods (Chapter 4) provide an alternative. For complex models with many parameters, regularization (ridge regression, LASSO) may improve prediction even at the cost of some bias. And when prior information is available, Bayesian methods (Chapter 5) provide a principled way to incorporate it.

The next sections extend these ideas. Section 3.3 compares MLE with method of moments and Bayesian estimation. Section 3.4 develops the sampling distribution theory that underlies our standard error calculations. And Sections 3.6–3.7 apply MLE to the linear model and its generalizations—the workhorses of applied statistics.

Key Takeaways 📝

The likelihood function \(L(\theta) = \prod f(x_i|\theta)\) measures how well different parameter values explain observed data; the MLE maximizes this function.
Score and Fisher information: The score \(U(\theta) = \partial \ell / \partial \theta\) has mean zero at the true parameter; its variance is the Fisher information \(I(\theta)\), which quantifies the curvature of the likelihood.
Asymptotic properties: Under regularity conditions, MLEs are consistent, asymptotically normal with variance \(1/[nI(\theta)]\), and asymptotically efficient (achieving the Cramér-Rao bound).
Numerical optimization: Newton-Raphson and Fisher scoring find MLEs when closed forms don’t exist; scipy.optimize provides robust implementations.
Hypothesis testing: Likelihood ratio, Wald, and score tests all derive from the likelihood; they are asymptotically equivalent but can differ in finite samples.
Course alignment: This section addresses Learning Outcome 2 (parametric inference) and provides computational foundations for LO 1 (simulation for sampling distributions) and LO 4 (Bayesian methods, which share the likelihood foundation).

References

Foundational Works by R. A. Fisher

[Fisher1912]

Fisher, R. A. (1912). On an absolute criterion for fitting frequency curves. Messenger of Mathematics, 41, 155–160. Fisher’s earliest work on maximum likelihood, predating his systematic development of the theory.

[Fisher1922]

Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society A, 222, 309–368. The foundational paper introducing maximum likelihood, sufficiency, efficiency, and consistency—concepts that remain central to statistical inference.

[Fisher1925]

Fisher, R. A. (1925). Theory of statistical estimation. Proceedings of the Cambridge Philosophical Society, 22(5), 700–725. Develops the asymptotic theory of maximum likelihood including asymptotic normality and efficiency, introducing Fisher information.

[Fisher1934]

Fisher, R. A. (1934). Two new properties of mathematical likelihood. Proceedings of the Royal Society A, 144(852), 285–307. Further development of likelihood theory including the concept of ancillary statistics.

The Cramér-Rao Lower Bound

[Rao1945]

Rao, C. R. (1945). Information and the accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–89. Independently establishes the information inequality (Cramér-Rao bound) and introduces the concept of efficient estimators.

[Cramer1946]

Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press. Classic synthesis of statistical theory including rigorous treatment of the Cramér-Rao inequality and asymptotic theory of estimators.

[Darmois1945]

Darmois, G. (1945). Sur les limites de la dispersion de certaines estimations. Revue de l’Institut International de Statistique, 13, 9–15. Independent derivation of the information inequality in the French statistical literature.

Asymptotic Theory

[Wald1949]

Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. Annals of Mathematical Statistics, 20(4), 595–601. Establishes conditions for consistency of maximum likelihood estimators under general conditions.

[LeCam1953]

Le Cam, L. (1953). On some asymptotic properties of maximum likelihood estimates and related Bayes’ estimates. University of California Publications in Statistics, 1, 277–329. Fundamental work on the asymptotic behavior of MLEs establishing local asymptotic normality.

[LeCam1970]

Le Cam, L. (1970). On the assumptions used to prove asymptotic normality of maximum likelihood estimates. Annals of Mathematical Statistics, 41(3), 802–828. Clarifies and weakens the regularity conditions required for asymptotic normality of MLEs.

Numerical Methods for MLE

[Dennis1996]

Dennis, J. E., and Schnabel, R. B. (1996). Numerical Methods for Unconstrained Optimization and Nonlinear Equations. SIAM. Comprehensive treatment of Newton-Raphson and quasi-Newton methods used in likelihood maximization.

[Nocedal2006]

Nocedal, J., and Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. Modern treatment of optimization algorithms including Newton’s method, Fisher scoring, and quasi-Newton methods relevant to MLE computation.

[McLachlan2008]

McLachlan, G. J., and Krishnan, T. (2008). The EM Algorithm and Extensions (2nd ed.). Wiley. Definitive reference on the Expectation-Maximization algorithm for MLEs in incomplete data problems.

Likelihood Ratio, Wald, and Score Tests

[Wilks1938]

Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. Annals of Mathematical Statistics, 9(1), 60–62. Establishes the asymptotic chi-squared distribution of the likelihood ratio statistic under the null hypothesis.

[Wald1943]

Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54(3), 426–482. Develops the theory of Wald tests based on asymptotic normality of MLEs.

[Rao1948]

Rao, C. R. (1948). Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Proceedings of the Cambridge Philosophical Society, 44(1), 50–57. Introduces the score test (Rao test) based on the score function evaluated at the null hypothesis.

Model Misspecification

[White1982]

White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50(1), 1–25. Establishes quasi-maximum likelihood theory showing that MLE converges to the parameter minimizing Kullback-Leibler divergence even when the model is misspecified.

[Huber1967]

Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, 221–233. University of California Press. Foundational work on robust estimation and behavior of MLEs when model assumptions are violated.

Comprehensive Texts

[Lehmann1983]

Lehmann, E. L. (1983). Theory of Point Estimation. Wiley. (2nd ed. with Casella, 1998, Springer.) Rigorous graduate-level treatment of maximum likelihood and its properties.

[VanDerVaart1998]

van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press. Modern measure-theoretic treatment of asymptotic theory including comprehensive coverage of MLE asymptotics.

[Pawitan2001]

Pawitan, Y. (2001). In All Likelihood: Statistical Modelling and Inference Using Likelihood. Oxford University Press. Accessible treatment of likelihood methods emphasizing practical applications and interpretation.

Historical Perspectives

[Stigler1986]

Stigler, S. M. (1986). The History of Statistics: The Measurement of Uncertainty before 1900. Harvard University Press. Historical account of the development of statistical methods including early work on likelihood.

[Hald1998]

Hald, A. (1998). A History of Mathematical Statistics from 1750 to 1930. Wiley. Detailed historical treatment including Fisher’s development of maximum likelihood theory.

[Hand2015]

Hand, D. J. (2015). From evidence to understanding: A commentary on Fisher (1922) ‘On the mathematical foundations of theoretical statistics’. Philosophical Transactions of the Royal Society A, 373(2039), 20140249. Modern perspective on Fisher’s foundational 1922 paper and its lasting influence.