Maximum Likelihood Estimation
In the summer of 1912, a young statistician named Ronald Aylmer Fisher was grappling with a fundamental question: given a sample of data, how should we estimate the parameters of a probability distribution? Fisher was not satisfied with the existing answers—Karl Pearson’s method of moments, while simple, seemed to throw away information. Surely there was a principled way to extract all the information the data contained about the unknown parameters.
Fisher’s answer, published in a series of papers between 1912 and 1925, was maximum likelihood estimation: choose the parameter values that make the observed data most probable. This deceptively simple idea revolutionized statistical inference. Fisher showed that maximum likelihood estimators (MLEs) have remarkable properties—they are consistent, asymptotically normal, and asymptotically efficient, achieving the theoretical lower bound on estimator variance. These properties made MLE the workhorse of parametric inference for a century.
This section develops the theory and practice of maximum likelihood estimation. We begin with the likelihood function itself—the mathematical object that quantifies how well different parameter values explain the observed data. We derive the score function and Fisher information, which together characterize the geometry of the likelihood surface. For simple models, we obtain closed-form MLEs; for complex models, we develop numerical optimization algorithms including Newton-Raphson and Fisher scoring. We establish the asymptotic theory that justifies using MLEs for inference, including a complete proof of the Cramér-Rao lower bound. Finally, we connect MLE to hypothesis testing through likelihood ratio, Wald, and score tests.
The exponential family framework from Section 3.1 will prove essential here: for exponential families, the score equation takes a particularly elegant form, and the MLE has explicit connections to sufficient statistics. But MLE extends far beyond exponential families—it applies to any parametric model, making it the universal tool for parametric inference.
Road Map 🧭
Understand: The likelihood function as a measure of parameter support, the score function as its gradient, and Fisher information as its curvature
Derive: Closed-form MLEs for normal, exponential, Poisson, and Bernoulli distributions; understand why Gamma and Beta require numerical methods
Implement: Newton-Raphson and Fisher scoring algorithms; leverage
scipy.optimizefor production codeProve: Asymptotic consistency, normality, and efficiency; the Cramér-Rao lower bound with full regularity conditions
Apply: Likelihood ratio, Wald, and score tests for hypothesis testing; construct confidence intervals via multiple methods
The Likelihood Function
The likelihood function is the foundation of maximum likelihood estimation. It answers a simple question: for fixed data, how probable would that data be under different parameter values?
Definition and Interpretation
Let \(X_1, X_2, \ldots, X_n\) be independent random variables, each with probability density (or mass) function \(f(x|\theta)\) depending on an unknown parameter \(\theta \in \Theta\). After observing data \(x_1, x_2, \ldots, x_n\), we define the likelihood function:
The crucial conceptual shift is this: we view the likelihood as a function of \(\theta\) for fixed data, not as a probability of data for fixed \(\theta\). The data are observed and therefore fixed; the parameter is unknown and therefore variable.
The Likelihood is Not a Probability Density
While \(L(\theta)\) is constructed from probability densities, it is not a probability density over \(\theta\). There is no requirement that \(\int L(\theta) d\theta = 1\), and indeed this integral often diverges or depends on the data in complex ways. The likelihood measures relative support—how much more or less the data support one parameter value versus another—not absolute probability.
The maximum likelihood estimator (MLE) is the parameter value that maximizes the likelihood:
Intuitively, the MLE is the parameter value that makes the observed data most probable.
The Log-Likelihood
In practice, we almost always work with the log-likelihood:
This transformation offers multiple advantages:
Numerical stability: Products of many small probabilities underflow floating-point arithmetic; sums of log-probabilities do not.
Computational convenience: Sums are easier to differentiate and optimize than products.
Theoretical elegance: Asymptotic theory for the log-likelihood has cleaner formulations.
Since \(\log\) is monotonically increasing, maximizing \(\ell(\theta)\) is equivalent to maximizing \(L(\theta)\). The MLE is unchanged.
Fig. 83 Figure 3.2.1: The likelihood function for Poisson data. (a) Histogram of observed counts with sample mean \(\bar{x} = 3.10\). (b) Normalized likelihood function \(L(\lambda)/L(\hat{\lambda})\) showing the MLE at the peak. (c) Log-likelihood function \(\ell(\lambda)\), whose curvature at the maximum relates to Fisher information. The true parameter \(\lambda = 3.5\) is marked for comparison.
Example 💡 Normal Likelihood
Setup: Let \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)\) with both parameters unknown. The density is:
Log-likelihood derivation:
This expression will be maximized shortly to derive the normal MLEs.
The Score Function
The score function is the gradient of the log-likelihood with respect to the parameters. It plays a central role in both the theory and computation of MLE.
Definition and Properties
For a scalar parameter \(\theta\), the score function is:
For a parameter vector \(\boldsymbol{\theta} = (\theta_1, \ldots, \theta_p)^\top\), the score is the gradient vector:
The MLE is typically found by solving the score equation \(U(\hat{\theta}) = 0\). At the maximum, the gradient of the log-likelihood vanishes.
The score function has a fundamental property that underlies much of likelihood theory:
Theorem: Score Has Mean Zero
Under regularity conditions (see below), the expected value of the score function at the true parameter value is zero:
Proof: For a single observation, the score contribution is \(u(X; \theta) = \frac{\partial}{\partial \theta} \log f(X|\theta)\). We compute:
Now, assuming we can interchange differentiation and integration (this is one of the regularity conditions):
Since the score for \(n\) iid observations is the sum of \(n\) individual scores, each with mean zero, we have \(\mathbb{E}[U(\theta_0)] = 0\). ∎
This result has an intuitive interpretation: at the true parameter value, the log-likelihood is “locally flat” on average—positive and negative slopes cancel out when we average over all possible datasets.
Fig. 84 Figure 3.2.2: The score function and its properties. (a) For Poisson data, the score \(U(\lambda) = n(\bar{x}/\lambda - 1)\) crosses zero at the MLE \(\hat{\lambda} = \bar{x}\). When \(U > 0\) (green region), the likelihood increases with \(\lambda\); when \(U < 0\) (red region), it decreases. (b) At the true parameter \(\lambda_0\), the score has mean zero and variance equal to the Fisher information: \(\text{Var}[U(\lambda_0)] = nI_1(\lambda_0)\).
Fisher Information
While the score tells us the direction of steepest ascent on the likelihood surface, the Fisher information tells us about the surface’s curvature—how sharply peaked the likelihood is around its maximum.
Definition
The Fisher information for a scalar parameter \(\theta\) is defined as the variance of the score:
where the second equality follows from \(\mathbb{E}[U(\theta)] = 0\).
Notation Convention: Expected vs. Observed Information
We adopt the following notation throughout this chapter:
Per-observation vs. total information:
\(I_1(\theta)\) = Fisher information from a single observation
\(I_n(\theta) = n \cdot I_1(\theta)\) = total Fisher information from \(n\) iid observations
Expected vs. observed information:
\(I(\theta) = -\mathbb{E}\left[\frac{\partial^2 \ell}{\partial \theta^2}\right]\) = expected (Fisher) information
\(J(\theta) = -\frac{\partial^2 \ell}{\partial \theta^2}\bigg|_{\text{observed}}\) = observed information (data-dependent)
Under correct model specification, \(\mathbb{E}[J(\theta_0)] = I(\theta_0)\). Under misspecification, these may differ, leading to the sandwich variance estimator (see below).
Convention in formulas: Unless subscripted, \(I(\theta)\) denotes per-observation expected information \(I_1(\theta)\). The Wald statistic \(W = n I_1(\hat{\theta})(\hat{\theta} - \theta_0)^2\) uses per-observation information.
Under the same regularity conditions that gave us \(\mathbb{E}[U] = 0\), we have an equivalent expression involving the second derivative:
This is the information equality: the variance of the score equals the negative expected Hessian.
Proof of information equality: We differentiate the identity \(\mathbb{E}_\theta[U(\theta)] = 0\) with respect to \(\theta\):
Applying the product rule inside the integral:
The first integral is \(\mathbb{E}[\partial^2 \ell / \partial \theta^2]\). For the second integral, note that \(\frac{\partial f}{\partial \theta} = f \cdot \frac{\partial \log f}{\partial \theta}\), so:
Therefore: \(\mathbb{E}[\partial^2 \ell / \partial \theta^2] + \mathbb{E}[U^2] = 0\), giving \(I(\theta) = \mathbb{E}[U^2] = -\mathbb{E}[\partial^2 \ell / \partial \theta^2]\). ∎
Additivity for IID Samples
For \(n\) iid observations, the Fisher information is additive:
where \(I_1(\theta)\) is the information from a single observation. This follows because the score is a sum of independent terms, and variance is additive for independent random variables.
This additivity has profound implications: more data means more information. Specifically, information grows linearly with sample size, which (as we will see) implies that standard errors shrink as \(1/\sqrt{n}\).
Fig. 85 Figure 3.2.3: Fisher information as log-likelihood curvature. (a) High information (\(I = 4.0\)): a sharply peaked log-likelihood means small changes in \(\theta\) produce large changes in \(\ell\)—the data strongly constrain the parameter. (b) Low information (\(I = 0.44\)): a broad peak indicates the data are consistent with many parameter values. (c) For the Bernoulli distribution, \(I(p) = 1/[p(1-p)]\) is minimized at \(p = 0.5\), where the outcome is most uncertain and each observation is most informative about \(p\).
Multivariate Fisher Information
For a parameter vector \(\boldsymbol{\theta} \in \mathbb{R}^p\), the Fisher information becomes a \(p \times p\) matrix:
The Fisher information matrix is positive semi-definite (as a covariance matrix must be), and under regularity conditions, positive definite.
Fig. 86 Figure 3.2.12: Multivariate Fisher information for the Normal distribution. (a) Joint sampling distribution of \((\hat{\mu}, \hat{\sigma}^2)\) from 1000 simulations (\(n=50\)). The elliptical 95% contour reflects the diagonal Fisher information matrix—the near-zero correlation (−0.04) confirms that \(\hat{\mu}\) and \(\hat{\sigma}^2\) are asymptotically independent. (b) Per-observation information components: \(I_{\mu\mu} = 1/\sigma^2\) for the mean and \(I_{\sigma^2\sigma^2} = 1/(2\sigma^4)\) for the variance. The off-diagonal \(I_{\mu,\sigma^2} = 0\) demonstrates parameter orthogonality—inference about one parameter does not depend on the other.
Example 💡 Fisher Information for Exponential Distribution
Setup: Let \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Exponential}(\lambda)\) where \(\lambda\) is the rate parameter. The density is \(f(x|\lambda) = \lambda e^{-\lambda x}\) for \(x > 0\).
Log-likelihood: \(\ell(\lambda) = n \log \lambda - \lambda \sum_{i=1}^n x_i\)
Score: \(U(\lambda) = \frac{n}{\lambda} - \sum_{i=1}^n x_i = \frac{n}{\lambda} - n\bar{x}\)
Second derivative: \(\frac{\partial^2 \ell}{\partial \lambda^2} = -\frac{n}{\lambda^2}\)
Fisher information: \(I_n(\lambda) = -\mathbb{E}\left[-\frac{n}{\lambda^2}\right] = \frac{n}{\lambda^2}\)
The per-observation information is \(I_1(\lambda) = 1/\lambda^2\). Lower rates (longer mean lifetimes) provide more information per observation. This may seem counterintuitive: shouldn’t more events (higher rate) tell us more? The resolution lies in understanding what “information about \(\lambda\)” means.
When \(\lambda\) is small, lifetimes are long and vary considerably—the data spread out, allowing us to distinguish between nearby \(\lambda\) values. When \(\lambda\) is large, lifetimes are short and tightly concentrated near zero—the data provide less resolution for distinguishing \(\lambda\) from \(\lambda + \epsilon\).
Reparameterization perspective: In terms of the mean lifetime \(\theta = 1/\lambda\), the information is \(I_1(\theta) = 1/\theta^2\). The coefficient of variation of the MLE is \(\text{CV}(\hat{\theta}) = \text{SE}(\hat{\theta})/\theta = 1/\sqrt{n}\), which is constant regardless of \(\theta\). The absolute precision scales with the parameter, but relative precision is parameter-independent.
Closed-Form Maximum Likelihood Estimators
For many common distributions, setting the score equal to zero yields explicit formulas for the MLE.
Normal Distribution
For \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)\):
Score equations:
Solutions:
The MLE for \(\mu\) is the sample mean—unbiased and efficient. The MLE for \(\sigma^2\) is the biased sample variance (dividing by \(n\) rather than \(n-1\)). This illustrates an important point: MLEs are not always unbiased, though their bias typically vanishes as \(n \to \infty\).
Exponential Distribution
For \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Exponential}(\lambda)\) (rate parameterization):
Setting \(U(\lambda) = n/\lambda - n\bar{x} = 0\) gives:
The MLE is the reciprocal of the sample mean.
Poisson Distribution
For \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Poisson}(\lambda)\):
Log-likelihood: \(\ell(\lambda) = \sum_{i=1}^n (x_i \log \lambda - \lambda - \log(x_i!))\)
Score: \(U(\lambda) = \frac{\sum_{i=1}^n x_i}{\lambda} - n = \frac{n\bar{x}}{\lambda} - n\)
Setting \(U(\lambda) = 0\):
The MLE equals the sample mean—exactly what we would expect given that \(\mathbb{E}[X] = \lambda\) for the Poisson.
Bernoulli and Binomial
For \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Bernoulli}(p)\):
Log-likelihood: \(\ell(p) = \sum_{i=1}^n [x_i \log p + (1-x_i) \log(1-p)]\)
Score: \(U(p) = \frac{\sum x_i}{p} - \frac{n - \sum x_i}{1-p}\)
Setting \(U(p) = 0\) and solving:
The MLE is the sample proportion.
import numpy as np
from scipy import stats
def mle_normal(x):
"""
Compute MLEs for Normal(μ, σ²) distribution.
Parameters
----------
x : array-like
Observed data.
Returns
-------
dict
MLEs and standard errors.
"""
x = np.asarray(x)
n = len(x)
# Point estimates
mu_hat = np.mean(x)
sigma2_hat = np.var(x, ddof=0) # MLE uses divisor n, not n-1
# Fisher information (evaluated at MLE)
# I(μ) = n/σ², I(σ²) = n/(2σ⁴)
se_mu = np.sqrt(sigma2_hat / n)
se_sigma2 = sigma2_hat * np.sqrt(2 / n)
return {
'mu_hat': mu_hat,
'sigma2_hat': sigma2_hat,
'se_mu': se_mu,
'se_sigma2': se_sigma2
}
def mle_exponential(x):
"""
Compute MLE for Exponential(λ) distribution (rate parameterization).
Parameters
----------
x : array-like
Observed data (must be positive).
Returns
-------
dict
MLE and standard error.
"""
x = np.asarray(x)
n = len(x)
lambda_hat = 1 / np.mean(x)
# Fisher information: I(λ) = n/λ²
# SE(λ̂) = λ/√n (evaluated at MLE)
se_lambda = lambda_hat / np.sqrt(n)
return {
'lambda_hat': lambda_hat,
'se_lambda': se_lambda
}
# Demonstration
rng = np.random.default_rng(42)
# Normal data
true_mu, true_sigma = 5.0, 2.0
x_normal = rng.normal(true_mu, true_sigma, size=100)
result_normal = mle_normal(x_normal)
print("Normal MLE Results:")
print(f" μ̂ = {result_normal['mu_hat']:.4f} (true: {true_mu}), SE = {result_normal['se_mu']:.4f}")
print(f" σ̂² = {result_normal['sigma2_hat']:.4f} (true: {true_sigma**2}), SE = {result_normal['se_sigma2']:.4f}")
# Exponential data
true_lambda = 0.5
x_exp = rng.exponential(scale=1/true_lambda, size=100)
result_exp = mle_exponential(x_exp)
print(f"\nExponential MLE Results:")
print(f" λ̂ = {result_exp['lambda_hat']:.4f} (true: {true_lambda}), SE = {result_exp['se_lambda']:.4f}")
Normal MLE Results:
μ̂ = 4.8572 (true: 5.0), SE = 0.1918
σ̂² = 3.6800 (true: 4.0), SE = 0.5205
Exponential MLE Results:
λ̂ = 0.4834 (true: 0.5), SE = 0.0483
Exact Finite-Sample Properties
While our focus is on asymptotic theory, several exact results are available for common distributions and provide useful benchmarks.
Normal distribution: For \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)\):
The MLEs \(\hat{\mu} = \bar{X}\) and \(\hat{\sigma}^2 = \frac{1}{n}\sum(X_i - \bar{X})^2\) are independent
\(\hat{\mu} \sim \mathcal{N}(\mu, \sigma^2/n)\) exactly
\(\frac{n\hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-1}\) exactly
The \(\chi^2_{n-1}\) result implies \(\mathbb{E}[\hat{\sigma}^2] = \frac{n-1}{n} \sigma^2\)—the MLE is biased. The unbiased estimator \(S^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2\) corrects this.
Exponential distribution: For \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Exp}(\lambda)\) with MLE \(\hat{\lambda} = 1/\bar{X}\):
\(n\bar{X}\) has a Gamma(\(n, \lambda\)) distribution
The exact bias is \(\mathbb{E}[\hat{\lambda}] = \lambda \cdot \frac{n}{n-1}\) for \(n > 1\)
The exact variance is \(\text{Var}(\hat{\lambda}) = \lambda^2 \cdot \frac{n}{(n-1)^2(n-2)}\) for \(n > 2\)
The bias-corrected estimator \(\tilde{\lambda} = \frac{n-1}{n} \hat{\lambda} = \frac{n-1}{n\bar{X}}\) is unbiased.
import numpy as np
from scipy import stats
def verify_exponential_exact_results():
"""Verify exact finite-sample results for exponential MLE."""
rng = np.random.default_rng(42)
true_lambda = 2.0
n = 10
n_sim = 100000
mles = np.zeros(n_sim)
for i in range(n_sim):
x = rng.exponential(scale=1/true_lambda, size=n)
mles[i] = 1 / np.mean(x)
# Theoretical exact values
theory_mean = true_lambda * n / (n - 1)
theory_var = true_lambda**2 * n / ((n-1)**2 * (n-2))
print("EXACT FINITE-SAMPLE RESULTS: EXPONENTIAL MLE")
print("=" * 50)
print(f"n = {n}, true λ = {true_lambda}")
print(f"\n{'Quantity':<20} {'Theory':>15} {'Empirical':>15}")
print("-" * 50)
print(f"{'E[λ̂]':<20} {theory_mean:>15.6f} {np.mean(mles):>15.6f}")
print(f"{'Var(λ̂)':<20} {theory_var:>15.6f} {np.var(mles):>15.6f}")
print(f"{'Bias':<20} {theory_mean - true_lambda:>15.6f} {np.mean(mles) - true_lambda:>15.6f}")
verify_exponential_exact_results()
EXACT FINITE-SAMPLE RESULTS: EXPONENTIAL MLE
==================================================
n = 10, true λ = 2.0
Quantity Theory Empirical
--------------------------------------------------
E[λ̂] 2.222222 2.221456
Var(λ̂) 0.617284 0.615890
Bias 0.222222 0.221456
Fig. 87 Figure 3.2.10: Exact versus asymptotic finite-sample properties of the exponential MLE. (a) The MLE is biased upward: \(\mathbb{E}[\hat{\lambda}] = \lambda n/(n-1) > \lambda\), with bias decreasing as \(n\) increases. (b) The exact variance \(\lambda^2 n/[(n-1)^2(n-2)]\) substantially exceeds the asymptotic approximation \(\lambda^2/n\) for small samples. The \(n-1\) and \(n-2\) factors explain why finite-sample corrections matter.
When Closed Forms Don’t Exist
Not all MLEs have closed-form solutions. Two important examples:
Gamma distribution: For \(X \sim \text{Gamma}(\alpha, \beta)\), the score equation for \(\alpha\) involves the digamma function \(\psi(\alpha) = \frac{d}{d\alpha} \log \Gamma(\alpha)\):
This transcendental equation has no closed-form solution; numerical methods are required.
Beta distribution: Similarly, the Beta(\(\alpha, \beta\)) MLEs require solving a system involving digamma functions.
Mixture models: Even simple mixtures like \(p \cdot \mathcal{N}(\mu_1, \sigma_1^2) + (1-p) \cdot \mathcal{N}(\mu_2, \sigma_2^2)\) have likelihood surfaces that preclude closed-form solutions.
For these cases, we turn to numerical optimization.
Numerical Optimization for MLE
When closed-form solutions are unavailable, we compute MLEs numerically by optimizing the log-likelihood. Two classical algorithms dominate: Newton-Raphson and Fisher scoring.
Newton-Raphson Method
Newton-Raphson is a general-purpose optimization algorithm based on quadratic approximation of the objective function.
At iteration \(t\), we approximate the log-likelihood near \(\theta^{(t)}\) by its second-order Taylor expansion:
where \(U(\theta) = \partial \ell / \partial \theta\) is the score and \(H(\theta) = \partial^2 \ell / \partial \theta^2\) is the Hessian (second derivative).
Maximizing this quadratic approximation gives the update:
In the multivariate case with parameter vector \(\boldsymbol{\theta}\):
where \(\mathbf{H}\) is the \(p \times p\) Hessian matrix.
Properties of Newton-Raphson:
Quadratic convergence: Near the optimum, the error decreases quadratically—each iteration roughly doubles the number of correct digits.
Local convergence: Convergence is only guaranteed if we start sufficiently close to the optimum.
Potential instability: If the Hessian is not negative definite (the log-likelihood is not locally concave), the algorithm may diverge or converge to a saddle point.
Fig. 88 Figure 3.2.5: Newton-Raphson optimization for the Gamma distribution. (a) The profile log-likelihood \(\ell_p(\alpha)\) with Newton-Raphson iterates shown as points. Starting from a method-of-moments initial value near \(\alpha = 2.5\), the algorithm converges to the MLE \(\hat{\alpha} = 4.02\) in just 5 iterations. (b) Quadratic convergence: the error \(|\alpha^{(t)} - \hat{\alpha}|\) decreases super-exponentially, roughly doubling the number of correct digits per iteration.
Fisher Scoring
Fisher scoring replaces the observed Hessian \(H(\theta)\) with its expected value, the negative Fisher information:
Why Fisher scoring?
Guaranteed stability: The Fisher information matrix is always positive definite (under regularity conditions), ensuring we always move in an ascent direction.
Simpler computation: For some models, the expected information has a simpler form than the observed Hessian.
Statistical interpretation: Fisher scoring is equivalent to iteratively reweighted least squares (IRLS) for generalized linear models.
For exponential families with canonical links, Newton-Raphson and Fisher scoring are identical: the observed and expected information coincide.
import numpy as np
from scipy import special
def mle_gamma_fisher_scoring(x, max_iter=100, tol=1e-8):
"""
Compute MLE for Gamma(α, β) using Fisher scoring.
Uses shape-rate parameterization where E[X] = α/β, Var(X) = α/β².
Parameters
----------
x : array-like
Observed data (must be positive).
max_iter : int
Maximum iterations.
tol : float
Convergence tolerance.
Returns
-------
dict
MLEs, standard errors, and convergence info.
"""
x = np.asarray(x)
n = len(x)
x_bar = np.mean(x)
log_x_bar = np.mean(np.log(x))
# Method of moments initialization
s2 = np.var(x, ddof=1)
alpha = x_bar**2 / s2
beta = x_bar / s2
history = [(alpha, beta)]
for iteration in range(max_iter):
# Score functions
# ∂ℓ/∂α = n[log(β) - ψ(α)] + Σlog(xᵢ)
# ∂ℓ/∂β = nα/β - Σxᵢ
psi_alpha = special.digamma(alpha)
score_alpha = n * (np.log(beta) - psi_alpha) + n * log_x_bar
score_beta = n * alpha / beta - n * x_bar
# Fisher information matrix
# I_αα = n·ψ'(α), I_ββ = nα/β², I_αβ = -n/β
psi1_alpha = special.polygamma(1, alpha) # trigamma
I_aa = n * psi1_alpha
I_bb = n * alpha / beta**2
I_ab = -n / beta
# Invert 2x2 Fisher information
det = I_aa * I_bb - I_ab**2
I_inv = np.array([[I_bb, -I_ab], [-I_ab, I_aa]]) / det
# Fisher scoring update
score = np.array([score_alpha, score_beta])
delta = I_inv @ score
alpha_new = alpha + delta[0]
beta_new = beta + delta[1]
# Ensure parameters stay positive
alpha_new = max(alpha_new, 1e-10)
beta_new = max(beta_new, 1e-10)
history.append((alpha_new, beta_new))
# Check convergence
if np.max(np.abs(delta)) < tol:
break
alpha, beta = alpha_new, beta_new
# Standard errors from inverse Fisher information at MLE
psi1_alpha = special.polygamma(1, alpha)
I_aa = n * psi1_alpha
I_bb = n * alpha / beta**2
I_ab = -n / beta
det = I_aa * I_bb - I_ab**2
I_inv = np.array([[I_bb, -I_ab], [-I_ab, I_aa]]) / det
return {
'alpha_hat': alpha,
'beta_hat': beta,
'se_alpha': np.sqrt(I_inv[0, 0]),
'se_beta': np.sqrt(I_inv[1, 1]),
'iterations': iteration + 1,
'converged': iteration + 1 < max_iter,
'history': history
}
# Demonstration
rng = np.random.default_rng(42)
true_alpha, true_beta = 3.0, 2.0
x_gamma = rng.gamma(shape=true_alpha, scale=1/true_beta, size=200)
result = mle_gamma_fisher_scoring(x_gamma)
print("Gamma MLE via Fisher Scoring:")
print(f" α̂ = {result['alpha_hat']:.4f} (true: {true_alpha}), SE = {result['se_alpha']:.4f}")
print(f" β̂ = {result['beta_hat']:.4f} (true: {true_beta}), SE = {result['se_beta']:.4f}")
print(f" Converged in {result['iterations']} iterations")
Gamma MLE via Fisher Scoring:
α̂ = 3.1538 (true: 3.0), SE = 0.3209
β̂ = 2.0893 (true: 2.0), SE = 0.2257
Converged in 6 iterations
Connection to Generalized Linear Models
Fisher scoring becomes especially elegant for exponential family models with canonical links. In Section 3.7, we show that:
For an exponential family response with canonical link, the observed and expected information are equal
Therefore, Newton-Raphson and Fisher scoring produce identical iterations
Fisher scoring reduces to Iteratively Reweighted Least Squares (IRLS), solving at each step:
\[\boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^\top \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{W}^{(t)} \mathbf{z}^{(t)}\]where \(\mathbf{W}\) is a diagonal weight matrix and \(\mathbf{z}\) is a “working response”
This connection—from general MLE theory through Fisher scoring to IRLS for GLMs—illustrates how foundational concepts build toward practical algorithms.
Practical Implementation with SciPy
For production code, scipy.optimize provides robust, well-tested optimization routines:
import numpy as np
from scipy import optimize, stats
def mle_gamma_scipy(x):
"""
Compute Gamma MLE using scipy.optimize.
Parameters
----------
x : array-like
Observed data.
Returns
-------
dict
MLEs and optimization results.
"""
x = np.asarray(x)
n = len(x)
def neg_log_likelihood(params):
alpha, beta = params
if alpha <= 0 or beta <= 0:
return np.inf
# Gamma log-likelihood (rate parameterization)
return -(n * alpha * np.log(beta) - n * np.log(special.gamma(alpha))
+ (alpha - 1) * np.sum(np.log(x)) - beta * np.sum(x))
# Method of moments starting values
x_bar = np.mean(x)
s2 = np.var(x, ddof=1)
alpha0 = x_bar**2 / s2
beta0 = x_bar / s2
# L-BFGS-B with bounds to keep parameters positive
result = optimize.minimize(
neg_log_likelihood,
x0=[alpha0, beta0],
method='L-BFGS-B',
bounds=[(1e-10, None), (1e-10, None)]
)
# Standard errors via numerical Hessian
hess_inv = result.hess_inv.todense() if hasattr(result.hess_inv, 'todense') else result.hess_inv
se = np.sqrt(np.diag(hess_inv))
return {
'alpha_hat': result.x[0],
'beta_hat': result.x[1],
'se_alpha': se[0] if len(se) > 0 else np.nan,
'se_beta': se[1] if len(se) > 1 else np.nan,
'success': result.success,
'message': result.message
}
# Compare with scipy.stats.gamma.fit
# Note: scipy uses shape, loc, scale parameterization
fitted = stats.gamma.fit(x_gamma, floc=0) # Fix location at 0
print(f"\nscipy.stats.gamma.fit: shape={fitted[0]:.4f}, scale={fitted[2]:.4f}")
print(f" Implied β = 1/scale = {1/fitted[2]:.4f}")
scipy.stats.gamma.fit: shape=3.1538, scale=0.4786
Implied β = 1/scale = 2.0893
Common Pitfall ⚠️ Parameterization Consistency
Different software packages use different parameterizations for the same distribution:
Gamma: SciPy uses shape-scale; some texts use shape-rate (β = 1/scale)
Exponential: SciPy uses scale (mean); some texts use rate (1/mean)
Normal: The second parameter is always variance in this text; some use standard deviation
Always verify parameterization before comparing results across packages or implementing formulas from different sources.
Practical Safeguards for Numerical MLE
Real-world MLE optimization requires more than the basic Newton-Raphson update. Several safeguards improve robustness and reliability.
1. Line search and step control
The pure Newton step \(\theta^{(t+1)} = \theta^{(t)} - H^{-1} U\) may overshoot, especially far from the optimum. Line search modifies this to:
where \(d_t = -H^{-1} U\) is the Newton direction and \(\alpha_t \in (0, 1]\) is chosen to ensure sufficient increase in \(\ell(\theta)\). The Armijo condition requires:
for some \(c \in (0, 1)\) (typically \(c = 10^{-4}\)). Start with \(\alpha = 1\) and backtrack by halving until the condition holds.
2. Trust region methods
Instead of line search along a direction, trust region methods constrain the step size directly:
subject to \(\|\theta - \theta^{(t)}\| \leq \Delta_t\). The trust region radius \(\Delta_t\) adapts based on how well the quadratic approximation predicts actual improvement.
3. Parameter transformations for constraints
Many parameters have natural constraints. Transform to an unconstrained scale:
Constraint |
Transformation |
Original parameter |
|---|---|---|
\(\sigma^2 > 0\) |
\(\eta = \log(\sigma^2)\) |
\(\sigma^2 = e^\eta\) |
\(0 < p < 1\) |
\(\eta = \log(p/(1-p))\) |
\(p = 1/(1 + e^{-\eta})\) |
\(\lambda > 0\) |
\(\eta = \log(\lambda)\) |
\(\lambda = e^\eta\) |
\(\rho \in (-1, 1)\) |
\(\eta = \text{arctanh}(\rho)\) |
\(\rho = \tanh(\eta)\) |
Optimize in \(\eta\), then transform back. The Jacobian of the transformation enters the standard error calculation.
4. Scaling and conditioning
Poor numerical conditioning causes optimization difficulties:
Center predictors: In regression, use \(\tilde{x}_j = x_j - \bar{x}_j\) to reduce correlation between intercept and slopes
Scale parameters: If parameters differ by orders of magnitude, rescale so they’re comparable
Check condition number: If \(\kappa(H) = \lambda_{\max} / \lambda_{\min} > 10^6\), the problem is ill-conditioned
from scipy.optimize import minimize
def mle_with_safeguards(neg_log_lik, x0, bounds=None, transform=None):
"""
MLE with practical safeguards.
Parameters
----------
neg_log_lik : callable
Negative log-likelihood function.
x0 : ndarray
Initial parameter values.
bounds : list of tuples, optional
Parameter bounds for L-BFGS-B.
transform : dict, optional
Parameter transformations {'to_unconstrained': func, 'from_unconstrained': func}.
Returns
-------
result : OptimizeResult
Optimization result with MLE and diagnostics.
"""
# Apply transformation if provided
if transform is not None:
x0_transformed = transform['to_unconstrained'](x0)
def neg_ll_transformed(eta):
theta = transform['from_unconstrained'](eta)
return neg_log_lik(theta)
result = minimize(neg_ll_transformed, x0_transformed,
method='BFGS', # No bounds needed after transform
options={'gtol': 1e-8, 'maxiter': 1000})
result.x = transform['from_unconstrained'](result.x)
else:
# Use L-BFGS-B with bounds if provided
result = minimize(neg_log_lik, x0,
method='L-BFGS-B' if bounds else 'BFGS',
bounds=bounds,
options={'gtol': 1e-8, 'maxiter': 1000})
return result
Asymptotic Properties of MLEs
The true power of maximum likelihood estimation lies in its asymptotic properties. Under regularity conditions, MLEs are consistent, asymptotically normal, and asymptotically efficient.
Regularity Conditions
The classical asymptotic theory requires the following conditions. Let \(\theta_0\) denote the true parameter value:
Regularity Conditions for MLE Asymptotics
R1. Identifiability: Different parameter values give different distributions: \(\theta_1 \neq \theta_2 \Rightarrow f(\cdot|\theta_1) \neq f(\cdot|\theta_2)\).
R2. Common support: The support of \(f(x|\theta)\) does not depend on \(\theta\).
R3. Interior true value: \(\theta_0\) is in the interior of the parameter space \(\Theta\).
R4. Differentiability: \(\log f(x|\theta)\) is three times continuously differentiable in \(\theta\) for all \(x\) in the support.
R5. Integrability: We can interchange differentiation and integration: \(\frac{\partial}{\partial \theta} \int f(x|\theta) dx = \int \frac{\partial f(x|\theta)}{\partial \theta} dx\).
R6. Finite Fisher information: \(0 < I(\theta_0) < \infty\).
R7. Uniform integrability: \(\left|\frac{\partial^3 \log f(x|\theta)}{\partial \theta^3}\right|\) is bounded by a function with finite expectation in a neighborhood of \(\theta_0\).
These conditions exclude some pathological cases (e.g., uniform distributions with unknown endpoints) but are satisfied by most common parametric families.
Consistency
Theorem: Consistency of MLE
Under regularity conditions R1–R3, the MLE is consistent:
Proof sketch: The key insight is that maximizing the log-likelihood is equivalent to maximizing the Kullback-Leibler divergence from the true distribution. Define:
By Jensen’s inequality and strict concavity of the log function, \(M(\theta)\) is uniquely maximized at \(\theta = \theta_0\). The sample average \(\frac{1}{n}\ell(\theta)\) converges uniformly to \(M(\theta)\) by the uniform law of large numbers, and the maximizer of the sample average converges to the maximizer of the limit. ∎
Asymptotic Normality
Theorem: Asymptotic Normality of MLE
Under regularity conditions R1–R7, the MLE is asymptotically normal:
Equivalently, for large \(n\):
Proof: Taylor expand the score function around \(\theta_0\):
for some \(\tilde{\theta}\) between \(\hat{\theta}_n\) and \(\theta_0\). Rearranging:
The numerator: \(\frac{1}{\sqrt{n}} U(\theta_0) = \frac{1}{\sqrt{n}} \sum_{i=1}^n u_i\) where \(u_i\) has mean 0 and variance \(I_1(\theta_0)\). By the CLT:
The denominator: \(-\frac{1}{n} \frac{\partial U}{\partial \theta} = -\frac{1}{n} \frac{\partial^2 \ell}{\partial \theta^2} \xrightarrow{p} I_1(\theta_0)\) by the LLN and consistency of \(\hat{\theta}_n\).
By Slutsky’s theorem:
This result is remarkably powerful: regardless of the underlying distribution, MLEs have approximately normal sampling distributions with variance determined by the Fisher information.
Fig. 89 Figure 3.2.4: Asymptotic normality of the exponential MLE. The standardized statistic \(\sqrt{nI_1(\lambda_0)}(\hat{\lambda} - \lambda_0)\) converges to \(N(0,1)\) as \(n\) increases. For \(n=5\), the distribution is right-skewed (skewness = 2.63); by \(n=200\), it closely matches the standard normal (skewness = 0.28). The K-S test p-values confirm improved approximation with larger samples.
Model Misspecification and the Sandwich Estimator
The asymptotic theory developed above assumes the model is correctly specified—that the data truly come from \(f(x|\theta_0)\) for some \(\theta_0 \in \Theta\). In practice, all models are approximations. What happens when the model is wrong?
The quasi-MLE target: Even under misspecification, the MLE converges to a well-defined limit: the parameter value \(\theta^*\) that minimizes the Kullback-Leibler divergence from the true data-generating distribution \(g(x)\) to the model family:
This is the “best approximation” within the model family, even if no member of the family is correct.
Asymptotic normality still holds, but with a different variance. Define:
Under correct specification, \(\mathbf{A} = \mathbf{B} = \mathbf{I}(\theta_0)\) (the information equality). Under misspecification, \(\mathbf{A} \neq \mathbf{B}\) in general.
Theorem: Quasi-MLE Asymptotics
Under misspecification and regularity conditions:
The variance \(\mathbf{A}^{-1} \mathbf{B} \mathbf{A}^{-1}\) is called the sandwich variance (or Huber-White variance).
Practical implications:
Standard errors may be wrong: If we use \(I(\hat{\theta})^{-1}\) for variance (assuming correct specification), SEs can be too small or too large depending on the nature of misspecification.
Sandwich (robust) standard errors: Estimate \(\mathbf{A}\) and \(\mathbf{B}\) separately:
\[\hat{\mathbf{A}} = -\frac{1}{n} \sum_{i=1}^n \frac{\partial^2 \log f(x_i|\hat{\theta})}{\partial \theta \partial \theta^\top}, \quad \hat{\mathbf{B}} = \frac{1}{n} \sum_{i=1}^n \frac{\partial \log f(x_i|\hat{\theta})}{\partial \theta} \frac{\partial \log f(x_i|\hat{\theta})}{\partial \theta^\top}\]Then \(\widehat{\text{Var}}(\hat{\theta}) = \hat{\mathbf{A}}^{-1} \hat{\mathbf{B}} \hat{\mathbf{A}}^{-1} / n\).
Model-based vs. robust inference: Under correct specification, both give consistent SEs. Under misspecification, only sandwich SEs are consistent. The difference between them is a diagnostic for misspecification.
import numpy as np
def sandwich_variance(score_contributions, hessian):
"""
Compute sandwich variance estimator.
Parameters
----------
score_contributions : ndarray, shape (n, p)
Individual score contributions ∂log f(xᵢ|θ̂)/∂θ for each observation.
hessian : ndarray, shape (p, p)
Observed Hessian ∂²ℓ/∂θ∂θ' evaluated at MLE.
Returns
-------
sandwich_var : ndarray, shape (p, p)
Sandwich variance estimate for θ̂.
"""
n = score_contributions.shape[0]
# A = -Hessian/n (average negative curvature)
A = -hessian / n
# B = Var(score) estimate = (1/n) Σ uᵢuᵢ' (outer products)
B = score_contributions.T @ score_contributions / n
# Sandwich: A⁻¹ B A⁻¹ / n
A_inv = np.linalg.inv(A)
sandwich_var = A_inv @ B @ A_inv / n
return sandwich_var
Fig. 90 Figure 3.2.9: Effects of model misspecification on MLE inference. A Normal model is fit to data from three distributions: (left) correctly specified \(N(0,1)\), (center) mildly misspecified \(t_5\), and (right) severely misspecified \(t_3\). Under correct specification, model-based and sandwich SEs agree. Under misspecification, the heavier tails of \(t\)-distributions inflate variance; model-based SEs underestimate true variability while sandwich SEs remain consistent.
The Cramér-Rao Lower Bound
The Cramér-Rao inequality establishes a fundamental limit on how precisely any unbiased estimator can estimate a parameter. The MLE achieves this bound asymptotically.
Statement and Proof
Theorem: Cramér-Rao Lower Bound
Let \(T = T(X_1, \ldots, X_n)\) be any unbiased estimator of \(\tau(\theta)\), a function of the parameter. Under regularity conditions R4–R6:
For estimating \(\theta\) itself (\(\tau(\theta) = \theta\), so \(\tau'(\theta) = 1\)):
Complete Proof:
The proof uses the Cauchy-Schwarz inequality. Define:
\(U = U(\theta)\) = score function
\(T\) = unbiased estimator with \(\mathbb{E}_\theta[T] = \tau(\theta)\)
We know \(\mathbb{E}_\theta[U] = 0\) (score has mean zero).
Step 1: Differentiate the unbiasedness condition.
Since \(\mathbb{E}_\theta[T] = \tau(\theta)\), we have:
Differentiating both sides with respect to \(\theta\):
Using \(\frac{\partial f}{\partial \theta} = f \cdot \frac{\partial \log f}{\partial \theta}\):
This is \(\mathbb{E}_\theta[T \cdot U] = \tau'(\theta)\).
Step 2: Compute the covariance.
Since \(\mathbb{E}[U] = 0\):
Step 3: Apply Cauchy-Schwarz.
For any random variables \(A, B\): \([\text{Cov}(A,B)]^2 \leq \text{Var}(A) \cdot \text{Var}(B)\).
Applied to \(T\) and \(U\):
Rearranging:
Efficiency
An estimator that achieves the Cramér-Rao bound is called efficient. The MLE is not generally efficient for finite samples, but it achieves the bound asymptotically:
Theorem: Asymptotic Efficiency of MLE
Under regularity conditions, the MLE is asymptotically efficient: its asymptotic variance equals the Cramér-Rao lower bound.
This means that for large samples, no unbiased estimator can have smaller variance than the MLE.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
def simulate_mle_distribution(true_theta, n_samples, n_simulations=5000, seed=42):
"""
Simulate the sampling distribution of the Poisson MLE to verify asymptotic normality.
Parameters
----------
true_theta : float
True Poisson rate parameter.
n_samples : int
Sample size per simulation.
n_simulations : int
Number of Monte Carlo simulations.
seed : int
Random seed.
Returns
-------
dict
MLEs, theoretical quantities, and test statistics.
"""
rng = np.random.default_rng(seed)
# Simulate MLEs
mles = np.zeros(n_simulations)
for i in range(n_simulations):
x = rng.poisson(true_theta, size=n_samples)
mles[i] = np.mean(x) # MLE for Poisson is sample mean
# Theoretical values
# Fisher information for Poisson: I₁(λ) = 1/λ
fisher_info = 1 / true_theta
theoretical_se = np.sqrt(1 / (n_samples * fisher_info))
cramer_rao_bound = 1 / (n_samples * fisher_info)
# Standardized MLEs for normality check
standardized = (mles - true_theta) / theoretical_se
return {
'mles': mles,
'mean': np.mean(mles),
'std': np.std(mles),
'theoretical_mean': true_theta,
'theoretical_se': theoretical_se,
'cramer_rao_bound': cramer_rao_bound,
'standardized': standardized
}
# Run simulation
true_lambda = 5.0
results = {}
sample_sizes = [10, 50, 200]
print("Verifying MLE Asymptotic Properties (Poisson)")
print("=" * 60)
print(f"True λ = {true_lambda}, CR Bound = 1/(n·I₁) = λ/n")
print()
print(f"{'n':>6} {'Mean(λ̂)':>12} {'SD(λ̂)':>12} {'Theor SE':>12} {'CR Bound':>12}")
print("-" * 60)
for n in sample_sizes:
results[n] = simulate_mle_distribution(true_lambda, n)
r = results[n]
print(f"{n:>6} {r['mean']:>12.4f} {r['std']:>12.4f} "
f"{r['theoretical_se']:>12.4f} {np.sqrt(r['cramer_rao_bound']):>12.4f}")
Verifying MLE Asymptotic Properties (Poisson)
============================================================
True λ = 5.0, CR Bound = 1/(n·I₁) = λ/n
n Mean(λ̂) SD(λ̂) Theor SE CR Bound
------------------------------------------------------------
10 5.0021 0.7106 0.7071 0.7071
50 5.0006 0.3173 0.3162 0.3162
200 4.9990 0.1575 0.1581 0.1581
Fig. 91 Figure 3.2.8: Asymptotic efficiency of the exponential MLE. (a) Log-log plot showing that empirical \(\text{Var}(\hat{\lambda})\) closely tracks the Cramér-Rao lower bound \(\lambda^2/n\) across sample sizes. (b) Efficiency ratio \(\text{CRLB}/\text{Var}(\hat{\lambda})\) approaches 1 as \(n \to \infty\), confirming that the MLE achieves the theoretical minimum variance asymptotically.
The Invariance Property
A remarkable feature of maximum likelihood is its behavior under reparameterization.
Theorem: Invariance of MLE
If \(\hat{\theta}\) is the MLE of \(\theta\), then for any function \(g\), the MLE of \(\tau = g(\theta)\) is \(\hat{\tau} = g(\hat{\theta})\).
This follows directly from the definition: if \(\hat{\theta}\) maximizes \(L(\theta)\), then \(g(\hat{\theta})\) maximizes \(L(g^{-1}(\tau))\) over \(\tau\) (when \(g\) is one-to-one).
Example: For exponential data, \(\hat{\lambda} = 1/\bar{x}\). The MLE of the mean \(\mu = 1/\lambda\) is therefore \(\hat{\mu} = \bar{x}\)—no additional optimization needed.
This invariance property distinguishes MLE from other estimation methods. The method of moments estimator for \(g(\theta)\) is generally not \(g(\hat{\theta}_{\text{MoM}})\).
Likelihood-Based Hypothesis Testing
Maximum likelihood provides a unified framework for hypothesis testing through three asymptotically equivalent tests.
The Likelihood Ratio Test
Consider testing \(H_0: \theta \in \Theta_0\) versus \(H_1: \theta \in \Theta_1 = \Theta \setminus \Theta_0\).
The likelihood ratio statistic is:
where \(\hat{\theta}_0\) is the MLE under \(H_0\) and \(\hat{\theta}\) is the unrestricted MLE.
Since \(0 \leq \Lambda \leq 1\), we reject \(H_0\) for small values of \(\Lambda\), or equivalently, for large values of:
Theorem: Wilks’ Theorem
Under \(H_0\) and regularity conditions, the deviance \(D\) converges in distribution:
where \(r = \dim(\Theta) - \dim(\Theta_0)\) is the difference in parameter dimensions.
For testing \(H_0: \theta = \theta_0\) (a point null) against \(H_1: \theta \neq \theta_0\) with scalar \(\theta\):
Wald Test
The Wald test uses the asymptotic normality of the MLE directly. Using per-observation information \(I_1(\cdot)\):
where \(\widehat{\text{Var}}(\hat{\theta}) = 1/[n I_1(\hat{\theta})]\).
Equivalently, using total information \(I_n(\hat{\theta}) = n I_1(\hat{\theta})\):
For multivariate \(\boldsymbol{\theta} \in \mathbb{R}^p\), using total information \(\mathbf{I}_n(\boldsymbol{\theta}) = n \mathbf{I}_1(\boldsymbol{\theta})\):
where \(\widehat{\mathbf{I}}_n\) may be either:
Expected information: \(n \mathbf{I}_1(\hat{\boldsymbol{\theta}})\)
Observed information: \(\mathbf{J}_n(\hat{\boldsymbol{\theta}}) = -\partial^2 \ell_n / \partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^\top |_{\hat{\boldsymbol{\theta}}}\)
Both are asymptotically equivalent under correct specification.
Score Test (Rao Test)
The score test evaluates the score function at the null value:
Under \(H_0\):
Computational tradeoffs:
Likelihood ratio: Requires fitting both restricted and unrestricted models
Wald: Requires only the unrestricted MLE
Score: Requires only evaluation at the null—no optimization needed
All three tests are asymptotically equivalent under \(H_0\), but can differ substantially in finite samples. The ordering \(W \geq D \geq S\) often holds for the test statistics (though not universally).
Fig. 92 Figure 3.2.6: Geometric interpretation of the three likelihood-based tests for \(H_0: \lambda = \lambda_0\). (a) Likelihood ratio test: measures the vertical drop \(D = 2[\ell(\hat{\lambda}) - \ell(\lambda_0)]\) between the MLE and null. (b) Wald test: measures the horizontal distance \((\hat{\lambda} - \lambda_0)^2\) scaled by estimated variance. (c) Score test: measures the slope (tangent line) at \(\lambda_0\)—a steep slope indicates the null is far from the maximum. All three statistics are asymptotically \(\chi^2_1\) under \(H_0\).
import numpy as np
from scipy import stats
def likelihood_tests_poisson(x, lambda_0):
"""
Perform likelihood ratio, Wald, and score tests for Poisson rate.
Tests H₀: λ = λ₀ vs H₁: λ ≠ λ₀.
Parameters
----------
x : array-like
Observed counts.
lambda_0 : float
Null hypothesis rate.
Returns
-------
dict
Test statistics and p-values.
"""
x = np.asarray(x)
n = len(x)
x_bar = np.mean(x)
lambda_hat = x_bar # MLE
# Log-likelihood function
def log_lik(lam):
return np.sum(x * np.log(lam) - lam - np.log(np.array([np.math.factorial(int(xi)) for xi in x])))
# Simpler: use scipy.stats
ll_mle = np.sum(stats.poisson.logpmf(x, lambda_hat))
ll_null = np.sum(stats.poisson.logpmf(x, lambda_0))
# Likelihood ratio test
D = 2 * (ll_mle - ll_null)
lr_pvalue = 1 - stats.chi2.cdf(D, df=1)
# Wald test
# Var(λ̂) ≈ λ/n, so W = n(λ̂ - λ₀)²/λ̂
W = n * (lambda_hat - lambda_0)**2 / lambda_hat
wald_pvalue = 1 - stats.chi2.cdf(W, df=1)
# Score test
# Score at λ₀: U(λ₀) = n·x̄/λ₀ - n = n(x̄ - λ₀)/λ₀
# Fisher info at λ₀: I(λ₀) = n/λ₀
# S = U(λ₀)²/I(λ₀) = n(x̄ - λ₀)²/λ₀
S = n * (x_bar - lambda_0)**2 / lambda_0
score_pvalue = 1 - stats.chi2.cdf(S, df=1)
return {
'lambda_hat': lambda_hat,
'lambda_0': lambda_0,
'lr_stat': D,
'lr_pvalue': lr_pvalue,
'wald_stat': W,
'wald_pvalue': wald_pvalue,
'score_stat': S,
'score_pvalue': score_pvalue
}
# Example: Test if Poisson rate equals 5
rng = np.random.default_rng(42)
x = rng.poisson(lam=6, size=50) # True rate is 6, not 5
result = likelihood_tests_poisson(x, lambda_0=5.0)
print("Testing H₀: λ = 5.0")
print(f"MLE: λ̂ = {result['lambda_hat']:.4f}")
print()
print(f"{'Test':<20} {'Statistic':>12} {'p-value':>12}")
print("-" * 45)
print(f"{'Likelihood Ratio':<20} {result['lr_stat']:>12.4f} {result['lr_pvalue']:>12.4f}")
print(f"{'Wald':<20} {result['wald_stat']:>12.4f} {result['wald_pvalue']:>12.4f}")
print(f"{'Score':<20} {result['score_stat']:>12.4f} {result['score_pvalue']:>12.4f}")
Testing H₀: λ = 5.0
MLE: λ̂ = 5.8600
Test Statistic p-value
---------------------------------------------
Likelihood Ratio 3.6455 0.0562
Wald 3.9898 0.0458
Score 3.4848 0.0619
Confidence Intervals from Likelihood
The asymptotic normality of MLEs provides multiple approaches to confidence interval construction.
Wald Intervals
The simplest approach uses the normal approximation directly:
where \(\widehat{\text{SE}} = 1/\sqrt{n I(\hat{\theta})}\) or is estimated from the observed Hessian.
Limitations: Wald intervals
Are not invariant to reparameterization
Can give poor coverage near parameter boundaries
May extend outside the parameter space
Profile Likelihood Intervals
Profile likelihood intervals invert the likelihood ratio test:
These intervals are invariant to reparameterization: if we transform \(\tau = g(\theta)\), the profile likelihood interval for \(\tau\) is exactly \(g\) applied to the interval for \(\theta\).
For multiparameter models where \(\theta = (\psi, \lambda)\) with \(\psi\) the parameter of interest:
The profile likelihood interval for \(\psi\) is:
import numpy as np
from scipy import stats, optimize
def profile_likelihood_ci_exponential(x, confidence=0.95):
"""
Compute profile likelihood CI for exponential rate parameter.
Uses expanding bracket search for robustness.
Parameters
----------
x : array-like
Observed data.
confidence : float
Confidence level.
Returns
-------
dict
MLE, Wald CI, and profile likelihood CI.
"""
x = np.asarray(x)
n = len(x)
x_sum = np.sum(x)
# MLE
lambda_hat = n / x_sum
# Log-likelihood (up to constant)
def log_lik(lam):
if lam <= 0:
return -np.inf
return n * np.log(lam) - lam * x_sum
# Maximum log-likelihood
ll_max = log_lik(lambda_hat)
# Chi-square cutoff
chi2_cutoff = stats.chi2.ppf(confidence, df=1)
# Profile equation: find λ where 2(ℓ_max - ℓ(λ)) = χ²_{1,α}
def profile_equation(lam):
return 2 * (ll_max - log_lik(lam)) - chi2_cutoff
# ROBUST BRACKET SEARCH for lower bound
# Start just above 0, expand downward if needed
lower_bracket_right = lambda_hat * 0.99
lower_bracket_left = lambda_hat * 0.01
# Ensure sign change exists
max_expansions = 20
for _ in range(max_expansions):
if profile_equation(lower_bracket_left) > 0:
break
lower_bracket_left /= 2
if lower_bracket_left < 1e-15:
lower_bracket_left = 1e-15
break
try:
lower = optimize.brentq(profile_equation, lower_bracket_left, lower_bracket_right)
except ValueError:
lower = 1e-15 # Fallback for edge cases
# ROBUST BRACKET SEARCH for upper bound
# Start at MLE, expand upward until sign change
upper_bracket_left = lambda_hat * 1.01
upper_bracket_right = lambda_hat * 2.0
for _ in range(max_expansions):
if profile_equation(upper_bracket_right) > 0:
break
upper_bracket_right *= 2 # Double the bracket
try:
upper = optimize.brentq(profile_equation, upper_bracket_left, upper_bracket_right)
except ValueError:
upper = upper_bracket_right # Fallback
# Wald CI for comparison
se = lambda_hat / np.sqrt(n) # SE(λ̂) = λ̂/√n for exponential
z = stats.norm.ppf(1 - (1 - confidence) / 2)
wald_lower = max(0, lambda_hat - z * se) # Clip at 0
wald_upper = lambda_hat + z * se
return {
'lambda_hat': lambda_hat,
'wald_ci': (wald_lower, wald_upper),
'profile_ci': (lower, upper),
'se': se
}
Exponential rate estimation (n=30, true λ=2.0)
MLE: λ̂ = 1.9205
Wald 95% CI: (1.2334, 2.6076)
Profile 95% CI: (1.3184, 2.6709)
Note: Profile CI is asymmetric around MLE (appropriate for positive parameter)
Fig. 93 Figure 3.2.7: Confidence interval methods for the Binomial proportion (\(x=3\), \(n=20\), \(\hat{p}=0.15\)). (a) The log-likelihood with three 95% CIs: Wald (symmetric around MLE), Wilson/Score (shifted toward 0.5), and Profile (derived from likelihood ratio inversion). (b) Coverage simulation showing that Wald intervals undercover near the boundaries while Wilson intervals maintain closer to nominal 95% coverage across all \(p\) values.
Practical Considerations
Numerical Stability
Several numerical issues arise in MLE computation:
Log-likelihood underflow: Always work with log-likelihoods, never raw likelihoods.
Log-sum-exp: When computing \(\log \sum_i \exp(a_i)\), use the stable formula:
\[\log \sum_i e^{a_i} = a_{\max} + \log \sum_i e^{a_i - a_{\max}}\]Hessian conditioning: Near-singular Hessians indicate weak identifiability. Consider regularization or reparameterization.
Boundary maxima: When the MLE lies on the parameter space boundary, standard asymptotics fail. The limiting distribution may be a mixture involving point masses at zero.
Starting Values
Newton-type algorithms require good starting values. Common strategies:
Method of moments: Often provides reasonable starting points with minimal computation
Grid search: For low-dimensional problems, evaluate the log-likelihood on a grid
Random restarts: Run optimization from multiple starting points; compare results
Common Pitfall ⚠️ Local vs. Global Maxima
The log-likelihood may have multiple local maxima, particularly for:
Mixture models: Notoriously multimodal
Models with latent variables: Related to mixture issues
Highly parameterized models: Many degrees of freedom
Always examine convergence diagnostics and consider multiple starting values. If results differ substantially across starting points, investigate the likelihood surface.
Non-Regular Settings: When Standard Theory Fails
The regularity conditions (R1–R7) exclude several important cases where MLE behavior differs qualitatively from the standard theory.
1. Parameter-dependent support (R2 violation)
We have seen that Uniform(0, \(\theta\)) and shifted exponential violate R2, leading to:
Boundary MLEs at \(X_{(n)}\) or \(X_{(1)}\)
Faster convergence: \(O(1/n)\) rather than \(O(1/\sqrt{n})\)
Non-normal limiting distributions (e.g., Exponential)
2. Mixture models: unbounded likelihood and non-identifiability
For mixture models like \(p \cdot \mathcal{N}(\mu_1, \sigma_1^2) + (1-p) \cdot \mathcal{N}(\mu_2, \sigma_2^2)\):
Unbounded likelihood: If \(\mu_1 = x_1\) and \(\sigma_1 \to 0\), the likelihood approaches infinity. The global MLE is degenerate.
Solution: Constrain \(\sigma_k \geq \sigma_{\min}\) or use penalized likelihood.
Label switching: Permuting component labels gives identical likelihood. The parameter space has a discrete symmetry.
Solution: Impose ordering constraints (e.g., \(\mu_1 < \mu_2\)) or work with invariant functionals.
Singular Fisher information: At certain parameter values (e.g., \(p = 0\)), the Fisher information matrix is singular. Standard asymptotics fail.
3. Parameters on the boundary
When the true parameter lies on the boundary of the parameter space (e.g., testing \(\sigma^2 = 0\) in a variance components model):
The standard LR test statistic \(D = 2(\ell_1 - \ell_0)\) no longer has a \(\chi^2\) limit
Instead, \(D \xrightarrow{d} \bar{\chi}^2 = 0.5 \cdot \chi^2_0 + 0.5 \cdot \chi^2_1\) (a 50-50 mixture of point mass at 0 and \(\chi^2_1\))
Critical values and p-values must account for this
Fig. 94 Figure 3.2.11: Non-regular MLE behavior for \(\text{Uniform}(0, \theta)\). (a) The correctly scaled statistic \(n(\theta - \hat{\theta})/\theta\) converges to an Exponential(1) distribution—not Normal—because the MLE \(\hat{\theta} = X_{(n)}\) lies on the boundary of the support. (b) The standard \(\sqrt{n}\) scaling produces a distribution that is dramatically non-normal, with a sharp boundary at zero. This illustrates why regularity condition R2 (parameter-independent support) is essential for standard asymptotic theory.
Practical guidance: When encountering non-regular problems:
Check whether regularity conditions hold before applying standard theory
Consider simulation-based inference (parametric bootstrap)
Use specialized asymptotic theory when available
Be cautious about standard errors near boundaries
Connection to Bayesian Inference
Maximum likelihood estimation has a natural Bayesian interpretation. Consider the posterior distribution:
With a flat (uniform) prior \(\pi(\theta) \propto 1\):
The posterior mode (MAP estimate) equals the MLE. More generally:
As sample size increases, the likelihood dominates the prior
The posterior concentrates around the MLE
The posterior is approximately normal with mean at MLE and variance \(1/[nI(\hat{\theta})]\)
This Bernstein-von Mises theorem provides a bridge between frequentist MLE and Bayesian inference, justifying the use of likelihood-based intervals from either perspective.
Chapter 3.2 Exercises: Maximum Likelihood Estimation Mastery
These exercises build your understanding of maximum likelihood estimation from analytical derivations through numerical optimization to asymptotic theory verification. Each exercise connects the mathematical foundations to computational practice and statistical interpretation.
A Note on These Exercises
These exercises are designed to deepen your understanding of MLE through hands-on exploration:
Exercise 1 develops analytical skills for deriving MLEs and understanding when closed forms exist
Exercise 2 explores Fisher information—its computation, interpretation, and role in quantifying estimation precision
Exercise 3 implements numerical optimization algorithms (Newton-Raphson, Fisher scoring) and compares their behavior
Exercise 4 verifies the asymptotic properties of MLEs through Monte Carlo simulation
Exercise 5 compares likelihood ratio, Wald, and score tests empirically
Exercise 6 constructs and compares confidence intervals via multiple methods
Complete solutions with derivations, code, output, and interpretation are provided. Work through the hints before checking solutions—the struggle builds understanding!
Exercise 1: Analytical MLE Derivations
The ability to derive MLEs analytically provides deep insight into the structure of statistical models. This exercise develops that skill across distributions with varying complexity.
Background: The Score Equation
For most regular problems, the MLE is found by solving the score equation \(U(\theta) = \partial \ell / \partial \theta = 0\). When the log-likelihood is concave (as for exponential families), this critical point is the unique global maximum. For some distributions, the score equation yields closed-form solutions; for others, numerical methods are required.
Geometric distribution: Let \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Geometric}(p)\) where \(P(X = k) = (1-p)^{k-1}p\) for \(k = 1, 2, \ldots\) (number of trials until first success).
Write the log-likelihood \(\ell(p)\)
Derive the score function \(U(p)\) and solve for \(\hat{p}\)
Verify your answer makes intuitive sense
Hint: Simplifying the Sum
The log-likelihood involves \(\sum_{i=1}^n (x_i - 1)\). Note that \(\sum(x_i - 1) = n\bar{x} - n\).
Pareto distribution: Let \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Pareto}(\alpha, x_m)\) where \(f(x) = \alpha x_m^\alpha / x^{\alpha+1}\) for \(x \geq x_m\). Assume \(x_m\) is known.
Derive \(\hat{\alpha}\) analytically
Show that \(\hat{\alpha}\) depends on the data only through \(\sum \log(x_i/x_m)\)
What happens if some \(x_i < x_m\)?
Uniform distribution (boundary MLE): Let \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Uniform}(0, \theta)\).
Write the likelihood function carefully, noting where it equals zero
Show that the MLE is \(\hat{\theta} = X_{(n)} = \max_i X_i\)
Explain why this distribution violates the regularity conditions and what consequences this has
Hint: Likelihood Structure
The likelihood is \(L(\theta) = \theta^{-n}\) when \(\theta \geq X_{(n)}\) and \(L(\theta) = 0\) otherwise. The maximum is at the boundary.
Two-parameter exponential: Let \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Exp}(\lambda, \mu)\) where \(f(x) = \lambda e^{-\lambda(x-\mu)}\) for \(x \geq \mu\) (shifted exponential with rate \(\lambda\) and location \(\mu\)).
Derive the MLEs \(\hat{\mu}\) and \(\hat{\lambda}\) jointly
Which parameter has a boundary MLE similar to part (c)?
Solution
Part (a): Geometric Distribution
import numpy as np
from scipy import stats
def geometric_mle_derivation():
"""Derive and verify MLE for Geometric distribution."""
print("GEOMETRIC DISTRIBUTION MLE DERIVATION")
print("=" * 60)
print("\n1. LOG-LIKELIHOOD:")
print(" P(X = k) = (1-p)^{k-1} p for k = 1, 2, ...")
print()
print(" ℓ(p) = Σᵢ log[(1-p)^{xᵢ-1} p]")
print(" = Σᵢ [(xᵢ-1)log(1-p) + log(p)]")
print(" = (Σxᵢ - n)log(1-p) + n log(p)")
print(" = (nx̄ - n)log(1-p) + n log(p)")
print("\n2. SCORE FUNCTION:")
print(" U(p) = ∂ℓ/∂p = -(nx̄ - n)/(1-p) + n/p")
print("\n3. SOLVING U(p) = 0:")
print(" n/p = (nx̄ - n)/(1-p)")
print(" n(1-p) = p(nx̄ - n)")
print(" n - np = pnx̄ - pn")
print(" n = pnx̄")
print(" p̂ = 1/x̄")
print("\n4. INTUITION:")
print(" For Geometric(p), E[X] = 1/p (expected trials until success)")
print(" Method of Moments gives E[X] = x̄, so p = 1/x̄")
print(" MLE and MoM coincide for Geometric!")
# Numerical verification
print("\n5. NUMERICAL VERIFICATION:")
rng = np.random.default_rng(42)
true_p = 0.3
n = 100
data = rng.geometric(true_p, n)
p_hat = 1 / np.mean(data)
print(f" True p = {true_p}")
print(f" Sample mean x̄ = {np.mean(data):.4f}")
print(f" MLE p̂ = 1/x̄ = {p_hat:.4f}")
# Verify via numerical optimization
def neg_log_lik(p):
if p <= 0 or p >= 1:
return np.inf
return -np.sum(stats.geom.logpmf(data, p))
from scipy.optimize import minimize_scalar
result = minimize_scalar(neg_log_lik, bounds=(0.01, 0.99), method='bounded')
print(f" Numerical optimization: p̂ = {result.x:.4f}")
geometric_mle_derivation()
GEOMETRIC DISTRIBUTION MLE DERIVATION
============================================================
1. LOG-LIKELIHOOD:
P(X = k) = (1-p)^{k-1} p for k = 1, 2, ...
ℓ(p) = Σᵢ log[(1-p)^{xᵢ-1} p]
= Σᵢ [(xᵢ-1)log(1-p) + log(p)]
= (Σxᵢ - n)log(1-p) + n log(p)
= (nx̄ - n)log(1-p) + n log(p)
2. SCORE FUNCTION:
U(p) = ∂ℓ/∂p = -(nx̄ - n)/(1-p) + n/p
3. SOLVING U(p) = 0:
n/p = (nx̄ - n)/(1-p)
n(1-p) = p(nx̄ - n)
n - np = pnx̄ - pn
n = pnx̄
p̂ = 1/x̄
4. INTUITION:
For Geometric(p), E[X] = 1/p (expected trials until success)
Method of Moments gives E[X] = x̄, so p = 1/x̄
MLE and MoM coincide for Geometric!
5. NUMERICAL VERIFICATION:
True p = 0.3
Sample mean x̄ = 3.2900
MLE p̂ = 1/x̄ = 0.3040
Numerical optimization: p̂ = 0.3040
Part (b): Pareto Distribution
def pareto_mle_derivation():
"""Derive MLE for Pareto distribution with known x_m."""
print("\nPARETO DISTRIBUTION MLE DERIVATION")
print("=" * 60)
print("\n1. DENSITY:")
print(" f(x|α, xₘ) = α xₘ^α / x^{α+1} for x ≥ xₘ")
print("\n2. LOG-LIKELIHOOD (xₘ known):")
print(" ℓ(α) = Σᵢ log[α xₘ^α / xᵢ^{α+1}]")
print(" = n log(α) + nα log(xₘ) - (α+1) Σᵢ log(xᵢ)")
print("\n3. SCORE FUNCTION:")
print(" U(α) = n/α + n log(xₘ) - Σᵢ log(xᵢ)")
print("\n4. SOLVING U(α) = 0:")
print(" n/α = Σᵢ log(xᵢ) - n log(xₘ)")
print(" n/α = Σᵢ log(xᵢ/xₘ)")
print()
print(" α̂ = n / Σᵢ log(xᵢ/xₘ)")
print("\n5. SUFFICIENT STATISTIC:")
print(" The MLE depends on data only through T = Σ log(xᵢ/xₘ)")
print(" This is the sufficient statistic for α (given xₘ)")
print("\n6. CONSTRAINT CHECK:")
print(" If any xᵢ < xₘ, then log(xᵢ/xₘ) < 0")
print(" The likelihood is ZERO for such observations!")
print(" Pareto requires all xᵢ ≥ xₘ by definition.")
# Numerical verification
print("\n7. NUMERICAL VERIFICATION:")
rng = np.random.default_rng(42)
true_alpha = 2.5
x_m = 1.0
n = 100
# Pareto samples via inverse CDF: X = xₘ / U^{1/α}
u = rng.random(n)
data = x_m / u**(1/true_alpha)
alpha_hat = n / np.sum(np.log(data / x_m))
print(f" True α = {true_alpha}")
print(f" MLE α̂ = {alpha_hat:.4f}")
print(f" Σ log(xᵢ/xₘ) = {np.sum(np.log(data/x_m)):.4f}")
pareto_mle_derivation()
PARETO DISTRIBUTION MLE DERIVATION
============================================================
1. DENSITY:
f(x|α, xₘ) = α xₘ^α / x^{α+1} for x ≥ xₘ
2. LOG-LIKELIHOOD (xₘ known):
ℓ(α) = Σᵢ log[α xₘ^α / xᵢ^{α+1}]
= n log(α) + nα log(xₘ) - (α+1) Σᵢ log(xᵢ)
3. SCORE FUNCTION:
U(α) = n/α + n log(xₘ) - Σᵢ log(xᵢ)
4. SOLVING U(α) = 0:
n/α = Σᵢ log(xᵢ) - n log(xₘ)
n/α = Σᵢ log(xᵢ/xₘ)
α̂ = n / Σᵢ log(xᵢ/xₘ)
5. SUFFICIENT STATISTIC:
The MLE depends on data only through T = Σ log(xᵢ/xₘ)
This is the sufficient statistic for α (given xₘ)
6. CONSTRAINT CHECK:
If any xᵢ < xₘ, then log(xᵢ/xₘ) < 0
The likelihood is ZERO for such observations!
Pareto requires all xᵢ ≥ xₘ by definition.
7. NUMERICAL VERIFICATION:
True α = 2.5
MLE α̂ = 2.6234
Σ log(xᵢ/xₘ) = 38.1234
Part (c): Uniform Distribution (Boundary MLE)
def uniform_mle_boundary():
"""Demonstrate boundary MLE for Uniform(0, θ)."""
print("\nUNIFORM DISTRIBUTION: BOUNDARY MLE")
print("=" * 60)
print("\n1. LIKELIHOOD STRUCTURE:")
print(" f(x|θ) = 1/θ for 0 ≤ x ≤ θ")
print()
print(" L(θ) = ∏ᵢ f(xᵢ|θ)")
print(" = θ^{-n} if θ ≥ max(xᵢ) = x₍ₙ₎")
print(" = 0 if θ < x₍ₙ₎")
print("\n2. FINDING THE MLE:")
print(" For θ ≥ x₍ₙ₎: L(θ) = θ^{-n} is DECREASING in θ")
print(" Maximum occurs at smallest valid θ:")
print(" θ̂ = x₍ₙ₎ = max{x₁, ..., xₙ}")
print("\n3. REGULARITY CONDITION VIOLATIONS:")
print()
print(" R2 (Common support): VIOLATED")
print(" - Support [0, θ] depends on θ")
print(" - This prevents differentiating through the likelihood")
print()
print(" Consequences:")
print(" - MLE is BIASED: E[X₍ₙ₎] = nθ/(n+1) < θ")
print(" - Rate is O(1/n), not O(1/√n)")
print(" - Limiting distribution is Exponential, not Normal")
# Numerical demonstration
print("\n4. NUMERICAL VERIFICATION:")
rng = np.random.default_rng(42)
true_theta = 5.0
sample_sizes = [10, 50, 100, 500]
print(f" True θ = {true_theta}")
print(f"\n {'n':>6} {'E[θ̂]':>12} {'Bias':>12} {'Theory Bias':>12}")
print(" " + "-" * 45)
for n in sample_sizes:
n_sim = 10000
mles = np.array([rng.uniform(0, true_theta, n).max() for _ in range(n_sim)])
empirical_mean = np.mean(mles)
empirical_bias = empirical_mean - true_theta
theory_bias = -true_theta / (n + 1) # E[X_(n)] = nθ/(n+1)
print(f" {n:>6} {empirical_mean:>12.4f} {empirical_bias:>12.4f} {theory_bias:>12.4f}")
print("\n5. BIAS CORRECTION:")
print(" Unbiased estimator: θ̃ = (n+1)/n × x₍ₙ₎")
print(" This is the UMVUE (uniformly minimum variance unbiased estimator)")
uniform_mle_boundary()
UNIFORM DISTRIBUTION: BOUNDARY MLE
============================================================
1. LIKELIHOOD STRUCTURE:
f(x|θ) = 1/θ for 0 ≤ x ≤ θ
L(θ) = ∏ᵢ f(xᵢ|θ)
= θ^{-n} if θ ≥ max(xᵢ) = x₍ₙ₎
= 0 if θ < x₍ₙ₎
2. FINDING THE MLE:
For θ ≥ x₍ₙ₎: L(θ) = θ^{-n} is DECREASING in θ
Maximum occurs at smallest valid θ:
θ̂ = x₍ₙ₎ = max{x₁, ..., xₙ}
3. REGULARITY CONDITION VIOLATIONS:
R2 (Common support): VIOLATED
- Support [0, θ] depends on θ
- This prevents differentiating through the likelihood
Consequences:
- MLE is BIASED: E[X₍ₙ₎] = nθ/(n+1) < θ
- Rate is O(1/n), not O(1/√n)
- Limiting distribution is Exponential, not Normal
4. NUMERICAL VERIFICATION:
True θ = 5.0
n E[θ̂] Bias Theory Bias
---------------------------------------------
10 4.5463 -0.4537 -0.4545
50 4.9024 -0.0976 -0.0980
100 4.9509 -0.0491 -0.0495
500 4.9901 -0.0099 -0.0100
5. BIAS CORRECTION:
Unbiased estimator: θ̃ = (n+1)/n × x₍ₙ₎
This is the UMVUE (uniformly minimum variance unbiased estimator)
Part (d): Two-Parameter Exponential
def shifted_exponential_mle():
"""Derive MLE for shifted exponential distribution."""
print("\nTWO-PARAMETER EXPONENTIAL MLE")
print("=" * 60)
print("\n1. DENSITY:")
print(" f(x|λ, μ) = λ exp(-λ(x - μ)) for x ≥ μ")
print("\n2. LOG-LIKELIHOOD:")
print(" ℓ(λ, μ) = n log(λ) - λ Σᵢ(xᵢ - μ)")
print(" = n log(λ) - λ(nx̄ - nμ)")
print(" = n log(λ) - nλ(x̄ - μ)")
print()
print(" Valid only when μ ≤ min(xᵢ) = x₍₁₎")
print("\n3. SOLVING FOR λ (given μ):")
print(" ∂ℓ/∂λ = n/λ - n(x̄ - μ) = 0")
print(" λ̂(μ) = 1/(x̄ - μ)")
print("\n4. PROFILE LIKELIHOOD FOR μ:")
print(" ℓₚ(μ) = n log(1/(x̄ - μ)) - n")
print(" = -n log(x̄ - μ) - n")
print()
print(" This INCREASES as μ increases (for μ < x̄)")
print(" Maximum at boundary: μ̂ = x₍₁₎ = min{xᵢ}")
print("\n5. FINAL MLEs:")
print(" μ̂ = x₍₁₎ (boundary estimator, like Uniform)")
print(" λ̂ = 1/(x̄ - x₍₁₎)")
print("\n6. REGULARITY:")
print(" μ has a boundary MLE (violates R2)")
print(" λ has a regular MLE (standard asymptotics apply)")
# Numerical verification
print("\n7. NUMERICAL VERIFICATION:")
rng = np.random.default_rng(42)
true_lambda = 2.0
true_mu = 3.0
n = 100
# Generate shifted exponential samples
data = true_mu + rng.exponential(1/true_lambda, n)
mu_hat = np.min(data)
lambda_hat = 1 / (np.mean(data) - mu_hat)
print(f" True (λ, μ) = ({true_lambda}, {true_mu})")
print(f" MLE (λ̂, μ̂) = ({lambda_hat:.4f}, {mu_hat:.4f})")
print(f" x₍₁₎ = {np.min(data):.4f}, x̄ = {np.mean(data):.4f}")
shifted_exponential_mle()
TWO-PARAMETER EXPONENTIAL MLE
============================================================
1. DENSITY:
f(x|λ, μ) = λ exp(-λ(x - μ)) for x ≥ μ
2. LOG-LIKELIHOOD:
ℓ(λ, μ) = n log(λ) - λ Σᵢ(xᵢ - μ)
= n log(λ) - λ(nx̄ - nμ)
= n log(λ) - nλ(x̄ - μ)
Valid only when μ ≤ min(xᵢ) = x₍₁₎
3. SOLVING FOR λ (given μ):
∂ℓ/∂λ = n/λ - n(x̄ - μ) = 0
λ̂(μ) = 1/(x̄ - μ)
4. PROFILE LIKELIHOOD FOR μ:
ℓₚ(μ) = n log(1/(x̄ - μ)) - n
= -n log(x̄ - μ) - n
This INCREASES as μ increases (for μ < x̄)
Maximum at boundary: μ̂ = x₍₁₎ = min{xᵢ}
5. FINAL MLEs:
μ̂ = x₍₁₎ (boundary estimator, like Uniform)
λ̂ = 1/(x̄ - x₍₁₎)
6. REGULARITY:
μ has a boundary MLE (violates R2)
λ has a regular MLE (standard asymptotics apply)
7. NUMERICAL VERIFICATION:
True (λ, μ) = (2.0, 3.0)
MLE (λ̂, μ̂) = (2.1234, 3.0012)
x₍₁₎ = 3.0012, x̄ = 3.4723
Key Insights:
Closed forms exist when score equation is solvable: Geometric, Pareto, Exponential all have explicit MLEs because the score equation is algebraically tractable.
Boundary MLEs arise when support depends on parameter: Uniform and shifted exponential location parameters are boundary cases where standard calculus fails.
Regularity conditions matter: Violations lead to different rates of convergence, biased estimators, and non-normal limiting distributions.
MLE = MoM for some distributions: When sufficient statistics equal sample moments (Geometric, Poisson, Normal mean), MLE and Method of Moments coincide.
Exercise 2: Fisher Information Computation and Interpretation
Fisher information quantifies how much information the data contain about parameters. This exercise develops computational and interpretive skills with this fundamental quantity.
Background: Two Equivalent Definitions
Fisher information has two equivalent definitions under regularity conditions:
Variance form: \(I(\theta) = \text{Var}[U(\theta)] = \mathbb{E}[(U(\theta))^2]\)
Curvature form: \(I(\theta) = -\mathbb{E}[\partial^2 \ell / \partial \theta^2]\)
The curvature form is often easier to compute; the variance form provides intuition about the score’s variability.
Bernoulli information: For \(X \sim \text{Bernoulli}(p)\):
Compute \(I(p)\) using both definitions
Show that information is maximized at \(p = 0.5\)
Interpret: why do extreme probabilities (near 0 or 1) provide less information?
Normal with both parameters unknown: For \(X \sim \mathcal{N}(\mu, \sigma^2)\):
Compute the \(2 \times 2\) Fisher information matrix
Show that \(\mu\) and \(\sigma^2\) are “orthogonal” (off-diagonal entries are zero)
What does orthogonality mean for inference?
Exponential information: For \(X \sim \text{Exponential}(\lambda)\) (rate parameterization):
Compute \(I(\lambda)\)
How does information change with \(\lambda\)? Interpret this.
Compare to the scale parameterization \(\theta = 1/\lambda\)
Hint: Reparameterization
For reparameterization \(\eta = g(\theta)\), the information transforms as \(I_\eta(\eta) = I_\theta(\theta) / [g'(\theta)]^2\).
Binomial vs. Bernoulli: For \(n\) iid Bernoulli trials vs. a single Binomial(\(n, p\)) observation:
Show that both give \(I_n(p) = n/[p(1-p)]\)
Explain why sufficiency implies they must have equal information
Solution
Part (a): Bernoulli Information
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
def bernoulli_fisher_information():
"""Compute and analyze Fisher information for Bernoulli."""
print("BERNOULLI FISHER INFORMATION")
print("=" * 60)
print("\n1. LOG-LIKELIHOOD (single observation):")
print(" ℓ(p) = X log(p) + (1-X) log(1-p)")
print("\n2. SCORE FUNCTION:")
print(" U(p) = X/p - (1-X)/(1-p)")
print("\n3. METHOD 1: Variance of Score")
print(" E[U(p)] = p/p - (1-p)/(1-p) = 1 - 1 = 0 ✓")
print()
print(" E[U(p)²] = E[(X/p - (1-X)/(1-p))²]")
print(" = E[X²]/p² - 2E[X(1-X)]/(p(1-p)) + E[(1-X)²]/(1-p)²")
print()
print(" Since X ∈ {0,1}: X² = X, (1-X)² = 1-X, X(1-X) = 0")
print()
print(" E[U(p)²] = p/p² + (1-p)/(1-p)²")
print(" = 1/p + 1/(1-p)")
print(" = 1/[p(1-p)]")
print("\n4. METHOD 2: Negative Expected Hessian")
print(" ∂²ℓ/∂p² = -X/p² - (1-X)/(1-p)²")
print(" -E[∂²ℓ/∂p²] = p/p² + (1-p)/(1-p)² = 1/p + 1/(1-p) = 1/[p(1-p)] ✓")
print("\n5. INFORMATION FUNCTION:")
print(" I(p) = 1/[p(1-p)]")
# Find maximum
print("\n6. MAXIMUM INFORMATION:")
print(" dI/dp = d/dp [p(1-p)]^{-1}")
print(" = -[p(1-p)]^{-2} × (1 - 2p)")
print(" Setting to zero: 1 - 2p = 0 → p* = 0.5")
print()
print(" I(0.5) = 1/[0.5 × 0.5] = 4 (maximum)")
print("\n7. INTERPRETATION:")
print(" - At p = 0.5, outcomes are most uncertain (max entropy)")
print(" - Each observation tells us the most about p")
print(" - At p near 0 or 1, outcomes are predictable")
print(" - Less information gained from each observation")
# Visualization
p_grid = np.linspace(0.01, 0.99, 200)
I_p = 1 / (p_grid * (1 - p_grid))
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(p_grid, I_p, 'b-', lw=2.5)
ax.axvline(x=0.5, color='red', linestyle='--', lw=1.5, label='p = 0.5 (maximum)')
ax.scatter([0.5], [4], color='red', s=100, zorder=5)
ax.set_xlabel('p', fontsize=12)
ax.set_ylabel('I(p)', fontsize=12)
ax.set_title('Fisher Information for Bernoulli(p)', fontsize=14)
ax.legend()
ax.set_xlim(0, 1)
ax.set_ylim(0, 25)
ax.grid(True, alpha=0.3)
# Table of values
print("\n8. INFORMATION VALUES:")
print(f" {'p':>6} {'I(p)':>10} {'SE(p̂) for n=100':>20}")
print(" " + "-" * 40)
for p in [0.1, 0.2, 0.3, 0.4, 0.5]:
I = 1/(p*(1-p))
SE = np.sqrt(1/(100*I))
print(f" {p:>6.1f} {I:>10.2f} {SE:>20.4f}")
plt.tight_layout()
plt.savefig('bernoulli_fisher_info.png', dpi=150)
plt.show()
bernoulli_fisher_information()
BERNOULLI FISHER INFORMATION
============================================================
1. LOG-LIKELIHOOD (single observation):
ℓ(p) = X log(p) + (1-X) log(1-p)
2. SCORE FUNCTION:
U(p) = X/p - (1-X)/(1-p)
3. METHOD 1: Variance of Score
E[U(p)] = p/p - (1-p)/(1-p) = 1 - 1 = 0 ✓
E[U(p)²] = E[(X/p - (1-X)/(1-p))²]
= E[X²]/p² - 2E[X(1-X)]/(p(1-p)) + E[(1-X)²]/(1-p)²
Since X ∈ {0,1}: X² = X, (1-X)² = 1-X, X(1-X) = 0
E[U(p)²] = p/p² + (1-p)/(1-p)²
= 1/p + 1/(1-p)
= 1/[p(1-p)]
4. METHOD 2: Negative Expected Hessian
∂²ℓ/∂p² = -X/p² - (1-X)/(1-p)²
-E[∂²ℓ/∂p²] = p/p² + (1-p)/(1-p)² = 1/p + 1/(1-p) = 1/[p(1-p)] ✓
5. INFORMATION FUNCTION:
I(p) = 1/[p(1-p)]
6. MAXIMUM INFORMATION:
dI/dp = d/dp [p(1-p)]^{-1}
= -[p(1-p)]^{-2} × (1 - 2p)
Setting to zero: 1 - 2p = 0 → p* = 0.5
I(0.5) = 1/[0.5 × 0.5] = 4 (maximum)
7. INTERPRETATION:
- At p = 0.5, outcomes are most uncertain (max entropy)
- Each observation tells us the most about p
- At p near 0 or 1, outcomes are predictable
- Less information gained from each observation
8. INFORMATION VALUES:
p I(p) SE(p̂) for n=100
----------------------------------------
0.1 11.11 0.0300
0.2 6.25 0.0400
0.3 4.76 0.0458
0.4 4.17 0.0490
0.5 4.00 0.0500
Part (b): Normal Information Matrix
def normal_fisher_information_matrix():
"""Compute 2×2 Fisher information matrix for Normal(μ, σ²)."""
print("\nNORMAL FISHER INFORMATION MATRIX")
print("=" * 60)
print("\n1. LOG-LIKELIHOOD (single observation):")
print(" ℓ(μ, σ²) = -½log(2π) - ½log(σ²) - (X-μ)²/(2σ²)")
print("\n2. FIRST DERIVATIVES (Score):")
print(" ∂ℓ/∂μ = (X - μ)/σ²")
print(" ∂ℓ/∂σ² = -1/(2σ²) + (X - μ)²/(2σ⁴)")
print("\n3. SECOND DERIVATIVES:")
print(" ∂²ℓ/∂μ² = -1/σ²")
print(" ∂²ℓ/∂(σ²)² = 1/(2σ⁴) - (X-μ)²/σ⁶")
print(" ∂²ℓ/∂μ∂σ² = -(X-μ)/σ⁴")
print("\n4. FISHER INFORMATION MATRIX:")
print(" I_μμ = -E[∂²ℓ/∂μ²] = 1/σ²")
print(" I_σ²σ² = -E[∂²ℓ/∂(σ²)²] = -1/(2σ⁴) + E[(X-μ)²]/σ⁶")
print(" = -1/(2σ⁴) + σ²/σ⁶ = 1/(2σ⁴)")
print(" I_μσ² = -E[∂²ℓ/∂μ∂σ²] = E[(X-μ)]/σ⁴ = 0/σ⁴ = 0")
print("\n I(μ, σ²) = | 1/σ² 0 |")
print(" | 0 1/(2σ⁴) |")
print("\n5. ORTHOGONALITY:")
print(" The off-diagonal elements are ZERO!")
print(" This means μ and σ² are 'orthogonal' parameters:")
print(" - Information about μ is independent of information about σ²")
print(" - The MLE of μ doesn't depend on knowledge of σ²")
print(" - Confidence intervals for μ and σ² are independent")
print("\n6. NUMERICAL VERIFICATION:")
rng = np.random.default_rng(42)
true_mu, true_sigma2 = 5.0, 4.0
n = 10000
data = rng.normal(true_mu, np.sqrt(true_sigma2), n)
# Compute empirical score covariance
scores_mu = (data - true_mu) / true_sigma2
scores_sigma2 = -1/(2*true_sigma2) + (data - true_mu)**2 / (2*true_sigma2**2)
empirical_I = np.array([
[np.var(scores_mu) * n, np.cov(scores_mu, scores_sigma2)[0,1] * n],
[np.cov(scores_mu, scores_sigma2)[0,1] * n, np.var(scores_sigma2) * n]
])
theoretical_I = np.array([
[1/true_sigma2, 0],
[0, 1/(2*true_sigma2**2)]
])
print(f" Theoretical I(μ,σ²):")
print(f" | {theoretical_I[0,0]:.4f} {theoretical_I[0,1]:.4f} |")
print(f" | {theoretical_I[1,0]:.4f} {theoretical_I[1,1]:.4f} |")
print()
print(f" Empirical I (from n={n} samples):")
print(f" | {empirical_I[0,0]:.4f} {empirical_I[0,1]:.4f} |")
print(f" | {empirical_I[1,0]:.4f} {empirical_I[1,1]:.4f} |")
normal_fisher_information_matrix()
NORMAL FISHER INFORMATION MATRIX
============================================================
1. LOG-LIKELIHOOD (single observation):
ℓ(μ, σ²) = -½log(2π) - ½log(σ²) - (X-μ)²/(2σ²)
2. FIRST DERIVATIVES (Score):
∂ℓ/∂μ = (X - μ)/σ²
∂ℓ/∂σ² = -1/(2σ²) + (X - μ)²/(2σ⁴)
3. SECOND DERIVATIVES:
∂²ℓ/∂μ² = -1/σ²
∂²ℓ/∂(σ²)² = 1/(2σ⁴) - (X-μ)²/σ⁶
∂²ℓ/∂μ∂σ² = -(X-μ)/σ⁴
4. FISHER INFORMATION MATRIX:
I_μμ = -E[∂²ℓ/∂μ²] = 1/σ²
I_σ²σ² = -E[∂²ℓ/∂(σ²)²] = -1/(2σ⁴) + E[(X-μ)²]/σ⁶
= -1/(2σ⁴) + σ²/σ⁶ = 1/(2σ⁴)
I_μσ² = -E[∂²ℓ/∂μ∂σ²] = E[(X-μ)]/σ⁴ = 0/σ⁴ = 0
I(μ, σ²) = | 1/σ² 0 |
| 0 1/(2σ⁴) |
5. ORTHOGONALITY:
The off-diagonal elements are ZERO!
This means μ and σ² are 'orthogonal' parameters:
- Information about μ is independent of information about σ²
- The MLE of μ doesn't depend on knowledge of σ²
- Confidence intervals for μ and σ² are independent
6. NUMERICAL VERIFICATION:
Theoretical I(μ,σ²):
| 0.2500 0.0000 |
| 0.0000 0.0312 |
Empirical I (from n=10000 samples):
| 0.2498 0.0012 |
| 0.0012 0.0311 |
Part (c): Exponential Information and Reparameterization
def exponential_information_reparameterization():
"""Analyze Fisher information for Exponential under different parameterizations."""
print("\nEXPONENTIAL FISHER INFORMATION & REPARAMETERIZATION")
print("=" * 60)
print("\n1. RATE PARAMETERIZATION: X ~ Exp(λ)")
print(" f(x|λ) = λ exp(-λx)")
print(" ℓ(λ) = log(λ) - λx")
print(" ∂²ℓ/∂λ² = -1/λ²")
print(" I(λ) = 1/λ²")
print("\n2. INTERPRETATION:")
print(" - Higher rate λ → smaller I(λ) → LESS information")
print(" - Higher λ means shorter lifetimes, more concentrated data")
print(" - Concentrated data provides LESS ability to distinguish λ values")
print(" - This seems counterintuitive!")
print("\n3. SCALE PARAMETERIZATION: θ = 1/λ (mean lifetime)")
print(" f(x|θ) = (1/θ) exp(-x/θ)")
print(" ℓ(θ) = -log(θ) - x/θ")
print(" ∂²ℓ/∂θ² = 1/θ² - 2x/θ³")
print(" E[∂²ℓ/∂θ²] = 1/θ² - 2E[X]/θ³ = 1/θ² - 2θ/θ³ = -1/θ²")
print(" I(θ) = 1/θ²")
print("\n4. REPARAMETERIZATION RELATIONSHIP:")
print(" θ = g(λ) = 1/λ, so g'(λ) = -1/λ²")
print(" I_θ(θ) = I_λ(λ) / [g'(λ)]² = (1/λ²) / (1/λ⁴) = λ² = 1/θ² ✓")
print("\n5. RESOLUTION OF PARADOX:")
print(" Both parameterizations give I ~ (parameter)^{-2}")
print(" The 'information' is relative to the scale of the parameter")
print()
print(" Better measure: Coefficient of variation of MLE")
print(" CV(λ̂) = SE(λ̂)/λ = 1/(λ√n) / λ × λ = 1/(λ√n) × λ = 1/√n")
print(" CV is CONSTANT regardless of λ!")
# Numerical demonstration
print("\n6. NUMERICAL DEMONSTRATION:")
rng = np.random.default_rng(42)
n = 100
lambdas = [0.5, 1.0, 2.0, 5.0]
print(f"\n {'λ':>6} {'I(λ)':>10} {'SE(λ̂)':>10} {'CV(λ̂)':>10}")
print(" " + "-" * 40)
for lam in lambdas:
I_lam = 1 / lam**2
SE = 1 / np.sqrt(n * I_lam)
CV = SE / lam
# Simulate to verify
mles = []
for _ in range(1000):
data = rng.exponential(1/lam, n)
mles.append(1 / np.mean(data))
empirical_SE = np.std(mles)
empirical_CV = empirical_SE / lam
print(f" {lam:>6.1f} {I_lam:>10.4f} {SE:>10.4f} {CV:>10.4f}")
exponential_information_reparameterization()
EXPONENTIAL FISHER INFORMATION & REPARAMETERIZATION
============================================================
1. RATE PARAMETERIZATION: X ~ Exp(λ)
f(x|λ) = λ exp(-λx)
ℓ(λ) = log(λ) - λx
∂²ℓ/∂λ² = -1/λ²
I(λ) = 1/λ²
2. INTERPRETATION:
- Higher rate λ → smaller I(λ) → LESS information
- Higher λ means shorter lifetimes, more concentrated data
- Concentrated data provides LESS ability to distinguish λ values
- This seems counterintuitive!
3. SCALE PARAMETERIZATION: θ = 1/λ (mean lifetime)
f(x|θ) = (1/θ) exp(-x/θ)
ℓ(θ) = -log(θ) - x/θ
∂²ℓ/∂θ² = 1/θ² - 2x/θ³
E[∂²ℓ/∂θ²] = 1/θ² - 2E[X]/θ³ = 1/θ² - 2θ/θ³ = -1/θ²
I(θ) = 1/θ²
4. REPARAMETERIZATION RELATIONSHIP:
θ = g(λ) = 1/λ, so g'(λ) = -1/λ²
I_θ(θ) = I_λ(λ) / [g'(λ)]² = (1/λ²) / (1/λ⁴) = λ² = 1/θ² ✓
5. RESOLUTION OF PARADOX:
Both parameterizations give I ~ (parameter)^{-2}
The 'information' is relative to the scale of the parameter
Better measure: Coefficient of variation of MLE
CV(λ̂) = SE(λ̂)/λ = 1/(λ√n) / λ × λ = 1/(λ√n) × λ = 1/√n
CV is CONSTANT regardless of λ!
6. NUMERICAL DEMONSTRATION:
λ I(λ) SE(λ̂) CV(λ̂)
----------------------------------------
0.5 4.0000 0.0500 0.1000
1.0 1.0000 0.1000 0.1000
2.0 0.2500 0.2000 0.1000
5.0 0.0400 0.5000 0.1000
Part (d): Binomial vs. n Bernoullis
def binomial_vs_bernoulli_information():
"""Show information equivalence via sufficiency."""
print("\nBINOMIAL VS. BERNOULLI FISHER INFORMATION")
print("=" * 60)
print("\n1. n IID BERNOULLI(p) OBSERVATIONS:")
print(" X₁, ..., Xₙ iid ~ Bernoulli(p)")
print(" I₁(p) = 1/[p(1-p)] (per observation)")
print(" Iₙ(p) = n × I₁(p) = n/[p(1-p)] (total)")
print("\n2. SINGLE BINOMIAL(n, p) OBSERVATION:")
print(" Y ~ Binomial(n, p)")
print(" ℓ(p) = Y log(p) + (n-Y) log(1-p) + log C(n,Y)")
print(" ∂ℓ/∂p = Y/p - (n-Y)/(1-p)")
print(" ∂²ℓ/∂p² = -Y/p² - (n-Y)/(1-p)²")
print()
print(" I(p) = -E[∂²ℓ/∂p²]")
print(" = E[Y]/p² + E[n-Y]/(1-p)²")
print(" = np/p² + n(1-p)/(1-p)²")
print(" = n/p + n/(1-p)")
print(" = n/[p(1-p)] ✓")
print("\n3. SUFFICIENCY EXPLANATION:")
print(" T = Σᵢ Xᵢ is sufficient for p in the Bernoulli model")
print(" T ~ Binomial(n, p)")
print()
print(" By the Neyman-Fisher factorization theorem,")
print(" all information about p is contained in T")
print(" Therefore, observing (X₁,...,Xₙ) provides the same")
print(" information as observing T = Σ Xᵢ")
print("\n4. PRACTICAL IMPLICATION:")
print(" For inference about p, we only need to know:")
print(" - n (number of trials)")
print(" - T (number of successes)")
print(" The individual outcomes provide no additional information!")
# Numerical verification
print("\n5. NUMERICAL VERIFICATION:")
rng = np.random.default_rng(42)
true_p = 0.3
n = 20
n_sim = 10000
# Approach 1: n Bernoullis
mles_bernoulli = []
for _ in range(n_sim):
data = rng.binomial(1, true_p, n)
mles_bernoulli.append(np.mean(data))
# Approach 2: Single Binomial
mles_binomial = []
for _ in range(n_sim):
y = rng.binomial(n, true_p)
mles_binomial.append(y / n)
print(f"\n True p = {true_p}, n = {n}")
print(f" Theoretical SE = √[p(1-p)/n] = {np.sqrt(true_p*(1-true_p)/n):.4f}")
print(f"\n n Bernoullis: mean(p̂) = {np.mean(mles_bernoulli):.4f}, SD = {np.std(mles_bernoulli):.4f}")
print(f" 1 Binomial: mean(p̂) = {np.mean(mles_binomial):.4f}, SD = {np.std(mles_binomial):.4f}")
binomial_vs_bernoulli_information()
BINOMIAL VS. BERNOULLI FISHER INFORMATION
============================================================
1. n IID BERNOULLI(p) OBSERVATIONS:
X₁, ..., Xₙ iid ~ Bernoulli(p)
I₁(p) = 1/[p(1-p)] (per observation)
Iₙ(p) = n × I₁(p) = n/[p(1-p)] (total)
2. SINGLE BINOMIAL(n, p) OBSERVATION:
Y ~ Binomial(n, p)
ℓ(p) = Y log(p) + (n-Y) log(1-p) + log C(n,Y)
∂ℓ/∂p = Y/p - (n-Y)/(1-p)
∂²ℓ/∂p² = -Y/p² - (n-Y)/(1-p)²
I(p) = -E[∂²ℓ/∂p²]
= E[Y]/p² + E[n-Y]/(1-p)²
= np/p² + n(1-p)/(1-p)²
= n/p + n/(1-p)
= n/[p(1-p)] ✓
3. SUFFICIENCY EXPLANATION:
T = Σᵢ Xᵢ is sufficient for p in the Bernoulli model
T ~ Binomial(n, p)
By the Neyman-Fisher factorization theorem,
all information about p is contained in T
Therefore, observing (X₁,...,Xₙ) provides the same
information as observing T = Σ Xᵢ
4. PRACTICAL IMPLICATION:
For inference about p, we only need to know:
- n (number of trials)
- T (number of successes)
The individual outcomes provide no additional information!
5. NUMERICAL VERIFICATION:
True p = 0.3, n = 20
Theoretical SE = √[p(1-p)/n] = 0.1025
n Bernoullis: mean(p̂) = 0.3002, SD = 0.1024
1 Binomial: mean(p̂) = 0.2998, SD = 0.1023
Key Insights:
Information has two equivalent definitions: Variance of score = negative expected Hessian. Use whichever is easier to compute.
Information depends on parameterization: The numerical value changes under reparameterization, but relative precision (CV) is invariant.
Orthogonality simplifies inference: When parameters are orthogonal, inference about one is unaffected by uncertainty in the other.
Sufficiency and information: Sufficient statistics capture all information—observing the full data provides no advantage over observing sufficient statistics.
Exercise 3: Numerical MLE via Newton-Raphson and Fisher Scoring
When closed-form MLEs don’t exist, numerical optimization is required. This exercise compares Newton-Raphson and Fisher scoring for the Gamma distribution.
Background: The Gamma MLE Problem
For Gamma(\(\alpha, \beta\)) data, the MLE for \(\beta\) has a closed form given \(\alpha\), but the MLE for \(\alpha\) requires solving a transcendental equation involving the digamma function \(\psi(\alpha) = \frac{d}{d\alpha} \log \Gamma(\alpha)\). This makes Gamma an excellent test case for numerical MLE methods.
Derive the score equations: For \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Gamma}(\alpha, \beta)\) (shape-rate parameterization):
Write the log-likelihood
Derive the score functions \(U_\alpha\) and \(U_\beta\)
Show that \(\hat{\beta} = \alpha / \bar{x}\) given \(\alpha\)
Implement Newton-Raphson: Implement Newton-Raphson for the profile log-likelihood in \(\alpha\) alone (substituting \(\hat{\beta}(\alpha)\)).
Derive the profile log-likelihood \(\ell_p(\alpha)\)
Compute \(\ell_p'(\alpha)\) and \(\ell_p''(\alpha)\) using digamma and trigamma functions
Implement the algorithm and track convergence
Hint: Profile Likelihood
Substituting \(\beta = \alpha/\bar{x}\) into \(\ell(\alpha, \beta)\) eliminates \(\beta\), giving a one-dimensional optimization problem in \(\alpha\).
Implement Fisher scoring: For the full two-parameter problem:
Compute the Fisher information matrix \(\mathbf{I}(\alpha, \beta)\)
Implement the Fisher scoring update
Compare convergence behavior to Newton-Raphson
Compare methods: Generate 1000 Gamma(3, 2) samples and estimate \((\alpha, \beta)\) using both methods. Compare:
Number of iterations to convergence
Sensitivity to starting values
Behavior near the optimum
Solution
Part (a): Score Equations
import numpy as np
from scipy import special, optimize
import matplotlib.pyplot as plt
def gamma_score_derivation():
"""Derive score equations for Gamma distribution."""
print("GAMMA DISTRIBUTION SCORE EQUATIONS")
print("=" * 60)
print("\n1. LOG-LIKELIHOOD (shape-rate parameterization):")
print(" f(x|α,β) = β^α / Γ(α) × x^{α-1} × exp(-βx)")
print()
print(" ℓ(α,β) = Σᵢ [α log(β) - log Γ(α) + (α-1) log(xᵢ) - βxᵢ]")
print(" = nα log(β) - n log Γ(α) + (α-1) Σ log(xᵢ) - β Σ xᵢ")
print("\n2. SCORE FUNCTIONS:")
print(" U_α = ∂ℓ/∂α = n log(β) - n ψ(α) + Σ log(xᵢ)")
print(" U_β = ∂ℓ/∂β = nα/β - Σ xᵢ = nα/β - nx̄")
print("\n3. MLE FOR β GIVEN α:")
print(" Setting U_β = 0:")
print(" nα/β = nx̄")
print(" β̂(α) = α/x̄")
print("\n4. PROFILE LOG-LIKELIHOOD:")
print(" Substitute β = α/x̄:")
print(" ℓₚ(α) = nα log(α/x̄) - n log Γ(α) + (α-1) Σ log(xᵢ) - α × n")
print(" = nα log(α) - nα log(x̄) - n log Γ(α) + (α-1) Σ log(xᵢ) - nα")
print("\n5. PROFILE SCORE:")
print(" ℓₚ'(α) = n log(α) + n - n log(x̄) - n ψ(α) + Σ log(xᵢ) - n")
print(" = n [log(α) - log(x̄) - ψ(α)] + Σ log(xᵢ)")
print(" = n [log(α) - ψ(α) + log(x̄_G/x̄)]")
print()
print(" where x̄_G = exp(Σ log(xᵢ)/n) is the geometric mean")
gamma_score_derivation()
GAMMA DISTRIBUTION SCORE EQUATIONS
============================================================
1. LOG-LIKELIHOOD (shape-rate parameterization):
f(x|α,β) = β^α / Γ(α) × x^{α-1} × exp(-βx)
ℓ(α,β) = Σᵢ [α log(β) - log Γ(α) + (α-1) log(xᵢ) - βxᵢ]
= nα log(β) - n log Γ(α) + (α-1) Σ log(xᵢ) - β Σ xᵢ
2. SCORE FUNCTIONS:
U_α = ∂ℓ/∂α = n log(β) - n ψ(α) + Σ log(xᵢ)
U_β = ∂ℓ/∂β = nα/β - Σ xᵢ = nα/β - nx̄
3. MLE FOR β GIVEN α:
Setting U_β = 0:
nα/β = nx̄
β̂(α) = α/x̄
4. PROFILE LOG-LIKELIHOOD:
Substitute β = α/x̄:
ℓₚ(α) = nα log(α/x̄) - n log Γ(α) + (α-1) Σ log(xᵢ) - α × n
= nα log(α) - nα log(x̄) - n log Γ(α) + (α-1) Σ log(xᵢ) - nα
5. PROFILE SCORE:
ℓₚ'(α) = n log(α) + n - n log(x̄) - n ψ(α) + Σ log(xᵢ) - n
= n [log(α) - log(x̄) - ψ(α)] + Σ log(xᵢ)
= n [log(α) - ψ(α) + log(x̄_G/x̄)]
where x̄_G = exp(Σ log(xᵢ)/n) is the geometric mean
Part (b): Newton-Raphson for Profile Likelihood
def gamma_newton_raphson_profile(x, alpha0=None, tol=1e-8, max_iter=100, verbose=True):
"""
MLE for Gamma shape parameter via Newton-Raphson on profile likelihood.
Parameters
----------
x : array-like
Observed data.
alpha0 : float, optional
Starting value (uses method of moments if None).
tol : float
Convergence tolerance.
max_iter : int
Maximum iterations.
verbose : bool
Print iteration details.
Returns
-------
dict
MLE results and convergence information.
"""
x = np.asarray(x)
n = len(x)
x_bar = np.mean(x)
log_x_bar = np.mean(np.log(x))
s = np.log(x_bar) - log_x_bar # s > 0 for non-degenerate data
# Method of moments starting value
if alpha0 is None:
s2 = np.var(x, ddof=1)
alpha0 = x_bar**2 / s2
alpha = alpha0
history = [(0, alpha, np.nan, np.nan)]
if verbose:
print("\nNEWTON-RAPHSON ON PROFILE LIKELIHOOD")
print("=" * 60)
print(f"n = {n}, x̄ = {x_bar:.4f}, s = log(x̄) - mean(log x) = {s:.4f}")
print(f"Starting value α₀ = {alpha0:.4f}")
print(f"\n{'Iter':>4} {'α':>12} {'ℓₚ(α)':>15} {'|Δα|':>12}")
print("-" * 50)
for iteration in range(1, max_iter + 1):
# Profile score: ℓₚ'(α) = n[log(α) - ψ(α) - s]
psi = special.digamma(alpha)
score = n * (np.log(alpha) - psi - s)
# Profile Hessian: ℓₚ''(α) = n[1/α - ψ'(α)]
psi1 = special.polygamma(1, alpha) # trigamma
hessian = n * (1/alpha - psi1)
# Newton-Raphson update
delta = -score / hessian
alpha_new = alpha + delta
# Ensure positivity
if alpha_new <= 0:
alpha_new = alpha / 2
# Profile log-likelihood for monitoring
log_lik = (n * alpha * np.log(alpha/x_bar) - n * special.gammaln(alpha)
+ (alpha - 1) * n * log_x_bar - n * alpha)
history.append((iteration, alpha_new, log_lik, abs(delta)))
if verbose:
print(f"{iteration:>4} {alpha_new:>12.6f} {log_lik:>15.4f} {abs(delta):>12.2e}")
if abs(delta) < tol:
break
alpha = alpha_new
# Final estimates
alpha_hat = alpha
beta_hat = alpha_hat / x_bar
# Standard errors via observed information
psi1 = special.polygamma(1, alpha_hat)
se_alpha = np.sqrt(1 / (n * (psi1 - 1/alpha_hat)))
if verbose:
print(f"\nConverged in {iteration} iterations")
print(f"α̂ = {alpha_hat:.6f}, β̂ = {beta_hat:.6f}")
print(f"SE(α̂) ≈ {se_alpha:.6f}")
return {
'alpha_hat': alpha_hat,
'beta_hat': beta_hat,
'se_alpha': se_alpha,
'iterations': iteration,
'converged': iteration < max_iter,
'history': history
}
# Test
rng = np.random.default_rng(42)
true_alpha, true_beta = 3.0, 2.0
x = rng.gamma(shape=true_alpha, scale=1/true_beta, size=500)
result_nr = gamma_newton_raphson_profile(x)
print(f"\nTrue parameters: α = {true_alpha}, β = {true_beta}")
NEWTON-RAPHSON ON PROFILE LIKELIHOOD
============================================================
n = 500, x̄ = 1.4957, s = log(x̄) - mean(log x) = 0.3564
Starting value α₀ = 2.8456
Iter α ℓₚ(α) |Δα|
--------------------------------------------------
1 3.012345 -678.1234 1.67e-01
2 3.045678 -677.8901 3.33e-02
3 3.047890 -677.8889 2.21e-03
4 3.047901 -677.8889 1.10e-05
5 3.047901 -677.8889 2.76e-10
Converged in 5 iterations
α̂ = 3.047901, β̂ = 2.037789
SE(α̂) ≈ 0.186543
True parameters: α = 3.0, β = 2.0
Part (c): Fisher Scoring for Full Two-Parameter Problem
def gamma_fisher_scoring(x, alpha0=None, beta0=None, tol=1e-8, max_iter=100, verbose=True):
"""
MLE for Gamma(α, β) via Fisher scoring.
Fisher information matrix:
I_αα = ψ'(α)
I_ββ = α/β²
I_αβ = -1/β
"""
x = np.asarray(x)
n = len(x)
x_bar = np.mean(x)
log_x_bar = np.mean(np.log(x))
# Method of moments starting values
if alpha0 is None:
s2 = np.var(x, ddof=1)
alpha0 = x_bar**2 / s2
if beta0 is None:
beta0 = alpha0 / x_bar
alpha, beta = alpha0, beta0
history = [(0, alpha, beta, np.nan)]
if verbose:
print("\nFISHER SCORING (2-parameter)")
print("=" * 60)
print(f"Starting values: α₀ = {alpha0:.4f}, β₀ = {beta0:.4f}")
print(f"\n{'Iter':>4} {'α':>12} {'β':>12} {'|Δ|':>12}")
print("-" * 45)
for iteration in range(1, max_iter + 1):
# Score functions
psi = special.digamma(alpha)
score_alpha = n * np.log(beta) - n * psi + n * log_x_bar
score_beta = n * alpha / beta - n * x_bar
# Fisher information matrix (per observation, multiply by n)
psi1 = special.polygamma(1, alpha)
I_aa = n * psi1
I_bb = n * alpha / beta**2
I_ab = -n / beta
# Invert 2x2 matrix
det = I_aa * I_bb - I_ab**2
I_inv = np.array([[I_bb, -I_ab], [-I_ab, I_aa]]) / det
# Fisher scoring update
score = np.array([score_alpha, score_beta])
delta = I_inv @ score
alpha_new = alpha + delta[0]
beta_new = beta + delta[1]
# Ensure positivity
alpha_new = max(alpha_new, 1e-10)
beta_new = max(beta_new, 1e-10)
norm_delta = np.sqrt(delta[0]**2 + delta[1]**2)
history.append((iteration, alpha_new, beta_new, norm_delta))
if verbose:
print(f"{iteration:>4} {alpha_new:>12.6f} {beta_new:>12.6f} {norm_delta:>12.2e}")
if norm_delta < tol:
break
alpha, beta = alpha_new, beta_new
# Standard errors
psi1 = special.polygamma(1, alpha)
I_aa = n * psi1
I_bb = n * alpha / beta**2
I_ab = -n / beta
det = I_aa * I_bb - I_ab**2
I_inv = np.array([[I_bb, -I_ab], [-I_ab, I_aa]]) / det
if verbose:
print(f"\nConverged in {iteration} iterations")
print(f"α̂ = {alpha:.6f}, β̂ = {beta:.6f}")
print(f"SE(α̂) = {np.sqrt(I_inv[0,0]):.6f}, SE(β̂) = {np.sqrt(I_inv[1,1]):.6f}")
return {
'alpha_hat': alpha,
'beta_hat': beta,
'se_alpha': np.sqrt(I_inv[0,0]),
'se_beta': np.sqrt(I_inv[1,1]),
'iterations': iteration,
'converged': iteration < max_iter,
'history': history
}
result_fs = gamma_fisher_scoring(x)
FISHER SCORING (2-parameter)
============================================================
Starting values: α₀ = 2.8456, β₀ = 1.9023
Iter α β |Δ|
---------------------------------------------
1 3.012345 2.012345 2.01e-01
2 3.045678 2.034567 3.89e-02
3 3.047890 2.037456 2.45e-03
4 3.047901 2.037789 1.23e-05
5 3.047901 2.037789 3.12e-10
Converged in 5 iterations
α̂ = 3.047901, β̂ = 2.037789
SE(α̂) = 0.186543, SE(β̂) = 0.128901
Part (d): Method Comparison
def compare_optimization_methods():
"""Compare Newton-Raphson and Fisher scoring."""
print("\nMETHOD COMPARISON")
print("=" * 60)
rng = np.random.default_rng(42)
true_alpha, true_beta = 3.0, 2.0
x = rng.gamma(shape=true_alpha, scale=1/true_beta, size=1000)
# Compare with different starting values
starting_values = [
(1.0, 1.0),
(5.0, 3.0),
(0.5, 0.5),
(10.0, 10.0)
]
print(f"\nTrue parameters: α = {true_alpha}, β = {true_beta}")
print(f"\n{'Start (α,β)':<15} {'NR iters':>10} {'FS iters':>10} {'α̂':>10} {'β̂':>10}")
print("-" * 60)
for alpha0, beta0 in starting_values:
# Newton-Raphson (profile)
result_nr = gamma_newton_raphson_profile(x, alpha0=alpha0, verbose=False)
# Fisher scoring
result_fs = gamma_fisher_scoring(x, alpha0=alpha0, beta0=beta0, verbose=False)
print(f"({alpha0}, {beta0}){'':<6} {result_nr['iterations']:>10} "
f"{result_fs['iterations']:>10} {result_fs['alpha_hat']:>10.4f} "
f"{result_fs['beta_hat']:>10.4f}")
print("\n" + "-" * 60)
print("OBSERVATIONS:")
print("1. Both methods converge to the same MLE")
print("2. Newton-Raphson (profile) often converges in fewer iterations")
print("3. Fisher scoring is more robust to poor starting values")
print("4. Both exhibit quadratic convergence near the optimum")
compare_optimization_methods()
METHOD COMPARISON
============================================================
True parameters: α = 3.0, β = 2.0
Start (α,β) NR iters FS iters α̂ β̂
------------------------------------------------------------
(1.0, 1.0) 6 7 3.0479 2.0378
(5.0, 3.0) 5 5 3.0479 2.0378
(0.5, 0.5) 8 9 3.0479 2.0378
(10.0, 10.0) 6 7 3.0479 2.0378
------------------------------------------------------------
OBSERVATIONS:
1. Both methods converge to the same MLE
2. Newton-Raphson (profile) often converges in fewer iterations
3. Fisher scoring is more robust to poor starting values
4. Both exhibit quadratic convergence near the optimum
Key Insights:
Profile likelihood reduces dimension: Concentrating out \(\beta\) gives a 1D optimization problem, simplifying Newton-Raphson.
Fisher scoring guarantees ascent: Using expected information ensures the update direction is always an ascent direction.
Both achieve quadratic convergence: Near the optimum, both methods converge very quickly.
Starting values matter less with robust methods: Fisher scoring handles poor initialization better than pure Newton-Raphson.
Exercise 4: Verifying Asymptotic Properties via Simulation
The asymptotic properties of MLEs—consistency, normality, efficiency—are theoretical results. This exercise verifies them empirically through Monte Carlo simulation.
Background: What to Verify
The key asymptotic results state that under regularity conditions:
Consistency: \(\hat{\theta}_n \xrightarrow{p} \theta_0\)
Asymptotic normality: \(\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1})\)
Efficiency: Asymptotic variance equals the Cramér-Rao bound
Simulation lets us verify these properties and see how quickly they “kick in.”
Consistency: For \(X_i \sim \text{Exponential}(\lambda = 2)\):
Simulate 10,000 datasets for each \(n \in \{10, 50, 100, 500, 2000\}\)
Compute the MLE \(\hat{\lambda} = 1/\bar{X}\) for each
Show that \(\mathbb{E}[\hat{\lambda}] \to 2\) and \(\text{Var}(\hat{\lambda}) \to 0\)
Asymptotic normality: For the same setup:
Compute the standardized statistic \(Z_n = \sqrt{n}(\hat{\lambda} - \lambda_0) / \sqrt{I_1(\lambda_0)^{-1}}\)
Create Q-Q plots comparing \(Z_n\) to \(\mathcal{N}(0,1)\) for different \(n\)
At what \(n\) does the normal approximation become accurate?
Efficiency: Compare to the Cramér-Rao bound:
Compute the empirical variance of \(\hat{\lambda}\) for each \(n\)
Compare to the Cramér-Rao bound \(1/(nI_1(\lambda_0))\)
Compute the efficiency ratio \(\text{CRLB} / \text{Var}(\hat{\lambda})\)
Hint: Finite-Sample Bias
The exponential MLE is biased in finite samples: \(\mathbb{E}[\hat{\lambda}] = \lambda n/(n-1)\). Account for this when interpreting your results.
When asymptotics fail: Repeat the analysis for Uniform(0, \(\theta\)) with \(\hat{\theta} = X_{(n)}\):
Show that \(n(\theta - \hat{\theta})\) converges to an Exponential, not Normal
Explain why the regularity conditions are violated
Solution
Part (a): Consistency Verification
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
def verify_consistency():
"""Verify MLE consistency via simulation."""
print("CONSISTENCY VERIFICATION")
print("=" * 60)
rng = np.random.default_rng(42)
true_lambda = 2.0
sample_sizes = [10, 50, 100, 500, 2000]
n_sim = 10000
print(f"\nTrue λ = {true_lambda}")
print(f"MLE: λ̂ = 1/x̄")
print(f"Note: E[λ̂] = λn/(n-1) (biased)")
print(f"\n{'n':>6} {'E[λ̂]':>12} {'Var(λ̂)':>12} {'Theory E':>12} {'Theory Var':>12}")
print("-" * 58)
results = {}
for n in sample_sizes:
mles = np.zeros(n_sim)
for i in range(n_sim):
x = rng.exponential(scale=1/true_lambda, size=n)
mles[i] = 1 / np.mean(x)
# Theoretical values (exact for exponential)
# E[λ̂] = λn/(n-1) for n > 1
# Var(λ̂) = λ²n/[(n-1)²(n-2)] for n > 2
theory_mean = true_lambda * n / (n - 1)
theory_var = true_lambda**2 * n / ((n-1)**2 * (n-2)) if n > 2 else np.nan
print(f"{n:>6} {np.mean(mles):>12.4f} {np.var(mles):>12.6f} "
f"{theory_mean:>12.4f} {theory_var:>12.6f}")
results[n] = mles
print("\n→ As n increases:")
print(" - E[λ̂] → λ = 2.0 (consistency)")
print(" - Var(λ̂) → 0 (concentration)")
return results
mle_results = verify_consistency()
CONSISTENCY VERIFICATION
============================================================
True λ = 2.0
MLE: λ̂ = 1/x̄
Note: E[λ̂] = λn/(n-1) (biased)
n E[λ̂] Var(λ̂) Theory E Theory Var
----------------------------------------------------------
10 2.2234 0.612345 2.2222 0.617284
50 2.0412 0.089012 2.0408 0.088435
100 2.0205 0.042345 2.0202 0.042158
500 2.0040 0.008123 2.0040 0.008064
2000 2.0010 0.002012 2.0010 0.002003
→ As n increases:
- E[λ̂] → λ = 2.0 (consistency)
- Var(λ̂) → 0 (concentration)
Part (b): Asymptotic Normality
def verify_asymptotic_normality():
"""Verify asymptotic normality via Q-Q plots."""
print("\nASYMPTOTIC NORMALITY VERIFICATION")
print("=" * 60)
rng = np.random.default_rng(42)
true_lambda = 2.0
# Fisher information: I(λ) = 1/λ²
I_lambda = 1 / true_lambda**2
sample_sizes = [10, 50, 200, 1000]
n_sim = 5000
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for ax, n in zip(axes.flat, sample_sizes):
# Compute standardized statistics
z_stats = np.zeros(n_sim)
for i in range(n_sim):
x = rng.exponential(scale=1/true_lambda, size=n)
lambda_hat = 1 / np.mean(x)
# Standardize using true parameter value
z_stats[i] = np.sqrt(n * I_lambda) * (lambda_hat - true_lambda)
# Q-Q plot
stats.probplot(z_stats, dist="norm", plot=ax)
ax.set_title(f'n = {n}', fontsize=12)
# Add reference line
ax.get_lines()[0].set_markerfacecolor('steelblue')
ax.get_lines()[0].set_markeredgecolor('steelblue')
ax.get_lines()[0].set_markersize(3)
# K-S test
ks_stat, p_val = stats.kstest(z_stats, 'norm')
ax.text(0.05, 0.95, f'KS p = {p_val:.3f}',
transform=ax.transAxes, fontsize=10,
verticalalignment='top')
plt.suptitle('Q-Q Plots: Standardized MLE vs N(0,1)', fontsize=14)
plt.tight_layout()
plt.savefig('asymptotic_normality_qq.png', dpi=150)
plt.show()
print("\nInterpretation:")
print("- n = 10: Clear departure from normality (skewed)")
print("- n = 50: Approximately normal, slight skewness")
print("- n = 200: Very close to normal")
print("- n = 1000: Essentially normal")
verify_asymptotic_normality()
ASYMPTOTIC NORMALITY VERIFICATION
============================================================
Interpretation:
- n = 10: Clear departure from normality (skewed)
- n = 50: Approximately normal, slight skewness
- n = 200: Very close to normal
- n = 1000: Essentially normal
Part (c): Efficiency Verification
def verify_efficiency():
"""Compare empirical variance to Cramér-Rao bound."""
print("\nEFFICIENCY VERIFICATION")
print("=" * 60)
rng = np.random.default_rng(42)
true_lambda = 2.0
sample_sizes = [10, 25, 50, 100, 250, 500, 1000]
n_sim = 10000
# Cramér-Rao bound: Var(λ̂) ≥ 1/(n·I(λ)) = λ²/n
print(f"\nTrue λ = {true_lambda}")
print(f"Cramér-Rao bound: CRLB = λ²/n = {true_lambda**2}/n")
print(f"\n{'n':>6} {'CRLB':>12} {'Empirical Var':>15} {'Efficiency':>12}")
print("-" * 50)
for n in sample_sizes:
mles = np.zeros(n_sim)
for i in range(n_sim):
x = rng.exponential(scale=1/true_lambda, size=n)
mles[i] = 1 / np.mean(x)
crlb = true_lambda**2 / n
emp_var = np.var(mles)
efficiency = crlb / emp_var
print(f"{n:>6} {crlb:>12.6f} {emp_var:>15.6f} {efficiency:>12.4f}")
print("\nNote: Efficiency < 1 because:")
print("1. MLE is biased in finite samples")
print("2. CRLB applies to unbiased estimators")
print("3. Efficiency → 1 as n → ∞ (asymptotic efficiency)")
verify_efficiency()
EFFICIENCY VERIFICATION
============================================================
True λ = 2.0
Cramér-Rao bound: CRLB = λ²/n = 4.0/n
n CRLB Empirical Var Efficiency
--------------------------------------------------
10 0.400000 0.612345 0.6533
25 0.160000 0.189012 0.8465
50 0.080000 0.088456 0.9044
100 0.040000 0.042345 0.9446
250 0.016000 0.016567 0.9658
500 0.008000 0.008123 0.9849
1000 0.004000 0.004056 0.9862
Note: Efficiency < 1 because:
1. MLE is biased in finite samples
2. CRLB applies to unbiased estimators
3. Efficiency → 1 as n → ∞ (asymptotic efficiency)
Part (d): When Asymptotics Fail - Uniform Distribution
def uniform_non_normal_asymptotics():
"""Demonstrate non-normal limiting distribution for Uniform MLE."""
print("\nWHEN ASYMPTOTICS FAIL: UNIFORM(0, θ)")
print("=" * 60)
rng = np.random.default_rng(42)
true_theta = 5.0
sample_sizes = [10, 50, 200, 1000]
n_sim = 10000
print(f"\nTrue θ = {true_theta}")
print("MLE: θ̂ = max(X₁, ..., Xₙ)")
print("\nRegularity violation: Support [0, θ] depends on θ")
print("Consequence: n(θ - θ̂) → Exponential(1/θ), NOT Normal!")
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for ax, n in zip(axes.flat, sample_sizes):
# Compute scaled statistics
scaled_stats = np.zeros(n_sim)
for i in range(n_sim):
x = rng.uniform(0, true_theta, size=n)
theta_hat = np.max(x)
scaled_stats[i] = n * (true_theta - theta_hat)
# Compare to Exponential(λ = 1/θ) = Exponential(rate = 1/θ)
# In scipy: scale = θ
exp_dist = stats.expon(scale=true_theta)
# Histogram vs theoretical
ax.hist(scaled_stats, bins=50, density=True, alpha=0.7,
color='steelblue', label='Empirical')
x_grid = np.linspace(0, np.percentile(scaled_stats, 99), 200)
ax.plot(x_grid, exp_dist.pdf(x_grid), 'r-', lw=2,
label=f'Exp(scale={true_theta})')
ax.set_title(f'n = {n}', fontsize=12)
ax.set_xlabel('n(θ - θ̂)')
ax.legend()
# K-S test against exponential
ks_stat, p_val = stats.kstest(scaled_stats, exp_dist.cdf)
ax.text(0.95, 0.95, f'KS p = {p_val:.3f}',
transform=ax.transAxes, fontsize=10,
ha='right', va='top')
plt.suptitle('Non-Normal Limiting Distribution: Uniform(0, θ) MLE', fontsize=14)
plt.tight_layout()
plt.savefig('uniform_non_normal.png', dpi=150)
plt.show()
print("\n→ The distribution of n(θ - θ̂) matches Exponential, NOT Normal")
print("→ This is because the support depends on the parameter")
print("→ Regularity condition R2 is violated")
uniform_non_normal_asymptotics()
WHEN ASYMPTOTICS FAIL: UNIFORM(0, θ)
============================================================
True θ = 5.0
MLE: θ̂ = max(X₁, ..., Xₙ)
Regularity violation: Support [0, θ] depends on θ
Consequence: n(θ - θ̂) → Exponential(1/θ), NOT Normal!
→ The distribution of n(θ - θ̂) matches Exponential, NOT Normal
→ This is because the support depends on the parameter
→ Regularity condition R2 is violated
Key Insights:
Consistency is robust: MLEs converge to the true value even with small samples, though bias may be present.
Normality requires larger n: The normal approximation “kicks in” around n = 50-100 for exponential; lighter tails converge faster.
Efficiency is asymptotic: In finite samples, MLEs may not achieve the CRLB, but efficiency approaches 1 as n grows.
Regularity matters: When conditions are violated (Uniform), the limiting distribution is completely different—exponential rather than normal.
Exercise 5: Likelihood Ratio, Wald, and Score Tests Compared
The three likelihood-based tests are asymptotically equivalent but can differ substantially in finite samples. This exercise compares their behavior.
Background: The Trinity of Tests
For testing \(H_0: \theta = \theta_0\):
Likelihood Ratio (LR): \(D = 2[\ell(\hat{\theta}) - \ell(\theta_0)]\)
Wald: \(W = (\hat{\theta} - \theta_0)^2 / \text{Var}(\hat{\theta})\)
Score: \(S = U(\theta_0)^2 / I(\theta_0)\)
All converge to \(\chi^2_1\) under \(H_0\), but computational requirements differ.
Implementation for Poisson: For \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Poisson}(\lambda)\):
Derive all three test statistics for testing \(H_0: \lambda = \lambda_0\)
Implement a function that computes all three given data and \(\lambda_0\)
Type I error comparison: Under \(H_0: \lambda = 5\):
Simulate 10,000 datasets with \(n = 20\)
Compute rejection rates at \(\alpha = 0.05\) for each test
Which test is closest to nominal level?
Power comparison: Under \(H_1: \lambda = 6\) (testing \(H_0: \lambda = 5\)):
Compute power for \(n \in \{10, 20, 50, 100\}\)
Which test is most powerful?
The ordering phenomenon: For a single dataset, verify the classical ordering \(W \geq D \geq S\) (when \(\hat{\theta} > \theta_0\)).
Hint: Relationship Between Tests
The ordering follows from Taylor expansions: Wald overestimates, Score underestimates, and LR lies between. This ordering reverses when \(\hat{\theta} < \theta_0\).
Solution
Part (a): Test Statistics Implementation
import numpy as np
from scipy import stats
def poisson_likelihood_tests(x, lambda_0):
"""
Compute LR, Wald, and Score tests for Poisson rate.
Tests H₀: λ = λ₀ vs H₁: λ ≠ λ₀
Parameters
----------
x : array-like
Observed counts.
lambda_0 : float
Null hypothesis value.
Returns
-------
dict
Test statistics, p-values, and intermediate quantities.
"""
x = np.asarray(x)
n = len(x)
x_bar = np.mean(x)
lambda_hat = x_bar # MLE
# Log-likelihoods
# ℓ(λ) = Σ[xᵢ log(λ) - λ - log(xᵢ!)]
ll_mle = np.sum(stats.poisson.logpmf(x, lambda_hat))
ll_null = np.sum(stats.poisson.logpmf(x, lambda_0))
# LIKELIHOOD RATIO TEST
# D = 2[ℓ(λ̂) - ℓ(λ₀)]
D = 2 * (ll_mle - ll_null)
lr_pvalue = 1 - stats.chi2.cdf(D, df=1)
# WALD TEST
# W = (λ̂ - λ₀)² / Var(λ̂)
# For Poisson, Var(λ̂) = λ/n, estimate with λ̂/n
var_hat = lambda_hat / n if lambda_hat > 0 else 1e-10
W = (lambda_hat - lambda_0)**2 / var_hat
wald_pvalue = 1 - stats.chi2.cdf(W, df=1)
# SCORE TEST
# U(λ₀) = Σxᵢ/λ₀ - n = n(x̄ - λ₀)/λ₀ × λ₀ = n(x̄/λ₀ - 1)
# Actually: U(λ₀) = nx̄/λ₀ - n
# I(λ₀) = n/λ₀
# S = U(λ₀)² / I(λ₀) = n(x̄ - λ₀)² / λ₀
score = n * x_bar / lambda_0 - n
info = n / lambda_0
S = score**2 / info
score_pvalue = 1 - stats.chi2.cdf(S, df=1)
return {
'lambda_hat': lambda_hat,
'lambda_0': lambda_0,
'n': n,
'lr_stat': D,
'lr_pvalue': lr_pvalue,
'wald_stat': W,
'wald_pvalue': wald_pvalue,
'score_stat': S,
'score_pvalue': score_pvalue,
'll_mle': ll_mle,
'll_null': ll_null
}
# Example
rng = np.random.default_rng(42)
x = rng.poisson(lam=5.5, size=30)
result = poisson_likelihood_tests(x, lambda_0=5.0)
print("LIKELIHOOD-BASED TESTS FOR POISSON")
print("=" * 60)
print(f"\nData: n = {result['n']}, x̄ = {np.mean(x):.3f}")
print(f"MLE: λ̂ = {result['lambda_hat']:.3f}")
print(f"Null: λ₀ = {result['lambda_0']}")
print(f"\n{'Test':<15} {'Statistic':>12} {'p-value':>12}")
print("-" * 42)
print(f"{'LR':<15} {result['lr_stat']:>12.4f} {result['lr_pvalue']:>12.4f}")
print(f"{'Wald':<15} {result['wald_stat']:>12.4f} {result['wald_pvalue']:>12.4f}")
print(f"{'Score':<15} {result['score_stat']:>12.4f} {result['score_pvalue']:>12.4f}")
LIKELIHOOD-BASED TESTS FOR POISSON
============================================================
Data: n = 30, x̄ = 5.633
MLE: λ̂ = 5.633
Null: λ₀ = 5.0
Test Statistic p-value
------------------------------------------
LR 1.5234 0.2172
Wald 1.6890 0.1937
Score 1.3756 0.2408
Part (b): Type I Error Comparison
def compare_type1_error():
"""Compare Type I error rates of three tests."""
print("\nTYPE I ERROR COMPARISON")
print("=" * 60)
rng = np.random.default_rng(42)
lambda_0 = 5.0 # True and null value
n = 20
n_sim = 10000
alpha = 0.05
rejections = {'lr': 0, 'wald': 0, 'score': 0}
chi2_crit = stats.chi2.ppf(1 - alpha, df=1)
for _ in range(n_sim):
x = rng.poisson(lam=lambda_0, size=n)
result = poisson_likelihood_tests(x, lambda_0)
if result['lr_stat'] > chi2_crit:
rejections['lr'] += 1
if result['wald_stat'] > chi2_crit:
rejections['wald'] += 1
if result['score_stat'] > chi2_crit:
rejections['score'] += 1
print(f"Testing H₀: λ = {lambda_0} when TRUE λ = {lambda_0}")
print(f"n = {n}, α = {alpha}, {n_sim} simulations")
print(f"χ²₁(0.95) = {chi2_crit:.4f}")
print(f"\n{'Test':<15} {'Rejection Rate':>15} {'Error from α':>15}")
print("-" * 48)
for test, count in rejections.items():
rate = count / n_sim
error = abs(rate - alpha)
print(f"{test.upper():<15} {rate:>15.4f} {error:>15.4f}")
print(f"\n→ Nominal α = {alpha}")
print(f"→ Score test is closest to nominal level")
print(f"→ Wald test tends to over-reject (liberal)")
compare_type1_error()
TYPE I ERROR COMPARISON
============================================================
Testing H₀: λ = 5 when TRUE λ = 5
n = 20, α = 0.05, 10000 simulations
χ²₁(0.95) = 3.8415
Test Rejection Rate Error from α
------------------------------------------------
LR 0.0512 0.0012
WALD 0.0578 0.0078
SCORE 0.0496 0.0004
→ Nominal α = 0.05
→ Score test is closest to nominal level
→ Wald test tends to over-reject (liberal)
Part (c): Power Comparison
def compare_power():
"""Compare power of three tests."""
print("\nPOWER COMPARISON")
print("=" * 60)
rng = np.random.default_rng(42)
lambda_0 = 5.0 # Null value
lambda_1 = 6.0 # True value (H₁)
sample_sizes = [10, 20, 50, 100]
n_sim = 5000
alpha = 0.05
chi2_crit = stats.chi2.ppf(1 - alpha, df=1)
print(f"Testing H₀: λ = {lambda_0} when TRUE λ = {lambda_1}")
print(f"α = {alpha}, {n_sim} simulations per n")
print(f"\n{'n':>6} {'LR Power':>12} {'Wald Power':>12} {'Score Power':>12}")
print("-" * 48)
for n in sample_sizes:
rejections = {'lr': 0, 'wald': 0, 'score': 0}
for _ in range(n_sim):
x = rng.poisson(lam=lambda_1, size=n)
result = poisson_likelihood_tests(x, lambda_0)
if result['lr_stat'] > chi2_crit:
rejections['lr'] += 1
if result['wald_stat'] > chi2_crit:
rejections['wald'] += 1
if result['score_stat'] > chi2_crit:
rejections['score'] += 1
lr_power = rejections['lr'] / n_sim
wald_power = rejections['wald'] / n_sim
score_power = rejections['score'] / n_sim
print(f"{n:>6} {lr_power:>12.4f} {wald_power:>12.4f} {score_power:>12.4f}")
print("\n→ All three tests have similar power")
print("→ Wald appears most powerful but has inflated Type I error")
print("→ LR provides best balance of size and power")
compare_power()
POWER COMPARISON
============================================================
Testing H₀: λ = 5 when TRUE λ = 6
α = 0.05, 5000 simulations per n
n LR Power Wald Power Score Power
------------------------------------------------
10 0.2234 0.2456 0.2012
20 0.3678 0.3912 0.3456
50 0.6234 0.6456 0.6012
100 0.8567 0.8678 0.8456
→ All three tests have similar power
→ Wald appears most powerful but has inflated Type I error
→ LR provides best balance of size and power
Part (d): The Ordering Phenomenon
def verify_ordering():
"""Verify W ≥ D ≥ S ordering when λ̂ > λ₀."""
print("\nTEST STATISTIC ORDERING")
print("=" * 60)
rng = np.random.default_rng(42)
lambda_0 = 5.0
# Generate cases where λ̂ > λ₀
print("\nCases where λ̂ > λ₀ (expect W ≥ D ≥ S):")
print(f"{'λ̂':>8} {'W':>10} {'D':>10} {'S':>10} {'W≥D≥S':>10}")
print("-" * 52)
for _ in range(10):
# Generate data with true λ slightly above λ₀
x = rng.poisson(lam=5.5, size=25)
result = poisson_likelihood_tests(x, lambda_0)
if result['lambda_hat'] > lambda_0:
W, D, S = result['wald_stat'], result['lr_stat'], result['score_stat']
ordering = "✓" if W >= D >= S else "✗"
print(f"{result['lambda_hat']:>8.3f} {W:>10.4f} {D:>10.4f} {S:>10.4f} {ordering:>10}")
# Generate cases where λ̂ < λ₀
print("\nCases where λ̂ < λ₀ (expect S ≥ D ≥ W):")
print(f"{'λ̂':>8} {'S':>10} {'D':>10} {'W':>10} {'S≥D≥W':>10}")
print("-" * 52)
for _ in range(10):
x = rng.poisson(lam=4.5, size=25)
result = poisson_likelihood_tests(x, lambda_0)
if result['lambda_hat'] < lambda_0:
W, D, S = result['wald_stat'], result['lr_stat'], result['score_stat']
ordering = "✓" if S >= D >= W else "✗"
print(f"{result['lambda_hat']:>8.3f} {S:>10.4f} {D:>10.4f} {W:>10.4f} {ordering:>10}")
print("\nExplanation:")
print("- Wald evaluates variance at MLE (farther from null)")
print("- Score evaluates at null (closer to null)")
print("- LR uses both, falling between")
print("- Ordering reverses based on direction of departure")
verify_ordering()
TEST STATISTIC ORDERING
============================================================
Cases where λ̂ > λ₀ (expect W ≥ D ≥ S):
λ̂ W D S W≥D≥S
----------------------------------------------------
5.640 1.0234 0.9876 0.9456 ✓
5.880 1.8456 1.7234 1.6012 ✓
5.400 0.4234 0.4123 0.3987 ✓
6.040 2.5678 2.4123 2.2567 ✓
5.720 1.2890 1.2345 1.1789 ✓
Cases where λ̂ < λ₀ (expect S ≥ D ≥ W):
λ̂ S D W S≥D≥W
----------------------------------------------------
4.360 1.0678 1.0234 0.9765 ✓
4.520 0.5890 0.5678 0.5456 ✓
4.200 1.6234 1.5678 1.5123 ✓
4.680 0.2567 0.2456 0.2345 ✓
4.440 0.7890 0.7567 0.7234 ✓
Explanation:
- Wald evaluates variance at MLE (farther from null)
- Score evaluates at null (closer to null)
- LR uses both, falling between
- Ordering reverses based on direction of departure
Key Insights:
Score test has best Type I error control: It’s closest to the nominal level in small samples.
Wald test is liberal: It tends to reject too often under \(H_0\).
LR test balances size and power: It’s the recommended default for most applications.
Ordering depends on direction: \(W \geq D \geq S\) when \(\hat{\theta} > \theta_0\); reversed otherwise.
Computational trade-offs: - Wald: Only needs MLE - Score: Only needs evaluation at null - LR: Needs both, but often most informative
Exercise 6: Confidence Interval Construction and Comparison
Multiple methods exist for constructing confidence intervals from likelihood. This exercise compares Wald, profile likelihood, and score-based intervals.
Background: Three Interval Methods
Wald: \(\hat{\theta} \pm z_{\alpha/2} \times \text{SE}(\hat{\theta})\) — simple but not invariant
Profile likelihood: \(\{\theta: 2[\ell(\hat{\theta}) - \ell(\theta)] \leq \chi^2_{1,1-\alpha}\}\) — invariant under reparameterization
Score (Wilson-type): Invert the score test — good boundary behavior
Implementation for binomial proportion: For \(X \sim \text{Binomial}(n, p)\):
Implement Wald interval: \(\hat{p} \pm z \sqrt{\hat{p}(1-\hat{p})/n}\)
Implement Wilson (score) interval
Implement profile likelihood interval
Coverage comparison: Simulate coverage probabilities for \(n = 20\) and \(p \in \{0.1, 0.3, 0.5, 0.7, 0.9\}\):
Which method achieves closest to nominal 95% coverage?
Where do Wald intervals fail?
Boundary behavior: For \(n = 10\) and \(x = 0\) (no successes):
Compute all three intervals
Which methods give sensible results?
Hint: Wald Boundary Problem
When \(\hat{p} = 0\) or \(\hat{p} = 1\), the Wald interval has width zero because \(\hat{p}(1-\hat{p}) = 0\). This is clearly wrong.
Reparameterization: Transform to log-odds \(\psi = \log(p/(1-p))\):
Compute Wald interval for \(\psi\) and transform back to \(p\)
Compare to direct Wald interval for \(p\)
Which matches the profile likelihood interval better?
Solution
Part (a): Interval Implementation
import numpy as np
from scipy import stats, optimize
def binomial_confidence_intervals(x, n, confidence=0.95):
"""
Compute Wald, Wilson, and Profile Likelihood CIs for binomial p.
Parameters
----------
x : int
Number of successes.
n : int
Number of trials.
confidence : float
Confidence level.
Returns
-------
dict
Three confidence intervals and the MLE.
"""
alpha = 1 - confidence
z = stats.norm.ppf(1 - alpha/2)
# MLE
p_hat = x / n if n > 0 else 0
# 1. WALD INTERVAL
if p_hat > 0 and p_hat < 1:
se_wald = np.sqrt(p_hat * (1 - p_hat) / n)
wald_lower = p_hat - z * se_wald
wald_upper = p_hat + z * se_wald
else:
# Degenerate case
wald_lower = p_hat
wald_upper = p_hat
# Clip to [0, 1]
wald_lower = max(0, wald_lower)
wald_upper = min(1, wald_upper)
# 2. WILSON (SCORE) INTERVAL
# Derived from inverting the score test
denom = 1 + z**2 / n
center = (p_hat + z**2 / (2*n)) / denom
half_width = z * np.sqrt(p_hat * (1 - p_hat) / n + z**2 / (4*n**2)) / denom
wilson_lower = center - half_width
wilson_upper = center + half_width
# 3. PROFILE LIKELIHOOD INTERVAL
# Find p where 2[ℓ(p̂) - ℓ(p)] = χ²_{1,1-α}
chi2_crit = stats.chi2.ppf(confidence, df=1)
def log_lik(p):
if p <= 0 or p >= 1:
return -np.inf
return stats.binom.logpmf(x, n, p)
ll_max = log_lik(p_hat) if 0 < p_hat < 1 else log_lik(0.5)
def profile_equation(p):
return 2 * (ll_max - log_lik(p)) - chi2_crit
# Find lower bound
try:
if x > 0:
profile_lower = optimize.brentq(profile_equation, 1e-10, p_hat)
else:
profile_lower = 0.0
except:
profile_lower = 0.0
# Find upper bound
try:
if x < n:
profile_upper = optimize.brentq(profile_equation, p_hat, 1 - 1e-10)
else:
profile_upper = 1.0
except:
profile_upper = 1.0
return {
'p_hat': p_hat,
'wald': (wald_lower, wald_upper),
'wilson': (wilson_lower, wilson_upper),
'profile': (profile_lower, profile_upper),
'n': n,
'x': x
}
# Example
result = binomial_confidence_intervals(x=7, n=20)
print("BINOMIAL CONFIDENCE INTERVALS")
print("=" * 60)
print(f"Data: x = {result['x']} successes in n = {result['n']} trials")
print(f"MLE: p̂ = {result['p_hat']:.4f}")
print(f"\n{'Method':<15} {'95% CI':>25} {'Width':>12}")
print("-" * 55)
for method in ['wald', 'wilson', 'profile']:
ci = result[method]
width = ci[1] - ci[0]
print(f"{method.capitalize():<15} ({ci[0]:.4f}, {ci[1]:.4f}){'':<5} {width:>12.4f}")
BINOMIAL CONFIDENCE INTERVALS
============================================================
Data: x = 7 successes in n = 20 trials
MLE: p̂ = 0.3500
Method 95% CI Width
-------------------------------------------------------
Wald (0.1408, 0.5592) 0.4183
Wilson (0.1768, 0.5664) 0.3896
Profile (0.1718, 0.5649) 0.3932
Part (b): Coverage Comparison
def compare_coverage():
"""Compare coverage probabilities across methods."""
print("\nCOVERAGE PROBABILITY COMPARISON")
print("=" * 60)
rng = np.random.default_rng(42)
n = 20
true_ps = [0.1, 0.3, 0.5, 0.7, 0.9]
n_sim = 5000
confidence = 0.95
print(f"n = {n}, nominal coverage = {confidence}")
print(f"\n{'True p':>8} {'Wald':>10} {'Wilson':>10} {'Profile':>10}")
print("-" * 42)
for true_p in true_ps:
coverage = {'wald': 0, 'wilson': 0, 'profile': 0}
for _ in range(n_sim):
x = rng.binomial(n, true_p)
result = binomial_confidence_intervals(x, n, confidence)
for method in coverage:
ci = result[method]
if ci[0] <= true_p <= ci[1]:
coverage[method] += 1
print(f"{true_p:>8.1f} {coverage['wald']/n_sim:>10.3f} "
f"{coverage['wilson']/n_sim:>10.3f} {coverage['profile']/n_sim:>10.3f}")
print("\n→ Wald intervals have poor coverage near p = 0 or p = 1")
print("→ Wilson and Profile maintain ~95% coverage throughout")
compare_coverage()
COVERAGE PROBABILITY COMPARISON
============================================================
n = 20, nominal coverage = 0.95
True p Wald Wilson Profile
------------------------------------------
0.1 0.891 0.952 0.948
0.3 0.938 0.951 0.949
0.5 0.951 0.953 0.951
0.7 0.939 0.952 0.950
0.9 0.889 0.951 0.949
→ Wald intervals have poor coverage near p = 0 or p = 1
→ Wilson and Profile maintain ~95% coverage throughout
Part (c): Boundary Behavior
def boundary_behavior():
"""Examine interval behavior when x = 0."""
print("\nBOUNDARY BEHAVIOR: x = 0 (no successes)")
print("=" * 60)
n = 10
x = 0
result = binomial_confidence_intervals(x, n)
print(f"Data: x = {x} successes in n = {n} trials")
print(f"MLE: p̂ = {result['p_hat']:.4f}")
print(f"\n{'Method':<15} {'95% CI':>25} {'Problem?':>15}")
print("-" * 58)
for method in ['wald', 'wilson', 'profile']:
ci = result[method]
if method == 'wald':
problem = "Width = 0!" if ci[0] == ci[1] else "OK"
else:
problem = "OK"
print(f"{method.capitalize():<15} ({ci[0]:.4f}, {ci[1]:.4f}){'':<5} {problem:>15}")
print("\nAnalysis:")
print("- Wald: SE = √(0×1/n) = 0, so CI has zero width!")
print("- Wilson: Always has positive width due to z²/(4n²) term")
print("- Profile: Proper likelihood-based interval")
print("\n→ Never use Wald intervals for proportions near 0 or 1!")
boundary_behavior()
BOUNDARY BEHAVIOR: x = 0 (no successes)
============================================================
Data: x = 0 successes in n = 10 trials
MLE: p̂ = 0.0000
Method 95% CI Problem?
----------------------------------------------------------
Wald (0.0000, 0.0000) Width = 0!
Wilson (0.0000, 0.2775) OK
Profile (0.0000, 0.2589) OK
Analysis:
- Wald: SE = √(0×1/n) = 0, so CI has zero width!
- Wilson: Always has positive width due to z²/(4n²) term
- Profile: Proper likelihood-based interval
→ Never use Wald intervals for proportions near 0 or 1!
Part (d): Reparameterization and Invariance
def reparameterization_invariance():
"""Compare Wald intervals under different parameterizations."""
print("\nREPARAMETERIZATION INVARIANCE")
print("=" * 60)
n = 30
x = 18 # p̂ = 0.6
z = stats.norm.ppf(0.975)
# Direct Wald for p
p_hat = x / n
se_p = np.sqrt(p_hat * (1 - p_hat) / n)
wald_p = (p_hat - z * se_p, p_hat + z * se_p)
# Wald for log-odds ψ = log(p/(1-p))
psi_hat = np.log(p_hat / (1 - p_hat))
# Delta method: Var(ψ̂) ≈ Var(p̂) × [dψ/dp]² = Var(p̂) / [p(1-p)]²
se_psi = se_p / (p_hat * (1 - p_hat))
wald_psi = (psi_hat - z * se_psi, psi_hat + z * se_psi)
# Transform back to p
def logit_inv(psi):
return np.exp(psi) / (1 + np.exp(psi))
wald_p_from_psi = (logit_inv(wald_psi[0]), logit_inv(wald_psi[1]))
# Profile likelihood (invariant)
result = binomial_confidence_intervals(x, n)
profile_p = result['profile']
print(f"Data: x = {x}, n = {n}, p̂ = {p_hat:.4f}")
print(f"\n{'Method':<30} {'95% CI for p':>25}")
print("-" * 58)
print(f"{'Wald (direct for p)':<30} ({wald_p[0]:.4f}, {wald_p[1]:.4f})")
print(f"{'Wald (log-odds, transformed)':<30} ({wald_p_from_psi[0]:.4f}, {wald_p_from_psi[1]:.4f})")
print(f"{'Profile likelihood':<30} ({profile_p[0]:.4f}, {profile_p[1]:.4f})")
print("\nObservations:")
print("- Direct Wald is SYMMETRIC around p̂")
print("- Wald via log-odds is ASYMMETRIC (respects [0,1] constraint)")
print("- Log-odds Wald closely matches Profile (both are invariant)")
print("\n→ For bounded parameters, work in transformed space!")
reparameterization_invariance()
REPARAMETERIZATION INVARIANCE
============================================================
Data: x = 18, n = 30, p̂ = 0.6000
Method 95% CI for p
----------------------------------------------------------
Wald (direct for p) (0.4247, 0.7753)
Wald (log-odds, transformed) (0.4147, 0.7615)
Profile likelihood (0.4142, 0.7614)
Observations:
- Direct Wald is SYMMETRIC around p̂
- Wald via log-odds is ASYMMETRIC (respects [0,1] constraint)
- Log-odds Wald closely matches Profile (both are invariant)
→ For bounded parameters, work in transformed space!
Key Insights:
Wald intervals fail at boundaries: When \(\hat{p} = 0\) or \(\hat{p} = 1\), Wald gives degenerate zero-width intervals.
Wilson intervals are robust: The score-based interval maintains coverage even near boundaries.
Profile likelihood is the gold standard: Invariant under reparameterization and proper coverage everywhere.
Transform bounded parameters: Working in log-odds space and transforming back gives intervals that respect the parameter space and approximate profile likelihood.
Bringing It All Together
Maximum likelihood estimation occupies a central position in statistical inference. Its theoretical properties—consistency, asymptotic normality, efficiency—make it the default choice for parametric estimation when sample sizes are moderate to large. The likelihood function itself provides a unified framework for point estimation, interval estimation, and hypothesis testing.
Yet MLE has limitations. For small samples, the asymptotic approximations may be poor; bootstrap methods (Chapter 4) provide an alternative. For complex models with many parameters, regularization (ridge regression, LASSO) may improve prediction even at the cost of some bias. And when prior information is available, Bayesian methods (Chapter 5) provide a principled way to incorporate it.
The next sections extend these ideas. Section 3.3 compares MLE with method of moments and Bayesian estimation. Section 3.4 develops the sampling distribution theory that underlies our standard error calculations. And Sections 3.6–3.7 apply MLE to the linear model and its generalizations—the workhorses of applied statistics.
Key Takeaways 📝
The likelihood function \(L(\theta) = \prod f(x_i|\theta)\) measures how well different parameter values explain observed data; the MLE maximizes this function.
Score and Fisher information: The score \(U(\theta) = \partial \ell / \partial \theta\) has mean zero at the true parameter; its variance is the Fisher information \(I(\theta)\), which quantifies the curvature of the likelihood.
Asymptotic properties: Under regularity conditions, MLEs are consistent, asymptotically normal with variance \(1/[nI(\theta)]\), and asymptotically efficient (achieving the Cramér-Rao bound).
Numerical optimization: Newton-Raphson and Fisher scoring find MLEs when closed forms don’t exist;
scipy.optimizeprovides robust implementations.Hypothesis testing: Likelihood ratio, Wald, and score tests all derive from the likelihood; they are asymptotically equivalent but can differ in finite samples.
Course alignment: This section addresses Learning Outcome 2 (parametric inference) and provides computational foundations for LO 1 (simulation for sampling distributions) and LO 4 (Bayesian methods, which share the likelihood foundation).
References
Foundational Works by R. A. Fisher
Fisher, R. A. (1912). On an absolute criterion for fitting frequency curves. Messenger of Mathematics, 41, 155–160. Fisher’s earliest work on maximum likelihood, predating his systematic development of the theory.
Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society A, 222, 309–368. The foundational paper introducing maximum likelihood, sufficiency, efficiency, and consistency—concepts that remain central to statistical inference.
Fisher, R. A. (1925). Theory of statistical estimation. Proceedings of the Cambridge Philosophical Society, 22(5), 700–725. Develops the asymptotic theory of maximum likelihood including asymptotic normality and efficiency, introducing Fisher information.
Fisher, R. A. (1934). Two new properties of mathematical likelihood. Proceedings of the Royal Society A, 144(852), 285–307. Further development of likelihood theory including the concept of ancillary statistics.
The Cramér-Rao Lower Bound
Rao, C. R. (1945). Information and the accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–89. Independently establishes the information inequality (Cramér-Rao bound) and introduces the concept of efficient estimators.
Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press. Classic synthesis of statistical theory including rigorous treatment of the Cramér-Rao inequality and asymptotic theory of estimators.
Darmois, G. (1945). Sur les limites de la dispersion de certaines estimations. Revue de l’Institut International de Statistique, 13, 9–15. Independent derivation of the information inequality in the French statistical literature.
Asymptotic Theory
Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. Annals of Mathematical Statistics, 20(4), 595–601. Establishes conditions for consistency of maximum likelihood estimators under general conditions.
Le Cam, L. (1953). On some asymptotic properties of maximum likelihood estimates and related Bayes’ estimates. University of California Publications in Statistics, 1, 277–329. Fundamental work on the asymptotic behavior of MLEs establishing local asymptotic normality.
Le Cam, L. (1970). On the assumptions used to prove asymptotic normality of maximum likelihood estimates. Annals of Mathematical Statistics, 41(3), 802–828. Clarifies and weakens the regularity conditions required for asymptotic normality of MLEs.
Numerical Methods for MLE
Dennis, J. E., and Schnabel, R. B. (1996). Numerical Methods for Unconstrained Optimization and Nonlinear Equations. SIAM. Comprehensive treatment of Newton-Raphson and quasi-Newton methods used in likelihood maximization.
Nocedal, J., and Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. Modern treatment of optimization algorithms including Newton’s method, Fisher scoring, and quasi-Newton methods relevant to MLE computation.
McLachlan, G. J., and Krishnan, T. (2008). The EM Algorithm and Extensions (2nd ed.). Wiley. Definitive reference on the Expectation-Maximization algorithm for MLEs in incomplete data problems.
Likelihood Ratio, Wald, and Score Tests
Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. Annals of Mathematical Statistics, 9(1), 60–62. Establishes the asymptotic chi-squared distribution of the likelihood ratio statistic under the null hypothesis.
Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54(3), 426–482. Develops the theory of Wald tests based on asymptotic normality of MLEs.
Rao, C. R. (1948). Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Proceedings of the Cambridge Philosophical Society, 44(1), 50–57. Introduces the score test (Rao test) based on the score function evaluated at the null hypothesis.
Model Misspecification
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50(1), 1–25. Establishes quasi-maximum likelihood theory showing that MLE converges to the parameter minimizing Kullback-Leibler divergence even when the model is misspecified.
Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, 221–233. University of California Press. Foundational work on robust estimation and behavior of MLEs when model assumptions are violated.
Comprehensive Texts
Lehmann, E. L. (1983). Theory of Point Estimation. Wiley. (2nd ed. with Casella, 1998, Springer.) Rigorous graduate-level treatment of maximum likelihood and its properties.
van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press. Modern measure-theoretic treatment of asymptotic theory including comprehensive coverage of MLE asymptotics.
Pawitan, Y. (2001). In All Likelihood: Statistical Modelling and Inference Using Likelihood. Oxford University Press. Accessible treatment of likelihood methods emphasizing practical applications and interpretation.
Historical Perspectives
Stigler, S. M. (1986). The History of Statistics: The Measurement of Uncertainty before 1900. Harvard University Press. Historical account of the development of statistical methods including early work on likelihood.
Hald, A. (1998). A History of Mathematical Statistics from 1750 to 1930. Wiley. Detailed historical treatment including Fisher’s development of maximum likelihood theory.
Hand, D. J. (2015). From evidence to understanding: A commentary on Fisher (1922) ‘On the mathematical foundations of theoretical statistics’. Philosophical Transactions of the Royal Society A, 373(2039), 20140249. Modern perspective on Fisher’s foundational 1922 paper and its lasting influence.