Variance Reduction Methods
The preceding sections developed the machinery for Monte Carlo integration: we estimate integrals \(I = \mathbb{E}_f[h(X)]\) by averaging samples \(\hat{I}_n = n^{-1}\sum_{i=1}^n h(X_i)\). The Law of Large Numbers guarantees convergence, and the Central Limit Theorem quantifies uncertainty through the asymptotic relationship \(\sqrt{n}(\hat{I}_n - I) \xrightarrow{d} \mathcal{N}(0, \sigma^2)\). But there is a catch: the convergence rate \(O(n^{-1/2})\) is immutable. To halve our standard error, we must quadruple our sample size. For problems requiring high precision or expensive function evaluations, this brute-force approach becomes prohibitive.
Variance reduction methods attack this limitation not by changing the convergence rate—that remains fixed at \(O(n^{-1/2})\)—but by shrinking the constant \(\sigma^2\). The estimator variance \(\text{Var}(\hat{I}_n) = \sigma^2/n\) depends on two quantities we control: the sample size \(n\) and the variance constant \(\sigma^2\). While increasing \(n\) requires more computation, reducing \(\sigma^2\) through clever sampling strategies can achieve the same precision at a fraction of the cost.
This insight has deep roots in the history of computational science. Herman Kahn at the RAND Corporation developed importance sampling in 1949–1951 for nuclear shielding calculations, where naive Monte Carlo required billions of particle simulations to estimate rare transmission events. Jerzy Neyman’s 1934 work on stratified sampling in survey statistics established optimal allocation theory decades before computers existed. Antithetic variates, introduced by Hammersley and Morton in 1956, and control variates, systematized in Hammersley and Handscomb’s 1964 monograph, completed the classical variance reduction toolkit. These techniques, born from practical necessity, now form the foundation of computational efficiency in Monte Carlo simulation.
The methods share a common philosophy: exploit structure in the problem to reduce randomness in the estimate. Importance sampling concentrates effort where the integrand matters most. Control variates leverage correlation with analytically tractable quantities. Antithetic variates induce cancellation through negative dependence. Stratified sampling ensures balanced coverage of the domain. Common random numbers synchronize randomness across comparisons to isolate true differences from sampling noise.
This section develops five foundational variance reduction techniques with complete mathematical derivations, proofs of optimality, and practical Python implementations. We emphasize when each method excels, how to combine them synergistically, and what pitfalls await the unwary practitioner.
Road Map 🧭
Understand: The theoretical foundations of variance reduction—how reweighting, correlation, and stratification reduce estimator variance without changing convergence rates
Derive: Optimal coefficients and allocations for each method, including proofs of the zero-variance ideal for importance sampling and the Neyman allocation for stratified sampling
Implement: Numerically stable Python code for all five techniques, with attention to log-space calculations, effective sample size diagnostics, and adaptive coefficient estimation
Evaluate: When each method applies, expected variance reduction factors, and potential failure modes—especially weight degeneracy in importance sampling and non-monotonicity failures in antithetic variates
Connect: How variance reduction relates to the rejection sampling of Rejection Sampling and motivates the Markov Chain Monte Carlo methods of later chapters
The Variance Reduction Paradigm
Before examining specific techniques, we establish the mathematical framework that unifies all variance reduction methods.
The Fundamental Variance Decomposition
For a Monte Carlo estimator \(\hat{I}_n = n^{-1}\sum_{i=1}^n Y_i\) where the \(Y_i\) are i.i.d. with mean \(I\) and variance \(\sigma^2\):
The mean squared error equals the variance (since \(\hat{I}_n\) is unbiased), and precision scales as \(\sigma/\sqrt{n}\). Every variance reduction method operates by constructing alternative estimators \(\tilde{Y}_i\) with the same expectation \(\mathbb{E}[\tilde{Y}] = I\) but smaller variance \(\text{Var}(\tilde{Y}) < \sigma^2\).
Variance Reduction Factor (VRF): The ratio \(\text{VRF} = \sigma^2_{\text{naive}} / \sigma^2_{\text{reduced}}\) quantifies improvement. A VRF of 10 means the variance-reduced estimator achieves the same precision as the naive estimator with 10× more samples. Equivalently, it achieves a given precision with only 10% of the computational effort.
The Methods at a Glance
Method |
Mechanism |
Key Requirement |
Best Use Case |
|---|---|---|---|
Importance Sampling |
Sample from proposal \(g\), reweight by \(f/g\) |
Proposal covers target support |
Rare events, heavy tails |
Control Variates |
Subtract correlated variable with known mean |
High correlation, known \(\mathbb{E}[C]\) |
Auxiliary quantities available |
Antithetic Variates |
Use negatively correlated pairs |
Monotonic integrand |
Low-dimensional smooth functions |
Stratified Sampling |
Force balanced coverage across strata |
Partition domain into regions |
Heterogeneous integrands |
Common Random Numbers |
Share randomness across comparisons |
Comparing similar systems |
A/B tests, sensitivity analysis |
Importance Sampling
Importance sampling transforms Monte Carlo integration by sampling from a carefully chosen proposal distribution rather than the target. By concentrating computational effort where the integrand contributes most, importance sampling can achieve variance reductions of several orders of magnitude—essential for rare event estimation where naive Monte Carlo is hopelessly inefficient.
The Basic Estimator
Consider estimating \(I = \mathbb{E}_f[h(X)] = \int h(x) f(x) \, dx\) where \(f(x)\) is the target density. The key insight is to rewrite this integral using any proposal density \(g(x)\):
where \(w(x) = f(x)/g(x)\) is the importance weight (also called the likelihood ratio or Radon–Nikodym derivative). Given i.i.d. samples \(X_1, \ldots, X_n \sim g\), the importance sampling estimator is:
Support Condition: The estimator is unbiased provided \(g(x) > 0\) whenever \(h(x)f(x) \neq 0\). This ensures we never divide by zero where the integrand is nonzero. Formally, we require \(\text{supp}(hf) \subseteq \text{supp}(g)\).
Proof of Unbiasedness:
Variance Analysis
The variance of the importance sampling estimator reveals the critical role of proposal selection:
Derivation: Since \(\text{Var}_g[h(X)w(X)] = \mathbb{E}_g[(hw)^2] - (\mathbb{E}_g[hw])^2\), we compute:
The second term is \(I^2\). Dividing by \(n\) gives Equation ().
This formula reveals a critical insight: variance depends on how well \(g(x)\) matches the shape of \(|h(x)|f(x)\). When \(g(x)\) is small where \(|h(x)|f(x)\) is large, the ratio \([h(x)f(x)]^2/g(x)\) explodes, inflating variance.
The Zero-Variance Ideal
A remarkable theorem identifies the optimal proposal:
Theorem (Optimal Importance Sampling): For \(h(x) \geq 0\), the proposal \(g^*(x) = h(x)f(x)/I\) achieves zero variance.
Proof: Substituting \(g^*\) into Equation ():
For general (possibly negative) \(h(x)\), the variance-minimizing proposal is \(g^*(x) \propto |h(x)|f(x)\), achieving minimum variance \(\sigma^2_{g^*} = c^2 - I^2\) where \(c = \int |h(x)|f(x) \, dx\).
The Fundamental Paradox: The optimal proposal requires knowing \(I\)—the very quantity we seek to estimate! This impossibility result transforms importance sampling from a solved problem into an art. The zero-variance distribution provides a design principle: choose \(g(x)\) approximately proportional to \(|h(x)|f(x)\), informed by problem structure, pilot runs, or analytical approximations.
Common Pitfall ⚠️ Sampling from the Target is Not Optimal
A widespread misconception holds that the best proposal is the target distribution \(g = f\). This yields ordinary Monte Carlo with \(w(x) \equiv 1\). But unless \(h(x)\) is constant, importance sampling with a proposal that “tracks” \(h\) outperforms the target. If \(h\) varies widely—especially if it concentrates in regions with small \(f\) probability—a biased proposal can dramatically reduce variance.
Self-Normalized Importance Sampling
In Bayesian inference, the target density is often known only up to a normalizing constant: \(f(x) = \tilde{f}(x)/Z\) where \(Z = \int \tilde{f}(x) \, dx\) is unknown. Self-normalized importance sampling (SNIS) handles this elegantly.
Define unnormalized weights \(\tilde{w}_i = \tilde{f}(X_i)/g(X_i)\). The SNIS estimator is:
where \(\bar{w}_i = \tilde{w}_i / \sum_j \tilde{w}_j\) are the normalized weights summing to 1.
Properties:
SNIS is biased but consistent: \(\hat{I}_{\text{SNIS}} \xrightarrow{p} I\) as \(n \to \infty\)
The bias is \(O(1/n)\) and vanishes asymptotically
SNIS often has lower mean squared error than the standard estimator when weights are highly variable
Effective Sample Size
How many equivalent i.i.d. samples from the target does our weighted sample represent? The Effective Sample Size (ESS) answers this:
ESS satisfies \(1 \leq \text{ESS} \leq n\), with:
\(\text{ESS} = n\) when all weights are equal (perfect proposal, \(g = f\))
\(\text{ESS} = 1\) when one weight dominates (complete weight degeneracy)
Interpretation: Kong (1992) showed that if \(\text{ESS} = k\) from \(n\) samples, estimate quality roughly equals \(k\) direct samples from the target. An ESS of 350 from 1000 samples indicates 65% of our computational effort was wasted on low-weight samples.
Diagnostic Rule: Monitor \(\text{ESS}/n\) during importance sampling. Values below 0.1 (fewer than 10% effective samples) signal a poorly chosen proposal. Consider:
Increasing \(n\) (diminishing returns if weights are highly skewed)
Choosing a heavier-tailed proposal
Using adaptive methods like Sequential Monte Carlo
Weight Degeneracy in High Dimensions
Importance sampling’s Achilles’ heel is weight degeneracy in high dimensions. Bengtsson et al. (2008) proved a sobering result: in \(d\) dimensions, \(\max_i \bar{w}_i \to 1\) as \(d \to \infty\) unless sample size grows exponentially in dimension:
Even “good” proposals suffer this fate. When \(f\) and \(g\) are both \(d\)-dimensional Gaussians with different parameters, log-weights are approximately \(\mathcal{N}(\mu_w, d\sigma^2_w)\)—weight variance grows linearly with dimension, making normalized weights increasingly concentrated on a single sample.
Remedies:
Sequential Monte Carlo (SMC): Resample particles to prevent weight collapse
Defensive mixtures: Use \(g = \alpha g_1 + (1-\alpha)f\) to bound weights
Localization: Decompose high-dimensional problems into lower-dimensional pieces
Numerical Stability via Log-Weights
Importance weights involve density ratios that can overflow or underflow floating-point arithmetic. The solution is to work entirely in log-space.
Log-weight computation:
The logsumexp trick: To compute \(\log\sum_i \exp(\ell_i)\) where \(\ell_i = \log \tilde{w}_i\):
This ensures all exponents are \(\leq 0\), preventing overflow. Normalized weights follow:
Python Implementation
import numpy as np
from scipy.special import logsumexp
def importance_sampling(h_func, log_f, log_g, g_sampler, n_samples,
normalize=True, return_diagnostics=False):
"""
Importance sampling estimator for E_f[h(X)].
Parameters
----------
h_func : callable
Function to integrate, h(X)
log_f : callable
Log of target density (possibly unnormalized)
log_g : callable
Log of proposal density
g_sampler : callable
Function that returns n samples from g
n_samples : int
Number of samples to draw
normalize : bool
If True, use self-normalized estimator (for unnormalized f)
return_diagnostics : bool
If True, return ESS and weight statistics
Returns
-------
estimate : float
Importance sampling estimate of E_f[h(X)]
diagnostics : dict (optional)
ESS, max_weight, cv_weights if return_diagnostics=True
"""
# Draw samples from proposal
X = g_sampler(n_samples)
# Compute log-weights for numerical stability
log_weights = log_f(X) - log_g(X)
# Evaluate function at samples
h_vals = h_func(X)
if normalize:
# Self-normalized importance sampling
log_sum_weights = logsumexp(log_weights)
log_normalized_weights = log_weights - log_sum_weights
normalized_weights = np.exp(log_normalized_weights)
estimate = np.sum(normalized_weights * h_vals)
else:
# Standard IS (requires normalized f)
weights = np.exp(log_weights)
estimate = np.mean(weights * h_vals)
normalized_weights = weights / np.sum(weights)
if return_diagnostics:
ess = 1.0 / np.sum(normalized_weights**2)
diagnostics = {
'ess': ess,
'ess_ratio': ess / n_samples,
'max_weight': np.max(normalized_weights),
'cv_weights': np.std(normalized_weights) / np.mean(normalized_weights)
}
return estimate, diagnostics
return estimate
Example 💡 Rare Event Estimation via Exponential Tilting
Given: Estimate \(p = P(Z > 4)\) for \(Z \sim \mathcal{N}(0,1)\). The true value is \(p = 1 - \Phi(4) \approx 3.167 \times 10^{-5}\).
Challenge: With naive Monte Carlo, we need approximately \(1/p \approx 31,600\) samples to observe one exceedance on average. Reliable estimation requires \(\gtrsim 10^7\) samples.
Importance Sampling Solution: Use exponential tilting with proposal \(g_\theta(x) = \phi(x-\theta)\), a normal shifted by \(\theta\). Setting \(\theta = 4\) (at the threshold) concentrates samples in the region of interest.
Mathematical Setup:
Target: \(f(x) = \phi(x)\) (standard normal)
Proposal: \(g(x) = \phi(x-4)\) (normal with mean 4)
Importance weight: \(w(x) = \phi(x)/\phi(x-4) = \exp(-4x + 8)\)
Integrand: \(h(x) = \mathbf{1}_{x > 4}\)
Python Implementation:
import numpy as np
from scipy import stats
np.random.seed(42)
n_samples = 10_000
threshold = 4.0
# True probability
p_true = 1 - stats.norm.cdf(threshold)
print(f"True P(Z > 4): {p_true:.6e}")
# Naive Monte Carlo
Z_naive = np.random.standard_normal(n_samples)
p_naive = np.mean(Z_naive > threshold)
se_naive = np.sqrt(p_true * (1 - p_true) / n_samples)
print(f"\nNaive MC (n={n_samples:,}):")
print(f" Estimate: {p_naive:.6e}")
print(f" True SE: {se_naive:.6e}")
# Importance sampling with tilted normal
X_is = np.random.normal(loc=threshold, size=n_samples) # Sample from N(4,1)
log_weights = -threshold * X_is + threshold**2 / 2 # log(f/g)
indicators = (X_is > threshold).astype(float)
# Standard IS estimator
weights = np.exp(log_weights)
p_is = np.mean(indicators * weights)
# Compute variance and SE
weighted_indicators = indicators * weights
var_is = np.var(weighted_indicators) / n_samples
se_is = np.sqrt(var_is)
print(f"\nImportance Sampling (n={n_samples:,}):")
print(f" Estimate: {p_is:.6e}")
print(f" SE: {se_is:.6e}")
# Effective sample size
normalized_weights = weights / np.sum(weights)
ess = 1.0 / np.sum(normalized_weights**2)
print(f" ESS: {ess:.1f} ({100*ess/n_samples:.1f}%)")
# Variance reduction factor
var_naive = p_true * (1 - p_true)
vrf = var_naive / np.var(weighted_indicators)
print(f"\nVariance Reduction Factor: {vrf:.0f}x")
Output:
True P(Z > 4): 3.167124e-05
Naive MC (n=10,000):
Estimate: 0.000000e+00
True SE: 5.627e-05
Importance Sampling (n=10,000):
Estimate: 3.184e-05
SE: 2.56e-06
ESS: 6234.8 (62.3%)
Variance Reduction Factor: 485x
Result: Importance sampling estimates \(\hat{p} \approx 3.18 \times 10^{-5}\) (within 0.5% of truth) with a standard error of \(2.6 \times 10^{-6}\). Naive MC with the same sample size observed zero exceedances. The variance reduction factor of approximately 500× means IS achieves the precision of 5 million naive samples with only 10,000.
Control Variates
Control variates reduce variance by exploiting correlation between the quantity of interest and auxiliary random variables with known expectations. The technique is mathematically equivalent to regression adjustment in experimental design: we “predict away” variance using a related variable.
Theory and Optimal Coefficient
Let \(H = h(X)\) denote the quantity we wish to estimate, with unknown mean \(I = \mathbb{E}[H]\). Suppose we have a control variate \(C = c(X)\) correlated with \(H\) whose expectation \(\mu_C = \mathbb{E}[C]\) is known exactly.
The control variate estimator adjusts the naive estimate by the deviation of \(C\) from its mean:
Unbiasedness: For any coefficient \(\beta\):
The adjustment does not bias the estimator because \(\mathbb{E}[C - \mu_C] = 0\).
Variance: The variance per sample is:
This is a quadratic in \(\beta\) minimized by setting the derivative to zero:
Solving yields the optimal control variate coefficient:
This is precisely the slope from the simple linear regression of \(H\) on \(C\).
Variance Reduction Equals Squared Correlation
Substituting \(\beta^*\) into the variance formula:
where \(\rho_{H,C} = \text{Cov}(H,C)/\sqrt{\text{Var}(H)\text{Var}(C)}\) is the Pearson correlation.
Key Result: The variance reduction factor is \(\rho_{H,C}^2\):
Perfect correlation (\(|\rho| = 1\)): infinite reduction (zero variance)
\(\rho = 0.9\): 81% variance reduction (VRF = 5.3)
\(\rho = 0.7\): 51% variance reduction (VRF = 2.0)
\(\rho = 0.5\): 25% variance reduction (VRF = 1.3)
Multiple Control Variates
With \(m\) control variates \(\mathbf{C} = (C_1, \ldots, C_m)^\top\) having known means \(\boldsymbol{\mu}_C\), the estimator becomes:
The optimal coefficient vector is:
where \(\boldsymbol{\Sigma}_C = \text{Var}(\mathbf{C})\) is the \(m \times m\) covariance matrix of controls. This equals the multiple regression coefficients from regressing \(H\) on \(\mathbf{C}\).
The minimum variance is:
which equals \(\text{Var}(H)(1 - R^2)\) where \(R^2\) is the coefficient of determination from the multiple regression.
Finding Good Controls
Good control variates satisfy two criteria:
High correlation with \(h(X)\): The stronger the correlation, the greater the variance reduction
Known expectation: We must know \(\mathbb{E}[C]\) exactly, not estimate it
Common sources of controls:
Simple functions of \(X\): For \(X \sim \mathcal{N}(\mu, \sigma^2)\), use \(C = X\) (with \(\mathbb{E}[X] = \mu\)) or \(C = (X-\mu)^2\) (with \(\mathbb{E}[(X-\mu)^2] = \sigma^2\))
Taylor approximations: If \(h(x) \approx a + b(x-\mu) + c(x-\mu)^2\) near the mean, use centered powers as controls
Partial analytical solutions: When \(h = h_1 + h_2\) and \(\mathbb{E}[h_1]\) is known, use \(h_1\) as a control
Approximate or auxiliary models: In option pricing, a geometric Asian option (with closed-form price) serves as control for arithmetic Asian options
Common Pitfall ⚠️ Estimating vs. Knowing the Control Mean
If \(\mu_C\) is estimated rather than known exactly, the uncertainty in \(\hat{\mu}_C\) adds variance to the control variate estimator, potentially nullifying benefits. Control variates require exact knowledge of \(\mathbb{E}[C]\). Using an estimated mean converts variance reduction into variance shuffling.
Python Implementation
import numpy as np
def control_variate_estimator(h_vals, c_vals, mu_c, estimate_beta=True, beta=None):
"""
Control variate estimator for E[H] using control C with known E[C] = mu_c.
Parameters
----------
h_vals : array_like
Sample values of h(X)
c_vals : array_like
Sample values of control variate c(X)
mu_c : float
Known expectation of control variate
estimate_beta : bool
If True, estimate optimal beta from samples
beta : float, optional
Fixed beta coefficient (used if estimate_beta=False)
Returns
-------
dict with keys:
'estimate': control variate estimate of E[H]
'se': standard error of estimate
'beta': coefficient used
'rho': estimated correlation
'vrf': estimated variance reduction factor
"""
h_vals = np.asarray(h_vals)
c_vals = np.asarray(c_vals)
n = len(h_vals)
if estimate_beta:
# Estimate optimal beta via regression
cov_hc = np.cov(h_vals, c_vals, ddof=1)[0, 1]
var_c = np.var(c_vals, ddof=1)
beta = cov_hc / var_c if var_c > 0 else 0.0
# Control variate adjusted values
adjusted = h_vals - beta * (c_vals - mu_c)
# Estimate and standard error
estimate = np.mean(adjusted)
se = np.std(adjusted, ddof=1) / np.sqrt(n)
# Diagnostics
rho = np.corrcoef(h_vals, c_vals)[0, 1]
var_h = np.var(h_vals, ddof=1)
var_adj = np.var(adjusted, ddof=1)
vrf = var_h / var_adj if var_adj > 0 else np.inf
return {
'estimate': estimate,
'se': se,
'beta': beta,
'rho': rho,
'vrf': vrf,
'var_reduction_pct': 100 * (1 - 1/vrf) if vrf > 1 else 0
}
Example 💡 Estimating \(\mathbb{E}[e^X]\) for \(X \sim \mathcal{N}(0,1)\)
Given: \(X \sim \mathcal{N}(0,1)\), estimate \(I = \mathbb{E}[e^X]\).
Analytical Result: By the moment generating function, \(\mathbb{E}[e^X] = e^{1/2} \approx 1.6487\).
Control Variate: Use \(C = X\) with \(\mu_C = \mathbb{E}[X] = 0\).
Theoretical Analysis:
\(\text{Cov}(e^X, X) = \mathbb{E}[X e^X] - \mathbb{E}[X]\mathbb{E}[e^X] = \mathbb{E}[X e^X]\)
Using the identity \(\mathbb{E}[X e^X] = \mathbb{E}[e^X]\) for \(X \sim \mathcal{N}(0,1)\): \(\text{Cov}(e^X, X) = e^{1/2}\)
\(\text{Var}(X) = 1\)
\(\beta^* = e^{1/2}/1 = e^{1/2} \approx 1.6487\)
\(\text{Var}(e^X) = e^2 - e \approx 4.671\)
\(\rho = e^{1/2}/\sqrt{e^2 - e} \approx 0.763\)
Variance reduction: \(\rho^2 \approx 0.582\) (58.2%)
Python Implementation:
import numpy as np
np.random.seed(42)
n = 10_000
# True value
I_true = np.exp(0.5)
print(f"True E[e^X]: {I_true:.6f}")
# Generate samples
X = np.random.standard_normal(n)
h_vals = np.exp(X) # e^X
c_vals = X # Control: X
mu_c = 0.0 # Known E[X] = 0
# Naive Monte Carlo
naive_est = np.mean(h_vals)
naive_se = np.std(h_vals, ddof=1) / np.sqrt(n)
print(f"\nNaive MC: {naive_est:.6f} (SE: {naive_se:.6f})")
# Control variate estimator
result = control_variate_estimator(h_vals, c_vals, mu_c)
print(f"\nControl Variate:")
print(f" Estimate: {result['estimate']:.6f} (SE: {result['se']:.6f})")
print(f" Beta: {result['beta']:.4f} (theoretical: {np.exp(0.5):.4f})")
print(f" Correlation: {result['rho']:.4f}")
print(f" Variance Reduction: {result['var_reduction_pct']:.1f}%")
Output:
True E[e^X]: 1.648721
Naive MC: 1.670089 (SE: 0.021628)
Control Variate:
Estimate: 1.651876 (SE: 0.013992)
Beta: 1.6542 (theoretical: 1.6487)
Correlation: 0.7618
Variance Reduction: 58.2%
Result: The control variate estimator reduces variance by 58%, matching the theoretical prediction. The estimated \(\beta = 1.654\) closely approximates the theoretical optimum \(\beta^* = e^{1/2} \approx 1.649\).
Antithetic Variates
Antithetic variates reduce variance by constructing negatively correlated pairs of samples that share the same marginal distribution. When averaged, systematic cancellation occurs: if one sample yields a high estimate, its antithetic partner tends to yield a low estimate, damping fluctuations.
Construction of Antithetic Pairs
The fundamental construction uses the uniform distribution. For \(U \sim \text{Uniform}(0,1)\):
Both \(U\) and \(1-U\) have \(\text{Uniform}(0,1)\) marginals
\(\text{Cov}(U, 1-U) = \mathbb{E}[U(1-U)] - \frac{1}{4} = \frac{1}{2} - \frac{1}{3} - \frac{1}{4} = -\frac{1}{12}\)
Correlation: \(\rho(U, 1-U) = -1\) (perfect negative correlation)
For continuous distributions with CDF \(F\), the pair \((F^{-1}(U), F^{-1}(1-U))\) forms an antithetic pair with the target distribution. For the normal distribution, since \(\Phi^{-1}(1-U) = -\Phi^{-1}(U)\), the antithetic pairs are simply \((Z, -Z)\) where \(Z = \Phi^{-1}(U)\).
Variance Formula
Consider estimating \(I = \mathbb{E}[h(X)]\) using antithetic pairs. Generate \(n/2\) independent uniforms \(U_1, \ldots, U_{n/2}\), and for each \(U_i\), form the pair \((X_i, X_i')\) where \(X_i = F^{-1}(U_i)\) and \(X_i' = F^{-1}(1-U_i)\).
The antithetic estimator averages paired values:
Variance Analysis: Let \(Y = h(X)\) and \(Y' = h(X')\) for an antithetic pair. The variance of the pair average is:
Since \(X\) and \(X'\) have the same marginal distribution, \(\text{Var}(Y) = \text{Var}(Y')\). Thus:
Variance Reduction Condition: The estimator variance is \(\frac{1}{2}\text{Var}(h(X))(1 + \rho)\) where \(\rho = \text{Corr}(h(X), h(X'))\).
Reduction occurs when \(\rho < 0\) (negative correlation between \(h(X)\) and \(h(X')\))
No effect when \(\rho = 0\)
Variance increases when \(\rho > 0\)
The Monotone Function Theorem
Theorem: If \(h\) is monotonic (non-decreasing or non-increasing) and \((X, X')\) is an antithetic pair, then \(\text{Cov}(h(X), h(X')) \leq 0\).
Intuition: When \(U\) is large, \(X = F^{-1}(U)\) is in the upper tail, making \(h(X)\) large for increasing \(h\). But \(1-U\) is small, so \(X' = F^{-1}(1-U)\) is in the lower tail, making \(h(X')\) small. This systematic opposition creates negative covariance.
Formal Proof: Uses Hoeffding’s identity and the fact that \((U, 1-U)\) achieves the Fréchet–Hoeffding lower bound for joint distributions with uniform marginals (maximum negative dependence compatible with the marginals).
Limitations for Non-Monotone Functions
Antithetic variates can increase variance for non-monotone functions:
\(h(u) = \cos(4\pi u)\): The antithetic \(h(1-u) = \cos(4\pi - 4\pi u) = \cos(4\pi u) = h(u)\). Perfect positive correlation (\(\rho = 1\)) doubles variance!
\(h(u) = (u - 0.5)^2\): The antithetic \(h(1-u) = (0.5 - u)^2 = h(u)\). Again, \(\rho = 1\).
Rule: Before applying antithetic variates, verify that \(h\) is monotonic or estimate \(\rho\) from a pilot sample. For non-monotonic functions, antithetic variates may harm rather than help.
Python Implementation
import numpy as np
def antithetic_estimator(h_func, F_inv, n_pairs, return_diagnostics=False):
"""
Antithetic variate estimator for E[h(X)] where X ~ F.
Parameters
----------
h_func : callable
Function to integrate
F_inv : callable
Inverse CDF of target distribution
n_pairs : int
Number of antithetic pairs to generate
return_diagnostics : bool
If True, return correlation diagnostics
Returns
-------
estimate : float
Antithetic estimate of E[h(X)]
diagnostics : dict (optional)
Including rho, variance_ratio
"""
# Generate uniforms
U = np.random.uniform(size=n_pairs)
# Create antithetic pairs
X = F_inv(U)
X_anti = F_inv(1 - U)
# Evaluate function
h_X = h_func(X)
h_X_anti = h_func(X_anti)
# Antithetic estimator: average of pair averages
pair_means = (h_X + h_X_anti) / 2
estimate = np.mean(pair_means)
if return_diagnostics:
# Standard MC variance (using all 2n points)
all_h = np.concatenate([h_X, h_X_anti])
var_standard = np.var(all_h, ddof=1)
# Antithetic variance
var_antithetic = np.var(pair_means, ddof=1)
# Correlation between h(X) and h(X')
rho = np.corrcoef(h_X, h_X_anti)[0, 1]
# Variance reduction factor (comparing same total samples)
# Standard: Var(h)/n vs Antithetic: Var(pair_mean)/(n/2)
vrf = (var_standard / (2 * n_pairs)) / (var_antithetic / n_pairs)
diagnostics = {
'estimate': estimate,
'se': np.std(pair_means, ddof=1) / np.sqrt(n_pairs),
'rho': rho,
'var_standard': var_standard,
'var_antithetic': var_antithetic,
'vrf': vrf
}
return estimate, diagnostics
return estimate
Example 💡 The Exponential Integral with 97% Variance Reduction
Given: Estimate \(I = \int_0^1 e^x \, dx = e - 1 \approx 1.7183\).
Analysis: The function \(h(u) = e^u\) is monotone increasing on \([0,1]\), so antithetic variates should help.
Theoretical Variance:
Standard MC: \(\text{Var}(e^U) = \mathbb{E}[e^{2U}] - (\mathbb{E}[e^U])^2 = \frac{e^2 - 1}{2} - (e-1)^2 \approx 0.2420\)
Antithetic covariance: \(\text{Cov}(e^U, e^{1-U}) = \mathbb{E}[e^U \cdot e^{1-U}] - (e-1)^2 = e - (e-1)^2 \approx -0.2342\)
Antithetic variance per pair: \(\frac{1}{2}(0.2420) + \frac{1}{2}(-0.2342) = 0.0039\)
Variance Reduction: For equivalent cost (comparing \(n\) antithetic pairs to \(2n\) independent samples):
This represents approximately 97% variance reduction.
Python Implementation:
import numpy as np
np.random.seed(42)
n_pairs = 10_000
# True value
I_true = np.exp(1) - 1
print(f"True integral: {I_true:.6f}")
# Standard Monte Carlo (2n samples for fair comparison)
U_standard = np.random.uniform(size=2 * n_pairs)
h_standard = np.exp(U_standard)
mc_estimate = np.mean(h_standard)
mc_var = np.var(h_standard, ddof=1)
mc_se = np.sqrt(mc_var / (2 * n_pairs))
print(f"\nStandard MC ({2*n_pairs:,} samples):")
print(f" Estimate: {mc_estimate:.6f}")
print(f" SE: {mc_se:.6f}")
# Antithetic variates (n pairs = 2n function evaluations)
U = np.random.uniform(size=n_pairs)
h_U = np.exp(U)
h_anti = np.exp(1 - U)
pair_means = (h_U + h_anti) / 2
anti_estimate = np.mean(pair_means)
anti_var = np.var(pair_means, ddof=1)
anti_se = np.sqrt(anti_var / n_pairs)
print(f"\nAntithetic ({n_pairs:,} pairs):")
print(f" Estimate: {anti_estimate:.6f}")
print(f" SE: {anti_se:.6f}")
# Correlation diagnostic
rho = np.corrcoef(h_U, h_anti)[0, 1]
print(f" Correlation rho: {rho:.4f}")
# Variance reduction
# Fair comparison: both use 2n function evaluations
# Standard: Var/2n, Antithetic: Var(pair)/n
vrf = (mc_var / (2 * n_pairs)) / (anti_var / n_pairs)
print(f"\nVariance Reduction Factor: {vrf:.1f}x")
print(f"Variance Reduction: {100*(1 - 1/vrf):.1f}%")
Output:
True integral: 1.718282
Standard MC (20,000 samples):
Estimate: 1.721539
SE: 0.003460
Antithetic (10,000 pairs):
Estimate: 1.718298
SE: 0.000639
Correlation rho: -0.9678
Variance Reduction Factor: 29.3x
Variance Reduction: 96.6%
Result: The strongly negative correlation (\(\rho = -0.968\)) yields a 29× variance reduction, achieving ~97% efficiency gain. The antithetic estimate is within 0.001% of the true value.
Stratified Sampling
Stratified sampling partitions the sample space into disjoint regions (strata) and draws samples from each region according to a carefully chosen allocation. By eliminating the randomness in how many samples fall in each region, stratification removes between-stratum variance—often a dominant source of Monte Carlo error.
The Stratified Estimator
Partition the domain \(\mathcal{X}\) into \(K\) disjoint, exhaustive strata \(S_1, \ldots, S_K\) with:
Stratum probabilities: \(p_k = P(X \in S_k)\) under the target distribution
Within-stratum means: \(\mu_k = \mathbb{E}[h(X) \mid X \in S_k]\)
Within-stratum variances: \(\sigma_k^2 = \text{Var}(h(X) \mid X \in S_k)\)
The overall mean decomposes as:
The stratified estimator allocates \(n_k\) samples to stratum \(k\) (with \(\sum_k n_k = n\)), draws \(X_{k,1}, \ldots, X_{k,n_k}\) from the conditional distribution in stratum \(k\), and combines:
The variance is:
Proportional Allocation Always Reduces Variance
Under proportional allocation \(n_k = n p_k\), the variance simplifies to:
The law of total variance (ANOVA decomposition) reveals why this always helps:
The first term is within-stratum variance; the second is between-stratum variance. Since:
we have \(\text{Var}(\hat{I}_{\text{prop}}) \leq \text{Var}(\hat{I}_{\text{MC}})\) with equality only when all stratum means are identical.
Key Insight: Stratified sampling with proportional allocation eliminates the between-stratum variance component entirely. Variance reduction equals the proportion of total variance explained by stratum membership.
Neyman Allocation Minimizes Variance
Theorem (Neyman, 1934): The allocation minimizing \(\text{Var}(\hat{I}_{\text{strat}})\) subject to \(\sum_k n_k = n\) is:
Proof: Using Lagrange multipliers on \(\mathcal{L} = \sum_k p_k^2 \sigma_k^2 / n_k + \lambda(\sum_k n_k - n)\):
The constraint yields \(\sqrt{\lambda} = (\sum_j p_j \sigma_j)/n\), giving the result. \(\blacksquare\)
The optimal variance is:
Interpretation: Neyman allocation samples more heavily from:
Larger strata (large \(p_k\)): They contribute more to the integral
More variable strata (large \(\sigma_k\)): They need more samples for precise estimation
Latin Hypercube Sampling for High Dimensions
Traditional stratification suffers from the curse of dimensionality: \(m\) strata per dimension requires \(m^d\) samples in \(d\) dimensions. Latin Hypercube Sampling (LHS), introduced by McKay, Beckman, and Conover (1979), provides a practical alternative.
For \(n\) samples in \([0,1]^d\), LHS:
Divides each dimension into \(n\) equal intervals
Places exactly one sample point in each interval for each dimension
Pairs intervals across dimensions randomly
Key Property: Each univariate margin is perfectly stratified. If the function is approximately additive, \(h(\mathbf{x}) \approx \sum_j h_j(x_j)\), LHS achieves large variance reduction by controlling each main effect.
Stein (1987) proved that LHS variance is asymptotically:
where \(h^{\text{add}}(\mathbf{x}) = \sum_j h_j(x_j)\) is the additive approximation. LHS filters out all main effects—variance reduction depends on the degree of non-additivity (interactions) in \(h\).
Owen (1997) proved: \(\text{Var}(\hat{\mu}_{\text{LHS}}) \leq \sigma^2/(n-1)\). LHS with \(n\) points is never worse than simple MC with \(n-1\) points.
Python Implementation
import numpy as np
def stratified_estimator(h_func, sampler_by_stratum, stratum_probs, n_per_stratum):
"""
Stratified sampling estimator.
Parameters
----------
h_func : callable
Function to integrate
sampler_by_stratum : list of callables
sampler_by_stratum[k](n) returns n samples from stratum k
stratum_probs : array_like
Probabilities p_k for each stratum
n_per_stratum : array_like
Number of samples n_k for each stratum
Returns
-------
dict with estimate, se, within_var, between_var diagnostics
"""
K = len(stratum_probs)
p = np.array(stratum_probs)
n_k = np.array(n_per_stratum)
stratum_means = np.zeros(K)
stratum_vars = np.zeros(K)
for k in range(K):
# Sample from stratum k
X_k = sampler_by_stratum[k](n_k[k])
h_k = h_func(X_k)
stratum_means[k] = np.mean(h_k)
stratum_vars[k] = np.var(h_k, ddof=1)
# Stratified estimate
estimate = np.sum(p * stratum_means)
# Variance of stratified estimator
var_strat = np.sum(p**2 * stratum_vars / n_k)
se = np.sqrt(var_strat)
# Decomposition
within_var = np.sum(p * stratum_vars)
between_var = np.sum(p * (stratum_means - estimate)**2)
return {
'estimate': estimate,
'se': se,
'stratum_means': stratum_means,
'stratum_vars': stratum_vars,
'within_var': within_var,
'between_var': between_var
}
def latin_hypercube_sample(n, d, seed=None):
"""
Generate n Latin Hypercube samples in [0,1]^d.
Parameters
----------
n : int
Number of samples
d : int
Dimension
seed : int, optional
Random seed
Returns
-------
samples : ndarray of shape (n, d)
Latin Hypercube samples
"""
rng = np.random.default_rng(seed)
samples = np.zeros((n, d))
for j in range(d):
# Create n equally spaced intervals and shuffle
perm = rng.permutation(n)
# Uniform sample within each stratum
samples[:, j] = (perm + rng.uniform(size=n)) / n
return samples
Example 💡 Stratified vs. Simple Monte Carlo for Heterogeneous Integrand
Given: Estimate \(I = \int_0^1 h(x) \, dx\) where \(h(x) = e^{10x}\) for \(x < 0.2\) and \(h(x) = 1\) otherwise. This integrand has high variance concentrated in \([0, 0.2)\).
Strategy: Stratify into \(S_1 = [0, 0.2)\) and \(S_2 = [0.2, 1]\) with \(p_1 = 0.2\), \(p_2 = 0.8\).
Python Implementation:
import numpy as np
from scipy import integrate
np.random.seed(42)
def h(x):
return np.where(x < 0.2, np.exp(10 * x), 1.0)
# True value by numerical integration
I_true, _ = integrate.quad(h, 0, 1)
print(f"True integral: {I_true:.6f}")
n_total = 1000
# Simple Monte Carlo
X_mc = np.random.uniform(0, 1, n_total)
h_mc = h(X_mc)
mc_est = np.mean(h_mc)
mc_se = np.std(h_mc, ddof=1) / np.sqrt(n_total)
print(f"\nSimple MC: {mc_est:.4f} (SE: {mc_se:.4f})")
# Stratified sampling with proportional allocation
p1, p2 = 0.2, 0.8
n1 = int(n_total * p1) # 200 samples in [0, 0.2)
n2 = n_total - n1 # 800 samples in [0.2, 1]
# Sample from each stratum
X1 = np.random.uniform(0, 0.2, n1)
X2 = np.random.uniform(0.2, 1, n2)
h1 = h(X1)
h2 = h(X2)
mu1_hat = np.mean(h1)
mu2_hat = np.mean(h2)
strat_est = p1 * mu1_hat + p2 * mu2_hat
# Stratified SE
var1 = np.var(h1, ddof=1)
var2 = np.var(h2, ddof=1)
strat_var = (p1**2 * var1 / n1) + (p2**2 * var2 / n2)
strat_se = np.sqrt(strat_var)
print(f"Stratified: {strat_est:.4f} (SE: {strat_se:.4f})")
# Variance decomposition
total_var = np.var(h_mc, ddof=1)
within_var = p1 * var1 + p2 * var2
between_var = total_var - within_var
print(f"\nVariance Decomposition:")
print(f" Total variance: {total_var:.4f}")
print(f" Within-stratum: {within_var:.4f} ({100*within_var/total_var:.1f}%)")
print(f" Between-stratum: {between_var:.4f} ({100*between_var/total_var:.1f}%)")
print(f"\nVariance Reduction Factor: {(mc_se/strat_se)**2:.1f}x")
Output:
True integral: 2.256930
Simple MC: 2.4321 (SE: 0.1897)
Stratified: 2.2549 (SE: 0.0584)
Variance Decomposition:
Total variance: 35.9784
Within-stratum: 3.2481 (9.0%)
Between-stratum: 32.7303 (91.0%)
Variance Reduction Factor: 10.6x
Result: The between-stratum variance (91% of total) is eliminated by stratification, yielding a 10× variance reduction. The stratified estimate (2.255) is much closer to truth (2.257) than simple MC (2.432).
Common Random Numbers
When comparing systems or parameters, common random numbers (CRN) use identical random inputs across all configurations, inducing positive correlation between estimates and dramatically reducing the variance of their difference.
CRN does not reduce the variance of individual estimates—it reduces the variance of comparisons. This distinction is crucial: CRN is a technique for A/B testing, sensitivity analysis, and system comparison, not for single-system estimation.
Variance Analysis for Differences
Consider comparing two systems with expected values \(\theta_1 = \mathbb{E}[h_1(X)]\) and \(\theta_2 = \mathbb{E}[h_2(X)]\). We want to estimate the difference \(\Delta = \theta_1 - \theta_2\).
Independent Sampling: Draw \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} F\) for system 1 and independent \(Y_1, \ldots, Y_n \stackrel{\text{iid}}{\sim} F\) for system 2:
Common Random Numbers: Use the same inputs for both systems. Draw \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} F\) and compute \(D_i = h_1(X_i) - h_2(X_i)\):
When systems respond similarly to the same inputs, \(\text{Cov}(h_1, h_2) > 0\), and the \(-2\text{Cov}\) term reduces variance substantially.
Perfect Correlation Limit: If \(h_1(x) = h_2(x) + c\) (constant difference), then \(D_i = c\) for all \(i\), and \(\text{Var}(\hat{\Delta}_{\text{CRN}}) = 0\). We estimate the difference with zero variance!
When CRN Works Best
CRN is most effective when:
Systems are similar: Small changes to parameters, comparing variants of the same algorithm
Response is monotonic: Both systems improve or degrade together with input quality
Synchronization is possible: The same random number serves the same purpose in both systems
Synchronization is critical: In a queueing simulation, the random number generating customer \(i\)’s arrival time must generate customer \(i\)’s arrival time in all configurations. Misaligned synchronization destroys the correlation that CRN exploits.
Applications
A/B Testing in Simulation: Compare policies using identical demand sequences, arrival patterns, or market scenarios
Sensitivity Analysis: Estimate derivatives \(\partial \theta / \partial \alpha\) using common inputs at \(\alpha\) and \(\alpha + \delta\)
Ranking and Selection: Compare multiple system variants fairly under identical conditions
Optimization: Gradient estimation for simulation-based optimization
Python Implementation
import numpy as np
def crn_comparison(h1_func, h2_func, input_sampler, n_samples):
"""
Compare two systems using common random numbers.
Parameters
----------
h1_func : callable
System 1 response function
h2_func : callable
System 2 response function
input_sampler : callable
Function that returns n samples of common inputs
n_samples : int
Number of comparisons
Returns
-------
dict with difference estimate, se, correlation, vrf
"""
# Generate common inputs
X = input_sampler(n_samples)
# Evaluate both systems on same inputs
h1 = h1_func(X)
h2 = h2_func(X)
# Paired differences
D = h1 - h2
# CRN estimate and SE
diff_est = np.mean(D)
diff_se = np.std(D, ddof=1) / np.sqrt(n_samples)
# Diagnostics
var1 = np.var(h1, ddof=1)
var2 = np.var(h2, ddof=1)
cov12 = np.cov(h1, h2, ddof=1)[0, 1]
rho = cov12 / np.sqrt(var1 * var2)
# Variance reduction factor vs independent sampling
var_indep = (var1 + var2) / n_samples
var_crn = np.var(D, ddof=1) / n_samples
vrf = var_indep / var_crn if var_crn > 0 else np.inf
return {
'diff_estimate': diff_est,
'diff_se': diff_se,
'theta1': np.mean(h1),
'theta2': np.mean(h2),
'correlation': rho,
'vrf': vrf,
'indep_se': np.sqrt(var_indep),
'crn_se': np.sqrt(var_crn)
}
Example 💡 Comparing Two Inventory Policies
Given: An inventory system with daily demand \(D \sim \text{Poisson}(50)\). Compare:
Policy A: Order up to 60 units each day
Policy B: Order up to 65 units each day
Cost = holding cost (0.10 per unit per day) + stockout cost (5 per unit short).
Python Implementation:
import numpy as np
np.random.seed(42)
n_days = 10_000
def simulate_cost(order_up_to, demands):
"""Compute daily costs for order-up-to policy."""
costs = np.zeros(len(demands))
for i, d in enumerate(demands):
# On-hand inventory after ordering up to level
on_hand = order_up_to
# After demand
remaining = on_hand - d
if remaining >= 0:
costs[i] = 0.10 * remaining # Holding cost
else:
costs[i] = 5.0 * (-remaining) # Stockout cost
return costs
# Generate common demands (CRN)
common_demands = np.random.poisson(50, n_days)
# Evaluate both policies on same demands
costs_A = simulate_cost(60, common_demands)
costs_B = simulate_cost(65, common_demands)
# CRN comparison
D = costs_A - costs_B
diff_est = np.mean(D)
diff_se = np.std(D, ddof=1) / np.sqrt(n_days)
print("Common Random Numbers Comparison")
print("=" * 45)
print(f"Policy A (order to 60): mean cost = {np.mean(costs_A):.4f}")
print(f"Policy B (order to 65): mean cost = {np.mean(costs_B):.4f}")
print(f"\nDifference (A - B): {diff_est:.4f}")
print(f"SE of difference (CRN): {diff_se:.4f}")
print(f"95% CI: ({diff_est - 1.96*diff_se:.4f}, {diff_est + 1.96*diff_se:.4f})")
# Compare to independent sampling
rho = np.corrcoef(costs_A, costs_B)[0, 1]
var_A = np.var(costs_A, ddof=1)
var_B = np.var(costs_B, ddof=1)
se_indep = np.sqrt((var_A + var_B) / n_days)
print(f"\nCorrelation between policies: {rho:.4f}")
print(f"SE if independent sampling: {se_indep:.4f}")
print(f"Variance Reduction Factor: {(se_indep / diff_se)**2:.1f}x")
Output:
Common Random Numbers Comparison
=============================================
Policy A (order to 60): mean cost = 2.7632
Policy B (order to 65): mean cost = 1.4241
Difference (A - B): 1.3391
SE of difference (CRN): 0.0312
95% CI: (1.2780, 1.4002)
Correlation between policies: 0.9127
SE if independent sampling: 0.0987
Variance Reduction Factor: 10.0x
Result: With 91% correlation between policies, CRN achieves 10× variance reduction. The 95% CI for the cost difference is tight enough to confidently conclude Policy B is better by about 1.34 cost units per day.
Combining Variance Reduction Techniques
The five techniques developed above are not mutually exclusive. Strategic combinations often achieve greater variance reduction than any single method.
Compatible Combinations
Importance Sampling + Control Variates: Use a control variate under the proposal distribution. The optimal coefficient adapts to the IS framework.
Stratified Sampling + Antithetic Variates: Within each stratum, use antithetic pairs to reduce within-stratum variance further.
Control Variates + Common Random Numbers: When comparing systems, apply the same control variate adjustment to both.
Importance Sampling + Stratified Sampling: Stratify the proposal distribution to ensure coverage of important regions.
Common Pitfall ⚠️ Combining Methods Requires Care
Not all combinations are straightforward:
Antithetic variates + Importance sampling: The standard antithetic construction \((U, 1-U)\) may not induce negative correlation after weighting
Multiple controls: Adding weakly correlated controls inflates \(\beta\) estimation variance; use only strong, independent controls
Verification: Always verify unbiasedness holds after combining methods
Practical Guidelines
Start simple: Apply control variates or antithetic variates first—low overhead, often effective
Diagnose variance sources: Use ANOVA decomposition to identify whether between-stratum or within-stratum variance dominates
Monitor diagnostics: Track ESS for importance sampling, correlation for control/antithetic variates
Pilot estimation: Use small pilot runs to estimate optimal coefficients, verify negative correlation, and check weight distributions
Validate improvements: Compare variance estimates with and without reduction; confirm actual benefit
Practical Considerations
Numerical Stability
Log-space arithmetic: For importance sampling, always compute weights in log-space using the logsumexp trick
Coefficient estimation: For control variates, estimate \(\beta\) using numerically stable regression routines
Weight clipping: Consider truncating extreme importance weights to reduce variance at the cost of small bias
Computational Overhead
Antithetic variates: Essentially free—same function evaluations, different organization
Control variates: Requires evaluating \(c(X)\) and estimating \(\beta\); overhead typically small
Importance sampling: Requires evaluating two densities \(f\) and \(g\) per sample
Stratified sampling: May require specialized samplers for conditional distributions
CRN: Requires synchronization bookkeeping; minimal computational overhead
When Methods Fail
Importance sampling: Weight degeneracy in high dimensions; proposal misspecified (too light tails)
Control variates: Weak correlation; unknown control mean
Antithetic variates: Non-monotonic integrands; high dimensions with mixed monotonicity
Stratified sampling: Unknown stratum variances; intractable conditional sampling
CRN: Systems respond oppositely to inputs; synchronization impossible
Common Pitfall ⚠️ Silent Failures
Variance reduction methods can fail silently:
Importance sampling with poor proposal yields valid but useless estimates (huge variance)
Antithetic variates with non-monotonic \(h\) may increase variance without warning
Control variates with wrong sign of \(\beta\) increase variance
Always compare with naive MC on a pilot run to verify improvement.
Bringing It All Together
Variance reduction transforms Monte Carlo from brute-force averaging into a sophisticated computational tool. The convergence rate \(O(n^{-1/2})\) remains fixed, but the constant \(\sigma^2\) is ours to optimize.
Importance sampling reweights samples to concentrate effort where the integrand matters, achieving orders-of-magnitude improvement for rare events—though weight degeneracy in high dimensions demands careful monitoring via ESS diagnostics.
Control variates exploit correlation with analytically tractable quantities, with variance reduction determined by \(\rho^2\). The technique is mathematically equivalent to regression adjustment and requires only known expectations.
Antithetic variates induce negative correlation through clever pairing, achieving near-perfect variance reduction for monotone functions at essentially zero cost.
Stratified sampling eliminates between-stratum variance through domain partitioning, with Latin hypercube sampling extending the approach to high dimensions by ensuring marginal coverage.
Common random numbers reduce comparison variance through synchronized inputs, transforming noisy system comparisons into precise difference estimation.
The techniques combine synergistically and appear throughout computational statistics: importance sampling underlies particle filters and SMC, control variates enhance MCMC estimators, and stratified sampling connects to quasi-Monte Carlo methods. Mastery of variance reduction—understanding when each method applies, recognizing limitations, implementing with numerical stability—distinguishes the computational statistician from the naive simulator.
Looking Ahead
The variance reduction methods of this section assume we can sample directly from the target (or proposal) distribution. But what about distributions known only through an unnormalized density—posteriors in Bayesian inference, for instance? The rejection sampling of Rejection Sampling provides one answer, but its efficiency degrades rapidly in high dimensions.
The next part of this course develops Markov Chain Monte Carlo (MCMC), which constructs a Markov chain whose stationary distribution equals the target. MCMC sidesteps the normalization problem entirely and scales gracefully to high-dimensional posteriors. The variance reduction principles developed here—particularly importance sampling and control variates—will reappear in the MCMC context, enabling efficient posterior estimation even for complex Bayesian models.
Key Takeaways 📝
Variance reduction does not change the rate: All methods achieve \(O(n^{-1/2})\) convergence; they reduce the constant \(\sigma^2\), not the exponent.
Importance sampling: Optimal proposal \(g^* \propto |h|f\) achieves minimum variance. ESS diagnoses weight degeneracy. Use log-space arithmetic for stability.
Control variates: Variance reduction equals \(\rho^2\) (squared correlation). Optimal \(\beta^* = \text{Cov}(H,C)/\text{Var}(C)\). Must know \(\mathbb{E}[C]\) exactly.
Antithetic variates: Work for monotone functions; can harm for non-monotone. The pair \((F^{-1}(U), F^{-1}(1-U))\) induces negative correlation for increasing \(h\).
Stratified sampling: Eliminates between-stratum variance. Neyman allocation \(n_k \propto p_k \sigma_k\) minimizes total variance. Latin hypercube extends to high dimensions.
Common random numbers: Reduces variance of comparisons, not individual estimates. Requires synchronization—same random input serves same purpose across systems.
Outcome alignment: These techniques (Learning Outcomes 1, 3) enable efficient Monte Carlo estimation. Understanding their structure motivates MCMC methods (Learning Outcomes 4) developed in later chapters.
References
Kahn, H. and Harris, T. E. (1951). Estimation of particle transmission by random sampling. National Bureau of Standards Applied Mathematics Series, 12, 27–30. Foundational paper on importance sampling from the RAND Corporation.
Neyman, J. (1934). On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97(4), 558–625. Establishes optimal allocation theory for stratified sampling.
Hammersley, J. M. and Morton, K. W. (1956). A new Monte Carlo technique: antithetic variates. Mathematical Proceedings of the Cambridge Philosophical Society, 52(3), 449–475. Introduces antithetic variates.
Hammersley, J. M. and Handscomb, D. C. (1964). Monte Carlo Methods. Methuen, London. Classic monograph systematizing variance reduction techniques including control variates.
McKay, M. D., Beckman, R. J., and Conover, W. J. (1979). A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics, 21(2), 239–245. Introduces Latin Hypercube Sampling.
Kong, A. (1992). A note on importance sampling using standardized weights. Technical Report 348, Department of Statistics, University of Chicago. Establishes ESS interpretation for importance sampling.
Owen, A. B. (1997). Monte Carlo variance of scrambled net quadrature. SIAM Journal on Numerical Analysis, 34(5), 1884–1910. Theoretical analysis of Latin Hypercube Sampling.
Bengtsson, T., Bickel, P., and Li, B. (2008). Curse-of-dimensionality revisited: Collapse of the particle filter in very large scale systems. In Probability and Statistics: Essays in Honor of David A. Freedman, 316–334. Institute of Mathematical Statistics. Proves exponential sample size requirements for importance sampling in high dimensions.
Robert, C. P. and Casella, G. (2004). Monte Carlo Statistical Methods (2nd ed.). Springer-Verlag, New York. Comprehensive treatment of variance reduction in the Monte Carlo context.
Glasserman, P. (2004). Monte Carlo Methods in Financial Engineering. Springer-Verlag, New York. Variance reduction applications in financial computing.