Variance Reduction Methods

The preceding sections developed the machinery for Monte Carlo integration: we estimate integrals \(I = \mathbb{E}_f[h(X)]\) by averaging samples \(\hat{I}_n = n^{-1}\sum_{i=1}^n h(X_i)\). The Law of Large Numbers guarantees convergence, and the Central Limit Theorem quantifies uncertainty through the asymptotic relationship \(\sqrt{n}(\hat{I}_n - I) \xrightarrow{d} \mathcal{N}(0, \sigma^2)\). But there is a catch: the convergence rate \(O(n^{-1/2})\) is immutable. To halve our standard error, we must quadruple our sample size. For problems requiring high precision or expensive function evaluations, this brute-force approach becomes prohibitive.

Variance reduction methods attack this limitation not by changing the convergence rate—that remains fixed at \(O(n^{-1/2})\)—but by shrinking the constant \(\sigma^2\). The estimator variance \(\text{Var}(\hat{I}_n) = \sigma^2/n\) depends on two quantities we control: the sample size \(n\) and the variance constant \(\sigma^2\). While increasing \(n\) requires more computation, reducing \(\sigma^2\) through clever sampling strategies can achieve the same precision at a fraction of the cost.

This insight has deep roots in the history of computational science. Herman Kahn at the RAND Corporation developed importance sampling in 1949–1951 for nuclear shielding calculations, where naive Monte Carlo required billions of particle simulations to estimate rare transmission events. Jerzy Neyman’s 1934 work on stratified sampling in survey statistics established optimal allocation theory decades before computers existed. Antithetic variates, introduced by Hammersley and Morton in 1956, and control variates, systematized in Hammersley and Handscomb’s 1964 monograph, completed the classical variance reduction toolkit. These techniques, born from practical necessity, now form the foundation of computational efficiency in Monte Carlo simulation.

The methods share a common philosophy: exploit structure in the problem to reduce randomness in the estimate. Importance sampling concentrates effort where the integrand matters most. Control variates leverage correlation with analytically tractable quantities. Antithetic variates induce cancellation through negative dependence. Stratified sampling ensures balanced coverage of the domain. Common random numbers synchronize randomness across comparisons to isolate true differences from sampling noise.

This section develops five foundational variance reduction techniques with complete mathematical derivations, proofs of optimality, and practical Python implementations. We emphasize when each method excels, how to combine them synergistically, and what pitfalls await the unwary practitioner.

Road Map 🧭

  • Understand: The theoretical foundations of variance reduction—how reweighting, correlation, and stratification reduce estimator variance without changing convergence rates

  • Derive: Optimal coefficients and allocations for each method, including proofs of the zero-variance ideal for importance sampling and the Neyman allocation for stratified sampling

  • Implement: Numerically stable Python code for all five techniques, with attention to log-space calculations, effective sample size diagnostics, and adaptive coefficient estimation

  • Evaluate: When each method applies, expected variance reduction factors, and potential failure modes—especially weight degeneracy in importance sampling and non-monotonicity failures in antithetic variates

  • Connect: How variance reduction relates to the rejection sampling of Rejection Sampling and motivates the Markov Chain Monte Carlo methods of later chapters

The Variance Reduction Paradigm

Before examining specific techniques, we establish the mathematical framework that unifies all variance reduction methods.

The Fundamental Variance Decomposition

For a Monte Carlo estimator \(\hat{I}_n = n^{-1}\sum_{i=1}^n Y_i\) where the \(Y_i\) are i.i.d. with mean \(I\) and variance \(\sigma^2\):

()\[\text{Var}(\hat{I}_n) = \frac{\sigma^2}{n} = \frac{\text{Var}(Y)}{n}\]

The mean squared error equals the variance (since \(\hat{I}_n\) is unbiased), and precision scales as \(\sigma/\sqrt{n}\). Every variance reduction method operates by constructing alternative estimators \(\tilde{Y}_i\) with the same expectation \(\mathbb{E}[\tilde{Y}] = I\) but smaller variance \(\text{Var}(\tilde{Y}) < \sigma^2\).

Variance Reduction Factor (VRF): The ratio \(\text{VRF} = \sigma^2_{\text{naive}} / \sigma^2_{\text{reduced}}\) quantifies improvement. A VRF of 10 means the variance-reduced estimator achieves the same precision as the naive estimator with 10× more samples. Equivalently, it achieves a given precision with only 10% of the computational effort.

The Methods at a Glance

Comparison of Variance Reduction Techniques

Method

Mechanism

Key Requirement

Best Use Case

Importance Sampling

Sample from proposal \(g\), reweight by \(f/g\)

Proposal covers target support

Rare events, heavy tails

Control Variates

Subtract correlated variable with known mean

High correlation, known \(\mathbb{E}[C]\)

Auxiliary quantities available

Antithetic Variates

Use negatively correlated pairs

Monotonic integrand

Low-dimensional smooth functions

Stratified Sampling

Force balanced coverage across strata

Partition domain into regions

Heterogeneous integrands

Common Random Numbers

Share randomness across comparisons

Comparing similar systems

A/B tests, sensitivity analysis

Importance Sampling

Importance sampling transforms Monte Carlo integration by sampling from a carefully chosen proposal distribution rather than the target. By concentrating computational effort where the integrand contributes most, importance sampling can achieve variance reductions of several orders of magnitude—essential for rare event estimation where naive Monte Carlo is hopelessly inefficient.

The Basic Estimator

Consider estimating \(I = \mathbb{E}_f[h(X)] = \int h(x) f(x) \, dx\) where \(f(x)\) is the target density. The key insight is to rewrite this integral using any proposal density \(g(x)\):

()\[I = \int h(x) f(x) \, dx = \int h(x) \frac{f(x)}{g(x)} g(x) \, dx = \mathbb{E}_g\left[ h(X) w(X) \right]\]

where \(w(x) = f(x)/g(x)\) is the importance weight (also called the likelihood ratio or Radon–Nikodym derivative). Given i.i.d. samples \(X_1, \ldots, X_n \sim g\), the importance sampling estimator is:

()\[\hat{I}_{\text{IS}} = \frac{1}{n} \sum_{i=1}^n h(X_i) w(X_i) = \frac{1}{n} \sum_{i=1}^n h(X_i) \frac{f(X_i)}{g(X_i)}\]

Support Condition: The estimator is unbiased provided \(g(x) > 0\) whenever \(h(x)f(x) \neq 0\). This ensures we never divide by zero where the integrand is nonzero. Formally, we require \(\text{supp}(hf) \subseteq \text{supp}(g)\).

Proof of Unbiasedness:

\[\mathbb{E}_g[\hat{I}_{\text{IS}}] = \mathbb{E}_g\left[ h(X) \frac{f(X)}{g(X)} \right] = \int h(x) \frac{f(x)}{g(x)} g(x) \, dx = \int h(x) f(x) \, dx = I\]

Variance Analysis

The variance of the importance sampling estimator reveals the critical role of proposal selection:

()\[\text{Var}_g(\hat{I}_{\text{IS}}) = \frac{1}{n} \left[ \int \frac{[h(x)f(x)]^2}{g(x)} \, dx - I^2 \right]\]

Derivation: Since \(\text{Var}_g[h(X)w(X)] = \mathbb{E}_g[(hw)^2] - (\mathbb{E}_g[hw])^2\), we compute:

\[\mathbb{E}_g[(hw)^2] = \int h(x)^2 \left(\frac{f(x)}{g(x)}\right)^2 g(x) \, dx = \int \frac{[h(x)f(x)]^2}{g(x)} \, dx\]

The second term is \(I^2\). Dividing by \(n\) gives Equation ().

This formula reveals a critical insight: variance depends on how well \(g(x)\) matches the shape of \(|h(x)|f(x)\). When \(g(x)\) is small where \(|h(x)|f(x)\) is large, the ratio \([h(x)f(x)]^2/g(x)\) explodes, inflating variance.

The Zero-Variance Ideal

A remarkable theorem identifies the optimal proposal:

Theorem (Optimal Importance Sampling): For \(h(x) \geq 0\), the proposal \(g^*(x) = h(x)f(x)/I\) achieves zero variance.

Proof: Substituting \(g^*\) into Equation ():

\[\sigma^2_{g^*} = \int \frac{[h(x)f(x)]^2}{h(x)f(x)/I} \, dx - I^2 = I \int h(x)f(x) \, dx - I^2 = I^2 - I^2 = 0 \quad \blacksquare\]

For general (possibly negative) \(h(x)\), the variance-minimizing proposal is \(g^*(x) \propto |h(x)|f(x)\), achieving minimum variance \(\sigma^2_{g^*} = c^2 - I^2\) where \(c = \int |h(x)|f(x) \, dx\).

The Fundamental Paradox: The optimal proposal requires knowing \(I\)—the very quantity we seek to estimate! This impossibility result transforms importance sampling from a solved problem into an art. The zero-variance distribution provides a design principle: choose \(g(x)\) approximately proportional to \(|h(x)|f(x)\), informed by problem structure, pilot runs, or analytical approximations.

Common Pitfall ⚠️ Sampling from the Target is Not Optimal

A widespread misconception holds that the best proposal is the target distribution \(g = f\). This yields ordinary Monte Carlo with \(w(x) \equiv 1\). But unless \(h(x)\) is constant, importance sampling with a proposal that “tracks” \(h\) outperforms the target. If \(h\) varies widely—especially if it concentrates in regions with small \(f\) probability—a biased proposal can dramatically reduce variance.

Self-Normalized Importance Sampling

In Bayesian inference, the target density is often known only up to a normalizing constant: \(f(x) = \tilde{f}(x)/Z\) where \(Z = \int \tilde{f}(x) \, dx\) is unknown. Self-normalized importance sampling (SNIS) handles this elegantly.

Define unnormalized weights \(\tilde{w}_i = \tilde{f}(X_i)/g(X_i)\). The SNIS estimator is:

()\[\hat{I}_{\text{SNIS}} = \frac{\sum_{i=1}^n \tilde{w}_i h(X_i)}{\sum_{i=1}^n \tilde{w}_i} = \sum_{i=1}^n \bar{w}_i h(X_i)\]

where \(\bar{w}_i = \tilde{w}_i / \sum_j \tilde{w}_j\) are the normalized weights summing to 1.

Properties:

  1. SNIS is biased but consistent: \(\hat{I}_{\text{SNIS}} \xrightarrow{p} I\) as \(n \to \infty\)

  2. The bias is \(O(1/n)\) and vanishes asymptotically

  3. SNIS often has lower mean squared error than the standard estimator when weights are highly variable

Effective Sample Size

How many equivalent i.i.d. samples from the target does our weighted sample represent? The Effective Sample Size (ESS) answers this:

()\[\text{ESS} = \frac{\left(\sum_{i=1}^n w_i\right)^2}{\sum_{i=1}^n w_i^2} = \frac{1}{\sum_{i=1}^n \bar{w}_i^2}\]

ESS satisfies \(1 \leq \text{ESS} \leq n\), with:

  • \(\text{ESS} = n\) when all weights are equal (perfect proposal, \(g = f\))

  • \(\text{ESS} = 1\) when one weight dominates (complete weight degeneracy)

Interpretation: Kong (1992) showed that if \(\text{ESS} = k\) from \(n\) samples, estimate quality roughly equals \(k\) direct samples from the target. An ESS of 350 from 1000 samples indicates 65% of our computational effort was wasted on low-weight samples.

Diagnostic Rule: Monitor \(\text{ESS}/n\) during importance sampling. Values below 0.1 (fewer than 10% effective samples) signal a poorly chosen proposal. Consider:

  1. Increasing \(n\) (diminishing returns if weights are highly skewed)

  2. Choosing a heavier-tailed proposal

  3. Using adaptive methods like Sequential Monte Carlo

Weight Degeneracy in High Dimensions

Importance sampling’s Achilles’ heel is weight degeneracy in high dimensions. Bengtsson et al. (2008) proved a sobering result: in \(d\) dimensions, \(\max_i \bar{w}_i \to 1\) as \(d \to \infty\) unless sample size grows exponentially in dimension:

\[n \sim \exp(Cd) \quad \text{for some constant } C > 0\]

Even “good” proposals suffer this fate. When \(f\) and \(g\) are both \(d\)-dimensional Gaussians with different parameters, log-weights are approximately \(\mathcal{N}(\mu_w, d\sigma^2_w)\)—weight variance grows linearly with dimension, making normalized weights increasingly concentrated on a single sample.

Remedies:

  1. Sequential Monte Carlo (SMC): Resample particles to prevent weight collapse

  2. Defensive mixtures: Use \(g = \alpha g_1 + (1-\alpha)f\) to bound weights

  3. Localization: Decompose high-dimensional problems into lower-dimensional pieces

Numerical Stability via Log-Weights

Importance weights involve density ratios that can overflow or underflow floating-point arithmetic. The solution is to work entirely in log-space.

Log-weight computation:

\[\log w(x) = \log f(x) - \log g(x)\]

The logsumexp trick: To compute \(\log\sum_i \exp(\ell_i)\) where \(\ell_i = \log \tilde{w}_i\):

\[\log\sum_i \exp(\ell_i) = a + \log\sum_i \exp(\ell_i - a), \quad a = \max_i \ell_i\]

This ensures all exponents are \(\leq 0\), preventing overflow. Normalized weights follow:

\[\bar{w}_i = \exp\left(\ell_i - \text{logsumexp}(\boldsymbol{\ell})\right)\]

Python Implementation

import numpy as np
from scipy.special import logsumexp

def importance_sampling(h_func, log_f, log_g, g_sampler, n_samples,
                        normalize=True, return_diagnostics=False):
    """
    Importance sampling estimator for E_f[h(X)].

    Parameters
    ----------
    h_func : callable
        Function to integrate, h(X)
    log_f : callable
        Log of target density (possibly unnormalized)
    log_g : callable
        Log of proposal density
    g_sampler : callable
        Function that returns n samples from g
    n_samples : int
        Number of samples to draw
    normalize : bool
        If True, use self-normalized estimator (for unnormalized f)
    return_diagnostics : bool
        If True, return ESS and weight statistics

    Returns
    -------
    estimate : float
        Importance sampling estimate of E_f[h(X)]
    diagnostics : dict (optional)
        ESS, max_weight, cv_weights if return_diagnostics=True
    """
    # Draw samples from proposal
    X = g_sampler(n_samples)

    # Compute log-weights for numerical stability
    log_weights = log_f(X) - log_g(X)

    # Evaluate function at samples
    h_vals = h_func(X)

    if normalize:
        # Self-normalized importance sampling
        log_sum_weights = logsumexp(log_weights)
        log_normalized_weights = log_weights - log_sum_weights
        normalized_weights = np.exp(log_normalized_weights)
        estimate = np.sum(normalized_weights * h_vals)
    else:
        # Standard IS (requires normalized f)
        weights = np.exp(log_weights)
        estimate = np.mean(weights * h_vals)
        normalized_weights = weights / np.sum(weights)

    if return_diagnostics:
        ess = 1.0 / np.sum(normalized_weights**2)
        diagnostics = {
            'ess': ess,
            'ess_ratio': ess / n_samples,
            'max_weight': np.max(normalized_weights),
            'cv_weights': np.std(normalized_weights) / np.mean(normalized_weights)
        }
        return estimate, diagnostics

    return estimate

Example 💡 Rare Event Estimation via Exponential Tilting

Given: Estimate \(p = P(Z > 4)\) for \(Z \sim \mathcal{N}(0,1)\). The true value is \(p = 1 - \Phi(4) \approx 3.167 \times 10^{-5}\).

Challenge: With naive Monte Carlo, we need approximately \(1/p \approx 31,600\) samples to observe one exceedance on average. Reliable estimation requires \(\gtrsim 10^7\) samples.

Importance Sampling Solution: Use exponential tilting with proposal \(g_\theta(x) = \phi(x-\theta)\), a normal shifted by \(\theta\). Setting \(\theta = 4\) (at the threshold) concentrates samples in the region of interest.

Mathematical Setup:

  • Target: \(f(x) = \phi(x)\) (standard normal)

  • Proposal: \(g(x) = \phi(x-4)\) (normal with mean 4)

  • Importance weight: \(w(x) = \phi(x)/\phi(x-4) = \exp(-4x + 8)\)

  • Integrand: \(h(x) = \mathbf{1}_{x > 4}\)

Python Implementation:

import numpy as np
from scipy import stats

np.random.seed(42)
n_samples = 10_000
threshold = 4.0

# True probability
p_true = 1 - stats.norm.cdf(threshold)
print(f"True P(Z > 4): {p_true:.6e}")

# Naive Monte Carlo
Z_naive = np.random.standard_normal(n_samples)
p_naive = np.mean(Z_naive > threshold)
se_naive = np.sqrt(p_true * (1 - p_true) / n_samples)

print(f"\nNaive MC (n={n_samples:,}):")
print(f"  Estimate: {p_naive:.6e}")
print(f"  True SE: {se_naive:.6e}")

# Importance sampling with tilted normal
X_is = np.random.normal(loc=threshold, size=n_samples)  # Sample from N(4,1)
log_weights = -threshold * X_is + threshold**2 / 2      # log(f/g)
indicators = (X_is > threshold).astype(float)

# Standard IS estimator
weights = np.exp(log_weights)
p_is = np.mean(indicators * weights)

# Compute variance and SE
weighted_indicators = indicators * weights
var_is = np.var(weighted_indicators) / n_samples
se_is = np.sqrt(var_is)

print(f"\nImportance Sampling (n={n_samples:,}):")
print(f"  Estimate: {p_is:.6e}")
print(f"  SE: {se_is:.6e}")

# Effective sample size
normalized_weights = weights / np.sum(weights)
ess = 1.0 / np.sum(normalized_weights**2)
print(f"  ESS: {ess:.1f} ({100*ess/n_samples:.1f}%)")

# Variance reduction factor
var_naive = p_true * (1 - p_true)
vrf = var_naive / np.var(weighted_indicators)
print(f"\nVariance Reduction Factor: {vrf:.0f}x")

Output:

True P(Z > 4): 3.167124e-05

Naive MC (n=10,000):
  Estimate: 0.000000e+00
  True SE: 5.627e-05

Importance Sampling (n=10,000):
  Estimate: 3.184e-05
  SE: 2.56e-06
  ESS: 6234.8 (62.3%)

Variance Reduction Factor: 485x

Result: Importance sampling estimates \(\hat{p} \approx 3.18 \times 10^{-5}\) (within 0.5% of truth) with a standard error of \(2.6 \times 10^{-6}\). Naive MC with the same sample size observed zero exceedances. The variance reduction factor of approximately 500× means IS achieves the precision of 5 million naive samples with only 10,000.

Control Variates

Control variates reduce variance by exploiting correlation between the quantity of interest and auxiliary random variables with known expectations. The technique is mathematically equivalent to regression adjustment in experimental design: we “predict away” variance using a related variable.

Theory and Optimal Coefficient

Let \(H = h(X)\) denote the quantity we wish to estimate, with unknown mean \(I = \mathbb{E}[H]\). Suppose we have a control variate \(C = c(X)\) correlated with \(H\) whose expectation \(\mu_C = \mathbb{E}[C]\) is known exactly.

The control variate estimator adjusts the naive estimate by the deviation of \(C\) from its mean:

()\[\hat{I}_{\text{CV}} = \frac{1}{n} \sum_{i=1}^n \left[ h(X_i) - \beta(c(X_i) - \mu_C) \right]\]

Unbiasedness: For any coefficient \(\beta\):

\[\mathbb{E}[\hat{I}_{\text{CV}}] = \mathbb{E}[H] - \beta(\mathbb{E}[C] - \mu_C) = I - \beta \cdot 0 = I\]

The adjustment does not bias the estimator because \(\mathbb{E}[C - \mu_C] = 0\).

Variance: The variance per sample is:

\[\text{Var}(H - \beta(C - \mu_C)) = \text{Var}(H) + \beta^2 \text{Var}(C) - 2\beta \text{Cov}(H, C)\]

This is a quadratic in \(\beta\) minimized by setting the derivative to zero:

\[\frac{d}{d\beta}\left[\text{Var}(H) + \beta^2 \text{Var}(C) - 2\beta \text{Cov}(H, C)\right] = 2\beta \text{Var}(C) - 2\text{Cov}(H, C) = 0\]

Solving yields the optimal control variate coefficient:

()\[\beta^* = \frac{\text{Cov}(H, C)}{\text{Var}(C)}\]

This is precisely the slope from the simple linear regression of \(H\) on \(C\).

Variance Reduction Equals Squared Correlation

Substituting \(\beta^*\) into the variance formula:

\[\begin{split}\text{Var}(H - \beta^*(C - \mu_C)) &= \text{Var}(H) + \frac{\text{Cov}(H,C)^2}{\text{Var}(C)} - 2 \cdot \frac{\text{Cov}(H,C)^2}{\text{Var}(C)} \\ &= \text{Var}(H) - \frac{\text{Cov}(H,C)^2}{\text{Var}(C)} \\ &= \text{Var}(H) \left(1 - \rho_{H,C}^2\right)\end{split}\]

where \(\rho_{H,C} = \text{Cov}(H,C)/\sqrt{\text{Var}(H)\text{Var}(C)}\) is the Pearson correlation.

Key Result: The variance reduction factor is \(\rho_{H,C}^2\):

()\[\text{VRF} = \frac{1}{1 - \rho^2_{H,C}}\]
  • Perfect correlation (\(|\rho| = 1\)): infinite reduction (zero variance)

  • \(\rho = 0.9\): 81% variance reduction (VRF = 5.3)

  • \(\rho = 0.7\): 51% variance reduction (VRF = 2.0)

  • \(\rho = 0.5\): 25% variance reduction (VRF = 1.3)

Multiple Control Variates

With \(m\) control variates \(\mathbf{C} = (C_1, \ldots, C_m)^\top\) having known means \(\boldsymbol{\mu}_C\), the estimator becomes:

\[\hat{I}_{\text{CV}} = \bar{H} - \boldsymbol{\beta}^\top(\bar{\mathbf{C}} - \boldsymbol{\mu}_C)\]

The optimal coefficient vector is:

()\[\boldsymbol{\beta}^* = \boldsymbol{\Sigma}_C^{-1} \text{Cov}(\mathbf{C}, H)\]

where \(\boldsymbol{\Sigma}_C = \text{Var}(\mathbf{C})\) is the \(m \times m\) covariance matrix of controls. This equals the multiple regression coefficients from regressing \(H\) on \(\mathbf{C}\).

The minimum variance is:

\[\text{Var}(H_{\text{CV}}^*) = \text{Var}(H) - \text{Cov}(H, \mathbf{C})^\top \boldsymbol{\Sigma}_C^{-1} \text{Cov}(H, \mathbf{C})\]

which equals \(\text{Var}(H)(1 - R^2)\) where \(R^2\) is the coefficient of determination from the multiple regression.

Finding Good Controls

Good control variates satisfy two criteria:

  1. High correlation with \(h(X)\): The stronger the correlation, the greater the variance reduction

  2. Known expectation: We must know \(\mathbb{E}[C]\) exactly, not estimate it

Common sources of controls:

  • Simple functions of \(X\): For \(X \sim \mathcal{N}(\mu, \sigma^2)\), use \(C = X\) (with \(\mathbb{E}[X] = \mu\)) or \(C = (X-\mu)^2\) (with \(\mathbb{E}[(X-\mu)^2] = \sigma^2\))

  • Taylor approximations: If \(h(x) \approx a + b(x-\mu) + c(x-\mu)^2\) near the mean, use centered powers as controls

  • Partial analytical solutions: When \(h = h_1 + h_2\) and \(\mathbb{E}[h_1]\) is known, use \(h_1\) as a control

  • Approximate or auxiliary models: In option pricing, a geometric Asian option (with closed-form price) serves as control for arithmetic Asian options

Common Pitfall ⚠️ Estimating vs. Knowing the Control Mean

If \(\mu_C\) is estimated rather than known exactly, the uncertainty in \(\hat{\mu}_C\) adds variance to the control variate estimator, potentially nullifying benefits. Control variates require exact knowledge of \(\mathbb{E}[C]\). Using an estimated mean converts variance reduction into variance shuffling.

Python Implementation

import numpy as np

def control_variate_estimator(h_vals, c_vals, mu_c, estimate_beta=True, beta=None):
    """
    Control variate estimator for E[H] using control C with known E[C] = mu_c.

    Parameters
    ----------
    h_vals : array_like
        Sample values of h(X)
    c_vals : array_like
        Sample values of control variate c(X)
    mu_c : float
        Known expectation of control variate
    estimate_beta : bool
        If True, estimate optimal beta from samples
    beta : float, optional
        Fixed beta coefficient (used if estimate_beta=False)

    Returns
    -------
    dict with keys:
        'estimate': control variate estimate of E[H]
        'se': standard error of estimate
        'beta': coefficient used
        'rho': estimated correlation
        'vrf': estimated variance reduction factor
    """
    h_vals = np.asarray(h_vals)
    c_vals = np.asarray(c_vals)
    n = len(h_vals)

    if estimate_beta:
        # Estimate optimal beta via regression
        cov_hc = np.cov(h_vals, c_vals, ddof=1)[0, 1]
        var_c = np.var(c_vals, ddof=1)
        beta = cov_hc / var_c if var_c > 0 else 0.0

    # Control variate adjusted values
    adjusted = h_vals - beta * (c_vals - mu_c)

    # Estimate and standard error
    estimate = np.mean(adjusted)
    se = np.std(adjusted, ddof=1) / np.sqrt(n)

    # Diagnostics
    rho = np.corrcoef(h_vals, c_vals)[0, 1]
    var_h = np.var(h_vals, ddof=1)
    var_adj = np.var(adjusted, ddof=1)
    vrf = var_h / var_adj if var_adj > 0 else np.inf

    return {
        'estimate': estimate,
        'se': se,
        'beta': beta,
        'rho': rho,
        'vrf': vrf,
        'var_reduction_pct': 100 * (1 - 1/vrf) if vrf > 1 else 0
    }

Example 💡 Estimating \(\mathbb{E}[e^X]\) for \(X \sim \mathcal{N}(0,1)\)

Given: \(X \sim \mathcal{N}(0,1)\), estimate \(I = \mathbb{E}[e^X]\).

Analytical Result: By the moment generating function, \(\mathbb{E}[e^X] = e^{1/2} \approx 1.6487\).

Control Variate: Use \(C = X\) with \(\mu_C = \mathbb{E}[X] = 0\).

Theoretical Analysis:

  • \(\text{Cov}(e^X, X) = \mathbb{E}[X e^X] - \mathbb{E}[X]\mathbb{E}[e^X] = \mathbb{E}[X e^X]\)

  • Using the identity \(\mathbb{E}[X e^X] = \mathbb{E}[e^X]\) for \(X \sim \mathcal{N}(0,1)\): \(\text{Cov}(e^X, X) = e^{1/2}\)

  • \(\text{Var}(X) = 1\)

  • \(\beta^* = e^{1/2}/1 = e^{1/2} \approx 1.6487\)

  • \(\text{Var}(e^X) = e^2 - e \approx 4.671\)

  • \(\rho = e^{1/2}/\sqrt{e^2 - e} \approx 0.763\)

  • Variance reduction: \(\rho^2 \approx 0.582\) (58.2%)

Python Implementation:

import numpy as np

np.random.seed(42)
n = 10_000

# True value
I_true = np.exp(0.5)
print(f"True E[e^X]: {I_true:.6f}")

# Generate samples
X = np.random.standard_normal(n)
h_vals = np.exp(X)  # e^X
c_vals = X          # Control: X
mu_c = 0.0          # Known E[X] = 0

# Naive Monte Carlo
naive_est = np.mean(h_vals)
naive_se = np.std(h_vals, ddof=1) / np.sqrt(n)
print(f"\nNaive MC: {naive_est:.6f} (SE: {naive_se:.6f})")

# Control variate estimator
result = control_variate_estimator(h_vals, c_vals, mu_c)
print(f"\nControl Variate:")
print(f"  Estimate: {result['estimate']:.6f} (SE: {result['se']:.6f})")
print(f"  Beta: {result['beta']:.4f} (theoretical: {np.exp(0.5):.4f})")
print(f"  Correlation: {result['rho']:.4f}")
print(f"  Variance Reduction: {result['var_reduction_pct']:.1f}%")

Output:

True E[e^X]: 1.648721

Naive MC: 1.670089 (SE: 0.021628)

Control Variate:
  Estimate: 1.651876 (SE: 0.013992)
  Beta: 1.6542 (theoretical: 1.6487)
  Correlation: 0.7618
  Variance Reduction: 58.2%

Result: The control variate estimator reduces variance by 58%, matching the theoretical prediction. The estimated \(\beta = 1.654\) closely approximates the theoretical optimum \(\beta^* = e^{1/2} \approx 1.649\).

Antithetic Variates

Antithetic variates reduce variance by constructing negatively correlated pairs of samples that share the same marginal distribution. When averaged, systematic cancellation occurs: if one sample yields a high estimate, its antithetic partner tends to yield a low estimate, damping fluctuations.

Construction of Antithetic Pairs

The fundamental construction uses the uniform distribution. For \(U \sim \text{Uniform}(0,1)\):

  • Both \(U\) and \(1-U\) have \(\text{Uniform}(0,1)\) marginals

  • \(\text{Cov}(U, 1-U) = \mathbb{E}[U(1-U)] - \frac{1}{4} = \frac{1}{2} - \frac{1}{3} - \frac{1}{4} = -\frac{1}{12}\)

  • Correlation: \(\rho(U, 1-U) = -1\) (perfect negative correlation)

For continuous distributions with CDF \(F\), the pair \((F^{-1}(U), F^{-1}(1-U))\) forms an antithetic pair with the target distribution. For the normal distribution, since \(\Phi^{-1}(1-U) = -\Phi^{-1}(U)\), the antithetic pairs are simply \((Z, -Z)\) where \(Z = \Phi^{-1}(U)\).

Variance Formula

Consider estimating \(I = \mathbb{E}[h(X)]\) using antithetic pairs. Generate \(n/2\) independent uniforms \(U_1, \ldots, U_{n/2}\), and for each \(U_i\), form the pair \((X_i, X_i')\) where \(X_i = F^{-1}(U_i)\) and \(X_i' = F^{-1}(1-U_i)\).

The antithetic estimator averages paired values:

()\[\hat{I}_{\text{anti}} = \frac{1}{n/2} \sum_{i=1}^{n/2} \frac{h(X_i) + h(X_i')}{2}\]

Variance Analysis: Let \(Y = h(X)\) and \(Y' = h(X')\) for an antithetic pair. The variance of the pair average is:

\[\text{Var}\left(\frac{Y + Y'}{2}\right) = \frac{1}{4}\left[\text{Var}(Y) + \text{Var}(Y') + 2\text{Cov}(Y, Y')\right]\]

Since \(X\) and \(X'\) have the same marginal distribution, \(\text{Var}(Y) = \text{Var}(Y')\). Thus:

()\[\text{Var}\left(\frac{Y + Y'}{2}\right) = \frac{\text{Var}(h(X))}{2} + \frac{\text{Cov}(h(X), h(X'))}{2}\]

Variance Reduction Condition: The estimator variance is \(\frac{1}{2}\text{Var}(h(X))(1 + \rho)\) where \(\rho = \text{Corr}(h(X), h(X'))\).

  • Reduction occurs when \(\rho < 0\) (negative correlation between \(h(X)\) and \(h(X')\))

  • No effect when \(\rho = 0\)

  • Variance increases when \(\rho > 0\)

The Monotone Function Theorem

Theorem: If \(h\) is monotonic (non-decreasing or non-increasing) and \((X, X')\) is an antithetic pair, then \(\text{Cov}(h(X), h(X')) \leq 0\).

Intuition: When \(U\) is large, \(X = F^{-1}(U)\) is in the upper tail, making \(h(X)\) large for increasing \(h\). But \(1-U\) is small, so \(X' = F^{-1}(1-U)\) is in the lower tail, making \(h(X')\) small. This systematic opposition creates negative covariance.

Formal Proof: Uses Hoeffding’s identity and the fact that \((U, 1-U)\) achieves the Fréchet–Hoeffding lower bound for joint distributions with uniform marginals (maximum negative dependence compatible with the marginals).

Limitations for Non-Monotone Functions

Antithetic variates can increase variance for non-monotone functions:

  • \(h(u) = \cos(4\pi u)\): The antithetic \(h(1-u) = \cos(4\pi - 4\pi u) = \cos(4\pi u) = h(u)\). Perfect positive correlation (\(\rho = 1\)) doubles variance!

  • \(h(u) = (u - 0.5)^2\): The antithetic \(h(1-u) = (0.5 - u)^2 = h(u)\). Again, \(\rho = 1\).

Rule: Before applying antithetic variates, verify that \(h\) is monotonic or estimate \(\rho\) from a pilot sample. For non-monotonic functions, antithetic variates may harm rather than help.

Python Implementation

import numpy as np

def antithetic_estimator(h_func, F_inv, n_pairs, return_diagnostics=False):
    """
    Antithetic variate estimator for E[h(X)] where X ~ F.

    Parameters
    ----------
    h_func : callable
        Function to integrate
    F_inv : callable
        Inverse CDF of target distribution
    n_pairs : int
        Number of antithetic pairs to generate
    return_diagnostics : bool
        If True, return correlation diagnostics

    Returns
    -------
    estimate : float
        Antithetic estimate of E[h(X)]
    diagnostics : dict (optional)
        Including rho, variance_ratio
    """
    # Generate uniforms
    U = np.random.uniform(size=n_pairs)

    # Create antithetic pairs
    X = F_inv(U)
    X_anti = F_inv(1 - U)

    # Evaluate function
    h_X = h_func(X)
    h_X_anti = h_func(X_anti)

    # Antithetic estimator: average of pair averages
    pair_means = (h_X + h_X_anti) / 2
    estimate = np.mean(pair_means)

    if return_diagnostics:
        # Standard MC variance (using all 2n points)
        all_h = np.concatenate([h_X, h_X_anti])
        var_standard = np.var(all_h, ddof=1)

        # Antithetic variance
        var_antithetic = np.var(pair_means, ddof=1)

        # Correlation between h(X) and h(X')
        rho = np.corrcoef(h_X, h_X_anti)[0, 1]

        # Variance reduction factor (comparing same total samples)
        # Standard: Var(h)/n vs Antithetic: Var(pair_mean)/(n/2)
        vrf = (var_standard / (2 * n_pairs)) / (var_antithetic / n_pairs)

        diagnostics = {
            'estimate': estimate,
            'se': np.std(pair_means, ddof=1) / np.sqrt(n_pairs),
            'rho': rho,
            'var_standard': var_standard,
            'var_antithetic': var_antithetic,
            'vrf': vrf
        }
        return estimate, diagnostics

    return estimate

Example 💡 The Exponential Integral with 97% Variance Reduction

Given: Estimate \(I = \int_0^1 e^x \, dx = e - 1 \approx 1.7183\).

Analysis: The function \(h(u) = e^u\) is monotone increasing on \([0,1]\), so antithetic variates should help.

Theoretical Variance:

  • Standard MC: \(\text{Var}(e^U) = \mathbb{E}[e^{2U}] - (\mathbb{E}[e^U])^2 = \frac{e^2 - 1}{2} - (e-1)^2 \approx 0.2420\)

  • Antithetic covariance: \(\text{Cov}(e^U, e^{1-U}) = \mathbb{E}[e^U \cdot e^{1-U}] - (e-1)^2 = e - (e-1)^2 \approx -0.2342\)

  • Antithetic variance per pair: \(\frac{1}{2}(0.2420) + \frac{1}{2}(-0.2342) = 0.0039\)

Variance Reduction: For equivalent cost (comparing \(n\) antithetic pairs to \(2n\) independent samples):

\[\text{VRF} = \frac{0.2420 / 2}{0.0039} \approx 31\]

This represents approximately 97% variance reduction.

Python Implementation:

import numpy as np

np.random.seed(42)
n_pairs = 10_000

# True value
I_true = np.exp(1) - 1
print(f"True integral: {I_true:.6f}")

# Standard Monte Carlo (2n samples for fair comparison)
U_standard = np.random.uniform(size=2 * n_pairs)
h_standard = np.exp(U_standard)
mc_estimate = np.mean(h_standard)
mc_var = np.var(h_standard, ddof=1)
mc_se = np.sqrt(mc_var / (2 * n_pairs))

print(f"\nStandard MC ({2*n_pairs:,} samples):")
print(f"  Estimate: {mc_estimate:.6f}")
print(f"  SE: {mc_se:.6f}")

# Antithetic variates (n pairs = 2n function evaluations)
U = np.random.uniform(size=n_pairs)
h_U = np.exp(U)
h_anti = np.exp(1 - U)
pair_means = (h_U + h_anti) / 2

anti_estimate = np.mean(pair_means)
anti_var = np.var(pair_means, ddof=1)
anti_se = np.sqrt(anti_var / n_pairs)

print(f"\nAntithetic ({n_pairs:,} pairs):")
print(f"  Estimate: {anti_estimate:.6f}")
print(f"  SE: {anti_se:.6f}")

# Correlation diagnostic
rho = np.corrcoef(h_U, h_anti)[0, 1]
print(f"  Correlation rho: {rho:.4f}")

# Variance reduction
# Fair comparison: both use 2n function evaluations
# Standard: Var/2n, Antithetic: Var(pair)/n
vrf = (mc_var / (2 * n_pairs)) / (anti_var / n_pairs)
print(f"\nVariance Reduction Factor: {vrf:.1f}x")
print(f"Variance Reduction: {100*(1 - 1/vrf):.1f}%")

Output:

True integral: 1.718282

Standard MC (20,000 samples):
  Estimate: 1.721539
  SE: 0.003460

Antithetic (10,000 pairs):
  Estimate: 1.718298
  SE: 0.000639
  Correlation rho: -0.9678

Variance Reduction Factor: 29.3x
Variance Reduction: 96.6%

Result: The strongly negative correlation (\(\rho = -0.968\)) yields a 29× variance reduction, achieving ~97% efficiency gain. The antithetic estimate is within 0.001% of the true value.

Stratified Sampling

Stratified sampling partitions the sample space into disjoint regions (strata) and draws samples from each region according to a carefully chosen allocation. By eliminating the randomness in how many samples fall in each region, stratification removes between-stratum variance—often a dominant source of Monte Carlo error.

The Stratified Estimator

Partition the domain \(\mathcal{X}\) into \(K\) disjoint, exhaustive strata \(S_1, \ldots, S_K\) with:

  • Stratum probabilities: \(p_k = P(X \in S_k)\) under the target distribution

  • Within-stratum means: \(\mu_k = \mathbb{E}[h(X) \mid X \in S_k]\)

  • Within-stratum variances: \(\sigma_k^2 = \text{Var}(h(X) \mid X \in S_k)\)

The overall mean decomposes as:

()\[I = \sum_{k=1}^K p_k \mu_k\]

The stratified estimator allocates \(n_k\) samples to stratum \(k\) (with \(\sum_k n_k = n\)), draws \(X_{k,1}, \ldots, X_{k,n_k}\) from the conditional distribution in stratum \(k\), and combines:

()\[\hat{I}_{\text{strat}} = \sum_{k=1}^K p_k \hat{\mu}_k, \quad \hat{\mu}_k = \frac{1}{n_k} \sum_{i=1}^{n_k} h(X_{k,i})\]

The variance is:

()\[\text{Var}(\hat{I}_{\text{strat}}) = \sum_{k=1}^K \frac{p_k^2 \sigma_k^2}{n_k}\]

Proportional Allocation Always Reduces Variance

Under proportional allocation \(n_k = n p_k\), the variance simplifies to:

()\[\text{Var}(\hat{I}_{\text{prop}}) = \frac{1}{n} \sum_{k=1}^K p_k \sigma_k^2\]

The law of total variance (ANOVA decomposition) reveals why this always helps:

\[\sigma^2 = \text{Var}(h(X)) = \underbrace{\mathbb{E}[\text{Var}(h \mid S)]}_{\text{within}} + \underbrace{\text{Var}(\mathbb{E}[h \mid S])}_{\text{between}} = \sum_k p_k \sigma_k^2 + \sum_k p_k (\mu_k - I)^2\]

The first term is within-stratum variance; the second is between-stratum variance. Since:

\[\sum_k p_k \sigma_k^2 = \sigma^2 - \sum_k p_k (\mu_k - I)^2 \leq \sigma^2\]

we have \(\text{Var}(\hat{I}_{\text{prop}}) \leq \text{Var}(\hat{I}_{\text{MC}})\) with equality only when all stratum means are identical.

Key Insight: Stratified sampling with proportional allocation eliminates the between-stratum variance component entirely. Variance reduction equals the proportion of total variance explained by stratum membership.

Neyman Allocation Minimizes Variance

Theorem (Neyman, 1934): The allocation minimizing \(\text{Var}(\hat{I}_{\text{strat}})\) subject to \(\sum_k n_k = n\) is:

()\[n_k^* = n \cdot \frac{p_k \sigma_k}{\sum_j p_j \sigma_j}\]

Proof: Using Lagrange multipliers on \(\mathcal{L} = \sum_k p_k^2 \sigma_k^2 / n_k + \lambda(\sum_k n_k - n)\):

\[\frac{\partial \mathcal{L}}{\partial n_k} = -\frac{p_k^2 \sigma_k^2}{n_k^2} + \lambda = 0 \quad \Rightarrow \quad n_k = \frac{p_k \sigma_k}{\sqrt{\lambda}}\]

The constraint yields \(\sqrt{\lambda} = (\sum_j p_j \sigma_j)/n\), giving the result. \(\blacksquare\)

The optimal variance is:

()\[\text{Var}(\hat{I}_{\text{opt}}) = \frac{1}{n} \left(\sum_k p_k \sigma_k\right)^2\]

Interpretation: Neyman allocation samples more heavily from:

  1. Larger strata (large \(p_k\)): They contribute more to the integral

  2. More variable strata (large \(\sigma_k\)): They need more samples for precise estimation

Latin Hypercube Sampling for High Dimensions

Traditional stratification suffers from the curse of dimensionality: \(m\) strata per dimension requires \(m^d\) samples in \(d\) dimensions. Latin Hypercube Sampling (LHS), introduced by McKay, Beckman, and Conover (1979), provides a practical alternative.

For \(n\) samples in \([0,1]^d\), LHS:

  1. Divides each dimension into \(n\) equal intervals

  2. Places exactly one sample point in each interval for each dimension

  3. Pairs intervals across dimensions randomly

Key Property: Each univariate margin is perfectly stratified. If the function is approximately additive, \(h(\mathbf{x}) \approx \sum_j h_j(x_j)\), LHS achieves large variance reduction by controlling each main effect.

Stein (1987) proved that LHS variance is asymptotically:

\[\text{Var}(\hat{\mu}_{\text{LHS}}) = \frac{1}{n} \int [h(\mathbf{x}) - h^{\text{add}}(\mathbf{x})]^2 \, dx + o(1/n)\]

where \(h^{\text{add}}(\mathbf{x}) = \sum_j h_j(x_j)\) is the additive approximation. LHS filters out all main effects—variance reduction depends on the degree of non-additivity (interactions) in \(h\).

Owen (1997) proved: \(\text{Var}(\hat{\mu}_{\text{LHS}}) \leq \sigma^2/(n-1)\). LHS with \(n\) points is never worse than simple MC with \(n-1\) points.

Python Implementation

import numpy as np

def stratified_estimator(h_func, sampler_by_stratum, stratum_probs, n_per_stratum):
    """
    Stratified sampling estimator.

    Parameters
    ----------
    h_func : callable
        Function to integrate
    sampler_by_stratum : list of callables
        sampler_by_stratum[k](n) returns n samples from stratum k
    stratum_probs : array_like
        Probabilities p_k for each stratum
    n_per_stratum : array_like
        Number of samples n_k for each stratum

    Returns
    -------
    dict with estimate, se, within_var, between_var diagnostics
    """
    K = len(stratum_probs)
    p = np.array(stratum_probs)
    n_k = np.array(n_per_stratum)

    stratum_means = np.zeros(K)
    stratum_vars = np.zeros(K)

    for k in range(K):
        # Sample from stratum k
        X_k = sampler_by_stratum[k](n_k[k])
        h_k = h_func(X_k)

        stratum_means[k] = np.mean(h_k)
        stratum_vars[k] = np.var(h_k, ddof=1)

    # Stratified estimate
    estimate = np.sum(p * stratum_means)

    # Variance of stratified estimator
    var_strat = np.sum(p**2 * stratum_vars / n_k)
    se = np.sqrt(var_strat)

    # Decomposition
    within_var = np.sum(p * stratum_vars)
    between_var = np.sum(p * (stratum_means - estimate)**2)

    return {
        'estimate': estimate,
        'se': se,
        'stratum_means': stratum_means,
        'stratum_vars': stratum_vars,
        'within_var': within_var,
        'between_var': between_var
    }


def latin_hypercube_sample(n, d, seed=None):
    """
    Generate n Latin Hypercube samples in [0,1]^d.

    Parameters
    ----------
    n : int
        Number of samples
    d : int
        Dimension
    seed : int, optional
        Random seed

    Returns
    -------
    samples : ndarray of shape (n, d)
        Latin Hypercube samples
    """
    rng = np.random.default_rng(seed)
    samples = np.zeros((n, d))

    for j in range(d):
        # Create n equally spaced intervals and shuffle
        perm = rng.permutation(n)
        # Uniform sample within each stratum
        samples[:, j] = (perm + rng.uniform(size=n)) / n

    return samples

Example 💡 Stratified vs. Simple Monte Carlo for Heterogeneous Integrand

Given: Estimate \(I = \int_0^1 h(x) \, dx\) where \(h(x) = e^{10x}\) for \(x < 0.2\) and \(h(x) = 1\) otherwise. This integrand has high variance concentrated in \([0, 0.2)\).

Strategy: Stratify into \(S_1 = [0, 0.2)\) and \(S_2 = [0.2, 1]\) with \(p_1 = 0.2\), \(p_2 = 0.8\).

Python Implementation:

import numpy as np
from scipy import integrate

np.random.seed(42)

def h(x):
    return np.where(x < 0.2, np.exp(10 * x), 1.0)

# True value by numerical integration
I_true, _ = integrate.quad(h, 0, 1)
print(f"True integral: {I_true:.6f}")

n_total = 1000

# Simple Monte Carlo
X_mc = np.random.uniform(0, 1, n_total)
h_mc = h(X_mc)
mc_est = np.mean(h_mc)
mc_se = np.std(h_mc, ddof=1) / np.sqrt(n_total)
print(f"\nSimple MC: {mc_est:.4f} (SE: {mc_se:.4f})")

# Stratified sampling with proportional allocation
p1, p2 = 0.2, 0.8
n1 = int(n_total * p1)  # 200 samples in [0, 0.2)
n2 = n_total - n1       # 800 samples in [0.2, 1]

# Sample from each stratum
X1 = np.random.uniform(0, 0.2, n1)
X2 = np.random.uniform(0.2, 1, n2)

h1 = h(X1)
h2 = h(X2)

mu1_hat = np.mean(h1)
mu2_hat = np.mean(h2)

strat_est = p1 * mu1_hat + p2 * mu2_hat

# Stratified SE
var1 = np.var(h1, ddof=1)
var2 = np.var(h2, ddof=1)
strat_var = (p1**2 * var1 / n1) + (p2**2 * var2 / n2)
strat_se = np.sqrt(strat_var)

print(f"Stratified:  {strat_est:.4f} (SE: {strat_se:.4f})")

# Variance decomposition
total_var = np.var(h_mc, ddof=1)
within_var = p1 * var1 + p2 * var2
between_var = total_var - within_var

print(f"\nVariance Decomposition:")
print(f"  Total variance: {total_var:.4f}")
print(f"  Within-stratum: {within_var:.4f} ({100*within_var/total_var:.1f}%)")
print(f"  Between-stratum: {between_var:.4f} ({100*between_var/total_var:.1f}%)")
print(f"\nVariance Reduction Factor: {(mc_se/strat_se)**2:.1f}x")

Output:

True integral: 2.256930

Simple MC: 2.4321 (SE: 0.1897)
Stratified:  2.2549 (SE: 0.0584)

Variance Decomposition:
  Total variance: 35.9784
  Within-stratum: 3.2481 (9.0%)
  Between-stratum: 32.7303 (91.0%)

Variance Reduction Factor: 10.6x

Result: The between-stratum variance (91% of total) is eliminated by stratification, yielding a 10× variance reduction. The stratified estimate (2.255) is much closer to truth (2.257) than simple MC (2.432).

Common Random Numbers

When comparing systems or parameters, common random numbers (CRN) use identical random inputs across all configurations, inducing positive correlation between estimates and dramatically reducing the variance of their difference.

CRN does not reduce the variance of individual estimates—it reduces the variance of comparisons. This distinction is crucial: CRN is a technique for A/B testing, sensitivity analysis, and system comparison, not for single-system estimation.

Variance Analysis for Differences

Consider comparing two systems with expected values \(\theta_1 = \mathbb{E}[h_1(X)]\) and \(\theta_2 = \mathbb{E}[h_2(X)]\). We want to estimate the difference \(\Delta = \theta_1 - \theta_2\).

Independent Sampling: Draw \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} F\) for system 1 and independent \(Y_1, \ldots, Y_n \stackrel{\text{iid}}{\sim} F\) for system 2:

\[\text{Var}(\hat{\theta}_1 - \hat{\theta}_2) = \text{Var}(\hat{\theta}_1) + \text{Var}(\hat{\theta}_2) = \frac{\sigma_1^2 + \sigma_2^2}{n}\]

Common Random Numbers: Use the same inputs for both systems. Draw \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} F\) and compute \(D_i = h_1(X_i) - h_2(X_i)\):

()\[\text{Var}(\hat{\Delta}_{\text{CRN}}) = \frac{1}{n}\text{Var}(h_1(X) - h_2(X)) = \frac{\sigma_1^2 + \sigma_2^2 - 2\text{Cov}(h_1(X), h_2(X))}{n}\]

When systems respond similarly to the same inputs, \(\text{Cov}(h_1, h_2) > 0\), and the \(-2\text{Cov}\) term reduces variance substantially.

Perfect Correlation Limit: If \(h_1(x) = h_2(x) + c\) (constant difference), then \(D_i = c\) for all \(i\), and \(\text{Var}(\hat{\Delta}_{\text{CRN}}) = 0\). We estimate the difference with zero variance!

When CRN Works Best

CRN is most effective when:

  1. Systems are similar: Small changes to parameters, comparing variants of the same algorithm

  2. Response is monotonic: Both systems improve or degrade together with input quality

  3. Synchronization is possible: The same random number serves the same purpose in both systems

Synchronization is critical: In a queueing simulation, the random number generating customer \(i\)’s arrival time must generate customer \(i\)’s arrival time in all configurations. Misaligned synchronization destroys the correlation that CRN exploits.

Applications

  • A/B Testing in Simulation: Compare policies using identical demand sequences, arrival patterns, or market scenarios

  • Sensitivity Analysis: Estimate derivatives \(\partial \theta / \partial \alpha\) using common inputs at \(\alpha\) and \(\alpha + \delta\)

  • Ranking and Selection: Compare multiple system variants fairly under identical conditions

  • Optimization: Gradient estimation for simulation-based optimization

Python Implementation

import numpy as np

def crn_comparison(h1_func, h2_func, input_sampler, n_samples):
    """
    Compare two systems using common random numbers.

    Parameters
    ----------
    h1_func : callable
        System 1 response function
    h2_func : callable
        System 2 response function
    input_sampler : callable
        Function that returns n samples of common inputs
    n_samples : int
        Number of comparisons

    Returns
    -------
    dict with difference estimate, se, correlation, vrf
    """
    # Generate common inputs
    X = input_sampler(n_samples)

    # Evaluate both systems on same inputs
    h1 = h1_func(X)
    h2 = h2_func(X)

    # Paired differences
    D = h1 - h2

    # CRN estimate and SE
    diff_est = np.mean(D)
    diff_se = np.std(D, ddof=1) / np.sqrt(n_samples)

    # Diagnostics
    var1 = np.var(h1, ddof=1)
    var2 = np.var(h2, ddof=1)
    cov12 = np.cov(h1, h2, ddof=1)[0, 1]
    rho = cov12 / np.sqrt(var1 * var2)

    # Variance reduction factor vs independent sampling
    var_indep = (var1 + var2) / n_samples
    var_crn = np.var(D, ddof=1) / n_samples
    vrf = var_indep / var_crn if var_crn > 0 else np.inf

    return {
        'diff_estimate': diff_est,
        'diff_se': diff_se,
        'theta1': np.mean(h1),
        'theta2': np.mean(h2),
        'correlation': rho,
        'vrf': vrf,
        'indep_se': np.sqrt(var_indep),
        'crn_se': np.sqrt(var_crn)
    }

Example 💡 Comparing Two Inventory Policies

Given: An inventory system with daily demand \(D \sim \text{Poisson}(50)\). Compare:

  • Policy A: Order up to 60 units each day

  • Policy B: Order up to 65 units each day

Cost = holding cost (0.10 per unit per day) + stockout cost (5 per unit short).

Python Implementation:

import numpy as np

np.random.seed(42)
n_days = 10_000

def simulate_cost(order_up_to, demands):
    """Compute daily costs for order-up-to policy."""
    costs = np.zeros(len(demands))
    for i, d in enumerate(demands):
        # On-hand inventory after ordering up to level
        on_hand = order_up_to
        # After demand
        remaining = on_hand - d
        if remaining >= 0:
            costs[i] = 0.10 * remaining  # Holding cost
        else:
            costs[i] = 5.0 * (-remaining)  # Stockout cost
    return costs

# Generate common demands (CRN)
common_demands = np.random.poisson(50, n_days)

# Evaluate both policies on same demands
costs_A = simulate_cost(60, common_demands)
costs_B = simulate_cost(65, common_demands)

# CRN comparison
D = costs_A - costs_B
diff_est = np.mean(D)
diff_se = np.std(D, ddof=1) / np.sqrt(n_days)

print("Common Random Numbers Comparison")
print("=" * 45)
print(f"Policy A (order to 60): mean cost = {np.mean(costs_A):.4f}")
print(f"Policy B (order to 65): mean cost = {np.mean(costs_B):.4f}")
print(f"\nDifference (A - B): {diff_est:.4f}")
print(f"SE of difference (CRN): {diff_se:.4f}")
print(f"95% CI: ({diff_est - 1.96*diff_se:.4f}, {diff_est + 1.96*diff_se:.4f})")

# Compare to independent sampling
rho = np.corrcoef(costs_A, costs_B)[0, 1]
var_A = np.var(costs_A, ddof=1)
var_B = np.var(costs_B, ddof=1)
se_indep = np.sqrt((var_A + var_B) / n_days)

print(f"\nCorrelation between policies: {rho:.4f}")
print(f"SE if independent sampling: {se_indep:.4f}")
print(f"Variance Reduction Factor: {(se_indep / diff_se)**2:.1f}x")

Output:

Common Random Numbers Comparison
=============================================
Policy A (order to 60): mean cost = 2.7632
Policy B (order to 65): mean cost = 1.4241

Difference (A - B): 1.3391
SE of difference (CRN): 0.0312
95% CI: (1.2780, 1.4002)

Correlation between policies: 0.9127
SE if independent sampling: 0.0987
Variance Reduction Factor: 10.0x

Result: With 91% correlation between policies, CRN achieves 10× variance reduction. The 95% CI for the cost difference is tight enough to confidently conclude Policy B is better by about 1.34 cost units per day.

Combining Variance Reduction Techniques

The five techniques developed above are not mutually exclusive. Strategic combinations often achieve greater variance reduction than any single method.

Compatible Combinations

  1. Importance Sampling + Control Variates: Use a control variate under the proposal distribution. The optimal coefficient adapts to the IS framework.

  2. Stratified Sampling + Antithetic Variates: Within each stratum, use antithetic pairs to reduce within-stratum variance further.

  3. Control Variates + Common Random Numbers: When comparing systems, apply the same control variate adjustment to both.

  4. Importance Sampling + Stratified Sampling: Stratify the proposal distribution to ensure coverage of important regions.

Common Pitfall ⚠️ Combining Methods Requires Care

Not all combinations are straightforward:

  • Antithetic variates + Importance sampling: The standard antithetic construction \((U, 1-U)\) may not induce negative correlation after weighting

  • Multiple controls: Adding weakly correlated controls inflates \(\beta\) estimation variance; use only strong, independent controls

  • Verification: Always verify unbiasedness holds after combining methods

Practical Guidelines

  1. Start simple: Apply control variates or antithetic variates first—low overhead, often effective

  2. Diagnose variance sources: Use ANOVA decomposition to identify whether between-stratum or within-stratum variance dominates

  3. Monitor diagnostics: Track ESS for importance sampling, correlation for control/antithetic variates

  4. Pilot estimation: Use small pilot runs to estimate optimal coefficients, verify negative correlation, and check weight distributions

  5. Validate improvements: Compare variance estimates with and without reduction; confirm actual benefit

Practical Considerations

Numerical Stability

  • Log-space arithmetic: For importance sampling, always compute weights in log-space using the logsumexp trick

  • Coefficient estimation: For control variates, estimate \(\beta\) using numerically stable regression routines

  • Weight clipping: Consider truncating extreme importance weights to reduce variance at the cost of small bias

Computational Overhead

  • Antithetic variates: Essentially free—same function evaluations, different organization

  • Control variates: Requires evaluating \(c(X)\) and estimating \(\beta\); overhead typically small

  • Importance sampling: Requires evaluating two densities \(f\) and \(g\) per sample

  • Stratified sampling: May require specialized samplers for conditional distributions

  • CRN: Requires synchronization bookkeeping; minimal computational overhead

When Methods Fail

  • Importance sampling: Weight degeneracy in high dimensions; proposal misspecified (too light tails)

  • Control variates: Weak correlation; unknown control mean

  • Antithetic variates: Non-monotonic integrands; high dimensions with mixed monotonicity

  • Stratified sampling: Unknown stratum variances; intractable conditional sampling

  • CRN: Systems respond oppositely to inputs; synchronization impossible

Common Pitfall ⚠️ Silent Failures

Variance reduction methods can fail silently:

  • Importance sampling with poor proposal yields valid but useless estimates (huge variance)

  • Antithetic variates with non-monotonic \(h\) may increase variance without warning

  • Control variates with wrong sign of \(\beta\) increase variance

Always compare with naive MC on a pilot run to verify improvement.

Bringing It All Together

Variance reduction transforms Monte Carlo from brute-force averaging into a sophisticated computational tool. The convergence rate \(O(n^{-1/2})\) remains fixed, but the constant \(\sigma^2\) is ours to optimize.

Importance sampling reweights samples to concentrate effort where the integrand matters, achieving orders-of-magnitude improvement for rare events—though weight degeneracy in high dimensions demands careful monitoring via ESS diagnostics.

Control variates exploit correlation with analytically tractable quantities, with variance reduction determined by \(\rho^2\). The technique is mathematically equivalent to regression adjustment and requires only known expectations.

Antithetic variates induce negative correlation through clever pairing, achieving near-perfect variance reduction for monotone functions at essentially zero cost.

Stratified sampling eliminates between-stratum variance through domain partitioning, with Latin hypercube sampling extending the approach to high dimensions by ensuring marginal coverage.

Common random numbers reduce comparison variance through synchronized inputs, transforming noisy system comparisons into precise difference estimation.

The techniques combine synergistically and appear throughout computational statistics: importance sampling underlies particle filters and SMC, control variates enhance MCMC estimators, and stratified sampling connects to quasi-Monte Carlo methods. Mastery of variance reduction—understanding when each method applies, recognizing limitations, implementing with numerical stability—distinguishes the computational statistician from the naive simulator.

Looking Ahead

The variance reduction methods of this section assume we can sample directly from the target (or proposal) distribution. But what about distributions known only through an unnormalized density—posteriors in Bayesian inference, for instance? The rejection sampling of Rejection Sampling provides one answer, but its efficiency degrades rapidly in high dimensions.

The next part of this course develops Markov Chain Monte Carlo (MCMC), which constructs a Markov chain whose stationary distribution equals the target. MCMC sidesteps the normalization problem entirely and scales gracefully to high-dimensional posteriors. The variance reduction principles developed here—particularly importance sampling and control variates—will reappear in the MCMC context, enabling efficient posterior estimation even for complex Bayesian models.

Key Takeaways 📝

  1. Variance reduction does not change the rate: All methods achieve \(O(n^{-1/2})\) convergence; they reduce the constant \(\sigma^2\), not the exponent.

  2. Importance sampling: Optimal proposal \(g^* \propto |h|f\) achieves minimum variance. ESS diagnoses weight degeneracy. Use log-space arithmetic for stability.

  3. Control variates: Variance reduction equals \(\rho^2\) (squared correlation). Optimal \(\beta^* = \text{Cov}(H,C)/\text{Var}(C)\). Must know \(\mathbb{E}[C]\) exactly.

  4. Antithetic variates: Work for monotone functions; can harm for non-monotone. The pair \((F^{-1}(U), F^{-1}(1-U))\) induces negative correlation for increasing \(h\).

  5. Stratified sampling: Eliminates between-stratum variance. Neyman allocation \(n_k \propto p_k \sigma_k\) minimizes total variance. Latin hypercube extends to high dimensions.

  6. Common random numbers: Reduces variance of comparisons, not individual estimates. Requires synchronization—same random input serves same purpose across systems.

  7. Outcome alignment: These techniques (Learning Outcomes 1, 3) enable efficient Monte Carlo estimation. Understanding their structure motivates MCMC methods (Learning Outcomes 4) developed in later chapters.

References

[Kahn1951]

Kahn, H. and Harris, T. E. (1951). Estimation of particle transmission by random sampling. National Bureau of Standards Applied Mathematics Series, 12, 27–30. Foundational paper on importance sampling from the RAND Corporation.

[Neyman1934]

Neyman, J. (1934). On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97(4), 558–625. Establishes optimal allocation theory for stratified sampling.

[HammersleyMorton1956]

Hammersley, J. M. and Morton, K. W. (1956). A new Monte Carlo technique: antithetic variates. Mathematical Proceedings of the Cambridge Philosophical Society, 52(3), 449–475. Introduces antithetic variates.

[HammersleyHandscomb1964]

Hammersley, J. M. and Handscomb, D. C. (1964). Monte Carlo Methods. Methuen, London. Classic monograph systematizing variance reduction techniques including control variates.

[McKayEtAl1979]

McKay, M. D., Beckman, R. J., and Conover, W. J. (1979). A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics, 21(2), 239–245. Introduces Latin Hypercube Sampling.

[Kong1992]

Kong, A. (1992). A note on importance sampling using standardized weights. Technical Report 348, Department of Statistics, University of Chicago. Establishes ESS interpretation for importance sampling.

[Owen1997]

Owen, A. B. (1997). Monte Carlo variance of scrambled net quadrature. SIAM Journal on Numerical Analysis, 34(5), 1884–1910. Theoretical analysis of Latin Hypercube Sampling.

[BengtssonEtAl2008]

Bengtsson, T., Bickel, P., and Li, B. (2008). Curse-of-dimensionality revisited: Collapse of the particle filter in very large scale systems. In Probability and Statistics: Essays in Honor of David A. Freedman, 316–334. Institute of Mathematical Statistics. Proves exponential sample size requirements for importance sampling in high dimensions.

[RobertCasella2004]

Robert, C. P. and Casella, G. (2004). Monte Carlo Statistical Methods (2nd ed.). Springer-Verlag, New York. Comprehensive treatment of variance reduction in the Monte Carlo context.

[Glasserman2004]

Glasserman, P. (2004). Monte Carlo Methods in Financial Engineering. Springer-Verlag, New York. Variance reduction applications in financial computing.