.. _ch2.6-variance-reduction-methods: ========================== Variance Reduction Methods ========================== The preceding sections developed the machinery for Monte Carlo integration: we estimate integrals :math:`I = \mathbb{E}_f[h(X)]` by averaging samples :math:`\hat{I}_n = n^{-1}\sum_{i=1}^n h(X_i)`. The Law of Large Numbers guarantees convergence, and the Central Limit Theorem quantifies uncertainty through the asymptotic relationship :math:`\sqrt{n}(\hat{I}_n - I) \xrightarrow{d} \mathcal{N}(0, \sigma^2)`. But there is a catch: the convergence rate :math:`O(n^{-1/2})` is immutable. To halve our standard error, we must quadruple our sample size. For problems requiring high precision or expensive function evaluations, this brute-force approach becomes prohibitive. **Variance reduction methods** attack this limitation not by changing the convergence *rate*—that remains fixed at :math:`O(n^{-1/2})`—but by shrinking the *constant* :math:`\sigma^2`. The estimator variance :math:`\text{Var}(\hat{I}_n) = \sigma^2/n` depends on two quantities we control: the sample size :math:`n` and the variance constant :math:`\sigma^2`. While increasing :math:`n` requires more computation, reducing :math:`\sigma^2` through clever sampling strategies can achieve the same precision at a fraction of the cost. This insight has deep roots in the history of computational science. Herman Kahn at the RAND Corporation developed **importance sampling** in 1949–1951 for nuclear shielding calculations, where naive Monte Carlo required billions of particle simulations to estimate rare transmission events. Jerzy Neyman's 1934 work on **stratified sampling** in survey statistics established optimal allocation theory decades before computers existed. **Antithetic variates**, introduced by Hammersley and Morton in 1956, and **control variates**, systematized in Hammersley and Handscomb's 1964 monograph, completed the classical variance reduction toolkit. These techniques, born from practical necessity, now form the foundation of computational efficiency in Monte Carlo simulation. The methods share a common philosophy: *exploit structure in the problem to reduce randomness in the estimate*. Importance sampling concentrates effort where the integrand matters most. Control variates leverage correlation with analytically tractable quantities. Antithetic variates induce cancellation through negative dependence. Stratified sampling ensures balanced coverage of the domain. Common random numbers synchronize randomness across comparisons to isolate true differences from sampling noise. This section develops five foundational variance reduction techniques with complete mathematical derivations, proofs of optimality, and practical Python implementations. We emphasize when each method excels, how to combine them synergistically, and what pitfalls await the unwary practitioner. .. admonition:: Road Map 🧭 :class: important • **Understand**: The theoretical foundations of variance reduction—how reweighting, correlation, and stratification reduce estimator variance without changing convergence rates • **Derive**: Optimal coefficients and allocations for each method, including proofs of the zero-variance ideal for importance sampling and the Neyman allocation for stratified sampling • **Implement**: Numerically stable Python code for all five techniques, with attention to log-space calculations, effective sample size diagnostics, and adaptive coefficient estimation • **Evaluate**: When each method applies, expected variance reduction factors, and potential failure modes—especially weight degeneracy in importance sampling and non-monotonicity failures in antithetic variates • **Connect**: How variance reduction relates to the rejection sampling of :ref:`ch2.5-rejection-sampling` and motivates the Markov Chain Monte Carlo methods of later chapters The Variance Reduction Paradigm ------------------------------- Before examining specific techniques, we establish the mathematical framework that unifies all variance reduction methods. The Fundamental Variance Decomposition ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For a Monte Carlo estimator :math:`\hat{I}_n = n^{-1}\sum_{i=1}^n Y_i` where the :math:`Y_i` are i.i.d. with mean :math:`I` and variance :math:`\sigma^2`: .. math:: :label: variance-decomposition \text{Var}(\hat{I}_n) = \frac{\sigma^2}{n} = \frac{\text{Var}(Y)}{n} The mean squared error equals the variance (since :math:`\hat{I}_n` is unbiased), and precision scales as :math:`\sigma/\sqrt{n}`. Every variance reduction method operates by constructing alternative estimators :math:`\tilde{Y}_i` with the same expectation :math:`\mathbb{E}[\tilde{Y}] = I` but smaller variance :math:`\text{Var}(\tilde{Y}) < \sigma^2`. **Variance Reduction Factor (VRF)**: The ratio :math:`\text{VRF} = \sigma^2_{\text{naive}} / \sigma^2_{\text{reduced}}` quantifies improvement. A VRF of 10 means the variance-reduced estimator achieves the same precision as the naive estimator with 10× more samples. Equivalently, it achieves a given precision with only 10% of the computational effort. The Methods at a Glance ~~~~~~~~~~~~~~~~~~~~~~~ .. flat-table:: Comparison of Variance Reduction Techniques :header-rows: 1 :widths: 20 25 25 30 * - Method - Mechanism - Key Requirement - Best Use Case * - Importance Sampling - Sample from proposal :math:`g`, reweight by :math:`f/g` - Proposal covers target support - Rare events, heavy tails * - Control Variates - Subtract correlated variable with known mean - High correlation, known :math:`\mathbb{E}[C]` - Auxiliary quantities available * - Antithetic Variates - Use negatively correlated pairs - Monotonic integrand - Low-dimensional smooth functions * - Stratified Sampling - Force balanced coverage across strata - Partition domain into regions - Heterogeneous integrands * - Common Random Numbers - Share randomness across comparisons - Comparing similar systems - A/B tests, sensitivity analysis Importance Sampling ------------------- Importance sampling transforms Monte Carlo integration by sampling from a carefully chosen **proposal distribution** rather than the target. By concentrating computational effort where the integrand contributes most, importance sampling can achieve variance reductions of several orders of magnitude—essential for rare event estimation where naive Monte Carlo is hopelessly inefficient. The Basic Estimator ~~~~~~~~~~~~~~~~~~~ Consider estimating :math:`I = \mathbb{E}_f[h(X)] = \int h(x) f(x) \, dx` where :math:`f(x)` is the target density. The key insight is to rewrite this integral using any **proposal density** :math:`g(x)`: .. math:: :label: is-identity I = \int h(x) f(x) \, dx = \int h(x) \frac{f(x)}{g(x)} g(x) \, dx = \mathbb{E}_g\left[ h(X) w(X) \right] where :math:`w(x) = f(x)/g(x)` is the **importance weight** (also called the likelihood ratio or Radon–Nikodym derivative). Given i.i.d. samples :math:`X_1, \ldots, X_n \sim g`, the importance sampling estimator is: .. math:: :label: is-estimator \hat{I}_{\text{IS}} = \frac{1}{n} \sum_{i=1}^n h(X_i) w(X_i) = \frac{1}{n} \sum_{i=1}^n h(X_i) \frac{f(X_i)}{g(X_i)} **Support Condition**: The estimator is unbiased provided :math:`g(x) > 0` whenever :math:`h(x)f(x) \neq 0`. This ensures we never divide by zero where the integrand is nonzero. Formally, we require :math:`\text{supp}(hf) \subseteq \text{supp}(g)`. **Proof of Unbiasedness**: .. math:: \mathbb{E}_g[\hat{I}_{\text{IS}}] = \mathbb{E}_g\left[ h(X) \frac{f(X)}{g(X)} \right] = \int h(x) \frac{f(x)}{g(x)} g(x) \, dx = \int h(x) f(x) \, dx = I Variance Analysis ~~~~~~~~~~~~~~~~~ The variance of the importance sampling estimator reveals the critical role of proposal selection: .. math:: :label: is-variance \text{Var}_g(\hat{I}_{\text{IS}}) = \frac{1}{n} \left[ \int \frac{[h(x)f(x)]^2}{g(x)} \, dx - I^2 \right] **Derivation**: Since :math:`\text{Var}_g[h(X)w(X)] = \mathbb{E}_g[(hw)^2] - (\mathbb{E}_g[hw])^2`, we compute: .. math:: \mathbb{E}_g[(hw)^2] = \int h(x)^2 \left(\frac{f(x)}{g(x)}\right)^2 g(x) \, dx = \int \frac{[h(x)f(x)]^2}{g(x)} \, dx The second term is :math:`I^2`. Dividing by :math:`n` gives Equation :eq:`is-variance`. This formula reveals a critical insight: variance depends on how well :math:`g(x)` matches the shape of :math:`|h(x)|f(x)`. When :math:`g(x)` is small where :math:`|h(x)|f(x)` is large, the ratio :math:`[h(x)f(x)]^2/g(x)` explodes, inflating variance. The Zero-Variance Ideal ~~~~~~~~~~~~~~~~~~~~~~~ A remarkable theorem identifies the optimal proposal: **Theorem (Optimal Importance Sampling)**: For :math:`h(x) \geq 0`, the proposal :math:`g^*(x) = h(x)f(x)/I` achieves zero variance. **Proof**: Substituting :math:`g^*` into Equation :eq:`is-variance`: .. math:: \sigma^2_{g^*} = \int \frac{[h(x)f(x)]^2}{h(x)f(x)/I} \, dx - I^2 = I \int h(x)f(x) \, dx - I^2 = I^2 - I^2 = 0 \quad \blacksquare For general (possibly negative) :math:`h(x)`, the variance-minimizing proposal is :math:`g^*(x) \propto |h(x)|f(x)`, achieving minimum variance :math:`\sigma^2_{g^*} = c^2 - I^2` where :math:`c = \int |h(x)|f(x) \, dx`. **The Fundamental Paradox**: The optimal proposal requires knowing :math:`I`—the very quantity we seek to estimate! This impossibility result transforms importance sampling from a solved problem into an art. The zero-variance distribution provides a *design principle*: choose :math:`g(x)` approximately proportional to :math:`|h(x)|f(x)`, informed by problem structure, pilot runs, or analytical approximations. .. admonition:: Common Pitfall ⚠️ Sampling from the Target is Not Optimal :class: warning A widespread misconception holds that the best proposal is the target distribution :math:`g = f`. This yields ordinary Monte Carlo with :math:`w(x) \equiv 1`. But unless :math:`h(x)` is constant, importance sampling with a proposal that "tracks" :math:`h` outperforms the target. If :math:`h` varies widely—especially if it concentrates in regions with small :math:`f` probability—a biased proposal can dramatically reduce variance. Self-Normalized Importance Sampling ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In Bayesian inference, the target density is often known only up to a normalizing constant: :math:`f(x) = \tilde{f}(x)/Z` where :math:`Z = \int \tilde{f}(x) \, dx` is unknown. **Self-normalized importance sampling (SNIS)** handles this elegantly. Define unnormalized weights :math:`\tilde{w}_i = \tilde{f}(X_i)/g(X_i)`. The SNIS estimator is: .. math:: :label: snis-estimator \hat{I}_{\text{SNIS}} = \frac{\sum_{i=1}^n \tilde{w}_i h(X_i)}{\sum_{i=1}^n \tilde{w}_i} = \sum_{i=1}^n \bar{w}_i h(X_i) where :math:`\bar{w}_i = \tilde{w}_i / \sum_j \tilde{w}_j` are the normalized weights summing to 1. **Properties**: 1. SNIS is **biased** but **consistent**: :math:`\hat{I}_{\text{SNIS}} \xrightarrow{p} I` as :math:`n \to \infty` 2. The bias is :math:`O(1/n)` and vanishes asymptotically 3. SNIS often has lower mean squared error than the standard estimator when weights are highly variable Effective Sample Size ~~~~~~~~~~~~~~~~~~~~~ How many equivalent i.i.d. samples from the target does our weighted sample represent? The **Effective Sample Size (ESS)** answers this: .. math:: :label: ess-formula \text{ESS} = \frac{\left(\sum_{i=1}^n w_i\right)^2}{\sum_{i=1}^n w_i^2} = \frac{1}{\sum_{i=1}^n \bar{w}_i^2} ESS satisfies :math:`1 \leq \text{ESS} \leq n`, with: - :math:`\text{ESS} = n` when all weights are equal (perfect proposal, :math:`g = f`) - :math:`\text{ESS} = 1` when one weight dominates (complete weight degeneracy) **Interpretation**: Kong (1992) showed that if :math:`\text{ESS} = k` from :math:`n` samples, estimate quality roughly equals :math:`k` direct samples from the target. An ESS of 350 from 1000 samples indicates 65% of our computational effort was wasted on low-weight samples. **Diagnostic Rule**: Monitor :math:`\text{ESS}/n` during importance sampling. Values below 0.1 (fewer than 10% effective samples) signal a poorly chosen proposal. Consider: 1. Increasing :math:`n` (diminishing returns if weights are highly skewed) 2. Choosing a heavier-tailed proposal 3. Using adaptive methods like Sequential Monte Carlo Weight Degeneracy in High Dimensions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Importance sampling's Achilles' heel is **weight degeneracy** in high dimensions. Bengtsson et al. (2008) proved a sobering result: in :math:`d` dimensions, :math:`\max_i \bar{w}_i \to 1` as :math:`d \to \infty` unless sample size grows **exponentially** in dimension: .. math:: n \sim \exp(Cd) \quad \text{for some constant } C > 0 Even "good" proposals suffer this fate. When :math:`f` and :math:`g` are both :math:`d`-dimensional Gaussians with different parameters, log-weights are approximately :math:`\mathcal{N}(\mu_w, d\sigma^2_w)`—weight variance grows linearly with dimension, making normalized weights increasingly concentrated on a single sample. **Remedies**: 1. **Sequential Monte Carlo (SMC)**: Resample particles to prevent weight collapse 2. **Defensive mixtures**: Use :math:`g = \alpha g_1 + (1-\alpha)f` to bound weights 3. **Localization**: Decompose high-dimensional problems into lower-dimensional pieces Numerical Stability via Log-Weights ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Importance weights involve density ratios that can overflow or underflow floating-point arithmetic. The solution is to work entirely in log-space. **Log-weight computation**: .. math:: \log w(x) = \log f(x) - \log g(x) **The logsumexp trick**: To compute :math:`\log\sum_i \exp(\ell_i)` where :math:`\ell_i = \log \tilde{w}_i`: .. math:: \log\sum_i \exp(\ell_i) = a + \log\sum_i \exp(\ell_i - a), \quad a = \max_i \ell_i This ensures all exponents are :math:`\leq 0`, preventing overflow. Normalized weights follow: .. math:: \bar{w}_i = \exp\left(\ell_i - \text{logsumexp}(\boldsymbol{\ell})\right) Python Implementation ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import numpy as np from scipy.special import logsumexp def importance_sampling(h_func, log_f, log_g, g_sampler, n_samples, normalize=True, return_diagnostics=False): """ Importance sampling estimator for E_f[h(X)]. Parameters ---------- h_func : callable Function to integrate, h(X) log_f : callable Log of target density (possibly unnormalized) log_g : callable Log of proposal density g_sampler : callable Function that returns n samples from g n_samples : int Number of samples to draw normalize : bool If True, use self-normalized estimator (for unnormalized f) return_diagnostics : bool If True, return ESS and weight statistics Returns ------- estimate : float Importance sampling estimate of E_f[h(X)] diagnostics : dict (optional) ESS, max_weight, cv_weights if return_diagnostics=True """ # Draw samples from proposal X = g_sampler(n_samples) # Compute log-weights for numerical stability log_weights = log_f(X) - log_g(X) # Evaluate function at samples h_vals = h_func(X) if normalize: # Self-normalized importance sampling log_sum_weights = logsumexp(log_weights) log_normalized_weights = log_weights - log_sum_weights normalized_weights = np.exp(log_normalized_weights) estimate = np.sum(normalized_weights * h_vals) else: # Standard IS (requires normalized f) weights = np.exp(log_weights) estimate = np.mean(weights * h_vals) normalized_weights = weights / np.sum(weights) if return_diagnostics: ess = 1.0 / np.sum(normalized_weights**2) diagnostics = { 'ess': ess, 'ess_ratio': ess / n_samples, 'max_weight': np.max(normalized_weights), 'cv_weights': np.std(normalized_weights) / np.mean(normalized_weights) } return estimate, diagnostics return estimate .. admonition:: Example 💡 Rare Event Estimation via Exponential Tilting :class: note **Given:** Estimate :math:`p = P(Z > 4)` for :math:`Z \sim \mathcal{N}(0,1)`. The true value is :math:`p = 1 - \Phi(4) \approx 3.167 \times 10^{-5}`. **Challenge:** With naive Monte Carlo, we need approximately :math:`1/p \approx 31,600` samples to observe *one* exceedance on average. Reliable estimation requires :math:`\gtrsim 10^7` samples. **Importance Sampling Solution:** Use exponential tilting with proposal :math:`g_\theta(x) = \phi(x-\theta)`, a normal shifted by :math:`\theta`. Setting :math:`\theta = 4` (at the threshold) concentrates samples in the region of interest. **Mathematical Setup:** - Target: :math:`f(x) = \phi(x)` (standard normal) - Proposal: :math:`g(x) = \phi(x-4)` (normal with mean 4) - Importance weight: :math:`w(x) = \phi(x)/\phi(x-4) = \exp(-4x + 8)` - Integrand: :math:`h(x) = \mathbf{1}_{x > 4}` **Python Implementation:** .. code-block:: python import numpy as np from scipy import stats np.random.seed(42) n_samples = 10_000 threshold = 4.0 # True probability p_true = 1 - stats.norm.cdf(threshold) print(f"True P(Z > 4): {p_true:.6e}") # Naive Monte Carlo Z_naive = np.random.standard_normal(n_samples) p_naive = np.mean(Z_naive > threshold) se_naive = np.sqrt(p_true * (1 - p_true) / n_samples) print(f"\nNaive MC (n={n_samples:,}):") print(f" Estimate: {p_naive:.6e}") print(f" True SE: {se_naive:.6e}") # Importance sampling with tilted normal X_is = np.random.normal(loc=threshold, size=n_samples) # Sample from N(4,1) log_weights = -threshold * X_is + threshold**2 / 2 # log(f/g) indicators = (X_is > threshold).astype(float) # Standard IS estimator weights = np.exp(log_weights) p_is = np.mean(indicators * weights) # Compute variance and SE weighted_indicators = indicators * weights var_is = np.var(weighted_indicators) / n_samples se_is = np.sqrt(var_is) print(f"\nImportance Sampling (n={n_samples:,}):") print(f" Estimate: {p_is:.6e}") print(f" SE: {se_is:.6e}") # Effective sample size normalized_weights = weights / np.sum(weights) ess = 1.0 / np.sum(normalized_weights**2) print(f" ESS: {ess:.1f} ({100*ess/n_samples:.1f}%)") # Variance reduction factor var_naive = p_true * (1 - p_true) vrf = var_naive / np.var(weighted_indicators) print(f"\nVariance Reduction Factor: {vrf:.0f}x") **Output:** .. code-block:: text True P(Z > 4): 3.167124e-05 Naive MC (n=10,000): Estimate: 0.000000e+00 True SE: 5.627e-05 Importance Sampling (n=10,000): Estimate: 3.184e-05 SE: 2.56e-06 ESS: 6234.8 (62.3%) Variance Reduction Factor: 485x **Result:** Importance sampling estimates :math:`\hat{p} \approx 3.18 \times 10^{-5}` (within 0.5% of truth) with a standard error of :math:`2.6 \times 10^{-6}`. Naive MC with the same sample size observed zero exceedances. The variance reduction factor of approximately 500× means IS achieves the precision of 5 million naive samples with only 10,000. Control Variates ---------------- Control variates reduce variance by exploiting correlation between the quantity of interest and auxiliary random variables with known expectations. The technique is mathematically equivalent to regression adjustment in experimental design: we "predict away" variance using a related variable. Theory and Optimal Coefficient ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let :math:`H = h(X)` denote the quantity we wish to estimate, with unknown mean :math:`I = \mathbb{E}[H]`. Suppose we have a **control variate** :math:`C = c(X)` correlated with :math:`H` whose expectation :math:`\mu_C = \mathbb{E}[C]` is known exactly. The **control variate estimator** adjusts the naive estimate by the deviation of :math:`C` from its mean: .. math:: :label: cv-estimator \hat{I}_{\text{CV}} = \frac{1}{n} \sum_{i=1}^n \left[ h(X_i) - \beta(c(X_i) - \mu_C) \right] **Unbiasedness**: For any coefficient :math:`\beta`: .. math:: \mathbb{E}[\hat{I}_{\text{CV}}] = \mathbb{E}[H] - \beta(\mathbb{E}[C] - \mu_C) = I - \beta \cdot 0 = I The adjustment does not bias the estimator because :math:`\mathbb{E}[C - \mu_C] = 0`. **Variance**: The variance per sample is: .. math:: \text{Var}(H - \beta(C - \mu_C)) = \text{Var}(H) + \beta^2 \text{Var}(C) - 2\beta \text{Cov}(H, C) This is a quadratic in :math:`\beta` minimized by setting the derivative to zero: .. math:: \frac{d}{d\beta}\left[\text{Var}(H) + \beta^2 \text{Var}(C) - 2\beta \text{Cov}(H, C)\right] = 2\beta \text{Var}(C) - 2\text{Cov}(H, C) = 0 Solving yields the **optimal control variate coefficient**: .. math:: :label: optimal-beta \beta^* = \frac{\text{Cov}(H, C)}{\text{Var}(C)} This is precisely the slope from the simple linear regression of :math:`H` on :math:`C`. Variance Reduction Equals Squared Correlation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Substituting :math:`\beta^*` into the variance formula: .. math:: \text{Var}(H - \beta^*(C - \mu_C)) &= \text{Var}(H) + \frac{\text{Cov}(H,C)^2}{\text{Var}(C)} - 2 \cdot \frac{\text{Cov}(H,C)^2}{\text{Var}(C)} \\ &= \text{Var}(H) - \frac{\text{Cov}(H,C)^2}{\text{Var}(C)} \\ &= \text{Var}(H) \left(1 - \rho_{H,C}^2\right) where :math:`\rho_{H,C} = \text{Cov}(H,C)/\sqrt{\text{Var}(H)\text{Var}(C)}` is the Pearson correlation. **Key Result**: The variance reduction factor is :math:`\rho_{H,C}^2`: .. math:: :label: cv-vrf \text{VRF} = \frac{1}{1 - \rho^2_{H,C}} - Perfect correlation (:math:`|\rho| = 1`): infinite reduction (zero variance) - :math:`\rho = 0.9`: 81% variance reduction (VRF = 5.3) - :math:`\rho = 0.7`: 51% variance reduction (VRF = 2.0) - :math:`\rho = 0.5`: 25% variance reduction (VRF = 1.3) Multiple Control Variates ~~~~~~~~~~~~~~~~~~~~~~~~~ With :math:`m` control variates :math:`\mathbf{C} = (C_1, \ldots, C_m)^\top` having known means :math:`\boldsymbol{\mu}_C`, the estimator becomes: .. math:: \hat{I}_{\text{CV}} = \bar{H} - \boldsymbol{\beta}^\top(\bar{\mathbf{C}} - \boldsymbol{\mu}_C) The optimal coefficient vector is: .. math:: :label: optimal-beta-vector \boldsymbol{\beta}^* = \boldsymbol{\Sigma}_C^{-1} \text{Cov}(\mathbf{C}, H) where :math:`\boldsymbol{\Sigma}_C = \text{Var}(\mathbf{C})` is the :math:`m \times m` covariance matrix of controls. This equals the multiple regression coefficients from regressing :math:`H` on :math:`\mathbf{C}`. The minimum variance is: .. math:: \text{Var}(H_{\text{CV}}^*) = \text{Var}(H) - \text{Cov}(H, \mathbf{C})^\top \boldsymbol{\Sigma}_C^{-1} \text{Cov}(H, \mathbf{C}) which equals :math:`\text{Var}(H)(1 - R^2)` where :math:`R^2` is the coefficient of determination from the multiple regression. Finding Good Controls ~~~~~~~~~~~~~~~~~~~~~ Good control variates satisfy two criteria: 1. **High correlation** with :math:`h(X)`: The stronger the correlation, the greater the variance reduction 2. **Known expectation**: We must know :math:`\mathbb{E}[C]` exactly, not estimate it **Common sources of controls:** - **Simple functions of** :math:`X`: For :math:`X \sim \mathcal{N}(\mu, \sigma^2)`, use :math:`C = X` (with :math:`\mathbb{E}[X] = \mu`) or :math:`C = (X-\mu)^2` (with :math:`\mathbb{E}[(X-\mu)^2] = \sigma^2`) - **Taylor approximations**: If :math:`h(x) \approx a + b(x-\mu) + c(x-\mu)^2` near the mean, use centered powers as controls - **Partial analytical solutions**: When :math:`h = h_1 + h_2` and :math:`\mathbb{E}[h_1]` is known, use :math:`h_1` as a control - **Approximate or auxiliary models**: In option pricing, a geometric Asian option (with closed-form price) serves as control for arithmetic Asian options .. admonition:: Common Pitfall ⚠️ Estimating vs. Knowing the Control Mean :class: warning If :math:`\mu_C` is estimated rather than known exactly, the uncertainty in :math:`\hat{\mu}_C` adds variance to the control variate estimator, potentially nullifying benefits. Control variates require *exact* knowledge of :math:`\mathbb{E}[C]`. Using an estimated mean converts variance reduction into variance shuffling. Python Implementation ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import numpy as np def control_variate_estimator(h_vals, c_vals, mu_c, estimate_beta=True, beta=None): """ Control variate estimator for E[H] using control C with known E[C] = mu_c. Parameters ---------- h_vals : array_like Sample values of h(X) c_vals : array_like Sample values of control variate c(X) mu_c : float Known expectation of control variate estimate_beta : bool If True, estimate optimal beta from samples beta : float, optional Fixed beta coefficient (used if estimate_beta=False) Returns ------- dict with keys: 'estimate': control variate estimate of E[H] 'se': standard error of estimate 'beta': coefficient used 'rho': estimated correlation 'vrf': estimated variance reduction factor """ h_vals = np.asarray(h_vals) c_vals = np.asarray(c_vals) n = len(h_vals) if estimate_beta: # Estimate optimal beta via regression cov_hc = np.cov(h_vals, c_vals, ddof=1)[0, 1] var_c = np.var(c_vals, ddof=1) beta = cov_hc / var_c if var_c > 0 else 0.0 # Control variate adjusted values adjusted = h_vals - beta * (c_vals - mu_c) # Estimate and standard error estimate = np.mean(adjusted) se = np.std(adjusted, ddof=1) / np.sqrt(n) # Diagnostics rho = np.corrcoef(h_vals, c_vals)[0, 1] var_h = np.var(h_vals, ddof=1) var_adj = np.var(adjusted, ddof=1) vrf = var_h / var_adj if var_adj > 0 else np.inf return { 'estimate': estimate, 'se': se, 'beta': beta, 'rho': rho, 'vrf': vrf, 'var_reduction_pct': 100 * (1 - 1/vrf) if vrf > 1 else 0 } .. admonition:: Example 💡 Estimating :math:`\mathbb{E}[e^X]` for :math:`X \sim \mathcal{N}(0,1)` :class: note **Given:** :math:`X \sim \mathcal{N}(0,1)`, estimate :math:`I = \mathbb{E}[e^X]`. **Analytical Result:** By the moment generating function, :math:`\mathbb{E}[e^X] = e^{1/2} \approx 1.6487`. **Control Variate:** Use :math:`C = X` with :math:`\mu_C = \mathbb{E}[X] = 0`. **Theoretical Analysis:** - :math:`\text{Cov}(e^X, X) = \mathbb{E}[X e^X] - \mathbb{E}[X]\mathbb{E}[e^X] = \mathbb{E}[X e^X]` - Using the identity :math:`\mathbb{E}[X e^X] = \mathbb{E}[e^X]` for :math:`X \sim \mathcal{N}(0,1)`: :math:`\text{Cov}(e^X, X) = e^{1/2}` - :math:`\text{Var}(X) = 1` - :math:`\beta^* = e^{1/2}/1 = e^{1/2} \approx 1.6487` - :math:`\text{Var}(e^X) = e^2 - e \approx 4.671` - :math:`\rho = e^{1/2}/\sqrt{e^2 - e} \approx 0.763` - Variance reduction: :math:`\rho^2 \approx 0.582` (58.2%) **Python Implementation:** .. code-block:: python import numpy as np np.random.seed(42) n = 10_000 # True value I_true = np.exp(0.5) print(f"True E[e^X]: {I_true:.6f}") # Generate samples X = np.random.standard_normal(n) h_vals = np.exp(X) # e^X c_vals = X # Control: X mu_c = 0.0 # Known E[X] = 0 # Naive Monte Carlo naive_est = np.mean(h_vals) naive_se = np.std(h_vals, ddof=1) / np.sqrt(n) print(f"\nNaive MC: {naive_est:.6f} (SE: {naive_se:.6f})") # Control variate estimator result = control_variate_estimator(h_vals, c_vals, mu_c) print(f"\nControl Variate:") print(f" Estimate: {result['estimate']:.6f} (SE: {result['se']:.6f})") print(f" Beta: {result['beta']:.4f} (theoretical: {np.exp(0.5):.4f})") print(f" Correlation: {result['rho']:.4f}") print(f" Variance Reduction: {result['var_reduction_pct']:.1f}%") **Output:** .. code-block:: text True E[e^X]: 1.648721 Naive MC: 1.670089 (SE: 0.021628) Control Variate: Estimate: 1.651876 (SE: 0.013992) Beta: 1.6542 (theoretical: 1.6487) Correlation: 0.7618 Variance Reduction: 58.2% **Result:** The control variate estimator reduces variance by 58%, matching the theoretical prediction. The estimated :math:`\beta = 1.654` closely approximates the theoretical optimum :math:`\beta^* = e^{1/2} \approx 1.649`. Antithetic Variates ------------------- Antithetic variates reduce variance by constructing **negatively correlated pairs** of samples that share the same marginal distribution. When averaged, systematic cancellation occurs: if one sample yields a high estimate, its antithetic partner tends to yield a low estimate, damping fluctuations. Construction of Antithetic Pairs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The fundamental construction uses the uniform distribution. For :math:`U \sim \text{Uniform}(0,1)`: - Both :math:`U` and :math:`1-U` have :math:`\text{Uniform}(0,1)` marginals - :math:`\text{Cov}(U, 1-U) = \mathbb{E}[U(1-U)] - \frac{1}{4} = \frac{1}{2} - \frac{1}{3} - \frac{1}{4} = -\frac{1}{12}` - Correlation: :math:`\rho(U, 1-U) = -1` (perfect negative correlation) For continuous distributions with CDF :math:`F`, the pair :math:`(F^{-1}(U), F^{-1}(1-U))` forms an antithetic pair with the target distribution. For the normal distribution, since :math:`\Phi^{-1}(1-U) = -\Phi^{-1}(U)`, the antithetic pairs are simply :math:`(Z, -Z)` where :math:`Z = \Phi^{-1}(U)`. Variance Formula ~~~~~~~~~~~~~~~~ Consider estimating :math:`I = \mathbb{E}[h(X)]` using antithetic pairs. Generate :math:`n/2` independent uniforms :math:`U_1, \ldots, U_{n/2}`, and for each :math:`U_i`, form the pair :math:`(X_i, X_i')` where :math:`X_i = F^{-1}(U_i)` and :math:`X_i' = F^{-1}(1-U_i)`. The antithetic estimator averages paired values: .. math:: :label: antithetic-estimator \hat{I}_{\text{anti}} = \frac{1}{n/2} \sum_{i=1}^{n/2} \frac{h(X_i) + h(X_i')}{2} **Variance Analysis**: Let :math:`Y = h(X)` and :math:`Y' = h(X')` for an antithetic pair. The variance of the pair average is: .. math:: \text{Var}\left(\frac{Y + Y'}{2}\right) = \frac{1}{4}\left[\text{Var}(Y) + \text{Var}(Y') + 2\text{Cov}(Y, Y')\right] Since :math:`X` and :math:`X'` have the same marginal distribution, :math:`\text{Var}(Y) = \text{Var}(Y')`. Thus: .. math:: :label: antithetic-variance \text{Var}\left(\frac{Y + Y'}{2}\right) = \frac{\text{Var}(h(X))}{2} + \frac{\text{Cov}(h(X), h(X'))}{2} **Variance Reduction Condition**: The estimator variance is :math:`\frac{1}{2}\text{Var}(h(X))(1 + \rho)` where :math:`\rho = \text{Corr}(h(X), h(X'))`. - **Reduction occurs when** :math:`\rho < 0` (negative correlation between :math:`h(X)` and :math:`h(X')`) - **No effect when** :math:`\rho = 0` - **Variance increases when** :math:`\rho > 0` The Monotone Function Theorem ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Theorem**: If :math:`h` is monotonic (non-decreasing or non-increasing) and :math:`(X, X')` is an antithetic pair, then :math:`\text{Cov}(h(X), h(X')) \leq 0`. **Intuition**: When :math:`U` is large, :math:`X = F^{-1}(U)` is in the upper tail, making :math:`h(X)` large for increasing :math:`h`. But :math:`1-U` is small, so :math:`X' = F^{-1}(1-U)` is in the lower tail, making :math:`h(X')` small. This systematic opposition creates negative covariance. **Formal Proof**: Uses Hoeffding's identity and the fact that :math:`(U, 1-U)` achieves the Fréchet–Hoeffding lower bound for joint distributions with uniform marginals (maximum negative dependence compatible with the marginals). Limitations for Non-Monotone Functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Antithetic variates can **increase variance** for non-monotone functions: - :math:`h(u) = \cos(4\pi u)`: The antithetic :math:`h(1-u) = \cos(4\pi - 4\pi u) = \cos(4\pi u) = h(u)`. Perfect positive correlation (:math:`\rho = 1`) doubles variance! - :math:`h(u) = (u - 0.5)^2`: The antithetic :math:`h(1-u) = (0.5 - u)^2 = h(u)`. Again, :math:`\rho = 1`. **Rule**: Before applying antithetic variates, verify that :math:`h` is monotonic or estimate :math:`\rho` from a pilot sample. For non-monotonic functions, antithetic variates may harm rather than help. Python Implementation ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import numpy as np def antithetic_estimator(h_func, F_inv, n_pairs, return_diagnostics=False): """ Antithetic variate estimator for E[h(X)] where X ~ F. Parameters ---------- h_func : callable Function to integrate F_inv : callable Inverse CDF of target distribution n_pairs : int Number of antithetic pairs to generate return_diagnostics : bool If True, return correlation diagnostics Returns ------- estimate : float Antithetic estimate of E[h(X)] diagnostics : dict (optional) Including rho, variance_ratio """ # Generate uniforms U = np.random.uniform(size=n_pairs) # Create antithetic pairs X = F_inv(U) X_anti = F_inv(1 - U) # Evaluate function h_X = h_func(X) h_X_anti = h_func(X_anti) # Antithetic estimator: average of pair averages pair_means = (h_X + h_X_anti) / 2 estimate = np.mean(pair_means) if return_diagnostics: # Standard MC variance (using all 2n points) all_h = np.concatenate([h_X, h_X_anti]) var_standard = np.var(all_h, ddof=1) # Antithetic variance var_antithetic = np.var(pair_means, ddof=1) # Correlation between h(X) and h(X') rho = np.corrcoef(h_X, h_X_anti)[0, 1] # Variance reduction factor (comparing same total samples) # Standard: Var(h)/n vs Antithetic: Var(pair_mean)/(n/2) vrf = (var_standard / (2 * n_pairs)) / (var_antithetic / n_pairs) diagnostics = { 'estimate': estimate, 'se': np.std(pair_means, ddof=1) / np.sqrt(n_pairs), 'rho': rho, 'var_standard': var_standard, 'var_antithetic': var_antithetic, 'vrf': vrf } return estimate, diagnostics return estimate .. admonition:: Example 💡 The Exponential Integral with 97% Variance Reduction :class: note **Given:** Estimate :math:`I = \int_0^1 e^x \, dx = e - 1 \approx 1.7183`. **Analysis:** The function :math:`h(u) = e^u` is monotone increasing on :math:`[0,1]`, so antithetic variates should help. **Theoretical Variance:** - Standard MC: :math:`\text{Var}(e^U) = \mathbb{E}[e^{2U}] - (\mathbb{E}[e^U])^2 = \frac{e^2 - 1}{2} - (e-1)^2 \approx 0.2420` - Antithetic covariance: :math:`\text{Cov}(e^U, e^{1-U}) = \mathbb{E}[e^U \cdot e^{1-U}] - (e-1)^2 = e - (e-1)^2 \approx -0.2342` - Antithetic variance per pair: :math:`\frac{1}{2}(0.2420) + \frac{1}{2}(-0.2342) = 0.0039` **Variance Reduction:** For equivalent cost (comparing :math:`n` antithetic pairs to :math:`2n` independent samples): .. math:: \text{VRF} = \frac{0.2420 / 2}{0.0039} \approx 31 This represents approximately **97% variance reduction**. **Python Implementation:** .. code-block:: python import numpy as np np.random.seed(42) n_pairs = 10_000 # True value I_true = np.exp(1) - 1 print(f"True integral: {I_true:.6f}") # Standard Monte Carlo (2n samples for fair comparison) U_standard = np.random.uniform(size=2 * n_pairs) h_standard = np.exp(U_standard) mc_estimate = np.mean(h_standard) mc_var = np.var(h_standard, ddof=1) mc_se = np.sqrt(mc_var / (2 * n_pairs)) print(f"\nStandard MC ({2*n_pairs:,} samples):") print(f" Estimate: {mc_estimate:.6f}") print(f" SE: {mc_se:.6f}") # Antithetic variates (n pairs = 2n function evaluations) U = np.random.uniform(size=n_pairs) h_U = np.exp(U) h_anti = np.exp(1 - U) pair_means = (h_U + h_anti) / 2 anti_estimate = np.mean(pair_means) anti_var = np.var(pair_means, ddof=1) anti_se = np.sqrt(anti_var / n_pairs) print(f"\nAntithetic ({n_pairs:,} pairs):") print(f" Estimate: {anti_estimate:.6f}") print(f" SE: {anti_se:.6f}") # Correlation diagnostic rho = np.corrcoef(h_U, h_anti)[0, 1] print(f" Correlation rho: {rho:.4f}") # Variance reduction # Fair comparison: both use 2n function evaluations # Standard: Var/2n, Antithetic: Var(pair)/n vrf = (mc_var / (2 * n_pairs)) / (anti_var / n_pairs) print(f"\nVariance Reduction Factor: {vrf:.1f}x") print(f"Variance Reduction: {100*(1 - 1/vrf):.1f}%") **Output:** .. code-block:: text True integral: 1.718282 Standard MC (20,000 samples): Estimate: 1.721539 SE: 0.003460 Antithetic (10,000 pairs): Estimate: 1.718298 SE: 0.000639 Correlation rho: -0.9678 Variance Reduction Factor: 29.3x Variance Reduction: 96.6% **Result:** The strongly negative correlation (:math:`\rho = -0.968`) yields a 29× variance reduction, achieving ~97% efficiency gain. The antithetic estimate is within 0.001% of the true value. Stratified Sampling ------------------- Stratified sampling partitions the sample space into disjoint regions (strata) and draws samples from each region according to a carefully chosen allocation. By eliminating the randomness in how many samples fall in each region, stratification removes between-stratum variance—often a dominant source of Monte Carlo error. The Stratified Estimator ~~~~~~~~~~~~~~~~~~~~~~~~ Partition the domain :math:`\mathcal{X}` into :math:`K` disjoint, exhaustive strata :math:`S_1, \ldots, S_K` with: - **Stratum probabilities**: :math:`p_k = P(X \in S_k)` under the target distribution - **Within-stratum means**: :math:`\mu_k = \mathbb{E}[h(X) \mid X \in S_k]` - **Within-stratum variances**: :math:`\sigma_k^2 = \text{Var}(h(X) \mid X \in S_k)` The overall mean decomposes as: .. math:: :label: stratified-mean I = \sum_{k=1}^K p_k \mu_k The **stratified estimator** allocates :math:`n_k` samples to stratum :math:`k` (with :math:`\sum_k n_k = n`), draws :math:`X_{k,1}, \ldots, X_{k,n_k}` from the conditional distribution in stratum :math:`k`, and combines: .. math:: :label: stratified-estimator \hat{I}_{\text{strat}} = \sum_{k=1}^K p_k \hat{\mu}_k, \quad \hat{\mu}_k = \frac{1}{n_k} \sum_{i=1}^{n_k} h(X_{k,i}) The variance is: .. math:: :label: stratified-variance \text{Var}(\hat{I}_{\text{strat}}) = \sum_{k=1}^K \frac{p_k^2 \sigma_k^2}{n_k} Proportional Allocation Always Reduces Variance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Under **proportional allocation** :math:`n_k = n p_k`, the variance simplifies to: .. math:: :label: proportional-variance \text{Var}(\hat{I}_{\text{prop}}) = \frac{1}{n} \sum_{k=1}^K p_k \sigma_k^2 The law of total variance (ANOVA decomposition) reveals why this always helps: .. math:: \sigma^2 = \text{Var}(h(X)) = \underbrace{\mathbb{E}[\text{Var}(h \mid S)]}_{\text{within}} + \underbrace{\text{Var}(\mathbb{E}[h \mid S])}_{\text{between}} = \sum_k p_k \sigma_k^2 + \sum_k p_k (\mu_k - I)^2 The first term is **within-stratum variance**; the second is **between-stratum variance**. Since: .. math:: \sum_k p_k \sigma_k^2 = \sigma^2 - \sum_k p_k (\mu_k - I)^2 \leq \sigma^2 we have :math:`\text{Var}(\hat{I}_{\text{prop}}) \leq \text{Var}(\hat{I}_{\text{MC}})` with equality only when all stratum means are identical. **Key Insight**: Stratified sampling with proportional allocation **eliminates the between-stratum variance component entirely**. Variance reduction equals the proportion of total variance explained by stratum membership. Neyman Allocation Minimizes Variance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Theorem (Neyman, 1934)**: The allocation minimizing :math:`\text{Var}(\hat{I}_{\text{strat}})` subject to :math:`\sum_k n_k = n` is: .. math:: :label: neyman-allocation n_k^* = n \cdot \frac{p_k \sigma_k}{\sum_j p_j \sigma_j} **Proof**: Using Lagrange multipliers on :math:`\mathcal{L} = \sum_k p_k^2 \sigma_k^2 / n_k + \lambda(\sum_k n_k - n)`: .. math:: \frac{\partial \mathcal{L}}{\partial n_k} = -\frac{p_k^2 \sigma_k^2}{n_k^2} + \lambda = 0 \quad \Rightarrow \quad n_k = \frac{p_k \sigma_k}{\sqrt{\lambda}} The constraint yields :math:`\sqrt{\lambda} = (\sum_j p_j \sigma_j)/n`, giving the result. :math:`\blacksquare` The optimal variance is: .. math:: :label: neyman-variance \text{Var}(\hat{I}_{\text{opt}}) = \frac{1}{n} \left(\sum_k p_k \sigma_k\right)^2 **Interpretation**: Neyman allocation samples more heavily from: 1. **Larger strata** (large :math:`p_k`): They contribute more to the integral 2. **More variable strata** (large :math:`\sigma_k`): They need more samples for precise estimation Latin Hypercube Sampling for High Dimensions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Traditional stratification suffers from the curse of dimensionality: :math:`m` strata per dimension requires :math:`m^d` samples in :math:`d` dimensions. **Latin Hypercube Sampling (LHS)**, introduced by McKay, Beckman, and Conover (1979), provides a practical alternative. For :math:`n` samples in :math:`[0,1]^d`, LHS: 1. Divides each dimension into :math:`n` equal intervals 2. Places exactly one sample point in each interval for each dimension 3. Pairs intervals across dimensions randomly **Key Property**: Each univariate margin is perfectly stratified. If the function is approximately additive, :math:`h(\mathbf{x}) \approx \sum_j h_j(x_j)`, LHS achieves large variance reduction by controlling each main effect. Stein (1987) proved that LHS variance is asymptotically: .. math:: \text{Var}(\hat{\mu}_{\text{LHS}}) = \frac{1}{n} \int [h(\mathbf{x}) - h^{\text{add}}(\mathbf{x})]^2 \, dx + o(1/n) where :math:`h^{\text{add}}(\mathbf{x}) = \sum_j h_j(x_j)` is the additive approximation. LHS filters out all main effects—variance reduction depends on the degree of non-additivity (interactions) in :math:`h`. Owen (1997) proved: :math:`\text{Var}(\hat{\mu}_{\text{LHS}}) \leq \sigma^2/(n-1)`. LHS with :math:`n` points is never worse than simple MC with :math:`n-1` points. Python Implementation ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import numpy as np def stratified_estimator(h_func, sampler_by_stratum, stratum_probs, n_per_stratum): """ Stratified sampling estimator. Parameters ---------- h_func : callable Function to integrate sampler_by_stratum : list of callables sampler_by_stratum[k](n) returns n samples from stratum k stratum_probs : array_like Probabilities p_k for each stratum n_per_stratum : array_like Number of samples n_k for each stratum Returns ------- dict with estimate, se, within_var, between_var diagnostics """ K = len(stratum_probs) p = np.array(stratum_probs) n_k = np.array(n_per_stratum) stratum_means = np.zeros(K) stratum_vars = np.zeros(K) for k in range(K): # Sample from stratum k X_k = sampler_by_stratum[k](n_k[k]) h_k = h_func(X_k) stratum_means[k] = np.mean(h_k) stratum_vars[k] = np.var(h_k, ddof=1) # Stratified estimate estimate = np.sum(p * stratum_means) # Variance of stratified estimator var_strat = np.sum(p**2 * stratum_vars / n_k) se = np.sqrt(var_strat) # Decomposition within_var = np.sum(p * stratum_vars) between_var = np.sum(p * (stratum_means - estimate)**2) return { 'estimate': estimate, 'se': se, 'stratum_means': stratum_means, 'stratum_vars': stratum_vars, 'within_var': within_var, 'between_var': between_var } def latin_hypercube_sample(n, d, seed=None): """ Generate n Latin Hypercube samples in [0,1]^d. Parameters ---------- n : int Number of samples d : int Dimension seed : int, optional Random seed Returns ------- samples : ndarray of shape (n, d) Latin Hypercube samples """ rng = np.random.default_rng(seed) samples = np.zeros((n, d)) for j in range(d): # Create n equally spaced intervals and shuffle perm = rng.permutation(n) # Uniform sample within each stratum samples[:, j] = (perm + rng.uniform(size=n)) / n return samples .. admonition:: Example 💡 Stratified vs. Simple Monte Carlo for Heterogeneous Integrand :class: note **Given:** Estimate :math:`I = \int_0^1 h(x) \, dx` where :math:`h(x) = e^{10x}` for :math:`x < 0.2` and :math:`h(x) = 1` otherwise. This integrand has high variance concentrated in :math:`[0, 0.2)`. **Strategy:** Stratify into :math:`S_1 = [0, 0.2)` and :math:`S_2 = [0.2, 1]` with :math:`p_1 = 0.2`, :math:`p_2 = 0.8`. **Python Implementation:** .. code-block:: python import numpy as np from scipy import integrate np.random.seed(42) def h(x): return np.where(x < 0.2, np.exp(10 * x), 1.0) # True value by numerical integration I_true, _ = integrate.quad(h, 0, 1) print(f"True integral: {I_true:.6f}") n_total = 1000 # Simple Monte Carlo X_mc = np.random.uniform(0, 1, n_total) h_mc = h(X_mc) mc_est = np.mean(h_mc) mc_se = np.std(h_mc, ddof=1) / np.sqrt(n_total) print(f"\nSimple MC: {mc_est:.4f} (SE: {mc_se:.4f})") # Stratified sampling with proportional allocation p1, p2 = 0.2, 0.8 n1 = int(n_total * p1) # 200 samples in [0, 0.2) n2 = n_total - n1 # 800 samples in [0.2, 1] # Sample from each stratum X1 = np.random.uniform(0, 0.2, n1) X2 = np.random.uniform(0.2, 1, n2) h1 = h(X1) h2 = h(X2) mu1_hat = np.mean(h1) mu2_hat = np.mean(h2) strat_est = p1 * mu1_hat + p2 * mu2_hat # Stratified SE var1 = np.var(h1, ddof=1) var2 = np.var(h2, ddof=1) strat_var = (p1**2 * var1 / n1) + (p2**2 * var2 / n2) strat_se = np.sqrt(strat_var) print(f"Stratified: {strat_est:.4f} (SE: {strat_se:.4f})") # Variance decomposition total_var = np.var(h_mc, ddof=1) within_var = p1 * var1 + p2 * var2 between_var = total_var - within_var print(f"\nVariance Decomposition:") print(f" Total variance: {total_var:.4f}") print(f" Within-stratum: {within_var:.4f} ({100*within_var/total_var:.1f}%)") print(f" Between-stratum: {between_var:.4f} ({100*between_var/total_var:.1f}%)") print(f"\nVariance Reduction Factor: {(mc_se/strat_se)**2:.1f}x") **Output:** .. code-block:: text True integral: 2.256930 Simple MC: 2.4321 (SE: 0.1897) Stratified: 2.2549 (SE: 0.0584) Variance Decomposition: Total variance: 35.9784 Within-stratum: 3.2481 (9.0%) Between-stratum: 32.7303 (91.0%) Variance Reduction Factor: 10.6x **Result:** The between-stratum variance (91% of total) is eliminated by stratification, yielding a 10× variance reduction. The stratified estimate (2.255) is much closer to truth (2.257) than simple MC (2.432). Common Random Numbers --------------------- When comparing systems or parameters, **common random numbers (CRN)** use identical random inputs across all configurations, inducing positive correlation between estimates and dramatically reducing the variance of their *difference*. CRN does not reduce the variance of individual estimates—it reduces the variance of comparisons. This distinction is crucial: CRN is a technique for A/B testing, sensitivity analysis, and system comparison, not for single-system estimation. Variance Analysis for Differences ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Consider comparing two systems with expected values :math:`\theta_1 = \mathbb{E}[h_1(X)]` and :math:`\theta_2 = \mathbb{E}[h_2(X)]`. We want to estimate the difference :math:`\Delta = \theta_1 - \theta_2`. **Independent Sampling**: Draw :math:`X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} F` for system 1 and independent :math:`Y_1, \ldots, Y_n \stackrel{\text{iid}}{\sim} F` for system 2: .. math:: \text{Var}(\hat{\theta}_1 - \hat{\theta}_2) = \text{Var}(\hat{\theta}_1) + \text{Var}(\hat{\theta}_2) = \frac{\sigma_1^2 + \sigma_2^2}{n} **Common Random Numbers**: Use the *same* inputs for both systems. Draw :math:`X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} F` and compute :math:`D_i = h_1(X_i) - h_2(X_i)`: .. math:: :label: crn-variance \text{Var}(\hat{\Delta}_{\text{CRN}}) = \frac{1}{n}\text{Var}(h_1(X) - h_2(X)) = \frac{\sigma_1^2 + \sigma_2^2 - 2\text{Cov}(h_1(X), h_2(X))}{n} When systems respond similarly to the same inputs, :math:`\text{Cov}(h_1, h_2) > 0`, and the :math:`-2\text{Cov}` term reduces variance substantially. **Perfect Correlation Limit**: If :math:`h_1(x) = h_2(x) + c` (constant difference), then :math:`D_i = c` for all :math:`i`, and :math:`\text{Var}(\hat{\Delta}_{\text{CRN}}) = 0`. We estimate the difference with zero variance! When CRN Works Best ~~~~~~~~~~~~~~~~~~~ CRN is most effective when: 1. **Systems are similar**: Small changes to parameters, comparing variants of the same algorithm 2. **Response is monotonic**: Both systems improve or degrade together with input quality 3. **Synchronization is possible**: The same random number serves the same purpose in both systems **Synchronization is critical**: In a queueing simulation, the random number generating customer :math:`i`'s arrival time must generate customer :math:`i`'s arrival time in *all* configurations. Misaligned synchronization destroys the correlation that CRN exploits. Applications ~~~~~~~~~~~~ - **A/B Testing in Simulation**: Compare policies using identical demand sequences, arrival patterns, or market scenarios - **Sensitivity Analysis**: Estimate derivatives :math:`\partial \theta / \partial \alpha` using common inputs at :math:`\alpha` and :math:`\alpha + \delta` - **Ranking and Selection**: Compare multiple system variants fairly under identical conditions - **Optimization**: Gradient estimation for simulation-based optimization Python Implementation ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import numpy as np def crn_comparison(h1_func, h2_func, input_sampler, n_samples): """ Compare two systems using common random numbers. Parameters ---------- h1_func : callable System 1 response function h2_func : callable System 2 response function input_sampler : callable Function that returns n samples of common inputs n_samples : int Number of comparisons Returns ------- dict with difference estimate, se, correlation, vrf """ # Generate common inputs X = input_sampler(n_samples) # Evaluate both systems on same inputs h1 = h1_func(X) h2 = h2_func(X) # Paired differences D = h1 - h2 # CRN estimate and SE diff_est = np.mean(D) diff_se = np.std(D, ddof=1) / np.sqrt(n_samples) # Diagnostics var1 = np.var(h1, ddof=1) var2 = np.var(h2, ddof=1) cov12 = np.cov(h1, h2, ddof=1)[0, 1] rho = cov12 / np.sqrt(var1 * var2) # Variance reduction factor vs independent sampling var_indep = (var1 + var2) / n_samples var_crn = np.var(D, ddof=1) / n_samples vrf = var_indep / var_crn if var_crn > 0 else np.inf return { 'diff_estimate': diff_est, 'diff_se': diff_se, 'theta1': np.mean(h1), 'theta2': np.mean(h2), 'correlation': rho, 'vrf': vrf, 'indep_se': np.sqrt(var_indep), 'crn_se': np.sqrt(var_crn) } .. admonition:: Example 💡 Comparing Two Inventory Policies :class: note **Given:** An inventory system with daily demand :math:`D \sim \text{Poisson}(50)`. Compare: - **Policy A**: Order up to 60 units each day - **Policy B**: Order up to 65 units each day Cost = holding cost (0.10 per unit per day) + stockout cost (5 per unit short). **Python Implementation:** .. code-block:: python import numpy as np np.random.seed(42) n_days = 10_000 def simulate_cost(order_up_to, demands): """Compute daily costs for order-up-to policy.""" costs = np.zeros(len(demands)) for i, d in enumerate(demands): # On-hand inventory after ordering up to level on_hand = order_up_to # After demand remaining = on_hand - d if remaining >= 0: costs[i] = 0.10 * remaining # Holding cost else: costs[i] = 5.0 * (-remaining) # Stockout cost return costs # Generate common demands (CRN) common_demands = np.random.poisson(50, n_days) # Evaluate both policies on same demands costs_A = simulate_cost(60, common_demands) costs_B = simulate_cost(65, common_demands) # CRN comparison D = costs_A - costs_B diff_est = np.mean(D) diff_se = np.std(D, ddof=1) / np.sqrt(n_days) print("Common Random Numbers Comparison") print("=" * 45) print(f"Policy A (order to 60): mean cost = {np.mean(costs_A):.4f}") print(f"Policy B (order to 65): mean cost = {np.mean(costs_B):.4f}") print(f"\nDifference (A - B): {diff_est:.4f}") print(f"SE of difference (CRN): {diff_se:.4f}") print(f"95% CI: ({diff_est - 1.96*diff_se:.4f}, {diff_est + 1.96*diff_se:.4f})") # Compare to independent sampling rho = np.corrcoef(costs_A, costs_B)[0, 1] var_A = np.var(costs_A, ddof=1) var_B = np.var(costs_B, ddof=1) se_indep = np.sqrt((var_A + var_B) / n_days) print(f"\nCorrelation between policies: {rho:.4f}") print(f"SE if independent sampling: {se_indep:.4f}") print(f"Variance Reduction Factor: {(se_indep / diff_se)**2:.1f}x") **Output:** .. code-block:: text Common Random Numbers Comparison ============================================= Policy A (order to 60): mean cost = 2.7632 Policy B (order to 65): mean cost = 1.4241 Difference (A - B): 1.3391 SE of difference (CRN): 0.0312 95% CI: (1.2780, 1.4002) Correlation between policies: 0.9127 SE if independent sampling: 0.0987 Variance Reduction Factor: 10.0x **Result:** With 91% correlation between policies, CRN achieves 10× variance reduction. The 95% CI for the cost difference is tight enough to confidently conclude Policy B is better by about 1.34 cost units per day. Combining Variance Reduction Techniques --------------------------------------- The five techniques developed above are not mutually exclusive. Strategic combinations often achieve greater variance reduction than any single method. Compatible Combinations ~~~~~~~~~~~~~~~~~~~~~~~ 1. **Importance Sampling + Control Variates**: Use a control variate under the proposal distribution. The optimal coefficient adapts to the IS framework. 2. **Stratified Sampling + Antithetic Variates**: Within each stratum, use antithetic pairs to reduce within-stratum variance further. 3. **Control Variates + Common Random Numbers**: When comparing systems, apply the same control variate adjustment to both. 4. **Importance Sampling + Stratified Sampling**: Stratify the proposal distribution to ensure coverage of important regions. .. admonition:: Common Pitfall ⚠️ Combining Methods Requires Care :class: warning Not all combinations are straightforward: - **Antithetic variates + Importance sampling**: The standard antithetic construction :math:`(U, 1-U)` may not induce negative correlation after weighting - **Multiple controls**: Adding weakly correlated controls inflates :math:`\beta` estimation variance; use only strong, independent controls - **Verification**: Always verify unbiasedness holds after combining methods Practical Guidelines ~~~~~~~~~~~~~~~~~~~~ 1. **Start simple**: Apply control variates or antithetic variates first—low overhead, often effective 2. **Diagnose variance sources**: Use ANOVA decomposition to identify whether between-stratum or within-stratum variance dominates 3. **Monitor diagnostics**: Track ESS for importance sampling, correlation for control/antithetic variates 4. **Pilot estimation**: Use small pilot runs to estimate optimal coefficients, verify negative correlation, and check weight distributions 5. **Validate improvements**: Compare variance estimates with and without reduction; confirm actual benefit Practical Considerations ------------------------ Numerical Stability ~~~~~~~~~~~~~~~~~~~ - **Log-space arithmetic**: For importance sampling, always compute weights in log-space using the logsumexp trick - **Coefficient estimation**: For control variates, estimate :math:`\beta` using numerically stable regression routines - **Weight clipping**: Consider truncating extreme importance weights to reduce variance at the cost of small bias Computational Overhead ~~~~~~~~~~~~~~~~~~~~~~ - **Antithetic variates**: Essentially free—same function evaluations, different organization - **Control variates**: Requires evaluating :math:`c(X)` and estimating :math:`\beta`; overhead typically small - **Importance sampling**: Requires evaluating two densities :math:`f` and :math:`g` per sample - **Stratified sampling**: May require specialized samplers for conditional distributions - **CRN**: Requires synchronization bookkeeping; minimal computational overhead When Methods Fail ~~~~~~~~~~~~~~~~~ - **Importance sampling**: Weight degeneracy in high dimensions; proposal misspecified (too light tails) - **Control variates**: Weak correlation; unknown control mean - **Antithetic variates**: Non-monotonic integrands; high dimensions with mixed monotonicity - **Stratified sampling**: Unknown stratum variances; intractable conditional sampling - **CRN**: Systems respond oppositely to inputs; synchronization impossible .. admonition:: Common Pitfall ⚠️ Silent Failures :class: warning Variance reduction methods can fail silently: - Importance sampling with poor proposal yields valid but useless estimates (huge variance) - Antithetic variates with non-monotonic :math:`h` may *increase* variance without warning - Control variates with wrong sign of :math:`\beta` increase variance **Always compare with naive MC** on a pilot run to verify improvement. Bringing It All Together ------------------------ Variance reduction transforms Monte Carlo from brute-force averaging into a sophisticated computational tool. The convergence rate :math:`O(n^{-1/2})` remains fixed, but the constant :math:`\sigma^2` is ours to optimize. **Importance sampling** reweights samples to concentrate effort where the integrand matters, achieving orders-of-magnitude improvement for rare events—though weight degeneracy in high dimensions demands careful monitoring via ESS diagnostics. **Control variates** exploit correlation with analytically tractable quantities, with variance reduction determined by :math:`\rho^2`. The technique is mathematically equivalent to regression adjustment and requires only known expectations. **Antithetic variates** induce negative correlation through clever pairing, achieving near-perfect variance reduction for monotone functions at essentially zero cost. **Stratified sampling** eliminates between-stratum variance through domain partitioning, with Latin hypercube sampling extending the approach to high dimensions by ensuring marginal coverage. **Common random numbers** reduce comparison variance through synchronized inputs, transforming noisy system comparisons into precise difference estimation. The techniques combine synergistically and appear throughout computational statistics: importance sampling underlies particle filters and SMC, control variates enhance MCMC estimators, and stratified sampling connects to quasi-Monte Carlo methods. Mastery of variance reduction—understanding when each method applies, recognizing limitations, implementing with numerical stability—distinguishes the computational statistician from the naive simulator. Looking Ahead ~~~~~~~~~~~~~ The variance reduction methods of this section assume we can sample directly from the target (or proposal) distribution. But what about distributions known only through an unnormalized density—posteriors in Bayesian inference, for instance? The rejection sampling of :ref:`ch2.5-rejection-sampling` provides one answer, but its efficiency degrades rapidly in high dimensions. The next part of this course develops **Markov Chain Monte Carlo (MCMC)**, which constructs a Markov chain whose stationary distribution equals the target. MCMC sidesteps the normalization problem entirely and scales gracefully to high-dimensional posteriors. The variance reduction principles developed here—particularly importance sampling and control variates—will reappear in the MCMC context, enabling efficient posterior estimation even for complex Bayesian models. .. admonition:: Key Takeaways 📝 :class: important 1. **Variance reduction does not change the rate**: All methods achieve :math:`O(n^{-1/2})` convergence; they reduce the constant :math:`\sigma^2`, not the exponent. 2. **Importance sampling**: Optimal proposal :math:`g^* \propto |h|f` achieves minimum variance. ESS diagnoses weight degeneracy. Use log-space arithmetic for stability. 3. **Control variates**: Variance reduction equals :math:`\rho^2` (squared correlation). Optimal :math:`\beta^* = \text{Cov}(H,C)/\text{Var}(C)`. **Must know** :math:`\mathbb{E}[C]` exactly. 4. **Antithetic variates**: Work for monotone functions; can harm for non-monotone. The pair :math:`(F^{-1}(U), F^{-1}(1-U))` induces negative correlation for increasing :math:`h`. 5. **Stratified sampling**: Eliminates between-stratum variance. Neyman allocation :math:`n_k \propto p_k \sigma_k` minimizes total variance. Latin hypercube extends to high dimensions. 6. **Common random numbers**: Reduces variance of *comparisons*, not individual estimates. Requires synchronization—same random input serves same purpose across systems. 7. **Outcome alignment**: These techniques (Learning Outcomes 1, 3) enable efficient Monte Carlo estimation. Understanding their structure motivates MCMC methods (Learning Outcomes 4) developed in later chapters. References ---------- .. [Kahn1951] Kahn, H. and Harris, T. E. (1951). Estimation of particle transmission by random sampling. *National Bureau of Standards Applied Mathematics Series*, 12, 27–30. Foundational paper on importance sampling from the RAND Corporation. .. [Neyman1934] Neyman, J. (1934). On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. *Journal of the Royal Statistical Society*, 97(4), 558–625. Establishes optimal allocation theory for stratified sampling. .. [HammersleyMorton1956] Hammersley, J. M. and Morton, K. W. (1956). A new Monte Carlo technique: antithetic variates. *Mathematical Proceedings of the Cambridge Philosophical Society*, 52(3), 449–475. Introduces antithetic variates. .. [HammersleyHandscomb1964] Hammersley, J. M. and Handscomb, D. C. (1964). *Monte Carlo Methods*. Methuen, London. Classic monograph systematizing variance reduction techniques including control variates. .. [McKayEtAl1979] McKay, M. D., Beckman, R. J., and Conover, W. J. (1979). A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. *Technometrics*, 21(2), 239–245. Introduces Latin Hypercube Sampling. .. [Kong1992] Kong, A. (1992). A note on importance sampling using standardized weights. Technical Report 348, Department of Statistics, University of Chicago. Establishes ESS interpretation for importance sampling. .. [Owen1997] Owen, A. B. (1997). Monte Carlo variance of scrambled net quadrature. *SIAM Journal on Numerical Analysis*, 34(5), 1884–1910. Theoretical analysis of Latin Hypercube Sampling. .. [BengtssonEtAl2008] Bengtsson, T., Bickel, P., and Li, B. (2008). Curse-of-dimensionality revisited: Collapse of the particle filter in very large scale systems. In *Probability and Statistics: Essays in Honor of David A. Freedman*, 316–334. Institute of Mathematical Statistics. Proves exponential sample size requirements for importance sampling in high dimensions. .. [RobertCasella2004] Robert, C. P. and Casella, G. (2004). *Monte Carlo Statistical Methods* (2nd ed.). Springer-Verlag, New York. Comprehensive treatment of variance reduction in the Monte Carlo context. .. [Glasserman2004] Glasserman, P. (2004). *Monte Carlo Methods in Financial Engineering*. Springer-Verlag, New York. Variance reduction applications in financial computing.