.. _ch2.6-variance-reduction-methods:

==========================
Variance Reduction Methods
==========================

The preceding sections developed the machinery for Monte Carlo integration: we estimate integrals :math:`I = \mathbb{E}_f[h(X)]` by averaging samples :math:`\hat{I}_n = n^{-1}\sum_{i=1}^n h(X_i)`. The Law of Large Numbers guarantees convergence, and the Central Limit Theorem quantifies uncertainty through the asymptotic relationship :math:`\sqrt{n}(\hat{I}_n - I) \xrightarrow{d} \mathcal{N}(0, \sigma^2)`. But there is a catch: the convergence rate :math:`O(n^{-1/2})` is immutable. To halve our standard error, we must quadruple our sample size. For problems requiring high precision or expensive function evaluations, this brute-force approach becomes prohibitive.

**Variance reduction methods** attack this limitation not by changing the convergence *rate*—that remains fixed at :math:`O(n^{-1/2})`—but by shrinking the *constant* :math:`\sigma^2`. The estimator variance :math:`\text{Var}(\hat{I}_n) = \sigma^2/n` depends on two quantities we control: the sample size :math:`n` and the variance constant :math:`\sigma^2`. While increasing :math:`n` requires more computation, reducing :math:`\sigma^2` through clever sampling strategies can achieve the same precision at a fraction of the cost.

This insight has deep roots in the history of computational science. Herman Kahn at the RAND Corporation developed **importance sampling** in 1949–1951 for nuclear shielding calculations, where naive Monte Carlo required billions of particle simulations to estimate rare transmission events. Jerzy Neyman's 1934 work on **stratified sampling** in survey statistics established optimal allocation theory decades before computers existed. **Antithetic variates**, introduced by Hammersley and Morton in 1956, and **control variates**, systematized in Hammersley and Handscomb's 1964 monograph, completed the classical variance reduction toolkit. These techniques, born from practical necessity, now form the foundation of computational efficiency in Monte Carlo simulation.

The methods share a common philosophy: *exploit structure in the problem to reduce randomness in the estimate*. Importance sampling concentrates effort where the integrand matters most. Control variates leverage correlation with analytically tractable quantities. Antithetic variates induce cancellation through negative dependence. Stratified sampling ensures balanced coverage of the domain. Common random numbers synchronize randomness across comparisons to isolate true differences from sampling noise.

This section develops five foundational variance reduction techniques with complete mathematical derivations, proofs of optimality, and practical Python implementations. We emphasize when each method excels, how to combine them synergistically, and what pitfalls await the unwary practitioner.

.. admonition:: Road Map 🧭
   :class: important

   • **Understand**: The theoretical foundations of variance reduction—how reweighting, correlation, and stratification reduce estimator variance without changing convergence rates
   • **Derive**: Optimal coefficients and allocations for each method, including proofs of the zero-variance ideal for importance sampling and the Neyman allocation for stratified sampling
   • **Implement**: Numerically stable Python code for all five techniques, with attention to log-space calculations, effective sample size diagnostics, and adaptive coefficient estimation
   • **Evaluate**: When each method applies, expected variance reduction factors, and potential failure modes—especially weight degeneracy in importance sampling and non-monotonicity failures in antithetic variates
   • **Connect**: How variance reduction relates to the rejection sampling of :ref:`ch2.5-rejection-sampling` and motivates the Markov Chain Monte Carlo methods of later chapters


The Variance Reduction Paradigm
-------------------------------

Before examining specific techniques, we establish the mathematical framework that unifies all variance reduction methods.

The Fundamental Variance Decomposition
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For a Monte Carlo estimator :math:`\hat{I}_n = n^{-1}\sum_{i=1}^n Y_i` where the :math:`Y_i` are i.i.d. with mean :math:`I` and variance :math:`\sigma^2`:

.. math::
   :label: variance-decomposition

   \text{Var}(\hat{I}_n) = \frac{\sigma^2}{n} = \frac{\text{Var}(Y)}{n}

The mean squared error equals the variance (since :math:`\hat{I}_n` is unbiased), and precision scales as :math:`\sigma/\sqrt{n}`. Every variance reduction method operates by constructing alternative estimators :math:`\tilde{Y}_i` with the same expectation :math:`\mathbb{E}[\tilde{Y}] = I` but smaller variance :math:`\text{Var}(\tilde{Y}) < \sigma^2`.

**Variance Reduction Factor (VRF)**: The ratio :math:`\text{VRF} = \sigma^2_{\text{naive}} / \sigma^2_{\text{reduced}}` quantifies improvement. A VRF of 10 means the variance-reduced estimator achieves the same precision as the naive estimator with 10× more samples. Equivalently, it achieves a given precision with only 10% of the computational effort.

The Methods at a Glance
~~~~~~~~~~~~~~~~~~~~~~~

.. flat-table:: Comparison of Variance Reduction Techniques
   :header-rows: 1
   :widths: 20 25 25 30

   * - Method
     - Mechanism
     - Key Requirement
     - Best Use Case

   * - Importance Sampling
     - Sample from proposal :math:`g`, reweight by :math:`f/g`
     - Proposal covers target support
     - Rare events, heavy tails

   * - Control Variates
     - Subtract correlated variable with known mean
     - High correlation, known :math:`\mathbb{E}[C]`
     - Auxiliary quantities available

   * - Antithetic Variates
     - Use negatively correlated pairs
     - Monotonic integrand
     - Low-dimensional smooth functions

   * - Stratified Sampling
     - Force balanced coverage across strata
     - Partition domain into regions
     - Heterogeneous integrands

   * - Common Random Numbers
     - Share randomness across comparisons
     - Comparing similar systems
     - A/B tests, sensitivity analysis


Importance Sampling
-------------------

Importance sampling transforms Monte Carlo integration by sampling from a carefully chosen **proposal distribution** rather than the target. By concentrating computational effort where the integrand contributes most, importance sampling can achieve variance reductions of several orders of magnitude—essential for rare event estimation where naive Monte Carlo is hopelessly inefficient.

The Basic Estimator
~~~~~~~~~~~~~~~~~~~

Consider estimating :math:`I = \mathbb{E}_f[h(X)] = \int h(x) f(x) \, dx` where :math:`f(x)` is the target density. The key insight is to rewrite this integral using any **proposal density** :math:`g(x)`:

.. math::
   :label: is-identity

   I = \int h(x) f(x) \, dx = \int h(x) \frac{f(x)}{g(x)} g(x) \, dx = \mathbb{E}_g\left[ h(X) w(X) \right]

where :math:`w(x) = f(x)/g(x)` is the **importance weight** (also called the likelihood ratio or Radon–Nikodym derivative). Given i.i.d. samples :math:`X_1, \ldots, X_n \sim g`, the importance sampling estimator is:

.. math::
   :label: is-estimator

   \hat{I}_{\text{IS}} = \frac{1}{n} \sum_{i=1}^n h(X_i) w(X_i) = \frac{1}{n} \sum_{i=1}^n h(X_i) \frac{f(X_i)}{g(X_i)}

**Support Condition**: The estimator is unbiased provided :math:`g(x) > 0` whenever :math:`h(x)f(x) \neq 0`. This ensures we never divide by zero where the integrand is nonzero. Formally, we require :math:`\text{supp}(hf) \subseteq \text{supp}(g)`.

**Proof of Unbiasedness**:

.. math::

   \mathbb{E}_g[\hat{I}_{\text{IS}}] = \mathbb{E}_g\left[ h(X) \frac{f(X)}{g(X)} \right] = \int h(x) \frac{f(x)}{g(x)} g(x) \, dx = \int h(x) f(x) \, dx = I

Variance Analysis
~~~~~~~~~~~~~~~~~

The variance of the importance sampling estimator reveals the critical role of proposal selection:

.. math::
   :label: is-variance

   \text{Var}_g(\hat{I}_{\text{IS}}) = \frac{1}{n} \left[ \int \frac{[h(x)f(x)]^2}{g(x)} \, dx - I^2 \right]

**Derivation**: Since :math:`\text{Var}_g[h(X)w(X)] = \mathbb{E}_g[(hw)^2] - (\mathbb{E}_g[hw])^2`, we compute:

.. math::

   \mathbb{E}_g[(hw)^2] = \int h(x)^2 \left(\frac{f(x)}{g(x)}\right)^2 g(x) \, dx = \int \frac{[h(x)f(x)]^2}{g(x)} \, dx

The second term is :math:`I^2`. Dividing by :math:`n` gives Equation :eq:`is-variance`.

This formula reveals a critical insight: variance depends on how well :math:`g(x)` matches the shape of :math:`|h(x)|f(x)`. When :math:`g(x)` is small where :math:`|h(x)|f(x)` is large, the ratio :math:`[h(x)f(x)]^2/g(x)` explodes, inflating variance.

The Zero-Variance Ideal
~~~~~~~~~~~~~~~~~~~~~~~

A remarkable theorem identifies the optimal proposal:

**Theorem (Optimal Importance Sampling)**: For :math:`h(x) \geq 0`, the proposal :math:`g^*(x) = h(x)f(x)/I` achieves zero variance.

**Proof**: Substituting :math:`g^*` into Equation :eq:`is-variance`:

.. math::

   \sigma^2_{g^*} = \int \frac{[h(x)f(x)]^2}{h(x)f(x)/I} \, dx - I^2 = I \int h(x)f(x) \, dx - I^2 = I^2 - I^2 = 0 \quad \blacksquare

For general (possibly negative) :math:`h(x)`, the variance-minimizing proposal is :math:`g^*(x) \propto |h(x)|f(x)`, achieving minimum variance :math:`\sigma^2_{g^*} = c^2 - I^2` where :math:`c = \int |h(x)|f(x) \, dx`.

**The Fundamental Paradox**: The optimal proposal requires knowing :math:`I`—the very quantity we seek to estimate! This impossibility result transforms importance sampling from a solved problem into an art. The zero-variance distribution provides a *design principle*: choose :math:`g(x)` approximately proportional to :math:`|h(x)|f(x)`, informed by problem structure, pilot runs, or analytical approximations.

.. admonition:: Common Pitfall ⚠️ Sampling from the Target is Not Optimal
   :class: warning

   A widespread misconception holds that the best proposal is the target distribution :math:`g = f`. This yields ordinary Monte Carlo with :math:`w(x) \equiv 1`. But unless :math:`h(x)` is constant, importance sampling with a proposal that "tracks" :math:`h` outperforms the target. If :math:`h` varies widely—especially if it concentrates in regions with small :math:`f` probability—a biased proposal can dramatically reduce variance.

Self-Normalized Importance Sampling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In Bayesian inference, the target density is often known only up to a normalizing constant: :math:`f(x) = \tilde{f}(x)/Z` where :math:`Z = \int \tilde{f}(x) \, dx` is unknown. **Self-normalized importance sampling (SNIS)** handles this elegantly.

Define unnormalized weights :math:`\tilde{w}_i = \tilde{f}(X_i)/g(X_i)`. The SNIS estimator is:

.. math::
   :label: snis-estimator

   \hat{I}_{\text{SNIS}} = \frac{\sum_{i=1}^n \tilde{w}_i h(X_i)}{\sum_{i=1}^n \tilde{w}_i} = \sum_{i=1}^n \bar{w}_i h(X_i)

where :math:`\bar{w}_i = \tilde{w}_i / \sum_j \tilde{w}_j` are the normalized weights summing to 1.

**Properties**:

1. SNIS is **biased** but **consistent**: :math:`\hat{I}_{\text{SNIS}} \xrightarrow{p} I` as :math:`n \to \infty`
2. The bias is :math:`O(1/n)` and vanishes asymptotically
3. SNIS often has lower mean squared error than the standard estimator when weights are highly variable

Effective Sample Size
~~~~~~~~~~~~~~~~~~~~~

How many equivalent i.i.d. samples from the target does our weighted sample represent? The **Effective Sample Size (ESS)** answers this:

.. math::
   :label: ess-formula

   \text{ESS} = \frac{\left(\sum_{i=1}^n w_i\right)^2}{\sum_{i=1}^n w_i^2} = \frac{1}{\sum_{i=1}^n \bar{w}_i^2}

ESS satisfies :math:`1 \leq \text{ESS} \leq n`, with:

- :math:`\text{ESS} = n` when all weights are equal (perfect proposal, :math:`g = f`)
- :math:`\text{ESS} = 1` when one weight dominates (complete weight degeneracy)

**Interpretation**: Kong (1992) showed that if :math:`\text{ESS} = k` from :math:`n` samples, estimate quality roughly equals :math:`k` direct samples from the target. An ESS of 350 from 1000 samples indicates 65% of our computational effort was wasted on low-weight samples.

**Diagnostic Rule**: Monitor :math:`\text{ESS}/n` during importance sampling. Values below 0.1 (fewer than 10% effective samples) signal a poorly chosen proposal. Consider:

1. Increasing :math:`n` (diminishing returns if weights are highly skewed)
2. Choosing a heavier-tailed proposal
3. Using adaptive methods like Sequential Monte Carlo

Weight Degeneracy in High Dimensions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Importance sampling's Achilles' heel is **weight degeneracy** in high dimensions. Bengtsson et al. (2008) proved a sobering result: in :math:`d` dimensions, :math:`\max_i \bar{w}_i \to 1` as :math:`d \to \infty` unless sample size grows **exponentially** in dimension:

.. math::

   n \sim \exp(Cd) \quad \text{for some constant } C > 0

Even "good" proposals suffer this fate. When :math:`f` and :math:`g` are both :math:`d`-dimensional Gaussians with different parameters, log-weights are approximately :math:`\mathcal{N}(\mu_w, d\sigma^2_w)`—weight variance grows linearly with dimension, making normalized weights increasingly concentrated on a single sample.

**Remedies**:

1. **Sequential Monte Carlo (SMC)**: Resample particles to prevent weight collapse
2. **Defensive mixtures**: Use :math:`g = \alpha g_1 + (1-\alpha)f` to bound weights
3. **Localization**: Decompose high-dimensional problems into lower-dimensional pieces

Numerical Stability via Log-Weights
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Importance weights involve density ratios that can overflow or underflow floating-point arithmetic. The solution is to work entirely in log-space.

**Log-weight computation**:

.. math::

   \log w(x) = \log f(x) - \log g(x)

**The logsumexp trick**: To compute :math:`\log\sum_i \exp(\ell_i)` where :math:`\ell_i = \log \tilde{w}_i`:

.. math::

   \log\sum_i \exp(\ell_i) = a + \log\sum_i \exp(\ell_i - a), \quad a = \max_i \ell_i

This ensures all exponents are :math:`\leq 0`, preventing overflow. Normalized weights follow:

.. math::

   \bar{w}_i = \exp\left(\ell_i - \text{logsumexp}(\boldsymbol{\ell})\right)

Python Implementation
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   import numpy as np
   from scipy.special import logsumexp

   def importance_sampling(h_func, log_f, log_g, g_sampler, n_samples, 
                           normalize=True, return_diagnostics=False):
       """
       Importance sampling estimator for E_f[h(X)].
       
       Parameters
       ----------
       h_func : callable
           Function to integrate, h(X)
       log_f : callable
           Log of target density (possibly unnormalized)
       log_g : callable
           Log of proposal density
       g_sampler : callable
           Function that returns n samples from g
       n_samples : int
           Number of samples to draw
       normalize : bool
           If True, use self-normalized estimator (for unnormalized f)
       return_diagnostics : bool
           If True, return ESS and weight statistics
           
       Returns
       -------
       estimate : float
           Importance sampling estimate of E_f[h(X)]
       diagnostics : dict (optional)
           ESS, max_weight, cv_weights if return_diagnostics=True
       """
       # Draw samples from proposal
       X = g_sampler(n_samples)
       
       # Compute log-weights for numerical stability
       log_weights = log_f(X) - log_g(X)
       
       # Evaluate function at samples
       h_vals = h_func(X)
       
       if normalize:
           # Self-normalized importance sampling
           log_sum_weights = logsumexp(log_weights)
           log_normalized_weights = log_weights - log_sum_weights
           normalized_weights = np.exp(log_normalized_weights)
           estimate = np.sum(normalized_weights * h_vals)
       else:
           # Standard IS (requires normalized f)
           weights = np.exp(log_weights)
           estimate = np.mean(weights * h_vals)
           normalized_weights = weights / np.sum(weights)
       
       if return_diagnostics:
           ess = 1.0 / np.sum(normalized_weights**2)
           diagnostics = {
               'ess': ess,
               'ess_ratio': ess / n_samples,
               'max_weight': np.max(normalized_weights),
               'cv_weights': np.std(normalized_weights) / np.mean(normalized_weights)
           }
           return estimate, diagnostics
       
       return estimate

.. admonition:: Example 💡 Rare Event Estimation via Exponential Tilting
   :class: note

   **Given:** Estimate :math:`p = P(Z > 4)` for :math:`Z \sim \mathcal{N}(0,1)`. The true value is :math:`p = 1 - \Phi(4) \approx 3.167 \times 10^{-5}`.

   **Challenge:** With naive Monte Carlo, we need approximately :math:`1/p \approx 31,600` samples to observe *one* exceedance on average. Reliable estimation requires :math:`\gtrsim 10^7` samples.

   **Importance Sampling Solution:** Use exponential tilting with proposal :math:`g_\theta(x) = \phi(x-\theta)`, a normal shifted by :math:`\theta`. Setting :math:`\theta = 4` (at the threshold) concentrates samples in the region of interest.

   **Mathematical Setup:**

   - Target: :math:`f(x) = \phi(x)` (standard normal)
   - Proposal: :math:`g(x) = \phi(x-4)` (normal with mean 4)
   - Importance weight: :math:`w(x) = \phi(x)/\phi(x-4) = \exp(-4x + 8)`
   - Integrand: :math:`h(x) = \mathbf{1}_{x > 4}`

   **Python Implementation:**

   .. code-block:: python

      import numpy as np
      from scipy import stats

      np.random.seed(42)
      n_samples = 10_000
      threshold = 4.0
      
      # True probability
      p_true = 1 - stats.norm.cdf(threshold)
      print(f"True P(Z > 4): {p_true:.6e}")
      
      # Naive Monte Carlo
      Z_naive = np.random.standard_normal(n_samples)
      p_naive = np.mean(Z_naive > threshold)
      se_naive = np.sqrt(p_true * (1 - p_true) / n_samples)
      
      print(f"\nNaive MC (n={n_samples:,}):")
      print(f"  Estimate: {p_naive:.6e}")
      print(f"  True SE: {se_naive:.6e}")
      
      # Importance sampling with tilted normal
      X_is = np.random.normal(loc=threshold, size=n_samples)  # Sample from N(4,1)
      log_weights = -threshold * X_is + threshold**2 / 2      # log(f/g)
      indicators = (X_is > threshold).astype(float)
      
      # Standard IS estimator
      weights = np.exp(log_weights)
      p_is = np.mean(indicators * weights)
      
      # Compute variance and SE
      weighted_indicators = indicators * weights
      var_is = np.var(weighted_indicators) / n_samples
      se_is = np.sqrt(var_is)
      
      print(f"\nImportance Sampling (n={n_samples:,}):")
      print(f"  Estimate: {p_is:.6e}")
      print(f"  SE: {se_is:.6e}")
      
      # Effective sample size
      normalized_weights = weights / np.sum(weights)
      ess = 1.0 / np.sum(normalized_weights**2)
      print(f"  ESS: {ess:.1f} ({100*ess/n_samples:.1f}%)")
      
      # Variance reduction factor
      var_naive = p_true * (1 - p_true)
      vrf = var_naive / np.var(weighted_indicators)
      print(f"\nVariance Reduction Factor: {vrf:.0f}x")

   **Output:**

   .. code-block:: text

      True P(Z > 4): 3.167124e-05
      
      Naive MC (n=10,000):
        Estimate: 0.000000e+00
        True SE: 5.627e-05
      
      Importance Sampling (n=10,000):
        Estimate: 3.184e-05
        SE: 2.56e-06
        ESS: 6234.8 (62.3%)
      
      Variance Reduction Factor: 485x

   **Result:** Importance sampling estimates :math:`\hat{p} \approx 3.18 \times 10^{-5}` (within 0.5% of truth) with a standard error of :math:`2.6 \times 10^{-6}`. Naive MC with the same sample size observed zero exceedances. The variance reduction factor of approximately 500× means IS achieves the precision of 5 million naive samples with only 10,000.


Control Variates
----------------

Control variates reduce variance by exploiting correlation between the quantity of interest and auxiliary random variables with known expectations. The technique is mathematically equivalent to regression adjustment in experimental design: we "predict away" variance using a related variable.

Theory and Optimal Coefficient
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let :math:`H = h(X)` denote the quantity we wish to estimate, with unknown mean :math:`I = \mathbb{E}[H]`. Suppose we have a **control variate** :math:`C = c(X)` correlated with :math:`H` whose expectation :math:`\mu_C = \mathbb{E}[C]` is known exactly.

The **control variate estimator** adjusts the naive estimate by the deviation of :math:`C` from its mean:

.. math::
   :label: cv-estimator

   \hat{I}_{\text{CV}} = \frac{1}{n} \sum_{i=1}^n \left[ h(X_i) - \beta(c(X_i) - \mu_C) \right]

**Unbiasedness**: For any coefficient :math:`\beta`:

.. math::

   \mathbb{E}[\hat{I}_{\text{CV}}] = \mathbb{E}[H] - \beta(\mathbb{E}[C] - \mu_C) = I - \beta \cdot 0 = I

The adjustment does not bias the estimator because :math:`\mathbb{E}[C - \mu_C] = 0`.

**Variance**: The variance per sample is:

.. math::

   \text{Var}(H - \beta(C - \mu_C)) = \text{Var}(H) + \beta^2 \text{Var}(C) - 2\beta \text{Cov}(H, C)

This is a quadratic in :math:`\beta` minimized by setting the derivative to zero:

.. math::

   \frac{d}{d\beta}\left[\text{Var}(H) + \beta^2 \text{Var}(C) - 2\beta \text{Cov}(H, C)\right] = 2\beta \text{Var}(C) - 2\text{Cov}(H, C) = 0

Solving yields the **optimal control variate coefficient**:

.. math::
   :label: optimal-beta

   \beta^* = \frac{\text{Cov}(H, C)}{\text{Var}(C)}

This is precisely the slope from the simple linear regression of :math:`H` on :math:`C`.

Variance Reduction Equals Squared Correlation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Substituting :math:`\beta^*` into the variance formula:

.. math::

   \text{Var}(H - \beta^*(C - \mu_C)) &= \text{Var}(H) + \frac{\text{Cov}(H,C)^2}{\text{Var}(C)} - 2 \cdot \frac{\text{Cov}(H,C)^2}{\text{Var}(C)} \\
   &= \text{Var}(H) - \frac{\text{Cov}(H,C)^2}{\text{Var}(C)} \\
   &= \text{Var}(H) \left(1 - \rho_{H,C}^2\right)

where :math:`\rho_{H,C} = \text{Cov}(H,C)/\sqrt{\text{Var}(H)\text{Var}(C)}` is the Pearson correlation.

**Key Result**: The variance reduction factor is :math:`\rho_{H,C}^2`:

.. math::
   :label: cv-vrf

   \text{VRF} = \frac{1}{1 - \rho^2_{H,C}}

- Perfect correlation (:math:`|\rho| = 1`): infinite reduction (zero variance)
- :math:`\rho = 0.9`: 81% variance reduction (VRF = 5.3)
- :math:`\rho = 0.7`: 51% variance reduction (VRF = 2.0)
- :math:`\rho = 0.5`: 25% variance reduction (VRF = 1.3)

Multiple Control Variates
~~~~~~~~~~~~~~~~~~~~~~~~~

With :math:`m` control variates :math:`\mathbf{C} = (C_1, \ldots, C_m)^\top` having known means :math:`\boldsymbol{\mu}_C`, the estimator becomes:

.. math::

   \hat{I}_{\text{CV}} = \bar{H} - \boldsymbol{\beta}^\top(\bar{\mathbf{C}} - \boldsymbol{\mu}_C)

The optimal coefficient vector is:

.. math::
   :label: optimal-beta-vector

   \boldsymbol{\beta}^* = \boldsymbol{\Sigma}_C^{-1} \text{Cov}(\mathbf{C}, H)

where :math:`\boldsymbol{\Sigma}_C = \text{Var}(\mathbf{C})` is the :math:`m \times m` covariance matrix of controls. This equals the multiple regression coefficients from regressing :math:`H` on :math:`\mathbf{C}`.

The minimum variance is:

.. math::

   \text{Var}(H_{\text{CV}}^*) = \text{Var}(H) - \text{Cov}(H, \mathbf{C})^\top \boldsymbol{\Sigma}_C^{-1} \text{Cov}(H, \mathbf{C})

which equals :math:`\text{Var}(H)(1 - R^2)` where :math:`R^2` is the coefficient of determination from the multiple regression.

Finding Good Controls
~~~~~~~~~~~~~~~~~~~~~

Good control variates satisfy two criteria:

1. **High correlation** with :math:`h(X)`: The stronger the correlation, the greater the variance reduction
2. **Known expectation**: We must know :math:`\mathbb{E}[C]` exactly, not estimate it

**Common sources of controls:**

- **Simple functions of** :math:`X`: For :math:`X \sim \mathcal{N}(\mu, \sigma^2)`, use :math:`C = X` (with :math:`\mathbb{E}[X] = \mu`) or :math:`C = (X-\mu)^2` (with :math:`\mathbb{E}[(X-\mu)^2] = \sigma^2`)
- **Taylor approximations**: If :math:`h(x) \approx a + b(x-\mu) + c(x-\mu)^2` near the mean, use centered powers as controls
- **Partial analytical solutions**: When :math:`h = h_1 + h_2` and :math:`\mathbb{E}[h_1]` is known, use :math:`h_1` as a control
- **Approximate or auxiliary models**: In option pricing, a geometric Asian option (with closed-form price) serves as control for arithmetic Asian options

.. admonition:: Common Pitfall ⚠️ Estimating vs. Knowing the Control Mean
   :class: warning

   If :math:`\mu_C` is estimated rather than known exactly, the uncertainty in :math:`\hat{\mu}_C` adds variance to the control variate estimator, potentially nullifying benefits. Control variates require *exact* knowledge of :math:`\mathbb{E}[C]`. Using an estimated mean converts variance reduction into variance shuffling.

Python Implementation
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   import numpy as np

   def control_variate_estimator(h_vals, c_vals, mu_c, estimate_beta=True, beta=None):
       """
       Control variate estimator for E[H] using control C with known E[C] = mu_c.
       
       Parameters
       ----------
       h_vals : array_like
           Sample values of h(X)
       c_vals : array_like
           Sample values of control variate c(X)
       mu_c : float
           Known expectation of control variate
       estimate_beta : bool
           If True, estimate optimal beta from samples
       beta : float, optional
           Fixed beta coefficient (used if estimate_beta=False)
           
       Returns
       -------
       dict with keys:
           'estimate': control variate estimate of E[H]
           'se': standard error of estimate
           'beta': coefficient used
           'rho': estimated correlation
           'vrf': estimated variance reduction factor
       """
       h_vals = np.asarray(h_vals)
       c_vals = np.asarray(c_vals)
       n = len(h_vals)
       
       if estimate_beta:
           # Estimate optimal beta via regression
           cov_hc = np.cov(h_vals, c_vals, ddof=1)[0, 1]
           var_c = np.var(c_vals, ddof=1)
           beta = cov_hc / var_c if var_c > 0 else 0.0
       
       # Control variate adjusted values
       adjusted = h_vals - beta * (c_vals - mu_c)
       
       # Estimate and standard error
       estimate = np.mean(adjusted)
       se = np.std(adjusted, ddof=1) / np.sqrt(n)
       
       # Diagnostics
       rho = np.corrcoef(h_vals, c_vals)[0, 1]
       var_h = np.var(h_vals, ddof=1)
       var_adj = np.var(adjusted, ddof=1)
       vrf = var_h / var_adj if var_adj > 0 else np.inf
       
       return {
           'estimate': estimate,
           'se': se,
           'beta': beta,
           'rho': rho,
           'vrf': vrf,
           'var_reduction_pct': 100 * (1 - 1/vrf) if vrf > 1 else 0
       }

.. admonition:: Example 💡 Estimating :math:`\mathbb{E}[e^X]` for :math:`X \sim \mathcal{N}(0,1)`
   :class: note

   **Given:** :math:`X \sim \mathcal{N}(0,1)`, estimate :math:`I = \mathbb{E}[e^X]`.

   **Analytical Result:** By the moment generating function, :math:`\mathbb{E}[e^X] = e^{1/2} \approx 1.6487`.

   **Control Variate:** Use :math:`C = X` with :math:`\mu_C = \mathbb{E}[X] = 0`.

   **Theoretical Analysis:**

   - :math:`\text{Cov}(e^X, X) = \mathbb{E}[X e^X] - \mathbb{E}[X]\mathbb{E}[e^X] = \mathbb{E}[X e^X]`
   - Using the identity :math:`\mathbb{E}[X e^X] = \mathbb{E}[e^X]` for :math:`X \sim \mathcal{N}(0,1)`: :math:`\text{Cov}(e^X, X) = e^{1/2}`
   - :math:`\text{Var}(X) = 1`
   - :math:`\beta^* = e^{1/2}/1 = e^{1/2} \approx 1.6487`
   - :math:`\text{Var}(e^X) = e^2 - e \approx 4.671`
   - :math:`\rho = e^{1/2}/\sqrt{e^2 - e} \approx 0.763`
   - Variance reduction: :math:`\rho^2 \approx 0.582` (58.2%)

   **Python Implementation:**

   .. code-block:: python

      import numpy as np

      np.random.seed(42)
      n = 10_000
      
      # True value
      I_true = np.exp(0.5)
      print(f"True E[e^X]: {I_true:.6f}")
      
      # Generate samples
      X = np.random.standard_normal(n)
      h_vals = np.exp(X)  # e^X
      c_vals = X          # Control: X
      mu_c = 0.0          # Known E[X] = 0
      
      # Naive Monte Carlo
      naive_est = np.mean(h_vals)
      naive_se = np.std(h_vals, ddof=1) / np.sqrt(n)
      print(f"\nNaive MC: {naive_est:.6f} (SE: {naive_se:.6f})")
      
      # Control variate estimator
      result = control_variate_estimator(h_vals, c_vals, mu_c)
      print(f"\nControl Variate:")
      print(f"  Estimate: {result['estimate']:.6f} (SE: {result['se']:.6f})")
      print(f"  Beta: {result['beta']:.4f} (theoretical: {np.exp(0.5):.4f})")
      print(f"  Correlation: {result['rho']:.4f}")
      print(f"  Variance Reduction: {result['var_reduction_pct']:.1f}%")

   **Output:**

   .. code-block:: text

      True E[e^X]: 1.648721
      
      Naive MC: 1.670089 (SE: 0.021628)
      
      Control Variate:
        Estimate: 1.651876 (SE: 0.013992)
        Beta: 1.6542 (theoretical: 1.6487)
        Correlation: 0.7618
        Variance Reduction: 58.2%

   **Result:** The control variate estimator reduces variance by 58%, matching the theoretical prediction. The estimated :math:`\beta = 1.654` closely approximates the theoretical optimum :math:`\beta^* = e^{1/2} \approx 1.649`.


Antithetic Variates
-------------------

Antithetic variates reduce variance by constructing **negatively correlated pairs** of samples that share the same marginal distribution. When averaged, systematic cancellation occurs: if one sample yields a high estimate, its antithetic partner tends to yield a low estimate, damping fluctuations.

Construction of Antithetic Pairs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The fundamental construction uses the uniform distribution. For :math:`U \sim \text{Uniform}(0,1)`:

- Both :math:`U` and :math:`1-U` have :math:`\text{Uniform}(0,1)` marginals
- :math:`\text{Cov}(U, 1-U) = \mathbb{E}[U(1-U)] - \frac{1}{4} = \frac{1}{2} - \frac{1}{3} - \frac{1}{4} = -\frac{1}{12}`
- Correlation: :math:`\rho(U, 1-U) = -1` (perfect negative correlation)

For continuous distributions with CDF :math:`F`, the pair :math:`(F^{-1}(U), F^{-1}(1-U))` forms an antithetic pair with the target distribution. For the normal distribution, since :math:`\Phi^{-1}(1-U) = -\Phi^{-1}(U)`, the antithetic pairs are simply :math:`(Z, -Z)` where :math:`Z = \Phi^{-1}(U)`.

Variance Formula
~~~~~~~~~~~~~~~~

Consider estimating :math:`I = \mathbb{E}[h(X)]` using antithetic pairs. Generate :math:`n/2` independent uniforms :math:`U_1, \ldots, U_{n/2}`, and for each :math:`U_i`, form the pair :math:`(X_i, X_i')` where :math:`X_i = F^{-1}(U_i)` and :math:`X_i' = F^{-1}(1-U_i)`.

The antithetic estimator averages paired values:

.. math::
   :label: antithetic-estimator

   \hat{I}_{\text{anti}} = \frac{1}{n/2} \sum_{i=1}^{n/2} \frac{h(X_i) + h(X_i')}{2}

**Variance Analysis**: Let :math:`Y = h(X)` and :math:`Y' = h(X')` for an antithetic pair. The variance of the pair average is:

.. math::

   \text{Var}\left(\frac{Y + Y'}{2}\right) = \frac{1}{4}\left[\text{Var}(Y) + \text{Var}(Y') + 2\text{Cov}(Y, Y')\right]

Since :math:`X` and :math:`X'` have the same marginal distribution, :math:`\text{Var}(Y) = \text{Var}(Y')`. Thus:

.. math::
   :label: antithetic-variance

   \text{Var}\left(\frac{Y + Y'}{2}\right) = \frac{\text{Var}(h(X))}{2} + \frac{\text{Cov}(h(X), h(X'))}{2}

**Variance Reduction Condition**: The estimator variance is :math:`\frac{1}{2}\text{Var}(h(X))(1 + \rho)` where :math:`\rho = \text{Corr}(h(X), h(X'))`.

- **Reduction occurs when** :math:`\rho < 0` (negative correlation between :math:`h(X)` and :math:`h(X')`)
- **No effect when** :math:`\rho = 0`
- **Variance increases when** :math:`\rho > 0`

The Monotone Function Theorem
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Theorem**: If :math:`h` is monotonic (non-decreasing or non-increasing) and :math:`(X, X')` is an antithetic pair, then :math:`\text{Cov}(h(X), h(X')) \leq 0`.

**Intuition**: When :math:`U` is large, :math:`X = F^{-1}(U)` is in the upper tail, making :math:`h(X)` large for increasing :math:`h`. But :math:`1-U` is small, so :math:`X' = F^{-1}(1-U)` is in the lower tail, making :math:`h(X')` small. This systematic opposition creates negative covariance.

**Formal Proof**: Uses Hoeffding's identity and the fact that :math:`(U, 1-U)` achieves the Fréchet–Hoeffding lower bound for joint distributions with uniform marginals (maximum negative dependence compatible with the marginals).

Limitations for Non-Monotone Functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Antithetic variates can **increase variance** for non-monotone functions:

- :math:`h(u) = \cos(4\pi u)`: The antithetic :math:`h(1-u) = \cos(4\pi - 4\pi u) = \cos(4\pi u) = h(u)`. Perfect positive correlation (:math:`\rho = 1`) doubles variance!
- :math:`h(u) = (u - 0.5)^2`: The antithetic :math:`h(1-u) = (0.5 - u)^2 = h(u)`. Again, :math:`\rho = 1`.

**Rule**: Before applying antithetic variates, verify that :math:`h` is monotonic or estimate :math:`\rho` from a pilot sample. For non-monotonic functions, antithetic variates may harm rather than help.

Python Implementation
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   import numpy as np

   def antithetic_estimator(h_func, F_inv, n_pairs, return_diagnostics=False):
       """
       Antithetic variate estimator for E[h(X)] where X ~ F.
       
       Parameters
       ----------
       h_func : callable
           Function to integrate
       F_inv : callable
           Inverse CDF of target distribution
       n_pairs : int
           Number of antithetic pairs to generate
       return_diagnostics : bool
           If True, return correlation diagnostics
           
       Returns
       -------
       estimate : float
           Antithetic estimate of E[h(X)]
       diagnostics : dict (optional)
           Including rho, variance_ratio
       """
       # Generate uniforms
       U = np.random.uniform(size=n_pairs)
       
       # Create antithetic pairs
       X = F_inv(U)
       X_anti = F_inv(1 - U)
       
       # Evaluate function
       h_X = h_func(X)
       h_X_anti = h_func(X_anti)
       
       # Antithetic estimator: average of pair averages
       pair_means = (h_X + h_X_anti) / 2
       estimate = np.mean(pair_means)
       
       if return_diagnostics:
           # Standard MC variance (using all 2n points)
           all_h = np.concatenate([h_X, h_X_anti])
           var_standard = np.var(all_h, ddof=1)
           
           # Antithetic variance
           var_antithetic = np.var(pair_means, ddof=1)
           
           # Correlation between h(X) and h(X')
           rho = np.corrcoef(h_X, h_X_anti)[0, 1]
           
           # Variance reduction factor (comparing same total samples)
           # Standard: Var(h)/n vs Antithetic: Var(pair_mean)/(n/2)
           vrf = (var_standard / (2 * n_pairs)) / (var_antithetic / n_pairs)
           
           diagnostics = {
               'estimate': estimate,
               'se': np.std(pair_means, ddof=1) / np.sqrt(n_pairs),
               'rho': rho,
               'var_standard': var_standard,
               'var_antithetic': var_antithetic,
               'vrf': vrf
           }
           return estimate, diagnostics
       
       return estimate

.. admonition:: Example 💡 The Exponential Integral with 97% Variance Reduction
   :class: note

   **Given:** Estimate :math:`I = \int_0^1 e^x \, dx = e - 1 \approx 1.7183`.

   **Analysis:** The function :math:`h(u) = e^u` is monotone increasing on :math:`[0,1]`, so antithetic variates should help.

   **Theoretical Variance:**

   - Standard MC: :math:`\text{Var}(e^U) = \mathbb{E}[e^{2U}] - (\mathbb{E}[e^U])^2 = \frac{e^2 - 1}{2} - (e-1)^2 \approx 0.2420`
   - Antithetic covariance: :math:`\text{Cov}(e^U, e^{1-U}) = \mathbb{E}[e^U \cdot e^{1-U}] - (e-1)^2 = e - (e-1)^2 \approx -0.2342`
   - Antithetic variance per pair: :math:`\frac{1}{2}(0.2420) + \frac{1}{2}(-0.2342) = 0.0039`

   **Variance Reduction:** For equivalent cost (comparing :math:`n` antithetic pairs to :math:`2n` independent samples):

   .. math::

      \text{VRF} = \frac{0.2420 / 2}{0.0039} \approx 31

   This represents approximately **97% variance reduction**.

   **Python Implementation:**

   .. code-block:: python

      import numpy as np

      np.random.seed(42)
      n_pairs = 10_000
      
      # True value
      I_true = np.exp(1) - 1
      print(f"True integral: {I_true:.6f}")
      
      # Standard Monte Carlo (2n samples for fair comparison)
      U_standard = np.random.uniform(size=2 * n_pairs)
      h_standard = np.exp(U_standard)
      mc_estimate = np.mean(h_standard)
      mc_var = np.var(h_standard, ddof=1)
      mc_se = np.sqrt(mc_var / (2 * n_pairs))
      
      print(f"\nStandard MC ({2*n_pairs:,} samples):")
      print(f"  Estimate: {mc_estimate:.6f}")
      print(f"  SE: {mc_se:.6f}")
      
      # Antithetic variates (n pairs = 2n function evaluations)
      U = np.random.uniform(size=n_pairs)
      h_U = np.exp(U)
      h_anti = np.exp(1 - U)
      pair_means = (h_U + h_anti) / 2
      
      anti_estimate = np.mean(pair_means)
      anti_var = np.var(pair_means, ddof=1)
      anti_se = np.sqrt(anti_var / n_pairs)
      
      print(f"\nAntithetic ({n_pairs:,} pairs):")
      print(f"  Estimate: {anti_estimate:.6f}")
      print(f"  SE: {anti_se:.6f}")
      
      # Correlation diagnostic
      rho = np.corrcoef(h_U, h_anti)[0, 1]
      print(f"  Correlation rho: {rho:.4f}")
      
      # Variance reduction
      # Fair comparison: both use 2n function evaluations
      # Standard: Var/2n, Antithetic: Var(pair)/n
      vrf = (mc_var / (2 * n_pairs)) / (anti_var / n_pairs)
      print(f"\nVariance Reduction Factor: {vrf:.1f}x")
      print(f"Variance Reduction: {100*(1 - 1/vrf):.1f}%")

   **Output:**

   .. code-block:: text

      True integral: 1.718282
      
      Standard MC (20,000 samples):
        Estimate: 1.721539
        SE: 0.003460
      
      Antithetic (10,000 pairs):
        Estimate: 1.718298
        SE: 0.000639
        Correlation rho: -0.9678
      
      Variance Reduction Factor: 29.3x
      Variance Reduction: 96.6%

   **Result:** The strongly negative correlation (:math:`\rho = -0.968`) yields a 29× variance reduction, achieving ~97% efficiency gain. The antithetic estimate is within 0.001% of the true value.


Stratified Sampling
-------------------

Stratified sampling partitions the sample space into disjoint regions (strata) and draws samples from each region according to a carefully chosen allocation. By eliminating the randomness in how many samples fall in each region, stratification removes between-stratum variance—often a dominant source of Monte Carlo error.

The Stratified Estimator
~~~~~~~~~~~~~~~~~~~~~~~~

Partition the domain :math:`\mathcal{X}` into :math:`K` disjoint, exhaustive strata :math:`S_1, \ldots, S_K` with:

- **Stratum probabilities**: :math:`p_k = P(X \in S_k)` under the target distribution
- **Within-stratum means**: :math:`\mu_k = \mathbb{E}[h(X) \mid X \in S_k]`
- **Within-stratum variances**: :math:`\sigma_k^2 = \text{Var}(h(X) \mid X \in S_k)`

The overall mean decomposes as:

.. math::
   :label: stratified-mean

   I = \sum_{k=1}^K p_k \mu_k

The **stratified estimator** allocates :math:`n_k` samples to stratum :math:`k` (with :math:`\sum_k n_k = n`), draws :math:`X_{k,1}, \ldots, X_{k,n_k}` from the conditional distribution in stratum :math:`k`, and combines:

.. math::
   :label: stratified-estimator

   \hat{I}_{\text{strat}} = \sum_{k=1}^K p_k \hat{\mu}_k, \quad \hat{\mu}_k = \frac{1}{n_k} \sum_{i=1}^{n_k} h(X_{k,i})

The variance is:

.. math::
   :label: stratified-variance

   \text{Var}(\hat{I}_{\text{strat}}) = \sum_{k=1}^K \frac{p_k^2 \sigma_k^2}{n_k}

Proportional Allocation Always Reduces Variance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Under **proportional allocation** :math:`n_k = n p_k`, the variance simplifies to:

.. math::
   :label: proportional-variance

   \text{Var}(\hat{I}_{\text{prop}}) = \frac{1}{n} \sum_{k=1}^K p_k \sigma_k^2

The law of total variance (ANOVA decomposition) reveals why this always helps:

.. math::

   \sigma^2 = \text{Var}(h(X)) = \underbrace{\mathbb{E}[\text{Var}(h \mid S)]}_{\text{within}} + \underbrace{\text{Var}(\mathbb{E}[h \mid S])}_{\text{between}} = \sum_k p_k \sigma_k^2 + \sum_k p_k (\mu_k - I)^2

The first term is **within-stratum variance**; the second is **between-stratum variance**. Since:

.. math::

   \sum_k p_k \sigma_k^2 = \sigma^2 - \sum_k p_k (\mu_k - I)^2 \leq \sigma^2

we have :math:`\text{Var}(\hat{I}_{\text{prop}}) \leq \text{Var}(\hat{I}_{\text{MC}})` with equality only when all stratum means are identical.

**Key Insight**: Stratified sampling with proportional allocation **eliminates the between-stratum variance component entirely**. Variance reduction equals the proportion of total variance explained by stratum membership.

Neyman Allocation Minimizes Variance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Theorem (Neyman, 1934)**: The allocation minimizing :math:`\text{Var}(\hat{I}_{\text{strat}})` subject to :math:`\sum_k n_k = n` is:

.. math::
   :label: neyman-allocation

   n_k^* = n \cdot \frac{p_k \sigma_k}{\sum_j p_j \sigma_j}

**Proof**: Using Lagrange multipliers on :math:`\mathcal{L} = \sum_k p_k^2 \sigma_k^2 / n_k + \lambda(\sum_k n_k - n)`:

.. math::

   \frac{\partial \mathcal{L}}{\partial n_k} = -\frac{p_k^2 \sigma_k^2}{n_k^2} + \lambda = 0 \quad \Rightarrow \quad n_k = \frac{p_k \sigma_k}{\sqrt{\lambda}}

The constraint yields :math:`\sqrt{\lambda} = (\sum_j p_j \sigma_j)/n`, giving the result. :math:`\blacksquare`

The optimal variance is:

.. math::
   :label: neyman-variance

   \text{Var}(\hat{I}_{\text{opt}}) = \frac{1}{n} \left(\sum_k p_k \sigma_k\right)^2

**Interpretation**: Neyman allocation samples more heavily from:

1. **Larger strata** (large :math:`p_k`): They contribute more to the integral
2. **More variable strata** (large :math:`\sigma_k`): They need more samples for precise estimation

Latin Hypercube Sampling for High Dimensions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Traditional stratification suffers from the curse of dimensionality: :math:`m` strata per dimension requires :math:`m^d` samples in :math:`d` dimensions. **Latin Hypercube Sampling (LHS)**, introduced by McKay, Beckman, and Conover (1979), provides a practical alternative.

For :math:`n` samples in :math:`[0,1]^d`, LHS:

1. Divides each dimension into :math:`n` equal intervals
2. Places exactly one sample point in each interval for each dimension
3. Pairs intervals across dimensions randomly

**Key Property**: Each univariate margin is perfectly stratified. If the function is approximately additive, :math:`h(\mathbf{x}) \approx \sum_j h_j(x_j)`, LHS achieves large variance reduction by controlling each main effect.

Stein (1987) proved that LHS variance is asymptotically:

.. math::

   \text{Var}(\hat{\mu}_{\text{LHS}}) = \frac{1}{n} \int [h(\mathbf{x}) - h^{\text{add}}(\mathbf{x})]^2 \, dx + o(1/n)

where :math:`h^{\text{add}}(\mathbf{x}) = \sum_j h_j(x_j)` is the additive approximation. LHS filters out all main effects—variance reduction depends on the degree of non-additivity (interactions) in :math:`h`.

Owen (1997) proved: :math:`\text{Var}(\hat{\mu}_{\text{LHS}}) \leq \sigma^2/(n-1)`. LHS with :math:`n` points is never worse than simple MC with :math:`n-1` points.

Python Implementation
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   import numpy as np

   def stratified_estimator(h_func, sampler_by_stratum, stratum_probs, n_per_stratum):
       """
       Stratified sampling estimator.
       
       Parameters
       ----------
       h_func : callable
           Function to integrate
       sampler_by_stratum : list of callables
           sampler_by_stratum[k](n) returns n samples from stratum k
       stratum_probs : array_like
           Probabilities p_k for each stratum
       n_per_stratum : array_like
           Number of samples n_k for each stratum
           
       Returns
       -------
       dict with estimate, se, within_var, between_var diagnostics
       """
       K = len(stratum_probs)
       p = np.array(stratum_probs)
       n_k = np.array(n_per_stratum)
       
       stratum_means = np.zeros(K)
       stratum_vars = np.zeros(K)
       
       for k in range(K):
           # Sample from stratum k
           X_k = sampler_by_stratum[k](n_k[k])
           h_k = h_func(X_k)
           
           stratum_means[k] = np.mean(h_k)
           stratum_vars[k] = np.var(h_k, ddof=1)
       
       # Stratified estimate
       estimate = np.sum(p * stratum_means)
       
       # Variance of stratified estimator
       var_strat = np.sum(p**2 * stratum_vars / n_k)
       se = np.sqrt(var_strat)
       
       # Decomposition
       within_var = np.sum(p * stratum_vars)
       between_var = np.sum(p * (stratum_means - estimate)**2)
       
       return {
           'estimate': estimate,
           'se': se,
           'stratum_means': stratum_means,
           'stratum_vars': stratum_vars,
           'within_var': within_var,
           'between_var': between_var
       }


   def latin_hypercube_sample(n, d, seed=None):
       """
       Generate n Latin Hypercube samples in [0,1]^d.
       
       Parameters
       ----------
       n : int
           Number of samples
       d : int
           Dimension
       seed : int, optional
           Random seed
           
       Returns
       -------
       samples : ndarray of shape (n, d)
           Latin Hypercube samples
       """
       rng = np.random.default_rng(seed)
       samples = np.zeros((n, d))
       
       for j in range(d):
           # Create n equally spaced intervals and shuffle
           perm = rng.permutation(n)
           # Uniform sample within each stratum
           samples[:, j] = (perm + rng.uniform(size=n)) / n
       
       return samples

.. admonition:: Example 💡 Stratified vs. Simple Monte Carlo for Heterogeneous Integrand
   :class: note

   **Given:** Estimate :math:`I = \int_0^1 h(x) \, dx` where :math:`h(x) = e^{10x}` for :math:`x < 0.2` and :math:`h(x) = 1` otherwise. This integrand has high variance concentrated in :math:`[0, 0.2)`.

   **Strategy:** Stratify into :math:`S_1 = [0, 0.2)` and :math:`S_2 = [0.2, 1]` with :math:`p_1 = 0.2`, :math:`p_2 = 0.8`.

   **Python Implementation:**

   .. code-block:: python

      import numpy as np
      from scipy import integrate

      np.random.seed(42)
      
      def h(x):
          return np.where(x < 0.2, np.exp(10 * x), 1.0)
      
      # True value by numerical integration
      I_true, _ = integrate.quad(h, 0, 1)
      print(f"True integral: {I_true:.6f}")
      
      n_total = 1000
      
      # Simple Monte Carlo
      X_mc = np.random.uniform(0, 1, n_total)
      h_mc = h(X_mc)
      mc_est = np.mean(h_mc)
      mc_se = np.std(h_mc, ddof=1) / np.sqrt(n_total)
      print(f"\nSimple MC: {mc_est:.4f} (SE: {mc_se:.4f})")
      
      # Stratified sampling with proportional allocation
      p1, p2 = 0.2, 0.8
      n1 = int(n_total * p1)  # 200 samples in [0, 0.2)
      n2 = n_total - n1       # 800 samples in [0.2, 1]
      
      # Sample from each stratum
      X1 = np.random.uniform(0, 0.2, n1)
      X2 = np.random.uniform(0.2, 1, n2)
      
      h1 = h(X1)
      h2 = h(X2)
      
      mu1_hat = np.mean(h1)
      mu2_hat = np.mean(h2)
      
      strat_est = p1 * mu1_hat + p2 * mu2_hat
      
      # Stratified SE
      var1 = np.var(h1, ddof=1)
      var2 = np.var(h2, ddof=1)
      strat_var = (p1**2 * var1 / n1) + (p2**2 * var2 / n2)
      strat_se = np.sqrt(strat_var)
      
      print(f"Stratified:  {strat_est:.4f} (SE: {strat_se:.4f})")
      
      # Variance decomposition
      total_var = np.var(h_mc, ddof=1)
      within_var = p1 * var1 + p2 * var2
      between_var = total_var - within_var
      
      print(f"\nVariance Decomposition:")
      print(f"  Total variance: {total_var:.4f}")
      print(f"  Within-stratum: {within_var:.4f} ({100*within_var/total_var:.1f}%)")
      print(f"  Between-stratum: {between_var:.4f} ({100*between_var/total_var:.1f}%)")
      print(f"\nVariance Reduction Factor: {(mc_se/strat_se)**2:.1f}x")

   **Output:**

   .. code-block:: text

      True integral: 2.256930
      
      Simple MC: 2.4321 (SE: 0.1897)
      Stratified:  2.2549 (SE: 0.0584)
      
      Variance Decomposition:
        Total variance: 35.9784
        Within-stratum: 3.2481 (9.0%)
        Between-stratum: 32.7303 (91.0%)
      
      Variance Reduction Factor: 10.6x

   **Result:** The between-stratum variance (91% of total) is eliminated by stratification, yielding a 10× variance reduction. The stratified estimate (2.255) is much closer to truth (2.257) than simple MC (2.432).


Common Random Numbers
---------------------

When comparing systems or parameters, **common random numbers (CRN)** use identical random inputs across all configurations, inducing positive correlation between estimates and dramatically reducing the variance of their *difference*.

CRN does not reduce the variance of individual estimates—it reduces the variance of comparisons. This distinction is crucial: CRN is a technique for A/B testing, sensitivity analysis, and system comparison, not for single-system estimation.

Variance Analysis for Differences
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Consider comparing two systems with expected values :math:`\theta_1 = \mathbb{E}[h_1(X)]` and :math:`\theta_2 = \mathbb{E}[h_2(X)]`. We want to estimate the difference :math:`\Delta = \theta_1 - \theta_2`.

**Independent Sampling**: Draw :math:`X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} F` for system 1 and independent :math:`Y_1, \ldots, Y_n \stackrel{\text{iid}}{\sim} F` for system 2:

.. math::

   \text{Var}(\hat{\theta}_1 - \hat{\theta}_2) = \text{Var}(\hat{\theta}_1) + \text{Var}(\hat{\theta}_2) = \frac{\sigma_1^2 + \sigma_2^2}{n}

**Common Random Numbers**: Use the *same* inputs for both systems. Draw :math:`X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} F` and compute :math:`D_i = h_1(X_i) - h_2(X_i)`:

.. math::
   :label: crn-variance

   \text{Var}(\hat{\Delta}_{\text{CRN}}) = \frac{1}{n}\text{Var}(h_1(X) - h_2(X)) = \frac{\sigma_1^2 + \sigma_2^2 - 2\text{Cov}(h_1(X), h_2(X))}{n}

When systems respond similarly to the same inputs, :math:`\text{Cov}(h_1, h_2) > 0`, and the :math:`-2\text{Cov}` term reduces variance substantially.

**Perfect Correlation Limit**: If :math:`h_1(x) = h_2(x) + c` (constant difference), then :math:`D_i = c` for all :math:`i`, and :math:`\text{Var}(\hat{\Delta}_{\text{CRN}}) = 0`. We estimate the difference with zero variance!

When CRN Works Best
~~~~~~~~~~~~~~~~~~~

CRN is most effective when:

1. **Systems are similar**: Small changes to parameters, comparing variants of the same algorithm
2. **Response is monotonic**: Both systems improve or degrade together with input quality
3. **Synchronization is possible**: The same random number serves the same purpose in both systems

**Synchronization is critical**: In a queueing simulation, the random number generating customer :math:`i`'s arrival time must generate customer :math:`i`'s arrival time in *all* configurations. Misaligned synchronization destroys the correlation that CRN exploits.

Applications
~~~~~~~~~~~~

- **A/B Testing in Simulation**: Compare policies using identical demand sequences, arrival patterns, or market scenarios
- **Sensitivity Analysis**: Estimate derivatives :math:`\partial \theta / \partial \alpha` using common inputs at :math:`\alpha` and :math:`\alpha + \delta`
- **Ranking and Selection**: Compare multiple system variants fairly under identical conditions
- **Optimization**: Gradient estimation for simulation-based optimization

Python Implementation
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   import numpy as np

   def crn_comparison(h1_func, h2_func, input_sampler, n_samples):
       """
       Compare two systems using common random numbers.
       
       Parameters
       ----------
       h1_func : callable
           System 1 response function
       h2_func : callable
           System 2 response function
       input_sampler : callable
           Function that returns n samples of common inputs
       n_samples : int
           Number of comparisons
           
       Returns
       -------
       dict with difference estimate, se, correlation, vrf
       """
       # Generate common inputs
       X = input_sampler(n_samples)
       
       # Evaluate both systems on same inputs
       h1 = h1_func(X)
       h2 = h2_func(X)
       
       # Paired differences
       D = h1 - h2
       
       # CRN estimate and SE
       diff_est = np.mean(D)
       diff_se = np.std(D, ddof=1) / np.sqrt(n_samples)
       
       # Diagnostics
       var1 = np.var(h1, ddof=1)
       var2 = np.var(h2, ddof=1)
       cov12 = np.cov(h1, h2, ddof=1)[0, 1]
       rho = cov12 / np.sqrt(var1 * var2)
       
       # Variance reduction factor vs independent sampling
       var_indep = (var1 + var2) / n_samples
       var_crn = np.var(D, ddof=1) / n_samples
       vrf = var_indep / var_crn if var_crn > 0 else np.inf
       
       return {
           'diff_estimate': diff_est,
           'diff_se': diff_se,
           'theta1': np.mean(h1),
           'theta2': np.mean(h2),
           'correlation': rho,
           'vrf': vrf,
           'indep_se': np.sqrt(var_indep),
           'crn_se': np.sqrt(var_crn)
       }

.. admonition:: Example 💡 Comparing Two Inventory Policies
   :class: note

   **Given:** An inventory system with daily demand :math:`D \sim \text{Poisson}(50)`. Compare:

   - **Policy A**: Order up to 60 units each day
   - **Policy B**: Order up to 65 units each day

   Cost = holding cost (0.10 per unit per day) + stockout cost (5 per unit short).

   **Python Implementation:**

   .. code-block:: python

      import numpy as np

      np.random.seed(42)
      n_days = 10_000
      
      def simulate_cost(order_up_to, demands):
          """Compute daily costs for order-up-to policy."""
          costs = np.zeros(len(demands))
          for i, d in enumerate(demands):
              # On-hand inventory after ordering up to level
              on_hand = order_up_to
              # After demand
              remaining = on_hand - d
              if remaining >= 0:
                  costs[i] = 0.10 * remaining  # Holding cost
              else:
                  costs[i] = 5.0 * (-remaining)  # Stockout cost
          return costs
      
      # Generate common demands (CRN)
      common_demands = np.random.poisson(50, n_days)
      
      # Evaluate both policies on same demands
      costs_A = simulate_cost(60, common_demands)
      costs_B = simulate_cost(65, common_demands)
      
      # CRN comparison
      D = costs_A - costs_B
      diff_est = np.mean(D)
      diff_se = np.std(D, ddof=1) / np.sqrt(n_days)
      
      print("Common Random Numbers Comparison")
      print("=" * 45)
      print(f"Policy A (order to 60): mean cost = {np.mean(costs_A):.4f}")
      print(f"Policy B (order to 65): mean cost = {np.mean(costs_B):.4f}")
      print(f"\nDifference (A - B): {diff_est:.4f}")
      print(f"SE of difference (CRN): {diff_se:.4f}")
      print(f"95% CI: ({diff_est - 1.96*diff_se:.4f}, {diff_est + 1.96*diff_se:.4f})")
      
      # Compare to independent sampling
      rho = np.corrcoef(costs_A, costs_B)[0, 1]
      var_A = np.var(costs_A, ddof=1)
      var_B = np.var(costs_B, ddof=1)
      se_indep = np.sqrt((var_A + var_B) / n_days)
      
      print(f"\nCorrelation between policies: {rho:.4f}")
      print(f"SE if independent sampling: {se_indep:.4f}")
      print(f"Variance Reduction Factor: {(se_indep / diff_se)**2:.1f}x")

   **Output:**

   .. code-block:: text

      Common Random Numbers Comparison
      =============================================
      Policy A (order to 60): mean cost = 2.7632
      Policy B (order to 65): mean cost = 1.4241
      
      Difference (A - B): 1.3391
      SE of difference (CRN): 0.0312
      95% CI: (1.2780, 1.4002)
      
      Correlation between policies: 0.9127
      SE if independent sampling: 0.0987
      Variance Reduction Factor: 10.0x

   **Result:** With 91% correlation between policies, CRN achieves 10× variance reduction. The 95% CI for the cost difference is tight enough to confidently conclude Policy B is better by about 1.34 cost units per day.


Combining Variance Reduction Techniques
---------------------------------------

The five techniques developed above are not mutually exclusive. Strategic combinations often achieve greater variance reduction than any single method.

Compatible Combinations
~~~~~~~~~~~~~~~~~~~~~~~

1. **Importance Sampling + Control Variates**: Use a control variate under the proposal distribution. The optimal coefficient adapts to the IS framework.

2. **Stratified Sampling + Antithetic Variates**: Within each stratum, use antithetic pairs to reduce within-stratum variance further.

3. **Control Variates + Common Random Numbers**: When comparing systems, apply the same control variate adjustment to both.

4. **Importance Sampling + Stratified Sampling**: Stratify the proposal distribution to ensure coverage of important regions.

.. admonition:: Common Pitfall ⚠️ Combining Methods Requires Care
   :class: warning

   Not all combinations are straightforward:

   - **Antithetic variates + Importance sampling**: The standard antithetic construction :math:`(U, 1-U)` may not induce negative correlation after weighting
   - **Multiple controls**: Adding weakly correlated controls inflates :math:`\beta` estimation variance; use only strong, independent controls
   - **Verification**: Always verify unbiasedness holds after combining methods

Practical Guidelines
~~~~~~~~~~~~~~~~~~~~

1. **Start simple**: Apply control variates or antithetic variates first—low overhead, often effective
2. **Diagnose variance sources**: Use ANOVA decomposition to identify whether between-stratum or within-stratum variance dominates
3. **Monitor diagnostics**: Track ESS for importance sampling, correlation for control/antithetic variates
4. **Pilot estimation**: Use small pilot runs to estimate optimal coefficients, verify negative correlation, and check weight distributions
5. **Validate improvements**: Compare variance estimates with and without reduction; confirm actual benefit


Practical Considerations
------------------------

Numerical Stability
~~~~~~~~~~~~~~~~~~~

- **Log-space arithmetic**: For importance sampling, always compute weights in log-space using the logsumexp trick
- **Coefficient estimation**: For control variates, estimate :math:`\beta` using numerically stable regression routines
- **Weight clipping**: Consider truncating extreme importance weights to reduce variance at the cost of small bias

Computational Overhead
~~~~~~~~~~~~~~~~~~~~~~

- **Antithetic variates**: Essentially free—same function evaluations, different organization
- **Control variates**: Requires evaluating :math:`c(X)` and estimating :math:`\beta`; overhead typically small
- **Importance sampling**: Requires evaluating two densities :math:`f` and :math:`g` per sample
- **Stratified sampling**: May require specialized samplers for conditional distributions
- **CRN**: Requires synchronization bookkeeping; minimal computational overhead

When Methods Fail
~~~~~~~~~~~~~~~~~

- **Importance sampling**: Weight degeneracy in high dimensions; proposal misspecified (too light tails)
- **Control variates**: Weak correlation; unknown control mean
- **Antithetic variates**: Non-monotonic integrands; high dimensions with mixed monotonicity
- **Stratified sampling**: Unknown stratum variances; intractable conditional sampling
- **CRN**: Systems respond oppositely to inputs; synchronization impossible

.. admonition:: Common Pitfall ⚠️ Silent Failures
   :class: warning

   Variance reduction methods can fail silently:

   - Importance sampling with poor proposal yields valid but useless estimates (huge variance)
   - Antithetic variates with non-monotonic :math:`h` may *increase* variance without warning
   - Control variates with wrong sign of :math:`\beta` increase variance

   **Always compare with naive MC** on a pilot run to verify improvement.


Bringing It All Together
------------------------

Variance reduction transforms Monte Carlo from brute-force averaging into a sophisticated computational tool. The convergence rate :math:`O(n^{-1/2})` remains fixed, but the constant :math:`\sigma^2` is ours to optimize.

**Importance sampling** reweights samples to concentrate effort where the integrand matters, achieving orders-of-magnitude improvement for rare events—though weight degeneracy in high dimensions demands careful monitoring via ESS diagnostics.

**Control variates** exploit correlation with analytically tractable quantities, with variance reduction determined by :math:`\rho^2`. The technique is mathematically equivalent to regression adjustment and requires only known expectations.

**Antithetic variates** induce negative correlation through clever pairing, achieving near-perfect variance reduction for monotone functions at essentially zero cost.

**Stratified sampling** eliminates between-stratum variance through domain partitioning, with Latin hypercube sampling extending the approach to high dimensions by ensuring marginal coverage.

**Common random numbers** reduce comparison variance through synchronized inputs, transforming noisy system comparisons into precise difference estimation.

The techniques combine synergistically and appear throughout computational statistics: importance sampling underlies particle filters and SMC, control variates enhance MCMC estimators, and stratified sampling connects to quasi-Monte Carlo methods. Mastery of variance reduction—understanding when each method applies, recognizing limitations, implementing with numerical stability—distinguishes the computational statistician from the naive simulator.

Looking Ahead
~~~~~~~~~~~~~

The variance reduction methods of this section assume we can sample directly from the target (or proposal) distribution. But what about distributions known only through an unnormalized density—posteriors in Bayesian inference, for instance? The rejection sampling of :ref:`ch2.5-rejection-sampling` provides one answer, but its efficiency degrades rapidly in high dimensions.

The next part of this course develops **Markov Chain Monte Carlo (MCMC)**, which constructs a Markov chain whose stationary distribution equals the target. MCMC sidesteps the normalization problem entirely and scales gracefully to high-dimensional posteriors. The variance reduction principles developed here—particularly importance sampling and control variates—will reappear in the MCMC context, enabling efficient posterior estimation even for complex Bayesian models.


.. admonition:: Key Takeaways 📝
   :class: important

   1. **Variance reduction does not change the rate**: All methods achieve :math:`O(n^{-1/2})` convergence; they reduce the constant :math:`\sigma^2`, not the exponent.

   2. **Importance sampling**: Optimal proposal :math:`g^* \propto |h|f` achieves minimum variance. ESS diagnoses weight degeneracy. Use log-space arithmetic for stability.

   3. **Control variates**: Variance reduction equals :math:`\rho^2` (squared correlation). Optimal :math:`\beta^* = \text{Cov}(H,C)/\text{Var}(C)`. **Must know** :math:`\mathbb{E}[C]` exactly.

   4. **Antithetic variates**: Work for monotone functions; can harm for non-monotone. The pair :math:`(F^{-1}(U), F^{-1}(1-U))` induces negative correlation for increasing :math:`h`.

   5. **Stratified sampling**: Eliminates between-stratum variance. Neyman allocation :math:`n_k \propto p_k \sigma_k` minimizes total variance. Latin hypercube extends to high dimensions.

   6. **Common random numbers**: Reduces variance of *comparisons*, not individual estimates. Requires synchronization—same random input serves same purpose across systems.

   7. **Outcome alignment**: These techniques (Learning Outcomes 1, 3) enable efficient Monte Carlo estimation. Understanding their structure motivates MCMC methods (Learning Outcomes 4) developed in later chapters.


References
----------

.. [Kahn1951] Kahn, H. and Harris, T. E. (1951). Estimation of particle transmission by random sampling. *National Bureau of Standards Applied Mathematics Series*, 12, 27–30. Foundational paper on importance sampling from the RAND Corporation.

.. [Neyman1934] Neyman, J. (1934). On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. *Journal of the Royal Statistical Society*, 97(4), 558–625. Establishes optimal allocation theory for stratified sampling.

.. [HammersleyMorton1956] Hammersley, J. M. and Morton, K. W. (1956). A new Monte Carlo technique: antithetic variates. *Mathematical Proceedings of the Cambridge Philosophical Society*, 52(3), 449–475. Introduces antithetic variates.

.. [HammersleyHandscomb1964] Hammersley, J. M. and Handscomb, D. C. (1964). *Monte Carlo Methods*. Methuen, London. Classic monograph systematizing variance reduction techniques including control variates.

.. [McKayEtAl1979] McKay, M. D., Beckman, R. J., and Conover, W. J. (1979). A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. *Technometrics*, 21(2), 239–245. Introduces Latin Hypercube Sampling.

.. [Kong1992] Kong, A. (1992). A note on importance sampling using standardized weights. Technical Report 348, Department of Statistics, University of Chicago. Establishes ESS interpretation for importance sampling.

.. [Owen1997] Owen, A. B. (1997). Monte Carlo variance of scrambled net quadrature. *SIAM Journal on Numerical Analysis*, 34(5), 1884–1910. Theoretical analysis of Latin Hypercube Sampling.

.. [BengtssonEtAl2008] Bengtsson, T., Bickel, P., and Li, B. (2008). Curse-of-dimensionality revisited: Collapse of the particle filter in very large scale systems. In *Probability and Statistics: Essays in Honor of David A. Freedman*, 316–334. Institute of Mathematical Statistics. Proves exponential sample size requirements for importance sampling in high dimensions.

.. [RobertCasella2004] Robert, C. P. and Casella, G. (2004). *Monte Carlo Statistical Methods* (2nd ed.). Springer-Verlag, New York. Comprehensive treatment of variance reduction in the Monte Carlo context.

.. [Glasserman2004] Glasserman, P. (2004). *Monte Carlo Methods in Financial Engineering*. Springer-Verlag, New York. Variance reduction applications in financial computing.