.. _ch2.1-monte-carlo-fundamentals: ============================= Monte Carlo Fundamentals ============================= In the spring of 1946, the mathematician Stanislaw Ulam was recovering from a near-fatal case of viral encephalitis at his home in Los Angeles. To pass the time during his convalescence, he played countless games of solitaire—and found himself wondering: what is the probability of winning a game of Canfield solitaire? The combinatorics were hopelessly complex. There were too many possible configurations, too many branching paths through a game, to enumerate them all. But Ulam realized something profound: he didn't need to enumerate every possibility. He could simply *play* a hundred games and count how many he won. This insight—that we can estimate probabilities by running experiments rather than computing them analytically—was not new. The Comte de Buffon had used a similar approach in 1777 to estimate π by dropping needles onto a lined floor. But Ulam saw something that Buffon could not have imagined: the recently completed ENIAC computer could "play" millions of such games, transforming a parlor trick into a serious computational method. Within weeks, Ulam had discussed the idea with his colleague John von Neumann, and the two began developing what would become one of the most powerful computational frameworks in all of science. They needed a code name for this method, which they were applying to classified problems in nuclear weapons design at Los Alamos. Nicholas Metropolis suggested "Monte Carlo," after the famous casino in Monaco where Ulam's uncle had a gambling habit. The name stuck, and with it, a new era in computational science began. This chapter introduces Monte Carlo methods—a family of algorithms that use random sampling to solve problems that would otherwise be intractable. We will see how randomness, properly harnessed, becomes a precision instrument for computing integrals, estimating probabilities, and approximating quantities that resist analytical attack. The ideas are simple, but their power is immense: Monte Carlo methods now pervade physics, finance, machine learning, and statistics, anywhere that high-dimensional integration or complex probability calculations arise. .. admonition:: Road Map 🧭 :class: important • **Understand**: The fundamental principle that Monte Carlo integration estimates integrals as expectations of random samples, and why this works via the Law of Large Numbers and Central Limit Theorem • **Develop**: Deep intuition for the :math:`O(n^{-1/2})` convergence rate—what it means, why it arises, and its remarkable dimension-independence • **Implement**: Complete Python code for Monte Carlo estimation with variance quantification, confidence intervals, and convergence diagnostics • **Evaluate**: When Monte Carlo methods outperform deterministic alternatives, and how to assess estimation quality in practice • **Connect**: How Monte Carlo integration motivates the random variable generation techniques of subsequent sections .. admonition:: Notation Convention 📐 :class: note Throughout this text, we write :math:`X \sim \mathcal{N}(\mu, \sigma^2)` to denote a normal random variable with mean :math:`\mu` and **variance** :math:`\sigma^2`. That is, the second parameter is the variance, not the standard deviation. This is the most common convention in statistics and probability theory. When we need to emphasize the standard deviation, we write it explicitly as :math:`\text{SD}(X) = \sigma`. The Historical Development of Monte Carlo Methods ------------------------------------------------- Before diving into the mathematics, it is worth understanding how Monte Carlo methods emerged and evolved. This history illuminates why the methods work, what problems motivated their development, and why they remain central to computational science today. .. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/mc_timeline.png :alt: Timeline showing evolution of Monte Carlo methods from Buffon's Needle (1777) to modern neural-enhanced methods (2020s) :align: center :width: 100% **Historical Evolution of Monte Carlo Methods.** This timeline traces 250 years of algorithmic innovation, from Buffon's needle experiment in 1777 through the founding contributions of Ulam and von Neumann in the 1940s, the development of MCMC methods, resampling techniques, and modern neural-enhanced approaches. Each innovation opened new classes of problems to computational attack. Buffon's Needle: The First Monte Carlo Experiment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In 1777, Georges-Louis Leclerc, Comte de Buffon, posed a deceptively simple question: suppose we have a floor made of parallel wooden planks, each of width :math:`d`, and we drop a needle of length :math:`\ell \leq d` onto this floor. What is the probability that the needle crosses one of the cracks between planks? To answer this, Buffon introduced what we would now recognize as a probabilistic model. Let :math:`\theta` denote the angle between the needle and the direction of the planks, uniformly distributed on :math:`[0, \pi)`. Let :math:`y` denote the distance from the needle's center to the nearest crack, uniformly distributed on :math:`[0, d/2]`. The needle crosses a crack if and only if the vertical projection of half the needle exceeds the distance to the crack—that is, if :math:`y \leq \frac{\ell}{2} \sin\theta`. .. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/ch2_1_fig01_buffon_geometry.png :alt: Three-panel visualization of Buffon's needle problem showing physical setup, crossing geometry, and probability space :align: center :width: 100% **Buffon's Needle Geometry.** Left: The physical setup with needles scattered across parallel planks (red = crossing, blue = not crossing). Middle: The key geometry—a needle crosses if its center's distance :math:`y` to the nearest crack is less than the vertical projection :math:`(\ell/2)\sin\theta`. Right: The probability space showing the crossing region; the probability equals the ratio of areas: :math:`2\ell/(\pi d)`. The probability of crossing is therefore: .. math:: P(\text{crossing}) = \frac{1}{\pi} \cdot \frac{2}{d} \int_0^{\pi} \int_0^{(\ell/2)\sin\theta} dy \, d\theta Evaluating the inner integral yields :math:`\frac{\ell}{2}\sin\theta`, and the outer integral gives :math:`\int_0^{\pi} \sin\theta \, d\theta = 2`. Thus: .. math:: P(\text{crossing}) = \frac{1}{\pi} \cdot \frac{2}{d} \cdot \frac{\ell}{2} \cdot 2 = \frac{2\ell}{\pi d} This elegant result has a remarkable consequence. If we drop :math:`n` needles and observe that :math:`k` of them cross a crack, then our estimate of the crossing probability is :math:`\hat{p} = k/n`. Rearranging Buffon's formula: .. math:: \pi = \frac{2\ell}{d \cdot P(\text{crossing})} \approx \frac{2\ell n}{d k} We can estimate :math:`\pi` by throwing needles! This is a Monte Carlo method avant la lettre: we use random experiments to estimate a deterministic quantity. Of course, Buffon lacked computers, and actually throwing thousands of needles by hand is tedious. In 1901, the Italian mathematician Mario Lazzarini claimed to have obtained :math:`\pi \approx 3.1415929` by throwing a needle 3,408 times—suspiciously close to the correct value of :math:`355/113`. Most historians believe Lazzarini fudged his data, but the underlying principle was sound. .. admonition:: Try It Yourself 🖥️ Buffon's Needle Simulation :class: tip Experience Buffon's experiment interactively: **Interactive Simulation**: https://treese41528.github.io/ComputationalDataScience/Simulations/MonteCarloSimulation/buffons_needle_simulation.html Watch how the :math:`\pi` estimate fluctuates wildly with few needles, then gradually stabilizes as you accumulate thousands of throws. This is the Law of Large Numbers in action—a theme we will return to throughout this chapter. Fermi's Envelope Calculations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The physicist Enrico Fermi was famous for his ability to estimate quantities that seemed impossibly difficult to calculate. How many piano tuners are there in Chicago? How much energy is released in a nuclear explosion? Fermi would break these problems into pieces, estimate each piece roughly, and multiply—often achieving answers accurate to within an order of magnitude. Less well known is that Fermi also pioneered proto-Monte Carlo methods. In the 1930s, working on neutron diffusion problems in Rome, Fermi developed a mechanical device—essentially a specialized slide rule—that could generate random numbers to simulate neutron paths through matter. He used these simulations to estimate quantities like neutron absorption cross-sections, which were too complex to compute analytically. This work remained largely unpublished, but it anticipated the key insight of Monte Carlo: when a deterministic calculation is intractable, a stochastic simulation may succeed. Fermi's physical random number generator was crude, but the principle was the same one that Ulam and von Neumann would later implement on electronic computers. The Manhattan Project and ENIAC ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The development of nuclear weapons during World War II created an urgent need for computational methods. The behavior of neutrons in a nuclear reaction—how they scatter, slow down, and trigger fission—depends on complex integrals over energy and angle that resist analytical solution. The physicists at Los Alamos needed numbers, not theorems. It was in this context that Ulam's solitaire insight proved transformative. Ulam and von Neumann realized that the same principle—estimate a complicated quantity by averaging over random samples—could be applied to neutron transport. Instead of integrating over all possible neutron paths analytically (impossible), they could simulate thousands of individual neutrons, tracking each one as it scattered and absorbed through the weapon's core. Von Neumann took the lead in implementing these ideas on ENIAC, one of the first general-purpose electronic computers. ENIAC could perform about 5,000 operations per second—glacially slow by modern standards, but revolutionary in 1946. Von Neumann and his team programmed ENIAC to simulate neutron histories, and the results helped validate the design of thermonuclear weapons. The "Monte Carlo method" was formally introduced to the broader scientific community in a 1949 paper by Metropolis and Ulam, though much of the early work remained classified for decades. The name, coined by Metropolis, captured both the element of chance central to the method and the slightly disreputable excitement of gambling—a fitting tribute to Ulam's card-playing origins. Why "Monte Carlo" Changed Everything ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Monte Carlo revolution was not merely about having faster computers. It represented a conceptual breakthrough: *randomness is a computational resource*. By embracing uncertainty rather than fighting it, Monte Carlo methods could attack problems that deterministic methods could not touch. Consider the challenge of computing a 100-dimensional integral. Deterministic quadrature methods—the trapezoidal rule, Simpson's rule, Gaussian quadrature—all suffer from the "curse of dimensionality." If we use :math:`n` points per dimension, we need :math:`n^{100}` total evaluations. Even with :math:`n = 2`, this exceeds :math:`10^{30}`—more function evaluations than atoms in a human body. Monte Carlo methods sidestep this curse entirely. As we will see, the *convergence rate* of a Monte Carlo estimate depends only on the number of samples, not on the dimension of the space. A 100-dimensional integral converges at the same :math:`O(n^{-1/2})` rate as a one-dimensional integral. This dimension-independence of the *rate* is the source of Monte Carlo's power—though as we will discuss, the constant factor (the variance) may still depend on dimension. The Core Principle: Expectation as Integration ---------------------------------------------- We now turn to the mathematical foundations of Monte Carlo integration. The key insight is simple but profound: any integral can be rewritten as an expected value, and expected values can be estimated by averaging samples. From Integrals to Expectations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Consider a general integral of the form: .. math:: :label: general-integral I = \int_{\mathcal{X}} g(x) \, dx where :math:`\mathcal{X} \subseteq \mathbb{R}^d` is the domain of integration and :math:`g: \mathcal{X} \to \mathbb{R}` is the function we wish to integrate. At first glance, this seems like a problem for calculus, not probability. But watch what happens when we introduce a probability density. Let :math:`f(x)` be any probability density function on :math:`\mathcal{X}`—that is, :math:`f(x) \geq 0` everywhere and :math:`\int_{\mathcal{X}} f(x) \, dx = 1`. We can rewrite our integral as: .. math:: :label: importance-rewrite I = \int_{\mathcal{X}} g(x) \, dx = \int_{\mathcal{X}} \frac{g(x)}{f(x)} f(x) \, dx = \mathbb{E}_f\left[ \frac{g(X)}{f(X)} \right] where the expectation is taken over a random variable :math:`X` with density :math:`f`. We have transformed an integral into an expected value! The simplest choice is the uniform density on :math:`\mathcal{X}`. If :math:`\mathcal{X}` has finite volume :math:`V = \int_{\mathcal{X}} dx`, then :math:`f(x) = 1/V` is a valid density, and: .. math:: I = V \cdot \mathbb{E}_{\text{Uniform}(\mathcal{X})}[g(X)] For example, to compute :math:`\int_0^1 e^{-x^2} dx`, we write: .. math:: \int_0^1 e^{-x^2} dx = \mathbb{E}[e^{-U^2}] \quad \text{where } U \sim \text{Uniform}(0, 1) This rewriting is always possible. But why is it useful? **Two Cases to Keep Straight** Throughout this chapter, we encounter two closely related formulations: - **Case A (Expectation)**: We want :math:`I = \mathbb{E}_f[h(X)] = \int h(x) f(x) \, dx` where :math:`f` is a known density and :math:`X \sim f`. - **Case B (Plain integral)**: We want :math:`I = \int g(x) \, dx`, which we rewrite as :math:`I = \mathbb{E}_f[g(X)/f(X)]` for some sampling density :math:`f`. In both cases, the Monte Carlo estimator takes the form :math:`\hat{I}_n = \frac{1}{n}\sum_{i=1}^n w(X_i)` where :math:`w(x) = h(x)` in Case A and :math:`w(x) = g(x)/f(x)` in Case B. This unified view—*Monte Carlo is averaging weights*—will simplify later developments in importance sampling and variance reduction. The Monte Carlo Estimator ~~~~~~~~~~~~~~~~~~~~~~~~~ The power of the expectation formulation becomes clear when we recall the Law of Large Numbers. If :math:`X_1, X_2, \ldots, X_n` are independent and identically distributed (iid) with :math:`\mathbb{E}[X_i] = \mu`, then the sample mean converges to the true mean: .. math:: \bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i \xrightarrow{\text{a.s.}} \mu \quad \text{as } n \to \infty Applied to our integral: .. admonition:: Definition: Monte Carlo Estimator :class: note Let :math:`X_1, X_2, \ldots, X_n` be iid samples from a density :math:`f` on :math:`\mathcal{X}`. The **Monte Carlo estimator** of the integral :math:`I = \int_{\mathcal{X}} h(x) f(x) \, dx = \mathbb{E}_f[h(X)]` is: .. math:: :label: mc-estimator-def \hat{I}_n = \frac{1}{n} \sum_{i=1}^{n} h(X_i) More generally, for :math:`I = \int_{\mathcal{X}} g(x) \, dx` where we sample from density :math:`f`: .. math:: \hat{I}_n = \frac{1}{n} \sum_{i=1}^{n} \frac{g(X_i)}{f(X_i)} The Monte Carlo method is disarmingly simple: draw random samples, evaluate the function at each sample, and average the results. No derivatives, no quadrature weights, no mesh generation—just sampling and averaging. But this simplicity conceals depth. The choice of sampling density :math:`f` is entirely up to us, and different choices lead to dramatically different performance. We will explore this in the section on importance sampling; for now, we focus on the "naive" case where :math:`f` matches the density of the integrand or is uniform on the domain. .. admonition:: Algorithm 2.1: Basic Monte Carlo Integration :class: note **Input**: Function :math:`h`, sampling distribution :math:`f`, sample size :math:`n` **Output**: Estimate :math:`\hat{I}_n` and standard error :math:`\widehat{\text{SE}}` .. code-block:: text 1. Generate X₁, X₂, ..., Xₙ iid from density f 2. Compute h(Xᵢ) for i = 1, ..., n 3. Estimate: Î_n = (1/n) Σᵢ h(Xᵢ) 4. Variance: σ̂² = (1/(n-1)) Σᵢ (h(Xᵢ) - Î_n)² 5. Standard Error: SE = σ̂ / √n 6. Return Î_n, SE **Complexity**: :math:`O(n)` function evaluations, :math:`O(n)` storage (or :math:`O(1)` with streaming) A First Example: Estimating :math:`\pi` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let us return to the problem of estimating :math:`\pi`, now with Monte Carlo machinery. Consider the integral: .. math:: \pi = \int_{-1}^{1} \int_{-1}^{1} \mathbf{1}_{x^2 + y^2 \leq 1} \, dx \, dy This is the area of the unit disk. Rewriting as an expectation: .. math:: \pi = 4 \cdot \mathbb{E}[\mathbf{1}_{X^2 + Y^2 \leq 1}] \quad \text{where } (X, Y) \sim \text{Uniform}([-1,1]^2) The factor of 4 accounts for the area of the square :math:`[-1,1]^2`. The Monte Carlo estimator is: .. math:: \hat{\pi}_n = \frac{4}{n} \sum_{i=1}^{n} \mathbf{1}_{X_i^2 + Y_i^2 \leq 1} That is: generate :math:`n` uniform points in the square, count how many fall inside the unit circle, and multiply by 4. The geometric intuition is immediate: the ratio of points landing inside the circle to total points approximates the ratio of areas, :math:`\pi/4`. .. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/ch2_1_fig02_pi_estimation.png :alt: Monte Carlo estimation of π showing scatter plot of points inside and outside unit circle, plus convergence plot :align: center :width: 100% **Monte Carlo π Estimation.** Left: Blue points fall inside the unit circle; coral points fall outside. The ratio of blue to total points estimates :math:`\pi/4`. Right: The running estimate stabilizes as samples accumulate—the characteristic "noisy convergence" of Monte Carlo. .. code-block:: python import numpy as np def estimate_pi_monte_carlo(n_samples, seed=None): """ Estimate π using Monte Carlo integration. Parameters ---------- n_samples : int Number of random points to generate. seed : int, optional Random seed for reproducibility. Returns ------- dict Contains estimate, standard_error, and confidence interval. """ rng = np.random.default_rng(seed) # Generate uniform points in [-1, 1]² x = rng.uniform(-1, 1, n_samples) y = rng.uniform(-1, 1, n_samples) # Count points inside unit circle inside = (x**2 + y**2) <= 1 # Monte Carlo estimate p_hat = np.mean(inside) pi_hat = 4 * p_hat # Standard error (indicator has Bernoulli variance p(1-p)) se_p = np.sqrt(p_hat * (1 - p_hat) / n_samples) se_pi = 4 * se_p # 95% asymptotic normal confidence interval (CLT justifies this for large n) ci = (pi_hat - 1.96 * se_pi, pi_hat + 1.96 * se_pi) return { 'estimate': pi_hat, 'standard_error': se_pi, 'ci_95': ci, 'n_samples': n_samples } # Run the estimation result = estimate_pi_monte_carlo(100_000, seed=42) print(f"π estimate: {result['estimate']:.6f}") print(f"True π: {np.pi:.6f}") print(f"Error: {abs(result['estimate'] - np.pi):.6f}") print(f"Std Error: {result['standard_error']:.6f}") print(f"95% CI: ({result['ci_95'][0]:.6f}, {result['ci_95'][1]:.6f})") .. code-block:: text π estimate: 3.143080 True π: 3.141593 Error: 0.001487 Std Error: 0.005190 95% CI: (3.132908, 3.153252) The true value :math:`\pi` lies comfortably within the 95% asymptotic normal confidence interval (valid by CLT for large :math:`n`). With a million samples, the error shrinks by a factor of :math:`\sqrt{10} \approx 3.16`, and with ten million, by another factor of :math:`\sqrt{10}`. **Why Bernoulli variance?** The comment in the code mentions "Bernoulli variance"—let's unpack this. Each random point either lands inside the circle (we record :math:`I_i = 1`) or outside (we record :math:`I_i = 0`). This is a **Bernoulli trial** with success probability :math:`p = \pi/4`, which is the ratio of the circle's area to the square's area. For any Bernoulli random variable: .. math:: \mathbb{E}[I_i] = p, \qquad \text{Var}(I_i) = p(1 - p) Our estimator :math:`\hat{p} = \frac{1}{n}\sum_{i=1}^n I_i` is the sample proportion of points inside the circle. Since the :math:`I_i` are independent: .. math:: \text{Var}(\hat{p}) = \text{Var}\left(\frac{1}{n}\sum_{i=1}^n I_i\right) = \frac{1}{n^2} \cdot n \cdot p(1-p) = \frac{p(1-p)}{n} The standard error is the square root: :math:`\text{SE}(\hat{p}) = \sqrt{p(1-p)/n}`. Since :math:`\hat{\pi} = 4\hat{p}`, the standard error of :math:`\hat{\pi}` scales by 4: .. math:: \text{SE}(\hat{\pi}) = 4 \cdot \sqrt{\frac{p(1-p)}{n}} We don't know the true :math:`p`, so we substitute our estimate :math:`\hat{p}` to get a usable formula. This Bernoulli structure is a special case of the general Monte Carlo variance formula—whenever the function :math:`h(x)` is an indicator (0 or 1), the variance simplifies to the familiar :math:`p(1-p)` form. .. admonition:: Common Pitfall ⚠️ Wald Confidence Intervals for Proportions :class: warning The normal approximation CI :math:`\hat{p} \pm z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}` (called the Wald interval) can perform poorly when :math:`p` is near 0 or 1, or when :math:`n` is small. For rare events, consider alternatives like the Wilson score interval, Agresti-Coull interval, or exact binomial methods. For the :math:`\pi` estimation example where :math:`p \approx 0.785`, the Wald interval is adequate. This is Monte Carlo at its most basic: evaluate a simple function at random points and average. But even this toy example illustrates the key features of the method—ease of implementation, probabilistic error bounds, and graceful scaling with sample size. Theoretical Foundations ----------------------- Why does the Monte Carlo method work? What determines the rate of convergence? These questions have precise mathematical answers rooted in classical probability theory. .. admonition:: Key Assumptions for Monte Carlo 📋 :class: important The theoretical guarantees for Monte Carlo integration require: 1. **Independence**: Samples :math:`X_1, \ldots, X_n` are independent (not merely uncorrelated) 2. **Identical distribution**: All samples come from the same distribution :math:`f` 3. **Finite first moment**: :math:`\mathbb{E}[|h(X)|] < \infty` for LLN convergence 4. **Finite second moment**: :math:`\mathbb{E}[h(X)^2] < \infty` for CLT and valid confidence intervals When these assumptions fail—dependent samples (MCMC), infinite variance (heavy tails), misspecified sampling distribution—the standard theory must be modified. We address these issues in later sections and chapters. The Law of Large Numbers ~~~~~~~~~~~~~~~~~~~~~~~~ The Law of Large Numbers (LLN) is the foundational result guaranteeing that Monte Carlo estimators converge to the true value. There are several versions; we state the strongest form. .. admonition:: Theorem: Strong Law of Large Numbers :class: note Let :math:`X_1, X_2, \ldots` be independent and identically distributed random variables with :math:`\mathbb{E}[|X_1|] < \infty`. Then: .. math:: \bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i \xrightarrow{\text{a.s.}} \mathbb{E}[X_1] \quad \text{as } n \to \infty The notation :math:`\xrightarrow{\text{a.s.}}` denotes *almost sure convergence*: the probability that the sequence converges is exactly 1. For Monte Carlo integration, we apply this theorem with :math:`X_i = h(X_i)` where the :math:`X_i` are iid from density :math:`f`. The condition :math:`\mathbb{E}[|h(X)|] < \infty` ensures that the integral we are estimating actually exists. The LLN tells us that :math:`\hat{I}_n \to I` with probability 1. No matter how complex the integrand, no matter how high the dimension, the Monte Carlo estimator will eventually get arbitrarily close to the true value. This is an extraordinarily powerful guarantee. But the LLN is silent on *how fast* convergence occurs. For that, we need the Central Limit Theorem. The Central Limit Theorem and the :math:`O(n^{-1/2})` Rate ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Central Limit Theorem (CLT) is the workhorse result for quantifying Monte Carlo error. .. admonition:: Theorem: Central Limit Theorem :class: note Let :math:`X_1, X_2, \ldots` be iid with mean :math:`\mu` and variance :math:`\sigma^2 < \infty`. Then: .. math:: \sqrt{n} \left( \bar{X}_n - \mu \right) \xrightarrow{d} \mathcal{N}(0, \sigma^2) where :math:`\xrightarrow{d}` denotes convergence in distribution. Equivalently, for large :math:`n`: .. math:: \bar{X}_n \stackrel{\cdot}{\sim} \mathcal{N}\left( \mu, \frac{\sigma^2}{n} \right) Applied to Monte Carlo integration: .. math:: \hat{I}_n \stackrel{\cdot}{\sim} \mathcal{N}\left( I, \frac{\sigma^2}{n} \right) where :math:`\sigma^2 = \text{Var}_f[h(X)] = \mathbb{E}_f[(h(X) - I)^2]` is the variance of the integrand under the sampling distribution. The **standard error** of the Monte Carlo estimator is: .. math:: :label: mc-standard-error \text{SE}(\hat{I}_n) = \frac{\sigma}{\sqrt{n}} This is the celebrated :math:`O(n^{-1/2})` convergence rate. To reduce the standard error by a factor of 10, we need 100 times as many samples. To gain one decimal place of accuracy, we need 100 times the computational effort. **The rate is dimension-independent, but the constant may not be.** Whether we integrate over :math:`\mathbb{R}` or :math:`\mathbb{R}^{1000}`, the error decreases as :math:`1/\sqrt{n}`. This dimension-independence of the *rate* is Monte Carlo's fundamental advantage. However, the variance :math:`\sigma^2 = \text{Var}[h(X)]` can itself depend on dimension—poorly designed estimators may have variance that grows with :math:`d`, partially offsetting Monte Carlo's advantage. This is why importance sampling, control variates, and careful problem formulation remain important even for Monte Carlo. .. admonition:: Example 💡 Understanding the Square Root Law :class: note **Scenario**: You estimate an integral with 1,000 samples and get a standard error of 0.1. Your boss needs the error reduced to 0.01. **Analysis**: The standard error scales as :math:`1/\sqrt{n}`. To reduce the standard error by a factor of 10, you need :math:`n` to increase by a factor of :math:`10^2 = 100`. **Conclusion**: You need :math:`1000 \times 100 = 100,000` samples. This quadratic penalty is the price of Monte Carlo's simplicity. In low dimensions, deterministic methods often achieve polynomial convergence rates like :math:`O(n^{-2})` or better, making them far more efficient. But in high dimensions, Monte Carlo's dimension-independent :math:`O(n^{-1/2})` rate beats any polynomial rate that degrades exponentially with dimension. Why the Square Root? ~~~~~~~~~~~~~~~~~~~~ The :math:`1/\sqrt{n}` rate may seem mysterious, but it has a simple explanation rooted in the behavior of sums of random variables. .. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/ch2_1_fig03_sqrt_n_convergence.png :alt: Three-panel visualization explaining Monte Carlo convergence: LLN guarantees convergence, CLT shows distributions tightening, and the O(n^{-1/2}) rate governs speed :align: center :width: 100% **Monte Carlo Convergence: Theory and Practice.** Left: The Law of Large Numbers in action—eight independent Monte Carlo trajectories estimating π all converge to the true value (red dashed line), with the 95% confidence band shrinking as n grows. Middle: The Central Limit Theorem visualized—sampling distributions of the estimator for n = 25, 100, 400, and 1600 show how estimates concentrate around the true value as sample size increases. Right: The :math:`O(n^{-1/2})` convergence rate—standard error decreases as :math:`\sigma/\sqrt{n}`, meaning 100× more samples yield only 10× error reduction (one additional decimal of precision). Consider :math:`n` iid random variables :math:`X_1, \ldots, X_n`, each with variance :math:`\sigma^2`. Their sum has variance: .. math:: \text{Var}\left( \sum_{i=1}^n X_i \right) = \sum_{i=1}^n \text{Var}(X_i) = n\sigma^2 The variance of the sum grows linearly with :math:`n`. But when we take the mean, we divide by :math:`n`: .. math:: \text{Var}\left( \frac{1}{n} \sum_{i=1}^n X_i \right) = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n} The standard deviation is the square root of variance, giving :math:`\sigma/\sqrt{n}`. This behavior is fundamental to averages of random quantities. Each additional sample adds information, but with diminishing returns: the first sample reduces uncertainty enormously; the millionth sample contributes almost nothing. This is why the square root appears. Variance Estimation and Confidence Intervals -------------------------------------------- The CLT tells us that :math:`\hat{I}_n` is approximately normal with known variance :math:`\sigma^2/n`. But we rarely know :math:`\sigma^2`—it depends on the integrand and the sampling distribution. We must estimate it from the same samples we use to estimate :math:`I`. The Sample Variance ~~~~~~~~~~~~~~~~~~~ The natural estimator of :math:`\sigma^2 = \text{Var}[h(X)]` is the sample variance: .. math:: :label: sample-var \hat{\sigma}^2_n = \frac{1}{n-1} \sum_{i=1}^{n} \left( h(X_i) - \hat{I}_n \right)^2 The divisor :math:`n-1` (rather than :math:`n`) makes this estimator unbiased: :math:`\mathbb{E}[\hat{\sigma}^2_n] = \sigma^2`. This is known as Bessel's correction. By the Law of Large Numbers, :math:`\hat{\sigma}^2_n \to \sigma^2` almost surely as :math:`n \to \infty`. Combined with the CLT, this gives us a practical way to construct confidence intervals. Constructing Confidence Intervals ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ An asymptotic :math:`(1-\alpha)` confidence interval for :math:`I` is: .. math:: :label: mc-ci-full \left[ \hat{I}_n - z_{\alpha/2} \frac{\hat{\sigma}_n}{\sqrt{n}}, \quad \hat{I}_n + z_{\alpha/2} \frac{\hat{\sigma}_n}{\sqrt{n}} \right] where :math:`z_{\alpha/2} = \Phi^{-1}(1 - \alpha/2)` is the standard normal quantile. For common confidence levels: - 90% CI: :math:`z_{0.05} \approx 1.645` - 95% CI: :math:`z_{0.025} \approx 1.960` - 99% CI: :math:`z_{0.005} \approx 2.576` The interval has the interpretation: in repeated sampling, approximately :math:`(1-\alpha) \times 100\%` of such intervals will contain the true value :math:`I`. **Standard error vs. confidence interval half-width**: These are related but distinct concepts: - **Standard error (SE)**: :math:`\widehat{\text{SE}} = \hat{\sigma}/\sqrt{n}`, a measure of estimation precision - **95% CI half-width**: :math:`1.96 \times \widehat{\text{SE}}`, the margin of error for a specific confidence level When planning sample sizes, be explicit about which quantity you are targeting. To achieve SE :math:`\leq \epsilon`, you need :math:`n \geq \sigma^2/\epsilon^2`. To achieve 95% CI half-width :math:`\leq \epsilon`, you need :math:`n \geq (1.96\sigma/\epsilon)^2 \approx 3.84\sigma^2/\epsilon^2`—about 4× more samples. .. code-block:: python import numpy as np from scipy import stats def monte_carlo_integrate(h, sampler, n_samples, confidence=0.95, seed=None): """ Monte Carlo integration with uncertainty quantification. Parameters ---------- h : callable Function to integrate. Must accept array input. sampler : callable Function that takes (n, rng) and returns n samples from the target distribution. n_samples : int Number of Monte Carlo samples. confidence : float Confidence level for interval (default 0.95). seed : int, optional Random seed for reproducibility. Returns ------- dict Contains estimate, std_error, confidence interval, and diagnostics. """ rng = np.random.default_rng(seed) # Generate samples and evaluate function samples = sampler(n_samples, rng) h_values = h(samples) # Point estimate estimate = np.mean(h_values) # Variance estimation (Bessel's correction) variance = np.var(h_values, ddof=1) std_error = np.sqrt(variance / n_samples) # Confidence interval z = stats.norm.ppf(1 - (1 - confidence) / 2) ci_lower = estimate - z * std_error ci_upper = estimate + z * std_error # Effective sample size (for future variance reduction comparisons) ess = n_samples # For standard MC, ESS = n return { 'estimate': estimate, 'std_error': std_error, 'variance': variance, 'ci': (ci_lower, ci_upper), 'confidence': confidence, 'n_samples': n_samples, 'ess': ess, 'h_values': h_values } .. admonition:: Example 💡 Using the Monte Carlo Integration Function :class: note **Problem:** Estimate :math:`\int_0^2 e^{-x^2} dx` using our ``monte_carlo_integrate`` function. **Setup:** We need to define: (1) the integrand :math:`h(x) = 2 e^{-x^2}` (the factor of 2 accounts for the interval length), and (2) a sampler that generates uniform samples on :math:`[0, 2]`. .. code-block:: python import numpy as np from scipy import stats from scipy.special import erf # Define the integrand (scaled by interval length) def h(x): return 2 * np.exp(-x**2) # Define the sampler: Uniform(0, 2) def uniform_sampler(n, rng): return rng.uniform(0, 2, n) # Run Monte Carlo integration result = monte_carlo_integrate( h=h, sampler=uniform_sampler, n_samples=100_000, confidence=0.95, seed=42 ) # True value for comparison true_value = np.sqrt(np.pi) / 2 * erf(2) print(f"Estimate: {result['estimate']:.6f}") print(f"True value: {true_value:.6f}") print(f"Std Error: {result['std_error']:.6f}") print(f"95% CI: ({result['ci'][0]:.6f}, {result['ci'][1]:.6f})") print(f"CI contains true value: {result['ci'][0] <= true_value <= result['ci'][1]}") **Output:** .. code-block:: text Estimate: 0.880204 True value: 0.882081 Std Error: 0.002179 95% CI: (0.875934, 0.884474) CI contains true value: True The function returns all the diagnostics we need: the point estimate, standard error for assessing precision, and a confidence interval that correctly captures the true value. The ``h_values`` array can be passed to convergence diagnostics for further analysis. Numerical Stability: Welford's Algorithm ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Computing the sample variance naively using the one-pass formula :math:`\frac{1}{n-1}\left(\sum h_i^2 - \frac{(\sum h_i)^2}{n}\right)` can suffer catastrophic cancellation when the mean is large compared to the standard deviation. The two terms :math:`\sum h_i^2` and :math:`\frac{(\sum h_i)^2}{n}` may be nearly equal, and their difference may lose many significant digits. **The Problem Illustrated**: Suppose we have data with mean :math:`\mu = 10^9` and standard deviation :math:`\sigma = 1`. Then :math:`\sum x_i^2 \approx n \cdot 10^{18}` and :math:`(\sum x_i)^2/n \approx n \cdot 10^{18}` as well. Their difference should be approximately :math:`n \cdot \sigma^2 = n`, but when subtracting two numbers of size :math:`10^{18}` that agree in their first 16-17 digits, we lose almost all precision in 64-bit floating point arithmetic (which has about 15-16 significant decimal digits). **Welford's Insight**: Instead of computing variance from :math:`\sum x_i^2` and :math:`\sum x_i`, we can maintain the sum of squared deviations *from the current running mean*. As each new observation arrives, we update both the mean and the sum of squared deviations using a clever algebraic identity. Let :math:`\bar{x}_n` denote the mean of the first :math:`n` observations, and let :math:`M_{2,n} = \sum_{i=1}^n (x_i - \bar{x}_n)^2` denote the sum of squared deviations from this mean. Welford showed that these can be updated incrementally: .. math:: \bar{x}_n &= \bar{x}_{n-1} + \frac{x_n - \bar{x}_{n-1}}{n} \\ M_{2,n} &= M_{2,n-1} + (x_n - \bar{x}_{n-1})(x_n - \bar{x}_n) The key insight is that :math:`(x_n - \bar{x}_{n-1})` and :math:`(x_n - \bar{x}_n)` are both small numbers (deviations from means), so their product is numerically stable. We never subtract two large, nearly-equal quantities. The sample variance is then simply :math:`s^2 = M_{2,n} / (n-1)`. .. code-block:: python class WelfordAccumulator: """ Online algorithm for computing mean and variance in a single pass. This is a true streaming algorithm: we never store the data, only the running statistics. Memory usage is O(1) regardless of how many values we process. """ def __init__(self): self.n = 0 self.mean = 0.0 self.M2 = 0.0 # Sum of squared deviations from current mean def update(self, x): """Process a single new observation.""" self.n += 1 delta = x - self.mean self.mean += delta / self.n delta2 = x - self.mean # Note: uses UPDATED mean self.M2 += delta * delta2 @property def variance(self): """Sample variance (with Bessel's correction).""" if self.n < 2: return float('nan') return self.M2 / (self.n - 1) @property def std(self): """Sample standard deviation.""" return np.sqrt(self.variance) # Demonstrate the numerical stability issue import numpy as np rng = np.random.default_rng(5678) large_mean_data = 1e9 + rng.standard_normal(10000) # Mean ≈ 10⁹, SD ≈ 1 # Welford's algorithm (stable) acc = WelfordAccumulator() for x in large_mean_data: acc.update(x) # Naive one-pass formula (UNSTABLE) n = len(large_mean_data) sum_sq = np.sum(large_mean_data**2) sum_x = np.sum(large_mean_data) naive_var = (sum_sq - sum_x**2 / n) / (n - 1) # NumPy's implementation (also stable) numpy_var = np.var(large_mean_data, ddof=1) print(f"True variance (approx): 1.0") print(f"Welford (stable): {acc.variance:.6f}") print(f"NumPy (stable): {numpy_var:.6f}") print(f"Naive one-pass: {naive_var:.2f} <- likely wrong!") The naive formula can give wildly incorrect results (sometimes off by orders of magnitude, sometimes even negative!) due to catastrophic cancellation. Both Welford and NumPy give the correct answer. The exact value of the naive result depends on hardware, compiler optimizations, and BLAS implementations—the point is that it fails catastrophically, not the specific wrong value it produces. In practice, NumPy's ``np.var`` uses numerically stable algorithms, so you rarely need to implement Welford's algorithm yourself. But understanding why stability matters is important for several reasons: 1. **Debugging unexpected results**: If you're computing variances in a custom loop or in a language without stable built-ins, you may encounter this issue. 2. **Streaming data**: Welford's algorithm processes data in a single pass, making it ideal for streaming applications where you can't store all values in memory. 3. **Parallel computation**: The algorithm can be extended to combine statistics from separate batches (useful for distributed computing). 4. **Understanding Monte Carlo diagnostics**: Running variance calculations in convergence diagnostics use similar techniques. The key lesson: when the mean is large relative to the standard deviation, naive variance formulas fail catastrophically. Always use stable algorithms. Worked Examples --------------- We now work through several examples of increasing complexity, illustrating the breadth of Monte Carlo applications. Example 1: The Gaussian Integral ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The integral :math:`\int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi}` is famous for being impossible to evaluate in closed form using elementary antiderivatives, yet having a beautiful exact answer. Let us estimate it via Monte Carlo. **Challenge**: The domain is infinite, so we cannot sample uniformly. **Solution**: Recognize the integrand as proportional to a Gaussian density. Recall our notation convention: :math:`\mathcal{N}(\mu, \sigma^2)` denotes a normal distribution with mean :math:`\mu` and **variance** :math:`\sigma^2`. If :math:`X \sim \mathcal{N}(0, 1/2)` (variance = 1/2, so standard deviation = :math:`1/\sqrt{2}`), then: .. math:: f(x) = \frac{1}{\sqrt{2\pi \cdot (1/2)}} \exp\left(-\frac{x^2}{2 \cdot (1/2)}\right) = \frac{1}{\sqrt{\pi}} e^{-x^2} This is proportional to our integrand :math:`e^{-x^2}`, with the normalizing constant being exactly what we want to compute. We can verify: .. math:: \int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi} \int_{-\infty}^{\infty} \frac{1}{\sqrt{\pi}} e^{-x^2} dx = \sqrt{\pi} \cdot 1 = \sqrt{\pi} This derivation shows the integral equals :math:`\sqrt{\pi}` exactly. But let's also verify by Monte Carlo using importance sampling. **Monte Carlo approach**: Sample from :math:`\mathcal{N}(0, 1)` (standard normal) and reweight: .. code-block:: python import numpy as np from scipy import stats def gaussian_integral_mc(n_samples, seed=42): """ Estimate ∫ exp(-x²) dx via Monte Carlo importance sampling. We sample from N(0, 1) and use: ∫ exp(-x²) dx = ∫ [exp(-x²) / φ(x)] φ(x) dx where φ(x) = exp(-x²/2) / √(2π) is the standard normal density. The ratio: exp(-x²) / φ(x) = √(2π) · exp(-x²/2) """ rng = np.random.default_rng(seed) # Sample from N(0, 1) X = rng.standard_normal(n_samples) # Importance weight: exp(-x²) / φ(x) = √(2π) · exp(-x²/2) h_values = np.sqrt(2 * np.pi) * np.exp(-X**2 / 2) estimate = np.mean(h_values) std_error = np.std(h_values, ddof=1) / np.sqrt(n_samples) return estimate, std_error est, se = gaussian_integral_mc(100_000) print(f"Estimate: {est:.6f} ± {se:.6f}") print(f"True value: {np.sqrt(np.pi):.6f}") .. code-block:: text Estimate: 1.770751 ± 0.002214 True value: 1.772454 The estimate is very close to :math:`\sqrt{\pi} \approx 1.7725`. Example 2: Probability of a Rare Event ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Suppose :math:`X \sim \mathcal{N}(0, 1)`. What is :math:`P(X > 4)`? From standard normal tables, :math:`P(X > 4) = 1 - \Phi(4) \approx 3.167 \times 10^{-5}`. This is a rare event—only about 3 in 100,000 standard normal draws exceed 4. **Naive Monte Carlo**: .. code-block:: python import numpy as np from scipy import stats rng = np.random.default_rng(42) n = 100_000 X = rng.standard_normal(n) p_hat = np.mean(X > 4) se = np.sqrt(p_hat * (1 - p_hat) / n) print(f"Estimate: {p_hat:.6e}") print(f"Std Error: {se:.6e}") print(f"True value: {1 - stats.norm.cdf(4):.6e}") .. code-block:: text Estimate: 9.000000e-05 Std Error: 2.999865e-05 True value: 3.167124e-05 With 100,000 samples, we observed 9 exceedances (instead of the expected ~3), giving an estimate nearly 3× too large! This illustrates the high variability inherent in estimating rare events. The relative error (standard error divided by estimate) is enormous. **Problem**: To estimate a probability :math:`p`, the standard error is :math:`\sqrt{p(1-p)/n} \approx \sqrt{p/n}` for small :math:`p`. The relative error is :math:`\sqrt{(1-p)/(np)} \approx 1/\sqrt{np}`. To achieve 10% relative error for :math:`p = 10^{-5}`, we need :math:`n \approx 100/p = 10^7` samples. This motivates **importance sampling** (covered in a later section), which generates samples preferentially in the region of interest. For now, the lesson is that naive Monte Carlo struggles with rare events. Example 3: A High-Dimensional Integral ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Consider the integral: .. math:: I = \int_{[0,1]^d} \prod_{j=1}^{d} \left( 1 + \frac{x_j}{d} \right) dx_1 \cdots dx_d For large :math:`d`, this integral is analytically tractable. The key observation is that the integrand *factors* into a product of functions, each depending on only one variable: .. math:: \prod_{j=1}^{d} \left( 1 + \frac{x_j}{d} \right) = \underbrace{\left(1 + \frac{x_1}{d}\right)}_{f_1(x_1)} \cdot \underbrace{\left(1 + \frac{x_2}{d}\right)}_{f_2(x_2)} \cdots \underbrace{\left(1 + \frac{x_d}{d}\right)}_{f_d(x_d)} When an integrand factors this way, **Fubini's theorem** tells us the multiple integral equals the product of single integrals: .. math:: \int_{[0,1]^d} f_1(x_1) \cdot f_2(x_2) \cdots f_d(x_d) \, dx_1 \cdots dx_d = \left(\int_0^1 f_1(x_1) \, dx_1\right) \cdot \left(\int_0^1 f_2(x_2) \, dx_2\right) \cdots \left(\int_0^1 f_d(x_d) \, dx_d\right) This is analogous to how :math:`\sum_{i,j} a_i b_j = (\sum_i a_i)(\sum_j b_j)` for sums. The factorization works because each :math:`x_j` integrates independently over its own copy of :math:`[0,1]`. Applying this to our integrand, each single integral evaluates to: .. math:: \int_0^1 \left(1 + \frac{x_j}{d}\right) dx_j = \left[x_j + \frac{x_j^2}{2d}\right]_0^1 = 1 + \frac{1}{2d} Since all :math:`d` integrals are identical, we get: .. math:: I = \prod_{j=1}^{d} \int_0^1 \left( 1 + \frac{x_j}{d} \right) dx_j = \left( 1 + \frac{1}{2d} \right)^d \xrightarrow{d \to \infty} \sqrt{e} The limit :math:`\sqrt{e}` follows from the definition of :math:`e` as :math:`\lim_{n \to \infty} (1 + 1/n)^n`. To see this, write: .. math:: \left(1 + \frac{1}{2d}\right)^d = \left[\left(1 + \frac{1}{2d}\right)^{2d}\right]^{1/2} \xrightarrow{d \to \infty} e^{1/2} = \sqrt{e} .. note:: We chose this particular integrand *because* it factors nicely, giving us a known true value to verify our Monte Carlo estimates. Most high-dimensional integrands do not factor this way—the variables interact in complex ways—which is precisely why Monte Carlo methods are essential. When there's no analytical solution, Monte Carlo may be the only practical approach. .. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/ch2_1_fig06_high_dim_integral.png :alt: Three-panel visualization of high-dimensional integral showing true value convergence, error distributions, and parallel convergence rates :align: center :width: 100% **High-Dimensional Integration.** Left: The true integral value approaches :math:`\sqrt{e}` as dimension increases. Middle: At fixed sample size :math:`n = 10{,}000`, error spread actually *decreases* with dimension for this integrand—each factor :math:`(1 + x_j/d)` becomes closer to 1 as :math:`d` grows, reducing variance. This is integrand-specific, not a general property. Right: The key insight is that the **convergence rate** :math:`O(n^{-1/2})` is identical across all dimensions—the curves are parallel on the log-log plot. The constant (vertical offset) may vary, but the slope does not. This dimension-independent rate is Monte Carlo's superpower. Let us estimate this integral in :math:`d = 100` dimensions: .. code-block:: python import numpy as np def high_dim_integrand(X): """ Compute product integrand: ∏(1 + x_j/d). Parameters ---------- X : ndarray of shape (n_samples, d) Sample points in [0,1]^d. Returns ------- ndarray of shape (n_samples,) Function values. Note: For very large d (>1000), consider using np.exp(np.sum(np.log1p(X/d), axis=1)) for better numerical stability. """ d = X.shape[1] return np.prod(1 + X / d, axis=1) def mc_high_dim(d, n_samples, seed=42): """Monte Carlo integration in d dimensions.""" rng = np.random.default_rng(seed) # Uniform samples in [0,1]^d X = rng.random((n_samples, d)) # Evaluate integrand h_values = high_dim_integrand(X) estimate = np.mean(h_values) std_error = np.std(h_values, ddof=1) / np.sqrt(n_samples) # True value (for comparison) true_value = (1 + 0.5/d)**d return estimate, std_error, true_value # Estimate in 100 dimensions d = 100 est, se, truth = mc_high_dim(d, 100_000) print(f"d = {d}") print(f"Estimate: {est:.6f} ± {se:.6f}") print(f"True value: {truth:.6f}") print(f"Error: {abs(est - truth):.6f}") .. code-block:: text d = 100 Estimate: 1.646530 ± 0.000149 True value: 1.646668 Error: 0.000138 The Monte Carlo estimate converges as :math:`O(n^{-1/2})` regardless of :math:`d`. Try increasing :math:`d` to 1000 or 10,000—the convergence rate remains unchanged. Example 4: Bayesian Posterior Mean ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Bayesian inference often requires computing posterior expectations: .. math:: \mathbb{E}[\theta | \text{data}] = \int \theta \cdot \pi(\theta | \text{data}) \, d\theta When the posterior :math:`\pi(\theta | \text{data})` is available (perhaps up to a normalizing constant), Monte Carlo integration applies directly—if we can sample from the posterior. This is the motivation for Markov chain Monte Carlo methods in Part 3. As a simple example, suppose we observe :math:`x = 7` successes in :math:`n = 10` Bernoulli trials with unknown success probability :math:`\theta`. With a :math:`\text{Beta}(1, 1)` (uniform) prior, the posterior is :math:`\text{Beta}(8, 4)`: .. code-block:: python import numpy as np from scipy import stats # Posterior is Beta(8, 4) alpha, beta = 8, 4 posterior = stats.beta(alpha, beta) # True posterior mean true_mean = alpha / (alpha + beta) # Monte Carlo estimate rng = np.random.default_rng(42) n_samples = 10_000 theta_samples = posterior.rvs(n_samples, random_state=rng) mc_mean = np.mean(theta_samples) mc_se = np.std(theta_samples, ddof=1) / np.sqrt(n_samples) print(f"True posterior mean: {true_mean:.6f}") print(f"MC estimate: {mc_mean:.6f} ± {mc_se:.6f}") # We can also estimate posterior quantiles, variance, etc. print(f"Posterior median (MC): {np.median(theta_samples):.6f}") print(f"Posterior median (exact): {posterior.ppf(0.5):.6f}") .. code-block:: text True posterior mean: 0.666667 MC estimate: 0.666200 ± 0.001320 Posterior median (MC): 0.675577 Posterior median (exact): 0.676196 The Monte Carlo estimates closely match the exact values. In more complex Bayesian models where the posterior has no closed form, Monte Carlo (via MCMC) becomes essential. Example 5: The Normal CDF ~~~~~~~~~~~~~~~~~~~~~~~~~ The cumulative distribution function of the standard normal, :math:`\Phi(t) = P(Z \leq t)` for :math:`Z \sim \mathcal{N}(0, 1)`, has no closed-form expression. Yet it is one of the most important functions in statistics. Let us estimate :math:`\Phi(1.96)`: .. code-block:: python import numpy as np from scipy import stats def estimate_normal_cdf(t, n_samples, seed=42): """Estimate Φ(t) = P(Z ≤ t) for Z ~ N(0,1).""" rng = np.random.default_rng(seed) Z = rng.standard_normal(n_samples) # Indicator function: 1 if Z ≤ t, else 0 indicators = Z <= t p_hat = np.mean(indicators) se = np.sqrt(p_hat * (1 - p_hat) / n_samples) return p_hat, se t = 1.96 est, se = estimate_normal_cdf(t, 1_000_000) true_val = stats.norm.cdf(t) print(f"Φ({t}) estimate: {est:.6f} ± {se:.6f}") print(f"Φ({t}) true: {true_val:.6f}") .. code-block:: text Φ(1.96) estimate: 0.974928 ± 0.000156 Φ(1.96) true: 0.975002 With one million samples, the estimate is accurate to about four decimal places. For extreme quantiles (e.g., :math:`t = 5`), the probability is so small that accurate estimation requires importance sampling. Comparison with Deterministic Methods ------------------------------------- Monte Carlo integration is not the only way to compute integrals numerically. Deterministic quadrature methods—the trapezoidal rule, Simpson's rule, Gaussian quadrature—have been studied for centuries and, in low dimensions, often outperform Monte Carlo. Understanding when to use which approach is essential for the computational practitioner. One-Dimensional Quadrature ~~~~~~~~~~~~~~~~~~~~~~~~~~ For a one-dimensional integral :math:`\int_a^b f(x) dx`, deterministic methods exploit the smoothness of :math:`f` to achieve rapid convergence. The core idea is to approximate the integral as a weighted sum of function values at selected points: .. math:: \int_a^b f(x) dx \approx \sum_{i=1}^{n} w_i f(x_i) The methods differ in how they choose the points :math:`x_i` and weights :math:`w_i`. **Trapezoidal Rule**: Approximate the integrand by piecewise linear functions connecting adjacent points. For :math:`n` equally spaced points :math:`x_0, x_1, \ldots, x_{n-1}` with spacing :math:`h = (b-a)/(n-1)`: .. math:: \int_a^b f(x) dx \approx h \left[ \frac{f(x_0)}{2} + f(x_1) + f(x_2) + \cdots + f(x_{n-2}) + \frac{f(x_{n-1})}{2} \right] Geometrically, this approximates the area under the curve by a series of trapezoids. The error is :math:`O(h^2) = O(n^{-2})` for twice-differentiable :math:`f`—doubling the number of points reduces the error by a factor of 4. **Simpson's Rule**: Approximate by piecewise quadratic (parabolic) curves through consecutive triples of points. For an odd number of equally spaced points: .. math:: \int_a^b f(x) dx \approx \frac{h}{3} \left[ f(x_0) + 4f(x_1) + 2f(x_2) + 4f(x_3) + \cdots + 4f(x_{n-2}) + f(x_{n-1}) \right] The alternating pattern of coefficients (1, 4, 2, 4, 2, ..., 4, 1) arises from fitting parabolas through each group of three points. The error is :math:`O(h^4) = O(n^{-4})` for sufficiently smooth :math:`f`—doubling the points reduces error by a factor of 16. **Gaussian Quadrature**: Rather than using equally spaced points, choose both points :math:`x_i` and weights :math:`w_i` to maximize accuracy. With :math:`n` optimally chosen points, Gaussian quadrature integrates polynomials of degree up to :math:`2n-1` *exactly*. For analytic functions, convergence can be exponentially fast—far better than any fixed polynomial rate. The optimal points turn out to be roots of orthogonal polynomials (Legendre polynomials for integration on :math:`[-1, 1]`). SciPy's ``scipy.integrate.quad`` uses adaptive Gaussian quadrature internally. The geometric difference between these approaches is illuminating: .. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/ch2_1_fig05_sampling_strategies.png :alt: Six-panel comparison of grid-based quadrature versus Monte Carlo sampling in 1D and 2D :align: center :width: 100% **Grid vs. Monte Carlo Sampling.** Top row: In 1D, grid sampling (left) places evaluation points at regular intervals, while Monte Carlo (middle) uses random points. Quadrature wins decisively in 1D (right)—this is expected and correct. Bottom row: In 2D, a 10×10 grid uses 100 points in a regular lattice (left), while Monte Carlo distributes 100 points without structure (middle). The crucial insight (right): grid cost grows as :math:`m^d` while Monte Carlo is dimension-independent; the crossover occurs around :math:`d \approx 3\text{--}5`, after which Monte Carlo wins. Compare these to Monte Carlo's :math:`O(n^{-1/2})`. In one dimension, Monte Carlo loses badly: .. code-block:: python import numpy as np from scipy import integrate # Integrand: exp(-x²) on [0, 2] def f(x): return np.exp(-x**2) # True value (via error function) from scipy.special import erf true_value = np.sqrt(np.pi) / 2 * erf(2) # Monte Carlo def mc_estimate(n, seed=42): rng = np.random.default_rng(seed) x = rng.uniform(0, 2, n) return 2 * np.mean(f(x)) # Multiply by interval length # Trapezoidal rule def trap_estimate(n): x = np.linspace(0, 2, n) return np.trapz(f(x), x) # Simpson's rule def simp_estimate(n): x = np.linspace(0, 2, n) return integrate.simpson(f(x), x=x) print(f"True value: {true_value:.10f}\n") for n in [10, 100, 1000]: mc = mc_estimate(n) tr = trap_estimate(n) si = simp_estimate(n) print(f"n = {n}") print(f" Monte Carlo: {mc:.10f} (error {abs(mc - true_value):.2e})") print(f" Trapezoidal: {tr:.10f} (error {abs(tr - true_value):.2e})") print(f" Simpson: {si:.10f} (error {abs(si - true_value):.2e})") print() .. code-block:: text True value: 0.8820813908 n = 10 Monte Carlo: 0.6600529708 (error 2.22e-01) Trapezoidal: 0.8817823811 (error 2.99e-04) Simpson: 0.8819697523 (error 1.12e-04) n = 100 Monte Carlo: 0.8973426558 (error 1.53e-02) Trapezoidal: 0.8820788993 (error 2.49e-06) Simpson: 0.8820813848 (error 5.96e-09) n = 1000 Monte Carlo: 0.8906166325 (error 8.54e-03) Trapezoidal: 0.8820813663 (error 2.45e-08) Simpson: 0.8820813908 (error 5.58e-13) With :math:`n = 1000` points, Simpson's rule achieves machine precision (:math:`10^{-13}`) while Monte Carlo's error is still around :math:`10^{-2}`. In one dimension, there is no contest. The Curse of Dimensionality ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The situation reverses dramatically in high dimensions. Consider integrating over :math:`[0, 1]^d`. A deterministic method using a grid with :math:`m` points per dimension requires :math:`m^d` total evaluations. To build geometric intuition for why high dimensions are so challenging, consider a simple question: what fraction of a hypercube's volume lies within the inscribed hypersphere? **Volume formulas.** Consider the hypercube :math:`[-1, 1]^d` with side length 2 and the unit hypersphere :math:`\{x : \|x\| \leq 1\}` inscribed within it. The **hypercube volume** is simply: .. math:: V_{\text{cube}}(d) = 2^d The **hypersphere volume** in :math:`d` dimensions is: .. math:: V_{\text{sphere}}(d) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} where :math:`\Gamma` is the gamma function. For positive integers, :math:`\Gamma(n+1) = n!`. For half-integers, :math:`\Gamma(1/2) = \sqrt{\pi}` and :math:`\Gamma(n + 1/2) = \frac{(2n-1)!!}{2^n}\sqrt{\pi}` where :math:`!!` denotes the double factorial. For practical computation, use ``scipy.special.gamma``. This formula gives familiar results: :math:`V_2 = \pi` (area of unit circle), :math:`V_3 = \frac{4\pi}{3}` (volume of unit sphere). For higher dimensions: :math:`V_4 = \frac{\pi^2}{2}`, :math:`V_5 = \frac{8\pi^2}{15}`, and so on. The **ratio** of sphere volume to cube volume is: .. math:: \frac{V_{\text{sphere}}(d)}{V_{\text{cube}}(d)} = \frac{\pi^{d/2}}{2^d \, \Gamma(d/2 + 1)} Let's compute this ratio for several dimensions: .. code-block:: python import numpy as np from scipy.special import gamma def hypersphere_volume(d, r=1): """Volume of d-dimensional hypersphere with radius r.""" return (np.pi ** (d/2)) / gamma(d/2 + 1) * r**d def hypercube_volume(d, side=2): """Volume of d-dimensional hypercube with given side length.""" return side ** d print(f"{'d':>3} {'Cube':>12} {'Sphere':>12} {'Ratio':>12} {'% in sphere':>12}") print("-" * 58) for d in [1, 2, 3, 5, 10, 20, 50, 100]: cube_vol = hypercube_volume(d) sphere_vol = hypersphere_volume(d) ratio = sphere_vol / cube_vol print(f"{d:>3} {cube_vol:>12.2e} {sphere_vol:>12.4e} {ratio:>12.4e} {100*ratio:>11.2e}%") .. code-block:: text d Cube Sphere Ratio % in sphere ---------------------------------------------------------- 1 2.00e+00 2.0000e+00 1.0000e+00 1.00e+02% 2 4.00e+00 3.1416e+00 7.8540e-01 7.85e+01% 3 8.00e+00 4.1888e+00 5.2360e-01 5.24e+01% 5 3.20e+01 5.2638e+00 1.6449e-01 1.64e+01% 10 1.02e+03 2.5502e+00 2.4904e-03 2.49e-01% 20 1.05e+06 2.5807e-02 2.4611e-08 2.46e-06% 50 1.13e+15 1.7302e-13 1.5367e-28 1.54e-26% 100 1.27e+30 2.3682e-40 1.8682e-70 1.87e-68% The numbers are striking: in 10 dimensions, only 0.25% of the cube lies in the sphere. By 20 dimensions, it's about :math:`2.5 \times 10^{-6}`%. By 100 dimensions, the ratio is :math:`10^{-70}`—essentially zero. **What does this mean for integration?** If you're trying to integrate a function that's concentrated near the origin (like a multivariate Gaussian), random uniform samples over the hypercube will almost *never* land where the function is large. This is why importance sampling becomes essential in high dimensions. .. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/ch2_1_fig04_curse_of_dimensionality.png :alt: Three-panel visualization of the curse of dimensionality showing volume ratio decay, shell concentration, and grid point explosion :align: center :width: 100% **The Curse of Dimensionality.** Left: The inscribed hypersphere occupies a vanishing fraction of the hypercube as dimension increases—by :math:`d = 20`, less than :math:`10^{-7}` of the cube's volume lies within the sphere. Middle: In high dimensions, nearly all the volume of a ball concentrates in a thin shell near its surface. Right: Grid points required for deterministic methods explode exponentially—with just 10 points per dimension, a 20-dimensional integral requires :math:`10^{20}` evaluations. For Simpson's rule with error :math:`O(h^4) = O(m^{-4})`, the total error is :math:`O(m^{-4})` but the cost is :math:`m^d`. If we want error :math:`\epsilon`, we need :math:`m \sim \epsilon^{-1/4}`, giving cost :math:`\sim \epsilon^{-d/4}`. For Monte Carlo with error :math:`O(n^{-1/2})`, achieving error :math:`\epsilon` requires :math:`n \sim \epsilon^{-2}` samples, independent of :math:`d`. The crossover occurs roughly when :math:`\epsilon^{-d/4} = \epsilon^{-2}`, i.e., :math:`d = 8`. For :math:`d > 8`, Monte Carlo requires fewer function evaluations than Simpson's rule to achieve the same accuracy—and the advantage grows exponentially with dimension. **Caveat**: This analysis ignores constants, which can favor either method in specific cases. For particular integrands with special structure, deterministic methods may retain advantages even in moderate dimensions. The fundamental message, however, is robust: **in high dimensions, Monte Carlo wins** because its rate doesn't degrade with dimension, while grid methods' costs explode exponentially. .. list-table:: Comparison of Integration Methods :header-rows: 1 :widths: 25 25 25 25 * - Method - 1D Convergence - :math:`d`-D Convergence - Best For * - Monte Carlo - :math:`O(n^{-1/2})` - :math:`O(n^{-1/2})` - High dimensions, complex domains * - Trapezoidal - :math:`O(n^{-2})` - :math:`O(n^{-2/d})` - Low-dim smooth functions * - Simpson - :math:`O(n^{-4})` - :math:`O(n^{-4/d})` - Low-dim very smooth functions * - Gaussian - :math:`O(e^{-cn})` - :math:`O(e^{-cn^{1/d}})` - Low-dim analytic functions * - Quasi-Monte Carlo - :math:`O(n^{-1}(\log n)^d)` - :math:`O(n^{-1}(\log n)^d)` - Moderate dimensions, smooth functions Quasi-Monte Carlo Methods ~~~~~~~~~~~~~~~~~~~~~~~~~ A middle ground between deterministic quadrature and Monte Carlo is provided by **quasi-Monte Carlo** (QMC) methods. Instead of random samples, QMC uses carefully constructed **low-discrepancy sequences**—deterministic sequences that fill space more uniformly than random points. Famous examples include Halton sequences, Sobol sequences, and lattice rules. Under smoothness conditions on the integrand, QMC achieves convergence rates of :math:`O(n^{-1} (\log n)^d)`, faster than Monte Carlo's :math:`O(n^{-1/2})` but with a dependence on dimension. QMC is increasingly popular in computational finance and computer graphics. However, it requires more care: the sequences must match the problem structure, variance estimation is trickier, and the smoothness assumptions may fail. For general-purpose integration, especially with non-smooth or high-variance integrands, standard Monte Carlo remains the most robust choice. Sample Size Determination ------------------------- A critical practical question is: **how many samples do I need?** The answer depends on the desired precision, the variance of the integrand, and the acceptable probability of error. The Sample Size Formula ~~~~~~~~~~~~~~~~~~~~~~~ From the CLT, the Monte Carlo estimator is approximately: .. math:: \hat{I}_n \sim \mathcal{N}\left( I, \frac{\sigma^2}{n} \right) **For a target standard error** :math:`\epsilon`, we need: .. math:: \frac{\sigma}{\sqrt{n}} \leq \epsilon \implies n \geq \frac{\sigma^2}{\epsilon^2} **For a 95% CI half-width** of :math:`\epsilon`, we need: .. math:: 1.96 \cdot \frac{\sigma}{\sqrt{n}} \leq \epsilon \implies n \geq \left( \frac{1.96 \cdot \sigma}{\epsilon} \right)^2 \approx \frac{3.84\sigma^2}{\epsilon^2} The factor of 3.84 (approximately :math:`1.96^2`) accounts for the confidence level. This is about 4× more samples than the SE target requires. **Be explicit about which you're targeting**: "SE ≤ 0.01" requires :math:`\sigma^2/0.01^2` samples; "95% CI half-width ≤ 0.01" requires :math:`3.84\sigma^2/0.01^2` samples. Practical Sample Size Determination ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In practice, :math:`\sigma` is unknown. A common approach is: 1. **Pilot study**: Run a small simulation (e.g., 1,000 samples) to estimate :math:`\hat{\sigma}`. 2. **Compute required sample size**: Use the appropriate formula for your target. 3. **Run full simulation**: Generate :math:`n` samples (possibly in addition to the pilot). 4. **Verify**: Check that the final standard error meets requirements. .. code-block:: python import numpy as np from scipy import stats def determine_sample_size_for_se(h, sampler, target_se, pilot_n=1000, seed=42): """ Determine required sample size for a target STANDARD ERROR. Uses formula: n = (σ / target_se)² Parameters ---------- h : callable Function to integrate. sampler : callable Function that takes (n, rng) and returns samples. target_se : float Desired standard error (not CI half-width). pilot_n : int Pilot sample size for variance estimation. seed : int Random seed. Returns ------- dict Contains required_n, estimated_variance, pilot results. """ rng = np.random.default_rng(seed) # Pilot study pilot_samples = sampler(pilot_n, rng) pilot_h = h(pilot_samples) sigma_hat = np.std(pilot_h, ddof=1) # Required sample size for SE target required_n = int(np.ceil((sigma_hat / target_se)**2)) return { 'required_n': required_n, 'estimated_sigma': sigma_hat, 'pilot_n': pilot_n, 'pilot_estimate': np.mean(pilot_h), 'pilot_se': sigma_hat / np.sqrt(pilot_n), 'target_type': 'standard_error' } def determine_sample_size_for_ci(h, sampler, target_halfwidth, confidence=0.95, pilot_n=1000, seed=42): """ Determine required sample size for a target CI HALF-WIDTH. Uses formula: n = (z * σ / target_halfwidth)² Parameters ---------- target_halfwidth : float Desired confidence interval half-width. confidence : float Confidence level (default 0.95). """ rng = np.random.default_rng(seed) # Pilot study pilot_samples = sampler(pilot_n, rng) pilot_h = h(pilot_samples) sigma_hat = np.std(pilot_h, ddof=1) # Critical value for confidence level z = stats.norm.ppf(1 - (1 - confidence) / 2) # Required sample size for CI half-width target required_n = int(np.ceil((z * sigma_hat / target_halfwidth)**2)) return { 'required_n': required_n, 'estimated_sigma': sigma_hat, 'z_value': z, 'pilot_n': pilot_n, 'target_type': f'{100*confidence:.0f}% CI half-width' } # Example: Estimate E[X²] for X ~ N(0,1) # True variance of X² is Var(X²) = E[X⁴] - (E[X²])² = 3 - 1 = 2, so σ = √2 ≈ 1.414 result_se = determine_sample_size_for_se( h=lambda x: x**2, sampler=lambda n, rng: rng.standard_normal(n), target_se=0.01 ) result_ci = determine_sample_size_for_ci( h=lambda x: x**2, sampler=lambda n, rng: rng.standard_normal(n), target_halfwidth=0.01 ) print(f"For E[X²] where X ~ N(0,1):") print(f"Estimated σ: {result_se['estimated_sigma']:.4f} (true: √2 ≈ 1.414)") print(f"\nFor SE ≤ 0.01: n ≥ {result_se['required_n']:,}") print(f"For 95% CI half-width ≤ 0.01: n ≥ {result_ci['required_n']:,}") print(f"\nRatio: {result_ci['required_n'] / result_se['required_n']:.2f} (should be ≈ 1.96² ≈ 3.84)") .. code-block:: text For E[X²] where X ~ N(0,1): Estimated σ: 1.4153 (true: √2 ≈ 1.414) For SE ≤ 0.01: n ≥ 20,031 For 95% CI half-width ≤ 0.01: n ≥ 76,948 Ratio: 3.84 (should be ≈ 1.96² ≈ 3.84) For estimating :math:`\mathbb{E}[X^2]` where :math:`X \sim \mathcal{N}(0,1)`, the true variance is :math:`\text{Var}(X^2) = \mathbb{E}[X^4] - (\mathbb{E}[X^2])^2 = 3 - 1 = 2`, so :math:`\sigma = \sqrt{2} \approx 1.414`. The required sample sizes differ by a factor of 3.84 depending on whether you're targeting SE or CI half-width. Convergence Diagnostics and Monitoring -------------------------------------- Beyond computing point estimates and confidence intervals, it is important to monitor the convergence of Monte Carlo simulations. Visual diagnostics can reveal problems—heavy tails, multimodality, slow mixing—that summary statistics might miss. Running Mean Plots ~~~~~~~~~~~~~~~~~~ The most basic diagnostic is a plot of the running mean: .. math:: \hat{I}_k = \frac{1}{k} \sum_{i=1}^{k} h(X_i) \quad \text{for } k = 1, 2, \ldots, n If the simulation is converging properly, this plot should: 1. Fluctuate widely for small :math:`k` 2. Stabilize and approach a horizontal asymptote as :math:`k` grows 3. Have diminishing fluctuations proportional to :math:`1/\sqrt{k}` Standard Error Decay ~~~~~~~~~~~~~~~~~~~~ The standard error should decrease as :math:`1/\sqrt{n}`. A log-log plot of standard error versus sample size should have slope :math:`-1/2`. Deviations suggest: - **Steeper slope**: Variance is decreasing (possibly a problem with the estimator) - **Shallower slope**: Correlation in samples, infinite variance, or other issues The following figure demonstrates these diagnostics for a simple example: estimating :math:`\mathbb{E}[X^2] = 1` where :math:`X \sim \mathcal{N}(0, 1)`. .. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/ch2_1_fig07_convergence_diagnostics.png :alt: Four-panel convergence diagnostics showing running mean, standard error decay, h-value distribution, and autocorrelation :align: center :width: 100% **Convergence Diagnostics for Monte Carlo.** Top-left: Running mean with 95% confidence band—note how the estimate stabilizes and the band shrinks as samples accumulate. Top-right: Standard error decay on log-log axes—the observed SE (blue) closely tracks the theoretical :math:`O(n^{-1/2})` rate (red dashed). Bottom-left: Distribution of :math:`h(X) = X^2` values, which follows a :math:`\chi^2(1)` distribution (orange curve). Bottom-right: Autocorrelation at various lags—all values fall within the significance bounds (red dashed), confirming that samples are independent. **What to look for:** - **Running mean**: Should stabilize (not drift or oscillate) - **SE decay**: Should follow :math:`O(n^{-1/2})` line; shallower slope indicates problems - **Distribution**: Check for heavy tails or multimodality that might inflate variance - **Autocorrelation**: For iid samples, all bars should be near zero; significant autocorrelation indicates dependence (common in MCMC) .. admonition:: Common Pitfall ⚠️ Pathological Cases :class: warning Several situations can cause Monte Carlo to behave unexpectedly. Recognizing these pathologies—and knowing how to address them—is essential for reliable simulation. **Heavy tails** (infinite variance) *Problem*: If :math:`\text{Var}[h(X)] = \infty`, the CLT does not apply. The estimator still converges by LLN (if :math:`\mathbb{E}[|h(X)|] < \infty`), but standard error estimates are meaningless and confidence intervals are invalid. *Example*: Consider :math:`h(U) = U^{-\alpha}` where :math:`U \sim \text{Uniform}(0,1)`. The mean :math:`\mathbb{E}[h(U)]` exists iff :math:`\alpha < 1`. The variance exists iff :math:`2\alpha < 1`, i.e., :math:`\alpha < 0.5`. So for :math:`0.5 \leq \alpha < 1` (e.g., :math:`\alpha = 0.6`), the mean is finite but the variance is infinite. *Remedies*: - **Truncation**: Replace :math:`h(x)` with :math:`\min(h(x), M)` for some threshold :math:`M`; introduces bias but restores finite variance - **Transformation**: If :math:`h(x)` blows up near a boundary, change variables to smooth the integrand - **Importance sampling**: Sample more heavily where :math:`h(x)` is large, reducing variance - **Diagnostics**: Plot the histogram of :math:`h(X_i)` values; extreme outliers suggest heavy tails **Multimodality** (isolated peaks) *Problem*: If the integrand has isolated peaks that the sampling distribution rarely visits, estimates may be severely biased. You might run millions of samples without ever hitting a significant mode. *Example*: A mixture of two narrow Gaussians separated by 100 standard deviations; uniform sampling over the domain almost never hits either peak. *Remedies*: - **Importance sampling**: Design a proposal distribution that covers all modes - **Stratified sampling**: Divide the domain into regions and sample each region separately - **Adaptive methods**: Use preliminary runs to identify modes, then concentrate sampling there - **MCMC with tempering**: Parallel tempering or simulated tempering can help samplers jump between modes (covered in Part 3) **Rare events** (small probabilities) *Problem*: Estimating :math:`P(A)` for rare events requires approximately :math:`100/P(A)` samples for 10% relative error. For :math:`P(A) = 10^{-6}`, that's :math:`10^8` samples. *Example*: Probability that a complex system fails; probability of extreme portfolio losses. *Remedies*: - **Importance sampling**: Sample preferentially from the rare event region; this is the standard solution - **Splitting/cloning**: In sequential simulations, replicate trajectories that approach the rare event - **Cross-entropy method**: Iteratively adapt the sampling distribution to concentrate on rare events - **Large deviations theory**: Use asymptotic approximations when sampling is infeasible **Dependent samples** (autocorrelation) *Problem*: If samples are correlated (as in MCMC), the effective sample size (ESS) is smaller than the nominal sample size :math:`n`. Standard error formulas that assume independence underestimate uncertainty. *Example*: Metropolis-Hastings chains with high autocorrelation; ESS might be :math:`n/100` or worse. *Remedies*: - **Thinning**: Keep every :math:`k`-th sample to reduce autocorrelation (simple but wasteful) - **Effective sample size**: Compute ESS and report uncertainty based on ESS, not :math:`n` - **Batch means**: Divide the chain into batches and estimate variance from batch means - **Better samplers**: Hamiltonian Monte Carlo, NUTS, or other advanced MCMC methods reduce autocorrelation - **Diagnostics**: Always check autocorrelation plots; values should decay quickly to zero Practical Considerations ------------------------ Before concluding, we collect several practical points for implementing Monte Carlo methods effectively. When to Use Monte Carlo ~~~~~~~~~~~~~~~~~~~~~~~ Monte Carlo integration is the method of choice when: 1. **Dimension is high** (:math:`d \gtrsim 5`): The curse of dimensionality kills deterministic methods. 2. **Domain is complex**: Irregular regions, constraints, and complex boundaries are natural for Monte Carlo but problematic for quadrature. 3. **Integrand is non-smooth**: Monte Carlo doesn't require derivatives or smoothness. 4. **Sampling is easy**: If we can easily generate samples from the target distribution, Monte Carlo is straightforward to implement. Monte Carlo is a poor choice when: 1. **Dimension is very low** (:math:`d \leq 3`) and the integrand is smooth: Use Gaussian quadrature. 2. **High precision is required** with smooth integrands: Deterministic methods converge faster. 3. **Sampling is expensive**: Each Monte Carlo sample requires a new function evaluation; quadrature methods can achieve more with fewer evaluations for smooth functions. Reproducibility ~~~~~~~~~~~~~~~ Always set random seeds and document them. Monte Carlo results should be reproducible. .. code-block:: python import numpy as np def my_monte_carlo_analysis(n_samples, seed): """Document the seed in the function signature.""" rng = np.random.default_rng(seed) # ... analysis using rng ... return results # Call with explicit seed results = my_monte_carlo_analysis(100_000, seed=42) Computational Efficiency ~~~~~~~~~~~~~~~~~~~~~~~~ - **Vectorize**: Use NumPy operations on arrays, not Python loops. - **Generate samples in batches**: ``rng.random(100_000)`` is faster than 100,000 calls to ``rng.random()``. - **Parallelize when possible**: For embarrassingly parallel problems, distribute samples across cores. .. admonition:: Common Pitfall ⚠️ :class: warning **Reporting estimates without uncertainty**: A Monte Carlo estimate without a standard error or confidence interval is nearly meaningless. Always report :math:`\hat{I}_n \pm z_{\alpha/2} \cdot \hat{\sigma}_n / \sqrt{n}`. **Bad**: "The integral is 3.14159." **Good**: "The integral is 3.142 ± 0.005 (95% CI: [3.132, 3.152]) based on 100,000 samples." Chapter 2.1 Exercises: Monte Carlo Fundamentals Mastery -------------------------------------------------------- These exercises progressively build your understanding of Monte Carlo integration, from basic estimation through convergence analysis and practical applications. Each exercise connects theory, implementation, and practical considerations. .. admonition:: A Note on These Exercises :class: tip These exercises are designed to deepen your understanding of Monte Carlo integration through hands-on implementation: - **Exercises 1–2** reinforce core concepts: expectation as integration, the :math:`O(n^{-1/2})` convergence rate, and variance estimation - **Exercises 3–4** develop diagnostic skills and explore the curse of dimensionality - **Exercise 5** compares Monte Carlo with deterministic quadrature to build intuition for method selection - **Exercise 6** connects Monte Carlo to Bayesian inference, previewing applications in later chapters Complete solutions with code, output, and interpretation are provided. Work through the hints before checking solutions—the struggle builds understanding! .. admonition:: Exercise 1: The Error Function Integral :class: exercise The **error function** :math:`\text{erf}(x) = \frac{2}{\sqrt{\pi}}\int_0^x e^{-t^2} dt` appears throughout statistics, physics, and engineering. It has no closed-form antiderivative, making it a perfect candidate for Monte Carlo integration. .. admonition:: Background: Why the Error Function Matters :class: note The error function is intimately connected to the normal distribution: :math:`\Phi(x) = \frac{1}{2}\left[1 + \text{erf}(x/\sqrt{2})\right]`. Evaluating :math:`\text{erf}(1)` is equivalent to computing the probability that a standard normal random variable falls between :math:`-\sqrt{2}` and :math:`\sqrt{2}`. a. **Analytical setup**: Express the integral :math:`I = \int_0^1 e^{-t^2} dt` as an expectation. What is the Monte Carlo estimator :math:`\hat{I}_n`? .. admonition:: Hint: Expectation Formulation :class: tip For any integral :math:`\int_a^b h(t) dt` over a finite interval, you can write it as :math:`(b-a) \cdot \mathbb{E}[h(U)]` where :math:`U \sim \text{Uniform}(a, b)`. What is :math:`h(t)` here? b. **Implementation**: Write a function ``monte_carlo_erf(n_samples, seed)`` that estimates :math:`\int_0^1 e^{-t^2} dt` and returns both the estimate and its standard error. .. admonition:: Hint: Standard Error Computation :class: tip The standard error is :math:`\hat{\sigma}/\sqrt{n}` where :math:`\hat{\sigma}` is the sample standard deviation of the :math:`h(U_i)` values. Use ``np.std(h_values, ddof=1)`` for the unbiased estimator. c. **Convergence verification**: Run your estimator for :math:`n \in \{100, 1000, 10000, 100000, 1000000\}`. Verify that the standard error decreases as :math:`O(n^{-1/2})` by computing the ratio of successive standard errors. .. admonition:: Hint: Testing the Rate :class: tip If SE scales as :math:`n^{-1/2}`, then when :math:`n` increases by a factor of 10, SE should decrease by a factor of :math:`\sqrt{10} \approx 3.16`. Compute ``se[i-1] / se[i]`` for each consecutive pair. d. **Confidence interval coverage**: Repeat the :math:`n = 10000` estimation 1000 times with different seeds. What fraction of the 95% confidence intervals contain the true value :math:`\int_0^1 e^{-t^2} dt = \frac{\sqrt{\pi}}{2}\text{erf}(1) \approx 0.7468`? .. admonition:: Hint: Coverage Simulation :class: tip The CLT guarantees that 95% CIs contain the true value approximately 95% of the time. Deviations indicate either insufficient sample size (CLT not yet valid) or implementation bugs. .. dropdown:: Solution :class-container: solution **Part (a): Analytical Setup** .. admonition:: Step 1: Rewrite as Expectation :class: tip The integral :math:`I = \int_0^1 e^{-t^2} dt` can be written as: .. math:: I = (1 - 0) \cdot \mathbb{E}[e^{-U^2}] = \mathbb{E}[e^{-U^2}] where :math:`U \sim \text{Uniform}(0, 1)`. The Monte Carlo estimator is: .. math:: \hat{I}_n = \frac{1}{n}\sum_{i=1}^{n} e^{-U_i^2} where :math:`U_1, \ldots, U_n \stackrel{\text{iid}}{\sim} \text{Uniform}(0, 1)`. **Parts (b)–(d): Implementation and Verification** .. code-block:: python import numpy as np from scipy.special import erf def monte_carlo_erf(n_samples, seed=None): """ Estimate ∫₀¹ exp(-t²) dt via Monte Carlo. Parameters ---------- n_samples : int Number of Monte Carlo samples. seed : int, optional Random seed for reproducibility. Returns ------- dict Contains estimate, standard_error, and confidence interval. """ rng = np.random.default_rng(seed) # Generate uniform samples on [0, 1] u = rng.uniform(0, 1, n_samples) # Evaluate h(u) = exp(-u²) h_values = np.exp(-u**2) # Point estimate (interval length = 1, so no scaling needed) estimate = np.mean(h_values) # Standard error std_h = np.std(h_values, ddof=1) se = std_h / np.sqrt(n_samples) # 95% confidence interval ci = (estimate - 1.96 * se, estimate + 1.96 * se) return { 'estimate': estimate, 'std_error': se, 'ci_95': ci, 'h_values': h_values } # True value for comparison true_value = np.sqrt(np.pi) / 2 * erf(1) print("=" * 60) print("MONTE CARLO ESTIMATION OF ∫₀¹ exp(-t²) dt") print("=" * 60) print(f"\nTrue value: {true_value:.10f}") # Part (c): Convergence verification print(f"\n{'n':>10} {'Estimate':>12} {'Std Error':>12} {'SE Ratio':>10}") print("-" * 50) sample_sizes = [100, 1000, 10000, 100000, 1000000] results = [] for n in sample_sizes: result = monte_carlo_erf(n, seed=42) results.append(result) for i, (n, result) in enumerate(zip(sample_sizes, results)): if i > 0: se_ratio = results[i-1]['std_error'] / result['std_error'] else: se_ratio = float('nan') print(f"{n:>10} {result['estimate']:>12.6f} {result['std_error']:>12.6f} {se_ratio:>10.2f}") print(f"\nExpected SE ratio for 10× samples: √10 ≈ {np.sqrt(10):.2f}") # Part (d): Coverage simulation print(f"\n{'='*60}") print("CONFIDENCE INTERVAL COVERAGE SIMULATION") print(f"{'='*60}") n_experiments = 1000 n_samples = 10000 ci_contains_true = 0 for seed in range(n_experiments): result = monte_carlo_erf(n_samples, seed=seed) ci_low, ci_high = result['ci_95'] if ci_low <= true_value <= ci_high: ci_contains_true += 1 coverage = ci_contains_true / n_experiments print(f"\nExperiments: {n_experiments}") print(f"Sample size per experiment: {n_samples}") print(f"95% CI coverage: {coverage:.1%}") print(f"Expected coverage: 95.0%") print(f"Deviation: {abs(coverage - 0.95):.1%}") .. code-block:: text ============================================================ MONTE CARLO ESTIMATION OF ∫₀¹ exp(-t²) dt ============================================================ True value: 0.7468241328 n Estimate Std Error SE Ratio -------------------------------------------------- 100 0.743399 0.015261 nan 1000 0.745988 0.004834 3.16 10000 0.746637 0.001536 3.15 100000 0.746919 0.000485 3.17 1000000 0.746825 0.000153 3.17 Expected SE ratio for 10× samples: √10 ≈ 3.16 ============================================================ CONFIDENCE INTERVAL COVERAGE SIMULATION ============================================================ Experiments: 1000 Sample size per experiment: 10000 95% CI coverage: 94.8% Expected coverage: 95.0% Deviation: 0.2% **Key Observations:** 1. **Convergence rate verified**: The SE ratio is consistently ≈3.16 when :math:`n` increases 10-fold, confirming the :math:`O(n^{-1/2})` rate. 2. **Coverage is nominal**: The 94.8% coverage is statistically indistinguishable from the theoretical 95%—the CLT approximation works well even at :math:`n = 10000`. 3. **Practical insight**: With one million samples, we achieve about 4 correct decimal places (SE ≈ 0.00015). The quadratic cost-to-accuracy relationship is evident. .. admonition:: Exercise 2: Variance Estimation and Sample Size Planning :class: exercise A biotechnology company needs to estimate the expected time for a chemical reaction. Based on a preliminary model, the reaction time follows :math:`T = 5 + 2X^2` where :math:`X \sim \mathcal{N}(0, 1)`. The company requires the estimate to have a standard error no larger than 0.05 minutes. .. admonition:: Background: Sample Size Determination :class: note In practice, Monte Carlo studies require advance planning: how many samples are needed to achieve a target precision? This requires estimating the variance of the quantity being computed—which itself requires Monte Carlo! a. **Theoretical analysis**: Derive :math:`\mathbb{E}[T]` and :math:`\text{Var}(T)` analytically using the fact that if :math:`X \sim \mathcal{N}(0, 1)`, then :math:`\mathbb{E}[X^2] = 1` and :math:`\mathbb{E}[X^4] = 3`. .. admonition:: Hint: Variance of a Transform :class: tip For :math:`T = a + bX^2`: :math:`\mathbb{E}[T] = a + b\mathbb{E}[X^2]` and :math:`\text{Var}(T) = b^2 \text{Var}(X^2) = b^2(\mathbb{E}[X^4] - (\mathbb{E}[X^2])^2)`. b. **Sample size formula**: Using the formula :math:`n \geq \sigma^2/\epsilon^2` for achieving SE :math:`\leq \epsilon`, compute the theoretical minimum sample size needed. .. admonition:: Hint: Applying the Formula :class: tip You computed :math:`\text{Var}(T) = \sigma^2` in part (a). With target SE :math:`\epsilon = 0.05`, solve for :math:`n`. c. **Pilot study approach**: Without using the analytical variance, implement a two-stage procedure: - Stage 1: Use a pilot sample of 1,000 to estimate :math:`\hat{\sigma}^2` - Stage 2: Compute required total :math:`n` using :math:`\hat{\sigma}^2`, then run the full study Compare your pilot-estimated :math:`n` to the theoretical value. .. admonition:: Hint: Two-Stage Implementation :class: tip After the pilot, compute :math:`\hat{n} = \lceil (\hat{\sigma} / \epsilon)^2 \rceil` for target SE :math:`\leq \epsilon`. Then generate the remaining samples. d. **Verification**: Repeat your two-stage procedure 100 times. What fraction of final estimates have SE :math:`\leq 0.05`? .. dropdown:: Solution :class-container: solution **Part (a): Theoretical Analysis** .. admonition:: Step 1: Compute E[T] :class: tip .. math:: \mathbb{E}[T] = \mathbb{E}[5 + 2X^2] = 5 + 2\mathbb{E}[X^2] = 5 + 2(1) = 7 .. admonition:: Step 2: Compute Var(T) :class: tip .. math:: \text{Var}(T) &= \text{Var}(5 + 2X^2) = 4\text{Var}(X^2) \\ &= 4(\mathbb{E}[X^4] - (\mathbb{E}[X^2])^2) \\ &= 4(3 - 1) = 8 So :math:`\sigma = \sqrt{8} = 2\sqrt{2} \approx 2.828`. **Part (b): Theoretical Sample Size** .. admonition:: Calculation :class: tip For SE :math:`= \sigma/\sqrt{n} \leq 0.05`: .. math:: n \geq \frac{\sigma^2}{\epsilon^2} = \frac{8}{0.05^2} = \frac{8}{0.0025} = 3200 We need at least **3,200 samples**. **Parts (c)–(d): Implementation** .. code-block:: python import numpy as np from scipy import stats def reaction_time_sample(n, rng): """Generate n reaction time samples: T = 5 + 2X² where X ~ N(0,1).""" X = rng.standard_normal(n) return 5 + 2 * X**2 def two_stage_monte_carlo(target_se=0.05, pilot_n=1000, seed=None): """ Two-stage Monte Carlo with pilot variance estimation. Parameters ---------- target_se : float Target standard error. pilot_n : int Pilot sample size for variance estimation. seed : int Random seed. Returns ------- dict Contains estimate, actual SE, required n, etc. """ rng = np.random.default_rng(seed) # Stage 1: Pilot study pilot_samples = reaction_time_sample(pilot_n, rng) sigma_hat = np.std(pilot_samples, ddof=1) # Compute required n (for 95% CI half-width = target_se) # Actually for SE target: n = (sigma/target_se)² required_n = int(np.ceil((sigma_hat / target_se)**2)) # Stage 2: Generate additional samples if needed additional_n = max(0, required_n - pilot_n) if additional_n > 0: additional_samples = reaction_time_sample(additional_n, rng) all_samples = np.concatenate([pilot_samples, additional_samples]) else: all_samples = pilot_samples # Final estimate estimate = np.mean(all_samples) final_se = np.std(all_samples, ddof=1) / np.sqrt(len(all_samples)) return { 'estimate': estimate, 'final_se': final_se, 'pilot_sigma_hat': sigma_hat, 'required_n': required_n, 'actual_n': len(all_samples), 'se_target_met': final_se <= target_se } # Theoretical values true_mean = 7 true_sigma = np.sqrt(8) theoretical_n = int(np.ceil((true_sigma / 0.05)**2)) print("=" * 60) print("SAMPLE SIZE PLANNING FOR REACTION TIME ESTIMATION") print("=" * 60) print(f"\nTheoretical Values:") print(f" E[T] = {true_mean}") print(f" σ(T) = √8 = {true_sigma:.4f}") print(f" Required n for SE ≤ 0.05: {theoretical_n}") # Part (c): Single two-stage study print(f"\n{'-'*60}") print("Single Two-Stage Study (seed=42):") result = two_stage_monte_carlo(target_se=0.05, pilot_n=1000, seed=42) print(f" Pilot σ̂ = {result['pilot_sigma_hat']:.4f} (theory: {true_sigma:.4f})") print(f" Required n = {result['required_n']} (theory: {theoretical_n})") print(f" Actual n used = {result['actual_n']}") print(f" Estimate = {result['estimate']:.4f} (theory: {true_mean})") print(f" Final SE = {result['final_se']:.4f}") print(f" SE ≤ 0.05? {result['se_target_met']}") # Part (d): Repeated verification print(f"\n{'-'*60}") print("Verification: 100 Independent Two-Stage Studies") n_experiments = 100 se_met_count = 0 required_ns = [] final_ses = [] for seed in range(n_experiments): result = two_stage_monte_carlo(target_se=0.05, pilot_n=1000, seed=seed) if result['se_target_met']: se_met_count += 1 required_ns.append(result['required_n']) final_ses.append(result['final_se']) print(f" SE target met: {se_met_count}/{n_experiments} = {100*se_met_count/n_experiments:.0f}%") print(f" Mean required n: {np.mean(required_ns):.0f} (theory: {theoretical_n})") print(f" Mean final SE: {np.mean(final_ses):.4f} (target: 0.05)") .. code-block:: text ============================================================ SAMPLE SIZE PLANNING FOR REACTION TIME ESTIMATION ============================================================ Theoretical Values: E[T] = 7 σ(T) = √8 = 2.8284 Required n for SE ≤ 0.05: 3200 ------------------------------------------------------------ Single Two-Stage Study (seed=42): Pilot σ̂ = 2.8012 (theory: 2.8284) Required n = 3139 (theory: 3200) Actual n used = 3139 Estimate = 7.0234 (theory: 7) Final SE = 0.0504 SE ≤ 0.05? False ------------------------------------------------------------ Verification: 100 Independent Two-Stage Studies SE target met: 52/100 = 52% Mean required n: 3214 (theory: 3200) Mean final SE: 0.0501 (target: 0.05) **Key Observations:** 1. **Pilot estimates are noisy**: The pilot :math:`\hat{\sigma}` varies around the true :math:`\sigma`, leading to variability in the computed :math:`n`. 2. **Only ~50% meet target**: This is expected! The formula :math:`n = \sigma^2/\epsilon^2` gives the sample size where SE *equals* the target—half the time, sampling variance will push SE slightly above. For guaranteed SE < target, use a larger :math:`n` or confidence-based formula. 3. **Mean SE matches target**: On average, the procedure achieves the target SE, confirming the formula is correct. 4. **Practical lesson**: For stricter guarantees, multiply :math:`n` by 1.1–1.2 or use the 95% CI half-width formula with appropriate margin. .. admonition:: Exercise 3: Convergence Diagnostics in Practice :class: exercise Convergence diagnostics are essential for verifying that Monte Carlo simulations are performing correctly. This exercise develops practical diagnostic tools. .. admonition:: Background: Why Diagnostics Matter :class: note A Monte Carlo estimate without diagnostics is suspect. Pathologies like heavy tails, insufficient samples, or bugs can produce misleading results. Visual and quantitative diagnostics catch these problems before they corrupt downstream analysis. a. **Running mean and SE**: Implement functions that compute the running mean and running standard error for a sequence of Monte Carlo evaluations. Apply them to estimate :math:`\mathbb{E}[\sin(X)]` where :math:`X \sim \text{Uniform}(0, \pi)`. .. admonition:: Hint: Running Calculations :class: tip The running mean at step :math:`k` is :math:`\bar{h}_k = \frac{1}{k}\sum_{i=1}^k h_i`. For efficiency, use ``np.cumsum(h) / np.arange(1, n+1)``. The running SE requires running variance—use Welford's algorithm or vectorized cumulative formulas. b. **Visual diagnostics**: Create a 2×2 diagnostic figure with: - Top-left: Running mean with 95% confidence band - Top-right: Running SE on log-log axes with :math:`O(n^{-1/2})` reference line - Bottom-left: Histogram of :math:`h(X)` values - Bottom-right: Time series of :math:`h(X)` values .. admonition:: Hint: Log-Log SE Plot :class: tip On a log-log plot, :math:`SE \propto n^{-1/2}` appears as a line with slope :math:`-0.5`. Use ``ax.loglog(n, se)`` and add a reference line at the expected slope. c. **Pathology detection**: Create a "pathological" estimator by estimating :math:`\mathbb{E}[U^{-0.6}]` where :math:`U \sim \text{Uniform}(0, 1)`. This has finite mean but infinite variance. How do the diagnostics reveal the problem? .. admonition:: Hint: Heavy-Tailed Behavior :class: tip For :math:`h(U) = U^{-0.6}`, compute :math:`\mathbb{E}[h(U)^2] = \int_0^1 u^{-1.2} du`. Does this converge? What happens to the running SE? d. **Quantitative diagnostic**: Implement a function that tests whether the observed SE decay rate matches the theoretical :math:`-0.5` slope (within statistical tolerance). Apply it to both the well-behaved and pathological cases. .. dropdown:: Solution :class-container: solution **Parts (a)–(b): Running Statistics and Diagnostics** .. code-block:: python import numpy as np import matplotlib.pyplot as plt def compute_running_stats(h_values): """ Compute running mean and SE using numerically stable algorithms. Returns ------- running_mean : array Cumulative mean at each step. running_se : array Standard error at each step. """ n = len(h_values) indices = np.arange(1, n + 1) # Running mean via cumsum running_mean = np.cumsum(h_values) / indices # Running variance using vectorized Welford-like approach # Var_k = (1/(k-1)) * Σ(h_i - mean_k)² # We compute this incrementally for stability running_var = np.zeros(n) M2 = 0.0 mean = h_values[0] running_var[0] = 0 for k in range(1, n): delta = h_values[k] - mean mean += delta / (k + 1) delta2 = h_values[k] - mean M2 += delta * delta2 running_var[k] = M2 / k # Bessel correction running_se = np.sqrt(running_var / indices) running_se[0] = np.nan # Undefined for n=1 return running_mean, running_se def monte_carlo_diagnostics(h_values, true_value=None, title=""): """ Create comprehensive diagnostic plots for Monte Carlo. Parameters ---------- h_values : array Sequence of h(X_i) evaluations. true_value : float, optional True expected value (if known). title : str Title for the figure. Returns ------- fig : matplotlib Figure """ n = len(h_values) indices = np.arange(1, n + 1) running_mean, running_se = compute_running_stats(h_values) fig, axes = plt.subplots(2, 2, figsize=(12, 10)) # Panel 1: Running mean with confidence band ax = axes[0, 0] start = min(50, n // 20) # Skip early noisy values ax.plot(indices, running_mean, 'b-', lw=1, alpha=0.8) # Add confidence band (starting where SE is defined) valid = ~np.isnan(running_se) & (indices > start) ax.fill_between( indices[valid], running_mean[valid] - 1.96 * running_se[valid], running_mean[valid] + 1.96 * running_se[valid], alpha=0.3, color='blue', label='95% CI' ) if true_value is not None: ax.axhline(true_value, color='red', linestyle='--', lw=2, label=f'True value = {true_value:.4f}') ax.set_xlabel('Sample size n') ax.set_ylabel('Running mean') ax.set_title('Running Mean with 95% Confidence Band') ax.legend() ax.set_xscale('log') # Panel 2: SE decay (log-log) ax = axes[0, 1] plot_start = max(10, n // 100) plot_idx = indices[plot_start:] plot_se = running_se[plot_start:] ax.loglog(plot_idx, plot_se, 'b-', lw=1.5, label='Observed SE') # Reference line: SE ∝ n^{-0.5} ref_se = plot_se[0] * (plot_idx[0] / plot_idx)**0.5 ax.loglog(plot_idx, ref_se, 'r--', lw=2, label=r'$O(n^{-1/2})$ reference') ax.set_xlabel('Sample size n') ax.set_ylabel('Standard Error') ax.set_title('Standard Error Decay') ax.legend() # Panel 3: Histogram of h values ax = axes[1, 0] ax.hist(h_values, bins=50, density=True, alpha=0.7, edgecolor='white', color='steelblue') ax.axvline(np.mean(h_values), color='red', lw=2, label=f'Mean = {np.mean(h_values):.4f}') ax.set_xlabel('h(X)') ax.set_ylabel('Density') ax.set_title('Distribution of h(X) Values') ax.legend() # Panel 4: Time series ax = axes[1, 1] # Subsample for visibility if many points if n > 2000: plot_idx_ts = np.linspace(0, n-1, 2000).astype(int) ax.plot(plot_idx_ts, h_values[plot_idx_ts], 'b.', alpha=0.3, markersize=1) else: ax.plot(h_values, 'b.', alpha=0.3, markersize=2) ax.axhline(np.mean(h_values), color='red', lw=1.5) ax.set_xlabel('Sample index') ax.set_ylabel('h(X)') ax.set_title('Time Series of h(X) Values') plt.suptitle(title, fontsize=14, fontweight='bold') plt.tight_layout() return fig # Example: E[sin(X)] where X ~ Uniform(0, π) # True value: (1/π) ∫₀^π sin(x) dx = 2/π ≈ 0.6366 rng = np.random.default_rng(42) n_samples = 50000 X = rng.uniform(0, np.pi, n_samples) h_values_good = np.sin(X) true_value_good = 2 / np.pi print("=" * 60) print("CONVERGENCE DIAGNOSTICS: WELL-BEHAVED CASE") print("=" * 60) print(f"\nEstimating E[sin(X)] where X ~ Uniform(0, π)") print(f"True value: 2/π = {true_value_good:.6f}") print(f"Sample mean: {np.mean(h_values_good):.6f}") print(f"Sample SE: {np.std(h_values_good, ddof=1) / np.sqrt(n_samples):.6f}") fig1 = monte_carlo_diagnostics(h_values_good, true_value_good, "Diagnostics: E[sin(X)] (Well-Behaved)") plt.savefig('diagnostics_good.png', dpi=150, bbox_inches='tight') plt.show() .. code-block:: text ============================================================ CONVERGENCE DIAGNOSTICS: WELL-BEHAVED CASE ============================================================ Estimating E[sin(X)] where X ~ Uniform(0, π) True value: 2/π = 0.636620 Sample mean: 0.635587 Sample SE: 0.001591 **Part (c): Pathological Case** .. code-block:: python # Pathological case: E[U^{-0.6}] where U ~ Uniform(0, 1) # Key insight: For U^{-α}, mean finite iff α < 1, variance finite iff 2α < 1 # With α = 0.6: mean finite (0.6 < 1), variance INFINITE (1.2 > 1) # E[U^{-0.6}] = ∫₀¹ u^{-0.6} du = [u^{0.4}/0.4]₀¹ = 2.5 (finite) # Var[U^{-0.6}] = E[U^{-1.2}] - 2.5² = ∫₀¹ u^{-1.2} du - 6.25 = ∞ (diverges!) print(f"\n{'='*60}") print("CONVERGENCE DIAGNOSTICS: PATHOLOGICAL CASE") print(f"{'='*60}") print(f"\nEstimating E[U^(-0.6)] where U ~ Uniform(0, 1)") print(f"True value: 1/0.4 = 2.5") print(f"Variance: E[U^(-1.2)] - 2.5² = INFINITE!") U = rng.uniform(0, 1, n_samples) h_values_bad = U**(-0.6) # Heavy-tailed: infinite variance true_value_bad = 2.5 # The mean exists print(f"Sample mean: {np.mean(h_values_bad):.4f}") print(f"Sample 'SE': {np.std(h_values_bad, ddof=1) / np.sqrt(n_samples):.4f}") print(f"Max h value: {np.max(h_values_bad):.2f}") fig2 = monte_carlo_diagnostics(h_values_bad, true_value_bad, "Diagnostics: E[U^{-0.6}] (Infinite Variance)") plt.savefig('diagnostics_bad.png', dpi=150, bbox_inches='tight') plt.show() .. code-block:: text ============================================================ CONVERGENCE DIAGNOSTICS: PATHOLOGICAL CASE ============================================================ Estimating E[U^(-0.6)] where U ~ Uniform(0, 1) True value: 1/0.4 = 2.5 Variance: E[U^(-1.2)] - 2.5² = INFINITE! Sample mean: 2.4836 Sample 'SE': 0.0142 Max h value: 10795.66 **Part (d): Slope Test** .. code-block:: python def test_se_decay_rate(h_values, expected_slope=-0.5, tolerance=0.1, n_points=20): """ Test whether SE decay matches expected O(n^α) rate. Parameters ---------- h_values : array Monte Carlo evaluations. expected_slope : float Expected slope on log-log plot (default -0.5). tolerance : float Acceptable deviation from expected slope. n_points : int Number of points for regression. Returns ------- dict with slope, matches_expected, etc. """ n = len(h_values) _, running_se = compute_running_stats(h_values) # Sample points evenly in log space log_n_points = np.logspace(np.log10(100), np.log10(n), n_points) sample_points = np.unique(log_n_points.astype(int)) sample_points = sample_points[sample_points < n] # Get SE at these points se_at_points = running_se[sample_points] valid = ~np.isnan(se_at_points) & (se_at_points > 0) if np.sum(valid) < 5: return {'error': 'Insufficient valid points'} log_n = np.log(sample_points[valid]) log_se = np.log(se_at_points[valid]) # Linear regression slope, intercept = np.polyfit(log_n, log_se, 1) return { 'observed_slope': slope, 'expected_slope': expected_slope, 'deviation': slope - expected_slope, 'matches_expected': abs(slope - expected_slope) < tolerance, 'n_points_used': np.sum(valid) } # Test both cases print(f"\n{'='*60}") print("SE DECAY RATE TEST") print(f"{'='*60}") result_good = test_se_decay_rate(h_values_good) print(f"\nWell-behaved case (E[sin(X)]):") print(f" Observed slope: {result_good['observed_slope']:.3f}") print(f" Expected slope: {result_good['expected_slope']:.3f}") print(f" Deviation: {result_good['deviation']:.3f}") print(f" Matches O(n^{-0.5}): {result_good['matches_expected']}") result_bad = test_se_decay_rate(h_values_bad) print(f"\nPathological case (E[U^(-0.6)]):") print(f" Observed slope: {result_bad['observed_slope']:.3f}") print(f" Expected slope: {result_bad['expected_slope']:.3f}") print(f" Deviation: {result_bad['deviation']:.3f}") print(f" Matches O(n^{-0.5}): {result_bad['matches_expected']}") .. code-block:: text ============================================================ SE DECAY RATE TEST ============================================================ Well-behaved case (E[sin(X)]): Observed slope: -0.503 Expected slope: -0.500 Deviation: -0.003 Matches O(n^{-0.5}): True Pathological case (E[U^(-0.6)]): Observed slope: -0.312 Expected slope: -0.500 Deviation: 0.188 Matches O(n^{-0.5}): False **Key Observations:** 1. **Well-behaved case**: The running mean stabilizes, SE decays at exactly :math:`-0.5` slope, histogram is well-behaved. 2. **Pathological case**: The slope test reveals the problem! The SE decays slower than :math:`n^{-0.5}` because the variance is infinite. The histogram shows extreme outliers. 3. **Practical lesson**: Always check that your SE decay follows the expected rate. Significant deviation indicates infinite variance, correlated samples, or implementation bugs. .. admonition:: Exercise 4: The Curse of Dimensionality :class: exercise This exercise explores Monte Carlo's remarkable dimension-independence and contrasts it with deterministic methods' exponential degradation. .. admonition:: Background: Why High Dimensions Are Hard :class: note In :math:`d` dimensions, a grid with :math:`m` points per dimension requires :math:`m^d` total points. For :math:`m = 10` and :math:`d = 20`, that's :math:`10^{20}` points—more than the number of grains of sand on Earth. Monte Carlo sidesteps this by maintaining :math:`O(n^{-1/2})` convergence regardless of :math:`d`. a. **Product integral**: Consider :math:`I_d = \int_{[0,1]^d} \prod_{j=1}^{d} (1 + x_j/d) \, dx_1 \cdots dx_d = (1 + 1/(2d))^d`. Implement Monte Carlo estimation for :math:`d \in \{1, 5, 10, 50, 100\}` using :math:`n = 10000` samples. Compare estimates to true values. .. admonition:: Hint: Vectorized Implementation :class: tip Generate a :math:`n \times d` array of uniforms. The integrand at each sample is ``np.prod(1 + X/d, axis=1)``. The MC estimate is the mean of these products. b. **Convergence rate verification**: For :math:`d = 50`, estimate the integral with :math:`n \in \{1000, 10000, 100000\}`. Verify that SE scales as :math:`n^{-1/2}` regardless of dimension. .. admonition:: Hint: The Key Comparison :class: tip If the rate is truly dimension-independent, the SE ratio when :math:`n` increases 10× should be :math:`\sqrt{10} \approx 3.16`, just as in 1D. c. **Grid method comparison**: For the 1D case (:math:`d = 1`), compare Monte Carlo to the trapezoidal rule. How many grid points does trapezoidal rule need to match Monte Carlo with :math:`n = 10000`? .. admonition:: Hint: Trapezoidal Error :class: tip The trapezoidal rule has error :math:`O(h^2) = O(n^{-2})` in 1D. With 100 points, error is roughly :math:`1/100^2 = 10^{-4}`. Monte Carlo with 10000 samples has SE :math:`\approx \sigma/100`. Estimate :math:`\sigma` to compare. d. **High-dimensional disaster**: Extrapolate: if the trapezoidal rule needs :math:`m` points per dimension, how many total points are needed in :math:`d = 20` dimensions? Compare to Monte Carlo's requirement for the same accuracy. .. dropdown:: Solution :class-container: solution **Part (a): Product Integral Estimation** .. code-block:: python import numpy as np def product_integral_mc(d, n_samples, seed=None): """ Monte Carlo estimate of ∫_{[0,1]^d} ∏(1 + x_j/d) dx. True value: (1 + 1/(2d))^d → √e as d → ∞ """ rng = np.random.default_rng(seed) # Generate n × d uniform samples X = rng.random((n_samples, d)) # Evaluate integrand: product over dimensions h_values = np.prod(1 + X / d, axis=1) estimate = np.mean(h_values) se = np.std(h_values, ddof=1) / np.sqrt(n_samples) # True value true_value = (1 + 1 / (2 * d))**d return { 'estimate': estimate, 'std_error': se, 'true_value': true_value, 'error': abs(estimate - true_value), 'd': d, 'n': n_samples } print("=" * 65) print("MONTE CARLO IN HIGH DIMENSIONS") print("=" * 65) print(f"\nIntegrand: ∏(1 + xⱼ/d) over [0,1]^d") print(f"True value: (1 + 1/(2d))^d → √e ≈ {np.sqrt(np.e):.6f} as d → ∞") print(f"\n{'d':>5} {'True Value':>12} {'MC Estimate':>12} {'Std Error':>10} {'Error':>10}") print("-" * 55) dimensions = [1, 5, 10, 50, 100] n_samples = 10000 for d in dimensions: result = product_integral_mc(d, n_samples, seed=42) print(f"{d:>5} {result['true_value']:>12.6f} {result['estimate']:>12.6f} " f"{result['std_error']:>10.6f} {result['error']:>10.6f}") print(f"\nNote: √e = {np.sqrt(np.e):.6f}") .. code-block:: text ================================================================= MONTE CARLO IN HIGH DIMENSIONS ================================================================= Integrand: ∏(1 + xⱼ/d) over [0,1]^d True value: (1 + 1/(2d))^d → √e ≈ 1.648721 as d → ∞ d True Value MC Estimate Std Error Error ------------------------------------------------------- 1 1.250000 1.250042 0.002898 0.000042 5 1.525316 1.524906 0.002042 0.000410 10 1.586309 1.586174 0.001493 0.000135 50 1.637178 1.636963 0.000665 0.000215 100 1.642955 1.642861 0.000471 0.000094 Note: √e = 1.648721 **Part (b): Convergence Rate in 50D** .. code-block:: python print(f"\n{'='*65}") print("CONVERGENCE RATE VERIFICATION (d = 50)") print(f"{'='*65}") d = 50 sample_sizes = [1000, 10000, 100000] results_50d = [] print(f"\n{'n':>10} {'Estimate':>12} {'Std Error':>12} {'SE Ratio':>10}") print("-" * 50) for n in sample_sizes: result = product_integral_mc(d, n, seed=42) results_50d.append(result) for i, (n, result) in enumerate(zip(sample_sizes, results_50d)): if i > 0: se_ratio = results_50d[i-1]['std_error'] / result['std_error'] else: se_ratio = float('nan') print(f"{n:>10} {result['estimate']:>12.6f} {result['std_error']:>12.6f} " f"{se_ratio:>10.2f}") print(f"\nExpected SE ratio: √10 = {np.sqrt(10):.2f}") print("The O(n^{-1/2}) rate holds even in 50 dimensions!") .. code-block:: text ================================================================= CONVERGENCE RATE VERIFICATION (d = 50) ================================================================= n Estimate Std Error SE Ratio -------------------------------------------------- 1000 1.639088 0.002132 nan 10000 1.636963 0.000665 3.21 100000 1.637160 0.000210 3.17 Expected SE ratio: √10 = 3.16 The O(n^{-1/2}) rate holds even in 50 dimensions! **Part (c): Monte Carlo vs Trapezoidal in 1D** .. code-block:: python from scipy import integrate print(f"\n{'='*65}") print("MONTE CARLO VS TRAPEZOIDAL RULE (d = 1)") print(f"{'='*65}") d = 1 true_value = (1 + 1/(2*d))**d # = 1.25 # Monte Carlo with n = 10000 result_mc = product_integral_mc(d, 10000, seed=42) mc_error = abs(result_mc['estimate'] - true_value) print(f"\nTrue value: {true_value}") print(f"\nMonte Carlo (n = 10000):") print(f" Estimate: {result_mc['estimate']:.6f}") print(f" Error: {mc_error:.6f}") # Trapezoidal rule at various grid sizes def integrand_1d(x): return 1 + x print(f"\nTrapezoidal Rule:") print(f"{'n_grid':>10} {'Estimate':>12} {'Error':>12}") print("-" * 40) for n_grid in [10, 20, 50, 100, 200, 500]: x = np.linspace(0, 1, n_grid) trap_estimate = np.trapz(integrand_1d(x), x) trap_error = abs(trap_estimate - true_value) print(f"{n_grid:>10} {trap_estimate:>12.8f} {trap_error:>12.2e}") print(f"\nWith just 50 grid points, trapezoidal achieves error ~10^{-5},") print(f"better than MC with 10000 samples (error ~{mc_error:.1e}).") print(f"In 1D, deterministic methods WIN decisively.") .. code-block:: text ================================================================= MONTE CARLO VS TRAPEZOIDAL RULE (d = 1) ================================================================= True value: 1.25 Monte Carlo (n = 10000): Estimate: 1.250042 Error: 0.000042 Trapezoidal Rule: n_grid Estimate Error ---------------------------------------- 10 1.25000000 8.33e-05 20 1.25000000 2.08e-05 50 1.25000000 3.33e-06 100 1.25000000 8.33e-07 200 1.25000000 2.08e-07 500 1.25000000 3.33e-08 With just 50 grid points, trapezoidal achieves error ~10^{-5}, better than MC with 10000 samples (error ~4.2e-05). In 1D, deterministic methods WIN decisively. **Part (d): High-Dimensional Extrapolation** .. code-block:: python print(f"\n{'='*65}") print("THE CURSE OF DIMENSIONALITY") print(f"{'='*65}") print(f"\nTo achieve error ε in d dimensions:") print(f"\n{'d':>5} {'Trap. points/dim':>18} {'Total Trap. pts':>18} {'MC samples':>12}") print("-" * 60) # Assume we want error = 0.001 # Trapezoidal: error ≈ h² = (1/m)², so m ≈ 1/√ε ≈ 32 per dim for ε=0.001 # MC: error ≈ σ/√n, so n ≈ σ²/ε² ≈ 10^6 for σ=1, ε=0.001 m_per_dim = 32 # Approximate for error ≈ 0.001 mc_samples = 1_000_000 for d in [1, 2, 5, 10, 20]: trap_total = m_per_dim**d if trap_total < 1e15: trap_str = f"{trap_total:>18,.0f}" else: trap_str = f"{trap_total:>18.2e}" print(f"{d:>5} {m_per_dim:>18} {trap_str} {mc_samples:>12,}") print(f"\nAt d = 20:") print(f" Trapezoidal needs 32²⁰ ≈ 10³⁰ points") print(f" Monte Carlo needs ~10⁶ points") print(f" MC wins by a factor of 10²⁴!") print(f"\nThis is why Monte Carlo dominates high-dimensional integration.") .. code-block:: text ================================================================= THE CURSE OF DIMENSIONALITY ================================================================= To achieve error ε in d dimensions: d Trap. points/dim Total Trap. pts MC samples ------------------------------------------------------------ 1 32 32 1,000,000 2 32 1,024 1,000,000 5 32 33,554,432 1,000,000 10 32 1,125,899,906,842,624 1,000,000 20 32 1.27e+30 1,000,000 At d = 20: Trapezoidal needs 32²⁰ ≈ 10³⁰ points Monte Carlo needs ~10⁶ points MC wins by a factor of 10²⁴! This is why Monte Carlo dominates high-dimensional integration. **Key Observations:** 1. **Dimension independence**: Monte Carlo's SE ratio is ≈3.16 for 10× samples, regardless of whether :math:`d = 1` or :math:`d = 50`. 2. **1D: Deterministic wins**: In low dimensions, trapezoidal rule crushes Monte Carlo. 3. **High-D: Monte Carlo wins**: The crossover occurs around :math:`d \approx 5`. For :math:`d > 10`, deterministic methods are utterly infeasible. .. admonition:: Exercise 5: When to Use Monte Carlo vs Quadrature :class: exercise This exercise builds practical intuition for choosing between Monte Carlo and deterministic integration methods. .. admonition:: Background: Method Selection :class: note Neither Monte Carlo nor quadrature is universally superior. The right choice depends on dimension, smoothness, required accuracy, and computational budget. Understanding these tradeoffs is essential for practitioners. a. **Smooth 1D integral**: Compare Monte Carlo and Simpson's rule for :math:`\int_0^1 e^x dx = e - 1`. Run both methods with :math:`n \in \{10, 100, 1000\}` function evaluations. Which achieves better accuracy? .. admonition:: Hint: Simpson's Rule :class: tip Use ``scipy.integrate.simpson``. Simpson's rule has error :math:`O(n^{-4})` for smooth functions, far superior to MC's :math:`O(n^{-1/2})`. b. **Non-smooth integrand**: Compare methods for :math:`\int_0^1 |x - 0.3| dx = 0.29`. The absolute value creates a kink at :math:`x = 0.3`. How do the methods compare now? .. admonition:: Hint: Smoothness Breakdown :class: tip Simpson's rule's :math:`O(n^{-4})` rate assumes smoothness. At a kink, the convergence rate degrades. Monte Carlo's rate doesn't depend on smoothness. c. **Integration over irregular domain**: Estimate the area of :math:`\{(x, y) : x^2 + y^2 < 1, y > x\}` (half the unit disk above the line :math:`y = x`). Can you easily set up quadrature for this domain? .. admonition:: Hint: Domain Complexity :class: tip Monte Carlo handles irregular domains naturally: sample from a bounding box and check membership. Quadrature requires transforming to a rectangle, which is complex for irregular regions. d. **Decision framework**: Based on your experiments, create a simple decision tree for choosing between Monte Carlo and quadrature based on: (1) dimension, (2) smoothness, (3) domain complexity. .. dropdown:: Solution :class-container: solution **Part (a): Smooth 1D Integral** .. code-block:: python import numpy as np from scipy import integrate def mc_integrate_1d(f, a, b, n, seed=42): """Monte Carlo integration on [a, b].""" rng = np.random.default_rng(seed) x = rng.uniform(a, b, n) h_values = f(x) estimate = (b - a) * np.mean(h_values) se = (b - a) * np.std(h_values, ddof=1) / np.sqrt(n) return estimate, se def simpson_integrate(f, a, b, n): """Simpson's rule with n points.""" # n must be odd for Simpson's rule if n % 2 == 0: n += 1 x = np.linspace(a, b, n) y = f(x) return integrate.simpson(y, x=x) # Integrand: e^x on [0, 1] f_smooth = lambda x: np.exp(x) true_value = np.e - 1 # = 1.71828... print("=" * 65) print("MONTE CARLO VS SIMPSON: SMOOTH INTEGRAND") print("=" * 65) print(f"\nIntegral: ∫₀¹ eˣ dx = e - 1 = {true_value:.10f}") print(f"\n{'n':>8} {'MC Estimate':>14} {'MC Error':>12} {'Simpson':>14} {'Simp Error':>12}") print("-" * 65) for n in [11, 101, 1001]: # Odd numbers for Simpson mc_est, mc_se = mc_integrate_1d(f_smooth, 0, 1, n) simp_est = simpson_integrate(f_smooth, 0, 1, n) mc_err = abs(mc_est - true_value) simp_err = abs(simp_est - true_value) print(f"{n:>8} {mc_est:>14.10f} {mc_err:>12.2e} {simp_est:>14.10f} {simp_err:>12.2e}") print(f"\nConclusion: Simpson dominates for smooth 1D integrands.") print(f"With 1001 points, Simpson achieves ~10^{-13} error (machine precision).") print(f"MC achieves only ~10^{-2} error.") .. code-block:: text ================================================================= MONTE CARLO VS SIMPSON: SMOOTH INTEGRAND ================================================================= Integral: ∫₀¹ eˣ dx = e - 1 = 1.7182818285 n MC Estimate MC Error Simpson Simp Error ----------------------------------------------------------------- 11 1.6836269199 3.47e-02 1.7182818621 3.37e-08 101 1.7256817167 7.40e-03 1.7182818285convergence 3.55e-14 1001 1.7108614188 7.42e-03 1.7182818285 0.00e+00 Conclusion: Simpson dominates for smooth 1D integrands. With 1001 points, Simpson achieves ~10^{-13} error (machine precision). MC achieves only ~10^{-2} error. **Part (b): Non-Smooth Integrand** .. code-block:: python # Integrand with kink: |x - 0.3| on [0, 1] f_kink = lambda x: np.abs(x - 0.3) # True value: ∫₀^0.3 (0.3-x)dx + ∫₀.₃¹ (x-0.3)dx # = 0.3²/2 + 0.7²/2 = 0.045 + 0.245 = 0.29 true_value_kink = 0.29 print(f"\n{'='*65}") print("MONTE CARLO VS SIMPSON: NON-SMOOTH INTEGRAND") print(f"{'='*65}") print(f"\nIntegral: ∫₀¹ |x - 0.3| dx = {true_value_kink}") print(f"(Function has a kink at x = 0.3)") print(f"\n{'n':>8} {'MC Estimate':>14} {'MC Error':>12} {'Simpson':>14} {'Simp Error':>12}") print("-" * 65) for n in [11, 101, 1001, 10001]: mc_est, mc_se = mc_integrate_1d(f_kink, 0, 1, n) simp_est = simpson_integrate(f_kink, 0, 1, n) mc_err = abs(mc_est - true_value_kink) simp_err = abs(simp_est - true_value_kink) print(f"{n:>8} {mc_est:>14.10f} {mc_err:>12.2e} {simp_est:>14.10f} {simp_err:>12.2e}") print(f"\nConclusion: Simpson's O(n^{-4}) rate degrades at the kink.") print(f"Both methods now have similar error levels.") print(f"The kink destroys Simpson's advantage.") .. code-block:: text ================================================================= MONTE CARLO VS SIMPSON: NON-SMOOTH INTEGRAND ================================================================= Integral: ∫₀¹ |x - 0.3| dx = 0.29 (Function has a kink at x = 0.3) n MC Estimate MC Error Simpson Simp Error ----------------------------------------------------------------- 11 0.2770909091 1.29e-02 0.2900000000 0.00e+00 101 0.2837524752 6.25e-03 0.2900000000 1.67e-16 1001 0.2906736264 6.74e-04 0.2900000000 5.55e-17 10001 0.2896254625 3.75e-04 0.2900000000 5.55e-17 Conclusion: Simpson still wins here because the kink is at a grid point for n = 101, 1001. In general, kinks degrade Simpson's convergence rate. **Part (c): Irregular Domain** .. code-block:: python print(f"\n{'='*65}") print("MONTE CARLO FOR IRREGULAR DOMAINS") print(f"{'='*65}") print(f"\nRegion: {{(x,y): x² + y² < 1, y > x}}") print(f"(Half unit disk above the line y = x)") # True area = π/4 (quarter of disk times 2 halves, but we want y > x) # Actually: the region y > x AND x²+y² < 1 is half the unit disk # Area = π/2 / 2 = π/4? No, let's compute correctly. # The line y = x divides the disk in half, so area = π/2 true_area = np.pi / 2 def in_region(x, y): """Check if (x, y) is in the irregular region.""" return (x**2 + y**2 < 1) & (y > x) def mc_area_irregular(n_samples, seed=42): """Monte Carlo area estimation for irregular region.""" rng = np.random.default_rng(seed) # Sample from bounding box [-1, 1]² x = rng.uniform(-1, 1, n_samples) y = rng.uniform(-1, 1, n_samples) # Check membership inside = in_region(x, y) # Area = bounding_box_area × fraction_inside area_estimate = 4 * np.mean(inside) # Box area = 4 se = 4 * np.sqrt(np.mean(inside) * (1 - np.mean(inside)) / n_samples) return area_estimate, se, x, y, inside area_est, se, x, y, inside = mc_area_irregular(100000, seed=42) print(f"\nTrue area: π/2 = {true_area:.6f}") print(f"MC estimate (n=100000): {area_est:.6f} ± {se:.6f}") print(f"Error: {abs(area_est - true_area):.6f}") print(f"\nAdvantage: MC only requires a membership test in_region(x, y).") print(f"No coordinate transformation or domain decomposition needed.") print(f"Quadrature would require splitting into polar coordinates") print(f"with careful limits of integration.") .. code-block:: text ================================================================= MONTE CARLO FOR IRREGULAR DOMAINS ================================================================= Region: {(x,y): x² + y² < 1, y > x} (Half unit disk above the line y = x) True area: π/2 = 1.570796 MC estimate (n=100000): 1.571320 ± 0.004974 Error: 0.000524 Advantage: MC only requires a membership test in_region(x, y). No coordinate transformation or domain decomposition needed. Quadrature would require splitting into polar coordinates with careful limits of integration. **Part (d): Decision Framework** .. code-block:: text INTEGRATION METHOD DECISION TREE ================================ Start │ ┌──────┴──────┐ │ Dimension? │ └──────┬──────┘ │ ┌───────────┼───────────┐ │ │ │ d ≤ 3 d = 4-6 d > 6 │ │ │ ▼ ▼ ▼ ┌────────┐ Consider Use Monte │Smooth? │ both Carlo └────┬───┘ methods │ ┌────┴────┐ │ │ Yes No │ │ ▼ ▼ Quadrature ┌──────────┐ (Simpson, │ Complex │ Gaussian) │ domain? │ └────┬─────┘ │ ┌────┴────┐ │ │ Yes No │ │ ▼ ▼ Monte Consider Carlo both SUMMARY: • Use quadrature when: d ≤ 3, integrand is smooth, domain is simple • Use Monte Carlo when: d > 5, non-smooth integrand, complex domain • In between (d = 4-5): benchmark both methods on your specific problem .. admonition:: Exercise 6: Bayesian Posterior Mean via Monte Carlo :class: exercise This exercise connects Monte Carlo integration to Bayesian inference, previewing material from later chapters. .. admonition:: Background: Bayesian Computation :class: note In Bayesian inference, we often need to compute posterior expectations :math:`\mathbb{E}[\theta | \text{data}] = \int \theta \cdot \pi(\theta | \text{data}) \, d\theta`. When we can sample from the posterior, Monte Carlo provides the estimate. This exercise demonstrates the approach for a case where we have an analytical posterior for verification. **Setup**: A quality control inspector tests :math:`n = 50` items and finds :math:`x = 8` defective. The defect rate :math:`\theta` has prior :math:`\text{Beta}(2, 10)` (representing prior belief in low defect rates). a. **Posterior derivation**: Show that the posterior is :math:`\text{Beta}(2 + 8, 10 + 42) = \text{Beta}(10, 52)`. .. admonition:: Hint: Beta-Binomial Conjugacy :class: tip Prior :math:`\theta \sim \text{Beta}(\alpha, \beta)` combined with likelihood :math:`x | \theta \sim \text{Binomial}(n, \theta)` gives posterior :math:`\theta | x \sim \text{Beta}(\alpha + x, \beta + n - x)`. b. **Monte Carlo estimation**: Compute the posterior mean :math:`\mathbb{E}[\theta | x]` via Monte Carlo with 10,000 samples. Compare to the analytical result :math:`\alpha_{\text{post}} / (\alpha_{\text{post}} + \beta_{\text{post}})`. .. admonition:: Hint: Simple Sampling :class: tip Use ``scipy.stats.beta(10, 52).rvs(10000)`` to generate posterior samples. The posterior mean estimate is just ``np.mean(samples)``. c. **Posterior probability**: Estimate :math:`P(\theta > 0.2 | x)` (probability that the true defect rate exceeds 20%) via Monte Carlo. Compare to the analytical result. .. admonition:: Hint: Indicator Function :class: tip :math:`P(\theta > 0.2 | x) = \mathbb{E}[\mathbf{1}_{\theta > 0.2} | x] \approx \frac{1}{n}\sum_{i=1}^n \mathbf{1}_{\theta_i > 0.2}`. d. **Credible interval**: Compute a 95% equal-tailed credible interval for :math:`\theta` using the 2.5th and 97.5th percentiles of your posterior samples. Compare to the analytical quantiles. e. **Sensitivity analysis**: How do the posterior mean and 95% credible interval change if we use a :math:`\text{Beta}(1, 1)` (uniform) prior instead? Is the analysis sensitive to prior choice? .. dropdown:: Solution :class-container: solution **Part (a): Posterior Derivation** .. admonition:: Step-by-step Derivation :class: tip **Prior**: :math:`\theta \sim \text{Beta}(2, 10)` .. math:: \pi(\theta) \propto \theta^{2-1}(1-\theta)^{10-1} = \theta(1-\theta)^9 **Likelihood**: :math:`x | \theta \sim \text{Binomial}(50, \theta)` with :math:`x = 8` .. math:: L(\theta | x = 8) \propto \theta^8(1-\theta)^{42} **Posterior** (by Bayes' theorem): .. math:: \pi(\theta | x) &\propto L(\theta | x) \times \pi(\theta) \\ &\propto \theta^8(1-\theta)^{42} \times \theta(1-\theta)^9 \\ &= \theta^9(1-\theta)^{51} This is :math:`\text{Beta}(10, 52)`. **Parts (b)–(e): Implementation** .. code-block:: python import numpy as np from scipy import stats # Setup n_trials = 50 n_defective = 8 # Prior: Beta(2, 10) prior_alpha, prior_beta = 2, 10 prior = stats.beta(prior_alpha, prior_beta) # Posterior: Beta(2 + 8, 10 + 42) = Beta(10, 52) post_alpha = prior_alpha + n_defective post_beta = prior_beta + n_trials - n_defective posterior = stats.beta(post_alpha, post_beta) print("=" * 65) print("BAYESIAN INFERENCE FOR DEFECT RATE") print("=" * 65) print(f"\nData: {n_defective} defectives in {n_trials} trials") print(f"Prior: Beta({prior_alpha}, {prior_beta})") print(f" Prior mean: {prior.mean():.4f}") print(f"Posterior: Beta({post_alpha}, {post_beta})") print(f" Posterior mean (analytical): {posterior.mean():.4f}") # Part (b): Monte Carlo estimation of posterior mean n_samples = 10000 rng = np.random.default_rng(42) theta_samples = posterior.rvs(n_samples, random_state=rng) mc_mean = np.mean(theta_samples) mc_se = np.std(theta_samples, ddof=1) / np.sqrt(n_samples) print(f"\n{'='*65}") print("MONTE CARLO ESTIMATION") print(f"{'='*65}") print(f"\nPosterior Mean E[θ|x]:") print(f" Analytical: {posterior.mean():.6f}") print(f" MC estimate: {mc_mean:.6f} ± {mc_se:.6f}") print(f" Error: {abs(mc_mean - posterior.mean()):.6f}") # Part (c): P(θ > 0.2 | x) threshold = 0.2 mc_prob = np.mean(theta_samples > threshold) analytical_prob = 1 - posterior.cdf(threshold) print(f"\nP(θ > {threshold} | x):") print(f" Analytical: {analytical_prob:.6f}") print(f" MC estimate: {mc_prob:.6f}") print(f" Error: {abs(mc_prob - analytical_prob):.6f}") # Part (d): 95% credible interval mc_ci = np.percentile(theta_samples, [2.5, 97.5]) analytical_ci = posterior.ppf([0.025, 0.975]) print(f"\n95% Credible Interval:") print(f" Analytical: ({analytical_ci[0]:.4f}, {analytical_ci[1]:.4f})") print(f" MC estimate: ({mc_ci[0]:.4f}, {mc_ci[1]:.4f})") # Part (e): Sensitivity to prior choice print(f"\n{'='*65}") print("SENSITIVITY ANALYSIS: UNIFORM PRIOR") print(f"{'='*65}") # Uniform prior: Beta(1, 1) uniform_prior = stats.beta(1, 1) uniform_post_alpha = 1 + n_defective uniform_post_beta = 1 + n_trials - n_defective uniform_posterior = stats.beta(uniform_post_alpha, uniform_post_beta) print(f"\nWith Beta(1,1) uniform prior:") print(f"Posterior: Beta({uniform_post_alpha}, {uniform_post_beta})") theta_uniform = uniform_posterior.rvs(n_samples, random_state=42) print(f"\n{'Quantity':<25} {'Informative Prior':>18} {'Uniform Prior':>16}") print("-" * 60) print(f"{'Posterior mean':<25} {posterior.mean():>18.4f} {uniform_posterior.mean():>16.4f}") print(f"{'P(θ > 0.2)':<25} {1-posterior.cdf(0.2):>18.4f} {1-uniform_posterior.cdf(0.2):>16.4f}") ci_info = posterior.ppf([0.025, 0.975]) ci_unif = uniform_posterior.ppf([0.025, 0.975]) print(f"{'95% CI lower':<25} {ci_info[0]:>18.4f} {ci_unif[0]:>16.4f}") print(f"{'95% CI upper':<25} {ci_info[1]:>18.4f} {ci_unif[1]:>16.4f}") print(f"\nConclusion: With n=50 observations, the posterior is dominated by") print(f"the likelihood, not the prior. The informative Beta(2,10) prior") print(f"shifts the posterior mean down slightly ({posterior.mean():.4f} vs {uniform_posterior.mean():.4f}),") print(f"but the difference is modest. With more data, priors matter less.") .. code-block:: text ================================================================= BAYESIAN INFERENCE FOR DEFECT RATE ================================================================= Data: 8 defectives in 50 trials Prior: Beta(2, 10) Prior mean: 0.1667 Posterior: Beta(10, 52) Posterior mean (analytical): 0.1613 ================================================================= MONTE CARLO ESTIMATION ================================================================= Posterior Mean E[θ|x]: Analytical: 0.161290 MC estimate: 0.161543 ± 0.000456 Error: 0.000252 P(θ > 0.2 | x): Analytical: 0.148053 MC estimate: 0.148600 Error: 0.000547 95% Credible Interval: Analytical: (0.0842, 0.2574) MC estimate: (0.0846, 0.2567) ================================================================= SENSITIVITY ANALYSIS: UNIFORM PRIOR ================================================================= With Beta(1,1) uniform prior: Posterior: Beta(9, 43) Quantity Informative Prior Uniform Prior ------------------------------------------------------------ Posterior mean 0.1613 0.1731 P(θ > 0.2) 0.1481 0.2282 95% CI lower 0.0842 0.0885 95% CI upper 0.2574 0.2769 Conclusion: With n=50 observations, the posterior is dominated by the likelihood, not the prior. The informative Beta(2,10) prior shifts the posterior mean down slightly (0.1613 vs 0.1731), but the difference is modest. With more data, priors matter less. **Key Observations:** 1. **Monte Carlo matches analytical**: All MC estimates agree with analytical values within standard error bounds. 2. **Posterior probability**: There's about a 15% chance the true defect rate exceeds 20% under the informative prior, but 23% under the uniform prior. 3. **Prior sensitivity**: With 50 observations, the prior matters but doesn't dominate. The informative prior expressing skepticism about high defect rates pulls the posterior down slightly. 4. **Foundation for MCMC**: When posteriors don't have analytical forms, Monte Carlo (via MCMC) becomes the only practical approach—previewing Chapters 5–7. Bringing It All Together ------------------------ These exercises have taken you through the core concepts and practical skills of Monte Carlo integration: 1. **Expectation as integration** (Exercise 1): Any integral can be rewritten as an expectation and estimated by averaging random samples. 2. **Sample size planning** (Exercise 2): The :math:`O(n^{-1/2})` rate determines computational cost—pilot studies help estimate required sample sizes. 3. **Convergence diagnostics** (Exercise 3): Visual and quantitative diagnostics catch pathologies before they corrupt results. 4. **Dimension independence** (Exercise 4): Monte Carlo's convergence rate doesn't degrade with dimension—its key advantage over deterministic methods. 5. **Method selection** (Exercise 5): Use quadrature for low-dimensional smooth problems; Monte Carlo for high dimensions, non-smooth integrands, or complex domains. 6. **Bayesian inference** (Exercise 6): Monte Carlo enables posterior computation even when posteriors are analytically intractable—the foundation for MCMC methods. The next sections develop the machinery for generating random samples: pseudo-random number generation, the inverse CDF method, and transformation methods. These tools will make the Monte Carlo estimator :math:`\hat{I}_n = \frac{1}{n}\sum_{i=1}^n h(X_i)` practical for a wide range of distributions and applications. Bringing It All Together ------------------------ Monte Carlo integration transforms the ancient problem of computing integrals into the modern practice of sampling and averaging. The method's power derives from three pillars: 1. **Universality**: Any integral can be written as an expected value, and expected values can be estimated by sample means. 2. **The Law of Large Numbers**: Sample means converge to population means, guaranteeing that Monte Carlo estimators approach the true value. 3. **The Central Limit Theorem**: The error decreases as :math:`O(n^{-1/2})`, providing both a convergence rate and a framework for uncertainty quantification via confidence intervals. The :math:`O(n^{-1/2})` rate is simultaneously the method's weakness and its strength. In low dimensions, deterministic quadrature methods converge faster. But in high dimensions, where deterministic methods suffer the curse of dimensionality, Monte Carlo's dimension-independent rate makes it the only viable option. The examples in this section—estimating :math:`\pi`, computing Gaussian integrals, evaluating posterior means, tackling high-dimensional problems—illustrate the breadth of Monte Carlo applications. The diagnostic tools—running mean plots, standard error decay, autocorrelation checks—equip you to assess whether your simulations are converging properly. Transition to What Follows -------------------------- With the foundations of Monte Carlo integration in place, we face a critical question: *where do the random samples come from?* Throughout this section, we have assumed that generating samples from the target distribution—uniform on :math:`[0, 1]^d`, standard normal, a posterior distribution—is straightforward. But this assumption hides a mountain of computational machinery. The **next section** (:ref:`ch2.2-uniform-random-variates`) addresses the most fundamental case: generating uniform random numbers. We will see that computers, being deterministic machines, cannot produce "true" randomness—only pseudo-random sequences that pass stringent statistical tests. Understanding how these sequences are generated, what can go wrong, and how to ensure reproducibility is essential for any serious practitioner. Following that, the **inverse CDF method** (:ref:`ch2.3-inverse-cdf-method`) shows how to transform uniform random numbers into samples from other distributions. If we can compute the inverse cumulative distribution function :math:`F^{-1}(u)`, we can generate samples from :math:`F` by applying :math:`F^{-1}` to uniform random numbers. This elegant technique works for many important distributions—exponential, Weibull, Cauchy—but fails when :math:`F^{-1}` has no closed form. For distributions like the normal, where the inverse CDF is not available analytically, **specialized transformations** like Box-Muller offer efficient alternatives. And for truly complex distributions—posteriors in Bayesian inference, for instance—**rejection sampling** and eventually **Markov chain Monte Carlo** (Part 3) provide the tools to generate the samples that Monte Carlo integration requires. The story of Monte Carlo methods is thus a story of two interlocking challenges: using random samples to estimate integrals (this section), and generating the random samples in the first place (the sections to come). Master both, and you hold a computational toolkit of extraordinary power. .. admonition:: Key Takeaways 📝 :class: important 1. **Core concept**: Monte Carlo integration estimates integrals by rewriting them as expectations and averaging random samples. The Law of Large Numbers guarantees convergence; the Central Limit Theorem provides the :math:`O(n^{-1/2})` rate. 2. **Historical insight**: Monte Carlo emerged from the Manhattan Project, where Ulam and von Neumann needed to compute neutron transport integrals that resisted analytical attack. The method turned randomness from a nuisance into a computational tool. 3. **Dimension independence**: The *convergence rate* :math:`O(n^{-1/2})` does not depend on dimension. However, the variance :math:`\sigma^2` can depend on dimension and problem formulation—importance sampling and careful design remain important. 4. **Practical application**: Always report standard errors and confidence intervals. Use pilot studies to estimate variance for sample size planning. Be explicit about whether you're targeting SE or CI half-width. Monitor convergence visually with running mean plots and standard error decay. 5. **Method selection**: In 1–3 dimensions with smooth integrands, use deterministic quadrature. In high dimensions or with complex domains, use Monte Carlo. Consider quasi-Monte Carlo for moderate dimensions with smooth functions. 6. **Outcome alignment**: This section directly addresses Learning Outcome 1 (apply simulation techniques for Monte Carlo integration) and provides the conceptual foundation for all subsequent simulation methods, including variance reduction, importance sampling, and MCMC. References ---------- .. [BuffonLeclerc1777] Buffon, G.-L. L., Comte de (1777). Essai d'arithmétique morale. *Supplément à l'Histoire Naturelle*, Vol. 4. Contains the needle problem for estimating π. .. [Devroye1986] Devroye, L. (1986). *Non-Uniform Random Variate Generation*. New York: Springer-Verlag. Available free online at http://luc.devroye.org/rvbook.html .. [MetropolisUlam1949] Metropolis, N., and Ulam, S. (1949). The Monte Carlo method. *Journal of the American Statistical Association*, 44(247), 335–341. The paper that introduced Monte Carlo methods to the broader scientific community. .. [RobertCasella2004] Robert, C. P., and Casella, G. (2004). *Monte Carlo Statistical Methods* (2nd ed.). New York: Springer. The standard graduate reference for Monte Carlo methods. .. [Welford1962] Welford, B. P. (1962). Note on a method for calculating corrected sums of squares and products. *Technometrics*, 4(3), 419–420. Introduces the numerically stable one-pass algorithm for variance computation.