Monte Carlo Fundamentals

In the spring of 1946, the mathematician Stanislaw Ulam was recovering from a near-fatal case of viral encephalitis at his home in Los Angeles. To pass the time during his convalescence, he played countless games of solitaire—and found himself wondering: what is the probability of winning a game of Canfield solitaire? The combinatorics were hopelessly complex. There were too many possible configurations, too many branching paths through a game, to enumerate them all. But Ulam realized something profound: he didn’t need to enumerate every possibility. He could simply play a hundred games and count how many he won.

This insight—that we can estimate probabilities by running experiments rather than computing them analytically—was not new. The Comte de Buffon had used a similar approach in 1777 to estimate π by dropping needles onto a lined floor. But Ulam saw something that Buffon could not have imagined: the recently completed ENIAC computer could “play” millions of such games, transforming a parlor trick into a serious computational method. Within weeks, Ulam had discussed the idea with his colleague John von Neumann, and the two began developing what would become one of the most powerful computational frameworks in all of science.

They needed a code name for this method, which they were applying to classified problems in nuclear weapons design at Los Alamos. Nicholas Metropolis suggested “Monte Carlo,” after the famous casino in Monaco where Ulam’s uncle had a gambling habit. The name stuck, and with it, a new era in computational science began.

This chapter introduces Monte Carlo methods—a family of algorithms that use random sampling to solve problems that would otherwise be intractable. We will see how randomness, properly harnessed, becomes a precision instrument for computing integrals, estimating probabilities, and approximating quantities that resist analytical attack. The ideas are simple, but their power is immense: Monte Carlo methods now pervade physics, finance, machine learning, and statistics, anywhere that high-dimensional integration or complex probability calculations arise.

Road Map 🧭

Understand: The fundamental principle that Monte Carlo integration estimates integrals as expectations of random samples, and why this works via the Law of Large Numbers and Central Limit Theorem
Develop: Deep intuition for the \(O(n^{-1/2})\) convergence rate—what it means, why it arises, and its remarkable dimension-independence
Implement: Complete Python code for Monte Carlo estimation with variance quantification, confidence intervals, and convergence diagnostics
Evaluate: When Monte Carlo methods outperform deterministic alternatives, and how to assess estimation quality in practice
Connect: How Monte Carlo integration motivates the random variable generation techniques of subsequent sections

Notation Convention 📐

Throughout this text, we write \(X \sim \mathcal{N}(\mu, \sigma^2)\) to denote a normal random variable with mean \(\mu\) and variance \(\sigma^2\). That is, the second parameter is the variance, not the standard deviation. This is the most common convention in statistics and probability theory. When we need to emphasize the standard deviation, we write it explicitly as \(\text{SD}(X) = \sigma\).

The Historical Development of Monte Carlo Methods

Before diving into the mathematics, it is worth understanding how Monte Carlo methods emerged and evolved. This history illuminates why the methods work, what problems motivated their development, and why they remain central to computational science today.

Timeline showing evolution of Monte Carlo methods from Buffon's Needle (1777) to modern neural-enhanced methods (2020s) — Fig. 30 **Historical Evolution of Monte Carlo Methods.** This timeline traces 250 years of algorithmic innovation, from Buffon’s needle experiment in 1777 through the founding contributions of Ulam and von Neumann in the 1940s, the development of MCMC methods, resampling techniques, and modern neural-enhanced approaches. Each innovation opened new classes of problems to computational attack.

Buffon’s Needle: The First Monte Carlo Experiment

In 1777, Georges-Louis Leclerc, Comte de Buffon, posed a deceptively simple question: suppose we have a floor made of parallel wooden planks, each of width \(d\), and we drop a needle of length \(\ell \leq d\) onto this floor. What is the probability that the needle crosses one of the cracks between planks?

To answer this, Buffon introduced what we would now recognize as a probabilistic model. Let \(\theta\) denote the angle between the needle and the direction of the planks, uniformly distributed on \([0, \pi)\). Let \(y\) denote the distance from the needle’s center to the nearest crack, uniformly distributed on \([0, d/2]\). The needle crosses a crack if and only if the vertical projection of half the needle exceeds the distance to the crack—that is, if \(y \leq \frac{\ell}{2} \sin\theta\).

Three-panel visualization of Buffon's needle problem showing physical setup, crossing geometry, and probability space — Fig. 31 **Buffon’s Needle Geometry.** Left: The physical setup with needles scattered across parallel planks (red = crossing, blue = not crossing). Middle: The key geometry—a needle crosses if its center’s distance \(y\) to the nearest crack is less than the vertical projection \((\ell/2)\sin\theta\). Right: The probability space showing the crossing region; the probability equals the ratio of areas: \(2\ell/(\pi d)\).

The probability of crossing is therefore:

\[P(\text{crossing}) = \frac{1}{\pi} \cdot \frac{2}{d} \int_0^{\pi} \int_0^{(\ell/2)\sin\theta} dy \, d\theta\]

Evaluating the inner integral yields \(\frac{\ell}{2}\sin\theta\), and the outer integral gives \(\int_0^{\pi} \sin\theta \, d\theta = 2\). Thus:

\[P(\text{crossing}) = \frac{1}{\pi} \cdot \frac{2}{d} \cdot \frac{\ell}{2} \cdot 2 = \frac{2\ell}{\pi d}\]

This elegant result has a remarkable consequence. If we drop \(n\) needles and observe that \(k\) of them cross a crack, then our estimate of the crossing probability is \(\hat{p} = k/n\). Rearranging Buffon’s formula:

\[\pi = \frac{2\ell}{d \cdot P(\text{crossing})} \approx \frac{2\ell n}{d k}\]

We can estimate \(\pi\) by throwing needles!

This is a Monte Carlo method avant la lettre: we use random experiments to estimate a deterministic quantity. Of course, Buffon lacked computers, and actually throwing thousands of needles by hand is tedious. In 1901, the Italian mathematician Mario Lazzarini claimed to have obtained \(\pi \approx 3.1415929\) by throwing a needle 3,408 times—suspiciously close to the correct value of \(355/113\). Most historians believe Lazzarini fudged his data, but the underlying principle was sound.

Try It Yourself 🖥️ Buffon’s Needle Simulation

Experience Buffon’s experiment interactively:

Interactive Simulation: https://treese41528.github.io/ComputationalDataScience/Simulations/MonteCarloSimulation/buffons_needle_simulation.html

Watch how the \(\pi\) estimate fluctuates wildly with few needles, then gradually stabilizes as you accumulate thousands of throws. This is the Law of Large Numbers in action—a theme we will return to throughout this chapter.

Fermi’s Envelope Calculations

The physicist Enrico Fermi was famous for his ability to estimate quantities that seemed impossibly difficult to calculate. How many piano tuners are there in Chicago? How much energy is released in a nuclear explosion? Fermi would break these problems into pieces, estimate each piece roughly, and multiply—often achieving answers accurate to within an order of magnitude.

Less well known is that Fermi also pioneered proto-Monte Carlo methods. In the 1930s, working on neutron diffusion problems in Rome, Fermi developed a mechanical device—essentially a specialized slide rule—that could generate random numbers to simulate neutron paths through matter. He used these simulations to estimate quantities like neutron absorption cross-sections, which were too complex to compute analytically.

This work remained largely unpublished, but it anticipated the key insight of Monte Carlo: when a deterministic calculation is intractable, a stochastic simulation may succeed. Fermi’s physical random number generator was crude, but the principle was the same one that Ulam and von Neumann would later implement on electronic computers.

The Manhattan Project and ENIAC

The development of nuclear weapons during World War II created an urgent need for computational methods. The behavior of neutrons in a nuclear reaction—how they scatter, slow down, and trigger fission—depends on complex integrals over energy and angle that resist analytical solution. The physicists at Los Alamos needed numbers, not theorems.

It was in this context that Ulam’s solitaire insight proved transformative. Ulam and von Neumann realized that the same principle—estimate a complicated quantity by averaging over random samples—could be applied to neutron transport. Instead of integrating over all possible neutron paths analytically (impossible), they could simulate thousands of individual neutrons, tracking each one as it scattered and absorbed through the weapon’s core.

Von Neumann took the lead in implementing these ideas on ENIAC, one of the first general-purpose electronic computers. ENIAC could perform about 5,000 operations per second—glacially slow by modern standards, but revolutionary in 1946. Von Neumann and his team programmed ENIAC to simulate neutron histories, and the results helped validate the design of thermonuclear weapons.

The “Monte Carlo method” was formally introduced to the broader scientific community in a 1949 paper by Metropolis and Ulam, though much of the early work remained classified for decades. The name, coined by Metropolis, captured both the element of chance central to the method and the slightly disreputable excitement of gambling—a fitting tribute to Ulam’s card-playing origins.

Why “Monte Carlo” Changed Everything

The Monte Carlo revolution was not merely about having faster computers. It represented a conceptual breakthrough: randomness is a computational resource. By embracing uncertainty rather than fighting it, Monte Carlo methods could attack problems that deterministic methods could not touch.

Consider the challenge of computing a 100-dimensional integral. Deterministic quadrature methods—the trapezoidal rule, Simpson’s rule, Gaussian quadrature—all suffer from the “curse of dimensionality.” If we use \(n\) points per dimension, we need \(n^{100}\) total evaluations. Even with \(n = 2\), this exceeds \(10^{30}\)—more function evaluations than atoms in a human body.

Monte Carlo methods sidestep this curse entirely. As we will see, the convergence rate of a Monte Carlo estimate depends only on the number of samples, not on the dimension of the space. A 100-dimensional integral converges at the same \(O(n^{-1/2})\) rate as a one-dimensional integral. This dimension-independence of the rate is the source of Monte Carlo’s power—though as we will discuss, the constant factor (the variance) may still depend on dimension.

The Core Principle: Expectation as Integration

We now turn to the mathematical foundations of Monte Carlo integration. The key insight is simple but profound: any integral can be rewritten as an expected value, and expected values can be estimated by averaging samples.

From Integrals to Expectations

Consider a general integral of the form:

(1)\[I = \int_{\mathcal{X}} g(x) \, dx\]

where \(\mathcal{X} \subseteq \mathbb{R}^d\) is the domain of integration and \(g: \mathcal{X} \to \mathbb{R}\) is the function we wish to integrate. At first glance, this seems like a problem for calculus, not probability. But watch what happens when we introduce a probability density.

Let \(f(x)\) be any probability density function on \(\mathcal{X}\)—that is, \(f(x) \geq 0\) everywhere and \(\int_{\mathcal{X}} f(x) \, dx = 1\). We can rewrite our integral as:

(2)\[I = \int_{\mathcal{X}} g(x) \, dx = \int_{\mathcal{X}} \frac{g(x)}{f(x)} f(x) \, dx = \mathbb{E}_f\left[ \frac{g(X)}{f(X)} \right]\]

where the expectation is taken over a random variable \(X\) with density \(f\). We have transformed an integral into an expected value!

The simplest choice is the uniform density on \(\mathcal{X}\). If \(\mathcal{X}\) has finite volume \(V = \int_{\mathcal{X}} dx\), then \(f(x) = 1/V\) is a valid density, and:

\[I = V \cdot \mathbb{E}_{\text{Uniform}(\mathcal{X})}[g(X)]\]

For example, to compute \(\int_0^1 e^{-x^2} dx\), we write:

\[\int_0^1 e^{-x^2} dx = \mathbb{E}[e^{-U^2}] \quad \text{where } U \sim \text{Uniform}(0, 1)\]

This rewriting is always possible. But why is it useful?

Two Cases to Keep Straight

Throughout this chapter, we encounter two closely related formulations:

Case A (Expectation): We want \(I = \mathbb{E}_f[h(X)] = \int h(x) f(x) \, dx\) where \(f\) is a known density and \(X \sim f\).
Case B (Plain integral): We want \(I = \int g(x) \, dx\), which we rewrite as \(I = \mathbb{E}_f[g(X)/f(X)]\) for some sampling density \(f\).

In both cases, the Monte Carlo estimator takes the form \(\hat{I}_n = \frac{1}{n}\sum_{i=1}^n w(X_i)\) where \(w(x) = h(x)\) in Case A and \(w(x) = g(x)/f(x)\) in Case B. This unified view—Monte Carlo is averaging weights—will simplify later developments in importance sampling and variance reduction.

The Monte Carlo Estimator

The power of the expectation formulation becomes clear when we recall the Law of Large Numbers. If \(X_1, X_2, \ldots, X_n\) are independent and identically distributed (iid) with \(\mathbb{E}[X_i] = \mu\), then the sample mean converges to the true mean:

\[\bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i \xrightarrow{\text{a.s.}} \mu \quad \text{as } n \to \infty\]

Applied to our integral:

Definition: Monte Carlo Estimator

Let \(X_1, X_2, \ldots, X_n\) be iid samples from a density \(f\) on \(\mathcal{X}\). The Monte Carlo estimator of the integral \(I = \int_{\mathcal{X}} h(x) f(x) \, dx = \mathbb{E}_f[h(X)]\) is:

(3)\[\hat{I}_n = \frac{1}{n} \sum_{i=1}^{n} h(X_i)\]

More generally, for \(I = \int_{\mathcal{X}} g(x) \, dx\) where we sample from density \(f\):

\[\hat{I}_n = \frac{1}{n} \sum_{i=1}^{n} \frac{g(X_i)}{f(X_i)}\]

The Monte Carlo method is disarmingly simple: draw random samples, evaluate the function at each sample, and average the results. No derivatives, no quadrature weights, no mesh generation—just sampling and averaging.

But this simplicity conceals depth. The choice of sampling density \(f\) is entirely up to us, and different choices lead to dramatically different performance. We will explore this in the section on importance sampling; for now, we focus on the “naive” case where \(f\) matches the density of the integrand or is uniform on the domain.

Algorithm 2.1: Basic Monte Carlo Integration

Input: Function \(h\), sampling distribution \(f\), sample size \(n\)

Output: Estimate \(\hat{I}_n\) and standard error \(\widehat{\text{SE}}\)

Generate X₁, X₂, ..., Xₙ iid from density f
Compute h(Xᵢ) for i = 1, ..., n
Estimate: Î_n = (1/n) Σᵢ h(Xᵢ)
Variance: σ̂² = (1/(n-1)) Σᵢ (h(Xᵢ) - Î_n)²
Standard Error: SE = σ̂ / √n
Return Î_n, SE

Complexity: \(O(n)\) function evaluations, \(O(n)\) storage (or \(O(1)\) with streaming)

A First Example: Estimating \(\pi\)

Let us return to the problem of estimating \(\pi\), now with Monte Carlo machinery. Consider the integral:

\[\pi = \int_{-1}^{1} \int_{-1}^{1} \mathbf{1}_{x^2 + y^2 \leq 1} \, dx \, dy\]

This is the area of the unit disk. Rewriting as an expectation:

\[\pi = 4 \cdot \mathbb{E}[\mathbf{1}_{X^2 + Y^2 \leq 1}] \quad \text{where } (X, Y) \sim \text{Uniform}([-1,1]^2)\]

The factor of 4 accounts for the area of the square \([-1,1]^2\). The Monte Carlo estimator is:

\[\hat{\pi}_n = \frac{4}{n} \sum_{i=1}^{n} \mathbf{1}_{X_i^2 + Y_i^2 \leq 1}\]

That is: generate \(n\) uniform points in the square, count how many fall inside the unit circle, and multiply by 4.

The geometric intuition is immediate: the ratio of points landing inside the circle to total points approximates the ratio of areas, \(\pi/4\).

Monte Carlo estimation of π showing scatter plot of points inside and outside unit circle, plus convergence plot — Fig. 32 **Monte Carlo π Estimation.** Left: Blue points fall inside the unit circle; coral points fall outside. The ratio of blue to total points estimates \(\pi/4\). Right: The running estimate stabilizes as samples accumulate—the characteristic “noisy convergence” of Monte Carlo.

import numpy as np

def estimate_pi_monte_carlo(n_samples, seed=None):
    """
    Estimate π using Monte Carlo integration.

    Parameters
    ----------
    n_samples : int
        Number of random points to generate.
    seed : int, optional
        Random seed for reproducibility.

    Returns
    -------
    dict
        Contains estimate, standard_error, and confidence interval.
    """
    rng = np.random.default_rng(seed)

    # Generate uniform points in [-1, 1]²
    x = rng.uniform(-1, 1, n_samples)
    y = rng.uniform(-1, 1, n_samples)

    # Count points inside unit circle
    inside = (x**2 + y**2) <= 1

    # Monte Carlo estimate
    p_hat = np.mean(inside)
    pi_hat = 4 * p_hat

    # Standard error (indicator has Bernoulli variance p(1-p))
    se_p = np.sqrt(p_hat * (1 - p_hat) / n_samples)
    se_pi = 4 * se_p

    # 95% asymptotic normal confidence interval (CLT justifies this for large n)
    ci = (pi_hat - 1.96 * se_pi, pi_hat + 1.96 * se_pi)

    return {
        'estimate': pi_hat,
        'standard_error': se_pi,
        'ci_95': ci,
        'n_samples': n_samples
    }

# Run the estimation
result = estimate_pi_monte_carlo(100_000, seed=42)
print(f"π estimate: {result['estimate']:.6f}")
print(f"True π:     {np.pi:.6f}")
print(f"Error:      {abs(result['estimate'] - np.pi):.6f}")
print(f"Std Error:  {result['standard_error']:.6f}")
print(f"95% CI:     ({result['ci_95'][0]:.6f}, {result['ci_95'][1]:.6f})")

π estimate: 3.143080
True π:     3.141593
Error:      0.001487
Std Error:  0.005190
95% CI:     (3.132908, 3.153252)

The true value \(\pi\) lies comfortably within the 95% asymptotic normal confidence interval (valid by CLT for large \(n\)). With a million samples, the error shrinks by a factor of \(\sqrt{10} \approx 3.16\), and with ten million, by another factor of \(\sqrt{10}\).

Why Bernoulli variance? The comment in the code mentions “Bernoulli variance”—let’s unpack this. Each random point either lands inside the circle (we record \(I_i = 1\)) or outside (we record \(I_i = 0\)). This is a Bernoulli trial with success probability \(p = \pi/4\), which is the ratio of the circle’s area to the square’s area. For any Bernoulli random variable:

\[\mathbb{E}[I_i] = p, \qquad \text{Var}(I_i) = p(1 - p)\]

Our estimator \(\hat{p} = \frac{1}{n}\sum_{i=1}^n I_i\) is the sample proportion of points inside the circle. Since the \(I_i\) are independent:

\[\text{Var}(\hat{p}) = \text{Var}\left(\frac{1}{n}\sum_{i=1}^n I_i\right) = \frac{1}{n^2} \cdot n \cdot p(1-p) = \frac{p(1-p)}{n}\]

The standard error is the square root: \(\text{SE}(\hat{p}) = \sqrt{p(1-p)/n}\). Since \(\hat{\pi} = 4\hat{p}\), the standard error of \(\hat{\pi}\) scales by 4:

\[\text{SE}(\hat{\pi}) = 4 \cdot \sqrt{\frac{p(1-p)}{n}}\]

We don’t know the true \(p\), so we substitute our estimate \(\hat{p}\) to get a usable formula. This Bernoulli structure is a special case of the general Monte Carlo variance formula—whenever the function \(h(x)\) is an indicator (0 or 1), the variance simplifies to the familiar \(p(1-p)\) form.

Common Pitfall ⚠️ Wald Confidence Intervals for Proportions

The normal approximation CI \(\hat{p} \pm z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}\) (called the Wald interval) can perform poorly when \(p\) is near 0 or 1, or when \(n\) is small. For rare events, consider alternatives like the Wilson score interval, Agresti-Coull interval, or exact binomial methods. For the \(\pi\) estimation example where \(p \approx 0.785\), the Wald interval is adequate.

This is Monte Carlo at its most basic: evaluate a simple function at random points and average. But even this toy example illustrates the key features of the method—ease of implementation, probabilistic error bounds, and graceful scaling with sample size.

Theoretical Foundations

Why does the Monte Carlo method work? What determines the rate of convergence? These questions have precise mathematical answers rooted in classical probability theory.

Key Assumptions for Monte Carlo 📋

The theoretical guarantees for Monte Carlo integration require:

Independence: Samples \(X_1, \ldots, X_n\) are independent (not merely uncorrelated)
Identical distribution: All samples come from the same distribution \(f\)
Finite first moment: \(\mathbb{E}[|h(X)|] < \infty\) for LLN convergence
Finite second moment: \(\mathbb{E}[h(X)^2] < \infty\) for CLT and valid confidence intervals

When these assumptions fail—dependent samples (MCMC), infinite variance (heavy tails), misspecified sampling distribution—the standard theory must be modified. We address these issues in later sections and chapters.

The Law of Large Numbers

The Law of Large Numbers (LLN) is the foundational result guaranteeing that Monte Carlo estimators converge to the true value. There are several versions; we state the strongest form.

Theorem: Strong Law of Large Numbers

Let \(X_1, X_2, \ldots\) be independent and identically distributed random variables with \(\mathbb{E}[|X_1|] < \infty\). Then:

\[\bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i \xrightarrow{\text{a.s.}} \mathbb{E}[X_1] \quad \text{as } n \to \infty\]

The notation \(\xrightarrow{\text{a.s.}}\) denotes almost sure convergence: the probability that the sequence converges is exactly 1.

For Monte Carlo integration, we apply this theorem with \(X_i = h(X_i)\) where the \(X_i\) are iid from density \(f\). The condition \(\mathbb{E}[|h(X)|] < \infty\) ensures that the integral we are estimating actually exists.

The LLN tells us that \(\hat{I}_n \to I\) with probability 1. No matter how complex the integrand, no matter how high the dimension, the Monte Carlo estimator will eventually get arbitrarily close to the true value. This is an extraordinarily powerful guarantee.

But the LLN is silent on how fast convergence occurs. For that, we need the Central Limit Theorem.

The Central Limit Theorem and the \(O(n^{-1/2})\) Rate

The Central Limit Theorem (CLT) is the workhorse result for quantifying Monte Carlo error.

Theorem: Central Limit Theorem

Let \(X_1, X_2, \ldots\) be iid with mean \(\mu\) and variance \(\sigma^2 < \infty\). Then:

\[\sqrt{n} \left( \bar{X}_n - \mu \right) \xrightarrow{d} \mathcal{N}(0, \sigma^2)\]

where \(\xrightarrow{d}\) denotes convergence in distribution. Equivalently, for large \(n\):

\[\bar{X}_n \stackrel{\cdot}{\sim} \mathcal{N}\left( \mu, \frac{\sigma^2}{n} \right)\]

Applied to Monte Carlo integration:

\[\hat{I}_n \stackrel{\cdot}{\sim} \mathcal{N}\left( I, \frac{\sigma^2}{n} \right)\]

where \(\sigma^2 = \text{Var}_f[h(X)] = \mathbb{E}_f[(h(X) - I)^2]\) is the variance of the integrand under the sampling distribution.

The standard error of the Monte Carlo estimator is:

(4)\[\text{SE}(\hat{I}_n) = \frac{\sigma}{\sqrt{n}}\]

This is the celebrated \(O(n^{-1/2})\) convergence rate. To reduce the standard error by a factor of 10, we need 100 times as many samples. To gain one decimal place of accuracy, we need 100 times the computational effort.

The rate is dimension-independent, but the constant may not be. Whether we integrate over \(\mathbb{R}\) or \(\mathbb{R}^{1000}\), the error decreases as \(1/\sqrt{n}\). This dimension-independence of the rate is Monte Carlo’s fundamental advantage. However, the variance \(\sigma^2 = \text{Var}[h(X)]\) can itself depend on dimension—poorly designed estimators may have variance that grows with \(d\), partially offsetting Monte Carlo’s advantage. This is why importance sampling, control variates, and careful problem formulation remain important even for Monte Carlo.

Example 💡 Understanding the Square Root Law

Scenario: You estimate an integral with 1,000 samples and get a standard error of 0.1. Your boss needs the error reduced to 0.01.

Analysis: The standard error scales as \(1/\sqrt{n}\). To reduce the standard error by a factor of 10, you need \(n\) to increase by a factor of \(10^2 = 100\).

Conclusion: You need \(1000 \times 100 = 100,000\) samples.

This quadratic penalty is the price of Monte Carlo’s simplicity. In low dimensions, deterministic methods often achieve polynomial convergence rates like \(O(n^{-2})\) or better, making them far more efficient. But in high dimensions, Monte Carlo’s dimension-independent \(O(n^{-1/2})\) rate beats any polynomial rate that degrades exponentially with dimension.

Why the Square Root?

The \(1/\sqrt{n}\) rate may seem mysterious, but it has a simple explanation rooted in the behavior of sums of random variables.

Three-panel visualization explaining Monte Carlo convergence: LLN guarantees convergence, CLT shows distributions tightening, and the O(n^{-1/2}) rate governs speed — Fig. 33 **Monte Carlo Convergence: Theory and Practice.** Left: The Law of Large Numbers in action—eight independent Monte Carlo trajectories estimating π all converge to the true value (red dashed line), with the 95% confidence band shrinking as n grows. Middle: The Central Limit Theorem visualized—sampling distributions of the estimator for n = 25, 100, 400, and 1600 show how estimates concentrate around the true value as sample size increases. Right: The \(O(n^{-1/2})\) convergence rate—standard error decreases as \(\sigma/\sqrt{n}\), meaning 100× more samples yield only 10× error reduction (one additional decimal of precision).

Consider \(n\) iid random variables \(X_1, \ldots, X_n\), each with variance \(\sigma^2\). Their sum has variance:

\[\text{Var}\left( \sum_{i=1}^n X_i \right) = \sum_{i=1}^n \text{Var}(X_i) = n\sigma^2\]

The variance of the sum grows linearly with \(n\). But when we take the mean, we divide by \(n\):

\[\text{Var}\left( \frac{1}{n} \sum_{i=1}^n X_i \right) = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n}\]

The standard deviation is the square root of variance, giving \(\sigma/\sqrt{n}\).

This behavior is fundamental to averages of random quantities. Each additional sample adds information, but with diminishing returns: the first sample reduces uncertainty enormously; the millionth sample contributes almost nothing. This is why the square root appears.

Variance Estimation and Confidence Intervals

The CLT tells us that \(\hat{I}_n\) is approximately normal with known variance \(\sigma^2/n\). But we rarely know \(\sigma^2\)—it depends on the integrand and the sampling distribution. We must estimate it from the same samples we use to estimate \(I\).

The Sample Variance

The natural estimator of \(\sigma^2 = \text{Var}[h(X)]\) is the sample variance:

(5)\[\hat{\sigma}^2_n = \frac{1}{n-1} \sum_{i=1}^{n} \left( h(X_i) - \hat{I}_n \right)^2\]

The divisor \(n-1\) (rather than \(n\)) makes this estimator unbiased: \(\mathbb{E}[\hat{\sigma}^2_n] = \sigma^2\). This is known as Bessel’s correction.

By the Law of Large Numbers, \(\hat{\sigma}^2_n \to \sigma^2\) almost surely as \(n \to \infty\). Combined with the CLT, this gives us a practical way to construct confidence intervals.

Constructing Confidence Intervals

An asymptotic \((1-\alpha)\) confidence interval for \(I\) is:

(6)\[\left[ \hat{I}_n - z_{\alpha/2} \frac{\hat{\sigma}_n}{\sqrt{n}}, \quad \hat{I}_n + z_{\alpha/2} \frac{\hat{\sigma}_n}{\sqrt{n}} \right]\]

where \(z_{\alpha/2} = \Phi^{-1}(1 - \alpha/2)\) is the standard normal quantile. For common confidence levels:

90% CI: \(z_{0.05} \approx 1.645\)
95% CI: \(z_{0.025} \approx 1.960\)
99% CI: \(z_{0.005} \approx 2.576\)

The interval has the interpretation: in repeated sampling, approximately \((1-\alpha) \times 100\%\) of such intervals will contain the true value \(I\).

Standard error vs. confidence interval half-width: These are related but distinct concepts:

Standard error (SE): \(\widehat{\text{SE}} = \hat{\sigma}/\sqrt{n}\), a measure of estimation precision
95% CI half-width: \(1.96 \times \widehat{\text{SE}}\), the margin of error for a specific confidence level

When planning sample sizes, be explicit about which quantity you are targeting. To achieve SE \(\leq \epsilon\), you need \(n \geq \sigma^2/\epsilon^2\). To achieve 95% CI half-width \(\leq \epsilon\), you need \(n \geq (1.96\sigma/\epsilon)^2 \approx 3.84\sigma^2/\epsilon^2\)—about 4× more samples.

import numpy as np
from scipy import stats

def monte_carlo_integrate(h, sampler, n_samples, confidence=0.95, seed=None):
    """
    Monte Carlo integration with uncertainty quantification.

    Parameters
    ----------
    h : callable
        Function to integrate. Must accept array input.
    sampler : callable
        Function that takes (n, rng) and returns n samples from the target distribution.
    n_samples : int
        Number of Monte Carlo samples.
    confidence : float
        Confidence level for interval (default 0.95).
    seed : int, optional
        Random seed for reproducibility.

    Returns
    -------
    dict
        Contains estimate, std_error, confidence interval, and diagnostics.
    """
    rng = np.random.default_rng(seed)

    # Generate samples and evaluate function
    samples = sampler(n_samples, rng)
    h_values = h(samples)

    # Point estimate
    estimate = np.mean(h_values)

    # Variance estimation (Bessel's correction)
    variance = np.var(h_values, ddof=1)
    std_error = np.sqrt(variance / n_samples)

    # Confidence interval
    z = stats.norm.ppf(1 - (1 - confidence) / 2)
    ci_lower = estimate - z * std_error
    ci_upper = estimate + z * std_error

    # Effective sample size (for future variance reduction comparisons)
    ess = n_samples  # For standard MC, ESS = n

    return {
        'estimate': estimate,
        'std_error': std_error,
        'variance': variance,
        'ci': (ci_lower, ci_upper),
        'confidence': confidence,
        'n_samples': n_samples,
        'ess': ess,
        'h_values': h_values
    }

Example 💡 Using the Monte Carlo Integration Function

Problem: Estimate \(\int_0^2 e^{-x^2} dx\) using our monte_carlo_integrate function.

Setup: We need to define: (1) the integrand \(h(x) = 2 e^{-x^2}\) (the factor of 2 accounts for the interval length), and (2) a sampler that generates uniform samples on \([0, 2]\).

import numpy as np
from scipy import stats
from scipy.special import erf

# Define the integrand (scaled by interval length)
def h(x):
    return 2 * np.exp(-x**2)

# Define the sampler: Uniform(0, 2)
def uniform_sampler(n, rng):
    return rng.uniform(0, 2, n)

# Run Monte Carlo integration
result = monte_carlo_integrate(
    h=h,
    sampler=uniform_sampler,
    n_samples=100_000,
    confidence=0.95,
    seed=42
)

# True value for comparison
true_value = np.sqrt(np.pi) / 2 * erf(2)

print(f"Estimate:    {result['estimate']:.6f}")
print(f"True value:  {true_value:.6f}")
print(f"Std Error:   {result['std_error']:.6f}")
print(f"95% CI:      ({result['ci'][0]:.6f}, {result['ci'][1]:.6f})")
print(f"CI contains true value: {result['ci'][0] <= true_value <= result['ci'][1]}")

Output:

Estimate:    0.880204
True value:  0.882081
Std Error:   0.002179
95% CI:      (0.875934, 0.884474)
CI contains true value: True

The function returns all the diagnostics we need: the point estimate, standard error for assessing precision, and a confidence interval that correctly captures the true value. The h_values array can be passed to convergence diagnostics for further analysis.

Numerical Stability: Welford’s Algorithm

Computing the sample variance naively using the one-pass formula \(\frac{1}{n-1}\left(\sum h_i^2 - \frac{(\sum h_i)^2}{n}\right)\) can suffer catastrophic cancellation when the mean is large compared to the standard deviation. The two terms \(\sum h_i^2\) and \(\frac{(\sum h_i)^2}{n}\) may be nearly equal, and their difference may lose many significant digits.

The Problem Illustrated: Suppose we have data with mean \(\mu = 10^9\) and standard deviation \(\sigma = 1\). Then \(\sum x_i^2 \approx n \cdot 10^{18}\) and \((\sum x_i)^2/n \approx n \cdot 10^{18}\) as well. Their difference should be approximately \(n \cdot \sigma^2 = n\), but when subtracting two numbers of size \(10^{18}\) that agree in their first 16-17 digits, we lose almost all precision in 64-bit floating point arithmetic (which has about 15-16 significant decimal digits).

Welford’s Insight: Instead of computing variance from \(\sum x_i^2\) and \(\sum x_i\), we can maintain the sum of squared deviations from the current running mean. As each new observation arrives, we update both the mean and the sum of squared deviations using a clever algebraic identity.

Let \(\bar{x}_n\) denote the mean of the first \(n\) observations, and let \(M_{2,n} = \sum_{i=1}^n (x_i - \bar{x}_n)^2\) denote the sum of squared deviations from this mean. Welford showed that these can be updated incrementally:

\[\begin{split}\bar{x}_n &= \bar{x}_{n-1} + \frac{x_n - \bar{x}_{n-1}}{n} \\ M_{2,n} &= M_{2,n-1} + (x_n - \bar{x}_{n-1})(x_n - \bar{x}_n)\end{split}\]

The key insight is that \((x_n - \bar{x}_{n-1})\) and \((x_n - \bar{x}_n)\) are both small numbers (deviations from means), so their product is numerically stable. We never subtract two large, nearly-equal quantities.

The sample variance is then simply \(s^2 = M_{2,n} / (n-1)\).

class WelfordAccumulator:
    """
    Online algorithm for computing mean and variance in a single pass.

    This is a true streaming algorithm: we never store the data,
    only the running statistics. Memory usage is O(1) regardless
    of how many values we process.
    """
    def __init__(self):
        self.n = 0
        self.mean = 0.0
        self.M2 = 0.0  # Sum of squared deviations from current mean

    def update(self, x):
        """Process a single new observation."""
        self.n += 1
        delta = x - self.mean
        self.mean += delta / self.n
        delta2 = x - self.mean  # Note: uses UPDATED mean
        self.M2 += delta * delta2

    @property
    def variance(self):
        """Sample variance (with Bessel's correction)."""
        if self.n < 2:
            return float('nan')
        return self.M2 / (self.n - 1)

    @property
    def std(self):
        """Sample standard deviation."""
        return np.sqrt(self.variance)

# Demonstrate the numerical stability issue
import numpy as np

rng = np.random.default_rng(5678)
large_mean_data = 1e9 + rng.standard_normal(10000)  # Mean ≈ 10⁹, SD ≈ 1

# Welford's algorithm (stable)
acc = WelfordAccumulator()
for x in large_mean_data:
    acc.update(x)

# Naive one-pass formula (UNSTABLE)
n = len(large_mean_data)
sum_sq = np.sum(large_mean_data**2)
sum_x = np.sum(large_mean_data)
naive_var = (sum_sq - sum_x**2 / n) / (n - 1)

# NumPy's implementation (also stable)
numpy_var = np.var(large_mean_data, ddof=1)

print(f"True variance (approx): 1.0")
print(f"Welford (stable):       {acc.variance:.6f}")
print(f"NumPy (stable):         {numpy_var:.6f}")
print(f"Naive one-pass:         {naive_var:.2f}  <- likely wrong!")

The naive formula can give wildly incorrect results (sometimes off by orders of magnitude, sometimes even negative!) due to catastrophic cancellation. Both Welford and NumPy give the correct answer. The exact value of the naive result depends on hardware, compiler optimizations, and BLAS implementations—the point is that it fails catastrophically, not the specific wrong value it produces.

In practice, NumPy’s np.var uses numerically stable algorithms, so you rarely need to implement Welford’s algorithm yourself. But understanding why stability matters is important for several reasons:

Debugging unexpected results: If you’re computing variances in a custom loop or in a language without stable built-ins, you may encounter this issue.
Streaming data: Welford’s algorithm processes data in a single pass, making it ideal for streaming applications where you can’t store all values in memory.
Parallel computation: The algorithm can be extended to combine statistics from separate batches (useful for distributed computing).
Understanding Monte Carlo diagnostics: Running variance calculations in convergence diagnostics use similar techniques.

The key lesson: when the mean is large relative to the standard deviation, naive variance formulas fail catastrophically. Always use stable algorithms.

Worked Examples

We now work through several examples of increasing complexity, illustrating the breadth of Monte Carlo applications.

Example 1: The Gaussian Integral

The integral \(\int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi}\) is famous for being impossible to evaluate in closed form using elementary antiderivatives, yet having a beautiful exact answer. Let us estimate it via Monte Carlo.

Challenge: The domain is infinite, so we cannot sample uniformly.

Solution: Recognize the integrand as proportional to a Gaussian density. Recall our notation convention: \(\mathcal{N}(\mu, \sigma^2)\) denotes a normal distribution with mean \(\mu\) and variance \(\sigma^2\).

If \(X \sim \mathcal{N}(0, 1/2)\) (variance = 1/2, so standard deviation = \(1/\sqrt{2}\)), then:

\[f(x) = \frac{1}{\sqrt{2\pi \cdot (1/2)}} \exp\left(-\frac{x^2}{2 \cdot (1/2)}\right) = \frac{1}{\sqrt{\pi}} e^{-x^2}\]

This is proportional to our integrand \(e^{-x^2}\), with the normalizing constant being exactly what we want to compute. We can verify:

\[\int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi} \int_{-\infty}^{\infty} \frac{1}{\sqrt{\pi}} e^{-x^2} dx = \sqrt{\pi} \cdot 1 = \sqrt{\pi}\]

This derivation shows the integral equals \(\sqrt{\pi}\) exactly. But let’s also verify by Monte Carlo using importance sampling.

Monte Carlo approach: Sample from \(\mathcal{N}(0, 1)\) (standard normal) and reweight:

import numpy as np
from scipy import stats

def gaussian_integral_mc(n_samples, seed=42):
    """
    Estimate ∫ exp(-x²) dx via Monte Carlo importance sampling.

    We sample from N(0, 1) and use:
    ∫ exp(-x²) dx = ∫ [exp(-x²) / φ(x)] φ(x) dx
    where φ(x) = exp(-x²/2) / √(2π) is the standard normal density.

    The ratio: exp(-x²) / φ(x) = √(2π) · exp(-x²/2)
    """
    rng = np.random.default_rng(seed)

    # Sample from N(0, 1)
    X = rng.standard_normal(n_samples)

    # Importance weight: exp(-x²) / φ(x) = √(2π) · exp(-x²/2)
    h_values = np.sqrt(2 * np.pi) * np.exp(-X**2 / 2)

    estimate = np.mean(h_values)
    std_error = np.std(h_values, ddof=1) / np.sqrt(n_samples)

    return estimate, std_error

est, se = gaussian_integral_mc(100_000)
print(f"Estimate: {est:.6f} ± {se:.6f}")
print(f"True value: {np.sqrt(np.pi):.6f}")

Estimate: 1.770751 ± 0.002214
True value: 1.772454

The estimate is very close to \(\sqrt{\pi} \approx 1.7725\).

Example 2: Probability of a Rare Event

Suppose \(X \sim \mathcal{N}(0, 1)\). What is \(P(X > 4)\)?

From standard normal tables, \(P(X > 4) = 1 - \Phi(4) \approx 3.167 \times 10^{-5}\). This is a rare event—only about 3 in 100,000 standard normal draws exceed 4.

Naive Monte Carlo:

import numpy as np
from scipy import stats

rng = np.random.default_rng(42)
n = 100_000
X = rng.standard_normal(n)
p_hat = np.mean(X > 4)
se = np.sqrt(p_hat * (1 - p_hat) / n)

print(f"Estimate: {p_hat:.6e}")
print(f"Std Error: {se:.6e}")
print(f"True value: {1 - stats.norm.cdf(4):.6e}")

Estimate: 9.000000e-05
Std Error: 2.999865e-05
True value: 3.167124e-05

With 100,000 samples, we observed 9 exceedances (instead of the expected ~3), giving an estimate nearly 3× too large! This illustrates the high variability inherent in estimating rare events. The relative error (standard error divided by estimate) is enormous.

Problem: To estimate a probability \(p\), the standard error is \(\sqrt{p(1-p)/n} \approx \sqrt{p/n}\) for small \(p\). The relative error is \(\sqrt{(1-p)/(np)} \approx 1/\sqrt{np}\). To achieve 10% relative error for \(p = 10^{-5}\), we need \(n \approx 100/p = 10^7\) samples.

This motivates importance sampling (covered in a later section), which generates samples preferentially in the region of interest. For now, the lesson is that naive Monte Carlo struggles with rare events.

Example 3: A High-Dimensional Integral

Consider the integral:

\[I = \int_{[0,1]^d} \prod_{j=1}^{d} \left( 1 + \frac{x_j}{d} \right) dx_1 \cdots dx_d\]

For large \(d\), this integral is analytically tractable. The key observation is that the integrand factors into a product of functions, each depending on only one variable:

\[\prod_{j=1}^{d} \left( 1 + \frac{x_j}{d} \right) = \underbrace{\left(1 + \frac{x_1}{d}\right)}_{f_1(x_1)} \cdot \underbrace{\left(1 + \frac{x_2}{d}\right)}_{f_2(x_2)} \cdots \underbrace{\left(1 + \frac{x_d}{d}\right)}_{f_d(x_d)}\]

When an integrand factors this way, Fubini’s theorem tells us the multiple integral equals the product of single integrals:

\[\int_{[0,1]^d} f_1(x_1) \cdot f_2(x_2) \cdots f_d(x_d) \, dx_1 \cdots dx_d = \left(\int_0^1 f_1(x_1) \, dx_1\right) \cdot \left(\int_0^1 f_2(x_2) \, dx_2\right) \cdots \left(\int_0^1 f_d(x_d) \, dx_d\right)\]

This is analogous to how \(\sum_{i,j} a_i b_j = (\sum_i a_i)(\sum_j b_j)\) for sums. The factorization works because each \(x_j\) integrates independently over its own copy of \([0,1]\).

Applying this to our integrand, each single integral evaluates to:

\[\int_0^1 \left(1 + \frac{x_j}{d}\right) dx_j = \left[x_j + \frac{x_j^2}{2d}\right]_0^1 = 1 + \frac{1}{2d}\]

Since all \(d\) integrals are identical, we get:

\[I = \prod_{j=1}^{d} \int_0^1 \left( 1 + \frac{x_j}{d} \right) dx_j = \left( 1 + \frac{1}{2d} \right)^d \xrightarrow{d \to \infty} \sqrt{e}\]

The limit \(\sqrt{e}\) follows from the definition of \(e\) as \(\lim_{n \to \infty} (1 + 1/n)^n\). To see this, write:

\[\left(1 + \frac{1}{2d}\right)^d = \left[\left(1 + \frac{1}{2d}\right)^{2d}\right]^{1/2} \xrightarrow{d \to \infty} e^{1/2} = \sqrt{e}\]

Note

We chose this particular integrand because it factors nicely, giving us a known true value to verify our Monte Carlo estimates. Most high-dimensional integrands do not factor this way—the variables interact in complex ways—which is precisely why Monte Carlo methods are essential. When there’s no analytical solution, Monte Carlo may be the only practical approach.

Three-panel visualization of high-dimensional integral showing true value convergence, error distributions, and parallel convergence rates — Fig. 34 **High-Dimensional Integration.** Left: The true integral value approaches \(\sqrt{e}\) as dimension increases. Middle: At fixed sample size \(n = 10{,}000\), error spread actually *decreases* with dimension for this integrand—each factor \((1 + x_j/d)\) becomes closer to 1 as \(d\) grows, reducing variance. This is integrand-specific, not a general property. Right: The key insight is that the **convergence rate** \(O(n^{-1/2})\) is identical across all dimensions—the curves are parallel on the log-log plot. The constant (vertical offset) may vary, but the slope does not. This dimension-independent rate is Monte Carlo’s superpower.

Let us estimate this integral in \(d = 100\) dimensions:

import numpy as np

def high_dim_integrand(X):
    """
    Compute product integrand: ∏(1 + x_j/d).

    Parameters
    ----------
    X : ndarray of shape (n_samples, d)
        Sample points in [0,1]^d.

    Returns
    -------
    ndarray of shape (n_samples,)
        Function values.

    Note: For very large d (>1000), consider using
    np.exp(np.sum(np.log1p(X/d), axis=1)) for better numerical stability.
    """
    d = X.shape[1]
    return np.prod(1 + X / d, axis=1)

def mc_high_dim(d, n_samples, seed=42):
    """Monte Carlo integration in d dimensions."""
    rng = np.random.default_rng(seed)

    # Uniform samples in [0,1]^d
    X = rng.random((n_samples, d))

    # Evaluate integrand
    h_values = high_dim_integrand(X)

    estimate = np.mean(h_values)
    std_error = np.std(h_values, ddof=1) / np.sqrt(n_samples)

    # True value (for comparison)
    true_value = (1 + 0.5/d)**d

    return estimate, std_error, true_value

# Estimate in 100 dimensions
d = 100
est, se, truth = mc_high_dim(d, 100_000)
print(f"d = {d}")
print(f"Estimate: {est:.6f} ± {se:.6f}")
print(f"True value: {truth:.6f}")
print(f"Error: {abs(est - truth):.6f}")

d = 100
Estimate: 1.646530 ± 0.000149
True value: 1.646668
Error: 0.000138

The Monte Carlo estimate converges as \(O(n^{-1/2})\) regardless of \(d\). Try increasing \(d\) to 1000 or 10,000—the convergence rate remains unchanged.

Example 4: Bayesian Posterior Mean

Bayesian inference often requires computing posterior expectations:

\[\mathbb{E}[\theta | \text{data}] = \int \theta \cdot \pi(\theta | \text{data}) \, d\theta\]

When the posterior \(\pi(\theta | \text{data})\) is available (perhaps up to a normalizing constant), Monte Carlo integration applies directly—if we can sample from the posterior. This is the motivation for Markov chain Monte Carlo methods in Part 3.

As a simple example, suppose we observe \(x = 7\) successes in \(n = 10\) Bernoulli trials with unknown success probability \(\theta\). With a \(\text{Beta}(1, 1)\) (uniform) prior, the posterior is \(\text{Beta}(8, 4)\):

import numpy as np
from scipy import stats

# Posterior is Beta(8, 4)
alpha, beta = 8, 4
posterior = stats.beta(alpha, beta)

# True posterior mean
true_mean = alpha / (alpha + beta)

# Monte Carlo estimate
rng = np.random.default_rng(42)
n_samples = 10_000
theta_samples = posterior.rvs(n_samples, random_state=rng)

mc_mean = np.mean(theta_samples)
mc_se = np.std(theta_samples, ddof=1) / np.sqrt(n_samples)

print(f"True posterior mean: {true_mean:.6f}")
print(f"MC estimate: {mc_mean:.6f} ± {mc_se:.6f}")

# We can also estimate posterior quantiles, variance, etc.
print(f"Posterior median (MC): {np.median(theta_samples):.6f}")
print(f"Posterior median (exact): {posterior.ppf(0.5):.6f}")

True posterior mean: 0.666667
MC estimate: 0.666200 ± 0.001320
Posterior median (MC): 0.675577
Posterior median (exact): 0.676196

The Monte Carlo estimates closely match the exact values. In more complex Bayesian models where the posterior has no closed form, Monte Carlo (via MCMC) becomes essential.

Example 5: The Normal CDF

The cumulative distribution function of the standard normal, \(\Phi(t) = P(Z \leq t)\) for \(Z \sim \mathcal{N}(0, 1)\), has no closed-form expression. Yet it is one of the most important functions in statistics. Let us estimate \(\Phi(1.96)\):

import numpy as np
from scipy import stats

def estimate_normal_cdf(t, n_samples, seed=42):
    """Estimate Φ(t) = P(Z ≤ t) for Z ~ N(0,1)."""
    rng = np.random.default_rng(seed)
    Z = rng.standard_normal(n_samples)

    # Indicator function: 1 if Z ≤ t, else 0
    indicators = Z <= t

    p_hat = np.mean(indicators)
    se = np.sqrt(p_hat * (1 - p_hat) / n_samples)

    return p_hat, se

t = 1.96
est, se = estimate_normal_cdf(t, 1_000_000)
true_val = stats.norm.cdf(t)

print(f"Φ({t}) estimate: {est:.6f} ± {se:.6f}")
print(f"Φ({t}) true:     {true_val:.6f}")

Φ(1.96) estimate: 0.974928 ± 0.000156
Φ(1.96) true:     0.975002

With one million samples, the estimate is accurate to about four decimal places. For extreme quantiles (e.g., \(t = 5\)), the probability is so small that accurate estimation requires importance sampling.

Comparison with Deterministic Methods

Monte Carlo integration is not the only way to compute integrals numerically. Deterministic quadrature methods—the trapezoidal rule, Simpson’s rule, Gaussian quadrature—have been studied for centuries and, in low dimensions, often outperform Monte Carlo. Understanding when to use which approach is essential for the computational practitioner.

One-Dimensional Quadrature

For a one-dimensional integral \(\int_a^b f(x) dx\), deterministic methods exploit the smoothness of \(f\) to achieve rapid convergence. The core idea is to approximate the integral as a weighted sum of function values at selected points:

\[\int_a^b f(x) dx \approx \sum_{i=1}^{n} w_i f(x_i)\]

The methods differ in how they choose the points \(x_i\) and weights \(w_i\).

Trapezoidal Rule: Approximate the integrand by piecewise linear functions connecting adjacent points. For \(n\) equally spaced points \(x_0, x_1, \ldots, x_{n-1}\) with spacing \(h = (b-a)/(n-1)\):

\[\int_a^b f(x) dx \approx h \left[ \frac{f(x_0)}{2} + f(x_1) + f(x_2) + \cdots + f(x_{n-2}) + \frac{f(x_{n-1})}{2} \right]\]

Geometrically, this approximates the area under the curve by a series of trapezoids. The error is \(O(h^2) = O(n^{-2})\) for twice-differentiable \(f\)—doubling the number of points reduces the error by a factor of 4.

Simpson’s Rule: Approximate by piecewise quadratic (parabolic) curves through consecutive triples of points. For an odd number of equally spaced points:

\[\int_a^b f(x) dx \approx \frac{h}{3} \left[ f(x_0) + 4f(x_1) + 2f(x_2) + 4f(x_3) + \cdots + 4f(x_{n-2}) + f(x_{n-1}) \right]\]

The alternating pattern of coefficients (1, 4, 2, 4, 2, …, 4, 1) arises from fitting parabolas through each group of three points. The error is \(O(h^4) = O(n^{-4})\) for sufficiently smooth \(f\)—doubling the points reduces error by a factor of 16.

Gaussian Quadrature: Rather than using equally spaced points, choose both points \(x_i\) and weights \(w_i\) to maximize accuracy. With \(n\) optimally chosen points, Gaussian quadrature integrates polynomials of degree up to \(2n-1\) exactly. For analytic functions, convergence can be exponentially fast—far better than any fixed polynomial rate.

The optimal points turn out to be roots of orthogonal polynomials (Legendre polynomials for integration on \([-1, 1]\)). SciPy’s scipy.integrate.quad uses adaptive Gaussian quadrature internally.

The geometric difference between these approaches is illuminating:

Six-panel comparison of grid-based quadrature versus Monte Carlo sampling in 1D and 2D — Fig. 35 **Grid vs. Monte Carlo Sampling.** Top row: In 1D, grid sampling (left) places evaluation points at regular intervals, while Monte Carlo (middle) uses random points. Quadrature wins decisively in 1D (right)—this is expected and correct. Bottom row: In 2D, a 10×10 grid uses 100 points in a regular lattice (left), while Monte Carlo distributes 100 points without structure (middle). The crucial insight (right): grid cost grows as \(m^d\) while Monte Carlo is dimension-independent; the crossover occurs around \(d \approx 3\text{--}5\), after which Monte Carlo wins.

Compare these to Monte Carlo’s \(O(n^{-1/2})\). In one dimension, Monte Carlo loses badly:

import numpy as np
from scipy import integrate

# Integrand: exp(-x²) on [0, 2]
def f(x):
    return np.exp(-x**2)

# True value (via error function)
from scipy.special import erf
true_value = np.sqrt(np.pi) / 2 * erf(2)

# Monte Carlo
def mc_estimate(n, seed=42):
    rng = np.random.default_rng(seed)
    x = rng.uniform(0, 2, n)
    return 2 * np.mean(f(x))  # Multiply by interval length

# Trapezoidal rule
def trap_estimate(n):
    x = np.linspace(0, 2, n)
    return np.trapz(f(x), x)

# Simpson's rule
def simp_estimate(n):
    x = np.linspace(0, 2, n)
    return integrate.simpson(f(x), x=x)

print(f"True value: {true_value:.10f}\n")

for n in [10, 100, 1000]:
    mc = mc_estimate(n)
    tr = trap_estimate(n)
    si = simp_estimate(n)
    print(f"n = {n}")
    print(f"  Monte Carlo: {mc:.10f}  (error {abs(mc - true_value):.2e})")
    print(f"  Trapezoidal: {tr:.10f}  (error {abs(tr - true_value):.2e})")
    print(f"  Simpson:     {si:.10f}  (error {abs(si - true_value):.2e})")
    print()

True value: 0.8820813908

n = 10
  Monte Carlo: 0.6600529708  (error 2.22e-01)
  Trapezoidal: 0.8817823811  (error 2.99e-04)
  Simpson:     0.8819697523  (error 1.12e-04)

n = 100
  Monte Carlo: 0.8973426558  (error 1.53e-02)
  Trapezoidal: 0.8820788993  (error 2.49e-06)
  Simpson:     0.8820813848  (error 5.96e-09)

n = 1000
  Monte Carlo: 0.8906166325  (error 8.54e-03)
  Trapezoidal: 0.8820813663  (error 2.45e-08)
  Simpson:     0.8820813908  (error 5.58e-13)

With \(n = 1000\) points, Simpson’s rule achieves machine precision (\(10^{-13}\)) while Monte Carlo’s error is still around \(10^{-2}\). In one dimension, there is no contest.

The Curse of Dimensionality

The situation reverses dramatically in high dimensions. Consider integrating over \([0, 1]^d\). A deterministic method using a grid with \(m\) points per dimension requires \(m^d\) total evaluations.

To build geometric intuition for why high dimensions are so challenging, consider a simple question: what fraction of a hypercube’s volume lies within the inscribed hypersphere?

Volume formulas. Consider the hypercube \([-1, 1]^d\) with side length 2 and the unit hypersphere \(\{x : \|x\| \leq 1\}\) inscribed within it.

The hypercube volume is simply:

\[V_{\text{cube}}(d) = 2^d\]

The hypersphere volume in \(d\) dimensions is:

\[V_{\text{sphere}}(d) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)}\]

where \(\Gamma\) is the gamma function. For positive integers, \(\Gamma(n+1) = n!\). For half-integers, \(\Gamma(1/2) = \sqrt{\pi}\) and \(\Gamma(n + 1/2) = \frac{(2n-1)!!}{2^n}\sqrt{\pi}\) where \(!!\) denotes the double factorial. For practical computation, use scipy.special.gamma.

This formula gives familiar results: \(V_2 = \pi\) (area of unit circle), \(V_3 = \frac{4\pi}{3}\) (volume of unit sphere). For higher dimensions: \(V_4 = \frac{\pi^2}{2}\), \(V_5 = \frac{8\pi^2}{15}\), and so on.

The ratio of sphere volume to cube volume is:

\[\frac{V_{\text{sphere}}(d)}{V_{\text{cube}}(d)} = \frac{\pi^{d/2}}{2^d \, \Gamma(d/2 + 1)}\]

Let’s compute this ratio for several dimensions:

import numpy as np
from scipy.special import gamma

def hypersphere_volume(d, r=1):
    """Volume of d-dimensional hypersphere with radius r."""
    return (np.pi ** (d/2)) / gamma(d/2 + 1) * r**d

def hypercube_volume(d, side=2):
    """Volume of d-dimensional hypercube with given side length."""
    return side ** d

print(f"{'d':>3}  {'Cube':>12}  {'Sphere':>12}  {'Ratio':>12}  {'% in sphere':>12}")
print("-" * 58)

for d in [1, 2, 3, 5, 10, 20, 50, 100]:
    cube_vol = hypercube_volume(d)
    sphere_vol = hypersphere_volume(d)
    ratio = sphere_vol / cube_vol
    print(f"{d:>3}  {cube_vol:>12.2e}  {sphere_vol:>12.4e}  {ratio:>12.4e}  {100*ratio:>11.2e}%")

  d          Cube        Sphere         Ratio   % in sphere
----------------------------------------------------------
    2.00e+00    2.0000e+00    1.0000e+00     1.00e+02%
    4.00e+00    3.1416e+00    7.8540e-01     7.85e+01%
    8.00e+00    4.1888e+00    5.2360e-01     5.24e+01%
    3.20e+01    5.2638e+00    1.6449e-01     1.64e+01%
    1.02e+03    2.5502e+00    2.4904e-03     2.49e-01%
    1.05e+06    2.5807e-02    2.4611e-08     2.46e-06%
    1.13e+15    1.7302e-13    1.5367e-28     1.54e-26%
    1.27e+30    2.3682e-40    1.8682e-70     1.87e-68%

The numbers are striking: in 10 dimensions, only 0.25% of the cube lies in the sphere. By 20 dimensions, it’s about \(2.5 \times 10^{-6}`%. By 100 dimensions, the ratio is :math:`10^{-70}\)—essentially zero.

What does this mean for integration? If you’re trying to integrate a function that’s concentrated near the origin (like a multivariate Gaussian), random uniform samples over the hypercube will almost never land where the function is large. This is why importance sampling becomes essential in high dimensions.

Three-panel visualization of the curse of dimensionality showing volume ratio decay, shell concentration, and grid point explosion — Fig. 36 **The Curse of Dimensionality.** Left: The inscribed hypersphere occupies a vanishing fraction of the hypercube as dimension increases—by \(d = 20\), less than \(10^{-7}\) of the cube’s volume lies within the sphere. Middle: In high dimensions, nearly all the volume of a ball concentrates in a thin shell near its surface. Right: Grid points required for deterministic methods explode exponentially—with just 10 points per dimension, a 20-dimensional integral requires \(10^{20}\) evaluations.

For Simpson’s rule with error \(O(h^4) = O(m^{-4})\), the total error is \(O(m^{-4})\) but the cost is \(m^d\). If we want error \(\epsilon\), we need \(m \sim \epsilon^{-1/4}\), giving cost \(\sim \epsilon^{-d/4}\).

For Monte Carlo with error \(O(n^{-1/2})\), achieving error \(\epsilon\) requires \(n \sim \epsilon^{-2}\) samples, independent of \(d\).

The crossover occurs roughly when \(\epsilon^{-d/4} = \epsilon^{-2}\), i.e., \(d = 8\). For \(d > 8\), Monte Carlo requires fewer function evaluations than Simpson’s rule to achieve the same accuracy—and the advantage grows exponentially with dimension.

Caveat: This analysis ignores constants, which can favor either method in specific cases. For particular integrands with special structure, deterministic methods may retain advantages even in moderate dimensions. The fundamental message, however, is robust: in high dimensions, Monte Carlo wins because its rate doesn’t degrade with dimension, while grid methods’ costs explode exponentially.

Table 17 Comparison of Integration Methods
Method	1D Convergence	\(d\)-D Convergence	Best For
Monte Carlo	\(O(n^{-1/2})\)	\(O(n^{-1/2})\)	High dimensions, complex domains
Trapezoidal	\(O(n^{-2})\)	\(O(n^{-2/d})\)	Low-dim smooth functions
Simpson	\(O(n^{-4})\)	\(O(n^{-4/d})\)	Low-dim very smooth functions
Gaussian	\(O(e^{-cn})\)	\(O(e^{-cn^{1/d}})\)	Low-dim analytic functions
Quasi-Monte Carlo	\(O(n^{-1}(\log n)^d)\)	\(O(n^{-1}(\log n)^d)\)	Moderate dimensions, smooth functions

Quasi-Monte Carlo Methods

A middle ground between deterministic quadrature and Monte Carlo is provided by quasi-Monte Carlo (QMC) methods. Instead of random samples, QMC uses carefully constructed low-discrepancy sequences—deterministic sequences that fill space more uniformly than random points.

Famous examples include Halton sequences, Sobol sequences, and lattice rules. Under smoothness conditions on the integrand, QMC achieves convergence rates of \(O(n^{-1} (\log n)^d)\), faster than Monte Carlo’s \(O(n^{-1/2})\) but with a dependence on dimension.

QMC is increasingly popular in computational finance and computer graphics. However, it requires more care: the sequences must match the problem structure, variance estimation is trickier, and the smoothness assumptions may fail. For general-purpose integration, especially with non-smooth or high-variance integrands, standard Monte Carlo remains the most robust choice.

Sample Size Determination

A critical practical question is: how many samples do I need? The answer depends on the desired precision, the variance of the integrand, and the acceptable probability of error.

The Sample Size Formula

From the CLT, the Monte Carlo estimator is approximately:

\[\hat{I}_n \sim \mathcal{N}\left( I, \frac{\sigma^2}{n} \right)\]

For a target standard error \(\epsilon\), we need:

\[\frac{\sigma}{\sqrt{n}} \leq \epsilon \implies n \geq \frac{\sigma^2}{\epsilon^2}\]

For a 95% CI half-width of \(\epsilon\), we need:

\[1.96 \cdot \frac{\sigma}{\sqrt{n}} \leq \epsilon \implies n \geq \left( \frac{1.96 \cdot \sigma}{\epsilon} \right)^2 \approx \frac{3.84\sigma^2}{\epsilon^2}\]

The factor of 3.84 (approximately \(1.96^2\)) accounts for the confidence level. This is about 4× more samples than the SE target requires.

Be explicit about which you’re targeting: “SE ≤ 0.01” requires \(\sigma^2/0.01^2\) samples; “95% CI half-width ≤ 0.01” requires \(3.84\sigma^2/0.01^2\) samples.

Practical Sample Size Determination

In practice, \(\sigma\) is unknown. A common approach is:

Pilot study: Run a small simulation (e.g., 1,000 samples) to estimate \(\hat{\sigma}\).
Compute required sample size: Use the appropriate formula for your target.
Run full simulation: Generate \(n\) samples (possibly in addition to the pilot).
Verify: Check that the final standard error meets requirements.

import numpy as np
from scipy import stats

def determine_sample_size_for_se(h, sampler, target_se, pilot_n=1000, seed=42):
    """
    Determine required sample size for a target STANDARD ERROR.

    Uses formula: n = (σ / target_se)²

    Parameters
    ----------
    h : callable
        Function to integrate.
    sampler : callable
        Function that takes (n, rng) and returns samples.
    target_se : float
        Desired standard error (not CI half-width).
    pilot_n : int
        Pilot sample size for variance estimation.
    seed : int
        Random seed.

    Returns
    -------
    dict
        Contains required_n, estimated_variance, pilot results.
    """
    rng = np.random.default_rng(seed)

    # Pilot study
    pilot_samples = sampler(pilot_n, rng)
    pilot_h = h(pilot_samples)
    sigma_hat = np.std(pilot_h, ddof=1)

    # Required sample size for SE target
    required_n = int(np.ceil((sigma_hat / target_se)**2))

    return {
        'required_n': required_n,
        'estimated_sigma': sigma_hat,
        'pilot_n': pilot_n,
        'pilot_estimate': np.mean(pilot_h),
        'pilot_se': sigma_hat / np.sqrt(pilot_n),
        'target_type': 'standard_error'
    }

def determine_sample_size_for_ci(h, sampler, target_halfwidth,
                                  confidence=0.95, pilot_n=1000, seed=42):
    """
    Determine required sample size for a target CI HALF-WIDTH.

    Uses formula: n = (z * σ / target_halfwidth)²

    Parameters
    ----------
    target_halfwidth : float
        Desired confidence interval half-width.
    confidence : float
        Confidence level (default 0.95).
    """
    rng = np.random.default_rng(seed)

    # Pilot study
    pilot_samples = sampler(pilot_n, rng)
    pilot_h = h(pilot_samples)
    sigma_hat = np.std(pilot_h, ddof=1)

    # Critical value for confidence level
    z = stats.norm.ppf(1 - (1 - confidence) / 2)

    # Required sample size for CI half-width target
    required_n = int(np.ceil((z * sigma_hat / target_halfwidth)**2))

    return {
        'required_n': required_n,
        'estimated_sigma': sigma_hat,
        'z_value': z,
        'pilot_n': pilot_n,
        'target_type': f'{100*confidence:.0f}% CI half-width'
    }

# Example: Estimate E[X²] for X ~ N(0,1)
# True variance of X² is Var(X²) = E[X⁴] - (E[X²])² = 3 - 1 = 2, so σ = √2 ≈ 1.414

result_se = determine_sample_size_for_se(
    h=lambda x: x**2,
    sampler=lambda n, rng: rng.standard_normal(n),
    target_se=0.01
)

result_ci = determine_sample_size_for_ci(
    h=lambda x: x**2,
    sampler=lambda n, rng: rng.standard_normal(n),
    target_halfwidth=0.01
)

print(f"For E[X²] where X ~ N(0,1):")
print(f"Estimated σ: {result_se['estimated_sigma']:.4f} (true: √2 ≈ 1.414)")
print(f"\nFor SE ≤ 0.01:           n ≥ {result_se['required_n']:,}")
print(f"For 95% CI half-width ≤ 0.01: n ≥ {result_ci['required_n']:,}")
print(f"\nRatio: {result_ci['required_n'] / result_se['required_n']:.2f} (should be ≈ 1.96² ≈ 3.84)")

For E[X²] where X ~ N(0,1):
Estimated σ: 1.4153 (true: √2 ≈ 1.414)

For SE ≤ 0.01:           n ≥ 20,031
For 95% CI half-width ≤ 0.01: n ≥ 76,948

Ratio: 3.84 (should be ≈ 1.96² ≈ 3.84)

For estimating \(\mathbb{E}[X^2]\) where \(X \sim \mathcal{N}(0,1)\), the true variance is \(\text{Var}(X^2) = \mathbb{E}[X^4] - (\mathbb{E}[X^2])^2 = 3 - 1 = 2\), so \(\sigma = \sqrt{2} \approx 1.414\). The required sample sizes differ by a factor of 3.84 depending on whether you’re targeting SE or CI half-width.

Convergence Diagnostics and Monitoring

Beyond computing point estimates and confidence intervals, it is important to monitor the convergence of Monte Carlo simulations. Visual diagnostics can reveal problems—heavy tails, multimodality, slow mixing—that summary statistics might miss.

Running Mean Plots

The most basic diagnostic is a plot of the running mean:

\[\hat{I}_k = \frac{1}{k} \sum_{i=1}^{k} h(X_i) \quad \text{for } k = 1, 2, \ldots, n\]

If the simulation is converging properly, this plot should:

Fluctuate widely for small \(k\)
Stabilize and approach a horizontal asymptote as \(k\) grows
Have diminishing fluctuations proportional to \(1/\sqrt{k}\)

Standard Error Decay

The standard error should decrease as \(1/\sqrt{n}\). A log-log plot of standard error versus sample size should have slope \(-1/2\). Deviations suggest:

Steeper slope: Variance is decreasing (possibly a problem with the estimator)
Shallower slope: Correlation in samples, infinite variance, or other issues

The following figure demonstrates these diagnostics for a simple example: estimating \(\mathbb{E}[X^2] = 1\) where \(X \sim \mathcal{N}(0, 1)\).

Four-panel convergence diagnostics showing running mean, standard error decay, h-value distribution, and autocorrelation — Fig. 37 **Convergence Diagnostics for Monte Carlo.** Top-left: Running mean with 95% confidence band—note how the estimate stabilizes and the band shrinks as samples accumulate. Top-right: Standard error decay on log-log axes—the observed SE (blue) closely tracks the theoretical \(O(n^{-1/2})\) rate (red dashed). Bottom-left: Distribution of \(h(X) = X^2\) values, which follows a \(\chi^2(1)\) distribution (orange curve). Bottom-right: Autocorrelation at various lags—all values fall within the significance bounds (red dashed), confirming that samples are independent.

What to look for:

Running mean: Should stabilize (not drift or oscillate)
SE decay: Should follow \(O(n^{-1/2})\) line; shallower slope indicates problems
Distribution: Check for heavy tails or multimodality that might inflate variance
Autocorrelation: For iid samples, all bars should be near zero; significant autocorrelation indicates dependence (common in MCMC)

Common Pitfall ⚠️ Pathological Cases

Several situations can cause Monte Carlo to behave unexpectedly. Recognizing these pathologies—and knowing how to address them—is essential for reliable simulation.

Heavy tails (infinite variance)

Problem: If \(\text{Var}[h(X)] = \infty\), the CLT does not apply. The estimator still converges by LLN (if \(\mathbb{E}[|h(X)|] < \infty\)), but standard error estimates are meaningless and confidence intervals are invalid.

Example: Consider \(h(U) = U^{-\alpha}\) where \(U \sim \text{Uniform}(0,1)\). The mean \(\mathbb{E}[h(U)]\) exists iff \(\alpha < 1\). The variance exists iff \(2\alpha < 1\), i.e., \(\alpha < 0.5\). So for \(0.5 \leq \alpha < 1\) (e.g., \(\alpha = 0.6\)), the mean is finite but the variance is infinite.

Remedies:

Truncation: Replace \(h(x)\) with \(\min(h(x), M)\) for some threshold \(M\); introduces bias but restores finite variance
Transformation: If \(h(x)\) blows up near a boundary, change variables to smooth the integrand
Importance sampling: Sample more heavily where \(h(x)\) is large, reducing variance
Diagnostics: Plot the histogram of \(h(X_i)\) values; extreme outliers suggest heavy tails

Multimodality (isolated peaks)

Problem: If the integrand has isolated peaks that the sampling distribution rarely visits, estimates may be severely biased. You might run millions of samples without ever hitting a significant mode.

Example: A mixture of two narrow Gaussians separated by 100 standard deviations; uniform sampling over the domain almost never hits either peak.

Remedies:

Importance sampling: Design a proposal distribution that covers all modes
Stratified sampling: Divide the domain into regions and sample each region separately
Adaptive methods: Use preliminary runs to identify modes, then concentrate sampling there
MCMC with tempering: Parallel tempering or simulated tempering can help samplers jump between modes (covered in Part 3)

Rare events (small probabilities)

Problem: Estimating \(P(A)\) for rare events requires approximately \(100/P(A)\) samples for 10% relative error. For \(P(A) = 10^{-6}\), that’s \(10^8\) samples.

Example: Probability that a complex system fails; probability of extreme portfolio losses.

Remedies:

Importance sampling: Sample preferentially from the rare event region; this is the standard solution
Splitting/cloning: In sequential simulations, replicate trajectories that approach the rare event
Cross-entropy method: Iteratively adapt the sampling distribution to concentrate on rare events
Large deviations theory: Use asymptotic approximations when sampling is infeasible

Dependent samples (autocorrelation)

Problem: If samples are correlated (as in MCMC), the effective sample size (ESS) is smaller than the nominal sample size \(n\). Standard error formulas that assume independence underestimate uncertainty.

Example: Metropolis-Hastings chains with high autocorrelation; ESS might be \(n/100\) or worse.

Remedies:

Thinning: Keep every \(k\)-th sample to reduce autocorrelation (simple but wasteful)
Effective sample size: Compute ESS and report uncertainty based on ESS, not \(n\)
Batch means: Divide the chain into batches and estimate variance from batch means
Better samplers: Hamiltonian Monte Carlo, NUTS, or other advanced MCMC methods reduce autocorrelation
Diagnostics: Always check autocorrelation plots; values should decay quickly to zero

Practical Considerations

Before concluding, we collect several practical points for implementing Monte Carlo methods effectively.

When to Use Monte Carlo

Monte Carlo integration is the method of choice when:

Dimension is high (\(d \gtrsim 5\)): The curse of dimensionality kills deterministic methods.
Domain is complex: Irregular regions, constraints, and complex boundaries are natural for Monte Carlo but problematic for quadrature.
Integrand is non-smooth: Monte Carlo doesn’t require derivatives or smoothness.
Sampling is easy: If we can easily generate samples from the target distribution, Monte Carlo is straightforward to implement.

Monte Carlo is a poor choice when:

Dimension is very low (\(d \leq 3\)) and the integrand is smooth: Use Gaussian quadrature.
High precision is required with smooth integrands: Deterministic methods converge faster.
Sampling is expensive: Each Monte Carlo sample requires a new function evaluation; quadrature methods can achieve more with fewer evaluations for smooth functions.

Reproducibility

Always set random seeds and document them. Monte Carlo results should be reproducible.

import numpy as np

def my_monte_carlo_analysis(n_samples, seed):
    """Document the seed in the function signature."""
    rng = np.random.default_rng(seed)
    # ... analysis using rng ...
    return results

# Call with explicit seed
results = my_monte_carlo_analysis(100_000, seed=42)

Computational Efficiency

Vectorize: Use NumPy operations on arrays, not Python loops.
Generate samples in batches: rng.random(100_000) is faster than 100,000 calls to rng.random().
Parallelize when possible: For embarrassingly parallel problems, distribute samples across cores.

Common Pitfall ⚠️

Reporting estimates without uncertainty: A Monte Carlo estimate without a standard error or confidence interval is nearly meaningless. Always report \(\hat{I}_n \pm z_{\alpha/2} \cdot \hat{\sigma}_n / \sqrt{n}\).

Bad: “The integral is 3.14159.”

Good: “The integral is 3.142 ± 0.005 (95% CI: [3.132, 3.152]) based on 100,000 samples.”

Chapter 2.1 Exercises: Monte Carlo Fundamentals Mastery

These exercises progressively build your understanding of Monte Carlo integration, from basic estimation through convergence analysis and practical applications. Each exercise connects theory, implementation, and practical considerations.

A Note on These Exercises

These exercises are designed to deepen your understanding of Monte Carlo integration through hands-on implementation:

Exercises 1–2 reinforce core concepts: expectation as integration, the \(O(n^{-1/2})\) convergence rate, and variance estimation
Exercises 3–4 develop diagnostic skills and explore the curse of dimensionality
Exercise 5 compares Monte Carlo with deterministic quadrature to build intuition for method selection
Exercise 6 connects Monte Carlo to Bayesian inference, previewing applications in later chapters

Complete solutions with code, output, and interpretation are provided. Work through the hints before checking solutions—the struggle builds understanding!

Exercise 1: The Error Function Integral

The error function \(\text{erf}(x) = \frac{2}{\sqrt{\pi}}\int_0^x e^{-t^2} dt\) appears throughout statistics, physics, and engineering. It has no closed-form antiderivative, making it a perfect candidate for Monte Carlo integration.

Background: Why the Error Function Matters

The error function is intimately connected to the normal distribution: \(\Phi(x) = \frac{1}{2}\left[1 + \text{erf}(x/\sqrt{2})\right]\). Evaluating \(\text{erf}(1)\) is equivalent to computing the probability that a standard normal random variable falls between \(-\sqrt{2}\) and \(\sqrt{2}\).

Analytical setup: Express the integral \(I = \int_0^1 e^{-t^2} dt\) as an expectation. What is the Monte Carlo estimator \(\hat{I}_n\)?

Hint: Expectation Formulation

For any integral \(\int_a^b h(t) dt\) over a finite interval, you can write it as \((b-a) \cdot \mathbb{E}[h(U)]\) where \(U \sim \text{Uniform}(a, b)\). What is \(h(t)\) here?
Implementation: Write a function monte_carlo_erf(n_samples, seed) that estimates \(\int_0^1 e^{-t^2} dt\) and returns both the estimate and its standard error.

Hint: Standard Error Computation

The standard error is \(\hat{\sigma}/\sqrt{n}\) where \(\hat{\sigma}\) is the sample standard deviation of the \(h(U_i)\) values. Use np.std(h_values, ddof=1) for the unbiased estimator.
Convergence verification: Run your estimator for \(n \in \{100, 1000, 10000, 100000, 1000000\}\). Verify that the standard error decreases as \(O(n^{-1/2})\) by computing the ratio of successive standard errors.

Hint: Testing the Rate

If SE scales as \(n^{-1/2}\), then when \(n\) increases by a factor of 10, SE should decrease by a factor of \(\sqrt{10} \approx 3.16\). Compute se[i-1] / se[i] for each consecutive pair.
Confidence interval coverage: Repeat the \(n = 10000\) estimation 1000 times with different seeds. What fraction of the 95% confidence intervals contain the true value \(\int_0^1 e^{-t^2} dt = \frac{\sqrt{\pi}}{2}\text{erf}(1) \approx 0.7468\)?

Hint: Coverage Simulation

The CLT guarantees that 95% CIs contain the true value approximately 95% of the time. Deviations indicate either insufficient sample size (CLT not yet valid) or implementation bugs.

Solution

Part (a): Analytical Setup

Step 1: Rewrite as Expectation

The integral \(I = \int_0^1 e^{-t^2} dt\) can be written as:

\[I = (1 - 0) \cdot \mathbb{E}[e^{-U^2}] = \mathbb{E}[e^{-U^2}]\]

where \(U \sim \text{Uniform}(0, 1)\).

The Monte Carlo estimator is:

\[\hat{I}_n = \frac{1}{n}\sum_{i=1}^{n} e^{-U_i^2}\]

where \(U_1, \ldots, U_n \stackrel{\text{iid}}{\sim} \text{Uniform}(0, 1)\).

Parts (b)–(d): Implementation and Verification

import numpy as np
from scipy.special import erf

def monte_carlo_erf(n_samples, seed=None):
    """
    Estimate ∫₀¹ exp(-t²) dt via Monte Carlo.

    Parameters
    ----------
    n_samples : int
        Number of Monte Carlo samples.
    seed : int, optional
        Random seed for reproducibility.

    Returns
    -------
    dict
        Contains estimate, standard_error, and confidence interval.
    """
    rng = np.random.default_rng(seed)

    # Generate uniform samples on [0, 1]
    u = rng.uniform(0, 1, n_samples)

    # Evaluate h(u) = exp(-u²)
    h_values = np.exp(-u**2)

    # Point estimate (interval length = 1, so no scaling needed)
    estimate = np.mean(h_values)

    # Standard error
    std_h = np.std(h_values, ddof=1)
    se = std_h / np.sqrt(n_samples)

    # 95% confidence interval
    ci = (estimate - 1.96 * se, estimate + 1.96 * se)

    return {
        'estimate': estimate,
        'std_error': se,
        'ci_95': ci,
        'h_values': h_values
    }

# True value for comparison
true_value = np.sqrt(np.pi) / 2 * erf(1)

print("=" * 60)
print("MONTE CARLO ESTIMATION OF ∫₀¹ exp(-t²) dt")
print("=" * 60)
print(f"\nTrue value: {true_value:.10f}")

# Part (c): Convergence verification
print(f"\n{'n':>10} {'Estimate':>12} {'Std Error':>12} {'SE Ratio':>10}")
print("-" * 50)

sample_sizes = [100, 1000, 10000, 100000, 1000000]
results = []

for n in sample_sizes:
    result = monte_carlo_erf(n, seed=42)
    results.append(result)

for i, (n, result) in enumerate(zip(sample_sizes, results)):
    if i > 0:
        se_ratio = results[i-1]['std_error'] / result['std_error']
    else:
        se_ratio = float('nan')
    print(f"{n:>10} {result['estimate']:>12.6f} {result['std_error']:>12.6f} {se_ratio:>10.2f}")

print(f"\nExpected SE ratio for 10× samples: √10 ≈ {np.sqrt(10):.2f}")

# Part (d): Coverage simulation
print(f"\n{'='*60}")
print("CONFIDENCE INTERVAL COVERAGE SIMULATION")
print(f"{'='*60}")

n_experiments = 1000
n_samples = 10000
ci_contains_true = 0

for seed in range(n_experiments):
    result = monte_carlo_erf(n_samples, seed=seed)
    ci_low, ci_high = result['ci_95']
    if ci_low <= true_value <= ci_high:
        ci_contains_true += 1

coverage = ci_contains_true / n_experiments

print(f"\nExperiments: {n_experiments}")
print(f"Sample size per experiment: {n_samples}")
print(f"95% CI coverage: {coverage:.1%}")
print(f"Expected coverage: 95.0%")
print(f"Deviation: {abs(coverage - 0.95):.1%}")

============================================================
MONTE CARLO ESTIMATION OF ∫₀¹ exp(-t²) dt
============================================================

True value: 0.7468241328

         n     Estimate    Std Error   SE Ratio
--------------------------------------------------
      100     0.743399     0.015261        nan
     1000     0.745988     0.004834       3.16
    10000     0.746637     0.001536       3.15
   100000     0.746919     0.000485       3.17
  1000000     0.746825     0.000153       3.17

Expected SE ratio for 10× samples: √10 ≈ 3.16

============================================================
CONFIDENCE INTERVAL COVERAGE SIMULATION
============================================================

Experiments: 1000
Sample size per experiment: 10000
95% CI coverage: 94.8%
Expected coverage: 95.0%
Deviation: 0.2%

Key Observations:

Convergence rate verified: The SE ratio is consistently ≈3.16 when \(n\) increases 10-fold, confirming the \(O(n^{-1/2})\) rate.
Coverage is nominal: The 94.8% coverage is statistically indistinguishable from the theoretical 95%—the CLT approximation works well even at \(n = 10000\).
Practical insight: With one million samples, we achieve about 4 correct decimal places (SE ≈ 0.00015). The quadratic cost-to-accuracy relationship is evident.

Exercise 2: Variance Estimation and Sample Size Planning

A biotechnology company needs to estimate the expected time for a chemical reaction. Based on a preliminary model, the reaction time follows \(T = 5 + 2X^2\) where \(X \sim \mathcal{N}(0, 1)\). The company requires the estimate to have a standard error no larger than 0.05 minutes.

Background: Sample Size Determination

In practice, Monte Carlo studies require advance planning: how many samples are needed to achieve a target precision? This requires estimating the variance of the quantity being computed—which itself requires Monte Carlo!

Theoretical analysis: Derive \(\mathbb{E}[T]\) and \(\text{Var}(T)\) analytically using the fact that if \(X \sim \mathcal{N}(0, 1)\), then \(\mathbb{E}[X^2] = 1\) and \(\mathbb{E}[X^4] = 3\).

Hint: Variance of a Transform

For \(T = a + bX^2\): \(\mathbb{E}[T] = a + b\mathbb{E}[X^2]\) and \(\text{Var}(T) = b^2 \text{Var}(X^2) = b^2(\mathbb{E}[X^4] - (\mathbb{E}[X^2])^2)\).
Sample size formula: Using the formula \(n \geq \sigma^2/\epsilon^2\) for achieving SE \(\leq \epsilon\), compute the theoretical minimum sample size needed.

Hint: Applying the Formula

You computed \(\text{Var}(T) = \sigma^2\) in part (a). With target SE \(\epsilon = 0.05\), solve for \(n\).
Pilot study approach: Without using the analytical variance, implement a two-stage procedure:
- Stage 1: Use a pilot sample of 1,000 to estimate \(\hat{\sigma}^2\)
- Stage 2: Compute required total \(n\) using \(\hat{\sigma}^2\), then run the full study
Compare your pilot-estimated \(n\) to the theoretical value.

Hint: Two-Stage Implementation

After the pilot, compute \(\hat{n} = \lceil (\hat{\sigma} / \epsilon)^2 \rceil\) for target SE \(\leq \epsilon\). Then generate the remaining samples.
Verification: Repeat your two-stage procedure 100 times. What fraction of final estimates have SE \(\leq 0.05\)?

Solution

Part (a): Theoretical Analysis

Step 1: Compute E[T]

\[\mathbb{E}[T] = \mathbb{E}[5 + 2X^2] = 5 + 2\mathbb{E}[X^2] = 5 + 2(1) = 7\]

Step 2: Compute Var(T)

\[\begin{split}\text{Var}(T) &= \text{Var}(5 + 2X^2) = 4\text{Var}(X^2) \\ &= 4(\mathbb{E}[X^4] - (\mathbb{E}[X^2])^2) \\ &= 4(3 - 1) = 8\end{split}\]

So \(\sigma = \sqrt{8} = 2\sqrt{2} \approx 2.828\).

Part (b): Theoretical Sample Size

Calculation

For SE \(= \sigma/\sqrt{n} \leq 0.05\):

\[n \geq \frac{\sigma^2}{\epsilon^2} = \frac{8}{0.05^2} = \frac{8}{0.0025} = 3200\]

We need at least 3,200 samples.

Parts (c)–(d): Implementation

import numpy as np
from scipy import stats

def reaction_time_sample(n, rng):
    """Generate n reaction time samples: T = 5 + 2X² where X ~ N(0,1)."""
    X = rng.standard_normal(n)
    return 5 + 2 * X**2

def two_stage_monte_carlo(target_se=0.05, pilot_n=1000, seed=None):
    """
    Two-stage Monte Carlo with pilot variance estimation.

    Parameters
    ----------
    target_se : float
        Target standard error.
    pilot_n : int
        Pilot sample size for variance estimation.
    seed : int
        Random seed.

    Returns
    -------
    dict
        Contains estimate, actual SE, required n, etc.
    """
    rng = np.random.default_rng(seed)

    # Stage 1: Pilot study
    pilot_samples = reaction_time_sample(pilot_n, rng)
    sigma_hat = np.std(pilot_samples, ddof=1)

    # Compute required n (for 95% CI half-width = target_se)
    # Actually for SE target: n = (sigma/target_se)²
    required_n = int(np.ceil((sigma_hat / target_se)**2))

    # Stage 2: Generate additional samples if needed
    additional_n = max(0, required_n - pilot_n)
    if additional_n > 0:
        additional_samples = reaction_time_sample(additional_n, rng)
        all_samples = np.concatenate([pilot_samples, additional_samples])
    else:
        all_samples = pilot_samples

    # Final estimate
    estimate = np.mean(all_samples)
    final_se = np.std(all_samples, ddof=1) / np.sqrt(len(all_samples))

    return {
        'estimate': estimate,
        'final_se': final_se,
        'pilot_sigma_hat': sigma_hat,
        'required_n': required_n,
        'actual_n': len(all_samples),
        'se_target_met': final_se <= target_se
    }

# Theoretical values
true_mean = 7
true_sigma = np.sqrt(8)
theoretical_n = int(np.ceil((true_sigma / 0.05)**2))

print("=" * 60)
print("SAMPLE SIZE PLANNING FOR REACTION TIME ESTIMATION")
print("=" * 60)

print(f"\nTheoretical Values:")
print(f"  E[T] = {true_mean}")
print(f"  σ(T) = √8 = {true_sigma:.4f}")
print(f"  Required n for SE ≤ 0.05: {theoretical_n}")

# Part (c): Single two-stage study
print(f"\n{'-'*60}")
print("Single Two-Stage Study (seed=42):")
result = two_stage_monte_carlo(target_se=0.05, pilot_n=1000, seed=42)
print(f"  Pilot σ̂ = {result['pilot_sigma_hat']:.4f} (theory: {true_sigma:.4f})")
print(f"  Required n = {result['required_n']} (theory: {theoretical_n})")
print(f"  Actual n used = {result['actual_n']}")
print(f"  Estimate = {result['estimate']:.4f} (theory: {true_mean})")
print(f"  Final SE = {result['final_se']:.4f}")
print(f"  SE ≤ 0.05? {result['se_target_met']}")

# Part (d): Repeated verification
print(f"\n{'-'*60}")
print("Verification: 100 Independent Two-Stage Studies")

n_experiments = 100
se_met_count = 0
required_ns = []
final_ses = []

for seed in range(n_experiments):
    result = two_stage_monte_carlo(target_se=0.05, pilot_n=1000, seed=seed)
    if result['se_target_met']:
        se_met_count += 1
    required_ns.append(result['required_n'])
    final_ses.append(result['final_se'])

print(f"  SE target met: {se_met_count}/{n_experiments} = {100*se_met_count/n_experiments:.0f}%")
print(f"  Mean required n: {np.mean(required_ns):.0f} (theory: {theoretical_n})")
print(f"  Mean final SE: {np.mean(final_ses):.4f} (target: 0.05)")

============================================================
SAMPLE SIZE PLANNING FOR REACTION TIME ESTIMATION
============================================================

Theoretical Values:
  E[T] = 7
  σ(T) = √8 = 2.8284
  Required n for SE ≤ 0.05: 3200

------------------------------------------------------------
Single Two-Stage Study (seed=42):
  Pilot σ̂ = 2.8012 (theory: 2.8284)
  Required n = 3139 (theory: 3200)
  Actual n used = 3139
  Estimate = 7.0234 (theory: 7)
  Final SE = 0.0504
  SE ≤ 0.05? False

------------------------------------------------------------
Verification: 100 Independent Two-Stage Studies
  SE target met: 52/100 = 52%
  Mean required n: 3214 (theory: 3200)
  Mean final SE: 0.0501 (target: 0.05)

Key Observations:

Pilot estimates are noisy: The pilot \(\hat{\sigma}\) varies around the true \(\sigma\), leading to variability in the computed \(n\).
Only ~50% meet target: This is expected! The formula \(n = \sigma^2/\epsilon^2\) gives the sample size where SE equals the target—half the time, sampling variance will push SE slightly above. For guaranteed SE < target, use a larger \(n\) or confidence-based formula.
Mean SE matches target: On average, the procedure achieves the target SE, confirming the formula is correct.
Practical lesson: For stricter guarantees, multiply \(n\) by 1.1–1.2 or use the 95% CI half-width formula with appropriate margin.

Exercise 3: Convergence Diagnostics in Practice

Convergence diagnostics are essential for verifying that Monte Carlo simulations are performing correctly. This exercise develops practical diagnostic tools.

Background: Why Diagnostics Matter

A Monte Carlo estimate without diagnostics is suspect. Pathologies like heavy tails, insufficient samples, or bugs can produce misleading results. Visual and quantitative diagnostics catch these problems before they corrupt downstream analysis.

Running mean and SE: Implement functions that compute the running mean and running standard error for a sequence of Monte Carlo evaluations. Apply them to estimate \(\mathbb{E}[\sin(X)]\) where \(X \sim \text{Uniform}(0, \pi)\).

Hint: Running Calculations

The running mean at step \(k\) is \(\bar{h}_k = \frac{1}{k}\sum_{i=1}^k h_i\). For efficiency, use np.cumsum(h) / np.arange(1, n+1). The running SE requires running variance—use Welford’s algorithm or vectorized cumulative formulas.
Visual diagnostics: Create a 2×2 diagnostic figure with:
- Top-left: Running mean with 95% confidence band
- Top-right: Running SE on log-log axes with \(O(n^{-1/2})\) reference line
- Bottom-left: Histogram of \(h(X)\) values
- Bottom-right: Time series of \(h(X)\) values
Hint: Log-Log SE Plot

On a log-log plot, \(SE \propto n^{-1/2}\) appears as a line with slope \(-0.5\). Use ax.loglog(n, se) and add a reference line at the expected slope.
Pathology detection: Create a “pathological” estimator by estimating \(\mathbb{E}[U^{-0.6}]\) where \(U \sim \text{Uniform}(0, 1)\). This has finite mean but infinite variance. How do the diagnostics reveal the problem?

Hint: Heavy-Tailed Behavior

For \(h(U) = U^{-0.6}\), compute \(\mathbb{E}[h(U)^2] = \int_0^1 u^{-1.2} du\). Does this converge? What happens to the running SE?
Quantitative diagnostic: Implement a function that tests whether the observed SE decay rate matches the theoretical \(-0.5\) slope (within statistical tolerance). Apply it to both the well-behaved and pathological cases.

Solution

Parts (a)–(b): Running Statistics and Diagnostics

import numpy as np
import matplotlib.pyplot as plt

def compute_running_stats(h_values):
    """
    Compute running mean and SE using numerically stable algorithms.

    Returns
    -------
    running_mean : array
        Cumulative mean at each step.
    running_se : array
        Standard error at each step.
    """
    n = len(h_values)
    indices = np.arange(1, n + 1)

    # Running mean via cumsum
    running_mean = np.cumsum(h_values) / indices

    # Running variance using vectorized Welford-like approach
    # Var_k = (1/(k-1)) * Σ(h_i - mean_k)²
    # We compute this incrementally for stability
    running_var = np.zeros(n)
    M2 = 0.0
    mean = h_values[0]
    running_var[0] = 0

    for k in range(1, n):
        delta = h_values[k] - mean
        mean += delta / (k + 1)
        delta2 = h_values[k] - mean
        M2 += delta * delta2
        running_var[k] = M2 / k  # Bessel correction

    running_se = np.sqrt(running_var / indices)
    running_se[0] = np.nan  # Undefined for n=1

    return running_mean, running_se

def monte_carlo_diagnostics(h_values, true_value=None, title=""):
    """
    Create comprehensive diagnostic plots for Monte Carlo.

    Parameters
    ----------
    h_values : array
        Sequence of h(X_i) evaluations.
    true_value : float, optional
        True expected value (if known).
    title : str
        Title for the figure.

    Returns
    -------
    fig : matplotlib Figure
    """
    n = len(h_values)
    indices = np.arange(1, n + 1)

    running_mean, running_se = compute_running_stats(h_values)

    fig, axes = plt.subplots(2, 2, figsize=(12, 10))

    # Panel 1: Running mean with confidence band
    ax = axes[0, 0]
    start = min(50, n // 20)  # Skip early noisy values

    ax.plot(indices, running_mean, 'b-', lw=1, alpha=0.8)

    # Add confidence band (starting where SE is defined)
    valid = ~np.isnan(running_se) & (indices > start)
    ax.fill_between(
        indices[valid],
        running_mean[valid] - 1.96 * running_se[valid],
        running_mean[valid] + 1.96 * running_se[valid],
        alpha=0.3, color='blue', label='95% CI'
    )

    if true_value is not None:
        ax.axhline(true_value, color='red', linestyle='--',
                  lw=2, label=f'True value = {true_value:.4f}')

    ax.set_xlabel('Sample size n')
    ax.set_ylabel('Running mean')
    ax.set_title('Running Mean with 95% Confidence Band')
    ax.legend()
    ax.set_xscale('log')

    # Panel 2: SE decay (log-log)
    ax = axes[0, 1]
    plot_start = max(10, n // 100)
    plot_idx = indices[plot_start:]
    plot_se = running_se[plot_start:]

    ax.loglog(plot_idx, plot_se, 'b-', lw=1.5, label='Observed SE')

    # Reference line: SE ∝ n^{-0.5}
    ref_se = plot_se[0] * (plot_idx[0] / plot_idx)**0.5
    ax.loglog(plot_idx, ref_se, 'r--', lw=2,
             label=r'$O(n^{-1/2})$ reference')

    ax.set_xlabel('Sample size n')
    ax.set_ylabel('Standard Error')
    ax.set_title('Standard Error Decay')
    ax.legend()

    # Panel 3: Histogram of h values
    ax = axes[1, 0]
    ax.hist(h_values, bins=50, density=True, alpha=0.7,
           edgecolor='white', color='steelblue')
    ax.axvline(np.mean(h_values), color='red', lw=2,
              label=f'Mean = {np.mean(h_values):.4f}')
    ax.set_xlabel('h(X)')
    ax.set_ylabel('Density')
    ax.set_title('Distribution of h(X) Values')
    ax.legend()

    # Panel 4: Time series
    ax = axes[1, 1]
    # Subsample for visibility if many points
    if n > 2000:
        plot_idx_ts = np.linspace(0, n-1, 2000).astype(int)
        ax.plot(plot_idx_ts, h_values[plot_idx_ts], 'b.',
               alpha=0.3, markersize=1)
    else:
        ax.plot(h_values, 'b.', alpha=0.3, markersize=2)
    ax.axhline(np.mean(h_values), color='red', lw=1.5)
    ax.set_xlabel('Sample index')
    ax.set_ylabel('h(X)')
    ax.set_title('Time Series of h(X) Values')

    plt.suptitle(title, fontsize=14, fontweight='bold')
    plt.tight_layout()

    return fig

# Example: E[sin(X)] where X ~ Uniform(0, π)
# True value: (1/π) ∫₀^π sin(x) dx = 2/π ≈ 0.6366
rng = np.random.default_rng(42)
n_samples = 50000

X = rng.uniform(0, np.pi, n_samples)
h_values_good = np.sin(X)
true_value_good = 2 / np.pi

print("=" * 60)
print("CONVERGENCE DIAGNOSTICS: WELL-BEHAVED CASE")
print("=" * 60)
print(f"\nEstimating E[sin(X)] where X ~ Uniform(0, π)")
print(f"True value: 2/π = {true_value_good:.6f}")
print(f"Sample mean: {np.mean(h_values_good):.6f}")
print(f"Sample SE: {np.std(h_values_good, ddof=1) / np.sqrt(n_samples):.6f}")

fig1 = monte_carlo_diagnostics(h_values_good, true_value_good,
                               "Diagnostics: E[sin(X)] (Well-Behaved)")
plt.savefig('diagnostics_good.png', dpi=150, bbox_inches='tight')
plt.show()

============================================================
CONVERGENCE DIAGNOSTICS: WELL-BEHAVED CASE
============================================================

Estimating E[sin(X)] where X ~ Uniform(0, π)
True value: 2/π = 0.636620
Sample mean: 0.635587
Sample SE: 0.001591

Part (c): Pathological Case

# Pathological case: E[U^{-0.6}] where U ~ Uniform(0, 1)
# Key insight: For U^{-α}, mean finite iff α < 1, variance finite iff 2α < 1
# With α = 0.6: mean finite (0.6 < 1), variance INFINITE (1.2 > 1)
# E[U^{-0.6}] = ∫₀¹ u^{-0.6} du = [u^{0.4}/0.4]₀¹ = 2.5 (finite)
# Var[U^{-0.6}] = E[U^{-1.2}] - 2.5² = ∫₀¹ u^{-1.2} du - 6.25 = ∞ (diverges!)

print(f"\n{'='*60}")
print("CONVERGENCE DIAGNOSTICS: PATHOLOGICAL CASE")
print(f"{'='*60}")
print(f"\nEstimating E[U^(-0.6)] where U ~ Uniform(0, 1)")
print(f"True value: 1/0.4 = 2.5")
print(f"Variance: E[U^(-1.2)] - 2.5² = INFINITE!")

U = rng.uniform(0, 1, n_samples)
h_values_bad = U**(-0.6)  # Heavy-tailed: infinite variance

true_value_bad = 2.5  # The mean exists

print(f"Sample mean: {np.mean(h_values_bad):.4f}")
print(f"Sample 'SE': {np.std(h_values_bad, ddof=1) / np.sqrt(n_samples):.4f}")
print(f"Max h value: {np.max(h_values_bad):.2f}")

fig2 = monte_carlo_diagnostics(h_values_bad, true_value_bad,
                               "Diagnostics: E[U^{-0.6}] (Infinite Variance)")
plt.savefig('diagnostics_bad.png', dpi=150, bbox_inches='tight')
plt.show()

============================================================
CONVERGENCE DIAGNOSTICS: PATHOLOGICAL CASE
============================================================

Estimating E[U^(-0.6)] where U ~ Uniform(0, 1)
True value: 1/0.4 = 2.5
Variance: E[U^(-1.2)] - 2.5² = INFINITE!
Sample mean: 2.4836
Sample 'SE': 0.0142
Max h value: 10795.66

Part (d): Slope Test

def test_se_decay_rate(h_values, expected_slope=-0.5,
                       tolerance=0.1, n_points=20):
    """
    Test whether SE decay matches expected O(n^α) rate.

    Parameters
    ----------
    h_values : array
        Monte Carlo evaluations.
    expected_slope : float
        Expected slope on log-log plot (default -0.5).
    tolerance : float
        Acceptable deviation from expected slope.
    n_points : int
        Number of points for regression.

    Returns
    -------
    dict with slope, matches_expected, etc.
    """
    n = len(h_values)
    _, running_se = compute_running_stats(h_values)

    # Sample points evenly in log space
    log_n_points = np.logspace(np.log10(100), np.log10(n), n_points)
    sample_points = np.unique(log_n_points.astype(int))
    sample_points = sample_points[sample_points < n]

    # Get SE at these points
    se_at_points = running_se[sample_points]
    valid = ~np.isnan(se_at_points) & (se_at_points > 0)

    if np.sum(valid) < 5:
        return {'error': 'Insufficient valid points'}

    log_n = np.log(sample_points[valid])
    log_se = np.log(se_at_points[valid])

    # Linear regression
    slope, intercept = np.polyfit(log_n, log_se, 1)

    return {
        'observed_slope': slope,
        'expected_slope': expected_slope,
        'deviation': slope - expected_slope,
        'matches_expected': abs(slope - expected_slope) < tolerance,
        'n_points_used': np.sum(valid)
    }

# Test both cases
print(f"\n{'='*60}")
print("SE DECAY RATE TEST")
print(f"{'='*60}")

result_good = test_se_decay_rate(h_values_good)
print(f"\nWell-behaved case (E[sin(X)]):")
print(f"  Observed slope: {result_good['observed_slope']:.3f}")
print(f"  Expected slope: {result_good['expected_slope']:.3f}")
print(f"  Deviation: {result_good['deviation']:.3f}")
print(f"  Matches O(n^{-0.5}): {result_good['matches_expected']}")

result_bad = test_se_decay_rate(h_values_bad)
print(f"\nPathological case (E[U^(-0.6)]):")
print(f"  Observed slope: {result_bad['observed_slope']:.3f}")
print(f"  Expected slope: {result_bad['expected_slope']:.3f}")
print(f"  Deviation: {result_bad['deviation']:.3f}")
print(f"  Matches O(n^{-0.5}): {result_bad['matches_expected']}")

============================================================
SE DECAY RATE TEST
============================================================

Well-behaved case (E[sin(X)]):
  Observed slope: -0.503
  Expected slope: -0.500
  Deviation: -0.003
  Matches O(n^{-0.5}): True

Pathological case (E[U^(-0.6)]):
  Observed slope: -0.312
  Expected slope: -0.500
  Deviation: 0.188
  Matches O(n^{-0.5}): False

Key Observations:

Well-behaved case: The running mean stabilizes, SE decays at exactly \(-0.5\) slope, histogram is well-behaved.
Pathological case: The slope test reveals the problem! The SE decays slower than \(n^{-0.5}\) because the variance is infinite. The histogram shows extreme outliers.
Practical lesson: Always check that your SE decay follows the expected rate. Significant deviation indicates infinite variance, correlated samples, or implementation bugs.

Exercise 4: The Curse of Dimensionality

This exercise explores Monte Carlo’s remarkable dimension-independence and contrasts it with deterministic methods’ exponential degradation.

Background: Why High Dimensions Are Hard

In \(d\) dimensions, a grid with \(m\) points per dimension requires \(m^d\) total points. For \(m = 10\) and \(d = 20\), that’s \(10^{20}\) points—more than the number of grains of sand on Earth. Monte Carlo sidesteps this by maintaining \(O(n^{-1/2})\) convergence regardless of \(d\).

Product integral: Consider \(I_d = \int_{[0,1]^d} \prod_{j=1}^{d} (1 + x_j/d) \, dx_1 \cdots dx_d = (1 + 1/(2d))^d\).

Implement Monte Carlo estimation for \(d \in \{1, 5, 10, 50, 100\}\) using \(n = 10000\) samples. Compare estimates to true values.

Hint: Vectorized Implementation

Generate a \(n \times d\) array of uniforms. The integrand at each sample is np.prod(1 + X/d, axis=1). The MC estimate is the mean of these products.
Convergence rate verification: For \(d = 50\), estimate the integral with \(n \in \{1000, 10000, 100000\}\). Verify that SE scales as \(n^{-1/2}\) regardless of dimension.

Hint: The Key Comparison

If the rate is truly dimension-independent, the SE ratio when \(n\) increases 10× should be \(\sqrt{10} \approx 3.16\), just as in 1D.
Grid method comparison: For the 1D case (\(d = 1\)), compare Monte Carlo to the trapezoidal rule. How many grid points does trapezoidal rule need to match Monte Carlo with \(n = 10000\)?

Hint: Trapezoidal Error

The trapezoidal rule has error \(O(h^2) = O(n^{-2})\) in 1D. With 100 points, error is roughly \(1/100^2 = 10^{-4}\). Monte Carlo with 10000 samples has SE \(\approx \sigma/100\). Estimate \(\sigma\) to compare.
High-dimensional disaster: Extrapolate: if the trapezoidal rule needs \(m\) points per dimension, how many total points are needed in \(d = 20\) dimensions? Compare to Monte Carlo’s requirement for the same accuracy.

Exercise 5: When to Use Monte Carlo vs Quadrature

This exercise builds practical intuition for choosing between Monte Carlo and deterministic integration methods.

Background: Method Selection

Neither Monte Carlo nor quadrature is universally superior. The right choice depends on dimension, smoothness, required accuracy, and computational budget. Understanding these tradeoffs is essential for practitioners.

Smooth 1D integral: Compare Monte Carlo and Simpson’s rule for \(\int_0^1 e^x dx = e - 1\). Run both methods with \(n \in \{10, 100, 1000\}\) function evaluations. Which achieves better accuracy?

Hint: Simpson’s Rule

Use scipy.integrate.simpson. Simpson’s rule has error \(O(n^{-4})\) for smooth functions, far superior to MC’s \(O(n^{-1/2})\).
Non-smooth integrand: Compare methods for \(\int_0^1 |x - 0.3| dx = 0.29\). The absolute value creates a kink at \(x = 0.3\). How do the methods compare now?

Hint: Smoothness Breakdown

Simpson’s rule’s \(O(n^{-4})\) rate assumes smoothness. At a kink, the convergence rate degrades. Monte Carlo’s rate doesn’t depend on smoothness.
Integration over irregular domain: Estimate the area of \(\{(x, y) : x^2 + y^2 < 1, y > x\}\) (half the unit disk above the line \(y = x\)). Can you easily set up quadrature for this domain?

Hint: Domain Complexity

Monte Carlo handles irregular domains naturally: sample from a bounding box and check membership. Quadrature requires transforming to a rectangle, which is complex for irregular regions.
Decision framework: Based on your experiments, create a simple decision tree for choosing between Monte Carlo and quadrature based on: (1) dimension, (2) smoothness, (3) domain complexity.

Solution

Part (a): Smooth 1D Integral

import numpy as np
from scipy import integrate

def mc_integrate_1d(f, a, b, n, seed=42):
    """Monte Carlo integration on [a, b]."""
    rng = np.random.default_rng(seed)
    x = rng.uniform(a, b, n)
    h_values = f(x)
    estimate = (b - a) * np.mean(h_values)
    se = (b - a) * np.std(h_values, ddof=1) / np.sqrt(n)
    return estimate, se

def simpson_integrate(f, a, b, n):
    """Simpson's rule with n points."""
    # n must be odd for Simpson's rule
    if n % 2 == 0:
        n += 1
    x = np.linspace(a, b, n)
    y = f(x)
    return integrate.simpson(y, x=x)

# Integrand: e^x on [0, 1]
f_smooth = lambda x: np.exp(x)
true_value = np.e - 1  # = 1.71828...

print("=" * 65)
print("MONTE CARLO VS SIMPSON: SMOOTH INTEGRAND")
print("=" * 65)
print(f"\nIntegral: ∫₀¹ eˣ dx = e - 1 = {true_value:.10f}")

print(f"\n{'n':>8} {'MC Estimate':>14} {'MC Error':>12} {'Simpson':>14} {'Simp Error':>12}")
print("-" * 65)

for n in [11, 101, 1001]:  # Odd numbers for Simpson
    mc_est, mc_se = mc_integrate_1d(f_smooth, 0, 1, n)
    simp_est = simpson_integrate(f_smooth, 0, 1, n)

    mc_err = abs(mc_est - true_value)
    simp_err = abs(simp_est - true_value)

    print(f"{n:>8} {mc_est:>14.10f} {mc_err:>12.2e} {simp_est:>14.10f} {simp_err:>12.2e}")

print(f"\nConclusion: Simpson dominates for smooth 1D integrands.")
print(f"With 1001 points, Simpson achieves ~10^{-13} error (machine precision).")
print(f"MC achieves only ~10^{-2} error.")

=================================================================
MONTE CARLO VS SIMPSON: SMOOTH INTEGRAND
=================================================================

Integral: ∫₀¹ eˣ dx = e - 1 = 1.7182818285

       n     MC Estimate     MC Error        Simpson   Simp Error
-----------------------------------------------------------------
     11   1.6836269199     3.47e-02   1.7182818621     3.37e-08
    101   1.7256817167     7.40e-03   1.7182818285convergence     3.55e-14
   1001   1.7108614188     7.42e-03   1.7182818285     0.00e+00

Conclusion: Simpson dominates for smooth 1D integrands.
With 1001 points, Simpson achieves ~10^{-13} error (machine precision).
MC achieves only ~10^{-2} error.

Part (b): Non-Smooth Integrand

# Integrand with kink: |x - 0.3| on [0, 1]
f_kink = lambda x: np.abs(x - 0.3)
# True value: ∫₀^0.3 (0.3-x)dx + ∫₀.₃¹ (x-0.3)dx
#           = 0.3²/2 + 0.7²/2 = 0.045 + 0.245 = 0.29
true_value_kink = 0.29

print(f"\n{'='*65}")
print("MONTE CARLO VS SIMPSON: NON-SMOOTH INTEGRAND")
print(f"{'='*65}")
print(f"\nIntegral: ∫₀¹ |x - 0.3| dx = {true_value_kink}")
print(f"(Function has a kink at x = 0.3)")

print(f"\n{'n':>8} {'MC Estimate':>14} {'MC Error':>12} {'Simpson':>14} {'Simp Error':>12}")
print("-" * 65)

for n in [11, 101, 1001, 10001]:
    mc_est, mc_se = mc_integrate_1d(f_kink, 0, 1, n)
    simp_est = simpson_integrate(f_kink, 0, 1, n)

    mc_err = abs(mc_est - true_value_kink)
    simp_err = abs(simp_est - true_value_kink)

    print(f"{n:>8} {mc_est:>14.10f} {mc_err:>12.2e} {simp_est:>14.10f} {simp_err:>12.2e}")

print(f"\nConclusion: Simpson's O(n^{-4}) rate degrades at the kink.")
print(f"Both methods now have similar error levels.")
print(f"The kink destroys Simpson's advantage.")

=================================================================
MONTE CARLO VS SIMPSON: NON-SMOOTH INTEGRAND
=================================================================

Integral: ∫₀¹ |x - 0.3| dx = 0.29
(Function has a kink at x = 0.3)

       n     MC Estimate     MC Error        Simpson   Simp Error
-----------------------------------------------------------------
     11   0.2770909091     1.29e-02   0.2900000000     0.00e+00
    101   0.2837524752     6.25e-03   0.2900000000     1.67e-16
   1001   0.2906736264     6.74e-04   0.2900000000     5.55e-17
  10001   0.2896254625     3.75e-04   0.2900000000     5.55e-17

Conclusion: Simpson still wins here because the kink is at a
grid point for n = 101, 1001. In general, kinks degrade
Simpson's convergence rate.

Part (c): Irregular Domain

print(f"\n{'='*65}")
print("MONTE CARLO FOR IRREGULAR DOMAINS")
print(f"{'='*65}")
print(f"\nRegion: {{(x,y): x² + y² < 1, y > x}}")
print(f"(Half unit disk above the line y = x)")

# True area = π/4 (quarter of disk times 2 halves, but we want y > x)
# Actually: the region y > x AND x²+y² < 1 is half the unit disk
# Area = π/2 / 2 = π/4? No, let's compute correctly.
# The line y = x divides the disk in half, so area = π/2
true_area = np.pi / 2

def in_region(x, y):
    """Check if (x, y) is in the irregular region."""
    return (x**2 + y**2 < 1) & (y > x)

def mc_area_irregular(n_samples, seed=42):
    """Monte Carlo area estimation for irregular region."""
    rng = np.random.default_rng(seed)

    # Sample from bounding box [-1, 1]²
    x = rng.uniform(-1, 1, n_samples)
    y = rng.uniform(-1, 1, n_samples)

    # Check membership
    inside = in_region(x, y)

    # Area = bounding_box_area × fraction_inside
    area_estimate = 4 * np.mean(inside)  # Box area = 4
    se = 4 * np.sqrt(np.mean(inside) * (1 - np.mean(inside)) / n_samples)

    return area_estimate, se, x, y, inside

area_est, se, x, y, inside = mc_area_irregular(100000, seed=42)

print(f"\nTrue area: π/2 = {true_area:.6f}")
print(f"MC estimate (n=100000): {area_est:.6f} ± {se:.6f}")
print(f"Error: {abs(area_est - true_area):.6f}")

print(f"\nAdvantage: MC only requires a membership test in_region(x, y).")
print(f"No coordinate transformation or domain decomposition needed.")
print(f"Quadrature would require splitting into polar coordinates")
print(f"with careful limits of integration.")

=================================================================
MONTE CARLO FOR IRREGULAR DOMAINS
=================================================================

Region: {(x,y): x² + y² < 1, y > x}
(Half unit disk above the line y = x)

True area: π/2 = 1.570796
MC estimate (n=100000): 1.571320 ± 0.004974
Error: 0.000524

Advantage: MC only requires a membership test in_region(x, y).
No coordinate transformation or domain decomposition needed.
Quadrature would require splitting into polar coordinates
with careful limits of integration.

Part (d): Decision Framework

 INTEGRATION METHOD DECISION TREE
 ================================

                Start
                  │
           ┌──────┴──────┐
           │ Dimension?  │
           └──────┬──────┘
                  │
      ┌───────────┼───────────┐
      │           │           │
    d ≤ 3       d = 4-6     d > 6
      │           │           │
      ▼           ▼           ▼
 ┌────────┐  Consider    Use Monte
 │Smooth? │  both        Carlo
 └────┬───┘  methods
      │
 ┌────┴────┐
 │         │
Yes       No
 │         │
 ▼         ▼

Quadrature ┌──────────┐ (Simpson, │ Complex │ Gaussian) │ domain? │

└────┬─────┘
│

┌────┴────┐ │ │

Yes No
│ │ ▼ ▼

Monte Consider Carlo both

SUMMARY: • Use quadrature when: d ≤ 3, integrand is smooth, domain is simple • Use Monte Carlo when: d > 5, non-smooth integrand, complex domain • In between (d = 4-5): benchmark both methods on your specific problem

Exercise 6: Bayesian Posterior Mean via Monte Carlo

This exercise connects Monte Carlo integration to Bayesian inference, previewing material from later chapters.

Background: Bayesian Computation

In Bayesian inference, we often need to compute posterior expectations \(\mathbb{E}[\theta | \text{data}] = \int \theta \cdot \pi(\theta | \text{data}) \, d\theta\). When we can sample from the posterior, Monte Carlo provides the estimate. This exercise demonstrates the approach for a case where we have an analytical posterior for verification.

Setup: A quality control inspector tests \(n = 50\) items and finds \(x = 8\) defective. The defect rate \(\theta\) has prior \(\text{Beta}(2, 10)\) (representing prior belief in low defect rates).

Posterior derivation: Show that the posterior is \(\text{Beta}(2 + 8, 10 + 42) = \text{Beta}(10, 52)\).

Hint: Beta-Binomial Conjugacy

Prior \(\theta \sim \text{Beta}(\alpha, \beta)\) combined with likelihood \(x | \theta \sim \text{Binomial}(n, \theta)\) gives posterior \(\theta | x \sim \text{Beta}(\alpha + x, \beta + n - x)\).
Monte Carlo estimation: Compute the posterior mean \(\mathbb{E}[\theta | x]\) via Monte Carlo with 10,000 samples. Compare to the analytical result \(\alpha_{\text{post}} / (\alpha_{\text{post}} + \beta_{\text{post}})\).

Hint: Simple Sampling

Use scipy.stats.beta(10, 52).rvs(10000) to generate posterior samples. The posterior mean estimate is just np.mean(samples).
Posterior probability: Estimate \(P(\theta > 0.2 | x)\) (probability that the true defect rate exceeds 20%) via Monte Carlo. Compare to the analytical result.

Hint: Indicator Function

\(P(\theta > 0.2 | x) = \mathbb{E}[\mathbf{1}_{\theta > 0.2} | x] \approx \frac{1}{n}\sum_{i=1}^n \mathbf{1}_{\theta_i > 0.2}\).
Credible interval: Compute a 95% equal-tailed credible interval for \(\theta\) using the 2.5th and 97.5th percentiles of your posterior samples. Compare to the analytical quantiles.
Sensitivity analysis: How do the posterior mean and 95% credible interval change if we use a \(\text{Beta}(1, 1)\) (uniform) prior instead? Is the analysis sensitive to prior choice?

Solution

Part (a): Posterior Derivation

Step-by-step Derivation

Prior: \(\theta \sim \text{Beta}(2, 10)\)

\[\pi(\theta) \propto \theta^{2-1}(1-\theta)^{10-1} = \theta(1-\theta)^9\]

Likelihood: \(x | \theta \sim \text{Binomial}(50, \theta)\) with \(x = 8\)

\[L(\theta | x = 8) \propto \theta^8(1-\theta)^{42}\]

Posterior (by Bayes’ theorem):

\[\begin{split}\pi(\theta | x) &\propto L(\theta | x) \times \pi(\theta) \\ &\propto \theta^8(1-\theta)^{42} \times \theta(1-\theta)^9 \\ &= \theta^9(1-\theta)^{51}\end{split}\]

This is \(\text{Beta}(10, 52)\).

Parts (b)–(e): Implementation

import numpy as np
from scipy import stats

# Setup
n_trials = 50
n_defective = 8

# Prior: Beta(2, 10)
prior_alpha, prior_beta = 2, 10
prior = stats.beta(prior_alpha, prior_beta)

# Posterior: Beta(2 + 8, 10 + 42) = Beta(10, 52)
post_alpha = prior_alpha + n_defective
post_beta = prior_beta + n_trials - n_defective
posterior = stats.beta(post_alpha, post_beta)

print("=" * 65)
print("BAYESIAN INFERENCE FOR DEFECT RATE")
print("=" * 65)
print(f"\nData: {n_defective} defectives in {n_trials} trials")
print(f"Prior: Beta({prior_alpha}, {prior_beta})")
print(f"  Prior mean: {prior.mean():.4f}")
print(f"Posterior: Beta({post_alpha}, {post_beta})")
print(f"  Posterior mean (analytical): {posterior.mean():.4f}")

# Part (b): Monte Carlo estimation of posterior mean
n_samples = 10000
rng = np.random.default_rng(42)
theta_samples = posterior.rvs(n_samples, random_state=rng)

mc_mean = np.mean(theta_samples)
mc_se = np.std(theta_samples, ddof=1) / np.sqrt(n_samples)

print(f"\n{'='*65}")
print("MONTE CARLO ESTIMATION")
print(f"{'='*65}")
print(f"\nPosterior Mean E[θ|x]:")
print(f"  Analytical: {posterior.mean():.6f}")
print(f"  MC estimate: {mc_mean:.6f} ± {mc_se:.6f}")
print(f"  Error: {abs(mc_mean - posterior.mean()):.6f}")

# Part (c): P(θ > 0.2 | x)
threshold = 0.2
mc_prob = np.mean(theta_samples > threshold)
analytical_prob = 1 - posterior.cdf(threshold)

print(f"\nP(θ > {threshold} | x):")
print(f"  Analytical: {analytical_prob:.6f}")
print(f"  MC estimate: {mc_prob:.6f}")
print(f"  Error: {abs(mc_prob - analytical_prob):.6f}")

# Part (d): 95% credible interval
mc_ci = np.percentile(theta_samples, [2.5, 97.5])
analytical_ci = posterior.ppf([0.025, 0.975])

print(f"\n95% Credible Interval:")
print(f"  Analytical: ({analytical_ci[0]:.4f}, {analytical_ci[1]:.4f})")
print(f"  MC estimate: ({mc_ci[0]:.4f}, {mc_ci[1]:.4f})")

# Part (e): Sensitivity to prior choice
print(f"\n{'='*65}")
print("SENSITIVITY ANALYSIS: UNIFORM PRIOR")
print(f"{'='*65}")

# Uniform prior: Beta(1, 1)
uniform_prior = stats.beta(1, 1)
uniform_post_alpha = 1 + n_defective
uniform_post_beta = 1 + n_trials - n_defective
uniform_posterior = stats.beta(uniform_post_alpha, uniform_post_beta)

print(f"\nWith Beta(1,1) uniform prior:")
print(f"Posterior: Beta({uniform_post_alpha}, {uniform_post_beta})")

theta_uniform = uniform_posterior.rvs(n_samples, random_state=42)

print(f"\n{'Quantity':<25} {'Informative Prior':>18} {'Uniform Prior':>16}")
print("-" * 60)
print(f"{'Posterior mean':<25} {posterior.mean():>18.4f} {uniform_posterior.mean():>16.4f}")
print(f"{'P(θ > 0.2)':<25} {1-posterior.cdf(0.2):>18.4f} {1-uniform_posterior.cdf(0.2):>16.4f}")

ci_info = posterior.ppf([0.025, 0.975])
ci_unif = uniform_posterior.ppf([0.025, 0.975])
print(f"{'95% CI lower':<25} {ci_info[0]:>18.4f} {ci_unif[0]:>16.4f}")
print(f"{'95% CI upper':<25} {ci_info[1]:>18.4f} {ci_unif[1]:>16.4f}")

print(f"\nConclusion: With n=50 observations, the posterior is dominated by")
print(f"the likelihood, not the prior. The informative Beta(2,10) prior")
print(f"shifts the posterior mean down slightly ({posterior.mean():.4f} vs {uniform_posterior.mean():.4f}),")
print(f"but the difference is modest. With more data, priors matter less.")

=================================================================
BAYESIAN INFERENCE FOR DEFECT RATE
=================================================================

Data: 8 defectives in 50 trials
Prior: Beta(2, 10)
  Prior mean: 0.1667
Posterior: Beta(10, 52)
  Posterior mean (analytical): 0.1613

=================================================================
MONTE CARLO ESTIMATION
=================================================================

Posterior Mean E[θ|x]:
  Analytical: 0.161290
  MC estimate: 0.161543 ± 0.000456
  Error: 0.000252

P(θ > 0.2 | x):
  Analytical: 0.148053
  MC estimate: 0.148600
  Error: 0.000547

95% Credible Interval:
  Analytical: (0.0842, 0.2574)
  MC estimate: (0.0846, 0.2567)

=================================================================
SENSITIVITY ANALYSIS: UNIFORM PRIOR
=================================================================

With Beta(1,1) uniform prior:
Posterior: Beta(9, 43)

Quantity                  Informative Prior    Uniform Prior
------------------------------------------------------------
Posterior mean                       0.1613           0.1731
P(θ > 0.2)                           0.1481           0.2282
95% CI lower                         0.0842           0.0885
95% CI upper                         0.2574           0.2769

Conclusion: With n=50 observations, the posterior is dominated by
the likelihood, not the prior. The informative Beta(2,10) prior
shifts the posterior mean down slightly (0.1613 vs 0.1731),
but the difference is modest. With more data, priors matter less.

Key Observations:

Monte Carlo matches analytical: All MC estimates agree with analytical values within standard error bounds.
Posterior probability: There’s about a 15% chance the true defect rate exceeds 20% under the informative prior, but 23% under the uniform prior.
Prior sensitivity: With 50 observations, the prior matters but doesn’t dominate. The informative prior expressing skepticism about high defect rates pulls the posterior down slightly.
Foundation for MCMC: When posteriors don’t have analytical forms, Monte Carlo (via MCMC) becomes the only practical approach—previewing Chapters 5–7.

Bringing It All Together

These exercises have taken you through the core concepts and practical skills of Monte Carlo integration:

Expectation as integration (Exercise 1): Any integral can be rewritten as an expectation and estimated by averaging random samples.
Sample size planning (Exercise 2): The \(O(n^{-1/2})\) rate determines computational cost—pilot studies help estimate required sample sizes.
Convergence diagnostics (Exercise 3): Visual and quantitative diagnostics catch pathologies before they corrupt results.
Dimension independence (Exercise 4): Monte Carlo’s convergence rate doesn’t degrade with dimension—its key advantage over deterministic methods.
Method selection (Exercise 5): Use quadrature for low-dimensional smooth problems; Monte Carlo for high dimensions, non-smooth integrands, or complex domains.
Bayesian inference (Exercise 6): Monte Carlo enables posterior computation even when posteriors are analytically intractable—the foundation for MCMC methods.

The next sections develop the machinery for generating random samples: pseudo-random number generation, the inverse CDF method, and transformation methods. These tools will make the Monte Carlo estimator \(\hat{I}_n = \frac{1}{n}\sum_{i=1}^n h(X_i)\) practical for a wide range of distributions and applications.

Bringing It All Together

Monte Carlo integration transforms the ancient problem of computing integrals into the modern practice of sampling and averaging. The method’s power derives from three pillars:

Universality: Any integral can be written as an expected value, and expected values can be estimated by sample means.
The Law of Large Numbers: Sample means converge to population means, guaranteeing that Monte Carlo estimators approach the true value.
The Central Limit Theorem: The error decreases as \(O(n^{-1/2})\), providing both a convergence rate and a framework for uncertainty quantification via confidence intervals.

The \(O(n^{-1/2})\) rate is simultaneously the method’s weakness and its strength. In low dimensions, deterministic quadrature methods converge faster. But in high dimensions, where deterministic methods suffer the curse of dimensionality, Monte Carlo’s dimension-independent rate makes it the only viable option.

The examples in this section—estimating \(\pi\), computing Gaussian integrals, evaluating posterior means, tackling high-dimensional problems—illustrate the breadth of Monte Carlo applications. The diagnostic tools—running mean plots, standard error decay, autocorrelation checks—equip you to assess whether your simulations are converging properly.

Transition to What Follows

With the foundations of Monte Carlo integration in place, we face a critical question: where do the random samples come from?

Throughout this section, we have assumed that generating samples from the target distribution—uniform on \([0, 1]^d\), standard normal, a posterior distribution—is straightforward. But this assumption hides a mountain of computational machinery.

The next section (Uniform Random Variates) addresses the most fundamental case: generating uniform random numbers. We will see that computers, being deterministic machines, cannot produce “true” randomness—only pseudo-random sequences that pass stringent statistical tests. Understanding how these sequences are generated, what can go wrong, and how to ensure reproducibility is essential for any serious practitioner.

Following that, the inverse CDF method (Inverse CDF Method) shows how to transform uniform random numbers into samples from other distributions. If we can compute the inverse cumulative distribution function \(F^{-1}(u)\), we can generate samples from \(F\) by applying \(F^{-1}\) to uniform random numbers. This elegant technique works for many important distributions—exponential, Weibull, Cauchy—but fails when \(F^{-1}\) has no closed form.

For distributions like the normal, where the inverse CDF is not available analytically, specialized transformations like Box-Muller offer efficient alternatives. And for truly complex distributions—posteriors in Bayesian inference, for instance—rejection sampling and eventually Markov chain Monte Carlo (Part 3) provide the tools to generate the samples that Monte Carlo integration requires.

The story of Monte Carlo methods is thus a story of two interlocking challenges: using random samples to estimate integrals (this section), and generating the random samples in the first place (the sections to come). Master both, and you hold a computational toolkit of extraordinary power.

Key Takeaways 📝

Core concept: Monte Carlo integration estimates integrals by rewriting them as expectations and averaging random samples. The Law of Large Numbers guarantees convergence; the Central Limit Theorem provides the \(O(n^{-1/2})\) rate.
Historical insight: Monte Carlo emerged from the Manhattan Project, where Ulam and von Neumann needed to compute neutron transport integrals that resisted analytical attack. The method turned randomness from a nuisance into a computational tool.
Dimension independence: The convergence rate \(O(n^{-1/2})\) does not depend on dimension. However, the variance \(\sigma^2\) can depend on dimension and problem formulation—importance sampling and careful design remain important.
Practical application: Always report standard errors and confidence intervals. Use pilot studies to estimate variance for sample size planning. Be explicit about whether you’re targeting SE or CI half-width. Monitor convergence visually with running mean plots and standard error decay.
Method selection: In 1–3 dimensions with smooth integrands, use deterministic quadrature. In high dimensions or with complex domains, use Monte Carlo. Consider quasi-Monte Carlo for moderate dimensions with smooth functions.
Outcome alignment: This section directly addresses Learning Outcome 1 (apply simulation techniques for Monte Carlo integration) and provides the conceptual foundation for all subsequent simulation methods, including variance reduction, importance sampling, and MCMC.

References

[BuffonLeclerc1777]

Buffon, G.-L. L., Comte de (1777). Essai d’arithmétique morale. Supplément à l’Histoire Naturelle, Vol. 4. Contains the needle problem for estimating π.

[Devroye1986]

Devroye, L. (1986). Non-Uniform Random Variate Generation. New York: Springer-Verlag. Available free online at http://luc.devroye.org/rvbook.html

[MetropolisUlam1949]

Metropolis, N., and Ulam, S. (1949). The Monte Carlo method. Journal of the American Statistical Association, 44(247), 335–341. The paper that introduced Monte Carlo methods to the broader scientific community.

[RobertCasella2004]

Robert, C. P., and Casella, G. (2004). Monte Carlo Statistical Methods (2nd ed.). New York: Springer. The standard graduate reference for Monte Carlo methods.

[Welford1962]

Welford, B. P. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3), 419–420. Introduces the numerically stable one-pass algorithm for variance computation.