2.1.1. Monte Carlo Fundamentals
In the spring of 1946, the mathematician Stanislaw Ulam was recovering from a near-fatal case of viral encephalitis at his home in Los Angeles. To pass the time during his convalescence, he played countless games of solitaire—and found himself wondering: what is the probability of winning a game of Canfield solitaire? The combinatorics were hopelessly complex. There were too many possible configurations, too many branching paths through a game, to enumerate them all. But Ulam realized something profound: he didn’t need to enumerate every possibility. He could simply play a hundred games and count how many he won.
This insight—that we can estimate probabilities by running experiments rather than computing them analytically—was not new. The Comte de Buffon had used a similar approach in 1777 to estimate π by dropping needles onto a lined floor. But Ulam saw something that Buffon could not have imagined: the recently completed ENIAC computer could “play” millions of such games, transforming a parlor trick into a serious computational method. Within weeks, Ulam had discussed the idea with his colleague John von Neumann, and the two began developing what would become one of the most powerful computational frameworks in all of science.
They needed a code name for this method, which they were applying to classified problems in nuclear weapons design at Los Alamos. Nicholas Metropolis suggested “Monte Carlo,” after the famous casino in Monaco where Ulam’s uncle had a gambling habit. The name stuck, and with it, a new era in computational science began.
This chapter introduces Monte Carlo methods—a family of algorithms that use random sampling to solve problems that would otherwise be intractable. We will see how randomness, properly harnessed, becomes a precision instrument for computing integrals, estimating probabilities, and approximating quantities that resist analytical attack. The ideas are simple, but their power is immense: Monte Carlo methods now pervade physics, finance, machine learning, and statistics, anywhere that high-dimensional integration or complex probability calculations arise.
Road Map 🧭
Understand: The fundamental principle that Monte Carlo integration estimates integrals as expectations of random samples, and why this works via the Law of Large Numbers and Central Limit Theorem
Develop: Deep intuition for the \(O(n^{-1/2})\) convergence rate—what it means, why it arises, and its remarkable dimension-independence
Implement: Complete Python code for Monte Carlo estimation with variance quantification, confidence intervals, and convergence diagnostics
Evaluate: When Monte Carlo methods outperform deterministic alternatives, and how to assess estimation quality in practice
Connect: How Monte Carlo integration motivates the random variable generation techniques of subsequent sections
The Historical Development of Monte Carlo Methods
Before diving into the mathematics, it is worth understanding how Monte Carlo methods emerged and evolved. This history illuminates why the methods work, what problems motivated their development, and why they remain central to computational science today.
Fig. 2.1 Historical Evolution of Monte Carlo Methods. This timeline traces 250 years of algorithmic innovation, from Buffon’s needle experiment in 1777 through the founding contributions of Ulam and von Neumann in the 1940s, the development of MCMC methods, resampling techniques, and modern neural-enhanced approaches. Each innovation opened new classes of problems to computational attack.
Buffon’s Needle: The First Monte Carlo Experiment
In 1777, Georges-Louis Leclerc, Comte de Buffon, posed a deceptively simple question: suppose we have a floor made of parallel wooden planks, each of width \(d\), and we drop a needle of length \(\ell \leq d\) onto this floor. What is the probability that the needle crosses one of the cracks between planks?
To answer this, Buffon introduced what we would now recognize as a probabilistic model. Let \(\theta\) denote the angle between the needle and the direction of the planks, uniformly distributed on \([0, \pi)\). Let \(y\) denote the distance from the needle’s center to the nearest crack, uniformly distributed on \([0, d/2]\). The needle crosses a crack if and only if the vertical projection of half the needle exceeds the distance to the crack—that is, if \(y \leq \frac{\ell}{2} \sin\theta\).
Fig. 2.2 Buffon’s Needle Geometry. Left: The physical setup with needles scattered across parallel planks (red = crossing, blue = not crossing). Middle: The key geometry—a needle crosses if its center’s distance \(y\) to the nearest crack is less than the vertical projection \((\ell/2)\sin\theta\). Right: The probability space showing the crossing region; the probability equals the ratio of areas: \(2\ell/(\pi d)\).
The probability of crossing is therefore:
Evaluating the inner integral yields \(\frac{\ell}{2}\sin\theta\), and the outer integral gives \(\int_0^{\pi} \sin\theta \, d\theta = 2\). Thus:
This elegant result has a remarkable consequence. If we drop \(n\) needles and observe that \(k\) of them cross a crack, then our estimate of the crossing probability is \(\hat{p} = k/n\). Rearranging Buffon’s formula:
We can estimate \(\pi\) by throwing needles!
This is a Monte Carlo method avant la lettre: we use random experiments to estimate a deterministic quantity. Of course, Buffon lacked computers, and actually throwing thousands of needles by hand is tedious. In 1901, the Italian mathematician Mario Lazzarini claimed to have obtained \(\pi \approx 3.1415929\) by throwing a needle 3,408 times—suspiciously close to the correct value of \(355/113\). Most historians believe Lazzarini fudged his data, but the underlying principle was sound.
Try It Yourself 🖥️ Buffon’s Needle Simulation
Experience Buffon’s experiment interactively:
Interactive Simulation: https://treese41528.github.io/ComputationalDataScience/Simulations/MonteCarloSimulation/buffons_needle_simulation.html
Watch how the \(\pi\) estimate fluctuates wildly with few needles, then gradually stabilizes as you accumulate thousands of throws. This is the Law of Large Numbers in action—a theme we will return to throughout this chapter.
Fermi’s Envelope Calculations
The physicist Enrico Fermi was famous for his ability to estimate quantities that seemed impossibly difficult to calculate. How many piano tuners are there in Chicago? How much energy is released in a nuclear explosion? Fermi would break these problems into pieces, estimate each piece roughly, and multiply—often achieving answers accurate to within an order of magnitude.
Less well known is that Fermi also pioneered proto-Monte Carlo methods. In the 1930s, working on neutron diffusion problems in Rome, Fermi developed a mechanical device—essentially a specialized slide rule—that could generate random numbers to simulate neutron paths through matter. He used these simulations to estimate quantities like neutron absorption cross-sections, which were too complex to compute analytically.
This work remained largely unpublished, but it anticipated the key insight of Monte Carlo: when a deterministic calculation is intractable, a stochastic simulation may succeed. Fermi’s physical random number generator was crude, but the principle was the same one that Ulam and von Neumann would later implement on electronic computers.
The Manhattan Project and ENIAC
The development of nuclear weapons during World War II created an urgent need for computational methods. The behavior of neutrons in a nuclear reaction—how they scatter, slow down, and trigger fission—depends on complex integrals over energy and angle that resist analytical solution. The physicists at Los Alamos needed numbers, not theorems.
It was in this context that Ulam’s solitaire insight proved transformative. Ulam and von Neumann realized that the same principle—estimate a complicated quantity by averaging over random samples—could be applied to neutron transport. Instead of integrating over all possible neutron paths analytically (impossible), they could simulate thousands of individual neutrons, tracking each one as it scattered and absorbed through the weapon’s core.
Von Neumann took the lead in implementing these ideas on ENIAC, one of the first general-purpose electronic computers. ENIAC could perform about 5,000 operations per second—glacially slow by modern standards, but revolutionary in 1946. Von Neumann and his team programmed ENIAC to simulate neutron histories, and the results helped validate the design of thermonuclear weapons.
The “Monte Carlo method” was formally introduced to the broader scientific community in a 1949 paper by Metropolis and Ulam, though much of the early work remained classified for decades. The name, coined by Metropolis, captured both the element of chance central to the method and the slightly disreputable excitement of gambling—a fitting tribute to Ulam’s card-playing origins.
Why “Monte Carlo” Changed Everything
The Monte Carlo revolution was not merely about having faster computers. It represented a conceptual breakthrough: randomness is a computational resource. By embracing uncertainty rather than fighting it, Monte Carlo methods could attack problems that deterministic methods could not touch.
Consider the challenge of computing a 100-dimensional integral. Deterministic quadrature methods—the trapezoidal rule, Simpson’s rule, Gaussian quadrature—all suffer from the “curse of dimensionality.” If we use \(n\) points per dimension, we need \(n^{100}\) total evaluations. Even with \(n = 2\), this exceeds \(10^{30}\)—more function evaluations than atoms in a human body.
Monte Carlo methods sidestep this curse entirely. As we will see, the error in a Monte Carlo estimate depends only on the number of samples and the variance of the integrand, not on the dimension of the space. A 100-dimensional integral is no harder than a one-dimensional integral, at least in terms of convergence rate. This dimension-independence is the source of Monte Carlo’s power.
The Core Principle: Expectation as Integration
We now turn to the mathematical foundations of Monte Carlo integration. The key insight is simple but profound: any integral can be rewritten as an expected value, and expected values can be estimated by averaging samples.
From Integrals to Expectations
Consider a general integral of the form:
where \(\mathcal{X} \subseteq \mathbb{R}^d\) is the domain of integration and \(h: \mathcal{X} \to \mathbb{R}\) is the function we wish to integrate. At first glance, this seems like a problem for calculus, not probability. But watch what happens when we introduce a probability density.
Let \(f(x)\) be any probability density function on \(\mathcal{X}\)—that is, \(f(x) \geq 0\) everywhere and \(\int_{\mathcal{X}} f(x) \, dx = 1\). We can rewrite our integral as:
where the expectation is taken over a random variable \(X\) with density \(f\). We have transformed an integral into an expected value!
The simplest choice is the uniform density on \(\mathcal{X}\). If \(\mathcal{X}\) has finite volume \(V = \int_{\mathcal{X}} dx\), then \(f(x) = 1/V\) is a valid density, and:
For example, to compute \(\int_0^1 e^{-x^2} dx\), we write:
This rewriting is always possible. But why is it useful?
The Monte Carlo Estimator
The power of the expectation formulation becomes clear when we recall the Law of Large Numbers. If \(X_1, X_2, \ldots, X_n\) are independent and identically distributed (iid) with \(\mathbb{E}[X_i] = \mu\), then the sample mean converges to the true mean:
Applied to our integral:
Definition: Monte Carlo Estimator
Let \(X_1, X_2, \ldots, X_n\) be iid samples from a density \(f\) on \(\mathcal{X}\). The Monte Carlo estimator of the integral \(I = \int_{\mathcal{X}} h(x) f(x) \, dx = \mathbb{E}_f[h(X)]\) is:
More generally, for \(I = \int_{\mathcal{X}} g(x) \, dx\) where we sample from density \(f\):
The Monte Carlo method is disarmingly simple: draw random samples, evaluate the function at each sample, and average the results. No derivatives, no quadrature weights, no mesh generation—just sampling and averaging.
But this simplicity conceals depth. The choice of sampling density \(f\) is entirely up to us, and different choices lead to dramatically different performance. We will explore this in the section on importance sampling; for now, we focus on the “naive” case where \(f\) matches the density of the integrand or is uniform on the domain.
A First Example: Estimating \(\pi\)
Let us return to the problem of estimating \(\pi\), now with Monte Carlo machinery. Consider the integral:
This is the area of the unit disk. Rewriting as an expectation:
The factor of 4 accounts for the area of the square \([-1,1]^2\). The Monte Carlo estimator is:
That is: generate \(n\) uniform points in the square, count how many fall inside the unit circle, and multiply by 4.
The geometric intuition is immediate: the ratio of points landing inside the circle to total points approximates the ratio of areas, \(\pi/4\).
Fig. 2.3 Monte Carlo π Estimation. Left: Blue points fall inside the unit circle; coral points fall outside. The ratio of blue to total points estimates \(\pi/4\). Right: The running estimate stabilizes as samples accumulate—the characteristic “noisy convergence” of Monte Carlo.
import numpy as np
def estimate_pi_monte_carlo(n_samples, seed=None):
"""
Estimate π using Monte Carlo integration.
Parameters
----------
n_samples : int
Number of random points to generate.
seed : int, optional
Random seed for reproducibility.
Returns
-------
dict
Contains estimate, standard_error, and confidence interval.
"""
rng = np.random.default_rng(seed)
# Generate uniform points in [-1, 1]²
x = rng.uniform(-1, 1, n_samples)
y = rng.uniform(-1, 1, n_samples)
# Count points inside unit circle
inside = (x**2 + y**2) <= 1
# Monte Carlo estimate
p_hat = np.mean(inside)
pi_hat = 4 * p_hat
# Standard error (indicator has Bernoulli variance p(1-p))
se_p = np.sqrt(p_hat * (1 - p_hat) / n_samples)
se_pi = 4 * se_p
# 95% confidence interval
ci = (pi_hat - 1.96 * se_pi, pi_hat + 1.96 * se_pi)
return {
'estimate': pi_hat,
'standard_error': se_pi,
'ci_95': ci,
'n_samples': n_samples
}
# Run the estimation
result = estimate_pi_monte_carlo(100_000, seed=42)
print(f"π estimate: {result['estimate']:.6f}")
print(f"True π: {np.pi:.6f}")
print(f"Error: {abs(result['estimate'] - np.pi):.6f}")
print(f"Std Error: {result['standard_error']:.6f}")
print(f"95% CI: ({result['ci_95'][0]:.6f}, {result['ci_95'][1]:.6f})")
π estimate: 3.143080
True π: 3.141593
Error: 0.001487
Std Error: 0.005190
95% CI: (3.132908, 3.153252)
The true value \(\pi\) lies comfortably within the 95% confidence interval. With a million samples, the error shrinks by a factor of \(\sqrt{10} \approx 3.16\), and with ten million, by another factor of \(\sqrt{10}\).
Why Bernoulli variance? The comment in the code mentions “Bernoulli variance”—let’s unpack this. Each random point either lands inside the circle (we record \(I_i = 1\)) or outside (we record \(I_i = 0\)). This is a Bernoulli trial with success probability \(p = \pi/4\), which is the ratio of the circle’s area to the square’s area. For any Bernoulli random variable:
Our estimator \(\hat{p} = \frac{1}{n}\sum_{i=1}^n I_i\) is the sample proportion of points inside the circle. Since the \(I_i\) are independent:
The standard error is the square root: \(\text{SE}(\hat{p}) = \sqrt{p(1-p)/n}\). Since \(\hat{\pi} = 4\hat{p}\), the standard error of \(\hat{\pi}\) scales by 4:
We don’t know the true \(p\), so we substitute our estimate \(\hat{p}\) to get a usable formula. This Bernoulli structure is a special case of the general Monte Carlo variance formula—whenever the function \(h(x)\) is an indicator (0 or 1), the variance simplifies to the familiar \(p(1-p)\) form.
This is Monte Carlo at its most basic: evaluate a simple function at random points and average. But even this toy example illustrates the key features of the method—ease of implementation, probabilistic error bounds, and graceful scaling with sample size.
Theoretical Foundations
Why does the Monte Carlo method work? What determines the rate of convergence? These questions have precise mathematical answers rooted in classical probability theory.
The Law of Large Numbers
The Law of Large Numbers (LLN) is the foundational result guaranteeing that Monte Carlo estimators converge to the true value. There are several versions; we state the strongest form.
Theorem: Strong Law of Large Numbers
Let \(X_1, X_2, \ldots\) be independent and identically distributed random variables with \(\mathbb{E}[|X_1|] < \infty\). Then:
The notation \(\xrightarrow{\text{a.s.}}\) denotes almost sure convergence: the probability that the sequence converges is exactly 1.
For Monte Carlo integration, we apply this theorem with \(X_i = h(X_i)\) where the \(X_i\) are iid from density \(f\). The condition \(\mathbb{E}[|h(X)|] < \infty\) ensures that the integral we are estimating actually exists.
The LLN tells us that \(\hat{I}_n \to I\) with probability 1. No matter how complex the integrand, no matter how high the dimension, the Monte Carlo estimator will eventually get arbitrarily close to the true value. This is an extraordinarily powerful guarantee.
But the LLN is silent on how fast convergence occurs. For that, we need the Central Limit Theorem.
The Central Limit Theorem and the \(O(n^{-1/2})\) Rate
The Central Limit Theorem (CLT) is the workhorse result for quantifying Monte Carlo error.
Theorem: Central Limit Theorem
Let \(X_1, X_2, \ldots\) be iid with mean \(\mu\) and variance \(\sigma^2 < \infty\). Then:
where \(\xrightarrow{d}\) denotes convergence in distribution. Equivalently, for large \(n\):
Applied to Monte Carlo integration:
where \(\sigma^2 = \text{Var}_f[h(X)] = \mathbb{E}_f[(h(X) - I)^2]\) is the variance of the integrand under the sampling distribution.
The standard error of the Monte Carlo estimator is:
This is the celebrated \(O(n^{-1/2})\) convergence rate. To reduce the standard error by a factor of 10, we need 100 times as many samples. To gain one decimal place of accuracy, we need 100 times the computational effort.
This rate may seem slow—and compared to some deterministic methods, it is. But the rate is independent of dimension. Whether we are integrating over \(\mathbb{R}\) or \(\mathbb{R}^{1000}\), the error decreases as \(1/\sqrt{n}\). This dimension-independence is the source of Monte Carlo’s power in high-dimensional problems.
Example 💡 Understanding the Square Root Law
Scenario: You estimate an integral with 1,000 samples and get a standard error of 0.1. Your boss needs the error reduced to 0.01.
Analysis: The standard error scales as \(1/\sqrt{n}\). To reduce the standard error by a factor of 10, you need \(n\) to increase by a factor of \(10^2 = 100\).
Conclusion: You need \(1000 \times 100 = 100,000\) samples.
This quadratic penalty is the price of Monte Carlo’s simplicity. In low dimensions, deterministic methods often achieve polynomial convergence rates like \(O(n^{-2})\) or better, making them far more efficient. But in high dimensions, Monte Carlo’s dimension-independent \(O(n^{-1/2})\) beats any polynomial rate that degrades with dimension.
Why the Square Root?
The \(1/\sqrt{n}\) rate may seem mysterious, but it has a simple explanation rooted in the behavior of sums of random variables.
Fig. 2.4 The Square Root Convergence Law. Left: The central limit theorem in action—sample means cluster more tightly around the true mean as \(n\) grows. Middle: Why variance shrinks—the variance of a sum equals \(n\sigma^2\), but dividing by \(n\) to get the mean gives variance \(\sigma^2/n\). Right: The practical consequence—the \(1/\sqrt{n}\) law means diminishing returns, with each additional decimal place of accuracy costing 100× more samples.
Consider \(n\) iid random variables \(X_1, \ldots, X_n\), each with variance \(\sigma^2\). Their sum has variance:
The variance of the sum grows linearly with \(n\). But when we take the mean, we divide by \(n\):
The standard deviation is the square root of variance, giving \(\sigma/\sqrt{n}\).
This behavior is fundamental to averages of random quantities. Each additional sample adds information, but with diminishing returns: the first sample reduces uncertainty enormously; the millionth sample contributes almost nothing. This is why the square root appears.
Variance Estimation and Confidence Intervals
The CLT tells us that \(\hat{I}_n\) is approximately normal with known variance \(\sigma^2/n\). But we rarely know \(\sigma^2\)—it depends on the integrand and the sampling distribution. We must estimate it from the same samples we use to estimate \(I\).
The Sample Variance
The natural estimator of \(\sigma^2 = \text{Var}[h(X)]\) is the sample variance:
The divisor \(n-1\) (rather than \(n\)) makes this estimator unbiased: \(\mathbb{E}[\hat{\sigma}^2_n] = \sigma^2\). This is known as Bessel’s correction.
By the Law of Large Numbers, \(\hat{\sigma}^2_n \to \sigma^2\) almost surely as \(n \to \infty\). Combined with the CLT, this gives us a practical way to construct confidence intervals.
Constructing Confidence Intervals
An asymptotic \((1-\alpha)\) confidence interval for \(I\) is:
where \(z_{\alpha/2} = \Phi^{-1}(1 - \alpha/2)\) is the standard normal quantile. For common confidence levels:
90% CI: \(z_{0.05} \approx 1.645\)
95% CI: \(z_{0.025} \approx 1.960\)
99% CI: \(z_{0.005} \approx 2.576\)
The interval has the interpretation: in repeated sampling, approximately \((1-\alpha) \times 100\%\) of such intervals will contain the true value \(I\).
import numpy as np
from scipy import stats
def monte_carlo_integrate(h, sampler, n_samples, confidence=0.95, seed=None):
"""
Monte Carlo integration with uncertainty quantification.
Parameters
----------
h : callable
Function to integrate. Must accept array input.
sampler : callable
Function that takes (n, rng) and returns n samples from the target distribution.
n_samples : int
Number of Monte Carlo samples.
confidence : float
Confidence level for interval (default 0.95).
seed : int, optional
Random seed for reproducibility.
Returns
-------
dict
Contains estimate, std_error, confidence interval, and diagnostics.
"""
rng = np.random.default_rng(seed)
# Generate samples and evaluate function
samples = sampler(n_samples, rng)
h_values = h(samples)
# Point estimate
estimate = np.mean(h_values)
# Variance estimation (Bessel's correction)
variance = np.var(h_values, ddof=1)
std_error = np.sqrt(variance / n_samples)
# Confidence interval
z = stats.norm.ppf(1 - (1 - confidence) / 2)
ci_lower = estimate - z * std_error
ci_upper = estimate + z * std_error
# Effective sample size (for future variance reduction comparisons)
ess = n_samples # For standard MC, ESS = n
return {
'estimate': estimate,
'std_error': std_error,
'variance': variance,
'ci': (ci_lower, ci_upper),
'confidence': confidence,
'n_samples': n_samples,
'ess': ess,
'h_values': h_values
}
Example 💡 Using the Monte Carlo Integration Function
Problem: Estimate \(\int_0^2 e^{-x^2} dx\) using our monte_carlo_integrate function.
Setup: We need to define: (1) the integrand \(h(x) = 2 e^{-x^2}\) (the factor of 2 accounts for the interval length), and (2) a sampler that generates uniform samples on \([0, 2]\).
import numpy as np
from scipy import stats
from scipy.special import erf
# Define the integrand (scaled by interval length)
def h(x):
return 2 * np.exp(-x**2)
# Define the sampler: Uniform(0, 2)
def uniform_sampler(n, rng):
return rng.uniform(0, 2, n)
# Run Monte Carlo integration
result = monte_carlo_integrate(
h=h,
sampler=uniform_sampler,
n_samples=100_000,
confidence=0.95,
seed=42
)
# True value for comparison
true_value = np.sqrt(np.pi) / 2 * erf(2)
print(f"Estimate: {result['estimate']:.6f}")
print(f"True value: {true_value:.6f}")
print(f"Std Error: {result['std_error']:.6f}")
print(f"95% CI: ({result['ci'][0]:.6f}, {result['ci'][1]:.6f})")
print(f"CI contains true value: {result['ci'][0] <= true_value <= result['ci'][1]}")
Output:
Estimate: 0.880204
True value: 0.882081
Std Error: 0.002179
95% CI: (0.875934, 0.884474)
CI contains true value: True
The function returns all the diagnostics we need: the point estimate, standard error for assessing precision, and a confidence interval that correctly captures the true value. The h_values array can be passed to convergence diagnostics for further analysis.
Numerical Stability: Welford’s Algorithm
Computing the sample variance naively using the one-pass formula \(\frac{1}{n-1}\left(\sum h_i^2 - \frac{(\sum h_i)^2}{n}\right)\) can suffer catastrophic cancellation when the mean is large compared to the standard deviation. The two terms \(\sum h_i^2\) and \(\frac{(\sum h_i)^2}{n}\) may be nearly equal, and their difference may lose many significant digits.
The Problem Illustrated: Suppose we have data with mean \(\mu = 10^9\) and standard deviation \(\sigma = 1\). Then \(\sum x_i^2 \approx n \cdot 10^{18}\) and \((\sum x_i)^2/n \approx n \cdot 10^{18}\) as well. Their difference should be approximately \(n \cdot \sigma^2 = n\), but when subtracting two numbers of size \(10^{18}\) that agree in their first 16-17 digits, we lose almost all precision in 64-bit floating point arithmetic (which has about 15-16 significant decimal digits).
Welford’s Insight: Instead of computing variance from \(\sum x_i^2\) and \(\sum x_i\), we can maintain the sum of squared deviations from the current running mean. As each new observation arrives, we update both the mean and the sum of squared deviations using a clever algebraic identity.
Let \(\bar{x}_n\) denote the mean of the first \(n\) observations, and let \(M_{2,n} = \sum_{i=1}^n (x_i - \bar{x}_n)^2\) denote the sum of squared deviations from this mean. Welford showed that these can be updated incrementally:
The key insight is that \((x_n - \bar{x}_{n-1})\) and \((x_n - \bar{x}_n)\) are both small numbers (deviations from means), so their product is numerically stable. We never subtract two large, nearly-equal quantities.
The sample variance is then simply \(s^2 = M_{2,n} / (n-1)\).
class WelfordAccumulator:
"""
Online algorithm for computing mean and variance in a single pass.
This is a true streaming algorithm: we never store the data,
only the running statistics. Memory usage is O(1) regardless
of how many values we process.
"""
def __init__(self):
self.n = 0
self.mean = 0.0
self.M2 = 0.0 # Sum of squared deviations from current mean
def update(self, x):
"""Process a single new observation."""
self.n += 1
delta = x - self.mean
self.mean += delta / self.n
delta2 = x - self.mean # Note: uses UPDATED mean
self.M2 += delta * delta2
@property
def variance(self):
"""Sample variance (with Bessel's correction)."""
if self.n < 2:
return float('nan')
return self.M2 / (self.n - 1)
@property
def std(self):
"""Sample standard deviation."""
return np.sqrt(self.variance)
# Generate data once (in practice, this would arrive as a stream)
rng = np.random.default_rng(5678)
large_mean_data = 1e9 + rng.standard_normal(10000) # Mean ≈ 10⁹, SD ≈ 1
# Demonstrate the ONLINE nature: process data one value at a time
acc = WelfordAccumulator()
print("Processing data as a stream (mean ≈ 10⁹, SD ≈ 1)...")
print(f"{'n':>8} {'Running Mean':>20} {'Running Variance':>18}")
print("-" * 52)
for i, x in enumerate(large_mean_data):
# In a real streaming application, x would arrive from a sensor,
# network connection, etc. We process it and discard it.
acc.update(x)
# Periodically report running statistics
if (i + 1) in [10, 100, 1000, 10000]:
print(f"{acc.n:>8} {acc.mean:>20.6f} {acc.variance:>18.6f}")
print(f"\nFinal Welford estimates: mean = {acc.mean:.2f}, var = {acc.variance:.6f}")
The output shows the algorithm refining its estimates as more data streams in:
Processing data as a stream (mean ≈ 10⁹, SD ≈ 1)...
n Running Mean Running Variance
----------------------------------------------------
10 999999999.769556 0.723562
100 1000000000.024273 0.934600
1000 999999999.989465 0.974390
10000 999999999.991273 0.962625
Final Welford estimates: mean = 999999999.99, var = 0.962625
Why this matters: In a true streaming scenario, we would process values as they arrive without ever storing them. Memory usage is O(1)—just three numbers (n, mean, M2)—regardless of how much data we process. The naive formula would require either storing all values (O(n) memory) or suffer catastrophic cancellation.
Now let’s see what happens with the naive one-pass formula on the same data:
# Compare stable vs unstable on the SAME data
n = len(large_mean_data)
# One-pass naive formula (UNSTABLE): Var = (Σx² - (Σx)²/n) / (n-1)
sum_sq = np.sum(large_mean_data**2)
sum_x = np.sum(large_mean_data)
naive_var_onepass = (sum_sq - sum_x**2 / n) / (n - 1)
# NumPy's implementation (also stable)
numpy_var = np.var(large_mean_data, ddof=1)
print(f"True variance (approx): 1.0")
print(f"One-pass naive: {naive_var_onepass:.2f} <- CATASTROPHIC FAILURE!")
print(f"Welford (online): {acc.variance:.6f}")
print(f"NumPy variance: {numpy_var:.6f}")
True variance (approx): 1.0
One-pass naive: 209.74 <- CATASTROPHIC FAILURE!
Welford (online): 0.962625
NumPy variance: 0.962625
The naive formula gives 209.74 instead of ≈1.0—a 200× error! Both Welford and NumPy (which uses a stable two-pass algorithm) give the correct answer.
In practice, NumPy’s np.var uses numerically stable algorithms, so you rarely need to implement Welford’s algorithm yourself. But understanding why stability matters is important for several reasons:
Debugging unexpected results: If you’re computing variances in a custom loop or in a language without stable built-ins, you may encounter this issue.
Streaming data: Welford’s algorithm processes data in a single pass, making it ideal for streaming applications where you can’t store all values in memory.
Parallel computation: The algorithm can be extended to combine statistics from separate batches (useful for distributed computing).
Understanding Monte Carlo diagnostics: Running variance calculations in convergence diagnostics use similar techniques.
The key lesson: when the mean is large relative to the standard deviation, naive variance formulas fail catastrophically. Always use stable algorithms.
Worked Examples
We now work through several examples of increasing complexity, illustrating the breadth of Monte Carlo applications.
Example 1: The Gaussian Integral
The integral \(\int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi}\) is famous for being impossible to evaluate in closed form using elementary antiderivatives, yet having a beautiful exact answer. Let us estimate it via Monte Carlo.
Challenge: The domain is infinite, so we cannot sample uniformly.
Solution: Recognize the integrand as proportional to a Gaussian density. If \(X \sim \mathcal{N}(0, 1/\sqrt{2})\), then:
This is a valid probability density (it integrates to 1). We can write:
This derivation shows the integral equals \(\sqrt{\pi}\) exactly, but let’s also verify by Monte Carlo.
Alternative approach: Sample from a distribution that covers the domain and reweight:
import numpy as np
from scipy import stats
def gaussian_integral_mc(n_samples, seed=42):
"""
Estimate ∫ exp(-x²) dx via Monte Carlo.
We sample from N(0, 1) and use importance sampling:
∫ exp(-x²) dx = ∫ [exp(-x²) / φ(x)] φ(x) dx
where φ(x) is the standard normal density.
"""
rng = np.random.default_rng(seed)
# Sample from N(0, 1)
X = rng.standard_normal(n_samples)
# Integrand: exp(-x²)
# Sampling density: φ(x) = exp(-x²/2) / sqrt(2π)
# Ratio: exp(-x²) / φ(x) = sqrt(2π) * exp(-x²/2)
h_values = np.sqrt(2 * np.pi) * np.exp(-X**2 / 2)
estimate = np.mean(h_values)
std_error = np.std(h_values, ddof=1) / np.sqrt(n_samples)
return estimate, std_error
est, se = gaussian_integral_mc(100_000)
print(f"Estimate: {est:.6f} ± {se:.6f}")
print(f"True value: {np.sqrt(np.pi):.6f}")
Estimate: 1.770751 ± 0.002214
True value: 1.772454
The estimate is very close to \(\sqrt{\pi} \approx 1.7725\).
Example 2: Probability of a Rare Event
Suppose \(X \sim \mathcal{N}(0, 1)\). What is \(P(X > 4)\)?
From standard normal tables, \(P(X > 4) = 1 - \Phi(4) \approx 3.167 \times 10^{-5}\). This is a rare event—only about 3 in 100,000 standard normal draws exceed 4.
Naive Monte Carlo:
import numpy as np
from scipy import stats
rng = np.random.default_rng(42)
n = 100_000
X = rng.standard_normal(n)
p_hat = np.mean(X > 4)
se = np.sqrt(p_hat * (1 - p_hat) / n)
print(f"Estimate: {p_hat:.6e}")
print(f"Std Error: {se:.6e}")
print(f"True value: {1 - stats.norm.cdf(4):.6e}")
Estimate: 9.000000e-05
Std Error: 2.999865e-05
True value: 3.167124e-05
With 100,000 samples, we observed 9 exceedances (instead of the expected ~3), giving an estimate nearly 3× too large! This illustrates the high variability inherent in estimating rare events. The relative error (standard error divided by estimate) is enormous.
Problem: To estimate a probability \(p\), the standard error is \(\sqrt{p(1-p)/n} \approx \sqrt{p/n}\) for small \(p\). The relative error is \(\sqrt{(1-p)/(np)} \approx 1/\sqrt{np}\). To achieve 10% relative error for \(p = 10^{-5}\), we need \(n \approx 100/p = 10^7\) samples.
This motivates importance sampling (covered in a later section), which generates samples preferentially in the region of interest. For now, the lesson is that naive Monte Carlo struggles with rare events.
Example 3: A High-Dimensional Integral
Consider the integral:
For large \(d\), this integral is analytically tractable. The key observation is that the integrand factors into a product of functions, each depending on only one variable:
When an integrand factors this way, Fubini’s theorem tells us the multiple integral equals the product of single integrals:
This is analogous to how \(\sum_{i,j} a_i b_j = (\sum_i a_i)(\sum_j b_j)\) for sums. The factorization works because each \(x_j\) integrates independently over its own copy of \([0,1]\).
Applying this to our integrand, each single integral evaluates to:
Since all \(d\) integrals are identical, we get:
The limit \(\sqrt{e}\) follows from the definition of \(e\) as \(\lim_{n \to \infty} (1 + 1/n)^n\). To see this, write:
Note
We chose this particular integrand because it factors nicely, giving us a known true value to verify our Monte Carlo estimates. Most high-dimensional integrands do not factor this way—the variables interact in complex ways—which is precisely why Monte Carlo methods are essential. When there’s no analytical solution, Monte Carlo may be the only practical approach.
Fig. 2.5 High-Dimensional Integration. Left: The true integral value approaches \(\sqrt{e}\) as dimension increases. Middle: At fixed sample size \(n = 10{,}000\), error spread actually decreases with dimension for this integrand—each factor \((1 + x_j/d)\) becomes closer to 1 as \(d\) grows, reducing variance. This is integrand-specific, not a general property. Right: The key insight is that the convergence rate \(O(n^{-1/2})\) is identical across all dimensions—the curves are parallel on the log-log plot. The constant (vertical offset) may vary, but the slope does not. This dimension-independent rate is Monte Carlo’s superpower.
Let us estimate this integral in \(d = 100\) dimensions:
import numpy as np
def high_dim_integrand(X):
"""
Compute product integrand: ∏(1 + x_j/d).
Parameters
----------
X : ndarray of shape (n_samples, d)
Sample points in [0,1]^d.
Returns
-------
ndarray of shape (n_samples,)
Function values.
"""
d = X.shape[1]
return np.prod(1 + X / d, axis=1)
def mc_high_dim(d, n_samples, seed=42):
"""Monte Carlo integration in d dimensions."""
rng = np.random.default_rng(seed)
# Uniform samples in [0,1]^d
X = rng.random((n_samples, d))
# Evaluate integrand
h_values = high_dim_integrand(X)
estimate = np.mean(h_values)
std_error = np.std(h_values, ddof=1) / np.sqrt(n_samples)
# True value (for comparison)
true_value = (1 + 0.5/d)**d
return estimate, std_error, true_value
# Estimate in 100 dimensions
d = 100
est, se, truth = mc_high_dim(d, 100_000)
print(f"d = {d}")
print(f"Estimate: {est:.6f} ± {se:.6f}")
print(f"True value: {truth:.6f}")
print(f"Error: {abs(est - truth):.6f}")
d = 100
Estimate: 1.646530 ± 0.000149
True value: 1.646668
Error: 0.000138
The Monte Carlo estimate converges as \(O(n^{-1/2})\) regardless of \(d\). Try increasing \(d\) to 1000 or 10,000—the convergence rate remains unchanged.
Example 4: Bayesian Posterior Mean
Bayesian inference often requires computing posterior expectations:
When the posterior \(\pi(\theta | \text{data})\) is available (perhaps up to a normalizing constant), Monte Carlo integration applies directly—if we can sample from the posterior. This is the motivation for Markov chain Monte Carlo methods in Part 3.
As a simple example, suppose we observe \(x = 7\) successes in \(n = 10\) Bernoulli trials with unknown success probability \(\theta\). With a \(\text{Beta}(1, 1)\) (uniform) prior, the posterior is \(\text{Beta}(8, 4)\):
import numpy as np
from scipy import stats
# Posterior is Beta(8, 4)
alpha, beta = 8, 4
posterior = stats.beta(alpha, beta)
# True posterior mean
true_mean = alpha / (alpha + beta)
# Monte Carlo estimate
rng = np.random.default_rng(42)
n_samples = 10_000
theta_samples = posterior.rvs(n_samples, random_state=rng)
mc_mean = np.mean(theta_samples)
mc_se = np.std(theta_samples, ddof=1) / np.sqrt(n_samples)
print(f"True posterior mean: {true_mean:.6f}")
print(f"MC estimate: {mc_mean:.6f} ± {mc_se:.6f}")
# We can also estimate posterior quantiles, variance, etc.
print(f"Posterior median (MC): {np.median(theta_samples):.6f}")
print(f"Posterior median (exact): {posterior.ppf(0.5):.6f}")
True posterior mean: 0.666667
MC estimate: 0.666200 ± 0.001320
Posterior median (MC): 0.675577
Posterior median (exact): 0.676196
The Monte Carlo estimates closely match the exact values. In more complex Bayesian models where the posterior has no closed form, Monte Carlo (via MCMC) becomes essential.
Example 5: The Normal CDF
The cumulative distribution function of the standard normal, \(\Phi(t) = P(Z \leq t)\) for \(Z \sim \mathcal{N}(0, 1)\), has no closed-form expression. Yet it is one of the most important functions in statistics. Let us estimate \(\Phi(1.96)\):
import numpy as np
from scipy import stats
def estimate_normal_cdf(t, n_samples, seed=42):
"""Estimate Φ(t) = P(Z ≤ t) for Z ~ N(0,1)."""
rng = np.random.default_rng(seed)
Z = rng.standard_normal(n_samples)
# Indicator function: 1 if Z ≤ t, else 0
indicators = Z <= t
p_hat = np.mean(indicators)
se = np.sqrt(p_hat * (1 - p_hat) / n_samples)
return p_hat, se
t = 1.96
est, se = estimate_normal_cdf(t, 1_000_000)
true_val = stats.norm.cdf(t)
print(f"Φ({t}) estimate: {est:.6f} ± {se:.6f}")
print(f"Φ({t}) true: {true_val:.6f}")
Φ(1.96) estimate: 0.974928 ± 0.000156
Φ(1.96) true: 0.975002
With one million samples, the estimate is accurate to about four decimal places. For extreme quantiles (e.g., \(t = 5\)), the probability is so small that accurate estimation requires importance sampling.
Comparison with Deterministic Methods
Monte Carlo integration is not the only way to compute integrals numerically. Deterministic quadrature methods—the trapezoidal rule, Simpson’s rule, Gaussian quadrature—have been studied for centuries and, in low dimensions, often outperform Monte Carlo. Understanding when to use which approach is essential for the computational practitioner.
One-Dimensional Quadrature
For a one-dimensional integral \(\int_a^b f(x) dx\), deterministic methods exploit the smoothness of \(f\) to achieve rapid convergence. The core idea is to approximate the integral as a weighted sum of function values at selected points:
The methods differ in how they choose the points \(x_i\) and weights \(w_i\).
Trapezoidal Rule: Approximate the integrand by piecewise linear functions connecting adjacent points. For \(n\) equally spaced points \(x_0, x_1, \ldots, x_{n-1}\) with spacing \(h = (b-a)/(n-1)\):
Geometrically, this approximates the area under the curve by a series of trapezoids. The error is \(O(h^2) = O(n^{-2})\) for twice-differentiable \(f\)—doubling the number of points reduces the error by a factor of 4.
Simpson’s Rule: Approximate by piecewise quadratic (parabolic) curves through consecutive triples of points. For an odd number of equally spaced points:
The alternating pattern of coefficients (1, 4, 2, 4, 2, …, 4, 1) arises from fitting parabolas through each group of three points. The error is \(O(h^4) = O(n^{-4})\) for sufficiently smooth \(f\)—doubling the points reduces error by a factor of 16.
Gaussian Quadrature: Rather than using equally spaced points, choose both points \(x_i\) and weights \(w_i\) to maximize accuracy. With \(n\) optimally chosen points, Gaussian quadrature integrates polynomials of degree up to \(2n-1\) exactly. For analytic functions, convergence can be exponentially fast—far better than any fixed polynomial rate.
The optimal points turn out to be roots of orthogonal polynomials (Legendre polynomials for integration on \([-1, 1]\)). SciPy’s scipy.integrate.quad uses adaptive Gaussian quadrature internally.
The geometric difference between these approaches is illuminating:
Fig. 2.6 Grid vs. Monte Carlo Sampling. Top row: In 1D, grid sampling (left) places evaluation points at regular intervals, while Monte Carlo (middle) uses random points. Quadrature wins decisively in 1D (right)—this is expected and correct. Bottom row: In 2D, a 10×10 grid uses 100 points in a regular lattice (left), while Monte Carlo distributes 100 points without structure (middle). The crucial insight (right): grid cost grows as \(m^d\) while Monte Carlo is dimension-independent; the crossover occurs around \(d \approx 3\), after which Monte Carlo wins.
Compare these to Monte Carlo’s \(O(n^{-1/2})\). In one dimension, Monte Carlo loses badly:
import numpy as np
from scipy import integrate
# Integrand: exp(-x²) on [0, 2]
def f(x):
return np.exp(-x**2)
# True value (via error function)
from scipy.special import erf
true_value = np.sqrt(np.pi) / 2 * erf(2)
# Monte Carlo
def mc_estimate(n, seed=42):
rng = np.random.default_rng(seed)
x = rng.uniform(0, 2, n)
return 2 * np.mean(f(x)) # Multiply by interval length
# Trapezoidal rule
def trap_estimate(n):
x = np.linspace(0, 2, n)
return np.trapz(f(x), x)
# Simpson's rule
def simp_estimate(n):
x = np.linspace(0, 2, n)
return integrate.simpson(f(x), x=x)
print(f"True value: {true_value:.10f}\n")
for n in [10, 100, 1000]:
mc = mc_estimate(n)
tr = trap_estimate(n)
si = simp_estimate(n)
print(f"n = {n}")
print(f" Monte Carlo: {mc:.10f} (error {abs(mc - true_value):.2e})")
print(f" Trapezoidal: {tr:.10f} (error {abs(tr - true_value):.2e})")
print(f" Simpson: {si:.10f} (error {abs(si - true_value):.2e})")
print()
True value: 0.8820813908
n = 10
Monte Carlo: 0.6600529708 (error 2.22e-01)
Trapezoidal: 0.8817823811 (error 2.99e-04)
Simpson: 0.8819697523 (error 1.12e-04)
n = 100
Monte Carlo: 0.8973426558 (error 1.53e-02)
Trapezoidal: 0.8820788993 (error 2.49e-06)
Simpson: 0.8820813848 (error 5.96e-09)
n = 1000
Monte Carlo: 0.8906166325 (error 8.54e-03)
Trapezoidal: 0.8820813663 (error 2.45e-08)
Simpson: 0.8820813908 (error 5.58e-13)
With \(n = 1000\) points, Simpson’s rule achieves machine precision (\(10^{-13}\)) while Monte Carlo’s error is still around \(10^{-2}\). In one dimension, there is no contest.
The Curse of Dimensionality
The situation reverses dramatically in high dimensions. Consider integrating over \([0, 1]^d\). A deterministic method using a grid with \(m\) points per dimension requires \(m^d\) total evaluations.
To build geometric intuition for why high dimensions are so challenging, consider a simple question: what fraction of a hypercube’s volume lies within the inscribed hypersphere?
Volume formulas. Consider the hypercube \([-1, 1]^d\) with side length 2 and the unit hypersphere \(\{x : \|x\| \leq 1\}\) inscribed within it.
The hypercube volume is simply:
The hypersphere volume in \(d\) dimensions is:
where \(\Gamma\) is the gamma function. For positive integers, \(\Gamma(n+1) = n!\). For half-integers, \(\Gamma(1/2) = \sqrt{\pi}\) and \(\Gamma(n + 1/2) = \frac{(2n-1)!!}{2^n}\sqrt{\pi}\) where \(!!\) denotes the double factorial. For practical computation, use scipy.special.gamma.
This formula gives familiar results: \(V_2 = \pi\) (area of unit circle), \(V_3 = \frac{4\pi}{3}\) (volume of unit sphere). For higher dimensions: \(V_4 = \frac{\pi^2}{2}\), \(V_5 = \frac{8\pi^2}{15}\), and so on.
The ratio of sphere volume to cube volume is:
Let’s compute this ratio for several dimensions:
import numpy as np
from scipy.special import gamma
def hypersphere_volume(d, r=1):
"""Volume of d-dimensional hypersphere with radius r."""
return (np.pi ** (d/2)) / gamma(d/2 + 1) * r**d
def hypercube_volume(d, side=2):
"""Volume of d-dimensional hypercube with given side length."""
return side ** d
print(f"{'d':>3} {'Cube':>12} {'Sphere':>12} {'Ratio':>12} {'% in sphere':>12}")
print("-" * 58)
for d in [1, 2, 3, 5, 10, 20, 50, 100]:
cube_vol = hypercube_volume(d)
sphere_vol = hypersphere_volume(d)
ratio = sphere_vol / cube_vol
print(f"{d:>3} {cube_vol:>12.2e} {sphere_vol:>12.4e} {ratio:>12.4e} {100*ratio:>11.2e}%")
d Cube Sphere Ratio % in sphere
----------------------------------------------------------
1 2.00e+00 2.0000e+00 1.0000e+00 1.00e+02%
2 4.00e+00 3.1416e+00 7.8540e-01 7.85e+01%
3 8.00e+00 4.1888e+00 5.2360e-01 5.24e+01%
5 3.20e+01 5.2638e+00 1.6449e-01 1.64e+01%
10 1.02e+03 2.5502e+00 2.4904e-03 2.49e-01%
20 1.05e+06 2.5807e-02 2.4611e-08 2.46e-06%
50 1.13e+15 1.7302e-13 1.5367e-28 1.54e-26%
100 1.27e+30 2.3682e-40 1.8682e-70 1.87e-68%
The numbers are striking: in 10 dimensions, only 0.25% of the cube lies in the sphere. By 20 dimensions, it’s about \(2.5 \times 10^{-6}`%. By 100 dimensions, the ratio is :math:`10^{-70}\)—essentially zero.
What does this mean for integration? If you’re trying to integrate a function that’s concentrated near the origin (like a multivariate Gaussian), random uniform samples over the hypercube will almost never land where the function is large. This is why importance sampling becomes essential in high dimensions.
Fig. 2.7 The Curse of Dimensionality. Left: The inscribed hypersphere occupies a vanishing fraction of the hypercube as dimension increases—by \(d = 20\), less than \(10^{-7}\) of the cube’s volume lies within the sphere. Middle: In high dimensions, nearly all the volume of a ball concentrates in a thin shell near its surface. Right: Grid points required for deterministic methods explode exponentially—with just 10 points per dimension, a 20-dimensional integral requires \(10^{20}\) evaluations.
For Simpson’s rule with error \(O(h^4) = O(m^{-4})\), the total error is \(O(m^{-4})\) but the cost is \(m^d\). If we want error \(\epsilon\), we need \(m \sim \epsilon^{-1/4}\), giving cost \(\sim \epsilon^{-d/4}\).
For Monte Carlo with error \(O(n^{-1/2})\), achieving error \(\epsilon\) requires \(n \sim \epsilon^{-2}\) samples, independent of \(d\).
The crossover occurs when \(\epsilon^{-d/4} = \epsilon^{-2}\), i.e., \(d = 8\). For \(d > 8\), Monte Carlo requires fewer function evaluations than Simpson’s rule to achieve the same accuracy—and the advantage grows exponentially with dimension.
This analysis ignores constants, which can favor either method in specific cases. But the fundamental message is robust: in high dimensions, Monte Carlo wins.
Method |
1D Convergence |
\(d\)-D Convergence |
Best For |
|---|---|---|---|
Monte Carlo |
\(O(n^{-1/2})\) |
\(O(n^{-1/2})\) |
High dimensions, complex domains |
Trapezoidal |
\(O(n^{-2})\) |
\(O(n^{-2/d})\) |
Low-dim smooth functions |
Simpson |
\(O(n^{-4})\) |
\(O(n^{-4/d})\) |
Low-dim very smooth functions |
Gaussian |
\(O(e^{-cn})\) |
\(O(e^{-cn^{1/d}})\) |
Low-dim analytic functions |
Quasi-Monte Carlo |
\(O(n^{-1}(\log n)^d)\) |
\(O(n^{-1}(\log n)^d)\) |
Moderate dimensions, smooth functions |
Quasi-Monte Carlo Methods
A middle ground between deterministic quadrature and Monte Carlo is provided by quasi-Monte Carlo (QMC) methods. Instead of random samples, QMC uses carefully constructed low-discrepancy sequences—deterministic sequences that fill space more uniformly than random points.
Famous examples include Halton sequences, Sobol sequences, and lattice rules. Under smoothness conditions on the integrand, QMC achieves convergence rates of \(O(n^{-1} (\log n)^d)\), faster than Monte Carlo’s \(O(n^{-1/2})\) but with a dependence on dimension.
QMC is increasingly popular in computational finance and computer graphics. However, it requires more care: the sequences must match the problem structure, variance estimation is trickier, and the smoothness assumptions may fail. For general-purpose integration, especially with non-smooth or high-variance integrands, standard Monte Carlo remains the most robust choice.
Sample Size Determination
A critical practical question is: how many samples do I need? The answer depends on the desired precision, the variance of the integrand, and the acceptable probability of error.
The Sample Size Formula
From the CLT, the Monte Carlo estimator is approximately:
To achieve a standard error of \(\epsilon\), we need:
For a 95% confidence interval of half-width \(\epsilon\), we need:
The factor of 4 (approximately \(1.96^2\)) accounts for the confidence level.
Practical Sample Size Determination
In practice, \(\sigma\) is unknown. A common approach is:
Pilot study: Run a small simulation (e.g., 1,000 samples) to estimate \(\hat{\sigma}\).
Compute required sample size: Use the formula \(n = 4\hat{\sigma}^2 / \epsilon^2\) for 95% confidence.
Run full simulation: Generate \(n\) samples (possibly in addition to the pilot).
Verify: Check that the final standard error meets requirements.
import numpy as np
from scipy import stats
def determine_sample_size(h, sampler, target_se, pilot_n=1000,
confidence=0.95, seed=42):
"""
Determine required sample size for target standard error.
Parameters
----------
h : callable
Function to integrate.
sampler : callable
Function that takes (n, rng) and returns samples from target distribution.
target_se : float
Desired standard error.
pilot_n : int
Pilot sample size for variance estimation.
confidence : float
Confidence level for interval.
seed : int
Random seed.
Returns
-------
dict
Contains required_n, estimated_variance, pilot results.
"""
rng = np.random.default_rng(seed)
# Pilot study
pilot_samples = sampler(pilot_n, rng)
pilot_h = h(pilot_samples)
sigma_hat = np.std(pilot_h, ddof=1)
# Required sample size
z = stats.norm.ppf(1 - (1 - confidence) / 2)
required_n = int(np.ceil((z * sigma_hat / target_se)**2))
return {
'required_n': required_n,
'estimated_sigma': sigma_hat,
'pilot_n': pilot_n,
'pilot_estimate': np.mean(pilot_h),
'pilot_se': sigma_hat / np.sqrt(pilot_n)
}
# Example: Estimate E[X²] for X ~ N(0,1) with SE ≤ 0.01
result = determine_sample_size(
h=lambda x: x**2,
sampler=lambda n, rng: rng.standard_normal(n),
target_se=0.01
)
print(f"Estimated σ: {result['estimated_sigma']:.4f}")
print(f"Required n for SE ≤ 0.01: {result['required_n']:,}")
Estimated σ: 1.4153
Required n for SE ≤ 0.01: 76,948
For estimating \(\mathbb{E}[X^2]\) where \(X \sim \mathcal{N}(0,1)\), the true variance is \(\text{Var}(X^2) = \mathbb{E}[X^4] - (\mathbb{E}[X^2])^2 = 3 - 1 = 2\), so \(\sigma = \sqrt{2} \approx 1.414\). To achieve standard error 0.01, we need \(n \geq (1.96)^2 \times 2 / (0.01)^2 \approx 77{,}000\) samples—consistent with our pilot estimate.
Convergence Diagnostics and Monitoring
Beyond computing point estimates and confidence intervals, it is important to monitor the convergence of Monte Carlo simulations. Visual diagnostics can reveal problems—heavy tails, multimodality, slow mixing—that summary statistics might miss.
Running Mean Plots
The most basic diagnostic is a plot of the running mean:
If the simulation is converging properly, this plot should:
Fluctuate widely for small \(k\)
Stabilize and approach a horizontal asymptote as \(k\) grows
Have diminishing fluctuations proportional to \(1/\sqrt{k}\)
Standard Error Decay
The standard error should decrease as \(1/\sqrt{n}\). A log-log plot of standard error versus sample size should have slope \(-1/2\). Deviations suggest:
Steeper slope: Variance is decreasing (possibly a problem with the estimator)
Shallower slope: Correlation in samples, infinite variance, or other issues
The following figure demonstrates these diagnostics for a simple example: estimating \(\mathbb{E}[X^2] = 1\) where \(X \sim \mathcal{N}(0, 1)\).
Fig. 2.8 Convergence Diagnostics for Monte Carlo. Top-left: Running mean with 95% confidence band—note how the estimate stabilizes and the band shrinks as samples accumulate. Top-right: Standard error decay on log-log axes—the observed SE (blue) closely tracks the theoretical \(O(n^{-1/2})\) rate (red dashed). Bottom-left: Distribution of \(h(X) = X^2\) values, which follows a \(\chi^2(1)\) distribution (orange curve). Bottom-right: Autocorrelation at various lags—all values fall within the significance bounds (red dashed), confirming that samples are independent.
What to look for:
Running mean: Should stabilize (not drift or oscillate)
SE decay: Should follow \(O(n^{-1/2})\) line; shallower slope indicates problems
Distribution: Check for heavy tails or multimodality that might inflate variance
Autocorrelation: For iid samples, all bars should be near zero; significant autocorrelation indicates dependence (common in MCMC)
The code below generates these diagnostic plots. Study it to understand how each panel is constructed:
import numpy as np
import matplotlib.pyplot as plt
def convergence_diagnostics(h_values, true_value=None):
"""
Create comprehensive convergence diagnostics for Monte Carlo.
Parameters
----------
h_values : array
Sequence of h(X_i) values from Monte Carlo simulation.
true_value : float, optional
True integral value (if known) for reference line.
Returns
-------
fig : matplotlib Figure
Four-panel diagnostic figure.
"""
n = len(h_values)
if n < 10:
print("Warning: Too few samples for meaningful diagnostics")
return None
indices = np.arange(1, n + 1)
# Running mean: cumulative sum divided by count
cumsum = np.cumsum(h_values)
running_mean = cumsum / indices
# Running variance using Welford's algorithm (numerically stable)
running_var = np.zeros(n)
M2 = 0
mean = h_values[0]
for k in range(1, n):
delta = h_values[k] - mean
mean += delta / (k + 1)
delta2 = h_values[k] - mean
M2 += delta * delta2
running_var[k] = M2 / k if k > 0 else 0
running_se = np.sqrt(running_var / indices)
# Create 2x2 figure
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Panel 1: Running mean with confidence band
ax = axes[0, 0]
ax.plot(indices, running_mean, 'b-', linewidth=0.5, alpha=0.8)
start_idx = min(100, n // 10) # Skip early noisy estimates
if start_idx < n - 1:
ax.fill_between(indices[start_idx:],
running_mean[start_idx:] - 1.96*running_se[start_idx:],
running_mean[start_idx:] + 1.96*running_se[start_idx:],
alpha=0.3, color='blue', label='95% CI')
if true_value is not None:
ax.axhline(true_value, color='red', linestyle='--',
linewidth=2, label=f'True value = {true_value:.4f}')
ax.set_xlabel('Sample size')
ax.set_ylabel('Running mean')
ax.set_title('Running Mean with 95% Confidence Band')
ax.legend()
ax.set_xscale('log')
# Panel 2: Standard error decay (log-log plot)
ax = axes[0, 1]
se_start = max(99, n // 100)
se_plot_idx = indices[se_start:]
ax.loglog(se_plot_idx, running_se[se_start:], 'b-', linewidth=1,
label='Observed SE')
# Reference line: SE should decay as 1/sqrt(n)
ref_se = running_se[se_start] * np.sqrt((se_start + 1) / se_plot_idx)
ax.loglog(se_plot_idx, ref_se, 'r--', linewidth=2,
label=r'$O(n^{-1/2})$ reference')
ax.set_xlabel('Sample size')
ax.set_ylabel('Standard Error')
ax.set_title('Standard Error Decay')
ax.legend()
# Panel 3: Distribution of h-values
ax = axes[1, 0]
ax.hist(h_values, bins=50, density=True, alpha=0.7, edgecolor='black')
ax.axvline(np.mean(h_values), color='red', linewidth=2,
label=f'Mean = {np.mean(h_values):.4f}')
ax.set_xlabel('h(X)')
ax.set_ylabel('Density')
ax.set_title('Distribution of h(X) Values')
ax.legend()
# Panel 4: Autocorrelation (detects dependence in samples)
ax = axes[1, 1]
max_lag = min(50, n // 20)
if max_lag >= 2:
autocorr = [np.corrcoef(h_values[:-k], h_values[k:])[0, 1]
for k in range(1, max_lag + 1)]
ax.bar(range(1, max_lag + 1), autocorr, color='steelblue')
# 95% significance bounds for white noise
ax.axhline(1.96/np.sqrt(n), color='red', linestyle='--',
label='95% bounds')
ax.axhline(-1.96/np.sqrt(n), color='red', linestyle='--')
ax.set_xlabel('Lag')
ax.set_ylabel('Autocorrelation')
ax.set_title('Autocorrelation (should be ~0 for iid samples)')
ax.legend()
plt.tight_layout()
return fig
# Example: Estimate E[X²] where X ~ N(0,1)
# True value is 1, variance of X² is 2
rng = np.random.default_rng(42)
X = rng.standard_normal(50000)
h_values = X**2 # h(X) = X²
fig = convergence_diagnostics(h_values, true_value=1.0)
plt.savefig('convergence_diagnostics.png', dpi=150, bbox_inches='tight')
plt.show()
Key Implementation Details
Running mean: Use
np.cumsumfor efficiency rather than recomputing the sum at each step.Running variance: Welford’s algorithm (lines 24-31) avoids numerical instability that plagues the naive one-pass formula.
Log scale: The x-axis uses log scale so early fluctuations don’t dominate the plot.
Autocorrelation bounds: For iid samples, autocorrelation at lag \(k\) is approximately \(\mathcal{N}(0, 1/n)\), so \(\pm 1.96/\sqrt{n}\) gives 95% bounds.
Common Pitfall ⚠️ Pathological Cases
Several situations can cause Monte Carlo to behave unexpectedly. Recognizing these pathologies—and knowing how to address them—is essential for reliable simulation.
Heavy tails (infinite variance)
Problem: If \(\text{Var}[h(X)] = \infty\), the CLT does not apply. The estimator still converges by LLN (if \(\mathbb{E}[|h(X)|] < \infty\)), but standard error estimates are meaningless and confidence intervals are invalid.
Example: Estimating \(\mathbb{E}[1/U]\) where \(U \sim \text{Uniform}(0,1)\). The expectation is infinite, but even for \(\mathbb{E}[1/\sqrt{U}] = 2\) (finite mean), the variance is infinite.
Remedies:
Truncation: Replace \(h(x)\) with \(\min(h(x), M)\) for some threshold \(M\); introduces bias but restores finite variance
Transformation: If \(h(x)\) blows up near a boundary, change variables to smooth the integrand
Importance sampling: Sample more heavily where \(h(x)\) is large, reducing variance
Diagnostics: Plot the histogram of \(h(X_i)\) values; extreme outliers suggest heavy tails
Multimodality (isolated peaks)
Problem: If the integrand has isolated peaks that the sampling distribution rarely visits, estimates may be severely biased. You might run millions of samples without ever hitting a significant mode.
Example: A mixture of two narrow Gaussians separated by 100 standard deviations; uniform sampling over the domain almost never hits either peak.
Remedies:
Importance sampling: Design a proposal distribution that covers all modes
Stratified sampling: Divide the domain into regions and sample each region separately
Adaptive methods: Use preliminary runs to identify modes, then concentrate sampling there
MCMC with tempering: Parallel tempering or simulated tempering can help samplers jump between modes (covered in Part 3)
Rare events (small probabilities)
Problem: Estimating \(P(A)\) for rare events requires approximately \(100/P(A)\) samples for 10% relative error. For \(P(A) = 10^{-6}\), that’s \(10^8\) samples.
Example: Probability that a complex system fails; probability of extreme portfolio losses.
Remedies:
Importance sampling: Sample preferentially from the rare event region; this is the standard solution
Splitting/cloning: In sequential simulations, replicate trajectories that approach the rare event
Cross-entropy method: Iteratively adapt the sampling distribution to concentrate on rare events
Large deviations theory: Use asymptotic approximations when sampling is infeasible
Dependent samples (autocorrelation)
Problem: If samples are correlated (as in MCMC), the effective sample size (ESS) is smaller than the nominal sample size \(n\). Standard error formulas that assume independence underestimate uncertainty.
Example: Metropolis-Hastings chains with high autocorrelation; ESS might be \(n/100\) or worse.
Remedies:
Thinning: Keep every \(k\)-th sample to reduce autocorrelation (simple but wasteful)
Effective sample size: Compute ESS and report uncertainty based on ESS, not \(n\)
Batch means: Divide the chain into batches and estimate variance from batch means
Better samplers: Hamiltonian Monte Carlo, NUTS, or other advanced MCMC methods reduce autocorrelation
Diagnostics: Always check autocorrelation plots; values should decay quickly to zero
Practical Considerations
Before concluding, we collect several practical points for implementing Monte Carlo methods effectively.
When to Use Monte Carlo
Monte Carlo integration is the method of choice when:
Dimension is high (\(d \gtrsim 5\)): The curse of dimensionality kills deterministic methods.
Domain is complex: Irregular regions, constraints, and complex boundaries are natural for Monte Carlo but problematic for quadrature.
Integrand is non-smooth: Monte Carlo doesn’t require derivatives or smoothness.
Sampling is easy: If we can easily generate samples from the target distribution, Monte Carlo is straightforward to implement.
Monte Carlo is a poor choice when:
Dimension is very low (\(d \leq 3\)) and the integrand is smooth: Use Gaussian quadrature.
High precision is required with smooth integrands: Deterministic methods converge faster.
Sampling is expensive: Each Monte Carlo sample requires a new function evaluation; quadrature methods can achieve more with fewer evaluations for smooth functions.
Reproducibility
Always set random seeds and document them. Monte Carlo results should be reproducible.
import numpy as np
def my_monte_carlo_analysis(n_samples, seed):
"""Document the seed in the function signature."""
rng = np.random.default_rng(seed)
# ... analysis using rng ...
return results
# Call with explicit seed
results = my_monte_carlo_analysis(100_000, seed=42)
Computational Efficiency
Vectorize: Use NumPy operations on arrays, not Python loops.
Generate samples in batches:
rng.random(100_000)is faster than 100,000 calls torng.random().Parallelize when possible: For embarrassingly parallel problems, distribute samples across cores.
Common Pitfall ⚠️
Reporting estimates without uncertainty: A Monte Carlo estimate without a standard error or confidence interval is nearly meaningless. Always report \(\hat{I}_n \pm z_{\alpha/2} \cdot \hat{\sigma}_n / \sqrt{n}\).
Bad: “The integral is 3.14159.”
Good: “The integral is 3.142 ± 0.005 (95% CI: [3.132, 3.152]) based on 100,000 samples.”
Bringing It All Together
Monte Carlo integration transforms the ancient problem of computing integrals into the modern practice of sampling and averaging. The method’s power derives from three pillars:
Universality: Any integral can be written as an expected value, and expected values can be estimated by sample means.
The Law of Large Numbers: Sample means converge to population means, guaranteeing that Monte Carlo estimators approach the true value.
The Central Limit Theorem: The error decreases as \(O(n^{-1/2})\), providing both a convergence rate and a framework for uncertainty quantification via confidence intervals.
The \(O(n^{-1/2})\) rate is simultaneously the method’s weakness and its strength. In low dimensions, deterministic quadrature methods converge faster. But in high dimensions, where deterministic methods suffer the curse of dimensionality, Monte Carlo’s dimension-independent rate makes it the only viable option.
The examples in this section—estimating \(\pi\), computing Gaussian integrals, evaluating posterior means, tackling high-dimensional problems—illustrate the breadth of Monte Carlo applications. The diagnostic tools—running mean plots, standard error decay, autocorrelation checks—equip you to assess whether your simulations are converging properly.
Transition to What Follows
With the foundations of Monte Carlo integration in place, we face a critical question: where do the random samples come from?
Throughout this section, we have assumed that generating samples from the target distribution—uniform on \([0, 1]^d\), standard normal, a posterior distribution—is straightforward. But this assumption hides a mountain of computational machinery.
The next section (Uniform Random Variates) addresses the most fundamental case: generating uniform random numbers. We will see that computers, being deterministic machines, cannot produce “true” randomness—only pseudo-random sequences that pass stringent statistical tests. Understanding how these sequences are generated, what can go wrong, and how to ensure reproducibility is essential for any serious practitioner.
Following that, the inverse CDF method (Inverse CDF Method) shows how to transform uniform random numbers into samples from other distributions. If we can compute the inverse cumulative distribution function \(F^{-1}(u)\), we can generate samples from \(F\) by applying \(F^{-1}\) to uniform random numbers. This elegant technique works for many important distributions—exponential, Weibull, Cauchy—but fails when \(F^{-1}\) has no closed form.
For distributions like the normal, where the inverse CDF is not available analytically, specialized transformations like Box-Muller offer efficient alternatives. And for truly complex distributions—posteriors in Bayesian inference, for instance—rejection sampling and eventually Markov chain Monte Carlo (Part 3) provide the tools to generate the samples that Monte Carlo integration requires.
The story of Monte Carlo methods is thus a story of two interlocking challenges: using random samples to estimate integrals (this section), and generating the random samples in the first place (the sections to come). Master both, and you hold a computational toolkit of extraordinary power.
Key Takeaways 📝
Core concept: Monte Carlo integration estimates integrals by rewriting them as expectations and averaging random samples. The Law of Large Numbers guarantees convergence; the Central Limit Theorem provides the \(O(n^{-1/2})\) rate.
Historical insight: Monte Carlo emerged from the Manhattan Project, where Ulam and von Neumann needed to compute neutron transport integrals that resisted analytical attack. The method turned randomness from a nuisance into a computational tool.
Dimension independence: The \(O(n^{-1/2})\) rate does not depend on the dimension of the integration domain. This is why Monte Carlo dominates in high-dimensional problems where deterministic methods fail.
Practical application: Always report standard errors and confidence intervals. Use pilot studies to estimate variance for sample size planning. Monitor convergence visually with running mean plots and standard error decay.
Method selection: In 1–3 dimensions with smooth integrands, use deterministic quadrature. In high dimensions or with complex domains, use Monte Carlo. Consider quasi-Monte Carlo for moderate dimensions with smooth functions.
Outcome alignment: This section directly addresses Learning Outcome 1 (apply simulation techniques for Monte Carlo integration) and provides the conceptual foundation for all subsequent simulation methods, including variance reduction, importance sampling, and MCMC.