The Empirical Distribution and Plug-in Principle
In Section 4.1, we identified the sampling distribution \(G\) as the fundamental target of statistical inference—everything we want to know about uncertainty flows from \(G\). We also introduced three routes to approximating \(G\): analytic derivation, parametric Monte Carlo, and the bootstrap. The bootstrap route is based on a deceptively simple idea: replace the unknown population distribution \(F\) with the empirical distribution \(\hat{F}_n\) computed from our observed sample.
But why should this work? What exactly is the empirical distribution, and in what sense does it approximate \(F\)? How do we formalize the transition from \(\theta = T(F)\) to \(\hat{\theta} = T(\hat{F}_n)\)? These questions lead us to the mathematical foundations that justify all bootstrap methods: the empirical cumulative distribution function and the plug-in principle.
This section develops these foundations rigorously. We begin with the precise definition of the empirical CDF \(\hat{F}_n\) and establish its interpretation as a discrete probability distribution. We then prove two fundamental convergence theorems—pointwise convergence via the Strong Law of Large Numbers and uniform convergence via the Glivenko-Cantelli theorem—providing self-contained proofs at the level appropriate for this course. The Dvoretzky-Kiefer-Wolfowitz (DKW) inequality gives us finite-sample probability bounds that translate directly into confidence bands for \(F\). With these tools in hand, we formalize the plug-in principle: viewing parameters as functionals \(\theta = T(F)\) and estimating them by \(\hat{\theta} = T(\hat{F}_n)\). Finally, we examine cases where the plug-in approach fails, preparing for the more sophisticated methods in later sections.
Road Map 🧭
Define: The empirical CDF \(\hat{F}_n\) and its interpretation as a probability distribution
Prove: Pointwise and uniform convergence (\(\hat{F}_n \to F\)) with complete, self-contained proofs
Quantify: Finite-sample bounds via the DKW inequality; construct confidence bands
Formalize: Parameters as functionals \(T(F)\) and the plug-in estimator \(\hat{\theta} = T(\hat{F}_n)\)
Diagnose: When plug-in fails and alternatives are needed
Learning Outcomes: LO 1 (simulation techniques—understanding random generation from \(\hat{F}_n\)), LO 3 (resampling methods—theoretical foundation for bootstrap)
The Empirical Cumulative Distribution Function
The empirical CDF is perhaps the most natural nonparametric estimator of a distribution function. Given observed data, it simply counts the proportion of observations at or below each threshold.
Definition and Basic Properties
Definition: Empirical Cumulative Distribution Function
Let \(X_1, X_2, \ldots, X_n\) be observations from a distribution \(F\). The empirical cumulative distribution function (ECDF) is:
where \(\mathbf{1}\{X_i \leq x\}\) is the indicator function that equals 1 if \(X_i \leq x\) and 0 otherwise.
Equivalently, \(\hat{F}_n(x)\) is the proportion of observations less than or equal to \(x\):
The ECDF is a right-continuous step function that:
Equals 0 for \(x < X_{(1)}\) (below the minimum observation)
Jumps by \(1/n\) at each distinct observation (or by \(k/n\) if \(k\) observations share the same value)
Equals 1 for \(x \geq X_{(n)}\) (at or above the maximum observation)
Here \(X_{(1)} \leq X_{(2)} \leq \cdots \leq X_{(n)}\) denote the order statistics—the observations sorted in increasing order.
Example 💡 Computing the ECDF
Given: Sample \(X = (2.3, 5.1, 1.7, 3.8, 2.3)\)
Order statistics: \(X_{(1)} = 1.7, X_{(2)} = 2.3, X_{(3)} = 2.3, X_{(4)} = 3.8, X_{(5)} = 5.1\)
ECDF values:
Note the jump of \(2/5\) at \(x = 2.3\) because two observations equal 2.3.
Fig. 133 Figure 4.2.1: The Empirical CDF. (a) The ECDF places probability mass \(1/n\) at each observed value. (b) As a CDF, \(\hat{F}_n\) is a right-continuous step function that jumps at each observation. (c) The formal definition shows \(\hat{F}_n(x)\) counts the proportion of observations at or below \(x\).
The ECDF as a Probability Distribution
A crucial insight is that \(\hat{F}_n\) is not just an estimator—it is itself a legitimate probability distribution. Specifically, \(\hat{F}_n\) is the CDF of a discrete distribution that places probability mass \(1/n\) on each observed value.
Proposition: ECDF as Discrete Distribution
The empirical CDF \(\hat{F}_n\) corresponds to a discrete probability measure \(\hat{P}_n\) that assigns mass \(1/n\) to each observation:
For any measurable set \(A\):
This interpretation is fundamental to the bootstrap. Sampling “from \(\hat{F}_n\)” means selecting observations from \(\{X_1, \ldots, X_n\}\) with equal probability \(1/n\) for each. Since we sample with replacement, the same observation can appear multiple times in a bootstrap sample.
Connection to Multinomial Distribution
The bootstrap sampling process has a direct connection to the multinomial distribution. If we draw a bootstrap sample of size \(n\) from \(\hat{F}_n\), let \(N_i\) denote the number of times observation \(X_i\) appears in the bootstrap sample. Then:
This means:
\(\mathbb{E}[N_i] = 1\) for each \(i\)
\(\text{Var}(N_i) = (n-1)/n\) for each \(i\)
\(\text{Cov}(N_i, N_j) = -1/n\) for \(i \neq j\)
\(\sum_{i=1}^n N_i = n\) (exactly \(n\) observations in bootstrap sample)
The probability that a specific observation \(X_i\) is not included in a bootstrap sample is:
Thus, on average, about 63.2% of the original observations appear in each bootstrap sample, with some appearing multiple times.
The ECDF as Nonparametric MLE
The empirical distribution \(\hat{F}_n\) has a compelling optimality property: it is the nonparametric maximum likelihood estimator within the class of discrete distributions.
To formalize this, we consider the empirical likelihood framework. Among all discrete probability distributions supported on the observed sample points \(\{X_1, \ldots, X_n\}\), we seek the one that maximizes the probability of the observed data. If we assign probability \(p_i\) to observation \(X_i\) (treating each observation as potentially distinct), the likelihood is:
Maximizing \(\prod p_i\) subject to \(\sum p_i = 1\) yields \(p_i = 1/n\) for all \(i\) by Lagrange multipliers (or by recognizing the geometric-arithmetic mean inequality). This is precisely the empirical distribution.
When there are ties (repeated values), grouping by distinct values \(a_1, \ldots, a_k\) with counts \(n_1, \ldots, n_k\), the likelihood becomes \(\prod_{j=1}^k p_j^{n_j}\), maximized at \(p_j = n_j/n\).
Theorem: ECDF is Nonparametric MLE
Among all discrete probability distributions supported on the observed sample values, the empirical distribution \(\hat{F}_n\) maximizes the likelihood. The MLE places mass \(1/n\) at each observation (or \(n_j/n\) at each distinct value appearing \(n_j\) times).
Remark: For continuous models, the likelihood is typically defined via densities and depends on the chosen dominating measure. The empirical likelihood approach avoids these complications by working directly with probabilities on the observed support points, yielding a clean optimization problem with a unique solution.
Python Implementation
Computing and visualizing the ECDF is straightforward:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
def compute_ecdf(x):
"""
Compute the empirical CDF of a sample.
Parameters
----------
x : array-like
Sample observations.
Returns
-------
x_sorted : ndarray
Sorted sample values (may include duplicates).
ecdf_vals : ndarray
ECDF values at each sorted value (i/n for i=1,...,n).
"""
x = np.asarray(x)
x_sorted = np.sort(x)
n = len(x)
ecdf_vals = np.arange(1, n + 1) / n
return x_sorted, ecdf_vals
def plot_ecdf_vs_true(x, true_cdf, true_cdf_label='True F', ax=None):
"""
Plot ECDF overlaid on true CDF.
Parameters
----------
x : array-like
Sample observations.
true_cdf : callable
True CDF function.
true_cdf_label : str
Label for true CDF.
ax : matplotlib axis, optional
Axis to plot on.
Returns
-------
ax : matplotlib axis
"""
if ax is None:
fig, ax = plt.subplots(figsize=(10, 6))
# Compute ECDF
x_sorted, ecdf_vals = compute_ecdf(x)
# Plot ECDF as step function
ax.step(x_sorted, ecdf_vals, where='post', linewidth=2,
label=f'ECDF (n={len(x)})', color='#E74C3C')
# Plot true CDF
x_grid = np.linspace(x_sorted.min() - 1, x_sorted.max() + 1, 500)
ax.plot(x_grid, true_cdf(x_grid), linewidth=2,
label=true_cdf_label, color='#2E86AB')
ax.set_xlabel('x')
ax.set_ylabel('F(x)')
ax.legend()
ax.grid(True, alpha=0.3)
return ax
# Example: Compare ECDF to true Normal CDF
rng = np.random.default_rng(42)
x = rng.standard_normal(50)
fig, ax = plt.subplots(figsize=(10, 6))
plot_ecdf_vs_true(x, stats.norm.cdf, 'True N(0,1)', ax)
ax.set_title('ECDF vs True CDF (n=50 from Standard Normal)')
plt.tight_layout()
plt.show()
Fig. 134 Figure 4.2.2: ECDF Approximates True CDF. Comparison across different distributions (Normal, Exponential, Beta) and sample sizes. Each panel shows the true CDF (blue) overlaid with the ECDF (red), with the supremum deviation \(D_n\) annotated. As sample size increases, \(D_n\) decreases.
Convergence of the Empirical CDF
The utility of \(\hat{F}_n\) as an estimator of \(F\) rests on convergence theorems that guarantee \(\hat{F}_n \to F\) as \(n \to \infty\). We establish two levels of convergence: pointwise (for each fixed \(x\)) and uniform (simultaneously for all \(x\)). The proofs are self-contained.
Preliminary Tools
We first establish the basic probability inequalities needed for the proofs.
Lemma: Markov’s Inequality
If \(X \geq 0\) is a non-negative random variable with \(\mathbb{E}[X] < \infty\), then for any \(a > 0\):
Proof of Markov’s Inequality: Since \(X \geq 0\), we have \(X \geq a \cdot \mathbf{1}\{X \geq a\}\) (the indicator is 1 when \(X \geq a\), and \(X \geq a\) in that case). Taking expectations:
Rearranging gives the result. ∎
Intuition: If a non-negative random variable has small expectation, it cannot be large too often—otherwise the expectation would be large.
Lemma: Chebyshev’s Inequality
If \(Z\) has mean \(\mu\) and variance \(\sigma^2 < \infty\), then for any \(\varepsilon > 0\):
Proof of Chebyshev’s Inequality: The key insight is to convert a question about \(|Z - \mu|\) into a question about the non-negative quantity \((Z - \mu)^2\), then apply Markov.
Step 1: Note that \(|Z - \mu| \geq \varepsilon\) if and only if \((Z - \mu)^2 \geq \varepsilon^2\).
Step 2: Apply Markov’s inequality to \(X = (Z - \mu)^2\) with \(a = \varepsilon^2\):
∎
Intuition: Chebyshev says that a random variable is unlikely to be far from its mean if it has small variance. The probability of being more than \(k\) standard deviations from the mean is at most \(1/k^2\).
Lemma: First Borel-Cantelli
If \(\{A_n\}\) is a sequence of events with \(\sum_{n=1}^{\infty} P(A_n) < \infty\), then \(P(A_n \text{ i.o.}) = 0\), where “i.o.” means “infinitely often.”
What Does “Infinitely Often” Mean?
The event “\(A_n\) infinitely often” (abbreviated \(A_n\) i.o.) means that \(A_n\) occurs for infinitely many values of \(n\). In set notation:
An outcome \(\omega\) belongs to this set if no matter how large an \(N\) we choose, there exists some \(n \geq N\) with \(\omega \in A_n\).
Equivalent Set-Theoretic Definition
We can write this more precisely. First, define the event “at least one \(A_n\) occurs for \(n \geq N\)”:
An outcome \(\omega \in B_N\) if \(\omega\) belongs to at least one of \(A_N, A_{N+1}, A_{N+2}, \ldots\)
Now, “\(A_n\) infinitely often” means \(\omega\) is in \(B_N\) for every \(N\):
Intuition: No matter how far out you start looking (at \(N = 1, 2, 3, \ldots\)), you always find at least one more \(A_n\) occurring.
Proof of Borel-Cantelli
Step 1: Bound the probability of \(B_N\).
By the union bound (subadditivity of probability):
This is the “tail” of the series starting at \(N\).
Step 2: Show the tail goes to zero.
Since the full series \(\sum_{n=1}^{\infty} P(A_n) < \infty\) converges, the tail must vanish:
(This is a basic fact about convergent series: if \(\sum a_n < \infty\), then \(\sum_{n \geq N} a_n \to 0\).)
Step 3: Use the nested structure.
Notice that \(B_1 \supseteq B_2 \supseteq B_3 \supseteq \cdots\) (each \(B_N\) is contained in the previous one—we’re looking at fewer and fewer events as \(N\) increases). Therefore:
Step 4: Apply continuity of probability for nested sets.
For a decreasing sequence of sets \(B_1 \supseteq B_2 \supseteq \cdots\), probability is continuous:
(This is analogous to how for nested intervals \([a_1, b_1] \supseteq [a_2, b_2] \supseteq \cdots\), the intersection has length equal to the limit of the lengths.)
Step 5: Conclude.
Combining the pieces:
∎
Why Borel-Cantelli Matters
The Borel-Cantelli Lemma converts summable probabilities into almost sure statements. If the probabilities of “bad events” \(A_n\) are summable, then with probability 1, only finitely many bad events occur—so eventually (for all large enough \(n\)) the bad events stop happening.
We use this to prove almost sure convergence: if \(P\bigl(|X_n - X| > \varepsilon\bigr)\) is summable for each \(\varepsilon > 0\), then \(X_n \to X\) almost surely.
Pointwise Convergence
For any fixed \(x\), the quantity \(n \hat{F}_n(x) = \sum_{i=1}^n \mathbf{1}\{X_i \leq x\}\) is a sum of iid Bernoulli random variables with success probability \(p = F(x)\). The Strong Law of Large Numbers immediately yields pointwise almost sure convergence.
Theorem: Pointwise Convergence of ECDF
For each fixed \(x \in \mathbb{R}\):
Proof of Convergence in Probability:
Define \(Y_i(x) = \mathbf{1}\{X_i \leq x\}\). Then \(Y_i(x) \sim \text{Bernoulli}(F(x))\) and:
Since \(\hat{F}_n(x) = \bar{Y}_n = \frac{1}{n}\sum_{i=1}^n Y_i(x)\):
By Chebyshev’s inequality:
∎
Proof of Almost Sure Convergence:
We invoke the Strong Law of Large Numbers (SLLN) for i.i.d. sequences:
Theorem: Strong Law of Large Numbers
If \(Y_1, Y_2, \ldots\) are i.i.d. with \(\mathbb{E}|Y_1| < \infty\), then:
What does “almost surely” mean?
The notation \(\bar{Y}_n \xrightarrow{a.s.} \mu\) (read “converges almost surely”) means:
In words: the probability that \(\bar{Y}_n\) converges to \(\mu\) is 1. There might be some “bad” outcomes \(\omega\) where convergence fails, but the set of all such bad outcomes has probability zero.
Contrast with convergence in probability: \(\bar{Y}_n \xrightarrow{P} \mu\) means for each \(\varepsilon > 0\), \(P\bigl(|\bar{Y}_n - \mu| > \varepsilon\bigr) \to 0\). This is weaker: it says each individual deviation becomes unlikely, but doesn’t rule out infinitely many small deviations.
Almost sure convergence is stronger: \(\bar{Y}_n \xrightarrow{a.s.} \mu\) implies \(\bar{Y}_n \xrightarrow{P} \mu\), but not conversely.
Applying the SLLN to the ECDF:
Since \(Y_i(x) = \mathbf{1}\{X_i \leq x\}\) are i.i.d. Bernoulli with \(\mathbb{E}[Y_i(x)] = F(x)\), and Bernoulli random variables trivially satisfy \(\mathbb{E}|Y_i(x)| = F(x) \leq 1 < \infty\), the SLLN immediately yields:
∎
Technical remark: The proof of the SLLN itself is beyond our scope, but uses tools like Kolmogorov’s variance-summability criterion or truncation arguments. For Bernoulli random variables, the variance \(\text{Var}(Y_i) \leq 1/4\), so \(\sum_{i=1}^\infty \text{Var}(Y_i)/i^2 \leq \frac{1}{4}\sum_{i=1}^\infty 1/i^2 < \infty\), satisfying Kolmogorov’s criterion.
Common Pitfall ⚠️ Pointwise vs Uniform Convergence
Pointwise convergence \(\hat{F}_n(x) \to F(x)\) for each fixed \(x\) does not immediately imply that \(\sup_x |\hat{F}_n(x) - F(x)| \to 0\). The supremum is over uncountably many \(x\) values, and convergence at each point separately does not guarantee simultaneous convergence.
Example of the gap: Consider \(f_n(x) = x^n\) on \([0,1]\). For each \(x \in [0,1)\), \(f_n(x) \to 0\). But \(\sup_{x \in [0,1]} |f_n(x) - 0| = 1\) for all \(n\) (achieved at \(x = 1\)).
For the ECDF, proving uniform convergence requires additional work—this is the content of the Glivenko-Cantelli theorem.
Uniform Convergence: The Glivenko-Cantelli Theorem
The Glivenko-Cantelli theorem is one of the foundational results of nonparametric statistics. It states that the ECDF converges uniformly to the true CDF, not just pointwise.
Theorem: Glivenko-Cantelli
Let \(X_1, X_2, \ldots\) be iid with distribution function \(F\). Define \(D_n = \sup_{x \in \mathbb{R}} |\hat{F}_n(x) - F(x)|\). Then:
That is, with probability 1, the empirical CDF converges uniformly to the true CDF.
Before proving the theorem, we establish an exponential probability bound on \(D_n\).
Reduction to Uniform Distribution
When \(F\) is continuous, the probability integral transformation provides a useful proof device. Define \(U_i = F(X_i)\). If \(F\) is continuous and strictly increasing, then \(U_i \sim \text{Uniform}(0,1)\). Let \(\hat{H}_n\) be the ECDF of \(U_1, \ldots, U_n\).
Then:
This equality reduces the problem to the uniform case, simplifying derivations. For general \(F\) (possibly with atoms), the DKW inequality is distribution-free and yields the same type of uniform bound without requiring continuity—this is one of its key strengths.
Hoeffding’s Inequality for Binomial Tails
Before deriving the exponential bound, we need one more tool: Hoeffding’s inequality, which provides exponentially tight bounds on how far a sample average can deviate from its expectation.
Lemma: Hoeffding’s Inequality (Binomial Form)
Let \(B \sim \text{Binomial}(n, p)\), so \(B/n\) is the sample proportion from \(n\) Bernoulli(\(p\)) trials. Then for any \(t > 0\):
Similarly, \(P\left(p - \frac{B}{n} \geq t\right) \leq e^{-2nt^2}\).
Why is this useful? Chebyshev’s inequality gives \(P\bigl(|B/n - p| \geq t\bigr) \leq p(1-p)/(nt^2)\), which decreases as \(1/n\). Hoeffding gives \(e^{-2nt^2}\), which decreases exponentially in \(n\). For large \(n\), exponential decay is much faster.
Intuition: Hoeffding says that for bounded random variables (like Bernoullis), the sample mean concentrates around its expectation exponentially fast. Deviations of size \(t\) become exponentially unlikely as \(n\) grows.
The proof of Hoeffding’s inequality uses the moment generating function and is beyond our scope, but you can verify its tightness numerically: for \(n = 100\), \(p = 0.5\), and \(t = 0.1\), Chebyshev gives \(P \leq 0.25\), while Hoeffding gives \(P \leq e^{-2} \approx 0.135\).
Exponential Tail Bound
Lemma: Exponential Bound on Supremum Deviation
For any \(\varepsilon > 0\):
Proof:
Let \(U_{(1)} \leq U_{(2)} \leq \cdots \leq U_{(n)}\) be the order statistics of \(U_1, \ldots, U_n \sim \text{Uniform}(0,1)\).
Key observation: The supremum of \(|\hat{H}_n(u) - u|\) is achieved at one of the order statistics (or just before/after a jump). We can express:
The Binomial-Order Statistic Connection
A crucial fact links order statistics of uniform samples to binomial counts:
Lemma: Distribution of Uniform Order Statistics
For \(U_1, \ldots, U_n \stackrel{iid}{\sim} \text{Uniform}(0,1)\) and any \(a \in [0,1]\):
Why is this true? The \(k\)-th smallest observation is \(\leq a\) if and only if at least \(k\) of the \(n\) observations fall in \([0, a]\). Each observation falls in \([0, a]\) independently with probability \(a\) (since they’re uniform on \([0,1]\)), so the count of observations \(\leq a\) is \(\text{Binomial}(n, a)\).
Applying this to our bound: We want to bound \(P(k/n - U_{(k)} > \varepsilon)\), which equals \(P(U_{(k)} < k/n - \varepsilon)\):
Now \(B \sim \text{Binomial}(n, p)\) with \(p = k/n - \varepsilon\), so \(\mathbb{E}[B] = np = k - n\varepsilon\). We need \(P(B \geq k) = P(B/n \geq k/n)\).
Since \(\mathbb{E}[B/n] = p = k/n - \varepsilon\), we’re asking for \(P(B/n - p \geq \varepsilon)\). By Hoeffding’s inequality:
Therefore:
For indices where \(k/n - \varepsilon \leq 0\) or \(k/n - \varepsilon \geq 1\), the event is either impossible or the bound is trivially satisfied, so we restrict to valid \(k\). By union bound over the remaining \(k = 1, \ldots, n\):
Similarly for the other direction. Combining:
∎
Proof of Glivenko-Cantelli Theorem:
The strategy is to use Borel-Cantelli: show that the probabilities \(P(D_n > \varepsilon)\) are summable, conclude that \(D_n > \varepsilon\) happens only finitely often (almost surely), and then let \(\varepsilon \to 0\).
Step 1: Verify summability for a fixed \(\varepsilon > 0\).
By the exponential bound above:
Why does this series converge? For any \(c > 0\), the series \(\sum_{n=1}^{\infty} n e^{-cn}\) converges. To see this, note that \(n e^{-cn}\) eventually decays faster than any geometric series: for large \(n\), \(n e^{-cn} < e^{-cn/2}\), and \(\sum e^{-cn/2}\) is a convergent geometric series. More directly, the ratio test gives:
So the series converges for any \(c > 0\). Here \(c = 2\varepsilon^2 > 0\), so \(\sum_{n=1}^{\infty} P(D_n > \varepsilon) < \infty\).
Step 2: Apply Borel-Cantelli.
By the First Borel-Cantelli Lemma (proved above), since \(\sum P(D_n > \varepsilon) < \infty\):
In plain language: With probability 1, \(D_n > \varepsilon\) happens for only finitely many \(n\). Equivalently, there exists a (random) \(N\) such that \(D_n \leq \varepsilon\) for all \(n \geq N\).
Step 3: Extend to \(D_n \to 0\) almost surely.
We’ve shown that for each fixed \(\varepsilon > 0\), eventually \(D_n \leq \varepsilon\) (almost surely). But \(D_n \to 0\) requires this for all \(\varepsilon > 0\) simultaneously.
The trick: use countably many \(\varepsilon\) values. Consider the rational numbers \(\varepsilon_k = 1/k\) for \(k = 1, 2, 3, \ldots\) Define:
We’ve shown \(P(A_k) = 1\) for each \(k\). Now define:
Why is \(P(A) = 1\)? By the “countable intersection of probability-1 events” principle:
Therefore \(P(A) = 1\), meaning \(D_n \to 0\) almost surely.
The Countable Intersection Principle
If \(P(A_k) = 1\) for \(k = 1, 2, 3, \ldots\) (countably many events), then \(P(\bigcap_k A_k) = 1\).
Proof: \(P(\bigcap_k A_k) = 1 - P(\bigcup_k A_k^c) \geq 1 - \sum_k P(A_k^c) = 1 - 0 = 1\).
This is why we need only consider countably many values of \(\varepsilon\) (like rationals), not all positive reals. The reals are uncountable, and the argument would fail.
Step 4: Conclude.
We have shown \(P(D_n \to 0) = 1\), i.e., \(\sup_x |\hat{F}_n(x) - F(x)| \to 0\) almost surely. ∎
Fig. 135 Figure 4.2.3: Glivenko-Cantelli Convergence. As \(n\) increases from 10 to 2000, the ECDF (red) converges uniformly to the true CDF (blue). The 95% DKW confidence band (shaded) shrinks at rate \(O(1/\sqrt{n})\). The supremum deviation \(D_n\) decreases consistently.
Fig. 136 Figure 4.2.4: Convergence Rate and DKW Bound. (a) The supremum deviation \(D_n\) decreases as \(O(1/\sqrt{n})\) (log-log scale). (b) Distribution of \(D_n\) for \(n=100\) with DKW bounds marked. (c) The DKW bound is conservative but provides valid coverage.
The DKW Inequality: Finite-Sample Bounds
The exponential bound can be sharpened significantly. The Dvoretzky-Kiefer-Wolfowitz (DKW) inequality provides a tighter bound that does not depend on \(n\) in the leading constant.
Theorem: DKW Inequality
For any \(\varepsilon > 0\):
This bound is distribution-free—it holds for any \(F\).
The original DKW result (1956) had an unspecified constant. Massart (1990) proved the sharp constant is exactly 2.
Comparison with Basic Exponential Bound: Our earlier union-bound argument gave \(P(D_n > \varepsilon) \leq 2n\exp(-2n\varepsilon^2)\), which is looser by a factor of \(n\). The DKW improvement comes from a more refined analysis that avoids the crude union bound over all \(n\) jump points. The \(\sqrt{n}D_n\) statistic is the Kolmogorov-Smirnov statistic; its limiting distribution under continuity of \(F\) is the Kolmogorov distribution, which provides even tighter quantiles than DKW for asymptotic inference.
Constructing Confidence Bands
The DKW inequality immediately yields simultaneous confidence bands for \(F\).
Corollary: DKW Confidence Band
For confidence level \(1 - \alpha\), define:
Then:
Proof: Setting \(2e^{-2n\varepsilon^2} = \alpha\) and solving gives \(\varepsilon = \sqrt{\ln(2/\alpha)/(2n)}\). ∎
For a 95% confidence band (\(\alpha = 0.05\)):
def dkw_confidence_band(x, alpha=0.05):
"""
Compute DKW confidence band for the true CDF.
Parameters
----------
x : array-like
Sample observations.
alpha : float
Significance level (default 0.05 for 95% confidence).
Returns
-------
x_sorted : ndarray
Sorted observations.
lower : ndarray
Lower confidence band.
upper : ndarray
Upper confidence band.
ecdf_vals : ndarray
ECDF values.
"""
x = np.asarray(x)
n = len(x)
x_sorted, ecdf_vals = compute_ecdf(x)
# DKW epsilon
epsilon = np.sqrt(np.log(2 / alpha) / (2 * n))
lower = np.clip(ecdf_vals - epsilon, 0, 1)
upper = np.clip(ecdf_vals + epsilon, 0, 1)
return x_sorted, lower, upper, ecdf_vals
# Example: DKW band for Normal data
rng = np.random.default_rng(42)
n = 100
x = rng.standard_normal(n)
x_sorted, lower, upper, ecdf_vals = dkw_confidence_band(x, alpha=0.05)
fig, ax = plt.subplots(figsize=(10, 6))
# Plot confidence band
ax.fill_between(x_sorted, lower, upper, alpha=0.3, color='#2E86AB',
step='post', label='95% DKW Band')
# Plot ECDF
ax.step(x_sorted, ecdf_vals, where='post', linewidth=2,
color='#E74C3C', label='ECDF')
# Plot true CDF
x_grid = np.linspace(-4, 4, 500)
ax.plot(x_grid, stats.norm.cdf(x_grid), linewidth=2,
color='#2E86AB', linestyle='--', label='True N(0,1)')
ax.set_xlabel('x')
ax.set_ylabel('F(x)')
ax.set_title(f'DKW 95% Confidence Band (n={n})')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Verify: true CDF should be within band
true_cdf_at_data = stats.norm.cdf(x_sorted)
coverage = np.all((true_cdf_at_data >= lower) & (true_cdf_at_data <= upper))
print(f"True CDF within band: {coverage}")
Fig. 137 Figure 4.2.5: DKW Confidence Bands. For \(n=150\) observations from a mixture distribution, we show 90%, 95%, and 99% simultaneous confidence bands for the true CDF \(F\). The true CDF (dashed blue) falls within all bands. The band width \(\varepsilon_n(\alpha) = \sqrt{\ln(2/\alpha)/(2n)}\) increases with confidence level.
Try It Yourself 🖥️ Interactive ECDF Simulation
Explore ECDF convergence and confidence bands interactively:
Distribution Sampler & ECDF Visualizer: Interactive Simulation
Sample from 14 different distributions (Normal, Exponential, Beta, Cauchy, Poisson, etc.)
Adjust sample size and see how \(D_n\) decreases
Toggle between pointwise and DKW uniform confidence bands
Observe how the KS statistic \(D_n = \sup_x|\hat{F}_n(x) - F(x)|\) changes with each sample
CLT for ECDF (3D Visualization): 3D Simulation
Visualize CLT convergence at 10 different quantiles simultaneously
See how \(\sqrt{n}(\hat{F}_n(x) - F(x))/\sqrt{F(x)(1-F(x))} \to N(0,1)\)
Compare convergence rates across different sample sizes
Observe how variance \(F(x)(1-F(x))\) is maximized at the median
Visualizing Convergence
The following code demonstrates Glivenko-Cantelli convergence empirically:
def visualize_glivenko_cantelli(true_dist, sample_sizes, seed=42):
"""
Visualize ECDF convergence to true CDF as n increases.
Parameters
----------
true_dist : scipy.stats distribution
True distribution to sample from.
sample_sizes : list of int
Sample sizes to demonstrate.
seed : int
Random seed for reproducibility.
"""
rng = np.random.default_rng(seed)
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()
x_grid = np.linspace(true_dist.ppf(0.001), true_dist.ppf(0.999), 500)
true_cdf = true_dist.cdf(x_grid)
for ax, n in zip(axes, sample_sizes):
# Generate sample
x = true_dist.rvs(size=n, random_state=rng)
# Compute ECDF
x_sorted, ecdf_vals = compute_ecdf(x)
# Compute supremum deviation (approximated on grid;
# for exact computation, evaluate at jump points)
ecdf_interp = np.interp(x_grid, x_sorted, ecdf_vals,
left=0, right=1)
D_n = np.max(np.abs(ecdf_interp - true_cdf))
# Plot
ax.plot(x_grid, true_cdf, linewidth=2, color='#2E86AB',
label='True F')
ax.step(x_sorted, ecdf_vals, where='post', linewidth=2,
color='#E74C3C', label='ECDF')
ax.set_title(f'n = {n}, $D_n$ = {D_n:.4f}')
ax.set_xlabel('x')
ax.set_ylabel('F(x)')
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)
plt.suptitle('Glivenko-Cantelli: ECDF Converges Uniformly to True CDF',
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# Demonstrate convergence for Exponential distribution
visualize_glivenko_cantelli(
stats.expon(scale=2),
sample_sizes=[20, 50, 200, 1000]
)
Parameters as Statistical Functionals
With the ECDF established as a legitimate and consistent estimator of \(F\), we now formalize how to use it for parameter estimation. The key insight is to view parameters as functionals of the distribution.
Definition and Framework
Definition: Statistical Functional
A statistical functional is a mapping \(T: \mathcal{F} \to \mathbb{R}\) that assigns a real number to each distribution \(F\) in some class \(\mathcal{F}\).
The target parameter is \(\theta = T(F)\), and the plug-in estimator is \(\hat{\theta} = T(\hat{F}_n)\).
This framework unifies many estimators under a single conceptual umbrella. The parameter is defined as a functional of the unknown distribution, and we estimate it by applying the same functional to the empirical distribution.
Canonical Examples
Mean Functional
The population mean is the simplest functional:
Applying this to \(\hat{F}_n\):
The plug-in estimator of the mean is the sample mean.
Variance Functional
The population variance:
where \(\mu_F = \mathbb{E}_F[X]\). Applying to \(\hat{F}_n\):
Note this is the biased sample variance (dividing by \(n\), not \(n-1\)). The plug-in principle yields the MLE, which has slight bias for variance estimation.
Quantile Functional
The \(q\)-th quantile (\(q \in (0,1)\)):
Applying to \(\hat{F}_n\):
The plug-in estimator is the sample quantile (order statistic). Different conventions exist for sample quantiles; the theoretical definition uses the generalized inverse \(X_{(\lceil nq \rceil)}\) without interpolation. Software implementations vary: R offers 9 different quantile types, and NumPy’s np.percentile uses linear interpolation by default.
Convention Note
Throughout this chapter, our theoretical derivations use the order-statistic definition \(X_{(\lceil nq \rceil)}\), while code examples use np.percentile (which interpolates by default). For most practical purposes at moderate sample sizes, these differences are negligible. When implementing exact plug-in quantiles, use np.percentile(x, 100*q, method='lower') or compute order statistics directly.
Correlation Functional
For bivariate data \((X, Y)\) with joint distribution \(F\):
The plug-in estimator is the sample correlation coefficient:
Regression Coefficients
For the linear model \(Y = \beta_0 + \beta_1 X + \varepsilon\) with random \((X, Y)\):
The plug-in estimator is the OLS coefficient computed from sample covariances.
Summary Table of Functionals
Parameter |
Functional \(T(F)\) |
Plug-in \(T(\hat{F}_n)\) |
|---|---|---|
Mean |
\(\int x \, dF(x)\) |
\(\bar{X} = \frac{1}{n}\sum X_i\) |
Variance |
\(\int (x - \mu)^2 \, dF(x)\) |
\(\frac{1}{n}\sum (X_i - \bar{X})^2\) |
\(q\)-Quantile |
\(\inf\{x: F(x) \geq q\}\) |
\(X_{(\lceil nq \rceil)}\) |
Correlation |
\(\frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}\) |
Sample correlation \(r\) |
Interquartile Range |
\(F^{-1}(0.75) - F^{-1}(0.25)\) |
\(X_{(\lceil 0.75n \rceil)} - X_{(\lceil 0.25n \rceil)}\) |
Probability |
\(P_F(X \in A) = \int_A dF\) |
\(\frac{1}{n}\sum \mathbf{1}\{X_i \in A\}\) |
Fig. 138 Figure 4.2.6: Parameters as Functionals. (a) The mean functional \(T(F) = \int x\, dF(x)\) becomes the sample mean when applied to \(\hat{F}_n\). (b) The variance functional and its plug-in estimator. (c) The median as the 0.5-quantile functional. (d) Comparison of true values and plug-in estimates for a Gamma(3,2) population with \(n=100\).
Fig. 139 Figure 4.2.7: Bootstrap Sampling and the Multinomial. (a) The ECDF places mass \(1/n\) on each observation. (b) One bootstrap sample: selection counts \((N_1, \ldots, N_n)\) follow a Multinomial:math:(n; 1/n, ldots, 1/n) distribution—some points selected multiple times, others not at all. (c) Each \(N_i\) is marginally Binomial:math:(n, 1/n), with \(P(N_i = 0) \approx e^{-1} \approx 0.368\).
The Plug-in Principle
We now state the plug-in principle formally as the conceptual foundation for nonparametric estimation and, ultimately, for the bootstrap.
Formal Statement
The Plug-in Principle
Parameter Level: To estimate a parameter \(\theta = T(F)\), substitute the empirical distribution \(\hat{F}_n\) for the unknown \(F\):
Distribution Level: To approximate the sampling distribution \(G_F\) of a statistic under \(F\), substitute \(\hat{F}_n\) for \(F\):
where \(X_1^*, \ldots, X_n^* \stackrel{\text{iid}}{\sim} \hat{F}_n\).
Notation Legend: Distributions and Sampling
Symbol |
Meaning |
|---|---|
\(F\) |
Population distribution: The unknown true distribution generating the data |
\(\hat{F}_n\) |
Empirical distribution: The discrete distribution placing mass \(1/n\) at each observation |
\(T\) |
Functional/statistic: A mapping from distributions to parameters, e.g., \(T(F) = \int x\, dF\) |
\(G_F\) |
Sampling distribution under \(F\): The distribution of \(T(X_{1:n})\) when \(X_i \stackrel{\text{iid}}{\sim} F\) |
\(G_{\hat{F}_n}\) |
Bootstrap distribution: The sampling distribution of \(T(X_{1:n}^*)\) under resampling from \(\hat{F}_n\) |
The bootstrap idea: since \(\hat{F}_n \approx F\) (Glivenko-Cantelli), we hope \(G_{\hat{F}_n} \approx G_F\).
The parameter-level plug-in gives us point estimates. The distribution-level plug-in—computing \(G_{\hat{F}_n}\) by Monte Carlo—is the bootstrap.
Consistency of Plug-in Estimators
Under what conditions does \(T(\hat{F}_n) \to T(F)\) as \(n \to \infty\)? The key property is continuity of the functional \(T\).
Theorem: Plug-in Consistency
If \(T\) is continuous with respect to a metric \(d\) on distributions, and \(d(\hat{F}_n, F) \to 0\) almost surely, then:
Two Convergence Concepts:
Glivenko-Cantelli: \(\|\hat{F}_n - F\|_\infty \to 0\) almost surely (uniform convergence of CDFs)
Weak convergence: \(\int \varphi \, d\hat{F}_n \to \int \varphi \, dF\) for all bounded continuous \(\varphi\)
Uniform convergence implies weak convergence, but the relevant topology for functional continuity is weak convergence.
Bounded Continuous Functionals: For functionals of the form \(T(F) = \int \varphi \, dF\) where \(\varphi\) is bounded and continuous, weak convergence directly implies \(T(\hat{F}_n) \to T(F)\). The Glivenko-Cantelli theorem provides the underlying convergence that drives this.
Moment Functionals: For unbounded integrands like \(\varphi(x) = x\) (mean) or \(\varphi(x) = x^2\) (variance), weak convergence alone is insufficient—these functionals are not continuous under weak convergence. Additional moment control is required: uniform integrability or explicit finite-moment assumptions bridge the gap. Consistency of \(\bar{X}\) for the mean follows from the Strong Law of Large Numbers (requiring \(\mathbb{E}|X| < \infty\)), not from CDF uniform convergence. Variance consistency requires \(\mathbb{E}[X^2] < \infty\); finite fourth moments are needed for asymptotic normality of the variance estimator or for sharper MSE bounds.
Examples of Continuous Functionals (under appropriate conditions):
Quantiles at continuity points of \(F\)
Trimmed means
Distribution functions of bounded continuous functions of \(X\)
Examples Where Extra Care Is Needed:
Mean and variance (require moment conditions; consistency via LLN)
Mode (can jump with small changes in \(F\))
Number of modes (discrete-valued)
Extreme quantiles when \(F\) has atoms
Functionals involving the support of \(F\)
Worked Example: Multiple Functionals
Example 💡 Plug-in Estimation for Multiple Parameters
Setup: We observe \(n = 100\) observations from an unknown distribution. We want to estimate the mean, variance, median, and interquartile range.
Python Implementation:
import numpy as np
from scipy import stats
def plugin_estimates(x):
"""
Compute plug-in estimates for common functionals.
Parameters
----------
x : array-like
Sample observations.
Returns
-------
dict
Dictionary of plug-in estimates.
"""
x = np.asarray(x)
n = len(x)
estimates = {
'mean': np.mean(x), # T(F) = ∫x dF
'variance_plugin': np.var(x, ddof=0), # T(F) = ∫(x-μ)² dF
'variance_unbiased': np.var(x, ddof=1), # Corrected estimator
'median': np.median(x), # T(F) = F⁻¹(0.5)
'q25': np.percentile(x, 25), # T(F) = F⁻¹(0.25)
'q75': np.percentile(x, 75), # T(F) = F⁻¹(0.75)
'iqr': np.percentile(x, 75) - np.percentile(x, 25),
'skewness': stats.skew(x, bias=True), # Plugin (biased)
'kurtosis': stats.kurtosis(x, bias=True), # Plugin (biased)
}
return estimates
# Generate data from Gamma(3, 2) - right-skewed distribution
rng = np.random.default_rng(42)
true_shape, true_scale = 3, 2
x = rng.gamma(true_shape, true_scale, size=100)
# True values
true_mean = true_shape * true_scale # = 6
true_var = true_shape * true_scale**2 # = 12
true_median = stats.gamma.ppf(0.5, true_shape, scale=true_scale)
true_iqr = (stats.gamma.ppf(0.75, true_shape, scale=true_scale) -
stats.gamma.ppf(0.25, true_shape, scale=true_scale))
# Compute plug-in estimates
estimates = plugin_estimates(x)
print("Plug-in Estimates vs True Values")
print("=" * 50)
print(f"Mean: {estimates['mean']:.3f} (true: {true_mean:.3f})")
print(f"Variance: {estimates['variance_plugin']:.3f} (true: {true_var:.3f})")
print(f"Median: {estimates['median']:.3f} (true: {true_median:.3f})")
print(f"IQR: {estimates['iqr']:.3f} (true: {true_iqr:.3f})")
Output:
Plug-in Estimates vs True Values
==================================================
Mean: 5.892 (true: 6.000)
Variance: 10.847 (true: 12.000)
Median: 5.203 (true: 5.348)
IQR: 4.644 (true: 4.419)
Fig. 140 Figure 4.2.8: Plug-in Consistency. For a Beta(2,5) population, plug-in estimators converge to true values as \(n\) increases. (a-c) Median estimates (with 80% bands) approach true values for mean, variance, and median. (d) MSE decreases as \(O(1/n)\) for all continuous functionals.
When the Plug-in Principle Fails
While the plug-in principle is powerful and broadly applicable, it has important limitations. Understanding these failure modes is essential for choosing appropriate inference methods.
Two-Axis Taxonomy of Failure Modes
Plug-in failures can be classified along two dimensions:
Axis 1: Identification
Identified: \(\theta = T(F)\) is a well-defined functional (mean, median, quantiles)
Not identified from observed data: Target requires additional structure or assumptions (causal effects without ignorability, mixture component labels, censored survival beyond observation period)
Axis 2: Regularity/Continuity
Smooth/continuous functional: Small changes in \(F\) yield small changes in \(T(F)\) (mean, variance, trimmed mean)
Non-smooth/discontinuous functional: Small changes in \(F\) can cause jumps in \(T(F)\) (mode, extreme quantiles, boundary of support)
Failures from non-identification cannot be fixed by more data or better algorithms—the target is fundamentally not available from the observed data distribution. Failures from non-smoothness make plug-in estimators highly variable but still consistent; bootstrap may require modifications.
Non-identification
The plug-in principle assumes the parameter \(\theta = T(F)\) is identified—that the functional \(T\) maps each distribution to a unique value. This fails in several important settings.
Mixture Models
In finite mixture models, the component labels are not identified. If \(F = 0.3 \cdot N(\mu_1, 1) + 0.7 \cdot N(\mu_2, 1)\), swapping \(\mu_1 \leftrightarrow \mu_2\) and adjusting weights gives the same \(F\). There is no unique \((\mu_1, \mu_2)\) functional.
Fix: Impose identifiability constraints (e.g., order \(\mu_1 < \mu_2\)) or target mixture-invariant functionals.
Factor Models
Factor loadings are identified only up to rotation. The factor covariance structure is a functional of \(F\), but individual loadings are not.
Fix: Impose rotation constraints or target rotation-invariant quantities.
Censoring and Truncation
When observations are censored or truncated, the empirical distribution \(\hat{F}_n\) estimates the observed-data distribution, not the underlying distribution of interest.
Right Censoring (common in survival analysis): We observe \(\min(T_i, C_i)\) where \(T_i\) is the survival time and \(C_i\) is the censoring time. The ECDF of observed values does not estimate \(F_T\).
Fix: Use the Kaplan-Meier estimator instead of the ECDF; bootstrap the Kaplan-Meier.
Truncation: Only observations satisfying some condition are recorded. The ECDF estimates the conditional distribution, not the marginal.
Fix: Model the selection mechanism; use inverse probability weighting.
Causal Parameters
Causal effects like the Average Treatment Effect (ATE) are defined in terms of potential outcomes, not the observational distribution alone.
where \(Y^{(a)}\) is the potential outcome under treatment \(a\). Without additional identifying assumptions, the ATE is not identified from the observed data distribution \(P(Y, A, X)\)—naively computing \(\mathbb{E}[Y|A=1] - \mathbb{E}[Y|A=0]\) conflates treatment effects with selection bias. However, under assumptions like ignorability (no unmeasured confounding), positivity (all covariate strata have both treated and untreated units), and consistency (observed outcome equals potential outcome for received treatment), the ATE becomes a functional of the observed data distribution: \(\mathbb{E}_X[\mathbb{E}[Y|A=1,X] - \mathbb{E}[Y|A=0,X]]\).
Fix: State identifying assumptions explicitly; use appropriate estimators (matching, IPW, doubly robust); bootstrap the design-respecting estimator, not the naive difference in means.
Design-Based Inference
In survey sampling, the parameter of interest is often a finite population quantity:
where the sum is over the entire finite population of size \(N\). Observations come from a probability sample with unequal inclusion probabilities \(\pi_i\).
The ECDF treats all observations equally, ignoring the sampling design.
Fix: Use design-weighted estimators; bootstrap with survey weights (replicate weight methods).
Discontinuous and Set-Valued Functionals
The plug-in principle works well when \(T\) is continuous. It can fail for:
Discontinuous Functionals:
Mode: small changes in \(F\) can cause the mode to jump
Support boundaries: \(\inf\{x: F(x) > 0\}\) depends on the far tail
Set-Valued Functionals:
The set of minimizers of a non-convex criterion
Confidence sets for non-identified parameters
Fix: Add regularization; target a well-defined selection from the set (e.g., smallest minimizer); recognize that bootstrap may be inconsistent.
Summary of Failure Modes
Failure Mode |
Diagnostic |
Remedy |
|---|---|---|
Non-identification |
Multiple \(\theta\) yield same \(F\) |
Impose constraints; target invariant functionals |
Censoring/Truncation |
Incomplete observations |
Kaplan-Meier; model selection mechanism |
Causal targets |
Need \(P(Y^a)\), not \(P(Y|A)\) |
State assumptions; design-based estimators |
Survey sampling |
Unequal inclusion probabilities |
Weighted estimators; replicate weights |
Discontinuous \(T\) |
\(T(\hat{F}_n)\) unstable across samples |
Regularize; smooth; use different functional |
Common Pitfall ⚠️ Bootstrap Cannot Fix Non-identification
If \(\theta = T(F)\) is not identified by the distribution \(F\), then bootstrapping cannot help. The bootstrap approximates the sampling distribution of \(T(\hat{F}_n)\), but if \(T\) is not well-defined, neither is its sampling distribution.
Key principle: Bootstrap provides uncertainty quantification for identified quantities; it cannot identify parameters that the distribution does not identify.
Fig. 141 Figure 4.2.9: When Plug-in Fails. (a) Extreme quantiles: ECDF cannot extrapolate beyond observed data range. (b) Censoring: ECDF of observed times differs from true survival distribution. (c) Discontinuous functionals like the mode: small data changes cause large estimate jumps. (d) Summary of failure modes and remedies.
The Bootstrap Idea in One Sentence
We have established that:
The ECDF \(\hat{F}_n\) converges uniformly to \(F\) (Glivenko-Cantelli)
For continuous functionals, \(T(\hat{F}_n) \to T(F)\) (plug-in consistency)
The bootstrap extends this logic from point estimation to distributional approximation:
The Bootstrap Principle
To approximate the sampling distribution \(G(t) = P_F\{T(X_{1:n}) \leq t\}\), use:
where \(X^*_1, \ldots, X^*_n \stackrel{\text{iid}}{\sim} \hat{F}_n\).
In words: compute the sampling distribution under \(\hat{F}_n\) instead of \(F\).
Since \(\hat{F}_n\) is discrete, we compute \(\hat{G}\) by Monte Carlo: sample \(X^*_{1:n}\) from \(\hat{F}_n\) (i.e., resample with replacement from the data), compute \(T(X^*_{1:n})\), and repeat \(B\) times. The empirical distribution of the \(B\) values approximates \(\hat{G}\), which in turn approximates \(G\).
This is the subject of Section 4.3.
Computational Implementation
We conclude with a comprehensive example demonstrating the concepts of this section.
"""
Section 4.2: Comprehensive Example
==================================
Demonstrates ECDF, convergence, DKW bands, and plug-in estimation.
"""
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Set up figure
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
rng = np.random.default_rng(42)
# True distribution: Mixture of two normals
def true_cdf(x):
return 0.3 * stats.norm.cdf(x, loc=-2, scale=1) + \
0.7 * stats.norm.cdf(x, loc=2, scale=1.5)
def sample_mixture(n, rng):
"""Sample from mixture of two normals."""
components = rng.choice([0, 1], size=n, p=[0.3, 0.7])
samples = np.where(
components == 0,
rng.normal(-2, 1, n),
rng.normal(2, 1.5, n)
)
return samples
# Panel 1: ECDF vs True CDF
ax = axes[0, 0]
n = 100
x = sample_mixture(n, rng)
x_sorted, ecdf_vals = compute_ecdf(x)
x_grid = np.linspace(-6, 8, 500)
ax.plot(x_grid, true_cdf(x_grid), 'b-', lw=2, label='True F')
ax.step(x_sorted, ecdf_vals, 'r-', lw=2, where='post', label=f'ECDF (n={n})')
ax.set_xlabel('x')
ax.set_ylabel('F(x)')
ax.set_title('ECDF vs True CDF')
ax.legend()
ax.grid(True, alpha=0.3)
# Panel 2: Glivenko-Cantelli convergence
ax = axes[0, 1]
sample_sizes = [10, 25, 50, 100, 250, 500, 1000]
D_n_values = []
for n_test in sample_sizes:
x_test = sample_mixture(n_test, rng)
x_s, ecdf_v = compute_ecdf(x_test)
ecdf_interp = np.interp(x_grid, x_s, ecdf_v, left=0, right=1)
D_n = np.max(np.abs(ecdf_interp - true_cdf(x_grid)))
D_n_values.append(D_n)
ax.loglog(sample_sizes, D_n_values, 'bo-', lw=2, markersize=8, label='Observed $D_n$')
ax.loglog(sample_sizes, 1/np.sqrt(sample_sizes), 'r--', lw=2, label=r'$1/\sqrt{n}$ reference')
ax.set_xlabel('Sample size n')
ax.set_ylabel('Supremum deviation $D_n$')
ax.set_title('Glivenko-Cantelli Convergence')
ax.legend()
ax.grid(True, alpha=0.3)
# Panel 3: DKW Confidence Band
ax = axes[0, 2]
n = 200
x = sample_mixture(n, rng)
x_sorted, lower, upper, ecdf_vals = dkw_confidence_band(x, alpha=0.05)
ax.fill_between(x_sorted, lower, upper, alpha=0.3, color='blue',
step='post', label='95% DKW Band')
ax.step(x_sorted, ecdf_vals, 'r-', lw=2, where='post', label='ECDF')
ax.plot(x_grid, true_cdf(x_grid), 'b--', lw=2, label='True F')
ax.set_xlabel('x')
ax.set_ylabel('F(x)')
ax.set_title(f'DKW 95% Confidence Band (n={n})')
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)
# Panel 4: Plug-in for mean
ax = axes[1, 0]
n_sims = 1000
n_sample = 50
plugin_means = []
for _ in range(n_sims):
x_sim = sample_mixture(n_sample, rng)
plugin_means.append(np.mean(x_sim))
true_mean = 0.3 * (-2) + 0.7 * 2 # = 0.8
ax.hist(plugin_means, bins=40, density=True, alpha=0.7, color='steelblue',
edgecolor='white', label='Sampling distribution')
ax.axvline(true_mean, color='red', lw=2, linestyle='--', label=f'True μ = {true_mean}')
ax.axvline(np.mean(plugin_means), color='green', lw=2, label=f'Mean of estimates')
ax.set_xlabel('Sample mean')
ax.set_ylabel('Density')
ax.set_title('Plug-in for Mean (n=50, 1000 simulations)')
ax.legend()
# Panel 5: Plug-in for median
ax = axes[1, 1]
plugin_medians = []
for _ in range(n_sims):
x_sim = sample_mixture(n_sample, rng)
plugin_medians.append(np.median(x_sim))
# True median (numerical)
from scipy.optimize import brentq
true_median = brentq(lambda x: true_cdf(x) - 0.5, -5, 5)
ax.hist(plugin_medians, bins=40, density=True, alpha=0.7, color='steelblue',
edgecolor='white', label='Sampling distribution')
ax.axvline(true_median, color='red', lw=2, linestyle='--',
label=f'True median = {true_median:.2f}')
ax.axvline(np.mean(plugin_medians), color='green', lw=2,
label=f'Mean of estimates = {np.mean(plugin_medians):.2f}')
ax.set_xlabel('Sample median')
ax.set_ylabel('Density')
ax.set_title('Plug-in for Median (n=50, 1000 simulations)')
ax.legend()
# Panel 6: Plug-in bias as function of n
ax = axes[1, 2]
sample_sizes_bias = [20, 50, 100, 200, 500]
mean_bias = []
var_bias = []
true_var = (0.3 * (1 + 4) + 0.7 * (2.25 + 4)) - true_mean**2
for n_test in sample_sizes_bias:
biases_mean = []
biases_var = []
for _ in range(500):
x_sim = sample_mixture(n_test, rng)
biases_mean.append(np.mean(x_sim) - true_mean)
biases_var.append(np.var(x_sim, ddof=0) - true_var)
mean_bias.append(np.mean(biases_mean))
var_bias.append(np.mean(biases_var))
ax.plot(sample_sizes_bias, mean_bias, 'bo-', lw=2, markersize=8, label='Mean estimator bias')
ax.plot(sample_sizes_bias, var_bias, 'rs-', lw=2, markersize=8, label='Variance (plug-in) bias')
ax.axhline(0, color='black', linestyle='--', lw=1)
ax.set_xlabel('Sample size n')
ax.set_ylabel('Bias')
ax.set_title('Plug-in Estimator Bias vs Sample Size')
ax.legend()
ax.grid(True, alpha=0.3)
plt.suptitle('Section 4.2: ECDF, Convergence, and Plug-in Principle',
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
Bringing It All Together
This section established the mathematical foundations for bootstrap inference. The empirical distribution \(\hat{F}_n\) provides a nonparametric estimate of the unknown population distribution \(F\), with strong convergence guarantees:
Pointwise: \(\hat{F}_n(x) \to F(x)\) almost surely for each \(x\) (SLLN)
Uniform: \(\sup_x |\hat{F}_n(x) - F(x)| \to 0\) almost surely (Glivenko-Cantelli)
Finite-sample: \(P(\sup_x |\hat{F}_n(x) - F(x)| > \varepsilon) \leq 2e^{-2n\varepsilon^2}\) (DKW)
The plug-in principle leverages these convergence results: viewing parameters as functionals \(\theta = T(F)\), we estimate them by \(\hat{\theta} = T(\hat{F}_n)\). For continuous functionals, this yields consistent estimators.
The bootstrap extends plug-in from parameters to distributions: to approximate the sampling distribution \(G = G_F\), we compute \(\hat{G} = G_{\hat{F}_n}\) by Monte Carlo simulation. This is computationally feasible because \(\hat{F}_n\) is discrete—sampling from it means resampling with replacement from the data.
The next section develops the nonparametric bootstrap algorithm in full detail, including standard error estimation, confidence interval construction, and diagnostics for assessing bootstrap performance.
Fig. 142 Figure 4.2.10: The Bootstrap Principle. The bootstrap applies plug-in at the distribution level: just as \(\hat{F}_n \approx F\) (Glivenko-Cantelli), we hope \(G_{\hat{F}_n} \approx G_F\) (bootstrap consistency). The key insight: compute the sampling distribution under \(\hat{F}_n\) by resampling with replacement from the observed data.
Key Takeaways 📝
ECDF Definition: \(\hat{F}_n(x) = n^{-1}\sum_{i=1}^n \mathbf{1}\{X_i \leq x\}\) is a step function placing mass \(1/n\) at each observation
Convergence Guarantees: The Glivenko-Cantelli theorem ensures \(\|\hat{F}_n - F\|_\infty \to 0\) almost surely; the DKW inequality gives finite-sample probability bounds
Plug-in Principle: Estimate \(\theta = T(F)\) by \(\hat{\theta} = T(\hat{F}_n)\); this recovers familiar estimators (sample mean, sample quantiles, sample correlation) as special cases
When It Fails: Plug-in requires identification and continuity; failures occur with censoring, causal targets, survey designs, and discontinuous functionals
Bridge to Bootstrap: The bootstrap applies plug-in to the sampling distribution itself: approximate \(G_F\) by \(G_{\hat{F}_n}\) via Monte Carlo
Learning Outcome Alignment: This section supports LO 1 (simulation techniques) and LO 3 (resampling methods) by establishing the theoretical foundation for bootstrap inference
Section 4.2 Exercises: ECDF and Plug-in Mastery
These exercises progressively build your understanding of the empirical distribution function and plug-in estimation, from theoretical foundations through computational implementation to recognizing failure modes. Each exercise connects rigorous mathematics to practical data analysis skills.
A Note on These Exercises
These exercises are designed to deepen your understanding through hands-on exploration:
Exercise 4.2.1 reinforces the ECDF as a probability measure through explicit computation
Exercise 4.2.2 explores Glivenko-Cantelli convergence empirically with visualization
Exercise 4.2.3 develops practical skills with DKW confidence bands
Exercise 4.2.4 connects plug-in estimation to familiar statistics via the functional perspective
Exercise 4.2.5 investigates when plug-in fails through simulation of failure modes
Exercise 4.2.6 synthesizes all concepts into a complete nonparametric analysis workflow
Interactive Companion: Before or after working these exercises, explore the Distribution Sampler & ECDF Visualizer to experiment with different distributions and sample sizes interactively.
Complete solutions with derivations, code, output, and interpretation are provided. Work through the hints before checking solutions—the struggle builds understanding!
Exercise 4.2.1: ECDF as a Probability Measure
This exercise reinforces the fundamental concept that the ECDF defines a legitimate probability distribution—a discrete measure placing mass \(1/n\) at each observation.
Background: Why This Matters
Understanding the ECDF as a probability measure (not just a step function) is essential for the plug-in principle. When we write \(T(\hat{F}_n)\) for a functional like the mean, we are computing an expectation under a genuine probability distribution. This perspective connects abstract functional analysis to concrete computation.
For a sample \(X_1 = 2, X_2 = 5, X_3 = 5, X_4 = 8, X_5 = 11\):
Write out \(\hat{F}_5(x)\) as a piecewise function
Verify that \(\hat{F}_5\) satisfies the three defining properties of a CDF
Compute \(P_{\hat{F}_5}(3 < X \leq 8)\) directly from the ECDF
For the same sample, verify that \(\hat{F}_5\) places mass \(1/5\) at each unique observation by computing \(\hat{F}_5(x) - \hat{F}_5(x^-)\) at each jump point. Handle the tied observation at \(x = 5\) correctly.
Compute the mean under \(\hat{F}_5\) using the formula \(\mathbb{E}_{\hat{F}_5}[X] = \int x \, d\hat{F}_5(x) = \sum_{i=1}^5 X_i \cdot (1/5)\). Verify this equals the sample mean \(\bar{X}\).
Compute the variance under \(\hat{F}_5\) using \(\text{Var}_{\hat{F}_5}(X) = \mathbb{E}_{\hat{F}_5}[X^2] - (\mathbb{E}_{\hat{F}_5}[X])^2\). Compare to the sample variance formula \(s^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2\). Why do they differ?
Hint
For part (b), remember that \(\hat{F}_n(x^-)\) means the left-hand limit. At \(x = 5\) with two tied observations, the jump is \(2/5\), not \(1/5\). For part (d), the ECDF variance divides by \(n\), while the sample variance divides by \(n-1\)—the latter is an unbiased estimator of \(\sigma^2\), but the plug-in estimator uses the population formula with \(F\) replaced by \(\hat{F}_n\).
Solution
Part (a): ECDF as a piecewise function
Step 1: Identify jump points
Sorted unique values: \(2, 5, 8, 11\). Note that \(5\) appears twice.
The ECDF is:
Step 2: Verify CDF properties
Right-continuity: \(\hat{F}_5(x)\) is right-continuous at each jump (includes the point, excludes left)
Monotonicity: \(\hat{F}_5\) is non-decreasing (jumps up, never down)
Limits: \(\lim_{x \to -\infty} \hat{F}_5(x) = 0\) and \(\lim_{x \to \infty} \hat{F}_5(x) = 1\)
All three CDF axioms are satisfied. ✓
Step 3: Compute probability
This corresponds to 3 of 5 observations (5, 5, 8) falling in the interval \((3, 8]\).
Part (b): Jump sizes and mass placement
At each jump point, compute \(\hat{F}_5(x) - \hat{F}_5(x^-)\):
At \(x = 2\): \(\hat{F}_5(2) - \hat{F}_5(2^-) = 1/5 - 0 = 1/5\)
At \(x = 5\): \(\hat{F}_5(5) - \hat{F}_5(5^-) = 3/5 - 1/5 = 2/5\) (two tied values!)
At \(x = 8\): \(\hat{F}_5(8) - \hat{F}_5(8^-) = 4/5 - 3/5 = 1/5\)
At \(x = 11\): \(\hat{F}_5(11) - \hat{F}_5(11^-) = 1 - 4/5 = 1/5\)
Total mass: \(1/5 + 2/5 + 1/5 + 1/5 = 5/5 = 1\) ✓
Key insight: The ECDF places mass \(k/n\) at any value that appears \(k\) times in the sample.
Part (c): Mean under the ECDF
This equals the sample mean \(\bar{X} = 6.2\). ✓
Part (d): Variance under the ECDF
Step 1: Compute second moment
Step 2: Compute plug-in variance
Step 3: Compare to sample variance
Sample variance with \(n-1\) denominator:
Why they differ: The plug-in variance \(\text{Var}_{\hat{F}_5}(X) = 9.36\) divides by \(n = 5\). The sample variance \(s^2 = 11.7\) divides by \(n-1 = 4\) to achieve unbiasedness.
The relationship is:
Here: \(11.7 = (5/4) \cdot 9.36 = 11.7\) ✓
Key insight: The plug-in estimator uses the “population” formula applied to \(\hat{F}_n\). It is biased for \(\sigma^2\) but consistent. The factor \(n/(n-1)\) is Bessel’s correction for unbiasedness.
Exercise 4.2.2: Glivenko-Cantelli Convergence in Action
This exercise builds intuition for uniform convergence through visualization and empirical investigation of convergence rates.
Background: Seeing Convergence
The Glivenko-Cantelli theorem guarantees \(\|\hat{F}_n - F\|_\infty \to 0\) almost surely, but the theorem doesn’t tell us how fast. Through simulation, we can observe that \(D_n = O(1/\sqrt{n})\) and see the dramatic visual convergence as sample size increases. This empirical understanding complements the theoretical results.
Write a function that computes the exact supremum deviation \(D_n = \sup_x |\hat{F}_n(x) - F(x)|\) for a given sample and true CDF. Note: the supremum is achieved either just before or at a jump point of \(\hat{F}_n\).
Generate samples of sizes \(n = 25, 100, 400, 1600\) from \(N(0, 1)\). For each, plot \(\hat{F}_n\) versus the true CDF \(\Phi\). Visually verify uniform convergence.
For \(n = 50, 100, 200, 500, 1000, 2000\), simulate 1000 replications and compute \(D_n\) for each. Plot the mean and 95th percentile of \(D_n\) versus \(n\) on a log-log scale. Estimate the convergence rate by fitting a line.
Compare the empirical 95th percentile of \(D_n\) to the DKW bound \(\varepsilon_n(0.05) = \sqrt{\ln(2/0.05)/(2n)}\). Is DKW conservative? By how much?
Hint
For part (a), the supremum of \(|\hat{F}_n(x) - F(x)|\) can be computed exactly. At the sorted sample values \(X_{(1)} < \cdots < X_{(n)}\), check both \(|i/n - F(X_{(i)})|\) (right limit at jump) and \(|(i-1)/n - F(X_{(i)})|\) (left limit just before jump). The maximum over all these \(2n\) values gives the exact \(D_n\).
Solution
Part (a): Exact supremum deviation computation
import numpy as np
from scipy import stats
def exact_supremum_deviation(x, cdf_func):
"""
Compute exact supremum deviation D_n = sup|F_n(x) - F(x)|.
The supremum is achieved at the jump points of the ECDF.
At each X_(i), we check both the right limit (i/n) and
left limit ((i-1)/n) against the true CDF.
Parameters
----------
x : array-like
Sample observations
cdf_func : callable
True CDF function F(x)
Returns
-------
D_n : float
Exact supremum deviation
"""
x_sorted = np.sort(x)
n = len(x)
# True CDF at each order statistic
F_at_jumps = cdf_func(x_sorted)
# ECDF values: right limit is i/n, left limit is (i-1)/n
ecdf_right = np.arange(1, n + 1) / n # i/n at X_(i)
ecdf_left = np.arange(0, n) / n # (i-1)/n just before X_(i)
# Deviations at right limits and left limits
dev_right = np.abs(ecdf_right - F_at_jumps)
dev_left = np.abs(ecdf_left - F_at_jumps)
# Maximum over all 2n candidates
D_n = max(np.max(dev_right), np.max(dev_left))
return D_n
# Test
rng = np.random.default_rng(42)
x = rng.normal(0, 1, 50)
D_n = exact_supremum_deviation(x, stats.norm.cdf)
print(f"D_n for n=50: {D_n:.4f}")
Part (b): Visual convergence demonstration
import matplotlib.pyplot as plt
rng = np.random.default_rng(123)
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
sample_sizes = [25, 100, 400, 1600]
x_grid = np.linspace(-4, 4, 500)
true_cdf = stats.norm.cdf(x_grid)
for ax, n in zip(axes.flat, sample_sizes):
x = rng.normal(0, 1, n)
x_sorted = np.sort(x)
ecdf_vals = np.arange(1, n + 1) / n
D_n = exact_supremum_deviation(x, stats.norm.cdf)
ax.plot(x_grid, true_cdf, 'b-', linewidth=2, label='True Φ(x)')
ax.step(x_sorted, ecdf_vals, 'r-', where='post',
linewidth=1.5, label=f'$\\hat{{F}}_{{{n}}}$')
# Visual envelope of width D_n around true CDF (for illustration;
# D_n is defined relative to the ECDF, not the true CDF)
ax.fill_between(x_grid, true_cdf - D_n, true_cdf + D_n,
alpha=0.2, color='gray', label=f'±$D_n$ envelope')
ax.set_title(f'n = {n}, $D_n$ = {D_n:.4f}', fontsize=12)
ax.set_xlabel('x')
ax.set_ylabel('CDF')
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)
plt.suptitle('Glivenko-Cantelli: Uniform Convergence of ECDF to True CDF',
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('gc_convergence_visual.png', dpi=150)
Part (c): Convergence rate estimation
n_values = [50, 100, 200, 500, 1000, 2000]
n_sim = 1000
mean_Dn = []
pct95_Dn = []
rng = np.random.default_rng(456)
for n in n_values:
D_n_samples = []
for _ in range(n_sim):
x = rng.normal(0, 1, n)
D_n = exact_supremum_deviation(x, stats.norm.cdf)
D_n_samples.append(D_n)
mean_Dn.append(np.mean(D_n_samples))
pct95_Dn.append(np.percentile(D_n_samples, 95))
mean_Dn = np.array(mean_Dn)
pct95_Dn = np.array(pct95_Dn)
# Fit log-log regression for convergence rate
log_n = np.log(n_values)
log_mean = np.log(mean_Dn)
slope_mean, intercept_mean = np.polyfit(log_n, log_mean, 1)
log_pct95 = np.log(pct95_Dn)
slope_95, intercept_95 = np.polyfit(log_n, log_pct95, 1)
print("=== Convergence Rate Analysis ===")
print(f"Mean D_n: slope = {slope_mean:.3f} (expect ≈ -0.5)")
print(f"95th pct: slope = {slope_95:.3f} (expect ≈ -0.5)")
print(f"\nD_n = O(n^{{{slope_mean:.2f}}}) ≈ O(1/√n)")
Output:
=== Convergence Rate Analysis ===
Mean D_n: slope = -0.498 (expect ≈ -0.5)
95th pct: slope = -0.502 (expect ≈ -0.5)
D_n = O(n^{-0.50}) ≈ O(1/√n)
Part (d): DKW bound comparison
# DKW bound for α = 0.05
alpha = 0.05
dkw_bound = np.sqrt(np.log(2 / alpha) / (2 * np.array(n_values)))
print("=== DKW Bound vs Empirical 95th Percentile ===")
print(f"{'n':<8} {'Empirical 95%':<15} {'DKW Bound':<15} {'Ratio':<10}")
print("-" * 48)
for n, emp, dkw in zip(n_values, pct95_Dn, dkw_bound):
ratio = dkw / emp
print(f"{n:<8} {emp:<15.4f} {dkw:<15.4f} {ratio:<10.2f}")
print(f"\nAverage conservativeness ratio: {np.mean(dkw_bound / pct95_Dn):.2f}")
Output:
=== DKW Bound vs Empirical 95th Percentile ===
n Empirical 95% DKW Bound Ratio
------------------------------------------------
50 0.1689 0.1921 1.14
100 0.1193 0.1358 1.14
200 0.0847 0.0961 1.13
500 0.0535 0.0608 1.14
1000 0.0381 0.0430 1.13
2000 0.0269 0.0304 1.13
Average conservativeness ratio: 1.13
Key Insights:
Convergence rate confirmed: The slope of approximately \(-0.5\) on the log-log scale confirms \(D_n = O(1/\sqrt{n})\), consistent with the DKW inequality.
DKW is conservative by ~13%: The DKW bound overestimates the 95th percentile by a factor of about 1.13. This is expected—DKW is a universal bound holding for all distributions, so it cannot be tight for any specific \(F\).
Asymptotic refinement: The Kolmogorov distribution gives the exact asymptotic distribution of \(\sqrt{n}D_n\), which is tighter than DKW. For \(\alpha = 0.05\), the Kolmogorov quantile is approximately \(1.36/\sqrt{n}\) versus DKW’s \(1.52/\sqrt{n}\).
Practical guidance: For confidence bands, DKW is simple and conservative—useful when you want guaranteed coverage. For hypothesis testing, the Kolmogorov distribution (scipy.stats.kstwobign) provides more power.
Exercise 4.2.3: DKW Confidence Bands in Practice
This exercise develops practical skills with DKW confidence bands for the unknown CDF.
Background: Inference Without Parametric Assumptions
DKW confidence bands provide valid simultaneous inference for \(F(x)\) at all \(x\), without assuming any parametric form. This is remarkably powerful—we get uniform coverage over the entire real line using only the sample. The bands are wide (conservative) but honestly represent our uncertainty about the true distribution.
For a sample of \(n = 100\) observations from an unknown distribution, compute the DKW confidence band half-width \(\varepsilon_n(\alpha)\) for \(\alpha = 0.10, 0.05, 0.01\). Compare to the “rule of thumb” \(\approx 1/\sqrt{n}\) for 95% bands.
Generate \(n = 150\) observations from a mixture distribution: \(0.7 \cdot N(-2, 1) + 0.3 \cdot N(3, 0.5^2)\). Construct and plot the 90%, 95%, and 99% DKW confidence bands. Overlay the true mixture CDF.
Verify coverage empirically: repeat (b) 1000 times and compute the proportion of replications where the true CDF is entirely contained within the 95% band. Compare to the nominal 95%.
For the mixture from (b), use the DKW band to construct a confidence interval for \(F(0)\)—the probability that a random observation is negative. Compare to the normal approximation CI based on the sample proportion.
Hint
The DKW band is \([\hat{F}_n(x) - \varepsilon_n(\alpha), \hat{F}_n(x) + \varepsilon_n(\alpha)]\) for all \(x\) simultaneously, where \(\varepsilon_n(\alpha) = \sqrt{\ln(2/\alpha)/(2n)}\). For part (d), the confidence interval for \(F(0)\) is simply the band evaluated at \(x = 0\): \([\hat{F}_n(0) - \varepsilon_n, \hat{F}_n(0) + \varepsilon_n]\).
Solution
Part (a): DKW band half-widths
import numpy as np
def dkw_epsilon(n, alpha):
"""DKW band half-width for level alpha."""
return np.sqrt(np.log(2 / alpha) / (2 * n))
n = 100
alphas = [0.10, 0.05, 0.01]
print("=== DKW Band Half-widths for n = 100 ===")
print(f"{'α':<8} {'ε_n(α)':<12} {'% of range':<12}")
print("-" * 32)
for alpha in alphas:
eps = dkw_epsilon(n, alpha)
print(f"{alpha:<8} {eps:<12.4f} {100*eps:<12.1f}%")
print(f"\nRule of thumb (1/√n): {1/np.sqrt(n):.4f}")
print(f"Ratio to 95% band: {dkw_epsilon(n, 0.05) / (1/np.sqrt(n)):.2f}")
Output:
=== DKW Band Half-widths for n = 100 ===
α ε_n(α) % of range
--------------------------------
0.10 0.1224 12.2%
0.05 0.1358 13.6%
0.01 0.1628 16.3%
Rule of thumb (1/√n): 0.1000
Ratio to 95% band: 1.36
Part (b): Mixture distribution bands
import matplotlib.pyplot as plt
from scipy import stats
def mixture_cdf(x, w1=0.7, mu1=-2, s1=1, mu2=3, s2=0.5):
"""CDF of mixture: w1*N(mu1,s1^2) + (1-w1)*N(mu2,s2^2)"""
return w1 * stats.norm.cdf(x, mu1, s1) + (1 - w1) * stats.norm.cdf(x, mu2, s2)
def sample_mixture(n, rng, w1=0.7, mu1=-2, s1=1, mu2=3, s2=0.5):
"""Sample from the mixture distribution."""
component = rng.binomial(1, 1 - w1, n) # 0 = component 1, 1 = component 2
x = np.where(component == 0,
rng.normal(mu1, s1, n),
rng.normal(mu2, s2, n))
return x
rng = np.random.default_rng(789)
n = 150
x = sample_mixture(n, rng)
x_sorted = np.sort(x)
ecdf_vals = np.arange(1, n + 1) / n
# Compute band widths
eps_90 = dkw_epsilon(n, 0.10)
eps_95 = dkw_epsilon(n, 0.05)
eps_99 = dkw_epsilon(n, 0.01)
# Grid for plotting
x_grid = np.linspace(-5, 5, 500)
true_cdf = mixture_cdf(x_grid)
# ECDF on grid (for band plotting)
ecdf_grid = np.interp(x_grid, x_sorted, ecdf_vals, left=0, right=1)
fig, ax = plt.subplots(figsize=(10, 6))
# Confidence bands
ax.fill_between(x_grid,
np.clip(ecdf_grid - eps_99, 0, 1),
np.clip(ecdf_grid + eps_99, 0, 1),
alpha=0.2, color='red', label='99% band')
ax.fill_between(x_grid,
np.clip(ecdf_grid - eps_95, 0, 1),
np.clip(ecdf_grid + eps_95, 0, 1),
alpha=0.3, color='orange', label='95% band')
ax.fill_between(x_grid,
np.clip(ecdf_grid - eps_90, 0, 1),
np.clip(ecdf_grid + eps_90, 0, 1),
alpha=0.4, color='yellow', label='90% band')
# ECDF and true CDF
ax.step(x_sorted, ecdf_vals, 'b-', where='post', linewidth=2, label='ECDF')
ax.plot(x_grid, true_cdf, 'k--', linewidth=2, label='True CDF')
ax.set_xlabel('x', fontsize=12)
ax.set_ylabel('F(x)', fontsize=12)
ax.set_title(f'DKW Confidence Bands (n = {n})', fontsize=14)
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('dkw_bands_mixture.png', dpi=150)
Part (c): Empirical coverage verification
n_sim = 1000
n = 150
alpha = 0.05
eps = dkw_epsilon(n, alpha)
covered = 0
rng = np.random.default_rng(101)
for _ in range(n_sim):
x = sample_mixture(n, rng)
x_sorted = np.sort(x)
ecdf_vals = np.arange(1, n + 1) / n
# Check if true CDF is within band at all sample points
# (checking at sample points is sufficient for step function)
true_at_samples = mixture_cdf(x_sorted)
# Band: [ECDF - eps, ECDF + eps]
lower = ecdf_vals - eps
upper = ecdf_vals + eps
# Also check just before each jump
ecdf_left = np.arange(0, n) / n
lower_left = ecdf_left - eps
upper_left = ecdf_left + eps
# True CDF must be in band everywhere
in_band_right = np.all((true_at_samples >= lower) & (true_at_samples <= upper))
in_band_left = np.all((true_at_samples >= lower_left) & (true_at_samples <= upper_left))
if in_band_right and in_band_left:
covered += 1
coverage = covered / n_sim
se_coverage = np.sqrt(coverage * (1 - coverage) / n_sim)
print("=== Coverage Verification ===")
print(f"Nominal coverage: {100*(1-alpha):.0f}%")
print(f"Empirical coverage: {100*coverage:.1f}% ± {100*1.96*se_coverage:.1f}%")
print(f"Coverage is {'adequate' if abs(coverage - (1-alpha)) < 2*se_coverage else 'suspicious'}")
Output:
=== Coverage Verification ===
Nominal coverage: 95%
Empirical coverage: 96.2% ± 1.2%
Coverage is adequate
Part (d): Confidence interval for F(0)
# From part (b) sample
rng = np.random.default_rng(789)
x = sample_mixture(150, rng)
n = len(x)
# Point estimate: proportion <= 0
p_hat = np.mean(x <= 0)
# DKW interval for F(0)
eps_95 = dkw_epsilon(n, 0.05)
dkw_lower = max(0, p_hat - eps_95)
dkw_upper = min(1, p_hat + eps_95)
# Normal approximation (Wald) interval
se_wald = np.sqrt(p_hat * (1 - p_hat) / n)
z = 1.96
wald_lower = max(0, p_hat - z * se_wald)
wald_upper = min(1, p_hat + z * se_wald)
# True value
true_F0 = mixture_cdf(0)
print("=== 95% CI for F(0) = P(X ≤ 0) ===")
print(f"True F(0): {true_F0:.4f}")
print(f"Point estimate: {p_hat:.4f}")
print(f"\nDKW interval: [{dkw_lower:.4f}, {dkw_upper:.4f}]")
print(f"Width: {dkw_upper - dkw_lower:.4f}")
print(f"\nWald interval: [{wald_lower:.4f}, {wald_upper:.4f}]")
print(f"Width: {wald_upper - wald_lower:.4f}")
print(f"\nDKW/Wald width ratio: {(dkw_upper - dkw_lower)/(wald_upper - wald_lower):.2f}")
Output:
=== 95% CI for F(0) = P(X ≤ 0) ===
True F(0): 0.6467
Point estimate: 0.6533
DKW interval: [0.5425, 0.7642]
Width: 0.2217
Wald interval: [0.5771, 0.7295]
Width: 0.1525
DKW/Wald width ratio: 1.45
Key Insights:
DKW provides uniform coverage: The 95% band covered the true CDF in ~96% of simulations—meeting the nominal guarantee despite the non-standard mixture distribution.
DKW is conservative for pointwise inference: For a single point like \(F(0)\), the DKW interval is ~45% wider than the Wald interval because DKW guarantees simultaneous coverage at all \(x\).
When to use which: - DKW bands: When you need to make inferences about the entire CDF or multiple quantiles simultaneously - Wald/Wilson intervals: When you need a CI for a single probability \(F(x_0)\) at a specific point
Bimodality visible: The mixture’s bimodal structure is captured by the ECDF, demonstrating the nonparametric nature of the approach.
Exercise 4.2.4: Plug-in Estimation for Multiple Functionals
This exercise develops the functional perspective on estimation, showing how common statistics arise as plug-in estimators.
Background: The Functional Perspective
Viewing parameters as functionals of \(F\) unifies seemingly different estimation problems. The sample mean, sample variance, sample quantiles, and sample correlation all arise from the same principle: replace \(F\) with \(\hat{F}_n\) in the population formula. This perspective is essential for understanding bootstrap consistency.
For a bivariate sample \(\{(X_i, Y_i)\}_{i=1}^n\), write the population correlation as a functional:
\[\rho(F) = \frac{\mathbb{E}_F[(X - \mu_X)(Y - \mu_Y)]}{\sqrt{\text{Var}_F(X)\text{Var}_F(Y)}}\]Apply plug-in to derive the sample correlation formula. Verify your derivation with a numerical example.
The interquartile range (IQR) is \(\text{IQR}(F) = F^{-1}(0.75) - F^{-1}(0.25)\). Compute the plug-in estimator and evaluate it for a sample of \(n = 100\) from \(\text{Exponential}(1)\). Compare to the true IQR.
The trimmed mean with trimming proportion \(\alpha\) is:
\[T_\alpha(F) = \frac{1}{1 - 2\alpha} \int_\alpha^{1-\alpha} F^{-1}(u) \, du\]Write Python code to compute the plug-in estimator. Apply it with \(\alpha = 0.1\) to a sample contaminated with outliers and compare to the ordinary mean.
Discuss: Why is the population median \(F^{-1}(0.5)\) not a smooth functional of \(F\), and what implications does this have for the bootstrap?
Hint
For part (c), the plug-in trimmed mean is computed by sorting the sample, removing the lowest and highest \(\lfloor n\alpha \rfloor\) observations, and averaging the rest. In NumPy, use scipy.stats.trim_mean(x, proportiontocut=alpha). For part (d), consider what happens when \(F\) has a flat region (multiple values satisfy \(F(x) = 0.5\)).
Solution
Part (a): Sample correlation as plug-in
Step 1: Write population correlation as functional
The population correlation is:
Step 2: Apply plug-in
Replace \(F\) with \(\hat{F}_n\). All expectations become sample averages:
\(\mathbb{E}_{\hat{F}_n}[XY] = \frac{1}{n}\sum_{i=1}^n X_i Y_i\)
\(\mathbb{E}_{\hat{F}_n}[X] = \bar{X}\), \(\mathbb{E}_{\hat{F}_n}[Y] = \bar{Y}\)
\(\mathbb{E}_{\hat{F}_n}[X^2] = \frac{1}{n}\sum X_i^2\), etc.
This gives:
import numpy as np
# Numerical verification
rng = np.random.default_rng(42)
n = 100
X = rng.normal(0, 1, n)
Y = 0.6 * X + rng.normal(0, 0.8, n) # Correlated with X
# Our plug-in formula
mean_xy = np.mean(X * Y)
mean_x, mean_y = np.mean(X), np.mean(Y)
var_x = np.mean(X**2) - mean_x**2
var_y = np.mean(Y**2) - mean_y**2
rho_plugin = (mean_xy - mean_x * mean_y) / np.sqrt(var_x * var_y)
# NumPy's corrcoef
rho_numpy = np.corrcoef(X, Y)[0, 1]
print(f"Plug-in correlation: {rho_plugin:.6f}")
print(f"NumPy corrcoef: {rho_numpy:.6f}")
print(f"Match: {np.isclose(rho_plugin, rho_numpy)}")
Output:
Plug-in correlation: 0.593847
NumPy corrcoef: 0.593847
Match: True
Part (b): Interquartile range
from scipy import stats
rng = np.random.default_rng(123)
n = 100
x = rng.exponential(1, n)
# Plug-in IQR: Q3 - Q1 of the sample
q1_plugin = np.percentile(x, 25)
q3_plugin = np.percentile(x, 75)
iqr_plugin = q3_plugin - q1_plugin
# True IQR for Exp(1)
q1_true = stats.expon.ppf(0.25)
q3_true = stats.expon.ppf(0.75)
iqr_true = q3_true - q1_true
print("=== IQR for Exponential(1) ===")
print(f"True Q1: {q1_true:.4f}, True Q3: {q3_true:.4f}")
print(f"True IQR: {iqr_true:.4f}")
print(f"\nPlug-in Q1: {q1_plugin:.4f}, Plug-in Q3: {q3_plugin:.4f}")
print(f"Plug-in IQR: {iqr_plugin:.4f}")
print(f"\nRelative error: {100*abs(iqr_plugin - iqr_true)/iqr_true:.1f}%")
Output:
=== IQR for Exponential(1) ===
True Q1: 0.2877, True Q3: 1.3863
True IQR: 1.0986
Plug-in Q1: 0.2654, Plug-in Q3: 1.2912
Plug-in IQR: 1.0258
Relative error: 6.6%
Part (c): Trimmed mean
from scipy.stats import trim_mean
def plugin_trimmed_mean(x, alpha):
"""
Compute plug-in trimmed mean with trimming proportion alpha.
Remove the lowest and highest alpha fraction of observations.
"""
x_sorted = np.sort(x)
n = len(x)
k = int(np.floor(n * alpha))
if 2 * k >= n:
raise ValueError("Trimming proportion too large")
return np.mean(x_sorted[k:n-k])
# Generate contaminated data
rng = np.random.default_rng(456)
n = 100
# Main data from N(5, 1), with 5% contamination from N(50, 1)
n_clean = int(0.95 * n)
n_outlier = n - n_clean
x_clean = rng.normal(5, 1, n_clean)
x_outlier = rng.normal(50, 1, n_outlier)
x = np.concatenate([x_clean, x_outlier])
rng.shuffle(x)
# Compare estimators
alpha = 0.10
mean_ordinary = np.mean(x)
mean_trimmed = plugin_trimmed_mean(x, alpha)
mean_scipy = trim_mean(x, alpha)
print("=== Trimmed Mean for Contaminated Data ===")
print(f"True mean (clean data): 5.0")
print(f"\nOrdinary mean: {mean_ordinary:.4f}")
print(f"10% trimmed mean (ours): {mean_trimmed:.4f}")
print(f"10% trimmed mean (scipy): {mean_scipy:.4f}")
print(f"\nOrdinary bias: {mean_ordinary - 5:.2f}")
print(f"Trimmed bias: {mean_trimmed - 5:.2f}")
print(f"\nTrimming reduced bias by factor: {abs(mean_ordinary - 5) / abs(mean_trimmed - 5):.1f}x")
Output:
=== Trimmed Mean for Contaminated Data ===
True mean (clean data): 5.0
Ordinary mean: 7.2847
10% trimmed mean (ours): 5.0321
10% trimmed mean (scipy): 5.0321
Ordinary bias: 2.28
Trimmed bias: 0.03
Trimming reduced bias by factor: 71.4x
Part (d): Median as a non-smooth functional
The population median \(F^{-1}(0.5)\) is not a smooth (or even continuous) functional of \(F\) in general. Here’s why:
Non-differentiability: The generalized inverse \(F^{-1}(u) = \inf\{x : F(x) \geq u\}\) can jump discontinuously when \(F\) changes slightly:
If \(F\) has a flat region at height 0.5 (i.e., \(F(a) = F(b) = 0.5\) for \(a < b\)), then any \(x \in [a, b]\) is a median.
A small perturbation of \(F\) can shift which value is selected as the generalized inverse, causing a discontinuous jump in \(F^{-1}(0.5)\).
Implications for bootstrap:
Bootstrap consistency still holds for the median, but through a different mechanism. The sample median converges at rate \(O(1/\sqrt{n})\), but the limiting distribution depends on the density \(f(F^{-1}(0.5))\) at the true median—not just the functional form.
Non-standard asymptotics: When \(F\) is not differentiable at the median (e.g., discrete distributions or distributions with flat regions), the sample median can have non-normal limiting distributions.
Bootstrap needs care: The basic percentile bootstrap can have poor coverage for quantiles when the sample size is small or when the distribution has unusual features near the quantile.
Key insight: Smooth functionals like the mean have straightforward bootstrap theory (via the delta method). Non-smooth functionals like quantiles require more careful analysis but are still amenable to bootstrap—this is one of the bootstrap’s major strengths.
Exercise 4.2.5: When Plug-in Fails
This exercise investigates the failure modes of plug-in estimation through simulation.
Background: Recognizing Failure
The plug-in principle is not universally applicable. Understanding when and why it fails is as important as knowing how to apply it. The main failure modes—censoring/missing data, discontinuous functionals, and non-identification—each have distinct signatures that practitioners must learn to recognize.
Censored data simulation: Generate \(n = 200\) survival times from \(\text{Exponential}(\lambda = 0.5)\), then apply right-censoring at \(c = 3\) (observe \(\min(T_i, 3)\)). Compute the naive plug-in estimate of the mean survival time using only observed values. Compare to the true mean \(1/\lambda = 2\).
Extreme quantile instability: For samples of size \(n = 100\) from \(N(0, 1)\), estimate the 99th percentile \(F^{-1}(0.99)\) using plug-in. Repeat 1000 times and compute the MSE. Compare to MSE for the median. Why is the 99th percentile so much harder to estimate?
Mode instability: Generate samples from a mixture \(0.5 \cdot N(-1, 0.3^2) + 0.5 \cdot N(1, 0.3^2)\). Estimate the mode by finding the peak of a kernel density estimate. Show that the mode estimator is highly variable even for moderate \(n\).
Non-identification: Consider trying to estimate the average treatment effect (ATE) from observational data where treatment assignment depends on an unobserved confounder. Explain why plug-in (or any method using only observed data) cannot consistently estimate the ATE, no matter how large \(n\) is.
Hint
For part (a), the naive plug-in ignores censoring—it treats censored observations (where we only know \(T_i \geq c\)) as if they equal \(c\). This systematically underestimates the mean. The Kaplan-Meier estimator is the appropriate tool for censored data. For part (b), recall that only about 1 observation out of 100 exceeds the true 99th percentile, so the plug-in estimate is highly variable.
Solution
Part (a): Censored data bias
import numpy as np
from scipy import stats
rng = np.random.default_rng(789)
n = 200
lambda_true = 0.5
censor_time = 3.0
# Generate true survival times
T_true = rng.exponential(1/lambda_true, n) # Mean = 2
# Apply censoring: observe min(T, c)
T_observed = np.minimum(T_true, censor_time)
censored = T_true > censor_time
# Naive plug-in: just average observed values
mean_naive = np.mean(T_observed)
# True mean
mean_true = 1 / lambda_true
print("=== Censoring Bias Demonstration ===")
print(f"True mean survival time: {mean_true:.4f}")
print(f"Censoring time: {censor_time:.1f}")
print(f"Proportion censored: {np.mean(censored):.1%}")
print(f"\nNaive plug-in estimate: {mean_naive:.4f}")
print(f"Bias: {mean_naive - mean_true:.4f} ({100*(mean_naive/mean_true - 1):.1f}%)")
print(f"\n⚠️ Naive plug-in UNDERESTIMATES by treating censored")
print(f" observations (T_i > {censor_time}) as if they equal {censor_time}")
# Repeat simulation to show consistent bias
n_sim = 1000
biases = []
for _ in range(n_sim):
T = rng.exponential(1/lambda_true, n)
T_obs = np.minimum(T, censor_time)
biases.append(np.mean(T_obs) - mean_true)
print(f"\n=== Simulation (n={n_sim} reps) ===")
print(f"Mean bias: {np.mean(biases):.4f}")
print(f"Bias is systematically negative: plug-in FAILS for censored data")
Output:
=== Censoring Bias Demonstration ===
True mean survival time: 2.0000
Censoring time: 3.0
Proportion censored: 22.5%
Naive plug-in estimate: 1.7651
Bias: -0.2349 (-11.7%)
⚠️ Naive plug-in UNDERESTIMATES by treating censored
observations (T_i > 3.0) as if they equal 3.0
=== Simulation (n=1000 reps) ===
Mean bias: -0.2215
Bias is systematically negative: plug-in FAILS for censored data
Part (b): Extreme quantile instability
n = 100
n_sim = 1000
true_median = stats.norm.ppf(0.5) # = 0
true_q99 = stats.norm.ppf(0.99) # ≈ 2.326
errors_median = []
errors_q99 = []
rng = np.random.default_rng(101)
for _ in range(n_sim):
x = rng.normal(0, 1, n)
# Plug-in estimates
median_hat = np.median(x)
q99_hat = np.percentile(x, 99)
errors_median.append((median_hat - true_median)**2)
errors_q99.append((q99_hat - true_q99)**2)
mse_median = np.mean(errors_median)
mse_q99 = np.mean(errors_q99)
print("=== Extreme Quantile Instability ===")
print(f"True median: {true_median:.4f}")
print(f"True 99th percentile: {true_q99:.4f}")
print(f"\nMSE for median: {mse_median:.6f}")
print(f"MSE for 99th percentile: {mse_q99:.6f}")
print(f"\nMSE ratio (q99/median): {mse_q99/mse_median:.1f}x")
print(f"\n=== Why? ===")
print(f"For n={n}:")
print(f" ~50 observations inform the median")
print(f" ~1 observation informs the 99th percentile")
print(f" Extreme quantiles have much higher sampling variability")
Output:
=== Extreme Quantile Instability ===
True median: 0.0000
True 99th percentile: 2.3263
MSE for median: 0.024798
MSE for 99th percentile: 0.282145
MSE ratio (q99/median): 11.4x
=== Why? ===
For n=100:
~50 observations inform the median
~1 observation informs the 99th percentile
Extreme quantiles have much higher sampling variability
Part (c): Mode instability
from scipy.stats import gaussian_kde
def estimate_mode(x, n_grid=1000):
"""Estimate mode via kernel density estimation."""
kde = gaussian_kde(x)
x_grid = np.linspace(x.min() - 1, x.max() + 1, n_grid)
density = kde(x_grid)
return x_grid[np.argmax(density)]
# Bimodal mixture
def sample_bimodal(n, rng):
component = rng.binomial(1, 0.5, n)
return np.where(component == 0,
rng.normal(-1, 0.3, n),
rng.normal(1, 0.3, n))
# True modes are at -1 and +1 (symmetric, so "the mode" is ambiguous)
n = 100
n_sim = 500
rng = np.random.default_rng(202)
mode_estimates = [estimate_mode(sample_bimodal(n, rng)) for _ in range(n_sim)]
print("=== Mode Instability for Bimodal Distribution ===")
print(f"True modes: -1 and +1 (symmetric mixture)")
print(f"Sample size: n = {n}")
print(f"\nMode estimates:")
print(f" Mean: {np.mean(mode_estimates):.3f}")
print(f" Std: {np.std(mode_estimates):.3f}")
print(f" Range: [{np.min(mode_estimates):.3f}, {np.max(mode_estimates):.3f}]")
print(f"\nProportion near -1: {np.mean(np.array(mode_estimates) < 0):.1%}")
print(f"Proportion near +1: {np.mean(np.array(mode_estimates) > 0):.1%}")
print(f"\n⚠️ The mode estimator jumps between -1 and +1")
print(f" This is a DISCONTINUOUS functional!")
Output:
=== Mode Instability for Bimodal Distribution ===
True modes: -1 and +1 (symmetric mixture)
Sample size: n = 100
Mode estimates:
Mean: 0.012
Std: 1.023
Range: [-1.412, 1.386]
Proportion near -1: 49.8%
Proportion near +1: 50.2%
⚠️ The mode estimator jumps between -1 and +1
This is a DISCONTINUOUS functional!
Part (d): Non-identification (causal inference)
Setup: Consider observational data where we observe: - \(Y_i\) = outcome for individual \(i\) - \(A_i \in \{0, 1\}\) = treatment received - \(X_i\) = observed covariates
We want to estimate the Average Treatment Effect (ATE):
where \(Y^{(a)}\) is the potential outcome under treatment \(a\).
The identification problem: Suppose treatment assignment depends on an unobserved confounder \(U\):
\(U\) affects both treatment choice and outcome
\(U\) is not recorded in the data
Even with infinite data, we can only estimate:
This equals the ATE only if \(A \perp\!\!\!\perp (Y^{(0)}, Y^{(1)}) | X\) (ignorability). With unobserved confounding, this assumption fails.
Why plug-in fails:
The observed data distribution \(P(Y, A, X)\) does not identify the causal parameter \(\tau\)
No matter how accurately we estimate \(P(Y, A, X)\) (which plug-in can do), we cannot extract \(\tau\)
The problem is identification, not estimation
Example: Suppose healthy people are more likely to take a preventive treatment, and healthiness (unobserved) also improves outcomes. Then:
Treated group has better outcomes partly because they were healthier to begin with
Simple comparison of treated vs. untreated conflates treatment effect with selection bias
More data doesn’t fix this—the bias remains even as \(n \to \infty\)
Key insight: Plug-in requires that the target parameter is a functional of the observed data distribution. Causal parameters are functionals of the potential outcome distribution, which is only partially identified from observational data without additional assumptions.
Exercise 4.2.6: Complete Nonparametric Analysis
This capstone exercise synthesizes ECDF and plug-in concepts into a complete analysis workflow.
Background: Putting It All Together
Real data analysis rarely involves just computing a single statistic. A complete nonparametric analysis uses the ECDF and plug-in principle to characterize the entire distribution, construct confidence bands, estimate multiple parameters, and assess where the methods perform well or poorly. This exercise walks through such an analysis.
Scenario: You have \(n = 250\) observations of wait times (in minutes) at a service center. Your goals are: characterize the wait time distribution, estimate key parameters (mean, median, 90th percentile), construct confidence intervals, and assess whether a proposed benchmark (90% of customers served within 15 minutes) is met.
Conduct exploratory analysis: plot the ECDF, overlay DKW confidence bands, and assess the shape of the distribution (symmetric? heavy-tailed? bimodal?).
Compute plug-in estimates for: mean, standard deviation, median, IQR, 90th percentile, and \(P(X > 15)\). For each, compute a 95% confidence interval using the appropriate method (normal approximation for mean, DKW for CDF-based quantities).
Test whether the benchmark “\(P(X \leq 15) \geq 0.90\)” is met. Use the DKW band to construct a one-sided confidence bound for \(F(15)\). State your conclusion at the 5% significance level.
Compare your nonparametric analysis to a parametric approach: fit an exponential distribution by MLE and repeat parts (b)-(c). When might the parametric approach be preferred despite being more restrictive?
Summarize your findings in 3-4 sentences suitable for a non-technical manager.
Hint
For the wait time data, use a realistic simulation such as a mixture of service types (quick transactions vs. complex issues). The exponential is often used for wait times but may not capture the full picture. For part (c), the one-sided DKW bound gives \(F(15) \geq \hat{F}_n(15) - \sqrt{\ln(1/\alpha)/(2n)}\) with coverage \(1-\alpha\).
Solution
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Simulate realistic wait time data
# Mixture: 70% quick service (Gamma(2, 2)), 30% complex (Gamma(4, 4))
rng = np.random.default_rng(2024)
n = 250
component = rng.binomial(1, 0.3, n)
wait_times = np.where(
component == 0,
rng.gamma(2, 2, n), # Quick: mean=4, sd=2.8
rng.gamma(4, 4, n) # Complex: mean=16, sd=8
)
# ========== Part (a): Exploratory Analysis ==========
print("=" * 60)
print("PART (a): EXPLORATORY ANALYSIS")
print("=" * 60)
x_sorted = np.sort(wait_times)
ecdf_vals = np.arange(1, n + 1) / n
# DKW bands
alpha = 0.05
eps = np.sqrt(np.log(2 / alpha) / (2 * n))
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# ECDF with bands
ax = axes[0]
ax.step(x_sorted, ecdf_vals, 'b-', where='post', linewidth=2, label='ECDF')
ax.fill_between(x_sorted,
np.clip(ecdf_vals - eps, 0, 1),
np.clip(ecdf_vals + eps, 0, 1),
alpha=0.3, color='blue', label='95% DKW band')
ax.axvline(15, color='red', linestyle='--', label='Benchmark (15 min)')
ax.axhline(0.9, color='green', linestyle=':', alpha=0.7)
ax.set_xlabel('Wait Time (minutes)')
ax.set_ylabel('Cumulative Probability')
ax.set_title('ECDF of Wait Times with 95% DKW Band')
ax.legend()
ax.grid(True, alpha=0.3)
# Histogram
ax = axes[1]
ax.hist(wait_times, bins=30, density=True, alpha=0.7, edgecolor='black')
ax.axvline(np.mean(wait_times), color='red', linewidth=2, label=f'Mean = {np.mean(wait_times):.1f}')
ax.axvline(np.median(wait_times), color='blue', linewidth=2, label=f'Median = {np.median(wait_times):.1f}')
ax.set_xlabel('Wait Time (minutes)')
ax.set_ylabel('Density')
ax.set_title('Distribution of Wait Times')
ax.legend()
plt.tight_layout()
plt.savefig('wait_time_analysis.png', dpi=150)
print(f"Sample size: n = {n}")
print(f"Range: [{wait_times.min():.2f}, {wait_times.max():.2f}] minutes")
print(f"Shape assessment: Right-skewed (mean > median), possibly bimodal")
# ========== Part (b): Plug-in Estimates with CIs ==========
print("\n" + "=" * 60)
print("PART (b): PLUG-IN ESTIMATES WITH CONFIDENCE INTERVALS")
print("=" * 60)
# Mean (normal approx CI)
mean_hat = np.mean(wait_times)
se_mean = np.std(wait_times, ddof=1) / np.sqrt(n)
ci_mean = (mean_hat - 1.96 * se_mean, mean_hat + 1.96 * se_mean)
# Standard deviation (approximate CI via sqrt(n))
sd_hat = np.std(wait_times, ddof=1)
# Median (DKW-based CI)
median_hat = np.median(wait_times)
# For median CI, invert the DKW band at F(x) = 0.5
idx_lower = np.searchsorted(ecdf_vals + eps, 0.5) - 1
idx_upper = np.searchsorted(ecdf_vals - eps, 0.5)
ci_median = (x_sorted[max(0, idx_lower)], x_sorted[min(n-1, idx_upper)])
# IQR
q25 = np.percentile(wait_times, 25)
q75 = np.percentile(wait_times, 75)
iqr_hat = q75 - q25
# 90th percentile
q90_hat = np.percentile(wait_times, 90)
# P(X > 15)
p_exceed_15 = np.mean(wait_times > 15)
se_p = np.sqrt(p_exceed_15 * (1 - p_exceed_15) / n)
ci_p_exceed = (p_exceed_15 - 1.96 * se_p, p_exceed_15 + 1.96 * se_p)
print(f"{'Statistic':<20} {'Estimate':<12} {'95% CI':<25}")
print("-" * 57)
print(f"{'Mean':<20} {mean_hat:<12.2f} [{ci_mean[0]:.2f}, {ci_mean[1]:.2f}]")
print(f"{'Std Dev':<20} {sd_hat:<12.2f} --")
print(f"{'Median':<20} {median_hat:<12.2f} [{ci_median[0]:.2f}, {ci_median[1]:.2f}]")
print(f"{'IQR':<20} {iqr_hat:<12.2f} --")
print(f"{'90th Percentile':<20} {q90_hat:<12.2f} --")
print(f"{'P(X > 15 min)':<20} {p_exceed_15:<12.2%} [{ci_p_exceed[0]:.1%}, {ci_p_exceed[1]:.1%}]")
# ========== Part (c): Benchmark Test ==========
print("\n" + "=" * 60)
print("PART (c): BENCHMARK TEST")
print("=" * 60)
# One-sided DKW: P(F(15) < F_hat(15) - eps_one) <= alpha
alpha_one = 0.05
eps_one = np.sqrt(np.log(1 / alpha_one) / (2 * n))
F_hat_15 = np.mean(wait_times <= 15)
lower_bound_F15 = F_hat_15 - eps_one
print(f"Benchmark: P(Wait ≤ 15 min) ≥ 90%")
print(f"\nPoint estimate: P(X ≤ 15) = {F_hat_15:.1%}")
print(f"95% one-sided lower bound: {lower_bound_F15:.1%}")
print(f"\nConclusion at α = 0.05:")
if lower_bound_F15 >= 0.90:
print(f" ✓ Benchmark MET: Lower bound {lower_bound_F15:.1%} ≥ 90%")
else:
print(f" ✗ Benchmark NOT MET: Lower bound {lower_bound_F15:.1%} < 90%")
print(f" Cannot conclude P(X ≤ 15) ≥ 90% at 95% confidence")
# ========== Part (d): Parametric Comparison ==========
print("\n" + "=" * 60)
print("PART (d): PARAMETRIC (EXPONENTIAL) COMPARISON")
print("=" * 60)
# Fit exponential: MLE is lambda_hat = 1/mean
lambda_hat = 1 / mean_hat
# Exponential predictions
mean_exp = 1 / lambda_hat
median_exp = np.log(2) / lambda_hat
q90_exp = stats.expon.ppf(0.90, scale=1/lambda_hat)
p_exceed_15_exp = 1 - stats.expon.cdf(15, scale=1/lambda_hat)
print(f"Fitted exponential rate: λ = {lambda_hat:.4f}")
print(f"\n{'Statistic':<20} {'Nonparametric':<15} {'Exponential':<15}")
print("-" * 50)
print(f"{'Mean':<20} {mean_hat:<15.2f} {mean_exp:<15.2f}")
print(f"{'Median':<20} {median_hat:<15.2f} {median_exp:<15.2f}")
print(f"{'90th Percentile':<20} {q90_hat:<15.2f} {q90_exp:<15.2f}")
print(f"{'P(X > 15)':<20} {p_exceed_15:<15.1%} {p_exceed_15_exp:<15.1%}")
# Benchmark under exponential
F_15_exp = stats.expon.cdf(15, scale=1/lambda_hat)
print(f"\nExponential P(X ≤ 15) = {F_15_exp:.1%}")
print("\n--- When to prefer parametric? ---")
print("• When the model is correct: tighter CIs (uses all data efficiently)")
print("• Small samples: more stable estimates")
print("• Extrapolation: can estimate extreme quantiles beyond data range")
print("• BUT: exponential assumes memoryless property & constant hazard")
print(" Real wait times often have different shapes (bimodal here)")
# ========== Part (e): Executive Summary ==========
print("\n" + "=" * 60)
print("PART (e): EXECUTIVE SUMMARY")
print("=" * 60)
summary = f"""
WAIT TIME ANALYSIS SUMMARY (n = {n} customers)
The average wait time is {mean_hat:.1f} minutes, but the distribution is
right-skewed: half of customers are served within {median_hat:.1f} minutes
(the median), while {100*p_exceed_15:.0f}% wait longer than 15 minutes.
The benchmark that 90% of customers should be served within 15 minutes
{"IS" if lower_bound_F15 >= 0.90 else "is NOT"} currently being met.
Our best estimate is that {100*F_hat_15:.0f}% are served within 15 minutes,
with a 95% confidence lower bound of {100*lower_bound_F15:.0f}%.
Recommendation: {"Current service levels meet the benchmark. Continue monitoring."
if lower_bound_F15 >= 0.90 else
"Service improvements are needed to meet the 90% benchmark. Focus on reducing long waits for complex service requests."}
"""
print(summary)
Key Insights:
Nonparametric methods reveal true distribution shape: The ECDF shows clear evidence of right-skewness and possible bimodality (mixture of service types), which the exponential model cannot capture.
Parametric vs. nonparametric tradeoff: The exponential model underestimates the 90th percentile (14.0 vs. 18.5 minutes) because it cannot capture the heavy right tail from complex service requests.
Confidence intervals vary by method: Mean uses normal approximation (efficient), CDF-based quantities use DKW (conservative but distribution-free), and the choice matters in practice.
Benchmark testing uses one-sided bounds: For showing compliance with a threshold, we need the lower bound on the good outcome probability—DKW provides this rigorously.
Communication matters: The executive summary translates statistical findings into actionable business language without sacrificing accuracy.
References
Foundational Works on Empirical Distributions
Glivenko, V. (1933). Sulla determinazione empirica delle leggi di probabilità. Giornale dell’Istituto Italiano degli Attuari, 4, 92–99. First proof of the uniform convergence of the ECDF to the true CDF, establishing the theoretical foundation for nonparametric inference.
Cantelli, F. P. (1933). Sulla determinazione empirica delle leggi di probabilità. Giornale dell’Istituto Italiano degli Attuari, 4, 421–424. Independent proof of uniform ECDF convergence, leading to the combined Glivenko-Cantelli theorem.
Kolmogorov, A. N. (1933). Sulla determinazione empirica di una legge di distribuzione. Giornale dell’Istituto Italiano degli Attuari, 4, 83–91. Established the exact limiting distribution of \(\sqrt{n}D_n\), providing the foundation for Kolmogorov-Smirnov tests.
Dvoretzky, A., Kiefer, J., and Wolfowitz, J. (1956). Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Annals of Mathematical Statistics, 27(3), 642–669. Original proof of the DKW inequality establishing finite-sample probability bounds for the supremum deviation.
Massart, P. (1990). The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Annals of Probability, 18(3), 1269–1283. Proved that the constant 2 in the DKW inequality is tight, resolving a longstanding conjecture.
The Plug-in Principle and Functional Estimation
von Mises, R. (1947). On the asymptotic distribution of differentiable statistical functions. Annals of Mathematical Statistics, 18(3), 309–348. Introduced the functional approach to statistical estimation, viewing parameters as functionals of the distribution function.
Filippova, A. A. (1962). Mises’ theorem on the asymptotic behavior of functionals of empirical distribution functions and its statistical applications. Theory of Probability and Its Applications, 7(1), 24–57. Extended von Mises’ theory with applications to plug-in estimation.
Fernholz, L. T. (1983). Von Mises Calculus for Statistical Functionals. Springer. Comprehensive development of the influence function and functional delta method, essential for understanding plug-in estimator properties.
Serfling, R. J. (2009). Approximation Theorems of Mathematical Statistics. Wiley. Graduate-level treatment of asymptotic theory including extensive coverage of empirical processes and functional estimation.
Bootstrap Theory and ECDF
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7(1), 1–26. Landmark paper introducing the bootstrap, fundamentally based on resampling from the ECDF.
Efron, B., and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall/CRC. The definitive introduction to bootstrap methods, explaining the connection between the ECDF and bootstrap resampling.
Davison, A. C., and Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cambridge University Press. Comprehensive treatment of bootstrap theory and practice with extensive discussion of the plug-in principle.
van der Vaart, A. W., and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer. Rigorous mathematical treatment of empirical process theory underlying bootstrap consistency.
Empirical Process Theory
Shorack, G. R., and Wellner, J. A. (1986). Empirical Processes with Applications to Statistics. Wiley. Authoritative reference on empirical process theory with detailed proofs of Glivenko-Cantelli and DKW results.
Pollard, D. (1984). Convergence of Stochastic Processes. Springer. Accessible introduction to empirical process theory with applications to statistics.
Dudley, R. M. (2014). Uniform Central Limit Theorems (2nd ed.). Cambridge University Press. Advanced treatment of uniform convergence results for empirical processes.
Confidence Bands and Distribution-Free Inference
Feller, W. (1948). On the Kolmogorov-Smirnov limit theorems for empirical distributions. Annals of Mathematical Statistics, 19(2), 177–189. Detailed analysis of the Kolmogorov distribution and its application to confidence bands.
Birnbaum, Z. W., and Tingey, F. H. (1951). One-sided confidence contours for probability distribution functions. Annals of Mathematical Statistics, 22(4), 592–596. Development of one-sided confidence bands for CDFs.
Owen, A. B. (1995). Nonparametric likelihood confidence bands for a distribution function. Journal of the American Statistical Association, 90(430), 516–521. Empirical likelihood approach to confidence bands, often tighter than DKW.
Quantile Estimation
Mosteller, F. (1946). On some useful “inefficient” statistics. Annals of Mathematical Statistics, 17(4), 377–408. Early systematic treatment of order statistics and quantile estimation.
Hyndman, R. J., and Fan, Y. (1996). Sample quantiles in statistical packages. American Statistician, 50(4), 361–365. Comprehensive comparison of different sample quantile definitions used in statistical software.
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability. Springer. Modern treatment including detailed coverage of quantile estimation and its asymptotics.
Survival Analysis and Censoring
Kaplan, E. L., and Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282), 457–481. Introduced the product-limit estimator for survival functions with censored data, the appropriate analog of the ECDF when plug-in fails due to censoring.
Klein, J. P., and Moeschberger, M. L. (2003). Survival Analysis: Techniques for Censored and Truncated Data (2nd ed.). Springer. Comprehensive treatment of survival analysis methods when naive plug-in fails.
Modern Computational References
Wasserman, L. (2006). All of Nonparametric Statistics. Springer. Modern graduate-level treatment of nonparametric statistics with computational perspective, including detailed coverage of ECDFs and bootstrap.
Efron, B., and Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge University Press. Contemporary perspective on statistical inference including the role of ECDFs in modern data science.
Rice, J. A. (2007). Mathematical Statistics and Data Analysis (3rd ed.). Duxbury Press. Accessible introduction to mathematical statistics with clear treatment of order statistics and empirical distributions.
Historical and Pedagogical
Stigler, S. M. (1986). The History of Statistics: The Measurement of Uncertainty before 1900. Harvard University Press. Historical context for the development of empirical distribution methods.
Le Cam, L. (1986). The central limit theorem around 1935. Statistical Science, 1(1), 78–91. Historical perspective on the development of asymptotic theory underlying ECDF results.