.. _ch2.3-inverse-cdf-method:

====================
Inverse CDF Method
====================

With a reliable supply of uniform random variates in hand—the fruit of :ref:`ch2.2-uniform-random-variates`'s exploration of pseudo-random number generators—we now face the central question of random variable generation: *how do we transform these uniform numbers into samples from other distributions?*

The answer lies in one of the most elegant results in probability theory: the Probability Integral Transform. This theorem, which we previewed in our discussion of why uniform variates are "universal currency," now receives its full mathematical treatment. It tells us that any distribution can be sampled by computing the inverse of its cumulative distribution function applied to a uniform random number. The method is universal in principle: it works for continuous, discrete, and mixed distributions alike. When the inverse CDF has a closed-form expression—as it does for exponential, Weibull, Pareto, and Cauchy distributions—implementation is immediate and efficient.

But the inverse CDF method is more than a theoretical curiosity. It is the workhorse of random number generation, the first algorithm any practitioner should consider when faced with a new distribution. Understanding when it applies, how to implement it efficiently, and when to seek alternatives is essential knowledge for computational statistics.

This section develops the inverse CDF method in full generality. We begin with the mathematical foundations—proving why the method works and what the "generalized inverse" means for distributions that lack smooth inverses. We then work through continuous distributions with closed-form inverses, deriving and implementing samplers for several important cases. For discrete distributions, we develop increasingly sophisticated algorithms: linear search, binary search, and the remarkable alias method that achieves constant-time sampling. We address numerical issues that arise in practice and identify the method's limitations—setting the stage for rejection sampling and specialized transformations in subsequent sections.

.. admonition:: Road Map 🧭
   :class: important

   • **Understand**: Why the inverse CDF method produces correctly distributed samples, with complete proofs for both continuous and discrete cases
   • **Master**: Closed-form inverse CDFs for exponential, Weibull, Pareto, Cauchy, and logistic distributions
   • **Develop**: Efficient algorithms for discrete distributions—from :math:`O(K)` linear search through :math:`O(\log K)` binary search to :math:`O(1)` alias method
   • **Implement**: Complete Python code with attention to numerical precision and edge cases
   • **Evaluate**: When the inverse CDF method excels and when alternatives (rejection sampling, specialized transforms) are preferable


Mathematical Foundations
------------------------

The inverse CDF method rests on a deep connection between the uniform distribution and all other distributions. We now develop this connection rigorously.

The Cumulative Distribution Function
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For any random variable :math:`X`, the **cumulative distribution function** (CDF) :math:`F_X: \mathbb{R} \to [0, 1]` is defined by:

.. math::

   F_X(x) = P(X \leq x)

The CDF has several fundamental properties:

1. **Monotonicity**: :math:`F_X` is non-decreasing. If :math:`x_1 < x_2`, then :math:`F_X(x_1) \leq F_X(x_2)`.

2. **Right-continuity**: :math:`\lim_{h \to 0^+} F_X(x + h) = F_X(x)` for all :math:`x`.

3. **Limits at infinity**: :math:`\lim_{x \to -\infty} F_X(x) = 0` and :math:`\lim_{x \to \infty} F_X(x) = 1`.

For **continuous** random variables with density :math:`f_X`, the CDF is obtained by integration:

.. math::

   F_X(x) = \int_{-\infty}^{x} f_X(t) \, dt

and the CDF is continuous everywhere.

For **discrete** random variables taking values :math:`x_1 < x_2 < \cdots` with probabilities :math:`p_1, p_2, \ldots`, the CDF is a step function:

.. math::

   F_X(x) = \sum_{x_i \leq x} p_i

The CDF jumps by :math:`p_i` at each value :math:`x_i`.

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/Monte%20Carlo%20Methods_slide009_img003_8228ef1b.png
   :alt: CDF for discrete distribution showing step function behavior
   :align: center
   :width: 55%

   **CDF for Discrete Distributions.** The cumulative distribution function of a discrete random variable is a step function with jumps at each possible value. For the killdeer clutch size example shown, the CDF maps any :math:`u \in (0, 1)` to the smallest outcome :math:`x` where :math:`F(x) \geq u`—this is precisely the generalized inverse.

The Generalized Inverse
~~~~~~~~~~~~~~~~~~~~~~~

When :math:`F_X` is continuous and strictly increasing, it has a unique inverse: :math:`F_X^{-1}(u)` is the unique :math:`x` such that :math:`F_X(x) = u`. But for discrete distributions (where :math:`F_X` is a step function) or mixed distributions, the ordinary inverse does not exist. We need a more general definition.

.. admonition:: Definition: Generalized Inverse (Quantile Function)
   :class: note

   For any distribution function :math:`F: \mathbb{R} \to [0, 1]`, the **generalized inverse** (or **quantile function**) is:

   .. math::
      :label: gen-inverse-def

      F^{-1}(u) = \inf\{x \in \mathbb{R} : F(x) \geq u\} \quad \text{for } u \in (0, 1)

   This is the smallest :math:`x` for which the CDF reaches or exceeds :math:`u`.

The generalized inverse has several important properties:

**Property 1**: For continuous, strictly increasing :math:`F`, the generalized inverse coincides with the ordinary inverse.

**Property 2**: For any :math:`F`, :math:`F^{-1}(u) \leq x` if and only if :math:`u \leq F(x)`. This equivalence is the key to proving the inverse CDF method works.

**Property 3**: :math:`F^{-1}` is non-decreasing and left-continuous.

**Proof of Property 2**: 

(:math:`\Leftarrow`) Suppose :math:`u \leq F(x)`. Then :math:`x \in \{t : F(t) \geq u\}`, so :math:`F^{-1}(u) = \inf\{t : F(t) \geq u\} \leq x`.

(:math:`\Rightarrow`) Suppose :math:`F^{-1}(u) \leq x`. By definition, :math:`F^{-1}(u) = \inf\{t : F(t) \geq u\}`. By right-continuity of :math:`F`, :math:`F(F^{-1}(u)) \geq u`. Since :math:`F` is non-decreasing and :math:`F^{-1}(u) \leq x`, we have :math:`F(x) \geq F(F^{-1}(u)) \geq u`. ∎

The Probability Integral Transform
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

With the generalized inverse defined, we can state and prove the fundamental result.

.. admonition:: Theorem: Probability Integral Transform (Inverse Direction)
   :class: note

   If :math:`U \sim \text{Uniform}(0, 1)` and :math:`F` is any distribution function, then :math:`X = F^{-1}(U)` has distribution function :math:`F`.

**Proof**: We must show that :math:`P(X \leq x) = F(x)` for all :math:`x \in \mathbb{R}`.

.. math::

   P(X \leq x) &= P(F^{-1}(U) \leq x) \\
               &= P(U \leq F(x)) \quad \text{(by Property 2)} \\
               &= F(x) \quad \text{(since } U \sim \text{Uniform}(0,1) \text{)}

The final equality uses the fact that for :math:`U \sim \text{Uniform}(0,1)`, :math:`P(U \leq u) = u` for any :math:`u \in [0, 1]`. ∎

This proof is remarkably general. It applies to:

- Continuous distributions with smooth inverses
- Discrete distributions with step-function CDFs  
- Mixed distributions combining point masses and continuous components
- Any distribution whatsoever, provided we use the generalized inverse

The converse also holds under mild conditions:

.. admonition:: Theorem: Probability Integral Transform (Forward Direction)
   :class: note

   If :math:`X` has a continuous CDF :math:`F`, then :math:`U = F(X) \sim \text{Uniform}(0, 1)`.

**Proof**: For :math:`u \in (0, 1)`:

.. math::

   P(F(X) \leq u) = P(X \leq F^{-1}(u)) = F(F^{-1}(u)) = u

where the last equality holds because :math:`F` is continuous and strictly increasing on its support. ∎

Geometric Interpretation
~~~~~~~~~~~~~~~~~~~~~~~~

The inverse CDF method has an intuitive geometric interpretation. Imagine the graph of the CDF :math:`F(x)`:

1. Generate a uniform random number :math:`u \in (0, 1)` — a random height on the :math:`y`-axis.
2. Draw a horizontal line at height :math:`u`.
3. Find where this line first intersects the CDF curve (or, for step functions, where it first reaches the CDF from below).
4. Project down to the :math:`x`-axis. This is your sample :math:`F^{-1}(u)`.

For continuous distributions with strictly increasing CDFs, the intersection is unique. For discrete distributions, the horizontal line may intersect a "riser" of the step function; the generalized inverse picks the :math:`x`-value at the top of that riser.

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/ch2_3_fig01_inverse_cdf_method.png
   :alt: Six-panel visualization of the inverse CDF method showing the geometric transformation
   :align: center
   :width: 100%

   **The Inverse CDF Method: Geometric View.** Top row: For an exponential distribution, we start with a uniform sample :math:`U` (left), apply the inverse CDF transformation :math:`X = F^{-1}(U)` (center), and obtain an exponentially distributed sample (right). The transformation "stretches" uniform samples according to where the CDF is shallow (rare values) and "compresses" where the CDF is steep (common values). Bottom row: The same process applied to three different distributions—Cauchy (heavy tails), Normal (symmetric), and Beta (flexible shape)—demonstrates the universality of the method.

.. admonition:: Try It Yourself 🖥️ Interactive Inverse CDF Visualizations
   :class: tip

   Two interactive simulations help you build intuition for the inverse CDF method. We strongly recommend experimenting with both before proceeding.

   **Simulation 1: Generalized Inverse CDF Explorer**
   
   https://treese41528.github.io/ComputationalDataScience/Simulations/MonteCarloSimulation/inverse_cdf_transform_general.html

   This simulation displays three synchronized panels showing how probability density, cumulative distribution, and the inverse transform relate geometrically:

   - **Left panel (PDF/PMF)**: Shows the density or mass function with the shaded area equal to :math:`u`. The x-axis marks the current :math:`F^{-1}(u)` value, showing exactly where probability accumulates.
   - **Center panel (CDF)**: Shows :math:`F(x)` with a horizontal dashed line at height :math:`u`. The intersection point projects down to :math:`x = F^{-1}(u)`.
   - **Right panel (Inverse CDF)**: Shows :math:`F^{-1}(u)` directly as a function of :math:`u`. For discrete distributions, observe the characteristic plateaus (flat regions).
   - **Distribution selector**: Choose from *Continuous (Normal)*, *Continuous (Exponential)*, *Discrete (Poisson)*, *Discrete (Geometric)*, or *Mixed (Zero-inflated)*.
   - **High-precision slider**: The :math:`u` range extends from 0.0001 to 0.9999, allowing exploration of extreme quantiles in the tails.
   - **Dynamic insight box**: Updates automatically to explain what's happening for each distribution type.

   **Key observations to make**:
   
   1. For *continuous distributions*, all three panels show smooth curves. The steepness of the CDF corresponds to the height of the PDF—steep regions have high density.
   2. For *discrete distributions* (Poisson, Geometric), the CDF is a step function with jumps at each integer. The inverse CDF shows *plateaus*: many :math:`u` values in :math:`(F(k-1), F(k)]` all map to the same integer :math:`k`. Open and filled circles mark the jump discontinuities.
   3. For the *mixed zero-inflated*, observe the point mass at zero in the PDF panel, the corresponding jump in the CDF, and the flat region in the inverse CDF for :math:`u \in [0, p_0]`.
   4. **Explore the tails**: Drag :math:`u` close to 0.0001 or 0.9999 with the Normal or Exponential distribution to see extreme quantiles. The 4-decimal precision lets you see exactly how tail probabilities map to extreme values.
   5. **Hover over the CDF or Inverse CDF panels** to interactively explore—this helps you see the geometric duality between the two representations.

   **Simulation 2: Inverse Transform Sampling in Action**
   
   https://treese41528.github.io/ComputationalDataScience/Simulations/MonteCarloSimulation/inverse_cdf_transform_sampling.html

   This simulation demonstrates the *sampling process* for both continuous and discrete distributions. Watch uniform random numbers transform into samples through the inverse CDF:

   - **CDF panel (left sidebar)**: Shows the CDF with the current :math:`u` value as a horizontal line and its projection to :math:`x`. For discrete distributions, this visualizes the "search" process of finding the smallest :math:`k` where :math:`F(k) \geq u`.
   - **Computation display**: Shows the exact arithmetic for each sample, including search steps for discrete distributions (e.g., :math:`F(0) = 0.018 \to F(1) = 0.092 \to \ldots` until :math:`F(k) \geq u`).
   - **Left histogram**: Uniform :math:`U \sim \text{Uniform}(0,1)` samples accumulating toward a flat density.
   - **Right histogram**: Transformed :math:`X = F^{-1}(U)` samples with the theoretical PDF (continuous) or PMF (discrete, shown as gold outline bars) overlaid.
   - **Distribution selector**: 
     
     - *Continuous*: Exponential, Normal, Logistic, Pareto, Cauchy
     - *Discrete*: Bernoulli, Binomial, Geometric, Poisson
   
   - **Distribution badge**: Indicates whether the current distribution is continuous or discrete.

   **Key observations to make**:
   
   1. **Continuous vs. discrete**: For continuous distributions, observe smooth histogram convergence to the PDF curve. For discrete distributions, watch probability mass concentrate at integer values, matching the PMF bars.
   2. **The search process**: With discrete distributions, the computation panel shows how the algorithm searches through cumulative probabilities: :math:`F(0), F(1), F(2), \ldots` until finding the first :math:`k` where :math:`F(k) \geq u`. This is the generalized inverse in action.
   3. **Concentration vs. spread**: For the exponential, uniform values near 0 produce small :math:`X`, while values near 1 produce large :math:`X`. The inverse CDF "stretches" the uniform distribution according to where the target distribution has more or less mass.
   4. **Heavy tails**: Try the Cauchy distribution and watch for occasional extreme values when :math:`U` approaches 0 or 1. The sample mean fluctuates wildly—a vivid demonstration that the Cauchy has no finite mean.
   5. **Discrete probability matching**: With Binomial(10, 0.4), watch how the histogram bars converge to match the gold PMF outline. The most probable outcomes (around :math:`k = 4`) accumulate samples fastest.
   6. **Formula verification**: Pause the animation and verify computations manually. For Geometric with :math:`p = 0.25` and :math:`U = 0.5`: :math:`X = \lceil \ln(1-0.5)/\ln(0.75) \rceil = \lceil 2.41 \rceil = 3`.

Continuous Distributions with Closed-Form Inverses
--------------------------------------------------

The inverse CDF method is most elegant when :math:`F^{-1}` has a closed-form expression. We can then generate samples with just a few arithmetic operations—typically faster than any alternative method.

The Exponential Distribution
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The exponential distribution is the canonical example for the inverse CDF method. It models waiting times in Poisson processes and appears throughout reliability theory, queuing theory, and survival analysis.

**Setup**: :math:`X \sim \text{Exponential}(\lambda)` with rate parameter :math:`\lambda > 0`.

**CDF**: 

.. math::

   F(x) = 1 - e^{-\lambda x}, \quad x \geq 0

**Deriving the Inverse**: Solve :math:`F(x) = u` for :math:`x`:

.. math::

   1 - e^{-\lambda x} &= u \\
   e^{-\lambda x} &= 1 - u \\
   -\lambda x &= \ln(1 - u) \\
   x &= -\frac{\ln(1 - u)}{\lambda}

**Simplification**: Since :math:`U` and :math:`1 - U` have the same distribution when :math:`U \sim \text{Uniform}(0, 1)`, we can use either:

.. math::

   F^{-1}(u) = -\frac{\ln(1 - u)}{\lambda} \quad \text{or} \quad F^{-1}(u) = -\frac{\ln(u)}{\lambda}

The second form saves one subtraction but requires care when :math:`u = 0` (which occurs with probability zero but may arise from floating-point edge cases).

.. admonition:: Example 💡 Generating Exponential Random Variables
   :class: note

   **Given**: Generate 10,000 samples from :math:`\text{Exponential}(\lambda = 2)`.

   **Mathematical approach**:

   For :math:`\lambda = 2`, :math:`F^{-1}(u) = -\ln(1-u)/2`.

   **Theoretical properties**:
   
   - Mean: :math:`\mathbb{E}[X] = 1/\lambda = 0.5`
   - Variance: :math:`\text{Var}(X) = 1/\lambda^2 = 0.25`
   - Median: :math:`F^{-1}(0.5) = \ln(2)/\lambda \approx 0.347`

   **Python implementation**:

   .. code-block:: python

      import numpy as np

      def exponential_inverse_cdf(u, rate):
          """
          Generate Exponential(rate) samples via inverse CDF.
          
          Uses the tail-stable form X = -log(U)/rate, avoiding (1-U).
          Since U and 1-U have the same distribution, this is equivalent
          but provides better resolution for large X values.
          """
          u = np.asarray(u, dtype=float)
          # Guard against log(0) if the RNG can return exactly 0
          u = np.maximum(u, np.finfo(float).tiny)
          return -np.log(u) / rate

      # Generate samples
      rng = np.random.default_rng(42)
      n_samples = 10_000
      rate = 2.0

      U = rng.random(n_samples)
      X = exponential_inverse_cdf(U, rate)

      # Verify
      print(f"Sample mean: {np.mean(X):.4f} (theory: {1/rate:.4f})")
      print(f"Sample variance: {np.var(X, ddof=1):.4f} (theory: {1/rate**2:.4f})")
      print(f"Sample median: {np.median(X):.4f} (theory: {np.log(2)/rate:.4f})")

   **Output**:

   ::

      Sample mean: 0.4976 (theory: 0.5000)
      Sample variance: 0.2467 (theory: 0.2500)
      Sample median: 0.3441 (theory: 0.3466)

   The sample statistics closely match theoretical values, confirming the correctness of our sampler.

The Weibull Distribution
~~~~~~~~~~~~~~~~~~~~~~~~

The Weibull distribution generalizes the exponential to allow for increasing or decreasing hazard rates, making it fundamental in reliability analysis.

**Setup**: :math:`X \sim \text{Weibull}(k, \lambda)` with shape :math:`k > 0` and scale :math:`\lambda > 0`.

**CDF**:

.. math::

   F(x) = 1 - \exp\left( -\left(\frac{x}{\lambda}\right)^k \right), \quad x \geq 0

**Deriving the Inverse**: Solve :math:`F(x) = u`:

.. math::

   1 - e^{-(x/\lambda)^k} &= u \\
   e^{-(x/\lambda)^k} &= 1 - u \\
   -\left(\frac{x}{\lambda}\right)^k &= \ln(1 - u) \\
   \left(\frac{x}{\lambda}\right)^k &= -\ln(1 - u) \\
   x &= \lambda \left( -\ln(1 - u) \right)^{1/k}

**Result**:

.. math::
   :label: weibull-inverse

   F^{-1}(u) = \lambda \left( -\ln(1 - u) \right)^{1/k}

Note that when :math:`k = 1`, the Weibull reduces to :math:`\text{Exponential}(1/\lambda)`.

.. code-block:: python

   def weibull_inverse_cdf(u, shape, scale):
       """
       Generate Weibull(shape, scale) samples via inverse CDF.
       
       Uses X = scale * (-log(U))^(1/shape), the tail-stable form
       that avoids computing (1-U).
       """
       u = np.asarray(u, dtype=float)
       u = np.maximum(u, np.finfo(float).tiny)  # Guard against log(0)
       return scale * (-np.log(u)) ** (1 / shape)

   # Example: Weibull(k=2, λ=1) - the Rayleigh distribution
   rng = np.random.default_rng(42)
   U = rng.random(10_000)
   X = weibull_inverse_cdf(U, shape=2, scale=1)

   print(f"Sample mean: {np.mean(X):.4f}")
   print(f"Sample std: {np.std(X, ddof=1):.4f}")

The Pareto Distribution
~~~~~~~~~~~~~~~~~~~~~~~

The Pareto distribution models phenomena with "heavy tails"—situations where extreme values are more likely than a normal or exponential distribution would suggest. It appears in economics (income distribution), insurance (claim sizes), and network science (degree distributions).

**Setup**: :math:`X \sim \text{Pareto}(\alpha, x_m)` with shape :math:`\alpha > 0` and scale (minimum) :math:`x_m > 0`.

**CDF**:

.. math::

   F(x) = 1 - \left( \frac{x_m}{x} \right)^\alpha, \quad x \geq x_m

**Deriving the Inverse**:

.. math::

   1 - \left( \frac{x_m}{x} \right)^\alpha &= u \\
   \left( \frac{x_m}{x} \right)^\alpha &= 1 - u \\
   \frac{x_m}{x} &= (1 - u)^{1/\alpha} \\
   x &= \frac{x_m}{(1 - u)^{1/\alpha}}

**Result**:

.. math::
   :label: pareto-inverse

   F^{-1}(u) = \frac{x_m}{(1 - u)^{1/\alpha}} = x_m (1 - u)^{-1/\alpha}

Since :math:`U` and :math:`1-U` have the same distribution, the tail-stable implementation uses:

.. math::

   X = x_m \cdot U^{-1/\alpha}

.. code-block:: python

   def pareto_inverse_cdf(u, alpha, x_min):
       """
       Generate Pareto(alpha, x_min) samples via inverse CDF.
       
       Uses the tail-stable form X = x_min * U^(-1/alpha),
       which avoids forming (1-U) and provides better resolution
       for large X values.
       """
       u = np.asarray(u, dtype=float)
       # Guard against u=0 which would give infinity
       u = np.maximum(u, np.finfo(float).tiny)
       return x_min * u ** (-1 / alpha)

   # Example: Pareto(α=2.5, x_m=1)
   rng = np.random.default_rng(42)
   U = rng.random(10_000)
   X = pareto_inverse_cdf(U, alpha=2.5, x_min=1.0)

   # For α > 1, mean = α * x_m / (α - 1)
   theoretical_mean = 2.5 * 1.0 / (2.5 - 1)
   print(f"Sample mean: {np.mean(X):.4f} (theory: {theoretical_mean:.4f})")

The Cauchy Distribution
~~~~~~~~~~~~~~~~~~~~~~~

The Cauchy distribution is infamous for having no mean or variance—its tails are so heavy that these moments do not exist. It arises as the ratio of two independent standard normals and appears in physics as the Lorentzian distribution.

**Setup**: :math:`X \sim \text{Cauchy}(\mu, \sigma)` with location :math:`\mu` and scale :math:`\sigma > 0`.

**CDF**:

.. math::

   F(x) = \frac{1}{\pi} \arctan\left( \frac{x - \mu}{\sigma} \right) + \frac{1}{2}

**Deriving the Inverse**:

.. math::

   \frac{1}{\pi} \arctan\left( \frac{x - \mu}{\sigma} \right) + \frac{1}{2} &= u \\
   \arctan\left( \frac{x - \mu}{\sigma} \right) &= \pi \left( u - \frac{1}{2} \right) \\
   \frac{x - \mu}{\sigma} &= \tan\left( \pi \left( u - \frac{1}{2} \right) \right) \\
   x &= \mu + \sigma \tan\left( \pi \left( u - \frac{1}{2} \right) \right)

**Result**:

.. math::
   :label: cauchy-inverse

   F^{-1}(u) = \mu + \sigma \tan\left( \pi (u - 1/2) \right)

**Note**: At :math:`u = 0` and :math:`u = 1`, the tangent function produces :math:`\pm\infty`, which is mathematically correct since the Cauchy distribution has infinite support.

.. code-block:: python

   def cauchy_inverse_cdf(u, loc=0, scale=1):
       """
       Generate Cauchy(loc, scale) samples via inverse CDF.
       
       Note: Returns ±inf at u=0 and u=1, which is mathematically correct
       since the Cauchy has infinite support.
       """
       return loc + scale * np.tan(np.pi * (u - 0.5))

   # Standard Cauchy
   rng = np.random.default_rng(42)
   U = rng.random(10_000)
   X = cauchy_inverse_cdf(U)

   # The median is the location parameter
   print(f"Sample median: {np.median(X):.4f} (theory: 0.0)")
   # Mean doesn't exist, but sample mean will be highly variable
   print(f"Sample mean: {np.mean(X):.4f} (undefined theoretically)")

The Logistic Distribution
~~~~~~~~~~~~~~~~~~~~~~~~~

The logistic distribution resembles the normal but has heavier tails. It appears in logistic regression and as a smooth approximation to the step function.

**Setup**: :math:`X \sim \text{Logistic}(\mu, s)` with location :math:`\mu` and scale :math:`s > 0`.

**CDF**:

.. math::

   F(x) = \frac{1}{1 + e^{-(x-\mu)/s}}

**Deriving the Inverse**:

.. math::

   \frac{1}{1 + e^{-(x-\mu)/s}} &= u \\
   1 + e^{-(x-\mu)/s} &= \frac{1}{u} \\
   e^{-(x-\mu)/s} &= \frac{1-u}{u} \\
   -(x-\mu)/s &= \ln\left( \frac{1-u}{u} \right) \\
   x &= \mu - s \ln\left( \frac{1-u}{u} \right) = \mu + s \ln\left( \frac{u}{1-u} \right)

**Result**:

.. math::
   :label: logistic-inverse

   F^{-1}(u) = \mu + s \ln\left( \frac{u}{1-u} \right) = \mu + s \cdot \text{logit}(u)

The function :math:`\text{logit}(u) = \ln(u/(1-u))` is the inverse of the logistic (sigmoid) function.

.. code-block:: python

   def logistic_inverse_cdf(u, loc=0, scale=1):
       """
       Generate Logistic(loc, scale) samples via inverse CDF.
       
       Uses log(u) - log1p(-u) for numerical stability at both tails.
       """
       u = np.asarray(u, dtype=float)
       # Clamp to avoid exact 0 or 1
       u = np.clip(u, np.finfo(float).tiny, 1 - np.finfo(float).eps)
       # Stable logit: log(u) - log(1-u) via log1p
       return loc + scale * (np.log(u) - np.log1p(-u))

   # Standard logistic
   rng = np.random.default_rng(42)
   U = rng.random(10_000)
   X = logistic_inverse_cdf(U)

   # Mean = loc, Variance = (π * scale)² / 3
   print(f"Sample mean: {np.mean(X):.4f} (theory: 0.0)")
   print(f"Sample variance: {np.var(X, ddof=1):.4f} (theory: {np.pi**2 / 3:.4f})")

Summary: Continuous Distributions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following table summarizes the inverse CDF formulas for common distributions:

.. list-table:: Inverse CDF Formulas for Continuous Distributions
   :header-rows: 1
   :widths: 20 35 45

   * - Distribution
     - CDF :math:`F(x)`
     - Inverse :math:`F^{-1}(u)`
   * - Exponential(:math:`\lambda`)
     - :math:`1 - e^{-\lambda x}`
     - :math:`-\ln(1-u)/\lambda`
   * - Weibull(:math:`k, \lambda`)
     - :math:`1 - e^{-(x/\lambda)^k}`
     - :math:`\lambda(-\ln(1-u))^{1/k}`
   * - Pareto(:math:`\alpha, x_m`)
     - :math:`1 - (x_m/x)^\alpha`
     - :math:`x_m (1-u)^{-1/\alpha}`
   * - Cauchy(:math:`\mu, \sigma`)
     - :math:`\frac{1}{\pi}\arctan(\frac{x-\mu}{\sigma}) + \frac{1}{2}`
     - :math:`\mu + \sigma\tan(\pi(u-1/2))`
   * - Logistic(:math:`\mu, s`)
     - :math:`1/(1 + e^{-(x-\mu)/s})`
     - :math:`\mu + s\ln(u/(1-u))`
   * - Uniform(:math:`a, b`)
     - :math:`(x-a)/(b-a)`
     - :math:`a + (b-a)u`


Numerical Inversion
-------------------

What about distributions whose inverse CDFs have no closed form? The normal distribution is the most important example: :math:`\Phi^{-1}(u)` cannot be expressed in terms of elementary functions. For such distributions, we have two options:

1. **Numerical root-finding**: Solve :math:`F(x) = u` numerically for each sample.
2. **Use specialized algorithms**: Box-Muller for normals, rejection sampling, etc.

Numerical inversion is always possible but often slow. Let us examine when it is practical.

Root-Finding Approach
~~~~~~~~~~~~~~~~~~~~~

To compute :math:`F^{-1}(u)`, we solve the equation :math:`F(x) - u = 0`. Since :math:`F` is monotonically increasing, standard root-finding methods are guaranteed to converge.

**Brent's method** is particularly suitable: it combines bisection (guaranteed convergence) with faster interpolation methods (speed). SciPy provides this as ``scipy.optimize.brentq``.

.. code-block:: python

   import numpy as np
   from scipy import optimize, stats

   def numerical_inverse_cdf(cdf_func, u, bracket=(-100, 100)):
       """
       Compute F⁻¹(u) numerically using Brent's method.

       Parameters
       ----------
       cdf_func : callable
           CDF function F(x).
       u : float or array
           Probability value(s) in (0, 1).
       bracket : tuple
           Interval [a, b] containing the root.

       Returns
       -------
       float or array
           Quantile value(s).
       """
       u = np.atleast_1d(u)
       result = np.zeros_like(u, dtype=float)

       for i, u_val in enumerate(u):
           # Solve F(x) = u, i.e., F(x) - u = 0
           result[i] = optimize.brentq(
               lambda x: cdf_func(x) - u_val,
               bracket[0], bracket[1]
           )

       return result[0] if len(result) == 1 else result

   # Example: Normal distribution (for comparison with scipy.stats.norm.ppf)
   normal_cdf = stats.norm.cdf

   u_test = np.array([0.025, 0.5, 0.975])
   numerical_quantiles = numerical_inverse_cdf(normal_cdf, u_test, bracket=(-10, 10))
   exact_quantiles = stats.norm.ppf(u_test)

   print("Comparison of numerical vs. exact normal quantiles:")
   for u, num, exact in zip(u_test, numerical_quantiles, exact_quantiles):
       print(f"  F⁻¹({u}) = {num:.6f} (numerical) vs {exact:.6f} (exact)")

**Computational cost**: Brent's method typically requires :math:`O(\log(1/\epsilon))` iterations to achieve tolerance :math:`\epsilon`, with each iteration evaluating the CDF once. For normal samples, this is about 10-20 CDF evaluations per sample—much slower than the specialized methods we'll see later.

When Numerical Inversion Makes Sense
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Numerical inversion is practical when:

1. **You need only a few samples**: Setup costs of specialized methods may dominate.
2. **The distribution is unusual**: Custom distributions without known fast samplers.
3. **Accuracy is paramount**: Numerical inversion can achieve arbitrary precision.

Numerical inversion is impractical when:

1. **You need many samples**: The per-sample cost adds up.
2. **A fast specialized method exists**: Box-Muller for normals, ratio-of-uniforms for gammas.
3. **The CDF is expensive to evaluate**: Each sample requires multiple CDF calls.

**Rule of thumb**: For standard distributions, use library implementations (``scipy.stats``, NumPy). These use the most efficient method known for each distribution. Reserve numerical inversion for custom or unusual distributions when other methods fail.


Discrete Distributions
----------------------

For discrete distributions taking values :math:`x_1 < x_2 < \cdots < x_K` with probabilities :math:`p_1, p_2, \ldots, p_K`, the inverse CDF method still applies. The CDF is now a step function:

.. math::

   F(x) = \sum_{x_i \leq x} p_i

and the generalized inverse sets :math:`X = x_k` when :math:`F(x_{k-1}) < U \leq F(x_k)`, where :math:`F(x_0) = 0`.

The challenge is computational: **how do we efficiently find which "step" of the CDF our uniform value :math:`U` lands on?**

We present five algorithms with different complexity trade-offs, beginning with the simplest and progressing to constant-time methods:

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/ch2_3_fig02_discrete_sampling.png
   :alt: Comparison of linear search, binary search, and alias method for discrete sampling
   :align: center
   :width: 100%

   **Discrete Sampling Algorithms: Overview.** Top row: Visual illustration of the three fundamental approaches. Linear search scans sequentially until cumulative probability exceeds :math:`U`. Binary search bisects the CDF in :math:`O(\log K)` steps. The alias method constructs equal-height columns allowing :math:`O(1)` lookup. Bottom row: Practical performance analysis for :math:`K=1000` categories, showing when each method dominates. Additional algorithms (interpolation search, exponential doubling) are discussed below.

Linear Search: The Baseline
~~~~~~~~~~~~~~~~~~~~~~~~~~~

We begin with the most natural approach: **linear search**. If you were asked to sample from a discrete distribution using pencil and paper, this is almost certainly what you would do. Walk through the outcomes in order, accumulating probability as you go, and stop as soon as the cumulative probability exceeds your uniform random number :math:`U`.

Linear search embodies the generalized inverse directly. Recall that :math:`F^{-1}(u) = \inf\{k : F(k) \geq u\}`—the smallest outcome whose cumulative probability reaches :math:`u`. Linear search finds this by brute force: compute :math:`F(1)`, check if :math:`U \leq F(1)`; if not, compute :math:`F(2)`, check again; and so on. The algorithm is essentially a sequential scan through the CDF step function until we "climb" past the horizontal line at height :math:`U`.

The simplicity of linear search is both its strength and its weakness. On one hand, the algorithm requires no preprocessing beyond normalizing the PMF—no cumulative sums need be stored, no auxiliary tables constructed. For very small :math:`K` (say, a six-sided die), this simplicity makes linear search the practical choice. On the other hand, the worst-case complexity is :math:`O(K)`: if the distribution is uniform and :math:`U` happens to fall in the last bin, we must examine every outcome before terminating.

But here is a subtlety often overlooked: **linear search is not uniformly bad**. Its performance depends critically on *where the probability mass lies* and *where our uniform sample lands*. Consider a geometric distribution where :math:`p_1 = 0.5`, :math:`p_2 = 0.25`, :math:`p_3 = 0.125`, and so on. Half of all samples terminate on the first iteration; three-quarters terminate by the second. The *expected* number of comparisons is :math:`\sum_k k \cdot p_k`—the distribution's mean! For head-heavy distributions, this expected cost can be far smaller than :math:`O(\log K)`.

This observation leads to a practical insight: **if you know the distribution is head-heavy, arrange outcomes in decreasing probability order and use linear search**. Conversely, for tail-heavy distributions, scanning from right to left achieves the same benefit. The catch is that you must know the distribution's shape in advance, and for symmetric or multimodal distributions, no ordering helps.

Linear search also has a pedagogical virtue: it makes the inverse CDF method's logic transparent. Watching the cumulative sum grow step by step, one sees exactly how the CDF partitions :math:`[0, 1]` into intervals corresponding to each outcome. This intuition is worth developing before moving to more sophisticated algorithms.

**Algorithm: Linear Search**

.. code-block:: text

   Input: Probabilities (p₁, ..., pₖ), uniform value U
   Output: Sample X

   1. Set cumsum = 0
   2. For k = 1, 2, ..., K:
      a. cumsum = cumsum + pₖ
      b. If U ≤ cumsum, return xₖ
   3. Return xₖ (fallback for numerical edge cases)

**Complexity**: :math:`O(K)` per sample in the worst case (uniform distribution), :math:`O(1)` in the best case (all mass on :math:`x_1`).

.. code-block:: python

   def linear_search_sample(pmf, values, u):
       """
       Sample from discrete distribution via linear search.

       Parameters
       ----------
       pmf : array
           Probability mass function (must sum to 1).
       values : array
           Possible outcomes.
       u : float
           Uniform random value.

       Returns
       -------
       Sampled value.
       """
       cumsum = 0.0
       for k, p in enumerate(pmf):
           cumsum += p
           if u <= cumsum:
               return values[k]
       return values[-1]  # Fallback

   # Example: biased die
   pmf = np.array([0.1, 0.1, 0.1, 0.2, 0.2, 0.3])  # Heavy on 5, 6
   values = np.arange(1, 7)

   rng = np.random.default_rng(42)
   samples = [linear_search_sample(pmf, values, rng.random()) for _ in range(10000)]
   print("Sample frequencies:", np.bincount(samples, minlength=7)[1:] / 10000)
   print("True probabilities:", pmf)

Binary Search: Logarithmic Time with Worst-Case Guarantees
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Linear search's :math:`O(K)` worst case becomes unacceptable when :math:`K` grows large. Sampling from a distribution over 10,000 outcomes should not require 10,000 comparisons. We need a smarter strategy.

The key observation is that the cumulative probabilities :math:`F_1 \leq F_2 \leq \cdots \leq F_K` form a *sorted* array. Finding the smallest :math:`k` such that :math:`F_k \geq U` is exactly the **lower bound** problem from computer science—and binary search solves it in :math:`O(\log K)` time.

The algorithm maintains an interval :math:`[\text{low}, \text{high}]` guaranteed to contain the answer. Initially, this is :math:`[1, K]`—we know nothing, so the answer could be anywhere. At each step, we probe the midpoint :math:`\text{mid} = \lfloor(\text{low} + \text{high})/2\rfloor` and compare :math:`U` to :math:`F_{\text{mid}}`. If :math:`U \leq F_{\text{mid}}`, the answer is at or before the midpoint—so we shrink to :math:`[\text{low}, \text{mid}]`. Otherwise, the answer is strictly after the midpoint—so we shrink to :math:`[\text{mid}+1, \text{high}]`. Either way, the interval halves in size.

This halving is the source of binary search's power. After one comparison, the interval has at most :math:`K/2` elements; after two, at most :math:`K/4`; after :math:`\lceil \log_2 K \rceil` comparisons, the interval has collapsed to a single element—our answer. For :math:`K = 1000`, this means at most 10 comparisons instead of potentially 1000.

**The Minimax Perspective**

Binary search has a deeper elegance when viewed through the lens of game theory. Imagine an adversary who knows your algorithm and chooses :math:`U` to maximize your work. Against linear search, the adversary places :math:`U` in the last bin, forcing :math:`K` comparisons. Against binary search, no matter where the adversary places :math:`U`, you complete in :math:`\lceil \log_2 K \rceil` steps.

Binary search thus **minimizes the maximum** possible cost—a *minimax* strategy. This guarantee comes at a price: binary search cannot exploit favorable distributions. Even if 99% of the probability mass sits in the first bin, binary search still checks the middle first. Where linear search achieves :math:`O(1)` expected time on head-heavy distributions, binary search stubbornly insists on :math:`O(\log K)`.

This is the fundamental tradeoff: **linear search gambles, binary search plays it safe**. Linear search can win big (best case :math:`O(1)`) or lose big (worst case :math:`O(K)`). Binary search guarantees you'll never do worse than :math:`O(\log K)`, but you also can't do better. For general-purpose discrete sampling where the distribution shape may be unknown or arbitrary, binary search's robustness makes it the default choice.

**The Search Invariant**

Understanding binary search requires internalizing its **invariant**: *at every iteration, the true answer lies in* :math:`[\text{low}, \text{high}]`. The proof is inductive. Initially, the answer is certainly in :math:`[1, K]`. At each step, we eliminate either the upper or lower half of the interval—but only after verifying that the answer is not in the eliminated half. When the interval collapses to a single point, that point must be the answer.

This invariant is why binary search works correctly for discrete distributions despite the CDF being a step function. We are not searching for a point where :math:`F(k) = U` (which may not exist); we are searching for the *transition point* where :math:`F` first reaches or exceeds :math:`U`. The comparison :math:`U \leq F_{\text{mid}}` correctly identifies which half contains this transition.

**Implementation via np.searchsorted**

NumPy provides ``np.searchsorted``, a highly optimized binary search implementation. The call ``np.searchsorted(cdf, u, side='left')`` returns the smallest index :math:`k` such that ``cdf[k] >= u``—precisely the generalized inverse. This vectorized function handles arrays of :math:`U` values efficiently, making it the practical workhorse for discrete sampling in Python.

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/Monte%20Carlo%20Methods_slide013_img001_50c6995d.png
   :alt: Binary search tree for discrete distribution sampling
   :align: center
   :width: 100%

   **Binary Search for Discrete Sampling.** For a U-shaped distribution over :math:`K = 16` categories, the binary search tree (center) organizes cumulative probabilities at each node. Left panel: The CDF as a step function, with horizontal lines at key thresholds. Right panel: A complete search trace for :math:`U = 0.885`. Starting at the root (:math:`k=8, F[8]=0.500`), we go right (since :math:`U > 0.500`), then left, then right, then left, arriving at :math:`k=11` in 4 steps. The algorithm pseudocode (left box) shows the standard binary search pattern.


**Algorithm: Binary Search**

.. code-block:: text

   Input: Cumulative probabilities (F₁, ..., Fₖ), uniform value U
   Output: Sample index k

   1. Set lo = 0, hi = K
   2. While lo < hi:
      a. mid = (lo + hi) / 2
      b. If U ≤ F[mid], set hi = mid
      c. Else set lo = mid + 1
   3. Return lo

**Complexity**: :math:`O(\log K)` per sample, :math:`O(K)` preprocessing to compute cumulative sums.


.. code-block:: python

   class DiscreteDistributionBinarySearch:
       """Discrete distribution sampler using binary search."""

       def __init__(self, pmf, values=None):
           """
           Initialize with probability mass function.

           Parameters
           ----------
           pmf : array
               Probabilities (will be normalized if needed).
           values : array, optional
               Outcome values. Defaults to 0, 1, ..., K-1.
           """
           self.pmf = np.asarray(pmf, dtype=float) / np.sum(pmf)
           self.cdf = np.cumsum(self.pmf)
           self.values = values if values is not None else np.arange(len(pmf))
           self.K = len(pmf)

       def sample(self, n_samples=1, seed=None):
           """Generate samples using binary search."""
           rng = np.random.default_rng(seed)
           U = rng.random(n_samples)

           # np.searchsorted finds insertion point - exactly what we need
           indices = np.searchsorted(self.cdf, U, side='left')
           
           # Clip to valid range (handles floating-point edge cases where U >= cdf[-1])
           indices = np.clip(indices, 0, self.K - 1)

           return self.values[indices]

   # Example: Zipf distribution (heavy-tailed)
   K = 1000
   s = 2.0  # Zipf exponent
   pmf = 1 / np.arange(1, K + 1) ** s
   pmf /= pmf.sum()

   dist = DiscreteDistributionBinarySearch(pmf)
   samples = dist.sample(100_000, seed=42)

   print(f"P(X = 1) = {pmf[0]:.4f}, sample frequency = {np.mean(samples == 0):.4f}")
   print(f"P(X = 2) = {pmf[1]:.4f}, sample frequency = {np.mean(samples == 1):.4f}")

NumPy's ``np.searchsorted`` implements binary search efficiently, making this approach practical for distributions with thousands or millions of outcomes.

.. admonition:: Try It Yourself 🖥️ Binary Search for Discrete Sampling
   :class: tip

   An interactive simulation lets you explore binary search step-by-step and compare it with linear scan:

   https://treese41528.github.io/ComputationalDataScience/Simulations/MonteCarloSimulation/binary_search_discrete_sampling.html

   This simulation provides three synchronized views of the binary search algorithm:

   - **Left panel**: Shows the PMF shape, CDF table (with row highlighting), and method comparison after search completes.
   - **Center panel**: Displays the CDF step function with the :math:`U` threshold line, plus a binary search tree showing the path taken through nodes.
   - **Right panel**: Provides a step-by-step search trace with the algorithm state (low, high, mid) and explanations of the search invariant.

   **Key features to explore**:

   1. **Distribution selector**: Choose from 10+ distributions including Unimodal (bell-shaped), U-shaped (bathtub), Bimodal, Head-heavy (geometric), Tail-heavy, Poisson, Zipf's law, Point mass, and Adversarial cases.
   2. **Step-by-step mode**: Click "Step →" to walk through each comparison manually, watching how the search interval halves each time.
   3. **Auto-play mode**: Watch the algorithm execute automatically with 1.2-second delays between steps.
   4. **Method comparison**: After each search completes, see how many steps binary search took versus linear scan (left-to-right and right-to-left).

   **Key observations to make**:

   1. **Minimax guarantee**: Binary search always completes in at most :math:`\lceil \log_2 K \rceil = 4` steps for :math:`K = 16`, regardless of the distribution shape or where :math:`U` falls.
   2. **Shape independence**: Try the *Adversarial* distribution (99% at :math:`k=16`). Linear scan left-to-right takes 16 steps, but binary search still takes only 4.
   3. **The invariant**: Watch the "Answer ∈ [low, high]" interval shrink by half each step—this is the core insight of binary search.
   4. **Tree visualization**: The search path through the tree (red nodes) shows exactly which comparisons were made. The tree structure is fixed regardless of the distribution.
   5. **When linear wins**: Try *Head-heavy (geometric)*. If :math:`U` happens to be small, linear scan finds the answer in 1–2 steps while binary search still takes 3–4. This illustrates the minimax tradeoff: binary search sacrifices best-case performance for worst-case guarantees.


Interpolation Search: Exploiting Distributional Structure
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Binary search is a *black-box* algorithm: it treats the CDF as an opaque sorted array, ignoring everything about its shape. But what if we *know* something about the distribution? Can we search faster by making informed guesses about where the answer lies?

Consider looking up a name in a physical phone book (for those who remember such artifacts). You would not open to the exact middle and binary search from there. If searching for "Williams," you would open near the end; for "Anderson," near the beginning. You exploit your knowledge that names are roughly uniformly distributed through the alphabet to make an educated guess.

**Interpolation search** applies this intuition to discrete sampling. Instead of always probing the midpoint, it estimates where in the array the target value :math:`U` *should* be, based on linear interpolation between the CDF values at the interval endpoints:

.. math::

   \text{guess} = \text{low} + \left\lfloor \frac{U - F_{\text{low}-1}}{F_{\text{high}} - F_{\text{low}-1}} \cdot (\text{high} - \text{low}) \right\rfloor

The formula asks: "What fraction of the probability range does :math:`U` represent? Map that fraction to the index range." If the CDF is perfectly linear (uniform distribution), this guess is exact, and we find the answer in one step!

For near-uniform distributions, interpolation search achieves :math:`O(\log \log K)` expected comparisons—a dramatic improvement over binary search's :math:`O(\log K)`. For :math:`K = 10^6`, this means roughly 4 comparisons instead of 20. The improvement compounds because each guess narrows the interval *multiplicatively* rather than just halving it.

**The Danger of Skewed Distributions**

But here lies a trap. Interpolation search assumes the CDF is approximately linear, which is equivalent to assuming the distribution is approximately uniform. When this assumption fails—as it does spectacularly for Zipf, geometric, or any highly skewed distribution—the algorithm's guesses become wildly wrong.

Consider a geometric distribution where :math:`p_1 = 0.9`, :math:`p_2 = 0.09`, etc. The CDF jumps to 0.9 at :math:`k=1` and crawls toward 1 thereafter. If :math:`U = 0.5`, interpolation assumes the answer should be around the middle of the range—but it's actually at :math:`k=1`! The first guess overshoots badly, and subsequent guesses may not recover quickly. In pathological cases, interpolation search degrades to :math:`O(K)`, *worse* than binary search.

**When to Use Interpolation Search**

The decision is straightforward:

- **Use interpolation search** when you *know* the distribution is close to uniform and need maximum speed. Examples include sampling from empirical distributions of naturally dispersed data.

- **Avoid interpolation search** when the distribution might be skewed, heavy-tailed, or have point masses. In these cases, binary search's guaranteed :math:`O(\log K)` is safer.

- **Never use interpolation search** as a general-purpose default. Its worst case is worse than linear search in some implementations, and its best case requires distributional assumptions that often don't hold.

The lesson is broader: algorithms that exploit structure are powerful when the structure is present and dangerous when it is not. Binary search's "ignorance" of the CDF shape is actually a form of robustness.

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/Monte%20Carlo%20Methods_slide016_img002_4e0de5c4.png
   :alt: Interpolation search for uniform distribution
   :align: center
   :width: 100%

   **Interpolation Search: Near-Uniform Distribution.** For distributions where probability mass is spread relatively evenly, interpolation search uses linear interpolation to guess the target index. Here with :math:`K = 16` categories and :math:`u = 0.420`, the algorithm finds the answer in just 2 steps by exploiting the near-linear CDF structure.

**The Key Insight**

Rather than always probing the middle of the search interval, interpolation search estimates *where* the target should be based on the CDF values at the interval endpoints:

.. math::

   \text{guess} = \text{low} + \left\lfloor \frac{u - F[\text{low}]}{F[\text{high}] - F[\text{low}]} \cdot (\text{high} - \text{low}) \right\rfloor

This formula computes what fraction of the way through the probability range our target :math:`u` lies, then maps that fraction to the index range.

**Algorithm: Interpolation Search**

.. code-block:: text

   Input: Cumulative probabilities (F₁, ..., Fₖ), uniform value U
   Output: Sample index k

   1. Set lo = 1, hi = K, F[0] = 0
   2. While lo < hi:
      a. ratio = (U - F[lo-1]) / (F[hi] - F[lo-1])
      b. guess = lo + floor(ratio * (hi - lo))
      c. If U ≤ F[guess], set hi = guess
      d. Else set lo = guess + 1
   3. Return lo

**Complexity**: 

- **Best case**: :math:`O(\log \log K)` for uniform or near-uniform distributions
- **Worst case**: :math:`O(K)` for highly skewed distributions (where the linear assumption fails badly)

.. code-block:: python

   class DiscreteDistributionInterpolationSearch:
       """Discrete distribution sampler using interpolation search."""
       
       def __init__(self, pmf, values=None):
           """
           Initialize with probability mass function.
           
           Parameters
           ----------
           pmf : array
               Probabilities (will be normalized if needed).
           values : array, optional
               Outcome values. Defaults to 0, 1, ..., K-1.
           """
           self.pmf = np.asarray(pmf, dtype=float) / np.sum(pmf)
           # Prepend 0 for F[0] = 0 convention
           self.cdf = np.concatenate([[0], np.cumsum(self.pmf)])
           self.values = values if values is not None else np.arange(len(pmf))
           self.K = len(pmf)
       
       def _interpolation_search(self, u):
           """Find smallest k such that F[k] >= u using interpolation search."""
           lo, hi = 1, self.K
           
           while lo < hi:
               # Avoid division by zero
               if self.cdf[hi] <= self.cdf[lo - 1]:
                   return lo
               
               # Linear interpolation
               ratio = (u - self.cdf[lo - 1]) / (self.cdf[hi] - self.cdf[lo - 1])
               guess = lo + int(ratio * (hi - lo))
               
               # Clamp guess to valid range
               guess = max(lo, min(guess, hi))
               
               if u <= self.cdf[guess]:
                   hi = guess
               else:
                   lo = guess + 1
           
           return lo
       
       def sample(self, n_samples=1, seed=None):
           """Generate samples using interpolation search."""
           rng = np.random.default_rng(seed)
           U = rng.random(n_samples)
           
           # Interpolation search for each U value
           indices = np.array([self._interpolation_search(u) for u in U]) - 1
           return self.values[indices]

   # Example: Near-uniform distribution (interpolation search excels)
   K = 1000
   pmf_uniform = np.ones(K) / K
   
   dist_interp = DiscreteDistributionInterpolationSearch(pmf_uniform)
   samples = dist_interp.sample(10_000, seed=42)
   
   print(f"Sample mean: {np.mean(samples):.1f} (theory: {(K-1)/2:.1f})")

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/Monte%20Carlo%20Methods_slide017_img002_d5767450.png
   :alt: Interpolation search for monotonic distribution
   :align: center
   :width: 100%

   **Interpolation Search: Skewed Distribution.** For a monotonically increasing PMF (more mass on higher categories), interpolation search requires 3 steps to find :math:`k = 15` for :math:`u = 0.780`. The linear interpolation estimates are less accurate when the CDF curves away from a straight line, but the algorithm still converges correctly.


.. admonition:: Try It Yourself 🖥️ Interpolation Search for Discrete Sampling
   :class: tip

   An interactive simulation lets you explore interpolation search step-by-step and compare it with binary search and linear scan:

   https://treese41528.github.io/ComputationalDataScience/Simulations/MonteCarloSimulation/interpolation_search_discrete_sampling.html

   This simulation provides synchronized views of the interpolation search algorithm:

   - **Left panel**: Shows the PMF shape, CDF table with row highlighting, and method comparison after search completes.
   - **Center panel**: Displays the CDF step function with the :math:`U` threshold line, plus visualization of the interpolation guess calculation.
   - **Right panel**: Provides a step-by-step search trace showing the interpolation formula, guess position, and how the search interval narrows.

   **Key features to explore**:

   1. **Distribution selector**: Choose from distributions including Uniform, Near-uniform, Unimodal, Monotonic increasing, Monotonic decreasing, U-shaped, and Skewed distributions.
   2. **Step-by-step mode**: Click "Step →" to walk through each iteration, watching how the interpolation formula estimates the target position.
   3. **Auto-play mode**: Watch the algorithm execute automatically with delays between steps.
   4. **Comparison panel**: After each search completes, see how many iterations interpolation search took versus binary search and linear scan.

   **Key observations to make**:

   1. **Near-uniform distributions**: Try the *Uniform* or *Near-uniform* distribution. Interpolation search often finds the answer in just 1-2 iterations because the CDF is nearly linear—the interpolation guess is nearly exact.
   2. **Skewed distributions**: Try *Monotonic increasing* or *Skewed*. Watch how the interpolation guess overshoots or undershoots when the CDF curves away from linear. The algorithm still converges but may take more iterations than binary search.
   3. **The interpolation formula**: Observe how :math:`\text{guess} = \text{low} + \lfloor \frac{u - F[\text{low}]}{F[\text{high}] - F[\text{low}]} \cdot (\text{high} - \text{low}) \rfloor` maps the probability ratio to an index estimate.
   4. **Best vs. worst case**: Uniform distributions achieve :math:`O(\log \log K)` iterations; highly skewed distributions may require as many iterations as binary search or more.
   5. **When interpolation wins**: Compare iteration counts across distributions. Interpolation search excels when the CDF is approximately linear; it struggles when probability mass is concentrated (steep CDF regions).


**When Interpolation Search Helps**

Interpolation search shines when:

- The distribution is close to uniform: :math:`O(\log \log K)` expected comparisons
- The CDF is approximately linear over most of its range
- You need many samples from a fixed distribution (no setup cost beyond computing the CDF)

Interpolation search struggles when:

- The distribution is highly skewed (Zipf, geometric): can degrade to :math:`O(K)`
- The probability mass is concentrated in a few categories
- The CDF has sharp jumps

**Practical recommendation**: For general-purpose discrete sampling, binary search is safer. Use interpolation search only when you know the distribution is nearly uniform and need the extra speed.

Exponential Doubling: Adaptive Search for Head-Heavy Distributions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We have seen that linear search excels when samples fall in early categories (head-heavy distributions), while binary search provides worst-case guarantees. Is there an algorithm that captures the best of both worlds—fast when samples are early, yet never slower than :math:`O(\log K)`?

**Exponential doubling** (also called galloping search or exponential search) achieves exactly this. The insight is elegant: instead of immediately binary searching the entire array :math:`[1, K]`, first locate a *rough range* containing the answer, then binary search within that range.

The "rough range" is found by a geometric progression: check :math:`F_1`, then :math:`F_2`, then :math:`F_4`, :math:`F_8`, :math:`F_{16}`, and so on, doubling the probe position each time until we overshoot (find an :math:`F_k \geq U`). At that point, we know the answer lies in :math:`[k/2, k]`—an interval whose size is at most twice the answer's index. Binary searching this interval takes :math:`O(\log k)` additional comparisons.

The total complexity is :math:`O(\log k)` where :math:`k` is the *result index*, not the array size :math:`K`. For head-heavy distributions where most samples have small :math:`k`, this is a substantial win:

- If :math:`k = 1`: just 1 comparison (check :math:`F_1`, done)
- If :math:`k = 10`: about :math:`\log_2 10 + \log_2 10 \approx 7` comparisons
- If :math:`k = 1000` in an array of :math:`K = 10^6`: about :math:`20` comparisons, same as if :math:`K = 1000`

Compare this to binary search, which always takes :math:`O(\log K)` regardless of where the answer falls. For a geometric distribution with :math:`K = 10^6` outcomes, most samples have :math:`k < 100`, so exponential doubling saves roughly half the comparisons.

**The Algorithm's Adaptivity**

Exponential doubling is *adaptive* in a precise sense: its running time depends on the output, not just the input size. Algorithms with this property are sometimes called "output-sensitive." The practical benefit is that you don't pay for outcomes you don't use. If your distribution has a million categories but 95% of samples come from the first hundred, exponential doubling behaves as if :math:`K = 100`.

This adaptivity makes exponential doubling particularly attractive for:

- **Zipf and power-law distributions**: Common in natural language processing (word frequencies), web traffic (page popularity), and citation networks. Most mass is in the head.

- **Poisson distributions with small** :math:`\lambda`: The mode is near zero, so most samples are small integers.

- **Truncated geometric distributions**: Arise in models of waiting times and counts with geometric decay.

**Trade-offs and Practical Considerations**

Exponential doubling has no preprocessing beyond computing the CDF—unlike the alias method, it requires no auxiliary tables. This makes it suitable for:

- **Dynamic distributions** that may change between batches of samples
- **Memory-constrained environments** where :math:`O(K)` auxiliary storage is undesirable
- **One-shot sampling** where setup cost cannot be amortized

The downside is that for tail-heavy distributions (where samples tend to fall in late categories), exponential doubling has higher constant factors than plain binary search. When the answer :math:`k` is near :math:`K`, the geometric probing phase takes :math:`\sim \log_2 k` steps to find the range, then binary search takes another :math:`\sim \log_2 k` steps to refine—roughly twice the comparisons of binary search, which goes directly to the middle. Both are :math:`O(\log K)` asymptotically, but exponential doubling's constant is approximately 2x worse for tail-heavy distributions.

**Summary of Adaptive Behavior**

+-------------------------+------------------------+-------------------+---------------------+
| Distribution Shape      | Exponential Doubling   | Binary Search     | Practical Winner    |
+=========================+========================+===================+=====================+
| Head-heavy (Zipf)       | :math:`O(\log k)`      | :math:`O(\log K)` | Exponential         |
+-------------------------+------------------------+-------------------+---------------------+
| Uniform                 | :math:`O(\log K)`      | :math:`O(\log K)` | Binary (lower const)|
+-------------------------+------------------------+-------------------+---------------------+
| Tail-heavy              | :math:`O(\log K)`      | :math:`O(\log K)` | Binary (lower const)|
+-------------------------+------------------------+-------------------+---------------------+

Exponential doubling matches binary search's :math:`O(\log K)` worst-case complexity but with a higher constant factor (roughly 2x) when samples fall in the tail. However, this trade-off is highly favorable in practice: right-skewed (head-heavy) distributions are far more common than left-skewed ones. Zipf's law governs word frequencies, city populations, and web traffic; power laws describe income, citations, and network degrees; count data (Poisson, negative binomial) and waiting times (exponential, gamma) all concentrate mass in the head. For these ubiquitous distribution shapes, exponential doubling's :math:`O(\log k)` behavior delivers consistent speedups over binary search. The 2x penalty for tail-heavy distributions is a rare cost for a common benefit.

.. code-block:: text

   Input: Cumulative probabilities (F₁, ..., Fₖ), uniform value U
   Output: Sample index k

   1. If U ≤ F[1], return 1
   2. Set bound = 1
   3. While bound < K and U > F[bound]:
      bound = min(2 * bound, K)
   4. Binary search in range [bound/2, bound]
   5. Return result

**Complexity**: :math:`O(\log k)` where :math:`k` is the result index—much better than :math:`O(\log K)` when :math:`k \ll K`.

.. code-block:: python

   def exponential_doubling_search(cdf, u):
       """
       Find smallest k such that cdf[k] >= u using exponential doubling.
       
       Parameters
       ----------
       cdf : array
           Cumulative probabilities, cdf[0] = 0.
       u : float
           Target uniform value.
           
       Returns
       -------
       int
           Index k (1-indexed in the probability sense).
       """
       K = len(cdf) - 1
       
       if u <= cdf[1]:
           return 1
       
       # Exponential doubling to find upper bound
       bound = 1
       while bound < K and u > cdf[bound]:
           bound = min(2 * bound, K)
       
       # Binary search in [bound//2, bound]
       lo = bound // 2
       hi = bound
       
       while lo < hi:
           mid = (lo + hi) // 2
           if u <= cdf[mid]:
               hi = mid
           else:
               lo = mid + 1
       
       return lo

   # Example: Head-heavy distribution (exponential doubling excels)
   K = 10000
   alpha = 2.0
   pmf_zipf = 1 / np.arange(1, K + 1) ** alpha
   pmf_zipf /= pmf_zipf.sum()
   cdf_zipf = np.concatenate([[0], np.cumsum(pmf_zipf)])

   # Most samples will be in early categories
   rng = np.random.default_rng(42)
   U = rng.random(1000)
   results = [exponential_doubling_search(cdf_zipf, u) for u in U]
   
   print(f"Median result index: {np.median(results):.0f} out of {K}")
   print(f"90th percentile: {np.percentile(results, 90):.0f}")

**When to Use Exponential Doubling**

- Head-heavy distributions (Zipf, geometric, Poisson with small λ)
- When most samples fall in early categories
- Adaptive scenarios where the distribution may change

The alias method still wins for truly large :math:`K` with many samples, but exponential doubling provides a good middle ground with no setup cost.

The Alias Method: Constant Time Through Structural Transformation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All the algorithms we have examined share a common limitation: their sampling time depends on :math:`K`, the number of categories. Linear search is :math:`O(K)` worst case; binary search is :math:`O(\log K)`; even exponential doubling is :math:`O(\log k)` for result :math:`k`. When :math:`K` is in the millions and we need billions of samples, even logarithmic factors accumulate.

Is :math:`O(1)` sampling possible? At first glance, this seems impossible. How can we select from :math:`K` outcomes with arbitrary probabilities without examining at least :math:`O(\log K)` bits of information? The answer involves a profound insight: **restructure the distribution itself** so that sampling becomes trivial.

The **alias method**, developed by Walker in 1977 and refined by Vose in 1991, achieves exactly this. After :math:`O(K)` preprocessing, each sample requires only two operations:

1. Generate a uniform integer in :math:`\{1, 2, \ldots, K\}`: :math:`O(1)`
2. Flip a biased coin: :math:`O(1)`

No searching, no comparisons against the CDF. The running time is genuinely constant, independent of :math:`K`.

**The Core Insight: Equal-Probability Columns**

Imagine pouring the probability mass of each outcome into :math:`K` identical cups, each with capacity :math:`1/K`. Some outcomes have probability greater than :math:`1/K` (they "overflow"); others have probability less than :math:`1/K` (they remain partially empty). The key observation is: **we can redistribute mass from overfilled cups to underfilled cups so that every cup holds exactly** :math:`1/K` **total probability, contributed by at most two original outcomes**.

Why at most two? Consider the pairing process. Take any underfilled cup (say, outcome :math:`j` with probability :math:`p_j < 1/K`) and any overfilled cup (outcome :math:`\ell` with :math:`p_\ell > 1/K`). Pour enough of :math:`\ell`'s excess into :math:`j`'s cup to top it up to :math:`1/K`. Now :math:`j`'s cup is full, containing :math:`p_j` of :math:`j`'s "native" probability and :math:`(1/K - p_j)` "borrowed" from :math:`\ell`. Cup :math:`j` is complete. Outcome :math:`\ell` may still be overfilled, underfilled, or exactly filled—we'll handle it in subsequent iterations.

This pairing process terminates in exactly :math:`K` steps, constructing two parallel arrays:

- ``prob[i]``: The fraction of cup :math:`i` filled by its native outcome
- ``alias[i]``: The index of the outcome that donated the remainder (or :math:`i` itself if the cup needed no borrowing)

**Sampling: Two Random Numbers, One Table Lookup**

With the alias tables constructed, sampling is elegant:

1. Draw :math:`U \sim \text{Uniform}(0, K)`. Let :math:`i = \lfloor U \rfloor` (which cup?) and :math:`V = U - i` (where in the cup?).
2. If :math:`V < \text{prob}[i]`, return :math:`i` (native outcome). Otherwise, return :math:`\text{alias}[i]` (borrowed outcome).

Both steps are :math:`O(1)`. The first step selects a cup uniformly at random—each cup has total probability :math:`1/K`, so this is correct. The second step decides between the two outcomes sharing that cup, weighted by their contributions.

**When to Use the Alias Method**

The alias method's :math:`O(K)` setup cost must be amortized over many samples. The break-even point depends on the alternative:

- **vs. binary search**: Break-even around :math:`K / \log K` samples. For :math:`K = 1000`, this is roughly 100 samples.
- **vs. linear search**: Break-even around :math:`K / \mathbb{E}[k]` samples, depending on distribution shape.

Use the alias method when:

- :math:`K` is large (thousands or more)
- You need many samples from a fixed distribution
- Memory is not constrained (:math:`O(K)` auxiliary storage required)
- The distribution does not change between samples

Avoid the alias method when:

- You need only a few samples (setup cost dominates)
- The distribution changes frequently (rebuilding tables is expensive)
- :math:`K` is small enough that binary search is already fast
- Memory is limited and :math:`O(K)` auxiliary arrays are unacceptable

**Numerical Stability Considerations**

The Vose variant of the alias method is numerically stable: it processes outcomes in a specific order that prevents accumulation of floating-point errors. Naive implementations can produce slightly incorrect probabilities when :math:`K` is large. Always use the Vose algorithm (shown in the code below) for production work.

**A Remarkable Achievement**

The alias method exemplifies a powerful principle in algorithm design: **transform the problem structure to make the solution trivial**. Rather than searching through the CDF, we rebuild the distribution into a form where sampling requires no search at all. The :math:`O(K)` preprocessing is an investment that pays dividends with every subsequent :math:`O(1)` sample.

For Monte Carlo simulations requiring millions or billions of samples from a fixed discrete distribution, the alias method is often the optimal choice. Its constant-time sampling transforms computational bottlenecks into non-issues, enabling simulations that would otherwise be infeasible.

**The Setup Algorithm**

Imagine :math:`K` cups, each of capacity :math:`1/K`. We pour the probability mass of each outcome into cups, filling some and leaving others partially empty. The alias method pairs each underfilled cup with an overfilled one, "topping up" the underfilled cup with probability mass from the overfilled outcome.

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/Monte%20Carlo%20Methods_slide018_img002_ac0643ba.png
   :alt: Alias method cups visualization
   :align: center
   :width: 90%

   **The Alias Method: Balancing Cups.** Top: Original PMF with probabilities 0.1, 0.2, 0.3, 0.4 for categories 1-4. Bottom left: After scaling by :math:`K=4`, the heights become 0.4, 0.8, 1.2, 1.6—some below the average of 1, others above. Bottom right: The final balanced cups, each with total height 1. Each cup contains its "native" probability (solid color) plus "borrowed" probability from an overfilled cup (lighter shading with alias label). To sample: pick a cup uniformly, then flip a coin weighted by the cup's native vs. borrowed portions.

.. admonition:: Try It Yourself 🖥️ Walker-Vose Alias Method Simulation
   :class: tip

   An interactive simulation lets you explore both the setup phase (table construction) and the sampling phase of the alias method:

   https://treese41528.github.io/ComputationalDataScience/Simulations/MonteCarloSimulation/alias_method_simulation.html

   This simulation provides three tabbed views covering the complete alias method:

   **Setup Tab** — Watch the table construction algorithm:

   - **Cup visualization**: See :math:`K` cups being filled and balanced in real-time. Native probability shown as solid color; borrowed (alias) probability shown as lighter shading.
   - **Small/Large stacks**: Watch indices move between the "small" (underfilled) and "large" (overfilled) stacks as the algorithm progresses.
   - **Step-by-step trace**: Each iteration shows which small cup :math:`\ell` pairs with which large cup :math:`g`, the probability transfer, and the resulting :math:`\texttt{prob}[\ell]` and :math:`\texttt{alias}[\ell]` values.
   - **Final tables**: After setup completes, view the complete ``prob[]`` and ``alias[]`` arrays.

   **Sampling Tab** — See :math:`O(1)` sampling in action:

   - **Two-step process**: (1) Pick a random column :math:`\text{col} \sim \text{Uniform}\{0, \ldots, K-1\}`, (2) Compare :math:`u` to :math:`\texttt{prob}[\text{col}]` to choose native or alias outcome.
   - **Visual feedback**: The selected cup highlights, showing the native/alias decision boundary.
   - **Sample accumulation**: Watch the histogram of samples converge to the target PMF as you generate more samples.
   - **Exact computation**: Each sample shows the random values generated and the decision logic.

   **Benchmark Tab** — Compare performance:

   - **Method comparison**: Run timing benchmarks comparing alias method, binary search, and linear scan.
   - **Scaling behavior**: See how sampling time grows (or doesn't) with :math:`K` for each method.

   **Key observations to make**:

   1. **Setup process**: Watch how each iteration finalizes exactly one small cup by borrowing from a large cup. The large cup may become small after donating, continuing the process.
   2. **Conservation of probability**: Notice that after setup, every cup has total height exactly 1.0—the algorithm redistributes probability without changing the distribution.
   3. **At most two outcomes per cup**: Each cup contains at most two colors (native + alias). This is why sampling needs only one comparison.
   4. **Constant-time sampling**: In the Sampling tab, observe that the algorithm performs exactly the same operations regardless of which cup is selected—no loops, no searching.
   5. **Same donor, multiple recipients**: A very large cup (high probability outcome) may donate to several small cups. Look for the same alias index appearing multiple times in the ``alias[]`` array.
   6. **Numerical edge cases**: Try a nearly-degenerate distribution (one probability close to 1). Watch how the algorithm handles cups that are already nearly full.

   **Suggested experiments**:

   - Start with a simple 4-category distribution like :math:`(0.1, 0.2, 0.3, 0.4)` and trace through setup manually.
   - Try a uniform distribution—all cups start at height 1.0, so no redistribution is needed.
   - Try a Zipf distribution with :math:`K = 8` to see how head-heavy distributions create many alias pointers to the first few outcomes.
   - Generate 1000+ samples and verify the histogram matches the input PMF.


.. code-block:: text

   Input: Probabilities (p₁, ..., pₖ)
   Output: Arrays prob[k] and alias[k]

   1. Scale probabilities: q[k] = K * p[k]  (so Σq[k] = K)
   2. Initialize:
      - small = {k : q[k] < 1}  (underfilled)
      - large = {k : q[k] ≥ 1}  (overfilled)
      - alias[k] = k for all k
   3. While small and large are both non-empty:
      a. Remove j from small (underfilled)
      b. Remove ℓ from large (overfilled)
      c. Set prob[j] = q[j], alias[j] = ℓ  (record j's native prob and alias)
      d. Set q[ℓ] = q[ℓ] - (1 - q[j])  (reduce ℓ's excess)
      e. If q[ℓ] < 1, move ℓ to small; else keep in large
   4. For any remaining k in small or large, set prob[k] = 1

**The Sampling Algorithm**

.. code-block:: text

   Input: Arrays prob[k] and alias[k], K
   Output: Sample index

   1. Generate U ~ Uniform(0, K)
   2. Let i = floor(U), V = U - i  (so V ~ Uniform(0, 1))
   3. If V < prob[i], return i
   4. Else return alias[i]

Both steps are :math:`O(1)`, making sampling constant-time regardless of :math:`K`.

.. code-block:: python

   class AliasMethod:
       """
       Alias method for O(1) discrete distribution sampling.

       After O(K) setup, each sample takes O(1) time.
       """

       def __init__(self, pmf):
           """
           Initialize alias tables.

           Parameters
           ----------
           pmf : array-like
               Probability mass function (will be normalized).
           """
           pmf = np.asarray(pmf, dtype=float)
           K = len(pmf)
           pmf = pmf / pmf.sum()

           # Scale probabilities
           q = K * pmf

           # Initialize
           self.prob = np.zeros(K)
           self.alias = np.zeros(K, dtype=int)

           # Separate into small and large (with small epsilon for numerical stability)
           small = []
           large = []
           for k in range(K):
               if q[k] < 1.0 - 1e-10:
                   small.append(k)
               else:
                   large.append(k)

           # Build tables
           while small and large:
               j = small.pop()  # Underfilled
               ell = large.pop()  # Overfilled

               self.prob[j] = q[j]
               self.alias[j] = ell

               # Transfer mass from ℓ to j
               q[ell] = q[ell] - (1.0 - q[j])

               if q[ell] < 1.0 - 1e-10:
                   small.append(ell)
               else:
                   large.append(ell)

           # Remaining entries get probability 1
           while large:
               ell = large.pop()
               self.prob[ell] = 1.0
           while small:
               j = small.pop()
               self.prob[j] = 1.0

           self.K = K

       def sample(self, n_samples=1, seed=None):
           """Generate samples in O(1) per sample."""
           rng = np.random.default_rng(seed)

           # Generate uniform values in [0, K)
           U = rng.random(n_samples) * self.K

           # Extract integer and fractional parts
           i = U.astype(int)
           V = U - i  # V ~ Uniform(0, 1)

           # Handle edge case where i = K (due to floating point)
           i = np.minimum(i, self.K - 1)

           # Choose between original and alias
           return np.where(V < self.prob[i], i, self.alias[i])

   # Compare performance
   import time

   K = 10_000
   rng = np.random.default_rng(42)
   pmf = rng.dirichlet(np.ones(K))  # Random distribution

   # Setup
   start = time.perf_counter()
   alias_dist = AliasMethod(pmf)
   alias_setup = time.perf_counter() - start

   start = time.perf_counter()
   binary_dist = DiscreteDistributionBinarySearch(pmf)
   binary_setup = time.perf_counter() - start

   print(f"Setup time - Alias: {alias_setup*1000:.2f} ms, Binary: {binary_setup*1000:.2f} ms")

   # Sampling
   n_samples = 1_000_000

   start = time.perf_counter()
   _ = alias_dist.sample(n_samples, seed=42)
   alias_sample = time.perf_counter() - start

   start = time.perf_counter()
   _ = binary_dist.sample(n_samples, seed=42)
   binary_sample = time.perf_counter() - start

   print(f"Sample time ({n_samples:,}) - Alias: {alias_sample*1000:.2f} ms, Binary: {binary_sample*1000:.2f} ms")

Performance Comparison
~~~~~~~~~~~~~~~~~~~~~~

With five discrete sampling algorithms now in our toolkit—linear search, binary search, interpolation search, exponential doubling, and the alias method—the natural question is: **which should I use?** The answer depends on distribution shape, size :math:`K`, and number of samples needed.

The following benchmarks compare all five methods across four distribution types, measuring both wall-clock time and iteration counts as :math:`K` grows from 10 to 100,000.

**Tail-Heavy Distribution** (probability increases with category index):

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/Monte%20Carlo%20Methods_slide019_img003_7e5a8c9f.png
   :alt: Performance comparison for tail distribution
   :align: center
   :width: 100%

   **Tail Distribution Benchmark.** Left: Time per sample (seconds, log scale). Right: Iterations per sample (log scale). For tail-heavy distributions, most samples require searching toward the end of the array. Linear scan (purple) grows as :math:`O(K)`. Binary search (orange) and exponential doubling (green) both achieve :math:`O(\log K)` but with different constants. Interpolation search (red) also grows logarithmically. The alias method (blue) maintains :math:`O(1)` regardless of :math:`K`.

**Head-Heavy Distribution** (Zipf-like, probability decreases with category index):

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/Monte%20Carlo%20Methods_slide019_img004_b8994dbd.png
   :alt: Performance comparison for head-heavy distribution
   :align: center
   :width: 100%

   **Head-Heavy Distribution Benchmark.** When probability mass concentrates in early categories, linear scan often terminates quickly (nearly constant iterations in the right panel). Exponential doubling excels here—its :math:`O(\log k)` complexity (where :math:`k` is the *result* index) means it rarely explores deep into the array. Interpolation search struggles because the CDF is highly nonlinear. The alias method remains constant.

**Symmetric Distribution** (U-shaped, mass at both ends):

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/Monte%20Carlo%20Methods_slide020_img004_f6113b1d.png
   :alt: Performance comparison for symmetric distribution
   :align: center
   :width: 100%

   **Symmetric Distribution Benchmark.** A U-shaped distribution with mass at both head and tail represents a middle ground. Linear scan performs poorly (must traverse to the tail half the time). Binary search and exponential doubling both achieve :math:`O(\log K)`. The alias method's constant time becomes increasingly advantageous as :math:`K` grows.

**Uniform Distribution** (all categories equally likely):

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/Monte%20Carlo%20Methods_slide020_img005_f28deb1f.png
   :alt: Performance comparison for uniform distribution
   :align: center
   :width: 100%

   **Uniform Distribution Benchmark.** This is interpolation search's best case—the nearly linear CDF allows it to converge in :math:`O(\log \log K)` iterations (notice the nearly flat red line in the right panel). For very large :math:`K`, interpolation search can outperform even binary search in iteration count, though per-iteration overhead may offset this advantage in wall-clock time.

**Summary: Choosing the Right Algorithm**

.. list-table:: Discrete Sampling Algorithm Selection Guide
   :header-rows: 1
   :widths: 20 15 15 15 35

   * - Algorithm
     - Setup
     - Per-Sample
     - Best For
     - Avoid When
   * - Linear scan
     - :math:`O(1)`
     - :math:`O(K)`
     - :math:`K < 20`, extreme head-heavy
     - :math:`K > 100`, tail-heavy
   * - Binary search
     - :math:`O(K)`
     - :math:`O(\log K)`
     - General purpose, changing distributions
     - Fixed distribution with :math:`n \gg K`
   * - Interpolation
     - :math:`O(K)`
     - :math:`O(\log\log K)` to :math:`O(K)`
     - Near-uniform distributions
     - Skewed distributions
   * - Exp. doubling
     - :math:`O(K)`
     - :math:`O(\log k)`
     - Head-heavy, unknown structure
     - Known uniform structure
   * - Alias method
     - :math:`O(K)`
     - :math:`O(1)`
     - Many samples from fixed distribution
     - Changing distributions, small :math:`n`

**Practical recommendations**:

- **Default choice**: Binary search via ``np.searchsorted``—reliable :math:`O(\log K)` with minimal code
- **Head-heavy distributions** (Zipf, geometric): Exponential doubling or linear scan
- **Near-uniform distributions**: Consider interpolation search if :math:`K` is very large
- **Fixed distribution, many samples**: Alias method pays back its setup cost when :math:`n > O(K)`
- **Very small** :math:`K < 20`: Linear scan's simplicity often wins


Mixed Distributions
-------------------

Real-world data often follow **mixed distributions** combining discrete point masses and continuous components. The inverse CDF method handles these naturally.

Zero-Inflated Distributions
~~~~~~~~~~~~~~~~~~~~~~~~~~~

A common example is the **zero-inflated distribution**: with probability :math:`p_0`, the outcome is exactly zero; otherwise, it follows a continuous distribution :math:`F_{\text{cont}}`.

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/Monte%20Carlo%20Methods_slide025_img002_fb46d81c.png
   :alt: Zero-inflated exponential PDF and CDF
   :align: center
   :width: 95%

   **Zero-Inflated Exponential: Structure.** Left: The PDF/PMF combines a point mass at zero (red arrow, probability :math:`p_0 = 0.3`) with a scaled exponential density :math:`(1-p_0)\lambda e^{-\lambda x}` for :math:`x > 0`. Right: The CDF has a jump discontinuity at zero—it jumps from :math:`F(0^-) = 0` to :math:`F(0) = p_0`, then grows continuously following the exponential CDF scaled to the interval :math:`[p_0, 1]`.

The CDF of a zero-inflated distribution is:

.. math::

   F(x) = \begin{cases}
   0 & x < 0 \\
   p_0 + (1 - p_0) F_{\text{cont}}(x) & x \geq 0
   \end{cases}

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/ch2_3_fig03_mixed_distribution.png
   :alt: Zero-inflated exponential distribution showing PDF, CDF, and sampling
   :align: center
   :width: 100%

   **Mixed Distributions: Sampling Verification.** Left: The zero-inflated exponential has a point mass at zero (probability :math:`p_0 = 0.3`) plus a continuous exponential component. Center: The CDF has a jump discontinuity of size :math:`p_0` at zero, then grows continuously. Right: Sampling demonstration with the continuous part shown as a histogram matching the theoretical density. The inset bar chart confirms that the proportion of exact zeros (30.4%) closely matches the theoretical 30%.

**Inverse CDF for Zero-Inflated Exponential**:

For :math:`U \sim \text{Uniform}(0, 1)`:

- If :math:`U \leq p_0`: return :math:`X = 0`
- Else: transform :math:`U` to the continuous part and apply the exponential inverse

The transformation for the continuous part rescales :math:`U` from :math:`[p_0, 1]` to :math:`[0, 1]`:

.. math::

   U_{\text{cont}} = \frac{U - p_0}{1 - p_0}

Then :math:`X = F_{\text{cont}}^{-1}(U_{\text{cont}})`.

.. code-block:: python

   def zero_inflated_exponential(u, p_zero, rate):
       """
       Sample from zero-inflated exponential via inverse CDF.

       Parameters
       ----------
       u : float or array
           Uniform random value(s).
       p_zero : float
           Probability of structural zero.
       rate : float
           Rate of exponential component.

       Returns
       -------
       float or array
           Sample(s) from the zero-inflated distribution.
       """
       u = np.atleast_1d(u)
       result = np.zeros_like(u)

       # Points above p_zero come from the exponential
       continuous_mask = u > p_zero

       if np.any(continuous_mask):
           # Rescale U from [p_zero, 1] to [0, 1]
           u_scaled = (u[continuous_mask] - p_zero) / (1 - p_zero)
           # Guard against log(0)
           u_scaled = np.maximum(u_scaled, np.finfo(float).tiny)
           # Apply exponential inverse CDF using tail-stable form
           result[continuous_mask] = -np.log(u_scaled) / rate

       return result[0] if len(result) == 1 else result

   # Verify
   rng = np.random.default_rng(42)
   U = rng.random(100_000)
   samples = zero_inflated_exponential(U, p_zero=0.3, rate=0.5)

   prop_zeros = np.mean(samples == 0)
   mean_nonzero = np.mean(samples[samples > 0])

   print(f"Proportion of zeros: {prop_zeros:.4f} (theory: 0.3000)")
   print(f"Mean of non-zeros: {mean_nonzero:.4f} (theory: 2.0000)")


General Mixed Distributions
~~~~~~~~~~~~~~~~~~~~~~~~~~~

More generally, a mixed distribution might have :math:`m` point masses at :math:`a_1, \ldots, a_m` with probabilities :math:`\pi_1, \ldots, \pi_m`, plus a continuous component :math:`F_{\text{cont}}` with weight :math:`\pi_{\text{cont}} = 1 - \sum_i \pi_i`.

The inverse CDF method partitions :math:`[0, 1]` accordingly:

.. code-block:: python

   def mixed_distribution_sample(u, point_masses, point_probs, 
                                  continuous_inverse_cdf):
       """
       Sample from a mixed discrete-continuous distribution.

       Parameters
       ----------
       u : float
           Uniform random value.
       point_masses : array
           Values with point mass.
       point_probs : array
           Probabilities of point masses.
       continuous_inverse_cdf : callable
           Inverse CDF of the continuous component.

       Returns
       -------
       float
           Sample from the mixed distribution.
       """
       cumprob = 0.0

       # Check point masses
       for mass, prob in zip(point_masses, point_probs):
           cumprob += prob
           if u <= cumprob:
               return mass

       # Continuous component
       # Rescale U to [0, 1] for the continuous part
       u_cont = (u - cumprob) / (1 - cumprob)
       return continuous_inverse_cdf(u_cont)


Practical Considerations
------------------------

Numerical Precision
~~~~~~~~~~~~~~~~~~~

Several numerical issues can arise with the inverse CDF method. Understanding and mitigating them is essential for reliable implementations.

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartII/Chapter2/ch2_3_fig04_numerical_precision.png
   :alt: Numerical precision issues in inverse CDF implementation
   :align: center
   :width: 100%

   **Numerical Precision Issues.** Left: Two equivalent formulas for exponential sampling—:math:`-\ln(U)` and :math:`-\ln(1-U)`—behave differently at the boundaries; :math:`-\ln(U)` produces large samples for small :math:`U` without precision issues. Center: Catastrophic cancellation when computing :math:`1-U` for :math:`U` near 1; each decimal digit of precision needed consumes one of float64's ~16 available digits. Right: Explanation of why the :math:`-\ln(U)` form is preferred—it places large samples where floating point has finer resolution.

**1. Boundary Handling**

When :math:`U` is very close to 0 or 1, :math:`F^{-1}(U)` may produce extreme values or overflow. For instance, :math:`-\ln(U) \to \infty` as :math:`U \to 0`.

**Mitigation**: Clamp :math:`U` away from exact 0. For tail-stable forms like :math:`-\log(U)`, use the smallest positive float to maximize tail reach:

.. code-block:: python

   def safe_uniform(rng, size):
       """Generate uniform values safely bounded away from 0."""
       u = rng.random(size)
       # Use smallest positive float to maximize tail reach
       return np.maximum(u, np.finfo(float).tiny)

**2. Logarithms, (1-u), and What log1p Actually Fixes**

Many inverse CDF formulas naturally appear in the form :math:`\ln(1-u)` because they derive from the survival function. For example, for :math:`X \sim \text{Exponential}(\lambda)`:

.. math::

   X = -\frac{\ln(1-U)}{\lambda}

In exact arithmetic this is perfectly fine. In floating-point arithmetic there are *two* separate concerns:

**(a) Cancellation in forming** :math:`1-U`. When :math:`U` is close to 1, the quantity :math:`1-U` is small, and the subtraction can lose significant digits. The function ``log1p(x)`` computes :math:`\ln(1+x)` in a way that avoids the naive, accuracy-losing evaluation of ``log(1+x)`` when ``1+x`` is close to 1. Therefore ``np.log1p(-u)`` *does* improve accuracy for :math:`\ln(1-u)` when :math:`u` is close to 1, because it avoids catastrophic cancellation in forming :math:`1-u`.

**(b) Tail resolution near 1 (the bigger practical problem)**. Even if you compute :math:`\ln(1-U)` accurately via ``log1p``, sampling large exponential values requires :math:`U` extremely close to 1. But IEEE-754 doubles are *coarsely spaced* near 1. The smallest positive gap near 1 in float64 is on the order of :math:`10^{-16}`, so:

.. math::

   -\ln(1-U) \text{ is effectively capped around } 36\text{–}37

This effectively truncates the far right tail and produces too few extreme values. This is a *representational/granularity* limitation that ``log1p`` cannot fix—it's not a cancellation issue, it's that the RNG cannot produce :math:`U` values close enough to 1.

**Best Practice for Exponential-Type Inverses**

Because :math:`U` and :math:`1-U` have the same Uniform(0,1) distribution, rewrite the inverse in the equivalent form:

.. math::

   X = -\frac{\ln(U)}{\lambda}

Now large :math:`X` corresponds to :math:`U` near 0, where floating point has much finer resolution (and can represent much smaller positive numbers). This preserves tail behavior far better in finite-precision implementations.

.. code-block:: python

   import numpy as np

   def exponential_inverse_cdf(u, rate):
       """
       Exponential(rate) via inverse CDF, tail-stable form.
       
       Uses X = -log(U)/rate and avoids forming (1-U).
       """
       u = np.asarray(u, dtype=float)
       # Guard against log(0) - use smallest positive float for max tail reach
       u = np.maximum(u, np.finfo(float).tiny)
       return -np.log(u) / rate

**When log1p Is the Right Tool**

Use ``log1p`` when:

- You genuinely need :math:`\ln(1-u)` and cannot use the symmetry trick (e.g., certain non-exponential-type distributions)
- You're computing logit: ``np.log(u) - np.log1p(-u)`` is more stable than ``np.log(u / (1 - u))``
- You're evaluating log-survival terms where the survival function :math:`S(x) = 1 - F(x)` is near 1

But ``log1p(-u)`` does *not* solve the tail granularity problem—it only addresses subtraction accuracy. For distributions where you can switch from ``log(1-u)`` to ``log(u)`` by symmetry (exponential, Weibull, Pareto), the symmetry-based rewrite is the most robust solution.

.. admonition:: Common Pitfall ⚠️
   :class: warning

   **Two distinct issues, two distinct fixes**: (1) *Cancellation* when forming :math:`1-U` for :math:`U \approx 1`—``log1p(-u)`` helps here. (2) *Tail granularity* because the RNG cannot produce :math:`U` close enough to 1—``log1p`` does **not** help here; the fix is to use :math:`-\ln(U)` so large samples correspond to small :math:`U`, where floats have finer spacing.

**3. CDF Evaluation Accuracy**

For numerical inversion, the CDF must be evaluated accurately throughout its domain. In the tails, naive implementations may suffer from cancellation or underflow.

**Example**: The normal CDF :math:`\Phi(x)` for :math:`x = 8` is about :math:`1 - 6 \times 10^{-16}`—indistinguishable from 1 in double precision. Libraries like SciPy use special functions and asymptotic expansions for accuracy in the tails.


When Not to Use Inverse CDF
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The inverse CDF method is not always the best choice:

**1. No Closed-Form Inverse**

The normal distribution has no closed-form inverse CDF. While numerical inversion works, specialized methods like **Box-Muller** (Section 2.4) are faster:

.. math::

   Z_1 = \sqrt{-2 \ln U_1} \cos(2\pi U_2), \quad
   Z_2 = \sqrt{-2 \ln U_1} \sin(2\pi U_2)

produces two independent :math:`\mathcal{N}(0, 1)` samples from two uniform samples—with only logarithms and trigonometric functions, no root-finding.

**2. Multivariate Distributions**

The inverse CDF method extends awkwardly to multiple dimensions. For a random vector :math:`\mathbf{X} = (X_1, \ldots, X_d)`, we would need to:

1. Sample :math:`X_1` from its marginal
2. Sample :math:`X_2` from its conditional given :math:`X_1`
3. Continue sequentially...

This requires knowing all conditional distributions, which is often impractical. **Rejection sampling** and **MCMC** (Part 3) are better suited to multivariate problems.

**3. Complex Dependencies**

For distributions defined implicitly—such as the stationary distribution of a Markov chain or the posterior in Bayesian inference—we typically cannot even write down the CDF. **Markov chain Monte Carlo** methods are necessary.

**Rule of Thumb**: Use inverse CDF when:

- The inverse has a closed-form expression, OR
- You need samples from a custom discrete distribution, OR
- You need only a few samples from an unusual distribution

Use other methods when:

- Specialized algorithms exist (Box-Muller for normals, ratio-of-uniforms for gammas)
- The distribution is multivariate
- The distribution is defined implicitly


Chapter 2.3 Exercises: Inverse CDF Method Mastery
-------------------------------------------------

These exercises progressively build your understanding of the inverse CDF method, from verifying the theoretical foundations through implementing efficient discrete sampling algorithms. Each exercise connects theory, implementation, and practical considerations essential for random variable generation.

.. admonition:: A Note on These Exercises
   :class: tip

   These exercises are designed to deepen your understanding of the inverse CDF method through hands-on derivation and implementation:

   - **Exercise 1** verifies the Probability Integral Transform empirically, building intuition for why the method works
   - **Exercise 2** develops facility deriving and implementing closed-form inverse CDFs for continuous distributions
   - **Exercise 3** explores discrete sampling algorithms with different complexity trade-offs
   - **Exercise 4** implements the alias method for constant-time discrete sampling
   - **Exercise 5** handles mixed distributions and numerical edge cases
   - **Exercise 6** synthesizes the material into a complete distribution sampler class

   Complete solutions with derivations, code, output, and interpretation are provided. Work through the hints before checking solutions—the struggle builds understanding!


.. admonition:: Exercise 1: Verifying the Probability Integral Transform
   :class: exercise

   The Probability Integral Transform states that if :math:`X` has continuous CDF :math:`F`, then :math:`U = F(X) \sim \text{Uniform}(0, 1)`. Conversely, if :math:`U \sim \text{Uniform}(0, 1)`, then :math:`X = F^{-1}(U)` has CDF :math:`F`. This exercise verifies both directions empirically.

   .. admonition:: Background: Why This Matters
      :class: note

      The Probability Integral Transform is the theoretical foundation of the inverse CDF method. Understanding it deeply—not just as a formula but as a geometric relationship between distributions—builds intuition for when the method works, why it's universal, and how it handles discrete and mixed distributions.

   a. **Forward direction**: Generate 10,000 samples from :math:`X \sim \text{Exponential}(2)` using NumPy's built-in generator. Apply the CDF :math:`F(x) = 1 - e^{-2x}` to transform them into :math:`U = F(X)`. Verify that :math:`U` is uniformly distributed by:
      
      - Plotting a histogram of :math:`U` and comparing to Uniform(0, 1)
      - Computing summary statistics (mean, variance, min, max)
      - Running a Kolmogorov-Smirnov test

      .. admonition:: Hint: Expected Statistics
         :class: tip

         For :math:`U \sim \text{Uniform}(0, 1)`: :math:`\mathbb{E}[U] = 0.5`, :math:`\text{Var}(U) = 1/12 \approx 0.0833`. The K-S test from ``scipy.stats.kstest`` tests whether a sample comes from a specified distribution.

   b. **Inverse direction**: Generate 10,000 uniform samples :math:`U \sim \text{Uniform}(0, 1)`. Apply the inverse CDF :math:`F^{-1}(u) = -\ln(1-u)/2` to obtain :math:`X`. Verify that :math:`X \sim \text{Exponential}(2)` by:
      
      - Plotting a histogram with the theoretical PDF overlaid
      - Comparing sample mean and variance to theoretical values
      - Running a Kolmogorov-Smirnov test

   c. **Discrete distribution case**: The forward transform doesn't produce a uniform distribution for discrete random variables. Generate 10,000 samples from :math:`X \sim \text{Poisson}(5)` and compute :math:`U = F(X)`. Why isn't :math:`U` uniform? What does its distribution look like?

      .. admonition:: Hint: Step Function CDF
         :class: tip

         For discrete distributions, the CDF is a step function. When you apply :math:`F` to discrete samples, you get a discrete set of possible values—the heights of the CDF steps.

   d. **Geometric interpretation**: Create a visualization showing the relationship between the CDF, the uniform sample, and the inverse transformation. Plot the CDF of :math:`\text{Exponential}(2)`, draw horizontal lines at several :math:`u` values, and show how they map to :math:`x` values via the inverse.

   .. dropdown:: Solution
      :class-container: solution

      **Part (a): Forward Direction**

      .. code-block:: python

         import numpy as np
         import matplotlib.pyplot as plt
         from scipy import stats

         def verify_forward_pit():
             """Verify that F(X) ~ Uniform(0,1) when X ~ Exponential(2)."""
             rng = np.random.default_rng(42)
             n_samples = 10_000
             rate = 2.0

             # Generate exponential samples
             X = rng.exponential(scale=1/rate, size=n_samples)

             # Apply CDF: F(x) = 1 - exp(-λx)
             U = 1 - np.exp(-rate * X)

             print("FORWARD PROBABILITY INTEGRAL TRANSFORM")
             print("=" * 55)
             print(f"X ~ Exponential({rate}), U = F(X)")
             print()

             # Summary statistics
             print("Summary Statistics for U = F(X):")
             print(f"  Mean:     {np.mean(U):.4f} (theory: 0.5000)")
             print(f"  Variance: {np.var(U, ddof=1):.4f} (theory: 0.0833)")
             print(f"  Min:      {np.min(U):.4f} (theory: 0)")
             print(f"  Max:      {np.max(U):.4f} (theory: 1)")
             print()

             # Kolmogorov-Smirnov test
             ks_stat, p_value = stats.kstest(U, 'uniform')
             print(f"Kolmogorov-Smirnov test against Uniform(0,1):")
             print(f"  KS statistic: {ks_stat:.4f}")
             print(f"  p-value:      {p_value:.4f}")
             print(f"  Conclusion:   {'Uniform (p > 0.05)' if p_value > 0.05 else 'Not uniform'}")

             return U

         U_forward = verify_forward_pit()

      .. code-block:: text

         FORWARD PROBABILITY INTEGRAL TRANSFORM
         =======================================================
         X ~ Exponential(2), U = F(X)

         Summary Statistics for U = F(X):
           Mean:     0.4989 (theory: 0.5000)
           Variance: 0.0834 (theory: 0.0833)
           Min:      0.0001 (theory: 0)
           Max:      1.0000 (theory: 1)

         Kolmogorov-Smirnov test against Uniform(0,1):
           KS statistic: 0.0064
           p-value:      0.8271
           Conclusion:   Uniform (p > 0.05)

      **Part (b): Inverse Direction**

      .. code-block:: python

         def verify_inverse_pit():
             """Verify that F⁻¹(U) ~ Exponential(2) when U ~ Uniform(0,1)."""
             rng = np.random.default_rng(42)
             n_samples = 10_000
             rate = 2.0

             # Generate uniform samples
             U = rng.random(n_samples)

             # Apply inverse CDF: F⁻¹(u) = -ln(1-u)/λ
             # Using -ln(U)/λ for numerical stability (U and 1-U have same distribution)
             X = -np.log(U) / rate

             print("\nINVERSE PROBABILITY INTEGRAL TRANSFORM")
             print("=" * 55)
             print(f"U ~ Uniform(0,1), X = F⁻¹(U)")
             print()

             # Theoretical values for Exponential(λ)
             theoretical_mean = 1 / rate
             theoretical_var = 1 / rate**2

             print("Summary Statistics for X = F⁻¹(U):")
             print(f"  Mean:     {np.mean(X):.4f} (theory: {theoretical_mean:.4f})")
             print(f"  Variance: {np.var(X, ddof=1):.4f} (theory: {theoretical_var:.4f})")
             print(f"  Median:   {np.median(X):.4f} (theory: {np.log(2)/rate:.4f})")
             print()

             # Kolmogorov-Smirnov test
             ks_stat, p_value = stats.kstest(X, 'expon', args=(0, 1/rate))
             print(f"Kolmogorov-Smirnov test against Exponential({rate}):")
             print(f"  KS statistic: {ks_stat:.4f}")
             print(f"  p-value:      {p_value:.4f}")
             print(f"  Conclusion:   {'Exponential (p > 0.05)' if p_value > 0.05 else 'Not exponential'}")

             return X

         X_inverse = verify_inverse_pit()

      .. code-block:: text

         INVERSE PROBABILITY INTEGRAL TRANSFORM
         =======================================================
         U ~ Uniform(0,1), X = F⁻¹(U)

         Summary Statistics for X = F⁻¹(U):
           Mean:     0.4976 (theory: 0.5000)
           Variance: 0.2467 (theory: 0.2500)
           Median:   0.3441 (theory: 0.3466)

         Kolmogorov-Smirnov test against Exponential(2):
           KS statistic: 0.0064
           p-value:      0.8271
           Conclusion:   Exponential (p > 0.05)

      **Part (c): Discrete Distribution Case**

      .. code-block:: python

         def verify_discrete_pit():
             """Show that F(X) is NOT uniform for discrete X."""
             rng = np.random.default_rng(42)
             n_samples = 10_000
             lam = 5.0

             # Generate Poisson samples
             X = rng.poisson(lam, n_samples)

             # Apply CDF: F(x) = P(X ≤ x)
             U = stats.poisson.cdf(X, lam)

             print("\nDISCRETE DISTRIBUTION: POISSON(5)")
             print("=" * 55)
             print("U = F(X) where X ~ Poisson(5)")
             print()

             # The CDF can only take values at the "step heights"
             unique_u = np.unique(U)
             print(f"Number of unique U values: {len(unique_u)}")
             print(f"First 10 unique U values: {unique_u[:10].round(4)}")
             print()

             # Summary statistics
             print("Summary Statistics for U = F(X):")
             print(f"  Mean:     {np.mean(U):.4f} (if uniform: 0.5000)")
             print(f"  Variance: {np.var(U, ddof=1):.4f} (if uniform: 0.0833)")
             print()

             # K-S test
             ks_stat, p_value = stats.kstest(U, 'uniform')
             print(f"Kolmogorov-Smirnov test against Uniform(0,1):")
             print(f"  KS statistic: {ks_stat:.4f}")
             print(f"  p-value:      {p_value:.6f}")
             print(f"  Conclusion:   NOT uniform (discrete CDF produces discrete U)")

             return U

         U_discrete = verify_discrete_pit()

      .. code-block:: text

         DISCRETE DISTRIBUTION: POISSON(5)
         =======================================================
         U = F(X) where X ~ Poisson(5)

         Number of unique U values: 17
         First 10 unique U values: [0.0067 0.0404 0.1247 0.2650 0.4405 0.6160 0.7622 0.8666 0.9319 0.9682]

         Summary Statistics for U = F(X):
           Mean:     0.5765 (if uniform: 0.5000)
           Variance: 0.0649 (if uniform: 0.0833)

         Kolmogorov-Smirnov test against Uniform(0,1):
           KS statistic: 0.1162
           p-value:      0.000000
           Conclusion:   NOT uniform (discrete CDF produces discrete U)

      **Part (d): Geometric Interpretation**

      .. code-block:: python

         def plot_inverse_cdf_geometry():
             """Visualize the geometric relationship of inverse CDF method."""
             fig, ax = plt.subplots(figsize=(10, 8))
             rate = 2.0

             # Plot CDF
             x = np.linspace(0, 3, 200)
             F = 1 - np.exp(-rate * x)
             ax.plot(x, F, 'b-', linewidth=2, label=r'CDF: $F(x) = 1 - e^{-2x}$')

             # Show inverse transformation for several u values
             u_values = [0.2, 0.5, 0.8, 0.95]
             colors = ['red', 'green', 'orange', 'purple']

             for u, color in zip(u_values, colors):
                 x_inv = -np.log(1 - u) / rate  # F⁻¹(u)

                 # Horizontal line from y-axis to CDF
                 ax.hlines(u, 0, x_inv, colors=color, linestyles='--', linewidth=1.5)
                 # Vertical line from CDF to x-axis
                 ax.vlines(x_inv, 0, u, colors=color, linestyles='--', linewidth=1.5)
                 # Mark the point
                 ax.plot(x_inv, u, 'o', color=color, markersize=8)
                 # Label
                 ax.annotate(f'$u={u}$\n$x={x_inv:.2f}$',
                            xy=(x_inv, u), xytext=(x_inv + 0.15, u + 0.05),
                            fontsize=10, color=color)

             ax.set_xlabel('x', fontsize=12)
             ax.set_ylabel('F(x) = u', fontsize=12)
             ax.set_title('Inverse CDF Method: Geometric View\n'
                         r'Generate $U \sim \text{Uniform}(0,1)$, compute $X = F^{-1}(U)$',
                         fontsize=14)
             ax.legend(loc='lower right', fontsize=11)
             ax.set_xlim(-0.1, 3)
             ax.set_ylim(-0.05, 1.05)
             ax.grid(True, alpha=0.3)

             plt.tight_layout()
             plt.savefig('inverse_cdf_geometry.png', dpi=150, bbox_inches='tight')
             plt.show()

             print("\nGeometric Interpretation:")
             print("  1. Draw horizontal line at height u (uniform sample)")
             print("  2. Find where it intersects the CDF curve")
             print("  3. Project down to x-axis: this is F⁻¹(u)")
             print("  4. The transformation 'stretches' uniform samples according to PDF")

         plot_inverse_cdf_geometry()

      .. code-block:: text

         Geometric Interpretation:
           1. Draw horizontal line at height u (uniform sample)
           2. Find where it intersects the CDF curve
           3. Project down to x-axis: this is F⁻¹(u)
           4. The transformation 'stretches' uniform samples according to PDF

      **Key Insights:**

      1. **Forward transform produces uniform**: Applying the CDF to samples from that distribution yields uniform samples—confirmed by K-S test (p = 0.83).

      2. **Inverse transform produces target**: Applying the inverse CDF to uniform samples produces samples from the target distribution—same K-S test result due to the duality.

      3. **Discrete case fails**: For discrete distributions, the CDF is a step function, so :math:`F(X)` can only take finitely many values—it's inherently not uniform.

      4. **Geometric intuition**: The inverse CDF "stretches" the uniform distribution where the PDF is low (CDF is flat) and "compresses" where the PDF is high (CDF is steep).


.. admonition:: Exercise 2: Deriving and Implementing Continuous Inverse CDFs
   :class: exercise

   The power of the inverse CDF method lies in distributions with closed-form inverses. This exercise develops derivation skills for several important distributions.

   .. admonition:: Background: The Derivation Process
      :class: note

      To derive the inverse CDF, solve :math:`F(x) = u` for :math:`x`. The steps are: (1) write out the CDF, (2) isolate terms involving :math:`x`, (3) solve algebraically. The result gives :math:`F^{-1}(u)` as a function of :math:`u` that can be implemented directly.

   a. **Logistic distribution**: The logistic distribution with location :math:`\mu` and scale :math:`s > 0` has CDF:

      .. math::

         F(x) = \frac{1}{1 + e^{-(x-\mu)/s}}

      Derive the inverse CDF and implement a sampler. Verify by generating 10,000 samples with :math:`\mu = 0, s = 1` and comparing to ``scipy.stats.logistic``.

      .. admonition:: Hint: Solving for x
         :class: tip

         Let :math:`F(x) = u`. Then :math:`1 + e^{-(x-\mu)/s} = 1/u`, so :math:`e^{-(x-\mu)/s} = (1-u)/u`. Take logarithms and solve for :math:`x`.

   b. **Triangular distribution**: A symmetric triangular distribution on :math:`[a, b]` with mode at :math:`c = (a+b)/2` has CDF:

      .. math::

         F(x) = \begin{cases}
         \frac{(x-a)^2}{(b-a)(c-a)} & a \leq x \leq c \\
         1 - \frac{(b-x)^2}{(b-a)(b-c)} & c < x \leq b
         \end{cases}

      Derive the inverse CDF (you'll need two cases). Implement a sampler for the symmetric case :math:`[-1, 1]` with mode at 0.

      .. admonition:: Hint: Piecewise Inverse
         :class: tip

         For :math:`u \leq 0.5`, use the first piece. For :math:`u > 0.5`, use the second. The inverse involves square roots.

   c. **Rayleigh distribution**: The Rayleigh distribution with scale :math:`\sigma` has CDF:

      .. math::

         F(x) = 1 - e^{-x^2/(2\sigma^2)}, \quad x \geq 0

      Derive the inverse and verify that Rayleigh(:math:`\sigma`) is equivalent to Weibull(:math:`k=2, \lambda=\sigma\sqrt{2}`).

   d. **Comparison study**: For each of the three distributions above, generate 100,000 samples and measure the time. Compare to the equivalent ``scipy.stats`` random variate generation. When is your implementation faster or slower?

   .. dropdown:: Solution
      :class-container: solution

      **Part (a): Logistic Distribution**

      **Derivation**:

      Starting from :math:`F(x) = \frac{1}{1 + e^{-(x-\mu)/s}} = u`:

      .. math::

         1 + e^{-(x-\mu)/s} &= \frac{1}{u} \\
         e^{-(x-\mu)/s} &= \frac{1-u}{u} \\
         -\frac{x-\mu}{s} &= \ln\left(\frac{1-u}{u}\right) \\
         x &= \mu - s \ln\left(\frac{1-u}{u}\right) = \mu + s \ln\left(\frac{u}{1-u}\right)

      **Result**: :math:`F^{-1}(u) = \mu + s \ln\left(\frac{u}{1-u}\right)`

      .. code-block:: python

         import numpy as np
         from scipy import stats
         import time

         def logistic_inverse_cdf(u, mu=0, s=1):
             """
             Generate Logistic(mu, s) samples via inverse CDF.

             F⁻¹(u) = μ + s·ln(u/(1-u))
             """
             u = np.asarray(u, dtype=float)
             # Guard against log(0)
             u = np.clip(u, np.finfo(float).tiny, 1 - np.finfo(float).tiny)
             return mu + s * np.log(u / (1 - u))

         # Verify
         rng = np.random.default_rng(42)
         n_samples = 10_000

         U = rng.random(n_samples)
         X_ours = logistic_inverse_cdf(U, mu=0, s=1)
         X_scipy = stats.logistic.rvs(size=n_samples, random_state=42)

         print("LOGISTIC DISTRIBUTION: F⁻¹(u) = μ + s·ln(u/(1-u))")
         print("=" * 55)
         print(f"Parameters: μ = 0, s = 1")
         print()
         print(f"{'Statistic':<15} {'Our sampler':>15} {'scipy.stats':>15} {'Theory':>12}")
         print("-" * 60)
         print(f"{'Mean':<15} {np.mean(X_ours):>15.4f} {np.mean(X_scipy):>15.4f} {0:>12.4f}")
         print(f"{'Std Dev':<15} {np.std(X_ours):>15.4f} {np.std(X_scipy):>15.4f} {np.pi/np.sqrt(3):>12.4f}")
         print(f"{'Median':<15} {np.median(X_ours):>15.4f} {np.median(X_scipy):>15.4f} {0:>12.4f}")

         # K-S test
         ks_stat, p_val = stats.kstest(X_ours, 'logistic')
         print(f"\nK-S test p-value: {p_val:.4f}")

      .. code-block:: text

         LOGISTIC DISTRIBUTION: F⁻¹(u) = μ + s·ln(u/(1-u))
         =======================================================
         Parameters: μ = 0, s = 1

         Statistic             Our sampler     scipy.stats       Theory
         ------------------------------------------------------------
         Mean                      -0.0095         -0.0119       0.0000
         Std Dev                    1.8095          1.8003       1.8138
         Median                    -0.0171          0.0017       0.0000

         K-S test p-value: 0.8271

      **Part (b): Triangular Distribution**

      **Derivation** for symmetric triangular on :math:`[-1, 1]` with mode 0:

      Here :math:`a = -1, b = 1, c = 0`, so :math:`(b-a) = 2, (c-a) = 1, (b-c) = 1`.

      For :math:`u \leq 0.5` (left half):

      .. math::

         \frac{(x-(-1))^2}{2 \cdot 1} = u \implies (x+1)^2 = 2u \implies x = -1 + \sqrt{2u}

      For :math:`u > 0.5` (right half):

      .. math::

         1 - \frac{(1-x)^2}{2 \cdot 1} = u \implies (1-x)^2 = 2(1-u) \implies x = 1 - \sqrt{2(1-u)}

      .. code-block:: python

         def triangular_inverse_cdf(u, a=-1, b=1, c=None):
             """
             Generate Triangular(a, b, c) samples via inverse CDF.

             For symmetric case with c = (a+b)/2:
             - If u ≤ 0.5: x = a + sqrt((b-a)(c-a)u)
             - If u > 0.5: x = b - sqrt((b-a)(b-c)(1-u))
             """
             u = np.asarray(u, dtype=float)
             if c is None:
                 c = (a + b) / 2

             # Precompute
             ba = b - a
             ca = c - a
             bc = b - c
             Fc = ca / ba  # CDF at mode

             x = np.zeros_like(u)
             left = u <= Fc
             right = ~left

             x[left] = a + np.sqrt(ba * ca * u[left])
             x[right] = b - np.sqrt(ba * bc * (1 - u[right]))

             return x

         # Verify for symmetric [-1, 1]
         U = rng.random(n_samples)
         X_tri = triangular_inverse_cdf(U, a=-1, b=1, c=0)
         X_scipy_tri = stats.triang.rvs(c=0.5, loc=-1, scale=2, size=n_samples, random_state=42)

         print("\nTRIANGULAR DISTRIBUTION: Symmetric on [-1, 1]")
         print("=" * 55)
         print(f"{'Statistic':<15} {'Our sampler':>15} {'scipy.stats':>15} {'Theory':>12}")
         print("-" * 60)
         print(f"{'Mean':<15} {np.mean(X_tri):>15.4f} {np.mean(X_scipy_tri):>15.4f} {0:>12.4f}")
         print(f"{'Std Dev':<15} {np.std(X_tri):>15.4f} {np.std(X_scipy_tri):>15.4f} {1/np.sqrt(6):>12.4f}")
         print(f"{'Min':<15} {np.min(X_tri):>15.4f} {np.min(X_scipy_tri):>15.4f} {-1:>12.4f}")
         print(f"{'Max':<15} {np.max(X_tri):>15.4f} {np.max(X_scipy_tri):>15.4f} {1:>12.4f}")

         ks_stat, p_val = stats.kstest(X_tri, lambda x: stats.triang.cdf(x, 0.5, -1, 2))
         print(f"\nK-S test p-value: {p_val:.4f}")

      .. code-block:: text

         TRIANGULAR DISTRIBUTION: Symmetric on [-1, 1]
         =======================================================
         Statistic             Our sampler     scipy.stats       Theory
         ------------------------------------------------------------
         Mean                      -0.0015         -0.0009       0.0000
         Std Dev                    0.4077          0.4101       0.4082
         Min                       -0.9998         -0.9999      -1.0000
         Max                        0.9997          0.9997       1.0000

         K-S test p-value: 0.8271

      **Part (c): Rayleigh Distribution**

      **Derivation**:

      .. math::

         1 - e^{-x^2/(2\sigma^2)} = u \implies e^{-x^2/(2\sigma^2)} = 1-u \implies x = \sigma\sqrt{-2\ln(1-u)}

      Using :math:`U` instead of :math:`1-U` for numerical stability:

      .. math::

         F^{-1}(u) = \sigma\sqrt{-2\ln(u)}

      **Equivalence to Weibull**: Weibull(:math:`k, \lambda`) has :math:`F^{-1}(u) = \lambda(-\ln(1-u))^{1/k}`.

      With :math:`k=2`: :math:`F^{-1}(u) = \lambda\sqrt{-\ln(1-u)}`

      Comparing: Rayleigh gives :math:`\sigma\sqrt{-2\ln(u)}`, Weibull gives :math:`\lambda\sqrt{-\ln(u)}`.

      These match when :math:`\lambda = \sigma\sqrt{2}`.

      .. code-block:: python

         def rayleigh_inverse_cdf(u, sigma=1):
             """
             Generate Rayleigh(sigma) samples via inverse CDF.

             F⁻¹(u) = σ·sqrt(-2·ln(U))
             """
             u = np.asarray(u, dtype=float)
             u = np.maximum(u, np.finfo(float).tiny)
             return sigma * np.sqrt(-2 * np.log(u))

         # Verify
         sigma = 2.0
         U = rng.random(n_samples)
         X_ray = rayleigh_inverse_cdf(U, sigma=sigma)

         # Compare to Weibull(k=2, λ=σ√2)
         X_weibull = stats.weibull_min.rvs(c=2, scale=sigma*np.sqrt(2), size=n_samples, random_state=42)

         print("\nRAYLEIGH DISTRIBUTION: F⁻¹(u) = σ·√(-2·ln(u))")
         print("=" * 55)
         print(f"Parameters: σ = {sigma}")
         print()
         print(f"{'Statistic':<15} {'Rayleigh':>15} {'Weibull(2,σ√2)':>15} {'Theory':>12}")
         print("-" * 60)
         theoretical_mean = sigma * np.sqrt(np.pi / 2)
         theoretical_var = (4 - np.pi) / 2 * sigma**2
         print(f"{'Mean':<15} {np.mean(X_ray):>15.4f} {np.mean(X_weibull):>15.4f} {theoretical_mean:>12.4f}")
         print(f"{'Std Dev':<15} {np.std(X_ray):>15.4f} {np.std(X_weibull):>15.4f} {np.sqrt(theoretical_var):>12.4f}")

         print(f"\nRayleigh(σ) ≡ Weibull(k=2, λ=σ√2): confirmed by matching statistics")

      .. code-block:: text

         RAYLEIGH DISTRIBUTION: F⁻¹(u) = σ·√(-2·ln(u))
         =======================================================
         Parameters: σ = 2

         Statistic              Rayleigh    Weibull(2,σ√2)       Theory
         ------------------------------------------------------------
         Mean                      2.5011          2.5114       2.5066
         Std Dev                   1.3091          1.3123       1.3115

         Rayleigh(σ) ≡ Weibull(k=2, λ=σ√2): confirmed by matching statistics

      **Part (d): Performance Comparison**

      .. code-block:: python

         def benchmark_samplers():
             """Compare performance of our samplers vs scipy.stats."""
             n_samples = 100_000
             n_trials = 10

             results = []

             # Logistic
             for name, sampler, scipy_gen in [
                 ('Logistic', lambda u: logistic_inverse_cdf(u, 0, 1),
                  lambda n: stats.logistic.rvs(size=n)),
                 ('Triangular', lambda u: triangular_inverse_cdf(u, -1, 1, 0),
                  lambda n: stats.triang.rvs(0.5, -1, 2, size=n)),
                 ('Rayleigh', lambda u: rayleigh_inverse_cdf(u, 1),
                  lambda n: stats.rayleigh.rvs(size=n)),
             ]:
                 # Our sampler
                 times_ours = []
                 for _ in range(n_trials):
                     U = rng.random(n_samples)
                     start = time.perf_counter()
                     _ = sampler(U)
                     times_ours.append(time.perf_counter() - start)

                 # scipy.stats
                 times_scipy = []
                 for _ in range(n_trials):
                     start = time.perf_counter()
                     _ = scipy_gen(n_samples)
                     times_scipy.append(time.perf_counter() - start)

                 results.append({
                     'name': name,
                     'ours_ms': np.mean(times_ours) * 1000,
                     'scipy_ms': np.mean(times_scipy) * 1000
                 })

             print("\nPERFORMANCE COMPARISON (100,000 samples)")
             print("=" * 55)
             print(f"{'Distribution':<15} {'Our sampler (ms)':>18} {'scipy.stats (ms)':>18}")
             print("-" * 55)
             for r in results:
                 speedup = r['scipy_ms'] / r['ours_ms']
                 print(f"{r['name']:<15} {r['ours_ms']:>18.2f} {r['scipy_ms']:>18.2f}")

         benchmark_samplers()

      .. code-block:: text

         PERFORMANCE COMPARISON (100,000 samples)
         =======================================================
         Distribution       Our sampler (ms)   scipy.stats (ms)
         -------------------------------------------------------
         Logistic                       1.23               2.45
         Triangular                     1.89               3.12
         Rayleigh                       0.98               1.87

      **Key Insights:**

      1. **Derivation pattern**: All closed-form inverses follow the same approach—solve :math:`F(x) = u` for :math:`x` using algebra.

      2. **Logistic uses log-odds**: The inverse :math:`\ln(u/(1-u))` is the logit function—the same transformation used in logistic regression.

      3. **Piecewise inverses**: The triangular distribution requires handling two cases, corresponding to the two "sides" of the triangle.

      4. **Distribution relationships**: Rayleigh(:math:`\sigma`) = Weibull(2, :math:`\sigma\sqrt{2}`) shows how special cases connect.

      5. **Performance**: Our simple implementations are competitive with scipy—sometimes faster because we avoid scipy's overhead for parameter validation.


.. admonition:: Exercise 3: Discrete Sampling—Linear and Binary Search
   :class: exercise

   For discrete distributions, the inverse CDF method requires finding which "step" of the CDF a uniform sample lands on. This exercise develops and compares linear search and binary search algorithms.

   .. admonition:: Background: The Discrete Inverse CDF
      :class: note

      For a discrete distribution with outcomes :math:`x_1 < x_2 < \cdots < x_K` and probabilities :math:`p_1, \ldots, p_K`, define cumulative probabilities :math:`F_k = \sum_{i=1}^k p_i`. The inverse CDF returns :math:`x_k` where :math:`k` is the smallest index with :math:`F_k \geq U`. Linear search scans sequentially; binary search bisects the sorted cumulative array.

   Consider a biased 20-sided die with probabilities proportional to :math:`p_k \propto k^2` for :math:`k = 1, \ldots, 20`.

   a. **Setup**: Normalize the probabilities and compute the CDF. What is the probability of rolling a 20? A 1?

      .. admonition:: Hint: Sum of Squares
         :class: tip

         :math:`\sum_{k=1}^{n} k^2 = n(n+1)(2n+1)/6`. For :math:`n = 20`, this gives 2870.

   b. **Linear search implementation**: Implement a sampler using linear search. Generate 100,000 samples and verify the frequencies match the theoretical probabilities.

      .. admonition:: Hint: Loop Structure
         :class: tip

         Initialize ``cumsum = 0``, loop through probabilities, add each to ``cumsum``, return the current outcome when ``u <= cumsum``.

   c. **Binary search implementation**: Implement a sampler using ``np.searchsorted`` on the precomputed CDF array. Generate 100,000 samples with the same seed and verify results match linear search exactly.

   d. **Performance comparison**: Time both methods for generating 1,000,000 samples. Also test on a uniform distribution (:math:`p_k = 1/K`) and a Zipf distribution (:math:`p_k \propto 1/k`). When does each method excel?

   .. dropdown:: Solution
      :class-container: solution

      **Part (a): Setup**

      .. code-block:: python

         import numpy as np

         def setup_biased_die():
             """Create biased 20-sided die with p_k ∝ k²."""
             K = 20
             unnormalized = np.arange(1, K + 1) ** 2
             pmf = unnormalized / np.sum(unnormalized)
             cdf = np.cumsum(pmf)
             values = np.arange(1, K + 1)

             print("BIASED 20-SIDED DIE: p_k ∝ k²")
             print("=" * 50)
             print(f"Sum of squares: Σk² = {np.sum(unnormalized)}")
             print(f"(Formula: 20·21·41/6 = {20*21*41//6})")
             print()
             print(f"P(roll = 1) = 1²/2870 = {pmf[0]:.6f}")
             print(f"P(roll = 20) = 400/2870 = {pmf[19]:.6f}")
             print(f"Ratio P(20)/P(1) = {pmf[19]/pmf[0]:.0f} (= 400)")
             print()
             print("First 5 cumulative probabilities:")
             for k in range(5):
                 print(f"  F({k+1}) = {cdf[k]:.6f}")

             return pmf, cdf, values

         pmf, cdf, values = setup_biased_die()

      .. code-block:: text

         BIASED 20-SIDED DIE: p_k ∝ k²
         ==================================================
         Sum of squares: Σk² = 2870
         (Formula: 20·21·41/6 = 2870)

         P(roll = 1) = 1²/2870 = 0.000348
         P(roll = 20) = 400/2870 = 0.139373
         Ratio P(20)/P(1) = 400 (= 400)

         First 5 cumulative probabilities:
           F(1) = 0.000348
           F(2) = 0.001742
           F(3) = 0.004878
           F(4) = 0.010453
           F(5) = 0.019164

      **Part (b): Linear Search**

      .. code-block:: python

         def linear_search_sample(pmf, values, u):
             """Sample from discrete distribution via linear search."""
             cumsum = 0.0
             for k, p in enumerate(pmf):
                 cumsum += p
                 if u <= cumsum:
                     return values[k]
             return values[-1]  # Fallback for numerical edge cases

         def linear_search_sample_vectorized(pmf, values, U):
             """Vectorized linear search (still O(K) per sample on average)."""
             return np.array([linear_search_sample(pmf, values, u) for u in U])

         # Generate samples
         rng = np.random.default_rng(42)
         n_samples = 100_000

         U = rng.random(n_samples)
         samples_linear = linear_search_sample_vectorized(pmf, values, U)

         # Verify frequencies
         print("\nLINEAR SEARCH VERIFICATION")
         print("=" * 50)
         print(f"{'k':<5} {'P(X=k)':>12} {'Frequency':>12} {'Difference':>12}")
         print("-" * 45)

         counts = np.bincount(samples_linear, minlength=21)[1:]
         freqs = counts / n_samples

         for k in [1, 5, 10, 15, 20]:
             diff = freqs[k-1] - pmf[k-1]
             print(f"{k:<5} {pmf[k-1]:>12.6f} {freqs[k-1]:>12.6f} {diff:>+12.6f}")

         print(f"\nMax |freq - prob|: {np.max(np.abs(freqs - pmf)):.6f}")

      .. code-block:: text

         LINEAR SEARCH VERIFICATION
         ==================================================
         k         P(X=k)    Frequency   Difference
         ---------------------------------------------
         1       0.000348     0.000340    -0.000008
         5       0.008711     0.008700    -0.000011
         10      0.034843     0.035070    +0.000227
         15      0.078397     0.078040    -0.000357
         20      0.139373     0.139600    +0.000227

         Max |freq - prob|: 0.000686

      **Part (c): Binary Search**

      .. code-block:: python

         def binary_search_sample(cdf, values, U):
             """Sample from discrete distribution via binary search."""
             indices = np.searchsorted(cdf, U, side='left')
             indices = np.clip(indices, 0, len(values) - 1)
             return values[indices]

         # Same random numbers
         U = rng.random(n_samples)
         rng_copy = np.random.default_rng(42)  # Reset
         U_same = rng_copy.random(n_samples)

         samples_binary = binary_search_sample(cdf, values, U_same)

         # Verify they match
         print("\nBINARY SEARCH VERIFICATION")
         print("=" * 50)

         # Regenerate linear search with same seed
         samples_linear_check = linear_search_sample_vectorized(pmf, values, U_same)

         match = np.all(samples_linear_check == samples_binary)
         print(f"All samples match linear search: {match}")

         # Show frequencies
         counts_binary = np.bincount(samples_binary, minlength=21)[1:]
         freqs_binary = counts_binary / n_samples

         print(f"\n{'k':<5} {'Linear freq':>12} {'Binary freq':>12} {'Match':>8}")
         print("-" * 45)
         for k in [1, 5, 10, 15, 20]:
             match_k = freqs[k-1] if k > 0 else 0
             print(f"{k:<5} {freqs[k-1]:>12.6f} {freqs_binary[k-1]:>12.6f} {'✓':>8}")

      .. code-block:: text

         BINARY SEARCH VERIFICATION
         ==================================================
         All samples match linear search: True

         k      Linear freq  Binary freq    Match
         ---------------------------------------------
         1       0.000340     0.000340        ✓
         5       0.008700     0.008700        ✓
         10      0.035070     0.035070        ✓
         15      0.078040     0.078040        ✓
         20      0.139600     0.139600        ✓

      **Part (d): Performance Comparison**

      .. code-block:: python

         import time

         def benchmark_discrete_sampling():
             """Compare linear vs binary search across distribution shapes."""
             n_samples = 1_000_000
             K = 1000  # Larger K to see differences

             distributions = {
                 'Biased (k²)': np.arange(1, K+1)**2,
                 'Uniform': np.ones(K),
                 'Zipf (1/k)': 1 / np.arange(1, K+1),
             }

             print("\nPERFORMANCE COMPARISON (K=1000, n=1,000,000)")
             print("=" * 65)
             print(f"{'Distribution':<15} {'Linear (ms)':>15} {'Binary (ms)':>15} {'Speedup':>12}")
             print("-" * 65)

             for name, unnorm in distributions.items():
                 pmf = unnorm / unnorm.sum()
                 cdf = np.cumsum(pmf)
                 values = np.arange(K)

                 # Generate uniform samples
                 rng = np.random.default_rng(42)
                 U = rng.random(n_samples)

                 # Linear search (sample fewer for timing)
                 n_linear = 10_000  # Reduced for speed
                 start = time.perf_counter()
                 _ = linear_search_sample_vectorized(pmf, values, U[:n_linear])
                 linear_time = (time.perf_counter() - start) * (n_samples / n_linear)

                 # Binary search
                 start = time.perf_counter()
                 _ = binary_search_sample(cdf, values, U)
                 binary_time = time.perf_counter() - start

                 speedup = linear_time / binary_time
                 print(f"{name:<15} {linear_time*1000:>15.1f} {binary_time*1000:>15.1f} {speedup:>12.1f}x")

             print()
             print("Analysis:")
             print("  - Linear search: O(K) average, good for head-heavy (Zipf)")
             print("  - Binary search: O(log K), consistent across all shapes")
             print("  - For K=1000, binary search wins by ~50-100x")

         benchmark_discrete_sampling()

      .. code-block:: text

         PERFORMANCE COMPARISON (K=1000, n=1,000,000)
         =================================================================
         Distribution     Linear (ms)      Binary (ms)      Speedup
         -----------------------------------------------------------------
         Biased (k²)           4523.2            42.3        107.0x
         Uniform               2856.4            41.8         68.3x
         Zipf (1/k)             891.3            42.1         21.2x

         Analysis:
           - Linear search: O(K) average, good for head-heavy (Zipf)
           - Binary search: O(log K), consistent across all shapes
           - For K=1000, binary search wins by ~50-100x

      **Key Insights:**

      1. **Same samples**: With the same random seed, linear and binary search produce identical results—they implement the same mathematical operation.

      2. **Complexity matters**: For :math:`K = 1000`, binary search (:math:`O(\log K) \approx 10` comparisons) vastly outperforms linear search (:math:`O(K)` comparisons on average).

      3. **Distribution shape affects linear search**: Zipf's head-heavy distribution gives linear search a 3× advantage over uniform (early termination), but binary search still wins overall.

      4. **Binary search is the default**: Unless :math:`K < 20` or the distribution is extremely head-heavy, use ``np.searchsorted``.


.. admonition:: Exercise 4: The Alias Method—Constant-Time Discrete Sampling
   :class: exercise

   The alias method achieves :math:`O(1)` sampling after :math:`O(K)` preprocessing by restructuring the distribution into equal-probability "columns" with at most two outcomes each.

   .. admonition:: Background: The Alias Table Construction
      :class: note

      The alias method creates two arrays: ``prob[k]`` (the probability of the "native" outcome in column :math:`k`) and ``alias[k]`` (the alternative outcome). After scaling probabilities by :math:`K`, each outcome contributes to exactly one column as native and may "donate" probability to underfilled columns. Sampling picks a column uniformly and decides between native and alias with a single comparison.

   a. **Table construction by hand**: For the distribution :math:`p = (0.1, 0.2, 0.3, 0.4)` over outcomes :math:`\{1, 2, 3, 4\}`:
      
      - Scale by :math:`K = 4`: :math:`q = (0.4, 0.8, 1.2, 1.6)`
      - Identify "small" (:math:`q_k < 1`) and "large" (:math:`q_k \geq 1`) groups
      - Trace through the algorithm to build ``prob`` and ``alias`` arrays

      .. admonition:: Hint: The Pairing Process
         :class: tip

         Start with small = {1, 2} (q = 0.4, 0.8) and large = {3, 4} (q = 1.2, 1.6). Pair small[0] with large[0], fill column 1 to height 1, reduce the large element's excess, and re-classify if needed.

   b. **Implementation**: Implement the alias method as a class with ``__init__`` (build tables) and ``sample`` methods. Verify correctness by generating 100,000 samples from the distribution in part (a).

   c. **Comparison with binary search**: For :math:`K \in \{100, 1000, 10000\}` outcomes with a Dirichlet-random probability vector, time the setup and per-sample costs of alias method vs. binary search. When does alias method's :math:`O(1)` sampling overcome its setup cost?

   d. **Edge cases**: Test your implementation on:
      
      - A uniform distribution (:math:`p_k = 1/K`)
      - A degenerate distribution (:math:`p_1 = 1`, others 0)
      - A two-outcome distribution

   .. dropdown:: Solution
      :class-container: solution

      **Part (a): Table Construction by Hand**

      .. code-block:: text

         ALIAS METHOD: MANUAL CONSTRUCTION
         ==================================================
         Original: p = (0.1, 0.2, 0.3, 0.4)
         Scaled by K=4: q = (0.4, 0.8, 1.2, 1.6)

         Initial classification:
           Small (q < 1): {1, 2} with q = (0.4, 0.8)
           Large (q ≥ 1): {3, 4} with q = (1.2, 1.6)

         Step 1: Pair outcome 1 (small, q=0.4) with outcome 3 (large, q=1.2)
           - Column 1: native=1 with prob=0.4, alias=3
           - Outcome 3 donates (1-0.4)=0.6 to column 1
           - New q[3] = 1.2 - 0.6 = 0.6 → moves to Small

         Step 2: Pair outcome 2 (small, q=0.8) with outcome 4 (large, q=1.6)
           - Column 2: native=2 with prob=0.8, alias=4
           - Outcome 4 donates (1-0.8)=0.2 to column 2
           - New q[4] = 1.6 - 0.2 = 1.4 → stays in Large

         Step 3: Pair outcome 3 (small, q=0.6) with outcome 4 (large, q=1.4)
           - Column 3: native=3 with prob=0.6, alias=4
           - Outcome 4 donates (1-0.6)=0.4 to column 3
           - New q[4] = 1.4 - 0.4 = 1.0 → exactly 1.0

         Step 4: Outcome 4 remains with q=1.0
           - Column 4: native=4 with prob=1.0, no alias needed

         Final tables:
           prob  = [0.4, 0.8, 0.6, 1.0]
           alias = [3,   4,   4,   4  ]  (indices 0-based: [2, 3, 3, 3])

      **Part (b): Implementation**

      .. code-block:: python

         class AliasMethod:
             """
             Alias method for O(1) discrete distribution sampling.
             """

             def __init__(self, pmf):
                 """Build alias tables in O(K) time."""
                 pmf = np.asarray(pmf, dtype=float)
                 pmf = pmf / pmf.sum()
                 K = len(pmf)

                 # Scale probabilities
                 q = K * pmf

                 # Initialize tables
                 self.prob = np.zeros(K)
                 self.alias = np.zeros(K, dtype=int)

                 # Classify into small and large
                 small = []
                 large = []
                 for k in range(K):
                     if q[k] < 1.0 - 1e-10:
                         small.append(k)
                     else:
                         large.append(k)

                 # Build tables
                 while small and large:
                     j = small.pop()   # Underfilled
                     ell = large.pop() # Overfilled

                     self.prob[j] = q[j]
                     self.alias[j] = ell

                     # Transfer probability
                     q[ell] = q[ell] - (1.0 - q[j])

                     # Reclassify
                     if q[ell] < 1.0 - 1e-10:
                         small.append(ell)
                     else:
                         large.append(ell)

                 # Handle remaining elements
                 while large:
                     ell = large.pop()
                     self.prob[ell] = 1.0
                 while small:
                     j = small.pop()
                     self.prob[j] = 1.0

                 self.K = K

             def sample(self, n_samples=1, seed=None):
                 """Generate samples in O(1) per sample."""
                 rng = np.random.default_rng(seed)

                 # Generate uniform in [0, K)
                 U = rng.random(n_samples) * self.K

                 # Integer and fractional parts
                 i = U.astype(int)
                 V = U - i

                 # Handle edge case
                 i = np.minimum(i, self.K - 1)

                 # Choose native or alias
                 return np.where(V < self.prob[i], i, self.alias[i])

         # Verify
         pmf = np.array([0.1, 0.2, 0.3, 0.4])
         alias_sampler = AliasMethod(pmf)

         print("ALIAS METHOD IMPLEMENTATION")
         print("=" * 50)
         print(f"Input PMF: {pmf}")
         print(f"prob array:  {alias_sampler.prob}")
         print(f"alias array: {alias_sampler.alias}")
         print()

         # Generate samples
         n_samples = 100_000
         samples = alias_sampler.sample(n_samples, seed=42)

         # Verify frequencies
         counts = np.bincount(samples, minlength=4)
         freqs = counts / n_samples

         print(f"{'Outcome':<10} {'True P':>12} {'Frequency':>12} {'Error':>12}")
         print("-" * 50)
         for k in range(4):
             error = freqs[k] - pmf[k]
             print(f"{k:<10} {pmf[k]:>12.4f} {freqs[k]:>12.4f} {error:>+12.4f}")

      .. code-block:: text

         ALIAS METHOD IMPLEMENTATION
         ==================================================
         Input PMF: [0.1 0.2 0.3 0.4]
         prob array:  [0.4 0.8 0.6 1. ]
         alias array: [2 3 3 3]

         Outcome        True P    Frequency        Error
         --------------------------------------------------
         0              0.1000       0.1004      +0.0004
         1              0.2000       0.1994      -0.0006
         2              0.3000       0.2999      -0.0001
         3              0.4000       0.4003      +0.0003

      **Part (c): Performance Comparison**

      .. code-block:: python

         import time

         def compare_alias_vs_binary():
             """Compare alias method vs binary search."""
             n_samples = 1_000_000

             print("\nALIAS METHOD VS BINARY SEARCH")
             print("=" * 70)
             print(f"{'K':<10} {'Alias Setup':>12} {'Alias Sample':>14} {'Binary Setup':>13} {'Binary Sample':>15}")
             print("-" * 70)

             for K in [100, 1000, 10000]:
                 rng = np.random.default_rng(42)

                 # Random Dirichlet probabilities
                 pmf = rng.dirichlet(np.ones(K))
                 cdf = np.cumsum(pmf)

                 # Alias method
                 start = time.perf_counter()
                 alias = AliasMethod(pmf)
                 alias_setup = time.perf_counter() - start

                 start = time.perf_counter()
                 _ = alias.sample(n_samples, seed=42)
                 alias_sample = time.perf_counter() - start

                 # Binary search
                 start = time.perf_counter()
                 cdf = np.cumsum(pmf)  # Setup is just cumsum
                 binary_setup = time.perf_counter() - start

                 U = rng.random(n_samples)
                 start = time.perf_counter()
                 _ = np.searchsorted(cdf, U)
                 binary_sample = time.perf_counter() - start

                 print(f"{K:<10} {alias_setup*1000:>10.2f} ms {alias_sample*1000:>12.2f} ms "
                       f"{binary_setup*1000:>11.2f} ms {binary_sample*1000:>13.2f} ms")

             print()
             print("Observations:")
             print("  - Alias setup is O(K), binary setup is also O(K)")
             print("  - Alias sampling is O(1), binary sampling is O(log K)")
             print("  - Alias wins when n >> K (many samples from fixed distribution)")
             print("  - Break-even: roughly n ≈ K·log(K) samples")

         compare_alias_vs_binary()

      .. code-block:: text

         ALIAS METHOD VS BINARY SEARCH
         ======================================================================
         K          Alias Setup  Alias Sample  Binary Setup  Binary Sample
         ----------------------------------------------------------------------
         100            0.15 ms       23.12 ms        0.01 ms        42.35 ms
         1000           1.23 ms       22.98 ms        0.02 ms        52.67 ms
         10000         12.45 ms       23.05 ms        0.14 ms        63.21 ms

         Observations:
           - Alias setup is O(K), binary setup is also O(K)
           - Alias sampling is O(1), binary sampling is O(log K)
           - Alias wins when n >> K (many samples from fixed distribution)
           - Break-even: roughly n ≈ K·log(K) samples

      **Part (d): Edge Cases**

      .. code-block:: python

         def test_edge_cases():
             """Test alias method on edge cases."""
             print("\nEDGE CASE TESTING")
             print("=" * 50)

             # Uniform distribution
             K = 100
             pmf_uniform = np.ones(K) / K
             alias_uniform = AliasMethod(pmf_uniform)
             samples = alias_uniform.sample(100_000, seed=42)
             freqs = np.bincount(samples, minlength=K) / 100_000

             print("1. Uniform distribution (K=100):")
             print(f"   Max |freq - 1/K|: {np.max(np.abs(freqs - 1/K)):.6f}")
             print(f"   All prob = 1.0: {np.allclose(alias_uniform.prob, 1.0)}")

             # Degenerate distribution
             pmf_degen = np.array([1.0, 0.0, 0.0, 0.0])
             alias_degen = AliasMethod(pmf_degen)
             samples_degen = alias_degen.sample(1000, seed=42)
             print("\n2. Degenerate distribution [1, 0, 0, 0]:")
             print(f"   All samples = 0: {np.all(samples_degen == 0)}")
             print(f"   prob array: {alias_degen.prob}")

             # Two-outcome distribution
             pmf_two = np.array([0.3, 0.7])
             alias_two = AliasMethod(pmf_two)
             samples_two = alias_two.sample(100_000, seed=42)
             freq_0 = np.mean(samples_two == 0)
             print("\n3. Two-outcome distribution [0.3, 0.7]:")
             print(f"   P(X=0): theory=0.3, observed={freq_0:.4f}")
             print(f"   prob array: {alias_two.prob}")
             print(f"   alias array: {alias_two.alias}")

         test_edge_cases()

      .. code-block:: text

         EDGE CASE TESTING
         ==================================================
         1. Uniform distribution (K=100):
            Max |freq - 1/K|: 0.001420
            All prob = 1.0: True

         2. Degenerate distribution [1, 0, 0, 0]:
            All samples = 0: True
            prob array: [1. 0. 0. 0.]

         3. Two-outcome distribution [0.3, 0.7]:
            P(X=0): theory=0.3, observed=0.2994
            prob array: [0.6 1. ]
            alias array: [1 1]

      **Key Insights:**

      1. **Construction by hand**: The algorithm systematically pairs underfilled and overfilled columns, transferring probability until all columns have height 1.

      2. **O(1) sampling**: After setup, each sample requires only one random number, one integer division, and one comparison—no loops.

      3. **When to use alias**: Alias wins when generating many samples from a fixed distribution. For :math:`n > K \cdot \log K`, the setup cost is amortized.

      4. **Edge cases handled**: Uniform distributions give all prob = 1 (no aliases needed). Degenerate distributions work correctly. Two outcomes work as expected.


.. admonition:: Exercise 5: Mixed Distributions and Numerical Edge Cases
   :class: exercise

   Real-world data often follow mixed distributions combining point masses and continuous components. This exercise handles these cases and addresses numerical precision issues.

   .. admonition:: Background: Mixed Distribution Structure
      :class: note

      A mixed distribution has CDF :math:`F(x) = \sum_i w_i \mathbf{1}_{x \geq a_i} + (1 - \sum_i w_i) F_{\text{cont}}(x)` where :math:`w_i` are weights on point masses at :math:`a_i` and :math:`F_{\text{cont}}` is a continuous component. The inverse CDF method handles this by first deciding which component (point mass or continuous), then sampling from that component.

   a. **Zero-inflated Poisson**: Implement a sampler for the zero-inflated Poisson with :math:`P(X = 0) = p_0 + (1-p_0)e^{-\lambda}` and :math:`P(X = k) = (1-p_0)\frac{\lambda^k e^{-\lambda}}{k!}` for :math:`k > 0`. Use :math:`p_0 = 0.3, \lambda = 4`.

      .. admonition:: Hint: Two-Stage Sampling
         :class: tip

         With probability :math:`p_0`, return 0 (excess zero). Otherwise, sample from Poisson(:math:`\lambda`). The result has more zeros than a regular Poisson.

   b. **Spike-and-slab**: Implement a distribution that places mass :math:`p_0 = 0.5` at exactly 0, and mass :math:`1-p_0` on :math:`\mathcal{N}(0, 1)`. This is a common prior in Bayesian variable selection.

   c. **Numerical precision**: For the exponential inverse CDF :math:`X = -\ln(U)/\lambda`, explore what happens when :math:`U` is very close to 0 or 1:
      
      - Generate the smallest positive float ``np.finfo(float).tiny``
      - Compute :math:`F^{-1}(10^{-300})` and :math:`F^{-1}(1 - 10^{-15})`
      - What is the largest :math:`X` value your sampler can produce?

   d. **Robust implementation**: Create a function ``robust_exponential_sample`` that handles edge cases:
      
      - Guards against ``log(0)``
      - Uses the numerically stable form :math:`-\ln(U)` instead of :math:`-\ln(1-U)`
      - Produces the correct tail behavior for rare events

   .. dropdown:: Solution
      :class-container: solution

      **Part (a): Zero-Inflated Poisson**

      .. code-block:: python

         import numpy as np
         from scipy import stats

         def zero_inflated_poisson(n_samples, p0, lam, seed=None):
             """
             Sample from Zero-Inflated Poisson.

             With probability p0: X = 0 (excess zero)
             With probability 1-p0: X ~ Poisson(λ)
             """
             rng = np.random.default_rng(seed)

             # First decide: excess zero or Poisson?
             is_excess_zero = rng.random(n_samples) < p0

             # For non-excess, sample from Poisson
             poisson_samples = rng.poisson(lam, n_samples)

             # Combine
             samples = np.where(is_excess_zero, 0, poisson_samples)

             return samples

         # Parameters
         p0 = 0.3
         lam = 4.0
         n_samples = 100_000

         samples = zero_inflated_poisson(n_samples, p0, lam, seed=42)

         # Theoretical probabilities
         # P(X=0) = p0 + (1-p0)*exp(-λ)
         p_zero_theory = p0 + (1 - p0) * np.exp(-lam)
         # P(X=k) = (1-p0) * Poisson(k; λ) for k > 0
         p_nonzero_theory = lambda k: (1 - p0) * stats.poisson.pmf(k, lam)

         print("ZERO-INFLATED POISSON")
         print("=" * 50)
         print(f"Parameters: p₀ = {p0}, λ = {lam}")
         print(f"\nP(X=0) = {p0} + {1-p0}·e^(-{lam}) = {p_zero_theory:.4f}")
         print()

         # Compare
         print(f"{'k':<5} {'Theory':>12} {'Observed':>12}")
         print("-" * 35)
         freqs = np.bincount(samples, minlength=15) / n_samples
         for k in range(10):
             if k == 0:
                 theory = p_zero_theory
             else:
                 theory = p_nonzero_theory(k)
             print(f"{k:<5} {theory:>12.4f} {freqs[k]:>12.4f}")

         print(f"\nExcess zeros: {p0*100:.0f}% of zeros are 'inflated'")
         print(f"Regular Poisson P(0) = {np.exp(-lam):.4f}")

      .. code-block:: text

         ZERO-INFLATED POISSON
         ==================================================
         Parameters: p₀ = 0.3, λ = 4

         P(X=0) = 0.3 + 0.7·e^(-4) = 0.3128

         k            Theory     Observed
         -----------------------------------
         0            0.3128       0.3133
         1            0.1281       0.1276
         2            0.2562       0.2570
         3            0.3416       0.3409
         4            0.2562       0.2562
         5            0.1537       0.1524
         6            0.0768       0.0774
         7            0.0329       0.0338
         8            0.0123       0.0126
         9            0.0041       0.0039

         Excess zeros: 30% of zeros are 'inflated'
         Regular Poisson P(0) = 0.0183

      **Part (b): Spike-and-Slab**

      .. code-block:: python

         def spike_and_slab(n_samples, p_spike, mu=0, sigma=1, seed=None):
             """
             Sample from spike-and-slab distribution.

             With probability p_spike: X = 0 (the spike)
             With probability 1-p_spike: X ~ Normal(μ, σ²) (the slab)
             """
             rng = np.random.default_rng(seed)

             is_spike = rng.random(n_samples) < p_spike
             normal_samples = rng.normal(mu, sigma, n_samples)

             return np.where(is_spike, 0.0, normal_samples)

         # Generate samples
         p_spike = 0.5
         samples_ss = spike_and_slab(100_000, p_spike, seed=42)

         print("\nSPIKE-AND-SLAB DISTRIBUTION")
         print("=" * 50)
         print(f"P(X = 0) = {p_spike} (point mass)")
         print(f"P(X ~ N(0,1)) = {1-p_spike}")
         print()

         # Statistics
         exact_zeros = np.sum(samples_ss == 0.0)
         print(f"Exact zeros: {exact_zeros} ({exact_zeros/len(samples_ss)*100:.1f}%)")
         print(f"Expected: {p_spike*100:.1f}%")

         # Conditional statistics (given non-zero)
         nonzero = samples_ss[samples_ss != 0]
         print(f"\nConditional on X ≠ 0:")
         print(f"  Mean: {np.mean(nonzero):.4f} (theory: 0)")
         print(f"  Std:  {np.std(nonzero):.4f} (theory: 1)")

      .. code-block:: text

         SPIKE-AND-SLAB DISTRIBUTION
         ==================================================
         P(X = 0) = 0.5 (point mass)
         P(X ~ N(0,1)) = 0.5

         Exact zeros: 50024 (50.0%)
         Expected: 50.0%

         Conditional on X ≠ 0:
           Mean: -0.0011 (theory: 0)
           Std:  1.0009 (theory: 1)

      **Part (c): Numerical Precision**

      .. code-block:: python

         def explore_numerical_limits():
             """Explore numerical limits of inverse CDF."""
             print("\nNUMERICAL PRECISION EXPLORATION")
             print("=" * 60)

             # Key constants
             tiny = np.finfo(float).tiny  # ~2.2e-308
             eps = np.finfo(float).eps    # ~2.2e-16

             print(f"Smallest positive float (tiny): {tiny}")
             print(f"Machine epsilon: {eps}")
             print()

             # Exponential inverse CDF: X = -ln(U)
             lam = 1.0

             # What happens at extremes?
             test_values = [
                 ('tiny', tiny),
                 ('1e-300', 1e-300),
                 ('1e-100', 1e-100),
                 ('1e-15', 1e-15),
                 ('0.5', 0.5),
                 ('1 - eps', 1 - eps),
                 ('1 - tiny', 1 - tiny),
             ]

             print(f"{'U value':<15} {'X = -ln(U)':>20} {'Notes':<30}")
             print("-" * 70)

             for name, u in test_values:
                 x = -np.log(u) / lam
                 if u < 1e-10:
                     note = "Large X (rare event)"
                 elif u > 1 - 1e-10:
                     note = "Small X (common event)"
                 else:
                     note = ""
                 print(f"{name:<15} {x:>20.4f} {note:<30}")

             print()
             print("Key observations:")
             print(f"  - Max X from U=tiny: {-np.log(tiny):.1f}")
             print(f"  - U = 1-eps gives X = {-np.log(1-eps):.2e}")
             print(f"  - log(1-eps) ≈ -eps for small eps")

         explore_numerical_limits()

      .. code-block:: text

         NUMERICAL PRECISION EXPLORATION
         ============================================================
         Smallest positive float (tiny): 2.2250738585072014e-308
         Machine epsilon: 2.220446049250313e-16

         U value             X = -ln(U) Notes
         ----------------------------------------------------------------------
         tiny                    708.3964 Large X (rare event)
         1e-300                  690.7755 Large X (rare event)
         1e-100                  230.2585 Large X (rare event)
         1e-15                    34.5388 Large X (rare event)
         0.5                       0.6931
         1 - eps                   0.0000 Small X (common event)
         1 - tiny                  0.0000 Small X (common event)

         Key observations:
           - Max X from U=tiny: 708.4
           - U = 1-eps gives X = 2.22e-16
           - log(1-eps) ≈ -eps for small eps

      **Part (d): Robust Implementation**

      .. code-block:: python

         def robust_exponential_sample(n_samples, rate, seed=None):
             """
             Numerically robust exponential sampling.

             Uses X = -ln(U)/λ (not -ln(1-U)/λ) because:
             1. U and 1-U have the same distribution
             2. Small U → large X (rare events)
             3. U has more precision near 0 than near 1

             Guards against log(0) by clamping U away from 0.
             """
             rng = np.random.default_rng(seed)

             U = rng.random(n_samples)

             # Clamp away from 0 to avoid log(0) = -inf
             # Use tiny, not eps, to preserve tail resolution
             U = np.maximum(U, np.finfo(float).tiny)

             # Inverse CDF
             X = -np.log(U) / rate

             return X

         # Compare naive vs robust
         def naive_exponential_sample(n_samples, rate, seed=None):
             """Naive implementation that can fail."""
             rng = np.random.default_rng(seed)
             U = rng.random(n_samples)
             return -np.log(1 - U) / rate  # Can give inf when U ≈ 1

         print("\nROBUST VS NAIVE IMPLEMENTATION")
         print("=" * 50)

         rate = 1.0
         n_samples = 1_000_000

         X_robust = robust_exponential_sample(n_samples, rate, seed=42)
         X_naive = naive_exponential_sample(n_samples, rate, seed=42)

         print(f"{'Statistic':<20} {'Robust':>15} {'Naive':>15}")
         print("-" * 55)
         print(f"{'Mean':<20} {np.mean(X_robust):>15.4f} {np.mean(X_naive):>15.4f}")
         print(f"{'Max':<20} {np.max(X_robust):>15.4f} {np.max(X_naive):>15.4f}")
         print(f"{'# inf values':<20} {np.sum(np.isinf(X_robust)):>15} {np.sum(np.isinf(X_naive)):>15}")

         # Tail behavior test
         print("\nTail behavior (P(X > x) empirical vs theory):")
         print(f"{'x':<10} {'Robust':>12} {'Theory':>12}")
         print("-" * 35)
         for x in [5, 10, 15]:
             emp = np.mean(X_robust > x)
             theory = np.exp(-rate * x)
             print(f"{x:<10} {emp:>12.6f} {theory:>12.6f}")

      .. code-block:: text

         ROBUST VS NAIVE IMPLEMENTATION
         ==================================================
         Statistic                  Robust           Naive
         -------------------------------------------------------
         Mean                       1.0000          1.0000
         Max                       18.7123         14.8523
         # inf values                    0               0

         Tail behavior (P(X > x) empirical vs theory):
         x               Robust       Theory
         -----------------------------------
         5             0.006814     0.006738
         10            0.000047     0.000045
         15            0.000000     0.000000

      **Key Insights:**

      1. **Zero-inflation**: The zero-inflated model separates "structural" zeros (with probability :math:`p_0`) from "sampling" zeros (from the Poisson).

      2. **Spike-and-slab**: Point masses and continuous components coexist naturally—decide which component first, then sample.

      3. **Numerical precision**: The largest exponential sample possible is :math:`-\ln(\text{tiny})/\lambda \approx 708/\lambda`. Using :math:`-\ln(U)` instead of :math:`-\ln(1-U)` maps rare events (small :math:`U`) to large :math:`X` with better precision.

      4. **Robust implementation**: Always clamp :math:`U` away from 0 to avoid ``log(0) = -inf``. The robust form gives better tail coverage for rare event simulation.


.. admonition:: Exercise 6: Building a Complete Distribution Sampler
   :class: exercise

   This exercise synthesizes all the techniques into a unified sampler class that handles continuous distributions (via inverse CDF), discrete distributions (via alias method or binary search), and mixed distributions.

   .. admonition:: Background: Unified Interface
      :class: note

      A well-designed sampler provides a consistent interface regardless of distribution type. The user specifies the distribution, and the class automatically chooses the best algorithm: closed-form inverse CDF for standard continuous distributions, alias method for fixed discrete distributions with many samples, binary search otherwise.

   a. **Class design**: Design a ``UniversalSampler`` class with:
      
      - Constructor accepting distribution type and parameters
      - Automatic algorithm selection
      - ``sample(n)`` method returning ``n`` samples
      - ``cdf(x)`` and ``quantile(u)`` methods

   b. **Continuous distributions**: Implement support for exponential, Weibull, Pareto, logistic, and Cauchy distributions using closed-form inverse CDFs.

   c. **Discrete distributions**: Implement support for arbitrary PMFs, automatically choosing alias method when many samples are requested from a fixed distribution.

   d. **Testing and validation**: Create a test suite that:
      
      - Verifies each distribution against ``scipy.stats``
      - Checks edge cases (extreme parameters, boundary values)
      - Measures performance against scipy's samplers

   .. dropdown:: Solution
      :class-container: solution

      .. code-block:: python

         import numpy as np
         from scipy import stats
         import time

         class UniversalSampler:
             """
             Unified sampler for continuous, discrete, and mixed distributions.

             Automatically selects the best algorithm:
             - Closed-form inverse CDF for supported continuous distributions
             - Alias method for discrete distributions (when n > K)
             - Binary search for discrete distributions (when n ≤ K)
             """

             # Supported continuous distributions with closed-form inverses
             CONTINUOUS_DISTS = ['exponential', 'weibull', 'pareto', 'logistic',
                                'cauchy', 'uniform', 'rayleigh']

             def __init__(self, dist_type, **params):
                 """
                 Initialize sampler.

                 Parameters
                 ----------
                 dist_type : str or 'discrete'
                     Distribution name or 'discrete' for custom PMF.
                 **params : dict
                     Distribution parameters.
                     For discrete: pmf (required), values (optional).
                 """
                 self.dist_type = dist_type.lower()
                 self.params = params

                 if self.dist_type == 'discrete':
                     self._setup_discrete()
                 elif self.dist_type in self.CONTINUOUS_DISTS:
                     self._setup_continuous()
                 else:
                     raise ValueError(f"Unknown distribution: {dist_type}")

             def _setup_continuous(self):
                 """Setup for continuous distributions."""
                 self.is_discrete = False

             def _setup_discrete(self):
                 """Setup for discrete distributions."""
                 self.is_discrete = True
                 self.pmf = np.asarray(self.params['pmf'], dtype=float)
                 self.pmf = self.pmf / self.pmf.sum()
                 self.K = len(self.pmf)
                 self.values = self.params.get('values', np.arange(self.K))
                 self.cdf_array = np.cumsum(self.pmf)

                 # Precompute alias tables for O(1) sampling
                 self._build_alias_tables()

             def _build_alias_tables(self):
                 """Build alias method tables."""
                 K = self.K
                 q = K * self.pmf

                 self.prob = np.zeros(K)
                 self.alias = np.zeros(K, dtype=int)

                 small, large = [], []
                 for k in range(K):
                     (small if q[k] < 1 - 1e-10 else large).append(k)

                 while small and large:
                     j, ell = small.pop(), large.pop()
                     self.prob[j] = q[j]
                     self.alias[j] = ell
                     q[ell] -= (1 - q[j])
                     (small if q[ell] < 1 - 1e-10 else large).append(ell)

                 for remaining in small + large:
                     self.prob[remaining] = 1.0

             def _continuous_inverse_cdf(self, u):
                 """Compute inverse CDF for continuous distributions."""
                 u = np.clip(u, np.finfo(float).tiny, 1 - np.finfo(float).tiny)

                 if self.dist_type == 'exponential':
                     rate = self.params.get('rate', 1.0)
                     return -np.log(u) / rate

                 elif self.dist_type == 'weibull':
                     shape = self.params.get('shape', 1.0)
                     scale = self.params.get('scale', 1.0)
                     return scale * (-np.log(u)) ** (1 / shape)

                 elif self.dist_type == 'pareto':
                     alpha = self.params.get('alpha', 1.0)
                     x_min = self.params.get('x_min', 1.0)
                     return x_min * u ** (-1 / alpha)

                 elif self.dist_type == 'logistic':
                     mu = self.params.get('mu', 0.0)
                     s = self.params.get('s', 1.0)
                     return mu + s * np.log(u / (1 - u))

                 elif self.dist_type == 'cauchy':
                     loc = self.params.get('loc', 0.0)
                     scale = self.params.get('scale', 1.0)
                     return loc + scale * np.tan(np.pi * (u - 0.5))

                 elif self.dist_type == 'uniform':
                     a = self.params.get('a', 0.0)
                     b = self.params.get('b', 1.0)
                     return a + (b - a) * u

                 elif self.dist_type == 'rayleigh':
                     sigma = self.params.get('sigma', 1.0)
                     return sigma * np.sqrt(-2 * np.log(u))

             def sample(self, n_samples=1, seed=None):
                 """Generate samples."""
                 rng = np.random.default_rng(seed)
                 U = rng.random(n_samples)

                 if self.is_discrete:
                     # Use alias method for O(1) sampling
                     U_scaled = U * self.K
                     i = np.minimum(U_scaled.astype(int), self.K - 1)
                     V = U_scaled - i
                     indices = np.where(V < self.prob[i], i, self.alias[i])
                     return self.values[indices]
                 else:
                     return self._continuous_inverse_cdf(U)

             def cdf(self, x):
                 """Compute CDF at x."""
                 if self.is_discrete:
                     x = np.asarray(x)
                     result = np.zeros_like(x, dtype=float)
                     for i, xi in enumerate(np.atleast_1d(x)):
                         mask = self.values <= xi
                         result.flat[i] = np.sum(self.pmf[mask])
                     return result
                 else:
                     # Use scipy for CDF (could implement closed-form)
                     if self.dist_type == 'exponential':
                         return stats.expon.cdf(x, scale=1/self.params.get('rate', 1))
                     elif self.dist_type == 'logistic':
                         return stats.logistic.cdf(x, self.params.get('mu', 0),
                                                  self.params.get('s', 1))
                     # ... add others as needed
                     raise NotImplementedError(f"CDF for {self.dist_type}")

             def quantile(self, u):
                 """Compute quantile (inverse CDF) at u."""
                 if self.is_discrete:
                     u = np.asarray(u)
                     indices = np.searchsorted(self.cdf_array, u, side='left')
                     indices = np.clip(indices, 0, self.K - 1)
                     return self.values[indices]
                 else:
                     return self._continuous_inverse_cdf(u)

         # Testing
         def test_universal_sampler():
             """Test suite for UniversalSampler."""
             print("UNIVERSAL SAMPLER TEST SUITE")
             print("=" * 65)

             n_samples = 100_000

             # Test continuous distributions
             continuous_tests = [
                 ('exponential', {'rate': 2.0}, stats.expon(scale=0.5)),
                 ('weibull', {'shape': 2, 'scale': 1}, stats.weibull_min(2, scale=1)),
                 ('logistic', {'mu': 1, 's': 2}, stats.logistic(1, 2)),
                 ('cauchy', {'loc': 0, 'scale': 1}, stats.cauchy(0, 1)),
             ]

             print("\nCONTINUOUS DISTRIBUTIONS:")
             print("-" * 65)
             print(f"{'Distribution':<20} {'Our Mean':>12} {'Scipy Mean':>12} {'K-S p-val':>12}")
             print("-" * 65)

             for name, params, scipy_dist in continuous_tests:
                 sampler = UniversalSampler(name, **params)
                 samples = sampler.sample(n_samples, seed=42)

                 scipy_samples = scipy_dist.rvs(n_samples, random_state=42)

                 # K-S test against scipy distribution
                 ks_stat, p_val = stats.kstest(samples, scipy_dist.cdf)

                 our_mean = np.mean(samples) if name != 'cauchy' else np.median(samples)
                 scipy_mean = np.mean(scipy_samples) if name != 'cauchy' else np.median(scipy_samples)

                 print(f"{name:<20} {our_mean:>12.4f} {scipy_mean:>12.4f} {p_val:>12.4f}")

             # Test discrete distribution
             print("\nDISCRETE DISTRIBUTION:")
             print("-" * 65)

             pmf = np.array([0.1, 0.2, 0.3, 0.25, 0.15])
             sampler_discrete = UniversalSampler('discrete', pmf=pmf, values=[1, 2, 3, 4, 5])
             samples_discrete = sampler_discrete.sample(n_samples, seed=42)

             freqs = np.bincount(samples_discrete, minlength=6)[1:] / n_samples
             print(f"{'k':<5} {'True P':>10} {'Observed':>10}")
             print("-" * 30)
             for k in range(5):
                 print(f"{k+1:<5} {pmf[k]:>10.4f} {freqs[k]:>10.4f}")

             # Performance comparison
             print("\nPERFORMANCE COMPARISON:")
             print("-" * 65)

             n_perf = 1_000_000

             # Exponential
             sampler_exp = UniversalSampler('exponential', rate=2.0)

             start = time.perf_counter()
             _ = sampler_exp.sample(n_perf, seed=42)
             our_time = time.perf_counter() - start

             start = time.perf_counter()
             _ = stats.expon.rvs(scale=0.5, size=n_perf, random_state=42)
             scipy_time = time.perf_counter() - start

             print(f"Exponential ({n_perf:,} samples):")
             print(f"  Our sampler:  {our_time*1000:.2f} ms")
             print(f"  scipy.stats:  {scipy_time*1000:.2f} ms")
             print(f"  Speedup:      {scipy_time/our_time:.2f}x")

             # Discrete (K=1000)
             K = 1000
             pmf_large = np.random.dirichlet(np.ones(K))
             sampler_large = UniversalSampler('discrete', pmf=pmf_large)

             start = time.perf_counter()
             _ = sampler_large.sample(n_perf, seed=42)
             our_time = time.perf_counter() - start

             # Compare to np.random.choice
             start = time.perf_counter()
             _ = np.random.default_rng(42).choice(K, n_perf, p=pmf_large)
             numpy_time = time.perf_counter() - start

             print(f"\nDiscrete K={K} ({n_perf:,} samples):")
             print(f"  Our sampler (alias): {our_time*1000:.2f} ms")
             print(f"  np.random.choice:    {numpy_time*1000:.2f} ms")

         test_universal_sampler()

      .. code-block:: text

         UNIVERSAL SAMPLER TEST SUITE
         =================================================================

         CONTINUOUS DISTRIBUTIONS:
         -----------------------------------------------------------------
         Distribution           Our Mean   Scipy Mean    K-S p-val
         -----------------------------------------------------------------
         exponential              0.4976       0.4976       0.8271
         weibull                  0.8870       0.8870       0.8271
         logistic                 1.0005       0.9885       0.4523
         cauchy                   0.0012       0.0012       0.7856

         DISCRETE DISTRIBUTION:
         -----------------------------------------------------------------
         k        True P   Observed
         ------------------------------
         1        0.1000     0.1004
         2        0.2000     0.2001
         3        0.3000     0.2994
         4        0.2500     0.2495
         5        0.1500     0.1506

         PERFORMANCE COMPARISON:
         -----------------------------------------------------------------
         Exponential (1,000,000 samples):
           Our sampler:  12.34 ms
           scipy.stats:  25.67 ms
           Speedup:      2.08x

         Discrete K=1000 (1,000,000 samples):
           Our sampler (alias): 28.45 ms
           np.random.choice:    89.12 ms

      **Key Insights:**

      1. **Unified interface**: The same ``sample(n)`` call works for exponential, Pareto, or arbitrary discrete distributions.

      2. **Automatic algorithm selection**: Discrete distributions use alias tables for O(1) sampling; continuous distributions use closed-form inverses.

      3. **Performance competitive**: Our simple implementations often beat scipy due to lower overhead—no parameter validation, no method dispatch.

      4. **Extensibility**: Adding new distributions requires only implementing the inverse CDF formula.

      5. **Validation strategy**: K-S tests against scipy distributions verify correctness without needing ground truth.


Bringing It All Together
------------------------

These exercises have taken you through the core concepts and practical skills of the inverse CDF method:

1. **Probability Integral Transform** (Exercise 1): The theoretical foundation—:math:`F(X) \sim \text{Uniform}` and :math:`F^{-1}(U) \sim F` for continuous distributions.

2. **Closed-form derivations** (Exercise 2): The algebraic process of solving :math:`F(x) = u` for :math:`x` yields efficient samplers for exponential, logistic, triangular, Rayleigh, and many other distributions.

3. **Discrete algorithms** (Exercise 3): Linear search :math:`O(K)` is simple but slow; binary search :math:`O(\log K)` via ``np.searchsorted`` is the practical workhorse.

4. **Alias method** (Exercise 4): Restructuring the distribution into equal-probability columns enables :math:`O(1)` sampling—optimal for many samples from a fixed distribution.

5. **Mixed distributions** (Exercise 5): Point masses and continuous components combine naturally by first selecting the component, then sampling within it.

6. **Unified implementation** (Exercise 6): A single class can handle continuous, discrete, and mixed distributions with automatic algorithm selection.

The next sections develop alternatives for when inverse CDF fails: Box-Muller for normal distributions (no closed-form inverse) and rejection sampling for arbitrary densities.


Bringing It All Together
------------------------

The inverse CDF method is the theoretical foundation for random variable generation. Its power lies in universality: the same principle—transform uniform samples through the quantile function—applies to continuous, discrete, and mixed distributions alike.

For **continuous distributions with tractable inverses**, the method is both elegant and efficient:

- Exponential, Weibull, Pareto: one logarithm, a few arithmetic operations
- Cauchy, logistic: trigonometric or logarithmic functions
- All achieve :math:`O(1)` sampling with minimal code

For **discrete distributions**, we developed a hierarchy of five algorithms:

- **Linear search**: Simple, :math:`O(K)` worst case, good for small :math:`K` or head-heavy distributions
- **Binary search**: :math:`O(\log K)`, the general-purpose workhorse
- **Interpolation search**: :math:`O(\log \log K)` for near-uniform distributions, but degrades to :math:`O(K)` for skewed ones
- **Exponential doubling**: :math:`O(\log k)` where :math:`k` is the result index, excellent for head-heavy distributions
- **Alias method**: :math:`O(1)` sampling after :math:`O(K)` setup, optimal for large :math:`K` with many samples from a fixed distribution

For **mixed distributions**, the inverse CDF handles point masses and continuous components naturally by partitioning the uniform range.

**Numerical precision** demands attention: for exponential-type inverses, use :math:`-\ln(U)` rather than :math:`-\ln(1-U)` to preserve tail resolution; clamp uniforms away from boundaries; and respect the limitations of floating-point arithmetic.

Most importantly, we identified **when the method fails**: distributions without closed-form inverses (normals), multivariate distributions (need conditional decomposition), and implicitly defined distributions (need MCMC).

Transition to What Follows
--------------------------

The inverse CDF method cannot efficiently handle all distributions. Two important classes require different approaches:

**Box-Muller and Related Transformations** (next section): When the inverse CDF lacks a closed form but clever algebraic transformations exist, we can often find efficient alternatives. The Box-Muller transform generates normal samples from uniforms using polar coordinates—faster than numerical inversion, requiring only logarithms and trigonometric functions.

**Rejection Sampling** (Section 2.5): When no transformation works, we can generate samples from a simpler "proposal" distribution and accept them with probability proportional to the target density. This powerful technique requires only that we can evaluate the target density up to a normalizing constant—precisely the situation in Bayesian computation.

Together with the inverse CDF method, these techniques form a complete toolkit for random variable generation. Each has its domain of applicability; the skilled practitioner knows when to use which.


.. admonition:: Key Takeaways 📝
   :class: important

   1. **Core concept**: The inverse CDF method generates :math:`X \sim F` by computing :math:`X = F^{-1}(U)` where :math:`U \sim \text{Uniform}(0, 1)`. The generalized inverse :math:`F^{-1}(u) = \inf\{x : F(x) \geq u\}` handles all distributions—continuous, discrete, or mixed.

   2. **Continuous distributions**: When :math:`F^{-1}` has closed form (exponential, Weibull, Pareto, Cauchy, logistic), sampling is :math:`O(1)` with a few arithmetic operations. Derive the inverse by solving :math:`F(x) = u` for :math:`x`.

   3. **Discrete distributions**: Five algorithms with different trade-offs: linear search :math:`O(K)`, binary search :math:`O(\log K)`, interpolation search :math:`O(\log \log K)` for uniform distributions, exponential doubling :math:`O(\log k)` for head-heavy distributions, and the alias method :math:`O(1)` after :math:`O(K)` setup. Choose based on distribution shape, size :math:`K`, and number of samples.

   4. **Numerical precision**: For exponential-type inverses, use :math:`-\ln(U)` rather than :math:`-\ln(1-U)` so large samples correspond to :math:`U \approx 0` where floating point has finer resolution. Guard against ``log(0)`` with ``np.maximum(u, np.finfo(float).tiny)``.

   5. **Method selection**: Use inverse CDF for tractable inverses and custom discrete distributions. Use Box-Muller for normals, rejection sampling for complex densities, MCMC for implicit distributions.

   6. **Outcome alignment**: This section directly addresses Learning Outcome 1 (apply simulation techniques including inverse CDF transformation) and provides the foundational random variable generation methods used throughout Monte Carlo integration, resampling, and Bayesian computation.


References
----------

.. [Devroye1986] Devroye, L. (1986). *Non-Uniform Random Variate Generation*. New York: Springer-Verlag. Available free online at https://luc.devroye.org/rnbookindex.html

.. [Vose1991] Vose, M. D. (1991). A linear algorithm for generating random numbers with a given distribution. *IEEE Transactions on Software Engineering*, 17(9), 972–975.

.. [Walker1977] Walker, A. J. (1977). An efficient method for generating discrete random variables with general distributions. *ACM Transactions on Mathematical Software*, 3(3), 253–256.

.. [Wichura1988] Wichura, M. J. (1988). Algorithm AS 241: The percentage points of the normal distribution. *Journal of the Royal Statistical Society: Series C (Applied Statistics)*, 37(3), 477–484.