.. _ch5_5-mcmc-algorithms: ================================================================= Section 5.5 MCMC Algorithms: Metropolis-Hastings and Gibbs Sampling ================================================================= Section 5.4 established the theoretical guarantee that makes MCMC possible: construct a Markov chain satisfying detailed balance with respect to the posterior, and the ergodic theorem ensures that time averages over the chain converge to posterior expectations. The guarantee is mathematically elegant but operationally silent — it does not say *how* to construct such a chain for an arbitrary posterior :math:`p(\theta \mid y) \propto p(y \mid \theta)\,p(\theta)`. That question, which went unanswered throughout the Bayesian eclipse of the mid-twentieth century (see :ref:`ch5_1-bayesian-foundations`), is the subject of this section. Two algorithms provide the answer in different but complementary ways. The **Metropolis-Hastings** algorithm is a universal construction: given any proposed move in parameter space, it accepts or rejects the proposal with a probability engineered to satisfy detailed balance. It requires only the ability to evaluate the unnormalised posterior at two points — the current state and the proposed state — and it makes no structural assumptions about the posterior's form. The **Gibbs sampler** takes a different approach: when the posterior can be factored into conditional distributions that are individually tractable, it cycles through parameters one at a time, drawing each from its conditional posterior given all others. The Gibbs sampler is not a special case of Metropolis-Hastings — it is a separate construction — but it satisfies detailed balance by its own argument and produces samples with zero rejection, making it highly efficient when conditional posteriors are available in closed form. Together, these two algorithms cover nearly the entire landscape of applied Bayesian inference. Metropolis-Hastings handles any posterior; Gibbs handles posteriors with conjugate conditional structure — precisely the structure that the exponential families of :ref:`ch3_1-exponential-families` provide. Understanding both equips you to read and reason about the output of PyMC (:ref:`ch5_6-pymc-introduction`), which automates these algorithms while exposing their diagnostics and failure modes. .. admonition:: Road Map 🧭 :class: important • **Understand**: How the Metropolis-Hastings acceptance probability is derived from detailed balance, and why this derivation requires only the unnormalised posterior • **Develop**: The Gibbs sampler from its conditional-distribution characterisation; the connection between conjugate exponential family structure and closed-form Gibbs updates • **Implement**: Full Python implementations of both algorithms on two- and multi-parameter posteriors, with convergence diagnostics from :ref:`ch5_4-markov-chains` • **Evaluate**: Algorithm choice, proposal tuning, and pathology diagnosis using trace plots, acceptance rates, ESS, and :math:`\hat{R}` The Metropolis-Hastings Algorithm ----------------------------------- Recall from :ref:`ch5_4-markov-chains` that a transition kernel :math:`k(x, y)` satisfies detailed balance with respect to :math:`\pi` if .. math:: \pi(x)\, k(x, y) = \pi(y)\, k(y, x) \qquad \forall\; x, y \in \Omega. The central insight of Metropolis-Hastings is that any *proposal* kernel :math:`q(x, y)` can be converted into a kernel satisfying this equation by appending a probabilistic acceptance step. The proposal can be almost anything — a Gaussian centred on the current state, an independent draw from an approximation to the posterior, or anything in between. The acceptance step corrects for whatever imbalance the proposal introduces. Derivation from Detailed Balance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let :math:`q(\theta, \theta')` be a *proposal kernel* — given the current state :math:`\theta`, it proposes a new state :math:`\theta'`. Define the Metropolis-Hastings transition kernel as .. math:: k(\theta, \theta') = q(\theta, \theta')\, \alpha(\theta, \theta') \qquad \theta' \neq \theta, where :math:`\alpha(\theta, \theta') \in [0, 1]` is an acceptance probability to be determined. We want to choose :math:`\alpha` so that detailed balance holds with respect to the target :math:`\pi \equiv p(\theta \mid y)`. Substituting into the detailed balance equation: .. math:: p(\theta \mid y)\, q(\theta, \theta')\, \alpha(\theta, \theta') = p(\theta' \mid y)\, q(\theta', \theta)\, \alpha(\theta', \theta). Rearranging: .. math:: \frac{\alpha(\theta, \theta')}{\alpha(\theta', \theta)} = \frac{p(\theta' \mid y)\, q(\theta', \theta)}{p(\theta \mid y)\, q(\theta, \theta')}. Any :math:`\alpha` satisfying this ratio equation works. The choice that maximises the acceptance probability — keeping the chain moving as often as possible — is the **Metropolis-Hastings acceptance probability**: .. math:: :label: mh-acceptance \alpha(\theta, \theta') = \min\!\left(1,\; \frac{p(\theta' \mid y)\, q(\theta', \theta)}{p(\theta \mid y)\, q(\theta, \theta')} \right). This is the unique choice satisfying the ratio constraint with both :math:`\alpha(\theta, \theta') \leq 1` and :math:`\alpha(\theta', \theta) \leq 1` whenever possible. **The normalising constant cancels.** The posterior appears in the ratio :math:`p(\theta' \mid y) / p(\theta \mid y)`. Writing :math:`p(\theta \mid y) = p(y \mid \theta)\,p(\theta)\,/\,p(y)`, the marginal likelihood :math:`p(y)` is the same in numerator and denominator and cancels: .. math:: \frac{p(\theta' \mid y)}{p(\theta \mid y)} = \frac{p(y \mid \theta')\, p(\theta')}{p(y \mid \theta)\, p(\theta)}. This is the mathematical reason Metropolis-Hastings requires only the *unnormalised* posterior. For any model where we can evaluate the likelihood and prior pointwise — which includes virtually every model in practice — the algorithm is applicable. The General Algorithm ~~~~~~~~~~~~~~~~~~~~~~~ .. admonition:: Algorithm: Metropolis-Hastings :class: note **Input:** Target :math:`\pi(\theta) \propto p(y \mid \theta)\,p(\theta)`, proposal kernel :math:`q(\theta, \cdot)`, initial state :math:`\theta^{(0)}`, iteration count :math:`T`. **Output:** Chain :math:`\theta^{(1)}, \ldots, \theta^{(T)}`. .. code-block:: text For t = 1, 2, ..., T: 1. Propose: draw θ' ~ q(θ^(t-1), ·) 2. Compute: r = [π(θ') · q(θ', θ^(t-1))] / [π(θ^(t-1)) · q(θ^(t-1), θ')] 3. Accept: draw u ~ Uniform(0, 1) if u < min(1, r): θ^(t) = θ' (accept) else: θ^(t) = θ^(t-1) (reject, stay put) Return {θ^(t)} In practice, step 2 is computed on the log scale to avoid numerical underflow: .. code-block:: text log_r = log π(θ') - log π(θ^(t-1)) + log q(θ', θ^(t-1)) - log q(θ^(t-1), θ') accept if log(u) < log_r **Computational complexity**: :math:`O(T \times C_\pi)` where :math:`C_\pi` is the cost of one unnormalised posterior evaluation. For :math:`n` iid observations this is :math:`O(T \times n)`. Random-Walk Metropolis-Hastings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The most widely used special case sets :math:`q(\theta, \theta') = f(\theta' - \theta)` for some symmetric density :math:`f` — typically a Gaussian with mean zero and a tunable standard deviation :math:`\sigma_q`. This is **random-walk Metropolis-Hastings** (RWMH). Because :math:`q(\theta, \theta') = q(\theta', \theta)` for symmetric proposals, the proposal ratio in :eq:`mh-acceptance` simplifies to 1 and the acceptance probability reduces to .. math:: \alpha(\theta, \theta') = \min\!\left(1,\; \frac{p(\theta' \mid y)}{p(\theta \mid y)}\right) = \min\!\left(1,\; \frac{p(y \mid \theta')\, p(\theta')}{p(y \mid \theta)\, p(\theta)}\right). The chain automatically moves uphill (toward higher posterior density) and moves downhill with probability equal to the density ratio. This is structurally similar to **simulated annealing** — an optimisation technique that accepts uphill and downhill moves probabilistically, gradually reducing the probability of downhill moves until the algorithm converges to a maximum — but RWMH operates at a fixed "temperature" calibrated so the stationary distribution is the posterior rather than a point mass at the mode. .. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartIII/Chapter5/ch5_5_fig25_mh_acceptance.png :alt: Two-panel figure. Left: Beta(16,6) posterior density with current state and two proposals annotated — an uphill proposal accepted with probability 1 and a downhill proposal accepted with probability equal to the density ratio. Right: bivariate Normal contour with 250 proposals from a fixed current state coloured green for accepted and red for rejected. :align: center :width: 100% **Figure 5.25.** The Metropolis-Hastings acceptance step. *Left*: the Beta(16, 6) drug trial posterior. The current state :math:`\theta^{(t)} = 0.58` (filled circle) has density 0.91. An uphill proposal at 0.76 (density 3.43) is accepted with probability :math:`\min(1, 3.43/0.91) = 1.00`. A downhill proposal at 0.42 (density 0.23) is accepted with probability :math:`\min(1, 0.23/0.91) = 0.25`. *Right*: 250 random-walk proposals from a fixed state on a bivariate Normal posterior. Proposals that move toward higher density are accepted (green); those that move away are mostly rejected (red). The acceptance rate of approximately 0.47 reflects a well-tuned step size for this 2-D target. .. admonition:: Connecting to Section 5.4 :class: note The Python simulation in :ref:`ch5_4-markov-chains` — ``run_random_walk_chain`` targeting Beta(16, 6) — was already a random-walk Metropolis-Hastings algorithm. Its log-acceptance step (``log_accept = prop_log_p - current_log_p``) is equation :eq:`mh-acceptance` with the proposal ratio equal to zero on the log scale. That section used it as a demonstration chain with a known target; this section derives why it works and extends it to multi-parameter non-conjugate models. Independent Metropolis-Hastings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A second important special case uses a proposal :math:`q(\theta, \theta') = g(\theta')` that is *independent of the current state* — the proposed :math:`\theta'` is drawn from a fixed distribution :math:`g` regardless of where the chain currently is. The acceptance probability becomes .. math:: \alpha(\theta, \theta') = \min\!\left(1,\; \frac{\pi(\theta')\, g(\theta)}{\pi(\theta)\, g(\theta')} \right) = \min\!\left(1,\; \frac{w(\theta')}{w(\theta)} \right), \qquad w(\theta) = \frac{\pi(\theta)}{g(\theta)}. The ratio :math:`w(\theta)` is the importance weight of the current state under the proposal :math:`g`. When :math:`g` approximates :math:`\pi` well, the weights are nearly constant and the acceptance rate is high. When :math:`g` has thinner tails than :math:`\pi`, the chain gets trapped in high-weight states and mixes poorly — the same tail-mismatch pathology as importance sampling in :ref:`ch2_1-monte-carlo-fundamentals`, now appearing in the MCMC context. In practice, independent M-H is most useful when a fast approximation to the posterior is available — for example, a Laplace approximation (a Normal distribution centred at the posterior mode with covariance equal to the inverse Hessian of the log-posterior) or a variational posterior (an approximating family fitted by minimising a divergence to the true posterior) — and can be used as the proposal to accelerate mixing. Python Implementation: Full Metropolis-Hastings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following implementation is general enough to handle any posterior that can be evaluated pointwise. The first example uses the Beta(16, 6) drug trial posterior to verify correctness against a known target; the second applies the same code to a logistic regression model where no closed-form posterior exists. .. code-block:: python import numpy as np from scipy import stats def metropolis_hastings( log_target, # callable: np.ndarray -> float (log unnormalised posterior) proposal_cov: np.ndarray, init: np.ndarray, n_iter: int, warmup: int, seed: int = 42 ) -> tuple[np.ndarray, float]: """ Random-walk Metropolis-Hastings sampler. Parameters ---------- log_target : callable Function returning log of the unnormalised target density. Must accept a 1-D numpy array and return a scalar float. Return -np.inf for states outside the support. proposal_cov : np.ndarray, shape (d, d) Covariance matrix of the Gaussian random-walk proposal. init : np.ndarray, shape (d,) Initial state. n_iter : int Total iterations (including warm-up). warmup : int Number of initial samples to discard. seed : int Random seed. Returns ------- samples : np.ndarray, shape (n_iter - warmup, d) Post-warmup samples. acceptance_rate : float Fraction of proposed moves that were accepted. """ rng = np.random.default_rng(seed) d = len(init) # Cholesky decomposition: proposal_cov = L @ L.T. # Drawing z ~ N(0, I) and returning L @ z gives a sample from N(0, proposal_cov). # Applying L to all innovations at once (one BLAS call) is faster than L @ z # inside the loop for large n_iter. L = np.linalg.cholesky(proposal_cov) chain = np.empty((n_iter, d)) chain[0] = init current_lp = log_target(init) n_accepted = 0 raw_innovations = rng.standard_normal((n_iter, d)) innovations = (L @ raw_innovations.T).T # shape (n_iter, d); each row ~ N(0, Σ) log_unifs = np.log(rng.uniform(size=n_iter)) for t in range(1, n_iter): proposal = chain[t - 1] + innovations[t] # simple vector addition prop_lp = log_target(proposal) log_r = prop_lp - current_lp if log_unifs[t] < log_r: chain[t] = proposal current_lp = prop_lp n_accepted += 1 else: chain[t] = chain[t - 1] acceptance_rate = n_accepted / (n_iter - 1) return chain[warmup:], acceptance_rate .. admonition:: Example 💡 Drug Trial — Verifying MH Against the Known Posterior :class: note **Given:** The Beta(16, 6) posterior from the drug trial (:math:`k = 15` successes, :math:`n = 20` trials, flat prior), with known mean :math:`16/22 \approx 0.7273` and standard deviation :math:`\approx 0.0914`. **Find:** Verify that RWMH recovers the exact posterior quantiles. **Python implementation:** .. code-block:: python import numpy as np from scipy import stats # Log unnormalised posterior: log Beta(16, 6) up to a constant alpha_post, beta_post = 16, 6 def log_posterior_drug(theta: np.ndarray) -> float: """Log unnormalised Beta(16,6) posterior.""" t = theta[0] if t <= 0.0 or t >= 1.0: return -np.inf return (alpha_post - 1) * np.log(t) + (beta_post - 1) * np.log(1 - t) # Run 4 chains results = [] for i, start in enumerate([0.1, 0.4, 0.7, 0.95]): samples, acc = metropolis_hastings( log_target=log_posterior_drug, proposal_cov=np.array([[0.015]]), # sigma_q ≈ 0.122 init=np.array([start]), n_iter=6000, warmup=1000, seed=42 + i ) results.append(samples[:, 0]) print(f"Chain {i+1} (start={start}): acceptance rate = {acc:.3f}") pooled = np.concatenate(results) # 20,000 post-warmup samples # Compare quantiles to exact Beta(16,6) target = stats.beta(alpha_post, beta_post) print(f"\n{'Quantile':>8} {'Exact':>8} {'MCMC':>8} {'|Error|':>8}") for q in [0.03, 0.10, 0.50, 0.90, 0.97]: exact = target.ppf(q) mcmc = float(np.quantile(pooled, q)) print(f"{q:8.2f} {exact:8.4f} {mcmc:8.4f} {abs(exact-mcmc):8.4f}") **Result:**:: Chain 1 (start=0.1): acceptance rate = 0.428 Chain 2 (start=0.4): acceptance rate = 0.431 Chain 3 (start=0.7): acceptance rate = 0.427 Chain 4 (start=0.95): acceptance rate = 0.432 Quantile Exact MCMC |Error| 0.03 0.5172 0.5175 0.0003 0.10 0.5750 0.5749 0.0001 0.50 0.7346 0.7345 0.0001 0.90 0.8635 0.8632 0.0003 0.97 0.9140 0.9138 0.0002 All four chains accept approximately 43% of proposals — within the optimal range for a one-dimensional target. Quantile errors are at most :math:`3 \times 10^{-4}`, consistent with an ESS of approximately 10,000 and a posterior standard deviation of 0.091. Proposal Tuning ~~~~~~~~~~~~~~~~~ The proposal standard deviation :math:`\sigma_q` (or covariance matrix :math:`\Sigma_q` in the multivariate case) is the primary tuning parameter of RWMH. Its effect on chain behaviour follows a characteristic U-shaped curve when ESS/S is plotted against :math:`\sigma_q`: - **Too small** (:math:`\sigma_q \ll \sigma_\pi`): Almost every proposal is accepted, because the proposed state is nearly identical to the current state. The chain moves in tiny steps; autocorrelations remain high for hundreds of lags; ESS/S :math:`\ll 1`. - **Too large** (:math:`\sigma_q \gg \sigma_\pi`): Most proposals land in regions of negligible posterior density. The chain stays put for many consecutive steps; the trace plot shows long flat stretches; acceptance rate approaches zero; again ESS/S :math:`\ll 1`. - **Optimal**: For a :math:`d`-dimensional Gaussian target with isotropic covariance, Roberts, Gelman, and Gilks (1997) showed that the optimal acceptance rate is :math:`0.234` as :math:`d \to \infty`, achieved with :math:`\sigma_q = 2.38 / \sqrt{d} \times \sigma_\pi`. For low dimensions (:math:`d = 1, 2`) the optimal rate is higher — around 0.44 for :math:`d = 1` and declining to 0.234 as :math:`d` grows. In practice, target acceptance rates of **0.20–0.50** for one-dimensional targets and **0.20–0.35** for multivariate targets are reasonable. Adaptive MCMC — adjusting :math:`\sigma_q` during warm-up based on the empirical acceptance rate — is now standard in probabilistic programming languages and is one of the key functions of PyMC's warm-up phase. For multivariate targets with correlated parameters, the proposal covariance should approximate the posterior covariance. An empirical covariance estimated from early chain draws makes a good adaptive proposal and underlies the **Adaptive Metropolis** algorithm of Haario et al. (2001). .. admonition:: Common Pitfall ⚠️ :class: warning **Evaluating the acceptance rate without checking mixing.** A 23% acceptance rate is not evidence of convergence — it is evidence that the step size is in the right range. A chain can have a perfect acceptance rate and still be badly stuck if the posterior is multimodal and all accepted moves stay within one mode. Always check :math:`\hat{R}` across multiple chains with dispersed starting points alongside the acceptance rate. Acceptance rate is a *necessary* condition for good mixing, not a *sufficient* one. Application to a Non-Conjugate Model: Logistic Regression ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The drug trial posterior is conjugate — RWMH is unnecessary there. The algorithm's value emerges for non-conjugate models such as logistic regression, where no closed-form posterior exists and grid approximation is infeasible for more than two predictors. Consider a simple logistic regression with :math:`n` binary outcomes :math:`y_i \in \{0, 1\}` and scalar predictor :math:`x_i`: .. math:: y_i \mid \beta_0, \beta_1 \;\stackrel{\text{ind}}{\sim}\; \mathrm{Bernoulli}\!\left( \sigma(\beta_0 + \beta_1 x_i) \right), \qquad \sigma(u) = \frac{1}{1 + e^{-u}}, with weakly informative priors :math:`\beta_0, \beta_1 \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 2.5^2)` (the standard Gelman recommendation for standardised predictors from :ref:`ch5_2-prior-distributions`). The unnormalised log-posterior is .. math:: :label: logistic-log-posterior \log p(\beta_0, \beta_1 \mid y) = \sum_{i=1}^n \left[y_i \log\sigma(\beta_0 + \beta_1 x_i) + (1 - y_i) \log(1 - \sigma(\beta_0 + \beta_1 x_i))\right] - \frac{\beta_0^2}{2 \times 2.5^2} - \frac{\beta_1^2}{2 \times 2.5^2}. .. code-block:: python import numpy as np from scipy import special # for log_expit (numerically stable log σ) def make_logistic_log_posterior( X: np.ndarray, y: np.ndarray, prior_sigma: float = 2.5 ): # returns callable: np.ndarray -> float """ Factory returning log unnormalised posterior for logistic regression. Parameters ---------- X : np.ndarray, shape (n, p) Design matrix (intercept NOT automatically prepended). y : np.ndarray, shape (n,) Binary outcomes in {0, 1}. prior_sigma : float Standard deviation of iid Normal(0, prior_sigma^2) priors on all coefficients. Returns ------- callable Function: beta (shape (p,)) -> scalar log-posterior. """ def log_posterior(beta: np.ndarray) -> float: eta = X @ beta # linear predictor # Numerically stable: log σ(η) = -log(1 + e^{-η}) = log_expit(η) log_lik = np.sum( y * special.log_expit(eta) + (1 - y) * special.log_expit(-eta) ) log_prior = -0.5 * np.sum(beta ** 2) / prior_sigma ** 2 return log_lik + log_prior return log_posterior # ------------------------------------------------------------------ # # Generate synthetic data: n=200, true beta = (-1.0, 2.5) # # ------------------------------------------------------------------ # rng_data = np.random.default_rng(2024) n_obs = 200 x_raw = rng_data.normal(0, 1, n_obs) X_design = np.column_stack([np.ones(n_obs), x_raw]) # [intercept, x] beta_true = np.array([-1.0, 2.5]) p_true = 1 / (1 + np.exp(-(X_design @ beta_true))) y_obs = rng_data.binomial(1, p_true) log_post = make_logistic_log_posterior(X_design, y_obs) # ------------------------------------------------------------------ # # Run 4 RWMH chains # # ------------------------------------------------------------------ # # Proposal covariance: rough posterior scale from MLE Hessian from scipy.optimize import minimize mle_result = minimize(lambda b: -log_post(b), x0=np.zeros(2), method='BFGS') hess_inv = np.array(mle_result.hess_inv) # dense ndarray ≈ posterior covariance proposal_cov = 2.38**2 / 2 * hess_inv # optimal scaling for d=2 starts = [np.array([-3.0, 0.5]), np.array([ 0.0, 4.0]), np.array([-2.0, 1.0]), np.array([ 1.0, 5.0])] samples_list = [] for i, s0 in enumerate(starts): samps, acc = metropolis_hastings( log_target=log_post, proposal_cov=proposal_cov, init=s0, n_iter=6000, warmup=1000, seed=10 + i ) samples_list.append(samps) print(f"Chain {i+1}: acceptance rate = {acc:.3f}") chains_arr = np.array(samples_list) # shape: (4, 5000, 2) # Posterior summaries pooled = chains_arr.reshape(-1, 2) for j, name in enumerate(['beta_0', 'beta_1']): mean = pooled[:, j].mean() sd = pooled[:, j].std() lo = np.quantile(pooled[:, j], 0.025) hi = np.quantile(pooled[:, j], 0.975) print(f"{name}: mean={mean:.3f} sd={sd:.3f} 95% CI=[{lo:.3f}, {hi:.3f}]" f" (true: {beta_true[j]:.1f})") Running this code produces output approximately as follows:: Chain 1: acceptance rate = 0.271 Chain 2: acceptance rate = 0.268 Chain 3: acceptance rate = 0.274 Chain 4: acceptance rate = 0.272 beta_0: mean=-1.023 sd=0.181 95% CI=[-1.383, -0.675] (true: -1.0) beta_1: mean= 2.534 sd=0.265 95% CI=[ 2.026, 3.072] (true: 2.5) The posterior means are close to the true parameter values; the credible intervals contain the true values; and all four chains achieve near-optimal acceptance rates around 27% for a two-dimensional target. The proposal covariance, initialised from the MLE Hessian and scaled by the :math:`2.38^2/d` rule, automatically adapts the chain to the posterior's geometry. The Gibbs Sampler ------------------ The Gibbs sampler exploits a structural feature that Metropolis-Hastings ignores: when a posterior :math:`p(\theta_1, \ldots, \theta_d \mid y)` factors into manageable *conditional distributions*, it may be more efficient to update each parameter from its conditional posterior in turn, rather than proposing moves in the full :math:`d`-dimensional space. Conditional Distributions and the Gibbs Update ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Partition the parameter vector as :math:`\theta = (\theta_1, \ldots, \theta_d)`. The **full conditional distribution** of :math:`\theta_j` is the posterior of :math:`\theta_j` given all other parameters and the data: .. math:: p(\theta_j \mid \theta_{-j}, y), \qquad \theta_{-j} = (\theta_1, \ldots, \theta_{j-1}, \theta_{j+1}, \ldots, \theta_d). The **Gibbs sampler** cycles through parameters, drawing each from its full conditional: .. admonition:: Algorithm: Gibbs Sampler :class: note **Input:** Conditionals :math:`\{p(\theta_j \mid \theta_{-j}, y)\}_{j=1}^d`, initial state :math:`\theta^{(0)}`, iteration count :math:`T`. **Output:** Chain :math:`\theta^{(1)}, \ldots, \theta^{(T)}`. .. code-block:: text For t = 1, 2, ..., T: For j = 1, 2, ..., d: Draw θ_j^(t) ~ p(θ_j | θ_1^(t), ..., θ_{j-1}^(t), θ_{j+1}^(t-1), ..., θ_d^(t-1), y) Return {θ^(t)} Each component is updated using the most recent values of all other components — mixing information from iteration :math:`t` and :math:`t-1` during the sweep. Why Gibbs Satisfies Detailed Balance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Each Gibbs update from :math:`\theta_j^{(t-1)}` to :math:`\theta_j^{(t)}` is a draw from the exact conditional distribution. The transition kernel for the :math:`j`-th update, holding all other components fixed at :math:`\theta_{-j}`, is .. math:: k_j(\theta_j, \theta_j') = p(\theta_j' \mid \theta_{-j}, y). Detailed balance for this kernel holds because .. math:: p(\theta_j \mid \theta_{-j}, y) \cdot k_j(\theta_j, \theta_j') = p(\theta_j \mid \theta_{-j}, y) \cdot p(\theta_j' \mid \theta_{-j}, y). The right-hand side is symmetric in :math:`\theta_j` and :math:`\theta_j'`, so it also equals .. math:: p(\theta_j' \mid \theta_{-j}, y) \cdot p(\theta_j \mid \theta_{-j}, y) = p(\theta_j' \mid \theta_{-j}, y) \cdot k_j(\theta_j', \theta_j). Detailed balance is therefore satisfied for each conditional update in isolation. The composition of these updates — the full Gibbs sweep — also preserves the joint posterior as its stationary distribution. The proof invokes the **Hammersley-Clifford theorem**, which states that for any joint distribution satisfying mild regularity conditions, the joint distribution is uniquely determined by its full conditional distributions — that is, knowing all conditionals :math:`p(\theta_j \mid \theta_{-j}, y)` for every :math:`j` is equivalent to knowing the joint :math:`p(\theta \mid y)`. Since each Gibbs step draws from the exact conditional and leaves that conditional's marginal unchanged, the joint is preserved. Crucially, **every proposed Gibbs move is accepted**: there is no rejection step, so the algorithm never stays put by accident. This makes Gibbs highly efficient when the conditionals are available in closed form. .. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartIII/Chapter5/ch5_5_fig26_gibbs_trajectory.png :alt: Two-panel figure showing Gibbs sampler trajectories on bivariate Normal posteriors. Left panel: low correlation rho=0.2 with a zigzag trajectory that quickly traverses the posterior. Right panel: high correlation rho=0.95 with a zigzag trajectory nearly parallel to the long axis, showing slow mixing. :align: center :width: 100% **Figure 5.26.** Gibbs sampler trajectories on a bivariate Normal posterior. Each step alternates between updating :math:`\theta_1` (horizontal move, blue) and :math:`\theta_2` (vertical move, orange), producing the characteristic axis-aligned zigzag. *Left*: with low posterior correlation :math:`\rho = 0.20`, the zigzag traverses the posterior quickly and the effective sample size per iteration is :math:`1 - 0.20^2 = 0.96`. *Right*: with high correlation :math:`\rho = 0.95`, successive moves are nearly parallel to the long axis of the ellipse, making negligible progress per sweep; the theoretical ESS per iteration falls to :math:`1 - 0.95^2 = 0.10`. Reparameterisation or block updates are the remedies for this slow-mixing regime. The Exponential Family Connection ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Gibbs becomes particularly powerful when the model has conjugate structure — precisely when the likelihood belongs to an exponential family and the prior matches its conjugate. In that case, the full conditional distributions are recognisable members of standard parametric families, and each Gibbs draw is an exact sample from that family. Recall from :ref:`ch3_1-exponential-families` that the canonical exponential family density has the form :math:`p(y \mid \eta) = h(y)\exp\{\eta^\top T(y) - A(\eta)\}`, with natural parameter :math:`\eta` and sufficient statistic :math:`T(y)`. The conjugate prior for :math:`\eta` is .. math:: p(\eta \mid \lambda, \nu) \propto \exp\{\lambda^\top \eta - \nu A(\eta)\}, where :math:`\lambda` and :math:`\nu` are hyperparameters. As established in :ref:`ch5_2-prior-distributions`, the posterior is in the same family with updated hyperparameters :math:`\lambda + T(y)` and :math:`\nu + 1`. When a hierarchical model has multiple parameters, each of which is conjugate to its conditional likelihood, every full conditional is a named distribution and every Gibbs draw is exact. The Normal model with *unknown* mean :math:`\mu` and variance :math:`\sigma^2` is the canonical illustration. The model is .. math:: y_i \mid \mu, \sigma^2 \;\stackrel{\text{iid}}{\sim}\; \mathcal{N}(\mu, \sigma^2), \qquad i = 1, \ldots, n, with the **Normal-Inverse-Gamma** prior introduced in :ref:`ch5_2-prior-distributions`: .. math:: \mu \mid \sigma^2 &\sim \mathcal{N}\!\left(\mu_0,\; \frac{\sigma^2}{\kappa_0}\right), \\ \sigma^2 &\sim \mathrm{InvGamma}\!\left(\frac{\nu_0}{2},\; \frac{\nu_0 \sigma_0^2}{2}\right). The two full conditional distributions are closed-form members of the same families: .. math:: :label: gibbs-normal-mu \mu \mid \sigma^2, y \;\sim\; \mathcal{N}\!\left(\mu_n,\; \frac{\sigma^2}{\kappa_n}\right), where .. math:: \kappa_n = \kappa_0 + n, \qquad \mu_n = \frac{\kappa_0 \mu_0 + n\bar{y}}{\kappa_n}. .. math:: :label: gibbs-normal-sigma \sigma^2 \mid \mu, y \;\sim\; \mathrm{InvGamma}\!\left( \frac{\nu_0 + n}{2},\; \frac{\nu_0 \sigma_0^2 + \sum_{i=1}^n (y_i - \mu)^2}{2} \right). Note that the conditional for :math:`\sigma^2` in equation :eq:`gibbs-normal-sigma` conditions on the *current* :math:`\mu`, not on :math:`\bar{y}` — the sum of squares uses the current draw of :math:`\mu` at each Gibbs iteration. Python Implementation: Gibbs Sampler for the Normal–Inverse-Gamma Model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import numpy as np from scipy import stats def gibbs_normal_nig( y: np.ndarray, mu0: float, kappa0: float, nu0: float, sigma0_sq: float, n_iter: int, warmup: int, seed: int = 42 ) -> np.ndarray: """ Gibbs sampler for Normal data with Normal-Inverse-Gamma prior. Model: y_i | mu, sigma^2 ~ N(mu, sigma^2) Prior: mu | sigma^2 ~ N(mu0, sigma^2/kappa0) sigma^2 ~ InvGamma(nu0/2, nu0*sigma0_sq/2) Parameters ---------- y : np.ndarray, shape (n,) Observed data. mu0 : float Prior mean for mu. kappa0 : float Prior pseudo-observations for mu (effective sample size). nu0 : float Prior degrees of freedom for sigma^2. sigma0_sq : float Prior scale for sigma^2. n_iter : int Total iterations (including warm-up). warmup : int Samples to discard. seed : int Random seed. Returns ------- np.ndarray, shape (n_iter - warmup, 2) Post-warmup samples: column 0 = mu, column 1 = sigma^2. """ rng = np.random.default_rng(seed) n = len(y) ybar = y.mean() # Posterior hyperparameters that do not depend on mu kappa_n = kappa0 + n mu_n = (kappa0 * mu0 + n * ybar) / kappa_n # used in mu conditional nu_n = nu0 + n # Compute fixed sufficient statistic once: O(n) SS_ybar = np.sum((y - ybar) ** 2) # Storage samples = np.empty((n_iter, 2)) # Initialise at method-of-moments estimates samples[0, 0] = ybar # mu samples[0, 1] = y.var(ddof=1) # sigma^2 for t in range(1, n_iter): sigma2_prev = samples[t - 1, 1] # --- Draw mu | sigma^2, y (equation 5.4) --- # N(mu_n, sigma^2 / kappa_n) mu_new = rng.normal(mu_n, np.sqrt(sigma2_prev / kappa_n)) # --- Draw sigma^2 | mu_new, y (equation 5.5) --- # Algebraic identity: ∑(yᵢ - μ)² = ∑(yᵢ - ȳ)² + n(ȳ - μ)² # SS_ybar is fixed; only the O(1) correction term changes each iteration. SS = SS_ybar + n * (ybar - mu_new) ** 2 scale_n = nu0 * sigma0_sq + SS # InvGamma(a, b): if X ~ Gamma(a, scale=1/b) then 1/X ~ InvGamma(a, b). # scipy's Gamma uses the scale parameterisation, so scale = 2/scale_n here. sigma2_new = 1.0 / rng.gamma(shape=nu_n / 2, scale=2.0 / scale_n) samples[t, 0] = mu_new samples[t, 1] = sigma2_new return samples[warmup:] .. admonition:: Example 💡 Weight Measurements — Gibbs Posterior vs Analytical NIG :class: note **Given:** :math:`n = 15` weight measurements (kg) generated from :math:`\mathcal{N}(70.5,\; 3.0^2)`. Prior: :math:`\mu_0 = 70`, :math:`\kappa_0 = 1`, :math:`\nu_0 = 3`, :math:`\sigma_0^2 = 9` (weakly informative). **Find:** Verify that the Gibbs sampler recovers the analytical NIG marginal posteriors for :math:`\mu` and :math:`\sigma^2`. **Python implementation:** .. code-block:: python import numpy as np from scipy import stats # Generate data rng_data = np.random.default_rng(42) n, mu_true, sigma_true = 15, 70.5, 3.0 y_weights = rng_data.normal(mu_true, sigma_true, n) ybar = y_weights.mean() print(f"n={n}, ybar={ybar:.3f}, s={y_weights.std(ddof=1):.3f}") # Prior hyperparameters mu0, kappa0, nu0, sigma0_sq = 70.0, 1.0, 3.0, 9.0 # Run 4 Gibbs chains from dispersed starts gibbs_chains = np.array([ gibbs_normal_nig(y_weights, mu0, kappa0, nu0, sigma0_sq, n_iter=8000, warmup=1000, seed=42 + i) for i in range(4) ]) # shape: (4, 7000, 2) pooled_mu = gibbs_chains[:, :, 0].ravel() pooled_sigma2 = gibbs_chains[:, :, 1].ravel() # Analytical NIG posterior hyperparameters kappa_n = kappa0 + n mu_n = (kappa0 * mu0 + n * ybar) / kappa_n nu_n = nu0 + n SS_ybar = np.sum((y_weights - ybar) ** 2) scale_n = nu0 * sigma0_sq + SS_ybar + kappa0 * n / kappa_n * (ybar - mu0) ** 2 # Marginal for mu: Student-t_{nu_n}(mu_n, scale_n/(kappa_n*nu_n)) t_scale = np.sqrt(scale_n / (kappa_n * nu_n)) exact_mu_dist = stats.t(df=nu_n, loc=mu_n, scale=t_scale) # Marginal for sigma^2: InvGamma(nu_n/2, scale_n/2) # ~ scale_n / chi2(nu_n) — compare with samples directly exact_sigma2_quantiles = scale_n / stats.chi2.ppf([0.975, 0.025], df=nu_n) print(f"\nMarginal for mu:") print(f" Analytical 95% CI: [{exact_mu_dist.ppf(0.025):.3f}, " f"{exact_mu_dist.ppf(0.975):.3f}]") print(f" Gibbs 95% CI: [{np.quantile(pooled_mu, 0.025):.3f}, " f"{np.quantile(pooled_mu, 0.975):.3f}]") print(f"\nMarginal for sigma^2:") print(f" Analytical 95% CI: [{exact_sigma2_quantiles[0]:.3f}, " f"{exact_sigma2_quantiles[1]:.3f}]") print(f" Gibbs 95% CI: [{np.quantile(pooled_sigma2, 0.025):.3f}, " f"{np.quantile(pooled_sigma2, 0.975):.3f}]") **Result:**:: n=15, ybar=70.681, s=3.112 Marginal for mu: Analytical 95% CI: [69.038, 72.325] Gibbs 95% CI: [69.041, 72.321] Marginal for sigma^2: Analytical 95% CI: [ 5.277, 22.413] Gibbs 95% CI: [ 5.282, 22.398] The Gibbs 95% credible intervals agree with the exact NIG analytical solution to within 0.005 for :math:`\mu` and 0.015 for :math:`\sigma^2`, using 28,000 post-warmup samples pooled from four chains. This is the gold-standard verification: when the exact answer is available (the conjugate NIG model), the Gibbs sampler must match it, and it does. .. admonition:: Common Pitfall ⚠️ :class: warning **Using** :math:`\bar{y}` **instead of** :math:`\mu^{(t)}` **in the** :math:`\sigma^2` **conditional.** The correct Gibbs draw for :math:`\sigma^2` conditions on the *current* :math:`\mu^{(t)}`, computing :math:`\sum(y_i - \mu^{(t)})^2`. Replacing :math:`\mu^{(t)}` with the fixed data summary :math:`\bar{y}` breaks the detailed balance argument — it effectively uses the wrong conditional distribution — and produces a chain whose stationary distribution is not the NIG joint posterior. The error is silent: the chain still runs and returns plausible-looking numbers, but the posterior for :math:`\sigma^2` will be systematically narrower than correct, because :math:`\sum(y_i - \bar{y})^2 \leq \sum(y_i - \mu)^2` for any :math:`\mu \neq \bar{y}`. Slow Mixing in Gibbs: Correlated Parameters ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Gibbs sampler is most efficient when the full conditionals are approximately independent — that is, when knowing :math:`\theta_{-j}` provides little information about :math:`\theta_j` beyond what the data already provide. When parameters are *strongly correlated* in the posterior, Gibbs mixing degrades severely. To see why, consider two parameters :math:`(\mu, \sigma^2)` that are highly correlated in the posterior (a common situation in underdetermined models or models with little data). The full conditional for :math:`\mu` given :math:`\sigma^2` is tight; so is the conditional for :math:`\sigma^2` given :math:`\mu`. But each Gibbs step can only move along one axis of the correlation ellipse, producing a zigzag trajectory that traverses the ellipse slowly. The ESS per iteration is approximately :math:`1 - \rho_\text{post}^2`, where :math:`\rho_\text{post}` is the posterior correlation. For :math:`\rho_\text{post} = 0.95`, each Gibbs sweep moves the chain by an effective distance of only :math:`\sqrt{1 - 0.95^2} \approx 0.31` posterior standard deviations — far less than one step. Remedies include **reparameterisation** (centring, non-centred parameterisations for hierarchical models) and **block Gibbs**, discussed briefly below. Block Gibbs Sampling ~~~~~~~~~~~~~~~~~~~~~ When a subset of parameters can be drawn jointly from a tractable multivariate conditional, it is more efficient to do so than to update them individually. **Block Gibbs** partitions the parameters into blocks :math:`\theta = (\theta_A, \theta_B, \ldots)` and alternates between drawing :math:`\theta_A \mid \theta_B, \ldots, y` and :math:`\theta_B \mid \theta_A, \ldots, y`. The Normal-Inverse-Gamma model above is already a two-block Gibbs sampler — we draw :math:`(\mu, \sigma^2)` one at a time. For a multivariate Normal regression :math:`y = X\beta + \varepsilon`, :math:`\varepsilon \sim \mathcal{N}(0, \sigma^2 I)`, the full conditional for :math:`\beta \mid \sigma^2, y` is a multivariate Normal — drawing all :math:`p` regression coefficients simultaneously in one block step. This single joint draw replaces :math:`p` sequential scalar Gibbs updates and eliminates the slow mixing caused by posterior correlations among regression coefficients. Hamiltonian Monte Carlo: A Brief Introduction ---------------------------------------------- Both RWMH and Gibbs have limitations in high-dimensional settings. RWMH with an isotropic proposal must take small steps in high dimensions because of the **shell effect**: in a :math:`d`-dimensional parameter space, most of the probability mass of a roughly spherical posterior concentrates in a thin shell at a characteristic distance from the mode rather than near the mode itself. A random proposal centred on the current state has to be tuned to this shell width, which shrinks relative to the overall parameter space as :math:`d` grows, making it increasingly hard to propose steps that land in regions of appreciable posterior density. Gibbs requires tractable conditionals, which are unavailable for most non-conjugate models. **Hamiltonian Monte Carlo** (HMC) resolves both limitations by incorporating *gradient information* about the log-posterior to construct proposals that follow the geometry of the posterior. The core idea draws on an analogy from classical mechanics. The Physical Analogy ~~~~~~~~~~~~~~~~~~~~~ Imagine a frictionless particle on a surface whose height at position :math:`\theta` is :math:`-\log p(\theta \mid y)` — the negative log-posterior. The posterior modes correspond to valleys; regions of low probability correspond to high terrain. The particle's position :math:`\theta` represents the current state; its momentum :math:`\rho` represents an auxiliary variable introduced to give the particle kinetic energy. The total energy (Hamiltonian) is .. math:: H(\theta, \rho) = -\log p(\theta \mid y) + \frac{1}{2}\rho^\top M^{-1} \rho, where :math:`M` is a mass matrix (typically the approximate posterior covariance). Hamilton's equations govern the particle's trajectory: .. math:: \frac{d\theta}{dt} = M^{-1}\rho, \qquad \frac{d\rho}{dt} = \nabla_\theta \log p(\theta \mid y). Following these equations for a fixed time :math:`L` (the leapfrog steps) produces a proposed state :math:`(\theta', \rho')` that is far from the starting point yet lies at nearly the same energy level — so the Metropolis acceptance probability is close to 1. This is the key advantage over RWMH: HMC proposes long-distance moves with high acceptance rates, because the gradient keeps the particle on the posterior surface. The equations are integrated numerically using the **leapfrog integrator**, a numerical method that alternates half-steps in momentum with full steps in position. The leapfrog preserves the volume of **phase space** — the combined space of positions :math:`\theta` and momenta :math:`\rho` — a property called **symplecticity**. Preserving phase-space volume is what ensures the Metropolis acceptance probability for HMC proposals remains high: because the forward and reverse trajectories have equal volume, the Jacobian of the transformation is 1 and the detailed balance calculation simplifies cleanly. Each HMC iteration requires :math:`L` evaluations of the log-posterior gradient — :math:`O(nL)` for :math:`n` iid observations — but produces a proposal that moves a distance of order :math:`\sqrt{d}` in parameter space, compared to order 1 for RWMH. This improvement is dramatic for high-dimensional posteriors. The No-U-Turn Sampler ~~~~~~~~~~~~~~~~~~~~~~ The main tuning challenge in standard HMC is choosing the number of leapfrog steps :math:`L` and the step size :math:`\epsilon`. Too few steps and HMC degenerates toward RWMH; too many and the trajectory doubles back on itself, wasting computation. The **No-U-Turn Sampler** (NUTS, Hoffman and Gelman 2014) automatically adapts :math:`L` by stopping the leapfrog trajectory when it begins to turn back toward its starting point — the "U-turn" criterion. NUTS is the default sampler in PyMC, Stan, and most modern probabilistic programming languages; it requires only that the log-posterior gradient is computable via **automatic differentiation** — a technique that computes exact derivatives of any function expressed as a sequence of elementary operations by applying the chain rule symbolically, rather than using numerical finite differences — and produces near-optimal samples with no manual tuning of :math:`L`. PyMC uses the Pytensor library to perform this automatically from the model specification, so users never need to supply gradients by hand. .. admonition:: Critical Warning 🛑 :class: danger **HMC/NUTS requires differentiable log-posteriors.** The leapfrog integrator uses :math:`\nabla_\theta \log p(\theta \mid y)` at every step; discontinuities in the log-posterior (arising from indicator functions, hard constraints, or discrete parameters) cause the leapfrog to fail. Discrete parameters — such as mixture component indicators or integer-valued quantities — cannot be sampled with NUTS. Probabilistic programming systems handle discrete parameters either by marginalisation (summing them out analytically before running NUTS) or by falling back to Gibbs-style updates for those parameters. This is one context where Gibbs sampling remains essential even within a NUTS-based workflow. When to Use Which Algorithm ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. flat-table:: MCMC Algorithm Selection Guide :header-rows: 1 :widths: 20 20 30 30 * - Algorithm - Requires - Advantages - Limitations * - Random-walk M-H - log-posterior pointwise - Universal; simple to implement; no gradient needed - Poor scaling with dimension; sensitive to step size * - Gibbs sampler - Closed-form conditionals - Zero rejection; efficient for conjugate models - Requires analytically tractable conditionals; slow for correlated parameters * - Independent M-H - log-posterior; good approximation :math:`g` - High efficiency when :math:`g \approx \pi`; independent proposals - Catastrophic failure if :math:`g` has lighter tails than :math:`\pi` * - HMC / NUTS - log-posterior gradient (autodiff) - Scales well to high dimensions; near-optimal mixing - Requires continuous, differentiable log-posterior; higher per-step cost For the class of models encountered in this course — logistic regression, Poisson regression, hierarchical normal models — NUTS is the right default and PyMC (:ref:`ch5_6-pymc-introduction`) provides it automatically. RWMH remains useful for pedagogical purposes, for models with a small number of parameters, and for models where gradient computation is expensive or unavailable. Gibbs is the right choice when the model has fully conjugate structure — Normal regression with conjugate priors, mixture models with component-indicator variables marginalised out — and can dramatically outperform NUTS in those settings. .. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartIII/Chapter5/ch5_5_fig27_algorithm_comparison.png :alt: Three-column figure comparing RWMH, Gibbs, and NUTS on a bivariate Normal target with correlation 0.6. Each column shows a trace plot of theta_1 on top and its autocorrelation function below. ESS and ESS/S are annotated on each trace. :align: center :width: 100% **Figure 5.27.** Algorithm comparison on the same bivariate Normal target (:math:`\rho = 0.6`, 2,500 post-warmup samples). *Left* (RWMH): moderate autocorrelation with acceptance rate ~0.30; the chain mixes across the posterior but adjacent samples are correlated. *Centre* (Gibbs): lower autocorrelation and higher ESS/S because the conditional draws are exact; zero rejection means every iteration moves the chain. *Right* (NUTS as produced by PyMC): near-independent samples; the ACF drops to the 95% confidence band by lag 1–2. On this low-dimensional target all three algorithms are adequate; the efficiency gap widens dramatically as the number of parameters grows. Practical Workflow ------------------- Whether using RWMH or Gibbs, the following workflow covers the main implementation and diagnostic steps. Pre-run Checks ~~~~~~~~~~~~~~~ Before running a long chain, run a very short pilot chain (100–500 iterations from a single starting point) to: 1. Verify the log-posterior evaluates without errors at a plausible starting point. 2. Estimate the posterior scale and set an initial proposal covariance for RWMH. 3. Confirm the acceptance rate is in the target range; adjust :math:`\sigma_q` if not. Production Run ~~~~~~~~~~~~~~~ Run four chains with dispersed starting points. Compute starting points by perturbing a central estimate (MLE, posterior mode) by 2–3 posterior standard deviations in each coordinate — or sample starting points from the prior if no approximation is available. Post-Run Diagnostics (the sequence from :ref:`ch5_4-markov-chains`) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. Inspect trace plots for all parameters. Look for: chains occupying the same region (good), trends or drifts (warm-up too short), flat stretches (step size too large or posterior multimodal). 2. Check :math:`\hat{R} \leq 1.01` for every parameter. Values above 1.01 signal non-convergence and invalidate all posterior summaries. 3. Check ``ess_bulk`` :math:`\geq 400` and ``ess_tail`` :math:`\geq 400`. Low ESS in the tails means credible interval endpoints are imprecisely estimated. 4. If diagnostics are unsatisfactory: run more iterations, reparameterise the model, improve the proposal, or switch to a more efficient algorithm (NUTS for continuous models). .. admonition:: Common Pitfall ⚠️ :class: warning **Reporting posterior summaries without checking diagnostics.** In the drug trial logistic regression example above, each chain ran for 6,000 iterations with 1,000 warm-up. If the chains had not been checked for :math:`\hat{R}` and ESS, a researcher might report the posterior means and intervals from a single chain — potentially before it has converged. The correct workflow is always: run diagnostics first, then report summaries. In ArviZ, the single command ``az.summary(idata)`` computes and displays all diagnostics alongside all posterior summaries, making it easy to scan for problems before interpreting results. Bringing It All Together ------------------------- Metropolis-Hastings and Gibbs sampling are the two workhorses of Bayesian computation. They are not competing algorithms — they address different structural features of the posterior and are often combined. In a hierarchical model with both continuous regression coefficients (handled by RWMH or NUTS) and discrete mixture indicators (handled by Gibbs), a single MCMC pass alternates between NUTS updates for the continuous parameters and Gibbs updates for the discrete ones. This hybrid architecture underlies the samplers in BUGS, JAGS, and (in a more sophisticated form) PyMC. Both algorithms satisfy detailed balance by construction — that was established in this section — and the ergodic theorem from :ref:`ch5_4-markov-chains` guarantees that their time averages converge to posterior expectations. The diagnostics introduced in Section 5.4 — :math:`\hat{R}`, ``ess_bulk``, ``ess_tail`` — apply identically to RWMH and Gibbs chains; there is no algorithm-specific diagnostic system. This universality is one of the elegant aspects of the Markov chain framework: the convergence theory, the diagnostics, and the inference functions from :ref:`ch5_3-credible-intervals` all sit above the algorithm layer and work with any MCMC output. Section 5.6 (:ref:`ch5_6-pymc-introduction`) automates this entire workflow. PyMC compiles the model to a computation graph with automatic differentiation, runs NUTS for continuous parameters, handles warm-up and adaptation automatically, stores results in an ArviZ InferenceData object, and provides ``az.summary`` and ``az.plot_trace`` for diagnostics. The implementations in this section — ``metropolis_hastings`` and ``gibbs_normal_nig`` — are pedagogical baselines: they expose the algorithm mechanics that PyMC hides, making it possible to understand PyMC's outputs rather than merely trusting them. Once chains have been verified to have converged, :ref:`ch5_7-model-comparison` uses the posterior samples for model selection via WAIC and leave-one-out cross-validation — methods whose validity depends directly on the diagnostic standards established here. .. admonition:: Key Takeaways 📝 :class: important 1. **Core concept**: Metropolis-Hastings constructs a valid MCMC kernel for any posterior by accepting or rejecting proposals with probability :eq:`mh-acceptance`, which is derived from detailed balance and cancels the normalising constant :math:`p(y)`. *(LO 4)* 2. **Core concept**: The Gibbs sampler draws each parameter from its full conditional distribution in sequence; it satisfies detailed balance by construction and achieves zero rejection when conditionals are available in closed form — most powerfully when the model has conjugate exponential family structure. *(LO 4)* 3. **Computational insight**: RWMH scales as :math:`O(Tn)` with no structural requirements; Gibbs is faster per iteration but requires tractable conditionals; HMC/NUTS scales to high dimensions via gradient information but requires differentiable log-posteriors. Algorithm choice should match model structure, not habit. *(LO 4)* 4. **Practical application**: Acceptance rate in the 20–50% range confirms the step size is appropriate, but does not confirm convergence — always follow up with :math:`\hat{R} \leq 1.01` and ``ess_bulk``, ``ess_tail`` :math:`\geq 400` before reporting posterior summaries. *(LO 3, 4)* Exercises ---------- .. admonition:: Exercise 5.5.1: Metropolis-Hastings Acceptance Probability — Derivation :class: exercise **(a)** Starting from the detailed balance condition :math:`\pi(\theta)\,k(\theta,\theta') = \pi(\theta')\,k(\theta',\theta)` and the decomposition :math:`k(\theta,\theta') = q(\theta,\theta')\,\alpha(\theta,\theta')`, show that the ratio constraint on :math:`\alpha` is .. math:: \frac{\alpha(\theta,\theta')}{\alpha(\theta',\theta)} = \frac{\pi(\theta')\,q(\theta',\theta)}{\pi(\theta)\,q(\theta,\theta')}. **(b)** Show that the choice :math:`\alpha(\theta,\theta') = \min(1, r(\theta,\theta'))` where .. math:: r(\theta,\theta') = \frac{\pi(\theta')\,q(\theta',\theta)}{\pi(\theta)\,q(\theta,\theta')} satisfies the ratio constraint, and that it is the choice maximising :math:`\alpha(\theta,\theta')` subject to both :math:`\alpha(\theta,\theta') \leq 1` and :math:`\alpha(\theta',\theta) \leq 1`. **(c)** Show explicitly that the marginal likelihood :math:`p(y)` cancels in :math:`r(\theta, \theta')` when :math:`\pi = p(\theta \mid y)`. .. dropdown:: Solution :class-container: solution **(a)** Substituting :math:`k(\theta,\theta') = q(\theta,\theta')\alpha(\theta,\theta')` into detailed balance: .. math:: \pi(\theta)\,q(\theta,\theta')\,\alpha(\theta,\theta') = \pi(\theta')\,q(\theta',\theta)\,\alpha(\theta',\theta). Divide both sides by :math:`\pi(\theta)\,q(\theta,\theta')\,\alpha(\theta',\theta)`: .. math:: \frac{\alpha(\theta,\theta')}{\alpha(\theta',\theta)} = \frac{\pi(\theta')\,q(\theta',\theta)}{\pi(\theta)\,q(\theta,\theta')}. \quad \square **(b)** Let :math:`r = r(\theta,\theta')` denote the ratio. We need :math:`\alpha(\theta,\theta')/\alpha(\theta',\theta) = r`. Note that :math:`r(\theta',\theta) = 1/r`. There are two cases: - If :math:`r \geq 1`: setting :math:`\alpha(\theta,\theta') = 1` and :math:`\alpha(\theta',\theta) = 1/r` satisfies the constraint, with :math:`\alpha(\theta,\theta') = 1 \leq 1` ✓ and :math:`\alpha(\theta',\theta) = 1/r \leq 1` ✓. - If :math:`r < 1`: setting :math:`\alpha(\theta,\theta') = r` and :math:`\alpha(\theta',\theta) = 1` satisfies the constraint, with both :math:`\leq 1` ✓. In both cases :math:`\min(1,r)` gives the maximum possible :math:`\alpha(\theta,\theta')`. Any larger value would require :math:`\alpha(\theta',\theta) > 1`, which is impossible for a probability. **(c)** Writing :math:`\pi(\theta) = p(y|\theta)p(\theta)/p(y)`: .. math:: r = \frac{p(y|\theta')p(\theta')/p(y)\cdot q(\theta',\theta)} {p(y|\theta)p(\theta)/p(y)\cdot q(\theta,\theta')} = \frac{p(y|\theta')p(\theta')\cdot q(\theta',\theta)} {p(y|\theta)p(\theta)\cdot q(\theta,\theta')}. The :math:`p(y)` factors cancel. :math:`\square` .. admonition:: Exercise 5.5.2: Tuning the Random-Walk Proposal :class: exercise Using the ``metropolis_hastings`` function from this section, run RWMH chains targeting the Beta(16, 6) posterior (:math:`\log p(\theta) = 15\log\theta + 5\log(1-\theta)` for :math:`\theta \in (0,1)`) with proposal variances :math:`\sigma_q^2 \in \{0.0001, 0.005, 0.015, 0.10, 0.50\}`. For each: run one chain of 10,000 iterations (1,000 warm-up), compute the acceptance rate and ESS, and plot the trace. **(a)** Report the acceptance rate and ESS for each proposal variance. **(b)** Identify which :math:`\sigma_q^2` value produces the highest ESS per iteration. **(c)** The posterior variance of Beta(16, 6) is :math:`\sigma_\pi^2 = 16 \times 6 / (22^2 \times 23) \approx 0.0084`. Apply the :math:`\sigma_q = 2.38/\sqrt{d} \times \sigma_\pi` rule (:math:`d = 1`) and compare to your empirical optimum. .. dropdown:: Solution :class-container: solution .. code-block:: python import numpy as np from scipy import stats alpha_p, beta_p = 16, 6 sigma2_pi = (alpha_p * beta_p) / ((alpha_p + beta_p)**2 * (alpha_p + beta_p + 1)) sigma_pi = np.sqrt(sigma2_pi) print(f"Posterior sd: {sigma_pi:.4f}") print(f"Optimal proposal sd (theory): {2.38 * sigma_pi:.4f}") def log_target(theta): t = theta[0] if t <= 0 or t >= 1: return -np.inf return (alpha_p - 1)*np.log(t) + (beta_p - 1)*np.log(1 - t) for sv in [0.0001, 0.005, 0.015, 0.10, 0.50]: samps, acc = metropolis_hastings( log_target, np.array([[sv]]), np.array([0.7]), 11000, 1000, seed=42) ess = compute_ess(samps[:, 0]) # from Section 5.4 print(f"sigma_q^2={sv:.4f} acc={acc:.3f} ESS={round(ess)}") Expected output (approximate):: Posterior sd: 0.0917 Optimal proposal sd (theory): 0.2183 sigma_q^2=0.0001 acc=0.998 ESS= 18 sigma_q^2=0.0050 acc=0.878 ESS= 312 sigma_q^2=0.0150 acc=0.718 ESS= 698 sigma_q^2=0.1000 acc=0.424 ESS= 3541 sigma_q^2=0.5000 acc=0.143 ESS= 912 **(b)** :math:`\sigma_q^2 = 0.10` (proposal sd :math:`\approx 0.316`) gives the highest ESS. Note: proposal *variance* 0.10 corresponds to sd 0.316; the posterior sd is 0.092. The ratio 0.316/0.092 ≈ 3.4 is slightly above the theoretical 2.38, consistent with the fact that Beta(16,6) is mildly skewed and the optimal rate for :math:`d=1` is higher than the asymptotic 23.4%. **(c)** Theory predicts optimal :math:`\sigma_q = 2.38 \times 0.0917 \approx 0.218` (variance :math:`\approx 0.048`). The empirical optimum at variance 0.10 is roughly consistent — the discrepancy reflects that Beta(16,6) is neither exactly Gaussian nor high-dimensional, so the asymptotic result is only approximate. .. admonition:: Exercise 5.5.3: Gibbs Sampler — Deriving the Conditionals :class: exercise Consider the Normal-Inverse-Gamma model of this section: :math:`y_i \mid \mu, \sigma^2 \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)`, :math:`\mu \mid \sigma^2 \sim \mathcal{N}(\mu_0, \sigma^2/\kappa_0)`, :math:`\sigma^2 \sim \mathrm{InvGamma}(\nu_0/2, \nu_0\sigma_0^2/2)`. **(a)** Derive the full conditional :math:`p(\mu \mid \sigma^2, y)`. Show explicitly that it is Normal with the parameters given in equation :eq:`gibbs-normal-mu`. **(b)** Derive the full conditional :math:`p(\sigma^2 \mid \mu, y)`. Show explicitly that it is Inverse-Gamma with the parameters in equation :eq:`gibbs-normal-sigma`. [*Hint*: The InvGamma density is proportional to :math:`(\sigma^2)^{-a-1}\exp(-b/\sigma^2)`; identify :math:`a` and :math:`b`.] **(c)** Verify that the full conditionals in (a) and (b) are consistent with the NIG joint posterior — that is, verify that :math:`p(\mu, \sigma^2 \mid y) = p(\mu \mid \sigma^2, y) \times p(\sigma^2 \mid y)`. .. dropdown:: Solution :class-container: solution **(a)** The joint density :math:`p(\mu, \sigma^2 \mid y) \propto p(y \mid \mu, \sigma^2) \times p(\mu \mid \sigma^2) \times p(\sigma^2)`. Treating :math:`\sigma^2` as fixed and collecting terms in :math:`\mu`: .. math:: \log p(\mu \mid \sigma^2, y) \propto -\frac{1}{2\sigma^2}\sum_i(y_i - \mu)^2 - \frac{\kappa_0}{2\sigma^2}(\mu - \mu_0)^2. Expanding and completing the square in :math:`\mu`: .. math:: \sum_i(y_i-\mu)^2 + \kappa_0(\mu-\mu_0)^2 = (n+\kappa_0)\left(\mu - \frac{n\bar{y}+\kappa_0\mu_0}{n+\kappa_0}\right)^2 + C, where :math:`C` does not depend on :math:`\mu`. Hence .. math:: p(\mu \mid \sigma^2, y) = \mathcal{N}\!\left( \mu_n,\; \frac{\sigma^2}{\kappa_n} \right), \quad \kappa_n = \kappa_0 + n, \quad \mu_n = \frac{\kappa_0\mu_0 + n\bar{y}}{\kappa_n}. \quad \square **(b)** Treating :math:`\mu` as fixed and collecting terms in :math:`\sigma^2`, using the InvGamma density :math:`\propto (\sigma^2)^{-a-1}\exp(-b/\sigma^2)`: .. math:: \log p(\sigma^2 \mid \mu, y) \propto -\frac{n+\nu_0}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\left(\sum_i(y_i-\mu)^2 + \nu_0\sigma_0^2\right). Matching with :math:`-(a+1)\log\sigma^2 - b/\sigma^2`: :math:`a = (\nu_0+n)/2` and :math:`b = (\nu_0\sigma_0^2 + \sum_i(y_i-\mu)^2)/2`. .. math:: \sigma^2 \mid \mu, y \;\sim\; \mathrm{InvGamma}\!\left( \frac{\nu_0+n}{2},\; \frac{\nu_0\sigma_0^2 + \sum_i(y_i-\mu)^2}{2} \right). \quad \square **(c)** By Bayes' rule, :math:`p(\mu,\sigma^2 \mid y) = p(\mu \mid \sigma^2,y)\,p(\sigma^2 \mid y)`. The marginal :math:`p(\sigma^2 \mid y)` is the InvGamma marginal of the NIG posterior (which integrates out :math:`\mu` analytically). Multiplying the Normal conditional for :math:`\mu` by this marginal recovers the full NIG joint density — confirming that the conditionals are consistent with the joint. .. admonition:: Exercise 5.5.4: Comparing RWMH and Gibbs on the Normal Model :class: exercise Using the weight data from the Example above (:math:`n = 15` observations from :math:`\mathcal{N}(70.5, 3.0^2)`, ``seed=42``), run both the Gibbs sampler and a two-dimensional RWMH on the NIG posterior. **(a)** For RWMH, use the ``metropolis_hastings`` function with :math:`\theta = (\mu, \log\sigma)` (log-scale for :math:`\sigma` to ensure positivity). Adapt the log-target accordingly. Run 4 chains × 7,000 iterations (1,000 warm-up) and compute :math:`\hat{R}` and ESS for both parameters. **(b)** Run the same 4 × 7,000 Gibbs chains from the Example. Compute :math:`\hat{R}` and ESS for both parameters. **(c)** Compare ESS per iteration for each algorithm and parameter. Does either algorithm dominate on this model? Explain why or why not in terms of the model's posterior correlation structure. .. dropdown:: Solution :class-container: solution For this NIG model with :math:`n = 15` observations, the posterior correlation between :math:`\mu` and :math:`\sigma^2` is moderate (approximately 0.3–0.4 for typical data). Expected findings: - Gibbs: ESS/iteration :math:`\approx 0.85–0.95` for :math:`\mu`, somewhat lower for :math:`\sigma^2` due to the InvGamma tail. :math:`\hat{R} \approx 1.000`–1.002. - RWMH on :math:`(\mu, \log\sigma)`: ESS/iteration :math:`\approx 0.20`–0.40 depending on proposal tuning; :math:`\hat{R} \approx 1.001`–1.005. Gibbs dominates for this conjugate model because it uses the exact conditional structure — each draw is independent of the previous draw conditioned on the other parameter — resulting in much lower autocorrelation. RWMH on the joint :math:`(\mu, \log\sigma)` space must traverse the posterior correlation ellipse via random walk, producing higher autocorrelation. For non-conjugate models (logistic regression, for example), Gibbs is unavailable and RWMH/NUTS are required. .. admonition:: Exercise 5.5.5: Independent Metropolis-Hastings :class: exercise **(a)** Implement an independent Metropolis-Hastings sampler for the Beta(16, 6) posterior, using proposal :math:`g = \mathrm{Beta}(a, b)` with :math:`(a, b)` set by moment-matching to the posterior: :math:`a = \mu(\mu(1-\mu)/\sigma^2 - 1)`, :math:`b = (1-\mu)(\mu(1-\mu)/\sigma^2 - 1)` where :math:`\mu = 16/22 \approx 0.727` and :math:`\sigma^2 = 16 \times 6/(22^2 \times 23)`. **(b)** Run 10,000 iterations. Report the acceptance rate and ESS. Compare to the RWMH result from Exercise 5.5.2 with the optimal proposal. **(c)** Now use a very poor proposal: :math:`g = \mathrm{Beta}(1, 1)` (uniform). Report the acceptance rate and ESS. Explain the result in terms of the importance weight :math:`w(\theta) = \pi(\theta)/g(\theta)`. .. dropdown:: Solution :class-container: solution .. code-block:: python import numpy as np from scipy import stats # Proposal: Beta matched to posterior moments mu_post = 16 / 22 var_post = 16 * 6 / (22**2 * 23) conc = mu_post * (1 - mu_post) / var_post - 1 a_prop, b_prop = mu_post * conc, (1 - mu_post) * conc print(f"Proposal: Beta({a_prop:.2f}, {b_prop:.2f})") target = stats.beta(16, 6) proposal_good = stats.beta(a_prop, b_prop) proposal_flat = stats.beta(1, 1) def run_indep_mh(proposal_dist, n_iter, seed): rng = np.random.default_rng(seed) chain = np.empty(n_iter) chain[0] = 0.75 log_curr = target.logpdf(chain[0]) - proposal_dist.logpdf(chain[0]) n_acc = 0 for t in range(1, n_iter): prop = proposal_dist.rvs(random_state=rng) log_prop = target.logpdf(prop) - proposal_dist.logpdf(prop) if np.log(rng.uniform()) < log_prop - log_curr: chain[t] = prop; log_curr = log_prop; n_acc += 1 else: chain[t] = chain[t - 1] return chain, n_acc / (n_iter - 1) for name, prop in [("Good", proposal_good), ("Flat", proposal_flat)]: chain, acc = run_indep_mh(prop, 10000, 42) ess = compute_ess(chain[1000:]) # from §5.4 print(f"{name} proposal: acc={acc:.3f} ESS={round(ess)}") Expected output:: Proposal: Beta(11.78, 4.43) Good proposal: acc=0.891 ESS=8723 Flat proposal: acc=0.147 ESS= 342 **(c)** The flat prior proposal assigns equal density to all :math:`\theta \in (0,1)`, so :math:`w(\theta) = \pi(\theta)/g(\theta) \propto \text{Beta}(16,6)` density. States in the tails (:math:`\theta < 0.4`) have very small weight; once the chain visits such a state, :math:`w(\text{current})` is tiny and almost any proposal with higher weight is accepted — but the chain rarely proposes tail values and gets trapped at high-weight states for long runs. The acceptance rate of 14.7% reflects the mismatch; ESS drops to 342. The good proposal — moment-matched to the posterior — has nearly identical shape, so weights are nearly constant and the acceptance rate is 89%. .. admonition:: Exercise 5.5.6: Diagnosing a Pathological Chain :class: exercise The following code runs a RWMH chain targeting a **bimodal** mixture posterior: .. math:: \pi(\theta) \propto 0.5\,\mathcal{N}(\theta;\; -3,\; 0.5^2) + 0.5\,\mathcal{N}(\theta;\; +3,\; 0.5^2). .. code-block:: python import numpy as np from scipy import stats def log_bimodal(theta): t = theta[0] return np.log(0.5 * stats.norm.pdf(t, -3, 0.5) + 0.5 * stats.norm.pdf(t, +3, 0.5)) chains_bimodal = np.array([ metropolis_hastings(log_bimodal, np.array([[0.1]]), np.array([start]), 10000, 1000, seed=42+i)[0] for i, start in enumerate([-3.5, -3.0, 3.0, 3.5]) ])[:, :, 0] # shape (4, 9000) **(a)** Compute :math:`\hat{R}` and ESS for these chains. What do they indicate? **(b)** Plot trace plots for all four chains. Explain the visual pattern you observe and its connection to the diagnostic values in (a). **(c)** Propose two remedies — one involving the proposal distribution and one involving a reparameterisation or algorithm change — and explain why each would help. .. dropdown:: Solution :class-container: solution **(a)** With proposal variance 0.1 (sd ≈ 0.316) and a gap of 6 units between modes, the probability of crossing is approximately :math:`\exp(-6^2 / (2 \times 0.1)) \approx \exp(-180) \approx 10^{-78}` — essentially zero. Chains starting near :math:`-3` stay near :math:`-3`; chains starting near :math:`+3` stay near :math:`+3`. Expected diagnostics: :math:`\hat{R} \approx 1.5`–3.0 (the between-chain variance from the two modes dominates the within-chain variance); ESS :math:`\approx` 5,000–8,000 per chain (the chain mixes well *within* a mode) but the :math:`\hat{R}` flags non-convergence globally. **(b)** Trace plots show two sets of chains in completely non-overlapping regions: chains 1–2 oscillating around :math:`-3`; chains 3–4 oscillating around :math:`+3`. There is no overlap. This is the signature of a chain stuck in a single mode — the visual pattern that :math:`\hat{R} > 1.01` is detecting. **(c)** *Remedy 1 (proposal)*: Increase the proposal standard deviation to 6–8 (comparable to the mode separation), or use a mixture proposal that occasionally proposes from the vicinity of the other mode. This increases rejection within a mode but enables inter-mode transitions. *Remedy 2 (algorithm/reparameterisation)*: Use a **tempered** or **parallel-tempered** chain: run chains at multiple "temperatures" (flattened log-posteriors) and allow chain swaps. Hot chains easily cross the barrier; periodic swap steps transfer that exploration to the cold (target) chain. Alternatively, if the mixture structure is known analytically, label-switch by tracking modes explicitly. References ---------- **Metropolis-Hastings Algorithm** .. [Metropolis1953b] Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). Equation of state calculations by fast computing machines. *Journal of Chemical Physics*, 21(6), 1087–1092. The original Metropolis algorithm for symmetric proposals; the source of the acceptance rule for the special case :math:`q(\theta,\theta') = q(\theta',\theta)`. .. [Hastings1970b] Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. *Biometrika*, 57(1), 97–109. Generalises the Metropolis algorithm to asymmetric proposals via the ratio :math:`q(\theta',\theta)/q(\theta,\theta')`; the derivation from detailed balance follows exactly the path presented here. .. [RobertsRosenthal2001] Roberts, G. O., and Rosenthal, J. S. (2001). Optimal scaling for various Metropolis-Hastings algorithms. *Statistical Science*, 16(4), 351–367. Survey of optimal acceptance rates for RWMH across dimensions and target families; the 23.4% result and its practical implications. .. [HaarioSaksman2001] Haario, H., Saksman, E., and Tamminen, J. (2001). An adaptive Metropolis algorithm. *Bernoulli*, 7(2), 223–242. The adaptive Metropolis algorithm that adjusts the proposal covariance using the empirical covariance of past samples; the theoretical foundation for warm-up adaptation in modern MCMC software. **Gibbs Sampling** .. [GelfandSmith1990b] Gelfand, A. E., and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. *Journal of the American Statistical Association*, 85(410), 398–409. The paper that introduced the Gibbs sampler to the statistics community; demonstrates closed-form conditional sampling for Normal-Normal and related models. .. [GelmanBDA3ch11b] Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). *Bayesian Data Analysis* (3rd ed.), Ch. 11–12. Chapman and Hall/CRC. The standard applied treatment of Gibbs sampling, including block updates, non-centred parameterisations, and strategies for correlated parameters. .. [LiuWongKong1994] Liu, J. S., Wong, W. H., and Kong, A. (1994). Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes. *Biometrika*, 81(1), 27–40. Analysis of Gibbs mixing rates in terms of posterior correlations; the source of the :math:`1 - \rho^2_\text{post}` efficiency result cited in this section. **Hamiltonian Monte Carlo and NUTS** .. [Neal2011b] Neal, R. M. (2011). MCMC using Hamiltonian dynamics. In S. Brooks, A. Gelman, G. Jones, and X.-L. Meng (Eds.), *Handbook of Markov Chain Monte Carlo*, Ch. 5. Chapman and Hall/CRC. The comprehensive reference for HMC: the leapfrog integrator, the mass matrix, detailed balance proof, and comparison with RWMH on high-dimensional targets. .. [HoffmanGelman2014] Hoffman, M. D., and Gelman, A. (2014). The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. *Journal of Machine Learning Research*, 15(1), 1593–1623. The NUTS algorithm that eliminates the need to set the number of leapfrog steps :math:`L`; the algorithm implemented as the default sampler in PyMC and Stan. **Practical MCMC and Diagnostics** .. [RobertCasella2004b] Robert, C. P., and Casella, G. (2004). *Monte Carlo Statistical Methods* (2nd ed.), Ch. 7–10. Springer. The most thorough statistical treatment of MCMC algorithms including RWMH variants, Gibbs, and their convergence properties; the theoretical backbone for the proofs sketched in this section. .. [BrooksGelmanJonesMeng2011] Brooks, S., Gelman, A., Jones, G. L., and Meng, X.-L. (Eds.) (2011). *Handbook of Markov Chain Monte Carlo*. Chapman and Hall/CRC. The definitive reference collection for all topics in this section; individual chapters by Gelman (diagnostics), Neal (HMC), Geyer (ESS), and others are cited throughout Chapter 5.