.. _ch5_8-hierarchical-models:

=================================================================
Section 5.8 Hierarchical Models and Partial Pooling
=================================================================

Every model in this chapter so far has treated its data as a single undifferentiated group.
The Beta-Binomial model in :ref:`ch5_2-prior-distributions` estimated one probability
:math:`\theta` from one sequence of trials. The logistic regression in
:ref:`ch5_6-pymc-introduction` estimated one coefficient vector :math:`\boldsymbol{\beta}` from
one dataset. But data in the real world are rarely so uniform. Exam scores come from different
instructors. Clinical outcomes come from different hospitals. Species counts come from different
ecosystems. In each case you have *multiple related groups*, and you face a choice: should you
treat the groups as identical, treat them as completely unrelated, or try to learn both what they
share and how they differ?

Hierarchical models formalise the third option. They place a *distribution over group-level
parameters*, so that each group gets its own estimate but that estimate is informed by the data
from every other group. The result is called **partial pooling**, and it is one of the most
practically useful ideas in applied statistics. Groups with abundant data are left largely alone;
groups with sparse data are pulled toward the overall pattern. The amount of pulling is not
chosen by the analyst — it is learned from the data.

This section develops hierarchical models from motivation through implementation. We begin with
the three pooling strategies and their tradeoffs (§5.8.1), formalise the mathematical structure
(§5.8.2), work through the canonical Eight Schools problem including the geometry that makes
naive MCMC fail (§5.8.3), demonstrate the same hierarchical idea with a categorical likelihood
(§5.8.4), and close the loop with :ref:`ch5_7-model-comparison` by comparing pooling strategies
quantitatively using LOO (§5.8.5). A practical decision guide (§5.8.6) concludes the chapter's
final section.

.. admonition:: Road Map 🧭
   :class: important

   • **Understand**: Why partial pooling adaptively balances bias and variance across groups, and
     what exchangeability means as a modeling assumption
   • **Develop**: Facility with hierarchical model specification, the centred/non-centred
     parameterisation trade-off, and the geometry of Neal's funnel
   • **Implement**: Hierarchical Normal and Dirichlet-Multinomial models in PyMC, including
     non-centred reparameterisation, forest plots, and LOO model comparison
   • **Evaluate**: When hierarchical structure improves inference, how to diagnose funnel
     pathology, and how to compare pooling strategies quantitatively


.. _pooling-problem:

The Pooling Problem
-------------------

Before introducing any machinery, consider a concrete scenario. Five instructors each teach the
same introductory statistics course to sections of 10–20 students. At the end of the semester,
every student takes the same final exam. You want to estimate each instructor's true average exam
score :math:`\mu_i`. Three strategies present themselves.

Complete Pooling
~~~~~~~~~~~~~~~~

Ignore the instructor variable entirely. Pool all students into a single group, fit one overall
mean :math:`\mu`, and assign that estimate to every instructor:

.. math::

   y_{ij} \mid \mu, \sigma \sim \text{Normal}(\mu, \sigma), \quad
   \mu \sim \text{Normal}(50, 25), \quad
   \sigma \sim \text{HalfNormal}(15)

where :math:`y_{ij}` is the exam score for student :math:`j` in instructor :math:`i`'s section,
and a single :math:`\mu` is shared across all five sections. This model is fast and yields narrow
intervals because it uses all the data, but it is unfair: instructors who are genuinely different
get averaged away. If one instructor's true mean is 85 and another's is 65, both receive the
same estimate of roughly 75. Complete pooling trades *bias* for *precision*.

No Pooling
~~~~~~~~~~

Fit a separate, independent model to each instructor's section:

.. math::

   y_{ij} \mid \mu_i, \sigma_i \sim \text{Normal}(\mu_i, \sigma_i), \quad
   \mu_i \sim \text{Normal}(50, 25), \quad
   \sigma_i \sim \text{HalfNormal}(15)

for :math:`i = 1, \ldots, 5`. Now each instructor gets an unbiased estimate of their own mean.
The cost is *variance*: an instructor with only 8 students gets a wide credible interval because
the model has no way to borrow information from the other 60 students in the dataset. No pooling
treats every group as if it has nothing in common with any other group. This is inefficient when
the groups genuinely are related, and it cannot make predictions for a new instructor (one not in
the training data) because new groups have no parameters at all.

Partial Pooling (Hierarchical)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Model the instructor-level means as draws from a common distribution:

.. math::

   \mu_i \mid \mu, \tau &\sim \text{Normal}(\mu, \tau), \quad i = 1, \ldots, J \\
   y_{ij} \mid \mu_i, \sigma &\sim \text{Normal}(\mu_i, \sigma)

with shared **hyperparameters** :math:`\mu` (grand mean across instructors) and :math:`\tau`
(between-instructor standard deviation), and hyperpriors :math:`\mu \sim \text{Normal}(50, 25)`,
:math:`\tau \sim \text{HalfNormal}(10)`, :math:`\sigma \sim \text{HalfNormal}(15)`.

Each instructor's estimate :math:`\hat{\mu}_i` is pulled toward the grand mean :math:`\mu` by an
amount that depends on two factors: (i) how much data that instructor has — instructors with more
students are pulled less — and (ii) how similar the instructors are overall — when :math:`\tau`
is small, the data indicate that instructors are similar, so every group is pulled substantially;
when :math:`\tau` is large, the data indicate genuine diversity, so the pull is weak.

This adaptive compromise between the two extreme strategies is called **shrinkage**. Partial
pooling gives narrower intervals than no pooling for groups with sparse data, while barely
affecting groups with abundant data. It does not force a one-size-fits-all answer (unlike
complete pooling), but it does use the overall pattern to stabilise noisy estimates (unlike no
pooling).

.. # Figure 5.33 — three-panel comparison (complete / partial / no pooling)

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartIII/Chapter5/ch5_8_fig33_pooling_comparison.png
   :alt: Three-panel dot plot comparing instructor mean estimates under complete pooling, partial pooling, and no pooling.
   :align: center
   :width: 95%

   **Figure 5.33** — Instructor mean estimates under three pooling strategies. Complete pooling
   (left) assigns one value to all instructors. No pooling (centre) gives unbiased but noisy
   estimates, especially for instructors with small sections. Partial pooling (right) shrinks
   small-section estimates toward the grand mean while leaving large-section estimates largely
   unchanged.

.. admonition:: Definition: Key Hierarchical Vocabulary
   :class: note

   **Hyperparameters** are parameters that govern the distribution of other parameters. In the
   instructor example, :math:`\mu` and :math:`\tau` are hyperparameters because they control
   the Normal distribution from which the group-level means :math:`\mu_i` are drawn.

   **Hyperpriors** are the prior distributions placed on hyperparameters. The choices
   :math:`\mu \sim \text{Normal}(50, 25)` and :math:`\tau \sim \text{HalfNormal}(10)` are
   hyperpriors.

   **Shrinkage** is the pull of group-level estimates toward the grand mean. Shrinkage is
   stronger for groups with less data and weaker for groups with more data. In the no-pooling
   model, shrinkage is zero. In the complete-pooling model, shrinkage is total. The hierarchical
   model estimates the appropriate amount of shrinkage from the data.


.. _hierarchical-structure:

The Mathematical Structure
--------------------------

A hierarchical model organises parameters into levels. At the bottom are the observations,
governed by group-specific parameters. At the middle level, the group parameters are governed by
shared hyperparameters. At the top, hyperpriors express uncertainty about the hyperparameters
themselves. This three-level structure is the defining feature of the hierarchical framework.

The Generative Story
~~~~~~~~~~~~~~~~~~~~

Writing the model top-down as a generative process makes the structure transparent:

.. math::

   \text{Hyperprior:}  &\quad \mu \sim p(\mu), \quad \tau \sim p(\tau) \\
   \text{Group level:} &\quad \theta_i \mid \mu, \tau \sim p(\theta_i \mid \mu, \tau),
                         \quad i = 1, \ldots, J \\
   \text{Observation:}  &\quad y_{ij} \mid \theta_i \sim p(y_{ij} \mid \theta_i),
                         \quad j = 1, \ldots, n_i

where :math:`J` is the number of groups, :math:`n_i` is the sample size in group :math:`i`, and
the total dataset contains :math:`N = \sum_{i=1}^{J} n_i` observations.

The joint model factors as:

.. math::
   :label: hierarchical-joint

   p(y, \theta, \mu, \tau) = p(\mu) \, p(\tau)
     \prod_{i=1}^{J} p(\theta_i \mid \mu, \tau)
     \prod_{i=1}^{J} \prod_{j=1}^{n_i} p(y_{ij} \mid \theta_i)

Read this factorisation left to right. First, nature draws the hyperparameters :math:`\mu` and
:math:`\tau` from their hyperpriors. Then, for each group, nature draws a group-specific
parameter :math:`\theta_i` from the group-level distribution governed by those hyperparameters.
Finally, for each observation within each group, nature draws the data :math:`y_{ij}` from the
likelihood, which depends only on its group parameter :math:`\theta_i`.

The crucial structural feature is that the groups are linked *only* through the hyperparameters.
Conditional on :math:`\mu` and :math:`\tau`, the group parameters :math:`\theta_1, \ldots,
\theta_J` are independent. But marginally — before conditioning on :math:`\mu` and :math:`\tau`
— they are dependent, because they share a common source of variation. This marginal dependence
is what enables borrowing of strength: the data from group :math:`i` inform the hyperparameters,
which in turn inform the estimate for group :math:`k`.

.. # Figure 5.34a — Plate diagram

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartIII/Chapter5/ch5_8_fig34a_plate_diagram.png
   :alt: Plate notation diagram showing the three-level hierarchical model with hyperparameters, group parameters, and observations.
   :align: center
   :width: 60%

   **Figure 5.34a** — Plate notation for a hierarchical model. Shaded nodes are observed,
   unshaded nodes are latent. The outer plate indexes groups :math:`i = 1, \ldots, J`; the
   inner plate indexes observations :math:`j = 1, \ldots, n_i` within each group.

Exchangeability
~~~~~~~~~~~~~~~

The group-level assumption :math:`\theta_i \mid \mu, \tau \sim p(\theta_i \mid \mu, \tau)` for
all :math:`i` is a statement of **exchangeability**: before looking at the data, we have no
reason to believe that any one group is systematically different from any other. The order of the
group labels does not matter — relabelling "School A" as "School C" does not change the model.

Exchangeability is *not* the same as independence. The :math:`\theta_i` are exchangeable but
dependent (through the shared hyperparameters). An analogy: imagine drawing coloured marbles from
an urn. Before any draw, any marble is equally likely. But once you draw a red marble, you
update your beliefs about the urn's composition, which changes the probabilities for the next
draw. The marbles are exchangeable but not independent.

Exchangeability is the key modeling assumption of the hierarchical framework, and it should be
checked. If there is a known structural distinction between groups — for example, if some schools
are private and some are public, and you have reason to believe this matters — that distinction
should be encoded explicitly in the model (e.g., as a covariate at the group level), not assumed
away under exchangeability.

The Shrinkage Formula
~~~~~~~~~~~~~~~~~~~~~

The mechanics of partial pooling can be made precise for the Normal-Normal hierarchy. Suppose the
group-level model is :math:`\theta_i \sim \text{Normal}(\mu, \tau^2)` and the data model is
:math:`\bar{y}_i \mid \theta_i \sim \text{Normal}(\theta_i, \sigma_i^2)`, where
:math:`\bar{y}_i` is the sample mean for group :math:`i` and :math:`\sigma_i^2` is its known
variance (this is the Eight Schools setup). Then the posterior mean for :math:`\theta_i`,
conditional on :math:`\mu` and :math:`\tau`, is:

.. math::
   :label: shrinkage-formula

   \mathbb{E}[\theta_i \mid y, \mu, \tau] = \underbrace{(1 - B_i)}_{\text{data weight}}
     \bar{y}_i \;+\; \underbrace{B_i}_{\text{prior weight}} \mu

where the **shrinkage factor** :math:`B_i` is:

.. math::

   B_i = \frac{\sigma_i^2}{\sigma_i^2 + \tau^2}

This is a weighted average of the group's own data :math:`\bar{y}_i` and the grand mean
:math:`\mu`. The weight given to the grand mean is :math:`B_i`, which depends on the ratio of
within-group noise :math:`\sigma_i^2` to between-group variation :math:`\tau^2`. Three limiting
cases make the behaviour transparent:

- **When** :math:`\sigma_i^2 \gg \tau^2` (**noisy group, similar groups**): :math:`B_i \to 1`,
  and the estimate is pulled almost entirely to the grand mean. The group's own data are so noisy
  relative to the between-group spread that they carry little information.

- **When** :math:`\sigma_i^2 \ll \tau^2` (**precise group, diverse groups**): :math:`B_i \to 0`,
  and the estimate is determined almost entirely by the group's own data. The within-group
  precision is high enough to overwhelm the pull toward :math:`\mu`.

- **When** :math:`\tau \to 0` (**all groups identical**): :math:`B_i \to 1` for every group,
  and partial pooling becomes complete pooling.

In the Eight Schools data, School E has :math:`\sigma_E = 9` (one of the smallest standard
errors), so its shrinkage factor is relatively low — its estimate is determined mainly by its own
data. School H has :math:`\sigma_H = 18` (the largest standard error), so its shrinkage factor is
high — it is pulled more strongly toward the grand mean. The hierarchical model adapts the degree
of pooling to each group's information content automatically.

.. admonition:: Example 💡 Shrinkage in Action
   :class: note

   **Given:** Eight Schools data with :math:`\mu \approx 8` and :math:`\tau \approx 6` (posterior
   means from the non-centred model).

   **Compute shrinkage factors for Schools E and H:**

   .. math::

      B_E = \frac{9^2}{9^2 + 6^2} = \frac{81}{81 + 36} = \frac{81}{117} \approx 0.69

   .. math::

      B_H = \frac{18^2}{18^2 + 6^2} = \frac{324}{324 + 36} = \frac{324}{360} = 0.90

   **Python verification:**

   .. code-block:: python

      mu_hat, tau_hat = 8.0, 6.0
      for school, y_i, sig_i in [('E', -1, 9), ('H', 12, 18)]:
          B_i = sig_i**2 / (sig_i**2 + tau_hat**2)
          theta_hat = (1 - B_i) * y_i + B_i * mu_hat
          print(f"School {school}: B = {B_i:.2f}, "
                f"shrunk estimate = {theta_hat:.1f} "
                f"(raw = {y_i})")

   .. code-block:: text

      School E: B = 0.69, shrunk estimate = 5.2 (raw = -1)
      School H: B = 0.90, shrunk estimate = 8.4 (raw = 12)

   **Result:** School E's negative raw effect (:math:`-1`) is pulled substantially toward the
   grand mean (8), yielding a shrunk estimate of 5.2. School H's raw effect (12) is pulled
   almost all the way to the grand mean because its standard error is so large. Both estimates
   are more moderate than the raw data — this is shrinkage in action.

Borrowing Strength: The Information Flow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The phrase "borrowing strength" describes how hierarchical models use data from all groups to
improve estimates for each individual group. The mechanism is indirect: the data from group
:math:`i` do not directly enter the likelihood for group :math:`k`. Instead, the data from
group :math:`i` inform the hyperparameters :math:`\mu` and :math:`\tau`, and those
hyperparameters in turn inform the prior for group :math:`k`'s parameter :math:`\theta_k`.

This information flow has a concrete consequence. Consider a new school — School I — that was
not in the original study. Under no pooling, we have no estimate for School I because it
contributed no data. Under partial pooling, we can predict School I's effect by drawing from the
population distribution:

.. math::

   \theta_{\text{new}} \sim \text{Normal}(\mu, \tau)

Using posterior samples :math:`(\mu^{(s)}, \tau^{(s)})` for :math:`s = 1, \ldots, S`, we obtain
a predictive distribution for :math:`\theta_{\text{new}}` that incorporates both the estimated
population mean and the estimated between-group variability. This is the hierarchical model's
built-in mechanism for *generalisation* — making predictions for units not in the training data.


.. _eight-schools:

Worked Example 1: Eight Schools (Normal Likelihood)
----------------------------------------------------

The Eight Schools dataset is the canonical hierarchical example in Bayesian statistics, first
analysed by [Rubin1981]_ and extensively discussed by [GelmanBDA3]_. Eight US high schools each
ran a short SAT coaching programme. Each school reported an estimated effect size
:math:`y_i` (improvement in SAT score) and its standard error :math:`\sigma_i`. The question:
did the programme work, and did it work equally across schools?

The Data
~~~~~~~~

.. flat-table:: Eight Schools data
   :header-rows: 1
   :widths: 15 20 20

   * - School
     - Effect :math:`y_i`
     - Std Error :math:`\sigma_i`

   * - A
     - 28
     - 15

   * - B
     - 8
     - 10

   * - C
     - :math:`-3`
     - 16

   * - D
     - 7
     - 11

   * - E
     - :math:`-1`
     - 9

   * - F
     - 1
     - 11

   * - G
     - 18
     - 10

   * - H
     - 12
     - 18

The standard errors :math:`\sigma_i` are treated as *fixed known quantities* — they come from
the original experimental design, not from the model. This is a known-variance Normal model,
which simplifies the hierarchy.

The Hierarchical Model
~~~~~~~~~~~~~~~~~~~~~~

The three-level structure specialised to this problem is:

.. math::
   :label: eight-schools-model

   \mu &\sim \text{Normal}(0, 20) \qquad \text{(grand mean effect)} \\
   \tau &\sim \text{HalfNormal}(10) \qquad \text{(between-school SD)} \\
   \theta_i &\sim \text{Normal}(\mu, \tau), \quad i = 1, \ldots, 8
     \qquad \text{(school-level effects)} \\
   y_i &\sim \text{Normal}(\theta_i, \sigma_i) \qquad \text{(observed effects, } \sigma_i
     \text{ fixed)}

The hyperprior :math:`\mu \sim \text{Normal}(0, 20)` centres the grand mean at zero (no prior
assumption that coaching helps or hurts) with a standard deviation of 20 SAT points — generous
enough to accommodate any plausible programme effect. The hyperprior
:math:`\tau \sim \text{HalfNormal}(10)` allows between-school variation from near zero (all
schools are the same) to roughly 20 points (substantial heterogeneity). These are weakly
informative priors in the sense of :ref:`ch5_2-prior-distributions`: they constrain the
parameter space to physically plausible ranges without strongly favouring any particular value.

The Centred Parameterisation (and Why It Fails)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The natural way to write this model in PyMC follows the mathematical specification directly:

.. code-block:: python

   import numpy as np
   import pymc as pm
   import arviz as az

   # Eight Schools data
   y_obs   = np.array([28,  8, -3,  7, -1,  1, 18, 12], dtype=float)
   sigma_i = np.array([15, 10, 16, 11,  9, 11, 10, 18], dtype=float)
   J = len(y_obs)

   with pm.Model() as centred:
       mu    = pm.Normal('mu',  mu=0,  sigma=20)
       tau   = pm.HalfNormal('tau', sigma=10)
       theta = pm.Normal('theta', mu=mu, sigma=tau, shape=J)
       y     = pm.Normal('y', mu=theta, sigma=sigma_i, observed=y_obs)

       idata_c = pm.sample(draws=2000, tune=1000, chains=4,
                           target_accept=0.90, random_seed=42)

This is called the **centred parameterisation** because each :math:`\theta_i` is centred
directly on :math:`\mu` with scale :math:`\tau`. Run this code and PyMC will report
**divergent transitions** — the sampler's warning that it encountered posterior geometry it could
not navigate reliably. Divergences are not a nuisance to be ignored; they indicate that the
reported posterior samples may be biased (discussed in :ref:`ch5_5-mcmc-algorithms`).

Neal's Funnel
~~~~~~~~~~~~~

The divergences arise from a geometric pathology first described by [BetancourtGirolami2015]_
and named **Neal's funnel**. The issue is the interaction between :math:`\tau` and the
:math:`\theta_i`.

When :math:`\tau` is large — say :math:`\tau = 10` — the school effects :math:`\theta_i` can
spread freely across a wide range of values. The posterior geometry in this region is smooth
and NUTS navigates it easily.

When :math:`\tau` is small — say :math:`\tau \approx 0.5` — the model is saying that all
schools have nearly the same effect. Consequently, every :math:`\theta_i` must cluster tightly
around :math:`\mu`, leaving almost no room to vary. The posterior geometry in this region is
extremely narrow: a thin spike in :math:`J`-dimensional :math:`\theta`-space, constrained to a
tiny neighbourhood of :math:`\mu`.

Now imagine the sampler trying to move through both regions. At the wide top of the funnel (large
:math:`\tau`), NUTS selects a step size :math:`\epsilon` that efficiently covers the broad
posterior. But when the sampler drifts toward the narrow stem (small :math:`\tau`), that same
step size is far too large: each leapfrog step overshoots the narrow constraint, the Hamiltonian
diverges, and the proposal is rejected. The sampler gets stuck in the wide region and rarely
visits the stem.

An analogy: imagine trying to sample uniformly from a wine glass shape. Steps sized for the wide
rim will crash through the narrow stem; steps sized for the narrow stem will take forever to
explore the rim. No single step size works for both regions.

.. # Figure 5.34 — Neal's funnel scatter

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartIII/Chapter5/ch5_8_fig34_neals_funnel.png
   :alt: Two-panel scatter plot showing centred parameterisation with funnel geometry and divergences versus non-centred parameterisation with regular geometry and no divergences.
   :align: center
   :width: 90%

   **Figure 5.34** — Neal's funnel. Left: the centred parameterisation produces a
   funnel-shaped joint posterior in :math:`(\tau, \theta_1)` space; red points mark divergent
   transitions concentrated at the funnel's neck. Right: the non-centred parameterisation
   transforms the geometry into a cylinder; divergences vanish.

.. admonition:: Common Pitfall ⚠️
   :class: warning

   **Divergent transitions are not a tuning problem.** Increasing ``target_accept`` from 0.90 to
   0.95 or 0.99 can reduce the number of divergences by forcing a smaller step size, but it does
   not eliminate the geometric pathology. If divergences appear in a hierarchical model, the first
   diagnosis should be a reparameterisation, not a tuning adjustment. Ignoring divergences risks
   biased posterior estimates — typically with the between-group variance :math:`\tau`
   underestimated because the sampler fails to explore the funnel's stem.

The Non-Centred Parameterisation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The solution is a change of variables that decouples the group parameters from :math:`\tau`.
Instead of writing :math:`\theta_i \sim \text{Normal}(\mu, \tau)`, we write:

.. math::
   :label: noncentred-reparam

   \tilde{\theta}_i &\sim \text{Normal}(0, 1) \\
   \theta_i &= \mu + \tau \cdot \tilde{\theta}_i

The auxiliary variables :math:`\tilde{\theta}_i` are standard Normal regardless of :math:`\tau`.
This is a deterministic transformation: the joint distribution of :math:`(\theta_1, \ldots,
\theta_J)` is unchanged, but the geometry of the *sampled* parameters
:math:`(\tilde{\theta}_1, \ldots, \tilde{\theta}_J)` is now approximately independent of
:math:`\tau`. The funnel is replaced by a cylinder that NUTS traverses without difficulty.

.. code-block:: python

   with pm.Model() as noncentred:
       mu        = pm.Normal('mu',  mu=0,  sigma=20)
       tau       = pm.HalfNormal('tau', sigma=10)
       theta_raw = pm.Normal('theta_raw', mu=0, sigma=1, shape=J)
       theta     = pm.Deterministic('theta', mu + tau * theta_raw)
       y         = pm.Normal('y', mu=theta, sigma=sigma_i, observed=y_obs)

       idata_nc = pm.sample(draws=2000, tune=1000, chains=4,
                            target_accept=0.90, random_seed=42,
                            idata_kwargs={"log_likelihood": True})

The only change is in how ``theta`` is constructed: a standard Normal ``theta_raw`` is sampled
first, then ``theta`` is built deterministically. Mathematically, :math:`\theta_i = \mu +
\tau \cdot \tilde{\theta}_i` with :math:`\tilde{\theta}_i \sim \text{Normal}(0,1)` produces
the same distribution as :math:`\theta_i \sim \text{Normal}(\mu, \tau)`. Computationally, the
difference is decisive.

After sampling:

.. code-block:: python

   print(az.summary(idata_nc, var_names=['mu', 'tau', 'theta'],
                    hdi_prob=0.89))

.. code-block:: text

          mean     sd  hdi_5.5%  hdi_94.5%  r_hat  ess_bulk  ess_tail
   mu     7.87   5.18    -0.54      15.93   1.00    2914.0    3002.0
   tau    6.41   5.22     0.16      14.34   1.00    1841.0    2189.0
   theta[0]  11.24   8.41    -2.51      24.90   1.00    3415.0    3201.0
   theta[1]   7.78   6.18    -2.68      17.38   1.00    4102.0    3587.0
   theta[2]   5.96   7.61   -7.70      17.71   1.00    4315.0    3498.0
   theta[3]   7.49   6.50    -3.55      17.56   1.00    4008.0    3321.0
   theta[4]   5.14   6.20    -5.89      14.44   1.00    3987.0    3419.0
   theta[5]   6.13   6.58    -5.32      16.22   1.00    4121.0    3412.0
   theta[6]  10.57   6.78    -0.14      21.23   1.00    3618.0    3155.0
   theta[7]   8.50   7.61    -4.18      20.35   1.00    3798.0    3302.0

All :math:`\hat{R}` values are at 1.00, and effective sample sizes are in the thousands — the
non-centred parameterisation has eliminated the sampling pathology entirely.

Forest Plot
~~~~~~~~~~~

The forest plot is the standard visualisation for hierarchical model estimates. It shows each
group's posterior distribution as a credible interval with a point estimate, arranged vertically
for comparison:

.. code-block:: python

   school_names = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
   labeller = az.labels.MapLabeller(
       var_name_map={f'theta[{i}]': school_names[i] for i in range(J)}
   )
   az.plot_forest(idata_nc, var_names=['theta'], combined=True,
                  hdi_prob=0.89, r_hat=True, labeller=labeller)

.. # Figure 5.35 — Forest plot

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartIII/Chapter5/ch5_8_fig35_eight_schools_forest.png
   :alt: Forest plot showing 89% credible intervals for each school's treatment effect, with shrinkage toward the grand mean visible for all schools.
   :align: center
   :width: 80%

   **Figure 5.35** — Forest plot of Eight Schools posterior estimates. Horizontal bars are 89%
   HDI intervals; dots are posterior means. All schools are shrunk toward the grand mean
   :math:`\mu \approx 8`. Schools with larger standard errors (A, C, H) are shrunk more.

Key observations to read from the forest plot:

- **Schools A and G** have the largest observed effects (28 and 18), but their posterior means
  are shrunk to approximately 11 and 10.5. The data alone suggest large effects; the hierarchical
  model tempers this by noting that the other six schools show much smaller effects, so the
  extreme values are partly noise.

- **Schools C and E** have negative point estimates in the raw data (:math:`-3` and :math:`-1`),
  but their posterior intervals cross zero comfortably. We cannot confidently claim that the
  programme hurt these schools.

- **School B** has a moderate observed effect (8) with a relatively small standard error (10),
  so it is shrunk less than schools with wider standard errors.

- The *width* of each interval reflects both the within-school standard error :math:`\sigma_i`
  and the between-school variance :math:`\tau`. Schools with larger :math:`\sigma_i` have wider
  intervals — they contribute less information and are therefore pulled more toward :math:`\mu`.

Diagnosing the Centred vs. Non-Centred Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The diagnostic comparison between the two parameterisations is instructive. The ``az.plot_pair``
function reveals the geometry directly:

.. code-block:: python

   # Centred model diagnostics
   az.plot_pair(idata_c, var_names=['tau', 'theta'],
                coords={'theta_dim_0': [0]},
                divergences=True, kind='scatter',
                figsize=(6, 6))

This produces the left panel of Figure 5.34: a scatter plot of :math:`(\tau, \theta_0)` with
divergent transitions marked in red. The funnel shape is unmistakable — samples cluster at
large :math:`\tau` and divergences concentrate where :math:`\tau` is small and :math:`\theta_0`
is forced near :math:`\mu`.

Running the same diagnostic on the non-centred model shows no funnel and no divergences:

.. code-block:: python

   # Non-centred model diagnostics
   az.plot_pair(idata_nc, var_names=['tau', 'theta'],
                coords={'theta_dim_0': [0]},
                divergences=True, kind='scatter',
                figsize=(6, 6))

A quantitative comparison confirms the visual impression:

.. code-block:: python

   # Count divergences
   divergences_c  = idata_c.sample_stats['diverging'].values.sum()
   divergences_nc = idata_nc.sample_stats['diverging'].values.sum()
   print(f"Centred model divergences:     {divergences_c}")
   print(f"Non-centred model divergences: {divergences_nc}")

   # Compare ESS for tau
   ess_c  = az.ess(idata_c,  var_names=['tau'])['tau'].values
   ess_nc = az.ess(idata_nc, var_names=['tau'])['tau'].values
   print(f"\nESS for tau (centred):     {ess_c:.0f}")
   print(f"ESS for tau (non-centred): {ess_nc:.0f}")

.. code-block:: text

   Centred model divergences:     27
   Non-centred model divergences: 0

   ESS for tau (centred):     312
   ESS for tau (non-centred): 1841

The centred model produces 27 divergent transitions and an ESS for :math:`\tau` of only 312 —
less than one-sixth of the non-centred model's ESS. The non-centred model has zero divergences
and nearly six times the effective sample size for the critical between-group variance parameter.
This is not a marginal improvement; it is the difference between a reliable posterior and one
that cannot be trusted.

Posterior Predictive Check
~~~~~~~~~~~~~~~~~~~~~~~~~~

As a final diagnostic, we verify that the hierarchical model can reproduce the observed data. The
posterior predictive check draws simulated datasets :math:`y^{\text{rep}}` from the posterior
and compares them to the observed data:

.. code-block:: python

   with noncentred:
       pm.sample_posterior_predictive(idata_nc, extend_inferencedata=True,
                                      random_seed=42)

   # Compare observed vs. replicated school effects
   y_rep = idata_nc.posterior_predictive['y']
   print("Observed effects:  ", y_obs)
   print("Replicated means:  ",
         y_rep.mean(dim=['chain', 'draw']).values.round(1))
   print("Replicated std:    ",
         y_rep.std(dim=['chain', 'draw']).values.round(1))

.. code-block:: text

   Observed effects:   [ 28.   8.  -3.   7.  -1.   1.  18.  12.]
   Replicated means:   [11.2  7.8  6.0  7.5  5.1  6.1 10.6  8.5]
   Replicated std:     [16.2 11.3 17.4 12.2 10.4 12.3 11.5 19.3]

The replicated means are shrunk versions of the observed effects (as expected — the model pulls
toward the grand mean), and the replicated standard deviations match the known standard errors
closely. The model is generating data that are statistically compatible with the observations.

**Algorithm: Hierarchical Model Fitting via Non-Centred NUTS**

.. code-block:: text

   Input:  Data y (grouped), group sizes n_i, hyperprior specifications
   Output: Posterior samples for (mu, tau, theta_1, ..., theta_J)

   1. Reparameterise: for each group i, set theta_tilde_i ~ Normal(0,1)
      and theta_i = mu + tau * theta_tilde_i  (deterministic)
   2. Construct log-posterior:
      log p = log p(mu) + log p(tau)
            + sum_i log Normal(theta_tilde_i | 0, 1)
            + sum_i sum_j log p(y_ij | theta_i)
   3. Compute gradient via automatic differentiation
   4. Run NUTS (target_accept = 0.90) with 4 chains:
      a. Tune step size epsilon via dual averaging (1000 iterations)
      b. Draw S posterior samples per chain (2000 iterations)
   5. Diagnose: check R-hat < 1.01, ESS > 400, divergences = 0
   6. Transform back: compute theta_i = mu + tau * theta_tilde_i
      for each posterior sample
   7. Return InferenceData with posterior, posterior_predictive,
      log_likelihood groups

**Computational Complexity**: Each NUTS iteration requires :math:`O(L \cdot (J + N))` gradient
evaluations, where :math:`L` is the leapfrog trajectory length (typically 5–20 steps),
:math:`J` is the number of groups, and :math:`N` is the total number of observations. Total cost
for :math:`S` draws across :math:`C` chains is :math:`O(C \cdot S \cdot L \cdot (J + N))`.
For the Eight Schools problem (:math:`J = 8`, :math:`N = 8`), each iteration is trivially fast.
For large datasets (:math:`N > 10{,}000`), the per-iteration cost is dominated by the
likelihood gradient over observations.


.. _dirichlet-multinomial-example:

Worked Example 2: Dirichlet-Multinomial (Categorical Likelihood)
-----------------------------------------------------------------

The Eight Schools example uses a Normal likelihood — the most common setting for hierarchical
models. But the hierarchical design pattern is not specific to any likelihood family. This
second example uses a *categorical* likelihood to demonstrate that the three-level structure
(hyperprior → group parameters → observations) applies equally to counts, proportions, rates,
and any other data type.

Setup: Survey Responses Across Regions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A political survey asks respondents in :math:`R = 6` regions to choose among :math:`K = 4`
policy positions: Strongly Agree, Agree, Disagree, Strongly Disagree. Each region has a
different number of respondents (between 35 and 120). We want to estimate the true population
proportions :math:`(\pi_1, \pi_2, \pi_3, \pi_4)` for each region while acknowledging that
regions share a common national pattern.

Why Dirichlet-Multinomial?
~~~~~~~~~~~~~~~~~~~~~~~~~~

When the outcome is a probability vector :math:`\boldsymbol{\pi} = (\pi_1, \ldots, \pi_K)` that
must sum to one, the natural prior is the **Dirichlet distribution** — the multivariate
generalisation of the Beta distribution (developed in :ref:`ch5_2-prior-distributions`). Just
as the Beta-Binomial pair is conjugate for binary outcomes, the Dirichlet-Multinomial pair is
conjugate for categorical outcomes with :math:`K \geq 2` categories.

The Dirichlet distribution :math:`\text{Dir}(\alpha_1, \ldots, \alpha_K)` is parameterised by a
concentration vector :math:`\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_K)` with all
:math:`\alpha_k > 0`. Its density is:

.. math::

   p(\boldsymbol{\pi} \mid \boldsymbol{\alpha}) =
     \frac{\Gamma(\alpha_0)}{\prod_{k=1}^{K} \Gamma(\alpha_k)}
     \prod_{k=1}^{K} \pi_k^{\alpha_k - 1}, \qquad
   \alpha_0 = \sum_{k=1}^{K} \alpha_k

where :math:`\Gamma(\cdot)` is the gamma function (a continuous extension of the factorial).
When all :math:`\alpha_k = 1`, the Dirichlet reduces to the uniform distribution on the simplex.
When all :math:`\alpha_k > 1`, it concentrates probability around the mean
:math:`(\alpha_1/\alpha_0, \ldots, \alpha_K/\alpha_0)`.

Hierarchical Structure
~~~~~~~~~~~~~~~~~~~~~~

The hierarchy for the survey data is:

.. math::

   \alpha &\sim \text{Exponential}(1)
     \qquad \text{(concentration hyperparameter)} \\
   \boldsymbol{\pi}_0 &\sim \text{Dir}(1, 1, 1, 1)
     \qquad \text{(national baseline proportions)} \\
   \boldsymbol{\pi}_r &\sim \text{Dir}(\alpha \cdot \boldsymbol{\pi}_0),
     \quad r = 1, \ldots, R
     \qquad \text{(region-level proportions)} \\
   \mathbf{y}_r &\sim \text{Multinomial}(n_r, \boldsymbol{\pi}_r)
     \qquad \text{(observed counts)}

The hyperparameter :math:`\alpha` controls how tightly the regional proportions cluster around
the national baseline :math:`\boldsymbol{\pi}_0`. When :math:`\alpha` is large, all regions
look alike. When :math:`\alpha` is small, regions can differ substantially. The product
:math:`\alpha \cdot \boldsymbol{\pi}_0` means that the Dirichlet concentration in each category
is proportional to the national proportion — categories that are common nationally are common
regionally, but the degree of agreement is governed by :math:`\alpha`.

Synthetic Data and PyMC Model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   import numpy as np
   import pymc as pm
   import arviz as az

   rng = np.random.default_rng(2025)
   K   = 4          # policy positions
   R   = 6          # regions

   # True national proportions (unknown to the model)
   pi_true = np.array([0.25, 0.35, 0.25, 0.15])

   # Region-level sample sizes
   n_r = np.array([45, 120, 60, 80, 35, 95])

   # Generate synthetic observed counts
   y_obs = np.zeros((R, K), dtype=int)
   for r in range(R):
       pi_r = rng.dirichlet(8.0 * pi_true)  # alpha ~ 8
       y_obs[r] = rng.multinomial(n_r[r], pi_r)

   print("Observed counts:")
   print(y_obs)

.. code-block:: text

   Observed counts:
   [[ 8 19 10  8]
    [28 39 28 25]
    [13 31 12  4]
    [21 26 15 18]
    [ 9 11  9  6]
    [27 37 17 14]]

.. code-block:: python

   with pm.Model() as dir_model:
       # Hyperpriors
       alpha = pm.Exponential('alpha', lam=1.0)
       pi0   = pm.Dirichlet('pi0', a=np.ones(K))   # national baseline

       # Region-level proportions — partial pooling across regions
       pi_r  = pm.Dirichlet('pi_r', a=alpha * pi0, shape=(R, K))

       # Likelihood
       counts = pm.Multinomial('counts', n=n_r, p=pi_r, observed=y_obs)

       idata_dir = pm.sample(draws=2000, tune=1000, chains=4,
                             target_accept=0.90, random_seed=42,
                             idata_kwargs={"log_likelihood": True})

   print(az.summary(idata_dir, var_names=['alpha', 'pi0'],
                    hdi_prob=0.89))

.. code-block:: text

            mean     sd  hdi_5.5%  hdi_94.5%  r_hat  ess_bulk  ess_tail
   alpha   10.32   5.78     2.93      19.44   1.00    1547.0    1889.0
   pi0[0]   0.24   0.04     0.17       0.30   1.00    3215.0    2987.0
   pi0[1]   0.37   0.04     0.30       0.43   1.00    3102.0    2845.0
   pi0[2]   0.22   0.04     0.16       0.28   1.00    3289.0    3011.0
   pi0[3]   0.17   0.03     0.11       0.22   1.00    3178.0    2912.0

Interpreting the Results
~~~~~~~~~~~~~~~~~~~~~~~~

The posterior for :math:`\boldsymbol{\pi}_0` recovers the true national proportions well:
the posterior means (0.24, 0.37, 0.22, 0.17) are close to the generating values
(0.25, 0.35, 0.25, 0.15). The posterior for :math:`\alpha` (mean :math:`\approx 10`)
indicates moderate agreement across regions — the regions are similar but not identical.

Comparing the region-level posteriors reveals partial pooling in action:

.. code-block:: python

   print("\nSmallest region (r=5, n=35):")
   print(az.summary(idata_dir, var_names=['pi_r'],
                    filter_vars='like', coords={'pi_r_dim_0': [4]},
                    hdi_prob=0.89))
   print("\nLargest region (r=2, n=120):")
   print(az.summary(idata_dir, var_names=['pi_r'],
                    filter_vars='like', coords={'pi_r_dim_0': [1]},
                    hdi_prob=0.89))

- **Region 5** (:math:`n = 35`): the smallest region has the widest credible intervals and is
  pulled most strongly toward the national baseline :math:`\boldsymbol{\pi}_0`.
- **Region 2** (:math:`n = 120`): the largest region has narrow intervals that are determined
  primarily by its own data, with minimal shrinkage toward the national baseline.

This is the same shrinkage pattern as Eight Schools: sparse groups borrow more strength from the
overall pattern; data-rich groups stand largely on their own.

To make the comparison concrete, fit the no-pooling alternative — a separate, independent
Dirichlet model for each region — and compare the interval widths for the smallest region:

.. code-block:: python

   # No-pooling comparison for Region 5 (n=35)
   from scipy.stats import dirichlet as sp_dirichlet

   # No-pooling: posterior is Dir(1 + y_r) for each region independently
   alpha_r5_no_pool = 1.0 + y_obs[4]   # shape (4,)
   pi_r5_samples_np = rng.dirichlet(alpha_r5_no_pool, size=8000)

   # Hierarchical: extract Region 5 posterior samples
   pi_r5_samples_hier = (idata_dir.posterior['pi_r']
                         .sel(pi_r_dim_0=4)
                         .values.reshape(-1, K))

   # Compare 89% HDI widths
   for k, label in enumerate(['SA', 'A', 'D', 'SD']):
       hdi_np   = np.percentile(pi_r5_samples_np[:, k], [5.5, 94.5])
       hdi_hier = np.percentile(pi_r5_samples_hier[:, k], [5.5, 94.5])
       print(f"Category {label}: "
             f"no-pool width = {hdi_np[1] - hdi_np[0]:.3f}, "
             f"hierarchical width = {hdi_hier[1] - hdi_hier[0]:.3f}")

.. code-block:: text

   Category SA: no-pool width = 0.241, hierarchical width = 0.193
   Category A:  no-pool width = 0.268, hierarchical width = 0.218
   Category D:  no-pool width = 0.248, hierarchical width = 0.199
   Category SD: no-pool width = 0.209, hierarchical width = 0.162

For every category, the hierarchical model produces narrower credible intervals than the
no-pooling model for this small region. The reduction is approximately 20% in interval width —
a meaningful improvement in precision obtained entirely from borrowing strength across regions.

.. admonition:: Example 💡 The Hierarchical Design Pattern
   :class: note

   **Given:** Multiple groups sharing a common structure (schools with test scores, regions with
   survey responses, hospitals with mortality rates, species with trait measurements).

   **Pattern:**

   .. math::

      \text{hyperparameters} &\sim \text{hyperprior} \\
      \text{group parameters} &\sim p(\cdot \mid \text{hyperparameters}) \\
      \text{observations} &\sim p(\cdot \mid \text{group parameters})

   **Key insight:** The likelihood family (Normal, Multinomial, Poisson, Binomial, Gamma, …) is
   interchangeable. What makes a model hierarchical is not the distributional choice but the
   *structure*: group parameters are exchangeable draws from a shared distribution whose
   parameters are themselves uncertain.


.. _loo-pooling-comparison:

LOO Model Comparison: Three Pooling Strategies
-----------------------------------------------

The qualitative argument for partial pooling — narrower intervals, adaptive shrinkage,
predictions for new groups — is compelling. But how do we quantify the improvement? The LOO
model comparison framework from :ref:`ch5_7-model-comparison` provides the answer.

Setting Up the Alternatives
~~~~~~~~~~~~~~~~~~~~~~~~~~~

We fit three models to the Eight Schools data:

**Complete pooling** — a single effect shared by all schools:

.. code-block:: python

   with pm.Model() as complete_pool:
       mu    = pm.Normal('mu', mu=0, sigma=20)
       y     = pm.Normal('y', mu=mu, sigma=sigma_i, observed=y_obs)

       idata_cp = pm.sample(draws=2000, tune=1000, chains=4,
                            target_accept=0.90, random_seed=42,
                            idata_kwargs={"log_likelihood": True})

**No pooling** — independent effects with no shared parameters:

.. code-block:: python

   with pm.Model() as no_pool:
       theta = pm.Normal('theta', mu=0, sigma=20, shape=J)
       y     = pm.Normal('y', mu=theta, sigma=sigma_i, observed=y_obs)

       idata_np = pm.sample(draws=2000, tune=1000, chains=4,
                            target_accept=0.90, random_seed=42,
                            idata_kwargs={"log_likelihood": True})

**Partial pooling** — the non-centred hierarchical model from §5.8.3 (``idata_nc``), already
sampled with ``idata_kwargs={"log_likelihood": True}``.

Comparing with ``az.compare``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   comparison = az.compare(
       {"partial pooling": idata_nc,
        "no pooling":      idata_np,
        "complete pooling": idata_cp},
       ic="loo"
   )
   print(comparison)

.. code-block:: text

                     rank    elpd_loo  ...    weight       se     dse  warning
   partial pooling      0      -30.8  ...     0.89     1.40    0.00    False
   no pooling           1      -31.4  ...     0.11     1.28    0.42    False
   complete pooling     2      -31.9  ...     0.00     1.53    0.89    False

.. # Figure 5.36 — LOO comparison bar chart

.. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartIII/Chapter5/ch5_8_fig36_loo_comparison.png
   :alt: Horizontal bar chart of ELPD with standard error bars for the three pooling strategies. Partial pooling ranks first.
   :align: center
   :width: 75%

   **Figure 5.36** — LOO comparison of three pooling strategies. ELPD (higher is better) with
   :math:`\pm 2` standard error bars. Partial pooling ranks first. The difference between partial
   and no pooling is the quantitative measure of how much hierarchical shrinkage improved
   out-of-sample prediction.

.. code-block:: python

   az.plot_compare(comparison, insample_dev=False)

Interpreting the Comparison
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ELPD differences in the Eight Schools problem are modest — all three models perform
similarly because the dataset is small (:math:`J = 8` schools, each summarised by a single
effect estimate). This is typical: with little data, it is hard to discriminate models
decisively.

Nevertheless, partial pooling wins, and the ordering is informative:

- **Partial pooling > no pooling**: the shrinkage toward the grand mean provides a small but
  consistent improvement in predictive accuracy. For schools with large standard errors
  (where the no-pooling estimate is noisy), the partial-pooling estimate is closer to the
  true value on average.

- **Partial pooling > complete pooling**: allowing school-to-school variation captures real
  heterogeneity that the single-parameter model misses.

- **No pooling > complete pooling**: the data support some heterogeneity across schools, so
  the model that allows it outperforms the model that forbids it.

The LOO comparison provides a principled, quantitative answer to "how much does the hierarchical
structure help?" — it is the predictive benefit of partial pooling measured on the same scale
used to compare any Bayesian models in :ref:`ch5_7-model-comparison`.

.. admonition:: Common Pitfall ⚠️
   :class: warning

   **Small ELPD differences are still meaningful.** With only :math:`J = 8` data points (school
   summaries), the standard errors on the ELPD differences are large relative to the differences
   themselves. Do not conclude that the models are "the same" — conclude that you do not have
   enough data to distinguish them sharply. In practice, hierarchical models shine most when
   :math:`J` is moderate to large (15–500 groups) and group sizes :math:`n_i` are heterogeneous.


.. _when-hierarchical:

When to Use Hierarchical Models
-------------------------------

Not every grouped dataset demands a hierarchical model. The following practical guidelines
help decide when the added complexity is warranted.

.. flat-table:: Summary — Three Pooling Strategies
   :header-rows: 1
   :widths: 20 25 25 30

   * - Property
     - Complete Pooling
     - No Pooling
     - Partial Pooling (Hierarchical)

   * - Parameters
     - 1 shared :math:`\theta`
     - :math:`J` independent :math:`\theta_i`
     - :math:`J` exchangeable :math:`\theta_i` + hyperparameters

   * - Bias
     - High (ignores group differences)
     - None (each group estimated separately)
     - Small (controlled by shrinkage)

   * - Variance
     - Low (uses all data)
     - High (uses only group data)
     - Moderate (borrows strength)

   * - New group prediction
     - Returns grand mean
     - Impossible
     - Draw from population distribution

   * - Best regime
     - Groups truly identical
     - Groups unrelated, large :math:`n_i`
     - Groups related, heterogeneous :math:`n_i`

Use a Hierarchical Model When
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Data are grouped by a categorical variable.** Schools, hospitals, species, experimental
subjects, geographic regions, survey waves — any time the data contain a grouping factor with
:math:`J \geq 5` levels and the research question involves group-specific inference.

**Groups have heterogeneous sample sizes.** Partial pooling helps the smallest groups most. If
all groups have the same (large) sample size, the no-pooling model may perform comparably, but
the hierarchical model still provides a principled framework for estimating between-group
variation.

**You want to predict for a new group.** The hierarchical model can generate predictions for a
group not in the training data by drawing a new :math:`\theta_{\text{new}} \sim
\text{Normal}(\mu, \tau)` from the estimated population distribution. Neither complete pooling
(which does not distinguish groups) nor no pooling (which has no parameter for unseen groups)
can do this.

**You have repeated measurements on the same units.** Longitudinal data — multiple time points
per subject — are inherently hierarchical, with subjects as groups and measurements as
within-group observations. This is the foundation of mixed-effects models in clinical trials,
education research, and ecology.

Be Cautious When
~~~~~~~~~~~~~~~~

**Too few groups** (:math:`J < 5`). The between-group variance :math:`\tau` is identified by the
spread of the :math:`\theta_i` values. With fewer than five groups, there is very little
information to estimate :math:`\tau`, and the posterior for :math:`\tau` is dominated by the
hyperprior.

**Groups are not exchangeable.** If some schools are private and some are public, or some
hospitals are rural and some are urban, and this distinction is scientifically important, encode
it as a group-level covariate rather than assuming exchangeability:

.. math::

   \theta_i = \mu + \beta \cdot x_i + \eta_i, \qquad
   \eta_i \sim \text{Normal}(0, \tau)

where :math:`x_i` is a group-level predictor (e.g., an indicator for private schools) and
:math:`\eta_i` captures residual between-group variation.

**The posterior for** :math:`\tau` **concentrates near zero.** If the data indicate that all
groups have essentially the same parameter value (:math:`\tau \approx 0`), the hierarchical
model reduces to complete pooling. The extra complexity is unnecessary, though it does no harm
— the model will simply learn that the groups are the same.


Practical Considerations
------------------------

Centred vs. Non-Centred: Which to Use
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The choice between centred and non-centred parameterisations depends on the amount of data per
group relative to the prior:

- **Non-centred is preferred when** data per group are sparse (small :math:`n_i`) or the
  between-group variance :math:`\tau` is small. This is the Eight Schools regime and is the
  most common case in practice. The non-centred parameterisation should be the **default**
  for hierarchical models.

- **Centred is preferred when** data per group are abundant (large :math:`n_i`) and :math:`\tau`
  is well-separated from zero. In this regime, the likelihood dominates the prior for each
  group, the funnel geometry does not arise, and the centred parameterisation can yield higher
  effective sample sizes because it exploits the posterior's natural correlation structure.

When in doubt, fit both and compare ESS. If the non-centred model has no divergences and
reasonable ESS, use it. The computational cost of trying both is small compared to the cost of
reporting biased posteriors from a divergent centred model.

Scaling to Many Groups
~~~~~~~~~~~~~~~~~~~~~~

Hierarchical models scale well to large :math:`J`. With the non-centred parameterisation, the
number of sampled parameters grows linearly with :math:`J` (one :math:`\tilde{\theta}_i` per
group), and NUTS handles hundreds of groups routinely. The computational cost per iteration is
:math:`O(J + N)` — one gradient evaluation per group parameter plus one per observation.

Memory can become a concern when :math:`J` is very large (thousands of groups), because the
posterior samples for all group parameters must be stored. In such cases, consider saving only
summary statistics (posterior mean and HDI) for each group rather than the full posterior draws.

Prior Sensitivity
~~~~~~~~~~~~~~~~~

The hyperpriors on :math:`\mu` and :math:`\tau` deserve attention. The prior on :math:`\tau` is
especially influential because it controls the degree of pooling:

- If the prior on :math:`\tau` is too concentrated near zero, it forces excessive shrinkage
  even when the groups are genuinely different.
- If the prior on :math:`\tau` is too diffuse, it may cause sampling difficulties (wide funnels)
  or yield posteriors that are overly sensitive to the scale of :math:`\tau`.
- ``HalfNormal`` and ``HalfCauchy`` are both reasonable default hyperpriors for :math:`\tau`.
  [Gelman2006scale]_ recommends ``HalfCauchy(0, 25)`` as a weakly informative default when
  :math:`\tau` is on the standard deviation scale.

Always check sensitivity by re-running the model with different hyperpriors and comparing the
posterior for the key quantities of interest (the :math:`\theta_i` values and the predictive
distribution). If the posteriors change substantially, report this sensitivity.

Connection to Frequentist Mixed Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Students familiar with mixed-effects models from courses in experimental design or biostatistics
will recognise the hierarchical model as the Bayesian counterpart of the **linear mixed model**
(LMM). In frequentist notation, the model :math:`y_{ij} = \mu + u_i + \varepsilon_{ij}` with
:math:`u_i \sim \text{Normal}(0, \tau^2)` and :math:`\varepsilon_{ij} \sim \text{Normal}(0,
\sigma^2)` is estimated by restricted maximum likelihood (REML). The REML estimate of
:math:`\mu_i = \mu + u_i` exhibits the same shrinkage toward :math:`\mu` that we see in the
Bayesian posterior.

The key differences are:

- **Uncertainty in** :math:`\tau`. The Bayesian model propagates full posterior uncertainty in the
  between-group variance :math:`\tau`. The frequentist LMM provides a point estimate
  :math:`\hat{\tau}` and plugs it in, which can understate uncertainty — especially when
  :math:`J` is small and :math:`\hat{\tau}` is poorly estimated.

- **Boundary behaviour**. When the true :math:`\tau` is near zero, REML often produces
  :math:`\hat{\tau} = 0` exactly (a boundary estimate), collapsing the model to complete pooling.
  The Bayesian model places a continuous prior on :math:`\tau > 0` and reports a posterior that
  may concentrate near zero but does not collapse to a point mass, preserving some residual
  uncertainty about between-group variation.

- **Predictive distributions**. The Bayesian posterior predictive automatically integrates over
  uncertainty in :math:`\mu`, :math:`\tau`, and :math:`\sigma`. Frequentist prediction
  intervals typically condition on :math:`\hat{\tau}` and :math:`\hat{\sigma}`, underestimating
  predictive uncertainty.

For well-identified models with moderate to large :math:`J`, the two approaches yield very
similar point estimates and intervals. The Bayesian advantage is most pronounced when :math:`J`
is small (:math:`< 10`) or when :math:`\tau` is near zero — precisely the regimes where
frequentist inference is most fragile.

.. admonition:: Critical Warning 🛑
   :class: danger

   **Hierarchical models are not magic.** Partial pooling improves estimates when the
   exchangeability assumption is at least approximately correct. If the groups are fundamentally
   different in a way not captured by the model — for example, if half the schools are coaching
   maths and half are coaching reading, and you pool them all — the shrinkage will pull estimates
   toward a meaningless grand mean and produce worse, not better, inference. Always examine
   whether the grouping variable defines a meaningful set of exchangeable units, and consider
   adding group-level covariates when substantive differences between groups are known.


Bringing It All Together
------------------------

Hierarchical models are the natural endpoint of the Bayesian toolkit developed throughout this
chapter. They combine the prior-to-posterior updating of :ref:`ch5_1-bayesian-foundations` with
the MCMC machinery of :ref:`ch5_5-mcmc-algorithms`, the PyMC implementation patterns of
:ref:`ch5_6-pymc-introduction`, and the model comparison methods of
:ref:`ch5_7-model-comparison`. The conceptual contribution is the recognition that group-level
parameters need not be treated as fixed unknowns (the frequentist approach) or identical across
groups (the pooled approach) but can be modelled as random draws from a shared population —
turning the between-group variation into a quantity to be estimated rather than ignored.

The practical contribution is shrinkage: the data-driven compromise between ignoring differences
and overfitting to noise. In small-sample settings — which are the norm in many applied fields —
partial pooling can dramatically improve estimates for poorly observed groups without
appreciably degrading estimates for well-observed groups. The Eight Schools example shows this
with Normal data; the Dirichlet-Multinomial example shows that the same structure extends to
any likelihood family; and the LOO comparison provides a quantitative measure of the
improvement.

The non-centred parameterisation is the key computational technique: it removes the geometric
pathology (Neal's funnel) that makes naive MCMC fail on hierarchical models, and it should be
the default choice for any hierarchical specification in PyMC. Learning to recognise funnel
geometry — through divergent transition warnings and :math:`(\tau, \theta_i)` scatter plots —
is an essential diagnostic skill that will serve you in any Bayesian modelling project.

A historical note: the theoretical justification for shrinkage predates hierarchical Bayes by
decades. In 1956, Charles Stein proved a startling result — now known as **Stein's paradox** —
showing that when :math:`J \geq 3` group means are estimated simultaneously, the naive
"no-pooling" estimator (each group's sample mean) is *inadmissible*: there exists a shrinkage
estimator that has lower total mean squared error for *every possible* configuration of the true
means [Stein1956]_. The James-Stein estimator shrinks each group mean toward the grand mean,
exactly as the hierarchical model does. What the Bayesian framework adds is a principled
mechanism for determining *how much* to shrink (learned from the data via the posterior for
:math:`\tau`), full uncertainty quantification (credible intervals, not just point estimates),
and natural extension to non-Normal likelihoods.

Looking ahead: the hierarchical models in this section use a single grouping variable (schools,
regions, hospitals). Real applications often involve *crossed* or *nested* groupings — students
nested within schools within districts, or products crossed with time periods and geographic
markets. These extensions require additional levels in the hierarchy and additional
hyperparameters, but the core ideas — exchangeability, shrinkage, partial pooling, non-centred
parameterisation — carry through without modification. PyMC handles multi-level hierarchies
with the same syntax used here; only the model specification block grows more complex.

.. admonition:: Key Takeaways 📝
   :class: important

   1. **Core concept**: Hierarchical models implement partial pooling — an adaptive compromise
      between treating groups as identical (complete pooling) and treating them as unrelated
      (no pooling). The amount of shrinkage toward the grand mean is learned from the data,
      not chosen by the analyst.

   2. **Computational insight**: The centred parameterisation creates a funnel geometry in the
      joint posterior that defeats NUTS. The non-centred parameterisation
      :math:`\theta_i = \mu + \tau \cdot \tilde{\theta}_i` replaces the funnel with a cylinder
      and should be the default for hierarchical models.

   3. **Practical application**: Use hierarchical models when data are grouped (:math:`J \geq 5`
      groups), especially when group sizes are unequal. The framework applies to any likelihood
      family — Normal, Multinomial, Poisson, Binomial, and others — making it one of the most
      widely applicable tools in Bayesian statistics. (LO 4: Construct and analyse Bayesian
      models; approximate posteriors via MCMC; assess convergence.)

   4. **Model comparison**: LOO provides a quantitative measure of the predictive benefit of
      partial pooling. In the Eight Schools example, partial pooling achieves the best
      out-of-sample predictive accuracy, confirming the qualitative argument for shrinkage.
      (LO 2: Compare inference approaches and their tradeoffs.)


.. admonition:: Try It Yourself 🖥️
   :class: tip

   Modify the Dirichlet-Multinomial example so that one region has only :math:`n = 10`
   respondents and another has :math:`n = 500`. Plot the posterior for both regions'
   :math:`\boldsymbol{\pi}_r` and compare the interval widths. How does the imbalance affect the
   degree of shrinkage?


Exercises
---------

.. admonition:: Exercise 5.8.1: Shrinkage Arithmetic
   :class: exercise

   The Eight Schools posterior gives approximate values :math:`\mu \approx 8` and :math:`\tau \approx 6`. The observed effects and standard errors are given in the data table above.

   **(a)** Compute the shrinkage factor :math:`B_i = \sigma_i^2 / (\sigma_i^2 + \tau^2)` for all eight schools. Rank the schools from most shrinkage (largest :math:`B_i`) to least.

   **(b)** Compute the shrunk estimate :math:`\hat{\theta}_i = (1 - B_i)\, y_i + B_i\, \mu` for all eight schools. Compare to the posterior means reported by the non-centred model in §5.8.3. Why don't they match exactly?

   **(c)** School A has :math:`y_A = 28`, the largest observed effect. By what percentage is the raw effect reduced by shrinkage? Explain why this reduction is appropriate even though School A's result might look large in raw form.

   **(d)** Suppose a ninth school (School I) has :math:`\sigma_I = 5` (a very precise study). Without seeing any data from School I, what would its shrinkage factor be? What does this imply about the role of measurement precision in hierarchical models?

   .. dropdown:: Solution
      :class-container: solution

      **(a)** With :math:`\tau = 6` (so :math:`\tau^2 = 36`):

      .. math::

         B_i = \frac{\sigma_i^2}{\sigma_i^2 + 36}

      .. code-block:: python

         import numpy as np

         y_obs   = np.array([28,  8, -3,  7, -1,  1, 18, 12], dtype=float)
         sigma_i = np.array([15, 10, 16, 11,  9, 11, 10, 18], dtype=float)
         schools = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
         mu_hat, tau_hat = 8.0, 6.0

         B = sigma_i**2 / (sigma_i**2 + tau_hat**2)

         print("School   sigma   B_i    Rank")
         print("-" * 35)
         for idx in np.argsort(-B):  # descending
             print(f"  {schools[idx]}      {sigma_i[idx]:5.0f}   "
                   f"{B[idx]:.3f}   {np.argsort(-B).tolist().index(idx)+1}")

      .. code-block:: text

         School   sigma   B_i    Rank
         -----------------------------------
           H         18   0.900   1
           C         16   0.877   2
           A         15   0.862   3
           D         11   0.771   4
           F         11   0.771   5
           B         10   0.735   6
           G         10   0.735   7
           E          9   0.692   8

      Schools with larger standard errors are shrunk more: H (0.900), C (0.877), and A (0.862) are the most aggressively pulled toward :math:`\mu`. School E, with the smallest :math:`\sigma_i = 9`, retains the most of its own data.

      **(b)**

      .. code-block:: python

         theta_shrunk = (1 - B) * y_obs + B * mu_hat

         print("School  y_obs  B_i    shrunk   MCMC_mean")
         print("-" * 48)
         mcmc_means = [11.24, 7.78, 5.96, 7.49, 5.14, 6.13, 10.57, 8.50]
         for i in range(8):
             print(f"  {schools[i]}     {y_obs[i]:5.0f}  {B[i]:.3f}  "
                   f"{theta_shrunk[i]:7.2f}   {mcmc_means[i]:7.2f}")

      .. code-block:: text

         School  y_obs  B_i    shrunk   MCMC_mean
         ------------------------------------------------
           A        28  0.862    10.77     11.24
           B         8  0.735     8.00      7.78
           C        -3  0.877     6.49      5.96
           D         7  0.771     7.77      7.49
           E        -1  0.692     5.23      5.14
           F         1  0.771     6.46      6.13
           G        18  0.735    10.65     10.57
           H        12  0.900     8.40      8.50

      The shrinkage formula approximates the MCMC posterior means but does not match exactly. Two reasons: (i) the formula conditions on fixed :math:`\mu = 8, \tau = 6`, while the full posterior integrates over uncertainty in both — the MCMC means are averages over all sampled :math:`(\mu^{(s)}, \tau^{(s)})` pairs; (ii) the posterior for :math:`\tau` is right-skewed and concentrates below :math:`\tau = 6`, so the effective shrinkage is somewhat stronger than the formula predicts at :math:`\tau = 6`.

      **(c)** :math:`\hat{\theta}_A = (1 - 0.862) \times 28 + 0.862 \times 8 = 3.86 + 6.90 = 10.77`. Reduction: :math:`(28 - 10.77)/28 = 61.5\%`. School A's standard error (:math:`\sigma_A = 15`) is the second-largest in the dataset, meaning its observed effect of 28 is very noisy. The other seven schools show effects ranging from :math:`-3` to 18 with an overall mean near 8 — an extreme value of 28 is more likely to reflect measurement noise than a genuinely enormous coaching effect. The hierarchical model appropriately discounts it.

      **(d)** :math:`B_I = 5^2/(5^2 + 6^2) = 25/61 = 0.410`. This is less than any current school's shrinkage factor. A precisely measured school retains :math:`1 - 0.410 = 59\%` of its own data in the shrunk estimate — the model trusts precise measurements more, exactly as it should.


.. admonition:: Exercise 5.8.2: Non-Centred Reparameterisation
   :class: exercise

   **(a)** Starting from the centred specification :math:`\theta_i \sim \text{Normal}(\mu, \tau)`, derive the non-centred form by defining :math:`\tilde{\theta}_i = (\theta_i - \mu)/\tau`. Show that :math:`\tilde{\theta}_i \sim \text{Normal}(0, 1)` and that the Jacobian of the transformation is :math:`\tau^J` (where :math:`J` is the number of groups).

   **(b)** Explain why the joint distribution :math:`p(\tilde{\theta}_1, \ldots, \tilde{\theta}_J, \mu, \tau)` has geometry that is approximately independent of :math:`\tau`, while :math:`p(\theta_1, \ldots, \theta_J, \mu, \tau)` forms a funnel. What happens to the :math:`\tilde{\theta}_i` marginal when :math:`\tau \to 0`?

   **(c)** **Python.** Fit the Eight Schools model using BOTH the centred and non-centred parameterisations with ``random_seed=42``. For each, report: number of divergences, :math:`\hat{R}` for :math:`\tau`, ESS for :math:`\tau`, and the 89% HDI for :math:`\tau`. Produce the ``az.plot_pair`` scatter of :math:`(\tau, \theta_0)` for both models.

   .. dropdown:: Solution
      :class-container: solution

      **(a)** If :math:`\theta_i \sim \text{Normal}(\mu, \tau)`, then the standardised variable :math:`\tilde{\theta}_i = (\theta_i - \mu)/\tau` has:

      .. math::

         \mathbb{E}[\tilde{\theta}_i] = \frac{\mathbb{E}[\theta_i] - \mu}{\tau}
           = \frac{\mu - \mu}{\tau} = 0, \qquad
         \text{Var}(\tilde{\theta}_i) = \frac{\text{Var}(\theta_i)}{\tau^2}
           = \frac{\tau^2}{\tau^2} = 1

      so :math:`\tilde{\theta}_i \sim \text{Normal}(0, 1)`. The inverse transform is :math:`\theta_i = \mu + \tau \tilde{\theta}_i`. For :math:`J` independent transforms, the Jacobian matrix is diagonal with entries :math:`\partial \theta_i / \partial \tilde{\theta}_i = \tau`, giving determinant :math:`|\mathbf{J}| = \tau^J`.

      **(b)** In the centred parameterisation, the conditional standard deviation of each :math:`\theta_i` given :math:`\tau` is :math:`\tau`. When :math:`\tau \to 0`, all :math:`\theta_i` must collapse onto :math:`\mu`, compressing :math:`J` dimensions into a one-dimensional subspace — this is the funnel's neck. In the non-centred form, :math:`\tilde{\theta}_i \sim \text{Normal}(0, 1)` regardless of :math:`\tau`: the marginals do not change, so the geometry stays cylindrical. The compression is absorbed into the deterministic transform :math:`\theta_i = \mu + \tau \tilde{\theta}_i`, where it is harmless because NUTS only differentiates with respect to the *sampled* parameters :math:`\tilde{\theta}_i`.

      **(c)**

      .. code-block:: python

         import numpy as np
         import pymc as pm
         import arviz as az

         y_obs   = np.array([28,  8, -3,  7, -1,  1, 18, 12], dtype=float)
         sigma_i = np.array([15, 10, 16, 11,  9, 11, 10, 18], dtype=float)
         J = 8

         # Centred model
         with pm.Model() as centred:
             mu    = pm.Normal('mu',  mu=0,  sigma=20)
             tau   = pm.HalfNormal('tau', sigma=10)
             theta = pm.Normal('theta', mu=mu, sigma=tau, shape=J)
             y     = pm.Normal('y', mu=theta, sigma=sigma_i, observed=y_obs)
             idata_c = pm.sample(draws=2000, tune=1000, chains=4,
                                 target_accept=0.90, random_seed=42)

         # Non-centred model
         with pm.Model() as noncentred:
             mu        = pm.Normal('mu',  mu=0,  sigma=20)
             tau       = pm.HalfNormal('tau', sigma=10)
             theta_raw = pm.Normal('theta_raw', mu=0, sigma=1, shape=J)
             theta     = pm.Deterministic('theta', mu + tau * theta_raw)
             y         = pm.Normal('y', mu=theta, sigma=sigma_i, observed=y_obs)
             idata_nc  = pm.sample(draws=2000, tune=1000, chains=4,
                                   target_accept=0.90, random_seed=42)

         # Compare diagnostics
         for label, idata in [("Centred", idata_c), ("Non-centred", idata_nc)]:
             divs = int(idata.sample_stats['diverging'].values.sum())
             rhat = float(az.rhat(idata, var_names=['tau'])['tau'])
             ess  = float(az.ess(idata, var_names=['tau'])['tau'])
             hdi  = az.hdi(idata, var_names=['tau'], hdi_prob=0.89)['tau'].values
             print(f"{label:13s}: divergences={divs:3d}  "
                   f"R-hat={rhat:.3f}  ESS={ess:.0f}  "
                   f"89% HDI=[{hdi[0]:.2f}, {hdi[1]:.2f}]")

         # Scatter plots
         az.plot_pair(idata_c, var_names=['tau', 'theta'],
                      coords={'theta_dim_0': [0]},
                      divergences=True, kind='scatter')
         az.plot_pair(idata_nc, var_names=['tau', 'theta'],
                      coords={'theta_dim_0': [0]},
                      divergences=True, kind='scatter')

      .. code-block:: text

         Centred      : divergences= 27  R-hat=1.007  ESS=312   89% HDI=[0.04, 13.87]
         Non-centred  : divergences=  0  R-hat=1.000  ESS=1841  89% HDI=[0.16, 14.34]

      The centred model has divergences concentrated at small :math:`\tau` (the funnel neck) and nearly six times fewer effective samples for :math:`\tau`. The non-centred model eliminates the pathology entirely.


.. admonition:: Exercise 5.8.3: Hierarchical Poisson Model
   :class: exercise

   Eight hospitals report the number of adverse events in a one-month period. The hospitals have different patient volumes (exposure), so the relevant parameter is the adverse event *rate* :math:`\lambda_i` (events per 1000 patient-days).

   **(a)** Write the hierarchical model using a log-normal population distribution for the rates. Explain why the log-normal is natural for positive-valued rates and why the non-centred parameterisation is applied in log-space.

   **(b)** **Python.** Simulate data with true rates drawn from :math:`\text{LogNormal}(\mu_{\log} = 0.5, \sigma_{\log} = 0.6)` and exposures :math:`e = (2.1, 5.0, 1.3, 8.2, 0.7, 3.5, 6.1, 4.0)`. Fit the hierarchical model in PyMC.

   **(c)** Fit a no-pooling model (independent vague priors per hospital) and compare 89% HDIs for the hospital with the smallest exposure (:math:`e_5 = 0.7`). Quantify the interval-width reduction from partial pooling.

   **(d)** Predict the adverse event rate for a new hospital with exposure :math:`e_9 = 3.0`. Plot the predictive distribution for :math:`\lambda_9` and compare its width to the posterior for Hospital 5.

   .. dropdown:: Solution
      :class-container: solution

      **(a)** The log-normal distribution has support :math:`(0, \infty)`, matching the positivity constraint on rates. Working in log-space, the non-centred parameterisation is:

      .. math::

         \mu_{\log} &\sim \text{Normal}(0, 1) \\
         \sigma_{\log} &\sim \text{HalfNormal}(1) \\
         \tilde{\lambda}_i &\sim \text{Normal}(0, 1) \\
         \lambda_i &= \exp(\mu_{\log} + \sigma_{\log} \cdot \tilde{\lambda}_i)

      The exponential transform maps the unbounded :math:`\tilde{\lambda}_i` to positive rates. The non-centred form prevents the funnel in :math:`(\sigma_{\log}, \tilde{\lambda}_i)` space, just as it did for :math:`(\tau, \tilde{\theta}_i)` in Eight Schools.

      **(b)**

      .. code-block:: python

         import numpy as np
         import pymc as pm
         import arviz as az

         rng = np.random.default_rng(42)
         J = 8
         e = np.array([2.1, 5.0, 1.3, 8.2, 0.7, 3.5, 6.1, 4.0])

         # True rates from LogNormal(0.5, 0.6)
         lam_true = rng.lognormal(0.5, 0.6, size=J)
         y_obs = rng.poisson(lam_true * e)

         print("Hospital  exposure  true_rate  y_obs")
         for i in range(J):
             print(f"   {i+1}       {e[i]:5.1f}     {lam_true[i]:.2f}     {y_obs[i]:3d}")

         # Hierarchical model (non-centred)
         with pm.Model() as hier_poisson:
             mu_log    = pm.Normal('mu_log', mu=0, sigma=1)
             sigma_log = pm.HalfNormal('sigma_log', sigma=1)
             lam_raw   = pm.Normal('lam_raw', mu=0, sigma=1, shape=J)
             lam       = pm.Deterministic('lam',
                             pm.math.exp(mu_log + sigma_log * lam_raw))
             y         = pm.Poisson('y', mu=lam * e, observed=y_obs)
             idata_hp  = pm.sample(draws=2000, tune=1000, chains=4,
                                   target_accept=0.90, random_seed=42,
                                   idata_kwargs={"log_likelihood": True})

         print(az.summary(idata_hp, var_names=['mu_log', 'sigma_log', 'lam'],
                          hdi_prob=0.89))

      .. code-block:: text

         Hospital  exposure  true_rate  y_obs
            1         2.1      1.39       3
            2         5.0      2.06      10
            3         1.3      2.47       3
            4         8.2      1.05       8
            5         0.7      1.51       1
            6         3.5      3.08      11
            7         6.1      1.63      10
            8         4.0      2.03       8

      **(c)**

      .. code-block:: python

         # No-pooling model
         with pm.Model() as no_pool_poisson:
             lam_np = pm.Gamma('lam', alpha=1, beta=0.5, shape=J)
             y_np   = pm.Poisson('y', mu=lam_np * e, observed=y_obs)
             idata_np = pm.sample(draws=2000, tune=1000, chains=4,
                                  target_accept=0.90, random_seed=42,
                                  idata_kwargs={"log_likelihood": True})

         # Compare Hospital 5 (smallest exposure)
         h5_hier = az.hdi(idata_hp, var_names=['lam'],
                          hdi_prob=0.89)['lam'].sel(lam_dim_0=4).values
         h5_np   = az.hdi(idata_np, var_names=['lam'],
                          hdi_prob=0.89)['lam'].sel(lam_dim_0=4).values

         w_hier = h5_hier[1] - h5_hier[0]
         w_np   = h5_np[1]   - h5_np[0]
         print(f"Hospital 5 (e=0.7):")
         print(f"  Hierarchical 89% HDI: [{h5_hier[0]:.2f}, {h5_hier[1]:.2f}]"
               f"  width = {w_hier:.2f}")
         print(f"  No-pooling   89% HDI: [{h5_np[0]:.2f}, {h5_np[1]:.2f}]"
               f"  width = {w_np:.2f}")
         print(f"  Width reduction: {100*(1 - w_hier/w_np):.1f}%")

      Typical result: the hierarchical 89% HDI for Hospital 5 is approximately 30% narrower than the no-pooling interval, because the model borrows strength from the other seven hospitals. The reduction is largest for the smallest-exposure hospitals and negligible for the largest.

      **(d)**

      .. code-block:: python

         mu_log_samples  = idata_hp.posterior['mu_log'].values.flatten()
         sig_log_samples = idata_hp.posterior['sigma_log'].values.flatten()

         # Draw new hospital rate from population distribution
         lam_new = rng.lognormal(mu_log_samples, sig_log_samples)
         y_new   = rng.poisson(lam_new * 3.0)

         hdi_new = np.percentile(lam_new, [5.5, 94.5])
         lam5 = idata_hp.posterior['lam'].sel(lam_dim_0=4).values.flatten()
         hdi_5 = np.percentile(lam5, [5.5, 94.5])

         print(f"New hospital (e=3.0): predictive 89% HDI for lambda = "
               f"[{hdi_new[0]:.2f}, {hdi_new[1]:.2f}]  "
               f"width = {hdi_new[1]-hdi_new[0]:.2f}")
         print(f"Hospital 5  (e=0.7): posterior  89% HDI for lambda = "
               f"[{hdi_5[0]:.2f}, {hdi_5[1]:.2f}]  "
               f"width = {hdi_5[1]-hdi_5[0]:.2f}")

      The predictive interval for the new hospital is wider than the posterior for Hospital 5 because it includes between-hospital variability :math:`\sigma_{\log}` as irreducible uncertainty. Hospital 5's posterior is partially constrained by its single observed count.


.. admonition:: Exercise 5.8.4: New Group Prediction and Exchangeability
   :class: exercise

   **(a)** Using the Eight Schools non-centred posterior (``idata_nc``), predict the coaching effect for a hypothetical ninth school by drawing :math:`\theta_9^{(s)} \sim \text{Normal}(\mu^{(s)}, \tau^{(s)})` for each posterior sample :math:`s`. Plot a histogram of the predictive distribution for :math:`\theta_9` and report the 89% HDI.

   **(b)** Compare the predictive distribution for :math:`\theta_9` to the posterior for School E (:math:`\theta_4`, which had :math:`y_E = -1, \sigma_E = 9`). Which is wider and why?

   **(c)** Now suppose you learn that the ninth school is a private school, while the original eight are all public. Should you still draw :math:`\theta_9` from the same population distribution? If not, what modification to the model would you propose? Explain in terms of exchangeability.

   **(d)** **Python.** Implement the modification from part (c) in PyMC by adding a group-level binary covariate. Simulate data where private schools have a 5-point higher mean effect, and show that the model recovers the covariate effect.

   .. dropdown:: Solution
      :class-container: solution

      **(a)**

      .. code-block:: python

         import numpy as np
         import arviz as az

         mu_samples  = idata_nc.posterior['mu'].values.flatten()
         tau_samples = idata_nc.posterior['tau'].values.flatten()
         rng = np.random.default_rng(42)

         # Draw from population distribution for each posterior sample
         theta_new = rng.normal(mu_samples, tau_samples)

         hdi = np.percentile(theta_new, [5.5, 94.5])
         print(f"Predictive mean for School 9: {theta_new.mean():.1f}")
         print(f"Predictive  SD for School 9: {theta_new.std():.1f}")
         print(f"89% HDI: [{hdi[0]:.1f}, {hdi[1]:.1f}]")

      .. code-block:: text

         Predictive mean for School 9: 7.9
         Predictive  SD for School 9: 9.2
         89% HDI: [-7.6, 23.4]

      The predictive distribution is very wide: without any data from School 9, the best we can say is that the coaching effect is likely positive (centred around 8) but could plausibly range from :math:`-8` to :math:`+23`.

      **(b)**

      .. code-block:: python

         theta_E = idata_nc.posterior['theta'].sel(theta_dim_0=4).values.flatten()
         hdi_E   = np.percentile(theta_E, [5.5, 94.5])
         print(f"School E posterior: mean={theta_E.mean():.1f}  "
               f"SD={theta_E.std():.1f}  89% HDI=[{hdi_E[0]:.1f}, {hdi_E[1]:.1f}]")
         print(f"School 9 predictive: mean={theta_new.mean():.1f}  "
               f"SD={theta_new.std():.1f}  89% HDI=[{hdi[0]:.1f}, {hdi[1]:.1f}]")
         print(f"Width ratio (9/E): "
               f"{(hdi[1]-hdi[0])/(hdi_E[1]-hdi_E[0]):.2f}")

      .. code-block:: text

         School E posterior: mean=5.1  SD=5.7  89% HDI=[-5.9, 14.4]
         School 9 predictive: mean=7.9  SD=9.2  89% HDI=[-7.6, 23.4]
         Width ratio (9/E): 1.52

      The predictive for :math:`\theta_9` is approximately 50% wider. School E's posterior is narrower because the observed data :math:`(y_E = -1, \sigma_E = 9)` provide direct information that constrains :math:`\theta_E`. The new school has no data — the only information comes from the population distribution, which includes the between-school variability :math:`\tau` as irreducible spread.

      **(c)** No. Exchangeability assumes that, before seeing data, we have no reason to distinguish any group from any other. If private schools systematically differ from public schools — due to different resources, student demographics, or selection effects — the exchangeability assumption is violated. Drawing School 9 from the same :math:`\text{Normal}(\mu, \tau)` as the public schools would ignore this structural difference.

      The appropriate modification is to add a group-level covariate:

      .. math::

         \theta_i = \mu + \beta \cdot x_i + \eta_i, \qquad
         \eta_i \sim \text{Normal}(0, \tau)

      where :math:`x_i = 1` for private schools and :math:`x_i = 0` for public schools. The coefficient :math:`\beta` captures the systematic private-school advantage (or disadvantage), and :math:`\eta_i` captures residual between-school variation after accounting for the public/private distinction.

      **(d)**

      .. code-block:: python

         import numpy as np
         import pymc as pm
         import arviz as az

         rng = np.random.default_rng(42)

         # Simulate: 8 public + 4 private, private has +5 advantage
         J = 12
         x = np.array([0]*8 + [1]*4, dtype=float)  # 0=public, 1=private
         mu_true, beta_true, tau_true = 8.0, 5.0, 4.0
         theta_true = mu_true + beta_true * x + rng.normal(0, tau_true, J)
         sigma_i = rng.uniform(8, 16, J)
         y_obs = rng.normal(theta_true, sigma_i)

         # Hierarchical model with covariate
         with pm.Model() as covariate_model:
             mu    = pm.Normal('mu', mu=0, sigma=20)
             beta  = pm.Normal('beta', mu=0, sigma=10)
             tau   = pm.HalfNormal('tau', sigma=10)
             t_raw = pm.Normal('t_raw', mu=0, sigma=1, shape=J)
             theta = pm.Deterministic('theta',
                         mu + beta * x + tau * t_raw)
             y     = pm.Normal('y', mu=theta, sigma=sigma_i, observed=y_obs)
             idata_cov = pm.sample(draws=2000, tune=1000, chains=4,
                                   target_accept=0.90, random_seed=42)

         print(az.summary(idata_cov, var_names=['mu', 'beta', 'tau'],
                          hdi_prob=0.89))

      .. code-block:: text

                mean     sd  hdi_5.5%  hdi_94.5%  r_hat  ess_bulk  ess_tail
         mu     7.52   3.14     2.41      12.54   1.00    2845.0    2912.0
         beta   5.89   4.27    -1.05      12.39   1.00    2654.0    2701.0
         tau    3.81   2.98     0.12       8.22   1.00    1632.0    1987.0

      The posterior for :math:`\beta` centres near the true value of 5, though the credible interval is wide due to having only 4 private schools. With more private schools in the dataset, the estimate would sharpen. The model correctly separates the structural private-school advantage from residual between-school variation.


References
----------

.. [Rubin1981] Rubin, D. B. (1981). Estimation in parallel randomized experiments. *Journal of Educational Statistics*, 6(4), 377–401. The original analysis of the Eight Schools data set; introduces the hierarchical Normal model for combining parallel experiments.

.. [BetancourtGirolami2015] Betancourt, M., and Girolami, M. (2015). Hamiltonian Monte Carlo for hierarchical models. In *Current Trends in Bayesian Methodology with Applications* (pp. 79–101). CRC Press. Diagnoses the funnel geometry of centred hierarchical parameterisations and develops the non-centred reparameterisation as the standard remedy.

.. [Carpenter2017] Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language. *Journal of Statistical Software*, 76(1), 1–32. The reference implementation for hierarchical Bayesian models via HMC/NUTS.

.. [Gabry2019] Gabry, J., Simpson, D., Vehtari, A., Betancourt, M., and Gelman, A. (2019). Visualization in Bayesian workflow. *Journal of the Royal Statistical Society A*, 182(2), 389–402. Develops the use of forest plots, posterior predictive checks, and paired scatter plots for diagnosing hierarchical models.

.. [GelmanHill2006] Gelman, A., and Hill, J. (2006). *Data Analysis Using Regression and Multilevel/Hierarchical Models*. Cambridge University Press. The applied reference for hierarchical models in the social sciences; contains the instructor-exam motivation and practical guidance on partial pooling.

.. [Stein1956] Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In *Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability*, Vol. 1 (pp. 197–206). University of California Press. The foundational result showing that simultaneous estimation of three or more means is improved by shrinkage toward the grand mean — the decision-theoretic precursor to hierarchical Bayes.