Part III: Bayesian Inference

What should I believe given this evidence? The Bayesian asks a fundamentally different question than the frequentist. Where Part II asked about hypothetical repetitions—“how would this procedure perform if we applied it again and again?”—the Bayesian asks about rational belief in the face of uncertainty. The answer comes through Bayes’ theorem: prior beliefs, updated by data through the likelihood, yield posterior beliefs. This simple formula—known for over 250 years—generates a complete framework for learning from data.

The Bayesian interpretation treats probability as a measure of uncertainty about unknown quantities. Parameters are not fixed constants to be estimated through repeated-sampling arguments but uncertain quantities whose distributions encode what we know—and don’t know—about them. Before seeing data, we encode our knowledge (or ignorance) in a prior distribution. The data speak through the likelihood—the same likelihood that drove maximum likelihood estimation in Chapter 3. Bayes’ theorem combines these mechanically:

\[\underbrace{p(\theta \mid \text{data})}_{\text{posterior}} \propto \underbrace{p(\text{data} \mid \theta)}_{\text{likelihood}} \times \underbrace{p(\theta)}_{\text{prior}}\]

The posterior distribution is the complete answer: point estimates (posterior mean or mode), uncertainty (credible intervals), predictions (posterior predictive distribution), and model comparisons all flow from this single object. There is no separate theory for each inferential task—everything is a calculation from the posterior.

The Posterior as the Central Object

Bayesian inference replaces the frequentist’s sampling distribution with the posterior distribution. Where Part II constructed confidence intervals by reasoning about estimator behavior across hypothetical samples, Part III constructs credible intervals by computing probability statements directly from \(p(\theta \mid \text{data})\). Where Part II assessed models by their repeated-sampling prediction error, Part III assesses models by their posterior predictive performance. The shift from “What would happen across many datasets?” to “What do these data tell us?” changes the computational target but preserves the fundamental goal: rigorous, quantified uncertainty.

Prior Information as Formal Input

The most distinctive feature of Bayesian analysis is the prior distribution. Prior information—from previous studies, expert knowledge, physical constraints, or deliberate agnosticism—enters the analysis formally rather than being ignored or hidden in informal modeling choices. This explicitness is both strength and vulnerability: priors make assumptions transparent and auditable, but they also introduce a subjective element that demands careful justification. We develop strategies spanning the full spectrum: conjugate priors for computational tractability, weakly informative priors that regularize without dominating, and informative priors that encode genuine knowledge.


The Arc of Part III

Chapter 5: Bayesian Inference builds the complete Bayesian toolkit from philosophical foundations through computational machinery to applied methodology.

We begin with foundations and the Bayesian workflow: what it means to treat probability as belief, how Bayes’ theorem mechanizes learning from data, and why the Bayesian approach offers certain advantages (direct probability statements, coherent sequential updating, natural handling of nuisance parameters) alongside certain challenges (prior specification, computational cost). Grid approximation provides an initial computational strategy that makes the entire framework concrete.

Prior specification and conjugate analysis is where Bayesian analysis begins — and where it invites the most scrutiny. We develop strategies for choosing priors spanning the full spectrum: conjugate priors that yield closed-form posteriors (Beta-Binomial for proportions, Normal-Normal for means, Gamma-Poisson for rates, Normal-Inverse-Gamma for joint mean-variance inference), weakly informative priors that regularize without dominating, and informative priors grounded in substantive knowledge. For each conjugate family we derive the posterior in full, connecting directly to Chapter 3’s exponential family framework, and interpret posterior means as precision-weighted averages that shrink toward the prior. We compare Bayesian and frequentist results head-to-head, showing when they agree and when they diverge. Prior predictive simulation provides a practical tool for evaluating whether prior choices generate plausible data before any observations enter the analysis, and sensitivity analysis quantifies how much conclusions depend on prior choice.

Credible intervals are the Bayesian analog of confidence intervals, with a crucial interpretive difference: a 95% credible interval contains the parameter with 95% probability given our model and data. This is what most practitioners mistakenly think confidence intervals mean. We develop both equal-tailed and highest posterior density intervals, contrast them systematically with the confidence intervals and bootstrap intervals from Chapters 3–4, and examine posterior-based hypothesis assessment.

Markov chain theory provides the mathematical foundation for MCMC—the computational breakthrough that made modern Bayesian inference practical. We develop transition kernels, stationary distributions, detailed balance, and ergodicity. The ergodic theorem plays the same role for MCMC that the Law of Large Numbers plays for Monte Carlo: it justifies replacing intractable integrals with sample averages. Understanding why MCMC works enables diagnosing when it fails.

MCMC algorithms turn theory into practice. The Metropolis-Hastings algorithm proposes moves and accepts based on posterior ratios—elegantly avoiding the intractable normalizing constant. Gibbs sampling exploits conditional conjugate structure. Hamiltonian Monte Carlo uses gradient information for efficient high-dimensional exploration. We implement each from scratch before introducing PyMC for practical workflows—the same “build it, then use the tool” philosophy that guided Chapter 2’s progression from hand-coded Monte Carlo to SciPy.

Convergence diagnostics address the central practical question: has the chain run long enough? Visual diagnostics, autocorrelation analysis, effective sample size, and the Gelman-Rubin \(\hat{R}\) statistic provide complementary evidence. We develop a systematic diagnostic workflow using ArviZ.

Model comparison evaluates competing Bayesian models. Posterior predictive checks assess whether a model reproduces key features of observed data—the Bayesian analog of residual analysis from Chapter 3. Information criteria (WAIC, LOO-CV via Pareto-smoothed importance sampling) estimate out-of-sample predictive performance, connecting directly to the cross-validation framework from Chapter 4. Bayes factors quantify evidence ratios between models.

Hierarchical models represent Bayesian inference at its most powerful. When data arise in groups—students within schools, patients within hospitals, measurements within subjects—hierarchical models let groups borrow strength from each other through partial pooling. Small groups are regularized toward the grand mean; large groups speak for themselves. This automatic, data-driven shrinkage is the natural Bayesian framework for structured data.

Computational Themes

Several computational motifs recur throughout Part III, extending and transforming those from Part II:

Simulation as inference. In Part II, simulation calibrated frequentist procedures by approximating sampling distributions. In Part III, simulation is inference: MCMC samples from the posterior are the computational representation of Bayesian conclusions. The target changes from sampling distribution to posterior distribution, but the Monte Carlo machinery transfers directly.

The normalizing constant problem. Computing \(p(\text{data}) = \int p(\text{data} \mid \theta) \, p(\theta) \, d\theta\) is the fundamental computational challenge of Bayesian inference. Every method in this chapter—conjugate analysis, grid approximation, MCMC—is a strategy for working around this intractable integral.

Convergence and diagnostics. MCMC chains produce dependent, not independent, samples. Effective sample size, mixing behavior, and convergence diagnostics replace the simpler Monte Carlo error analysis of Chapter 2. The margin between “it worked” and “it didn’t” requires careful, principled assessment.

Probabilistic programming. PyMC and ArviZ provide a high-level interface for specifying Bayesian models and analyzing their output. The transition from hand-coded algorithms to probabilistic programming mirrors the trajectory of scientific computing: understanding the fundamentals enables effective use of powerful abstractions.

Shrinkage as a unifying principle. Bayesian posteriors shrink estimates toward the prior mean. Hierarchical models shrink group estimates toward the grand mean. This connects to ridge regression (Chapter 3), James-Stein estimation, and the bias-variance tradeoff that pervades modern statistics.

Connections

Part I: Foundations introduces Bayes’ theorem as a mathematical identity and the Bayesian interpretation of probability as one of several competing philosophies. The distributions catalogued there reappear as priors and posteriors. The exponential family structure from Part I provides the scaffolding for conjugate analysis.

Part II: Frequentist Inference offers the contrasting paradigm. Frequentist and Bayesian methods answer different questions—“How reliable is this procedure?” versus “What should I believe?”—and the complete data scientist understands both. Likelihood functions, central to both paradigms, provide common ground: MLE emerges as the posterior mode under a flat prior, and the Bernstein-von Mises theorem shows that Bayesian and frequentist inferences converge asymptotically. Monte Carlo intuition from Chapter 2 transfers directly to MCMC. Cross-validation from Chapter 4 reappears as LOO-CV for Bayesian model comparison.

Part IV: LLMs in Data Science extends Bayesian thinking to new domains. Uncertainty quantification—central to Bayesian inference—becomes critical when assessing LLM reliability. Bayesian methods appear explicitly in calibration, active learning for annotation, and probabilistic retrieval. The capstone project integrates Bayesian analysis with modern data science workflows.