Part III: Bayesian Inference
What should I believe given this evidence? The Bayesian asks a different question than the frequentist. Where the frequentist asks about hypothetical repetitions, the Bayesian asks about rational belief. The answer comes through Bayes’ theorem: prior beliefs, updated by data through the likelihood, yield posterior beliefs. This simple formula—known for over 250 years—generates a complete framework for learning from data.
The Bayesian interpretation treats probability as a measure of uncertainty. Parameters are not fixed constants to be estimated but uncertain quantities to be learned about. Before seeing data, we encode our knowledge (or ignorance) in a prior distribution. The data speak through the likelihood. Bayes’ theorem combines these mechanically:
The posterior distribution tells us everything we need: point estimates (posterior mean or mode), uncertainty (credible intervals), and predictions (posterior predictive distribution). All flow from this single object.
This philosophical shift has profound practical consequences. We can make direct probability statements: “There is a 95% probability that θ lies between 2.3 and 4.7.” We can incorporate prior information formally—from previous studies, expert knowledge, or physical constraints. We can update beliefs sequentially as new data arrive. And we can handle complex models with many parameters, borrowing strength across groups through hierarchical structures.
Chapter
The Arc of Part III
Chapter 5 builds Bayesian inference from foundations through computation to application.
We begin with philosophy and foundations: what it means to treat probability as belief, how Bayes’ theorem mechanizes learning, and why the Bayesian approach offers certain advantages (direct probability statements, coherent updating, natural handling of nuisance parameters) alongside certain challenges (prior specification, computational cost).
Prior specification is where Bayesian analysis begins. We develop strategies for choosing priors: conjugate priors for computational convenience, weakly informative priors that regularize without dominating, and informative priors that encode genuine knowledge. We discuss prior sensitivity—how much do conclusions depend on prior choice?
Conjugate models provide analytical posteriors: Beta-Binomial for proportions, Normal-Normal for means, Gamma-Poisson for rates. These closed-form solutions build intuition before we need numerical methods and remain useful as building blocks in larger models.
Credible intervals are the Bayesian analog of confidence intervals—but with a crucial interpretive difference. A 95% credible interval contains the parameter with 95% probability (given our model and data). This is what most people mistakenly think confidence intervals mean.
Markov chain theory provides the mathematical foundation for MCMC. We develop states, transition kernels, stationary distributions, detailed balance, and ergodicity. Understanding why MCMC works enables diagnosing when it fails.
MCMC algorithms turn theory into practice. Metropolis-Hastings proposes moves and accepts based on posterior ratios—no normalizing constant needed. Gibbs sampling exploits conditional structure. We implement both, discussing tuning, efficiency, and convergence diagnostics.
Model comparison asks which model best explains the data. Bayes factors quantify evidence ratios. Information criteria estimate predictive performance. Posterior predictive checks assess whether the model reproduces features of observed data.
Hierarchical models represent Bayesian inference at its most powerful. When data come in groups, hierarchical models let groups borrow strength from each other—the natural Bayesian framework for structured data.
The Challenge: Computation
If Bayesian inference is so elegant, why did it take until the late 20th century to become mainstream? The answer is computational.
The posterior must integrate to one. Computing this normalizing constant requires integrating likelihood times prior over all parameter values:
For most interesting models, this integral is intractable. For decades, Bayesian methods were limited to conjugate priors or low-dimensional problems.
The breakthrough came with Markov chain Monte Carlo. Rather than computing the posterior analytically, we construct a Markov chain whose stationary distribution is the posterior. Running the chain generates samples—enough to estimate any quantity of interest. Suddenly, Bayesian inference became practical for complex, high-dimensional models.
Connections
Part I: Foundations introduces Bayes’ theorem as mathematical identity and the Bayesian interpretation of probability. The distributions catalogued there reappear as priors and posteriors.
Part II: Frequentist Inference offers the contrasting paradigm. Frequentist and Bayesian methods answer different questions: “How reliable is this procedure?” versus “What should I believe?” The complete data scientist understands both. Monte Carlo intuition from Part II transfers directly to MCMC. Likelihood—central to both paradigms—provides common ground.
Part IV: LLMs in Data Science extends Bayesian thinking to new domains. Uncertainty quantification—central to Bayesian inference—becomes critical when assessing LLM reliability. Bayesian methods appear explicitly in calibration, active learning for annotation, and probabilistic retrieval.
Prerequisites
Part III assumes mastery of Part II material: probability distributions, likelihood functions, maximum likelihood estimation, and Monte Carlo simulation. The likelihood plays a central role in Bayesian updating. Monte Carlo intuition transfers directly to MCMC.
Comfort with calculus and linear algebra helps with derivations, though we emphasize conceptual understanding and computation over mathematical technicalities.
By Part III’s end, you’ll be able to specify Bayesian models, compute or approximate posteriors, assess convergence, check model adequacy, compare alternatives, and communicate results—a complete toolkit for Bayesian data analysis.