Part I: Foundations of Probability and Computation

What is probability? The number 0.7 might represent “seven out of ten coin flips landed heads,” or “I’m 70% confident it will rain tomorrow,” or “this radioactive atom has a 70% chance of decaying within the hour.” These statements feel different, one describes observed frequency, another expresses belief, a third characterizes physical propensity, yet all use the same mathematical machinery. Part I establishes this machinery and confronts the interpretive questions that shape everything that follows.

The remarkable fact about probability theory is that everyone agrees on the rules while disagreeing profoundly about what the rules mean. Kolmogorov’s axioms, non-negativity, normalization, and countable additivity, provide the mathematical foundation that all practitioners accept. But frequentists see probabilities as long-run frequencies revealed through repetition, while Bayesians see them as degrees of rational belief updated through evidence. These are not merely philosophical positions; they lead to different inferential targets and different interpretations of results.

Part I does not resolve this debate, it cannot be resolved because the interpretations answer different questions. Instead, we develop fluency in both perspectives and build a shared computational language for uncertainty. A central theme is that many problems in probability and statistics reduce to computing expectations or probabilities under a specified distribution. When analytic calculation fails, Monte Carlo simulation provides a general, principled approximation strategy that will power both Part II and Part III.

Chapter

The Arc of Part I

Chapter 1 weaves together three essential threads: philosophical foundations, mathematical machinery, and computational tools.

We begin with Kolmogorov’s axioms, the mathematical bedrock accepted by all camps. These rules generate the entire edifice of probability theory. From them we derive conditional probability and Bayes’ theorem, independence and exchangeability, expectation and variance, the law of large numbers and the central limit theorem. The axioms are interpretation-neutral, they specify how probabilities must behave without dictating what they represent.

The interpretive question then takes center stage. The frequentist views probability as limiting relative frequency in an infinite sequence of trials, grounding probability in repeatable phenomena but struggling with one-time events. The Bayesian views probability as coherent degree of belief, quantifying uncertainty through conditional probability, handling unique events naturally but requiring specification of priors. We study both perspectives not to declare a winner but to understand what each offers and when each applies.

From philosophy we turn to random variables and their distributions. A random variable maps outcomes to numbers; a distribution describes how probability mass spreads across those numbers. We develop probability mass functions for discrete variables, probability density functions for continuous ones, cumulative distribution functions that unify both, and quantile functions that invert them.

The chapter catalogues the probability distributions that appear throughout data science. Discrete distributions, Bernoulli, Binomial, Poisson, Geometric, Negative Binomial, model trials and counts. Continuous distributions, Uniform, Normal, Exponential, Gamma, Beta, model measurements, durations, and proportions. Inference distributions, Student’s t, Chi-square, and F, arise when estimating parameters from normally distributed data. For each, we develop not just formulas but understanding: what phenomena it models, why it arises, and how it relates to other distributions.

Chapter 2 develops Monte Carlo simulation as a general computational methodology. The core estimator is a sample average approximating an expectation, with accuracy governed by the law of large numbers and the central limit theorem. We then build practical machinery: pseudo-random number generation, random variate generation (inverse CDF, transformations, rejection sampling), and variance reduction. This chapter establishes a reusable computational primitive: approximate an integral by sampling from a target distribution and averaging.

Why Foundations Matter

It is tempting to rush past foundations toward inference and machine learning, but foundations are the soil from which everything else grows.

Interpretation shapes method. Frequentist procedures are designed to have good long-run operating characteristics, while Bayesian analysis expresses uncertainty through posterior distributions. The same data can yield different answers because the questions differ.

Distribution knowledge enables computation. Many simulation and inference strategies exploit structural relationships among distributions. The richer your distribution vocabulary, the more computational strategies become available.

Mathematical precision prevents errors. Sloppy probability reasoning leads to famous fallacies. Part I builds habits of precision that prevent these errors.

Connections

Part II: Frequentist Inference uses Monte Carlo as a computational engine for the frequentist thought experiment. The target distribution is the sampling distribution of estimators and test statistics, approximated analytically in special cases and computationally through parametric simulation, bootstrap resampling, and permutation tests.

Part III: Bayesian Inference uses the same Monte Carlo engine but targets posterior and posterior predictive distributions. Bayes’ theorem supplies the target; computation supplies the approximation.

Part IV: LLMs in Data Science applies the same computational mindset to modern evaluation problems. Quantifying uncertainty and validating procedures transfer directly to assessing model reliability.

Prerequisites

Part I assumes comfort with calculus (derivatives, integrals, series), linear algebra (vectors, matrices), and Python programming (functions, arrays, basic NumPy). Prior exposure to probability and statistics helps but is not required.

By Part I’s end, you will have the conceptual vocabulary, mathematical tools, and computational skills to engage with the serious methods that follow.