Computational Methods in Data Science
STAT 418 · Spring 2026 · Purdue University
Course Content
Modern statistics is computational statistics. The elegant formulas of classical theory—derived under assumptions of normality, independence, and large samples—often fail when confronted with real data: messy, high-dimensional, and decidedly non-normal. Yet the core questions remain: How certain should we be? What can we conclude? How might we be wrong?
This course develops the computational methods that answer these questions without relying on fragile assumptions. We replace analytical derivations with simulation, asymptotic approximations with resampling, and conjugate convenience with general-purpose algorithms. The computer becomes not just a calculator but a laboratory for statistical thought experiments.
The intellectual journey moves through four parts. We begin with foundations: what probability means, how distributions behave, and how Python’s scientific stack turns theory into computation. We then develop frequentist inference computationally: Monte Carlo simulation as the engine, maximum likelihood as the estimator, and bootstrap resampling as the uncertainty quantifier. Next, we explore Bayesian inference: prior beliefs, posterior updating, and Markov chain Monte Carlo for models too complex for analytical treatment. Finally, we address large language models in data science: integrating pre-trained models into analytical workflows for text preprocessing, feature extraction, data annotation, and retrieval-augmented generation—with careful attention to responsible use, reliability, and privacy.
Throughout, we emphasize both rigor and practice. Every method receives mathematical justification—you’ll understand why these techniques work, not just how to call the functions. But every derivation leads to working code. By course end, you’ll have built a complete toolkit for modern statistical computation, from foundational simulation through cutting-edge AI integration.
Course Information
Instructor |
Dr. Timothy Reese |
||
Office |
MATH 210 |
Phone |
765-494-4129 |
Lectures |
Tue/Thu 1:30–2:45 PM |
Room |
UNIV 127 |
Office Hours |
Wed/Fri 1:00–3:00 PM |
Location |
MATH 210 |
Credits |
3.00 |
Website |
Prerequisites
Enrollment requires completion of the following with a grade of C- or better:
- Probability (one of):
STAT 41600: Probability (undergraduate)
STAT 51600: Basic Probability and Applications (graduate)
- Statistical Inference (one of):
STAT 35000: Introduction to Statistics
STAT 35500: Statistics for Data Science
STAT 51100: Statistical Methods (graduate)
- Programming (one of):
MA 16290: Integrated Calculus and Linear Algebra II
CS 38003: Python Programming
You should be comfortable with: probability axioms and random variables; discrete and continuous distributions (PMFs, PDFs, CDFs); expectation, variance, and covariance; joint, marginal, and conditional distributions; functions and transformations of random variables; moment generating functions (MGFs); the Law of Large Numbers and Central Limit Theorem; Bayes’ theorem; hypothesis testing and confidence intervals; Python programming with NumPy arrays; and calculus through multiple integrals.
Course Structure
The course divides into four parts, each building on the previous:
Part I: Foundations of Probability and Computation
What does probability mean? How do we describe and compute with random variables? Part I establishes the mathematical and philosophical groundwork: Kolmogorov’s axioms, frequentist and Bayesian interpretations, probability distributions and their properties, and Python’s ecosystem for statistical computing. These foundations support everything that follows.
Part II: Frequentist Inference
The frequentist asks: “What would happen if I repeated this procedure many times?” Part II develops this perspective computationally. Monte Carlo simulation provides the engine for approximating expectations and probabilities. Maximum likelihood estimation and generalized linear models provide the parametric toolkit. Bootstrap resampling, the jackknife, and permutation tests provide distribution-free inference when parametric assumptions fail.
Part III: Bayesian Inference
The Bayesian asks: “What should I believe given this evidence?” Part III develops posterior inference from prior specification through MCMC computation. We construct models, check their fit, compare alternatives, and extract predictions—all while properly quantifying uncertainty through posterior distributions.
Part IV: Large Language Models in Data Science
How can we leverage pre-trained language models to enhance data science workflows? Part IV addresses the practical and responsible integration of LLMs into analytical pipelines. We cover text preprocessing and feature extraction using embeddings, leveraging pre-trained models for data annotation and augmentation, retrieval-augmented generation (RAG) for domain-specific applications, and responsible AI practices including prompt engineering, reliability assessment, and privacy considerations.
The course culminates in a capstone project where you synthesize these methods to address a substantial data science problem, demonstrating both theoretical understanding and practical implementation skill.
Learning Outcomes
Upon completing this course, you will be able to:
Apply simulation techniques including Monte Carlo methods, transformation approaches, and rejection sampling to analyze probabilistic behavior in data science applications
Compare and evaluate frequentist and Bayesian inference paradigms by examining their theoretical foundations, identifying their strengths and limitations, and explaining their roles in statistical modeling and decision-making
Design, implement, and assess resampling methods focusing on both nonparametric and parametric forms of the bootstrap to estimate variability, construct confidence intervals, and improve statistical estimates through bias correction techniques
Apply cross-validation principles to compute model performance metrics, detect overfitting and underfitting, and select models with reliable predictive accuracy using Python libraries
Construct and interpret Bayesian models including posterior distributions and credible intervals, apply Markov chain Monte Carlo methods to approximate posteriors, and evaluate the role of prior distributions in Bayesian inference
Utilize large language models in data science workflows for contextual data augmentation, feature engineering, and integrating structured and unstructured data to enhance predictive models, while addressing challenges of privacy and reliability
Synthesize course methods in a capstone project to design, develop, and present robust solutions to real-world data science challenges, demonstrating both theoretical understanding and applied expertise
Assessment
Component |
Weight |
Details |
|---|---|---|
Homework |
40% |
6–7 assignments on ~2-week cadence; lowest score dropped; late submissions accepted up to 3 days with 20% penalty |
Midterm Exams |
30% |
Two exams (15% each): Midterm I covers Chapters 1–3 (Foundations, Monte Carlo, Frequentist Inference); Midterm II covers Chapters 4–5 (Resampling Methods, Bayesian Inference) |
Capstone Project |
30% |
Proposal (2%), progress report (1%), presentation (7%), final submission (20%); demonstrates synthesis of course methods on substantive problem |
Academic Integrity: All work governed by Purdue’s Honor Pledge. Collaboration encouraged on concepts; submitted work must be your own. AI tools (ChatGPT, Copilot, Claude) permitted for debugging, studying, and exploring ideas; prohibited for generating turnkey solutions. Disclose AI assistance; verify all AI-generated content for accuracy.
Schedule & Syllabus
- 📅 Course Schedule
Interactive weekly schedule with topics, assignments, and exam dates.
- 📋 Course Syllabus (PDF)
Complete syllabus with policies, grading breakdown, and academic integrity guidelines.
Lecture Slides
Interactive slide presentations for each chapter. These slides contain key concepts, visualizations, and speaker notes for in-class lectures.
Additional slide decks will be released as the course progresses.
| Chapter | View Slides |
|---|---|
| Chapter 1: Probability Foundations & Inference Paradigms | 🎬 View Slides |
| Chapter 2: Monte Carlo Simulation | 🎬 View Slides |
| Chapter 3: Parametric Inference and Likelihood Methods | 🎬 View Slides |
Additional Notes
Supplementary materials providing additional coverage of specific topics.
Additional notes will be added throughout the semester.
| Topic | View Notes |
|---|---|
| Bernoulli Distribution Relationships: Connections between Bernoulli, Binomial, Geometric, and Negative Binomial distributions | 📄 View Notes |
| Four Faces of Leibniz: How differentiating under the integral sign powers MLE theory, exponential families, policy gradients (REINFORCE), and score matching | 📄 View Notes |
Companion Notebooks
These Jupyter notebooks accompany the course lectures. Each chapter notebook contains worked examples, visualizations, and code you can run and modify. View the rendered HTML online or download the .ipynb files to run locally (right-click and “Save Link As” if needed).
Additional notebooks will be released as the course progresses.
| Chapter | View Online | Download |
|---|---|---|
| Chapter 1: Probability Foundations & Python Review | 🔗 View HTML | ⬇ Download .ipynb |
| Chapter 2: Monte Carlo Methods | 🔗 View HTML | ⬇ Download .ipynb |
| Chapter 3: Parametric Inference | 🔗 View HTML | ⬇ Download .ipynb |
Computing Resources
This course has access to Purdue’s Scholar Cluster, a high-performance computing resource designed for classroom learning. Scholar provides the computational power needed for Monte Carlo simulations, MCMC sampling, bootstrap resampling, and other computationally intensive methods covered in this course.
What You Get
As a registered student in STAT 418, you automatically receive:
25 GB Home Directory for your course files and projects
Scratch Storage for temporary computational work
Jupyter Notebook Server — run Python notebooks directly in your browser
Batch Job Submission — run longer simulations via the Slurm scheduler
Access Methods
You can access Scholar through several interfaces, depending on your needs and comfort level:
| Method | Best For | Access |
|---|---|---|
| Gateway (Open OnDemand) | Jupyter notebooks, file management, job monitoring | 🚀 Launch Gateway |
| Remote Desktop (ThinLinc) | Full graphical Linux desktop experience | 🖥️ Launch Desktop |
| SSH (Command Line) | Direct terminal access, batch job submission | ssh yourname@scholar.rcac.purdue.edu |
Recommended for this course: The Gateway interface provides the easiest access to Jupyter notebooks, which is how most course assignments are structured.
Getting Started
Login: Use your Purdue Career Account credentials (same as BoilerKey)
Launch Jupyter: From the Gateway, select “Jupyter Notebook” under Interactive Apps
Select Resources: For most assignments, 1-2 cores and 4 GB memory is sufficient
Upload Notebooks: Transfer course notebooks to your home directory
First Time Setup
The first time you log in, you may need to wait a few minutes for your account to be fully provisioned. If you encounter issues, try again after 15 minutes or contact RCAC support.
When to Use Scholar
Scholar is particularly useful when your local machine is insufficient:
Large Monte Carlo studies — running millions of simulations
MCMC sampling — chains with many iterations or multiple chains in parallel
Bootstrap resampling — thousands of resamples on large datasets
Cross-validation — parallelizing fold computations
Capstone projects — scaling up analyses for real-world datasets
Reserve Scholar for computationally demanding tasks.
Documentation & Help
| Resource | Link |
|---|---|
| Scholar User Guide | 📖 User Guide |
| Running Jobs (Slurm) | 📋 Job Submission Guide |
| Jupyter on Scholar | 🐍 Jupyter Guide |
| File Storage & Transfer | 💾 Storage Guide |
| Frequently Asked Questions | ❓ FAQ |
| RCAC Help Desk | 📧 rcac-help@purdue.edu |
Troubleshooting
If you experience issues accessing Scholar or running jobs, first check the RCAC Status Page for scheduled maintenance or outages. For persistent problems, email rcac-help@purdue.edu with your username, the time of the issue, and any error messages.
Python Tutorials for Further Study
The resources below are curated for students who want to deepen their Python skills beyond the course material. All are free, open-source, and maintained by recognized experts or framework developers.
Course Environment Setup
⬇ Download requirements.txt — Python package dependencies for the course. Install with pip install -r requirements.txt in a virtual environment.
Data Science Foundations
- Python Data Science Handbook by Jake VanderPlas
Complete free book covering NumPy, Pandas, Matplotlib, and Scikit-learn. All content available as executable Jupyter notebooks. Essential reference for the scientific Python stack.
- Scientific Python Lectures
From core SciPy contributors. Structured modules progressing from fundamentals to expert topics including memory optimization and performance tuning.
- From Python to NumPy by Nicolas Rougier
Focused entirely on vectorization techniques—transforming Python loops into efficient NumPy operations. Essential for writing fast numerical code.
NumPy and Pandas Deep Dives
- SciPy Lecture Notes: Advanced NumPy
Written by Pauli Virtanen (NumPy core developer). Covers ndarray internals, strides, memory layout, and creating ufuncs. Graduate-level depth.
- Pandas User Guide
Official comprehensive documentation. Essential sections: GroupBy, Reshaping, and Enhancing Performance.
- Modern Random Generator API
Official documentation for
np.random.default_rng()—the modern approach we use throughout the course.
Machine Learning
- INRIA Scikit-learn MOOC
Gold standard for ML education—developed by scikit-learn core developers. 70% hands-on notebooks covering model selection, cross-validation, and ensemble methods.
- Scikit-learn User Guide
Official documentation with mathematical formulations for all algorithms. Includes the famous “Choosing the Right Estimator” flowchart.
Bayesian Statistics
- Think Bayes 2nd Edition by Allen Downey
Computational approach to Bayesian statistics using Python code instead of calculus. All Jupyter notebooks available for Colab.
- PyMC Documentation
Official tutorials for the probabilistic programming library we’ll use. Start with the Overview Tutorial.
- Statistical Rethinking with PyMC
Full port of McElreath’s acclaimed course materials to Python/PyMC.
- ArviZ Documentation
Bayesian model visualization and diagnostics. Works with PyMC, NumPyro, Stan, and other backends.
Large Language Models
- Hugging Face LLM Course
Comprehensive 12-chapter course covering transformer architecture, fine-tuning, and building applications. Updated for current models.
- Prompt Engineering Guide
Techniques including Chain-of-Thought, ReAct, and RAG. Model-specific guidance for GPT-4, Claude, and open models.
- LangChain Tutorials
RAG implementation, agents, and complex LLM workflows. Industry-standard orchestration framework.
- OpenAI Cookbook
Production-ready patterns for API integration, embeddings, function calling, and cost optimization.
Recommended Textbooks
There is no single textbook that covers all course topics in depth. Students seeking one comprehensive resource should start with:
Efron, B. & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge University Press. https://doi.org/10.1017/CBO9781316576533
This text bridges classical frequentist methods, bootstrap and resampling, and Bayesian approaches, which partially mirrors the arc of the course.
For deeper study, the following topic-specific texts are recommended. Within each category, texts are ranked by accessibility and relevance to course material (★★★ = primary recommendation).
Statistical Foundations and Inference Theory
★★★ Abramovich, F. & Ritov, Y. (2022). Statistical Theory: A Concise Introduction (2nd ed.). Chapman and Hall/CRC. Concise, modern treatment of estimation, hypothesis testing, and asymptotic theory. Best for building theoretical intuition.
Monte Carlo and Simulation Methods
★★★ Robert, C. P. & Casella, G. (2004). Monte Carlo Statistical Methods (2nd ed.). Springer. https://doi.org/10.1007/978-1-4757-4145-2 The definitive reference for simulation techniques. Chapters 2–4 cover foundational methods used in Weeks 2–3.
Resampling Methods
★★★ Efron, B. & Tibshirani, R. J. (1994). An Introduction to the Bootstrap. Chapman and Hall/CRC. https://doi.org/10.1201/9780429246593 The foundational text by the method’s creators. Exceptionally clear exposition; essential reading for Weeks 6–8.
★★ Shao, J. & Tu, D. (1995). The Jackknife and Bootstrap. Springer. https://doi.org/10.1007/978-1-4612-0795-5 [Advanced] More theoretical treatment with rigorous asymptotic analysis. Recommended after Efron & Tibshirani.
Bayesian Data Analysis
★★★ McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples in R and Stan (2nd ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9780429029608 Outstanding pedagogical approach that builds intuition before formalism. Primary recommendation for Weeks 10–12.
★★ Martin, O. A. (2024). Bayesian Analysis with Python: A Practical Guide to Probabilistic Modeling (3rd ed.). Packt Publishing. Practical implementation focus using PyMC. Excellent for translating Bayesian concepts into working Python code.
★★ Gelman, A. et al. (2013). Bayesian Data Analysis (3rd ed.). Chapman and Hall/CRC. https://doi.org/10.1201/b16018 [Advanced] Comprehensive reference (“BDA3”). More encyclopedic; best used for specific topics or deeper theoretical study.