.. _7-1-statistics-and-sampling-distributions:
.. raw:: html
Statistics and Sampling Distributions
====================================================
We've spent considerable time building probability models that describe populations. Now we reach a pivotal
moment in our statistical journey: the bridge from probability theory to statistical inference.
Instead of knowing the population distribution, we'll work backward from sample data to make conclusions
about unknown populations. This transition requires understanding **how sample statistics themselves behave
as random variables**.
.. admonition:: Road Map đź§
:class: important
* Understand how **sample statistics can be viewed as random variables**.
* Master the **theoretical properties of sample statistics** in relation to the population distribution.
Parameters, Statistics, and the Bridge to Inference
----------------------------------------------------
Before diving into sampling distributions, let us review the fundamental vocabulary.
Population Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~
A **parameter** is a number that describes some attribute of the population.
For example, the population mean :math:`\mu` is a parameter that tells us the average value across *all units in
the population*. These parameters are often unknown in practice, but they always exist as **fixed, non-random values**.
When we study probability distributions, we often assume we know these parameters to
understand theoretical behavior. But in statistical inference, the tables turn—we
observe sample data and try to learn about these unknown population characteristics.
Sample Statistics: Our Window Into the Population
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A **statistic** :math:`T` is a function that maps each sample to a numerical summary :math:`t`:
.. math:: T(x_1, x_2, \cdots, x_n) = t
One example is the sample mean :math:`\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i`, which summarizes the
center of an observed sample. Unlike parameters, we can calculate statistics directly
from the data.
The crucial insight is that **statistics will change from sample to sample**. If we collect a
new sample of the same size from the same population, we'll almost certainly get **different values for**
:math:`\bar{x}`.
Estimators: Statistics with a Purpose
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
An **estimator** is a statistic which **targets a specific
population parameter**. We use the sample mean :math:`\bar{x}` as an estimator of the population mean :math:`\mu`.
We use the sample standard deviation :math:`s` as an estimator of the population standard deviation :math:`\sigma`.
The sample mean :math:`\bar{x}` is both a statistic (it summarizes our sample)
and an estimator (it targets the population mean :math:`\mu`). This dual role reflects the bridge between
describing what we observe and inferring what we cannot directly measure.
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter7/pop-stat-est-inf.png
:figwidth: 80%
:align: center
*Parameters describe the population; statistics summarize the sample; estimators target the unknown parameters*
The Sampling Distribution
--------------------------------------------------------------------
To understand **how estimators behave across many
possible samples**, we must establish the concept of a sampling distribution.
The Thought Experiment
~~~~~~~~~~~~~~~~~~~~~~~
Imagine we could repeatedly sample from the same population, computing a statistic each
time. Suppose we're studying the heights of college students and we take samples of size
:math:`n = 25`. We collect our first sample, measure all :math:`25` students, and compute :math:`\bar{x}_1 = 67.2` inches.
We collect a second sample of :math:`25` different students and get :math:`\bar{x}_2 = 68.8` inches. A third sample
gives :math:`\bar{x}_3 = 66.9` inches.
If we repeated this process thousands of times, we'd have thousands of different sample means:
:math:`\bar{x}_1, \bar{x}_2, \cdots, \bar{x}_m`. These sample means would **vary**,
and that variation would follow a pattern.
The distribution of these sample means is called the **sampling distribution of the sample mean**.
Formal Definition
~~~~~~~~~~~~~~~~~~~~
The **sampling distribution** of a statistic is the probability distribution of that statistic
across all possible samples of the same size from the same population. It tells us **how the statistic
behaves as a random variable**.
This concept applies to any statistic—sample means, sample standard deviations, sample medians,
sample correlations. Each has its own sampling distribution that describes how that particular
statistic varies from sample to sample.
Why This Matters for Inference
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Understanding sampling distributions allows us to quantify the uncertainty in our estimates.
If we know how :math:`\bar{X}` behaves across many samples, we can assess how reliable any single observed
:math:`\bar{x}` might be as an estimate of :math:`\mu`. This knowledge forms the foundation for
confidence intervals, hypothesis tests, and all other inferential procedures.
.. admonition:: The Capital :math:`\bar{X}`
:class: important
We will now study the behavior of sample statistics as random variables.
To distinguish from contexts where they serve as realized values, we use
capital letters. :math:`\bar{X}` denotes the random variable that
generates :math:`\bar{x}`'s.
Similar distinctions apply to :math:`S` vs. :math:`s` and :math:`S^2` vs. :math:`s^2`.
Factors Affecting Sampling Distributions
-------------------------------------------
1. **The population distribution**
2. **Sample size**, :math:`n`
3. **The statistic itself**
For example, :math:`S` can only take positive values, while :math:`\bar{X}` has no such
restriction. This is due to differences in how the two statistics are computed.
4. **The sampling technique**
The sampling technique affects how well a sample represents the population and whether
key properties like independence are satisfied.
We will examine how these factors influence inference in greater depth in the upcoming chapters.
Bringing It All Together
---------------------------
.. admonition:: Key Takeaways 📝
:class: important
1. **Statistics are random variables** that vary from sample to sample, creating sampling distributions
that describe this variability. Understanding how **statistics behave as random variables** lets us quantify
uncertainty and make conclusions about unknown populations.
2. **Multiple factors affect sampling distributions**: population shape, sample size,
the choice of statistic, and independence all play important roles.
Exercises
~~~~~~~~~~~~~
1. **Conceptual Understanding**: Explain the difference between a parameter, a statistic, and an estimator.
Give an example of each in the context of measuring the average commute time for workers in your city.