Slides 📊

7.1. Statistics and Sampling Distributions

We’ve spent considerable time building probability models that describe populations. Now we reach a pivotal moment in our statistical journey: the bridge from probability theory to statistical inference. This transition first requires the understanding of how sample statistics themselves behave as random variables.

Road Map 🧭

Understand how sample statistics can be viewed as random variables.
Master the theoretical properties of sample statistics in relation to the population distribution.

7.1.1. Parameters, Statistics, and Estimators

Population Parameters

Recall that a parameter is a number that describes some attribute of the population. For example, the population mean \(\mu\) is a parameter that tells us the average value across all units in the population. We often study the theoretical behavior of probability distributions assuming that we know these parameters. But in statistical inference, the tables turn—parameters are usually unknown, and we try to learn about them through observed samples. However, we should always keep in mind that, though unknown, the parameters always exist as fixed, non-random values.

Sample Statistics: Our Window Into the Population

A statistic \(T\) is a function that maps each sample to a numerical summary \(t\):

\[T(x_1, x_2, \cdots, x_n) = t\]

One example is the sample mean \(\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\), which summarizes the center of an observed sample. Unlike parameters, we can calculate statistics directly from the data.

The crucial insight is that statistics will change from sample to sample. If we collect multiple samples of the same size from the same population, we will almost certainly get different values of \(\bar{x}\) from all the samples.

Estimators: Statistics with a Purpose

An estimator is a statistic which targets a specific population parameter. We use the sample mean \(\bar{x}\) as an estimator of the population mean \(\mu\). We use the sample standard deviation \(s\) as an estimator of the population standard deviation \(\sigma\).

The sample mean \(\bar{x}\) is both a statistic (it summarizes the sample) and an estimator (it targets the population mean \(\mu\)). This dual role of a data summary is key to statistical inference, since it has significance for both the sample and the population. It not only describes the observed data but also provides a basis for drawing conclusions about the underlying distribution.

https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter7/pop-stat-est-inf.png — Fig. 7.1 *Parameters describe the population; statistics summarize the sample; estimators target the unknown parameters*

7.1.2. The Sampling Distribution

To understand how estimators behave across many possible samples, we must establish the concept of a sampling distribution.

The Thought Experiment

Suppose we are studying the heights of college students and we have enough resources to replicate the study many times. During each single run of the study, we take a sample of size \(n = 25\). We collect our first sample, measure all \(25\) students, and compute \(\bar{x}_1 = 67.2\) inches. We collect a second sample of \(25\) different students and get \(\bar{x}_2 = 68.8\) inches. A third sample gives \(\bar{x}_3 = 66.9\) inches.

If we repeated this process thousands of times, we would have thousands of different sample means: \(\bar{x}_1, \bar{x}_2, \cdots, \bar{x}_m\). These sample means would vary, and that variation would follow a pattern. The distribution of these sample means is called the sampling distribution of the sample mean.

Formal Definition

The sampling distribution of a statistic is the probability distribution of that statistic across all possible samples of the same size from the same population. It tells us how the statistic behaves as a random variable.

This concept applies to any statistic—sample means, sample standard deviations, sample medians, sample correlations. Each has its own sampling distribution that describes how that particular statistic varies from sample to sample.

Why This Matters for Inference

Understanding sampling distributions allows us to quantify the uncertainty in our estimates. If we know how \(\bar{X}\) behaves across many samples, we can assess how reliable any single observed \(\bar{x}\) might be as an estimate of \(\mu\). This knowledge forms the foundation for confidence intervals, hypothesis tests, and all other inferential procedures.

The Capital \(\bar{X}\)

We will now study the behavior of sample statistics as random variables. To distinguish from contexts where they serve as realized values, we use capital letters. \(\bar{X}\) denotes the random variable that generates \(\bar{x}\)’s. Similar distinctions apply to \(S\) vs. \(s\) and \(S^2\) vs. \(s^2\).

7.1.3. Factors Affecting Sampling Distributions

The population distribution
Sample size, \(n\)
The statistic itself

For example, \(S\) can only take positive values, while \(\bar{X}\) has no such restriction. This is due to the difference in how the two statistics are defined.
The sampling technique

The sampling technique affects how well a sample represents the population and whether key properties like independence are satisfied.

We will examine how these factors influence inference in greater depth in the upcoming chapters.

7.1.4. Bringing It All Together

Key Takeaways 📝

Statistics are random variables that vary from sample to sample, creating sampling distributions that describe this variability. Understanding how statistics behave as random variables lets us quantify uncertainty and make conclusions about unknown populations.
Sampling distributions are determined by multiple factors. The population distribution, sample size, the choice of statistic, and independence all play important roles.

Exercises

Conceptual Understanding: Explain the difference between a parameter, a statistic, and an estimator. Give an example of each in the context of measuring the average commute time for workers in your city.