7.3. The Central Limit Theorem (CLT)
We’ve established that the sampling distribution of \(\bar{X}\) is normal when the population is normal. Then what about cases where the population is not normally distributed? The Central Limit Theorem (CLT) is a pivotal concept in statistics that addresses this question.
Road Map 🧭
Learn the formal statement of the Central Limit Theorem (CLT).
Recognize the experimental settings where the CLT applies.
Apply the CLT to compute approximate probabilities of the sample mean.
7.3.1. The Formal Statement of the CLT
For an independent and identically distributed (iid) random sample \(X_1, X_2, ..., X_n\) from a population with a finite mean \(μ\) and finite standard deviation \(σ\),
What does this all mean? 🔎
\(\frac{\bar{X} - \mu}{\sigma/\sqrt{n}}\) is the standardized sample mean. \(\bar{X}\) is subtracted by its own mean \(\mu_{\bar{X}} = \mu\), then divided by its own standard deviation \(\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}\) (the standard error).
\(\xrightarrow{d}\) indicates convergence in distribution.
Putting together, the mathematical statement reads: “The distribution of the standardized sample mean approaches standard normal as the sample size \(n\) goes to infinity.”
Practical Implications of the CLT
When
the population has a finite mean \(\mu\) and finite standard deviation \(\sigma\),
the observations are independent and identically distributed (iid), and
the sample size \(n\) is sufficiently large,
we have
where \(\stackrel{d}{\approx}\) indicates that the two random variables have similar distributions. By applying the same linear operations on both sides, we can equivalently write:
The right-hand side is a linear transformation of a standard normal random variable with the distribution \(N(\mu, \sigma/\sqrt{n})\). It follows that:
7.3.2. Visual Demonstrations
The Central Limit Theorem (CLT) states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the shape of the population distribution, as long as certain conditions are met. Let us use a visual tool to build intuition for this somewhat surprising result.
How to Use the Simulation Tool
The simulation tool requires a few inputs.
Argument |
Description |
How to use |
---|---|---|
Population Distribution |
Normal, uniform, exponential, etc. |
Select the family of distributions from which samples will be generated |
Population Parameters |
Different for each population family (e.g., \(\mu\) and \(\sigma\) for normal, \(\lambda\) for exponential) |
Select the specific distribution belonging to the chosen family |
Sample Size \(n\) |
The number of data points used to compute one sample mean |
Increase from small to large. Observe the change in the sampling distribution |
Number of Samples |
The number of different sample means used to construct a histogram of the sampling distribution |
Fix at a large number (10000, for example) so the sampling distribution of of sample means is sufficiently described. |
Once all values are set as desired, click simulate. Two plots will display:
A pdf/pmf plot of the population distribution
A histogram of sample means drawn from the population
Interactive Visualization 🎮
Central Limit Theorem Simulation
Visualize how sample means converge to a normal distribution as sample size increases, regardless of the underlying population distribution.
A Checklist
By experimenting with different settings, confirm the following:
✔❓ |
To confirm |
---|---|
The CLT indeed holds. That is, the histogram of sample means approaches a normal shape as \(n\) grows. |
|
A large enough \(n\), which makes the histogram look sufficiently normal, is different for each population distribution. In general, a population requires a larger \(n\) if it has strong non-normal characteristics (asymetry, multimodality, etc.). |
|
the center of the histogram remains around \(\mu\) while the spread (\(\sigma/\sqrt{n}\)) narrows around \(\mu\) as \(n\) increases. |
How Large is “Sufficiently Large”?
The common rule of thumb that n > 30 is sufficient should not be blindly applied. As we saw through simulations, the appropriate sample size depends entirely on the underlying population distribution. The farther the population distribution is from normal, the larger the sample size needed for the CLT to apply effectively.
In practice, we only observe a single sample from a population. Our understanding of the population depends on the sample we observe and our background knowledge. We must explore our sample carefully to determine if the sample size is likely sufficient for the CLT to apply.
7.3.3. Step-by-Step Problem Solving with the CLT
Problems using the CLT follow the general steps below:
Verify that the prerequisites for the CLT are met.
Identify the population mean \(\mu_X\) and standard deviation \(\sigma_X\), and use them to calculate the mean \(\mu_{\bar{X}} = \mu_X\) and the standard error \(\sigma_{\bar{X}} = \sigma_X/\sqrt{n}\) of the sampling distribution.
Establish the approximate sampling distribution of the sample mean: \(\bar{X} \stackrel{\text{approx}}{\sim} N(\mu_{\bar{X}}, \sigma_{\bar{X}})\).
Use a standard normal table or software to find the required probability.
Example💡: Another Maze Navigation Example
The same subspecies of rat from the previous experiment will be forced to navigate a more complex maze in which a wrong turn early on can drastically increase the time to completion. The true mean completion time for the whole population of the subspecies is known to be 10 minutes. The true standard deviation is 3 minutes. It is also known that the distribution is slightly right skewed.
Suppose that 60 rats are randomly selected from the population to navigate the maze.
What is the mean of the average time it takes these 60 rates to navigate the maze?
What is the standard deviation of the average time?
What is the probability that the average maze navigation time for the 60 rats is greater than 11 minutes?
Organize the Given Information
We know that the distribution of the population \(X\) is slightly right skewed. Further, \(\mu_X=10\) and \(\sigma_X=3\).
Solve the Problems
\(\mu_{\bar{X}} = \mu_{X} = 10\)
\(\sigma_{\bar{X}} = \frac{\sigma_X}{\sqrt{n}} = \frac{3}{\sqrt{60}}=0.3878\)
We do not know the population distribution, so we hope to use the CLT to compute the probability. We must first confirm that all the conditions are met for the CLT to hold:
The population mean and standard deviation are finite. ✔
The sample was formed by taking random samples from the same population. The randmonness ensures that the selection of one rat does not influence others (independence). Since they come from the same population, their distributions are identical. ✔
For a slightly skewed distribution, \(n=60\) is large enough. (To confirm, try plotting a left-skewed beta case with \(\alpha=3, \beta=2\) using the simulation tool above). ✔
Therefore, \(\bar{X} \stackrel{\text{approx}}{\sim} N(\mu_{X}=10, \sigma_{\bar{X}}=0.3878)\) by the CLT. Then
\[\begin{split}P(\bar{X} > 11) &= 1 - P(\bar{X} \leq 11) \\ &= 1 - P\left(Z \leq \frac{11-10}{3/\sqrt{60}}\right) \\ &= 1-\Phi(2.58) = 0.0049\end{split}\]
7.3.4. Bringing It All Together
Key Takeaways 📝
The Central Limit Theorem states that for an iid sample from a population with finite mean and variance, the distribution of the standardized sample mean approaches standard normal as the sample size increases.
The CLT can be used to approximate the sampling distribution of \(\bar{X}\), but the sample size needed depends on how far the population distribution deviates from normality.
The Central Limit Theorem represents one of the most powerful and remarkable results in statistics. It allows us to use the normal distribution as a foundation for statistical inference across a wide range of real-world situations, even when the underlying population distributions are unknown or non-normal.
Exercises
CLT Application: A population follows a uniform distribution between 0 and 100. For samples of size n = 36:
What are the mean and standard deviation of the sampling distribution of X̄?
Approximately what distribution does the sample mean follow?
Calculate P(X̄ > 55)
Practical Implications: Explain why the CLT is so important for statistical inference. What practical problems would we face if the CLT didn’t exist?