10.2. Hypothesis Test for the Population Mean When σ is Known
Now that we understand the conceptual framework of hypothesis testing, including Type I and Type II errors and statistical power, we’re ready to implement these ideas with actual data. When we have a sample and want to test a claim about the population mean, we need systematic procedures for computing test statistics, calculating p-values, and making defensible conclusions.
Road Map 🧭
Problem we will solve – How to conduct formal hypothesis tests for population means when the population standard deviation is known
Tool we’ll learn – The z-test procedure using p-values to evaluate evidence against null hypotheses
How it fits – This applies our sampling distribution knowledge and hypothesis testing framework to make data-driven decisions about population parameters
10.2.1. From Theory to Practice: The Z-Test for Population Means
When testing hypotheses about a population mean \(\mu\) with known population standard deviation \(\sigma\), we use what’s called a z-test. This test leverages the sampling distribution of the sample mean and the Central Limit Theorem to transform our sample evidence into a standardized test statistic.
The Foundation: Sampling Distribution Under the Null Hypothesis
The key insight is that under the null hypothesis, we know exactly what the sampling distribution of \(\bar{X}\) should look like. If \(H_0: \mu = \mu_0\) is true, then by our sampling distribution theory:
This means we can standardize any observed sample mean \(\bar{x}\) to see how far it falls from what we’d expect under the null hypothesis.
The Z Test Statistic
Our test statistic measures how many standard errors the observed sample mean falls from the hypothesized population mean:
Under the null hypothesis and our standard assumptions (SRS from a normal population, or large enough sample for CLT), this test statistic follows a standard normal distribution: \(Z_{TS} \sim N(0,1)\).
Why This Test Statistic Makes Sense
The numerator \(\bar{X} - \mu_0\) measures the discrepancy between what we observed and what the null hypothesis claims. The denominator \(\sigma/\sqrt{n}\) is the standard error—it tells us how much variability we expect in sample means just due to random sampling.
When we divide the discrepancy by the standard error, we get a measure of how unusual our observation is relative to normal sampling variability. Large absolute values of \(Z_{TS}\) suggest the discrepancy is too large to be explained by chance alone.
10.2.2. Implementing the P-Value Method
While we could use rejection regions (comparing our test statistic to critical values), we’ll focus on the p-value method because it provides more nuanced information about the strength of evidence against the null hypothesis.
What the P-Value Tells Us
The p-value answers this question: “If the null hypothesis were actually true, what’s the probability of observing a test statistic at least as extreme as what we actually got?”
For different alternative hypotheses, “at least as extreme” means different things:
One-Tailed Tests (Upper Tail) When \(H_a: \mu > \mu_0\), extreme values are those in the upper tail:
One-Tailed Tests (Lower Tail) When \(H_a: \mu < \mu_0\), extreme values are those in the lower tail:
Two-Tailed Tests When \(H_a: \mu \neq \mu_0\), extreme values are in either tail. Since the standard normal distribution is symmetric:
Computing P-Values in R
The pnorm() function gives us probabilities for the standard normal distribution:
# Right-tailed test (H_a: mu > mu_0)
p_value <- pnorm(z_test_stat, lower.tail = FALSE)
# Left-tailed test (H_a: mu < mu_0)
p_value <- pnorm(z_test_stat, lower.tail = TRUE)
# Two-tailed test (H_a: mu ≠ mu_0)
p_value <- 2 * pnorm(abs(z_test_stat), lower.tail = FALSE)
Making the Decision
Once we have the p-value, we compare it to our pre-specified significance level \(\alpha\):
If p-value ≤ \(\alpha\): Reject \(H_0\) (statistically significant result)
If p-value > \(\alpha\): Fail to reject \(H_0\) (not statistically significant)
10.2.3. A Complete Example: Spiral Galaxy Diameters
Let’s work through the detailed spiral galaxy example from the transcript to see how all these pieces fit together.
The Research Question
A theory predicts that spiral galaxies like the Milky Way have an average diameter of 50,000 light years. A research group wants to test whether galaxies in their local neighborhood are actually larger than this theoretical prediction.
Study Design and Data
Sample: SRS of 50 spiral galaxies from a catalog
Sample mean: \(\bar{x} = 51,600\) light years
Population standard deviation: \(\sigma = 4,000\) light years (assumed known)
Data characteristics: Symmetric, free of outliers (supports normality assumption)
Significance level: \(\alpha = 0.01\)
Step 1: State the Hypotheses
The researchers want to show that local galaxies are larger than the theoretical prediction:
\(H_0: \mu \leq 50,000\) (galaxies are not larger than theory predicts)
\(H_a: \mu > 50,000\) (galaxies are larger than theory predicts)
This is a right-tailed test because the alternative hypothesis involves “greater than.”
Step 2: Check Assumptions
✓ Simple Random Sample: Galaxies selected randomly from catalog ✓ Independence: Galaxy sizes should be independent of each other ✓ Normality: Either population is normal OR sample size large enough for CLT
Data described as “symmetric and free of outliers” suggests normality
Sample size n = 50 is also sufficient for CLT
✓ Known σ: We’re told \(\sigma = 4,000\) light years
Step 3: Calculate the Test Statistic
Step 4: Calculate the P-Value
Since this is a right-tailed test (\(H_a: \mu > 50,000\)):
z_test_stat <- 2.828
p_value <- pnorm(z_test_stat, lower.tail = FALSE)
p_value
# [1] 0.002339
Step 5: Make the Decision
Compare p-value to significance level: - p-value = 0.002339 - \(\alpha = 0.01\) - Since 0.002339 < 0.01, we reject the null hypothesis
Step 6: State the Conclusion
Since the p-value (0.002339) is less than our significance level (0.01), we reject the null hypothesis. We have statistically significant evidence that the true population mean diameter of spiral galaxies in this neighborhood is greater than 50,000 light years as predicted by theory.
Interpreting the P-Value
The p-value of 0.002339 tells us that if the theory were correct (mean diameter = 50,000 light years), there would be only about a 0.23% chance of observing a sample mean of 51,600 light years or larger in a sample of 50 galaxies. This is quite unlikely, providing strong evidence against the theoretical prediction.
10.2.4. Power Analysis for the Galaxy Study
Now let’s examine the power characteristics of this study, following the detailed calculations from the transcript.
Assessing Power for a Specific Alternative
Suppose the researchers want to know their power to detect galaxies that are 2,000 light years larger on average than the theory predicts (i.e., \(\mu_a = 52,000\) light years).
Step 1: Find the Critical Cutoff Value
Under the null hypothesis, we reject \(H_0\) when our test statistic exceeds \(z_{0.01} = 2.326\):
z_alpha <- qnorm(0.01, lower.tail = FALSE)
z_alpha
# [1] 2.326348
The corresponding cutoff value for \(\bar{x}\) is:
Step 2: Calculate Type II Error Probability
If the true mean is \(\mu_a = 52,000\), the Type II error is the probability that \(\bar{X} < 51,316\):
# Type II error calculation
beta <- pnorm(51316, mean = 52000, sd = 4000/sqrt(50), lower.tail = TRUE)
beta
# [1] 0.1132957
Step 3: Calculate Power
Interpretation
This study has about 88.7% power to detect a 2,000 light-year increase in galaxy diameter. This is quite good power—if galaxies in this region truly average 52,000 light years in diameter, there’s about an 89% chance this study would detect that difference.
10.2.5. Sample Size Determination
Planning for Higher Power
Suppose the researchers wanted 95% power to detect the same 2,000 light-year difference. What sample size would they need?
The Sample Size Formula
For a one-tailed z-test, the required sample size is:
Where: - \(z_{\alpha}\) is the critical value for the significance level - \(z_{\beta}\) is the critical value corresponding to the desired power - \(\sigma\) is the population standard deviation - \(|\mu_0 - \mu_a|\) is the effect size we want to detect
Step-by-Step Calculation
For 95% power, \(\beta = 0.05\):
z_alpha <- qnorm(0.01, lower.tail = FALSE) # 2.326
z_beta <- qnorm(0.05, lower.tail = FALSE) # 1.645
sigma <- 4000
effect_size <- abs(50000 - 52000) # 2000
n_required <- ((z_alpha + z_beta) * sigma / effect_size)^2
n_required
# [1] 63.08177
Rounding up to the nearest integer: n = 64
Verification
We can verify this calculation by checking that n = 64 indeed gives us 95% power:
# With n = 64, what's the power?
std_error_new <- 4000 / sqrt(64) # 500
cutoff_new <- 50000 + 2.326 * 500 # 51163
power_check <- 1 - pnorm(51163, mean = 52000, sd = 500)
power_check
# [1] 0.9505285
Indeed, with n = 64, the power is approximately 95%.
10.2.6. A Manufacturing Quality Control Example
Let’s work through the Bulls Eye Production example to demonstrate a two-tailed test scenario.
The Scenario
Bulls Eye Production manufactures precision components with an ideal diameter of 5mm. Under normal operating conditions, component diameters follow a normal distribution with mean 5mm and standard deviation 0.5mm. However, the production machine needs periodic recalibration.
Quality Control Protocol
A sample of 64 components is taken regularly. If the sample provides evidence that the true mean diameter differs significantly from 5mm (in either direction), recalibration is needed.
Sample size: n = 64
Sample mean: \(\bar{x} = 4.85\) mm
Population standard deviation: \(\sigma = 0.5\) mm (known from process specifications)
Significance level: \(\alpha = 0.01\)
Step 1: State the Hypotheses
Since we care about deviations in either direction:
\(H_0: \mu = 5\) (machine is properly calibrated)
\(H_a: \mu \neq 5\) (machine needs recalibration)
This is a two-tailed test.
Step 2: Calculate the Test Statistic
Step 3: Calculate the P-Value
For a two-tailed test, we need the probability of observing a test statistic at least as extreme as ±2.4:
z_test_stat <- -2.4
p_value <- 2 * pnorm(abs(z_test_stat), lower.tail = FALSE)
p_value
# [1] 0.01639472
Step 4: Make the Decision
Compare to significance level: - p-value = 0.0164 - \(\alpha = 0.01\) - Since 0.0164 > 0.01, we fail to reject the null hypothesis
Step 5: Conclusion
At the 1% significance level, we do not have sufficient evidence to conclude that the machine needs recalibration. The observed sample mean of 4.85mm, while below the target of 5.0mm, is not statistically significantly different when accounting for normal sampling variability.
Practical Interpretation
The p-value of 0.0164 indicates that if the machine were properly calibrated, we’d expect to see a sample mean at least as far from 5.0mm as 4.85mm about 1.64% of the time due to random sampling variation. While this is uncommon, it’s not quite rare enough to trigger recalibration at our chosen significance level.
10.2.7. Understanding the Components of the Z-Test
The Role of Each Element
Sample Mean (:math:`bar{x}`): This is our point estimate of the population mean, summarizing the central tendency of our sample data.
Null Value (:math:`mu_0`): This represents the claimed or hypothesized value we’re testing against, often based on theory, historical data, or regulatory standards.
Standard Error (:math:`sigma/sqrt{n}`): This quantifies how much we expect sample means to vary around the true population mean due to random sampling.
Test Statistic (:math:`Z_{TS}`): This standardizes the difference between what we observed and what we expected, allowing us to use the standard normal distribution for probability calculations.
P-Value: This translates our test statistic into an interpretable probability statement about the strength of evidence against the null hypothesis.
Why the Standard Normal Distribution?
The beauty of the z-test is that regardless of the original population distribution (as long as our assumptions are met), the standardized test statistic follows the well-known standard normal distribution. This allows us to:
Use standard normal tables or software functions
Compare results across different studies and contexts
Apply consistent decision rules based on p-values
The Central Limit Theorem Connection
Even when the original population isn’t normal, the CLT ensures that for sufficiently large samples, the sampling distribution of \(\bar{X}\) is approximately normal. This makes the z-test robust and widely applicable.
10.2.8. Common Mistakes and How to Avoid Them
Mistake 1: Confusing Statistical and Practical Significance
Just because a result is statistically significant doesn’t mean it’s practically important. In the galaxy example, we found statistically significant evidence that galaxies are larger than 50,000 light years. But is a difference of 1,600 light years (from 50,000 to 51,600) practically meaningful for astronomers? That depends on the scientific context.
Mistake 2: Misinterpreting P-Values
The p-value is NOT the probability that the null hypothesis is true. It’s the probability of observing our data (or more extreme) IF the null hypothesis were true. A p-value of 0.02 doesn’t mean there’s a 2% chance the null hypothesis is correct.
Mistake 3: Changing Significance Levels After Seeing Results
The significance level must be chosen before collecting or analyzing data. Adjusting \(\alpha\) after seeing the p-value invalidates the error rate control that makes hypothesis testing meaningful.
Mistake 4: Accepting the Null Hypothesis
We never “accept” or “prove” the null hypothesis. We either reject it (with sufficient evidence) or fail to reject it (insufficient evidence). Failing to reject doesn’t prove the null is true—it simply means we lack strong evidence against it.
Mistake 5: Ignoring Assumptions
The validity of z-test results depends critically on the assumptions being met. Always check: - Random sampling - Independence of observations - Normality (population normal OR large sample for CLT) - Known population standard deviation
10.2.9. The Broader Context: When to Use Z-Tests
Ideal Conditions for Z-Tests
Z-tests for population means are most appropriate when:
Population standard deviation is known: This is often unrealistic in practice, but occurs in: - Manufacturing processes with well-established specifications - Standardized testing where historical variability is known - Theoretical or simulation studies
Large sample sizes: Even if \(\sigma\) is unknown, when n is very large (say, n > 100), the sample standard deviation s provides a very good estimate of \(\sigma\), making z-tests reasonable approximations.
Educational contexts: Z-tests provide a clean introduction to hypothesis testing concepts without the additional complexity of unknown parameters.
Transitioning to T-Tests
In most real-world applications, we don’t know \(\sigma\) and must estimate it from our sample data. This leads to t-tests, which account for the additional uncertainty introduced by estimating \(\sigma\). The conceptual framework remains identical—only the reference distribution changes from standard normal to t-distribution.
Building Statistical Intuition
Working through z-test examples helps build intuition about: - How sample size affects the strength of evidence - The relationship between effect size and statistical significance - The trade-offs between Type I and Type II errors - The practical meaning of p-values and significance levels
These insights transfer directly to more complex testing situations we’ll encounter later.
10.2.10. Bringing It All together
Connecting to Previous Concepts
Sampling Distributions: The z-test relies entirely on our understanding of sampling distributions from Chapter 7. The test statistic is just a standardized version of the sample mean.
Central Limit Theorem: This theorem (Chapter 7) justifies using the normal distribution for the test statistic even when the population isn’t normal, provided the sample size is adequate.
Confidence Intervals: There’s a direct relationship between hypothesis tests and confidence intervals. A 95% confidence interval contains exactly those null hypothesis values that would not be rejected at \(\alpha = 0.05\) in a two-tailed test.
Standard Error: The denominator of our test statistic is the standard error we’ve been working with since Chapter 7. It quantifies the precision of our sample mean as an estimate of the population mean.
Looking Forward
The framework we’ve developed here—stating hypotheses, computing test statistics, calculating p-values, and making decisions—applies to virtually all hypothesis testing procedures in statistics. Whether we’re testing means, proportions, differences between groups, or relationships between variables, the same logical structure applies.
In Chapter 10.3, we’ll extend these ideas to the more realistic situation where the population standard deviation is unknown, introducing t-tests. In Chapter 10.4, we’ll explore the broader interpretation of statistical significance and its limitations.
The computational skills you’ve developed here (using R functions like pnorm() and qnorm()) will serve you throughout your statistical journey, as will the conceptual understanding of what hypothesis tests can and cannot tell us about our research questions.
Key Takeaways 📝
Z-tests for population means are appropriate when the population standard deviation is known and standard assumptions are met.
The test statistic \(Z_{TS} = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}}\) standardizes the evidence against the null hypothesis.
P-values quantify evidence strength by giving the probability of observing our results (or more extreme) if the null hypothesis were true.
Statistical significance (p-value ≤ \(\alpha\)) indicates evidence against the null hypothesis, but doesn’t guarantee practical importance.
Power analysis helps determine if our study design is adequate for detecting meaningful effects and can guide sample size decisions.
Two-tailed tests are used when we care about deviations in either direction from the null value.
The hypothesis testing framework established here applies broadly across different types of statistical tests.
Exercises
Galaxy Follow-up Study: The galaxy researchers want to conduct a follow-up study with more power. If they want 99% power to detect a mean diameter of 52,500 light years (compared to the null hypothesis of 50,000), and they plan to use \(\alpha = 0.01\) with \(\sigma = 4,000\), what sample size do they need?
Quality Control Threshold: In the Bulls Eye Production example, at what significance level would the observed sample mean of 4.85mm become statistically significant? What does this tell us about the relationship between \(\alpha\) and decision-making?
Pharmaceutical Testing: A pharmaceutical company claims their new pain medication reduces recovery time from an average of 7.2 days to less than that. In a clinical trial of 100 patients, the sample mean recovery time was 6.8 days. Assuming \(\sigma = 1.5\) days, test the company’s claim at \(\alpha = 0.05\).
Two-Tailed Manufacturing: A bottling company targets 12 oz per bottle. Quality control samples 36 bottles and finds \(\bar{x} = 11.85\) oz. With \(\sigma = 0.6\) oz, test whether the process mean differs from target at \(\alpha = 0.02\).
Power Comparison: Compare the power to detect a 0.5-unit increase in population mean for sample sizes of n = 25, 50, 100, and 200, assuming \(\sigma = 2\) and \(\alpha = 0.05\). What pattern do you observe?
P-Value Interpretation: A study reports a p-value of 0.08 for testing \(H_0: \mu = 100\) versus \(H_a: \mu \neq 100\). Write three different statements about what this p-value means, avoiding common misinterpretations.