10.2. Hypothesis Test for the Population Mean When σ is Known

Now that we understand the conceptual framework of hypothesis testing, including Type I and Type II errors and statistical power, we’re ready to implement these ideas with actual data. When we have a sample and want to test a claim about the population mean, we need systematic procedures for computing test statistics, calculating p-values, and making defensible conclusions.

Road Map 🧭

  • Problem we will solve – How to conduct formal hypothesis tests for population means when the population standard deviation is known

  • Tool we’ll learn – The z-test procedure using p-values to evaluate evidence against null hypotheses

  • How it fits – This applies our sampling distribution knowledge and hypothesis testing framework to make data-driven decisions about population parameters

10.2.1. From Theory to Practice: The Z-Test for Population Means

When testing hypotheses about a population mean \(\mu\) with known population standard deviation \(\sigma\), we use what’s called a z-test. This test leverages the sampling distribution of the sample mean and the Central Limit Theorem to transform our sample evidence into a standardized test statistic.

The Foundation: Sampling Distribution Under the Null Hypothesis

The key insight is that under the null hypothesis, we know exactly what the sampling distribution of \(\bar{X}\) should look like. If \(H_0: \mu = \mu_0\) is true, then by our sampling distribution theory:

\[\bar{X} \sim N\left(\mu_0, \frac{\sigma^2}{n}\right)\]

This means we can standardize any observed sample mean \(\bar{x}\) to see how far it falls from what we’d expect under the null hypothesis.

The Z Test Statistic

Our test statistic measures how many standard errors the observed sample mean falls from the hypothesized population mean:

\[Z_{TS} = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}}\]

Under the null hypothesis and our standard assumptions (SRS from a normal population, or large enough sample for CLT), this test statistic follows a standard normal distribution: \(Z_{TS} \sim N(0,1)\).

Why This Test Statistic Makes Sense

The numerator \(\bar{X} - \mu_0\) measures the discrepancy between what we observed and what the null hypothesis claims. The denominator \(\sigma/\sqrt{n}\) is the standard error—it tells us how much variability we expect in sample means just due to random sampling.

When we divide the discrepancy by the standard error, we get a measure of how unusual our observation is relative to normal sampling variability. Large absolute values of \(Z_{TS}\) suggest the discrepancy is too large to be explained by chance alone.

10.2.2. Implementing the P-Value Method

While we could use rejection regions (comparing our test statistic to critical values), we’ll focus on the p-value method because it provides more nuanced information about the strength of evidence against the null hypothesis.

What the P-Value Tells Us

The p-value answers this question: “If the null hypothesis were actually true, what’s the probability of observing a test statistic at least as extreme as what we actually got?”

For different alternative hypotheses, “at least as extreme” means different things:

One-Tailed Tests (Upper Tail) When \(H_a: \mu > \mu_0\), extreme values are those in the upper tail:

\[\text{p-value} = P(Z \geq z_{ts}) = P(Z \geq Z_{TS})\]

One-Tailed Tests (Lower Tail) When \(H_a: \mu < \mu_0\), extreme values are those in the lower tail:

\[\text{p-value} = P(Z \leq z_{ts}) = P(Z \leq Z_{TS})\]

Two-Tailed Tests When \(H_a: \mu \neq \mu_0\), extreme values are in either tail. Since the standard normal distribution is symmetric:

\[\text{p-value} = 2 \cdot P(Z \geq |z_{ts}|) = 2 \cdot P(Z \geq |Z_{TS}|)\]

Computing P-Values in R

The pnorm() function gives us probabilities for the standard normal distribution:

# Right-tailed test (H_a: mu > mu_0)
p_value <- pnorm(z_test_stat, lower.tail = FALSE)

# Left-tailed test (H_a: mu < mu_0)
p_value <- pnorm(z_test_stat, lower.tail = TRUE)

# Two-tailed test (H_a: mu ≠ mu_0)
p_value <- 2 * pnorm(abs(z_test_stat), lower.tail = FALSE)

Making the Decision

Once we have the p-value, we compare it to our pre-specified significance level \(\alpha\):

  • If p-value ≤ \(\alpha\): Reject \(H_0\) (statistically significant result)

  • If p-value > \(\alpha\): Fail to reject \(H_0\) (not statistically significant)

10.2.3. A Complete Example: Spiral Galaxy Diameters

Let’s work through the detailed spiral galaxy example from the transcript to see how all these pieces fit together.

The Research Question

A theory predicts that spiral galaxies like the Milky Way have an average diameter of 50,000 light years. A research group wants to test whether galaxies in their local neighborhood are actually larger than this theoretical prediction.

Study Design and Data

  • Sample: SRS of 50 spiral galaxies from a catalog

  • Sample mean: \(\bar{x} = 51,600\) light years

  • Population standard deviation: \(\sigma = 4,000\) light years (assumed known)

  • Data characteristics: Symmetric, free of outliers (supports normality assumption)

  • Significance level: \(\alpha = 0.01\)

Step 1: State the Hypotheses

The researchers want to show that local galaxies are larger than the theoretical prediction:

  • \(H_0: \mu \leq 50,000\) (galaxies are not larger than theory predicts)

  • \(H_a: \mu > 50,000\) (galaxies are larger than theory predicts)

This is a right-tailed test because the alternative hypothesis involves “greater than.”

Step 2: Check Assumptions

Simple Random Sample: Galaxies selected randomly from catalog ✓ Independence: Galaxy sizes should be independent of each other ✓ Normality: Either population is normal OR sample size large enough for CLT

  • Data described as “symmetric and free of outliers” suggests normality

  • Sample size n = 50 is also sufficient for CLT

Known σ: We’re told \(\sigma = 4,000\) light years

Step 3: Calculate the Test Statistic

\[Z_{TS} = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}} = \frac{51,600 - 50,000}{4,000/\sqrt{50}} = \frac{1,600}{565.685} = 2.828\]

Step 4: Calculate the P-Value

Since this is a right-tailed test (\(H_a: \mu > 50,000\)):

z_test_stat <- 2.828
p_value <- pnorm(z_test_stat, lower.tail = FALSE)
p_value
# [1] 0.002339

Step 5: Make the Decision

Compare p-value to significance level: - p-value = 0.002339 - \(\alpha = 0.01\) - Since 0.002339 < 0.01, we reject the null hypothesis

Step 6: State the Conclusion

Since the p-value (0.002339) is less than our significance level (0.01), we reject the null hypothesis. We have statistically significant evidence that the true population mean diameter of spiral galaxies in this neighborhood is greater than 50,000 light years as predicted by theory.

Interpreting the P-Value

The p-value of 0.002339 tells us that if the theory were correct (mean diameter = 50,000 light years), there would be only about a 0.23% chance of observing a sample mean of 51,600 light years or larger in a sample of 50 galaxies. This is quite unlikely, providing strong evidence against the theoretical prediction.

10.2.4. Power Analysis for the Galaxy Study

Now let’s examine the power characteristics of this study, following the detailed calculations from the transcript.

Assessing Power for a Specific Alternative

Suppose the researchers want to know their power to detect galaxies that are 2,000 light years larger on average than the theory predicts (i.e., \(\mu_a = 52,000\) light years).

Step 1: Find the Critical Cutoff Value

Under the null hypothesis, we reject \(H_0\) when our test statistic exceeds \(z_{0.01} = 2.326\):

z_alpha <- qnorm(0.01, lower.tail = FALSE)
z_alpha
# [1] 2.326348

The corresponding cutoff value for \(\bar{x}\) is:

\[\bar{x}_{cutoff} = \mu_0 + z_{\alpha} \cdot \frac{\sigma}{\sqrt{n}} = 50,000 + 2.326 \times \frac{4,000}{\sqrt{50}} = 51,316\]

Step 2: Calculate Type II Error Probability

If the true mean is \(\mu_a = 52,000\), the Type II error is the probability that \(\bar{X} < 51,316\):

# Type II error calculation
beta <- pnorm(51316, mean = 52000, sd = 4000/sqrt(50), lower.tail = TRUE)
beta
# [1] 0.1132957

Step 3: Calculate Power

\[\text{Power} = 1 - \beta = 1 - 0.113 = 0.887\]

Interpretation

This study has about 88.7% power to detect a 2,000 light-year increase in galaxy diameter. This is quite good power—if galaxies in this region truly average 52,000 light years in diameter, there’s about an 89% chance this study would detect that difference.

10.2.5. Sample Size Determination

Planning for Higher Power

Suppose the researchers wanted 95% power to detect the same 2,000 light-year difference. What sample size would they need?

The Sample Size Formula

For a one-tailed z-test, the required sample size is:

\[n = \left[\frac{(z_{\alpha} + z_{\beta}) \sigma}{\mu_0 - \mu_a}\right]^2\]

Where: - \(z_{\alpha}\) is the critical value for the significance level - \(z_{\beta}\) is the critical value corresponding to the desired power - \(\sigma\) is the population standard deviation - \(|\mu_0 - \mu_a|\) is the effect size we want to detect

Step-by-Step Calculation

For 95% power, \(\beta = 0.05\):

z_alpha <- qnorm(0.01, lower.tail = FALSE)  # 2.326
z_beta <- qnorm(0.05, lower.tail = FALSE)   # 1.645
sigma <- 4000
effect_size <- abs(50000 - 52000)  # 2000

n_required <- ((z_alpha + z_beta) * sigma / effect_size)^2
n_required
# [1] 63.08177

Rounding up to the nearest integer: n = 64

Verification

We can verify this calculation by checking that n = 64 indeed gives us 95% power:

# With n = 64, what's the power?
std_error_new <- 4000 / sqrt(64)  # 500
cutoff_new <- 50000 + 2.326 * 500  # 51163

power_check <- 1 - pnorm(51163, mean = 52000, sd = 500)
power_check
# [1] 0.9505285

Indeed, with n = 64, the power is approximately 95%.

10.2.6. A Manufacturing Quality Control Example

Let’s work through the Bulls Eye Production example to demonstrate a two-tailed test scenario.

The Scenario

Bulls Eye Production manufactures precision components with an ideal diameter of 5mm. Under normal operating conditions, component diameters follow a normal distribution with mean 5mm and standard deviation 0.5mm. However, the production machine needs periodic recalibration.

Quality Control Protocol

A sample of 64 components is taken regularly. If the sample provides evidence that the true mean diameter differs significantly from 5mm (in either direction), recalibration is needed.

  • Sample size: n = 64

  • Sample mean: \(\bar{x} = 4.85\) mm

  • Population standard deviation: \(\sigma = 0.5\) mm (known from process specifications)

  • Significance level: \(\alpha = 0.01\)

Step 1: State the Hypotheses

Since we care about deviations in either direction:

  • \(H_0: \mu = 5\) (machine is properly calibrated)

  • \(H_a: \mu \neq 5\) (machine needs recalibration)

This is a two-tailed test.

Step 2: Calculate the Test Statistic

\[Z_{TS} = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}} = \frac{4.85 - 5.0}{0.5/\sqrt{64}} = \frac{-0.15}{0.0625} = -2.4\]

Step 3: Calculate the P-Value

For a two-tailed test, we need the probability of observing a test statistic at least as extreme as ±2.4:

z_test_stat <- -2.4
p_value <- 2 * pnorm(abs(z_test_stat), lower.tail = FALSE)
p_value
# [1] 0.01639472

Step 4: Make the Decision

Compare to significance level: - p-value = 0.0164 - \(\alpha = 0.01\) - Since 0.0164 > 0.01, we fail to reject the null hypothesis

Step 5: Conclusion

At the 1% significance level, we do not have sufficient evidence to conclude that the machine needs recalibration. The observed sample mean of 4.85mm, while below the target of 5.0mm, is not statistically significantly different when accounting for normal sampling variability.

Practical Interpretation

The p-value of 0.0164 indicates that if the machine were properly calibrated, we’d expect to see a sample mean at least as far from 5.0mm as 4.85mm about 1.64% of the time due to random sampling variation. While this is uncommon, it’s not quite rare enough to trigger recalibration at our chosen significance level.

10.2.7. Understanding the Components of the Z-Test

The Role of Each Element

Sample Mean (:math:`bar{x}`): This is our point estimate of the population mean, summarizing the central tendency of our sample data.

Null Value (:math:`mu_0`): This represents the claimed or hypothesized value we’re testing against, often based on theory, historical data, or regulatory standards.

Standard Error (:math:`sigma/sqrt{n}`): This quantifies how much we expect sample means to vary around the true population mean due to random sampling.

Test Statistic (:math:`Z_{TS}`): This standardizes the difference between what we observed and what we expected, allowing us to use the standard normal distribution for probability calculations.

P-Value: This translates our test statistic into an interpretable probability statement about the strength of evidence against the null hypothesis.

Why the Standard Normal Distribution?

The beauty of the z-test is that regardless of the original population distribution (as long as our assumptions are met), the standardized test statistic follows the well-known standard normal distribution. This allows us to:

  1. Use standard normal tables or software functions

  2. Compare results across different studies and contexts

  3. Apply consistent decision rules based on p-values

The Central Limit Theorem Connection

Even when the original population isn’t normal, the CLT ensures that for sufficiently large samples, the sampling distribution of \(\bar{X}\) is approximately normal. This makes the z-test robust and widely applicable.

10.2.8. Common Mistakes and How to Avoid Them

Mistake 1: Confusing Statistical and Practical Significance

Just because a result is statistically significant doesn’t mean it’s practically important. In the galaxy example, we found statistically significant evidence that galaxies are larger than 50,000 light years. But is a difference of 1,600 light years (from 50,000 to 51,600) practically meaningful for astronomers? That depends on the scientific context.

Mistake 2: Misinterpreting P-Values

The p-value is NOT the probability that the null hypothesis is true. It’s the probability of observing our data (or more extreme) IF the null hypothesis were true. A p-value of 0.02 doesn’t mean there’s a 2% chance the null hypothesis is correct.

Mistake 3: Changing Significance Levels After Seeing Results

The significance level must be chosen before collecting or analyzing data. Adjusting \(\alpha\) after seeing the p-value invalidates the error rate control that makes hypothesis testing meaningful.

Mistake 4: Accepting the Null Hypothesis

We never “accept” or “prove” the null hypothesis. We either reject it (with sufficient evidence) or fail to reject it (insufficient evidence). Failing to reject doesn’t prove the null is true—it simply means we lack strong evidence against it.

Mistake 5: Ignoring Assumptions

The validity of z-test results depends critically on the assumptions being met. Always check: - Random sampling - Independence of observations - Normality (population normal OR large sample for CLT) - Known population standard deviation

10.2.9. The Broader Context: When to Use Z-Tests

Ideal Conditions for Z-Tests

Z-tests for population means are most appropriate when:

  1. Population standard deviation is known: This is often unrealistic in practice, but occurs in: - Manufacturing processes with well-established specifications - Standardized testing where historical variability is known - Theoretical or simulation studies

  2. Large sample sizes: Even if \(\sigma\) is unknown, when n is very large (say, n > 100), the sample standard deviation s provides a very good estimate of \(\sigma\), making z-tests reasonable approximations.

  3. Educational contexts: Z-tests provide a clean introduction to hypothesis testing concepts without the additional complexity of unknown parameters.

Transitioning to T-Tests

In most real-world applications, we don’t know \(\sigma\) and must estimate it from our sample data. This leads to t-tests, which account for the additional uncertainty introduced by estimating \(\sigma\). The conceptual framework remains identical—only the reference distribution changes from standard normal to t-distribution.

Building Statistical Intuition

Working through z-test examples helps build intuition about: - How sample size affects the strength of evidence - The relationship between effect size and statistical significance - The trade-offs between Type I and Type II errors - The practical meaning of p-values and significance levels

These insights transfer directly to more complex testing situations we’ll encounter later.

10.2.10. Bringing It All together

Connecting to Previous Concepts

Sampling Distributions: The z-test relies entirely on our understanding of sampling distributions from Chapter 7. The test statistic is just a standardized version of the sample mean.

Central Limit Theorem: This theorem (Chapter 7) justifies using the normal distribution for the test statistic even when the population isn’t normal, provided the sample size is adequate.

Confidence Intervals: There’s a direct relationship between hypothesis tests and confidence intervals. A 95% confidence interval contains exactly those null hypothesis values that would not be rejected at \(\alpha = 0.05\) in a two-tailed test.

Standard Error: The denominator of our test statistic is the standard error we’ve been working with since Chapter 7. It quantifies the precision of our sample mean as an estimate of the population mean.

Looking Forward

The framework we’ve developed here—stating hypotheses, computing test statistics, calculating p-values, and making decisions—applies to virtually all hypothesis testing procedures in statistics. Whether we’re testing means, proportions, differences between groups, or relationships between variables, the same logical structure applies.

In Chapter 10.3, we’ll extend these ideas to the more realistic situation where the population standard deviation is unknown, introducing t-tests. In Chapter 10.4, we’ll explore the broader interpretation of statistical significance and its limitations.

The computational skills you’ve developed here (using R functions like pnorm() and qnorm()) will serve you throughout your statistical journey, as will the conceptual understanding of what hypothesis tests can and cannot tell us about our research questions.

Key Takeaways 📝

  1. Z-tests for population means are appropriate when the population standard deviation is known and standard assumptions are met.

  2. The test statistic \(Z_{TS} = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}}\) standardizes the evidence against the null hypothesis.

  3. P-values quantify evidence strength by giving the probability of observing our results (or more extreme) if the null hypothesis were true.

  4. Statistical significance (p-value ≤ \(\alpha\)) indicates evidence against the null hypothesis, but doesn’t guarantee practical importance.

  5. Power analysis helps determine if our study design is adequate for detecting meaningful effects and can guide sample size decisions.

  6. Two-tailed tests are used when we care about deviations in either direction from the null value.

  7. The hypothesis testing framework established here applies broadly across different types of statistical tests.

Exercises

  1. Galaxy Follow-up Study: The galaxy researchers want to conduct a follow-up study with more power. If they want 99% power to detect a mean diameter of 52,500 light years (compared to the null hypothesis of 50,000), and they plan to use \(\alpha = 0.01\) with \(\sigma = 4,000\), what sample size do they need?

  2. Quality Control Threshold: In the Bulls Eye Production example, at what significance level would the observed sample mean of 4.85mm become statistically significant? What does this tell us about the relationship between \(\alpha\) and decision-making?

  3. Pharmaceutical Testing: A pharmaceutical company claims their new pain medication reduces recovery time from an average of 7.2 days to less than that. In a clinical trial of 100 patients, the sample mean recovery time was 6.8 days. Assuming \(\sigma = 1.5\) days, test the company’s claim at \(\alpha = 0.05\).

  4. Two-Tailed Manufacturing: A bottling company targets 12 oz per bottle. Quality control samples 36 bottles and finds \(\bar{x} = 11.85\) oz. With \(\sigma = 0.6\) oz, test whether the process mean differs from target at \(\alpha = 0.02\).

  5. Power Comparison: Compare the power to detect a 0.5-unit increase in population mean for sample sizes of n = 25, 50, 100, and 200, assuming \(\sigma = 2\) and \(\alpha = 0.05\). What pattern do you observe?

  6. P-Value Interpretation: A study reports a p-value of 0.08 for testing \(H_0: \mu = 100\) versus \(H_a: \mu \neq 100\). Write three different statements about what this p-value means, avoiding common misinterpretations.