10.1. Type I Error, Type II Error, and Power
Statistical inference provides us with more than just tools to estimate population parameters and quantify uncertainty through confidence intervals. It also gives us formal procedures to test hypotheses—to gather data and assess whether specific statistical claims are reasonable or not. We’ll explore hypotheses about population parameters, starting with the population mean, and develop systematic approaches for evaluating evidence from sample data.
Road Map 🧭
Problem: We need formal procedures to test specific claims about population parameters using sample evidence
Tool: Hypothesis testing framework with null and alternative hypotheses, test statistics, and controlled error rates
Pipeline: This completes our statistical inference toolkit alongside confidence intervals, enabling evidence-based decision making
10.1.1. The Foundation of Hypothesis Testing
What Is a Statistical Hypothesis?
In statistics, a hypothesis is a specific claim or declaration about one or more population parameters, expressed as a mathematical statement. Unlike everyday hypotheses, statistical hypotheses must be precise enough to be tested with data. For example, “the population mean equals 50” or “the population mean is greater than 50” are testable statistical hypotheses.
The Dual Hypothesis Framework
Hypothesis testing always involves two competing hypotheses:
The null hypothesis (denoted \(H_0\)) represents the status quo, baseline claim, or position we will assume to be true until we have sufficient evidence to conclude otherwise. This might be a theory we suspect is false, a historical standard we want to challenge, or simply a convenient reference point for comparison.
The alternative hypothesis (denoted \(H_a\)) represents what we’re actually trying to establish or demonstrate. This is the claim we believe might be true and want to test. We gather evidence through our sample data to see if this alternative hypothesis appears reasonable.
Think of this setup as “innocent until proven guilty”—we assume the null hypothesis is true until our data provides compelling evidence to favor the alternative.
Test Statistics: Condensing Evidence
To assess the evidence from our data, we construct a test statistic—a single numerical value calculated from our sample that measures how consistent (or inconsistent) the observed data is with the null hypothesis.
The test statistic serves as our “evidence summary.” The more extreme the test statistic (in a direction that supports the alternative hypothesis), the stronger the evidence against the null hypothesis.
For testing claims about a population mean, our test statistic will typically be a function of the sample mean \(\bar{x}\), since this is our best estimator of the population mean \(\mu\).
Significance Level: Setting the Evidence Threshold
Before collecting any data, we must decide how much evidence we require to reject the null hypothesis. The significance level (denoted \(\alpha\)) is the pre-specified probability threshold that determines the minimum strength of evidence needed to conclude the null hypothesis is false.
Common choices include \(\alpha = 0.05\) (5% significance level) or \(\alpha = 0.01\) (1% significance level). The significance level represents our tolerance for making an error when the null hypothesis is actually true.
P-values: Measuring the Strength of Evidence
The p-value transforms our test statistic into a probability that quantifies the strength of evidence against the null hypothesis. Specifically, the p-value is the probability of observing a test statistic at least as extreme as what we actually observed, assuming the null hypothesis is true and all model assumptions are met.
Small p-values indicate strong evidence against the null hypothesis (the observed data would be very unlikely if the null were true), while large p-values indicate weak evidence against the null hypothesis.
We compare the p-value to our pre-specified significance level \(\alpha\):
If p-value ≤ \(\alpha\): We reject the null hypothesis in favor of the alternative
If p-value > \(\alpha\): We fail to reject the null hypothesis
10.1.2. A Legal System Analogy
The hypothesis testing framework mirrors the American legal system, providing a helpful analogy for understanding the key concepts.
The Defendant and the Prosecutor
In our analogy:
The null hypothesis is like the defendant, presumed innocent until proven guilty
The alternative hypothesis is like the prosecutor, trying to establish the defendant’s guilt
The significance level \(\alpha\) represents the standard of evidence required for conviction (like “beyond reasonable doubt”)
The test statistic summarizes all the evidence presented at trial
The p-value measures how convincing this evidence is
The jury compares the strength of evidence to the required standard and renders a verdict
Just as we would rather let a guilty person go free than convict an innocent person, in hypothesis testing we’re generally more concerned about incorrectly rejecting a true null hypothesis than about failing to detect a false one.
The Trial Process
State the hypotheses: Define what we’re trying to prove (alternative) and what we assume to be true initially (null)
Set the evidence standard: Choose the significance level before seeing any data
Collect and present evidence: Gather our sample data and compute the test statistic
Evaluate the evidence: Calculate the p-value to measure how compelling the evidence is
Render the verdict: Compare p-value to \(\alpha\) and make our decision
State the conclusion: Clearly communicate what we’ve established (or failed to establish)
This systematic approach ensures objectivity and prevents us from changing our standards after seeing the data.
10.1.3. Understanding Type I and Type II Errors
Because hypothesis testing involves making decisions under uncertainty based on sample data, we can make two types of errors. Understanding these errors is crucial for designing good studies and interpreting results properly.
Type I Error: False Conviction
A Type I error occurs when we reject a true null hypothesis—essentially “convicting an innocent defendant.” This represents a false positive result where we claim to have found an effect or difference when none actually exists.
The probability of making a Type I error is denoted \(\alpha\) (our significance level). By choosing \(\alpha = 0.05\), we’re saying we’re willing to falsely reject a true null hypothesis 5% of the time in the long run.
Examples of Type I errors:
Concluding a new drug is effective when it actually has no effect
Claiming a manufacturing process has changed when it’s still operating the same way
Detecting a gender wage gap when no systematic difference exists
Type II Error: False Acquittal
A Type II error occurs when we fail to reject a false null hypothesis—like “letting a guilty defendant go free.” This represents a false negative result where we fail to detect a real effect or difference.
The probability of making a Type II error is denoted \(\beta\). Unlike \(\alpha\), we don’t directly control \(\beta\) by setting it in advance. Instead, \(\beta\) depends on several factors including the true value of the parameter, the sample size, and the significance level.
Examples of Type II errors:
Failing to detect that a new drug is actually effective
Missing the fact that a manufacturing process has deteriorated
Not discovering a real gender wage gap that exists
The Truth Table
The relationship between our decisions and reality can be summarized in a 2×2 table:
Reality |
Fail to Reject \(H_0\) |
Reject \(H_0\) |
---|---|---|
\(H_0\) is True |
✓ Correct Decision (probability = \(1-\alpha\)) |
✗ Type I Error (probability = \(\alpha\)) |
\(H_0\) is False |
✗ Type II Error (probability = \(\beta\)) |
✓ Correct Decision (probability = \(1-\beta\)) |
Error Trade-offs
Type I and Type II errors are inversely related—efforts to reduce one type of error typically increase the other type (holding sample size constant). This creates important trade-offs:
Decreasing \(\alpha\) (being more stringent about rejecting \(H_0\)) increases \(\beta\) (makes it harder to detect false null hypotheses)
Increasing \(\alpha\) (being more liberal about rejecting \(H_0\)) decreases \(\beta\) (makes it easier to detect false null hypotheses)
The only way to reduce both error types simultaneously is to increase the sample size, collect higher quality data, or improve the measurement process.
10.1.4. Statistical Power: The Ability to Detect Truth
Definition of Power
Statistical power is the probability that a test will correctly reject a false null hypothesis. It represents our ability to detect an effect when an effect actually exists:
Power ranges from 0 to 1, with higher values indicating a better ability to detect false null hypotheses. A power of 0.80 means that if the null hypothesis is actually false, we have an 80% chance of correctly rejecting it.
Understanding the Trade-offs
The slides demonstrate several important relationships:
As :math:`alpha` increases, :math:`beta` decreases and power increases We choose the maximum acceptable \(\alpha\) at the start of the study. If we move the cutoff value to the left (making it easier to reject \(H_0\)), we increase our Type I error rate but decrease our Type II error rate and increase power.
As :math:`mu_a` increases (effect size), :math:`beta` decreases and power increases This reflects the change or effect we want to be able to detect. Larger effects are easier to detect because the two distributions become more separated.
As :math:`sigma` decreases, :math:`beta` decreases and power increases \(\sigma\) is a function of the attribute of the population and the design of the experiment. Less variability makes it easier to distinguish between the null and alternative hypotheses.
As :math:`n` increases, :math:`beta` decreases and power increases \(n\) is often the most practical way of controlling power. Larger sample sizes reduce the standard error \(\sigma/\sqrt{n}\), making the sampling distributions more peaked and easier to distinguish.
Visualizing Power Through Sampling Distributions
Consider testing \(H_0: \mu \leq \mu_0\) versus \(H_a: \mu > \mu_0\). We can visualize the relationship between Type I error, Type II error, and power using the sampling distributions under both hypotheses.
The key insight is that we have two different sampling distributions for \(\bar{X}\):
Under the null hypothesis: \(\bar{X} \sim N(\mu_0, \sigma^2/n)\)
Under the alternative hypothesis: \(\bar{X} \sim N(\mu_a, \sigma^2/n)\) where \(\mu_a > \mu_0\)

Fig. 10.1 The sampling distribution of \(\bar{X}\) under the null hypothesis (left curve) and alternative hypothesis (right curve). The critical value \(\bar{x}_{cutoff}\) determines the boundary between rejection and non-rejection regions.
Type I Error (:math:`alpha`)
\(\alpha\) is the probability of incorrectly concluding there is a systematic effect when the result is due to chance. Under the null distribution, this corresponds to the area in the rejection region (to the right of the cutoff value).
Type II Error (:math:`beta`)
\(\beta\) is the probability of failing to reject the null hypothesis when an effect is present. Under the alternative distribution, this corresponds to the area in the non-rejection region (to the left of the cutoff value).
Statistical Power (1-:math:`beta`)
Power is the probability of correctly rejecting the null hypothesis when it’s actually false. Under the alternative distribution, this corresponds to the area in the rejection region (to the right of the cutoff value).
The Critical Relationship
The researcher controls the cutoff value by choosing \(\alpha\). Once this cutoff is set:
The Type I error is fixed at \(\alpha\) (chosen prior to data collection)
The Type II error \(\beta\) and power (1-\(\beta\)) are determined by the position of the alternative distribution relative to this cutoff
10.1.5. Power Analysis in Practice
Prospective Power Analysis
Before conducting a study, researchers should perform prospective power analysis to determine the sample size needed to achieve adequate power (typically 80% or higher) for detecting meaningful effects.
This requires specifying:
The significance level \(\alpha\)
The desired power (1-\(\beta\))
The smallest effect size worth detecting
An estimate of the population standard deviation
Mathematical Framework for Power Calculations
When \(\sigma\) is known and we have an SRS (Simple Random Sample), the test statistic follows a standard normal distribution under the CLT:
For a one-tailed test (\(H_0: \mu \leq \mu_0\) vs \(H_a: \mu > \mu_0\)):
Step 1: Find the critical cutoff value The cutoff value is determined by the null hypothesis and \(\alpha\):
Where \(z_{\alpha}\) satisfies \(P(Z > z_{\alpha}) = \alpha\).
Step 2: Calculate power The power is the probability of rejecting \(H_0\) when the true mean is \(\mu_a\):
This can be computed using:
Step 3: Sample size determination To achieve a specified power (1-\(\beta\)), we solve:
Where \(z_{\beta}\) is the critical value such that \(P(Z > z_{\beta}) = \beta\).
Retrospective Power Analysis
After completing a study, researchers sometimes calculate the power their test had for detecting various effect sizes. This can help interpret non-significant results and inform future research plans.
However, calculating power using the observed effect size from a non-significant result (sometimes called “observed power”) is generally not informative and can be misleading.
10.1.6. Working with Power: A Practical Example
A pharmaceutical company wants to test whether a new drug reduces average recovery time from a common illness. Historical data shows the standard recovery time is \(\mu_0 = 7\) days with \(\sigma = 2\) days. The company wants to detect a reduction to \(\mu_a = 6\) days (a 1-day improvement) with 90% power at \(\alpha = 0.05\) significance.
Step 1: Identify the components
\(H_0: \mu = 7\) (no improvement)
\(H_a: \mu < 7\) (recovery time reduced)
\(\alpha = 0.05\), so \(z_{\alpha} = 1.645\) (one-tailed test)
Desired power = 0.90, so \(\beta = 0.10\) and \(z_\beta = 1.282\)
Effect size: \(\mu_a - \mu_0 = 6 - 7 = -1\) day
Population standard deviation: \(\sigma = 2\) days
Step 2: Calculate required sample size
Rounding up, the company needs n = 35 patients to achieve 90% power.
Step 3: Verify the calculation
With n = 35, the standard error is \(\sigma/\sqrt{n} = 2/\sqrt{35} = 0.338\).
The critical value for rejecting \(H_0\) at \(\alpha = 0.05\) (one-tailed) corresponds to:
\(\bar{x}_{critical} = 7 - 1.645 \times 0.338 = 6.44\) days
If the true mean is \(\mu_a = 6\), the power is:
\(P(\bar{X} < 6.44 | \mu = 6) = P\left(Z < \frac{6.44 - 6}{0.338}\right) = P(Z < 1.30) = 0.903\)
This confirms approximately 90% power, validating our sample size calculation.
A Comprehensive SAT Scores Example
Let’s work through the complete example to demonstrate all the concepts.
Scenario: A teacher believes that the mean SAT score at a high school will be more than 1497 because that was the national average in 2013, and the teacher thinks students at this high school are better than the national average. An SRS of 300 Indiana high school students’ SAT scores is reviewed. Assume the population standard deviation is 200.
Setting up the hypotheses:
\(H_0: \mu \leq 1497\) (students perform at or below national average)
\(H_a: \mu > 1497\) (students perform above national average)
\(\alpha = 0.01\) (1% significance level)
\(n = 300\), \(\sigma = 200\), \(\mu_0 = 1497\)
Question 1: Power to detect a 20-point increase
Suppose we want to know if this test has sufficient power to detect an increase of 20 points (\(\mu_a = 1517\)).
Step 1: Find the cutoff value
Step 2: Calculate power
Result: The power is only 27.62%, which means there’s a 72.38% probability of Type II error. This test is not sufficiently sensitive to reliably detect a 20-point improvement.
Question 2: Sample size for 90% power
What sample size would be required to detect an increase of 20 points with 90% power at 1% significance?
Step 1: Find critical values
\(z_{0.01} = 2.326348\)
For 90% power, \(\beta = 0.10\), so \(z_{0.1} = 1.281552\)
Step 2: Apply sample size formula
Result: We would need n = 1302 students to achieve 90% power—much larger than the available sample of 300.
10.1.7. Interactive Power Analysis Visualization
To better understand how the various components of hypothesis testing interact, you can explore an interactive simulation that visualizes the relationship between Type I error, Type II error, power, and the factors that affect them.
This interactive tool allows you to:
Adjust the null and alternative means to see how effect size affects power
Change the significance level (:math:`alpha`) to observe the trade-off between Type I error and power
Modify the sample size to see its dramatic effect on power
Alter the population standard deviation to understand how variability affects detectability
Switch between one-tailed and two-tailed tests to compare their power characteristics
Key Observations to Make:
Effect Size Impact: Move the alternative mean closer to or farther from the null mean. Notice how larger effect sizes (greater separation) dramatically increase power.
Sample Size Power: Increase the sample size and watch both distributions become narrower (smaller standard error), making it easier to distinguish between null and alternative hypotheses.
Alpha Trade-off: Increase \(\alpha\) and see how the red region (Type I error) grows, but the green region (power) also increases while the purple region (Type II error) shrinks.
Standard Deviation Effect: Increase the population standard deviation and observe how both curves become wider, making it harder to detect differences and reducing power.
10.1.8. Understanding the Code Behind the Visualization
The interactive app is built using R Shiny and demonstrates several key statistical computing concepts. Here’s how the power calculation works programmatically:
Core Power Calculation Logic
# Calculate standard error of the sampling distribution
std_error <- std_dev / sqrt(sample_size)
# Find critical value based on hypothesis direction
z_critical <- if (hypothesis == "greater") {
qnorm(1 - alpha) # Upper tail critical value
} else if (hypothesis == "less") {
qnorm(alpha) # Lower tail critical value
} else {
qnorm(1 - alpha/2) # Two-tailed critical value
}
# Calculate the cutoff value in original units
cutoff <- null_mean + z_critical * std_error
# Calculate power using the alternative distribution
power <- if (hypothesis == "greater") {
1 - pnorm(cutoff, alt_mean, std_error)
} else if (hypothesis == "less") {
pnorm(cutoff, alt_mean, std_error)
} else {
# Two-tailed case (more complex)
upper_cutoff <- null_mean + z_critical * std_error
lower_cutoff <- null_mean - z_critical * std_error
(1 - pnorm(upper_cutoff, alt_mean, std_error)) +
pnorm(lower_cutoff, alt_mean, std_error)
}
Visualization Strategy
The app creates two normal distributions:
Null distribution: Centered at \(\mu_0\) with standard error \(\sigma/\sqrt{n}\)
Alternative distribution: Centered at \(\mu_a\) with the same standard error
The colored regions represent:
Red areas: Type I error (\(\alpha\)) - rejection region under null distribution
Purple areas: Type II error (\(\beta\)) - non-rejection region under alternative distribution
Green areas: Power (1-\(\beta\)) - rejection region under alternative distribution
Why This Visualization Matters
This interactive approach helps students understand several crucial concepts that are often difficult to grasp from static diagrams alone:
Dynamic relationships: See how changing one parameter affects all others simultaneously
Magnitude of effects: Understand whether a change in sample size from 30 to 40 matters as much as a change from 30 to 100
Practical constraints: Recognize why achieving very high power often requires impractically large sample sizes
Design decisions: Appreciate the trade-offs researchers face when planning studies
The app reinforces that power analysis isn’t just a mathematical exercise—it’s a practical tool for making informed decisions about study design and resource allocation.
10.1.9. The Assumptions Behind Hypothesis Testing
Like confidence intervals, hypothesis tests for population means rely on several key assumptions:
Random Sampling
The data must represent a simple random sample from the population of interest. Systematic sampling biases can invalidate the test results by making the sample non-representative.
Independence
Individual observations must be independent of each other. Dependence between observations can affect the sampling distribution of the test statistic in ways that invalidate the p-value calculations.
Distributional Assumptions
For tests about population means:
If the population is normally distributed, the test is exact for any sample size
If the population is not normal, the Central Limit Theorem provides approximate validity for sufficiently large samples (typically n ≥ 30, but this depends on the degree of non-normality)
Known vs. Unknown Population Parameters
When the population standard deviation \(\sigma\) is known, we use z-tests based on the standard normal distribution. When \(\sigma\) is unknown (the more common situation), we use t-tests based on the t-distribution with appropriate degrees of freedom.
10.1.10. Connecting Hypothesis Tests and Confidence Intervals
Hypothesis tests and confidence intervals are intimately connected—they’re two sides of the same inferential coin. This connection, called duality, provides important insights:
The Duality Principle
A two-sided \(100(1-\alpha)\%\) confidence interval contains exactly those null hypothesis values that would not be rejected at significance level \(\alpha\) in a two-sided test.
Conversely, if a two-sided test rejects \(H_0: \mu = \mu_0\) at significance level \(\alpha\), then \(\mu_0\) lies outside the corresponding \(100(1-\alpha)\%\) confidence interval.
Practical Implications
This duality means:
Confidence intervals provide information about a range of plausible parameter values
Hypothesis tests provide yes/no answers about specific parameter values
Both approaches give equivalent information, just presented differently
For example, if a 95% confidence interval for \(\mu\) is [12.3, 18.7], then:
We would reject \(H_0: \mu = 10\) at \(\alpha = 0.05\) (since 10 is outside the interval)
We would fail to reject \(H_0: \mu = 15\) at \(\alpha = 0.05\) (since 15 is inside the interval)
10.1.11. Bringing It All Together
This chapter has established the conceptual foundation for hypothesis testing, including the decision framework, error types, and power considerations. In the following chapters, we’ll apply these concepts to specific testing scenarios:
Chapter 10.2 will develop z-tests for population means when \(\sigma\) is known
Chapter 10.3 will extend to t-tests when \(\sigma\) is unknown
Chapter 10.4 will explore the broader concept of statistical significance and its interpretation
The framework we’ve established—null and alternative hypotheses, test statistics, p-values, Type I and Type II errors, and power—applies to all these specific tests and indeed to most hypothesis testing procedures in statistics.
Key Takeaways 📝
Hypothesis testing provides a formal framework for evaluating specific claims about population parameters using sample evidence.
The null hypothesis represents the status quo or baseline claim, while the alternative hypothesis represents what we’re trying to establish.
Type I error (false positive) occurs when we reject a true null hypothesis with probability \(\alpha\).
Type II error (false negative) occurs when we fail to reject a false null hypothesis with probability \(\beta\).
Statistical power (1-\(\beta\)) measures our ability to detect false null hypotheses and depends on effect size, sample size, significance level, and population variability.
Power analysis helps determine appropriate sample sizes before conducting studies and interpret results afterward.
The p-value quantifies the strength of evidence against the null hypothesis, while the significance level \(\alpha\) sets our evidence threshold.
Hypothesis tests and confidence intervals are dual procedures that provide complementary perspectives on statistical inference.
Understanding these foundational concepts prepares us to tackle specific testing procedures with confidence, knowing that the same logical framework applies whether we’re testing means, proportions, or other population parameters.
Exercises
Conceptual Understanding: Explain the difference between Type I and Type II errors using a medical testing scenario (e.g., testing for a disease). Which error would be more serious for the patient, and why?
Power Calculation: A researcher wants to test whether a new teaching method improves average test scores. Historical data shows mean scores of 75 with standard deviation 10. The researcher wants 80% power to detect an improvement to 78 points at \(\alpha = 0.05\). Calculate the required sample size.
Error Trade-offs: If a researcher reduces the significance level from \(\alpha = 0.05\) to \(\alpha = 0.01\) while keeping everything else constant, what happens to: a) The probability of Type I error? b) The probability of Type II error? c) The power of the test?
Duality Relationship: A 95% confidence interval for a population mean is [23.4, 28.6]. Without doing any calculations, determine whether the following null hypotheses would be rejected or not rejected at \(\alpha = 0.05\): a) \(H_0: \mu = 25\) b) \(H_0: \mu = 30\) c) \(H_0: \mu = 22\)
Power Analysis: A pharmaceutical company plans to test a new drug and wants to detect a 5-point reduction in blood pressure (from 140 to 135 mmHg) with 90% power at \(\alpha = 0.01\). If the population standard deviation is 15 mmHg, how many patients should be enrolled in the study?
Legal Analogy Extension: In the context of hypothesis testing as a legal trial, explain what each of the following represents: a) The burden of proof b) Circumstantial evidence vs. direct evidence c) A hung jury d) An appeal process