10.1. The Foundation of Hypothesis Testing

Hypothesis testing is a key statistical inference framework that assesses whether claims about population parameters are reasonable based on data evidence. In this lesson, we establish the basic language of hypothesis testing to prepare for the formal steps covered in the upcoming lessons.

Road Map 🧭

  • Learn the building blocks of hypothesis testing.

  • Formulate the null and alternative hypotheses in the correct format for a given research question.

  • Understand the logic of constructing a decision rule.

  • Recognize the two types of errors that can arise in hypothesis testing and understand how they are controlled or influenced by different components of the procedure.

10.1.1. The Building Blocks of Hypothesis Testing

A. The Dual Hypothesis Framework

A statistical hypothesis is a claim about one or more population parameters, expressed as a mathematical statement. The first step in hypothesis testing is to frame the research question as two competing hypotheses:

  • Null hypothesis, \(H_0\): the status quo or a baseline claim, assumed true until there is sufficient evidence to conclude otherwise

  • Alternative hypothesis, \(H_a\): the competing claim to be tested against the null

When testing a claim about the population mean \(\mu\), the hypothesis formulation follows a set of rules summarized by Fig. 10.1 and the following list.

hypotheses template

Fig. 10.1 Template for dual hypothesis

  1. In Fig. 10.1, the part in black always stays unchanged.

  2. \(\mu_0\), called the null value, is a point of comparison for \(\mu\) taken from the research context. It is represented with a symbol here, but it takes a concrete numerical value in applications.

  3. \(H_0\) and \(H_a\) are complementary—their cases must not overlap yet together encompass all possibilities for the parameter. \(H_0\) always includes an equality sign (\(=, \leq, \geq\)), while the inequality in \(H_a\) is always strict (\(\neq, <, >\)).

Let us get some practice applying these rules correctly to research questions.

Example 💡: Writing the Hypotheses Correctly

For each research scenorio below, write the appropriate set of hypotheses to conduct a hypothesis test. Be sure to follow all the rules for hypotheses presentation.

  1. The census data show that the mean household income in an area is $63K (63 thousand dollars) per year. A market research firm wants to find out whether the mean household income of the shoppers at a mall in this area is HIGHER THAN that of the general population.

    Let \(\mu\) denote the true mean household income of the shoppers at this mall. The dual hypothesis is:

    \[\begin{split}&H_0: \mu \leq 63\\ &H_a: \mu > 63\end{split}\]

    The question raised by the study will always align with the alternative hypothesis. Also note that the generalized symbol \(\mu_0\) in the template (Fig. 11.2) is replaced with a specific vlaue, 63, from the context.

  2. Last year, your company’s service technicians took an average of 2.6 hours to respond to calls from customers. Do this year’s data show a DIFFERENT average time?

    Let \(\mu\) denote the true mean average response time by service techicians this year. The dual hypothesis appropriate for this research question is:

    \[\begin{split}&H_0: \mu = 2.6\\ &H_a: \mu \neq 2.6\end{split}\]
  3. The drying time of paint under a specified test condition is known to be normally distributed with mean 75 min and standard deviation 9 min. Chemists have proposed a new additive designed to DECREASE average drying time. Should the company change to the new additive?

    Let \(\mu\) be the true mean drying time of the new paint formula. Then we have:

    \[\begin{split}&H_0: \mu \geq 75\\ &H_a: \mu < 75\end{split}\]

Three Types of Hypotheses

From these examples, we see that there are three main ways to formulate a pair of hypotheses. Focusing on the alternative side,

  • A test with \(H_a: \mu > \mu_0\) is called an upper-tailed (right-tailed) hypothesis test.

  • A test with \(H_a: \mu < \mu_0\) is called a lower-tailed (left-tailed) hypothesis test.

  • A test with \(H_a: \mu \neq \mu_0\) is called a two-tailed hypothesis test.

B. The Significance Level

Before collecting any data, we must decide how strong the evidence must be to reject the null hypothesis. The significance level, denoted \(\alpha\), is the pre-specified probability that represents our tolerance for the error of rejecting a true null hypothesis. A small value, typically less than or equal to \(0.1\), is chosen based on expert recommendations, legal requirements, or field conventions. The smaller the \(\alpha\), the stronger the evidence must be to reject the null hypothesis.

C. The Test Statistic and the Decision

Identifying the Goal

For a concrete context, suppose we perform the upper-tailed hypothesis test for the true mean income of shoppers at a mall, taken from the first of the three examples above.

\[\begin{split}&H_0: \mu \leq 63\\ &H_a: \mu > 63\end{split}\]

Let us also assume that

  1. \(X_1, X_2, \cdots, X_n\) form an iid sample from the population \(X\) with mean \(\mu\) and variance \(\sigma^2\).

  2. Either the population \(X\) is normally distributed, or the sample size \(n\) is sufficiently large for the CLT to hold.

  3. The population variance \(\sigma^2\) is known.

We now need to develop an objective rule for rejecting the null hypothesis. This rule must (1) be applicable in any upper-tailed hypothesis testing scenario where the assumptions hold, and (2) satisfy the maximum error tolerance condition given by \(\alpha\).

Finding the Decision Rule

It is natural to view the sample mean \(\bar{X}\) as central to the decision, since it is one of the best indicators of the true location of \(\mu\). In the simplest terms, if \(\bar{X}\) yields an observed value \(\bar{x}\) much larger than 63 (thousands of dollars), we would want to reject the null hypothesis, whereas if it is close to or lower than 63, there would not be enough evidence against it. The key question is, how large must \(\bar{x}\) be to count as sufficient evidence against the null?

Under the set of assumptions about the distribution of \(X\) and its sampling conditions, \(\bar{X}\) (approximately) follows a normal distribution. In addition, its full distribution can be given by

\[\bar{X} \sim N\left(63, \frac{\sigma^2}{n}\right)\]

under the null hypothesis. We call this the null distribution.

Let us consider rejecting the null hypothesis only if \(\bar{x}\) lands above the cutoff that marks an upper area of \(\alpha\) under the null distribution:

Decision rule for an upper-tailed hypothesis test

Fig. 10.2 Decision rule for an upper-tailed hypothesis test

  1. Is this rule objective and universally applicable in other upper-tailed hypothesis tests?

    Yes. If the same set of assumptions hold, we can make an equivalent rule by replacing the values of \(\mu_0, \sigma^2\), and \(n\) appropriately.

  2. Does this rule limit the false rejection rate to at most \(\alpha\) ?

    Yes. If \(H_0\) was indeed true, then according to the null distribution, \(\bar{X}\) would generate values above the cutoff only \(\alpha \cdot 100 \%\) of the time. By design, the chance of incorrectly rejecting the null hypothesis is limited by how often incorrect answers are generated under the null hypothesis.

  3. What about other potential values under \(H_0\) ?

    The null hypothesis \(H_0: \mu \leq 63\) proposes that \(\mu\) is anything less than or equal to the null value, 63. Is the decision rule also safe for candidate values other than 63? The answer is yes. When the true mean is strictly less than \(\mu_0\), the entire distribution of \(\bar{X}\) slides to the left and away from the cutoff, leaving an upper-tail area smaller than \(\alpha\):

    Other candicate values from the null hypothesis

    Fig. 10.3 Candicate values for \(\mu\) other than \(\mu_0\) in the null hypothesis

    Therefore, error-inducing outcomes are generated even less frequently than when the population mean is exactly 63. In general, the boundary case of \(\mu=\mu_0\) addresses the worst case scenario in terms of false rejection rates. If the rule is safe under the boundary case, then it is safe under all other scenarios belonging to the null hypothesis.

How to Locate the Cutoff

The exact location of the cutoff can be computed by viewing it as a \((1-\alpha)\cdot 100\)-th percentile of the boundary case null distribution. Using the techniques learned in Chapter 6, confirm that the cutoff is \(z_{\alpha}\frac{\sigma}{\sqrt{n}} + 63\) for this example, where \(z_\alpha\) is the \(z\)-critical value used in Chapter 9. In general,

\[\text{cutoff}_{upper} = z_{\alpha}\frac{\sigma}{\sqrt{n}} + \mu_0.\]

In summary, we reject the null hypothesis of an upper-tailed hypothesis test if \(\bar{x} > z_{\alpha}\frac{\sigma}{\sqrt{n}} + \mu_0\) or, by standardizing both sides, if

\[\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}} > z_{\alpha}.\]

What About the Cutoff for a Lower-Tailed Hypothesis Test?

By making a mirror argument of this section, confirm that you would reject the null hypothesis for a lower-tailed hypothesis test if \(\bar{x} < -z_\alpha\frac{\sigma}{\sqrt{n}} + \mu_0 = \text{cutoff}_{lower}\), or by standardizing both sides, if

\[\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}} < -z_{\alpha}.\]

The Test Statistic

A statistic that measures the consistency of the observed data with the null hypothesis is called the test statistic. For hypothesis tests on a population mean, \(\frac{\bar{X}-\mu_0}{\sigma/\sqrt{n}}\) plays this role. Its realized value represents the standardized distance between the hypothesized true mean \(\mu_0\) and the generated outcome \(\bar{x}\). It is also used for comparison against a \(z\)-critical value to draw the final decision. Since it follows the standard normal distribution under the null hypothesis, we denote this quantity \(Z_{TS}\):

\[Z_{TS} = \frac{\bar{X}-\mu_0}{\sigma/\sqrt{n}}\]

and call it the \(z\)-test statistic.

10.1.2. Understanding Type I and Type II Errors

Since the results of hypothesis tests always accompany a degree of uncertainty, it is important to analyze the likelihood and consequences of the possible errors. There are two types of errors that can arise in hypothesis testing. The error of incorrectly rejecting a true null hypothesis is called the Type I error, while the error of failing to reject a false null hypothesis is called the Type II error. The table below summarizes the different combinations of reality and decision.

Decision

Fail to Reject \(H_0\)

Reject \(H_0\)

Reality

\(H_0\) is True

✅ Correct

❌ Type I Error

\(H_0\) is False

❌ Type II Error

✅ Correct

Type I Error: False Positive

A Type I error occurs when a true null hypothesis is rejected. This error results in a false positive claim of an effect or difference when none actually exists. The probability of making a Type I error is denoted \(\alpha\). Formally,

\[\alpha = P(\text{Type I error}) = P(\text{Reject } H_0|H_0 \text{ is true}).\]

Examples of Type I errors

  • Concluding that a new drug is effective when it actually has no effect

  • Claiming that a manufacturing process has changed when it is operating the same way as before

Type II Error: False Negative

A Type II error occurs when a false null hypothesis is not rejected. This results in a false negative case where real effect or difference goes undetected. The probability of making a Type II error is

\[\beta = P(\text{Type II error}) = P(\text{Fail to reject } H_0| H_0 \text{ is false}).\]

Examples of Type II errors

  • Failing to detect that a new drug is more effective than placebo

  • Failing to recognize that a manufacturing process has deteriorated

Error Trade-offs and Prioritization of the Type I Error

Type I and Type II errors are inversely related—efforts to reduce one type of error typically increase the other. The only way to reduce both error types simultaneously is to increase the sample size, collect higher quality data, or improve the measurement process.

When constructing a hypothesis test under limited resources, therefore, we must prioritize one error over the other. We typically choose to control \(\alpha\). That is, we design the decision-making procedure so that its probability of Type I error reamins below a pre-specified maximum. We make this choice because falsely claiming change from the status quo often carries substantial immediate cost. Such costs can include purchasing new factory equipment, setting up a production line and marketing strategy for a new drug, or revising a business contract.

By consequence, we cannot directly control \(\beta\). Instead, we analyze and try to minimize \(\beta\) by learning its relationship with the population distribution, sample size, and the significance level \(\alpha\).

A Legal System Analogy 🧑‍⚖️

The analogy between hypothesis testing and the American legal system offers useful insight. Just as we would rather let a guilty person go free than convict an innocent person, we are generally more concerned about incorrectly rejecting a true null hypothesis than about failing to detect a false one. Further,

  • The null hypothesis is like the defendant, presumed innocent until proven guilty.

  • The alternative hypothesis is like the prosecutor, trying to establish the defendant’s guilt.

  • The significance level \(\alpha\) represents the standard of evidence required for conviction.

  • The test statistic summarizes all the evidence presented at trial.

  • The p-value (to be discussed in Section 10.2) measures how convincing this evidence is.

10.1.3. Statistical Power: The Ability to Detect True Change

Definition

Statistical power is the probability that a test will correctly reject a false null hypothesis. It represents the test’s ability to detect an unusual effect when it actually exists. It is also the complement of the Type II error probability, \(\beta\).

\[\text{Power} = P(\text{Reject } H_0 | H_0 \text{ is false}) = 1 - \beta\]

Power ranges from 0 to 1, with higher values indicating a better ability to detect false null hypotheses. A power of 0.80 means that if the null hypothesis is false, the test correctly rejects it with 80% chance.

Visualization of \(\alpha, \beta\), and Power

Let us continue with the upper-tailed hypothesis test for the true mean household income of shoppers at a mall. The dual hypothesis is:

\[\begin{split}&H_0: \mu \leq 63\\ &H_a: \mu > 63\end{split}\]

where the null value is \(\mu_0 = 63\). We agreed to reject the null hypothesis whenever \(\bar{x}\) was “too large”, or when

\[\bar{x} > z_{\alpha}\frac{\sigma}{\sqrt{n}} + 63 = \text{cutoff}_{upper}.\]

Let us also make the unrealistic assumption that we know the true value of \(\mu\); it is equal to \(\mu_a = 65,\) making the reality belong to the alternative side. Let us visualize this along with the resulting locations of \(\alpha\), \(\beta\), and power:

Locations of alpha, beta, and power marked on the null and alternative distributions of X bar

Fig. 10.4 Plots were generated using \(n=35, \mu_0 = 63, \mu_a=65, \sigma=4, \alpha =0.05\) in 🔗 Power Simulator .

The diagram has two normal densities partially overlapping one another. The left curve, centered at \(\mu_0 = 63\), represents the reality assumed by the null hypothesis, while the right curve, centered at \(\mu_a=65,\) represents the truth. According to our decision rule, we draw the cutoff where it leaves an upper-tail area of \(\alpha\) (red) under the null distribution and reject the null hypothesis whenever we see the sample mean land above it.

What happens if, in the meantime, the sample means are actually being generated from the right (alternative) curve? With probabiliy \(\beta\) (purple), it will generate an observed sample mean that will fail to lead to a rejection of \(H_0\) (Type II error). All other outcomes will lead to a correct rejection, with the probability represented by the green area (power).

Let us observe how the sizes of these three regions are influenced by different components of the experiment.

What Influcences Power?

Significance Level, \(\alpha\)

alpha, beta, and power as alpha changes

Fig. 10.5 \(\alpha=0.01, 0.05, 0.1\) from top to bottom

The central plot with the blue outline is the original plot, identical to Fig. 10.4. A smaller \(\alpha\) pushes the cutoff up in an upper-tailed hypothesis test, since it calls for more caution against Type I error and requires a stronger evidence (larger \(\bar{x}\)) for rejection of the null hypothesis. In response, the probability of Type II error increases (purple) and the power decreases (green).

True Mean, \(\mu_a\)

alpha, beta, and power as the alternative truth mu_a changes

Fig. 10.6 \(\mu_a=64,65,66\) from top to bottom

If the hypothesized \(\mu_0\) and the true effect \(\mu_a\) are close to each other, it is naturally harder to separate the two cases. Even though \(\alpha\) stays constant (because we explicitly control it), the power decreases and the Type II error probability goes up as the distance between \(\mu_0\) and \(\mu_a\) narrows.

Population Standard Deviation, \(\sigma\)

alpha, beta, and power as the population sd changes

Fig. 10.7 \(\sigma=2.5, 4, 5\) from top to bottom

Recall that \(\bar{X}\) has the standard deviation \(\sigma/\sqrt{n}\). When \(\sigma\) decreases while \(\mu_0\) and \(\mu_a\) stay constant, the two densities become narrower around their respective means, creating a stronger separation between the two cases. This leads to a higher power and smaller Type II error probability.

Sample Size, \(n\)

alpha, beta, and power as the sample size changes

Fig. 10.8 \(n=13, 35, 70\) from top to bottom

The sample size \(n\) also affects the spread of the distribution of \(\bar{X}\), but in the opposite direction of \(\sigma\). As \(n\) decreases, \(\sigma/\sqrt{n}\) increases, making the densities wider. Larger overlap between the distributions leads to decreased power and higher Type II error probability.

Power Analysis Simulator 🎮

Explore how \(\alpha, \beta\), and statistical power relate to each other, and reproduce the images used in this section using:

🔗 Interactive Demo | 📄 R Code

10.1.4. Prospective Power Analysis

From the previous discussion, we find that the only realistic way to control statistical power is through the sample size, \(n\). Before conducting a study, researchers perform prospective power analysis to determine the sample size needed to ensure adequate power in their tests.

We continue with the upper-tailed hypothesis test on the true mean household income of shoppers at a mall:

\[\begin{split}&H_0: \mu \leq 63\\ &H_a: \mu > 63\end{split}\]

Suppose that the researchers expect the test to detect an increase of 2K in the household income effectively. Specifically, such jump should be detected with a probability at least 0.8. In other words, we want:

\[\text{Power} = 1-\beta \geq 0.8\]

when \(\mu = 63 + 2 = 65\). The magnitude of change to be detected, 2 in this case, is also called the effect size.

Additionally, we still assume:

  1. \(X_1, X_2, \cdots, X_n\) forms an iid sample from a population \(X\) with mean \(\mu\) and variance \(\sigma^2\).

  2. Either the population \(X\) is normally distributed, or the sample size \(n\) is sufficiently large for the CLT to hold.

  3. The population variance \(\sigma^2\) is known.

Step 1: Mathematically Clarify the Goal

In general, \(\text{Power} = P(\text{Reject} H_0|H_0 \text{ is false})\).

We replace the general definition with the specific conditions given by our problem. The event of “rejecting \(H_0\)” is equivalent to the event \(\{\bar{X} > \text{cutoff}_{upper}\},\) and the event that \(H_0\) is false should now reflect the desired effect size. Therefore, our goal is to find \(n\) satisfying:

\[\text{Power} = P\left(\bar{X} > \text{cutoff}_{upper} \Bigg| \mu=65\right) \geq 0.8\]

or, equivalently,

\[\beta = P\left(\bar{X} \leq \text{cutoff}_{upper} \Bigg| \mu=65\right) < 0.2.\]

Denote the value \(0.2\) by \(\beta_{max}\), since we do not allow \(\beta\) to be larger than \(0.2\).

Step 2: Simplify and Calculate

Let us break down the latter form of our mathematical goal.

  • From the conditional information, we know that \(\bar{X}\) is assumed to follow the distribution \(N(65, \sigma/\sqrt{n})\).

  • Since the goal is written with a strict inequality, the cutoff must be a value strictly less than the 20th (\(\beta_{max}\cdot 100\)-th) percentile of \(N(65, \sigma/\sqrt{n})\). Mathematically,

    \[\text{cutoff}_{upper} < -z_{\beta_{max}}\frac{\sigma}{\sqrt{n}} + 65\]

    where \(z_{\beta_{max}}\) is a \(z\)-critical value computed for the upper-tail area \(\beta_{max}\).

  • Replacing the \(\text{cutoff}_{upper}\) with its complete formula,

    \[z_\alpha\frac{\sigma}{\sqrt{n}} + 63 < -z_{\beta_{max}}\frac{\sigma}{\sqrt{n}} + 65.\]

    Isolate \(n\):

    \[n > \left(\frac{(z_{\alpha} + z_{\beta_{max}}) \sigma}{65 - 63}\right)^2.\]

Since \(n\) must be an integer, we take the smallest integer above this lower bound.

Summary

In an upper-tailed hypothesis test, the minimum sample size for a desired power lower bound \(1-\beta_{max}\) and an effect size \(|\mu_a-\mu_0|\) is the smallest integer \(n\) satisfying:

\[n > \left(\frac{(z_{\alpha} + z_{\beta_{\max}}) \sigma}{|\mu_a - \mu_0|}\right)^2.\]

Prospective Power Analysis for Lower-tailed Hypothesis Tests

By walking through a mirror argument, confirm that the minimum sample size \(n\) for a desired power lower bound \(1-\beta_{max}\) and an effect size \(|\mu_a-\mu_0|\) in a lower-tailed hypothesis test is determined by the same formula as the upper-tailed case.

Example 💡: Compute Power for SAT Scores

A teacher at STAT High School believes that their students score higher on the SAT than the 2013 national average of 1497. Assume the true standard deviation of SAT scores from this school is 200.

Q1: The teacher wants to construct a hypothesis test at 0.01 significance level that can detect a 20-point increase in the true mean effectively. If the current sample size is 300, what is the power of this test?

Step 1: Identify the Components

  • The dual hypothesis is:

    \[\begin{split}&H_0: \mu \leq 1497\\ &H_a: \mu > 1497\end{split}\]
  • \(\alpha = 0.01\) (\(z_{0.01} = 2.326348\))

    z_alpha <- qnorm(0.01, lower.tail = FALSE)
    
  • Effect size: \(20\) points. This makes \(\mu_a = \mu_0 + 20 = 1497 = 1517\).

  • Population standard deviation is \(\sigma = 200\) points

  • Current sample size: \(n=300\)

Step 2: Find the Cutoff

\[\text{cutoff}_{upper} = 1497 + \frac{200}{\sqrt{300}}(2.326348) = 1497 + 26.862 = 1523.862\]

Step 3: Calculate Power

\[\text{Power} = P(\bar{X} > 1523.862 | \mu = 1517)\]

Using the conditional information, we compute the probability assuming that \(\bar{X} \sim N(1517, \sigma/\sqrt{n})\).

\[\begin{split}\text{Power} &= P\left(\frac{\bar{X}-1517}{\sigma/\sqrt{n}} > \frac{1523.862-1517}{\sigma/\sqrt{n}}\right)\\ &= P(Z > 0.5943) = 0.2762\end{split}\]

Result: The power is only 27.62%. This test is not sufficiently sensitive to reliably detect a 20-point improvement.

Example 💡: Compute Minimum Sample Size for SAT Scores

Q2: Continuing with the SAT scores problem, what is the minimum sample size required for the test to detect a 20-point increase with at least 90% chance?

Step 1: Identify the Components

  • \(\text{Power} \geq 0.90\) is required, so \(\beta = 1 - \text{Power} < 0.10 = \beta_{max}\).

  • \(z_{\beta_{max}} = 1.282\)

    z_betamax <- qnorm(0.1, lower.tail=FALSE)
    

Step 2: Apply the Formula

\[\begin{split}n &> \left[\frac{(z_{\alpha} + z_{\beta_{max}}) \sigma}{|\mu_a - \mu_0|}\right]^2 = \left[\frac{(2.326348 + 1.281552)(200)}{1517-1497}\right]^2\\ &= \left[\frac{(3.6079)(200)}{20}\right]^2 = (36.079)^2 = 1301.69\end{split}\]

Result: We would need at least \(n = 1302\) students to achieve 90% power—much larger than the available sample of 300.

Example 💡: Average Recovery Time

A pharmaceutical company wants to test whether a new drug reduces average recovery time from a common illness. Historical data shows the standard recovery time is \(\mu_0 = 7\) days with \(\sigma = 2\) days. The company wants to detect a reduction to \(\mu_a = 6\) days (a 1-day improvement) with 90% power at \(\alpha = 0.05\) significance.

Step 1: Identify the Components

  • The hypotheses

    \[\begin{split}&H_0: \mu \geq 7\\ &H_a: \mu < 7\end{split}\]
  • The significance level: \(\alpha = 0.05\) \((z_{\alpha} = 1.645)\)

    z_alpha <- qnorm(0.05, lower.tail=FALSE)
    
  • \(\text{Power} \geq 0.90\) is required, so \(\beta = 1 - \text{Power} < 0.10 = \beta_{max}\). \((z_{\beta_{max}} = 1.282)\)

    z_betamax <- qnorm(0.1, lower.tail=FALSE)
    
  • Effect size: \(|\mu_a - \mu_0| = |6 - 7| = 1\) day

  • Population standard deviation: \(\sigma = 2\) days

Step 2: Calculate Required Sample Size

\[\begin{split}n &> \left(\frac{(1.645 + 1.282)(2)}{|6 - 7|}\right)^2 \\ &= \left(\frac{(2.927)(2)}{1}\right)^2 = (5.854)^2 \approx 34.3\end{split}\]

The company needs at least \(n = 35\) patients to achieve statistical power of at least 90%.

10.1.5. Bringing It All Together

Key Takeaways 📝

  1. Hypothesis testing provides a framework for evaluating specific claims about population parameters using sample evidence. It consists of formally presenting the null and alternative hypotheses, determining the significance level, computing a test statistic, determining its strength, and drawing a decision.

  2. Type I error (false positive) occurs when a true null hypothesis is rejected. Its probability, denoted \(\alpha\), is the significance level of the test.

  3. Type II error (false negative) occurs whe a false null hypothesis is not rejected. It occurs with probability \(\beta\).

  4. Statistical power (1-\(\beta\)) measures a test’s ability to detect false null hypotheses. It depends on the sample size, significance level, and population standard deviation.

Exercises

  1. Power Calculation: A researcher wants to test whether a new teaching method improves average test scores. Historical data shows mean scores of 75 with standard deviation 10. The researcher wants 80% power to detect an improvement to 78 points at \(\alpha = 0.05\). Calculate the required sample size.

  2. Error Trade-offs: If a researcher reduces the significance level from \(\alpha = 0.05\) to \(\alpha = 0.01\) while keeping everything else constant, what happens to:

    1. The probability of Type I error?

    2. The probability of Type II error?

    3. The power of the test?