10.4. The Four Steps to Hypothesis Testing and Understanding the Result

We conclude our introduction to hypothesis testing by stepping back to examine the broader picture. We present a standardized four-step framework to hypothesis testing and discuss important subtleties in its interpretation. In particular, we address what hypothesis tests can and cannot tell us, how to interpret results responsibly, and common pitfalls to avoid in practice.

Road Map 🧭

  • Organize the hypothesis testing workflow into four steps: (1) define the parameter(s), (2) state the hypotheses, (3) compute the test statistic, the df, and the \(p\)-value, (4) draw a conclusion.

  • Correctly interpret \(p\)-values and statistically significant results.

  • Recognize ethical and practical concerns regarding statistical inference.

10.4.1. The Four-Step Process to Hypothesis Testing

Throughout our exploration of hypothesis testing, we have been following an implicit structure. Now let us formalize this into a systematic four-step process that ensures consistency and completeness in our analysis.

Step 1: Identify and Describe the Parameter(s) of Interest

Make a concrete connection between the context of the problem and the mathematical notation to be used in the formal test. This step should include:

  • The population being studied

  • The symbol for the parameter of interest (e.g., \(\mu\))

  • What the parameter represents in practical terms

  • Units of measurement

  • Any other contextual details needed for interpretation

Example 💡: Testing Water Recycling Performance–Step 1

The Whimsical Wet ‘n’ Wobble Water Wonderland Waterpark has implemented a new water recycling system. This recycling system is supposed to ensure water loss only due to evaporation and splashing by the park patrons. However, after its implementation, they suspect a higher daily water loss than the pre-implementation average of 230,000 gallons per day.

To investigate, the park collected 21 days of water usage data during the first year of implementation. Perform a hypothesis test to determine if the system underperforms expectations. Use a \(\alpha=0.05\) significance level.

Step 1 of the Hypothesis Test

Let \(\mu\) represent the true average daily water loss (in thousands of gallons) at the Whimsical Wet ‘n’ Wobble Water Wonderland Waterpark after implementing the new recycling system.

Step 2: State the Hypotheses

State the paired hypotheses in the correct mathematical format. Refer to the rules listed in Chapter 10.1.1.

Example 💡: Testing Water Recycling Performance–Step 2

Continuing from the context provided in the previous example,

Step 2 of the Hypothesis Test

\[\begin{split}&H_0: \mu \leq 230\\ &H_a: \mu > 230\end{split}\]

Step 3: Calculate the Test Statistic and P-Value

Perform all the computational steps. This includes:

  • Checking that the normality assumption is reasonably met

  • Deciding between a \(z\)-test or a \(t\)-test based on whether the population standard deviation is known

  • Computing the test statistic and stating the appropriate degrees of freedom, if any

  • Calculating the \(p\)-value

Example 💡: Testing Water Recycling Performance–Step 3

The 21 daily water loss measurements, in thousands of gallons, are:

190.1

244.6

244.1

270.1

269.6

201.0

234.3

292.3

205.7

242.3

263.0

219.0

233.3

229.0

293.5

290.4

264.0

248.6

260.5

210.4

236.9

Step 3 of the Hypothesis Test

Graphical analysis of the data’s distribution:

Graphical assessment of normality for water usage example

Fig. 10.16 Graphical assessment of normality

The data distribution shows moderate deviation from normality, for which the sample size of \(n=21\) is large enough. In addition, the population standard deviation is unknown. Therefore, we use the \(t\)-test procedure.

The numerical summaries:

  • \(n=21\)

  • \(\bar{x} = 244.8905\)

  • \(s=29.81\)

  • \(df = n-1 = 20\)

The observed test statistic:

\[t_{TS} = \frac{244.8905-230}{29.81/\sqrt{n}} = 2.2891\]

The \(p\)-value:

\[p = P(T_{n-1} > t_{TS}) = P(T_{20} > 2.2891) = 0.01654,\]

which can be computed using the R code below:

pvalue <- pt(2.2891, df=20, lower.tail=FALSE)

Step 4: Make the Decision and State the Conclusion

This step consists of two parts.

First, draw a formal decision: compare the \(p\)-value computed in Step 3 to \(\alpha\) and state whether the null hypothes is is rejected. We never “reject the alternative” or “accept” anything. The only two choices available for a formal decision is to

  • reject the null hypothesis, or

  • fail to reject the null hypothesis.

Then make a contextual conclusion using the template below:

“The data [does/does not] give [some/strong] support (p-value = [actual value]) to the claim that [statement of \(H_a\) in context].”

Example 💡: Testing Water Recycling Performance–Step 4

The waterpark required the test to have the significance level \(\alpha=0.05\).

Step 4 of Hypothesis Testing

Since \(p\)-value \(= 0.01654 < 0.05\), the null hypothesis is rejected. There is sufficient evidence from data (\(p\)-value \(= 0.01654\)) to the claim that the true mean daily water loss after implementing the new water recycling system is greater than 230,000 gallons.

10.4.2. Understanding Statistical Significance

We call a statistical result whose behavior may be attributed to more than just random chance statistically significant. In hypothesis testing, results that lead to rejection of the null hypothesis are conventionally regarded as statistically significant.

However, when we encounter a \(p\)-value smaller than \(\alpha\) in practice, the information conveyed can be more nuanced than it may first appear. A statistically significant result can arise to a combination of the following three reasons:

1. Existence of True Effect (What We Hope For)

We may encounter a statistically significant result because the null hypothesis is actually false and our test correctly identified a genuine effect. This represents the ideal scenario where statistical significance corresponds to a real phenomenon.

2. Rare Event Under True Null

Even if the null hypothesis is true, we could observe extreme data purely by chance. Our significance level \(\alpha\) represents our tolerance for this type of error (Type I error). If \(\alpha = 0.05\), we expect to falsely reject true null hypotheses about 5% of the time in the long run.

3. Assumption Violations

Hypothesis tests rely on assumptions about the population and the sampling procedure. The data can be flagged as unusual by a hyopothesis test when the background assumptions are incorrect.

The Key Message

A statistically significant result indicates inconsistency between the data and the assumptions. For the result to be meaningful, we must ensure that the only assumption that can be violated is the null hypothesis. This is why checking the distributional assumptions thoroughly before inference is essential.

10.4.3. What \(p\)-Values Are and What They Are Not

Recall that a \(p\)-value is the probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis and all other model assumptions are true. It quantifies how incompatible the data are with the null hypothesis.

What \(p\)-Values Are NOT

❌ The p-value is NOT the probability that the null hypothesis is true.

A \(p\)-value of 0.03 does not mean there’s a 3% chance the null hypothesis is correct.

❌ The p-value is NOT the probability that observations occurred by chance.

The :math:`p`s-value is computed under the assumption that the null hypothesis and all model assumptions are true. It’s the probability of the data given these assumptions, not the probability that chance alone explains the results.

❌ A small p-value does NOT prove the null hypothesis false.

A small \(p\)-value flags the data as unusual under our assumptions. This could mean that:

  • The null hypothesis is false (what we hope).

  • A rare event occurred under a true null hypothesis.

  • Model assumptions are violated.

❌ A large p-value does NOT prove the null hypothesis true

A large \(p\)-value simply indicates that the data is consistent with the null hypothesis. This could mean that:

  • The null hypothesis is actually true.

  • Our sample size was too small to detect an existing effect.

  • The effect size is too small to detect with our current design.

10.4.4. The Problem of \(p\)-Value Hacking

Unfortunately, the pressure to publish statistically significant results has led to unethical practices collectively known as \(p\)-value hacking, data dredging, or fishing for significance.

Common Forms of \(p\)-Value Hacking

  • Conducting multiple analyses but selectively reporting only those yielding significant results.

  • Data Manipulation, such as increasing the data size until reaching significance, excluding problematic observations, and trying different response variables until finding a significant one

  • Model Shopping, or trying different statistical procedures until finding one that yields significance, without proper justification for the model choice.

Preventing \(p\)-Value Hacking

\(p\)-value hacking leads to increased false positives, undermining confidence in research findings. Forgood research practice, a researcher should:

  • document all procedures, including any data exclusions,

  • report all analyses conducted, not just significant ones, and

  • specify hypotheses and analysis plans before data collection.

10.4.5. Bringing It All Together

Key Takeaways 📝

  1. Organize your hypothesis testing workflow in four steps:

    1. parameter identification,

    2. hypothesis statement,

    3. calculation, and

    4. decision and conclusion in context.

  2. Know the correct implications of statistical significance and \(p\)-values.

  3. \(p\)-value hacking undermines scientific integrity. For good research practice, determine the complete analysis plan prior to data collection and document all procedures transparently.