10.4. P-values, Statistical Significance, and Formal Conclusion
We conclude our introduction to hypothesis testing by stepping back to examine the bigger picture. Having mastered the technical procedures for testing claims about population means, we now need to understand what these tests can and cannot tell us, how to interpret results responsibly, and what pitfalls to avoid in practice.
Road Map 🧭
Problem we will solve – Understanding what statistical significance really means and how to avoid common
misinterpretations and misuses of hypothesis testing • Tool we’ll learn – A systematic four-step process for hypothesis testing and proper interpretation of p-values in scientific context • How it fits – This provides the critical thinking framework needed to use hypothesis testing responsibly in real research and decision-making
10.4.1. A Systematic Approach: The Four-Step Process
Throughout our exploration of hypothesis testing, we’ve been following an implicit structure. Now let’s formalize this into a systematic four-step process that ensures consistency and completeness in our analysis.
Step 1: Identify and Describe the Parameter(s) of Interest
The parameter must be described clearly in the context of the problem, including:
The symbol (e.g., \(\mu\))
What the parameter represents in practical terms
The population being studied
Units of measurement
Any contextual details needed for interpretation
Example: “\(\mu\) represents the true average daily water loss (in thousands of gallons) at the Whimsical Wet ‘n’ Wobble Water Wonderland Waterpark after implementing the new recycling system.”
Step 2: State the Hypotheses
Write both the null hypothesis (\(H_0\)) and alternative hypothesis (\(H_a\))
Use symbolic form unless explicitly asked to write in words
Ensure the hypotheses are based on the research question, not the data
The null value \(\mu_0\) should be determined before data collection
Example:
\(H_0: \mu \leq 230\) (water loss does not exceed initial estimates)
\(H_a: \mu > 230\) (water loss exceeds initial estimates)
Step 3: Calculate the Test Statistic and P-Value
Choose the appropriate test statistic (z or t) based on whether \(\sigma\) is known
State degrees of freedom when using t-procedures
Calculate the p-value using the correct approach for your alternative hypothesis
Master the R code for computing test statistics, p-values, and confidence intervals
Key formulas to know:
Z-test (σ known): \(Z_{TS} = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}}\)
T-test (σ unknown): \(t_{TS} = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}\), with \(df = n-1\)
R code patterns:
P-values: pnorm() or pt() with appropriate lower.tail setting
Confidence intervals: qnorm() or qt() for critical values
Step 4: Make the Decision and State the Conclusion
This step has two parts:
Hard Decision: Compare p-value to \(\alpha\)
If p-value ≤ \(\alpha\): Reject \(H_0\)
If p-value > \(\alpha\): Fail to reject \(H_0\)
Contextual Conclusion: Use this template:
“The data [does/does not] give [some/strong] support (p-value = [actual value]) to the claim that [statement of :math:`H_a` in context].”
The strength of language should reflect the p-value:
“does” when we reject \(H_0\)
“does not” when we fail to reject \(H_0\)
“might” or “might not” when p-value is close to \(\alpha\)
10.4.2. Understanding What Statistical Significance Really Means
When we obtain statistical significance (p-value ≤ \(\alpha\)), what does this actually tell us? The answer is more nuanced than it might initially appear.
What Statistical Significance Could Indicate
1. True Effect Detection (What We Hope For)
The null hypothesis is actually false, and our test correctly identified a genuine effect. This represents the ideal scenario where statistical significance corresponds to a real phenomenon.
2. Rare Event Under True Null
Even if the null hypothesis is true, we could observe extreme data purely by chance. Our significance level \(\alpha\) represents our tolerance for this type of error (Type I error). If \(\alpha = 0.05\), we expect to falsely reject true null hypotheses about 5% of the time in the long run.
3. Assumption Violations
Our test procedures rely on several assumptions. If these are violated, our p-value calculations may be invalid:
Independence violations: If observations are dependent (e.g., repeated measures, cluster sampling), our standard errors may be incorrect
Distributional violations: If data comes from non-normal distributions and sample sizes are small, t-procedures may not be appropriate
Sampling violations: If the sample is not representative of the target population, results may not generalize
The Key Message
A statistically significant result indicates that the data and assumptions are not consistent, but it doesn’t specify exactly how this inconsistency occurred. This is why checking assumptions thoroughly is crucial before interpreting results.
10.4.3. What P-Values Actually Mean (And Don’t Mean)
P-values are among the most misunderstood concepts in statistics. Let’s clarify what they truly represent and address common misconceptions.
What P-Values ARE
The p-value is the probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis and all other model assumptions are true.
More specifically:
P-values assess the evidence provided by the test statistic
They represent the minimum significance level of the same-sided test at which we could reject \(H_0\) with the observed data
They quantify how incompatible our data appears with the null hypothesis
What P-Values Are NOT
❌ The p-value is NOT the probability that the null hypothesis is true
This is perhaps the most common misinterpretation. A p-value of 0.03 does not mean there’s a 3% chance the null hypothesis is correct.
❌ The p-value is NOT the probability that observations occurred by chance
The p-value is computed under the assumption that the null hypothesis and all model assumptions are true. It’s the probability of the data given these assumptions, not the probability that chance alone explains the results.
❌ A significant result does NOT prove the null hypothesis is false
A small p-value flags the data as unusual under our assumptions. This could mean:
The null hypothesis is false (what we hope)
A rare event occurred under a true null hypothesis
Our model assumptions are violated
❌ A non-significant result does NOT prove the null hypothesis is true
A large p-value simply indicates that the data is consistent with the null hypothesis. This could mean:
The null hypothesis is actually true
Our sample size was too small to detect an existing effect
The effect size is too small to detect with our current design
The Correct Interpretation
Think of p-values as measuring the degree of surprise in your data. A very small p-value (say, 0.001) means “if the null hypothesis were true, we’d be very surprised to see data like this.” A large p-value (say, 0.8) means “this data looks pretty typical for what we’d expect if the null hypothesis were true.”
10.4.4. The Importance of All Assumptions
A crucial but often overlooked point: the p-value calculation depends on ALL model assumptions, not just the null hypothesis.
When we compute a p-value, we assume:
Simple random sampling from the target population
Independence of observations
Identical distribution (observations from the same population)
Normality (either population normal or CLT applicable)
The null hypothesis is true
If any of these assumptions fails, our p-value may be meaningless. This is why exploratory data analysis and assumption checking are not optional—they’re essential for valid inference.
10.4.5. The Problem of P-Value Hacking
Unfortunately, the pressure to publish statistically significant results has led to unethical practices collectively known as p-value hacking, data dredging, or fishing for significance.
Common Forms of P-Value Hacking
Selective Reporting Researchers conduct multiple analyses but only report those yielding significant results. This inflates the apparent rate of significant findings far above the nominal \(\alpha\) level.
Data Manipulation
Adding observations: Continue collecting data until reaching significance
Excluding observations: Remove “outliers” or problematic data points to achieve significance
Variable selection: Try different outcome measures until finding one that’s significant
Model Shopping
Trying different statistical procedures until finding one that yields significance, without proper justification for the model choice.
The Consequences
P-value hacking leads to:
Increased false positives: Results that appear significant but aren’t real
Wasted resources: Other researchers waste time trying to replicate false findings
Loss of scientific trust: Undermines confidence in research findings
Public health risks: Especially dangerous in medical research where false results could harm patients
A Contemporary Example
The transcripts mention a notable case involving a Harvard Business School professor who studied dishonesty but was later accused of data manipulation. The case involved a study examining whether placing honesty declarations at the top versus bottom of forms affected truthfulness in reporting.
While the case remains in litigation with the professor defending against the allegations, it highlights how even researchers studying ethics are not immune to the pressures that can lead to questionable research practices. The situation demonstrates the importance of transparency, rigorous peer review, and proper oversight in academic research.
Preventing P-Value Hacking
Good research practice involves:
Transparent methodology: Document all procedures, including any data exclusions
Comprehensive reporting: Report all analyses conducted, not just significant ones
Pre-registration: Specify hypotheses and analysis plans before data collection
Rigorous peer review: Ensure independent evaluation of methods and conclusions
10.4.6. Beyond P-Values: A Note on Practical Significance
While we focus primarily on statistical significance in this course, it’s important to acknowledge that statistical significance doesn’t automatically imply practical importance. A result can be statistically significant (indicating a real effect exists) while being too small to matter in practical terms.
The Concept of Effect Size
Effect size measures quantify the magnitude of differences relative to the underlying variability in the data. They help answer: “Even if this difference is real, is it large enough to care about?”
Why This Matters
With large enough sample sizes, even tiny differences can become statistically significant. For example: - A website change that increases click-through rates by 0.01% might be statistically significant with millions of users but practically meaningless - A drug that lowers blood pressure by 0.1 mmHg might show statistical significance in a large trial but provide no clinical benefit
Our Approach in This Course
While practical significance is crucial in real research, we will not focus on effect size calculations or practical significance assessments in this introductory course. Our goal is to master the fundamental concepts of statistical significance and proper interpretation of hypothesis tests.
However, you should always remember that statistical significance is just the first step in evaluating research findings. In practice, researchers must also consider:
The magnitude of effects
Cost-benefit analyses
Clinical or practical relevance
Broader implications and context
10.4.7. A Word on Better Statistical Practices
The American Statistical Association and other professional organizations have advocated for moving beyond simple “significant vs. non-significant” thinking toward more nuanced statistical reasoning.
Key Principles for Good Practice
Report exact p-values rather than just inequalities (p = 0.0168 rather than p < 0.05)
Include confidence intervals to show the range of plausible parameter values
Think of evidence as continuous rather than dichotomous (reject/fail to reject)
Examine all assumptions that contribute to your inference
Consider effect sizes and practical significance alongside statistical significance
Practice transparency in reporting methods and results
The Bigger Picture
As noted by Ron Wasserstein, executive director of the American Statistical Association: “In our era, scientific argumentation is not based on whether a p-value is small enough or not. Attention is paid to effect sizes and confidence intervals. Evidence is thought of as being continuous rather than some sort of dichotomy… no single numerical value, and certainly not the p-value, will substitute for thoughtful statistical and scientific reasoning.”
10.4.8. A Complete Example: Water Park Quality Control
To cement the four-step procedure, we now work through a fully reproducible example that starts with raw data and ends with a decision and confidence statement. All numerical results are generated in R so that you can copy the code, run it yourself, and obtain the same output.
10.4.9. Background
Whimsical Wet ’n’ Wobble Water Wonderland installed a recycling system that was supposed to hold the average daily water loss at or below 230 000 gallons. One year later, managers recorded losses on 21 randomly selected days and fear the system is not meeting expectations.
The data (thousand gallons per day)
190.1, 244.6, 244.1, 270.1, 269.6, 201.0, 234.3,
292.3, 205.7, 242.3, 263.0, 219.0, 233.3, 229.0,
293.5, 290.4, 264.0, 248.6, 260.5, 210.4, 236.9
We test the claim at the \(\alpha = 0.05\) level.
10.4.10. Step 1 – Identify the parameter
\(\mu\) = true mean daily water loss (in thousand gallons) after the recycling system was installed.
10.4.11. Step 2 – State the hypotheses
Because management worries the mean exceeds 230, the logical pair is
Null hypothesis |
Alternative hypothesis |
---|---|
\(H_{0}\!: \mu \le 230\) |
\(H_{a}\!: \mu > 230\) |
Important. Although the null is written “≤ 230”, the sampling distribution used for the t-statistic is built under the boundary value \(\mu = 230\).
10.4.12. Step 3 – Compute the test statistic and p-value
## ---- Water-park analysis ----
water_usage <- c(
190.1, 244.6, 244.1, 270.1, 269.6, 201.0, 234.3,
292.3, 205.7, 242.3, 263.0, 219.0, 233.3, 229.0,
293.5, 290.4, 264.0, 248.6, 260.5, 210.4, 236.9
)
n <- length(water_usage) # 21
xbar <- mean(water_usage) # 244.8905
s <- sd(water_usage) # 29.8100
mu_0 <- 230
t_ts <- (xbar - mu_0) / (s / sqrt(n)) # 2.2891
df <- n - 1 # 20
pval <- pt(t_ts, df, lower.tail = FALSE)
pval # 0.01654
## --------------------------------
For reference,
and the one-tailed p-value is \(p = 0.0165\).
Note
With \(n = 21\) we still examine a histogram and Q–Q plot to make sure a Normal-based t-procedure is reasonable. Suppose those diagnostics look satisfactory.
10.4.13. Step 4 – Decision and conclusion
Hard decision Because \(p = 0.0165 < 0.05\), we reject the null hypothesis.
Contextual conclusion The data give some support (p = 0.0165) to the claim that the average daily water loss at the park exceeds 230 000 gallons.
10.4.14. 95 % one-sided confidence bound
A right-tailed test pairs with a lower confidence bound.
t_crit <- qt(1 - 0.05, df) # 1.724718
lower <- xbar - t_crit * s / sqrt(n) # 233.671
lower
## [1] 233.6709
We are 95 % confident the true mean loss is greater than 233.7 thousand gallons.
10.4.15. Single-call verification
t.test(water_usage,
mu = mu_0,
alternative = "greater",
conf.level = 0.95)
Output (abbreviated):
One Sample t-test
t = 2.2891, df = 20, p-value = 0.01654
95 percent confidence interval:
233.671 Inf
sample estimates:
mean of x
244.8905
The t-statistic, p-value, and confidence bound match the hand calculations—always a good final check.
10.4.16. Key points illustrated
The null hypothesis covers all values ≤ 230, but the test statistic is computed under \(\mu = 230\).
A single line of R code (t.test) reproduces the entire analysis, but writing out each step clarifies what the software is doing.
Reporting the exact p-value (0.0165) and a compatible confidence bound (233.7 k gal) gives a precise picture of the evidence rather than a simple “reject / fail to reject” label.
10.4.17. Looking Forward: The Foundation for Advanced Methods
The hypothesis testing framework we’ve developed for population means extends far beyond this single parameter. The same logical structure—stating hypotheses, calculating test statistics, computing p-values, and making decisions—applies to:
Comparing Two Populations
Differences between two means
Paired data analysis
Multiple Group Comparisons
Analysis of variance (ANOVA)
Multiple comparison procedures
Relationships Between Variables
Correlation analysis
Linear regression
The Universal Framework
Regardless of the specific parameter or procedure, good statistical practice always involves:
Clear parameter definition in context
Appropriate hypothesis formulation based on research questions
Careful assumption checking before analysis
Proper calculation of test statistics and p-values
Thoughtful interpretation that considers both statistical and practical significance
Transparent reporting of methods and limitations
10.4.18. Critically Thinking About Statistical Results
As you encounter statistical analyses in research papers, news reports, and professional contexts, apply these critical thinking questions:
About the Study Design
Was the sample representative of the target population?
Were observations truly independent?
Was the sample size adequate for the questions being asked?
About the Analysis
Were assumptions checked and reported?
Were multiple comparisons properly adjusted for?
Was the analysis plan specified before seeing the data?
About the Interpretation
Do the conclusions match what the statistical test actually demonstrated?
Is statistical significance confused with practical importance?
Are limitations and uncertainties appropriately acknowledged?
About Broader Context
Do these results replicate previous findings?
Are effect sizes reported alongside statistical significance?
Would you make important decisions based on this evidence alone?
10.4.19. Final Thoughts: Statistics as a Tool for Understanding
Hypothesis testing is a powerful tool for learning from data, but it’s just that—a tool. Like any tool, its value depends on how skillfully and appropriately it’s used.
Statistical significance can help us distinguish real patterns from random noise, but it cannot substitute for:
Scientific reasoning about mechanisms and causation
Domain expertise about what effects are meaningful
Ethical consideration of how results will be used
Humility about the limitations of any single study
As you continue your statistical journey, remember that the goal is not to achieve statistical significance but to advance understanding. Good statistics serves science and society by helping us make better decisions based on evidence, while acknowledging the uncertainty that is inherent in learning from data.
The systematic approach you’ve learned here—careful formulation of questions, rigorous checking of assumptions, proper calculation of test statistics and p-values, and thoughtful interpretation of results—will serve you well whether you’re reading research papers, conducting your own analyses, or simply trying to make sense of the statistical claims you encounter in daily life.
Key Takeaways 📝
Follow a systematic four-step process: Parameter identification, hypothesis statement, calculation, and decision/conclusion in context.
Statistical significance has multiple possible explanations: true effect detection, rare events under a true null, or assumption violations.
P-values are NOT the probability that the null hypothesis is true or that results occurred by chance—they measure evidence against the null under model assumptions.
All model assumptions matter for valid inference, not just the null hypothesis. Check assumptions before interpreting results.
P-value hacking undermines scientific integrity through selective reporting, data manipulation, and model shopping.
Statistical significance doesn’t guarantee practical importance—effect sizes and context matter for real-world applications.
Good statistical practice emphasizes transparency, complete reporting, and thoughtful interpretation rather than simple reject/fail-to-reject decisions.
The hypothesis testing framework generalizes broadly to many other statistical procedures and research contexts.
Exercises
Four-Step Practice: A quality control manager wants to test whether a manufacturing process produces bolts with mean diameter different from the target 10.0 mm. A sample of 15 bolts yields \(\bar{x} = 10.12\) mm and \(s = 0.08\) mm. Use the four-step process to test at \(\alpha = 0.01\).
P-Value Interpretation: For each scenario below, explain what the p-value means and what it does NOT mean:
A medical study reports p = 0.03 for testing whether a new drug reduces blood pressure
An educational study reports p = 0.42 for testing whether a teaching method improves test scores
Assumption Checking: A researcher finds p = 0.001 in testing \(H_0: \mu = 50\) vs. \(H_a: \mu \neq 50\) with n = 20. List three different explanations for this result and describe how you would investigate which is most likely correct.
Research Ethics: Describe three specific practices that would constitute p-value hacking and explain why each is problematic for scientific integrity.
Critical Analysis: Find a news article reporting on a scientific study that uses hypothesis testing. Identify:
What null and alternative hypotheses were likely tested
Whether the reported interpretation matches what the test actually showed
Any potential issues with the study design or analysis
Whether practical significance was considered alongside statistical significance
Template Application: Using the conclusion template from this chapter, write appropriate conclusions for these scenarios:
Testing \(H_0: \mu = 100\) vs. \(H_a: \mu > 100\), p-value = 0.023, \(\alpha = 0.05\)
Testing \(H_0: \mu = 50\) vs. \(H_a: \mu \neq 50\), p-value = 0.087, \(\alpha = 0.05\)
Testing \(H_0: \mu = 25\) vs. \(H_a: \mu < 25\), p-value = 0.048, \(\alpha = 0.05\)