12.4. Multiple Comparison Procedures and Family-Wise Error Rates
After obtaining statistical significance in our analysis of variance procedure, we have determined that at least one of the means are different from the rest. However, this doesn’t tell us which means are different, and it doesn’t tell us that it’s only one mean that’s different from the rest—it could be that all of them are different. We need to explore further after we have obtained statistical significance to identify which means are different, which results in us performing what’s called multiple comparison procedures.
Road Map 🧭
Problem we will solve – How to make multiple pairwise comparisons after ANOVA while controlling the overall probability of making Type I errors across all tests
Tools we’ll learn – Family-wise error rate concepts, Šidák and Bonferroni corrections, Tukey’s HSD method, Dunnett’s method, and graphical display techniques for post-hoc comparisons
How it fits – This completes the ANOVA procedure by providing rigorous methods to identify specific group differences while maintaining statistical control, bridging the gap between detecting overall differences and understanding the structure of those differences
12.4.1. The Foundation: Why We Need Multiple Comparisons
When ANOVA indicates statistical significance with a small p-value from our F-test statistic, we have rejected the null hypothesis \(H_0: \mu_1 = \mu_2 = \cdots = \mu_k\). This tells us that at least one of the population means is different from the others, leading to the alternative hypothesis \(H_a: \mu_i \neq \mu_j\) for some \(i \neq j\).
However, this global conclusion could mean several different things:
Only one pair of means differs (e.g., \(\mu_1 \neq \mu_2\) but \(\mu_3 = \mu_4 = \mu_5\))
Multiple pairs differ but some remain equal (e.g., \(\mu_1 = \mu_2 \neq \mu_3 = \mu_4 \neq \mu_5\))
All means are completely different from each other
Any number of intermediate patterns
To understand the specific structure of these differences, we need to perform multiple comparison procedures—essentially conducting a series of focused two-sample tests comparing pairs of groups.
Visual Exploration Before Formal Testing
Before diving into formal statistical procedures, we should return to our graphical displays to gain initial insights:
Effects plots display the sample means for each group without considering variability, making it easy to see which means are most separated. These plots help us identify the groups that appear most different visually.
Side-by-side boxplots allow us to consider both the separation of means and the variability within groups, helping us identify which differences are most likely to be statistically significant when we account for sampling variability.
In our coffeehouse example, the boxplots clearly showed that coffeehouses 2 and 4 had the most separated means relative to their within-group variability, suggesting these would likely be significantly different in formal testing.
12.4.2. The Fundamental Challenge: The Multiple Testing Problem
The most natural approach might be to simply perform individual two-sample t-tests for every pair of groups using the pooled variance approach (since we’ve already established the equal variance assumption in ANOVA). However, this naive approach creates a serious multiple testing problem.
The Mathematical Setup
For \(k\) groups, we need to perform \(c = \binom{k}{2} = \frac{k!}{(k-2)!2!}\) different pairwise comparisons. Each comparison tests:
for all pairs \(i > j\), where \(j = 1, 2, \ldots, c\).
Since we’re building on ANOVA, we use the pooled variance confidence interval approach:
where MSE comes from our ANOVA table and the degrees of freedom are \(n-k\) from our variance estimate.
The Problem: Type I Error Inflation
Each individual test controls Type I error at level \(\alpha\). If we set \(\alpha = 0.05\) for each test, then each test has a 5% chance of making a Type I error when its null hypothesis is true. However, when we perform many tests simultaneously, the overall probability of making at least one Type I error becomes much larger than 5%.
Let’s examine this mathematically. For a concrete example, consider 5 groups requiring \(\binom{5}{2} = 10\) pairwise comparisons. If we naively use \(\alpha = 0.05\) for each test, what’s the probability of making at least one Type I error across all 10 tests?
To understand this, let’s think about the complementary event. The probability of making no Type I errors means that all 10 tests correctly fail to reject their respective null hypotheses. If we assume the tests are independent (a simplifying assumption we’ll discuss), then:
Therefore:
This means we have about a 40% chance of making at least one false positive—far above our intended 5% level! This massive inflation of Type I error rate undermines our statistical control and can lead to false discoveries.
Visualizing the Multiple Testing Problem
Imagine we have two simultaneous tests, represented by their test statistic distributions under their respective null hypotheses. Each distribution has rejection regions in its tails corresponding to \(\alpha/2\) in each tail for a two-sided test.
The overall Type I error occurs when we reject at least one null hypothesis incorrectly. This happens in several scenarios:
We make a Type I error in the first test but not the second
We make a Type I error in the second test but not the first
We make Type I errors in both tests simultaneously
The total probability of “at least one error” encompasses all these scenarios and is much larger than the individual \(\alpha\) level for any single test.
12.4.3. Family-Wise Error Rate (FWER): The Formal Framework
The family-wise error rate (FWER) provides a rigorous framework for understanding and controlling the overall Type I error rate across multiple tests.
Formal Definition
The family-wise error rate is the probability of making at least one Type I error across all comparisons in a family of tests, assuming all null hypotheses in the family are true:
This can be equivalently written as:
Independence Assumption and Its Implications
If we assume the \(c\) tests are statistically independent and each has Type I error rate \(\alpha_{\text{single}}\), then:
Therefore:
Important caveat: The independence assumption is actually quite unrealistic in our setting. The pairwise comparisons use overlapping data and share the common MSE estimate, creating dependencies between the tests. However, this assumption provides a useful starting point for understanding the problem and developing solutions.
The General Formula for Any Number of Comparisons
For \(k\) groups requiring \(c = \binom{k}{2}\) comparisons, if each test uses significance level \(\alpha_{\text{single}}\):
This formula reveals how quickly the family-wise error rate grows as we increase either the number of groups or the individual test significance level.
12.4.4. Methods for Controlling FWER
Several statistical methods have been developed to control the family-wise error rate. We’ll examine three major approaches: the Šidák correction, the Bonferroni correction, and Tukey’s HSD method.
The Šidák Correction
The Šidák correction directly inverts the independence-based FWER formula to find the individual test level needed to achieve a desired overall level.
Derivation: If we want \(\text{FWER} = \alpha_{\text{overall}}\), we solve:
Rearranging:
Taking the \(c\)-th root:
Therefore:
Example calculation: For our coffeehouse study with 5 groups (\(c = 10\) comparisons) and desired \(\alpha_{\text{overall}} = 0.05\):
We would use \(\alpha = 0.005116\) for each individual test, making our critical values much more stringent.
Verification: Using this corrected level:
Limitations: The Šidák correction assumes complete independence between tests. In practice, pairwise comparisons using the same data and pooled variance estimate are correlated, making this assumption problematic.
The Bonferroni Correction
The Bonferroni correction takes a different approach, using an inequality to bound the family-wise error rate rather than assuming independence.
Mathematical foundation: The correction relies on Boole’s inequality (also called the union bound):
Applied to our context:
If each test uses significance level \(\alpha_{\text{single}}\), then under the respective null hypotheses:
The Bonferroni correction: To ensure \(\text{FWER} \leq \alpha_{\text{overall}}\), we set:
Example calculation: For 10 comparisons with desired \(\alpha_{\text{overall}} = 0.05\):
Notice this is slightly more conservative than the Šidák correction (0.005 vs 0.005116).
Key advantages:
No independence assumption required
Simple to calculate and apply
Provides a true upper bound on FWER
Key disadvantages:
More conservative than Šidák correction
Can be overly restrictive with many comparisons
Uses an inequality rather than an equality, potentially wasting statistical power
Why the Inequality Matters
The Bonferroni correction uses an upper bound rather than the exact FWER. To see why this creates conservatism, consider the inclusion-exclusion principle for the union of events:
The Bonferroni bound ignores the negative intersection term \(P(A_1 \cap A_2)\), which represents the probability of making Type I errors in multiple tests simultaneously. Since this intersection probability is always non-negative, the bound overestimates the true FWER.
For more than two events, the inclusion-exclusion principle involves alternating sums of increasingly complex intersection terms. The Bonferroni bound uses only the first (positive) terms, making it increasingly conservative as the number of comparisons grows.
Conservative Nature and Power Implications
Both Šidák and Bonferroni corrections are conservative—they control the FWER at or below the desired level, often well below it. This conservatism comes at the cost of reduced statistical power (increased probability of Type II errors).
When we make the significance level more stringent for each comparison (from 0.05 to 0.005, for example), we make it more difficult to reject individual null hypotheses. This reduces our ability to detect true differences that actually exist in the population.
As the number of comparisons increases, these methods become increasingly restrictive. For example, with 15 groups requiring 105 pairwise comparisons, the Bonferroni correction would use \(\alpha_{\text{single}} = 0.05/105 ≈ 0.0005\) for each test—an extremely stringent criterion that would miss many real differences.
12.4.5. Tukey’s Honestly Significant Difference (HSD) Method
Tukey’s HSD method represents a major advance in multiple comparison procedures. Instead of adjusting individual significance levels, it uses a fundamentally different approach based on the studentized range distribution.
The Philosophy Behind Tukey’s HSD
Tukey’s method recognizes that the pairwise comparisons in ANOVA are not independent—they all use the same pooled variance estimate (MSE) and involve overlapping group comparisons. Rather than pretending independence or using conservative bounds, Tukey’s HSD explicitly accounts for this correlation structure.
The key insight is to control the family-wise error rate by focusing on the largest difference among all possible pairwise comparisons. If we can control the probability that the largest difference exceeds a certain threshold when all null hypotheses are true, then we automatically control the FWER for all comparisons.
The Studentized Range Distribution
The theoretical foundation of Tukey’s method rests on the studentized range statistic. Suppose we have \(k\) independent sample means \(\bar{X}_1, \bar{X}_2, \ldots, \bar{X}_k\) from normal populations with common variance \(\sigma^2\), each based on sample size \(n\). The studentized range statistic is:
When \(\sigma^2\) is unknown and estimated by \(s^2\) with \(\nu\) degrees of freedom:
follows the studentized range distribution with parameters \(k\) (number of groups) and \(\nu\) (degrees of freedom for the variance estimate).
Adaptation to Unequal Sample Sizes
In practice, we often have unequal sample sizes. The studentized range distribution theory extends to this case, where for any two groups \(i\) and \(j\):
The factor \(\frac{1}{2}\) appears because the studentized range distribution is constructed for the difference between the maximum and minimum of \(k\) means, which involves a standard error of \(\sqrt{\sigma^2/n}\), while the difference between any two specific means has standard error \(\sqrt{\sigma^2(1/n_i + 1/n_j)}\). The relationship between these gives rise to the \(\frac{1}{2}\) factor.
Tukey’s HSD Critical Value
The critical value for Tukey’s HSD is:
where \(q_{\alpha,k,n-k}\) is the \((1-\alpha)\)-quantile of the studentized range distribution with \(k\) groups and \(n-k\) degrees of freedom.
The Confidence Interval Formula
The Tukey HSD confidence interval for the difference \(\mu_i - \mu_j\) is:
This interval simultaneously controls the family-wise confidence level at \(1 - \alpha\) for all \(\binom{k}{2}\) pairwise comparisons.
Implementation in R
Method 1: Using confidence level
# Get critical value using confidence level
q_crit <- qtukey(0.95, nmeans = k, df = n-k, lower.tail = TRUE) / sqrt(2)
Method 2: Using significance level
# Get critical value using significance level
q_crit <- qtukey(0.05, nmeans = k, df = n-k, lower.tail = FALSE) / sqrt(2)
Complete procedure with raw data:
# Fit ANOVA model
fit <- aov(response ~ factor, data = dataset)
# Perform Tukey's HSD
tukey_results <- TukeyHSD(fit, ordered = TRUE, conf.level = 0.95)
# View results
tukey_results
Manual calculation for summary statistics:
# Calculate critical value
q_crit <- qtukey(0.05, nmeans = k, df = n-k, lower.tail = FALSE) / sqrt(2)
# For each pair (i,j)
diff <- x_bar_i - x_bar_j
se_diff <- sqrt(MSE * (1/n_i + 1/n_j))
margin_error <- q_crit * se_diff
# Confidence interval
ci_lower <- diff - margin_error
ci_upper <- diff + margin_error
Interpreting Tukey HSD Results
A confidence interval that does not contain 0 indicates a statistically significant difference between those two groups at the family-wise level \(\alpha\).
The R output provides several components:
diff: Difference in sample means (\(\bar{X}_i - \bar{X}_j\))
lwr: Lower bound of confidence interval
upr: Upper bound of confidence interval
p adj: Adjusted p-value accounting for multiple comparisons
The adjusted p-values can be compared directly to \(\alpha\) (e.g., 0.05) to determine significance, or you can examine whether 0 falls within each confidence interval.
Advantages of Tukey’s HSD
Exact FWER control: For balanced designs, Tukey’s HSD controls the FWER at exactly the specified level (not just an upper bound)
Less conservative: Generally more powerful than Bonferroni or Šidák corrections because it accounts for the correlation structure among tests
No independence assumption: The method is derived specifically for the correlated structure of pairwise comparisons using pooled variance
Optimal for all pairwise comparisons: When the goal is to compare all possible pairs of groups, Tukey’s HSD is typically the most powerful method
12.4.6. Alternative: Dunnett’s Method for Control Comparisons
When the research design includes a control group and the primary interest lies in comparing each treatment to the control (rather than all pairwise comparisons), Dunnett’s method provides a more powerful alternative.
The Dunnett Setup
Consider \(k\) groups where group 1 is a control and groups 2 through \(k\) are treatments. Instead of \(\binom{k}{2}\) pairwise comparisons, we perform only \(k-1\) comparisons:
This reduction in the number of comparisons (from \(\binom{k}{2}\) to \(k-1\)) leads to increased statistical power for detecting differences from the control.
Advantages of Dunnett’s Method
Fewer comparisons: \(k-1\) instead of \(\binom{k}{2}\), dramatically reducing the multiple comparison penalty
Increased power: More powerful for detecting treatment-vs-control differences
Targeted inference: Focuses on the scientifically relevant comparisons when a control group exists
Still controls FWER: Maintains proper family-wise error control
Example: With 5 groups, Tukey’s HSD requires 10 comparisons while Dunnett’s method requires only 4, leading to substantially less stringent critical values.
Implementation Considerations
Dunnett’s method is not implemented in base R but is available in various packages. The theoretical development parallels Tukey’s approach but uses a different multivariate distribution that accounts for the specific correlation structure among control-vs-treatment comparisons.
# Example using a hypothetical package
# install.packages("multcomp") # or another package
# library(multcomp)
# dunnett_results <- dunnett.test(fit, control = "group1")
The exact implementation details vary by package, but the conceptual approach remains the same: use a specialized distribution to set critical values that control the FWER for the specific set of control-vs-treatment comparisons.
12.4.7. Visual Display of Multiple Comparison Results
After obtaining multiple comparison results, creating an effective visual summary helps communicate the pattern of group differences clearly and intuitively.
The Underline Method: Step-by-Step Process
The standard approach uses an “underline” notation to show which groups cannot be distinguished statistically:
Order groups by sample means: Arrange groups from smallest to largest sample mean
Identify non-significant pairs: From your multiple comparison results, identify all pairs that are NOT significantly different
Draw underlines for adjacent groups: Start by drawing lines under groups that are adjacent in your ordering and not significantly different
Extend underlines when possible: If groups A and B are connected by an underline, and B and C are connected by an underline, and A and C are also not significantly different, extend the line to cover A, B, and C
Interpret the grouping pattern: Groups connected by the same underline are statistically indistinguishable from each other
Visual Example 1: Simple Grouping
Consider four groups A, B, C, D with sample means ordered as \(\bar{X}_A < \bar{X}_B < \bar{X}_C < \bar{X}_D\).
Suppose Tukey HSD results show:
A vs B: not significant
B vs C: not significant
C vs D: not significant
A vs C: significant
A vs D: significant
B vs D: significant
Visual display:
A ——— B ——— C ——— D |_______|
Interpretation: Groups A and B form one cluster, groups C and D form another cluster, but these two clusters are significantly different from each other.
Visual Example 2: Complex Overlapping Pattern
Consider the same four groups with different significance results:
A vs B: not significant
B vs C: not significant
C vs D: not significant
A vs C: not significant
A vs D: significant
B vs D: significant
Visual display:
A ——————————— B ——— C ——— D
Interpretation: Groups A, B, and C are all statistically indistinguishable from each other, but group D is significantly different from groups A and B (though not from group C).
Visual Example 3: Ambiguous Transitivity
Sometimes the results create apparent contradictions that require careful interpretation:
A vs B: not significant
B vs C: significant
C vs D: not significant
A vs C: significant
A vs D: not significant
B vs D: significant
Visual display:
A ——— B C ——— D
Interpretation: Groups A and B are similar; groups C and D are similar; but the statistical evidence suggests that A is more similar to D than to C, even though D and C are not significantly different. This illustrates the limitations of interpreting statistical significance as a perfect ordering—the underlying population means may have a different relationship than what our sample evidence suggests.
12.4.8. Complete Example: Coffeehouse Analysis Revisited
Let’s work through a comprehensive multiple comparison analysis using our coffeehouse customer age data.
Step 1: ANOVA Foundation
Our previous ANOVA analysis yielded:
F-statistic = 22.14
p-value ≈ 0 (highly significant)
MSE = 99.8
Error degrees of freedom: \(n - k = 200 - 5 = 195\)
Between-groups degrees of freedom: \(k - 1 = 4\)
Conclusion: We have strong evidence that at least one coffeehouse differs in mean customer age. We proceed to multiple comparisons to identify the specific differences.
Step 2: Choosing the Appropriate Method
Since we want to compare all possible pairs of coffeehouses (no natural control group), Tukey’s HSD is the appropriate method. We’ll use a family-wise confidence level of 95% (\(\alpha = 0.05\)).
Step 3: Tukey HSD Implementation
# Assuming we have the fitted ANOVA model
tukey_results <- TukeyHSD(fit, ordered = TRUE, conf.level = 0.95)
The critical value calculation:
# Manual critical value calculation
q_crit <- qtukey(0.05, nmeans = 5, df = 195, lower.tail = FALSE) / sqrt(2)
# q_crit ≈ 2.83
Step 4: Detailed Results Interpretation
Significant differences (confidence intervals not containing 0):
Comparison |
Difference |
95% CI Lower |
95% CI Upper |
Interpretation |
---|---|---|---|---|
5-4 |
7.65 |
1.53 |
13.77 |
CH5 customers significantly older than CH4 |
1-4 |
12.71 |
6.44 |
18.98 |
CH1 customers significantly older than CH4 |
3-4 |
14.08 |
7.92 |
20.24 |
CH3 customers significantly older than CH4 |
2-4 |
20.24 |
13.93 |
26.55 |
CH2 customers significantly older than CH4 |
3-5 |
6.43 |
0.46 |
12.40 |
CH3 customers significantly older than CH5 |
2-5 |
12.59 |
6.47 |
18.71 |
CH2 customers significantly older than CH5 |
2-1 |
7.53 |
1.26 |
13.80 |
CH2 customers significantly older than CH1 |
2-3 |
6.16 |
0.0009 |
12.31 |
CH2 customers significantly older than CH3 (barely) |
Non-significant differences (confidence intervals containing 0):
Comparison |
Difference |
95% CI Lower |
95% CI Upper |
Interpretation |
---|---|---|---|---|
1-5 |
5.06 |
-1.02 |
11.14 |
No significant difference |
3-1 |
1.37 |
-4.74 |
7.49 |
No significant difference |
Step 5: Sample Mean Ordering and Visual Display
Sample means in ascending order:
Non-significant adjacent and non-adjacent pairs: 1-5, 3-1
Visual representation:
CH4 CH5 ——— CH1 ——— CH3 CH2
Step 6: Comprehensive Interpretation
Population mean relationships:
Coffeehouse 4 has the youngest customer base, significantly different from all other coffeehouses
Coffeehouse 2 has the oldest customer base, significantly different from all other coffeehouses
Coffeehouses 5, 1, and 3 form an intermediate cluster where customers cannot be distinguished by age statistically, though their sample means suggest the ordering CH5 < CH1 < CH3
Statistical conclusion: The data suggests that Coffeehouse 4 might be popular with undergraduates (younger customers) while Coffeehouse 2 might attract graduate students and faculty (older customers). Coffeehouses 5, 1, and 3 have overlapping age demographics that cannot be statistically distinguished from each other.
12.4.9. Advanced Topics in Multiple Comparisons
Controlling False Discovery Rate (FDR)
While family-wise error rate provides strong control over Type I errors, it can be overly conservative when the number of comparisons is large. An alternative approach controls the false discovery rate (FDR)—the expected proportion of false discoveries among all rejected hypotheses.
The FDR approach is particularly useful in exploratory research where some false positives are acceptable in exchange for increased power to detect true effects. However, for the controlled experimental settings typical in introductory statistics courses, FWER control remains the standard approach.
Simultaneous Confidence Intervals
The multiple comparison methods we’ve discussed provide simultaneous confidence intervals—sets of intervals that jointly have the specified confidence level. This means we can be 95% confident that all intervals simultaneously contain their respective true population differences.
This simultaneity distinguishes these intervals from individual confidence intervals, each of which would have 95% confidence individually but would not maintain 95% confidence when considered together.
Power Considerations in Multiple Comparisons
The choice among multiple comparison methods involves a fundamental trade-off between Type I and Type II error control:
Most conservative (lowest power): Bonferroni correction Moderately conservative: Šidák correction Least conservative (highest power): Tukey’s HSD (for all pairwise comparisons) Most powerful (for specific designs): Dunnett’s method (for control comparisons)
When planning studies, researchers should consider:
The number of planned comparisons
The relative costs of Type I vs Type II errors in their context
Whether all pairwise comparisons are scientifically meaningful
The availability of natural control groups
12.4.10. Best Practices for Multiple Comparisons
Always perform ANOVA first as a gatekeeper test
Only proceed with multiple comparisons if ANOVA is significant
Choose the appropriate method based on your research design:
Tukey HSD for all pairwise comparisons
Dunnett’s method for treatment-vs-control designs
Avoid Bonferroni correction unless you have specific reasons to use it
Create visual displays to communicate results clearly
Report both statistical significance and effect sizes
12.4.11. The Complete ANOVA Workflow
Exploratory analysis: Side-by-side boxplots, effects plots
Check assumptions: Independence, normality, equal variances
Perform ANOVA: Test for any group differences
If ANOVA significant: Proceed to multiple comparisons
Apply appropriate method: Usually Tukey HSD
Create visual display: Show pattern of group differences
Interpret results: Identify which groups differ and by how much
12.4.12. Bringing It All Together
Key Takeaways 📝
Multiple comparisons are essential after significant ANOVA to identify which specific groups differ, but naive approaches inflate Type I error rates dangerously.
Family-wise error rate (FWER) provides the theoretical framework for understanding and controlling the overall probability of making at least one Type I error across all comparisons.
Bonferroni and Šidák corrections control FWER by adjusting individual test significance levels, but they tend to be overly conservative, especially with many comparisons.
Tukey’s HSD method provides superior power while maintaining exact FWER control by using the studentized range distribution, which accounts for the correlation structure among pairwise comparisons.
Dunnett’s method offers the most powerful approach when comparing multiple treatments to a single control group by reducing the number of required comparisons.
Visual displays using underlines effectively communicate complex patterns of group similarities and differences, though they should be interpreted carefully regarding transitivity.
The complete ANOVA workflow integrates exploratory analysis, assumption checking, overall F-testing, and targeted multiple comparisons to provide comprehensive understanding of group differences.
Method selection matters: The choice among multiple comparison procedures should align with research questions, study design, and the relative costs of Type I vs Type II errors in the specific context.
Exercises
FWER Calculation and Method Comparison: A researcher plans to compare 6 different teaching methods using all pairwise comparisons.
How many individual comparisons will be performed?
If each test uses \(\alpha = 0.05\), what is the family-wise error rate assuming independence?
What individual \(\alpha\) level should be used with Bonferroni correction to achieve FWER = 0.05?
What individual \(\alpha\) level should be used with Šidák correction?
Which correction is more powerful, and why?
Tukey HSD Interpretation: Consider these Tukey HSD confidence intervals for differences in means (all at 95% family-wise confidence level):
A vs B: [-2.1, 4.8]
A vs C: [1.2, 8.1]
A vs D: [3.5, 10.4]
B vs C: [-1.5, 5.4]
B vs D: [0.8, 7.7]
C vs D: [-2.0, 4.9]
Which pairs are significantly different at the 0.05 family-wise level?
Create a visual display showing group relationships using the underline method
Order the population means from smallest to largest based on the evidence
If these were individual confidence intervals (not simultaneous), would your conclusions change? Explain.
Method Selection and Design Considerations: For each scenario, identify the most appropriate multiple comparison method and justify your choice:
A pharmaceutical company tests 4 new drugs against a placebo for reducing blood pressure
A psychologist compares reaction times across 5 different age groups with no control condition
An agricultural researcher compares crop yields for 8 different fertilizer treatments
A marketing team compares customer satisfaction across 3 store locations
An education researcher compares test scores for students using 6 different learning apps vs traditional instruction
Complete Analysis with Real Data: A nutrition study compares the effectiveness of 5 different diets on weight loss over 12 weeks. ANOVA yields F = 8.2 with p-value = 0.0003. The summary data is:
Diet A (Control): \(n_1 = 20\), \(\bar{x}_1 = 8.2\) lbs, \(s_1 = 3.1\)
Diet B (Low-carb): \(n_2 = 18\), \(\bar{x}_2 = 12.7\) lbs, \(s_2 = 2.8\)
Diet C (Low-fat): \(n_3 = 22\), \(\bar{x}_3 = 6.1\) lbs, \(s_3 = 3.4\)
Diet D (Mediterranean): \(n_4 = 19\), \(\bar{x}_4 = 15.3\) lbs, \(s_4 = 3.0\)
Diet E (Intermittent fasting): \(n_5 = 21\), \(\bar{x}_5 = 10.8\) lbs, \(s_5 = 2.9\)
MSE = 9.12, total \(n = 100\)
What can you conclude from the ANOVA results?
Should you proceed with multiple comparisons? Which method is most appropriate?
Which pairs of diets would you expect to be significantly different based on the sample means?
Calculate the Tukey HSD critical value for this study
Calculate the 95% family-wise confidence interval for the difference between Diet D and Diet C
If the research question focused on comparing each diet to the control (Diet A), which method would be more appropriate and why?
Power and Sample Size Considerations:
Explain why Tukey’s HSD becomes more conservative as the number of groups increases
A researcher is planning a study with 4 groups and wants 80% power to detect a difference of 2 units between any pair of means, assuming \(\sigma = 3\). How would the required sample size change if they planned to use Bonferroni correction instead of Tukey’s HSD?
Under what circumstances might a researcher choose to use individual t-tests without multiple comparison correction, and what would be the statistical and ethical implications?
Interpretation Challenge: Given this underline display for 5 groups:
A ——— B ——— C D ——— E |______________|
Which pairs of groups are significantly different?
Which pairs are not significantly different?
Is it possible for A and D to be significantly different while C and D are not? Explain the statistical reasoning.
What does this pattern suggest about the underlying population means?