12.4. Multiple Comparison Procedures

When ANOVA leads to rejection of the null hypothesis, we conclude that at least one population mean is different than the others. However, this global conclusion could mean that only one mean differs from the rest, that all means are completely different from each other, or any intermediate scenario.

To understand the specific structure of differences, we need to perform a multiple comparison procedure.

Road Map 🧭

  • Understand the impact of multiple simultaneous testing on the Family-Wise Error Rate (FWER).

  • Learn the Bonferroni correction and Tukey’s method as two major ways to control FWER. Recognize the advantages and limitations of each method.

  • Visually disply multiple comparisons results using an underline graph.

  • Follow the correct sequence of steps to perform a complete ANOVA.

12.4.1. The Multiple Testing Problem

The Hypotheses

The multiple comparisons procedure is equivalent to performing a hypothesis test for each pair of groups \(i\) and \(j\) such that \(i,j \in \{1, \cdots,k\}\) and \(i < j\), using the dual hypothesis:

(12.1)\[\begin{split}&H_0: \mu_i - \mu_j = 0\\ &H_a: \mu_i - \mu_j \neq 0.\end{split}\]

The Problem of Type I Error Inflation

When performing multiple simultaneous inferences, we are concerned not only with controlling the individual Type I error probability \(\alpha_{\text{single}}\), for each test, but also with managing the Family-Wise Error Rate (FEWR), \(\alpha_{\text{overall}}\). The FWER is essentially the probability of making at least one individual Type I error.

Let us denote the number of comparisons to be made as \(c\). The naive approach of performing \(c\) different two-sample \(t\)-procedures with no further adjustments raises the serious issue of FWER inflation. To see how, suppose there are five groups requiring \(\binom{5}{2} = 10\) pairwise comparisons at \(\alpha_{\text{single}} = 0.05\) for each test. In a simplified setting where the test results are mutually independent, the FWER is:

\[\begin{split}\alpha_{\text{overall}} &=P(\text{At least one Type I error})\\ & = 1 - P(\text{No Type I errors})\\ &= 1 - \prod_{i=1}^{10} P(\text{No Type I error in the } i\text{-th test})\\ &= 1-(1 - 0.05)^{10} = 1 - (0.95)^{10} = 0.401\end{split}\]

The transition from the second to the thrid line above requires independence. The result means that we have about a 40% chance of making at least one false positive—far above our intended 5% level!

General Result

In general, the family-wise error rate of \(c\) independent tests is computed by:

\[\begin{split}\alpha_{\text{overall}} &= P(\text{At least one false positive}) \\ &= 1 - P(\text{No false positives}) = 1 - (1 - \alpha_{\text{single}})^c.\end{split}\]

Even with small \(c\), the FWER quickly grows to exceed the individual Type I error probability by a large margin. The issue is exacerbated by the fact that \(c\) grows quadratically with \(k\) as shown by the general formula:

\[c = \binom{k}{2} = \frac{k!}{(k-2)!2!} = \frac{k(k-1)}{2}.\]

Without any adjustments addressing this issue, this massive inflation of Type I error rate will undermine our statistical control and lead to false discoveries.

Important Caveat: The Independence Assumption

Note that our exploration of Type I error inflation was made under the assumption that the test results are independent. This is in fact quite unrealistic since pairwise comparisons use overlapping data and share the common MSE estimate. However, this provides a useful starting point for understanding the problem.

The adjustment methods introduced below will not require the independence assumption.

12.4.2. Methods for Controlling FWER

The Common Structure

Several statistical methods have been developed to control the family-wise error rate. We will examine two major approaches: the the Bonferroni correction and Tukey’s HSD method, and briefly discuss two special cases—Šidák correction and Dunnett’s method—in the appendix.

For every method, a confidence interval equivalent to the test (12.1) is constructed as:

(12.2)\[(\bar{X}_i - \bar{X}_j) \pm \mathbf{t^{**}} \sqrt{\text{MSE}\left(\frac{1}{n_i} + \frac{1}{n_j}\right)},\]

where the form of \(\mathbf{t^{**}}\) is specific to each method. All other components remain unchanged and draw a close parallel with the independent two-sample confidence interval under the equal variance assumption.

Method 1: The Bonferroni Correction

The Bonferroni correction bounds the indidual type I error so that it yields a family-wise error rate below a desired level. The correction relies on Boole’s inequality (also called the union bound):

\[P\left(\bigcup_{j=1}^c A_j\right) \leq \sum_{j=1}^c P(A_j)\]

How does Boole’s Inequality Hold?

Let us examine a two-way case. Beginning with the general addition rule:

\[P(A \cup B) = P(A) + P(B) - P(A \cap B) \leq P(A) + P(B)\]

which agrees with Boole’s inequality with \(c=2\). Cases with \(c > 2\) can be inductively reasoned.

Applying the inequality to our context,

\[\begin{split}\alpha_{\text{overall}} &= P\left(\text{No Type I error}\right)\\ &=P\left(\{\text{No Type I error for Test 1}\} \cup \cdots \cup \{\text{No Type I error for Test c}\} \right)\\ &\leq \sum_{\ell=1}^c P(\text{No Type I error for Test } \ell) = c\alpha\end{split}\]

To ensure a target FWER of at most \(\alpha_{\text{overall}}\), therefore, we choose:

\[\alpha_{\text{single}} = \frac{\alpha_{\text{overall}}}{c},\]

and use the \(t\)-critical value:

\[\mathbf{t^{**}} = t_{\frac{\alpha_{\text{single}}}{2}, n-k} = t_{\frac{\alpha_{\text{overall}}}{2c}, n-k}\]

in the general confidence interval (12.2). Note that independence of test results is not required for this approach, since Boole’s inequality holds regardless of independence between events.

Example 💡: Bonferroni Multiple Comparisons for Coffeehouse Data

Coffeehouse data summary

Recall the coffeehouse dataset. In the ANOVA F-test, we rejected the null hypothesis that all true means are equal.

Suppose the Bonferroni correction is to be used for all confidence intervals of pairwise differences, with \(\alpha_{\text{overall}} = 0.05\). Compute the interval for the difference between Coffeehouse 2 and Coffeehouse 3.

Key components

  • \(\bar{x}_2 - \bar{x}_3 = 46.66 - 40.50 = 6.16\)

  • \(df_E = n-k = 200-5 = 195\)

  • \(MSE = 99.8\) from the ANOVA table (see Chapter 12.3)

For \(k=5\), the number of comparisons is \(\binom{5}{2}=10\). Then, \(\alpha_{\text{single}} = \frac{0.05}{10} = 0.005\).

Computing the critical value

Using R,

qt(0.0025, df=195, lower.tail=FALSE)

gives \(\mathbf{t^{**}} = t_{0.005/2, 195}=2.84\).

Substituting the appropriate values into Eq (12.2),

\[6.16 \pm (2.84)\sqrt{99.8\left(\frac{1}{38} + \frac{1}{42}\right)} = (-0.1920, 12.5120)\]

Conclusion

Since the interval includes 0, we do not have enough evidence to conclude that the true mean ages of Coffeehouses 2 and 3 are different at the family-wise error rate of \(\alpha_{\text{overall}}\) according to the Bonferroni correction.

Method 2: Tukey’s Honestly Significant Difference (HSD) Method

The theoretical foundation of Tukey’s method rests on the distribution of the statistic for the largest difference among all possible pairwise comparisons, \(\max_i \bar{X} - \min_i\bar{X}_i\). When the group sizes are balanced and commonly denoted as \(n\), the statistic

\[Q = \frac{\max_i \bar{X}_i - \min_i \bar{X}_i}{\sqrt{MSE/n}}\]

is called the studentized range statistic and follows the studentized range distribution, which is determined by two parameters, \(k\) and \(n-k\).

We use the distribution of \(Q\) to compute the critical value for all comparisons since if we can identify a wide enough interval for the largest difference, then we are safe with all other pairwise comparisons. The critical value used for Tukey’s HSD is:

\[\mathbf{t^{**}} = \frac{q_{\alpha,k,n-k}}{\sqrt{2}},\]

where \(q_{\alpha,k,n-k}\) is the \((1-\alpha)\)-quantile of the studentized range distribution with \(k\) groups and \(n-k\) degrees of freedom.

Performing Tukey’s Method on R

With Summary Statistics

Calculate the critical value for alpha:

q_crit <- qtukey(alpha, nmeans = k, df = n-k, lower.tail = FALSE) / sqrt(2)

Then for each pair \((i,j)\),

diff <- x_bar_i - x_bar_j
se_diff <- sqrt(MSE * (1/n_i + 1/n_j))
margin_error <- q_crit * se_diff

# Confidence interval
ci_lower <- diff - margin_error
ci_upper <- diff + margin_error

Using Raw Data

# Fit ANOVA model
fit <- aov(response ~ factor, data = dataset)

# Perform Tukey's HSD
tukey_results <- TukeyHSD(fit, ordered = TRUE, conf.level = 0.95)

# View results
tukey_results

Example 💡: Tueky Multiple Comparisons for Coffeehouse Data

Coffeehouse data summary

Suppose now that Tukey’s HSD method is to be used for all confidence intervals of pairwise differences, with \(\alpha_{\text{overall}} = 0.05\). Compute the interval for the difference between Coffeehouse 2 and Coffeehouse 3.

Compare the result with the Bonferroni-corrected interval.

Key components

  • \(\bar{x}_2 - \bar{x}_3 = 46.66 - 40.50 = 6.16\)

  • \(df_E = n-k = 200-5 = 195\)

  • \(MSE = 99.8\) from the ANOVA table (see Chapter 12.3)

Computing the critical value

Using R,

qtukey(0.05, nmeans = 5, df = 195, lower.tail = FALSE) / sqrt(2)

gives \(\mathbf{t^{**}} = 2.75\).

Substituting the appropriate values into Eq (12.2),

\[6.16 \pm (2.75)\sqrt{99.8\left(\frac{1}{38} + \frac{1}{42}\right)} = (0.0014, 12.3107)\]

Conclusion

Since the interval does not include 0, we have enough evidence to conclude that the true mean ages of Coffeehouses 2 and 3 are different at the family-wise error rate of \(\alpha_{\text{overall}}\) according to the Tukey’s HSD method.

Why does it give a different conclusion than Bonferroni?

The Bonferroni correction tends to make the inference results overly conservative as \(c\) grows by choosing \(\alpha_{\text{singe}}\) that falls well below the true requirement. When the number of comparisons is greater than a few, therefore, Tukey’s HSD has higher power and is generally a better choice than Bonferroni.

Bonferroni or Tukey’s HSD?

Advantage

Disadvantage

Bonferroni

  • Computation is slightly simpler.

  • It is based on \(t\)-distribution, which we are more familiar with.

  • The \(\alpha_{\text{overall}}\) is often a loose upper bound to the true FWER. Therefore, the conclusions are more conservative than necesary. As the number of comparisons increases, it become increasingly restrictive with very small \(\alpha_{\text{single}}\). True effects may not be detected effectively.

Tukey’s HSD

  • Generally more powerful than Bonferroni.

  • For balanced designs, Tukey’s HSD controls the FWER at exactly the specified level.

  • Requires a slightly more advanced statistical understanding. However, Tukey’s method is generally a better option than Bonferroni.

12.4.3. Interpreting Multiple Comparisons Results

Reading the R output

A confidence interval that does not contain 0 indicates a statistically significant difference at the family-wise error rate \(\alpha_{\text{overall}}\).

A typical R output for a multiple comparisons procedure consists of:

  • diff: The evaluated difference \(\bar{x}_{i\cdot} - \bar{x}_{j\cdot}\)

  • lwr: Lower bound of confidence interval

  • upr: Upper bound of confidence interval

  • p adj: Adjusted \(p\)-value accounting for multiple comparisons

The adjusted \(p\)-values can be compared directly to \(\alpha_{\text{overall}}\) to determine significance, or you can examine whether 0 falls within each confidence interval.

Visually Displaying the Results

After obtaining multiple comparison results, creating an effective visual summary helps communicate the pattern of group differences clearly and intuitively. The standard approach uses an “underline” notation to show which groups cannot be distinguished statistically. To construct underline diagrams, follow the steps below:

  1. Arrange the groups from smallest to largest sample mean.

  2. Identify all pairs that are NOT significantly different.

  3. Draw lines under groups that are not significantly different.

  4. Interpret the grouping pattern. Groups connected by the same underline are statistically indistinguishable from each other.

Example 💡: Underline Diagraams

Consider four groups A, B, C, D whose observed sample means are ordered as \(\bar{x}_A < \bar{x}_B < \bar{x}_C < \bar{x}_D.\) For each case of multiple comparisons results, draw and interpret the underline diagram.

Example 1: Simple Grouping

  • A vs B: not significant

  • B vs C: significant

  • C vs D: not significant

  • A vs C: significant

  • A vs D: significant

  • B vs D: significant

Underline graph for example 1

Groups A and B form one cluster and groups C and D form another cluster. The two clusters are significantly different from each other.

Example 2: Complex Overlapping Pattern

  • A vs B: not significant

  • B vs C: not significant

  • C vs D: not significant

  • A vs C: not significant

  • A vs D: significant

  • B vs D: significant

Underline graph for example 2

Groups A, B, and C are all statistically indistinguishable from each other. Groups C and D are not statistically different, either. However, Group D is significantly different from groups A and B.

Example 3: Ambiguous Transitivity

  • A vs B: not significant

  • B vs C: significant

  • C vs D: not significant

  • A vs C: significant

  • A vs D: not significant

  • B vs D: significant

Underline graph for example 3

Sometimes the results create apparent contradictions that require careful interpretation, mainly due to heavy imbalance in group sizes.

Groups A and B are similar; groups C and D are similar. However, the data also suggests that A is more similar to D than to C. This illustrates the limitations of interpreting statistical significance as a perfect ordering—the underlying population means may have a different relationship than what our sample evidence suggests.

12.4.4. The Complete ANOVA Workflow

An ANOVA procedure requires a series of steps to be performed in the correct order. The list below summarizes the complete ANOVA workflow:

  1. Perform exploratory data analysis. Use side-by-side boxplots.

  2. Graphically and numerically check assumptions of independence, normality, and equal variances.

  3. Construct the ANOVA table. Perform the ANOVA \(F\)-test for any group differences.

  4. If the result of Step 3 is statistically significant, proceed to multiple comparisons.

  5. Apply the appropriate multiple comparisons method.

  6. Create an underline diagram and interpret the results.

Example 💡: Comnplete Tukey’s HSD Multiple Comparisons ☕️

Let us continue with the coffeehouse example. Disregarding the Bonferroni example, suppose the multiple comparison procedure is to be performed using Tukey’s HSD. This is in fact more appropriate since with \(c=10\), the Bonferroni adjusted \(\alpha_{\text{single}}\) is too small.

Use \(\alpha_{\text{overall}} = 0.05\).

Manual Computation

The critical value calculation:

# Manual critical value calculation
q_crit <- qtukey(0.05, nmeans = 5, df = 195, lower.tail = FALSE) / sqrt(2)
# q_crit ≈ 2.75

Then for all possible pairs \((i,j)\), compute the interval using Equation (12.2).

Using the Raw Data

Assuming that we have the fitted ANOVA model saved to fit, we can also run:

tukey_results <- TukeyHSD(fit, ordered = TRUE, conf.level = 0.95)

Results

Pair

Difference

95% CI Lower

95% CI Upper

Significant?

5-4

7.65

1.53

13.77

1-4

12.71

6.44

18.98

3-4

14.08

7.92

20.24

2-4

20.24

13.93

26.55

3-5

6.43

0.46

12.40

2-5

12.59

6.47

18.71

2-1

7.53

1.26

13.80

2-3

6.16

0.0009

12.31

1-5

5.06

-1.02

11.14

3-1

1.37

-4.74

7.49

The Underline Graph

First, order the groups in the ascending order of the sample means. Since pairs 1-5 and 3-1 have non-significant relationships, the appropriate underline graph is:

CH4    CH5 ——— CH1 ——— CH3    CH2
  • Coffeehouse 4 has the youngest customer base, significantly different from all other coffeehouses.

  • Coffeehouse 2 has the oldest customer base, significantly different from all other coffeehouses.

  • Coffeehouses 5, 1, and 3 form an intermediate cluster where customers cannot be distinguished by age statistically.

12.4.5. Bringing It All Together

Key Takeaways 📝

  1. After the ANOVA \(F\)-test yields a significant result, multiple comparisons are essential to identify which specific groups differ.

  2. Family-wise error rate (FWER) must be controlled in multiple comparisons contexts.

  3. Bonferroni correction controls FWER by adjusting individual test significance levels, but it tends to be overly conservative, especially with many comparisons.

  4. Tukey’s HSD method provides superior power while maintaining exact FWER control by using the studentized range distribution.

  5. Visual displays using underlines effectively communicate complex patterns of group similarities and differences, though they should be interpreted carefully regarding transitivity.

  6. The complete ANOVA workflow integrates exploratory analysis, assumption checking, overall F-testing, and targeted multiple comparisons to provide comprehensive understanding of group differences.

Exercises

  1. FWER Calculation and Method Comparison: A researcher plans to compare 6 different teaching methods using all pairwise comparisons.

    1. How many individual comparisons will be performed?

    2. What individual \(\alpha\) level should be used with Bonferroni correction to achieve FWER = 0.05?

  2. Tukey HSD Interpretation: Consider the Tukey HSD confidence intervals for differences in means at 95% family-wise confidence level:

    • A vs B: [-2.1, 4.8]

    • A vs C: [1.2, 8.1]

    • A vs D: [3.5, 10.4]

    • B vs C: [-1.5, 5.4]

    • B vs D: [0.8, 7.7]

    • C vs D: [-2.0, 4.9]

    1. Which pairs are significantly different at the 0.05 family-wise level?

    2. Create a visual display showing group relationships using the underline method.

    3. If these were individual confidence intervals (not simultaneous), would your conclusions change? Explain.

  3. Complete Analysis with Real Data: A nutrition study compares the effectiveness of 5 different diets on weight loss over 12 weeks. ANOVA yields \(F = 8.2\) with \(p\)-value \(= 0.0003\). The data summary is:

    • Diet A (Control): \(n_1 = 20\), \(\bar{x}_{1\cdot} = 8.2\) lbs, \(s_{1\cdot} = 3.1\)

    • Diet B (Low-carb): \(n_2 = 18\), \(\bar{x}_{2\cdot} = 12.7\) lbs, \(s_{2\cdot} = 2.8\)

    • Diet C (Low-fat): \(n_3 = 22\), \(\bar{x}_{3\cdot} = 6.1\) lbs, \(s_{3\cdot} = 3.4\)

    • Diet D (Mediterranean): \(n_4 = 19\), \(\bar{x}_{4\cdot} = 15.3\) lbs, \(s_{4\cdot} = 3.0\)

    • Diet E (Intermittent fasting): \(n_5 = 21\), \(\bar{x}_{5\cdot} = 10.8\) lbs, \(s_{5\cdot} = 2.9\)

    In addition, \(MSE = 9.12\) and the total sample size is \(n = 100\).

    1. What can you conclude from the ANOVA results?

    2. Should you proceed with multiple comparisons? Which method is most appropriate?

    3. Calculate the Tukey HSD critical value for this study.

    4. Calculate the 95% family-wise confidence interval for the difference between Diet D and Diet C.

12.4.6. Appendix: Other FWER-Controlling Methods

Method 3: The Šidák Correction

The Šidák correction continues from the discussion of FWER inflation under the assumption that the test results are independent. By isolating \(\alpha_{\text{single}}\) in:

\[\alpha_{\text{overall}} = 1 - (1 - \alpha_{\text{single}})^c,\]

we get:

\[\alpha_{\text{single}} = 1 - (1 - \alpha_{\text{overall}})^{1/c}.\]

Once \(\alpha_{\text{single}}\) is obtained, we use it to compute the critical value as \(\mathbf{t^{**}} = t_{\alpha_{\text{single}}/2, n-k}\).

Limitaions of the Šidák Correction

The Šidák correction assumes mutual independence between tests. In practice, pairwise comparisons using the same data and pooled variance estimate are correlated, making this assumption problematic.

Method 4: Dunnett’s Method for Control Comparisons

Consider \(k\) groups where group 1 is a control and groups 2 through \(k\) are treatments. If the primary interest lies in comparing each treatment to the control rather than all pairwise comparisons, Dunnett’s method provides a more powerful alternative. The increased power is partially due to the reduced number of comparisons from all \(\binom{k}{2}\) pairs to only \(k-1\) control-treatment pairs. Another contributing factor is the specialized distribution that allows control over the FWER for the specific set of control-vs-treatment comparisons.

Dunnett’s method is not implemented in base R but is available in various packages. The exact implementation details vary by package, but the conceptual approach remains the same.

# Example using a hypothetical package
# install.packages("multcomp")  # or another package
# library(multcomp)
# dunnett_results <- dunnett.test(fit, control = "group1")