11.4. Comparing the Means of Two Independent Populations - No Equal Variance Assumption

When population standard deviations are unknown and we cannot reasonably assume that the variances are equal across populations, we must estimate each variance separately. This scenario, while more realistic than the pooled approach, introduces significant computational complexity through the need for approximate degrees of freedom calculations. This section develops the unpooled (Welch) approach and demonstrates why it has become the preferred method in modern statistical practice.

Road Map 🧭

  • Problem we will solve – How to compare two population means when standard deviations are unknown and potentially unequal, avoiding the restrictive equal variance assumption

  • Tools we’ll learn – The Welch-Satterthwaite approximation for degrees of freedom and robust two-sample t-procedures that work regardless of variance equality

  • How it fits – This provides the most general and widely applicable approach to two-sample comparisons, serving as the default method in contemporary statistical practice

11.4.1. The Motivation for Unpooled Procedures

In practical statistical analysis, the assumption of equal population variances required for pooled procedures is often unrealistic or unverifiable. Consider the fundamental question: if we do not know the population means (which is why we are testing them), how can we confidently assume we know that the population variances are equal?

The Limitations of Equal Variance Assumptions

The equal variance assumption \(\sigma^2_A = \sigma^2_B\) requires substantial justification:

  1. Process similarity: The data-generating mechanisms must be nearly identical except for location shifts

  2. Homogeneous populations: Both populations must have similar underlying variability structures

  3. Verifiability: We need sufficient evidence that the assumption holds in the specific context

Real-World Scenarios Where Variances Differ

Many practical situations naturally violate the equal variance assumption:

  • Different measurement scales: Comparing groups with naturally different variability ranges

  • Heteroscedastic relationships: When variability changes systematically across groups

  • Different population structures: Comparing homogeneous vs. heterogeneous populations

  • Treatment effects on variability: Interventions that affect both mean and variance

The Conservative Approach

Rather than risking the serious consequences of incorrect pooling, the unpooled approach provides a robust alternative that:

  • Works regardless of variance equality: Valid whether variances are equal or unequal

  • Avoids assumption verification: Eliminates the need to test for equal variances

  • Provides conservative inference: Generally produces slightly wider confidence intervals when variances are actually equal

11.4.2. Mathematical Framework for Unpooled Procedures

Fundamental Assumptions

The unpooled approach maintains the core assumptions of independent sample procedures while removing the equal variance requirement:

  1. Independent simple random sampling from each population

  2. Independence between groups: No relationship between observations from different populations

  3. Normal sampling distributions: Either through population normality or Central Limit Theorem

  4. Unknown and potentially unequal variances: \(\sigma^2_A\) and \(\sigma^2_B\) are both unknown and need not be equal

The Standard Error with Separate Variance Estimation

When variances must be estimated separately, our standard error becomes:

\[SE = \sqrt{\frac{s^2_A}{n_A} + \frac{s^2_B}{n_B}}\]

This expression directly estimates the variability of each sample mean separately, making no assumptions about the relationship between \(\sigma^2_A\) and \(\sigma^2_B\).

The Test Statistic

The test statistic takes the familiar form:

\[T'_{TS} = \frac{(\bar{X}_A - \bar{X}_B) - \Delta_0}{\sqrt{\frac{s^2_A}{n_A} + \frac{s^2_B}{n_B}}}\]

The prime notation (T’) indicates that this statistic does not follow an exact t-distribution but rather an approximation that depends on the unknown population variances.

11.4.3. The Welch-Satterthwaite Approximation

The Degrees of Freedom Problem

The critical challenge in unpooled procedures lies in determining appropriate degrees of freedom. Unlike pooled procedures where \(df = n_A + n_B - 2\), the unpooled case requires a complex approximation.

The Exact Degrees of Freedom Formula

If the population variances were known, the exact degrees of freedom would be:

\[df_{exact} = \frac{\left(\frac{\sigma^2_A}{n_A} + \frac{\sigma^2_B}{n_B}\right)^2}{\frac{1}{n_A - 1}\left(\frac{\sigma^2_A}{n_A}\right)^2 + \frac{1}{n_B - 1}\left(\frac{\sigma^2_B}{n_B}\right)^2}\]

This formula would yield an exact t-distribution for the test statistic. However, since the population variances are unknown, we must approximate.

The Welch-Satterthwaite Approximation

By substituting sample variances for population variances, we obtain the approximate degrees of freedom:

\[\nu = \frac{\left(\frac{s^2_A}{n_A} + \frac{s^2_B}{n_B}\right)^2}{\frac{1}{n_A - 1}\left(\frac{s^2_A}{n_A}\right)^2 + \frac{1}{n_B - 1}\left(\frac{s^2_B}{n_B}\right)^2}\]

Key Properties of the Approximation

  1. Non-integer values: Unlike exact degrees of freedom, \(\nu\) is typically not an integer

  2. Data-dependent: The degrees of freedom depend on the observed sample variances

  3. Approximation quality: The approximation is generally very good for practical purposes

  4. Conservative bias: Tends to produce slightly conservative inference when sample sizes are small

Computational Considerations

To avoid errors in complex calculations, it is recommended to:

  1. Calculate the numerator separately: \(\left(\frac{s^2_A}{n_A} + \frac{s^2_B}{n_B}\right)^2\)

  2. Calculate the denominator separately: \(\frac{1}{n_A - 1}\left(\frac{s^2_A}{n_A}\right)^2 + \frac{1}{n_B - 1}\left(\frac{s^2_B}{n_B}\right)^2\)

  3. Divide to obtain the final result: \(\nu = \frac{\text{numerator}}{\text{denominator}}\)

11.4.4. The Pivotal Quantity and Confidence Intervals

Approximate Pivotal Quantity

For confidence interval construction, we use the approximate pivotal quantity:

This quantity follows an approximate t-distribution with \(\nu\) degrees of freedom calculated via the Welch-Satterthwaite method.

Confidence Interval Formula

A \(100(1-\alpha)\%\) confidence interval for \(\mu_A - \mu_B\) is:

\[(\bar{x}_A - \bar{x}_B) \pm t_{\alpha/2,\nu} \sqrt{\frac{s^2_A}{n_A} + \frac{s^2_B}{n_B}}\]

where \(t_{\alpha/2,\nu}\) is the critical value from a t-distribution with \(\nu\) degrees of freedom.

Confidence Bounds for One-Sided Alternatives

For one-sided alternatives, we construct confidence bounds:

Upper bound (for :math:`H_a: mu_A - mu_B < Delta_0`*)*:

Lower bound (for :math:`H_a: mu_A - mu_B > Delta_0`*)*:

11.4.5. Comparing Pooled and Unpooled Approaches: A Critical Analysis

When Equal Variance Assumption Holds (:math:`sigma_A = sigma_B`)

If the population variances are actually equal, pooled procedures offer several advantages:

Advantages of pooled approach:

  • Better variance estimation: Uses data from both samples (\(n_A + n_B - 2\) degrees of freedom)

  • Exact t-distribution: Test statistics follow exact t-distributions

  • Simple degrees of freedom: \(df = n_A + n_B - 2\) (always an integer)

  • Higher power: More efficient use of data leads to better ability to detect true differences

Disadvantages of unpooled approach when variances are equal:

  • Less efficient estimation: Uses only individual sample information for variance estimation

  • Approximate distribution: Test statistics follow only approximate t-distributions

  • Complex degrees of freedom: Requires Welch-Satterthwaite calculation

  • Slightly lower power: Less efficient use of available information

When Equal Variance Assumption Fails (:math:`sigma_A neq sigma_B`)

When population variances differ, the consequences of using pooled procedures can be severe, particularly when sample sizes are unequal.

The Problem with Unequal Sample Sizes:

Consider a scenario where \(n_A = 1500\), \(n_B = 200\), and \(\sigma_A > \sigma_B\). The pooled variance estimator becomes:

Since \(n_A >> n_B\), the pooled estimator is heavily weighted toward \(S^2_A\). If \(\sigma_A > \sigma_B\), this leads to:

  1. Overestimation of overall variability: The pooled estimate reflects the larger variance from population A

  2. Overly wide confidence intervals: Margins of error become unnecessarily large

  3. Reduced power: True differences become harder to detect

  4. Poor coverage properties: Confidence intervals may have coverage probabilities far from nominal levels

Coverage Simulation Results:

The slides show simulation results demonstrating how pooled procedures fail when the equal variance assumption is violated:

  • Nominal coverage: 95% confidence intervals should capture the true difference 95% of the time

  • Observed coverage with unequal variances: Can drop to 50% or rise to nearly 100%, depending on sample size configuration

  • Systematic bias: The direction and magnitude of coverage errors depend on which population has larger variance and larger sample size

When to Use Each Approach

Use pooled procedures when:

  • Strong theoretical or empirical evidence supports equal variances

  • Experimental control ensures similar variability (e.g., designed experiments)

  • Sample sizes are small and efficiency gains are crucial

  • Domain expertise strongly suggests homogeneous variance structures

Use unpooled procedures when:

  • No clear evidence supports equal variances

  • Populations may have systematically different variability

  • Sample sizes are unequal

  • A conservative, robust approach is preferred

  • Default analysis without strong assumptions

11.4.6. The Four-Step Process for Unpooled Procedures

Step 1: Identify and Describe Parameters

Clearly define both population means using contextual labels and appropriate units. Specify that standard deviations are unknown and will be estimated separately.

Example: \(\mu_{new}\) = true mean DRP score for students receiving the new teaching method; \(\mu_{traditional}\) = true mean DRP score for students receiving traditional instruction.

Step 2: State Hypotheses

Formulate hypotheses about the difference in means, choosing the appropriate alternative based on the research question:

  • \(H_0: \mu_A - \mu_B = \Delta_0\) (typically \(\Delta_0 = 0\))

  • \(H_a: \mu_A - \mu_B \neq \Delta_0\) (or \(>\) or \(<\))

Step 3: Calculate Test Statistic, Degrees of Freedom, and P-Value

Test statistic:

Degrees of freedom (Welch-Satterthwaite):

P-value calculation (using approximate t-distribution):

  • Two-sided: 2 * pt(abs(t_ts), df = nu, lower.tail = FALSE)

  • Right-tailed: pt(t_ts, df = nu, lower.tail = FALSE)

  • Left-tailed: pt(t_ts, df = nu, lower.tail = TRUE)

Step 4: Decision and Conclusion

Compare p-value to \(\alpha\) and state conclusions in context, emphasizing the difference between population means.

11.4.7. Implementation in R

Using t.test() Function

For data analysis, R’s t.test() function automatically performs Welch procedures:

t.test(quantitativeVariable ~ categoricalVariable,
       mu = Delta0,
       conf.level = C,
       paired = FALSE,
       alternative = "alternative_hypothesis",
       var.equal = FALSE)

Key Arguments:

  • Formula syntax: quantitativeVariable ~ categoricalVariable separates groups

  • mu: Null hypothesis value \(\Delta_0\) (default = 0)

  • conf.level: Confidence level \(C = 1 - \alpha\)

  • paired: Set to FALSE for independent samples

  • alternative: “two.sided”, “greater”, or “less”

  • var.equal: Set to FALSE to use Welch procedures (our course default)

Manual Calculation Example

When summary statistics are provided instead of raw data:

# Given summary statistics
n_A <- 21; xbar_A <- 51.48; s_A <- 11.01
n_B <- 23; xbar_B <- 41.52; s_B <- 17.15

# Calculate test statistic
point_est <- xbar_A - xbar_B
se <- sqrt(s_A^2/n_A + s_B^2/n_B)
t_ts <- point_est / se

# Calculate Welch-Satterthwaite degrees of freedom
numerator <- (s_A^2/n_A + s_B^2/n_B)^2
denominator <- (s_A^2/n_A)^2/(n_A-1) + (s_B^2/n_B)^2/(n_B-1)
nu <- numerator / denominator

# Calculate p-value (for right-tailed test)
p_value <- pt(t_ts, df = nu, lower.tail = FALSE)

11.4.8. Robustness and Assumption Checking

Robustness to Normality Violations

The unpooled t-procedure is robust to moderate departures from normality, with robustness increasing with sample size. However, data visualization remains essential regardless of sample size guidelines.

Sample Size Guidelines

While the following guidelines provide general direction, they cannot substitute for careful data examination:

Total Sample Size

Normality Requirements

\(n_A + n_B < 15\)

Data must be very close to normal. Requires careful graphical assessment.

\(15 \leq n_A + n_B < 40\)

Can tolerate mild skewness. Strong skewness still problematic.

\(n_A + n_B \geq 40\)

Usually acceptable even with moderate skewness.

Critical Emphasis on Data Visualization

These sample size rules are guidelines only. The robustness of t-procedures depends on the specific nature of departures from normality:

  • Outliers: Can invalidate procedures regardless of sample size

  • Extreme skewness: May require much larger samples than guidelines suggest

  • Heavy tails: Can affect both Type I error rates and power

  • Multiple modes: May indicate population heterogeneity

Essential Diagnostic Tools:

  • Histograms: Assess shape, skewness, and outliers for each group

  • Box plots: Compare distributions and identify outliers

  • Q-Q plots: Evaluate normality assumption more precisely

  • Side-by-side comparisons: Examine both groups simultaneously

Equal Sample Sizes Improve Robustness

When \(n_A \approx n_B\), t-procedures are more robust to:

  • Moderate normality violations

  • Unequal variances

  • Outliers in one group

This provides another advantage of balanced designs beyond their efficiency benefits.

When Assumptions Are Seriously Violated

If normality assumptions are severely violated:

  1. Data transformation: Consider log, square root, or other transformations

  2. Non-parametric methods: Wilcoxon rank-sum test or Mann-Whitney U test

  3. Bootstrap methods: Computer-intensive resampling approaches

  4. Larger sample sizes: May overcome moderate violations through CLT

11.4.9. A Complete Example: Teaching Methods Comparison

To illustrate the complete unpooled procedure, we present a comprehensive analysis of teaching method effectiveness.

Study Design

Researchers investigated whether directed reading activities improve elementary students’ reading ability, measured by Degree of Reading Power (DRP) scores. Students were randomly assigned to either:

  • New method: Directed reading activities (\(n = 21\))

  • Traditional method: Standard instruction (\(n = 23\))

Higher DRP scores indicate better reading ability.

Step 1: Parameter Identification

  • \(\mu_{new}\) = true mean DRP score for students receiving directed reading instruction

  • \(\mu_{traditional}\) = true mean DRP score for students receiving traditional instruction

Both population standard deviations are unknown and will be estimated separately.

Step 2: Hypothesis Formulation

The research question asks whether directed reading activities improve performance:

  • \(H_0: \mu_{new} - \mu_{traditional} \leq 0\) (no improvement)

  • \(H_a: \mu_{new} - \mu_{traditional} > 0\) (improvement)

This is a right-tailed test with \(\Delta_0 = 0\).

Step 3: Assumption Checking and Statistical Analysis

Data exploration reveals:

  • Both distributions approximately normal with mild skewness in traditional group

  • No serious outliers identified in modified box plots

  • Sample sizes (\(n_{new} = 21\), \(n_{traditional} = 23\)) adequate for mild departures from normality

  • Some evidence that variances may differ between groups

Summary statistics (from the example):

  • New method: \(\bar{x}_{new} = 51.48\), \(s_{new} = 11.01\)

  • Traditional method: \(\bar{x}_{traditional} = 41.52\), \(s_{traditional} = 17.15\)

Test statistic calculation:

Welch-Satterthwaite degrees of freedom:

P-value: For a right-tailed test, p-value = pt(2.31, df = 37.8, lower.tail = FALSE) = 0.013

Step 4: Decision and Conclusion

Since p-value = 0.013 < \(\alpha = 0.05\), we reject the null hypothesis.

Conclusion: The data give some support (p-value = 0.013) to the claim that directed reading activities improve elementary students’ reading ability as measured by DRP scores.

95% Lower Confidence Bound

For the one-sided alternative, we construct a lower confidence bound:

We are 95% confident that the new teaching method improves DRP scores by more than 2.69 points on average.

11.4.10. Modern Statistical Practice: Why Unpooled is Preferred

The Pragmatic Argument

Contemporary statistical practice increasingly favors unpooled procedures as the default approach for several compelling reasons:

  1. Assumption-free robustness: Works whether variances are equal or unequal

  2. Conservative inference: Provides valid results even when assumptions are violated

  3. Computational accessibility: Modern software makes complex calculations routine

  4. Reduced assumption checking: Eliminates need for formal tests of variance equality

The Risk-Benefit Analysis

The slight efficiency loss when variances are actually equal is far outweighed by the protection against serious errors when they are not:

  • Cost of false assumption: Severe (incorrect inference, poor coverage)

  • Cost of avoiding assumption: Minimal (slightly wider intervals when assumption holds)

  • Uncertainty about assumption: High (difficult to verify in practice)

Software Implementation

Most statistical software packages now use Welch procedures as the default:

  • R’s t.test() uses var.equal = FALSE by default in recent versions

  • This reflects the statistical community’s consensus on best practices

  • Pooled procedures remain available but require explicit specification

Pedagogical Considerations

Understanding both approaches provides important insights:

  • Pooled procedures: Illustrate the impact of assumptions on statistical methods

  • Unpooled procedures: Demonstrate robust, widely applicable techniques

  • Comparison: Highlights the trade-offs inherent in statistical methodology

11.4.11. Future Connections: Extension to Multiple Groups

The principles developed in two-sample unpooled procedures extend to more complex scenarios:

Analysis of Variance (ANOVA)

  • Traditional ANOVA assumes equal variances across all groups

  • Welch’s ANOVA provides a robust alternative for unequal variances

  • Same principle: trade some efficiency for assumption-free robustness

Regression Analysis

  • Heteroscedasticity (unequal error variances) violates standard regression assumptions

  • Robust standard errors and weighted least squares provide analogous solutions

  • Pattern: adapt methods to relax restrictive assumptions

Key Takeaways 📝

  1. Unpooled procedures avoid the restrictive equal variance assumption by estimating population variances separately using \(SE = \sqrt{\frac{s^2_A}{n_A} + \frac{s^2_B}{n_B}}\).

  2. The Welch-Satterthwaite approximation provides approximate degrees of freedom: \(\nu = \frac{\left(\frac{s^2_A}{n_A} + \frac{s^2_B}{n_B}\right)^2}{\frac{1}{n_A - 1}\left(\frac{s^2_A}{n_A}\right)^2 + \frac{1}{n_B - 1}\left(\frac{s^2_B}{n_B}\right)^2}\), typically yielding non-integer values.

  3. Pooled procedures can fail dramatically when equal variance assumptions are violated, especially with unequal sample sizes, leading to poor coverage and biased inference.

  4. Unpooled procedures provide robust inference that works whether variances are equal or unequal, with only minor efficiency loss when variances are actually equal.

  5. Modern statistical practice favors unpooled procedures as the default approach due to their assumption-free robustness and computational accessibility.

  6. Data visualization remains essential regardless of sample size guidelines, as robustness depends on the specific nature of departures from normality.

  7. The four-step hypothesis testing framework applies directly with test statistics approximately following t-distributions with Welch-Satterthwaite degrees of freedom.

  8. R implementation uses var.equal = FALSE in t.test() to automatically perform Welch procedures with approximate degrees of freedom calculations.

Exercises

  1. Degrees of Freedom Calculation: Two independent samples have the following characteristics:

    • Sample A: \(n_A = 15\), \(s^2_A = 28.4\)

    • Sample B: \(n_B = 20\), \(s^2_B = 45.7\)

    1. Calculate the Welch-Satterthwaite degrees of freedom

    2. Compare this to the pooled degrees of freedom \(n_A + n_B - 2\)

    3. Explain why the Welch degrees of freedom is typically smaller

  2. Assumption Violation Analysis: Consider the simulation results shown in the slides where pooled procedures fail when \(\sigma_A \neq \sigma_B\).

    1. Explain why unequal sample sizes exacerbate the problem

    2. Describe what happens to confidence interval coverage when the larger sample comes from the population with larger variance

    3. Explain why equal sample sizes improve robustness

  3. Method Comparison: A researcher has samples with \(n_A = n_B = 25\) and approximately equal sample variances.

    1. What are the advantages of using pooled procedures in this scenario?

    2. What are the advantages of using unpooled procedures?

    3. Which approach would you recommend and why?

  4. Robustness Assessment: You have two samples with total size \(n_A + n_B = 30\). One sample shows moderate right skewness and the other has one potential outlier.

    1. What additional information would you need to assess the appropriateness of t-procedures?

    2. What graphical tools would you use to evaluate the assumptions?

    3. Under what conditions might you proceed with t-procedures despite these concerns?

  5. Practical Implementation: Using the teaching methods example, verify the calculations by:

    1. Computing the test statistic manually

    2. Calculating the Welch-Satterthwaite degrees of freedom step-by-step

    3. Finding the p-value using R’s pt() function

    4. Constructing a 95% lower confidence bound

  6. Real-World Application: Design a study comparing two populations where:

    1. Equal variance assumption would be reasonable

    2. Equal variance assumption would be questionable

    For each scenario, justify your assessment and explain which procedure you would use.