Slides 📊

11.5. Paired Two-Sample Analysis

Researchers often encounter situations where observations are naturally paired in practice. Such dependence requires a revised statistical framework to ensure valid inference.

Road Map 🧭

Formulate a paired two-sample problem as a single-sample problem of differences, and recognize the structural similarities of the analaysis with the one-sample methods.
Be aware of the consequences of incorrectly using paired versus independent two-sample methods.

11.5.1. Characteristics of Paired Observations

Paired-sample procedures are appropriate when observations are linked by underlying characteristics that create dependence between measurements. These links can arise through several mechanisms:

Before-and-after studies where each subject is measured twice, before and after a treatment is applied.
Pairing subjects based on shared characteristics such as age, gender, and diseases severity.
Applying different treatments to two pieces from the same source (also called the split-plot design). Examples include sub-plots of crops, pieces cut from the same fabric, and metal from the same alloy.

Pairing is particularly valuable when the extraneous characteristics that link the observations create large variability that might otherwise obscure the treatment effect. By focusing only on the within-pair differences, we eliminate the influence of individual variation from the analysis result.

11.5.2. Notation and Assumptions

We begin with a review of the notation for paired two-sample analyses, first introduced in Chapter 11.1.4.

Observations from Population A and B are denoted \(X_{A1}, X_{A2}, \ldots, X_{An}\) and \(X_{B1}, X_{B2}, \ldots, X_{Bn}\), respectively. The observations are ordered so that \(X_{Ai}\) is paired with \(X_{Bi}\) for each \(i\).

Define the pair-wise differences as:

\[D_i = X_{Ai} - X_{Bi} \text{ for } i = 1, 2, \ldots, n.\]

The validity of paired two-sample procedures rests on the following three fundamental assumptions. They must be carefully verified before applying the inference methods introduced in this lessson.

Assumption 1. Independence between pairs

Each pair \((X_{Ai}, X_{Bi})\) must be independent of all other pairs \((X_{Aj}, X_{Bj})\) for \(i \neq j\).

Distinguish within-pair and between-pair independence

This assumption pertains solely to the independence of pairs across the sample; it does not make any claim about the form or strength of association that may exist within individual pairs.

Assumption 2. Simple Random Sample of Pairs

The pairs must constitute a simple random sample from the population of potential pairs. Combined with Assumption 1, this ensures that the bivariate random variables \((X_{A1}, X_{B1}), (X_{A2}, X_{B2}), \cdots, (X_{An}, X_{Bn})\) form an iid sample of the population of all possible pairs.

This also implies that the pair-wsie differences \(D_1, D_2, \cdots, D_n\) are iid.

Assumption 3. Normality of Differences

The differences \(D_i\) must be normally distributed, or the sample size must be large enough for the Central Limit Theorem to ensure that \(\bar{D}\) is approximately normally distributed.

Note that this assumption concerns the distribution of differences, not the original observations.

11.5.3. The Population Parameter and Its Point Estimator

The Population Mean of Differences

The novelty of the paired two-sample procedure is that it analyzes the distribution of individual differences. Therefore, the central population parameter is \(\mu_D\), the true mean of all possible differences.

In fact, however, the point value of \(\mu_D\) is equal to \(\mu_A - \mu_B\), the central parameter of the independent two-sample procedure.

\[\mu_D = E[D_i] = E[X_{Ai} - X_{Bi}] = E[X_{Ai}] - E[X_{Bi}] = \mu_A - \mu_B\]

This may at first obscure the difference between the two methods, but the key distinction arises in how the population variance of the differences is defined.

The Population Variacnce of Differences

Using the properties of variance for correlated random variables:

\[\begin{split}\sigma^2_D &= \text{Var}[D_i]\\ &= \text{Var}[X_{Ai}] + \text{Var}[X_{Bi}] - 2\text{Cov}[X_{Ai}, X_{Bi}]\\ &= \sigma^2_A + \sigma^2_B - 2\sigma_A\sigma_B\rho_{AB}\end{split}\]

The true variance of the differences now contains an additional term \(- 2\sigma_A\sigma_B\rho_{AB}\).

While we could theoretically estimate the components \(\sigma^2_A, \sigma^2_B,\) and \(\rho_{AB},\) individually, this approach would be unnecessarily complex. Instead, the transformation of data from two samples to a single list of differences allows us to estimate \(\sigma^2_D\) directly using \(S_D^2\), then apply the familiar one-sample \(t\)-procedures.

11.5.4. The Point Estimator and Its Distribution

Sample Mean of Differences

The sample mean of differences provides an ubiased point estimator for \(\mu_D\):

\[\bar{D} = \frac{1}{n}\sum_{i=1}^n D_i = \frac{1}{n}\sum_{i=1}^n (X_{Ai} - X_{Bi}) = \bar{X}_A - \bar{X}_B\]

Note that the point value of an observed mean of differences \(\bar{d}\) is also identical to the observed difference of the two means \(\bar{x}_A - \bar{x}_B\).

Sampling Distribution of \(\bar{D}\)

Under the assumptions listed in Section 11.5.2 above,

\[\bar{D} \sim N\left(\mu_D, \frac{\sigma_D}{\sqrt{n}}\right).\]

Sample Variance of Differences

The sample variance of differences \(S^2_D\) is calculated using the standard formula applied to the differences:

\[S^2_D = \frac{1}{n-1}\sum_{i=1}^n (D_i - \bar{D})^2.\]

This provides an unbiased estimator for \(\sigma^2_D\).

11.5.5. Hypothesis Testing for Paired Samples

Paired sample hypothesis testing follows the same four-step framework as one-sample procedures, but with careful attention to defining the differences appropriately.

Step 1: Parameter Identification and Difference Definition

The parameter of interest is \(\mu_D\), the mean difference between paired observations. Critically, we must explicitly define how the difference is calculated, as this determines the direction of our alternative hypothesis.

For example, if studying a training program’s effectiveness by comparing pre-test and post-test scores, we can define the difference \(D\) as either:

\(D = X_{\text{pre}} - X_{\text{post}}\), or
\(D = X_{\text{post}} - X_{\text{pre}}\).

The choice should align with the research question and the anticipated direction of the effect.

Step 2: Hypothesis Formulation

The three types of possible hypothesis pairs are:

Upper-Tailed Hypothesis Test

\[\begin{split}&H_0: \mu_D \leq \Delta_0\\ &H_a: \mu_D > \Delta_0\end{split}\]

Lower-Tailed Hypothesis Test

\[\begin{split}&H_0: \mu_D \geq \Delta_0\\ &H_a: \mu_D < \Delta_0\end{split}\]

Two-Tailed Hypothesis Test

\[\begin{split}&H_0: \mu_D = \Delta_0\\ &H_a: \mu_D \neq \Delta_0\end{split}\]

The choice should reflect how we defined the differences in Step 1.

Step 3: Test Statistic and P-Value Calculation

The test statistic follows the familiar one-sample \(t\)-test format:

\[T_{TS} = \frac{\bar{D} - \Delta_0}{S_d/\sqrt{n}}\]

Under the null hypothesis and the stated assumptions, the test statistic follows a \(t\)-distribution with \(df = n - 1\). \(p\)-value calculation follows the same patterns as one-sample procedures:

\(p\)-Values for paired two-sample \(t\)-Test
Upper-tailed	\[P(T_{n-1} \geq t_{TS})\] tts <- (dbar-delta0)/(s_d/sqrt(n)) pt(tts, df=n-1, lower.tail=FALSE)
Lower-tailed	\[P(T_{n-1} \leq t_{TS})\] pt(tts, df=n-1)
Two-tailed	\[2P(T_{n-1} \leq -\|t_{TS}\|) \quad \text{ or } \quad 2P(T_{n-1} \geq \|t_{TS}\|)\] 2 * pt(-abs(tts), df=n-1) 2 * pt(abs(tts), df=n-1, lower.tail=FALSE)

Step 4: Decision and Conclusion

Compare the \(p\)-value to the predetermined significance level \(\alpha\) and draw conclusions in the context of the original research question, being careful to interpret results in terms of the mean difference as defined in Step 1.

Example 💡: Nursing Sensitivity Training Study

A regional hospital conducted a study to determine whether sensitivity training would improve the quality of care provided by their nursing staff. Eight nurses were selected, and their nursing skills were evaluated on a scale from 1 to 10, where higher scores indicate greater sensitivity to patients. After this initial assessment, the nurses participated in a training program, and their skills were evaluated again using the same scale.

Perform a hypothesis test at \(\alpha=0.01\) to determine if the training improves the quality of care provided by the nursing staff.

Step 0: Analysis of Data & Background Information

Since each nurse serves as their own control (measured before and after training), this is clearly a paired design. The data shows pre-training scores, post-training scores, and the calculated differences for each nurse:

ID	Pre-Training	Post-Training	Difference (Pre - Post)
1	2.56	4.54	1.98
2	3.22	5.33	-2.11
3	3.45	4.32	0.87
4	5.55	7.45	1.90
5	5.63	7.00	1.37
6	7.89	9.80	1.91
7	7.66	7.33	0.33
8	6.20	6.80	0.60
Sample mean	\(\bar{x}_{pre} = 5.27\)	\(\bar{x}_{post} = 6.57\)	\(\bar{d} = -1.30\)
Sample sd	\(s_{pre} = 2.018\)	\(s_{post} = 1.803\)	\(s_d = 0.8608\)

Step 1: Parameter Identification

The parameter of interest is \(\mu_D\), the true mean difference between pre-training and post-training nursing sensitivity scores. Using the same order as the data table, define the difference as \(D = X_{\text{pre}} - X_{\text{post}}\).

Step 2: Hypothesis Formulation

Increase in the score corresponds to a negative value for \(\mu_D\) as defined in Step 1. Therefore, we perform a lower-tailed hypothesis test:

\[\begin{split}&H_0: \mu_D \geq 0\\ &H_a: \mu_D < 0\end{split}\]

Step 3: Test Statistic, degrees of freedom, and p-Value

The observed test statistic is:

\[t_{TS} = \frac{\bar{d} - \Delta_0}{s_d/\sqrt{n}} = \frac{-1.30 - 0}{0.8608/\sqrt{8}} = \frac{-1.30}{0.3044} = -4.2755\]

Under the assumptions and the null hypothesis, this value would have been generated from a \(t\)-distributed random variable with \(df = n - 1 = 7\).

For a lower-tailed test, the \(p\)-value is \(p = P(T_7 < -4.2755) = 0.001838\).

Step 4: Decision and Conclusion

Since p-value \(= 0.001838 < \alpha = 0.01\), we reject the null hypothesis. The data gives strong support (p-value \(= 0.001838\)) to the claim that the population average nursing sensitivity score improves after the training.

11.5.6. Confidence Regions for Paired Differences

The \((1-\alpha)100 \%\) confidence regions follow the same format as the standard one-sample case:

Confidence regions for the mean of differences (paired two-sample procedure)
Confidence Interval	\[\bar{d} \pm t_{\alpha/2, n-1} \cdot \frac{s_d}{\sqrt{n}}\]
Upper Confidence Bound	\[\bar{d} + t_{\alpha, n-1} \cdot \frac{s_d}{\sqrt{n}}\]
Lower Confidence Bound	\[\bar{d} - t_{\alpha, n-1} \cdot \frac{s_d}{\sqrt{n}}\]

Example 💡: Nursing Sensitivity Training Study

For the study on the sensitivity score of nursing staff before and after training, compute the \(99\%\) confidence region which provides consistent results with the previous hypothesis test. Explain how the two inferences agree with each other.

The Confidence Region

Since we conducted a left-tailed test, the corresponding confidence bound is an upper bound. The critical value is \(t_{0.01,7} = 2.998\). Putting the components together,

\[UCB = -1.30 + 2.998 \times \frac{0.8608}{\sqrt{8}} = -1.30 + 0.913 = -0.387.\]

We are 99% confident that the true mean difference between pre- and post-training scores is less than -0.387.

Does It Align with the Hypothesis Test?

Since this upper bound is negative (the region does not include the null value \(\Delta_0= 0\)), it confirms that the training program produces improvement.

Simultaneous R Implementation of Confidence Region and Hypothesis Testing

🛑 This approach works only if you have access to the raw data. If the summary statistics are provided instead, solve the problem by substituting appropriate values to an appropriate formula as shown above and in the previous example.

# Define the data
pre_training <- c(2.56, 3.22, 3.45, 5.55, 5.63, 7.89, 7.66, 6.20)
post_training <- c(4.54, 5.33, 4.32, 7.45, 7.00, 9.80, 7.33, 6.80)

# Perform paired t-test
t.test(pre_training, post_training,
      mu = 0,
      conf.level = 0.99,
      paired = TRUE,
      alternative = "less")

11.5.7. Importance of Using the Correct Analysis Method

The choice between paired and independent two-sample procedures depends on the study design and data structure. Making the wrong choice can lead to invalid inference or substantial loss of statistical power.

Using Paired Design When Independence Holds

If we use a paired design on populations which are in fact independent, we face two kinds of losses.

Experimental resources: We may waste time or expenses trying to find a good pair for each participant and taking measures against subject dropouts or carryover effects, when these steps are not necessary.
Reduced degrees of freedom: When \(n=n_A = n_B\), the paired method gives \(df=n-1\), while the pooled independent two-sample method gives \(df=n_A + n_B - 2\). By incorrectly using the paired method, we lose half of our “free” datapoints. The Welch-Satterthwaite degrees of freedom for unpooled cases are also typically larger than \(n-1\).

The reduced effective sample size leads to smaller power. The method does not lose validity, however, because the paired design covers a general case with no assumptions imposed on the correlation between the two populations.

Using an Independent Design When Pairing is appropriate

Incorrectly assuming independence when the data should be paired often results in more serious consequences.

When the true correlation \(\rho_{AB}\) is positive,

\[\sigma^2_D = \sigma^2_A + \sigma^2_B - 2\sigma_A\sigma_B\rho_{AB} < \sigma^2_A + \sigma^2_B.\]

This means that the independent method overestimates the true variance, making the inferences less precise (less powerful).

When the true correlation \(\rho_{AB}\) is negative,

\[\sigma^2_D = \sigma^2_A + \sigma^2_B - 2\sigma_A\sigma_B\rho_{AB} > \sigma^2_A + \sigma^2_B.\]

In this case, the true variance is underestimated, causing the type I error probability to exceed \(\alpha\), and the confidence regions to be too narrow to uphold the coverage probabiltiy \(C\).

Independent or Paired?

The paired two-sample procedure should be chosen when:

a natural pairing system exists, or
high extraneous variability among subjects requires artificial pairing for sufficient power.

When these conditions do not hold, the populations and the sampling procedure should be carefully examined for any violation of independence. If the indpendence assumption is reasonable, proceed with an indpendent two-sample analysis.

11.5.8. Bringing It All Together

Key Takeaways 📝

Paired procedures apply when observations are linked through natural relationships. Paired procedures work with differences \(D_i = X_{Ai} - X_{Bi}\), which reduces the two-sample problem to a familiar one-sample analysis.
Three key assumptions must be satisfied for validity of a paired two-sample inference: independence among pairs, simple random sampling from the population of pairs, and normality of differences.
Pairing controls for individual variability that might otherwise obscure treatment effects, often leading to more powerful statistical tests.
Proper difference definition is crucial—the direction of subtraction must align with the research question and the alternative hypothesis.
The choice between paired and independent procedures depends on study design, the presence of natural pairing relationships, and the magnitude of individual variability relative to treatment effects.

Exercises

Difference Definition Impact: A researcher studies whether a new exercise program improves cardiovascular fitness, measured by time to complete a standard fitness test (in minutes, where lower times indicate better fitness).
1. If the difference is defined as \(D = \text{Time}_{\text{before}} - \text{Time}_{\text{after}}\), write appropriate hypotheses to test whether the program improves fitness.
2. If the difference is defined as \(D = \text{Time}_{\text{after}} - \text{Time}_{\text{before}}\), write appropriate hypotheses for the same research question.
3. Explain how the choice of difference definition affects the interpretation of results.

Complete Analysis: Eight patients with chronic pain participated in a study of acupuncture therapy. Their pain levels were measured on a 10-point scale before and after a series of acupuncture treatments:

Patient

Before

After

Difference

1

8.2

6.1

2

7.5

5.8

3

9.1

7.2

4

6.8

6.0

5

8.9

5.5

6

7.2

6.8

7

8.5

6.9

8

7.8

6.2
1. Calculate the differences and summary statistics.
2. Test whether acupuncture reduces pain levels on average (use \(\alpha = 0.05\)).
3. Construct a 95% confidence interval for the mean reduction in pain.
4. Interpret your results in the context of the study.
Power Comparison: Explain why paired designs often have higher statistical power than independent sample designs for detecting treatment effects. Under what circumstances might an independent design be preferred despite this power advantage?

Patient	Before	After
1	8.2	6.1
2	7.5	5.8
3	9.1	7.2
4	6.8	6.0
5	8.9	5.5
6	7.2	6.8
7	8.5	6.9
8	7.8	6.2