10.3. Connecting CI and HT; t-Test for μ When σ Is Unknown

Hypothesis testing and confidence regions are complementary tools that address essentially the same question from two different perspectives. When certain conditions are carefully matched, a confidence region and a hypothesis test yield outcomes that carry direct implications for one another.

We also dicsuss how to extend our understanding of hypothesis testing to cases where the population standard deviation is unknown. As with confidence regions, we employ the \(t\)-distribution to construct a testing procedure that accounts for the added uncertainty.

Road Map 🧭

  • Understand the underlying connection between confidence regions and hypothesis testing. For a given confidene region, identify its complementary hypothesis testing scenario, and vice versa.

  • Use the \(t\)-distribution to construct hypothesis tests when the population standard deviation \(\sigma\) is unknown.

10.3.1. The Duality of Hypothesis Tests and Confidence Regions

Two Perspectives on the Same Question

We begin our discussion by comparing the key components of hypothesis testing and confidence regions:

Inference Component

Confidence region for \(\mu\)

Hypothesis testing on \(\mu\)

Parameter of interest

They both aim to understand a population mean, \(\mu\)

Question

“What parameter values are consistent with the sample?”

“Is this specific parameter value consistent with the sample?”

How inference strength is conveyed

A large \(C\)

A small \(\alpha\)

Sample information used

Both use \(\bar{X}\) and its approximate normality due to the CLT

Outcome

A range of plausible values

Answer to whether a candidate value (\(\mu_0\)) is plausible

It is evident from the summary above that confidence intervals and hypothesis tests address the same fundamental question from different angles.

The mathematical connection becomes even deeper when the two inference methods are matched by their inferential strengths and sidedness. Under this pairing, the two methods are in fact equivalent: the result of one has direct implications for the result of the other.

Confidence Intervals and Two-sided Hypothesis Tests

Consider the \(C \cdot 100 \%\) confidence interval for a population mean. With \(\alpha=1-C\), we use the formula:

\[\left(\bar{x} - z_{\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{x} + z_{\alpha/2} \frac{\sigma}{\sqrt{n}}\right).\]

Let us now perform a two-sided hypothesis test on a candidate value \(\mu_0\) using the significance level \(\alpha = 1-C\). The hypotheses are:

\[\begin{split}&H_0: \mu = \mu_0\\ &H_a: \mu \neq \mu_0\end{split}\]

Based on the cutoff method, we reject the null hypothesis if:

\[\bar{x} \geq z_{\alpha/2}\frac{\sigma}{\sqrt{n}} + \mu_0 \quad \text{ or } \quad \bar{x} \leq -z_{\alpha/2}\frac{\sigma}{\sqrt{n}} + \mu_0.\]

Isolate \(\mu_0\) in both inequalities. Then the null hypothesis is rejected when:

\[\bar{x} - z_{\alpha/2}\frac{\sigma}{\sqrt{n}} \geq \mu_0 \quad \text{ or } \quad \bar{x} + z_{\alpha/2}\frac{\sigma}{\sqrt{n}} \leq \mu_0.\]

Note that the left-hand side of the inequalities are exactly the two ends of the confidence interval.

The Connection

The null hypothesis for a two-tailed test is rejected excatly when the null value \(\mu_0\) is outside a matching confidence interval. Conversely, if the chosen \(\mu_0\) is inside the confidence interval, we would not reject the null hypothesis for that test.

Lower Confidence Bounds and Upper-Tailed Tests

The \(C\cdot 100 \%\) lower confidence bound for the population mean \(\mu\) is:

\[\left(\bar{x} - z_{\alpha} \frac{\sigma}{\sqrt{n}}, \quad \infty\right).\]

When performing an upper-tailed test:

\[\begin{split}&H_0: \mu \leq \mu_0\\ &H_a: \mu > \mu_0,\end{split}\]

we reject the null hypothesis if \(\bar{x} \geq z_\alpha\frac{\sigma}{\sqrt{n}} + \mu_0\). By isolating \(\mu_0\), the inequality becomes:

\[\bar{x} - z_\alpha\frac{\sigma}{\sqrt{n}} \geq \mu_0.\]

The Connection

Given \(\alpha = 1-C\), the null hypothesis of an upper-tailed test is rejected if and only if the null value \(\mu_0\) fails to be inside the confidence region defined by the lower bound.

Upper Confidence Bounds and Lower-Tailed Tests

The \(C\cdot 100 \%\) upper confidence bound for the population mean \(\mu\) is:

\[\left(-\infty, \quad \bar{x} + z_{\alpha} \frac{\sigma}{\sqrt{n}} \right).\]

For a lower-tailed hypothesis test with

\[\begin{split}&H_0: \mu \geq \mu_0\\ &H_a: \mu < \mu_0,\end{split}\]

The null hypothesis is rejected if \(\bar{x} \leq -z_\alpha\frac{\sigma}{\sqrt{n}} + \mu_0\). By isolating \(\mu_0\), the inequality becomes:

\[\bar{x} + z_\alpha\frac{\sigma}{\sqrt{n}} \leq \mu_0.\]

The Connection

As expected, the values of \(\mu_0\) that lead to rejection of the null hypothesis in a lower-tailed test coincide with those that fall outside the upper confidence bound of a matching confidence level, \(C= 1-\alpha\).

Summary

For the duality to work, we need three crucial conditions:

  1. The two inference methods are being applied to the same experimental result.

  2. \(C + \alpha = 1\).

  3. The methods are paired corrrectly based on sidedness (two-sided test and CI, upper-tailed test and LCB, lower-tailed test and UCB).

When these conditions hold,

  • \(\mu_0\) lies inside the confidence region \(\iff\) fail to reject \(H_0\)

  • \(\mu_0\) lies outside the confidence region \(\iff\) reject \(H_0\)

Example 💡: Quality Control for Cherry Tomatoes

Tom Green oversees quality control for a large produce company. The weights of cherry tomato packages are known to be normally distributed with \(\mu=227g\) (1/2 lbs) and \(\sigma=5g\). He obtains a simple random sample of four packages of cherry tomatoes and discovers that their average weight is 222g.

  1. Construct a 95% confidence interval for the mean weight.

  2. Tom would like to test whether the true mean weight of the packages is different from 227g with \(\alpha = 0.05\). Based on the result of #1, (do not compute the test statistic or the \(p\)-value), predict wether the null hypothesis will be rejected.

  3. Perform the hypothesis test and confirm your answer from part 2.

Q1: Construct the 95% Confidence Interval

Since \(\sigma\) is known and the data is normally distributed, we use the \(z\)-procedure. In general,

\[\left(\bar{x} - z_{\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{x} + z_{\alpha/2} \frac{\sigma}{\sqrt{n}}\right).\]

For 95% confidence, \(\alpha = 0.05\). The critical value \(z_{0.025}\) can be found using:

z_critical <- qnorm(0.025, lower.tail = FALSE)
z_critical
# [1] 1.959964

Calculate the interval:

\[\left(222 - (1.96)\frac{5}{\sqrt{4}}, 222 + (1.96) \frac{5}{\sqrt{4}}\right) = (217.1, 226.9)\]

We are 95% confident that the true mean weight of cherry tomato packages is captured between 217.1 and 226.9 grams.

Q2: Use Duality to Predict the Conlusion for the Hypothesis Test

We want to test:

\[\begin{split}&H_0: \mu = 227\\ &H_a: \mu \neq 227\end{split}\]

Since we have \(C + \alpha = 0.95 + 0.05 = 1\), and both the confidence region and the hypothesis test are two-sided, the duality relationship applies. The null value \(\mu_0 = 227\) lies outside the 95% confidence interval \((217.1, 226.9)\). Therefore, we would reject the null hypothesis if we performed the hypothesis test. \(\mu_0 = 227\) is NOT a plausaible value for the true mean weight according to the CI, so we should be able to draw the same conclusion from the dual hypothesis test.

Q3: Verify with Formal Hypothesis Test

Let’s confirm this conclusion by performing a z-test for the hypothesis pair.

\[z_{TS} = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}} = \frac{222 - 227}{5/\sqrt{4}} = \frac{-5}{2.5} = -2.0\]

The p-value is:

z_test_stat <- -2.0
p_value <- 2 * pnorm(abs(z_test_stat), lower.tail = FALSE)
p_value
# [1] 0.04550026

Since p-value = \(0.0455 < \alpha = 0.05\), we reject the null hypothesis. Both approaches give the same conclusion. This confirms the duality relationship.

10.3.2. \(t\)-Tests: When σ is Unknown

So far, we have been building our test procedures based on the convenient assumption that the population standard deviation is known. If we do not know the population mean \(\mu\), however, we almost certainly do not know \(\sigma\), either.

In such cases, we will take the natural step of replacing the unknown \(\sigma\) with the estimator, \(S\). The sample standard deviation \(S\) is itself a random variable that varies from sample to sample, and this extra variability must be accounted for.

The Assumptions

For the new test procedure, we use a slightly modified set of assumtions:

  1. \(X_1, X_2, \cdots, X_n\) form an iid sample from the population \(X\) with mean \(\mu\) and variance \(\sigma^2\).

  2. Either the population \(X\) is normally distributed, or the sample size \(n\) is sufficiently large for the CLT to hold.

  3. The population variance \(\sigma^2\) is unknown.

The only difference is that the population variance (and sd) is unknown.

The \(t\)-Test Statistic

Recall that when \(\sigma\) was known, the \(z\)-test statistic

\[Z_{TS} = \frac{\bar{X}-\mu_0}{\sigma/\sqrt{n}}\]

played a key role. We used it to

  • measure the standardized discrepancy of the data from the null assumption,

  • compare it with a \(z\)-critical value and draw a conclusion using the cutoff method, and

  • compute the tail probabiltiy of the observed \(z_{TS}\) and draw a conclusion in the \(p\)-value method.

We obtain a new test statistic by replacing the ununknown \(\sigma\) with the sample standard deviation, \(S:\)

\[T_{TS} = \frac{\bar{X} - \mu_0}{S/\sqrt{n}}\]

This new test stastistic is called the \(t\)-test statistic. When the null hypothesis holds, the \(t\)-test statistic has a \(t\)-distribution with the degrees of freedom \(\nu= n-1\).

The \(t\)-test statistic plays the same roles as the \(z\)-test statistic, but we must account for the change in distribution by referencing the appropriate t-distribution rather than standard normal when computing critical values and \(p\)-values.

Cutoff Method for \(t\)-Tests

Recall that the cutoff method rejects the null hypothesis if the observed test statistic falls in a region that is too unusual for the null hypothesis.

For an upper-tailed \(t\)-test, the null hypothesis would be rejected if the observed sample mean is much higher than the null value \(\mu_0\) and satisfies:

\[t_{TS} = \frac{\bar{x}-\mu_0}{s/\sqrt{n}} > t_{\alpha, n-1},\]

where \(t_{\alpha, n-1}\) is the appropriate \(t\)-critical value.

Likewise, the rejection rule for a lower-tailed test is:

\[t_{TS} = \frac{\bar{x}-\mu_0}{s/\sqrt{n}} < -t_{\alpha, n-1}.\]

Finally, for a two-tailed test,

\[|t_{TS}| = \left|\frac{\bar{x}-\mu_0}{s/\sqrt{n}}\right| > t_{\alpha/2, n-1}.\]

\(p\)-Values for \(t\)-Tests

\(p\)-Values for \(t\)-Tests

Upper-tailed p-value

\[P(T_{n-1} \geq t_{TS})\]
tts <- (xbar-mu0)/(s/sqrt(n))
pt(tts, df=n-1, lower.tail=FALSE)

Lower-tailed p-value

\[P(T_{n-1} \leq t_{TS})\]
pt(tts, df=n-1)

Two-tailed p-value

\[2P(T_{n-1} \leq -|t_{TS}|) \quad \text{ or } \quad 2P(T_{n-1} \geq |t_{TS}|)\]
2 * pt(-abs(tts), df=n-1)
2 * pt(abs(tts), df=n-1, lower.tail=FALSE)

The Rejection Rule Remains Unchanged

Once a \(p\)-value is computed, it is compared against a pre-specified significance level \(\alpha\). If the \(p\)-value is less than \(\alpha\), the null hypothesis is rejected.

Example 💡: Radon Detector Accuracy

University researchers want to find out whether their radon detectors are working correctly. Thy collected a random sample of 12 detectors and placed them in a chamber exposed to exactly 105 picocuries per liter of radon. If the detectors work properly, their measurements should be close to 105, on average.

The Measurements (in picocuries per liter)

91.9

97.8

111.4

122.3

105.4

95.0

103.8

99.6

119.3

104.8

101.7

96.6

In addition, suppose that the population distribution is known to be normal. Perform a hypothesis test with the significance level \(\alpha=0.1\).

Step 0: Which Procedure?

The experiment uses a random sample from a normally distributed population. Therefore, we are justified to use an inference method which assumes approximate normality of the sample mean. We use the \(t\)-test procedure since the population standard deviation is unknown.

Step 1: Define the Parameter

Let \(\mu\) denote the true mean of the measurements produced by the detectors in a chamber with exactly 105 picocuries per liter of radon.

Step 2: State the Hypotheses

\[\begin{split}&H_0: \mu = 105\\ &H_a: \mu \neq 105\end{split}\]

Step 3: Calculate the Observed Test Statistic and the p-value

Comonents:

  • \(n=12\)

  • \(\bar{x}=104.1333\)

  • \(s = 9.397421\)

  • \(df = n-1 = 11\)

The sample mean and the sample standard deviation can be computed from the data set.

\[t_{TS} = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} = \frac{104.1333 - 105}{9.397421/\sqrt{12}} = \frac{-0.8667}{2.7136} = -0.319\]

The \(p\)-value is computed using R:

p_value <- 2 * pt(abs(t_test_stat), df = 11, lower.tail = FALSE)
p_value
# [1] 0.755

Step 4: Make the Decision and Write the Conclusion

Since \(p\)-value \(= 0.755 > \alpha = 0.10\), we fail to reject the null hypothesis. With the significance level \(\alpha=0.1\), we do not have enough evidence to reject the null hypothesis that the true mean measurent is 105 picocuries per liter.

🤔 Why Such a Large P-Value?

Although the individual measurements in the data set seem quite inaccurate, we failed to reject the null hypothesis with a large \(p\)-value of \(0.755\). Several factors contribute:

  1. The sample mean (\(\bar{x} = 104.1\)) is very close to null value (\(\mu_0 = 105.0\)).

  2. The sample size of 12 is small and limits precision.

  3. The sample standard deviation (\(s=9.4\)) is relatively large.

  4. We are performing a two-sided test, which makes the rejection region farther toward the tails than a one-sided test.

This example illustrates why, in hypothesis testing, we say we “fail to reject” rather than “accept” the null hypothesis. The absence of evidence against the null does not necessarily constitute evidence in its favor.

When there is a large degree of uncertainty, hypothesis tests tend to grow more conservative, which means that it will require the evidence to be stronger for a rejection of the null.

Duality Revisited for \(t\)-Procedures

The duality relationship established for \(z\)-procedures carries over directly to \(t\)-procedures. A confidence region and a hypothesis test based on \(t\)-distributions are equivalent if:

  1. The two inference methods are being applied to the same experimental result.

  2. \(C + \alpha = 1\).

  3. The methods are paired corrrectly based on sidedness (two-sided test and CI, upper-tailed test and LCB, lower-tailed test and UCB).

Example 💡: Complementary CI for Radon Detector Accuracy

Compute the 90% confidence interval for the Radon Detector experiment and comment on its consistency with the hypothesis test.

A \(t\)-confidence interval is, in general,

\[\left(\bar{x} - t_{\alpha/2, n-1}\frac{s}{\sqrt{n}}, \quad \bar{x} + t_{\alpha/2, n-1}\frac{s}{\sqrt{n}}\right).\]

From the previous example, we have:

  • \(\bar{x} =104.1333\)

  • \(s = 9.397421\)

  • \(n = 12\)

The t-critical value \(t_{\alpha/2, n-1}\) is computed using R:

alpha <- 0.10
t_critical <- qt(alpha/2, df = 11, lower.tail = FALSE)
t_critical
# [1] 1.795885

Substituting the values to the general formula, the 90% confidence interval is:

\[(99.3, 109.0).\]

Since \(\mu_0 = 105\) lies within this interval, the duality principle tells us we should fail to reject \(H_0\), which matches our hypothesis test conclusion.

Single-call Verification Using t.test

When the raw data set is available, the R command t.test produces both inference results simultaneously.

radon <- c(91.9, 97.8, 111.4, 122.3, 105.4,
     95.0, 103.8, 99.6, 119.3, 104.8, 101.7, 96.6)

t.test(radon,
       mu = 105,
       alternative = "two.sided",
       conf.level=0.9)

#output
'''
One Sample t-test

data:  radon

t = -0.31947, df = 11, p-value = 0.7554
alternative hypothesis: true mean is not equal to 105

90 percent confidence interval:
99.26145 109.00521

sample estimates:
mean of x
104.1333
'''

The \(t\)-statistic, \(p\)-value, and confidence interval match the hand calculations—always a good final check.

10.3.3. \(t\)-Procedures vs. \(z\)-Procedures

We learned in Chapter 9.5.4 that for any given significance level, the \(t\)-critical value decreases as \(n\) (and therefore the df) increases. This also meant that

\[t_{\alpha, n-1} > z_\alpha\]

for any finite \(n\), since the standard normal distribution can be viewed as a \(t\)-distribution with “infinite” degrees of freedom. As a result, \(t\)-based confidence regions were wider on average than \(z\)-based regions.

In hypothesis tests, if the observed test statistic is held constant, its p-value is larger in a \(t\)-test than in a \(z\)-test because the tails of a \(t\)-distribution are heavier than those of the standard normal. This makes it more difficult for a \(t\)-test to reject the null hypothesis.

The trend is consistent: in the presence of added uncertainty, both inference methods become more conservative—more cautious in labeling an experimental result as unusual. The confidence region widens, and the test becomes more reluctant to reject the status quo.

10.3.4. Bringing It All Together

Key Takeaways 📝

  1. Hypothesis tests and confidence regions are dual procedures that address the same questions from different perspectives, connected by the relationship \(C + \alpha = 1\). Further,

    • Two-sided hypothesis tests pair with confidence intervals.

    • Upper-tailed tests pair with lower confidence bounds.

    • Lower-tailed tests pair with upper confidence bounds.

  2. When the population standard deviation \(\sigma\) is unknown, \(t\)-tests are used instead of \(z\)-tests.

  3. \(t\)-procedures generally produce more conservative inference results than the corresponding \(z\)-procedure; the confidence regions are wider, and it is more difficult to reject the null hypothesis.

Exercises

  1. One-Sided Test: A manufacturer wants to show that their batteries last more than 20 hours on average. With a random sample of 12 batteries, they obtain \(\bar{x} = 22.1\) hours and \(s = 3.5\) hours.

    1. Perform the appropriate hypothesis test at \(\alpha = 0.01\).

    2. What type of confidence region is appropriate for this context? Compute the appropriate 99% confidence region and confirm that it aligns with the conclusion of the hypothesis test.

  2. Sample Size Impact: Explain why a \(t\)-test with \(n = 5\) requires a larger test statistic to reject \(H_0\) than a \(z\)-test with the same data and significance level. What does this say about our confidence in conclusions from small samples?

  3. Cherry Tomato Follow-up: In the cherry tomato example, suppose \(\sigma\) is in fact unknown and estimated as \(s = 5\) grams from the sample of 4 packages. Rework the entire analysis using \(t\)-procedures and compare your conclusions to the original \(z\)-procedure results.

  4. Critical Thinking: A study reports “no significant difference” with \(p = 0.12\) and \(n = 8.\) The researcher concludes the null hypothesis is true. Identify at least three problems with this reasoning and suggest better ways to interpret the results.