t-Test of Different Types In this course, we’ve covered three distinct types of t-tests: one-sample, two-sample independent, and two-sample paired tests. Each serves a unique purpose in hypothesis testing, though they all operate under the same guiding principles and utilize the versatile t.test() function. However, the choice of data format, diagnostic plots and keywords depends on the specific inference procedure.

We previously discussed the one-sample t-test in Computer Assignment 6. In this tutorial, we focus on the two-sample independent test and the two-sample paired test.

t Procedures for Two Independent Samples

In a two-sample independent t-test, we aim to compare the means of a quantitative variable between two distinct groups. These groups must be independent, implying that the measurements in one group do not affect the measurements in the other group. For conducting such an analysis, you have two main approaches:

  1. Two Vectors of Data: Each vector represents the measurements from one of the groups. This method requires manually separating your data based on group membership before conducting the test.

  2. A Data Frame with a Factor Variable: This approach leverages a factor variable in your data frame that identifies the group membership for each observation, effectively categorizing your continuous variable of interest into two groups based on this factor.

Given its simplicity and direct alignment with R’s formula interface used in various other statistical functions, we will focus on the second approach. This method enhances readability and efficiency, particularly when your data is already organized within a single data frame.

Combining Data Categories with ifelse

When dealing with more than one group, such as in a two-sample procedure or ANOVA, you might find it useful to filter out categories that aren’t relevant to your analysis. This can be especially helpful when working with datasets containing numerous categories. R provides several ways to accomplish this task, with ifelse being one of the most straightforward approaches.

The ifelse function in R is a vectorized conditional function that allows you to replace values in a vector based on a condition. The syntax of ifelse is as follows:

ifelse(test_expression, yes, no)
  • test_expression is the logical condition based on which replacements are performed.
  • yes is the value to return if test_expression is TRUE.
  • no is the value to return if test_expression is FALSE.

We’ll create an illustrative example to show how ifelse can be used in practical scenarios.

Example: Movie Profitability Statistics

(Data Set: movies.csv) In the film industry, understanding the financial performance of movies through different lenses, such as audience ratings, is crucial for stakeholders. This understanding helps in tailoring future productions to meet audience expectations and optimize profitability. The movies.csv dataset provides a snapshot of various movies’ profitability metrics, including LOpening, which represents the log-transformed revenue from the opening weekend per theater. This transformation helps in normalizing revenue data, making it more amenable to statistical analysis.

Our analysis focuses on exploring how movies rated for different audiences—‘R’ for adults and a combined ‘Family’ category including ‘PG’ and ‘PG-13’ ratings—fare in terms of their opening weekend revenue per theater.

To compare these groups effectively, we first redefine our movie ratings into two distinct categories: ‘Adult’ for ‘R’ rated movies and ‘Family’ for movies rated ‘PG’ and ‘PG-13’. This re-categorization is captured in a new variable within our dataset, MergedRating:

movies <- read.csv("Data/movies.csv")
kable(movies, caption = "Movie Profitability Data")
(#tab:generate_data)Movie Profitability Data
Title Rating Genre Budget USRevenue Opening LOpening Theaters Opinion Profit
Madagascar: Escape 2 Africa PG Animation 150.0 180.0 63.1 4.145 4056 6.9 1
Sex and the City R Comedy 65.0 152.6 56.8 4.040 3285 5.4 1
The Ruins R Horror 8.0 17.4 8.0 2.079 2812 6.0 1
Stop-Loss R Drama 25.0 10.9 4.6 1.526 1291 6.5 0
The Curious Case of Benjamin Button PG-13 Drama 150.0 127.5 26.9 3.292 2988 8.0 0
Redbelt R Action 7.0 2.3 1.1 0.095 1379 6.9 0
The Secret Life of Bees PG-13 Drama 11.0 37.8 10.5 2.351 1591 7.0 1
Kung Fu Panda PG Animation 130.0 215.4 60.2 4.098 4114 7.7 1
The Happening R Drama 60.0 64.5 30.5 3.418 2986 5.2 1
Zach and Miri Make a Porno R Comedy 24.0 31.5 10.1 2.313 2735 7.1 1
The Strangers R Horror 10.0 52.5 21.0 3.045 2466 6.0 1
Prom Night PG-13 Horror 20.0 43.8 20.8 3.035 2700 3.6 1
The Dark Knight PG-13 Action 185.0 533.3 158.4 5.065 4366 8.9 1
Baby Mama PG-13 Comedy 30.0 60.3 17.4 2.856 2543 6.1 1
Wanted R Action 75.0 134.3 50.9 3.930 3175 6.8 1
Changeling R Drama 55.0 35.7 10.0 2.303 1850 8.0 0
Yes Man PG-13 Comedy 70.0 97.7 18.3 2.907 3434 7.0 1
The Express PG Drama 40.0 9.6 4.6 1.526 2808 7.1 0
W. PG-13 Drama 25.1 25.5 10.5 2.351 2030 6.6 1
The Mummy: Tomb of the Dragon Emporer PG-13 Action 145.0 102.2 40.5 3.701 3760 5.1 0
Eagle Eye PG-13 Action 80.0 101.1 29.2 3.374 3510 6.6 1
Burn After Reading R Comedy 37.0 60.3 19.1 2.950 2651 7.2 1
Saw V R Horror 10.8 56.7 30.1 3.405 3060 5.8 1
Miracle and St Anna R Action 45.0 7.9 3.5 1.253 1185 5.9 0
The Day the Earth Stood Still PG-13 Drama 80.0 79.4 30.5 3.418 3560 5.5 0
Be Kind Rewind PG-13 Comedy 20.0 11.2 4.1 1.411 808 6.6 0
Jumper PG-13 Action 85.0 80.2 32.1 3.469 3428 5.9 0
Hancock PG-13 Action 150.0 227.9 62.6 4.137 3965 6.5 1
Speed Racer PG Action 120.0 43.9 18.6 2.923 3606 6.3 0
The Eye R Drama 12.0 31.4 12.4 2.518 2436 5.3 1
Death Race R Action 45.0 36.1 12.6 2.534 2532 6.6 0
College R Comedy 6.5 4.7 2.6 0.956 2123 4.3 0
Blindness R Drama 25.0 3.1 2.0 0.693 1690 6.7 0
Iron Man PG-13 Action 140.0 318.3 102.1 4.626 4105 8.0 1
Lakeview Terrace PG-13 Drama 22.0 39.3 15.0 2.708 2464 6.3 1
movies$MergedRating <- ifelse(movies$Rating == "PG" | movies$Rating == "PG-13", "Family", "Adult") 
kable(head(movies), caption = "Movie Profitability Data")
Table 1: Movie Profitability Statistics
Title Rating Genre Budget USRevenue Opening LOpening Theaters Opinion Profit MergedRating
Madagascar: Escape 2 Africa PG Animation 150 180.0 63.1 4.145 4056 6.9 1 Family
Sex and the City R Comedy 65 152.6 56.8 4.040 3285 5.4 1 Adult
The Ruins R Horror 8 17.4 8.0 2.079 2812 6.0 1 Adult
Stop-Loss R Drama 25 10.9 4.6 1.526 1291 6.5 0 Adult
The Curious Case of Benjamin Button PG-13 Drama 150 127.5 26.9 3.292 2988 8.0 0 Family
Redbelt R Action 7 2.3 1.1 0.095 1379 6.9 0 Adult

Refer back to Computer Assignment #6 Tutorial for information regarding logical operators.

Two-sample Independent procedure

Hypothesis Testing Framework

  1. Test Selection: For our purpose, a two-sample independent t-test is appropriate as it compares means between two distinct groups that are not related or paired. This test suits our scenario since each movie is unique and falls into one of two independent categories, ‘Adult’ or ‘Family’.

  2. Alternative Hypothesis: We aim to determine if there’s a significant difference in profitability (as measured by log opening revenue, LOpening) between ‘Adult’ and ‘Family’ movies. Hence, our alternative hypothesis could be that the mean LOpening for ‘Adult’ movies is different from ‘Family’ movies.

  3. Data Visualization: To understand the distribution of LOpening for each category, we generate histograms and boxplots.

First, calculate group level statistics and density.

# Calculate the sample mean and standard deviation for each group
xbar <- tapply(movies$LOpening, movies$MergedRating, mean)
s <- tapply(movies$LOpening, movies$MergedRating, sd)

# Create estimated normal density curves for each group
movies$normal.density <- ifelse(movies$MergedRating == "Family", 
                                 dnorm(movies$LOpening, xbar["Family"], s["Family"]), 
                                 dnorm(movies$LOpening, xbar["Adult"], s["Adult"]))

To ensure accurate comparision between the two groups in the histogram we need to use the ‘facet_grid()’ function from the ggplot2 package, designed to create a grid of plots based on the values of the levels of our factor. It allows for the simultaneous visualization of subsets of data across different categories, facilitating comparisons and highlighting differences or patterns within the data.

binLen <- as.numeric(max(tapply(movies$LOpening, movies$MergedRating,length)))
n_bins <- round(max(sqrt(binLen)+2, 5))


ggplot(movies, aes(x = LOpening)) + 
  geom_histogram(aes(y = after_stat(density)), bins = n_bins, fill = "grey", col = "black") + 
  facet_grid(. ~ MergedRating) +
  geom_density(col = "red", lwd = 1) + 
  geom_line(aes(y = normal.density), col = "blue", lwd = 1) + 
  labs(title = "Distribution of Log Opening Revenue by Rating Category")

Create boxplots for both ‘Family’ and ‘Adult’ rating categories. Boxplots are instrumental in visualizing the central tendency and variability of data. By designating a categorical variable for the x-axis, we can generate side-by-side boxplots, facilitating an effortless comparison between the two groups. This visual comparison can help highlight differences in the distribution of log opening weekend revenue per theater across rating categories, providing insights into how movie ratings may influence financial performance.

ggplot(movies, aes(x = MergedRating, y = LOpening)) +
  geom_boxplot() +
  stat_boxplot(geom = "errorbar") +
  stat_summary(fun = mean, colour = "black", geom = "point", size = 3) +
  ggtitle("Boxplots of Log Opening Revenue by Rating Category")

  1. Diagnostics Determine if the assumptions are valid to perform inference in this situation. You do not need to repeat any graphs that were presented in part c). Additional plots may be needed. Be sure that you list all of the assumptions whether they can be determined from the graphs or not.

Calculating Slope and Intercept for Reference Lines

For each rating category, we calculate the slope and intercept of the reference line that would represent a perfectly normal distribution. These calculations allow ggplot2 to draw the reference lines accurately for each category in the Q-Q plots:

movies$intercept <- ifelse(movies$MergedRating == "Family", xbar["Family"], xbar["Adult"])
movies$slope <- ifelse(movies$MergedRating == "Family", s["Family"], s["Adult"])

With the intercept and slope prepared, we proceed to construct Q-Q plots for LOpening within the ‘Family’ and ‘Adult’ groups, facilitating a comparison of their distributions to a normal distribution:

ggplot(movies, aes(sample = LOpening)) +
  stat_qq() +
  facet_grid(MergedRating ~ .) +
  geom_abline(aes(intercept = intercept, slope = slope), color = "blue", linetype = "dashed") +
  ggtitle("Q-Q Plots of Log Opening Revenue by Rating Category")

Conducting the T-Test

  1. Carry out Hypothesis Since the assumptions are valid we carry out the hypothesis test at a 0.01 significance level to determine if movies rated as ‘Family’ compared to those rated as ‘Adult’differ with respect to the log-transformed opening weekend revenue per theater (LOpening).

For this analysis, we use the formula interface of the t.test() function, which allows for a concise specification of the groups being compared:

t.test(LOpening ~ MergedRating, data = movies, 
                        mu = 0, conf.level = 0.99, 
                        paired = FALSE, alternative = "two.sided", 
                        var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  LOpening by MergedRating
## t = -2.5144, df = 29.12, p-value = 0.0177
## alternative hypothesis: true difference in means between group Adult and group Family is not equal to 0
## 99 percent confidence interval:
##  -1.917939  0.087768
## sample estimates:
##  mean in group Adult mean in group Family 
##             2.316125             3.231211

t Procedures for Two-Sample Matched Pairs

In situations where you’re comparing two related groups—such as before-and-after measurements in a controlled experiment or matched pairs in observational studies—a two-sample paired t-test provides a powerful tool for analysis. This test focuses on the differences between paired observations, which means you’ll need to create a new variable representing these differences.

Creating a Difference Variable

To conduct a paired t-test, the first step involves calculating the differences between each pair of matched observations. This new variable, let’s call it diff, captures the essence of the paired design by isolating and emphasizing the change or effect of interest.

The direction in which you calculate these differences (i.e., variable1 - variable2 vs. variable2 - variable1) is a matter of context or convention and does not influence the statistical validity of the test. However, it’s essential to be consistent with the hypothesized direction of the effect. For example, if you’re expected to estimate the mean difference of a - b, then your difference calculation should reflect this order.

Code for Creating the Difference Variable While we won’t repeat the specifics here, remember that creating this difference variable can be achieved with simple subtraction, it typically looks something like this:

# Assuming 'data' is your dataframe, and 'before' and 'after' are the paired observations
data$diff <- data$after - data$before

Example: Fuel efficiency comparison

(Data Set: ex07-39mpgdiff.csv) Fuel efficiency comparison. A researcher records the mpg (miles per gallon, a measurement of the fuel economy) of his car each time he fills the tank. He did this by dividing the miles driven since the last fill-up by the amount of gallons pumped at fill-up. He wants to determine if these calculations differ from what his car’s computer estimates.

For the paired t-test, we focus on the Diff variable, which represents the difference between computer estimates and driver measurements. This variable highlights the change or discrepancy of interest, serving as the basis for our analysis. If this variable was not already calculated we would need to obtain it as mentioned above.

mpg <- read.csv("Data/ex07-39mpgdiff.csv")
kable(mpg, caption = "Miles Per Gallon Data")
(#tab:generate_data_fuel)Miles Per Gallon Data
Fill.up Computer Driver Diff
1 41.5 36.5 5.0
2 50.7 44.2 6.5
3 36.6 37.2 -0.6
4 37.3 35.6 1.7
5 34.2 30.5 3.7
6 45.0 40.5 4.5
7 48.0 40.0 8.0
8 43.2 41.0 2.2
9 47.7 42.8 4.9
10 42.2 39.2 3.0
11 43.2 38.8 4.4
12 44.6 44.5 0.1
13 48.4 45.4 3.0
14 46.4 45.3 1.1
15 46.8 45.7 1.1
16 39.2 34.2 5.0
17 37.3 35.2 2.1
18 43.5 39.8 3.7
19 44.3 44.9 -0.6
20 43.3 47.5 -4.2
  1. Test Selection: For our analysis, a two-sample paired t-test is ideal since it compares the means of related observations. Here, each pair of observations consists of the MPG as calculated by the car’s computer and as measured by the driver for the same fill-up, making them inherently paired. This test allows us to assess if there’s a statistically significant difference between the computer’s estimates and the driver’s measurements.

  2. Alternative Hypothesis: We aim to determine whether there’s a significant discrepancy between the car’s computer MPG estimates and the driver’s MPG measurements. Thus, our alternative hypothesis posits that the mean difference between the computer’s estimates and the driver’s measurements is not equal to zero, indicating a systematic bias in either the computer’s or the driver’s favor.

  3. Data Visualization: To visualize the distribution of MPG differences (Computer MPG - Driver MPG), histograms and boxplots can be informative. These plots will help us understand the spread and central tendency of the MPG differences, alongside any potential outliers or skewness in the data. The code is similar to one-sample procedures and will not be repeated.

  1. Diagnostics Determine if the assumptions are valid to perform inference in this situation. You do not need to repeat any graphs that were presented in part c). Additional plots may be needed. Be sure that you list all of the assumptions whether they can be determined from the graphs or not.The code is similar to one-sample procedures and will not be repeated.

Conducting the T-Test

  1. Carry out Hypothesis The outlier is suspect but it does not seem too large with respect to the scale. Since the assumptions are valid we carry out the hypothesis test at a 0.05 significance level for testing if there is a significant difference between the computer’s estimates and the driver’s measurements.

For this analysis, we use the formula interface of the t.test() function, which allows for a concise specification of the groups being compared. Notice we can either use the ‘Diff’ variable as one-sample procedure or use the two variables ‘Computer’ and ‘Driver’ and use a paired procedure to get the same results:

One-Sample Approach Using the ‘Diff’ Variable: If we choose to focus on the already calculated differences between the car’s computer estimates and the driver’s measurements (Diff), we can apply a one-sample t-test. This approach treats the set of differences as a single sample being tested against a hypothesized mean difference of zero.

t.test.results <- t.test(mpg$Diff, mu = 0, conf.level = 0.95, alternative = "two.sided")
t.test.results
## 
##  One Sample t-test
## 
## data:  mpg$Diff
## t = 4.358, df = 19, p-value = 0.0003386
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1.418847 4.041153
## sample estimates:
## mean of x 
##      2.73

Paired Two-Sample Approach Using ‘Computer’ and ‘Driver’ Variables: Alternatively, we can directly compare the Computer and Driver variables using a paired two-sample t-test. This method implicitly calculates the differences between each pair of corresponding observations, aligning closely with the nature of our data as paired measurements from the same fill-up events.

t.test(mpg$Computer, mpg$Driver, mu = 0, conf.level = 0.95, alternative = "two.sided", paired = TRUE)
## 
##  Paired t-test
## 
## data:  mpg$Computer and mpg$Driver
## t = 4.358, df = 19, p-value = 0.0003386
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  1.418847 4.041153
## sample estimates:
## mean difference 
##            2.73