One-way ANOVA is a statistical method used to compare the means of three or more populations/groups to determine if at least one group mean is statistically different from the others. It’s particularly useful in experimental design to assess the effect of different treatments or conditions on a measured response variable.

The base R installation provides functions like aov() for conducting ANOVA tests and TukeyHSD() for post-hoc analysis, which allows for multiple comparisons after the ANOVA to determine which groups differ.

The aov() function in R is a fundamental tool for conducting Analysis of Variance (ANOVA). It’s part of R’s base packages, meaning it’s readily available without the need for installing additional libraries. ANOVA is a technique used to compare means across two or more groups, and aov() is specifically designed to handle one-way ANOVA, as well as other ANOVA-related tests.

Basics of ANOVA in R

The Analysis of Variance function aov()

The basic syntax of the aov() function is as follows:

aov(formula, data = NULL, projections = FALSE, qr = TRUE, contrasts = NULL, ...)
  • formula: A formula specifying the model. It typically takes the form response ~ factor(s), where response is the dependent variable and factor(s) indicates the independent variable(s) or factor(s) that group the data.
  • data: A data frame in which the variables specified in the formula will be found. If missing, the variables are searched for in the environment. If R doesn’t find the variables in the current environment, it then searches through the environments of attached packages and datasets.
  • projections: An optional vector specifying a subset of observations to be used in the fitting process. This is covered in a more advanced course.
  • qr: An optional vector of weights to be used in the fitting process, a numerical method used in linear algebra to solve linear equations and least squares problems. Setting qr = TRUE (the default) enables the use of this method, which can affect the computational efficiency and numerical stability of the model fitting.
  • contrasts: The contrasts parameter allows for specifying how categorical variables (factors) are coded into numerical contrasts, which can influence the interpretation of the model’s coefficients. R has default contrast coding schemes. This is covered in a more advanced course.

In practice, you’ll often use aov() with just the first two arguments, formula and data, especially for straightforward ANOVA analyses:

results <- aov(formula, data = NULL)
summary(results)

Summary of aov() Output

The summary() function in R is a generic function used to produce result summaries of various objects in R, such as data frames, model objects (e.g., linear models, ANOVA models), and even basic structures like vectors. The type of summary returned depends on the class of the object for which it is called. This function is highly versatile and one of the most commonly used functions for data analysis in R because it provides a quick, high-level overview of the object, making it easier to understand the data or the model at a glance.

What Does summary() Do?

The actions performed by summary() vary depending on the object’s class:

  • Data Frames: For data frames, summary() provides basic descriptive statistics for each column, such as Min, 1st Qu. (first quartile), Median, Mean, 3rd Qu. (third quartile), Max for numerical columns, and counts of levels for factor columns or summary of characters.
  • Model Objects: For model objects like those returned by aov() (ANOVA models), summary() provides detailed information about the model fit, including coefficients, standard errors, statistical significance, and residuals, among other statistics. For aov() objects, it specifically includes the ANOVA table, which shows the degrees of freedom, sum of squares, mean squares, F-statistic, and the (\(p\)-value).
  • Vectors and Matrices: For basic data structures like vectors and matrices, the summary includes elements like Min, 1st Quartile, Median, Mean, 3rd Quartile, and Max.

The summary() function, when applied to an object returned by aov(), provides a concise summary of the ANOVA analysis. It includes:

  • Degrees of Freedom (Df): The number of degrees of freedom associated with both the model (e.g., the factors or independent variables) and the residuals (error).
  • Sum Sq (Sum of Squares): The total variance in the dependent variable that is explained by the model (for each factor) and the unexplained variance (residuals).
  • Mean Sq (Mean Square): The average variance explained by the model (for each factor) and the average variance of the residuals. It’s calculated as the sum of squares divided by the corresponding degrees of freedom.
  • F value: The ratio of the mean square of the factor to the mean square of the residuals. It indicates how much more variance is explained by the model than would be expected by chance.
  • Pr(>F): The p-value associated with the F statistic. It tells you whether the variance explained by the model (for each factor) is statistically significantly greater than what would be expected by chance.

This summary is crucial for interpreting the results of ANOVA, as it provides the statistical evidence needed to determine whether there are significant differences in the means of the groups defined by the independent variables.

Factor Variables

In R, it’s generally a good practice to ensure that categorical variables are explicitly declared as factors before performing Analysis of Variance (ANOVA) or other statistical modeling where the distinction between categorical and continuous variables is important. While R often does a good job of automatically treating character variables as categorical (factors), there are situations where you might need to manually convert them using the factor() function. Here’s why and when converting to factors is important:

Why Convert Categorical Variables to Factors?

  • Model Interpretation: Factors ensure that statistical models treat the variable as categorical, influencing the model’s structure, interpretation, and output. For ANOVA, factors denote the groups being compared.
  • Levels Control: Converting to a factor allows you to specify the order of levels explicitly, which can be important for ordinal data and for the interpretation of model coefficients.
  • Data Integrity: Ensuring that categorical variables are explicitly declared as factors can prevent accidental misuse as continuous variables in calculations or models, preserving the integrity of your analysis.This is especially important if your categories are represented numerically (1,2,…).

Converting an ordinal variable into a factor with specified order:

data$OrdinalVar <- factor(data$OrdinalVar, levels = c("Low", "Medium", "High"))

Effects plot

An effects plot, often used in the context of statistical models, displays the estimated effects of categorical variables (usually represented as factors) on a response variable, allowing for visual interpretation of the model’s findings. In R, ggplot2 can be used for plotting statistical summaries, including the effects of different levels of a factor on a response variable. It helps in understanding how the response changes across different levels of the factor. Here’s how you can create an effects plot using ggplot2.

Assuming you have a dataset data with a response variable response and a factor factorVar, here’s a guide to creating an effects plot.

ggplot(data, aes(x = factorVar, y = response)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun = mean, geom = "line", aes(group = 1)) +
  ggtitle("Effects Plot of Response by FactorVar")
  • stat_summary(): Calculates summary statistics. Here, it’s used twice: first to plot points at the mean of response for each factorVar level, and then to connect these points with a line.
  • fun = mean: Specifies that the summary function is the mean.
  • geom = “point” and geom = “line”: Indicate that the summary statistics should be represented as points and lines, respectively.
  • aes(group = 1): Ensures that the line spans all factor levels. Without it, ggplot2 might treat each level as a separate group and not draw a connecting line.

Post-hoc Analysis (TukeyHSD)

Post-hoc analysis is a critical step in statistical analysis, particularly after conducting ANOVA (Analysis of Variance) tests. When an ANOVA indicates significant differences between group means, it tells us that not all groups are the same, but it doesn’t specify which groups differ from each other. This is where post-hoc analyses like Tukey’s Honestly Significant Difference (TukeyHSD) test come into play.

  • Identify Specific Differences: Post-hoc tests help pinpoint exactly which means differ from one another after a significant ANOVA result.
  • Control Type I Error: When making multiple comparisons, the chance of a Type I error (falsely claiming significance) increases. Post-hoc tests adjust for this increased risk.

Application in R:

After conducting an ANOVA in R, the TukeyHSD() function performs the post-hoc analysis, allowing for an examination of pairwise mean differences with adjusted confidence intervals:

tukey_result <- TukeyHSD(fit, conf.level = C)
plot(tukey_result)

This code conducts a TukeyHSD test and then plots the confidence intervals for the mean differences between groups. The plot visually represents the magnitude of differences and their statistical significance, offering an intuitive understanding of which groups significantly differ from each other.

Example: Educational Product Assesment

(Data Set: eduproduct.csv) Evaluation of a New Educational Product. A company designed two versions of a new educational product to improve children’s reading comprehension. They would like to claim that children would acquire better reading comprehension skills utilizing either of the two versions than with the traditional approach. You are helping the company to prepare the marketing material, which will include the results of a study conducted to compare the two versions of the new product with the traditional method. The standard method is called Basal (B), and the two variations of the new method are called DRTA (D) and Strat (S).

Education researchers randomly divided 66 children into three groups of \(22\). Each group was taught by one of the three methods. The response variable is a measure of reading comprehension called Comp that was obtained by a test taken after the instruction was completed.

The study aims to determine which of three educational methods most effectively improves children’s reading comprehension. Through visual and statistical analysis, including ANOVA at a \(\alpha=0.05\) significance level, we will identify any significant differences in efficacy between methods. If significant differences are found, post-hoc analysis will identify which methods stand out. The results will be summarized in a report that interprets the educational impact in straightforward terms, accessible to a general audience.

Load data and generate effects plot

edu.df <- read.csv("Data/eduproduct.csv", header = TRUE)

# Convert to factor variable
edu.df$GroupName <- factor(edu.df$GroupName)


ggplot(data = edu.df, aes(x = GroupName, y = Comp)) +
  stat_summary(fun = mean, geom = "point") + 
  stat_summary(fun = mean, geom = "line", aes(group = 1)) +
  ggtitle("Effects Plot of Comp by GroupName")

The effects plot offers a succinct visualization of the impacts associated with various educational approaches: the traditional Basal (B), and the innovative DRTA (D) and Strat (S) methodologies. Particularly in the context of a one-way ANOVA, this plot becomes helpful for post-hoc analysis following the determination of statistical significance. It facilitates an immediate, clear comparison of the mean outcomes for each educational method. By delineating the relative performance of B, D, and S, the effects plot conveys the practical ranking among these methods based on their average impacts on reading comprehension (B<S<D).

Compute groupwise statistics

n <- tapply(edu.df$Comp, edu.df$GroupName, length)
xbar <- tapply(edu.df$Comp, edu.df$GroupName, mean)
s <- tapply(edu.df$Comp, edu.df$GroupName, sd)

edu.df$normal.density <- apply(edu.df, 1, function(x){
  dnorm(as.numeric(x["Comp"]), 
        xbar[x["GroupName"]], s[x["GroupName"]])})
summary_table <- data.frame(
  GroupName = names(n),
  SampleSize = n,
  Mean = xbar,
  StandardDeviation = s
)
cat("Summary Statistics by Group\n")
print(summary_table, row.names = FALSE)
Table 1: Summary Statistics by Group
GroupName SampleSize Mean StandardDeviation
B 22 41.04545 5.635578
D 22 46.72727 7.388420
S 22 44.27273 5.766750

Generate side-by-side box plot

ggplot(edu.df, aes(x = GroupName, y = Comp)) +
  geom_boxplot() +
  stat_boxplot(geom = "errorbar") +
  stat_summary(fun = mean, colour = "black", geom = "point", size = 3) +
  ggtitle("Boxplots of Comp by GroupName")

The side-by-side boxplots greatly overlap. It is difficult to tell if there is any significant differences. However, the innovative DRTA (D) does appear to be greater than traditional Basal (B) as \(Q_1\) of D is above the sample mean of B. However, visual analysis is very subjective. Also, it appears D and S are negatively skewed. No outliers are present any group.

Generate histograms

n_bins <- round(max(sqrt(max(n) + 2),5))
                    
ggplot(edu.df, aes(x = Comp)) +
  geom_histogram(aes(y = after_stat(density)), bins = n_bins, fill = "grey", col = "black") +
  facet_grid(GroupName ~ .) +
  geom_density(col = "red", lwd = 1) +
  geom_line(aes(y = normal.density), col = "blue", lwd = 1) +
  ggtitle("Histograms of Comp by GroupName")

The histograms for groups D and S appear to be slightly negatively skewed but without outliers.

Generate qq-plot

edu.df$intercept <- apply(edu.df, 1, function(x){xbar[x["GroupName"]]})
edu.df$slope <- apply(edu.df, 1, function(x){s[x["GroupName"]]})
#
# (2) Make the QQ plot
#
ggplot(edu.df, aes(sample = Comp)) +
  stat_qq() +
  facet_grid(GroupName ~ .) +
  geom_abline(data = edu.df, aes(intercept = intercept, slope = slope)) +
  ggtitle("QQ Plots of Comp by GroupName")

The QQ-plots for groups D and S also demonstrate a negative skewed.

Assumptions

  1. SRS: We assume that the samples are independently and randomly selected from the population, with each group treated independently from the others. This assumption is fundamental to the validity of ANOVA results, ensuring that the observed differences between group means are attributable to the treatment effects not sampling bias.
  2. Normality: Examination of histograms and QQ-plots reveals a slight negative skew in groups D and S. While the data are not perfectly normal, the skewness is minor. Given the Central Limit Theorem, which posits that the distribution of sample means approaches normality as sample sizes increase we shall consider this assumption satisfied for the analysis.
  3. Homogeneity of Variance (Equality of Variance): To assess the assumption of homogeneity of variances, we examine the ratio of the largest to the smallest sample standard deviation across groups. In this case, \(\frac{s_{\text{max}}}{s_{\text{min}}}=\frac{7.388420}{5.635578}=1.311<2\).This ratio, being less than 2, suggests that the variances are not substantially different. Hence, according to this rule of thumb, we proceed under the assumption that the variance within each group is approximately equal. Each of these points effectively communicates the rationale behind the assumption checks for ANOVA, grounded in statistical principles. It’s important to note that while ANOVA is robust to minor violations of normality and homogeneity of variances, especially with larger sample sizes, the independence assumption is crucial and should not be violated.

Fit ANOVA model and Conduct four-step process

fit <- aov(Comp~GroupName, data = edu.df)
summary(fit)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## GroupName    2  357.3  178.65   4.481 0.0152 *
## Residuals   63 2511.7   39.87                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 1: Define parameters - \(\mu_{\textbf B}\) is the population mean Comp score for the Basal method (control). - \(\mu_{\textbf D}\) is the population mean Comp score for the DRTA method. - \(\mu_{\textbf S}\) is the population mean Comp score for the Strat method.

Step 2: State the hypotheses - \(H_0\): \(\mu_{\textbf B}=\mu_{\textbf D}=\mu_{\textbf S}\) - \(H_a\): at least two \(\mu_i\)’s are different.

Step 3: Find the test statistic, p-value, report degrees of freedom

  • \(F_{\text{ts}}= 4.481\)
  • \(df1 = 2\), \(df2 = 63\)
  • \(p\)-value \(= 0.0152\)

Step 4: Conclusion:

Since \(p\)-value\(=0.0152 < 0.05=\alpha\), we should reject \(H_0\)

The data provides evidence (\(p\)-value \(=0.0152\)) to the claim that the population mean values of at least one of the education methods is different from the rest.

The ANOVA test indicated a difference. Keep in mind that the results of the hypothesis test are more objective than the subjective analysis of the plots. However, you should always look at the plots to get a feel of the data before you do the inference.

Post-hoc Analysis

tukey_result <- TukeyHSD(fit, conf.level = 0.95)
tukey_result
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Comp ~ GroupName, data = edu.df)
## 
## $GroupName
##          diff       lwr       upr     p adj
## D-B  5.681818  1.112137 10.251499 0.0111135
## S-B  3.227273 -1.342408  7.796953 0.2149995
## S-D -2.454545 -7.024226  2.115135 0.4064363
plot(tukey_result)

# Tukey critical Value used for confidence interval (This could be used with the MSE for manual calculations)
qtukey(p = 0.05,nmeans = 3, df = 63, lower.tail = FALSE)/sqrt(2)
## [1] 2.400326

There are two ways to determine which pairs are significant and which are not. 1. Check whether \(0\) is in the interval (lwr, upr) if it is than it is not significant. 2. Check whether the p adj is smaller than the significance level. Note that you might not have this information on the exams.

In this case, we have statistical evidence that D and B are different, but we do not have evidence that B is different from S nor that D is different from S. Also notice that in the above plot the confidence interval for D-B does not overlap with \(0\) and therefore they are significant.

The following is the graphical representation you must construct this by hand (unless you want to program R to do this by yourself). Remember that you need to include the means in ascending or descending order.

Posthoc Means Graphical Display