One-way ANOVA

Overview
Dataset
Computing the One-Way ANOVA
Pairwise-Comparisons
ANOVA Assumptions
- Homoscedasicity
- Normality of Residuals

Overview

The one-way ANOVA is an extension of the independent samples t-test in which more than 2 groups (\(K\) > 2) are compared with each other. While you could run a slew of paired-samples t-tests on the dataset, the one-way ANOVA provides a much more powerful analysis and controls for Type 1 error inflation, or when the null hypothesis (\(H_O\)) is true, but is rejected by the test.

One-way ANOVAs have three assumptions:

The residuals from the data are normally distributed.
The variances of the sampled populations are equal.
The data are independent and normally distributed.

ANOVAs are fairly robust to violations to these assumptions – particularly the normality assumption – but take great caution if and when these assumptions are violated.

Before starting the computations in R, take a moment to install these dependencies if you have not already:

# Required packages
install.packages("car")
install.packages("multcomp")
install.packages("DescTools")

The car package will come in handy when we run Levene’s Test, and the multicomp and DescTools packages will be useful when we run multiple comparisions after the one-way ANOVA.

Dataset

What we will be using in this document is a dataset describing the effectiveness of different insecticides. Six different sprays (\(K = 6\)) of 12 trials each (\(n = 12\); \(N = 72\)) were tested in controlled trials, with the outcome measure being the number of bugs left in the sample after the insecticide had been administered. Put into different words, the fewer insects left after a spray is administered, the more effective the spray is.

This dataset is built into R and can be accessed simply by typing the name of the dataset (InsectSprays) into the console:

(Instead of showing the entire dataset, I provided a figure to illustrate InsectSprays.)

As you can see from the boxplot, some insecticides are clearly more effective (“C”, “D”, and “E”) than others (“A”, “B”, and “C”). But how do we determine whole-group differences statistically? And how can we tell exactly which groups are statistically more effective than others?

Computing the One-Way ANOVA

In testing for differences amongst groups using the one-way ANOVA, we are testing the hypothesis that the group means in the dataset are not the same, thus:

\(H_O: \mu_1 = \mu_2 = ... = \mu_K\)

\(H_A:\) Not \(H_O\)

Computing a one-way ANOVA in R is fairly simple (given that your data is structured in the correct way, of course). In the base “stats” package, the command aov is responsible for computing analyses of variance. For the one-way ANOVA, aov just requires 3 parameters: The dependent variable, the single independent variable, and the name of the dataset.

# Dummy code
aov(response ~ factor, data=data_name)

When applied to the InsectSprays dataset, we will designate the “response”" (i.e. the DV) as “count” and the “factor” (i.e. the IV) as “spray”. Because we are going to use this particular ANOVA computation in future calculations, we will save it as “ins.aov”.

ins.aov <- aov(count ~ spray, data = InsectSprays)
summary(ins.aov)

##             Df Sum Sq Mean Sq F value Pr(>F)    
## spray        5   2669   533.8    34.7 <2e-16 ***
## Residuals   66   1015    15.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What is important in the output are the F- and p-values. Because the F-value is greater than the critical value (2.37 for alpha = .05 and df = 5,66), and because the p-value is accordingly below our alpha value, the ANOVA is significant. This shows that there are significant differences between insecticides regarding the mean number of bugs that populate a given area after a spray has been administered, F(5,66) = 34.7, p < .005.

The results of the F-test should always follow APA format when reporting significance (or non-significance). The degrees of freedom (df) are also important to include – the computations of which are briefly shown below:

Between-groups degrees of freedom:

\(df_{between} = K - 1\)

Within-groups degrees of freedom:

\(df_{within} = K(n - 1)\)

Total degrees of freedom:

\(df_{total} = df_{between} + df_{within} = N - 1\)

Where \(K\) = # of groups, \(n\) = # of observations in each group, and \(N\) = the total # of observations.

Pairwise-Comparisons

Now that we have determined that the six insecticides are significantly different from each other, we need to figure out specifically which insecticides are different from each other. This is accomplished using pairwise-comparisons.

Many textbook ANOVA examples will have datasets with exactly 3 groups in them, which conveniently necessitates as many comparisons as there are groups: A vs. B, B vs. C, and A vs. C. In larger datasets, however, the number of groups may be so large that it would be impossible to count the number of comparisons that you need off the top of your head. Instead, use this formula to compute the number of orthogonal (i.e., non-repeating) contrasts required:

\(\# Comparisons = \frac{K (K - 1)}{2}\)

For the InsectSprays dataset, there will be \(15\), or \(\frac{6(6-1)}{2}\) pairwise comparisions.

Tukey’s HSD

A classic posthoc comparison is Tukey’s Honest Significant Differences (HSD). Like the one-way ANOVA, it is easy to call and submit data to the analysis. In this case, the only parameter we will put into the function is the ANOVA that we saved earlier, i.e. ins.aov.

TukeyHSD(ins.aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = count ~ spray, data = InsectSprays)
## 
## $spray
##            diff        lwr       upr     p adj
## B-A   0.8333333  -3.866075  5.532742 0.9951810
## C-A -12.4166667 -17.116075 -7.717258 0.0000000
## D-A  -9.5833333 -14.282742 -4.883925 0.0000014
## E-A -11.0000000 -15.699409 -6.300591 0.0000000
## F-A   2.1666667  -2.532742  6.866075 0.7542147
## C-B -13.2500000 -17.949409 -8.550591 0.0000000
## D-B -10.4166667 -15.116075 -5.717258 0.0000002
## E-B -11.8333333 -16.532742 -7.133925 0.0000000
## F-B   1.3333333  -3.366075  6.032742 0.9603075
## D-C   2.8333333  -1.866075  7.532742 0.4920707
## E-C   1.4166667  -3.282742  6.116075 0.9488669
## F-C  14.5833333   9.883925 19.282742 0.0000000
## E-D  -1.4166667  -6.116075  3.282742 0.9488669
## F-D  11.7500000   7.050591 16.449409 0.0000000
## F-E  13.1666667   8.467258 17.866075 0.0000000

Here, we can see the 15 pairwise comparisons for the 6 spray groups. diff is the difference in means between the two groups, lwr is the lower estimate of the 95% confidence interval of the difference in means, upr is the upper estimate of the same 95% confidence interval, and p adj is the significance of the test after correcting for family-wise error rate. The adjustment of the p-value is necessary in controlling for Type 1 error inflation.

Out of the 18 pairwise comparisons, 9 exhibited statistically significant differences between means (\(p < .05\)) while 6 did not (\(p > .05\)). Using the output from Tukey’s HSD as well as the box plot shown previously, you can guess which insecticides do not differ significantly from each other.

Multiple Comparisons - General Linear Hypothesis Test

Another way to run the same post-hoc test is the multiple comparisons approach using the general linear hypothesis test, which is supported by the multicomp package. Take a moment to go ahead and load the package in your R environment (assuming you have already installed the package):

require(multcomp)

The general linear hypothesis tests function is built into the multicomp package and can be called using glht. The function is simple - glht(model, lincft = mcp(group = "Method")) - and only requires specification of the ANOVA model (or model in the dummy code) as well as the linear hypotheses to be tested (linfct). For this particular example, we will indicate that the groups we want to compare the different sprays (spray) and that we want to correct the critical value using the Tukey Studentized Range statistic, which multiplies the tabled critical value by \(\sqrt{2}\).

By replacing the terms in the dummy code with ones from the current dataset, we can derive the results of the test:

summary(glht(ins.aov, linfct = mcp(spray = "Tukey")))

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = count ~ spray, data = InsectSprays)
## 
## Linear Hypotheses:
##            Estimate Std. Error t value Pr(>|t|)    
## B - A == 0   0.8333     1.6011   0.520    0.995    
## C - A == 0 -12.4167     1.6011  -7.755   <0.001 ***
## D - A == 0  -9.5833     1.6011  -5.985   <0.001 ***
## E - A == 0 -11.0000     1.6011  -6.870   <0.001 ***
## F - A == 0   2.1667     1.6011   1.353    0.754    
## C - B == 0 -13.2500     1.6011  -8.276   <0.001 ***
## D - B == 0 -10.4167     1.6011  -6.506   <0.001 ***
## E - B == 0 -11.8333     1.6011  -7.391   <0.001 ***
## F - B == 0   1.3333     1.6011   0.833    0.960    
## D - C == 0   2.8333     1.6011   1.770    0.492    
## E - C == 0   1.4167     1.6011   0.885    0.949    
## F - C == 0  14.5833     1.6011   9.108   <0.001 ***
## E - D == 0  -1.4167     1.6011  -0.885    0.949    
## F - D == 0  11.7500     1.6011   7.339   <0.001 ***
## F - E == 0  13.1667     1.6011   8.223   <0.001 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Compare these results with the ones we got using Tukey’s HSD and you will notice that they are exactly the same.

Scheffe’s Test

While this method is only beginning to be taught in behavioral statistics classes, Scheffe’s method is an extremely powerful post-hoc comparisons test. Many professors will generally say that Scheffe’s Test is the preferred stepdown procedure, although the realities of this claim is much more nuanced than I will allow myself to get into in this document.

In general, Scheffe’s method accounts for family-wise error rate by weighting the test statistic (\(C\)) by the mean squared error (\(MSE\)), the between-samples DF (\(k - 1\)), and group sizes (if inequal):

\(C = \sqrt{(k - 1)~F~MSE~(\frac{1}{n_i} + \frac{1}{n_j})}\)

This statistic is used to build a confidence interval around \(C\) using the pre-specified alpha (\(\alpha\)) value.

In r, to run this statistic, you must first load the DescTools package:

require(DescTools)

From there, the syntax is quite simple to run: ScheffeTest(x, which = NULL, contrasts = NULL, conf.level = 0.95), where x is the fitted ANOVA model objects (ins.aov), which is the specification of the terms for which confidence intervals are calculated (defaults to all contrasts), and contrasts is a matrix specifying the contrasts to be computed (factor levels x number of contrasts), and conf.level is the pre-determined confidence level (defaults to 95%).

For our test, Scheffe’s method is quite easy:

ScheffeTest(ins.aov)

## 
##   Posthoc multiple comparisons of means : Scheffe Test 
##     95% family-wise confidence level
## 
## $spray
##            diff     lwr.ci    upr.ci    pval    
## B-A   0.8333333  -4.659440  6.326107  0.9981    
## C-A -12.4166667 -17.909440 -6.923893 2.7e-08 ***
## D-A  -9.5833333 -15.076107 -4.090560 2.1e-05 ***
## E-A -11.0000000 -16.492773 -5.507227 8.0e-07 ***
## F-A   2.1666667  -3.326107  7.659440  0.8699    
## C-B -13.2500000 -18.742773 -7.757227 3.6e-09 ***
## D-B -10.4166667 -15.909440 -4.923893 3.1e-06 ***
## E-B -11.8333333 -17.326107 -6.340560 1.1e-07 ***
## F-B   1.3333333  -4.159440  6.826107  0.9827    
## D-C   2.8333333  -2.659440  8.326107  0.6802    
## E-C   1.4166667  -4.076107  6.909440  0.9773    
## F-C  14.5833333   9.090560 20.076107 1.4e-10 ***
## E-D  -1.4166667  -6.909440  4.076107  0.9773    
## F-D  11.7500000   6.257227 17.242773 1.3e-07 ***
## F-E  13.1666667   7.673893 18.659440 4.4e-09 ***
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, we see that exactly 9 contrasts are significant while 6 are not.

Pairwise T-Tests

Perhaps you have decided that using the previous methods are not for you. Instead, you may decide to run pairwise t-tests using Bonferroni’s methods.

Again, the base code is quite simple - pairwise.t.test(DV, IV, p.adjust.method = "method"). In our example, we would specify the DV as the number of living bugs remaining in a plot after being sprayed (count) and the IV as the different sprays used in the study (spray).

Differently from the other post-hoc analyses described here, we are going to specify that we want to statistically account for multiple comparisons by adjusting our alpha using the Bonferroni correction. This correction is quite simple and just divides the “normal” alpha value (\(\alpha\)) by the number of comparisons being run (\(m\)):

\(\alpha_{corr} = \frac{\alpha}{m}\)

Thus, the critical alpha value for this test using the Bonferroni correction is \(.003\), or \(\frac{.05}{15}\).

Let’s go ahead and calculate the results for our paired t-tests:

pairwise.t.test(InsectSprays$count, InsectSprays$spray, p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  InsectSprays$count and InsectSprays$spray 
## 
##   A       B       C       D       E      
## B 1       -       -       -       -      
## C 1.1e-09 1.3e-10 -       -       -      
## D 1.5e-06 1.8e-07 1       -       -      
## E 4.1e-08 4.9e-09 1       1       -      
## F 1       1       4.2e-12 6.1e-09 1.6e-10
## 
## P value adjustment method: bonferroni

Despite the differences in methodology, the results are the same as the other post-hoc comparisons: 9 comparisons are statistically significant while 6 are not.

ANOVA Assumptions

Remember way back at the beginning of this document when I said that ANOVAs are based off of three main assumptions? They are actually quite important for the ANOVA and should be tested with each model.

Here they are again, in case you forgot:

The residuals from the data are normally distributed.
The variances of the sampled populations are equal.
The data are independent and normally distributed.

With the correct sampling procedures, we can usually count on the assumption of independent and normally distributed data (#3) to be true. What is not quite within our hands, however, are the first two assumptions. Here, we will use both statistical tests as well as “eyeball” tests to determine whether our ANOVA violates a critical assumption.

Homoscedasicity

In order for us to assume that we are comparing two sets of data sampled from the same general population, it is necessary to determine if the two samples have similar variances. Otherwise, we cannot make any strong inferences about the test because we cannot rule out the possibility that the data were sampled from separate, and potentially orthogonal, populations.

The first order of business in testing this assumption is running Levene’s Test, which is a mathematical way of determining homogeneity of variance. This step is so important that many statistical computing packages (such as SPSS) just go ahead and throw in Levene’s Test any time that you run an ANOVA.

To run the test in R, we simply load the car package into our computing environment:

library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:DescTools':
## 
##     Recode

And then run Levene’s Test using this simple syntax - leveneTest(DV ~ IV, data = "dataset") - where DV specifies the outcome (dependent) measure, IV specifies the independent variables, and data specifies the dataset that we are testing. With a few simple substitutions, we can run it on our insect spray data:

leveneTest(count ~ spray, data = InsectSprays)

In all cases, you want Levene’s Test statistic to be non-significant. In the case that it is significant (as it is here), you can either a) ignore this violation based on your own a priori knowledge of the distributional characteristics of the population being sampled or b) relax the assumption of homoscedasicity and run the Welch one-way test, which does not require that assumption.

Another way to examine this assumption is to plot the residuals vs. the fitted values for all grounds. In general, you do not want the average of the values (i.e. the red line) to deviate much, if at all, from zero. Here, we can use both Levene’s Test and the plot of the residuals to determine that the variances are inequal in the current sample.

plot(ins.aov,1)

Normality of Residuals

Finally, we can test the assuption that the residuals in our data are normally distributed by graphing the standardized residuals by their theoretical quantiles. In a perfect dataset, these values would create a perfect diagonal line from quadrant III to quadrant I:

plot(ins.aov, 2)

If there is deviation from the “ideal” diagonal, you can run the Shapiro-Wilk test on the residuals to mathematically determine if the assumption has been violated:

ins.residuals <- residuals(object = ins.aov)
shapiro.test(x = ins.residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  ins.residuals
## W = 0.96006, p-value = 0.02226

In the general linear model (the basis of most statistical tests), the assumption of normally distributed residuals applies only to the error term (i.e., noise in the model) and not to the independent variables, so in theory, violations of this assumption have minimal consequences, if any at all.