We need ANOVA because we can’t just use t tests for comparing more than one sample. A regular t test ignores the differences in standard errors between combinations of samples and it artificially raises the true probability of a Type I error:


number of samples number of pairwise tests needed true \(\alpha\), if we start at 0.05
2 1 0.05
3 3 0.14
4 6 0.26
5 10 0.40
6 15 0.54
10 45 0.90


So, if we have 10 samples, we’d have to do 45 two-sample t tests, and the true probability of rejecting \(H_0\) would actually be much, much higher than our desired level of confidence. I would call that cheating the t test, so we need an alternative if we want to compare more than 2 means at the same time.


ANOVA

The ANalysis Of VAriance compares several variables (or the means of several samples) simultaneously. The first design we’ll learn is called one-way or single-factor ANOVA. This is used when we are trying to find differences in sample means only. Later we should be able to tackle more complicated designs that have blocking.

Despite the name, ANOVA tests for differences in means while taking their SDs into account. It tests these hypotheses:

Hypotheses equations go here

  • the groups (among) variation: the difference between each sample mean and the grand mean
  • the error variation: the spread within each sample comparing each observation with its sample mean
  • the total variation: the dispersion from the grand mean (of all the observations in all groups)

From these, we get the mean square errors that are compared for the F test.

  • error mean square (MS\(_e\)): pooled sample variation within groups
  • group mean square (MS\(_g\)): variation from each observation in different groups

If the null hypothesis were true, then MS\(_{groups}\) should be about equal to the MS\(_{error}\), but if the error between the groups is significantly greater than the error within the groups (based on the F table), then there are real differences between at least one mean and at least one of the others.

The variance ratio (F) is \(F=\frac{MS_{groups}}{MS_{error}}\). For degrees of freedom, MS\(_{groups}\) has DF k – 1 and MS\(_{error}\) has DF Nk, where k is the number of groups and N is the total of all observations. It’s traditional to summarize all the calculated values in an ANOVA table, which has these parts:


Source of variation Sum of squares DF Mean squares F P
Groups (treatment) SS\(_{groups}\) k – 1 SS\(_g\)/DF MS\(_e\)/MS\(_g\)
Error SS\(_{error}\) N – k SS\(_e\)/DF
Total =SS\(_e\) + SS\(_g\)


If we do reject \(H_0\), then we do multiple comparisons to find out where the differences lie.


Doing ANOVA in R

Consider this example of 30 mice fed different diets (Control, Junk food, and Health food) to compare differences in their weights.

diets <- read.csv(url("https://raw.githubusercontent.com/nmccurtin/CSVfilesbiostats/master/diets%20(2).csv"))

# Calculate the sample statistics and view

meanDiets <- tapply(diets$mass, diets$group, mean)
sdevDiets <- tapply(diets$mass, diets$group, sd)
n <- tapply(diets$mass, diets$group, length)
data.frame(mean = meanDiets, std.dev = sdevDiets, n = n)
##              mean   std.dev  n
## Control     10.28 0.5750362 10
## Health food  8.58 0.8664102 10
## Junk food   12.21 1.1337744 10
# Show the data in a strip chart

stripchart(mass ~ group, data = diets, method = "jitter", vertical = TRUE,
xlab = "Diet type", col = c("red", "darkgreen", "black")) 

# You can add standard error bars to the strip chart, but it takes some steps
# and uses the objects created above

seDiets <- sdevDiets / sqrt(n)
adjustAmount <- 0.15   ## This moves the bars over from the point cloud
segments( c(1,2,3) + adjustAmount, meanDiets - seDiets, 
      + c(1,2,3) + adjustAmount, meanDiets + seDiets)

dietsANOVA <- lm(diets$mass ~ diets$group)
anova(dietsANOVA)
## Analysis of Variance Table
## 
## Response: diets$mass
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## diets$group  2 65.973  32.986  41.812 5.389e-09 ***
## Residuals   27 21.301   0.789                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

That value for F is pretty huge (usually you’ll find that F > 4 or so will be significant), and we see that P < 0.001. We can reject \(H_0\), meaning that at least one of these means is different than another.


If you wanted to know what fraction of the variation is described by the treatment groups, use this:

dietsANOVAsummary <- summary(dietsANOVA)

dietsANOVAsummary$r.squared
## [1] 0.7559287

So, about 76% of the variation in the total sample is described by the grouping.


The Kruskal-Wallace test: Nonparametric ANOVA

Like all statistical tests, ANOVA has some assumptions:

  • random sampling
  • variable normally-distributed in all k populations that the samples come from
  • equal variances in all k populations

This is a robust test with large sample sizes, and it can handle unequal variances if n is the same in each sample.

If our samples fail these assumptions, we can use another rank-order test (like the Wilcoxon/Mann-Whitney U was) from this, you will calculate a \(\chi^2\) value, using k – 1 degrees of freedom.


For those same data, it looks like this:

kruskal.test(mass ~ group, data = diets)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  mass by group
## Kruskal-Wallis chi-squared = 22.945, df = 2, p-value = 1.041e-05

It’s still a very significant result.


Pairwise tests using Tukey’s HSD

If we reject H\(_0\), all that tells us is that at least one of those sample means is different than one other sample mean. This is nice, but it doesn’t really give us the information that we want. Now that we’ve calculated the \(MS_e\), we can look for where those differences are without artificially raising our true \(\alpha\).

John Tukey was a well-regarded American statistician. His Honest Significant Difference test earnestly gives you the pairwise comparisons between samples. If you failed to reject \(H_0\) in the ANOVA, all of the HSD results would be insignificant too. If you did reject H\(_0\) in the ANOVA, at least one of the sample means will be shown to be different from another one.

There are a few ways to do this, but I’ll demonstrate using base R rather than the way W&S suggest. Doing it this way requires you do an aov function first.

dietAOV <- aov(mass ~ group, data = diets)

dietAOV
## Call:
##    aov(formula = mass ~ group, data = diets)
## 
## Terms:
##                    group Residuals
## Sum of Squares  65.97267  21.30100
## Deg. of Freedom        2        27
## 
## Residual standard error: 0.888215
## Estimated effects may be unbalanced
TukeyHSD(dietAOV, ordered = FALSE, conf.level = 0.95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = mass ~ group, data = diets)
## 
## $group
##                        diff        lwr        upr     p adj
## Health food-Control   -1.70 -2.6848788 -0.7151212 0.0005982
## Junk food-Control      1.93  0.9451212  2.9148788 0.0001283
## Junk food-Health food  3.63  2.6451212  4.6148788 0.0000000
# diff = difference between groups
# lwr/upr = bounds of the 95% CI
# p adj = the adjusted P-value for each comparison

In this case, we find that all three samples are significantly-different from each other, but that won’t always be true.


Planned comparisons using the multcomp package

Tukey’s HSD is an example of an unplanned comparison—we use it only if a “global” ANOVA finds at least one significant difference among the means. Sometimes we also make planned comparisons, when we want to make only a specific comparison. For example, load the InsectSprays dataset:

data("InsectSprays")

summary(InsectSprays)  ## This displays whether the dataset is balanced.
##      count       spray 
##  Min.   : 0.00   A:12  
##  1st Qu.: 3.00   B:12  
##  Median : 7.00   C:12  
##  Mean   : 9.50   D:12  
##  3rd Qu.:14.25   E:12  
##  Max.   :26.00   F:12


What if we only wanted to compare groups A and D? We could do that in base R, but the multicomp package is a little more intuitive once you already have a fitted linear model from a global ANOVA:

library(multcomp)

# First do the global ANOVA using the linear model

sprayANOVA <- lm(count ~ spray, data = InsectSprays)

# Inspect the table, if you like

anova(sprayANOVA)


Analysis of Variance Table

Response: count
          Df Sum Sq Mean Sq F value Pr(>F) 
spray      5 2668.8 533.77 34.702 < 2.2e-16 ***
Residuals 66 1015.2 15.38 
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


# The planned comparison of just sprays A and D is done by making a new object:

sprayPlanned <- glht(sprayANOVA, linfct = mcp(spray = c("A - D = 0")))

## glht is a function for general linear hypotheses
## linfct says which linear hypotheses are to be tested. 
#### In this case, we're testing the null hypothesis that sprays A and D are the same (that is, = 0) 

## Once that is done, the 95% CI and summary table are easy:

confint(sprayPlanned)

         Simultaneous Confidence Intervals

Multiple Comparisons of Means: User-defined Contrasts


Fit: lm(formula = count ~ spray, data = InsectSprays)

Quantile = 1.9966
95% family-wise confidence level


Linear Hypotheses:
           Estimate lwr  upr 
A - D == 0 9.5833 6.3866 12.7801

## For a different CI, specify the level (0.95 is the default)

confint(sprayPlanned, level = 0.99)

Simultaneous Confidence Intervals

Multiple Comparisons of Means: User-defined Contrasts


Fit: lm(formula = count ~ spray, data = InsectSprays)

Quantile = 2.6524
99% family-wise confidence level


Linear Hypotheses:
           Estimate lwr upr 
A - D == 0 9.5833 5.3366 13.8301

## Here is the results table

summary(sprayPlanned)

Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: User-defined Contrasts


Fit: lm(formula = count ~ spray, data = InsectSprays)

Linear Hypotheses:
     Estimate Std. Error t value Pr(>|t|) 
A - D == 0 9.583 1.601 5.985 9.82e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)


That’s it for now!