Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the “variation” among and between groups) used to analyze the differences among group means in a sample. ANOVA was developed by statistician and evolutionary biologist Ronald Fisher. The ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means.

Weight Gain data

## Loading required package: tools

The data arise from an experiment to study the gain in weight of rats fed on four different diets, distinguished by amount of protein (low and high) and by source of protein (beef and cereal).

What data looks like

head(weightgain[19:23,], n=5)
##    source type weightgain
## 19   Beef High        117
## 20   Beef High        111
## 21 Cereal  Low        107
## 22 Cereal  Low         95
## 23 Cereal  Low         97

Now some stats

print("mean:")
## [1] "mean:"
tapply(weightgain$weightgain,list(weightgain$source, weightgain$type), mean)
##         High  Low
## Beef   100.0 79.2
## Cereal  85.9 83.9
print("standard deviation:")
## [1] "standard deviation:"
tapply(weightgain$weightgain, list(weightgain$source, weightgain$type), sd)
##            High      Low
## Beef   15.13642 13.88684
## Cereal 15.02184 15.70881

There seems to be differences on weight gain by the infuences of the two factors, but how can we know if the observed differences aren’t from random effect?

To answer this question, we can apply ANOVA to the data. The model formula specifies a two-way layout with interaction term between the two factor variables

wg_aov <- aov(weightgain ~ source * type, data = weightgain) #"*" between source and type specify interaction term required. If not required, use "+" symbol

The estimates of the intercept and the main and interaction effects can be extracted from the model fit by

coef(wg_aov)
##          (Intercept)         sourceCereal              typeLow 
##                100.0                -14.1                -20.8 
## sourceCereal:typeLow 
##                 18.8

Plot of mean weight gain for each level of the two factors

plot.design(weightgain)

Lastly Summary of the ANOVA model:

summary(wg_aov)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## source       1    221   220.9   0.988 0.3269  
## type         1   1300  1299.6   5.812 0.0211 *
## source:type  1    884   883.6   3.952 0.0545 .
## Residuals   36   8049   223.6                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As you can see, source’s p-value is too large, but type and the interaction term (between source and type) seems to be pretty significant (PR(>F) <0.1), and let’s examine the interaction between the two factor variables.

interaction.plot(weightgain$type, weightgain$source, weightgain$weightgain)

Aha! This plot reveals that the source variable queal to cereal does not have much impact on the weight gain - rather a flat negatively sloped line, whereas steep neg slope is observed when source equals beef (moving from high to low from the ‘type’ factor), thus the low p-value for the interaction term, but not the ‘source’ variable by itself.

Foster Feeding of Rats of Different Genotype

Data:

head(foster[46:48,])
##    litgen motgen weight
## 46      I      J   54.5
## 47      J      A   59.0
## 48      J      A   57.4

The data are from a foster feeding experiment with rat mothers and litters of four different genotypes. The measurement is the litter weight after a trial feeding period.

We can derive the two analyses of variance tables for the foster feeding

summary(aov(weight ~ litgen * motgen, data = foster))
##               Df Sum Sq Mean Sq F value  Pr(>F)   
## litgen         3   60.2   20.05   0.370 0.77522   
## motgen         3  775.1  258.36   4.763 0.00574 **
## litgen:motgen  9  824.1   91.56   1.688 0.12005   
## Residuals     45 2440.8   54.24                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now let’s switch the order of the two factors

summary(aov(weight ~ motgen * litgen, data = foster))
##               Df Sum Sq Mean Sq F value  Pr(>F)   
## motgen         3  771.6  257.20   4.742 0.00587 **
## litgen         3   63.6   21.21   0.391 0.76000   
## motgen:litgen  9  824.1   91.56   1.688 0.12005   
## Residuals     45 2440.8   54.24                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot.design(foster)

There seems to be small differences in the sum of squares for the two main effects, and consequentlym in the associated F-tests and p-values. We can investigate the effect of genotype B on litter weight in more detail by the use of multiple comparison procedures. Such procedures allow a comparison of all pairs of levels of a factor while maintaining the nominal significance level at its specified value and producing adjusted confidence intervals for mean differences. One such procedure is “Tukey honest significant differences”.

unique(foster$motgen)
## [1] A B I J
## Levels: A B I J
foster_aov <- aov(weight ~ litgen * motgen, data = foster)
foster_hsd <- TukeyHSD(foster_aov, "motgen")
foster_hsd
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weight ~ litgen * motgen, data = foster)
## 
## $motgen
##          diff        lwr        upr     p adj
## B-A  3.330369  -3.859729 10.5204672 0.6078581
## I-A -1.895574  -8.841869  5.0507207 0.8853702
## J-A -6.566168 -13.627285  0.4949498 0.0767540
## I-B -5.225943 -12.416041  1.9641552 0.2266493
## J-B -9.896537 -17.197624 -2.5954489 0.0040509
## J-I -4.670593 -11.731711  2.3905240 0.3035490

As you can see from the Tuckey test, out of the 6 unique combinations of the factor “mortgen”, only “J and B”’s Confidence interval does NOT contain 0, implying significance at 95% Confidence level or at 5% alpha.

Now let’s plot the result:

plot(foster_hsd)

Water Hardness and Mortality

Data:

head(water, n=4)
##   location       town mortality hardness
## 1    South       Bath      1247      105
## 2    North Birkenhead      1668       17
## 3    South Birmingham      1466        5
## 4    North  Blackburn      1800       14

The mortality and drinking water hardness for 61 cities in England and Wales.

Let’s assess the differences of both hardness and mortality in the North or South. The hypothesis that the two-dimensional mean-vector of water hardness and mortality is the same for cities in the North and the South can be tested by Hotelling-Lawley test in a multivariate analysis of variance framework. The manova can be used to fit such a model and the corresponding summary method performs the test specified by the test argument

summary(manova(cbind(hardness, mortality) ~ location, data = water), test = "Hotelling-Lawley")
##           Df Hotelling-Lawley approx F num Df den Df    Pr(>F)    
## location   1          0.90021   26.106      2     58 8.217e-09 ***
## Residuals 59                                                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is very small, suggesting strong evidence that the mean vectors of the two variables are not the same in the two regions.

Let’s look at the sample means separately:

tapply(water$hardness, water$location, mean)
##    North    South 
## 30.40000 69.76923
tapply(water$mortality, water$location, mean)
##    North    South 
## 1633.600 1376.808

we see large differences in the two regions both in water hardness and mortality, where low mortality is associated with hard water in the South and high mortality with soft water in the North

Male Egyptian Skulls

Data:

head(skulls, n=4)
##     epoch  mb  bh bl nh
## 1 c4000BC 131 138 89 49
## 2 c4000BC 125 131 92 48
## 3 c4000BC 131 132 99 50
## 4 c4000BC 119 132 96 44

The data describe measurements made on Egyptian skulls from five epochs.

means <- aggregate(skulls[,c("mb", "bh", "bl", "nh")], list(epoch = skulls$epoch), mean)
means
##     epoch       mb       bh       bl       nh
## 1 c4000BC 131.3667 133.6000 99.16667 50.53333
## 2 c3300BC 132.3667 132.7000 99.06667 50.23333
## 3 c1850BC 134.4667 133.8000 96.03333 50.56667
## 4  c200BC 135.5000 132.3000 94.53333 51.96667
## 5  cAD150 136.1667 130.3333 93.50000 51.36667

There appear to be quite large differences between the epoch means. We can now test for a difference more formally by using MANOVA with the following R code to apply each of the four possible test criteria mentioned earlier

skulls_manova <- manova(cbind(mb, bh, bl, nh) ~ epoch, data = skulls)
summary(skulls_manova, test = "Pillai")
##            Df  Pillai approx F num Df den Df    Pr(>F)    
## epoch       4 0.35331    3.512     16    580 4.675e-06 ***
## Residuals 145                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(skulls_manova, test = "Wilks")
##            Df   Wilks approx F num Df den Df   Pr(>F)    
## epoch       4 0.66359   3.9009     16 434.45 7.01e-07 ***
## Residuals 145                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(skulls_manova, test = "Hotelling-Lawley")
##            Df Hotelling-Lawley approx F num Df den Df    Pr(>F)    
## epoch       4          0.48182    4.231     16    562 8.278e-08 ***
## Residuals 145                                                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(skulls_manova, test = "Roy")
##            Df    Roy approx F num Df den Df    Pr(>F)    
## epoch       4 0.4251    15.41      4    145 1.588e-10 ***
## Residuals 145                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value associated with each four test criteria is very small and there is strong evidence that the skull measurements differ between the five epochs. We might now move on to investigate which epochs differ and on which variables. We can look at the univariate F-tests for each of the four variables by using the code

summary.aov(skulls_manova)
##  Response mb :
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## epoch         4  502.83 125.707  5.9546 0.0001826 ***
## Residuals   145 3061.07  21.111                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response bh :
##              Df Sum Sq Mean Sq F value  Pr(>F)  
## epoch         4  229.9  57.477  2.4474 0.04897 *
## Residuals   145 3405.3  23.485                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response bl :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## epoch         4  803.3 200.823  8.3057 4.636e-06 ***
## Residuals   145 3506.0  24.179                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response nh :
##              Df Sum Sq Mean Sq F value Pr(>F)
## epoch         4   61.2  15.300   1.507 0.2032
## Residuals   145 1472.1  10.153

We see that the results for the maximum breadths (mb) and basialiveolar length (bl) are highly significant, with those for the other two variables, in particular for nasal heights (nh), suggesting little evidence of a difference. To look at the pairwise multivariate tests (any of the four test criteria are equivalent in the case of a one-way layout with two levels only) we can use the summary method and manova function as follows:

summary(manova(cbind(mb, bh, bl, nh) ~ epoch, data = skulls, subset = epoch %in% c("c4000BC", "c3300BC")))
##           Df   Pillai approx F num Df den Df Pr(>F)
## epoch      1 0.027674  0.39135      4     55 0.8139
## Residuals 58
summary(manova(cbind(mb, bh, bl, nh) ~ epoch, data = skulls, subset = epoch %in% c("c4000BC", "c1850BC")))
##           Df  Pillai approx F num Df den Df  Pr(>F)  
## epoch      1 0.18757   3.1744      4     55 0.02035 *
## Residuals 58                                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(manova(cbind(mb, bh, bl, nh) ~ epoch, data = skulls, subset = epoch %in% c("c4000BC", "c200BC")))
##           Df  Pillai approx F num Df den Df    Pr(>F)    
## epoch      1 0.30297   5.9766      4     55 0.0004564 ***
## Residuals 58                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(manova(cbind(mb, bh, bl, nh) ~ epoch, data = skulls, subset = epoch %in% c("c4000BC", "cAD150")))
##           Df  Pillai approx F num Df den Df    Pr(>F)    
## epoch      1 0.36182   7.7956      4     55 4.736e-05 ***
## Residuals 58                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To keep the overall significance level for the set of all pairwise multivariate tests under some control (and still maintain a reasonable power), it is recommended to setting the nominal level alpha = 0.15 and carrying out each test at the alpha/m level where m is the number of tests performed. The results of the four pairwise tests suggest that as the epochs become further separated in time the four skull measurements become increasingly distinct.