March 05, 2019

Analysis of variance (ANOVA)

## Using the built-in data set PlantGrowth 
data(PlantGrowth) 
str(PlantGrowth)
## 'data.frame':    30 obs. of  2 variables:
##  $ weight: num  4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
##  $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(PlantGrowth) 
##      weight       group   
##  Min.   :3.590   ctrl:10  
##  1st Qu.:4.550   trt1:10  
##  Median :5.155   trt2:10  
##  Mean   :5.073            
##  3rd Qu.:5.530            
##  Max.   :6.310
Week 2

Analysis of variance (ANOVA)

Always carry out sanity checks (str, summary) and plot your data before you get carried away with statistical modelling. What do you conclude looking at the boxplot below?

Week 2

Analysis of variance (ANOVA)

If we carry out an ANOVA with the aov command, we obtain a summary stating an overall P-value for the predictor variable. So, it only indicates whether the predictor variable had a statistically significant effect but it does not tell us where those differences lie, i.e. which of the levels of the predictor variable differ significantly from each other. They could all differ signficantly from each other but a single significant difference between any two of the factor levels is enough to give an overall significance for the effect of the predictor variable.

##             Df Sum Sq Mean Sq F value Pr(>F)  
## group        2  3.766  1.8832   4.846 0.0159 *
## Residuals   27 10.492  0.3886                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Week 2

Analysis of variance (ANOVA)

Explanatory variable(s) Model name
1 One-way ANOVA
2 Two-way ANOVA
3 Three-way ANOVA
n-way ANOVA
Week 1

Analysis of variance (ANOVA)

An analysis of variance with an explanatory variable that has four levels (e.g. habitat type: estuarine, dryland, wetland, swamp), is called a:

  1. one-way ANOVA
  2. four-way ANOVA
  3. three-way ANOVA using estuarine habitat as reference level
  4. None of the above

Analysis of variance (ANOVA)

An analysis of variance with an explanatory variable that has four levels (e.g. habitat type: estuarine, dryland, wetland, swamp), is called a:

  1. one-way ANOVA
  2. four-way ANOVA
  3. three-way ANOVA using estuarine habitat as reference level
  4. None of the above

Analysis of variance (ANOVA)

An analysis of variance with two explanatory variables, each with three levels (e.g. fertiliser: low, med, high; plant species: A, B, C), is called a:

  1. three-way ANOVA
  2. six-way ANOVA
  3. two-way ANOVA
  4. None of the above

Analysis of variance (ANOVA)

An analysis of variance with two explanatory variables, each with three levels (e.g. fertiliser: low, med, high; plant species: A, B, C), is called a:

  1. three-way ANOVA
  2. six-way ANOVA
  3. two-way ANOVA
  4. None of the above

Analysis of variance (ANOVA)

Interpretation of a classical one-way ANOVA table associated with the aov.

Week 2

Analysis of variance (ANOVA)

We can also run an analysis of variance using the lm function, which results in the output below.  

 

As you can see this output style returns partial pairwise-comparisons. The intercept term gives the mean of the reference group (which is determined alphabetically by default but can be set using factor(x, levels = ...)!) and the associated P-value tests if its value is significantly different from zero. The remaining parameter estimates give the differences between the means of the respective factor levels and the reference mean. The associated P-values test whether these differences betweenmeans are significantly different from zero. If this is not the case, then obsviously there is no statistically significant difference.

Week 2

Analysis of variance (ANOVA)

We can quickly check this by calculating the means and their differences.  

library(dplyr) 
treat.means <- group_by(PlantGrowth, group) %>% 
  summarise(aver = mean(weight)) %>% mutate(diff = aver - aver[1]) 
treat.means
## # A tibble: 3 x 3
##   group  aver   diff
##   <fct> <dbl>  <dbl>
## 1 ctrl   5.03  0    
## 2 trt1   4.66 -0.371
## 3 trt2   5.53  0.494
## We could also have computed the desired differences like this: #
treat.means$aver[2:3] - treat.means$aver[1] 
## [1] -0.371  0.494
Week 2

Post-hoc analysis (multiple comparison test)

Post-hoc originates from Latin meaning `after this' and involves multiple pairwise comparisons to ascertain differences between the levels of a factor (following an overall significant effect determined by a preceding, overarching model). Because each comparison running under the hood of a post-hoc analysis is essentially a separate statistical test, the Type I error probability goes through the roof with increasing number of comparisons. To counteract the inflation of Type I errors, we need to adjust the resulting P-values.
Week 2

Post-hoc analysis (multiple comparison test)

The most conservative and easiest to understand adjmustment method is the Bonferroni correction which simply multiplies the P-values by the number of comparisons. This method relies on the family-wise error rate, the probability of at least one Type I error (family = a series of comparisons). It is inferior to more modern methods and should hence not be used anymore.
## Dummy P-values from a post-hoc analysis 
p <- c(0.04, 0.001, 0.01, 0.021) 
p * 4 
## [1] 0.160 0.004 0.040 0.084
p.adjust(p = p, method = "bonferroni") 
## [1] 0.160 0.004 0.040 0.084
Week 2

Post-hoc analysis (multiple comparison test)

An more modern, alternative approach to controlling the family-wise error rate is to control the false discovery rate, which is based on the expected proportion of Type I errors among the rejected null hypotheses (false positives; Benjamini & Hochberg 1995).

## Dummy P-values from a post-hoc analysis 
p <- c(0.04, 0.001, 0.01, 0.021) 
p.adjust(p = p, method = "BH") # Benjamini & Hochberg method
## [1] 0.040 0.004 0.020 0.028
Benjamini Y & Hochberg Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289–300.
Week 2

Post-hoc analysis (multiple comparison test)

The contemporary approach is to combine the functionality of the emmeans and the multcomp package.

library(emmeans)
library(multcomp)
m1 <- lm(weight ~ group, data = PlantGrowth)

ems <- emmeans(object = m1, specs = "group")
pairs(ems)
##  contrast    estimate    SE df t.ratio p.value
##  ctrl - trt1    0.371 0.279 27  1.331  0.3909 
##  ctrl - trt2   -0.494 0.279 27 -1.772  0.1980 
##  trt1 - trt2   -0.865 0.279 27 -3.103  0.0120 
## 
## P value adjustment: tukey method for comparing a family of 3 estimates
summary(object = as.glht(pairs(ems)), test = adjusted(type = "BH"))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Linear Hypotheses:
##                  Estimate Std. Error t value Pr(>|t|)  
## ctrl - trt1 == 0   0.3710     0.2788   1.331   0.1944  
## ctrl - trt2 == 0  -0.4940     0.2788  -1.772   0.1315  
## trt1 - trt2 == 0  -0.8650     0.2788  -3.103   0.0134 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- BH method)
Week 2

Post-hoc analysis (multiple comparison test)

…continued.

ems <- emmeans(object = m1, specs = "group")
## pairs(ems)
summary(object = as.glht(pairs(ems)), test = adjusted(type = "BH"))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Linear Hypotheses:
##                  Estimate Std. Error t value Pr(>|t|)  
## ctrl - trt1 == 0   0.3710     0.2788   1.331   0.1944  
## ctrl - trt2 == 0  -0.4940     0.2788  -1.772   0.1315  
## trt1 - trt2 == 0  -0.8650     0.2788  -3.103   0.0134 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- BH method)
Week 2

Post-hoc analysis for two-way ANOVA with interaction

m1 <- lm(p ~ loc * horizon, data = soil)
anova(m1)
## Analysis of Variance Table
## 
## Response: p
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## loc          1  41993   41993 22.0146 2.282e-05 ***
## horizon      2  62683   31342 16.4305 3.665e-06 ***
## loc:horizon  2  17697    8848  4.6386    0.0144 *  
## Residuals   48  91561    1908                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Week 2

Post-hoc analysis for two-way ANOVA with interaction

Comparing habitats within soil horizons.

ems <- emmeans(object = m1, specs = "loc", by = "horizon")
summary(object = as.glht(pairs(ems)), test = adjusted(type = "BH"))

Week 2

Post-hoc analysis for two-way ANOVA with interaction

And now the other way around…

ems <- emmeans(object = m1, specs = "horizon", by = "loc")
summary(object = as.glht(pairs(ems)), test = adjusted(type = "BH"))

Week 2