ANOVA

Material used from Chapter One of Grafen and Hails: Modern Statistics for the Life Sciences

What is ANOVA?

The basic principles of ANOVA

In a simple case we consider the comparison of three means. This is done by the analysis of variance (ANOVA). In this case we will go through an example in detail and work out all the mechanics, but once we have done that and seen how the output is derived from the input we will not need to do it again. We will use R to do the heavy lifting. We will just need to know when it is appropriate to use ANOVA, how to get R to do it and how to interpret the output that R produces.

The Scenario

Suppose we have three fertilizers and wish to compare their efficacy. This has been done in a filed experiment where each fertilizer is applied to 10 plots and the 30 plots are later harvested with the crop yields being calculated. We end up with three groups of 10 figures and we wish to know if there are any differences between these groups.

When we plot the data we see that the fertilizers do differ in the amount of yield produced but that there is also a lot of variation between the plots that were given the same fertilizer.

What does an ANOVA do?

An ANOVA analysis, where ANOVA stands for ANalysis Of VAriance attempts to determine whether the differences between the effect of the fertilizers is significant by investigating the variability in the data. We see how the variability between groups compares to the variability within groups.

Grand Mean

First we calculate the ‘grand mean’, the mean of the yields across all 30 plots:

## [1] 4.643667

Deviations from the grand mean

Mean value of yield for each fertilizer

Measures of variability

SST - Total sum of squares

## [1] 36.4449

SST is the total sum of squares. It is the sum of squares of the deviations of the data around the grand mean. This is a measure of the total variability of the dataset.

SSE - Error sum of squares

SSE is the error sum of squares. It is the sum of the squares of the deviations of the data around the three separate group means. This is a measure of the variation betweeen plots that have been given the same fertilizer.

SSF - Fertilizer sum of squares

SSF is the fertilizer sum of squares. This is the sum of the squares of the deviations of the group means from the grand mean. This is a measure of the variation between plots given different fertilizers.

When the three group means are fitted, there is an obvious reduction in variability around the three means compared to that around the grand mean, but it is not obvious if the fertilizers have had an effect on yield.

At what point do we decide if the amount of variation explained by fitting the means is significant? By this, we mean, "When is the variability between the group means greater than we would expect by chance alone?

First, we note that SSF and SSE partition between them the total variability in the data:

SST = SSF + SSE

## [1] 36.4449

## [1] 10.82275

## [1] 25.62215

## [1] 36.4449

So the total variability has been divided into two components. That due to differences between plots given different treatments and that due to differences between plots given the same treatment. Variability must be due to one or other of these components. Separating the total SS into its component SS is known as partitioning the sums of squares.

A comparison of SSF and SSE is going to indicate whether fitting the three fertilizer means accounts for a significant amount of variability.

However, to make a proper comparison, we really need to compare the variability per degree of freedom ie the variance.

Partitioning the degrees of freedom

Every sum of squares (SS) has been calculated using a number of independent pieces of information. In each, case, we call this number the number of degrees of freedom for the SS.

For SST this number is one less than the number of data points n. This is because when we calculate the deviations of each data point around a grand mean there are only n-1 of them that are independent, since by definition the sum of these deviations is zero, and so when n-1 of them have been calculated, the final one is pre-determined.

Similarly, when we calculate SSF, which measures the deviation of the group means from the grand mean, we have \(k\)-1 degrees of freedom, (where in the present example \(k\), the number of treatments, is equal to three) since the deviations must sum to zero, so when \(k\)-1 of them have been calculated, the last one is pre-determined.

Finally, SSE, which measures deviation around the group means will have n-k degrees of freedom, since the sum of each of the deviations around one of the group means must sum to zero, and so when all but one of them have been calculated, the final one is pre-determined. There are \(k\) group means, so the total degrees of freedom for SSE is n-k.

The degrees of freedom are additive: \[ df(\text{SST}) = df(\text{SSE}) + df(\text{SSF}) \] Check:

\[\begin{align*} df(\text{SST}) &= n-1\\ df(\text{SSE}) &= k-1\\ df(\text{SSF}) &= n-k\\ \therefore df(\text{SSE}) + df(\text{SSF}) &= k-1 + n-k\\ &=n-1\\ &=df(\text{SST}) \end{align*}\]

Mean Squares

Now we can calculate the variances which are a measure of the amount of variability per degree of freedom.

In this context, we call them mean squares. To find each one we divided each of the sums of squares (SS) by their corresponding degrees of freedom.

Fertiliser Mean Square (FMS) = SSF / k - 1. This is the variation per df between plots given different fertilisers.

Error Mean Square (EMS) = SSE / n - k. This is the variation per df between plots given the same fertiliser.

Total Mean Square (TMS) = SST / n - 1. This is the total variance per df of the dataset.

Unlike the SS, the MS are not additive. That is, FMS + EMS \(\neq\) TMS.

F-ratios

If none of the fertilizers influenced yield, we would expect as much variation between the plots treated with the same fertilizer as between the plots treated with different fertilizers.

We can express this in terms of the mean squares: the mean square for fertilizer would be the same as the mean square for error:

\[ \frac{\text{FMS}}{\text{EMS}}=1 \] We call this ratio the F-ratio. It is the end result of ANOVA.

Even if the fertilizers were identical, the F-ratio is unlikely to be exactly 1 - it could by chance take a whole range of values. The F-distribution represents the range and likelihood of all possible F ratios under the null hypothesis. ie when the fertilizers were identical.

If the fertilizers were very different, then the FMS would be much greater than the EMS and the F-ratio would be greater than one. However it can be quite large even when there are no treatment differences. So how do we decide when the size of the F-ratio is due to treatment rather than to chance?

Traditionally, we decide that it is sufficiently larger than one to be due to treatment differences if it would be this large or larger under the null hypothesis only 5% or less of the time. If we had inside knowledge that the null hypothesis was in fact true then we would still get an F-ratio that large or larger 5% of the time.

Our p-value ie the probability that the F-ratio would have been as large as it is or larger, under the null hypothesis, represents the strength of evidence against the null hypothesis. The smaller it is, the stronger the evidence, and only when it is less than 0.05 do we regard the evidence as strong enough to reject the null.

Note that in choosing a threshold value of 0.05 we are adopting a convention. This choice means we accept a Type 1 error rate of 5% - the proportion of times where we will rejct the null hypothesis

ANOVA example 1

We will carry out the ANOVA analyis of the fertilizer data discussed on the previous tab.

Our question is whether yield depends on fertilizer.

What is our null hypothesis?

Start a new notebook with these two code chunks to begin with:

```{r global-options, include=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8, warning=FALSE, message=FALSE,echo=FALSE)
```

```{r load packages, message=FALSE,warning=FALSE,echo=FALSE}
library(tidyverse)
library(here)
library(cowplot)
library(gridExtra)
library(ggfortify)
```

Load the fertilizer.csv data into an object call `fertilizer

Is it tidy data? If not, tidy it.

Convert the FERTIL column to a factor, using this code:

```{r make_factor}
fertilizer <- fertilizer %>%
  mutate(FERTIL=as.factor(FERTIL))
```

Make a box plot of yield vs fertilizer, like the one on the previous tab.

Now use the lm() function to create the anova model (There are several ways to do this in R - this is just one)

fertil.model<-lm(YIELD~FERTIL,data=fertilizer)

Now inspect the model:

anova(fertil.model)

So we find that we can reject the null hypothesis. There is thus evidence that fertilizer does affect yield (ANOVA, df = 2,27, F=5.7, p< 0.01)

What this test has not done so far is show us where the differences lie. An ANOVA is a holistic test that tells you whether or not there is evidence for a difference between at least one pair of groups being compared. To identify which gruopd, if any, are differeny, we need to do so-called post-hoc tests.

ANOVA example 2

An experiment was performed to compare four melon varieties. It was designed so that each variety was grown in six plots, but two plots growing variety 3 were accidentally destroyed.

We wish to find out if there is evidence for a difference in yield between the varieties.

Null hypothesis

What is the null hypothesis of this study?

The data are in the melons.csv dataset.

Write code chunks to

Load data and inspect the data

## Rows: 22
## Columns: 2
## $ YIELDM  <dbl> 25.12, 17.25, 26.42, 16.08, 22.15, 15.92, 40.25, 35.25, 31.98,…
## $ VARIETY <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4,…

Prepare the data

Ensure that the VARIETY column is a factor

Plot the data

Create a scatter plot of the yield.

Summarise the data

Create a summary table that shows the mean yield for each variety.

1-way ANOVA

Create the model

melons.model<-lm(YIELDM~VARIETY,data=melons)

Check the validity of the model

autoplot(melons.model,smooth.colour=NA) + theme_cowplot()

DO the data look as though they meet the criteria for an ANOVA? - the variance of the residuals is roughly constant across all groups and the qq-plot is fairly straight. We could confirm with a normality test if we like. For example, we could use a Shapiro-Wilk test.

What is the null hypothesis of this test?

shapiro.test(melons.model$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  melons.model$residuals
## W = 0.94567, p-value = 0.2586

What do we conclude from this test?

Inspect the model

anova(melons.model)

What conclusions would you draw from the output of this model and the table of mean yields for each variety?

Confidence intervals of the means

Let us find the 95% confidence intervals for each mean

These we calculate as

\[ \text{Mean}\pm t_{\text{crit}}\text{SE}_{\text{mean}} \] For a 95% confidence interval and 18 degrees of freedom, \(t_{\text{crit}}\) is 2.1, so we find that the intervals are:

ANOVA example 2 solution

An experiment was performed to compare four melon varieties. It was designed so that each variety was grown in six plots, but two plots growing variety 3 were accidentally destroyed.

We wish to find out if there is evidence for a difference in yield between the varieties.

Null hypothesis

What is the null hypothesis of this study?

The data are in the melons.csv dataset.

Write code chunks to

Load data and inspect the data

filepath<-here("data","melons.csv")
melons<-read_csv(filepath)
glimpse(melons)

## Rows: 22
## Columns: 2
## $ YIELDM  <dbl> 25.12, 17.25, 26.42, 16.08, 22.15, 15.92, 40.25, 35.25, 31.98,…
## $ VARIETY <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4,…

Prepare the data

Ensure that the VARIETY column is a factor

melons <- melons %>%
  mutate(VARIETY=as.factor(VARIETY))

Plot the data

Create a scatter plot of the yield.

melons %>%
  ggplot(aes(x=VARIETY,y=YIELDM)) +
  geom_point() +
  scale_y_continuous(limits=c(0,45),breaks=seq(0,45,5))+
  labs(x = "Melon variety", y="Yield") +
  theme_cowplot()

Summarise the data

Create a summary table that shows the mean yield for each variety.

melons %>%
  group_by(VARIETY) %>%
  summarise(
    N=n(),
    Mean=round(mean(YIELDM),2)
  )

1-way ANOVA

Create the model

melons.model<-lm(YIELDM~VARIETY,data=melons)

Check the validity of the model

autoplot(melons.model,smooth.colour=NA) + theme_cowplot()

The data look as though they meet the criteria for an ANOVA - the variance of the residuals is roughly constant across all groups and the qq-plot is fairly straight. We could confirm with a normality test if we like. For example, we could use a Shapiro-Wilk test.

The null hypothesis of this test is that the residuals are drawn from a population that is normally distributed

shapiro.test(melons.model$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  melons.model$residuals
## W = 0.94567, p-value = 0.2586

Since p>0.05 we conclude that there is no reason to reject the null hypothesis that the residuals are normally disributed.

Inspect the model

anova(melons.model)

What conclusions would you draw from the output of this model and the table of mean yields for each variety?

We see that the null hypothesis is rejected with a p-value of less than 0.001. We conclude that there are significant differences in the mean yield of melons across the varieties. We estimate that variety 2 has the highest mean yield and varieties 1 and 3 have the lowest mean yields.

The unexplained variance ie the error s for each group is 15.6 with 18 degrees of freedom. So the standard error for each group is \(\frac{s}{\sqrt{n}}\) where s=\(\sqrt{15.6}\) = 3.95 divided by the number of elements in each group, giving us standard errors of 1.97 for variety 3, and 1.61 for the other varieties.

Confidence intervals of the means

Let us find the 95% confidence intervals for each mean

These we calculate as

\[ \text{Mean}\pm t_{\text{crit}}\text{SE}_{\text{mean}} \] For a 95% confidence interval and 18 degrees of freedom, \(t_{\text{crit}}\) is 2.1, so we find that the intervals are:

Two-way ANOVA

from Chapter 6 of Beckerman, Childs andd Petchey: Getting Started with R

Two-way ANOVAs correspond to data that have structure in two dimensions. The response variable may vary with both variables. It may also be that the way it depends on one variable depends on the value of the other. This is a so-called statistical interaction.

In this example, we look at the growth rate of cows and how this depends on their main diet, with a choice of wheat, oats or barley, and on a dietary supplement, with a choice of four (control (ie no supplement), agrimore, supergrain and supersupp). We want to see in particular if the supplements affect growth rate and if the way they do so depends on the diet.

Thus we have two factors - diet and supplement, with diet having three levels and supplement having four.

We choose a fully factorial design, whereby we try out every possible combination of levels of each factor.

How many possible combinations are there?

How many degrees of freedom are there associated with each ‘main effect’ (diet and supplement) and with the interaction?

Load libraries

library(tidyverse)
library(here)
library(ggfortify)

Load data

pathway<-here("data","growth.csv")
growth.moo<-read_csv(pathway)
glimpse(growth.moo)

## Rows: 48
## Columns: 3
## $ supplement <chr> "supergain", "supergain", "supergain", "supergain", "contro…
## $ diet       <chr> "wheat", "wheat", "wheat", "wheat", "wheat", "wheat", "whea…
## $ gain       <dbl> 17.37125, 16.81489, 18.08184, 15.78175, 17.70656, 18.22717,…

Convert the categorical data to factors (so R recognises them as categorical)

growth.moo$diet<-as.factor(growth.moo$diet)
growth.moo$supplement<-as.factor(growth.moo$supplement)

Investigate factor levels

In our ANOVA summary output we will see the mean levels of growth rate for for every main effect and every interaction. These will be presented as differences from an ‘intercept`, which will be what R decides is the reference ’level’ for each factor. By default, R orders levels alphabetically

levels(growth.moo$diet)

## [1] "barley" "oats"   "wheat"

levels(growth.moo$supplement)

## [1] "agrimore"  "control"   "supergain" "supersupp"

..but this might not make sense for us. We might want oats to be the reference diet, because maybe we have always used oats and we want to see what difference it makes if we use barley or wheat, and we probably do want to make control the reference level for dietary supplement.

Relevel to make ‘oats’ the reference level for diet, and ‘control’ the reference level for supplement

growth.moo<-mutate(growth.moo,diet=relevel(diet,ref='oats'))
levels(growth.moo$diet)

## [1] "oats"   "barley" "wheat"

growth.moo<-mutate(growth.moo,supplement=relevel(supplement,ref='control'))
levels(growth.moo$supplement)

## [1] "control"   "agrimore"  "supergain" "supersupp"

That’s better.

Find the means of each combination of diet and supplement

sumMoo<-growth.moo %>% 
  group_by(diet,supplement) %>%
  summarise(meanGrow=mean(gain))
sumMoo

Interaction plot

Does diet have an effect on growth rate ie is there a main effect of diet?
Does supplement effect growth rate ie is there a main effect of supplement?
Does the effect of each supplement type depend on the diet ? ie is there an interaction between diet and supplement?

ggplot(sumMoo,aes(x=supplement,y=meanGrow))+geom_point()

## add colours and lines
ggplot(sumMoo,aes(x=supplement,y=meanGrow,colour=diet,group=diet))+
  geom_point()+
  geom_line()+
  theme_classic()

Interpret the figure

Both diet and supplement have main effects.

The lines are parallel. This suggests that there is no interaction. The effect of supplement does not appear to depend on diet.

But diet does have an effect, so does supplement. This suggests that we have an additive rather than an interactive model

Build the model

Specify the null hypothesis:

‘The effect of supplement type on cow weight gain does not depend on the diet.’

This is an additive model. The alternative is that there is interaction. Hence we will fit to an interactive model. This will allow us to test between the additive and non-additive alternative.

diet*supplement means diet + supplement + diet:supplement

ie a main effect of diet, a main effect of supplement and an interaction effect (:) between the two.

Check assumptions

Looks OK, not great, but OK

Model output

Check using anova() and summary()

The ANOVA table

Watch out for the order of testing in the ANOVA table. - if the data are balanced and orthogonal then the order of testing in ANOVA table does not matter. This is the case here, since the data are independent bydesign - this is a designed experiment but in general, in most observational studies, this will not be the case. The order of testing in an ANOVA table WILL matter. The mean squares, F-values and p-values will depend on the order of testing.

We could stop right here, but let’s look at the summary() table

## 
## Call:
## lm(formula = gain ~ diet * supplement, data = growth.moo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.48756 -1.00368 -0.07452  1.03496  2.68069 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    20.4936649  0.6555863  31.260  < 2e-16 ***
## dietbarley                      2.8029851  0.9271390   3.023  0.00459 ** 
## dietwheat                      -3.0881467  0.9271390  -3.331  0.00201 ** 
## supplementagrimore              2.8047189  0.9271390   3.025  0.00457 ** 
## supplementsupergain            -0.8306614  0.9271390  -0.896  0.37624    
## supplementsupersupp             1.3665697  0.9271390   1.474  0.14918    
## dietbarley:supplementagrimore   0.2471088  1.3111726   0.188  0.85157    
## dietwheat:supplementagrimore   -0.5711641  1.3111726  -0.436  0.66572    
## dietbarley:supplementsupergain  0.0001351  1.3111726   0.000  0.99992    
## dietwheat:supplementsupergain   0.4375746  1.3111726   0.334  0.74052    
## dietbarley:supplementsupersupp  0.9120830  1.3111726   0.696  0.49113    
## dietwheat:supplementsupersupp   0.8962530  1.3111726   0.684  0.49864    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.311 on 36 degrees of freedom
## Multiple R-squared:  0.8607, Adjusted R-squared:  0.8182 
## F-statistic: 20.22 on 11 and 36 DF,  p-value: 3.295e-12

What is the variation around each mean value?

ANOVA

What is ANOVA?

The basic principles of ANOVA

The Scenario

What does an ANOVA do?

Grand Mean

Deviations from the grand mean

Mean value of yield for each fertilizer

Measures of variability

SST - Total sum of squares

SSE - Error sum of squares

SSF - Fertilizer sum of squares

SST = SSF + SSE

Partitioning the degrees of freedom

Mean Squares

F-ratios

ANOVA example 1

ANOVA example 2

Null hypothesis

Load data and inspect the data

Prepare the data

Plot the data

Summarise the data

1-way ANOVA

Create the model

Check the validity of the model

Inspect the model

Confidence intervals of the means

ANOVA example 2 solution

Null hypothesis

Load data and inspect the data

Prepare the data

Plot the data

Summarise the data

1-way ANOVA

Create the model

Check the validity of the model

Inspect the model

Confidence intervals of the means

Two-way ANOVA

Load libraries

Load data

Convert the categorical data to factors (so R recognises them as categorical)

Investigate factor levels

Relevel to make ‘oats’ the reference level for diet, and ‘control’ the reference level for supplement

Find the means of each combination of diet and supplement

Interaction plot

Interpret the figure

Build the model

Check assumptions

Model output

The ANOVA table

Calculate mean and sd of gain for all 12 combinations

Plot this