Material used from Chapter One of Grafen and Hails: Modern Statistics for the Life Sciences
In a simple case we consider the comparison of three means. This is done by the analysis of variance (ANOVA). In this case we will go through an example in detail and work out all the mechanics, but once we have done that and seen how the output is derived from the input we will not need to do it again. We will use R to do the heavy lifting. We will just need to know when it is appropriate to use ANOVA, how to get R to do it and how to interpret the output that R produces.
Suppose we have three fertilizers and wish to compare their efficacy. This has been done in a filed experiment where each fertilizer is applied to 10 plots and the 30 plots are later harvested with the crop yields being calculated. We end up with three groups of 10 figures and we wish to know if there are any differences between these groups.
When we plot the data we see that the fertilizers do differ in the amount of yield produced but that there is also a lot of variation between the plots that were given the same fertilizer.
An ANOVA analysis, where ANOVA stands for ANalysis Of VAriance attempts to determine whether the differences between the effect of the fertilizers is significant by investigating the variability in the data. We see how the variability between groups compares to the variability within groups.
First we calculate the ‘grand mean’, the mean of the yields across all 30 plots:
## [1] 4.643667
## [1] 36.4449
SST is the total sum of squares. It is the sum of squares of the deviations of the data around the grand mean. This is a measure of the total variability of the dataset.
SSE is the error sum of squares. It is the sum of the squares of the deviations of the data around the three separate group means. This is a measure of the variation betweeen plots that have been given the same fertilizer.
SSF is the fertilizer sum of squares. This is the sum of the squares of the deviations of the group means from the grand mean. This is a measure of the variation between plots given different fertilizers.
When the three group means are fitted, there is an obvious reduction in variability around the three means compared to that around the grand mean, but it is not obvious if the fertilizers have had an effect on yield.
At what point do we decide if the amount of variation explained by fitting the means is significant? By this, we mean, "When is the variability between the group means greater than we would expect by chance alone?
First, we note that SSF and SSE partition between them the total variability in the data:
## [1] 36.4449
## [1] 10.82275
## [1] 25.62215
## [1] 36.4449
So the total variability has been divided into two components. That due to differences between plots given different treatments and that due to differences between plots given the same treatment. Variability must be due to one or other of these components. Separating the total SS into its component SS is known as partitioning the sums of squares.
A comparison of SSF and SSE is going to indicate whether fitting the three fertilizer means accounts for a significant amount of variability.
However, to make a proper comparison, we really need to compare the variability per degree of freedom ie the variance.
Every sum of squares (SS) has been calculated using a number of independent pieces of information. In each, case, we call this number the number of degrees of freedom for the SS.
For SST this number is one less than the number of data points n. This is because when we calculate the deviations of each data point around a grand mean there are only n-1 of them that are independent, since by definition the sum of these deviations is zero, and so when n-1 of them have been calculated, the final one is pre-determined.
Similarly, when we calculate SSF, which measures the deviation of the group means from the grand mean, we have \(k\)-1 degrees of freedom, (where in the present example \(k\), the number of treatments, is equal to three) since the deviations must sum to zero, so when \(k\)-1 of them have been calculated, the last one is pre-determined.
Finally, SSE, which measures deviation around the group means will have n-k degrees of freedom, since the sum of each of the deviations around one of the group means must sum to zero, and so when all but one of them have been calculated, the final one is pre-determined. There are \(k\) group means, so the total degrees of freedom for SSE is n-k.
The degrees of freedom are additive: \[ df(\text{SST}) = df(\text{SSE}) + df(\text{SSF}) \] Check:
\[\begin{align*} df(\text{SST}) &= n-1\\ df(\text{SSE}) &= k-1\\ df(\text{SSF}) &= n-k\\ \therefore df(\text{SSE}) + df(\text{SSF}) &= k-1 + n-k\\ &=n-1\\ &=df(\text{SST}) \end{align*}\]
Now we can calculate the variances which are a measure of the amount of variability per degree of freedom.
In this context, we call them mean squares. To find each one we divided each of the sums of squares (SS) by their corresponding degrees of freedom.
Fertiliser Mean Square (FMS) = SSF / k - 1. This is the variation per df between plots given different fertilisers.
Error Mean Square (EMS) = SSE / n - k. This is the variation per df between plots given the same fertiliser.
Total Mean Square (TMS) = SST / n - 1. This is the total variance per df of the dataset.
Unlike the SS, the MS are not additive. That is, FMS + EMS \(\neq\) TMS.
If none of the fertilizers influenced yield, we would expect as much variation between the plots treated with the same fertilizer as between the plots treated with different fertilizers.
We can express this in terms of the mean squares: the mean square for fertilizer would be the same as the mean square for error:
\[ \frac{\text{FMS}}{\text{EMS}}=1 \] We call this ratio the F-ratio. It is the end result of ANOVA.
Even if the fertilizers were identical, the F-ratio is unlikely to be exactly 1 - it could by chance take a whole range of values. The F-distribution represents the range and likelihood of all possible F ratios under the null hypothesis. ie when the fertilizers were identical.
If the fertilizers were very different, then the FMS would be much greater than the EMS and the F-ratio would be greater than one. However it can be quite large even when there are no treatment differences. So how do we decide when the size of the F-ratio is due to treatment rather than to chance?
Traditionally, we decide that it is sufficiently larger than one to be due to treatment differences if it would be this large or larger under the null hypothesis only 5% or less of the time. If we had inside knowledge that the null hypothesis was in fact true then we would still get an F-ratio that large or larger 5% of the time.
Our p-value ie the probability that the F-ratio would have been as large as it is or larger, under the null hypothesis, represents the strength of evidence against the null hypothesis. The smaller it is, the stronger the evidence, and only when it is less than 0.05 do we regard the evidence as strong enough to reject the null.
Note that in choosing a threshold value of 0.05 we are adopting a convention. This choice means we accept a Type 1 error rate of 5% - the proportion of times where we will rejct the null hypothesis
We will carry out the ANOVA analyis of the fertilizer data discussed on the previous tab.
Our question is whether yield depends on fertilizer.
What is our null hypothesis?
Start a new notebook with these two code chunks to begin with:
```{r global-options, include=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8, warning=FALSE, message=FALSE,echo=FALSE)
```
```{r load packages, message=FALSE,warning=FALSE,echo=FALSE}
library(tidyverse)
library(here)
library(cowplot)
library(gridExtra)
library(ggfortify)
```
Load the fertilizer.csv data into an object call `fertilizer
Is it tidy data? If not, tidy it.
Convert the FERTIL column to a factor, using this code:
```{r make_factor}
fertilizer <- fertilizer %>%
mutate(FERTIL=as.factor(FERTIL))
```
Make a box plot of yield vs fertilizer, like the one on the previous tab.
Now use the lm() function to create the anova model (There are several ways to do this in R - this is just one)
fertil.model<-lm(YIELD~FERTIL,data=fertilizer)
Now inspect the model:
anova(fertil.model)
So we find that we can reject the null hypothesis. There is thus evidence that fertilizer does affect yield (ANOVA, df = 2,27, F=5.7, p< 0.01)
What this test has not done so far is show us where the differences lie. An ANOVA is a holistic test that tells you whether or not there is evidence for a difference between at least one pair of groups being compared. To identify which gruopd, if any, are differeny, we need to do so-called post-hoc tests.
An experiment was performed to compare four melon varieties. It was designed so that each variety was grown in six plots, but two plots growing variety 3 were accidentally destroyed.
We wish to find out if there is evidence for a difference in yield between the varieties.
What is the null hypothesis of this study?
The data are in the melons.csv dataset.
Write code chunks to
## Rows: 22
## Columns: 2
## $ YIELDM <dbl> 25.12, 17.25, 26.42, 16.08, 22.15, 15.92, 40.25, 35.25, 31.98,…
## $ VARIETY <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4,…
Ensure that the VARIETY column is a factor
Create a scatter plot of the yield.
Create a summary table that shows the mean yield for each variety.
melons.model<-lm(YIELDM~VARIETY,data=melons)
autoplot(melons.model,smooth.colour=NA) + theme_cowplot()
DO the data look as though they meet the criteria for an ANOVA? - the variance of the residuals is roughly constant across all groups and the qq-plot is fairly straight. We could confirm with a normality test if we like. For example, we could use a Shapiro-Wilk test.
What is the null hypothesis of this test?
shapiro.test(melons.model$residuals)
##
## Shapiro-Wilk normality test
##
## data: melons.model$residuals
## W = 0.94567, p-value = 0.2586
What do we conclude from this test?
anova(melons.model)
What conclusions would you draw from the output of this model and the table of mean yields for each variety?
Let us find the 95% confidence intervals for each mean
These we calculate as
\[ \text{Mean}\pm t_{\text{crit}}\text{SE}_{\text{mean}} \] For a 95% confidence interval and 18 degrees of freedom, \(t_{\text{crit}}\) is 2.1, so we find that the intervals are:
An experiment was performed to compare four melon varieties. It was designed so that each variety was grown in six plots, but two plots growing variety 3 were accidentally destroyed.
We wish to find out if there is evidence for a difference in yield between the varieties.
What is the null hypothesis of this study?
The data are in the melons.csv dataset.
Write code chunks to
filepath<-here("data","melons.csv")
melons<-read_csv(filepath)
glimpse(melons)
## Rows: 22
## Columns: 2
## $ YIELDM <dbl> 25.12, 17.25, 26.42, 16.08, 22.15, 15.92, 40.25, 35.25, 31.98,…
## $ VARIETY <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4,…
Ensure that the VARIETY column is a factor
melons <- melons %>%
mutate(VARIETY=as.factor(VARIETY))
Create a scatter plot of the yield.
melons %>%
ggplot(aes(x=VARIETY,y=YIELDM)) +
geom_point() +
scale_y_continuous(limits=c(0,45),breaks=seq(0,45,5))+
labs(x = "Melon variety", y="Yield") +
theme_cowplot()
Create a summary table that shows the mean yield for each variety.
melons %>%
group_by(VARIETY) %>%
summarise(
N=n(),
Mean=round(mean(YIELDM),2)
)
melons.model<-lm(YIELDM~VARIETY,data=melons)
autoplot(melons.model,smooth.colour=NA) + theme_cowplot()
The data look as though they meet the criteria for an ANOVA - the variance of the residuals is roughly constant across all groups and the qq-plot is fairly straight. We could confirm with a normality test if we like. For example, we could use a Shapiro-Wilk test.
The null hypothesis of this test is that the residuals are drawn from a population that is normally distributed
shapiro.test(melons.model$residuals)
##
## Shapiro-Wilk normality test
##
## data: melons.model$residuals
## W = 0.94567, p-value = 0.2586
Since p>0.05 we conclude that there is no reason to reject the null hypothesis that the residuals are normally disributed.
anova(melons.model)
What conclusions would you draw from the output of this model and the table of mean yields for each variety?
We see that the null hypothesis is rejected with a p-value of less than 0.001. We conclude that there are significant differences in the mean yield of melons across the varieties. We estimate that variety 2 has the highest mean yield and varieties 1 and 3 have the lowest mean yields.
The unexplained variance ie the error s for each group is 15.6 with 18 degrees of freedom. So the standard error for each group is \(\frac{s}{\sqrt{n}}\) where s=\(\sqrt{15.6}\) = 3.95 divided by the number of elements in each group, giving us standard errors of 1.97 for variety 3, and 1.61 for the other varieties.
Let us find the 95% confidence intervals for each mean
These we calculate as
\[ \text{Mean}\pm t_{\text{crit}}\text{SE}_{\text{mean}} \] For a 95% confidence interval and 18 degrees of freedom, \(t_{\text{crit}}\) is 2.1, so we find that the intervals are:
from Chapter 6 of Beckerman, Childs andd Petchey: Getting Started with R
Two-way ANOVAs correspond to data that have structure in two dimensions. The response variable may vary with both variables. It may also be that the way it depends on one variable depends on the value of the other. This is a so-called statistical interaction.
In this example, we look at the growth rate of cows and how this depends on their main diet, with a choice of wheat, oats or barley, and on a dietary supplement, with a choice of four (control (ie no supplement), agrimore, supergrain and supersupp). We want to see in particular if the supplements affect growth rate and if the way they do so depends on the diet.
Thus we have two factors - diet and supplement, with diet having three levels and supplement having four.
We choose a fully factorial design, whereby we try out every possible combination of levels of each factor.
How many possible combinations are there?
How many degrees of freedom are there associated with each ‘main effect’ (diet and supplement) and with the interaction?
library(tidyverse)
library(here)
library(ggfortify)
pathway<-here("data","growth.csv")
growth.moo<-read_csv(pathway)
glimpse(growth.moo)
## Rows: 48
## Columns: 3
## $ supplement <chr> "supergain", "supergain", "supergain", "supergain", "contro…
## $ diet <chr> "wheat", "wheat", "wheat", "wheat", "wheat", "wheat", "whea…
## $ gain <dbl> 17.37125, 16.81489, 18.08184, 15.78175, 17.70656, 18.22717,…
growth.moo$diet<-as.factor(growth.moo$diet)
growth.moo$supplement<-as.factor(growth.moo$supplement)
In our ANOVA summary output we will see the mean levels of growth rate for for every main effect and every interaction. These will be presented as differences from an ‘intercept`, which will be what R decides is the reference ’level’ for each factor. By default, R orders levels alphabetically
levels(growth.moo$diet)
## [1] "barley" "oats" "wheat"
levels(growth.moo$supplement)
## [1] "agrimore" "control" "supergain" "supersupp"
..but this might not make sense for us. We might want oats to be the reference diet, because maybe we have always used oats and we want to see what difference it makes if we use barley or wheat, and we probably do want to make control the reference level for dietary supplement.
growth.moo<-mutate(growth.moo,diet=relevel(diet,ref='oats'))
levels(growth.moo$diet)
## [1] "oats" "barley" "wheat"
growth.moo<-mutate(growth.moo,supplement=relevel(supplement,ref='control'))
levels(growth.moo$supplement)
## [1] "control" "agrimore" "supergain" "supersupp"
That’s better.
sumMoo<-growth.moo %>%
group_by(diet,supplement) %>%
summarise(meanGrow=mean(gain))
sumMoo
ggplot(sumMoo,aes(x=supplement,y=meanGrow))+geom_point()
## add colours and lines
ggplot(sumMoo,aes(x=supplement,y=meanGrow,colour=diet,group=diet))+
geom_point()+
geom_line()+
theme_classic()
Both diet and supplement have main effects.
The lines are parallel. This suggests that there is no interaction. The effect of supplement does not appear to depend on diet.
But diet does have an effect, so does supplement. This suggests that we have an additive rather than an interactive model
Specify the null hypothesis:
‘The effect of supplement type on cow weight gain does not depend on the diet.’
This is an additive model. The alternative is that there is interaction. Hence we will fit to an interactive model. This will allow us to test between the additive and non-additive alternative.
diet*supplement means diet + supplement + diet:supplement
ie a main effect of diet, a main effect of supplement and an interaction effect (:) between the two.
Looks OK, not great, but OK
Check using anova() and summary()
Watch out for the order of testing in the ANOVA table. - if the data are balanced and orthogonal then the order of testing in ANOVA table does not matter. This is the case here, since the data are independent bydesign - this is a designed experiment but in general, in most observational studies, this will not be the case. The order of testing in an ANOVA table WILL matter. The mean squares, F-values and p-values will depend on the order of testing.
We could stop right here, but let’s look at the summary() table
##
## Call:
## lm(formula = gain ~ diet * supplement, data = growth.moo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.48756 -1.00368 -0.07452 1.03496 2.68069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.4936649 0.6555863 31.260 < 2e-16 ***
## dietbarley 2.8029851 0.9271390 3.023 0.00459 **
## dietwheat -3.0881467 0.9271390 -3.331 0.00201 **
## supplementagrimore 2.8047189 0.9271390 3.025 0.00457 **
## supplementsupergain -0.8306614 0.9271390 -0.896 0.37624
## supplementsupersupp 1.3665697 0.9271390 1.474 0.14918
## dietbarley:supplementagrimore 0.2471088 1.3111726 0.188 0.85157
## dietwheat:supplementagrimore -0.5711641 1.3111726 -0.436 0.66572
## dietbarley:supplementsupergain 0.0001351 1.3111726 0.000 0.99992
## dietwheat:supplementsupergain 0.4375746 1.3111726 0.334 0.74052
## dietbarley:supplementsupersupp 0.9120830 1.3111726 0.696 0.49113
## dietwheat:supplementsupersupp 0.8962530 1.3111726 0.684 0.49864
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.311 on 36 degrees of freedom
## Multiple R-squared: 0.8607, Adjusted R-squared: 0.8182
## F-statistic: 20.22 on 11 and 36 DF, p-value: 3.295e-12
What is the variation around each mean value?