This exercise sheet is heavily indebted to Michael Crawley’s Statistics: An introduction using R, 2nd Ed, Wiley. Published in 2015 this emphasises statistics over R (in fact, much of the R he presents would be better written using tidyverse code). It is very useful and is at a higher level than Beckerman, Childs and Petchey’s Getting Started in R: An introduction for biologists, 2nd Ed. OUP published in 2017.
The best model is the one that adequately explains the data with fewest parameters. This means with the smallest possible number of degrees of freedom.
If we have a very large number of parameters, a model can fit any data set but be of limited use in generalising beyond it (we will have overfitted the data). If we have too few we will not explain much of the variance of the data. A balance must be struck. Hence we want the minimal adequate model.
As Einstein almost said, a model should be as simple as possible, but no simpler.
A factorial experiment has two or more factors, each with two or more levels, plus replication for each combination of factor levels. This means that we can investigate statistical interactions in which the effect of one factor depends on the value of another factor.
We take an example from a farm-trial of animal diets. There are two factors: diet and supplement. Diet is a factor with three levels: barley, oats and wheat. supplement is a factor with four levels: agrimore, control, supergrain and supersupp, where control means the absence of any supplement. The response variable gain is weight gain after 6 weeks. there were 48 individual cows in total with 4 per combination of diet and supplement. Hence this is a balanced design.
In the following, we present the code you need to analyse this data together with some explanatory text. You could just read the text but if you want to learn how to implement the analysis yourself, a much better idea would to start a new R Notebook in your R Project and type in the code so that you can run the analysis yourself. You are encouraged too to view the code presented here as one way to do it. Feel free to hack away at it and change things, to try different approaches and see what happens. That way you will learn.
# this code chunk is opaque but useful. It affects the look of the knitted document
knitr::opts_chunk$set(fig.width=12, fig.height=8, warning=FALSE, message=FALSE)
library(tidyverse)
library(here)
library(cowplot)
library(gridExtra)
library(ggfortify)
library(kableExtra)
filepath<-here("data","growth.csv")
weights<-read_csv(filepath)
glimpse(weights)
## Rows: 48
## Columns: 3
## $ supplement <chr> "supergain", "supergain", "supergain", "supergain", "contro…
## $ diet <chr> "wheat", "wheat", "wheat", "wheat", "wheat", "wheat", "whea…
## $ gain <dbl> 17.37125, 16.81489, 18.08184, 15.78175, 17.70656, 18.22717,…
At the moment, the variables supplement and diet are not recognised as factors. R just thinks of them as text. Let us fix that, as it will be useful later on for them to be recognised as what they are:
weights <- weights %>%
mutate(supplement=as.factor(supplement),
diet=as.factor(diet))
R puts the levels of a factor in alphabetical order. Since differences are later calculated with respect to the level that comes first, this is not necessarily what we want. Normally, for example, we want the control to be the reference level. To ensure that it is, we can reorder the levels:
weights <- weights %>%
mutate(supplement=fct_relevel(supplement, "control", "agrimore", "supergrain", "supersupp"))
We will calculate the mean and standard error the mean for each combination of diet and supplement
growth_summary<-weights %>%
group_by(diet,supplement) %>%
summarise(mean_gain=mean(gain),se_gain=sqrt(var(gain)/n()))
# now produce a pretty table of the summary
growth_summary %>%
kbl(digits=3) %>%
kable_styling(full_width=FALSE)
| diet | supplement | mean_gain | se_gain |
|---|---|---|---|
| barley | control | 23.297 | 0.703 |
| barley | agrimore | 26.348 | 0.919 |
| barley | supersupp | 25.575 | 1.060 |
| barley | supergain | 22.466 | 0.771 |
| oats | control | 20.494 | 0.506 |
| oats | agrimore | 23.298 | 0.613 |
| oats | supersupp | 21.860 | 0.413 |
| oats | supergain | 19.663 | 0.349 |
| wheat | control | 17.406 | 0.460 |
| wheat | agrimore | 19.639 | 0.710 |
| wheat | supersupp | 19.668 | 0.475 |
| wheat | supergain | 17.012 | 0.485 |
We could inspect the data in various ways:
growth_summary %>%
ggplot(aes(x=supplement,y=mean_gain,fill=diet)) +
geom_col(position=position_dodge()) +
geom_errorbar(aes(ymin=mean_gain-se_gain,ymax=mean_gain+se_gain),width=0.2,position=position_dodge(width=0.9)) +
labs(x="Supplement",
y="Mean weight gain") +
scale_fill_brewer() +
theme_cowplot()
growth_summary %>%
ggplot(aes(x=supplement,y=mean_gain,colour=diet,group=diet)) +
geom_point(size=2) +
geom_line() +
geom_errorbar(aes(ymin=mean_gain-se_gain,ymax=mean_gain+se_gain),width=0.1) +
labs(x="Supplement",
y="Mean weight gain") +
scale_fill_brewer() +
theme_cowplot()
Note that on both plots the error bars are standard errors of the mean. Any caption to a figure that contains error bars should explain what those error bars mean. In particular, it should say whether they are standard devations of the sample, standard erros of the mean or confidence intervals. These are all different from each other. A good explanation of the difference is given by [Cummings et al][1]. (2007). The line plot, sometimes in this context called an interaction plot is useful in that we see that both diet and supplement have an effect on growth and that the effect of one is altered little by the value of the other. The lines are more or less parallel. This suggests that we have main effects of both diet and supplement, but little or no interaction between them.
Now we can use aov() or lm() to fit a factorial ANOVA (the choice affects only whether we get an ANOVA table or a list of parameters estimates as the default output from summary().)
We estimate parameters for the main effects of each level of diet and each level of supplement, plus terms for the interaction between diet and supplement.
The interaction degrees of freedom are the product of those for diet and supplement ie (3-1) x (4-1) = 6.
The model is:
gain ~ diet + supplement + diet:supplement
which can be written more simply using the asterisk notation as:
model0<-aov(gain ~ diet * supplement,data=weights)
summary(model0)
## Df Sum Sq Mean Sq F value Pr(>F)
## diet 2 287.17 143.59 83.52 3.00e-14 ***
## supplement 3 91.88 30.63 17.82 2.95e-07 ***
## diet:supplement 6 3.41 0.57 0.33 0.917
## Residuals 36 61.89 1.72
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA table shows that there is no hint of any interaction between diet and supplement (p = 0.917). Clearly, the effects of diet and supplement are merely additive (ie whichever of one you have it does affect the impact on growth of whichever of the other you choose) , and both are highly significant.
The ANOVA table does not show us effect sizes or allow us to work out which if any of the levels of the two factors are significantly different. For this, summary.lm() is more useful:
summary.lm(model0)
##
## Call:
## aov(formula = gain ~ diet * supplement, data = weights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.48756 -1.00368 -0.07452 1.03496 2.68069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.2966499 0.6555863 35.536 < 2e-16 ***
## dietoats -2.8029851 0.9271390 -3.023 0.00459 **
## dietwheat -5.8911317 0.9271390 -6.354 2.34e-07 ***
## supplementagrimore 3.0518277 0.9271390 3.292 0.00224 **
## supplementsupersupp 2.2786527 0.9271390 2.458 0.01893 *
## supplementsupergain -0.8305263 0.9271390 -0.896 0.37631
## dietoats:supplementagrimore -0.2471088 1.3111726 -0.188 0.85157
## dietwheat:supplementagrimore -0.8182729 1.3111726 -0.624 0.53651
## dietoats:supplementsupersupp -0.9120830 1.3111726 -0.696 0.49113
## dietwheat:supplementsupersupp -0.0158299 1.3111726 -0.012 0.99043
## dietoats:supplementsupergain -0.0001351 1.3111726 0.000 0.99992
## dietwheat:supplementsupergain 0.4374395 1.3111726 0.334 0.74060
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.311 on 36 degrees of freedom
## Multiple R-squared: 0.8607, Adjusted R-squared: 0.8182
## F-statistic: 20.22 on 11 and 36 DF, p-value: 3.295e-12
This is a complex model as there are 12 estimated parameters: 6 main effects and 6 interactions. Notice that the ‘controls’ for diet and supplement (barley and control) do not appear in the table. The intercept term gives us the level for the combination of the two controls, barley as diet and control as supplement. The other effect values are given as differences from this reference level. See if you can tally the effect values in the summary table with the mean values given in the table above for each combination of diet and supplement.
This table re-emphasises that none of the interaction terms are significant. It also suggests that a minimum adequate model will contain 5 parameters: an intercept, a difference due to oats, a difference due to wheat, a difference due to agrimore and a difference due to suppersupp.
Given the results of the full interaction model, we begin model simplification by leaving out the interaction terms, to leave us with an additive model:
model1<-lm(gain~diet+supplement,data=weights)
summary(model1)
##
## Call:
## lm(formula = gain ~ diet + supplement, data = weights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.30792 -0.85929 -0.07713 0.92052 2.90615
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.4263 0.4408 53.141 < 2e-16 ***
## dietoats -3.0928 0.4408 -7.016 1.38e-08 ***
## dietwheat -5.9903 0.4408 -13.589 < 2e-16 ***
## supplementagrimore 2.6967 0.5090 5.298 4.03e-06 ***
## supplementsupersupp 1.9693 0.5090 3.869 0.000375 ***
## supplementsupergain -0.6848 0.5090 -1.345 0.185772
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.247 on 42 degrees of freedom
## Multiple R-squared: 0.8531, Adjusted R-squared: 0.8356
## F-statistic: 48.76 on 5 and 42 DF, p-value: < 2.2e-16
We ought to pause here for a moment and just check that we are OK to go ahead and analyse our data using a general linear model (of which ANOVA is an example, linear regression and t-tests being others). We will use autoplot() from the ggfortify package, which gives us the standard four diagnostic plots.
autoplot(model1) + theme_cowplot() # autoplot() is from the ggfortify package.
Well, that all looks fine. In particular, from the top-left figure we see that the variance of the residuals is more or less constant and from the top-right figure, the quantile-quantile plot, we get a pretty good approximation of a straight line which tells us that the residuals are more or less normally distributed. These are two key assumptions that must be satisfied by data if it is going to make any sense to use a linear model to analyse it. We won’t discuss here the other two diagnostic plots, but they look fine too. So we are good to go using ANOVA with this data.
Back to interpreting the output of the ANOVA:
It is clear that we need to retain all three levels of diet since the effect values of each differ from each other by an amount that is several times the standand errors, so that t >> 1. It is not clear that we need all the levels of supplement, however. supersupp is not obviously different from agrimore (difference = -0.727 with standard error = 0.509), yet both are clearly different from control. However supergrain is not obviously different from control (difference = -0.68, error = 0.509). Hence we are tempted to try a new model with just two levels of the factor supplement which we might sensibly call “best”, by which we mean agrimore or supersupp, and “worst” by which we mean control or supergrain.
weights <- weights %>%
mutate(supp2=ifelse(supplement %in% c("agrimore","supersupp"),"best","worst"))
Now we will fit the simpler model and compare the two additive models:
model2<-lm(gain ~ diet + supp2,data=weights)
anova(model1,model2)
When we use anova() in this way it is testing the explanatory power of the second model against that of the first ie how much of the variance in the data does each explain. Its null hypothesis is that both models explain just as much of the variance as the other.
The simpler model has saved two degrees of freedom and is not significantly different in explanatory power than the more complex model (p = 0.158). Hence this is a better candidate as a minimal adequate model. All the parameters are signficantly different from zero and from each other.
summary(model2)
##
## Call:
## lm(formula = gain ~ diet + supp2, data = weights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6716 -0.9432 -0.1918 0.9293 3.2698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.7593 0.3674 70.106 < 2e-16 ***
## dietoats -3.0928 0.4500 -6.873 1.76e-08 ***
## dietwheat -5.9903 0.4500 -13.311 < 2e-16 ***
## supp2worst -2.6754 0.3674 -7.281 4.43e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.273 on 44 degrees of freedom
## Multiple R-squared: 0.8396, Adjusted R-squared: 0.8286
## F-statistic: 76.76 on 3 and 44 DF, p-value: < 2.2e-16
We have now reduced our initial 12 parameter model to a four parameter model that is much more tractable and easier to communicate. Our advice would be that for maximum weight gain a diet of barley with a supplement of agrimore or suppersupp would be best.
[1]: Cumming, G., Fidler, F., & Vaux, D. L. (2007). Error bars in experimental biology. Journal of Cell Biology, 177(1), 7–11. https://doi.org/10.1083/jcb.200611141