---
title: "Practical 9 - AESC40180 25/26"
author: "Virginia Morera-Pujol"
format:
html:
theme: flatly
highlight-style: github
code-tools: true
fontsize: 1em
toc: true
toc-location: left
editor: visual
---
## Introduction
In this practical, you'll practice running generalised linear mixed models, identifying random effects, assessing model fit, etc. For that, we'll use a dataset that studies the development of aggression behaviour in adolescent apes of different species. In order to do that, aggression scores were measured on several individuals of 5 species of ape at different times in their adolescent development. Each individual was measured more than once, and there's several individuals of each species. The data is contained in the file `apes.csv`
## Housekeeping
Use the well known code to clear environment, set working directory if necessary, and load the data. We will also need to load the packages to run the GLMMs and some other packages to produce plots.
> Remember, if new packages are not installed in your computer, you will get an error message when running `library(lme4)`. You will need to install it by using `install.packages("lme4")` (the quotation marks are important here).
```{r}
#| echo: true
rm(list=ls()) # this clears the environment from any leftover objects
library(tidyverse) #this loads the packages we'll need for easy coding
library(lme4) # to run the models
library(lmerTest) # to obtain p-values for our fixed effects
library(lattice)
library(DHARMa)
apes <- read_csv("Data/apes.csv")
```
## Data cleaning and organisation
Check if any data cleaning is necessary by running a summary of the data and a plot. You already know how to do this but because I've added a new `geom_` type, `geom_jitter`, I've displayed the code so you see how it's done.
```{r}
summary(apes)
ggplot(apes) +
geom_jitter(aes(x = species, y = dominance, col = age), width = 0.2) +
geom_boxplot(aes(x = species, y = dominance), alpha = 0) +
scale_color_viridis_c() +
theme_bw()
```
We see several issues:
- a miss-spelling of the word "gorilla"
- some ages that are set at -99, which is a typical value to use when the value has not been registered.
- two dominance values seem very far away from the rest and may be outliers. We are going to correct the spelling and remove the observations with no real age first with the following code:
```{r}
apes_clean <- apes %>%
filter(age > 0) %>%
mutate(species = recode(species, "gorrila" = "gorilla"))
```
And now we re-run the summary and the plot to see what has changed (code hidden as I've shown it above, try and replicate it!)
```{r}
#| echo: false
summary(apes_clean)
ggplot(apes_clean) +
geom_jitter(aes(x = species, y = dominance, col = age), width = 0.2) +
geom_boxplot(aes(x = species, y = dominance), alpha = 0) +
scale_color_viridis_c() +
theme_bw()
```
We also see that `species` and `ape` (individual's name) are characters, and need to be factors for the models to work. We will also remove the observations that have a dominance rank of 19.4 and -21.2. as we consider them outliers (but in real life you'd consult with whoever has collected the data and try and figure out if they're really mistakes or very angry gorillas before removing them). With the code provided so far you should be able to do this on your own now, and plot again:
```{r}
#| echo: false
apes_clean <- apes_clean %>%
filter(abs(dominance) < 19) %>%
mutate(species = as.factor(species),
ape = as.factor(ape))
ggplot(apes_clean) +
geom_jitter(aes(x = species, y = dominance, col = age), width = 0.2) +
geom_boxplot(aes(x = species, y = dominance), alpha = 0) +
scale_color_viridis_c() +
theme_bw()
```
Now it's much easier to see what the data may be telling us:
- Do you think older apes seem to be more dominant?
- Do you think there are clear and significant differences in dominance between species?
- Does this plot allow us to see the differences between individuals?
- Can you think of other ways of plotting this? Try them!
## Modelling
The code to run a GLMM is very similar to what you already know from running GLMs. The only bit you'll be unfamiliar with will be the specification of the random effect. We are going to run several models of increasing complexity and compare them in the end with the function `anova()` to see which one has a better fit.
### Null model
#### Run model
The null model is the model that contains only the intercept as an explanatory variable. However, in the context of mixed models, the null model will have the intercept *and* the random effect:
```{r}
m0 <- glmer(dominance ~ 1 + (1|ape), data = apes_clean)
```
Do you see a warning message? Can you interpret what it means?
To see the output of this model, we will do our usual summary function, which I'm not giving you the code for because you know it already
```{r}
#| echo: false
summary(m0)
```
In this summary you see a fixed effects section which is interpreted exactly the same way as a GLM or a LM, but only has the intercept at the moment because it's the null model. It also has a section with the results for the random effect. There, you have two rows, one for the random effect you've specified (`ape`), and another one named "Residual". The variance and sd values displayed tell us how much of the variability in the data is explained by our random factor. If you divide the variance explained by the random factor by the total residual variance using the code below, you'll obtain a measure of the explanatory power of the random effect.
```{r}
0.8993/(0.1027+0.8993)
```
#### Random effect assessment
Additionally, you can visualise how the random effect separates the groups. For this, we will use a function of the package `lattice`, that we've installed and loaded at the beginning.
First, we need to extract the actual value of the random effects from the model output
```{r}
re0 <- ranef(m0)
head(re0$ape)
```
And now we plot it
```{r}
dotplot(re0)
```
Do the group means seem normally distributed? They should be!
#### Model evaluation
Now we need to produce a couple of plots to evaluate the models. What we are looking for is residuals that have no structure, but only in gaussian (normal) GLMMs you can use a qqplot to test the normality of residuals, as poisson and binomial models don't expect the errors to be normally distributed. However, the package DHARMa lets us bypass that issue by creating plots that look like the familiar qq and residual plots but are appropriate for all families (look at the package documentation for more info, at this level this is all you need to know!).
```{r}
sim_res <- simulateResiduals(m0)
plot(sim_res)
```
We can also just plot the residuals and look for structure
```{r}
plot(residuals(m0))
```
### Model 1
Now we can start adding fixed effects
#### Run model
We run this exactly the same as the null model but adding the age as an explanatory variable that may explain dominance. You should be able to code this by yourselves now and display the summary
```{r}
#| echo: false
m1 <- glmer(dominance ~ 1 + age + (1|ape), data = apes_clean)
summary(m1)
```
Here we see a bit more than in the other model. We see that there is an effect of age that is positive, but quite small. With every year of age, apes seem to increase their aggression score in 0.03, which doesn't seem much. How do we know if this is significant, if we don't have a p-value? We can use an ANOVA function to compare this and the null model (and any other models we run with increasing complexity)
```{r}
anova(m0, m1)
```
Here we see a Chisq test and a p-value, which are testing whether the addition of age is improving the model fit significantly (that is, that the effect of age is significant)
- Do you think it is?
We can check if the amount of variance explained by the random effect has changed much. Remember, variance of age divided by the sum of variance of age and residual variance (you get the values from the model summary)
```{r}
#| echo: false
0.90060/(0.09522+0.90060)
```
- Has the proportion of variance explained by ape changed between this and the null model?
#### Random effect assessment
Let's visualise the random effect again:
```{r}
#| echo: false
re1 <- ranef(m1)
dotplot(re1)
```
- Does this look different from the residuals of the null model?
- Why?
#### Model evaluation
Use the functions of the DHARMa package to produce diagnostic plots, and also plot the residuals
```{r}
#| echo: false
sim_res <- simulateResiduals(m1)
plot(sim_res)
plot(residuals(m1))
```
### Model 2
Can you think of what more we can add to the model to help explain the data better? Try it!
### Nested model
Think about the structure of the data. Do you think we can say that the random factor "ape" is nested within "species"?
Let's try a nested random effects model then! This is how you specify nested random effects
```{r}
m3 <- glmer(dominance ~ age + (1|species/ape), data = apes_clean, family = gaussian)
summary(m3)
```
From now on, you can produce additional outputs in the same way as with the other models. The only thing you can't do is use the anova function to compare models, as this has a different random effect structure, so it's no longer easily comparable to the rest.
After looking at the outputs from this nested model. answer the following questions:
- Has the size of the effect of age on dominance changed after changing species from fixed to nested random effect?
- What explains more variability in the data, ape, or species?
- Has the explanatory power of ape been reduced by the introduction of species as a random nested effect? (compare this model to model 2)
- How many random effect plots do you get now with the function `dotplot`? What do they tell us?
- Finally, have a think: if you're not satisfied with the diagnostics of this nested model, what other information about the apes, the data collection method, etc. could have been useful to have to add to the model and explain the data better?