Practical 9 - AESC40180 25/26

Author

Virginia Morera-Pujol

Introduction

In this practical, you’ll practice running generalised linear mixed models, identifying random effects, assessing model fit, etc. For that, we’ll use a dataset that studies the development of aggression behaviour in adolescent apes of different species. In order to do that, aggression scores were measured on several individuals of 5 species of ape at different times in their adolescent development. Each individual was measured more than once, and there’s several individuals of each species. The data is contained in the file apes.csv

Housekeeping

Use the well known code to clear environment, set working directory if necessary, and load the data. We will also need to load the packages to run the GLMMs and some other packages to produce plots.

Remember, if new packages are not installed in your computer, you will get an error message when running library(lme4). You will need to install it by using install.packages("lme4") (the quotation marks are important here).

rm(list=ls()) # this clears the environment from any leftover objects

library(tidyverse) #this loads the packages we'll need for easy coding
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lme4) # to run the models
Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack
library(lmerTest) # to obtain p-values for our fixed effects
Warning: package 'lmerTest' was built under R version 4.5.3

Attaching package: 'lmerTest'

The following object is masked from 'package:lme4':

    lmer

The following object is masked from 'package:stats':

    step
library(lattice)
library(DHARMa)
Warning: package 'DHARMa' was built under R version 4.5.3
This is DHARMa 0.4.7. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa')
apes <- read_csv("Data/apes.csv")
New names:
Rows: 830 Columns: 5
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(2): ape, species dbl (3): ...1, age, dominance
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

Data cleaning and organisation

Check if any data cleaning is necessary by running a summary of the data and a plot. You already know how to do this but because I’ve added a new geom_ type, geom_jitter, I’ve displayed the code so you see how it’s done.

summary(apes)
      ...1           ape                 age           dominance         
 Min.   :  1.0   Length:830         Min.   :-99.00   Min.   :-21.200000  
 1st Qu.:208.2   Class :character   1st Qu.:  3.00   1st Qu.: -0.700000  
 Median :415.5   Mode  :character   Median :  6.00   Median : -0.100000  
 Mean   :415.5                      Mean   :  5.18   Mean   : -0.002169  
 3rd Qu.:622.8                      3rd Qu.:  8.00   3rd Qu.:  0.700000  
 Max.   :830.0                      Max.   : 10.00   Max.   : 19.400000  
   species         
 Length:830        
 Class :character  
 Mode  :character  
                   
                   
                   
ggplot(apes) + 
  geom_jitter(aes(x = species, y = dominance, col = age), width = 0.2) + 
  geom_boxplot(aes(x = species, y = dominance), alpha = 0) + 
  scale_color_viridis_c() + 
  theme_bw()

We see several issues:

  • a miss-spelling of the word “gorilla”

  • some ages that are set at -99, which is a typical value to use when the value has not been registered.

  • two dominance values seem very far away from the rest and may be outliers. We are going to correct the spelling and remove the observations with no real age first with the following code:

apes_clean <- apes %>% 
  filter(age > 0) %>% 
  mutate(species = recode(species, "gorrila" = "gorilla"))

And now we re-run the summary and the plot to see what has changed (code hidden as I’ve shown it above, try and replicate it!)

#| echo: false

summary(apes_clean)
      ...1           ape                 age           dominance         
 Min.   :  1.0   Length:827         Min.   : 1.000   Min.   :-21.200000  
 1st Qu.:209.5   Class :character   1st Qu.: 3.000   1st Qu.: -0.700000  
 Median :416.0   Mode  :character   Median : 6.000   Median : -0.100000  
 Mean   :416.1                      Mean   : 5.557   Mean   : -0.001572  
 3rd Qu.:623.5                      3rd Qu.: 8.000   3rd Qu.:  0.700000  
 Max.   :830.0                      Max.   :10.000   Max.   : 19.400000  
   species         
 Length:827        
 Class :character  
 Mode  :character  
                   
                   
                   
ggplot(apes_clean) + 
  geom_jitter(aes(x = species, y = dominance, col = age), width = 0.2) + 
  geom_boxplot(aes(x = species, y = dominance), alpha = 0) + 
  scale_color_viridis_c() + 
  theme_bw()

We also see that species and ape (individual’s name) are characters, and need to be factors for the models to work. We will also remove the observations that have a dominance rank of 19.4 and -21.2. as we consider them outliers (but in real life you’d consult with whoever has collected the data and try and figure out if they’re really mistakes or very angry gorillas before removing them). With the code provided so far you should be able to do this on your own now, and plot again:

Now it’s much easier to see what the data may be telling us:

  • Do you think older apes seem to be more dominant?
  • Do you think there are clear and significant differences in dominance between species?
  • Does this plot allow us to see the differences between individuals?
  • Can you think of other ways of plotting this? Try them!

Modelling

The code to run a GLMM is very similar to what you already know from running GLMs. The only bit you’ll be unfamiliar with will be the specification of the random effect. We are going to run several models of increasing complexity and compare them in the end with the function anova() to see which one has a better fit.

Null model

Run model

The null model is the model that contains only the intercept as an explanatory variable. However, in the context of mixed models, the null model will have the intercept and the random effect:

m0 <- glmer(dominance ~ 1 +  (1|ape), data = apes_clean)
Warning in glmer(dominance ~ 1 + (1 | ape), data = apes_clean): calling glmer()
with family=gaussian (identity link) as a shortcut to lmer() is deprecated;
please call lmer() directly

Do you see a warning message? Can you interpret what it means?

To see the output of this model, we will do our usual summary function, which I’m not giving you the code for because you know it already

Linear mixed model fit by REML ['lmerMod']
Formula: dominance ~ 1 + (1 | ape)
   Data: apes_clean

REML criterion at convergence: 1088.6

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.8361 -0.6286 -0.0065  0.5819  3.3453 

Random effects:
 Groups   Name        Variance Std.Dev.
 ape      (Intercept) 0.8993   0.9483  
 Residual             0.1027   0.3204  
Number of obs: 825, groups:  ape, 168

Fixed effects:
            Estimate Std. Error t value
(Intercept)  0.06286    0.07416   0.848

In this summary you see a fixed effects section which is interpreted exactly the same way as a GLM or a LM, but only has the intercept at the moment because it’s the null model. It also has a section with the results for the random effect. There, you have two rows, one for the random effect you’ve specified (ape), and another one named “Residual”. The variance and sd values displayed tell us how much of the variability in the data is explained by our random factor. If you divide the variance explained by the random factor by the total residual variance using the code below, you’ll obtain a measure of the explanatory power of the random effect.

0.8993/(0.1027+0.8993)
[1] 0.897505

Random effect assessment

Additionally, you can visualise how the random effect separates the groups. For this, we will use a function of the package lattice, that we’ve installed and loaded at the beginning.

First, we need to extract the actual value of the random effects from the model output

re0 <- ranef(m0)
head(re0$ape)
           (Intercept)
Abdul Noor  -0.4014023
Abigail     -0.5678828
Aileen      -0.9121691
Airiana     -0.4525263
Alan         1.3463982
Alexander   -0.3148665

And now we plot it

dotplot(re0)
$ape

Do the group means seem normally distributed? They should be!

Model evaluation

Now we need to produce a couple of plots to evaluate the models. What we are looking for is residuals that have no structure, but only in gaussian (normal) GLMMs you can use a qqplot to test the normality of residuals, as poisson and binomial models don’t expect the errors to be normally distributed. However, the package DHARMa lets us bypass that issue by creating plots that look like the familiar qq and residual plots but are appropriate for all families (look at the package documentation for more info, at this level this is all you need to know!).

sim_res <- simulateResiduals(m0)
plot(sim_res)

We can also just plot the residuals and look for structure

plot(residuals(m0))

Model 1

Now we can start adding fixed effects

Run model

We run this exactly the same as the null model but adding the age as an explanatory variable that may explain dominance. You should be able to code this by yourselves now and display the summary

Warning in glmer(dominance ~ 1 + age + (1 | ape), data = apes_clean): calling
glmer() with family=gaussian (identity link) as a shortcut to lmer() is
deprecated; please call lmer() directly
Linear mixed model fit by REML ['lmerMod']
Formula: dominance ~ 1 + age + (1 | ape)
   Data: apes_clean

REML criterion at convergence: 1047

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.7381 -0.6434  0.0194  0.5900  3.1891 

Random effects:
 Groups   Name        Variance Std.Dev.
 ape      (Intercept) 0.90060  0.9490  
 Residual             0.09522  0.3086  
Number of obs: 825, groups:  ape, 168

Fixed effects:
             Estimate Std. Error t value
(Intercept) -0.120973   0.078370  -1.544
age          0.033297   0.004595   7.246

Correlation of Fixed Effects:
    (Intr)
age -0.324

Here we see a bit more than in the other model. We see that there is an effect of age that is positive, but quite small. With every year of age, apes seem to increase their aggression score in 0.03, which doesn’t seem much. How do we know if this is significant, if we don’t have a p-value? We can use an ANOVA function to compare this and the null model (and any other models we run with increasing complexity)

anova(m0, m1)
refitting model(s) with ML (instead of REML)
Data: apes_clean
Models:
m0: dominance ~ 1 + (1 | ape)
m1: dominance ~ 1 + age + (1 | ape)
   npar    AIC    BIC  logLik -2*log(L)  Chisq Df Pr(>Chisq)    
m0    3 1091.3 1105.4 -542.63    1085.3                         
m1    4 1042.7 1061.5 -517.34    1034.7 50.593  1  1.137e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we see a Chisq test and a p-value, which are testing whether the addition of age is improving the model fit significantly (that is, that the effect of age is significant)

  • Do you think it is?

We can check if the amount of variance explained by the random effect has changed much. Remember, variance of age divided by the sum of variance of age and residual variance (you get the values from the model summary)

[1] 0.9043803
  • Has the proportion of variance explained by ape changed between this and the null model?

Random effect assessment

Let’s visualise the random effect again:

$ape

  • Does this look different from the residuals of the null model?

  • Why?

Model evaluation

Use the functions of the DHARMa package to produce diagnostic plots, and also plot the residuals

#| echo: false
sim_res <- simulateResiduals(m1)
plot(sim_res)

plot(residuals(m1))

Model 2

Can you think of what more we can add to the model to help explain the data better? Try it!

Nested model

Think about the structure of the data. Do you think we can say that the random factor “ape” is nested within “species”?

Let’s try a nested random effects model then! This is how you specify nested random effects

m3 <- glmer(dominance ~ age + (1|species/ape), data = apes_clean, family = gaussian)
Warning in glmer(dominance ~ age + (1 | species/ape), data = apes_clean, :
calling glmer() with family=gaussian (identity link) as a shortcut to lmer() is
deprecated; please call lmer() directly
summary(m3)
Linear mixed model fit by REML ['lmerMod']
Formula: dominance ~ age + (1 | species/ape)
   Data: apes_clean

REML criterion at convergence: 1036.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.7460 -0.6260  0.0220  0.5842  3.2061 

Random effects:
 Groups      Name        Variance Std.Dev.
 ape:species (Intercept) 0.80831  0.8991  
 species     (Intercept) 0.11511  0.3393  
 Residual                0.09521  0.3086  
Number of obs: 825, groups:  ape:species, 168; species, 5

Fixed effects:
             Estimate Std. Error t value
(Intercept) -0.136412   0.169593  -0.804
age          0.033184   0.004594   7.224

Correlation of Fixed Effects:
    (Intr)
age -0.149

From now on, you can produce additional outputs in the same way as with the other models. The only thing you can’t do is use the anova function to compare models, as this has a different random effect structure, so it’s no longer easily comparable to the rest.

After looking at the outputs from this nested model. answer the following questions:

  • Has the size of the effect of age on dominance changed after changing species from fixed to nested random effect?
  • What explains more variability in the data, ape, or species?
  • Has the explanatory power of ape been reduced by the introduction of species as a random nested effect? (compare this model to model 2)
  • How many random effect plots do you get now with the function dotplot? What do they tell us?
  • Finally, have a think: if you’re not satisfied with the diagnostics of this nested model, what other information about the apes, the data collection method, etc. could have been useful to have to add to the model and explain the data better?