Mixed effect models in R - a step by step guide

Hello!

This is an RMarkdown document that I’ve created for you to help you through your mixed effects models - it allows me to embed code into a document so you can copy and paste it into your RStudio session and run it without needing me to be there to talk you through it 😀 It’s a very snazzy tool that you can use within R - so I’ve written this document using my RStudio window (and I’m playing with it’s functionality for teaching, so using you as a guineapig!)

Mixed effects models in R (or GLMMs)

So, first things first we will library all your important packages, and then import your data into R in the way that you’ve always done, using the file.choose() argument wrapped inside the read.csv() function…

library(tidyverse) # This has lots of your basic R functionality in - including ggplot2
library(lme4) # This is the package you use to run your mixed effects models
library(car) # This is the package you use to extract your p-values from your models with the Anova() function
library(cAIC4) # This package will help you in your stepwise model selection process - see later on in the document!
library(emmeans) # This is the package you use to run your pairwise comparisons

coviddata <- read.csv(file.choose(), header = TRUE, stringsAsFactors = TRUE, na.strings = "")

Check that all of your columns are the format that you’re expecting them to be, and don’t forget that number of Ocrelizumab infusions will be numeric rather than a factor, so we should convert that with the following code:

coviddata$number.of.Ocrelizumab.infusions <- as.factor(coviddata$number.of.Ocrelizumab.infusions)

Right, now we’ve got our data imported into R, and everything is the correct class, the fun can begin! We need to run what is known as a fully saturated model (i.e. a model that includes ALL fixed terms, all sensible interactions between these fixed terms, and our random factor of patient ID). We use the lmer() function within the lme4 package to do this, like so:

m1 <- lmer(Spike.Bau ~ number.of.Ocrelizumab.infusions*days.between.blood.test.and.last.infusion*infusion.after.vaccine*time.interval.between.covid.vaccine.and.infusion+(1|K.number), data = coviddata)

…and then you can check your AIC

AIC(m1)

Now we get to the clever bit - there is a nifty package called cAIC4 which will do your stepwise deletions and compare your AICs for you - remember that the model with the AIC closest to zero is the model that best explains why your data look as they do (although here we’re using something called a conditional AIC, which is a derivation of a traditional AIC often used specifically for mixed effects models for mathematical reasons that you don’t need to worry too much about.

So, you use the stepcAIC() function from within the cAIC4 package to do this stepwise deletion for you, and produce an output with the best fitting model and its cAIC, we do this as follows:

stepwise <- stepcAIC(m1, direction = "backward", returnResult = TRUE, trace = TRUE)

This looks a bit scary and it’s not obvious whats going on here, so I’ll explain what each of the bits is doing!

I’ve called the object 'stepwise', as it’s a stepwise deletion process (simple stuff so far)
stepcAIC(...) is the function that you’re using from the package - it will run every model iteration possible for you
direction = "backward" - this tells the function that we want to start with a fully saturated model and work backwards to find the best fitting model. We can work the other way by specifying direction = "forward" but that involves more complicated code for no real benefit!
returnResult = TRUE - this is fairly self explanatory, and will return the best cAIC to you in the output
trace = TRUE - you need this argument for the function to print you the result of the best model. Without it, it will tell you the best cAIC, but counter-intuitively it will not tell you how it got there. This is one of R’s weird little features.

When you run this line of code, you should end up with an output that tells you the best fitting model. In my case (and yours!), we end up with the best fitting model m2. When we run m2, and ask the Anova() function to extract us some p-values, we can see some significance in our results!

m2 <- lmer(Spike.Bau ~ number.of.Ocrelizumab.infusions + days.between.blood.test.and.last.infusion + infusion.after.vaccine + time.interval.between.covid.vaccine.and.infusion + (1 | K.number), data = coviddata)

Anova(m2)

Note: you do have to add some bits to the model yourself to make it run properly - the output that stepcAIC gives you is not functional R code

Now we have the best fitting model (write it down, it’s important for you to mention both the above process and what the final model is in your results section!!), we can begin our pairwise comparisons, using the emmeans package. Generally, you would only do pairwise comparisons on significant results, but here we are specifically interested in what might be hiding in the Ocrelizumab treatment numbers factor in our data - it’s possible that a significant result might be being masked by lots of non-significance elsewhere, so we’ll investigate:

pairsm2 <- emmeans(m2, specs = "number.of.Ocrelizuman.infusions", adjust = "none")

This bit of code works like this:

We’ve called the object “pairsm2”, as it’s the pairwise comparisons of m2 (obviously)
then we use the function emmeans() from the emmeans package, to actually perform the pairwise comparisons
m2 specifies the model that you’re asking emmeans() to do comparisons within this specific model
specs = ... specifies what factor you want to perform the comparisons within, here it’s number of Ocrelizumab treatments
adjust = "none" tells us that we’re not doing any extra p-value adjustments (don’t worry about this)

Once you’ve run that, you can ask R to show you a table of the pairwise comparisons

pairs(pairsm2)

You should get a print-out of your results that looks a little like this…

… which contains all your test statistics and p-values and other useful stuff that you need to write about your results properly! You can either copy these into an excel document, or if you’re feeling clever, you can write this output directly to a .csv file to avoid faffing with copying and pasting R output (which is famously a bit of a pain!)

write.csv(pairs(pairsm2), file = "Ocrelizumab_infusions_pairs.csv")

Plotting your data

Now you’ve got your results in your table, you’ll want to produce plots to accompany them (note: it can be useful to produce plots first, but as we learned with the number of Ocrelizumab treatments, it can cause you to be tempted to ‘p-value fish’ in a way that is not statistical best-practise, so be careful of this!)

Here, we will use the package ggplot2 to create you some nice figures to accompany your GLMM results, and convince both yourself and your reader that your results are reflecting what you know to be true of your data. The “gg” in ggplot2 stands for “Grammar of Graphics”, and ggplot2 follows a defined set of grammatical (sort of) rules, some of which you’ll learn about below…

In order to get a basic boxplot of your data, we use the ggplot2 package, and the following code:

ggplot(aes(x = number.of.Ocrelizumab.infusions, y = Spike.Bau), data = coviddata)+
  geom_boxplot()

The way the code for this is structured is as follows:

First, we have the ggplot() function, which tells R that we want to use the ggplot2 package to create this graph. Note - the package is ggplot2, but the function is just ggplot(), without the 2!
Next up, we have this aes() function within the ggplot() wrapper. This refers to your plots aesthetics. Each aesthetic of a plot is essentially just ‘something you can see’. Each aesthetic, or aes() is a mapping between a visual cue and a variable (i.e. aes(x = ...) tells R that you want to make a link between your specified variable and the x axis, or, you want to plot that variable along the x axis. This will crop up again later!
Then, outside of the aes() wrapper, we have data = coviddata, which is just telling R what data.frame you want it to look for the information that you are plotting in.
then we have a + symbol at the end of the line, but outside of all of our closing brackets. This tells R that even though the brackets are resolved (i.e. all the opening brackets are matched with closing brackets), there is a bit of extra stuff that we want to add to this code.
Finally, geom_boxplot(). In ggplot2, an object placed on a plot is known as a geom, and with the function geom_boxplot we are specifying that we want it to be a boxplot. This general logic applies throughout ggplot, and if you wanted to create a histogram, you might type geom_histogram() for example.

As you can see, this isn’t a particularly pretty boxplot - we can improve it by adding new aesthetics, like this:

ggplot(aes(x = number.of.Ocrelizumab.infusions, y = Spike.Bau), data = coviddata)+
  geom_boxplot(aes(fill = number.of.Ocrelizumab.infusions))

Here, we’ve added a new aesthetic, or aes to our plot, and we are asking R to create a mapping or a link between the fill colour of the boxes and our variable for number of Ocrelizumab infusions (fill = number.of.Ocrelizumab.infusions).

That looks a bit nicer, although we’ve just found that none of the pairwise comparisons are significantly different, but this graph makes it look as though they are! We can sense-check ourselevs and our results by thinking about what the distribution of points within these boxes looks like. We can do that by adding an extra line of code to our plot…

ggplot(aes(x = number.of.Ocrelizumab.infusions, y = Spike.Bau), data = coviddata)+
  geom_boxplot(aes(fill = number.of.Ocrelizumab.infusions))+
  geom_point()

Notice that we put the geom_point() underneath the geom_boxplot() function - this is beacuse ggplot2 will plot things in the order that you tell it to - if we tell it to plot the points first, our graph looks like this:

ggplot(aes(x = number.of.Ocrelizumab.infusions, y = Spike.Bau), data = coviddata)+
  geom_point()+
  geom_boxplot(aes(fill = number.of.Ocrelizumab.infusions))

Going back to the plot above this last one, where we can actually see the points, we begin to realise that with only 4 points in our 2 Ocrelizumab treatments group, perhaps our initial boxplot might be a little misleading. We can either remove the boxplot all together, leaving only the points, or we can convert the boxplot to a violin plot to give us an idea of the density of the points in each group. We do this by simply swapping geom_boxplot(...) for geom_violin(...)

ggplot(aes(x = number.of.Ocrelizumab.infusions, y = Spike.Bau), data = coviddata)+
  geom_violin(aes(fill = number.of.Ocrelizumab.infusions))+
  geom_point()

This is much better, and helps us to understand why our pairwise comparisons are the way that they are. However, if we look at the results of our best fitting model (m2), we can see that whether a participant has had their vaccine before or after their infusion is significant, so this might be something that we might want to explore graphically. We will run similar code to before, but swap the fill aesthetic from the violins to the points, and tell R that we want the aes() to map the colour of the points to data about whether the patient has had an infusion before their covid vaccine. We do this like this:

ggplot(aes(x = number.of.Ocrelizumab.infusions, y = Spike.Bau), data = coviddata)+
  geom_violin()+
  geom_point(aes(colour = infusion.after.vaccine))

Notice that when we swapped our aes down into the geom_point() we also changed fill = ... to colour = .... This is just because the default point shape that R uses is a solid circle, so you’re changing the colour of a solid line, not the fill of an empty circle!

Now, we’ve got a graph that shows us something potentially quite useful. We could make it a little easier to understand though - the points are crowded on top of oneanother and a bit tricky to see… We will swap the geom_point(...) for geom_jitter() and see if we can improve it!

ggplot(aes(x = number.of.Ocrelizumab.infusions, y = Spike.Bau), data = coviddata)+
  geom_violin()+
  geom_jitter(aes(colour = infusion.after.vaccine), width = 0.1)

We’ve used geom_jitter() because it spreads the points that are on top of oneanother out slightly to make them easier to see. You can play with the width = 0.1 argument to edit how far the points spread. Although this is much more useful than before, it’s still not the best looking plot we could make it. We will play with the aes of the geom_jitter() to try and make it a little neater!

ggplot(aes(x = number.of.Ocrelizumab.infusions, y = Spike.Bau), data = coviddata)+
  geom_violin()+
  geom_jitter(aes(fill = infusion.after.vaccine), pch = 21, size = 2, colour = "black", alpha = 0.7, width = 0.1)

This looks a bit nicer, but what have we changed?

pch = 21 this changes the shape of the points. 21 is a black circle with a coloured fill, so notice I’ve changed the aesthetics back to say fill = ... instead of colour = ...
size = 2 changes the size of the dots, the default setting is 1.
colour = black adds a black outline to the dots. Notice that it isn’t within the aes() wrapper because I’m not mapping it to any features of our variables. If I wanted the colour to be different according to infusion and vaccine status, it would live within the aesthetics, but because I’ve changed it to black simply because I think it looks nicer, it stays outside.
alpha = 0.7 changes the transparancy of the points. An alpha value of 1 is a fully opaque point.

Now, go away and have fun playing with making your plot look pretty! I’ve had a quick go, and am challenging you to make yours look as nice as this one… 😀😀

All the stuff you need to be able to this is available here: http://www.sthda.com/english/wiki/ggplot2-essentials

Mixed effect models in R - a step by step guide

Hannah Jackson

2023-09-14

Hello!

Mixed effects models in R (or GLMMs)

Plotting your data