This is an RMarkdown document that I’ve created for you to help you through your mixed effects models - it allows me to embed code into a document so you can copy and paste it into your RStudio session and run it without needing me to be there to talk you through it 😀 It’s a very snazzy tool that you can use within R - so I’ve written this document using my RStudio window (and I’m playing with it’s functionality for teaching, so using you as a guineapig!)
So, first things first we will library all your important packages,
and then import your data into R in the way that you’ve always done,
using the file.choose()
argument wrapped inside the
read.csv()
function…
library(tidyverse) # This has lots of your basic R functionality in - including ggplot2
library(lme4) # This is the package you use to run your mixed effects models
library(car) # This is the package you use to extract your p-values from your models with the Anova() function
library(cAIC4) # This package will help you in your stepwise model selection process - see later on in the document!
library(emmeans) # This is the package you use to run your pairwise comparisons
coviddata <- read.csv(file.choose(), header = TRUE, stringsAsFactors = TRUE, na.strings = "")
Check that all of your columns are the format that you’re expecting them to be, and don’t forget that number of Ocrelizumab infusions will be numeric rather than a factor, so we should convert that with the following code:
coviddata$number.of.Ocrelizumab.infusions <- as.factor(coviddata$number.of.Ocrelizumab.infusions)
Right, now we’ve got our data imported into R, and everything is the
correct class, the fun can begin! We need to run what is known as a
fully saturated model (i.e. a model that includes ALL fixed terms, all
sensible interactions between these fixed terms, and our random
factor of patient ID). We use the lmer()
function within
the lme4 package to do this, like so:
m1 <- lmer(Spike.Bau ~ number.of.Ocrelizumab.infusions*days.between.blood.test.and.last.infusion*infusion.after.vaccine*time.interval.between.covid.vaccine.and.infusion+(1|K.number), data = coviddata)
…and then you can check your AIC
AIC(m1)
Now we get to the clever bit - there is a nifty package called
cAIC4
which will do your stepwise deletions and compare
your AICs for you - remember that the model with the AIC closest to zero
is the model that best explains why your data look as they do (although
here we’re using something called a conditional AIC, which is a
derivation of a traditional AIC often used specifically for mixed
effects models for mathematical reasons that you don’t need to worry too
much about.
So, you use the stepcAIC()
function from within the
cAIC4
package to do this stepwise deletion for you, and
produce an output with the best fitting model and its cAIC, we do this
as follows:
stepwise <- stepcAIC(m1, direction = "backward", returnResult = TRUE, trace = TRUE)
This looks a bit scary and it’s not obvious whats going on here, so I’ll explain what each of the bits is doing!
'stepwise'
, as it’s a stepwise
deletion process (simple stuff so far)stepcAIC(...)
is the function that you’re using from
the package - it will run every model iteration possible for youdirection = "backward"
- this tells the function that
we want to start with a fully saturated model and work backwards to find
the best fitting model. We can work the other way by specifying
direction = "forward"
but that involves more complicated
code for no real benefit!returnResult = TRUE
- this is fairly self explanatory,
and will return the best cAIC to you in the outputtrace = TRUE
- you need this argument for the function
to print you the result of the best model. Without it, it will tell you
the best cAIC, but counter-intuitively it will not tell you how it got
there. This is one of R’s weird little features.When you run this line of code, you should end up with an output that
tells you the best fitting model. In my case (and yours!), we end up
with the best fitting model m2. When we run m2, and ask the
Anova()
function to extract us some p-values, we can see
some significance in our results!
m2 <- lmer(Spike.Bau ~ number.of.Ocrelizumab.infusions + days.between.blood.test.and.last.infusion + infusion.after.vaccine + time.interval.between.covid.vaccine.and.infusion + (1 | K.number), data = coviddata)
Anova(m2)
Note: you do have to add some bits to the model yourself to make it run properly - the output that stepcAIC gives you is not functional R code
Now we have the best fitting model (write it down, it’s important for
you to mention both the above process and what the final model is in
your results section!!), we can begin our pairwise comparisons, using
the emmeans
package. Generally, you would only do pairwise
comparisons on significant results, but here we are specifically
interested in what might be hiding in the Ocrelizumab treatment numbers
factor in our data - it’s possible that a significant result might be
being masked by lots of non-significance elsewhere, so we’ll
investigate:
pairsm2 <- emmeans(m2, specs = "number.of.Ocrelizuman.infusions", adjust = "none")
This bit of code works like this:
emmeans()
from the
emmeans
package, to actually perform the pairwise
comparisonsm2
specifies the model that you’re asking
emmeans()
to do comparisons within this specific modelspecs = ...
specifies what factor you want to perform
the comparisons within, here it’s number of Ocrelizumab treatmentsadjust = "none"
tells us that we’re not doing any extra
p-value adjustments (don’t worry about this)Once you’ve run that, you can ask R to show you a table of the pairwise comparisons
pairs(pairsm2)
You should get a print-out of your results that looks a little like this…
… which contains all your test statistics and p-values and other useful stuff that you need to write about your results properly! You can either copy these into an excel document, or if you’re feeling clever, you can write this output directly to a .csv file to avoid faffing with copying and pasting R output (which is famously a bit of a pain!)
write.csv(pairs(pairsm2), file = "Ocrelizumab_infusions_pairs.csv")
Now you’ve got your results in your table, you’ll want to produce plots to accompany them (note: it can be useful to produce plots first, but as we learned with the number of Ocrelizumab treatments, it can cause you to be tempted to ‘p-value fish’ in a way that is not statistical best-practise, so be careful of this!)
Here, we will use the package ggplot2
to create you some
nice figures to accompany your GLMM results, and convince both yourself
and your reader that your results are reflecting what you know to be
true of your data. The “gg” in ggplot2
stands for “Grammar
of Graphics”, and ggplot2
follows a defined set of
grammatical (sort of) rules, some of which you’ll learn about below…
In order to get a basic boxplot of your data, we use the
ggplot2
package, and the following code:
ggplot(aes(x = number.of.Ocrelizumab.infusions, y = Spike.Bau), data = coviddata)+
geom_boxplot()
The way the code for this is structured is as follows:
ggplot()
function, which tells R
that we want to use the ggplot2
package to create this
graph. Note - the package is ggplot2
, but the function
is just ggplot()
, without the 2!aes()
function within the
ggplot()
wrapper. This refers to your plots aesthetics.
Each aesthetic of a plot is essentially just ‘something you can see’.
Each aesthetic, or aes()
is a mapping between a visual cue
and a variable (i.e. aes(x = ...)
tells R that you want to
make a link between your specified variable and the x axis, or, you want
to plot that variable along the x axis. This will crop up again
later!aes()
wrapper, we have
data = coviddata
, which is just telling R what data.frame
you want it to look for the information that you are plotting in.+
symbol at the end of the line, but
outside of all of our closing brackets. This tells R that even though
the brackets are resolved (i.e. all the opening brackets are matched
with closing brackets), there is a bit of extra stuff that we want to
add to this code.geom_boxplot()
. In ggplot2
, an
object placed on a plot is known as a geom
, and with the
function geom_boxplot
we are specifying that we want it to
be a boxplot. This general logic applies throughout ggplot, and if you
wanted to create a histogram, you might type
geom_histogram()
for example.As you can see, this isn’t a particularly pretty boxplot - we can improve it by adding new aesthetics, like this:
ggplot(aes(x = number.of.Ocrelizumab.infusions, y = Spike.Bau), data = coviddata)+
geom_boxplot(aes(fill = number.of.Ocrelizumab.infusions))
Here, we’ve added a new aesthetic, or aes
to our plot,
and we are asking R to create a mapping or a link between the fill
colour of the boxes and our variable for number of Ocrelizumab infusions
(fill = number.of.Ocrelizumab.infusions
).
That looks a bit nicer, although we’ve just found that none of the pairwise comparisons are significantly different, but this graph makes it look as though they are! We can sense-check ourselevs and our results by thinking about what the distribution of points within these boxes looks like. We can do that by adding an extra line of code to our plot…
ggplot(aes(x = number.of.Ocrelizumab.infusions, y = Spike.Bau), data = coviddata)+
geom_boxplot(aes(fill = number.of.Ocrelizumab.infusions))+
geom_point()
Notice that we put the geom_point() underneath the geom_boxplot() function - this is beacuse ggplot2 will plot things in the order that you tell it to - if we tell it to plot the points first, our graph looks like this:
ggplot(aes(x = number.of.Ocrelizumab.infusions, y = Spike.Bau), data = coviddata)+
geom_point()+
geom_boxplot(aes(fill = number.of.Ocrelizumab.infusions))
Going back to the plot above this last one, where we can actually see
the points, we begin to realise that with only 4 points in our 2
Ocrelizumab treatments group, perhaps our initial boxplot might be a
little misleading. We can either remove the boxplot all together,
leaving only the points, or we can convert the boxplot to a violin plot
to give us an idea of the density of the points in each group. We do
this by simply swapping geom_boxplot(...)
for
geom_violin(...)
ggplot(aes(x = number.of.Ocrelizumab.infusions, y = Spike.Bau), data = coviddata)+
geom_violin(aes(fill = number.of.Ocrelizumab.infusions))+
geom_point()
This is much better, and helps us to understand why our pairwise
comparisons are the way that they are. However, if we look at the
results of our best fitting model (m2), we can see that whether a
participant has had their vaccine before or after their infusion is
significant, so this might be something that we might want to explore
graphically. We will run similar code to before, but swap the fill
aesthetic from the violins to the points, and tell R that we want the
aes()
to map the colour of the points to data about whether
the patient has had an infusion before their covid vaccine. We do this
like this:
ggplot(aes(x = number.of.Ocrelizumab.infusions, y = Spike.Bau), data = coviddata)+
geom_violin()+
geom_point(aes(colour = infusion.after.vaccine))
Notice that when we swapped
our
aes
down into the geom_point()
we also
changed fill = ...
to colour = ...
. This is
just because the default point shape that R uses is a solid circle, so
you’re changing the colour of a solid line, not the fill of an empty
circle!
Now, we’ve got a graph that shows us something potentially quite
useful. We could make it a little easier to understand though - the
points are crowded on top of oneanother and a bit tricky to see… We will
swap the geom_point(...)
for geom_jitter()
and
see if we can improve it!
ggplot(aes(x = number.of.Ocrelizumab.infusions, y = Spike.Bau), data = coviddata)+
geom_violin()+
geom_jitter(aes(colour = infusion.after.vaccine), width = 0.1)
We’ve used geom_jitter()
because it spreads the points
that are on top of oneanother out slightly to make them easier to see.
You can play with the width = 0.1
argument to edit how far
the points spread. Although this is much more useful than before, it’s
still not the best looking plot we could make it. We will play with the
aes
of the geom_jitter()
to try and make it a
little neater!
ggplot(aes(x = number.of.Ocrelizumab.infusions, y = Spike.Bau), data = coviddata)+
geom_violin()+
geom_jitter(aes(fill = infusion.after.vaccine), pch = 21, size = 2, colour = "black", alpha = 0.7, width = 0.1)
This looks a bit nicer, but
what have we changed?
pch = 21
this changes the shape of the points. 21 is a
black circle with a coloured fill, so notice I’ve changed the aesthetics
back to say fill = ...
instead of
colour = ...
size = 2
changes the size of the dots, the default
setting is 1.colour = black
adds a black outline to the dots. Notice
that it isn’t within the aes()
wrapper because I’m not
mapping it to any features of our variables. If I wanted the colour to
be different according to infusion and vaccine status, it would live
within the aesthetics, but because I’ve changed it to black simply
because I think it looks nicer, it stays outside.alpha = 0.7
changes the transparancy of the points. An
alpha value of 1 is a fully opaque point.Now, go away and have fun playing with making your plot look pretty! I’ve had a quick go, and am challenging you to make yours look as nice as this one… 😀😀
All the stuff you need to be able to this is available here: http://www.sthda.com/english/wiki/ggplot2-essentials