This material corresponds to Week 11 of the labs and demos for Bio 4259F (Undergraduate) and Bio 9915A (Graduate) at Western University.
It documents Linear Mixed Models with Random Effects using R. The code loads necessary libraries, simulates data from simulation, fits loggistic distributions to simulated and real data, and adds random effects lmer().
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
library(MuMIn)
library(arm)
## Loading required package: MASS
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
##
##
## arm (Version 1.14-4, built: 2024-4-1)
##
## Working directory is D:/1_MisDocumentos/Documents/R_Class/4th_Class/4class
##
##
## Attaching package: 'arm'
##
## The following object is masked from 'package:MuMIn':
##
## coefplot
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:arm':
##
## logit
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
You may ask when to use mixed models (lmer() function), whenever the data structure violates independence because observations are:
| Random effect type | When used | Model syntax | |
|---|---|---|---|
| Random intercepts | Subjects/groups differ in baseline level of the response. | (1 | group) |
|
| Random slopes | Subjects/groups differ in the effect of a predictor. | (predictor | group) |
|
| Random intercepts + slopes | Both intercepts and slopes vary across groups; most flexible. | (1 + predictor | group) |
We will take the Corn yield example, which as you notice is equivalent to a paired t-test but using a mixed model.
reg <- c(1903,1935,1910,2496,2108,1961,2060,1444,1612,1316,1511)
kiln <- c(2009,1915,2011,2463,2180,1925,2122,1482,1542,1443,1535)
t.test(reg, kiln, paired = TRUE)
##
## Paired t-test
##
## data: reg and kiln
## t = -1.6905, df = 10, p-value = 0.1218
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -78.18164 10.72710
## sample estimates:
## mean difference
## -33.72727
test1 <- data.frame(
trt = factor(rep(c("reg","kiln"), each=11)),
subject = rep(1:11, 2),
yield = c(reg, kiln)
)
Then fit the model…
mod1 <- lmer(yield ~ trt + (1|subject), data=test1)
And not forgettttttt about summary and anova functions please, never and ever.
summary(mod1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: yield ~ trt + (1 | subject)
## Data: test1
##
## REML criterion at convergence: 261.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.16666 -0.57844 -0.08182 0.70200 1.05021
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 111940 334.57
## Residual 2189 46.79
## Number of obs: 22, groups: subject, 11
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 1875.18 101.86 18.41
## trtreg -33.73 19.95 -1.69
##
## Correlation of Fixed Effects:
## (Intr)
## trtreg -0.098
anova(mod1)
## Analysis of Variance Table
## npar Sum Sq Mean Sq F value
## trt 1 6256.4 6256.4 2.8577
Ask your self, why we are not using vif from car in this case?
Interpretation, (1|subject) absorbs between-subject differences. Equivalent inference to a paired t-test but extendable to more complex designs.
Random intercepts vs random slopes. It is a classic multi-site biodiversity experiment conducted across eight European grasslands, designed to test a fundamental ecological question, does plant species richness increase primary productivity (yield)?
The dataset contains 480 plots across 8 Sites across Europe (Germany, Greece, Ireland, Portugal, Sheffield, Silwood, Sweden, Switzerland). Also it has two blocks per site (15 blocks in total; Portugal has a special structure), 200 different species mixtures, Diversity manipulated experimentally (1, 2, 4, 8, 16, 32 species), and Replicated compositions (each mixture appears once per block).
This structure creates multiple sources of non-independence, as you notice now it is like the normal work for us in science, no dependency…
Take into account, 1. Plots within the same site share climate & soil. 2. Plots within a block share local microsite conditions. 3. Plots with the same species mixture share compositional identity. 4. Diversity effects may differ across sites (interaction).
These sources generate hierarchical, crossed, and partially crossed random effects, which makes standard lm() invalid.Then Mixed models are the only appropriate tool.
Load and Prepare Data:
biodepth <- read.csv("Biodepth_data.csv", header=TRUE)
biodepth <- biodepth %>%
mutate(
Mix = factor(Mix),
Diversity2 = log(Diversity, 2), # log2: each unit = doubling of species
Yield = Shoot2,
Site = factor(Site),
Block = factor(Block)
)
Before any model is fitted, you should visualize the structure.Plot Yield vs Diversity separated by Site:
ggplot(biodepth, aes(Diversity, Yield)) +
geom_point(alpha=.5) +
geom_smooth(method="lm") +
scale_x_continuous(trans="log2") +
facet_wrap(~Site, scales="free")
There is a trend but also what this figure shows,
Intercept differences Some sites (e.g., Portugal) have low productivity; others (Ireland, Silwood) have much higher productivity.Thennnn -> indicates need for random intercepts.
Slope differences In some sites, diversity strongly predicts yield (Germany); in others, effect is weak (Greece).Thennnn -> indicates need for random slopes.
Heteroscedasticity across sites Some sites show larger residual variance. Thennnn -> random structure must absorb this.
And not talk about differences in outliers…
Please ask yourself and answer, Why is a single regression line inappropriate for all sites? Which visual patterns imply that random intercepts are required? How does ignoring site structure change scientific conclusions?
Each site has its own intercept and slope, estimated independently.
lmList(Yield ~ Diversity2 | Site, data = biodepth)
## Call: lmList(formula = Yield ~ Diversity2 | Site, data = biodepth)
## Coefficients:
## (Intercept) Diversity2
## Germany 492.7615 129.077441
## Greece 216.9712 9.028654
## Ireland 496.2981 100.384693
## Portugal 123.9790 70.571644
## Sheffield 378.4164 126.242377
## Silwood 469.6008 62.021579
## Sweden 180.1298 64.108158
## Switzerland 347.7757 98.794057
##
## Degrees of freedom: 464 total; 448 residual
## Residual standard error: 224.8364
Characteristics:
Zero pooling No information sharing High variance, large uncertainty Strong slopes or flat slopes depending on site Not valid for population-level inference
Key question for you at this part
What problems arise from estimating 8 separate regressions?
(Hint: no shared information -> inflated variance, overfitting, cannot estimate overall relationship)
Model 1 (we can called m_int). Random intercepts only. It assumes slopes identical across sites and allows different baseline productivity.
m_int <- lmer(Yield ~ Diversity2 + (1|Site), data = biodepth)
Remember our first graph… if you dont do so, please go back an take a look, does the plot support equal slopes?
Model 2 (we can called m_slp). In this case it allows sites to have different intercepts and slopes, and the most biologically realistic model.
m_slp <- lmer(Yield ~ Diversity2 + (1 + Diversity2 | Site), data = biodepth)
anova(m_int, m_slp)
## refitting model(s) with ML (instead of REML)
## Data: biodepth
## Models:
## m_int: Yield ~ Diversity2 + (1 | Site)
## m_slp: Yield ~ Diversity2 + (1 + Diversity2 | Site)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m_int 4 6392.6 6409.2 -3192.3 6384.6
## m_slp 6 6383.8 6408.7 -3185.9 6371.8 12.796 2 0.001665 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What to look for:
Large χ²
p < 0.05 (0.0017) -> evidence that random slopes improve model fit.
Please look for or google the next questions and answer then, What does the likelihood ratio test compare? Why must random effects be tested via LRT and not F-tests?
VarCorr(m_slp)
## Groups Name Std.Dev. Corr
## Site (Intercept) 143.215
## Diversity2 34.012 0.858
## Residual 224.729
Which variance component is largest, and what does that tell you about ecological variation?
And now R² for mixed models,
r.squaredGLMM(m_slp)
## R2m R2c
## [1,] 0.1163115 0.4854576
Interpretation, marginal R² = fixed effects only, and conditional R² = fixed + random effects.
coef(m_slp) # group-specific intercepts and slopes
## $Site
## (Intercept) Diversity2
## Germany 500.3653 121.45845
## Greece 173.1250 33.43000
## Ireland 478.0543 109.73929
## Portugal 157.1408 50.91277
## Sheffield 405.8088 105.75970
## Silwood 421.9654 87.16501
## Sweden 197.5279 55.20702
## Switzerland 360.8827 91.30291
##
## attr(,"class")
## [1] "coef.mer"
fixef(m_slp) # population-level effects
## (Intercept) Diversity2
## 336.85877 81.87189
ranef(m_slp) # deviations from population mean
## $Site
## (Intercept) Diversity2
## Germany 163.50654 39.586554
## Greece -163.73377 -48.441892
## Ireland 141.19553 27.867397
## Portugal -179.71799 -30.959126
## Sheffield 68.95005 23.887806
## Silwood 85.10663 5.293116
## Sweden -139.33087 -26.664870
## Switzerland 24.02388 9.431016
##
## with conditional variances for "Site"
Shrinkage, BLUPs “shrink” estimates toward the global mean:
Ask yourself, why is shrinkage a good thing in ecological multi-site studies?
Population-level predictions are so important for publications. Normally what you see at a paper are not the real regressions. They are the predictions from the fit final model as follows,
newdat <- expand.grid(
Diversity2 = seq(min(biodepth$Diversity2), max(biodepth$Diversity2), length=100),
Site = levels(biodepth$Site)
)
p_fix <- predict(m_slp, newdata=newdat, re.form=NA, se.fit=TRUE)
newdat$fit <- p_fix$fit
newdat$se <- p_fix$se.fit
newdat$upper <- newdat$fit + 1.96 * newdat$se
newdat$lower <- newdat$fit - 1.96 * newdat$se
Plotting the general prediction ignoring all random effects,
ggplot(newdat, aes(Diversity2, fit)) +
geom_line(size=1.2) +
geom_ribbon(aes(ymin=lower, ymax=upper), alpha=0.2) +
labs(y = "Predicted Yield") +
theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Y^=β0+β1*Diversity2
p_site <- predict(m_slp, newdata=newdat, re.form=NULL)
newdat$fit_site <- p_site
ggplot(newdat, aes(Diversity2, fit_site, colour=Site)) +
geom_line(size=1) +
theme_bw()
Please to go a bit further answer,
How does a mixed model serve as a compromise between complete pooling and no pooling? What ecological processes might cause slope heterogeneity among sites? If slope variance were ~0, what would that mean about biodiversity effects?
Why would removing blocks or mixtures bias the Diversity effect?