This material corresponds to Week 11 of the labs and demos for Bio 4259F (Undergraduate) and Bio 9915A (Graduate) at Western University.

It documents Linear Mixed Models with Random Effects using R. The code loads necessary libraries, simulates data from simulation, fits loggistic distributions to simulated and real data, and adds random effects lmer().

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
library(MuMIn)
library(arm)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## 
## arm (Version 1.14-4, built: 2024-4-1)
## 
## Working directory is D:/1_MisDocumentos/Documents/R_Class/4th_Class/4class
## 
## 
## Attaching package: 'arm'
## 
## The following object is masked from 'package:MuMIn':
## 
##     coefplot
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:arm':
## 
##     logit
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

Conceptual Foundations (Essential)

You may ask when to use mixed models (lmer() function), whenever the data structure violates independence because observations are:

  • Clustered (subjects, sites, blocks, plots)
  • Repeated (paired data, temporal data)
  • Hierarchical (subjects within blocks, plots within sites)
  • Crossed (Site × Species Mix in BIODEPTH)

Random Effects: when each applies

Random effect type When used Model syntax
Random intercepts Subjects/groups differ in baseline level of the response. (1 | group)
Random slopes Subjects/groups differ in the effect of a predictor. (predictor | group)
Random intercepts + slopes Both intercepts and slopes vary across groups; most flexible. (1 + predictor | group)

Paired Data → Random Intercepts

We will take the Corn yield example, which as you notice is equivalent to a paired t-test but using a mixed model.

reg <- c(1903,1935,1910,2496,2108,1961,2060,1444,1612,1316,1511)
kiln <- c(2009,1915,2011,2463,2180,1925,2122,1482,1542,1443,1535)


t.test(reg, kiln, paired = TRUE)
## 
##  Paired t-test
## 
## data:  reg and kiln
## t = -1.6905, df = 10, p-value = 0.1218
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -78.18164  10.72710
## sample estimates:
## mean difference 
##       -33.72727
test1 <- data.frame(
  trt = factor(rep(c("reg","kiln"), each=11)),
  subject = rep(1:11, 2),
  yield = c(reg, kiln)
)

Then fit the model…

mod1 <- lmer(yield ~ trt + (1|subject), data=test1)

And not forgettttttt about summary and anova functions please, never and ever.

summary(mod1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: yield ~ trt + (1 | subject)
##    Data: test1
## 
## REML criterion at convergence: 261.8
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.16666 -0.57844 -0.08182  0.70200  1.05021 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept) 111940   334.57  
##  Residual               2189    46.79  
## Number of obs: 22, groups:  subject, 11
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  1875.18     101.86   18.41
## trtreg        -33.73      19.95   -1.69
## 
## Correlation of Fixed Effects:
##        (Intr)
## trtreg -0.098
anova(mod1)
## Analysis of Variance Table
##     npar Sum Sq Mean Sq F value
## trt    1 6256.4  6256.4  2.8577

Ask your self, why we are not using vif from car in this case?

Interpretation, (1|subject) absorbs between-subject differences. Equivalent inference to a paired t-test but extendable to more complex designs.

BIODEPTH Example

Random intercepts vs random slopes. It is a classic multi-site biodiversity experiment conducted across eight European grasslands, designed to test a fundamental ecological question, does plant species richness increase primary productivity (yield)?

The dataset contains 480 plots across 8 Sites across Europe (Germany, Greece, Ireland, Portugal, Sheffield, Silwood, Sweden, Switzerland). Also it has two blocks per site (15 blocks in total; Portugal has a special structure), 200 different species mixtures, Diversity manipulated experimentally (1, 2, 4, 8, 16, 32 species), and Replicated compositions (each mixture appears once per block).

This structure creates multiple sources of non-independence, as you notice now it is like the normal work for us in science, no dependency…

Take into account, 1. Plots within the same site share climate & soil. 2. Plots within a block share local microsite conditions. 3. Plots with the same species mixture share compositional identity. 4. Diversity effects may differ across sites (interaction).

These sources generate hierarchical, crossed, and partially crossed random effects, which makes standard lm() invalid.Then Mixed models are the only appropriate tool.

Load and Prepare Data:

biodepth <- read.csv("Biodepth_data.csv", header=TRUE)

biodepth <- biodepth %>%
  mutate(
    Mix        = factor(Mix),
    Diversity2 = log(Diversity, 2),   # log2: each unit = doubling of species
    Yield      = Shoot2,
    Site       = factor(Site),
    Block      = factor(Block)
  )

Understanding the Structure Before Modelling

Before any model is fitted, you should visualize the structure.Plot Yield vs Diversity separated by Site:

ggplot(biodepth, aes(Diversity, Yield)) +
  geom_point(alpha=.5) +
  geom_smooth(method="lm") +
  scale_x_continuous(trans="log2") +
  facet_wrap(~Site, scales="free")

There is a trend but also what this figure shows,

Intercept differences Some sites (e.g., Portugal) have low productivity; others (Ireland, Silwood) have much higher productivity.Thennnn -> indicates need for random intercepts.

Slope differences In some sites, diversity strongly predicts yield (Germany); in others, effect is weak (Greece).Thennnn -> indicates need for random slopes.

Heteroscedasticity across sites Some sites show larger residual variance. Thennnn -> random structure must absorb this.

And not talk about differences in outliers…

Please ask yourself and answer, Why is a single regression line inappropriate for all sites? Which visual patterns imply that random intercepts are required? How does ignoring site structure change scientific conclusions?

Separate Regressions (No Pooling)

Each site has its own intercept and slope, estimated independently.

lmList(Yield ~ Diversity2 | Site, data = biodepth)
## Call: lmList(formula = Yield ~ Diversity2 | Site, data = biodepth) 
## Coefficients:
##             (Intercept) Diversity2
## Germany        492.7615 129.077441
## Greece         216.9712   9.028654
## Ireland        496.2981 100.384693
## Portugal       123.9790  70.571644
## Sheffield      378.4164 126.242377
## Silwood        469.6008  62.021579
## Sweden         180.1298  64.108158
## Switzerland    347.7757  98.794057
## 
## Degrees of freedom: 464 total; 448 residual
## Residual standard error: 224.8364

Characteristics:

Zero pooling No information sharing High variance, large uncertainty Strong slopes or flat slopes depending on site Not valid for population-level inference

Key question for you at this part

What problems arise from estimating 8 separate regressions?

(Hint: no shared information -> inflated variance, overfitting, cannot estimate overall relationship)

Mixed Models, stepwise Model Building

Model 1 (we can called m_int). Random intercepts only. It assumes slopes identical across sites and allows different baseline productivity.

m_int <- lmer(Yield ~ Diversity2 + (1|Site), data = biodepth)

Remember our first graph… if you dont do so, please go back an take a look, does the plot support equal slopes?

Random intercepts + random slopes

Model 2 (we can called m_slp). In this case it allows sites to have different intercepts and slopes, and the most biologically realistic model.

m_slp <- lmer(Yield ~ Diversity2 + (1 + Diversity2 | Site), data = biodepth)

Compare Models, LRT (Likelihood Ratio Test)

anova(m_int, m_slp)
## refitting model(s) with ML (instead of REML)
## Data: biodepth
## Models:
## m_int: Yield ~ Diversity2 + (1 | Site)
## m_slp: Yield ~ Diversity2 + (1 + Diversity2 | Site)
##       npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)   
## m_int    4 6392.6 6409.2 -3192.3   6384.6                        
## m_slp    6 6383.8 6408.7 -3185.9   6371.8 12.796  2   0.001665 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What to look for:

Large χ²

p < 0.05 (0.0017) -> evidence that random slopes improve model fit.

Please look for or google the next questions and answer then, What does the likelihood ratio test compare? Why must random effects be tested via LRT and not F-tests?

Inspect Variance Components

VarCorr(m_slp)
##  Groups   Name        Std.Dev. Corr 
##  Site     (Intercept) 143.215       
##           Diversity2   34.012  0.858
##  Residual             224.729

Which variance component is largest, and what does that tell you about ecological variation?

And now R² for mixed models,

r.squaredGLMM(m_slp)
##            R2m       R2c
## [1,] 0.1163115 0.4854576

Interpretation, marginal R² = fixed effects only, and conditional R² = fixed + random effects.

Extracting and Interpreting BLUPs

coef(m_slp)    # group-specific intercepts and slopes
## $Site
##             (Intercept) Diversity2
## Germany        500.3653  121.45845
## Greece         173.1250   33.43000
## Ireland        478.0543  109.73929
## Portugal       157.1408   50.91277
## Sheffield      405.8088  105.75970
## Silwood        421.9654   87.16501
## Sweden         197.5279   55.20702
## Switzerland    360.8827   91.30291
## 
## attr(,"class")
## [1] "coef.mer"
fixef(m_slp)   # population-level effects
## (Intercept)  Diversity2 
##   336.85877    81.87189
ranef(m_slp)   # deviations from population mean
## $Site
##             (Intercept) Diversity2
## Germany       163.50654  39.586554
## Greece       -163.73377 -48.441892
## Ireland       141.19553  27.867397
## Portugal     -179.71799 -30.959126
## Sheffield      68.95005  23.887806
## Silwood        85.10663   5.293116
## Sweden       -139.33087 -26.664870
## Switzerland    24.02388   9.431016
## 
## with conditional variances for "Site"

Shrinkage, BLUPs “shrink” estimates toward the global mean:

  1. Extreme site slopes become more moderate
  2. Poorly-sampled sites borrow strength from well-sampled ones

Ask yourself, why is shrinkage a good thing in ecological multi-site studies?

Prediction, Mean + Uncertainty

Population-level predictions are so important for publications. Normally what you see at a paper are not the real regressions. They are the predictions from the fit final model as follows,

newdat <- expand.grid(
  Diversity2 = seq(min(biodepth$Diversity2), max(biodepth$Diversity2), length=100),
  Site       = levels(biodepth$Site)
)

p_fix <- predict(m_slp, newdata=newdat, re.form=NA, se.fit=TRUE)

newdat$fit   <- p_fix$fit
newdat$se    <- p_fix$se.fit
newdat$upper <- newdat$fit + 1.96 * newdat$se
newdat$lower <- newdat$fit - 1.96 * newdat$se

Plotting the general prediction ignoring all random effects,

ggplot(newdat, aes(Diversity2, fit)) +
  geom_line(size=1.2) +
  geom_ribbon(aes(ymin=lower, ymax=upper), alpha=0.2) +
  labs(y = "Predicted Yield") +
  theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Y^=β0+β1*Diversity2

Site-specific predictions

p_site <- predict(m_slp, newdata=newdat, re.form=NULL)
newdat$fit_site <- p_site

ggplot(newdat, aes(Diversity2, fit_site, colour=Site)) +
  geom_line(size=1) +
  theme_bw()

Final Reflection Questions

Please to go a bit further answer,

How does a mixed model serve as a compromise between complete pooling and no pooling? What ecological processes might cause slope heterogeneity among sites? If slope variance were ~0, what would that mean about biodiversity effects?

Why would removing blocks or mixtures bias the Diversity effect?