library(MASS)
library(dplyr)
library(ggplot2)
library(brms)
library(posterior)Using the Multivariate Normal Distribution to Simulate Correlated Random Effects in a Poisson GLMM
1 Overview
This tutorial shows one of the most useful applications of the multivariate normal distribution in ecology: estimating correlated random intercepts and random slopes in a generalized linear mixed model (GLMM).
We will simulate trout abundance in lakes as a function of lake temperature, where:
- trout abundance follows a Poisson distribution
- warmer lakes tend to have fewer trout
- each lake has its own baseline value of trout abundance, modeled by a random intercept
- each lake also has its own effect of temperature on trout abundance (for example, lakes that have more water movement may have different temperature effects), represented by a random slope for temperature
- the random intercepts and slopes are correlated
We will simulate the lake-level random effects from a bivariate normal distribution using MASS::mvrnorm(), then fit the model in brms and check whether the fitted model recovers the true correlation.
2 Why the multivariate normal matters here
In mixed models, a random-intercept/random-slope structure usually assumes that the group-level effects are jointly distributed as multivariate normal. For lake \(j\), we can write:
\[ \begin{bmatrix} b_{0j} \\ b_{1j} \end{bmatrix} \sim \mathcal{N} \left( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} \sigma_{0}^{2} & \sigma_{01} \\ \sigma_{01} & \sigma_{1}^{2} \end{bmatrix} \right) \]
where:
- \(b_{0j}\) is the random intercept for lake \(j\)
- \(b_{1j}\) is the random slope for temperature in lake \(j\)
- \(\sigma_0^2\) is the variance among lake intercepts
- \(\sigma_1^2\) is the variance among lake slopes
- \(\sigma_{01}\) is the covariance between them
The correlation between random intercepts and slopes is:
\[ \rho = \frac{\sigma_{01}}{\sigma_0 \sigma_1} \]
This is exactly the kind of structure the multivariate normal distribution is designed to represent.
3 Model used for simulation
We simulate counts from a Poisson GLMM with a log link:
\[ y_{ij} \sim \text{Poisson}(\lambda_{ij}) \]
\[ \log(\lambda_{ij}) = \beta_0 + b_{0j} + (\beta_1 + b_{1j}) \cdot \text{temp}_{ij} \]
where:
- \(y_{ij}\) is trout abundance in observation \(i\) from lake \(j\)
- \(\beta_0\) is the population-level intercept
- \(\beta_1\) is the population-level temperature effect
- \(b_{0j}\) and \(b_{1j}\) are the lake-specific deviations
We set \(\beta_1 < 0\) so that higher temperature leads to lower trout abundance.
4 Packages
5 Choose true parameter values
We begin by specifying the true fixed effects and the true variance-covariance matrix for the lake-level random effects.
set.seed(123)
n_lakes <- 30
obs_per_lake <- 25
beta_0 <- 2.6
beta_1 <- -0.12
sd_intercept <- 0.7
sd_slope <- 0.08
rho <- -0.5
Sigma_b <- matrix(
c(sd_intercept^2,
rho * sd_intercept * sd_slope,
rho * sd_intercept * sd_slope,
sd_slope^2),
nrow = 2,
byrow = TRUE
)
Sigma_b [,1] [,2]
[1,] 0.490 -0.0280
[2,] -0.028 0.0064
7 Build the lake-level data set
We next create repeated observations within each lake and assign temperatures.
sim_dat <- expand.grid(
lake = factor(seq_len(n_lakes)),
obs = seq_len(obs_per_lake)
) |>
left_join(lake_effects, by = "lake") |>
mutate(
temp = rnorm(n(), mean = 16, sd = 3)
)
head(sim_dat) lake obs b0_lake b1_lake temp
1 1 1 0.39062722 -0.052088432 17.13892
2 2 1 0.16229743 0.011077901 14.49303
3 3 1 -1.09464504 0.001150769 15.00038
4 4 1 -0.05285426 -0.057789835 12.94427
5 5 1 -0.09377379 -0.051510392 12.78463
6 6 1 -1.20327014 0.021725080 16.91059
8 Simulate Poisson trout abundance
We now construct the linear predictor and simulate counts.
sim_dat <- sim_dat |>
mutate(
eta = beta_0 + b0_lake + (beta_1 + b1_lake) * temp,
lambda = exp(eta),
trout_abundance = rpois(n(), lambda = lambda)
)
head(sim_dat) lake obs b0_lake b1_lake temp eta lambda
1 1 1 0.39062722 -0.052088432 17.13892 0.04121761 1.0420788
2 2 1 0.16229743 0.011077901 14.49303 1.18368623 3.2663927
3 3 1 -1.09464504 0.001150769 15.00038 -0.27742841 0.7577298
4 4 1 -0.05285426 -0.057789835 12.94427 0.24578542 1.2786252
5 5 1 -0.09377379 -0.051510392 12.78463 0.31352994 1.3682464
6 6 1 -1.20327014 0.021725080 16.91059 -0.26515661 0.7670858
trout_abundance
1 0
2 0
3 0
4 2
5 1
6 2
8.1 Quick look at the simulated data
ggplot(sim_dat, aes(x = temp, y = trout_abundance)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = FALSE) +
theme_minimal() +
labs(
x = "Lake temperature",
y = "Trout abundance",
title = "Simulated trout abundance declines with temperature"
)8.2 Plot by lake
ggplot(sim_dat, aes(x = temp, y = trout_abundance, color = lake)) +
geom_point(alpha = 0.5, show.legend = FALSE) +
theme_minimal() +
labs(
x = "Lake temperature",
y = "Trout abundance",
title = "Different lakes have different intercepts and slopes"
)9 Fit the model in brms
We fit the same random-intercept/random-slope Poisson GLMM using brms:
fit <- brm(
formula = trout_abundance ~ temp + (1 + temp | lake),
data = sim_dat,
family = poisson(),
chains = 4,
cores = 4,
iter = 2000,
seed = 123
)The term (1 + temp | lake) tells brms to estimate:
- a random intercept for each lake
- a random slope for temperature for each lake
- the correlation between them
10 Summarize the fitted model
summary(fit) Family: poisson
Links: mu = log
Formula: trout_abundance ~ temp + (1 + temp | lake)
Data: sim_dat (Number of observations: 750)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Multilevel Hyperparameters:
~lake (Number of levels: 30)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.62 0.21 0.23 1.05 1.00 1088 1214
sd(temp) 0.08 0.01 0.05 0.11 1.00 495 1007
cor(Intercept,temp) -0.55 0.23 -0.88 0.01 1.01 338 614
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 2.53 0.20 2.12 2.90 1.00 1325 2489
temp -0.13 0.02 -0.16 -0.09 1.00 1100 1790
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Look in the group-level effects section for:
sd(Intercept)sd(temp)cor(Intercept,temp)
These are the posterior estimates of the random-effect standard deviations and their correlation.
11 Compare estimated values to the truth
Here are the true values used in the simulation:
true_values <- data.frame(
parameter = c("sd_intercept", "sd_slope", "cor_intercept_slope"),
true_value = c(sd_intercept, sd_slope, rho)
)
true_values parameter true_value
1 sd_intercept 0.70
2 sd_slope 0.08
3 cor_intercept_slope -0.50
Extract posterior summaries from brms:
posterior_summary_df <- as.data.frame(posterior_summary(fit))
posterior_summary_df$parameter <- rownames(posterior_summary_df)
posterior_summary_df |>
filter(parameter %in% c("sd_lake__Intercept", "sd_lake__temp", "cor_lake__Intercept__temp")) Estimate Est.Error Q2.5 Q97.5
sd_lake__Intercept 0.6180205 0.20534587 0.22615401 1.0490536
sd_lake__temp 0.0762465 0.01428161 0.05254139 0.1071981
cor_lake__Intercept__temp -0.5517503 0.23089471 -0.87580547 0.0110716
parameter
sd_lake__Intercept sd_lake__Intercept
sd_lake__temp sd_lake__temp
cor_lake__Intercept__temp cor_lake__Intercept__temp
11.1 Posterior draws for the correlation
cor_draws <- as_draws_df(fit) |>
select(starts_with("cor_lake__Intercept__temp"))
head(cor_draws)# A tibble: 6 × 1
cor_lake__Intercept__temp
<dbl>
1 -0.605
2 -0.609
3 -0.761
4 -0.730
5 -0.747
6 -0.787
ggplot(cor_draws, aes(x = cor_lake__Intercept__temp)) +
geom_histogram(bins = 30) +
geom_vline(xintercept = rho, linetype = 2) +
theme_minimal() +
labs(
x = "Posterior draws of random-effect correlation",
y = "Count",
title = "Posterior for the intercept-slope correlation"
)The dashed vertical line is the true correlation used in the simulation.
12 What does “recovering the correlation” mean?
In a simulation exercise like this, we know the true parameter values because we chose them ourselves. After fitting the model, we ask whether the posterior distribution is centered near the true value and whether the credible interval contains it.
Because this is a finite data set:
- the estimated correlation will not exactly equal the true correlation
- recovery improves with more lakes and more observations per lake
- weak random-slope variation can make correlation recovery harder
13 Why the multivariate normal is so useful
This example shows why the multivariate normal distribution is so important in hierarchical modeling.
It lets us represent several related random quantities jointly. Here, the lake-level intercept and slope are not simulated independently; instead, they are generated together from a covariance matrix:
\[ \boldsymbol{\Sigma}_b = \begin{bmatrix} \sigma_{0}^{2} & \sigma_{01} \\ \sigma_{01} & \sigma_{1}^{2} \end{bmatrix} \]
That matrix determines:
- how much intercepts vary among lakes
- how much slopes vary among lakes
- whether lakes with higher baseline trout abundance tend to have stronger or weaker temperature effects
14 Exercises
- Change the true correlation
rhofrom-0.5to0.5. What changes in the simulated data and fitted model? - Increase the number of lakes. Does recovery improve?
- Reduce the random-slope standard deviation. What happens to the estimated correlation?
- Add an observation-level predictor besides temperature.
- Fit the same model with fewer observations per lake. How does that affect uncertainty?
15 Key takeaways
- Correlated random intercepts and slopes are naturally represented with a multivariate normal distribution.
- The covariance matrix controls both the amount of variation and the correlation among group-level effects.
- A Poisson GLMM can be simulated by combining a log link, random effects, and Poisson sampling.
brmscan estimate the correlation among random effects through formulas such as(1 + temp | lake).