Using the Multivariate Normal Distribution to Simulate Correlated Random Effects in a Poisson GLMM

Author

Trevor Caughlin

1 Overview

This tutorial shows one of the most useful applications of the multivariate normal distribution in ecology: estimating correlated random intercepts and random slopes in a generalized linear mixed model (GLMM).

We will simulate trout abundance in lakes as a function of lake temperature, where:

  • trout abundance follows a Poisson distribution
  • warmer lakes tend to have fewer trout
  • each lake has its own baseline value of trout abundance, modeled by a random intercept
  • each lake also has its own effect of temperature on trout abundance (for example, lakes that have more water movement may have different temperature effects), represented by a random slope for temperature
  • the random intercepts and slopes are correlated

We will simulate the lake-level random effects from a bivariate normal distribution using MASS::mvrnorm(), then fit the model in brms and check whether the fitted model recovers the true correlation.

2 Why the multivariate normal matters here

In mixed models, a random-intercept/random-slope structure usually assumes that the group-level effects are jointly distributed as multivariate normal. For lake \(j\), we can write:

\[ \begin{bmatrix} b_{0j} \\ b_{1j} \end{bmatrix} \sim \mathcal{N} \left( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} \sigma_{0}^{2} & \sigma_{01} \\ \sigma_{01} & \sigma_{1}^{2} \end{bmatrix} \right) \]

where:

  • \(b_{0j}\) is the random intercept for lake \(j\)
  • \(b_{1j}\) is the random slope for temperature in lake \(j\)
  • \(\sigma_0^2\) is the variance among lake intercepts
  • \(\sigma_1^2\) is the variance among lake slopes
  • \(\sigma_{01}\) is the covariance between them

The correlation between random intercepts and slopes is:

\[ \rho = \frac{\sigma_{01}}{\sigma_0 \sigma_1} \]

This is exactly the kind of structure the multivariate normal distribution is designed to represent.

3 Model used for simulation

We simulate counts from a Poisson GLMM with a log link:

\[ y_{ij} \sim \text{Poisson}(\lambda_{ij}) \]

\[ \log(\lambda_{ij}) = \beta_0 + b_{0j} + (\beta_1 + b_{1j}) \cdot \text{temp}_{ij} \]

where:

  • \(y_{ij}\) is trout abundance in observation \(i\) from lake \(j\)
  • \(\beta_0\) is the population-level intercept
  • \(\beta_1\) is the population-level temperature effect
  • \(b_{0j}\) and \(b_{1j}\) are the lake-specific deviations

We set \(\beta_1 < 0\) so that higher temperature leads to lower trout abundance.

4 Packages

library(MASS)
library(dplyr)
library(ggplot2)
library(brms)
library(posterior)

5 Choose true parameter values

We begin by specifying the true fixed effects and the true variance-covariance matrix for the lake-level random effects.

set.seed(123)

n_lakes <- 30
obs_per_lake <- 25

beta_0 <- 2.6
beta_1 <- -0.12

sd_intercept <- 0.7
sd_slope <- 0.08
rho <- -0.5

Sigma_b <- matrix(
  c(sd_intercept^2,
    rho * sd_intercept * sd_slope,
    rho * sd_intercept * sd_slope,
    sd_slope^2),
  nrow = 2,
  byrow = TRUE
)

Sigma_b
       [,1]    [,2]
[1,]  0.490 -0.0280
[2,] -0.028  0.0064

6 Simulate correlated random intercepts and slopes

Now we simulate one pair of random effects for each lake:

lake_effects <- MASS::mvrnorm(
  n = n_lakes,
  mu = c(0, 0),
  Sigma = Sigma_b
)

lake_effects <- as.data.frame(lake_effects)
names(lake_effects) <- c("b0_lake", "b1_lake")
lake_effects$lake <- factor(seq_len(n_lakes))

head(lake_effects)
      b0_lake      b1_lake lake
1  0.39062722 -0.052088432    1
2  0.16229743  0.011077901    2
3 -1.09464504  0.001150769    3
4 -0.05285426 -0.057789835    4
5 -0.09377379 -0.051510392    5
6 -1.20327014  0.021725080    6

These are draws from a bivariate normal distribution. Each row is one lake, and the two columns are the lake-specific intercept and slope deviations.

6.1 Check the simulated random-effect relationship

ggplot(lake_effects, aes(x = b0_lake, y = b1_lake)) +
  geom_point() +
  theme_minimal() +
  labs(
    x = "Random intercept",
    y = "Random slope for temperature",
    title = "Simulated lake-level random effects"
  )

cor(lake_effects$b0_lake, lake_effects$b1_lake)
[1] -0.6463289

The sample correlation in the simulated random effects will not equal the true value exactly, but it should be in the neighborhood of the true correlation.

7 Build the lake-level data set

We next create repeated observations within each lake and assign temperatures.

sim_dat <- expand.grid(
  lake = factor(seq_len(n_lakes)),
  obs = seq_len(obs_per_lake)
) |>
  left_join(lake_effects, by = "lake") |>
  mutate(
    temp = rnorm(n(), mean = 16, sd = 3)
  )

head(sim_dat)
  lake obs     b0_lake      b1_lake     temp
1    1   1  0.39062722 -0.052088432 17.13892
2    2   1  0.16229743  0.011077901 14.49303
3    3   1 -1.09464504  0.001150769 15.00038
4    4   1 -0.05285426 -0.057789835 12.94427
5    5   1 -0.09377379 -0.051510392 12.78463
6    6   1 -1.20327014  0.021725080 16.91059

8 Simulate Poisson trout abundance

We now construct the linear predictor and simulate counts.

sim_dat <- sim_dat |>
  mutate(
    eta = beta_0 + b0_lake + (beta_1 + b1_lake) * temp,
    lambda = exp(eta),
    trout_abundance = rpois(n(), lambda = lambda)
  )

head(sim_dat)
  lake obs     b0_lake      b1_lake     temp         eta    lambda
1    1   1  0.39062722 -0.052088432 17.13892  0.04121761 1.0420788
2    2   1  0.16229743  0.011077901 14.49303  1.18368623 3.2663927
3    3   1 -1.09464504  0.001150769 15.00038 -0.27742841 0.7577298
4    4   1 -0.05285426 -0.057789835 12.94427  0.24578542 1.2786252
5    5   1 -0.09377379 -0.051510392 12.78463  0.31352994 1.3682464
6    6   1 -1.20327014  0.021725080 16.91059 -0.26515661 0.7670858
  trout_abundance
1               0
2               0
3               0
4               2
5               1
6               2

8.1 Quick look at the simulated data

ggplot(sim_dat, aes(x = temp, y = trout_abundance)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = FALSE) +
  theme_minimal() +
  labs(
    x = "Lake temperature",
    y = "Trout abundance",
    title = "Simulated trout abundance declines with temperature"
  )

8.2 Plot by lake

ggplot(sim_dat, aes(x = temp, y = trout_abundance, color = lake)) +
  geom_point(alpha = 0.5, show.legend = FALSE) +
  theme_minimal() +
  labs(
    x = "Lake temperature",
    y = "Trout abundance",
    title = "Different lakes have different intercepts and slopes"
  )

9 Fit the model in brms

We fit the same random-intercept/random-slope Poisson GLMM using brms:

fit <- brm(
  formula = trout_abundance ~ temp + (1 + temp | lake),
  data = sim_dat,
  family = poisson(),
  chains = 4,
  cores = 4,
  iter = 2000,
  seed = 123
)

The term (1 + temp | lake) tells brms to estimate:

  • a random intercept for each lake
  • a random slope for temperature for each lake
  • the correlation between them

10 Summarize the fitted model

summary(fit)
 Family: poisson 
  Links: mu = log 
Formula: trout_abundance ~ temp + (1 + temp | lake) 
   Data: sim_dat (Number of observations: 750) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Multilevel Hyperparameters:
~lake (Number of levels: 30) 
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)           0.62      0.21     0.23     1.05 1.00     1088     1214
sd(temp)                0.08      0.01     0.05     0.11 1.00      495     1007
cor(Intercept,temp)    -0.55      0.23    -0.88     0.01 1.01      338      614

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     2.53      0.20     2.12     2.90 1.00     1325     2489
temp         -0.13      0.02    -0.16    -0.09 1.00     1100     1790

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Look in the group-level effects section for:

  • sd(Intercept)
  • sd(temp)
  • cor(Intercept,temp)

These are the posterior estimates of the random-effect standard deviations and their correlation.

11 Compare estimated values to the truth

Here are the true values used in the simulation:

true_values <- data.frame(
  parameter = c("sd_intercept", "sd_slope", "cor_intercept_slope"),
  true_value = c(sd_intercept, sd_slope, rho)
)

true_values
            parameter true_value
1        sd_intercept       0.70
2            sd_slope       0.08
3 cor_intercept_slope      -0.50

Extract posterior summaries from brms:

posterior_summary_df <- as.data.frame(posterior_summary(fit))
posterior_summary_df$parameter <- rownames(posterior_summary_df)

posterior_summary_df |>
  filter(parameter %in% c("sd_lake__Intercept", "sd_lake__temp", "cor_lake__Intercept__temp"))
                            Estimate  Est.Error        Q2.5     Q97.5
sd_lake__Intercept         0.6180205 0.20534587  0.22615401 1.0490536
sd_lake__temp              0.0762465 0.01428161  0.05254139 0.1071981
cor_lake__Intercept__temp -0.5517503 0.23089471 -0.87580547 0.0110716
                                          parameter
sd_lake__Intercept               sd_lake__Intercept
sd_lake__temp                         sd_lake__temp
cor_lake__Intercept__temp cor_lake__Intercept__temp

11.1 Posterior draws for the correlation

cor_draws <- as_draws_df(fit) |>
  select(starts_with("cor_lake__Intercept__temp"))

head(cor_draws)
# A tibble: 6 × 1
  cor_lake__Intercept__temp
                      <dbl>
1                    -0.605
2                    -0.609
3                    -0.761
4                    -0.730
5                    -0.747
6                    -0.787
ggplot(cor_draws, aes(x = cor_lake__Intercept__temp)) +
  geom_histogram(bins = 30) +
  geom_vline(xintercept = rho, linetype = 2) +
  theme_minimal() +
  labs(
    x = "Posterior draws of random-effect correlation",
    y = "Count",
    title = "Posterior for the intercept-slope correlation"
  )

The dashed vertical line is the true correlation used in the simulation.

12 What does “recovering the correlation” mean?

In a simulation exercise like this, we know the true parameter values because we chose them ourselves. After fitting the model, we ask whether the posterior distribution is centered near the true value and whether the credible interval contains it.

Because this is a finite data set:

  • the estimated correlation will not exactly equal the true correlation
  • recovery improves with more lakes and more observations per lake
  • weak random-slope variation can make correlation recovery harder

13 Why the multivariate normal is so useful

This example shows why the multivariate normal distribution is so important in hierarchical modeling.

It lets us represent several related random quantities jointly. Here, the lake-level intercept and slope are not simulated independently; instead, they are generated together from a covariance matrix:

\[ \boldsymbol{\Sigma}_b = \begin{bmatrix} \sigma_{0}^{2} & \sigma_{01} \\ \sigma_{01} & \sigma_{1}^{2} \end{bmatrix} \]

That matrix determines:

  • how much intercepts vary among lakes
  • how much slopes vary among lakes
  • whether lakes with higher baseline trout abundance tend to have stronger or weaker temperature effects

14 Exercises

  1. Change the true correlation rho from -0.5 to 0.5. What changes in the simulated data and fitted model?
  2. Increase the number of lakes. Does recovery improve?
  3. Reduce the random-slope standard deviation. What happens to the estimated correlation?
  4. Add an observation-level predictor besides temperature.
  5. Fit the same model with fewer observations per lake. How does that affect uncertainty?

15 Key takeaways

  • Correlated random intercepts and slopes are naturally represented with a multivariate normal distribution.
  • The covariance matrix controls both the amount of variation and the correlation among group-level effects.
  • A Poisson GLMM can be simulated by combining a log link, random effects, and Poisson sampling.
  • brms can estimate the correlation among random effects through formulas such as (1 + temp | lake).