Random effects are the most basic hierarchical structures that can be added to models. They allow for dependence between members of groups. Here we use long-term observations from a CO2 fumigation experiment to show how a hierarchical model can change the interpretation of size and treatment effects.
in groups: negotiate exercises from Unit 5, present to class
this vignette
Learning goals:
Reference: Duke Forest FACE
Data: dataTreeFACE.csv file on
Sakai
Software: Get the updated
clarkFunctions2024.R file on Sakai, add libraries:
source('clarkFunctions2024.r')
library(rjags)
How does rising CO2 affect tree growth, when each individual experiences CO2 concentration in its own way? The FACE experimental inventory data provide an opportunity to examine the heterogeneity of responses in the context of individual differences.
To determine how rising CO2 can affect a forested ecosystem, the Free-Air CO2 Enrichment (FACE) experiment subjected trees to elevated CO2 levels in 30-m diameter rings at the Duke Forest. Four plots received elevated atmospheric CO2 concentration, and four plots were instrumented but were treated as ambient controls. The experiment began in June 1994 as a prototype, with the full experiment implemented with additional plots in August, 1996. Plots were partitioned in early 2005 to include CO2 X nitrogen treatment (main effects and factorial). In addition to measurements on all aspects of ecophysiology and atmosphere-biosphere exchange, the plots were sampled as forest inventory plots, including diameter growth. The study was funded by by the Office of Science (BER) of the U.S. Department of Energy.
In this analysis we use tree data from three ambient and three elevated CO2 rings. Rings are 30 m wide and instrumented for CO2 fumigation and monitoring.
Here I use a hierarchical model to examine the effects of tree size (diameter), CO2 fertilization, and their interaction on growth rates. There are large differences between individuals, so I include a random effect on trees. To appreciate the role of heterogeneity, I start first with a single-stage regression that ignores individual differences. Here is a data set:
data <- read.csv('dataTreeFACE.csv',header=T)
data$y <- log( data[,'dnow'] ) # log growth rate
Exploratory data analysis (EDA) is the essential first step in an analysis to understand relationships between variables.
The CO2 treatment in the data set is coded as [0, 1]. To see the balance of observations I make a table:
table(data$trt)
| Var1 | Freq |
|---|---|
| 0 | 3669 |
| 1 | 3145 |
Note that treatment and controls are well balanced.
Here is a plot of growth rates against year:
plot( jitter(data$t), data$dnow, xlab = 'Year', ylab = 'Growth rate (cm)',
cex=.1, log = 'y', bty = 'n')
q <- tapply( data$dnow, data$t, quantile, c(.5, .05, .95), na.rm = T)
q <- matrix( unlist(q), ncol = 3, byrow = T )
lines( q[,1], lwd = 3 )
lines( q[,2], lty = 2 )
lines( q[,3], lty = 2 )
Attrition of slow growers in later years.
Note the use of tapply to evaluate quantiles by year.
What percent of observations lie between the dashed lines?
I find it hard to interpret this plot, which suggests bi-modality, i.e., clusters of low and high growth rates. Frequency plots of growth rates by year confirm this difference:
# histogram by year
lr <- range( data$y, na.rm=T )
bins <- seq( lr[1], lr[2], length=50) # bins for log growth rate
mids <- ( bins[-1] + bins[-50] )/2 # midpoints of bins
ybin <- findInterval( data$y, bins, all.inside = T) # assign growth rates to bins
dByt <- table( data$t, ybin ) # growth by year
xlim <- c(1, 12)
ylim <- range(data$y, na.rm=T)
plot( NA, xlim = xlim, ylim = ylim, cex = .1,
xlab = 'Year', ylab = 'log growth rate (cm)', bty = 'n')
nt <- max(data$t)
x <- dByt[1,]*0
scale <- 1/max(dByt) # horizontal scale fits time increment
for(t in 1:nt){
lines( t + scale*dByt[t,], mids, type = 's', lwd=2)
lines( c(t,t), ylim, lwd=2 )
}
Distributions of individual rates by year.
The distributions of growth rates are certainly not Gaussian, but perhaps they would be approximately Gaussian if I allow for individual differences. As a quick check, here are frequency distributions by individual tree:
par(mfrow=c(2,1), mar=c(4,5,2,4), las = 1)
growthStandard <- (data$y - mean(data$y))/sd(data$y) #center, standardize over all obs
hist( growthStandard, nclass = 100, xlim=c(-3, 3), col = 'grey', border = 'grey',
main = "Centered on population mean", xlab = '') # centered
imu <- tapply( data$y, data$i, mean, na.rm=T) # standardized rates for individual
isd <- tapply( data$y, data$i, sd, na.rm=T)
dev <- (data$y - imu[data$i])/isd[data$i] # mean/sd for each individual
hist( dev, nclass=100, xlim=c(-3, 3), col = 'grey',
border = 'grey', main = "Centered on individual means", xlab = 'Standardized log growth')
Distributions of individual rates.
The upper plot shows the strong bimodality. When taken as departures from their individual means, the distribution is not so bimodal. It is still not Gaussian, but predictors in the model are expected to explain variation in the data. I can include predictors in a regression.
jags
rjags can be used to simulate the posterior distribution in
R.
A Bayesian regression model starts with main effects of diameter
diam and CO2 treatment trt. The graph for this
model shows coefficients \(\boldsymbol{\beta}\) and a residual
variance \(\sigma^2\),
\[y_{ij} = \mathbf{x'}_{it} \boldsymbol{\beta} + \epsilon_{it}\]
A formula in R has the syntax
y ~ predictor1 + predictor1. An interaction has the syntax
predictor1*predictor2. Using this syntax for an
interaction, both main effects will also be included in the model.
Recall that \(\mathbf{x}'_{it}\)
is one row of the full design matrix \(\mathbf{X}\). I want to look at the design
matrix \(\mathbf{X}\), so I generate it
using model.frame and model.matrix:
tmp <- model.frame(y ~ diam*trt, data)
x <- model.matrix(y ~ diam*trt, tmp) # design matrix
y <- tmp$y # response (log growth rate)
n <- length(y)
head(x)
## (Intercept) diam trt diam:trt
## 1 1 19.17660 0 0
## 2 1 19.72210 0 0
## 3 1 20.34015 0 0
## 4 1 20.91494 0 0
## 5 1 21.24999 0 0
## 6 1 21.93743 0 0
Notice how the interactions appear in the design matrix
x:
| (Intercept) | diam | trt | diam:trt |
|---|---|---|---|
| 1 | 19.2 | 0 | 0 |
| 1 | 19.7 | 0 | 0 |
| 1 | 20.3 | 0 | 0 |
| 1 | 20.9 | 0 | 0 |
| 1 | 21.2 | 0 | 0 |
| 1 | 21.9 | 0 | 0 |
There is a product term for the interaction diam:trt.
The syntax diam*trt in the formula means to include the
main effects of both variables and their interaction.
The Bayesian regression will have fitted coefficients and a residual
variance. Here is a model for rjags, which is written to a file
growthFixed.txt:
cat( "model{
for(i in 1:n){
y[i] ~ dnorm(mu[i],tau)
mu[i] <- b1 + b2*x[i,2] + b3*x[i,3] + b4*x[i,4]
}
tau ~ dgamma(0.001,0.001)
sigma <- sqrt(1/tau)
b1 ~ dnorm(0.0,1.0E-06)
b2 ~ dnorm(0.0,1.0E-06)
b3 ~ dnorm(0.0,1.0E-06)
b4 ~ dnorm(0.0,1.0E-06)
}",
fill=T, file="growthFixed.txt"
)
The function cat converts this character string to a
text file. There are three blocks within the model string.
The first block looks like a loop, but it does not execute that way. It
will be interpreted to set up the graphical model for the mean of the
regression mu and for the data y. rjags
differs from everything else in R in that the parameterization for the
variance sigma^2 is the reciprocol of tau,
termed the precision. The precision is given a gamma prior, and
the sigma is defined as its reciprocal. This means that
sigma^2 has an inverse gamma prior. The coefficients have a
non-informative independent prior distribution.
The inverse gamma distribution for \(\sigma^2\) is a gamma distribution of \(\sigma^{-2}\). It is often used as a prior distribution for a variance, because it is conjugate with the normal likelihood, meaning that the posterior distribution is also inverse gamma.
The full model is
\[ \begin{align} [\boldsymbol{\beta}, \sigma^2| \mathbf{y}, \mathbf{X}] &= \prod_{i} \prod_{t}N( y_{it} | \mathbf{x'}_{it} \boldsymbol{\beta}, \sigma^2) \\ &\times \prod_q N( \beta_q | 0, 10^6) \\ &\times IG( \sigma^2 | 0.001, 0.001 ) \end{align} \] The graph for this model has a data stage (fixed \(y_i\)), a random parameter stage (\(\boldsymbol{\beta}, \sigma^2\)), and a fixed prior distribution.
Graph of the regression model showing the prior distribution for coefficients. Note that parameters have distributions, but that randomness is viewed as “uncertainty”, not as true variation. Uncertainty comes from the fact that parameters are estimated.
Here I generate the model graph:
treeData <- list(y = y, x = x, n = n ) # define data
growModel <- jags.model(data = treeData, file = "growthFixed.txt") # generate graph
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 6814
## Unobserved stochastic nodes: 5
## Total graph size: 50858
##
## Initializing model
Here I fit the model:
growFit <- coda.samples(growModel, variable.names = c("b1","b2","b3","b4"),
n.iter = 5000)
tmp <- summary(growFit)
print(tmp$statistics)
## Mean SD Naive SE Time-series SE
## b1 -4.05526538 0.105335655 1.489671e-03 0.0087254026
## b2 0.15060217 0.005557222 7.859099e-05 0.0004696314
## b3 0.64026440 0.103240489 1.460041e-03 0.0124472497
## b4 -0.02608361 0.005471030 7.737205e-05 0.0006629146
par(mar = c(3,3,1,1))
plot( as.mcmc(growFit), bty = 'n' )
MCMC chains and posterior densities.
These plots show MCMC chains for each of the four \(\beta\) values on the left and density estimates at right. We will talk more in coming weeks about the need for convergence and good mixing in these chains. For now, I just note that they quickly converge to values that are then stationary. The density plots at right are generated from these chains.
Markov chain Monte Carlo (MCMC) draws repeated samples from the posterior distribution that can be viewed as chains (iterations by parameter value).
Objects can be extracted from the summary,
betaFixed <- cbind( tmp$statistics[,1:2], tmp$quantiles[,c('2.5%','97.5%')] )
signif( betaFixed, 3 )
| Mean | SD | 2.5% | 97.5% | |
|---|---|---|---|---|
| b1 | -4.0600 | 0.10500 | -4.1300 | -3.9900 |
| b2 | 0.1510 | 0.00556 | 0.1470 | 0.1540 |
| b3 | 0.6400 | 0.10300 | 0.5390 | 0.7440 |
| b4 | -0.0261 | 0.00547 | -0.0316 | -0.0207 |
The matrix betaFixed summarizes the posterior
distribution for \(\boldsymbol{\beta}\).
Here are the data and the mean prediction for ambient (black) and elevated (red) treatments:
xlim <- range( x[,'diam'] )
xseq <- seq( xlim[1], xlim[2], length=100 ) # sequence of diameter values
plot(x[,'diam'], y, col=x[,'trt'] + 1, cex=.1, # color code by trt
xlab='Diameter (cm)', ylab = 'log growth rate (cm/yr)', bty = 'n' )
mu <- betaFixed[1,1] + betaFixed[2,1]*xseq
lines(xseq, mu, lwd = 2, lty = 2)
lines(xseq, mu + betaFixed[3,1] + betaFixed[4,1]*xseq, lwd = 2, col = 2, lty = 2)
legend('bottomright', c('ambient','elevated'), text.col = c(1,2), bty='n')
Fitted model showing two treatment levels with observed values.
Exercise 1: Interpret the main effects and interactions in this model. I’d like you to explain them.
Exercise 2: Run the model again, including
nfert (nitrogen fertilization) in addition to other
variables. Interpret its effect.
A model becomes hierarchical when a stage is added to the prior. When this is done, the “prior” is no longer fixed, thus allowing for variability at an additional stage. Random effects can be thought of as parameters that have a hyperprior distribution.
The model can be extended hierarchically to allow for differences between individuals. I assign random effects to individuals under the assumption that individuals are drawn at random from a much larger population of individuals that I might have selected for the study. In the simple regression context this is termed a mixed model, referring to the fact that there will be both fixed and random effects. Note that the added stage does not affect the structure of the previous model that contained only fixed effects.
Graph with random coefficients. There are “fixed” differences” between individuals coming from beta. There are also random differences coming from alpha_i.
For random effects, I start with labels for the observations by tree
in the vector ij, which uniquely identifies an individual.
I use match against the unique values to assign each row in
x to an individual:
ij <- paste(data$j, data$i, sep = '-') # plot, tree
tn <- sort(unique(ij)) # tree labels
ix <- match(ij,tn) # index rows by tree
ntree <- max(ix) # no. trees
The model will have random effects for the intercept and the diameter:
\[ y_{ij} = \beta_0 + \beta_d d_{it} + \beta_c c_{it} + \beta_{d \times c} d_{it}c_{it} + \alpha_{i0} + \alpha_{id} d_{it} + \epsilon_{it} \] where \(d, c\) refer to diameter and co2 treatment, respectively. Note that \(\alpha\) coefficients have a subscript \(i\), whereas the \(\beta\)’s do not.
Exercise 3: Why didn’t I include random effects on CO2 treatment? To answer this you need to consider how random effects are assigned to groups.
There are ntree = 696 individual trees. Each observation
is now labeled by the tree to which it belongs in the vector
ix. To follow the jags model, I need to recogize the
vectorization ix[i], which gives me the individual identity
for each row in x. Now I vectorize again to get the random
intercept assigned to this individual, a1[ix[i]]. Here is a
model:
cat( "model{
for(i in 1:n){
y[i] ~ dnorm(mu[i],tau)
mu[i] <- b1 + b2*x[i,2] + b3*x[i,3] + b4*x[i,4] +
a1[ix[i]] + a2[ix[i]]*x[i,2]
}
for(j in 1:ntree){
a1[j] ~ dnorm(0.0,tau1)
a2[j] ~ dnorm(0.0,tau2)
}
tau1 ~ dgamma(0.001,0.001) #RE precision intercept
sigma1 <- 1/sqrt(tau1) #RE variance intercept
tau2 ~ dgamma(0.001,0.001) #RE precision diam
sigma2 <- 1/sqrt(tau2) #RE variance diam
tau ~ dgamma(0.001,0.001)
sigma <- sqrt(1/tau)
b1 ~ dnorm(0.0,1.0E-06)
b2 ~ dnorm(0.0,1.0E-06)
b3 ~ dnorm(0.0,1.0E-06)
b4 ~ dnorm(0.0,1.0E-06)
}",
fill=T,file="growthRE.txt" )
Note that there are fixed effects, b1, ..., b4. Each
random effect, a1, a2 has 696 values. Again, the index
ix identifies the proper individual for each observation
row in x, having index i.
The hierarchical nature of model comes from the fact that the variance
for a1, given by sigma1, also has a prior
distribution.
Here is a model fit using three parallel chains:
treeData <- list(y = y, x = x, ix = ix, ntree = ntree, n = n)
reModel <- jags.model(data = treeData, file = "growthRE.txt", n.chains = 3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 6814
## Unobserved stochastic nodes: 1399
## Total graph size: 65885
##
## Initializing model
growthRE <- coda.samples(reModel,
variable.names=c("b1","b2","b3","b4","a1","a2","sigma1","sigma2"),
n.iter=5000)
tmp <- summary( growthRE )
betaRE <- cbind( tmp$statistics[,1:2], tmp$quantiles[,c('2.5%','97.5%')] )
Here are the first and last rows for the random effects matrices:
signif( head( betaRE ), 3 )
| Mean | SD | 2.5% | 97.5% | |
|---|---|---|---|---|
| b1 | -1.680000 | 0.43000 | -2.00000 | -0.32000 |
| b2 | 0.007370 | 0.00336 | 0.00188 | 0.01510 |
| b3 | 0.365000 | 0.17700 | -0.03430 | 0.60900 |
| b4 | -0.000689 | 0.00399 | -0.01060 | 0.00564 |
| sigma1 | 1.160000 | 0.16000 | 1.04000 | 1.74000 |
| sigma2 | 0.013900 | 0.00405 | 0.01010 | 0.02480 |
As before we have four beta estimates.
The estimates in betaRE are obtained from MCMC chains
that can be accessed this way:
cnames <- colnames(growthRE[[1]]) # parameters are column names
a1 <- which(startsWith(cnames, 'a1')) # a parameters for each tree
a2 <- which(startsWith(cnames, 'a2'))
bg <- which(startsWith(cnames, 'b')) # four beta parameter
sg <- which(startsWith(cnames, 's')) # sigma
bgibbs <- growthRE[[1]][,bg] # extract changes for beta, sigma
sgibbs <- growthRE[[1]][,sg]
beta <- apply(bgibbs,2, mean) # means agree with betaRE
# tree random effect parameter means
alpha <- cbind( apply(growthRE[[1]][,a1],2, mean),
apply(growthRE[[1]][,a2],2, mean))
colnames(alpha) <- c( 'a1', 'a2' )
rownames(alpha) <- NULL
| a1 | a2 |
|---|---|
| 0.852 | 0.00396 |
| -2.000 | -0.00214 |
| 1.030 | -0.00285 |
| 1.220 | -0.00198 |
| 0.731 | 0.01460 |
| 0.574 | -0.00222 |
Here are mean predictions for diameter at two CO2 treatments.
ylim <- range(y)
mu <- betaFixed[1,1] + betaFixed[2,1]*xseq # ambient
plot(xseq, mu, type='l',lwd=2, ylim=ylim, xlab='Diameter (cm)',
ylab=' log growth rate', lty = 2, bty = 'n')
mu2 <- mu + betaFixed[3,1] + betaFixed[4,1]*xseq # elevated CO2
lines(xseq, mu2, type='l', lty = 2, lwd=2, col=2)
# size range by individual
irange <- matrix( unlist(tapply(x[,'diam'], ix, range, na.rm=T)), ncol=2, byrow=T)
# regression fit by individual
for(i in 1:ntree){
xi <- xseq[xseq > irange[i,1] & xseq < irange[i,2]]
mu <- beta[1] + alpha[i,1] + (beta[2] + alpha[i,2])*xi # fixed b, individual a
wi <- which(ix == i)[1]
if(x[wi,'trt'] == 0){ # ambient
lines(xi, mu, col='grey')
next
}
mu <- mu + beta[3] + beta[4]*xi # elevated
lines(xi, mu, type='l',col=2)
}
Random effects model, with short lines for each individual, compared with the fixed effects model (dashed lines).
Note that I first plotted the lines for mean fixed effects. The loop adds the individual effects for each tree, showing elevated CO2 in red.
Exercise 4: Interpret the main effects and interactions in this hierarchical model. How does it compare with the model that included only fixed effects? Think about the role of variation between individuals versus within individuals. Speculate on why the differences.
Exercise 5: Is this analysis an example of the ecological fallacy? Why or why not?
Models are selected to fit, but not over-fit, the data. A penalty for model size typically counts independent parameters. Hierarchical models do not have a clear number of parameters, because many are drawn from distributions that also have fitted parameters.
The deviance information criterion (DIC) is an index used for selection in hierarchical models, based on the deviance, \(D(\theta) = -2 \log( [\mathbf{y} | \theta] )\), where \(\theta\) is a vector of fitted parameters, and \([\mathbf{y} | \theta]\) is the likelihood. We need a penalty for model complexity, i.e., the number of parameters. The problem with model selection for hierarchical models is the fact I do not know how to ‘count’ them. Each parameter is part of a distribution.
The DIC is given by
\[DIC = D(\bar{\theta}) + 2p_D\]
where \(D(\bar{\theta})\) is the deviance calculated at the posterior mean parameter values, and \(p_D = \bar{D} - D(\bar{\theta})\) is the ‘effective number of parameters’. This is penalty for model complexity. The model with the lowest DIC is viewed as the best model. Here is the DIC using \(2p_D\) as the penalty:
dic.samples(reModel, n.iter = 100, thin = 1, type = 'pD')
This penalized deviance (i.e., the DIC) can be compared with other models fitted to the same data.
Exercise 6: Include nitrogen fertilization,
nfert to the hierarchical analysis. Use DIC to determine if
it improves the model.
Random effects represent a common type of hierarchical model. Random effects can be used when group differences are viewed as having been drawn at random for a large population of such groups, i.e., they are exchangeable. A random effects model may be limited to random intercepts. Or it can additionally have random slopes. The effects become random when I add a prior on the prior–a hyperprior.
Hierarchical models submit to Gibbs sampling, a type of MCMC, just like single-stage models. They can be fitted in R, including in jags. In upcoming units we’ll write our own Gibbs samplers.
DIC provides a simple index to select variables in hierarchical models. It does so by estimating the effective number of parameters in the model and uses it as the penatly for model complexity.
Hierarchical models can apply to all kinds of problems, where additional stages allow us to specify and compute based on conditional relationships.
Here is a bit more background on the hierarchical model that we take up in later units.
Consider a linear model with Gaussian error. A random effects model for a regression parameter can be written with two terms,
\[y_{ij} = \beta + \alpha_j + \epsilon_{ij}\]
The first term is the fixed effect. It will have a prior distribution. If \(\alpha_j\) is a random effect assigned to the \(j^{th}\) group, then it will have a prior distribution with parameters that also have prior distributions.
In a regression the random effects can be organized as a \(q \times J\) matrix, with one row per group. I give it a distribution, e.g., \(vec(\boldsymbol{\alpha}) \sim MVN(0, \mathbf{V})\), where \(\mathbf{V}\) is a \(Jq \times Jq\) covariance matrix. If there are predictors in the model I can write it this way:
\[y_{ij} = \boldsymbol{\beta'}\mathbf{x}_{ij} + \boldsymbol{\alpha}'_j \mathbf{w}_{ij} + \epsilon_{ij}\]
where \(\mathbf{x}_{ij}\) is a length-\(p\) vector of covariates and/or factors that enter as fixed effects, \(\mathbf{w}_{ij}\) is a subset of \(\mathbf{x}_{ij}\) length-\(q\) vector, where \(q \leq p\), \(\boldsymbol{\beta}\) is the length-\(p\) vector of fixed effects, and \(\boldsymbol{\alpha}_j\) is the length-\(q\) vector of random effects for group \(j\). It is the \(j^{th}\) row in the random effects matrix.
For a simple model with random effects on both slope and intercept, \(\mathbf{x}_{ij} = \mathbf{w}_{ij} =(1, x_{ij})'\), \(\boldsymbol{\beta'} = (\beta_1, \beta_2)'\) \(\boldsymbol{\alpha}_j = (\alpha_{j1}, \alpha_{j2})'\). Now an observation in group \(j\) has intercept \(\beta_1 + \alpha_{j1}\) and slope \(\beta_2 + \alpha_{j2}\). If the model has a random intercept, but not a random slope, then all groups have slope \(\beta_2\).
The model for \(n_j\) observations in group \(j\) has this expected value:
\[E \left[\mathbf{y}_{\mathbf{j}} | \boldsymbol{\alpha}_{\mathbf{j}} \right] = \mathbf{X}_{\mathbf{j}} \boldsymbol{\beta} + \mathbf{W}_{\mathbf{j}} \boldsymbol{\alpha}_{\mathbf{j}}\]
The design matrices \(\mathbf{X}_{\mathbf{j}}\) and \(\mathbf{W}_{\mathbf{j}}\) have dimensions \(n_j \times p\) and \(n_j \times q\), respectively. The vector of random effects has the distribution \(\boldsymbol{\alpha} \sim MVN(\mathbf{0}, \mathbf{V})\), where \(\mathbf{0}\) is the length-\(qJ\) vector of zeros, and \(\mathbf{V}\) is the \(qJ \times qJ\) matrix of covariances. Because this is a random effects model, there will also be a prior on \(\mathbf{V} \sim IW(v, \mathbf{V}_0)\).
The likelihood is
\[\mathbf{y}_{\mathbf{j}} \sim MVN \left( \mathbf{X}_{\mathbf{j}} \boldsymbol{\beta} + \mathbf{W} \boldsymbol{\alpha}_{\mathbf{j}}, \sigma^2 \mathbf{I}_n \right)\]
where \(\boldsymbol{\alpha}_{\mathbf{j}}\) indicates the \(n \times q\) matrix having rows indexed to the proper group for observation \(i\). The full model is
\[ \begin{aligned} \boldsymbol{\beta}, \boldsymbol{\alpha}, \mathbf{V} | \mathbf{X}, \mathbf{y} & \sim \prod_i^n N \left( \mathbf{x}'_{ij} \boldsymbol{\beta} + \mathbf{w}_{ij} \boldsymbol{\alpha}_{j}, \sigma^2 \right) \\ & \times MVN \left( \boldsymbol{\beta} | 0, \mathbf{V}_b \right) \\ & \times MVN \left( vec(\boldsymbol{\alpha}) | 0, \mathbf{V} \right) \\ & \times IW(\mathbf{V} | v, \mathbf{V}_0) \\ & \times IG(\sigma^2 | s_1, s_2 ) \end{aligned} \] Note that both \(\boldsymbol{\beta}\) and \(\boldsymbol{\alpha}\) have prior distributions. However, the variance on \(\boldsymbol{\alpha}\) also has a prior distribution, the inverse Wishart for the covariance between random effects. This distribution will estimate the covariance between groups in their distribution. In the example below, I make variances apriori independent, \(\mathbf{V}^{-1}_{ii} = \tau_i \sim gamma(0.001, 0.001)\).