Random effects are the most basic hierarchical structures that can be added to models. They allow for dependence between members of groups. Here we use long-term observations of tree growth from a CO2 fumigation experiment to show how a hierarchical model can change the interpretation of size and treatment effects.

Logistics

Today

  • in groups: negotiate exercises from Unit 5, present to class

  • this vignette

  • Learning goals:

    • recognize and use random effects as a type of hierarchical model
    • implement a hierarchical model in rjags
    • interpret examples of the ‘ecological fallacy’
    • use DIC as a model selection criterion

Resources

Reference: Duke Forest FACE

Data: dataTreeFACE.csv file on Canvas

Software: Get the updated clarkFunctions2026.R file on Canvas, add libraries:

source('clarkFunctions2026.r')
library(rjags)

For next time

  • this vignette


Hierarchical analysis of experimental inventory

How does rising CO2 affect tree growth, when each individual experiences CO2 concentration in its own way? The Free-Air CO2 Experiment (FACE) inventory data provide an opportunity to examine the heterogeneity of responses in the context of individual differences.

To determine how rising CO2 can affect a forested ecosystem, the Free-Air CO2 Enrichment (FACE) experiment subjected trees to elevated CO2 levels in 30-m diameter rings at the Duke Forest. Four plots received elevated atmospheric CO2 concentration, and four plots were instrumented but were treated as ambient controls. The experiment began in June 1994 as a prototype, with the full experiment implemented with additional plots in August, 1996. Plots were partitioned in early 2005 to include CO2 X nitrogen treatment (main effects and factorial). In addition to measurements on all aspects of ecophysiology and atmosphere-biosphere exchange, the plots were sampled as forest inventory plots, including diameter growth. The study was funded by by the Office of Science (BER) of the U.S. Department of Energy.

*In this analysis we use tree data from three ambient and three elevated CO2 rings. Rings are 30 m wide and instrumented for CO2 fumigation and monitoring.*

In this analysis we use tree data from three ambient and three elevated CO2 rings. Rings are 30 m wide and instrumented for CO2 fumigation and monitoring.

Here I use a hierarchical model to examine the effects of tree size (diameter), CO2 fertilization, and their interaction on growth rates. There are large differences between individuals, so I include a random effect on trees. To appreciate the role of heterogeneity, I start first with a single-stage regression that ignores individual differences. Here is a data set:

data   <- read.csv('dataTreeFACE.csv',header=T)
data$y <- log( data$dnow )                     # log growth rate

I use log growth rate as the response due to the right skew in the data:

ylim <- c(0, 800 )
hist( data$dnow, nclass = 40, col = '#1f78b4', border = '#1f78b4',
      ylim = ylim, xlab = 'cm', main = 'Growth rate' )
hist( data$y, nclass = 40, col = '#1f78b4', border = '#1f78b4', yaxt = 'n',
      ylim = ylim, xlab = 'log cm', ylab = '', main = 'log scale'  )
*Diameter growth rate*

Diameter growth rate

Some EDA

Exploratory data analysis (EDA) is the essential first step in an analysis to understand relationships between variables.

The CO2 treatment in the data set is coded as [0, 1]. To see the balance of observations I make a table:

table(data$trt)
Var1 Freq
0 3669
1 3145

Note that treatment and controls are well balanced.

Here is a plot of growth rates against year:

plot( jitter(data$t), data$dnow, xlab = 'Year', ylab = 'Growth rate (cm)',
     cex=.1, log = 'y')
q <- tapply( data$dnow, data$t, quantile, c(.5, .05, .95), na.rm = T) # 90% of observations
q <- matrix( unlist(q), ncol = 3, byrow = T )
lines( q[,1], lwd = 3 )
lines( q[,2], lty = 2 )
lines( q[,3], lty = 2 )
\label{fig:figs} *Attrition of slow growers in later years. Dashed lines bound 90% of observations*

Attrition of slow growers in later years. Dashed lines bound 90% of observations


Note the use of tapply to evaluate quantiles by year.

I find it hard to interpret this plot, which suggests bi-modality, i.e., clusters of low and high growth rates. Frequency plots of growth rates by year confirm this difference:

lr   <- range( data$y, na.rm=T )
bins <- seq( lr[1], lr[2], length=50)                # bins for log growth rate
mids <- ( bins[-1] + bins[-50] )/2                   # midpoints of bins
ybin <- findInterval( data$y, bins, all.inside = T)  # assign growth rates to bins

dByt <- table( data$t, ybin )                        # growth by year
xlim <- c(1, 12)
ylim <- range(data$y, na.rm=T)

plot( NA, xlim = xlim, ylim = ylim, cex = .1, 
      xlab = 'Year', ylab = 'log growth rate (cm)' )
nt <- max(data$t)
x  <- dByt[1,]*0
scale <- 1/max(dByt)                                 # horizontal scale fits time increment
for(t in 1:nt){
  lines( t + scale*dByt[t,], mids, type = 's', lwd=2)
  lines( c(t,t), ylim, lwd=2 )
}
\label{fig:figs} *Distributions of individual rates by year*.

Distributions of individual rates by year.


The distributions of growth rates are certainly not Gaussian, but perhaps they would be approximately Gaussian if I allow for individual differences. As a quick check, here are frequency distributions by individual tree centered on their mean values:

growthStandard <- (data$y - mean(data$y))/sd(data$y)   #center, standardize over all obs
hist( growthStandard, nclass = 100, xlim=c(-3, 3), col = 'grey', border = 'grey',
      main = "Centered on population mean", xlab = '') # centered

imu <- tapply( data$y, data$i, mean, na.rm=T)          # standardized rates for individual
isd <- tapply( data$y, data$i, sd, na.rm=T)
dev <- (data$y - imu[data$i])/isd[data$i]              # mean/sd for each individual
hist( dev, nclass=100, xlim=c(-3, 3), col = 'grey',
      border = 'grey', main = "Centered on individual means", xlab = 'Standardized log growth')
\label{fig:figs} *Distributions of individual rates*.

Distributions of individual rates.


The upper plot shows the strong bimodality. When taken as departures from their individual means, the distribution is not so bimodal. It is still not Gaussian, but predictors in the model are expected to explain variation in the data. I can include predictors in a regression.


Bayesian regression in jags

rjags can be used to simulate the posterior distribution in R.

*Graph of the regression model showing the prior distribution for coefficients. Note that parameters have distributions, but that randomness is viewed as uncertainty, not as true variation. Uncertainty comes from the fact that parameters are estimated.*

Graph of the regression model showing the prior distribution for coefficients. Note that parameters have distributions, but that randomness is viewed as uncertainty, not as true variation. Uncertainty comes from the fact that parameters are estimated.

A Bayesian regression model starts with main effects of diameter diam and CO2 treatment trt. The graph for this model shows coefficients \(\boldsymbol{\beta}\) and a residual variance \(\sigma^2\),

\[y_{ij} = \mathbf{x'}_{it} \boldsymbol{\beta} + \epsilon_{it}\]

A formula in R has the syntax y ~ predictor1 + predictor1. To include main effects and the interaction I use the syntax predictor1*predictor2.

Recall that \(\mathbf{x}'_{it}\) is one row of the full design matrix \(\mathbf{X}\). I want to look at the design matrix \(\mathbf{X}\), so I generate it using model.frame and model.matrix:

tmp    <- model.frame(y ~ diam*trt, data)
x      <- model.matrix(y ~ diam*trt, tmp)  # design matrix
y      <- tmp$y                            # response (log growth rate)
n      <- length(y)
head(x)
##   (Intercept)     diam trt diam:trt
## 1           1 19.17660   0        0
## 2           1 19.72210   0        0
## 3           1 20.34015   0        0
## 4           1 20.91494   0        0
## 5           1 21.24999   0        0
## 6           1 21.93743   0        0

Notice how the interactions appear in the design matrix x:

(Intercept) diam trt diam:trt
1 19.2 0 0
1 19.7 0 0
1 20.3 0 0
1 20.9 0 0
1 21.2 0 0
1 21.9 0 0
1 22.6 0 0
1 23.3 0 0
1 23.9 0 0
1 24.7 0 0


The Bayesian regression will have fitted coefficients and a residual variance. Here is a model for rjags, which is written to a file growthFixed.txt:

cat( "model{                   
    for(i in 1:n){
      y[i]  ~ dnorm(mu[i],tau)
      mu[i] <- b1 + b2*x[i,2] + b3*x[i,3] + b4*x[i,4]
    }

    tau     ~ dgamma(0.001,0.001)
    sigma   <- sqrt(1/tau)
    
    b1 ~ dnorm(0.0,1.0E-06)
    b2 ~ dnorm(0.0,1.0E-06)
    b3 ~ dnorm(0.0,1.0E-06)
    b4 ~ dnorm(0.0,1.0E-06)
  }", 
  fill=T, file="growthFixed.txt"
)

The cat function writes this character string to a text file. There are three blocks within the model string. The first block looks like a loop, but it does not execute that way. It will be interpreted to set up the graphical model for the mean of the regression mu and for the data y. rjags differs from everything else in R in that the parameterization for the variance sigma^2 is the reciprocal of tau, termed the precision. The precision is given a gamma prior, and the sigma is defined as its reciprocal. This means that sigma^2 has an inverse gamma prior. The coefficients have a non-informative independent prior distribution.

The inverse gamma distribution for \(\sigma^2\) is a gamma distribution of \(\sigma^{-2}\). It is often used as a prior distribution for a variance, because it is conjugate with the normal likelihood, meaning that the posterior distribution is also inverse gamma.

The full model is

\[ \begin{align} [\boldsymbol{\beta}, \sigma^2| \mathbf{y}, \mathbf{X}] &= \prod_{i} \prod_{t}N( y_{it} | \mathbf{x'}_{it} \boldsymbol{\beta}, \sigma^2) \\ &\times \prod_q N( \beta_q | 0, 10^6) \\ &\times IG( \sigma^2 | 0.001, 0.001 ) \end{align} \] The graph for this model has a data stage (fixed \(y_i\)), a random parameter stage (\(\boldsymbol{\beta}, \sigma^2\)), and a fixed prior distribution.

Here I generate the model graph:

treeData  <- list(y = y, x = x, n = n )                            # define data
growModel <- jags.model(data = treeData, file = "growthFixed.txt") # generate graph
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 6814
##    Unobserved stochastic nodes: 5
##    Total graph size: 50858
## 
## Initializing model

Here I fit the model:

growFit <- coda.samples(growModel, variable.names = c("b1","b2","b3","b4"), 
                       n.iter = 5000)
tmp <- summary(growFit)
print(tmp$statistics)
##           Mean          SD     Naive SE Time-series SE
## b1 -4.05104671 0.105312709 1.489347e-03   0.0085402419
## b2  0.15036989 0.005576753 7.886720e-05   0.0004328869
## b3  0.63523040 0.100650329 1.423411e-03   0.0119843057
## b4 -0.02582705 0.005342871 7.555961e-05   0.0006239347
plot( as.mcmc(growFit) )
\label{fig:figs} *MCMC chains and posterior densities.*

MCMC chains and posterior densities.

Check also the other objects held in the list tmp.


These plots show MCMC chains for each of the four \(\boldsymbol{\beta}\) values on the left and density estimates at right. We will talk more in coming weeks about the need for convergence and good mixing in these chains. For now, I just note that they quickly converge to values that are then stationary. The density plots at right are generated from these chains.

Markov chain Monte Carlo (MCMC) draws repeated samples from the posterior distribution that can be viewed as chains (iterations by parameter value).

Objects can be extracted from the summary,

betaFixed <- cbind( tmp$statistics[,1:2], tmp$quantiles[,c('2.5%','97.5%')] )
signif( betaFixed, 3 )
Mean SD 2.5% 97.5%
b1 -4.0500 0.10500 -4.1300 -3.9900
b2 0.1500 0.00558 0.1470 0.1550
b3 0.6350 0.10100 0.5370 0.7460
b4 -0.0258 0.00534 -0.0317 -0.0205


The matrix betaFixed summarizes the posterior distribution for \(\boldsymbol{\beta}\).

Here are the data and the mean prediction for ambient (black) and elevated (red) treatments:

xlim <- range( x[,'diam'] )
xseq <- seq( xlim[1], xlim[2], length=100 )    # sequence of diameter values

plot(x[,'diam'], y, col = x[,'trt'] + 1, cex = .1, # color code by trt
     xlab='Diameter (cm)', ylab = 'log growth rate (cm/yr)' )
mu <- betaFixed[1,1] + betaFixed[2,1]*xseq
lines(xseq, mu, lwd = 2, lty = 2)
lines(xseq, mu + betaFixed[3,1] + betaFixed[4,1]*xseq, lwd = 2, col = 2, lty = 2)
legend('bottomright', c('ambient','elevated'), text.col = c(1,2), bty='n')
\label{fig:figs} *Fitted model showing two treatment levels with observed values.*

Fitted model showing two treatment levels with observed values.


Exercise 1: Interpret the main effects and interactions in this model. I’d like you to explain them.


Exercise 2: Run the model again, including nfert (nitrogen fertilization) in addition to other variables. Interpret its effect.


A hierarchical model for random effects

A model becomes hierarchical when a stage is added to the prior. When this is done, the “prior” is no longer fixed, thus allowing for variability at an additional stage. Random effects can be thought of as parameters that have a hyperprior distribution.

The model can be extended hierarchically to allow for differences between individuals. I assign random effects to individuals under the assumption that individuals are drawn at random from a much larger population of individuals that I might have selected for the study. In the simple regression context this is termed a mixed model, referring to the fact that there will be both fixed and random effects. Note that the added stage does not affect the structure of the previous model that contained only fixed effects.

*Graph with random coefficients. There are fixed differences between individuals coming from beta. There are also random differences coming from alpha_i, which now have variances that are fitted to the data.*

Graph with random coefficients. There are fixed differences between individuals coming from beta. There are also random differences coming from alpha_i, which now have variances that are fitted to the data.

For random effects, I start with labels for the observations by tree in the vector ij, which uniquely identifies an individual. I use match against the unique values to assign each row in x to an individual:

ij    <- paste(data$j, data$i, sep = '-')  # plot, tree
tn    <- sort(unique(ij))                  # tree labels
ix    <- match(ij,tn)                      # index rows by tree
ntree <- max(ix)                           # no. trees


The model will have random effects for the intercept and the diameter:

\[ y_{ij} = \beta_0 + \beta_d d_{it} + \beta_c c_{it} + \beta_{d \times c} d_{it}c_{it} + \alpha_{i0} + \alpha_{id} d_{it} + \epsilon_{it} \] where \(d, c\) refer to diameter and co2 treatment, respectively. Note that \(\alpha\) coefficients have a subscript \(i\), whereas the \(\beta\)’s do not.

The full model is now:

\[ \begin{align} [\boldsymbol{\beta}, \sigma^2| \mathbf{y}, \mathbf{X}] &= \prod_{i} \prod_{t}N( y_{it} | \mathbf{x'}_{it} \boldsymbol{\beta} + \mathbf{w'}_{it} \boldsymbol{\alpha}_i, \sigma^2) \\ &\times \prod_q N( \beta_q | 0, 10^6) \\ &\times IG( \sigma^2 | 0.001, 0.001 ) \\ &\times \prod_i \prod_{q \in (1,2)} N( \alpha_{iq} | 0, \sigma_q^2) \\ &\times \prod_{q \in (1,2)} IG( \sigma_q^2 | 0.001, 0.001 ) \end{align} \] The new data vector \(\mathbf{w}_{it}\) is actually a repeat of variables in \(\mathbf{x}_{it}\), i.e., the intercept and diameter. Again, the term “mixed model” refers to inclusion of both fixed and random effects. However, the random effect is centered on zero so that when it is added to the fixed effect we have its growth rate. In this case, \(\mathbf{x}'_{it} = (1, d_{it}, c_{it}, d_{it}c_{it})\), and \(\mathbf{w}'_{it} = (1, d_{it})\). In general, expect the random effects to be a subset of fixed effects.

Important: Random effects are assigned to groups of observations. In this case, the group is an individual, which is observed repeatedly. If there is no repetition at the group level then the estimate for that group will come from the prior. Avoid this by following this rule: Assign random effects to groups that have at least one subscript less than the response variable. In this example, the response has subscript \(it\) (year within individual), and the random effect has subscript \(i\) (individual).

The important addition to the model are the distributions for diameter effects. Each individual gets its own intercept and coefficient for the diameter effect.

Exercise 3: Why didn’t I include random effects on CO2 treatment? To answer this you need to consider how random effects are assigned to groups.

There are ntree = 696 individual trees. Each observation is now labeled by the tree to which it belongs in the vector ix. To follow the jags model, I need the vectorization ix[i], which gives me the individual identity for each row in x. Now I vectorize again to get the random intercept assigned to this individual, a1[ix[i]]. Here is a model:

cat( "model{
    for(i in 1:n){
      y[i]  ~ dnorm(mu[i],tau)
      mu[i] <- b1 + b2*x[i,2] + b3*x[i,3] + b4*x[i,4] + 
               a1[ix[i]] + a2[ix[i]]*x[i,2]
    }
    for(j in 1:ntree){
      a1[j]  ~ dnorm(0.0,tau1)
      a2[j]  ~ dnorm(0.0,tau2)
    }
    tau1   ~ dgamma(0.001,0.001)    #RE precision intercept
    sigma1 <- 1/sqrt(tau1)          #RE variance intercept
    tau2   ~ dgamma(0.001,0.001)    #RE precision diam
    sigma2 <- 1/sqrt(tau2)          #RE variance diam
    tau    ~ dgamma(0.001,0.001)
    sigma  <- sqrt(1/tau)
    
    b1 ~ dnorm(0.0,1.0E-06)
    b2 ~ dnorm(0.0,1.0E-06)
    b3 ~ dnorm(0.0,1.0E-06)
    b4 ~ dnorm(0.0,1.0E-06)
    }", 
    fill=T,file="growthRE.txt" )

Note that there are fixed effects, b1, ..., b4. Each random effect, a1, a2 has 696 values. Again, the index ix identifies the proper individual for each observation row in x, having index i.
The hierarchical nature of model comes from the fact that the variance for a1, given by sigma1, also has a prior distribution.

Here is a model fit using three parallel chains:

treeData <- list(y = y, x = x, ix = ix, ntree = ntree, n = n)
reModel  <- jags.model(data = treeData, file = "growthRE.txt", n.chains = 3)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 6814
##    Unobserved stochastic nodes: 1399
##    Total graph size: 65885
## 
## Initializing model
growthRE <- coda.samples(reModel,
                         variable.names=c("b1","b2","b3","b4","a1","a2","sigma1","sigma2"), 
                         n.iter=5000)
tmp    <- summary( growthRE )
alpha  <- cbind( tmp$statistics[,1:2], tmp$quantiles[,c('2.5%','97.5%')] )
The random coefficients in alpha has an intercept for each tree with rownames a1[1] \(...\) a1[ntree] and a diameter coefficient rownames a2[1] \(...\) a2[ntree]. Here are the first few lines of alpha:
Mean SD 2.5% 97.5%
a1[1] 0.906 0.453 -0.377 1.58
a1[2] -1.970 0.426 -3.320 -1.56
a1[3] 1.040 0.391 -0.102 1.59
a1[4] 1.270 0.404 0.359 1.96
a1[5] 0.707 0.567 -0.970 1.41
a1[6] 0.635 0.383 -0.466 1.18
a1[7] -1.650 0.433 -3.030 -1.19
a1[8] -1.910 0.423 -3.270 -1.50
a1[9] -0.970 0.419 -2.300 -0.55
a1[10] 1.290 0.383 0.271 1.89

All of the alpha values are followed by the fixed effects and variances, obtained with `tail alpha

Mean SD 2.5% 97.5%
a2[693] -2.53e-03 0.01550 -0.03440 0.02600
a2[694] -1.98e-03 0.01350 -0.03180 0.02170
a2[695] -3.73e-03 0.01490 -0.03290 0.02470
a2[696] -3.60e-03 0.01520 -0.03620 0.02380
b1 -1.67e+00 0.41900 -1.99000 -0.28300
b2 7.37e-03 0.00300 0.00183 0.01280
b3 3.30e-01 0.15900 -0.03660 0.60400
b4 7.77e-05 0.00337 -0.00650 0.00547
sigma1 1.16e+00 0.16100 1.04000 1.75000
sigma2 1.43e-02 0.00494 0.00968 0.02810

As before we have four beta estimates.

The estimates in alpha are obtained from MCMC chains that can be accessed this way:

cnames <- colnames(growthRE[[1]])          # parameters are column names
a1     <- which(startsWith(cnames, 'a1'))  # a parameters for each tree
a2     <- which(startsWith(cnames, 'a2'))
bg     <- which(startsWith(cnames, 'b'))  # four beta parameter
sg     <- which(startsWith(cnames, 's'))  # sigma

bgibbs <- growthRE[[1]][,bg]              # extract changes for beta, sigma
sgibbs <- growthRE[[1]][,sg]
beta   <- apply(bgibbs,2, mean)           # means agree with alpha

# tree random effect parameter means
alpha <- cbind( apply(growthRE[[1]][,a1],2, mean), 
                apply(growthRE[[1]][,a2],2, mean))
colnames(alpha) <- c( 'a1', 'a2' )
rownames(alpha) <- NULL
a1 a2
0.920 0.001490
-1.940 -0.001750
1.070 -0.003750
1.290 -0.004550
0.704 0.016700
0.656 -0.005480
-1.620 -0.000758
-1.880 -0.001130
-0.941 -0.001220
1.300 -0.012100


Here are mean predictions for diameter at two CO2 treatments.

ylim <- range(y)
mu   <- betaFixed[1,1] + betaFixed[2,1]*xseq                # ambient
plot(xseq, mu, type='l',lwd=2, ylim=ylim, xlab='Diameter (cm)', 
     ylab=' log growth rate', lty = 2, bty = 'n')

mu2  <- mu + betaFixed[3,1] + betaFixed[4,1]*xseq           # elevated CO2
lines(xseq, mu2, type='l', lty = 2, lwd=2, col=2)

# size range by individual
irange <- matrix( unlist(tapply(x[,'diam'], ix, range, na.rm=T)), ncol=2, byrow=T)

# regression fit by individual
for(i in 1:ntree){
  xi <- xseq[xseq > irange[i,1] & xseq < irange[i,2]]
  mu <- beta[1] + alpha[i,1] + (beta[2] + alpha[i,2])*xi  # fixed b, individual a
  wi <- which(ix == i)[1]
  if(x[wi,'trt'] == 0){                                   # ambient
    lines(xi, mu, col='grey')
    next
  }
  mu <- mu + beta[3] + beta[4]*xi                         # elevated
  lines(xi, mu, type='l',col=2)
}
*Random effects model, with short lines for each individual, compared with the fixed effects model (dashed lines).*

Random effects model, with short lines for each individual, compared with the fixed effects model (dashed lines).


Note that I first plotted the lines for mean fixed effects. The loop adds the individual effects for each tree, showing elevated CO2 in red.

Exercise 4: Interpret the main effects and interactions in this hierarchical model. How does it compare with the model that included only fixed effects? Think about the role of variation between individuals versus within individuals. Speculate on why the differences.


Exercise 5: Is this analysis an example of the ecological fallacy? Why or why not?


A distribution for random effects

The variances for random effects are placed in a table

sigmaAlpha <- cbind( colMeans( sgibbs ), 
                     apply( sgibbs, 2, sd ), 
                     t( apply( sgibbs, 2, quantile, c( .025, .975 ) ) ) )
colnames( sigmaAlpha )[1:2] <- c( 'Estimate', 'SE' )
Estimate SE 2.5% 97.5%
sigma1 1.15700 0.171500 1.036000 1.77000
sigma2 0.01422 0.004812 0.009398 0.02674

Again, this the spread among individuals in how their growth rate associates with their own diameter. Here are histograms of the alpha estimates shown with these variance estimates:

xlim <- c(-2,2)
hist( alpha[,1], nclass = 30, col = '#d6604d', border = '#d6604d', main = 'alpha_1',
     xlab = '', xlim = xlim, ylim = c(0, 100) )
arrows( -sigmaAlpha[1,1], 90, sigmaAlpha[1,1], 90, code = 3, length = .05, angle = 20, lwd = 2 )

xlim <- c(-.02, .02)
hist( alpha[,2], nclass = 30, col = '#d6604d', border = '#d6604d', main = 'alpha_2',
     xlab = '', yaxt = 'n', ylab = '', xlim = xlim, ylim = c(0, 100))
arrows( -sigmaAlpha[2,1], 90, sigmaAlpha[2,1], 90, code = 3, length = .05, angle = 20, lwd = 2 )
mtext( 'Frequency', 2, outer = T, las = 0, line = 1 )
*Histograms of alpha estimates shown with arrows spanning on the estimated standard deviations on these distributions in the object* sigmaAlpha.

Histograms of alpha estimates shown with arrows spanning on the estimated standard deviations on these distributions in the object sigmaAlpha.

Deviance information criterion

Models are selected to fit, but not over-fit, the data. A penalty for model size typically counts independent parameters. Hierarchical models do not have a clear number of parameters, because many are drawn from distributions that also have fitted parameters.

Model selection criteria typically describe how well the model fits the data with a penalty for model size. The penalty is important because a larger model can always improve the fit. Over-fitting describes the case where a model fits the “noise”, in the sense that it would not predict another sample drawn from the same population.

Out-of-sample prediction is one approach to the overfitting problem. This can be done by partitioning a data set, then using one part for fitting and the other for prediction. Cross-validation involves repeated model fitting, sequentially leaving out a portion of the data and checking the predictions. Leave-one-out (LOO) cross-validation does this one observation at a time. The disadvantage of cross-validation comes when models require substantial time for fitting. When using cross-validation model selection can be based on mean squared prediction error.

From traditional methods, the Akaike Information Criterion (AIC) penalizes a model \(j\) for the number of parameters in the model \(k_j\),

\[ AIC_j = 2k_j - 2 \ln \hat{L}_j \] where \(\hat{L}_j\) is the maximum likelihood for model \(j\). The model with the lowest \(AIC_j\) is preferred. The more parameters (higher \(k_j\)) the higher the penality.

Model selection is more challenging for hierarchical models because I do not know how to ‘count’ parameters; each parameter is part of a distribution. In this case, the number of parameters must be estimated. The deviance information criterion (DIC) is an index used for selection in hierarchical models, based on the deviance, \(D(\theta) = -2 \log( [\mathbf{y} | \theta] )\), where \(\theta\) is a vector of fitted parameters, and \([\mathbf{y} | \theta]\) is the likelihood. We need a penalty for model complexity, i.e., the number of parameters.

The DIC is given by

\[DIC = D(\bar{\theta}) + 2p_D\]

where \(D(\bar{\theta})\) is the deviance calculated at the posterior mean parameter values, and \(p_D = \bar{D} - D(\bar{\theta})\) is the ‘effective number of parameters’. This is penalty for model complexity. The model with the lowest DIC is viewed as the best model. Here is the DIC using \(2p_D\) as the penalty:

dic.samples(reModel, n.iter = 100, thin = 1, type = 'pD')

This penalized deviance (i.e., the DIC) can be compared with other models fitted to the same data.

Exercise 6: Include nitrogen fertilization, nfert to the hierarchical analysis. Use DIC to determine if it improves the model.


Summary

Random effects represent a common type of hierarchical model. Random effects can be used when group differences are viewed as having been drawn at random for a large population of such groups, i.e., they are exchangeable. A random effects model may be limited to random intercepts. Or it can additionally have random slopes. The effects become random in this example when I added a prior to the prior–a hyperprior.

Hierarchical models submit to Gibbs sampling, a type of MCMC. They can be fitted in R, including in jags. In upcoming units we’ll write our own Gibbs samplers.

DIC provides a simple index to select variables in hierarchical models. It does so by estimating the effective number of parameters in the model and uses it as the penatly for model complexity.

Hierarchical models can apply to all kinds of problems, where additional stages allow us to specify and compute based on conditional relationships.


Appendix

Here is a bit more background on the hierarchical model that we take up in later units.

Consider a linear model with Gaussian error. A random effects model for a regression parameter can be written with two terms,

\[y_{ij} = \beta + \alpha_j + \epsilon_{ij}\]

The first term is the fixed effect. It will have a prior distribution. If \(\alpha_j\) is a random effect assigned to the \(j^{th}\) group, then it will have a prior distribution with parameters that also have prior distributions.

In a regression the random effects can be organized as a \(q \times J\) matrix, with one row per group. I give it a distribution, e.g., \(vec(\boldsymbol{\alpha}) \sim MVN(0, \mathbf{V})\), where \(\mathbf{V}\) is a \(Jq \times Jq\) covariance matrix. If there are predictors in the model I can write it this way:

\[y_{ij} = \boldsymbol{\beta'}\mathbf{x}_{ij} + \boldsymbol{\alpha}'_j \mathbf{w}_{ij} + \epsilon_{ij}\]

where \(\mathbf{x}_{ij}\) is a length-\(p\) vector of covariates and/or factors that enter as fixed effects, \(\mathbf{w}_{ij}\) is a subset of \(\mathbf{x}_{ij}\) length-\(q\) vector, where \(q \leq p\), \(\boldsymbol{\beta}\) is the length-\(p\) vector of fixed effects, and \(\boldsymbol{\alpha}_j\) is the length-\(q\) vector of random effects for group \(j\). It is the \(j^{th}\) row in the random effects matrix.

For a simple model with random effects on both slope and intercept, \(\mathbf{x}_{ij} = \mathbf{w}_{ij} =(1, x_{ij})'\), \(\boldsymbol{\beta'} = (\beta_1, \beta_2)'\) \(\boldsymbol{\alpha}_j = (\alpha_{j1}, \alpha_{j2})'\). Now an observation in group \(j\) has intercept \(\beta_1 + \alpha_{j1}\) and slope \(\beta_2 + \alpha_{j2}\). If the model has a random intercept, but not a random slope, then all groups have slope \(\beta_2\).

The model for \(n_j\) observations in group \(j\) has this expected value:

\[E \left[\mathbf{y}_{\mathbf{j}} | \boldsymbol{\alpha}_{\mathbf{j}} \right] = \mathbf{X}_{\mathbf{j}} \boldsymbol{\beta} + \mathbf{W}_{\mathbf{j}} \boldsymbol{\alpha}_{\mathbf{j}}\]

The design matrices \(\mathbf{X}_{\mathbf{j}}\) and \(\mathbf{W}_{\mathbf{j}}\) have dimensions \(n_j \times p\) and \(n_j \times q\), respectively. The vector of random effects has the distribution \(\boldsymbol{\alpha} \sim MVN(\mathbf{0}, \mathbf{V})\), where \(\mathbf{0}\) is the length-\(qJ\) vector of zeros, and \(\mathbf{V}\) is the \(qJ \times qJ\) matrix of covariances. Because this is a random effects model, there will also be a prior on \(\mathbf{V} \sim IW(v, \mathbf{V}_0)\).

The likelihood is

\[\mathbf{y}_{\mathbf{j}} \sim MVN \left( \mathbf{X}_{\mathbf{j}} \boldsymbol{\beta} + \mathbf{W} \boldsymbol{\alpha}_{\mathbf{j}}, \sigma^2 \mathbf{I}_n \right)\]

where \(\boldsymbol{\alpha}_{\mathbf{j}}\) indicates the \(n \times q\) matrix having rows indexed to the proper group for observation \(i\). The full model is

\[ \begin{aligned} \boldsymbol{\beta}, \boldsymbol{\alpha}, \mathbf{V} | \mathbf{X}, \mathbf{y} & \sim \prod_i^n N \left( \mathbf{x}'_{ij} \boldsymbol{\beta} + \mathbf{w}_{ij} \boldsymbol{\alpha}_{j}, \sigma^2 \right) \\ & \times MVN \left( \boldsymbol{\beta} | 0, \mathbf{V}_b \right) \\ & \times MVN \left( vec(\boldsymbol{\alpha}) | 0, \mathbf{V} \right) \\ & \times IW(\mathbf{V} | v, \mathbf{V}_0) \\ & \times IG(\sigma^2 | s_1, s_2 ) \end{aligned} \] Note that both \(\boldsymbol{\beta}\) and \(\boldsymbol{\alpha}\) have prior distributions. However, the variance on \(\boldsymbol{\alpha}\) also has a prior distribution, the inverse Wishart for the covariance between random effects. This distribution will estimate the covariance between groups in their distribution. In the example below, I make variances apriori independent, \(\mathbf{V}^{-1}_{ii} = \tau_i \sim gamma(0.001, 0.001)\).