Unlike traditional methods that rely on optimization to find a ‘best’ estimate, Bayesian analysis results in a posterior distribution that often must be simulated. Markov chain Monte Carlo (MCMC) is used to draw samples from the posterior distribution, which can then be summarized in many ways. We discuss direct methods (Gibbs sampling) and indirect methods (Metropolis) of sampling the posterior using conditional distributions taken from the model graph. We generate our own MCMC algorithms, and we implement rjags.


Resources

Files on Sakai

  • on github: FACEtrees.txt
  • clarkFunctions2024.r

Software

  • packages: TruncatedNormal

Today

This vignette

Objectives

source('clarkFunctions2024.r')


A few preliminaries

I begin by summarizing/reviewing a few concepts that will come up in this vignette.

Multivariate normal prior on coefficients

I will use a multivariate normal prior distribution for the coefficient vector. Recall the notation

\[ MVN_q \left(\boldsymbol{\beta} | \mathbf{b}, \mathbf{B}) \right) \] refers to a multivariate normal density for a length-\(q\) vector \(\boldsymbol{\beta}\) having a length-\(q\) mean vector \(\mathbf{b}\) and \(q \times q\) covariance matrix \(\mathbf{B}\).

I will use the function .dMVN to obtain the density for a random vector and the function .rMVN to drawn m random vectors. Here is a random sample of size \(m\) random vectors of length \(q = 2\),

m <- 10
q <- 2
b <- matrix(0, q, 1)
B <- diag(q)
B[1,2] <- B[2,1] <- -.7     # negative covariance
z <- .rMVN( m, b, B)


Look at the mean vector b, the covariance matrix B, and the random sample z. The covariance in this random sample is not exactly the one I used to generate it. Here’s B and the covariance in the sample:

print( B )
##      [,1] [,2]
## [1,]  1.0 -0.7
## [2,] -0.7  1.0
print( cov(z) )
##           [,1]      [,2]
## [1,]  1.261291 -1.260763
## [2,] -1.260763  1.990534

If I increase the sample size (increase m), then cov(z) will approach B. I can find the probability associated with each random vector this way:

mmat <- matrix( b[,rep(1, m)], nrow=m, ncol=q) # repeat mean vector for each z
dz   <- .dMVN(z, mmat , B)
lims <- c(-3,3)
plot( z[,1], z[,2], cex=.01, xlim = lims, ylim = lims )
abline(h=0, lty=2, col='grey')
abline(v=0, lty=2, col='grey')
text( z[, 1], z[, 2], round( dz, 3) )
*Random vectors centered far from the mean have lower probability density. Dashed lines are 95% and 99% ellipsoids.*

Random vectors centered far from the mean have lower probability density. Dashed lines are 95% and 99% ellipsoids.


Note that there is one probability density value assigned to each row of z. From the few values I have plotted here note how these values decline with distance from the mean of the distribution.

Exercise 1. A valid covariance matrix must have an inverse. If it does not, it is singular or not full rank. A full rank covariance matrix has enough observations to contain information on all pairs of columns in the design. Use the function .rMVN to construct a design matrix with three variables (columns) in x. Use the cov function to obtain its covariance matrix. Use the solve function to obtain its inverse. How many observations (rows in x) are needed to get an inverse? How does the answer change if I have four columns in x?


Application to regression

I revisit regression with focus on Gibbs sampling, a type of MCMC. Recall that the posterior distribution has the form \(MVN_p( \beta | \mathbf{Vv}, \mathbf{V})\) with parameter vectors for \(\mathbf{V}\) and \(\mathbf{v}\),

\[ MVN(\mathbf{y} | \mathbf{X} \boldsymbol{\beta}, \Sigma) \times MVN(\boldsymbol{\beta} | \mathbf{b}, \mathbf{B}) \]

\[ \begin{aligned} \mathbf{V} &= \left( \sigma^{-2}\mathbf{X'X} + \mathbf{B}^{-1} \right)^{-1} \\ \mathbf{v} &= \sigma^{-2} \mathbf{X'y}+ \mathbf{B}^{-1}\mathbf{b} \end{aligned} \]

To show that I could draw samples from this distribution, start with data set:

 n <- 50
 p <- 2
 beta  <- matrix( c(.5,1.7), ncol = 1)
 sig2  <- .1
 x     <- cbind( 1, rnorm(n) )
 mu    <- x%*%beta
 y     <- mu + rnorm(n, 0, sqrt( sig2 ) )

Now I write a function to evaluate parameter vectors:

priorb <- beta*0
priorB <- diag(1000, p)
IB     <- solve(priorB)

getVv <- function(x, y, sig2, priorIVB, priorB){
  V  <- solve( crossprod(x)/sig2 + priorIVB )
  v  <- crossprod(x,y)/sig2 + priorIVB%*%priorB
  list( V = V, mu = V%*%v )
}

Here are some samples:

Vv <- getVv(x, y, sig2, IB, priorb)
b  <- .rMVN(1000, Vv$mu, Vv$V)
plot( b[,1], b[,2], cex=.1)
*Samples from the distribution for beta.*

Samples from the distribution for beta.


This example shows that I can draw samples from the posterior distribution for \(\beta\). This will lead us to MCMC and Gibbs sampling. I will use this MVN prior in the Metropolis sampler that follows.


Simulate a posterior distribution

The broad class of methods based on stochastic simulation techniques are referred to as Markov chain Monte Carlo (MCMC). The Markov chain part refers to the fact that each draw can be conditional on the current state. That is the definition of a Markov process.

For most models it is not possible to solve for the posterior distribution. Recall that the numerator is a product of likelihood and prior distribution. The denominator contains an integral,

\[[\theta | \mathbf{y}] = \frac{[\mathbf{y} | \theta][\theta]}{\int [\mathbf{y} | \theta][\theta] d\theta} = \frac{[\mathbf{y} | \theta][\theta]}{[\mathbf{y}]}\]

The denominator is a normalizing constant for this distribution. It is the marginal distribution of \(\mathbf{y}\). It is a distribution, but I refer to is as a ‘constant’, because it is constant with respect to \(\theta\). It determines the “height” of the posterior, but not its shape. Even if I cannot solve this integral, I may still be able to draw samples from it. If so, I can normalize by dividing by the number of samples I draw,

\[[\theta | \mathbf{y}] \approx \frac{1}{G} \sum_g^G [\theta_g | \mathbf{y}]\]

where \([\theta_g | \mathbf{y}]\) is a sample from the distribution \([\theta | \mathbf{y}]\). Importantly, this can be an unnormalized version of the posterior. The point here is that I can often draw samples from a distribution solely based on knowledge of its shape, which depends only on factors containing \(\theta\). It is normalized when I divide by \(G\). This normalized sample is an approximation to the posterior distribution.

Consider the example of a binomial distribution,

\[ binom(y | n, \theta) = \binom{n}{y} \theta^y (1 - \theta)^{n-y} \] The leading binomial coefficient ‘normalizes’ this distribution; it does not contain \(\theta\). [This binomial coefficient evaluates the number of ways that \(y\) successes can be drawn in \(n\) trials.] If I did not know this normalizer, I could draw repeatedly (say \(G\)) ’observations, each of of size-\(n\) trials from \(Bernoulli( \theta )\).

Here is a simple example in R implemented as an independence sampler (not a Markov chain):

par(bty='n')
theta <- .4
G <- 500
n <- 10

yv <- 0:n                           # the sample space is the integers from 0 to n
yg <- rep(0, n+1)                   # vector to hold draws

for(g in 1:G){
  y <- 0
  for(i in 1:n){                    # n trials
    y <- y + rbinom(1, 1, theta)    # a Bernoulli trial (yes, this is crazy)
  }
  yg[y+1] <- yg[y+1] + 1
}
py <- yg/G                          # divide by the number of samples drawn
plot( yv, cumsum(py), type = 's', ylim=c(0,1), xlab = 'y in n trials', 
      ylab = 'Probability' )
segments( yv, yv*0, yv, py, lwd=4 )
*A stupid algorithm to simulate the binomial by drawing samples*.

A stupid algorithm to simulate the binomial by drawing samples.

The thing to point out here is that I never normalized these Bernoulli trials with the binomial coefficient. Instead I just (stupidly) drew samples and divided by the number I drew, G. This technique works even when the denominator (normalizer) involves integration.

Markov chain Monte Carlo (MCMC) is a simulation technique that can be used to draw samples from a posterior distribution where each update is conditionally dependent only on the current stage. [Independence sampling is a special case where the sample does not depend on the current state; it is not a Markov process.] Because the sampler lacks memory of past states, the sampler can be relatively simple.

Gibbs sampling

Gibbs sampling is a MCMC technique that allows sampling from a posterior distribution that contains multiple parameters, each by conditioning on the current values of the others.

More often than not, the posterior distribution is multivariate–there is more than one parameter in most models. When this is the case I can draw from conditional distributions sequentially. Consider the posterior distribution for two parameters having initial values \(\left[\theta^{(0)}_1,\theta^{(0)}_2 \right]\). The superscript \((0)\) in this notation means these are the initial values. I want to draw a new vector, which can start from this initial state, i.e., \(\left[\theta^{(1)}_1,\theta^{(1)}_2 | \theta^{(0)}_1,\theta^{(0)}_2, \mathbf{y} \right]\), the algorithm could look like this:

  • draw \(\left[\theta^{(1)}_1 | \theta^{(0)}_2, \mathbf{y} \right]\)
  • draw \(\left[\theta^{(1)}_2 | \theta^{(1)}_1, \mathbf{y} \right]\)

Gibbs sampling is a MCMC technique based on sequential draws from conditional posteriors. In some cases, I can sample directly from the conditionals, as when they result from conjugate prior/likelihood combinations. In other cases I can embed within the Gibbs sampler other types of sampling techniques. I implemented this scheme in the previous unit for regression, before introducing this basic notion of Gibbs sampling. Here again is a simple normal distribution described from the perspective of Gibbs sampling.

I previously estimated the mean and variance of the normal distribution from the model

\[N(\mathbf{y}|\mu,\sigma^2) N(\mu | m, M) IG(\sigma^2 | s_1, s_2)\] I wrote down the conditional distribution for mean given variance,

\[[\mu | \sigma^2, \mathbf{y}] = N(\mu | Vv, V)\]

where

\[ \begin{aligned} V &= \left( \frac{n}{\sigma^2} + \frac{1}{M} \right)^{-1} \\ v &= \frac{n \bar{y}}{\sigma^2} + \frac{m}{M} \end{aligned} \] This distribution for \(\mu\) depends on \(\sigma^2\) (I can see it in both \(V\) and \(v\)).

I can also write down the conditional distribution for the variance given the mean,

\[[\sigma^2 | \mu, \mathbf{y}] = IG \left( s_1 + \frac{n}{2}, s_2 + \frac{1}{2} \sum_i (y_i - \mu)^2 \right)\] I now have two conditional distributions, \([\mu | \sigma^2, \mathbf{y}]\) and \([\sigma^2 | \mu, \mathbf{y}]\). I can use Gibbs sampling to obtain the joint posterior distribution \([\mu, \sigma^2 | \mathbf{y}]\). Here are functions to draw samples from the two conditional distributions:

getMu <- function(sig2, y, M, m){ # conditional distributions
  n  <- length(y)
  V  <- 1/( n/sig2 + 1/M )
  v  <- sum(y)/sig2 + m/M
  rnorm( 1, V*v, sqrt(V) )
}
getSigma <- function(mu, y, s1, s2){
  n  <- length(y)
  u1 <- s1 + n/2
  u2 <- s2 + 1/2*sum( (y - mu)^2 )
  1/rgamma(1, u1, u2) 
}


Here are a few draws (\(G = 5\)) executed in a loop, showing the progress of the sampler:

n    <- 10                           # simulate data
sig2 <- 2
mu   <- -1
y    <- rnorm( n, mu, sqrt(sig2) )

m  <- 0                            # prior parameter values
M  <- 1
s1 <- s2 <- 1

mg <- 0                            # initial values
sg <- 1
G  <- 5

plot( mg, sg, xlim = mu + c(-2,2) ,ylim = c(0,6), 
      xlab='mean parameter', ylab='variance parameter', bty = 'n', pch = 16 )

# one step at a time

for(g in 1:G){
  mn <- getMu(sg, y, M, m);      lines( c(mg,mn), c(sg,sg), col='blue')
  sn <- getSigma(mn, y, s1, s2); lines( c(mn,mn), c(sg,sn), col='green')
  points( mn, sn, pch = 15, col = 'white', cex = 2 )
  text( mn, sn, g)
  mg <- mn
  sg <- sn
  readline('return to continue')
}
*A few Gibbs steps for the Gaussian mean and variance.*

A few Gibbs steps for the Gaussian mean and variance.

Note that each draw consists of a horizontal (blue) line \(\mu|\sigma^2\) and a vertical (green) line \(\sigma^2|\mu\). Run this multiple times to get a feel for Gibbs sampling.

Here is a larger sample, where I save the chains of estimates and plot them with the scatter plot of values and marginal densities in blue:

par( mfcol=c(2,2), mar=c(4,4,1,1), bty='n')

ng     <- 500
chains <- matrix(NA,ng,2)

for(g in 1:ng){
  mg <- getMu(sg, y, M, m)
  sg <- getSigma(mg, y, s1, s2)
  chains[g,] <- c(mg, sg)
}
qc <- apply(chains,2,quantile,c(.025,.975))
ml <- 'Mean parameter'
vl <- 'Variance parameter'

xlim <- range(chains[,1])
ylim <- range(chains[,2])

# marginal densities
mdist <- density( chains[,1] )
vdist <- density( chains[,2] )

plot(chains[,1],chains[,2],cex=.3, xlab='', ylab=vl, log='y')
abline(v=qc[,1],lty=2); abline(h=qc[,2], lty=2)
  
polygon( mdist$x, ylim[1] + mdist$y, col='#3288bd', border='#3288bd' )
polygon( xlim[1] + vdist$y, vdist$x, col = '#01665e', border = '#01665e' )

plot(chains[,1],c(1:ng),type='l', xlab=ml, ylab='Index', col = '#3288bd')
  abline(v=qc[,1],lty=2, col = '#3288bd')
plot(chains[,2],type='l', ylab='', log='y', col = '#01665e')
  abline(h=qc[,2], lty=2, col = '#01665e')
*Chains for mean and variance with 95% credible intervals*. Marginals for the two parameters are shown in the margins at top left.

Chains for mean and variance with 95% credible intervals. Marginals for the two parameters are shown in the margins at top left.

The MCMC chains in these plots mix well; there is no sense of a long-term trend in the sampler. The scatter plot approximates the posterior distribution.

Exercise 2. Summarize the estimates as posterior mean colMeans(chains), median apply(chains,2,median), credible interval apply(chains,2,quantile,c(.025,.975)), and parameter correlation cor(chains).

Metropolis sampler

Metropolis sampling is a technique that allows indirect sampling from a distribution, by proposing values and accepting or rejecting them based on their relationship to the current value and to the distribution that is being sampled.

In the previous example, the conditional posteriors for parameters could be sampled directly from a product of normal likelihood \(\times\) normal prior for the mean and normal likelihood \(\times\) inverse gamma prior for the variance. I could do this because there are solutions for the two conditional posterior distributions. For a GLM, I cannot solve for the conditional posterior and thus, I cannot directly sample them. The Metropolis algorithm allows me to sample indirectly. By ‘indirect’, I mean that I will propose a random value from a standard distribution and then accept or reject that value based on a comparison involving the current value. I introduce the Metropolis algorithm in the context of a GLM example.

Application to logistic regression

Recall that a GLM combines a non-Gaussian sampling distribution with a link function and a linear function of parameters.

Basic model

Here is a GLM for binary outcomes using a Bernoulli distribution with a logit link,

\[ \begin{aligned} y_i &\sim Bernoulli(\theta_i) \\ logit(\theta_i) &= \mathbf{x}'_i \boldsymbol{\beta} \end{aligned} \]

The Bernoulli distribution is

\[ Bernoulli(y_i | \theta_i) = \theta_i^{y_i} (1 - \theta_i)^{1 - y_i} \]

The logit function, which is linear for the log odds ratio, is the link function for this example,

\[ logit(\theta_i) = \ln \left(\frac{\theta_i}{1 - \theta_i} \right) = \mathbf{x}_i'\boldsymbol{\beta} \]

For the Bernoulli distribution, the expected value is \(E[y_i] = \theta_i\). To invert the link function, I solve for \(\theta_i\),

\[ \theta_i = \frac{exp(\mathbf{x}'_i\boldsymbol{\beta})}{1 + exp(\mathbf{x}'_i\boldsymbol{\beta})} \]

Functions for the logit and inverse logit are contained in clarkFunctions2024.R. Here is a simulated data set for this GLM,

n     <- 2000
p     <- 2
x     <- matrix(rnorm(n*p),n,p)        # design matrix
x[,1] <- 1                             # intercept
beta  <- matrix(c(-3,3),p,1)           # parameter vector
ltg   <- x %*% beta                    # logit theta
tg    <- invlogit(ltg)                 # initial values for theta
y     <- rbinom(n,1,tg)                # data

To see what this looks like, plot the values of y and tg against the predictor variable x[,2]. The values of y are jittered and shown in blue,

par(bty='n')
plot(x[,2],jitter(y),col='#3288bd', xlab='x_2', ylab='y, theta', cex=.2)
points(x[,2], tg, cex = .2)
abline(h=c(0,1),col='grey',lty=2)
*Logistic regression with simulated zeros and ones (blue dots). The black dots are the mean values used to simulate data*.

Logistic regression with simulated zeros and ones (blue dots). The black dots are the mean values used to simulate data.

The black dots show values for \(\theta\), and blue dots show simulated data \(\mathbf{y}\).

Prior distribution

I cannot specify a prior distribution for \(\boldsymbol{\beta}\) that will allow me to draw samples directly–there is no conjugate prior for regression coefficients in this non-linear model. The Metropolis algorithm provides a way to sample from a distribution based on proposals.

The Metropolis algorithm is a Markov chain Monte Carlo (MCMC) technique. As before, each new sample is conditional on the previous one. Again, I cannot draw these conditional samples directly from a standard distribution. I begin by proposing new values for \(\boldsymbol{\beta}\), comparing the conditional posterior for this proposal to the current one, and accepting or rejecting it probabilistically.

Here is a non-informative prior for the slope and intercept parameters, followed by initial values,

priorB  <- matrix(0,p,1)          # priors for reg parameters
priorVB <- diag(1000,p)           # prior covariance matrix
bg      <- beta*0 + rnorm(p)      # initial values for beta’s

Suppose the currently imputed value of \(\boldsymbol{\beta}\) is \(\boldsymbol{\beta}^g\), where \(g\) represents the \(g^{th}\) iteration of a MCMC simulation. I draw a new sample from a standard distribution having this mean value. It could be a normal distribution, with a covariance matrix that determines how the proposals might depart from the current value–steps could be large or small. I know that proposals will be efficient when they are similar to the posterior distribution. Of course, I don’t know the posterior distribution, but a place to start could be something like this,

cmat   <- .1*solve(crossprod(x))    #proposal from design

The matrix cmat would be proportional to the conditional covariance of \(\beta\) if this were a Gaussian likelihood. Again, this is just a crude approximation to get us started.

Metropolis ratio

Let bg be my current parameter vector. Here is a proposal from the multivariate normal distribution

bs <- t( .rMVN(n = 1, mu = bg, sigma = cmat) )

This random vector differs from the vector used as the mean for this density. Here it is centered on the current vector bg. I now have a current value and a new proposal bs. The idea now is to accept the proposal or not using a criterion that assures convergence of the sequence of values to the posterior distribution. That acceptance criterion is

\[ a = \frac{[\boldsymbol{\beta}^*]}{[\boldsymbol{\beta}^g]} \] where \([\boldsymbol{\beta}]\) is the density I wish to simulate using the current values in the denominator and the proposed values in the numerator. I plug this proposasl into the density I want to simulate (likelihood times prior),

\[ [\boldsymbol{\beta}^*] = \prod_{i=1}^n Bernoulli(y_i | \theta^*_i) N(\boldsymbol{\beta}^* | \mathbf{b}, \mathbf{B} ) \] where

\[ \theta^*_i = \frac{exp(\mathbf{x}'\boldsymbol{\beta}^*)}{1 + exp(\mathbf{x}'\boldsymbol{\beta}^*)} \] The distribution \([\boldsymbol{\beta}^g]\) differs only in that the indicator \(g\) (current value) is substituted for the \(^*\) in the proposal.

It is a good idea to evaluate the acceptance criterion on the log scale and then translate back to the new scale at the end,

\[ \log{a} = \log{[\boldsymbol{\beta}^*]} - \log{[\boldsymbol{\beta}^g]} \] This means that the likelihood and prior are also on the log scale. The products become sums. Here’s the log likelihood:

\[ \log{[ \mathbf{y} | \boldsymbol{\beta}^*]} = \sum_{i=1}^n\log{[y_i | \boldsymbol{\beta}^*]} \]

I can now evaluate the acceptance criterion \(a = \exp{( \log{a}) }\) and accept this proposal with probability \(min(1, a)\). Code could look like this:

# a proposal
bs <- matrix( .rMVN(1, mu = bg, sigma = cmat), ncol=1 )  

# link to theta
thetag <- invlogit(  x%*%bg )                                  
thetas <- invlogit(  x%*%bs )

# log likelihood + log prior
pnew <- dbinom( y, n, thetas, log = T) + .dMVN(bs, priorB, priorVB, log=T ) 
pnow <- dbinom( y, n, thetag, log = T) + .dMVN(bg, priorB, priorVB, log=T ) 

# accept proposal with probability min(a, 1)
a <- exp( sum(pnew - pnow) )
if( a > runif(1,0,1) )bg <- bs


The larger the acceptance value \(a\), the greater the chance of acceptance. When the proposal is more probable than the current value, I always accept (\(a > 1\)). When it is less probable, I accept with probability \(a\). To accept or not, I draw a random uniform deviate on \((0, 1)\). If it is less than \(a\), I accept the proposal. Otherwise, I retain the current value.

Here is \(\log [\boldsymbol{\beta}^*]\) for the Bernoulli likelihood and Gaussian prior distribution. I include it in a generic function that I will also use for a Poisson model:

metRatio <- function( beta, x, y, likelihood, link,
                      priorB=beta*0, priorVB=diag(1000, length(beta)) ){
  B <- T
  if(likelihood == 'dpois')B <- F
  if(link == 'log')link <- 'exp'        # invert link function
  if(link == 'logit')link <- 'invlogit'
  
  link  <- match.fun(link)
  like  <- match.fun(likelihood)
  theta <- link(x %*% beta)      
  
  if(B){
    p1 <- like(y, 1, theta, log=T)
  }else{
    p1 <- like(y, theta, log=T)
  }
  sum( p1 ) + .dMVN(beta, priorB, priorVB, log=T) # log likelihood plus log prior
}

The function metRatio returns the conditional for a value of \(\theta\). Here is a function that uses metRatio to update the coefficients as part of a Gibbs sampler:

updateBetaGLM <- function(bg, cmat, x, y, likelihood='binom',
                          link = 'logit', lo = NULL, hi=NULL, ...){
  
  # metropolis sampler for GLM reg parameters
  # lo, hi - truncation points for MVN proposals
  
  require(TruncatedNormal)
  
  ac <- 0
  if(is.null(lo)){             # proposal
    bs <- t( .rMVN(1,bg,cmat) )      
  }else{
    bs <- matrix( rtmvnorm(1, bg, cmat, lb = lo, ub = hi), ncol = 1)
  }

  pnow <- metRatio(bg, x, y, likelihood, link, ...)
  pnew <- metRatio(bs, x, y, likelihood, link, ...)
  
  a    <- exp(pnew - pnow)
  z    <- runif(1,0,1)
  if(a > z){
    bg <- bs
    ac <- 1
  }
  list(beta = bg, accept = ac)
}

Again, this is the product of likelihood and prior, using the proposed values of \(\boldsymbol{\beta}\) in the numerator and the currently imputed values in the denominator. The calls to metRatio return log probabilities for current and proposed values. The acceptance criterion a requires exponentiation. To accept the proposal or not, I draw a uniform random variate z between 0 and 1 and accept the proposal if the variate I draw is less than a. If z is not less a, the current value is retained. Here is a Metropolis sampler:

ng     <- 5000
bgibbs <- matrix(NA,ng,2)         # for regression parameters
colnames(bgibbs) <- c('b0','b1')
accept <- 0

bg <- beta*0

for(g in 1:ng){
  
  tmp <- updateBetaGLM(bg, cmat, x, y, likelihood='dbinom',
                       link='logit', priorB = priorB, priorVB = priorVB)
  bgibbs[g,] <- bg  <- tmp$beta
  accept <- accept + tmp$accept
  
  if(g %in% c(200, 500, 1000))cmat <- var(bgibbs[1:(g-1),])  #adapt proposal
}

The Gibbs sampler recovers the values used to simulate the data (colored lines), with the expected negative correlation in slope and intercept. First look at the chains:

Here is a table of values with the burnin removed: The Gibbs sampler recovers the values used to simulate the data, with the expected negative correlation in slope and intercept. First look at the chains:

chainPlot( bgibbs, trueValues = beta, burnin = 2000 )
*Chains are a Markov process, each draw conditional on the current value.*

Chains are a Markov process, each draw conditional on the current value.

Here are density plots with true values and 95% credible intervals:

chain2density( bgibbs, trueValues = beta, burnin = 2000 )
*Chains and posterior densities.*

Chains and posterior densities.

Here is a table of values with the burnin removed:

chain2tab( bgibbs[1000:ng, ] )

Here is progress of the sampler:

plot( bgibbs[,1], bgibbs[,2], type = 'l', cex = .2, col = 'grey' )
*Draws from the posterior with ellipsoids shown for post-convergence draws.*

Draws from the posterior with ellipsoids shown for post-convergence draws.

Compare this posterior correlation with the final proposal distribution cmat.

print( cov2cor(cmat) )
##            b0         b1
## b0  1.0000000 -0.9690648
## b1 -0.9690648  1.0000000

This code shows the negative correlation in the proposal distribution cmat. At iteration 200 my updated proposal distribution was changed to match the covariance in the chains. Thus, a simple adaptation scheme can adjust proposals to learn about the posterior distribution. You can see the effect of this adaptation in the proposals from the traceplot of the chains. I return to efficient proposals in the next section.

Of course, I would need to run this longer to assure that it was converged and the posterior had been thoroughly explored. To compare this result with a traditional analysis you could try this:

summary(glm(y ~ x[, 2], family = binomial))

Exercise 3. Write a Metropolis algorithm. Considering the model below, conduct the following experiment.

\[[\lambda|\mathbf{y}] \propto \prod_{i=1}^nPoi(y_i | \lambda) unif(\lambda|0, 10)\]

  1. Define \(n\) and \(\lambda\)
  2. Draw a random sample for \(\mathbf{y}\)
  3. Use a Metropolis algorithm to estimate \(\lambda\):
  4. Select an initial value for \(\lambda\)
  5. Use function .tnorm in clarkFunctions2024.r to propose a new value \(\lambda^*\) centered on the current value but that agrees with your prior
  6. Evaluate the model for current and proposed values and their ratio
  7. Draw a uniform random value on \((0, 1)\)
  8. Accept the proposal if the acceptance ratio is higher than your uniform variate
  9. embed this code in a loop
  10. Determine how your proposal distribution affects the efficiency of the algorithm

MCMC with Jags

The same model in Jags can be written as a function:

file <- "logisReg.txt" 

cat("model{
  for (i in 1:n) {
    y[i] ~ dbern(theta[i])
    logit(theta[i]) <- b0 + b1*x1[i]
  }
  b0 ~ dnorm(0, .001)  
  b1 ~ dnorm(0, .001)
}", file = file)

Again, there is a loop defining the likelihood, followed by the prior. I need to define the data and parameters:

logisData <- list(y = y, x1 = x[,2], n = length(y))
parNames  <- c('b0','b1')

parInit <- function(){
  list(b0 = rnorm(1), b1 = rnorm(1))
}

Here is the analysis:

library(rjags)
jagsfit <- jags.model(data=logisData, inits=parInit,file=file)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 2000
##    Unobserved stochastic nodes: 2
##    Total graph size: 10005
## 
## Initializing model
update(jagsfit)

jagsLogit <- coda.samples(jagsfit, variable.names=c("b0","b1"), 
                       n.iter=5000)
tmp <- summary(jagsLogit)
print(tmp$statistics)
##         Mean        SD    Naive SE Time-series SE
## b0 -2.915150 0.1355237 0.001916595    0.005548203
## b1  2.929743 0.1537710 0.002174650    0.006146098

Thus, three different algorithms give similar, but not identical estimates. The two Bayesian Gibbs samplers, mine and jags, should give near identical results given a large number of iterations. So simulation studies are a good way to test the software. If I provide both Gibbs samplers a large data set I expect nearly identical answers—there will always be MCMC error, which can be reduced with longer chains. Try comparing results for samples of size n = 10000.

Efficiency

Indirect sampling is efficient when proposals are not too big and not too small. Small proposals don’t move fast enough. Big proposals are rejected too often. Proposal distributions often need to be tuned. They can be adaptive.

Because not all proposals are accepted, there will be need for some tuning of the algorithm. If the proposals are too close to the current values, they will all be accepted, but the sampler may move too slowly. If the proposals are too far from the current values, most will be rejected, and the sampler will not move. I want to keep track of the acceptance rate, which can be determined by accept/ng = 0.1216. To obtain good mixing it is helpful to have acceptance rates near 30%. When it is below that number I might want to increase the size of proposals. It also helps to draw from a proposal distribution that is similar in shape to the posterior distribution. In this example, the two parameters in \(\boldsymbol{\beta}\) are highly correlated. I tracked this covariance and used it to generate proposals from cmat.

A Poisson GLM

Here I implement a Poisson model for cone counts, now with an informative prior distribution. I want to use my knowledge of the signs of the effects of predictors, without specifying their magnitudes. The data for this example come from the Free-Air CO2 (FACE) experiment that we considered previously, looking at growth and cone production of pines grown at elevated CO2. Here is Poisson likelihood with log link function:

\[ \begin{aligned} y_i &\sim Poi(\lambda_i) \\ \log(\lambda_i) &= \mathbf{x}'_i \boldsymbol{\beta} \end{aligned} \]

An informative prior distribution

CO2 (trt), size (diam) have a positive effect on cone production, the variable cones. The effect of N (nfert) is less clear. The interaction between CO2 and N could be positive (a stronger response to N at high CO2 and vice versa). I now want a prior distribution on the main effects of CO2, diam, and N that is limited to positive values. The prior distribution can be truncated at zero by modifying the arguments to update regression parameters as follows. The analysis from LaDeau and Clark is the basis for this figure:

*(A) Cone production per tree increases with stem diameter. Fitted model (upper solid line) for elevated [CO2] trees is significantly different from the fitted ambient model (lower solid line) and bootstrapped 95% confidence intervals (dashed line). (B) Probability of reproductive maturation as a function of stem diameter. Fitted model for probability of being reproductively mature in elevated [CO2] rings is significantly different from the ambient probability and bootstrapped 95% confidence interval.*

(A) Cone production per tree increases with stem diameter. Fitted model (upper solid line) for elevated [CO2] trees is significantly different from the fitted ambient model (lower solid line) and bootstrapped 95% confidence intervals (dashed line). (B) Probability of reproductive maturation as a function of stem diameter. Fitted model for probability of being reproductively mature in elevated [CO2] rings is significantly different from the ambient probability and bootstrapped 95% confidence interval.

Here is the design matrix:

data <- repmis::source_data("https://github.com/jimclarkatduke/gjam/blob/master/FACEtrees.txt?raw=True",
                    rdata = F, header = T)
form  <- as.formula(cones ~ nfert*trt + diam)
X     <- model.matrix(form, data=data)
Y     <- model.frame(form, data=data)$cones
p     <- ncol(X)
n     <- nrow(X)

Here is the prior distribution that is truncated to positive values:

bg <- priorB <- matrix(0,p)
xnames <- colnames(X)
rownames(bg) <- xnames
priorVB <- diag(p)*1000
hi      <- bg + 100                           # big enough to not matter
lo      <- -hi
lo[xnames %in% c('nfert','trt','diam')] <- 0  # positive main effects
cmat    <- .1*solve(crossprod(X))

The function updateBetaGLM will be called for Metropolis updates. Here is the sampler with this prior that allows the proposal to adapt:

ng     <- 10000
bgibbs <- matrix(NA,ng,p)  #for regression parameters
colnames(bgibbs) <- colnames(X)
accept <- 0

cupdate <- c(200, 500, 1000, 2000)
accept  <- 0

for(g in 1:ng){
  
  tmp <- updateBetaGLM(bg, cmat, X, Y, likelihood='dpois',
                       link='log', priorB = priorB, priorVB = priorVB,
                       lo = lo, hi = hi)
  bgibbs[g,] <- bg <- tmp$beta
  accept     <- accept + tmp$accept
  
  if(g %in% cupdate){
    cmat <- .05*var(bgibbs[1:(g-1),])  #adapt proposal
    diag(cmat) <- diag(cmat)*1.001
  }
}

Here are chains so far:

chainPlot(bgibbs, burnin = 5000)
*Chains for parameter estimates.*

Chains for parameter estimates.

Try a restart, initiallizing bg from the end of this run and plot again. Here are chains:

chainPlot(bgibbs, burnin = 5000)
*Chains following restart look closer to convergence.*

Chains following restart look closer to convergence.

Exercise 4. What is the acceptance rate? How similar is the posterior covariance in parameters and the proposal distribution cmat? (Hint: turn both into correlation matrices using cov2cor. How might it be changed to improve efficiency?

With rjags

Here is a jags model:

file <- "regModel.txt" 

cat("model{
  for (i in 1:n){
    Y[i] ~ dpois(lambda[i])                  
    lambda[i] <- exp( inprod(beta[],X[i,]) ) 
  }
  beta[1] ~ dnorm(0, .1)
  beta[2] ~ dunif(0, 10)
  beta[3] ~ dunif(0, 10)
  beta[4] ~ dnorm(0, .1)
  beta[5] ~ dnorm(0, .1)
}", file = file)

The for loop says that Y[i] is Poisson distributed with mean lambda[i]. The second line defines lambda[i] as the log regression with intercept and slope in X[i,]. Then I specify prior distributions for the regression parameters.

Here’s the Gibbs sampler:

library(rjags)
outjags <- jags.model(file = file, data = list(Y=Y,n=nrow(X),X = X))
update(outjags, 200)                # burnin 

outjags <- coda.samples(outjags, variable.names="beta", 
        n.iter=2000)
tmp <- summary(outjags)
print(tmp$statistics)

traceplot(outjags)

Non-symmetric proposals

I’ve tended to use normal or truncated normal distributions for Metropolis proposals, because they are symmetric and, thus, are not biased left or right. Asymmetric proposals are fine, but they require a correction for the fact that they would otherwise pull the outcome one direction or the other. For example, if a parameter is only defined on the positive real line, then I cannot propose negative values. I could propose from an asymmetric distribution to avoid this problem, but it is typically expedient to simply truncate a normal at zero.

Significant factor levels

Unless we look at specific contrasts, coefficients for factors simply add to the reference factor level. ‘Significant’ factor levels are taken in comparison to the reference level. This makes sense if I view one level as a null hypothesis that I want to compare other levels to. It does not make sense if there is no such reference level. For example, we would typically not think of a standard reference class for soil type, land cover, gender, and so on. Where there is no such reference, the significant statement makes no sense.

In a classical setting, one can explore different types of contrasts by redistributing the coefficients. This can get complicated, but it is definitely possible.

In a Bayesian framework, were P values are not the goal, all comparisons can be done directly with the posterior distribution. There is a marginal distribution for each coefficient. Their degree of overlap is easy to extract.

Interpreting interactions

It is easy to get confused by interactions. What does it mean if the interaction term is positive or negative? Consider a regression model with an interaction between predictors \(a\) and \(b\),

\[ y = \dots + \beta_a x_{a} + \beta_b x_{b} + \beta_{ab} x_{a} x_{b} + \dots \] The interaction term contains a product between \(a\) and \(b\). It might be surprising to hear that the easiest way to see through an interaction is to simply differentiate it. For example, how does \(b\) change the effect of \(a\)? Here is a derivative,

\[ \frac{\partial}{\partial x_a} = \beta_a + \beta_{ab} x_{b} \]

It is now clear that an interaction amplifies a main effect when the interaction coefficient has the same sign as the main effect. If \(\beta_a\) is positive, then positive \(\beta_{ab}\) increases the response (more positive) when \(x_{b}\) is large. Likewise, if the main effect is negative, then a negative interaction amplifies the response in the negative direction.

An interaction has a buffering effect when main effects and interaction differ in sign.

It is important to note that the interaction can buffer the response to \(a\) while amplifying the response to \(b\). Here is the other derivative,

\[ \frac{\partial}{\partial x_b} = \beta_b + \beta_{ab} x_{a} \]

If main effects have opposite signs (e.g., negative \(\beta_a\) and positive \(\beta_b\)), then a positive interaction (\(\beta_{ab}\)) means that \(x_b\) buffers the response to \(x_a\), while \(x_a\) amplifies the response to \(x_b\).

One more point: Recall that the estimate of a coefficient depends on the distribution (e.g., the range) of variation in the predictor. Because it is a product, the estimate of an interaction depends on variation in both predictors. It therefore is sensitive to their covariance in the data.