Unlike traditional methods that rely on optimization to find a ‘best’ estimate, Bayesian analysis results in a posterior distribution that often must be simulated. Markov chain Monte Carlo (MCMC) is used to draw samples from the posterior distribution, which can then be summarized in many ways. We discuss direct methods (Gibbs sampling) and indirect methods (Metropolis) of sampling the posterior using conditional distributions taken from the model graph. We generate our own MCMC algorithms, and we implement rjags.
Trends in wolves and moose on Isle Royale since 1980 have led to inbreeding depression.
Resources
Files on Sakai
Resources/data/FACEtrees.txtResources/code/clarkFunctions2021.r
Software
- packages:
TruncatedNormal
Readings
Rising CO2 Levels and the Fecundity of Forest Trees, LaDeau and Clark on the direct and indirect effect of high CO2 on tree reproduction
High CO2 Levels May Give Fast-Growing Trees an Edge, commentary
Objectives
- articulate the concept of a Markov chain
- determine when posterior simulation is needed
- translate the product of prior and likelihood to conditional distributions
- understand conjugacy and the distinction between direct and indirect sampling
- implement a basic Metropolis sampler on your own, and with jags
- incorporate into algorithms the concepts of prior weight and prior dependence
A few preliminaries
I begin by summarizing/reviewing a few concepts that will come up in this vignette.
Multivariate normal prior on coefficients
I will use a multivariate normal prior distribution for the coefficient vector. Recall the notation
\[ MVN_q \left(\boldsymbol{\beta} | \mathbf{b}, \mathbf{B}) \right) \] refers to a multivariate normal density for a length-\(q\) vector \(\boldsymbol{\beta}\) having a length-\(q\) mean vector \(\mathbf{b}\) and \(q \times q\) covariance matrix \(\mathbf{B}\).
I will use the function .dMVN to obtain the density for a random vector and the function .rMVN to drawn m random vectors. Here is random sample of size \(m\) random vectors of length \(q = 2\),
Look at the mean vector b, the covariance matrix B, and the random sample z. The covariance in this random sample is not exactly the one I used to generate it,
## [,1] [,2]
## [1,] 0.8118501 0.3948995
## [2,] 0.3948995 0.9969448
Compare this covariance to B and understand why they are not the same. I can find the probability associated with each random vector this way:
mmat <- matrix( b[,rep(1, m)], nrow=m, ncol=q) # repeat mean vector for each z
dz <- .dMVN(z, mmat , B)
plot( z[,1], z[,2], cex=.01 )
abline(h=0, lty=2, col='grey')
abline(v=0, lty=2, col='grey')
text( z[, 1], z[, 2], round( dz, 3) )Random vectors far from the mean have lower probability density.
Note that there is one probability density value assigned to each \(q\)-length vector. From the few values I have plotted here note how these values decline with distance from the mean of the distribution.
I will use this MVN prior in the Metropolis sampler that follows.
Significant factor levels
Unless we look at specific contrasts, coefficients for factors simply add to the reference factor level. ‘Significant’ factor levels are taken in comparison to the reference level. This makes sense if I view one level as a null hypothesis that I want to compare other levels to. It does not make sense if there is no such reference level. For example, we would typically not think of a standard reference class for soil type, land cover, gender, and so on. Where there is no such reference, the significant statement makes no sense.
In a classical setting, one can explore different types of contrasts by redistributing the coefficients. This is can get complicated, but it is definitely possible.
In a Bayesian framework, were P values are not the goal, all comparisons can be done directly with the posterior distribution. There is a marginal distribution for each coefficient. Their degree of overlap is easy to extract.
Interpreting interactions
It is easy to get confused by interactions. What does it mean if the interaction term is positive or negative? Consider a regression model with an interaction between predictors \(a\) and \(b\),
\[ y = \dots + \beta_a x_{a} + \beta_b x_{b} + \beta_{ab} x_{a} x_{b} + \dots \] The interaction term contains a product between \(a\) and \(b\). It might be surprising to hear that the easiest way to see through an interaction is to simply differentiate it. For example, how does \(b\) change the effect of \(a\)? Here is a derivative,
\[ \frac{\partial}{\partial x_a} = \beta_a + \beta_{ab} x_{b} \]
It is now clear that an interaction amplifies a main effect when the interaction coefficient has the same sign as the main effect. If \(\beta_a\) is positive, then positive \(\beta_{ab}\) increases the response (more positive) when \(x_{b}\) is large. Likewise, if the main effect is negative, then a negative interaction amplifies the response in the negative direction.
An interaction has a buffering effect when main effects and interaction differ in sign.
It is important to note that the interaction can buffer the response to \(a\) while amplifying the response to \(b\). Here is the other derivative,
\[ \frac{\partial}{\partial x_b} = \beta_b + \beta_{ab} x_{a} \]
If main effects have opposite signs (e.g., negative \(\beta_a\) and positive \(\beta_b\)), then a positive interaction (\(\beta_{ab}\)) means that \(x_b\) buffers the response to \(x_a\), while \(x_a\) amplifies the response to \(x_b\).
One more point: Recall that the estimate of a coefficient depends on the distribution (e.g., the range) of variation in the predictor. Because it is a product, the estimate of an interaction depends on variation in both predictors. It therefore is sensitive to their covariance in the data.
Simulate a posterior distribution
The broad class of methods based on stochastic simulation techniques are referred to as Markov chain Monte Carlo (MCMC). The Markov chain part refers to the fact that each draw can be conditional on the current state. That is the definition of a Markov process.
For most models it is not possible to solve for the posterior distribution. Recall that the numerator is a product of likelihood and prior distribution. The denominator contains an integral,
\[[\theta | \mathbf{y}] = \frac{[\mathbf{y} | \theta][\theta]}{\int [\mathbf{y} | \theta][\theta] d\theta} = \frac{[\mathbf{y} | \theta][\theta]}{[\mathbf{y}]}\]
The denominator is a normalizing constant for this distribution. It is the marginal distribution of \(\mathbf{y}\). It is a distribution, but I refer to is as a ‘constant’, because it is constant with respect to \(\theta\). Even if I cannot solve this integral, I may still be able to draw samples from it. If so, I can normalize by dividing by the number of samples I draw,
\[[\theta | \mathbf{y}] \approx \frac{1}{G} \sum_g^G [\theta_g | \mathbf{y}]\]
where \([\theta_g | \mathbf{y}]\) is a sample from the distribution \([\theta | \mathbf{y}]\). Importantly, this can be an unnormalized version of the posterior. The point here is that I can often draw samples from a distribution solely based on knowledge of its shape, which depends only on factors containing \(\theta\). It is normalized when I divide by \(G\). This normalized sample is an approximation to the posterior distribution.
Consider the example of a binomial distribution,
\[ binom(y | n, \theta) = \binom{n}{y} \theta^y (1 - \theta)^{n-y} \] The leading binomial coefficient ‘normalizes’ this distribution; it does not contain \(\theta\). If I did not know this normalizer, I could draw repeatedly (say \(G\)) samples of size-\(n\) samples from \(Bernoulli( \theta )\).
Here is a simple example in R implemented as an independence sampler (not a Markov chain):
par(bty='n')
theta <- .4
G <- 500
n <- 10
yv <- 0:n # the sample space is the inteegers from 0 to n
yg <- rep(0, n+1) # vector to hold draws
for(g in 1:G){
y <- 0
for(i in 1:n){ # n trials
y <- y + rbinom(1, 1, theta) # a Bernoulli trial (yes, this is crazy)
}
yg[y+1] <- yg[y+1] + 1
}
py <- yg/G # divide by the number of samples drawn
plot( yv, cumsum(py), type = 's', ylim=c(0,1), xlab = 'y in n trials', ylab = 'Probability' )
segments( yv, yv*0, yv, py, lwd=4 )A stupid algorithm to simulate the binomial by drawing samples.
The thing to point out here is that I never normalized these Bernoulli trials with the binomial coefficient. Instead I just (stupidly) drew samples and divided by the number I drew. This technique works even when the denominator (normalizer) involves integration.
Markov chain Monte Carlo (MCMC) is a simulation technique that can be used to draw samples from a posterior distribution where each update is conditionally dependent only on the current stage. [Independence sampling is a special case where the sample does not depend on the current state; it is not a Markov proccess.] Because the sampler lacks memory of past states, the sampler can be relatively simple.
Gibbs sampling
Gibbs sampling is a MCMC technique that allows sampling from a posterior disribution that contains multiple parameters, each by conditioning on the current values of the others.
More often than not, the posterior distribution is multivariate–there is more than one parameter in most models. When this is the case I can draw from conditional distributions sequentially. Consider the posterior distribution for two parameters having initial values \(\left[\theta^{(0)}_1,\theta^{(0)}_2 \right]\). The superscript \((0)\) in this notation means these are the initial values. I want to draw a new vector, which can start from this initial state, i.e., \(\left[\theta^{(1)}_1,\theta^{(1)}_2 | \theta^{(0)}_1,\theta^{(0)}_2, \mathbf{y} \right]\), the algorithm could look like this:
- draw \(\left[\theta^{(1)}_1 | \theta^{(0)}_2, \mathbf{y} \right]\)
- draw \(\left[\theta^{(1)}_2 | \theta^{(1)}_1, \mathbf{y} \right]\)
Gibbs sampling is a MCMC technique for simulating a joint posterior, based on sequential draws from conditional posteriors. In some cases, I can sample directly from the conditionals, as when they result from conjugate prior/likelihood combinations. In other cases I can embed within the Gibbs sampler other types of sampling techniques. I implemented this scheme in the previous unit for regression, before introducing this basic notion of Gibbs sampling. Here again is a simple normal distribution described from the perspective of Gibbs sampling.
I previously estimated the mean and variance of the normal distribution from the model
\[N(\mathbf{y}|\mu,\sigma^2) N(\mu | m, M) IG(\sigma^2 | s_1, s_2)\] I wrote down the conditional distribution for mean given variance,
\[[\mu | \sigma^2, \mathbf{y}] = N(\mu | Vv, V)\]
where
\[ \begin{aligned} V &= \left( \frac{n}{\sigma^2} + \frac{1}{M} \right)^{-1} \\ v &= \frac{n \bar{y}}{\sigma^2} + \frac{m}{M} \end{aligned} \] This distribution for \(\mu\) depends on \(\sigma^2\).
I can also write down the conditional distribution for the variance given the mean,
\[[\sigma^2 | \mu, \mathbf{y}] = IG \left( s_1 + \frac{n}{2}, s_2 + \frac{1}{2} \sum_i (y_i - \mu)^2 \right)\] I now have two conditional distributions, \([\mu | \sigma^2, \mathbf{y}]\) and \([\sigma^2 | \mu, \mathbf{y}]\). I can use Gibbs sampling to obtain the joint posterior distribution \([\mu, \sigma^2 | \mathbf{y}]\). Here are functions to draw samples from the two conditional distributions:
getMu <- function(sigma, y, M, m){ # conditional distributions
n <- length(y)
V <- 1/( n/sigma + 1/M )
v <- sum(y)/sigma + m/M
rnorm(1, V*v, sqrt(V))
}
getSigma <- function(mu, y, s1, s2){
n <- length(y)
u1 <- s1 + n/2
u2 <- s2 + 1/2*sum( (y - mu)^2 )
1/rgamma(1, u1, u2)
}Here are a few samples (\(G = 10\)) in a loop, showing the progress of the sampler:
par(bty='n')
n <- 10 # simulate data
s <- 1
mu <- -2
y <- rnorm(n, mu, sqrt(s))
m <- 0 # prior parameter values
M <- 1
s1 <- s2 <- 1
mg <- 0
sg <- 2
G <- 10
plot(mg, sg, xlim= mg + c(-2.5,2.5) ,ylim = sg + c(-2.5,2.5),
xlab='mean', ylab='variance')
# one step at a time
for(g in 1:G){
mn <- getMu(sg, y, M, m); lines(c(mg,mn),c(sg,sg),col='blue')
sn <- getSigma(mn, y, s1, s2); lines(c(mn,mn),c(sg,sn),col='green')
text(mn,sn,g)
mg <- mn
sg <- sn
readline('return to continue')
}A few Gibbs steps for the Gaussian mean and variance.
## return to continue
## return to continue
## return to continue
## return to continue
## return to continue
## return to continue
## return to continue
## return to continue
## return to continue
## return to continue
Note that each draw consists of a horizontal (blue) line \(\mu|\sigma^2\) and a vertical (green) line \(\sigma^2|\mu\). Run this multiple times to get a feel for Gibbs sampling.
Here is a larger sample, where I save the chains of estimates and plot them along side a scatter plot of the values:
par(mfcol=c(2,2),mar=c(4,4,1,1),bty='n')
G <- 1000
chains <- matrix(NA,G,2)
for(g in 1:G){
mg <- getMu(sg, y, M, m)
sg <- getSigma(mg, y, s1, s2)
chains[g,] <- c(mg, sg)
}
qc <- apply(chains,2,quantile,c(.025,.975))
ml <- 'Mean parameter'
vl <- 'Variance parameter'
plot(chains[,1],chains[,2],cex=.3, xlab='', ylab=vl, log='y')
abline(v=qc[,1],lty=2); abline(h=qc[,2],lty=2)
plot(chains[,1],c(1:G),type='l', xlab=ml, ylab='Index')
abline(v=qc[,1],lty=2)
plot(chains[,2],type='l', ylab='', log='y')
abline(h=qc[,2],lty=2)Chains for mean and variance with 95% credible intervals.
The MCMC chains in these plots mix well; there is no sense of a long-term trend in the sampler. The scatter plot approximates the posterior distribution.
Exercise 1. Summarize the estimates as posterior mean colMeans(chains), median apply(chains,2,median), credible interval apply(chains,2,quantile,c(.025,.975)), and parameter correlation cor(chains).
Metropolis sampler
Metropolis sampling is a technique that allows indirect sampling from a distribution, by proposing values and accepting or rejecting them based on their relationship to the current value and to the distribution that is being sampled.
In the previous example, the conditional posteriors for parameters could be sampled directly from a product of normal likelihood \(\times\) normal prior for the mean and normal likelihood \(\times\) inverse gamma prior for the variance. I could do this because there are solutions for the two conditional posterior distributions. For a GLM, I cannot solve for the conditional posterior and thus, I cannot directly sample them. The Metropolis algorithm allows me to sample indirectly. By ‘indirect’, I mean that I will propose a random value from a standard distribution and then accept or reject that value based on a comparison involving the current value. I introduce the Metropolis algorithm in the context of a GLM example.
Application to logistic regression
Recall that a GLM combines a non-Gaussian sampling distribution with a link function and a linear function of parameters.
Basic model
Here is a GLM for binary outcomes using a Bernoulli distribution with a logit link,
\[ \begin{aligned} y_i &\sim Bernoulli(\theta_i) \\ logit(\theta_i) &= \mathbf{x}'_i \boldsymbol{\beta} \end{aligned} \]
The Bernoulli distribution is
\[ Bernoulli(y_i | \theta_i) = \theta_i^{y_i} (1 - \theta_i)^{1 - y_i} \]
The logit function, which is linear for the log odds ratio, is the link function for this example,
\[ logit(\theta_i) = \ln \left(\frac{\theta_i}{1 - \theta_i} \right) = \mathbf{x}_i'\boldsymbol{\beta} \]
For the Bernoulli distribution, the expected value is \(E[y_i] = \theta_i\). To invert the link function, I solve for \(\theta_i\),
\[ \theta_i = \frac{exp(\mathbf{x}'_i\boldsymbol{\beta})}{1 + exp(\mathbf{x}'_i\boldsymbol{\beta})} \]
Functions for the logit and inverse logit are contained in clarkFunctions2021.R. Here is a simulated data set for this GLM,
n <- 500
p <- 2
x <- matrix(rnorm(n*p),n,p) # design matrix
x[,1] <- 1 # intercept
beta <- matrix(c(-3,3),p,1) # parameter vector
ltg <- x %*% beta # logit theta
tg <- invlogit(ltg) # initial values for theta
y <- rbinom(n,1,tg) # dataTo see what this looks like, plot the values of y and tg against the predictor variable x[,2]. The values of y are jittered and shown in blue,
par(bty='n')
plot(x[,2],jitter(y),col='blue', xlab='x_2', ylab='y, theta', cex=.2)
points(x[,2], tg, cex = .2)
abline(h=c(0,1),col='grey',lty=2)Logistic regression with simulated zeros and ones (blue).
The black dots show values for \(\theta\), and blue dots show simulated data \(\mathbf{y}\).
Prior distribution
I cannot specify a prior distribution for \(\boldsymbol{\beta}\) that will allow me to draw samples directly–there is no conjugate prior for regression coefficients. The Metropolis algorithm provides a way to sample from a distribution based on proposals.
The Metropolis algorithm is a Markov chain Monte Carlo (MCMC) technique, where each new sample is conditional on the previous one. I begin by proposing new values for \(\boldsymbol{\beta}\), comparing the conditional posterior for this proposal to the current one, and accepting or rejecting it probabilistically.
Here is a non-informative prior for the slope and intercept parameters, followed by initial values,
priorB <- matrix(0,p,1) # priors for reg parameters
priorVB <- diag(1000,p) # prior covariance matrix
bg <- beta*0 + rnorm(p) # initial values for beta’sSuppose the currently imputed value of \(\boldsymbol{\beta}\) is \(\boldsymbol{\beta}^g\), where \(g\) represents the \(g^{th}\) iteration of a MCMC simulation. I draw a new sample from a standard distribution having this mean value. It could be a normal distribution, with a covariance matrix that determines how the proposals might depart from the current value–steps could be large or small. I know that proposals will be efficient when they are similar to the posterior distribution. Of course, I don’t know the posterior distribution, but a crude approximation could be something like this,
The matrix cmat would be proportional to the conditional covariance of \(\beta\) if this were a Gaussian likelihood. This is just a crude approximation to get us started.
Metropolis ratio
Let bg be my current parameter vector. Here is a proposal from the multivariate normal distribution
This random vector differs from the vector used as the mean for this density, in this case the current value bg. I now have a current value and a new proposal bs. The idea now is to accept the proposal or not using a criterion that assures convergence of the sequence of values to the posterior distribution. That acceptance criterion is
\[ a = \frac{[\boldsymbol{\beta}^*]}{[\boldsymbol{\beta}^g]} \] where \([\boldsymbol{\beta}]\) is the density I wish to simulate using the current values in the denominator and the proposed values in the numerator. In this case, that density is
\[ [\boldsymbol{\beta}^*] = \prod_{i=1}^n Bernoulli(y_i | \theta^*_i) N(\boldsymbol{\beta}^* | \mathbf{b}, \mathbf{B} ) \] where
\[ \theta^*_i = \frac{exp(\mathbf{x}'\boldsymbol{\beta}^*)}{1 + exp(\mathbf{x}'\boldsymbol{\beta}^*)} \] The distribution \([\boldsymbol{\beta}^g]\) differs only in that the indicator \(g\) (current value) is substituted for the \(^*\) in the proposal.
# a proposal
bs <- matrix( .rMVN(1, mu = bg, sigma = cmat), ncol=1)
# link to theta
thetag <- invlogit( x%*%bg )
thetas <- invlogit( x%*%bs )
# log likelihood + log prior
pnew <- dbinom( y, n, thetas, log = T) + .dMVN(bs, priorB, priorVB, log=T)
pnow <- dbinom( y, n, thetag, log = T) + .dMVN(bg, priorB, priorVB, log=T)
# accept proposal with probability min(a, 1)
a <- exp( sum(pnew - pnow) )
if( a > runif(1,0,1) )bg <- bsI accept the proposal with probability \(min(a, 1)\). In other words, the larger the probability assigned to the proposal, the greater the chance of acceptance. When the proposal is more probable than the current value, I always accept. When it is less probable, I accept with probability \(a\).
To accept or not, I draw a random uniform deviate on \((0, 1)\). If it is less than \(a\), I accept the proposal. Otherwise, I retain the current value.
Here is \(\log [\boldsymbol{\beta}^*]\) for the Bernoulli likelihood and Gaussian prior distribution. I include it in a generic function that I will also use for a Poisson model:
metRatio <- function(beta, x, y, likelihood, link,
priorB=beta*0, priorVB=diag(1000,length(beta))){
B <- T
if(likelihood == 'dpois')B <- F
if(link == 'log')link <- 'exp' # invert link function
if(link == 'logit')link <- 'invlogit'
link <- match.fun(link)
like <- match.fun(likelihood)
theta <- link(x %*% beta)
if(B){
p1 <- like(y, 1, theta, log=T)
}else{
p1 <- like(y, theta, log=T)
}
sum( p1 ) + .dMVN(beta, priorB, priorVB, log=T) # log likelihood plus log prior
}The function metRatio returns the conditional for a value of \(\theta\). Here is a function that uses metRatio to update the coefficients as part of a Gibbs sampler:
updateBetaGLM <- function(bg, cmat, x, y, likelihood='binom',
link = 'logit', lo = NULL, hi=NULL, ...){
# metropolis sampler for GLM reg parameters
# lo, hi - truncation points for MVN proposals
require(TruncatedNormal)
ac <- 0
if(is.null(lo)){ # proposal
bs <- t( .rMVN(1,bg,cmat) )
}else{
bs <- matrix( rtmvnorm(1, bg, cmat, lb = lo, ub = hi), ncol = 1)
}
pnow <- metRatio(bg, x, y, likelihood, link, ...)
pnew <- metRatio(bs, x, y, likelihood, link, ...)
a <- exp(pnew - pnow)
z <- runif(1,0,1)
if(a > z){
bg <- bs
ac <- 1
}
list(beta = bg, accept = ac)
}Again, this is the product of likelihood and prior, using the proposed values of \(\boldsymbol{\beta}\) in the numerator and the currently imputed values in the denominator. The calls to metRatio return log probabilities for current and proposed values. The acceptance criterion a requires exponentiation. To accept the proposal or not, I draw a uniform random variate z between 0 and 1 and accept the proposal if the variate I draw is less than a. If z is not less a, the current value is retained. Here is a Metropolis sampler:
ng <- 5000
bgibbs <- matrix(NA,ng,2) # for regression parameters
colnames(bgibbs) <- c('b0','b1')
accept <- 0
bg <- beta*0
for(g in 1:ng){
tmp <- updateBetaGLM(bg, cmat, x, y, likelihood='dbinom',
link='logit', priorB = priorB, priorVB = priorVB)
bgibbs[g,] <- bg <- tmp$beta
accept <- accept + tmp$accept
if(g %in% c(200, 500, 1000))cmat <- var(bgibbs[1:(g-1),]) #adapt proposal
}## Loading required package: TruncatedNormal
The Gibbs sampler recovers the values used to simulate the data, with the expected negative correlation in slope and intercept. First look at the chains:
Here is a table of values with the burnin removed: The Gibbs sampler recovers the values used to simulate the data, with the expected negative correlation in slope and intercept. First look at the chains:
Chains and posterior densities.
## $summary
## estimate se 0.025 0.975 true value
## b0 -2.836 0.5806 -3.4900 -0.4435 -3
## b1 2.824 0.5958 0.4051 3.5440 3
Here is a table of values with the burnin removed:
Posterior densities without burnin.
## $summary
## estimate se 0.025 0.975 true value
## b0 -2.943 0.2780 -3.500 -2.445 -3
## b1 2.944 0.2966 2.416 3.572 3
Of course, I would need to run this longer to assure that it was converged and the posterior had been thoroughly explored. To compare this result with a traditional analysis you could try this:
Exercise 2. Considering the model below, conduct the following experiment.
\[[\lambda|\mathbf{y}] \propto \prod_{i=1}^nPoi(y_i | \lambda) unif(\lambda|0, 10)\]
- Define \(n\) and \(\lambda\)
- Draw a random sample for \(\mathbf{y}\)
- Use a Metropolis algorithm to estimate \(\lambda\):
- Select an initial value for \(\lambda\)
-
Use function
.tnorminclarkFunctions2021.rto propose a new value \(\lambda^*\) centered on the current value but that agrees with your prior - Evaluate the model for current and proposed values and their ratio
- Draw a uniform random value on \((0, 1)\)
- Accept the proposal if the acceptance ratio is higher than your uniform variate
- mbed this code in a loop
-
Determine how your proposal distribution affects the efficiency of the algorithm
Efficiency
Indirect sampling is efficient when proposals are not too big and not too small. Small proposals don’t move fast enough. Big proposals are rejected too often. Proposal distributions often need to be tuned. They can be adaptive.
Because not all proposals are accepted, there will be need for some tuning of the algorithm. If the proposals are too close to the current values, they will all be accepted, but the sampler may move too slowly. If the proposals are too far from the current values, most will be rejected, and the sampler will not move. I want to keep track of the acceptance rate, which can be determined by accept/ng = 0.2978. To obtain good mixing it is helpful to have acceptance rates near 30%. It also helps to draw from a proposal distribution that is similar in shape to the posterior distribution. In this example, the two parameters in \(\boldsymbol{\beta}\) are highly correlated.
Parameter correlation
## b0 b1
## b0 1.0000000 -0.9499259
## b1 -0.9499259 1.0000000
Compare this posterior correlation with the final proposal distribution cmat.
## b0 b1
## b0 1.0000000 -0.9882432
## b1 -0.9882432 1.0000000
This code shows the negative correlation in the proposal distribution cmat. At iteration 200 my updated proposal distribution was changed to match the covariance in the chains. Thus, a simple adaptation scheme can adjust proposals to learn about the posterior distribution. You can see the effect of this adaptation in the proposals from the traceplot of the chains.
With Jags
The same model in Jags can be written as a function:
file <- "logisReg.txt"
cat("model{
for (i in 1:n) {
y[i] ~ dbern(theta[i])
logit(theta[i]) <- b0 + b1*x1[i]
}
b0 ~ dnorm(0, .001)
b1 ~ dnorm(0, .001)
}", file = file)Again, there is a loop defining the likelihood, followed by the prior. I need to define the data and parameters:
logisData <- list(y = y, x1 = x[,2], n = length(y))
parNames <- c('b0','b1')
parInit <- function(){
list(b0 = rnorm(1), b1 = rnorm(1))
}Here is the analysis:
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 500
## Unobserved stochastic nodes: 2
## Total graph size: 2505
##
## Initializing model
update(jagsfit)
jagsLogit <- coda.samples(jagsfit, variable.names=c("b0","b1"),
n.iter=5000)
tmp <- summary(jagsLogit)
print(tmp$statistics)## Mean SD Naive SE Time-series SE
## b0 -2.917671 0.2841704 0.004018776 0.01111048
## b1 2.924935 0.3080867 0.004357004 0.01223939
Thus, three different algorithms give similar, but not identical estimates. The two Bayesian Gibbs samplers, mine and jags, should give near identical results given a large number of iterations. So simulation studies are a good way to test the software. If I provide both Gibbs samplers a large data set I expect nearly identical answers—there will always be MCMC error, which can be reduced with longer chains. Try comparing results for samples of size n = 10000.
A Poisson GLM
Here I implement a Poisson model for cone counts, now with an informative prior distribution. I want to use my knowledge of the signs of the effects of predictors, without specifying their magnitudes. The data for this example come from the Free-Air CO2 (FACE) experiment that we considered previously, looking at growth and cone production of pines grown at elevated CO2. Here is Poisson likelihood with log link function:
\[ \begin{aligned} y_i &\sim Poi(\lambda_i) \\ \log(\lambda_i) &= \mathbf{x}'_i \boldsymbol{\beta} \end{aligned} \]
An informative prior distribution
CO2 (trt), size (diam) have a positive effect on cone production, the variable cones. The effect of N (nfert) is less clear. The interaction between CO2 and N could be positive (a stronger response to N at high CO2 and vice versa). I now want a prior distribution on the main effects of CO2, diam, and N that is limited to positive values. The prior distribution can be truncated at zero by modifying the arguments to update regression parameters as follows. The analysis from LaDeau and Clark is the basis for this figure:
(A) Cone production per tree increases with stem diameter. Fitted model (upper solid line) for elevated [CO2] trees is significantly different from the fitted ambient model (lower solid line) and bootstrapped 95% confidence intervals (dashed line). (B) Probability of reproductive maturation as a function of stem diameter. Fitted model for probability of being reproductively mature in elevated [CO2] rings is significantly different from the ambient probability and bootstrapped 95% confidence interval.
Here is the design matrix:
data <- read.table('FACEtrees.txt',header=T)
form <- as.formula(cones ~ nfert*trt + diam)
X <- model.matrix(form, data=data)
Y <- model.frame(form, data=data)$cones
p <- ncol(X)
n <- nrow(X)Here is the prior distribution that is truncated to positive values:
bg <- priorB <- matrix(0,p)
xnames <- colnames(X)
rownames(bg) <- xnames
priorVB <- diag(p)*1000
hi <- bg + 100 # big enough to not matter
lo <- -hi
lo[xnames %in% c('nfert','trt','diam')] <- 0 # positive main effects
cmat <- .1*solve(crossprod(X))Note the vectors lo and hi for truncation. The function updateBetaGLM will be called for Metropolis updates. Here is the sampler with this prior:
ng <- 10000
bgibbs <- matrix(NA,ng,p) #for regression parameters
colnames(bgibbs) <- colnames(X)
accept <- 0
cupdate <- c(200, 500, 1000, 2000)
accept <- 0
for(g in 1:ng){
tmp <- updateBetaGLM(bg, cmat, X, Y, likelihood='dpois',
link='log', priorB = priorB, priorVB = priorVB,
lo = lo, hi = hi)
bgibbs[g,] <- bg <- tmp$beta
accept <- accept + tmp$accept
if(g %in% cupdate){
cmat <- .1*var(bgibbs[1:(g-1),]) #adapt proposal
diag(cmat) <- diag(cmat)*1.001
}
}Here are chains so far:
## $summary
## estimate se 0.025 0.975
## (Intercept) -4.4670 0.68110 -4.7270 -2.32600
## nfert 0.2901 0.10740 0.1482 0.68640
## trt 0.5430 0.09080 0.3164 0.66350
## diam 0.2179 0.03172 0.1311 0.23040
## nfert:trt -0.2198 0.08536 -0.4077 -0.06916
Chains and posterior densities.
Try a restart and plot again:
## $summary
## estimate se 0.025 0.975
## (Intercept) -4.6370 0.04835 -4.7270 -4.53300
## nfert 0.2511 0.04039 0.1482 0.31710
## trt 0.5156 0.01664 0.4831 0.54830
## diam 0.2267 0.00198 0.2224 0.23060
## nfert:trt -0.1949 0.04407 -0.2639 -0.07923
Chains and posterior densities.
Exercise 3. Compare this model with one not truncated at zero. Determine a) the interaction term and b) the effect of the prior distribution on main effects.
With rjags
Here is a jags model:
file <- "regModel.txt"
cat("model{
for (i in 1:n){
Y[i] ~ dpois(lambda[i])
lambda[i] <- exp( inprod(beta[],X[i,]) )
}
beta[1] ~ dnorm(0, .1)
beta[2] ~ dunif(0, 10)
beta[3] ~ dunif(0, 10)
beta[4] ~ dnorm(0, .1)
beta[5] ~ dnorm(0, .1)
}", file = file)The for loop says that Y[i] is Poisson distributed with mean lambda[i]. The second line defines lambda[i] as the log regression with intercept and slope in X[i,]. Then I specify prior distributions for the regression parameters.
Here’s the Gibbs sampler:
Non-symmetric proposals
I’ve tended to use normal or truncated normal distributions for Metropolis proposals, because they are symmetric and, thus, are not biased left or right. Asymmetric proposals are fine, but they require a correction for the fact that they would otherwise pull the outcome one direction or the other. For example, if a parameter is only defined on the positive real line, then I cannot propose negative values. I could propose from an assymetric distribution to avoid this problem, but it is typically expedient to simply truncate a normal at zero.
Prior dependence: inbreeding from allele frequencies
It can be the case that the allowable range for one parameter depends on the value of a second parameter. This prior dependence can be incorporated into the proposals.
Based on counts of genotypes for two alleles at one locus, I want to estimate the frequency of each allele, which depends on inbreeding. An individual can be homozygous for either allele, or it can be heterozygous, having one copy of each. The greater the degree of inbreeding the lower the fraction of heterozygotes. An example of isolation effects on inbreeding for Isle Royale wolves is shown in the figure from Robinson et al. 2019.
A, Inbred Isle Royale wolves contain fewer heterozygotes and B, more homozygotes than Minnesota wolves, for both damaging and benign SNPs. C shows that the total number of derived alleles per individual is unaffected by recent inbreeding. D to G show two-dimensional allele frequency spectra showing the correlation in derived allele frequencies (AF) between outbred North American wolves and Isle Royale or Minnesota wolves for variants present within Isle Royale or Minnesota wolves. Derived nonsynonymous variants occur at low frequencies due to negative selection in large outbred populations, (D and F) but nonsynonymous variants segregating in Isle Royale wolves occur at much higher frequencies due to drift and high relatedness among individuals (E and G). This effect is apparent for putatively damaging and benign variants. The dashed line represents the diagonal y = x, and the solid line represents the regression line.
In this example, I use a Metropolis step to estimate gene frequency and the inbreeding coefficient, employing multiple chains. First, recall that I have tended to specify a prior distribution that is independent for multiple parameters, e.g., \([\theta_1, \theta_2]\) is given a prior \([\theta_1][\theta_2]\). This prior does not mean that I think that parameter estimates are independent. What it really means is that I might have no idea what the prior dependence should be or, if I did know, I would not know how to specify it. When my prior distribution is independent, it just means that their posterior relationship is coming from the data, i.e., the likelihood.
In the case of inbreeding and allele frequency, I cannot avoid specifying prior dependence, because prior independence would give impossible values–the range of possible allele frequencies depends on inbreeding, and vice versa. This example demonstrates prior dependence in parameter values. I know how to write \([\theta_1, \theta_2]\) (jointly) only if I factor it into a conditional and marginal, \([\theta_1| \theta_2][\theta_2]\).
Genotypes shed light on inbreeding, because heterozygotes decline as inbreeding increases.
For gene frequency and the inbreeding coefficient I use the prior distribution to specify a conditional relationship between two parameters, the inbreeding coefficient \(f\) and the frequency of allele \(a\), which is a parameter \(p\),
\[[f, p] = [f | p][p]\] I can impose this conditional relationship at the proposal stage of the Metropolis step. First, just a tiny review of genotypes for two alleles at one locus.
With random mating (\(f = 0\)), the expected frequency for genotypes is
\[\begin{aligned} P_{aa} &= p^2 \\ P_{ab} + P_{ba} &= 2p(1 - p) \\ P_{bb} &= (1 - p)^2 \end{aligned}\]
With inbreeding (\(f \neq 0\)), the frequencies become
\[\begin{aligned} P_{aa} &= p^2 + fp(1 - p) \\ P_{ab} &= 2p(1 - p)(1 - f) \\ P_{bb} &= (1 - p)^2 + fp(1 - p) \end{aligned}\]
Notice that both homozygotes have increased due to the second term that includes inbreeding \(f\). At the same time, the factor \((1 - f)\) reduces the heterozygote fraction. With complete inbreeding (\(f = 1\)) heteozygotes disappear completely,
\[\begin{aligned} P_{aa} &= p^2 + p(1 - p) \\ P_{ab} &= 0 \\ P_{bb} &= (1 - p)^2 + p(1 - p) \end{aligned}\]
I have written a function that evaluates these values from a vector \((p, f)\) in clarkFunctions2021.r, pmake(pars), where pars is the vector of p and f. pmake returns the frequencies \((P_{aa}, P_{ab}, P_{bb})\),
pmake <- function(pars){
p <- pars[1] # frequency of a
f <- pars[2] # inbreeding coefficient
paa <- p^2 + f*p*(1 - p) # frequency of p.aa
pab <- 2*p*(1 - p)*(1 - f) # frequency of p.ab
pbb <- (1 - p)^2 + f*p*(1 - p) # frequency of p.bb
c(paa,pab,pbb)
}For pa = 0.5, here is the difference between genotype frequencies for f = 0 and f = 0.2,
## [1] 0.25 0.50 0.25
## [1] 0.3 0.4 0.3
Note again how inbreeding decreases the proportion of heterozygotes. Likewise, a tendency to avoid inbreeding increases the frequency of heterozygotes,
## [1] 0.2 0.6 0.2
Prior dependence comes from the fact that some values of \(p\) cannot occur (at equilibrium) if \(f\) is low. This is the reason why allele frequencies shed light on inbreeding. The inbreeding coefficient can be negative and has a lower limit \(f_l\) that depends on \(p\),
\[
f_l(p) =
\begin{cases}
-(1 - p)/p, & p > (1 - p) \\
-p/(1 - p), & p \leq (1 - p) \\
\end{cases}
\] \(f\) can be as low as \(-1\) for equally represented alleles \((p = 0.5)\) and increases to \(0\) as either allele dominates (\(p = 0\) or \(1\)). A function that determines this limit is minf(p), where the argument p is the parameter \(p\),
minf <- function(p){ # minimum inbreeding coefficient
lof <- rep(-1,length(p))
lop <- -(1 - p)/p
hip <- -p/(1 - p)
lof[p > (1 - p)] <- lop[p > (1 - p)]
lof[p < (1 - p)] <- hip[p < (1 - p)]
lof
}Here is a plot of the minimum allowable values of \(f\),
pseq <- seq(.01,.99,length=100)
plot(pseq,minf(pseq),type='l', xlab='p',ylab='minimum f', ylim = c(-1,1))Minimum inbreeding coefficient at different equilibrium frequencies of allele a. Negative f means inbreeding avoidance.
It is now apparent that I can impose this prior dependence by proposing values this way:
- Propose a value of \(p^*\) close the current value of \(p\)
- Propose a value for \(f^*\) greater than the minimum value \(f_l|p^*\)
The goal is to estimate the frequency of allele \(a\), i.e., \(p\), and inbreeding \(f\), based on simulated data for the three genotypes, while imposing prior dependence.
I begin by simulating a data set. Simulated genotypes for a pair of \((p, f)\) values and population size 50 can be generated from the multinomial as
m <- 50 # individuals in sample
p <- 0.6 # frequency of allele a
f <- -.2 # inbreeding coefficient
pr <- pmake(c(p ,f)) # values for (Paa,Pab,Pbb)
y <- rmultinom(10,m,pr) # a random vector of size m
y## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 12 12 16 11 15 20 7 13 11 17
## [2,] 27 33 29 31 29 25 41 28 33 29
## [3,] 11 5 5 8 6 5 2 9 6 4
The three rows represent the number of individuals having each of three genotypes. It is important not to select a value of \(f\) that is incompatible with \(p\), which can be checked using minf(p) = 0.6.
I used a value of \(f = -0.2\), which is above this lower limit. For \(n = 1\) populations, with \(m\) individuals per population, a simulated data set could be generated with
The simulated data y is now a matrix, each column of which contains three values, one for each of the three genotypes. I now want to sample from the posterior for this model. The likelihood for population i is
\[\begin{aligned} \mathbf{y}_i &\sim multinom(n_i,\boldsymbol{\theta})_i \propto \prod_{k=1}^3 \theta_{ik}^{y_{ik}} \\ \boldsymbol{\theta}_i &= \left( P_{aa}, P_{ab}, P_{bb} \right)_i \end{aligned}\]
The three values in \(\boldsymbol{\theta}_i\) depend on the allele frequencies and inbreeding. Including all priors for these parameters and the likelihood taken over all n populations, the posterior is
\[[p,f|\mathbf{Y}] \propto \prod_{i=1}^m multinom(n_i,\boldsymbol{\theta}_i) beta(p|p_1, p_2)N(f|f_1, f_2)I(f_l(p) < f < 1)\]
Note the normal prior for \(f\) truncated at lower and upper values. The advantage of this prior is that I can propose from a symmetric jump distribution, while building in the prior dependence between \(f\) and \(p\). I will use inverse distribution sampling to draw samples for this Metropolis simulation. Here are some prior parameter values
g1 <- 1 # beta parameters for p
g2 <- 1
priorF <- 0 # mean and sd for f
priorFSD <- 1
fg <- f # initial value of fFirst propose values of \(p\) and \(f\) based on current values pg and fg. Here is a way this can be done using truncated normals
propp <- .tnorm(1, .02, .98, p, .002) # propose p
fl <- minf(propp) # lower f limit for pa
propf <- .tnorm(1, fl, 1, fg, .01) # propose f > flThe log posterior is proportional to the sum of the log multinomial, the beta prior for pa and the normal prior for f. This is done in the function pf.update(p,f),
pf.update <- function(y, p, f){ #log posterior
multlik(y, c(p,f)) + dbeta(p,g1,g2,log=T) +
dnorm(f,priorF,priorFSD,log=T)
}
multlik <- function(y, pars){ #multinom likelihood
npop <- ncol(y)
p <- pmake(pars)
pmat <- matrix(rep(p,npop),3,npop)
sum(y*log(pmat))
}
update_pf <- function(y, pg, fg){
ac <- 0
propp <- .tnorm(1,.02,.98,pg,.002) #propose pa
fl <- minf(propp) #lower f limit for pa
propf <- .tnorm(1,fl,1,fg,.01) #propose f > fl
pnew <- pf.update(y, propp, propf)
pnow <- pf.update(y, pg, fg)
a <- exp(pnew - pnow)
z <- runif(1,0,1)
if(z < a){
pg <- propp
fg <- propf
ac <- 1
}
list(pg = pg, fg = fg, accept = ac)
}A Metropolis algorithm is used to propose and accept values based on probabilities calculated using this function. Here is a one for multiple chains (outer j loop) with a nested inner loop for a single chain:
ng <- 5000
nchain <- 10
nt <- 500
thin <- round(seq(1, ng, length=nt))
pgibbs <- matrix(0,nt,nchain)
fgibbs <- pgibbs
accept <- 0
for(j in 1:nchain){
pg <- rbeta(1,1,1) #draw initial values from prior
lg <- minf(pg)
fg <- .tnorm(1,lg,1,priorF,priorFSD)
k <- 0
for(g in 1:ng){
pf <- update_pf(y, pg, fg)
pg <- pf$pg
fg <- pf$fg
if(g %in% thin){
k <- k + 1
pgibbs[k,j] <- pg
fgibbs[k,j] <- fg
}
}
}Here is a plot showing all chains:
par(mfrow=c(2,1), mar=c(4,4,1,1), bty='n')
plot(fgibbs[,1], type='l', ylim=c(-1,1), ylab='f')
for(j in 2:nchain)lines(fgibbs[,j])
abline(h=f, lty=2)
plot(pgibbs[,1],type='l',ylim=c(0,1), ylab='p')
for(j in 2:nchain)lines(pgibbs[,j])
abline(h=p, lty=2)Multiple chains for f and p.
Here is a plot showing the joint distribution bounded by the prior dependence, with the convergence from my horrible initial values:
par(mfrow=c(1,1), bty='n')
plot(pgibbs, fgibbs, xlim=c(0,1), ylim=c(-1,1), cex=.2, xlab='p', ylab='f')
lines(pseq,minf(pseq))Progress of the MCMC through parameter space for multiple chains from different initial values.
Here is a summary and density plot when all chains analyzed together, omitting a burnin of 1000:
fp <- cbind( as.vector(fgibbs[-c(1:100),]), as.vector(pgibbs[-c(1:100),]) )
colnames(fp) <- c('f', 'p')
.processPars(fp, xtrue= c(f, p) ,DPLOT=T)Posterior distribution for (f, p).
## $summary
## estimate se 0.025 0.975 true value
## f -0.2121 0.029990 -0.2708 -0.1512 -0.2
## p 0.6113 0.009301 0.5931 0.6292 0.6
Note that I passed to .processPars the matrix of chains transformed into vectors, omitting the burnin rows.
Exercise 4. Simulate populations of size n = 5, 10, 20 individuals with positive inbreeding. Use 10 chains to see if you can recover parameter estimates. Is recovery better, worse, or the same as with negative inbreeding.
Gibbs sampling and Metropolis as MCMC techniques
Both Gibbs sampling and Metropolis are MCMC techniques–both draw from the posterior distribution that is conditioned on the the current values of parameters. Gibbs sampling refers to the case where each MCMC step includes several pieces, one for each conditional distribution. Sampling from those conditional distributions be direct when there is a solution available for the conditional. It can be indirect, when there is no solution, in which case there are proposals that can be accepted or not. Metropolis is one of the most common indirect sampling methods. When available, direct sampling has the advantage of high efficiency.
Recap
Posterior simulation is done as a last resort, when a posterior distribution cannot be solved by conventional methods. Markov chain Monte Carlo (MCMC) algorithms can be developed for a broad range of inference problems. The most important technique is Gibbs sampling, which allows simulation based on conditional distributions.
The concept of conjugacy is important for determining when a prior \(\times\) likelihood can be sampled directly (efficient) versus indirectly, through methods such as Metropolis. Indirect sampling is less efficient, because it involves proposals, which can be hard to match with the posterior distribution.
Prior dependence describes the situation where the prior distribution on one parameter may depend on the currently imputed value of another, e.g., \(\beta_1 > \beta_2\). With Metropolis sampling, prior dependence may be achieved through proposals.