Much of the explanation below is taken from Krusche 2015: Doing Bayesian Data Analysis.
I highly recommend this video to start.. It focuses on the Metropolis-hastings algorithm, but offers a great general introductory discussion of MCMC sampling process. Also, for a more in-depth, yet approachable, delve into math, I recommend this series of videos.
#Load libraries
library("rjags")
library("coda")
library("matrixStats")
library("tidyr")
library("bfw")
dat <- read.csv("Noa_MethaneFlux.csv")
source("kruschke.R") # Run the functions in this script. I use some in the code below. This comes from Kruschke 2015
##
## *********************************************************************
## Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:
## A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.
## *********************************************************************
#convert data from wide to long
dat <- gather(dat, measurement_num, methane_flux, M1:M8, factor_key=TRUE)
#remove NA
dat<-dat[complete.cases(dat), ]
#scale data
dat$methane_flux <- as.numeric(scale(dat$methane_flux))
dat$measurement_num <- sub('.', '', dat$measurement_num)
dat$measurement_num <- scale(as.numeric(dat$measurement_num))
dat$Type <- as.factor(dat$Type)
JAGS (Just Another Gibbs Sampler) is an implementation of an MCMC algorithm called Gibbs sampling to sample the posterior distribution of a Bayesian model.
## put into list format jags needs
dataForJags <- list(methane_flux=dat$methane_flux,
Type=dat$Type,
measurement_num=as.numeric(dat$measurement_num), # Note that we changed the measurement number to a numeric value. So it now represents a continuous time variable.
Nlevs=length(unique(dat$Type)),
N=length(dat$methane_flux))
A model of data specifies the probability of particular data values given the models structure and parameter values (likelihood function). The model also indicates the probability of various parameter values (priors).
In other words the model specifies:
And we use Bayes Rule to find what we really want to know, which is how strongly we should believe in the various parameter values, given the data:
Summary:
Together, the shapes of the prior and likelihood distributions result in the shape/spread of the posterior. Note in the image below that the posterior distribution lands in the middle, so-to-speak, of the prior and likelihood distribution. To see how these may interplay, watch this.
In the images below, observe how the prior and likelihood distributions influence the shape, dispersion (aka. precision) of the posterior distribution (bottom panels). Note that the panels also illustrate what happens with difference in sample size (N). For a more detailed explanation pg.112-114 of Kruschke 2015.
model <- "model{
for(i in 1:N){
# the likelihood function
methane_flux[i] ~ dnorm(alpha[i], tau)
alpha[i] <- mu[Type[i]] + beta1[Type[i]]*measurement_num[i]
# interpret as: ith value of methane flux (yi) is distributed as a normal distribution with parameters alpha and tau.
}
# coefficients for individual read (conditional priors)
for(j in 1:Nlevs){
beta1[j] ~ dnorm(beta1mu, beta1tau)
# interpreted as: beta1 (θ) is distributed as a normal distribution with shape paramters: mean (beta1mu) and variance (beta1tau). beta1mu can take on any value from a normal distribution centered on zero (no effect) and with a variance of 0.001 (see hyperprior sections).
mu[j] ~ dnorm(mumu, mutau)
# The mean or intercept can take on any value from a normal distribution centered on zero and with a variance of 0.001 (see hyperprior sections)
}
tau ~ dgamma(0.1,0.1)
# hyperpriors for coefficients
beta1mu ~ dnorm(0,0.001)
mumu ~ dnorm(0,0.001)
# hyperpriors for precisions
beta1tau ~ dgamma(0.1,0.1)
mutau ~ dgamma(0.1,0.1)
}"
To understand the utility of the Gibbs sampler is to understand that calculating the true posterior distribution is intractable. This resource provides a breif, but approachable explanation in more detail. The posterior distribution is a defined distribution with certain parameters. Integrating for ALL possible values of θ is computationally intensive in the context of joint probability distributions. An alternative is sampling from the population to approximate the posterior. The Gibbs sampler is sampling method, when used in a Bayesian framework, approximates the target posterior distribution (an entity we do not actually know the parameters of) BUT with sufficient sample size we can estimate. The samples referred to, are proposed parameters or “candidate” values for (θ).
In the figure below the upper left panel is an example of an “exact” distribution of p(θ). The other three panels represent approximations of the “exact” p(θ) distribution for progressively larger samples. These three illustrate what a sample does to estimate the target posterior distribution (i.e. top left panel)
Data, an informed model and the prior are inputs into the sampler. Let’s follow the flow chart in the “PROCESS” section below, for one iteration. We start at the “sampler” box:
The prior and likelihood function are multiplied together to evaluate the posterior probability for that particular choice. It is important to note that a sample doesn’t completely choose random guesses, instead, and given a particular algorithm, it chooses a value for θ that is somewhat “intelligent”. This is done repeatedly for many samples of θ which generates a distribution of the posterior probabilities.
The sampler works using the flow chart below:
Markov Chain Monte Carlo sampling provides a class of algorithms for systematic random sampling from high-dimensional probability distributions. Many iterations or samples should eventually get the Markov chain to the desired distribution also called the stationary distribution of the Markov chain. For Bayesian application this stationary phase represents the target posterior distribution. The Markov chain only needs to be simulated until stationarity is achieved. The process of drawing lots of samples from a target or population distribution to get to a representative sample distribution is typical of Monte Carlo methods.
In the image below, the target distribution (blue) is being estimated with each iteration of the Gibbs sampler.
Another way to look at this is to consider the images below. A) Traceplots illustrating the random walk. Stationarity is reached after the 200th time step (before that we would specify burn-in). Between 0-200 iterations, the samples are NOT representative of the target posterior distribution and convergence has NOT occurred. After the 200th iteration, the samples have converged with the target posterior distribution and now samples are from the target posterior. Similarly, in panel B the red line is the non-stationary phase of the random walk. This phase is not sampling values of theta that represent the target posterior distribution. The blue region is representative of the stationary phase where samples are drawn form the target posterior distribution.
Each sample, or iteration, is dependent on the previous value (in our case the value is a parameter estimate- so is can be thought of as a type of dependent sampling algorithm. Monte Carlo methods typically assume that we can efficiently draw samples from the target distribution (joint posterior distribution). From the samples that are drawn, we can then estimate the sum or integral quantity as the mean or variance of the drawn samples.
The Gibbs sampling algorithm, can be thought of a special case of another MCMC sampler: Metropolis Hastings algorithm. Like the Metropolis algorithm, its procedure involves a random walk. The walks starts at some arbitrary point and at each point in the walk, the next step depends only on the current position. In short, it differs from the Metropolis algorithm in that each step of the random walk is the result of conditioned probabilities of parameters.
Recall:
I won’t get too much into Gibbs Sampling, but here are some links to explore and visualize a MCMC sampling process:
The posterior distributions are the marginal distributions of this process.
Gibbs sampling is especially useful for complex models (i.e. hierarchichal)- for simpler models it is not entirely necessary, but such overkill should alter the results.
## compile
mod <- jags.model(textConnection(model),data= dataForJags, n.chains = 5)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 220
## Unobserved stochastic nodes: 11
## Total graph size: 731
##
## Initializing model
## run it
samples<-jags.samples(model= mod,
variable.names=c("beta1","mu","beta1mu","mumu", "beta1tau","mutau","tau"),
n.iter=1000)
Jags arguments:
Adaptive phase
JAGS has an adaptive mode built into the process, in which samplers are optimized/tuned. The MCMC samplers that JAGS uses to sample the posterior are governed by tunable parameters that affect their precise behavior. Proper tuning of these parameters can produce gains in the speed or de-correlation of the sampling. JAGS contains machinery to tune these parameters automatically, and does so as it draws posterior samples. This process is called adaptation, but it is non-Markovian; the resulting samples do not constitute a Markov chain. Therefore, burn-in must be performed separately after adaptation. It is incorrect to substitute the adaptation period for the burn-in. However, sometimes only relatively short burn-in is necessary post-adaptation. During adaptation, the sampler parameters, eg, step size for a random-walk sampler, are TUNED to give good MIXING. Too little tuning does not affect the validity of the MCMC chain, but does give poor mixing, with high autocorrelation and low effective sample size. JAGS has a separate adaptation phase, which is skipped if samplers do not need tuning, and the user decides how many iterations to use for adaptation. With R2jags::jags, the entire burn-in phase is passed to JAGS as an adaptation phase. That makes sense if adaptation is necessary, but if JAGS skips adaptation we get no burn-in! jagsUI::jags has separate arguments for n.adapt and n.burnin, but I suspect most users are like me and plug in thousands of iterations for n.burnin and leave the default for n.adapt = 100. We would get much better mixing if we used thousands of iterations for n.adapt and none for post-adaptation burn-in. But we’d have to check the trace plots (as we always do…) to pick up the cases where adaptation was skipped and burn-in is needed. When people say “mixing” in the context of Markov chain Monte Carlo (MCMC), they are (knowingly or unknowingly) referring to the “mixing time” of the Markov chain. Intuitively, mixing time for a Markov chain is the number of steps required of the Markov chain to come close to the stationary distribution (or in the world of Bayesian statistics, posterior distribution). Complex models may require longer adaptive phases. If the adaptive phase is not sufficient for JAGS to optimize the samplers, a warning message will be printed (see example below). For a simple model like ours- it’s likley sufficient to leave as the defaults, but would be something to consider for more complex models.
When doing Bayesian analysis with samples from and MCMC algorithms a key feature of it is convergence. Recall that MCMC algorithms SIMULATE the posterior distributions. Theory guarantees that the output from an MCMC run will eventually converge on the right posterior distribution (target distribution), BUT for a less-than-infinite run we have to be careful. Under certain conditions (i.e.,when the model is an accurate descriptor, with the appropriate number of iterations or combination of thinning or burn-in arguments), MCMC algorithms will draw a sample from the target posterior distribution after it has converged to equilibrium. In equilibrium, the distribution of samples from chains should be the same regardless of the initial starting values of the chains. This also indicates that in equilibrium, the mean of all chains is the same.
When asking if the MCMC simulation has “converged”, in part, you are asking “Has the simulated Markov chain fully explored the target posterior distribution so far, or do we need longer simulations?”. There are diagnostics to make sure the MCMC samples converged on the target distribution. Visually you can check traceplots, autocorrelation plots, posterior probability density for each estimated parameter.
In traceplots:
In autocorrelation plots:
The section below is taken from here.
Autocorrelation refers to a pattern of serial correlation in the chain, where sequential draws of a parameter, say β1, from the conditional distribution are correlated. The cause of autocorrelation is that the parameters in our model may be highly correlated, so the Gibbs Sampler will be slow to explore the entire posterior distribution.
Consider an MCMC chain that is strongly autocorrelated. Autocorrelation produces clumpy samples that are unrepresentative, in the short run, of the true underlying posterior distribution. Therefore, if possible, we would like to get rid of autocorrelation so that the MCMC sample provides a more precise estimate of the posterior sample.Typically, the level of autocorrelation will decline with an increasing number of lags in the chain (e.g. as we go from the 1000th to the 1010th lags, the level of autocorrelation will often decline.) When this dampening doesn’t occur, then you have a problem and will probably want to re-parameterize the model (more on this below).
Only the first lag should be highly correlated since we used a Markov Chain and natural to this process is each step is dependent on the previous. The lag-k autocorrelation is the correlation between every sample and the sample k steps before. This autocorrelation should become smaller as k increases, i.e. samples can be considered as independent. If, on the other hand, autocorrelation remains high for higher values of k, this indicates a high degree of correlation between our samples. See image below for examples of non-autocorrelated samples and correlated.
It is common practice to thin the sample to reduce autocorrelation, however this can reduce precision of the estimate. If we keep 50,000 thinned steps with small autocorrelation, then we very probably have a more precise estimate of the posterior than 50,000 unthinned steps with high autocorrelation. But to get 50,000 kept steps in a thinned chain, we needed to generate n*50,000 steps. With such a long chain, the clumpy autocorrelation has probably all been averaged out anyway! In fact, Link and Eaton show that the longer (unthinned) chain usually yields better estimates of the true posterior than the shorter thinned chain, even for percentiles in the tail of the distribution, at least for the particular cases they consider. So, consider making thinned values small and iterations large.
The tricky part is knowing how long of a chain is long enough to smooth out short-run autocorrelation. The more severe the autocorrelation is, the longer the chain needs to be. I have not seen rules of thumb for how to translate an autocorrelation function into a recommended chain length. Link and Eaton suggest monitoring different independent chains and assaying whether the estimates produced by the different chains are suitably similar to each other. For extreme autocorrelation, it’s best not to rely on long-run averaging out, but instead to use other techniques that actually get rid of the autocorrelation. This usually involves reparameterization, as appropriate for the particular model.
Gelman-Rubin Convergence Diagnotstic & plot:
*The Gelman–Rubin diagnostic evaluates MCMC convergence by analyzing the difference between multiple Markov chains. The convergence is assessed by comparing the estimated between-chains and within-chain variances for each model parameter. Large differences between these variances indicate nonconvergence. See bottom left panel in the figure above. Values should be close to 1 for most iterations
Troubleshooting:
If you do not have random, uncorrelated patterns:
I like using jags.samples because, for me, I like the structure of the output in terms of understanding what is sampled. The MCMC object is a list of arrays for each estimated parameter. Recall that an array is a “stack” of matrices. Therefore, an array has 3 dimensions: [rows, columns, number of matrices]. In this case the rows of the array represent the number of parameters estimated. The columns represents the number of iterations. The number of matrices represent the number of chains.
Below we check out some diagnostics plots before we get the effect sizes. Here we use the plot() function to output the traceplots. I like this method because you can practice extracting values from the MCMC object:
## trace plots
plot(samples$beta1[1,,1])
plot(samples$beta1[2,,1])
plot(samples$beta1[3,,1])
plot(samples$beta1mu[,,1])
plot(samples$tau[,,1])
Below we can check out the posterior distributions for our beta1 parameters. IF you wish to include histograms of the posterior distributions to your publication, typically you would add details like: the HDI region (i.e. 95% HDI), add a vertical line at the mean or report mode. There are wrapper plotting functions that do this, but I wanted to show the way to hard code these posteriors from the outputs.
#lets look at beta1 (treat)
plot(density(samples$beta1[1,,]), main = "Posterior Distribution")
summary(samples$beta1[1,,]) #gives you summary stats for all 5 chains. If your chains have converged you can use one chain for reporting statitistics
## V1 V2 V3 V4
## Min. :-0.20552 Min. :-0.21169 Min. :-0.18379 Min. :-0.27172
## 1st Qu.: 0.06069 1st Qu.: 0.05935 1st Qu.: 0.05774 1st Qu.: 0.06021
## Median : 0.13391 Median : 0.12980 Median : 0.13231 Median : 0.12762
## Mean : 0.13599 Mean : 0.13445 Mean : 0.13438 Mean : 0.13279
## 3rd Qu.: 0.21210 3rd Qu.: 0.21104 3rd Qu.: 0.21199 3rd Qu.: 0.21190
## Max. : 0.50950 Max. : 0.51510 Max. : 0.55107 Max. : 0.51274
## V5
## Min. :-0.22757
## 1st Qu.: 0.05222
## Median : 0.12784
## Mean : 0.12857
## 3rd Qu.: 0.20838
## Max. : 0.54093
plot(density(samples$beta1[2,,]), main = "Posterior Distribution")
summary(samples$beta1[2,,])
## V1 V2 V3 V4
## Min. :-0.2655 Min. :-0.21819 Min. :-0.32776 Min. :-0.33258
## 1st Qu.: 0.0475 1st Qu.: 0.03889 1st Qu.: 0.03344 1st Qu.: 0.03144
## Median : 0.1221 Median : 0.11447 Median : 0.11018 Median : 0.11209
## Mean : 0.1244 Mean : 0.11325 Mean : 0.10937 Mean : 0.11179
## 3rd Qu.: 0.2064 3rd Qu.: 0.18691 3rd Qu.: 0.18835 3rd Qu.: 0.18698
## Max. : 0.4529 Max. : 0.54620 Max. : 0.45933 Max. : 0.57282
## V5
## Min. :-0.33443
## 1st Qu.: 0.02866
## Median : 0.10736
## Mean : 0.10970
## 3rd Qu.: 0.19363
## Max. : 0.56303
plot(density(samples$beta1[3,,]), main = "Posterior Distribution")
summary(samples$beta1[3,,])
## V1 V2 V3 V4
## Min. :-0.17674 Min. :-0.24101 Min. :-0.21023 Min. :-0.22686
## 1st Qu.: 0.05851 1st Qu.: 0.05473 1st Qu.: 0.05457 1st Qu.: 0.04885
## Median : 0.13120 Median : 0.12994 Median : 0.12774 Median : 0.12221
## Mean : 0.13435 Mean : 0.13071 Mean : 0.12820 Mean : 0.12152
## 3rd Qu.: 0.20840 3rd Qu.: 0.20166 3rd Qu.: 0.20233 3rd Qu.: 0.19344
## Max. : 0.44945 Max. : 0.54783 Max. : 0.47132 Max. : 0.42307
## V5
## Min. :-0.21036
## 1st Qu.: 0.06024
## Median : 0.13185
## Mean : 0.13034
## 3rd Qu.: 0.19929
## Max. : 0.55079
There are other functions within rjags to do Gibbs sampling. coda.samples(), as opposed to jags.samples(), has convenient plotting and summary functions associated with it. coda.samples()is a wrapper function for jags.samples(). Lets run this sample function and then pull diagnostic AND output values from it.
jags_out <- rjags::coda.samples(
mod,
variable.names=c("beta1","mu","beta1mu","mumu", "beta1tau","mutau","tau"),
n.iter = 1000) #same arguments as we did previously
Below is a caterpillar plot of the posteriors for each level of treatment. In this case it represents how methane flux changes overtime for each treatment level. Overall we see, no difference in the different treatments, but overall, across all treatments, methane flux increases over time.
library(MCMCvis)
## Warning: package 'MCMCvis' was built under R version 4.0.5
MCMCplot(jags_out, params = 'beta1', ci = c(50, 95), col = 'blue', sz_labels = 1.5, sz_med = 2, sz_thick = 7, sz_thin = 3,sz_ax = 4, guide_lines = TRUE, HPD = TRUE, rank = TRUE)
The argument ci=c(50,95) in the plotting function below indicates that the thick blue line is the 50%HDI and the thin blue line is the 95% HDI. HDI: Highest Density Interval. This is a way to summarize the posterior distribution, in additon to the mean. It indicates which points of a distribution are most credible and which cover most of the distribution. HDI (95%) indicates the interval of values that cover 95% of the distribution.
Below is a summary of the stats you would report (i.e. mean of the posterior and its credibility intervals).
MCMCsummary(jags_out, round = 2)
## mean sd 2.5% 50% 97.5% Rhat n.eff
## beta1[1] 0.14 0.12 -0.09 0.13 0.36 1.00 4312
## beta1[2] 0.11 0.12 -0.12 0.11 0.35 1.00 4665
## beta1[3] 0.13 0.11 -0.08 0.13 0.34 1.00 5065
## beta1mu 0.12 0.57 -0.59 0.12 0.83 1.11 6871
## beta1tau 9.92 9.54 0.33 7.19 35.54 1.00 2952
## mu[1] -0.05 0.11 -0.27 -0.05 0.18 1.00 5123
## mu[2] 0.05 0.12 -0.18 0.05 0.28 1.00 4729
## mu[3] 0.00 0.11 -0.22 0.00 0.21 1.00 5000
## mumu 0.01 0.43 -0.71 0.01 0.74 1.01 4668
## mutau 10.10 9.64 0.31 7.17 35.11 1.00 2907
## tau 1.00 0.09 0.83 0.99 1.19 1.00 4843
Outputting diagnostic plots using the coda package.
MCMCtrace(jags_out, pdf=FALSE)
This is another diagnostic plot using the functions in the Kruschke.R file. Be sure to run that script to load those functions before running the code below. I like this one because it includes the autocorrelation plot. But this function is inefficient for large numbers of iterations and more complex models.
diagMCMC(jags_out) #the diagnostic plot will save to your folder. For models with large iterations this may not produce a complete output. If you want this output, consider decreasing the number of iterations in the model to 100-1000.