Much of the explanation below is taken from Krusche 2015: Doing Bayesian Data Analysis.

I highly recommend this video to start.. It focuses on the Metropolis-hastings algorithm, but offers a great general introductory discussion of MCMC sampling process. Also, for a more in-depth, yet approachable, delve into math, I recommend this series of videos.

#Load libraries
library("rjags")
library("coda")
library("matrixStats")
library("tidyr")
library("bfw")

Read in Data

dat <- read.csv("Noa_MethaneFlux.csv")
source("kruschke.R") # Run the functions in this script. I use some in the code below. This comes from Kruschke 2015

## 
## *********************************************************************
## Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:
## A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.
## *********************************************************************

Prep Data

#convert data from wide to long
dat <- gather(dat, measurement_num, methane_flux, M1:M8, factor_key=TRUE)

#remove NA
dat<-dat[complete.cases(dat), ]

#scale data
dat$methane_flux <- as.numeric(scale(dat$methane_flux))
dat$measurement_num <- sub('.', '', dat$measurement_num)
dat$measurement_num <- scale(as.numeric(dat$measurement_num))
dat$Type <- as.factor(dat$Type)

Run Jags Model

JAGS (Just Another Gibbs Sampler) is an implementation of an MCMC algorithm called Gibbs sampling to sample the posterior distribution of a Bayesian model.

## put into list format jags needs
dataForJags <- list(methane_flux=dat$methane_flux, 
                    Type=dat$Type, 
                    measurement_num=as.numeric(dat$measurement_num), # Note that we changed the measurement number to a numeric value. So it now represents a continuous time variable. 

                    Nlevs=length(unique(dat$Type)), 
                    N=length(dat$methane_flux))

Bayes Rule

A model of data specifies the probability of particular data values given the models structure and parameter values (likelihood function). The model also indicates the probability of various parameter values (priors).

In other words the model specifies:

p(data values | parameter values) p(D|θ) [Lines 91-95]
along with the prior, p(parameter values) p(θ) [Line 105]

And we use Bayes Rule to find what we really want to know, which is how strongly we should believe in the various parameter values, given the data:

p(parameters values| data values ) p(θ|D) [Extracted from Ln 159: the posterior comes from samples drawn from the target posterior distribution]

Summary:

The “prior” p(θ) is the probability (i.e. credibility) of the candidate parameter values without taking into account data D.
The “posterior” p(θ|D)is the credibility of θ values with the data D taken into account
The “likelihood” p(D|θ) is the probability that the data could be generated by the model with parameter value θ.
The “evidence” for the model p(D), is the overall probability of the dat according to the model, determined by averaging across all the possible parameter values wieghted by the strength of belief in those parameter values

Together, the shapes of the prior and likelihood distributions result in the shape/spread of the posterior. Note in the image below that the posterior distribution lands in the middle, so-to-speak, of the prior and likelihood distribution. To see how these may interplay, watch this.

In the images below, observe how the prior and likelihood distributions influence the shape, dispersion (aka. precision) of the posterior distribution (bottom panels). Note that the panels also illustrate what happens with difference in sample size (N). For a more detailed explanation pg.112-114 of Kruschke 2015.

Specify the Model According to Bayes Rule

model <- "model{
for(i in 1:N){

# the likelihood function

methane_flux[i] ~ dnorm(alpha[i], tau)  
alpha[i] <- mu[Type[i]] + beta1[Type[i]]*measurement_num[i] 

# interpret as: ith value of methane flux (yi) is distributed as a normal distribution with parameters alpha and tau. 


}    

# coefficients for individual read (conditional priors)

for(j in 1:Nlevs){  

beta1[j] ~ dnorm(beta1mu, beta1tau)
# interpreted as:  beta1 (θ) is distributed as a normal distribution with shape paramters: mean (beta1mu) and variance (beta1tau). beta1mu can take on any value from a normal distribution centered on zero (no effect) and with a variance of 0.001 (see hyperprior sections). 

mu[j] ~ dnorm(mumu, mutau)   
# The mean or intercept  can take on any value from a normal distribution centered on zero and with a variance of 0.001 (see hyperprior sections)

}
tau ~ dgamma(0.1,0.1)

# hyperpriors for coefficients

beta1mu ~ dnorm(0,0.001)
mumu ~ dnorm(0,0.001)

# hyperpriors for precisions 
beta1tau ~ dgamma(0.1,0.1)
mutau ~ dgamma(0.1,0.1)
}"

MCMC Sampler

To understand the utility of the Gibbs sampler is to understand that calculating the true posterior distribution is intractable. This resource provides a breif, but approachable explanation in more detail. The posterior distribution is a defined distribution with certain parameters. Integrating for ALL possible values of θ is computationally intensive in the context of joint probability distributions. An alternative is sampling from the population to approximate the posterior. The Gibbs sampler is sampling method, when used in a Bayesian framework, approximates the target posterior distribution (an entity we do not actually know the parameters of) BUT with sufficient sample size we can estimate. The samples referred to, are proposed parameters or “candidate” values for (θ).

In the figure below the upper left panel is an example of an “exact” distribution of p(θ). The other three panels represent approximations of the “exact” p(θ) distribution for progressively larger samples. These three illustrate what a sample does to estimate the target posterior distribution (i.e. top left panel)

Data, an informed model and the prior are inputs into the sampler. Let’s follow the flow chart in the “PROCESS” section below, for one iteration. We start at the “sampler” box:

Arrow from the sampler to prior box: The sampler generates different values of θ which get fed to the module that tells you the prior probability of that θ
Arrow from the prior box back to the sampler box: The probability of the θ given the prior distribution is returned.
Arrow from the sampler to the model box: The same parameter, θ, is also input into the model.
Arrow from the model box back to the sampler box: The likelihood of the data , given θ (the data is represented as an arrow going to the model)

The prior and likelihood function are multiplied together to evaluate the posterior probability for that particular choice. It is important to note that a sample doesn’t completely choose random guesses, instead, and given a particular algorithm, it chooses a value for θ that is somewhat “intelligent”. This is done repeatedly for many samples of θ which generates a distribution of the posterior probabilities.

The sampler works using the flow chart below:

Markov Chain Monte Carlo sampling provides a class of algorithms for systematic random sampling from high-dimensional probability distributions. Many iterations or samples should eventually get the Markov chain to the desired distribution also called the stationary distribution of the Markov chain. For Bayesian application this stationary phase represents the target posterior distribution. The Markov chain only needs to be simulated until stationarity is achieved. The process of drawing lots of samples from a target or population distribution to get to a representative sample distribution is typical of Monte Carlo methods.

In the image below, the target distribution (blue) is being estimated with each iteration of the Gibbs sampler.

Another way to look at this is to consider the images below. A) Traceplots illustrating the random walk. Stationarity is reached after the 200th time step (before that we would specify burn-in). Between 0-200 iterations, the samples are NOT representative of the target posterior distribution and convergence has NOT occurred. After the 200th iteration, the samples have converged with the target posterior distribution and now samples are from the target posterior. Similarly, in panel B the red line is the non-stationary phase of the random walk. This phase is not sampling values of theta that represent the target posterior distribution. The blue region is representative of the stationary phase where samples are drawn form the target posterior distribution.

Each sample, or iteration, is dependent on the previous value (in our case the value is a parameter estimate- so is can be thought of as a type of dependent sampling algorithm. Monte Carlo methods typically assume that we can efficiently draw samples from the target distribution (joint posterior distribution). From the samples that are drawn, we can then estimate the sum or integral quantity as the mean or variance of the drawn samples.

Gibbs Sampling

The Gibbs sampling algorithm, can be thought of a special case of another MCMC sampler: Metropolis Hastings algorithm. Like the Metropolis algorithm, its procedure involves a random walk. The walks starts at some arbitrary point and at each point in the walk, the next step depends only on the current position. In short, it differs from the Metropolis algorithm in that each step of the random walk is the result of conditioned probabilities of parameters.

Recall:

I won’t get too much into Gibbs Sampling, but here are some links to explore and visualize a MCMC sampling process:

The posterior distributions are the marginal distributions of this process.

Gibbs sampling is especially useful for complex models (i.e. hierarchichal)- for simpler models it is not entirely necessary, but such overkill should alter the results.

Run Gibbs Sampler

## compile
mod <- jags.model(textConnection(model),data= dataForJags, n.chains = 5)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 220
##    Unobserved stochastic nodes: 11
##    Total graph size: 731
## 
## Initializing model

## run it
samples<-jags.samples(model= mod,
                      variable.names=c("beta1","mu","beta1mu","mumu", "beta1tau","mutau","tau"),
                      n.iter=1000)

Jags arguments:

n.chains: number of Markov chains (default=1). specifies the number of iterations in the sampling phase, i.e., the length of the MCMC chain. How many samples are required to reach convergence and to have sufficient precision depends on the complexity of data and model, and may range from as few as 100 to several million.
n.iter: number of total iterations (steps) per chain (including burn in; default=1000)
n.burnin: length of burn in. Number of iterations to discard at beginning (default= n.iter/2 or 100; if n.burnin=0, jags() will run 100 iterations for adaption). These discarded iterations represent the phase where the MCMC algothrim randomly walked in parameter space that was no representative of the target distribution.
n.adapt: Number of interations for adaptive phase (explained in detail below). default= 1000.
n.thin: number of samples=1

Adaptive phase

JAGS has an adaptive mode built into the process, in which samplers are optimized/tuned. The MCMC samplers that JAGS uses to sample the posterior are governed by tunable parameters that affect their precise behavior. Proper tuning of these parameters can produce gains in the speed or de-correlation of the sampling. JAGS contains machinery to tune these parameters automatically, and does so as it draws posterior samples. This process is called adaptation, but it is non-Markovian; the resulting samples do not constitute a Markov chain. Therefore, burn-in must be performed separately after adaptation. It is incorrect to substitute the adaptation period for the burn-in. However, sometimes only relatively short burn-in is necessary post-adaptation. During adaptation, the sampler parameters, eg, step size for a random-walk sampler, are TUNED to give good MIXING. Too little tuning does not affect the validity of the MCMC chain, but does give poor mixing, with high autocorrelation and low effective sample size. JAGS has a separate adaptation phase, which is skipped if samplers do not need tuning, and the user decides how many iterations to use for adaptation. With R2jags::jags, the entire burn-in phase is passed to JAGS as an adaptation phase. That makes sense if adaptation is necessary, but if JAGS skips adaptation we get no burn-in! jagsUI::jags has separate arguments for n.adapt and n.burnin, but I suspect most users are like me and plug in thousands of iterations for n.burnin and leave the default for n.adapt = 100. We would get much better mixing if we used thousands of iterations for n.adapt and none for post-adaptation burn-in. But we’d have to check the trace plots (as we always do…) to pick up the cases where adaptation was skipped and burn-in is needed. When people say “mixing” in the context of Markov chain Monte Carlo (MCMC), they are (knowingly or unknowingly) referring to the “mixing time” of the Markov chain. Intuitively, mixing time for a Markov chain is the number of steps required of the Markov chain to come close to the stationary distribution (or in the world of Bayesian statistics, posterior distribution). Complex models may require longer adaptive phases. If the adaptive phase is not sufficient for JAGS to optimize the samplers, a warning message will be printed (see example below). For a simple model like ours- it’s likley sufficient to leave as the defaults, but would be something to consider for more complex models.

Diagnostics

Background

When doing Bayesian analysis with samples from and MCMC algorithms a key feature of it is convergence. Recall that MCMC algorithms SIMULATE the posterior distributions. Theory guarantees that the output from an MCMC run will eventually converge on the right posterior distribution (target distribution), BUT for a less-than-infinite run we have to be careful. Under certain conditions (i.e.,when the model is an accurate descriptor, with the appropriate number of iterations or combination of thinning or burn-in arguments), MCMC algorithms will draw a sample from the target posterior distribution after it has converged to equilibrium. In equilibrium, the distribution of samples from chains should be the same regardless of the initial starting values of the chains. This also indicates that in equilibrium, the mean of all chains is the same.

When asking if the MCMC simulation has “converged”, in part, you are asking “Has the simulated Markov chain fully explored the target posterior distribution so far, or do we need longer simulations?”. There are diagnostics to make sure the MCMC samples converged on the target distribution. Visually you can check traceplots, autocorrelation plots, posterior probability density for each estimated parameter.

In traceplots:

it should generally appear as white noise (upper left panel in figure above- no pattern. IMage is often referred to as a “fuzzy caterpillar”)
near complete overlap of chains (superimposed) examples in upper left panel and lower right panel in figure above

In autocorrelation plots:

The section below is taken from here.

Autocorrelation refers to a pattern of serial correlation in the chain, where sequential draws of a parameter, say β1, from the conditional distribution are correlated. The cause of autocorrelation is that the parameters in our model may be highly correlated, so the Gibbs Sampler will be slow to explore the entire posterior distribution.

Consider an MCMC chain that is strongly autocorrelated. Autocorrelation produces clumpy samples that are unrepresentative, in the short run, of the true underlying posterior distribution. Therefore, if possible, we would like to get rid of autocorrelation so that the MCMC sample provides a more precise estimate of the posterior sample.Typically, the level of autocorrelation will decline with an increasing number of lags in the chain (e.g. as we go from the 1000th to the 1010th lags, the level of autocorrelation will often decline.) When this dampening doesn’t occur, then you have a problem and will probably want to re-parameterize the model (more on this below).

Only the first lag should be highly correlated since we used a Markov Chain and natural to this process is each step is dependent on the previous. The lag-k autocorrelation is the correlation between every sample and the sample k steps before. This autocorrelation should become smaller as k increases, i.e. samples can be considered as independent. If, on the other hand, autocorrelation remains high for higher values of k, this indicates a high degree of correlation between our samples. See image below for examples of non-autocorrelated samples and correlated.

It is common practice to thin the sample to reduce autocorrelation, however this can reduce precision of the estimate. If we keep 50,000 thinned steps with small autocorrelation, then we very probably have a more precise estimate of the posterior than 50,000 unthinned steps with high autocorrelation. But to get 50,000 kept steps in a thinned chain, we needed to generate n*50,000 steps. With such a long chain, the clumpy autocorrelation has probably all been averaged out anyway! In fact, Link and Eaton show that the longer (unthinned) chain usually yields better estimates of the true posterior than the shorter thinned chain, even for percentiles in the tail of the distribution, at least for the particular cases they consider. So, consider making thinned values small and iterations large.

The tricky part is knowing how long of a chain is long enough to smooth out short-run autocorrelation. The more severe the autocorrelation is, the longer the chain needs to be. I have not seen rules of thumb for how to translate an autocorrelation function into a recommended chain length. Link and Eaton suggest monitoring different independent chains and assaying whether the estimates produced by the different chains are suitably similar to each other. For extreme autocorrelation, it’s best not to rely on long-run averaging out, but instead to use other techniques that actually get rid of the autocorrelation. This usually involves reparameterization, as appropriate for the particular model.

Gelman-Rubin Convergence Diagnotstic & plot:

*The Gelman–Rubin diagnostic evaluates MCMC convergence by analyzing the difference between multiple Markov chains. The convergence is assessed by comparing the estimated between-chains and within-chain variances for each model parameter. Large differences between these variances indicate nonconvergence. See bottom left panel in the figure above. Values should be close to 1 for most iterations

Troubleshooting:

If you do not have random, uncorrelated patterns:

Run the chains for longer and discard the initial samples as a burn in. Recall, in the beginning phase- or first few iterations of the Markov Chain, when the MCMC algorithm is first exploring the parameter space and trying to find the region of the posterior, MCMC samples from this period of exploration are not random draws from the target posterior distribution. This is why we specify a burn-in period.
Try to reducing serial autocorrelation by thinning (n.thin) our chain-. Thinning specified how many samples to retain out of every n samples. (i.e. n.thin=100 means that 1 out of every 100 samples are kept)

Ouputs

I like using jags.samples because, for me, I like the structure of the output in terms of understanding what is sampled. The MCMC object is a list of arrays for each estimated parameter. Recall that an array is a “stack” of matrices. Therefore, an array has 3 dimensions: [rows, columns, number of matrices]. In this case the rows of the array represent the number of parameters estimated. The columns represents the number of iterations. The number of matrices represent the number of chains.

Below we check out some diagnostics plots before we get the effect sizes. Here we use the plot() function to output the traceplots. I like this method because you can practice extracting values from the MCMC object:

## trace plots
plot(samples$beta1[1,,1])

plot(samples$beta1[2,,1])

plot(samples$beta1[3,,1])

plot(samples$beta1mu[,,1])

plot(samples$tau[,,1])

Below we can check out the posterior distributions for our beta1 parameters. IF you wish to include histograms of the posterior distributions to your publication, typically you would add details like: the HDI region (i.e. 95% HDI), add a vertical line at the mean or report mode. There are wrapper plotting functions that do this, but I wanted to show the way to hard code these posteriors from the outputs.

#lets look at beta1 (treat)
plot(density(samples$beta1[1,,]), main = "Posterior Distribution")

summary(samples$beta1[1,,]) #gives you summary stats for all 5 chains. If your chains have converged you can use one chain for reporting statitistics

##        V1                 V2                 V3                 V4          
##  Min.   :-0.20552   Min.   :-0.21169   Min.   :-0.18379   Min.   :-0.27172  
##  1st Qu.: 0.06069   1st Qu.: 0.05935   1st Qu.: 0.05774   1st Qu.: 0.06021  
##  Median : 0.13391   Median : 0.12980   Median : 0.13231   Median : 0.12762  
##  Mean   : 0.13599   Mean   : 0.13445   Mean   : 0.13438   Mean   : 0.13279  
##  3rd Qu.: 0.21210   3rd Qu.: 0.21104   3rd Qu.: 0.21199   3rd Qu.: 0.21190  
##  Max.   : 0.50950   Max.   : 0.51510   Max.   : 0.55107   Max.   : 0.51274  
##        V5          
##  Min.   :-0.22757  
##  1st Qu.: 0.05222  
##  Median : 0.12784  
##  Mean   : 0.12857  
##  3rd Qu.: 0.20838  
##  Max.   : 0.54093

plot(density(samples$beta1[2,,]), main = "Posterior Distribution")

summary(samples$beta1[2,,])

##        V1                V2                 V3                 V4          
##  Min.   :-0.2655   Min.   :-0.21819   Min.   :-0.32776   Min.   :-0.33258  
##  1st Qu.: 0.0475   1st Qu.: 0.03889   1st Qu.: 0.03344   1st Qu.: 0.03144  
##  Median : 0.1221   Median : 0.11447   Median : 0.11018   Median : 0.11209  
##  Mean   : 0.1244   Mean   : 0.11325   Mean   : 0.10937   Mean   : 0.11179  
##  3rd Qu.: 0.2064   3rd Qu.: 0.18691   3rd Qu.: 0.18835   3rd Qu.: 0.18698  
##  Max.   : 0.4529   Max.   : 0.54620   Max.   : 0.45933   Max.   : 0.57282  
##        V5          
##  Min.   :-0.33443  
##  1st Qu.: 0.02866  
##  Median : 0.10736  
##  Mean   : 0.10970  
##  3rd Qu.: 0.19363  
##  Max.   : 0.56303

plot(density(samples$beta1[3,,]), main = "Posterior Distribution")

summary(samples$beta1[3,,])

##        V1                 V2                 V3                 V4          
##  Min.   :-0.17674   Min.   :-0.24101   Min.   :-0.21023   Min.   :-0.22686  
##  1st Qu.: 0.05851   1st Qu.: 0.05473   1st Qu.: 0.05457   1st Qu.: 0.04885  
##  Median : 0.13120   Median : 0.12994   Median : 0.12774   Median : 0.12221  
##  Mean   : 0.13435   Mean   : 0.13071   Mean   : 0.12820   Mean   : 0.12152  
##  3rd Qu.: 0.20840   3rd Qu.: 0.20166   3rd Qu.: 0.20233   3rd Qu.: 0.19344  
##  Max.   : 0.44945   Max.   : 0.54783   Max.   : 0.47132   Max.   : 0.42307  
##        V5          
##  Min.   :-0.21036  
##  1st Qu.: 0.06024  
##  Median : 0.13185  
##  Mean   : 0.13034  
##  3rd Qu.: 0.19929  
##  Max.   : 0.55079

There are other functions within rjags to do Gibbs sampling. coda.samples(), as opposed to jags.samples(), has convenient plotting and summary functions associated with it. coda.samples()is a wrapper function for jags.samples(). Lets run this sample function and then pull diagnostic AND output values from it.

jags_out <- rjags::coda.samples(
 mod,
  variable.names=c("beta1","mu","beta1mu","mumu", "beta1tau","mutau","tau"),
  n.iter = 1000) #same arguments as we did previously

Below is a caterpillar plot of the posteriors for each level of treatment. In this case it represents how methane flux changes overtime for each treatment level. Overall we see, no difference in the different treatments, but overall, across all treatments, methane flux increases over time.

library(MCMCvis)

## Warning: package 'MCMCvis' was built under R version 4.0.5

MCMCplot(jags_out, params = 'beta1', ci = c(50, 95), col = 'blue', sz_labels = 1.5, sz_med = 2, sz_thick = 7, sz_thin = 3,sz_ax = 4, guide_lines = TRUE, HPD = TRUE, rank = TRUE)

The argument ci=c(50,95) in the plotting function below indicates that the thick blue line is the 50%HDI and the thin blue line is the 95% HDI. HDI: Highest Density Interval. This is a way to summarize the posterior distribution, in additon to the mean. It indicates which points of a distribution are most credible and which cover most of the distribution. HDI (95%) indicates the interval of values that cover 95% of the distribution.

Below is a summary of the stats you would report (i.e. mean of the posterior and its credibility intervals).

MCMCsummary(jags_out, round = 2)

##           mean   sd  2.5%   50% 97.5% Rhat n.eff
## beta1[1]  0.14 0.12 -0.09  0.13  0.36 1.00  4312
## beta1[2]  0.11 0.12 -0.12  0.11  0.35 1.00  4665
## beta1[3]  0.13 0.11 -0.08  0.13  0.34 1.00  5065
## beta1mu   0.12 0.57 -0.59  0.12  0.83 1.11  6871
## beta1tau  9.92 9.54  0.33  7.19 35.54 1.00  2952
## mu[1]    -0.05 0.11 -0.27 -0.05  0.18 1.00  5123
## mu[2]     0.05 0.12 -0.18  0.05  0.28 1.00  4729
## mu[3]     0.00 0.11 -0.22  0.00  0.21 1.00  5000
## mumu      0.01 0.43 -0.71  0.01  0.74 1.01  4668
## mutau    10.10 9.64  0.31  7.17 35.11 1.00  2907
## tau       1.00 0.09  0.83  0.99  1.19 1.00  4843

Outputting diagnostic plots using the coda package.

MCMCtrace(jags_out, pdf=FALSE)

This is another diagnostic plot using the functions in the Kruschke.R file. Be sure to run that script to load those functions before running the code below. I like this one because it includes the autocorrelation plot. But this function is inefficient for large numbers of iterations and more complex models.

diagMCMC(jags_out) #the diagnostic plot will save to your folder. For models with large iterations this may not produce a complete output. If you want this output, consider decreasing the number of iterations in the model to 100-1000.

Noa Bayes Model

Danielle Salcido

3/31/21