Example: random walk Metropolis/ M-Hasting vs HMC

Example 1. Linear Regression using M-H vs HMC

1. Create dataset and Define the statistical model

We first create some test data that will be used to fit our model.

trueA = 5
trueB = 0
trueSd = 10
sampleSize = 50
 
# create independent x-values
x =(-(sampleSize-1)/2):((sampleSize-1)/2)
# create dependent values according to ax + b + N(0,sd)
y = trueA * x + trueB + rnorm(n=sampleSize,mean=0,sd=trueSd)
data<-data.frame( x,y)
 
plot(x,y, main="Test Data")

So the true model is \(y=5x+\varepsilon, \varepsilon \sim N(0,10)\)

We are going to compare two algorithms: M-H vs HMC, based on the pre-assumed model: \[ \begin{equation} y \sim Normal(\mu, \sigma),\quad j=1,2\\ \mu = a*x + b \\ a\sim Unif(0,10)\\ b\sim Normal(0,5)\\ \sigma\sim Unif(0,30) \end{equation} \]

2.Metropolis-Hastings MCMC (simple linear model)

Given initial values of a=4, b=0, sigma=10, fit the linear model using Metropolis-Hasting method with 10000 iterations:

para0<-c(4,0,10)
chain = run_metropolis_MCMC(para0, 10000)

Calculate the acceptance rate (68.9%) with 3000 burn-in:

burnIn = 3000
acceptance = 1-mean(duplicated(chain[-(1:burnIn),]))
acceptance
#> [1] 0.6890444

Have a look at the graphical summaries of a,b, sigma, find that especially for b and sigma, from trace plots, chains values of these 3 parameter not stick to their true values, and plots also show apparent autocorrelation, in spite of high acceptance rates.

3. HMC (simple linear model):

Then, we get to use HMC to fit the linear model via brms, this code is to run four independent Markov chain for the simple linear model using HMC , to distribute them across seqarate cores in computer:

library(brms)
HMC_lin<-brm(data=data,y~x,family=gaussian(),
        set_prior("uniform(0,10)", coef = "x"),
        set_prior("normal(0,5)",class="Intercept"),
        set_prior("uniform(0, 30)", class = "sd"),
        chains = 4, cores = 4,seed = 9)

When we look at the following summary, compare the simulated estimates of intercept, coef and sigma with true values: 0, 5, 10. The estimates for HMC looks relatively close to thr true value, compared with M-H method. Rhat for each parameters is always 1, indicating convergence of 4 chains.

print(HMC_lin) #true value: intercept=0, coef(x)=5, sigma=10
#>  Family: gaussian 
#>   Links: mu = identity; sigma = identity 
#> Formula: y ~ x 
#>          autocor ~ tructure(list(), class = "formula", .Environment = <environment>)
#>    Data: data (Number of observations: 50) 
#> Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup samples = 4000
#> 
#> Population-Level Effects: 
#>           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept     1.05      1.73    -2.41     4.42 1.00     3959     2816
#> x             5.08      0.12     4.85     5.32 1.00     3864     2726
#> 
#> Family Specific Parameters: 
#>       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma    12.32      1.34    10.06    15.28 1.00     3555     2721
#> 
#> Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).

And notice that, there’s supposed to be 2000 samples from all 4 chains, because each 1000 sample chain uses by default the first half (500) of the samples to adapt. However, in terms of Bulk_ESS Tail_ESS which are larger than 2000, so we need to alter the sample size, (say 4000?)

Use \(pairs\) to get a direct appreciation for parameter correlations and how the actual posterior density has turned out to be:

pairs(HMC_lin,
      off_diag_args = list(size = 1/3, alpha = 1/3))

Simulated posterior distribution looks nearly multivariate Gaussian (Markov chain provided seem correct), while the density for sigma is skewed.(Q: How to explain the skewness of sigma?) to further get Pearson’s correlation coefficients:

#>            Intercept          x
#> Intercept 1.00000000 0.01480511
#> x         0.01480511 1.00000000

Compared with other algorithms like M-H, HMC can tell us then things are going wrong. Let’s look at 2 chain visualizations: Trace Plot & Trace rank plot (trank plot) Trace Plot: merely plots the samples in sequential order, joined by a line. In brms, using plot() for a brm() fit returns both density and trace lots for the parameters.

Recall: how is the chain a healthy one? Three things to be looked for in these trace plots: (1) stationarity(traces + mean), (2) good mixing (chain rapidly explores the full region. It doesn’t slowly wander, but rather rapidly zig-zags around), and (3) convergence( multiple,independent chains stick around the same region of high probability). Here we make autocorrelation plots for all model parameters, one for each HMC chain：

 mcmc_acf(post, vars(b_Intercept, b_x,sigma),lags = 5)

The correlations dropped rapidly to 0 by lag 1, meaning the chain 1 interations apart are independent.( thin=1) Good!

Trank Plot: histogram of sample ranks for each individual chain.If the chains are exploring the same space efficiently, the histograms should be similar to one another and largely overlapping.

library(ggplot2)
library(magrittr)
post %>% 
  mcmc_rank_overlay(pars = vars(b_Intercept, b_x,sigma)) +
  coord_cartesian(ylim = c(30, NA))

This trank plot is what we hope for: Histograms that overlap and stay within the same range.

Example 2: random walk Metropolis vs HMC

Consider the famous example on genetic linkage(Rao, 1965,pp.368-369).A particular genetic model gives the probabilities for four genotypes AB, Ab, aB and ab as \(\frac{1}{4} (2+\theta)\), \(\frac{1}{4} (1-\theta)\),\(\frac{1}{4} (1-\theta)\) and \(\frac{1}{4} \theta\), where \(0< \theta<1\) is the square of the recombination fraction, the is the parameter we wish to estimate. This is a multinomial distribution with four cells, and the likelihood function
\[ \begin{equation} f(\textbf{x}|\theta)\propto \left[\frac{1}{4} (2+\theta)\right]^{x_1} \left[\frac{1}{4} (1-\theta)\right]^{x_2}\left[\frac{1}{4} (1-\theta)\right]^{x_3} \left[\frac{1}{4} \theta\right]^{x_4} \end{equation} \] where \(\textbf{x}=(x_1,\dots,x_4)\) are the observed frequencies of the four genotypes. We obser the frequencies \(x_1=125, x_2=18, x_3=20, x_4=34\).

1.Using random walk Metropolis with non-informative prior:

Given the non-informative prior \(\pi(\theta)\propto 1\), we first use random walk Metropolis algorithm to generate samples of \(\theta\) from posterior distribution, then the acceptance probability is just the ratio of the likelihoods: \[ \begin{equation} \alpha=min \left\{ 1,\frac{\left[\frac{1}{4} (2+\theta^*)\right]^{125} \left[\frac{1}{4} (1-\theta^*)\right]^{18}\left[\frac{1}{4} (1-\theta^*)\right]^{20} \left[\frac{1}{4} \theta^*\right]^{34}}{\left[\frac{1}{4} (2+\theta)\right]^{125} \left[\frac{1}{4} (1-\theta)\right]^{18}\left[\frac{1}{4} (1-\theta)\right]^{20} \left[\frac{1}{4} \theta\right]^{34}} \right\},\qquad 0<\theta<1. \end{equation} \] The corresponding R code for random walk Metropolis algorithm is:

genetic.metropolis<-function(m,sigma,theta0)
{
accept<-0
theta<-NULL
theta[1]<-theta0
for(i in 1:(m-1))
{
prop<-rnorm(1,theta[i],sigma) # use normal distribution to generate proposal, with sigma to be tuned
if(prop<0 || prop>1)
{
ratio<-0
}
else
{
lp<-(0.25*(2+prop)^125)*(0.25*(1-prop)^18)*(0.25*(1-prop)^20)*(0.25*prop^34)
lt<-(0.25*(2+theta[i])^125)*(0.25*(1-theta[i])^18)*(0.25*(1-theta[i])^20)*(0.25*theta[i]^34)
ratio<-lp/lt
}
alpha<-min(1,ratio)
if(runif(1)<alpha)
{
theta[i+1]<-prop
accept<-accept+1
}
else
{
theta[i+1]<-theta[i]
}
}
return(list(theta=theta,accept=accept/m))
}

Set proposal scaling \(\sigma\) =0.2 to achieve an acceptance ratio of about 0.2, then apply the algorithm and evaluate the samples of \(\theta\):

theta_M<-genetic.metropolis(10000, 0.2, 0)
theta_M$accept
#> [1] 0.2993
plot(density(theta_M$theta),main="Kernel density estimate",xlab="theta")

summary(theta_M$theta)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.0000  0.5871  0.6240  0.6219  0.6579  0.8059

Find that the density plot has long left hand tail,may be caused by choice of starting value \(\theta^{(0)}=0\). Check by looking at the trace plot:

plot(1:10000,theta_M$theta,type="l",xlab="iteration",ylab="theta",main="Trace plot")

 
acf(theta_M$theta,ylab="autocorrelation",main="Autocorrelation plot")

Trace plot: health chain mixing well, not stuck in any local regions, may want to drop first 1000 observations ar burn-in. Autocorrelation plot: the correlation dropped rapidly to 0 by lag 10, meaning the chain 10 interations apart are independent.(nthin=10)