We first create some test data that will be used to fit our model.
trueA = 5
trueB = 0
trueSd = 10
sampleSize = 50
# create independent x-values
x =(-(sampleSize-1)/2):((sampleSize-1)/2)
# create dependent values according to ax + b + N(0,sd)
y = trueA * x + trueB + rnorm(n=sampleSize,mean=0,sd=trueSd)
data<-data.frame( x,y)
plot(x,y, main="Test Data")
So the true model is \(y=5x+\varepsilon, \varepsilon \sim N(0,10)\)
We are going to compare two algorithms: M-H vs HMC, based on the pre-assumed model: \[ \begin{equation} y \sim Normal(\mu, \sigma),\quad j=1,2\\ \mu = a*x + b \\ a\sim Unif(0,10)\\ b\sim Normal(0,5)\\ \sigma\sim Unif(0,30) \end{equation} \]
Given initial values of a=4, b=0, sigma=10, fit the linear model using Metropolis-Hasting method with 10000 iterations:
para0<-c(4,0,10)
chain = run_metropolis_MCMC(para0, 10000)
Calculate the acceptance rate (68.9%) with 3000 burn-in:
burnIn = 3000
acceptance = 1-mean(duplicated(chain[-(1:burnIn),]))
acceptance
#> [1] 0.6890444
Have a look at the graphical summaries of a,b, sigma, find that especially for b and sigma, from trace plots, chains values of these 3 parameter not stick to their true values, and plots also show apparent autocorrelation, in spite of high acceptance rates.
Then, we get to use HMC to fit the linear model via brms, this code is to run four independent Markov chain for the simple linear model using HMC , to distribute them across seqarate cores in computer:
library(brms)
HMC_lin<-brm(data=data,y~x,family=gaussian(),
set_prior("uniform(0,10)", coef = "x"),
set_prior("normal(0,5)",class="Intercept"),
set_prior("uniform(0, 30)", class = "sd"),
chains = 4, cores = 4,seed = 9)
When we look at the following summary, compare the simulated estimates of intercept, coef and sigma with true values: 0, 5, 10. The estimates for HMC looks relatively close to thr true value, compared with M-H method. Rhat for each parameters is always 1, indicating convergence of 4 chains.
print(HMC_lin) #true value: intercept=0, coef(x)=5, sigma=10
#> Family: gaussian
#> Links: mu = identity; sigma = identity
#> Formula: y ~ x
#> autocor ~ tructure(list(), class = "formula", .Environment = <environment>)
#> Data: data (Number of observations: 50)
#> Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#> total post-warmup samples = 4000
#>
#> Population-Level Effects:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept 1.05 1.73 -2.41 4.42 1.00 3959 2816
#> x 5.08 0.12 4.85 5.32 1.00 3864 2726
#>
#> Family Specific Parameters:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma 12.32 1.34 10.06 15.28 1.00 3555 2721
#>
#> Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).
And notice that, there’s supposed to be 2000 samples from all 4 chains, because each 1000 sample chain uses by default the first half (500) of the samples to adapt. However, in terms of Bulk_ESS Tail_ESS which are larger than 2000, so we need to alter the sample size, (say 4000?)
Use \(pairs\) to get a direct appreciation for parameter correlations and how the actual posterior density has turned out to be:
pairs(HMC_lin,
off_diag_args = list(size = 1/3, alpha = 1/3))
Simulated posterior distribution looks nearly multivariate Gaussian (Markov chain provided seem correct), while the density for sigma is skewed.(Q: How to explain the skewness of sigma?) to further get Pearson’s correlation coefficients:
#> Intercept x
#> Intercept 1.00000000 0.01480511
#> x 0.01480511 1.00000000
Compared with other algorithms like M-H, HMC can tell us then things are going wrong. Let’s look at 2 chain visualizations: Trace Plot & Trace rank plot (trank plot) Trace Plot: merely plots the samples in sequential order, joined by a line. In brms, using plot() for a brm() fit returns both density and trace lots for the parameters.
Recall: how is the chain a healthy one? Three things to be looked for in these trace plots: (1) stationarity(traces + mean), (2) good mixing (chain rapidly explores the full region. It doesn’t slowly wander, but rather rapidly zig-zags around), and (3) convergence( multiple,independent chains stick around the same region of high probability). Here we make autocorrelation plots for all model parameters, one for each HMC chain:
mcmc_acf(post, vars(b_Intercept, b_x,sigma),lags = 5)
The correlations dropped rapidly to 0 by lag 1, meaning the chain 1 interations apart are independent.( thin=1) Good!
Trank Plot: histogram of sample ranks for each individual chain.If the chains are exploring the same space efficiently, the histograms should be similar to one another and largely overlapping.
library(ggplot2)
library(magrittr)
post %>%
mcmc_rank_overlay(pars = vars(b_Intercept, b_x,sigma)) +
coord_cartesian(ylim = c(30, NA))
This trank plot is what we hope for: Histograms that overlap and stay within the same range.
Consider the famous example on genetic linkage(Rao, 1965,pp.368-369).A particular genetic model gives the probabilities for four genotypes AB, Ab, aB and ab as \(\frac{1}{4} (2+\theta)\), \(\frac{1}{4} (1-\theta)\),\(\frac{1}{4} (1-\theta)\) and \(\frac{1}{4} \theta\), where \(0< \theta<1\) is the square of the recombination fraction, the is the parameter we wish to estimate. This is a multinomial distribution with four cells, and the likelihood function
\[
\begin{equation}
f(\textbf{x}|\theta)\propto \left[\frac{1}{4} (2+\theta)\right]^{x_1} \left[\frac{1}{4} (1-\theta)\right]^{x_2}\left[\frac{1}{4} (1-\theta)\right]^{x_3} \left[\frac{1}{4} \theta\right]^{x_4}
\end{equation}
\] where \(\textbf{x}=(x_1,\dots,x_4)\) are the observed frequencies of the four genotypes. We obser the frequencies \(x_1=125, x_2=18, x_3=20, x_4=34\).
Given the non-informative prior \(\pi(\theta)\propto 1\), we first use random walk Metropolis algorithm to generate samples of \(\theta\) from posterior distribution, then the acceptance probability is just the ratio of the likelihoods: \[ \begin{equation} \alpha=min \left\{ 1,\frac{\left[\frac{1}{4} (2+\theta^*)\right]^{125} \left[\frac{1}{4} (1-\theta^*)\right]^{18}\left[\frac{1}{4} (1-\theta^*)\right]^{20} \left[\frac{1}{4} \theta^*\right]^{34}}{\left[\frac{1}{4} (2+\theta)\right]^{125} \left[\frac{1}{4} (1-\theta)\right]^{18}\left[\frac{1}{4} (1-\theta)\right]^{20} \left[\frac{1}{4} \theta\right]^{34}} \right\},\qquad 0<\theta<1. \end{equation} \] The corresponding R code for random walk Metropolis algorithm is:
genetic.metropolis<-function(m,sigma,theta0)
{
accept<-0
theta<-NULL
theta[1]<-theta0
for(i in 1:(m-1))
{
prop<-rnorm(1,theta[i],sigma) # use normal distribution to generate proposal, with sigma to be tuned
if(prop<0 || prop>1)
{
ratio<-0
}
else
{
lp<-(0.25*(2+prop)^125)*(0.25*(1-prop)^18)*(0.25*(1-prop)^20)*(0.25*prop^34)
lt<-(0.25*(2+theta[i])^125)*(0.25*(1-theta[i])^18)*(0.25*(1-theta[i])^20)*(0.25*theta[i]^34)
ratio<-lp/lt
}
alpha<-min(1,ratio)
if(runif(1)<alpha)
{
theta[i+1]<-prop
accept<-accept+1
}
else
{
theta[i+1]<-theta[i]
}
}
return(list(theta=theta,accept=accept/m))
}
Set proposal scaling \(\sigma\) =0.2 to achieve an acceptance ratio of about 0.2, then apply the algorithm and evaluate the samples of \(\theta\):
theta_M<-genetic.metropolis(10000, 0.2, 0)
theta_M$accept
#> [1] 0.2993
plot(density(theta_M$theta),main="Kernel density estimate",xlab="theta")
summary(theta_M$theta)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.0000 0.5871 0.6240 0.6219 0.6579 0.8059
Find that the density plot has long left hand tail,may be caused by choice of starting value \(\theta^{(0)}=0\). Check by looking at the trace plot:
plot(1:10000,theta_M$theta,type="l",xlab="iteration",ylab="theta",main="Trace plot")
acf(theta_M$theta,ylab="autocorrelation",main="Autocorrelation plot")
Trace plot: health chain mixing well, not stuck in any local regions, may want to drop first 1000 observations ar burn-in. Autocorrelation plot: the correlation dropped rapidly to 0 by lag 10, meaning the chain 10 interations apart are independent.(nthin=10)