Markov Chain Monte Carlo Method

A regime for random sample generation

Arpan Dutta Soumyajit Roy

Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm is a Markov Chain Monte Carlo (MCMC) method used to generate samples from a target distribution when direct sampling is difficult. It is particularly useful for Bayesian inference and other probabilistic models.

  1. Choose an initial state \(x^{(0)}\).

  2. For each iteration \(t\):

    • Propose a new state \(x'\) from a proposal distribution \(q(x'|x^{(t)})\).

    • Calculate the acceptance probability \(\alpha(x^{(t)}, x') = \min \left(1, \frac{p(x') q(x^{(t)}|x')}{p(x^{(t)}) q(x'|x^{(t)})}\right)\), where \(p(x)\) is the target distribution.

    • Generate a uniform random number \(u \sim U(0, 1)\).

    • If \(u \leq \alpha(x^{(t)}, x')\), set \(x^{(t+1)} = x'\), otherwise set \(x^{(t+1)} = x^{(t)}\).

  3. Repeat step 2 for a sufficient number of iterations to obtain the desired number of samples.

Properties

  • The Metropolis-Hastings algorithm generates a Markov chain whose stationary distribution is the target distribution \(p(x)\).
  • The algorithm requires a proposal distribution \(q(x'|x)\), which should satisfy some conditions such as being symmetric if possible.
  • The choice of proposal distribution can impact the efficiency of the algorithm, with a good choice leading to faster convergence.
Metropolis-Hastings Algorithm
mcmc=function(x,p,n,pi)
{
  data=x
  while(length(data)<n){
    y=p(x,sam=TRUE) #-------a sample from p(theta|x)
    alpha=min(pi(y)*p(y)/(pi(x)*p(x)),1)
    u=runif(1)
    if(alpha>u){
      data=c(data,y)
      x=y }
    else{
      data=c(data,x)}
    }
  return(data)
}

p=function(x,sam=FALSE)
{
  if(sam){return(rnorm(1,x,1)) }
  else { return(dnorm(x,mean = x,sd = 1))}
}

Example

Taking sample from a mixed distribution

Code
p=function(x,sam=FALSE)
{
  if(sam)
  {
    return(rnorm(1,x,1))
  }
  else
  {
    return(dnorm(x,mean = x,sd = 1))
  }
}

pi=function(x) {return(dnorm(x,0,1)*0.9+dunif(x)*0.1)}

#--------need to draw a sample from p(theta|x)
da=mcmc(1,p,10000,pi)
x=seq(-20,20,length=1000)
par(mfrow=c(2,1))
hist((da[-(1:500)]),prob=TRUE,col = "cyan",main="Histogram of sample",xlab="value")
lines(density(da[-(1:500)]))
a=density(da)
polygon(a$x,a$y,density =50,col="blue")



curve((dnorm(x,0,1)*0.9+dunif(x)*0.1),xlim=c(-4,4),ylab="density")
polygon(x,(dnorm(x,0,1)*0.9+dunif(x)*0.1),density=50,col="green")
mtext("N(0,1)*0.9+unif(0,1)*0.1",outer = TRUE,line = -1)

Taking sample from a complex posterior from a Bayesian Distribution

\[ f\left(x\right)\propto\exp\left(-\frac{x^{2}}{2}\right)\left(\sin\left(6x\right)^{2}+3\cos\left(x\right)^{2}\sin\left(4x\right)^{2}+1\right) \]

Prior
pi1=function(x) 
{
  return((sin((6*x))**2)+3*(cos(x))**2*(sin(4*x))**2+1)
}
Proposed
p1=function(x,sam=FALSE)
{
  if(sam) { return(rnorm(1,0,1))}
  else { return(dnorm(x,mean = 0,sd = 1))}
}

Code
a=mcmc(x=0,p=p1,pi=pi1,n=10000)
par(mfrow=c(2,1))
hist(a,probability = TRUE,ylim=c(0,1),col="cyan")
lines(density(a),col="red")

yden=density(a)
polygon(yden$x,yden$y,density = 20,col="blue")

curve(pi1(x)*exp(-0.5*x**2),xlim = c(-3,3))
polygon(x=x,y=pi1(x)*exp(-0.5*x**2),col="blue")
mtext("(sin((6*x))^2)+3*(cos(x))^2*(sin(4*x))^2+1",outer = FALSE,line = -1)

Beta(2.5,5.9) generate

Code
p=function(x,sam=FALSE)
{
  if(sam)
  {
    return(rnorm(1,x,1))
  }
  else
  {
    return(dnorm(x,mean = x,sd = 1))
  }
}

pi=function(x) {return(dbeta(x,2.5,5.9))}

#--------need to draw a sample from p(theta|x)
da=mcmc(0.5,p,10000,pi)
x=seq(-20,20,length=1000)
par(mfrow=c(2,1))
hist((da[-(1:500)]),prob=TRUE,col = "cyan",main="Histogram of sample",xlab="value")
lines(density(da[-(1:500)]))
a=density(da)
polygon(a$x,a$y,density =50,col="blue")



curve(pi(x),xlim=c(0,1))
polygon(x,pi(x),density=50,col="green")
mtext("Beta example",outer = TRUE,line = -1)

Limitations

We first simulate two sets of observations \(X_1, . . . , X_n ∼ Normal(\mu_X,1) and Y1, . . . , Yn ∼ Normal(\mu_Y, 1).\) However, suppose these data are not observable and we only know $ Z_i = X_i + Y_i $

$ i = 1, . . . , n $ Thus, $ Z_i ∼ Normal(_X + _Y, 2).$We want to estimate \(\mu_X\) and \(\mu_X\). We choose priors \(\mu_X,\mu_Y \sim N(0,100^2)\)

Code
mu.X <- 2
mu.Y <- 4
n=100
set.seed(100)
X <- rnorm(n, mean = mu.X, sd = 1)
Y <- rnorm(n, mean = mu.Y, sd = 1)
Z <- X + Y
data <- list(Z = Z, n = n)

Contd.

Code
require(rjags,quietly = TRUE)
model_string <- textConnection("model{
# Likelihood (dnorm uses a precision, not variance)
for(i in 1:n){
Z[i] ~ dnorm(mu.X + mu.Y, 1 / 2)
}
# Priors
mu.X ~ dnorm(0, 0.0001)
mu.Y ~ dnorm(0, 0.0001)
}")

inits <- list(mu.X = mean(Z)/2, mu.Y = mean(Z)/2)
model <- jags.model(model_string, data = data,inits = inits, quiet = TRUE, n.chains =2)
update(model, 1000, progress.bar = "none")
params <- c("mu.X", "mu.Y")
samples <- coda.samples(model,

variable.names = params,
n.iter = 5000, progress.bar = "none")
plot(samples )

Linear Regression

Let us consider that we have a dataset with covariates as say temperature(in Farenhite)and response variable is a factor variable denoting “success” by 1 and failure by 0.In classical approach we can fit a regression model.Let us start with \[ y_i\sim N(\beta_0+\beta_1 x_i,\sigma^2) ,\ i=1,2,...,n \]

where \((\beta_0,\beta_1,\sigma)\) are the unknown parametres.

Estimation(Bayesian Analysis)

In classical approach we fit a Linear model and found the LS estimates to be \(\hat\beta_0,\hat\beta_1\)

Now we want to use prior information about \(\beta's\) for estimation.Here \(\beta's\) takes value in \(\mathbb R\) so we are taking Prior distribution as independent \(N(0,100^2)\) and \(\sigma^2\sim InverseGamma(0.001,0.001)\).Our posterior distribution is the following, \[ \pi(\alpha|x,y) \propto L(\alpha|x,y)\pi(\alpha|\hat\alpha,\hat\sigma_\alpha) \]

Trace plot

Code
require(rjags)
X <- runif(500)
Y <- rnorm(500, mean = 5 + 2 * X, sd = 1)

data <- list(Y =Y, X = X, n = length(X))
model_string <- textConnection("model{
# Likelihood (dnorm uses a precision, not variance)
for(i in 1:n){
Y[i] ~ dnorm(beta0 + beta1 * X[i], tau) #tau = 1/sigma^2
}
# Priors
tau ~ dgamma(0.01, 0.01)
sigma <- 1/sqrt(tau)
beta0 ~ dnorm(0, 0.0001)
beta1 ~ dnorm(0, 0.0001)
}")

inits <- list(beta0 = mean(X), beta1 = 0, tau = 1 / var(X))
model <- jags.model(model_string, data = data, inits = inits, quiet = TRUE)
update(model, 10000, progress.bar = "none")

params <- c("beta0", "beta1", "sigma")
samples <- coda.samples(model,
variable.names = params,
n.iter = 20000, progress.bar = "none")
plot(samples)

Gibbs Sampling Method

Gibbs sampling for the gaussian model

\[ Y_{i} \overset{iid}{\sim} \mathcal{N}(\mu, \sigma^2)\hspace{0.2cm} i = 1,2,...,n\] \[ \mu \sim \mathcal{N}(0,\sigma_m ^2),\hspace{0.2cm} \sigma^{2} \sim \text{InvGamma}(a, b)\]

\[ \mu|\sigma^{2},Y \sim \mathcal{N} (\dfrac{n\bar Y\sigma^{-2}+\mu_0\sigma_m ^{-2}}{n\sigma^{-2}+\sigma_m ^{-2}}, \dfrac{1}{n\sigma^{-2}+\sigma_m ^{-2}})\]

\[\sigma^{2} | \mu, Y \sim InvGamma ( a + \frac{n}{2}, \dfrac{1}{2} \sum_{i=1}^{n} (Y_{i} - \mu)^{2} + b )\] - we want to generate samples from posterior distribution

How Does it works?

  • step 1: take a initial value: \(\mu_o\) and \(\sigma_0 ^2\)
  • step 2:
    generate \(\mu_1\) from: \(\mu|y,\sigma_0 ^2\sim \mathcal{N} (\dfrac{n\bar Y\sigma_0 ^{-2}+\mu_0\sigma_m ^{-2}}{n\sigma_0 ^{-2}+\sigma_m ^{-2}}, \dfrac{1}{n\sigma_0 ^{-2}+\sigma_m ^{-2}})\)
    generate \(\sigma_1 ^2\) from: \(\sigma^{2} | \mu_0, y\sim InvGamma ( a + \frac{n}{2}, \dfrac{1}{2} \sum_{i=1}^{n} (Y_{i} - \mu_0)^{2} + b )\)
    \(\mu_1\) and \(\sigma_1 ^2\) is our required sample
  • step 3: We repeat step 2 n times and get the posterior draws

R Code

R Code
# given data from N(5,1) of size 100
mu <- 5
sigmaSq <- 1
set.seed(100)
Y <- rnorm(100, mean = mu,sd = sqrt(sigmaSq))

mu.update <- function(Y, sigmaSq, mu0, sigmaSq0){
mu.posvar.inv <- length(Y) / sigmaSq + 1 / sigmaSq0
mu.posvar <- 1 / mu.posvar.inv
mu.posmean <- (sum(Y) / sigmaSq + mu0 / sigmaSq0) / mu.posvar.inv
out <- mu.posmean + sqrt(mu.posvar) * rnorm(1)
out}
sigmaSq.update <- function(Y, mu, a, b){
sigmaSq.inv <- rgamma(1, shape = a + length(Y) / 2,
rate = b + sum((Y - mu)^2) / 2)
sigmaSq <- 1 / sigmaSq.inv
sigmaSq}

MCMC <- function(Y, mu.init, sigmaSq.init, mu0, sigmaSq0, a, b, iters){
# chain initiation
mu <- mu.init
sigmaSq <- sigmaSq.init
# define chains
mu.chain <- rep(NA, iters)
sigmaSq.chain <- rep(NA, iters)
# start MCMC
for(i in 1:iters){
mu <- mu.update(Y, sigmaSq, mu0, sigmaSq0)
sigmaSq <- sigmaSq.update(Y, mu, a, b)
mu.chain[i] <- mu
sigmaSq.chain[i] <- sigmaSq
}
# return chains
out <- list(mu.chain = mu.chain,
sigmaSq.chain = sigmaSq.chain)
return(out)}

Gibbs sampling for the gaussian model

initial choice of \(\mu\) = \(\bar X\) inital choice of \(\sigma^2\) = \(var(X)\)

Code
par(mfrow= c(1,2))

m = mean(Y)
v = var(Y)

MCMC.out <- MCMC(Y = Y,
mu.init = m,
sigmaSq.init = v,
mu0 = 0,
sigmaSq0 = 100,
a = 0.1,
b = 0.1,
iters = 3000)

plot(MCMC.out$mu.chain, xlab = "Iteration", ylab = "mu", type = "l", sub = paste('initial: ', m ))
abline(h = 5, col = 'red')
plot(MCMC.out$sigmaSq.chain, xlab = "Iteration", ylab = "sigmaSq", type = "l", sub = paste('initial: ', v ))
abline(h = 1, col = 'red')

Gibbs sampling for the gaussian model

initial choice of \(mu\) = -1000 inital choice of \(\sigma^2\) = 500

Code
par(mfrow= c(1,2))

m = -1000
v =  500

MCMC.out <- MCMC(Y = Y,
mu.init = m,
sigmaSq.init = v,
mu0 = 0,
sigmaSq0 = 100,
a = 0.1,
b = 0.1,
iters = 3000)

plot(MCMC.out$mu.chain, xlab = "Iteration", ylab = "mu", type = "l", sub = paste('initial: ', m ))
abline(h = 5, col = 'red')
plot(MCMC.out$sigmaSq.chain, xlab = "Iteration", ylab = "sigmaSq", type = "l", sub = paste('initial: ', v ))
abline(h = 1, col = 'red')

Using rjags

Code
X<-rnorm(100,5,1)
library(rjags)
data<-list(X=X,n=length(X))

#(1) Define the model as a string
model_string<- textConnection("model{
      #Likelihood(dnorm in jags uses precision, not variance)
      for(i in 1:n){
          X[i]~ dnorm(mu,tau)#tau=1/sigma^2
      }                    
      #Priors
      tau~ dgamma(0.01,0.01)
      sigma<-1/sqrt(tau)
      mu~ dnorm(0,0.0001)
      
      }")
#(2) Load the data and compile the MCMC code
inits<- list(mu=mean(X),tau=1/var(X))
model<-rjags::jags.model(model_string, data=data, inits=inits, quiet=TRUE)

#(3) Burn in for 10000 samples
update(model,10000,progress.bar="none")

#(4) Generate 20000 post-burn-in samples and retain the parameters named in params
params<-c("mu","sigma")
samples<-rjags::coda.samples(model,variable.names=params,n.iter=20000,
                       progress.bar="none")

#(6)Plot Samples
plot(samples)

Example

\[y_i\overset{iid}{\sim} poisson(\lambda)\hspace{0.8cm} i =1,2,3,\ldots,n \]

\[y|\lambda,b\sim poisson(\lambda)\] \[ \lambda|b\sim exp(b) \] \[ b\sim exp(1)\] Then, \[\lambda|b, y\sim Gamma(shape = y+1, rate = b+1) \] \[b|\lambda, y\sim Gamma(shape = 2, rate = \lambda +1) \]

Rcode

R Code
lam = 10
# taking a poisson sample of lambda = 10 of size = 100

Y = rpois(100, lambda = lam)

lam.update <- function(Y,b){
  pos.shape = Y+1
  pos.rate =  b+1 
  
  out = rgamma(1, shape = pos.shape, rate = pos.rate)
  out}

b.update <- function(lam){
  pos.shape = 2
  pos.rate =  lam + 1 
  
  out = rgamma(1, shape = pos.shape, rate = pos.rate)
  out}
  
  
 

MCMC <- function(Y, lam.init, iters){
  # chain initiation
 lam = lam.init
  # define chains
  lam.chain <- rep(NA, iters)

  # start MCMC
  for(i in 1:iters){
    b = b.update(lam)
    lam = lam.update(mean(Y), b)
    lam.chain[i] <- lam
  }
  # return chains
  lam.chain}

Example

initial value of \(\lambda\) = \(\bar X\)

Code
par(mfrow= c(1,1))

out<- MCMC(Y = Y,
                 lam.init = mean(Y),
                 iters = 30000)

plot(out, xlab = "Iteration", ylab = "lamda", type = "l")
abline(h = lam, col = 'red')

Example

initial value of \(\lambda\) = 50

Code
par(mfrow= c(1,1))

out <- MCMC(Y = Y,
            lam.init =  50,
            iters = 30000)
plot(out, xlab = "Iteration", ylab = "lamda", type = "l")
abline(h = lam, col = 'red')

Application: generating sample from Bivariate Normal distribution

Let \((X,Y)\sim N_2(\begin{bmatrix} 0\\0 \end{bmatrix},\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix})\)

  • Let us take a sample when \(\rho=0.89\)
  • take some initial value \(x_0\) and \(y_0\)
  • At t’th step:
  • for \(X_{t+1}\) generate: \(x|y_t\sim N(\rho y_t,1-\rho^2)\)
  • for \(Y_{t+1}\) generate: \(y|x_{t+1}\sim N(\rho x_{t+1},1-\rho^2)\)
  • \((X_{t+1},Y_{t+1})\) is our required sample

code

R code
rho=0.89
n=10000
sam=list("x","y")
init=c(0,0)
for(i in 1:n)
{
  a=rnorm(1,init[2]*rho,1-rho**2)
  sam$x=c(sam$x,a)
  init[1]=a
  b=rnorm(1,init[1]*rho,1-rho**2)
  sam$y=c(sam$y,b)
  init[2]=b
}

Visualization

Code
require(psych)
sam$x=sam$x[-(1:100)]
sam$y=sam$y[-(1:100)]
scatter.hist(x=sam$x,y=sam$y,ellipse=T,density = T)

Limitations

  • deriving these conditional distributions can be analytically intractable
  • Dependency on Initial Values
  • Convergence Rate

Softwer Options

  • R functions for Bayesian Linear Regression, BLR() in BGLR package
  • OpenBUGS, JAGS, Proc MCMC, STAN.

Acknowledgement

We want to say a big thank you to everyone who helped make this project possible. Special thanks to Professor Deepayan Sarkar and to Subhrangsu, Sourav, Subhendu for helping us.