knitr::opts_chunk$set(
  echo = TRUE, message = FALSE, warning = FALSE,
  fig.width = 7, fig.height = 4.5, dpi = 160
)
set.seed(2025)

In many statistical and scientific problems, the goal is to generate samples from a complex, high-dimensional probability distribution, which we denote as \(\pi(\theta)\). This is a common task in Bayesian statistics, where \(\pi(\theta)\) represents a posterior distribution, or in statistical physics, where it can be a joint density.

However, direct sampling from such a distribution is often computationally intractable or mathematically impossible due to its complex form or high dimensionality.

This is where methods come in. MCMC is a class of algorithms that construct a special type of stochastic process called a , denoted by a sequence of random variables \(\{\theta^{(t)}\}\). The key property of this Markov chain is that its is precisely the target distribution, \(\pi(\theta)\).

After an initial “burn-in” period, during which the chain converges to its stationary distribution, the subsequent draws from the chain, \(\theta^{(t+1)}, \theta^{(t+2)}, \ldots\), can be treated as approximate samples from \(\pi(\theta)\). While these samples are not truly independent, they are often sufficiently decorrelated to provide a good approximation of the desired distribution.

MCMC methods are built upon several fundamental algorithms, with two being particularly prominent:

is a powerful and intuitive Markov Chain Monte Carlo (MCMC) algorithm used to generate a sequence of samples from a complex, high-dimensional joint probability distribution, \(\pi(\theta)\). It’s particularly useful when direct sampling from the joint distribution is difficult, but sampling from the full conditional distributions of each component is relatively straightforward.

The algorithm works by breaking down the high-dimensional problem into a series of one-dimensional sampling steps. Let’s consider a parameter vector \(\theta = (\theta_1, \ldots, \theta_p)\) with a joint density \(\pi(\theta)\). The full conditional distribution for a component \(\theta_j\) is its distribution given the values of all other components, denoted by \(\theta_{-j} = (\theta_1, \ldots, \theta_{j-1}, \theta_{j+1}, \ldots, \theta_p)\). The algorithm iteratively samples each parameter from its full conditional distribution:

Each full sweep through all components constitutes one iteration of the algorithm. Under mild technical conditions, this iterative process generates a Markov chain that eventually converges to the target joint distribution, \(\pi(\theta)\). The samples obtained after a sufficient “burn-in” period can then be used for inference.

A classic illustration of the Gibbs sampling idea involves a bivariate normal distribution. Consider a joint density of \((X, Y)\) where we only know the full conditional distributions: \[ X \mid Y=y \sim \mathcal{N}(y, 1), \quad Y \mid X=x \sim \mathcal{N}(x, 1). \] While this particular joint density corresponds to a bivariate normal with strong correlation (and could be sampled directly), it serves as a simple and effective example to demonstrate the Gibbs sampling procedure. To generate samples from this joint distribution, we would start with an arbitrary initial value for one variable (say, \(Y^{(0)}\)) and then alternate between the conditional distributions:

This process, if repeated many times, will produce a sequence of pairs \((X, Y)\) that are samples from the target bivariate normal distribution.

gibbs_bvn <- function(n_iter=5000, burn=1000, x0=0, y0=0) {
  X <- Y <- numeric(n_iter)
  X[1] <- x0; Y[1] <- y0
  for (t in 2:n_iter) {
    X[t] <- rnorm(1, mean = Y[t-1], sd = 1)
    Y[t] <- rnorm(1, mean = X[t], sd = 1)
  }
  list(X=X[-(1:burn)], Y=Y[-(1:burn)])
}

out <- gibbs_bvn()
par(mfrow=c(1,2))
plot(out$X, type="l", main="Trace of X", xlab="Iteration", ylab="X")
plot(out$X, out$Y, pch=16, cex=0.4, main="Gibbs Draws", xlab="X", ylab="Y")

par(mfrow=c(1,1))
cor(out$X, out$Y)  # estimated correlation
## [1] 0.9996881

Gibbs sampling is particularly powerful in Bayesian hierarchical models where parameters are conditionally dependent. A classic example is Bayesian linear regression. We start with the standard linear regression model: \[ y = X\beta + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0,\sigma^2 I), \] where \(y\) is the vector of response variables, \(X\) is the design matrix, \(\beta\) is the vector of regression coefficients, and \(\varepsilon\) is the error term, assumed to be normally distributed with mean 0 and variance \(\sigma^2 I\).

In a Bayesian framework, we place prior distributions on the unknown parameters \(\beta\) and \(\sigma^2\). A common choice is to use conjugate priors, which simplify the conditional distributions and make them analytically tractable. We choose:

This choice of priors ensures that the resulting full conditional distributions for \(\beta\) and \(\sigma^2\) are also from familiar families (Normal and Inverse-Gamma, respectively), making them easy to sample from.

With these conjugate priors, we can set up a two-block Gibbs sampler. We iteratively sample from the full conditional distributions of each parameter, conditioned on the current values of all other parameters and the observed data \(y\).

By repeatedly sampling from these two distributions in turn, the Gibbs sampler generates a sequence of parameter pairs \((\beta^{(t)}, \sigma^{2(t)})\) that, after a burn-in period, represents a set of draws from the joint posterior distribution \(\pi(\beta, \sigma^2 \mid y)\). This allows us to make inferences about the regression coefficients and the error variance without complex analytical integration.

# Simulated regression data
n <- 100; p <- 2
X <- cbind(1, rnorm(n))
beta_true <- c(2, -1)
y <- X %*% beta_true + rnorm(n, sd=1)

# Gibbs sampler
gibbs_reg <- function(y, X, n_iter=5000) {
  n <- length(y); p <- ncol(X)
  beta <- rep(0,p); sigma2 <- 1
  out <- matrix(NA, n_iter, p+1); colnames(out) <- c(paste0("beta",0:(p-1)),"sigma2")
  for (t in 1:n_iter) {
    # beta | sigma2
    V <- solve(t(X)%*%X + diag(1,p))
    m <- V %*% t(X) %*% y
    beta <- mvtnorm::rmvnorm(1, mean=m, sigma=sigma2*V)
    # sigma2 | beta
    resid <- y - X %*% t(beta)
    a <- (n/2)+1; b <- sum(resid^2)/2+1
    sigma2 <- 1/rgamma(1, shape=a, rate=b)
    out[t,] <- c(beta, sigma2)
  }
  out
}

library(mvtnorm)
out_reg <- gibbs_reg(y,X)
apply(out_reg,2,mean)
##     beta0     beta1    sigma2 
##  1.826477 -1.157503  1.166155

The algorithm is a very general and fundamental Markov Chain Monte Carlo (MCMC) method. It’s used to generate samples from a target probability distribution \(\pi(\theta)\), even when we only know its density function up to a normalizing constant. This is a common situation in Bayesian inference, where the posterior density is proportional to the likelihood times the prior.

The core of the algorithm is an iterative process of proposing a new state and then deciding whether to accept or reject it. At each step \(t\), starting from the current state \(\theta^{(t)}\), the algorithm works as follows:

Let’s consider a target distribution that is not a standard one, such as \(\pi(x) \propto e^{-x^4/2}\). Since we only need to know the density up to a constant, the MH algorithm is a perfect fit. We could choose a simple proposal distribution, such as a Normal distribution centered at the current state, i.e., \(q(x^\star \mid x^{(t)}) = \mathcal{N}(x^{(t)}, \sigma^2)\). This is a symmetric proposal, so the ratio of proposal densities simplifies to 1. The acceptance probability would then be: \[ \alpha = \min\left(1, \frac{\pi(x^\star)}{\pi(x^{(t)})}\right) = \min\left(1, \frac{e^{-x^{\star 4}/2}}{e^{-x^{(t)4}/2}}\right). \] This simple example illustrates how the MH algorithm can be used to sample from a wide range of distributions where direct sampling is not possible.

mh_univ <- function(n_iter=10000, proposal_sd=1) {
  x <- numeric(n_iter); x[1] <- 0
  target <- function(x) exp(-x^4/2)
  for (t in 2:n_iter) {
    cand <- rnorm(1, x[t-1], proposal_sd)
    alpha <- min(1, target(cand)/target(x[t-1]))
    if (runif(1) < alpha) x[t] <- cand else x[t] <- x[t-1]
  }
  x
}

out_mh <- mh_univ()
par(mfrow=c(1,2))
plot(out_mh[1:500], type="l", main="MH Trace", xlab="Iter", ylab="x")
hist(out_mh, breaks=50, freq=FALSE, main="MH Approximation")
curve(exp(-x^4/2)/integrate(function(z) exp(-z^4/2), -Inf, Inf)$value,
      add=TRUE, col="red", lwd=2)

par(mfrow=c(1,1))

Target: \(\pi(x)\sim N(0,\Sigma)\) with \(\Sigma=\begin{pmatrix}1&0.8\0.8&1\end{pmatrix}\).

mh_mvnorm <- function(n_iter=5000, prop_sd=0.5) {
  x <- matrix(0, n_iter, 2)
  Sigma <- matrix(c(1,0.8,0.8,1),2,2)
  target <- function(z) dmvnorm(z, sigma=Sigma)
  for (t in 2:n_iter) {
    cand <- x[t-1,] + rnorm(2,0,prop_sd)
    alpha <- min(1, target(cand)/target(x[t-1,]))
    if (runif(1)<alpha) x[t,] <- cand else x[t,] <- x[t-1,]
  }
  x
}

library(mvtnorm)
out_mh2 <- mh_mvnorm()
plot(out_mh2, pch=16, cex=0.4, col=rgb(0,0,1,0.3),
     main="MH draws from bivariate Normal", xlab="x1", ylab="x2")