Big Picture

During the tutorials, we have done some:

Quick Recap:

Metropolis-Hastings (MH) Algorithm:

With the MH algorithm, we had an acceptance vs. rejection mechanism at each step of MCMC. Let’s say we want to estimate the mean (μ) of a normal distribution and assume we already know the variance (σ²).

In each step of the simulation, we propose a new value for μ using a proposal distribution. Then, we compare the likelihood of the candidate sample vs. the existing sample. If the candidate sample has a higher likelihood, we accept it; otherwise, we might still accept it with a certain probability (based on the acceptance ratio).

If accepted, we store the new μ and move forward. If it’s rejected, we keep the current value. That was our Metropolis-Hastings algorithm.

Gibbs Sampling:

Gibbs uses a different mechanism. Instead of an acceptance-rejection step, we sample from conditional distributions at each step.

When we have multiple parameters (e.g., μ and σ²), it’s difficult to sample them simultaneously. So, in each step, we sample one parameter conditional on the current value of the other parameter(s).

Then, we alternate and sample the other parameter based on the updated value of the first parameter. This process continues until we cover the parameter space. It’s more systematic than MH but can still be inefficient.

Why HMC?

Now, while these algorithms are useful, they are not always efficient. These methods can lead to slow exploration of the parameter space, where the chain might get stuck in less interesting regions and randomly walk around. This random walk behavior makes it difficult to fully and efficiently explore the posterior distribution, especially in high-dimensional spaces.

Hamiltonian Monte Carlo (HMC) steps in here. HMC uses a much more efficient way of covering the parameter space because it incorporates momentum (or, more formally, the gradient of the log-posterior).

Instead of moving step-by-step like in MH or Gibbs, HMC calculates the momentum of our parameters, allowing for larger, more efficient jumps in the parameter space.

What are we gonna do today?

We’ll start by exploring a custom HMC sampler function written in R. This will give us a hands-on understanding of how HMC works under the hood. Then, we’ll switch gears and use the rstan package to implement HMC in a more efficient and robust way.

Implementation ways:

So, what exactly is Hamiltonian Monte Carlo? In a nutshell, HMC is an MCMC algorithm that uses concepts from physics to make sampling from complex distributions more efficient.

Traditional MCMC methods like Metropolis-Hastings can struggle with high-dimensional spaces because they tend to take small, random steps, leading to slow exploration.

HMC, on the other hand, uses gradient information to take larger, informed steps, reducing the “random walk” behavior and improving convergence.

Custom HMC Sampler Function

Let’s get our hands dirty with a custom HMC sampler function. This function is designed to sample from the posterior distribution of a normal model with unknown mean and variance:

# Hamiltonian Monte Carlo sampler function
hmc_sampler <- function(y, mu0, sigma0_sq, n_iter, epsilon, L, mu_start, logsigmasq_start) {
  n <- length(y)
  mu_samples <- numeric(n_iter)
  theta_samples <- numeric(n_iter)
  accept <- logical(n_iter)
  
  # Initialize
  mu <- mu_start
  theta <- logsigmasq_start
  
  # Potential energy function
  U <- function(mu, theta) {
    -sum(dnorm(y, mu, sqrt(exp(theta)), log = TRUE)) - 
    dnorm(mu, mu0, sqrt(sigma0_sq), log = TRUE) - 
    log(dnorm(theta, 0, 1))
  }
  
  # Gradient of the potential energy function
  grad_U <- function(mu, theta) {
    dU_dmu <- -sum((y - mu) / exp(theta)) + (mu - mu0) / sigma0_sq
    dU_dtheta <- 0.5 * n - 0.5 * sum((y - mu)^2) / exp(theta) + theta

    c(dU_dmu, dU_dtheta)
  }
  
  for (t in 1:n_iter) {
    # Sample momentum
    p_mu <- rnorm(1)
    p_theta <- rnorm(1)
    
    # Save current values
    mu_old <- mu
    theta_old <- theta
    p_mu_old <- p_mu
    p_theta_old <- p_theta
    
    # Perform L leapfrog steps
    for (i in 1:L) {
    p_mu <- p_mu - 0.5 * epsilon * grad_U(mu, theta)[1]
    mu <- mu + epsilon * p_mu
    
    p_theta <- p_theta - 0.5 * epsilon * grad_U(mu, theta)[2]
    theta <- theta + epsilon * p_theta
    
    p_mu <- p_mu - 0.5 * epsilon * grad_U(mu, theta)[1]
    p_theta <- p_theta - 0.5 * epsilon * grad_U(mu, theta)[2]
    }

    # Compute the Hamiltonian
    H_old <- U(mu_old, theta_old) + 0.5 * (p_mu_old^2 + p_theta_old^2)
    H_new <- U(mu, theta) + 0.5 * (p_mu^2 + p_theta^2)

    # Check for NaN values in Hamiltonian
    if (is.nan(H_old) || is.nan(H_new)) {
    stop("NaN detected in Hamiltonian calculation")
    }

    # Accept or reject the proposal
    if (runif(1) < exp(H_old - H_new)) {
    accept[t] <- TRUE
    } else {
    mu <- mu_old
    theta <- theta_old
    accept[t] <- FALSE
    }
    
    # Store samples
    mu_samples[t] <- mu
    theta_samples[t] <- theta
  }
  
  list(mu_samples = mu_samples, theta_samples = theta_samples, accept = accept)
}

Breakdown

Let’s unpack what’s happening here:

Arguments in the function:

Explanation of epsilon and L (Leapfrog Steps):

Outputs:

Simulated Data Example

Let’s generate some fake data to test our sampler. We’ll simulate data from a normal distribution with known mean and standard deviation.

set.seed(123)

# True parameters
true_mu <- 5
true_sigma <- 2

# Simulate data
n <- 100
y <- rnorm(n, mean = true_mu, sd = true_sigma)

# Plot the data
hist(y, breaks = 20, main = "Histogram of Simulated Data", xlab = "y")

Running the Sampler

Now, we’ll set up our sampler’s parameters and run it using the simulated data.

# Prior parameters
mu0 <- 0
sigma0_sq <- 10^2  # A vague prior

# HMC parameters
n_iter <- 5000
epsilon <- 0.01
L <- 20
mu_start <- 0
logsigmasq_start <- log(1)

# Run the HMC sampler
set.seed(123)
hmc_results <- hmc_sampler(y, mu0, sigma0_sq, n_iter, epsilon, L, mu_start, logsigmasq_start)

Now, let’s see how our sampler did:

# Extract samples
mu_samples <- hmc_results$mu_samples
theta_samples <- hmc_results$theta_samples
acceptance_rate <- mean(hmc_results$accept)

# Convert theta_samples to sigma_samples
sigma_samples <- sqrt(exp(theta_samples))

# Trace plots
par(mfrow = c(2, 1))
plot(mu_samples, type = "l", main = "Trace Plot of mu", ylab = "mu")
plot(sigma_samples, type = "l", main = "Trace Plot of sigma", ylab = "sigma")

# Histograms
par(mfrow = c(1, 2))
hist(mu_samples, breaks = 30, main = "Posterior of mu", xlab = "mu", probability = TRUE)
abline(v = true_mu, col = "red", lwd = 2)
hist(sigma_samples, breaks = 30, main = "Posterior of sigma", xlab = "sigma", probability = TRUE)
abline(v = true_sigma, col = "red", lwd = 2)

# Acceptance rate
print(paste("Acceptance Rate:", round(acceptance_rate * 100, 2), "%"))
#> [1] "Acceptance Rate: 99.92 %"

Using RStan for HMC

While our custom sampler is great for understanding the mechanics of HMC, using a dedicated library like rstan can make things a lot easier and more efficient. Plus, we get access to diagnostic tools and advanced features.

Let’s start: make sure we have rstan installed and loaded first.

# Install rstan if you haven't already (uncomment the line below)
# install.packages("rstan", dependencies = TRUE, repos = "https://cloud.r-project.org/")

# Load rstan
library(rstan)
#> Loading required package: StanHeaders
#> 
#> rstan version 2.32.6 (Stan version 2.32.2)
#> For execution on a local, multicore CPU with excess RAM we recommend calling
#> options(mc.cores = parallel::detectCores()).
#> To avoid recompilation of unchanged Stan programs, we recommend calling
#> rstan_options(auto_write = TRUE)
#> For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
#> change `threads_per_chain` option:
#> rstan_options(threads_per_chain = 1)

We’ll define a Stan model that mirrors our earlier setup: a normal model with unknown mean and variance.

You must have a stan file saved as normal_model.stan with the following content, in the same directory as your R script.

// Stan model saved as 'normal_model.stan'
data {
  int<lower=0> N;          // Number of observations
  vector[N] y;             // Observed data
}

parameters {
  real mu;                 // Mean parameter
  real<lower=0> sigma;     // Standard deviation
}

model {
  mu ~ normal(0, 10);      // Prior for mu
  sigma ~ cauchy(0, 2.5);  // Prior for sigma
  y ~ normal(mu, sigma);   // Likelihood
}

We’ll use the same simulated data y from earlier.

So we feed the data to Stan and let it work its magic.

# Data for Stan
stan_data <- list(N = n, y = y)

Fitting the Model with RStan

Now, we’ll compile and fit the Stan model.

# Compile the model
stan_model <- stan_model(file = 'normal_model.stan')

# Fit the model
fit <- sampling(stan_model, data = stan_data, iter = 5000, chains = 4, seed = 123)
#> 
#> SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
#> Chain 1: 
#> Chain 1: Gradient evaluation took 6e-06 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1: 
#> Chain 1: 
#> Chain 1: Iteration:    1 / 5000 [  0%]  (Warmup)
#> Chain 1: Iteration:  500 / 5000 [ 10%]  (Warmup)
#> Chain 1: Iteration: 1000 / 5000 [ 20%]  (Warmup)
#> Chain 1: Iteration: 1500 / 5000 [ 30%]  (Warmup)
#> Chain 1: Iteration: 2000 / 5000 [ 40%]  (Warmup)
#> Chain 1: Iteration: 2500 / 5000 [ 50%]  (Warmup)
#> Chain 1: Iteration: 2501 / 5000 [ 50%]  (Sampling)
#> Chain 1: Iteration: 3000 / 5000 [ 60%]  (Sampling)
#> Chain 1: Iteration: 3500 / 5000 [ 70%]  (Sampling)
#> Chain 1: Iteration: 4000 / 5000 [ 80%]  (Sampling)
#> Chain 1: Iteration: 4500 / 5000 [ 90%]  (Sampling)
#> Chain 1: Iteration: 5000 / 5000 [100%]  (Sampling)
#> Chain 1: 
#> Chain 1:  Elapsed Time: 0.011 seconds (Warm-up)
#> Chain 1:                0.01 seconds (Sampling)
#> Chain 1:                0.021 seconds (Total)
#> Chain 1: 
#> 
#> SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
#> Chain 2: 
#> Chain 2: Gradient evaluation took 3e-06 seconds
#> Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.03 seconds.
#> Chain 2: Adjust your expectations accordingly!
#> Chain 2: 
#> Chain 2: 
#> Chain 2: Iteration:    1 / 5000 [  0%]  (Warmup)
#> Chain 2: Iteration:  500 / 5000 [ 10%]  (Warmup)
#> Chain 2: Iteration: 1000 / 5000 [ 20%]  (Warmup)
#> Chain 2: Iteration: 1500 / 5000 [ 30%]  (Warmup)
#> Chain 2: Iteration: 2000 / 5000 [ 40%]  (Warmup)
#> Chain 2: Iteration: 2500 / 5000 [ 50%]  (Warmup)
#> Chain 2: Iteration: 2501 / 5000 [ 50%]  (Sampling)
#> Chain 2: Iteration: 3000 / 5000 [ 60%]  (Sampling)
#> Chain 2: Iteration: 3500 / 5000 [ 70%]  (Sampling)
#> Chain 2: Iteration: 4000 / 5000 [ 80%]  (Sampling)
#> Chain 2: Iteration: 4500 / 5000 [ 90%]  (Sampling)
#> Chain 2: Iteration: 5000 / 5000 [100%]  (Sampling)
#> Chain 2: 
#> Chain 2:  Elapsed Time: 0.01 seconds (Warm-up)
#> Chain 2:                0.009 seconds (Sampling)
#> Chain 2:                0.019 seconds (Total)
#> Chain 2: 
#> 
#> SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
#> Chain 3: 
#> Chain 3: Gradient evaluation took 1e-06 seconds
#> Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
#> Chain 3: Adjust your expectations accordingly!
#> Chain 3: 
#> Chain 3: 
#> Chain 3: Iteration:    1 / 5000 [  0%]  (Warmup)
#> Chain 3: Iteration:  500 / 5000 [ 10%]  (Warmup)
#> Chain 3: Iteration: 1000 / 5000 [ 20%]  (Warmup)
#> Chain 3: Iteration: 1500 / 5000 [ 30%]  (Warmup)
#> Chain 3: Iteration: 2000 / 5000 [ 40%]  (Warmup)
#> Chain 3: Iteration: 2500 / 5000 [ 50%]  (Warmup)
#> Chain 3: Iteration: 2501 / 5000 [ 50%]  (Sampling)
#> Chain 3: Iteration: 3000 / 5000 [ 60%]  (Sampling)
#> Chain 3: Iteration: 3500 / 5000 [ 70%]  (Sampling)
#> Chain 3: Iteration: 4000 / 5000 [ 80%]  (Sampling)
#> Chain 3: Iteration: 4500 / 5000 [ 90%]  (Sampling)
#> Chain 3: Iteration: 5000 / 5000 [100%]  (Sampling)
#> Chain 3: 
#> Chain 3:  Elapsed Time: 0.01 seconds (Warm-up)
#> Chain 3:                0.012 seconds (Sampling)
#> Chain 3:                0.022 seconds (Total)
#> Chain 3: 
#> 
#> SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
#> Chain 4: 
#> Chain 4: Gradient evaluation took 1e-06 seconds
#> Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
#> Chain 4: Adjust your expectations accordingly!
#> Chain 4: 
#> Chain 4: 
#> Chain 4: Iteration:    1 / 5000 [  0%]  (Warmup)
#> Chain 4: Iteration:  500 / 5000 [ 10%]  (Warmup)
#> Chain 4: Iteration: 1000 / 5000 [ 20%]  (Warmup)
#> Chain 4: Iteration: 1500 / 5000 [ 30%]  (Warmup)
#> Chain 4: Iteration: 2000 / 5000 [ 40%]  (Warmup)
#> Chain 4: Iteration: 2500 / 5000 [ 50%]  (Warmup)
#> Chain 4: Iteration: 2501 / 5000 [ 50%]  (Sampling)
#> Chain 4: Iteration: 3000 / 5000 [ 60%]  (Sampling)
#> Chain 4: Iteration: 3500 / 5000 [ 70%]  (Sampling)
#> Chain 4: Iteration: 4000 / 5000 [ 80%]  (Sampling)
#> Chain 4: Iteration: 4500 / 5000 [ 90%]  (Sampling)
#> Chain 4: Iteration: 5000 / 5000 [100%]  (Sampling)
#> Chain 4: 
#> Chain 4:  Elapsed Time: 0.01 seconds (Warm-up)
#> Chain 4:                0.01 seconds (Sampling)
#> Chain 4:                0.02 seconds (Total)
#> Chain 4:

Diagnosing Convergence

Stan provides various diagnostics to help us check if our chains have converged.

# Print summary
print(fit)
#> Inference for Stan model: anon_model.
#> 4 chains, each with iter=5000; warmup=2500; thin=1; 
#> post-warmup draws per chain=2500, total post-warmup draws=10000.
#> 
#>          mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
#> mu       5.18    0.00 0.18    4.82    5.05    5.18    5.30    5.54  8680    1
#> sigma    1.84    0.00 0.13    1.60    1.75    1.83    1.93    2.13  8251    1
#> lp__  -110.66    0.01 1.01 -113.35 -111.06 -110.36 -109.93 -109.68  4632    1
#> 
#> Samples were drawn using NUTS(diag_e) at Mon Oct 21 03:48:08 2024.
#> For each parameter, n_eff is a crude measure of effective sample size,
#> and Rhat is the potential scale reduction factor on split chains (at 
#> convergence, Rhat=1).

# Extract Rhat values
rhat_values <- summary(fit)$summary[, "Rhat"]
print(rhat_values)
#>        mu     sigma      lp__ 
#> 1.0003650 1.0001896 0.9999744

Interpretation

# Trace plots
traceplot(fit, pars = c("mu", "sigma"), nrow=2, ncol=1)

# other arguments in the traceplot that we didn't use:
  
# traceplot(object, pars, include = TRUE, unconstrain = FALSE, 
        #   inc_warmup = FALSE, window = NULL, nrow = NULL, ncol = NULL, ...) 

Posterior Analysis

Let’s take a closer look at the posterior distributions.

# Extract samples
posterior_samples <- extract(fit)

# Posterior means
posterior_mu_mean <- mean(posterior_samples$mu)
posterior_sigma_mean <- mean(posterior_samples$sigma)

# Histograms
par(mfrow = c(1, 2))
hist(posterior_samples$mu, breaks = 30, main = "Posterior of mu", xlab = "mu", probability = TRUE)
abline(v = true_mu, col = "red", lwd = 2)
hist(posterior_samples$sigma, breaks = 30, main = "Posterior of sigma", xlab = "sigma", probability = TRUE)
abline(v = true_sigma, col = "red", lwd = 2)

Interpretation

The posterior distributions are closely centered around the true parameter values.

Assignment: Comparing HMC with Metropolis-Hastings & Model Diagnostics

1. Different Values

Play with the inputs of the first, manual, non-stan model for HMC, and re-run with the same data. What changes do you observe in the trace plots OR posterior distributions?

2. Comparing Acceptance Rates

What was the acceptance rate for the custom HMC sampler? How does it compare the acceptance rates we got in the previous assignments using different samplers?

3. BRMS

Install and load the library brms, and implement the equivalent model in brms. How similar are your results compared to the results from rstan that we did today? tip: you can use brm function, with an intercept-only model (y~1)

Created on 2024-10-21 with reprex v2.1.1