During the tutorials, we have done some:
With the MH algorithm, we had an acceptance vs. rejection mechanism at each step of MCMC. Let’s say we want to estimate the mean (μ) of a normal distribution and assume we already know the variance (σ²).
In each step of the simulation, we propose a new value for μ using a proposal distribution. Then, we compare the likelihood of the candidate sample vs. the existing sample. If the candidate sample has a higher likelihood, we accept it; otherwise, we might still accept it with a certain probability (based on the acceptance ratio).
If accepted, we store the new μ and move forward. If it’s rejected, we keep the current value. That was our Metropolis-Hastings algorithm.
Gibbs uses a different mechanism. Instead of an acceptance-rejection step, we sample from conditional distributions at each step.
When we have multiple parameters (e.g., μ and σ²), it’s difficult to sample them simultaneously. So, in each step, we sample one parameter conditional on the current value of the other parameter(s).
Then, we alternate and sample the other parameter based on the updated value of the first parameter. This process continues until we cover the parameter space. It’s more systematic than MH but can still be inefficient.
Now, while these algorithms are useful, they are not always efficient. These methods can lead to slow exploration of the parameter space, where the chain might get stuck in less interesting regions and randomly walk around. This random walk behavior makes it difficult to fully and efficiently explore the posterior distribution, especially in high-dimensional spaces.
Hamiltonian Monte Carlo (HMC) steps in here. HMC uses a much more efficient way of covering the parameter space because it incorporates momentum (or, more formally, the gradient of the log-posterior).
Instead of moving step-by-step like in MH or Gibbs, HMC calculates the momentum of our parameters, allowing for larger, more efficient jumps in the parameter space.
We’ll start by exploring a custom HMC sampler function written in R. This will give us a hands-on understanding of how HMC works under the hood. Then, we’ll switch gears and use the rstan
package to implement HMC in a more efficient and robust way.
Implementation ways:
rstan
brms
So, what exactly is Hamiltonian Monte Carlo? In a nutshell, HMC is an MCMC algorithm that uses concepts from physics to make sampling from complex distributions more efficient.
Traditional MCMC methods like Metropolis-Hastings can struggle with high-dimensional spaces because they tend to take small, random steps, leading to slow exploration.
HMC, on the other hand, uses gradient information to take larger, informed steps, reducing the “random walk” behavior and improving convergence.
Let’s get our hands dirty with a custom HMC sampler function. This function is designed to sample from the posterior distribution of a normal model with unknown mean and variance:
# Hamiltonian Monte Carlo sampler function
<- function(y, mu0, sigma0_sq, n_iter, epsilon, L, mu_start, logsigmasq_start) {
hmc_sampler <- length(y)
n <- numeric(n_iter)
mu_samples <- numeric(n_iter)
theta_samples <- logical(n_iter)
accept
# Initialize
<- mu_start
mu <- logsigmasq_start
theta
# Potential energy function
<- function(mu, theta) {
U -sum(dnorm(y, mu, sqrt(exp(theta)), log = TRUE)) -
dnorm(mu, mu0, sqrt(sigma0_sq), log = TRUE) -
log(dnorm(theta, 0, 1))
}
# Gradient of the potential energy function
<- function(mu, theta) {
grad_U <- -sum((y - mu) / exp(theta)) + (mu - mu0) / sigma0_sq
dU_dmu <- 0.5 * n - 0.5 * sum((y - mu)^2) / exp(theta) + theta
dU_dtheta
c(dU_dmu, dU_dtheta)
}
for (t in 1:n_iter) {
# Sample momentum
<- rnorm(1)
p_mu <- rnorm(1)
p_theta
# Save current values
<- mu
mu_old <- theta
theta_old <- p_mu
p_mu_old <- p_theta
p_theta_old
# Perform L leapfrog steps
for (i in 1:L) {
<- p_mu - 0.5 * epsilon * grad_U(mu, theta)[1]
p_mu <- mu + epsilon * p_mu
mu
<- p_theta - 0.5 * epsilon * grad_U(mu, theta)[2]
p_theta <- theta + epsilon * p_theta
theta
<- p_mu - 0.5 * epsilon * grad_U(mu, theta)[1]
p_mu <- p_theta - 0.5 * epsilon * grad_U(mu, theta)[2]
p_theta
}
# Compute the Hamiltonian
<- U(mu_old, theta_old) + 0.5 * (p_mu_old^2 + p_theta_old^2)
H_old <- U(mu, theta) + 0.5 * (p_mu^2 + p_theta^2)
H_new
# Check for NaN values in Hamiltonian
if (is.nan(H_old) || is.nan(H_new)) {
stop("NaN detected in Hamiltonian calculation")
}
# Accept or reject the proposal
if (runif(1) < exp(H_old - H_new)) {
<- TRUE
accept[t] else {
} <- mu_old
mu <- theta_old
theta <- FALSE
accept[t]
}
# Store samples
<- mu
mu_samples[t] <- theta
theta_samples[t]
}
list(mu_samples = mu_samples, theta_samples = theta_samples, accept = accept)
}
Let’s unpack what’s happening here:
y
: Our observed data.mu0
, sigma0_sq
: Prior mean and variance for mu
.n_iter
: Number of iterations we want to run.epsilon
: Step size for our leapfrog integrator.L
: Number of leapfrog steps per iteration.mu_start
, logsigmasq_start
: Starting values for mu
and log(sigma^2)
.epsilon
and L
(Leapfrog Steps):epsilon
(Step Size): This determines how big each update step is during the leapfrog integration. Larger values of epsilon
allow bigger jumps in the parameter space, but
if it’s too large, it may lead to inaccurate proposals and higher rejection rates.
if it’s too small, the algorithm will take very small steps, which can slow down convergence.
L
(Number of Leapfrog Steps): This defines how many times the leapfrog algorithm updates the parameters in a single iteration. The leapfrog integrator simulates the Hamiltonian dynamics by updating both the position (parameter values) and momentum (auxiliary variable) alternately. A higher number of steps L
means longer simulated trajectories, allowing the sampler to explore more of the parameter space in one iteration. However, too many steps might make the algorithm inefficient by taking overly long trajectories, so L
should be balanced based on the complexity of the model.
mu
, theta
(which is log(sigma^2)
), and an indicator of whether each proposal was accepted.Let’s generate some fake data to test our sampler. We’ll simulate data from a normal distribution with known mean and standard deviation.
set.seed(123)
# True parameters
<- 5
true_mu <- 2
true_sigma
# Simulate data
<- 100
n <- rnorm(n, mean = true_mu, sd = true_sigma)
y
# Plot the data
hist(y, breaks = 20, main = "Histogram of Simulated Data", xlab = "y")
Now, we’ll set up our sampler’s parameters and run it using the simulated data.
# Prior parameters
<- 0
mu0 <- 10^2 # A vague prior
sigma0_sq
# HMC parameters
<- 5000
n_iter <- 0.01
epsilon <- 20
L <- 0
mu_start <- log(1)
logsigmasq_start
# Run the HMC sampler
set.seed(123)
<- hmc_sampler(y, mu0, sigma0_sq, n_iter, epsilon, L, mu_start, logsigmasq_start) hmc_results
Now, let’s see how our sampler did:
# Extract samples
<- hmc_results$mu_samples
mu_samples <- hmc_results$theta_samples
theta_samples <- mean(hmc_results$accept)
acceptance_rate
# Convert theta_samples to sigma_samples
<- sqrt(exp(theta_samples))
sigma_samples
# Trace plots
par(mfrow = c(2, 1))
plot(mu_samples, type = "l", main = "Trace Plot of mu", ylab = "mu")
plot(sigma_samples, type = "l", main = "Trace Plot of sigma", ylab = "sigma")
# Histograms
par(mfrow = c(1, 2))
hist(mu_samples, breaks = 30, main = "Posterior of mu", xlab = "mu", probability = TRUE)
abline(v = true_mu, col = "red", lwd = 2)
hist(sigma_samples, breaks = 30, main = "Posterior of sigma", xlab = "sigma", probability = TRUE)
abline(v = true_sigma, col = "red", lwd = 2)
# Acceptance rate
print(paste("Acceptance Rate:", round(acceptance_rate * 100, 2), "%"))
#> [1] "Acceptance Rate: 99.92 %"
While our custom sampler is great for understanding the mechanics of HMC, using a dedicated library like rstan
can make things a lot easier and more efficient. Plus, we get access to diagnostic tools and advanced features.
Let’s start: make sure we have rstan installed and loaded first.
# Install rstan if you haven't already (uncomment the line below)
# install.packages("rstan", dependencies = TRUE, repos = "https://cloud.r-project.org/")
# Load rstan
library(rstan)
#> Loading required package: StanHeaders
#>
#> rstan version 2.32.6 (Stan version 2.32.2)
#> For execution on a local, multicore CPU with excess RAM we recommend calling
#> options(mc.cores = parallel::detectCores()).
#> To avoid recompilation of unchanged Stan programs, we recommend calling
#> rstan_options(auto_write = TRUE)
#> For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
#> change `threads_per_chain` option:
#> rstan_options(threads_per_chain = 1)
We’ll define a Stan model that mirrors our earlier setup: a normal model with unknown mean and variance.
You must have a stan file saved as normal_model.stan
with the following content, in the same directory as your R script.
// Stan model saved as 'normal_model.stan'
data {<lower=0> N; // Number of observations
int// Observed data
vector[N] y;
}
parameters {// Mean parameter
real mu; <lower=0> sigma; // Standard deviation
real
}
model {~ normal(0, 10); // Prior for mu
mu ~ cauchy(0, 2.5); // Prior for sigma
sigma ~ normal(mu, sigma); // Likelihood
y }
We’ll use the same simulated data y from earlier.
So we feed the data to Stan and let it work its magic.
# Data for Stan
<- list(N = n, y = y) stan_data
Now, we’ll compile and fit the Stan model.
# Compile the model
<- stan_model(file = 'normal_model.stan')
stan_model
# Fit the model
<- sampling(stan_model, data = stan_data, iter = 5000, chains = 4, seed = 123)
fit #>
#> SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
#> Chain 1:
#> Chain 1: Gradient evaluation took 6e-06 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1:
#> Chain 1:
#> Chain 1: Iteration: 1 / 5000 [ 0%] (Warmup)
#> Chain 1: Iteration: 500 / 5000 [ 10%] (Warmup)
#> Chain 1: Iteration: 1000 / 5000 [ 20%] (Warmup)
#> Chain 1: Iteration: 1500 / 5000 [ 30%] (Warmup)
#> Chain 1: Iteration: 2000 / 5000 [ 40%] (Warmup)
#> Chain 1: Iteration: 2500 / 5000 [ 50%] (Warmup)
#> Chain 1: Iteration: 2501 / 5000 [ 50%] (Sampling)
#> Chain 1: Iteration: 3000 / 5000 [ 60%] (Sampling)
#> Chain 1: Iteration: 3500 / 5000 [ 70%] (Sampling)
#> Chain 1: Iteration: 4000 / 5000 [ 80%] (Sampling)
#> Chain 1: Iteration: 4500 / 5000 [ 90%] (Sampling)
#> Chain 1: Iteration: 5000 / 5000 [100%] (Sampling)
#> Chain 1:
#> Chain 1: Elapsed Time: 0.011 seconds (Warm-up)
#> Chain 1: 0.01 seconds (Sampling)
#> Chain 1: 0.021 seconds (Total)
#> Chain 1:
#>
#> SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
#> Chain 2:
#> Chain 2: Gradient evaluation took 3e-06 seconds
#> Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.03 seconds.
#> Chain 2: Adjust your expectations accordingly!
#> Chain 2:
#> Chain 2:
#> Chain 2: Iteration: 1 / 5000 [ 0%] (Warmup)
#> Chain 2: Iteration: 500 / 5000 [ 10%] (Warmup)
#> Chain 2: Iteration: 1000 / 5000 [ 20%] (Warmup)
#> Chain 2: Iteration: 1500 / 5000 [ 30%] (Warmup)
#> Chain 2: Iteration: 2000 / 5000 [ 40%] (Warmup)
#> Chain 2: Iteration: 2500 / 5000 [ 50%] (Warmup)
#> Chain 2: Iteration: 2501 / 5000 [ 50%] (Sampling)
#> Chain 2: Iteration: 3000 / 5000 [ 60%] (Sampling)
#> Chain 2: Iteration: 3500 / 5000 [ 70%] (Sampling)
#> Chain 2: Iteration: 4000 / 5000 [ 80%] (Sampling)
#> Chain 2: Iteration: 4500 / 5000 [ 90%] (Sampling)
#> Chain 2: Iteration: 5000 / 5000 [100%] (Sampling)
#> Chain 2:
#> Chain 2: Elapsed Time: 0.01 seconds (Warm-up)
#> Chain 2: 0.009 seconds (Sampling)
#> Chain 2: 0.019 seconds (Total)
#> Chain 2:
#>
#> SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
#> Chain 3:
#> Chain 3: Gradient evaluation took 1e-06 seconds
#> Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
#> Chain 3: Adjust your expectations accordingly!
#> Chain 3:
#> Chain 3:
#> Chain 3: Iteration: 1 / 5000 [ 0%] (Warmup)
#> Chain 3: Iteration: 500 / 5000 [ 10%] (Warmup)
#> Chain 3: Iteration: 1000 / 5000 [ 20%] (Warmup)
#> Chain 3: Iteration: 1500 / 5000 [ 30%] (Warmup)
#> Chain 3: Iteration: 2000 / 5000 [ 40%] (Warmup)
#> Chain 3: Iteration: 2500 / 5000 [ 50%] (Warmup)
#> Chain 3: Iteration: 2501 / 5000 [ 50%] (Sampling)
#> Chain 3: Iteration: 3000 / 5000 [ 60%] (Sampling)
#> Chain 3: Iteration: 3500 / 5000 [ 70%] (Sampling)
#> Chain 3: Iteration: 4000 / 5000 [ 80%] (Sampling)
#> Chain 3: Iteration: 4500 / 5000 [ 90%] (Sampling)
#> Chain 3: Iteration: 5000 / 5000 [100%] (Sampling)
#> Chain 3:
#> Chain 3: Elapsed Time: 0.01 seconds (Warm-up)
#> Chain 3: 0.012 seconds (Sampling)
#> Chain 3: 0.022 seconds (Total)
#> Chain 3:
#>
#> SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
#> Chain 4:
#> Chain 4: Gradient evaluation took 1e-06 seconds
#> Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
#> Chain 4: Adjust your expectations accordingly!
#> Chain 4:
#> Chain 4:
#> Chain 4: Iteration: 1 / 5000 [ 0%] (Warmup)
#> Chain 4: Iteration: 500 / 5000 [ 10%] (Warmup)
#> Chain 4: Iteration: 1000 / 5000 [ 20%] (Warmup)
#> Chain 4: Iteration: 1500 / 5000 [ 30%] (Warmup)
#> Chain 4: Iteration: 2000 / 5000 [ 40%] (Warmup)
#> Chain 4: Iteration: 2500 / 5000 [ 50%] (Warmup)
#> Chain 4: Iteration: 2501 / 5000 [ 50%] (Sampling)
#> Chain 4: Iteration: 3000 / 5000 [ 60%] (Sampling)
#> Chain 4: Iteration: 3500 / 5000 [ 70%] (Sampling)
#> Chain 4: Iteration: 4000 / 5000 [ 80%] (Sampling)
#> Chain 4: Iteration: 4500 / 5000 [ 90%] (Sampling)
#> Chain 4: Iteration: 5000 / 5000 [100%] (Sampling)
#> Chain 4:
#> Chain 4: Elapsed Time: 0.01 seconds (Warm-up)
#> Chain 4: 0.01 seconds (Sampling)
#> Chain 4: 0.02 seconds (Total)
#> Chain 4:
Stan provides various diagnostics to help us check if our chains have converged.
# Print summary
print(fit)
#> Inference for Stan model: anon_model.
#> 4 chains, each with iter=5000; warmup=2500; thin=1;
#> post-warmup draws per chain=2500, total post-warmup draws=10000.
#>
#> mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
#> mu 5.18 0.00 0.18 4.82 5.05 5.18 5.30 5.54 8680 1
#> sigma 1.84 0.00 0.13 1.60 1.75 1.83 1.93 2.13 8251 1
#> lp__ -110.66 0.01 1.01 -113.35 -111.06 -110.36 -109.93 -109.68 4632 1
#>
#> Samples were drawn using NUTS(diag_e) at Mon Oct 21 03:48:08 2024.
#> For each parameter, n_eff is a crude measure of effective sample size,
#> and Rhat is the potential scale reduction factor on split chains (at
#> convergence, Rhat=1).
# Extract Rhat values
<- summary(fit)$summary[, "Rhat"]
rhat_values print(rhat_values)
#> mu sigma lp__
#> 1.0003650 1.0001896 0.9999744
mu: The estimated mean is 5.18, close to the true value of 5. The 95% credible interval is [4.82, 5.54], showing good estimation.
sigma: The estimated standard deviation is 1.84, slightly lower than the true value of 2, but within the credible interval [1.60, 2.13].
n_eff: Effective sample sizes for mu and sigma are over 8000, indicating the chains mix well and provide reliable estimates.
Rhat: Values are close to 1 for all parameters, indicating that the chains have converged successfully.
se_mean: Small standard errors for both mu and sigma (0.00), showing precise posterior mean estimates.
sd: The standard deviation for mu is 0.18 and for sigma is 0.13, reflecting low uncertainty in the estimates.
# Trace plots
traceplot(fit, pars = c("mu", "sigma"), nrow=2, ncol=1)
# other arguments in the traceplot that we didn't use:
# traceplot(object, pars, include = TRUE, unconstrain = FALSE,
# inc_warmup = FALSE, window = NULL, nrow = NULL, ncol = NULL, ...)
Let’s take a closer look at the posterior distributions.
# Extract samples
<- extract(fit)
posterior_samples
# Posterior means
<- mean(posterior_samples$mu)
posterior_mu_mean <- mean(posterior_samples$sigma)
posterior_sigma_mean
# Histograms
par(mfrow = c(1, 2))
hist(posterior_samples$mu, breaks = 30, main = "Posterior of mu", xlab = "mu", probability = TRUE)
abline(v = true_mu, col = "red", lwd = 2)
hist(posterior_samples$sigma, breaks = 30, main = "Posterior of sigma", xlab = "sigma", probability = TRUE)
abline(v = true_sigma, col = "red", lwd = 2)
The posterior distributions are closely centered around the true parameter values.
Play with the inputs of the first, manual, non-stan model for HMC, and re-run with the same data. What changes do you observe in the trace plots OR posterior distributions?
What was the acceptance rate for the custom HMC sampler? How does it compare the acceptance rates we got in the previous assignments using different samplers?
Install and load the library brms
, and implement the equivalent model in brms. How similar are your results compared to the results from rstan
that we did today? tip: you can use brm
function, with an intercept-only model (y~1)
Created on 2024-10-21 with reprex v2.1.1