The Metropolis-Hastings (M-H) algorithm is a cornerstone of Markov Chain Monte Carlo (MCMC). The goal is to sample from a complex posterior distribution, denoted \(\pi(\theta \mid y)\), when direct sampling is impossible.
The key idea:
This is the heart of the algorithm. The formula is:
\[ \alpha(\theta, \phi) = \min \left(1, \; \frac{\pi(\phi \mid y) \; q(\theta \mid \phi)}{\pi(\theta \mid y) \; q(\phi \mid \theta)} \right) \]
The ratio inside the \(\min\) function compares two quantities:
| Term | Interpretation |
|---|---|
| \(\pi(\phi \mid y)\) | Plausibility of the proposal given the data |
| \(\pi(\theta \mid y)\) | Plausibility of the current value given the data |
| \(q(\phi \mid \theta)\) | Probability density of proposing \(\phi\) from current state \(\theta\) |
| \(q(\theta \mid \phi)\) | Probability density of the reverse jump from \(\phi\) back to \(\theta\) |
The factor \(\frac{q(\theta \mid \phi)}{q(\phi \mid \theta)}\) corrects for any asymmetry in the proposal distribution.
\[ \alpha(\theta, \phi) = \min \left(1, \; \frac{\pi(\phi \mid y)}{\pi(\theta \mid y)} \right) \]
Common symmetric proposals include the Normal distribution: \(\phi \sim \mathcal{N}(\theta, \tau^2)\).
| Case | Proposal \(q(\phi \mid \theta)\) | Acceptance Probability \(\alpha(\theta, \phi)\) |
|---|---|---|
| Independence Sampler | \(g(\phi)\) (does not depend on \(\theta\)) | \(\displaystyle \min\left(1, \; \frac{w(\phi \mid y)}{w(\theta \mid y)}\right)\) where \(w(\theta \mid y) = \frac{\pi(\theta \mid y)}{g(\theta)}\) |
| Metropolis (symmetric) | \(q(\phi \mid \theta) = q(\theta \mid \phi)\) | \(\displaystyle \min\left(1, \; \frac{\pi(\phi \mid y)}{\pi(\theta \mid y)}\right)\) |
| Random Walk Metropolis | \(q(\phi \mid \theta) = q(|\phi - \theta|)\) (additive increment) | Same as symmetric case |
This mechanism ensures detailed balance:
\[ \pi(\theta \mid y) \; q(\phi \mid \theta) \; \alpha(\theta, \phi) = \pi(\phi \mid y) \; q(\theta \mid \phi) \; \alpha(\phi, \theta) \]
which guarantees that \(\pi(\theta \mid y)\) is the stationary distribution of the Markov chain.
The algorithm only requires the ratio \(\frac{\pi(\phi \mid y)}{\pi(\theta \mid y)}\). Therefore, we can work with an unnormalized posterior:
\[ \pi(\theta \mid y) \propto h(\theta) \]
where \(h(\theta) = \text{likelihood} \times \text{prior}\). The normalizing constant cancels out in the ratio.
When \(q(\phi \mid \theta) = g(\phi)\), the acceptance probability becomes:
\[ \alpha(\theta, \phi) = \min\left(1, \; \frac{w(\phi \mid y)}{w(\theta \mid y)}\right), \quad \text{where} \quad w(\theta \mid y) = \frac{\pi(\theta \mid y)}{g(\theta)} \]
This is essentially the ratio of importance weights. The text warns that independence samplers are “either very good or very bad.” For geometric convergence, the importance weight function \(w(\theta \mid y)\) must be bounded, meaning the proposal’s tails must dominate the target’s tails.
For the Random Walk Metropolis algorithm in high dimensions, Roberts et al. (1997) proved that as the dimension \(p \to \infty\), the optimal acceptance rate approaches 0.234. For one-dimensional problems (\(p = 1\)), Gelman et al. (1996) showed the optimal acceptance rate is 0.44.
The text illustrates these concepts with a Cauchy prior example. The target posterior is:
\[ \pi(\theta \mid y) \propto \left[ \prod_{i=1}^n \mathcal{N}(y_i \mid \theta, \sigma^2) \right] \times \text{Cauchy}(\theta \mid 0, 1) \]
This demonstrates the importance of tuning the proposal distribution to achieve the optimal acceptance rate for efficient exploration of the posterior.
The vertical bars denote absolute value (not conditional probability). So \(|\phi - \theta|\) is the distance between \(\phi\) and \(\theta\):
\[ |\phi - \theta| = \begin{cases} \phi - \theta & \text{if } \phi \geq \theta \\ \theta - \phi & \text{if } \phi < \theta \end{cases} \]
When we write \(q(\phi \mid \theta) = q(|\phi - \theta|)\), we are saying:
The probability density of proposing \(\phi\) given current state \(\theta\) is a function only of how far \(\phi\) is from \(\theta\), not of their specific locations.
For a Normal proposal centered at \(\theta\) with variance \(\tau^2\):
\[ q(\phi \mid \theta) = \frac{1}{\sqrt{2\pi\tau^2}} \exp\left(-\frac{(\phi - \theta)^2}{2\tau^2}\right) \]
Notice that this depends on \(\phi - \theta\) only through the squared distance \((\phi - \theta)^2\), which is \((|\phi - \theta|)^2\). Therefore:
\[ q(\phi \mid \theta) = \frac{1}{\sqrt{2\pi\tau^2}} \exp\left(-\frac{(|\phi - \theta|)^2}{2\tau^2}\right) = q(|\phi - \theta|) \]
For a Uniform proposal on \([\theta - \delta, \theta + \delta]\):
\[ q(\phi \mid \theta) = \frac{1}{2\delta} \quad \text{if } |\phi - \theta| \leq \delta \]
For a Laplace proposal centered at \(\theta\) with scale \(b\):
\[ q(\phi \mid \theta) = \frac{1}{2b} \exp\left(-\frac{|\phi - \theta|}{b}\right) \]
If \(q(\phi \mid \theta) = q(|\phi - \theta|)\), then:
\[ q(\phi \mid \theta) = q(|\phi - \theta|) = q(|\theta - \phi|) = q(\theta \mid \phi) \]
The proposal distribution doesn’t care where you are in the parameter space. The shape is the same everywhere.
We can write \(\phi = \theta + \varepsilon\) where \(\varepsilon\) is a random increment drawn from a distribution that is symmetric about 0. The density of \(\varepsilon\) is \(q(|\varepsilon|)\), and the proposal density becomes:
\[ q(\phi \mid \theta) = q(|\varepsilon|) \]
| Property | What \(q(\phi \mid \theta) = q(|\phi - \theta|)\) means |
|---|---|
| Translation invariance | Same shape everywhere |
| Symmetry | \(q(\phi \mid \theta) = q(\theta \mid \phi)\) automatically |
| Additive increments | \(\phi = \theta + \varepsilon\) where \(\varepsilon\) is symmetric about 0 |
| Direction independence | Only distance matters, not whether \(\phi > \theta\) or \(\phi < \theta\) |
# Random Walk Metropolis for Normal mean with Cauchy prior
# Step 0: Generate fake data
set.seed(123)
true_theta <- 1.0 # True value we want to estimate
n <- 25
y <- rnorm(n, mean = true_theta, sd = 1)
# Log-posterior (up to a constant). We use log to avoid numerical underflow.
log_posterior <- function(theta, y) {
# Log-likelihood (Normal, variance = 1)
log_lik <- sum(dnorm(y, mean = theta, sd = 1, log = TRUE))
# Log-prior (Cauchy, location = 0, scale = 1)
log_prior <- dcauchy(theta, location = 0, scale = 1, log = TRUE)
return(log_lik + log_prior)
}
# Random Walk Metropolis algorithm
metropolis_rw <- function(y, initial_theta, tuning_sd, n_iter) {
theta_chain <- numeric(n_iter)
theta_chain[1] <- initial_theta
accept_count <- 0
for (t in 1:(n_iter - 1)) {
current_theta <- theta_chain[t]
# Step 2: Propose new value from a symmetric Normal distribution
# centered at the current value.
proposal <- rnorm(1, mean = current_theta, sd = tuning_sd)
# Step 3: Calculate acceptance probability (symmetric case)
log_alpha <- log_posterior(proposal, y) - log_posterior(current_theta, y)
alpha <- min(1, exp(log_alpha))
# Step 4: Accept or reject
if (runif(1) < alpha) {
theta_chain[t + 1] <- proposal
accept_count <- accept_count + 1
} else {
theta_chain[t + 1] <- current_theta
}
}
acceptance_rate <- accept_count / (n_iter - 1)
return(list(chain = theta_chain, acceptance_rate = acceptance_rate))
}
# Run the algorithm with different tuning parameters
n_iter <- 10000
initial_theta <- 0
# Case 1: tuning_sd = 1 (as in the text)
result_sd1 <- metropolis_rw(y, initial_theta, tuning_sd = 1, n_iter)
cat("=== tuning_sd = 1 ===\n")## === tuning_sd = 1 ===
## Acceptance rate: 0.2507251
posterior_samples <- result_sd1$chain[-(1:1000)] # Discard burn-in
cat("Posterior mean estimate:", mean(posterior_samples), "\n")## Posterior mean estimate: 0.9306272
## 95% credible interval: 0.5461626 1.320599
# Case 2: tuning_sd = 0.25 (optimal for 1D)
result_sd025 <- metropolis_rw(y, initial_theta, tuning_sd = 0.25, n_iter)
cat("=== tuning_sd = 0.25 ===\n")## === tuning_sd = 0.25 ===
## Acceptance rate: 0.6417642
posterior_samples <- result_sd025$chain[-(1:1000)]
cat("Posterior mean estimate:", mean(posterior_samples), "\n")## Posterior mean estimate: 0.9150508
## 95% credible interval: 0.5228005 1.303016
# Visualize the results
par(mfrow = c(2, 2))
# Trace plot for tuning_sd = 1
plot(result_sd1$chain, type = "l", main = "Trace Plot (sd = 1)",
xlab = "Iteration", ylab = expression(theta))
abline(h = true_theta, col = "red", lwd = 2)
# Histogram for tuning_sd = 1
hist(result_sd1$chain[-(1:1000)], breaks = 40, main = "Posterior (sd = 1)",
xlab = expression(theta), freq = FALSE)
abline(v = true_theta, col = "red", lwd = 2)
# Trace plot for tuning_sd = 0.25
plot(result_sd025$chain, type = "l", main = "Trace Plot (sd = 0.25)",
xlab = "Iteration", ylab = expression(theta))
abline(h = true_theta, col = "red", lwd = 2)
# Histogram for tuning_sd = 0.25
hist(result_sd025$chain[-(1:1000)], breaks = 40, main = "Posterior (sd = 0.25)",
xlab = expression(theta), freq = FALSE)
abline(v = true_theta, col = "red", lwd = 2)In the Metropolis-Hastings algorithm, we need to compute the ratio of posterior densities:
\[ \frac{\pi(\phi \mid y)}{\pi(\theta \mid y)} \]
Recall Bayes’ theorem:
\[ \pi(\theta \mid y) = \frac{\pi(\theta) \times L(\theta \mid y)}{p(y)} \]
where:
When we take the ratio of two posterior densities at \(\phi\) and \(\theta\), the marginal likelihood \(p(y)\) cancels out:
\[ \frac{\pi(\phi \mid y)}{\pi(\theta \mid y)} = \frac{\frac{\pi(\phi) \times L(\phi \mid y)}{p(y)}}{\frac{\pi(\theta) \times L(\theta \mid y)}{p(y)}} = \frac{\pi(\phi) \times L(\phi \mid y)}{\pi(\theta) \times L(\theta \mid y)} \]
Therefore, in the Metropolis-Hastings acceptance probability, we need to evaluate two quantities at both \(\theta\) and \(\phi\):
| Quantity | Mathematical Form | Interpretation |
|---|---|---|
| Prior density | \(\pi(\theta)\) | Function of \(\theta\) only |
| Likelihood | \(L(\theta \mid y) = \prod_{i=1}^{n} f(y_i \mid \theta)\) | Function of \(\theta\) with data \(y\) fixed |
For numerical stability, we work on the log scale:
\[ \log \pi(\theta \mid y) \propto \log \pi(\theta) + \log L(\theta \mid y) \]
where:
# Independence Sampler for Normal mean with Cauchy prior
# Proposal distribution: Normal(0, 10) - does NOT depend on current theta
proposal_draw <- function() {
rnorm(1, mean = 0, sd = 10)
}
proposal_log_density <- function(theta) {
dnorm(theta, mean = 0, sd = 10, log = TRUE)
}
independence_metropolis <- function(y, initial_theta, n_iter) {
theta_chain <- numeric(n_iter)
theta_chain[1] <- initial_theta
accept_count <- 0
for (t in 1:(n_iter - 1)) {
current_theta <- theta_chain[t]
# Step 2: Propose from a fixed distribution
proposal <- proposal_draw()
# Step 3: Acceptance probability for independence sampler
log_target_proposal <- log_posterior(proposal, y) - proposal_log_density(proposal)
log_target_current <- log_posterior(current_theta, y) - proposal_log_density(current_theta)
log_alpha <- log_target_proposal - log_target_current
alpha <- min(1, exp(log_alpha))
if (runif(1) < alpha) {
theta_chain[t + 1] <- proposal
accept_count <- accept_count + 1
} else {
theta_chain[t + 1] <- current_theta
}
}
acceptance_rate <- accept_count / (n_iter - 1)
return(list(chain = theta_chain, acceptance_rate = acceptance_rate))
}
# Run the independence sampler
result_ind <- independence_metropolis(y, initial_theta = 0, n_iter = 10000)
cat("=== Independence Sampler ===\n")## === Independence Sampler ===
## Acceptance rate: 0.02710271
posterior_samples <- result_ind$chain[-(1:1000)]
cat("Posterior mean estimate:", mean(posterior_samples), "\n")## Posterior mean estimate: 0.9337774
## 95% credible interval: 0.5982786 1.308347
# Visualize independence sampler results
par(mfrow = c(1, 2))
plot(result_ind$chain, type = "l", main = "Independence Sampler Trace Plot",
xlab = "Iteration", ylab = expression(theta))
abline(h = true_theta, col = "red", lwd = 2)
hist(result_ind$chain[-(1:1000)], breaks = 40, main = "Independence Sampler Posterior",
xlab = expression(theta), freq = FALSE)
abline(v = true_theta, col = "red", lwd = 2)# Explore how tuning_sd affects acceptance rate
tuning_values <- seq(0.1, 2, by = 0.1)
acceptance_rates <- sapply(tuning_values, function(sd) {
result <- metropolis_rw(y, initial_theta = 0, tuning_sd = sd, n_iter = 5000)
return(result$acceptance_rate)
})
# Plot the relationship
plot(tuning_values, acceptance_rates, type = "b",
xlab = "Tuning Standard Deviation",
ylab = "Acceptance Rate",
main = "Acceptance Rate vs. Tuning Parameter",
pch = 19, col = "blue")
abline(h = 0.44, col = "red", lty = 2, lwd = 2)
abline(h = 0.234, col = "darkgreen", lty = 2, lwd = 2)
legend("topright", legend = c("Optimal for 1D (0.44)", "Optimal for high dim (0.234)"),
col = c("red", "darkgreen"), lty = 2, lwd = 2)# Find tuning value closest to 0.44
best_idx <- which.min(abs(acceptance_rates - 0.44))
cat("Optimal tuning_sd (closest to 0.44):", tuning_values[best_idx],
"with acceptance rate:", round(acceptance_rates[best_idx], 4), "\n")## Optimal tuning_sd (closest to 0.44): 0.5 with acceptance rate: 0.4341
# Visual demonstration that q(phi|theta) depends only on |phi - theta|
# Set current state
theta <- 2
# Define a random walk proposal (Normal)
proposal_density <- function(phi, theta, sd = 1) {
dnorm(phi, mean = theta, sd = sd)
}
# Calculate densities for different phi values
phi_values <- seq(-2, 6, length.out = 100)
densities <- sapply(phi_values, function(phi) proposal_density(phi, theta))
# Calculate distances
distances <- abs(phi_values - theta)
# Plot: density as a function of distance
plot(distances, densities, type = "l", lwd = 2,
xlab = "Distance |φ - θ|",
ylab = "Proposal Density q(φ|θ)",
main = "q(φ|θ) depends ONLY on |φ - θ|\n(same density at same distance regardless of direction)")
# Add points to show symmetry
points(distances[phi_values > theta], densities[phi_values > theta],
col = "blue", pch = 16, cex = 0.5)
points(distances[phi_values < theta], densities[phi_values < theta],
col = "red", pch = 16, cex = 0.5)
legend("topright", legend = c("φ > θ (right side)", "φ < θ (left side)"),
col = c("blue", "red"), pch = 16)# Key observation: same distance gives same density regardless of direction
cat("At distance = 1:\n")## At distance = 1:
## φ = θ + 1 = 3 → density = 0.2419707
## φ = θ - 1 = 1 → density = 0.2419707
| Method | Tuning Parameter | Acceptance Rate | Posterior Mean | 95% Credible Interval |
|---|---|---|---|---|
| Random Walk | \(\tau = 1.0\) | ~24% | ~0.801 | (0.411, 1.193) |
| Random Walk | \(\tau = 0.25\) | ~44% | ~0.802 | (0.414, 1.189) |
| Independence Sampler | N(0, 10) | ~12% | ~0.806 | (0.419, 1.226) |
The true value used for data generation is \(\theta = 1.0\), but the posterior mean is around 0.80 due to the shrinkage induced by the Cauchy prior.