1. Summary of the Metropolis-Hastings (M-H) algorithm

The Metropolis-Hastings (M-H) algorithm is a cornerstone of Markov Chain Monte Carlo (MCMC). The goal is to sample from a complex posterior distribution, denoted \(\pi(\theta \mid y)\), when direct sampling is impossible.

The key idea:

  1. Start with a current parameter value \(\theta^{(t)}\) at iteration \(t\).
  2. Propose a new value \(\phi\) from an easy-to-sample proposal distribution \(q(\phi \mid \theta)\).
  3. Calculate an acceptance probability \(\alpha(\theta, \phi)\).
  4. Accept \(\phi\) as \(\theta^{(t+1)}\) with probability \(\alpha\); otherwise, stay at \(\theta^{(t)}\).
  5. As \(t \to \infty\), the Markov chain’s stationary distribution becomes the target posterior \(\pi(\theta \mid y)\).

2. Detailed Explanation of Step 3: Calculate the Acceptance Probability

This is the heart of the algorithm. The formula is:

\[ \alpha(\theta, \phi) = \min \left(1, \; \frac{\pi(\phi \mid y) \; q(\theta \mid \phi)}{\pi(\theta \mid y) \; q(\phi \mid \theta)} \right) \]

Why this formula?

The ratio inside the \(\min\) function compares two quantities:

  • Numerator: The target posterior density at the proposal \(\phi\), multiplied by the probability of jumping back from \(\phi\) to the current value \(\theta\).
  • Denominator: The target posterior density at the current value \(\theta\), multiplied by the probability of jumping from \(\theta\) to the proposal \(\phi\).

What does each term represent?

Term Interpretation
\(\pi(\phi \mid y)\) Plausibility of the proposal given the data
\(\pi(\theta \mid y)\) Plausibility of the current value given the data
\(q(\phi \mid \theta)\) Probability density of proposing \(\phi\) from current state \(\theta\)
\(q(\theta \mid \phi)\) Probability density of the reverse jump from \(\phi\) back to \(\theta\)

Why include the ratio of proposal densities?

The factor \(\frac{q(\theta \mid \phi)}{q(\phi \mid \theta)}\) corrects for any asymmetry in the proposal distribution.

  • Symmetric proposal: If \(q(\phi \mid \theta) = q(\theta \mid \phi)\) for all \(\theta, \phi\), then this ratio equals \(1\). The acceptance probability simplifies to the original Metropolis algorithm:

\[ \alpha(\theta, \phi) = \min \left(1, \; \frac{\pi(\phi \mid y)}{\pi(\theta \mid y)} \right) \]

Common symmetric proposals include the Normal distribution: \(\phi \sim \mathcal{N}(\theta, \tau^2)\).

  • Asymmetric proposal: If the proposal distribution is not symmetric (e.g., Gamma, log-Normal, or any distribution with different forward and reverse densities), this ratio corrects the bias. Without it, the chain would not converge to the correct stationary distribution.

Special cases discussed in the text:

Case Proposal \(q(\phi \mid \theta)\) Acceptance Probability \(\alpha(\theta, \phi)\)
Independence Sampler \(g(\phi)\) (does not depend on \(\theta\)) \(\displaystyle \min\left(1, \; \frac{w(\phi \mid y)}{w(\theta \mid y)}\right)\) where \(w(\theta \mid y) = \frac{\pi(\theta \mid y)}{g(\theta)}\)
Metropolis (symmetric) \(q(\phi \mid \theta) = q(\theta \mid \phi)\) \(\displaystyle \min\left(1, \; \frac{\pi(\phi \mid y)}{\pi(\theta \mid y)}\right)\)
Random Walk Metropolis \(q(\phi \mid \theta) = q(|\phi - \theta|)\) (additive increment) Same as symmetric case

Understanding the \(\min(1, \cdot)\) function:

  • If the ratio is greater than 1, the proposal is more plausible (higher posterior density) or has a higher reverse probability than the current state. The algorithm always accepts (probability = 1).
  • If the ratio is less than 1, the algorithm accepts the proposal with probability equal to that ratio. For example, if the ratio equals \(0.3\), you accept with a \(30\%\) chance.

This mechanism ensures detailed balance:

\[ \pi(\theta \mid y) \; q(\phi \mid \theta) \; \alpha(\theta, \phi) = \pi(\phi \mid y) \; q(\theta \mid \phi) \; \alpha(\phi, \theta) \]

which guarantees that \(\pi(\theta \mid y)\) is the stationary distribution of the Markov chain.

3. Important Practical Remarks from the Text

No normalizing constant needed

The algorithm only requires the ratio \(\frac{\pi(\phi \mid y)}{\pi(\theta \mid y)}\). Therefore, we can work with an unnormalized posterior:

\[ \pi(\theta \mid y) \propto h(\theta) \]

where \(h(\theta) = \text{likelihood} \times \text{prior}\). The normalizing constant cancels out in the ratio.

Independence sampler behavior

When \(q(\phi \mid \theta) = g(\phi)\), the acceptance probability becomes:

\[ \alpha(\theta, \phi) = \min\left(1, \; \frac{w(\phi \mid y)}{w(\theta \mid y)}\right), \quad \text{where} \quad w(\theta \mid y) = \frac{\pi(\theta \mid y)}{g(\theta)} \]

This is essentially the ratio of importance weights. The text warns that independence samplers are “either very good or very bad.” For geometric convergence, the importance weight function \(w(\theta \mid y)\) must be bounded, meaning the proposal’s tails must dominate the target’s tails.

Theoretical optimal acceptance rates

For the Random Walk Metropolis algorithm in high dimensions, Roberts et al. (1997) proved that as the dimension \(p \to \infty\), the optimal acceptance rate approaches 0.234. For one-dimensional problems (\(p = 1\)), Gelman et al. (1996) showed the optimal acceptance rate is 0.44.

Practical tuning guidelines

  • Tuning standard deviation too small: High acceptance rate (e.g., \(> 0.5\)), but the chain moves slowly and exhibits high autocorrelation. Many iterations are needed to explore the parameter space.
  • Tuning standard deviation too large: Low acceptance rate (e.g., \(< 0.1\)). Most proposed jumps land in regions of low posterior density and are rejected, causing the chain to stagnate.
  • Target acceptance rates: Aim for approximately 0.44 for univariate problems, and 0.234 for high-dimensional problems.

4. Example from the Text: Cauchy Prior

The text illustrates these concepts with a Cauchy prior example. The target posterior is:

\[ \pi(\theta \mid y) \propto \left[ \prod_{i=1}^n \mathcal{N}(y_i \mid \theta, \sigma^2) \right] \times \text{Cauchy}(\theta \mid 0, 1) \]

  • Independence sampler using a Cauchy proposal distribution yielded an acceptance rate of only about 12%, which is suboptimal. This suggests the proposal does not match the target well.
  • Random Walk Metropolis with tuning standard deviation \(\tau = 1\) gave a 24.2% acceptance rate and reasonable estimates.
  • With a better-tuned proposal (\(\tau = 0.25\)), the acceptance rate improved to 43.71%, very close to the theoretical optimum of 44%, producing the most accurate point estimates.

This demonstrates the importance of tuning the proposal distribution to achieve the optimal acceptance rate for efficient exploration of the posterior.

5. The Notation \(q(\phi \mid \theta) = q(|\phi - \theta|)\) Explained

What \(|\phi - \theta|\) represents

The vertical bars denote absolute value (not conditional probability). So \(|\phi - \theta|\) is the distance between \(\phi\) and \(\theta\):

\[ |\phi - \theta| = \begin{cases} \phi - \theta & \text{if } \phi \geq \theta \\ \theta - \phi & \text{if } \phi < \theta \end{cases} \]

What the equation means

When we write \(q(\phi \mid \theta) = q(|\phi - \theta|)\), we are saying:

The probability density of proposing \(\phi\) given current state \(\theta\) is a function only of how far \(\phi\) is from \(\theta\), not of their specific locations.

Concrete Examples

Example 1: Normal Distribution (Random Walk)

For a Normal proposal centered at \(\theta\) with variance \(\tau^2\):

\[ q(\phi \mid \theta) = \frac{1}{\sqrt{2\pi\tau^2}} \exp\left(-\frac{(\phi - \theta)^2}{2\tau^2}\right) \]

Notice that this depends on \(\phi - \theta\) only through the squared distance \((\phi - \theta)^2\), which is \((|\phi - \theta|)^2\). Therefore:

\[ q(\phi \mid \theta) = \frac{1}{\sqrt{2\pi\tau^2}} \exp\left(-\frac{(|\phi - \theta|)^2}{2\tau^2}\right) = q(|\phi - \theta|) \]

Example 2: Uniform Distribution

For a Uniform proposal on \([\theta - \delta, \theta + \delta]\):

\[ q(\phi \mid \theta) = \frac{1}{2\delta} \quad \text{if } |\phi - \theta| \leq \delta \]

Example 3: Laplace (Double Exponential) Distribution

For a Laplace proposal centered at \(\theta\) with scale \(b\):

\[ q(\phi \mid \theta) = \frac{1}{2b} \exp\left(-\frac{|\phi - \theta|}{b}\right) \]

Why This Form Matters

1. Symmetry is automatically satisfied

If \(q(\phi \mid \theta) = q(|\phi - \theta|)\), then:

\[ q(\phi \mid \theta) = q(|\phi - \theta|) = q(|\theta - \phi|) = q(\theta \mid \phi) \]

2. Translation invariance

The proposal distribution doesn’t care where you are in the parameter space. The shape is the same everywhere.

3. Additive increments

We can write \(\phi = \theta + \varepsilon\) where \(\varepsilon\) is a random increment drawn from a distribution that is symmetric about 0. The density of \(\varepsilon\) is \(q(|\varepsilon|)\), and the proposal density becomes:

\[ q(\phi \mid \theta) = q(|\varepsilon|) \]

Summary Table

Property What \(q(\phi \mid \theta) = q(|\phi - \theta|)\) means
Translation invariance Same shape everywhere
Symmetry \(q(\phi \mid \theta) = q(\theta \mid \phi)\) automatically
Additive increments \(\phi = \theta + \varepsilon\) where \(\varepsilon\) is symmetric about 0
Direction independence Only distance matters, not whether \(\phi > \theta\) or \(\phi < \theta\)

6. R Code Examples

Example 1: Random Walk Metropolis for Normal Mean with Cauchy Prior

# Random Walk Metropolis for Normal mean with Cauchy prior

# Step 0: Generate fake data
set.seed(123)
true_theta <- 1.0          # True value we want to estimate
n <- 25
y <- rnorm(n, mean = true_theta, sd = 1)

# Log-posterior (up to a constant). We use log to avoid numerical underflow.
log_posterior <- function(theta, y) {
  # Log-likelihood (Normal, variance = 1)
  log_lik <- sum(dnorm(y, mean = theta, sd = 1, log = TRUE))
  # Log-prior (Cauchy, location = 0, scale = 1)
  log_prior <- dcauchy(theta, location = 0, scale = 1, log = TRUE)
  return(log_lik + log_prior)
}

# Random Walk Metropolis algorithm
metropolis_rw <- function(y, initial_theta, tuning_sd, n_iter) {
  theta_chain <- numeric(n_iter)
  theta_chain[1] <- initial_theta
  accept_count <- 0
  
  for (t in 1:(n_iter - 1)) {
    current_theta <- theta_chain[t]
    
    # Step 2: Propose new value from a symmetric Normal distribution
    # centered at the current value.
    proposal <- rnorm(1, mean = current_theta, sd = tuning_sd)
    
    # Step 3: Calculate acceptance probability (symmetric case)
    log_alpha <- log_posterior(proposal, y) - log_posterior(current_theta, y)
    alpha <- min(1, exp(log_alpha))
    
    # Step 4: Accept or reject
    if (runif(1) < alpha) {
      theta_chain[t + 1] <- proposal
      accept_count <- accept_count + 1
    } else {
      theta_chain[t + 1] <- current_theta
    }
  }
  
  acceptance_rate <- accept_count / (n_iter - 1)
  return(list(chain = theta_chain, acceptance_rate = acceptance_rate))
}

# Run the algorithm with different tuning parameters
n_iter <- 10000
initial_theta <- 0

# Case 1: tuning_sd = 1 (as in the text)
result_sd1 <- metropolis_rw(y, initial_theta, tuning_sd = 1, n_iter)
cat("=== tuning_sd = 1 ===\n")
## === tuning_sd = 1 ===
cat("Acceptance rate:", result_sd1$acceptance_rate, "\n")
## Acceptance rate: 0.2507251
posterior_samples <- result_sd1$chain[-(1:1000)]  # Discard burn-in
cat("Posterior mean estimate:", mean(posterior_samples), "\n")
## Posterior mean estimate: 0.9306272
cat("95% credible interval:", quantile(posterior_samples, c(0.025, 0.975)), "\n\n")
## 95% credible interval: 0.5461626 1.320599
# Case 2: tuning_sd = 0.25 (optimal for 1D)
result_sd025 <- metropolis_rw(y, initial_theta, tuning_sd = 0.25, n_iter)
cat("=== tuning_sd = 0.25 ===\n")
## === tuning_sd = 0.25 ===
cat("Acceptance rate:", result_sd025$acceptance_rate, "\n")
## Acceptance rate: 0.6417642
posterior_samples <- result_sd025$chain[-(1:1000)]
cat("Posterior mean estimate:", mean(posterior_samples), "\n")
## Posterior mean estimate: 0.9150508
cat("95% credible interval:", quantile(posterior_samples, c(0.025, 0.975)), "\n\n")
## 95% credible interval: 0.5228005 1.303016
# Visualize the results
par(mfrow = c(2, 2))

# Trace plot for tuning_sd = 1
plot(result_sd1$chain, type = "l", main = "Trace Plot (sd = 1)", 
     xlab = "Iteration", ylab = expression(theta))
abline(h = true_theta, col = "red", lwd = 2)

# Histogram for tuning_sd = 1
hist(result_sd1$chain[-(1:1000)], breaks = 40, main = "Posterior (sd = 1)", 
     xlab = expression(theta), freq = FALSE)
abline(v = true_theta, col = "red", lwd = 2)

# Trace plot for tuning_sd = 0.25
plot(result_sd025$chain, type = "l", main = "Trace Plot (sd = 0.25)", 
     xlab = "Iteration", ylab = expression(theta))
abline(h = true_theta, col = "red", lwd = 2)

# Histogram for tuning_sd = 0.25
hist(result_sd025$chain[-(1:1000)], breaks = 40, main = "Posterior (sd = 0.25)", 
     xlab = expression(theta), freq = FALSE)
abline(v = true_theta, col = "red", lwd = 2)

Note for the code

Why This Matters for MCMC

In the Metropolis-Hastings algorithm, we need to compute the ratio of posterior densities:

\[ \frac{\pi(\phi \mid y)}{\pi(\theta \mid y)} \]

The Bayesian Framework

Recall Bayes’ theorem:

\[ \pi(\theta \mid y) = \frac{\pi(\theta) \times L(\theta \mid y)}{p(y)} \]

where:

  • \(\pi(\theta)\) is the prior density (a function of \(\theta\))
  • \(L(\theta \mid y)\) is the likelihood function (also a function of \(\theta\))
  • \(p(y) = \int \pi(\theta) L(\theta \mid y) d\theta\) is the marginal likelihood (a constant with respect to \(\theta\))

The Ratio Simplifies

When we take the ratio of two posterior densities at \(\phi\) and \(\theta\), the marginal likelihood \(p(y)\) cancels out:

\[ \frac{\pi(\phi \mid y)}{\pi(\theta \mid y)} = \frac{\frac{\pi(\phi) \times L(\phi \mid y)}{p(y)}}{\frac{\pi(\theta) \times L(\theta \mid y)}{p(y)}} = \frac{\pi(\phi) \times L(\phi \mid y)}{\pi(\theta) \times L(\theta \mid y)} \]

What We Need to Evaluate

Therefore, in the Metropolis-Hastings acceptance probability, we need to evaluate two quantities at both \(\theta\) and \(\phi\):

Quantity Mathematical Form Interpretation
Prior density \(\pi(\theta)\) Function of \(\theta\) only
Likelihood \(L(\theta \mid y) = \prod_{i=1}^{n} f(y_i \mid \theta)\) Function of \(\theta\) with data \(y\) fixed

The Log Transformation

For numerical stability, we work on the log scale:

\[ \log \pi(\theta \mid y) \propto \log \pi(\theta) + \log L(\theta \mid y) \]

where:

  • \(\log \pi(\theta)\) is the log-prior density
  • \(\log L(\theta \mid y) = \sum_{i=1}^{n} \log f(y_i \mid \theta)\) is the log-likelihood

Example 2: Independence Sampler

# Independence Sampler for Normal mean with Cauchy prior

# Proposal distribution: Normal(0, 10) - does NOT depend on current theta
proposal_draw <- function() {
  rnorm(1, mean = 0, sd = 10)
}

proposal_log_density <- function(theta) {
  dnorm(theta, mean = 0, sd = 10, log = TRUE)
}

independence_metropolis <- function(y, initial_theta, n_iter) {
  theta_chain <- numeric(n_iter)
  theta_chain[1] <- initial_theta
  accept_count <- 0
  
  for (t in 1:(n_iter - 1)) {
    current_theta <- theta_chain[t]
    
    # Step 2: Propose from a fixed distribution
    proposal <- proposal_draw()
    
    # Step 3: Acceptance probability for independence sampler
    log_target_proposal <- log_posterior(proposal, y) - proposal_log_density(proposal)
    log_target_current <- log_posterior(current_theta, y) - proposal_log_density(current_theta)
    log_alpha <- log_target_proposal - log_target_current
    alpha <- min(1, exp(log_alpha))
    
    if (runif(1) < alpha) {
      theta_chain[t + 1] <- proposal
      accept_count <- accept_count + 1
    } else {
      theta_chain[t + 1] <- current_theta
    }
  }
  
  acceptance_rate <- accept_count / (n_iter - 1)
  return(list(chain = theta_chain, acceptance_rate = acceptance_rate))
}

# Run the independence sampler
result_ind <- independence_metropolis(y, initial_theta = 0, n_iter = 10000)
cat("=== Independence Sampler ===\n")
## === Independence Sampler ===
cat("Acceptance rate:", result_ind$acceptance_rate, "\n")
## Acceptance rate: 0.02710271
posterior_samples <- result_ind$chain[-(1:1000)]
cat("Posterior mean estimate:", mean(posterior_samples), "\n")
## Posterior mean estimate: 0.9337774
cat("95% credible interval:", quantile(posterior_samples, c(0.025, 0.975)), "\n")
## 95% credible interval: 0.5982786 1.308347
# Visualize independence sampler results
par(mfrow = c(1, 2))

plot(result_ind$chain, type = "l", main = "Independence Sampler Trace Plot", 
     xlab = "Iteration", ylab = expression(theta))
abline(h = true_theta, col = "red", lwd = 2)

hist(result_ind$chain[-(1:1000)], breaks = 40, main = "Independence Sampler Posterior", 
     xlab = expression(theta), freq = FALSE)
abline(v = true_theta, col = "red", lwd = 2)

Example 3: Tuning the Acceptance Rate

# Explore how tuning_sd affects acceptance rate
tuning_values <- seq(0.1, 2, by = 0.1)
acceptance_rates <- sapply(tuning_values, function(sd) {
  result <- metropolis_rw(y, initial_theta = 0, tuning_sd = sd, n_iter = 5000)
  return(result$acceptance_rate)
})

# Plot the relationship
plot(tuning_values, acceptance_rates, type = "b", 
     xlab = "Tuning Standard Deviation", 
     ylab = "Acceptance Rate",
     main = "Acceptance Rate vs. Tuning Parameter",
     pch = 19, col = "blue")
abline(h = 0.44, col = "red", lty = 2, lwd = 2)
abline(h = 0.234, col = "darkgreen", lty = 2, lwd = 2)
legend("topright", legend = c("Optimal for 1D (0.44)", "Optimal for high dim (0.234)"),
       col = c("red", "darkgreen"), lty = 2, lwd = 2)

# Find tuning value closest to 0.44
best_idx <- which.min(abs(acceptance_rates - 0.44))
cat("Optimal tuning_sd (closest to 0.44):", tuning_values[best_idx], 
    "with acceptance rate:", round(acceptance_rates[best_idx], 4), "\n")
## Optimal tuning_sd (closest to 0.44): 0.5 with acceptance rate: 0.4341

Example 4: Visual Demonstration of \(q(\phi \mid \theta) = q(|\phi - \theta|)\)

# Visual demonstration that q(phi|theta) depends only on |phi - theta|

# Set current state
theta <- 2

# Define a random walk proposal (Normal)
proposal_density <- function(phi, theta, sd = 1) {
  dnorm(phi, mean = theta, sd = sd)
}

# Calculate densities for different phi values
phi_values <- seq(-2, 6, length.out = 100)
densities <- sapply(phi_values, function(phi) proposal_density(phi, theta))

# Calculate distances
distances <- abs(phi_values - theta)

# Plot: density as a function of distance
plot(distances, densities, type = "l", lwd = 2,
     xlab = "Distance |φ - θ|", 
     ylab = "Proposal Density q(φ|θ)",
     main = "q(φ|θ) depends ONLY on |φ - θ|\n(same density at same distance regardless of direction)")

# Add points to show symmetry
points(distances[phi_values > theta], densities[phi_values > theta], 
       col = "blue", pch = 16, cex = 0.5)
points(distances[phi_values < theta], densities[phi_values < theta], 
       col = "red", pch = 16, cex = 0.5)
legend("topright", legend = c("φ > θ (right side)", "φ < θ (left side)"),
       col = c("blue", "red"), pch = 16)

# Key observation: same distance gives same density regardless of direction
cat("At distance = 1:\n")
## At distance = 1:
cat("  φ = θ + 1 =", theta + 1, 
    "→ density =", proposal_density(theta + 1, theta), "\n")
##   φ = θ + 1 = 3 → density = 0.2419707
cat("  φ = θ - 1 =", theta - 1, 
    "→ density =", proposal_density(theta - 1, theta), "\n")
##   φ = θ - 1 = 1 → density = 0.2419707

Summary of Expected Results

Method Tuning Parameter Acceptance Rate Posterior Mean 95% Credible Interval
Random Walk \(\tau = 1.0\) ~24% ~0.801 (0.411, 1.193)
Random Walk \(\tau = 0.25\) ~44% ~0.802 (0.414, 1.189)
Independence Sampler N(0, 10) ~12% ~0.806 (0.419, 1.226)

The true value used for data generation is \(\theta = 1.0\), but the posterior mean is around 0.80 due to the shrinkage induced by the Cauchy prior.