Understanding Proposal Distributions in Metropolis-Hastings

The Problem We’re Trying to Solve

We want to sample from the posterior distribution:

\[ \pi(\theta \mid y) \propto \pi(\theta) \times L(\theta) \]

But this posterior is not a standard distribution we know how to sample from directly (not Normal, Beta, Gamma, etc.).

What a Proposal Distribution Does

The proposal distribution \(q(\theta' \mid \theta)\) is our tool for exploring the parameter space. Think of it as:

A suggestion mechanism - “Let’s try moving to \(\theta'\)”
We can choose any distribution we want (that’s the beauty of MH!)
We just need to be able to draw samples from it easily

In This Specific Algorithm

We chose \(q(\theta' \mid \theta) = \pi(\theta')\) (the prior) as our proposal distribution.

Why would we do this? Because:

It’s easy to sample from (just draw \(z \sim \mathcal{N}(0,1)\) and transform)
It leads to a simple acceptance probability (just likelihood ratio)
It explores the whole parameter space (since prior has support on \((0,1)\))

The Role of the Proposal Distribution

Step	What happens
Current state	We’re at \(\theta\)
Propose	Use \(q\) to suggest a new \(\theta'\)
Accept/Reject	Use MH formula to decide whether to move there
Repeat	Eventually, the samples approximate \(\pi(\theta \mid y)\)

Analogy

Imagine you’re trying to map a mountain (the posterior):

Proposal distribution = Your method for choosing where to hike next

You could choose: - Random jumps anywhere (independence sampler - what we did) - Small steps from current position (random walk) - Guided by gradient (Hamiltonian Monte Carlo)

The MH algorithm corrects for your exploration strategy so that you end up sampling from the true posterior.

Without a Proposal Distribution?

You can’t explore the space! You need some rule for generating candidate values to try. The proposal distribution is that rule.

Summary

The proposal distribution is our chosen exploration strategy. We can pick any convenient distribution, and MH tells us how to accept/reject proposals so we eventually sample from the correct posterior.

Complete MH Algorithm for Our Problem

Problem Setup

Observations: \(y_1, \ldots, y_n \stackrel{\text{i.i.d.}}{\sim} \text{Bernoulli}(\theta)\)

Prior: \(\log\left( \frac{\theta}{1-\theta} \right) \sim \mathcal{N}(0, 1)\)

Prior density: \[ \pi(\theta) = \frac{1}{\sqrt{2\pi} \, \theta(1-\theta)} \exp\left[ -\frac12 \left( \log\frac{\theta}{1-\theta} \right)^2 \right], \quad 0<\theta<1 \]

Posterior: \[ \pi(\theta \mid y) \propto \pi(\theta) \times \theta^S (1-\theta)^{n-S} \] where \(S = \sum_{i=1}^n y_i\).

Proposal Distribution

We choose \(q(\theta' \mid \theta) = \pi(\theta')\) (the prior).

To draw from this proposal: 1. Draw \(z \sim \mathcal{N}(0, 1)\) 2. Set \(\theta' = \frac{e^z}{1 + e^z}\)

Acceptance Probability

\[ \alpha(\theta, \theta') = \min\left\{ 1, \frac{(\theta')^{S} (1-\theta')^{n-S}}{\theta^{S} (1-\theta)^{n-S}} \right\} \]

Algorithm Steps

Initialization: 1. Draw \(z^{(0)} \sim \mathcal{N}(0, 1)\) 2. Set \(\theta^{(0)} = \frac{e^{z^{(0)}}}{1 + e^{z^{(0)}}}\)

For \(t = 1, 2, \ldots, T\):

Propose: Draw \(z' \sim \mathcal{N}(0, 1)\), set \(\theta' = \frac{e^{z'}}{1 + e^{z'}}\)
Compute: \(r = \frac{(\theta')^{S} (1-\theta')^{n-S}}{(\theta^{(t-1)})^{S} (1-\theta^{(t-1)})^{n-S}}\), \(\alpha = \min(1, r)\)
Accept/Reject: Draw \(u \sim \text{Uniform}(0,1)\). If \(u < \alpha\), set \(\theta^{(t)} = \theta'\); otherwise \(\theta^{(t)} = \theta^{(t-1)}\)

R Implementation

# Generate synthetic data
set.seed(123)
n <- 100
true_theta <- 0.3
y <- rbinom(n, 1, true_theta)
S <- sum(y)
cat("Observed data: n =", n, ", S =", S, "\n")

## Observed data: n = 100 , S = 29

cat("True theta =", true_theta, "\n")

## True theta = 0.3

# Function to draw from prior (proposal distribution)
draw_from_prior <- function() {
  z <- rnorm(1, 0, 1)
  theta <- exp(z) / (1 + exp(z))
  return(theta)
}

# Metropolis-Hastings independence sampler
mh_independence <- function(S, n, n_iter) {
  # Initialize: draw from prior
  theta_current <- draw_from_prior()
  
  # Store samples
  samples <- numeric(n_iter)
  samples[1] <- theta_current
  acceptance_count <- 0
  
  # MCMC loop
  for (t in 2:n_iter) {
    # Propose from prior (independent of current)
    theta_proposed <- draw_from_prior()
    
    # Compute likelihood ratio
    log_r <- S * (log(theta_proposed) - log(theta_current)) + 
             (n-S) * (log(1-theta_proposed) - log(1-theta_current))
    r <- exp(min(log_r, 700))  # Cap to avoid Inf
    alpha <- min(1, r)
    
    # Accept/reject
    if (runif(1) < alpha) {
      theta_current <- theta_proposed
      acceptance_count <- acceptance_count + 1
    }
    
    samples[t] <- theta_current
  }
  
  acceptance_rate <- acceptance_count / (n_iter - 1)
  return(list(samples = samples, acceptance_rate = acceptance_rate))
}

# Run the sampler
n_iter <- 10000
results <- mh_independence(S, n, n_iter)

cat("\nAcceptance rate:", round(results$acceptance_rate * 100, 2), "%\n")

## 
## Acceptance rate: 18.69 %

cat("Posterior mean:", round(mean(results$samples), 4), "\n")

## Posterior mean: 0.2979

cat("MLE:", round(S/n, 4), "\n")

## MLE: 0.29

Diagnostics

# Trace plot
plot(results$samples, type = "l", 
     main = "Trace Plot of θ",
     xlab = "Iteration", ylab = "θ",
     col = "steelblue")
abline(h = true_theta, col = "red", lwd = 2, lty = 2)

# Histogram (after burn-in)
burn_in <- 1000
post_burnin <- results$samples[-(1:burn_in)]

hist(post_burnin, breaks = 50, 
     main = "Posterior Distribution of θ",
     xlab = "θ", ylab = "Frequency",
     col = "lightblue", border = "white")
abline(v = true_theta, col = "red", lwd = 2, lty = 2)
abline(v = mean(post_burnin), col = "darkgreen", lwd = 2)

# Autocorrelation
acf(post_burnin, main = "Autocorrelation of θ (after burn-in)")

Key Takeaways

Proposal distribution = exploration strategy for parameter space
Independence sampler = proposal does not depend on current state
Our choice = prior distribution (easy to sample, simple acceptance)
Acceptance probability = \(\min(1, \text{likelihood ratio})\) - prior cancels!
Works because MH corrects for any proposal distribution

This algorithm allows us to sample from an intractable posterior using only: - Easy draws from the prior - Likelihood calculations - The MH acceptance formula

Overview: Which Methods Need Proposal Distributions?

Method	Needs Proposal Distribution?	What It Uses Instead
Rejection Sampling	✅ Yes	Proposal distribution \(g(x)\) (plus a constant \(M\))
Metropolis-Hastings	✅ Yes	Proposal distribution \(q(\theta' \mid \theta)\)
Importance Sampling	✅ Yes	Importance distribution \(g(x)\) (for weighting)
Gibbs Sampling	❌ No	Full conditional distributions (derived from model)

1. Rejection Sampling - NEEDS Proposal

Proposal: \(g(x)\) (easy to sample from)

How it works: - Sample \(x \sim g(x)\) - Sample \(u \sim \text{Uniform}(0, M g(x))\) - Accept if \(u < f(x)\) (target)

Why needed: You need some distribution to propose candidates from, then reject some to match target \(f(x)\).

Example (R code):

# Target: Beta(2,5) distribution
f <- function(x) dbeta(x, 2, 5)

# Proposal: Uniform(0,1)
g <- function(x) dunif(x, 0, 1)

# Find M (maximum of f/g)
M <- optimize(function(x) f(x)/g(x), interval=c(0,1), maximum=TRUE)$objective

# Rejection sampling
n_samples <- 10000
samples <- numeric(n_samples)
accepted <- 0

while(accepted < n_samples) {
  x <- runif(1)  # Sample from proposal
  u <- runif(1, 0, M * g(x))
  
  if(u < f(x)) {
    accepted <- accepted + 1
    samples[accepted] <- x
  }
}

hist(samples, breaks=50, main="Rejection Sampling from Beta(2,5)", 
     xlab="x", col="lightblue", probability=TRUE)
curve(dbeta(x,2,5), add=TRUE, col="red", lwd=2)

2. Metropolis-Hastings - NEEDS Proposal

Proposal: \(q(\theta' \mid \theta)\) (your exploration strategy)

How it works: - Propose \(\theta' \sim q(\cdot \mid \theta)\) - Accept with probability \(\alpha = \min\left(1, \frac{f(\theta') q(\theta \mid \theta')}{f(\theta) q(\theta' \mid \theta)}\right)\)

Why needed: You need a way to explore the parameter space. The proposal can be anything (random walk, independence sampler, etc.).

Example (Random Walk MH):

# Target: N(0,1)
target <- function(x) dnorm(x, 0, 1)

# Proposal: Random walk N(current, 0.5^2)
propose <- function(current) {
  rnorm(1, current, 0.5)
}

# MH algorithm
n_iter <- 10000
samples <- numeric(n_iter)
samples[1] <- 0  # Initial value
accept_count <- 0

for(t in 2:n_iter) {
  # Propose
  theta_prime <- propose(samples[t-1])
  
  # Compute acceptance probability
  alpha <- min(1, target(theta_prime) / target(samples[t-1]))
  
  # Accept/reject
  if(runif(1) < alpha) {
    samples[t] <- theta_prime
    accept_count <- accept_count + 1
  } else {
    samples[t] <- samples[t-1]
  }
}

cat("Acceptance rate:", accept_count/(n_iter-1)*100, "%\n")

## Acceptance rate: 84.48845 %

hist(samples, breaks=50, main="MH Sampling from N(0,1)", 
     xlab="x", col="lightblue", probability=TRUE)
curve(dnorm(x), add=TRUE, col="red", lwd=2)

3. Importance Sampling - NEEDS Proposal

Proposal: \(g(x)\) (importance distribution)

How it works: - Sample \(x_i \sim g(x)\) - Compute weights \(w_i = f(x_i) / g(x_i)\) - Estimate \(E[h(X)] = \frac{\sum w_i h(x_i)}{\sum w_i}\)

Why needed: You sample from \(g\) instead of target \(f\), then reweight. No acceptance/rejection.

Example:

# Target: Beta(2,5)
f <- function(x) dbeta(x, 2, 5)

# Proposal: Uniform(0,1)
g <- function(x) dunif(x, 0, 1)

# Importance sampling
n_samples <- 10000
x <- runif(n_samples)  # Sample from proposal
weights <- f(x) / g(x)
weights <- weights / sum(weights)  # Normalize

# Estimate mean
est_mean <- sum(x * weights)
true_mean <- 2/(2+5)  # Beta(2,5) mean

cat("True mean:", true_mean, "\n")

## True mean: 0.2857143

cat("Importance sampling estimate:", est_mean, "\n")

## Importance sampling estimate: 0.2843986

# Weighted histogram
hist(x, breaks=50, main="Importance Sampling from Beta(2,5)", 
     xlab="x", col="lightblue", probability=TRUE, weights=weights)

## Warning in plot.window(xlim, ylim, log, ...): "weights" is not a graphical
## parameter

## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...):
## "weights" is not a graphical parameter

## Warning in axis(1, ...): "weights" is not a graphical parameter

## Warning in axis(2, at = yt, ...): "weights" is not a graphical parameter

curve(dbeta(x,2,5), add=TRUE, col="red", lwd=2)

4. Gibbs Sampling - NO Proposal Distribution!

What it uses: Full conditional distributions

How it works (for parameters \(\theta_1, \theta_2, \ldots, \theta_k\)):

For \(t = 1, 2, \ldots\): - Draw \(\theta_1^{(t)} \sim \pi(\theta_1 \mid \theta_2^{(t-1)}, \theta_3^{(t-1)}, \ldots, y)\) - Draw \(\theta_2^{(t)} \sim \pi(\theta_2 \mid \theta_1^{(t)}, \theta_3^{(t-1)}, \ldots, y)\) - Draw \(\theta_3^{(t)} \sim \pi(\theta_3 \mid \theta_1^{(t)}, \theta_2^{(t)}, \ldots, y)\) - And so on…

Why no proposal needed: - You sample directly from full conditionals - Every sample is accepted automatically (acceptance probability = 1) - But you need to be able to sample from these conditionals

Example (Bivariate Normal):

# Target: Bivariate Normal with correlation
# Using Gibbs sampling

n_iter <- 5000
n_burnin <- 1000
samples <- matrix(0, nrow=n_iter, ncol=2)
colnames(samples) <- c("x", "y")

# Parameters
mu_x <- 0
mu_y <- 0
sigma_x <- 1
sigma_y <- 1
rho <- 0.8

# Initialize
samples[1,] <- c(0, 0)

# Gibbs sampling
for(t in 2:n_iter) {
  # Sample x given y
  mu_x_given_y <- mu_x + rho * (sigma_x/sigma_y) * (samples[t-1,2] - mu_y)
  sigma_x_given_y <- sigma_x * sqrt(1 - rho^2)
  samples[t,1] <- rnorm(1, mu_x_given_y, sigma_x_given_y)
  
  # Sample y given x
  mu_y_given_x <- mu_y + rho * (sigma_y/sigma_x) * (samples[t,1] - mu_x)
  sigma_y_given_x <- sigma_y * sqrt(1 - rho^2)
  samples[t,2] <- rnorm(1, mu_y_given_x, sigma_y_given_x)
}

# Plot results
par(mfrow=c(1,2))
plot(samples[(n_burnin+1):n_iter,], pch=20, cex=0.5, 
     main="Gibbs Samples from Bivariate Normal", 
     xlab="x", ylab="y", col="steelblue")

hist(samples[(n_burnin+1):n_iter,1], breaks=50, 
     main="Marginal of x", xlab="x", col="lightblue", probability=TRUE)
curve(dnorm(x, mu_x, sigma_x), add=TRUE, col="red", lwd=2)

Visual Comparison of Methods

Rejection Sampling:
Target f(x) ←→ Proposal g(x) [NEEDED]
        ↓
    Accept/Reject

Metropolis-Hastings:
Target f(θ) ←→ Proposal q(θ'|θ) [NEEDED]
        ↓
    Accept with α

Importance Sampling:
Target f(x) ←→ Proposal g(x) [NEEDED]
        ↓
    Weight samples

Gibbs Sampling:
Target π(θ|y) ←→ Full conditionals [NO PROPOSAL]
        ↓
    Direct sampling

When to Use Which?

Method	Best when…
Rejection Sampling	Target is bounded, proposal is close to target, low dimension
Metropolis-Hastings	General purpose, any target, any dimension
Importance Sampling	Estimating expectations, target hard to sample but easy to evaluate
Gibbs Sampling	Full conditionals are known and easy to sample from

Key Insights

Rejection, MH, and Importance all need a proposal/importance distribution because they sample from something other than the target
Gibbs samples directly from pieces of the target (the conditionals), so no proposal needed
The choice of proposal distribution dramatically affects:
- Rejection Sampling: Efficiency (acceptance rate)
- Metropolis-Hastings: Mixing and convergence speed
- Importance Sampling: Variance of estimators
Gibbs is the special one that doesn’t require a user-specified proposal distribution - but it requires deriving full conditionals!

Summary Table of Requirements

Method	Proposal Needed?	Acceptance Step?	Can Sample Any Target?
Rejection Sampling	Yes	Yes (accept/reject)	No (needs bounded f/g)
Metropolis-Hastings	Yes	Yes (probabilistic)	Yes (with good proposal)
Importance Sampling	Yes	No (uses weights)	Yes (but high variance possible)
Gibbs Sampling	No	No (always accept)	No (needs conditional forms)