March 25, 2025

Introduction: Why MCMC Matters

The Challenge: Real-world problems require integration over complex probability distributions.

Traditional approaches struggle with:

  • Analytical integration: Often mathematically impossible
  • Direct sampling: Computationally infeasible
  • Grid-based methods: Curse of dimensionality
  • Importance sampling: Finding good proposals

The Solution: MCMC methods sample from virtually any probability distribution.

  • Generates samples via carefully constructed Markov chain
  • Explores probability space efficiently
  • Converts integration problems into simulations
  • Scales to high dimensions

What is MCMC?

Markov Chain Monte Carlo combines:

  1. Markov Chain: Random variables where each state depends only on the previous state

  2. Monte Carlo: Using random sampling for numerical results

Key Insight:

  • Construct a chain with target distribution as equilibrium
  • Samples eventually behave as if from target distribution

Practical Applications

  • Bayesian Statistics

    • Parameter estimation
    • Model comparison
  • Machine Learning

    • Bayesian neural networks
    • Uncertainty quantification
  • Physics & Finance

    • Molecular dynamics
    • Risk assessment
    • Option pricing

The Mathematical Challenge

We need to compute expectations:

\[\mathbb{E}[f(X)] = \int f(x)p(x)dx\]

Where:

  • \(p(x)\) is a complex probability density
  • Direct integration is intractable

Examples:

  • Bayesian posterior expectations
  • Statistical physics ensemble averages

The MCMC Solution:

Sample \(\{x_1, x_2, ..., x_n\}\) from \(p(x)\) to estimate:

\[\mathbb{E}[f(X)] \approx \frac{1}{n}\sum_{i=1}^{n}f(x_i)\]

Converges to true expectation as \(n \rightarrow \infty\)

Intuition: Exploring a Probability Landscape

Imagine hiking in a probability landscape:

  • Height = probability density
  • Goal: spend time proportional to height
  • Can only see immediate surroundings

The Metropolis Algorithm:

  1. Current position: location \(x\)
  2. Proposal: Suggest nearby location \(y\)
  3. Decision:
    • If higher: Always go there
    • If lower: Go with probability \(\frac{p(y)}{p(x)}\)

Key Insight:

  • Random walk that favors high-probability regions
  • Proportion of visits converges to target distribution
  • Allows exploration of entire space

The Metropolis-Hastings Algorithm

Algorithm Steps:

  1. Start at initial state \(x_0\)
  2. For \(t = 0, 1, 2, \ldots\):
    • Sample proposal \(y \sim q(y|x_t)\)
    • Calculate acceptance ratio: \[\alpha = \min\left(1, \frac{p(y)q(x_t|y)}{p(x_t)q(y|x_t)}\right)\]
    • Set \(x_{t+1} = y\) with probability \(\alpha\)
    • Otherwise, set \(x_{t+1} = x_t\)

Key Components:

  • \(p(x)\): Target distribution (can be unnormalized)
  • \(q(y|x)\): Proposal distribution
  • \(\alpha\): Acceptance probability

Important Properties:

  1. Asymptotic correctness: Converges to target distribution

  2. Proposal flexibility: Any ergodic proposal works (affects efficiency)

  3. Detailed balance: Ensures stationary distribution

When to use this algorithm:

  • Works with any target distribution
  • Only requires ability to evaluate density ratio
  • Simpler to implement than HMC methods

Implementing a 2D MCMC Sampler

# Metropolis-Hastings sampler for a 2D distribution
metropolis_hastings_2d <- function(target_density, proposal_sd,
                                 n_samples, init_state) {
  # init chain
  chain <- matrix(0, nrow = n_samples, ncol = 2)
  chain[1,] <- init_state
  
  current_log_density <- log(target_density(chain[1,1], chain[1,2]))
  
  # run chain
  for (i in 2:n_samples) {
    # Propose new state
    proposal <- c(
      rnorm(1, mean = chain[i-1, 1], sd = proposal_sd),
      rnorm(1, mean = chain[i-1, 2], sd = proposal_sd)
    )
    
    # calca] log density at proposal
    proposal_log_density <- log(target_density(proposal[1], proposal[2]))
    
    # calc acceptance probability
    log_accept_prob <- proposal_log_density - current_log_density
    
    # accept or reject
    if (log(runif(1)) < log_accept_prob) {
      chain[i,] <- proposal
      current_log_density <- proposal_log_density
    } else {
      chain[i,] <- chain[i-1,]
    }
  }
  
  return(chain)
}

Our Target Distribution

bimodal_target <- function(x, y) {
  mu1 <- c(1, 1)    # First peak center
  mu2 <- c(-1, -1)  # Second peak center
  sigma <- 0.3      # Spread of peaks
  
  # calc Gaussian densities
  z1 <- (1/(2*pi*sigma^2)) * 
    exp(-0.5*((x-mu1[1])^2 + 
              (y-mu1[2])^2)/sigma^2)
  z2 <- (1/(2*pi*sigma^2)) * 
    exp(-0.5*((x-mu2[1])^2 + 
              (y-mu2[2])^2)/sigma^2)
  
  # combine with equal weights
  return(0.5 * z1 + 0.5 * z2)
}

MCMC in Action

3D Visualization

Understanding the Sampling Process

What We’re Observing:

  1. Initial exploration (burn-in)

    • Chain explores from starting point
    • “Forgets” initial state
  2. Convergence

    • More time in high-density regions
    • Proportion in each mode ≈ 50%
  3. Mode-switching

    • Chain jumps between modes
    • More difficult in higher dimensions
  4. Stationarity

    • Distribution of samples stabilizes
    • Chain reaches target distribution

Convergence Assessment

Convergence Diagnostics

Assessing Convergence

Trace plots help determine:

  • Sufficient burn-in
  • Chain “mixing” quality
  • Local mode problems

Warning Signs:

  • Slow parameter drift
  • Long stationary periods
  • Poor mode-jumping
  • High autocorrelation

Effective Sample Size

Effective Sample Size (ESS)

Due to autocorrelation, MCMC samples are dependent. ESS estimates equivalent independent samples:

\[\text{ESS} = \frac{n}{1 + 2\sum_{k=1}^{\infty}\rho_k}\]

where:

  • \(n\) = number of samples
  • \(\rho_k\) = autocorrelation at lag \(k\)

Higher ESS means:

  • More accurate estimates
  • Better exploration
  • Lower Monte Carlo error
Parameter Total Effective Efficiency % Accept %
X dimension 3500 516.5 14.8 35.8
Y dimension 3500 507.8 14.5 35.8

Tuning MCMC Performance

Key Tuning Parameters:

  1. Proposal scale
    • Too small: High acceptance, slow
    • Too large: Low acceptance, poor
    • Optimal: ~20-40% acceptance

Critical Considerations

  • Burn-in period length
  • Multiple starting points
  • Parameter correlations

Advanced MCMC Methods

Hamiltonian Monte Carlo (HMC)

  • Uses gradient information
  • Physics-inspired sampling
  • Better in high dimensions
  • Implemented in Stan, PyMC

Gibbs Sampling

  • Updates one dimension at a time
  • Uses conditional distributions
  • Efficient for hierarchical models
  • Perfect for conditional conjugacy

No-U-Turn Sampler (NUTS)

  • Adaptive extension of HMC
  • Auto-tunes parameters
  • Avoids inefficient trajectories
  • State-of-the-art performance

Reversible Jump MCMC

  • Handles varying dimensions
  • Used for model selection
  • Enables trans-dimensional moves
  • Useful for mixture models

Conclusion: Power of MCMC

Key Takeaways:

  1. MCMC enables sampling from complex distributions that would be intractable with direct methods

  2. Metropolis-Hastings provides a flexible framework adaptable to almost any problem

  3. Proper diagnostics and tuning are essential for reliable results

  4. Advanced methods improve efficiency for complex problems

  5. Balancing exploration and exploitation is crucial for efficient sampling

Why MCMC Remains Essential:

  • Enables complex Bayesian analysis
  • Handles high-dimensional problems
  • Provides uncertainty quantification
  • Adaptable across domains
  • Supports robust decision-making

Resources:

  • “Bayesian Data Analysis” (Gelman)
  • “Monte Carlo Statistical Methods” (Robert & Casella)
  • Stan, PyMC documentation
  • “Probabilistic Programming & Bayesian Methods for Hackers”