Exploring Probability Spaces with Markov Chain Monte Carlo

March 25, 2025

Introduction: Why MCMC Matters

The Challenge: Real-world problems require integration over complex probability distributions.

Traditional approaches struggle with:

Analytical integration: Often mathematically impossible
Direct sampling: Computationally infeasible
Grid-based methods: Curse of dimensionality
Importance sampling: Finding good proposals

The Solution: MCMC methods sample from virtually any probability distribution.

Generates samples via carefully constructed Markov chain
Explores probability space efficiently
Converts integration problems into simulations
Scales to high dimensions

What is MCMC?

Markov Chain Monte Carlo combines:

Markov Chain: Random variables where each state depends only on the previous state
Monte Carlo: Using random sampling for numerical results

Key Insight:

Construct a chain with target distribution as equilibrium
Samples eventually behave as if from target distribution

Practical Applications

Bayesian Statistics
- Parameter estimation
- Model comparison
Machine Learning
- Bayesian neural networks
- Uncertainty quantification
Physics & Finance
- Molecular dynamics
- Risk assessment
- Option pricing

The Mathematical Challenge

We need to compute expectations:

\[\mathbb{E}[f(X)] = \int f(x)p(x)dx\]

Where:

\(p(x)\) is a complex probability density
Direct integration is intractable

Examples:

Bayesian posterior expectations
Statistical physics ensemble averages

The MCMC Solution:

Sample \(\{x_1, x_2, ..., x_n\}\) from \(p(x)\) to estimate:

\[\mathbb{E}[f(X)] \approx \frac{1}{n}\sum_{i=1}^{n}f(x_i)\]

Converges to true expectation as \(n \rightarrow \infty\)

Intuition: Exploring a Probability Landscape

Imagine hiking in a probability landscape:

Height = probability density
Goal: spend time proportional to height
Can only see immediate surroundings

The Metropolis Algorithm:

Current position: location \(x\)
Proposal: Suggest nearby location \(y\)
Decision:
- If higher: Always go there
- If lower: Go with probability \(\frac{p(y)}{p(x)}\)

Key Insight:

Random walk that favors high-probability regions
Proportion of visits converges to target distribution
Allows exploration of entire space

The Metropolis-Hastings Algorithm

Algorithm Steps:

Start at initial state \(x_0\)
For \(t = 0, 1, 2, \ldots\):
- Sample proposal \(y \sim q(y|x_t)\)
- Calculate acceptance ratio: \[\alpha = \min\left(1, \frac{p(y)q(x_t|y)}{p(x_t)q(y|x_t)}\right)\]
- Set \(x_{t+1} = y\) with probability \(\alpha\)
- Otherwise, set \(x_{t+1} = x_t\)

Key Components:

\(p(x)\): Target distribution (can be unnormalized)
\(q(y|x)\): Proposal distribution
\(\alpha\): Acceptance probability

Important Properties:

Asymptotic correctness: Converges to target distribution
Proposal flexibility: Any ergodic proposal works (affects efficiency)
Detailed balance: Ensures stationary distribution

When to use this algorithm:

Works with any target distribution
Only requires ability to evaluate density ratio
Simpler to implement than HMC methods

Implementing a 2D MCMC Sampler

# Metropolis-Hastings sampler for a 2D distribution
metropolis_hastings_2d <- function(target_density, proposal_sd,
                                 n_samples, init_state) {
  # init chain
  chain <- matrix(0, nrow = n_samples, ncol = 2)
  chain[1,] <- init_state
  
  current_log_density <- log(target_density(chain[1,1], chain[1,2]))
  
  # run chain
  for (i in 2:n_samples) {
    # Propose new state
    proposal <- c(
      rnorm(1, mean = chain[i-1, 1], sd = proposal_sd),
      rnorm(1, mean = chain[i-1, 2], sd = proposal_sd)
    )
    
    # calca] log density at proposal
    proposal_log_density <- log(target_density(proposal[1], proposal[2]))
    
    # calc acceptance probability
    log_accept_prob <- proposal_log_density - current_log_density
    
    # accept or reject
    if (log(runif(1)) < log_accept_prob) {
      chain[i,] <- proposal
      current_log_density <- proposal_log_density
    } else {
      chain[i,] <- chain[i-1,]
    }
  }
  
  return(chain)
}

Our Target Distribution

bimodal_target <- function(x, y) {
  mu1 <- c(1, 1)    # First peak center
  mu2 <- c(-1, -1)  # Second peak center
  sigma <- 0.3      # Spread of peaks
  
  # calc Gaussian densities
  z1 <- (1/(2*pi*sigma^2)) * 
    exp(-0.5*((x-mu1[1])^2 + 
              (y-mu1[2])^2)/sigma^2)
  z2 <- (1/(2*pi*sigma^2)) * 
    exp(-0.5*((x-mu2[1])^2 + 
              (y-mu2[2])^2)/sigma^2)
  
  # combine with equal weights
  return(0.5 * z1 + 0.5 * z2)
}

MCMC in Action

3D Visualization

Understanding the Sampling Process

What We’re Observing:

Initial exploration (burn-in)
- Chain explores from starting point
- “Forgets” initial state
Convergence
- More time in high-density regions
- Proportion in each mode ≈ 50%
Mode-switching
- Chain jumps between modes
- More difficult in higher dimensions
Stationarity
- Distribution of samples stabilizes
- Chain reaches target distribution

Convergence Assessment

Convergence Diagnostics

Assessing Convergence

Trace plots help determine:

Sufficient burn-in
Chain “mixing” quality
Local mode problems

Warning Signs:

Slow parameter drift
Long stationary periods
Poor mode-jumping
High autocorrelation

Effective Sample Size

Effective Sample Size (ESS)

Due to autocorrelation, MCMC samples are dependent. ESS estimates equivalent independent samples:

\[\text{ESS} = \frac{n}{1 + 2\sum_{k=1}^{\infty}\rho_k}\]

where:

\(n\) = number of samples
\(\rho_k\) = autocorrelation at lag \(k\)

Higher ESS means:

More accurate estimates
Better exploration
Lower Monte Carlo error

Parameter	Total	Effective	Efficiency %	Accept %
X dimension	3500	516.5	14.8	35.8
Y dimension	3500	507.8	14.5	35.8

Tuning MCMC Performance

Key Tuning Parameters:

Proposal scale
- Too small: High acceptance, slow
- Too large: Low acceptance, poor
- Optimal: ~20-40% acceptance

Critical Considerations

Burn-in period length
Multiple starting points
Parameter correlations

Advanced MCMC Methods

Hamiltonian Monte Carlo (HMC)

Uses gradient information
Physics-inspired sampling
Better in high dimensions
Implemented in Stan, PyMC

Gibbs Sampling

Updates one dimension at a time
Uses conditional distributions
Efficient for hierarchical models
Perfect for conditional conjugacy

No-U-Turn Sampler (NUTS)

Adaptive extension of HMC
Auto-tunes parameters
Avoids inefficient trajectories
State-of-the-art performance

Reversible Jump MCMC

Handles varying dimensions
Used for model selection
Enables trans-dimensional moves
Useful for mixture models

Conclusion: Power of MCMC

Key Takeaways:

MCMC enables sampling from complex distributions that would be intractable with direct methods
Metropolis-Hastings provides a flexible framework adaptable to almost any problem
Proper diagnostics and tuning are essential for reliable results
Advanced methods improve efficiency for complex problems
Balancing exploration and exploitation is crucial for efficient sampling

Why MCMC Remains Essential:

Enables complex Bayesian analysis
Handles high-dimensional problems
Provides uncertainty quantification
Adaptable across domains
Supports robust decision-making

Resources:

“Bayesian Data Analysis” (Gelman)
“Monte Carlo Statistical Methods” (Robert & Casella)
Stan, PyMC documentation
“Probabilistic Programming & Bayesian Methods for Hackers”