Importance Sampling: Mathematical Derivation

The importance sampling estimator for the posterior expectation is:

\[ \hat{E}_{\pi}[b(\theta) \mid y] = \frac{\sum_{j=1}^{N} b(\theta^{(j)}) w^{(j)}}{\sum_{j=1}^{N} w^{(j)}} \]

where:

\[ w^{(j)} = \frac{h(\theta^{(j)})}{g(\theta^{(j)})} \]

Breaking Down Each Part

1. The Notation \(E_{\pi}\) vs. \(E_g\)

  • \(E_{\pi}\) means expectation taken with respect to the posterior distribution \(\pi(\theta \mid y)\) — this is what we want but cannot compute directly

  • \(E_g\) means expectation taken with respect to the importance sampling distribution \(g(\theta)\) — this is what we can compute because we can sample from \(g\)

2. What We Need to Estimate

From the derivation in your text, the posterior expectation equals:

\[ E_{\pi}[b(\theta) \mid y] = \frac{E_g[b(\theta) w(\theta)]}{E_g[w(\theta)]} \]

This is a ratio of two expectations under \(g\):

  • Numerator: \(E_g[b(\theta) w(\theta)]\) — expected value of \(b(\theta) \times w(\theta)\)

  • Denominator: \(E_g[w(\theta)]\) — expected value of just \(w(\theta)\)

3. The Monte Carlo Approximation

We draw \(N\) independent samples from \(g(\theta)\):

\[ \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)} \sim g(\theta) \]

Then we approximate each expectation by the sample average:

Numerator estimate:

\[ \hat{E}_g[b(\theta) w(\theta)] = \frac{1}{N} \sum_{j=1}^{N} b(\theta^{(j)}) w^{(j)} \]

Denominator estimate:

\[ \hat{E}_g[w(\theta)] = \frac{1}{N} \sum_{j=1}^{N} w^{(j)} \]

4. The Final Estimator

Taking the ratio (the \(1/N\) cancels out):

\[ \hat{E}_{\pi}[b(\theta) \mid y] = \frac{\frac{1}{N} \sum_{j=1}^{N} b(\theta^{(j)}) w^{(j)}}{\frac{1}{N} \sum_{j=1}^{N} w^{(j)}} = \frac{\sum_{j=1}^{N} b(\theta^{(j)}) w^{(j)}}{\sum_{j=1}^{N} w^{(j)}} \]

Why This Works Intuitively

Think of the weights \(w^{(j)}\) as importance weights that correct for sampling from \(g\) instead of \(\pi\):

  • If \(\theta^{(j)}\) is likely under \(\pi\) but unlikely under \(g\), then \(w^{(j)}\) is large (boosting its contribution)

  • If \(\theta^{(j)}\) is unlikely under \(\pi\) but likely under \(g\), then \(w^{(j)}\) is small (down-weighting its contribution)

  • The denominator \(\sum w^{(j)}\) ensures the weights are properly normalized (sum to 1).

Summary Table

Component Description
\(\pi(\theta \mid y)\) Target posterior distribution (difficult to sample from)
\(g(\theta)\) Proposal/importance distribution (easy to sample from)
\(w^{(j)} = h(\theta^{(j)})/g(\theta^{(j)})\) Importance weight (unnormalized)
\(b(\theta)\) Function of interest (e.g., \(\theta\) for posterior mean)
\(\hat{E}_{\pi}[b(\theta) \mid y]\) Importance sampling estimator

Problem Description

We simulate n = 25 observations from N(θ = 1, 1) distribution.
With random seed fixed at 44, the resulting sample mean is ȳ = 0.847.

We perform importance sampling to estimate the posterior mean: - Prior: Standard Cauchy distribution - Proposal distribution: Standard Cauchy (same as prior) - Target distribution: Posterior ∝ Prior × Likelihood

Step-by-Step Calculation

Step 1: Draw samples from standard Cauchy distribution

We draw N = 10,000 samples from the standard Cauchy distribution:

# Set parameters
set.seed(44)  # for reproducibility

n <- 25
true_theta <- 1
sigma <- 1

# Generate n observations from N(theta=1, variance=1)
y <- rnorm(n, mean = true_theta, sd = sigma)

# Calculate sample mean
ybar <- mean(y)

N <- 10000
# Draw from standard Cauchy (location=0, scale=1)
theta <- rcauchy(N, location = 0, scale = 1)

# Display first 10 samples
head(theta, 10)
##  [1]  0.03761153  0.22766209  1.37844753 -1.62301391 -0.38655000 -4.73897934
##  [7] -0.20646008  3.70470032 -4.29933192  0.45876771

Step 2: Calculate importance weights

The weight for each sample \(\theta^{(j)}\) is:

\[ w^{(j)} = \exp\left(-\frac{n}{2}(\theta^{(j)} - \bar{y})^2\right) \]

Since the prior (Cauchy) and proposal (Cauchy) cancel, the weight simplifies to just the likelihood.

# Calculate weights
weights <- exp(- (n/2) * (theta - ybar)^2)

# Display first 10 weights
head(weights, 10)
##  [1]  2.779912e-04  8.279142e-03  2.927328e-02  7.604073e-34  5.496678e-09
##  [6] 4.081481e-170  9.460315e-07  4.628972e-45 1.680587e-144  1.520421e-01
# Summary of weights
summary(weights)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.093150 0.005565 1.000000

Step 3: Compute the posterior mean using importance sampling

The importance sampling estimator for the posterior mean is:

\[ \hat{E}_{\pi}[b(\theta) \mid y]=\bar{\theta}_{IS} = \frac{\sum_{j=1}^{N} \theta^{(j)} w^{(j)}}{\sum_{j=1}^{N} w^{(j)}} \]

# Calculate weighted average
posterior_mean <- sum(theta * weights) / sum(weights)

# Display result
posterior_mean
## [1] 0.8110299

The estimated posterior mean is 0.811.

Step 4: Check effective sample size (ESS)

Due to the heavy tails of the Cauchy distribution, many samples receive very small weights. The effective sample size tells us how many independent samples our weighted sample is equivalent to:

# Normalize weights
weights_norm <- weights / sum(weights)

# Calculate ESS
ESS <- 1 / sum(weights_norm^2)

# Display ESS
ESS
## [1] 1336.264
# Efficiency percentage
efficiency <- (ESS / N) * 100
efficiency
## [1] 13.36264

Only about 13.4% of the samples are effectively contributing to the estimate.

Interpretation

The posterior mean (0.811) differs from the sample mean (0.847) because:

  1. Cauchy prior (centered at 0) pulls the estimate toward 0
  2. Likelihood (centered at 0.847) pulls the estimate toward the data
  3. The posterior is a compromise between prior and likelihood

With n = 25, the likelihood is fairly strong, but the Cauchy prior still exerts some influence, resulting in shrinkage from 0.847 to 0.811.

Complete Code (Single Block)

Here’s all the code together for easy copying:

set.seed(44)
n <- 25
true_theta <- 1
sigma <- 1

# Generate n observations from N(theta=1, variance=1)
y <- rnorm(n, mean = true_theta, sd = sigma)

# Calculate sample mean
ybar <- mean(y)

N <- 10000

# Step 1: Draw from Cauchy
theta <- rcauchy(N, 0, 1)

# Step 2: Calculate weights
weights <- exp(- (n/2) * (theta - ybar)^2)

# Step 3: Compute posterior mean
post_mean <- sum(theta * weights) / sum(weights)
print(post_mean)  # Should be ~0.811
## [1] 0.8110299
# Optional: Check effective sample size
weights_norm <- weights / sum(weights)
ESS <- 1 / sum(weights_norm^2)
print(ESS)  
## [1] 1336.264

Notes

  • The weights are unnormalized - we normalize them implicitly when dividing by their sum
  • The Cauchy proposal works but is inefficient (low ESS)
  • A better proposal would be centered near ȳ (e.g., t-distribution or normal centered at 0.847)