Packages

pkgs <- c("ggplot2", "patchwork", "coda", "mvtnorm")
inst <- pkgs[!pkgs %in% rownames(installed.packages())]
if (length(inst)) install.packages(inst, dependencies = TRUE)
invisible(lapply(pkgs, require, character.only = TRUE))

1) What is Monte Carlo?

Monte Carlo uses random sampling to approximate quantities like integrals, probabilities, and expectations.

If \(X \sim f\) and we want \(\mathbb{E}[g(X)]\), then with IID draws \(X_1, \dots, X_n\),

\[ \hat{\mu}_n = \frac{1}{n} \sum_{i=1}^{n} g(X_i) \xrightarrow{n \to \infty} \mathbb{E}[g(X)] \]

by the Law of Large Numbers; the error typically scales like \(n^{-1/2}\) (by the CLT).

1.1 Theory: LLN, CLT, and Monte Carlo Error

Let \(X\sim f\) and \(\mu=\mathbb{E}[g(X)]\). With IID draws \(X_{1:n}\), \[ \widehat{\mu}_n=\frac{1}{n}\sum_{i=1}^n g(X_i),\qquad \mathbb{E}[\widehat{\mu}_n]=\mu,\quad \operatorname{Var}(\widehat{\mu}_n)=\frac{\operatorname{Var}[g(X)]}{n}. \] By the LLN, \(\widehat{\mu}_n\to \mu\) a.s. As \(n\to\infty\), a CLT gives \[ \sqrt{n}\,(\widehat{\mu}_n-\mu)\;\xrightarrow{d}\; \mathcal{N}\!\big(0,\ \sigma_g^2\big),\quad \sigma_g^2=\operatorname{Var}[g(X)]. \] A plug-in Monte Carlo SE (MCSE) is \[ \widehat{\mathrm{MCSE}}=\sqrt{\widehat{\operatorname{Var}}(g(X))/n}, \] yielding an approximate \(100(1-\alpha)\%\) CI: \(\widehat{\mu}_n \pm z_{1-\alpha/2}\,\widehat{\mathrm{MCSE}}\).

Variance reduction (at a glance): - Antithetics: pair \(U\) with \(1-U\). - Control variates: \(\tilde{g}=g-\beta(h-\mathbb{E}h)\) with \(\beta^\star=\operatorname{Cov}(g,h)/\operatorname{Var}(h)\). - Importance sampling: draw \(X\sim q\), use \(\widehat{\mu}^{\mathrm{IS}}=\tfrac{1}{n}\sum g(X_i)\,w(X_i)\) where \(w=f/q\). A self-normalized version uses \(\sum g(X_i)w_i / \sum w_i\); its weight ESS is \(n_{\mathrm{eff}} = \frac{(\sum w_i)^2}{\sum w_i^2}\) (larger is better).

Example: Estimating \(\pi\) with “darts”

Sample uniformly in the square \([-1,1]^2\), count points inside the unit circle.

n <- 50000
xy <- data.frame(
x = runif(n, -1, 1),
y = runif(n, -1, 1)
)
xy$inside <- with(xy, x^2 + y^2 <= 1)
pi_hat <- 4 * mean(xy$inside)
pi_hat

## [1] 3.14432

Running estimate and a “dartboard” picture

# Running estimate of pi

run_est <- 4 * cumsum(xy$inside) / seq_len(n)
df_run  <- data.frame(n = seq_len(n), pi_hat = run_est)

g1 <- ggplot(df_run, aes(n, pi_hat)) +
geom_line() +
geom_hline(yintercept = pi, linetype = "dashed", size = 0.4) +
labs(title = "Running Estimate of π", x = "Samples (n)", y = expression(hat(pi)))

# Circle for plotting

theta  <- seq(0, 2*pi, length.out = 400)
circle <- data.frame(cx = cos(theta), cy = sin(theta))

# A random subset of points for display

idx <- sample.int(n, 2000)
g2 <- ggplot() +
geom_point(data = xy[idx, ], aes(x, y, color = inside), alpha = 0.6) +
geom_path(data = circle, aes(cx, cy), size = 1) +
coord_fixed() +
labs(title = "Dartboard Monte Carlo", x = "x", y = "y") +
guides(color = "none")

g1 / g2

2) From Monte Carlo to MCMC

When IID sampling from a target \(\pi(\cdot)\) is hard, we construct a Markov chain with \(\pi\) as its stationary distribution. After an initial burn-in, the draws look like they come from \(\pi\), though they are autocorrelated.

Key terms: - Burn-in: Discard early draws - Mixing: How fast the chain explores - Acceptance rate: For Metropolis–Hastings - ESS (Effective Sample Size)

We will demo: - Metropolis–Hastings (MH) — accept/reject proposals to preserve the target - Gibbs sampling — sample sequentially from full conditionals (when available) - Small helper plotting functions

2.1 Theory: Markov Chains and the Ergodic Theorem

An MCMC algorithm constructs a Markov chain \(\{X_t\}\) with invariant (stationary) distribution \(\pi\). If the chain is \(\pi\)-irreducible, aperiodic, and positive Harris recurrent (collectively, ergodic), then for any integrable \(h\), \[ \frac{1}{n}\sum_{t=1}^n h(X_t)\ \xrightarrow{\text{a.s.}}\ \mathbb{E}_\pi[h(X)]. \] A Markov chain CLT often holds: \[ \sqrt{n}\big(\bar{h}_n-\mathbb{E}_\pi h\big)\ \xrightarrow{d}\ \mathcal{N}\!\big(0,\ \sigma_h^2\big), \] with \[ \sigma_h^2=\operatorname{Var}_\pi(h(X_0))\left(1+2\sum_{k\ge1}\rho_h(k)\right) =\operatorname{Var}_\pi(h)\,\tau_{\mathrm{int}}, \] where \(\rho_h(k)\) are autocorrelations and \(\tau_{\mathrm{int}}\) is the integrated autocorrelation time. Define ESS by \(n_{\mathrm{eff}}=n/\tau_{\mathrm{int}}\). In practice, estimate \(\sigma_h^2\) by batch means or spectral estimators; report MCSE \(= \sqrt{\widehat{\sigma}_h^2/n}\).

Burn-in & mixing. The burn-in removes pre-stationary draws; mixing is how rapidly the chain explores \(\pi\). Good mixing \(\Rightarrow\) small \(\tau_{\mathrm{int}}\) \(\Rightarrow\) large ESS.

trace_plot <- function(x, title = "Trace") {
df <- data.frame(iter = seq_along(x), value = x)
ggplot(df, aes(iter, value)) +
geom_line() +
labs(title = title, x = "Iteration", y = "Value")
}

acf_plot <- function(x, lags = 50, title = "ACF") {
ac <- acf(x, plot = FALSE, lag.max = lags)
df <- data.frame(lag = as.numeric(ac$lag), acf = as.numeric(ac$acf))
df <- df[df$lag >= 0, ]
ggplot(df, aes(lag, acf)) +
geom_col(width = 0.9) +
geom_hline(yintercept = 0, size = 0.3) +
labs(title = title, x = "Lag", y = "ACF")
}

3) Metropolis–Hastings (random-walk) demo

Target density (unnormalized), bimodal:

\[ \pi(x) \propto \exp\left(-\frac{x^4}{4} + \frac{x^2}{2}\right) \]

No closed-form sampler — perfect for MCMC.

3.1 Theory: Metropolis–Hastings in One Page

Given current \(x\) and proposal density \(q(x\to y)\), propose \(Y\sim q(x\to \cdot)\) and accept with probability \[ \alpha(x,y)=\min\!\left\{1,\ \frac{\pi(y)\,q(y\to x)}{\pi(x)\,q(x\to y)}\right\}. \] This kernel satisfies detailed balance with \(\pi\): \[ \pi(x)\,q(x\to y)\,\alpha(x,y)=\pi(y)\,q(y\to x)\,\alpha(y,x), \] hence \(\pi\) is invariant. Under mild conditions (irreducible/aperiodic), the chain is ergodic.

Random-walk MH (RWMH): \(q(x\to y)=\mathcal{N}(y; x,\sigma^2 I)\) is symmetric, so \(\alpha(x,y)=\min\{1,\pi(y)/\pi(x)\}\).

Efficiency & tuning. - Too small \(\sigma\): high acceptance, tiny steps \(\Rightarrow\) slow exploration. - Too large \(\sigma\): low acceptance, many rejections. - In moderate/high dimension, theory suggests an optimal RWMH acceptance near \(\approx 0.23\) under isotropic targets; in 1D, values around \(0.3\!-\!0.5\) are often reasonable heuristics. - Report: proposal scale, acceptance rate, trace/ACF, and ESS for key functionals.

log_target <- function(x) { -x^4/4 + x^2/2 }

mh <- function(n = 20000, init = 0, prop_sd = 1) {
x <- numeric(n)
x[1] <- init
acc <- 0L
for (t in 2:n) {
cand <- rnorm(1, mean = x[t-1], sd = prop_sd)    # symmetric proposal
logA <- log_target(cand) - log_target(x[t-1])
if (log(runif(1)) < logA) { x[t] <- cand; acc <- acc + 1L } else { x[t] <- x[t-1] }
}
list(draws = x, acc_rate = acc/(n-1))
}

out <- mh(n = 25000, init = 0, prop_sd = 0.9)
mh_chain <- out$draws
acc_rate <- out$acc_rate
acc_rate

## [1] 0.7551102

burn <- 3000
keep <- mh_chain[(burn+1):length(mh_chain)]

p_trace <- trace_plot(keep, title = sprintf("MH Trace (burn = %d), acc ≈ %.1f%%", burn, 100*acc_rate))
p_acf   <- acf_plot(keep, lags = 60, title = "MH ACF (first 60 lags)")
p_trace / p_acf

Histogram vs. numerically normalized truth

Z <- integrate(function(x) exp(log_target(x)), lower = -Inf, upper = Inf)$value
dens_fun <- function(x) exp(log_target(x))/Z

xs <- seq(-3, 3, length.out = 400)
df_true <- data.frame(x = xs, d = dens_fun(xs))
df_samp <- data.frame(x = keep)

ggplot(df_samp, aes(x)) +
geom_histogram(aes(y = ..density..), bins = 60, alpha = 0.6) +
geom_line(data = df_true, aes(x, d), size = 1) +
labs(title = "MH: Sample Histogram vs True Density", x = "x", y = "Density")

3.2 MCSE for MCMC and Batch Means (Practical)

For a scalar function \(h(X)\) of the chain, \[ \widehat{\theta}=\frac{1}{n}\sum_{t=1}^n h(X_t),\qquad \mathrm{MCSE}(\widehat{\theta})\approx \sqrt{\widehat{\sigma}_h^2/n}. \] Batch means: split the chain into \(B\) contiguous batches of size \(m\) (so \(n=Bm\)). Let \(\bar{h}_b\) be the mean in batch \(b\). Then \[ \widehat{\sigma}_h^2 = \frac{m}{B-1}\sum_{b=1}^B\big(\bar{h}_b-\bar{h}_\cdot\big)^2, \quad \bar{h}_\cdot=\frac{1}{B}\sum_{b=1}^B \bar{h}_b. \] Choose \(m\) large enough (e.g., \(m\approx\! \lfloor n^{1/2}\rfloor\)) so batches are weakly correlated.

Quick diagnostics

c(Mean = mean(keep), SD = sd(keep), ESS = coda::effectiveSize(keep))

##         Mean           SD     ESS.var1 
## 3.111477e-02 1.022749e+00 2.645329e+03

Tuning tip: For 1D random-walk MH, acceptance in the ballpark of 20–40% often mixes reasonably (not a hard rule).

Gibbs Sampling (non-Bayesian demo)

Target is a bivariate Normal with correlation \(\rho\), zero mean, unit variances:

\[ (X, Y) \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} \right) \]

Full conditionals:

\[ X \mid Y = y \sim \mathcal{N}(\rho y, 1 - \rho^2), \quad Y \mid X = x \sim \mathcal{N}(\rho x, 1 - \rho^2) \]

4.1 Theory: Why Gibbs Works

Suppose \(\pi(x_1,\dots,x_d)\) is the target and each full conditional \(\pi(x_j\mid x_{-j})\) is tractable. The Gibbs kernel that updates one coordinate \[ x_j'\sim \pi(x_j\mid x_{-j}) \] leaves \(\pi\) invariant. Composing updates over \(j=1,\dots,d\) (systematically or randomly) also leaves \(\pi\) invariant. Under irreducibility and aperiodicity (often satisfied on \(\mathbb{R}^d\) with proper \(\pi\)), the chain is ergodic.

Compatibility: Full conditionals must be compatible, i.e., arise from some joint \(\pi\). In our bivariate Normal example, the conditionals \[ X\mid Y=y\sim \mathcal{N}(\rho y,\ 1-\rho^2),\quad Y\mid X=x\sim \mathcal{N}(\rho x,\ 1-\rho^2) \] are compatible with the joint \(\mathcal{N}_2(0,\Sigma)\) where \(\Sigma=\begin{pmatrix}1&\rho\\ \rho&1\end{pmatrix}\).

gibbs_bvn <- function(n = 20000, rho = 0.9, init = c(0,0)) {
x <- y <- numeric(n)
x[1] <- init[1]; y[1] <- init[2]
s2 <- 1 - rho^2
for (t in 2:n) {
x[t] <- rnorm(1, mean = rho * y[t-1], sd = sqrt(s2))
y[t] <- rnorm(1, mean = rho * x[t],     sd = sqrt(s2))
}
cbind(x, y)
}

S <- gibbs_bvn(n = 25000, rho = 0.9, init = c(0,0))
burn <- 3000
S_keep <- S[(burn+1):nrow(S), , drop = FALSE]

cor_est <- cor(S_keep[,1], S_keep[,2])
c(Correlation_Est = cor_est)

## Correlation_Est 
##       0.9014974

Traces, ACFs, and scatter with contours

p_tx <- trace_plot(S_keep[,1], "Gibbs: Trace of X")
p_ty <- trace_plot(S_keep[,2], "Gibbs: Trace of Y")
p_ax <- acf_plot(S_keep[,1], 60, "Gibbs: ACF of X")
p_ay <- acf_plot(S_keep[,2], 60, "Gibbs: ACF of Y")
(p_tx / p_ax) | (p_ty / p_ay)

4.2 Blocking, Centering, and High Correlation

High \(|\rho|\) induces strong autocorrelation in single-site Gibbs. Remedies include: - Blocking: sample \((X,Y)\) jointly (when feasible). - Reparameterization/centering: transform to weakly correlated coordinates. - Overrelaxation: draw from a reflected conditional to reduce lag-1 correlation (advanced). Always inspect trace/ACF and report ESS for each component.

# True density grid for contours

xs <- seq(-3, 3, length.out = 200)
ys <- seq(-3, 3, length.out = 200)
grid <- expand.grid(x = xs, y = ys)
rho  <- 0.9
Sigma <- matrix(c(1, rho, rho, 1), 2, 2)
grid$dens <- mvtnorm::dmvnorm(grid, mean = c(0,0), sigma = Sigma)

dfS <- as.data.frame(S_keep[sample.int(nrow(S_keep), 4000), , drop = FALSE])
names(dfS) <- c("x","y")

ggplot() +
geom_point(data = dfS, aes(x, y), alpha = 0.35) +
geom_contour(data = grid, aes(x, y, z = dens), size = 0.4) +
coord_equal() +
labs(title = "Gibbs Sampling: Bivariate Normal",
subtitle = paste0("Sample corr ≈ ", sprintf("%.3f", cor_est)),
x = "X", y = "Y")

Tiny Bayesian Gibbs example

5.1 Theory: Conjugacy and Full Conditionals (Derivation Sketch)

Model: \(y_i\mid \mu,\sigma^2 \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu,\sigma^2)\), \(\mu\mid \sigma^2 \sim \mathcal{N}(\mu_0,\sigma^2/\kappa_0)\), \(\sigma^2 \sim \operatorname{InvGamma}(a_0,b_0)\). Let \(n\) be the sample size, \(\bar{y}\) the sample mean, and \(S=\sum_{i=1}^n (y_i-\bar{y})^2\).

Conditional for \(\mu\): completing the square, \[ \mu\mid \sigma^2,\mathbf{y}\ \sim\ \mathcal{N}\!\left( \frac{\kappa_0\mu_0+n\bar{y}}{\kappa_0+n},\ \frac{\sigma^2}{\kappa_0+n} \right). \]

Conditional for \(\sigma^2\): collecting quadratic terms, \[ \sigma^2\mid \mu,\mathbf{y}\ \sim\ \operatorname{InvGamma}\!\left( a_0+\frac{n}{2},\ b_0+\frac{1}{2}\sum_{i=1}^n (y_i-\mu)^2 \right). \] Hence Gibbs alternates Normal and Inverse-Gamma draws. Marginally, \(\mu\mid \mathbf{y}\) is Student-t (not needed for Gibbs, but good to know).

Normal data with unknown \((\mu, \sigma^2)\), conjugate priors:

\[ \mu \mid \sigma^2 \sim \mathcal{N}(\mu_0, \sigma^2 / \kappa_0), \quad \sigma^2 \sim \text{InvGamma}(a_0, b_0) \]

Then:

\[ \mu \mid \sigma^2, y \sim \mathcal{N} \left( \frac{\kappa_0 \mu_0 + n \bar{y}}{\kappa_0 + n}, \frac{\sigma^2}{\kappa_0 + n} \right) \]

\[ \sigma^2 \mid \mu, y \sim \text{InvGamma} \left( a_0 + \frac{n}{2}, b_0 + \frac{1}{2} \sum_{i=1}^{n} (y_i - \mu)^2 \right) \]

set.seed(42)
y <- rnorm(100, mean = 3, sd = 2)

mu0 <- 0; k0 <- 0.01
a0  <- 2; b0 <- 2

gibbs_norm <- function(y, niter = 20000, mu0 = 0, k0 = 0.01, a0 = 2, b0 = 2) {
n  <- length(y)
mu <- numeric(niter)
sig2 <- numeric(niter)
mu[1] <- mean(y); sig2[1] <- var(y)

for (t in 2:niter) {
# mu | sig2, y
post_mean <- (k0*mu0 + n*mean(y)) / (k0 + n)
post_var  <- sig2[t-1] / (k0 + n)
mu[t]     <- rnorm(1, post_mean, sqrt(post_var))

# sig2 | mu, y  ~ InvGamma  (sample via reciprocal of Gamma)
a <- a0 + n/2
b <- b0 + 0.5*sum( (y - mu[t])^2 )
sig2[t] <- 1 / rgamma(1, shape = a, rate = b)

}
cbind(mu = mu, sig2 = sig2)
}

S2 <- gibbs_norm(y, niter = 20000, mu0 = mu0, k0 = k0, a0 = a0, b0 = b0)
burn <- 3000
S2k  <- S2[(burn+1):nrow(S2), , drop = FALSE]

trace_plot(S2k[, "mu"],  "Gibbs (Bayes): μ") /
trace_plot(S2k[, "sig2"], "Gibbs (Bayes): σ²")

5.9 Multi-Chain Diagnostics (When You Run Several Chains)

\(\widehat{R}\) (Gelman–Rubin): compares between/within-chain variance; values near \(1.00\) (e.g., \(\le 1.01\)) suggest convergence.
Overlap: visually compare marginal densities across chains.
Initialization: disperse starting points to stress-test convergence.
Divergent transitions / treedepth (HMC-specific): for gradient-based samplers (not used here), monitor algorithm-specific warnings.

6) Practical Checklist (Theory-Aware)

Stationarity: include/discard a burn-in justified by trace stabilization.
Mixing: inspect trace/ACF; if sticky, tune proposals or reparameterize.
MCSE & ESS: report MCSE via batch means or spectral estimators; report ESS.
Tuning (MH): document proposal type/scale and acceptance rate.
Gibbs: verify conditional correctness and consider blocking for high correlation.
Sensitivity: for Bayesian models, vary weakly-informative priors and check stability.
Reproducibility: set seeds; record versions of code and packages.

7) Tiny glossary

Monte Carlo: Random sampling to approximate integrals/expectations.
MCMC: Monte Carlo using a Markov chain to sample from hard targets.
Metropolis–Hastings: Accept/reject to preserve the target distribution.
Gibbs: Cycle through tractable full conditional distributions.

7.5 Common Pitfalls (and Quick Fixes)

Confusing warm-up with thinning: prefer longer chains; thin only for storage/diagnostics.
Too-narrow proposals: high acceptance but poor exploration → increase step size.
Too-wide proposals: very low acceptance → decrease step size.
Ignoring dependence: use MCMC CLT-aware MCSE, not IID formulas.
Label switching (mixtures): use identifiability constraints or post-processing (beyond scope here).

8) Starter references

Robert, C. P., & Casella, G. (2004). Monte Carlo Statistical Methods.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.).
Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo.

Monte Carlo, Gibbs Sampling, and MCMC — An Easy Primer (with R) Part 1

Dr. Debashis Chatterjee

November 08, 2025