knitr::opts_chunk$set(
echo = TRUE, message = FALSE, warning = FALSE,
fig.width = 7, fig.height = 4.5, dpi = 160
)
set.seed(2025)
Imagine you want to generate samples from a joint probability distribution \(f(x, y)\).
But maybe:But you can sample from conditional distributions like \(f(x \mid y)\) and \(f(y \mid x)\). That’s when Gibbs Sampling becomes your best friend!
Let’s say you want samples from some unknown joint distribution \(f(x, y)\). But you can: Then the Gibbs Sampling steps are:Repeat this for many iterations — the sequence of \((x^{(t)}, y^{(t)})\) will eventually look like samples from \(f(x,y)\)!
We’re not told the joint distribution directly, but we can sample from both conditionals! Let’s write code for this.
gibbs_normal <- function(n_iter = 5000, x0 = 0, y0 = 0) {
X <- numeric(n_iter); Y <- numeric(n_iter)
X[1] <- x0; Y[1] <- y0
for (t in 2:n_iter) {
X[t] <- rnorm(1, mean = Y[t-1], sd = 1)
Y[t] <- rnorm(1, mean = X[t], sd = 1)
}
data.frame(X = X, Y = Y)
}
out <- gibbs_normal()
head(out)
# Traceplot
plot(out$X[1:200], type = "l", col = "darkblue", lwd = 2,
main = "Traceplot of X (first 200 iterations)", ylab = "X", xlab = "Iteration")
# Scatterplot of draws
plot(out$X, out$Y, pch = 21, bg = "tomato", col = "gray20",
main = "Gibbs samples from bivariate normal", xlab = "X", ylab = "Y")
This iterative process, known as Gibbs sampling, produces a sequence of pairs \((x, y)\) that, after a sufficient number of steps, approximates independent draws from the true joint distribution. It turns out that this specific process produces samples from a bivariate normal distribution with a strong positive correlation.
You have a box of chocolates arranged in a cube. We are going to try to perform Gibbs sampling on their weight (\(W\)) and size (\(S\)) joint distribution.
The problem, as stated, is a fun but tricky example that highlights a common misconception about Gibbs sampling. We are given the following conditional distributions:
Let’s try to Gibbs sample their \((W, S)\) joint distribution.
Gibbs sampling requires that we have a well-defined joint distribution from which we can derive the full conditional distributions. The two conditional distributions provided are inconsistent with each other. They create a logical circular dependency without a shared, underlying joint probability distribution.
Think of it this way: - The first statement says the mean of \(W\) is \(S\). - The second statement says the mean of \(S\) is \(W\).
This is a paradoxical situation because for a valid joint distribution, the parameters of the conditional distributions are derived from the joint distribution’s mean vector and covariance matrix. You cannot simply define two arbitrary conditional distributions and assume they are from a single, coherent joint distribution.
For a valid bivariate normal distribution, the conditional distributions would have means that are a linear function of the other variable, with a correlation term.
To make this a valid Gibbs sampling problem, we must first define a proper joint distribution, for example, a bivariate normal distribution:
\[\begin{pmatrix} W \\ S \end{pmatrix} \sim \mathcal{N}\left(\begin{pmatrix} \mu_W \\ \mu_S \end{pmatrix}, \begin{pmatrix} \sigma_W^2 & \rho\sigma_W\sigma_S \\ \rho\sigma_W\sigma_S & \sigma_S^2 \end{pmatrix}\right)\]
From this joint distribution, we can mathematically derive the two consistent conditional distributions that are required for Gibbs sampling:
\[ \begin{aligned} W \mid S &\sim \mathcal{N}\left(\mu_W + \rho\frac{\sigma_W}{\sigma_S}(S - \mu_S), \sigma_W^2(1 - \rho^2)\right) \\ S \mid W &\sim \mathcal{N}\left(\mu_S + \rho\frac{\sigma_S}{\sigma_W}(W - \mu_W), \sigma_S^2(1 - \rho^2)\right) \end{aligned} \]
Once we have these valid, consistent conditional distributions, we can then perform the Gibbs sampling algorithm: 1. Initialize a starting value for \(S\), say \(S^{(0)}\). 2. Iterate to get new samples: - Draw \(W^{(t+1)} \sim p(W \mid S^{(t)})\). - Draw \(S^{(t+1)} \sim p(S \mid W^{(t+1)})\). 3. Repeat this process many times to get samples from the joint distribution.
gibbs_chocolate <- function(n_iter = 5000) {
W <- S <- numeric(n_iter)
W[1] <- 0; S[1] <- 0
for (t in 2:n_iter) {
W[t] <- rnorm(1, mean = S[t-1], sd = 0.2)
S[t] <- rnorm(1, mean = W[t], sd = 0.3)
}
data.frame(W = W, S = S)
}
out2 <- gibbs_chocolate()
# Joint plot
plot(out2$W, out2$S, pch = 21, col = "black", bg = "orange",
xlab = "Weight", ylab = "Size", main = "Chocolate Gibbs Samples")
** 4. Heatmap View of the Density
Let’s plot the approximate density using Gibbs draws:
library(ggplot2)
ggplot(out2, aes(W, S)) +
stat_density_2d(aes(fill = ..level..), geom = "polygon", contour = TRUE) +
scale_fill_viridis_c() +
theme_minimal() +
labs(title = "Estimated Joint Density (Gibbs Samples)",
x = "Weight", y = "Size")
Now let’s extend the concept of Gibbs sampling to three dimensions. Suppose we have three variables, \(X\), \(Y\), and \(Z\), and their full conditional distributions are given by:
We will sample all three variables in turn to draw from their joint distribution.
The algorithm for this 3D case follows the same iterative, sequential logic as the 2D case. We initialize starting values for all variables and then iteratively sample each variable from its full conditional distribution, using the most recent values of the other variables.
Here is the step-by-step process:
Initialization: Choose starting values for \(Y\) and \(Z\), say \(Y^{(0)}\) and \(Z^{(0)}\). A good practice is to draw these from a reasonable initial distribution, or simply set them to a common value like zero.
Iteration: For \(t = 1, 2, 3, ...\), perform the following steps in sequence:
Collection: The triplet of values \((X^{(t)}, Y^{(t)}, Z^{(t)})\) at each iteration represents a single sample from the joint distribution. After a “burn-in” period (a sufficient number of initial iterations to allow the sampler to converge), you can collect these samples to approximate the target joint distribution.
This is a classic example of a bivariate/multivariate normal distribution. The full conditional distributions given are consistent with a joint distribution where the variables are related in a linear fashion. Specifically, the means of the conditional distributions are averages of the other variables. This implies a correlation structure between \(X, Y,\) and \(Z\) that a joint normal distribution would capture. The Gibbs sampler will produce samples from this multivariate normal distribution.
gibbs3d <- function(n_iter = 4000) {
X <- Y <- Z <- numeric(n_iter)
for (t in 2:n_iter) {
X[t] <- rnorm(1, mean = (Y[t-1] + Z[t-1])/2, sd = 1)
Y[t] <- rnorm(1, mean = (X[t] + Z[t-1])/2, sd = 1)
Z[t] <- rnorm(1, mean = (X[t] + Y[t])/2, sd = 1)
}
data.frame(X, Y, Z)
}
out3 <- gibbs3d()
pairs(out3[seq(1, 4000, by = 5), ], pch = 21, bg = "skyblue", main = "Pairs Plot (3D Gibbs)")
### 6. When Does Gibbs Work? 🤔
Gibbs sampling is a powerful tool, but its effectiveness and efficiency depend on several key factors:
The Gibbs sampling algorithm is an iterative process that cycles through each variable in the model.
The central idea behind Gibbs sampling is to construct a Markov chain whose stationary distribution is the target joint distribution we wish to sample from. The proof of its correctness relies on a fundamental principle of Markov Chain Monte Carlo (MCMC): the Detailed Balance Condition.
A Markov chain with transition kernel \(P(\cdot, \cdot)\) and stationary distribution \(\pi(\cdot)\) satisfies the detailed balance condition if for all states \(A\) and \(B\):
\[ \pi(A) P(A \to B) = \pi(B) P(B \to A) \]
If a Markov chain satisfies this condition, it is guaranteed that the chain will eventually converge to \(\pi(\cdot)\). The Gibbs sampler constructs just such a Markov chain.
Let’s consider a simple two-variable case with a joint distribution \(p(x, y)\). The Gibbs sampler samples from the full conditionals \(p(x|y)\) and \(p(y|x)\) in a two-step process.
We want to show that the stationary distribution of this process is our target joint distribution \(p(x, y)\). We can do this by showing that each step of the Gibbs sampler satisfies detailed balance with respect to the joint distribution.
Consider a single step of the sampler where we move from a state \((x, y)\) to a new state \((x, y')\) by sampling \(y'\) from \(p(y'|x)\). The transition probability for this single step is given by:
\[ P((x, y) \to (x, y')) = p(y' \mid x) \]
The reverse transition, from \((x, y')\) to \((x, y)\), has a probability of:
\[ P((x, y') \to (x, y)) = p(y \mid x) \]
Now, we check the detailed balance condition for this single step. We need to show:
\[ p(x, y) \cdot P((x, y) \to (x, y')) = p(x, y') \cdot P((x, y') \to (x, y)) \]
Substituting the transition probabilities, we get:
\[ p(x, y) \cdot p(y' \mid x) = p(x, y') \cdot p(y \mid x) \]
Using the definition of conditional probability, \(p(A, B) = p(A|B)p(B)\), we can rewrite both sides:
Since the two sides are equal, the detailed balance condition holds for this single sampling step.
A similar argument can be made for the second step, where we sample \(x'\) from \(p(x' \mid y')\). The full Gibbs sampling transition kernel is the product of these two steps, and since each individual step satisfies detailed balance, the combined two-step process also satisfies detailed balance with respect to the joint distribution \(p(x, y)\).
Because the Gibbs sampling procedure constructs a Markov chain that is:
It is guaranteed that the stationary distribution of the Markov chain is the target joint distribution \(p(x, y)\). Therefore, after a sufficient “burn-in” period, the samples produced by the Gibbs sampler are valid draws from the desired joint distribution. ### 9. Exercises (Try Yourself!) ✍️
Suppose you have the following system: - \(X \mid Y \sim \mathcal{N}(0.5Y, 1)\) - \(Y \mid X \sim \mathcal{N}(0.5X, 1)\)
Use Gibbs sampling to generate samples from this system.
Hint: This will produce a bivariate normal distribution. Compared to a high-correlation example, the samples you generate will be more dispersed, reflecting a lower correlation between \(X\) and \(Y\).
Let’s revisit the chocolate example with a new set of variances: - \(W \mid S \sim \mathcal{N}(S, 0.5^2)\) - \(S \mid W \sim \mathcal{N}(W, 1^2)\)
Try Gibbs sampling with these conditional distributions and then plot the joint density of the generated samples. Also, try different variance values and observe how the plot changes.
Hint: The stronger the variance, the weaker the dependency between the variables. This is because a larger variance introduces more “noise,” making the relationship between the conditional mean and the sampled value less deterministic.
Take your Gibbs sampler from any of the previous exercises and plot the mean of your samples for one of the variables, say \(X^{(1:t)}\), over the number of iterations \(t\).
Hint: You can calculate the cumulative mean using
cumsum(X) / 1:t in R. This plot will show you how the
sample mean stabilizes and converges to the true mean of the
distribution as you collect more samples. This is a visual way to assess
if your sampler has converged.