knitr::opts_chunk$set(
  echo = TRUE, message = FALSE, warning = FALSE,
  fig.width = 7, fig.height = 4.5, dpi = 160
)
set.seed(2025)

1. What is Gibbs Sampling?

Imagine you want to generate samples from a joint probability distribution \(f(x, y)\).

But maybe:

But you can sample from conditional distributions like \(f(x \mid y)\) and \(f(y \mid x)\). That’s when Gibbs Sampling becomes your best friend!

Let’s say you want samples from some unknown joint distribution \(f(x, y)\). But you can: Then the Gibbs Sampling steps are:

Repeat this for many iterations — the sequence of \((x^{(t)}, y^{(t)})\) will eventually look like samples from \(f(x,y)\)!

2. A Visual Example: “Chasing the Mean”

Suppose:

We’re not told the joint distribution directly, but we can sample from both conditionals! Let’s write code for this.

gibbs_normal <- function(n_iter = 5000, x0 = 0, y0 = 0) {
  X <- numeric(n_iter); Y <- numeric(n_iter)
  X[1] <- x0; Y[1] <- y0
  for (t in 2:n_iter) {
    X[t] <- rnorm(1, mean = Y[t-1], sd = 1)
    Y[t] <- rnorm(1, mean = X[t], sd = 1)
  }
  data.frame(X = X, Y = Y)
}

out <- gibbs_normal()
head(out)

Plot: Gibbs Sampler in Action

# Traceplot
plot(out$X[1:200], type = "l", col = "darkblue", lwd = 2,
     main = "Traceplot of X (first 200 iterations)", ylab = "X", xlab = "Iteration")

# Scatterplot of draws
plot(out$X, out$Y, pch = 21, bg = "tomato", col = "gray20",
     main = "Gibbs samples from bivariate normal", xlab = "X", ylab = "Y")

What did we do?

Even though we didn’t know the joint distribution \(f(x, y)\), we were able to generate samples from it by iteratively sampling from the conditional distributions:

This iterative process, known as Gibbs sampling, produces a sequence of pairs \((x, y)\) that, after a sufficient number of steps, approximates independent draws from the true joint distribution. It turns out that this specific process produces samples from a bivariate normal distribution with a strong positive correlation.

3. A Funny Example: “The Chocolate Cube” 🍫

You have a box of chocolates arranged in a cube. We are going to try to perform Gibbs sampling on their weight (\(W\)) and size (\(S\)) joint distribution.

The problem, as stated, is a fun but tricky example that highlights a common misconception about Gibbs sampling. We are given the following conditional distributions:

  • The weight \(W\) of a chocolate depends on its size \(S\): \[W \mid S \sim \mathcal{N}(S, 0.2^2)\]
  • The size \(S\) depends on the weight \(W\): \[S \mid W \sim \mathcal{N}(W, 0.3^2)\]

Let’s try to Gibbs sample their \((W, S)\) joint distribution.


Why This Example Is Not a Valid Gibbs Sampling Problem

Gibbs sampling requires that we have a well-defined joint distribution from which we can derive the full conditional distributions. The two conditional distributions provided are inconsistent with each other. They create a logical circular dependency without a shared, underlying joint probability distribution.

Think of it this way: - The first statement says the mean of \(W\) is \(S\). - The second statement says the mean of \(S\) is \(W\).

This is a paradoxical situation because for a valid joint distribution, the parameters of the conditional distributions are derived from the joint distribution’s mean vector and covariance matrix. You cannot simply define two arbitrary conditional distributions and assume they are from a single, coherent joint distribution.

For a valid bivariate normal distribution, the conditional distributions would have means that are a linear function of the other variable, with a correlation term.


How to Fix the Problem for Gibbs Sampling

To make this a valid Gibbs sampling problem, we must first define a proper joint distribution, for example, a bivariate normal distribution:

\[\begin{pmatrix} W \\ S \end{pmatrix} \sim \mathcal{N}\left(\begin{pmatrix} \mu_W \\ \mu_S \end{pmatrix}, \begin{pmatrix} \sigma_W^2 & \rho\sigma_W\sigma_S \\ \rho\sigma_W\sigma_S & \sigma_S^2 \end{pmatrix}\right)\]

From this joint distribution, we can mathematically derive the two consistent conditional distributions that are required for Gibbs sampling:

\[ \begin{aligned} W \mid S &\sim \mathcal{N}\left(\mu_W + \rho\frac{\sigma_W}{\sigma_S}(S - \mu_S), \sigma_W^2(1 - \rho^2)\right) \\ S \mid W &\sim \mathcal{N}\left(\mu_S + \rho\frac{\sigma_S}{\sigma_W}(W - \mu_W), \sigma_S^2(1 - \rho^2)\right) \end{aligned} \]

Once we have these valid, consistent conditional distributions, we can then perform the Gibbs sampling algorithm: 1. Initialize a starting value for \(S\), say \(S^{(0)}\). 2. Iterate to get new samples: - Draw \(W^{(t+1)} \sim p(W \mid S^{(t)})\). - Draw \(S^{(t+1)} \sim p(S \mid W^{(t+1)})\). 3. Repeat this process many times to get samples from the joint distribution.

gibbs_chocolate <- function(n_iter = 5000) {
  W <- S <- numeric(n_iter)
  W[1] <- 0; S[1] <- 0
  for (t in 2:n_iter) {
    W[t] <- rnorm(1, mean = S[t-1], sd = 0.2)
    S[t] <- rnorm(1, mean = W[t], sd = 0.3)
  }
  data.frame(W = W, S = S)
}

out2 <- gibbs_chocolate()

# Joint plot
plot(out2$W, out2$S, pch = 21, col = "black", bg = "orange",
     xlab = "Weight", ylab = "Size", main = "Chocolate Gibbs Samples")

** 4. Heatmap View of the Density

Let’s plot the approximate density using Gibbs draws:

library(ggplot2)
ggplot(out2, aes(W, S)) +
  stat_density_2d(aes(fill = ..level..), geom = "polygon", contour = TRUE) +
  scale_fill_viridis_c() +
  theme_minimal() +
  labs(title = "Estimated Joint Density (Gibbs Samples)",
       x = "Weight", y = "Size")

5. Extension: Gibbs Sampling in 3D 🎲

Now let’s extend the concept of Gibbs sampling to three dimensions. Suppose we have three variables, \(X\), \(Y\), and \(Z\), and their full conditional distributions are given by:

  • \(X \mid Y,Z \sim \mathcal{N}((Y+Z)/2, 1)\)
  • \(Y \mid X,Z \sim \mathcal{N}((X+Z)/2, 1)\)
  • \(Z \mid X,Y \sim \mathcal{N}((X+Y)/2, 1)\)

We will sample all three variables in turn to draw from their joint distribution.


The Gibbs Sampling Algorithm

The algorithm for this 3D case follows the same iterative, sequential logic as the 2D case. We initialize starting values for all variables and then iteratively sample each variable from its full conditional distribution, using the most recent values of the other variables.

Here is the step-by-step process:

  1. Initialization: Choose starting values for \(Y\) and \(Z\), say \(Y^{(0)}\) and \(Z^{(0)}\). A good practice is to draw these from a reasonable initial distribution, or simply set them to a common value like zero.

  2. Iteration: For \(t = 1, 2, 3, ...\), perform the following steps in sequence:

    1. Sample \(X^{(t)}\): Draw a new sample for \(X\) using the current values of \(Y\) and \(Z\): \[X^{(t)} \sim \mathcal{N}((Y^{(t-1)} + Z^{(t-1)})/2, 1)\]
    2. Sample \(Y^{(t)}\): Draw a new sample for \(Y\) using the new value of \(X\) and the current value of \(Z\): \[Y^{(t)} \sim \mathcal{N}((X^{(t)} + Z^{(t-1)})/2, 1)\]
    3. Sample \(Z^{(t)}\): Draw a new sample for \(Z\) using the new values of both \(X\) and \(Y\): \[Z^{(t)} \sim \mathcal{N}((X^{(t)} + Y^{(t)})/2, 1)\]
  3. Collection: The triplet of values \((X^{(t)}, Y^{(t)}, Z^{(t)})\) at each iteration represents a single sample from the joint distribution. After a “burn-in” period (a sufficient number of initial iterations to allow the sampler to converge), you can collect these samples to approximate the target joint distribution.


What is the Target Joint Distribution?

This is a classic example of a bivariate/multivariate normal distribution. The full conditional distributions given are consistent with a joint distribution where the variables are related in a linear fashion. Specifically, the means of the conditional distributions are averages of the other variables. This implies a correlation structure between \(X, Y,\) and \(Z\) that a joint normal distribution would capture. The Gibbs sampler will produce samples from this multivariate normal distribution.

gibbs3d <- function(n_iter = 4000) {
  X <- Y <- Z <- numeric(n_iter)
  for (t in 2:n_iter) {
    X[t] <- rnorm(1, mean = (Y[t-1] + Z[t-1])/2, sd = 1)
    Y[t] <- rnorm(1, mean = (X[t] + Z[t-1])/2, sd = 1)
    Z[t] <- rnorm(1, mean = (X[t] + Y[t])/2, sd = 1)
  }
  data.frame(X, Y, Z)
}

out3 <- gibbs3d()
pairs(out3[seq(1, 4000, by = 5), ], pch = 21, bg = "skyblue", main = "Pairs Plot (3D Gibbs)")

### 6. When Does Gibbs Work? 🤔

Gibbs sampling is a powerful tool, but its effectiveness and efficiency depend on several key factors:

  • You can easily sample from all full conditionals: The core of Gibbs sampling is drawing from the conditional distributions. If these distributions are complex or non-standard, the method becomes impractical. It works best when the conditionals are from common distributions (e.g., Normal, Gamma, Beta, etc.) that are easy to sample from.
  • The variables are not too dependent: Gibbs sampling takes a “random walk” through the parameter space. If the variables are highly correlated (either positively or negatively), the sampler can become slow and inefficient. This is because it takes many small, incremental steps to explore the joint distribution, leading to slow convergence. In contrast, a low-correlation scenario allows the sampler to explore the space more rapidly.
  • You allow enough burn-in and enough samples: The initial samples from the Gibbs sampler are not representative of the target distribution. This initial phase is called burn-in, where the Markov chain moves from its starting point to the typical set of the target distribution. After burn-in, you should collect a sufficient number of samples to get a good approximation of the joint distribution.

7. Summary: Gibbs Sampling Steps 📝

The Gibbs sampling algorithm is an iterative process that cycles through each variable in the model.


8. Theoretical Proof of Gibbs Sampling

The central idea behind Gibbs sampling is to construct a Markov chain whose stationary distribution is the target joint distribution we wish to sample from. The proof of its correctness relies on a fundamental principle of Markov Chain Monte Carlo (MCMC): the Detailed Balance Condition.

A Markov chain with transition kernel \(P(\cdot, \cdot)\) and stationary distribution \(\pi(\cdot)\) satisfies the detailed balance condition if for all states \(A\) and \(B\):

\[ \pi(A) P(A \to B) = \pi(B) P(B \to A) \]

If a Markov chain satisfies this condition, it is guaranteed that the chain will eventually converge to \(\pi(\cdot)\). The Gibbs sampler constructs just such a Markov chain.


Proof for the Two-Variable Case

Let’s consider a simple two-variable case with a joint distribution \(p(x, y)\). The Gibbs sampler samples from the full conditionals \(p(x|y)\) and \(p(y|x)\) in a two-step process.

We want to show that the stationary distribution of this process is our target joint distribution \(p(x, y)\). We can do this by showing that each step of the Gibbs sampler satisfies detailed balance with respect to the joint distribution.

Consider a single step of the sampler where we move from a state \((x, y)\) to a new state \((x, y')\) by sampling \(y'\) from \(p(y'|x)\). The transition probability for this single step is given by:

\[ P((x, y) \to (x, y')) = p(y' \mid x) \]

The reverse transition, from \((x, y')\) to \((x, y)\), has a probability of:

\[ P((x, y') \to (x, y)) = p(y \mid x) \]

Now, we check the detailed balance condition for this single step. We need to show:

\[ p(x, y) \cdot P((x, y) \to (x, y')) = p(x, y') \cdot P((x, y') \to (x, y)) \]

Substituting the transition probabilities, we get:

\[ p(x, y) \cdot p(y' \mid x) = p(x, y') \cdot p(y \mid x) \]

Using the definition of conditional probability, \(p(A, B) = p(A|B)p(B)\), we can rewrite both sides:

  • Left-hand side: \(p(x) \cdot p(y \mid x) \cdot p(y' \mid x)\)
  • Right-hand side: \(p(x) \cdot p(y' \mid x) \cdot p(y \mid x)\)

Since the two sides are equal, the detailed balance condition holds for this single sampling step.

A similar argument can be made for the second step, where we sample \(x'\) from \(p(x' \mid y')\). The full Gibbs sampling transition kernel is the product of these two steps, and since each individual step satisfies detailed balance, the combined two-step process also satisfies detailed balance with respect to the joint distribution \(p(x, y)\).

Conclusion

Because the Gibbs sampling procedure constructs a Markov chain that is:

  1. Irreducible: The chain can reach any point in the state space from any other point.
  2. Aperiodic: The chain does not get trapped in a cycle.
  3. Satisfies Detailed Balance: As proven above.

It is guaranteed that the stationary distribution of the Markov chain is the target joint distribution \(p(x, y)\). Therefore, after a sufficient “burn-in” period, the samples produced by the Gibbs sampler are valid draws from the desired joint distribution. ### 9. Exercises (Try Yourself!) ✍️

Exercise 1: Simple Gaussian Chain

Suppose you have the following system: - \(X \mid Y \sim \mathcal{N}(0.5Y, 1)\) - \(Y \mid X \sim \mathcal{N}(0.5X, 1)\)

Use Gibbs sampling to generate samples from this system.

Hint: This will produce a bivariate normal distribution. Compared to a high-correlation example, the samples you generate will be more dispersed, reflecting a lower correlation between \(X\) and \(Y\).

Exercise 2: Mix Chocolate with Noise

Let’s revisit the chocolate example with a new set of variances: - \(W \mid S \sim \mathcal{N}(S, 0.5^2)\) - \(S \mid W \sim \mathcal{N}(W, 1^2)\)

Try Gibbs sampling with these conditional distributions and then plot the joint density of the generated samples. Also, try different variance values and observe how the plot changes.

Hint: The stronger the variance, the weaker the dependency between the variables. This is because a larger variance introduces more “noise,” making the relationship between the conditional mean and the sampled value less deterministic.

Exercise 3: Explore Convergence

Take your Gibbs sampler from any of the previous exercises and plot the mean of your samples for one of the variables, say \(X^{(1:t)}\), over the number of iterations \(t\).

Hint: You can calculate the cumulative mean using cumsum(X) / 1:t in R. This plot will show you how the sample mean stabilizes and converges to the true mean of the distribution as you collect more samples. This is a visual way to assess if your sampler has converged.