Full Conditional Distributions and Gibbs sampling

The Core Definition

A full conditional distribution is the probability distribution of a single parameter (or a group of parameters) given fixed values of all other parameters in the model and the observed data.

In mathematical notation, for parameters \(\theta_1, \theta_2, \ldots, \theta_k\):

\[p(\theta_j \mid \theta_1, \theta_2, \ldots, \theta_{j-1}, \theta_{j+1}, \ldots, \theta_k, \text{data})\]

The key phrase is “given everything else” - you condition on:

All other parameters
The observed data

Why “Full” Conditional?

It’s called “full” because we condition on all other parameters, not just a subset. This distinguishes it from:

Type	Expression	Description
Full conditional	\(p(\theta_j \mid \text{all other } \theta, \text{data})\)	Conditions on all other parameters
Marginal distribution	\(p(\theta_j \mid \text{data})\)	Integrates out other parameters
Partial conditional	\(p(\theta_j \mid \text{some parameters}, \text{data})\)	Conditions on only a subset

Intuitive Example: Three Parameters

Imagine we have three parameters: \(\alpha\), \(\beta\), \(\gamma\)

Parameter	Full Conditional
\(\alpha\)	\(p(\alpha \mid \beta, \gamma, \text{data})\)
\(\beta\)	\(p(\beta \mid \alpha, \gamma, \text{data})\)
\(\gamma\)	\(p(\gamma \mid \alpha, \beta, \text{data})\)

Each one treats the other two as known constants.

Why Full Conditional Distributions Are Often Simple

This is one of the most elegant and powerful features of Bayesian analysis. Full conditionals tend to be simple even when the joint posterior is incredibly complex.

The Core Reason: Conditioning Breaks Dependencies

The Joint Posterior as a Product

The joint posterior typically looks like:

\[p(\theta_1, \theta_2, \ldots, \theta_k \mid \text{data}) \propto \text{Likelihood} \times \text{Prior}_1 \times \text{Prior}_2 \times \ldots \times \text{Prior}_k\]

This product can be very complicated because parameters interact through the likelihood.

The Full Conditional Factors

When you condition on all other parameters, most of the product becomes constant:

\[p(\theta_j \mid \theta_{-j}, \text{data}) \propto [\text{terms containing } \theta_j \text{ from likelihood}] \times [\text{prior for } \theta_j] \times [\text{constants}]\]

Key insight: Terms that don’t contain \(\theta_j\) cancel out in the proportionality constant!

Concrete Example 1: Normal Distribution

The Model

\[y_1, \ldots, y_n \sim \text{iid Normal}(\mu, \sigma^2)\]

The Joint Posterior (Complex!)

\[p(\mu, \sigma^2 \mid y) \propto (\sigma^2)^{-n/2} \exp\left(-\frac{\sum(y_i-\mu)^2}{2\sigma^2}\right) \times \exp\left(-\frac{(\mu-\mu_0)^2}{2\tau_0^2}\right) \times (\sigma^2)^{-a_0-1} \exp\left(-\frac{b_0}{\sigma^2}\right)\]

This looks messy! \(\mu\) and \(\sigma^2\) are tangled together.

Full Conditional for \(\mu\) (Simple!)

When we condition on \(\sigma^2\), anything without \(\mu\) becomes constant:

\[p(\mu \mid \sigma^2, y) \propto \exp\left(-\frac{\sum(y_i-\mu)^2}{2\sigma^2}\right) \times \exp\left(-\frac{(\mu-\mu_0)^2}{2\tau_0^2}\right)\]

This is just a Normal distribution! The complicated \((\sigma^2)^{-n/2}\) and \((\sigma^2)^{-a_0-1} \exp(-b_0/\sigma^2)\) terms don’t involve \(\mu\), so they cancel out.

Result: \[\mu \mid \sigma^2, y \sim \text{Normal}\left( \text{mean} = \frac{n\bar{y}/\sigma^2 + \mu_0/\tau_0^2}{n/\sigma^2 + 1/\tau_0^2}, \text{ variance} = \frac{1}{n/\sigma^2 + 1/\tau_0^2} \right)\]

Full Conditional for \(\sigma^2\) (Also Simple!)

When we condition on \(\mu\):

\[p(\sigma^2 \mid \mu, y) \propto (\sigma^2)^{-n/2} \exp\left(-\frac{\sum(y_i-\mu)^2}{2\sigma^2}\right) \times (\sigma^2)^{-a_0-1} \exp\left(-\frac{b_0}{\sigma^2}\right)\]

This is an Inverse-Gamma distribution! The Normal prior for \(\mu\) disappears because it doesn’t contain \(\sigma^2\).

Result: \[\sigma^2 \mid \mu, y \sim \text{Inverse-Gamma}\left( a_0 + \frac{n}{2}, b_0 + \frac{\sum(y_i-\mu)^2}{2} \right)\]

Concrete Example 2: Linear Regression

Joint Posterior (Complex!)

\[p(\beta, \sigma^2 \mid y, X) \propto (\sigma^2)^{-n/2} \exp\left(-\frac{(y-X\beta)'(y-X\beta)}{2\sigma^2}\right) \times \exp\left(-\frac{1}{2} \beta'\Sigma_0^{-1}\beta\right) \times (\sigma^2)^{-a_0-1} \exp\left(-\frac{b_0}{\sigma^2}\right)\]

Parameters are tangled: \(\beta\) and \(\sigma^2\) appear together in the likelihood.

Full Conditional for \(\beta\) (Given \(\sigma^2\)) \(\rightarrow\) Multivariate Normal

\[p(\beta \mid \sigma^2, y, X) \propto \exp\left(-\frac{(y-X\beta)'(y-X\beta)}{2\sigma^2}\right) \times \exp\left(-\frac{1}{2} \beta'\Sigma_0^{-1}\beta\right)\]

The Inverse-Gamma prior for \(\sigma^2\) disappears. This is Multivariate Normal!

Full Conditional for \(\sigma^2\) (Given \(\beta\)) \(\rightarrow\) Inverse-Gamma

\[p(\sigma^2 \mid \beta, y, X) \propto (\sigma^2)^{-n/2} \exp\left(-\frac{(y-X\beta)'(y-X\beta)}{2\sigma^2}\right) \times (\sigma^2)^{-a_0-1} \exp\left(-\frac{b_0}{\sigma^2}\right)\]

The Normal prior for \(\beta\) disappears. This is Inverse-Gamma!

The Mathematical Reason: Exponential Family + Conjugate Priors

This simplicity isn’t accidental. It happens when:

The likelihood is from the exponential family (Normal, Poisson, Binomial, Gamma, etc.)
We use conjugate priors (prior in the same family as the likelihood)

Exponential Family Form

\[p(y \mid \theta) = h(y) \exp\left(\eta(\theta)' T(y) - A(\theta)\right)\]

When we multiply independent conjugate priors, the full conditionals remain in the same family.

Example 3: Poisson Regression (More Complex Joint, Simple Full Conditionals)

Model

\[y_i \sim \text{Poisson}(\lambda_i), \quad \lambda_i = \exp(x_i'\beta)\] \[\beta_j \sim \text{Normal}(0, 1000) \text{ (independent priors)}\]

Joint Posterior (Very Complex!)

\[p(\beta \mid y, X) \propto \prod_i \left[ \frac{\exp(x_i'\beta)^{y_i} \exp(-\exp(x_i'\beta))}{y_i!} \right] \times \prod_j \left[ \frac{1}{\sqrt{2000\pi}} \exp\left(-\frac{\beta_j^2}{2000}\right) \right]\]

No closed form! Parameters are highly correlated.

Full Conditional for \(\beta_j\) (Still Relatively Simple!)

\[p(\beta_j \mid \beta_{-j}, y, X) \propto \exp\left( \sum_i \left[ y_i x_{ij} \beta_j - \exp(x_i'\beta) \right] \right) \times \exp\left( -\frac{\beta_j^2}{2000} \right)\]

Why is this simpler?

All terms without \(\beta_j\) are constant
The sum over \(i\) only involves \(\beta_j\) through \(x_{ij}\beta_j\) and \(\exp(x_i'\beta)\)
It’s log-concave, making it easy to sample via Metropolis-Hastings

Why “Simple” Doesn’t Always Mean “Standard”

Sometimes “simple” means:

Type	Meaning	Sampling Method
Standard distribution	Normal, Gamma, Beta	Direct sampling
Log-concave	Log of density is concave	Adaptive rejection sampling
Low-dimensional	1 or 2 parameters	Metropolis-Hastings within Gibbs
Conditionally independent	Breaks into smaller pieces	Block Gibbs sampling

Why Full Conditionals Matter in Gibbs Sampling

The Gibbs Sampling Step

At each iteration, you sample:

\[\theta_1^{(t)} \sim p(\theta_1 \mid \theta_2^{(t-1)}, \theta_3^{(t-1)}, \ldots, \theta_k^{(t-1)}, \text{data})\]

\[\theta_2^{(t)} \sim p(\theta_2 \mid \theta_1^{(t)}, \theta_3^{(t-1)}, \ldots, \theta_k^{(t-1)}, \text{data})\]

… and so on

Each of these is a full conditional distribution.

Key Property

When you sample from a full conditional, you’re effectively performing one step of a Gibbs update - you’re moving through the parameter space in a way that preserves the target joint posterior distribution.

The Ultimate Reason: Hammersley-Clifford Theorem

This theorem (in the context of Gibbs sampling) essentially says:

If you have all the full conditionals, you can reconstruct the joint distribution. But the reverse is also true: the joint distribution determines the full conditionals, and they tend to be simpler because each one ignores interactions with other parameters.

Summary Table: Joint vs. Full Conditional

Aspect	Joint Posterior	Full Conditional
Complexity	High (all parameters interact)	Low (other parameters are fixed constants)
Parameter interactions	Fully present	Conditioned away
Form	Often no closed form	Often standard distribution
Dimensionality	Full parameter space	Single parameter (or small block)
Sampling	Difficult (needs MCMC)	Easy (direct or simple MCMC step)