Introduction

In a typical Dirichlet Process Mixture Model (DPMM), the data \(\mathbf{y} = (y_1, \dots, y_n)\) are assumed to be independent and identically distributed (i.i.d.), conditional on \(G\) and \(\phi\), from the mixture distribution: \[ y_i \mid G, \phi \overset{\text{iid}}{\sim} f(\cdot \mid G, \phi) = \int k(\cdot \mid \theta, \phi) \, \textsf{d}G(\theta), \quad i = 1, \dots, n. \] Here, \(G \sim \textsf{DP}(\alpha, G_0)\) is a random mixing measure following a Dirichlet Process (DP) with concentration parameter \(\alpha\) and base measure \(G_0\). A parametric prior \(p(\phi)\) is placed on \(\phi\), which governs additional parameters in the mixture components.

If the model incorporates a regression component, the observations \(y_i\) also depend on covariate vectors \(\mathbf{x}_1, \dots, \mathbf{x}_n\), and in such cases, \(\phi\) typically includes the vector of regression coefficients.

Moreover, prior distributions may be assigned to \(\alpha\) or the parameters \(\psi\) of the base measure \(G_0 = G_0(\cdot \mid \psi)\).

In a hierarchical formulation for DPMMs, we introduce latent mixing parameters \(\theta_i\) for each observation \(y_i\), structured as follows:

Observation model: Each \(y_i\) is conditionally independent given \(\theta_i\) and a global parameter \(\phi\): \[ y_i \mid \theta_i, \phi \sim k(y_i \mid \theta_i, \phi), \quad i = 1, \ldots, n. \]
Latent mixing parameters: The \(\theta_i\) values are i.i.d. from a random distribution \(G\): \[ \theta_i \mid G \sim G, \quad i = 1, \ldots, n. \]
DP prior: The distribution \(G\) is drawn from a DP with concentration parameter \(\alpha\) and base measure \(G_0\): \[ G \mid \alpha, \psi \sim \textsf{DP}(\alpha, G_0(\cdot \mid \psi)). \]
Priors on model parameters: Independent priors are placed on \(\phi, \alpha, \psi\): \[ \phi, \alpha, \psi \sim p(\phi) \, p(\alpha) \, p(\psi). \]

This hierarchical setup ensures that multiple \(y_i\) values may share the same \(\theta_i\), naturally inducing clusters. It provides flexibility, as \(G\) can adapt to the number of clusters based on data complexity.

The focus of inference is on the parameter \(\phi\) for \(f( \cdot \mid G, \phi)\), the latent mixing parameters \(\theta_1, \dots, \theta_n\), along with the parameters \(\alpha\) and \(\psi\), and functionals \(H(F(\cdot \mid G, \phi))\) of the random mixture \(F(\cdot \mid G, \phi)\). These functionals may include mean and variance functionals, and percentile functionals.

Full inference, given the observed data, relies on the joint posterior distribution of the DP mixture model:
\[ p(G, \phi, \boldsymbol{\theta}, \alpha, \psi \mid \mathbf{y}). \] This posterior distribution encapsulates all uncertainty about the random quantities of interest, enabling comprehensive Bayesian inference.

Marginal posterior simulation

We can avoid the need to explicitly handle the infinite-dimensional random measure \(G\) by integrating it out, leveraging the Pólya urn (PU) characterization of the DP. This results in a finite-dimensional representation of the model:

Observation model: Each \(y_i\) is conditionally independent given \(\theta_i\) and \(\phi\): \[ y_i \mid \theta_i, \phi \sim k(y_i \mid \theta_i, \phi), \quad i = 1, \dots, n. \]
Prior on latent parameters: After integrating out \(G\), the joint distribution of \(\boldsymbol{\theta}\) follows the PU scheme: \[ \boldsymbol{\theta} \mid \alpha, \psi \sim \textsf{PU}(\alpha, G_0(\cdot \mid \psi)), \] where \(\textsf{PU}(\alpha, G_0)\) represents the induced PU process.
Priors on model parameters: Independent priors are placed on \(\phi, \alpha, \psi\): \[ \phi, \alpha, \psi \sim p(\phi) \, p(\alpha) \, p(\psi). \]

The posterior distribution in a DPMM is given by:
\[ p(\boldsymbol{\theta}, \phi, \alpha, \psi \mid \mathbf{y}) \propto \prod_{i=1}^{n} k(y_i \mid \theta_i, \phi) \cdot p(\boldsymbol{\theta} \mid \alpha, \psi) \, p(\alpha) \, p(\psi) \, p(\phi). \]

To conduct posterior inference, we require samples from the full conditional distributions of the model parameters. Specifically, the following four sets of full conditionals must be derived: \(p(\theta_i \mid \boldsymbol{\theta}_{-i}, \alpha, \psi, \phi, \mathbf{y})\), \(p(\alpha \mid \boldsymbol{\theta})\), \(p(\psi \mid \boldsymbol{\theta})\), and \(p(\phi \mid \boldsymbol{\theta}, \mathbf{y})\).

We focus on \(p(\theta_i \mid \boldsymbol{\theta}_{-i}, \alpha, \psi, \phi, \mathbf{y})\), which determines the assignment of each \(\theta_i\) to either an existing cluster or a new cluster.

The full conditional distribution for \(\theta_i\) is given by: \[ p(\theta_i \mid \boldsymbol{\theta}_{-i}, \alpha, \psi, \phi, \mathbf{y}) \propto k(y_i \mid \theta_i, \phi) \, p(\boldsymbol{\theta} \mid \alpha, \psi). \] Using the PU characterization of the DP, the full conditional prior distribution for any \(\theta_i\), given all the other parameters, follows:

\[ p(\theta_i \mid \boldsymbol{\theta}_{-i}, \alpha, \psi) = \sum_{k=1}^{K_{-i}} \frac{n_{-i,k}}{n-1+\alpha} \delta_{\vartheta_{-i,k}}(\theta_i) + \frac{\alpha}{n-1+\alpha} G_0(\theta_i \mid \psi). \] Here:

\(K_{-i}\) is the number of distinct clusters after removing \(\theta_i\).
\(n_{-i,k}\) is the size of the \(k\)-th cluster after removing \(\theta_i\).
\(\vartheta_{-i,k}\) are the unique values in \(\boldsymbol{\theta}_{-i}\).
\(\delta_{\vartheta_{-i,k}}(\theta_i)\) is a point mass at \(\vartheta_{-i,k}\) evaluated at \(\theta_i\).
\(G_0(\theta_i \mid \psi)\) is the base measure evaluated at \(\theta_i\).

The negative subscript \(-i\) denotes that these quantities are computed after removing observation \(i\).

Multiplying by the likelihood term \(k(y_i \mid \theta_i, \phi)\), we obtain the posterior full conditional: \[ p(\theta_i \mid \boldsymbol{\theta}_{-i}, \alpha, \psi, \phi, \mathbf{y}) \propto k(y_i \mid \theta_i, \phi) \left( \sum_{k=1}^{K_{-i}} n_{-i,k} \, \delta_{\vartheta_{-i,k}}(\theta_i) + \alpha \, G_0(\theta_i \mid \psi) \right). \] Thus, the full conditional distribution for \(\theta_i\) is: \[ p(\theta_i \mid \boldsymbol{\theta}_{-i}, \alpha, \psi, \phi, \mathbf{y}) \propto \sum_{k=1}^{K_{-i}} n_{-i,k} \, k(y_i \mid \vartheta_{-i,k}, \phi) \, \delta_{\vartheta_{-i,k}}(\theta_i) + \alpha \, p(y_i \mid \phi, \psi) \, \frac{k(y_i \mid \theta_i, \phi) \, G_0(\theta_i \mid \psi)}{p(y_i \mid \phi, \psi) }, \] where the marginal likelihood of \(y_i\) under the base measure \(G_0\) is given by: \[ p(y_i \mid \phi, \psi) = \int k(y_i \mid \theta_i, \phi) \, \textsf{d}G_0(\theta_i \mid \psi). \]

This formulation leads to the following interpretation:

Assignment to an existing cluster: With probability proportional to \(n_{-i,k} \, k(y_i \mid \vartheta_{-i,k}, \phi)\), the parameter \(\theta_i\) is assigned to an existing cluster \(k\), meaning it takes the value of one of the previously observed cluster representatives \(\vartheta_{-i,k}\). The term \(k(y_i \mid \vartheta_{-i,k}, \phi)\) ensures that \(y_i\) is more likely to join clusters where it fits well, while \(n_{-i,k}\) favors clusters with more members.
Formation of a new cluster: With probability proportional to \(\alpha \, p(y_i \mid \phi, \psi)\), a new cluster is created, and \(\theta_i\) is sampled from the posterior distribution \(p(\theta_i \mid \phi, \psi, y_i) \propto k(y_i \mid \theta_i, \phi) \, G_0(\theta_i \mid \psi)\). This ensures that new clusters are informed by both the base measure \(G_0\) and the likelihood of \(y_i\). The concentration parameter \(\alpha\) controls the likelihood of creating new clusters: larger values of \(\alpha\) increase the tendency to form new clusters, while smaller values favor merging into existing ones.

Using the PU scheme directly in a Gibbs sampler leads to slow mixing because \(\theta_i\) only changes when reassigned to a new cluster. This limits parameter updates, causing high autocorrelation and inefficient exploration of the posterior. To improve mixing, methods like auxiliary variables or split-merge moves can be used to update \(\theta_i\) more frequently without full cluster reassignment.

Introducing cluster assignment indicators

To improve sampling efficiency, we introduce cluster assignment indicators \(\xi_i \in \mathbb{N}\) and define a set of unique cluster parameters \(\vartheta_1, \dots, \vartheta_K\), where \(K = \max \{ \xi_1,\ldots,\xi_n \}\) is the current number of clusters.

Each \(\theta_i\) is linked to its corresponding cluster representative via \(\vartheta_{\xi_i} = \theta_i\).

To ensure proper bookkeeping, the indices \(\xi_i\) are always labeled continuously, starting from \(1\), meaning there are no gaps in the sequence of cluster labels.

With this setup, the DPMM can be rewritten as:

Observation model:
\[ y_i \mid \xi_i, \vartheta_{\xi_i}, \phi \sim k(y_i \mid \vartheta_{\xi_i}, \phi), \quad i = 1, \dots, n. \]
Cluster parameter prior:
\[ \vartheta_k \mid \psi \sim G_0(\psi), \quad k = 1, \dots, K. \]
Cluster assignments prior:
\[ \boldsymbol{\xi} \mid \alpha \sim \textsf{CRP}(\alpha). \]
Prior distributions:
\[ \alpha, \phi, \psi \sim p(\alpha) \, p(\phi) \, p(\psi). \]

This reformulation decouples cluster assignments from parameter updates, allowing for more efficient posterior sampling by avoiding direct manipulation of individual \(\theta_i\) values. Instead, it operates on a compressed representation of unique cluster parameters, improving mixing and computational efficiency in Gibbs sampling.

Joint posterior distribution

The joint posterior distribution of the cluster parameters \(\{\vartheta_k\}\), the cluster assignments \(\{\xi_i\}\), and the hyperparameters \(\phi, \alpha, \psi\) given the data \(\mathbf{y}\) is: \[ p(\{\vartheta_k\}, \{\xi_i\}, \phi, \alpha, \psi \mid \mathbf{y}) \propto \prod_{i=1}^{n} k(y_i \mid \vartheta_{\xi_i}, \phi) \cdot \prod_{k=1}^{K} G_0(\vartheta_k \mid \psi) \cdot p(\boldsymbol{\xi} \mid \alpha) \cdot p(\alpha) \, p(\psi) \, p(\phi) , \] where:

\(k(y_i \mid \vartheta_{\xi_i}, \phi)\) is the likelihood of each observation \(y_i\), given its cluster assignment \(\xi_i\) and associated parameter \(\vartheta_{\xi_i}\).
\(G_0(\vartheta_k \mid \psi)\) is the prior over the unique cluster parameters \(\{\vartheta_k\}\).
\(p(\boldsymbol{\xi} \mid \alpha)\) is the prior distribution over cluster assignments, following a CRP prior.
\(p(\alpha)\), \(p(\phi)\), and \(p(\psi)\) are the prior distributions for the model parameters.

The distribution over partitions of the data, given \(\alpha\), follows the Ewens’ Sampling Formula: \[ p(\boldsymbol{\xi} \mid \alpha) = \frac{\Gamma(\alpha)}{\Gamma(\alpha + n)}\, \alpha^{K} \prod_{k=1}^{K} \Gamma(n_k), \] where \(K\) is the number of nonempty clusters, and \(n_k\) is the size of the \(k\)-th cluster.

Since the number of clusters is not fixed, inference on the number of components \(K\) is done indirectly through inference on \(\{\xi_i\}\), since \(K = \max \{\xi_i\}\).

Full conditional distribution for cluster assignments

Given the DP formulation, the full conditional prior distribution for the cluster assignment \(\xi_i\) follows a PU process: \[ \xi_i \mid \boldsymbol{\xi}_{-i}, \alpha \sim \sum_{k=1}^{K_{-i}} \frac{n_{-i,k}}{n - 1 + \alpha} \delta_k + \frac{\alpha}{n - 1 + \alpha} \delta_{K_{-i} + 1}. \] where:

\(K_{-i}\) is the number of distinct clusters after removing observation \(i\).
\(n_{-i,k}\) is the number of observations currently assigned to cluster \(k\) after removing \(i\).
The first term represents the probability of assigning \(\xi_i\) to an existing cluster \(k\), which is proportional to the cluster’s current size, \(n_{-i,k}\).
The second term corresponds to the creation of a new cluster, introducing a new parameter \(\vartheta_{K_{-i}+1} \sim G_0(\psi)\), with a probability proportional to \(\alpha\).

Thus, the full conditional probability of assigning \(\xi_i\) given all other parameters is: \[ p(\xi_i \mid \boldsymbol{\xi}_{-i}, \boldsymbol{\vartheta}_{-i}, \alpha, \psi, \phi, \mathbf{y}) \propto k(y_i \mid \vartheta_{\xi_i}, \phi) \, p(\xi_i \mid \boldsymbol{\xi}_{-i}, \alpha). \] Expanding \(p(\xi_i \mid \boldsymbol{\xi}_{-i}, \alpha)\): \[ p(\xi_i \mid \boldsymbol{\xi}_{-i}, \alpha) = \begin{cases} \frac{n_{-i,k}}{n-1+\alpha}, & \text{if } \xi_i = k, \\ \frac{\alpha}{n-1+\alpha}, & \text{if } \xi_i = K_{-i} + 1. \end{cases} \] Thus: \[ p(\xi_i \mid \boldsymbol{\xi}_{-i}, \boldsymbol{\vartheta}, \alpha, \psi, \phi, \mathbf{y}) \propto \sum_{k=1}^{K_{-i}} n_{-i,k} \, k(y_i \mid \vartheta_k, \phi) \cdot \delta_k(\xi_i) + \alpha \, p(y_i \mid \phi, \psi) \cdot \delta_{K_{-i} + 1}(\xi_i), \] where \[ p(y_i \mid \phi, \psi) = \int k(y_i \mid \vartheta, \phi) \, \textsf{d}G_0(\vartheta \mid \psi). \]

This formulation leads to the following two possible updates:

Assignment to an existing cluster: With probability proportional to \(n_{-i,k} \, k(y_i \mid \vartheta_k, \phi)\), the observation \(y_i\) is assigned to an existing cluster \(k\). This probability is higher for clusters that already contain more observations (\(n_{-i,k}\)), and for which the likelihood \(k(y_i \mid \vartheta_k, \phi)\) is greater, indicating that \(y_i\) is well-explained by the cluster’s parameter \(\vartheta_k\).
Formation of a new cluster: With probability proportional to \(\alpha \, p(y_i \mid \phi, \psi)\), a new cluster is formed. Rather than sampling \(\vartheta_{K_{-i} + 1}\) directly from the prior \(G_0\), it is drawn from its posterior distribution \(p(\vartheta_{K_{-i} + 1} \mid y_i, \phi, \psi) \propto k(y_i \mid \vartheta, \phi) G_0(\vartheta \mid \psi)\). This approach ensures that the newly formed cluster’s parameter \(\vartheta_{K_{-i} + 1}\) is not determined solely by the prior but is instead updated in response to the observed data \(y_i\).

Full conditional distribution for \(\vartheta_k\)

The posterior distribution for each cluster parameter \(\vartheta_k\) is given by: \[ p(\vartheta_k \mid \dots) \propto \prod_{j: \xi_j = k} k(y_j \mid \vartheta_k, \phi) \cdot G_0(\vartheta_k \mid \psi), \] where:

The likelihood term \(k(y_j \mid \vartheta_k, \phi)\) accounts for all observations \(y_j\) assigned to cluster \(k\).
The prior \(G_0(\vartheta_k \mid \psi)\) regularizes the distribution of \(\vartheta_k\), ensuring that new clusters receive meaningful parameters.

Full conditional distribution for \(\phi\)

The posterior distribution for \(\phi\) follows: \[ p(\phi \mid \dots) \propto \prod_{i=1}^{n} k(y_i \mid \vartheta_{\xi_i}, \phi) \cdot p(\phi), \] where:

The likelihood term includes contributions from all observations \(y_i\), conditioned on their respective cluster parameters \(\vartheta_{\xi_i}\).
The prior \(p(\phi)\) imposes constraints on the global parameters.

Full conditional distribution for \(\psi\)

Since the cluster parameters \(\vartheta_k\) are drawn i.i.d. from \(G_0(\psi)\), the posterior for \(\psi\) follows: \[ p(\psi \mid \dots) \propto \prod_{k=1}^{K} G_0(\vartheta_k \mid \psi) \cdot p(\psi). \]

where:

The product runs over the unique cluster parameters \(\vartheta_k\).
The prior \(p(\psi)\) provides additional regularization.

Full Conditional for \(\alpha\)

The concentration parameter \(\alpha\) determines the prior distribution over the number of clusters \(K\), following the Ewens’ Sampling Formula: \[ p(\alpha \mid \dots) \propto p(\alpha) \frac{\Gamma(\alpha)}{\Gamma(\alpha + n)} \alpha^{K}. \]

To facilitate Gibbs sampling, we introduce the auxiliary variable \(\eta\) such that: \[ \eta \mid \alpha, \dots \sim \textsf{Beta}(\alpha + 1, n), \] and \[ \alpha \mid \eta, \dots \sim (1 - \delta) \textsf{Gamma}(a_{\alpha} + K, b_{\alpha} - \log \eta) + \delta \, \textsf{Gamma}(a_{\alpha} + K^* - 1, b_{\alpha} - \log \eta), \] where: \[ \delta = \frac{a_{\alpha} + K^* - 1}{a_{\alpha} + K - 1 + n (b_{\alpha} - \log \eta)}. \]

This formulation allows efficient posterior inference using Gibbs sampling.

Example: Location Normal DPMM

Consider the location-normal DPMM: \[ f(y \mid G, \phi) = \int \textsf{N}(y \mid \theta, \phi) \, \textsf{d}G(\theta), \] where the mixing distribution \(G\) follows a DP process prior: \(G \mid \alpha, \mu, \tau^2 \sim \textsf{DP}(\alpha, G_0 = \textsf{N}(\mu,\tau^2))\).

We assume the following prior distributions: \[ \phi \sim \textsf{IG}(a_{\phi}, b_{\phi}),\quad \alpha \sim \textsf{G}(a_{\alpha}, b_{\alpha}),\quad \mu \sim \textsf{N}(a_{\mu}, b_{\mu}),\quad \tau^2 \sim \textsf{IG}(a_{\tau}, b_{\tau}). \]

Thus, the hierarchical representation of this semiparametric DPMM is: \[ \begin{align*} y_i \mid \theta_i, \phi &\overset{\text{ind}}{\sim} \textsf{N}(y_i \mid \theta_i, \phi), \qquad i = 1, \dots, n, \\ \theta_i \mid G &\overset{\text{i.i.d.}}{\sim} G, \qquad i = 1, \dots, n, \\ G \mid \alpha, \mu, \tau^2 &\sim \textsf{DP}(\alpha, G_0 = \textsf{N}(\mu,\tau^2)), \\ \alpha, \mu, \tau^2, \phi &\sim p(\alpha) \, p(\mu) \, p(\tau^2) \, p(\phi), \end{align*} \] where the independent prior distributions \(p(\alpha)\), \(p(\mu)\), \(p(\tau^2)\), and \(p(\phi)\) are specified as above.

To study inference under this model, we consider synthetic data generated from a mixture of Normal distributions. Specifically, let \(n = 250\) observations be drawn from the following mixture model: \[ 0.25 \, \textsf{N}(-5, 1) + 0.5 \, \textsf{N}(0, 1) + 0.25 \, \textsf{N}(5, 1). \] This mixture comprises three normal components with means \(-5\), \(0\), and \(5\), each having unit variance, and respective mixing proportions of \(0.25\), \(0.5\), and \(0.25\).

# Sample size
n <- 200

# Define mixture proportions
probs <- c(0.25, 0.5, 0.25)

# Define component means and standard deviations
means <- c(-5, 0, 5)
sds   <- c(1, 1, 1)

# Generate 250 samples
set.seed(123)
components <- sample(1:3, size = n, replace = TRUE, prob = probs)
y <- rnorm(n, mean = means[components], sd = sds[components])

# Plot histogram using base R
hist(y, breaks = 30, freq = F, col = "lightgray", border = "lightgray",
     main = "Synthetic mixture data", xlab = "y", ylab = "Density")

Model formulation using cluster assingments

By integrating \(G\) out, the CRP formulation leads to a finite mixture model where the cluster assignments \(\xi_i\) determine which cluster each observation \(y_i\) belongs to. The model can be rewritten in a hierarchical form as follows:

Observation model: Each observation \(y_i\) follows a Normal distribution with a mean \(\vartheta_{\xi_i}\) corresponding to its assigned cluster and a global variance parameter \(\phi\): \[ y_i \mid \xi_i, \vartheta_{\xi_i}, \phi \overset{\text{ind}}{\sim} \textsf{N}(\vartheta_{\xi_i}, \phi), \quad i = 1, \dots, n. \]
Cluster assignments prior: We use the CRP prior to define the distribution of cluster assignments \(\xi_i\): \[ p(\xi_i \mid \boldsymbol{\xi}_{-i}, \alpha) = \begin{cases} \frac{n_{-i,k}}{n - 1 + \alpha}, & \text{if } \xi_i = k \qquad\quad \,\, \text{ (existing cluster)}, \\ \frac{\alpha}{n - 1 + \alpha}, & \text{if } \xi_i = K_{-i} + 1 \,\,\, \text{ (new cluster)}. \end{cases} \]
**Cluster mean prior: Each cluster-specific mean \(\vartheta_k\) is drawn from the base measure \(G_0\), which follows a Normal distribution: \[ \vartheta_k \mid \mu, \tau^2 \overset{\text{i.i.d.}}{\sim} \textsf{N}(\mu, \tau^2), \quad k = 1, \dots, K. \]
Priors for model parameters: To complete the hierarchical model, we place priors on the rest of model parameters: \[ \phi \sim \textsf{IG}(a_{\phi}, b_{\phi}),\quad \alpha \sim \textsf{G}(a_{\alpha}, b_{\alpha}),\quad \mu \sim \textsf{N}(a_{\mu}, b_{\mu}),\quad \tau^2 \sim \textsf{IG}(a_{\tau}, b_{\tau}). \]

Posterior Inference

By integrating out \(G\), the joint posterior distribution of the model parameters is: \[ p(\boldsymbol{\xi}, \boldsymbol{\vartheta}, \phi, \alpha, \mu, \tau^2 \mid \mathbf{y}) \propto \prod_{i=1}^{n} \textsf{N}(y_i \mid \vartheta_{\xi_i}, \phi) \cdot \prod_{k=1}^{K} \textsf{N}(\vartheta_k \mid \mu, \tau^2) \cdot p(\boldsymbol{\xi} \mid \alpha) \cdot p(\alpha) \, p(\mu) \, p(\tau^2) \, p(\phi). \]

We now derive the full conditional distributions for posterior inference via Gibbs sampling.

Full Conditional for the cluster assignments \(\xi_i\)

\[ p(\xi_i \mid \boldsymbol{\xi}_{-i}, \boldsymbol{\vartheta}, \alpha, \phi, \mu, \tau^2, y_i) \propto \sum_{k=1}^{K_{-i}} n_{-i,k} \cdot \textsf{N}(y_i \mid \vartheta_k, \phi) \cdot \delta_k(\xi_i) + \alpha \cdot \textsf{N}(y_i \mid \mu, \tau^2 + \phi) \cdot \delta_{K_{-i} + 1}(\xi_i), \] where:

Existing cluster assignment probability: \(\propto n_{-i,k} \cdot \textsf{N}(y_i \mid \vartheta_k, \phi)\).
New cluster creation probability: \(\propto \alpha \cdot \textsf{N}(y_i \mid \mu, \tau^2 + \phi)\).

If \(y_i\) is assigned to an existing cluster \(k\), then \(\xi_i\) takes a previously observed value \(k\). The prior probability of this assignment follows: \[ p(\xi_i = k \mid \boldsymbol{\xi}_{-i}, \alpha) = \frac{n_{-i,k}}{n - 1 + \alpha}. \] Multiplying by the likelihood of \(y_i\) given \(\vartheta_k\) we obtain: \[ p(\xi_i = k \mid \dots) \propto n_{-i,k} \cdot \textsf{N}(y_i \mid \vartheta_k, \phi). \]

If \(y_i\) is assigned to a new cluster, we must account for the prior on \(\vartheta_{K+1}\), which follows \(\vartheta_{K+1} \mid \mu, \tau^2 \sim \textsf{N}(\mu, \tau^2)\).

The marginal likelihood of assigning \(y_i\) to a new cluster is computed by integrating out \(\vartheta_{K+1}\): \[ p(y_i \mid \mu, \tau^2, \phi) = \int \textsf{N}(y_i \mid \vartheta, \phi) \, \textsf{N}(\vartheta \mid \mu, \tau^2) \, \textsf{d}\vartheta = \textsf{N}(y_i \mid \mu, \tau^2 + \phi). \]

The prior probability for creating a new cluster is: \[ p(\xi_i = K_{-i} + 1 \mid \boldsymbol{\xi}_{-i}, \alpha) = \frac{\alpha}{n - 1 + \alpha}. \]

Multiplying by the marginal likelihood of assigning \(y_i\) given \(\mu,\tau^2,\phi\) we obtain: \[ p(\xi_i = K_{-i} + 1 \mid \dots) \propto \alpha \cdot \textsf{N}(y_i \mid \mu, \tau^2 + \phi). \]

If \(\xi_i\) is assigned to a new cluster, we must sample a new value for \(\vartheta_{K+1}\) from its posterior distribution: \[ p(\vartheta_{K+1} \mid y_i, \mu, \tau^2, \phi) \propto \textsf{N}(y_i \mid \vartheta, \phi) \cdot \textsf{N}(\vartheta \mid \mu, \tau^2) \propto \textsf{N}(\mu^*, \tau^{*2}) \] where: \[ \mu^* = \frac{\frac{1}{\tau^2} \, \mu + \frac{1}{\phi} \, y_i}{\frac{1}{\tau^2} + \frac{1}{\phi}}, \qquad \tau^{*2} = \frac{1}{\frac{1}{\tau^2} + \frac{1}{\phi}}. \]

Full conditional for the cluster means \(\vartheta_k\)

\[ p(\vartheta_k \mid \{ y_i : \xi_i = k \}, \mu, \tau^2, \phi) \propto \prod_{i: \xi_i = k} \textsf{N}(y_i \mid \vartheta_k, \phi) \cdot \textsf{N}(\vartheta_k \mid \mu, \tau^2). \] This follows a Normal posterior \(\vartheta_k \mid \dots \sim \textsf{N}(M_k, V_k)\), where: \[ M_k = \frac{\frac{1}{\tau^2} \, \mu + \frac{n_k}{\phi} \, \bar{y}_k}{\frac{1}{\tau^2} + \frac{n_k}{\phi}}, \qquad V_k = \frac{1}{\frac{1}{\tau^2} + \frac{n_k}{\phi}}. \] Here, the term \(\bar{y}_k\) represents the sample mean of all observations assigned to cluster \(k\): \[ \bar{y}_k = \frac{1}{n_k} \sum_{i: \xi_i = k} y_i. \] where \(n_k\) is the number of observations assigned to cluster \(k\), i.e., \(n_k = \sum_{i} \mathbb{1}(\xi_i = k)\).

Full conditional for the global mean \(\mu\)

\[ p(\mu \mid \vartheta, \tau^2) \propto \prod_{k=1}^{K} \textsf{N}(\vartheta_k \mid \mu, \tau^2) \cdot \textsf{N}(\mu \mid a_{\mu}, b_{\mu}). \] This follows a Normal posterior \(\mu \mid \dots \sim \textsf{N}(M_{\mu}, V_{\mu})\), where: \[ M_{\mu} = \frac{\frac{1}{b_{\mu}} \, a_{\mu} + \frac{K}{\tau^2} \, \bar{\vartheta}}{\frac{1}{b_{\mu}} + \frac{K}{\tau^2}},\qquad V_{\mu} = \frac{1}{\frac{1}{b_{\mu}} + \frac{K}{\tau^2}}, \quad \] Here, the term \(\bar{\vartheta}\) represents the sample mean of all cluster means: \[ \bar{\vartheta} = \frac{1}{K} \sum_{k=1}^{K} \vartheta_k. \]

Full conditional for cluster variance \(\tau^2\):

\[ \tau^2 \mid \vartheta, \mu \sim \textsf{IG}\left( a_{\tau} + \frac{K}{2}, b_{\tau} + \frac{1}{2} \sum_{k=1}^{K} (\vartheta_k - \mu)^2 \right). \]

Full conditional for observation variance:

\[ \phi \mid \boldsymbol{\vartheta}, \boldsymbol{\xi}, \mathbf{y} \sim \textsf{IG}\left( a_{\phi} + \frac{n}{2}, b_{\phi} + \frac{1}{2} \sum_{i=1}^{n} (y_i - \vartheta_{\xi_i})^2 \right). \]

Implementation

sample_xi <- function(y, xi, theta, alpha, phi, mu, tau2)
{
  n <- length(y)
 
  for (i in 1:n) {
    # Remove current assignment and relabel clusters
    xi_new <- xi
    xi_new[i] <- 0  # Temporarily remove xi[i]
    
    # Relabel clusters (continuous labels)
    unique_xi  <- unique(xi_new[-i])
    unique_xi  <- unique_xi[order(unique_xi)]
    xi_new[-i] <- as.numeric(factor(xi_new[-i], levels = unique_xi, labels = seq_along(unique_xi)))
    
    # Get new number of clusters (excluding removed assignment)
    K_new <- max(xi_new)
    
    # Ensure theta is updated: Keep only active cluster parameters
    theta <- theta[unique_xi]
    
    # Compute assignment probabilities
    log_probs <- numeric(K_new + 1)
    
    for (k in 1:K_new) {
      n_k <- sum(xi_new == k)
      log_probs[k] <- log(n_k) + dnorm(y[i], mean = theta[k], sd = sqrt(phi), log = TRUE)
    }
    
    # Compute probability of forming a new cluster
    log_probs[K_new + 1] <- log(alpha) + dnorm(y[i], mean = mu, sd = sqrt(tau2 + phi), log = TRUE)
    
    # Sample new assignment
    new_cluster <- sample(1:(K_new + 1), size = 1, prob = exp(log_probs - max(log_probs)))
    xi_new[i] <- new_cluster
    
    # If a new cluster is formed, sample a new theta
    if (new_cluster == (K_new + 1)) {
      tau_star2 <- 1/(1/tau2 + 1/phi)
      mu_star <- tau_star2*(mu/tau2 + y[i]/phi)
      new_theta <- rnorm(1, mean = mu_star, sd = sqrt(tau_star2))
      theta <- c(theta, new_theta)
    }
    
    # Update assignment
    xi <- xi_new
  }
  
  return(
     list(
          xi = xi, 
          theta = theta
     )
  )
}

sample_theta <- function(y, xi, mu, tau2, phi) 
{
  unique_clusters <- unique(xi)
  theta <- numeric(length(unique_clusters))
  
  for (k in unique_clusters) {
    members <- which(xi == k)
    n_k <- length(members)
    
    if (n_k > 0) {
      s_k <- sum(y[members])
      V_k <- 1/(n_k/phi + 1/tau2)
      M_k <- V_k*(mu/tau2 + s_k/phi)
      theta[k] <- rnorm(1, mean = M_k, sd = sqrt(V_k))
    }
  }
  return(theta)
}

sample_mu <- function(theta, tau2, a_mu, b_mu)
{
  K <- length(theta)
  sum_theta <- mean(theta)
  
  V_mu <- 1/(1/b_mu + K/tau2)
  M_mu <- V_mu*(a_mu/b_mu + sum_theta/tau2)
  
  return(rnorm(1, mean = M_mu, sd = sqrt(V_mu)))
}

sample_tau2 <- function(theta, mu, a_tau, b_tau) 
{
  K <- length(theta)
  shape <- a_tau + 0.5*K
  scale <- b_tau + 0.5*sum((theta - mu)^2)
  return(1/rgamma(1, shape, scale))
}

sample_phi <- function(y, xi, theta, a_phi, b_phi) 
{
  n <- length(y)
  residuals <- y - theta[xi]
  shape <- a_phi + 0.5*n
  scale <- b_phi + 0.5*sum(residuals^2)
  return(1/rgamma(1, shape, scale))
}

sample_alpha <- function(alpha, xi, n, a_alpha, b_alpha) 
{
  K <- length(unique(xi))
  eta <- rbeta(1, shape1 = alpha + 1, shape2 = n)
  delta <- (a_alpha + K - 1)/(a_alpha + K - 1 + n*(b_alpha - log(eta)))
  
  if (runif(1) < delta) {
    return(rgamma(1, shape = a_alpha + K, rate = b_alpha - log(eta)))
  } else {
    return(rgamma(1, shape = a_alpha + K - 1, rate = b_alpha - log(eta)))
  }
}

gibbs_sampler <- function(y, iter, burn, 
                          a_phi, b_phi, 
                          a_alpha, b_alpha, 
                          a_mu, b_mu, 
                          a_tau, b_tau) 
{
  n <- length(y)
  max_K <- floor(n / 2)  # Upper bound on number of clusters
  
  # Storage (only for post-burn samples)
  B <- iter - burn
  xi_store    <- matrix(NA, nrow = B, ncol = n)
  phi_store   <- numeric(B)
  alpha_store <- numeric(B)
  mu_store    <- numeric(B)
  tau2_store  <- numeric(B)
  theta_store <- vector("list", B)

  # Initialize parameters
  alpha <- rgamma(1, a_alpha, b_alpha)
  phi   <- 1 / rgamma(1, a_phi, b_phi)
  mu    <- rnorm(1, a_mu, sqrt(b_mu))
  tau2  <- rinvgamma(1, a_tau, b_tau)
  
  # Initialize cluster assignments
  xi <- sample(1:max_K, n, replace = TRUE)
  xi <- as.numeric(factor(xi, levels = unique(xi), labels = seq_along(unique(xi))))
  
  # Initialize theta based on the number of unique clusters
  K <- length(unique(xi))
  theta <- rnorm(K, mu, sqrt(tau2))

  for (t in 1:iter) {
    
    # Update parameters
    xi_theta <- sample_xi(y, xi, theta, alpha, phi, mu, tau2)
    xi    <- xi_theta$xi
    theta <- xi_theta$theta
    
    theta <- sample_theta(y, xi, mu, tau2, phi) 
    mu    <- sample_mu(theta, tau2, a_mu, b_mu)
    tau2  <- sample_tau2(theta, mu, a_tau, b_tau)
    phi   <- sample_phi(y, xi, theta, a_phi, b_phi)
    alpha <- sample_alpha(alpha, xi, n, a_alpha, b_alpha)

    # Store samples (only after burn-in)
    if (t > burn) {
      i <- t - burn
      xi_store   [i, ] <- xi
      phi_store  [i]   <- phi
      alpha_store[i]   <- alpha
      mu_store   [i]   <- mu
      tau2_store [i]   <- tau2
      theta_store[[i]] <- theta
    }
    
    # Print progress every 10%
    if (t %% (iter / 10) == 0) {
      cat("Iteration", t, "/", iter, "| Clusters:", length(unique(xi)), "\n")
    }
  }
  
  return(
    list(
      xi    = xi_store, 
      theta = theta_store,
      phi   = phi_store, 
      alpha = alpha_store, 
      mu    = mu_store, 
      tau2  = tau2_store
    )
  )
}

# Hyperparamters
a_phi   <- 2      
b_phi   <- var(y)  
a_alpha <- 1      
b_alpha <- 1      
a_mu    <- mean(y) 
b_mu    <- 2 * var(y)  
a_tau   <- 2      
b_tau   <- var(y)

# Settings
iter <- 12500
burn <- 2500

# Run Gibbs sampler
samples <- gibbs_sampler(y, iter, burn, 
                         a_phi, b_phi, 
                         a_alpha, b_alpha, 
                         a_mu, b_mu, a_tau, b_tau)

save(samples, file = "samples_dpmm_1.RData")

Taking the logarithm of the likelihood at iteration \(b\), we obtain:
\[ \log p(y \mid \theta^{(b)}, \phi^{(b)}) = \sum_{i=1}^{n} \log \textsf{N}(y_i \mid \theta_{\xi_i}^{(b)}, \phi^{(b)}), \quad b = 1,\ldots,B, \] where \(B\) is the number of posterior samples.

compute_log_likelihood <- function(y, samples) {
  B  <- length(samples$theta)
  ll <- numeric(B)
  
  for (b in 1:B) {
    xi_b <- samples$xi[b, ]
    theta_b <- samples$theta[[b]]
    phi_b <- samples$phi[b]
    
    # Compute log-likelihood for this iteration
    ll[b] <- sum(dnorm(y, mean = theta_b[xi_b], sd = sqrt(phi_b), log = T))
  }
  
  return(ll)
}

# Compute log-likelihood at each MCMC iteration
ll <- compute_log_likelihood(y, samples)

# Plot log-likelihood trace
par(mfrow = c(1,1), mar = c(3,3,1.4,1.4), mgp = c(1.75,0.75,0))
plot(ll, type = "p", pch = ".", col = "blue", 
     main = "Log-Likelihood Trace Plot",
     xlab = "Iteration", ylab = "Log-Likelihood")

Inference on the number of clusters

We estimate the posterior distribution of clusters by computing \(K^{(b)} = \max \boldsymbol{\xi}^{(b)}\) at each iteration. The normalized frequencies approximate \(p(K \mid \mathbf{y})\), which we visualize with a histogram.

# Compute the number of clusters K at each iteration
K_values <- apply(samples$xi, 1, function(x) length(unique(x)))

# Compute the posterior distribution of K
K_table <- table(K_values) / length(K_values)

par(mfrow = c(1,1), mar = c(3,3,1.4,1.4), mgp = c(1.75,0.75,0))
plot(as.numeric(names(K_table)), K_table, type = "h", lwd = 3, col = "blue",
     main = "Posterior distribution of number of clusters",
     xlab = "Number of clusters", ylab = "Density",
     yaxt = "n")  # Suppress default y-axis

axis(2, at = pretty(K_table), labels = pretty(K_table))

Inference on the density function

In a DPMM, the number of clusters \(K\) varies across posterior samples. So, the density estimate can be computed as: \[ \hat{g}(x) = \frac{1}{B} \sum_{b=1}^B \sum_{k=1}^{K^{(b)}} \frac{n_k^{(b)}}{n} \, \textsf{N}(x \mid \theta_k^{(b)}, \phi^{(b)}). \] This estimator combines posterior samples to estimate the population density function** while incorporating uncertainty in both the number of clusters and parameter values.

posterior_density_estimate <- function(samples, x_seq) {
  B <- length(samples$theta)
  n <- ncol(samples$xi)
  M <- length(x_seq)
  
  # Store density estimates
  FE <- matrix(NA, nrow = B, ncol = M)
  
  for (b in 1:B) {
    xi_b <- samples$xi[b, ]
    theta_b <- samples$theta[[b]]
    phi_b <- samples$phi[b]
    
    # Unique cluster IDs
    cluster_labels <- sort(unique(xi_b))
    cluster_counts <- as.numeric(table(xi_b)[as.character(cluster_labels)])
    
    # Cluster probabilities
    weight_b <- cluster_counts / sum(cluster_counts)
    
    for (i in 1:M) {
      FE[b, i] <- sum(weight_b * dnorm(x_seq[i], mean = theta_b, sd = sqrt(phi_b)))
    }
  }
  
  f_hat <- colMeans(FE)  # Posterior mean density
  f_inf <- apply(FE, 2, quantile, probs = 0.025)  # 2.5%  credible interval
  f_sup <- apply(FE, 2, quantile, probs = 0.975)  # 97.5% credible interval
  
  return(list(f_hat = f_hat, f_inf = f_inf, f_sup = f_sup))
}

# Define a sequence of x values for density estimation
x_seq <- seq(min(y) - 1, max(y) + 1, length.out = 250)

# Compute posterior density estimate and credible intervals
density_estimate <- posterior_density_estimate(samples, x_seq)

# Extract results
f_hat <- density_estimate$f_hat
f_inf <- density_estimate$f_inf
f_sup <- density_estimate$f_sup

# Plot the histogram
par(mfrow = c(1,1), mar = c(3,3,1.4,1.4), mgp = c(1.75,0.75,0))
hist(y, breaks = 30, freq = FALSE, col = "lightgray", border = "lightgray",
     main = "Posterior predictive density estimate",
     xlab = "x", ylab = "Density", ylim = c(0, max(f_sup)))

# Add the 95% credible interval as a shaded region
polygon(c(x_seq, rev(x_seq)), c(f_inf, rev(f_sup)), col = rgb(0, 0, 1, 0.2), border = NA)

# Overlay the posterior density estimate as a blue line
lines(x_seq, f_hat, col = "blue", lwd = 2)

Inference on the cluster assingments

The co-clustering matrix \(\mathbf{A} = [a_{i,j}]\) is an \(n \times n\) symmetric matrix that represents the posterior probability that observations \(i\) and \(j\) belong to the same cluster. It is defined as: \[ a_{i,j} = \textsf{Pr}(\xi_i = \xi_j \mid \boldsymbol{y}) \approx \frac{1}{B} \sum_{b=1}^{B} I\left\{ \xi_i^{(b)} = \xi_j^{(b)} \right\}, \] where \(I(A)\) is the indicator function that equals 1 if \(A\) is true and 0 otherwise.

Since cluster assignments are identical in both directions, \(\mathbf{A}\) is symmetric: \[ \textsf{Pr}(\xi_i = \xi_j \mid \boldsymbol{y}) = \textsf{Pr}(\xi_j = \xi_i \mid \boldsymbol{y}). \] Furthermore, the diagonal elements satisfy \(a_{i,i} = 1\) because each observation is always in the same cluster as itself.

# Compute co-clustering matrix A
compute_co_clustering_matrix <- function(samples) {
  n <- ncol(samples$xi)
  B <- nrow(samples$xi)
  A <- matrix(0, nrow = n, ncol = n)
  
  for (b in 1:B) {
    xi_b <- samples$xi[b, ]
    for (k in unique(xi_b)) {
      cluster_members <- which(xi_b == k)
      A[cluster_members, cluster_members] <- A[cluster_members, cluster_members] + 1
    }
  }
  
  A <- A / B
  diag(A) <- 1
  
  return(A)
}

# Compute and reorder A based on true partition
A <- compute_co_clustering_matrix(samples)

# Sort observations by true partition
A <- A[order(components), order(components)]

# Visualize co-clustering matrix
par(mar = c(2.75, 2.75, 0.5, 0.5), mgp = c(1.7, 0.7, 0))
colorscale <- c("white", rev(heat.colors(100)))
corrplot::corrplot(A, is.corr = FALSE, method = "color", col = colorscale,
                   tl.pos = "n", addgrid.col = NA)

Inference on the cluster means

In mixture models, the issue of label switching arises because group labels are non-identifiable, meaning that permuting the labels does not change the likelihood of the model.

This problem causes parameter samples from the MCMC chain to be unordered, leading to incorrect inferences. Without proper correction, posterior summaries such as group means or variances become meaningless, as they average across different groups due to random label permutations.

To address this, averaging in the permuted space reorders parameter samples according to all possible label permutations, ensuring that each sample is aligned with a reference frame. This correction enables valid and reliable inference on group parameters, removing ambiguity from label switching.

# Posterior distribution of K (number of clusters)
K <- 25
K_post <- apply(samples$xi, 1, function(x) length(unique(x)))
K_tab  <- table(factor(K_post, levels = 1:K))

suppressMessages(suppressWarnings(library(gtools)))

B <- nrow(samples$xi)

# Define the target number of clusters
K <- 3

# Compute posterior mean of theta for iterations where K = 3
theta_pos <- 0
for (b in 1:B) {
  if (length(unique(samples$xi[b, ])) == K) {
    theta_pos <- theta_pos + samples$theta[[b]][sort(unique(samples$xi[b, ]))] / K_tab[K]
  }
}

# Generate all possible label permutations
permu <- permutations(n = K, r = K)

# Average over permuted spaces for all samples where K = 3
theta_corrected <- NULL
for (b in 1:B) {
  if (length(unique(samples$xi[b, ])) == K) {
    theta_current <- samples$theta[[b]][sort(unique(samples$xi[b, ]))]

    # Compute distances for all permutations
    dist <- apply(permu, 1, function(p) sum((theta_current[p] - theta_pos)^2))

    # Select the best permutation
    best_permu <- permu[which.min(dist), ]
    theta_corrected <- rbind(theta_corrected, theta_current[best_permu])
  }
}

# Compute posterior mean and credible intervals
tab <- rbind(colMeans(theta_corrected),
             apply(theta_corrected, 2, quantile, probs = c(0.025, 0.975)))
colnames(tab) <- paste("Cluster", 1:K)
rownames(tab) <- c("Mean", "2.5%", "97.5%")

round(t(tab), 3)

##             Mean   2.5%  97.5%
## Cluster 1 -5.033 -5.338 -4.744
## Cluster 2  0.006 -0.209  0.208
## Cluster 3  4.892  4.602  5.185

Dirichlet process mixture models: Posterior inference

Juan Sosa PhD

Email jcsosam@unal.edu.co

GitHub https://github.com/jstats1702