A mixture model is a statistical model that assumes data arises from a combination of several underlying probability distributions, each representing a distinct subgroup within the data. Each data point is assumed to be generated by one of these components, with each component following its own probability distribution. In this way, mixture models provide a flexible way to model data from heterogeneous sources.
A Dirichlet process mixture model (DPMM) is a type of mixture model in which the mixing distribution is unknown and assigned a Dirichlet Process (DP) prior. This approach differs from a mixture of DPs, where the parameters of the DP are treated as random.
The idea is to incorporate the DP within a hierarchical Bayesian model to impose a prior on the unknown distribution of certain parameters, resulting in semiparametric Bayesian models.
Instead of managing the extensive number of parameters in finite mixture models with many mixture components, it is more practical to use an infinite-dimensional specification by assuming a random mixing distribution that is not limited to a specific parametric family.
A finite mixture model can be reformulated in different ways. For instance, in the case of \(K=2\) components, we can introduce auxiliary random variables \(L_1, \ldots, L_n\) such that \(L_i = 1\) if \(y_i\) originates from the \(\textsf{N}(\mu_1, \sigma_1^2)\) component, and \(L_i = 2\) if \(y_i\) is drawn from the \(\textsf{N}(\mu_2, \sigma_2^2)\) component.
The model can then be expressed as follows: \[ \begin{align} y_i \mid L_i, \mu_1, \mu_2, \sigma_1^2, \sigma_2^2 &\sim \textsf{N}(y_i \mid \mu_{L_i}, \sigma_{L_i}^2) \\ \Pr(L_i = 1 \mid w) & = w \\ \Pr(L_i = 2 \mid w) & = 1 - w \\ (w, \mu_1, \mu_2, \sigma_1^2, \sigma_2^2) &\sim p(w, \mu_1, \mu_2, \sigma_1^2, \sigma_2^2) \end{align} \] If we marginalize over \(L_i\) for \(i = 1, \ldots, n\), we recover the original mixture formulation. Indeed, to marginalize over \(L_i\), we compute the probability distribution of \(y_i\) by summing over the possible values of \(L_i\): \[ p(y_i \mid \mu_1, \mu_2, \sigma_1^2, \sigma_2^2, w) = \sum_{L_i=1}^{2} p(y_i \mid L_i, \mu_1, \mu_2, \sigma_1^2, \sigma_2^2) \, p(L_i \mid w)\,. \] Since there are only two possible values for \(L_i\), this becomes: \[ p(y_i \mid \mu_1, \mu_2, \sigma_1^2, \sigma_2^2, w) = p(y_i \mid L_i = 1, \mu_1, \sigma_1^2) \cdot w + p(y_i \mid L_i = 2, \mu_2, \sigma_2^2) \cdot (1 - w)\,. \] Now, substituting the conditional distributions of \(y_i\) given \(L_i\), we get: \[ p(y_i \mid \mu_1, \mu_2, \sigma_1^2, \sigma_2^2, w) = w \cdot \textsf{N}(y_i \mid \mu_1, \sigma_1^2) + (1 - w) \cdot \textsf{N}(y_i \mid \mu_2, \sigma_2^2)\,. \]
We can also express the mixture model as follows: \[ w \cdot \textsf{N}(y_i \mid \mu_1, \sigma_1^2) + (1 - w) \cdot \textsf{N}(y_i \mid \mu_2, \sigma_2^2) = \int \textsf{N}(y_i \mid \mu, \sigma^2) \, \textsf{d}G(\mu, \sigma^2), \] where \[ G(\cdot,\cdot) = w \delta_{(\mu_1, \sigma_1^2)}(\cdot,\cdot) + (1 - w) \delta_{(\mu_2, \sigma_2^2)}(\cdot,\cdot). \] Here \(G\) represents a mixing distribution over the parameters \((\mu, \sigma^2)\), \(\delta_{(\mu_1, \sigma_1^2)}\) and \(\delta_{(\mu_2, \sigma_2^2)}\) are Dirac delta functions centered at \((\mu_1, \sigma_1^2)\) and \((\mu_2, \sigma_2^2)\), respectively, and the weights \(w\) and \(1 - w\) in \(G\) specify the probability of each component.
The previous equality follows directly from the property of the Dirac delta function \(\delta_{a}\), which states that for any function \(f(x)\): \[ \int f(x) \, \delta_{a}(x) \, \textsf{d}x = f(a). \]
A similar approach can be applied to a general \(K\)-component mixture model. In this context, the mixing distribution \(G\) is discrete (and random); a natural alternative is to assign a Dirichlet Process (DP) prior to \(G\), resulting in a DPMM.
By employing a countable mixture model instead of a finite one, we gain both theoretical and practical benefits. Theoretically, a countable mixture provides full support, meaning it can approximate any distribution given enough components. Practically, this approach allows the number of mixture components to be inferred from the data, as the DP prior supports a countable, potentially infinite, number of components. This flexibility enables the model to adaptively estimate the appropriate number of mixture components based on observed data.
In a Dirichlet Process Mixture Model (DPMM), the Dirichlet Process (DP) is used as a prior for the mixing distribution \(G\), allowing for an unknown and potentially infinite number of mixture components. This approach is especially useful when dealing with data that has complex structure or when the number of underlying subgroups (clusters) is unknown. The model is formulated as: \[ F(\cdot \mid G) = \int K(\cdot \mid \theta) \, \textsf{d}G(\theta), \]
where \(G \sim DP(\alpha, G_0)\) signifies that \(G\) is drawn from a DP with concentration parameter \(\alpha\) and base measure \(G_0\), and \(K(\cdot \mid \theta)\) is a parametric distribution characterized by parameters \(\theta\). Each \(\theta\) defines the parameters for one component in the mixture.
The DP prior makes \(G\) a discrete distribution, potentially consisting of infinitely many “atoms” or components, each with a unique parameter \(\theta\). This setup enables the DPMM to adaptively adjust the number of components based on the observed data.
In a DPM model, we place a DP prior on \(G\), the distribution of parameters \(\theta\). Data \(y_i\) are then generated by first sampling a parameter \(\theta_i\) from \(G\), followed by generating \(y_i\) from \(K(\cdot \mid \theta_i)\). Formally, this process involves:
Because \(G\) is discrete (although with potentially infinitely many atoms), many of the \(\theta_i\) values will coincide, leading to natural groupings or clusters among the \(y_i\) values.
The corresponding mixture density or probability mass function for the DPM model is: \[ f(\cdot \mid G) = \int k(\cdot \mid \theta) \, \textsf{d}G(\theta), \]
where \(k(\cdot \mid \theta)\) represents the density (or probability mass function) of \(K(\cdot \mid \theta)\). Because \(G\) is random, the mixture density \(f(\cdot \mid G)\) and cumulative distribution \(F(\cdot \mid G)\) are also random, giving rise to a Bayesian nonparametric model that can flexibly accommodate complex data structures.
In the context of DPMM, the (almost sure) discreteness of realizations \(G\) from the \(DP(\alpha, G_0)\) prior is advantageous. This discreteness leads to ties in the mixing parameters, making DP mixture models particularly useful for applications such as density estimation, clustering, and regression.
Using the constructive definition of the DP, a random distribution \(G\) drawn from a \(DP(\alpha, G_0)\) can be represented as \[ G = \sum_{\ell=1}^{\infty} \omega_{\ell} \delta_{\vartheta_{\ell}}, \] where \(\delta_{a}\) is a Dirac delta function centered at \(a\). This construction, known as the stick-breaking process, offers an intuitive way to understand \(G\) as a countable mixture of point masses, each located at parameters \(\vartheta_{\ell}\) with weights \(\omega_{\ell}\).
Thus, the probability model \(f(\cdot \mid G)\) can be expressed as a countable mixture of parametric densities: \[ f(\cdot \mid G) = \sum_{\ell=1}^{\infty} \omega_{\ell} k(\cdot \mid \vartheta_{\ell}), \] where \(k(\cdot \mid \vartheta_{\ell})\) is a parametric density function determined by the parameter \(\vartheta_{\ell}\), and \(\omega_{\ell}\) are weights generated by the stick-breaking process. The first weight is given by \(\omega_1 = z_1\), while subsequent weights are defined by \(\omega_{\ell} = z_{\ell} \prod_{r=1}^{\ell-1} (1 - z_r)\), where \(z_r \overset{\text{iid}}{\sim} \textsf{Beta}(1, \alpha)\), for \(r=1,2,\ldots\). The stick-breaking construction ensures that the weights \(\omega_1,\omega_2,\ldots\) sum to 1, forming a valid probability distribution across the infinite components. The locations \(\vartheta_{\ell}\) are independently drawn from the base measure \(G_0\). This means that \(G_0\) represents the prior knowledge about the underlying distribution of the data-generating process. The sequences \(\{z_r : r = 1, 2, \dots\}\) and \(\{\vartheta_{\ell} : \ell = 1, 2, \dots\}\) are independent of each other, ensuring that weight generation and location selection are separate.
In the context of a DPMM, this stick-breaking representation of \(G\) enables the prior probability model \(f(\cdot \mid G)\) to be a potentially infinite mixture of parametric densities, making the DPMM highly flexible. Because \(G\) is almost surely discrete, many of the sampled values for \(\theta\) (associated with observed data points \(y_i\)) will coincide, naturally creating clusters in the data. This discreteness makes DPMM particularly useful for applications like density estimation and clustering, where the number of components or clusters is unknown and determined by the data itself.
DPMMs are highly adaptable, modeling both discrete and continuous distributions.
For discrete data, the component \(K(\cdot | \theta)\) can be Poisson or binomial, effectively clustering count-based observations.
For continuous data, \(K(\cdot | \theta)\) may be normal, gamma, or uniform in univariate cases and multivariate normal for higher dimensions, capturing diverse data structures.
Typically, semiparametric DPMM are used as follows: \[ y_i \mid G, \theta \overset{\text{iid}}{\sim} f(\cdot \mid G, \phi) = \int k(\cdot \mid \theta,\phi) \, \textsf{d}G(\theta), \quad i = 1, \ldots, n, \] where \(G \sim DP(\alpha, G_0)\) and a parametric prior \(p(\phi)\) is placed on \(\phi\). Additionally, hyperpriors may be assigned to \(\alpha\) or the parameters \(\psi\) of \(G_0 = G_0(\cdot \mid \psi)\).
In a hierarchical formulation for DPMMs, we introduce latent mixing parameters \(\theta_i\) for each observation \(y_i\), structured as follows:
Observation Model: Each \(y_i\) is conditionally independent given \(\theta_i\) and a global parameter \(\phi\): \[ y_i \mid \theta_i, \phi \sim k(y_i \mid \theta_i, \phi), \quad i = 1, \ldots, n. \]
Latent Mixing Parameters: The \(\theta_i\) values are i.i.d. from a random distribution \(G\): \[ \theta_i \mid G \sim G, \quad i = 1, \ldots, n. \]
Dirichlet Process Prior: The distribution \(G\) is drawn from a Dirichlet Process with concentration parameter \(\alpha\) and base measure \(G_0\): \[ G \mid \alpha, \psi \sim \textsf{DP}(\alpha, G_0(\cdot \mid \psi)). \]
Priors on Model Parameters: Independent priors are placed on \(\theta\), \(\alpha\), and \(\psi\): \[ \phi, \alpha, \psi \sim p(\phi) \, p(\alpha) \, p(\psi). \]
This hierarchical setup ensures that multiple \(y_i\) values may share the same \(\theta_i\), naturally inducing clusters. It provides flexibility, as \(G\) can adapt to the number of clusters based on data complexity.
As \(\alpha \to 0^+\), the DP mixture model’s tendency to create new clusters diminishes, resulting in all observations being drawn from a single component. This limiting behavior simplifies the DP mixture to a non-mixture, single-component model, where all observations share the same latent parameter \(\theta\). Thus, when \(\alpha \to 0^+\), the model reduces to: \[ \begin{align} y_i \mid \theta, \phi &\overset{\text{iid}}{\sim} k(y_i \mid \theta, \phi), \quad i = 1, \ldots, n \\ \theta \mid \psi &\sim G_0(\cdot \mid \psi) \\ \phi, \psi &\sim p(\phi)\,p(\psi) \end{align} \]
As \(\alpha \to \infty\), the probability of starting a new cluster approaches 1 for each observation. Consequently, almost every observation forms its own cluster. In this case, the DPMM approximates a fully non-clustered model where each \(y_i\) is generated independently from its own unique parameter value \(\theta_i\), drawn independently from the base measure \(G_0\). Thus, when \(\alpha \to 0^+\), the model reduces to: \[ \begin{align} y_i \mid \theta_i, \phi &\overset{\text{ind}}{\sim} k(y_i \mid \theta_i, \phi), \quad i = 1, \ldots, n \\ \theta_i \mid \psi &\overset{\text{iid}}{\sim} G_0(\cdot \mid \psi) \\ \phi, \psi &\sim p(\phi)\,p(\psi) \end{align} \]
The countable sum formulation of the DPMM provides a link between limits of finite mixtures, with prior for the weights given by a symmetric Dirichlet distribution, and DP mixture models (e.g., Ishwaran and Zarepour, 2000).
Consider a finite mixture model with \(J\) components, given by \[ \sum_{j=1}^{J} q_j \, k(y \mid \vartheta_j), \] where \((q_1, \ldots, q_J) \sim \textsf{Dir}(\alpha/J, \ldots, \alpha/J)\) and \(\vartheta_j \overset{\text{iid}}{\sim} G_0\), for \(j = 1, \ldots, J\).
As the number of components \(J \to \infty\), this finite mixture model approximates a DPMM with kernel \(k\) and a DP prior \(DP(\alpha, G_0)\) for the mixing distribution.
As \(J \to \infty\), the discrete distribution \(\sum_{j=1}^{J} q_j \delta_{\vartheta_j}\) converges weakly to \[ \sum_{\ell=1}^{\infty} \omega_\ell \delta_{\vartheta_\ell} \sim \textsf{DP}(\alpha, G_0). \] This convergence indicates that as \(J\) becomes very large, the finite mixture with \(J\) components behaves like an infinite DP mixture, where the weights \(\omega_\ell\) and component parameters \(\theta_\ell\) follow the DP’s stochastic structure.
This result shows that a finite mixture model with a large number of components can approximate the flexibility of a DP mixture. This insight, reveals that DP mixtures can be seen as the infinite-component limit of finite mixtures, making DP mixtures an effective tool for modeling distributions with an unknown number of clusters or latent subgroups.
By taking the expectation over \(G\) with respect to its DP prior \(\textsf{DP}(\alpha, G_0)\), we get that: \[ \textsf{E}[F(x \mid G, \phi)] = F(x \mid G_0, \phi) \quad\text{and}\quad \textsf{E}[f(x \mid G, \phi)] = f(x \mid G_0, \phi). \] These expressions are useful for specifying the prior distribution of the parameters \(\psi\) in the base measure \(G_0(x \mid \psi)\).
Additionally, the concentration parameter \(\alpha\) controls the degree to which a realization \(G\) approximates \(G_0\) and influences the level of discreteness in \(G\). Consequently, in a Dirichlet Process Mixture Model (DPMM), \(\alpha\) determines the prior distribution of the number of unique elements \(n^*\) within the vector \(\boldsymbol{\theta} = (\theta_1, \ldots, \theta_n)\). If \(\alpha\) is large, there is a higher probability of drawing new values from \(G_0\), leading to more unique clusters. If \(\alpha\) is small, there is a lower probability to assign samples to existing clusters, creating fewer, larger clusters.
Now, consider the joint prior distribution for \(\boldsymbol{\theta} = (\theta_1, \ldots, \theta_n)\) that arises from \(\theta_i \mid G \stackrel{\text{iid}}{\sim} G\), with \(G \mid \alpha, \psi \sim \textsf{DP}(\alpha, G_0(\cdot \mid \psi))\), after integrating \(G\) over its DP prior. Using the Pólya urn representation of the DP, the joint prior distribution is given by: \[ p(\boldsymbol{\theta} \mid \alpha, \psi) = G_0(\theta_1 \mid \psi) \prod_{i=2}^{n} \left( \frac{\alpha}{\alpha + i - 1} G_0(\theta_i \mid \psi) + \frac{1}{\alpha + i - 1} \sum_{j=1}^{i-1} \delta_{\theta_j}(\theta_i) \right), \] where \(\delta_{\theta_j}(\theta_i)\) is an indicator function that takes the value 1 if \(\theta_i = \theta_j\) and 0 otherwise.
The prior moments for \(n^*\) can be used to inform the selection of \(\alpha\), or the prior parameters for \(\alpha\), since \[ \textsf{E}(n^* \mid \alpha) = \sum_{i=1}^n \frac{\alpha}{\alpha + i - 1} \quad\text{and}\quad \textsf{Var}(n^* \mid \alpha) = \sum_{i=1}^n \frac{\alpha(i - 1)}{(\alpha + i - 1)^2} \] can guide this choice. Thus, \(\textsf{E}(n^* \mid \alpha)\) increases at a logarithmic rate with \(n\) when \(\alpha\) is fixed, implying that \(\textsf{E}(n^* \mid \alpha) \to \infty\) as \(n \to \infty\).