NCM Workshop on Introduction to Statistical Learning Techniques

Author

Minerva Mukhopadhyay

Bayesian Clustering

Bayesian way of clustering is mainly model based.
It is assumed that the data points are conditionally iid with density

\[ f(y \mid P) = \int K(y \mid \theta) d P(\theta) \]

where $K(\cdot \mid \theta)$ is a parametric density with the parameter $\theta \in \Theta$, and $P$ is a probability measure of $\theta$.

Inducing clusters using suitable prior

The possible clusters in the data can be interpreted as effect of an underlying mixture population. Suppose the number of clusters are known to be $K$. Then one can set a prior of $P$ as follows:

$P = \sum_{j=1}^{K} w_{j} \delta_{\theta_{j}}$ implying that the conditional distribution of $y$ is $f(y \mid P) = \sum_{j=1}^{K}w_{j} K(y \mid \theta_{j})$.
The above construction induces a random finite partition of the data points.
Data points belong to the same cluster if they are generated from the same mixture component.
- Here $\{\theta_{1}, \ldots, \theta_{K}\}$ are the parameters (or, parameter vectors) characterizing the densities of the $K$ clusters,
- and $\{w_{1}, \ldots, w_{K}\}$ are the cluster weights.
As $\{w_{1}, \ldots, w_{K}\}$ and $\{\theta_{1}, \ldots, \theta_{K}\}$ are the parameters of the mixture-model, one assigns a prior on these parameters:
- As $\{w_{1}, \ldots, w_{K}\}$ is a non-negative vector which adds up to one, the prior on $\{w_{1}, \ldots, w_{k}\}$ should be supported on the $K-1$-dimensional unit-simplex. Usually, a Dirichlet prior is used for this purpose.
Example: Consider the following data:

From the plot, it is clear that there are at least three clusters in the data.
Suppose a location-scale mixture of at least $3$ normal distributions is imposed as a prior. Then the prior on $P$ is $\sum_{j=1}^{3} w_{j} \delta_{\boldsymbol{\theta}_{j}}$ where $\boldsymbol{\theta}_{j} = (\boldsymbol{\mu}_{j}, \sigma_{j}^{2})^{\top}$. Then the induced model on the data ${\bf y}$ is

\[ f(y \mid P) = \sum_{j = 1}^{3} w_{j} \phi_{4}\left( \frac{{\bf y} - \boldsymbol{\mu}_{j}}{\sigma_{j}} \right). \]

The next task is to assign suitable priors on the parameters:
- ${\bf w} = (w_{1}, w_{2}, w_{3} ) \sim \text{Dirichlet}(1/3, 1/3, 1/3),$
- $\boldsymbol{\mu}_{j} \mid \sigma_{j}^{2} \stackrel{iid}{\sim} N({\bf 0}, \sigma_{j}^{2} \tau I)$, $j =1 , \ldots,3$,
- $\sigma_{j}^{2} \stackrel{iid}{\sim} IG(a_{\sigma}, b_{\sigma})$.

Interpretation of the Mixture Model

The mixture distribution with mixing probabilities ${\bf w}$ might be thought of as a discrete prior distribution on the parameters $\theta_{j}$, $j=1, \ldots, K$ ; however, it seems more appropriate to think of this prior, or mixing distribution as a description of the variation in $\theta$ across the population of interest.
In this respect the mixture model more closely resembles a hierarchical model. This resemblance is enhanced with the introduction of unobserved (or missing) indicator variables $z_{i}$ where

\[z_{i} = c \qquad \text{if the $i$-th data point is drawn from the $c$-th mixture}.\]
Given ${\bf w}$, the distribution of each vector ${\bf z}_{i}$ is $\text{Discrete}(w_{1}, \ldots, w_{K})$, i.e.,

$P(z_{i} = c) = w_{c}$ for $c= 1, \ldots,k$.
In this case the mixture parameters ${\bf w}$ are thought of as hyperparameters determining the distribution of ${\bf z}$.
The joint distribution of the observed data ${\bf y}$ and the unobserved indicators ${\bf z}$ conditional on the model parameters can be written as

\[ f({\bf y}, {\bf z} \mid \boldsymbol{\theta}, {\bf w}) =f({\bf z} \mid {\bf w} ) f({\bf y} \mid {\bf z}, \boldsymbol{\theta}) = \prod_{i = 1}^{n} \prod_{c = 1}^{K} \left\{ w_{c} f(y_{i} \mid \theta_{c}) \right\}^{I(z_{i} = c)} \]

with $z_{i} = c$ for exactly one choice of $c$.
Identifiability and label switching: Parameters in a model are not identified if the same likelihood function is obtained for more than one choice of the model parameters. All finite mixture models are nonidentifiable in one sense; the distribution is unchanged if the group labels are permuted.
The likelihood of ${\bf y}$ obtained by marginalizing out ${\bf z}$ is $\sum_{j=1}^{K} w_{j} f(y \mid \theta_{j})$.

Posterior inference

The above priors provide a model-based framework for Bayesian cluster analysis, returning a posterior over a large partition space, expressing belief and uncertainty in the clustering given the data.
For mixture models, the Gibbs sampler alternates two major steps:
- Obtaining draws from the distribution of the indicators given the model parameters.
- Obtaining draws from the model parameters given the indicators.
Once the MCMC converges, then the optimized weights ${\bf w}$ provides the dominant cluster identities, as well as the level of uncertainty.
The optimized parameters provide insights on the distribution of the sub-populations.

How to choose K?

There are many ways in which $K$ can be determined.

Mixture of finite mixtures (MFM): One may potentially employ a prior on the number of mixtures. Typically, a truncated Poisson prior is used, and the application is done using reversible jump MCMC. However, as the dimension of the parameter space differs for different choices of $K$, this method is quite difficult to implement.
Overfitted model: An easy alternative is start with an overfitted model, i.e., start with a large value of $K$. Here $K$ can be viewed as an upper bound on the number of mixture components, as some of these components may be unoccupied. One may define $K_{n}$ as the actually occupied clusters, once the MCMC converges. Ideally, $K_{n}$ would not be sensitive to the choice of upper bound $K$.
Infinite mixtures (Bayesian nonparameterics).