Dirichlet Process Clustering

J Dorsey

Roadmap

The Problem: How can we cluster data without specifying the number of clusters manually?
Proposed Solution: Fit a model with a potentially infinite number of clusters.
Bayesian fixed-cluster model
Dirchlet distribution
Chinese restaurant process
Dirichlet process
Gibbs sampling
Application
Extensions
Futher Resources

The Problem

How can we cluster a dataset without specifying the number of clusters beforehand?
E.g. k-means requires a fixed number of clusters specified in advance
This is very inconvenient when we don't know how many clusters to expect
This talk presents a nonparametric Bayesian solution to this problem
But first, we will discuss a Bayesian approach for a fixed number of clusters.

Bayesian Approach to Clustering

Model the data as being generated from a mixture distribution with \( K \) components
For concreteness, we'll use a mixture of multivariate normal distributions
We introduce a latent variable for each observation.
The latent variable says which mixture component (cluster) the observation belongs to. \[ \begin{aligned} y_i | c_i, \boldsymbol{\mu}, \boldsymbol{\Sigma} & \sim \mathcal{N}(\mu_{c_i}, \Sigma_{c_i}) \\ c_i | \boldsymbol{p} & \sim \text{Discrete}(p_1, \dots, p_K) \end{aligned} \]
To make a complete Bayesian model we must add priors for the mixing proportions and for the parameters for each mixture component.

Priors for Each Mixture Component

The easiest priors to work with are conjugate or conditionally conjugate priors
A family of priors is conjugate to a given likelihood when the posterior distribution is in the same family.
I.e. conjugate priors are closed under Bayesian updating.
A family of priors for a given parameter is conditionally conjugate when it is conjugate after conditioning on the values of all other parameters.
Allows easy Gibbs sampling.

Priors for Multivaraite Normal Distribution

For a normal distribution with known variance, the normal distribution itself is a conjugate prior for the mean
For a normal distribution with known mean, the inverse Wishart distribution is a conjugate prior for the covariance matrix
The inverse Wishart distribution is a distribution over convariance matrices
You can think of it like a generalized inverse gamma distribution
Since each is conjugate given the value of the other parameter, we have conditionally conjugate priors

Prior for the Mixing Proportions

We need a distribution over \( K \)-dimensional discrete distributions
We would like it to be conjugate to the discrete (aka categorical/multi-noulli) distribution
So we need the posterior (LHS below) to be functionally the same as whatever we pick for the prior.

\[ P(\boldsymbol{p}|c) \propto P(c|\boldsymbol{p})P(\boldsymbol{p}) = \prod_{i = 1}^K p_i^{I[c = i]} P(\boldsymbol{p}) \]

A little thought shows the following works:

\[ P(\boldsymbol{p}) \propto \prod_{i = 1}^K p_i^{\alpha_i - 1} \] where \( \boldsymbol{\alpha} \) is a hyperparameter that controls dispersion.

The Dirichlet Distribution

\[ P(\boldsymbol{p}) \propto \prod_{i = 1}^K p_i^{\alpha_i - 1} \]

Called the Dirichlet Distribution
Oftentimes, we set all values of \( \boldsymbol{\alpha} \) to be equal.
This is called the symmetric Dirichlet distribution.
\( \alpha = 1 \) gives a uniform distribution.
\( \alpha < 1 \) is “sparse”; in a sample most values will be zero
\( \alpha > 1 \) is “dense”; in a sample most values will be similar to each other

Full Model with Priors

The full model: \[ \begin{aligned} y_i | c_i, \boldsymbol{\mu}, \boldsymbol{\Sigma} & \sim \mathcal{N}(\mu_{c_i}, \Sigma_{c_i}) \\ c_i | \boldsymbol{p} & \sim \text{Discrete}(p_1, \dots, p_K) \\ \mu_c & \sim \mathcal{N}(\eta, \Lambda) \\ \Sigma_c &\sim \text{inv-Wishart}(\nu, S) \\ \boldsymbol{p} & \sim \text{Dirichlet}(\alpha / K, \dots, \alpha / K) \end{aligned} \]
Once fit, for each observation \( y_i \) we get a distribution over its cluster assignment \( c_i \)
To get just one cluster assignment, we can take the most likely cluster for example

Taking the Limit

What happens as we take the limit \( K \to \infty \)?
Does every point just get its own cluster? That would be useless.
How can we fit such a model?
Most MCMC procedures would involve sampling \( \boldsymbol{p} \), which becomes infinitely large.
So something we'll probably have to do is integrate out \( \boldsymbol{p} \)
Doing this will actually give insight into the more important issue of what it means to let \( K\to\infty \).

Integrating Out p

Recall: \[ \begin{aligned} c_i | \boldsymbol{p} & \sim \text{Discrete}(p_1, \dots, p_K) \\ \boldsymbol{p} & \sim \text{Dirichlet}(\alpha / K, \dots, \alpha / K) \end{aligned} \]
As \( K \to \infty \) then to get just one \( c_i \) we have to sample a large vector \( \boldsymbol{p} \).
It would be better to understand the direct relationship of \( \alpha \) on \( c_i \).

\[ \begin{aligned} P(c_1, \dots, c_i) &= \int P(c_1, \dots, c_i | \boldsymbol{p}) P(\boldsymbol{p}) d\boldsymbol{p}\\ &\propto \int \prod_{j = 1}^i p_{c_j} \prod_{j = 1}^K p_j^{\alpha/K - 1} d\boldsymbol{p} \\ \end{aligned} \]

Integrating Out p

Using the joint distribution from the last slide:

\[ \begin{aligned} P(c_i | c_1, \dots, c_{i-1}) &= P(c_1, \dots, c_{i-1}, c_i) / P(c_1, \dots, c_{i - 1}) \\ &= \frac{\int \prod_{j = 1}^i p_{c_j} \prod_{j = 1}^K p_j^{\alpha/K - 1} d\boldsymbol{p}}{\int \prod_{j = 1}^{i-1} p_{c_j} \prod_{j = 1}^K p_j^{\alpha/K - 1} d\boldsymbol{p}} \\ &= \frac{n_{i, c_i} + \alpha / K}{i - 1 + \alpha} \end{aligned} \]

where \( n_{i, c_i} \) is the number of \( c_j \), \( j < i \) with \( c_j = c_i \).

This gives a prior over \( \boldsymbol{c} \) directly by multiplying together those conditional distributions above

Taking the Limit

Now when \( K \to \infty \) we get

\[ \begin{aligned} P(c_i | c_1, \dots, c_{i-1}) &= \frac{n_{i, c_i}}{i - 1 + \alpha} \\ P(c_i \neq c_j \text{ for all } j < i | c_1, \dots, c_{i-1}) &= \frac{\alpha}{i - 1 + \alpha} \end{aligned} \]

First line is a straightforward limit, and it pertains to any specific value of \( c_i \)
Second line follows from additivity, and it pertains to all values not already seen
Tells us what happens with infinitely many possible clusters: still a non-zero probability of putting an observation in an already existing cluster
Since not all clusters must be represented, this has the potential to solve the unknown number of clusters problem

The Chinese Restaurant Process

\[ \begin{aligned} P(c_i | c_1, \dots, c_{i-1}) &= \frac{n_{i, c_i}}{i - 1 + \alpha} \\ P(c_i \neq c_j \text{ for all } j < i | c_1, \dots, c_{i-1}) &= \frac{\alpha}{i - 1 + \alpha} \end{aligned} \]

The distribution on the \( c_i \) after taking the limit \( K\to\infty \) is called the Chinese Restaurant Process
Idea: customers enter a Chinese restaurant one at a time.
Each table has different food.
The customer chooses a table with probability proportional to how many customers are at the table (lots of people means the food must be good!)
Or the customer chooses a new, empty table.
Probability of starting a new table controlled by \( \alpha \).

The Dirichlet Process

You can imagine running the Chinese restaurant process to completion, to seat infinitely many customers
At the end you'll have a probability distribution over tables
This tells us what \( \boldsymbol{p} \) would have been, had we drawn it from an “infinite” Dirichlet distribution, and then sampled the \( c_i \) from that potentially infinite \( \boldsymbol{p} \).
The “infinite” Dirichlet distribution that generates those “infinite” \( \boldsymbol{p} \)'s is called the Dirichlet Process
(Disclaimer: This has not been rigorous, and the characterization I give of the Dirichlet process isn't quite right, but I don't want to bother with the exact definitions.)

Fitting the Model

In general it is very difficult to analytically compute the integrals required for Bayesian inference
In particular, our multivariate normal Dirichlet process model cannot be fit analytically
One approach is to use Markov Chain Monte Carlo (MCMC) techniques
Set up an ergodic Markov chain such that the posterior distribution is the stationary distribution of the chain
For our model Gibbs sampling is a promising lead

Gibbs Sampling Review

If we want to sample from \( P(x, y) \), we can do the following:
Choose some initial value \( x_0 \)
Repeatedly sample \( y_{i + 1} \sim P(y | x = x_i) \), \( x_{i + 1} \sim P(x | y = y_{i + 1}) \)
The values generated this way form a Markov chain whose stationary distribution is \( P(x, y) \)
This scheme easily generalizes to more than two variables
In a Bayesian context, the target distribution is \( P(\boldsymbol{\theta}|D) \)
So to perform Gibbs sampling we need to know \( P(\theta_j | \boldsymbol{\theta}_{-j}, D) \)

Gibbs Sampling the Dirichlet Process Model

This is just a sketch of how Gibbs sampling could work for the model we have
Assuming we know which cluster each data point belongs to, we can update the cluster means using conditional conjugacy, and only conditioning each mean on the data that it is actually responsible for
Of course, we cannot handle infinitely many cluster means at once, so we only store and update the means corresponding to clusters that actually have inhabitants
Assuming we know the means of each inhabited cluster, we can choose a new cluster for each data point using the Chinese restaurant process, modulo a factor that accounts for all the other variables we're conditioning on

Practical Implementations of the Dirichlet Process Model

Practical Implementations usually allow a prior on \( \alpha \)
They also have ways to handle non-conjugacy
The R package dirichletprocess automatically handles most of the details for us

Examples on Simple Data

Now we'll show some examples of using the dirichletprocess package on simple data to see how the Dirichlet Process can automatically infer the number of clusters
For these examples, all I had to do was standardize the data and then use the DirichletProcessMvnormal function.
No hand-picking the number of clusters or other hand-tuning is needed.

Examples on Simple Data

plot of chunk unnamed-chunk-1

plot of chunk unnamed-chunk-2

Extensions

This talk just scratches the surface of what's possible
Density estimation is another application of Dirichlet Processes
There are elaborate MCMC procedures to handle both non-conjugacy and more levels in the model
Instead of having a Dirchlet process mixture of normals, you could have a Dirichlet process mixture of Dirichlet process mixtures of normals
Called a Hierarchical Dirichlet Process, and used to handle grouped data
As with all Bayesian methods, you can just keep adding more and more complexity to the model…until you can't fit it anymore

Further Resources

https://en.wikipedia.org/wiki/Dirichlet_distribution
https://en.wikipedia.org/wiki/Dirichlet_process
https://en.wikipedia.org/wiki/Chinese_restaurant_process
Ross and Markwick, dirichletprocess: An R Package for Fitting Complex Bayesian Nonparametric Models
R.M. Neal, Markov Chain Sampling Methods for Dirichlet Process Mixture Models, Journal of Computational and Graphical Statistics, Vol. 9, No. 2. (Jun., 2000), pp. 249-265.