Dirichlet Process Clustering

J Dorsey

Roadmap

  • The Problem: How can we cluster data without specifying the number of clusters manually?
  • Proposed Solution: Fit a model with a potentially infinite number of clusters.
  • Bayesian fixed-cluster model
  • Dirchlet distribution
  • Chinese restaurant process
  • Dirichlet process
  • Gibbs sampling
  • Application
  • Extensions
  • Futher Resources

The Problem

  • How can we cluster a dataset without specifying the number of clusters beforehand?
  • E.g. k-means requires a fixed number of clusters specified in advance
  • This is very inconvenient when we don't know how many clusters to expect
  • This talk presents a nonparametric Bayesian solution to this problem
  • But first, we will discuss a Bayesian approach for a fixed number of clusters.

Bayesian Approach to Clustering

  • Model the data as being generated from a mixture distribution with \( K \) components
  • For concreteness, we'll use a mixture of multivariate normal distributions
  • We introduce a latent variable for each observation.
  • The latent variable says which mixture component (cluster) the observation belongs to. \[ \begin{aligned} y_i | c_i, \boldsymbol{\mu}, \boldsymbol{\Sigma} & \sim \mathcal{N}(\mu_{c_i}, \Sigma_{c_i}) \\ c_i | \boldsymbol{p} & \sim \text{Discrete}(p_1, \dots, p_K) \end{aligned} \]
  • To make a complete Bayesian model we must add priors for the mixing proportions and for the parameters for each mixture component.

Priors for Each Mixture Component

  • The easiest priors to work with are conjugate or conditionally conjugate priors
  • A family of priors is conjugate to a given likelihood when the posterior distribution is in the same family.
  • I.e. conjugate priors are closed under Bayesian updating.
  • A family of priors for a given parameter is conditionally conjugate when it is conjugate after conditioning on the values of all other parameters.
  • Allows easy Gibbs sampling.

Priors for Multivaraite Normal Distribution

  • For a normal distribution with known variance, the normal distribution itself is a conjugate prior for the mean
  • For a normal distribution with known mean, the inverse Wishart distribution is a conjugate prior for the covariance matrix
  • The inverse Wishart distribution is a distribution over convariance matrices
  • You can think of it like a generalized inverse gamma distribution
  • Since each is conjugate given the value of the other parameter, we have conditionally conjugate priors

Prior for the Mixing Proportions

  • We need a distribution over \( K \)-dimensional discrete distributions
  • We would like it to be conjugate to the discrete (aka categorical/multi-noulli) distribution
  • So we need the posterior (LHS below) to be functionally the same as whatever we pick for the prior.

\[ P(\boldsymbol{p}|c) \propto P(c|\boldsymbol{p})P(\boldsymbol{p}) = \prod_{i = 1}^K p_i^{I[c = i]} P(\boldsymbol{p}) \]

  • A little thought shows the following works:

\[ P(\boldsymbol{p}) \propto \prod_{i = 1}^K p_i^{\alpha_i - 1} \] where \( \boldsymbol{\alpha} \) is a hyperparameter that controls dispersion.

The Dirichlet Distribution

\[ P(\boldsymbol{p}) \propto \prod_{i = 1}^K p_i^{\alpha_i - 1} \]

  • Called the Dirichlet Distribution
  • Oftentimes, we set all values of \( \boldsymbol{\alpha} \) to be equal.
  • This is called the symmetric Dirichlet distribution.
  • \( \alpha = 1 \) gives a uniform distribution.
  • \( \alpha < 1 \) is “sparse”; in a sample most values will be zero
  • \( \alpha > 1 \) is “dense”; in a sample most values will be similar to each other

Full Model with Priors

  • The full model: \[ \begin{aligned} y_i | c_i, \boldsymbol{\mu}, \boldsymbol{\Sigma} & \sim \mathcal{N}(\mu_{c_i}, \Sigma_{c_i}) \\ c_i | \boldsymbol{p} & \sim \text{Discrete}(p_1, \dots, p_K) \\ \mu_c & \sim \mathcal{N}(\eta, \Lambda) \\ \Sigma_c &\sim \text{inv-Wishart}(\nu, S) \\ \boldsymbol{p} & \sim \text{Dirichlet}(\alpha / K, \dots, \alpha / K) \end{aligned} \]

  • Once fit, for each observation \( y_i \) we get a distribution over its cluster assignment \( c_i \)

  • To get just one cluster assignment, we can take the most likely cluster for example

Taking the Limit

  • What happens as we take the limit \( K \to \infty \)?
  • Does every point just get its own cluster? That would be useless.
  • How can we fit such a model?
  • Most MCMC procedures would involve sampling \( \boldsymbol{p} \), which becomes infinitely large.
  • So something we'll probably have to do is integrate out \( \boldsymbol{p} \)
  • Doing this will actually give insight into the more important issue of what it means to let \( K\to\infty \).

Integrating Out p

  • Recall: \[ \begin{aligned} c_i | \boldsymbol{p} & \sim \text{Discrete}(p_1, \dots, p_K) \\ \boldsymbol{p} & \sim \text{Dirichlet}(\alpha / K, \dots, \alpha / K) \end{aligned} \]
  • As \( K \to \infty \) then to get just one \( c_i \) we have to sample a large vector \( \boldsymbol{p} \).
  • It would be better to understand the direct relationship of \( \alpha \) on \( c_i \).

\[ \begin{aligned} P(c_1, \dots, c_i) &= \int P(c_1, \dots, c_i | \boldsymbol{p}) P(\boldsymbol{p}) d\boldsymbol{p}\\ &\propto \int \prod_{j = 1}^i p_{c_j} \prod_{j = 1}^K p_j^{\alpha/K - 1} d\boldsymbol{p} \\ \end{aligned} \]

Integrating Out p

  • Using the joint distribution from the last slide:

\[ \begin{aligned} P(c_i | c_1, \dots, c_{i-1}) &= P(c_1, \dots, c_{i-1}, c_i) / P(c_1, \dots, c_{i - 1}) \\ &= \frac{\int \prod_{j = 1}^i p_{c_j} \prod_{j = 1}^K p_j^{\alpha/K - 1} d\boldsymbol{p}}{\int \prod_{j = 1}^{i-1} p_{c_j} \prod_{j = 1}^K p_j^{\alpha/K - 1} d\boldsymbol{p}} \\ &= \frac{n_{i, c_i} + \alpha / K}{i - 1 + \alpha} \end{aligned} \]

where \( n_{i, c_i} \) is the number of \( c_j \), \( j < i \) with \( c_j = c_i \).

  • This gives a prior over \( \boldsymbol{c} \) directly by multiplying together those conditional distributions above

Taking the Limit

  • Now when \( K \to \infty \) we get

\[ \begin{aligned} P(c_i | c_1, \dots, c_{i-1}) &= \frac{n_{i, c_i}}{i - 1 + \alpha} \\ P(c_i \neq c_j \text{ for all } j < i | c_1, \dots, c_{i-1}) &= \frac{\alpha}{i - 1 + \alpha} \end{aligned} \]

  • First line is a straightforward limit, and it pertains to any specific value of \( c_i \)
  • Second line follows from additivity, and it pertains to all values not already seen
  • Tells us what happens with infinitely many possible clusters: still a non-zero probability of putting an observation in an already existing cluster
  • Since not all clusters must be represented, this has the potential to solve the unknown number of clusters problem

The Chinese Restaurant Process

\[ \begin{aligned} P(c_i | c_1, \dots, c_{i-1}) &= \frac{n_{i, c_i}}{i - 1 + \alpha} \\ P(c_i \neq c_j \text{ for all } j < i | c_1, \dots, c_{i-1}) &= \frac{\alpha}{i - 1 + \alpha} \end{aligned} \]

  • The distribution on the \( c_i \) after taking the limit \( K\to\infty \) is called the Chinese Restaurant Process
  • Idea: customers enter a Chinese restaurant one at a time.
  • Each table has different food.
  • The customer chooses a table with probability proportional to how many customers are at the table (lots of people means the food must be good!)
  • Or the customer chooses a new, empty table.
  • Probability of starting a new table controlled by \( \alpha \).

The Dirichlet Process

  • You can imagine running the Chinese restaurant process to completion, to seat infinitely many customers
  • At the end you'll have a probability distribution over tables
  • This tells us what \( \boldsymbol{p} \) would have been, had we drawn it from an “infinite” Dirichlet distribution, and then sampled the \( c_i \) from that potentially infinite \( \boldsymbol{p} \).
  • The “infinite” Dirichlet distribution that generates those “infinite” \( \boldsymbol{p} \)'s is called the Dirichlet Process
  • (Disclaimer: This has not been rigorous, and the characterization I give of the Dirichlet process isn't quite right, but I don't want to bother with the exact definitions.)

Fitting the Model

  • In general it is very difficult to analytically compute the integrals required for Bayesian inference
  • In particular, our multivariate normal Dirichlet process model cannot be fit analytically
  • One approach is to use Markov Chain Monte Carlo (MCMC) techniques
  • Set up an ergodic Markov chain such that the posterior distribution is the stationary distribution of the chain
  • For our model Gibbs sampling is a promising lead

Gibbs Sampling Review

  • If we want to sample from \( P(x, y) \), we can do the following:
  • Choose some initial value \( x_0 \)
  • Repeatedly sample \( y_{i + 1} \sim P(y | x = x_i) \), \( x_{i + 1} \sim P(x | y = y_{i + 1}) \)
  • The values generated this way form a Markov chain whose stationary distribution is \( P(x, y) \)
  • This scheme easily generalizes to more than two variables
  • In a Bayesian context, the target distribution is \( P(\boldsymbol{\theta}|D) \)
  • So to perform Gibbs sampling we need to know \( P(\theta_j | \boldsymbol{\theta}_{-j}, D) \)

Gibbs Sampling the Dirichlet Process Model

  • This is just a sketch of how Gibbs sampling could work for the model we have
  • Assuming we know which cluster each data point belongs to, we can update the cluster means using conditional conjugacy, and only conditioning each mean on the data that it is actually responsible for
  • Of course, we cannot handle infinitely many cluster means at once, so we only store and update the means corresponding to clusters that actually have inhabitants
  • Assuming we know the means of each inhabited cluster, we can choose a new cluster for each data point using the Chinese restaurant process, modulo a factor that accounts for all the other variables we're conditioning on

Practical Implementations of the Dirichlet Process Model

  • Practical Implementations usually allow a prior on \( \alpha \)
  • They also have ways to handle non-conjugacy
  • The R package dirichletprocess automatically handles most of the details for us

Examples on Simple Data

  • Now we'll show some examples of using the dirichletprocess package on simple data to see how the Dirichlet Process can automatically infer the number of clusters
  • For these examples, all I had to do was standardize the data and then use the DirichletProcessMvnormal function.
  • No hand-picking the number of clusters or other hand-tuning is needed.

Examples on Simple Data

plot of chunk unnamed-chunk-1

plot of chunk unnamed-chunk-2

Extensions

  • This talk just scratches the surface of what's possible
  • Density estimation is another application of Dirichlet Processes
  • There are elaborate MCMC procedures to handle both non-conjugacy and more levels in the model
  • Instead of having a Dirchlet process mixture of normals, you could have a Dirichlet process mixture of Dirichlet process mixtures of normals
  • Called a Hierarchical Dirichlet Process, and used to handle grouped data
  • As with all Bayesian methods, you can just keep adding more and more complexity to the model…until you can't fit it anymore

Further Resources