J Dorsey
\[ P(\boldsymbol{p}|c) \propto P(c|\boldsymbol{p})P(\boldsymbol{p}) = \prod_{i = 1}^K p_i^{I[c = i]} P(\boldsymbol{p}) \]
\[ P(\boldsymbol{p}) \propto \prod_{i = 1}^K p_i^{\alpha_i - 1} \] where \( \boldsymbol{\alpha} \) is a hyperparameter that controls dispersion.
\[ P(\boldsymbol{p}) \propto \prod_{i = 1}^K p_i^{\alpha_i - 1} \]
The full model: \[ \begin{aligned} y_i | c_i, \boldsymbol{\mu}, \boldsymbol{\Sigma} & \sim \mathcal{N}(\mu_{c_i}, \Sigma_{c_i}) \\ c_i | \boldsymbol{p} & \sim \text{Discrete}(p_1, \dots, p_K) \\ \mu_c & \sim \mathcal{N}(\eta, \Lambda) \\ \Sigma_c &\sim \text{inv-Wishart}(\nu, S) \\ \boldsymbol{p} & \sim \text{Dirichlet}(\alpha / K, \dots, \alpha / K) \end{aligned} \]
Once fit, for each observation \( y_i \) we get a distribution over its cluster assignment \( c_i \)
To get just one cluster assignment, we can take the most likely cluster for example
\[ \begin{aligned} P(c_1, \dots, c_i) &= \int P(c_1, \dots, c_i | \boldsymbol{p}) P(\boldsymbol{p}) d\boldsymbol{p}\\ &\propto \int \prod_{j = 1}^i p_{c_j} \prod_{j = 1}^K p_j^{\alpha/K - 1} d\boldsymbol{p} \\ \end{aligned} \]
\[ \begin{aligned} P(c_i | c_1, \dots, c_{i-1}) &= P(c_1, \dots, c_{i-1}, c_i) / P(c_1, \dots, c_{i - 1}) \\ &= \frac{\int \prod_{j = 1}^i p_{c_j} \prod_{j = 1}^K p_j^{\alpha/K - 1} d\boldsymbol{p}}{\int \prod_{j = 1}^{i-1} p_{c_j} \prod_{j = 1}^K p_j^{\alpha/K - 1} d\boldsymbol{p}} \\ &= \frac{n_{i, c_i} + \alpha / K}{i - 1 + \alpha} \end{aligned} \]
where \( n_{i, c_i} \) is the number of \( c_j \), \( j < i \) with \( c_j = c_i \).
\[ \begin{aligned} P(c_i | c_1, \dots, c_{i-1}) &= \frac{n_{i, c_i}}{i - 1 + \alpha} \\ P(c_i \neq c_j \text{ for all } j < i | c_1, \dots, c_{i-1}) &= \frac{\alpha}{i - 1 + \alpha} \end{aligned} \]
\[ \begin{aligned} P(c_i | c_1, \dots, c_{i-1}) &= \frac{n_{i, c_i}}{i - 1 + \alpha} \\ P(c_i \neq c_j \text{ for all } j < i | c_1, \dots, c_{i-1}) &= \frac{\alpha}{i - 1 + \alpha} \end{aligned} \]
dirichletprocess automatically handles most of the details for usdirichletprocess package on simple data to see how the Dirichlet Process can automatically infer the number of clustersDirichletProcessMvnormal function.