title: “Bayesian clustering using Hamiltonian Monte Carlo” date: “2021-09-27” output:

Introduction

Many statistical analyses use clustering as a key focus, as it is a useful technique for exploratory data analysis and simplification of complex data.There are various techniques in the literature such as distance-based, density-based, and model-based. Let \(y_{i} \in Y\), for \(i = 1;\cdots ; n\), denote the data and let \(d(y; y')\) denote a distance between data points \(y\) and \(y'\). Then, distance-based clustering algorithms are typically applied to the n \(\times\) n matrix of pairwise distances. \(D_{(n)\times(n)}={d_{i,i'}}\), with \(d_{i,i'}=d(y_{i}; y_{i'})\) for all \(i,i'\). Model-based clustering, on the other hand, builds a model for the original data \(y(n)\) using a likelihood-based technique with \((n)={1,2,3, \cdots,n}\), that has the form:

\[y_{i}\overset{iid}{ \sim} f,\;\; f(y)=\sum_{m=1}^{g}\pi_{m}G(y;\theta_{m})\] where \(\pi=(\pi_{1},\cdots,\pi_{g})'\) is a vector of probability weights in a finite mixture model, m is a cluster index, and \(G(y;\theta_{m})\) is the density of the data within cluster m. By marginalizing off the cluster index \(ci\in {1,2,3,\cdots,g}\) in the following model, the finite mixture model (1) may be derived. \[y_{i}\sim G(\theta_{ci}), \; pr(c_{i}=m)=\pi_{m}\] An expectation-maximization approach can be used to derive maximum likelihood estimates of the model parameters \(\pi\) and \(\theta=\theta_{m}\) using this data-augmented form. Bayesian approaches, on the other hand, are extensively used to include prior knowledge on the parameters and quantify parameter uncertainty. Distance-based methods have the advantage of being conceptually and computationally straightforward, but lack of quantification of uncertainty in clustering estimates and associated inferences is a major concern. While model-based techniques can solve these difficulties by utilizing a likelihood-based framework, one major drawback is the high sensitivity to the kernel \(G(.; \theta)\) chosen. Kernels are frequently chosen for their ease of use and computational efficiency.

Kernels are frequently chosen for their simplicity and computational convenience, and they impose rigid assumptions on cluster structure that are not warranted by the applicable situation. We are not the first to notice this issue, and there is a literature devoted to solving kernel robustness problems in model-based clustering. One approach is to use a flexible class of kernels that may represent a wide range of densities. For example, instead of a Gaussian kernel, one can use one that accommodates asymmetry, skewness, and/or heavier tails (Karlis and Santourian (2009); Juarez and Steel (2010); O’Hagan et al. (2016); Gallaugher and McNicholas (2018)).

Nonparametrically estimating the kernels particular to each cluster while imposing minimal identifiability constraints, such as unimodality and sufficiently light tails, is a similar direction. This direction is connected to Li et al. (2007)’s mode-based clustering methods; for a Bayesian approach utilizing unimodal kernels,(Rodriguez and Walker (2014)). Unfortunately, as Hennig et al. (2015) point out, a too flexible kernel leads to ambiguity in cluster definition and identifiability issues: for example, one cluster can be the union of several closely related clusters. In practice, such flexible kernels necessitate a high number of parameters, posing a tremendous computational challenge.

Replacing the likelihood with a robust alternative is a potential new method. Coretto and Hennig (2016) present a robust multivariate clustering strategy based on pseudo-likelihood, which catches outliers with an extra incorrect uniform component. Miller and Dunson (2018) present a coarsened Bayes strategy for Bayesian inference robustification and apply it to clustering challenges. In establishing a Bayesian method, they condition on the empirical probability mass function of the observed data being within some small neighborhood of that for the presumed model, rather than assuming that the actual data are exactly derived from (1). Small variances are allowed in both of these strategies. Extending these algorithms to data with great complexity, such as clustering numerous time series, pictures, and so on, is tough. We offer a new approach for pairwise distances based on a Bayesian model, which avoids a comprehensive specification of the likelihood function for the data \(y(n)\). Moreover, there literature suggesting Bayesian techniques can substitute an exact likelihood function with some alternative. Chernozhukov and Hong (2003) investigate a wide range of quasi-posterior distributions.

In Bayesian inference, Jeffreys (1961) suggested a substitution likelihood for quantiles;(Dunson and Taylor (2005)). Hoff (2007) suggested a Bayesian approach to copula model inference that avoids describing marginal distribution models via extended rank likelihood. Instead of using the data directly, Johnson (2005) offered Bayesian tests based on modeling frequentist test statistics. These are only a few of the many examples available. While gaining some of the benefits of model-based clustering, such as uncertainty quantification and flexibility, our proposed Bayesian Distance Clustering technique considerably simplifies the model specification task. Furthermore, many Bayesian approximation techniques have been used such as the Markov Chain Monte Carlo (MCMC), variational Bayes and Laplace approximation. Laplace approximation and variational Bayes, are based on replacing the Bayesian posterior density with a computationally convenient approximation. In addition, such methods have the advantages of speed and scalability, however they may raise the question of how closely the resultant approximate Bayesian inference can be trusted to replicate the true Bayesian inference. MCMC is a type of dependent sampling in which the samples are obtained from successive states of a discrete-time Markov chain. The Markov chain is designed to be easy to simulate and intended to (eventually) produce samples that have a distribution arbitrarily close to the posterior distribution. However, MCMC methods can be slow especially when dealing with high dimensional data.

Partial likelihoods for distances

Suppose that data \(y(n)\) are generated from model (1) or equivalently (2). We focus on the case in which \((y_{i} = (y_{i,\cdots},y_{i,p})^{T} \in Y \subset \mathbb{R}\). The conditional likelihood of the data \(y(n)\) given clustering indices \(c(n)\) can be expressed as \[L(y_{(n)};c_{(n)}) = \prod_{m=1}^{g}\prod_{i:c_{i}=m}G_{m}(y_{i})=\prod_{m=1}^{g}L_{m}(y^{[m]})\] where we let \(G_{m}(y_{i})\) denote the density of data within cluster h, and \(y^{[m]} = {y_{i}:c_{i}=m}=y_{i}^{[m]},i=1,\cdots,n_{m}\) is the data in cluster h. Since the information of \(c_{(n)}\) is stored by the index with \([m]\), we will omit \(c_{(n)}\) in the notation when \([m]\) appears. Referring to \(y_{1}^{[h]}\) as the seed for cluster \(h\), we can express the likelihood \(L_{h}(y^{[h]})\) as \[L_{m}(y^{[m]}) = G_{m}(y_{1}^{[m]})\prod_{i=2}^{n_{m}} D_{m}(\tilde{d}_{i,1}^{[m]})|y_{1}^{[m]}\] \[=G_{m}(y_{1}^{[m]}|\tilde{d}_{2,1}^{[m]}, \cdots,\tilde{d}_{_{n_{m}},1}^{[m]} )D_{m}(\tilde{d}_{2,1}^{[m]}, \cdots,\tilde{d}_{_{n_{m}},1}^{[m]})\] where \(\tilde{d}_{i,1}^{[m]}\) denotes the difference between \(y_{i}^{[m]}\) and \(y_{1}^{[m]}\);for now, let us define difference by subtraction \((\tilde{d}_{2,1}^{[m]}= y_{i}^{[m]}-y_{1}^{[m]}\) 1 . Expression is a product of the densities of the seed and \((n_{m}-1)\) differences. As the cluster size \(n_{m}\) increases, the relative contribution of the seed density \(D_(m)(y^{[m]}|.)\) will decrease and the likelihood becomes dominated by \(D_{h}\). The term \(G_(m)(y^{[m]}|.)\) can intuitively be discarded with little impact on the inferences of \(c_{(n)}\).

Prior specification

In Bayesian clustering it is useful to choose the prior parameters in a reasonable range. Recall in our gamma density, \(\alpha_{m}\ge 1\) determines the mode for \({d}_{i,i^{'}}^{[m]}\) at \((\alpha_{m}-1)\sigma_{m}\). To favor small values for the mode while accommodating a moderate degree of uncertainty, we use a shifted Gamma prior \(\alpha_{m} \sim\) Gamma(0:5; 1:0) + 1. To select a prior for \(\sigma_{m}\), we associate it with a pre-specified maximum cluster number \(k\). We can view \(k\) as a packing number— that is, how many balls (clusters) we can fit in a container of the data. To formalize, imagine a \(p\)-dimensional ellipsoid in \(\mathbb{R}^{p}\) enclosing all the observed data. The smallest volume of such an ellipsoid is \[vol(Data)=M \quad min_{\mu \in \mathbb {R}^{p}}(det \quad Q)^{-\frac{1}{2}}, s.t. (y_{i}-\mu)^TQ(y_i-\mu) \quad \le 1 \quad for i=1,\cdots, n \] which can be obtained via a fast convex optimization algorithm, with \(M = \tilde{\pi}^{\frac{p}{2}}/\Gamma{p/2+1}\) and \(\tilde{\pi}\approx 3.14\)

Posterior Computation

Since the likelihood is formed in a pairwise way, updating one cluster assignment \(c_i\) has an impact on the others. Therefore, conventional Gibbs samplers updating one \(c_i\) at a time are not ideal. Instead, we consider Hamiltonian Monte Carlo (HMC) to update the whole binary matrix C in, via a lift-and-project strategy, first lift each row \(C_i\) (on a simplex vertex) into the interior of the simplex, and denote the relaxation by \(W_i\in \Delta_{\delta}^{(g-1)}\),run leap-frog updates and then project it back to the nearest vertex as a proposal. We call this algorithm lift-and-project HMC. This iterates in the following steps:

Initialize each \(W_i\) and sample momentum variable \(Q\in \mathbb{R}^{n\times g}\), with each \(q_{i;j}\sim N(0; \sigma^{2})\)
Run the leap-frog algorithm in the simplex interior \(\Delta_\delta^{(g-1)}\) for L steps using kinetic and potential functions: \[K(Q) = \frac{1}{2\sigma_{q}^{2}}tr(Q^{T}Q)\] \[U(W) = -tr\lbrace{W^{T}(log D)W\wedge(W^{T}W)^{-1}}\rbrace + tr\lbrace{W^{T}DW(\sum W^{T}W)^{-1}}\rbrace\], with W the matrix of \({\textbf{W_i}}\). Denote the last state \((W^{*}, Q^{*})\).
Compute vertex projection \(C_i^*\) by setting the largest coordinate in \({\textbf{W}_i}\) to and others to 0 (corresponding to minimize Hellinger distance between \(C_i^*\) and \({\textbf{W}_i}\)).
Run Metropolis-Hastings, and accept proposal \(C*\) if \[u <\frac{U(C^*)K(Q^*)}{U(C)K(Q)}\] where \(u\sim Uniform(0,1)\)
Sample \[\sigma_{t} \sim Inverse-Gamma\{\frac{(n_{m}-1)}{2}+2, \frac{\sum_{i,i'}d_{i,i'}^{m}}{2n_{m}} + \beta_{\sigma}\}\],if \(n_m\)>1; otherwise update from \(\sigma_{m} \sim Inverse-Gamma(2, \beta_{\sigma})\).
Sample (\((\pi_{1},\cdots,\pi_{g})) \sim Dir(\alpha+n_{(1)}, \cdots, \alpha+n_{(m)})\).
Sample _{h} using random-walk Metropolis.

To prevent a negative impact of the projection on the acceptance rate, we make \({\textbf{W}}_i\) very close to its vertex projection \(C_i^*\). This is done via a tempered softmax re-parameterization. \[w_{i,m} = \frac{exp(v_{i,m}/t)}{\sum_{m'=1}^{g}exp(v_{i,m'/t})}, m=1, \cdots, g\] At small t, if one \(v_{i,m}\) is slightly larger than the rest in \(v_{i,1}, \cdots, v_{i,g}\), then \(w_{i,m}\) will be close to 1. In this article, we choose t = 0,1 and prevent the leap-frog stepsize from being too small, so that during Hamiltonian updates, each \(W_i\) is very close to a vertex with high probability. We utilize the auto-differentiation toolbox in Tensorflow. To produce a point estimate of \(\hat c_{(n)}\), we minimize the variation of information as the loss function, using the algorithm provided in Wade and Ghahramani (2018). In the posterior computation of clustering models, a particular challenge has been the label-switching issue due to the likelihood equivalence when permuting the labels \(m \in {(1, \cdots, g)}\). This often leads to difficulties in diagnosing convergence and assessing the assignment probability \(pr(c_i = m)\). To completely circumvent this issue, we instead track the matrix product \(CC^T\) for convergence, which is invariant to the permutation of the columns of C (corresponding to the labels). The posterior samples of \(CC^T\) give an estimate of the pairwise co-assignment probabilities \(pr(c_i = c_{i'} ) = \sum_{m=1}^{g}pr(c_{i'}=m)\). To obtain estimates for \(pr(c_i = m)\), we use symmetric simplex matrix factorization (Duan, 2019) on \(\{pr(c_i =c_{i'}\}_{i,i'}\) to obtain an \(n\times k\) matrix corresponding to \(\{pr(c_i = m\}_{i,m}\). The factorization can be done almost instantaneously.