Agglomerative Hierarchical Clustering

How it Works
- Choice of Dissimilairity Measure
- Linkage Methods
Dendrograms
Cutting the Tree
Implementations

Cluster analysis is an unsupervised learning problem that aims to find the underlying structure of a system without reference to a known grouping. A good clustering algorithm will partition observations into groups such that the dissimilarity of observations within groups is small or, equivalently, the dissimilarity between groups is large.

Agglomerative hierarchical clustering partitions observations by iteratively merging a selected pair of clusters, beginning with \(N\) individual clusters and ending with one single cluster. The result is a hierarchical grouping (a tree) with \(N - 1\) levels. A user cuts the tree at a desired level to get cluster assignments.

The benefits of agglomerative hierarchical clustering are that:

A user does not have to specify the number of clusters to partition the data, which is useful in situations where there is no group structure for comparison
A user can simultaneously view partitions at different granularities, or levels of fineness.
The method is visual and easy to interpret accross all fields.

The algorithm takes in an \(N x N\) matrix of pairwise dissimilarities between observations called a “proximity matrix” and a user-specified notion of group dissimilarity, called a “linkage method”. At each iteration the algorithm computes the dissimilarity between each cluster and merges the two most similar clusters into one single group.

How it Works

A basic algorithm is as follows:

Begin with \(N\) singleton clusters (each observation is its own cluster)
For each cluster compute its dissimilarity to the other clusters (this depends on the linkage method specified by the user).
Merge the two most similar clusters.
Repeat steps 1 and 2 until the last two clusters are merged into one.

Several variations of this algorithm are discussed in the book Introduction to Information Retrieval.

Agglomeratieve hierarchcial clustering runs at worst in \(O(n^3)\) and at best \(O(n^2)\), depending on the variation of the algorithm. Therefore, this method is extremely slow for large datasets and is not ideal for big data.

Choice of Dissimilairity Measure

As mentioned, Hierarchical clustering is performed on an \(N x N\) dissimilarity matrix of observations, which can be computed with any notion of dissimilarity between pairs of observations:

For observations \(x= (x_{1}, x_{2}, ... , x_{n})\) and \(y = (y_{1}, y_{2}, ... , y_{n})\) we define their dissimilarity \(d_{xy}\) to be:

\[d_{xy} =\bigg(\sum_{i \in 1...n}|x_{i} - y_{i}|^p\bigg)^\frac{1}{p}\]

for \(p \geq 1\).

The most popular method is Euclidean distance, where \(p = 2\). One can also use Manhattan distance, where \(p = 1\).

However, not all types of data are expressed in Euclidean space. Ordinal, categorical and binary data all require different notions of dissimilarity. Furthermore, some linkage methods, such as Ward’s Method, assume Euclidean dissimilarities. Therefore, it is essential for a user to choose a dissimilarity metric and a linkage method that best corresponds with the data that a user is trying to cluster.

Linkage Methods

A user specifies a linkage method that decides how to measure dissimilarity between clusters in order for the algorithm to decide which clusters to merge, or “link”. Three popular linkage methods are single, complete, and average. There are several more, including Ward’s method, centroid linkage, and median linkage.

Dissimilarity between clusters is defined as such:

Let \(C\) and \(C'\) represent two clusters. The dissimilarity \(d(C, C')\) is computed using the proximity matrix, or set of pairwise dissimilarities \(d_{ii'}\). A pair \((i,i')\) is two observations belonging to \(C\) and \(C'\), respectively.

Single Linkage

Single linkage defines dissimilarity between clusters as the smallest pairwise dissimilarity between the two groups:

\[ d_{SL}(C, C') = \displaystyle \min_{i \in C, i' \in C'}d_{ii'}\]

Complete Linkage

Complete linkage defines dissimilarity between clusters as the largest pairwise dissimilarity between the two groups:

\[ d_{CL}(C, C') = \displaystyle \max_{i \in C, i' \in C'}d_{ii'}\]

Average Linkage

Average linkage is a compromise between single and complete linkage, computing the group average dissimilarity:

\[ d_{AL}(C, C') = \frac{1}{N_{C}N_{C'}}\sum_{i \in C}\sum_{i \in C'}d_{ii'}\]

Each linkage method has its benefits and drawbacks. If the dissimilarities between observations have strong clustering tendency (i.e. clusters are well-defined) then each linkage method produces relatively similar results. If not, then the methods could produce very different results.

Dendrograms

The outcome of hierarchical clustering can be represented by a binary tree structure called a dendrogram. This is because agglomerative methods have a property that dissimilarity between merged clusters is monotone increasing. (i.e. as groups get larger, they become more dissimilar). Therefore, the tree structure can be plotted such that the height of each parent node is proportional to the value of dissimilarity between itself and its children.

Here is an example of three dendrograms of a hierarchical clustering of rock data (built-in to R, accessed by running the command data("rock")), which has 48 measurements on petroleum rock samples, labeled 1 through 48:

Dendrograms are easy to interpret and therefore quite popular to use. The example above shows how different linkage methods produce different trees. A user decides at which level to cut the tree in order to produce \(k\) clusters.

Cutting the Tree

There is no correct or incorrect number of clusters to define - much of it depends on what a user knows as a “natural” grouping to be. A clustering outcome can be measured by two properties: compactness, where observations within each cluster are similar to each other, and closeness, where observations grouped together are closer to each other than they are to observations in other groups.

Silhouette Coefficient

The silhouette coefficient (Rousseeuw 1986) is an average of the ratio of each cluster’s compactness and closeness with range \((-1, 1)\). For agglomerative hierarchical clustering, a silhouette coefficient can be computed for several cuts (\(k = 2...N-1\)) and plotted. The user selects the \(k\) with the maximum silhouette coefficient. This can be used with any distance metric and does not require the computation of cluster centers, making it ideal to validate hierarchical clustering.

Plot of silhouette coefficients of Hierarchical clustering of rock data, with k ranging from 3 to 10

Here the highest silhouette coefficient is for \(k=4\), so the user would select 4 clusters.

Visually

Another way to validate results is simply by looking at how well the data clusters visually. A user can plot a heatmap of the data ordered by clustering assignments and inspect whether or not the clusters are well-formed, and how many clusters appear to form:

heatmap of proximity matrix of rock data, ordered by hierarchical clustering

The presence of large squares of similar color indicate clusters. If the squares were well-defined (i.e. dark) then the data clusters well. If not, then the data does not cluster well. In this case, there do not appear to be very strong clusters, and it is difficult to decide an optimal number. This method

Other methods

There are many other methods to consider, such as the Gap Statistic and methods based on sum-of-square measurements.

Implementations

Some implementation of this algorithm are {hclust} in R and scipy in Python. Both packages allow a user to plot dendrograms. To visualize results in R, consider packages {dendextend} for prettier dendrograms and {superheat} for excellent heatmaps. plotly allows for interactive heatmaps and dendrograms in Python.