The Idea

Hierarchical clustering (HCA) is a method of cluster analysis which attempts to build a hierarchy of clusters. Hierarchical clustering is of two types:

Agglomerative: A bottom up approach where each observation begins in its own cluster and pairs of clusters are merged as part of a hierarchy until all observations are a part of the same cluster.
Divisive: A top down approach in which all observations begin in the same cluster and the cluster is split recursively as one moves down the hierarchical structure.

In these types of algorithms, decisions are made to be locally optimal choices (i.e., these algorithms are greedy).

As a benefit to the greedy nature of these algorithms, the results often are accompanied by a visualization of the merges or splits, known as a dendrogram.

In this section, we discuss only agglomerative hierarchical clustering.

Distance Metrics

The choice of distance metric is important as clustering with different metrics can result in different agglomerations. While there are some classic distance metrics such as the Euclidean metric \[d(\mathbf{u},\mathbf{v}) = \sqrt{\sum_{i=1}^n (u_i - v_i)^2},\] and the Manhattan metric \[d(\mathbf{u},\mathbf{v}) = \sum_{i=1}^n |u_i-v_i|, \] with \(\mathbf{u}\) and \(\mathbf{v} \in \mathbb{R}^n\), some other choices for measuring similarity (or dissimilarity) are also often used.

For categorical variables, it is often better to define similarity or dissimilarity based on how many binary variables agree between two observations. We define the matching coefficient as \[\text{Matching Coefficient} = MC(\mathbf{u},\mathbf{v})= \dfrac{\text{# of variables with matching values in } \mathbf{u} \text{ and } \mathbf{v}}{\text{total number of variables}}.\] Thus, a measure of dissimilarity of \(\mathbf{u}\) and \(\mathbf{v}\) is given by \[d(\mathbf{u},\mathbf{v}) = 1 - MC(\mathbf{u}, \mathbf{v}).\]

Consider the variable DrivesAVan. A 0 entry does not indicate that the person observed does not drive. In fact, that person may well drive a car - or they may not drive at all, or they may drive a truck, or… Sometimes, an entry of 0 does not add any information to a dissimilarity analysis. Hence, we define the Jaccard coefficient \[\text{Jaccard Coefficient} = JC(\mathbf{u},\mathbf{v}) = \dfrac{\text{Number of variables with matching 1 values in } \mathbf{u} \text{ and } \mathbf{v}}{\text{total number of variables} - \text{number of variables with matching 0 values}}.\] Thus, the dissimilarity measure for the Jaccard coefficient is define to be \[d(\mathbf{u}, \mathbf{v}) = 1 - J(\mathbf{u},\mathbf{v}).\]

In hierarchical clustering, some of the metrics that are commonly used (aside from Jaccard, matching, Manhattan and Euclidean) are cosine (essentially computing the angle between vectors), dice (similar to Jaccard, but double weight given to agreeing items), Pearson, and others. Type ?dist into the R console for more information (or see the R documentation on the dist function).

Linkage

The linkage criterion determines how we process the distance between the clusters. Let \(A\), \(B\) and \(C\) be clusters. Here are some examples of linkage criteria for comparing distances:

Maximum or complete linkage: \(\max \{d(a,b) | a \in A, b \in B\}\).
Minimum or single linkage: \(\min \{d(a,b) | a \in A, b \in B\}\).
Unweighted average linkage (UPGMA): \(\dfrac{1}{|A|\cdot|B|} \sum_{a \in A} \sum_{b \in B} d(a,b)\)
McQuitty linkage (WPGMA): \(d(A \cup C, B) = \dfrac{d(A,B) + d(C,B)}{2}\)
Centroid linkage (UPGMC): \(||c_A - c_B||\) where \(c_A\) and \(c_B\) are the centroids of \(A\) and \(B\) respectively.
Ward’s linkage: At each step the pair of clusters with minimum between-cluster distance are merged.

Selecting a different linkage can and often will result in different clustering. For example, here are 4 dendrograms with different linkage methods applied.

An Example

Data Preprocessing

Take the USArrests data frame and we will cluster the data. First, we remove missing values and scale the data.

data("USArrests")       #US Arrests Data Set
df <- na.omit(df)       #Remove any missing data
df <- scale(USArrests)  #Scale the Data Set
set.seed(123)           #Reproducible seed
head(df, n=5)

##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292

Calculating Dissimilarity

Next, we need to calculate the dissimilarity between each of the data values. Here, we use the Euclidean metric; however, we could use a variety of metrics. The command ?dist will show you a list of the available metrics.

DistAnalysis <- dist(df, method = "euclidean")        #Dissimilarity Matrix

The dissimilarity chart is a measure of dissimilarity between each pair of data points. The first 5 rows and columns look like this (note that DistAnalysis is actually lower triangular rather than symmetric like the chart below):

##             Alabama   Alaska  Arizona Arkansas California
## Alabama    0.000000 2.703754 2.293520 1.289810   3.263110
## Alaska     2.703754 0.000000 2.700643 2.826039   3.012541
## Arizona    2.293520 2.700643 0.000000 2.717758   1.310484
## Arkansas   1.289810 2.826039 2.717758 0.000000   3.763641
## California 3.263110 3.012541 1.310484 3.763641   0.000000

Linkage

The next thing we need to consider is the type of linkage we are going to use. The values that the hclust command recognize include “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC). Once we select the style, we use it in the hclust command and then output the results.

HclustAnalysis <- hclust(d = DistAnalysis,
                         method = "ward.D2")
plot(HclustAnalysis, cex = 0.6, hang = -1)

Alternatively, we can use a fviz_dend style dendrogram.

fviz_dend(HclustAnalysis, cex=0.4)

We can also use ggdendrogram.

library(ggplot2)
library(ggdendro)
ggdendrogram(HclustAnalysis)

Analysis

At this point, we can select the number of clusters based on the distance we feel comfortable using. You should base the decision on the vertical height of the dendrograms. The higher the group link is, the more dissimilar the two groups are. Four clusters makes sense when looking at the large vertical gaps in the previous dendrograms.

Of course, any approach to determining the optimal number of clusters that worked for k-means will also work here - at least as a starting point.

fviz_nbclust(df, FUN = hcut, method = "wss")

Again, selecting 4 clusters seems appropriate based on the elbow.

The cutree function plays the role of \(k\) from the k-means algorithm. So, if we want to group our data into 4 subsets, we use the command below.

group <- cutree(HclustAnalysis, k=4)     #Group the data into clusters
table(group)                             #Show how many data points in each group

## group
##  1  2  3  4 
##  7 12 19 12

We can attach the group to the data.

USArrests$cluster <- group
head(USArrests,5)

##            Murder Assault UrbanPop Rape cluster
## Alabama      13.2     236       58 21.2       1
## Alaska       10.0     263       48 44.5       2
## Arizona       8.1     294       80 31.0       2
## Arkansas      8.8     190       50 19.5       3
## California    9.0     276       91 40.6       2

Here are two graphics that can accomplish the task of visualization, taking advantage of code we have already used.

plot(HclustAnalysis, cex = 0.6, hang = -1)
rect.hclust(HclustAnalysis, k = 4, border = 2:5)  #start the border with 2 to avoid a black outline.

# Plot full dendogram
fviz_dend(
  HclustAnalysis,
  k = 4,
  horiz = TRUE,
  rect = TRUE,
  rect_fill = TRUE,
  rect_border = "jco",
  k_colors = "jco",
  cex = 0.351
)

As with k-means, we can use fviz_cluster to visualize the result in a scatter plot using PCA.

fviz_cluster(list(data = df, cluster = group))

The data from the analysis can be outputted using cluster.stats.

library(fpc)
cluster.stats(DistAnalysis, group)

## $n
## [1] 50
## 
## $cluster.number
## [1] 4
## 
## $cluster.size
## [1]  7 12 19 12
## 
## $min.cluster.size
## [1] 7
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 2.337465 3.290377 3.088343 2.401992
## 
## $average.distance
## [1] 1.352945 1.708815 1.535869 1.223130
## 
## $median.distance
## [1] 1.283191 1.608977 1.471118 1.165197
## 
## $separation
## [1] 1.2413874 1.1654171 0.5279092 0.5279092
## 
## $average.toother
## [1] 3.045407 3.244007 2.624701 3.162359
## 
## $separation.matrix
##          [,1]     [,2]      [,3]      [,4]
## [1,] 0.000000 1.273914 1.2413874 2.2494023
## [2,] 1.273914 0.000000 1.1654171 2.7568751
## [3,] 1.241387 1.165417 0.0000000 0.5279092
## [4,] 2.249402 2.756875 0.5279092 0.0000000
## 
## $ave.between.matrix
##          [,1]     [,2]     [,3]     [,4]
## [1,] 0.000000 2.564101 2.911942 3.738033
## [2,] 2.564101 0.000000 2.838821 4.282163
## [3,] 2.911942 2.838821 0.000000 2.243024
## [4,] 3.738033 4.282163 2.243024 0.000000
## 
## $average.between
## [1] 2.987747
## 
## $average.within
## [1] 1.476709
## 
## $n.between
## [1] 901
## 
## $n.within
## [1] 324
## 
## $max.diameter
## [1] 3.290377
## 
## $min.separation
## [1] 0.5279092
## 
## $within.cluster.ss
## [1] 57.9427
## 
## $clus.avg.silwidths
##         1         2         3         4 
## 0.4615690 0.2924672 0.2626758 0.4266255 
## 
## $avg.silwidth
## [1] 0.3370187
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.5826496
## 
## $dunn
## [1] 0.1604403
## 
## $dunn2
## [1] 1.312619
## 
## $entropy
## [1] 1.327954
## 
## $wb.ratio
## [1] 0.4942553
## 
## $ch
## [1] 36.534
## 
## $cwidegap
## [1] 1.0476313 2.0580889 1.1802929 0.9824857
## 
## $widestgap
## [1] 2.058089
## 
## $sindex
## [1] 0.7068277
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

Comparing Linkages

Of course, common sense must be used when selecting linkages and distance metrics. If we want to compare two different linkages, we can create a tanglegram.

library(dendextend)

hc1 <- hclust(DistAnalysis, method = "complete")
hc2 <- hclust(DistAnalysis, method = "average")

# Create two dendrograms
dend1 <- as.dendrogram (hc1)
dend2 <- as.dendrogram (hc2)
dend_list <- dendlist(dend1,dend2)

tanglegram(dend1, dend2,
  highlight_distinct_edges = FALSE, # Turn-off dashed lines
  common_subtrees_color_lines = TRUE, # Turn-on line colors
  common_subtrees_color_branches = TRUE, # Color common branches 
  main = paste("entanglement =", round(entanglement(dend_list), 2))
)

The output displays the link between each data point in the two methods. The quality of the alignment of the two trees can be measured using the function entanglement. Entanglement is a measure between 1 (full entanglement) and 0 (no entanglement). A lower entanglement coefficient corresponds to a good alignment.

A Note on Categorical Data

If you have entirely categorical data with finite number of categories, it may not make much sense to run a hierarchical cluster.

Consider a situation where you had 3 binary variables. You are going to end up with 8 clusters since there are 8 total groups. Moreover, the distance between \((1,0,0)\), \((0,1,0)\) and \((0,0,1)\) are the same so there is a question of the order in which they may be clustered.

However, if you have levels of categorical data, a hierarchical cluster might make sense. As always, you need to be smart. In some cases, “good”, “better”, “best” might work as “1”, “2”, “3” labels. In other cases, saying that best is 3 times the benefit as good may not.

If you have a mix of different data types, you can consider hierarchical clustering - but again, think about the potential problems ahead of your analysis. Do not simply load a data set into R and begin clustering without considering what the results mean.

Advantages of Hierarchical Clustering

We do not need to pre-specify the number of clusters.
Once the clustering is complete, you get a nice graphic that supports the process.

Disadvantages of Hierarchical Clustering

There are a large numbers decisions that you have to make that impact your results (distance metric and linkage type, for example).
The choice of linkage method significantly impacts the results.

Centroid linkage, for example, has a bias toward irregularly shaped clusters.
Ward’s method often produces clusters of similar size with solutions that can be distorted by outliers. As a result, one must select the algorithm based on the underlying structure of the data (e.g., sample size, distribution of observations, and what types of variables are included-nominal, ordinal, ratio, or interval).

While we do not pre-select the number of clusters, we do need to make a judgment as to where the cluster cut-off is.
The method has time complexity of \(O(n^3)\), where \(n\) is the number of data points, which means that it slows down very quickly for large sets of data.

Best practice is to attempt several different clusters based on changed linkage and distance metric. Keep in mind that analysis like this is not “correct/ incorrect” but rather a method of developing insight about the data.

Citations

Boehmke, B., & Greenwell, B.M. 2019. Hands-On Machine Learning with R.

Hair, Joseph F. 2006. Multivariate Data Analysis. Pearson Education India.

Kaufman, Leonard, and Peter J Rousseeuw. 2009. Finding Groups in Data: An Introduction to Cluster Analysis. Vol. 344. John Wiley & Sons.

Ketchen, David J, and Christopher L Shook. 1996. “The Application of Cluster Analysis in Strategic Management Research: An Analysis and Critique.” Strategic Management Journal 17 (6). Wiley Online Library: 441–58.

Ma, Yi, Harm Derksen, Wei Hong, and John Wright. 2007. “Segmentation of Multivariate Mixed Data via Lossy Data Coding and Compression.” IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (9). IEEE: 1546–62.

Zhang, Wei, Deli Zhao, and Xiaogang Wang. 2013. “Agglomerative Clustering via Maximum Incremental Path Integral.” Pattern Recognition 46 (11). Elsevier: 3056–65.

Zhao, Deli, and Xiaoou Tang. 2009. “Cyclizing Clusters via Zeta Function of a Graph.” In Advances in Neural Information Processing Systems, 1953–60.

Agglomerative Hierarchical Clustering

OC Data Science

The Idea

Distance Metrics

Linkage

An Example

Data Preprocessing

Calculating Dissimilarity

Linkage

Analysis

Comparing Linkages

A Note on Categorical Data

Advantages of Hierarchical Clustering

Disadvantages of Hierarchical Clustering

Citations