This tutorial will introduce hierarchical clustering by:
Hierarchical clustering is exceptional in portraying clusters in data. It is advantageous over K-means clustering as it does not require you to state a specific choice of K. It also presents a dendrogram or “upside down tree” that is appealing to view and see where relations and deviations occur within a data set.
There are two main types of hierarchical clustering: agglomerative and divisive
Agglomerative works from the bottom to up, each object is initially a single observation or leaf and as the algorithm is applied these join to make branches of bigger clusters.
Divisive is the opposite, it works from the top down. Divisive begins with all the clusters joined in one and splits them into two until they are in their very own cluster.
Throughout the remainder of this tutorial we will use the agglomerative hierarchical clustering method.
The hierarchical clustering algorithm works by defining a dissimilarity (two things that are not similar) measure between each pair of observations (Euclidean distance is commonly used), then starting from the bottom of the dendrogram where each observation is their very own cluster, it joins the two clusters that are most similar to one another and continues this process over and over again until the dendrogram is complete.
There are 4 linkage types. Linkage otherwise known as the amount of dissimilarity between clusters having numerous observations are categorized as below:
Complete: finds the maximum possible distance between points belonging to two different clusters
Single: finds the minimum possible distance
Average: averages all possible pair distances
Centroid: between the centroid for cluster A and B
The commonly preferred form of linkage is average or complete as they tend to manufacture more balanced dendrograms.
Lets take a look at an example. First lets load the data set USArrests. The USArrests data set contains statistics per 100,000 residents for assault, murder and rape in each of the 50 US states in 1973 as well as the percent living in urban population. Hierarchical clustering in this data set can communicate with us what areas have higher or lower crime rates.
We first start by omitting missing values and scaling to normalize the dataset.
Scaling the USArrests data will calculate the mean and standard deviation of the entire matrix then “scale” each element by those values by subtracting the mean and dividing by the standard deviation.
df <- USArrests
df <- na.omit(df)
df <- scale(df)
head(df)
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
Next, we can determine which linkage method produces the highest agglomeration coefficient so that we can use that method for our hierarchical clustering.
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
agnes(df, method = x)$ac
}
#calculate agglomerative coefficient for each clustering linkage method
sapply(m, ac)
## average single complete ward
## 0.7379371 0.6276128 0.8531583 0.9346210
Since ward has the highest coefficient we can use this method when generating our dendrogram.
Ward’s minimum variance method minimizes the within cluster variance. This helps keep the growth from merging clusters as small as possible.
Now lets generate the dendrogram:
#perform hierarchical clustering using Ward's minimum variance
clust <- agnes(df, method = "ward")
pltree(clust, cex = 0.6, hang = -1, main = "Dendrogram", xlab="", sub="")
Each state listed on the bottom are leafs and as you move up the tree, take note how different leaf’s are joined within each other forming a branch. This is important as it can communicate the relation that certain points have within our dataset. The height of these branches, measured on the vertical axis, communicate how different observations are from one another. So for example, it can be seen that the observations toward the bottom of the tree are closer together than the one’s that can be observed at the top of the tree.
Several packages can be used for clusters as well. For example using the factoextra package we can create a vertical dendrogram, which may be more helpful in discerning clusters.
suppressWarnings(
fviz_dend(clust, cex = 0.5, horiz = TRUE)
)
Furthermore, we can add colors to clearly define the clusters that formed within the dendrogram. We cut the cluster into “k” amount of groups. and use k_color to set the colors of each cluster.
fviz_dend(clust, k = 4,
cex = 0.5, # label size
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
color_labels_by_k = TRUE, # color labels by groups
ggtheme = theme_gray() # Change theme
)
Now lets move onto an example about animal attributes -
The Attribute of Animals data set contains 6 binary attributes (warm-blooded, can fly, vertebrate, endangered, live in groups and have hair ) for 20 animals. Hierarchical clustering in this case can be useful in grouping similar animals and possibly revealing a pattern as to why certain groups of animals are becoming endangered or extinct.
Here we load and define the column names for each animal cluster
animals <- cluster::animals
colnames(animals) <- c("warm-blooded",
"can fly",
"vertebrate",
"endangered",
"live in groups",
"have hair")
Next we can create a dendrogram and apply configure a heatmap using hclust and the gplots package.
Complete linkage was chosen for this example for tighter clusters.
# Define the color function
some_col_func <- function(n) diverge_hcl(n, h = c(246, 40), c = 96, l = c(65, 90))
# Calculate hierarchical clustering for rows and columns
dend_r <- animals %>% dist(method = "man") %>% hclust(method = "ward.D") %>% as.dendrogram
dend_r <- color_branches(dend_r, k = 4)
dend_c <- t(animals) %>% dist(method = "man") %>% hclust(method = "complete") %>% as.dendrogram
dend_c <- color_branches(dend_c, k = 3)
# Set margins
par(mar = c(3, 3, 1, .5)) # You can adjust these margin values
# Create the heatmap
heatmap.2(
as.matrix(animals - 1),
Rowv = dend_r,
Colv = dend_c,
trace = "row",
hline = NA,
tracecol = "darkgrey",
key.xlab = "no / yes",
denscol = "grey",
density.info = "density",
col = some_col_func
)
You can also create heatmaps for great visual reflections of your data. The heatmap sorts the rows and columns of a matrix according to the clustering called by hclust. Heatmap first treats the rows of a matrix as observations and calls hclust on them, then it treats the columns of a matrix as observations and calls hclust on those values as well. In the end you get a dendrogram associated with both rows and columns of a matrix which help you spot obvious patterns in the data.
The data set above portrays the different attribute of animals. As shown with the dendrogram above you can observe how cold blooded non vertebrates are not endangered. It can also be gathered that warm blooded vertebrates, without hair, some can fly, and some are endangered.
Clustering is great in the unsupervised setting(a model that does not have to be trained and there is no target variable) but it is important to note that decisions must be made as to the appropriate dissimilarity linkage and cut to be used when making dendrograms and clusters. Sometimes it can be difficult to conclude that clusters are a true representation of subgroups in data and more analysis and tests will need to be performed to verify accuracy. Clustering in subsets can help test the robustness of the data.
Overall I think hierarchical clustering is a great tool to first examine your data set and obtain a quick view of where relationships form. This can lead to hypothesis and further examination into said observations to help you gain a more detailed look at your data set.
Galili, Tal. Hierarchical Cluster Analysis on Famous Data Sets - Enhanced with the Dendextend Package, 24 Mar. 2023, cran.r-project.org/web/packages/dendextend/vignettes/Cluster_Analysis.html#votes.re pub—votes-for-republican-candidate-in-presidential-elections.
James, Gareth, et al. An Introduction to Statistical Learning: With Applications in R. Springer, 2022.