In this tutorial we will explore a clustering technique called hierarchical clustering, specifically focusing on bottom-up hierarchical clustering. First, we will look at examples from a synthetic dataset to understand the concept. Then, we will apply the clustering to a real-life dataset from the Palmer Penguins package.
Clustering refers to a broad set of methods that seek to group like observations together to create groups or clusters without requiring labelled data. This is an unsupervised technique, i.e. it works without labels. Rather than learning from predefined categories, clustering algorithms try to reveal hidden structure by grouping similar observations. As we will see, hierarchical clustering groups observations iteratively based on proximity and produces a tree-like dendrogram which helps visualize the results.
A real-life application of hierarchical clustering could be organizing music. For example, if you are Spotify, it would be useful to categorize all available music into broad groups, and further categorize each group into smaller subgroups. These categorizations would then be useful when recommending music to users based on their listening history, pulling music from the same or a nearby cluster.
Let us start by creating a synthetic dataset with 9 observations to explore the concept of hierarchical clustering. This dataset contains heights and widths of three types of plants: Asters, Tulips, and Crocuses. Asters tends to be quite large and bushy, while tulips and crocuses are small, flowering as single stems. Crocuses are very short, while tulips are of a medium height. Overall, the three kinds of plants are very different from each other in terms of height and width and should form distinct groups.
df <-data.frame(
width = c(12,15,18,6,7,5,1,2,1),
height = c(5,4,4.5,2,3,2,0.5,0.3,0.2),
plant_family = c("Aster","Aster","Aster","Tulip","Tulip","Tulip","Crocus","Crocus","Crocus")
)
df$obs_num <- 1:nrow(df)
df
## width height plant_family obs_num
## 1 12 5.0 Aster 1
## 2 15 4.0 Aster 2
## 3 18 4.5 Aster 3
## 4 6 2.0 Tulip 4
## 5 7 3.0 Tulip 5
## 6 5 2.0 Tulip 6
## 7 1 0.5 Crocus 7
## 8 2 0.3 Crocus 8
## 9 1 0.2 Crocus 9
Let’s look plot the widths and heights in a scatterplot color-coded by plant type.
ggplot(df, aes(x = width, y = height, color = plant_family)) +
geom_point() +
geom_text_repel(aes(label = obs_num), size = 3) +
labs(title = "Width vs Height of Plants", x = "Width (inches)", y = "Height (ft)", color = "Plant Type") +
theme_minimal()
As expected, the three kinds of plants form distinct clusters. Now, we
can try the method of hierarchical clustering to see if it can correctly
cluster these plants. But first, let’s talk about how hierarchical
clustering works.
Initially, every data point is treated as its own cluster. Then, the distance between each of the points is measured using a distance metric which you can specify (Euclidean, Manhattan, etc.), and the two closest points are fused together into a cluster. So, in this case, we would start out with 9 clusters, then the two closest points would become a cluster, resulting in 8 clusters. This process will continue until we run out of observations and we are left with one cluster.
To get an intuitive understanding, let’s look at a distance matrix that shows the pairwise distance between each plant. In this case, I have chosen Euclidean distance:
df_sub <- df %>% select(width, height)
dist_matrix <- dist(df_sub, method = "euclidean")
dist_table <- as.matrix(dist_matrix)
# Row and Column Names
rownames(dist_table) <- as.character(1:nrow(dist_table))
colnames(dist_table) <- as.character(1:ncol(dist_table))
pheatmap(
dist_table,
cluster_rows = FALSE,
cluster_cols = FALSE,
display_numbers = TRUE,
number_format = "%.2f",
color = colorRampPalette(brewer.pal(5, "Blues"))(100),
labels_row = rownames(dist_table),
labels_col = colnames(dist_table),
fontsize_row = 8,
fontsize_col = 8,
main = "Distance Matrix Heatmap"
)
This distance matrix is color-coded to help parse through the large
table. Lighter colors indicate a smaller distance or high similarity,
while darker colors indicate a higher distance or a small similarity.
Note that this is a symmetric matrix, so looking at either the top half
or the bottom half will suffice. We can also ignore the diagonal. Two
patterns emerge: 1. We see 3x3 areas with light colors representing the
plant type 2. It looks like plants 1-3 (Asters) are very different from
the remaining plants, while plants 4-9 are closer to each other. This is
consistent with the scatterplot that we saw above.
Let’s find the three smallest distances (that are not self distances) by focusing on the light blue boxes: The smallest distance is 0.3 between plants 7 and 9, followed by 1.00 between plants 8 and 9 as well as plants 4 and 6, followed by 1.02 between plants 7 and 8. We’ll keep this in mind as we look at the dendrogram produced by hierarchical clustering. In order to perform hierarchical clustering, we simply call the hclust function and provide the distance matrix that we created above. We also specify a method which we will get to shortly. Then, we can plot the results of the clustering using the plot command, which produces a dendrogram.
hc <- hclust(dist_matrix, method = "average")
plot(hc, main = "Hierarchical Clustering of Plants")
The dendrogram can be read from the bottom to the top, scanning in
horizontal lines. The height of the fusion tells us how similar the
points are. So, for example, the fusion at the lowest point is 7 and 9.
This tells us that 7 and 9 are very similar, and additionally that 7 and
9 are the most similar points and therefore are the first to be fused
into a cluster. This is in line with the distance matrix we saw above,
where 7 and 9 were the closest points. On the other hand, the last
fusion to take place is between the cluster containing plant 1, 2, and
3, and the cluster containing the remaining plants. This means that the
two clusters are quite different.
As we move up from the first fusion, it is hard to tell which fusion happens next so we can add in a reference line at 1 using the abline function:
hc <- hclust(dist_matrix, method = "average")
dend <- as.dendrogram(hc)
plot(dend, main = "Hierarchical Clustering of Plants")
abline(h = 1, col = "blue", lty = 2, lwd = 1)
We see that the next closest cluster is 4 and 6, as well as 8 fused with
the previous cluster of 7 and 9. This brings up an important question:
how do we calculate the distance between the lone plant 8 and the group
of plants 7 and 9? The distance matrix we calculated above only
accounted for pairwise distances. To do this, we define linkages which
measure similarity between groups of observations.
We can have different types of linkages. The commonly used linkages are listed below: 1. Complete: Compute the pairwise distances between all the points in each cluster and record the largest distance. 2. Single: Compute the pairwise distances between all the points in each cluster and record the smallest distance. 3. Average: Compute the pairwise distances between all the points in each cluster and record the average distance. 4. Centroid: Calculate the centroid of each cluster and measure the distance betweeen the centroids.
As you might recall, I chose the average linkage for this example, but you might choose a different linkage for your analysis. Keep in mind that different linkages will result in different dendrograms, and that centroid linkages in particular tend to lead to inversions which are not desirable.
Going back to the dendrogram, the horizontal lines serve another purpose: they divide the dendrograms into clusters. For example, if we make a line at height 5, it divides the plants into the three clusters corresponding to plant type. We can implement colored branches in the dendrogram by using the color_branches function and specifying the height from the dendextend package.
dend_colored <- color_branches(dend, h = 5)
plot(dend_colored, main = "Dendrogram Colored by Height \n Plants") +
abline(h = 5, col = "black", lty = 2, lwd = 1)
## integer(0)
But if we draw a horizontal line at a different level, we’ll get a different number of clusters. For example:
dend_colored <- color_branches(dend, h = 4)
plot(dend_colored, main = "Dendrogram Colored by Height") +
abline(h = 4, col = "black", lty = 2, lwd = 1)
## integer(0)
A line at height 4 divides the plants into four clusters: 1 falls into its own cluster, plants 2 and 3 form the second cluster, plants 7, 8, and 9 form the third cluster, and finally, plants 4, 5, and 6 form the fourth cluster.
Therefore, hierarchical clustering can give you any number of clusters depending on the height chosen.
Now that we know how hierarchical clustering works, let’s try to apply it to the penguins dataset, which contains body measurements (bill length and depth, flipper length, body mass) for three species of penguins. We start by dropping missings from the dataset, which gives us 333 observations.
df <- penguins %>% na.omit()
df
## # A tibble: 333 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## 7 Adelie Torgersen 39.2 19.6 195 4675
## 8 Adelie Torgersen 41.1 17.6 182 3200
## 9 Adelie Torgersen 38.6 21.2 191 3800
## 10 Adelie Torgersen 34.6 21.1 198 4400
## # ℹ 323 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
Next, we can look at the number of penguins in each species. We can then use hierarchical clustering to see if the body measurements can distinguish the penguins into distinct groups.
df %>% count(species)
## # A tibble: 3 × 2
## species n
## <fct> <int>
## 1 Adelie 146
## 2 Chinstrap 68
## 3 Gentoo 119
We can focus on the bill depth and length variables for the body measurements. Here is a scatterplot colored by species:
ggplot(df, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point(size = 3, alpha = 0.8) +
theme_minimal() +
labs(
title = "Bill Length vs Bill Depth by Species",
x = "Bill Length (mm)",
y = "Bill Depth (mm)",
color = "Species"
) + theme_minimal()
We see that the three species do indeed form relatively distinct
clusters. We can then implement hierarchical clustering to see if it is
able to distinguish this hidden structure.
As we did above, we can make a distance matrix using the body measurement variables and perform hierarchical clustering.
penguin_sub = df %>% select(bill_length_mm, bill_depth_mm)
dist_matrix <- dist(penguin_sub, method = "euclidean")
hc <- hclust(dist_matrix, method = "average")
dend <- as.dendrogram(hc)
dend_colored <- color_branches(dend, h = 8)
labels(dend_colored) <- rep("", length(labels(dend_colored)))
plot(dend_colored, main = "Dendrogram Colored by Height \n Penguins")
abline(h = 8, col = "red", lty = 2, lwd = 1)
A horizontal cut at height 8 gives us three clusters. We can then
compare the groups created by the hierarchical clustering against the
species in the dataset. The cutree function gives the clusters from the
dendrogram when provided with the cut heigh, and we can compare these
against the species variable from the dataset.
clusters <- cutree(hc, h = 8)
species <- df$species
table(clusters, species)
## species
## clusters Adelie Chinstrap Gentoo
## 1 142 6 0
## 2 4 61 115
## 3 0 1 4
We see that the clusters do not seem to correspond to the species particularly well. Cluster 1 does seem to correspond to the Adelie penguins, but cluster 2 captures both the Gentoo and the Chinstraps, while cluster 3 only captures 5 penguins. In this case, domain knowledge might indicate the choice of a different distance metric, linkage or cutting height.
Overall, hierarchical clustering can be a useful tool to discover hidden structure for exploratory analysis when labels are not available. One of the key advantages of hierarchical clustering is that it provides nested groupings which allows for broad as well as granular clusters. However, the flip side of this clustering is that it assumes an underlying hierarchical structure that might not actually exist in the data. Additionally, domain knowledge might be required to select the most appropriate distance metric and linkage as different choices might result in very different clusters.