Agglomorative Hierarchical Clustering

Step 1: Exploratory Analysis

Let’s start by looking at a correlation plot:

ggcorr(
  data = crime,
  low = "red3",
  high = "blue3",
  mid = "white",
  label = T,
  label_round = 2
)

Other than burglary and larceny, there aren’t a lot of strong correlations between pairs of crimes.

Next, let’s create a biplot of the first 2 PCs to see if there are any groups in the data using the fviz_pca() in factoextra

fviz_pca(
  prcomp(
    x = crime,
    scale. = T
  ), 
  geom = "text"
)

With only 16 cities, it’s a little difficult to see how many groups there are, but some cities are more similar than others!

First 2 PCs account for 70%, let’s look at the 1st and 3rd PC as well

fviz_pca(
  prcomp(
    x = crime,
    scale. = T
  ),
  axes = c(1,3),
  geom = "text"
)

From the biplots looks like there may be 2 or 3 groups

Hierarchical Clustering

There are many different functions in R that will do hierarchical clustering. We’ll be using hcut() function in the factoextra package.

There is also hclust() in base R and agnes() in the cluster package

hcut() needs 2 arguments:

x = a dataset or distance matrix
hc_method = link method to measure distances between clusters
k = # of clusters (optional). Defaults to 2 if nothing is specified

Note: just using method will give you a different result, need to use the hc_method argument!

Additional Note: If x is an non-standarized dataset, make sure to use stand=T

Using different link methods

Single Linkage

Single linkage will use the closest points between to clusters to measure how far apart each cluster is from one another.

crime_sin <- 
  hcut(
    x = crime, 
    hc_method = "single", 
    stand = T, 
    k = 2
  )

There are a couple of ways of plotting the dendrogram:

crime_sin |> 
  as.dendrogram() |> 
plot(main = "Dendrogram for Single Linkage")

Not the best looking dendrogram

More visually appealing using fviz_dend() in factoextra package. The downside is that it will show the cluster choice from hcut()

gg_dend_single <- 
  fviz_dend(
    crime_sin, 
    main = "Cluster Dendrogram for Single Linkage"
  )

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

gg_dend_single

rect = T adds a red rectangle around the clusters and you can specify the color

fviz_dend(
  crime_sin, 
  main = "Cluster Dendrogram for Single Linkage", 
  rect = T, 
  rect_border = "black"
)

Control the color of the clusters with k_colors:

fviz_dend(
  crime_sin, 
  main = "Cluster Dendrogram for Single Linkage",
  rect = T, 
  rect_border = "red", 
  k_colors = c("blue","orange2")
)

Need to rerun the hcut() function to choose a different k :(

hcut(
  x = crime, 
  hc_method = "single", 
  stand = T,
  k = 3
) |> 
  fviz_dend(main = "Cluster Dendrogram for Single Linkage")

You can get the cluster ID by using $cluster

crime_sin$cluster

##    Atlanta     Boston    Chicago     Dallas     Denver    Detroit   Hartford 
##          1          2          1          1          1          1          1 
##   Honolulu    Houston         KC         LA         NO         NY   Portland 
##          1          1          1          1          1          1          1 
##     Tucson Washington 
##          1          1

Alternatively, you can get the cluster ID using cutree() for multiple choices of k without having to rerun the algorithm again:

cutree(crime_sin, k=2:5)

##            2 3 4 5
## Atlanta    1 1 1 1
## Boston     2 2 2 2
## Chicago    1 1 3 3
## Dallas     1 1 3 4
## Denver     1 1 3 4
## Detroit    1 1 3 4
## Hartford   1 3 4 5
## Honolulu   1 1 3 4
## Houston    1 1 3 4
## KC         1 1 3 4
## LA         1 1 3 4
## NO         1 1 3 4
## NY         1 1 3 4
## Portland   1 1 3 4
## Tucson     1 3 4 5
## Washington 1 1 3 4

Plot of the data by clusters

We can use fviz_cluster() to visualize the results

Since we have more than 2 variables, fviz_cluster() will use PCA to display the first 2 PCs (by default)

fviz_cluster(
  crime_sin, 
  geom = "text", 
  show.clust.cent = F, 
  ellipse = F
) + 
  
  labs(title = "Single Linkage Clusters: k = 2") +
  
  theme_bw() + 
  theme(legend.position = "none")

Can’t tell what makes Boston special just from the first 2 PCs, so let’s check the 3rd PC again:

fviz_cluster(
  crime_sin, 
             geom = "text", 
             show.clust.cent = F, 
             ellipse = F,
  axes = c(1, 3)
) + 
  
  labs(title = "Single Linkage Clusters: k = 2") +
  
  theme_bw() + 
  
  theme(legend.position = "none") + 
  
  scale_y_continuous(
    expand = c(0.1, 0, 0.1, 0)
  )

Whatever crime(s) the third PC represents, Boston has a much higher score than the other cities:

prcomp(x = crime,
       scale. = T)

## Standard deviations (1, .., p=6):
## [1] 1.6975148 1.1526431 0.9506645 0.6122080 0.5926527 0.4000732
## 
## Rotation (n x k) = (6 x 6):
##                  PC1         PC2         PC3         PC4         PC5
## Murder    -0.2979396  0.61604145 -0.38478654 -0.31523950  0.35016625
## Robbery   -0.4640654  0.23316839  0.27835099  0.72318122  0.32657811
## Assault   -0.4771708  0.27718662 -0.07422189 -0.03602238 -0.81222298
## Burglary  -0.4353121 -0.38171250 -0.40641957 -0.25758947  0.29710735
## Larceny   -0.3898544 -0.57921095 -0.18227326  0.18970979 -0.09836761
## Autotheft -0.3559498 -0.09001807  0.75534810 -0.52344363  0.11434682
##                   PC6
## Murder     0.40208445
## Robbery   -0.15215833
## Assault   -0.17018493
## Burglary  -0.58736443
## Larceny    0.65851250
## Autotheft  0.08704792

Looks like the third PC is mostly Autotheft!

Complete Link

The complete link is the opposite of single link - It uses the farthest points between the two clusters to determine the distances.

crime_com <- 
  hcut(
    x = crime, 
    hc_method = "complete", 
    stand = T
  )

gg_dend_complete <- 
  fviz_dend(crime_com) +
  labs(title = "Complete Linkage")

gg_dend_complete

If it’s difficult to see where the first large merger distance is, you can add lines to the dendrogram to help see where the distances the clusters were merged starts to become large:

fviz_dend(crime_com) +
  
  labs(title = "Cluster Dendrogram with Complete Linkage") + 
  
  geom_hline(
    yintercept = crime_com$height, 
    linetype="dashed"
  )

Examining the dendrogram from complete linkage, it appears that there are 4 clusters

Let’s look at the results when k = 4

crime_com <- 
  hcut(
    x = crime, 
    hc_method = "complete", 
    stand = T, 
    k = 4
  )

fviz_dend(crime_com) +
  labs(title = "Dendrogram with Linkage = Complete and k = 4")

So what do cities in the same clusters have in common? What separates the 4 cities?

Scatterplots to illustrate the clusters

I prefer adding the cluster results to the biplot from earlier to help see where the differences in clusters are

We can specify groups with fviz_pca() by using habillage = group ID

Let’s look at 2 clusters and 4 clusters:

# k = 2
prcomp(
  x = crime,
       scale. = T
  ) |> 
  # Plotting the results with two clusters
  fviz_pca(
    geom = "text", 
    col.var = "black",
    habillage = cutree(crime_com,
                       k = 2),
    show.legend = F
  )

# k = 4
prcomp(
  x = crime,
  scale. = T
) |>  
  
  fviz_pca(
    axes = c(1, 2),
    geom = "text", 
    col.var = "black",
    habillage = cutree(crime_com,
                       k = 4),
    show.legend = F
  )

You can use hcut with k = multiple cluster # and have it return a matrix with each column corresponding to a choice of the cluster

cutree(
  crime_com, 
  k = c(2, 4)
)  |>  
  
  data.frame() |>  
  
  rename(
    k2 = X2, 
    k4 = X4
  )  |>  
  
  rownames_to_column(var = "City") |> 
  
  left_join(
    y = crime  |> rownames_to_column(var = "City"), 
    by = "City"
  ) |> 
  
  pivot_longer(
    cols = starts_with("k"),
    names_to = "total_cluster",
    values_to = "clusterID"
  ) |>  
  
  summarize(
    .by = c(total_cluster, clusterID),
    across(
      .cols = Murder:Autotheft,
      .fns = mean
    )
  )  |>  
  
  
  
  pivot_longer(
    cols = Murder:Autotheft,
    names_to = "crime",
    values_to = "rate"
  ) |>  
  
  mutate(
    clusterID = factor(clusterID),
    crime = as_factor(crime)
  ) |>  
  
  ggplot(
    mapping = aes(
      x = clusterID,
      y = rate,
      fill = clusterID
    )
  ) +
  
  geom_col(
    position = "dodge",
    color = "black"
  ) +
  
  facet_wrap(
    facets = ~ total_cluster + crime,
    scales = "free",
    nrow = 2
  ) +
  
  theme_test() +
  
  theme(
    legend.position = "none",
    panel.grid = element_blank(),
    axis.ticks.y = element_blank(),
    axis.text.y = element_blank()
  ) + 
  
  labs(y = "Crime Rate") + 
  scale_y_continuous(expand = c(0, 0, 0.05, 0))

If there are 2 clusters, it appears that there is a “low crime” group and a “high crime group” with the murder rate being about the same while assaults are much higher for the “high crime group”

If there are 4 clusters:

A “high murder” cluster (Red)
A “lower crime but high autotheft” group (green)
A “high crime other than murder” group (Blue)
A “low crime” group (purple)

cutree(
  crime_com, 
       k = c(2,4)
  ) |>   
  
  data.frame() |>
  
  rownames_to_column(var = "City") |> 
  
  arrange(X2, X4) |> 
  
  rename(k2 = X2, k4 = X4) |> 
  
  gt::gt()

City	k2	k4
Atlanta	1	1
Dallas	1	1
Houston	1	1
KC	1	1
Boston	1	2
Chicago	1	2
NO	1	2
Hartford	1	4
Honolulu	1	4
Portland	1	4
Tucson	1	4
Denver	2	3
Detroit	2	3
LA	2	3
NY	2	3
Washington	2	3

Average Link

The distance apart clusters A & B are using the average link is calculated by:

Finding the distance for every $(y_{ij}, y_{kl})$ pair
Averaging all pairs of distances calculated in step 1)

\[d^2_{avg} = \frac{1}{ab}\sum_j \sum_l (y_{ij} - y_{kl})^2\]

crime_avg <-
  hcut(
    crime, 
       hc_method = "average", 
       k = 2, 
       stand = T
    )

# Dendrogram for average linkage
gg_dend_avg <- 
  fviz_dend(
    crime_avg, 
    main = "Average Link"
  ) 

gg_dend_avg

This dendrogram doesn’t indicate a clear cut off, so it is a little hard to tell how many clusters there are when using average linkage

# A plot visualizing the results
fviz_cluster(
  crime_avg, 
  geom = "text", 
  ellipse = F
) +
  theme_bw() + 
  labs(title = "City AHC with Average Link") + 
  theme(legend.position = "none")

biplot to see how the clusters are different

prcomp(
  crime,
  scale. = T
) |> 
  fviz_pca_biplot(
    geom = "text", 
    habillage = cutree(crime_avg, 
                       k = 2), 
    col.var = "black"
  ) +
  
  theme_bw() + 
  
  labs(title = "City AHC with Average Link") + 
  
  theme(legend.position = "none")

Centroid Link

Using the centroid link, we calculate the average (centroid) for each cluster, then measure how far apart the centroids are:

\[d^2_{cen} = (\bar{y}_i - \bar{y}_k)^2 \]

crime_cen <- 
  hcut(
    crime, 
    hc_method = "centroid", 
    k = 2, 
    stand = T
  )

fviz_dend(
  crime_cen, 
  main = "Centroid Link"
)

## Warning in get_col(col, k): Length of color vector was shorter than the number
## of clusters - color vector was recycled

Well that doesn’t look good. The results aren’t ultrametric :(

By adding a city into a cluster, the next merger was closer than the previous one

In presidential terms: BAD!

The centroid link often isn’t used because it isn’t ultrametric!

data.frame(crime_cen.height = crime_cen$height) |> 
  
  mutate(
    merger_number = row_number(),
    height_change = crime_cen.height - lag(crime_cen.height)
  ) |> 
  
  dplyr::select(merger_number, everything())

##    merger_number crime_cen.height height_change
## 1              1        0.8558817            NA
## 2              2        0.9917272    0.13584546
## 3              3        1.4438898    0.45216260
## 4              4        1.5435297    0.09963994
## 5              5        1.4404744   -0.10305532
## 6              6        1.6088562    0.16838182
## 7              7        1.6894597    0.08060354
## 8              8        1.8436513    0.15419152
## 9              9        1.8664776    0.02282634
## 10            10        2.0007712    0.13429357
## 11            11        1.7690125   -0.23175872
## 12            12        2.0236073    0.25459487
## 13            13        1.9005101   -0.12309720
## 14            14        2.0613848    0.16087470
## 15            15        2.4995502    0.43816538

# Visualizing the heights:
tibble(merge_distance = crime_cen$height) |> 
  
  mutate(
    merge_order = row_number(),
    bad_mergers = if_else(merge_distance < lag(merge_distance) & 
                            merge_order != 1,
                          "Bad",
                          "Good")
  ) |> 
  
  ggplot(
    mapping = aes(
      x = merge_order,
      y = merge_distance
    )
  ) + 
  
  geom_line() + 
  
  geom_point() + 
  
  geom_point(
    mapping = aes(alpha = bad_mergers),
    size = 6, 
    color = "red",
    shape = 21,
    show.legend = F
  )  +
  
  scale_alpha_discrete(range = c(1, 0))

## Warning: Using alpha for a discrete variable is not advised.

Rule of thumb: don’t use the results if they aren’t ultrametric

Median Link

The median link is like the average link, but instead of finding the average distance for all points in Clusters A and B, it finds the median distance:

crime_med <- 
  hcut(
    crime, 
    hc_method = "median", 
    k = 2, 
    stand = T
  )

fviz_dend(
  crime_med, 
  main = "Dendrogram with Median Link"
)

## Warning in get_col(col, k): Length of color vector was shorter than the number
## of clusters - color vector was recycled

Seems we have the same ultrametric problem again….

tibble(merge_distance = crime_med$height)  |>  
  
  mutate(
    merge_order = row_number(),
    bad_mergers = if_else(merge_distance < lag(merge_distance) & merge_order != 1,
                          "Bad",
                          "Good")
  ) |> 
  
  ggplot(
    mapping = aes(x = merge_order,
                  y = merge_distance)
  ) + 
  
  labs(title = "Median Linkage") +
  
  geom_line() + 
  
  geom_point() + 
  
  geom_point(
    mapping = aes(alpha = bad_mergers),
    size = 6, 
    color = "red",
    shape = 21,
    show.legend = F
  )  +
  
  scale_alpha_discrete(range = c(1, 0))

## Warning: Using alpha for a discrete variable is not advised.

Wards’ Distance

Wards’ distance is a little bit of a misnomer because it doesn’t calculate the distances between clusters.

Instead, for each possible merger between two clusters, it will calculate the Within-Sums-of-Squares (WSS) and choose to merge the two clusters that results in the lowest WSS.

\[d^2_{ward} = \sum_i \sum_j (y_{ij} - \bar{y}_i)^2\]

Typically, Wards’ distance gives the “best” clustering result, but not always!

crime_ward <-
  hcut(
    crime, 
    hc_method = "ward.D", 
    k = 3, 
    stand = T
  )

### Dendrogram when k = 3. Looks about right :)
gg_dend_ward <- 
  fviz_dend(
    crime_ward, 
    main = "Wards' Distance"
  )
gg_dend_ward

Using Wards’ distance, it looks like there are 3 clusters:

# A plot visualizing the results
fviz_cluster(
  crime_ward, 
  geom = "text", 
  show.clust.cent = F, 
  ellipse = F
) + 
  
  labs(title = "Wards' Distance: k = 3") + 
  theme_bw() +
  theme(legend.position = "none")

# biplot to see how the clusters are different
prcomp(
  crime, 
  scale. = T
) |> 
  fviz_pca(
    geom = "text", 
    col.var = "black",
    habillage = cutree(crime_ward, 
                       k = 3)
  ) + 
  labs(title = "Wards' Distance: k = 3") + 
  theme_bw() +
  theme(legend.position = "none")

Let’s compare the average crime stats for the 3 clusters:

crime |> 
  mutate(cluster = crime_ward$cluster) |> 
  
  summarize(
    .by = cluster, 
    across(
      .cols = Murder:Autotheft,
      .fns = mean
    )
  )

##   cluster   Murder  Robbery  Assault Burglary   Larceny Autotheft
## 1       1 14.01667 240.0000 229.5000 1274.500  865.5000  663.1667
## 2       2  4.55000 107.8333 117.3333 1287.667  983.3333  634.8333
## 3       3 11.15000 452.2500 264.7500 1659.500 1241.0000  809.2500

Group 1 appears to have high Murder and medium rates for the other crimes

Group 2 has the lowest violent crime rate and typical theft rate

Group 3 has the highest non-murder crime rates.

crime %>% 
  mutate(cluster = crime_ward$cluster) |> 
  
  mutate(
    .by = cluster,
    cluster_order = row_number()
  ) |> 
  ggplot(
    mapping = aes(
      x = cluster,
      y = cluster_order,
      color = factor(cluster)
    )
  ) +
  
  geom_text(
    mapping = aes(label = rownames(crime)),
    show.legend = F
  ) + 
  
  labs(y = NULL) + 
  
  scale_x_continuous(breaks = 1:max(crime_ward$cluster)) + 
  
  theme(axis.text.y = element_blank())

Which linkage method is best?

The choice of link method has a large impact on how the clusters formed. So what is the best choice for these data?

A common way of measuring how well a AHC method clusters the data is to calculate the cophenetic correlation coefficient (often just referred as the cophenetic correlation). It measures how faithfully a dendrogram preserves the pairwise distances between the original unclustered points.

We can calculate the cophenetic distance using the cor_cophenetic() function in the dendextend package.

We need to give it two objects:

The result of the hclust() function
A distance matrix of the original (rescaled) data

cor_cophenetic(
  crime_ward,
  dist(scale(crime))
)

## [1] 0.5942779

Correlation with Wards’ distance seems pretty low, but how low is it relative to the other methods?

Let’s do it for all of them!

crime_coph_cor <- 
  tibble(
    single   = cor_cophenetic(crime_sin, dist(scale(crime))),
    complete = cor_cophenetic(crime_com, dist(scale(crime))),
    average  = cor_cophenetic(crime_avg, dist(scale(crime))),
    centroid = cor_cophenetic(crime_cen, dist(scale(crime))),
    median   = cor_cophenetic(crime_med, dist(scale(crime))),
    ward     = cor_cophenetic(crime_ward, dist(scale(crime)))
  ) |> 
  
  pivot_longer(
    cols = everything(),
    names_to = "link",
    values_to = "coph_cor"
  ) |> 
  
  arrange(-coph_cor)

gt::gt(crime_coph_cor)

link	coph_cor
average	0.6441523
centroid	0.6132771
ward	0.5942779
complete	0.5612580
single	0.5321645
median	0.3513478

Looks like average link has the highest correlation with Wards’ distance being the second highest (of the ultrametric methods anyway)!

Let’s plot the results!

ggplot(
  data = crime_coph_cor,
  mapping = aes(
    x = fct_reorder(link, -coph_cor),
    y = coph_cor
  )
) + 
  
  geom_col(
    fill = "steelblue",
    color = "black"
  ) + 
  
  labs(
    x = "Link Method",
    y = "Cophenetic Correlation"
  ) +
  
  scale_y_continuous(
    expand = c(0, 0, 0.05, 0),
    limits = c(0, 1)
  )

For the rest of the example, we’ll be using the results from the average link.

Determining the number of clusters

One advantage of AHC is that you can create a dendrogram to help determine the number of clusters in the data.

But if we need a little help deciding on the number of clusters, the same methods we saw with K-means will work with agglomerative clustering

Let’s create a silhouette plot using the fviz_nbclust() function:

fviz_nbclust(
  x = scale(crime), 
             FUNcluster = hcut, 
             method = "silhouette"
  )

Notice we didn’t specify the link method. That’s because hcut() defaults to Wards’ Distance for the link method. Thankfully, the fviz_nbclust() function allows us to give it the same arguments as the clustering method we are using!

So we can give fviz_nbclust() the hc_method argument that hcut() has!

fviz_nbclust(
  x = scale(crime), 
             FUNcluster = hcut,
             hc_method = "average",
             method = "silhouette"
  )

Comparing Clustering Results

There are a couple of tools we can use to compare the results of different methods/links used.

Tanglegram

A tanglegram aligns and plot two dendrograms side by side to compare the end results of each clustering method

Let’s compare the two best ultrametric methods for our data: Average and Ward with k = 6

# Need to store the results in a dendlist:
dendlist(
  avg  = as.dendrogram(crime_avg),
  ward = as.dendrogram(crime_ward)
)  |> 
  
  untangle(method = "step1side")  |>  # Find the best alignment layout
  
  tanglegram()                       # Draw the two dendrograms

The tanglegram shows that the difference between Wards’ distance and average linkage is when Boston is merged.

Average merges it in the second to last step while ward merges it much earlier with Tucson and Hartford.

There are some alternative options we can use in the tanglegram:

dendlist(
  avg = as.dendrogram(crime_avg),
  med = as.dendrogram(crime_med)
) |>
  
  untangle(method = "step1side") |> 
  
  tanglegram(
    highlight_distinct_edges = T,       # Turn-off/on dashed lines
    common_subtrees_color_lines = T,    # Turn-off/on line colors
    common_subtrees_color_branches = T  # Color common branches 
  )

We can measure how dissimilar two dendrograms are with the entanglement() function.

Lower value = similar, higher value = dissimilar

dendlist(
  avg   = as.dendrogram(crime_avg),
  ward  = as.dendrogram(crime_ward)
) |> 
  untangle(method = "step1side") |> 
  entanglement()

## [1] 0.01181431

Low value for ward and average, indicating that they are similar, which we saw in the tanglegram!

Let’s compare average and median:

dendlist(
  avg   = as.dendrogram(crime_avg),
  med  = as.dendrogram(crime_med)
)  |> 
  untangle(method = "step1side") |>  
  entanglement()

## [1] 0.2684704

It’s higher than ward and average, but still not that high.

Dendrogram Correlation

First let’s store our results in a dendlist object

dendlist() creates a list object of multiple dendrograms, which needs the object to be a dendrogram type object
as.dendrogram() turns the hcut() object into a dendrogram object

crime_dl <- 
  dendlist(
    sin  = as.dendrogram(crime_sin),
    com  = as.dendrogram(crime_com),
    avg  = as.dendrogram(crime_avg),
    cen  = as.dendrogram(crime_cen),
    med  = as.dendrogram(crime_med),
    ward = as.dendrogram(crime_ward)
  )

The function cor.dendlist() is used to compute Baker or Cophenetic correlation matrix between a list of trees.

The value can range between -1 to 1.

Values near 0 indicate that the two trees are not statistically similar.

Cophenetic correlation matrix

cor.dendlist(
  crime_dl, 
  method = "cophenetic"
)

##            sin       com       avg       cen        med       ward
## sin  1.0000000 0.2188113 0.5129018 0.9093251 0.30444129 0.31457381
## com  0.2188113 1.0000000 0.6185544 0.2820687 0.16857966 0.52214099
## avg  0.5129018 0.6185544 1.0000000 0.5304093 0.17041620 0.92092613
## cen  0.9093251 0.2820687 0.5304093 1.0000000 0.35117690 0.34322060
## med  0.3044413 0.1685797 0.1704162 0.3511769 1.00000000 0.07622216
## ward 0.3145738 0.5221410 0.9209261 0.3432206 0.07622216 1.00000000

Let’s create a correlation plot for the cophenetic correlation:

ggcorr(
  data = NULL,
  cor_matrix = cor.dendlist(
    crime_dl, 
    method = "cophenetic"
  ),
  high = "blue3",
  low = "red3",
  mid = "white",
  label = T,
  label_round = 2
)