Step 1: Exploratory Analysis

Let’s start by looking at a correlation plot:

ggcorr(
  data = crime,
  low = "red3",
  high = "blue3",
  mid = "white",
  label = T,
  label_round = 2
)

Other than burglary and larceny, there aren’t a lot of strong correlations between pairs of crimes.

Next, let’s create a biplot of the first 2 PCs to see if there are any groups in the data using the fviz_pca() in factoextra

fviz_pca(
  prcomp(
    x = crime,
    scale. = T
  ), 
  geom = "text"
) 

With only 16 cities, it’s a little difficult to see how many groups there are, but some cities are more similar than others!

First 2 PCs account for 70%, let’s look at the 1st and 3rd PC as well

fviz_pca(
  prcomp(
    x = crime,
    scale. = T
  ),
  axes = c(1,3),
  geom = "text"
)

From the biplots looks like there may be 2 or 3 groups

Hierarchical Clustering

There are many different functions in R that will do hierarchical clustering. We’ll be using hcut() function in the factoextra package.

There is also hclust() in base R and agnes() in the cluster package

hcut() needs 2 arguments:

Note: just using method will give you a different result, need to use the hc_method argument!

Additional Note: If x is an non-standarized dataset, make sure to use stand=T

Which linkage method is best?

The choice of link method has a large impact on how the clusters formed. So what is the best choice for these data?

A common way of measuring how well a AHC method clusters the data is to calculate the cophenetic correlation coefficient (often just referred as the cophenetic correlation). It measures how faithfully a dendrogram preserves the pairwise distances between the original unclustered points.

We can calculate the cophenetic distance using the cor_cophenetic() function in the dendextend package.

We need to give it two objects:

cor_cophenetic(
  crime_ward,
  dist(scale(crime))
)
## [1] 0.5942779

Correlation with Wards’ distance seems pretty low, but how low is it relative to the other methods?

Let’s do it for all of them!

crime_coph_cor <- 
  tibble(
    single   = cor_cophenetic(crime_sin, dist(scale(crime))),
    complete = cor_cophenetic(crime_com, dist(scale(crime))),
    average  = cor_cophenetic(crime_avg, dist(scale(crime))),
    centroid = cor_cophenetic(crime_cen, dist(scale(crime))),
    median   = cor_cophenetic(crime_med, dist(scale(crime))),
    ward     = cor_cophenetic(crime_ward, dist(scale(crime)))
  ) |> 
  
  pivot_longer(
    cols = everything(),
    names_to = "link",
    values_to = "coph_cor"
  ) |> 
  
  arrange(-coph_cor)

gt::gt(crime_coph_cor)
link coph_cor
average 0.6441523
centroid 0.6132771
ward 0.5942779
complete 0.5612580
single 0.5321645
median 0.3513478

Looks like average link has the highest correlation with Wards’ distance being the second highest (of the ultrametric methods anyway)!

Let’s plot the results!

ggplot(
  data = crime_coph_cor,
  mapping = aes(
    x = fct_reorder(link, -coph_cor),
    y = coph_cor
  )
) + 
  
  geom_col(
    fill = "steelblue",
    color = "black"
  ) + 
  
  labs(
    x = "Link Method",
    y = "Cophenetic Correlation"
  ) +
  
  scale_y_continuous(
    expand = c(0, 0, 0.05, 0),
    limits = c(0, 1)
  )

For the rest of the example, we’ll be using the results from the average link.

Determining the number of clusters

One advantage of AHC is that you can create a dendrogram to help determine the number of clusters in the data.

But if we need a little help deciding on the number of clusters, the same methods we saw with K-means will work with agglomerative clustering

Let’s create a silhouette plot using the fviz_nbclust() function:

fviz_nbclust(
  x = scale(crime), 
             FUNcluster = hcut, 
             method = "silhouette"
  )

Notice we didn’t specify the link method. That’s because hcut() defaults to Wards’ Distance for the link method. Thankfully, the fviz_nbclust() function allows us to give it the same arguments as the clustering method we are using!

So we can give fviz_nbclust() the hc_method argument that hcut() has!

fviz_nbclust(
  x = scale(crime), 
             FUNcluster = hcut,
             hc_method = "average",
             method = "silhouette"
  )

Comparing Clustering Results

There are a couple of tools we can use to compare the results of different methods/links used.

Tanglegram

A tanglegram aligns and plot two dendrograms side by side to compare the end results of each clustering method

Let’s compare the two best ultrametric methods for our data: Average and Ward with k = 6

# Need to store the results in a dendlist:
dendlist(
  avg  = as.dendrogram(crime_avg),
  ward = as.dendrogram(crime_ward)
)  |> 
  
  untangle(method = "step1side")  |>  # Find the best alignment layout
  
  tanglegram()                       # Draw the two dendrograms

The tanglegram shows that the difference between Wards’ distance and average linkage is when Boston is merged.

Average merges it in the second to last step while ward merges it much earlier with Tucson and Hartford.

There are some alternative options we can use in the tanglegram:

dendlist(
  avg = as.dendrogram(crime_avg),
  med = as.dendrogram(crime_med)
) |>
  
  untangle(method = "step1side") |> 
  
  tanglegram(
    highlight_distinct_edges = T,       # Turn-off/on dashed lines
    common_subtrees_color_lines = T,    # Turn-off/on line colors
    common_subtrees_color_branches = T  # Color common branches 
  )                       

We can measure how dissimilar two dendrograms are with the entanglement() function.

Lower value = similar, higher value = dissimilar

dendlist(
  avg   = as.dendrogram(crime_avg),
  ward  = as.dendrogram(crime_ward)
) |> 
  untangle(method = "step1side") |> 
  entanglement()
## [1] 0.01181431

Low value for ward and average, indicating that they are similar, which we saw in the tanglegram!

Let’s compare average and median:

dendlist(
  avg   = as.dendrogram(crime_avg),
  med  = as.dendrogram(crime_med)
)  |> 
  untangle(method = "step1side") |>  
  entanglement()
## [1] 0.2684704

It’s higher than ward and average, but still not that high.

Dendrogram Correlation

First let’s store our results in a dendlist object

  • dendlist() creates a list object of multiple dendrograms, which needs the object to be a dendrogram type object
  • as.dendrogram() turns the hcut() object into a dendrogram object
crime_dl <- 
  dendlist(
    sin  = as.dendrogram(crime_sin),
    com  = as.dendrogram(crime_com),
    avg  = as.dendrogram(crime_avg),
    cen  = as.dendrogram(crime_cen),
    med  = as.dendrogram(crime_med),
    ward = as.dendrogram(crime_ward)
  )

The function cor.dendlist() is used to compute Baker or Cophenetic correlation matrix between a list of trees.

The value can range between -1 to 1.

Values near 0 indicate that the two trees are not statistically similar.

Cophenetic correlation matrix

cor.dendlist(
  crime_dl, 
  method = "cophenetic"
)
##            sin       com       avg       cen        med       ward
## sin  1.0000000 0.2188113 0.5129018 0.9093251 0.30444129 0.31457381
## com  0.2188113 1.0000000 0.6185544 0.2820687 0.16857966 0.52214099
## avg  0.5129018 0.6185544 1.0000000 0.5304093 0.17041620 0.92092613
## cen  0.9093251 0.2820687 0.5304093 1.0000000 0.35117690 0.34322060
## med  0.3044413 0.1685797 0.1704162 0.3511769 1.00000000 0.07622216
## ward 0.3145738 0.5221410 0.9209261 0.3432206 0.07622216 1.00000000

Let’s create a correlation plot for the cophenetic correlation:

ggcorr(
  data = NULL,
  cor_matrix = cor.dendlist(
    crime_dl, 
    method = "cophenetic"
  ),
  high = "blue3",
  low = "red3",
  mid = "white",
  label = T,
  label_round = 2
)