Let’s start by looking at a correlation plot:
ggcorr(
data = crime,
low = "red3",
high = "blue3",
mid = "white",
label = T,
label_round = 2
)
Other than burglary and larceny, there aren’t a lot of strong correlations between pairs of crimes.
Next, let’s create a biplot of the first 2 PCs to see if there are
any groups in the data using the fviz_pca()
in
factoextra
fviz_pca(
prcomp(
x = crime,
scale. = T
),
geom = "text"
)
With only 16 cities, it’s a little difficult to see how many groups there are, but some cities are more similar than others!
First 2 PCs account for 70%, let’s look at the 1st and 3rd PC as well
fviz_pca(
prcomp(
x = crime,
scale. = T
),
axes = c(1,3),
geom = "text"
)
From the biplots looks like there may be 2 or 3 groups
There are many different functions in R that will do hierarchical
clustering. We’ll be using hcut()
function in the
factoextra
package.
There is also hclust()
in base R
and
agnes()
in the cluster
package
hcut()
needs 2 arguments:
x =
a dataset or distance matrixhc_method =
link method to measure distances between
clustersk =
# of clusters (optional). Defaults to 2 if nothing
is specifiedNote: just using method
will give you a
different result, need to use the hc_method
argument!
Additional Note: If x is an non-standarized dataset, make sure to use stand=T
Single linkage will use the closest points between to clusters to measure how far apart each cluster is from one another.
crime_sin <-
hcut(
x = crime,
hc_method = "single",
stand = T,
k = 2
)
There are a couple of ways of plotting the dendrogram:
crime_sin |>
as.dendrogram() |>
plot(main = "Dendrogram for Single Linkage")
Not the best looking dendrogram
More visually appealing using fviz_dend()
in
factoextra
package. The downside is that it will show the
cluster choice from hcut()
gg_dend_single <-
fviz_dend(
crime_sin,
main = "Cluster Dendrogram for Single Linkage"
)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
gg_dend_single
rect = T
adds a red rectangle around the clusters and
you can specify the color
fviz_dend(
crime_sin,
main = "Cluster Dendrogram for Single Linkage",
rect = T,
rect_border = "black"
)
Control the color of the clusters with k_colors
:
fviz_dend(
crime_sin,
main = "Cluster Dendrogram for Single Linkage",
rect = T,
rect_border = "red",
k_colors = c("blue","orange2")
)
Need to rerun the hcut()
function to choose a different
k
:(
hcut(
x = crime,
hc_method = "single",
stand = T,
k = 3
) |>
fviz_dend(main = "Cluster Dendrogram for Single Linkage")
You can get the cluster ID by using $cluster
crime_sin$cluster
## Atlanta Boston Chicago Dallas Denver Detroit Hartford
## 1 2 1 1 1 1 1
## Honolulu Houston KC LA NO NY Portland
## 1 1 1 1 1 1 1
## Tucson Washington
## 1 1
Alternatively, you can get the cluster ID using cutree()
for multiple choices of k
without having to rerun the
algorithm again:
cutree(crime_sin, k=2:5)
## 2 3 4 5
## Atlanta 1 1 1 1
## Boston 2 2 2 2
## Chicago 1 1 3 3
## Dallas 1 1 3 4
## Denver 1 1 3 4
## Detroit 1 1 3 4
## Hartford 1 3 4 5
## Honolulu 1 1 3 4
## Houston 1 1 3 4
## KC 1 1 3 4
## LA 1 1 3 4
## NO 1 1 3 4
## NY 1 1 3 4
## Portland 1 1 3 4
## Tucson 1 3 4 5
## Washington 1 1 3 4
We can use fviz_cluster()
to visualize the results
Since we have more than 2 variables, fviz_cluster()
will
use PCA to display the first 2 PCs (by default)
fviz_cluster(
crime_sin,
geom = "text",
show.clust.cent = F,
ellipse = F
) +
labs(title = "Single Linkage Clusters: k = 2") +
theme_bw() +
theme(legend.position = "none")
Can’t tell what makes Boston special just from the first 2 PCs, so let’s check the 3rd PC again:
fviz_cluster(
crime_sin,
geom = "text",
show.clust.cent = F,
ellipse = F,
axes = c(1, 3)
) +
labs(title = "Single Linkage Clusters: k = 2") +
theme_bw() +
theme(legend.position = "none") +
scale_y_continuous(
expand = c(0.1, 0, 0.1, 0)
)
Whatever crime(s) the third PC represents, Boston has a much higher score than the other cities:
prcomp(x = crime,
scale. = T)
## Standard deviations (1, .., p=6):
## [1] 1.6975148 1.1526431 0.9506645 0.6122080 0.5926527 0.4000732
##
## Rotation (n x k) = (6 x 6):
## PC1 PC2 PC3 PC4 PC5
## Murder -0.2979396 0.61604145 -0.38478654 -0.31523950 0.35016625
## Robbery -0.4640654 0.23316839 0.27835099 0.72318122 0.32657811
## Assault -0.4771708 0.27718662 -0.07422189 -0.03602238 -0.81222298
## Burglary -0.4353121 -0.38171250 -0.40641957 -0.25758947 0.29710735
## Larceny -0.3898544 -0.57921095 -0.18227326 0.18970979 -0.09836761
## Autotheft -0.3559498 -0.09001807 0.75534810 -0.52344363 0.11434682
## PC6
## Murder 0.40208445
## Robbery -0.15215833
## Assault -0.17018493
## Burglary -0.58736443
## Larceny 0.65851250
## Autotheft 0.08704792
Looks like the third PC is mostly Autotheft!
The complete link is the opposite of single link - It uses the farthest points between the two clusters to determine the distances.
crime_com <-
hcut(
x = crime,
hc_method = "complete",
stand = T
)
gg_dend_complete <-
fviz_dend(crime_com) +
labs(title = "Complete Linkage")
gg_dend_complete
If it’s difficult to see where the first large merger distance is, you can add lines to the dendrogram to help see where the distances the clusters were merged starts to become large:
fviz_dend(crime_com) +
labs(title = "Cluster Dendrogram with Complete Linkage") +
geom_hline(
yintercept = crime_com$height,
linetype="dashed"
)
Examining the dendrogram from complete linkage, it appears that there are 4 clusters
Let’s look at the results when k = 4
crime_com <-
hcut(
x = crime,
hc_method = "complete",
stand = T,
k = 4
)
fviz_dend(crime_com) +
labs(title = "Dendrogram with Linkage = Complete and k = 4")
So what do cities in the same clusters have in common? What separates the 4 cities?
I prefer adding the cluster results to the biplot from earlier to help see where the differences in clusters are
We can specify groups with fviz_pca()
by using
habillage = group ID
Let’s look at 2 clusters and 4 clusters:
# k = 2
prcomp(
x = crime,
scale. = T
) |>
# Plotting the results with two clusters
fviz_pca(
geom = "text",
col.var = "black",
habillage = cutree(crime_com,
k = 2),
show.legend = F
)
# k = 4
prcomp(
x = crime,
scale. = T
) |>
fviz_pca(
axes = c(1, 2),
geom = "text",
col.var = "black",
habillage = cutree(crime_com,
k = 4),
show.legend = F
)
You can use hcut with k = multiple cluster # and have it return a matrix with each column corresponding to a choice of the cluster
cutree(
crime_com,
k = c(2, 4)
) |>
data.frame() |>
rename(
k2 = X2,
k4 = X4
) |>
rownames_to_column(var = "City") |>
left_join(
y = crime |> rownames_to_column(var = "City"),
by = "City"
) |>
pivot_longer(
cols = starts_with("k"),
names_to = "total_cluster",
values_to = "clusterID"
) |>
summarize(
.by = c(total_cluster, clusterID),
across(
.cols = Murder:Autotheft,
.fns = mean
)
) |>
pivot_longer(
cols = Murder:Autotheft,
names_to = "crime",
values_to = "rate"
) |>
mutate(
clusterID = factor(clusterID),
crime = as_factor(crime)
) |>
ggplot(
mapping = aes(
x = clusterID,
y = rate,
fill = clusterID
)
) +
geom_col(
position = "dodge",
color = "black"
) +
facet_wrap(
facets = ~ total_cluster + crime,
scales = "free",
nrow = 2
) +
theme_test() +
theme(
legend.position = "none",
panel.grid = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank()
) +
labs(y = "Crime Rate") +
scale_y_continuous(expand = c(0, 0, 0.05, 0))
If there are 2 clusters, it appears that there is a “low crime” group and a “high crime group” with the murder rate being about the same while assaults are much higher for the “high crime group”
If there are 4 clusters:
cutree(
crime_com,
k = c(2,4)
) |>
data.frame() |>
rownames_to_column(var = "City") |>
arrange(X2, X4) |>
rename(k2 = X2, k4 = X4) |>
gt::gt()
City | k2 | k4 |
---|---|---|
Atlanta | 1 | 1 |
Dallas | 1 | 1 |
Houston | 1 | 1 |
KC | 1 | 1 |
Boston | 1 | 2 |
Chicago | 1 | 2 |
NO | 1 | 2 |
Hartford | 1 | 4 |
Honolulu | 1 | 4 |
Portland | 1 | 4 |
Tucson | 1 | 4 |
Denver | 2 | 3 |
Detroit | 2 | 3 |
LA | 2 | 3 |
NY | 2 | 3 |
Washington | 2 | 3 |
The distance apart clusters A & B are using the average link is calculated by:
\[d^2_{avg} = \frac{1}{ab}\sum_j \sum_l (y_{ij} - y_{kl})^2\]
crime_avg <-
hcut(
crime,
hc_method = "average",
k = 2,
stand = T
)
# Dendrogram for average linkage
gg_dend_avg <-
fviz_dend(
crime_avg,
main = "Average Link"
)
gg_dend_avg
This dendrogram doesn’t indicate a clear cut off, so it is a little hard to tell how many clusters there are when using average linkage
# A plot visualizing the results
fviz_cluster(
crime_avg,
geom = "text",
ellipse = F
) +
theme_bw() +
labs(title = "City AHC with Average Link") +
theme(legend.position = "none")
biplot to see how the clusters are different
prcomp(
crime,
scale. = T
) |>
fviz_pca_biplot(
geom = "text",
habillage = cutree(crime_avg,
k = 2),
col.var = "black"
) +
theme_bw() +
labs(title = "City AHC with Average Link") +
theme(legend.position = "none")
Using the centroid link, we calculate the average (centroid) for each cluster, then measure how far apart the centroids are:
\[d^2_{cen} = (\bar{y}_i - \bar{y}_k)^2 \]
crime_cen <-
hcut(
crime,
hc_method = "centroid",
k = 2,
stand = T
)
fviz_dend(
crime_cen,
main = "Centroid Link"
)
## Warning in get_col(col, k): Length of color vector was shorter than the number
## of clusters - color vector was recycled
Well that doesn’t look good. The results aren’t ultrametric :(
By adding a city into a cluster, the next merger was closer than the previous one
In presidential terms: BAD!
The centroid link often isn’t used because it isn’t ultrametric!
data.frame(crime_cen.height = crime_cen$height) |>
mutate(
merger_number = row_number(),
height_change = crime_cen.height - lag(crime_cen.height)
) |>
dplyr::select(merger_number, everything())
## merger_number crime_cen.height height_change
## 1 1 0.8558817 NA
## 2 2 0.9917272 0.13584546
## 3 3 1.4438898 0.45216260
## 4 4 1.5435297 0.09963994
## 5 5 1.4404744 -0.10305532
## 6 6 1.6088562 0.16838182
## 7 7 1.6894597 0.08060354
## 8 8 1.8436513 0.15419152
## 9 9 1.8664776 0.02282634
## 10 10 2.0007712 0.13429357
## 11 11 1.7690125 -0.23175872
## 12 12 2.0236073 0.25459487
## 13 13 1.9005101 -0.12309720
## 14 14 2.0613848 0.16087470
## 15 15 2.4995502 0.43816538
# Visualizing the heights:
tibble(merge_distance = crime_cen$height) |>
mutate(
merge_order = row_number(),
bad_mergers = if_else(merge_distance < lag(merge_distance) &
merge_order != 1,
"Bad",
"Good")
) |>
ggplot(
mapping = aes(
x = merge_order,
y = merge_distance
)
) +
geom_line() +
geom_point() +
geom_point(
mapping = aes(alpha = bad_mergers),
size = 6,
color = "red",
shape = 21,
show.legend = F
) +
scale_alpha_discrete(range = c(1, 0))
## Warning: Using alpha for a discrete variable is not advised.
Rule of thumb: don’t use the results if they aren’t ultrametric
The median link is like the average link, but instead of finding the average distance for all points in Clusters A and B, it finds the median distance:
crime_med <-
hcut(
crime,
hc_method = "median",
k = 2,
stand = T
)
fviz_dend(
crime_med,
main = "Dendrogram with Median Link"
)
## Warning in get_col(col, k): Length of color vector was shorter than the number
## of clusters - color vector was recycled
Seems we have the same ultrametric problem again….
tibble(merge_distance = crime_med$height) |>
mutate(
merge_order = row_number(),
bad_mergers = if_else(merge_distance < lag(merge_distance) & merge_order != 1,
"Bad",
"Good")
) |>
ggplot(
mapping = aes(x = merge_order,
y = merge_distance)
) +
labs(title = "Median Linkage") +
geom_line() +
geom_point() +
geom_point(
mapping = aes(alpha = bad_mergers),
size = 6,
color = "red",
shape = 21,
show.legend = F
) +
scale_alpha_discrete(range = c(1, 0))
## Warning: Using alpha for a discrete variable is not advised.
Wards’ distance is a little bit of a misnomer because it doesn’t calculate the distances between clusters.
Instead, for each possible merger between two clusters, it will calculate the Within-Sums-of-Squares (WSS) and choose to merge the two clusters that results in the lowest WSS.
\[d^2_{ward} = \sum_i \sum_j (y_{ij} - \bar{y}_i)^2\]
Typically, Wards’ distance gives the “best” clustering result, but not always!
crime_ward <-
hcut(
crime,
hc_method = "ward.D",
k = 3,
stand = T
)
### Dendrogram when k = 3. Looks about right :)
gg_dend_ward <-
fviz_dend(
crime_ward,
main = "Wards' Distance"
)
gg_dend_ward
Using Wards’ distance, it looks like there are 3 clusters:
# A plot visualizing the results
fviz_cluster(
crime_ward,
geom = "text",
show.clust.cent = F,
ellipse = F
) +
labs(title = "Wards' Distance: k = 3") +
theme_bw() +
theme(legend.position = "none")
# biplot to see how the clusters are different
prcomp(
crime,
scale. = T
) |>
fviz_pca(
geom = "text",
col.var = "black",
habillage = cutree(crime_ward,
k = 3)
) +
labs(title = "Wards' Distance: k = 3") +
theme_bw() +
theme(legend.position = "none")
Let’s compare the average crime stats for the 3 clusters:
crime |>
mutate(cluster = crime_ward$cluster) |>
summarize(
.by = cluster,
across(
.cols = Murder:Autotheft,
.fns = mean
)
)
## cluster Murder Robbery Assault Burglary Larceny Autotheft
## 1 1 14.01667 240.0000 229.5000 1274.500 865.5000 663.1667
## 2 2 4.55000 107.8333 117.3333 1287.667 983.3333 634.8333
## 3 3 11.15000 452.2500 264.7500 1659.500 1241.0000 809.2500
Group 1 appears to have high Murder and medium rates for the other crimes
Group 2 has the lowest violent crime rate and typical theft rate
Group 3 has the highest non-murder crime rates.
crime %>%
mutate(cluster = crime_ward$cluster) |>
mutate(
.by = cluster,
cluster_order = row_number()
) |>
ggplot(
mapping = aes(
x = cluster,
y = cluster_order,
color = factor(cluster)
)
) +
geom_text(
mapping = aes(label = rownames(crime)),
show.legend = F
) +
labs(y = NULL) +
scale_x_continuous(breaks = 1:max(crime_ward$cluster)) +
theme(axis.text.y = element_blank())
The choice of link method has a large impact on how the clusters formed. So what is the best choice for these data?
A common way of measuring how well a AHC method clusters the data is to calculate the cophenetic correlation coefficient (often just referred as the cophenetic correlation). It measures how faithfully a dendrogram preserves the pairwise distances between the original unclustered points.
We can calculate the cophenetic distance using the
cor_cophenetic()
function in the dendextend
package.
We need to give it two objects:
hclust()
functioncor_cophenetic(
crime_ward,
dist(scale(crime))
)
## [1] 0.5942779
Correlation with Wards’ distance seems pretty low, but how low is it relative to the other methods?
Let’s do it for all of them!
crime_coph_cor <-
tibble(
single = cor_cophenetic(crime_sin, dist(scale(crime))),
complete = cor_cophenetic(crime_com, dist(scale(crime))),
average = cor_cophenetic(crime_avg, dist(scale(crime))),
centroid = cor_cophenetic(crime_cen, dist(scale(crime))),
median = cor_cophenetic(crime_med, dist(scale(crime))),
ward = cor_cophenetic(crime_ward, dist(scale(crime)))
) |>
pivot_longer(
cols = everything(),
names_to = "link",
values_to = "coph_cor"
) |>
arrange(-coph_cor)
gt::gt(crime_coph_cor)
link | coph_cor |
---|---|
average | 0.6441523 |
centroid | 0.6132771 |
ward | 0.5942779 |
complete | 0.5612580 |
single | 0.5321645 |
median | 0.3513478 |
Looks like average link has the highest correlation with Wards’ distance being the second highest (of the ultrametric methods anyway)!
Let’s plot the results!
ggplot(
data = crime_coph_cor,
mapping = aes(
x = fct_reorder(link, -coph_cor),
y = coph_cor
)
) +
geom_col(
fill = "steelblue",
color = "black"
) +
labs(
x = "Link Method",
y = "Cophenetic Correlation"
) +
scale_y_continuous(
expand = c(0, 0, 0.05, 0),
limits = c(0, 1)
)
For the rest of the example, we’ll be using the results from the average link.
One advantage of AHC is that you can create a dendrogram to help determine the number of clusters in the data.
But if we need a little help deciding on the number of clusters, the same methods we saw with K-means will work with agglomerative clustering
Let’s create a silhouette plot using the fviz_nbclust()
function:
fviz_nbclust(
x = scale(crime),
FUNcluster = hcut,
method = "silhouette"
)
Notice we didn’t specify the link method. That’s because
hcut()
defaults to Wards’ Distance for the link method.
Thankfully, the fviz_nbclust()
function allows us to give
it the same arguments as the clustering method we are using!
So we can give fviz_nbclust()
the hc_method
argument that hcut()
has!
fviz_nbclust(
x = scale(crime),
FUNcluster = hcut,
hc_method = "average",
method = "silhouette"
)
There are a couple of tools we can use to compare the results of different methods/links used.
A tanglegram aligns and plot two dendrograms side by side to compare the end results of each clustering method
Let’s compare the two best ultrametric methods for our data: Average and Ward with k = 6
# Need to store the results in a dendlist:
dendlist(
avg = as.dendrogram(crime_avg),
ward = as.dendrogram(crime_ward)
) |>
untangle(method = "step1side") |> # Find the best alignment layout
tanglegram() # Draw the two dendrograms
The tanglegram shows that the difference between Wards’ distance and average linkage is when Boston is merged.
Average merges it in the second to last step while ward merges it much earlier with Tucson and Hartford.
There are some alternative options we can use in the tanglegram:
dendlist(
avg = as.dendrogram(crime_avg),
med = as.dendrogram(crime_med)
) |>
untangle(method = "step1side") |>
tanglegram(
highlight_distinct_edges = T, # Turn-off/on dashed lines
common_subtrees_color_lines = T, # Turn-off/on line colors
common_subtrees_color_branches = T # Color common branches
)
We can measure how dissimilar two dendrograms are with the
entanglement()
function.
Lower value = similar, higher value = dissimilar
dendlist(
avg = as.dendrogram(crime_avg),
ward = as.dendrogram(crime_ward)
) |>
untangle(method = "step1side") |>
entanglement()
## [1] 0.01181431
Low value for ward and average, indicating that they are similar, which we saw in the tanglegram!
Let’s compare average and median:
dendlist(
avg = as.dendrogram(crime_avg),
med = as.dendrogram(crime_med)
) |>
untangle(method = "step1side") |>
entanglement()
## [1] 0.2684704
It’s higher than ward and average, but still not that high.
First let’s store our results in a dendlist object
dendlist()
creates a list object of multiple
dendrograms, which needs the object to be a dendrogram type objectas.dendrogram()
turns the hcut()
object
into a dendrogram objectcrime_dl <-
dendlist(
sin = as.dendrogram(crime_sin),
com = as.dendrogram(crime_com),
avg = as.dendrogram(crime_avg),
cen = as.dendrogram(crime_cen),
med = as.dendrogram(crime_med),
ward = as.dendrogram(crime_ward)
)
The function cor.dendlist()
is used to compute Baker or
Cophenetic correlation matrix between a list of trees.
The value can range between -1 to 1.
Values near 0 indicate that the two trees are not statistically similar.
Cophenetic correlation matrix
cor.dendlist(
crime_dl,
method = "cophenetic"
)
## sin com avg cen med ward
## sin 1.0000000 0.2188113 0.5129018 0.9093251 0.30444129 0.31457381
## com 0.2188113 1.0000000 0.6185544 0.2820687 0.16857966 0.52214099
## avg 0.5129018 0.6185544 1.0000000 0.5304093 0.17041620 0.92092613
## cen 0.9093251 0.2820687 0.5304093 1.0000000 0.35117690 0.34322060
## med 0.3044413 0.1685797 0.1704162 0.3511769 1.00000000 0.07622216
## ward 0.3145738 0.5221410 0.9209261 0.3432206 0.07622216 1.00000000
Let’s create a correlation plot for the cophenetic correlation:
ggcorr(
data = NULL,
cor_matrix = cor.dendlist(
crime_dl,
method = "cophenetic"
),
high = "blue3",
low = "red3",
mid = "white",
label = T,
label_round = 2
)