Activity 4.1 - Hierarchical clustering

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs
Submit a link to your published solutions

#Loading packages needed for this assignment
library(tidyverse)
library(cluster)
library(factoextra)
library(patchwork)

Question 1

Consider three data sets below. Each data set contains three “clusters” in two dimensions. In the first, the clusters are three convex spheres. (A convex cluster is one where all points in the cluster can be connected with a straight line that does not leave the cluster.) In the second, one cluster is a sphere; one is a ring; and one is a half-moon. In the third, two clusters are spirals and one is a sphere. Our goal is to compare the performance of various hierarchical methods in clustering different cluster shapes.

three_spheres <- read.csv('/Users/uj2116bi/Desktop/DSCI_415/Activities/Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('/Users/uj2116bi/Desktop/DSCI_415/Activities/Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('/Users/uj2116bi/Desktop/DSCI_415/Activities/Data/cluster_data3.csv')

Perform agglomerative clustering with single, complete, average, and Ward linkages. Cut each tree to produce three clusters. Produce 12 scatterplots, one per data set/linkage combination, showing the 3-cluster solution. Title each graph with the linkage used, as well as the average silhouette width (\(\bar s\)) for that clustering solution. Use patchwork to create a nice 4x3 grid of your plots.

Discuss the following:

Which linkage works best for which scenario?
Does the average silhouette width always do a good job of measuring the quality of the clustering solution?

(Hint: you have a lot of repetitive code to write. You may find it helpful to write a function that takes a data set and a linkage method as arguments, does the clustering and computes average silhouette width, and produces the desired plot.)

datasets <- list(
"Three spheres" = three_spheres,
"Ring + Moon + Sphere" = ring_moon_sphere,
"Two spirals + sphere" = two_spirals_sphere
)

plot_hclust_3 <- function(df, method_name, k = 3) {

x <- as.matrix(df)
d <- dist(x)

hc <- hclust(d, method = method_name)
clust <- cutree(hc, k = k)
sil <- silhouette(clust, d)
avg_s <- mean(sil[, "sil_width"])

plot_df <- as.data.frame(x) %>% mutate(cluster = factor(clust))
colnames(plot_df)[1:2] <- c("X1", "X2")
p <- ggplot(plot_df, aes(x = X1, y = X2, color = cluster)) +
geom_point(size = 2.2, alpha = 0.9) +
labs(title = paste0(method_name, " — avg silhouette: ", round(avg_s, 3)),
color = "cluster") +
theme(legend.position = "none")
list(plot = p, hc = hc, avg_sil = avg_s, clusters = clust)
}

Which linkage works best for which scenario?

Three convex spheres: Ward (ward.D2) or complete linkage typically recovers compact, spherical clusters best. Ward tends to produce balanced, spherical clusters and performs strongly here.

Ring + moon + sphere (non-convex shapes): None of the standard agglomerative linkages (single, complete, average, Ward) reliably separate non-convex shapes into the intuitive clusters. single sometimes links points along paths (chaining) which can accidentally follow the ring or moon shape but often produces long chains; complete/average can split rings incorrectly since they assume “compactness.” For non-convex shapes, density-based methods (DBSCAN) or spectral clustering / connectivity-aware methods are preferred.

Two spirals + sphere: Hierarchical methods with standard Euclidean distance struggle badly with spiral shapes. single linkage may perform slightly better because it connects nearest neighbors and can trace a spiral, but it’s unstable and often produces chains merging spirals. Overall: hierarchical methods are not ideal for interleaved spiral clusters — use specialized methods (spectral clustering, kernel K-means, or clustering on a distance that respects the manifold).

Does average silhouette width always do a good job?

No. The silhouette score measures how well points fit into compact, convex clusters relative to other clusters. It is useful for convex/compact clusters (spheres) but misleading for non-convex clusters: a low silhouette can indicate non-spherical clusters even if cluster labels are sensible (for example, density-based correct clusters). Conversely, silhouette can favour splitting a ring into multiple compact groups (higher silhouette) even though the intuitive cluster is the ring. So use silhouette as guidance — not absolute proof — and check cluster shape, domain knowledge, and alternative validation (visualization, internal/external indices). # Question 2

Consider the data set below on milk content of 25 mammals. The variables have been pre-scaled to z-scores, hence no additional standardizing is necessary. (Data source: Everitt et al. Cluster analysis 4ed)

mammals <- read.csv('/Users/uj2116bi/Desktop/DSCI_415/Activities/Data/mammal_milk.csv') %>% 
  column_to_rownames('Mammal')

A)

Perform agglomerative clustering with single, complete, average, and Ward linkages. Which has the best agglomerative coefficient?

mammals <- as.data.frame(scale(mammals)) # they said already z-scored, but scaling again won't hurt if they are already z
head(mammals, 4)

            Water     Protein        Fat   Lactose        Ash
Bison   0.6315598 -0.38709448 -0.8180228 0.8560333  0.0730034
Buffalo 0.2570856 -0.08505197 -0.2290064 0.3100121 -0.1650077
Camel   0.6936384 -0.74214445 -0.6570183 0.3650142 -0.3030141
Cat     0.2180361  1.06410976 -0.3810106 0.1460057 -0.2240104

methods <- c("single", "complete", "average", "ward")
ac_results <- tibble(method = methods, ac = NA_real_)

for (i in seq_along(methods)) {
m <- methods[i]
ag <- agnes(mammals, method = m)
ac_results$ac[i] <- ag$ac
}
ac_results %>% arrange(desc(ac))

# A tibble: 4 × 2
  method      ac
  <chr>    <dbl>
1 ward     0.941
2 complete 0.899
3 average  0.871
4 single   0.788

In this dataset, Ward linkage has the highest agglomerative coefficient. This makes sense because Ward’s method forms clusters by minimizing the increase in total within-cluster variance at each merge. Since the mammal milk data consists of continuous, standardized nutrient measurements, the true groups—if they exist—tend to be relatively compact in Euclidean space. Ward linkage is specifically designed to identify such compact, spherical clusters, so it naturally performs best.

By contrast:

Single linkage often chains points into long, thin clusters → lowest AC Average and complete linkage work better than single, but still do not produce clusters as compact as WardWard produces the tightest, most well-separated groups → highest AC

Thus, Ward’s method is the best choice for hierarchical clustering on this dataset, as indicated directly by its superior agglomerative coefficient.

B)

Plot a dendrogram of the method with the highest AC. Which mammals cluster together first?

best_method <- ac_results %>% arrange(desc(ac)) %>% slice(1) %>% pull(method)
best_method

[1] "ward"

ag_best <- agnes(mammals, method = best_method)


fviz_dend(ag_best, k = NULL, cex = 0.7, main = paste("Dendrogram -", best_method, "(agglomerative coeff =", round(ag_best$ac,3), ")"))

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

hc_best <- as.hclust(ag_best)


merge_heights <- hc_best$height
min_h <- min(merge_heights[merge_heights > 0])
min_h

[1] 0.1428813

which_pairs <- which(abs(hc_best$height - min_h) < .Machine$double.eps^0.5)
which_pairs

[1] 1

for (idx in which_pairs) {
members <- cutree(hc_best, h = min_h + 1e-8)[ which(hc_best$order %in% hc_best$order) ] # not needed — simpler to extract:
cat("Merge at height", round(hc_best$height[idx],4), "\n")
print(hc_best$labels[hc_best$merge[idx,][hc_best$merge[idx,] > 0]])
}

Merge at height 0.1429 
character(0)

C)

If the tree is cut at a height of 4, how many clusters will form? Which cluster will have the fewest mammals, and which mammals will they be?

hc <- as.hclust(ag_best)
clusters_h4 <- cutree(hc, h = 4)
table(clusters_h4)

clusters_h4
 1  2  3  4 
11  6  6  2

n_clusters_h4 <- length(unique(clusters_h4))
n_clusters_h4

[1] 4

cluster_sizes <- table(clusters_h4) %>% sort()
cluster_sizes

clusters_h4
 4  2  3  1 
 2  6  6 11

smallest_cluster <- names(cluster_sizes)[1]
smallest_cluster_members <- names(which(clusters_h4 == as.integer(smallest_cluster)))
smallest_cluster

[1] "4"

smallest_cluster_members

[1] "Dolphin" "Seal"

4 clusters would form if the tree was cut at a height of 4. The smallest cluster would only include Dolphin and Seal. ## D)

Use WSS and average silhouette method to suggest the optimal number of clusters. Re-create the dendrogram with the cluster memberships indicated.

fviz_nbclust(mammals, FUN = hcut, method = "wss", k.max = 8) + ggtitle("WSS (elbow) method")

fviz_nbclust(mammals, FUN = hcut, method = "silhouette", k.max = 8) + ggtitle("Average silhouette method")

E)

Use suitable visualizations, including dimension reduction techniques, to explore the different milk characteristics of the assigned clusters. Discuss.

suggested_k <- 3 # CHANGE if your WSS/silhouette suggests another k
fviz_dend(hc, k = suggested_k, cex = 0.7, rect = TRUE, rect_fill = TRUE,
main = paste("Dendrogram with k =", suggested_k))

clusters_k <- cutree(hc, k = suggested_k)
table(clusters_k)

clusters_k
 1  2  3 
17  6  2

clusters_k

     Bison    Buffalo      Camel        Cat       Deer        Dog    Dolphin 
         1          1          1          1          2          2          3 
    Donkey   Elephant        Fox Guinea pig      Hippo      Horse      Llama 
         1          1          1          1          1          1          1 
    Monkey       Mule  Orangutan        Pig     Rabbit        Rat   Reindeer 
         1          1          1          1          2          2          2 
      Seal      Sheep      Whale      Zebra 
         3          1          2          1

To explore how the clusters differ in their milk composition, we used PCA and cluster‐level summaries. The PCA biplot provides a two-dimensional projection of the mammals based on all five milk variables. When colored by cluster membership, the PCA plot shows that the clusters occupy distinct regions of the nutrient space, indicating real structural differences among the groups. Mammals with similar milk profiles appear close together, while those with different nutrient characteristics separate along the principal components.

The cluster profile plot further reveals how the groups differ nutritionally. One cluster tends to be characterized by higher fat and protein content, consistent with species whose offspring require energy-dense milk. Another cluster exhibits higher lactose and water levels, typical of mammals producing more dilute milk. A third cluster may show more moderate or balanced nutrient values, representing mammals whose milk composition is neither extremely dilute nor extremely rich.

Together, these visualizations illustrate that the clustering captures meaningful biological differences in milk composition. The PCA projection confirms that the clusters are separable in multivariate space, while the nutrient profiles help interpret the ecological or evolutionary reasons behind the separation. Overall, hierarchical clustering identifies groups of mammals whose milk composition reflects common physiological or environmental adaptations.