Activity 4.1 - Hierarchical clustering

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs
Submit a link to your published solutions

#Loading packages needed for this assignment
library(tidyverse)
library(cluster)
library(factoextra)
library(patchwork)

Question 1

Consider three data sets below. Each data set contains three “clusters” in two dimensions. In the first, the clusters are three convex spheres. (A convex cluster is one where all points in the cluster can be connected with a straight line that does not leave the cluster.) In the second, one cluster is a sphere; one is a ring; and one is a half-moon. In the third, two clusters are spirals and one is a sphere. Our goal is to compare the performance of various hierarchical methods in clustering different cluster shapes.

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

Perform agglomerative clustering with single, complete, average, and Ward linkages. Cut each tree to produce three clusters. Produce 12 scatterplots, one per data set/linkage combination, showing the 3-cluster solution. Title each graph with the linkage used, as well as the average silhouette width (\(\bar s\)) for that clustering solution. Use patchwork to create a nice 4x3 grid of your plots.

read_xy <- function(path){
  df <- read.csv(path)
  names(df)[1:2] <- c("X1", "X2")
  df
}

three_spheres       <- read_xy("Data/cluster_data1.csv")
ring_moon_sphere    <- read_xy("Data/cluster_data2.csv")
two_spirals_sphere  <- read_xy("Data/cluster_data3.csv")


make_cluster_plot <- function(df, linkage_name, k = 3){

  d  <- dist(df[, c("X1","X2")])
  hc <- hclust(d, method = linkage_name)
  cl <- cutree(hc, k = k)

  sil      <- silhouette(cl, d)
  avg_sil  <- round(mean(sil[,3]), 2)
  title_sm <- paste0(linkage_name, " (", avg_sil, ")")

  df %>%
    mutate(cluster = factor(cl)) %>%
    ggplot(aes(X1, X2, color = cluster)) +
    geom_point(size = 1.4, alpha = 0.9) +
    theme_minimal(base_size = 10) +
    labs(title = title_sm) +
    theme(
      plot.title = element_text(face="bold", size=10),
      legend.position = "none"
    )
}

linkages <- c("single","complete","average","ward.D2")

plots_1 <- lapply(linkages, \(m) make_cluster_plot(three_spheres, m))
plots_2 <- lapply(linkages, \(m) make_cluster_plot(ring_moon_sphere, m))
plots_3 <- lapply(linkages, \(m) make_cluster_plot(two_spirals_sphere, m))


col_titles <- c(
  "Three Spheres", 
  "Ring / Moon / Sphere", 
  "Two Spirals + Sphere"
)

label_titles <- function(p, label){
  p + patchwork::plot_annotation(title = label)
}

plots_1 <- Map(label_titles, plots_1, col_titles)

Warning in mapply(FUN = f, ..., SIMPLIFY = FALSE): longer argument not a
multiple of length of shorter

plots_2 <- Map(label_titles, plots_2, col_titles)

Warning in mapply(FUN = f, ..., SIMPLIFY = FALSE): longer argument not a
multiple of length of shorter

plots_3 <- Map(label_titles, plots_3, col_titles)

Warning in mapply(FUN = f, ..., SIMPLIFY = FALSE): longer argument not a
multiple of length of shorter

final_grid <- (
  plots_1[[1]] | plots_2[[1]] | plots_3[[1]] ) /
  (plots_1[[2]] | plots_2[[2]] | plots_3[[2]] ) /
  (plots_1[[3]] | plots_2[[3]] | plots_3[[3]] ) /
  (plots_1[[4]] | plots_2[[4]] | plots_3[[4]] )

final_grid

Discuss the following:

Which linkage works best for which scenario?

For the three convex spheres, complete, average, and Ward linkage work the best because they naturally form tight, well-separated groups that match the data’s true structure. For the ring–moon–sphere and spiral datasets, single linkage works better since it can follow the curved shapes without breaking them apart the way the other methods do.
Does the average silhouette width always do a good job of measuring the quality of the clustering solution?

Silhouette width does not always measure clustering quality well. It assumes clusters are compact and roughly convex, so it works well for the three-spheres dataset where the groups are round and clearly separated. However, it becomes misleading for non-convex shapes like rings and spirals. In those cases, a method like single linkage can capture the true structure even if it ends up with a lower silhouette score, while methods with higher scores may actually distort the real shape of the clusters.

(Hint: you have a lot of repetitive code to write. You may find it helpful to write a function that takes a data set and a linkage method as arguments, does the clustering and computes average silhouette width, and produces the desired plot.)

Question 2

Consider the data set below on milk content of 25 mammals. The variables have been pre-scaled to z-scores, hence no additional standardizing is necessary. (Data source: Everitt et al. Cluster analysis 4ed)

mammals <- read.csv('Data/mammal_milk.csv') %>% 
  column_to_rownames('Mammal')

A)

Perform agglomerative clustering with single, complete, average, and Ward linkages. Which has the best agglomerative coefficient

library(cluster)


d <- dist(mammals)


ac_single   <- agnes(d, method = "single")$ac
ac_complete <- agnes(d, method = "complete")$ac
ac_average  <- agnes(d, method = "average")$ac
ac_ward     <- agnes(d, method = "ward")$ac

c(
  single   = ac_single,
  complete = ac_complete,
  average  = ac_average,
  ward     = ac_ward
)

   single  complete   average      ward 
0.7875718 0.8985539 0.8706571 0.9413994

Since Ward linkage has the highest agglomerative coefficient, it provides the strongest clustering structure for this dataset.

B)

Plot a dendrogram of the method with the highest AC. Which mammals cluster together first?

hc_ward <- hclust(dist(mammals), method = "ward.D2")
plot(hc_ward,
     main = "Ward Dendrogram for Mammal Milk Data",
     xlab = "",
     sub = "")

The mammals that cluster together first are fox and buffalo.

C)

If the tree is cut at a height of 4, how many clusters will form? Which cluster will have the fewest mammals, and which mammals will they be?

hc_ward <- hclust(dist(mammals), method = "ward.D2")

clusters_h4 <- cutree(hc_ward, h = 4)
table(clusters_h4)

clusters_h4
 1  2  3  4 
11  6  6  2

names(clusters_h4[clusters_h4 == 4])

[1] "Dolphin" "Seal"

When I cut the dendrogram at a height of 4, it splits into four different clusters. One of those clusters is much smaller than the others and only has two mammals in it. Looking at the tree, those two mammals are the Dolphin and the Seal. They end up grouped together because their milk compositions are more similar to each other than to any of the other animals in the dataset.

D)

Use WSS and average silhouette method to suggest the optimal number of clusters. Re-create the dendrogram with the cluster memberships indicated.

library(factoextra)


fviz_nbclust(mammals, kmeans, method = "wss") +
  labs(title = "WSS (Elbow Method) for Mammal Milk Data")

fviz_nbclust(mammals, kmeans, method = "silhouette") +
  labs(title = "Average Silhouette Width for k = 2–10")

hc_ward <- hclust(dist(mammals), method = "ward.D2")


k <- 3  
clusters_k <- cutree(hc_ward, k)


plot(hc_ward,
     main = paste("Ward Dendrogram with", k, "Clusters"),
     xlab = "", sub = "")

rect.hclust(hc_ward, k = k, border = 2:(k+1))

The WSS elbow plot and the average silhouette method both point toward using a small number of clusters for this dataset, typically either two or three. Once I settle on the number of clusters, I apply that cut to the Ward dendrogram. When I add the cluster boundaries or color the branches, it becomes much easier to see how the mammals group together based on similarities in their milk composition. This helps confirm whether the choice of k makes sense visually and whether the clusters match what the silhouette and WSS methods suggest.

E)

Use suitable visualizations, including dimension reduction techniques, to explore the different milk characteristics of the assigned clusters. Discuss.

library(ggplot2)
library(dplyr)
library(FactoMineR)
library(factoextra)


hc_ward <- hclust(dist(mammals), method = "ward.D2")
k <- 3
clusters_k <- cutree(hc_ward, k)


mammals_clustered <- mammals %>%
  mutate(Cluster = factor(clusters_k),
         Mammal = rownames(mammals))


pca_out <- PCA(mammals, scale.unit = FALSE, graph = FALSE)


fviz_pca_ind(pca_out,
             geom.ind = "point",
             habillage = mammals_clustered$Cluster,
             addEllipses = TRUE,
             palette = "Dark2",
             title = "PCA of Mammal Milk Composition by Cluster")

Too few points to calculate an ellipse

fviz_pca_var(pca_out,
             col.var = "contrib",
             gradient.cols = c("grey80", "orange", "red"),
             title = "Variable Contributions to PCA")

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
  Please report the issue at <https://github.com/kassambara/ggpubr/issues>.

library(GGally)

ggparcoord(mammals_clustered,
           columns = 1:5,
           groupColumn = "Cluster",
           scale = "uniminmax") +
  theme_minimal() +
  labs(title = "Parallel Coordinate Plot of Milk Nutrients by Cluster",
       x = "Nutrient", y = "Scaled Value")

Using PCA made the differences between the milk composition clusters much easier to see. In the PCA plot, the three groups separate mostly along the first principal component, with one cluster far to the right, another to the left, and a small group lower on the plot. This shows that the clusters differ in consistent ways across the five nutrients.

The variable contribution plot helps explain this separation. Fat and protein load strongly in the positive direction, while water pulls in the opposite direction. This means the right-hand cluster represents species with richer, more nutrient-dense milk, while the left-hand cluster has higher water content. Lactose plays a smaller role but still helps separate the groups along the second component.

The parallel coordinate plot supports these patterns by showing how each nutrient level changes across clusters. One group has consistently higher fat and protein, another shows much more water, and the smallest group has its own distinct mix. Together, these visuals clearly show that the clusters reflect meaningful differences in milk composition and that PCA is an effective way to highlight these relationships.