Activity 4.1 - Hierarchical clustering

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  3. Submit a link to your published solutions
#Loading packages needed for this assignment
library(tidyverse)
library(cluster)
library(factoextra)
library(patchwork)

Question 1

Consider three data sets below. Each data set contains three “clusters” in two dimensions. In the first, the clusters are three convex spheres. (A convex cluster is one where all points in the cluster can be connected with a straight line that does not leave the cluster.) In the second, one cluster is a sphere; one is a ring; and one is a half-moon. In the third, two clusters are spirals and one is a sphere. Our goal is to compare the performance of various hierarchical methods in clustering different cluster shapes.

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')
library(cluster)
library(ggplot2)
library(patchwork)
library(dplyr)
datasets <- list(
  spheres = three_spheres,
  complex = ring_moon_sphere,
  spirals = two_spirals_sphere
)
cluster_and_plot <- function(data, method, dataset_name) {
  dist_mat <- dist(data)
  hc <- hclust(dist_mat, method = method)
  clusters <- cutree(hc, k = 3)
  sil <- silhouette(clusters, dist_mat)
  avg_sil <- mean(sil[, 3])
  
  df <- as.data.frame(data)
  df$cluster <- as.factor(clusters)
  
  ggplot(df, aes(x = df[,1], y = df[,2], color = cluster)) +
    geom_point(size = 1.5) +
    labs(
      title = paste0(dataset_name, " - ", method, "\n", "Avg Silhouette: ", round(avg_sil, 3)),
      x = "", y = ""
    ) +
    theme_minimal() +
    theme(legend.position = "none")
}
methods <- c("single", "complete", "average", "ward.D2")
plot_list <- list()

combinations <- expand.grid(dataset_name = names(datasets), method = methods, stringsAsFactors = FALSE)

plot_list <- setNames(
  lapply(seq_len(nrow(combinations)), function(i) {
    row <- combinations[i, ]
    cluster_and_plot(datasets[[row$dataset_name]], row$method, row$dataset_name)
  }),
  paste(combinations$dataset_name, combinations$method, sep = "_")
)
final_plot <- wrap_plots(plot_list, ncol = 4)
final_plot

Perform agglomerative clustering with single, complete, average, and Ward linkages. Cut each tree to produce three clusters. Produce 12 scatterplots, one per data set/linkage combination, showing the 3-cluster solution. Title each graph with the linkage used, as well as the average silhouette width (\(\bar s\)) for that clustering solution. Use patchwork to create a nice 4x3 grid of your plots.

Discuss the following:

  1. Which linkage works best for which scenario?

    Three spheres - average and wards (both had average silhouettes of 0.727)

    Ring, Moon, Sphere - complete (average silhouette of 0.533)

    Two spirals, One sphere - average and wards (both had average silhouettes of 0.432)

  2. Does the average silhouette width always do a good job of measuring the quality of the clustering solution?

    It works well when looking at convex clusters like spheres as the silhouette width aligns with the visual quality. It also works for the ward and average linkage on spherical data as high scores match good clustering. It does not work for non-convex shapes as silhouette does not assess complex shapes well. It also does not work for single linkage on spirals. It visually preserves the structure but it gets a low score which can be misleading.

(Hint: you have a lot of repetitive code to write. You may find it helpful to write a function that takes a data set and a linkage method as arguments, does the clustering and computes average silhouette width, and produces the desired plot.)

Question 2

Consider the data set below on milk content of 25 mammals. The variables have been pre-scaled to z-scores, hence no additional standardizing is necessary. (Data source: Everitt et al. Cluster analysis 4ed)

library(tidyverse)
library(tibble)
mammals <- read.csv('Data/mammal_milk.csv') %>% 
  column_to_rownames('Mammal')

A)

Perform agglomerative clustering with single, complete, average, and Ward linkages. Which has the best agglomerative coefficient?

methods <- c("single", "complete", "average", "ward")
agglomerative_coefficients <- sapply(methods, function(m) {
  agnes(mammals, method = m)$ac
})
agglomerative_coefficients
   single  complete   average      ward 
0.7875718 0.8985539 0.8706571 0.9413994 

The best method (highest coefficient) would be ward (0.941).

B)

Plot a dendrogram of the method with the highest AC. Which mammals cluster together first?

ward_clustering <- agnes(mammals, method = "ward")
plot(ward_clustering, which.plots = 2, main = "Dendrogram - Ward Linkage")

first_merge <- ward_clustering$merge[1, ]
labels <- rownames(mammals)
first_pair <- labels[abs(first_merge)]
cat("First mammals to cluster together:", paste(first_pair, collapse = " and "))
First mammals to cluster together: Deer and Reindeer

The first mammals to cluster together are the Deer and Reindeer.

C)

If the tree is cut at a height of 4, how many clusters will form? Which cluster will have the fewest mammals, and which mammals will they be?

clusters <- cutree(as.hclust(ward_clustering), h = 4)
num_clusters <- length(unique(clusters))
cat("Number of clusters formed:", num_clusters, "\n")
Number of clusters formed: 4 
cluster_sizes <- table(clusters)
smallest_cluster_id <- names(which.min(cluster_sizes))
smallest_cluster_members <- names(clusters[clusters == as.numeric(smallest_cluster_id)])

cat("Cluster with the fewest mammals:", smallest_cluster_id, "\n")
Cluster with the fewest mammals: 4 
cat("Mammals in this cluster:\n")
Mammals in this cluster:
print(smallest_cluster_members)
[1] "Dolphin" "Seal"   

4 clusters will be formed. The cluster with the fewest mammals is 4 and they will be the Dolphin and the Seal.

D)

Use WSS and average silhouette method to suggest the optimal number of clusters. Re-create the dendrogram with the cluster memberships indicated.

fviz_nbclust(mammals, FUN = hcut, method = "wss", hc_method = "ward", k.max = 10) +
  labs(title = "Elbow Method (WSS) - Ward Linkage")
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"

fviz_nbclust(mammals, FUN = hcut, method = "silhouette", hc_method = "ward", k.max = 10) +
  labs(title = "Silhouette Method - Ward Linkage")
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"

ward_clustering <- agnes(mammals, method = "ward")
clusters <- cutree(as.hclust(ward_clustering), k = 3)
fviz_dend(ward_clustering, k = 3, 
          rect = TRUE, 
          rect_border = "jco", 
          rect_fill = TRUE,
          main = "Ward Dendrogram with Cluster Memberships")
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

I would suggest 3 clusters.

E)

Use suitable visualizations, including dimension reduction techniques, to explore the different milk characteristics of the assigned clusters. Discuss.

ward_clustering <- agnes(mammals, method = "ward")
clusters <- cutree(as.hclust(ward_clustering), k = 3)
mammals_clustered <- mammals %>% mutate(cluster = as.factor(clusters))
fviz_pca_ind(prcomp(mammals, scale. = TRUE),
             geom.ind = "point",
             col.ind = mammals_clustered$cluster,
             palette = "jco",
             addEllipses = TRUE,
             legend.title = "Cluster",
             title = "PCA of Milk Characteristics")
Too few points to calculate an ellipse

library(tidyverse)
library(tibble)
library(cluster)
library(factoextra)
library(Rtsne)
Warning: package 'Rtsne' was built under R version 4.5.2
set.seed(42)
tsne_out <- Rtsne(scale(mammals), dims = 2, perplexity = 5)
tsne_df <- as.data.frame(tsne_out$Y)
tsne_df$cluster <- mammals_clustered$cluster

ggplot(tsne_df, aes(x = V1, y = V2, color = cluster)) +
  geom_point(size = 2) +
  labs(title = "t-SNE of Milk Characteristics", x = "Dim 1", y = "Dim 2") +
  theme_minimal() +
  scale_color_brewer(palette = "Dark2")

PCA shows how clusters differ based on linear combinations of milk traits. T-SNE captures the non linear relationships, which reveals hidden structures especially useful if clusters are curved. Clusters can reflect nutritional strategies . Outliers might indicate specialized species.