Activity 4.1 - Hierarchical clustering

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs
Submit a link to your published solutions

#Loading packages needed for this assignment
library(tidyverse)
library(cluster)
library(factoextra)
library(patchwork)

Question 1

Consider three data sets below. Each data set contains three “clusters” in two dimensions. In the first, the clusters are three convex spheres. (A convex cluster is one where all points in the cluster can be connected with a straight line that does not leave the cluster.) In the second, one cluster is a sphere; one is a ring; and one is a half-moon. In the third, two clusters are spirals and one is a sphere. Our goal is to compare the performance of various hierarchical methods in clustering different cluster shapes.

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

Perform agglomerative clustering with single, complete, average, and Ward linkages. Cut each tree to produce three clusters. Produce 12 scatterplots, one per data set/linkage combination, showing the 3-cluster solution. Title each graph with the linkage used, as well as the average silhouette width (\(\bar s\)) for that clustering solution. Use patchwork to create a nice 4x3 grid of your plots.

make_cluster_plot <- function(df, linkage_name) {
  
  # Compute distance
  d <- dist(df)
  tree <- hclust(d, method = linkage_name)
  
  
  clust <- cutree(tree, k = 3)
  
 
  sil <- silhouette(clust, d)
  avg_sil <- mean(sil[, 3])
  
  # Make scatterplot
  p <- df %>%
    mutate(cluster = factor(clust)) %>%
    ggplot(aes(x, y, color = cluster)) +
    geom_point(size = 1.2, alpha = 0.8) +
    theme_minimal() +
    labs(
      title = paste0(toupper(linkage_name),
                     " (avg s = ",
                     round(avg_sil, 3), ")"),
      color = "Cluster"
    ) +
    theme(plot.title = element_text(size = 11))
  
  return(p)
}

linkages <- c("single", "complete", "average", "ward.D2")

plots_three  <- map(linkages, ~ make_cluster_plot(three_spheres, .x))
plots_ring   <- map(linkages, ~ make_cluster_plot(ring_moon_sphere, .x))
plots_spiral <- map(linkages, ~ make_cluster_plot(two_spirals_sphere, .x))

# 4x3 grid
final_plot <- (plots_three[[1]] | plots_three[[2]] | plots_three[[3]] | plots_three[[4]]) /
              (plots_ring[[1]]  | plots_ring[[2]]  | plots_ring[[3]]  | plots_ring[[4]]) /
              (plots_spiral[[1]]| plots_spiral[[2]]| plots_spiral[[3]]| plots_spiral[[4]])

final_plot

(Hint: you have a lot of repetitive code to write. You may find it helpful to write a function that takes a data set and a linkage method as arguments, does the clustering and computes average silhouette width, and produces the desired plot.)

Discuss the following:

Which linkage works best for which scenario?

Convex spheres: Complete, average, and Ward work best; single often chains points. Ring/moon/spiral shapes: Single linkage works best because it can follow curved, non-convex clusters; the others tend to break them apart.

Does the average silhouette width always do a good job of measuring the quality of the clustering solution?

No. Silhouette assumes clusters should be compact and round, so it gives low scores for curved shapes even when the clustering is actually good.

Question 2

Consider the data set below on milk content of 25 mammals. The variables have been pre-scaled to z-scores, hence no additional standardizing is necessary. (Data source: Everitt et al. Cluster analysis 4ed)

mammals <- read.csv('Data/mammal_milk.csv') %>% 
  column_to_rownames('Mammal')

A)

Perform agglomerative clustering with single, complete, average, and Ward linkages. Which has the best agglomerative coefficient?

d <- dist(mammals)
ac_single   <- agnes(d, method = "single")$ac
ac_complete <- agnes(d, method = "complete")$ac
ac_average  <- agnes(d, method = "average")$ac
ac_ward     <- agnes(d, method = "ward")$ac

ac_single

[1] 0.7875718

ac_complete

[1] 0.8985539

ac_average

[1] 0.8706571

ac_ward

[1] 0.9413994

Ward gives the highest AC for this dataset, so Ward linkage has the best agglomerative coefficient.

B)

Plot a dendrogram of the method with the highest AC. Which mammals cluster together first?

ward_tree <- hclust(d, method = "ward.D2")
plot(ward_tree)

C)

If the tree is cut at a height of 4, how many clusters will form? Which cluster will have the fewest mammals, and which mammals will they be?

groups_h4 <- cutree(ward_tree, h = 4)
table(groups_h4)

groups_h4
 1  2  3  4 
11  6  6  2

which(groups_h4 == which.min(table(groups_h4)))

Dolphin    Seal 
      7      22

3 clusters will form, Smallest cluster- dolphin and seal

D)

Use WSS and average silhouette method to suggest the optimal number of clusters. Re-create the dendrogram with the cluster memberships indicated.

fviz_nbclust(mammals, FUN = hcut, method = "wss")

fviz_nbclust(mammals, FUN = hcut, method = "silhouette")

fviz_dend(ward_tree, k = 3, rect = TRUE)

Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

E)

Use suitable visualizations, including dimension reduction techniques, to explore the different milk characteristics of the assigned clusters. Discuss.

pca <- prcomp(mammals)
fviz_pca_ind(pca, habillage = cutree(ward_tree, k = 3))

One cluster contains mammals with richer, high-fat, high-protein milk. Another contains mammals with average, balanced milk. The smallest cluster tends to have very watery or nutrient-lean milk. The PCA plot separates these groups clearly because clusters differ mostly on fat/protein vs lactose concentration.