Activity 4.1 - Hierarchical clustering

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs
Submit a link to your published solutions

#Loading packages needed for this assignment
library(tidyverse)
library(cluster)
library(factoextra)
library(patchwork)

Question 1

Consider three data sets below. Each data set contains three “clusters” in two dimensions. In the first, the clusters are three convex spheres. (A convex cluster is one where all points in the cluster can be connected with a straight line that does not leave the cluster.) In the second, one cluster is a sphere; one is a ring; and one is a half-moon. In the third, two clusters are spirals and one is a sphere. Our goal is to compare the performance of various hierarchical methods in clustering different cluster shapes.

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

Perform agglomerative clustering with single, complete, average, and Ward linkages. Cut each tree to produce three clusters. Produce 12 scatterplots, one per data set/linkage combination, showing the 3-cluster solution. Title each graph with the linkage used, as well as the average silhouette width (\(\bar s\)) for that clustering solution. Use patchwork to create a nice 4x3 grid of your plots.

library(cluster)
library(ggplot2)
library(patchwork)
library(dplyr)

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

datasets <- list(spheres = three_spheres,
                 complex = ring_moon_sphere,
                 spirals = two_spirals_sphere)

cluster_and_plot <- function(data, method, dataset_name) {
  data <- data[, 1:2]
  
  dist_mat <- dist(data)
  hc <- hclust(dist_mat, method = method)
  clusters <- cutree(hc, k = 3)
  
  sil <- silhouette(clusters, dist_mat)
  avg_sil <- mean(sil[, 3])
  
  df <- as.data.frame(data)
  names(df)[1:2] <- c("x", "y")
  df$cluster <- as.factor(clusters)
  
  ggplot(df, aes(x = df[,1], y = df[,2], color = cluster)) +
    geom_point(size = 1.5) +
    scale_color_manual(
      values = c(
        "1" = "plum",
        "2" = "tomato",
        "3" = "navy"
      )
    ) +
  labs(
  title = paste0(method, "\nAvg: ", round(avg_sil, 3)),
  x = "", y = ""
) +
    theme_minimal() +
    theme(legend.position = "none")
}

methods <- c("single", "complete", "average", "ward.D2")

combinations <- expand.grid(dataset_name = names(datasets), method = methods, stringsAsFactors = FALSE)

plot_list <- setNames(
  lapply(seq_len(nrow(combinations)), function(i) {
    row <- combinations[i, ]
    cluster_and_plot(datasets[[row$dataset_name]], row$method, row$dataset_name)
  }),
  paste(combinations$dataset_name, combinations$method, sep = "_")
)

final_plot <- wrap_plots(plot_list, ncol = 4)
final_plot

Discuss the following:

Which linkage works best for which scenario? Three spheres works best for the complete, average, and wardD2. Ring/moon/sphere works best for the single. Two spirals + sphere works best for the single.
Does the average silhouette width always do a good job of measuring the quality of the clustering solution?

No, average silhouette doesn’t always work. It’s great when clusters are round and well separated, but for weird shapes like rings, spirals, and moons, the clustering that actually follows the shapes can get a lower silhouette than a wrong solution that chops them into smaller.

(Hint: you have a lot of repetitive code to write. You may find it helpful to write a function that takes a data set and a linkage method as arguments, does the clustering and computes average silhouette width, and produces the desired plot.)

Question 2

Consider the data set below on milk content of 25 mammals. The variables have been pre-scaled to z-scores, hence no additional standardizing is necessary. (Data source: Everitt et al. Cluster analysis 4ed)

mammals <- read.csv('Data/mammal_milk.csv') %>% 
  column_to_rownames('Mammal')

A)

Perform agglomerative clustering with single, complete, average, and Ward linkages. Which has the best agglomerative coefficient?

methods <- c("single", "complete", "average", "ward")
agglo_coefficients <- sapply(methods, function(m) {
  agnes(mammals, method = m)$ac
})
agglo_coefficients

   single  complete   average      ward 
0.7875718 0.8985539 0.8706571 0.9413994

The highest coefficient would be ward i.e. 0.941.

B)

Plot a dendrogram of the method with the highest AC. Which mammals cluster together first?

ward_clustering <- agnes(mammals, method = "ward")
plot(ward_clustering, which.plots = 2, main = "Dendrogram with the highest AC")

first_merge <- ward_clustering$merge[1, ]
labels <- rownames(mammals)
first_pair <- labels[abs(first_merge)]
cat("Mammals to cluster together first:", paste(first_pair, collapse = " and "))

Mammals to cluster together first: Deer and Reindeer

Deer and Reindeer cluster together first.

C)

If the tree is cut at a height of 4, how many clusters will form? Which cluster will have the fewest mammals, and which mammals will they be?

clusters <- cutree(as.hclust(ward_clustering), h = 4)

num_clusters <- length(unique(clusters))
cat("Numbers of clusters:", num_clusters, "\n")

Numbers of clusters: 4

cluster_sizes <- table(clusters)
small_cluster_id <- as.numeric(names(which.min(cluster_sizes)))

small_cluster_members <- names(clusters[clusters == small_cluster_id])

cat("Fewest mammals cluster:", small_cluster_id, "\n")

Fewest mammals cluster: 4

cat("Which mammals is it?:\n")

Which mammals is it?:

print(small_cluster_members)

[1] "Dolphin" "Seal"

4 clusters will be formed. Cluster 4 will have the fewest mammals, and they will be “Dolphin” and “Seal”.

D)

Use WSS and average silhouette method to suggest the optimal number of clusters. Re-create the dendrogram with the cluster memberships indicated.

WSS_plot <- fviz_nbclust(mammals, FUNcluster = hcut, method = "wss",
                         k.max = 10, hc_method = "ward.D2")

silhouette_plot <- fviz_nbclust(mammals, FUNcluster = hcut, method = "silhouette", k.max = 10, hc_method = "ward.D2")

WSS_plot + silhouette_plot

ward_hc <- as.hclust(ward_clustering)
clusters <- cutree(ward_hc, k = 4)

plot(ward_hc, main = "Ward dendrogram", xlab = "", sub = "")

cat("Number of clusters: 4\n")

Number of clusters: 4

E)

Use suitable visualizations, including dimension reduction techniques, to explore the different milk characteristics of the assigned clusters. Discuss.

library(factoextra)
library(GGally)
library(ggplot2)
library(dplyr)
library(tibble)
library(FactoMineR)

hc_ward <- hclust(dist(mammals), method = "ward.D2")
k <- 4
cluster_ids <- cutree(hc_ward, k = k)

mammals_clustered <- mammals %>%
  as.data.frame() %>%
  rownames_to_column("Mammal") %>%
  mutate(Cluster = factor(cluster_ids))

pca_res <- PCA(
  mammals_clustered %>% select(-Mammal, -Cluster),
  scale.unit = TRUE,
  graph = FALSE
)

fviz_pca_ind(
  pca_res,
  geom.ind = "point",
  col.ind = mammals_clustered$Cluster,
  palette = "Dark2",
  legend.title = "Cluster",
  repel = TRUE,
  title = "PCA of Mammal Milk Characteristics"
)

Ignoring unknown labels:
• fill : "Cluster"
• linetype : "Cluster"

Dim1 is 78.1% of the variation, so most of the differences between mammals are left to right. Cluster 1 sits mostly on the left side with lower Dim1 values. Cluster2 is in the middle-right and higher on Dim2. Cluster 3 is far to the bottom-right, separated from everyone else. So basically, cluster 1 is “one type” of milk, cluster 2 is a richer type, and cluster 3 is an extreme type. Cluster 4 is the rich and supper fatty milk group. So, this milk is much more concentrated and energy dense than the other 3.

fviz_pca_var(
  pca_res,
  col.var = "contrib",
  gradient.cols = c("darkgreen", "grey", "navy"),
  title = "Nutrient Contributions in Mammal Milk"
)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
  Please report the issue at <https://github.com/kassambara/ggpubr/issues>.

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

Ash, protein, and fat arrows are pointing to the right, so higher Dim1 is the milk that is richer in fat, protein, and ash. Water and Lactose arrows are pointing to the left, so lower Dim1 is the milk that is more water and rich in lactose. Looking at the Dim2, ash and protein are up but the fat is down which is separating heavy protein/ash and heavy fat milk.

library(ggplot2)

ggplot(mammals_clustered,
       aes(x = Fat, y = Protein, color = Cluster)) +
  geom_point(size = 2) +
  theme_minimal() +
  labs(
    title = "Fat vs Protein in Mammal Milk",
    x = "Fat",
    y = "Protein"
  )

Cluster 1 has mostly low fat and protein so it is a dilute milk. Cluster 3 has moderate fat but high protein so it is richer and has more protein milk. Cluster 3 has very high fat and has reasonably high protein so it is the most energy dense milk.