Activity 4.1 - Hierarchical clustering

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  3. Submit a link to your published solutions
#Loading packages needed for this assignment
library(tidyverse)
library(cluster)
library(factoextra)
library(patchwork)

Question 1

Consider three data sets below. Each data set contains three “clusters” in two dimensions. In the first, the clusters are three convex spheres. (A convex cluster is one where all points in the cluster can be connected with a straight line that does not leave the cluster.) In the second, one cluster is a sphere; one is a ring; and one is a half-moon. In the third, two clusters are spirals and one is a sphere. Our goal is to compare the performance of various hierarchical methods in clustering different cluster shapes.

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

Perform agglomerative clustering with single, complete, average, and Ward linkages. Cut each tree to produce three clusters. Produce 12 scatterplots, one per data set/linkage combination, showing the 3-cluster solution. Title each graph with the linkage used, as well as the average silhouette width (\(\bar s\)) for that clustering solution. Use patchwork to create a nice 4x3 grid of your plots.

Discuss the following:

  1. Which linkage works best for which scenario?
  2. Does the average silhouette width always do a good job of measuring the quality of the clustering solution?

(Hint: you have a lot of repetitive code to write. You may find it helpful to write a function that takes a data set and a linkage method as arguments, does the clustering and computes average silhouette width, and produces the desired plot.)

make_plot <- function(data, linkage) {
  
  hc <- agnes(data, metric = "euclidean", method = linkage)
  
  clust <- cutree(hc, k = 3)
  
  dmat <- dist(data)
  sil <- silhouette(clust, dmat)
  avg_sil <- round(mean(sil[, 3]), 3)
  
  df <- data.frame(
    x = data[,1],
    y = data[,2],
    cluster = as.factor(clust)
  )
  
  p<- ggplot(df, aes(x, y, color = cluster)) +
    geom_point(size = 1.6, alpha = 0.9) +
    theme_minimal() +
    theme(legend.position = "none") +
    labs(
      title = paste0(linkage, " linkage  (avg s = ", avg_sil, ")")
    )
return(p)
}
linkages <- c("single", "complete", "average", "ward")

plots_1 <- lapply(linkages, make_plot, data = three_spheres)
plots_2 <- lapply(linkages, make_plot, data = ring_moon_sphere)
plots_3 <- lapply(linkages, make_plot, data = two_spirals_sphere)

all_plots <- c(plots_1, plots_2, plots_3)
final_grid <- wrap_plots(all_plots, ncol = 4) +
  plot_annotation(
    title = "Hierarchical Clustering: 12 Solutions",
    subtitle = "Rows = datasets, Columns = linkages"
  )

final_grid

For the three_spheres data set, complete, average, and ward linkages are best; all of them have the same avg.silhouette value.

For the ring_moon_sphere data set, single linkage is performing better than the other methods, even though the avg.silhouette is lower.

For the two_spirals_sphere data set, again, single linkage is the best method despite having the lowest avg.silhouette.

answer 2

The average silhouette doesn’t always do a good job measuring the quality of the clustering solution. For the 2nd and 3rd data sets, we saw that the single linkage method was the best even though it has the lowest avg.silhouette score.

Question 2

Consider the data set below on milk content of 25 mammals. The variables have been pre-scaled to z-scores, hence no additional standardizing is necessary. (Data source: Everitt et al. Cluster analysis 4ed)

mammals <- read.csv('Data/mammal_milk.csv') %>% 
  column_to_rownames('Mammal')

A)

Perform agglomerative clustering with single, complete, average, and Ward linkages. Which has the best agglomerative coefficient?

cluster_mammals_single <- agnes(mammals, metric='euclidean', method='single')
cluster_mammals_complete <- agnes(mammals, metric='euclidean', method='complete')
cluster_mammals_average <- agnes(mammals, metric='euclidean', method='average')
cluster_mammals_ward <- agnes(mammals, metric='euclidean', method='ward')
cluster_mammals_single$ac
[1] 0.7875718
cluster_mammals_complete$ac
[1] 0.8985539
cluster_mammals_average$ac
[1] 0.8706571
cluster_mammals_ward$ac
[1] 0.9413994

Ward linkage has a highest AC.

B)

Plot a dendrogram of the method with the highest AC. Which mammals cluster together first?

library(factoextra)
fviz_dend(cluster_mammals_ward) + 
  labs(title='Ward linkage')
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

Camel & Zebra

Donkey & Mule

Fox & Sheep

Cat & Guinea pig

Dog & Rat

Rabbit & Dolphin

Bison & Llama

Dolphin and Seal

C)

If the tree is cut at a height of 4, how many clusters will form? Which cluster will have the fewest mammals, and which mammals will they be?

fviz_dend(cluster_mammals_ward, h =4) + 
  labs(title='Ward linkage')

If we cut the tree at height of 4 it will form 5 clusters. The 5th cluster will have fewest amount of mammals. it will have Dolphin and Seal.

D)

Use WSS and average silhouette method to suggest the optimal number of clusters. Re-create the dendrogram with the cluster memberships indicated.

WSS

fviz_nbclust(mammals,
             FUNcluster = hcut,
             k.max = 10,
             method='wss',
             hc_method='ward',
             hc_func = 'agnes')

k=4 clusters seems like the best choice.

fviz_dend(cluster_mammals_ward, k = 4,  
          rect = TRUE,
          show_labels = TRUE,
          main = "Ward Dendrogram with Cluster Memberships")

Average silhouette

fviz_nbclust(mammals, FUN = hcut, method = "silhouette", hc_method = "ward.D2") +
  labs(subtitle = "Average silhouette method")

fviz_dend(cluster_mammals_ward, k = 3,  
          rect = TRUE,
          show_labels = TRUE,
          main = "Ward Dendrogram with Cluster Memberships")

combined

fviz_dend(cluster_mammals_ward,k=3)+ ggtitle('Ward linkage, 2 clusters') +
fviz_dend(cluster_mammals_ward,k=4)+ ggtitle('Ward linkage, 3 clusters')

E)

Use suitable visualizations, including dimension reduction techniques, to explore the different milk characteristics of the assigned clusters. Discuss.

mammals_scaled <- scale(mammals)
milk_pca <- prcomp(mammals_scaled, center=TRUE, scale. = TRUE)
(ward_clusters <- cutree(cluster_mammals_ward, k = 3:4) 
    %>% data.frame()
    %>% setNames(c('k3','k4'))
) %>% head
  k3 k4
1  1  1
2  1  2
3  1  1
4  1  2
5  2  3
6  2  3
k3_biplot <- fviz_pca(milk_pca, 
         habillage = factor(ward_clusters$k3),
         repel = TRUE) + 
      ggtitle('3-cluster solution') + 
      guides(color='none',shape='none')
k4_biplot <- fviz_pca(milk_pca, 
         habillage = factor(ward_clusters$k4),
         repel = TRUE) + 
        ggtitle('4-cluster solution') + 
        guides(color='none',shape='none')
k3_biplot + k4_biplot 

I would say the k = 4 cluster is better at explaining the different milk characteristics. Rat, whale, deer, reindeer, and dog milks are high in protein. Bison, llama, and camel milks are high in water and lactose. Rabbit milk is high in ash, and seal and dolphin milks are high in fat.

Monkey, elephant, hippo, donkey, mule, and orangutan milks are low in protein and ash. Sheep, fox, cat, buffalo, and pig have equal amounts of everything, with no real high in one specific characteristic.