Activity 4-1 - Hierarchical clustering

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  3. Submit a link to your published solutions
#Loading packages needed for this assignment
library(tidyverse)
library(cluster)
library(factoextra)
library(patchwork)

Question 1

Consider three data sets below. Each data set contains three “clusters” in two dimensions. In the first, the clusters are three convex spheres. (A convex cluster is one where all points in the cluster can be connected with a straight line that does not leave the cluster.) In the second, one cluster is a sphere; one is a ring; and one is a half-moon. In the third, two clusters are spirals and one is a sphere. Our goal is to compare the performance of various hierarchical methods in clustering different cluster shapes.

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

All printed s/silhouettes for each plot:

 [1] "Three Spheres - Single - s̄ = 0.442"        
 [2] "Three Spheres - Complete - s̄ = 0.727"      
 [3] "Three Spheres - Average - s̄ = 0.727"       
 [4] "Three Spheres - Ward - s̄ = 0.727"          
 [5] "Ring–Moon–Sphere - Single - s̄ = 0.407"     
 [6] "Ring–Moon–Sphere - Complete - s̄ = 0.533"   
 [7] "Ring–Moon–Sphere - Average - s̄ = 0.575"    
 [8] "Ring–Moon–Sphere - Ward - s̄ = 0.485"       
 [9] "Two Spirals + Sphere - Single - s̄ = 0.158" 
[10] "Two Spirals + Sphere - Complete - s̄ = 0.35"
[11] "Two Spirals + Sphere - Average - s̄ = 0.432"
[12] "Two Spirals + Sphere - Ward - s̄ = 0.389"   

Perform agglomerative clustering with single, complete, average, and Ward linkages. Cut each tree to produce three clusters. Produce 12 scatterplots, one per data set/linkage combination, showing the 3-cluster solution. Title each graph with the linkage used, as well as the average silhouette width (\(\bar s\)) for that clustering solution. Use patchwork to create a nice 4x3 grid of your plots.

Discuss the following:

  1. Which linkage works best for which scenario?

    Three spheres: All linkages except for single work best.
    Ring-Moon-Sphere: Complete and Average linkages do a fair job.
    Spirals: None of the linkages do a fantastic job. Average has the highest s value…

  2. Does the average silhouette width always do a good job of measuring the quality of the clustering solution?

In our case, yes. The average silhouette linkage consistently had the highest/tied for the highest silhouette value in each method.

(Hint: you have a lot of repetitive code to write. You may find it helpful to write a function that takes a data set and a linkage method as arguments, does the clustering and computes average silhouette width, and produces the desired plot.)

Question 2

Consider the data set below on milk content of 25 mammals. The variables have been pre-scaled to z-scores, hence no additional standardizing is necessary. (Data source: Everitt et al. Cluster analysis 4ed)

mammals <- read.csv('Data/mammal_milk.csv') %>% 
  column_to_rownames('Mammal')

A)

Perform agglomerative clustering with single, complete, average, and Ward linkages. Which has the best agglomerative coefficient?

(mammal_single <- agnes(mammals, method='single'))$ac
[1] 0.7875718
(mammal_complete <- agnes(mammals, method='complete'))$ac
[1] 0.8985539
(mammal_average <- agnes(mammals, method='average'))$ac
[1] 0.8706571
(mammal_ward <- agnes(mammals, method='ward'))$ac
[1] 0.9413994

The ward method leads to the highest AC value

B)

Plot a dendrogram of the method with the highest AC. Which mammals cluster together first?

m_single <- fviz_dend(mammal_single, cex = 0.5) + ggtitle("Single linkage")
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.
m_complete <- fviz_dend(mammal_complete, cex = 0.5) + ggtitle("Complete linkage")
m_average <- fviz_dend(mammal_average, cex = 0.5) + ggtitle("Average linkage")
m_ward <- fviz_dend(mammal_ward, cex = 0.5) + ggtitle("Ward linkage")

(m_single + m_complete)/(m_average + m_ward)

Some animals that cluster first on all four dendrograms include:
Bison & Llama
Camel & Zebra
Dog & Rat
Dolphin & Seal

C)

If the tree is cut at a height of 4, how many clusters will form? Which cluster will have the fewest mammals, and which mammals will they be?

Single doesn’t have any clusters above 4…
Complete has 3 clusters left
Average has 2 clusters left
Ward has 4 clusters left

Obviously single has the least mammals (0), but in the instance this could be wrong, then Average would be next. The mammals in the smallest cluster are dolphin and seal

D)

Use WSS and average silhouette method to suggest the optimal number of clusters. Re-create the dendrogram with the cluster memberships indicated.

mammals_numeric <- mammals[, -1]  

# Dissimilarity
mammal_dist <- dist(mammals_numeric)

# WSS 
fviz_nbclust(mammals_numeric, 
             FUNcluster = hcut, 
             method = "wss", 
             diss = mammal_dist, 
             hc_method = "ward", 
             hc_func = "agnes") +
  labs(title = "Plot of WSS vs k")

# Average silhouette 
fviz_nbclust(mammals_numeric, 
             FUNcluster = hcut, 
             method = "silhouette", 
             diss = mammal_dist, 
             hc_method = "ward", 
             hc_func = "agnes") +
  labs(title = "Plot of Average Silhouette vs k")

Recreating the dendrogram:

mammal_ward <- agnes(mammals_numeric, method = "ward")

# Create 2 Clusters
k <- 2                                       
clusters <- cutree(as.hclust(mammal_ward), k = 2)
clusters                                    
     Bison    Buffalo      Camel        Cat       Deer        Dog    Dolphin 
         1          1          1          2          2          2          2 
    Donkey   Elephant        Fox Guinea pig      Hippo      Horse      Llama 
         1          1          1          2          1          1          1 
    Monkey       Mule  Orangutan        Pig     Rabbit        Rat   Reindeer 
         1          1          1          2          2          2          2 
      Seal      Sheep      Whale      Zebra 
         2          1          2          1 
# Plot Dendrogram
fviz_dend(mammal_ward, 
          k = 2,                 
          rect = TRUE,           
          cex = 0.6,             
          k_colors = c("red", "blue")) + 
  ggtitle(paste("Dendrogram with", k, "clusters"))

Plotting with k=3 clusters onlyput Dolphin and Seal in one single group, which doesn’t make it a very suitable option. Two different clusters keeps things simple without losing much variability in the data.

E)

Use suitable visualizations, including dimension reduction techniques, to explore the different milk characteristics of the assigned clusters. Discuss.

# Box Plot
mammals_numeric <- mammals %>% select(where(is.numeric))
mammals_numeric$cluster <- factor(clusters)

# Characterizes each column
mammals_long <- mammals_numeric %>%
  rownames_to_column(var = "Animal") %>%
  pivot_longer(cols = -c(Animal, cluster), 
               names_to = "Characteristic", 
               values_to = "Value")

p_box <- ggplot(mammals_long, aes(x = cluster, y = Value, fill = cluster)) +
  geom_boxplot() +
  facet_wrap(~Characteristic, scales = "free_y") +
  theme_classic() +
  labs(title = "Milk Characteristics by Cluster") +
  theme(axis.text.x = element_text(size = 10))
p_box

# PCA
mammal_data <- mammals %>% select(where(is.numeric))

mammal_pca <- prcomp(mammal_data)  

fviz_pca(mammal_pca,
         habillage = factor(clusters),  
         repel = TRUE) +
  ggtitle("PCA with 2 Clusters") +
  guides(color = "none", shape = "none") +
  theme(plot.title = element_text(hjust = 0.5, size = 20))

  • Cluster 2 seems to have higher protein and ash levels compared to cluster 1, while cluster 1 has higher lactose and water levels
  • Particularly, llama and bison have high water levels
  • Deer, whales, and reindeer all have high protein levels
  • Seals and dolphins have very high fat levels…

Overall, the pox plot and PCA graph better reveal the contents of each individual mammal’s milk.