Activity 4.1 - Hierarchical clustering

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs
Submit a link to your published solutions

#Loading packages needed for this assignment
library(tidyverse)
library(cluster)
library(factoextra)
library(patchwork)

Question 1

Consider three data sets below. Each data set contains three “clusters” in two dimensions. In the first, the clusters are three convex spheres. (A convex cluster is one where all points in the cluster can be connected with a straight line that does not leave the cluster.) In the second, one cluster is a sphere; one is a ring; and one is a half-moon. In the third, two clusters are spirals and one is a sphere. Our goal is to compare the performance of various hierarchical methods in clustering different cluster shapes.

three_spheres <- read.csv('./Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('./Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('./Data/cluster_data3.csv')

Perform agglomerative clustering with single, complete, average, and Ward linkages. Cut each tree to produce three clusters. Produce 12 scatterplots, one per data set/linkage combination, showing the 3-cluster solution. Title each graph with the linkage used, as well as the average silhouette width (\(\bar s\)) for that clustering solution. Use patchwork to create a nice 4x3 grid of your plots.

Discuss the following:

Which linkage works best for which scenario?
Does the average silhouette width always do a good job of measuring the quality of the clustering solution?

(Hint: you have a lot of repetitive code to write. You may find it helpful to write a function that takes a data set and a linkage method as arguments, does the clustering and computes average silhouette width, and produces the desired plot.)

cluster_link <- \(data, link) {

  clusters <- agnes(scale(data), metric = 'euclidean', method = link)
  clusters_df <- (
  cutree(clusters, k = 3)
  %>% data.frame()
  %>% setNames('cluster')
  %>% mutate(across(everything(), as.factor))
  %>% bind_cols(data)
)
  
  sil <- (silhouette(x = cutree(clusters, k =3), dist = dist(data)))
  sil_avg <- round(mean(as.data.frame(sil)$sil_width),2)
    
  g <- (ggplot(data = clusters_df, aes(x = x, y = y, col = cluster)) +
          geom_point() +
          theme_classic() +
          labs(title = paste(str_to_title(link), ' Linkage', sep=""), 
               subtitle = paste('Average Silhouette Width: ', sil_avg, sep=""))
          )
    return(g)
}

  cluster_link(three_spheres, 'single') + cluster_link(ring_moon_sphere, 'single') + cluster_link(two_spirals_sphere, 'single') +
  cluster_link(three_spheres, 'complete') + cluster_link(ring_moon_sphere, 'complete') + cluster_link(two_spirals_sphere, 'complete') +
   cluster_link(three_spheres, 'average') + cluster_link(ring_moon_sphere, 'average') + cluster_link(two_spirals_sphere, 'average') +
   cluster_link(three_spheres, 'ward') + cluster_link(ring_moon_sphere, 'ward') + cluster_link(two_spirals_sphere, 'ward') +
  plot_layout(ncol = 3)

Starting with the three spheres scenario, we can see that each linkage method does a great job of identifying the clusters. There is one point the single and ward linkage identified as cluster 2 that the other two linkages identified as cluster 3. Other than that there is agreement between the linkage methods. Additionally, they all have a silhouette width of 0.73. Due to the high average silhouette width and similar clusters, any method would work for this situation.

Next we will look at the ring moon sphere situation. The single linkage has the lowest average silhouette width, but it also is the only method to identify the full ring as one cluster. Additionally, the single linkage method is the only one to identify the moon in the center as its own cluster. This method fails however because it included the sphere as part of the ring. The other linkage methods have a notably higher average silhouette width, but the ring had portions in all 3 clusters for these methods. Overall, I would prefer the single linkage method since it found the ring and the moon, even though it had the lowest average silhouette width.

Lastly, we have the two spirals and sphere situation. The single linkage again has the lowest average silhouette width, but it correctly identified all 3 clusters. Every other linkage method just divided the data into 3 arbitrary blobs. I would say the single linkage definetly performs the best here.

After looking at all of these situations, I would say that the average silhouette width does not always do a great job in evaluting these linkage methods. In the last 2 examples, I prefered the single linkage, but it had the lowest average silhouette width. I think that silhouette width can help make decisions, but it is important to check the clusters visually as well.

Question 2

Consider the data set below on milk content of 25 mammals. The variables have been pre-scaled to z-scores, hence no additional standardizing is necessary. (Data source: Everitt et al. Cluster analysis 4ed)

mammals <- read.csv('./Data/mammal_milk.csv') %>% 
  column_to_rownames('Mammal')

head(mammals)

         Water Protein    Fat Lactose    Ash
Bison    0.681  -0.387 -0.818   0.856  0.073
Buffalo  0.307  -0.085 -0.229   0.310 -0.165
Camel    0.743  -0.742 -0.657   0.365 -0.303
Cat      0.268   1.064 -0.381   0.146 -0.224
Deer    -0.955   1.147  0.893  -0.836  1.063
Dog     -0.145   0.845 -0.077  -0.618  0.667

A)

Perform agglomerative clustering with single, complete, average, and Ward linkages. Which has the best agglomerative coefficient?

(mammals_single <- agnes(mammals, method = 'single'))$ac

[1] 0.7875718

(mammals_complete <- agnes(mammals, method = 'complete'))$ac

[1] 0.8985539

(mammals_average <- agnes(mammals, method = 'average'))$ac

[1] 0.8706571

(mammals_ward <- agnes(mammals, method = 'ward'))$ac

[1] 0.9413994

The best agglomerative coefficient belongs to the ward linkage.

B)

Plot a dendrogram of the method with the highest AC. Which mammals cluster together first?

fviz_dend(mammals_ward) + ggtitle('Mammals Dendogram with Ward Linkage')

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

Some of the mammals that tend to cluster together first are similar animal species. For example, donkey mule and horse clustered together pretty quickly. Dolphin and seal clustered together. Monkey and Orangutan clustered together. Deer and reindeer clustered together as well. There are some matchups that do not make sense, but for the most part a lot of similar species got together early on.

C)

If the tree is cut at a height of 4, how many clusters will form? Which cluster will have the fewest mammals, and which mammals will they be?

If the tree is cut at a height of 4, there will be 4 clusters. Cluster 4 will have the fewest mammals. It will only have dolphin and seal. This makes sense that there would not be any other mammals in this cluster as these are aquatic mammals, whereas there are not many other aquatic mammals.

D)

Use WSS and average silhouette method to suggest the optimal number of clusters. Re-create the dendrogram with the cluster memberships indicated.

fviz_nbclust(mammals,
             FUNcluster = hcut,
             k.max = 10,
             method='wss',
             hc_method='ward',
             hc_func = 'agnes')

fviz_nbclust(mammals,
             FUNcluster = hcut,
             k.max = 10,
             method='silhouette',
             hc_method='ward',
             hc_func = 'agnes')

From the WSS method plot, there appears to be elbow at 4 clusters. That being said, it is not as clear an elbow as we have seen in other examples. An argument could be made that there is an elbow at 3 clusters, but it is not as clear as the one at 4. After looking at the plot of the silhouette method, it appears that the optimal number of clusters is 3, with 2 being close behind. When going from 3 clusters to 4 clusters, there is a significant drop off in average silhouette width. Due to this significant drop off, and the plot showing the WSS, I believe that 3 clusters is the correct number moving forward.

fviz_dend(mammals_ward, k = 3) + ggtitle('Ward Linkage with 3 Clusters')

E)

Use suitable visualizations, including dimension reduction techniques, to explore the different milk characteristics of the assigned clusters. Discuss.

cluster_membership <- (cutree(mammals_ward, k = 3)
                       %>% data.frame()
                       %>% setNames('Cluster') 
                       )

mammals_pca <- prcomp(mammals)

fviz_pca(mammals_pca, 
    habillage = factor(cluster_membership$Cluster),
    repel = TRUE) + 
    ggtitle('3-cluster solution') + 
    guides(shape='none')

mammals_up <- (
  mammals
  %>% rownames_to_column('Mammal')
  %>% mutate(Cluster = factor(cluster_membership$Cluster))  
  %>% pivot_longer(cols=Water:Ash, 
                       names_to = 'variable', 
                       values_to = 'value')
)

ggplot(data = mammals_up) + 
  geom_boxplot(aes(x = Cluster, y = value, fill = Cluster)) + 
  facet_wrap(~variable, scales = 'free_y') + 
  theme_classic()

From the PCA Biplot, right away we can see that the red cluster (cluster 1) has high water and lactose levels in their milk. This is confirmed when we look at the box plots, as we can see cluster 1 has the highest average lactose and water content in their milk. We can also see that cluster 1 has low levels of everything else from the box plot. This is suported when we see every other arrow pointing away from cluster 1 on the biplot.

Next, we can look at the green cluster (cluster 2). They have high protein and ash levels in their milk. Each other variable is roughly perpendicular to the arrows that point in the direction of cluster 2. This tells us that being in cluster 2 likely tells us nothing on the content of water, lactose and fat in their milk. This is supported when we see the box plots, as cluster 2 is in the middle of these 3 categories. The box plot also confirms the high levels of Ash and protein in their milk.

Lastly, we can see the blue cluster (cluster 3), is off on its own in the bottom right corner. The arrow pointing in this direction is the fat arrow. This tells us that the animals in cluster 3 have a very high fat content in their milk. Again, this is confirmed when we look at the box plot and see that these animals have a much higher fat content than the animals in the other clusters. One thing we gain from the box plot that is not immediately clear from the biplot is that they have a very high protein content in their milk as well.