Activity 4.1 - Hierarchical clustering

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs
Submit a link to your published solutions

#Loading packages needed for this assignment
library(tidyverse)
library(cluster)
library(factoextra)
library(patchwork)

Question 1

Consider three data sets below. Each data set contains three “clusters” in two dimensions. In the first, the clusters are three convex spheres. (A convex cluster is one where all points in the cluster can be connected with a straight line that does not leave the cluster.) In the second, one cluster is a sphere; one is a ring; and one is a half-moon. In the third, two clusters are spirals and one is a sphere. Our goal is to compare the performance of various hierarchical methods in clustering different cluster shapes.

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

Perform agglomerative clustering with single, complete, average, and Ward linkages. Cut each tree to produce three clusters. Produce 12 scatterplots, one per data set/linkage combination, showing the 3-cluster solution. Title each graph with the linkage used, as well as the average silhouette width (\(\bar s\)) for that clustering solution. Use patchwork to create a nice 4x3 grid of your plots.

Discuss the following:

Which linkage works best for which scenario?
Does the average silhouette width always do a good job of measuring the quality of the clustering solution?

(Hint: you have a lot of repetitive code to write. You may find it helpful to write a function that takes a data set and a linkage method as arguments, does the clustering and computes average silhouette width, and produces the desired plot.)

cluster_all_patchwork_with_silhouette <- function(data_list, 
                                                  k = 3, 
                                                  linkage_methods = c("single", "complete", "average", "ward"),
                                                  metric = "euclidean",
                                                  scale_data = TRUE) {
  plot_list <- list()
  
  for (dataset_name in names(data_list)) {
    data <- data_list[[dataset_name]]
    if (scale_data) {
      data <- scale(data)
    }
    for (method in linkage_methods) {
      hc <- agnes(data, method = method, metric = metric)
      clusters <- cutree(hc, k = k)
      clusters_factor <- as.factor(clusters)
      sil <- silhouette(clusters, dist(data))
      avg_sil_width <- round(mean(sil[, "sil_width"]), 3)
      df_plot <- bind_cols(as.data.frame(data), data.frame(cluster = clusters_factor))
      p <- ggplot(df_plot, aes(x = x, y = y, col = cluster)) +
        geom_point(size = 3) +
        theme_classic(base_size = 14) +
        guides(color = 'none') +
        labs(title = paste(dataset_name, "-", method, "s =", avg_sil_width))
      plot_list[[paste(dataset_name, method, sep = "_")]] <- p
    }
  }
  combined_patchwork <- wrap_plots(plot_list, nrow = length(data_list), ncol = length(linkage_methods))
  
  return(combined_patchwork)
}
data_list <- list(SSS = three_spheres, RMS = ring_moon_sphere, SpS = two_spirals_sphere)
all_12_plot <- cluster_all_patchwork_with_silhouette(data_list, k = 3, metric = "euclidean", scale_data = TRUE)
all_12_plot

For the Three Spheres Dataset the best linkage type to use would be any of the 4 linkage types they all cluster the points equally and this is further proven with the silhouette widths for all 4 being the same value of 0.72 which also shows for this dataset the silhouette widths are doing a good job.

For the Ring Moon Sphere dataset the best linkage type to use would be either the average or ward linkage because they appear to be the closest to having them split into the ring being its own color, the sphere being its own color, and the moon being its own color. The next closest would be the single but the sphere does not have its own color and gets included in the ring. When looking at silhouette widths we can see that the two highest average widths are the average linkage and ward linkage showing that the silhouette widths do a good job for quality of clustering methods.

For the Two Spiral and One Sphere dataset the best linkage type would be the single as it clusters by each spiral and the sphere. When looking at silhouette widths the best for clustering would be the ward linkage with a average width of .423 but this is not what we expected from the visual test. Here the silhouette width does not do a great job measuring the quality of the cluster solution.

Question 2

Consider the data set below on milk content of 25 mammals. The variables have been pre-scaled to z-scores, hence no additional standardizing is necessary. (Data source: Everitt et al. Cluster analysis 4ed)

mammals <- read.csv('Data/mammal_milk.csv') %>% 
  column_to_rownames('Mammal')

A)

Perform agglomerative clustering with single, complete, average, and Ward linkages. Which has the best agglomerative coefficient?

(mammals_single <- agnes(mammals, method='single'))$ac

[1] 0.7875718

(mammals_complete <- agnes(mammals, method='complete'))$ac

[1] 0.8985539

(mammals_average <- agnes(mammals, method='average'))$ac

[1] 0.8706571

(mammals_ward <- agnes(mammals, method='ward'))$ac

[1] 0.9413994

We can see that ward has the best AC with 0.94.

B)

Plot a dendrogram of the method with the highest AC. Which mammals cluster together first?

fviz_dend(mammals_ward)+ ggtitle('Ward linkage')

Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

The mammals that appear to cluster together first are the bison, llama, camel, and zebra, the donkey, mule, horse, monkey, and orangutan, the buffalo, fox, and sheep. Every other mammal apears to cluster together later on in the dendrogram.

C)

If the tree is cut at a height of 4, how many clusters will form? Which cluster will have the fewest mammals, and which mammals will they be?

fviz_dend(mammals_ward,k=4)+ ggtitle('Ward linkage, 4 clusters')

This dendrogram shows the clusters we would expect if we cut the tree at a height of 4. We would expect 4 different clusters with our cluster with the fewest mammals to be the purple cluster with dolphin and seal.

D)

Use WSS and average silhouette method to suggest the optimal number of clusters. Re-create the dendrogram with the cluster memberships indicated.

fviz_nbclust(mammals, 
             FUNcluster = hcut,
             method='wss',
             hc_method='ward',
             hc_func = 'agnes') + 
  labs(title = 'Plot of WSS vs k')

fviz_nbclust(mammals, 
             FUNcluster = hcut,
             method='silhouette',
             hc_method='ward',
             hc_func = 'agnes') + 
  labs(title = 'Plot of WSS vs k')

fviz_dend(mammals_ward,k=3)+ ggtitle('Ward linkage, 3 clusters') +
fviz_dend(mammals_ward,k=4)+ ggtitle('Ward linkage, 4 clusters')

From the WSS plot we can see that the recommended amount of clusters is 4 and from the silhouette plot the recommended amount of clusters is 3. When looking at the dendrograms the best solution appears to be the 4 clusters.

E)

Use suitable visualizations, including dimension reduction techniques, to explore the different milk characteristics of the assigned clusters. Discuss.

mammals_pca <- prcomp(mammals, center=TRUE)
(ward_clusters <- cutree(mammals_ward, k = 4) 
    %>% data.frame()
    %>% setNames(c('k4')) 
)  %>% head(3)

k4_biplot <- fviz_pca(mammals_pca, 
         habillage = factor(ward_clusters$k4),
         repel = TRUE) + 
        ggtitle('4-cluster solution') + 
        guides(color='none',shape='none')
k4_biplot

Looking at the plot we can see the characteristics that begin to separate our clusters. We can see that our orangish cluster is higher in water and lactose content but low in ash and protein content. We can see our blue cluster is higher in Ash and Protein content. Our purple cluster is high in Fat content. but lower in water and lactose content. Our green appears to be in the middle of all the characteristics showing around equal value for all characteristics except for a slightly lower fat content.