Activity 4.1 - Hierarchical clustering

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs
Submit a link to your published solutions

#Loading packages needed for this assignment
library(tidyverse)
library(cluster)
library(factoextra)
library(patchwork)
library(dplyr)

Question 1

Consider three data sets below. Each data set contains three “clusters” in two dimensions. In the first, the clusters are three convex spheres. (A convex cluster is one where all points in the cluster can be connected with a straight line that does not leave the cluster.) In the second, one cluster is a sphere; one is a ring; and one is a half-moon. In the third, two clusters are spirals and one is a sphere. Our goal is to compare the performance of various hierarchical methods in clustering different cluster shapes.

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

Perform agglomerative clustering with single, complete, average, and Ward linkages. Cut each tree to produce three clusters. Produce 12 scatterplots, one per data set/linkage combination, showing the 3-cluster solution. Title each graph with the linkage used, as well as the average silhouette width (\(\bar s\)) for that clustering solution. Use patchwork to create a nice 4x3 grid of your plots.

#automated iteration

datasets <- list(
  "Three_Spheres" = three_spheres,
  "Ring_Moon_Sphere" = ring_moon_sphere,
  "Two_Spirals_Sphere" = two_spirals_sphere
)

linkages <- c("single", "complete", "average", "ward.D2")
all_plots <- list()

hclust_and_plot <- function(data = datasets, method, k=3, data_name) {
  
  data_coords <- data %>% select(1:2)
  colnames(data_coords) <-c("X1", "X2")
  d_matrix <- dist(data_coords, method = "euclidean")
  #hierarchical clustering
  hc <- hclust(d_matrix, method = method)
  #cut tree
  data_coords$cluster <- as.factor(cutree(hc, k = k))
  #avg silhouette width
  sil <- silhouette(data_coords$cluster, d_matrix)
  avg_sil <- mean(sil[, 3])
  
  #plot
  
  title_text <- bquote(
    .(data_name) ~ "|" ~ .(method) ~ linkage ~ "|" ~ bar(s) ~ "=" ~ .(avg_sil)
  )
  
  plt <- ggplot(data, aes(x = X1, y = X2, color = cluster)) +
    geom_point(size = 1.5, alpha = 0.7) +
    labs(
      title = title_text,
      x = NULL,
      y = NULL,
      color = "Cluster"
    )+
    scale_color_brewer(palette = "Set1") +
    theme_minimal()+
    theme(
      legend.position = "none",
      plot.title = element_text(size = 10, face = "bold", hjust = 0.5)
    )
  return(plt)
  
}

``{r}

p_grid <- ( all_plots\(Three_Spheres_single + all_plots\)Ring_Moon_Sphere_single + all_plots\(Two_Spirals_Sphere_single ) / ( all_plots\)Three_Spheres_complete + all_plots\(Ring_Moon_Sphere_complete + all_plots\)Two_Spirals_Sphere_complete ) / ( all_plots\(Three_Spheres_average + all_plots\)Ring_Moon_Sphere_average + all_plots\(Two_Spirals_Sphere_average ) / ( all_plots\)Three_Spheres_ward.D2 + all_plots\(Ring_Moon_Sphere_ward.D2 + all_plots\)Two_Spirals_Sphere_ward.D2 ) + plot_annotation( title = ‘Agglomerative Hierarchical Clustering Comparison (k=3)’, caption = ‘Columns: Dataset | Rows: Linkage Method’ ) & theme(plot.title = element_text(size = 14, face = “bold”))

print(p_grid)




Discuss the following:

1. Which linkage works best for which scenario?
Ward is best with three spheres
complete is best with ring, moon, sphere
ward or average for two spirals and sphere

2. Does the average silhouette width always do a good job of measuring the quality of the clustering solution?  

(Hint: you have a lot of repetitive code to write. You may find it helpful to write a function that takes a data set and a linkage method as arguments, does the clustering and computes average silhouette width, and produces the desired plot.)


# Question 2

Consider the data set below on milk content of 25 mammals.  The variables have been pre-scaled to z-scores, hence no additional standardizing is necessary.  (Data source: Everitt et al. *Cluster analysis 4ed*)

::: {.cell}

```{.r .cell-code}
mammals <- read.csv('Data/mammal_milk.csv') %>% 
  column_to_rownames('Mammal')

:::

A)

Perform agglomerative clustering with single, complete, average, and Ward linkages. Which has the best agglomerative coefficient?

library(cluster)

mammals_single <- agnes(mammals, metric = 'euclidean', method = 'single')
mammals_complete <- agnes(mammals, metric = 'euclidean', method = 'complete')
mammals_average <- agnes(mammals, metric = 'euclidean', method = 'average')
mammals_ward <- agnes(mammals, metric = 'euclidean', method = 'ward')

mammals_single$ac

[1] 0.7875718

mammals_complete$ac

[1] 0.8985539

mammals_average$ac

[1] 0.8706571

mammals_ward$ac

[1] 0.9413994

Ward linkage method is the closest to 1 so it is the most strong-well-defined clustering structure out of the 4

B)

Plot a dendrogram of the method with the highest AC. Which mammals cluster together first?

library(factoextra)

fviz_dend(mammals_ward) +
  labs(title = 'Ward linkage')

Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

Reindeer and deer were linked together first

C)

If the tree is cut at a height of 4, how many clusters will form? Which cluster will have the fewest mammals, and which mammals will they be?

fviz_dend(mammals_ward, h = 4) +
  labs(title = 'Ward Linkage')

5 clusters form The cluster with the fewest mammals is the purple cluster which has dolphin and seal clustered together

D)

Use WSS and average silhouette method to suggest the optimal number of clusters. Re-create the dendrogram with the cluster memberships indicated.

fviz_nbclust(mammals, FUNcluster = hcut,
             k.max = 10,
             method = 'wss',
             )

(k3_sils <- silhouette(x = cutree(mammals_ward, k = 3), dist = dist(mammals))
 )%>%head

     cluster neighbor sil_width
[1,]       1        2 0.6206877
[2,]       1        2 0.5331256
[3,]       1        2 0.6686388
[4,]       1        2 0.1739062
[5,]       2        3 0.6464723
[6,]       2        1 0.3426564

(k4_sils <- silhouette(x = cutree(mammals_ward, k = 4), dist = dist(mammals))
 )%>%head

     cluster neighbor    sil_width
[1,]       1        2  0.004880895
[2,]       2        1  0.460679617
[3,]       1        2  0.205222792
[4,]       2        1  0.508208636
[5,]       3        2  0.563503814
[6,]       3        2 -0.156807895

(k2_sils <- silhouette(x = cutree(mammals_ward, k = 2), dist = dist(mammals))
 )%>%head

     cluster neighbor  sil_width
[1,]       1        2 0.67757784
[2,]       1        2 0.60907997
[3,]       1        2 0.71210469
[4,]       1        2 0.33005599
[5,]       2        1 0.52696171
[6,]       2        1 0.04645858

fviz_silhouette(k2_sils)

  cluster size ave.sil.width
1       1   17          0.60
2       2    8          0.39

fviz_silhouette(k3_sils)

  cluster size ave.sil.width
1       1   17          0.53
2       2    6          0.54
3       3    2          0.70

fviz_silhouette(k4_sils)

  cluster size ave.sil.width
1       1   11          0.37
2       2    6          0.51
3       3    6          0.37
4       4    2          0.70

3 clusters had the better average silhouette width of .55 2 clusters had .54 4 clusters had .43

fviz_dend(mammals_ward, k = 3) +
  labs(title = 'Ward Linkage')

E)

Use suitable visualizations, including dimension reduction techniques, to explore the different milk characteristics of the assigned clusters. Discuss.

mammals$cluster <- as.factor(cutree(mammals_ward, k = 3))

library(factoextra)
library(FactoMineR)

Warning: package 'FactoMineR' was built under R version 4.5.2

mammals_pca <- PCA(
  mammals[, -ncol(mammals)], 
  graph = FALSE,             
  scale.unit = TRUE          
)

fviz_pca_ind(
  mammals_pca,
  geom.ind = "point",       
  col.ind = mammals$cluster, 
  pointshape = 21,           
  fill.ind = mammals$cluster, 
#  palette = "jco",           
  addEllipses = TRUE,       
  legend.title = "H-Cluster",
  title = paste("PCA Biplot Colored by Ward Linkage Clusters (k=", 3, ")")
)

Too few points to calculate an ellipse