Activity 4.1 - Hierarchical clustering

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs
Submit a link to your published solutions

#Loading packages needed for this assignment
library(tidyverse)
library(cluster)
library(factoextra)
library(patchwork)

Question 1

Consider three data sets below. Each data set contains three “clusters” in two dimensions. In the first, the clusters are three convex spheres. (A convex cluster is one where all points in the cluster can be connected with a straight line that does not leave the cluster.) In the second, one cluster is a sphere; one is a ring; and one is a half-moon. In the third, two clusters are spirals and one is a sphere. Our goal is to compare the performance of various hierarchical methods in clustering different cluster shapes.

three_spheres <- read.csv('C:/Users/lr7273ow/OneDrive - Minnesota State/Documents/GitHub/DSCI_415/Activities/Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('C:/Users/lr7273ow/OneDrive - Minnesota State/Documents/GitHub/DSCI_415/Activities/Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('C:/Users/lr7273ow/OneDrive - Minnesota State/Documents/GitHub/DSCI_415/Activities/Data/cluster_data3.csv')

cluster_and_silhouette <- function(data, linkage_method, dataset_name) {
  
  # 1. Run AGNES clustering
  ag <- agnes(data, metric = "euclidean", method = linkage_method)
  
  # 2. Cut into 3 clusters
  cl <- cutree(ag, k = 3)
  
  # 3. Compute silhouettes
  sil <- silhouette(cl, dist(data))
  mean_sil <- mean(sil[, 3])
  
  # 4. Build scatterplot with clusters
  df <- data
  df$cluster <- factor(cl)
  
  p <- ggplot(df, aes(x = x, y = y, color = cluster)) +
    geom_point(size = 2, alpha = 0.8) +
    theme_minimal() +
    labs(
      title = paste0(dataset_name, " — ", linkage_method,
                     " (avg sil = ", round(mean_sil, 3), ")")
    ) +
    scale_color_brewer(palette = "Dark2")
  
  return(list(
    plot = p,
    mean_sil = mean_sil,
    clusters = cl
  ))
}

#Three-Spheres
p1 <- cluster_and_silhouette(three_spheres, "single", "Three Spheres")
p2 <- cluster_and_silhouette(three_spheres, "complete", "Three Spheres")
p3 <- cluster_and_silhouette(three_spheres, "average", "Three Spheres")
p4 <- cluster_and_silhouette(three_spheres, "ward", "Three Spheres")

#Ringmoon Spheres
p5 <- cluster_and_silhouette(ring_moon_sphere, "single", "Ring Moon")
p6 <- cluster_and_silhouette(ring_moon_sphere, "complete", "Ring Moon")
p7 <- cluster_and_silhouette(ring_moon_sphere, "average", "Ring Moon")
p8 <- cluster_and_silhouette(ring_moon_sphere, "ward", "Ring Moon")

#Two Spiral Sphere
p9  <- cluster_and_silhouette(two_spirals_sphere, "single", "Two Spirals")
p10 <- cluster_and_silhouette(two_spirals_sphere, "complete", "Two Spirals")
p11 <- cluster_and_silhouette(two_spirals_sphere, "average", "Two Spirals")
p12 <- cluster_and_silhouette(two_spirals_sphere, "ward", "Two Spirals")

(p1$plot | p2$plot | p3$plot | p4$plot) /
(p5$plot | p6$plot | p7$plot | p8$plot) /
(p9$plot | p10$plot | p11$plot | p12$plot)

Perform agglomerative clustering with single, complete, average, and Ward linkages. Cut each tree to produce three clusters. Produce 12 scatterplots, one per data set/linkage combination, showing the 3-cluster solution. Title each graph with the linkage used, as well as the average silhouette width (\(\bar s\)) for that clustering solution. Use patchwork to create a nice 4x3 grid of your plots.

Discuss the following:

Which linkage works best for which scenario?

The linkage that works best would probably be best for the Three Spheres would be average or ward this is both judging by the fact that the clusters 3 are most clear in those graphs spheres. This is also supported by the fact that the values are the highest for ward and average values (single is weakest). For the ringmoon dataset, the single dataset is the one that works the best by visuals. However the one that has the highest silhouette value is the average. For the Two Spirals data set the one that works the best is the single linkage just based on the visuals although it still suffers from chaining. Overall, hierarchical clustering is not good enough to really try and which linkage is best.
Does the average silhouette width always do a good job of measuring the quality of the clustering solution?

Short answer - no. Because there are some instances (especially in two spirals) where the one with the lowest silhouette average actually had the a very close image to what was described in the question prompt. And in some of the datasets it really wasn’t a huge deal what the average silhouette was when compared to how the visuals of each linkage type stacked up. Ring Moon is a good example of this. In general the average silhouette width is not a great indicator of measuring the quality of clustering solutions (single, complete,average, ward)

(Hint: you have a lot of repetitive code to write. You may find it helpful to write a function that takes a data set and a linkage method as arguments, does the clustering and computes average silhouette width, and produces the desired plot.)

Question 2

Consider the data set below on milk content of 25 mammals. The variables have been pre-scaled to z-scores, hence no additional standardizing is necessary. (Data source: Everitt et al. Cluster analysis 4ed)

mammals <- read.csv('C:/Users/lr7273ow/OneDrive - Minnesota State/Documents/GitHub/DSCI_415/Activities/Data/mammal_milk.csv') %>% 
  column_to_rownames('Mammal')

A)

Perform agglomerative clustering with single, complete, average, and Ward linkages. Which has the best agglomerative coefficient?

mammal_single_linkage <- agnes(mammals,metric = 'euclidean',method = 'single')

mammal_average_linkage <- agnes(mammals, metric = 'euclidean', method = 'average')

mammal_complete_linkage <- agnes(mammals, metric = 'euclidean', method = 'complete')

mammal_ward_linkage <- agnes(mammals, metric = 'euclidean', method = 'ward')

#Checking the AC for each linkage

mammal_single_linkage$ac

[1] 0.7875718

mammal_average_linkage$ac

[1] 0.8706571

mammal_complete_linkage$ac

[1] 0.8985539

mammal_ward_linkage$ac

[1] 0.9413994

#Ward has the best AC out of the four linkage methods #with 0.94 coefficient.

B)

Plot a dendrogram of the method with the highest AC. Which mammals cluster together first?

#library(factextra)

#Plotting Dendrogram
fviz_dend(mammal_ward_linkage) + 
labs(title = 'Mammal Ward Linkage')

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

#Finding the first cluster of mammals 
first_merge <- mammal_ward_linkage$merge[1,] 

mammal_names <- rownames(mammals)

first_pair <- mammal_names[abs(first_merge)]
first_pair

[1] "Deer"     "Reindeer"

#Deer & Reindeer are the first mammals 
#that cluster together

C)

If the tree is cut at a height of 4, how many clusters will form? Which cluster will have the fewest mammals, and which mammals will they be?

#If a tree is cut at height of 4
#Using a dendrogram graph 

fviz_dend(mammal_ward_linkage, k=4) + 
  labs(title= "Mammal Ward Linkage W/ 4 level cluster")

#Based on cutting a height of 4, we 
#can see that there will be 4 clusters
#created from using that value as a 
#standpoint value. The cluster with the #fewest mammals #is the fourth (purple)  
#cluster and the mammels in that cluster
#are Dolphins & Seals.

D)

Use WSS and average silhouette method to suggest the optimal number of clusters. Re-create the dendrogram with the cluster memberships indicated.

#WSS & average silhouette method 

fviz_nbclust(mammals, 
  FUNcluster = hcut,
  k.max = 10, 
  method = 'wss',
  hc_method = 'ward',
  hc_func = 'agnes'
)

#we can tell by looking at the elbow plot
#itself that by the line with the biggest #slope overall is at k=2 9although k=3 could work too). We can recreate the cluster with the dendrogram specified

#average silhoette method 

fviz_nbclust(
  mammals,
  FUNcluster = hcut,
  k.max = 10,
  method = "silhouette",
  hc_method = "ward.D2"
)

#In this case looking at the average silhouette method the number of clusters we should be using is at k=3. In which case we can go with cut tree height too.  
#If we cut the tree at k=2 
fviz_dend(mammal_ward_linkage, k=2) + 
labs(title = 'Mammal Ward Linkage W/ k=2 clustering')

#If we cut the tree at k=3 

fviz_dend(mammal_ward_linkage, k=3) +
labs(title = 'Mammal Ward Linkage W/ k=3 clustering')

E)

Use suitable visualizations, including dimension reduction techniques, to explore the different milk characteristics of the assigned clusters. Discuss.

#Using PCA and a biplot of the mammal #data. The 


mammals_pca <- prcomp(mammals, center = TRUE, scale = TRUE) 


#PCA Biplot with k=2
k2 <- cutree(mammal_ward_linkage, k = 2)


k2_biplot <- fviz_pca(mammals_pca, 
  habillage = factor(k2),
  repel = TRUE)+
  ggtitle('2-cluster solution') +
  guides(color = 'none')
k2_biplot

#With a clustering of k=2 biplot we can see that the 2 dimensional PCA #space, a two cluster solution. The mammal milk dataset projected into a 2 dimensional #PCA space. PC1 explains 78.1% of the  #variance in the data set while PC2 
#explains 17.2% of the dataset. 
#(the vectors) are more common by cluster
#We see that the first cluster is more
#abundant in milk with Water & Lactose. 
#While lacking in higher fat such as 
#ash, protein and fat. The second cluster #is more abundant in milk with ash, protein & fat (fatier acid. 
#



#PCA Biplot with k=3 
k3 <- cutree(mammal_ward_linkage,k=3)

k3_biplot <- fviz_pca(mammals_pca,
                habillage = factor(k3), repel = TRUE) +
  ggtitle("3-cluster solution") +
  guides(color = 'none')

k3_biplot

#With clustering k=3 biplot we can see #that within the 2 dimensional PCA #space with 3 clusters spread #throughout. PC1 again explains about #78.1% of the variance in the plot, 
#PC2 explains about 17.2% of the #variance. We that the first cluster #is abundant in water & lactose fatty
#acids. The first cluster (Red) is #more abundant in acids like water & 
#lactose while being substantially #less abundant in fat. 
#The second cluster is more abundant
#in nutrients like ash and protein 
#while being less abundant in water 
#& lactose. The third cluster (blue) #mammals milk are more rich in fat #but is 
#substantially less rich in nutrients
#like water & lactose.