Activity 4.1 - Hierarchical clustering

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  3. Submit a link to your published solutions
#Loading packages needed for this assignment
library(tidyverse)
library(cluster)
library(factoextra)
library(patchwork)

Question 1

Consider three data sets below. Each data set contains three “clusters” in two dimensions. In the first, the clusters are three convex spheres. (A convex cluster is one where all points in the cluster can be connected with a straight line that does not leave the cluster.) In the second, one cluster is a sphere; one is a ring; and one is a half-moon. In the third, two clusters are spirals and one is a sphere. Our goal is to compare the performance of various hierarchical methods in clustering different cluster shapes.

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

Perform agglomerative clustering with single, complete, average, and Ward linkages. Cut each tree to produce three clusters. Produce 12 scatterplots, one per data set/linkage combination, showing the 3-cluster solution. Title each graph with the linkage used, as well as the average silhouette width (\(\bar s\)) for that clustering solution. Use patchwork to create a nice 4x3 grid of your plots.

the template for the chart.

#  create a chart template 
plot_cluster_scatter <- function(data, plot_title = "Cluster scatterplot", subtitle_text = "silhouette_amt") {
  ggplot(data, aes(x = x, y = y, color = factor(clust3))) +
    geom_point(size = 3, alpha = 0.8) +
    scale_color_brewer(palette = "Dark2", name = "Cluster 3") +
    theme_classic(base_size = 14) +
    labs(
      title = plot_title,
      subtitle  = paste("AVG silhouette:" ,subtitle_text),
      x = "X Coordinate",
      y = "Y Coordinate"
    )
}

Heyyo, It’s worth noting that I hid the code chunks for the other two data tables and only showed the data for the two_spirals data, this is because their code is fairly repetative.

# do it for two_spirals_sphere
two_spirals_sphere_ward <- (cutree(agnes(two_spirals_sphere, method='ward'), k = 3) %>% 
  data.frame() %>% 
  setNames(c('clust3')) %>% 
  mutate(across(everything(), as.factor)) %>% 
  bind_cols(two_spirals_sphere) %>% 
  plot_cluster_scatter("spirals_sphere – ward", 
                       round(mean(silhouette(x= cutree(agnes(two_spirals_sphere, method='ward'), k = 3), dist =  dist(two_spirals_sphere))[,"sil_width"]),2)
                       )
  )

two_spirals_sphere_average <- (cutree(agnes(two_spirals_sphere, method='average'), k = 3) %>% 
  data.frame() %>% 
  setNames(c('clust3')) %>% 
  mutate(across(everything(), as.factor)) %>% 
  bind_cols(two_spirals_sphere)) %>% 
  plot_cluster_scatter("spirals_sphere – Average", 
                       round(mean(silhouette(x= cutree(agnes(two_spirals_sphere, method='average'), k = 3), dist =  dist(two_spirals_sphere))[,"sil_width"]),2)
                       )

two_spirals_sphere_complete <- (cutree(agnes(two_spirals_sphere, method='complete'), k = 3) %>% 
  data.frame() %>% 
  setNames(c('clust3')) %>% 
  mutate(across(everything(), as.factor)) %>% 
  bind_cols(two_spirals_sphere)) %>% 
  plot_cluster_scatter("spirals_sphere – Complete", 
                       round(mean(silhouette(x= cutree(agnes(two_spirals_sphere, method='complete'), k = 3), dist =  dist(two_spirals_sphere))[,"sil_width"]),2)
                       )

two_spirals_sphere_single <- (cutree(agnes(two_spirals_sphere, method='single'), k = 3) %>% 
  data.frame() %>% 
  setNames(c('clust3')) %>% 
  mutate(across(everything(), as.factor)) %>% 
  bind_cols(two_spirals_sphere)) %>% 
  plot_cluster_scatter("spirals_sphere – single", 
                       round(mean(silhouette(x= cutree(agnes(two_spirals_sphere, method='single'), k = 3), dist =  dist(two_spirals_sphere))[,"sil_width"]),2)
                       )
three_row <- three_spheres_ward        | three_spheres_average |
             three_spheres_complete    | three_spheres_single

ring_row  <- ring_moon_sphere_ward     | ring_moon_sphere_average |
             ring_moon_sphere_complete | ring_moon_sphere_single

two_row   <- two_spirals_sphere_ward   | two_spirals_sphere_average |
             two_spirals_sphere_complete | two_spirals_sphere_single

full_grid <- (three_row / ring_row / two_row) +
  plot_layout(guides = "collect")

full_grid & theme(legend.position = "bottom")

Discuss the following:

  1. Which linkage works best for which scenario?
  • clearly only method that actually handles the non convex data well at all is the single linkage method. however, itt does poorly at dealing with the convex data in three spheres.
  • for the other three methods they have the opposite situation, where they are really bad at non-convex data, but amazing at convex data.
  1. Does the average silhouette width always do a good job of measuring the quality of the clustering solution?
  • for the most part it does a good job, however, it does a really bad job at guaging how well single linkage does when it is being ran on non-convex data. in those situations it literally did a perfect job of figururing out what point belonged in which group, but in some spots it marks it as though single did even worse than the other methods that did significantly worse when compared to single linkage.

(Hint: you have a lot of repetitive code to write. You may find it helpful to write a function that takes a data set and a linkage method as arguments, does the clustering and computes average silhouette width, and produces the desired plot.) s

Question 2

Consider the data set below on milk content of 25 mammals. The variables have been pre-scaled to z-scores, hence no additional standardizing is necessary. (Data source: Everitt et al. Cluster analysis 4ed)

mammals <- read.csv('Data/mammal_milk.csv') %>% 
  column_to_rownames('Mammal')

A)

Perform agglomerative clustering with single, complete, average, and Ward linkages. Which has the best agglomerative coefficient?

Ward has the best coefficient.

 agnes(mammals,metric = 'euclidean', method='single')$ac #mammals_single <-
[1] 0.7875718
 agnes(mammals,metric = 'euclidean', method='complete')$ac #mammals_complete<-
[1] 0.8985539
 agnes(mammals,metric = 'euclidean', method='average')$ac  #mammals_average <-
[1] 0.8706571
 agnes(mammals,metric = 'euclidean', method='ward') -> mammals_ward 
 mammals_ward$ac
[1] 0.9413994

B)

Plot a dendrogram of the method with the highest AC. Which mammals cluster together first?

fviz_dend(mammals_ward) + 
  labs(title='ward linkage')
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

C)

If the tree is cut at a height of 4, how many clusters will form? Which cluster will have the fewest mammals, and which mammals will they be?

we would have exactly 4 clusters at that point.

the dolphin / seal cluster will be the one with the fewest mammals.

D)

Use WSS and average silhouette method to suggest the optimal number of clusters. Re-create the dendrogram with the cluster memberships indicated.

based off of the below skree plotss, I’d say that we should probably have it at 3 or 4 depending on which chart you fancy. I’m deciding to go with three though as the silhouette skree seems to change pretty decisively in direction at that point.

fviz_nbclust(mammals,
             FUNcluster = hcut,
             k.max = 10,
             method='wss',
             hc_method='ward',
             hc_func = 'agnes')

fviz_nbclust(mammals,
             FUNcluster = hcut,
             k.max = 10,
             method='silhouette',
             hc_method='ward',
             hc_func = 'agnes')

fviz_dend(mammals_ward,k=3)+ ggtitle('Ward linkage, 3 clusters')

E)

Use suitable visualizations, including dimension reduction techniques, to explore the different milk characteristics of the assigned clusters. Discuss.

it’s interesting to me, that the reason dolphins and seals are in a separate cluster because they have a higher concentration of fat than others.They also contain little lactose in their milk, and supposedly little water… which seems odd to me for aquatic animals.

I noticed that rabits milk contain the most ash. I think this is likely because rabbits are clearly ash-holes through and through ;) joking aside the green cluster contains mammal milk that has high levels of ash and protein.

then we have the others. These animals are charecterized by containing large quantities of lactose and water in their milk

mammal_pca <-prcomp(mammals, center=TRUE, scale. = TRUE)

mammals_clusters<-(cutree(mammals_ward, k = 3) 
    %>% data.frame()
    %>% setNames(c('k3'))
)

k3_biplot <- fviz_pca(mammal_pca, 
         habillage = factor(mammals_clusters$k3),
         repel = TRUE) + 
        ggtitle('3-cluster solution') + 
        guides(color='none',shape='none')