j— title: “Activity 4.1 - Hierarchical clustering” format: html editor_options: chunk_output_type: console


SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  3. Submit a link to your published solutions
#Loading packages needed for this assignment
library(tidyverse)
library(cluster)
library(factoextra)
library(patchwork)

Question 1

Consider three data sets below. Each data set contains three “clusters” in two dimensions. In the first, the clusters are three convex spheres. (A convex cluster is one where all points in the cluster can be connected with a straight line that does not leave the cluster.) In the second, one cluster is a sphere; one is a ring; and one is a half-moon. In the third, two clusters are spirals and one is a sphere. Our goal is to compare the performance of various hierarchical methods in clustering different cluster shapes.

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

head(three_spheres)
          x        y
1 0.8131339 4.084861
2 1.0518924 4.410097
3 0.8745572 3.866620
4 1.1435424 3.742791
5 1.2999460 4.345134
6 1.1079499 3.820448
head(ring_moon_sphere)
           x         y
1  0.8257353 0.5452301
2 -0.8033661 0.5292291
3  0.3204441 0.8884549
4  0.4289379 0.8779873
5 -0.2574707 1.0539014
6 -0.2874405 0.9242571
head(two_spirals_sphere)
           x        y
1 -0.1537584 1.895018
2 -0.6533304 1.780637
3 -0.3837657 1.922041
4 -0.7466500 1.743856
5 -0.7956078 2.012580
6 -0.9080305 1.985178
datasets <- list(
  spheres = three_spheres,
  ringmoon = ring_moon_sphere,
  spirals = two_spirals_sphere
)

linkages <- c("single", "complete", "average", "ward")
plot_raw <- function(data, title) {
  ggplot(data, aes(x, y)) +
    geom_point(size = 1.5, alpha = 0.7) +
    labs(title = title) +
    theme_minimal()
}
p1 <- plot_raw(three_spheres, "Three Spheres")
p2 <- plot_raw(ring_moon_sphere, "Ring / Moon / Sphere")
p3 <- plot_raw(two_spirals_sphere, "Two Spirals + Sphere")

p1

p2

p3

Perform agglomerative clustering with single, complete, average, and Ward linkages. Cut each tree to produce three clusters. Produce 12 scatterplots, one per data set/linkage combination, showing the 3-cluster solution. Title each graph with the linkage used, as well as the average silhouette width (\(\bar s\)) for that clustering solution. Use patchwork to create a nice 4x3 grid of your plots.

Discuss the following:

  1. Which linkage works best for which scenario?
  2. Does the average silhouette width always do a good job of measuring the quality of the clustering solution?

(Hint: you have a lot of repetitive code to write. You may find it helpful to write a function that takes a data set and a linkage method as arguments, does the clustering and computes average silhouette width, and produces the desired plot.)

make_cluster_plot <- function(data, linkage) {
  
  
  agn <- agnes(data, method = linkage, metric = "euclidean")
  
  
  cl <- cutree(agn, k = 3)
  
 
  d <- dist(data)
  sil <- silhouette(cl, d)
  avg_s <- mean(sil[, 3])
  
 ggplot(data, aes(x, y, color = factor(cl))) +
  geom_point(size = 1.6) +
  labs(
    title = paste(linkage, "linkage | avg s =", round(avg_s, 3)),
    color = "Cluster"
  ) +
  theme_minimal() +
theme(
  plot.title = element_text(size = 7),
  plot.margin = margin(t = 5, r = 6, b = 8, l = 8),
  legend.position = "none"       
)

}
plot_list <- list()

for (dname in names(datasets)) {
  for (link in linkages) {
    dat <- datasets[[dname]]
    plot_list[[paste(dname, link, sep = "_")]] <-
      make_cluster_plot(dat, link)
  }
}

# Display 4x3 grid
final_grid <- wrap_plots(plot_list, ncol = 4, nrow = 3)
final_grid

Part 1: Spheres: Average linkage, ward linkage, and complete linkage all have the same score and seem to do a good job with the clustering, the only difference is that complete linkage has a one point difference than the other two clustering methods. Ring: Single linkage looks the most accurate but score wise is not the highest. Spiral: Single linkage looks the best but again score wise is not the most accurate.

Part 2: The silhoutte score does not seem to be a very good indicator of clustering accuracy as for the ring and spiral the one with the lowest score was the most accurate. It finds the average length of the center of the cluster which is not important when you have very oddly shaped clusters.

Question 2

Consider the data set below on milk content of 25 mammals. The variables have been pre-scaled to z-scores, hence no additional standardizing is necessary. (Data source: Everitt et al. Cluster analysis 4ed)

mammals <- read.csv('Data/mammal_milk.csv') %>% 
  column_to_rownames('Mammal')

head(mammals)
         Water Protein    Fat Lactose    Ash
Bison    0.681  -0.387 -0.818   0.856  0.073
Buffalo  0.307  -0.085 -0.229   0.310 -0.165
Camel    0.743  -0.742 -0.657   0.365 -0.303
Cat      0.268   1.064 -0.381   0.146 -0.224
Deer    -0.955   1.147  0.893  -0.836  1.063
Dog     -0.145   0.845 -0.077  -0.618  0.667

A)

Perform agglomerative clustering with single, complete, average, and Ward linkages. Which has the best agglomerative coefficient?

# 2A) agnes with 4 linkages
agn_single   <- agnes(mammals, method = "single")
agn_complete <- agnes(mammals, method = "complete")
agn_average  <- agnes(mammals, method = "average")
agn_ward     <- agnes(mammals, method = "ward")

# agglomerative coefficients
agn_list <- list(
  single   = agn_single,
  complete = agn_complete,
  average  = agn_average,
  ward     = agn_ward
)

ac_values <- sapply(agn_list, function(x) x$ac)
ac_values
   single  complete   average      ward 
0.7875718 0.8985539 0.8706571 0.9413994 

Ward has the best agglomerative coefficient with a value of 0.941

B)

Plot a dendrogram of the method with the highest AC. Which mammals cluster together first?

# find best method automatically
best_name <- names(ac_values)[which.max(ac_values)]
best_agn  <- agn_list[[best_name]]

# convert to hclust for plotting
best_hclust <- as.hclust(best_agn)

# 2B) dendrogram of best method
fviz_dend(best_hclust, cex = 0.7, main = paste("Dendrogram -", best_name, "linkage"))
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

# mammals that cluster together first (first merge in the tree)
first_merge <- best_hclust$merge[1, ]
first_pair  <- best_hclust$labels[abs(first_merge)]
first_pair
[1] "Deer"     "Reindeer"

Deer and reindeer are the lowest on the dendrogram meaning that they merge first.

C)

If the tree is cut at a height of 4, how many clusters will form? Which cluster will have the fewest mammals, and which mammals will they be?

If it was cut at a height of 4 there would be 4 clusters, dolphin and seal will have the fewest.

D)

Use WSS and average silhouette method to suggest the optimal number of clusters. Re-create the dendrogram with the cluster memberships indicated.

# 2D) WSS and silhouette to choose k
fviz_nbclust(mammals, FUN = hcut, method = "wss",
             k.max = 10, hc_method = best_name)
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"

fviz_nbclust(mammals, FUN = hcut, method = "silhouette",
             k.max = 10, hc_method = best_name)
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"

k_opt <- 3   # <--- set this to your chosen k

cl_k <- cutree(best_hclust, k = k_opt)

fviz_dend(
  best_hclust,
  k = k_opt,
  rect = TRUE,
  k_colors = "jco",
  cex = 0.7,
  main = paste("Dendrogram with", k_opt, "clusters (", best_name, "linkage)")
)

WSS and Silhouette both appear to have an agreement on 3 being a good number of clusters so that is the number that I will go with.

E)

Use suitable visualizations, including dimension reduction techniques, to explore the different milk characteristics of the assigned clusters. Discuss.

# Perform PCA on the mammals dataset (already standardized)
pca_mammals <- prcomp(mammals, scale. = FALSE)

# Convert PCA scores to a data frame and add cluster labels
pca_df <- as.data.frame(pca_mammals$x) %>%
  mutate(
    Cluster = factor(cl_k),          
    Mammal  = rownames(mammals)
  )
# PCA loadings for PC1 + PC2
loadings <- as.data.frame(pca_mammals$rotation[, 1:2])
loadings$Var <- rownames(loadings)

# Scale arrows so they are visible
arrow_scale <- 3

# PCA biplot – SINGLE ggplot() call
ggplot() +
  # Points = mammals colored by cluster (NO labels)
  geom_point(data = pca_df, aes(PC1, PC2, color = Cluster), size = 2) +
  
  # Arrows = variable directions
  geom_segment(
    data = loadings,
    aes(
      x = 0, y = 0,
      xend = PC1 * arrow_scale,
      yend = PC2 * arrow_scale
    ),
    arrow = arrow(length = unit(0.25, "cm")),
    color = "black"
  ) +
  
  # Variable names at arrow tips
  geom_text(
    data = loadings,
    aes(
      x = PC1 * arrow_scale * 1.15,
      y = PC2 * arrow_scale * 1.15,
      label = Var
    ),
    size = 4,
    fontface = "bold"
  ) +
  
  # High-contrast cluster colors
  scale_color_manual(values = c(
    "1" = "#E41A1C",   # bright red
    "2" = "#377EB8",   # strong blue
    "3" = "#4DAF4A"    # vivid green
  )) +
  
  theme_minimal() +
  labs(
    title = "PCA Biplot of Mammal Milk Characteristics",
    x = "PC1",
    y = "PC2",
    color = "Cluster"
  )