Activity 4.1 - Hierarchical clustering

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  3. Submit a link to your published solutions
#Loading packages needed for this assignment
library(tidyverse)
library(cluster)
library(factoextra)
library(patchwork)

Question 1

Consider three data sets below. Each data set contains three “clusters” in two dimensions. In the first, the clusters are three convex spheres. (A convex cluster is one where all points in the cluster can be connected with a straight line that does not leave the cluster.) In the second, one cluster is a sphere; one is a ring; and one is a half-moon. In the third, two clusters are spirals and one is a sphere. Our goal is to compare the performance of various hierarchical methods in clustering different cluster shapes.

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

Perform agglomerative clustering with single, complete, average, and Ward linkages. Cut each tree to produce three clusters. Produce 12 scatterplots, one per data set/linkage combination, showing the 3-cluster solution. Title each graph with the linkage used, as well as the average silhouette width (\(\bar s\)) for that clustering solution. Use patchwork to create a nice 4x3 grid of your plots.

# Helper: select first two numeric columns and standardize names
get_xy <- function(df) {
  num_cols <- df %>% select(where(is.numeric))
  if (ncol(num_cols) < 2) {
    stop("Data must contain at least two numeric columns.")
  }
  num_cols %>%
    select(1:2) %>%
    rename(x = 1, y = 2)
}

# Core function: hierarchical clustering + silhouette + plot
cluster_plot <- function(df, method = c("single","complete","average","ward.D2"), 
                         k = 3, dataset_label = "") {
  method <- match.arg(method)
  xy <- get_xy(df)

  # Distances and clustering
  d <- dist(xy, method = "euclidean")
  hc <- hclust(d, method = method)
  cl <- cutree(hc, k = k)

  # Silhouette (using precomputed distance)
  sil <- cluster::silhouette(cl, d)
  avg_s <- mean(sil[, "sil_width"])
  ttl <- paste0(dataset_label, " — ", method, 
                " | avg silhouette = ", sprintf("%.3f", avg_s))

  # Plot
  ggplot(xy, aes(x = x, y = y, color = factor(cl))) +
    geom_point(size = 1.6, alpha = 0.9) +
    scale_color_brewer(palette = "Dark2", name = "Cluster") +
    labs(title = ttl, x = "X", y = "Y") +
    theme_minimal(base_size = 12) +
    theme(
      plot.title = element_text(face = "bold"),
      legend.position = "none",
      panel.grid = element_line(size = 0.2, color = "grey90")
    )
}

# Read the data
three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

# Methods to evaluate
methods <- c("single", "complete", "average", "ward.D2")

# Generate plots for each dataset-method combination
plots_1 <- purrr::map(methods, ~ cluster_plot(three_spheres, method = .x, k = 3, dataset_label = "Three spheres"))
Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.
plots_2 <- purrr::map(methods, ~ cluster_plot(ring_moon_sphere, method = .x, k = 3, dataset_label = "Ring–moon–sphere"))
plots_3 <- purrr::map(methods, ~ cluster_plot(two_spirals_sphere, method = .x, k = 3, dataset_label = "Two spirals + sphere"))

# Arrange each dataset’s plots in a 2x2 grid
grid_1 <- patchwork::wrap_plots(plots_1, ncol = 1) +
  patchwork::plot_annotation(title = "Three spheres — 4 linkage methods")

grid_2 <- patchwork::wrap_plots(plots_2, ncol = 1) +
  patchwork::plot_annotation(title = "Ring–moon–sphere — 4 linkage methods")

grid_3 <- patchwork::wrap_plots(plots_3, ncol = 1) +
  patchwork::plot_annotation(title = "Two spirals + sphere — 4 linkage methods")

# Display them one after another
grid_1

grid_2

grid_3

# If you prefer them stacked in one figure:
# grid_1 / grid_2 / grid_3

Discuss the following:

  1. Which linkage works best for which scenario? Three Spheres - average and ward.D2 look the same with a silhouette of about 0.727 and a better point distribution in my opinion

Ring-moon-sphere - average has the highest silhouette score of 0.575

Two spirals + sphere - average has the highest silhouette score of 0.432

  1. Does the average silhouette width always do a good job of measuring the quality of the clustering solution?

Overall it seems to do the best but I also feel that its measuring skill also is dependent on what you are trying to look at.

(Hint: you have a lot of repetitive code to write. You may find it helpful to write a function that takes a data set and a linkage method as arguments, does the clustering and computes average silhouette width, and produces the desired plot.)

Question 2

Consider the data set below on milk content of 25 mammals. The variables have been pre-scaled to z-scores, hence no additional standardizing is necessary. (Data source: Everitt et al. Cluster analysis 4ed)

mammals <- read.csv('Data/mammal_milk.csv') %>% 
  column_to_rownames('Mammal')

A)

Perform agglomerative clustering with single, complete, average, and Ward linkages. Which has the best agglomerative coefficient?

# Compute agglomerative coefficients for each linkage method
methods <- c("single", "complete", "average", "ward")

# Use purrr to iterate and extract coefficients
agg_coeffs <- purrr::map_dbl(methods, function(m) {
  cluster::agnes(mammals, method = m)$ac
})

# Create a tibble for easy viewing
tibble(
  Method = methods,
  Agglomerative_Coefficient = round(agg_coeffs, 3)
) %>%
  arrange(desc(Agglomerative_Coefficient))
# A tibble: 4 × 2
  Method   Agglomerative_Coefficient
  <chr>                        <dbl>
1 ward                         0.941
2 complete                     0.899
3 average                      0.871
4 single                       0.788

Ward has the best Agglomerative_Coefficient

B)

Plot a dendrogram of the method with the highest AC. Which mammals cluster together first?

# Perform Ward clustering using agnes
ward_clust <- cluster::agnes(mammals, method = "ward")

# Plot dendrogram
plot(ward_clust, which.plot = 2, main = "Ward Linkage Dendrogram — Mammal Milk Data")

Looks like Monkey and Orangutan would cluster together first

C)

If the tree is cut at a height of 4, how many clusters will form? Which cluster will have the fewest mammals, and which mammals will they be?

There would be 4 clusters. the fewest would be cluster 4 containing Dolphin and Seal

D)

Use WSS and average silhouette method to suggest the optimal number of clusters. Re-create the dendrogram with the cluster memberships indicated.

# WSS method (Elbow)
fviz_nbclust(mammals, FUN = hcut, method = "wss", k.max = 10, hc_method = "ward") +
  labs(title = "Elbow Method — WSS vs. Number of Clusters")
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"

# Silhouette method
fviz_nbclust(mammals, FUN = hcut, method = "silhouette", k.max = 10, hc_method = "ward") +
  labs(title = "Average Silhouette Width — Optimal Number of Clusters")
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"

# Ward clustering
ward_clust <- cluster::agnes(mammals, method = "ward")

# Convert to hclust for cutting and plotting
ward_hc <- as.hclust(ward_clust)

# Cut tree into 4 clusters
clust4 <- cutree(ward_hc, k = 4)

# Dendrogram with colored branches
fviz_dend(ward_hc, k = 4, 
          cex = 0.8, 
          k_colors = "jco", 
          rect = TRUE, 
          rect_border = "jco", 
          rect_fill = TRUE,
          main = "Ward Dendrogram — 4 Cluster Solution")
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

E)

Use suitable visualizations, including dimension reduction techniques, to explore the different milk characteristics of the assigned clusters. Discuss.

# Perform PCA
milk_pca <- prcomp(mammals, scale. = FALSE)

# Cut Ward tree into 4 clusters
ward_hc <- as.hclust(cluster::agnes(mammals, method = "ward"))
milk_clusters <- cutree(ward_hc, k = 4)

# Create PCA plot with cluster labels
fviz_pca_ind(milk_pca,
             geom.ind = "point",
             col.ind = factor(milk_clusters),
             palette = "jco",
             addEllipses = TRUE,
             ellipse.level = 0.95,
             legend.title = "Cluster",
             repel = TRUE,
             title = "PCA of Mammal Milk Composition — Cluster Overlay")
Ignoring unknown labels:
• linetype : "Cluster"
Too few points to calculate an ellipse

# Visualize variable loadings
fviz_pca_var(milk_pca,
             col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             title = "Variable Contributions to Principal Components")

Looking at the PCA cluster overlay it seems that the clusters do good at covering the points needed, however it looks like cluster 3 might be a little to big and a 5th cluster for the top right point could be helpful depending on what we are looking for. Using the Principal Components visual It seems like Ash, Fat and Water are the most prominent variables with Lactose and Protein still significant.