STAT 223 - Homework 3

Question 1: K-means

Part a) Determine the number of clusters

Use 3 different methods to determine the number of clusters to use in k-means clustering. For each method, describe how many clusters it recommends.

Biplot: First 2 PCs

wine2 |> 
  prcomp() |> 
  fviz_pca_ind(geom = "point")

From the biplot, there appears to be 2 or three groups

Elbow Plot:

fviz_nbclust(x = wine2, 
             FUNcluster = kmeans,
             method = "wss",
             nstart = 10,
             nboot = 100)

The elbow plot indicates that there appears to be 3 clusters

Silhouette Score Plot:

fviz_nbclust(x = wine2, 
             FUNcluster = kmeans,
             method = "silhouette",
             nstart = 10,
             nboot = 100)

Silhouette plot shows 2 - 4 clusters

Gap Stat Plot:

fviz_nbclust(x = wine2, 
             FUNcluster = kmeans,
             method = "gap_stat",
             nstart = 10,
             nboot = 100)

The Gap statistic shows 3 clusters

Overall, we’ll use 3 clusters.

Part b) Run K-means

With the number of clusters you determined in part 1a), run k-means clustering. Display a plot of the resulting data set.

set.seed(223)
wine_km3 <- 
  kmeans(x = wine2,
         centers = 3,
         nstart = 10)

fviz_cluster(object = wine_km3,
             data = wine,
             geom = "point") + 
  
  theme_bw()

Part 1c) Silhouette score

Calculate the silhouette score for each wine. Which, if any, of them appear to be misclustered?

silhouette(wine_km3$cluster, 
           dist = dist(wine2)) |> 
  fviz_silhouette() + 
  theme_classic() +
  theme(legend.position = "none") + 
  coord_cartesian(expand = F) + 
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

##   cluster size ave.sil.width
## 1       1   59          0.44
## 2       2   64          0.24
## 3       3   55          0.36

The wines with a negative silhouette score are likely misclustered since they are closer to the other wines in a different cluster than they are to the wines in their own cluster.

Extra: Students don’t need to show this plot

silhouette(wine_km3$cluster, 
           dist = dist(wine2)) |> 
  data.frame() |> 
  mutate(misclustered = sil_width < 0,
         better_cluster = if_else(sil_width < 0,
                                  neighbor,
                                  cluster)) |> 
  
  bind_cols(prcomp(wine2) |> 
              pluck("x") |> 
              data.frame() |> 
              dplyr::select(PC1:PC2)) |> 
  
  ggplot(mapping = aes(x = PC1, 
                       y = PC2, 
                       color = factor(cluster),
                       shape = factor(better_cluster),
                       size = misclustered*2 + 1)) + 
  
  geom_point() + 
  
  scale_size_identity() + 
  
  theme_test() + 
  
  labs(color = "Actual \nCluster",
       shape = "Best \nCluster") + 
  
  theme(legend.position = "top")

It looks like 5 of the the wines in the middle cluster belong to the left cluster (3 total) or the right cluster (2 total). All of the left and right most wines are in the correct cluster.

Part d) Calculate the average value of each of the original seven variables for the different clusters

Briefly describe any differences between the clusters

wine |> 
  mutate(cluster = factor(wine_km3$cluster)) |> 
  group_by(cluster) |> 
  summarize(across(.cols = alcohol:malic_acid,
                   .fns = mean))

## # A tibble: 3 × 8
##   cluster alcohol magnesium phenols flavanoids proline   hue malic_acid
##   <fct>     <dbl>     <dbl>   <dbl>      <dbl>   <dbl> <dbl>      <dbl>
## 1 1          13.7     108.     2.85      2.98    1124. 1.07        1.93
## 2 2          12.3      92.6    2.28      2.10     511. 1.07        1.77
## 3 3          13.1      99.2    1.71      0.926    618. 0.703       3.42

Bonus plot displaying the differences

Not something the students are expected to do, just including it as a way of displaying the results in the posted solutions.

wine2 |> 
  mutate(cluster = factor(wine_km3$cluster)) |> 
  group_by(cluster) |> 
  summarize(across(.cols = alcohol:malic_acid,
                   .fns = mean)) |> 
  
  pivot_longer(cols = alcohol:malic_acid) |> 
  
  mutate(name = as_factor(name)) |> 
  
  ggplot(mapping = aes(x = name,
                       y = cluster,
                       fill = value)) + 
  
  geom_tile(color = "white",
            size = 0.5) + 
  
  theme_test() + 
  
  coord_cartesian(expand = F) + 
  
  labs(fill = "z-score",
       x = NULL) + 
  
  scale_fill_gradient2(low = "darkred",
                      mid = "white",
                      high = "darkblue",
                      midpoint = 0) + 
  
  geom_text(mapping = aes(label = round(value, digits = 2)),
            color = "white",
            size = 5,
            fontface = "bold")

Cluster 1 has higher values than the other 2 for all variables except malic acid.
Cluster 2 has low or mid values for 6 of the 7 variables, with hue being above average
Cluster 3 has the lowest values for several variables and is the highest for malic acid.

Question 2: DBSCAN

Part a) Determine \(\varepsilon\)

When the minimum number of points to form a cluster is 5, find the distance for the DBSCAN algorithm.

kNNdistplot(x = wine2,
            k = 5)

If the minimum number of points is 5 to form a cluster, \(\varepsilon \approx 2\).

Part b)

Why is the choice of \(\varepsilon\) so much higher for the wine data set than the multishape data set used in the class example?

The multishape data has 2 variables (x & y) and the wine data has 7 variables.

The more variables included, the higher the dimension of the data. The higher the dimension, the further apart two points will be.

This is occasionally referred as the “Curse of Dimensionality”

Part c) DBSCAN

Cluster the data using DBSCAN with the appropriate min points and \(\varepsilon\). How many clusters are created? How many outliers are there?

wine_db <- 
  dbscan(x = wine2,
         eps = 2,
         minPts = 5)

wine_db

## DBSCAN clustering for 178 objects.
## Parameters: eps = 2, minPts = 5
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 1 cluster(s) and 5 noise points.
## 
##   0   1 
##   5 173 
## 
## Available fields: cluster, eps, minPts, dist, borderPoints

Part c) Visualize the result

fviz_cluster(object = wine_db,
             data = wine,
             geom = "point",
             axes = c(1, 2)) + 
  
  theme_bw() + 
  
  labs(title = "Clustering result of DBSCAN with PC 1 & 2")

The outliers don’t appear to stand out when examining the first 2 dimensions

fviz_cluster(object = wine_db,
             data = wine,
             geom = "point",
             axes = c(1, 3)) +
  
  theme_bw() + 
  
  labs(title = "Clustering result of DBSCAN with PC 1 & 3")

Four of the five outliers are different for PC 3

Question 3: Agglomorative Hierarchical Clustering

Part a) Find the best link choice

For complete, single, average, and ward links, which choice is the most appropriate for the data?

wine_coph_cor <- 
  
  tibble(
    single   = cor_cophenetic(hcut(x = wine2, 
                                   hc_method = "single"), 
                              dist(wine2)),
    
    complete = cor_cophenetic(hcut(x = wine2, 
                                   hc_method = "complete"), 
                              dist(wine2)),
    
    average  = cor_cophenetic(hcut(x = wine2, 
                                   hc_method = "average"), 
                              dist(wine2)),
    
    ward     = cor_cophenetic(hcut(x = wine2, 
                                   hc_method = "ward.D"), 
                              dist(wine2))
  ) |> 
  
  pivot_longer(cols = everything(),
               names_to = "link",
               values_to = "coph_cor") |> 
  
  arrange(-coph_cor)

gt::gt(wine_coph_cor)

link	coph_cor
average	0.7411620
complete	0.6857130
ward	0.6499653
single	0.4651009

Average linkage has the highest cophenetic correlation, so it is the best choice to cluster the wine data. (complete and ward are pretty comparable, but single is much worse)

Part b) Dendrogram

Using the link method you picked in part a), create a dendrogram for the clustering method. How many clusters do there appear to be?

wine_avg <-
  hcut(wine, 
       hc_method = "average", 
       k = 6, 
       stand = T)

# Dendrogram for average linkage
gg_dend_avg <- 
  fviz_dend(x = wine_avg, 
            main = "Average Link") +
  
  geom_hline(yintercept = 3.4)

gg_dend_avg

From the dendrogram, there appears to be 6 clusters

Part c) Display the clustering results

Visually display the clustering results for your choice in part b). Briefly describe the graph(s) you create.

gg_3c_12 <- 
  fviz_cluster(wine_avg, 
               geom = "point", 
               ellipse = F, 
               show.clust.cent = F) +
  theme_bw() + 
  labs(title = "Wine AHC with Average Link") + 
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5))

gg_3c_34 <- 
  fviz_cluster(wine_avg, 
               geom = "point", 
               ellipse = F, 
               show.clust.cent = F,
               axes = c(3, 4)) +
  theme_bw() + 
  labs(title = NULL) + 
  theme(legend.position = "none")


gg_3c_12 / gg_3c_34

There appears to be 3 main clusters separated by PC 1 & 2. These 3 are all fairly intermixed for PC3 and PC4.

One cluster only has 3 cases, and all of them have very low PC3 values.

Two clusters contain just a single case. One case has a very low PC2 value, and another has a very low PC4 value.

Part d) Calculate the average value of each of the original seven variables for the different clusters

Briefly describe any differences between the clusters. Ignore any clusters that have fewer than 5 cases.

wine |> 
  mutate(cluster = factor(wine_avg$cluster)) |> 
  group_by(cluster) |> 
  summarize(wines = n(),
            across(.cols = alcohol:malic_acid,
                   .fns = mean)) |> 
  filter(wines >= 10) |> 
  gt::gt()

cluster	wines	alcohol	magnesium	phenols	flavanoids	proline	hue	malic_acid
1	57	13.76053	107.33333	2.866140	3.0022807	1128.5439	1.0738596	1.906667
2	62	12.34661	90.64516	2.304677	2.1119355	511.8065	1.0480645	1.927581
4	54	13.04944	99.31481	1.673704	0.8609259	622.4259	0.7054815	3.341111

Bonus plot displaying the differences

Not something the students are expected to do, just including it as a way of displaying the results in the posted solutions.

wine2 |> 
  mutate(cluster = factor(wine_avg$cluster)) |> 
  group_by(cluster) |> 
  summarize(wines = n(),
            across(.cols = alcohol:malic_acid,
                   .fns = mean)) |> 
  
  filter(wines >= 10) |> 
  
  pivot_longer(cols = alcohol:malic_acid) |> 
  
  mutate(name = as_factor(name)) |> 
  
  ggplot(mapping = aes(x = name,
                       y = cluster,
                       fill = value)) + 
  
  geom_tile(color = "white",
            size = 0.5) + 
  
  theme_test() + 
  
  coord_cartesian(expand = F) + 
  
  labs(fill = "z-score",
       x = NULL) + 
  
  scale_fill_gradient2(low = "darkred",
                      mid = "white",
                      high = "darkblue",
                      midpoint = 0) + 
  
  geom_text(mapping = aes(label = round(value, digits = 2)),
            color = "white",
            size = 5,
            fontface = "bold")

Cluster 1 has higher values than the other 2 for all variables except malic acid.
Cluster 2 has low or mid values for 6 of the 7 variables, with hue being above average
Cluster 3 has the lowest values for several variables and is the highest for malic acid.

STAT 223 - Homework 3 - Clustering

Solutions

2024-03-01

Question 1: K-means

Part a) Determine the number of clusters

Biplot: First 2 PCs

Elbow Plot:

Silhouette Score Plot:

Gap Stat Plot:

Part b) Run K-means

Part 1c) Silhouette score

Part d) Calculate the average value of each of the original seven variables for the different clusters

Bonus plot displaying the differences

Question 2: DBSCAN

Part a) Determine \(\varepsilon\)

Part b)

Part c) DBSCAN

Part c) Visualize the result

Question 3: Agglomorative Hierarchical Clustering

Part a) Find the best link choice

Part b) Dendrogram

Part c) Display the clustering results

Part d) Calculate the average value of each of the original seven variables for the different clusters

Bonus plot displaying the differences