Question 1: K-means

Part a) Determine the number of clusters

Use 3 different methods to determine the number of clusters to use in k-means clustering. For each method, describe how many clusters it recommends.

Biplot: First 2 PCs

wine2 |> 
  prcomp() |> 
  fviz_pca_ind(geom = "point")

From the biplot, there appears to be 2 or three groups

Elbow Plot:

fviz_nbclust(x = wine2, 
             FUNcluster = kmeans,
             method = "wss",
             nstart = 10,
             nboot = 100)

The elbow plot indicates that there appears to be 3 clusters

Silhouette Score Plot:

fviz_nbclust(x = wine2, 
             FUNcluster = kmeans,
             method = "silhouette",
             nstart = 10,
             nboot = 100)

Silhouette plot shows 2 - 4 clusters

Gap Stat Plot:

fviz_nbclust(x = wine2, 
             FUNcluster = kmeans,
             method = "gap_stat",
             nstart = 10,
             nboot = 100)

The Gap statistic shows 3 clusters

Overall, we’ll use 3 clusters.

Part b) Run K-means

With the number of clusters you determined in part 1a), run k-means clustering. Display a plot of the resulting data set.

set.seed(223)
wine_km3 <- 
  kmeans(x = wine2,
         centers = 3,
         nstart = 10)

fviz_cluster(object = wine_km3,
             data = wine,
             geom = "point") + 
  
  theme_bw()

Part 1c) Silhouette score

Calculate the silhouette score for each wine. Which, if any, of them appear to be misclustered?

silhouette(wine_km3$cluster, 
           dist = dist(wine2)) |> 
  fviz_silhouette() + 
  theme_classic() +
  theme(legend.position = "none") + 
  coord_cartesian(expand = F) + 
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())
##   cluster size ave.sil.width
## 1       1   59          0.44
## 2       2   64          0.24
## 3       3   55          0.36

The wines with a negative silhouette score are likely misclustered since they are closer to the other wines in a different cluster than they are to the wines in their own cluster.

Extra: Students don’t need to show this plot

silhouette(wine_km3$cluster, 
           dist = dist(wine2)) |> 
  data.frame() |> 
  mutate(misclustered = sil_width < 0,
         better_cluster = if_else(sil_width < 0,
                                  neighbor,
                                  cluster)) |> 
  
  bind_cols(prcomp(wine2) |> 
              pluck("x") |> 
              data.frame() |> 
              dplyr::select(PC1:PC2)) |> 
  
  ggplot(mapping = aes(x = PC1, 
                       y = PC2, 
                       color = factor(cluster),
                       shape = factor(better_cluster),
                       size = misclustered*2 + 1)) + 
  
  geom_point() + 
  
  scale_size_identity() + 
  
  theme_test() + 
  
  labs(color = "Actual \nCluster",
       shape = "Best \nCluster") + 
  
  theme(legend.position = "top")

It looks like 5 of the the wines in the middle cluster belong to the left cluster (3 total) or the right cluster (2 total). All of the left and right most wines are in the correct cluster.

Part d) Calculate the average value of each of the original seven variables for the different clusters

Briefly describe any differences between the clusters

wine |> 
  mutate(cluster = factor(wine_km3$cluster)) |> 
  group_by(cluster) |> 
  summarize(across(.cols = alcohol:malic_acid,
                   .fns = mean))
## # A tibble: 3 × 8
##   cluster alcohol magnesium phenols flavanoids proline   hue malic_acid
##   <fct>     <dbl>     <dbl>   <dbl>      <dbl>   <dbl> <dbl>      <dbl>
## 1 1          13.7     108.     2.85      2.98    1124. 1.07        1.93
## 2 2          12.3      92.6    2.28      2.10     511. 1.07        1.77
## 3 3          13.1      99.2    1.71      0.926    618. 0.703       3.42

Bonus plot displaying the differences

Not something the students are expected to do, just including it as a way of displaying the results in the posted solutions.

wine2 |> 
  mutate(cluster = factor(wine_km3$cluster)) |> 
  group_by(cluster) |> 
  summarize(across(.cols = alcohol:malic_acid,
                   .fns = mean)) |> 
  
  pivot_longer(cols = alcohol:malic_acid) |> 
  
  mutate(name = as_factor(name)) |> 
  
  ggplot(mapping = aes(x = name,
                       y = cluster,
                       fill = value)) + 
  
  geom_tile(color = "white",
            size = 0.5) + 
  
  theme_test() + 
  
  coord_cartesian(expand = F) + 
  
  labs(fill = "z-score",
       x = NULL) + 
  
  scale_fill_gradient2(low = "darkred",
                      mid = "white",
                      high = "darkblue",
                      midpoint = 0) + 
  
  geom_text(mapping = aes(label = round(value, digits = 2)),
            color = "white",
            size = 5,
            fontface = "bold")

  • Cluster 1 has higher values than the other 2 for all variables except malic acid.

  • Cluster 2 has low or mid values for 6 of the 7 variables, with hue being above average

  • Cluster 3 has the lowest values for several variables and is the highest for malic acid.

Question 2: DBSCAN

Part a) Determine \(\varepsilon\)

When the minimum number of points to form a cluster is 5, find the distance for the DBSCAN algorithm.

kNNdistplot(x = wine2,
            k = 5)

If the minimum number of points is 5 to form a cluster, \(\varepsilon \approx 2\).

Part b)

Why is the choice of \(\varepsilon\) so much higher for the wine data set than the multishape data set used in the class example?

The multishape data has 2 variables (x & y) and the wine data has 7 variables.

The more variables included, the higher the dimension of the data. The higher the dimension, the further apart two points will be.

This is occasionally referred as the “Curse of Dimensionality”

Part c) DBSCAN

Cluster the data using DBSCAN with the appropriate min points and \(\varepsilon\). How many clusters are created? How many outliers are there?

wine_db <- 
  dbscan(x = wine2,
         eps = 2,
         minPts = 5)

wine_db
## DBSCAN clustering for 178 objects.
## Parameters: eps = 2, minPts = 5
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 1 cluster(s) and 5 noise points.
## 
##   0   1 
##   5 173 
## 
## Available fields: cluster, eps, minPts, dist, borderPoints

Part c) Visualize the result

fviz_cluster(object = wine_db,
             data = wine,
             geom = "point",
             axes = c(1, 2)) + 
  
  theme_bw() + 
  
  labs(title = "Clustering result of DBSCAN with PC 1 & 2")

The outliers don’t appear to stand out when examining the first 2 dimensions

fviz_cluster(object = wine_db,
             data = wine,
             geom = "point",
             axes = c(1, 3)) +
  
  theme_bw() + 
  
  labs(title = "Clustering result of DBSCAN with PC 1 & 3")

Four of the five outliers are different for PC 3

Question 3: Agglomorative Hierarchical Clustering

Part b) Dendrogram

Using the link method you picked in part a), create a dendrogram for the clustering method. How many clusters do there appear to be?

wine_avg <-
  hcut(wine, 
       hc_method = "average", 
       k = 6, 
       stand = T)

# Dendrogram for average linkage
gg_dend_avg <- 
  fviz_dend(x = wine_avg, 
            main = "Average Link") +
  
  geom_hline(yintercept = 3.4)

gg_dend_avg

From the dendrogram, there appears to be 6 clusters

Part c) Display the clustering results

Visually display the clustering results for your choice in part b). Briefly describe the graph(s) you create.

gg_3c_12 <- 
  fviz_cluster(wine_avg, 
               geom = "point", 
               ellipse = F, 
               show.clust.cent = F) +
  theme_bw() + 
  labs(title = "Wine AHC with Average Link") + 
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5))

gg_3c_34 <- 
  fviz_cluster(wine_avg, 
               geom = "point", 
               ellipse = F, 
               show.clust.cent = F,
               axes = c(3, 4)) +
  theme_bw() + 
  labs(title = NULL) + 
  theme(legend.position = "none")


gg_3c_12 / gg_3c_34

There appears to be 3 main clusters separated by PC 1 & 2. These 3 are all fairly intermixed for PC3 and PC4.

One cluster only has 3 cases, and all of them have very low PC3 values.

Two clusters contain just a single case. One case has a very low PC2 value, and another has a very low PC4 value.

Part d) Calculate the average value of each of the original seven variables for the different clusters

Briefly describe any differences between the clusters. Ignore any clusters that have fewer than 5 cases.

wine |> 
  mutate(cluster = factor(wine_avg$cluster)) |> 
  group_by(cluster) |> 
  summarize(wines = n(),
            across(.cols = alcohol:malic_acid,
                   .fns = mean)) |> 
  filter(wines >= 10) |> 
  gt::gt()
cluster wines alcohol magnesium phenols flavanoids proline hue malic_acid
1 57 13.76053 107.33333 2.866140 3.0022807 1128.5439 1.0738596 1.906667
2 62 12.34661 90.64516 2.304677 2.1119355 511.8065 1.0480645 1.927581
4 54 13.04944 99.31481 1.673704 0.8609259 622.4259 0.7054815 3.341111

Bonus plot displaying the differences

Not something the students are expected to do, just including it as a way of displaying the results in the posted solutions.

wine2 |> 
  mutate(cluster = factor(wine_avg$cluster)) |> 
  group_by(cluster) |> 
  summarize(wines = n(),
            across(.cols = alcohol:malic_acid,
                   .fns = mean)) |> 
  
  filter(wines >= 10) |> 
  
  pivot_longer(cols = alcohol:malic_acid) |> 
  
  mutate(name = as_factor(name)) |> 
  
  ggplot(mapping = aes(x = name,
                       y = cluster,
                       fill = value)) + 
  
  geom_tile(color = "white",
            size = 0.5) + 
  
  theme_test() + 
  
  coord_cartesian(expand = F) + 
  
  labs(fill = "z-score",
       x = NULL) + 
  
  scale_fill_gradient2(low = "darkred",
                      mid = "white",
                      high = "darkblue",
                      midpoint = 0) + 
  
  geom_text(mapping = aes(label = round(value, digits = 2)),
            color = "white",
            size = 5,
            fontface = "bold")

  • Cluster 1 has higher values than the other 2 for all variables except malic acid.

  • Cluster 2 has low or mid values for 6 of the 7 variables, with hue being above average

  • Cluster 3 has the lowest values for several variables and is the highest for malic acid.