Use 3 different methods to determine the number of clusters to use in k-means clustering. For each method, describe how many clusters it recommends.
wine2 |>
prcomp() |>
fviz_pca_ind(geom = "point")
From the biplot, there appears to be 2 or three groups
fviz_nbclust(x = wine2,
FUNcluster = kmeans,
method = "wss",
nstart = 10,
nboot = 100)
The elbow plot indicates that there appears to be 3 clusters
fviz_nbclust(x = wine2,
FUNcluster = kmeans,
method = "silhouette",
nstart = 10,
nboot = 100)
Silhouette plot shows 2 - 4 clusters
fviz_nbclust(x = wine2,
FUNcluster = kmeans,
method = "gap_stat",
nstart = 10,
nboot = 100)
The Gap statistic shows 3 clusters
Overall, we’ll use 3 clusters.
With the number of clusters you determined in part 1a), run k-means clustering. Display a plot of the resulting data set.
set.seed(223)
wine_km3 <-
kmeans(x = wine2,
centers = 3,
nstart = 10)
fviz_cluster(object = wine_km3,
data = wine,
geom = "point") +
theme_bw()
Calculate the silhouette score for each wine. Which, if any, of them appear to be misclustered?
silhouette(wine_km3$cluster,
dist = dist(wine2)) |>
fviz_silhouette() +
theme_classic() +
theme(legend.position = "none") +
coord_cartesian(expand = F) +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())
## cluster size ave.sil.width
## 1 1 59 0.44
## 2 2 64 0.24
## 3 3 55 0.36
The wines with a negative silhouette score are likely misclustered since they are closer to the other wines in a different cluster than they are to the wines in their own cluster.
Extra: Students don’t need to show this plot
silhouette(wine_km3$cluster,
dist = dist(wine2)) |>
data.frame() |>
mutate(misclustered = sil_width < 0,
better_cluster = if_else(sil_width < 0,
neighbor,
cluster)) |>
bind_cols(prcomp(wine2) |>
pluck("x") |>
data.frame() |>
dplyr::select(PC1:PC2)) |>
ggplot(mapping = aes(x = PC1,
y = PC2,
color = factor(cluster),
shape = factor(better_cluster),
size = misclustered*2 + 1)) +
geom_point() +
scale_size_identity() +
theme_test() +
labs(color = "Actual \nCluster",
shape = "Best \nCluster") +
theme(legend.position = "top")
It looks like 5 of the the wines in the middle cluster belong to the left cluster (3 total) or the right cluster (2 total). All of the left and right most wines are in the correct cluster.
Briefly describe any differences between the clusters
wine |>
mutate(cluster = factor(wine_km3$cluster)) |>
group_by(cluster) |>
summarize(across(.cols = alcohol:malic_acid,
.fns = mean))
## # A tibble: 3 × 8
## cluster alcohol magnesium phenols flavanoids proline hue malic_acid
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 13.7 108. 2.85 2.98 1124. 1.07 1.93
## 2 2 12.3 92.6 2.28 2.10 511. 1.07 1.77
## 3 3 13.1 99.2 1.71 0.926 618. 0.703 3.42
Not something the students are expected to do, just including it as a way of displaying the results in the posted solutions.
wine2 |>
mutate(cluster = factor(wine_km3$cluster)) |>
group_by(cluster) |>
summarize(across(.cols = alcohol:malic_acid,
.fns = mean)) |>
pivot_longer(cols = alcohol:malic_acid) |>
mutate(name = as_factor(name)) |>
ggplot(mapping = aes(x = name,
y = cluster,
fill = value)) +
geom_tile(color = "white",
size = 0.5) +
theme_test() +
coord_cartesian(expand = F) +
labs(fill = "z-score",
x = NULL) +
scale_fill_gradient2(low = "darkred",
mid = "white",
high = "darkblue",
midpoint = 0) +
geom_text(mapping = aes(label = round(value, digits = 2)),
color = "white",
size = 5,
fontface = "bold")
Cluster 1 has higher values than the other 2 for all variables except malic acid.
Cluster 2 has low or mid values for 6 of the 7 variables, with hue being above average
Cluster 3 has the lowest values for several variables and is the highest for malic acid.
When the minimum number of points to form a cluster is 5, find the distance for the DBSCAN algorithm.
kNNdistplot(x = wine2,
k = 5)
If the minimum number of points is 5 to form a cluster, \(\varepsilon \approx 2\).
Why is the choice of \(\varepsilon\) so much higher for the wine data set than the multishape data set used in the class example?
The multishape data has 2 variables (x & y) and the wine data has 7 variables.
The more variables included, the higher the dimension of the data. The higher the dimension, the further apart two points will be.
This is occasionally referred as the “Curse of Dimensionality”
Cluster the data using DBSCAN with the appropriate min points and \(\varepsilon\). How many clusters are created? How many outliers are there?
wine_db <-
dbscan(x = wine2,
eps = 2,
minPts = 5)
wine_db
## DBSCAN clustering for 178 objects.
## Parameters: eps = 2, minPts = 5
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 1 cluster(s) and 5 noise points.
##
## 0 1
## 5 173
##
## Available fields: cluster, eps, minPts, dist, borderPoints
fviz_cluster(object = wine_db,
data = wine,
geom = "point",
axes = c(1, 2)) +
theme_bw() +
labs(title = "Clustering result of DBSCAN with PC 1 & 2")
The outliers don’t appear to stand out when examining the first 2 dimensions
fviz_cluster(object = wine_db,
data = wine,
geom = "point",
axes = c(1, 3)) +
theme_bw() +
labs(title = "Clustering result of DBSCAN with PC 1 & 3")
Four of the five outliers are different for PC 3
For complete, single, average, and ward links, which choice is the most appropriate for the data?
wine_coph_cor <-
tibble(
single = cor_cophenetic(hcut(x = wine2,
hc_method = "single"),
dist(wine2)),
complete = cor_cophenetic(hcut(x = wine2,
hc_method = "complete"),
dist(wine2)),
average = cor_cophenetic(hcut(x = wine2,
hc_method = "average"),
dist(wine2)),
ward = cor_cophenetic(hcut(x = wine2,
hc_method = "ward.D"),
dist(wine2))
) |>
pivot_longer(cols = everything(),
names_to = "link",
values_to = "coph_cor") |>
arrange(-coph_cor)
gt::gt(wine_coph_cor)
link | coph_cor |
---|---|
average | 0.7411620 |
complete | 0.6857130 |
ward | 0.6499653 |
single | 0.4651009 |
Average linkage has the highest cophenetic correlation, so it is the best choice to cluster the wine data. (complete and ward are pretty comparable, but single is much worse)
Using the link method you picked in part a), create a dendrogram for the clustering method. How many clusters do there appear to be?
wine_avg <-
hcut(wine,
hc_method = "average",
k = 6,
stand = T)
# Dendrogram for average linkage
gg_dend_avg <-
fviz_dend(x = wine_avg,
main = "Average Link") +
geom_hline(yintercept = 3.4)
gg_dend_avg
From the dendrogram, there appears to be 6 clusters
Visually display the clustering results for your choice in part b). Briefly describe the graph(s) you create.
gg_3c_12 <-
fviz_cluster(wine_avg,
geom = "point",
ellipse = F,
show.clust.cent = F) +
theme_bw() +
labs(title = "Wine AHC with Average Link") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
gg_3c_34 <-
fviz_cluster(wine_avg,
geom = "point",
ellipse = F,
show.clust.cent = F,
axes = c(3, 4)) +
theme_bw() +
labs(title = NULL) +
theme(legend.position = "none")
gg_3c_12 / gg_3c_34
There appears to be 3 main clusters separated by PC 1 & 2. These 3 are all fairly intermixed for PC3 and PC4.
One cluster only has 3 cases, and all of them have very low PC3 values.
Two clusters contain just a single case. One case has a very low PC2 value, and another has a very low PC4 value.
Briefly describe any differences between the clusters. Ignore any clusters that have fewer than 5 cases.
wine |>
mutate(cluster = factor(wine_avg$cluster)) |>
group_by(cluster) |>
summarize(wines = n(),
across(.cols = alcohol:malic_acid,
.fns = mean)) |>
filter(wines >= 10) |>
gt::gt()
cluster | wines | alcohol | magnesium | phenols | flavanoids | proline | hue | malic_acid |
---|---|---|---|---|---|---|---|---|
1 | 57 | 13.76053 | 107.33333 | 2.866140 | 3.0022807 | 1128.5439 | 1.0738596 | 1.906667 |
2 | 62 | 12.34661 | 90.64516 | 2.304677 | 2.1119355 | 511.8065 | 1.0480645 | 1.927581 |
4 | 54 | 13.04944 | 99.31481 | 1.673704 | 0.8609259 | 622.4259 | 0.7054815 | 3.341111 |
Not something the students are expected to do, just including it as a way of displaying the results in the posted solutions.
wine2 |>
mutate(cluster = factor(wine_avg$cluster)) |>
group_by(cluster) |>
summarize(wines = n(),
across(.cols = alcohol:malic_acid,
.fns = mean)) |>
filter(wines >= 10) |>
pivot_longer(cols = alcohol:malic_acid) |>
mutate(name = as_factor(name)) |>
ggplot(mapping = aes(x = name,
y = cluster,
fill = value)) +
geom_tile(color = "white",
size = 0.5) +
theme_test() +
coord_cartesian(expand = F) +
labs(fill = "z-score",
x = NULL) +
scale_fill_gradient2(low = "darkred",
mid = "white",
high = "darkblue",
midpoint = 0) +
geom_text(mapping = aes(label = round(value, digits = 2)),
color = "white",
size = 5,
fontface = "bold")
Cluster 1 has higher values than the other 2 for all variables except malic acid.
Cluster 2 has low or mid values for 6 of the 7 variables, with hue being above average
Cluster 3 has the lowest values for several variables and is the highest for malic acid.