library(cluster)
library(dbscan)
library(factoextra)
library(tidyverse)
library(patchwork)
library(ggrepel)Activity 4.2 - Kmeans, PAM, and DBSCAN clustering
SUBMISSION INSTRUCTIONS
- Render to html
- Publish your html to RPubs
- Submit a link to your published solutions
Loading required packages:
Question 1
Reconsider the three data sets below. We will now compare kmeans, PAM, and DBSCAN to cluster these data sets.
three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')A)
With kmeans and PAM, we can specify that we want 3 clusters. But recall with DBSCAN we select minPts and eps, and the number of clusters is determined accordingly. Use k-nearest-neighbor distance plots to determine candidate epsilon values for each data set if minPts = 4. Add horizontal line(s) to each plot indicating your selected value(s) of \(\epsilon.\)
kNNdistplot(three_spheres, minPts = 4)
abline(h = .21)kNNdistplot(ring_moon_sphere, minPts = 4)
abline(h = .22)kNNdistplot(two_spirals_sphere, minPts = 4)
abline(h = 1.05)B)
Write a function called plot_dbscan_results(df, eps, minPts). This function takes a data frame, epsilon value, and minPts as arguments and does the following:
- Runs DBSCAN on the inputted data frame
df, given theepsandminPtsvalues; - Creates a scatterplot of the data frame with points color-coded by assigned cluster membership. Make sure the title of the plot includes the value of
epsandminPtsused to create the clusters!!
Using this function, and your candidate eps values from A) as a starting point, implement DBSCAN to correctly identify the 3 cluster shapes in each of the three data sets. You will likely need to revise the eps values until you settle on a “correct” solution.
plot_dbscan_results <- function(df, eps, minPts) {
db <- dbscan(df, eps = eps, minPts = minPts)
df_plot <- df %>%
mutate(cluster = factor(db$cluster))
ggplot(df_plot, aes(x = x, y = y, color = cluster)) +
geom_point(size = 2, alpha = 0.9) +
labs(
title = paste0("DBSCAN Results (eps = ", eps, ", minPts = ", minPts, ")"),
color = "Cluster"
) +
theme_minimal()
}plot_dbscan_results(three_spheres, eps = 0.21, minPts = 4)plot_dbscan_results(ring_moon_sphere, eps = 0.3, minPts = 4)plot_dbscan_results(two_spirals_sphere, eps = 1.2, minPts = 4)C)
Compare your DBSCAN solutions to the 3-cluster solutions from k-means and PAM. Use the patchwork package and your function from B) to produce a 3x3 grid of plots: one plot per method/data set combo. Comment on your findings.
plot_dbscan_results <- function(df, eps, minPts, name) {
db <- dbscan(df, eps = eps, minPts = minPts)
df$cluster <- as.factor(db$cluster)
ggplot(df, aes(x, y, color = cluster)) +
geom_point(size = 2) +
ggtitle(paste0(name, " – DBSCAN \n(eps = ", eps, ", minPts = ", minPts, ")")) +
theme_minimal() +
theme(
plot.title = element_text(size = 6)
)
}
plot_kmeans <- function(df, k, name) {
km <- kmeans(df, centers = k)
df$cluster <- as.factor(km$cluster)
ggplot(df, aes(x, y, color = cluster)) +
geom_point(size = 2) +
ggtitle(paste0(name, " – k-means (k = ", k, ")")) +
theme_minimal() +
theme(
plot.title = element_text(size = 6)
)
}
plot_pam <- function(df, k, name) {
pam_res <- pam(df, k)
df$cluster <- as.factor(pam_res$clustering)
ggplot(df, aes(x, y, color = cluster)) +
geom_point(size = 2) +
ggtitle(paste0(name, " – PAM (k = ", k, ")")) +
theme_minimal() +
theme(
plot.title = element_text(size = 6)
)
}
p1 <- plot_dbscan_results(df = three_spheres, eps = .21, minPts = 4, 'Three Spheres')
p2 <- plot_kmeans(three_spheres, 3, 'Three Spheres')
p3 <- plot_pam(three_spheres, 3, 'Three Spheres')
p4 <- plot_dbscan_results(ring_moon_sphere, eps = .3, minPts = 4, 'Ring Moon Sphere')
p5 <- plot_kmeans(ring_moon_sphere, 3, 'Ring Moon Sphere')
p6 <- plot_pam(ring_moon_sphere, 3, 'Ring Moon Sphere')
p7 <- plot_dbscan_results(two_spirals_sphere, eps = 1.2, minPts = 4, 'Two Sprial Spheres')
p8 <- plot_kmeans(two_spirals_sphere, 3, 'Two Sprial Spheres')
p9 <- plot_pam(two_spirals_sphere, 3, 'Two Sprial Spheres')
(p1 | p2 | p3) /
(p4 | p5 | p6) /
(p7 | p8 | p9)DBSCAN is the best algorithm for finding the outliers and identifying the shapes in the data set. It correctly identified the clusters in the three spheres data set and identified the clusters. In the ring moon sphere data set it correctly identified the ring, moon, and sphere as separated clusters. Then in the spiral data set is correctly identified the spirals and found outliers.
Kmeans and pam work well when the clusters are well separated and convex. When it comes to other chapes it doesn’t do very well. For example neither algorithms did a good job at identifying the moon shape in the ring moon sphere data set.
Question 2
In this question we will apply cluster analysis to analyze economic development indicators (WDIs) from the World Bank. The data are all 2020 indicators and include:
life_expectancy: average life expectancy at birthgdp: GDP per capita, in 2015 USDco2: CO2 emissions, in metric tons per capitafert_rate: annual births per 1000 womenhealth: percentage of GDP spent on health careimportsandexports: imports and exports as a percentage of GDPinternetandelectricity: percentage of population with access to internet and electricity, respectivelyinfant_mort: infant mortality rate, infant deaths per 1000 live birthsinflation: consumer price inflation, as annual percentageincome: annual per-capita income, in 2020 USD
wdi <- read.csv('Data/wdi_extract_clean.csv')
head(wdi) country life_expectancy gdp co2 fert_rate health internet
1 Afghanistan 61.45400 527.8346 0.180555 5.145 15.533614 17.0485
2 Albania 77.82400 4437.6535 1.607133 1.371 7.503894 72.2377
3 Algeria 73.25700 4363.6853 3.902928 2.940 5.638317 63.4727
4 Angola 63.11600 2433.3764 0.619139 5.371 3.274885 36.6347
5 Argentina 75.87800 11393.0506 3.764393 1.601 10.450306 85.5144
6 Armenia 73.37561 4032.0904 2.334560 1.700 12.240562 76.5077
infant_mort electricity imports inflation exports income
1 55.3 97.7 36.28908 5.601888 10.42082 475.7181
2 8.1 100.0 36.97995 1.620887 22.54076 4322.5497
3 20.4 99.7 24.85456 2.415131 15.53520 2689.8725
4 42.3 47.0 27.62749 22.271539 38.31454 1100.2175
5 8.7 100.0 13.59828 42.015095 16.60541 7241.0303
6 10.2 100.0 39.72382 1.211436 29.76499 3617.0320
Focus on using kmeans for this problem.
A)
My claim: 3-5 clusters appear optimal for this data set. Support or refute my claim using appropriate visualizations.
wdi_numeric <- wdi %>% select_if(is.numeric)
wdi_scaled <- scale(wdi_numeric)fviz_nbclust(
wdi_scaled,
kmeans,
method = "wss"
) +
ggtitle("Elbow Method for Optimal k")fviz_nbclust(
wdi_scaled,
kmeans,
method = "silhouette"
) +
ggtitle("Silhouette Scores for k")Looking at the elbow in the wss plot it can be seen that 3, 4, or 5 clusters would be optimal. Next looking at the silhouette plot it appears that 4 clusters is the most optimal. It also appears that 3 clusters is almost the same average silhouette width as 4 clusters.
B)
Use k-means to identify 4 clusters. Characterize the 4 clusters using a dimension reduction technique. Provide examples of countries that are representative of each cluster. Be thorough.
countries <- wdi %>% select(country)
kmeans4 <- kmeans(wdi_scaled, centers = 4, nstart = 10)
cluster <- factor(kmeans4$cluster)
wdi$cluster <- factor(kmeans4$cluster)pca_res <- prcomp(wdi_scaled, center = TRUE, scale. = FALSE)
pca_scores <- as.data.frame(pca_res$x)
pca_scores$cluster <- wdi$cluster
pca_scores$country <- wdi$countrycentroids <- pca_scores %>%
group_by(cluster) %>%
summarise(
c1 = mean(PC1),
c2 = mean(PC2)
)
distances <- pca_scores %>%
left_join(centroids, by = "cluster") %>%
mutate(distance = sqrt((PC1 - c1)^2 + (PC2 - c2)^2))
representatives <- distances %>%
group_by(cluster) %>%
slice_min(order_by = distance, n = 3) %>%
select(cluster, country, PC1, PC2, distance)
representatives# A tibble: 12 × 5
# Groups: cluster [4]
cluster country PC1 PC2 distance
<fct> <chr> <dbl> <dbl> <dbl>
1 1 Angola -3.61 -1.16 0.350
2 1 Togo -3.31 -0.515 0.366
3 1 Cote d'Ivoire -3.51 -0.385 0.439
4 2 Cabo Verde 0.0747 0.493 0.0279
5 2 Azerbaijan 0.127 0.596 0.102
6 2 Mexico 0.115 0.389 0.139
7 3 Austria 3.17 0.101 0.229
8 3 Sweden 3.27 0.392 0.407
9 3 Finland 3.05 0.626 0.489
10 4 Singapore 5.22 -5.59 0.393
11 4 Ireland 4.68 -3.94 1.92
12 4 Luxembourg 6.94 -7.31 2.15
ggplot(pca_scores, aes(PC1, PC2, color = cluster)) +
geom_point(alpha = 0.8) +
geom_text_repel(
data = representatives,
aes(label = country),
color = "black",
size = 3,
fontface = "bold",
nudge_y = 0.2
) +
theme_minimal() +
ggtitle("PCA Biplot of WDI Data with Cluster Representatives (k = 4)")The representatives in the final plot are the countries that best represent the cluster. They represent the most typical country in that cluster. This is done by getting the countries that are closest to the centroids of the cluster. The 3 closest countries to the centroid are selected for the final plot.
In cluster 1 the countries that best represent that cluster are Ireland, Singapore, and Luxembourg.
In cluster 2 the countries that best represent that cluster are Finland, Sweden, and Austria.
In cluster 3 the countries that best represent that cluster are Mexico, Azerbaijan, and Cabo Verde.
In cluster 4 the countries that best represent that cluster are Cot d’Ivoire, Togo, Angola.
C)
Remove Ireland, Singapore, and Luxembourg from the data set. Use k-means to find 4 clusters again, with these three countries removed. How do the cluster definitions change?
wdi_removed <- wdi %>%
filter(!country %in% c("Ireland", "Singapore", "Luxembourg"))
wdi_removed_numeric <- wdi_removed %>% select_if(is.numeric)
wdi_removed_scaled <- scale(wdi_removed_numeric)
km_removed <- kmeans(wdi_removed_scaled, centers = 4)
wdi_removed$cluster <- factor(km_removed$cluster)pca_removed <- prcomp(wdi_removed_scaled, center = TRUE, scale. = FALSE)
scores_removed <- as.data.frame(pca_removed$x[, 1:2])
colnames(scores_removed) <- c("PC1", "PC2")
scores_removed$country <- wdi_removed$country
scores_removed$cluster <- wdi_removed$clusterggplot(scores_removed, aes(PC1, PC2, color = cluster, label = country)) +
geom_point(size = 3) +
theme_minimal() +
labs(
title = "K-Means Clusters (4 clusters) after removing Ireland, Singapore, Luxembourg",
x = "PC1",
y = "PC2"
)The clusters are now different. Instead of cluster 1 from the previous problem being three countries that were outliers, it is now a cluster with more countries that aren’t all outliers of the entire data set. Other than cluster 1 being changed heavily from the previous problem the clusters are relatively in the same location.