Activity 4.2 - Kmeans, PAM, and DBSCAN clustering

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  3. Submit a link to your published solutions

Loading required packages:

library(cluster)
library(dbscan)
library(factoextra)
library(tidyverse)
library(patchwork)
library(ggrepel)

Question 1

Reconsider the three data sets below. We will now compare kmeans, PAM, and DBSCAN to cluster these data sets.

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

A)

With kmeans and PAM, we can specify that we want 3 clusters. But recall with DBSCAN we select minPts and eps, and the number of clusters is determined accordingly. Use k-nearest-neighbor distance plots to determine candidate epsilon values for each data set if minPts = 4. Add horizontal line(s) to each plot indicating your selected value(s) of \(\epsilon.\)

library(dbscan)
kNNdistplot(three_spheres[,1:2], minPts = 4)
abline(h = 0.2)

library(dbscan)
kNNdistplot(ring_moon_sphere[,1:2], minPts = 4)
abline(h = 0.25)

library(dbscan)
kNNdistplot(two_spirals_sphere[,1:2], minPts = 4)
abline(h = 1)

B)

Write a function called plot_dbscan_results(df, eps, minPts). This function takes a data frame, epsilon value, and minPts as arguments and does the following:

  • Runs DBSCAN on the inputted data frame df, given the eps and minPts values;
  • Creates a scatterplot of the data frame with points color-coded by assigned cluster membership. Make sure the title of the plot includes the value of eps and minPts used to create the clusters!!

Using this function, and your candidate eps values from A) as a starting point, implement DBSCAN to correctly identify the 3 cluster shapes in each of the three data sets. You will likely need to revise the eps values until you settle on a “correct” solution.

plot_dbscan_results <- function(df, eps, minPts) {
  db <- dbscan(df, eps = eps, minPts = minPts)
  df$dbcluster <- factor(db$cluster)

  p <- ggplot(df, aes(x = df[,1], y = df[,2], color = dbcluster)) +
    geom_point(size = 2) +
    labs(
      color = "Cluster",
      title = paste("DBSCAN (eps =", eps, ", minPts =", minPts, ")")
    ) +
    theme_classic(base_size = 12)
  
  return(p)
}
plot_dbscan_results(three_spheres[,1:2], eps = 0.2, minPts = 4)

plot_dbscan_results(ring_moon_sphere[,1:2], eps = 0.35, minPts = 4)

plot_dbscan_results(two_spirals_sphere[,1:2], eps = 1.2, minPts = 4)

C)

Compare your DBSCAN solutions to the 3-cluster solutions from k-means and PAM. Use the patchwork package and your function from B) to produce a 3x3 grid of plots: one plot per method/data set combo. Comment on your findings.

p_db_three <- plot_dbscan_results(three_spheres[,1:2], eps = 0.2, minPts = 4)
p_db_ring <- plot_dbscan_results(ring_moon_sphere[,1:2], eps = 0.35, minPts = 4)
p_db_spiral  <- plot_dbscan_results(two_spirals_sphere[,1:2], eps = 1.2, minPts = 4)

K_Means

run_kmeans <- function(df, centers = 3, iter.max = 20, nstart = 10) {
  numeric_only <- df %>% select(x, y)
  km <- kmeans(numeric_only, centers = centers, iter.max = iter.max, nstart = nstart)
  df$kmeans_clusters <- factor(km$cluster)
  return(df)
}
two_spirals_sphere <- run_kmeans(two_spirals_sphere)
three_spheres      <- run_kmeans(three_spheres)
ring_moon_sphere       <- run_kmeans(two_spirals_sphere)
p_km_three <- ggplot(three_spheres, aes(x = x, y = y, color = kmeans_clusters)) +
  geom_point() + guides(color = "none") + theme_classic(base_size = 12) +
  ggtitle("K-means: three_spheres")

p_km_ring <- ggplot(ring_moon_sphere, aes(x = x, y = y, color = kmeans_clusters)) +
  geom_point() + guides(color = "none") + theme_classic(base_size = 12) +
  ggtitle("K-means: ring moon")

p_km_sprial <- ggplot(two_spirals_sphere, aes(x = x, y = y, color = kmeans_clusters)) +
  geom_point() + guides(color = "none") + theme_classic(base_size = 12) +
  ggtitle("K-means: two spirals")

PAM

library(cluster)
run_pam <- function(df, k = 3, nstart = 10) {
  numeric_only <- df[, c("x", "y")]
  pam_result <- pam(numeric_only, k = k, nstart = nstart)
  df$pam_clusters <- factor(pam_result$clustering)
  
  return(df)
}
two_spirals_sphere <- run_pam(two_spirals_sphere)
three_spheres      <- run_pam(three_spheres)
ring_sphere      <- run_pam(two_spirals_sphere)
p_pam_three <- ggplot(three_spheres, aes(x = x, y = y, color = pam_clusters)) +
  geom_point() + guides(color = "none") + theme_classic(base_size = 12) +
  ggtitle("PAM: three_spheres")

p_pam_ring <- ggplot(ring_sphere, aes(x = x, y = y, color = pam_clusters)) +
  geom_point() + guides(color = "none") + theme_classic(base_size = 12) +
  ggtitle("PAM: ring moon")

p_pam_spiral <- ggplot(two_spirals_sphere, aes(x = x, y = y, color = pam_clusters)) +
  geom_point() + guides(color = "none") + theme_classic(base_size = 12) +
  ggtitle("PAM: two spirals")
library(patchwork)

(p_db_three | p_km_three | p_pam_three) /
(p_db_ring | p_km_ring | p_pam_ring) /
(p_db_spiral | p_km_sprial| p_pam_spiral)

It looks like all three methods perform really well on the first dataset, probably because it’s convex. But when it comes to the spiral datasets, k-means and PAM don’t handle them effectively. With the right min_pts and eps values, DBSCAN does an excellent job clustering the non-convex datasets.

Question 2

In this question we will apply cluster analysis to analyze economic development indicators (WDIs) from the World Bank. The data are all 2020 indicators and include:

  • life_expectancy: average life expectancy at birth
  • gdp: GDP per capita, in 2015 USD
  • co2: CO2 emissions, in metric tons per capita
  • fert_rate: annual births per 1000 women
  • health: percentage of GDP spent on health care
  • imports and exports: imports and exports as a percentage of GDP
  • internet and electricity: percentage of population with access to internet and electricity, respectively
  • infant_mort: infant mortality rate, infant deaths per 1000 live births
  • inflation: consumer price inflation, as annual percentage
  • income: annual per-capita income, in 2020 USD
wdi <- read.csv('Data/wdi_extract_clean.csv')%>% 
  column_to_rownames('country')
head(wdi)
            life_expectancy        gdp      co2 fert_rate    health internet
Afghanistan        61.45400   527.8346 0.180555     5.145 15.533614  17.0485
Albania            77.82400  4437.6535 1.607133     1.371  7.503894  72.2377
Algeria            73.25700  4363.6853 3.902928     2.940  5.638317  63.4727
Angola             63.11600  2433.3764 0.619139     5.371  3.274885  36.6347
Argentina          75.87800 11393.0506 3.764393     1.601 10.450306  85.5144
Armenia            73.37561  4032.0904 2.334560     1.700 12.240562  76.5077
            infant_mort electricity  imports inflation  exports    income
Afghanistan        55.3        97.7 36.28908  5.601888 10.42082  475.7181
Albania             8.1       100.0 36.97995  1.620887 22.54076 4322.5497
Algeria            20.4        99.7 24.85456  2.415131 15.53520 2689.8725
Angola             42.3        47.0 27.62749 22.271539 38.31454 1100.2175
Argentina           8.7       100.0 13.59828 42.015095 16.60541 7241.0303
Armenia            10.2       100.0 39.72382  1.211436 29.76499 3617.0320

Focus on using kmeans for this problem.

A)

My claim: 3-5 clusters appear optimal for this data set. Support or refute my claim using appropriate visualizations.

wdi_scaled= scale(wdi)
fviz_nbclust(wdi_scaled, 
             FUNcluster = kmeans,
             method='wss',
             ) + 
  labs(title = 'Plot of WSS vs k using kmeans') 

fviz_nbclust(wdi_scaled, 
             FUNcluster = kmeans,
             method='silhouette',
             ) + 
  labs(title = 'Plot of avg silhouette vs k using kmeans') 

I agree with the claim. The elbow in the WSS plot looks like it’s somewhere around 3–5 clusters, but the silhouette plot makes things clearer. k = 4 seems like the best choice because it has a strong average silhouette score better than 5 and about the same as 3. I wouldn’t pick 5 since its silhouette value drops a lot.

B)

Use k-means to identify 4 clusters. Characterize the 4 clusters using a dimension reduction technique. Provide examples of countries that are representative of each cluster. Be thorough.

wdi_kmeans_4<- kmeans(wdi_scaled, centers = 4, nstart = 10)

wdi_pca <- prcomp(wdi, center=TRUE, scale. = TRUE)
pcs <- as.data.frame(wdi_pca$x[, 1:2])
pcs$Country <- rownames(pcs)
pcs$outlier <- abs(scale(pcs$PC1)) > 1.4| abs(scale(pcs$PC2)) > 1.1
pcs$Cluster <- factor(wdi_kmeans_4$cluster)
fviz_pca(
  wdi_pca,
  habillage = pcs$Cluster,  
  label = "var",             
  repel = TRUE
) +
  geom_text_repel(
    data = pcs[pcs$outlier, ],
    aes(x = PC1, y = PC2, label = Country),
    color = "black",
    fontface = "bold",
    size = 3
  ) +
  ggtitle("K-means 4-cluster") +
  guides(color = "none", shape = "none") +
  theme_minimal()
Warning: ggrepel: 7 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Countries like the US, Japan, Australia, and Norway tend to cluster together because they all have high levels of health, electricity access, life expectancy, and internet usage, and very low fertility and infant mortality rates. In that same cluster, places like Qatar, the UAE, and Belgium have high income, GDP, and CO₂ emissions. Ireland, Singapore, and Luxembourg are there too, mainly because of their high import and export values.(green)

On the other hand, countries like Brazil, Argentina, China, Colombia, and Russia cluster together because they have relatively low import and export levels(red). And countries such as Angola, Chad, and Benin group together due to high infant mortality and fertility rates(blue). Zimbabwe ends up in its own cluster with high fertility and infant mortality rate. (pink)

C)

Remove Ireland, Singapore, and Luxembourg from the data set. Use k-means to find 4 clusters again, with these three countries removed. How do the cluster definitions change?

wdi_clean <- wdi[!rownames(wdi) %in% c("Ireland", "Singapore", "Luxembourg"), ]

wdi__clean_scaled= scale(wdi_clean)
wdi_clean_kmeans_4<- kmeans(wdi__clean_scaled, centers = 4, nstart = 10)


wdi_clean_pca <- prcomp(wdi_clean, center=TRUE, scale. = TRUE)
pcs_clean <- as.data.frame(wdi_clean_pca$x[, 1:2])
pcs_clean$Country <- rownames(pcs_clean) 

pcs_clean$outlier <- abs(scale(pcs_clean$PC1)) > 1.4 | abs(scale(pcs_clean$PC2)) > 1.2
pcs_clean$Cluster <- factor(wdi_clean_kmeans_4$cluster)

kmeans_biplot_4_clean <- fviz_pca(
  wdi_clean_pca,
  habillage = pcs_clean$Cluster,  
  label = "var",             
  repel = TRUE
) +
  geom_text_repel(
    data = pcs_clean[pcs_clean$outlier, ],
    aes(x = PC1, y = PC2, label = Country),
    color = "black",
    fontface = "bold",
    size = 3
  ) +
  ggtitle("K-means 4-cluster") +
  guides(color = "none", shape = "none") +
  theme_minimal()

kmeans_biplot_4_clean

After taking out Ireland, Singapore, and Luxembourg, the clusters look way nicer. Zimbabwe isn’t in its own cluster anymore it groups together with the other countries that have high fertility and infant mortality. That big cluster that had all the high-health, high-income, high-GDP, high-trade countries ends up splitting into two. One of the new clusters includes places like the UAE, Hungary, Belarus, Estonia, and Cyprus. It’s mostly Eastern European countries, and they in the direction of high import and export.