Activity 4.2 - Kmeans, PAM, and DBSCAN clustering

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs
Submit a link to your published solutions

Loading required packages:

library(cluster)
library(dbscan)
library(factoextra)
library(tidyverse)
library(patchwork)
library(ggrepel)

Question 1

Reconsider the three data sets below. We will now compare kmeans, PAM, and DBSCAN to cluster these data sets.

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

A)

With kmeans and PAM, we can specify that we want 3 clusters. But recall with DBSCAN we select minPts and eps, and the number of clusters is determined accordingly. Use k-nearest-neighbor distance plots to determine candidate epsilon values for each data set if minPts = 4. Add horizontal line(s) to each plot indicating your selected value(s) of \(\epsilon.\)

par(mfrow = c(1,3))

# Three Spheres
kNNdistplot(three_spheres, k = 4)
title("Three Spheres")
abline(h = 0.15, col = "red", lwd = 2)

# Ring–Moon–Sphere
kNNdistplot(ring_moon_sphere, k = 4)
title("Ring–Moon–Sphere")
abline(h = 0.1, col = "red", lwd = 2)

# Two Spirals + Sphere
kNNdistplot(two_spirals_sphere, k = 4)
title("Two Spirals + Sphere")
abline(h = 0.12, col = "red", lwd = 2)

Three Spheres ≈ 0.15

Ring–Moon–Sphere ≈ 0.1

Two Spirals + Sphere ≈ 0.12

B)

Write a function called plot_dbscan_results(df, eps, minPts). This function takes a data frame, epsilon value, and minPts as arguments and does the following:

Runs DBSCAN on the inputted data frame df, given the eps and minPts values;
Creates a scatterplot of the data frame with points color-coded by assigned cluster membership. Make sure the title of the plot includes the value of eps and minPts used to create the clusters!!

Using this function, and your candidate eps values from A) as a starting point, implement DBSCAN to correctly identify the 3 cluster shapes in each of the three data sets. You will likely need to revise the eps values until you settle on a “correct” solution.

plot_dbscan_results <- function(df, eps, minPts) {
  # Run DBSCAN
  db <- dbscan(df, eps = eps, minPts = minPts)
  
  # Add cluster labels to data
  df$cluster <- factor(db$cluster)  
  
  ggplot(df, aes(x = df[,1], y = df[,2], color = cluster)) +
    geom_point(size = 2) +
    labs(
      title = paste("DBSCAN Clusters (eps =", eps, ", minPts =", minPts, ")"),
      x = colnames(df)[1],
      y = colnames(df)[2],
      color = "Cluster"
    ) +
    theme_minimal()
}

#Examples
plot_dbscan_results(three_spheres, eps = 0.25, minPts = 3)

plot_dbscan_results(ring_moon_sphere, eps = 0.35, minPts = 3)

plot_dbscan_results(two_spirals_sphere, eps = 0.11, minPts = 3)

I think I needed a little more context for this question personally. I wasn’t sure if we needed 3 clusters for each plot, even though the previous question wanted minPts=4? I will look at the answer key when done to check what was wrong.

Also not entirely sure why, but changing the epsilon values for the spiral never gave 3 clear clusters. It will only be the circle in the middle, the giant spiral surrounding the circle, and then one single point as an entire cluster. Not sure if there is a way to change this, but it seems that two clusters work the best for the spiral.

C)

Compare your DBSCAN solutions to the 3-cluster solutions from k-means and PAM. Use the patchwork package and your function from B) to produce a 3x3 grid of plots: one plot per method/data set combo. Comment on your findings.

plot_clusters <- function(df, clusters, method_name) {
  df$cluster <- factor(clusters)
  
  ggplot(df, aes(x = df[,1], y = df[,2], color = cluster)) +
    geom_point(size = 2) +
    labs(
      title = paste(method_name),
      x = colnames(df)[1],
      y = colnames(df)[2],
      color = "Cluster"
    ) +
    theme_minimal()
}

#DBSCAN FUNCTION
plot_dbscan_results <- function(df, eps, minPts) {
  db <- dbscan(df, eps = eps, minPts = minPts)
  df$cluster <- factor(db$cluster)
  
  ggplot(df, aes(x = df[,1], y = df[,2], color = cluster)) +
    geom_point(size = 2) +
    labs(
      title = paste0("DBSCAN (eps=", eps, ", minPts=", minPts, ")"),
      x = colnames(df)[1],
      y = colnames(df)[2],
      color = "Cluster"
    ) +
    theme_minimal()
}


kmeans_ts <- kmeans(three_spheres, centers = 3)$cluster
pam_ts <- pam(three_spheres, k = 3)$cluster
dbscan_ts <- plot_dbscan_results(three_spheres, eps = 0.25, minPts = 4)

kmeans_rms <- kmeans(ring_moon_sphere, centers = 3)$cluster
pam_rms <- pam(ring_moon_sphere, k = 3)$cluster
dbscan_rms <- plot_dbscan_results(ring_moon_sphere, eps = 0.35, minPts = 4)

kmeans_tss <- kmeans(two_spirals_sphere, centers = 3)$cluster
pam_tss <- pam(two_spirals_sphere, k = 3)$cluster
dbscan_tss <- plot_dbscan_results(two_spirals_sphere, eps = 0.11, minPts = 4)


kmeans_ts_plot <- plot_clusters(three_spheres, kmeans_ts, "k-means")
pam_ts_plot <- plot_clusters(three_spheres, pam_ts, "PAM")

kmeans_rms_plot <- plot_clusters(ring_moon_sphere, kmeans_rms, "k-means")
pam_rms_plot <- plot_clusters(ring_moon_sphere, pam_rms, "PAM")

kmeans_tss_plot <- plot_clusters(two_spirals_sphere, kmeans_tss, "k-means")
pam_tss_plot <- plot_clusters(two_spirals_sphere, pam_tss, "PAM")


(kmeans_ts_plot | pam_ts_plot | dbscan_ts) /
(kmeans_rms_plot | pam_rms_plot | dbscan_rms) /
(kmeans_tss_plot | pam_tss_plot | dbscan_tss)

Question 2

In this question we will apply cluster analysis to analyze economic development indicators (WDIs) from the World Bank. The data are all 2020 indicators and include:

life_expectancy: average life expectancy at birth
gdp: GDP per capita, in 2015 USD
co2: CO2 emissions, in metric tons per capita
fert_rate: annual births per 1000 women
health: percentage of GDP spent on health care
imports and exports: imports and exports as a percentage of GDP
internet and electricity: percentage of population with access to internet and electricity, respectively
infant_mort: infant mortality rate, infant deaths per 1000 live births
inflation: consumer price inflation, as annual percentage
income: annual per-capita income, in 2020 USD

wdi <- read.csv('Data/wdi_extract_clean.csv') 
head(wdi)

      country life_expectancy        gdp      co2 fert_rate    health internet
1 Afghanistan        61.45400   527.8346 0.180555     5.145 15.533614  17.0485
2     Albania        77.82400  4437.6535 1.607133     1.371  7.503894  72.2377
3     Algeria        73.25700  4363.6853 3.902928     2.940  5.638317  63.4727
4      Angola        63.11600  2433.3764 0.619139     5.371  3.274885  36.6347
5   Argentina        75.87800 11393.0506 3.764393     1.601 10.450306  85.5144
6     Armenia        73.37561  4032.0904 2.334560     1.700 12.240562  76.5077
  infant_mort electricity  imports inflation  exports    income
1        55.3        97.7 36.28908  5.601888 10.42082  475.7181
2         8.1       100.0 36.97995  1.620887 22.54076 4322.5497
3        20.4        99.7 24.85456  2.415131 15.53520 2689.8725
4        42.3        47.0 27.62749 22.271539 38.31454 1100.2175
5         8.7       100.0 13.59828 42.015095 16.60541 7241.0303
6        10.2       100.0 39.72382  1.211436 29.76499 3617.0320

Focus on using kmeans for this problem.

A)

My claim: 3-5 clusters appear optimal for this data set. Support or refute my claim using appropriate visualizations.

wdi_num <- wdi[, sapply(wdi, is.numeric)]

# Standardize the data 
wdi_scaled <- scale(wdi_num)


fviz_nbclust(wdi_scaled, kmeans, method = "wss") +
  ggtitle("Elbow Method for Optimal k")

fviz_nbclust(wdi_scaled, kmeans, method = "silhouette") +
  ggtitle("Silhouette Method for Optimal k")

The elbow does indeed begin at 3, but really flattens out at k=5. Looking at our silhoutette, it supports using k=4.

B)

Use k-means to identify 4 clusters. Characterize the 4 clusters using a dimension reduction technique. Provide examples of countries that are representative of each cluster. Be thorough.

set.seed(123)
k <- 4
kmeans_res <- kmeans(wdi_scaled, centers = 5, nstart = 25) #centers = 5 to combat single point
wdi$cluster <- factor(kmeans_res$cluster)

# Perform PCA 
pca_res <- prcomp(wdi_scaled, center = TRUE, scale. = FALSE)

# Create data frame for plotting
pca_df <- data.frame(
  PC1 = pca_res$x[,1],
  PC2 = pca_res$x[,2],
  cluster = wdi$cluster,
  country = rownames(wdi)
)

# Plot PCA with clusters
ggplot(pca_df, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(size = 2) +
  # Tried to make clusters for more than 3 points, but doesn't work...
  stat_ellipse(data = subset(pca_df, cluster %in% names(table(cluster))[table(cluster) >= 3]),
               type = "norm", linetype = 2) +
  labs(title = "PCA of WDI Data with k=4",
       x = "PC1", y = "PC2", color = "Cluster") +
  theme_minimal()

Again, I’m having the same issues where a dot is showing up as an entire cluster. I don’t know how to change this. The above plot shows “5” clusters, but it shows 4 different ellipses.

cluster_summary <- aggregate(wdi_num, by = list(cluster = wdi$cluster), FUN = mean)

colnames(cluster_summary)[1] <- "Cluster"

print(cluster_summary)

  Cluster life_expectancy       gdp        co2 fert_rate    health internet
1       1        73.11275  7223.219  3.7575470  2.068763  6.838331 70.48229
2       2        81.46807 40937.926  8.0461472  1.575404 10.602173 90.52674
3       3        80.61227 56788.838 19.7066481  1.530000  5.037042 96.87044
4       4        61.53000  1230.192  0.5842832  3.754000  2.954401 29.29860
5       5        62.93771  1405.343  0.5006377  4.432794  5.501136 28.57167
  infant_mort electricity  imports   inflation   exports    income
1   13.423684    97.49211 40.22340   4.4820091  34.31290  5973.559
2    2.961538   100.00000 41.15122   0.6953442  43.20762 35104.973
3    4.585714   100.00000 97.13441  -0.6679204 114.12021 36922.741
4   44.900000    52.70000 25.02031 557.2018174  22.29307  1171.849
5   44.250000    50.09118 35.29576  10.5131462  26.14813  1092.885

Below is what AI used to find the closest country to the center of a cluster. If there is a better method to finding a certain country, I will find it in the answer key.

wdi_scaled_df <- as.data.frame(wdi_scaled)
wdi$cluster <- factor(kmeans_res$cluster)


representatives <- sapply(1:k, function(cl) {
  idx <- which(wdi$cluster == cl)         
  cluster_points <- wdi_scaled_df[idx, ]     
  centroid <- kmeans_res$centers[cl, ]      # centroid
  # Euclidean distance from each point to centroid
  distances <- apply(cluster_points, 1, function(x) sqrt(sum((x - centroid)^2)))
  rep_country <- rownames(wdi)[idx[which.min(distances)]]
  return(rep_country)
})

representatives

[1] "88"  "8"   "137" "144"

Looking at the original csv, we find these countries to be the best (least distance to center) for their respective cluser:

8:Australia
88: Mauritius
144:Zambia
132:Timor-Leste

C)

Remove Ireland, Singapore, and Luxembourg from the data set. Use k-means to find 4 clusters again, with these three countries removed. How do the cluster definitions change?