Activity 4.2 - Kmeans, PAM, and DBSCAN clustering

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  3. Submit a link to your published solutions

Loading required packages:

library(cluster)
library(dbscan)
library(factoextra)
library(tidyverse)
library(patchwork)
library(ggrepel)

Question 1

Reconsider the three data sets below. We will now compare kmeans, PAM, and DBSCAN to cluster these data sets.

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

A)

With kmeans and PAM, we can specify that we want 3 clusters. But recall with DBSCAN we select minPts and eps, and the number of clusters is determined accordingly. Use k-nearest-neighbor distance plots to determine candidate epsilon values for each data set if minPts = 4. Add horizontal line(s) to each plot indicating your selected value(s) of \(\epsilon.\)

library(ggplot2)
library(gridExtra)
Warning: package 'gridExtra' was built under R version 4.5.2

Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':

    combine
plot_kNN_and_dbscan <- function(data, dataset_name, minPts = 4, eps_candidates = NULL) {
  x <- as.matrix(data)
  kNNdistances <- kNNdist(x, k = minPts)
  df <- data.frame(
    index = 1:length(kNNdistances),
    distance = sort(kNNdistances)
  )
  
  p <- ggplot(df, aes(x = index, y = distance)) +
    geom_line(color = "steelblue") +
    labs(title = paste(dataset_name, "- kNN Distance Plot (minPts =", minPts, ")"),
         x = "Points (sorted)", y = "Distance") +
    theme_minimal()

  if (!is.null(eps_candidates)) {
    for (eps in eps_candidates) {
      p <- p + geom_hline(yintercept = eps, linetype = "dashed", color = "red")
    }
  }
  
  print(p)

  if (!is.null(eps_candidates)) {
    for (eps in eps_candidates) {
      cat("\n--- DBSCAN results for", dataset_name, "with eps =", eps, " ---\n")
      res <- dbscan(x, eps = eps, minPts = minPts)
      print(res)

      plot(x, col = res$cluster + 1, pch = 19,
           main = paste(dataset_name, "- DBSCAN (eps =", eps, ", minPts =", minPts, ")"))
    }
  }
}

plot_kNN_and_dbscan(three_spheres, "Three Spheres", minPts = 4, eps_candidates = c(0.3, 0.5))


--- DBSCAN results for Three Spheres with eps = 0.3  ---
DBSCAN clustering for 300 objects.
Parameters: eps = 0.3, minPts = 4
Using euclidean distances and borderpoints = TRUE
The clustering contains 2 cluster(s) and 0 noise points.

  1   2 
100 200 

Available fields: cluster, eps, minPts, metric, borderPoints


--- DBSCAN results for Three Spheres with eps = 0.5  ---
DBSCAN clustering for 300 objects.
Parameters: eps = 0.5, minPts = 4
Using euclidean distances and borderpoints = TRUE
The clustering contains 1 cluster(s) and 0 noise points.

  1 
300 

Available fields: cluster, eps, minPts, metric, borderPoints

plot_kNN_and_dbscan(ring_moon_sphere, "Ring/Moon/Sphere", minPts = 4, eps_candidates = c(0.4, 0.6))


--- DBSCAN results for Ring/Moon/Sphere with eps = 0.4  ---
DBSCAN clustering for 750 objects.
Parameters: eps = 0.4, minPts = 4
Using euclidean distances and borderpoints = TRUE
The clustering contains 2 cluster(s) and 0 noise points.

  1   2 
250 500 

Available fields: cluster, eps, minPts, metric, borderPoints


--- DBSCAN results for Ring/Moon/Sphere with eps = 0.6  ---
DBSCAN clustering for 750 objects.
Parameters: eps = 0.6, minPts = 4
Using euclidean distances and borderpoints = TRUE
The clustering contains 2 cluster(s) and 0 noise points.

  1   2 
250 500 

Available fields: cluster, eps, minPts, metric, borderPoints

plot_kNN_and_dbscan(two_spirals_sphere, "Two Spirals + Sphere", minPts = 4, eps_candidates = c(0.5, 0.7))


--- DBSCAN results for Two Spirals + Sphere with eps = 0.5  ---
DBSCAN clustering for 815 objects.
Parameters: eps = 0.5, minPts = 4
Using euclidean distances and borderpoints = TRUE
The clustering contains 30 cluster(s) and 254 noise points.

  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19 
254  18  19  24  11   4   9   7   6   5   7   5   4   4  35  31  10   5   5   6 
 20  21  22  23  24  25  26  27  28  29  30 
  4   6   6   4   5   4   4   4   5   4 300 

Available fields: cluster, eps, minPts, metric, borderPoints


--- DBSCAN results for Two Spirals + Sphere with eps = 0.7  ---
DBSCAN clustering for 815 objects.
Parameters: eps = 0.7, minPts = 4
Using euclidean distances and borderpoints = TRUE
The clustering contains 32 cluster(s) and 116 noise points.

  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19 
116  62  49   7   7   3   5   8   3   7   6   4   6   7   4   4   4   5   4  81 
 20  21  22  23  24  25  26  27  28  29  30  31  32 
 61  12   6   9   5   6   4   5   3   4   4   4 300 

Available fields: cluster, eps, minPts, metric, borderPoints

B)

Write a function called plot_dbscan_results(df, eps, minPts). This function takes a data frame, epsilon value, and minPts as arguments and does the following:

  • Runs DBSCAN on the inputted data frame df, given the eps and minPts values;
  • Creates a scatterplot of the data frame with points color-coded by assigned cluster membership. Make sure the title of the plot includes the value of eps and minPts used to create the clusters!!

Using this function, and your candidate eps values from A) as a starting point, implement DBSCAN to correctly identify the 3 cluster shapes in each of the three data sets. You will likely need to revise the eps values until you settle on a “correct” solution.

plot_dbscan_results <- function(df, eps, minPts) {
  x <- as.matrix(df)
  res <- dbscan(x, eps = eps, minPts = minPts)
  df$cluster <- factor(res$cluster) 
  p <- ggplot(df, aes(x = df[,1], y = df[,2], color = cluster)) +
    geom_point(size = 2) +
    labs(title = paste("DBSCAN Results (eps =", eps, ", minPts =", minPts, ")"),
         x = "X1", y = "X2") +
    theme_minimal()
  
  print(p)
  cat("\n--- DBSCAN summary ---\n")
  print(res)
}
plot_dbscan_results(three_spheres, eps = 0.295, minPts = 4)


--- DBSCAN summary ---
DBSCAN clustering for 300 objects.
Parameters: eps = 0.295, minPts = 4
Using euclidean distances and borderpoints = TRUE
The clustering contains 3 cluster(s) and 0 noise points.

  1   2   3 
100 100 100 

Available fields: cluster, eps, minPts, metric, borderPoints
plot_dbscan_results(ring_moon_sphere, eps = 0.35, minPts = 4)


--- DBSCAN summary ---
DBSCAN clustering for 750 objects.
Parameters: eps = 0.35, minPts = 4
Using euclidean distances and borderpoints = TRUE
The clustering contains 3 cluster(s) and 0 noise points.

  1   2   3 
250 250 250 

Available fields: cluster, eps, minPts, metric, borderPoints
plot_dbscan_results(two_spirals_sphere, eps = 1.3, minPts = 4)


--- DBSCAN summary ---
DBSCAN clustering for 815 objects.
Parameters: eps = 1.3, minPts = 4
Using euclidean distances and borderpoints = TRUE
The clustering contains 3 cluster(s) and 0 noise points.

  1   2   3 
257 258 300 

Available fields: cluster, eps, minPts, metric, borderPoints

C)

Compare your DBSCAN solutions to the 3-cluster solutions from k-means and PAM. Use the patchwork package and your function from B) to produce a 3x3 grid of plots: one plot per method/data set combo. Comment on your findings.

plot_dbscan_results <- function(df, eps, minPts, title = "") {
  x <- as.matrix(df)
  res <- dbscan(x, eps = eps, minPts = minPts)
  df$cluster <- factor(res$cluster)
  
  ggplot(df, aes(x = df[,1], y = df[,2], color = cluster)) +
    geom_point(size = 2) +
    labs(title = paste("DBSCAN:", title, "\n(eps =", eps, ", minPts =", minPts, ")"),
         x = "X1", y = "X2") +
    theme_minimal()
}

plot_kmeans <- function(df, centers = 3, title = "") {
  x <- as.matrix(df)
  km <- kmeans(x, centers = centers)
  df$cluster <- factor(km$cluster)
  
  ggplot(df, aes(x = df[,1], y = df[,2], color = cluster)) +
    geom_point(size = 2) +
    labs(title = paste("K-means:", title),
         x = "X1", y = "X2") +
    theme_minimal()
}

plot_pam <- function(df, centers = 3, title = "") {
  x <- as.matrix(df)
  pam_res <- pam(x, k = centers)
  df$cluster <- factor(pam_res$clustering)
  
  ggplot(df, aes(x = df[,1], y = df[,2], color = cluster)) +
    geom_point(size = 2) +
    labs(title = paste("PAM:", title),
         x = "X1", y = "X2") +
    theme_minimal()
}

p1 <- plot_dbscan_results(three_spheres, eps = 0.295, minPts = 4, title = "Three Spheres")
p2 <- plot_dbscan_results(ring_moon_sphere, eps = 0.35, minPts = 4, title = "Ring/Moon/Sphere")
p3 <- plot_dbscan_results(two_spirals_sphere, eps = 1.3, minPts = 4, title = "Two Spirals + Sphere")

p4 <- plot_kmeans(three_spheres, centers = 3, title = "Three Spheres")
p5 <- plot_kmeans(ring_moon_sphere, centers = 3, title = "Ring/Moon/Sphere")
p6 <- plot_kmeans(two_spirals_sphere, centers = 3, title = "Two Spirals + Sphere")

p7 <- plot_pam(three_spheres, centers = 3, title = "Three Spheres")
p8 <- plot_pam(ring_moon_sphere, centers = 3, title = "Ring/Moon/Sphere")
p9 <- plot_pam(two_spirals_sphere, centers = 3, title = "Two Spirals + Sphere")

(p1 | p2 | p3) / (p4 | p5 | p6) / (p7 | p8 | p9)

For the three spheres dataset, all three methods had a similar result so any could be used. For the ring/moon/sphere and the two spirals datasets the dbscan was the best clustering method.

Question 2

In this question we will apply cluster analysis to analyze economic development indicators (WDIs) from the World Bank. The data are all 2020 indicators and include:

  • life_expectancy: average life expectancy at birth
  • gdp: GDP per capita, in 2015 USD
  • co2: CO2 emissions, in metric tons per capita
  • fert_rate: annual births per 1000 women
  • health: percentage of GDP spent on health care
  • imports and exports: imports and exports as a percentage of GDP
  • internet and electricity: percentage of population with access to internet and electricity, respectively
  • infant_mort: infant mortality rate, infant deaths per 1000 live births
  • inflation: consumer price inflation, as annual percentage
  • income: annual per-capita income, in 2020 USD
wdi <- read.csv('Data/wdi_extract_clean.csv') 
head(wdi)
      country life_expectancy        gdp      co2 fert_rate    health internet
1 Afghanistan        61.45400   527.8346 0.180555     5.145 15.533614  17.0485
2     Albania        77.82400  4437.6535 1.607133     1.371  7.503894  72.2377
3     Algeria        73.25700  4363.6853 3.902928     2.940  5.638317  63.4727
4      Angola        63.11600  2433.3764 0.619139     5.371  3.274885  36.6347
5   Argentina        75.87800 11393.0506 3.764393     1.601 10.450306  85.5144
6     Armenia        73.37561  4032.0904 2.334560     1.700 12.240562  76.5077
  infant_mort electricity  imports inflation  exports    income
1        55.3        97.7 36.28908  5.601888 10.42082  475.7181
2         8.1       100.0 36.97995  1.620887 22.54076 4322.5497
3        20.4        99.7 24.85456  2.415131 15.53520 2689.8725
4        42.3        47.0 27.62749 22.271539 38.31454 1100.2175
5         8.7       100.0 13.59828 42.015095 16.60541 7241.0303
6        10.2       100.0 39.72382  1.211436 29.76499 3617.0320

Focus on using kmeans for this problem.

A)

My claim: 3-5 clusters appear optimal for this data set. Support or refute my claim using appropriate visualizations.

country_names <- wdi$country
wdi_numeric <- wdi %>% select(where(is.numeric))
wdi_scaled <- scale(wdi_numeric)

set.seed(123)  
kmeans_res <- kmeans(wdi_scaled, centers = 5)

pca_res <- prcomp(wdi_scaled)
plot_df <- data.frame(
  PC1 = pca_res$x[,1],
  PC2 = pca_res$x[,2],
  cluster = factor(kmeans_res$cluster),
  country = country_names
)

ggplot(plot_df, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(size = 2) +
  labs(title = "K-means Clustering (k = 5) on WDI Data",
       x = "PC1", y = "PC2") +
  theme_minimal()

I would agree, I think the bottom points could be considered outliers. I would suggest about 4 clusters as I believe 5 is too many.

B)

Use k-means to identify 4 clusters. Characterize the 4 clusters using a dimension reduction technique. Provide examples of countries that are representative of each cluster. Be thorough.

wdi$cluster <- factor(kmeans_res$cluster)
aggregate(wdi_numeric, by = list(cluster = kmeans_res$cluster), FUN = mean)
  cluster life_expectancy       gdp        co2 fert_rate    health internet
1       1        79.29097 36634.723 12.7143768  1.641056  6.983962 91.47530
2       2        81.87663 44357.350  8.2001498  1.579975 10.912171 91.15175
3       3        62.66213  1344.987  0.4839863  4.482375  5.577630 27.11023
4       4        72.72532  6425.242  3.3067435  2.123795  6.783954 68.80495
5       5        61.53000  1230.192  0.5842832  3.754000  2.954401 29.29860
  infant_mort electricity  imports   inflation  exports    income
1    4.505556    100.0000 81.32275   0.3678409 88.97402 26890.816
2    3.080000    100.0000 32.03153   0.7021106 33.45626 37791.137
3   45.571875     48.4500 34.79889  10.6946978 26.21992  1008.142
4   14.147945     96.7411 38.20414   4.7751985 31.63508  5345.466
5   44.900000     52.7000 25.02031 557.2018174 22.29307  1171.849
split(wdi$country, wdi$cluster)
$`1`
 [1] "Bahrain"              "Belgium"              "Brunei Darussalam"   
 [4] "Cyprus"               "Czechia"              "Estonia"             
 [7] "Hungary"              "Ireland"              "Lithuania"           
[10] "Luxembourg"           "Netherlands"          "Oman"                
[13] "Qatar"                "Seychelles"           "Singapore"           
[16] "Slovak Republic"      "Slovenia"             "United Arab Emirates"

$`2`
 [1] "Australia"     "Austria"       "Canada"        "Denmark"      
 [5] "Finland"       "France"        "Germany"       "Iceland"      
 [9] "Israel"        "Italy"         "Japan"         "Korea, Rep."  
[13] "New Zealand"   "Norway"        "Portugal"      "Saudi Arabia" 
[17] "Spain"         "Sweden"        "Switzerland"   "United States"

$`3`
 [1] "Afghanistan"       "Angola"            "Benin"            
 [4] "Burkina Faso"      "Burundi"           "Cameroon"         
 [7] "Chad"              "Congo, Rep."       "Cote d'Ivoire"    
[10] "Djibouti"          "Equatorial Guinea" "Ethiopia"         
[13] "Gambia, The"       "Guinea"            "Guinea-Bissau"    
[16] "Lesotho"           "Madagascar"        "Mali"             
[19] "Mauritania"        "Mozambique"        "Namibia"          
[22] "Niger"             "Pakistan"          "Rwanda"           
[25] "Senegal"           "Sierra Leone"      "Solomon Islands"  
[28] "Sudan"             "Tanzania"          "Togo"             
[31] "Uganda"            "Zambia"           

$`4`
 [1] "Albania"                "Algeria"                "Argentina"             
 [4] "Armenia"                "Azerbaijan"             "Bahamas, The"          
 [7] "Bangladesh"             "Belarus"                "Belize"                
[10] "Bhutan"                 "Bolivia"                "Bosnia and Herzegovina"
[13] "Botswana"               "Brazil"                 "Bulgaria"              
[16] "Cabo Verde"             "Cambodia"               "Chile"                 
[19] "China"                  "Colombia"               "Costa Rica"            
[22] "Croatia"                "Dominican Republic"     "Ecuador"               
[25] "Egypt, Arab Rep."       "El Salvador"            "Fiji"                  
[28] "Gabon"                  "Georgia"                "Ghana"                 
[31] "Greece"                 "Guatemala"              "Honduras"              
[34] "India"                  "Indonesia"              "Iran, Islamic Rep."    
[37] "Iraq"                   "Jordan"                 "Kazakhstan"            
[40] "Kyrgyz Republic"        "Latvia"                 "Lebanon"               
[43] "Libya"                  "Malaysia"               "Maldives"              
[46] "Mauritius"              "Mexico"                 "Micronesia, Fed. Sts." 
[49] "Moldova"                "Mongolia"               "Morocco"               
[52] "Nepal"                  "Nicaragua"              "North Macedonia"       
[55] "Panama"                 "Paraguay"               "Peru"                  
[58] "Philippines"            "Poland"                 "Romania"               
[61] "Russian Federation"     "Samoa"                  "South Africa"          
[64] "Sri Lanka"              "Thailand"               "Timor-Leste"           
[67] "Tunisia"                "Turkiye"                "Ukraine"               
[70] "Uruguay"                "Uzbekistan"             "Vanuatu"               
[73] "Viet Nam"              

$`5`
[1] "Zimbabwe"
country_names <- wdi$country
wdi_numeric <- wdi %>% select(where(is.numeric))
wdi_scaled <- scale(wdi_numeric)
pca_res <- prcomp(wdi_scaled)
kmeans_res <- kmeans(wdi_scaled, centers = 4)

plot_df <- data.frame(
  PC1 = pca_res$x[,1],
  PC2 = pca_res$x[,2],
  cluster = factor(kmeans_res$cluster),
  country = country_names
)
loadings <- data.frame(pca_res$rotation[,1:2]) 
loadings$variable <- rownames(loadings)

arrow_scale <- 5 
loadings_scaled <- loadings %>%
  mutate(PC1 = PC1 * arrow_scale,
         PC2 = PC2 * arrow_scale)

ggplot(plot_df, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(size = 2) +
  geom_segment(data = loadings_scaled,
               aes(x = 0, y = 0, xend = PC1, yend = PC2),
               arrow = arrow(length = unit(0.3, "cm")),
               color = "black") +
  geom_text(data = loadings_scaled,
            aes(x = PC1, y = PC2, label = variable),
            color = "black", size = 3, vjust = -0.5) +
  labs(title = "K-means Clustering (k = 4) with PCA Arrows",
       x = "PC1", y = "PC2") +
  theme_minimal()

Countries in cluster 4 have a higher infant mortality rate and fertility rate. Countries in this cluster include Middle Eastern countries like Afganistan, Angola, and Pakistan. Countries in cluster 2 have a medium level of inflation, electricity, and health rates. Countries in this cluster include countries in South America and Norther Africa including Brazil, Egypt, and Belize. Countries in cluster 3 have a high level of life expectancy and internet. Countries in this cluster include more first world countries like France, the US, and Iceland. Countries in cluster 1 have a higher income, gdp, imports, and exports. Countries in this cluster include European countries like Hungary, Belgium, and Lithuania.

C)

Remove Ireland, Singapore, and Luxembourg from the data set. Use k-means to find 4 clusters again, with these three countries removed. How do the cluster definitions change?

wdi_filtered <- wdi %>%
  filter(!country %in% c("Ireland", "Singapore", "Luxembourg"))
country_names <- wdi_filtered$country
wdi_numeric <- wdi_filtered %>% select(where(is.numeric))
sapply(wdi_numeric, class)
life_expectancy             gdp             co2       fert_rate          health 
      "numeric"       "numeric"       "numeric"       "numeric"       "numeric" 
       internet     infant_mort     electricity         imports       inflation 
      "numeric"       "numeric"       "numeric"       "numeric"       "numeric" 
        exports          income 
      "numeric"       "numeric" 
wdi_scaled <- scale(wdi_numeric)

set.seed(123)  
kmeans_res <- kmeans(wdi_scaled, centers = 4)

pca_res <- prcomp(wdi_scaled)
plot_df <- data.frame(
  PC1 = pca_res$x[,1],
  PC2 = pca_res$x[,2],
  cluster = factor(kmeans_res$cluster),
  country = country_names
)

ggplot(plot_df, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(size = 2) +
  labs(title = "K-means Clustering (k = 4) on WDI Data",
       x = "PC1", y = "PC2") +
  theme_minimal()

It keeps a fairly similar pattern of the clusters however removes the outliers that were in the bottom right hand corner.