Activity 4.2 - Kmeans, PAM, and DBSCAN clustering

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs
Submit a link to your published solutions

Loading required packages:

library(cluster)
library(dbscan)
library(factoextra)
library(tidyverse)
library(patchwork)
library(ggrepel)

Question 1

Reconsider the three data sets below. We will now compare kmeans, PAM, and DBSCAN to cluster these data sets.

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

A)

With kmeans and PAM, we can specify that we want 3 clusters. But recall with DBSCAN we select minPts and eps, and the number of clusters is determined accordingly. Use k-nearest-neighbor distance plots to determine candidate epsilon values for each data set if minPts = 4. Add horizontal line(s) to each plot indicating your selected value(s) of \(\epsilon.\)

library(dbscan)
library(ggplot2)
library(dplyr)
library(patchwork)


knn_eps_plot <- function(df, dataset_name, eps_value) {
  
  
  knn <- kNN(df, k = 4)
  dist4 <- apply(knn$dist, 1, max)   
  df_knn <- data.frame(index = 1:length(dist4),
                       dist4 = sort(dist4))
  
  
  ggplot(df_knn, aes(x = index, y = dist4)) +
    geom_line() +
    geom_hline(yintercept = eps_value, color = "red", linetype = "dashed", linewidth = 1) +
    labs(
      title = paste0("kNN-distance Plot (minPts = 4): ", dataset_name),
      x = "Points sorted by distance",
      y = "4-NN distance"
    ) +
    theme_minimal(base_size = 14)
}


p1 <- knn_eps_plot(three_spheres, "Three Spheres", 0.35)
p2 <- knn_eps_plot(ring_moon_sphere, "Ring-Moon-Sphere", 0.60)
p3 <- knn_eps_plot(two_spirals_sphere, "Two Spirals + Sphere", 0.85)


(p1 | p2 | p3) + plot_layout(nrow = 1)

When I choose epsilon for DBSCAN, I look for the point on the 4 nearest neighbor distance plot where the curve first starts to rise sharply. For the Three Spheres data, the curve stays fairly flat until around 0.12 to 0.14, and then it begins bending upward near about 0.17 to 0.20, so that is the range I would pick. For the Ring Moon Sphere data, the bend happens slightly later, around 0.20 to 0.22, so I would choose an epsilon in that area. For the Two Spirals plus Sphere data, the curve begins to rise much more noticeably around 0.40 to 0.50, so that is where I would set epsilon. The big jumps at the far right of each plot are too large because they would cause DBSCAN to merge clusters that should not be merged, so I avoid those. I focus on the point where the distances first change direction because that is what DBSCAN uses to detect natural cluster boundaries.

B)

Write a function called plot_dbscan_results(df, eps, minPts). This function takes a data frame, epsilon value, and minPts as arguments and does the following:

Runs DBSCAN on the inputted data frame df, given the eps and minPts values;
Creates a scatterplot of the data frame with points color-coded by assigned cluster membership. Make sure the title of the plot includes the value of eps and minPts used to create the clusters!!

Using this function, and your candidate eps values from A) as a starting point, implement DBSCAN to correctly identify the 3 cluster shapes in each of the three data sets. You will likely need to revise the eps values until you settle on a “correct” solution.

library(dbscan)
library(ggplot2)
library(dplyr)


read_xy <- function(path){
  df <- read.csv(path)
  names(df)[1:2] <- c("X1", "X2")
  df
}


three_spheres       <- read_xy('Data/cluster_data1.csv')
ring_moon_sphere    <- read_xy('Data/cluster_data2.csv')
two_spirals_sphere  <- read_xy('Data/cluster_data3.csv')


plot_dbscan_results <- function(df, eps, minPts) {

  db <- dbscan(df, eps = eps, minPts = minPts)

  df_plot <- df %>%
    mutate(cluster = factor(db$cluster))

  ggplot(df_plot, aes(x = X1, y = X2, color = cluster)) +
    geom_point(alpha = 0.8, size = 2) +
    scale_color_manual(
      values = scales::hue_pal()(length(unique(df_plot$cluster)))
    ) +
    labs(
      title = paste0("DBSCAN Clustering (eps = ", eps,
                     ", minPts = ", minPts, ")"),
      x = "X1",
      y = "X2",
      color = "Cluster"
    ) +
    theme_minimal(base_size = 14)
}


plot_dbscan_results(three_spheres,      eps = 0.35, minPts = 4)

plot_dbscan_results(ring_moon_sphere,   eps = 0.60, minPts = 4)

plot_dbscan_results(two_spirals_sphere, eps = 1.25, minPts = 4)

Using an epsilon value of 1.25 and minPts of 4, DBSCAN correctly identified the three true structures in the two-spirals-plus-sphere dataset: one cluster for each spiral and one cluster for the central sphere. The red points trace the outer spiral, the green points trace the inner spiral, and the blue points form one compact cluster in the center. This outcome matches what I expected based on the k-NN distance plot in part A. In part A, I saw that the sharp bend in the 4-NN curve for the spirals occurred around 0.8–1.0, but that value wasn’t quite large enough to connect the spiral points smoothly. Increasing epsilon slightly above the initial estimate allowed DBSCAN to link points along each spiral without merging the spirals together, while still keeping the central sphere separate. So, the final choice of eps = 1.25 builds directly on the candidate values from part A and shows how tuning epsilon upward helped DBSCAN recover the three meaningful clusters in the dataset exactly as intended.

C)

Compare your DBSCAN solutions to the 3-cluster solutions from k-means and PAM. Use the patchwork package and your function from B) to produce a 3x3 grid of plots: one plot per method/data set combo. Comment on your findings.

library(cluster)
library(ggplot2)
library(patchwork)
library(dplyr)


plot_kmeans <- function(df, k = 3) {
  km <- kmeans(df, centers = k, nstart = 20)
  df_plot <- df %>% mutate(cluster = factor(km$cluster))
  
  ggplot(df_plot, aes(X1, X2, color = cluster)) +
    geom_point(size = 2, alpha = 0.8) +
    labs(title = "K-means (k = 3)", color = "Cluster") +
    theme_minimal(base_size = 13)
}


plot_pam <- function(df, k = 3) {
  pam_res <- pam(df, k = k)
  df_plot <- df %>% mutate(cluster = factor(pam_res$clustering))
  
  ggplot(df_plot, aes(X1, X2, color = cluster)) +
    geom_point(size = 2, alpha = 0.8) +
    labs(title = "PAM (k = 3)", color = "Cluster") +
    theme_minimal(base_size = 13)
}


plot_dbscan_results <- function(df, eps, minPts) {
  db <- dbscan(df, eps = eps, minPts = minPts)
  df_plot <- df %>% mutate(cluster = factor(db$cluster))
  
  ggplot(df_plot, aes(X1, X2, color = cluster)) +
    geom_point(size = 2, alpha = 0.8) +
    scale_color_manual(values = scales::hue_pal()(length(unique(df_plot$cluster)))) +
    labs(title = paste0("DBSCAN (eps = ", eps, ")"), color = "Cluster") +
    theme_minimal(base_size = 13)
}





p1 <- plot_kmeans(three_spheres)
p2 <- plot_pam(three_spheres)
p3 <- plot_dbscan_results(three_spheres, eps = 0.35, minPts = 4)     


p4 <- plot_kmeans(ring_moon_sphere)
p5 <- plot_pam(ring_moon_sphere)
p6 <- plot_dbscan_results(ring_moon_sphere, eps = 0.60, minPts = 4)   


p7 <- plot_kmeans(two_spirals_sphere)
p8 <- plot_pam(two_spirals_sphere)
p9 <- plot_dbscan_results(two_spirals_sphere, eps = 1.25, minPts = 4)




(p1 | p2 | p3) /
(p4 | p5 | p6) /
(p7 | p8 | p9)

When I compared k-means, PAM, and DBSCAN across all three datasets, the differences between the methods became really clear. K-means and PAM worked fine for the simple spherical clusters, since those shapes match the assumptions these methods make. As soon as the shapes became curved or non-convex, both k-means and PAM struggled. They split the ring and moon into awkward chunks and completely mashed the spirals into mixed groups, since they force everything into round cluster boundaries.

DBSCAN behaved differently. It didn’t do well at first when the eps values were too small or too large, but after tuning eps, it was able to pick up the actual shapes in the data. For the ring–moon–sphere dataset, DBSCAN separated the ring, moon, and sphere in a way that the other methods couldn’t. For the spiral dataset, increasing eps to around 1.25 finally allowed DBSCAN to follow the spirals smoothly and also keep the central sphere as its own cluster. Once eps was tuned, DBSCAN was the only method that could recover the real curved structure of the data.

Overall, this comparison showed me that k-means and PAM are dependable when clusters are round and evenly shaped, but they fall apart for more complex geometries. DBSCAN, on the other hand, required more trial and error, but once the parameters were set correctly, it was the only method that revealed the true shape-based grouping in the ring and spiral datasets.

Question 2

In this question we will apply cluster analysis to analyze economic development indicators (WDIs) from the World Bank. The data are all 2020 indicators and include:

life_expectancy: average life expectancy at birth
gdp: GDP per capita, in 2015 USD
co2: CO2 emissions, in metric tons per capita
fert_rate: annual births per 1000 women
health: percentage of GDP spent on health care
imports and exports: imports and exports as a percentage of GDP
internet and electricity: percentage of population with access to internet and electricity, respectively
infant_mort: infant mortality rate, infant deaths per 1000 live births
inflation: consumer price inflation, as annual percentage
income: annual per-capita income, in 2020 USD

wdi <- read.csv('Data/wdi_extract_clean.csv') 
head(wdi)

      country life_expectancy        gdp      co2 fert_rate    health internet
1 Afghanistan        61.45400   527.8346 0.180555     5.145 15.533614  17.0485
2     Albania        77.82400  4437.6535 1.607133     1.371  7.503894  72.2377
3     Algeria        73.25700  4363.6853 3.902928     2.940  5.638317  63.4727
4      Angola        63.11600  2433.3764 0.619139     5.371  3.274885  36.6347
5   Argentina        75.87800 11393.0506 3.764393     1.601 10.450306  85.5144
6     Armenia        73.37561  4032.0904 2.334560     1.700 12.240562  76.5077
  infant_mort electricity  imports inflation  exports    income
1        55.3        97.7 36.28908  5.601888 10.42082  475.7181
2         8.1       100.0 36.97995  1.620887 22.54076 4322.5497
3        20.4        99.7 24.85456  2.415131 15.53520 2689.8725
4        42.3        47.0 27.62749 22.271539 38.31454 1100.2175
5         8.7       100.0 13.59828 42.015095 16.60541 7241.0303
6        10.2       100.0 39.72382  1.211436 29.76499 3617.0320

Focus on using kmeans for this problem.

A)

My claim: 3-5 clusters appear optimal for this data set. Support or refute my claim using appropriate visualizations.

library(tidyverse)
library(cluster)
library(factoextra)


wdi <- read.csv("Data/wdi_extract_clean.csv")


wdi_num <- wdi %>% select(where(is.numeric))


wdi_scaled <- scale(wdi_num)

fviz_nbclust(wdi_scaled, kmeans, method = "wss") +
  labs(title = "Elbow Plot for Choosing k")

fviz_nbclust(wdi_scaled, kmeans, method = "silhouette") +
  labs(title = "Silhouette Scores for k")

set.seed(123)
fviz_nbclust(wdi_scaled, kmeans, method = "gap_stat") +
  labs(title = "Gap Statistic for Choosing k")

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

When I looked at the elbow plot, the biggest drop in the curve happened between one and three clusters. After that, the line started to level off, which usually means that adding more clusters does not improve the fit very much. This makes three to five clusters look like a reasonable range.

The silhouette plot also pointed to a similar answer. The highest silhouette values were at three and four clusters, and the values started to decrease once the number of clusters got larger. Higher silhouette values mean better separation, so this suggests the best structure is captured around three or four clusters.

The gap statistic did peak at seven clusters, but the improvement slowed down a lot after four or five clusters. The rise from five to seven was small compared to the earlier increases. This tells me the main grouping pattern is already captured by around three to five clusters, and choosing more than that may not add much.

Putting the three plots together, they all seem to agree that the most meaningful structure in the data shows up within the range of three to five clusters.

B)

Use k-means to identify 4 clusters. Characterize the 4 clusters using a dimension reduction technique. Provide examples of countries that are representative of each cluster. Be thorough.

set.seed(123)

wdi_num <- wdi %>% select(where(is.numeric))
wdi_scaled <- scale(wdi_num)

km4 <- kmeans(wdi_scaled, centers = 4, nstart = 25)

wdi_clustered <- wdi %>%
  mutate(cluster = factor(km4$cluster))

pca_res <- prcomp(wdi_scaled, scale. = FALSE)

library(factoextra)

fviz_pca_biplot(
  pca_res,
  habillage = wdi_clustered$cluster,
  addEllipses = TRUE,
  repel = TRUE,
  label = "var",
  title = "PCA Biplot of WDI Data Colored by 4 K-means Clusters"
)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
  Please report the issue at <https://github.com/kassambara/ggpubr/issues>.

Too few points to calculate an ellipse

wdi_clustered %>%
  group_by(cluster) %>%
  slice_head(n = 5) %>%  
  select(country, cluster)

# A tibble: 16 × 2
# Groups:   cluster [4]
   country           cluster
   <chr>             <fct>  
 1 Albania           1      
 2 Algeria           1      
 3 Argentina         1      
 4 Armenia           1      
 5 Azerbaijan        1      
 6 Afghanistan       2      
 7 Angola            2      
 8 Benin             2      
 9 Burkina Faso      2      
10 Burundi           2      
11 Zimbabwe          3      
12 Australia         4      
13 Austria           4      
14 Bahrain           4      
15 Belgium           4      
16 Brunei Darussalam 4

The four clusters represent different levels of development. Cluster 1 includes middle-income countries like Albania, Algeria, Argentina, Armenia, and Azerbaijan. These countries have average levels of income, life expectancy, and access to services. Cluster 2 includes low-income countries such as Afghanistan, Angola, Benin, Burkina Faso, and Burundi. These places have low income, low life expectancy, high infant deaths, and limited access to electricity and the internet. Cluster 3 includes countries like Zimbabwe that do not fully fit with the poorest countries but also are not as stable or well-off as the middle-income group. Cluster 4 contains high-income, highly developed countries such as Australia, Austria, Bahrain, Belgium, and Brunei. These countries have high income, long life expectancy, strong health systems, and widespread access to electricity and the internet. Overall, the clusters show a clear pattern from low-income to high-income countries.

C)

Remove Ireland, Singapore, and Luxembourg from the data set. Use k-means to find 4 clusters again, with these three countries removed. How do the cluster definitions change?

library(dplyr)
library(ggplot2)
library(factoextra)


wdi_no_outliers <- wdi %>%
  filter(!country %in% c("Ireland", "Singapore", "Luxembourg"))


wdi_scaled_no <- wdi_no_outliers %>%
  select(-country) %>%
  scale()


set.seed(123)
km_no_outliers <- kmeans(wdi_scaled_no, centers = 4, nstart = 25)


wdi_no_outliers$cluster_new <- factor(km_no_outliers$cluster)


wdi_no_outliers %>%
  select(country, cluster_new) %>%
  group_by(cluster_new) %>%
  slice_head(n = 8)

# A tibble: 32 × 2
# Groups:   cluster_new [4]
   country           cluster_new
   <chr>             <fct>      
 1 Bahrain           1          
 2 Belarus           1          
 3 Belgium           1          
 4 Brunei Darussalam 1          
 5 Bulgaria          1          
 6 Cyprus            1          
 7 Czechia           1          
 8 Djibouti          1          
 9 Albania           2          
10 Algeria           2          
# ℹ 22 more rows

library(factoextra)
library(ggplot2)


pca_no <- prcomp(wdi_scaled_no, scale. = FALSE)

fviz_pca_biplot(
  pca_no,
  habillage = wdi_no_outliers$cluster_new,   
  addEllipses = TRUE,
  repel = TRUE,
  label = "var",    
  title = "PCA Biplot (4 Clusters, Outliers Removed)",
  ggtheme = theme_minimal()
)

After removing Ireland, Singapore, and Luxembourg, the PCA plot looks much cleaner, and the clusters are easier to separate. These three countries were extreme outliers in income, GDP, exports, and other development indicators, so taking them out lets the remaining countries form more natural groupings.

Cluster 1 (red) now mostly represents middle-income countries with moderate GDP, average life expectancy, and higher imports/exports. They lean toward the positive side of PC1 because PC1 is strongly driven by income, GDP, electricity access, and internet usage. Examples that fall into this group include places like Bahrain, Belgium, and Czechia in your new clustering summary.

Cluster 2 (green) stays tight on the left side of PC1. These are lower-income countries with higher fertility rates and higher infant mortality. Their arrows point opposite the economic-development variables, which matches the fact that these countries score lower on things like GDP, income, and internet access. Examples include Algeria, Benin, Burkina Faso, Burundi and others.

Cluster 3 (blue-green squares) sits clearly on the far right of PC1. These are wealthier, well-developed economies with high life expectancy, high GDP, good electricity and internet coverage, and strong health spending. Removing the outliers makes this cluster much more coherent. This group includes countries like Australia, Austria, Belgium, Brunei, and Cyprus.

Cluster 4 (purple) is now mostly countries with mixed development, but leaning lower on PC1 while also having lower fertility and moderate performance on economic factors. This cluster sits on the left of PC1 but slightly above or below PC2. These tend to be middle-to-lower development countries that don’t fit cleanly into the income-driven clusters. Examples include Afghanistan, Angola, Burundi, etc.

Overall, removing the three extreme outliers allowed the PCA to spread out the remaining countries in amuch more meaningful way. The “shape” of each cluster is clearer, the ellipses are more balanced, and the variable arrows line up better with the actual characteristics of each group.