Activity 4.2 - Kmeans, PAM, and DBSCAN clustering

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  3. Submit a link to your published solutions

Loading required packages:

library(cluster)
library(dbscan)
library(factoextra)
library(tidyverse)
library(patchwork)
library(ggrepel)

Question 1

Reconsider the three data sets below. We will now compare kmeans, PAM, and DBSCAN to cluster these data sets.

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

A)

With kmeans and PAM, we can specify that we want 3 clusters. But recall with DBSCAN we select minPts and eps, and the number of clusters is determined accordingly. Use k-nearest-neighbor distance plots to determine candidate epsilon values for each data set if minPts = 4. Add horizontal line(s) to each plot indicating your selected value(s) of \(\epsilon.\)

# Function to create kNN distance plot with horizontal line
plot_knn <- function(data, k, eps_candidates, title) {
  p <- kNNdistplot(data, k = k)
  abline(h = eps_candidates, col = "red", lty = 2)
  title(main = title)
}
par(mfrow = c(1,3))  # 3 plots side by side

# Three spheres
plot_knn(three_spheres, k = 4, eps_candidates = 0.25, 
         title = "Three Spheres")

# Ring-moon-sphere
plot_knn(ring_moon_sphere, k = 4, eps_candidates = 0.3, 
         title = "Ring-Moon-Sphere")

# Two spirals
plot_knn(two_spirals_sphere, k = 4, eps_candidates = 0.35, 
         title = "Two Spirals")

B)

Write a function called plot_dbscan_results(df, eps, minPts). This function takes a data frame, epsilon value, and minPts as arguments and does the following:

  • Runs DBSCAN on the inputted data frame df, given the eps and minPts values;
  • Creates a scatterplot of the data frame with points color-coded by assigned cluster membership. Make sure the title of the plot includes the value of eps and minPts used to create the clusters!!

Using this function, and your candidate eps values from A) as a starting point, implement DBSCAN to correctly identify the 3 cluster shapes in each of the three data sets. You will likely need to revise the eps values until you settle on a “correct” solution.

plot_dbscan_results <- function(df, eps, minPts) {
  # Run DBSCAN
  db <- dbscan(df, eps = eps, minPts = minPts)
  
  # Add cluster labels to data frame
  df_clustered <- df %>%
    mutate(cluster = factor(db$cluster))
  
  # Create scatterplot
  p <- ggplot(df_clustered, aes(x = df[,1], y = df[,2], color = cluster)) +
    geom_point(size = 2, alpha = 0.8) +
    labs(
      title = paste("DBSCAN Results (eps =", eps, ", minPts =", minPts, ")"),
      color = "Cluster"
    ) +
    theme_minimal()
  
  return(p)
}
# Candidate eps values from elbow plots
eps_three_spheres <- 0.25
eps_ring_moon <- 0.30
eps_two_spirals <- 0.35

# Run DBSCAN and plot
p1 <- plot_dbscan_results(three_spheres, eps = eps_three_spheres, minPts = 3)
p2 <- plot_dbscan_results(ring_moon_sphere, eps = eps_ring_moon, minPts = 3)
p3 <- plot_dbscan_results(two_spirals_sphere, eps = eps_two_spirals, minPts = 3)

# Combine plots side by side
library(patchwork)
p1 + p2 + p3

C)

Compare your DBSCAN solutions to the 3-cluster solutions from k-means and PAM. Use the patchwork package and your function from B) to produce a 3x3 grid of plots: one plot per method/data set combo. Comment on your findings.

library(cluster)
library(factoextra)
library(patchwork)

# K-means plot
plot_kmeans_results <- function(df, k) {
  km <- kmeans(df, centers = k, nstart = 25)
  df_clustered <- df %>% mutate(cluster = factor(km$cluster))
  
  ggplot(df_clustered, aes(x = df[,1], y = df[,2], color = cluster)) +
    geom_point(size = 2, alpha = 0.8) +
    labs(title = paste("K-means (k =", k, ")"), color = "Cluster") +
    theme_minimal()
}

# PAM plot
plot_pam_results <- function(df, k) {
  pam_fit <- pam(df, k = k)
  df_clustered <- df %>% mutate(cluster = factor(pam_fit$clustering))
  
  ggplot(df_clustered, aes(x = df[,1], y = df[,2], color = cluster)) +
    geom_point(size = 2, alpha = 0.8) +
    labs(title = paste("PAM (k =", k, ")"), color = "Cluster") +
    theme_minimal()
}
# Three spheres
p1 <- plot_kmeans_results(three_spheres, 3)
p2 <- plot_pam_results(three_spheres, 3)
p3 <- plot_dbscan_results(three_spheres, eps = 0.25, minPts = 4)

# Ring-moon-sphere
p4 <- plot_kmeans_results(ring_moon_sphere, 3)
p5 <- plot_pam_results(ring_moon_sphere, 3)
p6 <- plot_dbscan_results(ring_moon_sphere, eps = 0.32, minPts = 4)

# Two spirals
p7 <- plot_kmeans_results(two_spirals_sphere, 3)
p8 <- plot_pam_results(two_spirals_sphere, 3)
p9 <- plot_dbscan_results(two_spirals_sphere, eps = 0.38, minPts = 4)

# Combine into 3x3 grid
(p1 | p2 | p3) /
(p4 | p5 | p6) /
(p7 | p8 | p9)

It looks like the difference between the 3 is k-means looks like it chooses a third of each plot and colors it. PAM looks like it splits into 3 different stripes. DBSCAN seems a little more unstable, but it worked really well on the distince seperate patches.

Question 2

In this question we will apply cluster analysis to analyze economic development indicators (WDIs) from the World Bank. The data are all 2020 indicators and include:

  • life_expectancy: average life expectancy at birth
  • gdp: GDP per capita, in 2015 USD
  • co2: CO2 emissions, in metric tons per capita
  • fert_rate: annual births per 1000 women
  • health: percentage of GDP spent on health care
  • imports and exports: imports and exports as a percentage of GDP
  • internet and electricity: percentage of population with access to internet and electricity, respectively
  • infant_mort: infant mortality rate, infant deaths per 1000 live births
  • inflation: consumer price inflation, as annual percentage
  • income: annual per-capita income, in 2020 USD
wdi <- read.csv('Data/wdi_extract_clean.csv') 
head(wdi)
      country life_expectancy        gdp      co2 fert_rate    health internet
1 Afghanistan        61.45400   527.8346 0.180555     5.145 15.533614  17.0485
2     Albania        77.82400  4437.6535 1.607133     1.371  7.503894  72.2377
3     Algeria        73.25700  4363.6853 3.902928     2.940  5.638317  63.4727
4      Angola        63.11600  2433.3764 0.619139     5.371  3.274885  36.6347
5   Argentina        75.87800 11393.0506 3.764393     1.601 10.450306  85.5144
6     Armenia        73.37561  4032.0904 2.334560     1.700 12.240562  76.5077
  infant_mort electricity  imports inflation  exports    income
1        55.3        97.7 36.28908  5.601888 10.42082  475.7181
2         8.1       100.0 36.97995  1.620887 22.54076 4322.5497
3        20.4        99.7 24.85456  2.415131 15.53520 2689.8725
4        42.3        47.0 27.62749 22.271539 38.31454 1100.2175
5         8.7       100.0 13.59828 42.015095 16.60541 7241.0303
6        10.2       100.0 39.72382  1.211436 29.76499 3617.0320

Focus on using kmeans for this problem.

A)

My claim: 3-5 clusters appear optimal for this data set. Support or refute my claim using appropriate visualizations.

# Step 1: Keep only numeric columns (exclude country names, etc.)
wdi_num <- wdi %>% select(where(is.numeric))

# Step 2: Scale numeric data for k-means
wdi_scaled <- scale(wdi_num)

# Step 3: Elbow method (Within-cluster sum of squares)
p1 <- fviz_nbclust(wdi_scaled, kmeans, method = "wss") +
  labs(title = "Elbow Method for WDI Data")

# Step 4: Silhouette method
p2 <- fviz_nbclust(wdi_scaled, kmeans, method = "silhouette") +
  labs(title = "Silhouette Method for WDI Data")

# Step 5: Gap statistic
set.seed(123)  # reproducibility
p3 <- fviz_nbclust(wdi_scaled, kmeans, nstart = 25, method = "gap_stat") +
  labs(title = "Gap Statistic for WDI Data")
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.
# Step 6: Combine plots side by side
library(patchwork)
p1 + p2 + p3

According to what I see in my visuals 3-5 would be pretty close if not the optimal amount of clusters

B)

Use k-means to identify 4 clusters. Characterize the 4 clusters using a dimension reduction technique. Provide examples of countries that are representative of each cluster. Be thorough.

# 1) Separate identifiers from numeric indicators
stopifnot("country" %in% names(wdi))  # ensure we have a country column
countries <- wdi$country
wdi_num <- wdi %>% select(where(is.numeric))

# 2) Scale numeric indicators (k-means requires comparable scales)
wdi_scaled <- scale(wdi_num)
set.seed(123)
km4 <- kmeans(wdi_scaled, centers = 4, nstart = 50)

# Attach cluster labels
wdi_clusters <- wdi %>%
  mutate(cluster = factor(km4$cluster))

# Optional: cluster sizes and withinss
cluster_sizes <- table(km4$cluster)
cluster_withinss <- km4$withinss
# PCA on the scaled indicators
pca <- prcomp(wdi_scaled, center = FALSE, scale. = FALSE)

# Scores (PC space) with cluster labels and country names
scores_df <- as_tibble(pca$x[, 1:2], .name_repair = "minimal") %>%
  setNames(c("PC1", "PC2")) %>%
  mutate(country = countries,
         cluster = factor(km4$cluster))

# Loadings (how indicators map onto PCs)
loadings_df <- as_tibble(pca$rotation[, 1:2], rownames = "indicator") %>%
  setNames(c("indicator", "PC1_loading", "PC2_loading"))
# 1) PCA scatter colored by cluster, with subdued country labels
p_scores <- ggplot(scores_df, aes(PC1, PC2, color = cluster)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_text_repel(aes(label = country),
                  size = 3, max.overlaps = 30, alpha = 0.6, show.legend = FALSE) +
  labs(title = "PCA scores (PC1–PC2) colored by k-means clusters (k = 4)",
       x = "PC1", y = "PC2", color = "Cluster") +
  theme_minimal()

# 2) PCA biplot: countries (scores) and indicators (loadings)
# (factoextra biplot supports habillage for coloring by clusters)
p_biplot <- fviz_pca_biplot(pca,
                            label = "var",            # show variable loadings labels
                            habillage = km4$cluster, # color points by cluster
                            addEllipses = TRUE,       # cluster ellipses
                            repel = TRUE) +
  ggtitle("PCA biplot with k-means clusters (k = 4)")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
  Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
p_scores + p_biplot
Too few points to calculate an ellipse
Warning: ggrepel: 98 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

# Cluster means on standardized (z) scale: highlights relative highs/lows
prof_z <- as_tibble(wdi_scaled) %>%
  mutate(cluster = km4$cluster) %>%
  group_by(cluster) %>%
  summarise(across(everything(), mean), .groups = "drop") %>%
  mutate(cluster = factor(cluster))

# Cluster means on original scale: raw economic interpretation
prof_raw <- wdi_num %>%
  mutate(cluster = km4$cluster) %>%
  group_by(cluster) %>%
  summarise(across(everything(), ~mean(.x, na.rm = TRUE)), .groups = "drop") %>%
  mutate(cluster = factor(cluster))

# Heatmap of standardized cluster profiles (z-means)
prof_z_long <- prof_z %>%
  pivot_longer(-cluster, names_to = "indicator", values_to = "z_mean")

p_heatmap <- ggplot(prof_z_long,
                    aes(x = indicator, y = cluster, fill = z_mean)) +
  geom_tile() +
  scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#B2182B", midpoint = 0) +
  coord_flip() +
  labs(title = "Cluster profiles (standardized means per indicator)",
       x = "Indicator", y = "Cluster", fill = "z-mean") +
  theme_minimal()

p_heatmap

# Compute Euclidean distance to the assigned cluster center in scaled space
centers <- km4$centers
assigned <- km4$cluster
dist_to_center <- sqrt(rowSums((wdi_scaled - centers[assigned, ])^2))

representatives <- tibble(
  country = countries,
  cluster = factor(assigned),
  dist_to_center = dist_to_center
) %>%
  group_by(cluster) %>%
  arrange(dist_to_center, .by_group = TRUE) %>%
  slice_head(n = 5) %>%   # top 5 per cluster closest to the center
  ungroup()

representatives
# A tibble: 16 × 3
   country     cluster dist_to_center
   <chr>       <fct>            <dbl>
 1 Mexico      1                0.544
 2 Tunisia     1                0.566
 3 Mauritius   1                0.729
 4 Paraguay    1                0.750
 5 Romania     1                0.819
 6 Togo        2                0.570
 7 Zambia      2                1.01 
 8 Gambia, The 2                1.17 
 9 Congo, Rep. 2                1.19 
10 Madagascar  2                1.22 
11 Zimbabwe    3                0    
12 Austria     4                1.17 
13 Finland     4                1.41 
14 Netherlands 4                1.43 
15 Belgium     4                1.48 
16 Korea, Rep. 4                1.67 

In the 4 different colors from 1 to 4 it looks like Brazil, Guinea, Zimbabwe (the only one), and Luxembourg (far away)

C)

Remove Ireland, Singapore, and Luxembourg from the data set. Use k-means to find 4 clusters again, with these three countries removed. How do the cluster definitions change?

# Step 1: Remove the three countries
wdi_reduced <- wdi %>%
  filter(!country %in% c("Ireland", "Singapore", "Luxembourg"))

# Step 2: Scale numeric indicators
wdi_num_reduced <- wdi_reduced %>% select(where(is.numeric))
wdi_scaled_reduced <- scale(wdi_num_reduced)

# Step 3: Run k-means with 4 clusters
set.seed(123)
km4_reduced <- kmeans(wdi_scaled_reduced, centers = 4, nstart = 50)

# Step 4: Attach cluster labels
wdi_clusters_reduced <- wdi_reduced %>%
  mutate(cluster = factor(km4_reduced$cluster))

# Step 5: Visualize with PCA
pca_reduced <- prcomp(wdi_scaled_reduced)

fviz_pca_biplot(pca_reduced,
                habillage = wdi_clusters_reduced$cluster,
                addEllipses = TRUE,
                repel = TRUE) +
  ggtitle("PCA biplot with k-means clusters (k = 4, Ireland/Singapore/Luxembourg removed)")

With those those countries removed the clusters are a lot better defined and you are able to more closely able to view how the different countries relate.