Activity 4.2 - Kmeans, PAM, and DBSCAN clustering

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  3. Submit a link to your published solutions

Loading required packages:

library(cluster)
library(dbscan)
library(factoextra)
library(tidyverse)
library(patchwork)
library(ggrepel)

Question 1

Reconsider the three data sets below. We will now compare kmeans, PAM, and DBSCAN to cluster these data sets.

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

A)

With kmeans and PAM, we can specify that we want 3 clusters. But recall with DBSCAN we select minPts and eps, and the number of clusters is determined accordingly. Use k-nearest-neighbor distance plots to determine candidate epsilon values for each data set if minPts = 4. Add horizontal line(s) to each plot indicating your selected value(s) of \(\epsilon.\)

kNNdistplot(three_spheres, minPts = 4)
abline(h = .21)

kNNdistplot(ring_moon_sphere, minPts = 4)
abline(h=.21)

kNNdistplot(two_spirals_sphere, minPts = 4)
abline(h=1.03)

B)

Write a function called plot_dbscan_results(df, eps, minPts). This function takes a data frame, epsilon value, and minPts as arguments and does the following:

  • Runs DBSCAN on the inputted data frame df, given the eps and minPts values;
  • Creates a scatterplot of the data frame with points color-coded by assigned cluster membership. Make sure the title of the plot includes the value of eps and minPts used to create the clusters!!

Using this function, and your candidate eps values from A) as a starting point, implement DBSCAN to correctly identify the 3 cluster shapes in each of the three data sets. You will likely need to revise the eps values until you settle on a “correct” solution.

plot_dbscan_results <- function(df, eps, minPts) {
  library(dbscan)
  library(ggplot2)

  db <- dbscan(df, eps = eps, minPts = minPts)

  df$cluster <- factor(db$cluster)   

  
  ggplot(df, aes(x = x, y = y, color = cluster)) +
    geom_point(size = 1.8) +
    theme_minimal() +
    ggtitle(paste("DBSCAN Results (eps =", eps, ", minPts =", minPts, ")"))
}

plot_dbscan_results(three_spheres, eps = 0.295, minPts = 4)

plot_dbscan_results(ring_moon_sphere, eps = 0.32, minPts = 4)

plot_dbscan_results(two_spirals_sphere, eps = 1.2, minPts = 4)

C)

Compare your DBSCAN solutions to the 3-cluster solutions from k-means and PAM. Use the patchwork package and your function from B) to produce a 3x3 grid of plots: one plot per method/data set combo. Comment on your findings.

plot_dbscan_results <- function(df, eps, minPts) {
  db <- dbscan(df, eps = eps, minPts = minPts)
  df$cluster <- factor(db$cluster)

  ggplot(df, aes(x = x, y = y, color = cluster)) +
    geom_point(size = 1.8) +
    theme_minimal() +
    ggtitle(paste("DBSCAN (eps =", eps, ", minPts =", minPts, ")"))
}

kmeans3sphere <- kmeans(three_spheres, centers = 3, nstart = 10)
pam3sphere    <- pam(three_spheres, k = 3, nstart = 10)

p1 <- ggplot(three_spheres, aes(x = x, y = y,
                                color = factor(kmeans3sphere$cluster))) +
  geom_point() + guides(color='none') + theme_classic() +
  ggtitle("K-means (three spheres)")

p2 <- ggplot(three_spheres, aes(x = x, y = y,
                                color = factor(pam3sphere$cluster))) +
  geom_point() + guides(color='none') + theme_classic() +
  ggtitle("PAM (three spheres)")

p3 <- plot_dbscan_results(three_spheres, eps = 0.295, minPts = 4)

kmeans3ring <- kmeans(ring_moon_sphere, centers = 3, nstart = 10)
pam3ring    <- pam(ring_moon_sphere, k = 3, nstart = 10)

p4 <- ggplot(ring_moon_sphere, aes(x = x, y = y,
                                   color = factor(kmeans3ring$cluster))) +
  geom_point() + guides(color='none') + theme_classic() +
  ggtitle("K-means (ring/moon/sphere)")

p5 <- ggplot(ring_moon_sphere, aes(x = x, y = y,
                                   color = factor(pam3ring$cluster))) +
  geom_point() + guides(color='none') + theme_classic() +
  ggtitle("PAM (ring/moon/sphere)")

p6 <- plot_dbscan_results(ring_moon_sphere, eps = 0.32, minPts = 4)

kmeans3spiral <- kmeans(two_spirals_sphere, centers = 3, nstart = 10)
pam3spiral    <- pam(two_spirals_sphere, k = 3, nstart = 10)

p7 <- ggplot(two_spirals_sphere, aes(x = x, y = y,
                                     color = factor(kmeans3spiral$cluster))) +
  geom_point() + guides(color='none') + theme_classic() +
  ggtitle("K-means (spirals/sphere)")

p8 <- ggplot(two_spirals_sphere, aes(x = x, y = y,
                                     color = factor(pam3spiral$cluster))) +
  geom_point() + guides(color='none') + theme_classic() +
  ggtitle("PAM (spirals/sphere)")

p9 <- plot_dbscan_results(two_spirals_sphere, eps = 1.2, minPts = 4)


(p1 | p2 | p3) /
(p4 | p5 | p6) /
(p7 | p8 | p9)

We can see from our first set of plots that all 3 make similar plots with a cluster being each group of points showing that using any of the 3 clustering methods is effective in this set of points. Looking at our second group of plots we can see that each plot is different. We can see that our Kmeans groups by area with each third of the plot being a cluster. The pam plot appears to split the plot into triangles and each triangle being a cluster. Finally our DBSCAN plots our points in the pattern we would expect for our sphere being a cluster, moon being a cluster, and our ring being a cluster showing that DBSCAN is most effective for what we would expect from our points. Lastly for our Two Spirals Sphere plot we can see that each plot is different. Our Kmeans plot appears to split the plot into thirds similiar to our ring moon sphere plot. Our pam plot is also doing it in triangles similiar to the two spirals sphere plot. Our DBSCAN plot again divides the plot into what we would expect our plot to divide it with our two spirals and sphere being clustered together. From this we can see our DBSCAN divides our plots how we would expect our points to be clustered by their own individual shapes.

Question 2

In this question we will apply cluster analysis to analyze economic development indicators (WDIs) from the World Bank. The data are all 2020 indicators and include:

  • life_expectancy: average life expectancy at birth
  • gdp: GDP per capita, in 2015 USD
  • co2: CO2 emissions, in metric tons per capita
  • fert_rate: annual births per 1000 women
  • health: percentage of GDP spent on health care
  • imports and exports: imports and exports as a percentage of GDP
  • internet and electricity: percentage of population with access to internet and electricity, respectively
  • infant_mort: infant mortality rate, infant deaths per 1000 live births
  • inflation: consumer price inflation, as annual percentage
  • income: annual per-capita income, in 2020 USD
wdi <- read.csv('Data/wdi_extract_clean.csv')
head(wdi)
      country life_expectancy        gdp      co2 fert_rate    health internet
1 Afghanistan        61.45400   527.8346 0.180555     5.145 15.533614  17.0485
2     Albania        77.82400  4437.6535 1.607133     1.371  7.503894  72.2377
3     Algeria        73.25700  4363.6853 3.902928     2.940  5.638317  63.4727
4      Angola        63.11600  2433.3764 0.619139     5.371  3.274885  36.6347
5   Argentina        75.87800 11393.0506 3.764393     1.601 10.450306  85.5144
6     Armenia        73.37561  4032.0904 2.334560     1.700 12.240562  76.5077
  infant_mort electricity  imports inflation  exports    income
1        55.3        97.7 36.28908  5.601888 10.42082  475.7181
2         8.1       100.0 36.97995  1.620887 22.54076 4322.5497
3        20.4        99.7 24.85456  2.415131 15.53520 2689.8725
4        42.3        47.0 27.62749 22.271539 38.31454 1100.2175
5         8.7       100.0 13.59828 42.015095 16.60541 7241.0303
6        10.2       100.0 39.72382  1.211436 29.76499 3617.0320

Focus on using kmeans for this problem.

A)

My claim: 3-5 clusters appear optimal for this data set. Support or refute my claim using appropriate visualizations.

wdi_numerics <- wdi %>% select(-country)
wdi_scaled <- scale(wdi_numerics)

fviz_nbclust(wdi_scaled, 
             FUNcluster = kmeans,
             method='wss',
             ) + 
  labs(title = 'Plot of WSS vs k using kmeans') + 
fviz_nbclust(wdi_scaled, 
             FUNcluster = kmeans,
             method='silhouette',
             ) + 
  labs(title = 'Plot of avg silhouette vs k using kmeans') 

I support this claim as we can see that these two plots which show optimal k value or number of clusters that 4 appears to be the best value to use for clustering.

B)

Use k-means to identify 4 clusters. Characterize the 4 clusters using a dimension reduction technique. Provide examples of countries that are representative of each cluster. Be thorough.

kmeans4 <- kmeans(wdi_scaled, centers = 4, nstart = 10)
wdi_pca <- prcomp(wdi_numerics, scale. = TRUE)
PCS_with_labels <- bind_cols(wdi_pca$x[,1:2], Country = wdi$country)
fviz_pca(wdi_pca, 
         habillage = factor(kmeans4$cluster),
         label = 'var',
         repel = TRUE) + 
      ggtitle('K-means 4-cluster solution') + 
      guides(color='none',shape='none')  +
  geom_text_repel(aes(x = PC1, y = PC2, label = Country),
                  data = PCS_with_labels, 
                  max.overlaps = 7, 
                  size = 3)
Warning: ggrepel: 118 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Countries that are representative of cluster 1 (the red points) would be Mali and Chad, they are high in fertility rate and infant mortality rate but low in life expectancy, internet, and electricity. Contrarily countries in cluster 2 (the blue points) such as United States, Australia, and Norway are high in life expectancy, internet, and electricity but low in fertility and infant mortality rate. Then we have cluster 3(the purple points) which appears to be in the middle of both of these clusters showing not high or low values in those variables but low in exports and imports. Lastly we have cluster 4(the green points) which is high in exports and imports and more closely related to cluster 2(the blue points) and those variables than the variables of cluster 1(the red points).

C)

Remove Ireland, Singapore, and Luxembourg from the data set. Use k-means to find 4 clusters again, with these three countries removed. How do the cluster definitions change?

wdi_filtered <- wdi %>%
  filter(!country %in% c("Ireland", "Singapore", "Luxembourg"))
wdi_filtered_numerics <- wdi_filtered %>% select(-country)
wdi_scaled_filtered <- scale(wdi_filtered_numerics) 
kmeans4filtered <- kmeans(wdi_scaled_filtered, centers = 4, nstart = 10)
wdi_pca_filtered <- prcomp(wdi_scaled_filtered, center = TRUE, scale. = TRUE)
fviz_pca(wdi_pca_filtered, 
         habillage = factor(kmeans4filtered$cluster),
         label = 'var',
         repel = TRUE) + 
      ggtitle('K-means 4-cluster solution')

The clusters change in this with now having the exports and imports be separate from the health, co2, internet, and etc.. This plot is better for dividing all the countries more accurately and definitively amongst 4 clusters.