Activity 4.2 - Kmeans, PAM, and DBSCAN clustering

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  3. Submit a link to your published solutions

Loading required packages:

library(cluster)
library(dbscan)
library(factoextra)
library(tidyverse)
library(patchwork)
library(ggrepel)

Question 1

Reconsider the three data sets below. We will now compare kmeans, PAM, and DBSCAN to cluster these data sets.

three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

A)

With kmeans and PAM, we can specify that we want 3 clusters. But recall with DBSCAN we select minPts and eps, and the number of clusters is determined accordingly. Use k-nearest-neighbor distance plots to determine candidate epsilon values for each data set if minPts = 4. Add horizontal line(s) to each plot indicating your selected value(s) of \(\epsilon.\)

kNNdistplot(three_spheres, minPts = 4)
abline(h = .1)
abline(h = .21)

kNNdistplot(ring_moon_sphere, minPts = 4)
abline(h = .21)

kNNdistplot(two_spirals_sphere, minPts = 4)
abline(h = .10)
abline(h= 1)

B)

Write a function called plot_dbscan_results(df, eps, minPts). This function takes a data frame, epsilon value, and minPts as arguments and does the following:

  • Runs DBSCAN on the inputted data frame df, given the eps and minPts values;
  • Creates a scatterplot of the data frame with points color-coded by assigned cluster membership. Make sure the title of the plot includes the value of eps and minPts used to create the clusters!!

Using this function, and your candidate eps values from A) as a starting point, implement DBSCAN to correctly identify the 3 cluster shapes in each of the three data sets. You will likely need to revise the eps values until you settle on a “correct” solution.

plot_dbscan_results <- \(df, eps, minPts) {
  db <- dbscan(df, eps, minPts = minPts)

  df$cluster <- db$cluster

  ggplot(data = df, aes(x = x, y = y, color = factor(cluster))) +
    geom_point() +
    theme_classic() +
    ggtitle(paste('DBSCAN Clustering Results with \nEpislon = ',eps,' and Minimum Points =',minPts, sep = ''))
}
plot_dbscan_results(three_spheres, .295, 4)

plot_dbscan_results(ring_moon_sphere, .31, 4)

plot_dbscan_results(two_spirals_sphere, 1.2, 4)

After much trial and error, these are the values of epsilon that correctly identify the clustering structure: - Three Spheres: Min Points = 4 Epsilon = .295 - Ring Moon Sphere: Min Points = 4 Epsilon = .31 - Two Spirals Sphere: Min Points = 4 Epsilon = 1.2

C)

Compare your DBSCAN solutions to the 3-cluster solutions from k-means and PAM. Use the patchwork package and your function from B) to produce a 3x3 grid of plots: one plot per method/data set combo. Comment on your findings.

plot_k_means <- \(df) {
  k_means <- kmeans(df,
                         centers = 3,
                         iter.max = 20,
                         nstart = 10
                         )
  df$cluster <- k_means$cluster
  ggplot(data = df, aes(x = x, y = y, color = factor(cluster))) +
    geom_point() +
    theme_classic() +
    ggtitle('K-means Clustering Results with 3 Clusters')
}


plot_PAM <- \(df) {
  pam <- pam(df,
                        k = 3,
                        nstart = 10
                    )
  df$cluster <- pam$cluster
  ggplot(data = df, aes(x = x, y = y, color = factor(cluster))) +
    geom_point() +
    theme_classic() +
    ggtitle('PAM Clustering Results with 3 Clusters')
}
 plot_dbscan_results(three_spheres, .295, 4) + plot_k_means(three_spheres) + plot_PAM(three_spheres) +
 plot_dbscan_results(ring_moon_sphere, .31, 4) + plot_k_means(ring_moon_sphere) + plot_PAM(ring_moon_sphere) +
 plot_dbscan_results(two_spirals_sphere, 1.2, 4) + plot_k_means(two_spirals_sphere) + plot_PAM(two_spirals_sphere) +
  plot_layout(ncol = 3)

From this plot we can see that the4 DBSCAN correctly identified the DBSCAN method correctly identified the clustering structure for each example. The other two methods both correctly identified the three_Spheres example, but they failed for the other 2 examples. It makes sense that they would not perform as well on the non-convex clusters as the centroid does not tell us as much for those situations.

Question 2

In this question we will apply cluster analysis to analyze economic development indicators (WDIs) from the World Bank. The data are all 2020 indicators and include:

  • life_expectancy: average life expectancy at birth
  • gdp: GDP per capita, in 2015 USD
  • co2: CO2 emissions, in metric tons per capita
  • fert_rate: annual births per 1000 women
  • health: percentage of GDP spent on health care
  • imports and exports: imports and exports as a percentage of GDP
  • internet and electricity: percentage of population with access to internet and electricity, respectively
  • infant_mort: infant mortality rate, infant deaths per 1000 live births
  • inflation: consumer price inflation, as annual percentage
  • income: annual per-capita income, in 2020 USD
wdi <- read.csv('./Data/wdi_extract_clean.csv') 
head(wdi)
      country life_expectancy        gdp      co2 fert_rate    health internet
1 Afghanistan        61.45400   527.8346 0.180555     5.145 15.533614  17.0485
2     Albania        77.82400  4437.6535 1.607133     1.371  7.503894  72.2377
3     Algeria        73.25700  4363.6853 3.902928     2.940  5.638317  63.4727
4      Angola        63.11600  2433.3764 0.619139     5.371  3.274885  36.6347
5   Argentina        75.87800 11393.0506 3.764393     1.601 10.450306  85.5144
6     Armenia        73.37561  4032.0904 2.334560     1.700 12.240562  76.5077
  infant_mort electricity  imports inflation  exports    income
1        55.3        97.7 36.28908  5.601888 10.42082  475.7181
2         8.1       100.0 36.97995  1.620887 22.54076 4322.5497
3        20.4        99.7 24.85456  2.415131 15.53520 2689.8725
4        42.3        47.0 27.62749 22.271539 38.31454 1100.2175
5         8.7       100.0 13.59828 42.015095 16.60541 7241.0303
6        10.2       100.0 39.72382  1.211436 29.76499 3617.0320

Focus on using kmeans for this problem.

A)

My claim: 3-5 clusters appear optimal for this data set. Support or refute my claim using appropriate visualizations.

wd_scaled <- (
  wdi
  %>% select(-country)
  %>% scale()
)
fviz_nbclust(wd_scaled, 
             FUNcluster = kmeans,
             method='wss',
             ) + 
fviz_nbclust(wd_scaled, 
             FUNcluster = kmeans,
             method='silhouette',
             )

Looking at the method that uses WSS, we can see there is an elbow around 3 and another one around 5. Additionally, when we look at the silhouette plot, we can see that the number of clusters that has the highest average silhouette width is the 4 cluster solution. The 2 and 3 cluster solution also have a high silhouette width. Based on the silhouette plot and the location of the elbows, it seems that 3-5 clusters is the optimal number.

B)

Use k-means to identify 4 clusters. Characterize the 4 clusters using a dimension reduction technique. Provide examples of countries that are representative of each cluster. Be thorough.

kmeans_4 <- kmeans(wd_scaled,
                         centers = 4,
                         iter.max = 20,
                         nstart = 10
                         )
wd_pca <- prcomp(wd_scaled)


kmeans_biplot <- fviz_pca(wd_pca, 
         habillage = factor(kmeans_4$cluster),
         repel = TRUE) +
      ggtitle('K-means 4-cluster solution')
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
  Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
kmeans_biplot

(
  wdi
  %>% mutate(cluster = kmeans_4$cluster)
  %>% filter(rownames(wdi) == 132 | rownames(wdi) == 9 | rownames(wdi) == 8 | rownames(wdi) == 119)
  %>% column_to_rownames(var = "country")
)
           life_expectancy        gdp       co2 fert_rate    health internet
Austria           81.19268 43428.6983 7.0203830     1.440 11.316411  87.5294
Azerbaijan        70.31200  5087.5002 3.5077356     1.700  5.848177  84.6000
Singapore         83.54390 59189.7039 9.4761043     1.100  5.616088  92.0043
Togo              61.12300   806.9565 0.2555331     4.386  6.189284  29.0237
           infant_mort electricity   imports  inflation   exports     income
Austria            2.9       100.0  48.28512  1.3819106  51.66309 39515.5574
Azerbaijan        15.6       100.0  36.39431  2.7598095  35.62356  3530.6605
Singapore          1.9       100.0 150.25878 -0.1721119 181.78202 40860.4807
Togo              39.1        54.1  32.28501  1.6992846  23.26353   786.0325
           cluster
Austria          3
Azerbaijan       2
Singapore        4
Togo             1

Cluster 1: Countries in cluster 1 have a higher infant mortality rate and a higher fertility rate. Additionally, some countries in this cluster have a higher level of inflation. A country that would be good representative for this cluster would be Togo. It is close to the centroid and has a higher fertility rate, infant mortality rate, and inflation value

Cluster 2: Cluster 2 sits in the middle of the plot. This tells us that countries in Cluster 2 will not have extreme values of any variable. Additionally, the centroid is really close the origin. . A country that is a good representative for cluster 2 is Azerbaijan. Azerbaijan is close to the centroid of this cluster. Additionally, Azerbaijan dos not have extremely large or small values of any variable.

Cluster 3: Cluster 3 sits on the far right of the plot. These countries have a high GDP, income, life expectancy, and percentage of population that has internet to name a few of the variables. Austria is a good representation of this cluster. Austria has a high GDP, life expectancy, Income, and percentage of population that uses the internet. Additionally, Austria is close to the centroid of this cluster.

Cluster 4: Cluster 4 sits on the bottom right of this plot, and there are only 3 countries in this cluster. These countries have high amounts of imports and exports. A good example of a country from this cluster is singapore. Singapore has a high amount of both imports and exports. Additionally, Singapore is close to the centroid of this cluster.

C)

Remove Ireland, Singapore, and Luxembourg from the data set. Use k-means to find 4 clusters again, with these three countries removed. How do the cluster definitions change?

wd_scale_c <- (
  wdi
  %>% filter(country != 'Singapore', country != 'Ireland', country != 'Luxembourg')
  %>% select(-country)
  %>% scale()
)

kmeans_4_c <- kmeans(wd_scale_c,
                         centers = 4,
                         iter.max = 20,
                         nstart = 10
                         )
wd_pca_c <- prcomp(wd_scale_c)


kmeans_biplot_c <- fviz_pca(wd_pca_c, 
         habillage = factor(kmeans_4_c$cluster),
         repel = TRUE) +
      ggtitle('K-means 4-cluster solution')

kmeans_biplot_c

When removing those 3 countries we see a couple different changes. First of all Cluster 1 and Cluster 3 switch places. Now Cluster 1 is the cluster with high GDP, Income, and Life Expectancy. Cluster 3 is now the cluster with a high fertility rate and infant mortality rate. Cluster 2 stays largely the same. The arrow that corresponds to a high inflation value is now pointing towards a portion of cluster 2 instead of directly toward the high fertility rate cluster. The biggest change however is with cluster 4. There are now several different countries in cluster 4. These countries are in between the import and export arrows and the arrows that are pointing towards the new cluster 1. This tells us the the countries in this new cluster 4 are similar to cluster 1 except they also have have imports and exports.