Activity 4.2 - Kmeans, PAM, and DBSCAN clustering

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  3. Submit a link to your published solutions

Loading required packages:

library(cluster)
library(dbscan)
library(factoextra)
library(tidyverse)
library(patchwork)
library(ggrepel)

Question 1

Reconsider the three data sets below. We will now compare kmeans, PAM, and DBSCAN to cluster these data sets.

getwd()
[1] "C:/Users/wi7536ul/OneDrive - Minnesota State/DSCI 415/notes"
setwd("C:/Users/wi7536ul/OneDrive - Minnesota State/DSCI 415/notes")
three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')

A)

With kmeans and PAM, we can specify that we want 3 clusters. But recall with DBSCAN we select minPts and eps, and the number of clusters is determined accordingly. Use k-nearest-neighbor distance plots to determine candidate epsilon values for each data set if minPts = 4. Add horizontal line(s) to each plot indicating your selected value(s) of \(\epsilon.\)

kNNdistplot(three_spheres[,1:2], minPts = 4)
abline(h = 0.15)

kNNdistplot(ring_moon_sphere[,1:2], minPts = 4)
abline(h = 0.20)

kNNdistplot(two_spirals_sphere[,1:2], minPts = 4)
abline(h = 1.0)

B)

Write a function called plot_dbscan_results(df, eps, minPts). This function takes a data frame, epsilon value, and minPts as arguments and does the following:

  • Runs DBSCAN on the inputted data frame df, given the eps and minPts values;
  • Creates a scatterplot of the data frame with points color-coded by assigned cluster membership. Make sure the title of the plot includes the value of eps and minPts used to create the clusters!!
plot_dbscan_results <- \(dataset, eps, minPts) {

  db_out <- dbscan(dataset[,1:2], eps = eps, minPts = minPts)
  dataset$cluster <- factor(db_out$cluster)

  the_plot <- ggplot(data = dataset) +
    geom_point(aes(x = x, y = y, color = cluster)) +
    guides(color = 'none') +
    ggtitle(paste("DBSCAN: eps =", eps, ", minPts =", minPts)) +
    theme_classic() +
    theme(plot.title = element_text(hjust = 0.5, size = 12))

  the_plot
}

Using this function, and your candidate eps values from A) as a starting point, implement DBSCAN to correctly identify the 3 cluster shapes in each of the three data sets. You will likely need to revise the eps values until you settle on a “correct” solution.

plot_dbscan_results(three_spheres, 0.15, 4) + 
plot_dbscan_results(ring_moon_sphere, 0.20, 4) +
plot_dbscan_results(two_spirals_sphere, 1.0, 4)

plot_dbscan_results(three_spheres, 0.20, 4)

plot_dbscan_results(ring_moon_sphere, 0.27, 4)

plot_dbscan_results(two_spirals_sphere, 1.1, 4)

C)

Compare your DBSCAN solutions to the 3-cluster solutions from k-means and PAM. Use the patchwork package and your function from B) to produce a 3x3 grid of plots: one plot per method/data set combo. Comment on your findings.

#three_spheres

#k-means
numeric_only_three <- three_spheres %>%
  select(x,y)
kmeans_clusters <- kmeans(numeric_only_three,
                         centers = 3,
                         iter.max = 20,
                         nstart = 10
                         )
three_spheres$kmeans_clusters <- factor(kmeans_clusters$cluster)

km_three_s <- ggplot(data = three_spheres, aes(x = x, y = y, color=kmeans_clusters)) +
  geom_point() +
  guides(color='none') +
  theme_classic(base_size = 16)

#PAM
pam_clusters <- pam(numeric_only_three,
                        k = 3,
                        nstart = 10
                    )
three_spheres$pam_clusters <- factor(pam_clusters$cluster)

pam_three_s <- ggplot(data = three_spheres, aes(x = x, y = y, color=pam_clusters)) +
  geom_point() + 
  guides(color='none') + 
  theme_classic(base_size = 16)


#dbscan
dbscan_three_s <- plot_dbscan_results(three_spheres, 0.20, 4)
#ring_moon_sphere

#k-means
numeric_only_rings <- ring_moon_sphere %>%
  select(x,y)
kmeans_clusters <- kmeans(numeric_only_rings,
                         centers = 3,
                         iter.max = 20,
                         nstart = 10
                         )
ring_moon_sphere$kmeans_clusters <- factor(kmeans_clusters$cluster)

km_rm <- ggplot(data = ring_moon_sphere, aes(x = x, y = y, color=kmeans_clusters)) +
  geom_point() +
  guides(color='none') +
  theme_classic(base_size = 16)
 
#PAM 
pam_clusters <- pam(numeric_only_rings, k = 3, nstart = 10 )
ring_moon_sphere$pam_clusters <- factor(pam_clusters$cluster)

pam_rm <- ggplot(data = ring_moon_sphere, aes(x = x, y = y, color=pam_clusters)) +
geom_point() + guides(color='none') + theme_classic(base_size = 16)


#dbscan
dbscan_rm <- plot_dbscan_results(ring_moon_sphere, 0.27, 4)
#two_spirals_sphere

#k-means
numeric_only_two <- two_spirals_sphere %>%
  select(x,y)
kmeans_clusters <- kmeans(numeric_only_two,
                         centers = 3,
                         iter.max = 20,
                         nstart = 10
                         )
two_spirals_sphere$kmeans_clusters <- factor(kmeans_clusters$cluster)

km_two_s <- ggplot(data = two_spirals_sphere, aes(x = x, y = y, color=kmeans_clusters)) +
  geom_point() +
  guides(color='none') +
  theme_classic(base_size = 16)
 
#PAM 
pam_clusters <- pam(numeric_only_two, k = 3, nstart = 10 )
two_spirals_sphere$pam_clusters <- factor(pam_clusters$cluster)

pam_two_s <- ggplot(data = two_spirals_sphere, aes(x = x, y = y, color=pam_clusters)) +
geom_point() + guides(color='none') + theme_classic(base_size = 16)


#dbscan
dbscan_two_s <- plot_dbscan_results(two_spirals_sphere, 1.1, 4)
km_three_s + km_rm + km_two_s +
pam_three_s    + pam_rm    + km_two_s    +
dbscan_three_s     + dbscan_rm     + dbscan_two_s     +
  plot_layout(nrow = 3, ncol = 3)

Three spheres - all of the methods I used seem to work well on this dataset

Ring moon sphere- of the three methods, DBSCAN was the only one that correctly captured the ring, arc, and blob part of this dataset correctly

Two spirals sphere - neither k-means nor PAM did a job here … DBSCAN kept the proper structure, though

Overall findings - k-means and PAM seem to work best for more simple clusters, while DBSCAN did a decent job of working with the more complex shapes

Question 2

In this question we will apply cluster analysis to analyze economic development indicators (WDIs) from the World Bank. The data are all 2020 indicators and include:

  • life_expectancy: average life expectancy at birth
  • gdp: GDP per capita, in 2015 USD
  • co2: CO2 emissions, in metric tons per capita
  • fert_rate: annual births per 1000 women
  • health: percentage of GDP spent on health care
  • imports and exports: imports and exports as a percentage of GDP
  • internet and electricity: percentage of population with access to internet and electricity, respectively
  • infant_mort: infant mortality rate, infant deaths per 1000 live births
  • inflation: consumer price inflation, as annual percentage
  • income: annual per-capita income, in 2020 USD
wdi <- read.csv('Data/wdi_extract_clean.csv') 
head(wdi)
      country life_expectancy        gdp      co2 fert_rate    health internet
1 Afghanistan        61.45400   527.8346 0.180555     5.145 15.533614  17.0485
2     Albania        77.82400  4437.6535 1.607133     1.371  7.503894  72.2377
3     Algeria        73.25700  4363.6853 3.902928     2.940  5.638317  63.4727
4      Angola        63.11600  2433.3764 0.619139     5.371  3.274885  36.6347
5   Argentina        75.87800 11393.0506 3.764393     1.601 10.450306  85.5144
6     Armenia        73.37561  4032.0904 2.334560     1.700 12.240562  76.5077
  infant_mort electricity  imports inflation  exports    income
1        55.3        97.7 36.28908  5.601888 10.42082  475.7181
2         8.1       100.0 36.97995  1.620887 22.54076 4322.5497
3        20.4        99.7 24.85456  2.415131 15.53520 2689.8725
4        42.3        47.0 27.62749 22.271539 38.31454 1100.2175
5         8.7       100.0 13.59828 42.015095 16.60541 7241.0303
6        10.2       100.0 39.72382  1.211436 29.76499 3617.0320

Focus on using kmeans for this problem.

A)

My claim: 3-5 clusters appear optimal for this data set. Support or refute my claim using appropriate visualizations.

ggplot(wdi, aes(x = income, y = life_expectancy)) +
  geom_point() +
  theme_classic(base_size = 16)

numeric_only_wdi <- wdi %>%
  select(income, life_expectancy)

kmeans_clusters <- kmeans(
  numeric_only_wdi,
  centers = 3,
  iter.max = 20,
  nstart = 10
)

wdi$kmeans_clusters <- factor(kmeans_clusters$cluster)

ggplot(wdi, aes(x = income, y = life_expectancy, color = kmeans_clusters)) +
  geom_point() +
  theme_classic(base_size = 16)

numeric_wdi <- wdi %>% 
  select(where(is.numeric))

wdi_scaled <- scale(numeric_wdi)


cov(wdi_scaled)
                life_expectancy        gdp         co2  fert_rate      health
life_expectancy       1.0000000  0.6973222  0.54427116 -0.8309322  0.41805627
gdp                   0.6973222  1.0000000  0.56750894 -0.4786177  0.42277299
co2                   0.5442712  0.5675089  1.00000000 -0.4381669  0.09900871
fert_rate            -0.8309322 -0.4786177 -0.43816686  1.0000000 -0.38741451
health                0.4180563  0.4227730  0.09900871 -0.3874145  1.00000000
internet              0.8474164  0.6231724  0.62175088 -0.8385215  0.35157547
infant_mort          -0.9052890 -0.5256942 -0.49356034  0.8620178 -0.37287726
electricity           0.7617501  0.3758888  0.40045385 -0.8390548  0.27238951
imports               0.2369366  0.3842341  0.20901177 -0.2718620  0.02029878
inflation            -0.1690938 -0.1035817 -0.09806097  0.1385672 -0.15997566
exports               0.3738612  0.5399184  0.34085433 -0.3461416 -0.02207017
income                0.7268468  0.9766515  0.54856805 -0.5054745  0.49580116
                  internet infant_mort electricity     imports   inflation
life_expectancy  0.8474164  -0.9052890   0.7617501  0.23693656 -0.16909376
gdp              0.6231724  -0.5256942   0.3758888  0.38423409 -0.10358167
co2              0.6217509  -0.4935603   0.4004538  0.20901177 -0.09806097
fert_rate       -0.8385215   0.8620178  -0.8390548 -0.27186199  0.13856718
health           0.3515755  -0.3728773   0.2723895  0.02029878 -0.15997566
internet         1.0000000  -0.8310365   0.8026254  0.29209228 -0.16469141
infant_mort     -0.8310365   1.0000000  -0.8084987 -0.20453761  0.17575100
electricity      0.8026254  -0.8084987   1.0000000  0.16645057 -0.15857829
imports          0.2920923  -0.2045376   0.1664506  1.00000000 -0.12115127
inflation       -0.1646914   0.1757510  -0.1585783 -0.12115127  1.00000000
exports          0.4192930  -0.3217737   0.2422221  0.90427206 -0.10451967
income           0.6464023  -0.5570535   0.3963467  0.31523438 -0.11093212
                    exports     income
life_expectancy  0.37386117  0.7268468
gdp              0.53991836  0.9766515
co2              0.34085433  0.5485681
fert_rate       -0.34614160 -0.5054745
health          -0.02207017  0.4958012
internet         0.41929297  0.6464023
infant_mort     -0.32177373 -0.5570535
electricity      0.24222210  0.3963467
imports          0.90427206  0.3152344
inflation       -0.10451967 -0.1109321
exports          1.00000000  0.4676356
income           0.46763558  1.0000000
fviz_nbclust(wdi_scaled, 
             FUNcluster = kmeans,
             method='wss',
             ) + 
  labs(title = 'Plot of WSS vs k using kmeans') + 
fviz_nbclust(wdi_scaled, 
             FUNcluster = kmeans,
             method='silhouette',
             ) + 
  labs(title = 'Plot of avg silhouette vs k using kmeans') +
fviz_nbclust(wdi_scaled, 
             FUNcluster = pam,
             method='wss',
             )

After testing 2, 3, 4, 5, and 6 clusters, and looking at some plots, I would say I agree with your claim. I think either 3 or 4 clusters did the best job of showing distinct groups, but 3-5 all did a decent job.

B)

Use k-means to identify 4 clusters. Characterize the 4 clusters using a dimension reduction technique. Provide examples of countries that are representative of each cluster. Be thorough.

numeric_wdi <- wdi %>% 
  select(where(is.numeric))

rownames(numeric_wdi) <- wdi$country


wdi_scaled <- scale(numeric_wdi)
kmeans4 <- kmeans(wdi_scaled, centers = 4, nstart = 10)

wdi_pca <- prcomp(numeric_wdi, center = TRUE, scale. = TRUE)

kmeans_biplot <- fviz_pca_biplot(
  wdi_pca,
  habillage = factor(kmeans4$cluster),
  repel = TRUE
) +
  ggtitle('K-means 4-cluster solution') +
  guides(color = 'none', shape = 'none')
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
  Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
kmeans_biplot

Cluster 1 - red

-high life expectancy, income, electricity/internet

-United States, France, Korean Republic, Norway, Germany

Cluster 2 - blue

-overall moderate/middle

-moderate life expectancy, health

-Brazil, China, Russia, Iran, Mexico

Cluster 3 - green

-high exports, imports

-high GDP

-seems to overall be notably wealthy

-Ireland, Singapore, Luxembourg

Cluster 4 - purple

-high infant mortality, fertility rate

-low income, internet/electricity

-Chad, Niger, Malawi, Mozambique

C)

Remove Ireland, Singapore, and Luxembourg from the data set. Use k-means to find 4 clusters again, with these three countries removed. How do the cluster definitions change?

# C)

wdi_filtered <- wdi %>%
  filter(!country %in% c("Ireland", "Singapore", "Luxembourg"))

numeric_wdi <- wdi_filtered %>%
  select(-country) %>%
  mutate(across(everything(), as.numeric))

rownames(numeric_wdi) <- wdi_filtered$country

wdi_scaled <- scale(numeric_wdi)

kmeans4 <- kmeans(wdi_scaled, centers = 4, nstart = 10)

wdi_pca <- prcomp(numeric_wdi, center = TRUE, scale. = TRUE)

kmeans_biplot <- fviz_pca_biplot(
  wdi_pca,
  habillage = factor(kmeans4$cluster),
  repel = TRUE
) +
  ggtitle("K-means 4-cluster solution") +
  guides(color = "none", shape = "none")

kmeans_biplot

After removing Ireland, Singapore, and Luxembourg from the data set …

-The very wealthy cluster that originally included the aforementioned countries is now less extreme … this cluster is more compact and the outliers (ex: exports/imports) are not having an impact

-The countries with low income, high infant mortality, etc, stayed more or less the same, which I would assume is because the removed countries didn’t have much to do with this group

-The overall spreading of the data is tighter now … the clusters seem closer together because the extremes (Ireland, Singapore, Luxembourg) are no longer part of the data set