Activity 4.2 - Kmeans, PAM, and DBSCAN clustering

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs
Submit a link to your published solutions

Loading required packages:

library(cluster)
library(dbscan)
library(factoextra)
library(tidyverse)
library(patchwork)
library(ggrepel)
library(ggplot2)

Question 1

Reconsider the three data sets below. We will now compare kmeans, PAM, and DBSCAN to cluster these data sets.

three_spheres <- read.csv('Data/cluster_data1.csv') %>%  scale() 
ring_moon_sphere <- read.csv('Data/cluster_data2.csv') %>% scale()
two_spirals_sphere <- read.csv('Data/cluster_data3.csv') %>%  scale()

A)

With kmeans and PAM, we can specify that we want 3 clusters. But recall with DBSCAN we select minPts and eps, and the number of clusters is determined accordingly. Use k-nearest-neighbor distance plots to determine candidate epsilon values for each data set if minPts = 4. Add horizontal line(s) to each plot indicating your selected value(s) of \(\epsilon.\)

kNNdistplot(three_spheres, minPts = 5)
abline(h = 0.279)
title("3 sphere e = 0.279")

kNNdistplot(ring_moon_sphere, minPts = 5) 
abline(h = 0.267)
title("ring moon e = 0.267")

kNNdistplot(two_spirals_sphere, minPts = 5)
abline(h = 0.3065)
title("two_spirals e =  0.3065")

B)

Write a function called plot_dbscan_results(df, eps, minPts). This function takes a data frame, epsilon value, and minPts as arguments and does the following:

Runs DBSCAN on the inputted data frame df, given the eps and minPts values;
Creates a scatterplot of the data frame with points color-coded by assigned cluster membership. Make sure the title of the plot includes the value of eps and minPts used to create the clusters!!

Using this function, and your candidate eps values from A) as a starting point, implement DBSCAN to correctly identify the 3 cluster shapes in each of the three data sets. You will likely need to revise the eps values until you settle on a “correct” solution.

plot_dbscan_results <- function(df,eps,minPts, title) {
do_dbscan <- dbscan(df, eps = eps, minPts = minPts)


#Add clusters to data set:
#multishapes
df$dbcluster <-  factor(do_dbscan$cluster)


ggplot(data = data.frame(df), aes(x = x, y = y, color=dbcluster)) +
  geom_point() + 
  labs(color='Cluster') + 
  theme_classic(base_size = 16) + 
  ggtitle(title)
}

plot_dbscan_results(data.frame(three_spheres),0.279,5,"3 sphere e= .279")

plot_dbscan_results(data.frame(ring_moon_sphere),.267,5,"ring moon e= .267")

plot_dbscan_results(data.frame(two_spirals_sphere),0.3065,5,"two_spirals e= .3065")

C)

Compare your DBSCAN solutions to the 3-cluster solutions from k-means and PAM. Use the patchwork package and your function from B) to produce a 3x3 grid of plots: one plot per method/data set combo. Comment on your findings.

Discussion:

3 sphere: It looks like all of these methods did fairly well with the three_spheres data, I do like that dbscan is able to identify outliers, however i’m not terribly sure if the points it’s identified as outliers are really telling us much.

ring_moon: blatantly dbscan is the only function that did a good job of identifying the clusters in this data. This reminds me of activity 4.1, it seems that dbscan is capable of returning charts of a decent quality for both convex and non-convex dataset. when you look at this chart.

two spiral moon: this is the same situation as the ring_moon dataset. However, If i remember back to single linkage it seems interesting to me though that I couldn’t seem to get dbscan to return 3 clusters no matter how I tweaked the epsilon value and this was the closest I could manage.

#fx k-means
plot_kmeans_results <- function(df,centers,title){
  
numeric_only <- df %>%
  select(x,y)

kmeans_clusters <- kmeans(numeric_only,
                         centers = 3,
                         iter.max = 20,
                         nstart = 10
                         )

df$kmeans_clusters <- factor(kmeans_clusters$cluster)

ggplot(data = df, aes(x = x, y = y, color=kmeans_clusters)) +
  geom_point() + 
  guides(color='none') + 
  theme_classic(base_size = 16) + 
  ggtitle(title)
}

#fx Pam
plot_pam_results <- function(df,k,title){

numeric_only <- df %>%
  select(x,y)

pam_clusters <- pam(numeric_only,
                        k = k,
                        nstart = 10
                    )

#Add clusters to data frame
df$pam_clusters <- factor(pam_clusters$cluster)

ggplot(data = df, aes(x = x, y = y, color=pam_clusters)) +
  geom_point() + 
  guides(color='none') + 
  theme_classic(base_size = 16)+
  ggtitle(title)

}

#set charts pre patchwork
#dbscan
db_3s<- plot_dbscan_results(data.frame(three_spheres),0.279,5,"3 sphere,dbs") 
db_rm<- plot_dbscan_results(data.frame(ring_moon_sphere),.267,5,"ring moon,dbs") 
db_2s<- plot_dbscan_results(data.frame(two_spirals_sphere),0.3065,5,"two_spirals,dbs")

#kmeans
km_3s <- plot_kmeans_results(data.frame(three_spheres),3,"3 sphere,k-m")
km_rm <- plot_kmeans_results(data.frame(ring_moon_sphere),3,"ring_moon,k-m")
km_2s <- plot_kmeans_results(data.frame(two_spirals_sphere),3,"2 spiral, k-m")

#Pam
pam_3s <- plot_pam_results(data.frame(three_spheres),3,"3 sphere,pam")
pam_rm <- plot_pam_results(data.frame(ring_moon_sphere),3,"ring_moon,pam")
pam_2s <- plot_pam_results(data.frame(two_spirals_sphere),3,"2 spiral,pam")

#setup patchwork
(db_3s + db_rm + db_2s)/
(km_3s + km_rm + km_2s)/
(pam_3s + pam_rm + pam_2s)

Question 2

In this question we will apply cluster analysis to analyze economic development indicators (WDIs) from the World Bank. The data are all 2020 indicators and include:

life_expectancy: average life expectancy at birth
gdp: GDP per capita, in 2015 USD
co2: CO2 emissions, in metric tons per capita
fert_rate: annual births per 1000 women
health: percentage of GDP spent on health care
imports and exports: imports and exports as a percentage of GDP
internet and electricity: percentage of population with access to internet and electricity, respectively
infant_mort: infant mortality rate, infant deaths per 1000 live births
inflation: consumer price inflation, as annual percentage
income: annual per-capita income, in 2020 USD

wdi <- read.csv('Data/wdi_extract_clean.csv') 
head(wdi)

      country life_expectancy        gdp      co2 fert_rate    health internet
1 Afghanistan        61.45400   527.8346 0.180555     5.145 15.533614  17.0485
2     Albania        77.82400  4437.6535 1.607133     1.371  7.503894  72.2377
3     Algeria        73.25700  4363.6853 3.902928     2.940  5.638317  63.4727
4      Angola        63.11600  2433.3764 0.619139     5.371  3.274885  36.6347
5   Argentina        75.87800 11393.0506 3.764393     1.601 10.450306  85.5144
6     Armenia        73.37561  4032.0904 2.334560     1.700 12.240562  76.5077
  infant_mort electricity  imports inflation  exports    income
1        55.3        97.7 36.28908  5.601888 10.42082  475.7181
2         8.1       100.0 36.97995  1.620887 22.54076 4322.5497
3        20.4        99.7 24.85456  2.415131 15.53520 2689.8725
4        42.3        47.0 27.62749 22.271539 38.31454 1100.2175
5         8.7       100.0 13.59828 42.015095 16.60541 7241.0303
6        10.2       100.0 39.72382  1.211436 29.76499 3617.0320

Focus on using kmeans for this problem.

A)

My claim: 3-5 clusters appear optimal for this data set. Support or refute my claim using appropriate visualizations.

Discussion: I refute thine claim. I would honestly say that the raw data is actually a single cluster of points, with outliers(the points that fall below -2.5 on the dim2 scale.)

However, in k-means, we don’t have the ability to separate outliers to their separate group, so I would actually say that this is 2 clusters. This is specifically because we have 2 trailing lines of points that seem quite far away from any given group to consider them a part of a cluster. However, these also seem to be separated very clearnly across my dim2 = 0 line hence the claim that 2 clusters would be appropriate.

(after completing this problem, I don’t think you actually wanted me to chart all of these whoops XD)

#create k-means fx for this situation
plot_kmeans_results <- function(df,centers,title,label){

numeric_only  <-  df %>% column_to_rownames('country')

kmeans_clusters <- kmeans(numeric_only %>% scale(),
                         centers = centers,
                         iter.max = 20,
                         nstart = 10
                         )

wdi_pca <- prcomp(numeric_only, center=TRUE, scale. = TRUE)

fviz_pca(wdi_pca, 
         habillage = factor(kmeans_clusters$cluster),
#         select.ind = list(name = NULL, ind = sample(nrow(wdi_pca$x), 20)),
         repel = TRUE,label = label) + 
      ggtitle(title) +
      guides(shape='none') + 
      labs(color='Cluster',shape='Cluster')
}

# setup patchwork
 wdi_km_2 <-  plot_kmeans_results(wdi,2,"wdi,k-m,2clust","none")

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
  Please report the issue at <https://github.com/kassambara/ggpubr/issues>.

 wdi_km_3 <-  plot_kmeans_results(wdi,3,"wdi,k-m,3clust","none")
 wdi_km_4 <-  plot_kmeans_results(wdi,4,"wdi,k-m,4clust","none")
 wdi_km_5 <-  plot_kmeans_results(wdi,5,"wdi,k-m,5clust","none")

#patching
(wdi_km_2 + wdi_km_3 ) / (wdi_km_4 + wdi_km_5)

B)

Use k-means to identify 4 clusters. Characterize the 4 clusters using a dimension reduction technique. Provide examples of countries that are representative of each cluster. Be thorough.

sorry in advance for really really incredibly bad labeling. :3 but needed to get done with this quickly to work on other classes :/

clust1: Qatar, Japan, Germany, These tend to be fairly developed countries with internet and electricity being a common feature, with low fertility and infant mortality rates.

clust2: this is only zimbabwe, which is in a cluster of 1. It probably has a high level of inflation. Remembering that thislow dimensional chart does not show the original distances from each point to each point, it is likely that zimbabwe is a significant distance overall in the high dimensional version for it to be marked as its own cluster.

clust3:sri lanka, algeria, and paraguay, these countries tend to have low imports and exports and lowish gdp which are common marks of countries in cluster 2

clust4: Mali, chad, and Niger would be good examples for cluster 1 as these countries have large fertility and infant mortality rates which is the strongest traits in the 1st cluster.

plot_kmeans_results(wdi,4,"wdi,k-m,4clust","all")

C)

Remove Ireland, Singapore, and Luxembourg from the data set. Use k-means to find 4 clusters again, with these three countries removed. How do the cluster definitions change? Ok, I would be forced to admit that my above hypothesis was incorrect, as this doesn’t look bad at all as a 4 cluster dataset. ALthough if i wasn’t looking at the chart with the colors on it, I might have a tougher time discerning it in this case.

plot_kmeans_results(wdi %>% filter(!country %in% c("Ireland", "Singapore", "Luxembourg"))
,4,"wdi,k-m,4clust","None")