library(cluster)
library(dbscan)
library(factoextra)
library(tidyverse)
library(patchwork)
library(ggrepel)Activity 4.2 - Kmeans, PAM, and DBSCAN clustering
SUBMISSION INSTRUCTIONS
- Render to html
- Publish your html to RPubs
- Submit a link to your published solutions
Loading required packages:
Question 1
Reconsider the three data sets below. We will now compare kmeans, PAM, and DBSCAN to cluster these data sets.
getwd()[1] "C:/Users/wi7536ul/OneDrive - Minnesota State/DSCI 415/notes"
setwd("C:/Users/wi7536ul/OneDrive - Minnesota State/DSCI 415/notes")three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')A)
With kmeans and PAM, we can specify that we want 3 clusters. But recall with DBSCAN we select minPts and eps, and the number of clusters is determined accordingly. Use k-nearest-neighbor distance plots to determine candidate epsilon values for each data set if minPts = 4. Add horizontal line(s) to each plot indicating your selected value(s) of \(\epsilon.\)
kNNdistplot(three_spheres[,1:2], minPts = 4)
abline(h = 0.15)kNNdistplot(ring_moon_sphere[,1:2], minPts = 4)
abline(h = 0.20)kNNdistplot(two_spirals_sphere[,1:2], minPts = 4)
abline(h = 1.0)B)
Write a function called plot_dbscan_results(df, eps, minPts). This function takes a data frame, epsilon value, and minPts as arguments and does the following:
- Runs DBSCAN on the inputted data frame
df, given theepsandminPtsvalues; - Creates a scatterplot of the data frame with points color-coded by assigned cluster membership. Make sure the title of the plot includes the value of
epsandminPtsused to create the clusters!!
plot_dbscan_results <- \(dataset, eps, minPts) {
db_out <- dbscan(dataset[,1:2], eps = eps, minPts = minPts)
dataset$cluster <- factor(db_out$cluster)
the_plot <- ggplot(data = dataset) +
geom_point(aes(x = x, y = y, color = cluster)) +
guides(color = 'none') +
ggtitle(paste("DBSCAN: eps =", eps, ", minPts =", minPts)) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5, size = 12))
the_plot
}Using this function, and your candidate eps values from A) as a starting point, implement DBSCAN to correctly identify the 3 cluster shapes in each of the three data sets. You will likely need to revise the eps values until you settle on a “correct” solution.
plot_dbscan_results(three_spheres, 0.15, 4) +
plot_dbscan_results(ring_moon_sphere, 0.20, 4) +
plot_dbscan_results(two_spirals_sphere, 1.0, 4)plot_dbscan_results(three_spheres, 0.20, 4)plot_dbscan_results(ring_moon_sphere, 0.27, 4)plot_dbscan_results(two_spirals_sphere, 1.1, 4)C)
Compare your DBSCAN solutions to the 3-cluster solutions from k-means and PAM. Use the patchwork package and your function from B) to produce a 3x3 grid of plots: one plot per method/data set combo. Comment on your findings.
#three_spheres
#k-means
numeric_only_three <- three_spheres %>%
select(x,y)
kmeans_clusters <- kmeans(numeric_only_three,
centers = 3,
iter.max = 20,
nstart = 10
)
three_spheres$kmeans_clusters <- factor(kmeans_clusters$cluster)
km_three_s <- ggplot(data = three_spheres, aes(x = x, y = y, color=kmeans_clusters)) +
geom_point() +
guides(color='none') +
theme_classic(base_size = 16)
#PAM
pam_clusters <- pam(numeric_only_three,
k = 3,
nstart = 10
)
three_spheres$pam_clusters <- factor(pam_clusters$cluster)
pam_three_s <- ggplot(data = three_spheres, aes(x = x, y = y, color=pam_clusters)) +
geom_point() +
guides(color='none') +
theme_classic(base_size = 16)
#dbscan
dbscan_three_s <- plot_dbscan_results(three_spheres, 0.20, 4)#ring_moon_sphere
#k-means
numeric_only_rings <- ring_moon_sphere %>%
select(x,y)
kmeans_clusters <- kmeans(numeric_only_rings,
centers = 3,
iter.max = 20,
nstart = 10
)
ring_moon_sphere$kmeans_clusters <- factor(kmeans_clusters$cluster)
km_rm <- ggplot(data = ring_moon_sphere, aes(x = x, y = y, color=kmeans_clusters)) +
geom_point() +
guides(color='none') +
theme_classic(base_size = 16)
#PAM
pam_clusters <- pam(numeric_only_rings, k = 3, nstart = 10 )
ring_moon_sphere$pam_clusters <- factor(pam_clusters$cluster)
pam_rm <- ggplot(data = ring_moon_sphere, aes(x = x, y = y, color=pam_clusters)) +
geom_point() + guides(color='none') + theme_classic(base_size = 16)
#dbscan
dbscan_rm <- plot_dbscan_results(ring_moon_sphere, 0.27, 4)#two_spirals_sphere
#k-means
numeric_only_two <- two_spirals_sphere %>%
select(x,y)
kmeans_clusters <- kmeans(numeric_only_two,
centers = 3,
iter.max = 20,
nstart = 10
)
two_spirals_sphere$kmeans_clusters <- factor(kmeans_clusters$cluster)
km_two_s <- ggplot(data = two_spirals_sphere, aes(x = x, y = y, color=kmeans_clusters)) +
geom_point() +
guides(color='none') +
theme_classic(base_size = 16)
#PAM
pam_clusters <- pam(numeric_only_two, k = 3, nstart = 10 )
two_spirals_sphere$pam_clusters <- factor(pam_clusters$cluster)
pam_two_s <- ggplot(data = two_spirals_sphere, aes(x = x, y = y, color=pam_clusters)) +
geom_point() + guides(color='none') + theme_classic(base_size = 16)
#dbscan
dbscan_two_s <- plot_dbscan_results(two_spirals_sphere, 1.1, 4)km_three_s + km_rm + km_two_s +
pam_three_s + pam_rm + km_two_s +
dbscan_three_s + dbscan_rm + dbscan_two_s +
plot_layout(nrow = 3, ncol = 3)Three spheres - all of the methods I used seem to work well on this dataset
Ring moon sphere- of the three methods, DBSCAN was the only one that correctly captured the ring, arc, and blob part of this dataset correctly
Two spirals sphere - neither k-means nor PAM did a job here … DBSCAN kept the proper structure, though
Overall findings - k-means and PAM seem to work best for more simple clusters, while DBSCAN did a decent job of working with the more complex shapes
Question 2
In this question we will apply cluster analysis to analyze economic development indicators (WDIs) from the World Bank. The data are all 2020 indicators and include:
life_expectancy: average life expectancy at birthgdp: GDP per capita, in 2015 USDco2: CO2 emissions, in metric tons per capitafert_rate: annual births per 1000 womenhealth: percentage of GDP spent on health careimportsandexports: imports and exports as a percentage of GDPinternetandelectricity: percentage of population with access to internet and electricity, respectivelyinfant_mort: infant mortality rate, infant deaths per 1000 live birthsinflation: consumer price inflation, as annual percentageincome: annual per-capita income, in 2020 USD
wdi <- read.csv('Data/wdi_extract_clean.csv')
head(wdi) country life_expectancy gdp co2 fert_rate health internet
1 Afghanistan 61.45400 527.8346 0.180555 5.145 15.533614 17.0485
2 Albania 77.82400 4437.6535 1.607133 1.371 7.503894 72.2377
3 Algeria 73.25700 4363.6853 3.902928 2.940 5.638317 63.4727
4 Angola 63.11600 2433.3764 0.619139 5.371 3.274885 36.6347
5 Argentina 75.87800 11393.0506 3.764393 1.601 10.450306 85.5144
6 Armenia 73.37561 4032.0904 2.334560 1.700 12.240562 76.5077
infant_mort electricity imports inflation exports income
1 55.3 97.7 36.28908 5.601888 10.42082 475.7181
2 8.1 100.0 36.97995 1.620887 22.54076 4322.5497
3 20.4 99.7 24.85456 2.415131 15.53520 2689.8725
4 42.3 47.0 27.62749 22.271539 38.31454 1100.2175
5 8.7 100.0 13.59828 42.015095 16.60541 7241.0303
6 10.2 100.0 39.72382 1.211436 29.76499 3617.0320
Focus on using kmeans for this problem.
A)
My claim: 3-5 clusters appear optimal for this data set. Support or refute my claim using appropriate visualizations.
ggplot(wdi, aes(x = income, y = life_expectancy)) +
geom_point() +
theme_classic(base_size = 16)numeric_only_wdi <- wdi %>%
select(income, life_expectancy)
kmeans_clusters <- kmeans(
numeric_only_wdi,
centers = 3,
iter.max = 20,
nstart = 10
)
wdi$kmeans_clusters <- factor(kmeans_clusters$cluster)
ggplot(wdi, aes(x = income, y = life_expectancy, color = kmeans_clusters)) +
geom_point() +
theme_classic(base_size = 16)numeric_wdi <- wdi %>%
select(where(is.numeric))
wdi_scaled <- scale(numeric_wdi)
cov(wdi_scaled) life_expectancy gdp co2 fert_rate health
life_expectancy 1.0000000 0.6973222 0.54427116 -0.8309322 0.41805627
gdp 0.6973222 1.0000000 0.56750894 -0.4786177 0.42277299
co2 0.5442712 0.5675089 1.00000000 -0.4381669 0.09900871
fert_rate -0.8309322 -0.4786177 -0.43816686 1.0000000 -0.38741451
health 0.4180563 0.4227730 0.09900871 -0.3874145 1.00000000
internet 0.8474164 0.6231724 0.62175088 -0.8385215 0.35157547
infant_mort -0.9052890 -0.5256942 -0.49356034 0.8620178 -0.37287726
electricity 0.7617501 0.3758888 0.40045385 -0.8390548 0.27238951
imports 0.2369366 0.3842341 0.20901177 -0.2718620 0.02029878
inflation -0.1690938 -0.1035817 -0.09806097 0.1385672 -0.15997566
exports 0.3738612 0.5399184 0.34085433 -0.3461416 -0.02207017
income 0.7268468 0.9766515 0.54856805 -0.5054745 0.49580116
internet infant_mort electricity imports inflation
life_expectancy 0.8474164 -0.9052890 0.7617501 0.23693656 -0.16909376
gdp 0.6231724 -0.5256942 0.3758888 0.38423409 -0.10358167
co2 0.6217509 -0.4935603 0.4004538 0.20901177 -0.09806097
fert_rate -0.8385215 0.8620178 -0.8390548 -0.27186199 0.13856718
health 0.3515755 -0.3728773 0.2723895 0.02029878 -0.15997566
internet 1.0000000 -0.8310365 0.8026254 0.29209228 -0.16469141
infant_mort -0.8310365 1.0000000 -0.8084987 -0.20453761 0.17575100
electricity 0.8026254 -0.8084987 1.0000000 0.16645057 -0.15857829
imports 0.2920923 -0.2045376 0.1664506 1.00000000 -0.12115127
inflation -0.1646914 0.1757510 -0.1585783 -0.12115127 1.00000000
exports 0.4192930 -0.3217737 0.2422221 0.90427206 -0.10451967
income 0.6464023 -0.5570535 0.3963467 0.31523438 -0.11093212
exports income
life_expectancy 0.37386117 0.7268468
gdp 0.53991836 0.9766515
co2 0.34085433 0.5485681
fert_rate -0.34614160 -0.5054745
health -0.02207017 0.4958012
internet 0.41929297 0.6464023
infant_mort -0.32177373 -0.5570535
electricity 0.24222210 0.3963467
imports 0.90427206 0.3152344
inflation -0.10451967 -0.1109321
exports 1.00000000 0.4676356
income 0.46763558 1.0000000
fviz_nbclust(wdi_scaled,
FUNcluster = kmeans,
method='wss',
) +
labs(title = 'Plot of WSS vs k using kmeans') +
fviz_nbclust(wdi_scaled,
FUNcluster = kmeans,
method='silhouette',
) +
labs(title = 'Plot of avg silhouette vs k using kmeans') +
fviz_nbclust(wdi_scaled,
FUNcluster = pam,
method='wss',
)After testing 2, 3, 4, 5, and 6 clusters, and looking at some plots, I would say I agree with your claim. I think either 3 or 4 clusters did the best job of showing distinct groups, but 3-5 all did a decent job.
B)
Use k-means to identify 4 clusters. Characterize the 4 clusters using a dimension reduction technique. Provide examples of countries that are representative of each cluster. Be thorough.
numeric_wdi <- wdi %>%
select(where(is.numeric))
rownames(numeric_wdi) <- wdi$country
wdi_scaled <- scale(numeric_wdi)
kmeans4 <- kmeans(wdi_scaled, centers = 4, nstart = 10)
wdi_pca <- prcomp(numeric_wdi, center = TRUE, scale. = TRUE)
kmeans_biplot <- fviz_pca_biplot(
wdi_pca,
habillage = factor(kmeans4$cluster),
repel = TRUE
) +
ggtitle('K-means 4-cluster solution') +
guides(color = 'none', shape = 'none')Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
kmeans_biplotCluster 1 - red
-high life expectancy, income, electricity/internet
-United States, France, Korean Republic, Norway, Germany
Cluster 2 - blue
-overall moderate/middle
-moderate life expectancy, health
-Brazil, China, Russia, Iran, Mexico
Cluster 3 - green
-high exports, imports
-high GDP
-seems to overall be notably wealthy
-Ireland, Singapore, Luxembourg
Cluster 4 - purple
-high infant mortality, fertility rate
-low income, internet/electricity
-Chad, Niger, Malawi, Mozambique
C)
Remove Ireland, Singapore, and Luxembourg from the data set. Use k-means to find 4 clusters again, with these three countries removed. How do the cluster definitions change?
# C)
wdi_filtered <- wdi %>%
filter(!country %in% c("Ireland", "Singapore", "Luxembourg"))
numeric_wdi <- wdi_filtered %>%
select(-country) %>%
mutate(across(everything(), as.numeric))
rownames(numeric_wdi) <- wdi_filtered$country
wdi_scaled <- scale(numeric_wdi)
kmeans4 <- kmeans(wdi_scaled, centers = 4, nstart = 10)
wdi_pca <- prcomp(numeric_wdi, center = TRUE, scale. = TRUE)
kmeans_biplot <- fviz_pca_biplot(
wdi_pca,
habillage = factor(kmeans4$cluster),
repel = TRUE
) +
ggtitle("K-means 4-cluster solution") +
guides(color = "none", shape = "none")
kmeans_biplotAfter removing Ireland, Singapore, and Luxembourg from the data set …
-The very wealthy cluster that originally included the aforementioned countries is now less extreme … this cluster is more compact and the outliers (ex: exports/imports) are not having an impact
-The countries with low income, high infant mortality, etc, stayed more or less the same, which I would assume is because the removed countries didn’t have much to do with this group
-The overall spreading of the data is tighter now … the clusters seem closer together because the extremes (Ireland, Singapore, Luxembourg) are no longer part of the data set