library(cluster)
library(dbscan)
library(factoextra)
library(tidyverse)
library(patchwork)
library(ggrepel)Activity 4.2 - Kmeans, PAM, and DBSCAN clustering
SUBMISSION INSTRUCTIONS
- Render to html
- Publish your html to RPubs
- Submit a link to your published solutions
Loading required packages:
Question 1
Reconsider the three data sets below. We will now compare kmeans, PAM, and DBSCAN to cluster these data sets.
three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')A)
With kmeans and PAM, we can specify that we want 3 clusters. But recall with DBSCAN we select minPts and eps, and the number of clusters is determined accordingly. Use k-nearest-neighbor distance plots to determine candidate epsilon values for each data set if minPts = 4. Add horizontal line(s) to each plot indicating your selected value(s) of \(\epsilon.\)
par(mfrow = c(1,3))
# Three Spheres
kNNdistplot(three_spheres, k = 4)
title("Three Spheres")
abline(h = 0.15, col = "red", lwd = 2)
# Ring–Moon–Sphere
kNNdistplot(ring_moon_sphere, k = 4)
title("Ring–Moon–Sphere")
abline(h = 0.1, col = "red", lwd = 2)
# Two Spirals + Sphere
kNNdistplot(two_spirals_sphere, k = 4)
title("Two Spirals + Sphere")
abline(h = 0.12, col = "red", lwd = 2)Three Spheres ≈ 0.15
Ring–Moon–Sphere ≈ 0.1
Two Spirals + Sphere ≈ 0.12
B)
Write a function called plot_dbscan_results(df, eps, minPts). This function takes a data frame, epsilon value, and minPts as arguments and does the following:
- Runs DBSCAN on the inputted data frame
df, given theepsandminPtsvalues; - Creates a scatterplot of the data frame with points color-coded by assigned cluster membership. Make sure the title of the plot includes the value of
epsandminPtsused to create the clusters!!
Using this function, and your candidate eps values from A) as a starting point, implement DBSCAN to correctly identify the 3 cluster shapes in each of the three data sets. You will likely need to revise the eps values until you settle on a “correct” solution.
plot_dbscan_results <- function(df, eps, minPts) {
# Run DBSCAN
db <- dbscan(df, eps = eps, minPts = minPts)
# Add cluster labels to data
df$cluster <- factor(db$cluster)
ggplot(df, aes(x = df[,1], y = df[,2], color = cluster)) +
geom_point(size = 2) +
labs(
title = paste("DBSCAN Clusters (eps =", eps, ", minPts =", minPts, ")"),
x = colnames(df)[1],
y = colnames(df)[2],
color = "Cluster"
) +
theme_minimal()
}
#Examples
plot_dbscan_results(three_spheres, eps = 0.25, minPts = 3)plot_dbscan_results(ring_moon_sphere, eps = 0.35, minPts = 3)plot_dbscan_results(two_spirals_sphere, eps = 0.11, minPts = 3)I think I needed a little more context for this question personally. I wasn’t sure if we needed 3 clusters for each plot, even though the previous question wanted minPts=4? I will look at the answer key when done to check what was wrong.
Also not entirely sure why, but changing the epsilon values for the spiral never gave 3 clear clusters. It will only be the circle in the middle, the giant spiral surrounding the circle, and then one single point as an entire cluster. Not sure if there is a way to change this, but it seems that two clusters work the best for the spiral.
C)
Compare your DBSCAN solutions to the 3-cluster solutions from k-means and PAM. Use the patchwork package and your function from B) to produce a 3x3 grid of plots: one plot per method/data set combo. Comment on your findings.
plot_clusters <- function(df, clusters, method_name) {
df$cluster <- factor(clusters)
ggplot(df, aes(x = df[,1], y = df[,2], color = cluster)) +
geom_point(size = 2) +
labs(
title = paste(method_name),
x = colnames(df)[1],
y = colnames(df)[2],
color = "Cluster"
) +
theme_minimal()
}
#DBSCAN FUNCTION
plot_dbscan_results <- function(df, eps, minPts) {
db <- dbscan(df, eps = eps, minPts = minPts)
df$cluster <- factor(db$cluster)
ggplot(df, aes(x = df[,1], y = df[,2], color = cluster)) +
geom_point(size = 2) +
labs(
title = paste0("DBSCAN (eps=", eps, ", minPts=", minPts, ")"),
x = colnames(df)[1],
y = colnames(df)[2],
color = "Cluster"
) +
theme_minimal()
}
kmeans_ts <- kmeans(three_spheres, centers = 3)$cluster
pam_ts <- pam(three_spheres, k = 3)$cluster
dbscan_ts <- plot_dbscan_results(three_spheres, eps = 0.25, minPts = 4)
kmeans_rms <- kmeans(ring_moon_sphere, centers = 3)$cluster
pam_rms <- pam(ring_moon_sphere, k = 3)$cluster
dbscan_rms <- plot_dbscan_results(ring_moon_sphere, eps = 0.35, minPts = 4)
kmeans_tss <- kmeans(two_spirals_sphere, centers = 3)$cluster
pam_tss <- pam(two_spirals_sphere, k = 3)$cluster
dbscan_tss <- plot_dbscan_results(two_spirals_sphere, eps = 0.11, minPts = 4)
kmeans_ts_plot <- plot_clusters(three_spheres, kmeans_ts, "k-means")
pam_ts_plot <- plot_clusters(three_spheres, pam_ts, "PAM")
kmeans_rms_plot <- plot_clusters(ring_moon_sphere, kmeans_rms, "k-means")
pam_rms_plot <- plot_clusters(ring_moon_sphere, pam_rms, "PAM")
kmeans_tss_plot <- plot_clusters(two_spirals_sphere, kmeans_tss, "k-means")
pam_tss_plot <- plot_clusters(two_spirals_sphere, pam_tss, "PAM")
(kmeans_ts_plot | pam_ts_plot | dbscan_ts) /
(kmeans_rms_plot | pam_rms_plot | dbscan_rms) /
(kmeans_tss_plot | pam_tss_plot | dbscan_tss)Question 2
In this question we will apply cluster analysis to analyze economic development indicators (WDIs) from the World Bank. The data are all 2020 indicators and include:
life_expectancy: average life expectancy at birthgdp: GDP per capita, in 2015 USDco2: CO2 emissions, in metric tons per capitafert_rate: annual births per 1000 womenhealth: percentage of GDP spent on health careimportsandexports: imports and exports as a percentage of GDPinternetandelectricity: percentage of population with access to internet and electricity, respectivelyinfant_mort: infant mortality rate, infant deaths per 1000 live birthsinflation: consumer price inflation, as annual percentageincome: annual per-capita income, in 2020 USD
wdi <- read.csv('Data/wdi_extract_clean.csv')
head(wdi) country life_expectancy gdp co2 fert_rate health internet
1 Afghanistan 61.45400 527.8346 0.180555 5.145 15.533614 17.0485
2 Albania 77.82400 4437.6535 1.607133 1.371 7.503894 72.2377
3 Algeria 73.25700 4363.6853 3.902928 2.940 5.638317 63.4727
4 Angola 63.11600 2433.3764 0.619139 5.371 3.274885 36.6347
5 Argentina 75.87800 11393.0506 3.764393 1.601 10.450306 85.5144
6 Armenia 73.37561 4032.0904 2.334560 1.700 12.240562 76.5077
infant_mort electricity imports inflation exports income
1 55.3 97.7 36.28908 5.601888 10.42082 475.7181
2 8.1 100.0 36.97995 1.620887 22.54076 4322.5497
3 20.4 99.7 24.85456 2.415131 15.53520 2689.8725
4 42.3 47.0 27.62749 22.271539 38.31454 1100.2175
5 8.7 100.0 13.59828 42.015095 16.60541 7241.0303
6 10.2 100.0 39.72382 1.211436 29.76499 3617.0320
Focus on using kmeans for this problem.
A)
My claim: 3-5 clusters appear optimal for this data set. Support or refute my claim using appropriate visualizations.
wdi_num <- wdi[, sapply(wdi, is.numeric)]
# Standardize the data
wdi_scaled <- scale(wdi_num)
fviz_nbclust(wdi_scaled, kmeans, method = "wss") +
ggtitle("Elbow Method for Optimal k")fviz_nbclust(wdi_scaled, kmeans, method = "silhouette") +
ggtitle("Silhouette Method for Optimal k")The elbow does indeed begin at 3, but really flattens out at k=5. Looking at our silhoutette, it supports using k=4.
B)
Use k-means to identify 4 clusters. Characterize the 4 clusters using a dimension reduction technique. Provide examples of countries that are representative of each cluster. Be thorough.
set.seed(123)
k <- 4
kmeans_res <- kmeans(wdi_scaled, centers = 5, nstart = 25) #centers = 5 to combat single point
wdi$cluster <- factor(kmeans_res$cluster)
# Perform PCA
pca_res <- prcomp(wdi_scaled, center = TRUE, scale. = FALSE)
# Create data frame for plotting
pca_df <- data.frame(
PC1 = pca_res$x[,1],
PC2 = pca_res$x[,2],
cluster = wdi$cluster,
country = rownames(wdi)
)
# Plot PCA with clusters
ggplot(pca_df, aes(x = PC1, y = PC2, color = cluster)) +
geom_point(size = 2) +
# Tried to make clusters for more than 3 points, but doesn't work...
stat_ellipse(data = subset(pca_df, cluster %in% names(table(cluster))[table(cluster) >= 3]),
type = "norm", linetype = 2) +
labs(title = "PCA of WDI Data with k=4",
x = "PC1", y = "PC2", color = "Cluster") +
theme_minimal()Again, I’m having the same issues where a dot is showing up as an entire cluster. I don’t know how to change this. The above plot shows “5” clusters, but it shows 4 different ellipses.
cluster_summary <- aggregate(wdi_num, by = list(cluster = wdi$cluster), FUN = mean)
colnames(cluster_summary)[1] <- "Cluster"
print(cluster_summary) Cluster life_expectancy gdp co2 fert_rate health internet
1 1 73.11275 7223.219 3.7575470 2.068763 6.838331 70.48229
2 2 81.46807 40937.926 8.0461472 1.575404 10.602173 90.52674
3 3 80.61227 56788.838 19.7066481 1.530000 5.037042 96.87044
4 4 61.53000 1230.192 0.5842832 3.754000 2.954401 29.29860
5 5 62.93771 1405.343 0.5006377 4.432794 5.501136 28.57167
infant_mort electricity imports inflation exports income
1 13.423684 97.49211 40.22340 4.4820091 34.31290 5973.559
2 2.961538 100.00000 41.15122 0.6953442 43.20762 35104.973
3 4.585714 100.00000 97.13441 -0.6679204 114.12021 36922.741
4 44.900000 52.70000 25.02031 557.2018174 22.29307 1171.849
5 44.250000 50.09118 35.29576 10.5131462 26.14813 1092.885
Below is what AI used to find the closest country to the center of a cluster. If there is a better method to finding a certain country, I will find it in the answer key.
wdi_scaled_df <- as.data.frame(wdi_scaled)
wdi$cluster <- factor(kmeans_res$cluster)
representatives <- sapply(1:k, function(cl) {
idx <- which(wdi$cluster == cl)
cluster_points <- wdi_scaled_df[idx, ]
centroid <- kmeans_res$centers[cl, ] # centroid
# Euclidean distance from each point to centroid
distances <- apply(cluster_points, 1, function(x) sqrt(sum((x - centroid)^2)))
rep_country <- rownames(wdi)[idx[which.min(distances)]]
return(rep_country)
})
representatives[1] "88" "8" "137" "144"
Looking at the original csv, we find these countries to be the best (least distance to center) for their respective cluser:
8:Australia
88: Mauritius
144:Zambia
132:Timor-Leste
C)
Remove Ireland, Singapore, and Luxembourg from the data set. Use k-means to find 4 clusters again, with these three countries removed. How do the cluster definitions change?