library(cluster)
library(dbscan)
library(factoextra)
library(tidyverse)
library(patchwork)
library(ggrepel)Activity 4.2 - Kmeans, PAM, and DBSCAN clustering
SUBMISSION INSTRUCTIONS
- Render to html
- Publish your html to RPubs
- Submit a link to your published solutions
Loading required packages:
Question 1
Reconsider the three data sets below. We will now compare kmeans, PAM, and DBSCAN to cluster these data sets.
three_spheres <- read.csv('Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('Data/cluster_data3.csv')A)
With kmeans and PAM, we can specify that we want 3 clusters. But recall with DBSCAN we select minPts and eps, and the number of clusters is determined accordingly. Use k-nearest-neighbor distance plots to determine candidate epsilon values for each data set if minPts = 4. Add horizontal line(s) to each plot indicating your selected value(s) of \(\epsilon.\)
# Dataset 1
kNNdistplot(three_spheres, k = 4)
abline(h = 0.295, col = "red", lty = 2) # Dataset 2
kNNdistplot(ring_moon_sphere, k = 4)
abline(h = 0.32, col = "red", lty = 2) # Dataset 3
kNNdistplot(two_spirals_sphere, k = 4)
abline(h = 1.2, col = "red", lty = 2) B)
Write a function called plot_dbscan_results(df, eps, minPts). This function takes a data frame, epsilon value, and minPts as arguments and does the following:
- Runs DBSCAN on the inputted data frame
df, given theepsandminPtsvalues; - Creates a scatterplot of the data frame with points color-coded by assigned cluster membership. Make sure the title of the plot includes the value of
epsandminPtsused to create the clusters!!
Using this function, and your candidate eps values from A) as a starting point, implement DBSCAN to correctly identify the 3 cluster shapes in each of the three data sets. You will likely need to revise the eps values until you settle on a “correct” solution.
plot_dbscan_results <- function(df, eps, minPts) {
# Run DBSCAN
db <- dbscan::dbscan(df, eps = eps, minPts = minPts)
# Add cluster labels to data frame
df$cluster <- factor(db$cluster)
# Scatterplot of results
ggplot(df, aes(x = x, y = y, color = cluster)) +
geom_point() +
theme_classic() +
ggtitle(paste0("DBSCAN Results (eps = ", eps,
", minPts = ", minPts, ")"))
}plot_dbscan_results(three_spheres, eps = 0.295, minPts = 4)plot_dbscan_results(ring_moon_sphere, eps = 0.32, minPts = 4)plot_dbscan_results(two_spirals_sphere, eps = 1.2, minPts = 4)C)
Compare your DBSCAN solutions to the 3-cluster solutions from k-means and PAM. Use the patchwork package and your function from B) to produce a 3x3 grid of plots: one plot per method/data set combo. Comment on your findings.
plot_kmeans_results <- function(df, k = 3) {
km <- kmeans(df[, c("x", "y")], centers = k, nstart = 10)
df$cluster <- factor(km$cluster)
ggplot(df, aes(x = x, y = y, color = cluster)) +
geom_point() +
theme_classic() +
guides(color = "none") +
ggtitle(paste0("K-means (k = ", k, ")"))
}plot_pam_results <- function(df, k = 3) {
pam_fit <- cluster::pam(df[, c("x", "y")], k = k)
df$cluster <- factor(pam_fit$clustering)
ggplot(df, aes(x = x, y = y, color = cluster)) +
geom_point() +
theme_classic() +
guides(color = "none") +
ggtitle(paste0("PAM (k = ", k, ")"))
}plot_dbscan_results <- function(df, eps, minPts) {
db <- dbscan::dbscan(df[, c("x","y")], eps = eps, minPts = minPts)
df$cluster <- factor(db$cluster)
ggplot(df, aes(x = x, y = y, color = cluster)) +
geom_point() +
theme_classic() +
guides(color = "none") +
ggtitle(paste0("DB (eps ", eps, ", minPts ", minPts, ")"))
}p_db_1 <- plot_dbscan_results(three_spheres, eps = 0.295, minPts = 4)
p_db_2 <- plot_dbscan_results(ring_moon_sphere, eps = 0.32, minPts = 4)
p_db_3 <- plot_dbscan_results(two_spirals_sphere, eps = 1.2, minPts = 4)p_km_1 <- plot_kmeans_results(three_spheres)
p_km_2 <- plot_kmeans_results(ring_moon_sphere)
p_km_3 <- plot_kmeans_results(two_spirals_sphere)p_pam_1 <- plot_pam_results(three_spheres)
p_pam_2 <- plot_pam_results(ring_moon_sphere)
p_pam_3 <- plot_pam_results(two_spirals_sphere)library(patchwork)
options(repr.plot.width = 12, repr.plot.height = 12)
(p_km_1 | p_pam_1 | p_db_1) /
(p_km_2 | p_pam_2 | p_db_2) /
(p_km_3 | p_pam_3 | p_db_3)All 3, K means, PAM, and DBScan worked for normal shaped clusters, though it is more time consuming finding the correct eps for DBscan. For the abnormal shaped clusters (both datasets 2 and 3) K means and PAM failed to correctly identify the clusters shape both producing different results. DBScan was able to correctly identify the irregular shapes for both datasets.
Question 2
In this question we will apply cluster analysis to analyze economic development indicators (WDIs) from the World Bank. The data are all 2020 indicators and include:
life_expectancy: average life expectancy at birthgdp: GDP per capita, in 2015 USDco2: CO2 emissions, in metric tons per capitafert_rate: annual births per 1000 womenhealth: percentage of GDP spent on health careimportsandexports: imports and exports as a percentage of GDPinternetandelectricity: percentage of population with access to internet and electricity, respectivelyinfant_mort: infant mortality rate, infant deaths per 1000 live birthsinflation: consumer price inflation, as annual percentageincome: annual per-capita income, in 2020 USD
wdi <- read.csv('Data/wdi_extract_clean.csv')
head(wdi) country life_expectancy gdp co2 fert_rate health internet
1 Afghanistan 61.45400 527.8346 0.180555 5.145 15.533614 17.0485
2 Albania 77.82400 4437.6535 1.607133 1.371 7.503894 72.2377
3 Algeria 73.25700 4363.6853 3.902928 2.940 5.638317 63.4727
4 Angola 63.11600 2433.3764 0.619139 5.371 3.274885 36.6347
5 Argentina 75.87800 11393.0506 3.764393 1.601 10.450306 85.5144
6 Armenia 73.37561 4032.0904 2.334560 1.700 12.240562 76.5077
infant_mort electricity imports inflation exports income
1 55.3 97.7 36.28908 5.601888 10.42082 475.7181
2 8.1 100.0 36.97995 1.620887 22.54076 4322.5497
3 20.4 99.7 24.85456 2.415131 15.53520 2689.8725
4 42.3 47.0 27.62749 22.271539 38.31454 1100.2175
5 8.7 100.0 13.59828 42.015095 16.60541 7241.0303
6 10.2 100.0 39.72382 1.211436 29.76499 3617.0320
Focus on using kmeans for this problem.
A)
My claim: 3-5 clusters appear optimal for this data set. Support or refute my claim using appropriate visualizations.
library(tidyverse)
library(factoextra)
wdi_num <- wdi %>%
select(-country) %>%
scale()fviz_nbclust(wdi_num,
FUNcluster = kmeans,
method = "wss") +
labs(title = "WSS vs k for WDI Data (k-means)")fviz_nbclust(wdi_num,
FUNcluster = kmeans,
method = "silhouette") +
labs(title = "Average Silhouette vs k for WDI Data (k-means)")library(patchwork)
p_wss <- fviz_nbclust(wdi_num, kmeans, method = "wss")
p_sil <- fviz_nbclust(wdi_num, kmeans, method = "silhouette")
p_wss + p_sil3-5 does seem to be a reasonable cluster range in both graphs we see the greated drop in that range with the total sums of square begining to become less steep after 3 and the average silhoutte length having a huge drop off after 4. I would more argue that 4 seems to be a better representation after looking at both graphs but that does fall within your range.
B)
Use k-means to identify 4 clusters. Characterize the 4 clusters using a dimension reduction technique. Provide examples of countries that are representative of each cluster. Be thorough.
# Load required libraries
library(tidyverse)
library(factoextra)
# Scale numeric data
wdi_num <- wdi %>%
select(-country) %>%
scale()
# Run k-means with 4 clusters
set.seed(123)
wdi_k4 <- kmeans(wdi_num, centers = 4, nstart = 25)
# Add cluster assignments to original data
wdi$cluster <- factor(wdi_k4$cluster)
# PCA visualization
wdi_pca <- prcomp(wdi_num, center = TRUE, scale. = TRUE)
fviz_pca_ind(wdi_pca,
geom.ind = "point",
col.ind = wdi$cluster,
palette = "jco",
addEllipses = TRUE,
legend.title = "Cluster") +
labs(title = "PCA of WDI Data with K-means Clusters (k = 4)")Ignoring unknown labels:
• linetype : "Cluster"
Too few points to calculate an ellipse
# View cluster centers (scaled values)
round(wdi_k4$centers, 2) life_expectancy gdp co2 fert_rate health internet infant_mort electricity
1 0.06 -0.38 -0.20 -0.36 -0.07 0.17 -0.28 0.46
2 -1.34 -0.66 -0.72 1.54 -0.53 -1.43 1.49 -1.60
3 -1.53 -0.67 -0.70 0.98 -1.41 -1.37 1.50 -1.44
4 1.17 1.45 1.13 -0.72 0.69 1.02 -0.85 0.57
imports inflation exports income
1 -0.12 -0.09 -0.16 -0.38
2 -0.27 0.04 -0.41 -0.70
3 -0.69 11.26 -0.54 -0.69
4 0.54 -0.17 0.75 1.50
# Representative countries from each cluster
wdi %>%
group_by(cluster) %>%
slice_head(n = 3) %>%
select(country, cluster)# A tibble: 10 × 2
# Groups: cluster [4]
country cluster
<chr> <fct>
1 Albania 1
2 Algeria 1
3 Argentina 1
4 Afghanistan 2
5 Angola 2
6 Benin 2
7 Zimbabwe 3
8 Australia 4
9 Austria 4
10 Bahrain 4
# Load required libraries
library(tidyverse)
library(factoextra)
# Load and scale numeric data
wdi <- read.csv("Data/wdi_extract_clean.csv")
wdi_num <- wdi %>%
select(-country) %>%
scale()
# Apply k-means clustering with k = 4
set.seed(123)
wdi_k4 <- kmeans(wdi_num, centers = 4, nstart = 25)
# Perform PCA
wdi_pca <- prcomp(wdi_num, center = TRUE, scale. = TRUE)
# Create PCA biplot with clusters and variable vectors
fviz_pca_biplot(wdi_pca,
geom.ind = "point",
col.ind = factor(wdi_k4$cluster),
palette = "jco",
addEllipses = TRUE,
label = "var",
col.var = "black",
repel = TRUE) +
labs(title = "PCA Biplot of WDI Data with K-means Clusters (k = 4)",
color = "Cluster")Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
Too few points to calculate an ellipse
# Load required libraries
library(tidyverse)
library(factoextra)
library(ggrepel)
# Load and scale data
wdi <- read.csv("Data/wdi_extract_clean.csv")
wdi_num <- wdi %>%
select(-country) %>%
scale()
# Apply k-means clustering with k = 4
set.seed(123)
wdi_k4 <- kmeans(wdi_num, centers = 4, nstart = 25)
wdi$cluster <- factor(wdi_k4$cluster)
# Perform PCA
wdi_pca <- prcomp(wdi_num, center = TRUE, scale. = TRUE)
pca_coords <- as.data.frame(wdi_pca$x)
pca_coords$country <- wdi$country
pca_coords$cluster <- wdi$cluster
# Define color and shape mappings (match prior biplot)
cluster_colors <- c("1" = "#0073C2FF", "2" = "#EFC000FF", "3" = "#868686FF", "4" = "#CD534CFF")
cluster_shapes <- c("1" = 16, "2" = 17, "3" = 15, "4" = 3) # circle, triangle, square, plus
# Axis labels with explained variance
xlab_pc1 <- paste0("PC1 (", round(summary(wdi_pca)$importance[2,1] * 100, 1), "%)")
ylab_pc2 <- paste0("PC2 (", round(summary(wdi_pca)$importance[2,2] * 100, 1), "%)")
# Function to plot an individual cluster
plot_cluster <- function(cluster_id) {
pca_coords %>%
filter(cluster == cluster_id) %>%
ggplot(aes(x = PC1, y = PC2, label = country)) +
geom_point(aes(color = cluster, shape = cluster), size = 3) +
geom_text_repel(size = 3.5, max.overlaps = 50) +
scale_color_manual(values = cluster_colors) +
scale_shape_manual(values = cluster_shapes) +
labs(title = paste("Cluster", cluster_id),
x = xlab_pc1, y = ylab_pc2) +
theme_minimal(base_size = 14) +
theme(legend.position = "none")
}
# Generate plots for clusters 1–4 (each runs separately)
plot_cluster(1)plot_cluster(2)plot_cluster(3)plot_cluster(4)Cluster 1 includes countries such as Brazil, Argentina, China, Turkey, and Greece that seem to have no extremes within health, electricity, life expectancy, internet, etc but also no extremes in fertility rate and infant mortality. Cluster 2 contains countries such as Ethiopia, Uganda, Tanzania, Cameroon, and Sudan which have very high infant mortality rates and a high fertility rate and are also low on many of the well developed country features such as overall health, electricity, life expectancy etc. Cluster 3 for some reason only contains Zimbabwe which visually does not seem to be accurate as it should directly fall within cluster 2 but is also high in infant mortality rate and fertility rate. Cluster 4 contains countries such as United States, Canada, Germany, Japan, and Austrailia. These are more well developed countries that are high in overall health, imports, exports, income, gdp, etc. but also low in fertility rate and infant mortality rate.
C)
Remove Ireland, Singapore, and Luxembourg from the data set. Use k-means to find 4 clusters again, with these three countries removed. How do the cluster definitions change?
# Load required libraries
library(tidyverse)
library(factoextra)
library(ggrepel)
# Load and filter data
wdi <- read.csv("Data/wdi_extract_clean.csv") %>%
filter(!country %in% c("Ireland", "Singapore", "Luxembourg"))
# Scale numeric data
wdi_num <- wdi %>%
select(-country) %>%
scale()
# Apply k-means clustering with k = 4
set.seed(123)
wdi_k4 <- kmeans(wdi_num, centers = 4, nstart = 25)
wdi$cluster <- factor(wdi_k4$cluster)
# Perform PCA
wdi_pca <- prcomp(wdi_num, center = TRUE, scale. = TRUE)
pca_coords <- as.data.frame(wdi_pca$x)
pca_coords$country <- wdi$country
pca_coords$cluster <- wdi$cluster
# Define color and shape mappings
cluster_colors <- c("1" = "#0073C2FF", "2" = "#EFC000FF", "3" = "#868686FF", "4" = "#CD534CFF")
cluster_shapes <- c("1" = 16, "2" = 17, "3" = 15, "4" = 3)
# Axis labels with explained variance
xlab_pc1 <- paste0("PC1 (", round(summary(wdi_pca)$importance[2,1] * 100, 1), "%)")
ylab_pc2 <- paste0("PC2 (", round(summary(wdi_pca)$importance[2,2] * 100, 1), "%)")
# PCA biplot with variable vectors
fviz_pca_biplot(wdi_pca,
geom.ind = "point",
col.ind = wdi$cluster,
palette = "jco",
addEllipses = TRUE,
label = "var",
col.var = "black",
repel = TRUE) +
labs(title = "PCA Biplot of WDI Data with K-means Clusters",
color = "Cluster")# Cluster-specific PCA plots with country labels
plot_cluster <- function(cluster_id) {
pca_coords %>%
filter(cluster == cluster_id) %>%
ggplot(aes(x = PC1, y = PC2, label = country)) +
geom_point(aes(color = cluster, shape = cluster), size = 3) +
geom_text_repel(size = 3.5, max.overlaps = 50) +
scale_color_manual(values = cluster_colors) +
scale_shape_manual(values = cluster_shapes) +
labs(title = paste("Cluster", cluster_id),
x = xlab_pc1, y = ylab_pc2) +
theme_minimal(base_size = 14) +
theme(legend.position = "none")
}
# Generate updated cluster plots
plot_cluster(1)plot_cluster(2)plot_cluster(3)plot_cluster(4)# View updated cluster centers
round(wdi_k4$centers, 2) life_expectancy gdp co2 fert_rate health internet infant_mort electricity
1 0.59 0.27 0.80 -0.66 -0.09 0.80 -0.66 0.51
2 1.34 1.99 1.00 -0.76 1.19 1.05 -0.90 0.58
3 0.03 -0.39 -0.25 -0.27 -0.11 0.09 -0.22 0.41
4 -1.40 -0.69 -0.70 1.58 -0.50 -1.48 1.56 -1.65
imports inflation exports income
1 1.52 -0.16 1.73 0.28
2 -0.30 -0.17 0.07 2.02
3 -0.23 -0.08 -0.32 -0.40
4 -0.39 0.40 -0.57 -0.70
# Representative countries from each updated cluster
wdi %>%
group_by(cluster) %>%
slice_head(n = 3) %>%
select(country, cluster)# A tibble: 12 × 2
# Groups: cluster [4]
country cluster
<chr> <fct>
1 Bahrain 1
2 Belarus 1
3 Belgium 1
4 Australia 2
5 Austria 2
6 Canada 2
7 Albania 3
8 Algeria 3
9 Argentina 3
10 Afghanistan 4
11 Angola 4
12 Benin 4
Another cluster was formed within my analysis after removing those countries and the overall clusters seem to be much more accurate. Cluster one now consists of the countries with better overall health, income, gdp, life expectancy, internet, and co2. These countries being United States, Switzerland, Qatar, Australia, etc.
These countries are now separated from now “cluster 1” which is separated by its higher imports and exports.These countries being United Arab Emirates, Belgium, Slovenia, Oman, etc.
There is still a cluster that resides directly in the middle but now there is an arrow that shows inflation meaning that the countries in the middle may not be high in any of the other categories but some of them are higher in inflation. These countries being a large mix but include Thailand, Croatia, Greece, Portugal, Uruguay, China, etc.
The last cluster is still very similar with the cluster now being slightly more extreme with the fertility rates and infant moralities.These countries being Zimbabwe (no longer in its own), Sudan, Lesotha, Rwanda, Chad, etc.