This report is the result of our work on Module 3 — Clustering on the Multivariate Analyis course. Clustering itself is an unsupervised learning method which aims to group data by its cluster based on the similarity of characteristics (high homogenity within cluster and high heterogenity between cluster)
Clustering methods used are:
| Method | Type | Centroid | Distance |
|---|---|---|---|
| K-Means | Partition | Mean | Euclidean |
| K-Medians | Partition | Median | Manhattan |
| DBSCAN | Density | — | Euclidean |
| Mean Shift | Density | Mode | Euclidean |
| Fuzzy C-Means | Partition (Fuzzy) | Mean (Weighted) | Euclidean |
In this work, we decided to use HCV (Hepatitis C Virus) Dataset which can be acquired from UCI Machine Learning Repository. This dataset contains laboratory data from patients with various liver-related conditions, ranging from healthy blood donors to those with cirrhosis.
## Dimension: 615 row x 13 column
As can be seen above, this dataset contains of 615 instances and 13 columns. The 13 columns are:
| Variable | Type | Explanation |
|---|---|---|
| Category | Nominal | Target label (Blood Donor / Hepatitis / etc) |
| Age | Numeric | Patient’s age (year) |
| Sex | Nominal | Sex types (m/f) |
| ALB | Numeric | Albumin (g/dL) |
| ALP | Numeric | Alkaline Phosphatase (U/L) |
| ALT | Numeric | Alanine Aminotransferase (U/L) |
| AST | Numeric | Aspartate Aminotransferase (U/L) |
| BIL | Numeric | Bilirubin (µmol/L) |
| CHE | Numeric | Cholinesterase (kU/L) |
| CHOL | Numeric | Cholesterol (mmol/L) |
| CREA | Numeric | Creatinine (µmol/L) |
| GGT | Numeric | Gamma-Glutamyl Transferase (U/L) |
| PROT | Numeric | Total Protein (g/dL) |
We can see that there is a target column of ‘Category’. This is due to the dataset itself is originally a classification dataset. However, its characteristis is also perfect to be a clustering datasets. Thus, we will drop the target label in later process. There is also column ‘Sex’ which have types of Nominal. This categorical string column is not suitable for clustering so we will also drop it
df_raw %>%
summarise(across(where(is.numeric), ~ skewness(., na.rm = TRUE))) %>%
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Skewness") %>%
arrange(desc(Skewness))df_raw %>%
pivot_longer(cols = where(is.numeric), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "#1E88E5", color = "white", alpha = 0.85) +
facet_wrap(~ Variable, scales = "free", ncol = 4) +
labs(title = "Distribution per Feature",
x = "Value", y = "Frequency") +
theme_minimal(base_size = 10) +
theme(strip.text = element_text(face = "bold"))Some features such as CREA, BIL, GGT, and ALT shows distribution that is heavily right-skewed, indicating a presence of extreme outlier which is common characteristics of clinical laboratorium data
count_outliers <- function(x) {
q1 <- quantile(x, 0.25, na.rm = TRUE)
q3 <- quantile(x, 0.75, na.rm = TRUE)
iqr <- q3 - q1
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
sum(x < lower_bound | x > upper_bound, na.rm = TRUE)
}
df_raw %>%
summarise(across(where(is.numeric), count_outliers)) %>%
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Outlier_Count") %>%
mutate(Percentage = round(Outlier_Count / nrow(df_raw) * 100, 2)) %>%
arrange(desc(Outlier_Count))df_raw %>%
pivot_longer(cols = where(is.numeric), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = "", y = Value, fill = Variable)) +
geom_boxplot(alpha = 0.7, outlier.size = 0.7, outlier.alpha = 0.4) +
facet_wrap(~ Variable, scales = "free_y", nrow = 3) +
labs(title = "Boxplot",
x = NULL,
y = "Value") +
theme_minimal(base_size = 10) +
theme(legend.position = "none",
axis.text.x = element_blank(),
axis.ticks.x = element_blank())As mentioned earlier, our dataset consist of massive outliers. Some column such as GGT, AST, BIL and ALT even score an absurd amount of percentage with more than 5% of its data being outliers.
cor_matrix <- cor(df_raw[, sapply(df_raw, is.numeric)], use = "complete.obs")
ggcorrplot(cor_matrix,
method = "square",
type = "lower",
lab = TRUE,
lab_size = 3,
colors = c("#1E88E5", "white", "#E53935"),
title = "Heatmap Correlation",
ggtheme = theme_minimal(base_size = 10))From the heatmap, we can conclude that there is no things such as multicollinearity problem be found on our dataset. Even though some have moderate score (ALB - PROT, AST - GGE), they still are acceptable.
As stated earlier, we need to do additional work for ‘Category’ and ‘Sex’.
data.frame(
Variable = names(df),
Missing = colSums(is.na(df)),
Percentage = round(colSums(is.na(df)) / nrow(df) * 100, 2)
) %>% filter(Missing > 0)## Missing data founded: 31
As we seen there are missing data in the column of ALB, ALP, ALT, CHOL, and PROT. Even though it is mostly safe to just remove the data, we decided to do imputation. To decide which method is appropriate we need to check the skewness of the data first. According to our EDA earlier, due to most of the data being skewed (mostly even heavily right-skewed), we decided to do median imputation
df <- df %>%
mutate(across(where(is.numeric),
~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
data.frame(
Variable = names(df),
Missing = colSums(is.na(df)),
Percentage = round(colSums(is.na(df)) / nrow(df) * 100, 2)
) %>% filter(Missing > 0)## Missing data founded (after imputation): 0
Clustering itself is a method of measuring numeric distance between variables. If some variables have massive range (i.e Age), the clustering process will be unfair. Thus, scaling the data is needed. According to our EDA earlier, due to the massive outliers percentage, we cannot do regular z-score standardization. There are actually two methods we consider to deal with this: robust scaling, or log transform followed by standard scaling. After a few judgements we decided to do log transform because it reveals meaningful disease stages by compressing extreme outliers into a readable scale, which robust scaler failed to do.
df_transformed <- df %>%
mutate(across(everything(), log1p))
df_scaled <- df_transformed %>%
mutate(across(everything(), scale)) %>%
as.data.frame()To ensure we do well, we need to check the skewness histogram and outlier boxplot again
As we can see there are notable differencies in distribution for all previously-skewed variables such as CREA and BIL now have rather normally distributed. Also, even though several outliers still visible in the boxplot, the percentage of outlier is notably been compressed, especially in GGT.
In Elbow method, we usually do subjective opinion. Such methods is actually good but what if the elbow is rather hard to find? Then, we also need 2nd opinion from a mathematical side, which in this case we use Kneedle. Kneedle point is an objective algorithm that identifies the exact elbow by finding the point on the curve that has the maximum vertical distance from a straight line connecting the first and last points.
set.seed(123)
wss_values <- sapply(1:10, function(k) {
kmeans(df_scaled, centers = k, nstart = 25, iter.max = 100)$tot.withinss
})
k_max <- length(wss_values)
y1 <- wss_values[1]
y_end <- wss_values[k_max]
m <- (y_end - y1) / (k_max - 1)
c <- y1 - m * 1
line_y <- m * (1:k_max) + c
distances <- line_y - wss_values
optimal_k <- which.max(distances)
cat("K based on maximum distance (Kneedle point):", optimal_k, "\n")## K based on maximum distance (Kneedle point): 3
ggplot(data.frame(k = 1:10, WSS = wss_values), aes(x = k, y = WSS)) +
geom_line(color = "#1E88E5") +
geom_point(size = 3, color = "#E53935") +
geom_vline(xintercept = optimal_k, linetype = "dashed", color = "darkgray") +
scale_x_continuous(breaks = 1:10) +
labs(title = "Elbow Method",
subtitle = paste("The objective bend (Kneedle point) indicates the ideal cluster is K =", optimal_k),
x = "Number of Clusters (K)", y = "Total Within-Cluster SS") +
theme_minimal()Unlike Elbow, Silhoutte Method is more objective even without additional help. It calculates the average distance between a point and its own cluster compared to the next nearest cluster, providing a score from -1 to 1 that measures how well each object has been classified. K with the highest score won. Totally objective.
set.seed(123)
sil_values <- sapply(2:10, function(k) {
km <- kmeans(df_scaled, centers = k, nstart = 25)
mean(silhouette(km$cluster, dist(df_scaled))[, 3])
})
ggplot(data.frame(k = 2:10, Sil = sil_values), aes(x = k, y = Sil)) +
geom_line(color = "#1E88E5") +
geom_point(size = 3, color = "#E53935") +
scale_x_continuous(breaks = 2:10) +
labs(title = "Silhouette Method",
subtitle = "Highest value won",
x = "Total Cluster (K)", y = "Average Silhouette Width") +
theme_minimal()So, unexpectedly, both methods shown in the module have different result: Elbow says 3 is best, Silhoutte says 2. Personally, we think that 3 is better because, the more the merrier, but we cannot just ignore the massive drop of Silhoutte Score. On the other hand, if we follows Silhoutte Score by 2, the elbow in K=3 earlier is not that significant so it’s not as risky. Thus, we choose to have 2 as our optimal cluster
\(k = 2\)
K-Means works by: (1) initializing K random centroids, (2) assigning each point to the nearest centroid using Euclidean distance, (3) updating centroids to the mean of the points in their cluster, and (4) repeating until convergence.
set.seed(123)
km_result <- kmeans(df_scaled, centers = K_OPTIMAL,
nstart = 25, iter.max = 100)
cat("Iteration until convergent:", km_result$iter, "\n")## Iteration until convergent: 1
## Ratio Between-SS / Total-SS: 15.19 %
pca_result <- prcomp(df_scaled, scale. = FALSE)
pca_var <- summary(pca_result)$importance[2, 1:2] * 100
pca_km <- data.frame(pca_result$x[, 1:2],
Cluster = factor(km_result$cluster))
ggplot(pca_km, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(alpha = 0.6, size = 1.8) +
stat_ellipse(aes(fill = Cluster), alpha = 0.15, geom = "polygon") +
scale_color_manual(values = c("#E53935", "#1E88E5", "#43A047")) +
scale_fill_manual(values = c("#E53935", "#1E88E5", "#43A047")) +
labs(title = "K-Means Clustering (K = 3)",
subtitle = "2D Projection using PCA",
x = paste0("PC1 (", round(pca_var[1], 1), "%)"),
y = paste0("PC2 (", round(pca_var[2], 1), "%)")) +
theme_minimal()df %>%
mutate(Cluster = factor(km_result$cluster)) %>%
group_by(Cluster) %>%
summarise(across(everything(), ~ round(mean(.), 3)))Even though there are many overlap point between cluster 1 and 2, the method K-Means still doing acceptably well to cluster the data. Almost every variables shows that cluster 3 shows distinction with others, while for cluster 1 and 2 the most and probably the only variables that can distinct both is ALT
K-Medians is a variant of K-Means with two primary differences: use median as centroid instead of median and Manhattan distance (\(L_1\) norm) instead of Euclidean. These changes making it more robust against outliers, even though it increases the overall computational complexity. The process are all same though.w
set.seed(123)
kmed_result <- kcca(df_scaled, k = K_OPTIMAL,
family = kccaFamily("kmedians"),
control = list(iter.max = 100))
kmed_clusters <- clusters(kmed_result)
data.frame(
Cluster = names(table(kmed_clusters)),
Size = as.integer(table(kmed_clusters))
)pca_result <- prcomp(df_scaled, scale. = FALSE)
pca_df <- as.data.frame(pca_result$x)
pca_df$Cluster <- as.factor(kmed_clusters)
pca_var <- summary(pca_result)$importance[2, 1:5] * 100
ggplot(pca_df, aes(x = PC2, y = PC3, color = Cluster)) +
geom_point(alpha = 0.6, size = 1.8) +
stat_ellipse(aes(fill = Cluster), alpha = 0.15, geom = "polygon") +
scale_color_manual(values = c("#E53935", "#1E88E5", "#43A047")) +
scale_fill_manual(values = c("#E53935", "#1E88E5", "#43A047")) +
labs(
title = "K-Medians Clustering (K = 3)",
subtitle = "2D Projection using PCA (PC2 vs PC3)",
x = paste0("PC2 (", round(pca_var[1], 1), "%)"),
y = paste0("PC3 (", round(pca_var[2], 1), "%)")
) +
theme_minimal()df %>%
mutate(Cluster = factor(kmed_clusters)) %>%
group_by(Cluster) %>%
summarise(across(everything(), ~ round(median(.), 3)))Unlike the K-Means which exist clusters that shows distinction, in K-Medians method lot of the cluster overlap between each others. This indicating that K-Medians is less influenced by the extreme outliers in the sparse disease tail, causing its mathematical boundaries to blur across the dense core of the data.
DBSCAN didnt require the K to be determined from beginning, instead it will automatically find the K by the density of the data around it. Parameter” \(ε\) (eps) as search radius, MinPts as minimum point int radius \(ε\). Data point that failed the method will got labeled as noise (0).
As mentioned earlier, we need to find \(ε\) (eps) first, which in order to do that we needs to assign the value of MinPts first. According to the module, the general rule of thumbs is \(MinPts >= D+1\), which is in our case then it would be \(12+1=13\)
k_vals_all <- sort(kNNdist(df_scaled, k = MINPTS - 1))
trim_level <- 0.999
n_trimmed <- floor(length(k_vals_all) * trim_level)
k_vals <- k_vals_all[1:n_trimmed]
n_pts <- length(k_vals)
x_coords <- 1:n_pts
y_start <- k_vals[1]
y_end <- k_vals[n_pts]
slope <- (y_end - y_start) / (n_pts - 1)
intercept <- y_start - slope * 1
expected_y <- slope * x_coords + intercept
kneedle_eps <- k_vals[which.max(expected_y - k_vals)]
test_eps <- seq(kneedle_eps, 0.5, by = -0.05)
final_eps <- kneedle_eps # Fallback
for (e in test_eps) {
tmp_db <- dbscan::dbscan(df_scaled, eps = e, minPts = MINPTS)
n_cl <- length(unique(tmp_db$cluster[tmp_db$cluster > 0]))
if (n_cl > 1) {
final_eps <- e
break
}
}
db_result <- dbscan::dbscan(df_scaled, eps = final_eps, minPts = MINPTS)
cat("Kneedle EPS was:", round(kneedle_eps, 3), "| Final Split EPS used:", round(final_eps, 3), "\n")## Kneedle EPS was: 2.87 | Final Split EPS used: 1.52
## DBSCAN clustering for 615 objects.
## Parameters: eps = 1.52011140635023, minPts = 12
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 2 cluster(s) and 480 noise points.
##
## 0 1 2
## 480 127 8
##
## Available fields: cluster, eps, minPts, metric, borderPoints
kNNdistplot(df_scaled, k = MINPTS - 1)
abline(h = final_eps, col = "blue", lty = 2, lwd = 2)
title(main = "k-Distance Plot: Forced Split Selection")
legend("topleft", legend = paste("Split eps =", round(final_eps, 3)), col = "blue", lty = 2)As we can see above, the optimal eps value is 1.52
set.seed(123)
db_result <- dbscan::dbscan(df_scaled, eps = EPS_VALUE, minPts = MINPTS)
n_clusters_db <- length(unique(db_result$cluster[db_result$cluster != 0]))
n_noise <- sum(db_result$cluster == 0)
as.data.frame(table(db_result$cluster)) %>%
rename(Cluster = Var1, Size = Freq)pca_data <- prcomp(df_scaled)$x[, 1:2] %>% as.data.frame()
pca_data$Cluster <- as.factor(db_result$cluster)
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(alpha = 0.6) +
scale_color_manual(values = c("0" = "grey70", "1" = "#E41A1C", "2" = "#377EB8", "3" = "#429871")) +
theme_minimal() +
labs(title = "DBSCAN Projection (PCA)",
subtitle = "Grey points (0) are noise",
x = "PC 1", y = "PC 2")Sadly, DBScan performs worse. With 2 clusters but 78% noises is just unacceptable. However, this is rather expected by us because the use of log transform creates extreme variations in spatial density, effectively breaking the core assumption of DBSCAN that valid clusters maintain a relatively uniform density throughout the feature space.
Mean Shift is a non-parametric clustering algorithm that uses kernel density estimation to iteratively shift data points toward the local maxima of a feature space. The process relies on a bandwidth parameter to define the gradient search window, allowing the algorithm to partition data without requiring a predefined number of clusters. All points converging to the same local mode are assigned to a single cluster, effectively isolating dense cores while naturally incorporating low-density regions.
library(meanShiftR)
library(cluster)
library(dplyr)
quantiles_to_test <- seq(0.05, 0.30, by = 0.05)
best_sil <- -1
best_bw <- NULL
for (q in quantiles_to_test) {
bw <- quantile(dist(df_scaled), q)
ms_temp <- meanShiftR::meanShift(
as.matrix(df_scaled),
bandwidth = rep(bw, ncol(df_scaled)),
nNeighbors = 50
)
n_clusters <- length(unique(ms_temp$assignment))
if (n_clusters > 1 && n_clusters < nrow(df_scaled)) {
sil_score <- mean(silhouette(ms_temp$assignment, dist(df_scaled))[, 3])
if (sil_score > best_sil) {
best_sil <- sil_score
best_bw <- bw
}
}
}
micro_bw <- quantile(dist(df_scaled), 0.02)
cat("Forced Micro-Bandwidth:", round(micro_bw, 3), "\n")## Forced Micro-Bandwidth: 1.902
ms_forced <- meanShiftR::meanShift(
as.matrix(df_scaled),
bandwidth = rep(micro_bw, ncol(df_scaled)),
nNeighbors = 50
)
data.frame(
Cluster = names(table(ms_forced$assignment)),
Size = as.integer(table(ms_forced$assignment))
)library(ggplot2)
library(dplyr)
pca_ms <- prcomp(df_scaled)$x[, 1:2] %>% as.data.frame()
pca_ms$MS_Cluster <- as.factor(ms_forced$assignment)
ggplot(pca_ms, aes(x = PC1, y = PC2, color = MS_Cluster)) +
geom_point(alpha = 0.6, size = 2) +
scale_color_manual(values = c("1" = "#E74C3C")) +
theme_minimal() +
labs(title = "Mean Shift Result",
subtitle = "The entire dataset is drawn to a single density peak.",
x = "Principal Component 1", y = "Principal Component 2") +
theme(legend.position = "bottom")This is the worst of all. Mean shift failed to detect any clusters at all. Our assumption is, in addition to the use of log transform, mean shift acts this way because it strictly relies on finding distinct density peaks (local maxima) separated by deep valleys. The log transform compressed the extreme values of the disease tail just enough to smooth over any minor bumps, leaving a single, continuous uphill density gradient that pulled every single patient into the massive healthy core.
Fuzzy C-Means is a soft clustering algorithm that allows data points to belong to multiple clusters simultaneously by assigning a membership probability to each group. It iteratively minimizes a mathematical objective function to update cluster centroids and membership grades based on the spatial distance between points and centers. This probabilistic approach is highly effective for overlapping clinical datasets because it quantifies the exact degree of belongingness rather than forcing a rigid boundary.
library(e1071)
fcm_result <- e1071::cmeans(
as.matrix(df_scaled),
centers = K_OPTIMAL,
iter.max = 100,
m = 3
)
fcm_clusters <- fcm_result$cluster
data.frame(
Cluster = names(table(fcm_clusters)),
Size = as.integer(table(fcm_clusters))
)library(ggplot2)
library(dplyr)
pca_fcm <- prcomp(df_scaled)$x[, 1:2] %>% as.data.frame()
pca_fcm$FCM_Cluster <- as.factor(fcm_result$cluster)
hull_data_fcm <- pca_fcm %>%
group_by(FCM_Cluster) %>%
slice(chull(PC1, PC2))
ggplot(pca_fcm, aes(x = PC1, y = PC2, color = FCM_Cluster)) +
geom_polygon(data = hull_data_fcm, aes(fill = FCM_Cluster), alpha = 0.3, linetype = "dashed") +
geom_point(alpha = 0.7, size = 2) +
scale_color_manual(values = c("1" = "#2C3E50", "2" = "#E74C3C", "3" = "#27AE60", "4" = "#F39C12")) +
scale_fill_manual(values = c("1" = "#2C3E50", "2" = "#E74C3C", "3" = "#27AE60", "4" = "#F39C12")) +
theme_minimal() +
labs(title = "Fuzzy C-Means Clustering Result",
subtitle = "Showing the highest probability cluster assignment for each patient.",
x = "Principal Component 1", y = "Principal Component 2",
color = "Cluster",
fill = "Cluster") +
theme(legend.position = "bottom")df %>%
mutate(Cluster = factor(fcm_clusters)) %>%
group_by(Cluster) %>%
summarise(across(everything(), ~ round(mean(.), 3)))A fresh air from the previous horrondeus run from density-based method. Fuzzy C-Means perfoming above average (in our standard). Altough variables seems to overlap, some of them such as ALT, ALP, and GGT shows high differencies between each others, becoming the main source of distinction in clustering process
library(mclust)
library(aricode)
library(fpc)
actual_labels <- df_raw$Category
dist_matrix <- dist(df_scaled)
sil_km <- mean(silhouette(km_result$cluster, dist_matrix)[, 3])
sil_kmed <- mean(silhouette(kmed_clusters, dist_matrix)[, 3])
sil_fcm <- mean(silhouette(fcm_clusters, dist_matrix)[, 3])
st_km <- cluster.stats(dist_matrix, km_result$cluster, silhouette = FALSE)
st_kmed <- cluster.stats(dist_matrix, kmed_clusters, silhouette = FALSE)
st_fcm <- cluster.stats(dist_matrix, fcm_clusters, silhouette = FALSE)
n_clusters_ms <- length(unique(ms_forced$assignment))
ms_labels <- as.vector(ms_forced$assignment)
if (n_clusters_ms > 1 && n_clusters_ms < nrow(df_scaled)) {
sil_ms <- mean(silhouette(ms_labels, dist_matrix)[, 3])
st_ms <- tryCatch(
cluster.stats(dist_matrix, ms_labels, silhouette = FALSE),
error = function(e) NULL
)
} else {
sil_ms <- NA
st_ms <- NULL
}
db_mask <- db_result$cluster != 0
db_nonoise <- db_result$cluster[db_mask]
df_nonoise <- df_scaled[db_mask, ]
n_clusters_db_valid <- length(unique(db_nonoise))
if (n_clusters_db_valid > 1 && n_clusters_db_valid < nrow(df_nonoise)) {
sil_db <- mean(silhouette(db_nonoise, dist(df_nonoise))[, 3])
st_db <- tryCatch(
cluster.stats(dist(df_nonoise), db_nonoise, silhouette = FALSE),
error = function(e) NULL
)
} else {
sil_db <- NA
st_db <- NULL
}
labels_vec <- as.vector(unlist(actual_labels))
ari_km <- adjustedRandIndex(as.vector(km_result$cluster), labels_vec)
ari_kmed <- adjustedRandIndex(as.vector(kmed_clusters), labels_vec)
ari_db <- adjustedRandIndex(as.vector(db_result$cluster), labels_vec)
ari_ms <- adjustedRandIndex(as.vector(ms_forced$assignment), labels_vec)
ari_fcm <- adjustedRandIndex(as.vector(fcm_clusters), labels_vec)
nmi_km <- NMI(as.vector(km_result$cluster), labels_vec)
nmi_kmed <- NMI(as.vector(kmed_clusters), labels_vec)
nmi_db <- NMI(as.vector(db_result$cluster), labels_vec)
nmi_ms <- NMI(as.vector(ms_forced$assignment), labels_vec)
nmi_fcm <- NMI(as.vector(fcm_clusters), labels_vec)
safe_dunn <- function(st_obj) {
if (is.list(st_obj) && "dunn" %in% names(st_obj)) {
return(st_obj$dunn)
} else {
return(NA_real_)
}
}
tibble(
Method = c("K-Means", "K-Medians", "DBSCAN", "Mean Shift", "Fuzzy C-Means"),
`Clusters` = c(K_OPTIMAL, K_OPTIMAL, n_clusters_db, n_clusters_ms, length(unique(fcm_clusters))),
Silhouette = round(c(sil_km, sil_kmed, sil_db, sil_ms, sil_fcm), 4),
`Dunn Index` = round(c(safe_dunn(st_km),
safe_dunn(st_kmed),
safe_dunn(st_db),
safe_dunn(st_ms),
safe_dunn(st_fcm)), 4),
`ARI Score` = round(c(ari_km, ari_kmed, ari_db, ari_ms, ari_fcm), 4),
`NMI Score` = round(c(nmi_km, nmi_kmed, nmi_db, nmi_ms, nmi_fcm), 4)
)par(mfrow = c(2, 3), mar = c(4, 4, 3, 1))
plot(silhouette(km_result$cluster, dist_matrix),
col = c("#E53935","#1E88E5","#43A047"),
border = NA, main = "Silhouette K-Means")
plot(silhouette(kmed_clusters, dist_matrix),
col = c("#E53935","#1E88E5","#43A047"),
border = NA, main = "Silhouette K-Medians")
n_clusters_ms <- length(unique(ms_forced$assignment))
if (n_clusters_ms > 1 && n_clusters_ms < nrow(df_scaled)) {
ms_colors <- colorRampPalette(c("#E53935", "#1E88E5", "#43A047"))(n_clusters_ms)
plot(silhouette(as.vector(ms_forced$assignment), dist_matrix),
col = ms_colors,
border = NA, main = "Silhouette Mean Shift")
} else {
plot(1, type = "n", axes = FALSE, xlab = "", ylab = "",
main = "Silhouette Mean Shift")
text(1, 1, "Only 1 Cluster Detected\n(No Silhouette Plot)", cex = 1.2)
}
plot(silhouette(fcm_clusters, dist_matrix),
col = c("#E53935","#1E88E5","#43A047", "#8E24AA"),
border = NA, main = "Silhouette Fuzzy C-Means")
if (n_clusters_db_valid > 1 && n_clusters_db_valid < nrow(df_nonoise)) {
db_colors <- colorRampPalette(c("#E53935", "#1E88E5", "#43A047"))(n_clusters_db_valid)
plot(silhouette(db_nonoise, dist(df_nonoise)),
col = db_colors,
border = NA, main = "Silhouette DBSCAN (No Noise)")
} else {
plot(1, type = "n", axes = FALSE, xlab = "", ylab = "",
main = "Silhouette DBSCAN")
text(1, 1, "Not Enough Valid Clusters\n(No Silhouette Plot)", cex = 1.2)
}
par(mfrow = c(1, 1))pca_compare <- data.frame(
pca_result$x[, 1:2],
KMeans = factor(km_result$cluster),
KMedians = factor(kmed_clusters),
DBSCAN = factor(db_result$cluster),
MeanShift = factor(ms_forced$assignment),
FuzzyCM = factor(fcm_clusters)
)
pca_x <- paste0("PC1 (", round(pca_var[1], 1), "%)")
pca_y <- paste0("PC2 (", round(pca_var[2], 1), "%)")
p1 <- ggplot(pca_compare, aes(PC1, PC2, color = KMeans)) +
geom_point(alpha = 0.6, size = 1.5) +
scale_color_manual(values = c("#E53935","#1E88E5","#43A047")) +
labs(title = "K-Means", x = pca_x, y = pca_y, color = "Cluster") +
theme_minimal(base_size = 10)
p2 <- ggplot(pca_compare, aes(PC1, PC2, color = KMedians)) +
geom_point(alpha = 0.6, size = 1.5) +
scale_color_manual(values = c("#E53935","#1E88E5","#43A047")) +
labs(title = "K-Medians", x = pca_x, y = pca_y, color = "Cluster") +
theme_minimal(base_size = 10)
p3 <- ggplot(pca_compare, aes(PC1, PC2, color = DBSCAN)) +
geom_point(alpha = 0.7, size = 1.5) +
labs(title = "DBSCAN", x = pca_x, y = pca_y, color = "Cluster") +
theme_minimal(base_size = 10)
p4 <- ggplot(pca_compare, aes(PC1, PC2, color = MeanShift)) +
geom_point(alpha = 0.7, size = 1.5) +
labs(title = "Mean Shift", x = pca_x, y = pca_y, color = "Cluster") +
theme_minimal(base_size = 10)
p5 <- ggplot(pca_compare, aes(PC1, PC2, color = FuzzyCM)) +
geom_point(alpha = 0.7, size = 1.5) +
scale_color_manual(values = c("#E53935","#1E88E5","#43A047", "#8E24AA")) +
labs(title = "Fuzzy C-Means", x = pca_x, y = pca_y, color = "Cluster") +
theme_minimal(base_size = 10)
grid.arrange(p1, p2, p3, p4, p5, ncol = 3,
top = "Clustering Results Comparison PCA Projection")To be fair, most of the algorithms struggle with the dataset. K-Means performed the best out of the group, achieving an ARI of 0.4788 and an NMI of 0.2464. As the highest on the board, an ARI near 0.48 indicates a strong agreement with the actual true categories. Its Silhouette score of 0.4991 also confirms that the clusters it formed are distinct and tight.
Meanwhile, for the other two distance-based, they failed completely. Even after adjusting Fuzzy C-Means to two clusters, its ARI is close to zero, meaning its soft boundaries caused it to lose the pattern entirely and act no better than random chance.
For the density-based, it just sad. For DBSCAN, scoring negative in ARI indicates that the dense regions DBSCAN found completely contradict the actual ground-truth labels. Mean Shift assigned every single data point into exactly 1 cluster, that’s most unnacceptable.
Because K-Means is the only algorithm succeeding across the different clustering philosophies (centroid, median, density, and soft clustering), it is clear the data geometry relies heavily on variance. The success of K-Means compared to the low scores across the rest of the board indicates that the data points have clear, spherical boundaries separating the true categories, which the median, density, and soft-clustering algorithms simply fail to capture.