1 Introduction

This report is the result of our work on Module 3 — Clustering on the Multivariate Analyis course. Clustering itself is an unsupervised learning method which aims to group data by its cluster based on the similarity of characteristics (high homogenity within cluster and high heterogenity between cluster)

Clustering methods used are:

Method	Type	Centroid	Distance
K-Means	Partition	Mean	Euclidean
K-Medians	Partition	Median	Manhattan
DBSCAN	Density	—	Euclidean
Mean Shift	Density	Mode	Euclidean
Fuzzy C-Means	Partition (Fuzzy)	Mean (Weighted)	Euclidean

2 Dataset

In this work, we decided to use HCV (Hepatitis C Virus) Dataset which can be acquired from UCI Machine Learning Repository. This dataset contains laboratory data from patients with various liver-related conditions, ranging from healthy blood donors to those with cirrhosis.

df_raw <- read.csv("hcvdat0.csv", sep = ",", row.names = 1, stringsAsFactors = FALSE)

head(df_raw, 8)

## Dimension: 615 row x 13 column

As can be seen above, this dataset contains of 615 instances and 13 columns. The 13 columns are:

Variable	Type	Explanation
Category	Nominal	Target label (Blood Donor / Hepatitis / etc)
Age	Numeric	Patient’s age (year)
Sex	Nominal	Sex types (m/f)
ALB	Numeric	Albumin (g/dL)
ALP	Numeric	Alkaline Phosphatase (U/L)
ALT	Numeric	Alanine Aminotransferase (U/L)
AST	Numeric	Aspartate Aminotransferase (U/L)
BIL	Numeric	Bilirubin (µmol/L)
CHE	Numeric	Cholinesterase (kU/L)
CHOL	Numeric	Cholesterol (mmol/L)
CREA	Numeric	Creatinine (µmol/L)
GGT	Numeric	Gamma-Glutamyl Transferase (U/L)
PROT	Numeric	Total Protein (g/dL)

We can see that there is a target column of ‘Category’. This is due to the dataset itself is originally a classification dataset. However, its characteristis is also perfect to be a clustering datasets. Thus, we will drop the target label in later process. There is also column ‘Sex’ which have types of Nominal. This categorical string column is not suitable for clustering so we will also drop it

3 Exploratory Data Analysis (EDA)

3.1 Class Distribution

imbalance_check <- df_raw |> 
  dplyr::group_by(Category) |> 
  dplyr::summarise(Count = n()) |> 
  dplyr::mutate(Percentage = round(Count / sum(Count) * 100, 2))

print(imbalance_check)

## # A tibble: 5 × 3
##   Category               Count Percentage
##   <chr>                  <int>      <dbl>
## 1 0=Blood Donor            533      86.7 
## 2 0s=suspect Blood Donor     7       1.14
## 3 1=Hepatitis               24       3.9 
## 4 2=Fibrosis                21       3.41
## 5 3=Cirrhosis               30       4.88

Such heavy class imbalancies, even though may looks like it is ok because we do unsupervised learning, is actually sign of warning. This indicating that most of the data will be heavily clustered into one. Dominant clusters pull centroids toward their mass, often swallowing smaller groups to minimize global variance. Meanwhile, sparse minority points are frequently dismissed as noise by density-based models, with internal metrics masking this failure by only reflecting majority cohesion.

3.2 Variable Distribution

df_raw %>%
  summarise(across(where(is.numeric), ~ skewness(., na.rm = TRUE))) %>%
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Skewness") %>%
  arrange(desc(Skewness))

df_raw %>%
  pivot_longer(cols = where(is.numeric), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "#1E88E5", color = "white", alpha = 0.85) +
  facet_wrap(~ Variable, scales = "free", ncol = 4) +
  labs(title = "Distribution per Feature",
       x = "Value", y = "Frequency") +
  theme_minimal(base_size = 10) +
  theme(strip.text = element_text(face = "bold"))

Some features such as CREA, BIL, GGT, and ALT shows distribution that is heavily right-skewed, indicating a presence of extreme outlier which is common characteristics of clinical laboratorium data

3.3 Outlier Identification

count_outliers <- function(x) {
  q1 <- quantile(x, 0.25, na.rm = TRUE)
  q3 <- quantile(x, 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  
  lower_bound <- q1 - 1.5 * iqr
  upper_bound <- q3 + 1.5 * iqr
  
  sum(x < lower_bound | x > upper_bound, na.rm = TRUE)
}

df_raw %>%
  summarise(across(where(is.numeric), count_outliers)) %>%
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Outlier_Count") %>%
  mutate(Percentage = round(Outlier_Count / nrow(df_raw) * 100, 2)) %>%
  arrange(desc(Outlier_Count))

df_raw %>%
  pivot_longer(cols = where(is.numeric), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = "", y = Value, fill = Variable)) +
  geom_boxplot(alpha = 0.7, outlier.size = 0.7, outlier.alpha = 0.4) +
  facet_wrap(~ Variable, scales = "free_y", nrow = 3) +
  labs(title = "Boxplot",
       x = NULL, 
       y = "Value") +
  theme_minimal(base_size = 10) +
  theme(legend.position = "none",
        axis.text.x = element_blank(), 
        axis.ticks.x = element_blank())

As mentioned earlier, our dataset consist of massive outliers. Some column such as GGT, AST, BIL and ALT even score an absurd amount of percentage with more than 5% of its data being outliers.

3.4 Correlation Heatmap

cor_matrix <- cor(df_raw[, sapply(df_raw, is.numeric)], use = "complete.obs")

ggcorrplot(cor_matrix,
           method   = "square",
           type     = "lower",
           lab      = TRUE,
           lab_size = 3,
           colors   = c("#1E88E5", "white", "#E53935"),
           title    = "Heatmap Correlation",
           ggtheme  = theme_minimal(base_size = 10))

From the heatmap, we can conclude that there is no things such as multicollinearity problem be found on our dataset. Even though some have moderate score (ALB - PROT, AST - GGE), they still are acceptable.

4 Preprocessing

4.1 Drop Categorical

As stated earlier, we need to do additional work for ‘Category’ and ‘Sex’.

df <- df_raw %>% select(-Category, -Sex)

4.2 Handling Missing Data

data.frame(
  Variable = names(df),
  Missing = colSums(is.na(df)),
  Percentage = round(colSums(is.na(df)) / nrow(df) * 100, 2)
) %>% filter(Missing > 0)

## Missing data founded: 31

As we seen there are missing data in the column of ALB, ALP, ALT, CHOL, and PROT. Even though it is mostly safe to just remove the data, we decided to do imputation. To decide which method is appropriate we need to check the skewness of the data first. According to our EDA earlier, due to most of the data being skewed (mostly even heavily right-skewed), we decided to do median imputation

df <- df %>%
  mutate(across(where(is.numeric),
                ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))

data.frame(
  Variable = names(df),
  Missing = colSums(is.na(df)),
  Percentage = round(colSums(is.na(df)) / nrow(df) * 100, 2)
) %>% filter(Missing > 0)

## Missing data founded (after imputation): 0

4.3 Data Scaling

Clustering itself is a method of measuring numeric distance between variables. If some variables have massive range (i.e Age), the clustering process will be unfair. Thus, scaling the data is needed. According to our EDA earlier, due to the massive outliers percentage, we cannot do regular z-score standardization. There are actually two methods we consider to deal with this: robust scaling, or log transform followed by standard scaling. After a few judgements we decided to do log transform because it reveals meaningful disease stages by compressing extreme outliers into a readable scale, which robust scaler failed to do.

df_transformed <- df %>%
  mutate(across(everything(), log1p))

df_scaled <- df_transformed %>%
  mutate(across(everything(), scale)) %>%
  as.data.frame()

To ensure we do well, we need to check the skewness histogram and outlier boxplot again

As we can see there are notable differencies in distribution for all previously-skewed variables such as CREA and BIL now have rather normally distributed. Also, even though several outliers still visible in the boxplot, the percentage of outlier is notably been compressed, especially in GGT.

5 Optimal cluster search

5.1 Elbow Method

In Elbow method, we usually do subjective opinion. Such methods is actually good but what if the elbow is rather hard to find? Then, we also need 2nd opinion from a mathematical side, which in this case we use Kneedle. Kneedle point is an objective algorithm that identifies the exact elbow by finding the point on the curve that has the maximum vertical distance from a straight line connecting the first and last points.

set.seed(123)

wss_values <- sapply(1:10, function(k) {
  kmeans(df_scaled, centers = k, nstart = 25, iter.max = 100)$tot.withinss
})

k_max <- length(wss_values)
y1 <- wss_values[1]
y_end <- wss_values[k_max]

m <- (y_end - y1) / (k_max - 1)
c <- y1 - m * 1

line_y <- m * (1:k_max) + c
distances <- line_y - wss_values
optimal_k <- which.max(distances)

cat("K based on maximum distance (Kneedle point):", optimal_k, "\n")

## K based on maximum distance (Kneedle point): 3

ggplot(data.frame(k = 1:10, WSS = wss_values), aes(x = k, y = WSS)) +
  geom_line(color = "#1E88E5") +
  geom_point(size = 3, color = "#E53935") +
  geom_vline(xintercept = optimal_k, linetype = "dashed", color = "darkgray") +
  scale_x_continuous(breaks = 1:10) +
  labs(title    = "Elbow Method",
       subtitle = paste("The objective bend (Kneedle point) indicates the ideal cluster is K =", optimal_k),
       x = "Number of Clusters (K)", y = "Total Within-Cluster SS") +
  theme_minimal()

5.2 Silhouette Method

Unlike Elbow, Silhoutte Method is more objective even without additional help. It calculates the average distance between a point and its own cluster compared to the next nearest cluster, providing a score from -1 to 1 that measures how well each object has been classified. K with the highest score won. Totally objective.

set.seed(123)
sil_values <- sapply(2:10, function(k) {
  km <- kmeans(df_scaled, centers = k, nstart = 25)
  mean(silhouette(km$cluster, dist(df_scaled))[, 3])
})

ggplot(data.frame(k = 2:10, Sil = sil_values), aes(x = k, y = Sil)) +
  geom_line(color = "#1E88E5") +
  geom_point(size = 3, color = "#E53935") +
  scale_x_continuous(breaks = 2:10) +
  labs(title    = "Silhouette Method",
       subtitle = "Highest value won",
       x = "Total Cluster (K)", y = "Average Silhouette Width") +
  theme_minimal()

So, unexpectedly, both methods shown in the module have different result: Elbow says 3 is best, Silhoutte says 2. Personally, we think that 3 is better because, the more the merrier, but we cannot just ignore the massive drop of Silhoutte Score. On the other hand, if we follows Silhoutte Score by 2, the elbow in K=3 earlier is not that significant so it’s not as risky. Thus, we choose to have 2 as our optimal cluster

\(k = 2\)

K_OPTIMAL <- 2

6 K-Means

6.1 Algorithm

K-Means works by: (1) initializing K random centroids, (2) assigning each point to the nearest centroid using Euclidean distance, (3) updating centroids to the mean of the points in their cluster, and (4) repeating until convergence.

6.2 Result of Clustering

set.seed(123)
km_result <- kmeans(df_scaled, centers = K_OPTIMAL,
                    nstart = 25, iter.max = 100)

cat("Iteration until convergent:", km_result$iter, "\n")

## Iteration until convergent: 1

cat("Ratio Between-SS / Total-SS:",
    round(km_result$betweenss / km_result$totss * 100, 2), "%\n\n")

## Ratio Between-SS / Total-SS: 15.19 %

data.frame(
  Cluster = 1:K_OPTIMAL,
  size    = km_result$size,
  WSS     = round(km_result$withinss, 2)
)

6.3 Visualization

pca_result <- prcomp(df_scaled, scale. = FALSE)
pca_var    <- summary(pca_result)$importance[2, 1:2] * 100

pca_km <- data.frame(pca_result$x[, 1:2],
                     Cluster = factor(km_result$cluster))

ggplot(pca_km, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(alpha = 0.6, size = 1.8) +
  stat_ellipse(aes(fill = Cluster), alpha = 0.15, geom = "polygon") +
  scale_color_manual(values = c("#E53935", "#1E88E5", "#43A047")) +
  scale_fill_manual(values  = c("#E53935", "#1E88E5", "#43A047")) +
  labs(title    = "K-Means Clustering (K = 3)",
       subtitle = "2D Projection using PCA",
       x = paste0("PC1 (", round(pca_var[1], 1), "%)"),
       y = paste0("PC2 (", round(pca_var[2], 1), "%)")) +
  theme_minimal()

6.4 Cluster Profile

df %>%
  mutate(Cluster = factor(km_result$cluster)) %>%
  group_by(Cluster) %>%
  summarise(across(everything(), ~ round(mean(.), 3)))

6.5 Interpretation

Even though there are many overlap point between cluster 1 and 2, the method K-Means still doing acceptably well to cluster the data. Almost every variables shows that cluster 3 shows distinction with others, while for cluster 1 and 2 the most and probably the only variables that can distinct both is ALT

7 K-Medians Clustering

7.1 Algoritma

K-Medians is a variant of K-Means with two primary differences: use median as centroid instead of median and Manhattan distance (\(L_1\) norm) instead of Euclidean. These changes making it more robust against outliers, even though it increases the overall computational complexity. The process are all same though.w

7.2 Result of Clustering

set.seed(123)
kmed_result <- kcca(df_scaled, k = K_OPTIMAL,
                    family  = kccaFamily("kmedians"),
                    control = list(iter.max = 100))

kmed_clusters <- clusters(kmed_result)

data.frame(
  Cluster = names(table(kmed_clusters)),
  Size  = as.integer(table(kmed_clusters))
)

7.3 Visualisasi

pca_result <- prcomp(df_scaled, scale. = FALSE)

pca_df <- as.data.frame(pca_result$x)
pca_df$Cluster <- as.factor(kmed_clusters)

pca_var <- summary(pca_result)$importance[2, 1:5] * 100

ggplot(pca_df, aes(x = PC2, y = PC3, color = Cluster)) +
  geom_point(alpha = 0.6, size = 1.8) +
  stat_ellipse(aes(fill = Cluster), alpha = 0.15, geom = "polygon") +
  scale_color_manual(values = c("#E53935", "#1E88E5", "#43A047")) +
  scale_fill_manual(values  = c("#E53935", "#1E88E5", "#43A047")) +
  labs(
    title    = "K-Medians Clustering (K = 3)",
    subtitle = "2D Projection using PCA (PC2 vs PC3)",
    x = paste0("PC2 (", round(pca_var[1], 1), "%)"),
    y = paste0("PC3 (", round(pca_var[2], 1), "%)")
  ) +
  theme_minimal()

7.4 Cluster Profile

df %>%
  mutate(Cluster = factor(kmed_clusters)) %>%
  group_by(Cluster) %>%
  summarise(across(everything(), ~ round(median(.), 3)))

7.5 Interpretation

Unlike the K-Means which exist clusters that shows distinction, in K-Medians method lot of the cluster overlap between each others. This indicating that K-Medians is less influenced by the extreme outliers in the sparse disease tail, causing its mathematical boundaries to blur across the dense core of the data.

8 DBSCAN Clustering

8.1 Algorithm

DBSCAN didnt require the K to be determined from beginning, instead it will automatically find the K by the density of the data around it. Parameter” \(ε\) (eps) as search radius, MinPts as minimum point int radius \(ε\). Data point that failed the method will got labeled as noise (0).

As mentioned earlier, we need to find \(ε\) (eps) first, which in order to do that we needs to assign the value of MinPts first. According to the module, the general rule of thumbs is \(MinPts >= D+1\), which is in our case then it would be \(12+1=13\)

MINPTS <- 12

8.2 Parameter \(ε\) Search

k_vals_all <- sort(kNNdist(df_scaled, k = MINPTS - 1))
trim_level <- 0.999
n_trimmed <- floor(length(k_vals_all) * trim_level)
k_vals <- k_vals_all[1:n_trimmed]

n_pts <- length(k_vals)
x_coords <- 1:n_pts
y_start <- k_vals[1]
y_end <- k_vals[n_pts]
slope <- (y_end - y_start) / (n_pts - 1)
intercept <- y_start - slope * 1
expected_y <- slope * x_coords + intercept
kneedle_eps <- k_vals[which.max(expected_y - k_vals)]

test_eps <- seq(kneedle_eps, 0.5, by = -0.05)
final_eps <- kneedle_eps # Fallback

for (e in test_eps) {
  tmp_db <- dbscan::dbscan(df_scaled, eps = e, minPts = MINPTS)
  n_cl <- length(unique(tmp_db$cluster[tmp_db$cluster > 0]))
  
  if (n_cl > 1) {
    final_eps <- e
    break
  }
}

db_result <- dbscan::dbscan(df_scaled, eps = final_eps, minPts = MINPTS)
cat("Kneedle EPS was:", round(kneedle_eps, 3), "| Final Split EPS used:", round(final_eps, 3), "\n")

## Kneedle EPS was: 2.87 | Final Split EPS used: 1.52

print(db_result)

## DBSCAN clustering for 615 objects.
## Parameters: eps = 1.52011140635023, minPts = 12
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 2 cluster(s) and 480 noise points.
## 
##   0   1   2 
## 480 127   8 
## 
## Available fields: cluster, eps, minPts, metric, borderPoints

kNNdistplot(df_scaled, k = MINPTS - 1)
abline(h = final_eps, col = "blue", lty = 2, lwd = 2)
title(main = "k-Distance Plot: Forced Split Selection")
legend("topleft", legend = paste("Split eps =", round(final_eps, 3)), col = "blue", lty = 2)

As we can see above, the optimal eps value is 1.52

EPS_VALUE <- 1.52

8.3 Result of Clustering

set.seed(123)
db_result <- dbscan::dbscan(df_scaled, eps = EPS_VALUE, minPts = MINPTS)

n_clusters_db <- length(unique(db_result$cluster[db_result$cluster != 0]))
n_noise       <- sum(db_result$cluster == 0)

as.data.frame(table(db_result$cluster)) %>%
  rename(Cluster = Var1, Size = Freq)

8.4 Visualization

pca_data <- prcomp(df_scaled)$x[, 1:2] %>% as.data.frame()
pca_data$Cluster <- as.factor(db_result$cluster)

ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(alpha = 0.6) +
  scale_color_manual(values = c("0" = "grey70", "1" = "#E41A1C", "2" = "#377EB8", "3" = "#429871")) +
  theme_minimal() +
  labs(title = "DBSCAN Projection (PCA)",
       subtitle = "Grey points (0) are noise",
       x = "PC 1", y = "PC 2")

8.5 Interpretation

Sadly, DBScan performs worse. With 2 clusters but 78% noises is just unacceptable. However, this is rather expected by us because the use of log transform creates extreme variations in spatial density, effectively breaking the core assumption of DBSCAN that valid clusters maintain a relatively uniform density throughout the feature space.

9 Mean Shift Clustering

9.1 Algorithm

Mean Shift is a non-parametric clustering algorithm that uses kernel density estimation to iteratively shift data points toward the local maxima of a feature space. The process relies on a bandwidth parameter to define the gradient search window, allowing the algorithm to partition data without requiring a predefined number of clusters. All points converging to the same local mode are assigned to a single cluster, effectively isolating dense cores while naturally incorporating low-density regions.

9.2 Result of Clustering

library(meanShiftR)
library(cluster)
library(dplyr)

quantiles_to_test <- seq(0.05, 0.30, by = 0.05)
best_sil <- -1
best_bw <- NULL

for (q in quantiles_to_test) {
  bw <- quantile(dist(df_scaled), q)
  ms_temp <- meanShiftR::meanShift(
    as.matrix(df_scaled), 
    bandwidth = rep(bw, ncol(df_scaled)),
    nNeighbors = 50
  )
  
  n_clusters <- length(unique(ms_temp$assignment))
  
  if (n_clusters > 1 && n_clusters < nrow(df_scaled)) {
    sil_score <- mean(silhouette(ms_temp$assignment, dist(df_scaled))[, 3])
    if (sil_score > best_sil) {
      best_sil <- sil_score
      best_bw <- bw
    }
  }
}

micro_bw <- quantile(dist(df_scaled), 0.02)

cat("Forced Micro-Bandwidth:", round(micro_bw, 3), "\n")

## Forced Micro-Bandwidth: 1.902

ms_forced <- meanShiftR::meanShift(
  as.matrix(df_scaled), 
  bandwidth = rep(micro_bw, ncol(df_scaled)),
  nNeighbors = 50 
)

data.frame(
  Cluster = names(table(ms_forced$assignment)),
  Size  = as.integer(table(ms_forced$assignment))
)

9.3 Visualization

library(ggplot2)
library(dplyr)

pca_ms <- prcomp(df_scaled)$x[, 1:2] %>% as.data.frame()

pca_ms$MS_Cluster <- as.factor(ms_forced$assignment)

ggplot(pca_ms, aes(x = PC1, y = PC2, color = MS_Cluster)) +
  geom_point(alpha = 0.6, size = 2) +
  scale_color_manual(values = c("1" = "#E74C3C")) + 
  theme_minimal() +
  labs(title = "Mean Shift Result",
       subtitle = "The entire dataset is drawn to a single density peak.",
       x = "Principal Component 1", y = "Principal Component 2") +
  theme(legend.position = "bottom")

9.4 Interpretation

This is the worst of all. Mean shift failed to detect any clusters at all. Our assumption is, in addition to the use of log transform, mean shift acts this way because it strictly relies on finding distinct density peaks (local maxima) separated by deep valleys. The log transform compressed the extreme values of the disease tail just enough to smooth over any minor bumps, leaving a single, continuous uphill density gradient that pulled every single patient into the massive healthy core.

10 Fuzzy C-Means

10.1 Algorithm

Fuzzy C-Means is a soft clustering algorithm that allows data points to belong to multiple clusters simultaneously by assigning a membership probability to each group. It iteratively minimizes a mathematical objective function to update cluster centroids and membership grades based on the spatial distance between points and centers. This probabilistic approach is highly effective for overlapping clinical datasets because it quantifies the exact degree of belongingness rather than forcing a rigid boundary.

10.2 Result of Clustering

library(e1071)

fcm_result <- e1071::cmeans(
  as.matrix(df_scaled), 
  centers = K_OPTIMAL, 
  iter.max = 100, 
  m = 3
)

fcm_clusters <- fcm_result$cluster

data.frame(
  Cluster = names(table(fcm_clusters)),
  Size  = as.integer(table(fcm_clusters))
)

10.3 Visualization

library(ggplot2)
library(dplyr)

pca_fcm <- prcomp(df_scaled)$x[, 1:2] %>% as.data.frame()

pca_fcm$FCM_Cluster <- as.factor(fcm_result$cluster)
hull_data_fcm <- pca_fcm %>%
  group_by(FCM_Cluster) %>%
  slice(chull(PC1, PC2))
ggplot(pca_fcm, aes(x = PC1, y = PC2, color = FCM_Cluster)) +
  geom_polygon(data = hull_data_fcm, aes(fill = FCM_Cluster), alpha = 0.3, linetype = "dashed") +
  geom_point(alpha = 0.7, size = 2) +
  scale_color_manual(values = c("1" = "#2C3E50", "2" = "#E74C3C", "3" = "#27AE60", "4" = "#F39C12")) + 
  scale_fill_manual(values  = c("1" = "#2C3E50", "2" = "#E74C3C", "3" = "#27AE60", "4" = "#F39C12")) +
  theme_minimal() +
  labs(title = "Fuzzy C-Means Clustering Result",
       subtitle = "Showing the highest probability cluster assignment for each patient.",
       x = "Principal Component 1", y = "Principal Component 2",
       color = "Cluster", 
       fill = "Cluster") + 
  theme(legend.position = "bottom")

10.4 Cluster Profile

df %>%
  mutate(Cluster = factor(fcm_clusters)) %>%
  group_by(Cluster) %>%
  summarise(across(everything(), ~ round(mean(.), 3)))

10.5 Interpretation

A fresh air from the previous horrondeus run from density-based method. Fuzzy C-Means perfoming above average (in our standard). Altough variables seems to overlap, some of them such as ALT, ALP, and GGT shows high differencies between each others, becoming the main source of distinction in clustering process

11 Metric Evaluation

11.1 Comparison Table

library(mclust)  
library(aricode) 
library(fpc)     

actual_labels <- df_raw$Category 

dist_matrix <- dist(df_scaled)

sil_km   <- mean(silhouette(km_result$cluster, dist_matrix)[, 3])
sil_kmed <- mean(silhouette(kmed_clusters,     dist_matrix)[, 3])
sil_fcm  <- mean(silhouette(fcm_clusters,      dist_matrix)[, 3])

st_km   <- cluster.stats(dist_matrix, km_result$cluster, silhouette = FALSE)
st_kmed <- cluster.stats(dist_matrix, kmed_clusters,     silhouette = FALSE)
st_fcm  <- cluster.stats(dist_matrix, fcm_clusters,      silhouette = FALSE)

n_clusters_ms <- length(unique(ms_forced$assignment))
ms_labels <- as.vector(ms_forced$assignment)

if (n_clusters_ms > 1 && n_clusters_ms < nrow(df_scaled)) {
  sil_ms  <- mean(silhouette(ms_labels, dist_matrix)[, 3])
  st_ms   <- tryCatch(
               cluster.stats(dist_matrix, ms_labels, silhouette = FALSE),
               error = function(e) NULL
             )
} else {
  sil_ms  <- NA
  st_ms   <- NULL
}

db_mask    <- db_result$cluster != 0
db_nonoise <- db_result$cluster[db_mask]
df_nonoise <- df_scaled[db_mask, ]
n_clusters_db_valid <- length(unique(db_nonoise))

if (n_clusters_db_valid > 1 && n_clusters_db_valid < nrow(df_nonoise)) {
  sil_db  <- mean(silhouette(db_nonoise, dist(df_nonoise))[, 3])
  st_db   <- tryCatch(
               cluster.stats(dist(df_nonoise), db_nonoise, silhouette = FALSE),
               error = function(e) NULL
             )
} else {
  sil_db  <- NA
  st_db   <- NULL
}

labels_vec <- as.vector(unlist(actual_labels))

ari_km   <- adjustedRandIndex(as.vector(km_result$cluster), labels_vec)
ari_kmed <- adjustedRandIndex(as.vector(kmed_clusters), labels_vec)
ari_db   <- adjustedRandIndex(as.vector(db_result$cluster), labels_vec)
ari_ms   <- adjustedRandIndex(as.vector(ms_forced$assignment), labels_vec)
ari_fcm  <- adjustedRandIndex(as.vector(fcm_clusters), labels_vec)

nmi_km   <- NMI(as.vector(km_result$cluster), labels_vec)
nmi_kmed <- NMI(as.vector(kmed_clusters), labels_vec)
nmi_db   <- NMI(as.vector(db_result$cluster), labels_vec)
nmi_ms   <- NMI(as.vector(ms_forced$assignment), labels_vec)
nmi_fcm  <- NMI(as.vector(fcm_clusters), labels_vec)

safe_dunn <- function(st_obj) {
  if (is.list(st_obj) && "dunn" %in% names(st_obj)) {
    return(st_obj$dunn)
  } else {
    return(NA_real_)
  }
}

tibble(
  Method        = c("K-Means", "K-Medians", "DBSCAN", "Mean Shift", "Fuzzy C-Means"),
  `Clusters`    = c(K_OPTIMAL, K_OPTIMAL, n_clusters_db, n_clusters_ms, length(unique(fcm_clusters))),
  Silhouette    = round(c(sil_km, sil_kmed, sil_db, sil_ms, sil_fcm), 4),
  `Dunn Index`  = round(c(safe_dunn(st_km), 
                          safe_dunn(st_kmed), 
                          safe_dunn(st_db), 
                          safe_dunn(st_ms), 
                          safe_dunn(st_fcm)), 4),
  `ARI Score`   = round(c(ari_km, ari_kmed, ari_db, ari_ms, ari_fcm), 4),
  `NMI Score`   = round(c(nmi_km, nmi_kmed, nmi_db, nmi_ms, nmi_fcm), 4)
)

11.2 Silhouette Plot

par(mfrow = c(2, 3), mar = c(4, 4, 3, 1))

plot(silhouette(km_result$cluster, dist_matrix),
     col = c("#E53935","#1E88E5","#43A047"),
     border = NA, main = "Silhouette K-Means")

plot(silhouette(kmed_clusters, dist_matrix),
     col = c("#E53935","#1E88E5","#43A047"),
     border = NA, main = "Silhouette K-Medians")

n_clusters_ms <- length(unique(ms_forced$assignment))
if (n_clusters_ms > 1 && n_clusters_ms < nrow(df_scaled)) {
  ms_colors <- colorRampPalette(c("#E53935", "#1E88E5", "#43A047"))(n_clusters_ms)
  plot(silhouette(as.vector(ms_forced$assignment), dist_matrix),
       col = ms_colors,
       border = NA, main = "Silhouette Mean Shift")
} else {
  plot(1, type = "n", axes = FALSE, xlab = "", ylab = "", 
       main = "Silhouette Mean Shift")
  text(1, 1, "Only 1 Cluster Detected\n(No Silhouette Plot)", cex = 1.2)
}

plot(silhouette(fcm_clusters, dist_matrix),
     col = c("#E53935","#1E88E5","#43A047", "#8E24AA"),
     border = NA, main = "Silhouette Fuzzy C-Means")

if (n_clusters_db_valid > 1 && n_clusters_db_valid < nrow(df_nonoise)) {
  db_colors <- colorRampPalette(c("#E53935", "#1E88E5", "#43A047"))(n_clusters_db_valid)
  plot(silhouette(db_nonoise, dist(df_nonoise)),
       col = db_colors,
       border = NA, main = "Silhouette DBSCAN (No Noise)")
} else {
  plot(1, type = "n", axes = FALSE, xlab = "", ylab = "", 
       main = "Silhouette DBSCAN")
  text(1, 1, "Not Enough Valid Clusters\n(No Silhouette Plot)", cex = 1.2)
}

par(mfrow = c(1, 1))

11.3 Visual Comparison

pca_compare <- data.frame(
  pca_result$x[, 1:2],
  KMeans    = factor(km_result$cluster),
  KMedians  = factor(kmed_clusters),
  DBSCAN    = factor(db_result$cluster),
  MeanShift = factor(ms_forced$assignment),
  FuzzyCM   = factor(fcm_clusters)
)

pca_x <- paste0("PC1 (", round(pca_var[1], 1), "%)")
pca_y <- paste0("PC2 (", round(pca_var[2], 1), "%)")

p1 <- ggplot(pca_compare, aes(PC1, PC2, color = KMeans)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("#E53935","#1E88E5","#43A047")) +
  labs(title = "K-Means", x = pca_x, y = pca_y, color = "Cluster") +
  theme_minimal(base_size = 10)

p2 <- ggplot(pca_compare, aes(PC1, PC2, color = KMedians)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("#E53935","#1E88E5","#43A047")) +
  labs(title = "K-Medians", x = pca_x, y = pca_y, color = "Cluster") +
  theme_minimal(base_size = 10)

p3 <- ggplot(pca_compare, aes(PC1, PC2, color = DBSCAN)) +
  geom_point(alpha = 0.7, size = 1.5) +
  labs(title = "DBSCAN", x = pca_x, y = pca_y, color = "Cluster") +
  theme_minimal(base_size = 10)

p4 <- ggplot(pca_compare, aes(PC1, PC2, color = MeanShift)) +
  geom_point(alpha = 0.7, size = 1.5) +
  labs(title = "Mean Shift", x = pca_x, y = pca_y, color = "Cluster") +
  theme_minimal(base_size = 10)

p5 <- ggplot(pca_compare, aes(PC1, PC2, color = FuzzyCM)) +
  geom_point(alpha = 0.7, size = 1.5) +
  scale_color_manual(values = c("#E53935","#1E88E5","#43A047", "#8E24AA")) +
  labs(title = "Fuzzy C-Means", x = pca_x, y = pca_y, color = "Cluster") +
  theme_minimal(base_size = 10)

grid.arrange(p1, p2, p3, p4, p5, ncol = 3,
             top = "Clustering Results Comparison PCA Projection")

11.4 Conclusion

To be fair, most of the algorithms struggle with the dataset. K-Means performed the best out of the group, achieving an ARI of 0.4788 and an NMI of 0.2464. As the highest on the board, an ARI near 0.48 indicates a strong agreement with the actual true categories. Its Silhouette score of 0.4991 also confirms that the clusters it formed are distinct and tight.

Meanwhile, for the other two distance-based, they failed completely. Even after adjusting Fuzzy C-Means to two clusters, its ARI is close to zero, meaning its soft boundaries caused it to lose the pattern entirely and act no better than random chance.

For the density-based, it just sad. For DBSCAN, scoring negative in ARI indicates that the dense regions DBSCAN found completely contradict the actual ground-truth labels. Mean Shift assigned every single data point into exactly 1 cluster, that’s most unnacceptable.

Because K-Means is the only algorithm succeeding across the different clustering philosophies (centroid, median, density, and soft clustering), it is clear the data geometry relies heavily on variance. The success of K-Means compared to the low scores across the rest of the board indicates that the data points have clear, spherical boundaries separating the true categories, which the median, density, and soft-clustering algorithms simply fail to capture.

Module 3 — Clustering: K-Means, K-Medians, DBSCAN, Mean Shift, and Fuzzy C Means

Multivariate Analysis | Sains Data, FMIPA UNESA

Alfin Jayadi (082) & Faishal Muflih (170)

2026-04-03