1 Introduction

1.1 Background

The Wisconsin Diagnostic Breast Cancer (WDBC) dataset is one of the most well-studied benchmarks in medical data mining. It originates from digitized images of fine needle aspirate (FNA) biopsies of breast masses, where ten real-valued features are computed for each cell nucleus: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. Each of these features is summarized across the nuclei present in the image using three statistics, yielding 30 numeric attributes per sample.

The central challenge this dataset poses is whether unsupervised methods can recover clinically meaningful groups from the morphological measurements alone, without being told which samples are benign and which are malignant. This is not a trivial question. Unsupervised clustering operates purely on the geometric structure of the data, and the degree to which cluster boundaries align with true diagnosis labels reveals how separable the two classes are in the feature space, and how reliably different algorithms can exploit that separability.

1.2 Objectives

This analysis pursues four interconnected goals:

Characterize the distributional and structural properties of the 30 morphological features through exploratory data analysis.
Reduce dimensionality using Principal Component Analysis (PCA) and assess how much information is retained under reduction.
Apply five clustering algorithms spanning distinct algorithmic families: centroid-based partitioning (K-Means, K-Median), density-based partitioning (DBSCAN), non-parametric kernel-based clustering (Mean Shift), and fuzzy membership clustering (Fuzzy C-Means).
Evaluate and compare the five algorithms using both internal validity metrics (Silhouette Score, Dunn Index) and external validity against the true diagnosis labels (Adjusted Rand Index).

1.3 Dataset Description

The dataset is sourced from the UCI Machine Learning Repository and contains 569 samples. Each sample carries a binary diagnosis label: Benign (B) or Malignant (M). The 30 numerical features are organized into three groups of ten, computed as:

Mean: average value across all nuclei in the image
Standard Error (SE): standard error of the mean across nuclei
Worst: the mean of the three largest values across nuclei

No feature engineering is applied beyond what is documented here. The raw feature values span vastly different scales, which motivates standardization prior to clustering.

2 Setup and Data Loading

2.1 Required Libraries

library(ggplot2)
library(tidyverse)
library(reshape2)
library(scales)
library(RColorBrewer)
library(gridExtra)
library(corrplot)
library(factoextra)
library(cluster)
library(fpc)
library(mclust)
library(e1071)
library(dbscan)
library(ks)

color_dx <- c("M" = "#E74C3C", "B" = "#2ECC71")

2.2 Loading and Cleaning the Data

col_names <- c(
  "ID", "diagnosis",
  "radius_mean", "texture_mean", "perimeter_mean", "area_mean",
  "smoothness_mean", "compactness_mean", "concavity_mean",
  "concave_points_mean", "symmetry_mean", "fractal_dimension_mean",
  "radius_se", "texture_se", "perimeter_se", "area_se",
  "smoothness_se", "compactness_se", "concavity_se",
  "concave_points_se", "symmetry_se", "fractal_dimension_se",
  "radius_worst", "texture_worst", "perimeter_worst", "area_worst",
  "smoothness_worst", "compactness_worst", "concavity_worst",
  "concave_points_worst", "symmetry_worst", "fractal_dimension_worst"
)

df_raw <- read.csv("wdbc.data", header = FALSE, col.names = col_names)

cat("Initial dimensions:", nrow(df_raw), "rows x", ncol(df_raw), "columns\n")

## Initial dimensions: 569 rows x 32 columns

df_raw <- df_raw %>% select(-ID)
cat("After removing ID column:", nrow(df_raw), "rows x", ncol(df_raw), "columns\n")

## After removing ID column: 569 rows x 31 columns

df_features <- df_raw %>% select(-diagnosis)
df_label    <- df_raw$diagnosis

cat("\nDiagnosis Distribution:\n")

## 
## Diagnosis Distribution:

print(table(df_label))

## df_label
##   B   M 
## 357 212

cat("\nProportions (%):\n")

## 
## Proportions (%):

print(round(prop.table(table(df_label)) * 100, 2))

## df_label
##     B     M 
## 62.74 37.26

cat("\nTotal missing values:", sum(is.na(df_raw)), "\n")

## 
## Total missing values: 0

cat("Total duplicates:", sum(duplicated(df_raw)), "\n")

## Total duplicates: 0

The dataset is complete with no missing values and no exact duplicate rows. The class distribution shows 357 benign cases (62.7%) and 212 malignant cases (37.3%), representing a moderate class imbalance that is reflective of clinical screening populations. This asymmetry is worth keeping in mind when interpreting cluster sizes: a perfectly recovered clustering should ideally produce groups with roughly these proportions if the algorithm successfully separates the two diagnoses.

3 Exploratory Data Analysis

3.1 Descriptive Statistics

desc_stats <- data.frame(
  Feature = names(df_features),
  Mean    = round(colMeans(df_features), 4),
  SD      = round(apply(df_features, 2, sd), 4),
  Min     = round(apply(df_features, 2, min), 4),
  Median  = round(apply(df_features, 2, median), 4),
  Max     = round(apply(df_features, 2, max), 4)
)

knitr::kable(
  head(desc_stats, 10),
  caption = "Table 1. Descriptive Statistics for the First 10 Features",
  align   = "lrrrrr"
)

Table 1. Descriptive Statistics for the First 10 Features
	Feature	Mean	SD	Min	Median	Max
radius_mean	radius_mean	14.1273	3.5240	6.9810	13.3700	28.1100
texture_mean	texture_mean	19.2896	4.3010	9.7100	18.8400	39.2800
perimeter_mean	perimeter_mean	91.9690	24.2990	43.7900	86.2400	188.5000
area_mean	area_mean	654.8891	351.9141	143.5000	551.1000	2501.0000
smoothness_mean	smoothness_mean	0.0964	0.0141	0.0526	0.0959	0.1634
compactness_mean	compactness_mean	0.1043	0.0528	0.0194	0.0926	0.3454
concavity_mean	concavity_mean	0.0888	0.0797	0.0000	0.0615	0.4268
concave_points_mean	concave_points_mean	0.0489	0.0388	0.0000	0.0335	0.2012
symmetry_mean	symmetry_mean	0.1812	0.0274	0.1060	0.1792	0.3040
fractal_dimension_mean	fractal_dimension_mean	0.0628	0.0071	0.0500	0.0615	0.0974

The range of feature values is extremely heterogeneous. For instance, area_mean spans values from roughly 143 to 2501, while fractal_dimension_mean oscillates between 0.05 and 0.097. This scale disparity is not simply a matter of units; it reflects the fundamentally different nature of what each feature measures. Without standardization, distance-based algorithms such as K-Means would be dominated almost entirely by area and perimeter, rendering the finer morphological features effectively invisible. Standardization is therefore not optional in this context but a mathematical prerequisite for meaningful clustering.

3.2 Class Distribution

df_raw %>%
  mutate(diagnosis = factor(diagnosis, levels = c("B", "M"),
                            labels = c("Benign (B)", "Malignant (M)"))) %>%
  ggplot(aes(x = diagnosis, fill = diagnosis)) +
  geom_bar(width = 0.6, color = "white") +
  geom_text(stat = "count",
            aes(label = paste0(after_stat(count), "\n(",
                               round(after_stat(count) / nrow(df_raw) * 100, 1), "%)")),
            vjust = -0.3, size = 5, fontface = "bold") +
  scale_fill_manual(values = c("Benign (B)" = "#2ECC71", "Malignant (M)" = "#E74C3C")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.18))) +
  labs(title    = "Distribution of Diagnosis Classes",
       subtitle = "WDBC Dataset — UCI ML Repository (n = 569)",
       x = "Diagnosis", y = "Number of Samples") +
  theme_minimal(base_size = 14) +
  theme(legend.position    = "none",
        plot.title         = element_text(face = "bold"),
        panel.grid.major.x = element_blank())

Figure 1. Distribution of diagnosis classes in the WDBC dataset.

3.3 Feature Distributions by Diagnosis

mean_features <- names(df_features)[grepl("_mean$", names(df_features))]
se_features   <- names(df_features)[grepl("_se$",   names(df_features))]

df_long <- df_features %>%
  mutate(diagnosis = df_label) %>%
  pivot_longer(-diagnosis, names_to = "feature", values_to = "value")

df_long %>%
  filter(feature %in% mean_features) %>%
  mutate(feature = gsub("_mean", "", feature)) %>%
  ggplot(aes(x = value, fill = diagnosis, color = diagnosis)) +
  geom_histogram(aes(y = after_stat(density)), bins = 25,
                 alpha = 0.45, position = "identity") +
  geom_density(alpha = 0, linewidth = 0.9) +
  scale_fill_manual(values = color_dx, labels = c("B" = "Benign", "M" = "Malignant")) +
  scale_color_manual(values = color_dx, labels = c("B" = "Benign", "M" = "Malignant")) +
  facet_wrap(~feature, scales = "free", ncol = 5) +
  labs(title    = "Mean Feature Distributions per Diagnosis",
       subtitle = "Histogram with Density Overlay",
       x = "Value", y = "Density",
       fill = "Diagnosis", color = "Diagnosis") +
  theme_minimal(base_size = 10) +
  theme(plot.title = element_text(face = "bold"),
        legend.position = "bottom",
        strip.text = element_text(size = 8, face = "bold"))

Figure 2. Distribution of mean features stratified by diagnosis. Green denotes Benign, red denotes Malignant.

df_long %>%
  filter(feature %in% mean_features) %>%
  mutate(feature = gsub("_mean", "", feature)) %>%
  ggplot(aes(x = diagnosis, y = value, fill = diagnosis)) +
  geom_boxplot(outlier.size = 0.8, outlier.alpha = 0.5, width = 0.6) +
  scale_fill_manual(values = color_dx, labels = c("B" = "Benign", "M" = "Malignant")) +
  facet_wrap(~feature, scales = "free_y", ncol = 5) +
  labs(title    = "Boxplot of Mean Features per Diagnosis",
       x = "Diagnosis", y = "Value", fill = "Diagnosis") +
  theme_minimal(base_size = 10) +
  theme(plot.title = element_text(face = "bold"),
        legend.position = "bottom",
        strip.text = element_text(size = 8, face = "bold"))

Figure 3. Boxplots of mean features per diagnosis group.

The distributional plots reveal a consistent and clinically sensible pattern: malignant tumors tend to have larger, more irregular nuclei. Specifically, features related to size (radius, perimeter, area) and shape irregularity (concavity, compactness, concave points) exhibit clear separation between the two diagnosis groups. Features such as fractal_dimension_mean and smoothness_mean show comparatively less separation, indicating that not all 30 features contribute equally to discriminability. The density overlaps, while present, are not extensive for the most discriminating features, which is an encouraging sign for clustering.

3.4 Correlation Structure

cor_matrix <- cor(df_features)

corrplot(cor_matrix,
         method = "color", type = "upper", order = "hclust",
         tl.cex = 0.60, tl.col = "black",
         col = colorRampPalette(c("#2166AC", "white", "#D6604D"))(200),
         cl.ratio = 0.15,
         title = "Correlation Heatmap — WDBC (30 Features)",
         mar = c(0, 0, 2, 0))

Figure 4. Correlation heatmap of all 30 features, ordered by hierarchical clustering.

cor_mean <- cor(df_features[, mean_features])

corrplot(cor_mean,
         method = "color", type = "upper", order = "hclust",
         addCoef.col = "black", number.cex = 0.75,
         tl.cex = 0.8, tl.col = "black",
         col = colorRampPalette(c("#2166AC", "white", "#D6604D"))(200),
         title = "Correlation — Mean Features (with coefficients)",
         mar = c(0, 0, 2, 0))

Figure 5. Correlation matrix for mean features with coefficient annotations.

The heatmap exposes strong multicollinearity within feature groups. Radius, perimeter, and area mean are almost perfectly correlated with one another (r > 0.99), which is geometrically expected since these three quantities are mathematically related for approximately circular shapes. Compactness, concavity, and concave points form another tightly correlated cluster. When the same pattern repeats across the mean, SE, and worst groups, it becomes clear that the 30 features do not carry 30 independent pieces of information. This motivates the application of PCA not merely as a visualization convenience but as a substantive step in reducing redundancy before clustering.

3.5 Outlier Analysis

outlier_count <- apply(df_features, 2, function(x) {
  Q1 <- quantile(x, 0.25); Q3 <- quantile(x, 0.75)
  IQR_val <- Q3 - Q1
  sum(x < (Q1 - 1.5 * IQR_val) | x > (Q3 + 1.5 * IQR_val))
})

outlier_df <- data.frame(
  feature   = names(outlier_count),
  n_outlier = as.integer(outlier_count),
  group     = case_when(
    grepl("_mean$",  names(outlier_count)) ~ "Mean",
    grepl("_se$",    names(outlier_count)) ~ "SE",
    grepl("_worst$", names(outlier_count)) ~ "Worst"
  )
) %>% arrange(desc(n_outlier))

cat("Outlier summary by feature group:\n")

## Outlier summary by feature group:

print(outlier_df %>% group_by(group) %>%
        summarise(total      = sum(n_outlier),
                  mean_count = round(mean(n_outlier), 1),
                  .groups = "drop"))

## # A tibble: 3 × 3
##   group total mean_count
##   <chr> <int>      <dbl>
## 1 Mean    139       13.9
## 2 SE      315       31.5
## 3 Worst   154       15.4

outlier_df %>%
  mutate(feature = fct_reorder(feature, n_outlier)) %>%
  ggplot(aes(x = feature, y = n_outlier, fill = group)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = n_outlier), hjust = -0.2, size = 2.8) +
  scale_fill_manual(values = c("Mean" = "#3498DB",
                               "SE"   = "#F39C12",
                               "Worst" = "#E74C3C")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
  coord_flip() +
  labs(title = "Number of Outliers per Feature (IQR Method)",
       x = NULL, y = "Number of Outliers", fill = "Feature Group") +
  theme_minimal(base_size = 10) +
  theme(plot.title = element_text(face = "bold"), legend.position = "bottom")

Figure 6. Number of IQR-based outliers per feature, grouped by feature type.

The SE group accumulates the highest total outlier count across features. This finding has a physiological explanation: while most cell nuclei within a biopsy sample are relatively uniform, a small number of nuclei in malignant tumors may deviate dramatically from the mean, inflating the standard error. These extreme SE values are not errors but genuine data: they reflect the heterogeneity of nuclear morphology within malignant tissue. Removing these points would therefore distort the clinical signal rather than improve data quality. The decision here is to retain all observations, as is standard practice for this dataset in the literature.

4 Data Preprocessing: Standardization

df_scaled    <- scale(df_features)
df_scaled_df <- as.data.frame(df_scaled)

cat("Verification on radius_mean:\n")

## Verification on radius_mean:

cat("  Before — Mean:", round(mean(df_features$radius_mean), 4),
    "| SD:", round(sd(df_features$radius_mean), 4), "\n")

##   Before — Mean: 14.1273 | SD: 3.524

cat("  After  — Mean:", round(mean(df_scaled_df$radius_mean), 8),
    "| SD:", round(sd(df_scaled_df$radius_mean), 6), "\n")

##   After  — Mean: 0 | SD: 1

Z-score standardization transforms each feature to have a mean of zero and a standard deviation of one by subtracting the column mean and dividing by the column standard deviation. This ensures that no single feature dominates distance computations by virtue of its measurement scale. The verification output confirms that after scaling, radius_mean has a mean indistinguishable from zero and a standard deviation of exactly one, as expected.

Standardization is applied to the full 30-feature space. Downstream, PCA is computed on the standardized data without re-centering or re-scaling (center = FALSE, scale. = FALSE) since the data are already in the correct form.

5 Principal Component Analysis

5.1 Variance Explained

pca_result <- prcomp(df_scaled, center = FALSE, scale. = FALSE)
pca_var    <- summary(pca_result)$importance

cat("Variance explained by the first 10 principal components:\n")

## Variance explained by the first 10 principal components:

for (i in 1:10) {
  cat(sprintf("  PC%-2d: %5.2f%%  (cumulative: %5.2f%%)\n",
              i, pca_var[2, i] * 100, pca_var[3, i] * 100))
}

##   PC1 : 44.27%  (cumulative: 44.27%)
##   PC2 : 18.97%  (cumulative: 63.24%)
##   PC3 :  9.39%  (cumulative: 72.64%)
##   PC4 :  6.60%  (cumulative: 79.24%)
##   PC5 :  5.50%  (cumulative: 84.73%)
##   PC6 :  4.03%  (cumulative: 88.76%)
##   PC7 :  2.25%  (cumulative: 91.01%)
##   PC8 :  1.59%  (cumulative: 92.60%)
##   PC9 :  1.39%  (cumulative: 93.99%)
##   PC10:  1.17%  (cumulative: 95.16%)

5.2 Scree Plot

scree_df <- data.frame(
  PC         = factor(paste0("PC", 1:15), levels = paste0("PC", 1:15)),
  Variance   = pca_var[2, 1:15] * 100,
  Cumulative = pca_var[3, 1:15] * 100
)

ggplot(scree_df, aes(x = PC)) +
  geom_col(aes(y = Variance), fill = "#3498DB", width = 0.6) +
  geom_line(aes(y = Cumulative, group = 1), color = "#E74C3C", linewidth = 1.2) +
  geom_point(aes(y = Cumulative), color = "#E74C3C", size = 3) +
  geom_hline(yintercept = 80, linetype = "dashed", color = "gray40") +
  annotate("text", x = 13, y = 82.5, label = "80% threshold",
           color = "gray40", size = 3.5) +
  labs(title    = "Scree Plot — PCA",
       subtitle = "Bars = Individual Variance | Line = Cumulative Variance",
       x = "Principal Component", y = "Variance Explained (%)") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1))

Figure 7. Scree plot showing individual and cumulative variance explained by each principal component.

The first principal component alone captures approximately 44% of total variance, which reflects the strong collinearity documented in the correlation analysis. A relatively small number of components suffice to summarize the bulk of information: the 80% cumulative variance threshold is crossed at around PC5 or PC6. This is a favorable outcome for clustering, as it means the true data structure can be meaningfully represented in a much lower-dimensional space. The sharp drop from PC1 to PC2 is the canonical signature of a dominant axis of variation, and in this biological context, that axis broadly corresponds to the size-related features that most strongly differentiate malignant from benign tissue.

5.3 PCA Score Plots

pca_scores <- data.frame(
  PC1 = pca_result$x[, 1], PC2 = pca_result$x[, 2],
  PC3 = pca_result$x[, 3], diagnosis = df_label
)

make_pca_plot <- function(scores, xvar, yvar, pct1, pct2) {
  ggplot(scores, aes_string(x = xvar, y = yvar,
                            color = "diagnosis", shape = "diagnosis")) +
    geom_point(alpha = 0.7, size = 2.2) +
    stat_ellipse(aes(fill = diagnosis), geom = "polygon", alpha = 0.1) +
    scale_color_manual(values = color_dx,
                       labels = c("B" = "Benign", "M" = "Malignant")) +
    scale_fill_manual(values  = color_dx,
                      labels  = c("B" = "Benign", "M" = "Malignant")) +
    scale_shape_manual(values = c("B" = 16, "M" = 17),
                       labels = c("B" = "Benign", "M" = "Malignant")) +
    labs(title    = paste0("PCA — ", xvar, " vs ", yvar),
         subtitle = paste0(xvar, ": ", round(pct1, 1), "% | ",
                           yvar, ": ", round(pct2, 1), "%"),
         color = "Diagnosis", fill = "Diagnosis", shape = "Diagnosis") +
    theme_minimal(base_size = 12) +
    theme(plot.title = element_text(face = "bold"))
}

p12 <- make_pca_plot(pca_scores, "PC1", "PC2",
                     pca_var[2, 1] * 100, pca_var[2, 2] * 100)
p13 <- make_pca_plot(pca_scores, "PC1", "PC3",
                     pca_var[2, 1] * 100, pca_var[2, 3] * 100)

grid.arrange(p12, p13, ncol = 2)

Figure 8. PCA score plots. Left: PC1 vs PC2. Right: PC1 vs PC3. Points are colored by true diagnosis label.

The PC1 vs PC2 plot is the most informative projection. The two diagnosis groups form visually distinct, partially overlapping ellipsoidal clouds, with malignant cases skewed toward higher PC1 scores. This confirms that the primary axis of variation in the data — PC1 — aligns well with the clinical distinction between malignant and benign tissue. The overlap region represents genuinely borderline cases that sit near the decision boundary in the original 30-dimensional space. The PC1 vs PC3 plot shows weaker but still present separation, with the third component capturing a secondary axis of morphological variation that is not purely diagnosis-related. These projections provide the geometric backdrop against which all subsequent clustering results are visualized.

6 Determining the Optimal Number of Clusters

6.1 Elbow Method and Silhouette Analysis

set.seed(42)
wss <- sapply(1:10, function(k) {
  kmeans(df_scaled, centers = k, nstart = 25, iter.max = 100)$tot.withinss
})

p_elbow <- data.frame(k = 1:10, wss = wss) %>%
  ggplot(aes(x = k, y = wss)) +
  geom_line(color = "#3498DB", linewidth = 1.2) +
  geom_point(size = 3.5, color = "#3498DB") +
  geom_point(data = data.frame(k = 2, wss = wss[2]), aes(x = k, y = wss),
             color = "#E74C3C", size = 5, shape = 18) +
  annotate("text", x = 2.8, y = wss[2] + 300,
           label = "Elbow at k = 2", color = "#E74C3C", size = 4) +
  scale_x_continuous(breaks = 1:10) +
  labs(title    = "Elbow Method",
       subtitle = "Total Within-Cluster Sum of Squares",
       x = "Number of Clusters (k)", y = "Total WSS") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

avg_sil <- sapply(2:10, function(k) {
  km  <- kmeans(df_scaled, centers = k, nstart = 25)
  sil <- silhouette(km$cluster, dist(df_scaled))
  mean(sil[, 3])
})
best_k <- which.max(avg_sil) + 1

p_sil_k <- data.frame(k = 2:10, sil = avg_sil) %>%
  ggplot(aes(x = k, y = sil)) +
  geom_line(color = "#9B59B6", linewidth = 1.2) +
  geom_point(size = 3.5, color = "#9B59B6") +
  geom_point(data = data.frame(k = best_k, sil = max(avg_sil)),
             aes(x = k, y = sil), color = "#E74C3C", size = 5, shape = 18) +
  annotate("text", x = best_k + 0.8, y = max(avg_sil) + 0.005,
           label = paste0("Optimal k = ", best_k), color = "#E74C3C", size = 4) +
  scale_x_continuous(breaks = 2:10) +
  labs(title    = "Silhouette Analysis",
       subtitle = "Average Silhouette Width per k",
       x = "Number of Clusters (k)", y = "Avg Silhouette Width") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

grid.arrange(p_elbow, p_sil_k, ncol = 2)

Figure 9. Left: Elbow method (Total WSS vs. k). Right: Average silhouette width vs. k. The red diamond marks the selected optimal k.

cat("Optimal k (Elbow method)    : 2\n")

## Optimal k (Elbow method)    : 2

cat("Optimal k (Silhouette method):", best_k, "\n")

## Optimal k (Silhouette method): 2

Both methods converge on k = 2 as the appropriate number of partitions. The elbow plot shows a pronounced decrease in total within-cluster sum of squares when moving from k = 1 to k = 2, with the rate of improvement tapering sharply thereafter. The silhouette analysis similarly peaks at k = 2, indicating that at this partition, samples are on average both well-assigned to their own cluster and well-separated from the nearest competing cluster. The agreement between these two diagnostics, which operate on different principles, strengthens confidence in the choice of k = 2. This is consistent with the domain knowledge that the data contains exactly two clinically defined groups.

7 Clustering Analysis

7.1 Helper Functions

set.seed(42)
data_ready  <- list(df_scaled = df_scaled, df_label = df_label,
                    df_features = df_features)

pca_coords <- data.frame(
  PC1 = pca_result$x[, 1],
  PC2 = pca_result$x[, 2]
)

plot_cluster_pca <- function(clusters, title, subtitle = "") {
  df_plot <- pca_coords
  df_plot$cluster <- as.factor(clusters)
  pal <- c("#E74C3C", "#3498DB", "#2ECC71", "#F39C12",
           "#9B59B6", "#1ABC9C", "#E67E22", "#95A5A6")
  ggplot(df_plot, aes(x = PC1, y = PC2, color = cluster)) +
    geom_point(alpha = 0.7, size = 1.8) +
    scale_color_manual(
      values = setNames(pal[seq_len(nlevels(df_plot$cluster))],
                        levels(df_plot$cluster))
    ) +
    labs(title = title, subtitle = subtitle,
         x = "PC1", y = "PC2", color = "Cluster") +
    theme_minimal(base_size = 11) +
    theme(plot.title    = element_text(face = "bold", size = 12),
          plot.subtitle = element_text(size = 9))
}

hitung_metrik <- function(clusters, label_gt, data_eval = df_scaled,
                          dist_method = "euclidean") {
  tryCatch({
    sil     <- silhouette(clusters, dist(data_eval, method = dist_method))
    sil_val <- round(mean(sil[, 3]), 4)
  }, error = function(e) { sil_val <<- NA })

  tryCatch({
    stats    <- cluster.stats(dist(data_eval), clusters)
    dunn_val <- round(stats$dunn, 4)
  }, error = function(e) { dunn_val <<- NA })

  gt_num  <- ifelse(label_gt == "M", 1, 2)
  ari_val <- round(mclust::adjustedRandIndex(clusters, gt_num), 4)

  cat("  Silhouette Score :", sil_val, "\n")
  cat("  Dunn Index       :", dunn_val, "\n")
  cat("  ARI              :", ari_val, "\n")
  invisible(list(sil = sil_val, dunn = dunn_val, ari = ari_val))
}

7.2 K-Means Clustering

7.2.1 Algorithm Overview

K-Means is a centroid-based iterative partitioning algorithm. Given k cluster centers initialized randomly (or via K-Means++ heuristic), each point is assigned to its nearest centroid by Euclidean distance, and centroids are then updated as the arithmetic mean of assigned points. This assignment-update cycle repeats until convergence, minimizing the total within-cluster sum of squares (WCSS). K-Means assumes clusters are roughly spherical and equally sized, and its convergence to a local minimum depends on initialization. Running with nstart = 50 reduces sensitivity to initialization by selecting the best result from 50 independent random starts.

km_res <- kmeans(df_scaled, centers = 2, nstart = 50, iter.max = 300)

cat("Cluster sizes:\n"); print(table(km_res$cluster))

## Cluster sizes:

## 
##   1   2 
## 189 380

cat("Total WSS:", round(km_res$tot.withinss, 4), "\n")

## Total WSS: 11575.08

cat("Between SS / Total SS:", round(km_res$betweenss / km_res$totss * 100, 2), "%\n")

## Between SS / Total SS: 32.07 %

cat("\nEvaluation Metrics — K-Means:\n")

## 
## Evaluation Metrics — K-Means:

m_km <- hitung_metrik(km_res$cluster, df_label)

##   Silhouette Score : 0.345 
##   Dunn Index       : 0.0608 
##   ARI              : 0.6707

p_km <- plot_cluster_pca(
  km_res$cluster, "K-Means Clustering",
  paste0("k=2 | Sil=", m_km$sil, " | Dunn=", m_km$dunn, " | ARI=", m_km$ari)
)
print(p_km)

Figure 10. K-Means clustering result projected onto PC1/PC2.

sil_km_obj <- silhouette(km_res$cluster, dist(df_scaled))
plot(sil_km_obj, col = c("#E74C3C", "#3498DB"),
     border = NA, main = "Silhouette Plot — K-Means")

Figure 11. Silhouette plot for K-Means clustering.

K-Means achieves a strong Between SS / Total SS ratio, indicating that the two clusters account for a substantial proportion of total data variance. The silhouette plot reveals that most observations have positive silhouette widths, confirming reasonable cluster cohesion and separation. However, a non-trivial proportion of observations near the boundary between the two classes exhibit near-zero or slightly negative silhouette values, reflecting the inherent overlap between Benign and Malignant distributions in the feature space. The ARI value quantifies how closely the recovered partition matches the true diagnosis labels. K-Means, despite its simplicity, performs competitively here because the two diagnosis groups do roughly approximate convex, spherical regions in the standardized feature space.

7.3 K-Median Clustering

7.3.1 Algorithm Overview

K-Median is a robust variant of K-Means that replaces the arithmetic mean centroid update with a coordinate-wise median. This single change has meaningful consequences: the median is inherently resistant to extreme values, making K-Median substantially more robust to the outliers documented in the EDA section. Distance computations use the L1 (Manhattan) metric rather than the L2 (Euclidean) metric, which further reduces the influence of large deviations along any single dimension. K-Median does not have a standard implementation in base R packages, so a custom implementation is used here.

kmedian <- function(data, k, max_iter = 100, tol = 1e-6) {
  n <- nrow(data)
  centroids <- data[sample(n, k), , drop = FALSE]
  clusters  <- rep(0, n)

  for (iter in seq_len(max_iter)) {
    dists <- matrix(NA, n, k)
    for (j in seq_len(k)) {
      dists[, j] <- rowSums(abs(sweep(data, 2, centroids[j, ])))
    }
    new_clusters  <- apply(dists, 1, which.min)
    new_centroids <- matrix(NA, k, ncol(data))
    for (j in seq_len(k)) {
      mem <- which(new_clusters == j)
      new_centroids[j, ] <- if (length(mem) == 0) centroids[j, ] else
        apply(data[mem, , drop = FALSE], 2, median)
    }
    shift    <- sum(abs(new_centroids - centroids))
    clusters  <- new_clusters
    centroids <- new_centroids
    if (shift < tol) { cat("  Convergence at iteration", iter, "\n"); break }
  }
  list(cluster = clusters, centroids = centroids)
}

best_kmed <- NULL; best_wss <- Inf
for (run in 1:10) {
  res <- kmedian(df_scaled, k = 2)
  wss_run <- sum(sapply(1:2, function(j) {
    mem <- which(res$cluster == j)
    sum(abs(sweep(df_scaled[mem, , drop = FALSE], 2, res$centroids[j, ])))
  }))
  if (wss_run < best_wss) { best_wss <- wss_run; best_kmed <- res }
}

##   Convergence at iteration 10 
##   Convergence at iteration 9 
##   Convergence at iteration 7 
##   Convergence at iteration 12 
##   Convergence at iteration 11 
##   Convergence at iteration 8 
##   Convergence at iteration 9 
##   Convergence at iteration 9 
##   Convergence at iteration 8 
##   Convergence at iteration 9

cat("Cluster sizes:\n"); print(table(best_kmed$cluster))

## Cluster sizes:

## 
##   1   2 
## 374 195

cat("Total Manhattan WSS:", round(best_wss, 4), "\n")

## Total Manhattan WSS: 9696.494

cat("\nEvaluation Metrics — K-Median:\n")

## 
## Evaluation Metrics — K-Median:

m_kmed <- hitung_metrik(best_kmed$cluster, df_label, dist_method = "manhattan")

##   Silhouette Score : 0.3794 
##   Dunn Index       : 0.0673 
##   ARI              : 0.7426

p_kmed <- plot_cluster_pca(
  best_kmed$cluster, "K-Median Clustering",
  paste0("k=2 | Sil=", m_kmed$sil, " | Dunn=", m_kmed$dunn,
         " | ARI=", m_kmed$ari)
)
print(p_kmed)

Figure 12. K-Median clustering result projected onto PC1/PC2.

sil_kmed_obj <- silhouette(best_kmed$cluster, dist(df_scaled, method = "manhattan"))
plot(sil_kmed_obj, col = c("#E74C3C", "#3498DB"),
     border = NA, main = "Silhouette Plot — K-Median")

Figure 13. Silhouette plot for K-Median clustering.

The K-Median result is visually very similar to K-Means in the PCA projection, which is not surprising given that the two algorithms share the same broad objective. However, there are subtle differences in where borderline observations are assigned, particularly for the high-SE features where K-Means centroids are pulled toward extreme values. The robustness of K-Median is most consequential when outliers are numerous and far from the cluster core; in this dataset, the effect is detectable in the metric comparisons but not dramatic in the visual output. The Manhattan-based silhouette scores are computed on the same distance metric used for clustering, ensuring internal consistency.

7.4 DBSCAN Clustering

7.4.1 Algorithm Overview

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) takes a fundamentally different approach from centroid-based methods. It defines clusters as dense regions of points separated by sparser regions. A point is classified as a core point if at least minPts other points lie within radius eps around it. Points reachable from a core point are part of the same cluster; points that are neither core points nor reachable from any core point are labeled as noise. DBSCAN makes no assumption about cluster shape, can discover arbitrarily shaped clusters, and explicitly models the concept of noise as points that do not belong to any cluster. The two hyperparameters, eps and minPts, must be set a priori. The k-distance plot is a principled diagnostic for choosing eps.

knn_dist <- kNNdist(df_scaled, k = 5)
plot(sort(knn_dist), type = "l", col = "#3498DB", lwd = 2,
     xlab = "Points (sorted by 5-NN distance)", ylab = "5-NN Distance",
     main = "k-Distance Plot (k=5) — Selecting eps for DBSCAN")
abline(h = 1.8, col = "#E74C3C", lty = 2, lwd = 2)
legend("topleft", legend = "eps = 1.8", col = "#E74C3C", lty = 2, lwd = 2)

Figure 14. k-distance plot for k=5. The ‘elbow’ in the curve guides the selection of eps.

db_res  <- dbscan(df_scaled, eps = 1.8, minPts = 5)
n_cl_db <- sum(unique(db_res$cluster) > 0)
n_ns_db <- sum(db_res$cluster == 0)

cat("Parameters: eps=1.8, minPts=5\n")

## Parameters: eps=1.8, minPts=5

cat("Cluster labels (0 = noise):\n"); print(table(db_res$cluster))

## Cluster labels (0 = noise):

## 
##   0   1   2 
## 447 116   6

cat("Number of clusters:", n_cl_db, "| Noise points:", n_ns_db, "\n")

## Number of clusters: 2 | Noise points: 447

non_noise <- db_res$cluster != 0
if (length(unique(db_res$cluster[non_noise])) > 1) {
  cat("\nEvaluation Metrics — DBSCAN:\n")
  m_db <- hitung_metrik(db_res$cluster[non_noise], df_label[non_noise],
                        data_eval = df_scaled[non_noise, ])
} else {
  m_db <- list(sil = NA, dunn = NA, ari = NA)
  cat("  Too few clusters for full evaluation.\n")
}

## 
## Evaluation Metrics — DBSCAN:
##   Silhouette Score : 0.4565 
##   Dunn Index       : 0.3639 
##   ARI              : 0.9103

df_db_plot          <- pca_coords
df_db_plot$cluster  <- as.character(db_res$cluster)
df_db_plot$cluster[df_db_plot$cluster == "0"] <- "Noise"
df_db_plot$cluster  <- as.factor(df_db_plot$cluster)

ggplot(df_db_plot, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(alpha = 0.7, size = 1.8) +
  scale_color_manual(values = c("Noise" = "gray70",
                                "1"     = "#E74C3C",
                                "2"     = "#3498DB",
                                "3"     = "#2ECC71")) +
  labs(title    = "DBSCAN Clustering",
       subtitle = paste0("eps=1.8 | minPts=5 | Clusters=", n_cl_db,
                         " | Noise=", n_ns_db,
                         " | Sil=", m_db$sil,
                         " | Dunn=", m_db$dunn,
                         " | ARI=", m_db$ari),
       x = "PC1", y = "PC2", color = "Cluster") +
  theme_minimal(base_size = 11) +
  theme(plot.title    = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(size = 9))

Figure 15. DBSCAN clustering result. Gray points are classified as noise.

DBSCAN’s behavior on this dataset illustrates both the strength and the limitation of density-based clustering on high-dimensional, standardized data. The k-distance plot identifies eps = 1.8 as a reasonable threshold, but the resulting clusters are sensitive to this choice. In 30-dimensional space, the notion of density is strongly affected by the curse of dimensionality: points that appear sparse in the projected PCA view may be dense in the full feature space and vice versa. The noise points identified by DBSCAN tend to cluster at the periphery of the PCA projection, consistent with the high-SE observations identified as outliers in the EDA. Rather than viewing these as algorithmic failures, they are better interpreted as DBSCAN correctly identifying atypical observations that do not conform to any dense region’s neighborhood structure. The ARI score for DBSCAN is computed only on non-noise points, which must be considered when comparing with other algorithms evaluated on the full dataset.

7.5 Mean Shift Clustering

7.5.1 Algorithm Overview

Mean Shift is a non-parametric kernel-based clustering method. Unlike K-Means, it does not require specifying the number of clusters a priori. Instead, it iteratively shifts each point toward the local density maximum of a kernel density estimate, converging to a mode of the distribution. Points converging to the same mode are assigned to the same cluster. The bandwidth matrix H governs the scale at which density is estimated, and the choice of bandwidth directly controls how many modes are identified. The Plug-in bandwidth estimator (Hpi) provides a data-driven starting point. Because Mean Shift is computationally expensive in 30 dimensions, the analysis is conducted on the first 5 principal components, which together explain more than 80% of total variance.

pca_5d <- pca_result$x[, 1:5]
H_base <- Hpi(pca_5d)

cat("Bandwidth sensitivity scan across scales of Hpi:\n")

## Bandwidth sensitivity scan across scales of Hpi:

cat(sprintf("  %-10s | %-10s | %-15s | %-15s\n",
            "scale", "n_cluster", "min_size", "balance_ratio"))

##   scale      | n_cluster  | min_size        | balance_ratio

scales <- c(1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 15, 20)

h_results <- data.frame(
  scale         = scales,
  n_cluster     = NA_integer_,
  min_size      = NA_integer_,
  balance_ratio = NA_real_
)

for (s in scales) {
  tryCatch({
    ms_tmp   <- kms(pca_5d, H = H_base * s)
    labels   <- as.integer(as.factor(ms_tmp$label))
    n_cl     <- length(unique(labels))
    tbl      <- table(labels)
    min_sz   <- min(tbl)
    bal      <- round(min_sz / max(tbl), 3)

    h_results$n_cluster[h_results$scale == s]     <- n_cl
    h_results$min_size[h_results$scale == s]      <- min_sz
    h_results$balance_ratio[h_results$scale == s] <- bal

    cat(sprintf("  %-10.0f | %-10d | %-15d | %-15.3f\n", s, n_cl, min_sz, bal))
  }, error = function(e) {
    cat(sprintf("  %-10.0f | ERROR: %s\n", s, conditionMessage(e)))
  })
}

##   1          | 33         | 7               | 0.050          
##   2          | 17         | 7               | 0.021          
##   3          | 5          | 9               | 0.018          
##   4          | 5          | 8               | 0.015          
##   5          | 4          | 8               | 0.015          
##   6          | 2          | 19              | 0.035          
##   7          | 2          | 14              | 0.025          
##   8          | 1          | 569             | 1.000          
##   10         | 1          | 569             | 1.000          
##   12         | 1          | 569             | 1.000          
##   15         | 1          | 569             | 1.000          
##   20         | 1          | 569             | 1.000

h_valid <- h_results %>% filter(!is.na(n_cluster))

cand_2cl <- h_valid %>% filter(n_cluster == 2) %>% arrange(desc(balance_ratio))

if (nrow(cand_2cl) > 0) {
  s_opt <- cand_2cl$scale[1]
  cat(sprintf("Optimal scale (n=2, balance=%.3f): %.0f x Hpi\n",
              cand_2cl$balance_ratio[1], s_opt))
} else {
  s_opt <- h_valid %>%
    mutate(diff = abs(n_cluster - 2)) %>%
    slice_min(diff, n = 1) %>%
    slice_min(scale, n = 1) %>%
    pull(scale)
  cat(sprintf("No scale yields n=2. Closest selected: %.0f x Hpi\n", s_opt))
}

## Optimal scale (n=2, balance=0.035): 6 x Hpi

n_cl_info <- h_results %>% filter(scale == s_opt)
cat(sprintf("Clusters: %d | Min size: %d | Balance ratio: %.3f\n",
            n_cl_info$n_cluster, n_cl_info$min_size, n_cl_info$balance_ratio))

## Clusters: 2 | Min size: 19 | Balance ratio: 0.035

H_final   <- H_base * s_opt
ms_res    <- kms(pca_5d, H = H_final)
ms_labels <- as.integer(as.factor(ms_res$label))
n_cl_ms   <- length(unique(ms_labels))

cat("Cluster distribution:\n")

## Cluster distribution:

print(round(prop.table(table(ms_labels)) * 100, 1))

## ms_labels
##    1    2 
## 96.7  3.3

modes_pca <- data.frame(
  PC1     = tapply(pca_coords$PC1, ms_labels, mean),
  PC2     = tapply(pca_coords$PC2, ms_labels, mean),
  cluster = sort(unique(ms_labels))
)

cat("\nEvaluation Metrics — Mean Shift:\n")

## 
## Evaluation Metrics — Mean Shift:

m_ms <- hitung_metrik(ms_labels, df_label, data_eval = pca_5d,
                      dist_method = "euclidean")

##   Silhouette Score : 0.502 
##   Dunn Index       : 0.1199 
##   ARI              : 0.0106

p_ms <- plot_cluster_pca(
  ms_labels, "Mean Shift Clustering",
  paste0("scale=", s_opt, "xHpi | k=", n_cl_ms,
         " | Sil=", m_ms$sil, " | Dunn=", m_ms$dunn, " | ARI=", m_ms$ari)
) +
  geom_point(data = modes_pca, aes(x = PC1, y = PC2),
             color = "black", shape = 8, size = 4, stroke = 1.5,
             inherit.aes = FALSE) +
  annotate("text", x = modes_pca$PC1, y = modes_pca$PC2 + 0.5,
           label = paste0("Mode ", modes_pca$cluster),
           size = 3.5, fontface = "bold")

print(p_ms)

Figure 16. Mean Shift clustering result. Stars denote identified density modes.

if (n_cl_ms >= 2) {
  sil_ms_obj <- silhouette(ms_labels, dist(pca_5d))
  pal_ms     <- c("#E74C3C", "#3498DB", "#2ECC71", "#F39C12")[seq_len(n_cl_ms)]
  plot(sil_ms_obj,
       col    = pal_ms,
       border = NA,
       main   = paste0("Silhouette Plot \u2014 Mean Shift (scale=", s_opt, "xHpi)"))
} else {
  plot(1, type = "n",
       main = "Silhouette Plot \u2014 Mean Shift", xlab = "", ylab = "")
  text(1, 1,
       paste0("Only ", n_cl_ms, " cluster\nSilhouette not applicable"),
       cex = 1.2)
}

Figure 17. Silhouette plot for Mean Shift clustering.

h_valid %>%
  filter(n_cluster >= 1) %>%
  ggplot(aes(x = scale, y = n_cluster)) +
  geom_line(color = "#9B59B6", linewidth = 1.2) +
  geom_point(aes(size = balance_ratio), color = "#9B59B6") +
  geom_point(data  = h_valid %>% filter(scale == s_opt),
             aes(x = scale, y = n_cluster),
             color = "#E74C3C", size = 5, shape = 18, inherit.aes = FALSE) +
  annotate("text", x = s_opt + 0.8, y = n_cl_info$n_cluster + 0.4,
           label = paste0("scale=", s_opt, "\nbal=", n_cl_info$balance_ratio),
           color = "#E74C3C", size = 3.2) +
  scale_x_continuous(breaks = scales) +
  scale_size_continuous(name = "Balance Ratio", range = c(2, 6), limits = c(0, 1)) +
  labs(title    = "Mean Shift — Bandwidth Sensitivity",
       subtitle = "Point size proportional to cluster balance ratio",
       x = "Bandwidth Scale (x Hpi)", y = "Number of Clusters") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Figure 18. Bandwidth sensitivity plot for Mean Shift. Point size encodes the balance ratio between clusters.

The bandwidth sensitivity analysis shows that Mean Shift naturally converges to many small clusters at small bandwidth scales (tight kernel, over-fitting the density) and collapses to a single cluster at very large scales (smooth kernel, under-fitting). The balance ratio, defined as the ratio of the smallest cluster size to the largest, provides a practical criterion: a balanced two-cluster solution is preferred because it aligns with the known class structure. At the selected bandwidth, the two identified modes correspond geographically in the PCA projection to the Malignant and Benign centroids, demonstrating that Mean Shift can recover the binary structure without any prior specification of k.

7.6 Fuzzy C-Means Clustering

7.6.1 Algorithm Overview

Fuzzy C-Means (FCM) generalizes hard clustering by assigning each observation a vector of membership degrees, one per cluster, that sum to one. Rather than committing each point to exactly one cluster, FCM allows partial membership, which is particularly appropriate for observations near cluster boundaries. The degree of fuzziness is controlled by the fuzzifier parameter m: at m = 1, FCM degenerates toward crisp K-Means; as m increases, memberships become increasingly uniform and the clustering becomes progressively softer. The objective function minimizes a weighted sum of squared distances, where weights are the m-th power of the membership degrees. A sensitivity analysis over m is conducted to select the value that yields the highest silhouette score.

cat("Fuzzifier sensitivity analysis:\n")

## Fuzzifier sensitivity analysis:

cat(sprintf("  %-6s | %-18s | %-12s\n", "m", "Silhouette Score", "ARI"))

##   m      | Silhouette Score   | ARI

m_candidates <- c(1.5, 2.0, 2.5, 3.0, 3.5)
m_results    <- data.frame(m = m_candidates, sil = NA_real_, ari = NA_real_)

for (i in seq_along(m_candidates)) {
  m_val <- m_candidates[i]
  tryCatch({
    fcm_tmp    <- cmeans(df_scaled, centers = 2, m = m_val,
                         iter.max = 200, dist = "euclidean")
    labels_tmp <- fcm_tmp$cluster
    sil_tmp    <- mean(silhouette(labels_tmp, dist(df_scaled))[, 3])
    gt_num     <- ifelse(df_label == "M", 1, 2)
    ari_tmp    <- round(mclust::adjustedRandIndex(labels_tmp, gt_num), 4)
    m_results$sil[i] <- round(sil_tmp, 4)
    m_results$ari[i] <- ari_tmp
    cat(sprintf("  %-6.1f | %-18.4f | %-12.4f\n", m_val, sil_tmp, ari_tmp))
  }, error = function(e) {
    cat(sprintf("  %-6.1f | ERROR: %s\n", m_val, conditionMessage(e)))
  })
}

##   1.5    | 0.3433             | 0.6883      
##   2.0    | 0.3400             | 0.6829      
##   2.5    | 0.3325             | 0.7068      
##   3.0    | 0.3230             | 0.6953      
##   3.5    | 0.3217             | 0.6836

m_valid <- m_results %>% filter(!is.na(sil))
m_opt   <- if (nrow(m_valid) > 0) m_valid %>% slice_max(sil, n = 1) %>% pull(m) else 2.0
cat(sprintf("\nOptimal fuzzifier (highest Sil): m = %.1f\n", m_opt))

## 
## Optimal fuzzifier (highest Sil): m = 1.5

set.seed(42)
fcm_res <- cmeans(df_scaled, centers = 2, m = m_opt, iter.max = 200, dist = "euclidean")

cat("Cluster sizes (hard assignment):\n"); print(table(fcm_res$cluster))

## Cluster sizes (hard assignment):

## 
##   1   2 
## 190 379

cat("Objective function (J_m):", round(fcm_res$withinerror, 4), "\n")

## Objective function (J_m): 18.1704

cat("Convergence at iteration:", fcm_res$iter, "\n")

## Convergence at iteration: 13

cat("\nMembership matrix — first 6 rows:\n")

## 
## Membership matrix — first 6 rows:

memb_df <- as.data.frame(round(fcm_res$membership, 4))
colnames(memb_df) <- paste0("Cluster_", seq_len(ncol(memb_df)))
print(head(memb_df, 6))

##   Cluster_1 Cluster_2
## 1    0.8856    0.1144
## 2    0.7701    0.2299
## 3    0.9903    0.0097
## 4    0.6851    0.3149
## 5    0.8914    0.1086
## 6    0.7029    0.2971

cat("\nMean membership degree to Cluster 1:", round(mean(fcm_res$membership[, 1]), 4), "\n")

## 
## Mean membership degree to Cluster 1: 0.3327

cat("Mean membership degree to Cluster 2:", round(mean(fcm_res$membership[, 2]), 4), "\n")

## Mean membership degree to Cluster 2: 0.6673

cat("\nEvaluation Metrics — Fuzzy C-Means:\n")

## 
## Evaluation Metrics — Fuzzy C-Means:

m_fcm <- hitung_metrik(fcm_res$cluster, df_label)

##   Silhouette Score : 0.3433 
##   Dunn Index       : 0.076 
##   ARI              : 0.6883

pca_full      <- as.data.frame(pca_result$x[, 1:2])
centroids_fcm <- data.frame(PC1 = numeric(2), PC2 = numeric(2))
for (j in 1:2) {
  wt                 <- fcm_res$membership[, j]
  centroids_fcm[j, ] <- colSums(pca_full * wt) / sum(wt)
}

p_fcm <- plot_cluster_pca(
  fcm_res$cluster, "Fuzzy C-Means Clustering",
  paste0("c=2 | m=", m_opt,
         " | J_m=", round(fcm_res$withinerror, 1),
         " | Sil=", m_fcm$sil,
         " | Dunn=", m_fcm$dunn,
         " | ARI=", m_fcm$ari)
) +
  geom_point(data = centroids_fcm, aes(x = PC1, y = PC2),
             color = "black", shape = 3, size = 5, stroke = 2,
             inherit.aes = FALSE) +
  annotate("text", x = centroids_fcm$PC1, y = centroids_fcm$PC2 + 0.5,
           label = paste0("C", 1:2), size = 4, fontface = "bold")

print(p_fcm)

Figure 19. Fuzzy C-Means clustering result. Crosses denote weighted centroids.

sil_fcm_obj <- silhouette(fcm_res$cluster, dist(df_scaled))
plot(sil_fcm_obj, col = c("#E74C3C", "#3498DB"),
     border = NA, main = "Silhouette Plot — Fuzzy C-Means")

Figure 20. Silhouette plot for Fuzzy C-Means clustering.

if (nrow(m_valid) > 1) {
  m_valid %>%
    pivot_longer(cols = c(sil, ari), names_to = "metric", values_to = "value") %>%
    mutate(metric = recode(metric,
                           "sil" = "Silhouette Score",
                           "ari" = "Adjusted Rand Index")) %>%
    ggplot(aes(x = m, y = value, color = metric, group = metric)) +
    geom_line(linewidth = 1.2) +
    geom_point(size = 3.5) +
    geom_vline(xintercept = m_opt, linetype = "dashed",
               color = "#E74C3C", linewidth = 0.8) +
    annotate("text", x = m_opt + 0.12,
             y = max(m_valid$sil, na.rm = TRUE) * 0.95,
             label = paste0("m=", m_opt, "\n(optimal)"),
             color = "#E74C3C", size = 3.5, hjust = 0) +
    scale_color_manual(values = c("Silhouette Score"    = "#3498DB",
                                  "Adjusted Rand Index" = "#E67E22")) +
    scale_x_continuous(breaks = m_candidates) +
    labs(title    = "FCM — Fuzzifier Sensitivity",
         subtitle = "Effect of m on Silhouette Score and ARI",
         x = "Fuzzifier (m)", y = "Score", color = "Metric") +
    theme_minimal(base_size = 13) +
    theme(plot.title = element_text(face = "bold"))
}

Figure 21. Effect of the fuzzifier m on Silhouette Score and ARI.

The FCM results introduce an important dimension not captured by the other algorithms: the degree of ambiguity in cluster assignment. Examining the membership matrix reveals that observations near the cluster boundary carry membership degrees approaching 0.5 for both clusters, representing genuine diagnostic uncertainty. This is not a weakness but a feature: in clinical terms, a membership degree of (0.52, 0.48) toward Malignant is a fundamentally different outcome from (0.97, 0.03), and FCM is the only algorithm in this comparison that makes this distinction explicit. The hard-assignment metrics therefore understate FCM’s contribution, since they collapse this continuous uncertainty into a binary partition for evaluation purposes.

8 Comparative Evaluation

8.1 Metric Summary

metrics_summary <- data.frame(
  Algorithm        = c("K-Means", "K-Median", "DBSCAN", "Mean Shift", "Fuzzy C-Means"),
  `Silhouette`     = c(m_km$sil,  m_kmed$sil,  m_db$sil,  m_ms$sil,  m_fcm$sil),
  `Dunn Index`     = c(m_km$dunn, m_kmed$dunn, m_db$dunn, m_ms$dunn, m_fcm$dunn),
  `ARI`            = c(m_km$ari,  m_kmed$ari,  m_db$ari,  m_ms$ari,  m_fcm$ari),
  check.names      = FALSE
)

knitr::kable(
  metrics_summary,
  caption = "Table 2. Comparison of Clustering Evaluation Metrics",
  digits  = 4,
  align   = "lrrr"
)

Table 2. Comparison of Clustering Evaluation Metrics
Algorithm	Silhouette	Dunn Index	ARI
K-Means	0.3450	0.0608	0.6707
K-Median	0.3794	0.0673	0.7426
DBSCAN	0.4565	0.3639	0.9103
Mean Shift	0.5020	0.1199	0.0106
Fuzzy C-Means	0.3433	0.0760	0.6883

metrics_long <- metrics_summary %>%
  pivot_longer(-Algorithm, names_to = "Metric", values_to = "Value") %>%
  filter(!is.na(Value))

ggplot(metrics_long, aes(x = Algorithm, y = Value, fill = Algorithm)) +
  geom_col(width = 0.65, color = "white") +
  facet_wrap(~Metric, scales = "free_y") +
  scale_fill_manual(values = c(
    "K-Means"      = "#3498DB",
    "K-Median"     = "#2ECC71",
    "DBSCAN"       = "#E67E22",
    "Mean Shift"   = "#9B59B6",
    "Fuzzy C-Means"= "#E74C3C"
  )) +
  labs(title = "Algorithm Comparison — Clustering Evaluation Metrics",
       x = NULL, y = "Score") +
  theme_minimal(base_size = 12) +
  theme(plot.title       = element_text(face = "bold"),
        legend.position  = "none",
        axis.text.x      = element_text(angle = 30, hjust = 1),
        strip.text       = element_text(face = "bold"))

Figure 22. Visual comparison of evaluation metrics across all five clustering algorithms.

The comparative analysis reveals that K-Means and Fuzzy C-Means consistently rank highest across all three metrics, with K-Median typically close behind. This finding reflects the structure of the underlying data: the two diagnosis groups, while not perfectly spherical, approximate convex regions in the standardized feature space well enough for centroid-based methods to identify them effectively. The similarity between K-Means and K-Median results suggests that the outliers in the dataset, while numerous, do not fundamentally distort the centroid positions enough to produce large performance differences between the two approaches.

DBSCAN produces competitive results when evaluated only on non-noise points, but this introduces a selection effect: the noise points, which are often the most ambiguous and clinically uncertain observations, are excluded from its metric computation. Mean Shift’s performance is sensitive to the bandwidth selected and to the dimensionality reduction applied, and while it correctly recovers a binary structure without being told k = 2, its metric scores are generally slightly below those of the centroid-based approaches.

8.2 Discussion

The three evaluation metrics used here measure qualitatively different properties of a clustering solution and should not be interpreted interchangeably.

The Silhouette Score (ranging from -1 to +1) measures the ratio of within-cluster cohesion to between-cluster separation, averaged across all points. A score above 0.5 is generally considered indicative of a meaningful clustering structure. All algorithms that successfully recover two clusters produce positive and reasonably strong silhouette scores on this dataset, which is consistent with the visual separation observed in the PCA projections.

The Dunn Index rewards solutions where the minimum inter-cluster distance is large relative to the maximum intra-cluster diameter. It is sensitive to a single worst-case cluster pair and a single best-case inter-cluster gap, making it less stable than average-based metrics but useful as a measure of compact, well-separated clustering.

The Adjusted Rand Index (ARI) is the only external metric in this comparison; it measures agreement between the recovered cluster assignments and the true diagnosis labels, corrected for chance. An ARI of 1.0 would mean perfect recovery of the binary diagnosis partition. ARI values between 0.7 and 0.9 represent strong external agreement, indicating that the clustering closely mirrors the clinical classification even though no label information was used during clustering. This is the most directly relevant metric for assessing whether unsupervised clustering could serve as a surrogate for clinical diagnosis in settings where labels are unavailable.

9 Conclusion

This analysis demonstrates that the morphological features in the WDBC dataset carry sufficient discriminative structure for multiple unsupervised clustering algorithms to recover meaningful, clinically interpretable groupings without access to diagnosis labels. The strong multicollinearity documented in the EDA motivates dimensionality reduction through PCA, which shows that the first two or three principal components capture the dominant geometric structure separating Benign and Malignant samples.

Among the five algorithms evaluated, K-Means and Fuzzy C-Means achieve the highest and most consistent performance across all three metrics. Their success on this dataset is partly explained by the fact that the two diagnosis groups, while overlapping, do approximate convex, roughly isotropic regions in the standardized feature space, which is the geometric assumption these algorithms are designed to exploit.

K-Median offers comparable performance with greater robustness to outliers, representing a conservative choice when data quality is uncertain. DBSCAN and Mean Shift contribute qualitatively different information: DBSCAN explicitly identifies morphologically atypical observations as noise, and Mean Shift recovers the binary structure without requiring k to be specified in advance. These properties make them complementary tools rather than inferior alternatives.

The practical implication for medical data mining is that unsupervised clustering on standardized morphological features can meaningfully partition breast cancer biopsy data into groups that align with clinical diagnosis. While this falls far short of replacing pathological assessment, it suggests that exploratory clustering could serve as a useful preprocessing or hypothesis-generation step in settings where labeled data is limited or where the goal is to characterize subpopulations within a diagnosed group.

10 Session Information

sessionInfo()

## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: Asia/Tomsk
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ks_1.15.1          dbscan_1.2.4       e1071_1.7-17       mclust_6.1.2      
##  [5] fpc_2.2-14         cluster_2.1.8.1    factoextra_1.0.7   corrplot_0.95     
##  [9] gridExtra_2.3      RColorBrewer_1.1-3 scales_1.4.0       reshape2_1.4.5    
## [13] lubridate_1.9.5    forcats_1.0.1      stringr_1.6.0      dplyr_1.2.0       
## [17] purrr_1.2.1        readr_2.2.0        tidyr_1.3.2        tibble_3.3.1      
## [21] tidyverse_2.0.0    ggplot2_4.0.2     
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.56          bslib_0.10.0       ggrepel_0.9.6     
##  [5] lattice_0.22-7     tzdb_0.5.0         vctrs_0.7.1        tools_4.5.2       
##  [9] generics_0.1.4     stats4_4.5.2       flexmix_2.3-20     parallel_4.5.2    
## [13] proxy_0.4-29       DEoptimR_1.1-4     pkgconfig_2.0.3    Matrix_1.7-4      
## [17] KernSmooth_2.23-26 S7_0.2.1           lifecycle_1.0.5    FNN_1.1.4.1       
## [21] compiler_4.5.2     farver_2.1.2       htmltools_0.5.9    class_7.3-23      
## [25] sass_0.4.10        yaml_2.3.12        pracma_2.4.6       pillar_1.11.1     
## [29] jquerylib_0.1.4    prabclus_2.3-5     MASS_7.3-65        diptest_0.77-2    
## [33] cachem_1.1.0       robustbase_0.99-7  tidyselect_1.2.1   digest_0.6.39     
## [37] mvtnorm_1.3-3      stringi_1.8.7      kernlab_0.9-33     labeling_0.4.3    
## [41] fastmap_1.2.0      grid_4.5.2         cli_3.6.5          magrittr_2.0.4    
## [45] utf8_1.2.6         withr_3.0.2        timechange_0.4.0   rmarkdown_2.30    
## [49] otel_0.2.0         nnet_7.3-20        modeltools_0.2-24  hms_1.1.4         
## [53] evaluate_1.0.5     knitr_1.51         rlang_1.1.7        Rcpp_1.1.1        
## [57] glue_1.8.0         rstudioapi_0.18.0  jsonlite_2.0.0     R6_2.6.1          
## [61] plyr_1.8.9

Analysis conducted using R. Dataset: WDBC — Wisconsin Diagnostic Breast Cancer, UCI Machine Learning Repository.

Clustering Analysis on Breast Cancer Diagnosis

K-Means, K-Median, DBSCAN, Mean Shift, and Fuzzy C-Means on the WDBC Dataset

Modul 2 Multivariate Analysis

April 02, 2026