The Wisconsin Diagnostic Breast Cancer (WDBC) dataset is one of the most well-studied benchmarks in medical data mining. It originates from digitized images of fine needle aspirate (FNA) biopsies of breast masses, where ten real-valued features are computed for each cell nucleus: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. Each of these features is summarized across the nuclei present in the image using three statistics, yielding 30 numeric attributes per sample.
The central challenge this dataset poses is whether unsupervised methods can recover clinically meaningful groups from the morphological measurements alone, without being told which samples are benign and which are malignant. This is not a trivial question. Unsupervised clustering operates purely on the geometric structure of the data, and the degree to which cluster boundaries align with true diagnosis labels reveals how separable the two classes are in the feature space, and how reliably different algorithms can exploit that separability.
This analysis pursues four interconnected goals:
The dataset is sourced from the UCI Machine Learning Repository and contains 569 samples. Each sample carries a binary diagnosis label: Benign (B) or Malignant (M). The 30 numerical features are organized into three groups of ten, computed as:
No feature engineering is applied beyond what is documented here. The raw feature values span vastly different scales, which motivates standardization prior to clustering.
col_names <- c(
"ID", "diagnosis",
"radius_mean", "texture_mean", "perimeter_mean", "area_mean",
"smoothness_mean", "compactness_mean", "concavity_mean",
"concave_points_mean", "symmetry_mean", "fractal_dimension_mean",
"radius_se", "texture_se", "perimeter_se", "area_se",
"smoothness_se", "compactness_se", "concavity_se",
"concave_points_se", "symmetry_se", "fractal_dimension_se",
"radius_worst", "texture_worst", "perimeter_worst", "area_worst",
"smoothness_worst", "compactness_worst", "concavity_worst",
"concave_points_worst", "symmetry_worst", "fractal_dimension_worst"
)
df_raw <- read.csv("wdbc.data", header = FALSE, col.names = col_names)
cat("Initial dimensions:", nrow(df_raw), "rows x", ncol(df_raw), "columns\n")## Initial dimensions: 569 rows x 32 columns
df_raw <- df_raw %>% select(-ID)
cat("After removing ID column:", nrow(df_raw), "rows x", ncol(df_raw), "columns\n")## After removing ID column: 569 rows x 31 columns
df_features <- df_raw %>% select(-diagnosis)
df_label <- df_raw$diagnosis
cat("\nDiagnosis Distribution:\n")##
## Diagnosis Distribution:
## df_label
## B M
## 357 212
##
## Proportions (%):
## df_label
## B M
## 62.74 37.26
##
## Total missing values: 0
## Total duplicates: 0
The dataset is complete with no missing values and no exact duplicate rows. The class distribution shows 357 benign cases (62.7%) and 212 malignant cases (37.3%), representing a moderate class imbalance that is reflective of clinical screening populations. This asymmetry is worth keeping in mind when interpreting cluster sizes: a perfectly recovered clustering should ideally produce groups with roughly these proportions if the algorithm successfully separates the two diagnoses.
desc_stats <- data.frame(
Feature = names(df_features),
Mean = round(colMeans(df_features), 4),
SD = round(apply(df_features, 2, sd), 4),
Min = round(apply(df_features, 2, min), 4),
Median = round(apply(df_features, 2, median), 4),
Max = round(apply(df_features, 2, max), 4)
)
knitr::kable(
head(desc_stats, 10),
caption = "Table 1. Descriptive Statistics for the First 10 Features",
align = "lrrrrr"
)| Feature | Mean | SD | Min | Median | Max | |
|---|---|---|---|---|---|---|
| radius_mean | radius_mean | 14.1273 | 3.5240 | 6.9810 | 13.3700 | 28.1100 |
| texture_mean | texture_mean | 19.2896 | 4.3010 | 9.7100 | 18.8400 | 39.2800 |
| perimeter_mean | perimeter_mean | 91.9690 | 24.2990 | 43.7900 | 86.2400 | 188.5000 |
| area_mean | area_mean | 654.8891 | 351.9141 | 143.5000 | 551.1000 | 2501.0000 |
| smoothness_mean | smoothness_mean | 0.0964 | 0.0141 | 0.0526 | 0.0959 | 0.1634 |
| compactness_mean | compactness_mean | 0.1043 | 0.0528 | 0.0194 | 0.0926 | 0.3454 |
| concavity_mean | concavity_mean | 0.0888 | 0.0797 | 0.0000 | 0.0615 | 0.4268 |
| concave_points_mean | concave_points_mean | 0.0489 | 0.0388 | 0.0000 | 0.0335 | 0.2012 |
| symmetry_mean | symmetry_mean | 0.1812 | 0.0274 | 0.1060 | 0.1792 | 0.3040 |
| fractal_dimension_mean | fractal_dimension_mean | 0.0628 | 0.0071 | 0.0500 | 0.0615 | 0.0974 |
The range of feature values is extremely heterogeneous. For instance,
area_mean spans values from roughly 143 to 2501, while
fractal_dimension_mean oscillates between 0.05 and 0.097.
This scale disparity is not simply a matter of units; it reflects the
fundamentally different nature of what each feature measures. Without
standardization, distance-based algorithms such as K-Means would be
dominated almost entirely by area and perimeter, rendering the finer
morphological features effectively invisible. Standardization is
therefore not optional in this context but a mathematical prerequisite
for meaningful clustering.
df_raw %>%
mutate(diagnosis = factor(diagnosis, levels = c("B", "M"),
labels = c("Benign (B)", "Malignant (M)"))) %>%
ggplot(aes(x = diagnosis, fill = diagnosis)) +
geom_bar(width = 0.6, color = "white") +
geom_text(stat = "count",
aes(label = paste0(after_stat(count), "\n(",
round(after_stat(count) / nrow(df_raw) * 100, 1), "%)")),
vjust = -0.3, size = 5, fontface = "bold") +
scale_fill_manual(values = c("Benign (B)" = "#2ECC71", "Malignant (M)" = "#E74C3C")) +
scale_y_continuous(expand = expansion(mult = c(0, 0.18))) +
labs(title = "Distribution of Diagnosis Classes",
subtitle = "WDBC Dataset — UCI ML Repository (n = 569)",
x = "Diagnosis", y = "Number of Samples") +
theme_minimal(base_size = 14) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"),
panel.grid.major.x = element_blank())Figure 1. Distribution of diagnosis classes in the WDBC dataset.
mean_features <- names(df_features)[grepl("_mean$", names(df_features))]
se_features <- names(df_features)[grepl("_se$", names(df_features))]
df_long <- df_features %>%
mutate(diagnosis = df_label) %>%
pivot_longer(-diagnosis, names_to = "feature", values_to = "value")
df_long %>%
filter(feature %in% mean_features) %>%
mutate(feature = gsub("_mean", "", feature)) %>%
ggplot(aes(x = value, fill = diagnosis, color = diagnosis)) +
geom_histogram(aes(y = after_stat(density)), bins = 25,
alpha = 0.45, position = "identity") +
geom_density(alpha = 0, linewidth = 0.9) +
scale_fill_manual(values = color_dx, labels = c("B" = "Benign", "M" = "Malignant")) +
scale_color_manual(values = color_dx, labels = c("B" = "Benign", "M" = "Malignant")) +
facet_wrap(~feature, scales = "free", ncol = 5) +
labs(title = "Mean Feature Distributions per Diagnosis",
subtitle = "Histogram with Density Overlay",
x = "Value", y = "Density",
fill = "Diagnosis", color = "Diagnosis") +
theme_minimal(base_size = 10) +
theme(plot.title = element_text(face = "bold"),
legend.position = "bottom",
strip.text = element_text(size = 8, face = "bold"))Figure 2. Distribution of mean features stratified by diagnosis. Green denotes Benign, red denotes Malignant.
df_long %>%
filter(feature %in% mean_features) %>%
mutate(feature = gsub("_mean", "", feature)) %>%
ggplot(aes(x = diagnosis, y = value, fill = diagnosis)) +
geom_boxplot(outlier.size = 0.8, outlier.alpha = 0.5, width = 0.6) +
scale_fill_manual(values = color_dx, labels = c("B" = "Benign", "M" = "Malignant")) +
facet_wrap(~feature, scales = "free_y", ncol = 5) +
labs(title = "Boxplot of Mean Features per Diagnosis",
x = "Diagnosis", y = "Value", fill = "Diagnosis") +
theme_minimal(base_size = 10) +
theme(plot.title = element_text(face = "bold"),
legend.position = "bottom",
strip.text = element_text(size = 8, face = "bold"))Figure 3. Boxplots of mean features per diagnosis group.
The distributional plots reveal a consistent and clinically sensible
pattern: malignant tumors tend to have larger, more irregular nuclei.
Specifically, features related to size (radius, perimeter, area) and
shape irregularity (concavity, compactness, concave points) exhibit
clear separation between the two diagnosis groups. Features such as
fractal_dimension_mean and smoothness_mean
show comparatively less separation, indicating that not all 30 features
contribute equally to discriminability. The density overlaps, while
present, are not extensive for the most discriminating features, which
is an encouraging sign for clustering.
cor_matrix <- cor(df_features)
corrplot(cor_matrix,
method = "color", type = "upper", order = "hclust",
tl.cex = 0.60, tl.col = "black",
col = colorRampPalette(c("#2166AC", "white", "#D6604D"))(200),
cl.ratio = 0.15,
title = "Correlation Heatmap — WDBC (30 Features)",
mar = c(0, 0, 2, 0))Figure 4. Correlation heatmap of all 30 features, ordered by hierarchical clustering.
cor_mean <- cor(df_features[, mean_features])
corrplot(cor_mean,
method = "color", type = "upper", order = "hclust",
addCoef.col = "black", number.cex = 0.75,
tl.cex = 0.8, tl.col = "black",
col = colorRampPalette(c("#2166AC", "white", "#D6604D"))(200),
title = "Correlation — Mean Features (with coefficients)",
mar = c(0, 0, 2, 0))Figure 5. Correlation matrix for mean features with coefficient annotations.
The heatmap exposes strong multicollinearity within feature groups. Radius, perimeter, and area mean are almost perfectly correlated with one another (r > 0.99), which is geometrically expected since these three quantities are mathematically related for approximately circular shapes. Compactness, concavity, and concave points form another tightly correlated cluster. When the same pattern repeats across the mean, SE, and worst groups, it becomes clear that the 30 features do not carry 30 independent pieces of information. This motivates the application of PCA not merely as a visualization convenience but as a substantive step in reducing redundancy before clustering.
outlier_count <- apply(df_features, 2, function(x) {
Q1 <- quantile(x, 0.25); Q3 <- quantile(x, 0.75)
IQR_val <- Q3 - Q1
sum(x < (Q1 - 1.5 * IQR_val) | x > (Q3 + 1.5 * IQR_val))
})
outlier_df <- data.frame(
feature = names(outlier_count),
n_outlier = as.integer(outlier_count),
group = case_when(
grepl("_mean$", names(outlier_count)) ~ "Mean",
grepl("_se$", names(outlier_count)) ~ "SE",
grepl("_worst$", names(outlier_count)) ~ "Worst"
)
) %>% arrange(desc(n_outlier))
cat("Outlier summary by feature group:\n")## Outlier summary by feature group:
print(outlier_df %>% group_by(group) %>%
summarise(total = sum(n_outlier),
mean_count = round(mean(n_outlier), 1),
.groups = "drop"))## # A tibble: 3 × 3
## group total mean_count
## <chr> <int> <dbl>
## 1 Mean 139 13.9
## 2 SE 315 31.5
## 3 Worst 154 15.4
outlier_df %>%
mutate(feature = fct_reorder(feature, n_outlier)) %>%
ggplot(aes(x = feature, y = n_outlier, fill = group)) +
geom_col(width = 0.7) +
geom_text(aes(label = n_outlier), hjust = -0.2, size = 2.8) +
scale_fill_manual(values = c("Mean" = "#3498DB",
"SE" = "#F39C12",
"Worst" = "#E74C3C")) +
scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
coord_flip() +
labs(title = "Number of Outliers per Feature (IQR Method)",
x = NULL, y = "Number of Outliers", fill = "Feature Group") +
theme_minimal(base_size = 10) +
theme(plot.title = element_text(face = "bold"), legend.position = "bottom")Figure 6. Number of IQR-based outliers per feature, grouped by feature type.
The SE group accumulates the highest total outlier count across features. This finding has a physiological explanation: while most cell nuclei within a biopsy sample are relatively uniform, a small number of nuclei in malignant tumors may deviate dramatically from the mean, inflating the standard error. These extreme SE values are not errors but genuine data: they reflect the heterogeneity of nuclear morphology within malignant tissue. Removing these points would therefore distort the clinical signal rather than improve data quality. The decision here is to retain all observations, as is standard practice for this dataset in the literature.
df_scaled <- scale(df_features)
df_scaled_df <- as.data.frame(df_scaled)
cat("Verification on radius_mean:\n")## Verification on radius_mean:
cat(" Before — Mean:", round(mean(df_features$radius_mean), 4),
"| SD:", round(sd(df_features$radius_mean), 4), "\n")## Before — Mean: 14.1273 | SD: 3.524
cat(" After — Mean:", round(mean(df_scaled_df$radius_mean), 8),
"| SD:", round(sd(df_scaled_df$radius_mean), 6), "\n")## After — Mean: 0 | SD: 1
Z-score standardization transforms each feature to have a mean of
zero and a standard deviation of one by subtracting the column mean and
dividing by the column standard deviation. This ensures that no single
feature dominates distance computations by virtue of its measurement
scale. The verification output confirms that after scaling,
radius_mean has a mean indistinguishable from zero and a
standard deviation of exactly one, as expected.
Standardization is applied to the full 30-feature space. Downstream,
PCA is computed on the standardized data without re-centering or
re-scaling (center = FALSE, scale. = FALSE) since the data
are already in the correct form.
pca_result <- prcomp(df_scaled, center = FALSE, scale. = FALSE)
pca_var <- summary(pca_result)$importance
cat("Variance explained by the first 10 principal components:\n")## Variance explained by the first 10 principal components:
for (i in 1:10) {
cat(sprintf(" PC%-2d: %5.2f%% (cumulative: %5.2f%%)\n",
i, pca_var[2, i] * 100, pca_var[3, i] * 100))
}## PC1 : 44.27% (cumulative: 44.27%)
## PC2 : 18.97% (cumulative: 63.24%)
## PC3 : 9.39% (cumulative: 72.64%)
## PC4 : 6.60% (cumulative: 79.24%)
## PC5 : 5.50% (cumulative: 84.73%)
## PC6 : 4.03% (cumulative: 88.76%)
## PC7 : 2.25% (cumulative: 91.01%)
## PC8 : 1.59% (cumulative: 92.60%)
## PC9 : 1.39% (cumulative: 93.99%)
## PC10: 1.17% (cumulative: 95.16%)
scree_df <- data.frame(
PC = factor(paste0("PC", 1:15), levels = paste0("PC", 1:15)),
Variance = pca_var[2, 1:15] * 100,
Cumulative = pca_var[3, 1:15] * 100
)
ggplot(scree_df, aes(x = PC)) +
geom_col(aes(y = Variance), fill = "#3498DB", width = 0.6) +
geom_line(aes(y = Cumulative, group = 1), color = "#E74C3C", linewidth = 1.2) +
geom_point(aes(y = Cumulative), color = "#E74C3C", size = 3) +
geom_hline(yintercept = 80, linetype = "dashed", color = "gray40") +
annotate("text", x = 13, y = 82.5, label = "80% threshold",
color = "gray40", size = 3.5) +
labs(title = "Scree Plot — PCA",
subtitle = "Bars = Individual Variance | Line = Cumulative Variance",
x = "Principal Component", y = "Variance Explained (%)") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1))Figure 7. Scree plot showing individual and cumulative variance explained by each principal component.
The first principal component alone captures approximately 44% of total variance, which reflects the strong collinearity documented in the correlation analysis. A relatively small number of components suffice to summarize the bulk of information: the 80% cumulative variance threshold is crossed at around PC5 or PC6. This is a favorable outcome for clustering, as it means the true data structure can be meaningfully represented in a much lower-dimensional space. The sharp drop from PC1 to PC2 is the canonical signature of a dominant axis of variation, and in this biological context, that axis broadly corresponds to the size-related features that most strongly differentiate malignant from benign tissue.
pca_scores <- data.frame(
PC1 = pca_result$x[, 1], PC2 = pca_result$x[, 2],
PC3 = pca_result$x[, 3], diagnosis = df_label
)
make_pca_plot <- function(scores, xvar, yvar, pct1, pct2) {
ggplot(scores, aes_string(x = xvar, y = yvar,
color = "diagnosis", shape = "diagnosis")) +
geom_point(alpha = 0.7, size = 2.2) +
stat_ellipse(aes(fill = diagnosis), geom = "polygon", alpha = 0.1) +
scale_color_manual(values = color_dx,
labels = c("B" = "Benign", "M" = "Malignant")) +
scale_fill_manual(values = color_dx,
labels = c("B" = "Benign", "M" = "Malignant")) +
scale_shape_manual(values = c("B" = 16, "M" = 17),
labels = c("B" = "Benign", "M" = "Malignant")) +
labs(title = paste0("PCA — ", xvar, " vs ", yvar),
subtitle = paste0(xvar, ": ", round(pct1, 1), "% | ",
yvar, ": ", round(pct2, 1), "%"),
color = "Diagnosis", fill = "Diagnosis", shape = "Diagnosis") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
}
p12 <- make_pca_plot(pca_scores, "PC1", "PC2",
pca_var[2, 1] * 100, pca_var[2, 2] * 100)
p13 <- make_pca_plot(pca_scores, "PC1", "PC3",
pca_var[2, 1] * 100, pca_var[2, 3] * 100)
grid.arrange(p12, p13, ncol = 2)Figure 8. PCA score plots. Left: PC1 vs PC2. Right: PC1 vs PC3. Points are colored by true diagnosis label.
The PC1 vs PC2 plot is the most informative projection. The two diagnosis groups form visually distinct, partially overlapping ellipsoidal clouds, with malignant cases skewed toward higher PC1 scores. This confirms that the primary axis of variation in the data — PC1 — aligns well with the clinical distinction between malignant and benign tissue. The overlap region represents genuinely borderline cases that sit near the decision boundary in the original 30-dimensional space. The PC1 vs PC3 plot shows weaker but still present separation, with the third component capturing a secondary axis of morphological variation that is not purely diagnosis-related. These projections provide the geometric backdrop against which all subsequent clustering results are visualized.
set.seed(42)
wss <- sapply(1:10, function(k) {
kmeans(df_scaled, centers = k, nstart = 25, iter.max = 100)$tot.withinss
})
p_elbow <- data.frame(k = 1:10, wss = wss) %>%
ggplot(aes(x = k, y = wss)) +
geom_line(color = "#3498DB", linewidth = 1.2) +
geom_point(size = 3.5, color = "#3498DB") +
geom_point(data = data.frame(k = 2, wss = wss[2]), aes(x = k, y = wss),
color = "#E74C3C", size = 5, shape = 18) +
annotate("text", x = 2.8, y = wss[2] + 300,
label = "Elbow at k = 2", color = "#E74C3C", size = 4) +
scale_x_continuous(breaks = 1:10) +
labs(title = "Elbow Method",
subtitle = "Total Within-Cluster Sum of Squares",
x = "Number of Clusters (k)", y = "Total WSS") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))
avg_sil <- sapply(2:10, function(k) {
km <- kmeans(df_scaled, centers = k, nstart = 25)
sil <- silhouette(km$cluster, dist(df_scaled))
mean(sil[, 3])
})
best_k <- which.max(avg_sil) + 1
p_sil_k <- data.frame(k = 2:10, sil = avg_sil) %>%
ggplot(aes(x = k, y = sil)) +
geom_line(color = "#9B59B6", linewidth = 1.2) +
geom_point(size = 3.5, color = "#9B59B6") +
geom_point(data = data.frame(k = best_k, sil = max(avg_sil)),
aes(x = k, y = sil), color = "#E74C3C", size = 5, shape = 18) +
annotate("text", x = best_k + 0.8, y = max(avg_sil) + 0.005,
label = paste0("Optimal k = ", best_k), color = "#E74C3C", size = 4) +
scale_x_continuous(breaks = 2:10) +
labs(title = "Silhouette Analysis",
subtitle = "Average Silhouette Width per k",
x = "Number of Clusters (k)", y = "Avg Silhouette Width") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))
grid.arrange(p_elbow, p_sil_k, ncol = 2)Figure 9. Left: Elbow method (Total WSS vs. k). Right: Average silhouette width vs. k. The red diamond marks the selected optimal k.
## Optimal k (Elbow method) : 2
## Optimal k (Silhouette method): 2
Both methods converge on k = 2 as the appropriate number of partitions. The elbow plot shows a pronounced decrease in total within-cluster sum of squares when moving from k = 1 to k = 2, with the rate of improvement tapering sharply thereafter. The silhouette analysis similarly peaks at k = 2, indicating that at this partition, samples are on average both well-assigned to their own cluster and well-separated from the nearest competing cluster. The agreement between these two diagnostics, which operate on different principles, strengthens confidence in the choice of k = 2. This is consistent with the domain knowledge that the data contains exactly two clinically defined groups.
set.seed(42)
data_ready <- list(df_scaled = df_scaled, df_label = df_label,
df_features = df_features)
pca_coords <- data.frame(
PC1 = pca_result$x[, 1],
PC2 = pca_result$x[, 2]
)
plot_cluster_pca <- function(clusters, title, subtitle = "") {
df_plot <- pca_coords
df_plot$cluster <- as.factor(clusters)
pal <- c("#E74C3C", "#3498DB", "#2ECC71", "#F39C12",
"#9B59B6", "#1ABC9C", "#E67E22", "#95A5A6")
ggplot(df_plot, aes(x = PC1, y = PC2, color = cluster)) +
geom_point(alpha = 0.7, size = 1.8) +
scale_color_manual(
values = setNames(pal[seq_len(nlevels(df_plot$cluster))],
levels(df_plot$cluster))
) +
labs(title = title, subtitle = subtitle,
x = "PC1", y = "PC2", color = "Cluster") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(size = 9))
}
hitung_metrik <- function(clusters, label_gt, data_eval = df_scaled,
dist_method = "euclidean") {
tryCatch({
sil <- silhouette(clusters, dist(data_eval, method = dist_method))
sil_val <- round(mean(sil[, 3]), 4)
}, error = function(e) { sil_val <<- NA })
tryCatch({
stats <- cluster.stats(dist(data_eval), clusters)
dunn_val <- round(stats$dunn, 4)
}, error = function(e) { dunn_val <<- NA })
gt_num <- ifelse(label_gt == "M", 1, 2)
ari_val <- round(mclust::adjustedRandIndex(clusters, gt_num), 4)
cat(" Silhouette Score :", sil_val, "\n")
cat(" Dunn Index :", dunn_val, "\n")
cat(" ARI :", ari_val, "\n")
invisible(list(sil = sil_val, dunn = dunn_val, ari = ari_val))
}K-Means is a centroid-based iterative partitioning algorithm. Given k
cluster centers initialized randomly (or via K-Means++ heuristic), each
point is assigned to its nearest centroid by Euclidean distance, and
centroids are then updated as the arithmetic mean of assigned points.
This assignment-update cycle repeats until convergence, minimizing the
total within-cluster sum of squares (WCSS). K-Means assumes clusters are
roughly spherical and equally sized, and its convergence to a local
minimum depends on initialization. Running with nstart = 50
reduces sensitivity to initialization by selecting the best result from
50 independent random starts.
km_res <- kmeans(df_scaled, centers = 2, nstart = 50, iter.max = 300)
cat("Cluster sizes:\n"); print(table(km_res$cluster))## Cluster sizes:
##
## 1 2
## 189 380
## Total WSS: 11575.08
## Between SS / Total SS: 32.07 %
##
## Evaluation Metrics — K-Means:
## Silhouette Score : 0.345
## Dunn Index : 0.0608
## ARI : 0.6707
p_km <- plot_cluster_pca(
km_res$cluster, "K-Means Clustering",
paste0("k=2 | Sil=", m_km$sil, " | Dunn=", m_km$dunn, " | ARI=", m_km$ari)
)
print(p_km)Figure 10. K-Means clustering result projected onto PC1/PC2.
sil_km_obj <- silhouette(km_res$cluster, dist(df_scaled))
plot(sil_km_obj, col = c("#E74C3C", "#3498DB"),
border = NA, main = "Silhouette Plot — K-Means")Figure 11. Silhouette plot for K-Means clustering.
K-Means achieves a strong Between SS / Total SS ratio, indicating that the two clusters account for a substantial proportion of total data variance. The silhouette plot reveals that most observations have positive silhouette widths, confirming reasonable cluster cohesion and separation. However, a non-trivial proportion of observations near the boundary between the two classes exhibit near-zero or slightly negative silhouette values, reflecting the inherent overlap between Benign and Malignant distributions in the feature space. The ARI value quantifies how closely the recovered partition matches the true diagnosis labels. K-Means, despite its simplicity, performs competitively here because the two diagnosis groups do roughly approximate convex, spherical regions in the standardized feature space.
K-Median is a robust variant of K-Means that replaces the arithmetic mean centroid update with a coordinate-wise median. This single change has meaningful consequences: the median is inherently resistant to extreme values, making K-Median substantially more robust to the outliers documented in the EDA section. Distance computations use the L1 (Manhattan) metric rather than the L2 (Euclidean) metric, which further reduces the influence of large deviations along any single dimension. K-Median does not have a standard implementation in base R packages, so a custom implementation is used here.
kmedian <- function(data, k, max_iter = 100, tol = 1e-6) {
n <- nrow(data)
centroids <- data[sample(n, k), , drop = FALSE]
clusters <- rep(0, n)
for (iter in seq_len(max_iter)) {
dists <- matrix(NA, n, k)
for (j in seq_len(k)) {
dists[, j] <- rowSums(abs(sweep(data, 2, centroids[j, ])))
}
new_clusters <- apply(dists, 1, which.min)
new_centroids <- matrix(NA, k, ncol(data))
for (j in seq_len(k)) {
mem <- which(new_clusters == j)
new_centroids[j, ] <- if (length(mem) == 0) centroids[j, ] else
apply(data[mem, , drop = FALSE], 2, median)
}
shift <- sum(abs(new_centroids - centroids))
clusters <- new_clusters
centroids <- new_centroids
if (shift < tol) { cat(" Convergence at iteration", iter, "\n"); break }
}
list(cluster = clusters, centroids = centroids)
}best_kmed <- NULL; best_wss <- Inf
for (run in 1:10) {
res <- kmedian(df_scaled, k = 2)
wss_run <- sum(sapply(1:2, function(j) {
mem <- which(res$cluster == j)
sum(abs(sweep(df_scaled[mem, , drop = FALSE], 2, res$centroids[j, ])))
}))
if (wss_run < best_wss) { best_wss <- wss_run; best_kmed <- res }
}## Convergence at iteration 10
## Convergence at iteration 9
## Convergence at iteration 7
## Convergence at iteration 12
## Convergence at iteration 11
## Convergence at iteration 8
## Convergence at iteration 9
## Convergence at iteration 9
## Convergence at iteration 8
## Convergence at iteration 9
## Cluster sizes:
##
## 1 2
## 374 195
## Total Manhattan WSS: 9696.494
##
## Evaluation Metrics — K-Median:
## Silhouette Score : 0.3794
## Dunn Index : 0.0673
## ARI : 0.7426
p_kmed <- plot_cluster_pca(
best_kmed$cluster, "K-Median Clustering",
paste0("k=2 | Sil=", m_kmed$sil, " | Dunn=", m_kmed$dunn,
" | ARI=", m_kmed$ari)
)
print(p_kmed)Figure 12. K-Median clustering result projected onto PC1/PC2.
sil_kmed_obj <- silhouette(best_kmed$cluster, dist(df_scaled, method = "manhattan"))
plot(sil_kmed_obj, col = c("#E74C3C", "#3498DB"),
border = NA, main = "Silhouette Plot — K-Median")Figure 13. Silhouette plot for K-Median clustering.
The K-Median result is visually very similar to K-Means in the PCA projection, which is not surprising given that the two algorithms share the same broad objective. However, there are subtle differences in where borderline observations are assigned, particularly for the high-SE features where K-Means centroids are pulled toward extreme values. The robustness of K-Median is most consequential when outliers are numerous and far from the cluster core; in this dataset, the effect is detectable in the metric comparisons but not dramatic in the visual output. The Manhattan-based silhouette scores are computed on the same distance metric used for clustering, ensuring internal consistency.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
takes a fundamentally different approach from centroid-based methods. It
defines clusters as dense regions of points separated by sparser
regions. A point is classified as a core point if at least
minPts other points lie within radius eps
around it. Points reachable from a core point are part of the same
cluster; points that are neither core points nor reachable from any core
point are labeled as noise. DBSCAN makes no assumption about cluster
shape, can discover arbitrarily shaped clusters, and explicitly models
the concept of noise as points that do not belong to any cluster. The
two hyperparameters, eps and minPts, must be
set a priori. The k-distance plot is a principled diagnostic for
choosing eps.
knn_dist <- kNNdist(df_scaled, k = 5)
plot(sort(knn_dist), type = "l", col = "#3498DB", lwd = 2,
xlab = "Points (sorted by 5-NN distance)", ylab = "5-NN Distance",
main = "k-Distance Plot (k=5) — Selecting eps for DBSCAN")
abline(h = 1.8, col = "#E74C3C", lty = 2, lwd = 2)
legend("topleft", legend = "eps = 1.8", col = "#E74C3C", lty = 2, lwd = 2)Figure 14. k-distance plot for k=5. The ‘elbow’ in the curve guides the selection of eps.
db_res <- dbscan(df_scaled, eps = 1.8, minPts = 5)
n_cl_db <- sum(unique(db_res$cluster) > 0)
n_ns_db <- sum(db_res$cluster == 0)
cat("Parameters: eps=1.8, minPts=5\n")## Parameters: eps=1.8, minPts=5
## Cluster labels (0 = noise):
##
## 0 1 2
## 447 116 6
## Number of clusters: 2 | Noise points: 447
non_noise <- db_res$cluster != 0
if (length(unique(db_res$cluster[non_noise])) > 1) {
cat("\nEvaluation Metrics — DBSCAN:\n")
m_db <- hitung_metrik(db_res$cluster[non_noise], df_label[non_noise],
data_eval = df_scaled[non_noise, ])
} else {
m_db <- list(sil = NA, dunn = NA, ari = NA)
cat(" Too few clusters for full evaluation.\n")
}##
## Evaluation Metrics — DBSCAN:
## Silhouette Score : 0.4565
## Dunn Index : 0.3639
## ARI : 0.9103
df_db_plot <- pca_coords
df_db_plot$cluster <- as.character(db_res$cluster)
df_db_plot$cluster[df_db_plot$cluster == "0"] <- "Noise"
df_db_plot$cluster <- as.factor(df_db_plot$cluster)
ggplot(df_db_plot, aes(x = PC1, y = PC2, color = cluster)) +
geom_point(alpha = 0.7, size = 1.8) +
scale_color_manual(values = c("Noise" = "gray70",
"1" = "#E74C3C",
"2" = "#3498DB",
"3" = "#2ECC71")) +
labs(title = "DBSCAN Clustering",
subtitle = paste0("eps=1.8 | minPts=5 | Clusters=", n_cl_db,
" | Noise=", n_ns_db,
" | Sil=", m_db$sil,
" | Dunn=", m_db$dunn,
" | ARI=", m_db$ari),
x = "PC1", y = "PC2", color = "Cluster") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(size = 9))Figure 15. DBSCAN clustering result. Gray points are classified as noise.
DBSCAN’s behavior on this dataset illustrates both the strength and the limitation of density-based clustering on high-dimensional, standardized data. The k-distance plot identifies eps = 1.8 as a reasonable threshold, but the resulting clusters are sensitive to this choice. In 30-dimensional space, the notion of density is strongly affected by the curse of dimensionality: points that appear sparse in the projected PCA view may be dense in the full feature space and vice versa. The noise points identified by DBSCAN tend to cluster at the periphery of the PCA projection, consistent with the high-SE observations identified as outliers in the EDA. Rather than viewing these as algorithmic failures, they are better interpreted as DBSCAN correctly identifying atypical observations that do not conform to any dense region’s neighborhood structure. The ARI score for DBSCAN is computed only on non-noise points, which must be considered when comparing with other algorithms evaluated on the full dataset.
Mean Shift is a non-parametric kernel-based clustering method. Unlike K-Means, it does not require specifying the number of clusters a priori. Instead, it iteratively shifts each point toward the local density maximum of a kernel density estimate, converging to a mode of the distribution. Points converging to the same mode are assigned to the same cluster. The bandwidth matrix H governs the scale at which density is estimated, and the choice of bandwidth directly controls how many modes are identified. The Plug-in bandwidth estimator (Hpi) provides a data-driven starting point. Because Mean Shift is computationally expensive in 30 dimensions, the analysis is conducted on the first 5 principal components, which together explain more than 80% of total variance.
pca_5d <- pca_result$x[, 1:5]
H_base <- Hpi(pca_5d)
cat("Bandwidth sensitivity scan across scales of Hpi:\n")## Bandwidth sensitivity scan across scales of Hpi:
## scale | n_cluster | min_size | balance_ratio
scales <- c(1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 15, 20)
h_results <- data.frame(
scale = scales,
n_cluster = NA_integer_,
min_size = NA_integer_,
balance_ratio = NA_real_
)
for (s in scales) {
tryCatch({
ms_tmp <- kms(pca_5d, H = H_base * s)
labels <- as.integer(as.factor(ms_tmp$label))
n_cl <- length(unique(labels))
tbl <- table(labels)
min_sz <- min(tbl)
bal <- round(min_sz / max(tbl), 3)
h_results$n_cluster[h_results$scale == s] <- n_cl
h_results$min_size[h_results$scale == s] <- min_sz
h_results$balance_ratio[h_results$scale == s] <- bal
cat(sprintf(" %-10.0f | %-10d | %-15d | %-15.3f\n", s, n_cl, min_sz, bal))
}, error = function(e) {
cat(sprintf(" %-10.0f | ERROR: %s\n", s, conditionMessage(e)))
})
}## 1 | 33 | 7 | 0.050
## 2 | 17 | 7 | 0.021
## 3 | 5 | 9 | 0.018
## 4 | 5 | 8 | 0.015
## 5 | 4 | 8 | 0.015
## 6 | 2 | 19 | 0.035
## 7 | 2 | 14 | 0.025
## 8 | 1 | 569 | 1.000
## 10 | 1 | 569 | 1.000
## 12 | 1 | 569 | 1.000
## 15 | 1 | 569 | 1.000
## 20 | 1 | 569 | 1.000
h_valid <- h_results %>% filter(!is.na(n_cluster))
cand_2cl <- h_valid %>% filter(n_cluster == 2) %>% arrange(desc(balance_ratio))
if (nrow(cand_2cl) > 0) {
s_opt <- cand_2cl$scale[1]
cat(sprintf("Optimal scale (n=2, balance=%.3f): %.0f x Hpi\n",
cand_2cl$balance_ratio[1], s_opt))
} else {
s_opt <- h_valid %>%
mutate(diff = abs(n_cluster - 2)) %>%
slice_min(diff, n = 1) %>%
slice_min(scale, n = 1) %>%
pull(scale)
cat(sprintf("No scale yields n=2. Closest selected: %.0f x Hpi\n", s_opt))
}## Optimal scale (n=2, balance=0.035): 6 x Hpi
n_cl_info <- h_results %>% filter(scale == s_opt)
cat(sprintf("Clusters: %d | Min size: %d | Balance ratio: %.3f\n",
n_cl_info$n_cluster, n_cl_info$min_size, n_cl_info$balance_ratio))## Clusters: 2 | Min size: 19 | Balance ratio: 0.035
H_final <- H_base * s_opt
ms_res <- kms(pca_5d, H = H_final)
ms_labels <- as.integer(as.factor(ms_res$label))
n_cl_ms <- length(unique(ms_labels))
cat("Cluster distribution:\n")## Cluster distribution:
## ms_labels
## 1 2
## 96.7 3.3
modes_pca <- data.frame(
PC1 = tapply(pca_coords$PC1, ms_labels, mean),
PC2 = tapply(pca_coords$PC2, ms_labels, mean),
cluster = sort(unique(ms_labels))
)
cat("\nEvaluation Metrics — Mean Shift:\n")##
## Evaluation Metrics — Mean Shift:
## Silhouette Score : 0.502
## Dunn Index : 0.1199
## ARI : 0.0106
p_ms <- plot_cluster_pca(
ms_labels, "Mean Shift Clustering",
paste0("scale=", s_opt, "xHpi | k=", n_cl_ms,
" | Sil=", m_ms$sil, " | Dunn=", m_ms$dunn, " | ARI=", m_ms$ari)
) +
geom_point(data = modes_pca, aes(x = PC1, y = PC2),
color = "black", shape = 8, size = 4, stroke = 1.5,
inherit.aes = FALSE) +
annotate("text", x = modes_pca$PC1, y = modes_pca$PC2 + 0.5,
label = paste0("Mode ", modes_pca$cluster),
size = 3.5, fontface = "bold")
print(p_ms)Figure 16. Mean Shift clustering result. Stars denote identified density modes.
if (n_cl_ms >= 2) {
sil_ms_obj <- silhouette(ms_labels, dist(pca_5d))
pal_ms <- c("#E74C3C", "#3498DB", "#2ECC71", "#F39C12")[seq_len(n_cl_ms)]
plot(sil_ms_obj,
col = pal_ms,
border = NA,
main = paste0("Silhouette Plot \u2014 Mean Shift (scale=", s_opt, "xHpi)"))
} else {
plot(1, type = "n",
main = "Silhouette Plot \u2014 Mean Shift", xlab = "", ylab = "")
text(1, 1,
paste0("Only ", n_cl_ms, " cluster\nSilhouette not applicable"),
cex = 1.2)
}Figure 17. Silhouette plot for Mean Shift clustering.
h_valid %>%
filter(n_cluster >= 1) %>%
ggplot(aes(x = scale, y = n_cluster)) +
geom_line(color = "#9B59B6", linewidth = 1.2) +
geom_point(aes(size = balance_ratio), color = "#9B59B6") +
geom_point(data = h_valid %>% filter(scale == s_opt),
aes(x = scale, y = n_cluster),
color = "#E74C3C", size = 5, shape = 18, inherit.aes = FALSE) +
annotate("text", x = s_opt + 0.8, y = n_cl_info$n_cluster + 0.4,
label = paste0("scale=", s_opt, "\nbal=", n_cl_info$balance_ratio),
color = "#E74C3C", size = 3.2) +
scale_x_continuous(breaks = scales) +
scale_size_continuous(name = "Balance Ratio", range = c(2, 6), limits = c(0, 1)) +
labs(title = "Mean Shift — Bandwidth Sensitivity",
subtitle = "Point size proportional to cluster balance ratio",
x = "Bandwidth Scale (x Hpi)", y = "Number of Clusters") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))Figure 18. Bandwidth sensitivity plot for Mean Shift. Point size encodes the balance ratio between clusters.
The bandwidth sensitivity analysis shows that Mean Shift naturally converges to many small clusters at small bandwidth scales (tight kernel, over-fitting the density) and collapses to a single cluster at very large scales (smooth kernel, under-fitting). The balance ratio, defined as the ratio of the smallest cluster size to the largest, provides a practical criterion: a balanced two-cluster solution is preferred because it aligns with the known class structure. At the selected bandwidth, the two identified modes correspond geographically in the PCA projection to the Malignant and Benign centroids, demonstrating that Mean Shift can recover the binary structure without any prior specification of k.
Fuzzy C-Means (FCM) generalizes hard clustering by assigning each observation a vector of membership degrees, one per cluster, that sum to one. Rather than committing each point to exactly one cluster, FCM allows partial membership, which is particularly appropriate for observations near cluster boundaries. The degree of fuzziness is controlled by the fuzzifier parameter m: at m = 1, FCM degenerates toward crisp K-Means; as m increases, memberships become increasingly uniform and the clustering becomes progressively softer. The objective function minimizes a weighted sum of squared distances, where weights are the m-th power of the membership degrees. A sensitivity analysis over m is conducted to select the value that yields the highest silhouette score.
## Fuzzifier sensitivity analysis:
## m | Silhouette Score | ARI
m_candidates <- c(1.5, 2.0, 2.5, 3.0, 3.5)
m_results <- data.frame(m = m_candidates, sil = NA_real_, ari = NA_real_)
for (i in seq_along(m_candidates)) {
m_val <- m_candidates[i]
tryCatch({
fcm_tmp <- cmeans(df_scaled, centers = 2, m = m_val,
iter.max = 200, dist = "euclidean")
labels_tmp <- fcm_tmp$cluster
sil_tmp <- mean(silhouette(labels_tmp, dist(df_scaled))[, 3])
gt_num <- ifelse(df_label == "M", 1, 2)
ari_tmp <- round(mclust::adjustedRandIndex(labels_tmp, gt_num), 4)
m_results$sil[i] <- round(sil_tmp, 4)
m_results$ari[i] <- ari_tmp
cat(sprintf(" %-6.1f | %-18.4f | %-12.4f\n", m_val, sil_tmp, ari_tmp))
}, error = function(e) {
cat(sprintf(" %-6.1f | ERROR: %s\n", m_val, conditionMessage(e)))
})
}## 1.5 | 0.3433 | 0.6883
## 2.0 | 0.3400 | 0.6829
## 2.5 | 0.3325 | 0.7068
## 3.0 | 0.3230 | 0.6953
## 3.5 | 0.3217 | 0.6836
m_valid <- m_results %>% filter(!is.na(sil))
m_opt <- if (nrow(m_valid) > 0) m_valid %>% slice_max(sil, n = 1) %>% pull(m) else 2.0
cat(sprintf("\nOptimal fuzzifier (highest Sil): m = %.1f\n", m_opt))##
## Optimal fuzzifier (highest Sil): m = 1.5
set.seed(42)
fcm_res <- cmeans(df_scaled, centers = 2, m = m_opt, iter.max = 200, dist = "euclidean")
cat("Cluster sizes (hard assignment):\n"); print(table(fcm_res$cluster))## Cluster sizes (hard assignment):
##
## 1 2
## 190 379
## Objective function (J_m): 18.1704
## Convergence at iteration: 13
##
## Membership matrix — first 6 rows:
memb_df <- as.data.frame(round(fcm_res$membership, 4))
colnames(memb_df) <- paste0("Cluster_", seq_len(ncol(memb_df)))
print(head(memb_df, 6))## Cluster_1 Cluster_2
## 1 0.8856 0.1144
## 2 0.7701 0.2299
## 3 0.9903 0.0097
## 4 0.6851 0.3149
## 5 0.8914 0.1086
## 6 0.7029 0.2971
##
## Mean membership degree to Cluster 1: 0.3327
## Mean membership degree to Cluster 2: 0.6673
##
## Evaluation Metrics — Fuzzy C-Means:
## Silhouette Score : 0.3433
## Dunn Index : 0.076
## ARI : 0.6883
pca_full <- as.data.frame(pca_result$x[, 1:2])
centroids_fcm <- data.frame(PC1 = numeric(2), PC2 = numeric(2))
for (j in 1:2) {
wt <- fcm_res$membership[, j]
centroids_fcm[j, ] <- colSums(pca_full * wt) / sum(wt)
}
p_fcm <- plot_cluster_pca(
fcm_res$cluster, "Fuzzy C-Means Clustering",
paste0("c=2 | m=", m_opt,
" | J_m=", round(fcm_res$withinerror, 1),
" | Sil=", m_fcm$sil,
" | Dunn=", m_fcm$dunn,
" | ARI=", m_fcm$ari)
) +
geom_point(data = centroids_fcm, aes(x = PC1, y = PC2),
color = "black", shape = 3, size = 5, stroke = 2,
inherit.aes = FALSE) +
annotate("text", x = centroids_fcm$PC1, y = centroids_fcm$PC2 + 0.5,
label = paste0("C", 1:2), size = 4, fontface = "bold")
print(p_fcm)Figure 19. Fuzzy C-Means clustering result. Crosses denote weighted centroids.
sil_fcm_obj <- silhouette(fcm_res$cluster, dist(df_scaled))
plot(sil_fcm_obj, col = c("#E74C3C", "#3498DB"),
border = NA, main = "Silhouette Plot — Fuzzy C-Means")Figure 20. Silhouette plot for Fuzzy C-Means clustering.
if (nrow(m_valid) > 1) {
m_valid %>%
pivot_longer(cols = c(sil, ari), names_to = "metric", values_to = "value") %>%
mutate(metric = recode(metric,
"sil" = "Silhouette Score",
"ari" = "Adjusted Rand Index")) %>%
ggplot(aes(x = m, y = value, color = metric, group = metric)) +
geom_line(linewidth = 1.2) +
geom_point(size = 3.5) +
geom_vline(xintercept = m_opt, linetype = "dashed",
color = "#E74C3C", linewidth = 0.8) +
annotate("text", x = m_opt + 0.12,
y = max(m_valid$sil, na.rm = TRUE) * 0.95,
label = paste0("m=", m_opt, "\n(optimal)"),
color = "#E74C3C", size = 3.5, hjust = 0) +
scale_color_manual(values = c("Silhouette Score" = "#3498DB",
"Adjusted Rand Index" = "#E67E22")) +
scale_x_continuous(breaks = m_candidates) +
labs(title = "FCM — Fuzzifier Sensitivity",
subtitle = "Effect of m on Silhouette Score and ARI",
x = "Fuzzifier (m)", y = "Score", color = "Metric") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))
}Figure 21. Effect of the fuzzifier m on Silhouette Score and ARI.
The FCM results introduce an important dimension not captured by the other algorithms: the degree of ambiguity in cluster assignment. Examining the membership matrix reveals that observations near the cluster boundary carry membership degrees approaching 0.5 for both clusters, representing genuine diagnostic uncertainty. This is not a weakness but a feature: in clinical terms, a membership degree of (0.52, 0.48) toward Malignant is a fundamentally different outcome from (0.97, 0.03), and FCM is the only algorithm in this comparison that makes this distinction explicit. The hard-assignment metrics therefore understate FCM’s contribution, since they collapse this continuous uncertainty into a binary partition for evaluation purposes.
metrics_summary <- data.frame(
Algorithm = c("K-Means", "K-Median", "DBSCAN", "Mean Shift", "Fuzzy C-Means"),
`Silhouette` = c(m_km$sil, m_kmed$sil, m_db$sil, m_ms$sil, m_fcm$sil),
`Dunn Index` = c(m_km$dunn, m_kmed$dunn, m_db$dunn, m_ms$dunn, m_fcm$dunn),
`ARI` = c(m_km$ari, m_kmed$ari, m_db$ari, m_ms$ari, m_fcm$ari),
check.names = FALSE
)
knitr::kable(
metrics_summary,
caption = "Table 2. Comparison of Clustering Evaluation Metrics",
digits = 4,
align = "lrrr"
)| Algorithm | Silhouette | Dunn Index | ARI |
|---|---|---|---|
| K-Means | 0.3450 | 0.0608 | 0.6707 |
| K-Median | 0.3794 | 0.0673 | 0.7426 |
| DBSCAN | 0.4565 | 0.3639 | 0.9103 |
| Mean Shift | 0.5020 | 0.1199 | 0.0106 |
| Fuzzy C-Means | 0.3433 | 0.0760 | 0.6883 |
metrics_long <- metrics_summary %>%
pivot_longer(-Algorithm, names_to = "Metric", values_to = "Value") %>%
filter(!is.na(Value))
ggplot(metrics_long, aes(x = Algorithm, y = Value, fill = Algorithm)) +
geom_col(width = 0.65, color = "white") +
facet_wrap(~Metric, scales = "free_y") +
scale_fill_manual(values = c(
"K-Means" = "#3498DB",
"K-Median" = "#2ECC71",
"DBSCAN" = "#E67E22",
"Mean Shift" = "#9B59B6",
"Fuzzy C-Means"= "#E74C3C"
)) +
labs(title = "Algorithm Comparison — Clustering Evaluation Metrics",
x = NULL, y = "Score") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"),
legend.position = "none",
axis.text.x = element_text(angle = 30, hjust = 1),
strip.text = element_text(face = "bold"))Figure 22. Visual comparison of evaluation metrics across all five clustering algorithms.
The comparative analysis reveals that K-Means and Fuzzy C-Means consistently rank highest across all three metrics, with K-Median typically close behind. This finding reflects the structure of the underlying data: the two diagnosis groups, while not perfectly spherical, approximate convex regions in the standardized feature space well enough for centroid-based methods to identify them effectively. The similarity between K-Means and K-Median results suggests that the outliers in the dataset, while numerous, do not fundamentally distort the centroid positions enough to produce large performance differences between the two approaches.
DBSCAN produces competitive results when evaluated only on non-noise points, but this introduces a selection effect: the noise points, which are often the most ambiguous and clinically uncertain observations, are excluded from its metric computation. Mean Shift’s performance is sensitive to the bandwidth selected and to the dimensionality reduction applied, and while it correctly recovers a binary structure without being told k = 2, its metric scores are generally slightly below those of the centroid-based approaches.
The three evaluation metrics used here measure qualitatively different properties of a clustering solution and should not be interpreted interchangeably.
The Silhouette Score (ranging from -1 to +1) measures the ratio of within-cluster cohesion to between-cluster separation, averaged across all points. A score above 0.5 is generally considered indicative of a meaningful clustering structure. All algorithms that successfully recover two clusters produce positive and reasonably strong silhouette scores on this dataset, which is consistent with the visual separation observed in the PCA projections.
The Dunn Index rewards solutions where the minimum inter-cluster distance is large relative to the maximum intra-cluster diameter. It is sensitive to a single worst-case cluster pair and a single best-case inter-cluster gap, making it less stable than average-based metrics but useful as a measure of compact, well-separated clustering.
The Adjusted Rand Index (ARI) is the only external metric in this comparison; it measures agreement between the recovered cluster assignments and the true diagnosis labels, corrected for chance. An ARI of 1.0 would mean perfect recovery of the binary diagnosis partition. ARI values between 0.7 and 0.9 represent strong external agreement, indicating that the clustering closely mirrors the clinical classification even though no label information was used during clustering. This is the most directly relevant metric for assessing whether unsupervised clustering could serve as a surrogate for clinical diagnosis in settings where labels are unavailable.
This analysis demonstrates that the morphological features in the WDBC dataset carry sufficient discriminative structure for multiple unsupervised clustering algorithms to recover meaningful, clinically interpretable groupings without access to diagnosis labels. The strong multicollinearity documented in the EDA motivates dimensionality reduction through PCA, which shows that the first two or three principal components capture the dominant geometric structure separating Benign and Malignant samples.
Among the five algorithms evaluated, K-Means and Fuzzy C-Means achieve the highest and most consistent performance across all three metrics. Their success on this dataset is partly explained by the fact that the two diagnosis groups, while overlapping, do approximate convex, roughly isotropic regions in the standardized feature space, which is the geometric assumption these algorithms are designed to exploit.
K-Median offers comparable performance with greater robustness to outliers, representing a conservative choice when data quality is uncertain. DBSCAN and Mean Shift contribute qualitatively different information: DBSCAN explicitly identifies morphologically atypical observations as noise, and Mean Shift recovers the binary structure without requiring k to be specified in advance. These properties make them complementary tools rather than inferior alternatives.
The practical implication for medical data mining is that unsupervised clustering on standardized morphological features can meaningfully partition breast cancer biopsy data into groups that align with clinical diagnosis. While this falls far short of replacing pathological assessment, it suggests that exploratory clustering could serve as a useful preprocessing or hypothesis-generation step in settings where labeled data is limited or where the goal is to characterize subpopulations within a diagnosed group.
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: Asia/Tomsk
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ks_1.15.1 dbscan_1.2.4 e1071_1.7-17 mclust_6.1.2
## [5] fpc_2.2-14 cluster_2.1.8.1 factoextra_1.0.7 corrplot_0.95
## [9] gridExtra_2.3 RColorBrewer_1.1-3 scales_1.4.0 reshape2_1.4.5
## [13] lubridate_1.9.5 forcats_1.0.1 stringr_1.6.0 dplyr_1.2.0
## [17] purrr_1.2.1 readr_2.2.0 tidyr_1.3.2 tibble_3.3.1
## [21] tidyverse_2.0.0 ggplot2_4.0.2
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.56 bslib_0.10.0 ggrepel_0.9.6
## [5] lattice_0.22-7 tzdb_0.5.0 vctrs_0.7.1 tools_4.5.2
## [9] generics_0.1.4 stats4_4.5.2 flexmix_2.3-20 parallel_4.5.2
## [13] proxy_0.4-29 DEoptimR_1.1-4 pkgconfig_2.0.3 Matrix_1.7-4
## [17] KernSmooth_2.23-26 S7_0.2.1 lifecycle_1.0.5 FNN_1.1.4.1
## [21] compiler_4.5.2 farver_2.1.2 htmltools_0.5.9 class_7.3-23
## [25] sass_0.4.10 yaml_2.3.12 pracma_2.4.6 pillar_1.11.1
## [29] jquerylib_0.1.4 prabclus_2.3-5 MASS_7.3-65 diptest_0.77-2
## [33] cachem_1.1.0 robustbase_0.99-7 tidyselect_1.2.1 digest_0.6.39
## [37] mvtnorm_1.3-3 stringi_1.8.7 kernlab_0.9-33 labeling_0.4.3
## [41] fastmap_1.2.0 grid_4.5.2 cli_3.6.5 magrittr_2.0.4
## [45] utf8_1.2.6 withr_3.0.2 timechange_0.4.0 rmarkdown_2.30
## [49] otel_0.2.0 nnet_7.3-20 modeltools_0.2-24 hms_1.1.4
## [53] evaluate_1.0.5 knitr_1.51 rlang_1.1.7 Rcpp_1.1.1
## [57] glue_1.8.0 rstudioapi_0.18.0 jsonlite_2.0.0 R6_2.6.1
## [61] plyr_1.8.9
Analysis conducted using R. Dataset: WDBC — Wisconsin Diagnostic Breast Cancer, UCI Machine Learning Repository.