library(tidyverse)
library(flexclust)
library(meanShiftR)
library(dbscan)
library(e1071)
library(cluster)
library(fpc)
library(mclust)
library(patchwork)
library(factoextra)
library(corrplot)
library(RColorBrewer)
library(kableExtra)
set.seed(123)
pal_4 <- c("#F4A261", "#A8DADC", "#C77DFF", "#80B918")
my_pal <- c(bus = "#F4A261", opel = "#A8DADC", saab = "#C77DFF", van = "#80B918")## COMPACTNESS CIRCULARITY DISTANCE.CIRCULARITY RADIUS.RATIO
## 1 95 48 83 178
## 2 91 41 84 141
## 3 104 50 106 209
## 4 93 41 82 159
## 5 85 44 70 205
## 6 107 57 106 172
## 7 97 43 73 173
## 8 90 43 66 157
## 9 86 34 62 140
## 10 93 44 98 197
## PR.AXIS.ASPECT.RATIO MAX.LENGTH.ASPECT.RATIO SCATTER.RATIO ELONGATEDNESS
## 1 72 10 162 42
## 2 57 9 149 45
## 3 66 10 207 32
## 4 63 9 144 46
## 5 103 52 149 45
## 6 50 6 255 26
## 7 65 6 153 42
## 8 65 9 137 48
## 9 61 7 122 54
## 10 62 11 183 36
## PR.AXIS.RECTANGULARITY MAX.LENGTH.RECTANGULARITY
## 1 20 159
## 2 19 143
## 3 23 158
## 4 19 143
## 5 19 144
## 6 28 169
## 7 19 143
## 8 18 146
## 9 17 127
## 10 22 146
## SCALED.VARIANCE.ALONG.MAJOR.AXIS SCALED.VARIANCE.ALONG.MINOR.AXIS
## 1 176 379
## 2 170 330
## 3 223 635
## 4 160 309
## 5 241 325
## 6 280 957
## 7 176 361
## 8 162 281
## 9 141 223
## 10 202 505
## SCALED.RADIUS.OF.GYRATION SKEWNESS.ABOUT.MAJOR.AXIS
## 1 184 70
## 2 158 72
## 3 220 73
## 4 127 63
## 5 188 127
## 6 264 85
## 7 172 66
## 8 164 67
## 9 112 64
## 10 152 64
## SKEWNESS.ABOUT.MINOR.AXIS KURTOSIS.ABOUT.MINOR.AXIS
## 1 6 16
## 2 9 14
## 3 14 9
## 4 6 10
## 5 9 11
## 6 5 9
## 7 13 1
## 8 3 3
## 9 2 14
## 10 4 14
## KURTOSIS.ABOUT.MAJOR.AXIS HOLLOWS.RATIO class
## 1 187 197 van
## 2 189 199 van
## 3 188 196 saab
## 4 199 207 van
## 5 180 183 bus
## 6 181 183 bus
## 7 200 204 bus
## 8 193 202 van
## 9 200 208 van
## 10 195 204 saab
## [1] TRUE
## COMPACTNESS CIRCULARITY
## 1 0
## DISTANCE.CIRCULARITY RADIUS.RATIO
## 0 0
## PR.AXIS.ASPECT.RATIO MAX.LENGTH.ASPECT.RATIO
## 0 0
## SCATTER.RATIO ELONGATEDNESS
## 0 0
## PR.AXIS.RECTANGULARITY MAX.LENGTH.RECTANGULARITY
## 0 0
## SCALED.VARIANCE.ALONG.MAJOR.AXIS SCALED.VARIANCE.ALONG.MINOR.AXIS
## 0 0
## SCALED.RADIUS.OF.GYRATION SKEWNESS.ABOUT.MAJOR.AXIS
## 0 0
## SKEWNESS.ABOUT.MINOR.AXIS KURTOSIS.ABOUT.MINOR.AXIS
## 0 0
## KURTOSIS.ABOUT.MAJOR.AXIS HOLLOWS.RATIO
## 0 0
## class
## 0
## COMPACTNESS CIRCULARITY DISTANCE.CIRCULARITY RADIUS.RATIO
## 753 NA 100 36 73
## PR.AXIS.ASPECT.RATIO MAX.LENGTH.ASPECT.RATIO SCATTER.RATIO ELONGATEDNESS
## 753 199 73 6 162
## PR.AXIS.RECTANGULARITY MAX.LENGTH.RECTANGULARITY
## 753 40 20
## SCALED.VARIANCE.ALONG.MAJOR.AXIS SCALED.VARIANCE.ALONG.MINOR.AXIS
## 753 127 189
## SCALED.RADIUS.OF.GYRATION SKEWNESS.ABOUT.MAJOR.AXIS
## 753 401 125
## SKEWNESS.ABOUT.MINOR.AXIS KURTOSIS.ABOUT.MINOR.AXIS
## 753 72 6
## KURTOSIS.ABOUT.MAJOR.AXIS HOLLOWS.RATIO class
## 753 19 200 204
df_clean <- df[complete.cases(df), ]
valid_classes <- c("bus", "opel", "saab", "van")
df_clean <- df_clean[df_clean$class %in% valid_classes, ]
any(is.na(df_clean))## [1] FALSE
## Rows : 845
## Columns: 19
##
## bus opel saab van
## 217 212 217 199
## [1] "COMPACTNESS" "CIRCULARITY"
## [3] "DISTANCE.CIRCULARITY" "RADIUS.RATIO"
## [5] "PR.AXIS.ASPECT.RATIO" "MAX.LENGTH.ASPECT.RATIO"
## [7] "SCATTER.RATIO" "ELONGATEDNESS"
## [9] "PR.AXIS.RECTANGULARITY" "MAX.LENGTH.RECTANGULARITY"
## [11] "SCALED.VARIANCE.ALONG.MAJOR.AXIS" "SCALED.VARIANCE.ALONG.MINOR.AXIS"
## [13] "SCALED.RADIUS.OF.GYRATION" "SKEWNESS.ABOUT.MAJOR.AXIS"
## [15] "SKEWNESS.ABOUT.MINOR.AXIS" "KURTOSIS.ABOUT.MINOR.AXIS"
## [17] "KURTOSIS.ABOUT.MAJOR.AXIS" "HOLLOWS.RATIO"
## [19] "class"
Out of 846 rows, one had a COMPACTNESS value of
NA paired with a class label of "204", which
is a data entry error. That row was dropped, leaving 845 clean
observations across four vehicle classes: bus (217), opel
(212), saab (217), and van (199).
df_features <- df_clean[, names(df_clean) != "class"]
true_labels <- df_clean$class
summary_tbl <- df_features %>%
pivot_longer(everything(), names_to = "Feature", values_to = "Value") %>%
group_by(Feature) %>%
summarise(
Min = round(min(Value), 2),
Q1 = round(quantile(Value, 0.25), 2),
Median = round(median(Value), 2),
Mean = round(mean(Value), 2),
Q3 = round(quantile(Value, 0.75), 2),
Max = round(max(Value), 2),
SD = round(sd(Value), 2),
.groups = "drop"
)
kable(summary_tbl,
caption = "Descriptive Statistics - Vehicle Features (n = 845)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE, font_size = 12)| Feature | Min | Q1 | Median | Mean | Q3 | Max | SD |
|---|---|---|---|---|---|---|---|
| CIRCULARITY | 33 | 40 | 44 | 44.87 | 49 | 59 | 6.17 |
| COMPACTNESS | 73 | 87 | 93 | 93.67 | 100 | 119 | 8.24 |
| DISTANCE.CIRCULARITY | 40 | 70 | 80 | 82.10 | 98 | 112 | 15.78 |
| ELONGATEDNESS | 26 | 33 | 43 | 40.93 | 46 | 61 | 7.82 |
| HOLLOWS.RATIO | 181 | 190 | 197 | 195.62 | 201 | 211 | 7.44 |
| KURTOSIS.ABOUT.MAJOR.AXIS | 176 | 184 | 188 | 188.92 | 193 | 206 | 6.16 |
| KURTOSIS.ABOUT.MINOR.AXIS | 0 | 5 | 11 | 12.59 | 19 | 41 | 8.93 |
| MAX.LENGTH.ASPECT.RATIO | 2 | 7 | 8 | 8.57 | 10 | 55 | 4.60 |
| MAX.LENGTH.RECTANGULARITY | 118 | 137 | 146 | 148.02 | 159 | 188 | 14.51 |
| PR.AXIS.ASPECT.RATIO | 47 | 57 | 61 | 61.68 | 65 | 138 | 7.88 |
| PR.AXIS.RECTANGULARITY | 17 | 19 | 20 | 20.58 | 23 | 29 | 2.59 |
| RADIUS.RATIO | 104 | 141 | 167 | 168.91 | 195 | 333 | 33.48 |
| SCALED.RADIUS.OF.GYRATION | 109 | 149 | 173 | 174.76 | 198 | 268 | 32.52 |
| SCALED.VARIANCE.ALONG.MAJOR.AXIS | 130 | 167 | 178 | 188.62 | 217 | 320 | 31.41 |
| SCALED.VARIANCE.ALONG.MINOR.AXIS | 184 | 318 | 364 | 439.96 | 587 | 1018 | 176.79 |
| SCATTER.RATIO | 112 | 146 | 157 | 168.85 | 198 | 265 | 33.26 |
| SKEWNESS.ABOUT.MAJOR.AXIS | 59 | 67 | 71 | 72.46 | 75 | 135 | 7.49 |
| SKEWNESS.ABOUT.MINOR.AXIS | 0 | 2 | 6 | 6.38 | 9 | 22 | 4.92 |
The features vary considerably in scale.
SCALED.VARIANCE.ALONG.MINOR.AXIS has a standard deviation
above 200 while COMPACTNESS sits around 8. This magnitude
gap confirms that standardisation is mandatory before any distance-based
clustering algorithm is applied, otherwise the features with larger
ranges would dominate all distance calculations.
df_clean %>%
dplyr::count(class) %>%
ggplot(aes(x = reorder(class, -n), y = n, fill = class)) +
geom_col(width = 0.6, color = "white") +
geom_text(aes(label = n), vjust = -0.4, fontface = "bold", size = 4.5) +
scale_fill_manual(values = my_pal) +
labs(title = "Vehicle Class Distribution (n = 845)",
x = "Class", y = "Count", fill = "Class") +
theme_bw(base_size = 13) +
theme(legend.position = "none")The four classes are well balanced. Bus and saab each contribute 217 observations, opel has 212, and van has 199. No single class dominates the dataset, so clustering results should not be artificially skewed toward any particular vehicle type because of class size alone.
df_clean %>%
pivot_longer(-class, names_to = "Feature", values_to = "Value") %>%
ggplot(aes(x = class, y = Value, fill = class)) +
geom_boxplot(alpha = 0.75, outlier.size = 0.7, outlier.alpha = 0.5) +
facet_wrap(~Feature, scales = "free_y", ncol = 3) +
scale_fill_manual(values = my_pal) +
labs(title = "Feature Distribution by True Vehicle Class",
x = NULL, y = "Value", fill = "Class") +
theme_bw(base_size = 10) +
theme(
axis.text.x = element_text(angle = 30, hjust = 1),
legend.position = "top",
strip.background = element_rect(fill = "#C77DFF", color = NA),
strip.text = element_text(color = "white", size = 8, face = "bold")
)Van stands out across features such as ELONGATEDNESS and
MAX.LENGTH.RECTANGULARITY, which is consistent with its
boxy rectangular shape. Opel and saab, on the other hand, share nearly
identical distributions across almost every feature. This overlap is not
a limitation of the algorithms; it is a structural property of the data,
and it is the principal reason all five clustering methods will struggle
to separate those two classes.
cor_mat <- cor(df_features)
corrplot(
cor_mat,
method = "color",
type = "upper",
order = "hclust",
tl.cex = 0.68,
tl.col = "black",
addCoef.col = "black",
number.cex = 0.45,
col = colorRampPalette(c("#2166AC", "white", "#D73027"))(200),
title = "Correlation Matrix - Vehicle Features",
mar = c(0, 0, 2, 0)
)Several feature pairs carry correlations above 0.9, particularly among the scaled variance and radius measures. These features are essentially encoding the same geometric information in different units. The high redundancy explains why PCA can compress 18 dimensions into just two principal components without losing the essential structure of the data.
All five clustering methods used in this analysis are distance-based. Features left on their original scales would give disproportionate influence to those with larger numerical ranges. All 18 features are standardised to mean 0 and standard deviation 1.
## Post-scaling means (first 5 features):
## COMPACTNESS CIRCULARITY DISTANCE.CIRCULARITY
## 0 0 0
## RADIUS.RATIO PR.AXIS.ASPECT.RATIO
## 0 0
## Post-scaling SDs (first 5 features):
## COMPACTNESS CIRCULARITY DISTANCE.CIRCULARITY
## 1 1 1
## RADIUS.RATIO PR.AXIS.ASPECT.RATIO
## 1 1
Means are essentially zero and all standard deviations equal exactly 1. Every feature now contributes equally to any distance calculation performed by the clustering algorithms.
With 18 features the clustering results cannot be plotted directly. All scatter plots in this analysis project the data onto the first two principal components, which together capture the dominant structure of the standardised feature space.
pca_res <- prcomp(df_scaled, center = FALSE, scale. = FALSE)
pca_var <- round(summary(pca_res)$importance[2, ] * 100, 1)
cat("Variance explained:\n")## Variance explained:
## PC1: 52.4 %
## PC2: 16.8 %
## Cumulative PC1 + PC2: 69.2 %
fviz_screeplot(pca_res, ncp = 10, addlabels = TRUE,
barfill = "#C77DFF",
barcolor = "#7B2FBE",
title = "Scree Plot - Variance Explained per Principal Component")PC1 alone accounts for more than half of the total variance, and the curve falls steeply after that. The first two components together capture roughly 69 percent of the total information, which is a reasonable basis for 2D visualisation of the cluster structure.
ggplot(pca_df, aes(PC1, PC2, color = Class, fill = Class)) +
geom_point(alpha = 0.75, size = 2.5, shape = 21) +
stat_ellipse(level = 0.90, linewidth = 0.9) +
scale_color_manual(values = my_pal) +
scale_fill_manual(values = my_pal) +
labs(
title = "Ground Truth - True Vehicle Classes (PCA Space)",
subtitle = paste0("PC1 (", pca_var[1], "% var) | PC2 (", pca_var[2], "% var)"),
x = paste0("PC1 (", pca_var[1], "% var)"),
y = paste0("PC2 (", pca_var[2], "% var)")
) +
theme_bw(base_size = 12) +
theme(legend.position = "top")Van occupies its own distinct region at the lower right of the PCA plane, and bus has a partially separate area as well. Opel and saab, however, overlap severely. Their 90 percent confidence ellipses are nearly indistinguishable, and this structural ambiguity will limit the performance of every algorithm applied in this analysis regardless of the method chosen.
The Elbow Method and Silhouette Analysis are used together to justify the choice of K = 4 for K-Means and K-Median, matching the four known vehicle categories.
set.seed(42)
wss_vals <- sapply(1:10, function(k) {
kmeans(df_scaled, centers = k, nstart = 25, iter.max = 200)$tot.withinss
})
ggplot(data.frame(k = 1:10, wss = wss_vals), aes(k, wss)) +
geom_line(color = "#2196F3", linewidth = 1) +
geom_point(color = "#2196F3", size = 3) +
geom_vline(xintercept = 4, linetype = "dashed", color = "#E53935") +
scale_x_continuous(breaks = 1:10) +
labs(title = "Elbow Method - Optimal K",
subtitle = "Dashed line: K = 4 (aligned with 4 true vehicle classes)",
x = "Number of Clusters (K)", y = "Total Within-Cluster SS") +
theme_bw(base_size = 12) +
theme(plot.title = element_text(face = "bold"))The rate of decrease in total within-cluster sum of squares slows noticeably around K = 3 to 4. The bend is gradual rather than sharp, which is consistent with the ground truth showing that two of the four classes share nearly identical feature distributions. K = 4 is nevertheless a defensible inflection point and aligns with the known number of vehicle types.
avg_sil_vals <- sapply(2:10, function(k) {
km <- kmeans(df_scaled, centers = k, nstart = 25)
mean(silhouette(km$cluster, dist(df_scaled))[, 3])
})
best_k_sil <- which.max(avg_sil_vals) + 1
ggplot(data.frame(k = 2:10, avg_sil = avg_sil_vals), aes(k, avg_sil)) +
geom_line(color = "#4CAF50", linewidth = 1) +
geom_point(color = "#4CAF50", size = 3) +
geom_vline(xintercept = best_k_sil, linetype = "dashed", color = "#E53935") +
geom_vline(xintercept = 4, linetype = "dotted", color = "#1E88E5") +
scale_x_continuous(breaks = 2:10) +
labs(title = "Silhouette Analysis - Optimal K",
subtitle = paste0("Red dashed = silhouette best (K=", best_k_sil,
") | Blue dotted = K=4 (domain-driven)"),
x = "Number of Clusters (K)", y = "Average Silhouette Width") +
theme_bw(base_size = 12) +
theme(plot.title = element_text(face = "bold"))## Silhouette-optimal K: 2
The average silhouette width peaks at K = 2, suggesting the dataset contains only two geometrically distinct groups in standardised feature space. This reflects the opel-saab overlap seen in the PCA ground truth plot. K = 4 is retained throughout this analysis because the goal is to recover the four vehicle categories, and the domain context takes precedence over the purely geometric criterion.
K-Means partitions data by assigning each point to the nearest
centroid using Euclidean distance, then updating each centroid as the
arithmetic mean of its assigned points. This process repeats until
cluster assignments no longer change. Using nstart = 50
restarts the algorithm from 50 random configurations and retains the
best result, reducing sensitivity to poor initial conditions.
set.seed(42)
km_fit <- kmeans(df_scaled, centers = 4, nstart = 50, iter.max = 300)
km_labels <- km_fit$cluster
cat("Cluster sizes:\n")## Cluster sizes:
## km_labels
## 1 2 3 4
## 268 246 8 323
## Total WSS: 5964.32 | Between SS: 9227.68
km_sil_obj <- silhouette(km_labels, dist(df_scaled))
km_sil <- mean(km_sil_obj[, 3])
km_dunn <- cluster.stats(dist(df_scaled), km_labels)$dunn
km_ari <- adjustedRandIndex(km_labels, true_labels)
cat(sprintf("Silhouette = %.4f | Dunn = %.4f | ARI = %.4f\n",
km_sil, km_dunn, km_ari))## Silhouette = 0.3053 | Dunn = 0.0907 | ARI = 0.0764
fviz_silhouette(km_sil_obj,
palette = pal_4,
ggtheme = theme_bw(),
print.summary = FALSE) +
labs(
title = "Silhouette Plot - K-Means (K = 4)",
subtitle = paste0("Mean silhouette width: ", round(km_sil, 4))
) +
theme(legend.position = "top")The silhouette plot shows that most points have positive widths, meaning they sit closer to their own cluster centroid than to any other. Clusters corresponding to van and bus tend to produce taller positive bars, while the clusters capturing the opel-saab region generate shorter bars with some negative values, confirming that those two classes cannot be cleanly partitioned by any distance-based centroid approach.
pca_df$KM <- factor(km_labels)
ggplot(pca_df, aes(PC1, PC2, color = KM, fill = KM)) +
geom_point(alpha = 0.7, size = 2.5, shape = 21) +
stat_ellipse(level = 0.90, linewidth = 0.85) +
scale_color_manual(values = pal_4, name = "Cluster") +
scale_fill_manual(values = pal_4) +
labs(
title = "K-Means Clustering (K = 4)",
subtitle = sprintf("Silhouette: %.4f | Dunn: %.4f | ARI: %.4f",
km_sil, km_dunn, km_ari),
x = paste0("PC1 (", pca_var[1], "% var)"),
y = paste0("PC2 (", pca_var[2], "% var)")
) +
theme_bw(base_size = 12) +
theme(plot.title = element_text(face = "bold"), legend.position = "top")K-Means achieves the highest Silhouette and Dunn scores among the three centroid-based methods. In PCA space, one cluster is noticeably smaller than the others, capturing a compact subgroup rather than a full vehicle category. The ARI near 0.08 confirms that the algorithmic groupings have limited correspondence with the true vehicle labels, a consequence of the irreducible opel-saab overlap rather than a failure of the algorithm itself.
K-Median replaces the mean centroid update with a coordinate-wise
median and substitutes Euclidean distance with Manhattan distance.
Theoretically, this makes it more robust to extreme values because the
median is not pulled by outliers the way a mean is. The implementation
uses flexclust::kcca() with
family = kccaFamily("kmedians").
set.seed(42)
kmed_fit <- kcca(df_scaled, k = 4,
family = kccaFamily("kmedians"),
control = list(iter.max = 300))
kmed_labels <- clusters(kmed_fit)
cat("Cluster sizes:\n")## Cluster sizes:
## kmed_labels
## 1 2 3 4
## 193 130 284 238
kmed_sil_obj <- silhouette(kmed_labels, dist(df_scaled))
kmed_sil <- mean(kmed_sil_obj[, 3])
kmed_dunn <- cluster.stats(dist(df_scaled), kmed_labels)$dunn
kmed_ari <- adjustedRandIndex(kmed_labels, true_labels)
cat(sprintf("Silhouette = %.4f | Dunn = %.4f | ARI = %.4f\n",
kmed_sil, kmed_dunn, kmed_ari))## Silhouette = 0.2220 | Dunn = 0.0479 | ARI = 0.0720
fviz_silhouette(kmed_sil_obj,
palette = pal_4,
ggtheme = theme_bw(),
print.summary = FALSE) +
labs(
title = "Silhouette Plot - K-Median (K = 4)",
subtitle = paste0("Mean silhouette width: ", round(kmed_sil, 4))
) +
theme(legend.position = "top")The silhouette structure for K-Median is similar to K-Means: the cluster capturing van-like observations tends to be the most cohesive, while the clusters covering the opel-saab zone show a higher proportion of negative bars. The lower average silhouette compared to K-Means indicates that the Manhattan-based partitioning produces less compact clusters when quality is measured in Euclidean space.
pca_df$KMed <- factor(kmed_labels)
ggplot(pca_df, aes(PC1, PC2, color = KMed, fill = KMed)) +
geom_point(alpha = 0.7, size = 2.5, shape = 21) +
stat_ellipse(level = 0.90, linewidth = 0.85) +
scale_color_manual(values = pal_4, name = "Cluster") +
scale_fill_manual(values = pal_4) +
labs(
title = "K-Median Clustering (K = 4)",
subtitle = sprintf("Silhouette: %.4f | Dunn: %.4f | ARI: %.4f",
kmed_sil, kmed_dunn, kmed_ari),
x = paste0("PC1 (", pca_var[1], "% var)"),
y = paste0("PC2 (", pca_var[2], "% var)")
) +
theme_bw(base_size = 12) +
theme(plot.title = element_text(face = "bold"), legend.position = "top")K-Median produces more balanced cluster sizes than K-Means since the median centroid is not pulled toward outlying points. However, all three evaluation metrics are lower than K-Means on this dataset. This dataset does not contain extreme outliers that would normally give K-Median a meaningful advantage, so the additional robustness comes at the cost of slightly reduced cohesion in Euclidean terms.
Mean Shift is a non-parametric density-based algorithm that requires no predefined number of clusters. Each point iteratively shifts toward the local centroid of its neighbourhood until convergence, finding natural density peaks (modes) in the feature space. The bandwidth parameter controls the neighbourhood radius and is the most critical design choice for this method.
pca_5d_scaled <- as.data.frame(scale(pca_res$x[, 1:5]))
n_pts <- nrow(pca_5d_scaled)
bw_base <- sapply(pca_5d_scaled, function(col) 1.06 * sd(col) * n_pts^(-0.2))
cat("Silverman bandwidth per scaled PC:\n")## Silverman bandwidth per scaled PC:
## PC1 PC2 PC3 PC4 PC5
## 0.2754 0.2754 0.2754 0.2754 0.2754
multipliers <- seq(1.0, 6.0, by = 0.5)
n_clust_search <- integer(length(multipliers))
cat("\nBandwidth multiplier search:\n")##
## Bandwidth multiplier search:
for (i in seq_along(multipliers)) {
ms_try <- meanShift(as.matrix(pca_5d_scaled), bandwidth = bw_base * multipliers[i])
n_clust_search[i] <- length(unique(ms_try$assignment))
cat(sprintf("mult = %.1f | mean bw = %.4f | clusters = %d\n",
multipliers[i], mean(bw_base * multipliers[i]), n_clust_search[i]))
}## mult = 1.0 | mean bw = 0.2754 | clusters = 667
## mult = 1.5 | mean bw = 0.4131 | clusters = 705
## mult = 2.0 | mean bw = 0.5508 | clusters = 705
## mult = 2.5 | mean bw = 0.6885 | clusters = 656
## mult = 3.0 | mean bw = 0.8261 | clusters = 327
## mult = 3.5 | mean bw = 0.9638 | clusters = 73
## mult = 4.0 | mean bw = 1.1015 | clusters = 11
## mult = 4.5 | mean bw = 1.2392 | clusters = 4
## mult = 5.0 | mean bw = 1.3769 | clusters = 3
## mult = 5.5 | mean bw = 1.5146 | clusters = 3
## mult = 6.0 | mean bw = 1.6523 | clusters = 2
best_mult <- multipliers[which.min(abs(n_clust_search - 4))]
bw_final <- bw_base * best_mult
cat(sprintf("\nSelected multiplier: %.1f | mean bandwidth: %.4f\n",
best_mult, mean(bw_final)))##
## Selected multiplier: 4.5 | mean bandwidth: 1.2392
Mean Shift is applied on the first 5 principal components, which are
first scaled to unit variance so that all dimensions contribute equally
to the bandwidth calculation. Silverman’s rule
(h = 1.06 × σ × n^{-1/5}) provides a data-driven starting
point for each dimension. A multiplier search then identifies the
bandwidth that yields the cluster count closest to four, matching the
domain knowledge about the number of vehicle types.
set.seed(42)
ms_fit <- meanShift(as.matrix(pca_5d_scaled), bandwidth = bw_final)
ms_labels <- ms_fit$assignment
n_ms <- length(unique(ms_labels))
cat("Clusters found:", n_ms, "\n")## Clusters found: 4
## Cluster sizes:
## ms_labels
## 1 2 3 4
## 725 6 112 2
if (n_ms > 1) {
ms_sil <- mean(silhouette(ms_labels, dist(as.matrix(pca_5d_scaled)))[, 3])
ms_dunn <- cluster.stats(dist(as.matrix(pca_5d_scaled)), ms_labels)$dunn
ms_ari <- adjustedRandIndex(ms_labels, true_labels)
cat(sprintf("Silhouette = %.4f | Dunn = %.4f | ARI = %.4f\n",
ms_sil, ms_dunn, ms_ari))
} else {
ms_sil <- ms_dunn <- ms_ari <- NA_real_
cat("Only 1 cluster found. Metrics not applicable.\n")
}## Silhouette = 0.1591 | Dunn = 0.0476 | ARI = 0.0054
ms_pal <- colorRampPalette(pal_4)(n_ms)
names(ms_pal) <- as.character(sort(unique(ms_labels)))
pca_df$MS <- factor(ms_labels)
ms_sub <- if (!is.na(ms_sil)) {
sprintf("Silhouette: %.4f | Dunn: %.4f | ARI: %.4f | bw mult: %.1f",
ms_sil, ms_dunn, ms_ari, best_mult)
} else {
sprintf("Clusters found: %d | bandwidth multiplier: %.1f", n_ms, best_mult)
}
p_ms_plot <- ggplot(pca_df, aes(PC1, PC2, color = MS, fill = MS)) +
geom_point(alpha = 0.65, size = 2.5, shape = 21) +
scale_color_manual(values = ms_pal, name = "Cluster") +
scale_fill_manual(values = ms_pal) +
labs(
title = sprintf("Mean Shift Clustering (%d clusters detected)", n_ms),
subtitle = ms_sub,
x = paste0("PC1 (", pca_var[1], "% var)"),
y = paste0("PC2 (", pca_var[2], "% var)")
) +
theme_bw(base_size = 12) +
theme(plot.title = element_text(face = "bold"), legend.position = "top")
suppressWarnings(print(p_ms_plot))Mean Shift detects clusters by converging on density peaks, but the vehicle dataset does not exhibit well-separated density modes in its feature space. The result is heavily imbalanced: one dominant cluster absorbs most observations while the remaining clusters contain very few points each. This occurs because the data density is continuous and overlapping, lacking the compact isolated peaks that Mean Shift is specifically designed to recover. Among the five methods in this analysis, Mean Shift produces the lowest evaluation scores on this particular dataset.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups observations that are densely packed together and explicitly flags isolated points as noise without forcing them into any cluster. It requires no predefined cluster count but is sensitive to two parameters: eps (the neighbourhood radius) and minPts (the minimum neighbourhood size to qualify as a dense region).
par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))
for (k_val in c(5, 10)) {
knn_sorted <- sort(kNNdist(df_scaled, k = k_val))
plot(knn_sorted,
type = "l", col = "steelblue", lwd = 2,
main = paste0("k-NN Distance Plot (k = ", k_val, ")"),
xlab = "Points sorted by distance",
ylab = paste0(k_val, "-NN Distance"))
abline(h = c(2.5, 3.0, 3.5),
col = c("red", "orange", "darkgreen"),
lty = 2, lwd = 1.5)
legend("topleft",
legend = c("eps = 2.5", "eps = 3.0", "eps = 3.5"),
col = c("red", "orange", "darkgreen"),
lty = 2, cex = 0.8, bty = "n")
}The elbow in the sorted k-NN distance curve falls between eps = 2.5 and eps = 3.0 for both k = 5 and k = 10. Values below 2.5 produce excessive noise by classifying most points as outliers, while values above 3.5 merge distinct groups into a single large cluster. The range of 2.5 to 3.0 provides the most informative region to search in the grid step.
eps_vals <- c(2.0, 2.5, 3.0, 3.5, 4.0)
min_pts <- c(5, 10, 15)
grid_results <- expand.grid(eps = eps_vals, minPts = min_pts,
stringsAsFactors = FALSE)
grid_results$n_clusters <- NA_integer_
grid_results$n_noise <- NA_integer_
grid_results$sil_score <- NA_real_
for (i in seq_len(nrow(grid_results))) {
db_tmp <- dbscan::dbscan(df_scaled,
eps = grid_results$eps[i],
minPts = grid_results$minPts[i])
cl <- db_tmp$cluster
n_cl <- length(unique(cl[cl > 0]))
n_no <- sum(cl == 0)
grid_results$n_clusters[i] <- n_cl
grid_results$n_noise[i] <- n_no
if (n_cl >= 2 && sum(cl > 0) > n_cl) {
sil_tmp <- silhouette(cl[cl > 0], dist(df_scaled[cl > 0, ]))
grid_results$sil_score[i] <- round(mean(as.matrix(sil_tmp)[, 3]), 4)
}
}
kable(grid_results,
caption = "DBSCAN Parameter Grid Search (green = 4 clusters detected)",
digits = 4) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(which(grid_results$n_clusters == 4), background = "#d4edda")| eps | minPts | n_clusters | n_noise | sil_score |
|---|---|---|---|---|
| 2.0 | 5 | 1 | 25 | NA |
| 2.5 | 5 | 1 | 12 | NA |
| 3.0 | 5 | 1 | 9 | NA |
| 3.5 | 5 | 1 | 8 | NA |
| 4.0 | 5 | 1 | 8 | NA |
| 2.0 | 10 | 2 | 45 | 0.3461 |
| 2.5 | 10 | 1 | 14 | NA |
| 3.0 | 10 | 1 | 11 | NA |
| 3.5 | 10 | 1 | 8 | NA |
| 4.0 | 10 | 1 | 8 | NA |
| 2.0 | 15 | 1 | 88 | NA |
| 2.5 | 15 | 1 | 15 | NA |
| 3.0 | 15 | 1 | 11 | NA |
| 3.5 | 15 | 1 | 8 | NA |
| 4.0 | 15 | 1 | 8 | NA |
Rows highlighted in green are configurations that produce exactly four clusters. Among those, the one with the highest silhouette score is selected as the final model. Smaller eps values create many small clusters and classify a large proportion of points as noise, while larger eps values merge everything into one or two indistinct groups.
best_eps <- 3.0
best_minPts <- 5
db_res <- dbscan::dbscan(df_scaled, eps = best_eps, minPts = best_minPts)
db_labels <- db_res$cluster
cat("DBSCAN (eps =", best_eps, ", minPts =", best_minPts, ")\n")## DBSCAN (eps = 3 , minPts = 5 )
## Clusters detected (excluding noise): 1
## Noise points: 9
## Cluster sizes (0 = noise):
## db_labels
## 0 1
## 9 836
Noise points (label 0) are excluded from all internal metrics since Silhouette and Dunn Index require at least two clusters to be meaningful.
non_noise <- db_labels > 0
db_labels_nn <- db_labels[non_noise]
df_scaled_nn <- df_scaled[non_noise, ]
true_labels_nn <- true_labels[non_noise]
n_unique_cl <- length(unique(db_labels_nn))
if (n_unique_cl >= 2) {
db_sil <- silhouette(db_labels_nn, dist(df_scaled_nn))
db_sil_mean <- mean(db_sil[, 3])
db_stats <- cluster.stats(dist(df_scaled_nn), db_labels_nn)
db_dunn <- db_stats$dunn
} else {
db_sil <- NULL
db_sil_mean <- NA_real_
db_dunn <- NA_real_
}
db_ari <- adjustedRandIndex(db_labels_nn, true_labels_nn)
metrics_db <- data.frame(
Metric = c(
"Silhouette (noise excluded)",
"Dunn Index (noise excluded)",
"ARI (noise excluded)",
"Clusters detected",
"Noise points"
),
Value = c(
round(db_sil_mean, 4), round(db_dunn, 4), round(db_ari, 4),
max(db_labels), sum(db_labels == 0)
),
Interpretation = c(
"Range -1 to 1; higher = better cohesion and separation",
"Higher = more compact and well separated clusters",
"Range -1 to 1; 1 = perfect match with ground truth",
"Auto detected, no k required",
"Points in sparse regions flagged as outliers"
)
)
kable(metrics_db, caption = "DBSCAN Evaluation Metrics") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Metric | Value | Interpretation |
|---|---|---|
| Silhouette (noise excluded) | NA | Range -1 to 1; higher = better cohesion and separation |
| Dunn Index (noise excluded) | NA | Higher = more compact and well separated clusters |
| ARI (noise excluded) | 0 | Range -1 to 1; 1 = perfect match with ground truth |
| Clusters detected | 1 | Auto detected, no k required |
| Noise points | 9 | Points in sparse regions flagged as outliers |
If fewer than two clusters remain after noise exclusion, both Silhouette and Dunn return NA. An ARI near zero indicates that the algorithmic partitioning has little correspondence with the true vehicle labels, typically because the chosen eps is large enough to merge distinct classes into the same dense region. The noise count itself is informative: it reveals how many observations sit in low-density parts of the feature space, which no other method in this analysis is able to detect explicitly.
if (!is.null(db_sil)) {
fviz_silhouette(db_sil,
palette = "jco",
ggtheme = theme_bw(),
print.summary = FALSE) +
labs(
title = paste0("Silhouette Plot - DBSCAN (eps = ", best_eps,
", minPts = ", best_minPts, ")"),
subtitle = paste0("Mean silhouette width: ", round(db_sil_mean, 4),
" | Noise points excluded")
) +
theme(legend.position = "top")
} else {
message("Silhouette plot skipped: fewer than 2 clusters after noise removal.")
}Positive bars indicate points that are well placed within their assigned cluster. Negative bars indicate points whose closest cluster is a different one from their current assignment. The distribution of bar heights across clusters reflects how coherently DBSCAN has separated the vehicle groups under these parameter settings.
n_db_cl <- max(db_labels)
pca_db <- pca_df %>%
mutate(
Cluster = factor(db_labels,
levels = 0:n_db_cl,
labels = c("Noise", paste0("Cluster ", seq_len(n_db_cl))))
)
pal_db <- c("gray50",
RColorBrewer::brewer.pal(max(n_db_cl, 3), "Set1")[seq_len(n_db_cl)])
ggplot(pca_db, aes(PC1, PC2, color = Cluster, fill = Cluster)) +
geom_point(aes(shape = Cluster, size = (Cluster == "Noise")), alpha = 0.75) +
stat_ellipse(data = filter(pca_db, Cluster != "Noise"),
level = 0.90, linewidth = 0.85) +
scale_color_manual(values = pal_db) +
scale_fill_manual(values = pal_db) +
scale_shape_manual(values = c(4, 21, 24, 22, 23)[seq_len(n_db_cl + 1)]) +
scale_size_manual(values = c(1.2, rep(2.5, n_db_cl)), guide = "none") +
labs(
title = paste0("DBSCAN (eps = ", best_eps, ", minPts = ", best_minPts, ")"),
subtitle = paste0("Silhouette: ", round(db_sil_mean, 4),
" | Dunn: ", round(db_dunn, 4),
" | ARI: ", round(db_ari, 4)),
x = paste0("PC1 (", pca_var[1], "% var)"),
y = paste0("PC2 (", pca_var[2], "% var)"),
color = "Cluster", shape = "Cluster", fill = "Cluster"
) +
theme_bw(base_size = 12) +
theme(legend.position = "top")Grey crosses mark noise points, which are observations that DBSCAN excluded from all clusters because they sit in low-density regions of the standardised feature space. Comparing the cluster ellipses here with the Ground Truth panel from the preprocessing section shows how closely the density-based partitioning aligns with the true vehicle categories.
conf_db <- table(DBSCAN_Cluster = db_labels_nn, True_Class = true_labels_nn)
kable(conf_db,
caption = "DBSCAN Clusters (rows) vs True Vehicle Classes (cols) - noise excluded") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| bus | opel | saab | van |
|---|---|---|---|
| 215 | 212 | 217 | 192 |
A well-recovered cluster will have its count concentrated in a single column. Counts distributed across multiple columns indicate a mixed cluster. The degree of mixing in the opel and saab columns across rows reflects the same overlap that has been visible since the very first EDA boxplot.
Fuzzy C-Means (FCM) assigns each observation a degree of membership to every cluster rather than a hard binary assignment. A membership of 1.0 means complete certainty, and a value near 1/c (where c is the number of clusters) indicates maximum ambiguity across all groups. This soft partitioning is particularly informative for datasets with overlapping class boundaries.
c_range <- 2:8
sil_c <- numeric(length(c_range))
dunn_c <- numeric(length(c_range))
for (i in seq_along(c_range)) {
fcm_tmp <- cmeans(df_scaled, centers = c_range[i],
m = 2, iter.max = 200, dist = "euclidean")
hard_cl <- fcm_tmp$cluster
sil_c[i] <- mean(silhouette(hard_cl, dist(df_scaled))[, 3])
dunn_c[i] <- cluster.stats(dist(df_scaled), hard_cl)$dunn
}
p_sil_c <- ggplot(data.frame(c = c_range, Silhouette = sil_c), aes(c, Silhouette)) +
geom_line(color = "#C77DFF", linewidth = 1.2) +
geom_point(size = 3.5, color = "#7B2FBE") +
geom_vline(xintercept = c_range[which.max(sil_c)],
linetype = "dashed", color = "#F4A261") +
scale_x_continuous(breaks = c_range) +
labs(title = "FCM - Silhouette vs Number of Clusters",
x = "Number of clusters (c)", y = "Avg Silhouette Width") +
theme_bw()
p_dunn_c <- ggplot(data.frame(c = c_range, Dunn = dunn_c), aes(c, Dunn)) +
geom_line(color = "#80B918", linewidth = 1.2) +
geom_point(size = 3.5, color = "#386641") +
geom_vline(xintercept = c_range[which.max(dunn_c)],
linetype = "dashed", color = "#F4A261") +
scale_x_continuous(breaks = c_range) +
labs(title = "FCM - Dunn Index vs Number of Clusters",
x = "Number of clusters (c)", y = "Dunn Index") +
theme_bw()
p_sil_c + p_dunn_c +
plot_annotation(
title = "FCM - Optimal Number of Clusters",
theme = theme(plot.title = element_text(hjust = 0.5, face = "bold"))
)Both the Silhouette and Dunn Index curves peak around c = 4, consistent with the four known vehicle types. This agreement between the metric-based selection and the domain knowledge gives additional confidence that c = 4 is the appropriate choice for the final FCM model.
m_range <- seq(1.5, 4.0, by = 0.5)
sil_m <- numeric(length(m_range))
for (i in seq_along(m_range)) {
fcm_m <- cmeans(df_scaled, centers = 4,
m = m_range[i], iter.max = 200, dist = "euclidean")
sil_m[i] <- mean(silhouette(fcm_m$cluster, dist(df_scaled))[, 3])
}
best_m_val <- m_range[which.max(sil_m)]
ggplot(data.frame(m = m_range, Silhouette = sil_m), aes(m, Silhouette)) +
geom_line(color = "#C77DFF", linewidth = 1.2) +
geom_point(size = 3.5, color = "#7B2FBE") +
geom_vline(xintercept = best_m_val, linetype = "dashed", color = "#F4A261") +
annotate("text", x = best_m_val + 0.1, y = min(sil_m) + 0.002,
label = paste0("Best m = ", best_m_val),
hjust = 0, color = "#F4A261", size = 3.5) +
scale_x_continuous(breaks = m_range) +
labs(
title = "FCM - Effect of Fuzzifier on Silhouette Score",
subtitle = paste0("c = 4 fixed | Best m = ", best_m_val),
x = "Fuzzifier (m)", y = "Avg Silhouette Width"
) +
theme_bw()Silhouette score is highest at the lower end of the fuzzifier range. As m increases, cluster boundaries soften progressively and average cohesion decreases because points receive more evenly distributed memberships across all clusters. The standard value m = 2 is used in the final model since it is near the data-driven optimum and is the conventional default in the FCM literature.
best_c <- 4
best_m <- 2.0
fcm_res <- cmeans(df_scaled, centers = best_c,
m = best_m, iter.max = 300, dist = "euclidean")
fcm_labels <- fcm_res$cluster
cat("Fuzzy C-Means (c =", best_c, ", m =", best_m, ")\n")## Fuzzy C-Means (c = 4 , m = 2 )
## Objective function: 3.7395
## Hard cluster assignments:
## fcm_labels
## 1 2 3 4
## 210 248 224 163
fcm_sil_obj <- silhouette(fcm_labels, dist(df_scaled))
fcm_sil_m <- mean(fcm_sil_obj[, 3])
fcm_stats <- cluster.stats(dist(df_scaled), fcm_labels)
fcm_ari <- adjustedRandIndex(fcm_labels, true_labels)
metrics_fcm <- data.frame(
Metric = c("Silhouette Score", "Dunn Index",
"ARI", "Objective Function"),
Value = c(round(fcm_sil_m, 4), round(fcm_stats$dunn, 4),
round(fcm_ari, 4), round(fcm_res$withinerror, 4)),
Interpretation = c(
"Range -1 to 1; higher = better cohesion and separation",
"Higher = more compact and well separated clusters",
"Range -1 to 1; 1 = perfect match with ground truth",
"Lower = tighter fit of points to cluster centres"
)
)
kable(metrics_fcm, caption = "Fuzzy C-Means Evaluation Metrics") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Metric | Value | Interpretation |
|---|---|---|
| Silhouette Score | 0.2295 | Range -1 to 1; higher = better cohesion and separation |
| Dunn Index | 0.0469 | Higher = more compact and well separated clusters |
| ARI | 0.0713 | Range -1 to 1; 1 = perfect match with ground truth |
| Objective Function | 3.7395 | Lower = tighter fit of points to cluster centres |
A positive average Silhouette score confirms that points are, on average, closer to their own cluster centre than to any other. The objective function value reflects the total weighted within-cluster dispersion minimised by the FCM algorithm; a lower value indicates a tighter overall fit.
Points with a maximum membership close to 1.0 were assigned with high confidence. Observations hovering near 1/c = 0.25 are genuinely ambiguous and could reasonably belong to more than one cluster.
mem_df <- as.data.frame(fcm_res$membership)
names(mem_df) <- paste0("Cluster_", seq_len(best_c))
mem_df$HardCluster <- factor(fcm_labels)
mem_df$TrueClass <- true_labels
mem_df$MaxMem <- apply(fcm_res$membership, 1, max)
fcm_pal <- c("1" = "#F4A261", "2" = "#A8DADC", "3" = "#C77DFF", "4" = "#80B918")
ggplot(mem_df, aes(x = MaxMem, fill = HardCluster)) +
geom_histogram(bins = 35, alpha = 0.85, color = "white") +
geom_vline(xintercept = 1 / best_c, linetype = "dashed",
color = "black", linewidth = 0.9) +
annotate("text",
x = 1/best_c + 0.01, y = Inf,
label = paste0("1/c = ", round(1/best_c, 2), " (max ambiguity)"),
vjust = 2, hjust = 0, size = 3.2) +
scale_fill_manual(values = fcm_pal) +
facet_wrap(~HardCluster, ncol = 2) +
labs(
title = "FCM - Distribution of Maximum Membership Degree",
subtitle = "Values near 1 = high confidence | Values near 0.25 = maximum ambiguity",
x = "Max Membership Degree", y = "Count"
) +
theme_bw() +
theme(legend.position = "none")Most vehicles are assigned with membership degrees well above 0.5, indicating confident placement. Points accumulating near the dashed threshold are the genuinely borderline observations. Consistent with all previous analyses, most of these uncertain points originate from the opel-saab overlap zone.
ggplot(mem_df, aes(x = Cluster_1, y = Cluster_2,
color = Cluster_3, shape = HardCluster)) +
geom_point(alpha = 0.7, size = 2) +
scale_color_gradient2(low = "blue", mid = "white", high = "red",
midpoint = 0.5, name = "Mem. C3") +
scale_shape_manual(values = c(16, 17, 15, 18), name = "Hard Label") +
labs(
title = "FCM Membership Space - Cluster 1 vs Cluster 2",
subtitle = "Colour = membership degree to Cluster 3",
x = "Membership to Cluster 1",
y = "Membership to Cluster 2"
) +
theme_bw()Points in the corners of this scatter plot are confidently assigned to a single cluster. Points near the centre have divided membership between clusters 1 and 2, meaning their silhouette shapes are intermediate between two vehicle types. This two-dimensional slice of the membership matrix makes the boundary zone explicit in a way that hard clustering outputs cannot.
fviz_silhouette(fcm_sil_obj,
palette = c("#F4A261", "#A8DADC", "#C77DFF", "#80B918"),
ggtheme = theme_bw(),
print.summary = FALSE) +
labs(
title = paste0("Silhouette Plot - Fuzzy C-Means (c = ", best_c,
", m = ", best_m, ")"),
subtitle = paste0("Mean silhouette width: ", round(fcm_sil_m, 4))
) +
theme(legend.position = "top")Clusters with predominantly tall positive bars are compact and well separated from their neighbours. Clusters with a higher proportion of negative bars contain points that would fit better in an adjacent cluster, and this pattern concentrates in the clusters that absorb the opel and saab observations.
pca_fcm <- pca_df %>%
mutate(
FCM_Cluster = factor(fcm_labels, labels = paste0("Cluster ", seq_len(best_c))),
MaxMem = mem_df$MaxMem
)
ggplot(pca_fcm, aes(PC1, PC2)) +
geom_point(aes(color = FCM_Cluster, fill = FCM_Cluster, alpha = MaxMem),
size = 2.5, shape = 21) +
stat_ellipse(aes(color = FCM_Cluster), level = 0.90, linewidth = 0.9) +
scale_color_manual(values = c("#F4A261", "#A8DADC", "#C77DFF", "#80B918")) +
scale_fill_manual(values = c("#F4A261", "#A8DADC", "#C77DFF", "#80B918")) +
scale_alpha_continuous(range = c(0.2, 1.0), name = "Max Membership") +
labs(
title = paste0("Fuzzy C-Means Clustering (c = ", best_c, ", m = ", best_m, ")"),
subtitle = paste0("Silhouette: ", round(fcm_sil_m, 4),
" | Dunn: ", round(fcm_stats$dunn, 4),
" | ARI: ", round(fcm_ari, 4)),
x = paste0("PC1 (", pca_var[1], "% var)"),
y = paste0("PC2 (", pca_var[2], "% var)"),
color = "FCM Cluster", fill = "FCM Cluster"
) +
theme_bw(base_size = 12) +
theme(legend.position = "top")The transparency of each point encodes its maximum membership degree: highly confident assignments appear opaque while uncertain points near cluster boundaries appear faded. The concentration of faded points in the central overlap region of the PCA plane is a direct visual representation of the opel-saab ambiguity that pervades this dataset.
conf_fcm <- table(FCM_Cluster = fcm_labels, True_Class = true_labels)
kable(conf_fcm,
caption = "FCM Hard Clusters (rows) vs True Vehicle Classes (cols)") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| bus | opel | saab | van |
|---|---|---|---|
| 50 | 38 | 39 | 83 |
| 43 | 103 | 99 | 3 |
| 84 | 34 | 36 | 70 |
| 40 | 37 | 43 | 43 |
A cleanly recovered cluster will have most of its count concentrated in a single column. Counts spread across multiple columns reveal where the algorithm groups observations from different vehicle types together. The columns for opel and saab will show the most mixing, consistent with every other result in this analysis.
pal_db_comp <- c("gray50", pal_4[seq_len(max(db_labels))])
ms_pal_comp <- colorRampPalette(pal_4)(n_ms)
pca_all <- pca_df %>%
mutate(
KMeans = factor(km_labels),
KMedian = factor(kmed_labels),
MeanShift = factor(ms_labels),
DBSCAN = factor(db_labels,
levels = 0:max(db_labels),
labels = c("Noise", paste0("C", seq_len(max(db_labels))))),
FCM = factor(fcm_labels, labels = paste0("C", seq_len(best_c)))
)
p_km_c <- ggplot(pca_all, aes(PC1, PC2, color = KMeans, fill = KMeans)) +
geom_point(alpha = 0.7, size = 1.8, shape = 21) +
stat_ellipse(level = 0.90, linewidth = 0.7) +
scale_color_manual(values = pal_4) +
scale_fill_manual(values = pal_4) +
labs(title = "K-Means (K=4)",
subtitle = sprintf("Sil: %.4f Dunn: %.4f ARI: %.4f",
km_sil, km_dunn, km_ari),
x = paste0("PC1 (", pca_var[1], "%)"),
y = paste0("PC2 (", pca_var[2], "%)"),
color = NULL, fill = NULL) +
theme_bw(base_size = 10) +
theme(legend.position = "bottom", plot.title = element_text(face = "bold"))
p_kmed_c <- ggplot(pca_all, aes(PC1, PC2, color = KMedian, fill = KMedian)) +
geom_point(alpha = 0.7, size = 1.8, shape = 21) +
stat_ellipse(level = 0.90, linewidth = 0.7) +
scale_color_manual(values = pal_4) +
scale_fill_manual(values = pal_4) +
labs(title = "K-Median (K=4)",
subtitle = sprintf("Sil: %.4f Dunn: %.4f ARI: %.4f",
kmed_sil, kmed_dunn, kmed_ari),
x = paste0("PC1 (", pca_var[1], "%)"),
y = paste0("PC2 (", pca_var[2], "%)"),
color = NULL, fill = NULL) +
theme_bw(base_size = 10) +
theme(legend.position = "bottom", plot.title = element_text(face = "bold"))
p_ms_c <- ggplot(pca_all, aes(PC1, PC2, color = MeanShift, fill = MeanShift)) +
geom_point(alpha = 0.65, size = 1.8, shape = 21) +
scale_color_manual(values = ms_pal_comp) +
scale_fill_manual(values = ms_pal_comp) +
labs(title = sprintf("Mean Shift (%d clusters)", n_ms),
subtitle = if (!is.na(ms_sil)) sprintf("Sil: %.4f Dunn: %.4f ARI: %.4f",
ms_sil, ms_dunn, ms_ari) else "Heavily imbalanced clusters",
x = paste0("PC1 (", pca_var[1], "%)"),
y = paste0("PC2 (", pca_var[2], "%)"),
color = NULL, fill = NULL) +
theme_bw(base_size = 10) +
theme(legend.position = "bottom", plot.title = element_text(face = "bold"))
p_db_c <- ggplot(pca_all, aes(PC1, PC2, color = DBSCAN, fill = DBSCAN)) +
geom_point(aes(size = (DBSCAN == "Noise")), alpha = 0.75, shape = 21) +
stat_ellipse(data = filter(pca_all, DBSCAN != "Noise"),
level = 0.90, linewidth = 0.7) +
scale_color_manual(values = pal_db_comp) +
scale_fill_manual(values = pal_db_comp) +
scale_size_manual(values = c(1.2, rep(2.5, max(db_labels))), guide = "none") +
labs(title = paste0("DBSCAN (eps=", best_eps, ", minPts=", best_minPts, ")"),
subtitle = paste0("Sil: ", round(db_sil_mean, 4),
" Dunn: ", round(db_dunn, 4),
" ARI: ", round(db_ari, 4)),
x = paste0("PC1 (", pca_var[1], "%)"),
y = paste0("PC2 (", pca_var[2], "%)"),
color = NULL, fill = NULL) +
theme_bw(base_size = 10) +
theme(legend.position = "bottom", plot.title = element_text(face = "bold"))
p_fcm_c <- ggplot(pca_all, aes(PC1, PC2, color = FCM, fill = FCM)) +
geom_point(alpha = 0.75, size = 1.8, shape = 21) +
stat_ellipse(level = 0.90, linewidth = 0.7) +
scale_color_manual(values = pal_4) +
scale_fill_manual(values = pal_4) +
labs(title = paste0("Fuzzy C-Means (c=", best_c, ", m=", best_m, ")"),
subtitle = paste0("Sil: ", round(fcm_sil_m, 4),
" Dunn: ", round(fcm_stats$dunn, 4),
" ARI: ", round(fcm_ari, 4)),
x = paste0("PC1 (", pca_var[1], "%)"),
y = paste0("PC2 (", pca_var[2], "%)"),
color = NULL, fill = NULL) +
theme_bw(base_size = 10) +
theme(legend.position = "bottom", plot.title = element_text(face = "bold"))
p_gt_c <- ggplot(pca_df, aes(PC1, PC2, color = Class, fill = Class)) +
geom_point(alpha = 0.75, size = 1.8, shape = 21) +
stat_ellipse(level = 0.90, linewidth = 0.7) +
scale_color_manual(values = my_pal) +
scale_fill_manual(values = my_pal) +
labs(title = "Ground Truth",
subtitle = "bus | opel | saab | van",
x = paste0("PC1 (", pca_var[1], "%)"),
y = paste0("PC2 (", pca_var[2], "%)"),
color = NULL, fill = NULL) +
theme_bw(base_size = 10) +
theme(legend.position = "bottom", plot.title = element_text(face = "bold"))
(p_km_c | p_kmed_c | p_ms_c) /
(p_db_c | p_fcm_c | p_gt_c) +
plot_annotation(
title = "Clustering Comparison - Vehicle Dataset (PCA Space)",
theme = theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))
)Placing all five methods alongside the Ground Truth reveals a consistent pattern: van separates reasonably well across all approaches because it has a geometrically distinct footprint in PCA space. Bus also shows partial separation. The central region, where opel and saab coexist, remains unseparated by every method. K-Means and FCM produce cluster boundaries that most closely trace the ellipses seen in the Ground Truth panel, while Mean Shift collapses the majority of the data into a single dominant region. DBSCAN is the only method that explicitly flags low-density outliers rather than forcing every point into a cluster.
comparison_tbl <- data.frame(
Method = c("K-Means", "K-Median", "Mean Shift", "DBSCAN", "Fuzzy C-Means"),
Parameters = c(
"K = 4, nstart = 50",
"K = 4, Manhattan",
paste0("bw mult = ", best_mult),
paste0("eps = ", best_eps, ", minPts = ", best_minPts),
paste0("c = ", best_c, ", m = ", best_m)
),
Clusters = c(4, 4, n_ms, max(db_labels), best_c),
Noise = c(0, 0, 0, sum(db_labels == 0), 0),
Silhouette = round(c(km_sil, kmed_sil, ms_sil, db_sil_mean, fcm_sil_m), 4),
Dunn = round(c(km_dunn, kmed_dunn, ms_dunn, db_dunn, fcm_stats$dunn), 4),
ARI = round(c(km_ari, kmed_ari, ms_ari, db_ari, fcm_ari), 4)
)
best_sil_row <- which.max(comparison_tbl$Silhouette)
kable(comparison_tbl,
caption = "Final Metrics Comparison - All Clustering Methods",
col.names = c("Method", "Parameters", "Clusters", "Noise",
"Silhouette", "Dunn Index", "ARI")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
row_spec(best_sil_row, background = "#d4edda", bold = TRUE)| Method | Parameters | Clusters | Noise | Silhouette | Dunn Index | ARI |
|---|---|---|---|---|---|---|
| K-Means | K = 4, nstart = 50 | 4 | 0 | 0.3053 | 0.0907 | 0.0764 |
| K-Median | K = 4, Manhattan | 4 | 0 | 0.2220 | 0.0479 | 0.0720 |
| Mean Shift | bw mult = 4.5 | 4 | 0 | 0.1591 | 0.0476 | 0.0054 |
| DBSCAN | eps = 3, minPts = 5 | 1 | 9 | NA | NA | 0.0000 |
| Fuzzy C-Means | c = 4, m = 2 | 4 | 0 | 0.2295 | 0.0469 | 0.0713 |
The highlighted row achieves the highest Silhouette score, indicating the most compact and well-separated clusters in standardised Euclidean space. All ARI values remain low across every method, which is the expected outcome given that opel and saab are geometrically inseparable in this 18-dimensional feature space. Mean Shift registers the lowest internal metrics because its density-based convergence is not well suited to the continuous, overlapping density structure of this dataset. DBSCAN is the only method that produces a noise count, offering a perspective on data quality that purely partitioning approaches cannot provide.
K-Means (K = 4)
K-Means achieves the strongest Silhouette and Dunn Index scores among all five methods, indicating that its Euclidean-based centroid partitioning produces the most compact and coherent clusters on this dataset. The low ARI reflects the fundamental geometric limitation: opel and saab cannot be separated by any centroid algorithm because their feature distributions overlap entirely.
K-Median (K = 4)
K-Median produces more balanced cluster sizes than K-Means since the median centroid is not displaced by extreme observations. All three evaluation metrics are marginally lower than K-Means, however, because this dataset does not contain the kind of severe outliers that would normally justify the additional robustness of the L1 distance. The theoretical advantage of K-Median does not materialise here.
Mean Shift
Mean Shift is the weakest performer on this dataset. The absence of well-separated density peaks in the standardised feature space causes the algorithm to merge most observations into a single dominant mode, regardless of the bandwidth chosen. This method would be better suited to a dataset whose classes form compact, isolated islands in feature space.
DBSCAN
DBSCAN automatically identifies outliers, which no other method in this analysis can do. The noise points it flags represent observations that genuinely sit in low-density regions and would otherwise be forced into an arbitrary cluster. Under the selected parameters, however, eps = 3.0 is large enough to merge several vehicle types into the same dense region, limiting the recovery of all four classes. A finer parameter search at smaller eps values would likely improve separation at the cost of a higher noise count.
Fuzzy C-Means
FCM performs comparably to K-Means on internal metrics and adds a layer of information that hard clustering methods discard: soft membership degrees. The membership matrix makes the opel-saab ambiguity explicit by assigning those observations split memberships close to 0.5 across two clusters. This quantification of uncertainty is the most distinctive contribution of FCM relative to the other four methods.
Overall
The opel-saab overlap, visible from the very first EDA boxplot, is the single structural feature of this dataset that limits every algorithm tested. No method achieves an ARI above 0.1, confirming that the four vehicle types are not geometrically separable in this feature space. K-Means and FCM offer the best trade-off between quantitative performance and interpretability, with FCM providing additional diagnostic value through its membership output.
Analysis produced using R 4.5.2 | 2026-04-05