1 Setup & Package Loading

library(tidyverse)
library(flexclust)
library(meanShiftR)
library(dbscan)
library(e1071)
library(cluster)
library(fpc)
library(mclust)
library(patchwork)
library(factoextra)
library(corrplot)
library(RColorBrewer)
library(kableExtra)

set.seed(123)

pal_4 <- c("#F4A261", "#A8DADC", "#C77DFF", "#80B918")
my_pal <- c(bus = "#F4A261", opel = "#A8DADC", saab = "#C77DFF", van = "#80B918")

2 Data Loading & Cleaning

2.1 Load Raw Data

df <- read.csv("vehicle.csv")
head(df, 10)

##    COMPACTNESS CIRCULARITY DISTANCE.CIRCULARITY RADIUS.RATIO
## 1           95          48                   83          178
## 2           91          41                   84          141
## 3          104          50                  106          209
## 4           93          41                   82          159
## 5           85          44                   70          205
## 6          107          57                  106          172
## 7           97          43                   73          173
## 8           90          43                   66          157
## 9           86          34                   62          140
## 10          93          44                   98          197
##    PR.AXIS.ASPECT.RATIO MAX.LENGTH.ASPECT.RATIO SCATTER.RATIO ELONGATEDNESS
## 1                    72                      10           162            42
## 2                    57                       9           149            45
## 3                    66                      10           207            32
## 4                    63                       9           144            46
## 5                   103                      52           149            45
## 6                    50                       6           255            26
## 7                    65                       6           153            42
## 8                    65                       9           137            48
## 9                    61                       7           122            54
## 10                   62                      11           183            36
##    PR.AXIS.RECTANGULARITY MAX.LENGTH.RECTANGULARITY
## 1                      20                       159
## 2                      19                       143
## 3                      23                       158
## 4                      19                       143
## 5                      19                       144
## 6                      28                       169
## 7                      19                       143
## 8                      18                       146
## 9                      17                       127
## 10                     22                       146
##    SCALED.VARIANCE.ALONG.MAJOR.AXIS SCALED.VARIANCE.ALONG.MINOR.AXIS
## 1                               176                              379
## 2                               170                              330
## 3                               223                              635
## 4                               160                              309
## 5                               241                              325
## 6                               280                              957
## 7                               176                              361
## 8                               162                              281
## 9                               141                              223
## 10                              202                              505
##    SCALED.RADIUS.OF.GYRATION SKEWNESS.ABOUT.MAJOR.AXIS
## 1                        184                        70
## 2                        158                        72
## 3                        220                        73
## 4                        127                        63
## 5                        188                       127
## 6                        264                        85
## 7                        172                        66
## 8                        164                        67
## 9                        112                        64
## 10                       152                        64
##    SKEWNESS.ABOUT.MINOR.AXIS KURTOSIS.ABOUT.MINOR.AXIS
## 1                          6                        16
## 2                          9                        14
## 3                         14                         9
## 4                          6                        10
## 5                          9                        11
## 6                          5                         9
## 7                         13                         1
## 8                          3                         3
## 9                          2                        14
## 10                         4                        14
##    KURTOSIS.ABOUT.MAJOR.AXIS HOLLOWS.RATIO class
## 1                        187           197   van
## 2                        189           199   van
## 3                        188           196  saab
## 4                        199           207   van
## 5                        180           183   bus
## 6                        181           183   bus
## 7                        200           204   bus
## 8                        193           202   van
## 9                        200           208   van
## 10                       195           204  saab

any(is.na(df))

## [1] TRUE

colSums(is.na(df))

##                      COMPACTNESS                      CIRCULARITY 
##                                1                                0 
##             DISTANCE.CIRCULARITY                     RADIUS.RATIO 
##                                0                                0 
##             PR.AXIS.ASPECT.RATIO          MAX.LENGTH.ASPECT.RATIO 
##                                0                                0 
##                    SCATTER.RATIO                    ELONGATEDNESS 
##                                0                                0 
##           PR.AXIS.RECTANGULARITY        MAX.LENGTH.RECTANGULARITY 
##                                0                                0 
## SCALED.VARIANCE.ALONG.MAJOR.AXIS SCALED.VARIANCE.ALONG.MINOR.AXIS 
##                                0                                0 
##        SCALED.RADIUS.OF.GYRATION        SKEWNESS.ABOUT.MAJOR.AXIS 
##                                0                                0 
##        SKEWNESS.ABOUT.MINOR.AXIS        KURTOSIS.ABOUT.MINOR.AXIS 
##                                0                                0 
##        KURTOSIS.ABOUT.MAJOR.AXIS                    HOLLOWS.RATIO 
##                                0                                0 
##                            class 
##                                0

df[!complete.cases(df), ]

##     COMPACTNESS CIRCULARITY DISTANCE.CIRCULARITY RADIUS.RATIO
## 753          NA         100                   36           73
##     PR.AXIS.ASPECT.RATIO MAX.LENGTH.ASPECT.RATIO SCATTER.RATIO ELONGATEDNESS
## 753                  199                      73             6           162
##     PR.AXIS.RECTANGULARITY MAX.LENGTH.RECTANGULARITY
## 753                     40                        20
##     SCALED.VARIANCE.ALONG.MAJOR.AXIS SCALED.VARIANCE.ALONG.MINOR.AXIS
## 753                              127                              189
##     SCALED.RADIUS.OF.GYRATION SKEWNESS.ABOUT.MAJOR.AXIS
## 753                       401                       125
##     SKEWNESS.ABOUT.MINOR.AXIS KURTOSIS.ABOUT.MINOR.AXIS
## 753                        72                         6
##     KURTOSIS.ABOUT.MAJOR.AXIS HOLLOWS.RATIO class
## 753                        19           200   204

2.2 Clean Data

df_clean <- df[complete.cases(df), ]
valid_classes <- c("bus", "opel", "saab", "van")
df_clean <- df_clean[df_clean$class %in% valid_classes, ]
any(is.na(df_clean))

## [1] FALSE

2.3 Verify Cleaned Data

cat("Rows   :", nrow(df_clean), "\n")

## Rows   : 845

cat("Columns:", ncol(df_clean), "\n")

## Columns: 19

print(table(df_clean$class))

## 
##  bus opel saab  van 
##  217  212  217  199

print(names(df_clean))

##  [1] "COMPACTNESS"                      "CIRCULARITY"                     
##  [3] "DISTANCE.CIRCULARITY"             "RADIUS.RATIO"                    
##  [5] "PR.AXIS.ASPECT.RATIO"             "MAX.LENGTH.ASPECT.RATIO"         
##  [7] "SCATTER.RATIO"                    "ELONGATEDNESS"                   
##  [9] "PR.AXIS.RECTANGULARITY"           "MAX.LENGTH.RECTANGULARITY"       
## [11] "SCALED.VARIANCE.ALONG.MAJOR.AXIS" "SCALED.VARIANCE.ALONG.MINOR.AXIS"
## [13] "SCALED.RADIUS.OF.GYRATION"        "SKEWNESS.ABOUT.MAJOR.AXIS"       
## [15] "SKEWNESS.ABOUT.MINOR.AXIS"        "KURTOSIS.ABOUT.MINOR.AXIS"       
## [17] "KURTOSIS.ABOUT.MAJOR.AXIS"        "HOLLOWS.RATIO"                   
## [19] "class"

Out of 846 rows, one had a COMPACTNESS value of NA paired with a class label of "204", which is a data entry error. That row was dropped, leaving 845 clean observations across four vehicle classes: bus (217), opel (212), saab (217), and van (199).

3 Exploratory Data Analysis

3.1 Descriptive Statistics

df_features <- df_clean[, names(df_clean) != "class"]
true_labels <- df_clean$class

summary_tbl <- df_features %>%
  pivot_longer(everything(), names_to = "Feature", values_to = "Value") %>%
  group_by(Feature) %>%
  summarise(
    Min    = round(min(Value), 2),
    Q1     = round(quantile(Value, 0.25), 2),
    Median = round(median(Value), 2),
    Mean   = round(mean(Value), 2),
    Q3     = round(quantile(Value, 0.75), 2),
    Max    = round(max(Value), 2),
    SD     = round(sd(Value), 2),
    .groups = "drop"
  )

kable(summary_tbl,
  caption = "Descriptive Statistics - Vehicle Features (n = 845)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE, font_size = 12)

Descriptive Statistics - Vehicle Features (n = 845)
Feature	Min	Q1	Median	Mean	Q3	Max	SD
CIRCULARITY	33	40	44	44.87	49	59	6.17
COMPACTNESS	73	87	93	93.67	100	119	8.24
DISTANCE.CIRCULARITY	40	70	80	82.10	98	112	15.78
ELONGATEDNESS	26	33	43	40.93	46	61	7.82
HOLLOWS.RATIO	181	190	197	195.62	201	211	7.44
KURTOSIS.ABOUT.MAJOR.AXIS	176	184	188	188.92	193	206	6.16
KURTOSIS.ABOUT.MINOR.AXIS	0	5	11	12.59	19	41	8.93
MAX.LENGTH.ASPECT.RATIO	2	7	8	8.57	10	55	4.60
MAX.LENGTH.RECTANGULARITY	118	137	146	148.02	159	188	14.51
PR.AXIS.ASPECT.RATIO	47	57	61	61.68	65	138	7.88
PR.AXIS.RECTANGULARITY	17	19	20	20.58	23	29	2.59
RADIUS.RATIO	104	141	167	168.91	195	333	33.48
SCALED.RADIUS.OF.GYRATION	109	149	173	174.76	198	268	32.52
SCALED.VARIANCE.ALONG.MAJOR.AXIS	130	167	178	188.62	217	320	31.41
SCALED.VARIANCE.ALONG.MINOR.AXIS	184	318	364	439.96	587	1018	176.79
SCATTER.RATIO	112	146	157	168.85	198	265	33.26
SKEWNESS.ABOUT.MAJOR.AXIS	59	67	71	72.46	75	135	7.49
SKEWNESS.ABOUT.MINOR.AXIS	0	2	6	6.38	9	22	4.92

The features vary considerably in scale. SCALED.VARIANCE.ALONG.MINOR.AXIS has a standard deviation above 200 while COMPACTNESS sits around 8. This magnitude gap confirms that standardisation is mandatory before any distance-based clustering algorithm is applied, otherwise the features with larger ranges would dominate all distance calculations.

3.2 Class Distribution

df_clean %>%
  dplyr::count(class) %>%
  ggplot(aes(x = reorder(class, -n), y = n, fill = class)) +
  geom_col(width = 0.6, color = "white") +
  geom_text(aes(label = n), vjust = -0.4, fontface = "bold", size = 4.5) +
  scale_fill_manual(values = my_pal) +
  labs(title = "Vehicle Class Distribution (n = 845)",
    x = "Class", y = "Count", fill = "Class") +
  theme_bw(base_size = 13) +
  theme(legend.position = "none")

The four classes are well balanced. Bus and saab each contribute 217 observations, opel has 212, and van has 199. No single class dominates the dataset, so clustering results should not be artificially skewed toward any particular vehicle type because of class size alone.

3.3 Feature Distribution by Class

df_clean %>%
  pivot_longer(-class, names_to = "Feature", values_to = "Value") %>%
  ggplot(aes(x = class, y = Value, fill = class)) +
  geom_boxplot(alpha = 0.75, outlier.size = 0.7, outlier.alpha = 0.5) +
  facet_wrap(~Feature, scales = "free_y", ncol = 3) +
  scale_fill_manual(values = my_pal) +
  labs(title = "Feature Distribution by True Vehicle Class",
    x = NULL, y = "Value", fill = "Class") +
  theme_bw(base_size = 10) +
  theme(
    axis.text.x = element_text(angle = 30, hjust = 1),
    legend.position = "top",
    strip.background = element_rect(fill = "#C77DFF", color = NA),
    strip.text = element_text(color = "white", size = 8, face = "bold")
  )

Van stands out across features such as ELONGATEDNESS and MAX.LENGTH.RECTANGULARITY, which is consistent with its boxy rectangular shape. Opel and saab, on the other hand, share nearly identical distributions across almost every feature. This overlap is not a limitation of the algorithms; it is a structural property of the data, and it is the principal reason all five clustering methods will struggle to separate those two classes.

3.4 Correlation Heatmap

cor_mat <- cor(df_features)

corrplot(
  cor_mat,
  method = "color",
  type = "upper",
  order = "hclust",
  tl.cex = 0.68,
  tl.col = "black",
  addCoef.col = "black",
  number.cex = 0.45,
  col = colorRampPalette(c("#2166AC", "white", "#D73027"))(200),
  title = "Correlation Matrix - Vehicle Features",
  mar = c(0, 0, 2, 0)
)

Several feature pairs carry correlations above 0.9, particularly among the scaled variance and radius measures. These features are essentially encoding the same geometric information in different units. The high redundancy explains why PCA can compress 18 dimensions into just two principal components without losing the essential structure of the data.

4 Preprocessing

4.1 Feature Scaling

All five clustering methods used in this analysis are distance-based. Features left on their original scales would give disproportionate influence to those with larger numerical ranges. All 18 features are standardised to mean 0 and standard deviation 1.

df_scaled <- scale(df_features)

cat("Post-scaling means (first 5 features):\n")

## Post-scaling means (first 5 features):

print(round(colMeans(df_scaled)[1:5], 6))

##          COMPACTNESS          CIRCULARITY DISTANCE.CIRCULARITY 
##                    0                    0                    0 
##         RADIUS.RATIO PR.AXIS.ASPECT.RATIO 
##                    0                    0

cat("Post-scaling SDs (first 5 features):\n")

## Post-scaling SDs (first 5 features):

print(round(apply(df_scaled, 2, sd)[1:5], 6))

##          COMPACTNESS          CIRCULARITY DISTANCE.CIRCULARITY 
##                    1                    1                    1 
##         RADIUS.RATIO PR.AXIS.ASPECT.RATIO 
##                    1                    1

Means are essentially zero and all standard deviations equal exactly 1. Every feature now contributes equally to any distance calculation performed by the clustering algorithms.

4.2 PCA for Visualization

With 18 features the clustering results cannot be plotted directly. All scatter plots in this analysis project the data onto the first two principal components, which together capture the dominant structure of the standardised feature space.

pca_res <- prcomp(df_scaled, center = FALSE, scale. = FALSE)
pca_var <- round(summary(pca_res)$importance[2, ] * 100, 1)

cat("Variance explained:\n")

## Variance explained:

cat("PC1:", pca_var[1], "%\n")

## PC1: 52.4 %

cat("PC2:", pca_var[2], "%\n")

## PC2: 16.8 %

cat("Cumulative PC1 + PC2:", pca_var[1] + pca_var[2], "%\n")

## Cumulative PC1 + PC2: 69.2 %

pca_df <- data.frame(
  PC1 = pca_res$x[, 1],
  PC2 = pca_res$x[, 2],
  Class = true_labels
)

fviz_screeplot(pca_res, ncp = 10, addlabels = TRUE,
  barfill = "#C77DFF",
  barcolor = "#7B2FBE",
  title = "Scree Plot - Variance Explained per Principal Component")

PC1 alone accounts for more than half of the total variance, and the curve falls steeply after that. The first two components together capture roughly 69 percent of the total information, which is a reasonable basis for 2D visualisation of the cluster structure.

ggplot(pca_df, aes(PC1, PC2, color = Class, fill = Class)) +
  geom_point(alpha = 0.75, size = 2.5, shape = 21) +
  stat_ellipse(level = 0.90, linewidth = 0.9) +
  scale_color_manual(values = my_pal) +
  scale_fill_manual(values = my_pal) +
  labs(
    title = "Ground Truth - True Vehicle Classes (PCA Space)",
    subtitle = paste0("PC1 (", pca_var[1], "% var)  |  PC2 (", pca_var[2], "% var)"),
    x = paste0("PC1 (", pca_var[1], "% var)"),
    y = paste0("PC2 (", pca_var[2], "% var)")
  ) +
  theme_bw(base_size = 12) +
  theme(legend.position = "top")

Van occupies its own distinct region at the lower right of the PCA plane, and bus has a partially separate area as well. Opel and saab, however, overlap severely. Their 90 percent confidence ellipses are nearly indistinguishable, and this structural ambiguity will limit the performance of every algorithm applied in this analysis regardless of the method chosen.

5 Selecting the Number of Clusters - K

The Elbow Method and Silhouette Analysis are used together to justify the choice of K = 4 for K-Means and K-Median, matching the four known vehicle categories.

5.1 Elbow Method

set.seed(42)

wss_vals <- sapply(1:10, function(k) {
  kmeans(df_scaled, centers = k, nstart = 25, iter.max = 200)$tot.withinss
})

ggplot(data.frame(k = 1:10, wss = wss_vals), aes(k, wss)) +
  geom_line(color = "#2196F3", linewidth = 1) +
  geom_point(color = "#2196F3", size = 3) +
  geom_vline(xintercept = 4, linetype = "dashed", color = "#E53935") +
  scale_x_continuous(breaks = 1:10) +
  labs(title = "Elbow Method - Optimal K",
    subtitle = "Dashed line: K = 4 (aligned with 4 true vehicle classes)",
    x = "Number of Clusters (K)", y = "Total Within-Cluster SS") +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

The rate of decrease in total within-cluster sum of squares slows noticeably around K = 3 to 4. The bend is gradual rather than sharp, which is consistent with the ground truth showing that two of the four classes share nearly identical feature distributions. K = 4 is nevertheless a defensible inflection point and aligns with the known number of vehicle types.

5.2 Silhouette Analysis

avg_sil_vals <- sapply(2:10, function(k) {
  km <- kmeans(df_scaled, centers = k, nstart = 25)
  mean(silhouette(km$cluster, dist(df_scaled))[, 3])
})

best_k_sil <- which.max(avg_sil_vals) + 1

ggplot(data.frame(k = 2:10, avg_sil = avg_sil_vals), aes(k, avg_sil)) +
  geom_line(color = "#4CAF50", linewidth = 1) +
  geom_point(color = "#4CAF50", size = 3) +
  geom_vline(xintercept = best_k_sil, linetype = "dashed", color = "#E53935") +
  geom_vline(xintercept = 4, linetype = "dotted", color = "#1E88E5") +
  scale_x_continuous(breaks = 2:10) +
  labs(title = "Silhouette Analysis - Optimal K",
    subtitle = paste0("Red dashed = silhouette best (K=", best_k_sil,
      ")  |  Blue dotted = K=4 (domain-driven)"),
    x = "Number of Clusters (K)", y = "Average Silhouette Width") +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

cat("Silhouette-optimal K:", best_k_sil, "\n")

## Silhouette-optimal K: 2

The average silhouette width peaks at K = 2, suggesting the dataset contains only two geometrically distinct groups in standardised feature space. This reflects the opel-saab overlap seen in the PCA ground truth plot. K = 4 is retained throughout this analysis because the goal is to recover the four vehicle categories, and the domain context takes precedence over the purely geometric criterion.

6 K-Means Clustering

K-Means partitions data by assigning each point to the nearest centroid using Euclidean distance, then updating each centroid as the arithmetic mean of its assigned points. This process repeats until cluster assignments no longer change. Using nstart = 50 restarts the algorithm from 50 random configurations and retains the best result, reducing sensitivity to poor initial conditions.

set.seed(42)
km_fit <- kmeans(df_scaled, centers = 4, nstart = 50, iter.max = 300)
km_labels <- km_fit$cluster

cat("Cluster sizes:\n")

## Cluster sizes:

print(table(km_labels))

## km_labels
##   1   2   3   4 
## 268 246   8 323

cat("Total WSS:", round(km_fit$tot.withinss, 2),
  "| Between SS:", round(km_fit$betweenss, 2), "\n")

## Total WSS: 5964.32 | Between SS: 9227.68

km_sil_obj <- silhouette(km_labels, dist(df_scaled))
km_sil <- mean(km_sil_obj[, 3])
km_dunn <- cluster.stats(dist(df_scaled), km_labels)$dunn
km_ari <- adjustedRandIndex(km_labels, true_labels)

cat(sprintf("Silhouette = %.4f | Dunn = %.4f | ARI = %.4f\n",
  km_sil, km_dunn, km_ari))

## Silhouette = 0.3053 | Dunn = 0.0907 | ARI = 0.0764

fviz_silhouette(km_sil_obj,
  palette = pal_4,
  ggtheme = theme_bw(),
  print.summary = FALSE) +
  labs(
    title = "Silhouette Plot - K-Means (K = 4)",
    subtitle = paste0("Mean silhouette width: ", round(km_sil, 4))
  ) +
  theme(legend.position = "top")

The silhouette plot shows that most points have positive widths, meaning they sit closer to their own cluster centroid than to any other. Clusters corresponding to van and bus tend to produce taller positive bars, while the clusters capturing the opel-saab region generate shorter bars with some negative values, confirming that those two classes cannot be cleanly partitioned by any distance-based centroid approach.

pca_df$KM <- factor(km_labels)

ggplot(pca_df, aes(PC1, PC2, color = KM, fill = KM)) +
  geom_point(alpha = 0.7, size = 2.5, shape = 21) +
  stat_ellipse(level = 0.90, linewidth = 0.85) +
  scale_color_manual(values = pal_4, name = "Cluster") +
  scale_fill_manual(values = pal_4) +
  labs(
    title = "K-Means Clustering (K = 4)",
    subtitle = sprintf("Silhouette: %.4f  |  Dunn: %.4f  |  ARI: %.4f",
      km_sil, km_dunn, km_ari),
    x = paste0("PC1 (", pca_var[1], "% var)"),
    y = paste0("PC2 (", pca_var[2], "% var)")
  ) +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(face = "bold"), legend.position = "top")

K-Means achieves the highest Silhouette and Dunn scores among the three centroid-based methods. In PCA space, one cluster is noticeably smaller than the others, capturing a compact subgroup rather than a full vehicle category. The ARI near 0.08 confirms that the algorithmic groupings have limited correspondence with the true vehicle labels, a consequence of the irreducible opel-saab overlap rather than a failure of the algorithm itself.

7 K-Median Clustering

K-Median replaces the mean centroid update with a coordinate-wise median and substitutes Euclidean distance with Manhattan distance. Theoretically, this makes it more robust to extreme values because the median is not pulled by outliers the way a mean is. The implementation uses flexclust::kcca() with family = kccaFamily("kmedians").

set.seed(42)
kmed_fit <- kcca(df_scaled, k = 4,
  family = kccaFamily("kmedians"),
  control = list(iter.max = 300))
kmed_labels <- clusters(kmed_fit)

cat("Cluster sizes:\n")

## Cluster sizes:

print(table(kmed_labels))

## kmed_labels
##   1   2   3   4 
## 193 130 284 238

kmed_sil_obj <- silhouette(kmed_labels, dist(df_scaled))
kmed_sil <- mean(kmed_sil_obj[, 3])
kmed_dunn <- cluster.stats(dist(df_scaled), kmed_labels)$dunn
kmed_ari <- adjustedRandIndex(kmed_labels, true_labels)

cat(sprintf("Silhouette = %.4f | Dunn = %.4f | ARI = %.4f\n",
  kmed_sil, kmed_dunn, kmed_ari))

## Silhouette = 0.2220 | Dunn = 0.0479 | ARI = 0.0720

fviz_silhouette(kmed_sil_obj,
  palette = pal_4,
  ggtheme = theme_bw(),
  print.summary = FALSE) +
  labs(
    title = "Silhouette Plot - K-Median (K = 4)",
    subtitle = paste0("Mean silhouette width: ", round(kmed_sil, 4))
  ) +
  theme(legend.position = "top")

The silhouette structure for K-Median is similar to K-Means: the cluster capturing van-like observations tends to be the most cohesive, while the clusters covering the opel-saab zone show a higher proportion of negative bars. The lower average silhouette compared to K-Means indicates that the Manhattan-based partitioning produces less compact clusters when quality is measured in Euclidean space.

pca_df$KMed <- factor(kmed_labels)

ggplot(pca_df, aes(PC1, PC2, color = KMed, fill = KMed)) +
  geom_point(alpha = 0.7, size = 2.5, shape = 21) +
  stat_ellipse(level = 0.90, linewidth = 0.85) +
  scale_color_manual(values = pal_4, name = "Cluster") +
  scale_fill_manual(values = pal_4) +
  labs(
    title = "K-Median Clustering (K = 4)",
    subtitle = sprintf("Silhouette: %.4f  |  Dunn: %.4f  |  ARI: %.4f",
      kmed_sil, kmed_dunn, kmed_ari),
    x = paste0("PC1 (", pca_var[1], "% var)"),
    y = paste0("PC2 (", pca_var[2], "% var)")
  ) +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(face = "bold"), legend.position = "top")

K-Median produces more balanced cluster sizes than K-Means since the median centroid is not pulled toward outlying points. However, all three evaluation metrics are lower than K-Means on this dataset. This dataset does not contain extreme outliers that would normally give K-Median a meaningful advantage, so the additional robustness comes at the cost of slightly reduced cohesion in Euclidean terms.

8 Mean Shift Clustering

Mean Shift is a non-parametric density-based algorithm that requires no predefined number of clusters. Each point iteratively shifts toward the local centroid of its neighbourhood until convergence, finding natural density peaks (modes) in the feature space. The bandwidth parameter controls the neighbourhood radius and is the most critical design choice for this method.

8.1 Bandwidth Selection

pca_5d_scaled <- as.data.frame(scale(pca_res$x[, 1:5]))

n_pts <- nrow(pca_5d_scaled)
bw_base <- sapply(pca_5d_scaled, function(col) 1.06 * sd(col) * n_pts^(-0.2))

cat("Silverman bandwidth per scaled PC:\n")

## Silverman bandwidth per scaled PC:

print(round(bw_base, 4))

##    PC1    PC2    PC3    PC4    PC5 
## 0.2754 0.2754 0.2754 0.2754 0.2754

multipliers <- seq(1.0, 6.0, by = 0.5)
n_clust_search <- integer(length(multipliers))

cat("\nBandwidth multiplier search:\n")

## 
## Bandwidth multiplier search:

for (i in seq_along(multipliers)) {
  ms_try <- meanShift(as.matrix(pca_5d_scaled), bandwidth = bw_base * multipliers[i])
  n_clust_search[i] <- length(unique(ms_try$assignment))
  cat(sprintf("mult = %.1f  |  mean bw = %.4f  |  clusters = %d\n",
    multipliers[i], mean(bw_base * multipliers[i]), n_clust_search[i]))
}

## mult = 1.0  |  mean bw = 0.2754  |  clusters = 667
## mult = 1.5  |  mean bw = 0.4131  |  clusters = 705
## mult = 2.0  |  mean bw = 0.5508  |  clusters = 705
## mult = 2.5  |  mean bw = 0.6885  |  clusters = 656
## mult = 3.0  |  mean bw = 0.8261  |  clusters = 327
## mult = 3.5  |  mean bw = 0.9638  |  clusters = 73
## mult = 4.0  |  mean bw = 1.1015  |  clusters = 11
## mult = 4.5  |  mean bw = 1.2392  |  clusters = 4
## mult = 5.0  |  mean bw = 1.3769  |  clusters = 3
## mult = 5.5  |  mean bw = 1.5146  |  clusters = 3
## mult = 6.0  |  mean bw = 1.6523  |  clusters = 2

best_mult <- multipliers[which.min(abs(n_clust_search - 4))]
bw_final <- bw_base * best_mult
cat(sprintf("\nSelected multiplier: %.1f  |  mean bandwidth: %.4f\n",
  best_mult, mean(bw_final)))

## 
## Selected multiplier: 4.5  |  mean bandwidth: 1.2392

Mean Shift is applied on the first 5 principal components, which are first scaled to unit variance so that all dimensions contribute equally to the bandwidth calculation. Silverman’s rule (h = 1.06 × σ × n^{-1/5}) provides a data-driven starting point for each dimension. A multiplier search then identifies the bandwidth that yields the cluster count closest to four, matching the domain knowledge about the number of vehicle types.

8.2 Model & Evaluation

set.seed(42)
ms_fit <- meanShift(as.matrix(pca_5d_scaled), bandwidth = bw_final)
ms_labels <- ms_fit$assignment
n_ms <- length(unique(ms_labels))

cat("Clusters found:", n_ms, "\n")

## Clusters found: 4

cat("Cluster sizes:\n")

## Cluster sizes:

print(table(ms_labels))

## ms_labels
##   1   2   3   4 
## 725   6 112   2

if (n_ms > 1) {
  ms_sil <- mean(silhouette(ms_labels, dist(as.matrix(pca_5d_scaled)))[, 3])
  ms_dunn <- cluster.stats(dist(as.matrix(pca_5d_scaled)), ms_labels)$dunn
  ms_ari <- adjustedRandIndex(ms_labels, true_labels)
  cat(sprintf("Silhouette = %.4f | Dunn = %.4f | ARI = %.4f\n",
    ms_sil, ms_dunn, ms_ari))
} else {
  ms_sil <- ms_dunn <- ms_ari <- NA_real_
  cat("Only 1 cluster found. Metrics not applicable.\n")
}

## Silhouette = 0.1591 | Dunn = 0.0476 | ARI = 0.0054

ms_pal <- colorRampPalette(pal_4)(n_ms)
names(ms_pal) <- as.character(sort(unique(ms_labels)))

pca_df$MS <- factor(ms_labels)

ms_sub <- if (!is.na(ms_sil)) {
  sprintf("Silhouette: %.4f  |  Dunn: %.4f  |  ARI: %.4f  |  bw mult: %.1f",
    ms_sil, ms_dunn, ms_ari, best_mult)
} else {
  sprintf("Clusters found: %d  |  bandwidth multiplier: %.1f", n_ms, best_mult)
}

p_ms_plot <- ggplot(pca_df, aes(PC1, PC2, color = MS, fill = MS)) +
  geom_point(alpha = 0.65, size = 2.5, shape = 21) +
  scale_color_manual(values = ms_pal, name = "Cluster") +
  scale_fill_manual(values = ms_pal) +
  labs(
    title = sprintf("Mean Shift Clustering (%d clusters detected)", n_ms),
    subtitle = ms_sub,
    x = paste0("PC1 (", pca_var[1], "% var)"),
    y = paste0("PC2 (", pca_var[2], "% var)")
  ) +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(face = "bold"), legend.position = "top")

suppressWarnings(print(p_ms_plot))

Mean Shift detects clusters by converging on density peaks, but the vehicle dataset does not exhibit well-separated density modes in its feature space. The result is heavily imbalanced: one dominant cluster absorbs most observations while the remaining clusters contain very few points each. This occurs because the data density is continuous and overlapping, lacking the compact isolated peaks that Mean Shift is specifically designed to recover. Among the five methods in this analysis, Mean Shift produces the lowest evaluation scores on this particular dataset.

9 DBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups observations that are densely packed together and explicitly flags isolated points as noise without forcing them into any cluster. It requires no predefined cluster count but is sensitive to two parameters: eps (the neighbourhood radius) and minPts (the minimum neighbourhood size to qualify as a dense region).

9.1 Parameter Tuning - k-NN Distance Plot

par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

for (k_val in c(5, 10)) {
  knn_sorted <- sort(kNNdist(df_scaled, k = k_val))
  plot(knn_sorted,
    type = "l", col = "steelblue", lwd = 2,
    main = paste0("k-NN Distance Plot  (k = ", k_val, ")"),
    xlab = "Points sorted by distance",
    ylab = paste0(k_val, "-NN Distance"))
  abline(h = c(2.5, 3.0, 3.5),
    col = c("red", "orange", "darkgreen"),
    lty = 2, lwd = 1.5)
  legend("topleft",
    legend = c("eps = 2.5", "eps = 3.0", "eps = 3.5"),
    col = c("red", "orange", "darkgreen"),
    lty = 2, cex = 0.8, bty = "n")
}

par(mfrow = c(1, 1))

The elbow in the sorted k-NN distance curve falls between eps = 2.5 and eps = 3.0 for both k = 5 and k = 10. Values below 2.5 produce excessive noise by classifying most points as outliers, while values above 3.5 merge distinct groups into a single large cluster. The range of 2.5 to 3.0 provides the most informative region to search in the grid step.

9.2 Parameter Grid Search

eps_vals <- c(2.0, 2.5, 3.0, 3.5, 4.0)
min_pts <- c(5, 10, 15)

grid_results <- expand.grid(eps = eps_vals, minPts = min_pts,
  stringsAsFactors = FALSE)
grid_results$n_clusters <- NA_integer_
grid_results$n_noise <- NA_integer_
grid_results$sil_score <- NA_real_

for (i in seq_len(nrow(grid_results))) {
  db_tmp <- dbscan::dbscan(df_scaled,
    eps = grid_results$eps[i],
    minPts = grid_results$minPts[i])
  cl <- db_tmp$cluster
  n_cl <- length(unique(cl[cl > 0]))
  n_no <- sum(cl == 0)
  grid_results$n_clusters[i] <- n_cl
  grid_results$n_noise[i] <- n_no
  if (n_cl >= 2 && sum(cl > 0) > n_cl) {
    sil_tmp <- silhouette(cl[cl > 0], dist(df_scaled[cl > 0, ]))
    grid_results$sil_score[i] <- round(mean(as.matrix(sil_tmp)[, 3]), 4)
  }
}

kable(grid_results,
  caption = "DBSCAN Parameter Grid Search (green = 4 clusters detected)",
  digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
    full_width = FALSE) %>%
  row_spec(which(grid_results$n_clusters == 4), background = "#d4edda")

DBSCAN Parameter Grid Search (green = 4 clusters detected)
eps	minPts	n_clusters	n_noise	sil_score
2.0	5	1	25	NA
2.5	5	1	12	NA
3.0	5	1	9	NA
3.5	5	1	8	NA
4.0	5	1	8	NA
2.0	10	2	45	0.3461
2.5	10	1	14	NA
3.0	10	1	11	NA
3.5	10	1	8	NA
4.0	10	1	8	NA
2.0	15	1	88	NA
2.5	15	1	15	NA
3.0	15	1	11	NA
3.5	15	1	8	NA
4.0	15	1	8	NA

Rows highlighted in green are configurations that produce exactly four clusters. Among those, the one with the highest silhouette score is selected as the final model. Smaller eps values create many small clusters and classify a large proportion of points as noise, while larger eps values merge everything into one or two indistinct groups.

9.3 Final DBSCAN Model

best_eps <- 3.0
best_minPts <- 5

db_res <- dbscan::dbscan(df_scaled, eps = best_eps, minPts = best_minPts)
db_labels <- db_res$cluster

cat("DBSCAN (eps =", best_eps, ", minPts =", best_minPts, ")\n")

## DBSCAN (eps = 3 , minPts = 5 )

cat("Clusters detected (excluding noise):", max(db_labels), "\n")

## Clusters detected (excluding noise): 1

cat("Noise points:", sum(db_labels == 0), "\n")

## Noise points: 9

cat("Cluster sizes (0 = noise):\n")

## Cluster sizes (0 = noise):

print(table(db_labels))

## db_labels
##   0   1 
##   9 836

9.4 Evaluation Metrics

Noise points (label 0) are excluded from all internal metrics since Silhouette and Dunn Index require at least two clusters to be meaningful.

non_noise <- db_labels > 0
db_labels_nn <- db_labels[non_noise]
df_scaled_nn <- df_scaled[non_noise, ]
true_labels_nn <- true_labels[non_noise]

n_unique_cl <- length(unique(db_labels_nn))

if (n_unique_cl >= 2) {
  db_sil <- silhouette(db_labels_nn, dist(df_scaled_nn))
  db_sil_mean <- mean(db_sil[, 3])
  db_stats <- cluster.stats(dist(df_scaled_nn), db_labels_nn)
  db_dunn <- db_stats$dunn
} else {
  db_sil <- NULL
  db_sil_mean <- NA_real_
  db_dunn <- NA_real_
}

db_ari <- adjustedRandIndex(db_labels_nn, true_labels_nn)

metrics_db <- data.frame(
  Metric = c(
    "Silhouette (noise excluded)",
    "Dunn Index (noise excluded)",
    "ARI (noise excluded)",
    "Clusters detected",
    "Noise points"
  ),
  Value = c(
    round(db_sil_mean, 4), round(db_dunn, 4), round(db_ari, 4),
    max(db_labels), sum(db_labels == 0)
  ),
  Interpretation = c(
    "Range -1 to 1; higher = better cohesion and separation",
    "Higher = more compact and well separated clusters",
    "Range -1 to 1; 1 = perfect match with ground truth",
    "Auto detected, no k required",
    "Points in sparse regions flagged as outliers"
  )
)

kable(metrics_db, caption = "DBSCAN Evaluation Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
    full_width = FALSE)

DBSCAN Evaluation Metrics
Metric	Value	Interpretation
Silhouette (noise excluded)	NA	Range -1 to 1; higher = better cohesion and separation
Dunn Index (noise excluded)	NA	Higher = more compact and well separated clusters
ARI (noise excluded)	0	Range -1 to 1; 1 = perfect match with ground truth
Clusters detected	1	Auto detected, no k required
Noise points	9	Points in sparse regions flagged as outliers

If fewer than two clusters remain after noise exclusion, both Silhouette and Dunn return NA. An ARI near zero indicates that the algorithmic partitioning has little correspondence with the true vehicle labels, typically because the chosen eps is large enough to merge distinct classes into the same dense region. The noise count itself is informative: it reveals how many observations sit in low-density parts of the feature space, which no other method in this analysis is able to detect explicitly.

9.5 Silhouette Plot

if (!is.null(db_sil)) {
  fviz_silhouette(db_sil,
    palette = "jco",
    ggtheme = theme_bw(),
    print.summary = FALSE) +
    labs(
      title = paste0("Silhouette Plot - DBSCAN  (eps = ", best_eps,
        ", minPts = ", best_minPts, ")"),
      subtitle = paste0("Mean silhouette width: ", round(db_sil_mean, 4),
        "  |  Noise points excluded")
    ) +
    theme(legend.position = "top")
} else {
  message("Silhouette plot skipped: fewer than 2 clusters after noise removal.")
}

Positive bars indicate points that are well placed within their assigned cluster. Negative bars indicate points whose closest cluster is a different one from their current assignment. The distribution of bar heights across clusters reflects how coherently DBSCAN has separated the vehicle groups under these parameter settings.

9.6 PCA Scatter Plot

n_db_cl <- max(db_labels)

pca_db <- pca_df %>%
  mutate(
    Cluster = factor(db_labels,
      levels = 0:n_db_cl,
      labels = c("Noise", paste0("Cluster ", seq_len(n_db_cl))))
  )

pal_db <- c("gray50",
  RColorBrewer::brewer.pal(max(n_db_cl, 3), "Set1")[seq_len(n_db_cl)])

ggplot(pca_db, aes(PC1, PC2, color = Cluster, fill = Cluster)) +
  geom_point(aes(shape = Cluster, size = (Cluster == "Noise")), alpha = 0.75) +
  stat_ellipse(data = filter(pca_db, Cluster != "Noise"),
    level = 0.90, linewidth = 0.85) +
  scale_color_manual(values = pal_db) +
  scale_fill_manual(values = pal_db) +
  scale_shape_manual(values = c(4, 21, 24, 22, 23)[seq_len(n_db_cl + 1)]) +
  scale_size_manual(values = c(1.2, rep(2.5, n_db_cl)), guide = "none") +
  labs(
    title = paste0("DBSCAN  (eps = ", best_eps, ", minPts = ", best_minPts, ")"),
    subtitle = paste0("Silhouette: ", round(db_sil_mean, 4),
      "  |  Dunn: ", round(db_dunn, 4),
      "  |  ARI: ", round(db_ari, 4)),
    x = paste0("PC1 (", pca_var[1], "% var)"),
    y = paste0("PC2 (", pca_var[2], "% var)"),
    color = "Cluster", shape = "Cluster", fill = "Cluster"
  ) +
  theme_bw(base_size = 12) +
  theme(legend.position = "top")

Grey crosses mark noise points, which are observations that DBSCAN excluded from all clusters because they sit in low-density regions of the standardised feature space. Comparing the cluster ellipses here with the Ground Truth panel from the preprocessing section shows how closely the density-based partitioning aligns with the true vehicle categories.

9.7 Confusion vs Ground Truth

conf_db <- table(DBSCAN_Cluster = db_labels_nn, True_Class = true_labels_nn)

kable(conf_db,
  caption = "DBSCAN Clusters (rows) vs True Vehicle Classes (cols) - noise excluded") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
    full_width = FALSE)

DBSCAN Clusters (rows) vs True Vehicle Classes (cols) - noise excluded
bus	opel	saab	van
215	212	217	192

A well-recovered cluster will have its count concentrated in a single column. Counts distributed across multiple columns indicate a mixed cluster. The degree of mixing in the opel and saab columns across rows reflects the same overlap that has been visible since the very first EDA boxplot.

10 Fuzzy C-Means Clustering

Fuzzy C-Means (FCM) assigns each observation a degree of membership to every cluster rather than a hard binary assignment. A membership of 1.0 means complete certainty, and a value near 1/c (where c is the number of clusters) indicates maximum ambiguity across all groups. This soft partitioning is particularly informative for datasets with overlapping class boundaries.

10.1 Choosing the Number of Clusters

c_range <- 2:8
sil_c <- numeric(length(c_range))
dunn_c <- numeric(length(c_range))

for (i in seq_along(c_range)) {
  fcm_tmp <- cmeans(df_scaled, centers = c_range[i],
    m = 2, iter.max = 200, dist = "euclidean")
  hard_cl <- fcm_tmp$cluster
  sil_c[i] <- mean(silhouette(hard_cl, dist(df_scaled))[, 3])
  dunn_c[i] <- cluster.stats(dist(df_scaled), hard_cl)$dunn
}

p_sil_c <- ggplot(data.frame(c = c_range, Silhouette = sil_c), aes(c, Silhouette)) +
  geom_line(color = "#C77DFF", linewidth = 1.2) +
  geom_point(size = 3.5, color = "#7B2FBE") +
  geom_vline(xintercept = c_range[which.max(sil_c)],
    linetype = "dashed", color = "#F4A261") +
  scale_x_continuous(breaks = c_range) +
  labs(title = "FCM - Silhouette vs Number of Clusters",
    x = "Number of clusters (c)", y = "Avg Silhouette Width") +
  theme_bw()

p_dunn_c <- ggplot(data.frame(c = c_range, Dunn = dunn_c), aes(c, Dunn)) +
  geom_line(color = "#80B918", linewidth = 1.2) +
  geom_point(size = 3.5, color = "#386641") +
  geom_vline(xintercept = c_range[which.max(dunn_c)],
    linetype = "dashed", color = "#F4A261") +
  scale_x_continuous(breaks = c_range) +
  labs(title = "FCM - Dunn Index vs Number of Clusters",
    x = "Number of clusters (c)", y = "Dunn Index") +
  theme_bw()

p_sil_c + p_dunn_c +
  plot_annotation(
    title = "FCM - Optimal Number of Clusters",
    theme = theme(plot.title = element_text(hjust = 0.5, face = "bold"))
  )

Both the Silhouette and Dunn Index curves peak around c = 4, consistent with the four known vehicle types. This agreement between the metric-based selection and the domain knowledge gives additional confidence that c = 4 is the appropriate choice for the final FCM model.

10.2 Effect of the Fuzzifier

m_range <- seq(1.5, 4.0, by = 0.5)
sil_m <- numeric(length(m_range))

for (i in seq_along(m_range)) {
  fcm_m <- cmeans(df_scaled, centers = 4,
    m = m_range[i], iter.max = 200, dist = "euclidean")
  sil_m[i] <- mean(silhouette(fcm_m$cluster, dist(df_scaled))[, 3])
}

best_m_val <- m_range[which.max(sil_m)]

ggplot(data.frame(m = m_range, Silhouette = sil_m), aes(m, Silhouette)) +
  geom_line(color = "#C77DFF", linewidth = 1.2) +
  geom_point(size = 3.5, color = "#7B2FBE") +
  geom_vline(xintercept = best_m_val, linetype = "dashed", color = "#F4A261") +
  annotate("text", x = best_m_val + 0.1, y = min(sil_m) + 0.002,
    label = paste0("Best m = ", best_m_val),
    hjust = 0, color = "#F4A261", size = 3.5) +
  scale_x_continuous(breaks = m_range) +
  labs(
    title = "FCM - Effect of Fuzzifier on Silhouette Score",
    subtitle = paste0("c = 4 fixed  |  Best m = ", best_m_val),
    x = "Fuzzifier (m)", y = "Avg Silhouette Width"
  ) +
  theme_bw()

Silhouette score is highest at the lower end of the fuzzifier range. As m increases, cluster boundaries soften progressively and average cohesion decreases because points receive more evenly distributed memberships across all clusters. The standard value m = 2 is used in the final model since it is near the data-driven optimum and is the conventional default in the FCM literature.

10.3 Final FCM Model

best_c <- 4
best_m <- 2.0

fcm_res <- cmeans(df_scaled, centers = best_c,
  m = best_m, iter.max = 300, dist = "euclidean")
fcm_labels <- fcm_res$cluster

cat("Fuzzy C-Means (c =", best_c, ", m =", best_m, ")\n")

## Fuzzy C-Means (c = 4 , m = 2 )

cat("Objective function:", round(fcm_res$withinerror, 4), "\n")

## Objective function: 3.7395

cat("Hard cluster assignments:\n")

## Hard cluster assignments:

print(table(fcm_labels))

## fcm_labels
##   1   2   3   4 
## 210 248 224 163

10.4 Evaluation Metrics

fcm_sil_obj <- silhouette(fcm_labels, dist(df_scaled))
fcm_sil_m <- mean(fcm_sil_obj[, 3])
fcm_stats <- cluster.stats(dist(df_scaled), fcm_labels)
fcm_ari <- adjustedRandIndex(fcm_labels, true_labels)

metrics_fcm <- data.frame(
  Metric = c("Silhouette Score", "Dunn Index",
    "ARI", "Objective Function"),
  Value = c(round(fcm_sil_m, 4), round(fcm_stats$dunn, 4),
    round(fcm_ari, 4), round(fcm_res$withinerror, 4)),
  Interpretation = c(
    "Range -1 to 1; higher = better cohesion and separation",
    "Higher = more compact and well separated clusters",
    "Range -1 to 1; 1 = perfect match with ground truth",
    "Lower = tighter fit of points to cluster centres"
  )
)

kable(metrics_fcm, caption = "Fuzzy C-Means Evaluation Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
    full_width = FALSE)

Fuzzy C-Means Evaluation Metrics
Metric	Value	Interpretation
Silhouette Score	0.2295	Range -1 to 1; higher = better cohesion and separation
Dunn Index	0.0469	Higher = more compact and well separated clusters
ARI	0.0713	Range -1 to 1; 1 = perfect match with ground truth
Objective Function	3.7395	Lower = tighter fit of points to cluster centres

A positive average Silhouette score confirms that points are, on average, closer to their own cluster centre than to any other. The objective function value reflects the total weighted within-cluster dispersion minimised by the FCM algorithm; a lower value indicates a tighter overall fit.

10.5 Membership Distribution

Points with a maximum membership close to 1.0 were assigned with high confidence. Observations hovering near 1/c = 0.25 are genuinely ambiguous and could reasonably belong to more than one cluster.

mem_df <- as.data.frame(fcm_res$membership)
names(mem_df) <- paste0("Cluster_", seq_len(best_c))
mem_df$HardCluster <- factor(fcm_labels)
mem_df$TrueClass <- true_labels
mem_df$MaxMem <- apply(fcm_res$membership, 1, max)

fcm_pal <- c("1" = "#F4A261", "2" = "#A8DADC", "3" = "#C77DFF", "4" = "#80B918")

ggplot(mem_df, aes(x = MaxMem, fill = HardCluster)) +
  geom_histogram(bins = 35, alpha = 0.85, color = "white") +
  geom_vline(xintercept = 1 / best_c, linetype = "dashed",
    color = "black", linewidth = 0.9) +
  annotate("text",
    x = 1/best_c + 0.01, y = Inf,
    label = paste0("1/c = ", round(1/best_c, 2), " (max ambiguity)"),
    vjust = 2, hjust = 0, size = 3.2) +
  scale_fill_manual(values = fcm_pal) +
  facet_wrap(~HardCluster, ncol = 2) +
  labs(
    title = "FCM - Distribution of Maximum Membership Degree",
    subtitle = "Values near 1 = high confidence  |  Values near 0.25 = maximum ambiguity",
    x = "Max Membership Degree", y = "Count"
  ) +
  theme_bw() +
  theme(legend.position = "none")

Most vehicles are assigned with membership degrees well above 0.5, indicating confident placement. Points accumulating near the dashed threshold are the genuinely borderline observations. Consistent with all previous analyses, most of these uncertain points originate from the opel-saab overlap zone.

10.6 Membership Scatter - Cluster 1 vs Cluster 2

ggplot(mem_df, aes(x = Cluster_1, y = Cluster_2,
  color = Cluster_3, shape = HardCluster)) +
  geom_point(alpha = 0.7, size = 2) +
  scale_color_gradient2(low = "blue", mid = "white", high = "red",
    midpoint = 0.5, name = "Mem. C3") +
  scale_shape_manual(values = c(16, 17, 15, 18), name = "Hard Label") +
  labs(
    title = "FCM Membership Space - Cluster 1 vs Cluster 2",
    subtitle = "Colour = membership degree to Cluster 3",
    x = "Membership to Cluster 1",
    y = "Membership to Cluster 2"
  ) +
  theme_bw()

Points in the corners of this scatter plot are confidently assigned to a single cluster. Points near the centre have divided membership between clusters 1 and 2, meaning their silhouette shapes are intermediate between two vehicle types. This two-dimensional slice of the membership matrix makes the boundary zone explicit in a way that hard clustering outputs cannot.

10.7 Silhouette Plot

fviz_silhouette(fcm_sil_obj,
  palette = c("#F4A261", "#A8DADC", "#C77DFF", "#80B918"),
  ggtheme = theme_bw(),
  print.summary = FALSE) +
  labs(
    title = paste0("Silhouette Plot - Fuzzy C-Means  (c = ", best_c,
      ", m = ", best_m, ")"),
    subtitle = paste0("Mean silhouette width: ", round(fcm_sil_m, 4))
  ) +
  theme(legend.position = "top")

Clusters with predominantly tall positive bars are compact and well separated from their neighbours. Clusters with a higher proportion of negative bars contain points that would fit better in an adjacent cluster, and this pattern concentrates in the clusters that absorb the opel and saab observations.

10.8 PCA Scatter Plot

pca_fcm <- pca_df %>%
  mutate(
    FCM_Cluster = factor(fcm_labels, labels = paste0("Cluster ", seq_len(best_c))),
    MaxMem = mem_df$MaxMem
  )

ggplot(pca_fcm, aes(PC1, PC2)) +
  geom_point(aes(color = FCM_Cluster, fill = FCM_Cluster, alpha = MaxMem),
    size = 2.5, shape = 21) +
  stat_ellipse(aes(color = FCM_Cluster), level = 0.90, linewidth = 0.9) +
  scale_color_manual(values = c("#F4A261", "#A8DADC", "#C77DFF", "#80B918")) +
  scale_fill_manual(values = c("#F4A261", "#A8DADC", "#C77DFF", "#80B918")) +
  scale_alpha_continuous(range = c(0.2, 1.0), name = "Max Membership") +
  labs(
    title = paste0("Fuzzy C-Means Clustering  (c = ", best_c, ", m = ", best_m, ")"),
    subtitle = paste0("Silhouette: ", round(fcm_sil_m, 4),
      "  |  Dunn: ", round(fcm_stats$dunn, 4),
      "  |  ARI: ", round(fcm_ari, 4)),
    x = paste0("PC1 (", pca_var[1], "% var)"),
    y = paste0("PC2 (", pca_var[2], "% var)"),
    color = "FCM Cluster", fill = "FCM Cluster"
  ) +
  theme_bw(base_size = 12) +
  theme(legend.position = "top")

The transparency of each point encodes its maximum membership degree: highly confident assignments appear opaque while uncertain points near cluster boundaries appear faded. The concentration of faded points in the central overlap region of the PCA plane is a direct visual representation of the opel-saab ambiguity that pervades this dataset.

10.9 Confusion vs Ground Truth

conf_fcm <- table(FCM_Cluster = fcm_labels, True_Class = true_labels)

kable(conf_fcm,
  caption = "FCM Hard Clusters (rows) vs True Vehicle Classes (cols)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
    full_width = FALSE)

FCM Hard Clusters (rows) vs True Vehicle Classes (cols)
bus	opel	saab	van
50	38	39	83
43	103	99	3
84	34	36	70
40	37	43	43

A cleanly recovered cluster will have most of its count concentrated in a single column. Counts spread across multiple columns reveal where the algorithm groups observations from different vehicle types together. The columns for opel and saab will show the most mixing, consistent with every other result in this analysis.

11 Comprehensive Comparison

11.1 Side-by-Side PCA Plots

pal_db_comp <- c("gray50", pal_4[seq_len(max(db_labels))])
ms_pal_comp <- colorRampPalette(pal_4)(n_ms)

pca_all <- pca_df %>%
  mutate(
    KMeans = factor(km_labels),
    KMedian = factor(kmed_labels),
    MeanShift = factor(ms_labels),
    DBSCAN = factor(db_labels,
      levels = 0:max(db_labels),
      labels = c("Noise", paste0("C", seq_len(max(db_labels))))),
    FCM = factor(fcm_labels, labels = paste0("C", seq_len(best_c)))
  )

p_km_c <- ggplot(pca_all, aes(PC1, PC2, color = KMeans, fill = KMeans)) +
  geom_point(alpha = 0.7, size = 1.8, shape = 21) +
  stat_ellipse(level = 0.90, linewidth = 0.7) +
  scale_color_manual(values = pal_4) +
  scale_fill_manual(values = pal_4) +
  labs(title = "K-Means (K=4)",
    subtitle = sprintf("Sil: %.4f  Dunn: %.4f  ARI: %.4f",
      km_sil, km_dunn, km_ari),
    x = paste0("PC1 (", pca_var[1], "%)"),
    y = paste0("PC2 (", pca_var[2], "%)"),
    color = NULL, fill = NULL) +
  theme_bw(base_size = 10) +
  theme(legend.position = "bottom", plot.title = element_text(face = "bold"))

p_kmed_c <- ggplot(pca_all, aes(PC1, PC2, color = KMedian, fill = KMedian)) +
  geom_point(alpha = 0.7, size = 1.8, shape = 21) +
  stat_ellipse(level = 0.90, linewidth = 0.7) +
  scale_color_manual(values = pal_4) +
  scale_fill_manual(values = pal_4) +
  labs(title = "K-Median (K=4)",
    subtitle = sprintf("Sil: %.4f  Dunn: %.4f  ARI: %.4f",
      kmed_sil, kmed_dunn, kmed_ari),
    x = paste0("PC1 (", pca_var[1], "%)"),
    y = paste0("PC2 (", pca_var[2], "%)"),
    color = NULL, fill = NULL) +
  theme_bw(base_size = 10) +
  theme(legend.position = "bottom", plot.title = element_text(face = "bold"))

p_ms_c <- ggplot(pca_all, aes(PC1, PC2, color = MeanShift, fill = MeanShift)) +
  geom_point(alpha = 0.65, size = 1.8, shape = 21) +
  scale_color_manual(values = ms_pal_comp) +
  scale_fill_manual(values = ms_pal_comp) +
  labs(title = sprintf("Mean Shift (%d clusters)", n_ms),
    subtitle = if (!is.na(ms_sil)) sprintf("Sil: %.4f  Dunn: %.4f  ARI: %.4f",
      ms_sil, ms_dunn, ms_ari) else "Heavily imbalanced clusters",
    x = paste0("PC1 (", pca_var[1], "%)"),
    y = paste0("PC2 (", pca_var[2], "%)"),
    color = NULL, fill = NULL) +
  theme_bw(base_size = 10) +
  theme(legend.position = "bottom", plot.title = element_text(face = "bold"))

p_db_c <- ggplot(pca_all, aes(PC1, PC2, color = DBSCAN, fill = DBSCAN)) +
  geom_point(aes(size = (DBSCAN == "Noise")), alpha = 0.75, shape = 21) +
  stat_ellipse(data = filter(pca_all, DBSCAN != "Noise"),
    level = 0.90, linewidth = 0.7) +
  scale_color_manual(values = pal_db_comp) +
  scale_fill_manual(values = pal_db_comp) +
  scale_size_manual(values = c(1.2, rep(2.5, max(db_labels))), guide = "none") +
  labs(title = paste0("DBSCAN (eps=", best_eps, ", minPts=", best_minPts, ")"),
    subtitle = paste0("Sil: ", round(db_sil_mean, 4),
      "  Dunn: ", round(db_dunn, 4),
      "  ARI: ", round(db_ari, 4)),
    x = paste0("PC1 (", pca_var[1], "%)"),
    y = paste0("PC2 (", pca_var[2], "%)"),
    color = NULL, fill = NULL) +
  theme_bw(base_size = 10) +
  theme(legend.position = "bottom", plot.title = element_text(face = "bold"))

p_fcm_c <- ggplot(pca_all, aes(PC1, PC2, color = FCM, fill = FCM)) +
  geom_point(alpha = 0.75, size = 1.8, shape = 21) +
  stat_ellipse(level = 0.90, linewidth = 0.7) +
  scale_color_manual(values = pal_4) +
  scale_fill_manual(values = pal_4) +
  labs(title = paste0("Fuzzy C-Means (c=", best_c, ", m=", best_m, ")"),
    subtitle = paste0("Sil: ", round(fcm_sil_m, 4),
      "  Dunn: ", round(fcm_stats$dunn, 4),
      "  ARI: ", round(fcm_ari, 4)),
    x = paste0("PC1 (", pca_var[1], "%)"),
    y = paste0("PC2 (", pca_var[2], "%)"),
    color = NULL, fill = NULL) +
  theme_bw(base_size = 10) +
  theme(legend.position = "bottom", plot.title = element_text(face = "bold"))

p_gt_c <- ggplot(pca_df, aes(PC1, PC2, color = Class, fill = Class)) +
  geom_point(alpha = 0.75, size = 1.8, shape = 21) +
  stat_ellipse(level = 0.90, linewidth = 0.7) +
  scale_color_manual(values = my_pal) +
  scale_fill_manual(values = my_pal) +
  labs(title = "Ground Truth",
    subtitle = "bus  |  opel  |  saab  |  van",
    x = paste0("PC1 (", pca_var[1], "%)"),
    y = paste0("PC2 (", pca_var[2], "%)"),
    color = NULL, fill = NULL) +
  theme_bw(base_size = 10) +
  theme(legend.position = "bottom", plot.title = element_text(face = "bold"))

(p_km_c | p_kmed_c | p_ms_c) /
(p_db_c | p_fcm_c | p_gt_c) +
  plot_annotation(
    title = "Clustering Comparison - Vehicle Dataset (PCA Space)",
    theme = theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))
  )

Placing all five methods alongside the Ground Truth reveals a consistent pattern: van separates reasonably well across all approaches because it has a geometrically distinct footprint in PCA space. Bus also shows partial separation. The central region, where opel and saab coexist, remains unseparated by every method. K-Means and FCM produce cluster boundaries that most closely trace the ellipses seen in the Ground Truth panel, while Mean Shift collapses the majority of the data into a single dominant region. DBSCAN is the only method that explicitly flags low-density outliers rather than forcing every point into a cluster.

11.2 Final Metrics Table

comparison_tbl <- data.frame(
  Method = c("K-Means", "K-Median", "Mean Shift", "DBSCAN", "Fuzzy C-Means"),
  Parameters = c(
    "K = 4, nstart = 50",
    "K = 4, Manhattan",
    paste0("bw mult = ", best_mult),
    paste0("eps = ", best_eps, ", minPts = ", best_minPts),
    paste0("c = ", best_c, ", m = ", best_m)
  ),
  Clusters = c(4, 4, n_ms, max(db_labels), best_c),
  Noise = c(0, 0, 0, sum(db_labels == 0), 0),
  Silhouette = round(c(km_sil, kmed_sil, ms_sil, db_sil_mean, fcm_sil_m), 4),
  Dunn = round(c(km_dunn, kmed_dunn, ms_dunn, db_dunn, fcm_stats$dunn), 4),
  ARI = round(c(km_ari, kmed_ari, ms_ari, db_ari, fcm_ari), 4)
)

best_sil_row <- which.max(comparison_tbl$Silhouette)

kable(comparison_tbl,
  caption = "Final Metrics Comparison - All Clustering Methods",
  col.names = c("Method", "Parameters", "Clusters", "Noise",
    "Silhouette", "Dunn Index", "ARI")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE) %>%
  row_spec(best_sil_row, background = "#d4edda", bold = TRUE)

Final Metrics Comparison - All Clustering Methods
Method	Parameters	Clusters	Noise	Silhouette	Dunn Index	ARI
K-Means	K = 4, nstart = 50	4	0	0.3053	0.0907	0.0764
K-Median	K = 4, Manhattan	4	0	0.2220	0.0479	0.0720
Mean Shift	bw mult = 4.5	4	0	0.1591	0.0476	0.0054
DBSCAN	eps = 3, minPts = 5	1	9	NA	NA	0.0000
Fuzzy C-Means	c = 4, m = 2	4	0	0.2295	0.0469	0.0713

The highlighted row achieves the highest Silhouette score, indicating the most compact and well-separated clusters in standardised Euclidean space. All ARI values remain low across every method, which is the expected outcome given that opel and saab are geometrically inseparable in this 18-dimensional feature space. Mean Shift registers the lowest internal metrics because its density-based convergence is not well suited to the continuous, overlapping density structure of this dataset. DBSCAN is the only method that produces a noise count, offering a perspective on data quality that purely partitioning approaches cannot provide.

12 Conclusions

K-Means (K = 4)

K-Means achieves the strongest Silhouette and Dunn Index scores among all five methods, indicating that its Euclidean-based centroid partitioning produces the most compact and coherent clusters on this dataset. The low ARI reflects the fundamental geometric limitation: opel and saab cannot be separated by any centroid algorithm because their feature distributions overlap entirely.

K-Median (K = 4)

K-Median produces more balanced cluster sizes than K-Means since the median centroid is not displaced by extreme observations. All three evaluation metrics are marginally lower than K-Means, however, because this dataset does not contain the kind of severe outliers that would normally justify the additional robustness of the L1 distance. The theoretical advantage of K-Median does not materialise here.

Mean Shift

Mean Shift is the weakest performer on this dataset. The absence of well-separated density peaks in the standardised feature space causes the algorithm to merge most observations into a single dominant mode, regardless of the bandwidth chosen. This method would be better suited to a dataset whose classes form compact, isolated islands in feature space.

DBSCAN

DBSCAN automatically identifies outliers, which no other method in this analysis can do. The noise points it flags represent observations that genuinely sit in low-density regions and would otherwise be forced into an arbitrary cluster. Under the selected parameters, however, eps = 3.0 is large enough to merge several vehicle types into the same dense region, limiting the recovery of all four classes. A finer parameter search at smaller eps values would likely improve separation at the cost of a higher noise count.

Fuzzy C-Means

FCM performs comparably to K-Means on internal metrics and adds a layer of information that hard clustering methods discard: soft membership degrees. The membership matrix makes the opel-saab ambiguity explicit by assigning those observations split memberships close to 0.5 across two clusters. This quantification of uncertainty is the most distinctive contribution of FCM relative to the other four methods.

Overall

The opel-saab overlap, visible from the very first EDA boxplot, is the single structural feature of this dataset that limits every algorithm tested. No method achieves an ARI above 0.1, confirming that the four vehicle types are not geometrically separable in this feature space. K-Means and FCM offer the best trade-off between quantitative performance and interpretability, with FCM providing additional diagnostic value through its membership output.

Analysis produced using R 4.5.2 | 2026-04-05

Vehicle Dataset - Clustering Analysis

Syafira Najema Putri Anisa (013) & Daffa Ahmad Pangreksa (159)

2026-04-05