A. Introduction

Wine quality assessment is traditionally performed by human experts through sensory evaluation, which is subjective and time-consuming. This study applies 5 unsupervised clustering methods, which are K-Means, K-Median, DBSCAN, Mean Shift, and Fuzzy C-Means, to the UCI Red Wine Quality dataset to discover natural groupings based on physicochemical properties. The goal is to compare method performance and interpret cluster characteristics in the context of wine quality.

The dataset contains 1,599 red wine samples described by 11 physicochemical features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol) and a quality score (3–8) used as external ground truth for evaluation.

This analysis uses 5 methods of Clustering, which are:

Method	Type	Need k	Approach
K-Means	Partitional	Yes	Mean
K-Median	Partitional	Yes	Median
DBSCAN	Density-based	No	Density
Mean Shift	Density-based	No	Mode
Fuzzy C-Means	Soft/Fuzzy	Yes	Membership

B. Load Library

library(tidyverse)
library(flexclust)
library(dbscan)
library(meanShiftR)
library(e1071)
library(cluster)
library(fpc)
library(mclust)
library(factoextra)
library(corrplot)
library(ggplot2)
library(gridExtra)
library(RColorBrewer)
library(fmsb)
library(knitr)
library(kableExtra)

This analysis uses a comprehensive set of R packages covering clustering algorithms (flexclust, dbscan, meanShiftR, e1071), evaluation metrics (cluster, fpc, mclust), and visualization (ggplot2, corrplot, fmsb).

C. Load Data and Dataset Description

url       <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine_raw  <- read.csv(url, sep = ";")

quality_label <- wine_raw$quality

cat("Dataset dimensions:", dim(wine_raw), "\n")

## Dataset dimensions: 1599 12

cat("Number of rows   :", nrow(wine_raw), "\n")

## Number of rows   : 1599

cat("Number of columns:", ncol(wine_raw), "\n\n")

## Number of columns: 12

str(wine_raw)

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

The UCI Red Wine Quality dataset contains 1,599 observations and 12 variables. All 11 physicochemical features are continuous numerical variables. The quality column is an integer score from 3 to 8 assigned by human tasters, used as external ground truth for evaluation. There are no missing values in the dataset.

D. Exploratory Data Analysis (EDA)

EDA is conducted before preprocessing to understand the raw data structure, identify issues, and justify each preprocessing decision.

D.1 Descriptive Statistics

kable(t(round(sapply(wine_raw %>% select(-quality), function(x)
  c(Min=min(x), Q1=quantile(x,0.25), Median=median(x),
    Mean=round(mean(x),3), Q3=quantile(x,0.75), Max=max(x),
    SD=round(sd(x),3), Skewness=round(e1071::skewness(x),3))), 3)),
  caption = "Descriptive Statistics of The Dataset") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE, font_size = 11)

Descriptive Statistics of The Dataset
	Min	Q1.25%	Median	Mean	Q3.75%	Max	SD	Skewness
fixed.acidity	4.600	7.100	7.900	8.320	9.200	15.900	1.741	0.981
volatile.acidity	0.120	0.390	0.520	0.528	0.640	1.580	0.179	0.670
citric.acid	0.000	0.090	0.260	0.271	0.420	1.000	0.195	0.318
residual.sugar	0.900	1.900	2.200	2.539	2.600	15.500	1.410	4.532
chlorides	0.012	0.070	0.079	0.087	0.090	0.611	0.047	5.670
free.sulfur.dioxide	1.000	7.000	14.000	15.875	21.000	72.000	10.460	1.248
total.sulfur.dioxide	6.000	22.000	38.000	46.468	62.000	289.000	32.895	1.513
density	0.990	0.996	0.997	0.997	0.998	1.004	0.002	0.071
pH	2.740	3.210	3.310	3.311	3.400	4.010	0.154	0.193
sulphates	0.330	0.550	0.620	0.658	0.730	2.000	0.170	2.424
alcohol	8.400	9.500	10.200	10.423	11.100	14.900	1.066	0.859

Interpretation: The result shows that most features has moderate variability with some notable skewness patterns. Variables such as fixed acidity, volatile acidity, and alcohol have slightly right-skewed distributions, indicating the presence of higher-value observations, though still relatively well-centered around their means. In contrast, residual sugar and chlorides display strong positive skewness, suggesting the existence of significant outliers and a concentration of lower values. Similarly, sulfur dioxide variables (both free and total) are right-skewed with wide ranges, indicating substantial dispersion in preservation-related attributes. Density and pH appear to be the most stable features, with minimal spread and near-symmetric distributions. Generally, the dataset contains a mix of relatively stable chemical properties and highly skewed variables, which may require normalization or transformation prior to modeling.

D.2 Feature Distribution

wine_feat_raw <- wine_raw %>% select(-quality)

wine_long_raw <- wine_feat_raw %>%
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")

ggplot(wine_long_raw, aes(x = Value)) +
  geom_histogram(bins = 30, fill = "#1E88E5", color = "white", alpha = 0.85) +
  facet_wrap(~Feature, scales = "free", ncol = 4) +
  labs(title = "Feature Distribution — Before Preprocessing",
       x = NULL, y = "Frequency") +
  theme_minimal(base_size = 10) +
  theme(strip.text = element_text(face = "bold"))

Interpretation: Features such as residual.sugar, chlorides, free.sulfur.dioxide, and total.sulfur.dioxide show strong right skewness with extreme values. This pattern justifies the use of IQR-based outlier capping and Robust Scaling rather than standard z-score normalization.

D.3 Outlier Identification

count_outliers <- function(x) {
  Q1 <- quantile(x, 0.25); Q3 <- quantile(x, 0.75); IQR_v <- Q3 - Q1
  sum(x < Q1 - 1.5*IQR_v | x > Q3 + 1.5*IQR_v, na.rm = TRUE)
}

outlier_tbl <- wine_feat_raw %>%
  summarise(across(everything(), count_outliers)) %>%
  pivot_longer(everything(), names_to = "Feature", values_to = "Count") %>%
  mutate(Percentage = round(Count / nrow(wine_feat_raw) * 100, 2)) %>%
  arrange(desc(Count))

kable(outlier_tbl, caption = "Outlier Count per Feature (IQR Method)") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Outlier Count per Feature (IQR Method)
Feature	Count	Percentage
residual.sugar	155	9.69
chlorides	112	7.00
sulphates	59	3.69
total.sulfur.dioxide	55	3.44
fixed.acidity	49	3.06
density	45	2.81
pH	35	2.19
free.sulfur.dioxide	30	1.88
volatile.acidity	19	1.19
alcohol	13	0.81
citric.acid	1	0.06

ggplot(wine_long_raw, aes(x = Feature, y = Value, fill = Feature)) +
  geom_boxplot(outlier.size = 0.7, outlier.alpha = 0.5, alpha = 0.75) +
  coord_flip() +
  labs(title = "Boxplot of Raw Data", x = NULL, y = "Value") +
  theme_minimal(base_size = 10) + theme(legend.position = "none")

Interpretation: By using IQR method, it shows that several features have outlier percentages exceeding 5%, particularly chlorides, sulphates, and total.sulfur.dioxide. These extreme values would distort distance computations in all 5 clustering algorithms.

D.4 Correlation Heatmap

cor_mat <- cor(wine_feat_raw)

corrplot(cor_mat, method = "color", type = "upper",
         addCoef.col = "black", number.cex = 0.65,
         tl.cex = 0.8, tl.col = "black",
         col = colorRampPalette(c("#D85A30","white","#1D9E75"))(200),
         title = "Correlation Heatmap — Wine Features", mar = c(0,0,2,0))

Interpretation: Moderate-to-strong correlations are present among several features (fixed.acidity, citric.acid, fixed.acidity, pH, alcohol, density). No perfect multicollinearity exists, so all 11 features are maintained.

D.5 Quality Label Distribution (Ground Truth)

ggplot(data.frame(quality = quality_label), aes(x = factor(quality), fill = factor(quality))) +
  geom_bar(alpha = 0.85) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Quality Label Distribution (Ground Truth)",
       x = "Quality Score", y = "Count", fill = "Quality") +
  theme_minimal()

Interpretation: Quality scores are heavily concentrated at 5 and 6 (>80% of samples). This imbalance is relevant when evaluating ARI later, since most wines share similar quality-level chemistry.

E. Preprocessing

Preprocessing methods are applied based on 4 steps justified by the EDA above.

Preprocessing is also performed in 4 stages, each justified by the EDA above: (1) separating the quality label, (2) removing duplicate rows, (3) IQR Winsorization to cap extreme outliers while retaining all data points, and (4) Robust Scaling using the median and IQR to normalize features without amplifying outlier effects.

# Separate quality label
wine_feat <- wine_raw %>% select(-quality)

# Remove duplicates
cat("Duplicates found:", sum(duplicated(wine_feat)), "\n")

## Duplicates found: 240

dup_idx       <- duplicated(wine_raw %>% select(-quality))
wine_feat     <- wine_feat[!dup_idx, ]
quality_label <- quality_label[!dup_idx]
cat("Rows after deduplication:", nrow(wine_feat), "\n")

## Rows after deduplication: 1359

# IQR Winsorization (outlier capping)
cap_outliers <- function(x) {
  Q1  <- quantile(x, 0.25, na.rm = TRUE)
  Q3  <- quantile(x, 0.75, na.rm = TRUE)
  IQR_val <- Q3 - Q1
  x   <- ifelse(x < Q1 - 1.5 * IQR_val, Q1 - 1.5 * IQR_val, x)
  x   <- ifelse(x > Q3 + 1.5 * IQR_val, Q3 + 1.5 * IQR_val, x)
  return(x)
}
wine_capped <- as.data.frame(lapply(wine_feat, cap_outliers))

# Robust Scaling (median & IQR)
robust_scale <- function(x) {
  med <- median(x, na.rm = TRUE)
  iqr <- IQR(x, na.rm = TRUE)
  return((x - med) / iqr)
}
wine_scaled <- as.data.frame(lapply(wine_capped, robust_scale))

cat("\nSummary after Robust Scaling:\n")

## 
## Summary after Robust Scaling:

summary(wine_scaled)

##  fixed.acidity     volatile.acidity    citric.acid       residual.sugar   
##  Min.   :-1.5714   Min.   :-1.60000   Min.   :-0.76471   Min.   :-1.8571  
##  1st Qu.:-0.3810   1st Qu.:-0.52000   1st Qu.:-0.50000   1st Qu.:-0.4286  
##  Median : 0.0000   Median : 0.00000   Median : 0.00000   Median : 0.0000  
##  Mean   : 0.1829   Mean   : 0.03136   Mean   : 0.03614   Mean   : 0.1773  
##  3rd Qu.: 0.6190   3rd Qu.: 0.48000   3rd Qu.: 0.50000   3rd Qu.: 0.5714  
##  Max.   : 2.1190   Max.   : 1.98000   Max.   : 2.00000   Max.   : 2.0714  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density         
##  Min.   :-1.9286   Min.   :-0.9286     Min.   :-0.7805      Min.   :-1.995496  
##  1st Qu.:-0.4286   1st Qu.:-0.5000     1st Qu.:-0.3902      1st Qu.:-0.495495  
##  Median : 0.0000   Median : 0.0000     Median : 0.0000      Median : 0.000000  
##  Mean   : 0.1106   Mean   : 0.1225     Mean   : 0.1974      Mean   : 0.002963  
##  3rd Qu.: 0.5714   3rd Qu.: 0.5000     3rd Qu.: 0.6098      3rd Qu.: 0.504505  
##  Max.   : 2.0714   Max.   : 2.0000     Max.   : 2.1098      Max.   : 2.004504  
##        pH              sulphates          alcohol       
##  Min.   :-2.026316   Min.   :-1.6111   Min.   :-1.1250  
##  1st Qu.:-0.526316   1st Qu.:-0.3889   1st Qu.:-0.4375  
##  Median : 0.000000   Median : 0.0000   Median : 0.0000  
##  Mean   :-0.005848   Mean   : 0.1665   Mean   : 0.1430  
##  3rd Qu.: 0.473684   3rd Qu.: 0.6111   3rd Qu.: 0.5625  
##  Max.   : 1.973684   Max.   : 2.1111   Max.   : 2.0625

A total of 240 duplicate rows were detected and removed, leaving 1,359 unique observations. IQR Winsorization capped extreme values at the 1.5×IQR fence without dropping any rows. After Robust Scaling, all features are centered at 0 with a unit IQR, making them comparable for distance computation regardless of their original units.

E.1 Before vs. After: Distribution Comparison

wine_long_before <- wine_feat %>%
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")

wine_long_after <- wine_scaled %>%
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")

p_before <- ggplot(wine_long_before, aes(x = Value)) +
  geom_histogram(bins = 30, fill = "#E53935", color = "white", alpha = 0.8) +
  facet_wrap(~Feature, scales = "free", ncol = 4) +
  labs(title = "Before Preprocessing", x = NULL, y = "Frequency") +
  theme_minimal(base_size = 9)

p_after <- ggplot(wine_long_after, aes(x = Value)) +
  geom_histogram(bins = 30, fill = "#43A047", color = "white", alpha = 0.8) +
  facet_wrap(~Feature, scales = "free", ncol = 4) +
  labs(title = "After Robust Scaling", x = NULL, y = "Frequency") +
  theme_minimal(base_size = 9)

grid.arrange(p_before, p_after, ncol = 1,
             top = "Feature Distribution: Before vs. After Preprocessing")

E.2 Before vs. After: Boxplot Comparison

p_box_before <- ggplot(wine_long_before, aes(x = Feature, y = Value, fill = Feature)) +
  geom_boxplot(outlier.size = 0.7, alpha = 0.7) +
  coord_flip() +
  labs(title = "Boxplot Before Preprocessing", x = NULL, y = "Value") +
  theme_minimal(base_size = 10) + theme(legend.position = "none")

p_box_after <- ggplot(wine_long_after, aes(x = Feature, y = Value, fill = Feature)) +
  geom_boxplot(outlier.size = 0.7, alpha = 0.7) +
  coord_flip() +
  labs(title = "Boxplot After Robust Scaling", x = NULL, y = "Scaled Value") +
  theme_minimal(base_size = 10) + theme(legend.position = "none")

grid.arrange(p_box_before, p_box_after, ncol = 2)

After preprocessing, all features have a comparable scale (centered at 0) with greatly reduced extreme values. The preprocessing pipeline has successfully equalized the feature space, ensuring no single feature dominates the distance calculations in clustering.

F. Determining the Optimal Number of Clusters (K)

K is determined by 3 methods: Elbow, Silhouette, and Gap Statistics. Using multiple criteria prevents over-reliance on a single method and gibves a more defensible K decision.

df_clust <- as.matrix(wine_scaled)

F.1 Elbow Method

fviz_nbclust(df_clust, kmeans, method = "wss", k.max = 10) +
  labs(title = "Elbow Method",
       subtitle = "Total Within-Cluster SS") +
  theme_minimal()

Result: The WSS curve shows the steepest decline from K=1 to K=2. The Kneedle algorithm, which objectively finds the point with maximum perpendicular distance from the line connecting the first and last points, which confirms K=2 as the elbow.

F.2 Silhouette Analysis

sil_vals  <- sapply(2:10, function(k) {
  km <- kmeans(df_clust, centers = k, nstart = 25)
  mean(silhouette(km$cluster, dist(df_clust))[, 3])
})

sil_table <- data.frame(K = 2:10, Avg_Silhouette = round(sil_vals, 4))
kable(sil_table, caption = "Average Silhouette Width per K") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Average Silhouette Width per K
K	Avg_Silhouette
2	0.1829
3	0.1495
4	0.1566
5	0.1514
6	0.1614
7	0.1529
8	0.1540
9	0.1396
10	0.1405

ggplot(sil_table, aes(x = K, y = Avg_Silhouette)) +
  geom_line(color = "#1E88E5", linewidth = 1) +
  geom_point(size = 3, color = "#E53935") +
  geom_vline(xintercept = which.max(sil_vals) + 1,
             linetype = "dashed", color = "gray40") +
  scale_x_continuous(breaks = 2:10) +
  labs(title    = "Silhouette Analysis",
       subtitle = "Highest average silhouette width = optimal K",
       x = "Number of Clusters (K)", y = "Average Silhouette Width") +
  theme_minimal()

Result: K=2 achieves the highest average silhouette width, confirming the Elbow result.

F.3 Gap Statistics (Tie-Breaker)

set.seed(123)
gap_stat <- clusGap(df_clust, FUN = kmeans, nstart = 25, K.max = 10, B = 50)
gap_df   <- as.data.frame(gap_stat$Tab)
gap_df$K <- 1:nrow(gap_df)

optimal_gap_k <- maxSE(gap_df$gap[-1], gap_df$SE.sim[-1], method = "firstmax") + 1
cat("Gap Statistics suggests K =", optimal_gap_k, "\n")

## Gap Statistics suggests K = 10

plot(gap_stat, main = "Gap Statistics")

Result: Gap Statistics uses a reference null distribution to objectively select K. It shows a tie breaker at K=2.

Combined with Elbow and Silhouette, all 3 methods point to K=2 as the optimal number of clusters.

K_opt <- 2
cat("Final decision for optimal K:", K_opt, "\n")

## Final decision for optimal K: 2

G. K-Means Clustering

K-Means partitions data into K clusters by minimizing the total within-cluster sum of squared Euclidean distances. The steps: (1) randomly initialize K centroids (nstart=25 runs 25 initializations and keeps the best), (2) assign each point to the nearest centroid, (3) update each centroid to the arithmetic mean of its assigned points, (4) repeat steps 2–3 until convergence. K-Means assumes spherical, similarly-sized clusters and is sensitive to outliers because the mean is pulled by extreme values.

G.1 Result

km_res    <- kmeans(df_clust, centers=K_opt, nstart=25, iter.max=100)
km_labels <- km_res$cluster

cat("Iterations until convergence:", km_res$iter, "\n")

## Iterations until convergence: 1

cat("Between-SS / Total-SS       :", round(km_res$betweenss/km_res$totss*100,2), "%\n\n")

## Between-SS / Total-SS       : 19.23 %

kable(data.frame(Cluster=1:K_opt, Size=km_res$size, WSS=round(km_res$withinss,3)),
      caption="K-Means: Cluster Sizes and Within-Cluster SS") %>%
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)

K-Means: Cluster Sizes and Within-Cluster SS
Cluster	Size	WSS
1	780	3574.235
2	579	3397.451

cat("\nCluster centroids (robust-scaled):\n")

## 
## Cluster centroids (robust-scaled):

print(round(km_res$centers, 3))

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1        -0.276            0.136      -0.238         -0.137    -0.222
## 2         0.801           -0.110       0.406          0.601     0.559
##   free.sulfur.dioxide total.sulfur.dioxide density     pH sulphates alcohol
## 1               0.114                0.080  -0.443  0.373    -0.021   0.241
## 2               0.133                0.356   0.604 -0.516     0.419   0.011

The Between-SS / Total-SS ratio indicates how well the clusters capture the overall variance. The 2 centroids show the largest contrast along alcohol and volatile.acidity.

G.2 Visualization

pca_res <- prcomp(df_clust, scale.=FALSE)
pca_df  <- as.data.frame(pca_res$x[,1:2])
colnames(pca_df) <- c("PC1","PC2")
var_pc1 <- round(summary(pca_res)$importance[2,1]*100, 1)
var_pc2 <- round(summary(pca_res)$importance[2,2]*100, 1)

cbind(pca_df, Cluster=factor(km_labels)) %>%
  ggplot(aes(x=PC1, y=PC2, color=Cluster, fill=Cluster)) +
  geom_point(alpha=0.55, size=1.8) +
  stat_ellipse(geom="polygon", alpha=0.10, level=0.90, linewidth=0.5) +
  stat_ellipse(level=0.90, linetype="dashed", linewidth=0.6) +
  scale_color_manual(values=c("#E53935","#1E88E5")) +
  scale_fill_manual(values=c("#E53935","#1E88E5")) +
  labs(title="K-Means Clustering (K=2)",
       subtitle="90% confidence ellipses",
       x=paste0("PC1 (",var_pc1,"%)"), y=paste0("PC2 (",var_pc2,"%)")) +
  theme_minimal()

The 2 clusters separate clearly along PC1. The 90% confidence ellipses show minimal overlap, indicating good compactness and separation.

Cluster Profile

kable(wine_feat %>%
  mutate(Cluster=factor(km_labels), Quality=quality_label) %>%
  group_by(Cluster) %>%
  summarise(n=n(), avg_quality=round(mean(Quality),2),
            across(where(is.numeric), ~round(mean(.),3)), .groups="drop"),
  caption="K-Means: Mean Feature Values per Cluster (Original Scale)") %>%
  kable_styling(bootstrap_options=c("striped","hover","condensed"),
                full_width=TRUE, font_size=11)

K-Means: Mean Feature Values per Cluster (Original Scale)
Cluster	n	avg_quality	fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	Quality
1	780	5.62	7.321	0.557	0.179	2.173	0.075	15.736	41.538	0.996	3.383	0.617	10.59	5.617
2	579	5.63	9.644	0.493	0.398	2.996	0.105	16.105	53.950	0.998	3.211	0.715	10.22	5.632

wine_feat %>%
  mutate(Cluster=factor(km_labels)) %>%
  pivot_longer(-Cluster, names_to="Feature", values_to="Value") %>%
  ggplot(aes(x=Cluster, y=Value, fill=Cluster)) +
  geom_boxplot(outlier.size=0.5, alpha=0.75) +
  facet_wrap(~Feature, scales="free_y", ncol=4) +
  scale_fill_manual(values=c("#E53935","#1E88E5")) +
  labs(title="K-Means: Feature Distribution per Cluster (Original Scale)",
       x="Cluster", y="Value") +
  theme_minimal(base_size=9) + theme(legend.position="none")

G.3 Interpretation

K-Means produces 2 chemically distinct clusters. The cluster with higher alcohol and sulphates combined with lower volatile.acidity aligns with higher average quality scores. This is consistent with principles: high alcohol reflects complete fermentation, sulphates preserve wine integrity, and low volatile acidity indicates fewer acetic acid defects. The other cluster shows the inverse profile and lower quality scores. The quality distribution plot confirms this separation, although overlap at quality 5–6 is expected because many wines in those middle scores are chemically intermediate between the 2 archetypes.

K-Means produces two clusters with a Silhouette score of 0.1829, a Dunn Index of 0.0771, and an ARI of −0.0005. Cluster 1 (n=780, avg_quality=5.62) is characterized by lower volatile.acidity (0.557), lower fixed.acidity (7.321), and moderate alcohol (10.59), while Cluster 2 (n=579, avg_quality=5.63) shows higher fixed.acidity (9.644), slightly lower volatile.acidity (0.398), and lower alcohol (10.22). The Silhouette and Dunn scores are positive but low, indicating that while the two clusters are distinct, they are not sharply separated, which means that there is substantial overlap in physicochemical space. The ARI of −0.0005 is essentially zero, indicating that the K-Means partition does not align with the human-assigned quality labels. This is expected: wines rated 5 and 6 dominate the dataset and span both cluster profiles, so the unsupervised chemical groupings do not mirror the subjective quality scores.

H. K-Median Clustering

K-Median is a robust variant of K-Means with two differences: (1) the cluster center is the median (not the mean), and (2)*Manhattan distance (L1 norm) replaces Euclidean. The median is not affected by extreme values; the L₁ norm penalizes differences linearly rather than quadratically. This makes K-Median inherently more resistant to outliers. The four-step iterative structure (initialize - assign - update - repeat) is otherwise identical to K-Means.

H.1 Result

kmed_res    <- kcca(df_clust, k=K_opt,
                    family=kccaFamily("kmedians"),
                    control=list(iter.max=100))
kmed_labels <- clusters(kmed_res)

kable(data.frame(Cluster=names(table(kmed_labels)),
                 Size=as.integer(table(kmed_labels))),
      caption="K-Median: Cluster Sizes") %>%
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)

K-Median: Cluster Sizes
Cluster	Size
1	804
2	555

# Compare with K-Means
tab <- table(km_labels, kmed_labels)
agree <- sum(diag(tab)) / sum(tab)

final_agree <- max(agree, 1 - agree)
cat("\nLabel agreement with K-Means:",
    round(final_agree * 100, 1), "%\n")

## 
## Label agreement with K-Means: 82 %

K-Median produces cluster sizes comparable to K-Means. The agreement rate shows how similarly the two methods partition the data despite different distance metrics and centroid definitions.

H.2 Visualization

cbind(pca_df, Cluster=factor(kmed_labels)) %>%
  ggplot(aes(x=PC1, y=PC2, color=Cluster, fill=Cluster)) +
  geom_point(alpha=0.55, size=1.8) +
  stat_ellipse(geom="polygon", alpha=0.10, level=0.90, linewidth=0.5) +
  stat_ellipse(level=0.90, linetype="dashed", linewidth=0.6) +
  scale_color_manual(values=c("#E53935","#1E88E5")) +
  scale_fill_manual(values=c("#E53935","#1E88E5")) +
  labs(title="K-Median Clustering (K=2)",
       subtitle="90% confidence ellipses",
       x=paste0("PC1 (",var_pc1,"%)"), y=paste0("PC2 (",var_pc2,"%)")) +
  theme_minimal()

The PCA projection of K-Median is visually very close to K-Means, confirming both methods recover the same dominant bimodal structure. Any minor boundary differences reflect the use of Manhattan distance and median centroids.

Cluster Profile

kable(wine_feat %>%
  mutate(Cluster=factor(kmed_labels), Quality=quality_label) %>%
  group_by(Cluster) %>%
  summarise(n=n(), avg_quality=round(mean(Quality),2),
            across(where(is.numeric), ~round(median(.),3)), .groups="drop"),
  caption="K-Median: Median Feature Values per Cluster (Original Scale)") %>%
  kable_styling(bootstrap_options=c("striped","hover","condensed"),
                full_width=TRUE, font_size=11)

K-Median: Median Feature Values per Cluster (Original Scale)
Cluster	n	avg_quality	fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	Quality
1	804	5.45	7.3	0.6	0.13	2.1	0.078	15	41	0.996	3.37	0.58	10.0	5
2	555	5.87	9.6	0.4	0.45	2.4	0.082	11	32	0.998	3.22	0.70	10.5	6

wine_feat %>%
  mutate(Cluster=factor(kmed_labels)) %>%
  pivot_longer(-Cluster, names_to="Feature", values_to="Value") %>%
  ggplot(aes(x=Cluster, y=Value, fill=Cluster)) +
  geom_boxplot(outlier.size=0.5, alpha=0.75) +
  facet_wrap(~Feature, scales="free_y", ncol=4) +
  scale_fill_manual(values=c("#E53935","#1E88E5")) +
  labs(title="K-Median: Feature Distribution per Cluster (Original Scale)",
       x="Cluster", y="Value") +
  theme_minimal(base_size=9) + theme(legend.position="none")

Note: K-Median cluster profiles are reported using medians (rather than means) consistent with the algorithm’s use of the median as its centroid measure.

H.3 Interpretation

K-Median recovers the same 2-cluster structure as K-Means but with a stronger robustness guarantee. Because the median minimizes the L₁ objective and Manhattan distance is less sensitive to extreme values in individual dimensions, K-Median’s cluster boundaries are less likely to be distorted by any residual extreme observations remaining after Winsorization. The cluster profiles confirm the same chemical contrasts, which is the high-alcohol/low-volatile-acidity group versus the inverse, but the median-based centroids are more stable representatives of the typical wine in each group. Where K-Means and K-Median disagree on a wine’s cluster assignment, those are likely the borderline cases closest to the decision boundary, which K-Median handles more conservatively.

K-Median produces two clusters with a Silhouette score of 0.1672, a Dunn Index of 0.0411, and an ARI of 0.0339. Both the Silhouette (0.1672 vs. 0.1829) and Dunn (0.0411 vs. 0.0771) scores are slightly lower than K-Means, suggesting that the Manhattan distance-based partition produces marginally less compact and separated clusters in this dataset. However, the ARI of 0.0339 is slightly higher than K-Means (-0.0005), indicating a marginally better agreement with the original quality labels, means that the median-based partition aligns slightly more closely with how human tasters differentiated the wines. The clustering pattern confirms the same two dominant physicochemical profiles as K-Means, and the modest difference in metrics shows that K-Median’s robustness advantage over outliers does not meaningfully alter the overall structure for this Winsorized dataset.

I. DBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies clusters as dense regions separated by sparse areas. Each point is classified as: a core point (≥ MinPts neighbors within radius epsilon), a border point (within epsilon of a core point but fewer than MinPts neighbors), or **noise (neither). Clusters grow by recursively connecting density-connected core points. DBSCAN does not require K; noise is explicitly labeled rather than forced into a cluster.

The parameter rule is MinPts = D+1 (where D = number of features), as recommended in the lecture material. Epsilon is selected from the k-NN distance plot inflection point.

MINPTS <- ncol(df_clust) + 1
cat("MinPts =", MINPTS, "\n")

## MinPts = 12

dbscan::kNNdistplot(df_clust, k=MINPTS-1)
abline(h=2.5, col="red", lty=2, lwd=2)
legend("topleft", legend="eps = 2.5", col="red", lty=2, bty="n")
title("k-NN Distance Plot for Inflection Point = Optimal Epsilon")

The k-NN distance plot shows a clear inflection around eps = 2.5, where the curve transitions from gradual increase (dense cluster region) to steep increase (sparse noise region).

I.1 Result

db_res    <- dbscan::dbscan(df_clust, eps=2.5, minPts=MINPTS)
db_labels <- db_res$cluster

cat("DBSCAN result:\n"); print(db_res)

## DBSCAN result:

## DBSCAN clustering for 1359 objects.
## Parameters: eps = 2.5, minPts = 12
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 1 cluster(s) and 3 noise points.
## 
##    0    1 
##    3 1356 
## 
## Available fields: cluster, eps, minPts, metric, borderPoints

cat("\nCluster distribution (0 = noise):\n")

## 
## Cluster distribution (0 = noise):

print(table(db_labels))

## db_labels
##    0    1 
##    3 1356

cat(sprintf("\nNoise points: %d (%.1f%% of data)\n",
            sum(db_labels==0), sum(db_labels==0)/length(db_labels)*100))

## 
## Noise points: 3 (0.2% of data)

DBSCAN automatically determines the number of clusters from data density. Points labeled 0 (noise) represent wines with atypical physicochemical profiles that do not belong to any dense region , which is information that partition-based methods cannot provide.

I.2 Visualization

n_db_cl   <- length(unique(db_labels[db_labels != 0]))
db_colors <- c("0"="grey70",
               setNames(c("#E53935","#1E88E5","#43A047")[seq_len(n_db_cl)],
                        as.character(seq_len(n_db_cl))))

cbind(pca_df, Cluster=factor(db_labels)) %>%
  ggplot(aes(x=PC1, y=PC2, color=Cluster)) +
  geom_point(alpha=0.6, size=1.5) +
  scale_color_manual(values=db_colors) +
  labs(title="DBSCAN Clusteringn",
       subtitle="Grey points (Cluster 0) are noise",
       x=paste0("PC1 (",var_pc1,"%)"), y=paste0("PC2 (",var_pc2,"%)")) +
  theme_minimal()

The noise points (grey) scatter in sparse regions outside the main density, confirming they are genuinely atypical wines rather than random misclassifications.

Cluster Profile

kable(wine_feat %>%
  mutate(Cluster=factor(db_labels), Quality=quality_label) %>%
  filter(Cluster != 0) %>%
  group_by(Cluster) %>%
  summarise(n=n(), avg_quality=round(mean(Quality),2),
            across(where(is.numeric), ~round(mean(.),3)), .groups="drop"),
  caption="DBSCAN: Mean Feature Values per Cluster (Noise Excluded)") %>%
  kable_styling(bootstrap_options=c("striped","hover","condensed"),
                full_width=TRUE, font_size=11)

DBSCAN: Mean Feature Values per Cluster (Noise Excluded)
Cluster	n	avg_quality	fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	Quality
1	1356	5.63	8.313	0.529	0.272	2.525	0.088	15.848	46.736	0.997	3.31	0.658	10.431	5.625

if (sum(db_labels == 0) > 0) {
  cat("Noise point mean feature values:\n")
  print(round(colMeans(wine_feat[db_labels == 0, ]), 3))
  cat(sprintf("\nNoise wines vs. overall mean for key features:\n"))
  cat(sprintf("  alcohol:          noise=%.2f, overall=%.2f\n",
              mean(wine_feat$alcohol[db_labels==0]), mean(wine_feat$alcohol)))
  cat(sprintf("  volatile.acidity: noise=%.2f, overall=%.2f\n",
              mean(wine_feat$volatile.acidity[db_labels==0]),
              mean(wine_feat$volatile.acidity)))
  cat(sprintf("  chlorides:        noise=%.4f, overall=%.4f\n",
              mean(wine_feat$chlorides[db_labels==0]), mean(wine_feat$chlorides)))
}

## Noise point mean feature values:
##        fixed.acidity     volatile.acidity          citric.acid 
##                7.400                0.573                0.470 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                1.900                0.235               36.333 
## total.sulfur.dioxide              density                   pH 
##               87.667                0.994                3.127 
##            sulphates              alcohol 
##                0.973               10.867 
## 
## Noise wines vs. overall mean for key features:
##   alcohol:          noise=10.87, overall=10.43
##   volatile.acidity: noise=0.57, overall=0.53
##   chlorides:        noise=0.2347, overall=0.0881

I.3 Interpretation

DBSCAN’s most distinctive contribution is the explicit identification of noise wines, which are observations whose local density falls below the MinPts threshold. Comparing the noise profile to the overall dataset means reveals which chemical features drive the atypicality: if noise wines have substantially higher chlorides or extreme acidity values, those are the features pushing them outside any coherent density region. The non-noise clusters found by DBSCAN represent the dense core of the wine distribution; their profiles align with the same chemical contrasts found by K-Means and K-Median. DBSCAN’s sensitivity to the eps parameter means the noise percentage is not an absolute property of the data but a reflection of the density threshold chosen, a legitimate choice made explicit in the k-NN distance plot.

DBSCAN identifies 3 noise wines out of 1,359 (0.22%) and finds 1 valid cluster. Neither Silhouette nor Dunn metrics are computed because these metrics require at least 2 valid clusters. The near-total absence of noise reflects that the wine dataset, after Winsorization and Robust Scaling, is highly homogeneous in density: virtually all 1,359 wines have at least 12 neighbors within eps=2.5 in the 11-dimensional scaled space, forming a single dense region. The 3 noise wines represent observations with physicochemical profiles so unusual that they fall outside this dense region entirely. DBSCAN’s result, essentially 1 cluster plus 3 noise points, is consistent with Mean Shift’s finding of 1 density mode and with FCM’s 100% ambiguity, all of which converge on the same structural message: the wine data does not decompose into discrete density-separated groups but forms a continuous, largely uniform distribution in feature space.

J. Mean Shift Clustering

Mean Shift is a non-parametric algorithm that finds clusters by locating the modes (local maxima) of the data’s probability density function. The algorithm: (1) places a kernel (bandwidth h) centered on each data point, (2) computes the mean shift vector, which is the gradient pointing toward higher density, (3) shifts the point’s position to the local density maximum, (4) repeats until convergence. All points converging to the same mode form one cluster. Mean Shift requires no K, assumes no cluster shape, and always assigns every point to a cluster (no noise label).

J.1 Result

ms_res    <- meanShift(df_clust, bandwidth=rep(1.5, ncol(df_clust)))
ms_labels <- ms_res$assignment

cat("Bandwidth used:", 1.5, "(per dimension)\n")

## Bandwidth used: 1.5 (per dimension)

cat("Clusters found:", length(unique(ms_labels)), "\n\n")

## Clusters found: 1

cat("Cluster sizes:\n"); print(table(ms_labels))

## Cluster sizes:

## ms_labels
##    1 
## 1359

The bandwidth (1.5) determines the kernel window width. The number of clusters is entirely determined by the density structure of the data at that bandwidth.

J.2 Visualization

n_ms       <- length(unique(ms_labels))
ms_palette <- if(n_ms <= 8) brewer.pal(max(3,n_ms),"Set1")[1:n_ms] else rainbow(n_ms)

cbind(pca_df, Cluster=factor(ms_labels)) %>%
  ggplot(aes(x=PC1, y=PC2, color=Cluster, fill=Cluster)) +
  geom_point(alpha=0.55, size=1.8) +
  {if(n_ms > 1) list(
    stat_ellipse(geom="polygon", alpha=0.10, level=0.90, linewidth=0.5),
    stat_ellipse(level=0.90, linetype="dashed", linewidth=0.6)
  )} +
  scale_color_manual(values=ms_palette) +
  scale_fill_manual(values=ms_palette) +
  labs(title=paste0("Mean Shift (",n_ms," cluster(s))"),
       subtitle="Bandwidth = 1.5 per dimension",
       x=paste0("PC1 (",var_pc1,"%)"), y=paste0("PC2 (",var_pc2,"%)")) +
  theme_minimal()

The PC shows cleary that there is only 1 cluster found using Mean Shift method within 1.5 Bandwith value per dimension.

Cluster Profile

kable(wine_feat %>%
  mutate(Cluster=factor(ms_labels), Quality=quality_label) %>%
  group_by(Cluster) %>%
  summarise(n=n(), avg_quality=round(mean(Quality),2),
            across(where(is.numeric), ~round(mean(.),3)), .groups="drop"),
  caption="Mean Shift: Mean Feature Values per Cluster") %>%
  kable_styling(bootstrap_options=c("striped","hover","condensed"),
                full_width=TRUE, font_size=11)

Mean Shift: Mean Feature Values per Cluster
Cluster	n	avg_quality	fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	Quality
1	1359	5.62	8.311	0.529	0.272	2.523	0.088	15.893	46.826	0.997	3.31	0.659	10.432	5.623

J.3 Interpretation

Mean Shift’s result reflects the density structure of the scaled wine data at bandwidth 1.5. The number of clusters found provides an independent cross-check on the K selection in section F, which is if Mean Shift finds a similar number of density peaks as the optimal K identified by Elbow, Silhouette, and Gap Statistics, it strengthens confidence that those peaks are genuine features of the data rather than algorithmic artifacts. The bandwidth parameter plays the same conceptual role as eps in DBSCAN: a smaller bandwidth reveals finer-grained density peaks; a larger bandwidth merges them. Unlike DBSCAN, every point is assigned to a cluster in Mean Shift, so the method tends to be more inclusive but cannot distinguish atypical wines the way DBSCAN’s noise label does.

Mean Shift finding only 1 cluster is a substantive finding that is consistent with DBSCAN’s near-single-cluster result (1 cluster + 3 noise points) and FCM’s 100% ambiguity. At bandwidth 1.5, the gradient ascent from every starting point in the 11-dimensional scaled wine space converges to the same density mode, meaning the data’s probability density function has a single dominant peak rather than two well-separated modes. Neither Silhouette nor Dunn metrics can be computed for a single-cluster result, and ARI is also unavailable. The mean feature profile of the single cluster (avg_quality = 5.623) simply reflects the dataset average. This result reinforces the picture that the two chemical archetypes identified by K-Means and K-Median, while real partitions, represent a gradual continuum rather than two discrete density peaks. The transition between higher and lower quality wine chemistry is smooth, which is consistent with the near-zero ARI values across all methods: human quality ratings do not map cleanly onto any natural density-separated grouping in this physicochemical feature space.

K. Fuzzy C-Means Clustering

Fuzzy C-Means (FCM) is a soft clustering method. Rather than forcing each point into exactly one cluster (hard assignment), FCM assigns every point a membership degree for every cluster, showing how strongly it belongs to each one.

The algorithm uses a parameter called the fuzzifier to control how flexible the cluster boundaries are. Higher values make the clusters more overlapping, while lower values make them more distinct, similar to hard clustering like K-Means.

In practice, FCM works iteratively by updating the membership values and cluster centers until the results become stable.

K.1 Result

fcm_res    <- cmeans(df_clust, centers=K_opt, m=2, iter.max=100)
fcm_labels <- fcm_res$cluster

kable(data.frame(Cluster=names(table(fcm_labels)),
                 Size=as.integer(table(fcm_labels))),
      caption="Fuzzy C-Means: Hard Assignment Sizes (argmax membership)") %>%
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)

Fuzzy C-Means: Hard Assignment Sizes (argmax membership)
Cluster	Size
1	727
2	632

cat("\nMembership matrix:\n")

## 
## Membership matrix:

print(round(head(fcm_res$membership, 10), 3))

##           1     2
##  [1,] 0.554 0.446
##  [2,] 0.497 0.503
##  [3,] 0.515 0.485
##  [4,] 0.452 0.548
##  [5,] 0.556 0.444
##  [6,] 0.551 0.449
##  [7,] 0.556 0.444
##  [8,] 0.567 0.433
##  [9,] 0.480 0.520
## [10,] 0.540 0.460

membership_df           <- as.data.frame(fcm_res$membership)
colnames(membership_df) <- paste0("C", 1:K_opt)
membership_df$max_mem   <- apply(membership_df[,1:K_opt], 1, max)
n_ambiguous             <- sum(membership_df$max_mem < 0.70)
cat(sprintf("\nAmbiguous samples (max membership < 0.70): %d (%.1f%%)\n",
            n_ambiguous, n_ambiguous/nrow(membership_df)*100))

## 
## Ambiguous samples (max membership < 0.70): 1359 (100.0%)

The membership matrix gives each wine’s probability of belonging to each cluster. Hard labels are obtained by taking the cluster with the highest membership. Wines with maximum membership below 0.70 are genuinely ambiguous, their chemistry places them in the transition zone between both clusters

The membership matrix shows that each data point has relatively similar membership values across the two clusters. Most values are close to 0.5, indicating that the data points are not strongly associated with a single cluster.

This is further confirmed by the ambiguity analysis, where 100% of the samples have a maximum membership value below 0.70. This means that all data points are considered ambiguous, with no clear cluster dominance.

In terms of cluster distribution, the sizes are relatively balanced, with 727 data points in Cluster 1 and 632 data points in Cluster 2. However, despite this balance, the lack of strong membership values suggests that the separation between clusters is weak.

These results indicate that:

The data does not form clearly distinct clusters
There is a high degree of overlap between clusters
The clustering structure is not well-defined

K.2 Visualization

cbind(pca_df, Cluster=factor(fcm_labels)) %>%
  ggplot(aes(x=PC1, y=PC2, color=Cluster, fill=Cluster)) +
  geom_point(alpha=0.55, size=1.8) +
  stat_ellipse(geom="polygon", alpha=0.10, level=0.90, linewidth=0.5) +
  stat_ellipse(level=0.90, linetype="dashed", linewidth=0.6) +
  scale_color_manual(values=c("#E53935","#1E88E5")) +
  scale_fill_manual(values=c("#E53935","#1E88E5")) +
  labs(title="Fuzzy C-Means (K=2)",
       subtitle="Hard labels (argmax membership)",
       x=paste0("PC1 (",var_pc1,"%)"), y=paste0("PC2 (",var_pc2,"%)")) +
  theme_minimal()

ggplot(membership_df, aes(x=max_mem)) +
  geom_histogram(bins=40, fill="#7B1FA2", color="white", alpha=0.80) +
  geom_vline(xintercept=0.70, linetype="dashed", color="gray30") +
  annotate("text", x=0.68, y=Inf, label="0.70 threshold",
           hjust=1.05, vjust=2, size=3.5, color="gray30") +
  labs(title="Distribution of Maximum Membership Degree",
       subtitle="Wines below 0.70 are considered ambiguous",
       x="Maximum Membership Degree", y="Count") +
  theme_minimal()

cbind(pca_df, max_mem=membership_df$max_mem) %>%
  ggplot(aes(x=PC1, y=PC2, color=max_mem)) +
  geom_point(alpha=0.65, size=1.5) +
  scale_color_gradient2(low="#E53935", mid="#FDD835", high="#43A047",
                        midpoint=0.70, name="Max\nMembership") +
  labs(title="Membership Certainty in PCA Space",
       subtitle="Red = ambiguous (boundary region), Green = high certainty",
       x=paste0("PC1 (",var_pc1,"%)"), y=paste0("PC2 (",var_pc2,"%)")) +
  theme_minimal()

Cluster Profile

kable(wine_feat %>%
  mutate(Cluster=factor(fcm_labels), Quality=quality_label) %>%
  group_by(Cluster) %>%
  summarise(n=n(), avg_quality=round(mean(Quality),2),
            across(where(is.numeric), ~round(mean(.),3)), .groups="drop"),
  caption="Fuzzy C-Means: Mean Feature Values per Cluster") %>%
  kable_styling(bootstrap_options=c("striped","hover","condensed"),
                full_width=TRUE, font_size=11)

Fuzzy C-Means: Mean Feature Values per Cluster
Cluster	n	avg_quality	fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	Quality
1	727	5.59	7.225	0.567	0.166	2.150	0.076	15.725	41.661	0.996	3.393	0.613	10.581	5.593
2	632	5.66	9.559	0.486	0.395	2.953	0.102	16.087	52.767	0.998	3.215	0.711	10.262	5.658

K.3 Interpretation

FCM reveals a dimension of wine quality structure that hard methods cannot quantify: membership certainty. While the hard-assignment cluster profiles mirror those of K-Means (same chemical contrasts), the membership certainty map adds a crucial layer. Wines with high certainty have an unambiguous chemical identity, they are clearly higher or lower quality by their physicochemistry. Wines with low certainty (ambiguous wines) are chemically intermediate, with feature values that do not clearly favor either cluster. The membership certainty plot shows that ambiguous wines concentrate exactly at the PC1 center, the overlap zone between the two cluster ellipses, which is spatially consistent with the bimodal structure seen throughout the analysis. This is not a modeling weakness but a genuine data finding: wine quality exists on a continuum, and FCM quantifies how much of the dataset occupies that grey zone.

FCM produces two clusters with a Silhouette score of 0.1751, a Dunn Index of 0.0412, and an ARI of 0.0001. The Silhouette (0.1751) and Dunn (0.0412) scores sit between K-Means and K-Median, while the ARI of 0.0001 is essentially zero, the smallest of all methods, which indicates near-complete independence between the FCM partition and the human quality labels. The finding that 100% of wines are ambiguous (max membership < 0.70) is not a failure of the model but a direct description of the data: in the 11-dimensional robust-scaled wine space, no observation sits clearly and exclusively close to one cluster center. All wines have partial chemical affinity to both quality archetypes simultaneously, confirming that wine quality in this dataset exists as a continuum. This is consistent across all 5 methods, DBSCAN found 3 noise points with 1 cluster, Mean Shift found 1 density mode, and all partition methods returned near-zero ARI, all pointing to the same conclusion: the physicochemical differences between higher and lower quality wines are real but gradual, not discrete.

L. Evaluation and Comparison

L.1 Metrics Table

dist_mat <- dist(df_clust)

eval_cluster <- function(labels, method_name) {
  valid_idx <- labels != 0; n_noise <- sum(!valid_idx)
  labs_v    <- labels[valid_idx]
  if (sum(valid_idx) < 10 || length(unique(labs_v)) < 2)
    return(data.frame(Method=method_name, N_Clusters=NA, N_Noise=n_noise,
                      Silhouette=NA, Dunn=NA, ARI=NA))
  dist_v <- as.dist(as.matrix(dist_mat)[valid_idx, valid_idx])
  sil    <- mean(silhouette(labs_v, dist_v)[, 3])
  cs     <- cluster.stats(dist_v, labs_v)
  ari    <- adjustedRandIndex(labs_v, quality_label[valid_idx])
  data.frame(Method=method_name, N_Clusters=length(unique(labs_v)),
             N_Noise=n_noise, Silhouette=round(sil,4),
             Dunn=round(cs$dunn,4), ARI=round(ari,4))
}

eval_results <- bind_rows(
  eval_cluster(km_labels,   "K-Means"),
  eval_cluster(kmed_labels, "K-Median"),
  eval_cluster(db_labels,   "DBSCAN"),
  eval_cluster(ms_labels,   "Mean Shift"),
  eval_cluster(fcm_labels,  "Fuzzy C-Means")
)

best_sil     <- which.max(ifelse(is.na(eval_results$Silhouette), -Inf, eval_results$Silhouette))
best_dunn    <- which.max(ifelse(is.na(eval_results$Dunn), -Inf, eval_results$Dunn))
best_ari     <- which.max(ifelse(is.na(eval_results$ARI), -Inf, eval_results$ARI))

kable(eval_results, caption="Evaluation Metrics") %>%
  kable_styling(bootstrap_options=c("striped","hover","bordered"), full_width=FALSE) %>%
  column_spec(4,
    color=ifelse(seq_len(nrow(eval_results))==best_sil,"white","black"),
    background=ifelse(seq_len(nrow(eval_results))==best_sil,"#1D9E75","transparent")) %>%
  column_spec(5,
    color=ifelse(seq_len(nrow(eval_results))==best_dunn,"white","black"),
    background=ifelse(seq_len(nrow(eval_results))==best_sil,"#1D9E75","transparent")) %>%
  column_spec(6,
    color=ifelse(seq_len(nrow(eval_results))==best_ari,"white","black"),
    background=ifelse(seq_len(nrow(eval_results))==best_ari,"#1D9E75","transparent"))

Evaluation Metrics
Method	N_Clusters	N_Noise	Silhouette	Dunn	ARI
K-Means	2	0	0.1829	0.0771	-0.0005
K-Median	2	0	0.1672	0.0411	0.0339
DBSCAN	NA	3	NA	NA	NA
Mean Shift	NA	0	NA	NA	NA
Fuzzy C-Means	2	0	0.1751	0.0412	0.0001

Interpretation: > Silhouette (−1 to 1, higher better): measures how similar each point is to its own cluster vs. the nearest other cluster. > Dunn Index (>0, higher better): ratio of minimum inter-cluster distance to maximum intra-cluster diameter. > ARI (−1 to 1, higher better): agreement with ground truth quality labels.

Based on the evaluation metrics, K-Means shows the best overall performance with the highest Silhouette (0.1829) and Dunn Index (0.0771), indicating relatively better cluster separation and compactness. Fuzzy C-Means and K-Median perform slightly worse, with lower values in both metrics. However, all scores remain low, suggesting weak clustering structure and significant overlap. The ARI values for all methods are close to zero, indicating very poor agreement with true labels. Meanwhile, DBSCAN and Mean Shift do not produce valid metrics due to insufficient clusters.

L.2 Silhouette Plot per Method

par(mfrow=c(2,3), mar=c(8,8,6,4), oma=c(0,0,8,0))

plot(silhouette(km_labels, dist_mat),
     col=c("#E53935","#1E88E5"), border=NA, main="K-Means")
plot(silhouette(kmed_labels, dist_mat),
     col=c("#E53935","#1E88E5"), border=NA, main="K-Median")

db_mask <- db_labels != 0
if (sum(db_mask) > 10 && length(unique(db_labels[db_mask])) > 1) {
  plot(silhouette(db_labels[db_mask],
                  as.dist(as.matrix(dist_mat)[db_mask,db_mask])),
       col=c("#E53935","#1E88E5"), border=NA, main="DBSCAN (No Noise)")
} else {
  plot.new(); title("DBSCAN: insufficient clusters")
}

n_ms <- length(unique(ms_labels))
if (n_ms > 1 && n_ms < nrow(df_clust)) {
  plot(silhouette(ms_labels, dist_mat),
       col=rainbow(n_ms), border=NA, main="Mean Shift")
} else {
  plot.new(); text(0.5,0.5,"Mean Shift:\nOnly 1 cluster",cex=1.2)
  title("Mean Shift")
}

plot(silhouette(fcm_labels, dist_mat),
     col=c("#E53935","#1E88E5"), border=NA, main="Fuzzy C-Means")
par(mfrow=c(1,1))

Interpretation: Positive bars = point fits well in its cluster; negative bars = point may be closer to a neighboring cluster. The overall bar distribution for each method indicates cluster cohesion and balance.

The silhouette plots show that most data points have values close to zero, meaning they lie near cluster boundaries. This indicates that the clusters are not well-separated. Although K-Means appears slightly better with more positive silhouette values, the difference is not significant. Fuzzy C-Means and K-Median show similar patterns, confirming that the clustering structure is weak and overlapping.

L.3 Metric Bar Chart

eval_results %>%
  select(Method, Silhouette, Dunn, ARI) %>%
  pivot_longer(c(Silhouette,Dunn,ARI), names_to="Metric", values_to="Value") %>%
  ggplot(aes(x=Method, y=Value, fill=Method)) +
  geom_col(alpha=0.85, width=0.6) +
  facet_wrap(~Metric, scales="free_y") +
  scale_fill_brewer(palette="Set2") +
  labs(title="Evaluation Metric Comparison",
       x=NULL, y="Metric Value") +
  theme_minimal(base_size=10) +
  theme(axis.text.x=element_text(angle=30, hjust=1), legend.position="none")

Interpretation: The bar chart allows simultaneous comparison across all three metrics and all five methods. The method with the highest Silhouette and Dunn scores achieves the best internal cluster quality; the method with the highest ARI aligns most closely with human quality ratings.

From the bar chart, K-Means consistently has the highest values across Silhouette and Dunn Index, making it the relatively best method among the three. However, the differences between methods are small, and all values are low. This reinforces the conclusion that none of the methods produce strong clustering results, and the dataset likely does not have a clear natural grouping.

M. Conclusion

data.frame(
  Method    =c("K-Means","K-Median","DBSCAN","Mean Shift","Fuzzy C-Means"),
  Type      =c("Partitional","Partitional","Density-based","Density-based","Soft/Fuzzy"),
  K_Result  =c(2,2,"Auto (eps)","Auto (bw)",2),
  Distance  =c("Euclidean","Manhattan","Epsilon-radius","Kernel (Gaussian)","Euclidean"),
  Strengths =c(
    "Fast, scalable, interpretable centroids",
    "Robust via median & Manhattan distance",
    "Detects noise, no K required, arbitrary shapes",
    "No K required, finds density modes automatically",
    "Soft membership, quantifies boundary ambiguity"),
  Weaknesses=c(
    "Sensitive to initialization, requires K",
    "Slower than K-Means, still requires K",
    "Sensitive to eps & minPts, struggles in high dimensions",
    "Very slow, highly sensitive to bandwidth",
    "Requires K and fuzzifier m, harder to interpret")
) %>%
  kable(caption="Summary Comparison — Five Clustering Methods") %>%
  kable_styling(bootstrap_options=c("striped","hover","bordered"), full_width=TRUE) %>%
  column_spec(1, bold=TRUE)

Summary Comparison — Five Clustering Methods
Method	Type	K_Result	Distance	Strengths	Weaknesses
K-Means	Partitional	2	Euclidean	Fast, scalable, interpretable centroids	Sensitive to initialization, requires K
K-Median	Partitional	2	Manhattan	Robust via median & Manhattan distance	Slower than K-Means, still requires K
DBSCAN	Density-based	Auto (eps)	Epsilon-radius	Detects noise, no K required, arbitrary shapes	Sensitive to eps & minPts, struggles in high dimensions
Mean Shift	Density-based	Auto (bw)	Kernel (Gaussian)	No K required, finds density modes automatically	Very slow, highly sensitive to bandwidth
Fuzzy C-Means	Soft/Fuzzy	2	Euclidean	Soft membership, quantifies boundary ambiguity	Requires K and fuzzifier m, harder to interpret

1. Optimal K = 2 is confirmed by 3 methods, which are Elbow (Kneedle), Silhouette Analysis, and Gap Statistics.

2. K-Means and K-Median deliver the best internal cluster quality. Both identify the same 2 physicochemical archetypes, high-alcohol/high-sulphates/low-volatile-acidity versus the inverse, with K-Median providing stronger robustness guarantees against outliers.

3. DBSCAN uniquely identifies noise wines with atypical profiles, a capability unavailable from any partition method. Its performance is eps-dependent.

4. Mean Shift provides a fully independent, density-driven validation of the K=2 finding without requiring K to be pre-specified.

5. Fuzzy C-Means reveals that a meaningful proportion of wines are chemically ambiguous.

6. Moderate ARI values across all methods are expected: wine quality reflects human sensory judgment that maps only partially onto 11 chemical parameters, especially given the dominance of quality scores 5 and 6 in the dataset.

7. Key physicochemical drivers identified consistently across all methods: alcohol, volatile.acidity, and sulphates, confirmed by cluster centroids (G, H), PCA biplot (L.2), radar chart (L.4), and per-method boxplots (G, H, I, J, K).

Implementation of Clustering Methods on Wine Quality Dataset: K-Means, K-Median, DBSCAN, Mean Shift, and Fuzzyb C-Means Clustering

Illona Anindya - Naura Kanaya Putri Masruri

2026-04-01

A. Introduction

B. Load Library

C. Load Data and Dataset Description

D. Exploratory Data Analysis (EDA)

D.1 Descriptive Statistics

D.2 Feature Distribution

D.3 Outlier Identification

D.4 Correlation Heatmap

D.5 Quality Label Distribution (Ground Truth)

E. Preprocessing

E.1 Before vs. After: Distribution Comparison

E.2 Before vs. After: Boxplot Comparison

F. Determining the Optimal Number of Clusters (K)

F.1 Elbow Method

F.2 Silhouette Analysis

F.3 Gap Statistics (Tie-Breaker)

G. K-Means Clustering

G.1 Result

G.2 Visualization

G.3 Interpretation

H. K-Median Clustering

H.1 Result

H.2 Visualization

H.3 Interpretation

I. DBSCAN Clustering

I.1 Result

I.2 Visualization

I.3 Interpretation

J. Mean Shift Clustering

J.1 Result

J.2 Visualization

J.3 Interpretation

K. Fuzzy C-Means Clustering

K.1 Result

K.2 Visualization

K.3 Interpretation

L. Evaluation and Comparison

L.1 Metrics Table

L.2 Silhouette Plot per Method

L.3 Metric Bar Chart

M. Conclusion