Wine quality assessment is traditionally performed by human experts through sensory evaluation, which is subjective and time-consuming. This study applies 5 unsupervised clustering methods, which are K-Means, K-Median, DBSCAN, Mean Shift, and Fuzzy C-Means, to the UCI Red Wine Quality dataset to discover natural groupings based on physicochemical properties. The goal is to compare method performance and interpret cluster characteristics in the context of wine quality.
The dataset contains 1,599 red wine samples described by 11 physicochemical features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol) and a quality score (3–8) used as external ground truth for evaluation.
This analysis uses 5 methods of Clustering, which are:
| Method | Type | Need k | Approach |
|---|---|---|---|
| K-Means | Partitional | Yes | Mean |
| K-Median | Partitional | Yes | Median |
| DBSCAN | Density-based | No | Density |
| Mean Shift | Density-based | No | Mode |
| Fuzzy C-Means | Soft/Fuzzy | Yes | Membership |
library(tidyverse)
library(flexclust)
library(dbscan)
library(meanShiftR)
library(e1071)
library(cluster)
library(fpc)
library(mclust)
library(factoextra)
library(corrplot)
library(ggplot2)
library(gridExtra)
library(RColorBrewer)
library(fmsb)
library(knitr)
library(kableExtra)
This analysis uses a comprehensive set of R packages covering
clustering algorithms (flexclust, dbscan,
meanShiftR, e1071), evaluation metrics
(cluster, fpc, mclust), and
visualization (ggplot2, corrplot,
fmsb).
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine_raw <- read.csv(url, sep = ";")
quality_label <- wine_raw$quality
cat("Dataset dimensions:", dim(wine_raw), "\n")
## Dataset dimensions: 1599 12
cat("Number of rows :", nrow(wine_raw), "\n")
## Number of rows : 1599
cat("Number of columns:", ncol(wine_raw), "\n\n")
## Number of columns: 12
str(wine_raw)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
The UCI Red Wine Quality dataset contains 1,599
observations and 12 variables. All 11
physicochemical features are continuous numerical variables. The
quality column is an integer score from 3 to 8 assigned by
human tasters, used as external ground truth for evaluation. There are
no missing values in the dataset.
EDA is conducted before preprocessing to understand the raw data structure, identify issues, and justify each preprocessing decision.
kable(t(round(sapply(wine_raw %>% select(-quality), function(x)
c(Min=min(x), Q1=quantile(x,0.25), Median=median(x),
Mean=round(mean(x),3), Q3=quantile(x,0.75), Max=max(x),
SD=round(sd(x),3), Skewness=round(e1071::skewness(x),3))), 3)),
caption = "Descriptive Statistics of The Dataset") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = TRUE, font_size = 11)
| Min | Q1.25% | Median | Mean | Q3.75% | Max | SD | Skewness | |
|---|---|---|---|---|---|---|---|---|
| fixed.acidity | 4.600 | 7.100 | 7.900 | 8.320 | 9.200 | 15.900 | 1.741 | 0.981 |
| volatile.acidity | 0.120 | 0.390 | 0.520 | 0.528 | 0.640 | 1.580 | 0.179 | 0.670 |
| citric.acid | 0.000 | 0.090 | 0.260 | 0.271 | 0.420 | 1.000 | 0.195 | 0.318 |
| residual.sugar | 0.900 | 1.900 | 2.200 | 2.539 | 2.600 | 15.500 | 1.410 | 4.532 |
| chlorides | 0.012 | 0.070 | 0.079 | 0.087 | 0.090 | 0.611 | 0.047 | 5.670 |
| free.sulfur.dioxide | 1.000 | 7.000 | 14.000 | 15.875 | 21.000 | 72.000 | 10.460 | 1.248 |
| total.sulfur.dioxide | 6.000 | 22.000 | 38.000 | 46.468 | 62.000 | 289.000 | 32.895 | 1.513 |
| density | 0.990 | 0.996 | 0.997 | 0.997 | 0.998 | 1.004 | 0.002 | 0.071 |
| pH | 2.740 | 3.210 | 3.310 | 3.311 | 3.400 | 4.010 | 0.154 | 0.193 |
| sulphates | 0.330 | 0.550 | 0.620 | 0.658 | 0.730 | 2.000 | 0.170 | 2.424 |
| alcohol | 8.400 | 9.500 | 10.200 | 10.423 | 11.100 | 14.900 | 1.066 | 0.859 |
Interpretation: The result shows that most features has moderate variability with some notable skewness patterns. Variables such as fixed acidity, volatile acidity, and alcohol have slightly right-skewed distributions, indicating the presence of higher-value observations, though still relatively well-centered around their means. In contrast, residual sugar and chlorides display strong positive skewness, suggesting the existence of significant outliers and a concentration of lower values. Similarly, sulfur dioxide variables (both free and total) are right-skewed with wide ranges, indicating substantial dispersion in preservation-related attributes. Density and pH appear to be the most stable features, with minimal spread and near-symmetric distributions. Generally, the dataset contains a mix of relatively stable chemical properties and highly skewed variables, which may require normalization or transformation prior to modeling.
wine_feat_raw <- wine_raw %>% select(-quality)
wine_long_raw <- wine_feat_raw %>%
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")
ggplot(wine_long_raw, aes(x = Value)) +
geom_histogram(bins = 30, fill = "#1E88E5", color = "white", alpha = 0.85) +
facet_wrap(~Feature, scales = "free", ncol = 4) +
labs(title = "Feature Distribution — Before Preprocessing",
x = NULL, y = "Frequency") +
theme_minimal(base_size = 10) +
theme(strip.text = element_text(face = "bold"))
Interpretation: Features such as
residual.sugar, chlorides,
free.sulfur.dioxide, and total.sulfur.dioxide
show strong right skewness with extreme values. This pattern justifies
the use of IQR-based outlier capping and Robust Scaling rather than
standard z-score normalization.
count_outliers <- function(x) {
Q1 <- quantile(x, 0.25); Q3 <- quantile(x, 0.75); IQR_v <- Q3 - Q1
sum(x < Q1 - 1.5*IQR_v | x > Q3 + 1.5*IQR_v, na.rm = TRUE)
}
outlier_tbl <- wine_feat_raw %>%
summarise(across(everything(), count_outliers)) %>%
pivot_longer(everything(), names_to = "Feature", values_to = "Count") %>%
mutate(Percentage = round(Count / nrow(wine_feat_raw) * 100, 2)) %>%
arrange(desc(Count))
kable(outlier_tbl, caption = "Outlier Count per Feature (IQR Method)") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
| Feature | Count | Percentage |
|---|---|---|
| residual.sugar | 155 | 9.69 |
| chlorides | 112 | 7.00 |
| sulphates | 59 | 3.69 |
| total.sulfur.dioxide | 55 | 3.44 |
| fixed.acidity | 49 | 3.06 |
| density | 45 | 2.81 |
| pH | 35 | 2.19 |
| free.sulfur.dioxide | 30 | 1.88 |
| volatile.acidity | 19 | 1.19 |
| alcohol | 13 | 0.81 |
| citric.acid | 1 | 0.06 |
ggplot(wine_long_raw, aes(x = Feature, y = Value, fill = Feature)) +
geom_boxplot(outlier.size = 0.7, outlier.alpha = 0.5, alpha = 0.75) +
coord_flip() +
labs(title = "Boxplot of Raw Data", x = NULL, y = "Value") +
theme_minimal(base_size = 10) + theme(legend.position = "none")
Interpretation: By using IQR method, it shows that
several features have outlier percentages exceeding 5%, particularly
chlorides, sulphates, and
total.sulfur.dioxide. These extreme values would distort
distance computations in all 5 clustering algorithms.
cor_mat <- cor(wine_feat_raw)
corrplot(cor_mat, method = "color", type = "upper",
addCoef.col = "black", number.cex = 0.65,
tl.cex = 0.8, tl.col = "black",
col = colorRampPalette(c("#D85A30","white","#1D9E75"))(200),
title = "Correlation Heatmap — Wine Features", mar = c(0,0,2,0))
Interpretation: Moderate-to-strong correlations are
present among several features (fixed.acidity,
citric.acid, fixed.acidity, pH,
alcohol, density). No perfect
multicollinearity exists, so all 11 features are maintained.
ggplot(data.frame(quality = quality_label), aes(x = factor(quality), fill = factor(quality))) +
geom_bar(alpha = 0.85) +
scale_fill_brewer(palette = "Set2") +
labs(title = "Quality Label Distribution (Ground Truth)",
x = "Quality Score", y = "Count", fill = "Quality") +
theme_minimal()
Interpretation: Quality scores are heavily concentrated at 5 and 6 (>80% of samples). This imbalance is relevant when evaluating ARI later, since most wines share similar quality-level chemistry.
Preprocessing methods are applied based on 4 steps justified by the EDA above.
Preprocessing is also performed in 4 stages, each justified by the EDA above: (1) separating the quality label, (2) removing duplicate rows, (3) IQR Winsorization to cap extreme outliers while retaining all data points, and (4) Robust Scaling using the median and IQR to normalize features without amplifying outlier effects.
# Separate quality label
wine_feat <- wine_raw %>% select(-quality)
# Remove duplicates
cat("Duplicates found:", sum(duplicated(wine_feat)), "\n")
## Duplicates found: 240
dup_idx <- duplicated(wine_raw %>% select(-quality))
wine_feat <- wine_feat[!dup_idx, ]
quality_label <- quality_label[!dup_idx]
cat("Rows after deduplication:", nrow(wine_feat), "\n")
## Rows after deduplication: 1359
# IQR Winsorization (outlier capping)
cap_outliers <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
x <- ifelse(x < Q1 - 1.5 * IQR_val, Q1 - 1.5 * IQR_val, x)
x <- ifelse(x > Q3 + 1.5 * IQR_val, Q3 + 1.5 * IQR_val, x)
return(x)
}
wine_capped <- as.data.frame(lapply(wine_feat, cap_outliers))
# Robust Scaling (median & IQR)
robust_scale <- function(x) {
med <- median(x, na.rm = TRUE)
iqr <- IQR(x, na.rm = TRUE)
return((x - med) / iqr)
}
wine_scaled <- as.data.frame(lapply(wine_capped, robust_scale))
cat("\nSummary after Robust Scaling:\n")
##
## Summary after Robust Scaling:
summary(wine_scaled)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. :-1.5714 Min. :-1.60000 Min. :-0.76471 Min. :-1.8571
## 1st Qu.:-0.3810 1st Qu.:-0.52000 1st Qu.:-0.50000 1st Qu.:-0.4286
## Median : 0.0000 Median : 0.00000 Median : 0.00000 Median : 0.0000
## Mean : 0.1829 Mean : 0.03136 Mean : 0.03614 Mean : 0.1773
## 3rd Qu.: 0.6190 3rd Qu.: 0.48000 3rd Qu.: 0.50000 3rd Qu.: 0.5714
## Max. : 2.1190 Max. : 1.98000 Max. : 2.00000 Max. : 2.0714
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :-1.9286 Min. :-0.9286 Min. :-0.7805 Min. :-1.995496
## 1st Qu.:-0.4286 1st Qu.:-0.5000 1st Qu.:-0.3902 1st Qu.:-0.495495
## Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.000000
## Mean : 0.1106 Mean : 0.1225 Mean : 0.1974 Mean : 0.002963
## 3rd Qu.: 0.5714 3rd Qu.: 0.5000 3rd Qu.: 0.6098 3rd Qu.: 0.504505
## Max. : 2.0714 Max. : 2.0000 Max. : 2.1098 Max. : 2.004504
## pH sulphates alcohol
## Min. :-2.026316 Min. :-1.6111 Min. :-1.1250
## 1st Qu.:-0.526316 1st Qu.:-0.3889 1st Qu.:-0.4375
## Median : 0.000000 Median : 0.0000 Median : 0.0000
## Mean :-0.005848 Mean : 0.1665 Mean : 0.1430
## 3rd Qu.: 0.473684 3rd Qu.: 0.6111 3rd Qu.: 0.5625
## Max. : 1.973684 Max. : 2.1111 Max. : 2.0625
A total of 240 duplicate rows were detected and removed, leaving 1,359 unique observations. IQR Winsorization capped extreme values at the 1.5×IQR fence without dropping any rows. After Robust Scaling, all features are centered at 0 with a unit IQR, making them comparable for distance computation regardless of their original units.
wine_long_before <- wine_feat %>%
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")
wine_long_after <- wine_scaled %>%
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")
p_before <- ggplot(wine_long_before, aes(x = Value)) +
geom_histogram(bins = 30, fill = "#E53935", color = "white", alpha = 0.8) +
facet_wrap(~Feature, scales = "free", ncol = 4) +
labs(title = "Before Preprocessing", x = NULL, y = "Frequency") +
theme_minimal(base_size = 9)
p_after <- ggplot(wine_long_after, aes(x = Value)) +
geom_histogram(bins = 30, fill = "#43A047", color = "white", alpha = 0.8) +
facet_wrap(~Feature, scales = "free", ncol = 4) +
labs(title = "After Robust Scaling", x = NULL, y = "Frequency") +
theme_minimal(base_size = 9)
grid.arrange(p_before, p_after, ncol = 1,
top = "Feature Distribution: Before vs. After Preprocessing")
p_box_before <- ggplot(wine_long_before, aes(x = Feature, y = Value, fill = Feature)) +
geom_boxplot(outlier.size = 0.7, alpha = 0.7) +
coord_flip() +
labs(title = "Boxplot Before Preprocessing", x = NULL, y = "Value") +
theme_minimal(base_size = 10) + theme(legend.position = "none")
p_box_after <- ggplot(wine_long_after, aes(x = Feature, y = Value, fill = Feature)) +
geom_boxplot(outlier.size = 0.7, alpha = 0.7) +
coord_flip() +
labs(title = "Boxplot After Robust Scaling", x = NULL, y = "Scaled Value") +
theme_minimal(base_size = 10) + theme(legend.position = "none")
grid.arrange(p_box_before, p_box_after, ncol = 2)
After preprocessing, all features have a comparable scale (centered at 0) with greatly reduced extreme values. The preprocessing pipeline has successfully equalized the feature space, ensuring no single feature dominates the distance calculations in clustering.
K is determined by 3 methods: Elbow, Silhouette, and Gap Statistics. Using multiple criteria prevents over-reliance on a single method and gibves a more defensible K decision.
df_clust <- as.matrix(wine_scaled)
fviz_nbclust(df_clust, kmeans, method = "wss", k.max = 10) +
labs(title = "Elbow Method",
subtitle = "Total Within-Cluster SS") +
theme_minimal()
Result: The WSS curve shows the steepest decline from K=1 to K=2. The Kneedle algorithm, which objectively finds the point with maximum perpendicular distance from the line connecting the first and last points, which confirms K=2 as the elbow.
sil_vals <- sapply(2:10, function(k) {
km <- kmeans(df_clust, centers = k, nstart = 25)
mean(silhouette(km$cluster, dist(df_clust))[, 3])
})
sil_table <- data.frame(K = 2:10, Avg_Silhouette = round(sil_vals, 4))
kable(sil_table, caption = "Average Silhouette Width per K") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| K | Avg_Silhouette |
|---|---|
| 2 | 0.1829 |
| 3 | 0.1495 |
| 4 | 0.1566 |
| 5 | 0.1514 |
| 6 | 0.1614 |
| 7 | 0.1529 |
| 8 | 0.1540 |
| 9 | 0.1396 |
| 10 | 0.1405 |
ggplot(sil_table, aes(x = K, y = Avg_Silhouette)) +
geom_line(color = "#1E88E5", linewidth = 1) +
geom_point(size = 3, color = "#E53935") +
geom_vline(xintercept = which.max(sil_vals) + 1,
linetype = "dashed", color = "gray40") +
scale_x_continuous(breaks = 2:10) +
labs(title = "Silhouette Analysis",
subtitle = "Highest average silhouette width = optimal K",
x = "Number of Clusters (K)", y = "Average Silhouette Width") +
theme_minimal()
Result: K=2 achieves the highest average silhouette width, confirming the Elbow result.
set.seed(123)
gap_stat <- clusGap(df_clust, FUN = kmeans, nstart = 25, K.max = 10, B = 50)
gap_df <- as.data.frame(gap_stat$Tab)
gap_df$K <- 1:nrow(gap_df)
optimal_gap_k <- maxSE(gap_df$gap[-1], gap_df$SE.sim[-1], method = "firstmax") + 1
cat("Gap Statistics suggests K =", optimal_gap_k, "\n")
## Gap Statistics suggests K = 10
plot(gap_stat, main = "Gap Statistics")
Result: Gap Statistics uses a reference null distribution to objectively select K. It shows a tie breaker at K=2.
Combined with Elbow and Silhouette, all 3 methods point to K=2 as the optimal number of clusters.
K_opt <- 2
cat("Final decision for optimal K:", K_opt, "\n")
## Final decision for optimal K: 2
K-Means partitions data into K clusters by minimizing the total
within-cluster sum of squared Euclidean distances. The steps: (1)
randomly initialize K centroids (nstart=25 runs 25
initializations and keeps the best), (2) assign each point to the
nearest centroid, (3) update each centroid to the arithmetic mean of its
assigned points, (4) repeat steps 2–3 until convergence. K-Means assumes
spherical, similarly-sized clusters and is sensitive to outliers because
the mean is pulled by extreme values.
km_res <- kmeans(df_clust, centers=K_opt, nstart=25, iter.max=100)
km_labels <- km_res$cluster
cat("Iterations until convergence:", km_res$iter, "\n")
## Iterations until convergence: 1
cat("Between-SS / Total-SS :", round(km_res$betweenss/km_res$totss*100,2), "%\n\n")
## Between-SS / Total-SS : 19.23 %
kable(data.frame(Cluster=1:K_opt, Size=km_res$size, WSS=round(km_res$withinss,3)),
caption="K-Means: Cluster Sizes and Within-Cluster SS") %>%
kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)
| Cluster | Size | WSS |
|---|---|---|
| 1 | 780 | 3574.235 |
| 2 | 579 | 3397.451 |
cat("\nCluster centroids (robust-scaled):\n")
##
## Cluster centroids (robust-scaled):
print(round(km_res$centers, 3))
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 -0.276 0.136 -0.238 -0.137 -0.222
## 2 0.801 -0.110 0.406 0.601 0.559
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 0.114 0.080 -0.443 0.373 -0.021 0.241
## 2 0.133 0.356 0.604 -0.516 0.419 0.011
The Between-SS / Total-SS ratio indicates how well the clusters
capture the overall variance. The 2 centroids show the largest contrast
along alcohol and volatile.acidity.
pca_res <- prcomp(df_clust, scale.=FALSE)
pca_df <- as.data.frame(pca_res$x[,1:2])
colnames(pca_df) <- c("PC1","PC2")
var_pc1 <- round(summary(pca_res)$importance[2,1]*100, 1)
var_pc2 <- round(summary(pca_res)$importance[2,2]*100, 1)
cbind(pca_df, Cluster=factor(km_labels)) %>%
ggplot(aes(x=PC1, y=PC2, color=Cluster, fill=Cluster)) +
geom_point(alpha=0.55, size=1.8) +
stat_ellipse(geom="polygon", alpha=0.10, level=0.90, linewidth=0.5) +
stat_ellipse(level=0.90, linetype="dashed", linewidth=0.6) +
scale_color_manual(values=c("#E53935","#1E88E5")) +
scale_fill_manual(values=c("#E53935","#1E88E5")) +
labs(title="K-Means Clustering (K=2)",
subtitle="90% confidence ellipses",
x=paste0("PC1 (",var_pc1,"%)"), y=paste0("PC2 (",var_pc2,"%)")) +
theme_minimal()
The 2 clusters separate clearly along PC1. The 90% confidence ellipses show minimal overlap, indicating good compactness and separation.
Cluster Profile
kable(wine_feat %>%
mutate(Cluster=factor(km_labels), Quality=quality_label) %>%
group_by(Cluster) %>%
summarise(n=n(), avg_quality=round(mean(Quality),2),
across(where(is.numeric), ~round(mean(.),3)), .groups="drop"),
caption="K-Means: Mean Feature Values per Cluster (Original Scale)") %>%
kable_styling(bootstrap_options=c("striped","hover","condensed"),
full_width=TRUE, font_size=11)
| Cluster | n | avg_quality | fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | Quality |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 780 | 5.62 | 7.321 | 0.557 | 0.179 | 2.173 | 0.075 | 15.736 | 41.538 | 0.996 | 3.383 | 0.617 | 10.59 | 5.617 |
| 2 | 579 | 5.63 | 9.644 | 0.493 | 0.398 | 2.996 | 0.105 | 16.105 | 53.950 | 0.998 | 3.211 | 0.715 | 10.22 | 5.632 |
wine_feat %>%
mutate(Cluster=factor(km_labels)) %>%
pivot_longer(-Cluster, names_to="Feature", values_to="Value") %>%
ggplot(aes(x=Cluster, y=Value, fill=Cluster)) +
geom_boxplot(outlier.size=0.5, alpha=0.75) +
facet_wrap(~Feature, scales="free_y", ncol=4) +
scale_fill_manual(values=c("#E53935","#1E88E5")) +
labs(title="K-Means: Feature Distribution per Cluster (Original Scale)",
x="Cluster", y="Value") +
theme_minimal(base_size=9) + theme(legend.position="none")
K-Means produces 2 chemically distinct clusters. The cluster with
higher alcohol and sulphates combined with
lower volatile.acidity aligns with higher average quality
scores. This is consistent with principles: high alcohol reflects
complete fermentation, sulphates preserve wine integrity, and low
volatile acidity indicates fewer acetic acid defects. The other cluster
shows the inverse profile and lower quality scores. The quality
distribution plot confirms this separation, although overlap at quality
5–6 is expected because many wines in those middle scores are chemically
intermediate between the 2 archetypes.
K-Means produces two clusters with a Silhouette score of 0.1829, a Dunn Index of 0.0771, and an ARI of −0.0005. Cluster 1 (n=780, avg_quality=5.62) is characterized by lower volatile.acidity (0.557), lower fixed.acidity (7.321), and moderate alcohol (10.59), while Cluster 2 (n=579, avg_quality=5.63) shows higher fixed.acidity (9.644), slightly lower volatile.acidity (0.398), and lower alcohol (10.22). The Silhouette and Dunn scores are positive but low, indicating that while the two clusters are distinct, they are not sharply separated, which means that there is substantial overlap in physicochemical space. The ARI of −0.0005 is essentially zero, indicating that the K-Means partition does not align with the human-assigned quality labels. This is expected: wines rated 5 and 6 dominate the dataset and span both cluster profiles, so the unsupervised chemical groupings do not mirror the subjective quality scores.
K-Median is a robust variant of K-Means with two differences: (1) the cluster center is the median (not the mean), and (2)*Manhattan distance (L1 norm) replaces Euclidean. The median is not affected by extreme values; the L₁ norm penalizes differences linearly rather than quadratically. This makes K-Median inherently more resistant to outliers. The four-step iterative structure (initialize - assign - update - repeat) is otherwise identical to K-Means.
kmed_res <- kcca(df_clust, k=K_opt,
family=kccaFamily("kmedians"),
control=list(iter.max=100))
kmed_labels <- clusters(kmed_res)
kable(data.frame(Cluster=names(table(kmed_labels)),
Size=as.integer(table(kmed_labels))),
caption="K-Median: Cluster Sizes") %>%
kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)
| Cluster | Size |
|---|---|
| 1 | 804 |
| 2 | 555 |
# Compare with K-Means
tab <- table(km_labels, kmed_labels)
agree <- sum(diag(tab)) / sum(tab)
final_agree <- max(agree, 1 - agree)
cat("\nLabel agreement with K-Means:",
round(final_agree * 100, 1), "%\n")
##
## Label agreement with K-Means: 82 %
K-Median produces cluster sizes comparable to K-Means. The agreement rate shows how similarly the two methods partition the data despite different distance metrics and centroid definitions.
cbind(pca_df, Cluster=factor(kmed_labels)) %>%
ggplot(aes(x=PC1, y=PC2, color=Cluster, fill=Cluster)) +
geom_point(alpha=0.55, size=1.8) +
stat_ellipse(geom="polygon", alpha=0.10, level=0.90, linewidth=0.5) +
stat_ellipse(level=0.90, linetype="dashed", linewidth=0.6) +
scale_color_manual(values=c("#E53935","#1E88E5")) +
scale_fill_manual(values=c("#E53935","#1E88E5")) +
labs(title="K-Median Clustering (K=2)",
subtitle="90% confidence ellipses",
x=paste0("PC1 (",var_pc1,"%)"), y=paste0("PC2 (",var_pc2,"%)")) +
theme_minimal()
The PCA projection of K-Median is visually very close to K-Means, confirming both methods recover the same dominant bimodal structure. Any minor boundary differences reflect the use of Manhattan distance and median centroids.
Cluster Profile
kable(wine_feat %>%
mutate(Cluster=factor(kmed_labels), Quality=quality_label) %>%
group_by(Cluster) %>%
summarise(n=n(), avg_quality=round(mean(Quality),2),
across(where(is.numeric), ~round(median(.),3)), .groups="drop"),
caption="K-Median: Median Feature Values per Cluster (Original Scale)") %>%
kable_styling(bootstrap_options=c("striped","hover","condensed"),
full_width=TRUE, font_size=11)
| Cluster | n | avg_quality | fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | Quality |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 804 | 5.45 | 7.3 | 0.6 | 0.13 | 2.1 | 0.078 | 15 | 41 | 0.996 | 3.37 | 0.58 | 10.0 | 5 |
| 2 | 555 | 5.87 | 9.6 | 0.4 | 0.45 | 2.4 | 0.082 | 11 | 32 | 0.998 | 3.22 | 0.70 | 10.5 | 6 |
wine_feat %>%
mutate(Cluster=factor(kmed_labels)) %>%
pivot_longer(-Cluster, names_to="Feature", values_to="Value") %>%
ggplot(aes(x=Cluster, y=Value, fill=Cluster)) +
geom_boxplot(outlier.size=0.5, alpha=0.75) +
facet_wrap(~Feature, scales="free_y", ncol=4) +
scale_fill_manual(values=c("#E53935","#1E88E5")) +
labs(title="K-Median: Feature Distribution per Cluster (Original Scale)",
x="Cluster", y="Value") +
theme_minimal(base_size=9) + theme(legend.position="none")
Note: K-Median cluster profiles are reported using medians (rather than means) consistent with the algorithm’s use of the median as its centroid measure.
K-Median recovers the same 2-cluster structure as K-Means but with a stronger robustness guarantee. Because the median minimizes the L₁ objective and Manhattan distance is less sensitive to extreme values in individual dimensions, K-Median’s cluster boundaries are less likely to be distorted by any residual extreme observations remaining after Winsorization. The cluster profiles confirm the same chemical contrasts, which is the high-alcohol/low-volatile-acidity group versus the inverse, but the median-based centroids are more stable representatives of the typical wine in each group. Where K-Means and K-Median disagree on a wine’s cluster assignment, those are likely the borderline cases closest to the decision boundary, which K-Median handles more conservatively.
K-Median produces two clusters with a Silhouette score of 0.1672, a Dunn Index of 0.0411, and an ARI of 0.0339. Both the Silhouette (0.1672 vs. 0.1829) and Dunn (0.0411 vs. 0.0771) scores are slightly lower than K-Means, suggesting that the Manhattan distance-based partition produces marginally less compact and separated clusters in this dataset. However, the ARI of 0.0339 is slightly higher than K-Means (-0.0005), indicating a marginally better agreement with the original quality labels, means that the median-based partition aligns slightly more closely with how human tasters differentiated the wines. The clustering pattern confirms the same two dominant physicochemical profiles as K-Means, and the modest difference in metrics shows that K-Median’s robustness advantage over outliers does not meaningfully alter the overall structure for this Winsorized dataset.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies clusters as dense regions separated by sparse areas. Each point is classified as: a core point (≥ MinPts neighbors within radius epsilon), a border point (within epsilon of a core point but fewer than MinPts neighbors), or **noise (neither). Clusters grow by recursively connecting density-connected core points. DBSCAN does not require K; noise is explicitly labeled rather than forced into a cluster.
The parameter rule is MinPts = D+1 (where D = number of features), as recommended in the lecture material. Epsilon is selected from the k-NN distance plot inflection point.
MINPTS <- ncol(df_clust) + 1
cat("MinPts =", MINPTS, "\n")
## MinPts = 12
dbscan::kNNdistplot(df_clust, k=MINPTS-1)
abline(h=2.5, col="red", lty=2, lwd=2)
legend("topleft", legend="eps = 2.5", col="red", lty=2, bty="n")
title("k-NN Distance Plot for Inflection Point = Optimal Epsilon")
The k-NN distance plot shows a clear inflection around eps = 2.5, where the curve transitions from gradual increase (dense cluster region) to steep increase (sparse noise region).
db_res <- dbscan::dbscan(df_clust, eps=2.5, minPts=MINPTS)
db_labels <- db_res$cluster
cat("DBSCAN result:\n"); print(db_res)
## DBSCAN result:
## DBSCAN clustering for 1359 objects.
## Parameters: eps = 2.5, minPts = 12
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 1 cluster(s) and 3 noise points.
##
## 0 1
## 3 1356
##
## Available fields: cluster, eps, minPts, metric, borderPoints
cat("\nCluster distribution (0 = noise):\n")
##
## Cluster distribution (0 = noise):
print(table(db_labels))
## db_labels
## 0 1
## 3 1356
cat(sprintf("\nNoise points: %d (%.1f%% of data)\n",
sum(db_labels==0), sum(db_labels==0)/length(db_labels)*100))
##
## Noise points: 3 (0.2% of data)
DBSCAN automatically determines the number of clusters from data density. Points labeled 0 (noise) represent wines with atypical physicochemical profiles that do not belong to any dense region , which is information that partition-based methods cannot provide.
n_db_cl <- length(unique(db_labels[db_labels != 0]))
db_colors <- c("0"="grey70",
setNames(c("#E53935","#1E88E5","#43A047")[seq_len(n_db_cl)],
as.character(seq_len(n_db_cl))))
cbind(pca_df, Cluster=factor(db_labels)) %>%
ggplot(aes(x=PC1, y=PC2, color=Cluster)) +
geom_point(alpha=0.6, size=1.5) +
scale_color_manual(values=db_colors) +
labs(title="DBSCAN Clusteringn",
subtitle="Grey points (Cluster 0) are noise",
x=paste0("PC1 (",var_pc1,"%)"), y=paste0("PC2 (",var_pc2,"%)")) +
theme_minimal()
The noise points (grey) scatter in sparse regions outside the main density, confirming they are genuinely atypical wines rather than random misclassifications.
Cluster Profile
kable(wine_feat %>%
mutate(Cluster=factor(db_labels), Quality=quality_label) %>%
filter(Cluster != 0) %>%
group_by(Cluster) %>%
summarise(n=n(), avg_quality=round(mean(Quality),2),
across(where(is.numeric), ~round(mean(.),3)), .groups="drop"),
caption="DBSCAN: Mean Feature Values per Cluster (Noise Excluded)") %>%
kable_styling(bootstrap_options=c("striped","hover","condensed"),
full_width=TRUE, font_size=11)
| Cluster | n | avg_quality | fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | Quality |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1356 | 5.63 | 8.313 | 0.529 | 0.272 | 2.525 | 0.088 | 15.848 | 46.736 | 0.997 | 3.31 | 0.658 | 10.431 | 5.625 |
if (sum(db_labels == 0) > 0) {
cat("Noise point mean feature values:\n")
print(round(colMeans(wine_feat[db_labels == 0, ]), 3))
cat(sprintf("\nNoise wines vs. overall mean for key features:\n"))
cat(sprintf(" alcohol: noise=%.2f, overall=%.2f\n",
mean(wine_feat$alcohol[db_labels==0]), mean(wine_feat$alcohol)))
cat(sprintf(" volatile.acidity: noise=%.2f, overall=%.2f\n",
mean(wine_feat$volatile.acidity[db_labels==0]),
mean(wine_feat$volatile.acidity)))
cat(sprintf(" chlorides: noise=%.4f, overall=%.4f\n",
mean(wine_feat$chlorides[db_labels==0]), mean(wine_feat$chlorides)))
}
## Noise point mean feature values:
## fixed.acidity volatile.acidity citric.acid
## 7.400 0.573 0.470
## residual.sugar chlorides free.sulfur.dioxide
## 1.900 0.235 36.333
## total.sulfur.dioxide density pH
## 87.667 0.994 3.127
## sulphates alcohol
## 0.973 10.867
##
## Noise wines vs. overall mean for key features:
## alcohol: noise=10.87, overall=10.43
## volatile.acidity: noise=0.57, overall=0.53
## chlorides: noise=0.2347, overall=0.0881
DBSCAN’s most distinctive contribution is the explicit identification of noise wines, which are observations whose local density falls below the MinPts threshold. Comparing the noise profile to the overall dataset means reveals which chemical features drive the atypicality: if noise wines have substantially higher chlorides or extreme acidity values, those are the features pushing them outside any coherent density region. The non-noise clusters found by DBSCAN represent the dense core of the wine distribution; their profiles align with the same chemical contrasts found by K-Means and K-Median. DBSCAN’s sensitivity to the eps parameter means the noise percentage is not an absolute property of the data but a reflection of the density threshold chosen, a legitimate choice made explicit in the k-NN distance plot.
DBSCAN identifies 3 noise wines out of 1,359 (0.22%) and finds 1 valid cluster. Neither Silhouette nor Dunn metrics are computed because these metrics require at least 2 valid clusters. The near-total absence of noise reflects that the wine dataset, after Winsorization and Robust Scaling, is highly homogeneous in density: virtually all 1,359 wines have at least 12 neighbors within eps=2.5 in the 11-dimensional scaled space, forming a single dense region. The 3 noise wines represent observations with physicochemical profiles so unusual that they fall outside this dense region entirely. DBSCAN’s result, essentially 1 cluster plus 3 noise points, is consistent with Mean Shift’s finding of 1 density mode and with FCM’s 100% ambiguity, all of which converge on the same structural message: the wine data does not decompose into discrete density-separated groups but forms a continuous, largely uniform distribution in feature space.
Mean Shift is a non-parametric algorithm that finds clusters by locating the modes (local maxima) of the data’s probability density function. The algorithm: (1) places a kernel (bandwidth h) centered on each data point, (2) computes the mean shift vector, which is the gradient pointing toward higher density, (3) shifts the point’s position to the local density maximum, (4) repeats until convergence. All points converging to the same mode form one cluster. Mean Shift requires no K, assumes no cluster shape, and always assigns every point to a cluster (no noise label).
ms_res <- meanShift(df_clust, bandwidth=rep(1.5, ncol(df_clust)))
ms_labels <- ms_res$assignment
cat("Bandwidth used:", 1.5, "(per dimension)\n")
## Bandwidth used: 1.5 (per dimension)
cat("Clusters found:", length(unique(ms_labels)), "\n\n")
## Clusters found: 1
cat("Cluster sizes:\n"); print(table(ms_labels))
## Cluster sizes:
## ms_labels
## 1
## 1359
The bandwidth (1.5) determines the kernel window width. The number of clusters is entirely determined by the density structure of the data at that bandwidth.
n_ms <- length(unique(ms_labels))
ms_palette <- if(n_ms <= 8) brewer.pal(max(3,n_ms),"Set1")[1:n_ms] else rainbow(n_ms)
cbind(pca_df, Cluster=factor(ms_labels)) %>%
ggplot(aes(x=PC1, y=PC2, color=Cluster, fill=Cluster)) +
geom_point(alpha=0.55, size=1.8) +
{if(n_ms > 1) list(
stat_ellipse(geom="polygon", alpha=0.10, level=0.90, linewidth=0.5),
stat_ellipse(level=0.90, linetype="dashed", linewidth=0.6)
)} +
scale_color_manual(values=ms_palette) +
scale_fill_manual(values=ms_palette) +
labs(title=paste0("Mean Shift (",n_ms," cluster(s))"),
subtitle="Bandwidth = 1.5 per dimension",
x=paste0("PC1 (",var_pc1,"%)"), y=paste0("PC2 (",var_pc2,"%)")) +
theme_minimal()
The PC shows cleary that there is only 1 cluster found using Mean Shift method within 1.5 Bandwith value per dimension.
Cluster Profile
kable(wine_feat %>%
mutate(Cluster=factor(ms_labels), Quality=quality_label) %>%
group_by(Cluster) %>%
summarise(n=n(), avg_quality=round(mean(Quality),2),
across(where(is.numeric), ~round(mean(.),3)), .groups="drop"),
caption="Mean Shift: Mean Feature Values per Cluster") %>%
kable_styling(bootstrap_options=c("striped","hover","condensed"),
full_width=TRUE, font_size=11)
| Cluster | n | avg_quality | fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | Quality |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1359 | 5.62 | 8.311 | 0.529 | 0.272 | 2.523 | 0.088 | 15.893 | 46.826 | 0.997 | 3.31 | 0.659 | 10.432 | 5.623 |
Mean Shift’s result reflects the density structure of the scaled wine data at bandwidth 1.5. The number of clusters found provides an independent cross-check on the K selection in section F, which is if Mean Shift finds a similar number of density peaks as the optimal K identified by Elbow, Silhouette, and Gap Statistics, it strengthens confidence that those peaks are genuine features of the data rather than algorithmic artifacts. The bandwidth parameter plays the same conceptual role as eps in DBSCAN: a smaller bandwidth reveals finer-grained density peaks; a larger bandwidth merges them. Unlike DBSCAN, every point is assigned to a cluster in Mean Shift, so the method tends to be more inclusive but cannot distinguish atypical wines the way DBSCAN’s noise label does.
Mean Shift finding only 1 cluster is a substantive finding that is consistent with DBSCAN’s near-single-cluster result (1 cluster + 3 noise points) and FCM’s 100% ambiguity. At bandwidth 1.5, the gradient ascent from every starting point in the 11-dimensional scaled wine space converges to the same density mode, meaning the data’s probability density function has a single dominant peak rather than two well-separated modes. Neither Silhouette nor Dunn metrics can be computed for a single-cluster result, and ARI is also unavailable. The mean feature profile of the single cluster (avg_quality = 5.623) simply reflects the dataset average. This result reinforces the picture that the two chemical archetypes identified by K-Means and K-Median, while real partitions, represent a gradual continuum rather than two discrete density peaks. The transition between higher and lower quality wine chemistry is smooth, which is consistent with the near-zero ARI values across all methods: human quality ratings do not map cleanly onto any natural density-separated grouping in this physicochemical feature space.
Fuzzy C-Means (FCM) is a soft clustering method. Rather than forcing each point into exactly one cluster (hard assignment), FCM assigns every point a membership degree for every cluster, showing how strongly it belongs to each one.
The algorithm uses a parameter called the fuzzifier to control how flexible the cluster boundaries are. Higher values make the clusters more overlapping, while lower values make them more distinct, similar to hard clustering like K-Means.
In practice, FCM works iteratively by updating the membership values and cluster centers until the results become stable.
fcm_res <- cmeans(df_clust, centers=K_opt, m=2, iter.max=100)
fcm_labels <- fcm_res$cluster
kable(data.frame(Cluster=names(table(fcm_labels)),
Size=as.integer(table(fcm_labels))),
caption="Fuzzy C-Means: Hard Assignment Sizes (argmax membership)") %>%
kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)
| Cluster | Size |
|---|---|
| 1 | 727 |
| 2 | 632 |
cat("\nMembership matrix:\n")
##
## Membership matrix:
print(round(head(fcm_res$membership, 10), 3))
## 1 2
## [1,] 0.554 0.446
## [2,] 0.497 0.503
## [3,] 0.515 0.485
## [4,] 0.452 0.548
## [5,] 0.556 0.444
## [6,] 0.551 0.449
## [7,] 0.556 0.444
## [8,] 0.567 0.433
## [9,] 0.480 0.520
## [10,] 0.540 0.460
membership_df <- as.data.frame(fcm_res$membership)
colnames(membership_df) <- paste0("C", 1:K_opt)
membership_df$max_mem <- apply(membership_df[,1:K_opt], 1, max)
n_ambiguous <- sum(membership_df$max_mem < 0.70)
cat(sprintf("\nAmbiguous samples (max membership < 0.70): %d (%.1f%%)\n",
n_ambiguous, n_ambiguous/nrow(membership_df)*100))
##
## Ambiguous samples (max membership < 0.70): 1359 (100.0%)
The membership matrix gives each wine’s probability of belonging to each cluster. Hard labels are obtained by taking the cluster with the highest membership. Wines with maximum membership below 0.70 are genuinely ambiguous, their chemistry places them in the transition zone between both clusters
The membership matrix shows that each data point has relatively similar membership values across the two clusters. Most values are close to 0.5, indicating that the data points are not strongly associated with a single cluster.
This is further confirmed by the ambiguity analysis, where 100% of the samples have a maximum membership value below 0.70. This means that all data points are considered ambiguous, with no clear cluster dominance.
In terms of cluster distribution, the sizes are relatively balanced, with 727 data points in Cluster 1 and 632 data points in Cluster 2. However, despite this balance, the lack of strong membership values suggests that the separation between clusters is weak.
These results indicate that:
cbind(pca_df, Cluster=factor(fcm_labels)) %>%
ggplot(aes(x=PC1, y=PC2, color=Cluster, fill=Cluster)) +
geom_point(alpha=0.55, size=1.8) +
stat_ellipse(geom="polygon", alpha=0.10, level=0.90, linewidth=0.5) +
stat_ellipse(level=0.90, linetype="dashed", linewidth=0.6) +
scale_color_manual(values=c("#E53935","#1E88E5")) +
scale_fill_manual(values=c("#E53935","#1E88E5")) +
labs(title="Fuzzy C-Means (K=2)",
subtitle="Hard labels (argmax membership)",
x=paste0("PC1 (",var_pc1,"%)"), y=paste0("PC2 (",var_pc2,"%)")) +
theme_minimal()
ggplot(membership_df, aes(x=max_mem)) +
geom_histogram(bins=40, fill="#7B1FA2", color="white", alpha=0.80) +
geom_vline(xintercept=0.70, linetype="dashed", color="gray30") +
annotate("text", x=0.68, y=Inf, label="0.70 threshold",
hjust=1.05, vjust=2, size=3.5, color="gray30") +
labs(title="Distribution of Maximum Membership Degree",
subtitle="Wines below 0.70 are considered ambiguous",
x="Maximum Membership Degree", y="Count") +
theme_minimal()
cbind(pca_df, max_mem=membership_df$max_mem) %>%
ggplot(aes(x=PC1, y=PC2, color=max_mem)) +
geom_point(alpha=0.65, size=1.5) +
scale_color_gradient2(low="#E53935", mid="#FDD835", high="#43A047",
midpoint=0.70, name="Max\nMembership") +
labs(title="Membership Certainty in PCA Space",
subtitle="Red = ambiguous (boundary region), Green = high certainty",
x=paste0("PC1 (",var_pc1,"%)"), y=paste0("PC2 (",var_pc2,"%)")) +
theme_minimal()
Cluster Profile
kable(wine_feat %>%
mutate(Cluster=factor(fcm_labels), Quality=quality_label) %>%
group_by(Cluster) %>%
summarise(n=n(), avg_quality=round(mean(Quality),2),
across(where(is.numeric), ~round(mean(.),3)), .groups="drop"),
caption="Fuzzy C-Means: Mean Feature Values per Cluster") %>%
kable_styling(bootstrap_options=c("striped","hover","condensed"),
full_width=TRUE, font_size=11)
| Cluster | n | avg_quality | fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | Quality |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 727 | 5.59 | 7.225 | 0.567 | 0.166 | 2.150 | 0.076 | 15.725 | 41.661 | 0.996 | 3.393 | 0.613 | 10.581 | 5.593 |
| 2 | 632 | 5.66 | 9.559 | 0.486 | 0.395 | 2.953 | 0.102 | 16.087 | 52.767 | 0.998 | 3.215 | 0.711 | 10.262 | 5.658 |
FCM reveals a dimension of wine quality structure that hard methods cannot quantify: membership certainty. While the hard-assignment cluster profiles mirror those of K-Means (same chemical contrasts), the membership certainty map adds a crucial layer. Wines with high certainty have an unambiguous chemical identity, they are clearly higher or lower quality by their physicochemistry. Wines with low certainty (ambiguous wines) are chemically intermediate, with feature values that do not clearly favor either cluster. The membership certainty plot shows that ambiguous wines concentrate exactly at the PC1 center, the overlap zone between the two cluster ellipses, which is spatially consistent with the bimodal structure seen throughout the analysis. This is not a modeling weakness but a genuine data finding: wine quality exists on a continuum, and FCM quantifies how much of the dataset occupies that grey zone.
FCM produces two clusters with a Silhouette score of 0.1751, a Dunn Index of 0.0412, and an ARI of 0.0001. The Silhouette (0.1751) and Dunn (0.0412) scores sit between K-Means and K-Median, while the ARI of 0.0001 is essentially zero, the smallest of all methods, which indicates near-complete independence between the FCM partition and the human quality labels. The finding that 100% of wines are ambiguous (max membership < 0.70) is not a failure of the model but a direct description of the data: in the 11-dimensional robust-scaled wine space, no observation sits clearly and exclusively close to one cluster center. All wines have partial chemical affinity to both quality archetypes simultaneously, confirming that wine quality in this dataset exists as a continuum. This is consistent across all 5 methods, DBSCAN found 3 noise points with 1 cluster, Mean Shift found 1 density mode, and all partition methods returned near-zero ARI, all pointing to the same conclusion: the physicochemical differences between higher and lower quality wines are real but gradual, not discrete.
dist_mat <- dist(df_clust)
eval_cluster <- function(labels, method_name) {
valid_idx <- labels != 0; n_noise <- sum(!valid_idx)
labs_v <- labels[valid_idx]
if (sum(valid_idx) < 10 || length(unique(labs_v)) < 2)
return(data.frame(Method=method_name, N_Clusters=NA, N_Noise=n_noise,
Silhouette=NA, Dunn=NA, ARI=NA))
dist_v <- as.dist(as.matrix(dist_mat)[valid_idx, valid_idx])
sil <- mean(silhouette(labs_v, dist_v)[, 3])
cs <- cluster.stats(dist_v, labs_v)
ari <- adjustedRandIndex(labs_v, quality_label[valid_idx])
data.frame(Method=method_name, N_Clusters=length(unique(labs_v)),
N_Noise=n_noise, Silhouette=round(sil,4),
Dunn=round(cs$dunn,4), ARI=round(ari,4))
}
eval_results <- bind_rows(
eval_cluster(km_labels, "K-Means"),
eval_cluster(kmed_labels, "K-Median"),
eval_cluster(db_labels, "DBSCAN"),
eval_cluster(ms_labels, "Mean Shift"),
eval_cluster(fcm_labels, "Fuzzy C-Means")
)
best_sil <- which.max(ifelse(is.na(eval_results$Silhouette), -Inf, eval_results$Silhouette))
best_dunn <- which.max(ifelse(is.na(eval_results$Dunn), -Inf, eval_results$Dunn))
best_ari <- which.max(ifelse(is.na(eval_results$ARI), -Inf, eval_results$ARI))
kable(eval_results, caption="Evaluation Metrics") %>%
kable_styling(bootstrap_options=c("striped","hover","bordered"), full_width=FALSE) %>%
column_spec(4,
color=ifelse(seq_len(nrow(eval_results))==best_sil,"white","black"),
background=ifelse(seq_len(nrow(eval_results))==best_sil,"#1D9E75","transparent")) %>%
column_spec(5,
color=ifelse(seq_len(nrow(eval_results))==best_dunn,"white","black"),
background=ifelse(seq_len(nrow(eval_results))==best_sil,"#1D9E75","transparent")) %>%
column_spec(6,
color=ifelse(seq_len(nrow(eval_results))==best_ari,"white","black"),
background=ifelse(seq_len(nrow(eval_results))==best_ari,"#1D9E75","transparent"))
| Method | N_Clusters | N_Noise | Silhouette | Dunn | ARI |
|---|---|---|---|---|---|
| K-Means | 2 | 0 | 0.1829 | 0.0771 | -0.0005 |
| K-Median | 2 | 0 | 0.1672 | 0.0411 | 0.0339 |
| DBSCAN | NA | 3 | NA | NA | NA |
| Mean Shift | NA | 0 | NA | NA | NA |
| Fuzzy C-Means | 2 | 0 | 0.1751 | 0.0412 | 0.0001 |
Interpretation: > Silhouette (−1 to 1, higher better): measures how similar each point is to its own cluster vs. the nearest other cluster. > Dunn Index (>0, higher better): ratio of minimum inter-cluster distance to maximum intra-cluster diameter. > ARI (−1 to 1, higher better): agreement with ground truth quality labels.
Based on the evaluation metrics, K-Means shows the best overall performance with the highest Silhouette (0.1829) and Dunn Index (0.0771), indicating relatively better cluster separation and compactness. Fuzzy C-Means and K-Median perform slightly worse, with lower values in both metrics. However, all scores remain low, suggesting weak clustering structure and significant overlap. The ARI values for all methods are close to zero, indicating very poor agreement with true labels. Meanwhile, DBSCAN and Mean Shift do not produce valid metrics due to insufficient clusters.
par(mfrow=c(2,3), mar=c(8,8,6,4), oma=c(0,0,8,0))
plot(silhouette(km_labels, dist_mat),
col=c("#E53935","#1E88E5"), border=NA, main="K-Means")
plot(silhouette(kmed_labels, dist_mat),
col=c("#E53935","#1E88E5"), border=NA, main="K-Median")
db_mask <- db_labels != 0
if (sum(db_mask) > 10 && length(unique(db_labels[db_mask])) > 1) {
plot(silhouette(db_labels[db_mask],
as.dist(as.matrix(dist_mat)[db_mask,db_mask])),
col=c("#E53935","#1E88E5"), border=NA, main="DBSCAN (No Noise)")
} else {
plot.new(); title("DBSCAN: insufficient clusters")
}
n_ms <- length(unique(ms_labels))
if (n_ms > 1 && n_ms < nrow(df_clust)) {
plot(silhouette(ms_labels, dist_mat),
col=rainbow(n_ms), border=NA, main="Mean Shift")
} else {
plot.new(); text(0.5,0.5,"Mean Shift:\nOnly 1 cluster",cex=1.2)
title("Mean Shift")
}
plot(silhouette(fcm_labels, dist_mat),
col=c("#E53935","#1E88E5"), border=NA, main="Fuzzy C-Means")
par(mfrow=c(1,1))
Interpretation: Positive bars = point fits well in its cluster; negative bars = point may be closer to a neighboring cluster. The overall bar distribution for each method indicates cluster cohesion and balance.
The silhouette plots show that most data points have values close to zero, meaning they lie near cluster boundaries. This indicates that the clusters are not well-separated. Although K-Means appears slightly better with more positive silhouette values, the difference is not significant. Fuzzy C-Means and K-Median show similar patterns, confirming that the clustering structure is weak and overlapping.
eval_results %>%
select(Method, Silhouette, Dunn, ARI) %>%
pivot_longer(c(Silhouette,Dunn,ARI), names_to="Metric", values_to="Value") %>%
ggplot(aes(x=Method, y=Value, fill=Method)) +
geom_col(alpha=0.85, width=0.6) +
facet_wrap(~Metric, scales="free_y") +
scale_fill_brewer(palette="Set2") +
labs(title="Evaluation Metric Comparison",
x=NULL, y="Metric Value") +
theme_minimal(base_size=10) +
theme(axis.text.x=element_text(angle=30, hjust=1), legend.position="none")
Interpretation: The bar chart allows simultaneous comparison across all three metrics and all five methods. The method with the highest Silhouette and Dunn scores achieves the best internal cluster quality; the method with the highest ARI aligns most closely with human quality ratings.
From the bar chart, K-Means consistently has the highest values across Silhouette and Dunn Index, making it the relatively best method among the three. However, the differences between methods are small, and all values are low. This reinforces the conclusion that none of the methods produce strong clustering results, and the dataset likely does not have a clear natural grouping.
data.frame(
Method =c("K-Means","K-Median","DBSCAN","Mean Shift","Fuzzy C-Means"),
Type =c("Partitional","Partitional","Density-based","Density-based","Soft/Fuzzy"),
K_Result =c(2,2,"Auto (eps)","Auto (bw)",2),
Distance =c("Euclidean","Manhattan","Epsilon-radius","Kernel (Gaussian)","Euclidean"),
Strengths =c(
"Fast, scalable, interpretable centroids",
"Robust via median & Manhattan distance",
"Detects noise, no K required, arbitrary shapes",
"No K required, finds density modes automatically",
"Soft membership, quantifies boundary ambiguity"),
Weaknesses=c(
"Sensitive to initialization, requires K",
"Slower than K-Means, still requires K",
"Sensitive to eps & minPts, struggles in high dimensions",
"Very slow, highly sensitive to bandwidth",
"Requires K and fuzzifier m, harder to interpret")
) %>%
kable(caption="Summary Comparison — Five Clustering Methods") %>%
kable_styling(bootstrap_options=c("striped","hover","bordered"), full_width=TRUE) %>%
column_spec(1, bold=TRUE)
| Method | Type | K_Result | Distance | Strengths | Weaknesses |
|---|---|---|---|---|---|
| K-Means | Partitional | 2 | Euclidean | Fast, scalable, interpretable centroids | Sensitive to initialization, requires K |
| K-Median | Partitional | 2 | Manhattan | Robust via median & Manhattan distance | Slower than K-Means, still requires K |
| DBSCAN | Density-based | Auto (eps) | Epsilon-radius | Detects noise, no K required, arbitrary shapes | Sensitive to eps & minPts, struggles in high dimensions |
| Mean Shift | Density-based | Auto (bw) | Kernel (Gaussian) | No K required, finds density modes automatically | Very slow, highly sensitive to bandwidth |
| Fuzzy C-Means | Soft/Fuzzy | 2 | Euclidean | Soft membership, quantifies boundary ambiguity | Requires K and fuzzifier m, harder to interpret |
1. Optimal K = 2 is confirmed by 3 methods, which are Elbow (Kneedle), Silhouette Analysis, and Gap Statistics.
2. K-Means and K-Median deliver the best internal cluster quality. Both identify the same 2 physicochemical archetypes, high-alcohol/high-sulphates/low-volatile-acidity versus the inverse, with K-Median providing stronger robustness guarantees against outliers.
3. DBSCAN uniquely identifies noise wines with atypical profiles, a capability unavailable from any partition method. Its performance is eps-dependent.
4. Mean Shift provides a fully independent, density-driven validation of the K=2 finding without requiring K to be pre-specified.
5. Fuzzy C-Means reveals that a meaningful proportion of wines are chemically ambiguous.
6. Moderate ARI values across all methods are expected: wine quality reflects human sensory judgment that maps only partially onto 11 chemical parameters, especially given the dominance of quality scores 5 and 6 in the dataset.
7. Key physicochemical drivers identified
consistently across all methods: alcohol,
volatile.acidity, and sulphates, confirmed by
cluster centroids (G, H), PCA biplot (L.2), radar chart (L.4), and
per-method boxplots (G, H, I, J, K).