Dataset yang digunakan adalah Wine Quality (Red Wine) dari UCI Machine Learning Repository.
| Informasi | Detail |
|---|---|
| Nama | Wine Quality (Red Wine) |
| Sumber | UCI Machine Learning Repository |
| URL | https://archive.ics.uci.edu/dataset/186/wine+quality |
| Jumlah Data | 1.599 baris |
| Variabel | 12 (11 numerik + 1 target) |
| Tugas ML | Clustering (Unsupervised) |
Dataset ini berisi hasil uji fisikokimia wine merah Vinho Verde dari Portugal. Dataset sangat relevan untuk clustering karena memiliki 11 variabel numerik kontinu yang merepresentasikan karakteristik kimia wine secara lengkap.
Tujuan analisis: Mengelompokkan wine berdasarkan karakteristik kimia untuk segmentasi kualitas secara otomatis — berguna bagi produsen wine, retailer, maupun konsumen.
| No | Variabel | Deskripsi | Satuan |
|---|---|---|---|
| 1 | fixed.acidity |
Keasaman tetap | g/L |
| 2 | volatile.acidity |
Keasaman volatil (asam asetat) | g/L |
| 3 | citric.acid |
Asam sitrat, menambah kesegaran | g/L |
| 4 | residual.sugar |
Gula sisa setelah fermentasi | g/L |
| 5 | chlorides |
Kandungan garam | g/L |
| 6 | free.sulfur.dioxide |
SO2 bebas, mencegah oksidasi | mg/L |
| 7 | total.sulfur.dioxide |
Total SO2 (bebas + terikat) | mg/L |
| 8 | density |
Kerapatan wine | g/mL |
| 9 | pH |
Tingkat keasaman | 0–14 |
| 10 | sulphates |
Aditif antimikroba | g/L |
| 11 | alcohol |
Kadar alkohol | % volume |
| 12 | quality |
Skor kualitas (label) | 0–10 |
Catatan: Variabel
qualityhanya digunakan untuk validasi, tidak digunakan dalam proses clustering.
# Load semua package yang dibutuhkan
library(ggplot2)
library(dplyr)
library(tidyr)
library(cluster)
library(factoextra)
library(corrplot)
library(dendextend)
library(dbscan)
library(scales)
library(knitr)
library(kableExtra)wine <- read.csv("winequality-red.csv", sep = ";", header = TRUE)
cat("Berhasil memuat dataset Wine Quality (Red Wine)\n")## Berhasil memuat dataset Wine Quality (Red Wine)
## Dimensi: 1599 baris x 12 kolom
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
# Visualisasi distribusi semua variabel
wine_long <- wine %>%
pivot_longer(cols = everything(), names_to = "Variabel", values_to = "Nilai")
ggplot(wine_long, aes(x = Nilai, fill = Variabel)) +
geom_histogram(bins = 30, color = "white", alpha = 0.8) +
facet_wrap(~ Variabel, scales = "free", ncol = 4) +
scale_fill_viridis_d(guide = "none") +
labs(title = "Distribusi Tiap Variabel - Wine Quality Dataset",
x = "Nilai", y = "Frekuensi") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
strip.text = element_text(face = "bold", size = 9))# Cek missing value per kolom
missing_df <- data.frame(
Variabel = names(wine),
Missing = colSums(is.na(wine)),
Persen = round(colMeans(is.na(wine)) * 100, 2)
)
missing_df %>%
kable(caption = "Jumlah Missing Value per Variabel", row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Variabel | Missing | Persen |
|---|---|---|
| fixed.acidity | 0 | 0 |
| volatile.acidity | 0 | 0 |
| citric.acid | 0 | 0 |
| residual.sugar | 0 | 0 |
| chlorides | 0 | 0 |
| free.sulfur.dioxide | 0 | 0 |
| total.sulfur.dioxide | 0 | 0 |
| density | 0 | 0 |
| pH | 0 | 0 |
| sulphates | 0 | 0 |
| alcohol | 0 | 0 |
| quality | 0 | 0 |
##
## Total missing value: 0
✅ Tidak ada missing value. Dataset sudah bersih dan siap diproses lebih lanjut.
# Pilih 11 variabel fisikokimia (tidak termasuk 'quality' yang merupakan label)
wine_features <- wine[, 1:11]
# Normalisasi Z-Score (mean=0, sd=1)
wine_scaled <- as.data.frame(scale(wine_features))
# Verifikasi normalisasi
norm_check <- data.frame(
Variabel = names(wine_scaled),
Mean_Before = round(apply(wine_features, 2, mean), 4),
SD_Before = round(apply(wine_features, 2, sd), 4),
Mean_After = round(apply(wine_scaled, 2, mean), 6),
SD_After = round(apply(wine_scaled, 2, sd), 4)
)
norm_check %>%
kable(caption = "Perbandingan Statistik Sebelum dan Sesudah Normalisasi",
row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Variabel | Mean_Before | SD_Before | Mean_After | SD_After |
|---|---|---|---|---|
| fixed.acidity | 8.3196 | 1.7411 | 0 | 1 |
| volatile.acidity | 0.5278 | 0.1791 | 0 | 1 |
| citric.acid | 0.2710 | 0.1948 | 0 | 1 |
| residual.sugar | 2.5388 | 1.4099 | 0 | 1 |
| chlorides | 0.0875 | 0.0471 | 0 | 1 |
| free.sulfur.dioxide | 15.8749 | 10.4602 | 0 | 1 |
| total.sulfur.dioxide | 46.4678 | 32.8953 | 0 | 1 |
| density | 0.9967 | 0.0019 | 0 | 1 |
| pH | 3.3111 | 0.1544 | 0 | 1 |
| sulphates | 0.6581 | 0.1695 | 0 | 1 |
| alcohol | 10.4230 | 1.0657 | 0 | 1 |
Metode: Z-Score Standardization — setiap variabel dikurangi rata-ratanya lalu dibagi standar deviasinya, sehingga semua variabel memiliki skala yang sama (mean ≈ 0, sd = 1).
# Scatter Plot 1: Alcohol vs Volatile Acidity
ggplot(wine, aes(x = alcohol, y = volatile.acidity, color = factor(quality))) +
geom_point(alpha = 0.55, size = 2) +
scale_color_brewer(palette = "RdYlGn", name = "Quality") +
labs(title = "Scatter Plot: Alcohol vs Volatile Acidity",
subtitle = "Warna berdasarkan skor kualitas wine",
x = "Alcohol (% volume)", y = "Volatile Acidity (g/L)") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))# Scatter Plot 2: Fixed Acidity vs Citric Acid
ggplot(wine, aes(x = fixed.acidity, y = citric.acid, color = factor(quality))) +
geom_point(alpha = 0.55, size = 2) +
scale_color_brewer(palette = "RdYlGn", name = "Quality") +
labs(title = "Scatter Plot: Fixed Acidity vs Citric Acid",
subtitle = "Warna berdasarkan skor kualitas wine",
x = "Fixed Acidity (g/L)", y = "Citric Acid (g/L)") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))# Correlation Matrix
cor_matrix <- cor(wine_features)
corrplot(cor_matrix, method = "color", type = "upper",
tl.cex = 0.85, addCoef.col = "black", number.cex = 0.65,
col = colorRampPalette(c("#D73027", "white", "#1A9641"))(200),
title = "Correlation Matrix - Wine Quality Features",
mar = c(0, 0, 2, 0))set.seed(42)
wss <- sapply(1:10, function(k) {
kmeans(wine_scaled, centers = k, nstart = 25, iter.max = 100)$tot.withinss
})
elbow_df <- data.frame(k = 1:10, wss = wss)
ggplot(elbow_df, aes(x = k, y = wss)) +
geom_line(color = "#E74C3C", linewidth = 1.3) +
geom_point(color = "#C0392B", size = 4) +
geom_vline(xintercept = 3, linetype = "dashed",
color = "steelblue", linewidth = 1.1) +
annotate("text", x = 3.4, y = max(wss) * 0.9,
label = "k optimal = 3", color = "steelblue",
fontface = "bold", size = 4.5) +
scale_x_continuous(breaks = 1:10) +
labs(title = "Elbow Method - Menentukan Jumlah Cluster Optimal",
subtitle = "K-Means | Wine Quality Dataset",
x = "Jumlah Cluster (k)", y = "Total Within-Cluster Sum of Squares (WSS)") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))📌 Berdasarkan Elbow Method, penurunan WSS mulai melambat signifikan pada k = 3, sehingga k = 3 dipilih sebagai jumlah cluster optimal.
set.seed(42)
kmeans_result <- kmeans(wine_scaled, centers = 3, nstart = 25, iter.max = 100)
# Distribusi cluster
dist_kmeans <- as.data.frame(table(Cluster = kmeans_result$cluster))
dist_kmeans$Persen <- round(dist_kmeans$Freq / sum(dist_kmeans$Freq) * 100, 1)
dist_kmeans %>%
kable(caption = "Distribusi Data per Cluster (K-Means)", col.names = c("Cluster", "Jumlah", "Persen (%)")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Cluster | Jumlah | Persen (%) |
|---|---|---|
| 1 | 502 | 31.4 |
| 2 | 724 | 45.3 |
| 3 | 373 | 23.3 |
fviz_cluster(kmeans_result,
data = wine_scaled,
palette = c("#E74C3C", "#2ECC71", "#3498DB"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_minimal()) +
labs(title = "Visualisasi K-Means Clustering (k=3)",
subtitle = "Wine Quality Dataset — Reduksi PCA 2 Dimensi") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))# Simpan hasil ke data asli
wine$cluster_kmeans <- kmeans_result$cluster
# Profil rata-rata per cluster
profile_kmeans <- wine %>%
group_by(`Cluster K-Means` = cluster_kmeans) %>%
summarise(
N = n(),
Alcohol = round(mean(alcohol), 2),
Vol_Acid = round(mean(volatile.acidity), 3),
Fix_Acid = round(mean(fixed.acidity), 2),
Citric = round(mean(citric.acid), 3),
Sulphates = round(mean(sulphates), 3),
pH = round(mean(pH), 3),
Quality = round(mean(quality), 2)
)
profile_kmeans %>%
kable(caption = "Profil Rata-rata per Cluster (K-Means)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Cluster K-Means | N | Alcohol | Vol_Acid | Fix_Acid | Citric | Sulphates | pH | Quality |
|---|---|---|---|---|---|---|---|---|
| 1 | 502 | 10.72 | 0.405 | 10.07 | 0.470 | 0.752 | 3.195 | 5.96 |
| 2 | 724 | 10.50 | 0.609 | 7.19 | 0.123 | 0.609 | 3.406 | 5.55 |
| 3 | 373 | 9.88 | 0.535 | 8.16 | 0.290 | 0.626 | 3.283 | 5.36 |
# Gunakan sample untuk efisiensi dendrogram
set.seed(42)
n_sample <- 150
idx_sample <- sample(1:nrow(wine_scaled), n_sample)
wine_sample <- wine_scaled[idx_sample, ]
# Hitung jarak
dist_matrix <- dist(wine_sample, method = "euclidean")
# Hierarchical Clustering - Ward.D2
hclust_result <- hclust(dist_matrix, method = "ward.D2")
cat("Metode linkage: Ward.D2\n")## Metode linkage: Ward.D2
## Metode jarak : Euclidean
## Jumlah data : 150 sampel
# Warnai dendrogram
dend <- as.dendrogram(hclust_result)
dend_colored <- color_branches(dend, k = 3,
col = c("#E74C3C", "#2ECC71", "#3498DB"))
par(mar = c(5, 4, 4, 2), cex = 0.6)
plot(dend_colored,
main = "Dendrogram - Hierarchical Clustering (Ward.D2)\nWine Quality Dataset",
xlab = paste("n =", n_sample, "sampel"),
ylab = "Height",
leaflab = "none")
abline(h = 8, col = "#8E44AD", lty = 2, lwd = 2)
legend("topright",
legend = c("Cluster 1", "Cluster 2", "Cluster 3", "Cut point"),
col = c("#E74C3C", "#2ECC71", "#3498DB", "#8E44AD"),
lty = c(1,1,1,2), lwd = 2, cex = 0.85, bty = "n")## Distribusi Cluster (Hierarchical):
## hclust_cut
## 1 2 3
## 30 118 2
fviz_cluster(list(data = wine_sample, cluster = hclust_cut),
palette = c("#E74C3C", "#2ECC71", "#3498DB"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_minimal()) +
labs(title = "Hierarchical Clustering (k=3) — Wine Quality",
subtitle = paste("n =", n_sample, "sampel | Ward.D2 linkage")) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))# MinPts = 5 (aturan umum)
minPts_val <- 5
# k-NN Distance Plot
knn_dist <- dbscan::kNNdist(wine_scaled, k = minPts_val)
knn_sorted <- sort(knn_dist)
plot(knn_sorted,
type = "l", col = "#8E44AD", lwd = 2,
main = "k-NN Distance Plot untuk Menentukan Epsilon",
xlab = "Data Points (diurutkan berdasarkan jarak)",
ylab = paste0(minPts_val, "-NN Distance"),
sub = "Titik 'siku' = nilai epsilon optimal")
abline(h = 0.9, col = "#E74C3C", lty = 2, lwd = 2)
legend("topleft",
legend = c(paste0(minPts_val, "-NN Distance"), "eps = 0.9"),
col = c("#8E44AD", "#E74C3C"), lty = c(1,2), lwd = 2, cex = 0.9)📌 Parameter DBSCAN: - eps (epsilon) = 0.9 — ditentukan dari titik “siku” pada k-NN Distance Plot - MinPts = 5 — aturan umum: minimal 5, atau 2 × jumlah dimensi
set.seed(42)
dbscan_result <- dbscan::dbscan(wine_scaled, eps = 0.9, minPts = 5)
# Distribusi cluster
dist_dbscan <- as.data.frame(table(Cluster = dbscan_result$cluster))
dist_dbscan$Keterangan <- ifelse(dist_dbscan$Cluster == 0, "Noise/Outlier", paste("Cluster", dist_dbscan$Cluster))
dist_dbscan$Persen <- round(dist_dbscan$Freq / sum(dist_dbscan$Freq) * 100, 1)
dist_dbscan %>%
kable(caption = "Distribusi Data DBSCAN (0 = Noise)") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Cluster | Freq | Keterangan | Persen |
|---|---|---|---|
| 0 | 1366 | Noise/Outlier | 85.4 |
| 1 | 92 | Cluster 1 | 5.8 |
| 2 | 5 | Cluster 2 | 0.3 |
| 3 | 16 | Cluster 3 | 1.0 |
| 4 | 10 | Cluster 4 | 0.6 |
| 5 | 5 | Cluster 5 | 0.3 |
| 6 | 5 | Cluster 6 | 0.3 |
| 7 | 4 | Cluster 7 | 0.3 |
| 8 | 3 | Cluster 8 | 0.2 |
| 9 | 5 | Cluster 9 | 0.3 |
| 10 | 6 | Cluster 10 | 0.4 |
| 11 | 14 | Cluster 11 | 0.9 |
| 12 | 6 | Cluster 12 | 0.4 |
| 13 | 9 | Cluster 13 | 0.6 |
| 14 | 5 | Cluster 14 | 0.3 |
| 15 | 15 | Cluster 15 | 0.9 |
| 16 | 5 | Cluster 16 | 0.3 |
| 17 | 5 | Cluster 17 | 0.3 |
| 18 | 6 | Cluster 18 | 0.4 |
| 19 | 6 | Cluster 19 | 0.4 |
| 20 | 6 | Cluster 20 | 0.4 |
| 21 | 5 | Cluster 21 | 0.3 |
noise_pct <- round(sum(dbscan_result$cluster == 0) / nrow(wine_scaled) * 100, 1)
fviz_cluster(dbscan_result,
data = wine_scaled,
geom = "point",
palette = "Set2",
ggtheme = theme_minimal()) +
labs(title = "DBSCAN Clustering — Wine Quality",
subtitle = paste0("eps = 0.9 | MinPts = 5 | Noise = ", noise_pct, "%")) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))# --- Silhouette K-Means ---
sil_kmeans <- cluster::silhouette(kmeans_result$cluster, dist(wine_scaled))
avg_sil_km <- round(mean(sil_kmeans[, 3]), 4)
# --- Silhouette Hierarchical (500 sampel) ---
set.seed(42)
wine_sub <- wine_scaled[1:500, ]
hcut_sub <- cutree(hclust(dist(wine_sub, method = "euclidean"),
method = "ward.D2"), k = 3)
sil_hclust <- cluster::silhouette(hcut_sub, dist(wine_sub))
avg_sil_hc <- round(mean(sil_hclust[, 3]), 4)
# --- Silhouette DBSCAN ---
idx_nonoise <- dbscan_result$cluster != 0
dbscan_label <- dbscan_result$cluster[idx_nonoise]
wine_nonoise <- wine_scaled[idx_nonoise, ]
if (length(unique(dbscan_label)) > 1) {
sil_dbscan <- cluster::silhouette(dbscan_label, dist(wine_nonoise))
avg_sil_db <- round(mean(sil_dbscan[, 3]), 4)
} else {
avg_sil_db <- NA
}
# Tabel perbandingan
eval_df <- data.frame(
Metode = c("K-Means", "Hierarchical Clustering", "DBSCAN"),
Jumlah_Cluster = c(length(unique(kmeans_result$cluster)),
3,
length(unique(dbscan_label))),
Silhouette_Score = c(avg_sil_km, avg_sil_hc, avg_sil_db),
Keterangan = c("Cluster sferis dan padat",
"Ward.D2 linkage",
paste0("Noise ", noise_pct, "% diabaikan"))
)
eval_df %>%
kable(caption = "Perbandingan Evaluasi Metode Clustering") %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
row_spec(which.max(na.omit(eval_df$Silhouette_Score)),
bold = TRUE, background = "#d4efdf")| Metode | Jumlah_Cluster | Silhouette_Score | Keterangan |
|---|---|---|---|
| K-Means | 3 | 0.1894 | Cluster sferis dan padat |
| Hierarchical Clustering | 3 | 0.2581 | Ward.D2 linkage |
| DBSCAN | 21 | 0.2579 | Noise 85.4% diabaikan |
🏆 Metode terbaik ditandai dengan warna hijau berdasarkan Silhouette Score tertinggi.
Silhouette Score mendekati +1 menunjukkan cluster yang terpisah baik; mendekati 0 berarti tumpang-tindih; negatif berarti salah klasifikasi.
fviz_silhouette(sil_kmeans,
palette = c("#E74C3C", "#2ECC71", "#3498DB"),
ggtheme = theme_minimal()) +
labs(title = "Silhouette Plot — K-Means (k=3)",
subtitle = paste("Rata-rata Silhouette Score =", avg_sil_km)) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))## cluster size ave.sil.width
## 1 1 502 0.14
## 2 2 724 0.26
## 3 3 373 0.11
fviz_silhouette(sil_hclust,
palette = c("#E74C3C", "#2ECC71", "#3498DB"),
ggtheme = theme_minimal()) +
labs(title = "Silhouette Plot — Hierarchical Clustering (k=3)",
subtitle = paste("Rata-rata Silhouette Score =", avg_sil_hc)) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))## cluster size ave.sil.width
## 1 1 373 0.24
## 2 2 22 0.18
## 3 3 105 0.34
wine$cluster_label <- factor(wine$cluster_kmeans,
labels = c("Cluster 1", "Cluster 2", "Cluster 3"))
cluster_full <- wine %>%
group_by(`Cluster` = cluster_label) %>%
summarise(
`N (data)` = n(),
`Alcohol (%)` = round(mean(alcohol), 2),
`Volatile Acidity` = round(mean(volatile.acidity), 3),
`Fixed Acidity` = round(mean(fixed.acidity), 2),
`Citric Acid` = round(mean(citric.acid), 3),
`Sulphates` = round(mean(sulphates), 3),
`pH` = round(mean(pH), 3),
`Total SO2` = round(mean(total.sulfur.dioxide), 1),
`Quality (avg)` = round(mean(quality), 2)
)
cluster_full %>%
kable(caption = "Profil Lengkap Tiap Cluster (K-Means, k=3)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
column_spec(1, bold = TRUE) %>%
row_spec(0, bold = TRUE, background = "#2c3e50", color = "white")| Cluster | N (data) | Alcohol (%) | Volatile Acidity | Fixed Acidity | Citric Acid | Sulphates | pH | Total SO2 | Quality (avg) |
|---|---|---|---|---|---|---|---|---|---|
| Cluster 1 | 502 | 10.72 | 0.405 | 10.07 | 0.470 | 0.752 | 3.195 | 30.6 | 5.96 |
| Cluster 2 | 724 | 10.50 | 0.609 | 7.19 | 0.123 | 0.609 | 3.406 | 35.0 | 5.55 |
| Cluster 3 | 373 | 9.88 | 0.535 | 8.16 | 0.290 | 0.626 | 3.283 | 90.1 | 5.36 |
cluster_long <- cluster_full %>%
select(Cluster, `Alcohol (%)`, `Volatile Acidity`, `Citric Acid`, `Sulphates`, pH) %>%
pivot_longer(-Cluster, names_to = "Variabel", values_to = "Nilai") %>%
group_by(Variabel) %>%
mutate(Nilai_Norm = rescale(Nilai, to = c(0, 1)))
ggplot(cluster_long, aes(x = Variabel, y = Nilai_Norm, fill = Cluster)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.88, width = 0.7) +
scale_fill_manual(values = c("#E74C3C", "#2ECC71", "#3498DB")) +
labs(title = "Profil Karakteristik Tiap Cluster (K-Means)",
subtitle = "Nilai ternormalisasi 0–1 untuk perbandingan antar variabel",
x = "Variabel Kimia", y = "Nilai Ternormalisasi", fill = "Cluster") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
axis.text.x = element_text(angle = 15, hjust = 1))ggplot(wine, aes(x = cluster_label, y = alcohol, fill = cluster_label)) +
geom_boxplot(alpha = 0.85, outlier.color = "gray50") +
scale_fill_manual(values = c("#E74C3C", "#2ECC71", "#3498DB"), guide = "none") +
labs(title = "Distribusi Kadar Alkohol per Cluster",
x = "Cluster", y = "Alcohol (% volume)") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))| Cluster | Karakteristik | Interpretasi | Segmentasi |
|---|---|---|---|
| Cluster 1 — Wine Asam Ringan | Volatile acidity tinggi, alkohol rendah, sulphates rendah | Cita rasa lebih tajam/asam, cenderung kualitas menengah-rendah | Pasar ekonomis / wine untuk masak |
| Cluster 2 — Wine Seimbang | Semua variabel mendekati rata-rata, keseimbangan kimia baik | Profil wine mainstream yang paling banyak dikonsumsi | Pasar menengah / konsumsi sehari-hari |
| Cluster 3 — Wine Premium | Alkohol tinggi, sulphates tinggi, volatile acidity rendah | Fermentasi sempurna, perlindungan mikrobial optimal, kualitas tinggi | Wine premium / pasar menengah-atas |
📊 Implikasi Praktis:
Produsen wine dapat menggunakan clustering ini untuk:
- 🔍 Quality control otomatis — mendeteksi wine cacat (Cluster 1 dengan volatile acidity tinggi)
- 💰 Penetapan harga berdasarkan profil kimia, bukan hanya penilaian subjektif
- 🏭 Optimasi produksi — mengarahkan proses menuju Cluster 3 (premium)
Retailer dapat mensegmentasi produk wine untuk rekomendasi pelanggan yang lebih tepat.
Cluster 2 (seimbang) adalah segmen pasar terbesar yang perlu dipertahankan volume produksinya.
| Aspek | Hasil |
|---|---|
| Dataset | Wine Quality Red (UCI) — 1.599 baris, 11 variabel numerik, tanpa missing value |
| Preprocessing | Normalisasi Z-Score berhasil diterapkan pada semua 11 variabel fisikokimia |
| K-Means | k=3 (Elbow Method) | Silhouette = 0.1894 |
| Hierarchical | k=3 (Ward.D2) | Silhouette = 0.2581 |
| DBSCAN | eps=0.9, MinPts=5 | Noise = 85.4% | Silhouette = 0.2579 |
| Metode Terbaik | K-Means (Silhouette tertinggi = 0.2581) |
Laporan dibuat menggunakan R Markdown | Dataset: UCI Machine Learning Repository — Wine Quality