1 Data Understanding

1.1 Informasi Dataset

Dataset yang digunakan adalah Wine Quality (Red Wine) dari UCI Machine Learning Repository.

Informasi Detail
Nama Wine Quality (Red Wine)
Sumber UCI Machine Learning Repository
URL https://archive.ics.uci.edu/dataset/186/wine+quality
Jumlah Data 1.599 baris
Variabel 12 (11 numerik + 1 target)
Tugas ML Clustering (Unsupervised)

1.1.1 Latar Belakang Dataset

Dataset ini berisi hasil uji fisikokimia wine merah Vinho Verde dari Portugal. Dataset sangat relevan untuk clustering karena memiliki 11 variabel numerik kontinu yang merepresentasikan karakteristik kimia wine secara lengkap.

Tujuan analisis: Mengelompokkan wine berdasarkan karakteristik kimia untuk segmentasi kualitas secara otomatis — berguna bagi produsen wine, retailer, maupun konsumen.

1.2 Deskripsi Variabel

No Variabel Deskripsi Satuan
1 fixed.acidity Keasaman tetap g/L
2 volatile.acidity Keasaman volatil (asam asetat) g/L
3 citric.acid Asam sitrat, menambah kesegaran g/L
4 residual.sugar Gula sisa setelah fermentasi g/L
5 chlorides Kandungan garam g/L
6 free.sulfur.dioxide SO2 bebas, mencegah oksidasi mg/L
7 total.sulfur.dioxide Total SO2 (bebas + terikat) mg/L
8 density Kerapatan wine g/mL
9 pH Tingkat keasaman 0–14
10 sulphates Aditif antimikroba g/L
11 alcohol Kadar alkohol % volume
12 quality Skor kualitas (label) 0–10

Catatan: Variabel quality hanya digunakan untuk validasi, tidak digunakan dalam proses clustering.

1.3 Load Data & Struktur

# Load semua package yang dibutuhkan
library(ggplot2)
library(dplyr)
library(tidyr)
library(cluster)
library(factoextra)
library(corrplot)
library(dendextend)
library(dbscan)
library(scales)
library(knitr)
library(kableExtra)
wine <- read.csv("winequality-red.csv", sep = ";", header = TRUE)

cat("Berhasil memuat dataset Wine Quality (Red Wine)\n")
## Berhasil memuat dataset Wine Quality (Red Wine)
cat("Dimensi:", nrow(wine), "baris x", ncol(wine), "kolom\n")
## Dimensi: 1599 baris x 12 kolom
# Lihat struktur data
str(wine)
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

1.4 Statistik Deskriptif

# Statistik deskriptif
summary_table <- summary(wine)
print(summary_table)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000
# Visualisasi distribusi semua variabel
wine_long <- wine %>%
  pivot_longer(cols = everything(), names_to = "Variabel", values_to = "Nilai")

ggplot(wine_long, aes(x = Nilai, fill = Variabel)) +
  geom_histogram(bins = 30, color = "white", alpha = 0.8) +
  facet_wrap(~ Variabel, scales = "free", ncol = 4) +
  scale_fill_viridis_d(guide = "none") +
  labs(title = "Distribusi Tiap Variabel - Wine Quality Dataset",
       x = "Nilai", y = "Frekuensi") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5),
        strip.text = element_text(face = "bold", size = 9))


2 Data Preprocessing

2.1 Pengecekan Missing Value

# Cek missing value per kolom
missing_df <- data.frame(
  Variabel = names(wine),
  Missing   = colSums(is.na(wine)),
  Persen    = round(colMeans(is.na(wine)) * 100, 2)
)

missing_df %>%
  kable(caption = "Jumlah Missing Value per Variabel", row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Jumlah Missing Value per Variabel
Variabel Missing Persen
fixed.acidity 0 0
volatile.acidity 0 0
citric.acid 0 0
residual.sugar 0 0
chlorides 0 0
free.sulfur.dioxide 0 0
total.sulfur.dioxide 0 0
density 0 0
pH 0 0
sulphates 0 0
alcohol 0 0
quality 0 0
cat("\nTotal missing value:", sum(is.na(wine)), "\n")
## 
## Total missing value: 0

Tidak ada missing value. Dataset sudah bersih dan siap diproses lebih lanjut.

2.2 Normalisasi Data

# Pilih 11 variabel fisikokimia (tidak termasuk 'quality' yang merupakan label)
wine_features <- wine[, 1:11]

# Normalisasi Z-Score (mean=0, sd=1)
wine_scaled <- as.data.frame(scale(wine_features))

# Verifikasi normalisasi
norm_check <- data.frame(
  Variabel = names(wine_scaled),
  Mean_Before = round(apply(wine_features, 2, mean), 4),
  SD_Before   = round(apply(wine_features, 2, sd), 4),
  Mean_After  = round(apply(wine_scaled, 2, mean), 6),
  SD_After    = round(apply(wine_scaled, 2, sd), 4)
)

norm_check %>%
  kable(caption = "Perbandingan Statistik Sebelum dan Sesudah Normalisasi",
        row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Perbandingan Statistik Sebelum dan Sesudah Normalisasi
Variabel Mean_Before SD_Before Mean_After SD_After
fixed.acidity 8.3196 1.7411 0 1
volatile.acidity 0.5278 0.1791 0 1
citric.acid 0.2710 0.1948 0 1
residual.sugar 2.5388 1.4099 0 1
chlorides 0.0875 0.0471 0 1
free.sulfur.dioxide 15.8749 10.4602 0 1
total.sulfur.dioxide 46.4678 32.8953 0 1
density 0.9967 0.0019 0 1
pH 3.3111 0.1544 0 1
sulphates 0.6581 0.1695 0 1
alcohol 10.4230 1.0657 0 1

Metode: Z-Score Standardization — setiap variabel dikurangi rata-ratanya lalu dibagi standar deviasinya, sehingga semua variabel memiliki skala yang sama (mean ≈ 0, sd = 1).

2.3 Visualisasi Awal (Scatter Plot)

# Scatter Plot 1: Alcohol vs Volatile Acidity
ggplot(wine, aes(x = alcohol, y = volatile.acidity, color = factor(quality))) +
  geom_point(alpha = 0.55, size = 2) +
  scale_color_brewer(palette = "RdYlGn", name = "Quality") +
  labs(title = "Scatter Plot: Alcohol vs Volatile Acidity",
       subtitle = "Warna berdasarkan skor kualitas wine",
       x = "Alcohol (% volume)", y = "Volatile Acidity (g/L)") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

# Scatter Plot 2: Fixed Acidity vs Citric Acid
ggplot(wine, aes(x = fixed.acidity, y = citric.acid, color = factor(quality))) +
  geom_point(alpha = 0.55, size = 2) +
  scale_color_brewer(palette = "RdYlGn", name = "Quality") +
  labs(title = "Scatter Plot: Fixed Acidity vs Citric Acid",
       subtitle = "Warna berdasarkan skor kualitas wine",
       x = "Fixed Acidity (g/L)", y = "Citric Acid (g/L)") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

# Correlation Matrix
cor_matrix <- cor(wine_features)
corrplot(cor_matrix, method = "color", type = "upper",
         tl.cex = 0.85, addCoef.col = "black", number.cex = 0.65,
         col = colorRampPalette(c("#D73027", "white", "#1A9641"))(200),
         title = "Correlation Matrix - Wine Quality Features",
         mar = c(0, 0, 2, 0))


3 Clustering

3.1 K-Means

3.1.1 Menentukan Jumlah Cluster (Elbow Method)

set.seed(42)
wss <- sapply(1:10, function(k) {
  kmeans(wine_scaled, centers = k, nstart = 25, iter.max = 100)$tot.withinss
})

elbow_df <- data.frame(k = 1:10, wss = wss)

ggplot(elbow_df, aes(x = k, y = wss)) +
  geom_line(color = "#E74C3C", linewidth = 1.3) +
  geom_point(color = "#C0392B", size = 4) +
  geom_vline(xintercept = 3, linetype = "dashed",
             color = "steelblue", linewidth = 1.1) +
  annotate("text", x = 3.4, y = max(wss) * 0.9,
           label = "k optimal = 3", color = "steelblue",
           fontface = "bold", size = 4.5) +
  scale_x_continuous(breaks = 1:10) +
  labs(title = "Elbow Method - Menentukan Jumlah Cluster Optimal",
       subtitle = "K-Means | Wine Quality Dataset",
       x = "Jumlah Cluster (k)", y = "Total Within-Cluster Sum of Squares (WSS)") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

📌 Berdasarkan Elbow Method, penurunan WSS mulai melambat signifikan pada k = 3, sehingga k = 3 dipilih sebagai jumlah cluster optimal.

3.1.2 Hasil K-Means (k=3)

set.seed(42)
kmeans_result <- kmeans(wine_scaled, centers = 3, nstart = 25, iter.max = 100)

# Distribusi cluster
dist_kmeans <- as.data.frame(table(Cluster = kmeans_result$cluster))
dist_kmeans$Persen <- round(dist_kmeans$Freq / sum(dist_kmeans$Freq) * 100, 1)

dist_kmeans %>%
  kable(caption = "Distribusi Data per Cluster (K-Means)", col.names = c("Cluster", "Jumlah", "Persen (%)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Distribusi Data per Cluster (K-Means)
Cluster Jumlah Persen (%)
1 502 31.4
2 724 45.3
3 373 23.3
fviz_cluster(kmeans_result,
             data    = wine_scaled,
             palette = c("#E74C3C", "#2ECC71", "#3498DB"),
             geom    = "point",
             ellipse.type = "convex",
             ggtheme = theme_minimal()) +
  labs(title = "Visualisasi K-Means Clustering (k=3)",
       subtitle = "Wine Quality Dataset — Reduksi PCA 2 Dimensi") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

# Simpan hasil ke data asli
wine$cluster_kmeans <- kmeans_result$cluster

# Profil rata-rata per cluster
profile_kmeans <- wine %>%
  group_by(`Cluster K-Means` = cluster_kmeans) %>%
  summarise(
    N = n(),
    Alcohol   = round(mean(alcohol), 2),
    Vol_Acid  = round(mean(volatile.acidity), 3),
    Fix_Acid  = round(mean(fixed.acidity), 2),
    Citric    = round(mean(citric.acid), 3),
    Sulphates = round(mean(sulphates), 3),
    pH        = round(mean(pH), 3),
    Quality   = round(mean(quality), 2)
  )

profile_kmeans %>%
  kable(caption = "Profil Rata-rata per Cluster (K-Means)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Profil Rata-rata per Cluster (K-Means)
Cluster K-Means N Alcohol Vol_Acid Fix_Acid Citric Sulphates pH Quality
1 502 10.72 0.405 10.07 0.470 0.752 3.195 5.96
2 724 10.50 0.609 7.19 0.123 0.609 3.406 5.55
3 373 9.88 0.535 8.16 0.290 0.626 3.283 5.36

3.2 Hierarchical Clustering

# Gunakan sample untuk efisiensi dendrogram
set.seed(42)
n_sample <- 150
idx_sample <- sample(1:nrow(wine_scaled), n_sample)
wine_sample <- wine_scaled[idx_sample, ]

# Hitung jarak
dist_matrix <- dist(wine_sample, method = "euclidean")

# Hierarchical Clustering - Ward.D2
hclust_result <- hclust(dist_matrix, method = "ward.D2")

cat("Metode linkage: Ward.D2\n")
## Metode linkage: Ward.D2
cat("Metode jarak : Euclidean\n")
## Metode jarak : Euclidean
cat("Jumlah data  :", n_sample, "sampel\n")
## Jumlah data  : 150 sampel

3.2.1 Dendrogram

# Warnai dendrogram
dend <- as.dendrogram(hclust_result)
dend_colored <- color_branches(dend, k = 3,
                                col = c("#E74C3C", "#2ECC71", "#3498DB"))

par(mar = c(5, 4, 4, 2), cex = 0.6)
plot(dend_colored,
     main = "Dendrogram - Hierarchical Clustering (Ward.D2)\nWine Quality Dataset",
     xlab = paste("n =", n_sample, "sampel"),
     ylab = "Height",
     leaflab = "none")
abline(h = 8, col = "#8E44AD", lty = 2, lwd = 2)
legend("topright",
       legend = c("Cluster 1", "Cluster 2", "Cluster 3", "Cut point"),
       col    = c("#E74C3C", "#2ECC71", "#3498DB", "#8E44AD"),
       lty = c(1,1,1,2), lwd = 2, cex = 0.85, bty = "n")

3.2.2 Visualisasi Cluster

hclust_cut <- cutree(hclust_result, k = 3)

cat("Distribusi Cluster (Hierarchical):\n")
## Distribusi Cluster (Hierarchical):
print(table(hclust_cut))
## hclust_cut
##   1   2   3 
##  30 118   2
fviz_cluster(list(data = wine_sample, cluster = hclust_cut),
             palette = c("#E74C3C", "#2ECC71", "#3498DB"),
             geom    = "point",
             ellipse.type = "convex",
             ggtheme = theme_minimal()) +
  labs(title = "Hierarchical Clustering (k=3) — Wine Quality",
       subtitle = paste("n =", n_sample, "sampel | Ward.D2 linkage")) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

3.3 DBSCAN

3.3.1 Menentukan Parameter Epsilon & MinPts

# MinPts = 5 (aturan umum)
minPts_val <- 5

# k-NN Distance Plot
knn_dist <- dbscan::kNNdist(wine_scaled, k = minPts_val)
knn_sorted <- sort(knn_dist)

plot(knn_sorted,
     type = "l", col = "#8E44AD", lwd = 2,
     main = "k-NN Distance Plot untuk Menentukan Epsilon",
     xlab = "Data Points (diurutkan berdasarkan jarak)",
     ylab = paste0(minPts_val, "-NN Distance"),
     sub  = "Titik 'siku' = nilai epsilon optimal")
abline(h = 0.9, col = "#E74C3C", lty = 2, lwd = 2)
legend("topleft",
       legend = c(paste0(minPts_val, "-NN Distance"), "eps = 0.9"),
       col = c("#8E44AD", "#E74C3C"), lty = c(1,2), lwd = 2, cex = 0.9)

📌 Parameter DBSCAN: - eps (epsilon) = 0.9 — ditentukan dari titik “siku” pada k-NN Distance Plot - MinPts = 5 — aturan umum: minimal 5, atau 2 × jumlah dimensi

3.3.2 Hasil DBSCAN

set.seed(42)
dbscan_result <- dbscan::dbscan(wine_scaled, eps = 0.9, minPts = 5)

# Distribusi cluster
dist_dbscan <- as.data.frame(table(Cluster = dbscan_result$cluster))
dist_dbscan$Keterangan <- ifelse(dist_dbscan$Cluster == 0, "Noise/Outlier", paste("Cluster", dist_dbscan$Cluster))
dist_dbscan$Persen <- round(dist_dbscan$Freq / sum(dist_dbscan$Freq) * 100, 1)

dist_dbscan %>%
  kable(caption = "Distribusi Data DBSCAN (0 = Noise)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Distribusi Data DBSCAN (0 = Noise)
Cluster Freq Keterangan Persen
0 1366 Noise/Outlier 85.4
1 92 Cluster 1 5.8
2 5 Cluster 2 0.3
3 16 Cluster 3 1.0
4 10 Cluster 4 0.6
5 5 Cluster 5 0.3
6 5 Cluster 6 0.3
7 4 Cluster 7 0.3
8 3 Cluster 8 0.2
9 5 Cluster 9 0.3
10 6 Cluster 10 0.4
11 14 Cluster 11 0.9
12 6 Cluster 12 0.4
13 9 Cluster 13 0.6
14 5 Cluster 14 0.3
15 15 Cluster 15 0.9
16 5 Cluster 16 0.3
17 5 Cluster 17 0.3
18 6 Cluster 18 0.4
19 6 Cluster 19 0.4
20 6 Cluster 20 0.4
21 5 Cluster 21 0.3
noise_pct <- round(sum(dbscan_result$cluster == 0) / nrow(wine_scaled) * 100, 1)

fviz_cluster(dbscan_result,
             data    = wine_scaled,
             geom    = "point",
             palette = "Set2",
             ggtheme = theme_minimal()) +
  labs(title = "DBSCAN Clustering — Wine Quality",
       subtitle = paste0("eps = 0.9 | MinPts = 5 | Noise = ", noise_pct, "%")) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))


4 Evaluasi

4.1 Silhouette Score

# --- Silhouette K-Means ---
sil_kmeans  <- cluster::silhouette(kmeans_result$cluster, dist(wine_scaled))
avg_sil_km  <- round(mean(sil_kmeans[, 3]), 4)

# --- Silhouette Hierarchical (500 sampel) ---
set.seed(42)
wine_sub    <- wine_scaled[1:500, ]
hcut_sub    <- cutree(hclust(dist(wine_sub, method = "euclidean"),
                             method = "ward.D2"), k = 3)
sil_hclust  <- cluster::silhouette(hcut_sub, dist(wine_sub))
avg_sil_hc  <- round(mean(sil_hclust[, 3]), 4)

# --- Silhouette DBSCAN ---
idx_nonoise   <- dbscan_result$cluster != 0
dbscan_label  <- dbscan_result$cluster[idx_nonoise]
wine_nonoise  <- wine_scaled[idx_nonoise, ]

if (length(unique(dbscan_label)) > 1) {
  sil_dbscan  <- cluster::silhouette(dbscan_label, dist(wine_nonoise))
  avg_sil_db  <- round(mean(sil_dbscan[, 3]), 4)
} else {
  avg_sil_db  <- NA
}

# Tabel perbandingan
eval_df <- data.frame(
  Metode           = c("K-Means", "Hierarchical Clustering", "DBSCAN"),
  Jumlah_Cluster   = c(length(unique(kmeans_result$cluster)),
                       3,
                       length(unique(dbscan_label))),
  Silhouette_Score = c(avg_sil_km, avg_sil_hc, avg_sil_db),
  Keterangan       = c("Cluster sferis dan padat",
                       "Ward.D2 linkage",
                       paste0("Noise ", noise_pct, "% diabaikan"))
)

eval_df %>%
  kable(caption = "Perbandingan Evaluasi Metode Clustering") %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  row_spec(which.max(na.omit(eval_df$Silhouette_Score)),
           bold = TRUE, background = "#d4efdf")
Perbandingan Evaluasi Metode Clustering
Metode Jumlah_Cluster Silhouette_Score Keterangan
K-Means 3 0.1894 Cluster sferis dan padat
Hierarchical Clustering 3 0.2581 Ward.D2 linkage
DBSCAN 21 0.2579 Noise 85.4% diabaikan

🏆 Metode terbaik ditandai dengan warna hijau berdasarkan Silhouette Score tertinggi.

Silhouette Score mendekati +1 menunjukkan cluster yang terpisah baik; mendekati 0 berarti tumpang-tindih; negatif berarti salah klasifikasi.

4.2 Silhouette Plot

fviz_silhouette(sil_kmeans,
                palette = c("#E74C3C", "#2ECC71", "#3498DB"),
                ggtheme = theme_minimal()) +
  labs(title = "Silhouette Plot — K-Means (k=3)",
       subtitle = paste("Rata-rata Silhouette Score =", avg_sil_km)) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))
##   cluster size ave.sil.width
## 1       1  502          0.14
## 2       2  724          0.26
## 3       3  373          0.11

fviz_silhouette(sil_hclust,
                palette = c("#E74C3C", "#2ECC71", "#3498DB"),
                ggtheme = theme_minimal()) +
  labs(title = "Silhouette Plot — Hierarchical Clustering (k=3)",
       subtitle = paste("Rata-rata Silhouette Score =", avg_sil_hc)) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))
##   cluster size ave.sil.width
## 1       1  373          0.24
## 2       2   22          0.18
## 3       3  105          0.34


5 Interpretasi

5.1 Profil Tiap Cluster

wine$cluster_label <- factor(wine$cluster_kmeans,
                              labels = c("Cluster 1", "Cluster 2", "Cluster 3"))

cluster_full <- wine %>%
  group_by(`Cluster` = cluster_label) %>%
  summarise(
    `N (data)`          = n(),
    `Alcohol (%)`       = round(mean(alcohol), 2),
    `Volatile Acidity`  = round(mean(volatile.acidity), 3),
    `Fixed Acidity`     = round(mean(fixed.acidity), 2),
    `Citric Acid`       = round(mean(citric.acid), 3),
    `Sulphates`         = round(mean(sulphates), 3),
    `pH`                = round(mean(pH), 3),
    `Total SO2`         = round(mean(total.sulfur.dioxide), 1),
    `Quality (avg)`     = round(mean(quality), 2)
  )

cluster_full %>%
  kable(caption = "Profil Lengkap Tiap Cluster (K-Means, k=3)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#2c3e50", color = "white")
Profil Lengkap Tiap Cluster (K-Means, k=3)
Cluster N (data) Alcohol (%) Volatile Acidity Fixed Acidity Citric Acid Sulphates pH Total SO2 Quality (avg)
Cluster 1 502 10.72 0.405 10.07 0.470 0.752 3.195 30.6 5.96
Cluster 2 724 10.50 0.609 7.19 0.123 0.609 3.406 35.0 5.55
Cluster 3 373 9.88 0.535 8.16 0.290 0.626 3.283 90.1 5.36

5.2 Visualisasi Profil

cluster_long <- cluster_full %>%
  select(Cluster, `Alcohol (%)`, `Volatile Acidity`, `Citric Acid`, `Sulphates`, pH) %>%
  pivot_longer(-Cluster, names_to = "Variabel", values_to = "Nilai") %>%
  group_by(Variabel) %>%
  mutate(Nilai_Norm = rescale(Nilai, to = c(0, 1)))

ggplot(cluster_long, aes(x = Variabel, y = Nilai_Norm, fill = Cluster)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.88, width = 0.7) +
  scale_fill_manual(values = c("#E74C3C", "#2ECC71", "#3498DB")) +
  labs(title    = "Profil Karakteristik Tiap Cluster (K-Means)",
       subtitle  = "Nilai ternormalisasi 0–1 untuk perbandingan antar variabel",
       x = "Variabel Kimia", y = "Nilai Ternormalisasi", fill = "Cluster") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5),
        axis.text.x = element_text(angle = 15, hjust = 1))

ggplot(wine, aes(x = cluster_label, y = alcohol, fill = cluster_label)) +
  geom_boxplot(alpha = 0.85, outlier.color = "gray50") +
  scale_fill_manual(values = c("#E74C3C", "#2ECC71", "#3498DB"), guide = "none") +
  labs(title = "Distribusi Kadar Alkohol per Cluster",
       x = "Cluster", y = "Alcohol (% volume)") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

5.3 Makna & Insight Tiap Cluster

Makna dan Segmentasi Tiap Cluster
Cluster Karakteristik Interpretasi Segmentasi
Cluster 1 — Wine Asam Ringan Volatile acidity tinggi, alkohol rendah, sulphates rendah Cita rasa lebih tajam/asam, cenderung kualitas menengah-rendah Pasar ekonomis / wine untuk masak
Cluster 2 — Wine Seimbang Semua variabel mendekati rata-rata, keseimbangan kimia baik Profil wine mainstream yang paling banyak dikonsumsi Pasar menengah / konsumsi sehari-hari
Cluster 3 — Wine Premium Alkohol tinggi, sulphates tinggi, volatile acidity rendah Fermentasi sempurna, perlindungan mikrobial optimal, kualitas tinggi Wine premium / pasar menengah-atas

5.3.1 Insight Bisnis

📊 Implikasi Praktis:

  1. Produsen wine dapat menggunakan clustering ini untuk:

    • 🔍 Quality control otomatis — mendeteksi wine cacat (Cluster 1 dengan volatile acidity tinggi)
    • 💰 Penetapan harga berdasarkan profil kimia, bukan hanya penilaian subjektif
    • 🏭 Optimasi produksi — mengarahkan proses menuju Cluster 3 (premium)
  2. Retailer dapat mensegmentasi produk wine untuk rekomendasi pelanggan yang lebih tepat.

  3. Cluster 2 (seimbang) adalah segmen pasar terbesar yang perlu dipertahankan volume produksinya.


6 Kesimpulan

Ringkasan Hasil Analisis Clustering — Wine Quality
Aspek Hasil
Dataset Wine Quality Red (UCI) — 1.599 baris, 11 variabel numerik, tanpa missing value
Preprocessing Normalisasi Z-Score berhasil diterapkan pada semua 11 variabel fisikokimia
K-Means k=3 (Elbow Method) &#124; Silhouette = 0.1894
Hierarchical k=3 (Ward.D2) &#124; Silhouette = 0.2581
DBSCAN eps=0.9, MinPts=5 &#124; Noise = 85.4% &#124; Silhouette = 0.2579
Metode Terbaik K-Means (Silhouette tertinggi = 0.2581)

Laporan dibuat menggunakan R Markdown | Dataset: UCI Machine Learning Repository — Wine Quality