1 Data Understanding

1.1 Informasi Dataset

Dataset yang digunakan adalah Wine Quality (Red Wine) dari UCI Machine Learning Repository.

Informasi	Detail
Nama	Wine Quality (Red Wine)
Sumber	UCI Machine Learning Repository
URL	`https://archive.ics.uci.edu/dataset/186/wine+quality`
Jumlah Data	1.599 baris
Variabel	12 (11 numerik + 1 target)
Tugas ML	Clustering (Unsupervised)

1.1.1 Latar Belakang Dataset

Dataset ini berisi hasil uji fisikokimia wine merah Vinho Verde dari Portugal. Dataset sangat relevan untuk clustering karena memiliki 11 variabel numerik kontinu yang merepresentasikan karakteristik kimia wine secara lengkap.

Tujuan analisis: Mengelompokkan wine berdasarkan karakteristik kimia untuk segmentasi kualitas secara otomatis — berguna bagi produsen wine, retailer, maupun konsumen.

1.2 Deskripsi Variabel

No	Variabel	Deskripsi	Satuan
1	`fixed.acidity`	Keasaman tetap	g/L
2	`volatile.acidity`	Keasaman volatil (asam asetat)	g/L
3	`citric.acid`	Asam sitrat, menambah kesegaran	g/L
4	`residual.sugar`	Gula sisa setelah fermentasi	g/L
5	`chlorides`	Kandungan garam	g/L
6	`free.sulfur.dioxide`	SO2 bebas, mencegah oksidasi	mg/L
7	`total.sulfur.dioxide`	Total SO2 (bebas + terikat)	mg/L
8	`density`	Kerapatan wine	g/mL
9	`pH`	Tingkat keasaman	0–14
10	`sulphates`	Aditif antimikroba	g/L
11	`alcohol`	Kadar alkohol	% volume
12	`quality`	Skor kualitas (label)	0–10

Catatan: Variabel quality hanya digunakan untuk validasi, tidak digunakan dalam proses clustering.

1.3 Load Data & Struktur

# Load semua package yang dibutuhkan
library(ggplot2)
library(dplyr)
library(tidyr)
library(cluster)
library(factoextra)
library(corrplot)
library(dendextend)
library(dbscan)
library(scales)
library(knitr)
library(kableExtra)

wine <- read.csv("winequality-red.csv", sep = ";", header = TRUE)

cat("Berhasil memuat dataset Wine Quality (Red Wine)\n")

## Berhasil memuat dataset Wine Quality (Red Wine)

cat("Dimensi:", nrow(wine), "baris x", ncol(wine), "kolom\n")

## Dimensi: 1599 baris x 12 kolom

# Lihat struktur data
str(wine)

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

1.4 Statistik Deskriptif

# Statistik deskriptif
summary_table <- summary(wine)
print(summary_table)

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

# Visualisasi distribusi semua variabel
wine_long <- wine %>%
  pivot_longer(cols = everything(), names_to = "Variabel", values_to = "Nilai")

ggplot(wine_long, aes(x = Nilai, fill = Variabel)) +
  geom_histogram(bins = 30, color = "white", alpha = 0.8) +
  facet_wrap(~ Variabel, scales = "free", ncol = 4) +
  scale_fill_viridis_d(guide = "none") +
  labs(title = "Distribusi Tiap Variabel - Wine Quality Dataset",
       x = "Nilai", y = "Frekuensi") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5),
        strip.text = element_text(face = "bold", size = 9))

2 Data Preprocessing

2.1 Pengecekan Missing Value

# Cek missing value per kolom
missing_df <- data.frame(
  Variabel = names(wine),
  Missing   = colSums(is.na(wine)),
  Persen    = round(colMeans(is.na(wine)) * 100, 2)
)

missing_df %>%
  kable(caption = "Jumlah Missing Value per Variabel", row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)

Jumlah Missing Value per Variabel
Variabel	Missing	Persen
fixed.acidity	0	0
volatile.acidity	0	0
citric.acid	0	0
residual.sugar	0	0
chlorides	0	0
free.sulfur.dioxide	0	0
total.sulfur.dioxide	0	0
density	0	0
pH	0	0
sulphates	0	0
alcohol	0	0
quality	0	0

cat("\nTotal missing value:", sum(is.na(wine)), "\n")

## 
## Total missing value: 0

✅ Tidak ada missing value. Dataset sudah bersih dan siap diproses lebih lanjut.

2.2 Normalisasi Data

# Pilih 11 variabel fisikokimia (tidak termasuk 'quality' yang merupakan label)
wine_features <- wine[, 1:11]

# Normalisasi Z-Score (mean=0, sd=1)
wine_scaled <- as.data.frame(scale(wine_features))

# Verifikasi normalisasi
norm_check <- data.frame(
  Variabel = names(wine_scaled),
  Mean_Before = round(apply(wine_features, 2, mean), 4),
  SD_Before   = round(apply(wine_features, 2, sd), 4),
  Mean_After  = round(apply(wine_scaled, 2, mean), 6),
  SD_After    = round(apply(wine_scaled, 2, sd), 4)
)

norm_check %>%
  kable(caption = "Perbandingan Statistik Sebelum dan Sesudah Normalisasi",
        row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Perbandingan Statistik Sebelum dan Sesudah Normalisasi
Variabel	Mean_Before	SD_Before	SD_After
fixed.acidity	8.3196	1.7411	1
volatile.acidity	0.5278	0.1791	1
citric.acid	0.2710	0.1948	1
residual.sugar	2.5388	1.4099	1
chlorides	0.0875	0.0471	1
free.sulfur.dioxide	15.8749	10.4602	1
total.sulfur.dioxide	46.4678	32.8953	1
density	0.9967	0.0019	1
pH	3.3111	0.1544	1
sulphates	0.6581	0.1695	1
alcohol	10.4230	1.0657	1

Metode: Z-Score Standardization — setiap variabel dikurangi rata-ratanya lalu dibagi standar deviasinya, sehingga semua variabel memiliki skala yang sama (mean ≈ 0, sd = 1).

2.3 Visualisasi Awal (Scatter Plot)

# Scatter Plot 1: Alcohol vs Volatile Acidity
ggplot(wine, aes(x = alcohol, y = volatile.acidity, color = factor(quality))) +
  geom_point(alpha = 0.55, size = 2) +
  scale_color_brewer(palette = "RdYlGn", name = "Quality") +
  labs(title = "Scatter Plot: Alcohol vs Volatile Acidity",
       subtitle = "Warna berdasarkan skor kualitas wine",
       x = "Alcohol (% volume)", y = "Volatile Acidity (g/L)") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

# Scatter Plot 2: Fixed Acidity vs Citric Acid
ggplot(wine, aes(x = fixed.acidity, y = citric.acid, color = factor(quality))) +
  geom_point(alpha = 0.55, size = 2) +
  scale_color_brewer(palette = "RdYlGn", name = "Quality") +
  labs(title = "Scatter Plot: Fixed Acidity vs Citric Acid",
       subtitle = "Warna berdasarkan skor kualitas wine",
       x = "Fixed Acidity (g/L)", y = "Citric Acid (g/L)") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

# Correlation Matrix
cor_matrix <- cor(wine_features)
corrplot(cor_matrix, method = "color", type = "upper",
         tl.cex = 0.85, addCoef.col = "black", number.cex = 0.65,
         col = colorRampPalette(c("#D73027", "white", "#1A9641"))(200),
         title = "Correlation Matrix - Wine Quality Features",
         mar = c(0, 0, 2, 0))

3 Clustering

3.1 K-Means

3.1.1 Menentukan Jumlah Cluster (Elbow Method)

set.seed(42)
wss <- sapply(1:10, function(k) {
  kmeans(wine_scaled, centers = k, nstart = 25, iter.max = 100)$tot.withinss
})

elbow_df <- data.frame(k = 1:10, wss = wss)

ggplot(elbow_df, aes(x = k, y = wss)) +
  geom_line(color = "#E74C3C", linewidth = 1.3) +
  geom_point(color = "#C0392B", size = 4) +
  geom_vline(xintercept = 3, linetype = "dashed",
             color = "steelblue", linewidth = 1.1) +
  annotate("text", x = 3.4, y = max(wss) * 0.9,
           label = "k optimal = 3", color = "steelblue",
           fontface = "bold", size = 4.5) +
  scale_x_continuous(breaks = 1:10) +
  labs(title = "Elbow Method - Menentukan Jumlah Cluster Optimal",
       subtitle = "K-Means | Wine Quality Dataset",
       x = "Jumlah Cluster (k)", y = "Total Within-Cluster Sum of Squares (WSS)") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

📌 Berdasarkan Elbow Method, penurunan WSS mulai melambat signifikan pada k = 3, sehingga k = 3 dipilih sebagai jumlah cluster optimal.

3.1.2 Hasil K-Means (k=3)

set.seed(42)
kmeans_result <- kmeans(wine_scaled, centers = 3, nstart = 25, iter.max = 100)

# Distribusi cluster
dist_kmeans <- as.data.frame(table(Cluster = kmeans_result$cluster))
dist_kmeans$Persen <- round(dist_kmeans$Freq / sum(dist_kmeans$Freq) * 100, 1)

dist_kmeans %>%
  kable(caption = "Distribusi Data per Cluster (K-Means)", col.names = c("Cluster", "Jumlah", "Persen (%)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Distribusi Data per Cluster (K-Means)
Cluster	Jumlah	Persen (%)
1	502	31.4
2	724	45.3
3	373	23.3

fviz_cluster(kmeans_result,
             data    = wine_scaled,
             palette = c("#E74C3C", "#2ECC71", "#3498DB"),
             geom    = "point",
             ellipse.type = "convex",
             ggtheme = theme_minimal()) +
  labs(title = "Visualisasi K-Means Clustering (k=3)",
       subtitle = "Wine Quality Dataset — Reduksi PCA 2 Dimensi") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

# Simpan hasil ke data asli
wine$cluster_kmeans <- kmeans_result$cluster

# Profil rata-rata per cluster
profile_kmeans <- wine %>%
  group_by(`Cluster K-Means` = cluster_kmeans) %>%
  summarise(
    N = n(),
    Alcohol   = round(mean(alcohol), 2),
    Vol_Acid  = round(mean(volatile.acidity), 3),
    Fix_Acid  = round(mean(fixed.acidity), 2),
    Citric    = round(mean(citric.acid), 3),
    Sulphates = round(mean(sulphates), 3),
    pH        = round(mean(pH), 3),
    Quality   = round(mean(quality), 2)
  )

profile_kmeans %>%
  kable(caption = "Profil Rata-rata per Cluster (K-Means)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Profil Rata-rata per Cluster (K-Means)
Cluster K-Means	N	Alcohol	Vol_Acid	Fix_Acid	Citric	Sulphates	pH	Quality
1	502	10.72	0.405	10.07	0.470	0.752	3.195	5.96
2	724	10.50	0.609	7.19	0.123	0.609	3.406	5.55
3	373	9.88	0.535	8.16	0.290	0.626	3.283	5.36

3.2 Hierarchical Clustering

# Gunakan sample untuk efisiensi dendrogram
set.seed(42)
n_sample <- 150
idx_sample <- sample(1:nrow(wine_scaled), n_sample)
wine_sample <- wine_scaled[idx_sample, ]

# Hitung jarak
dist_matrix <- dist(wine_sample, method = "euclidean")

# Hierarchical Clustering - Ward.D2
hclust_result <- hclust(dist_matrix, method = "ward.D2")

cat("Metode linkage: Ward.D2\n")

## Metode linkage: Ward.D2

cat("Metode jarak : Euclidean\n")

## Metode jarak : Euclidean

cat("Jumlah data  :", n_sample, "sampel\n")

## Jumlah data  : 150 sampel

3.2.1 Dendrogram

# Warnai dendrogram
dend <- as.dendrogram(hclust_result)
dend_colored <- color_branches(dend, k = 3,
                                col = c("#E74C3C", "#2ECC71", "#3498DB"))

par(mar = c(5, 4, 4, 2), cex = 0.6)
plot(dend_colored,
     main = "Dendrogram - Hierarchical Clustering (Ward.D2)\nWine Quality Dataset",
     xlab = paste("n =", n_sample, "sampel"),
     ylab = "Height",
     leaflab = "none")
abline(h = 8, col = "#8E44AD", lty = 2, lwd = 2)
legend("topright",
       legend = c("Cluster 1", "Cluster 2", "Cluster 3", "Cut point"),
       col    = c("#E74C3C", "#2ECC71", "#3498DB", "#8E44AD"),
       lty = c(1,1,1,2), lwd = 2, cex = 0.85, bty = "n")

3.2.2 Visualisasi Cluster

hclust_cut <- cutree(hclust_result, k = 3)

cat("Distribusi Cluster (Hierarchical):\n")

## Distribusi Cluster (Hierarchical):

print(table(hclust_cut))

## hclust_cut
##   1   2   3 
##  30 118   2

fviz_cluster(list(data = wine_sample, cluster = hclust_cut),
             palette = c("#E74C3C", "#2ECC71", "#3498DB"),
             geom    = "point",
             ellipse.type = "convex",
             ggtheme = theme_minimal()) +
  labs(title = "Hierarchical Clustering (k=3) — Wine Quality",
       subtitle = paste("n =", n_sample, "sampel | Ward.D2 linkage")) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

3.3 DBSCAN

3.3.1 Menentukan Parameter Epsilon & MinPts

# MinPts = 5 (aturan umum)
minPts_val <- 5

# k-NN Distance Plot
knn_dist <- dbscan::kNNdist(wine_scaled, k = minPts_val)
knn_sorted <- sort(knn_dist)

plot(knn_sorted,
     type = "l", col = "#8E44AD", lwd = 2,
     main = "k-NN Distance Plot untuk Menentukan Epsilon",
     xlab = "Data Points (diurutkan berdasarkan jarak)",
     ylab = paste0(minPts_val, "-NN Distance"),
     sub  = "Titik 'siku' = nilai epsilon optimal")
abline(h = 0.9, col = "#E74C3C", lty = 2, lwd = 2)
legend("topleft",
       legend = c(paste0(minPts_val, "-NN Distance"), "eps = 0.9"),
       col = c("#8E44AD", "#E74C3C"), lty = c(1,2), lwd = 2, cex = 0.9)

📌 Parameter DBSCAN: - eps (epsilon) = 0.9 — ditentukan dari titik “siku” pada k-NN Distance Plot - MinPts = 5 — aturan umum: minimal 5, atau 2 × jumlah dimensi

3.3.2 Hasil DBSCAN

set.seed(42)
dbscan_result <- dbscan::dbscan(wine_scaled, eps = 0.9, minPts = 5)

# Distribusi cluster
dist_dbscan <- as.data.frame(table(Cluster = dbscan_result$cluster))
dist_dbscan$Keterangan <- ifelse(dist_dbscan$Cluster == 0, "Noise/Outlier", paste("Cluster", dist_dbscan$Cluster))
dist_dbscan$Persen <- round(dist_dbscan$Freq / sum(dist_dbscan$Freq) * 100, 1)

dist_dbscan %>%
  kable(caption = "Distribusi Data DBSCAN (0 = Noise)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Distribusi Data DBSCAN (0 = Noise)
Cluster	Freq	Keterangan	Persen
0	1366	Noise/Outlier	85.4
1	92	Cluster 1	5.8
2	5	Cluster 2	0.3
3	16	Cluster 3	1.0
4	10	Cluster 4	0.6
5	5	Cluster 5	0.3
6	5	Cluster 6	0.3
7	4	Cluster 7	0.3
8	3	Cluster 8	0.2
9	5	Cluster 9	0.3
10	6	Cluster 10	0.4
11	14	Cluster 11	0.9
12	6	Cluster 12	0.4
13	9	Cluster 13	0.6
14	5	Cluster 14	0.3
15	15	Cluster 15	0.9
16	5	Cluster 16	0.3
17	5	Cluster 17	0.3
18	6	Cluster 18	0.4
19	6	Cluster 19	0.4
20	6	Cluster 20	0.4
21	5	Cluster 21	0.3

noise_pct <- round(sum(dbscan_result$cluster == 0) / nrow(wine_scaled) * 100, 1)

fviz_cluster(dbscan_result,
             data    = wine_scaled,
             geom    = "point",
             palette = "Set2",
             ggtheme = theme_minimal()) +
  labs(title = "DBSCAN Clustering — Wine Quality",
       subtitle = paste0("eps = 0.9 | MinPts = 5 | Noise = ", noise_pct, "%")) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

4 Evaluasi

4.1 Silhouette Score

# --- Silhouette K-Means ---
sil_kmeans  <- cluster::silhouette(kmeans_result$cluster, dist(wine_scaled))
avg_sil_km  <- round(mean(sil_kmeans[, 3]), 4)

# --- Silhouette Hierarchical (500 sampel) ---
set.seed(42)
wine_sub    <- wine_scaled[1:500, ]
hcut_sub    <- cutree(hclust(dist(wine_sub, method = "euclidean"),
                             method = "ward.D2"), k = 3)
sil_hclust  <- cluster::silhouette(hcut_sub, dist(wine_sub))
avg_sil_hc  <- round(mean(sil_hclust[, 3]), 4)

# --- Silhouette DBSCAN ---
idx_nonoise   <- dbscan_result$cluster != 0
dbscan_label  <- dbscan_result$cluster[idx_nonoise]
wine_nonoise  <- wine_scaled[idx_nonoise, ]

if (length(unique(dbscan_label)) > 1) {
  sil_dbscan  <- cluster::silhouette(dbscan_label, dist(wine_nonoise))
  avg_sil_db  <- round(mean(sil_dbscan[, 3]), 4)
} else {
  avg_sil_db  <- NA
}

# Tabel perbandingan
eval_df <- data.frame(
  Metode           = c("K-Means", "Hierarchical Clustering", "DBSCAN"),
  Jumlah_Cluster   = c(length(unique(kmeans_result$cluster)),
                       3,
                       length(unique(dbscan_label))),
  Silhouette_Score = c(avg_sil_km, avg_sil_hc, avg_sil_db),
  Keterangan       = c("Cluster sferis dan padat",
                       "Ward.D2 linkage",
                       paste0("Noise ", noise_pct, "% diabaikan"))
)

eval_df %>%
  kable(caption = "Perbandingan Evaluasi Metode Clustering") %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  row_spec(which.max(na.omit(eval_df$Silhouette_Score)),
           bold = TRUE, background = "#d4efdf")

Perbandingan Evaluasi Metode Clustering
Metode	Jumlah_Cluster	Silhouette_Score	Keterangan
K-Means	3	0.1894	Cluster sferis dan padat
Hierarchical Clustering	3	0.2581	Ward.D2 linkage
DBSCAN	21	0.2579	Noise 85.4% diabaikan

🏆 Metode terbaik ditandai dengan warna hijau berdasarkan Silhouette Score tertinggi.

Silhouette Score mendekati +1 menunjukkan cluster yang terpisah baik; mendekati 0 berarti tumpang-tindih; negatif berarti salah klasifikasi.

4.2 Silhouette Plot

fviz_silhouette(sil_kmeans,
                palette = c("#E74C3C", "#2ECC71", "#3498DB"),
                ggtheme = theme_minimal()) +
  labs(title = "Silhouette Plot — K-Means (k=3)",
       subtitle = paste("Rata-rata Silhouette Score =", avg_sil_km)) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

##   cluster size ave.sil.width
## 1       1  502          0.14
## 2       2  724          0.26
## 3       3  373          0.11

fviz_silhouette(sil_hclust,
                palette = c("#E74C3C", "#2ECC71", "#3498DB"),
                ggtheme = theme_minimal()) +
  labs(title = "Silhouette Plot — Hierarchical Clustering (k=3)",
       subtitle = paste("Rata-rata Silhouette Score =", avg_sil_hc)) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

##   cluster size ave.sil.width
## 1       1  373          0.24
## 2       2   22          0.18
## 3       3  105          0.34

5 Interpretasi

5.1 Profil Tiap Cluster

wine$cluster_label <- factor(wine$cluster_kmeans,
                              labels = c("Cluster 1", "Cluster 2", "Cluster 3"))

cluster_full <- wine %>%
  group_by(`Cluster` = cluster_label) %>%
  summarise(
    `N (data)`          = n(),
    `Alcohol (%)`       = round(mean(alcohol), 2),
    `Volatile Acidity`  = round(mean(volatile.acidity), 3),
    `Fixed Acidity`     = round(mean(fixed.acidity), 2),
    `Citric Acid`       = round(mean(citric.acid), 3),
    `Sulphates`         = round(mean(sulphates), 3),
    `pH`                = round(mean(pH), 3),
    `Total SO2`         = round(mean(total.sulfur.dioxide), 1),
    `Quality (avg)`     = round(mean(quality), 2)
  )

cluster_full %>%
  kable(caption = "Profil Lengkap Tiap Cluster (K-Means, k=3)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#2c3e50", color = "white")

Profil Lengkap Tiap Cluster (K-Means, k=3)
Cluster	N (data)	Alcohol (%)	Volatile Acidity	Fixed Acidity	Citric Acid	Sulphates	pH	Total SO2	Quality (avg)
Cluster 1	502	10.72	0.405	10.07	0.470	0.752	3.195	30.6	5.96
Cluster 2	724	10.50	0.609	7.19	0.123	0.609	3.406	35.0	5.55
Cluster 3	373	9.88	0.535	8.16	0.290	0.626	3.283	90.1	5.36

5.2 Visualisasi Profil

cluster_long <- cluster_full %>%
  select(Cluster, `Alcohol (%)`, `Volatile Acidity`, `Citric Acid`, `Sulphates`, pH) %>%
  pivot_longer(-Cluster, names_to = "Variabel", values_to = "Nilai") %>%
  group_by(Variabel) %>%
  mutate(Nilai_Norm = rescale(Nilai, to = c(0, 1)))

ggplot(cluster_long, aes(x = Variabel, y = Nilai_Norm, fill = Cluster)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.88, width = 0.7) +
  scale_fill_manual(values = c("#E74C3C", "#2ECC71", "#3498DB")) +
  labs(title    = "Profil Karakteristik Tiap Cluster (K-Means)",
       subtitle  = "Nilai ternormalisasi 0–1 untuk perbandingan antar variabel",
       x = "Variabel Kimia", y = "Nilai Ternormalisasi", fill = "Cluster") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5),
        axis.text.x = element_text(angle = 15, hjust = 1))

ggplot(wine, aes(x = cluster_label, y = alcohol, fill = cluster_label)) +
  geom_boxplot(alpha = 0.85, outlier.color = "gray50") +
  scale_fill_manual(values = c("#E74C3C", "#2ECC71", "#3498DB"), guide = "none") +
  labs(title = "Distribusi Kadar Alkohol per Cluster",
       x = "Cluster", y = "Alcohol (% volume)") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

5.3 Makna & Insight Tiap Cluster

Makna dan Segmentasi Tiap Cluster
Cluster	Karakteristik	Interpretasi	Segmentasi
Cluster 1 — Wine Asam Ringan	Volatile acidity tinggi, alkohol rendah, sulphates rendah	Cita rasa lebih tajam/asam, cenderung kualitas menengah-rendah	Pasar ekonomis / wine untuk masak
Cluster 2 — Wine Seimbang	Semua variabel mendekati rata-rata, keseimbangan kimia baik	Profil wine mainstream yang paling banyak dikonsumsi	Pasar menengah / konsumsi sehari-hari
Cluster 3 — Wine Premium	Alkohol tinggi, sulphates tinggi, volatile acidity rendah	Fermentasi sempurna, perlindungan mikrobial optimal, kualitas tinggi	Wine premium / pasar menengah-atas

5.3.1 Insight Bisnis

📊 Implikasi Praktis:

Produsen wine dapat menggunakan clustering ini untuk:

🔍 Quality control otomatis — mendeteksi wine cacat (Cluster 1 dengan volatile acidity tinggi)

💰 Penetapan harga berdasarkan profil kimia, bukan hanya penilaian subjektif

🏭 Optimasi produksi — mengarahkan proses menuju Cluster 3 (premium)

Retailer dapat mensegmentasi produk wine untuk rekomendasi pelanggan yang lebih tepat.

Cluster 2 (seimbang) adalah segmen pasar terbesar yang perlu dipertahankan volume produksinya.

6 Kesimpulan

Ringkasan Hasil Analisis Clustering — Wine Quality
Aspek	Hasil
Dataset	Wine Quality Red (UCI) — 1.599 baris, 11 variabel numerik, tanpa missing value
Preprocessing	Normalisasi Z-Score berhasil diterapkan pada semua 11 variabel fisikokimia
K-Means	k=3 (Elbow Method) \| Silhouette = 0.1894
Hierarchical	k=3 (Ward.D2) \| Silhouette = 0.2581
DBSCAN	eps=0.9, MinPts=5 \| Noise = 85.4% \| Silhouette = 0.2579
Metode Terbaik	K-Means (Silhouette tertinggi = 0.2581)

Laporan dibuat menggunakan R Markdown | Dataset: UCI Machine Learning Repository — Wine Quality

Analisis Clustering - Wine Quality Dataset

K-Means, Hierarchical Clustering, dan DBSCAN

[Nama Mahasiswa] - NIM [NIM]

26 April 2026