Laporan ini merupakan hasil pengerjaan Modul 3 — Clustering pada mata kuliah Analisis Multivariat. Analisis clustering adalah teknik unsupervised learning yang bertujuan mengelompokkan data ke dalam klaster berdasarkan kemiripan (homogenitas internal tinggi, heterogenitas antar kelompok tinggi).
Tiga metode clustering yang digunakan dalam laporan ini adalah:
| Metode | Tipe | Centroid | Distance |
|---|---|---|---|
| K-Means | Partisi | Mean | Euclidean |
| K-Medians | Partisi | Median | Manhattan |
| DBSCAN | Densitas | — | Euclidean |
Dataset yang digunakan adalah HCV (Hepatitis C Virus) Dataset yang bersumber dari UCI Machine Learning Repository. Dataset ini berisi data laboratorium dari pasien dengan berbagai kondisi terkait penyakit hati, mulai dari donor darah sehat hingga sirosis.
library(tidyverse)
library(flexclust)
library(dbscan)
library(cluster)
library(fpc)
library(factoextra)
library(ggcorrplot)
library(gridExtra)df_raw <- read.csv("hcvdat0.csv", sep = ";", row.names = 1,
stringsAsFactors = FALSE)
cat("Dimensi dataset:", nrow(df_raw), "baris x", ncol(df_raw), "kolom\n")## Dimensi dataset: 615 baris x 13 kolom
| Variabel | Tipe | Keterangan |
|---|---|---|
| Category | Nominal | Target label (Blood Donor / Hepatitis / …) |
| Age | Numerik | Usia pasien (tahun) |
| Sex | Nominal | Jenis kelamin (m/f) |
| ALB | Numerik | Albumin (g/dL) |
| ALP | Numerik | Alkaline Phosphatase (U/L) |
| ALT | Numerik | Alanine Aminotransferase (U/L) |
| AST | Numerik | Aspartate Aminotransferase (U/L) |
| BIL | Numerik | Bilirubin (µmol/L) |
| CHE | Numerik | Cholinesterase (kU/L) |
| CHOL | Numerik | Cholesterol (mmol/L) |
| CREA | Numerik | Creatinine (µmol/L) |
| GGT | Numerik | Gamma-Glutamyl Transferase (U/L) |
| PROT | Numerik | Total Protein (g/dL) |
df_raw %>%
count(Category, name = "Jumlah") %>%
mutate(Persentase = round(Jumlah / sum(Jumlah) * 100, 1)) %>%
arrange(desc(Jumlah))Karena clustering merupakan metode unsupervised, kolom
Category (target) di-drop dari analisis. Kolom
Sex diubah menjadi biner (male = 1, female = 0).
data.frame(
Kolom = names(df),
Missing = colSums(is.na(df)),
Persen = round(colSums(is.na(df)) / nrow(df) * 100, 2)
) %>% filter(Missing > 0)Terdapat missing values pada kolom ALB, ALP, ALT, CHOL, dan PROT. Imputasi dilakukan menggunakan nilai median per kolom, karena beberapa variabel bersifat right-skewed.
df <- df %>%
mutate(across(where(is.numeric),
~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
cat("Missing values setelah imputasi:", sum(is.na(df)), "\n")## Missing values setelah imputasi: 0
df %>%
pivot_longer(cols = everything(), names_to = "Variabel", values_to = "Nilai") %>%
ggplot(aes(x = Nilai)) +
geom_histogram(bins = 30, fill = "#1E88E5", color = "white", alpha = 0.85) +
facet_wrap(~ Variabel, scales = "free", ncol = 4) +
labs(title = "Distribusi Setiap Fitur",
subtitle = "Dataset HCV — sebelum standardisasi",
x = "Nilai", y = "Frekuensi") +
theme_minimal(base_size = 10) +
theme(strip.text = element_text(face = "bold"))Beberapa fitur seperti ALT, BIL, GGT, dan CREA menunjukkan distribusi yang sangat right-skewed, mengindikasikan adanya outlier ekstrem yang merupakan ciri khas data klinis laboratorium.
df %>%
pivot_longer(cols = everything(), names_to = "Variabel", values_to = "Nilai") %>%
ggplot(aes(x = Variabel, y = Nilai, fill = Variabel)) +
geom_boxplot(alpha = 0.7, outlier.size = 0.7, outlier.alpha = 0.4) +
labs(title = "Boxplot Semua Fitur",
subtitle = "Titik di luar whisker mengindikasikan outlier",
x = "Fitur", y = "Nilai") +
theme_minimal(base_size = 10) +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))cor_matrix <- cor(df, use = "complete.obs")
ggcorrplot(cor_matrix,
method = "square",
type = "lower",
lab = TRUE,
lab_size = 3,
colors = c("#E53935", "white", "#1E88E5"),
title = "Heatmap Korelasi Antar Fitur",
ggtheme = theme_minimal(base_size = 10))Terdapat korelasi positif yang cukup kuat antara ALB dan CHE, serta antara AST dan ALT, yang konsisten secara klinis sebagai penanda fungsi hati.
set.seed(123)
wss_values <- sapply(1:10, function(k) {
kmeans(df_scaled, centers = k, nstart = 25, iter.max = 100)$tot.withinss
})
ggplot(data.frame(k = 1:10, WSS = wss_values), aes(x = k, y = WSS)) +
geom_line(color = "#1E88E5") +
geom_point(size = 3, color = "#E53935") +
scale_x_continuous(breaks = 1:10) +
labs(title = "Elbow Method — Penentuan K Optimal",
subtitle = "Titik tekukan menunjukkan jumlah cluster yang ideal",
x = "Jumlah Cluster (K)", y = "Total Within-Cluster SS") +
theme_minimal()set.seed(123)
sil_values <- sapply(2:10, function(k) {
km <- kmeans(df_scaled, centers = k, nstart = 25)
mean(silhouette(km$cluster, dist(df_scaled))[, 3])
})
ggplot(data.frame(k = 2:10, Sil = sil_values), aes(x = k, y = Sil)) +
geom_line(color = "#1E88E5") +
geom_point(size = 3, color = "#E53935") +
scale_x_continuous(breaks = 2:10) +
labs(title = "Silhouette Method — Penentuan K Optimal",
subtitle = "Nilai tertinggi menunjukkan jumlah cluster paling kohesif",
x = "Jumlah Cluster (K)", y = "Rata-rata Silhouette Width") +
theme_minimal()Berdasarkan kedua metode, nilai K = 3 dipilih sebagai jumlah cluster optimal.
K-Means bekerja dengan cara: (1) inisialisasi K centroid acak, (2) setiap titik ditetapkan ke centroid terdekat menggunakan jarak Euclidean, (3) centroid diperbarui menjadi mean titik-titik di klasternya, (4) diulang hingga konvergen.
set.seed(123)
km_result <- kmeans(df_scaled, centers = K_OPTIMAL,
nstart = 25, iter.max = 100)
cat("Iterasi hingga konvergen:", km_result$iter, "\n")## Iterasi hingga konvergen: 4
## Rasio Between-SS / Total-SS: 23.28 %
pca_result <- prcomp(df_scaled, scale. = FALSE)
pca_var <- summary(pca_result)$importance[2, 1:2] * 100
pca_km <- data.frame(pca_result$x[, 1:2],
Cluster = factor(km_result$cluster))
ggplot(pca_km, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(alpha = 0.6, size = 1.8) +
stat_ellipse(aes(fill = Cluster), alpha = 0.15, geom = "polygon") +
scale_color_manual(values = c("#E53935", "#1E88E5", "#43A047")) +
scale_fill_manual(values = c("#E53935", "#1E88E5", "#43A047")) +
labs(title = "K-Means Clustering (K = 3)",
subtitle = "Proyeksi PCA 2 Dimensi",
x = paste0("PC1 (", round(pca_var[1], 1), "%)"),
y = paste0("PC2 (", round(pca_var[2], 1), "%)")) +
theme_minimal()K-Medians merupakan varian K-Means dengan perbedaan: centroid = median (bukan mean) dan distance = Manhattan (L₁ norm), sehingga lebih robust terhadap outlier.
set.seed(123)
kmed_result <- kcca(df_scaled, k = K_OPTIMAL,
family = kccaFamily("kmedians"),
control = list(iter.max = 100))
kmed_clusters <- clusters(kmed_result)
data.frame(
Cluster = names(table(kmed_clusters)),
Ukuran = as.integer(table(kmed_clusters))
)pca_df <- data.frame(pca_result$x[, 1:2],
Cluster = factor(kmed_clusters))
ggplot(pca_df, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(alpha = 0.6, size = 1.8) +
stat_ellipse(aes(fill = Cluster), alpha = 0.15, geom = "polygon") +
scale_color_manual(values = c("#E53935", "#1E88E5", "#43A047")) +
scale_fill_manual(values = c("#E53935", "#1E88E5", "#43A047")) +
labs(title = "K-Medians Clustering (K = 3)",
subtitle = "Proyeksi PCA 2 Dimensi",
x = paste0("PC1 (", round(pca_var[1], 1), "%)"),
y = paste0("PC2 (", round(pca_var[2], 1), "%)")) +
theme_minimal()DBSCAN tidak perlu menentukan K di awal. Parameter: ε (eps) = radius pencarian, MinPts = minimum titik dalam radius ε. Titik yang tidak memenuhi syarat dilabeli sebagai noise (label 0).
MINPTS <- 5
kNNdistplot(df_scaled, k = MINPTS - 1)
title(main = "k-Distance Plot untuk Penentuan eps")
abline(h = 1.5, col = "red", lty = 2, lwd = 2)
legend("topleft", legend = "eps = 1.5", col = "red", lty = 2)“Knee” pada grafik berada di sekitar eps = 1.5.
EPS_VALUE <- 1.5
set.seed(123)
db_result <- dbscan::dbscan(df_scaled, eps = EPS_VALUE, minPts = MINPTS)
n_clusters_db <- length(unique(db_result$cluster[db_result$cluster != 0]))
n_noise <- sum(db_result$cluster == 0)
cat("Jumlah cluster terdeteksi :", n_clusters_db, "\n")## Jumlah cluster terdeteksi : 2
cat("Jumlah noise points :", n_noise,
paste0("(", round(n_noise / nrow(df_scaled) * 100, 1), "%)\n"))## Jumlah noise points : 189 (30.7%)
pca_df_db <- data.frame(pca_result$x[, 1:2],
Cluster = factor(db_result$cluster),
IsNoise = db_result$cluster == 0)
ggplot(pca_df_db, aes(x = PC1, y = PC2, color = Cluster, shape = IsNoise)) +
geom_point(alpha = 0.7, size = 1.8) +
scale_shape_manual(values = c(`FALSE` = 16, `TRUE` = 4),
labels = c("Data Point", "Noise"),
name = "Tipe") +
labs(title = paste0("DBSCAN (eps = ", EPS_VALUE, ", MinPts = ", MINPTS, ")"),
subtitle = "Tanda × menunjukkan noise point",
x = paste0("PC1 (", round(pca_var[1], 1), "%)"),
y = paste0("PC2 (", round(pca_var[2], 1), "%)"),
color = "Cluster") +
theme_minimal()dist_matrix <- dist(df_scaled)
sil_km <- mean(silhouette(km_result$cluster, dist_matrix)[, 3])
sil_kmed <- mean(silhouette(kmed_clusters, dist_matrix)[, 3])
db_mask <- db_result$cluster != 0
db_nonoise <- db_result$cluster[db_mask]
df_nonoise <- df_scaled[db_mask, ]
sil_db <- mean(silhouette(db_nonoise, dist(df_nonoise))[, 3])
st_km <- cluster.stats(dist_matrix, km_result$cluster)
st_kmed <- cluster.stats(dist_matrix, kmed_clusters)
st_db <- tryCatch(cluster.stats(dist(df_nonoise), db_nonoise),
error = function(e) NULL)
dunn_db <- if (!is.null(st_db)) st_db$dunn else NA
tibble(
Metode = c("K-Means", "K-Medians", "DBSCAN"),
`K / Cluster` = c(K_OPTIMAL, K_OPTIMAL, n_clusters_db),
Silhouette = round(c(sil_km, sil_kmed, sil_db), 4),
`Dunn Index` = round(c(st_km$dunn, st_kmed$dunn, dunn_db), 4),
`CH Index` = round(c(st_km$ch, st_kmed$ch, NA_real_), 2),
`Within-SS` = round(c(st_km$within.cluster.ss,
st_kmed$within.cluster.ss, NA_real_), 2)
)par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))
plot(silhouette(km_result$cluster, dist_matrix),
col = c("#E53935","#1E88E5","#43A047"),
border = NA, main = "Silhouette — K-Means")
plot(silhouette(kmed_clusters, dist_matrix),
col = c("#E53935","#1E88E5","#43A047"),
border = NA, main = "Silhouette — K-Medians")pca_compare <- data.frame(pca_result$x[, 1:2],
KMeans = factor(km_result$cluster),
KMedians = factor(kmed_clusters),
DBSCAN = factor(db_result$cluster))
pca_x <- paste0("PC1 (", round(pca_var[1], 1), "%)")
pca_y <- paste0("PC2 (", round(pca_var[2], 1), "%)")
p1 <- ggplot(pca_compare, aes(PC1, PC2, color = KMeans)) +
geom_point(alpha = 0.6, size = 1.5) +
scale_color_manual(values = c("#E53935","#1E88E5","#43A047")) +
labs(title = "K-Means", x = pca_x, y = pca_y, color = "Cluster") +
theme_minimal(base_size = 10)
p2 <- ggplot(pca_compare, aes(PC1, PC2, color = KMedians)) +
geom_point(alpha = 0.6, size = 1.5) +
scale_color_manual(values = c("#E53935","#1E88E5","#43A047")) +
labs(title = "K-Medians", x = pca_x, y = pca_y, color = "Cluster") +
theme_minimal(base_size = 10)
p3 <- ggplot(pca_compare, aes(PC1, PC2, color = DBSCAN)) +
geom_point(alpha = 0.7, size = 1.5) +
labs(title = "DBSCAN", x = pca_x, y = pca_y, color = "Cluster") +
theme_minimal(base_size = 10)
grid.arrange(p1, p2, p3, ncol = 3,
top = "Perbandingan Hasil Clustering — Proyeksi PCA")Berdasarkan analisis clustering pada dataset HCV dengan 615 observasi dan 12 fitur numerik:
K-Means menghasilkan 3 cluster (351 | 35 | 229 data). Cluster 2 yang kecil menunjukkan profil biomarker ekstrem — AST (106), BIL (52), CREA (129) — kemungkinan merepresentasikan pasien dengan kondisi hati yang parah.
K-Medians menghasilkan pembagian lebih merata (150 | 116 | 349 data) karena penggunaan median dan jarak Manhattan lebih robust terhadap outlier.
DBSCAN mendeteksi 2 cluster dengan 189 noise points (30.7%). Noise yang tinggi mencerminkan banyaknya data klinis atipikal dalam dataset. DBSCAN menghasilkan Silhouette (0.2423) dan Dunn Index (0.3636) tertinggi.
K-Means unggul pada CH Index (92.88), sedangkan DBSCAN unggul pada Silhouette dan Dunn Index. Pemilihan metode tergantung pada tujuan analisis dan toleransi terhadap noise.
Laporan ini dibuat menggunakan R Markdown | Analisis Multivariat — FMIPA UNESA