1 Pendahuluan

Laporan ini merupakan hasil pengerjaan Modul 3 — Clustering pada mata kuliah Analisis Multivariat. Analisis clustering adalah teknik unsupervised learning yang bertujuan mengelompokkan data ke dalam klaster berdasarkan kemiripan (homogenitas internal tinggi, heterogenitas antar kelompok tinggi).

Tiga metode clustering yang digunakan dalam laporan ini adalah:

Metode Tipe Centroid Distance
K-Means Partisi Mean Euclidean
K-Medians Partisi Median Manhattan
DBSCAN Densitas Euclidean

2 Dataset

2.1 Deskripsi Dataset

Dataset yang digunakan adalah HCV (Hepatitis C Virus) Dataset yang bersumber dari UCI Machine Learning Repository. Dataset ini berisi data laboratorium dari pasien dengan berbagai kondisi terkait penyakit hati, mulai dari donor darah sehat hingga sirosis.

library(tidyverse)
library(flexclust)
library(dbscan)
library(cluster)
library(fpc)
library(factoextra)
library(ggcorrplot)
library(gridExtra)
df_raw <- read.csv("hcvdat0.csv", sep = ";", row.names = 1,
                   stringsAsFactors = FALSE)

cat("Dimensi dataset:", nrow(df_raw), "baris x", ncol(df_raw), "kolom\n")
## Dimensi dataset: 615 baris x 13 kolom
head(df_raw, 8)

2.2 Deskripsi Variabel

Variabel Tipe Keterangan
Category Nominal Target label (Blood Donor / Hepatitis / …)
Age Numerik Usia pasien (tahun)
Sex Nominal Jenis kelamin (m/f)
ALB Numerik Albumin (g/dL)
ALP Numerik Alkaline Phosphatase (U/L)
ALT Numerik Alanine Aminotransferase (U/L)
AST Numerik Aspartate Aminotransferase (U/L)
BIL Numerik Bilirubin (µmol/L)
CHE Numerik Cholinesterase (kU/L)
CHOL Numerik Cholesterol (mmol/L)
CREA Numerik Creatinine (µmol/L)
GGT Numerik Gamma-Glutamyl Transferase (U/L)
PROT Numerik Total Protein (g/dL)
df_raw %>%
  count(Category, name = "Jumlah") %>%
  mutate(Persentase = round(Jumlah / sum(Jumlah) * 100, 1)) %>%
  arrange(desc(Jumlah))

3 Preprocessing

3.1 Drop Kolom Target & Encoding

Karena clustering merupakan metode unsupervised, kolom Category (target) di-drop dari analisis. Kolom Sex diubah menjadi biner (male = 1, female = 0).

df <- df_raw %>% select(-Category)
df$Sex <- ifelse(df$Sex == "m", 1, 0)

3.2 Penanganan Missing Values

data.frame(
  Kolom   = names(df),
  Missing = colSums(is.na(df)),
  Persen  = round(colSums(is.na(df)) / nrow(df) * 100, 2)
) %>% filter(Missing > 0)

Terdapat missing values pada kolom ALB, ALP, ALT, CHOL, dan PROT. Imputasi dilakukan menggunakan nilai median per kolom, karena beberapa variabel bersifat right-skewed.

df <- df %>%
  mutate(across(where(is.numeric),
                ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))

cat("Missing values setelah imputasi:", sum(is.na(df)), "\n")
## Missing values setelah imputasi: 0

3.3 Standardisasi

df_scaled <- scale(df) %>% as.data.frame()

4 Exploratory Data Analysis (EDA)

4.1 Distribusi Fitur

df %>%
  pivot_longer(cols = everything(), names_to = "Variabel", values_to = "Nilai") %>%
  ggplot(aes(x = Nilai)) +
  geom_histogram(bins = 30, fill = "#1E88E5", color = "white", alpha = 0.85) +
  facet_wrap(~ Variabel, scales = "free", ncol = 4) +
  labs(title = "Distribusi Setiap Fitur",
       subtitle = "Dataset HCV — sebelum standardisasi",
       x = "Nilai", y = "Frekuensi") +
  theme_minimal(base_size = 10) +
  theme(strip.text = element_text(face = "bold"))

Beberapa fitur seperti ALT, BIL, GGT, dan CREA menunjukkan distribusi yang sangat right-skewed, mengindikasikan adanya outlier ekstrem yang merupakan ciri khas data klinis laboratorium.

4.2 Boxplot — Identifikasi Outlier

df %>%
  pivot_longer(cols = everything(), names_to = "Variabel", values_to = "Nilai") %>%
  ggplot(aes(x = Variabel, y = Nilai, fill = Variabel)) +
  geom_boxplot(alpha = 0.7, outlier.size = 0.7, outlier.alpha = 0.4) +
  labs(title = "Boxplot Semua Fitur",
       subtitle = "Titik di luar whisker mengindikasikan outlier",
       x = "Fitur", y = "Nilai") +
  theme_minimal(base_size = 10) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

4.3 Heatmap Korelasi

cor_matrix <- cor(df, use = "complete.obs")

ggcorrplot(cor_matrix,
           method   = "square",
           type     = "lower",
           lab      = TRUE,
           lab_size = 3,
           colors   = c("#E53935", "white", "#1E88E5"),
           title    = "Heatmap Korelasi Antar Fitur",
           ggtheme  = theme_minimal(base_size = 10))

Terdapat korelasi positif yang cukup kuat antara ALB dan CHE, serta antara AST dan ALT, yang konsisten secara klinis sebagai penanda fungsi hati.


5 Penentuan Jumlah Cluster Optimal

5.1 Elbow Method

set.seed(123)
wss_values <- sapply(1:10, function(k) {
  kmeans(df_scaled, centers = k, nstart = 25, iter.max = 100)$tot.withinss
})

ggplot(data.frame(k = 1:10, WSS = wss_values), aes(x = k, y = WSS)) +
  geom_line(color = "#1E88E5") +
  geom_point(size = 3, color = "#E53935") +
  scale_x_continuous(breaks = 1:10) +
  labs(title    = "Elbow Method — Penentuan K Optimal",
       subtitle = "Titik tekukan menunjukkan jumlah cluster yang ideal",
       x = "Jumlah Cluster (K)", y = "Total Within-Cluster SS") +
  theme_minimal()

5.2 Silhouette Method

set.seed(123)
sil_values <- sapply(2:10, function(k) {
  km <- kmeans(df_scaled, centers = k, nstart = 25)
  mean(silhouette(km$cluster, dist(df_scaled))[, 3])
})

ggplot(data.frame(k = 2:10, Sil = sil_values), aes(x = k, y = Sil)) +
  geom_line(color = "#1E88E5") +
  geom_point(size = 3, color = "#E53935") +
  scale_x_continuous(breaks = 2:10) +
  labs(title    = "Silhouette Method — Penentuan K Optimal",
       subtitle = "Nilai tertinggi menunjukkan jumlah cluster paling kohesif",
       x = "Jumlah Cluster (K)", y = "Rata-rata Silhouette Width") +
  theme_minimal()

Berdasarkan kedua metode, nilai K = 3 dipilih sebagai jumlah cluster optimal.

K_OPTIMAL <- 3

6 K-Means Clustering

6.1 Algoritma

K-Means bekerja dengan cara: (1) inisialisasi K centroid acak, (2) setiap titik ditetapkan ke centroid terdekat menggunakan jarak Euclidean, (3) centroid diperbarui menjadi mean titik-titik di klasternya, (4) diulang hingga konvergen.

6.2 Hasil Clustering

set.seed(123)
km_result <- kmeans(df_scaled, centers = K_OPTIMAL,
                    nstart = 25, iter.max = 100)

cat("Iterasi hingga konvergen:", km_result$iter, "\n")
## Iterasi hingga konvergen: 4
cat("Rasio Between-SS / Total-SS:",
    round(km_result$betweenss / km_result$totss * 100, 2), "%\n\n")
## Rasio Between-SS / Total-SS: 23.28 %
data.frame(
  Cluster = 1:K_OPTIMAL,
  Ukuran  = km_result$size,
  WSS     = round(km_result$withinss, 2)
)

6.3 Visualisasi

pca_result <- prcomp(df_scaled, scale. = FALSE)
pca_var    <- summary(pca_result)$importance[2, 1:2] * 100

pca_km <- data.frame(pca_result$x[, 1:2],
                     Cluster = factor(km_result$cluster))

ggplot(pca_km, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(alpha = 0.6, size = 1.8) +
  stat_ellipse(aes(fill = Cluster), alpha = 0.15, geom = "polygon") +
  scale_color_manual(values = c("#E53935", "#1E88E5", "#43A047")) +
  scale_fill_manual(values  = c("#E53935", "#1E88E5", "#43A047")) +
  labs(title    = "K-Means Clustering (K = 3)",
       subtitle = "Proyeksi PCA 2 Dimensi",
       x = paste0("PC1 (", round(pca_var[1], 1), "%)"),
       y = paste0("PC2 (", round(pca_var[2], 1), "%)")) +
  theme_minimal()

6.4 Profil Cluster

df %>%
  mutate(Cluster = factor(km_result$cluster)) %>%
  group_by(Cluster) %>%
  summarise(across(everything(), ~ round(mean(.), 3)))

7 K-Medians Clustering

7.1 Algoritma

K-Medians merupakan varian K-Means dengan perbedaan: centroid = median (bukan mean) dan distance = Manhattan (L₁ norm), sehingga lebih robust terhadap outlier.

7.2 Hasil Clustering

set.seed(123)
kmed_result <- kcca(df_scaled, k = K_OPTIMAL,
                    family  = kccaFamily("kmedians"),
                    control = list(iter.max = 100))

kmed_clusters <- clusters(kmed_result)

data.frame(
  Cluster = names(table(kmed_clusters)),
  Ukuran  = as.integer(table(kmed_clusters))
)

7.3 Visualisasi

pca_df <- data.frame(pca_result$x[, 1:2],
                     Cluster = factor(kmed_clusters))

ggplot(pca_df, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(alpha = 0.6, size = 1.8) +
  stat_ellipse(aes(fill = Cluster), alpha = 0.15, geom = "polygon") +
  scale_color_manual(values = c("#E53935", "#1E88E5", "#43A047")) +
  scale_fill_manual(values  = c("#E53935", "#1E88E5", "#43A047")) +
  labs(title    = "K-Medians Clustering (K = 3)",
       subtitle = "Proyeksi PCA 2 Dimensi",
       x = paste0("PC1 (", round(pca_var[1], 1), "%)"),
       y = paste0("PC2 (", round(pca_var[2], 1), "%)")) +
  theme_minimal()

7.4 Profil Cluster

df %>%
  mutate(Cluster = factor(kmed_clusters)) %>%
  group_by(Cluster) %>%
  summarise(across(everything(), ~ round(median(.), 3)))

8 DBSCAN Clustering

8.1 Algoritma

DBSCAN tidak perlu menentukan K di awal. Parameter: ε (eps) = radius pencarian, MinPts = minimum titik dalam radius ε. Titik yang tidak memenuhi syarat dilabeli sebagai noise (label 0).

8.2 Penentuan Parameter eps

MINPTS <- 5

kNNdistplot(df_scaled, k = MINPTS - 1)
title(main = "k-Distance Plot untuk Penentuan eps")
abline(h = 1.5, col = "red", lty = 2, lwd = 2)
legend("topleft", legend = "eps = 1.5", col = "red", lty = 2)

“Knee” pada grafik berada di sekitar eps = 1.5.

8.3 Hasil Clustering

EPS_VALUE <- 1.5

set.seed(123)
db_result <- dbscan::dbscan(df_scaled, eps = EPS_VALUE, minPts = MINPTS)

n_clusters_db <- length(unique(db_result$cluster[db_result$cluster != 0]))
n_noise       <- sum(db_result$cluster == 0)

cat("Jumlah cluster terdeteksi :", n_clusters_db, "\n")
## Jumlah cluster terdeteksi : 2
cat("Jumlah noise points        :", n_noise,
    paste0("(", round(n_noise / nrow(df_scaled) * 100, 1), "%)\n"))
## Jumlah noise points        : 189 (30.7%)
as.data.frame(table(db_result$cluster)) %>%
  rename(Cluster = Var1, Jumlah = Freq)

8.4 Visualisasi

pca_df_db <- data.frame(pca_result$x[, 1:2],
                         Cluster = factor(db_result$cluster),
                         IsNoise  = db_result$cluster == 0)

ggplot(pca_df_db, aes(x = PC1, y = PC2, color = Cluster, shape = IsNoise)) +
  geom_point(alpha = 0.7, size = 1.8) +
  scale_shape_manual(values = c(`FALSE` = 16, `TRUE` = 4),
                     labels = c("Data Point", "Noise"),
                     name   = "Tipe") +
  labs(title    = paste0("DBSCAN (eps = ", EPS_VALUE, ", MinPts = ", MINPTS, ")"),
       subtitle = "Tanda × menunjukkan noise point",
       x = paste0("PC1 (", round(pca_var[1], 1), "%)"),
       y = paste0("PC2 (", round(pca_var[2], 1), "%)"),
       color = "Cluster") +
  theme_minimal()


9 Evaluasi Metrik

9.1 Tabel Perbandingan

dist_matrix <- dist(df_scaled)

sil_km   <- mean(silhouette(km_result$cluster, dist_matrix)[, 3])
sil_kmed <- mean(silhouette(kmed_clusters,      dist_matrix)[, 3])

db_mask    <- db_result$cluster != 0
db_nonoise <- db_result$cluster[db_mask]
df_nonoise <- df_scaled[db_mask, ]

sil_db  <- mean(silhouette(db_nonoise, dist(df_nonoise))[, 3])
st_km   <- cluster.stats(dist_matrix, km_result$cluster)
st_kmed <- cluster.stats(dist_matrix, kmed_clusters)
st_db   <- tryCatch(cluster.stats(dist(df_nonoise), db_nonoise),
                    error = function(e) NULL)
dunn_db <- if (!is.null(st_db)) st_db$dunn else NA

tibble(
  Metode        = c("K-Means", "K-Medians", "DBSCAN"),
  `K / Cluster` = c(K_OPTIMAL, K_OPTIMAL, n_clusters_db),
  Silhouette    = round(c(sil_km, sil_kmed, sil_db), 4),
  `Dunn Index`  = round(c(st_km$dunn, st_kmed$dunn, dunn_db), 4),
  `CH Index`    = round(c(st_km$ch, st_kmed$ch, NA_real_), 2),
  `Within-SS`   = round(c(st_km$within.cluster.ss,
                           st_kmed$within.cluster.ss, NA_real_), 2)
)

9.2 Silhouette Plot

par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

plot(silhouette(km_result$cluster, dist_matrix),
     col = c("#E53935","#1E88E5","#43A047"),
     border = NA, main = "Silhouette — K-Means")

plot(silhouette(kmed_clusters, dist_matrix),
     col = c("#E53935","#1E88E5","#43A047"),
     border = NA, main = "Silhouette — K-Medians")

par(mfrow = c(1, 1))

10 Perbandingan Visual Ketiga Metode

pca_compare <- data.frame(pca_result$x[, 1:2],
                            KMeans   = factor(km_result$cluster),
                            KMedians = factor(kmed_clusters),
                            DBSCAN   = factor(db_result$cluster))

pca_x <- paste0("PC1 (", round(pca_var[1], 1), "%)")
pca_y <- paste0("PC2 (", round(pca_var[2], 1), "%)")

p1 <- ggplot(pca_compare, aes(PC1, PC2, color = KMeans)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("#E53935","#1E88E5","#43A047")) +
  labs(title = "K-Means", x = pca_x, y = pca_y, color = "Cluster") +
  theme_minimal(base_size = 10)

p2 <- ggplot(pca_compare, aes(PC1, PC2, color = KMedians)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("#E53935","#1E88E5","#43A047")) +
  labs(title = "K-Medians", x = pca_x, y = pca_y, color = "Cluster") +
  theme_minimal(base_size = 10)

p3 <- ggplot(pca_compare, aes(PC1, PC2, color = DBSCAN)) +
  geom_point(alpha = 0.7, size = 1.5) +
  labs(title = "DBSCAN", x = pca_x, y = pca_y, color = "Cluster") +
  theme_minimal(base_size = 10)

grid.arrange(p1, p2, p3, ncol = 3,
             top = "Perbandingan Hasil Clustering — Proyeksi PCA")


11 Kesimpulan

Berdasarkan analisis clustering pada dataset HCV dengan 615 observasi dan 12 fitur numerik:

  1. K-Means menghasilkan 3 cluster (351 | 35 | 229 data). Cluster 2 yang kecil menunjukkan profil biomarker ekstrem — AST (106), BIL (52), CREA (129) — kemungkinan merepresentasikan pasien dengan kondisi hati yang parah.

  2. K-Medians menghasilkan pembagian lebih merata (150 | 116 | 349 data) karena penggunaan median dan jarak Manhattan lebih robust terhadap outlier.

  3. DBSCAN mendeteksi 2 cluster dengan 189 noise points (30.7%). Noise yang tinggi mencerminkan banyaknya data klinis atipikal dalam dataset. DBSCAN menghasilkan Silhouette (0.2423) dan Dunn Index (0.3636) tertinggi.

  4. K-Means unggul pada CH Index (92.88), sedangkan DBSCAN unggul pada Silhouette dan Dunn Index. Pemilihan metode tergantung pada tujuan analisis dan toleransi terhadap noise.


Laporan ini dibuat menggunakan R Markdown | Analisis Multivariat — FMIPA UNESA