Deskripsi Singkat

Notebook ini digunakan untuk melakukan analisis clustering pada dataset obesitas dengan lima metode, yaitu K-Means, K-Median, DBSCAN, Mean Shift, dan Fuzzy C-Means.

Catatan Penting

Dataset mentah masih berisi campuran variabel numerik dan kategorikal. Agar dapat digunakan untuk clustering, variabel kategorikal akan diubah menjadi numerik menggunakan one-hot encoding, kemudian seluruh fitur akan distandardisasi. Setelah preprocessing, data yang masuk ke model sudah berbentuk numerik dan jumlah fiturnya lebih dari 10.

Alur Analisis

  1. Import library
  2. Membaca dan mengecek data awal
  3. Preprocessing data
  4. Eksplorasi data
  5. Clustering dengan 5 metode
  6. Evaluasi dan perbandingan hasil
  7. Penyimpanan output

1. Import Library

Pada tahap ini dilakukan import library yang diperlukan untuk membaca data, preprocessing, eksplorasi data, visualisasi, clustering, dan evaluasi hasil clustering.

# Tujuan: instalasi package jika belum tersedia (jalankan sekali saja).
install.packages(c(
  "tidyverse", "cluster", "factoextra", "dbscan",
  "e1071", "corrplot", "reshape2", "knitr", "kableExtra"
))
# Tujuan: mengimpor library utama yang dibutuhkan dalam analisis clustering.

library(tidyverse)    # manipulasi data & visualisasi (ggplot2, dplyr, dll.)
library(cluster)      # silhouette score, k-medoids
library(factoextra)   # visualisasi clustering & elbow method
library(dbscan)       # DBSCAN & Mean Shift
library(e1071)        # Fuzzy C-Means (cmeans)
library(corrplot)     # heatmap korelasi
library(reshape2)     # melt data frame
library(knitr)        # kable
library(kableExtra)   # kable styling

2. Membaca dan Mengecek Data Awal

Tahap ini bertujuan untuk memahami struktur awal dataset, termasuk ukuran data, tipe data, missing value, duplikasi, statistik deskriptif, dan distribusi kategori.

# Tujuan: membaca dataset obesity ke dalam dataframe.

df <- read.csv("ObesityDataSet_raw_and_data_sinthetic.csv", stringsAsFactors = FALSE)
head(df)
Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS NObeyesdad
Female 21 1.62 64.0 yes no 2 3 Sometimes no 2 no 0 1 no Public_Transportation Normal_Weight
Female 21 1.52 56.0 yes no 3 3 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation Normal_Weight
Male 23 1.80 77.0 yes no 2 3 Sometimes no 2 no 2 1 Frequently Public_Transportation Normal_Weight
Male 27 1.80 87.0 no no 3 3 Sometimes no 2 no 2 0 Frequently Walking Overweight_Level_I
Male 22 1.78 89.8 no no 2 1 Sometimes no 2 no 0 0 Sometimes Public_Transportation Overweight_Level_II
Male 29 1.62 53.0 no yes 2 3 Sometimes no 2 no 0 0 Sometimes Automobile Normal_Weight
# Tujuan: melihat ukuran dataset, nama kolom, dan tipe data setiap variabel.

cat("Ukuran dataset:", nrow(df), "baris x", ncol(df), "kolom\n")
## Ukuran dataset: 2111 baris x 17 kolom
cat("\nNama kolom:\n")
## 
## Nama kolom:
print(colnames(df))
##  [1] "Gender"                         "Age"                           
##  [3] "Height"                         "Weight"                        
##  [5] "family_history_with_overweight" "FAVC"                          
##  [7] "FCVC"                           "NCP"                           
##  [9] "CAEC"                           "SMOKE"                         
## [11] "CH2O"                           "SCC"                           
## [13] "FAF"                            "TUE"                           
## [15] "CALC"                           "MTRANS"                        
## [17] "NObeyesdad"
cat("\nTipe data:\n")
## 
## Tipe data:
print(sapply(df, class))
##                         Gender                            Age 
##                    "character"                      "numeric" 
##                         Height                         Weight 
##                      "numeric"                      "numeric" 
## family_history_with_overweight                           FAVC 
##                    "character"                    "character" 
##                           FCVC                            NCP 
##                      "numeric"                      "numeric" 
##                           CAEC                          SMOKE 
##                    "character"                    "character" 
##                           CH2O                            SCC 
##                      "numeric"                    "character" 
##                            FAF                            TUE 
##                      "numeric"                      "numeric" 
##                           CALC                         MTRANS 
##                    "character"                    "character" 
##                     NObeyesdad 
##                    "character"
# Tujuan: mengecek missing value dan jumlah data duplikat.

cat("Missing value per kolom:\n")
## Missing value per kolom:
print(colSums(is.na(df)))
##                         Gender                            Age 
##                              0                              0 
##                         Height                         Weight 
##                              0                              0 
## family_history_with_overweight                           FAVC 
##                              0                              0 
##                           FCVC                            NCP 
##                              0                              0 
##                           CAEC                          SMOKE 
##                              0                              0 
##                           CH2O                            SCC 
##                              0                              0 
##                            FAF                            TUE 
##                              0                              0 
##                           CALC                         MTRANS 
##                              0                              0 
##                     NObeyesdad 
##                              0
cat("\nJumlah data duplikat:", sum(duplicated(df)), "\n")
## 
## Jumlah data duplikat: 24
# Tujuan: menampilkan statistik deskriptif untuk variabel numerik.

num_cols_raw <- names(df)[sapply(df, is.numeric)]
summary(df[, num_cols_raw])
##       Age            Height          Weight            FCVC      
##  Min.   :14.00   Min.   :1.450   Min.   : 39.00   Min.   :1.000  
##  1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47   1st Qu.:2.000  
##  Median :22.78   Median :1.700   Median : 83.00   Median :2.386  
##  Mean   :24.31   Mean   :1.702   Mean   : 86.59   Mean   :2.419  
##  3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43   3rd Qu.:3.000  
##  Max.   :61.00   Max.   :1.980   Max.   :173.00   Max.   :3.000  
##       NCP             CH2O            FAF              TUE        
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:2.659   1st Qu.:1.585   1st Qu.:0.1245   1st Qu.:0.0000  
##  Median :3.000   Median :2.000   Median :1.0000   Median :0.6253  
##  Mean   :2.686   Mean   :2.008   Mean   :1.0103   Mean   :0.6579  
##  3rd Qu.:3.000   3rd Qu.:2.477   3rd Qu.:1.6667   3rd Qu.:1.0000  
##  Max.   :4.000   Max.   :3.000   Max.   :3.0000   Max.   :2.0000
# Tujuan: menampilkan distribusi kategori untuk setiap variabel kategorikal.

cat_cols_raw <- names(df)[sapply(df, is.character)]
for (col in cat_cols_raw) {
  cat("\nDistribusi kategori pada kolom", col, ":\n")
  print(sort(table(df[[col]]), decreasing = TRUE))
}
## 
## Distribusi kategori pada kolom Gender :
## 
##   Male Female 
##   1068   1043 
## 
## Distribusi kategori pada kolom family_history_with_overweight :
## 
##  yes   no 
## 1726  385 
## 
## Distribusi kategori pada kolom FAVC :
## 
##  yes   no 
## 1866  245 
## 
## Distribusi kategori pada kolom CAEC :
## 
##  Sometimes Frequently     Always         no 
##       1765        242         53         51 
## 
## Distribusi kategori pada kolom SMOKE :
## 
##   no  yes 
## 2067   44 
## 
## Distribusi kategori pada kolom SCC :
## 
##   no  yes 
## 2015   96 
## 
## Distribusi kategori pada kolom CALC :
## 
##  Sometimes         no Frequently     Always 
##       1401        639         70          1 
## 
## Distribusi kategori pada kolom MTRANS :
## 
## Public_Transportation            Automobile               Walking 
##                  1580                   457                    56 
##             Motorbike                  Bike 
##                    11                     7 
## 
## Distribusi kategori pada kolom NObeyesdad :
## 
##      Obesity_Type_I    Obesity_Type_III     Obesity_Type_II  Overweight_Level_I 
##                 351                 324                 297                 290 
## Overweight_Level_II       Normal_Weight Insufficient_Weight 
##                 290                 287                 272

4. Preprocessing Data

Tahap preprocessing dilakukan agar seluruh variabel dapat digunakan dalam clustering. Variabel target NObeyesdad tidak dipakai sebagai fitur clustering, melainkan hanya disimpan untuk interpretasi tambahan.

Pada tahap ini dilakukan:

  • pemisahan fitur dan label asli,
  • identifikasi kolom numerik dan kategorikal,
  • encoding untuk kolom kategorikal,
  • standardisasi fitur.
# Tujuan: memisahkan fitur input untuk clustering dan menyimpan label asli
# hanya untuk interpretasi tambahan.

target_col  <- "NObeyesdad"
X_raw       <- df[, setdiff(colnames(df), target_col)]
label_asli  <- df[[target_col]]

num_cols <- names(X_raw)[sapply(X_raw, is.numeric)]
cat_cols <- names(X_raw)[sapply(X_raw, is.character)]

cat("Jumlah kolom numerik awal:", length(num_cols), "\n")
## Jumlah kolom numerik awal: 8
cat("Kolom numerik:", paste(num_cols, collapse = ", "), "\n")
## Kolom numerik: Age, Height, Weight, FCVC, NCP, CH2O, FAF, TUE
cat("\nJumlah kolom kategorikal awal:", length(cat_cols), "\n")
## 
## Jumlah kolom kategorikal awal: 8
cat("Kolom kategorikal:", paste(cat_cols, collapse = ", "), "\n")
## Kolom kategorikal: Gender, family_history_with_overweight, FAVC, CAEC, SMOKE, SCC, CALC, MTRANS
# Tujuan: mengubah variabel kategorikal menjadi numerik (one-hot encoding,
# drop first level) dan melakukan standardisasi seluruh fitur.

# --- One-Hot Encoding (drop first level, setara drop="first" di sklearn) ---
ohe_list <- lapply(cat_cols, function(col) {
  lvls <- sort(unique(X_raw[[col]]))
  lvls_keep <- lvls[-1]   # buang level pertama (setara drop="first")
  mat <- sapply(lvls_keep, function(lv) as.integer(X_raw[[col]] == lv))
  colnames(mat) <- paste0(col, "_", lvls_keep)
  mat
})
X_cat_encoded <- do.call(cbind, ohe_list)

# --- Standardisasi variabel numerik ---
X_num_scaled <- scale(X_raw[, num_cols])

# --- Gabungkan ---
X <- cbind(X_num_scaled, X_cat_encoded)
X_df <- as.data.frame(X)

cat("Shape data sebelum preprocessing:", nrow(X_raw), "x", ncol(X_raw), "\n")
## Shape data sebelum preprocessing: 2111 x 16
cat("Shape data setelah preprocessing:", nrow(X_df), "x", ncol(X_df), "\n")
## Shape data setelah preprocessing: 2111 x 23
head(X_df)
Age Height Weight FCVC NCP CH2O FAF TUE Gender_Male family_history_with_overweight_yes FAVC_yes CAEC_Frequently CAEC_no CAEC_Sometimes SMOKE_yes SCC_yes CALC_Frequently CALC_no CALC_Sometimes MTRANS_Bike MTRANS_Motorbike MTRANS_Public_Transportation MTRANS_Walking
-0.5220007 -0.8753819 -0.8623539 -0.7848327 0.404057 -0.0130702 -1.187758 0.5618636 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0
-0.5220007 -1.9471379 -1.1678003 1.0880839 0.404057 1.6183751 2.339196 -1.0803687 0 1 0 0 0 1 1 1 0 0 1 0 0 1 0
-0.2068400 1.0537789 -0.3660034 -0.7848327 0.404057 -0.0130702 1.163545 0.5618636 1 1 0 0 0 1 0 0 1 0 0 0 0 1 0
0.4234815 1.0537789 0.0158046 1.0880839 0.404057 -0.0130702 1.163545 -1.0803687 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1
-0.3644203 0.8394277 0.1227109 -0.7848327 -2.166509 -0.0130702 -1.187758 -1.0803687 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0
0.7386422 -0.8753819 -1.2823427 -0.7848327 0.404057 -0.0130702 -1.187758 -1.0803687 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0

5. Eksplorasi Data

Eksplorasi dilakukan untuk memahami pola data sebelum dan sesudah preprocessing. Visualisasi yang digunakan meliputi korelasi antar variabel numerik, distribusi fitur, dan reduksi dimensi menggunakan PCA agar cluster lebih mudah divisualisasikan.

# Tujuan: memvisualisasikan korelasi antar variabel numerik pada data asli.

cor_mat <- cor(df[, num_cols], use = "complete.obs")
corrplot(cor_mat,
         method  = "color",
         type    = "upper",
         addCoef.col = "black",
         tl.col  = "black",
         tl.srt  = 45,
         col     = colorRampPalette(c("blue", "white", "red"))(200),
         title   = "Heatmap Korelasi Variabel Numerik",
         mar     = c(0, 0, 2, 0))

# Tujuan: menampilkan distribusi beberapa variabel numerik utama.

df_long <- melt(df[, num_cols])
ggplot(df_long, aes(x = value)) +
  geom_histogram(bins = 20, fill = "steelblue", color = "white") +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Distribusi Variabel Numerik", x = "", y = "Frekuensi") +
  theme_bw()

# Tujuan: mereduksi dimensi data menjadi 2 komponen utama agar hasil clustering
# mudah divisualisasikan.

set.seed(42)
pca_res <- prcomp(X, center = FALSE, scale. = FALSE)   # sudah discale sebelumnya
X_pca   <- pca_res$x[, 1:2]

var_exp <- summary(pca_res)$importance[2, 1:2]
cat("Explained variance ratio:\n")
## Explained variance ratio:
cat("  PC1:", round(var_exp[1], 4), "\n")
##   PC1: 0.287
cat("  PC2:", round(var_exp[2], 4), "\n")
##   PC2: 0.1412
pca_plot_df <- data.frame(PC1 = X_pca[, 1], PC2 = X_pca[, 2])
ggplot(pca_plot_df, aes(x = PC1, y = PC2)) +
  geom_point(size = 1, alpha = 0.7, color = "steelblue") +
  labs(title = "Visualisasi PCA 2D") +
  theme_bw()


6. Fungsi Bantu

Agar analisis lebih rapi, dibuat beberapa fungsi bantu untuk:

  • mengevaluasi hasil clustering,
  • memvisualisasikan cluster pada hasil PCA,
  • membuat profil rata-rata variabel numerik pada setiap cluster.
# Tujuan: membuat fungsi bantu untuk evaluasi, visualisasi, dan profiling cluster.

# ---- evaluate_clustering ----
evaluate_clustering <- function(X_data, labels, method_name = "Model") {
  labels       <- as.integer(labels)
  unique_lbl   <- unique(labels)
  n_noise      <- sum(labels == -1)
  n_clusters   <- length(setdiff(unique_lbl, -1L))

  if (n_clusters < 2) {
    return(data.frame(
      Method           = method_name,
      n_clusters       = n_clusters,
      Silhouette       = NA_real_,
      `Calinski-Harabasz` = NA_real_,
      `Davies-Bouldin` = NA_real_,
      check.names = FALSE
    ))
  }

  if (-1L %in% labels) {
    mask        <- labels != -1L
    X_eval      <- X_data[mask, , drop = FALSE]
    labels_eval <- labels[mask]
    if (length(unique(labels_eval)) < 2) {
      return(data.frame(
        Method           = method_name,
        n_clusters       = n_clusters,
        Silhouette       = NA_real_,
        `Calinski-Harabasz` = NA_real_,
        `Davies-Bouldin` = NA_real_,
        check.names = FALSE
      ))
    }
  } else {
    X_eval      <- X_data
    labels_eval <- labels
  }

  # Silhouette
  sil <- mean(silhouette(labels_eval, dist(X_eval))[, 3])

  # Calinski-Harabasz
  k    <- length(unique(labels_eval))
  n    <- nrow(X_eval)
  cent_all <- colMeans(X_eval)
  SS_B <- sum(sapply(unique(labels_eval), function(cl) {
    idx <- labels_eval == cl
    nk  <- sum(idx)
    ck  <- colMeans(X_eval[idx, , drop = FALSE])
    nk * sum((ck - cent_all)^2)
  }))
  SS_W <- sum(sapply(unique(labels_eval), function(cl) {
    idx <- labels_eval == cl
    xk  <- X_eval[idx, , drop = FALSE]
    ck  <- colMeans(xk)
    sum(sweep(xk, 2, ck)^2)
  }))
  ch <- (SS_B / (k - 1)) / (SS_W / (n - k))

  # Davies-Bouldin
  centers <- t(sapply(unique(labels_eval), function(cl) {
    colMeans(X_eval[labels_eval == cl, , drop = FALSE])
  }))
  s_i <- sapply(unique(labels_eval), function(cl) {
    idx <- labels_eval == cl
    mean(sqrt(rowSums(sweep(X_eval[idx, , drop = FALSE], 2,
                            centers[match(cl, unique(labels_eval)), ])^2)))
  })
  db_vals <- sapply(seq_len(k), function(i) {
    max(sapply(setdiff(seq_len(k), i), function(j) {
      d_ij <- sqrt(sum((centers[i, ] - centers[j, ])^2))
      (s_i[i] + s_i[j]) / d_ij
    }))
  })
  db <- mean(db_vals)

  data.frame(
    Method              = method_name,
    n_clusters          = n_clusters,
    Silhouette          = round(sil, 4),
    `Calinski-Harabasz` = round(ch, 4),
    `Davies-Bouldin`    = round(db, 4),
    check.names = FALSE
  )
}

# ---- plot_cluster_pca ----
plot_cluster_pca <- function(X_pca_data, labels, title_str) {
  plot_df <- data.frame(
    PC1     = X_pca_data[, 1],
    PC2     = X_pca_data[, 2],
    Cluster = factor(labels)
  )
  print(
    ggplot(plot_df, aes(x = PC1, y = PC2, color = Cluster)) +
      geom_point(size = 1.5, alpha = 0.7) +
      labs(title = title_str, color = "Cluster") +
      theme_bw() +
      theme(legend.position = "right")
  )
}

# ---- profile_cluster ----
profile_cluster <- function(df_source, labels, numeric_columns) {
  temp <- df_source[, numeric_columns, drop = FALSE]
  temp$cluster <- labels
  aggregate(. ~ cluster, data = temp, FUN = mean)
}

cat("Fungsi bantu berhasil didefinisikan.\n")
## Fungsi bantu berhasil didefinisikan.

7. Menentukan Jumlah Cluster untuk Metode Berbasis Pusat

Pada bagian ini dilakukan pencarian jumlah cluster terbaik untuk K-Means dan K-Median menggunakan rentang cluster 2 sampai 10. Evaluasi utama yang dipakai adalah Silhouette Score.

# Tujuan: mencari jumlah cluster terbaik untuk metode K-Means.

set.seed(42)
k_range       <- 2:10
results_kmeans <- data.frame()

for (k in k_range) {
  km      <- kmeans(X, centers = k, nstart = 10, iter.max = 300)
  labels  <- km$cluster
  sil_val <- mean(silhouette(labels, dist(X))[, 3])
  n       <- nrow(X)
  k_n     <- k

  cent_all <- colMeans(X)
  SS_B <- sum(sapply(unique(labels), function(cl) {
    idx <- labels == cl; nk <- sum(idx)
    ck  <- colMeans(X[idx, , drop = FALSE])
    nk * sum((ck - cent_all)^2)
  }))
  SS_W <- sum(sapply(unique(labels), function(cl) {
    idx <- labels == cl
    xk  <- X[idx, , drop = FALSE]
    ck  <- colMeans(xk)
    sum(sweep(xk, 2, ck)^2)
  }))
  ch_val <- (SS_B / (k_n - 1)) / (SS_W / (n - k_n))

  results_kmeans <- rbind(results_kmeans, data.frame(
    k         = k,
    inertia   = km$tot.withinss,
    silhouette = round(sil_val, 4),
    calinski   = round(ch_val, 4)
  ))
}

results_kmeans
k inertia silhouette calinski
2 17393.97 0.1333 325.8457
3 15701.51 0.1346 294.0092
4 14311.60 0.1545 283.1489
5 13330.78 0.1519 266.6151
6 12520.69 0.1350 254.2234
7 11878.65 0.1496 242.1506
8 11348.78 0.1416 231.1724
9 10980.39 0.1473 217.7780
10 10644.30 0.1547 206.9686
# Tujuan: memvisualisasikan hasil evaluasi jumlah cluster pada K-Means.

par(mfrow = c(1, 2))

plot(results_kmeans$k, results_kmeans$inertia,
     type = "b", pch = 19, col = "steelblue",
     main = "Elbow Method - K-Means",
     xlab = "Jumlah Cluster (k)", ylab = "Inertia")

plot(results_kmeans$k, results_kmeans$silhouette,
     type = "b", pch = 19, col = "darkorange",
     main = "Silhouette Score - K-Means",
     xlab = "Jumlah Cluster (k)", ylab = "Silhouette Score")

par(mfrow = c(1, 1))

best_k <- results_kmeans$k[which.max(results_kmeans$silhouette)]
cat("Jumlah cluster terbaik untuk K-Means:", best_k, "\n")
## Jumlah cluster terbaik untuk K-Means: 10

8. Clustering dengan K-Means

Setelah jumlah cluster terbaik diperoleh, model K-Means dijalankan pada seluruh data hasil preprocessing. Hasil cluster kemudian divisualisasikan pada bidang PCA dan diprofilkan menggunakan rata-rata variabel numerik asli.

# Tujuan: menjalankan model K-Means dengan jumlah cluster terbaik dan menghitung evaluasinya.

set.seed(42)
kmeans_model  <- kmeans(X, centers = best_k, nstart = 10, iter.max = 300)
kmeans_labels <- kmeans_model$cluster - 1L   # 0-based agar sesuai Python

eval_kmeans <- evaluate_clustering(X, kmeans_labels, "K-Means")
eval_kmeans
Method n_clusters Silhouette Calinski-Harabasz Davies-Bouldin
K-Means 10 0.1531 207.1689 1.8285
# Tujuan: memvisualisasikan hasil clustering K-Means dan membuat profil setiap cluster.

plot_cluster_pca(X_pca, kmeans_labels, "Clustering K-Means pada Bidang PCA")

profile_kmeans <- profile_cluster(df, kmeans_labels, num_cols)
profile_kmeans
cluster Age Height Weight FCVC NCP CH2O FAF TUE
0 21.00885 1.729847 74.59931 2.045724 2.988222 2.383150 2.0433920 1.5051664
1 20.43797 1.795845 72.29852 2.747625 3.195388 2.246581 2.0734491 0.4729259
2 20.68879 1.674327 67.73492 1.850280 3.044678 1.518783 0.5080278 1.0873518
3 25.40079 1.790809 102.94124 1.992448 2.869506 2.141781 0.8319401 0.4248549
4 23.28716 1.780663 128.50416 2.930114 2.906180 2.213367 1.3277361 0.7689575
5 37.43439 1.664558 85.01595 2.406856 2.522664 1.779903 0.9390338 0.2108454
6 25.46559 1.642251 106.53990 2.985410 2.999164 2.201919 0.0895633 0.4795781
7 22.02833 1.602985 62.98200 2.366240 1.144334 1.969030 0.6522025 0.1723208
8 21.27590 1.666091 82.67152 2.351729 1.299973 1.825494 0.6359490 1.4012938
9 20.86573 1.627131 61.31398 2.667724 3.114092 1.891794 1.3171571 0.4737666

9. Clustering dengan K-Median

Metode K-Median tidak tersedia secara langsung di R base, sehingga dibuat implementasi sederhana berbasis:

  • jarak Manhattan,
  • pusat cluster berupa median dari tiap cluster.
# Tujuan: membuat fungsi K-Median secara manual untuk kebutuhan analisis clustering.

kmedian_fit <- function(X_data, n_clusters = 3, max_iter = 100, random_state = 42) {
  set.seed(random_state)
  n          <- nrow(X_data)
  init_idx   <- sample(n, n_clusters, replace = FALSE)
  medians    <- X_data[init_idx, , drop = FALSE]
  labels     <- integer(n)

  for (iter in seq_len(max_iter)) {
    # Hitung jarak Manhattan ke setiap median
    dist_mat <- sapply(seq_len(n_clusters), function(j) {
      rowSums(abs(sweep(X_data, 2, medians[j, ])))
    })
    new_labels <- apply(dist_mat, 1, which.min) - 1L   # 0-based

    # Update median
    new_medians <- t(sapply(seq_len(n_clusters), function(j) {
      idx <- new_labels == (j - 1L)
      if (any(idx)) apply(X_data[idx, , drop = FALSE], 2, median)
      else medians[j, ]
    }))

    if (all(abs(medians - new_medians) < 1e-9)) {
      labels <- new_labels
      break
    }
    medians <- new_medians
    labels  <- new_labels
  }

  list(labels = labels, medians = medians)
}

cat("Fungsi kmedian_fit berhasil didefinisikan.\n")
## Fungsi kmedian_fit berhasil didefinisikan.
# Tujuan: mencari jumlah cluster terbaik untuk metode K-Median.

results_kmedian <- data.frame()

for (k in k_range) {
  res     <- kmedian_fit(X, n_clusters = k, max_iter = 100, random_state = 42)
  labels  <- res$labels
  sil_val <- mean(silhouette(labels, dist(X))[, 3])

  n <- nrow(X)
  cent_all <- colMeans(X)
  SS_B <- sum(sapply(unique(labels), function(cl) {
    idx <- labels == cl; nk <- sum(idx)
    ck  <- colMeans(X[idx, , drop = FALSE])
    nk * sum((ck - cent_all)^2)
  }))
  SS_W <- sum(sapply(unique(labels), function(cl) {
    idx <- labels == cl
    xk  <- X[idx, , drop = FALSE]
    ck  <- colMeans(xk)
    sum(sweep(xk, 2, ck)^2)
  }))
  ch_val <- (SS_B / (k - 1)) / (SS_W / (n - k))

  results_kmedian <- rbind(results_kmedian, data.frame(
    k          = k,
    silhouette = round(sil_val, 4),
    calinski   = round(ch_val, 4)
  ))
}

results_kmedian
k silhouette calinski
2 0.0890 182.4663
3 0.1086 211.6491
4 0.1125 207.5246
5 0.1110 186.0217
6 0.1098 184.8880
7 0.1157 183.0589
8 0.1118 169.9067
9 0.1196 165.4861
10 0.1182 157.8589
# Tujuan: memvisualisasikan evaluasi jumlah cluster K-Median dan menjalankan model final.

plot(results_kmedian$k, results_kmedian$silhouette,
     type = "b", pch = 19, col = "steelblue",
     main = "Silhouette Score - K-Median",
     xlab = "Jumlah Cluster (k)", ylab = "Silhouette Score")

best_k_kmedian <- results_kmedian$k[which.max(results_kmedian$silhouette)]
cat("Jumlah cluster terbaik untuk K-Median:", best_k_kmedian, "\n")
## Jumlah cluster terbaik untuk K-Median: 9
kmedian_res    <- kmedian_fit(X, n_clusters = best_k_kmedian, max_iter = 100, random_state = 42)
kmedian_labels <- kmedian_res$labels

eval_kmedian <- evaluate_clustering(X, kmedian_labels, "K-Median")
eval_kmedian
Method n_clusters Silhouette Calinski-Harabasz Davies-Bouldin
K-Median 9 0.1196 165.4861 2.2029
# Tujuan: memvisualisasikan hasil clustering K-Median dan membuat profil setiap cluster.

plot_cluster_pca(X_pca, kmedian_labels, "Clustering K-Median pada Bidang PCA")

profile_kmedian <- profile_cluster(df, kmedian_labels, num_cols)
profile_kmedian
cluster Age Height Weight FCVC NCP CH2O FAF TUE
0 21.26388 1.635362 64.59837 2.580234 2.726455 1.927141 1.0957280 0.8991648
1 28.63076 1.665050 92.20591 2.606705 1.739690 1.432352 0.9238745 0.3551007
2 20.78880 1.699987 71.60830 1.965530 2.867670 1.910991 0.8648035 1.0118062
3 22.64937 1.766499 78.50196 2.170286 2.994546 2.221035 2.0468241 0.9351300
4 25.27394 1.640005 105.64789 2.978989 2.944323 2.209656 0.1103169 0.4786664
5 27.11538 1.800603 108.92870 2.070658 2.798827 2.147342 0.9341045 0.4125383
6 36.39108 1.621518 75.82291 2.280542 2.666225 1.664620 0.4616460 0.1996102
7 22.72935 1.782004 116.58584 2.939923 2.939500 2.288728 1.5298585 0.7350692
8 21.00499 1.598305 55.98495 2.476344 1.589590 1.854580 0.8651569 0.3320110

10. Clustering dengan DBSCAN

Metode DBSCAN tidak memerlukan jumlah cluster di awal, tetapi membutuhkan penentuan parameter eps dan minPts. Oleh karena itu dilakukan eksplorasi beberapa kombinasi parameter untuk mencari hasil terbaik.

# Tujuan: mencoba beberapa kombinasi parameter DBSCAN dan memilih hasil terbaik
# berdasarkan silhouette score.

eps_values        <- c(0.5, 0.8, 1.0, 1.2, 1.5, 2.0)
min_samples_values <- c(3, 5, 7, 10)

dbscan_results <- data.frame()

for (eps in eps_values) {
  for (min_s in min_samples_values) {
    db_res  <- dbscan::dbscan(X, eps = eps, minPts = min_s)
    labels  <- db_res$cluster - 1L   # 0-based; noise tetap 0-1 = -1 setelah konversi

    # Konversi: dbscan R: 0 = noise, 1..k = cluster → Python: -1 = noise, 0..k-1
    labels_py <- ifelse(db_res$cluster == 0, -1L, db_res$cluster - 1L)

    ev <- evaluate_clustering(X, labels_py, paste0("DBSCAN_eps", eps, "_min", min_s))
    ev$eps        <- eps
    ev$min_samples <- min_s
    ev$n_noise    <- sum(labels_py == -1)
    dbscan_results <- rbind(dbscan_results, ev)
  }
}

dbscan_eval_df <- dbscan_results[order(-dbscan_results$Silhouette), ]
dbscan_eval_df
Method n_clusters Silhouette Calinski-Harabasz Davies-Bouldin eps min_samples n_noise
4 DBSCAN_eps0.5_min10 8 0.6795 722.6154 0.3973 0.5 10 1808
3 DBSCAN_eps0.5_min7 11 0.5778 211.1967 0.4379 0.5 7 1768
2 DBSCAN_eps0.5_min5 27 0.5614 149.3517 0.5119 0.5 5 1658
8 DBSCAN_eps0.8_min10 10 0.5422 235.7461 0.5744 0.8 10 1626
1 DBSCAN_eps0.5_min3 67 0.4990 114.5672 0.5103 0.5 3 1487
7 DBSCAN_eps0.8_min7 21 0.4919 171.2694 0.6184 0.8 7 1505
6 DBSCAN_eps0.8_min5 29 0.4502 133.9020 0.7271 0.8 5 1406
5 DBSCAN_eps0.8_min3 93 0.3934 70.2794 0.6782 0.8 3 1133
12 DBSCAN_eps1_min10 11 0.3790 126.5631 0.8130 1.0 10 1476
11 DBSCAN_eps1_min7 23 0.3355 92.0111 0.8073 1.0 7 1329
16 DBSCAN_eps1.2_min10 15 0.3330 117.0876 0.9554 1.2 10 1305
10 DBSCAN_eps1_min5 34 0.3082 72.9351 0.8348 1.0 5 1206
9 DBSCAN_eps1_min3 104 0.2422 39.6556 0.8152 1.0 3 911
15 DBSCAN_eps1.2_min7 26 0.2227 70.0029 0.9630 1.2 7 1100
22 DBSCAN_eps2_min5 2 0.1978 7.8793 0.9899 2.0 5 215
14 DBSCAN_eps1.2_min5 43 0.1970 49.7220 1.0063 1.2 5 921
13 DBSCAN_eps1.2_min3 83 0.1484 33.4589 1.0101 1.2 3 684
24 DBSCAN_eps2_min10 2 0.1329 38.3578 0.9683 2.0 10 283
20 DBSCAN_eps1.5_min10 18 0.0953 48.7055 1.0651 1.5 10 902
19 DBSCAN_eps1.5_min7 22 0.0565 35.8673 1.0788 1.5 7 741
21 DBSCAN_eps2_min3 6 -0.0254 4.3452 1.1558 2.0 3 183
18 DBSCAN_eps1.5_min5 30 -0.0767 20.2619 1.1411 1.5 5 568
17 DBSCAN_eps1.5_min3 46 -0.0979 15.5633 1.1945 1.5 3 437
23 DBSCAN_eps2_min7 1 NA NA NA 2.0 7 245
# Tujuan: menjalankan DBSCAN final berdasarkan kombinasi parameter terbaik.

best_row       <- dbscan_eval_df[1, ]
best_eps       <- best_row$eps
best_min_samples <- best_row$min_samples

cat("Parameter terbaik DBSCAN:\n")
## Parameter terbaik DBSCAN:
cat("eps =", best_eps, "\n")
## eps = 0.5
cat("min_samples =", best_min_samples, "\n")
## min_samples = 10
db_final    <- dbscan::dbscan(X, eps = best_eps, minPts = best_min_samples)
dbscan_labels <- ifelse(db_final$cluster == 0, -1L, db_final$cluster - 1L)

eval_dbscan <- evaluate_clustering(X, dbscan_labels, "DBSCAN")
eval_dbscan
Method n_clusters Silhouette Calinski-Harabasz Davies-Bouldin
DBSCAN 8 0.6795 722.6154 0.3973
# Tujuan: memvisualisasikan hasil clustering DBSCAN dan melihat jumlah anggota tiap label cluster.

plot_cluster_pca(X_pca, dbscan_labels, "Clustering DBSCAN pada Bidang PCA")

sort(table(dbscan_labels))
## dbscan_labels
##    3    4    7    0    6    2    5    1   -1 
##   10   11   11   19   36   45   55  116 1808

11. Clustering dengan Mean Shift

Metode Mean Shift akan mencari pusat cluster secara otomatis berdasarkan kepadatan data. Parameter penting yang digunakan adalah bandwidth, yang dapat diestimasi dari data.

# Tujuan: mengestimasi bandwidth, menjalankan Mean Shift, dan menghitung evaluasi clustering.

# ---- Implementasi Mean Shift manual ----
mean_shift_fit <- function(X_data, bandwidth, tol = 1e-3, max_iter = 200) {
  n <- nrow(X_data)
  # Setiap titik mulai dari posisi awalnya
  shifted <- X_data

  for (iter in seq_len(max_iter)) {
    new_shifted <- shifted
    for (i in seq_len(n)) {
      # Hitung jarak titik i ke semua titik asli
      dists <- sqrt(rowSums(sweep(X_data, 2, shifted[i, ])^2))
      # Titik dalam bandwidth
      in_bw <- dists <= bandwidth
      if (sum(in_bw) == 0) next
      # Geser ke rata-rata titik dalam bandwidth (flat kernel)
      new_shifted[i, ] <- colMeans(X_data[in_bw, , drop = FALSE])
    }
    # Cek konvergensi
    shift_dist <- max(sqrt(rowSums((new_shifted - shifted)^2)))
    shifted <- new_shifted
    if (shift_dist < tol) break
  }

  # Cluster: titik dengan mode yang sama (dalam jarak kecil) digabung
  labels <- integer(n)
  label_counter <- 0L
  cluster_modes <- list()

  for (i in seq_len(n)) {
    mode_i <- shifted[i, ]
    assigned <- FALSE
    for (j in seq_along(cluster_modes)) {
      if (sqrt(sum((mode_i - cluster_modes[[j]])^2)) < bandwidth * 0.5) {
        labels[i] <- j - 1L
        assigned <- TRUE
        break
      }
    }
    if (!assigned) {
      label_counter <- label_counter + 1L
      cluster_modes[[label_counter]] <- mode_i
      labels[i] <- label_counter - 1L
    }
  }
  list(labels = labels, modes = do.call(rbind, cluster_modes))
}

# Estimasi bandwidth: kuantil ke-20 dari jarak antar sampel
set.seed(42)
sample_idx  <- sample(nrow(X), min(300, nrow(X)))
X_sample    <- X[sample_idx, ]
dist_sample <- as.matrix(dist(X_sample))
bandwidth   <- quantile(dist_sample[upper.tri(dist_sample)], probs = 0.20)
cat("Bandwidth yang digunakan:", round(bandwidth, 4), "\n")
## Bandwidth yang digunakan: 3.4329
ms_res           <- mean_shift_fit(X, bandwidth = bandwidth)
meanshift_labels <- ms_res$labels

eval_meanshift <- evaluate_clustering(X, meanshift_labels, "Mean Shift")
eval_meanshift
Method n_clusters Silhouette Calinski-Harabasz Davies-Bouldin
Mean Shift 3 0.3666 4.379 0.4707
# Tujuan: memvisualisasikan hasil clustering Mean Shift dan melihat ukuran setiap cluster.

plot_cluster_pca(X_pca, meanshift_labels, "Clustering Mean Shift pada Bidang PCA")

sort(table(meanshift_labels))
## meanshift_labels
##    1    2    0 
##    1    1 2109

12. Clustering dengan Fuzzy C-Means

Berbeda dari metode hard clustering, Fuzzy C-Means memberikan derajat keanggotaan setiap data pada seluruh cluster. Pada tahap ini dicari jumlah cluster terbaik menggunakan FPC dan Silhouette Score.

# Tujuan: mencari jumlah cluster terbaik untuk metode Fuzzy C-Means.

set.seed(42)
X_mat      <- as.matrix(X)
fcm_results <- data.frame()

for (c in 2:10) {
  fcm_res <- e1071::cmeans(X_mat, centers = c, iter.max = 1000,
                           m = 2, method = "cmeans", verbose = FALSE)
  labels  <- fcm_res$cluster - 1L   # 0-based

  # Fuzzy Partition Coefficient
  u   <- fcm_res$membership          # matrix n x c
  fpc <- sum(u^2) / nrow(u)

  sil_val <- mean(silhouette(labels, dist(X_mat))[, 3])

  n <- nrow(X_mat)
  cent_all <- colMeans(X_mat)
  SS_B <- sum(sapply(unique(labels), function(cl) {
    idx <- labels == cl; nk <- sum(idx)
    ck  <- colMeans(X_mat[idx, , drop = FALSE])
    nk * sum((ck - cent_all)^2)
  }))
  SS_W <- sum(sapply(unique(labels), function(cl) {
    idx <- labels == cl
    xk  <- X_mat[idx, , drop = FALSE]
    ck  <- colMeans(xk)
    sum(sweep(xk, 2, ck)^2)
  }))
  ch_val <- (SS_B / (c - 1)) / (SS_W / (n - c))

  fcm_results <- rbind(fcm_results, data.frame(
    c          = c,
    FPC        = round(fpc, 4),
    silhouette = round(sil_val, 4),
    calinski   = round(ch_val, 4)
  ))
}

fcm_results
c FPC silhouette calinski
2 0.5000 0.1313 321.4452
3 0.3333 0.0603 173.2288
4 0.2500 0.0731 109.1558
5 0.2000 0.0610 83.2500
6 0.1667 0.0807 66.9434
7 0.1429 0.0807 54.8565
8 0.1250 -0.0286 55.2273
9 0.1111 0.0124 40.8763
10 0.1247 -0.0251 54.6141
# Tujuan: memvisualisasikan evaluasi Fuzzy C-Means dan menjalankan model final
# dengan jumlah cluster terbaik.

par(mfrow = c(1, 2))
plot(fcm_results$c, fcm_results$FPC,
     type = "b", pch = 19, col = "steelblue",
     main = "FPC - Fuzzy C-Means",
     xlab = "Jumlah Cluster", ylab = "FPC")

plot(fcm_results$c, fcm_results$silhouette,
     type = "b", pch = 19, col = "darkorange",
     main = "Silhouette Score - Fuzzy C-Means",
     xlab = "Jumlah Cluster", ylab = "Silhouette Score")

par(mfrow = c(1, 1))

best_c <- fcm_results$c[which.max(fcm_results$silhouette)]
cat("Jumlah cluster terbaik untuk Fuzzy C-Means:", best_c, "\n")
## Jumlah cluster terbaik untuk Fuzzy C-Means: 2
set.seed(42)
fcm_final  <- e1071::cmeans(X_mat, centers = best_c, iter.max = 1000,
                             m = 2, method = "cmeans", verbose = FALSE)
fcm_labels <- fcm_final$cluster - 1L   # 0-based

eval_fcm <- evaluate_clustering(X, fcm_labels, "Fuzzy C-Means")
eval_fcm
Method n_clusters Silhouette Calinski-Harabasz Davies-Bouldin
Fuzzy C-Means 2 0.1313 321.4452 2.4779
# Tujuan: memvisualisasikan hasil clustering Fuzzy C-Means dan melihat derajat
# keanggotaan beberapa data pertama.

plot_cluster_pca(X_pca, fcm_labels, "Clustering Fuzzy C-Means pada Bidang PCA")

u_matrix    <- fcm_final$membership
colnames(u_matrix) <- paste0("cluster_", 0:(best_c - 1))
membership_df <- as.data.frame(u_matrix)
head(membership_df)
cluster_0 cluster_1
0.5002205 0.4997795
0.5000282 0.4999718
0.4999929 0.5000071
0.4999085 0.5000915
0.5000096 0.4999904
0.5001174 0.4998826

13. Perbandingan Hasil Seluruh Metode

Pada bagian ini seluruh metode dibandingkan menggunakan metrik evaluasi yang sama:

  • Silhouette Score (semakin besar semakin baik)
  • Calinski-Harabasz Index (semakin besar semakin baik)
  • Davies-Bouldin Index (semakin kecil semakin baik)
# Tujuan: menyusun tabel perbandingan seluruh metode clustering.

all_eval <- rbind(eval_kmeans, eval_kmedian, eval_dbscan, eval_meanshift, eval_fcm)
all_eval_sorted <- all_eval[order(-all_eval$Silhouette, na.last = TRUE), ]
all_eval_sorted
Method n_clusters Silhouette Calinski-Harabasz Davies-Bouldin
3 DBSCAN 8 0.6795 722.6154 0.3973
4 Mean Shift 3 0.3666 4.3790 0.4707
1 K-Means 10 0.1531 207.1689 1.8285
5 Fuzzy C-Means 2 0.1313 321.4452 2.4779
2 K-Median 9 0.1196 165.4861 2.2029
# Tujuan: memvisualisasikan perbandingan silhouette score dari seluruh metode clustering.

eval_plot <- all_eval_sorted[!is.na(all_eval_sorted$Silhouette), ]
eval_plot$Method <- factor(eval_plot$Method,
                           levels = eval_plot$Method[order(-eval_plot$Silhouette)])

ggplot(eval_plot, aes(x = Method, y = Silhouette, fill = Method)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = round(Silhouette, 4)), vjust = -0.4, size = 3.5) +
  labs(title = "Perbandingan Silhouette Score Antar Metode",
       x = "", y = "Silhouette Score") +
  theme_bw() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 20, hjust = 1))

# Tujuan: membandingkan label cluster dengan label obesitas asli untuk keperluan
# interpretasi tambahan.

cat("Crosstab K-Means vs label asli\n")
## Crosstab K-Means vs label asli
ct_kmeans <- prop.table(table(kmeans_labels, label_asli), margin = 1)
print(round(ct_kmeans, 4))
##              label_asli
## kmeans_labels Insufficient_Weight Normal_Weight Obesity_Type_I Obesity_Type_II
##             0              0.2229        0.2048         0.2169          0.0060
##             1              0.3571        0.2143         0.0390          0.0000
##             2              0.2174        0.3000         0.1609          0.0000
##             3              0.0000        0.0505         0.2271          0.4069
##             4              0.0000        0.0000         0.0943          0.3279
##             5              0.0080        0.0478         0.3147          0.2072
##             6              0.0000        0.0299         0.0100          0.0000
##             7              0.2654        0.1728         0.1543          0.0000
##             8              0.0486        0.1389         0.3125          0.2153
##             9              0.3223        0.2851         0.1074          0.0165
##              label_asli
## kmeans_labels Obesity_Type_III Overweight_Level_I Overweight_Level_II
##             0           0.0000             0.1807              0.1687
##             1           0.0000             0.2403              0.1494
##             2           0.0000             0.1696              0.1522
##             3           0.0000             0.1167              0.1987
##             4           0.5656             0.0000              0.0123
##             5           0.0000             0.1713              0.2510
##             6           0.9254             0.0000              0.0348
##             7           0.0000             0.2778              0.1296
##             8           0.0000             0.1806              0.1042
##             9           0.0000             0.1364              0.1322
cat("\nCrosstab Fuzzy C-Means vs label asli\n")
## 
## Crosstab Fuzzy C-Means vs label asli
ct_fcm <- prop.table(table(fcm_labels, label_asli), margin = 1)
print(round(ct_fcm, 4))
##           label_asli
## fcm_labels Insufficient_Weight Normal_Weight Obesity_Type_I Obesity_Type_II
##          0              0.2171        0.2208         0.1711          0.0359
##          1              0.0352        0.0459         0.1611          0.2520
##           label_asli
## fcm_labels Obesity_Type_III Overweight_Level_I Overweight_Level_II
##          0           0.0175             0.1895              0.1481
##          1           0.2979             0.0820              0.1260

14. Penyimpanan Hasil

Sebagai tahap akhir, hasil clustering dan tabel evaluasi disimpan ke file .csv agar dapat digunakan kembali dalam laporan atau lampiran tugas.

# Tujuan: menyimpan hasil clustering dan evaluasi ke dalam file CSV.

df_result <- df
df_result$cluster_kmeans   <- kmeans_labels
df_result$cluster_kmedian  <- kmedian_labels
df_result$cluster_dbscan   <- dbscan_labels
df_result$cluster_meanshift <- meanshift_labels
df_result$cluster_fcm      <- fcm_labels

write.csv(df_result,     "hasil_clustering_obesity.csv",   row.names = FALSE)
write.csv(all_eval,      "evaluasi_metode_clustering.csv", row.names = FALSE)
write.csv(profile_kmeans,  "profil_cluster_kmeans.csv",   row.names = TRUE)
write.csv(profile_kmedian, "profil_cluster_kmedian.csv",  row.names = TRUE)

cat("File berhasil disimpan:\n")
## File berhasil disimpan:
cat("- hasil_clustering_obesity.csv\n")
## - hasil_clustering_obesity.csv
cat("- evaluasi_metode_clustering.csv\n")
## - evaluasi_metode_clustering.csv
cat("- profil_cluster_kmeans.csv\n")
## - profil_cluster_kmeans.csv
cat("- profil_cluster_kmedian.csv\n")
## - profil_cluster_kmedian.csv