Notebook ini digunakan untuk melakukan analisis clustering pada dataset obesitas dengan lima metode, yaitu K-Means, K-Median, DBSCAN, Mean Shift, dan Fuzzy C-Means.
Dataset mentah masih berisi campuran variabel numerik dan kategorikal. Agar dapat digunakan untuk clustering, variabel kategorikal akan diubah menjadi numerik menggunakan one-hot encoding, kemudian seluruh fitur akan distandardisasi. Setelah preprocessing, data yang masuk ke model sudah berbentuk numerik dan jumlah fiturnya lebih dari 10.
Pada tahap ini dilakukan import library yang diperlukan untuk membaca data, preprocessing, eksplorasi data, visualisasi, clustering, dan evaluasi hasil clustering.
# Tujuan: instalasi package jika belum tersedia (jalankan sekali saja).
install.packages(c(
"tidyverse", "cluster", "factoextra", "dbscan",
"e1071", "corrplot", "reshape2", "knitr", "kableExtra"
))# Tujuan: mengimpor library utama yang dibutuhkan dalam analisis clustering.
library(tidyverse) # manipulasi data & visualisasi (ggplot2, dplyr, dll.)
library(cluster) # silhouette score, k-medoids
library(factoextra) # visualisasi clustering & elbow method
library(dbscan) # DBSCAN & Mean Shift
library(e1071) # Fuzzy C-Means (cmeans)
library(corrplot) # heatmap korelasi
library(reshape2) # melt data frame
library(knitr) # kable
library(kableExtra) # kable stylingTahap ini bertujuan untuk memahami struktur awal dataset, termasuk ukuran data, tipe data, missing value, duplikasi, statistik deskriptif, dan distribusi kategori.
# Tujuan: membaca dataset obesity ke dalam dataframe.
df <- read.csv("ObesityDataSet_raw_and_data_sinthetic.csv", stringsAsFactors = FALSE)
head(df)| Gender | Age | Height | Weight | family_history_with_overweight | FAVC | FCVC | NCP | CAEC | SMOKE | CH2O | SCC | FAF | TUE | CALC | MTRANS | NObeyesdad |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Female | 21 | 1.62 | 64.0 | yes | no | 2 | 3 | Sometimes | no | 2 | no | 0 | 1 | no | Public_Transportation | Normal_Weight |
| Female | 21 | 1.52 | 56.0 | yes | no | 3 | 3 | Sometimes | yes | 3 | yes | 3 | 0 | Sometimes | Public_Transportation | Normal_Weight |
| Male | 23 | 1.80 | 77.0 | yes | no | 2 | 3 | Sometimes | no | 2 | no | 2 | 1 | Frequently | Public_Transportation | Normal_Weight |
| Male | 27 | 1.80 | 87.0 | no | no | 3 | 3 | Sometimes | no | 2 | no | 2 | 0 | Frequently | Walking | Overweight_Level_I |
| Male | 22 | 1.78 | 89.8 | no | no | 2 | 1 | Sometimes | no | 2 | no | 0 | 0 | Sometimes | Public_Transportation | Overweight_Level_II |
| Male | 29 | 1.62 | 53.0 | no | yes | 2 | 3 | Sometimes | no | 2 | no | 0 | 0 | Sometimes | Automobile | Normal_Weight |
# Tujuan: melihat ukuran dataset, nama kolom, dan tipe data setiap variabel.
cat("Ukuran dataset:", nrow(df), "baris x", ncol(df), "kolom\n")## Ukuran dataset: 2111 baris x 17 kolom
##
## Nama kolom:
## [1] "Gender" "Age"
## [3] "Height" "Weight"
## [5] "family_history_with_overweight" "FAVC"
## [7] "FCVC" "NCP"
## [9] "CAEC" "SMOKE"
## [11] "CH2O" "SCC"
## [13] "FAF" "TUE"
## [15] "CALC" "MTRANS"
## [17] "NObeyesdad"
##
## Tipe data:
## Gender Age
## "character" "numeric"
## Height Weight
## "numeric" "numeric"
## family_history_with_overweight FAVC
## "character" "character"
## FCVC NCP
## "numeric" "numeric"
## CAEC SMOKE
## "character" "character"
## CH2O SCC
## "numeric" "character"
## FAF TUE
## "numeric" "numeric"
## CALC MTRANS
## "character" "character"
## NObeyesdad
## "character"
## Missing value per kolom:
## Gender Age
## 0 0
## Height Weight
## 0 0
## family_history_with_overweight FAVC
## 0 0
## FCVC NCP
## 0 0
## CAEC SMOKE
## 0 0
## CH2O SCC
## 0 0
## FAF TUE
## 0 0
## CALC MTRANS
## 0 0
## NObeyesdad
## 0
##
## Jumlah data duplikat: 24
# Tujuan: menampilkan statistik deskriptif untuk variabel numerik.
num_cols_raw <- names(df)[sapply(df, is.numeric)]
summary(df[, num_cols_raw])## Age Height Weight FCVC
## Min. :14.00 Min. :1.450 Min. : 39.00 Min. :1.000
## 1st Qu.:19.95 1st Qu.:1.630 1st Qu.: 65.47 1st Qu.:2.000
## Median :22.78 Median :1.700 Median : 83.00 Median :2.386
## Mean :24.31 Mean :1.702 Mean : 86.59 Mean :2.419
## 3rd Qu.:26.00 3rd Qu.:1.768 3rd Qu.:107.43 3rd Qu.:3.000
## Max. :61.00 Max. :1.980 Max. :173.00 Max. :3.000
## NCP CH2O FAF TUE
## Min. :1.000 Min. :1.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:2.659 1st Qu.:1.585 1st Qu.:0.1245 1st Qu.:0.0000
## Median :3.000 Median :2.000 Median :1.0000 Median :0.6253
## Mean :2.686 Mean :2.008 Mean :1.0103 Mean :0.6579
## 3rd Qu.:3.000 3rd Qu.:2.477 3rd Qu.:1.6667 3rd Qu.:1.0000
## Max. :4.000 Max. :3.000 Max. :3.0000 Max. :2.0000
# Tujuan: menampilkan distribusi kategori untuk setiap variabel kategorikal.
cat_cols_raw <- names(df)[sapply(df, is.character)]
for (col in cat_cols_raw) {
cat("\nDistribusi kategori pada kolom", col, ":\n")
print(sort(table(df[[col]]), decreasing = TRUE))
}##
## Distribusi kategori pada kolom Gender :
##
## Male Female
## 1068 1043
##
## Distribusi kategori pada kolom family_history_with_overweight :
##
## yes no
## 1726 385
##
## Distribusi kategori pada kolom FAVC :
##
## yes no
## 1866 245
##
## Distribusi kategori pada kolom CAEC :
##
## Sometimes Frequently Always no
## 1765 242 53 51
##
## Distribusi kategori pada kolom SMOKE :
##
## no yes
## 2067 44
##
## Distribusi kategori pada kolom SCC :
##
## no yes
## 2015 96
##
## Distribusi kategori pada kolom CALC :
##
## Sometimes no Frequently Always
## 1401 639 70 1
##
## Distribusi kategori pada kolom MTRANS :
##
## Public_Transportation Automobile Walking
## 1580 457 56
## Motorbike Bike
## 11 7
##
## Distribusi kategori pada kolom NObeyesdad :
##
## Obesity_Type_I Obesity_Type_III Obesity_Type_II Overweight_Level_I
## 351 324 297 290
## Overweight_Level_II Normal_Weight Insufficient_Weight
## 290 287 272
Tahap preprocessing dilakukan agar seluruh variabel dapat digunakan
dalam clustering. Variabel target NObeyesdad tidak dipakai
sebagai fitur clustering, melainkan hanya disimpan untuk interpretasi
tambahan.
Pada tahap ini dilakukan:
# Tujuan: memisahkan fitur input untuk clustering dan menyimpan label asli
# hanya untuk interpretasi tambahan.
target_col <- "NObeyesdad"
X_raw <- df[, setdiff(colnames(df), target_col)]
label_asli <- df[[target_col]]
num_cols <- names(X_raw)[sapply(X_raw, is.numeric)]
cat_cols <- names(X_raw)[sapply(X_raw, is.character)]
cat("Jumlah kolom numerik awal:", length(num_cols), "\n")## Jumlah kolom numerik awal: 8
## Kolom numerik: Age, Height, Weight, FCVC, NCP, CH2O, FAF, TUE
##
## Jumlah kolom kategorikal awal: 8
## Kolom kategorikal: Gender, family_history_with_overweight, FAVC, CAEC, SMOKE, SCC, CALC, MTRANS
# Tujuan: mengubah variabel kategorikal menjadi numerik (one-hot encoding,
# drop first level) dan melakukan standardisasi seluruh fitur.
# --- One-Hot Encoding (drop first level, setara drop="first" di sklearn) ---
ohe_list <- lapply(cat_cols, function(col) {
lvls <- sort(unique(X_raw[[col]]))
lvls_keep <- lvls[-1] # buang level pertama (setara drop="first")
mat <- sapply(lvls_keep, function(lv) as.integer(X_raw[[col]] == lv))
colnames(mat) <- paste0(col, "_", lvls_keep)
mat
})
X_cat_encoded <- do.call(cbind, ohe_list)
# --- Standardisasi variabel numerik ---
X_num_scaled <- scale(X_raw[, num_cols])
# --- Gabungkan ---
X <- cbind(X_num_scaled, X_cat_encoded)
X_df <- as.data.frame(X)
cat("Shape data sebelum preprocessing:", nrow(X_raw), "x", ncol(X_raw), "\n")## Shape data sebelum preprocessing: 2111 x 16
## Shape data setelah preprocessing: 2111 x 23
| Age | Height | Weight | FCVC | NCP | CH2O | FAF | TUE | Gender_Male | family_history_with_overweight_yes | FAVC_yes | CAEC_Frequently | CAEC_no | CAEC_Sometimes | SMOKE_yes | SCC_yes | CALC_Frequently | CALC_no | CALC_Sometimes | MTRANS_Bike | MTRANS_Motorbike | MTRANS_Public_Transportation | MTRANS_Walking |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -0.5220007 | -0.8753819 | -0.8623539 | -0.7848327 | 0.404057 | -0.0130702 | -1.187758 | 0.5618636 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| -0.5220007 | -1.9471379 | -1.1678003 | 1.0880839 | 0.404057 | 1.6183751 | 2.339196 | -1.0803687 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| -0.2068400 | 1.0537789 | -0.3660034 | -0.7848327 | 0.404057 | -0.0130702 | 1.163545 | 0.5618636 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 0.4234815 | 1.0537789 | 0.0158046 | 1.0880839 | 0.404057 | -0.0130702 | 1.163545 | -1.0803687 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| -0.3644203 | 0.8394277 | 0.1227109 | -0.7848327 | -2.166509 | -0.0130702 | -1.187758 | -1.0803687 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 0.7386422 | -0.8753819 | -1.2823427 | -0.7848327 | 0.404057 | -0.0130702 | -1.187758 | -1.0803687 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
Eksplorasi dilakukan untuk memahami pola data sebelum dan sesudah preprocessing. Visualisasi yang digunakan meliputi korelasi antar variabel numerik, distribusi fitur, dan reduksi dimensi menggunakan PCA agar cluster lebih mudah divisualisasikan.
# Tujuan: memvisualisasikan korelasi antar variabel numerik pada data asli.
cor_mat <- cor(df[, num_cols], use = "complete.obs")
corrplot(cor_mat,
method = "color",
type = "upper",
addCoef.col = "black",
tl.col = "black",
tl.srt = 45,
col = colorRampPalette(c("blue", "white", "red"))(200),
title = "Heatmap Korelasi Variabel Numerik",
mar = c(0, 0, 2, 0))# Tujuan: menampilkan distribusi beberapa variabel numerik utama.
df_long <- melt(df[, num_cols])
ggplot(df_long, aes(x = value)) +
geom_histogram(bins = 20, fill = "steelblue", color = "white") +
facet_wrap(~variable, scales = "free") +
labs(title = "Distribusi Variabel Numerik", x = "", y = "Frekuensi") +
theme_bw()# Tujuan: mereduksi dimensi data menjadi 2 komponen utama agar hasil clustering
# mudah divisualisasikan.
set.seed(42)
pca_res <- prcomp(X, center = FALSE, scale. = FALSE) # sudah discale sebelumnya
X_pca <- pca_res$x[, 1:2]
var_exp <- summary(pca_res)$importance[2, 1:2]
cat("Explained variance ratio:\n")## Explained variance ratio:
## PC1: 0.287
## PC2: 0.1412
pca_plot_df <- data.frame(PC1 = X_pca[, 1], PC2 = X_pca[, 2])
ggplot(pca_plot_df, aes(x = PC1, y = PC2)) +
geom_point(size = 1, alpha = 0.7, color = "steelblue") +
labs(title = "Visualisasi PCA 2D") +
theme_bw()Agar analisis lebih rapi, dibuat beberapa fungsi bantu untuk:
# Tujuan: membuat fungsi bantu untuk evaluasi, visualisasi, dan profiling cluster.
# ---- evaluate_clustering ----
evaluate_clustering <- function(X_data, labels, method_name = "Model") {
labels <- as.integer(labels)
unique_lbl <- unique(labels)
n_noise <- sum(labels == -1)
n_clusters <- length(setdiff(unique_lbl, -1L))
if (n_clusters < 2) {
return(data.frame(
Method = method_name,
n_clusters = n_clusters,
Silhouette = NA_real_,
`Calinski-Harabasz` = NA_real_,
`Davies-Bouldin` = NA_real_,
check.names = FALSE
))
}
if (-1L %in% labels) {
mask <- labels != -1L
X_eval <- X_data[mask, , drop = FALSE]
labels_eval <- labels[mask]
if (length(unique(labels_eval)) < 2) {
return(data.frame(
Method = method_name,
n_clusters = n_clusters,
Silhouette = NA_real_,
`Calinski-Harabasz` = NA_real_,
`Davies-Bouldin` = NA_real_,
check.names = FALSE
))
}
} else {
X_eval <- X_data
labels_eval <- labels
}
# Silhouette
sil <- mean(silhouette(labels_eval, dist(X_eval))[, 3])
# Calinski-Harabasz
k <- length(unique(labels_eval))
n <- nrow(X_eval)
cent_all <- colMeans(X_eval)
SS_B <- sum(sapply(unique(labels_eval), function(cl) {
idx <- labels_eval == cl
nk <- sum(idx)
ck <- colMeans(X_eval[idx, , drop = FALSE])
nk * sum((ck - cent_all)^2)
}))
SS_W <- sum(sapply(unique(labels_eval), function(cl) {
idx <- labels_eval == cl
xk <- X_eval[idx, , drop = FALSE]
ck <- colMeans(xk)
sum(sweep(xk, 2, ck)^2)
}))
ch <- (SS_B / (k - 1)) / (SS_W / (n - k))
# Davies-Bouldin
centers <- t(sapply(unique(labels_eval), function(cl) {
colMeans(X_eval[labels_eval == cl, , drop = FALSE])
}))
s_i <- sapply(unique(labels_eval), function(cl) {
idx <- labels_eval == cl
mean(sqrt(rowSums(sweep(X_eval[idx, , drop = FALSE], 2,
centers[match(cl, unique(labels_eval)), ])^2)))
})
db_vals <- sapply(seq_len(k), function(i) {
max(sapply(setdiff(seq_len(k), i), function(j) {
d_ij <- sqrt(sum((centers[i, ] - centers[j, ])^2))
(s_i[i] + s_i[j]) / d_ij
}))
})
db <- mean(db_vals)
data.frame(
Method = method_name,
n_clusters = n_clusters,
Silhouette = round(sil, 4),
`Calinski-Harabasz` = round(ch, 4),
`Davies-Bouldin` = round(db, 4),
check.names = FALSE
)
}
# ---- plot_cluster_pca ----
plot_cluster_pca <- function(X_pca_data, labels, title_str) {
plot_df <- data.frame(
PC1 = X_pca_data[, 1],
PC2 = X_pca_data[, 2],
Cluster = factor(labels)
)
print(
ggplot(plot_df, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 1.5, alpha = 0.7) +
labs(title = title_str, color = "Cluster") +
theme_bw() +
theme(legend.position = "right")
)
}
# ---- profile_cluster ----
profile_cluster <- function(df_source, labels, numeric_columns) {
temp <- df_source[, numeric_columns, drop = FALSE]
temp$cluster <- labels
aggregate(. ~ cluster, data = temp, FUN = mean)
}
cat("Fungsi bantu berhasil didefinisikan.\n")## Fungsi bantu berhasil didefinisikan.
Pada bagian ini dilakukan pencarian jumlah cluster terbaik untuk K-Means dan K-Median menggunakan rentang cluster 2 sampai 10. Evaluasi utama yang dipakai adalah Silhouette Score.
# Tujuan: mencari jumlah cluster terbaik untuk metode K-Means.
set.seed(42)
k_range <- 2:10
results_kmeans <- data.frame()
for (k in k_range) {
km <- kmeans(X, centers = k, nstart = 10, iter.max = 300)
labels <- km$cluster
sil_val <- mean(silhouette(labels, dist(X))[, 3])
n <- nrow(X)
k_n <- k
cent_all <- colMeans(X)
SS_B <- sum(sapply(unique(labels), function(cl) {
idx <- labels == cl; nk <- sum(idx)
ck <- colMeans(X[idx, , drop = FALSE])
nk * sum((ck - cent_all)^2)
}))
SS_W <- sum(sapply(unique(labels), function(cl) {
idx <- labels == cl
xk <- X[idx, , drop = FALSE]
ck <- colMeans(xk)
sum(sweep(xk, 2, ck)^2)
}))
ch_val <- (SS_B / (k_n - 1)) / (SS_W / (n - k_n))
results_kmeans <- rbind(results_kmeans, data.frame(
k = k,
inertia = km$tot.withinss,
silhouette = round(sil_val, 4),
calinski = round(ch_val, 4)
))
}
results_kmeans| k | inertia | silhouette | calinski |
|---|---|---|---|
| 2 | 17393.97 | 0.1333 | 325.8457 |
| 3 | 15701.51 | 0.1346 | 294.0092 |
| 4 | 14311.60 | 0.1545 | 283.1489 |
| 5 | 13330.78 | 0.1519 | 266.6151 |
| 6 | 12520.69 | 0.1350 | 254.2234 |
| 7 | 11878.65 | 0.1496 | 242.1506 |
| 8 | 11348.78 | 0.1416 | 231.1724 |
| 9 | 10980.39 | 0.1473 | 217.7780 |
| 10 | 10644.30 | 0.1547 | 206.9686 |
# Tujuan: memvisualisasikan hasil evaluasi jumlah cluster pada K-Means.
par(mfrow = c(1, 2))
plot(results_kmeans$k, results_kmeans$inertia,
type = "b", pch = 19, col = "steelblue",
main = "Elbow Method - K-Means",
xlab = "Jumlah Cluster (k)", ylab = "Inertia")
plot(results_kmeans$k, results_kmeans$silhouette,
type = "b", pch = 19, col = "darkorange",
main = "Silhouette Score - K-Means",
xlab = "Jumlah Cluster (k)", ylab = "Silhouette Score")par(mfrow = c(1, 1))
best_k <- results_kmeans$k[which.max(results_kmeans$silhouette)]
cat("Jumlah cluster terbaik untuk K-Means:", best_k, "\n")## Jumlah cluster terbaik untuk K-Means: 10
Setelah jumlah cluster terbaik diperoleh, model K-Means dijalankan pada seluruh data hasil preprocessing. Hasil cluster kemudian divisualisasikan pada bidang PCA dan diprofilkan menggunakan rata-rata variabel numerik asli.
# Tujuan: menjalankan model K-Means dengan jumlah cluster terbaik dan menghitung evaluasinya.
set.seed(42)
kmeans_model <- kmeans(X, centers = best_k, nstart = 10, iter.max = 300)
kmeans_labels <- kmeans_model$cluster - 1L # 0-based agar sesuai Python
eval_kmeans <- evaluate_clustering(X, kmeans_labels, "K-Means")
eval_kmeans| Method | n_clusters | Silhouette | Calinski-Harabasz | Davies-Bouldin |
|---|---|---|---|---|
| K-Means | 10 | 0.1531 | 207.1689 | 1.8285 |
# Tujuan: memvisualisasikan hasil clustering K-Means dan membuat profil setiap cluster.
plot_cluster_pca(X_pca, kmeans_labels, "Clustering K-Means pada Bidang PCA")| cluster | Age | Height | Weight | FCVC | NCP | CH2O | FAF | TUE |
|---|---|---|---|---|---|---|---|---|
| 0 | 21.00885 | 1.729847 | 74.59931 | 2.045724 | 2.988222 | 2.383150 | 2.0433920 | 1.5051664 |
| 1 | 20.43797 | 1.795845 | 72.29852 | 2.747625 | 3.195388 | 2.246581 | 2.0734491 | 0.4729259 |
| 2 | 20.68879 | 1.674327 | 67.73492 | 1.850280 | 3.044678 | 1.518783 | 0.5080278 | 1.0873518 |
| 3 | 25.40079 | 1.790809 | 102.94124 | 1.992448 | 2.869506 | 2.141781 | 0.8319401 | 0.4248549 |
| 4 | 23.28716 | 1.780663 | 128.50416 | 2.930114 | 2.906180 | 2.213367 | 1.3277361 | 0.7689575 |
| 5 | 37.43439 | 1.664558 | 85.01595 | 2.406856 | 2.522664 | 1.779903 | 0.9390338 | 0.2108454 |
| 6 | 25.46559 | 1.642251 | 106.53990 | 2.985410 | 2.999164 | 2.201919 | 0.0895633 | 0.4795781 |
| 7 | 22.02833 | 1.602985 | 62.98200 | 2.366240 | 1.144334 | 1.969030 | 0.6522025 | 0.1723208 |
| 8 | 21.27590 | 1.666091 | 82.67152 | 2.351729 | 1.299973 | 1.825494 | 0.6359490 | 1.4012938 |
| 9 | 20.86573 | 1.627131 | 61.31398 | 2.667724 | 3.114092 | 1.891794 | 1.3171571 | 0.4737666 |
Metode K-Median tidak tersedia secara langsung di R base, sehingga dibuat implementasi sederhana berbasis:
# Tujuan: membuat fungsi K-Median secara manual untuk kebutuhan analisis clustering.
kmedian_fit <- function(X_data, n_clusters = 3, max_iter = 100, random_state = 42) {
set.seed(random_state)
n <- nrow(X_data)
init_idx <- sample(n, n_clusters, replace = FALSE)
medians <- X_data[init_idx, , drop = FALSE]
labels <- integer(n)
for (iter in seq_len(max_iter)) {
# Hitung jarak Manhattan ke setiap median
dist_mat <- sapply(seq_len(n_clusters), function(j) {
rowSums(abs(sweep(X_data, 2, medians[j, ])))
})
new_labels <- apply(dist_mat, 1, which.min) - 1L # 0-based
# Update median
new_medians <- t(sapply(seq_len(n_clusters), function(j) {
idx <- new_labels == (j - 1L)
if (any(idx)) apply(X_data[idx, , drop = FALSE], 2, median)
else medians[j, ]
}))
if (all(abs(medians - new_medians) < 1e-9)) {
labels <- new_labels
break
}
medians <- new_medians
labels <- new_labels
}
list(labels = labels, medians = medians)
}
cat("Fungsi kmedian_fit berhasil didefinisikan.\n")## Fungsi kmedian_fit berhasil didefinisikan.
# Tujuan: mencari jumlah cluster terbaik untuk metode K-Median.
results_kmedian <- data.frame()
for (k in k_range) {
res <- kmedian_fit(X, n_clusters = k, max_iter = 100, random_state = 42)
labels <- res$labels
sil_val <- mean(silhouette(labels, dist(X))[, 3])
n <- nrow(X)
cent_all <- colMeans(X)
SS_B <- sum(sapply(unique(labels), function(cl) {
idx <- labels == cl; nk <- sum(idx)
ck <- colMeans(X[idx, , drop = FALSE])
nk * sum((ck - cent_all)^2)
}))
SS_W <- sum(sapply(unique(labels), function(cl) {
idx <- labels == cl
xk <- X[idx, , drop = FALSE]
ck <- colMeans(xk)
sum(sweep(xk, 2, ck)^2)
}))
ch_val <- (SS_B / (k - 1)) / (SS_W / (n - k))
results_kmedian <- rbind(results_kmedian, data.frame(
k = k,
silhouette = round(sil_val, 4),
calinski = round(ch_val, 4)
))
}
results_kmedian| k | silhouette | calinski |
|---|---|---|
| 2 | 0.0890 | 182.4663 |
| 3 | 0.1086 | 211.6491 |
| 4 | 0.1125 | 207.5246 |
| 5 | 0.1110 | 186.0217 |
| 6 | 0.1098 | 184.8880 |
| 7 | 0.1157 | 183.0589 |
| 8 | 0.1118 | 169.9067 |
| 9 | 0.1196 | 165.4861 |
| 10 | 0.1182 | 157.8589 |
# Tujuan: memvisualisasikan evaluasi jumlah cluster K-Median dan menjalankan model final.
plot(results_kmedian$k, results_kmedian$silhouette,
type = "b", pch = 19, col = "steelblue",
main = "Silhouette Score - K-Median",
xlab = "Jumlah Cluster (k)", ylab = "Silhouette Score")best_k_kmedian <- results_kmedian$k[which.max(results_kmedian$silhouette)]
cat("Jumlah cluster terbaik untuk K-Median:", best_k_kmedian, "\n")## Jumlah cluster terbaik untuk K-Median: 9
kmedian_res <- kmedian_fit(X, n_clusters = best_k_kmedian, max_iter = 100, random_state = 42)
kmedian_labels <- kmedian_res$labels
eval_kmedian <- evaluate_clustering(X, kmedian_labels, "K-Median")
eval_kmedian| Method | n_clusters | Silhouette | Calinski-Harabasz | Davies-Bouldin |
|---|---|---|---|---|
| K-Median | 9 | 0.1196 | 165.4861 | 2.2029 |
# Tujuan: memvisualisasikan hasil clustering K-Median dan membuat profil setiap cluster.
plot_cluster_pca(X_pca, kmedian_labels, "Clustering K-Median pada Bidang PCA")| cluster | Age | Height | Weight | FCVC | NCP | CH2O | FAF | TUE |
|---|---|---|---|---|---|---|---|---|
| 0 | 21.26388 | 1.635362 | 64.59837 | 2.580234 | 2.726455 | 1.927141 | 1.0957280 | 0.8991648 |
| 1 | 28.63076 | 1.665050 | 92.20591 | 2.606705 | 1.739690 | 1.432352 | 0.9238745 | 0.3551007 |
| 2 | 20.78880 | 1.699987 | 71.60830 | 1.965530 | 2.867670 | 1.910991 | 0.8648035 | 1.0118062 |
| 3 | 22.64937 | 1.766499 | 78.50196 | 2.170286 | 2.994546 | 2.221035 | 2.0468241 | 0.9351300 |
| 4 | 25.27394 | 1.640005 | 105.64789 | 2.978989 | 2.944323 | 2.209656 | 0.1103169 | 0.4786664 |
| 5 | 27.11538 | 1.800603 | 108.92870 | 2.070658 | 2.798827 | 2.147342 | 0.9341045 | 0.4125383 |
| 6 | 36.39108 | 1.621518 | 75.82291 | 2.280542 | 2.666225 | 1.664620 | 0.4616460 | 0.1996102 |
| 7 | 22.72935 | 1.782004 | 116.58584 | 2.939923 | 2.939500 | 2.288728 | 1.5298585 | 0.7350692 |
| 8 | 21.00499 | 1.598305 | 55.98495 | 2.476344 | 1.589590 | 1.854580 | 0.8651569 | 0.3320110 |
Metode DBSCAN tidak memerlukan jumlah cluster di
awal, tetapi membutuhkan penentuan parameter eps dan
minPts. Oleh karena itu dilakukan eksplorasi beberapa
kombinasi parameter untuk mencari hasil terbaik.
# Tujuan: mencoba beberapa kombinasi parameter DBSCAN dan memilih hasil terbaik
# berdasarkan silhouette score.
eps_values <- c(0.5, 0.8, 1.0, 1.2, 1.5, 2.0)
min_samples_values <- c(3, 5, 7, 10)
dbscan_results <- data.frame()
for (eps in eps_values) {
for (min_s in min_samples_values) {
db_res <- dbscan::dbscan(X, eps = eps, minPts = min_s)
labels <- db_res$cluster - 1L # 0-based; noise tetap 0-1 = -1 setelah konversi
# Konversi: dbscan R: 0 = noise, 1..k = cluster → Python: -1 = noise, 0..k-1
labels_py <- ifelse(db_res$cluster == 0, -1L, db_res$cluster - 1L)
ev <- evaluate_clustering(X, labels_py, paste0("DBSCAN_eps", eps, "_min", min_s))
ev$eps <- eps
ev$min_samples <- min_s
ev$n_noise <- sum(labels_py == -1)
dbscan_results <- rbind(dbscan_results, ev)
}
}
dbscan_eval_df <- dbscan_results[order(-dbscan_results$Silhouette), ]
dbscan_eval_df| Method | n_clusters | Silhouette | Calinski-Harabasz | Davies-Bouldin | eps | min_samples | n_noise | |
|---|---|---|---|---|---|---|---|---|
| 4 | DBSCAN_eps0.5_min10 | 8 | 0.6795 | 722.6154 | 0.3973 | 0.5 | 10 | 1808 |
| 3 | DBSCAN_eps0.5_min7 | 11 | 0.5778 | 211.1967 | 0.4379 | 0.5 | 7 | 1768 |
| 2 | DBSCAN_eps0.5_min5 | 27 | 0.5614 | 149.3517 | 0.5119 | 0.5 | 5 | 1658 |
| 8 | DBSCAN_eps0.8_min10 | 10 | 0.5422 | 235.7461 | 0.5744 | 0.8 | 10 | 1626 |
| 1 | DBSCAN_eps0.5_min3 | 67 | 0.4990 | 114.5672 | 0.5103 | 0.5 | 3 | 1487 |
| 7 | DBSCAN_eps0.8_min7 | 21 | 0.4919 | 171.2694 | 0.6184 | 0.8 | 7 | 1505 |
| 6 | DBSCAN_eps0.8_min5 | 29 | 0.4502 | 133.9020 | 0.7271 | 0.8 | 5 | 1406 |
| 5 | DBSCAN_eps0.8_min3 | 93 | 0.3934 | 70.2794 | 0.6782 | 0.8 | 3 | 1133 |
| 12 | DBSCAN_eps1_min10 | 11 | 0.3790 | 126.5631 | 0.8130 | 1.0 | 10 | 1476 |
| 11 | DBSCAN_eps1_min7 | 23 | 0.3355 | 92.0111 | 0.8073 | 1.0 | 7 | 1329 |
| 16 | DBSCAN_eps1.2_min10 | 15 | 0.3330 | 117.0876 | 0.9554 | 1.2 | 10 | 1305 |
| 10 | DBSCAN_eps1_min5 | 34 | 0.3082 | 72.9351 | 0.8348 | 1.0 | 5 | 1206 |
| 9 | DBSCAN_eps1_min3 | 104 | 0.2422 | 39.6556 | 0.8152 | 1.0 | 3 | 911 |
| 15 | DBSCAN_eps1.2_min7 | 26 | 0.2227 | 70.0029 | 0.9630 | 1.2 | 7 | 1100 |
| 22 | DBSCAN_eps2_min5 | 2 | 0.1978 | 7.8793 | 0.9899 | 2.0 | 5 | 215 |
| 14 | DBSCAN_eps1.2_min5 | 43 | 0.1970 | 49.7220 | 1.0063 | 1.2 | 5 | 921 |
| 13 | DBSCAN_eps1.2_min3 | 83 | 0.1484 | 33.4589 | 1.0101 | 1.2 | 3 | 684 |
| 24 | DBSCAN_eps2_min10 | 2 | 0.1329 | 38.3578 | 0.9683 | 2.0 | 10 | 283 |
| 20 | DBSCAN_eps1.5_min10 | 18 | 0.0953 | 48.7055 | 1.0651 | 1.5 | 10 | 902 |
| 19 | DBSCAN_eps1.5_min7 | 22 | 0.0565 | 35.8673 | 1.0788 | 1.5 | 7 | 741 |
| 21 | DBSCAN_eps2_min3 | 6 | -0.0254 | 4.3452 | 1.1558 | 2.0 | 3 | 183 |
| 18 | DBSCAN_eps1.5_min5 | 30 | -0.0767 | 20.2619 | 1.1411 | 1.5 | 5 | 568 |
| 17 | DBSCAN_eps1.5_min3 | 46 | -0.0979 | 15.5633 | 1.1945 | 1.5 | 3 | 437 |
| 23 | DBSCAN_eps2_min7 | 1 | NA | NA | NA | 2.0 | 7 | 245 |
# Tujuan: menjalankan DBSCAN final berdasarkan kombinasi parameter terbaik.
best_row <- dbscan_eval_df[1, ]
best_eps <- best_row$eps
best_min_samples <- best_row$min_samples
cat("Parameter terbaik DBSCAN:\n")## Parameter terbaik DBSCAN:
## eps = 0.5
## min_samples = 10
db_final <- dbscan::dbscan(X, eps = best_eps, minPts = best_min_samples)
dbscan_labels <- ifelse(db_final$cluster == 0, -1L, db_final$cluster - 1L)
eval_dbscan <- evaluate_clustering(X, dbscan_labels, "DBSCAN")
eval_dbscan| Method | n_clusters | Silhouette | Calinski-Harabasz | Davies-Bouldin |
|---|---|---|---|---|
| DBSCAN | 8 | 0.6795 | 722.6154 | 0.3973 |
# Tujuan: memvisualisasikan hasil clustering DBSCAN dan melihat jumlah anggota tiap label cluster.
plot_cluster_pca(X_pca, dbscan_labels, "Clustering DBSCAN pada Bidang PCA")## dbscan_labels
## 3 4 7 0 6 2 5 1 -1
## 10 11 11 19 36 45 55 116 1808
Metode Mean Shift akan mencari pusat cluster secara
otomatis berdasarkan kepadatan data. Parameter penting yang digunakan
adalah bandwidth, yang dapat diestimasi dari data.
# Tujuan: mengestimasi bandwidth, menjalankan Mean Shift, dan menghitung evaluasi clustering.
# ---- Implementasi Mean Shift manual ----
mean_shift_fit <- function(X_data, bandwidth, tol = 1e-3, max_iter = 200) {
n <- nrow(X_data)
# Setiap titik mulai dari posisi awalnya
shifted <- X_data
for (iter in seq_len(max_iter)) {
new_shifted <- shifted
for (i in seq_len(n)) {
# Hitung jarak titik i ke semua titik asli
dists <- sqrt(rowSums(sweep(X_data, 2, shifted[i, ])^2))
# Titik dalam bandwidth
in_bw <- dists <= bandwidth
if (sum(in_bw) == 0) next
# Geser ke rata-rata titik dalam bandwidth (flat kernel)
new_shifted[i, ] <- colMeans(X_data[in_bw, , drop = FALSE])
}
# Cek konvergensi
shift_dist <- max(sqrt(rowSums((new_shifted - shifted)^2)))
shifted <- new_shifted
if (shift_dist < tol) break
}
# Cluster: titik dengan mode yang sama (dalam jarak kecil) digabung
labels <- integer(n)
label_counter <- 0L
cluster_modes <- list()
for (i in seq_len(n)) {
mode_i <- shifted[i, ]
assigned <- FALSE
for (j in seq_along(cluster_modes)) {
if (sqrt(sum((mode_i - cluster_modes[[j]])^2)) < bandwidth * 0.5) {
labels[i] <- j - 1L
assigned <- TRUE
break
}
}
if (!assigned) {
label_counter <- label_counter + 1L
cluster_modes[[label_counter]] <- mode_i
labels[i] <- label_counter - 1L
}
}
list(labels = labels, modes = do.call(rbind, cluster_modes))
}
# Estimasi bandwidth: kuantil ke-20 dari jarak antar sampel
set.seed(42)
sample_idx <- sample(nrow(X), min(300, nrow(X)))
X_sample <- X[sample_idx, ]
dist_sample <- as.matrix(dist(X_sample))
bandwidth <- quantile(dist_sample[upper.tri(dist_sample)], probs = 0.20)
cat("Bandwidth yang digunakan:", round(bandwidth, 4), "\n")## Bandwidth yang digunakan: 3.4329
ms_res <- mean_shift_fit(X, bandwidth = bandwidth)
meanshift_labels <- ms_res$labels
eval_meanshift <- evaluate_clustering(X, meanshift_labels, "Mean Shift")
eval_meanshift| Method | n_clusters | Silhouette | Calinski-Harabasz | Davies-Bouldin |
|---|---|---|---|---|
| Mean Shift | 3 | 0.3666 | 4.379 | 0.4707 |
# Tujuan: memvisualisasikan hasil clustering Mean Shift dan melihat ukuran setiap cluster.
plot_cluster_pca(X_pca, meanshift_labels, "Clustering Mean Shift pada Bidang PCA")## meanshift_labels
## 1 2 0
## 1 1 2109
Berbeda dari metode hard clustering, Fuzzy C-Means memberikan derajat keanggotaan setiap data pada seluruh cluster. Pada tahap ini dicari jumlah cluster terbaik menggunakan FPC dan Silhouette Score.
# Tujuan: mencari jumlah cluster terbaik untuk metode Fuzzy C-Means.
set.seed(42)
X_mat <- as.matrix(X)
fcm_results <- data.frame()
for (c in 2:10) {
fcm_res <- e1071::cmeans(X_mat, centers = c, iter.max = 1000,
m = 2, method = "cmeans", verbose = FALSE)
labels <- fcm_res$cluster - 1L # 0-based
# Fuzzy Partition Coefficient
u <- fcm_res$membership # matrix n x c
fpc <- sum(u^2) / nrow(u)
sil_val <- mean(silhouette(labels, dist(X_mat))[, 3])
n <- nrow(X_mat)
cent_all <- colMeans(X_mat)
SS_B <- sum(sapply(unique(labels), function(cl) {
idx <- labels == cl; nk <- sum(idx)
ck <- colMeans(X_mat[idx, , drop = FALSE])
nk * sum((ck - cent_all)^2)
}))
SS_W <- sum(sapply(unique(labels), function(cl) {
idx <- labels == cl
xk <- X_mat[idx, , drop = FALSE]
ck <- colMeans(xk)
sum(sweep(xk, 2, ck)^2)
}))
ch_val <- (SS_B / (c - 1)) / (SS_W / (n - c))
fcm_results <- rbind(fcm_results, data.frame(
c = c,
FPC = round(fpc, 4),
silhouette = round(sil_val, 4),
calinski = round(ch_val, 4)
))
}
fcm_results| c | FPC | silhouette | calinski |
|---|---|---|---|
| 2 | 0.5000 | 0.1313 | 321.4452 |
| 3 | 0.3333 | 0.0603 | 173.2288 |
| 4 | 0.2500 | 0.0731 | 109.1558 |
| 5 | 0.2000 | 0.0610 | 83.2500 |
| 6 | 0.1667 | 0.0807 | 66.9434 |
| 7 | 0.1429 | 0.0807 | 54.8565 |
| 8 | 0.1250 | -0.0286 | 55.2273 |
| 9 | 0.1111 | 0.0124 | 40.8763 |
| 10 | 0.1247 | -0.0251 | 54.6141 |
# Tujuan: memvisualisasikan evaluasi Fuzzy C-Means dan menjalankan model final
# dengan jumlah cluster terbaik.
par(mfrow = c(1, 2))
plot(fcm_results$c, fcm_results$FPC,
type = "b", pch = 19, col = "steelblue",
main = "FPC - Fuzzy C-Means",
xlab = "Jumlah Cluster", ylab = "FPC")
plot(fcm_results$c, fcm_results$silhouette,
type = "b", pch = 19, col = "darkorange",
main = "Silhouette Score - Fuzzy C-Means",
xlab = "Jumlah Cluster", ylab = "Silhouette Score")par(mfrow = c(1, 1))
best_c <- fcm_results$c[which.max(fcm_results$silhouette)]
cat("Jumlah cluster terbaik untuk Fuzzy C-Means:", best_c, "\n")## Jumlah cluster terbaik untuk Fuzzy C-Means: 2
set.seed(42)
fcm_final <- e1071::cmeans(X_mat, centers = best_c, iter.max = 1000,
m = 2, method = "cmeans", verbose = FALSE)
fcm_labels <- fcm_final$cluster - 1L # 0-based
eval_fcm <- evaluate_clustering(X, fcm_labels, "Fuzzy C-Means")
eval_fcm| Method | n_clusters | Silhouette | Calinski-Harabasz | Davies-Bouldin |
|---|---|---|---|---|
| Fuzzy C-Means | 2 | 0.1313 | 321.4452 | 2.4779 |
# Tujuan: memvisualisasikan hasil clustering Fuzzy C-Means dan melihat derajat
# keanggotaan beberapa data pertama.
plot_cluster_pca(X_pca, fcm_labels, "Clustering Fuzzy C-Means pada Bidang PCA")u_matrix <- fcm_final$membership
colnames(u_matrix) <- paste0("cluster_", 0:(best_c - 1))
membership_df <- as.data.frame(u_matrix)
head(membership_df)| cluster_0 | cluster_1 |
|---|---|
| 0.5002205 | 0.4997795 |
| 0.5000282 | 0.4999718 |
| 0.4999929 | 0.5000071 |
| 0.4999085 | 0.5000915 |
| 0.5000096 | 0.4999904 |
| 0.5001174 | 0.4998826 |
Pada bagian ini seluruh metode dibandingkan menggunakan metrik evaluasi yang sama:
# Tujuan: menyusun tabel perbandingan seluruh metode clustering.
all_eval <- rbind(eval_kmeans, eval_kmedian, eval_dbscan, eval_meanshift, eval_fcm)
all_eval_sorted <- all_eval[order(-all_eval$Silhouette, na.last = TRUE), ]
all_eval_sorted| Method | n_clusters | Silhouette | Calinski-Harabasz | Davies-Bouldin | |
|---|---|---|---|---|---|
| 3 | DBSCAN | 8 | 0.6795 | 722.6154 | 0.3973 |
| 4 | Mean Shift | 3 | 0.3666 | 4.3790 | 0.4707 |
| 1 | K-Means | 10 | 0.1531 | 207.1689 | 1.8285 |
| 5 | Fuzzy C-Means | 2 | 0.1313 | 321.4452 | 2.4779 |
| 2 | K-Median | 9 | 0.1196 | 165.4861 | 2.2029 |
# Tujuan: memvisualisasikan perbandingan silhouette score dari seluruh metode clustering.
eval_plot <- all_eval_sorted[!is.na(all_eval_sorted$Silhouette), ]
eval_plot$Method <- factor(eval_plot$Method,
levels = eval_plot$Method[order(-eval_plot$Silhouette)])
ggplot(eval_plot, aes(x = Method, y = Silhouette, fill = Method)) +
geom_col(width = 0.6) +
geom_text(aes(label = round(Silhouette, 4)), vjust = -0.4, size = 3.5) +
labs(title = "Perbandingan Silhouette Score Antar Metode",
x = "", y = "Silhouette Score") +
theme_bw() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 20, hjust = 1))# Tujuan: membandingkan label cluster dengan label obesitas asli untuk keperluan
# interpretasi tambahan.
cat("Crosstab K-Means vs label asli\n")## Crosstab K-Means vs label asli
## label_asli
## kmeans_labels Insufficient_Weight Normal_Weight Obesity_Type_I Obesity_Type_II
## 0 0.2229 0.2048 0.2169 0.0060
## 1 0.3571 0.2143 0.0390 0.0000
## 2 0.2174 0.3000 0.1609 0.0000
## 3 0.0000 0.0505 0.2271 0.4069
## 4 0.0000 0.0000 0.0943 0.3279
## 5 0.0080 0.0478 0.3147 0.2072
## 6 0.0000 0.0299 0.0100 0.0000
## 7 0.2654 0.1728 0.1543 0.0000
## 8 0.0486 0.1389 0.3125 0.2153
## 9 0.3223 0.2851 0.1074 0.0165
## label_asli
## kmeans_labels Obesity_Type_III Overweight_Level_I Overweight_Level_II
## 0 0.0000 0.1807 0.1687
## 1 0.0000 0.2403 0.1494
## 2 0.0000 0.1696 0.1522
## 3 0.0000 0.1167 0.1987
## 4 0.5656 0.0000 0.0123
## 5 0.0000 0.1713 0.2510
## 6 0.9254 0.0000 0.0348
## 7 0.0000 0.2778 0.1296
## 8 0.0000 0.1806 0.1042
## 9 0.0000 0.1364 0.1322
##
## Crosstab Fuzzy C-Means vs label asli
## label_asli
## fcm_labels Insufficient_Weight Normal_Weight Obesity_Type_I Obesity_Type_II
## 0 0.2171 0.2208 0.1711 0.0359
## 1 0.0352 0.0459 0.1611 0.2520
## label_asli
## fcm_labels Obesity_Type_III Overweight_Level_I Overweight_Level_II
## 0 0.0175 0.1895 0.1481
## 1 0.2979 0.0820 0.1260
Sebagai tahap akhir, hasil clustering dan tabel evaluasi disimpan ke
file .csv agar dapat digunakan kembali dalam laporan atau
lampiran tugas.
# Tujuan: menyimpan hasil clustering dan evaluasi ke dalam file CSV.
df_result <- df
df_result$cluster_kmeans <- kmeans_labels
df_result$cluster_kmedian <- kmedian_labels
df_result$cluster_dbscan <- dbscan_labels
df_result$cluster_meanshift <- meanshift_labels
df_result$cluster_fcm <- fcm_labels
write.csv(df_result, "hasil_clustering_obesity.csv", row.names = FALSE)
write.csv(all_eval, "evaluasi_metode_clustering.csv", row.names = FALSE)
write.csv(profile_kmeans, "profil_cluster_kmeans.csv", row.names = TRUE)
write.csv(profile_kmedian, "profil_cluster_kmedian.csv", row.names = TRUE)
cat("File berhasil disimpan:\n")## File berhasil disimpan:
## - hasil_clustering_obesity.csv
## - evaluasi_metode_clustering.csv
## - profil_cluster_kmeans.csv
## - profil_cluster_kmedian.csv