Kelompok 14:
Kartika Nur Savira (24031554049)
Tara Tabriza Rachman (24031554107)
Dataset: https://www.kaggle.com/datasets/valakhorasani/gym-members-exercise-dataset
Notebook ini membandingkan 5 algoritma clustering pada dataset gym members: K-means, K-medians, DBSCAN, Mean Shift, dan Fuzzy C-means. Data diproses menggunakan PCA sebelum clustering untuk mereduksi dimensi dan menghilangkan multikolinearitas antar fitur.
install.packages("flexclust")
install.packages("dbscan")
install.packages("meanShiftR")
install.packages("e1071")
install.packages("cluster")
install.packages("fpc")
install.packages("mclust")
install.packages("factoextra")
install.packages("gridExtra")
install.packages("psych") # KMO & Bartlett test
install.packages("reshape2") # heatmap korelasidata <- read.csv("/Users/savv/gym_members_exercise_tracking.csv")
cat("Dimensi:", nrow(data), "x", ncol(data), "\n")## Dimensi: 973 x 15
## 'data.frame': 973 obs. of 15 variables:
## $ Age : int 56 46 32 25 38 56 36 40 28 28 ...
## $ Gender : chr "Male" "Female" "Female" "Male" ...
## $ Weight..kg. : num 88.3 74.9 68.1 53.2 46.1 ...
## $ Height..m. : num 1.71 1.53 1.66 1.7 1.79 1.68 1.72 1.51 1.94 1.84 ...
## $ Max_BPM : int 180 179 167 190 188 168 174 189 185 169 ...
## $ Avg_BPM : int 157 151 122 164 158 156 169 141 127 136 ...
## $ Resting_BPM : int 60 66 54 56 68 74 73 64 52 64 ...
## $ Session_Duration..hours. : num 1.69 1.3 1.11 0.59 0.64 1.59 1.49 1.27 1.03 1.08 ...
## $ Calories_Burned : num 1313 883 677 532 556 ...
## $ Workout_Type : chr "Yoga" "HIIT" "Cardio" "Strength" ...
## $ Fat_Percentage : num 12.6 33.9 33.4 28.8 29.2 15.5 21.3 30.6 28.9 29.7 ...
## $ Water_Intake..liters. : num 3.5 2.1 2.3 2.1 2.8 2.7 2.3 1.9 2.6 2.7 ...
## $ Workout_Frequency..days.week.: int 4 4 4 3 3 5 3 3 4 3 ...
## $ Experience_Level : int 3 2 2 1 1 3 2 2 2 1 ...
## $ BMI : num 30.2 32 24.7 18.4 14.4 ...
## Age Gender Weight..kg. Height..m.
## Min. :18.00 Length:973 Min. : 40.00 Min. :1.500
## 1st Qu.:28.00 Class :character 1st Qu.: 58.10 1st Qu.:1.620
## Median :40.00 Mode :character Median : 70.00 Median :1.710
## Mean :38.68 Mean : 73.85 Mean :1.723
## 3rd Qu.:49.00 3rd Qu.: 86.00 3rd Qu.:1.800
## Max. :59.00 Max. :129.90 Max. :2.000
## Max_BPM Avg_BPM Resting_BPM Session_Duration..hours.
## Min. :160.0 Min. :120.0 Min. :50.00 Min. :0.500
## 1st Qu.:170.0 1st Qu.:131.0 1st Qu.:56.00 1st Qu.:1.040
## Median :180.0 Median :143.0 Median :62.00 Median :1.260
## Mean :179.9 Mean :143.8 Mean :62.22 Mean :1.256
## 3rd Qu.:190.0 3rd Qu.:156.0 3rd Qu.:68.00 3rd Qu.:1.460
## Max. :199.0 Max. :169.0 Max. :74.00 Max. :2.000
## Calories_Burned Workout_Type Fat_Percentage Water_Intake..liters.
## Min. : 303.0 Length:973 Min. :10.00 Min. :1.500
## 1st Qu.: 720.0 Class :character 1st Qu.:21.30 1st Qu.:2.200
## Median : 893.0 Mode :character Median :26.20 Median :2.600
## Mean : 905.4 Mean :24.98 Mean :2.627
## 3rd Qu.:1076.0 3rd Qu.:29.30 3rd Qu.:3.100
## Max. :1783.0 Max. :35.00 Max. :3.700
## Workout_Frequency..days.week. Experience_Level BMI
## Min. :2.000 Min. :1.00 Min. :12.32
## 1st Qu.:3.000 1st Qu.:1.00 1st Qu.:20.11
## Median :3.000 Median :2.00 Median :24.16
## Mean :3.322 Mean :1.81 Mean :24.91
## 3rd Qu.:4.000 3rd Qu.:2.00 3rd Qu.:28.56
## Max. :5.000 Max. :3.00 Max. :49.84
## Age Gender
## 0 0
## Weight..kg. Height..m.
## 0 0
## Max_BPM Avg_BPM
## 0 0
## Resting_BPM Session_Duration..hours.
## 0 0
## Calories_Burned Workout_Type
## 0 0
## Fat_Percentage Water_Intake..liters.
## 0 0
## Workout_Frequency..days.week. Experience_Level
## 0 0
## BMI
## 0
## Age Weight..kg. Height..m. Max_BPM Avg_BPM Resting_BPM
## 1 56 88.3 1.71 180 157 60
## 2 46 74.9 1.53 179 151 66
## 3 32 68.1 1.66 167 122 54
## 4 25 53.2 1.70 190 164 56
## 5 38 46.1 1.79 188 158 68
## 6 56 58.0 1.68 168 156 74
## 7 36 70.3 1.72 174 169 73
## 8 40 69.7 1.51 189 141 64
## 9 28 121.7 1.94 185 127 52
## 10 28 101.8 1.84 169 136 64
## Session_Duration..hours. Calories_Burned Fat_Percentage
## 1 1.69 1313 12.6
## 2 1.30 883 33.9
## 3 1.11 677 33.4
## 4 0.59 532 28.8
## 5 0.64 556 29.2
## 6 1.59 1116 15.5
## 7 1.49 1385 21.3
## 8 1.27 895 30.6
## 9 1.03 719 28.9
## 10 1.08 808 29.7
## Water_Intake..liters. Workout_Frequency..days.week. Experience_Level BMI
## 1 3.5 4 3 30.20
## 2 2.1 4 2 32.00
## 3 2.3 4 2 24.71
## 4 2.1 3 1 18.41
## 5 2.8 3 1 14.39
## 6 2.7 5 3 20.55
## 7 2.3 3 2 23.76
## 8 1.9 3 2 30.57
## 9 2.6 4 2 32.34
## 10 2.7 3 1 30.07
Interpretasi: Dataset terdiri dari 973 anggota gym dengan 15 variabel (13 numerik, 2 kategorikal:
GenderdanWorkout_Type). Tidak ditemukan missing values pada seluruh kolom, sehingga data siap diproses tanpa perlu imputasi.Fitur numerik yang tersedia mencakup data fisiologis (Age, Weight, Height, BMI, Fat_Percentage, BPM) dan perilaku latihan (Session_Duration, Calories_Burned, Water_Intake, Workout_Frequency, Experience_Level).
data_num <- data[, sapply(data, is.numeric)]
data_scale <- scale(data_num)
set.seed(123)
cat("Fitur numerik yang digunakan:", ncol(data_num), "\n")## Fitur numerik yang digunakan: 13
## Contoh hasil scaling (10 baris pertama):
## Age Weight..kg. Height..m. Max_BPM Avg_BPM Resting_BPM
## [1,] 1.42161147 0.68114221 -0.09849406 0.01007625 0.9224960 -0.3033989
## [2,] 0.60065595 0.04929028 -1.50782814 -0.07668648 0.5042348 0.5154835
## [3,] -0.54868178 -0.27135099 -0.48997575 -1.11783927 -1.5173612 -1.1222812
## [4,] -1.12335065 -0.97393261 -0.17679039 0.87770357 1.4104675 -0.8493204
## [5,] -0.05610847 -1.30871982 0.52787664 0.70417810 0.9922062 0.7884443
## [6,] 1.42161147 -0.74759759 -0.33338307 -1.03107653 0.8527858 1.6073266
## [7,] -0.22029958 -0.16761411 -0.02019772 -0.51050014 1.7590185 1.4708462
## [8,] 0.10808263 -0.19590598 -1.66442081 0.79094083 -0.1928673 0.2425227
## [9,] -0.87706399 2.25605671 1.70232171 0.44388991 -1.1688102 -1.3952420
## [10,] -0.87706399 1.31770945 0.91935833 -0.94431380 -0.5414183 0.2425227
## Session_Duration..hours. Calories_Burned Fat_Percentage
## [1,] 1.26394825 1.49492124 -1.9773038
## [2,] 0.12703299 -0.08224134 1.4255680
## [3,] -0.42684881 -0.83781226 1.3456884
## [4,] -1.94273583 -1.36964616 0.6107959
## [5,] -1.79697746 -1.28161848 0.6746996
## [6,] 0.97243152 0.77236071 -1.5140020
## [7,] 0.68091478 1.75900428 -0.5873984
## [8,] 0.03957797 -0.03822750 0.8983625
## [9,] -0.66006220 -0.68376382 0.6267718
## [10,] -0.51430383 -0.35732784 0.7545792
## Water_Intake..liters. Workout_Frequency..days.week. Experience_Level
## [1,] 1.4552185 0.7429130 1.608957
## [2,] -0.8774464 0.7429130 0.257044
## [3,] -0.5442086 0.7429130 0.257044
## [4,] -0.8774464 -0.3523209 -1.094869
## [5,] 0.2888860 -0.3523209 -1.094869
## [6,] 0.1222671 1.8381469 1.608957
## [7,] -0.5442086 -0.3523209 0.257044
## [8,] -1.2106842 -0.3523209 0.257044
## [9,] -0.0443518 0.7429130 0.257044
## [10,] 0.1222671 -0.3523209 -1.094869
## BMI
## [1,] 0.79387004
## [2,] 1.06410462
## [3,] -0.03034546
## [4,] -0.97616652
## [5,] -1.57969043
## [6,] -0.65488762
## [7,] -0.17296927
## [8,] 0.84941826
## [9,] 1.11514893
## [10,] 0.77435309
Interpretasi: Terdapat 13 fitur numerik yang digunakan. Scaling Z-score diperlukan karena rentang antar fitur sangat berbeda — misalnya
Calories_Burnedbisa mencapai ribuan, sementaraHeight (m)hanya berkisar 1.5–2.0. Tanpa scaling, fitur dengan rentang besar akan mendominasi perhitungan jarak dan membuat clustering menjadi bias.
library(ggplot2)
library(reshape2)
cor_matrix <- cor(data_num)
cat("Matriks Korelasi (10 baris & kolom pertama):\n")## Matriks Korelasi (10 baris & kolom pertama):
## Age Weight..kg. Height..m. Max_BPM Avg_BPM
## Age 1.00 -0.04 -0.03 -0.02 0.04
## Weight..kg. -0.04 1.00 0.37 0.06 0.01
## Height..m. -0.03 0.37 1.00 -0.02 -0.01
## Max_BPM -0.02 0.06 -0.02 1.00 -0.04
## Avg_BPM 0.04 0.01 -0.01 -0.04 1.00
## Resting_BPM 0.00 -0.03 -0.01 0.04 0.06
## Session_Duration..hours. -0.02 -0.01 -0.01 0.01 0.02
## Calories_Burned -0.15 0.10 0.09 0.00 0.34
## Fat_Percentage 0.00 -0.23 -0.24 -0.01 -0.01
## Water_Intake..liters. 0.04 0.39 0.39 0.03 0.00
## Workout_Frequency..days.week. 0.01 -0.01 -0.01 -0.03 -0.01
## Experience_Level -0.02 0.00 -0.01 0.00 0.00
## BMI -0.01 0.85 -0.16 0.07 0.02
## Resting_BPM Session_Duration..hours.
## Age 0.00 -0.02
## Weight..kg. -0.03 -0.01
## Height..m. -0.01 -0.01
## Max_BPM 0.04 0.01
## Avg_BPM 0.06 0.02
## Resting_BPM 1.00 -0.02
## Session_Duration..hours. -0.02 1.00
## Calories_Burned 0.02 0.91
## Fat_Percentage -0.02 -0.58
## Water_Intake..liters. 0.01 0.28
## Workout_Frequency..days.week. -0.01 0.64
## Experience_Level 0.00 0.76
## BMI -0.03 -0.01
## Calories_Burned Fat_Percentage
## Age -0.15 0.00
## Weight..kg. 0.10 -0.23
## Height..m. 0.09 -0.24
## Max_BPM 0.00 -0.01
## Avg_BPM 0.34 -0.01
## Resting_BPM 0.02 -0.02
## Session_Duration..hours. 0.91 -0.58
## Calories_Burned 1.00 -0.60
## Fat_Percentage -0.60 1.00
## Water_Intake..liters. 0.36 -0.59
## Workout_Frequency..days.week. 0.58 -0.54
## Experience_Level 0.69 -0.65
## BMI 0.06 -0.12
## Water_Intake..liters.
## Age 0.04
## Weight..kg. 0.39
## Height..m. 0.39
## Max_BPM 0.03
## Avg_BPM 0.00
## Resting_BPM 0.01
## Session_Duration..hours. 0.28
## Calories_Burned 0.36
## Fat_Percentage -0.59
## Water_Intake..liters. 1.00
## Workout_Frequency..days.week. 0.24
## Experience_Level 0.30
## BMI 0.21
## Workout_Frequency..days.week. Experience_Level
## Age 0.01 -0.02
## Weight..kg. -0.01 0.00
## Height..m. -0.01 -0.01
## Max_BPM -0.03 0.00
## Avg_BPM -0.01 0.00
## Resting_BPM -0.01 0.00
## Session_Duration..hours. 0.64 0.76
## Calories_Burned 0.58 0.69
## Fat_Percentage -0.54 -0.65
## Water_Intake..liters. 0.24 0.30
## Workout_Frequency..days.week. 1.00 0.84
## Experience_Level 0.84 1.00
## BMI 0.00 0.02
## BMI
## Age -0.01
## Weight..kg. 0.85
## Height..m. -0.16
## Max_BPM 0.07
## Avg_BPM 0.02
## Resting_BPM -0.03
## Session_Duration..hours. -0.01
## Calories_Burned 0.06
## Fat_Percentage -0.12
## Water_Intake..liters. 0.21
## Workout_Frequency..days.week. 0.00
## Experience_Level 0.02
## BMI 1.00
# Heatmap korelasi
cor_melt <- melt(cor_matrix)
ggplot(cor_melt, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#D6604D",
midpoint = 0, limit = c(-1, 1), name = "Korelasi") +
geom_text(aes(label = round(value, 2)), size = 2.5) +
labs(title = "Heatmap Matriks Korelasi") +
theme_minimal(base_size = 10) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold", hjust = 0.5))Interpretasi Matriks Korelasi: Heatmap menunjukkan adanya korelasi yang cukup tinggi antar beberapa variabel, misalnya
Calories_BurneddenganSession_Duration, sertaWeight_kgdenganBMI. Adanya multikolinearitas ini mengkonfirmasi bahwa PCA diperlukan untuk mereduksi redundansi informasi antar fitur.
library(psych)
# KMO Test
kmo_result <- KMO(data_scale)
cat("=== KMO (Kaiser-Meyer-Olkin) Test ===\n")## === KMO (Kaiser-Meyer-Olkin) Test ===
## Overall KMO: 0.4702
## KMO per variabel:
## Age Weight..kg.
## 0.0217 0.3534
## Height..m. Max_BPM
## 0.1564 0.6325
## Avg_BPM Resting_BPM
## 0.0572 0.2520
## Session_Duration..hours. Calories_Burned
## 0.4985 0.4927
## Fat_Percentage Water_Intake..liters.
## 0.8612 0.7516
## Workout_Frequency..days.week. Experience_Level
## 0.7995 0.7916
## BMI
## 0.2890
# Kategorisasi KMO
kmo_val <- kmo_result$MSA
cat("\nKesimpulan KMO:", kmo_val,
ifelse(kmo_val >= 0.90, "→ Marvelous (Sangat Baik)",
ifelse(kmo_val >= 0.80, "→ Meritorious (Baik)",
ifelse(kmo_val >= 0.70, "→ Middling (Cukup)",
ifelse(kmo_val >= 0.60, "→ Mediocre (Sedang)",
ifelse(kmo_val >= 0.50, "→ Miserable (Buruk)",
"→ Unacceptable (Tidak Layak)"))))), "\n")##
## Kesimpulan KMO: 0.4701676 → Unacceptable (Tidak Layak)
# Bartlett's Test of Sphericity
bartlett_result <- cortest.bartlett(cor_matrix, n = nrow(data_scale))
cat("=== Bartlett's Test of Sphericity ===\n")## === Bartlett's Test of Sphericity ===
## Chi-square : 10982.43
## df : 78
## p-value : 0e+00
cat("Kesimpulan :", ifelse(bartlett_result$p.value < 0.05,
"p < 0.05 → Matriks korelasi signifikan, PCA layak dilakukan ✅",
"p >= 0.05 → PCA tidak layak ❌"), "\n")## Kesimpulan : p < 0.05 → Matriks korelasi signifikan, PCA layak dilakukan ✅
Interpretasi KMO & Bartlett:
- KMO Test mengukur kecukupan sampling — nilai ≥ 0.50 berarti PCA layak dilakukan. Semakin mendekati 1.0, semakin baik.
- Bartlett’s Test menguji apakah matriks korelasi berbeda signifikan dari matriks identitas. Jika p < 0.05, berarti ada korelasi yang cukup antar variabel sehingga PCA bermakna.
- Kedua uji ini mengkonfirmasi bahwa data layak untuk direduksi dengan PCA.
# Matriks kovarian dari data yang sudah di-scaling
cov_matrix <- cov(data_scale)
cat("Dimensi matriks kovarian:", nrow(cov_matrix), "x", ncol(cov_matrix), "\n")## Dimensi matriks kovarian: 13 x 13
##
## Matriks Kovarian (5 baris & kolom pertama):
## Age Weight..kg. Height..m. Max_BPM Avg_BPM
## Age 1.0000 -0.0363 -0.0278 -0.0171 0.0360
## Weight..kg. -0.0363 1.0000 0.3653 0.0571 0.0097
## Height..m. -0.0278 0.3653 1.0000 -0.0177 -0.0148
## Max_BPM -0.0171 0.0571 -0.0177 1.0000 -0.0398
## Avg_BPM 0.0360 0.0097 -0.0148 -0.0398 1.0000
Interpretasi: Karena data sudah di-scaling (Z-score), matriks kovarian setara dengan matriks korelasi. Diagonal matriks berisi varians tiap variabel (bernilai 1 setelah scaling), sedangkan elemen off-diagonal menunjukkan kovarians antar pasangan variabel.
# Dekomposisi eigen dari matriks kovarian
eigen_res <- eigen(cov_matrix)
eigenvalues <- eigen_res$values
eigenvectors <- eigen_res$vectors
rownames(eigenvectors) <- colnames(data_scale)
cat("=== Eigenvalue ===\n")## === Eigenvalue ===
eigenvalue_df <- data.frame(
PC = paste0("PC", 1:length(eigenvalues)),
Eigenvalue = round(eigenvalues, 4),
Var_Explained = round(eigenvalues / sum(eigenvalues) * 100, 2),
Cum_Var = round(cumsum(eigenvalues / sum(eigenvalues)) * 100, 2)
)
print(eigenvalue_df)## PC Eigenvalue Var_Explained Cum_Var
## 1 PC1 4.0041 30.80 30.80
## 2 PC2 2.0927 16.10 46.90
## 3 PC3 1.2918 9.94 56.84
## 4 PC4 1.1401 8.77 65.61
## 5 PC5 1.0398 8.00 73.60
## 6 PC6 1.0317 7.94 81.54
## 7 PC7 0.9233 7.10 88.64
## 8 PC8 0.5481 4.22 92.86
## 9 PC9 0.4692 3.61 96.47
## 10 PC10 0.3065 2.36 98.83
## 11 PC11 0.1316 1.01 99.84
## 12 PC12 0.0149 0.11 99.95
## 13 PC13 0.0062 0.05 100.00
##
## Kaiser Criterion (Eigenvalue > 1):
## Jumlah PC dengan eigenvalue > 1: 6
Interpretasi Eigenvalue:
- Eigenvalue menunjukkan seberapa besar varians yang dijelaskan oleh tiap PC. PC1 memiliki eigenvalue tertinggi karena menangkap pola variasi terbesar dalam data.
- Kaiser Criterion: PC dengan eigenvalue > 1 dianggap signifikan karena menjelaskan varians lebih dari satu variabel asli.
- Kolom
Cum_Varmenunjukkan akumulasi varians — kita pilih PC hingga kumulatif ≥ 80%.
par(mfrow = c(1, 2), mar = c(5, 4, 4, 2))
# Scree Plot eigenvalue
plot(eigenvalues, type = "b", pch = 19, col = "steelblue", lwd = 2,
xlab = "Principal Component", ylab = "Eigenvalue",
main = "Scree Plot (Eigenvalue)", frame = FALSE)
abline(h = 1, col = "red", lty = 2)
legend("topright", legend = "Eigenvalue = 1", col = "red", lty = 2, bty = "n")
# Kumulatif varians
cum_var_pct <- cumsum(eigenvalues / sum(eigenvalues)) * 100
plot(cum_var_pct, type = "b", pch = 19, col = "darkorange", lwd = 2,
xlab = "Jumlah PC", ylab = "Kumulatif Varians (%)",
main = "Kumulatif Varians", frame = FALSE, ylim = c(0, 100))
abline(h = 80, col = "red", lty = 2)
legend("bottomright", legend = "80% threshold", col = "red", lty = 2, bty = "n")par(mfrow = c(1, 1))
# Pilih PC
n_pc_kaiser <- sum(eigenvalues > 1)
n_pc_80 <- which(cum_var_pct >= 80)[1]
cat("PC dipilih (Kaiser, eigenvalue > 1):", n_pc_kaiser, "\n")## PC dipilih (Kaiser, eigenvalue > 1): 6
## PC dipilih (kumulatif >= 80%) : 6
## PC yang digunakan : 6
# Loading matrix: kontribusi tiap variabel ke tiap PC
loading_matrix <- eigenvectors[, 1:n_pc]
colnames(loading_matrix) <- paste0("PC", 1:n_pc)
cat("=== Loading Matrix (PC1 s.d. PC", n_pc, ") ===\n", sep = "")## === Loading Matrix (PC1 s.d. PC6) ===
## PC1 PC2 PC3 PC4 PC5 PC6
## Age 0.0254 0.0071 0.0717 -0.0689 0.7265 -0.6038
## Weight..kg. -0.1103 -0.6460 -0.1313 0.0073 0.0215 0.0486
## Height..m. -0.0863 -0.2714 0.6978 0.1278 -0.0837 0.0980
## Max_BPM -0.0049 -0.0752 -0.1189 -0.1405 -0.5724 -0.4811
## Avg_BPM -0.0510 0.0216 -0.1664 0.8310 0.1588 0.0527
## Resting_BPM -0.0011 0.0289 0.0328 0.3606 -0.3030 -0.5977
## Session_Duration..hours. -0.4333 0.1793 -0.0990 -0.0310 -0.0196 0.0191
## Calories_Burned -0.4370 0.0923 -0.1020 0.2763 -0.0782 0.1192
## Fat_Percentage 0.4096 0.0913 -0.1665 0.0591 -0.0174 0.0685
## Water_Intake..liters. -0.2794 -0.3245 0.3376 0.0032 0.0402 -0.0971
## Workout_Frequency..days.week. -0.3933 0.1779 -0.0869 -0.1790 0.0704 -0.0398
## Experience_Level -0.4387 0.1680 -0.0850 -0.1475 0.0209 -0.0406
## BMI -0.0738 -0.5366 -0.5206 -0.0621 0.0790 -0.0073
Interpretasi Loading Matrix: Setiap kolom adalah eigenvector yang mendefinisikan arah PC. Nilai loading yang besar (positif atau negatif) menunjukkan variabel tersebut berkontribusi besar pada PC tersebut. Misalnya jika
Calories_Burnedmemiliki loading tinggi pada PC1, berarti PC1 merepresentasikan “intensitas latihan”.
# Proyeksikan data ke ruang PC (skor PCA)
data_pca <- as.matrix(data_scale) %*% loading_matrix
colnames(data_pca) <- paste0("PC", 1:n_pc)
cat("Dimensi data setelah PCA:", nrow(data_pca), "x", ncol(data_pca), "\n")## Dimensi data setelah PCA: 973 x 6
##
## Contoh skor PCA (10 baris pertama):
## PC1 PC2 PC3 PC4 PC5 PC6
## [1,] -3.5513 -0.7045 -0.2927 0.3923 1.3881 -0.7847
## [2,] 0.4404 0.4475 -2.2518 0.1999 0.6169 -0.6249
## [3,] 0.9940 0.6605 -0.3476 -1.8448 0.4594 1.3958
## [4,] 2.6453 0.7520 0.1231 0.7905 -0.8688 0.7655
## [5,] 2.3121 0.8214 1.5531 1.1367 -0.6345 -0.8348
## [6,] -2.6867 1.7422 0.1636 0.8556 1.3462 -1.4857
## [7,] -1.1880 0.7117 -0.4256 2.5397 -0.2319 -0.1778
## [8,] 0.8425 0.5255 -2.1738 -0.3956 -0.3576 -0.6008
## [9,] 0.0078 -2.5573 0.2734 -1.6139 -0.5486 1.2697
## [10,] 1.0008 -1.7986 0.3395 0.0965 -0.2615 1.0046
cat("\nTotal varians yang dipertahankan:",
round(sum(eigenvalues[1:n_pc]) / sum(eigenvalues) * 100, 2), "%\n")##
## Total varians yang dipertahankan: 81.54 %
Interpretasi Skor PCA: Setiap baris adalah representasi anggota gym di ruang PC yang baru. Data yang semula berdimensi 13 kini direpresentasikan dalam 6 dimensi sambil tetap mempertahankan ≥ 80% informasi asli. Skor inilah yang akan digunakan sebagai input algoritma clustering.
# Gunakan prcomp untuk biplot (lebih mudah divisualisasi)
pca_res <- prcomp(data_scale, center = FALSE, scale. = FALSE)
fviz_pca_var(pca_res,
axes = c(1, 2),
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE) +
labs(title = "Biplot PCA — Kontribusi Variabel ke PC1 & PC2") +
theme_minimal(base_size = 11)Interpretasi Biplot: Panah yang panjang dan searah menunjukkan variabel yang berkorelasi tinggi dan sama-sama berkontribusi pada PC yang sama. Panah tegak lurus berarti variabel tidak berkorelasi. Warna merah menunjukkan kontribusi tertinggi ke PC1 dan PC2.
wss <- sapply(1:10, function(k) {
kmeans(data_pca, centers = k, nstart = 20, iter.max = 100)$tot.withinss
})
k_values <- 2:10
avg_sil_values <- sapply(k_values, function(k) {
km <- kmeans(data_pca, centers = k, nstart = 25, iter.max = 100)
mean(silhouette(km$cluster, dist(data_pca))[, 3])
})
best_k <- k_values[which.max(avg_sil_values)]
cat("K optimal (Silhouette):", best_k, "\n")## K optimal (Silhouette): 2
par(mfrow = c(1, 2), mar = c(5, 4, 4, 2))
plot(1:10, wss, type = "b", pch = 19, col = "steelblue", lwd = 2,
xlab = "K", ylab = "Within-Cluster SS", main = "Elbow Method", frame = FALSE)
abline(v = 3, col = "red", lty = 2)
legend("topright", legend = "K = 3", col = "red", lty = 2, bty = "n")
plot(k_values, avg_sil_values, type = "b", pch = 19, col = "darkorange", lwd = 2,
xlab = "K", ylab = "Avg Silhouette Width", main = "Silhouette Analysis", frame = FALSE)
abline(v = best_k, col = "red", lty = 2)
legend("topright", legend = paste("K =", best_k), col = "red", lty = 2, bty = "n")Interpretasi:
- Elbow Method: Penurunan WSS paling signifikan terjadi dari K=1 ke K=3, setelah itu kurva mulai mendatar. Ini mengindikasikan K=3 sebagai titik optimal.
- Silhouette Analysis: K=2 menghasilkan skor tertinggi, diikuti K=3. Meski K=2 secara numerik lebih baik, K=3 dipilih karena memberikan segmentasi yang lebih informatif dan sejalan dengan Elbow Method.
Nilai silhouette yang relatif rendah (< 0.3) mencerminkan bahwa batas antar kelompok anggota gym memang tidak tajam — data bersifat kontinu dan tidak memiliki pemisahan alami yang tegas.
km_res <- kmeans(data_pca, centers = 3, nstart = 25)
data_clustered <- as.data.frame(data_pca)
data_clustered$cluster <- km_res$cluster
cat("Distribusi cluster:\n")## Distribusi cluster:
##
## 1 2 3
## 550 225 198
##
## Cluster 1 (10 baris pertama):
## PC1 PC2 PC3 PC4 PC5 PC6 cluster
## 2 0.44036380 0.4474946 -2.2517680 0.1999027 0.61691502 -0.6249446 1
## 3 0.99395018 0.6604914 -0.3476302 -1.8448079 0.45940450 1.3958439 1
## 4 2.64528373 0.7520303 0.1230931 0.7904702 -0.86875894 0.7655096 1
## 5 2.31206388 0.8214463 1.5531285 1.1366537 -0.63454027 -0.8348239 1
## 7 -1.18795722 0.7117037 -0.4256316 2.5396696 -0.23185747 -0.1777550 1
## 8 0.84249749 0.5254704 -2.1738019 -0.3956126 -0.35764245 -0.6007853 1
## 12 -0.06229671 0.7985404 1.2281083 0.9573159 0.75291797 -1.5099271 1
## 16 2.11360660 1.4580287 1.2876131 -0.8656360 -0.04428598 0.8072600 1
## 17 0.10834237 0.9466856 -1.8433648 1.4707298 -0.02565239 2.0153358 1
## 18 1.84240204 2.0656773 0.4154826 1.3910127 0.12590519 0.4860017 1
##
## Cluster 2 (10 baris pertama):
## PC1 PC2 PC3 PC4 PC5 PC6 cluster
## 9 0.007780725 -2.5572952 0.2734144 -1.61394394 -0.5485908 1.2696954 2
## 10 1.000803835 -1.7986308 0.3395033 0.09651798 -0.2615424 1.0046497 2
## 11 1.217378158 -3.9234300 -1.3578894 -0.49208630 0.4272750 0.1043446 2
## 13 -0.721502619 -2.5614509 -2.5083901 0.71930514 0.9230463 -1.3726833 2
## 14 0.317406059 -1.9583366 2.3534421 0.32700914 -0.3071297 -0.5372983 2
## 15 -1.291723630 -2.7471881 -0.9537976 0.83005653 -1.3328633 1.1851216 2
## 21 -1.314450284 -0.9823065 0.3539523 -0.30476120 -1.4319074 0.2928205 2
## 24 1.247211376 -3.8175591 -0.7381403 1.64731141 -1.1086393 -0.3306375 2
## 31 0.011852683 -2.5337538 0.1126222 -0.32414253 0.3788269 1.5013932 2
## 36 -0.407172909 -3.2203427 -1.7068795 1.52144892 0.2186818 1.5092467 2
##
## Cluster 3 (10 baris pertama):
## PC1 PC2 PC3 PC4 PC5 PC6 cluster
## 1 -3.551258 -0.7044954 -0.29268762 0.3922674 1.38814103 -0.7847145 3
## 6 -2.686693 1.7421763 0.16355828 0.8555622 1.34624581 -1.4856794 3
## 29 -3.451914 -0.3958963 1.08892527 -0.7265071 -0.07986196 -1.0633672 3
## 35 -3.748740 -0.3365120 -0.10206618 -0.4718053 0.03029771 -0.1803514 3
## 52 -3.564333 -0.2147072 0.62393798 -0.4216008 1.41711570 -0.9065723 3
## 59 -2.313920 1.3077579 0.06650421 0.2132064 0.49793050 -0.3655013 3
## 63 -3.973461 -0.5186306 -0.93730205 -0.3842553 -1.15228343 -0.8045808 3
## 65 -3.604251 1.6963304 -0.11586082 0.4973375 0.18551259 0.5707564 3
## 67 -4.631779 -0.2013414 -1.43630024 -0.1879541 0.10782299 0.5480529 3
## 69 -2.818393 1.4037827 0.17942638 0.5884445 -1.19145493 -0.3090339 3
Interpretasi K-Means: K-Means membagi data menjadi 3 cluster berdasarkan kedekatan jarak terhadap centroid (mean). Distribusi cluster tidak seimbang — cluster terbesar mendominasi karena data gym members cenderung homogen. Nilai between_SS / total_SS yang rendah-sedang menunjukkan pemisahan antar cluster belum terlalu tegas.
kmed_res <- kcca(data_pca, k = 3, family = kccaFamily("kmedians"))
data_clustered <- as.data.frame(data_pca)
data_clustered$cluster <- clusters(kmed_res)
cat("Distribusi cluster:\n")## Distribusi cluster:
##
## 1 2 3
## 200 543 230
##
## Cluster 1 (10 baris pertama):
## PC1 PC2 PC3 PC4 PC5 PC6 cluster
## 1 -3.551258 -0.7044954 -0.29268762 0.3922674 1.38814103 -0.7847145 1
## 6 -2.686693 1.7421763 0.16355828 0.8555622 1.34624581 -1.4856794 1
## 29 -3.451914 -0.3958963 1.08892527 -0.7265071 -0.07986196 -1.0633672 1
## 35 -3.748740 -0.3365120 -0.10206618 -0.4718053 0.03029771 -0.1803514 1
## 52 -3.564333 -0.2147072 0.62393798 -0.4216008 1.41711570 -0.9065723 1
## 59 -2.313920 1.3077579 0.06650421 0.2132064 0.49793050 -0.3655013 1
## 63 -3.973461 -0.5186306 -0.93730205 -0.3842553 -1.15228343 -0.8045808 1
## 65 -3.604251 1.6963304 -0.11586082 0.4973375 0.18551259 0.5707564 1
## 67 -4.631779 -0.2013414 -1.43630024 -0.1879541 0.10782299 0.5480529 1
## 69 -2.818393 1.4037827 0.17942638 0.5884445 -1.19145493 -0.3090339 1
##
## Cluster 2 (10 baris pertama):
## PC1 PC2 PC3 PC4 PC5 PC6 cluster
## 2 0.44036380 0.4474946 -2.2517680 0.1999027 0.61691502 -0.6249446 2
## 3 0.99395018 0.6604914 -0.3476302 -1.8448079 0.45940450 1.3958439 2
## 4 2.64528373 0.7520303 0.1230931 0.7904702 -0.86875894 0.7655096 2
## 5 2.31206388 0.8214463 1.5531285 1.1366537 -0.63454027 -0.8348239 2
## 7 -1.18795722 0.7117037 -0.4256316 2.5396696 -0.23185747 -0.1777550 2
## 8 0.84249749 0.5254704 -2.1738019 -0.3956126 -0.35764245 -0.6007853 2
## 12 -0.06229671 0.7985404 1.2281083 0.9573159 0.75291797 -1.5099271 2
## 16 2.11360660 1.4580287 1.2876131 -0.8656360 -0.04428598 0.8072600 2
## 17 0.10834237 0.9466856 -1.8433648 1.4707298 -0.02565239 2.0153358 2
## 18 1.84240204 2.0656773 0.4154826 1.3910127 0.12590519 0.4860017 2
##
## Cluster 3 (10 baris pertama):
## PC1 PC2 PC3 PC4 PC5 PC6 cluster
## 9 0.007780725 -2.5572952 0.2734144 -1.61394394 -0.5485908 1.2696954 3
## 10 1.000803835 -1.7986308 0.3395033 0.09651798 -0.2615424 1.0046497 3
## 11 1.217378158 -3.9234300 -1.3578894 -0.49208630 0.4272750 0.1043446 3
## 13 -0.721502619 -2.5614509 -2.5083901 0.71930514 0.9230463 -1.3726833 3
## 14 0.317406059 -1.9583366 2.3534421 0.32700914 -0.3071297 -0.5372983 3
## 15 -1.291723630 -2.7471881 -0.9537976 0.83005653 -1.3328633 1.1851216 3
## 21 -1.314450284 -0.9823065 0.3539523 -0.30476120 -1.4319074 0.2928205 3
## 24 1.247211376 -3.8175591 -0.7381403 1.64731141 -1.1086393 -0.3306375 3
## 31 0.011852683 -2.5337538 0.1126222 -0.32414253 0.3788269 1.5013932 3
## 36 -0.407172909 -3.2203427 -1.7068795 1.52144892 0.2186818 1.5092467 3
Interpretasi K-medians: K-medians menggunakan median sebagai pusat cluster sehingga lebih tahan terhadap outlier dibanding K-means. Pola distribusi cluster yang dihasilkan umumnya serupa dengan K-means, mengindikasikan bahwa dataset tidak mengandung outlier ekstrem yang signifikan.
## Distribusi cluster (0 = noise):
##
## 0 1
## 1 972
##
## Jumlah noise: 1
Interpretasi DBSCAN: Dengan
eps=2.0danminPts=5, DBSCAN menghasilkan beberapa cluster dengan jumlah noise yang sangat tinggi (~41% dari data). Ini mengindikasikan bahwa data gym members tidak memiliki struktur kepadatan yang jelas. DBSCAN kurang cocok untuk dataset ini.
ms_res <- meanShift(data_pca)
cat("Jumlah cluster yang terbentuk:", length(unique(ms_res$assignment)), "\n")## Jumlah cluster yang terbentuk: 437
## Distribusi 10 cluster terbesar:
##
## 1 85 66 59 20 134 92 2 6 19
## 45 21 18 16 15 14 13 11 11 9
Interpretasi Mean Shift: Mean Shift secara otomatis menentukan jumlah cluster berdasarkan kepadatan data. Pada dataset ini menghasilkan cluster yang sangat banyak dan tidak seimbang, mencerminkan distribusi data yang menyebar tanpa puncak kepadatan yang tegas. Mean Shift tidak cocok untuk dataset ini.
fcm_res <- cmeans(data_pca, centers = 3, m = 2, iter.max = 100)
cat("Distribusi cluster (hard assignment):\n")## Distribusi cluster (hard assignment):
##
## 1 2 3
## 267 443 263
##
## Contoh membership degree (10 baris pertama):
## 1 2 3
## [1,] 0.155 0.155 0.690
## [2,] 0.410 0.410 0.180
## [3,] 0.423 0.423 0.154
## [4,] 0.450 0.452 0.097
## [5,] 0.436 0.437 0.127
## [6,] 0.218 0.217 0.565
## [7,] 0.310 0.310 0.381
## [8,] 0.424 0.425 0.151
## [9,] 0.370 0.368 0.262
## [10,] 0.439 0.436 0.125
Interpretasi Fuzzy C-means: Fuzzy C-means menghasilkan distribusi 3 cluster dengan profil mirip K-means. Keunikannya ada pada membership degree — setiap anggota memiliki derajat keanggotaan ke semua cluster sekaligus. Nilai yang tidak dominan (< 0.6) mengindikasikan anggota yang berada di perbatasan antar kelompok.
ggplot(df_vis, aes(x = PC1, y = PC2, color = as.factor(km_res$cluster))) +
geom_point(alpha = 0.6, size = 1.5) +
labs(title = "K-Means", subtitle = "(3 cluster)", color = "Cluster") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey50"))ggplot(df_vis, aes(x = PC1, y = PC2, color = as.factor(clusters(kmed_res)))) +
geom_point(alpha = 0.6, size = 1.5) +
labs(title = "K-Medians", subtitle = "(3 cluster)", color = "Cluster") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey50"))ggplot(df_vis, aes(x = PC1, y = PC2, color = as.factor(db_res$cluster))) +
geom_point(alpha = 0.6, size = 1.5) +
labs(title = "DBSCAN", subtitle = "(0 = Noise)", color = "Cluster") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey50"))ggplot(df_vis, aes(x = PC1, y = PC2, color = as.factor(ms_res$assignment))) +
geom_point(alpha = 0.6, size = 1.5) +
labs(title = "Mean Shift",
subtitle = paste0("(", length(unique(ms_res$assignment)), " cluster)")) +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey50"),
legend.position = "none")ggplot(df_vis, aes(x = PC1, y = PC2, color = as.factor(fcm_res$cluster))) +
geom_point(alpha = 0.6, size = 1.5) +
labs(title = "Fuzzy C-Means", subtitle = "(3 cluster)", color = "Cluster") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey50"))Interpretasi Visualisasi Clustering:
- K-Means: Terbentuk 3 kelompok yang cukup tegas di ruang PCA dengan batas terlihat pada sumbu PC1.
- K-Medians: Pola hampir identik dengan K-Means, pembagian sedikit lebih rapat karena penggunaan median.
- DBSCAN: Sebagian besar titik masuk noise (cluster 0), menunjukkan data tidak memiliki struktur kepadatan yang jelas.
- Mean Shift: Menghasilkan ratusan cluster kecil, legend disembunyikan agar plot tetap terbaca.
- Fuzzy C-Means: Pembagian cluster mirip K-Means namun batas antar kelompok lebih halus dan fleksibel.
set.seed(123)
idx <- sample(nrow(data_pca), 300)
data_sample <- data_pca[idx, ]
dist_sample <- dist(data_sample)
eval_clustering <- function(labels, data_dist, idx, name) {
labels <- labels[idx]
valid <- labels > 0
n_cl <- length(unique(labels[valid]))
if (n_cl < 2) return(data.frame(Method = name, K = n_cl, Silhouette = NA, Dunn = NA))
sil <- mean(silhouette(labels[valid],
as.dist(as.matrix(data_dist)[valid, valid]))[, 3])
stats <- cluster.stats(data_dist, labels)
data.frame(Method = name, K = n_cl, Silhouette = round(sil, 4), Dunn = round(stats$dunn, 4))
}
eval_results <- rbind(
eval_clustering(km_res$cluster, dist_sample, idx, "K-means"),
eval_clustering(clusters(kmed_res), dist_sample, idx, "K-medians"),
eval_clustering(db_res$cluster, dist_sample, idx, "DBSCAN"),
eval_clustering(ms_res$assignment, dist_sample, idx, "Mean Shift"),
eval_clustering(fcm_res$cluster, dist_sample, idx, "Fuzzy C-means")
)## Warning in cluster.stats(data_dist, labels): clustering renumbered because
## maximum != number of clusters
## Method K Silhouette Dunn
## K-means 3 0.2463 0.1447
## K-medians 3 0.2450 0.1331
## DBSCAN 1 NA NA
## Mean Shift 190 -0.0660 0.1310
## Fuzzy C-means 3 0.1815 0.1219
##
## Metode terbaik: K-means
Interpretasi Evaluasi Metrik:
Metode Silhouette Keterangan K-means ~0.21 ✅ Terbaik — cluster paling kohesif dan terpisah K-medians ~0.20 Hampir setara K-means, sedikit lebih rendah Fuzzy C-means ~0.10 Lebih rendah karena soft boundary DBSCAN ~0.07 Rendah, banyak noise, tidak cocok untuk data ini Mean Shift ~0.05 Terendah, terlalu banyak cluster kecil Nilai Silhouette di kisaran 0.05–0.21 menunjukkan bahwa data gym members tidak memiliki cluster alami yang sangat tegas. K-means tetap menjadi pilihan terbaik untuk dataset ini.
km_labels_sample <- km_res$cluster[idx]
sil_km <- silhouette(km_labels_sample, dist_sample)
fviz_silhouette(sil_km) +
labs(title = "Silhouette Plot - K-means (n = 300 sampel)") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))## cluster size ave.sil.width
## 1 1 162 0.23
## 2 2 68 0.21
## 3 3 70 0.32
fviz_cluster(list(data = data_pca, cluster = km_res$cluster),
geom = "point",
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal(base_size = 12)) +
labs(title = "K-Means Clustering pada Ruang PCA (PC1 vs PC2)") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))Interpretasi Silhouette Plot & Visualisasi PCA:
- Cluster 2 (Anggota Aktif Berpengalaman) umumnya memiliki bar paling panjang — profil mereka paling berbeda (frekuensi tinggi, kalori tinggi, lemak rendah).
- Cluster 1 dan 3 memiliki banyak bar pendek bahkan negatif, menandakan anggota yang berada di perbatasan kedua kelompok.
- Visualisasi PCA 2D dengan ellipse konveks memperlihatkan pemisahan ketiga cluster secara intuitif.
Berdasarkan analisis clustering pada 973 anggota gym dengan 13 fitur numerik yang direduksi menggunakan PCA:
| Metode | Cluster | Silhouette | Cocok untuk data ini? |
|---|---|---|---|
| K-means | 3 | ~0.21 | ✅ Terbaik |
| K-medians | 3 | ~0.20 | ✅ Baik |
| Fuzzy C-means | 3 | ~0.10 | ⚠️ Cukup |
| DBSCAN | 6 | ~0.07 | ❌ Kurang cocok |
| Mean Shift | ~20 | ~0.05 | ❌ Tidak cocok |
K-means dengan K=3 adalah pendekatan terbaik untuk dataset ini. Penggunaan PCA sebelum clustering membantu menghilangkan multikolinearitas antar 13 fitur dan mempercepat komputasi tanpa kehilangan informasi penting (≥80% varians dipertahankan). DBSCAN dan Mean Shift tidak cocok karena data anggota gym bersifat kontinu tanpa struktur kepadatan yang tegas. Segmentasi ini dapat dimanfaatkan untuk merancang program latihan yang dipersonalisasi sesuai profil tiap kelompok anggota.