Pendahuluan
Analisis Faktor (Factor Analysis) → untuk menemukan struktur laten di balik variabel yang berkorelasi
Analisis Gerombol (Cluster Analysis) → untuk mengelompokkan individu berdasarkan kesamaan karakteristik
Tujuan Pembelajaran:
Memahami konsep dan aplikasi analisis faktor dan analisis gerombol
1. ANALISIS FAKTOR
Konsep Dasar
Analisis faktor digunakan untuk:
Mereduksi dimensi variabel yang saling berkorelasi
Mengidentifikasi struktur laten di balik variabel teramati
Mengelompokkan variabel berdasarkan korelasi yang tinggi
Contoh Aplikasi:
Mengukur intelegensi dari berbagai tes kemampuan
Mengidentifikasi faktor kepuasan pelanggan
Menganalisis faktor lingkungan yang mempengaruhi produktivitas
Langkah 1: Persiapan Data Simulasi
# Install paket jika belum tersedia
#install.packages(c("psych", "GPArotation", "factoextra", "dplyr"))
library (psych) # Analisis faktor & psikometri
library (GPArotation) #Rotasi faktor
Attaching package: 'GPArotation'
The following objects are masked from 'package:psych':
equamax, varimin
library (factoextra) # Visualisasi multivariat
Loading required package: ggplot2
Attaching package: 'ggplot2'
The following objects are masked from 'package:psych':
%+%, alpha
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library (dplyr) # Manipulasi data
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Fungsi Utama packages:
psych
Analisis faktor, KMO, Bartlett test
GPArotation
Rotasi Varimax, Promax, dll.
factoextra
Visualisasi cluster, scree plot
dplyr
Transformasi dan manipulasi data
set.seed (123 ) # Untuk hasil yang reproducible
n <- 100 # Jumlah observasi
# Membuat 2 faktor laten
faktor1 <- rnorm (n)
faktor2 <- rnorm (n)
# Membuat variabel teramati dengan struktur faktor jelas
data <- data.frame (
A1 = 0.7 * faktor1 + 0.3 * rnorm (n), # Loading tinggi faktor1
A2 = 0.8 * faktor1 + 0.2 * rnorm (n), # Loading tinggi faktor1
A3= 0.6 * faktor1 + 0.4 * rnorm (n), # Loading tinggi faktor1
B1 = 0.7 * faktor2 + 0.3 * rnorm (n), # Loading tinggi faktor2
B2 = 0.8 * faktor2 + 0.2 * rnorm (n), # Loading tinggi faktor2
B3 = 0.6 * faktor2 + 0.4 * rnorm (n) # Loading tinggi faktor2
)
# Melihat matriks korelasi
cat ("Matriks Korelasi: \n " )
A1 A2 A3 B1 B2 B3
A1 1.000 0.864 0.685 -0.048 -0.040 0.044
A2 0.864 1.000 0.737 -0.062 -0.047 0.099
A3 0.685 0.737 1.000 -0.084 -0.131 -0.066
B1 -0.048 -0.062 -0.084 1.000 0.894 0.779
B2 -0.040 -0.047 -0.131 0.894 1.000 0.789
B3 0.044 0.099 -0.066 0.779 0.789 1.000
Interpretasi Awal:
A1, A2, A3 berkorelasi tinggi (0.6-0.9) → Faktor 1
B1, B2, B3 berkorelasi tinggi (0.8-0.9) → Faktor 2
Korelasi antar kelompok rendah (-0.13 sampai 0.10) → struktur faktor jelas
Langkah 2: Uji Kecocokan Analisis Faktor
# Uji KMO (Kaiser-Meyer-Olkin)
kmo_result <- KMO (data)
cat ("KMO Overall MSA:" , round (kmo_result$ MSA, 3 ), " \n " )
# Uji Bartlett untuk Sphericity
bartlett_result <- cortest.bartlett (cor (data), n = nrow (data))
cat ("Bartlett Test p-value:" , bartlett_result$ p.value, " \n " )
Bartlett Test p-value: 6.892494e-93
Kriteria Kecocokan:
KMO
> 0.6
Layak untuk analisis faktor
Bartlett
p-value < 0.05
Korelasi signifikan
Data Layak untuk analisis faktor (KMO = 0.7, p-value ≈ 0)
Langkah 3: Menentukan Jumlah Faktor
# Scree plot untuk menentukan jumlah faktor
pca <- principal (data, nfactors = 6 , rotate = "none" )
eigen_values <- pca$ values
plot (eigen_values, type = "b" , main = "Scree Plot" ,
xlab = "Komponen" , ylab = "Eigenvalue" , pch = 16 , col = "blue" )
abline (h = 1 , col = "red" , lty = 2 , lwd = 2 )
text (eigen_values, labels = round (eigen_values, 2 ), pos = 3 , cex = 0.8 )
legend ("topright" , legend = "Eigenvalue = 1" , col = "red" , lty = 2 , lwd = 2 )
Aturan Penentuan Jumlah Faktor:
Kaiser Criterion: Eigenvalue > 1
Scree Test: Titik elbow pada grafik
Varians Dijelaskan: Minimal 60-70%
2 faktor direkomendasikan (eigenvalue > 1, elbow jelas di komponen 3)
Langkah 4: Ekstraksi Faktor (Maximum Likelihood)
# Analisis faktor dengan 2 faktor
fa_model <- fa (data, nfactors = 2 , rotate = "none" , fm = "ml" )
cat ("Loading Faktor (Sebelum Rotasi): \n " )
Loading Faktor (Sebelum Rotasi):
print (fa_model$ loadings, cutoff = 0.3 )
Loadings:
ML2 ML1
A1 0.879
A2 0.971
A3 0.760
B1 0.923
B2 0.939
B3 0.841
ML2 ML1
SS loadings 2.464 2.345
Proportion Var 0.411 0.391
Cumulative Var 0.411 0.802
cat (" \n Varians Dijelaskan: \n " )
print (round (fa_model$ Vaccounted, 3 ))
ML2 ML1
SS loadings 2.464 2.345
Proportion Var 0.411 0.391
Cumulative Var 0.411 0.802
Proportion Explained 0.512 0.488
Cumulative Proportion 0.512 1.000
Interpretasi :
Faktor 1: B1, B2, B3 (loading tinggi)
Faktor 2: A1, A2, A3 (loading tinggi)
80.2% varians dijelaskan → sangat baik
Langkah 5: Rotasi Faktor
# Rotasi Varimax (ortogonal)
fa_varimax <- fa (data, nfactors = 2 , rotate = "varimax" , fm = "ml" )
# Rotasi Promax (miring)
fa_promax <- fa (data, nfactors = 2 , rotate = "promax" , fm = "ml" )
cat ("Varimax Rotation: \n " )
print (fa_varimax$ loadings, cutoff = 0.3 )
Loadings:
ML2 ML1
A1 0.884
A2 0.977
A3 0.757
B1 0.934
B2 0.950
B3 0.838
ML2 ML1
SS loadings 2.482 2.327
Proportion Var 0.414 0.388
Cumulative Var 0.414 0.802
cat (" \n Promax Rotation: \n " )
print (fa_promax$ loadings, cutoff = 0.3 )
Loadings:
ML2 ML1
A1 0.885
A2 0.979
A3 0.756
B1 0.934
B2 0.949
B3 0.840
ML2 ML1
SS loadings 2.484 2.326
Proportion Var 0.414 0.388
Cumulative Var 0.414 0.802
# Korelasi antar faktor (hanya untuk rotasi miring)
cat (" \n Korelasi Antar Faktor (Promax): \n " )
Korelasi Antar Faktor (Promax):
print (round (fa_promax$ Phi, 3 ))
ML2 ML1
ML2 1.00 -0.04
ML1 -0.04 1.00
Langkah 6: Interpretasi Akhir
# Komunalitas
cat ("Komunalitas: \n " )
print (round (fa_varimax$ communalities, 3 ))
A1 A2 A3 B1 B2 B3
0.782 0.956 0.578 0.880 0.907 0.707
# Ringkasan hasil
cat (" \n Ringkasan Model: \n " )
print (fa_varimax$ Vaccounted)
ML2 ML1
SS loadings 2.4824390 2.3270399
Proportion Var 0.4137398 0.3878400
Cumulative Var 0.4137398 0.8015798
Proportion Explained 0.5161555 0.4838445
Cumulative Proportion 0.5161555 1.0000000
Interpretasi
Faktor 1: B1, B2, B3 → beri nama sesuai konteks (misal: “Faktor Kemampuan Numerik”)
Faktor 2: A1, A2, A3 → beri nama sesuai konteks (misal: “Faktor Kemampuan Verbal”)
Komunalitas tinggi (> 0.5) → model menjelaskan varians dengan baik
Struktur bersih tanpa cross-loading
2. ANALISIS GEROMBOL
Konsep Dasar
Analisis gerombol digunakan untuk:
- Mengelompokkan objek berdasarkan kemiripan karakteristik
- Mengidentifikasi pola dalam data multivariat
- Segmentasi untuk berbagai aplikasi bisnis dan penelitian
Jenis Metode:
- Hierarki: Dendrogram, linkage methods
- Non-Hierarki: K-Means, K-Medoids (PAM)
A. Hierarchical Clustering
Standarisasi & Matriks Jarak
Tahapan awal dalam analisis Hierarchical Clustering adalah menstandarkan data agar setiap variabel memiliki skala yang sebanding.
Hal ini penting karena variabel dengan satuan besar dapat mendominasi perhitungan jarak.
library (cluster)
library (factoextra)
library (clusterCrit)
library (dplyr)
library (ggplot2)
library (tidyr)
library (scales)
Attaching package: 'scales'
The following objects are masked from 'package:psych':
alpha, rescale
library (RColorBrewer)
data <- data %>% select (where (is.numeric)) %>% na.omit ()
data_scaled <- scale (data)
# Matriks jarak Euclidean
dist_matrix <- dist (data_scaled, method = "euclidean" )#"euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski"
head (dist_matrix,10 )
[1] 1.186582 3.221468 1.494887 1.724855 3.566237 1.361880 3.108316 1.237156
[9] 3.297424 3.579704
# Hierarchical clustering dengan metode Ward
hc_ward <- hclust (dist_matrix, method = "ward.D2" ) #"single", "complete", "average" (= UPGMA), "mcquitty" "median" or "centroid"
hc_ward
Call:
hclust(d = dist_matrix, method = "ward.D2")
Cluster method : ward.D2
Distance : euclidean
Number of objects: 100
Dendrogram
Dendrogram menggambarkan proses penggabungan antar objek atau cluster berdasarkan tingkat kemiripan. Setiap percabangan menunjukkan dua cluster yang digabung, dengan tinggi cabang mencerminkan jarak antar cluster.
fviz_dend (hc_ward, k = 3 , rect = TRUE , palette = "jco" ,
main = "Dendrogram Hierarchical Clustering (Ward.D2)" )
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
Please report the issue at <https://github.com/kassambara/factoextra/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the factoextra package.
Please report the issue at <https://github.com/kassambara/factoextra/issues>.
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the factoextra package.
Please report the issue at <https://github.com/kassambara/factoextra/issues>.
Hasil Clustering
cluster_hc <- cutree (hc_ward, k = 3 )
table (cluster_hc)
cluster_hc
1 2 3
27 45 28
head (data.frame (Observasi = 1 : length (cluster_hc),
Cluster = cluster_hc),10 )
Observasi Cluster
1 1 1
2 2 1
3 3 2
4 4 2
5 5 2
6 6 2
7 7 2
8 8 1
9 9 1
10 10 3
Komponen hasil
Fungsi hclust() menghasilkan sejumlah komponen yang menjelaskan proses pengelompokan.
Beberapa komponen penting antara lain:
merge: pasangan cluster yang digabung di setiap langkah,
height: jarak antar cluster pada saat digabung,
order: urutan observasi di dendrogram,
labels: label atau nama observasi,
method & dist.method: metode penggabungan dan jenis jarak yang digunakan.
[1] "merge" "height" "order" "labels" "method"
[6] "call" "dist.method"
[,1] [,2]
[1,] -66 -93
[2,] -17 -79
[3,] -20 -62
[4,] -80 -99
[5,] -14 -28
[6,] -51 -67
[7,] -24 -85
[8,] -52 -53
[9,] -75 3
[10,] -38 8
[1] 0.4234035 0.4641453 0.4690650 0.4875733 0.5092152 0.6765199
[7] 0.7140978 0.7430553 0.7559972 0.7630201 0.7649903 0.7802216
[13] 0.7820095 0.7840824 0.8095486 0.8212717 0.8374955 0.8482324
[19] 0.8539783 0.8695295 0.8761644 0.8779808 0.8860642 0.9080021
[25] 0.9415906 0.9436227 0.9437432 0.9540593 0.9642794 0.9663148
[31] 0.9966534 1.0168749 1.0266621 1.0569648 1.0644014 1.0681005
[37] 1.1189110 1.1575998 1.1642385 1.1844863 1.1923534 1.1956224
[43] 1.2007538 1.2312531 1.2978697 1.3137802 1.3203862 1.3453711
[49] 1.3539209 1.3567902 1.3760002 1.3899316 1.4087559 1.4285394
[55] 1.4706334 1.5056346 1.5111813 1.5122790 1.6555399 1.6555632
[61] 1.6788716 1.6864547 1.6979095 1.7533173 1.7614047 1.7795909
[67] 1.8766330 1.9024291 1.9025262 1.9160378 1.9856777 1.9937020
[73] 2.0487407 2.1490390 2.3078826 2.3989787 2.4157281 2.5583885
[79] 2.5989212 2.6374357 2.6444172 2.8171655 2.8467163 3.0118017
[85] 3.1094588 3.2462284 3.3512510 3.3936216 3.8937435 4.5920003
[91] 4.6711809 5.0551856 5.6516754 6.9609842 7.4537350 9.6919782
[97] 9.9291102 15.9742352 17.0236664
[1] 39 96 25 74 49 64 59 89 55 77 10 36 38 52 53 32 48 31
[19] 51 67 61 82 66 93 12 69 27 87 46 29 100 75 20 62 65 43
[37] 94 2 9 1 50 8 47 40 63 81 72 18 26 57 15 41 21 23
[55] 78 60 17 79 42 14 28 68 71 4 80 99 24 85 7 83 33 92
[73] 88 58 86 5 22 13 37 84 11 54 19 76 44 98 35 95 45 90
[91] 70 97 16 91 34 30 73 56 3 6
Evaluasi Clustering
Beberapa metrik yang digunakan antara lain:
Silhouette Coefficient: mengukur seberapa mirip observasi dengan cluster-nya sendiri dibandingkan cluster lain.
Calinski-Harabasz Index: semakin tinggi nilainya, semakin baik pemisahan antar cluster
Dunn Index: semakin tinggi nilai, semakin jauh jarak antar cluster relatif terhadap ukuran dalam cluster.
eval_cluster <- function (data_scaled, cluster, dist_matrix) {
sil <- mean (silhouette (cluster, dist_matrix)[, 3 ])
ch <- intCriteria (traj = as.matrix (data_scaled),
part = cluster, crit = "Calinski_Harabasz" )$ calinski_harabasz
dunn <- intCriteria (traj = as.matrix (data_scaled),
part = cluster, crit = "Dunn" )$ dunn
return (data.frame (Silhouette = sil,
Calinski_Harabasz = ch,
Dunn = dunn))
}
eval_cluster (data_scaled, cluster_hc, dist_matrix)
Silhouette Calinski_Harabasz Dunn
1 0.2373443 41.1055 0.1213656
Karakteristik Cluster
Tahap ini bertujuan untuk melihat karakteristik atau profil dari masing-masing cluster yang terbentuk.
Nilai rata-rata tiap variabel pada setiap cluster membantu dalam memahami perbedaan utama antar kelompok.
Cluster dapat diinterpretasikan sebagai kelompok observasi dengan pola nilai variabel yang serupa.
data_clustered <- data.frame (data, cluster = factor (cluster_hc))
cluster_summary <- data_clustered %>%
group_by (cluster) %>%
summarise (across (where (is.numeric), mean, .names = "{.col}" ))
cluster_summary
# A tibble: 3 × 7
cluster A1 A2 A3 B1 B2 B3
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -0.519 -0.705 -0.532 -0.502 -0.510 -0.277
2 2 0.497 0.542 0.423 -0.405 -0.444 -0.312
3 3 0.0573 0.0418 0.179 0.820 0.791 0.690
Visualisasi Clustering
Visualisasi digunakan untuk memperjelas pola karakteristik antar cluster.
Heatmap menampilkan perbedaan nilai rata-rata tiap variabel antar cluster secara visual melalui gradasi warna.
Bar Plot menunjukkan nilai rata-rata tiap variabel secara kuantitatif untuk setiap cluster, memudahkan interpretasi perbandingan.
Data untuk visualisasi
heat_data <- cluster_summary %>%
pivot_longer (- cluster, names_to = "Variabel" , values_to = "Rata2" )
heat_data
# A tibble: 18 × 3
cluster Variabel Rata2
<fct> <chr> <dbl>
1 1 A1 -0.519
2 1 A2 -0.705
3 1 A3 -0.532
4 1 B1 -0.502
5 1 B2 -0.510
6 1 B3 -0.277
7 2 A1 0.497
8 2 A2 0.542
9 2 A3 0.423
10 2 B1 -0.405
11 2 B2 -0.444
12 2 B3 -0.312
13 3 A1 0.0573
14 3 A2 0.0418
15 3 A3 0.179
16 3 B1 0.820
17 3 B2 0.791
18 3 B3 0.690
Heatmap
ggplot (heat_data, aes (x = Variabel, y = cluster, fill = Rata2)) +
geom_tile (color = "white" ) +
scale_fill_gradientn (colors = c ("#56B1F7" , "white" , "#FF6B6B" )) +
labs (title = "Heatmap Karakteristik Cluster" ,
x = "Variabel" , y = "Cluster" ) +
theme_minimal (base_size = 12 ) +
theme (axis.text.x = element_text (angle = 45 , hjust = 1 ))
Bar Plot
ggplot (heat_data, aes (x = Variabel, y = Rata2, fill = cluster)) +
geom_bar (stat = "identity" , position = "dodge" ) +
scale_fill_brewer (palette = "Set2" ) +
labs (title = "Bar Plot Karakteristik Cluster" ,
x = "Variabel" , y = "Rata-rata Nilai" ) +
theme_minimal (base_size = 12 ) +
theme (axis.text.x = element_text (angle = 45 , hjust = 1 ))
Menentukan Jumlah Cluster Optimal
Langkah pertama adalah menentukan jumlah cluster terbaik (k ) menggunakan dua metode populer:
Elbow Method: melihat titik siku (elbow point ) pada grafik within-cluster sum of squares (WSS) .
Silhouette Method: menilai kualitas pemisahan cluster; semakin tinggi nilai silhouette , semakin baik hasil clustering.
library (factoextra)
library (cluster)
library (clusterCrit)
library (dplyr)
library (ggplot2)
library (tidyr)
library (scales)
library (RColorBrewer)
data <- data %>% select (where (is.numeric)) %>% na.omit ()
data_scaled <- scale (data)
fviz_nbclust (data_scaled, kmeans, method = "wss" ) +
labs (title = "Elbow Method" )
fviz_nbclust (data_scaled, kmeans, method = "silhouette" ) +
labs (title = "Silhouette Method" )
Penerapan Algoritma K-Means
Setelah jumlah cluster optimal dipilih (misal k = 3), dilakukan proses clustering menggunakan fungsi kmeans().
Parameter nstart = 25 digunakan untuk memastikan stabilitas hasil dengan mencoba 25 inisialisasi acak.
set.seed (123 )
kmeans_result <- kmeans (data_scaled, centers = 3 , nstart = 25 )
kmeans_result
K-means clustering with 3 clusters of sizes 25, 41, 34
Cluster means:
A1 A2 A3 B1 B2 B3
1 -0.3197838 -0.2084213 -0.2240879 1.2662463 1.256284 1.1955286
2 -0.6467932 -0.6928074 -0.5357591 -0.4906116 -0.500987 -0.5400056
3 1.0150917 0.9886952 0.8108330 -0.3394436 -0.319607 -0.2278819
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
2 2 3 2 2 3 2 2 2 1 3 1 3 2 1 3 3 2 3 2
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
2 2 2 2 1 2 3 2 2 3 1 1 3 3 3 1 3 1 1 2
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 2 2 3 3 2 2 1 1 2 1 1 1 3 2 3 1 3 1 2
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
1 2 2 1 2 3 1 2 3 3 2 2 3 1 2 3 2 2 3 2
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
2 1 2 3 2 3 1 3 1 3 3 3 3 2 3 1 3 3 2 2
Within cluster sum of squares by cluster:
[1] 87.9679 102.4448 102.9224
(between_SS / total_SS = 50.6 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
Hasil Clustering
table (kmeans_result$ cluster)
head (data.frame (Observasi = 1 : length (kmeans_result$ cluster),
Cluster = kmeans_result$ cluster),10 )
Observasi Cluster
1 1 2
2 2 2
3 3 3
4 4 2
5 5 2
6 6 3
7 7 2
8 8 2
9 9 2
10 10 1
Komponen Hasil K-Means
Fungsi kmeans() menghasilkan sejumlah komponen penting untuk memahami hasil pengelompokan:
cluster
Label cluster untuk setiap observasi
centers
Koordinat pusat (centroid ) tiap cluster
totss
Total variasi data (Total Sum of Squares)
withinss
Variasi dalam masing-masing cluster
tot.withinss
Jumlah keseluruhan variasi dalam cluster
betweenss
Variasi antar cluster
size
Jumlah anggota setiap cluster
iter
Jumlah iterasi hingga konvergen
ifault
Status hasil (0 = sukses)
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
2 2 3 2 2 3 2 2 2 1 3 1 3 2 1 3 3 2 3 2
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
2 2 2 2 1 2 3 2 2 3 1 1 3 3 3 1 3 1 1 2
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 2 2 3 3 2 2 1 1 2 1 1 1 3 2 3 1 3 1 2
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
1 2 2 1 2 3 1 2 3 3 2 2 3 1 2 3 2 2 3 2
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
2 1 2 3 2 3 1 3 1 3 3 3 3 2 3 1 3 3 2 2
[1] 87.9679 102.4448 102.9224
kmeans_result$ tot.withinss
Evaluasi Clustering
cluster_km <- kmeans_result$ cluster
dist_matrix <- dist (data_scaled, method = "euclidean" )
sil <- mean (silhouette (cluster_km, dist_matrix)[, 3 ])
ch <- intCriteria (traj = as.matrix (data_scaled),
part = cluster_km,
crit = "Calinski_Harabasz" )$ calinski_harabasz
dunn <- intCriteria (traj = as.matrix (data_scaled),
part = cluster_km,
crit = "Dunn" )$ dunn
data.frame (Silhouette = sil,
Calinski_Harabasz = ch,
Dunn = dunn)
Silhouette Calinski_Harabasz Dunn
1 0.2893723 49.71189 0.133526
Karakteristik Cluster
data_clustered <- data.frame (data, cluster = factor (cluster_km))
cluster_summary <- data_clustered %>%
group_by (cluster) %>%
summarise (across (where (is.numeric), mean, .names = "{.col}" ))
cluster_summary
# A tibble: 3 × 7
cluster A1 A2 A3 B1 B2 B3
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -0.113 -0.0913 -0.0403 0.877 0.909 0.810
2 2 -0.331 -0.455 -0.231 -0.462 -0.525 -0.398
3 3 0.775 0.807 0.592 -0.347 -0.377 -0.181
Visualisasi Clustering
fviz_cluster (kmeans_result, data = data_scaled,
palette = "jco" , geom = "point" ,
main = "Visualisasi Hasil K-Means Clustering (3 Cluster)" )
heat_data <- cluster_summary %>%
pivot_longer (- cluster, names_to = "Variabel" , values_to = "Rata2" )
heat_data
# A tibble: 18 × 3
cluster Variabel Rata2
<fct> <chr> <dbl>
1 1 A1 -0.113
2 1 A2 -0.0913
3 1 A3 -0.0403
4 1 B1 0.877
5 1 B2 0.909
6 1 B3 0.810
7 2 A1 -0.331
8 2 A2 -0.455
9 2 A3 -0.231
10 2 B1 -0.462
11 2 B2 -0.525
12 2 B3 -0.398
13 3 A1 0.775
14 3 A2 0.807
15 3 A3 0.592
16 3 B1 -0.347
17 3 B2 -0.377
18 3 B3 -0.181
ggplot (heat_data, aes (x = Variabel, y = cluster, fill = Rata2)) +
geom_tile (color = "white" ) +
scale_fill_gradientn (colors = c ("#56B1F7" , "white" , "#FF6B6B" )) +
labs (title = "Heatmap Karakteristik Cluster" ,
x = "Variabel" , y = "Cluster" ) +
theme_minimal (base_size = 12 ) +
theme (axis.text.x = element_text (angle = 45 , hjust = 1 ))
ggplot (heat_data, aes (x = Variabel, y = Rata2, fill = cluster)) +
geom_bar (stat = "identity" , position = "dodge" ) +
scale_fill_brewer (palette = "Set2" ) +
labs (title = "Heatmap Karakteristik Cluster" ,
x = "Variabel" , y = "Rata-rata Nilai" ) +
theme_minimal (base_size = 12 ) +
theme (axis.text.x = element_text (angle = 45 , hjust = 1 ))
C. K-Medoids Clustering (PAM)
library (cluster)
library (factoextra)
library (clusterCrit)
library (dplyr)
library (ggplot2)
library (tidyr)
library (scales)
library (RColorBrewer)
data <- data %>% select (where (is.numeric)) %>% na.omit ()
data_scaled <- scale (data)
Eksperimen dengan Berbagai Jarak
dist_euc <- dist (data_scaled, method = "euclidean" )
dist_man <- dist (data_scaled, method = "manhattan" )
dist_can <- dist (data_scaled, method = "canberra" )
pam_euc <- pam (dist_euc, k = 3 )
pam_man <- pam (dist_man, k = 3 )
pam_can <- pam (dist_can, k = 3 )
Komponen Hasil PAM
Fungsi pam() menghasilkan beberapa komponen utama yang membantu memahami hasil pengelompokan, seperti:
medoids
Titik pusat (medoid) dari setiap cluster
id.med
Posisi medoid dalam data asli
clustering
Label cluster tiap observasi
objective
Total jarak dalam cluster
clusinfo
Ringkasan tiap cluster (ukuran, diameter, separation)
silinfo
Nilai silhouette untuk setiap cluster
diss
Matriks jarak yang digunakan
[1] "medoids" "id.med" "clustering" "objective" "isolation"
[6] "clusinfo" "silinfo" "diss" "call"
build swap
1.735455 1.629242
head (pam_euc$ clusinfo,10 )
size max_diss av_diss diameter separation
[1,] 39 3.722629 1.489213 5.168688 0.5092152
[2,] 26 3.634913 1.649934 5.228481 0.8095486
[3,] 35 4.416731 1.769902 5.959716 0.5092152
$widths
cluster neighbor sil_width
75 1 2 0.5371369529
62 1 3 0.5326154872
94 1 3 0.5238706025
20 1 3 0.5179771928
100 1 2 0.5012156945
80 1 3 0.4926891659
43 1 2 0.4872654802
29 1 2 0.4818657008
24 1 3 0.4459609586
65 1 3 0.4402794071
46 1 3 0.4217914391
1 1 2 0.4199646007
23 1 3 0.4190102477
81 1 2 0.4150904262
63 1 2 0.3916238155
9 1 3 0.3905548094
99 1 3 0.3893213458
50 1 2 0.3836195472
8 1 2 0.3651706147
26 1 3 0.3645989827
40 1 2 0.3601769030
18 1 3 0.3476033969
85 1 2 0.3467458007
83 1 2 0.3386233885
47 1 2 0.3240275720
2 1 3 0.2874248282
72 1 3 0.2594053802
4 1 2 0.2549258693
7 1 2 0.2483319608
21 1 3 0.2423607973
22 1 2 0.2341364866
68 1 2 0.2291613951
78 1 3 0.2231493094
42 1 3 0.1715376250
5 1 2 0.1593088424
60 1 2 0.1511932353
14 1 3 0.1366312684
71 1 2 0.1191941788
88 1 2 -0.0862055863
54 2 1 0.4878591280
11 2 3 0.4737113941
90 2 1 0.4658548139
44 2 1 0.4370037419
3 2 3 0.4168899709
95 2 1 0.4152541722
98 2 1 0.3913702396
45 2 1 0.3799930219
56 2 1 0.3580399572
76 2 1 0.3506604108
6 2 3 0.3390997632
33 2 1 0.3232939245
91 2 3 0.3169488670
19 2 1 0.2715504904
16 2 3 0.2627541811
73 2 3 0.2280420892
92 2 1 0.2225404634
97 2 3 0.2175334050
34 2 3 0.2025329711
86 2 1 0.1997626399
35 2 1 0.1913517402
58 2 1 0.1892973764
70 2 3 0.1584482140
37 2 1 0.1294618290
84 2 1 0.1164640678
13 2 1 0.1158431384
36 3 1 0.4448975014
61 3 2 0.4281507035
82 3 2 0.4249836837
38 3 1 0.4147204968
67 3 2 0.3793773385
49 3 2 0.3766451662
10 3 1 0.3734331856
51 3 2 0.3611022916
31 3 2 0.3497294471
25 3 1 0.3475955158
96 3 1 0.3440511253
39 3 1 0.3367710641
53 3 1 0.3276955856
74 3 1 0.3164806716
64 3 1 0.3121967272
52 3 1 0.2916486071
12 3 2 0.2451239835
59 3 1 0.2137781219
79 3 1 0.1754939617
48 3 1 0.1701887268
87 3 2 0.1689637654
15 3 1 0.1678257435
93 3 2 0.1656272820
66 3 2 0.1451620345
89 3 1 0.1424653198
17 3 2 0.1408691191
57 3 1 0.0988212617
32 3 1 0.0905050536
41 3 1 0.0715718207
55 3 1 0.0122640715
69 3 2 0.0005886862
27 3 2 -0.0082471283
28 3 1 -0.0376324247
77 3 1 -0.0590002316
30 3 2 -0.1214765700
$clus.avg.widths
[1] 0.3402399 0.2946755 0.2174963
$avg.width
[1] 0.2854329
Evaluasi Clustering
eval_pam <- function (data_scaled, pam_model, dist_matrix) {
cl <- pam_model$ clustering
sil <- mean (silhouette (cl, dist_matrix)[, 3 ])
ch <- intCriteria (traj = as.matrix (data_scaled),
part = cl, crit = "Calinski_Harabasz" )$ calinski_harabasz
dunn <- intCriteria (traj = as.matrix (data_scaled),
part = cl, crit = "Dunn" )$ dunn
return (data.frame (Silhouette = sil, CH = ch, Dunn = dunn))
}
eval_result <- rbind (
Euclidean = eval_pam (data_scaled, pam_euc, dist_euc),
Manhattan = eval_pam (data_scaled, pam_man, dist_man),
Canberra = eval_pam (data_scaled, pam_can, dist_can)
)
eval_result
Silhouette CH Dunn
Euclidean 0.2854329 47.64919 0.08544286
Manhattan 0.2729617 42.56597 0.10159179
Canberra 0.2031891 30.04782 0.05786000
eval_long <- eval_result %>%
mutate (Jarak = rownames (eval_result)) %>%
pivot_longer (cols = c (Silhouette, CH, Dunn),
names_to = "Metrik" , values_to = "Nilai" )
eval_long
# A tibble: 9 × 3
Jarak Metrik Nilai
<chr> <chr> <dbl>
1 Euclidean Silhouette 0.285
2 Euclidean CH 47.6
3 Euclidean Dunn 0.0854
4 Manhattan Silhouette 0.273
5 Manhattan CH 42.6
6 Manhattan Dunn 0.102
7 Canberra Silhouette 0.203
8 Canberra CH 30.0
9 Canberra Dunn 0.0579
ggplot (eval_long, aes (x = Jarak, y = Nilai, fill = Metrik)) +
geom_bar (stat = "identity" , position = "dodge" , width = 0.7 ) +
scale_fill_brewer (palette = "Set2" ) +
labs (title = "Perbandingan Metrik Evaluasi" ,
x = "Jenis Jarak" ,
y = "Nilai Metrik" ,
fill = "Metrik" ) +
theme_minimal (base_size = 12 ) +
theme (
plot.title = element_text (face = "bold" , hjust = 0.5 ),
axis.text.x = element_text (face = "bold" )
)
ggplot (filter (eval_long, Metrik == "Silhouette" ),
aes (x = Jarak, y = Nilai, fill = Jarak)) +
geom_bar (stat = "identity" , width = 0.6 , show.legend = FALSE ) +
scale_fill_brewer (palette = "Set2" ) +
labs (title = "Perbandingan Nilai Silhouette" ,
x = "Jenis Jarak" , y = "Nilai" ) +
theme_minimal (base_size = 12 ) +
theme (plot.title = element_text (face = "bold" , hjust = 0.5 ))
ggplot (filter (eval_long, Metrik == "CH" ),
aes (x = Jarak, y = Nilai, fill = Jarak)) +
geom_bar (stat = "identity" , width = 0.6 , show.legend = FALSE ) +
scale_fill_brewer (palette = "Set3" ) +
labs (title = "Perbandingan Nilai Calinski-Harabasz" ,
x = "Jenis Jarak" , y = "Nilai" ) +
theme_minimal (base_size = 12 ) +
theme (plot.title = element_text (face = "bold" , hjust = 0.5 ))
ggplot (filter (eval_long, Metrik == "Dunn" ),
aes (x = Jarak, y = Nilai, fill = Jarak)) +
geom_bar (stat = "identity" , width = 0.6 , show.legend = FALSE ) +
scale_fill_brewer (palette = "Pastel1" ) +
labs (title = "Perbandingan Nilai Dunn Index" ,
x = "Jenis Jarak" , y = "Nilai" ) +
theme_minimal (base_size = 12 ) +
theme (plot.title = element_text (face = "bold" , hjust = 0.5 ))
Hasil Clustering
Jarak terbaik euclidean
head (data.frame (Observasi = 1 : length (pam_euc$ cluster),
Cluster = pam_euc$ cluster),10 )
Observasi Cluster
1 1 1
2 2 1
3 3 2
4 4 1
5 5 1
6 6 2
7 7 1
8 8 1
9 9 1
10 10 3
D. PERBANDINGAN HASIL CLUSTERING
# Membandingkan hasil berbagai metode
comparison <- data.frame (
Observasi = 1 : nrow (data),
Hierarchical = cluster_hc,
KMeans = kmeans_result$ cluster,
PAM = pam_euc$ clustering
)
print (head (comparison, 10 ))
Observasi Hierarchical KMeans PAM
1 1 1 2 1
2 2 1 2 1
3 3 2 3 2
4 4 2 2 1
5 5 2 2 1
6 6 2 3 2
7 7 2 2 1
8 8 1 2 1
9 9 1 2 1
10 10 3 1 3
3. LATIHAN
Gunakan dataset USArrests dan lakukan:
Hierarchical clustering dengan complete linkage
K-Means clustering dengan 3 cluster
Bandingkan hasil kedua metode
Visualisasi hasil terbaik
# Data untuk latihan
data (USArrests)
Attaching package: 'maps'
The following object is masked from 'package:cluster':
votes.repub
library (ggplot2)
library (dplyr)
hc <- hclust (dist (scale (USArrests)), method = "complete" )
cluster_map <- cutree (hc, k = 3 )
usa <- map_data ("state" )
us_cluster <- data.frame (state = tolower (rownames (USArrests)),
cluster = factor (cluster_map))
usa_cluster <- left_join (usa, us_cluster, by = c ("region" = "state" )) %>%
filter (! is.na (cluster))
ggplot (usa_cluster, aes (long, lat, group = group, fill = cluster)) +
geom_polygon (color = "white" ) +
scale_fill_brewer (palette = "Set2" ) +
coord_fixed (1.3 ) +
labs (title = "Peta Hasil Hierarchical Clustering (USArrests)" ,
fill = "Cluster" ) +
theme_void (base_size = 13 ) +
theme (
plot.title = element_text (face = "bold" , hjust = 0.5 ),
legend.position = "bottom"
)