Analisis Faktor & Gerombol

Teknik Peubah Ganda

Authors

Wawan Saputra

Statistika Dan Sains Data

Pendahuluan

  1. Analisis Faktor (Factor Analysis) → untuk menemukan struktur laten di balik variabel yang berkorelasi
  2. Analisis Gerombol (Cluster Analysis) → untuk mengelompokkan individu berdasarkan kesamaan karakteristik

Tujuan Pembelajaran:

  • Memahami konsep dan aplikasi analisis faktor dan analisis gerombol

1. ANALISIS FAKTOR

Konsep Dasar

Analisis faktor digunakan untuk:

  • Mereduksi dimensi variabel yang saling berkorelasi

  • Mengidentifikasi struktur laten di balik variabel teramati

  • Mengelompokkan variabel berdasarkan korelasi yang tinggi

Contoh Aplikasi:

  • Mengukur intelegensi dari berbagai tes kemampuan
  • Mengidentifikasi faktor kepuasan pelanggan
  • Menganalisis faktor lingkungan yang mempengaruhi produktivitas

Langkah 1: Persiapan Data Simulasi

# Install paket jika belum tersedia
#install.packages(c("psych", "GPArotation", "factoextra", "dplyr"))

library(psych)        # Analisis faktor & psikometri 
library(GPArotation)  #Rotasi faktor 

Attaching package: 'GPArotation'
The following objects are masked from 'package:psych':

    equamax, varimin
library(factoextra)   # Visualisasi multivariat 
Loading required package: ggplot2

Attaching package: 'ggplot2'
The following objects are masked from 'package:psych':

    %+%, alpha
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(dplyr)        # Manipulasi data

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Fungsi Utama packages:

Paket Fungsi
psych Analisis faktor, KMO, Bartlett test
GPArotation Rotasi Varimax, Promax, dll.
factoextra Visualisasi cluster, scree plot
dplyr Transformasi dan manipulasi data

set.seed(123)  # Untuk hasil yang reproducible
n <- 100       # Jumlah observasi

# Membuat 2 faktor laten
faktor1 <- rnorm(n)
faktor2 <- rnorm(n)

# Membuat variabel teramati dengan struktur faktor jelas
data <- data.frame(
  A1 = 0.7 * faktor1 + 0.3 * rnorm(n),  # Loading tinggi faktor1
  A2 = 0.8 * faktor1 + 0.2 * rnorm(n),  # Loading tinggi faktor1  
  A3= 0.6 * faktor1 + 0.4 * rnorm(n),  # Loading tinggi faktor1
  B1 = 0.7 * faktor2 + 0.3 * rnorm(n),  # Loading tinggi faktor2
  B2 = 0.8 * faktor2 + 0.2 * rnorm(n),  # Loading tinggi faktor2
  B3 = 0.6 * faktor2 + 0.4 * rnorm(n)   # Loading tinggi faktor2
)

# Melihat matriks korelasi
cat("Matriks Korelasi:\n")
Matriks Korelasi:
round(cor(data), 3)
       A1     A2     A3     B1     B2     B3
A1  1.000  0.864  0.685 -0.048 -0.040  0.044
A2  0.864  1.000  0.737 -0.062 -0.047  0.099
A3  0.685  0.737  1.000 -0.084 -0.131 -0.066
B1 -0.048 -0.062 -0.084  1.000  0.894  0.779
B2 -0.040 -0.047 -0.131  0.894  1.000  0.789
B3  0.044  0.099 -0.066  0.779  0.789  1.000

Interpretasi Awal:

  • A1, A2, A3 berkorelasi tinggi (0.6-0.9) → Faktor 1
  • B1, B2, B3 berkorelasi tinggi (0.8-0.9) → Faktor 2
  • Korelasi antar kelompok rendah (-0.13 sampai 0.10) → struktur faktor jelas

Langkah 2: Uji Kecocokan Analisis Faktor

# Uji KMO (Kaiser-Meyer-Olkin)
kmo_result <- KMO(data)
cat("KMO Overall MSA:", round(kmo_result$MSA, 3), "\n")
KMO Overall MSA: 0.696 
# Uji Bartlett untuk Sphericity
bartlett_result <- cortest.bartlett(cor(data), n = nrow(data))
cat("Bartlett Test p-value:", bartlett_result$p.value, "\n")
Bartlett Test p-value: 6.892494e-93 

Kriteria Kecocokan:

Uji Kriteria Interpretasi
KMO > 0.6 Layak untuk analisis faktor
Bartlett p-value < 0.05 Korelasi signifikan

Data Layak untuk analisis faktor (KMO = 0.7, p-value ≈ 0)


Langkah 3: Menentukan Jumlah Faktor

# Scree plot untuk menentukan jumlah faktor
pca <- principal(data, nfactors = 6, rotate = "none")
eigen_values <- pca$values

plot(eigen_values, type = "b", main = "Scree Plot", 
     xlab = "Komponen", ylab = "Eigenvalue", pch = 16, col = "blue")
abline(h = 1, col = "red", lty = 2, lwd = 2)
text(eigen_values, labels = round(eigen_values, 2), pos = 3, cex = 0.8)
legend("topright", legend = "Eigenvalue = 1", col = "red", lty = 2, lwd = 2)

Aturan Penentuan Jumlah Faktor:

  • Kaiser Criterion: Eigenvalue > 1
  • Scree Test: Titik elbow pada grafik
  • Varians Dijelaskan: Minimal 60-70%
  • 2 faktor direkomendasikan (eigenvalue > 1, elbow jelas di komponen 3)

Langkah 4: Ekstraksi Faktor (Maximum Likelihood)

# Analisis faktor dengan 2 faktor
fa_model <- fa(data, nfactors = 2, rotate = "none", fm = "ml")

cat("Loading Faktor (Sebelum Rotasi):\n")
Loading Faktor (Sebelum Rotasi):
print(fa_model$loadings, cutoff = 0.3)

Loadings:
   ML2    ML1   
A1         0.879
A2         0.971
A3         0.760
B1  0.923       
B2  0.939       
B3  0.841       

                 ML2   ML1
SS loadings    2.464 2.345
Proportion Var 0.411 0.391
Cumulative Var 0.411 0.802
cat("\nVarians Dijelaskan:\n")

Varians Dijelaskan:
print(round(fa_model$Vaccounted, 3))
                        ML2   ML1
SS loadings           2.464 2.345
Proportion Var        0.411 0.391
Cumulative Var        0.411 0.802
Proportion Explained  0.512 0.488
Cumulative Proportion 0.512 1.000

Interpretasi :

  • Faktor 1: B1, B2, B3 (loading tinggi)
  • Faktor 2: A1, A2, A3 (loading tinggi)
  • 80.2% varians dijelaskan → sangat baik

Langkah 5: Rotasi Faktor

# Rotasi Varimax (ortogonal)
fa_varimax <- fa(data, nfactors = 2, rotate = "varimax", fm = "ml")

# Rotasi Promax (miring)  
fa_promax <- fa(data, nfactors = 2, rotate = "promax", fm = "ml")

cat("Varimax Rotation:\n")
Varimax Rotation:
print(fa_varimax$loadings, cutoff = 0.3)

Loadings:
   ML2    ML1   
A1         0.884
A2         0.977
A3         0.757
B1  0.934       
B2  0.950       
B3  0.838       

                 ML2   ML1
SS loadings    2.482 2.327
Proportion Var 0.414 0.388
Cumulative Var 0.414 0.802
cat("\nPromax Rotation:\n") 

Promax Rotation:
print(fa_promax$loadings, cutoff = 0.3)

Loadings:
   ML2    ML1   
A1         0.885
A2         0.979
A3         0.756
B1  0.934       
B2  0.949       
B3  0.840       

                 ML2   ML1
SS loadings    2.484 2.326
Proportion Var 0.414 0.388
Cumulative Var 0.414 0.802
# Korelasi antar faktor (hanya untuk rotasi miring)
cat("\nKorelasi Antar Faktor (Promax):\n")

Korelasi Antar Faktor (Promax):
print(round(fa_promax$Phi, 3))
      ML2   ML1
ML2  1.00 -0.04
ML1 -0.04  1.00

Langkah 6: Interpretasi Akhir

# Komunalitas
cat("Komunalitas:\n")
Komunalitas:
print(round(fa_varimax$communalities, 3))
   A1    A2    A3    B1    B2    B3 
0.782 0.956 0.578 0.880 0.907 0.707 
# Ringkasan hasil
cat("\nRingkasan Model:\n")

Ringkasan Model:
print(fa_varimax$Vaccounted)
                            ML2       ML1
SS loadings           2.4824390 2.3270399
Proportion Var        0.4137398 0.3878400
Cumulative Var        0.4137398 0.8015798
Proportion Explained  0.5161555 0.4838445
Cumulative Proportion 0.5161555 1.0000000

Interpretasi

  • Faktor 1: B1, B2, B3 → beri nama sesuai konteks (misal: “Faktor Kemampuan Numerik”)
  • Faktor 2: A1, A2, A3 → beri nama sesuai konteks (misal: “Faktor Kemampuan Verbal”)
  • Komunalitas tinggi (> 0.5) → model menjelaskan varians dengan baik
  • Struktur bersih tanpa cross-loading

2. ANALISIS GEROMBOL

Konsep Dasar

Analisis gerombol digunakan untuk:

- Mengelompokkan objek berdasarkan kemiripan karakteristik

- Mengidentifikasi pola dalam data multivariat

- Segmentasi untuk berbagai aplikasi bisnis dan penelitian

Jenis Metode:

- Hierarki: Dendrogram, linkage methods

- Non-Hierarki: K-Means, K-Medoids (PAM)

A. Hierarchical Clustering

Standarisasi & Matriks Jarak

  • Tahapan awal dalam analisis Hierarchical Clustering adalah menstandarkan data agar setiap variabel memiliki skala yang sebanding.
  • Hal ini penting karena variabel dengan satuan besar dapat mendominasi perhitungan jarak.
library(cluster)
library(factoextra)
library(clusterCrit)
library(dplyr)
library(ggplot2)
library(tidyr)
library(scales)

Attaching package: 'scales'
The following objects are masked from 'package:psych':

    alpha, rescale
library(RColorBrewer)

data <- data %>% select(where(is.numeric)) %>% na.omit()
data_scaled <- scale(data)
# Matriks jarak Euclidean
dist_matrix <- dist(data_scaled, method = "euclidean")#"euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski"
head(dist_matrix,10)
 [1] 1.186582 3.221468 1.494887 1.724855 3.566237 1.361880 3.108316 1.237156
 [9] 3.297424 3.579704
# Hierarchical clustering dengan metode Ward
hc_ward <- hclust(dist_matrix, method = "ward.D2") #"single", "complete", "average" (= UPGMA), "mcquitty" "median" or "centroid"
hc_ward

Call:
hclust(d = dist_matrix, method = "ward.D2")

Cluster method   : ward.D2 
Distance         : euclidean 
Number of objects: 100 

Dendrogram

Dendrogram menggambarkan proses penggabungan antar objek atau cluster berdasarkan tingkat kemiripan. Setiap percabangan menunjukkan dua cluster yang digabung, dengan tinggi cabang mencerminkan jarak antar cluster.

fviz_dend(hc_ward, k = 3, rect = TRUE, palette = "jco",
main = "Dendrogram Hierarchical Clustering (Ward.D2)")
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

Hasil Clustering

cluster_hc <- cutree(hc_ward, k = 3)
table(cluster_hc)
cluster_hc
 1  2  3 
27 45 28 
head(data.frame(Observasi = 1:length(cluster_hc),
           Cluster = cluster_hc),10)
   Observasi Cluster
1          1       1
2          2       1
3          3       2
4          4       2
5          5       2
6          6       2
7          7       2
8          8       1
9          9       1
10        10       3

Komponen hasil

Fungsi hclust() menghasilkan sejumlah komponen yang menjelaskan proses pengelompokan.
Beberapa komponen penting antara lain:

  • merge: pasangan cluster yang digabung di setiap langkah,

  • height: jarak antar cluster pada saat digabung,

  • order: urutan observasi di dendrogram,

  • labels: label atau nama observasi,

  • method & dist.method: metode penggabungan dan jenis jarak yang digunakan.

names(hc_ward)
[1] "merge"       "height"      "order"       "labels"      "method"     
[6] "call"        "dist.method"
head(hc_ward$merge, 10)
      [,1] [,2]
 [1,]  -66  -93
 [2,]  -17  -79
 [3,]  -20  -62
 [4,]  -80  -99
 [5,]  -14  -28
 [6,]  -51  -67
 [7,]  -24  -85
 [8,]  -52  -53
 [9,]  -75    3
[10,]  -38    8
hc_ward$height
 [1]  0.4234035  0.4641453  0.4690650  0.4875733  0.5092152  0.6765199
 [7]  0.7140978  0.7430553  0.7559972  0.7630201  0.7649903  0.7802216
[13]  0.7820095  0.7840824  0.8095486  0.8212717  0.8374955  0.8482324
[19]  0.8539783  0.8695295  0.8761644  0.8779808  0.8860642  0.9080021
[25]  0.9415906  0.9436227  0.9437432  0.9540593  0.9642794  0.9663148
[31]  0.9966534  1.0168749  1.0266621  1.0569648  1.0644014  1.0681005
[37]  1.1189110  1.1575998  1.1642385  1.1844863  1.1923534  1.1956224
[43]  1.2007538  1.2312531  1.2978697  1.3137802  1.3203862  1.3453711
[49]  1.3539209  1.3567902  1.3760002  1.3899316  1.4087559  1.4285394
[55]  1.4706334  1.5056346  1.5111813  1.5122790  1.6555399  1.6555632
[61]  1.6788716  1.6864547  1.6979095  1.7533173  1.7614047  1.7795909
[67]  1.8766330  1.9024291  1.9025262  1.9160378  1.9856777  1.9937020
[73]  2.0487407  2.1490390  2.3078826  2.3989787  2.4157281  2.5583885
[79]  2.5989212  2.6374357  2.6444172  2.8171655  2.8467163  3.0118017
[85]  3.1094588  3.2462284  3.3512510  3.3936216  3.8937435  4.5920003
[91]  4.6711809  5.0551856  5.6516754  6.9609842  7.4537350  9.6919782
[97]  9.9291102 15.9742352 17.0236664
hc_ward$order
  [1]  39  96  25  74  49  64  59  89  55  77  10  36  38  52  53  32  48  31
 [19]  51  67  61  82  66  93  12  69  27  87  46  29 100  75  20  62  65  43
 [37]  94   2   9   1  50   8  47  40  63  81  72  18  26  57  15  41  21  23
 [55]  78  60  17  79  42  14  28  68  71   4  80  99  24  85   7  83  33  92
 [73]  88  58  86   5  22  13  37  84  11  54  19  76  44  98  35  95  45  90
 [91]  70  97  16  91  34  30  73  56   3   6

Evaluasi Clustering

Beberapa metrik yang digunakan antara lain:

  • Silhouette Coefficient: mengukur seberapa mirip observasi dengan cluster-nya sendiri dibandingkan cluster lain.
  • Calinski-Harabasz Index: semakin tinggi nilainya, semakin baik pemisahan antar cluster
  • Dunn Index: semakin tinggi nilai, semakin jauh jarak antar cluster relatif terhadap ukuran dalam cluster.
eval_cluster <- function(data_scaled, cluster, dist_matrix) {
sil <- mean(silhouette(cluster, dist_matrix)[, 3])
ch <- intCriteria(traj = as.matrix(data_scaled),
part = cluster, crit = "Calinski_Harabasz")$calinski_harabasz
dunn <- intCriteria(traj = as.matrix(data_scaled),
part = cluster, crit = "Dunn")$dunn
return(data.frame(Silhouette = sil,
Calinski_Harabasz = ch,
Dunn = dunn))
}

eval_cluster(data_scaled, cluster_hc, dist_matrix)
  Silhouette Calinski_Harabasz      Dunn
1  0.2373443           41.1055 0.1213656

Karakteristik Cluster

  • Tahap ini bertujuan untuk melihat karakteristik atau profil dari masing-masing cluster yang terbentuk.
  • Nilai rata-rata tiap variabel pada setiap cluster membantu dalam memahami perbedaan utama antar kelompok.
  • Cluster dapat diinterpretasikan sebagai kelompok observasi dengan pola nilai variabel yang serupa.
data_clustered <- data.frame(data, cluster = factor(cluster_hc))
cluster_summary <- data_clustered %>%
group_by(cluster) %>%
summarise(across(where(is.numeric), mean, .names = "{.col}"))
cluster_summary
# A tibble: 3 × 7
  cluster      A1      A2     A3     B1     B2     B3
  <fct>     <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 1       -0.519  -0.705  -0.532 -0.502 -0.510 -0.277
2 2        0.497   0.542   0.423 -0.405 -0.444 -0.312
3 3        0.0573  0.0418  0.179  0.820  0.791  0.690

Visualisasi Clustering

Visualisasi digunakan untuk memperjelas pola karakteristik antar cluster.

  • Heatmap menampilkan perbedaan nilai rata-rata tiap variabel antar cluster secara visual melalui gradasi warna.

  • Bar Plot menunjukkan nilai rata-rata tiap variabel secara kuantitatif untuk setiap cluster, memudahkan interpretasi perbandingan.

Data untuk visualisasi

heat_data <- cluster_summary %>%
pivot_longer(-cluster, names_to = "Variabel", values_to = "Rata2")
heat_data
# A tibble: 18 × 3
   cluster Variabel   Rata2
   <fct>   <chr>      <dbl>
 1 1       A1       -0.519 
 2 1       A2       -0.705 
 3 1       A3       -0.532 
 4 1       B1       -0.502 
 5 1       B2       -0.510 
 6 1       B3       -0.277 
 7 2       A1        0.497 
 8 2       A2        0.542 
 9 2       A3        0.423 
10 2       B1       -0.405 
11 2       B2       -0.444 
12 2       B3       -0.312 
13 3       A1        0.0573
14 3       A2        0.0418
15 3       A3        0.179 
16 3       B1        0.820 
17 3       B2        0.791 
18 3       B3        0.690 

Heatmap

ggplot(heat_data, aes(x = Variabel, y = cluster, fill = Rata2)) +
geom_tile(color = "white") +
scale_fill_gradientn(colors = c("#56B1F7", "white", "#FF6B6B")) +
labs(title = "Heatmap Karakteristik Cluster",
x = "Variabel", y = "Cluster") +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

Bar Plot

ggplot(heat_data, aes(x = Variabel, y = Rata2, fill = cluster)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_brewer(palette = "Set2") +
labs(title = "Bar Plot Karakteristik Cluster",
x = "Variabel", y = "Rata-rata Nilai") +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

B. K-Means Clustering

Menentukan Jumlah Cluster Optimal

Langkah pertama adalah menentukan jumlah cluster terbaik (k) menggunakan dua metode populer:

  • Elbow Method: melihat titik siku (elbow point) pada grafik within-cluster sum of squares (WSS).
  • Silhouette Method: menilai kualitas pemisahan cluster; semakin tinggi nilai silhouette, semakin baik hasil clustering.
library(factoextra)
library(cluster)
library(clusterCrit)
library(dplyr)
library(ggplot2)
library(tidyr)
library(scales)
library(RColorBrewer)

data <- data %>% select(where(is.numeric)) %>% na.omit()
data_scaled <- scale(data)
fviz_nbclust(data_scaled, kmeans, method = "wss") +
labs(title = "Elbow Method")

fviz_nbclust(data_scaled, kmeans, method = "silhouette") +
labs(title = "Silhouette Method")

Penerapan Algoritma K-Means

Setelah jumlah cluster optimal dipilih (misal k = 3), dilakukan proses clustering menggunakan fungsi kmeans().

Parameter nstart = 25 digunakan untuk memastikan stabilitas hasil dengan mencoba 25 inisialisasi acak.

set.seed(123)
kmeans_result <- kmeans(data_scaled, centers = 3, nstart = 25)
kmeans_result
K-means clustering with 3 clusters of sizes 25, 41, 34

Cluster means:
          A1         A2         A3         B1        B2         B3
1 -0.3197838 -0.2084213 -0.2240879  1.2662463  1.256284  1.1955286
2 -0.6467932 -0.6928074 -0.5357591 -0.4906116 -0.500987 -0.5400056
3  1.0150917  0.9886952  0.8108330 -0.3394436 -0.319607 -0.2278819

Clustering vector:
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
  2   2   3   2   2   3   2   2   2   1   3   1   3   2   1   3   3   2   3   2 
 21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
  2   2   2   2   1   2   3   2   2   3   1   1   3   3   3   1   3   1   1   2 
 41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
  1   2   2   3   3   2   2   1   1   2   1   1   1   3   2   3   1   3   1   2 
 61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
  1   2   2   1   2   3   1   2   3   3   2   2   3   1   2   3   2   2   3   2 
 81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
  2   1   2   3   2   3   1   3   1   3   3   3   3   2   3   1   3   3   2   2 

Within cluster sum of squares by cluster:
[1]  87.9679 102.4448 102.9224
 (between_SS / total_SS =  50.6 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

Hasil Clustering

table(kmeans_result$cluster)

 1  2  3 
25 41 34 
head(data.frame(Observasi = 1:length(kmeans_result$cluster),
           Cluster = kmeans_result$cluster),10)
   Observasi Cluster
1          1       2
2          2       2
3          3       3
4          4       2
5          5       2
6          6       3
7          7       2
8          8       2
9          9       2
10        10       1

Komponen Hasil K-Means

Fungsi kmeans() menghasilkan sejumlah komponen penting untuk memahami hasil pengelompokan:

Komponen Penjelasan
cluster Label cluster untuk setiap observasi
centers Koordinat pusat (centroid) tiap cluster
totss Total variasi data (Total Sum of Squares)
withinss Variasi dalam masing-masing cluster
tot.withinss Jumlah keseluruhan variasi dalam cluster
betweenss Variasi antar cluster
size Jumlah anggota setiap cluster
iter Jumlah iterasi hingga konvergen
ifault Status hasil (0 = sukses)
names(kmeans_result)
[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
kmeans_result$cluster
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
  2   2   3   2   2   3   2   2   2   1   3   1   3   2   1   3   3   2   3   2 
 21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
  2   2   2   2   1   2   3   2   2   3   1   1   3   3   3   1   3   1   1   2 
 41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
  1   2   2   3   3   2   2   1   1   2   1   1   1   3   2   3   1   3   1   2 
 61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
  1   2   2   1   2   3   1   2   3   3   2   2   3   1   2   3   2   2   3   2 
 81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
  2   1   2   3   2   3   1   3   1   3   3   3   3   2   3   1   3   3   2   2 
kmeans_result$totss
[1] 594
kmeans_result$withinss
[1]  87.9679 102.4448 102.9224
kmeans_result$tot.withinss
[1] 293.3351
kmeans_result$betweenss
[1] 300.6649
kmeans_result$iter
[1] 3

Evaluasi Clustering

cluster_km <- kmeans_result$cluster
dist_matrix <- dist(data_scaled, method = "euclidean")

sil <- mean(silhouette(cluster_km, dist_matrix)[, 3])
ch <- intCriteria(traj = as.matrix(data_scaled),
part = cluster_km,
crit = "Calinski_Harabasz")$calinski_harabasz
dunn <- intCriteria(traj = as.matrix(data_scaled),
part = cluster_km,
crit = "Dunn")$dunn

data.frame(Silhouette = sil,
Calinski_Harabasz = ch,
Dunn = dunn)
  Silhouette Calinski_Harabasz     Dunn
1  0.2893723          49.71189 0.133526

Karakteristik Cluster

data_clustered <- data.frame(data, cluster = factor(cluster_km))

cluster_summary <- data_clustered %>%
group_by(cluster) %>%
summarise(across(where(is.numeric), mean, .names = "{.col}"))

cluster_summary
# A tibble: 3 × 7
  cluster     A1      A2      A3     B1     B2     B3
  <fct>    <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
1 1       -0.113 -0.0913 -0.0403  0.877  0.909  0.810
2 2       -0.331 -0.455  -0.231  -0.462 -0.525 -0.398
3 3        0.775  0.807   0.592  -0.347 -0.377 -0.181

Visualisasi Clustering

fviz_cluster(kmeans_result, data = data_scaled,
palette = "jco", geom = "point",
main = "Visualisasi Hasil K-Means Clustering (3 Cluster)")

heat_data <- cluster_summary %>%
pivot_longer(-cluster, names_to = "Variabel", values_to = "Rata2")
heat_data
# A tibble: 18 × 3
   cluster Variabel   Rata2
   <fct>   <chr>      <dbl>
 1 1       A1       -0.113 
 2 1       A2       -0.0913
 3 1       A3       -0.0403
 4 1       B1        0.877 
 5 1       B2        0.909 
 6 1       B3        0.810 
 7 2       A1       -0.331 
 8 2       A2       -0.455 
 9 2       A3       -0.231 
10 2       B1       -0.462 
11 2       B2       -0.525 
12 2       B3       -0.398 
13 3       A1        0.775 
14 3       A2        0.807 
15 3       A3        0.592 
16 3       B1       -0.347 
17 3       B2       -0.377 
18 3       B3       -0.181 
ggplot(heat_data, aes(x = Variabel, y = cluster, fill = Rata2)) +
geom_tile(color = "white") +
scale_fill_gradientn(colors = c("#56B1F7", "white", "#FF6B6B")) +
labs(title = "Heatmap Karakteristik Cluster",
x = "Variabel", y = "Cluster") +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(heat_data, aes(x = Variabel, y = Rata2, fill = cluster)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_brewer(palette = "Set2") +
labs(title = "Heatmap Karakteristik Cluster",
x = "Variabel", y = "Rata-rata Nilai") +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

C. K-Medoids Clustering (PAM)

library(cluster)
library(factoextra)
library(clusterCrit)
library(dplyr)
library(ggplot2)
library(tidyr)
library(scales)
library(RColorBrewer)
data <- data %>% select(where(is.numeric)) %>% na.omit()
data_scaled <- scale(data)

Eksperimen dengan Berbagai Jarak

dist_euc <- dist(data_scaled, method = "euclidean")
dist_man <- dist(data_scaled, method = "manhattan")
dist_can <- dist(data_scaled, method = "canberra")

pam_euc <- pam(dist_euc, k = 3)
pam_man <- pam(dist_man, k = 3)
pam_can <- pam(dist_can, k = 3)

Komponen Hasil PAM

Fungsi pam() menghasilkan beberapa komponen utama yang membantu memahami hasil pengelompokan, seperti:

Komponen Penjelasan
medoids Titik pusat (medoid) dari setiap cluster
id.med Posisi medoid dalam data asli
clustering Label cluster tiap observasi
objective Total jarak dalam cluster
clusinfo Ringkasan tiap cluster (ukuran, diameter, separation)
silinfo Nilai silhouette untuk setiap cluster
diss Matriks jarak yang digunakan
names(pam_euc)
[1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
[6] "clusinfo"   "silinfo"    "diss"       "call"      
head(pam_euc$clustering)
1 2 3 4 5 6 
1 1 2 1 1 2 
pam_euc$medoids
[1] "80" "90" "53"
pam_euc$id.med
[1] 80 90 53
pam_euc$objective
   build     swap 
1.735455 1.629242 
head(pam_euc$clusinfo,10)
     size max_diss  av_diss diameter separation
[1,]   39 3.722629 1.489213 5.168688  0.5092152
[2,]   26 3.634913 1.649934 5.228481  0.8095486
[3,]   35 4.416731 1.769902 5.959716  0.5092152
pam_euc$silinfo
$widths
    cluster neighbor     sil_width
75        1        2  0.5371369529
62        1        3  0.5326154872
94        1        3  0.5238706025
20        1        3  0.5179771928
100       1        2  0.5012156945
80        1        3  0.4926891659
43        1        2  0.4872654802
29        1        2  0.4818657008
24        1        3  0.4459609586
65        1        3  0.4402794071
46        1        3  0.4217914391
1         1        2  0.4199646007
23        1        3  0.4190102477
81        1        2  0.4150904262
63        1        2  0.3916238155
9         1        3  0.3905548094
99        1        3  0.3893213458
50        1        2  0.3836195472
8         1        2  0.3651706147
26        1        3  0.3645989827
40        1        2  0.3601769030
18        1        3  0.3476033969
85        1        2  0.3467458007
83        1        2  0.3386233885
47        1        2  0.3240275720
2         1        3  0.2874248282
72        1        3  0.2594053802
4         1        2  0.2549258693
7         1        2  0.2483319608
21        1        3  0.2423607973
22        1        2  0.2341364866
68        1        2  0.2291613951
78        1        3  0.2231493094
42        1        3  0.1715376250
5         1        2  0.1593088424
60        1        2  0.1511932353
14        1        3  0.1366312684
71        1        2  0.1191941788
88        1        2 -0.0862055863
54        2        1  0.4878591280
11        2        3  0.4737113941
90        2        1  0.4658548139
44        2        1  0.4370037419
3         2        3  0.4168899709
95        2        1  0.4152541722
98        2        1  0.3913702396
45        2        1  0.3799930219
56        2        1  0.3580399572
76        2        1  0.3506604108
6         2        3  0.3390997632
33        2        1  0.3232939245
91        2        3  0.3169488670
19        2        1  0.2715504904
16        2        3  0.2627541811
73        2        3  0.2280420892
92        2        1  0.2225404634
97        2        3  0.2175334050
34        2        3  0.2025329711
86        2        1  0.1997626399
35        2        1  0.1913517402
58        2        1  0.1892973764
70        2        3  0.1584482140
37        2        1  0.1294618290
84        2        1  0.1164640678
13        2        1  0.1158431384
36        3        1  0.4448975014
61        3        2  0.4281507035
82        3        2  0.4249836837
38        3        1  0.4147204968
67        3        2  0.3793773385
49        3        2  0.3766451662
10        3        1  0.3734331856
51        3        2  0.3611022916
31        3        2  0.3497294471
25        3        1  0.3475955158
96        3        1  0.3440511253
39        3        1  0.3367710641
53        3        1  0.3276955856
74        3        1  0.3164806716
64        3        1  0.3121967272
52        3        1  0.2916486071
12        3        2  0.2451239835
59        3        1  0.2137781219
79        3        1  0.1754939617
48        3        1  0.1701887268
87        3        2  0.1689637654
15        3        1  0.1678257435
93        3        2  0.1656272820
66        3        2  0.1451620345
89        3        1  0.1424653198
17        3        2  0.1408691191
57        3        1  0.0988212617
32        3        1  0.0905050536
41        3        1  0.0715718207
55        3        1  0.0122640715
69        3        2  0.0005886862
27        3        2 -0.0082471283
28        3        1 -0.0376324247
77        3        1 -0.0590002316
30        3        2 -0.1214765700

$clus.avg.widths
[1] 0.3402399 0.2946755 0.2174963

$avg.width
[1] 0.2854329
pam_euc$call
pam(x = dist_euc, k = 3)

Evaluasi Clustering

eval_pam <- function(data_scaled, pam_model, dist_matrix) {
cl <- pam_model$clustering
sil <- mean(silhouette(cl, dist_matrix)[, 3])
ch <- intCriteria(traj = as.matrix(data_scaled),
part = cl, crit = "Calinski_Harabasz")$calinski_harabasz
dunn <- intCriteria(traj = as.matrix(data_scaled),
part = cl, crit = "Dunn")$dunn
return(data.frame(Silhouette = sil, CH = ch, Dunn = dunn))
}

eval_result <- rbind(
Euclidean = eval_pam(data_scaled, pam_euc, dist_euc),
Manhattan = eval_pam(data_scaled, pam_man, dist_man),
Canberra = eval_pam(data_scaled, pam_can, dist_can)
)

eval_result
          Silhouette       CH       Dunn
Euclidean  0.2854329 47.64919 0.08544286
Manhattan  0.2729617 42.56597 0.10159179
Canberra   0.2031891 30.04782 0.05786000
eval_long <- eval_result %>%
mutate(Jarak = rownames(eval_result)) %>%
pivot_longer(cols = c(Silhouette, CH, Dunn),
names_to = "Metrik", values_to = "Nilai")
eval_long 
# A tibble: 9 × 3
  Jarak     Metrik       Nilai
  <chr>     <chr>        <dbl>
1 Euclidean Silhouette  0.285 
2 Euclidean CH         47.6   
3 Euclidean Dunn        0.0854
4 Manhattan Silhouette  0.273 
5 Manhattan CH         42.6   
6 Manhattan Dunn        0.102 
7 Canberra  Silhouette  0.203 
8 Canberra  CH         30.0   
9 Canberra  Dunn        0.0579
ggplot(eval_long, aes(x = Jarak, y = Nilai, fill = Metrik)) +
geom_bar(stat = "identity", position = "dodge", width = 0.7) +
scale_fill_brewer(palette = "Set2") +
labs(title = "Perbandingan Metrik Evaluasi",
x = "Jenis Jarak",
y = "Nilai Metrik",
fill = "Metrik") +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
axis.text.x = element_text(face = "bold")
)

ggplot(filter(eval_long, Metrik == "Silhouette"),
aes(x = Jarak, y = Nilai, fill = Jarak)) +
geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
scale_fill_brewer(palette = "Set2") +
labs(title = "Perbandingan Nilai Silhouette",
x = "Jenis Jarak", y = "Nilai") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))

ggplot(filter(eval_long, Metrik == "CH"),
aes(x = Jarak, y = Nilai, fill = Jarak)) +
geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
scale_fill_brewer(palette = "Set3") +
labs(title = "Perbandingan Nilai Calinski-Harabasz",
x = "Jenis Jarak", y = "Nilai") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))

ggplot(filter(eval_long, Metrik == "Dunn"),
aes(x = Jarak, y = Nilai, fill = Jarak)) +
geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
scale_fill_brewer(palette = "Pastel1") +
labs(title = "Perbandingan Nilai Dunn Index",
x = "Jenis Jarak", y = "Nilai") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))

Hasil Clustering

Jarak terbaik euclidean

table(pam_euc$cluster)

 1  2  3 
39 26 35 
head(data.frame(Observasi = 1:length(pam_euc$cluster),
           Cluster = pam_euc$cluster),10)
   Observasi Cluster
1          1       1
2          2       1
3          3       2
4          4       1
5          5       1
6          6       2
7          7       1
8          8       1
9          9       1
10        10       3

D. PERBANDINGAN HASIL CLUSTERING

# Membandingkan hasil berbagai metode
comparison <- data.frame(
  Observasi = 1:nrow(data),
  Hierarchical = cluster_hc,
  KMeans = kmeans_result$cluster,
  PAM = pam_euc$clustering
)

print(head(comparison, 10))
   Observasi Hierarchical KMeans PAM
1          1            1      2   1
2          2            1      2   1
3          3            2      3   2
4          4            2      2   1
5          5            2      2   1
6          6            2      3   2
7          7            2      2   1
8          8            1      2   1
9          9            1      2   1
10        10            3      1   3

3. LATIHAN

Gunakan dataset USArrests dan lakukan:

  1. Hierarchical clustering dengan complete linkage
  2. K-Means clustering dengan 3 cluster
  3. Bandingkan hasil kedua metode
  4. Visualisasi hasil terbaik
# Data untuk latihan
data(USArrests)

library(maps)

Attaching package: 'maps'
The following object is masked from 'package:cluster':

    votes.repub
library(ggplot2)
library(dplyr)

hc <- hclust(dist(scale(USArrests)), method = "complete")
cluster_map <- cutree(hc, k = 3)

usa <- map_data("state")

us_cluster <- data.frame(state = tolower(rownames(USArrests)),
                         cluster = factor(cluster_map))

usa_cluster <- left_join(usa, us_cluster, by = c("region" = "state")) %>%
  filter(!is.na(cluster))

ggplot(usa_cluster, aes(long, lat, group = group, fill = cluster)) +
  geom_polygon(color = "white") +
  scale_fill_brewer(palette = "Set2") +
  coord_fixed(1.3) +
  labs(title = "Peta Hasil Hierarchical Clustering (USArrests)",
       fill = "Cluster") +
  theme_void(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    legend.position = "bottom"
  )

TERIMA KASIH