Data yang digunakan merupakan data dummy yang terdiri dari dua variabel numerik yaitu X1 dan X2.
set.seed(123)
data_dummy <- data.frame(
X1 = c(rnorm(10,5,1), rnorm(10,20,1), rnorm(10,40,1)),
X2 = c(rnorm(10,6,1), rnorm(10,22,1), rnorm(10,42,1))
)
data_dummy## X1 X2
## 1 4.439524 6.426464
## 2 4.769823 5.704929
## 3 6.558708 6.895126
## 4 5.070508 6.878133
## 5 5.129288 6.821581
## 6 6.715065 6.688640
## 7 5.460916 6.553918
## 8 3.734939 5.938088
## 9 4.313147 5.694037
## 10 4.554338 5.619529
## 11 21.224082 21.305293
## 12 20.359814 21.792083
## 13 20.400771 20.734604
## 14 20.110683 24.168956
## 15 19.444159 23.207962
## 16 21.786913 20.876891
## 17 20.497850 21.597115
## 18 18.033383 21.533345
## 19 20.701356 22.779965
## 20 19.527209 21.916631
## 21 38.932176 42.253319
## 22 39.782025 41.971453
## 23 38.973996 41.957130
## 24 39.271109 43.368602
## 25 39.374961 41.774229
## 26 38.313307 43.516471
## 27 40.837787 40.451247
## 28 40.153373 42.584614
## 29 38.861863 42.123854
## 30 41.253815 42.215942
## X1 X2
## Min. : 3.735 Min. : 5.620
## 1st Qu.: 5.735 1st Qu.: 6.836
## Median :20.380 Median :21.695
## Mean :21.620 Mean :23.512
## 3rd Qu.:38.915 3rd Qu.:41.911
## Max. :41.254 Max. :43.516
plot(data_dummy$X1, data_dummy$X2,
pch = 19,
col = "purple",
xlab = "Variabel X1",
ylab = "Variabel X2",
main = "Scatter Plot Data Dummy")\[ z_{ij} = \frac{x_{ij} - \bar{x}_j}{s_j} \]
dengan:
## X1 X2
## [1,] -1.19369004 -1.14135619
## [2,] -1.17074052 -1.18955749
## [3,] -1.04644655 -1.11004784
## [4,] -1.14984850 -1.11118298
## [5,] -1.14576444 -1.11496090
## [6,] -1.03558270 -1.12384185
## [7,] -1.12272249 -1.13284182
## [8,] -1.24264550 -1.17398154
## [9,] -1.20247088 -1.19028507
## [10,] -1.18571264 -1.19526250
## [11,] -0.02747851 -0.14739440
## [12,] -0.08752890 -0.11487502
## [13,] -0.08468312 -0.18551860
## [14,] -0.10483884 0.04390907
## [15,] -0.15114973 -0.02028895
## [16,] 0.01162770 -0.17601324
## [17,] -0.07793795 -0.12789958
## [18,] -0.24917217 -0.13215969
## [19,] -0.06379814 -0.04888075
## [20,] -0.14537933 -0.10655473
## [21,] 1.20290150 1.25201258
## [22,] 1.26195003 1.23318292
## [23,] 1.20580715 1.23222604
## [24,] 1.22645094 1.32651774
## [25,] 1.23366670 1.22000760
## [26,] 1.15990168 1.33639590
## [27,] 1.33530566 1.13162743
## [28,] 1.28775174 1.27414435
## [29,] 1.19801605 1.24336388
## [30,] 1.36421178 1.24951566
## attr(,"scaled:center")
## X1 X2
## 21.61956 23.51167
## attr(,"scaled:scale")
## X1 X2
## 14.39238 14.96922
Metode Silhouette digunakan untuk menentukan jumlah
cluster optimal.
Nilai silhouette dihitung dengan rumus:
\[ s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} \]
dengan:
Nilai silhouette berada pada rentang:
\[ -1 \le s(i) \le 1 \]
## Warning: package 'cluster' was built under R version 4.3.3
sil_width <- c()
for(k in 2:6){
model <- kmeans(data_scaled, centers = k, nstart = 25)
ss <- silhouette(model$cluster, dist(data_scaled))
sil_width[k] <- mean(ss[,3])
}
sil_width## [1] NA 0.7576476 0.9307402 0.7381757 0.5774467 0.4337640
## [1] 3
Berdasarkan metode Silhouette, jumlah cluster optimal adalah 3 cluster, karena memiliki nilai silhouette rata-rata terbesar yaitu 0.9199.
\[ W = \sum_{k=1}^{K} \sum_{x_i \in C_k} ||x_i - \mu_k||^2 \]
dengan:
Metode K-Means Clustering menghasilkan 3 cluster, dengan masing-masing cluster berisi 10 observasi. Cluster 1 memiliki nilai variabel yang relatif tinggi, cluster 2 memiliki nilai yang rendah, sedangkan cluster 3 berada pada kategori sedang.
plot(data_dummy$X1, data_dummy$X2,
col = model_kmeans$cluster,
pch = 19,
xlab = "Variabel X1",
ylab = "Variabel X2",
main = "Hasil K-Means Clustering")
legend("topleft",
legend = paste("Cluster",1:k_opt),
col = 1:k_opt,
pch = 19)## Cluster X1 X2
## 1 1 5.074626 6.322045
## 2 2 20.208622 21.991284
## 3 3 39.575441 42.221686
KESIMPULAN:
Berdasarkan metode Silhouette, jumlah cluster optimal adalah 3 cluster dengan nilai silhouette sebesar 0.9199.
Hasil pengelompokan menggunakan metode K-Means menghasilkan tiga cluster dengan masing-masing cluster beranggotakan 10 observasi.
Berdasarkan nilai rata-rata variabel pada setiap cluster, cluster pertama memiliki nilai variabel paling tinggi, cluster kedua memiliki nilai paling rendah, sedangkan cluster ketiga berada pada kategori sedang.