Sumber Data : https://www.kaggle.com/datasets/abcsds/pokemon
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.4 âś” readr 2.1.5
## âś” forcats 1.0.0 âś” stringr 1.5.1
## âś” ggplot2 4.0.0 âś” tibble 3.2.1
## âś” lubridate 1.9.4 âś” tidyr 1.3.1
## âś” purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
library(clusterCrit)
library(ggplot2)
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(factoextra)
library(dplyr)
library(tidyr)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(RColorBrewer)
datapokemon <- read.csv("C:/Users/user/Documents/Semester 5/Teknik Peubah Ganda/Pokemon.csv")
datapokemon <- datapokemon[1:200, 6:10]
datapokemon
## HP Attack Defense Sp..Atk Sp..Def
## 1 45 49 49 65 65
## 2 60 62 63 80 80
## 3 80 82 83 100 100
## 4 80 100 123 122 120
## 5 39 52 43 60 50
## 6 58 64 58 80 65
## 7 78 84 78 109 85
## 8 78 130 111 130 85
## 9 78 104 78 159 115
## 10 44 48 65 50 64
## 11 59 63 80 65 80
## 12 79 83 100 85 105
## 13 79 103 120 135 115
## 14 45 30 35 20 20
## 15 50 20 55 25 25
## 16 60 45 50 90 80
## 17 40 35 30 20 20
## 18 45 25 50 25 25
## 19 65 90 40 45 80
## 20 65 150 40 15 80
## 21 40 45 40 35 35
## 22 63 60 55 50 50
## 23 83 80 75 70 70
## 24 83 80 80 135 80
## 25 30 56 35 25 35
## 26 55 81 60 50 70
## 27 40 60 30 31 31
## 28 65 90 65 61 61
## 29 35 60 44 40 54
## 30 60 85 69 65 79
## 31 35 55 40 50 50
## 32 60 90 55 90 80
## 33 50 75 85 20 30
## 34 75 100 110 45 55
## 35 55 47 52 40 40
## 36 70 62 67 55 55
## 37 90 92 87 75 85
## 38 46 57 40 40 40
## 39 61 72 57 55 55
## 40 81 102 77 85 75
## 41 70 45 48 60 65
## 42 95 70 73 95 90
## 43 38 41 40 50 65
## 44 73 76 75 81 100
## 45 115 45 20 45 25
## 46 140 70 45 85 50
## 47 40 45 35 30 40
## 48 75 80 70 65 75
## 49 45 50 55 75 65
## 50 60 65 70 85 75
## 51 75 80 85 110 90
## 52 35 70 55 45 55
## 53 60 95 80 60 80
## 54 60 55 50 40 55
## 55 70 65 60 90 75
## 56 10 55 25 35 45
## 57 35 80 50 50 70
## 58 40 45 35 40 40
## 59 65 70 60 65 65
## 60 50 52 48 65 50
## 61 80 82 78 95 80
## 62 40 80 35 35 45
## 63 65 105 60 60 70
## 64 55 70 45 70 50
## 65 90 110 80 100 80
## 66 40 50 40 40 40
## 67 65 65 65 50 50
## 68 90 95 95 70 90
## 69 25 20 15 105 55
## 70 40 35 30 120 70
## 71 55 50 45 135 95
## 72 55 50 65 175 95
## 73 70 80 50 35 35
## 74 80 100 70 50 60
## 75 90 130 80 65 85
## 76 50 75 35 70 30
## 77 65 90 50 85 45
## 78 80 105 65 100 70
## 79 40 40 35 50 100
## 80 80 70 65 80 120
## 81 40 80 100 30 30
## 82 55 95 115 45 45
## 83 80 120 130 55 65
## 84 50 85 55 65 65
## 85 65 100 70 80 80
## 86 90 65 65 40 40
## 87 95 75 110 100 80
## 88 95 75 180 130 80
## 89 25 35 70 95 55
## 90 50 60 95 120 70
## 91 52 65 55 58 62
## 92 35 85 45 35 35
## 93 60 110 70 60 60
## 94 65 45 55 45 70
## 95 90 70 80 70 95
## 96 80 80 50 40 50
## 97 105 105 75 65 100
## 98 30 65 100 45 25
## 99 50 95 180 85 45
## 100 30 35 30 100 35
## 101 45 50 45 115 55
## 102 60 65 60 130 75
## 103 60 65 80 170 95
## 104 35 45 160 30 45
## 105 60 48 45 43 90
## 106 85 73 70 73 115
## 107 30 105 90 25 25
## 108 55 130 115 50 50
## 109 40 30 50 55 55
## 110 60 50 70 80 80
## 111 60 40 80 60 45
## 112 95 95 85 125 65
## 113 50 50 95 40 50
## 114 60 80 110 50 80
## 115 50 120 53 35 110
## 116 50 105 79 35 110
## 117 90 55 75 60 75
## 118 40 65 95 60 45
## 119 65 90 120 85 70
## 120 80 85 95 30 30
## 121 105 130 120 45 45
## 122 250 5 5 35 105
## 123 65 55 115 100 40
## 124 105 95 80 40 80
## 125 105 125 100 60 100
## 126 30 40 70 70 25
## 127 55 65 95 95 45
## 128 45 67 60 35 50
## 129 80 92 65 65 80
## 130 30 45 55 70 55
## 131 60 75 85 100 85
## 132 40 45 65 100 120
## 133 70 110 80 55 80
## 134 65 50 35 115 95
## 135 65 83 57 95 85
## 136 65 95 57 100 85
## 137 65 125 100 55 70
## 138 65 155 120 65 90
## 139 75 100 95 40 70
## 140 20 10 55 15 20
## 141 95 125 79 60 100
## 142 95 155 109 70 130
## 143 130 85 80 85 95
## 144 48 48 48 48 48
## 145 55 55 50 45 65
## 146 130 65 60 110 95
## 147 65 65 60 110 95
## 148 65 130 60 95 110
## 149 65 60 70 85 75
## 150 35 40 100 90 55
## 151 70 60 125 115 70
## 152 30 80 90 55 45
## 153 60 115 105 65 70
## 154 80 105 65 60 75
## 155 80 135 85 70 95
## 156 160 110 65 65 110
## 157 90 85 100 95 125
## 158 90 90 85 125 90
## 159 90 100 90 125 85
## 160 41 64 45 50 50
## 161 61 84 65 70 70
## 162 91 134 95 100 100
## 163 106 110 90 154 90
## 164 106 190 100 154 100
## 165 106 150 70 194 120
## 166 100 100 100 100 100
## 167 45 49 65 49 65
## 168 60 62 80 63 80
## 169 80 82 100 83 100
## 170 39 52 43 60 50
## 171 58 64 58 80 65
## 172 78 84 78 109 85
## 173 50 65 64 44 48
## 174 65 80 80 59 63
## 175 85 105 100 79 83
## 176 35 46 34 35 45
## 177 85 76 64 45 55
## 178 60 30 30 36 56
## 179 100 50 50 76 96
## 180 40 20 30 40 80
## 181 55 35 50 55 110
## 182 40 60 40 40 40
## 183 70 90 70 60 60
## 184 85 90 80 70 80
## 185 75 38 38 56 56
## 186 125 58 58 76 76
## 187 20 40 15 35 35
## 188 50 25 28 45 55
## 189 90 30 15 40 20
## 190 35 20 65 40 65
## 191 55 40 85 80 105
## 192 40 50 45 70 45
## 193 65 75 70 95 70
## 194 55 40 40 65 45
## 195 70 55 55 80 60
## 196 90 75 85 115 90
## 197 90 95 105 165 110
## 198 75 80 95 90 100
## 199 70 20 50 20 50
## 200 100 50 80 60 80
summary(datapokemon)
## HP Attack Defense Sp..Atk
## Min. : 10.00 Min. : 5.00 Min. : 5.00 Min. : 15.00
## 1st Qu.: 47.50 1st Qu.: 50.00 1st Qu.: 50.00 1st Qu.: 45.00
## Median : 64.00 Median : 70.00 Median : 65.00 Median : 65.00
## Mean : 65.81 Mean : 73.27 Mean : 69.04 Mean : 70.69
## 3rd Qu.: 80.00 3rd Qu.: 90.50 3rd Qu.: 85.00 3rd Qu.: 90.00
## Max. :250.00 Max. :190.00 Max. :180.00 Max. :194.00
## Sp..Def
## Min. : 20.00
## 1st Qu.: 50.00
## Median : 70.00
## Mean : 68.96
## 3rd Qu.: 85.00
## Max. :130.00
str(datapokemon)
## 'data.frame': 200 obs. of 5 variables:
## $ HP : int 45 60 80 80 39 58 78 78 78 44 ...
## $ Attack : int 49 62 82 100 52 64 84 130 104 48 ...
## $ Defense: int 49 63 83 123 43 58 78 111 78 65 ...
## $ Sp..Atk: int 65 80 100 122 60 80 109 130 159 50 ...
## $ Sp..Def: int 65 80 100 120 50 65 85 85 115 64 ...
datapokemon_scaled <- scale(datapokemon)
head(datapokemon_scaled)
## HP Attack Defense Sp..Atk Sp..Def
## 1 -0.7624101 -0.7953306 -0.7108024 -0.1696888 -0.1584238
## 2 -0.2127273 -0.3692314 -0.2142339 0.2780389 0.4416663
## 3 0.5201832 0.2863059 0.4951498 0.8750092 1.2417863
## 4 0.5201832 0.8762895 1.9139171 1.5316765 2.0419064
## 5 -0.9822832 -0.6970000 -0.9236175 -0.3189314 -0.7585138
## 6 -0.2860183 -0.3036777 -0.3915798 0.2780389 -0.1584238
distance_matrix <- dist(datapokemon_scaled, method = "euclidean")
head(distance_matrix,10)
## [1] 1.1361603 2.7057875 4.3663038 0.6969109 0.8780860 2.4902868 4.2132441
## [8] 4.2004477 0.7256321 1.4292939
c_ward <- hclust(distance_matrix, method = "ward.D2")
c_ward
##
## Call:
## hclust(d = distance_matrix, method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 200
fviz_dend(c_ward, k = 3,
rect = TRUE,
rect_fill = TRUE,
main = "Dendrogram Metode Ward's")
clusters <- cutree(c_ward, k = 3)
table(clusters)
## clusters
## 1 2 3
## 79 70 51
names(c_ward)
## [1] "merge" "height" "order" "labels" "method"
## [6] "call" "dist.method"
Pada fungsi hclust, beberapa komponen penting yang dihasilkan antara lain:
merge : matriks yang menunjukkan bagaimana klaster
digabungkan pada setiap langkah.height : vektor yang menunjukkan jarak pada setiap
langkah penggabungan klaster.order : urutan objek dalam dendrogram.labels : label asli dari data yang digunakan.method & dist.method : metode penggabungan yang
digunakan (dalam hal ini “ward.D2”).call : panggilan fungsi yang digunakan untuk
menghasilkan objek hclust.head(c_ward$merge,10)
## [,1] [,2]
## [1,] -5 -170
## [2,] -6 -171
## [3,] -7 -172
## [4,] -10 -167
## [5,] -11 -168
## [6,] -12 -169
## [7,] -38 -182
## [8,] -58 -66
## [9,] -50 -149
## [10,] -28 -183
head(c_ward$height,20)
## [1] 0.00000000 0.00000000 0.00000000 0.07006162 0.07733662 0.21445969
## [7] 0.24085908 0.24147390 0.24582601 0.25983715 0.30312044 0.30312044
## [13] 0.30616487 0.33250393 0.36560237 0.36804615 0.36954866 0.37848174
## [19] 0.39047444 0.39183132
cluster_hclust <- cutree(c_ward, k = 3)
# menambahkan cluster ke data asli untuk intepretasi
data_with_clusters <- datapokemon %>%
mutate(Cluster = as.factor(cluster_hclust))
#melihat profil rata-rata setiap cluster
cluster_profiles <- data_with_clusters %>%
group_by(Cluster) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE))
print(cluster_profiles)
## # A tibble: 3 Ă— 6
## Cluster HP Attack Defense Sp..Atk Sp..Def
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 48.2 52.4 51.5 49.8 46.4
## 2 2 82.0 77.9 73.6 99.5 90.7
## 3 3 70.9 99.3 90.0 63.4 74.1
Terdapat beberapa metrik evaluasi klaster yang dapat digunakan untuk menilai kualitas hasil clustering:
silhouette_values <- silhouette(cluster_hclust, distance_matrix)
avg_silhouette <- mean(silhouette_values[, 3])
cat("Average Silhouette Score:", avg_silhouette, "\n")
## Average Silhouette Score: 0.2270905
dunn_index <- intCriteria(as.matrix(datapokemon_scaled),
as.integer(cluster_hclust),
c("Dunn"))$dunn
cat("Dunn Index:", dunn_index, "\n")
## Dunn Index: 0.05913987
ch_index <- intCriteria(as.matrix(datapokemon_scaled),
as.integer(cluster_hclust),
c("Calinski_Harabasz"))$calinski_harabasz
cat("Calinski-Harabasz Index:", ch_index, "\n")
## Calinski-Harabasz Index: 66.28871
datapokemon_clustered <- datapokemon %>%
mutate(Cluster = as.factor(cluster_hclust))
# Melihat profil rata-rata setiap klaster
cluster_profiles <- datapokemon_clustered %>%
group_by(Cluster) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE))
print(cluster_profiles)
## # A tibble: 3 Ă— 6
## Cluster HP Attack Defense Sp..Atk Sp..Def
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 48.2 52.4 51.5 49.8 46.4
## 2 2 82.0 77.9 73.6 99.5 90.7
## 3 3 70.9 99.3 90.0 63.4 74.1
pca_result <- prcomp(datapokemon_scaled)
pca_data <- data.frame(pca_result$x, Cluster = as.factor(cluster_hclust))
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 2) +
labs(title = "Visualisasi Cluster menggunakan PCA",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal()
Berdasarkan visualisasi PCA, kita dapat mengamati bahwa klaster-klaster yang terbentuk memiliki pemisahan yang belum cukup baik, menunjukkan bahwa metode Ward’s belum cukup efektif dalam mengelompokkan data berdasarkan karakteristiknya.
# Pastikan hanya variabel numerik (selain kolom 'Cluster') yang digunakan
datapokemon_num <- datapokemon_clustered[, sapply(datapokemon_clustered, is.numeric)]
datapokemon_num$Cluster <- datapokemon_clustered$Cluster
# Ubah data menjadi format long
datapokemon_melted <- melt(datapokemon_num, id.vars = "Cluster")
# Buat heatmap
ggplot(datapokemon_melted, aes(x = variable, y = as.factor(Cluster), fill = value)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Heatmap Karakteristik Cluster",
x = "Fitur",
y = "Cluster") +
theme_minimal()
Dapat dilihat bahwa value yang terendah dimulai dari Cluster 1, kemudian 2, lalu 3.
Beberapa metode yang umum digunakan untuk menentukan k antara lain:
fviz_nbclust(datapokemon_scaled, kmeans, method = "wss") +
labs(title = "Elbow Method for Optimal k")
fviz_nbclust(datapokemon_scaled, kmeans, method = "silhouette") +
labs(title = "Silhouette Method for Optimal k")
Didapat bahwa jumlah klaster yang optimum adalah k=2
set.seed(123) # Untuk reproduktifitas
k <- 2 # Jumlah klaster optimal
kmeans_result <- kmeans(datapokemon_scaled, centers = k, nstart = 25)
table(kmeans_result$cluster)
##
## 1 2
## 107 93
names(kmeans_result)
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Beberapa komponen penting yang dihasilkan oleh objek kmeans antara lain:
kmeans_result$centers
## HP Attack Defense Sp..Atk Sp..Def
## 1 0.5318275 0.5485208 0.4898460 0.4917208 0.6517913
## 2 -0.6118876 -0.6310939 -0.5635863 -0.5657433 -0.7499104
kmeans_result$tot.withinss
## [1] 652.074
kmeans_result$size
## [1] 107 93
kmeans_result$iter
## [1] 1
kmeans_result$withinss
## [1] 438.2028 213.8712
kmeans_result$betweenss
## [1] 342.926
Metrik evaluasi klaster yang digunakan sama seperti sebelumnya yaitu:
distance_matrix <- dist(datapokemon_scaled)
silhouette_values_kmeans <- silhouette(kmeans_result$cluster, distance_matrix)
avg_silhouette_kmeans <- mean(silhouette_values_kmeans[, 3])
cat("Average Silhouette Score (K-Means):", avg_silhouette_kmeans, "\n")
## Average Silhouette Score (K-Means): 0.3078682
dunn_index_kmeans <- intCriteria(
as.matrix(datapokemon_scaled),
as.integer(kmeans_result$cluster),
c("Dunn")
)$dunn
cat("Dunn Index (K-Means):", dunn_index_kmeans, "\n")
## Dunn Index (K-Means): 0.02489531
ch_index_kmeans <- intCriteria(
as.matrix(datapokemon_scaled),
as.integer(kmeans_result$cluster),
c("Calinski_Harabasz")
)$calinski_harabasz
cat("Calinski-Harabasz Index (K-Means):", ch_index_kmeans, "\n")
## Calinski-Harabasz Index (K-Means): 104.1283
data_kmeans_clustered <- datapokemon %>%
mutate(Cluster = as.factor(kmeans_result$cluster))
# Melihat profil rata-rata setiap klaster
cluster_profiles_kmeans <- data_kmeans_clustered %>%
group_by(Cluster) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE))
print(cluster_profiles_kmeans)
## # A tibble: 2 Ă— 6
## Cluster HP Attack Defense Sp..Atk Sp..Def
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 80.3 90 82.9 87.2 85.3
## 2 2 49.1 54.0 53.2 51.7 50.2
fviz_cluster(kmeans_result, data = datapokemon_scaled,
geom = "point",
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal(),
main = "Visualisasi Cluster K-Means")
Visualisasi cluster K-Means menunjukkan bagaimana data terdistribusi di antara klaster-klaster yang telah terbentuk. Dengan menggunakan metode PCA untuk mereduksi dimensi data, kita dapat melihat pemisahan yang jelas antara klaster-klaster tersebut. Hal ini mengindikasikan bahwa K-Means berhasil mengelompokkan data berdasarkan karakteristiknya dengan baik.
# Pastikan hanya variabel numerik (selain kolom 'Cluster') yang digunakan
data_num_kmeans <- data_kmeans_clustered[, sapply(data_kmeans_clustered, is.numeric)]
data_num_kmeans$Cluster <- data_kmeans_clustered$Cluster
# Ubah data menjadi format long
data_melted_kmeans <- melt(data_num_kmeans, id.vars = "Cluster")
# Buat heatmap
ggplot(data_melted_kmeans, aes(x = variable, y = as.factor(Cluster), fill = value)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
labs(title = "Heatmap Karakteristik Klaster K-Means",
x = "Fitur",
y = "Klaster") +
theme_minimal()
Berdasarkan heatmap untuk K-Means, kita dapat melihat pola karakteristik yang berbeda antara klaster-klaster. Klaster 1 memiliki atribut-atribut yang nilainya lebih tinggi dari klaster 2.
comparison_table <- data.frame(
Method = c("Hierarchical (Ward's)", "K-Means"),
Silhouette_Score = c(avg_silhouette, avg_silhouette_kmeans),
Dunn_Index = c(dunn_index, dunn_index_kmeans),
Calinski_Harabasz_Index = c(ch_index, ch_index_kmeans)
)
print(comparison_table)
## Method Silhouette_Score Dunn_Index Calinski_Harabasz_Index
## 1 Hierarchical (Ward's) 0.2270905 0.05913987 66.28871
## 2 K-Means 0.3078682 0.02489531 104.12829
Berdasarkan tabel perbandingan di atas, kita dapat melihat bagaimana kedua metode clustering berkinerja berdasarkan metrik evaluasi yang telah dihitung. Pada metrik evaluasi Silhouette Score dan Calinski Harabasz, clustering dengan metode K-Means hasilnya lebih baik karena nilainya lebih tinggi Namun pada metrik evaluasi Dunn Index, clustering dengan metode Ward lebih baik karena nilainya lebih tinggi.
Dari ketiga metrik evaluasi tersebut, kita dapat mengambil kesimpulan bahwa K-Means cenderung lebih baik dalam clustering.