ℹ️ Tentang Dokumen Ini
Dokumen ini berisi
seluruh alur analisis clustering provinsi-provinsi di Indonesia
menggunakan dua metode: K-Means dan
DBSCAN. Klik tombol “Code” di setiap chunk
untuk melihat kode R yang digunakan.
# ── Data Wrangling ─────────────────────────────
library(tidyverse)
library(dplyr)
library(readr)
# ── Visualisasi ────────────────────────────────
library(factoextra) # Visualisasi hasil clustering
library(gridExtra) # Multi-panel plot
library(GGally) # Correlation matrix
library(ggiraphExtra) # Radar chart interaktif
library(sf) # Membaca data spasial
library(plotly) # Interactive plotting
library(Rtsne) # Dimensionality reduction (t-SNE)
library(tmap) # Peta tematik
library(RColorBrewer) # Palet warna
# ── Machine Learning - Clustering ──────────────
library(cluster)
library(mvnTest)
library(clusterSim)
library(dbscan)<div class="metric-icon">📦</div>
<div class="metric-value">3</div>
<div class="metric-label">Library Wrangling</div>
<div class="metric-icon">📈</div>
<div class="metric-value">8</div>
<div class="metric-label">Library Visualisasi</div>
<div class="metric-icon">🤖</div>
<div class="metric-value">5</div>
<div class="metric-label">Library ML / Clustering</div>
<div class="metric-icon">🗺️</div>
<div class="metric-value">34</div>
<div class="metric-label">Provinsi Dianalisis</div>
data_cluster <- read.csv2(
"C:/Users/Muhammad Rizqa Salas/Downloads/Skripsi Lesgoo/Data/Data fix.csv",
sep = ";"
)
glimpse(data_cluster)Rows: 38
Columns: 16
$ Provinsi <chr> "ACEH", "SUMATERA UTARA", "SUMATERA BARAT", "RIAU", "JAMBI", …
$ TPAK <dbl> 65.32, 72.29, 71.34, 66.55, 68.37, 69.88, 71.78, 70.44, 69.38…
$ GR <dbl> 0.274, 0.283, 0.280, 0.304, 0.291, 0.298, 0.339, 0.287, 0.214…
$ PpK <int> 11191, 11898, 12041, 12233, 12018, 12416, 12197, 11683, 13837…
$ TPT <dbl> 5.64, 5.32, 5.62, 4.16, 4.26, 3.69, 3.41, 4.21, 4.45, 6.45, 6…
$ PDRB <dbl> 257502.4, 1236193.6, 352189.4, 1201383.8, 349662.6, 720205.2,…
$ PM <dbl> 12.33, 7.36, 5.35, 6.16, 7.19, 10.15, 12.08, 10.00, 5.00, 4.4…
$ AHH <dbl> 70.485, 70.330, 70.380, 72.570, 72.090, 70.975, 70.180, 71.55…
$ RLS <dbl> 9.95, 10.08, 9.77, 9.55, 8.95, 8.91, 9.23, 8.61, 8.65, 10.72,…
$ PPI <dbl> 88.95, 92.30, 91.89, 94.16, 90.59, 90.94, 90.62, 91.57, 92.14…
$ DMS <int> 5840, 5360, 1204, 1710, 1431, 2954, 1382, 2466, 392, 377, 267…
$ PMJ <dbl> 98.18, 68.69, 78.19, 73.30, 61.04, 73.33, 76.20, 71.90, 84.11…
$ PKKP <dbl> 8.60, 7.20, 7.67, 10.90, 10.22, 5.94, 9.50, 10.62, 10.00, 9.0…
$ PRTS <dbl> 82.21, 87.47, 74.59, 91.21, 85.88, 85.50, 85.75, 87.29, 96.45…
$ AMHP <dbl> 99.59, 99.52, 99.43, 99.54, 98.66, 99.50, 99.08, 98.89, 98.89…
$ APPT <dbl> 123.62, 125.61, 136.87, 118.59, 127.62, 153.85, 155.76, 165.5…
📋 Struktur Data: Dataset berisi data indikator sosial-ekonomi per provinsi di Indonesia dengan 15 variabel utama yang mencakup aspek ketenagakerjaan, ekonomi, pendidikan, dan kesehatan.
Provinsi TPAK GR PpK
Length:38 Min. :63.98 Min. :0.2140 Min. : 5861
Class :character 1st Qu.:67.44 1st Qu.:0.2848 1st Qu.:10598
Mode :character Median :69.56 Median :0.3275 Median :11978
Mean :70.52 Mean :0.3284 Mean :11962
3rd Qu.:72.37 3rd Qu.:0.3605 3rd Qu.:12766
Max. :90.66 Max. :0.4260 Max. :20676
TPT PDRB PM AHH
Min. :1.490 Min. : 28378 Min. : 3.720 Min. :64.75
1st Qu.:3.500 1st Qu.: 121014 1st Qu.: 5.675 1st Qu.:68.87
Median :4.210 Median : 249279 Median : 9.490 Median :70.73
Mean :4.468 Mean : 622001 Mean :10.611 Mean :70.58
3rd Qu.:5.545 3rd Qu.: 671116 3rd Qu.:12.268 3rd Qu.:72.32
Max. :6.960 Max. :3926153 Max. :30.030 Max. :75.53
RLS PPI DMS PMJ
Min. : 4.300 Min. :12.15 Min. : 245.0 Min. :61.04
1st Qu.: 8.445 1st Qu.:87.44 1st Qu.: 660.8 1st Qu.:73.05
Median : 9.050 Median :90.38 Median :1183.0 Median :77.82
Mean : 9.029 Mean :86.27 Mean :1947.7 Mean :79.19
3rd Qu.: 9.765 3rd Qu.:92.16 3rd Qu.:2003.2 3rd Qu.:85.97
Max. :11.590 Max. :98.18 Max. :8333.0 Max. :98.18
PKKP PRTS AMHP APPT
Min. : 2.670 Min. :16.34 Min. :90.79 Min. : 62.0
1st Qu.: 7.147 1st Qu.:81.15 1st Qu.:98.53 1st Qu.:109.1
Median : 9.295 Median :85.99 Median :99.14 Median :121.9
Mean :12.202 Mean :82.96 Mean :98.53 Mean :124.4
3rd Qu.:14.297 3rd Qu.:90.12 3rd Qu.:99.53 3rd Qu.:128.9
Max. :32.300 Max. :98.20 Max. :99.85 Max. :260.2
🔍 Interpretasi: Matriks korelasi berikut menunjukkan hubungan antar variabel. Nilai mendekati 1 atau -1 menandakan korelasi kuat; mendekati 0 menandakan tidak berkorelasi.
ggcorr(data_cluster[, 2:16], label = TRUE, label_size = 3,
low = "#c53030", mid = "white", high = "#2b6cb0",
layout.exp = 1, hjust = 0.85)Gambar 1. Matriks Korelasi Antar Variabel
⚠️ Catatan: VIF > 10 mengindikasikan multikolinearitas tinggi. Variabel dengan VIF tinggi perlu diperhatikan sebelum dilanjutkan ke analisis clustering.
CekVIF <- function(data) {
corr <- as.matrix(cor(data))
VIF <- diag(solve(corr))
return(VIF)
}
vif_result <- CekVIF(data_cluster[, 2:16])
vif_df <- data.frame(Variabel = names(vif_result), VIF = round(vif_result, 4))
vif_df <- vif_df[order(-vif_df$VIF), ]
rownames(vif_df) <- NULL
vif_dftype <- c("TPAK", "GR", "PpK", "TPT", "PDRB", "PM",
"AHH", "RLS", "PPI", "DMS", "PMJ", "PKKP",
"PRTS", "AMHP", "APPT")
plots <- lapply(type, function(i) {
ggplot(data_cluster, aes(y = .data[[i]])) +
geom_boxplot(fill = "#bee3f8", color = "#2b6cb0",
outlier.colour = "#c53030", outlier.size = 2.5,
width = 0.5) +
labs(title = i, y = NULL, x = NULL) +
theme_minimal(base_size = 11) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", color = "#1a365d"),
panel.grid.minor = element_blank(),
axis.text.x = element_blank()
)
})
gridExtra::grid.arrange(grobs = plots, ncol = 5)Gambar 2. Distribusi Box-Plot per Variabel
data_kmeans <- data_cluster %>%
dplyr::select(TPAK, GR, PpK, TPT, PDRB, PM,
AHH, RLS, PPI, DMS, PMJ, PKKP, PRTS, AMHP, APPT)
scale_data_kmeans <- scale(data_kmeans)
cat("✅ Data berhasil distandarisasi (Z-score)\n")✅ Data berhasil distandarisasi (Z-score)
📐 Dimensi data: 38 baris x 15 kolom
📖 Tentang K-Means: K-Means membagi data ke dalam k cluster berdasarkan kedekatan jarak Euclidean ke centroid. Algoritma ini efektif untuk cluster berbentuk sferis dan berukuran relatif seimbang.
set.seed(1000)
fviz_nbclust(
x = scale_data_kmeans,
FUNcluster = kmeans,
method = "wss",
k.max = 10,
linecolor = "#3182ce"
) +
labs(title = "Metode Elbow — Penentuan Jumlah Cluster Optimal",
subtitle = "Titik 'siku' menandakan nilai k yang ideal") +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", color = "#1a365d"),
plot.subtitle = element_text(color = "#718096")
)Gambar 3. Metode Elbow untuk Menentukan K Optimal
✅ Keputusan: Berdasarkan metode Elbow, jumlah cluster optimal yang dipilih adalah k = 3.
RNGkind(sample.kind = "Rounding")
set.seed(1000)
prov_kmeans <- kmeans(x = scale_data_kmeans, centers = 3)
fviz_cluster(
object = prov_kmeans,
data = scale_data_kmeans,
palette = c("#3182ce", "#e53e3e", "#38a169"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_minimal(base_size = 12)
) +
labs(title = "Hasil K-Means Clustering",
subtitle = paste("Jumlah cluster: 3 | Total observasi:", nrow(scale_data_kmeans))) +
theme(
plot.title = element_text(face = "bold", color = "#1a365d"),
plot.subtitle = element_text(color = "#718096")
)Gambar 4. Visualisasi Hasil K-Means Clustering (k = 3)
data_cluster$Cluster_KMeans <- as.factor(prov_kmeans$cluster)
kmeans_clustering <- data_cluster
kmeans_clustering$Cluster <- prov_kmeans$cluster
# Tampilkan tabel anggota cluster
for (k in 1:3) {
cat(paste0("\n🔵 **Cluster ", k, ":**\n"))
anggota <- kmeans_clustering %>% filter(Cluster == k) %>% pull(Provinsi)
cat(paste(anggota, collapse = " | "), "\n")
}
🔵 **Cluster 1:**
DKI JAKARTA | JAWA BARAT | JAWA TENGAH | JAWA TIMUR
🔵 **Cluster 2:**
PAPUA TENGAH | PAPUA PEGUNUNGAN
🔵 **Cluster 3:**
ACEH | SUMATERA UTARA | SUMATERA BARAT | RIAU | JAMBI | SUMATERA SELATAN | BENGKULU | LAMPUNG | KEPULAUAN BANGKA BELITUNG | KEPULAUAN RIAU | DAERAH ISTIMEWA YOGYAKARTA | BANTEN | BALI | NUSA TENGGARA BARAT | NUSA TENGGARA TIMUR | KALIMANTAN BARAT | KALIMANTAN TENGAH | KALIMANTAN SELATAN | KALIMANTAN TIMUR | KALIMANTAN UTARA | SULAWESI UTARA | SULAWESI TENGAH | SULAWESI SELATAN | SULAWESI TENGGARA | GORONTALO | SULAWESI BARAT | MALUKU | MALUKU UTARA | PAPUA BARAT | PAPUA BARAT DAYA | PAPUA | PAPUA SELATAN
📖 Tentang DBSCAN: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) mengelompokkan data berdasarkan kepadatan. Berbeda dengan K-Means, DBSCAN dapat mendeteksi cluster berbentuk arbitrer dan secara otomatis mengidentifikasi outlier/noise.
min_pts <- 2
dbscan::kNNdistplot(scale_data_kmeans, k = min_pts)
title(main = "K-NN Distance Plot", col.main = "#1a365d", font.main = 2)
abline(h = 3.5, col = "#e53e3e", lty = 2, lwd = 2)
text(x = 5, y = 3.7, labels = "ε = 3.5", col = "#e53e3e", font = 2)Gambar 6. K-NN Distance Plot untuk Menentukan Nilai Epsilon
⚙️ Parameter DBSCAN: ε (epsilon) = 3.5 | MinPts = 2
DBSCAN clustering for 38 objects.
Parameters: eps = 3.5, minPts = 2
Using euclidean distances and borderpoints = TRUE
The clustering contains 2 cluster(s) and 5 noise points.
0 1 2
5 30 3
Available fields: cluster, eps, minPts, metric, borderPoints
fviz_cluster(
res_dbscan,
data = scale_data_kmeans,
stand = FALSE,
ellipse = TRUE,
geom = "point",
palette = "Set1",
ggtheme = theme_minimal(base_size = 12),
main = "Hasil DBSCAN Clustering"
) +
theme(
plot.title = element_text(face = "bold", color = "#1a365d")
)Gambar 7. Visualisasi Hasil DBSCAN Clustering
📊 Distribusi Cluster DBSCAN:
0 1 2
5 30 3
outliers <- data_cluster %>%
filter(Cluster_DBSCAN == 0) %>%
dplyr::select(Provinsi)
cat("\n🔴 Provinsi yang dianggap Noise/Outlier oleh DBSCAN:\n")
🔴 Provinsi yang dianggap Noise/Outlier oleh DBSCAN:
Provinsi
1 DKI JAKARTA
2 NUSA TENGGARA BARAT
3 KALIMANTAN UTARA
4 PAPUA TENGAH
5 PAPUA PEGUNUNGAN
data_dbscan_radar <- data_kmeans
data_dbscan_radar$Cluster <- as.factor(res_dbscan$cluster)
levels(data_dbscan_radar$Cluster)[
levels(data_dbscan_radar$Cluster) == "0"
] <- "Noise"
ggRadar(
data = data_dbscan_radar,
mapping = aes(colours = Cluster),
rescale = TRUE,
interactive = TRUE
)Gambar 8. Radar Chart Karakteristik Cluster DBSCAN
# Silhouette Index
sil_kmeans <- silhouette(prov_kmeans$cluster, dist(scale_data_kmeans))
fviz_silhouette(sil_kmeans) +
labs(title = "Silhouette Plot — K-Means") +
scale_fill_manual(values = c("#3182ce", "#e53e3e", "#38a169")) +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold", color = "#1a365d")) cluster size ave.sil.width
1 1 4 0.20
2 2 2 0.19
3 3 32 0.32
avg_sil_kmeans <- mean(sil_kmeans[, 3])
dbi_kmeans <- index.DB(scale_data_kmeans, prov_kmeans$cluster)$DB
chi_kmeans <- index.G1(scale_data_kmeans, prov_kmeans$cluster)
cat("─────────────────────────────────\n")─────────────────────────────────
📊 K-Means Evaluation Metrics
─────────────────────────────────
Avg Silhouette Index : 0.3023
Davies-Bouldin Index : 1.1976 (↓ lebih baik)
Calinski-Harabasz : 11.1732 (↑ lebih baik)
─────────────────────────────────
is_not_noise <- res_dbscan$cluster != 0
data_dbscan_eval <- scale_data_kmeans[is_not_noise, ]
cluster_dbscan_eval <- res_dbscan$cluster[is_not_noise]
if (length(unique(cluster_dbscan_eval)) > 1) {
sil_dbscan <- silhouette(cluster_dbscan_eval, dist(data_dbscan_eval))
avg_sil_dbscan <- mean(sil_dbscan[, 3])
dbi_dbscan <- index.DB(data_dbscan_eval, cluster_dbscan_eval)$DB
chi_dbscan <- index.G1(data_dbscan_eval, cluster_dbscan_eval)
cat("──────────────────────────────────────────\n")
cat(" 📊 DBSCAN Evaluation Metrics (Excl. Noise)\n")
cat("──────────────────────────────────────────\n")
cat(sprintf(" Avg Silhouette Index : %.4f\n", avg_sil_dbscan))
cat(sprintf(" Davies-Bouldin Index : %.4f (↓ lebih baik)\n", dbi_dbscan))
cat(sprintf(" Calinski-Harabasz : %.4f (↑ lebih baik)\n", chi_dbscan))
cat("──────────────────────────────────────────\n")
} else {
cat("⚠️ DBSCAN tidak menghasilkan cukup cluster untuk dievaluasi.\n")
}──────────────────────────────────────────
📊 DBSCAN Evaluation Metrics (Excl. Noise)
──────────────────────────────────────────
Avg Silhouette Index : 0.3116
Davies-Bouldin Index : 1.0204 (↓ lebih baik)
Calinski-Harabasz : 7.0028 (↑ lebih baik)
──────────────────────────────────────────
| Metrik | K-Means | DBSCAN |
|---|---|---|
| Jumlah Cluster | 3 | Otomatis |
| Deteksi Outlier | ❌ Tidak | ✅ Ya |
| Bentuk Cluster | Sferis | Arbitrer |
| Sensitivitas Parameter | k (jumlah cluster) | ε dan MinPts |
| Silhouette Index | dari hasil | dari hasil |
| Davies-Bouldin | dari hasil | dari hasil |
| Calinski-Harabasz | dari hasil | dari hasil |
data_cluster$Cluster_KMeans <- as.factor(prov_kmeans$cluster)
data_cluster$Cluster_DBSCAN <- as.factor(res_dbscan$cluster)
indo_shp <- st_read(
"C:/Users/Muhammad Rizqa Salas/Downloads/Skripsi Lesgoo/[LapakGIS.com] Batas Wilayah Provinsi 2024/LapakGIS_Batas_Provinsi_2024.shp"
) %>%
st_make_valid() %>%
mutate(WADMPR = str_to_upper(WADMPR))
peta_merged <- indo_shp %>%
left_join(
data_cluster %>% mutate(Provinsi = str_to_upper(Provinsi)),
by = c("WADMPR" = "Provinsi")
) %>%
filter(!st_is_empty(geometry))tmap_mode("plot")
tm_shape(peta_merged) +
tm_polygons(
"Cluster_KMeans",
palette = "Set2",
title = "Cluster K-Means",
border.col = "white",
colorNA = "grey90"
) +
tm_text("WADMPR", size = 0.35, col = "#2d3748") +
tm_layout(
main.title = "Peta Cluster Provinsi — Metode K-Means",
main.title.size = 1.1,
main.title.color = "#1a365d",
main.title.fontface = "bold",
legend.outside = TRUE,
frame = FALSE,
bg.color = "#f7fafc"
)Gambar 9. Peta Sebaran Cluster K-Means per Provinsi
peta_merged <- peta_merged %>%
mutate(
Cluster_DB_Label = as.character(Cluster_DBSCAN),
Cluster_DB_Label = ifelse(
Cluster_DB_Label == "0",
"Noise/Outlier",
paste("Cluster", Cluster_DB_Label)
),
Cluster_DB_Label = as.factor(Cluster_DB_Label)
)
tm_shape(peta_merged) +
tm_polygons(
"Cluster_DB_Label",
palette = "Set1",
title = "Cluster DBSCAN",
border.col = "white"
) +
tm_text("WADMPR", size = 0.35, col = "#2d3748") +
tm_layout(
main.title = "Peta Cluster Provinsi — Metode DBSCAN",
main.title.size = 1.1,
main.title.color = "#1a365d",
main.title.fontface = "bold",
legend.outside = TRUE,
frame = FALSE,
bg.color = "#f7fafc"
)Gambar 10. Peta Sebaran Cluster DBSCAN per Provinsi