📊 Analisis Clustering Provinsi di Indonesia

Perbandingan Metode K-Means dan DBSCAN pada Indikator Sosial-Ekonomi

<span>👤 Muhammad Rizqa Salas</span>
<span>📅 04 May 2026</span>
<span>🎓 Skripsi</span>
<span>🔬 Analisis Data</span>

1 Library & Persiapan

ℹ️ Tentang Dokumen Ini
Dokumen ini berisi seluruh alur analisis clustering provinsi-provinsi di Indonesia menggunakan dua metode: K-Means dan DBSCAN. Klik tombol “Code” di setiap chunk untuk melihat kode R yang digunakan.

1.1 Library yang Digunakan

# ── Data Wrangling ─────────────────────────────
library(tidyverse)
library(dplyr)
library(readr)

# ── Visualisasi ────────────────────────────────
library(factoextra)    # Visualisasi hasil clustering
library(gridExtra)     # Multi-panel plot
library(GGally)        # Correlation matrix
library(ggiraphExtra)  # Radar chart interaktif
library(sf)            # Membaca data spasial
library(plotly)        # Interactive plotting
library(Rtsne)         # Dimensionality reduction (t-SNE)
library(tmap)          # Peta tematik
library(RColorBrewer)  # Palet warna

# ── Machine Learning - Clustering ──────────────
library(cluster)
library(mvnTest)
library(clusterSim)
library(dbscan)

1.2 Ringkasan Library

<div class="metric-icon">📦</div>
<div class="metric-value">3</div>
<div class="metric-label">Library Wrangling</div>
<div class="metric-icon">📈</div>
<div class="metric-value">8</div>
<div class="metric-label">Library Visualisasi</div>
<div class="metric-icon">🤖</div>
<div class="metric-value">5</div>
<div class="metric-label">Library ML / Clustering</div>
<div class="metric-icon">🗺️</div>
<div class="metric-value">34</div>
<div class="metric-label">Provinsi Dianalisis</div>

2 Import Data

data_cluster <- read.csv2(
  "C:/Users/Muhammad Rizqa Salas/Downloads/Skripsi Lesgoo/Data/Data fix.csv",
  sep = ";"
)
glimpse(data_cluster)
Rows: 38
Columns: 16
$ Provinsi <chr> "ACEH", "SUMATERA UTARA", "SUMATERA BARAT", "RIAU", "JAMBI", …
$ TPAK     <dbl> 65.32, 72.29, 71.34, 66.55, 68.37, 69.88, 71.78, 70.44, 69.38…
$ GR       <dbl> 0.274, 0.283, 0.280, 0.304, 0.291, 0.298, 0.339, 0.287, 0.214…
$ PpK      <int> 11191, 11898, 12041, 12233, 12018, 12416, 12197, 11683, 13837…
$ TPT      <dbl> 5.64, 5.32, 5.62, 4.16, 4.26, 3.69, 3.41, 4.21, 4.45, 6.45, 6…
$ PDRB     <dbl> 257502.4, 1236193.6, 352189.4, 1201383.8, 349662.6, 720205.2,…
$ PM       <dbl> 12.33, 7.36, 5.35, 6.16, 7.19, 10.15, 12.08, 10.00, 5.00, 4.4…
$ AHH      <dbl> 70.485, 70.330, 70.380, 72.570, 72.090, 70.975, 70.180, 71.55…
$ RLS      <dbl> 9.95, 10.08, 9.77, 9.55, 8.95, 8.91, 9.23, 8.61, 8.65, 10.72,…
$ PPI      <dbl> 88.95, 92.30, 91.89, 94.16, 90.59, 90.94, 90.62, 91.57, 92.14…
$ DMS      <int> 5840, 5360, 1204, 1710, 1431, 2954, 1382, 2466, 392, 377, 267…
$ PMJ      <dbl> 98.18, 68.69, 78.19, 73.30, 61.04, 73.33, 76.20, 71.90, 84.11…
$ PKKP     <dbl> 8.60, 7.20, 7.67, 10.90, 10.22, 5.94, 9.50, 10.62, 10.00, 9.0…
$ PRTS     <dbl> 82.21, 87.47, 74.59, 91.21, 85.88, 85.50, 85.75, 87.29, 96.45…
$ AMHP     <dbl> 99.59, 99.52, 99.43, 99.54, 98.66, 99.50, 99.08, 98.89, 98.89…
$ APPT     <dbl> 123.62, 125.61, 136.87, 118.59, 127.62, 153.85, 155.76, 165.5…
head(data_cluster)

📋 Struktur Data: Dataset berisi data indikator sosial-ekonomi per provinsi di Indonesia dengan 15 variabel utama yang mencakup aspek ketenagakerjaan, ekonomi, pendidikan, dan kesehatan.


3 Eksplorasi & Standarisasi Data

3.1 Statistik Deskriptif

summary(data_cluster)
   Provinsi              TPAK             GR              PpK       
 Length:38          Min.   :63.98   Min.   :0.2140   Min.   : 5861  
 Class :character   1st Qu.:67.44   1st Qu.:0.2848   1st Qu.:10598  
 Mode  :character   Median :69.56   Median :0.3275   Median :11978  
                    Mean   :70.52   Mean   :0.3284   Mean   :11962  
                    3rd Qu.:72.37   3rd Qu.:0.3605   3rd Qu.:12766  
                    Max.   :90.66   Max.   :0.4260   Max.   :20676  
      TPT             PDRB               PM              AHH       
 Min.   :1.490   Min.   :  28378   Min.   : 3.720   Min.   :64.75  
 1st Qu.:3.500   1st Qu.: 121014   1st Qu.: 5.675   1st Qu.:68.87  
 Median :4.210   Median : 249279   Median : 9.490   Median :70.73  
 Mean   :4.468   Mean   : 622001   Mean   :10.611   Mean   :70.58  
 3rd Qu.:5.545   3rd Qu.: 671116   3rd Qu.:12.268   3rd Qu.:72.32  
 Max.   :6.960   Max.   :3926153   Max.   :30.030   Max.   :75.53  
      RLS              PPI             DMS              PMJ       
 Min.   : 4.300   Min.   :12.15   Min.   : 245.0   Min.   :61.04  
 1st Qu.: 8.445   1st Qu.:87.44   1st Qu.: 660.8   1st Qu.:73.05  
 Median : 9.050   Median :90.38   Median :1183.0   Median :77.82  
 Mean   : 9.029   Mean   :86.27   Mean   :1947.7   Mean   :79.19  
 3rd Qu.: 9.765   3rd Qu.:92.16   3rd Qu.:2003.2   3rd Qu.:85.97  
 Max.   :11.590   Max.   :98.18   Max.   :8333.0   Max.   :98.18  
      PKKP             PRTS            AMHP            APPT      
 Min.   : 2.670   Min.   :16.34   Min.   :90.79   Min.   : 62.0  
 1st Qu.: 7.147   1st Qu.:81.15   1st Qu.:98.53   1st Qu.:109.1  
 Median : 9.295   Median :85.99   Median :99.14   Median :121.9  
 Mean   :12.202   Mean   :82.96   Mean   :98.53   Mean   :124.4  
 3rd Qu.:14.297   3rd Qu.:90.12   3rd Qu.:99.53   3rd Qu.:128.9  
 Max.   :32.300   Max.   :98.20   Max.   :99.85   Max.   :260.2  

3.2 Matriks Korelasi

🔍 Interpretasi: Matriks korelasi berikut menunjukkan hubungan antar variabel. Nilai mendekati 1 atau -1 menandakan korelasi kuat; mendekati 0 menandakan tidak berkorelasi.

ggcorr(data_cluster[, 2:16], label = TRUE, label_size = 3,
       low = "#c53030", mid = "white", high = "#2b6cb0",
       layout.exp = 1, hjust = 0.85)
Gambar 1. Matriks Korelasi Antar Variabel

Gambar 1. Matriks Korelasi Antar Variabel

3.3 Variance Inflation Factor (VIF)

⚠️ Catatan: VIF > 10 mengindikasikan multikolinearitas tinggi. Variabel dengan VIF tinggi perlu diperhatikan sebelum dilanjutkan ke analisis clustering.

CekVIF <- function(data) {
  corr <- as.matrix(cor(data))
  VIF  <- diag(solve(corr))
  return(VIF)
}

vif_result <- CekVIF(data_cluster[, 2:16])
vif_df     <- data.frame(Variabel = names(vif_result), VIF = round(vif_result, 4))
vif_df     <- vif_df[order(-vif_df$VIF), ]
rownames(vif_df) <- NULL
vif_df

3.4 Box-Plot Variabel

type <- c("TPAK", "GR", "PpK", "TPT", "PDRB", "PM",
          "AHH", "RLS", "PPI", "DMS", "PMJ", "PKKP",
          "PRTS", "AMHP", "APPT")

plots <- lapply(type, function(i) {
  ggplot(data_cluster, aes(y = .data[[i]])) +
    geom_boxplot(fill = "#bee3f8", color = "#2b6cb0",
                 outlier.colour = "#c53030", outlier.size = 2.5,
                 width = 0.5) +
    labs(title = i, y = NULL, x = NULL) +
    theme_minimal(base_size = 11) +
    theme(
      plot.title    = element_text(hjust = 0.5, face = "bold", color = "#1a365d"),
      panel.grid.minor = element_blank(),
      axis.text.x  = element_blank()
    )
})

gridExtra::grid.arrange(grobs = plots, ncol = 5)
Gambar 2. Distribusi Box-Plot per Variabel

Gambar 2. Distribusi Box-Plot per Variabel

3.5 Standarisasi Data

data_kmeans <- data_cluster %>%
  dplyr::select(TPAK, GR, PpK, TPT, PDRB, PM,
                AHH, RLS, PPI, DMS, PMJ, PKKP, PRTS, AMHP, APPT)

scale_data_kmeans <- scale(data_kmeans)

cat("✅ Data berhasil distandarisasi (Z-score)\n")
✅ Data berhasil distandarisasi (Z-score)
cat("📐 Dimensi data:", nrow(scale_data_kmeans), "baris x",
    ncol(scale_data_kmeans), "kolom\n")
📐 Dimensi data: 38 baris x 15 kolom

4 Algoritma K-Means

📖 Tentang K-Means: K-Means membagi data ke dalam k cluster berdasarkan kedekatan jarak Euclidean ke centroid. Algoritma ini efektif untuk cluster berbentuk sferis dan berukuran relatif seimbang.

4.1 Penentuan K Optimal

set.seed(1000)
fviz_nbclust(
  x          = scale_data_kmeans,
  FUNcluster = kmeans,
  method     = "wss",
  k.max      = 10,
  linecolor  = "#3182ce"
) +
  labs(title = "Metode Elbow — Penentuan Jumlah Cluster Optimal",
       subtitle = "Titik 'siku' menandakan nilai k yang ideal") +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", color = "#1a365d"),
    plot.subtitle = element_text(color = "#718096")
  )
Gambar 3. Metode Elbow untuk Menentukan K Optimal

Gambar 3. Metode Elbow untuk Menentukan K Optimal

✅ Keputusan: Berdasarkan metode Elbow, jumlah cluster optimal yang dipilih adalah k = 3.

4.2 Hasil Clustering

RNGkind(sample.kind = "Rounding")
set.seed(1000)

prov_kmeans <- kmeans(x = scale_data_kmeans, centers = 3)

fviz_cluster(
  object    = prov_kmeans,
  data      = scale_data_kmeans,
  palette   = c("#3182ce", "#e53e3e", "#38a169"),
  geom      = "point",
  ellipse.type = "convex",
  ggtheme   = theme_minimal(base_size = 12)
) +
  labs(title = "Hasil K-Means Clustering",
       subtitle = paste("Jumlah cluster: 3 | Total observasi:", nrow(scale_data_kmeans))) +
  theme(
    plot.title    = element_text(face = "bold", color = "#1a365d"),
    plot.subtitle = element_text(color = "#718096")
  )
Gambar 4. Visualisasi Hasil K-Means Clustering (k = 3)

Gambar 4. Visualisasi Hasil K-Means Clustering (k = 3)

4.3 Anggota Cluster

data_cluster$Cluster_KMeans <- as.factor(prov_kmeans$cluster)
kmeans_clustering            <- data_cluster
kmeans_clustering$Cluster    <- prov_kmeans$cluster

# Tampilkan tabel anggota cluster
for (k in 1:3) {
  cat(paste0("\n🔵 **Cluster ", k, ":**\n"))
  anggota <- kmeans_clustering %>% filter(Cluster == k) %>% pull(Provinsi)
  cat(paste(anggota, collapse = " | "), "\n")
}

🔵 **Cluster 1:**
DKI JAKARTA | JAWA BARAT | JAWA TENGAH | JAWA TIMUR 

🔵 **Cluster 2:**
PAPUA TENGAH | PAPUA PEGUNUNGAN 

🔵 **Cluster 3:**
ACEH | SUMATERA UTARA | SUMATERA BARAT | RIAU | JAMBI | SUMATERA SELATAN | BENGKULU | LAMPUNG | KEPULAUAN BANGKA BELITUNG | KEPULAUAN RIAU | DAERAH ISTIMEWA YOGYAKARTA | BANTEN | BALI | NUSA TENGGARA BARAT | NUSA TENGGARA TIMUR | KALIMANTAN BARAT | KALIMANTAN TENGAH | KALIMANTAN SELATAN | KALIMANTAN TIMUR | KALIMANTAN UTARA | SULAWESI UTARA | SULAWESI TENGAH | SULAWESI SELATAN | SULAWESI TENGGARA | GORONTALO | SULAWESI BARAT | MALUKU | MALUKU UTARA | PAPUA BARAT | PAPUA BARAT DAYA | PAPUA | PAPUA SELATAN 

4.4 Radar Chart K-Means

data_kmeans$Cluster <- prov_kmeans$cluster

ggRadar(
  data        = data_kmeans,
  mapping     = aes(colours = Cluster),
  interactive = TRUE
)

Gambar 5. Radar Chart Karakteristik Cluster K-Means


5 Algoritma DBSCAN

📖 Tentang DBSCAN: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) mengelompokkan data berdasarkan kepadatan. Berbeda dengan K-Means, DBSCAN dapat mendeteksi cluster berbentuk arbitrer dan secara otomatis mengidentifikasi outlier/noise.

5.1 Penentuan Epsilon (ε)

min_pts <- 2

dbscan::kNNdistplot(scale_data_kmeans, k = min_pts)
title(main = "K-NN Distance Plot", col.main = "#1a365d", font.main = 2)
abline(h = 3.5, col = "#e53e3e", lty = 2, lwd = 2)
text(x = 5, y = 3.7, labels = "ε = 3.5", col = "#e53e3e", font = 2)
Gambar 6. K-NN Distance Plot untuk Menentukan Nilai Epsilon

Gambar 6. K-NN Distance Plot untuk Menentukan Nilai Epsilon

⚙️ Parameter DBSCAN: ε (epsilon) = 3.5 | MinPts = 2

5.2 Hasil Clustering DBSCAN

set.seed(1000)
res_dbscan <- dbscan(scale_data_kmeans, eps = 3.5, minPts = min_pts)
res_dbscan
DBSCAN clustering for 38 objects.
Parameters: eps = 3.5, minPts = 2
Using euclidean distances and borderpoints = TRUE
The clustering contains 2 cluster(s) and 5 noise points.

 0  1  2 
 5 30  3 

Available fields: cluster, eps, minPts, metric, borderPoints
fviz_cluster(
  res_dbscan,
  data      = scale_data_kmeans,
  stand     = FALSE,
  ellipse   = TRUE,
  geom      = "point",
  palette   = "Set1",
  ggtheme   = theme_minimal(base_size = 12),
  main      = "Hasil DBSCAN Clustering"
) +
  theme(
    plot.title = element_text(face = "bold", color = "#1a365d")
  )
Gambar 7. Visualisasi Hasil DBSCAN Clustering

Gambar 7. Visualisasi Hasil DBSCAN Clustering

5.3 Identifikasi Outlier

data_cluster$Cluster_DBSCAN <- res_dbscan$cluster

cat("📊 Distribusi Cluster DBSCAN:\n")
📊 Distribusi Cluster DBSCAN:
print(table(data_cluster$Cluster_DBSCAN))

 0  1  2 
 5 30  3 
outliers <- data_cluster %>%
  filter(Cluster_DBSCAN == 0) %>%
  dplyr::select(Provinsi)

cat("\n🔴 Provinsi yang dianggap Noise/Outlier oleh DBSCAN:\n")

🔴 Provinsi yang dianggap Noise/Outlier oleh DBSCAN:
print(outliers)
             Provinsi
1         DKI JAKARTA
2 NUSA TENGGARA BARAT
3    KALIMANTAN UTARA
4        PAPUA TENGAH
5    PAPUA PEGUNUNGAN

5.4 Radar Chart DBSCAN

data_dbscan_radar         <- data_kmeans
data_dbscan_radar$Cluster <- as.factor(res_dbscan$cluster)
levels(data_dbscan_radar$Cluster)[
  levels(data_dbscan_radar$Cluster) == "0"
] <- "Noise"

ggRadar(
  data        = data_dbscan_radar,
  mapping     = aes(colours = Cluster),
  rescale     = TRUE,
  interactive = TRUE
)

Gambar 8. Radar Chart Karakteristik Cluster DBSCAN


6 Evaluasi & Perbandingan

6.1 Metrik K-Means

# Silhouette Index
sil_kmeans     <- silhouette(prov_kmeans$cluster, dist(scale_data_kmeans))
fviz_silhouette(sil_kmeans) +
  labs(title = "Silhouette Plot — K-Means") +
  scale_fill_manual(values = c("#3182ce", "#e53e3e", "#38a169")) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", color = "#1a365d"))
  cluster size ave.sil.width
1       1    4          0.20
2       2    2          0.19
3       3   32          0.32

avg_sil_kmeans <- mean(sil_kmeans[, 3])
dbi_kmeans     <- index.DB(scale_data_kmeans, prov_kmeans$cluster)$DB
chi_kmeans     <- index.G1(scale_data_kmeans, prov_kmeans$cluster)

cat("─────────────────────────────────\n")
─────────────────────────────────
cat(" 📊 K-Means Evaluation Metrics\n")
 📊 K-Means Evaluation Metrics
cat("─────────────────────────────────\n")
─────────────────────────────────
cat(sprintf(" Avg Silhouette Index : %.4f\n", avg_sil_kmeans))
 Avg Silhouette Index : 0.3023
cat(sprintf(" Davies-Bouldin Index : %.4f  (↓ lebih baik)\n", dbi_kmeans))
 Davies-Bouldin Index : 1.1976  (↓ lebih baik)
cat(sprintf(" Calinski-Harabasz   : %.4f  (↑ lebih baik)\n", chi_kmeans))
 Calinski-Harabasz   : 11.1732  (↑ lebih baik)
cat("─────────────────────────────────\n")
─────────────────────────────────

6.2 Metrik DBSCAN

is_not_noise       <- res_dbscan$cluster != 0
data_dbscan_eval   <- scale_data_kmeans[is_not_noise, ]
cluster_dbscan_eval <- res_dbscan$cluster[is_not_noise]

if (length(unique(cluster_dbscan_eval)) > 1) {
  sil_dbscan     <- silhouette(cluster_dbscan_eval, dist(data_dbscan_eval))
  avg_sil_dbscan <- mean(sil_dbscan[, 3])
  dbi_dbscan     <- index.DB(data_dbscan_eval, cluster_dbscan_eval)$DB
  chi_dbscan     <- index.G1(data_dbscan_eval, cluster_dbscan_eval)

  cat("──────────────────────────────────────────\n")
  cat(" 📊 DBSCAN Evaluation Metrics (Excl. Noise)\n")
  cat("──────────────────────────────────────────\n")
  cat(sprintf(" Avg Silhouette Index : %.4f\n", avg_sil_dbscan))
  cat(sprintf(" Davies-Bouldin Index : %.4f  (↓ lebih baik)\n", dbi_dbscan))
  cat(sprintf(" Calinski-Harabasz   : %.4f  (↑ lebih baik)\n", chi_dbscan))
  cat("──────────────────────────────────────────\n")
} else {
  cat("⚠️ DBSCAN tidak menghasilkan cukup cluster untuk dievaluasi.\n")
}
──────────────────────────────────────────
 📊 DBSCAN Evaluation Metrics (Excl. Noise)
──────────────────────────────────────────
 Avg Silhouette Index : 0.3116
 Davies-Bouldin Index : 1.0204  (↓ lebih baik)
 Calinski-Harabasz   : 7.0028  (↑ lebih baik)
──────────────────────────────────────────

6.3 Tabel Perbandingan

Metrik K-Means DBSCAN
Jumlah Cluster 3 Otomatis
Deteksi Outlier ❌ Tidak ✅ Ya
Bentuk Cluster Sferis Arbitrer
Sensitivitas Parameter k (jumlah cluster) ε dan MinPts
Silhouette Index dari hasil dari hasil
Davies-Bouldin dari hasil dari hasil
Calinski-Harabasz dari hasil dari hasil

7 Visualisasi Peta

data_cluster$Cluster_KMeans <- as.factor(prov_kmeans$cluster)
data_cluster$Cluster_DBSCAN <- as.factor(res_dbscan$cluster)

indo_shp <- st_read(
  "C:/Users/Muhammad Rizqa Salas/Downloads/Skripsi Lesgoo/[LapakGIS.com] Batas Wilayah Provinsi 2024/LapakGIS_Batas_Provinsi_2024.shp"
) %>%
  st_make_valid() %>%
  mutate(WADMPR = str_to_upper(WADMPR))

peta_merged <- indo_shp %>%
  left_join(
    data_cluster %>% mutate(Provinsi = str_to_upper(Provinsi)),
    by = c("WADMPR" = "Provinsi")
  ) %>%
  filter(!st_is_empty(geometry))

7.1 Peta K-Means

tmap_mode("plot")
tm_shape(peta_merged) +
  tm_polygons(
    "Cluster_KMeans",
    palette   = "Set2",
    title     = "Cluster K-Means",
    border.col = "white",
    colorNA   = "grey90"
  ) +
  tm_text("WADMPR", size = 0.35, col = "#2d3748") +
  tm_layout(
    main.title       = "Peta Cluster Provinsi — Metode K-Means",
    main.title.size  = 1.1,
    main.title.color = "#1a365d",
    main.title.fontface = "bold",
    legend.outside   = TRUE,
    frame            = FALSE,
    bg.color         = "#f7fafc"
  )
Gambar 9. Peta Sebaran Cluster K-Means per Provinsi

Gambar 9. Peta Sebaran Cluster K-Means per Provinsi

7.2 Peta DBSCAN

peta_merged <- peta_merged %>%
  mutate(
    Cluster_DB_Label = as.character(Cluster_DBSCAN),
    Cluster_DB_Label = ifelse(
      Cluster_DB_Label == "0",
      "Noise/Outlier",
      paste("Cluster", Cluster_DB_Label)
    ),
    Cluster_DB_Label = as.factor(Cluster_DB_Label)
  )

tm_shape(peta_merged) +
  tm_polygons(
    "Cluster_DB_Label",
    palette   = "Set1",
    title     = "Cluster DBSCAN",
    border.col = "white"
  ) +
  tm_text("WADMPR", size = 0.35, col = "#2d3748") +
  tm_layout(
    main.title       = "Peta Cluster Provinsi — Metode DBSCAN",
    main.title.size  = 1.1,
    main.title.color = "#1a365d",
    main.title.fontface = "bold",
    legend.outside   = TRUE,
    frame            = FALSE,
    bg.color         = "#f7fafc"
  )
Gambar 10. Peta Sebaran Cluster DBSCAN per Provinsi

Gambar 10. Peta Sebaran Cluster DBSCAN per Provinsi