1. Introduction

This report presents a comprehensive clustering analysis of Indonesian provincial socioeconomic indicators. Clustering, as an unsupervised machine learning technique, aims to partition observations into groups such that observations within the same group are more similar to each other than to those in other groups. The dataset encompasses 151 observations across 34 Indonesian provinces over multiple years, covering indicators related to education participation, labor market performance, poverty lines, poverty rates, and human development indices.

Five clustering methodologies are employed and systematically compared in this study:

  1. K-Means a centroid-based partitional algorithm
  2. K-Medoids (PAM) a medoid-based robust alternative to K-Means
  3. DBSCAN a density-based spatial clustering algorithm capable of detecting arbitrarily-shaped clusters and noise
  4. Mean Shift a non-parametric, kernel-density-based algorithm that does not require pre-specifying the number of clusters
  5. Fuzzy C-Means (FCM) a soft-partitioning algorithm that allows each observation to belong to multiple clusters with varying degrees of membership

The analysis follows a structured pipeline: data loading → exploratory data analysis → preprocessing → optimal cluster determination → clustering execution → validation → interpretation.


2. Data Loading and Initial Inspection

Prior to any analytical procedure, it is essential to examine the raw structure of the dataset to identify potential issues such as missing values, data type mismatches, and distributional anomalies.

if (file.exists("data_5_clustering.xlsx")) {
  df_raw <- read_excel("data_5_clustering.xlsx")
} else if (file.exists("data_5_clustering.csv")) {
  df_raw <- read.csv("data_5_clustering.csv", stringsAsFactors = FALSE)
} else {
  stop("Dataset not found. Place data_5_clustering.xlsx or .csv in the working directory.")
}

colnames(df_raw) <- c(
  "PROVINSI", "TAHUN",
  "APS_13_15", "APS_16_18", "APS_19_24",
  "TPT_Feb", "TPT_Agt",
  "TPAK_Feb", "TPAK_Agt",
  "GK_Maret", "GK_Sept",
  "JPM_Maret", "JPM_Sept",
  "PPM_Maret", "PPM_Sept",
  "IPM", "RLS_15Plus", "HLS"
)

cat(sprintf("Rows: %d | Columns: %d\n", nrow(df_raw), ncol(df_raw)))
## Rows: 151 | Columns: 18
str(df_raw)
## 'data.frame':    151 obs. of  18 variables:
##  $ PROVINSI  : chr  "SUMATERA UTARA" "SUMATERA BARAT" "RIAU" "JAMBI" ...
##  $ TAHUN     : int  2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 ...
##  $ APS_13_15 : num  97 96.6 95.7 96.4 94.8 ...
##  $ APS_16_18 : num  78.7 84.1 77.8 72.5 71.5 ...
##  $ APS_19_24 : num  27.1 36.4 28.8 24.1 18.8 ...
##  $ TPT_Feb   : num  6.01 6.67 4.96 4.76 5.17 ...
##  $ TPT_Agt   : num  6.33 6.52 4.42 5.09 4.98 3.65 4.69 5.03 9.91 8.5 ...
##  $ TPAK_Feb  : num  69.4 68.4 65.8 67.3 70 ...
##  $ TPAK_Agt  : num  69.1 67.7 65 67.2 68.8 ...
##  $ GK_Maret  : int  525756 568703 565937 506355 457455 548934 471439 752203 642425 697638 ...
##  $ GK_Sept   : int  648336 714991 702620 658100 564462 672816 599018 917673 807602 846085 ...
##  $ JPM_Maret : num  1344 371 501 294 1114 ...
##  $ JPM_Sept  : num  1111 315 473 273 949 ...
##  $ PPM_Maret : num  9.01 6.63 7.12 8.09 12.84 ...
##  $ PPM_Sept  : num  7.19 5.42 6.36 7.26 10.51 ...
##  $ IPM       : num  73.8 74.6 73.9 72.6 71.8 ...
##  $ RLS_15Plus: num  9.88 9.46 9.52 9.03 8.78 ...
##  $ HLS       : num  13.3 14.1 13.3 13 12.5 ...
missing_summary <- data.frame(
  Column   = names(df_raw),
  Missing  = sapply(df_raw, function(x) sum(is.na(x))),
  Pct_Miss = sapply(df_raw, function(x) round(mean(is.na(x)) * 100, 2))
)
rownames(missing_summary) <- NULL

kable(missing_summary, caption = "Missing Values per Variable",
      col.names = c("Variable", "Missing Count", "Missing (%)")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE)
Missing Values per Variable
Variable Missing Count Missing (%)
PROVINSI 0 0
TAHUN 0 0
APS_13_15 0 0
APS_16_18 0 0
APS_19_24 0 0
TPT_Feb 0 0
TPT_Agt 0 0
TPAK_Feb 0 0
TPAK_Agt 0 0
GK_Maret 0 0
GK_Sept 0 0
JPM_Maret 0 0
JPM_Sept 0 0
PPM_Maret 0 0
PPM_Sept 0 0
IPM 0 0
RLS_15Plus 0 0
HLS 0 0
kable(head(df_raw), caption = "Preview: First 6 Rows of Dataset") %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = TRUE, font_size = 11) %>%
  scroll_box(width = "100%")
Preview: First 6 Rows of Dataset
PROVINSI TAHUN APS_13_15 APS_16_18 APS_19_24 TPT_Feb TPT_Agt TPAK_Feb TPAK_Agt GK_Maret GK_Sept JPM_Maret JPM_Sept PPM_Maret PPM_Sept IPM RLS_15Plus HLS
SUMATERA UTARA 2021 96.99 78.66 27.05 6.01 6.33 69.39 69.10 525756 648336 1343.86 1110.92 9.01 7.19 73.84 9.88 13.27
SUMATERA BARAT 2021 96.63 84.07 36.41 6.67 6.52 68.41 67.72 568703 714991 370.67 315.43 6.63 5.42 74.56 9.46 14.09
RIAU 2021 95.66 77.81 28.79 4.96 4.42 65.81 65.03 565937 702620 500.81 473.04 7.12 6.36 73.89 9.52 13.28
JAMBI 2021 96.39 72.50 24.14 4.76 5.09 67.30 67.17 506355 658100 293.86 272.70 8.09 7.26 72.62 9.03 13.04
SUMATERA SELATAN 2021 94.85 71.53 18.81 5.17 4.98 69.95 68.77 457455 564462 1113.76 948.84 12.84 10.51 71.83 8.78 12.54
BENGKULU 2021 97.49 79.75 30.46 3.72 3.65 71.74 69.75 548934 672816 306.00 261.15 15.22 12.52 73.16 9.26 13.67

3. Exploratory Data Analysis (EDA)

3.1 Descriptive Statistics

Descriptive statistics provide a foundational understanding of the central tendency, dispersion, and shape of each variable’s distribution. Examining skewness and kurtosis prior to multivariate analysis is critical for assessing distributional assumptions. Variables with absolute skewness exceeding 2.0 or kurtosis exceeding 7.0 may warrant transformation.

numeric_vars <- df_raw %>% select(where(is.numeric)) %>% select(-TAHUN)

desc_stats <- data.frame(
  Variable = names(numeric_vars),
  N        = sapply(numeric_vars, function(x) sum(!is.na(x))),
  Mean     = sapply(numeric_vars, mean, na.rm = TRUE),
  SD       = sapply(numeric_vars, sd, na.rm = TRUE),
  Min      = sapply(numeric_vars, min, na.rm = TRUE),
  Q1       = sapply(numeric_vars, quantile, 0.25, na.rm = TRUE),
  Median   = sapply(numeric_vars, median, na.rm = TRUE),
  Q3       = sapply(numeric_vars, quantile, 0.75, na.rm = TRUE),
  Max      = sapply(numeric_vars, max, na.rm = TRUE),
  Skewness = sapply(numeric_vars, skewness, na.rm = TRUE),
  Kurtosis = sapply(numeric_vars, kurtosis, na.rm = TRUE)
)
rownames(desc_stats) <- NULL

kable(desc_stats %>% mutate(across(where(is.numeric), ~round(.x, 3))),
      caption = "Descriptive Statistics of Numeric Variables") %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = TRUE, font_size = 11) %>%
  scroll_box(width = "100%")
Descriptive Statistics of Numeric Variables
Variable N Mean SD Min Q1 Median Q3 Max Skewness Kurtosis
APS_13_15 151 94.392 5.109 80.02 93.270 95.93 97.615 99.43 -1.871 5.451
APS_16_18 151 74.570 6.390 63.98 69.915 74.07 79.280 91.17 0.304 2.519
APS_19_24 151 27.616 6.132 17.59 23.740 26.42 30.740 51.60 1.541 6.916
TPT_Feb 151 4.953 1.524 3.04 3.730 4.53 5.785 10.12 1.006 3.420
TPT_Agt 151 4.765 1.615 2.27 3.390 4.56 5.840 9.91 0.706 3.286
TPAK_Feb 151 69.487 4.032 61.97 66.500 69.36 71.720 80.23 0.576 3.034
TPAK_Agt 151 69.577 4.125 62.15 66.365 69.31 71.055 78.29 0.541 2.585
GK_Maret 151 583860.291 123126.095 364251.00 490047.000 570368.00 663415.500 1007060.00 0.583 3.135
GK_Sept 151 652458.444 119598.356 460283.00 547751.000 644107.00 708811.000 917673.00 0.345 2.370
JPM_Maret 151 747.771 1026.577 47.83 191.390 350.25 915.150 4572.73 2.584 8.730
JPM_Sept 151 626.133 940.524 41.11 161.070 272.70 711.770 3893.82 2.648 8.912
PPM_Maret 151 11.705 6.769 4.00 6.455 10.47 14.595 32.97 1.104 3.374
PPM_Sept 151 10.652 6.322 3.80 5.700 9.56 12.640 29.66 1.325 4.280
IPM 151 72.264 4.738 61.40 70.470 72.96 74.650 83.55 -0.442 3.412
RLS_15Plus 151 9.120 1.022 5.10 8.515 9.22 9.850 11.49 -0.686 4.235
HLS 151 13.133 0.884 9.97 12.790 13.17 13.645 15.70 -0.450 5.280
skew_df <- data.frame(
  Variable  = names(numeric_vars),
  Skewness  = round(sapply(numeric_vars, skewness, na.rm = TRUE), 3)
) %>%
  mutate(
    Direction      = ifelse(Skewness > 0, "Positive (Right)", "Negative (Left)"),
    Interpretation = ifelse(abs(Skewness) < 0.5, "Approximately Symmetric",
                     ifelse(abs(Skewness) < 1.0, "Moderately Skewed",
                     ifelse(abs(Skewness) < 2.0, "Highly Skewed", "Extremely Skewed")))
  )

kable(skew_df, caption = "Skewness Interpretation per Variable") %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE)
Skewness Interpretation per Variable
Variable Skewness Direction Interpretation
APS_13_15 APS_13_15 -1.871 Negative (Left) Highly Skewed
APS_16_18 APS_16_18 0.304 Positive (Right) Approximately Symmetric
APS_19_24 APS_19_24 1.541 Positive (Right) Highly Skewed
TPT_Feb TPT_Feb 1.006 Positive (Right) Highly Skewed
TPT_Agt TPT_Agt 0.706 Positive (Right) Moderately Skewed
TPAK_Feb TPAK_Feb 0.576 Positive (Right) Moderately Skewed
TPAK_Agt TPAK_Agt 0.541 Positive (Right) Moderately Skewed
GK_Maret GK_Maret 0.583 Positive (Right) Moderately Skewed
GK_Sept GK_Sept 0.345 Positive (Right) Approximately Symmetric
JPM_Maret JPM_Maret 2.584 Positive (Right) Extremely Skewed
JPM_Sept JPM_Sept 2.648 Positive (Right) Extremely Skewed
PPM_Maret PPM_Maret 1.104 Positive (Right) Highly Skewed
PPM_Sept PPM_Sept 1.325 Positive (Right) Highly Skewed
IPM IPM -0.442 Negative (Left) Approximately Symmetric
RLS_15Plus RLS_15Plus -0.686 Negative (Left) Moderately Skewed
HLS HLS -0.450 Negative (Left) Approximately Symmetric

3.2 Distribution Visualization

df_long <- numeric_vars %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(df_long, aes(x = Value)) +
  geom_histogram(aes(y = after_stat(density)), bins = 20,
                 fill = "#2E86AB", color = "white", alpha = 0.7) +
  geom_density(color = "#E84855", linewidth = 0.8) +
  facet_wrap(~Variable, scales = "free", ncol = 4) +
  labs(title = "Distribution of Socioeconomic Variables",
       subtitle = "Histogram overlaid with kernel density estimate",
       x = NULL, y = "Density") +
  theme_minimal(base_size = 10) +
  theme(strip.text = element_text(face = "bold", size = 8),
        plot.title = element_text(face = "bold", size = 13))
Histogram and Density Plots of All Numeric Variables

Histogram and Density Plots of All Numeric Variables

3.3 Boxplot Analysis

Boxplots enable the simultaneous visualization of central tendency, dispersion, and outlier presence across all variables. Outliers, defined as values beyond 1.5 × IQR from the quartile boundaries, may distort centroid-based algorithms such as K-Means and warrant careful consideration.

df_scaled_long <- as.data.frame(scale(numeric_vars)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(df_scaled_long, aes(x = reorder(Variable, Value, median), y = Value, fill = Variable)) +
  geom_boxplot(outlier.shape = 21, outlier.fill = "red", outlier.size = 1.5, alpha = 0.7) +
  coord_flip() +
  scale_fill_viridis_d(option = "turbo", guide = "none") +
  labs(title = "Boxplot of Standardized Variables",
       subtitle = "Z-score normalized; red dots indicate outliers",
       x = NULL, y = "Standardized Value (Z-score)") +
  theme_minimal(base_size = 10) +
  theme(plot.title = element_text(face = "bold"))
Boxplots of Standardized Variables

Boxplots of Standardized Variables

3.4 Correlation Analysis

Correlation analysis identifies linear relationships among variables. In cluster analysis, multicollinearity is less problematic than in regression, but understanding inter-variable relationships aids in interpreting cluster profiles and selecting features.

cor_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")

ggcorrplot(cor_matrix,
           method   = "square",
           type     = "lower",
           lab      = TRUE,
           lab_size = 2.5,
           colors   = c("#E84855", "white", "#2E86AB"),
           title    = "Pearson Correlation Matrix",
           ggtheme  = theme_minimal(base_size = 9)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7),
        axis.text.y = element_text(size = 7),
        plot.title  = element_text(face = "bold"))
Correlation Matrix Heatmap

Correlation Matrix Heatmap

cor_df <- as.data.frame(as.table(cor_matrix))
names(cor_df) <- c("Var1", "Var2", "r")
high_cor <- cor_df %>%
  filter(as.character(Var1) < as.character(Var2), abs(r) > 0.80) %>%
  arrange(desc(abs(r))) %>%
  mutate(r = round(r, 3))

kable(high_cor, caption = "Variable Pairs with |r| > 0.80 (High Correlation)",
      col.names = c("Variable 1", "Variable 2", "r")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
Variable Pairs with |r| > 0.80 (High Correlation)
Variable 1 Variable 2 r
JPM_Maret JPM_Sept 0.979
PPM_Maret PPM_Sept 0.967
TPT_Agt TPT_Feb 0.953
TPAK_Agt TPAK_Feb 0.916
GK_Maret GK_Sept 0.889
IPM PPM_Maret -0.814

3.5 Temporal Overview

yearly_summary <- df_raw %>%
  group_by(TAHUN) %>%
  summarise(Mean_IPM = mean(IPM, na.rm = TRUE),
            Mean_PPM = mean(PPM_Maret, na.rm = TRUE),
            .groups  = "drop")

p1 <- ggplot(yearly_summary, aes(x = factor(TAHUN), y = Mean_IPM, group = 1)) +
  geom_line(color = "#2E86AB", linewidth = 1.2) +
  geom_point(color = "#2E86AB", size = 3) +
  labs(title = "Average Human Development Index (IPM) by Year",
       x = "Year", y = "Mean IPM") +
  theme_minimal()

p2 <- ggplot(yearly_summary, aes(x = factor(TAHUN), y = Mean_PPM, group = 1)) +
  geom_line(color = "#E84855", linewidth = 1.2) +
  geom_point(color = "#E84855", size = 3) +
  labs(title = "Average Poverty Rate - March by Year",
       x = "Year", y = "Mean PPM (%)") +
  theme_minimal()

grid.arrange(p1, p2, ncol = 2)
Mean IPM and Poverty Rate by Year

Mean IPM and Poverty Rate by Year


4. Data Preprocessing

4.1 Feature Selection

For this clustering analysis, the most recent year of data is selected for each province to obtain a cross-sectional snapshot. All numeric socioeconomic indicators are retained as clustering features, excluding the identifier variables PROVINSI and TAHUN.

df_latest <- df_raw %>%
  group_by(PROVINSI) %>%
  slice_max(order_by = TAHUN, n = 1) %>%
  ungroup()

features <- df_latest %>% select(-PROVINSI, -TAHUN)

cat(sprintf("Provinces: %d | Features: %d\n", nrow(df_latest), ncol(features)))
## Provinces: 38 | Features: 16

4.2 Standardization

Standardization (Z-score normalization) is a critical preprocessing step in distance-based clustering algorithms. Without standardization, variables measured on larger scales (e.g., Poverty Line in Rupiah) will disproportionately influence distance calculations relative to variables on smaller scales.

features_scaled <- as.data.frame(scale(features))
rownames(features_scaled) <- df_latest$PROVINSI

cat(sprintf("Mean of scaled data : %.8f (expected = 0)\n", mean(as.matrix(features_scaled))))
## Mean of scaled data : 0.00000000 (expected = 0)
cat(sprintf("SD of scaled data   : %.8f (expected = 1)\n", sd(as.matrix(features_scaled))))
## SD of scaled data   : 0.98756686 (expected = 1)

4.3 PCA for Visualization

Principal Component Analysis (PCA) is applied here solely for visualization purposes projecting high-dimensional cluster solutions into two-dimensional space without loss of interpretive clarity.

pca_result  <- prcomp(features_scaled, center = FALSE, scale. = FALSE)
pca_var     <- pca_result$sdev^2
pca_var_pct <- pca_var / sum(pca_var) * 100
pca_cumvar  <- cumsum(pca_var_pct)

pca_scree <- data.frame(PC = 1:length(pca_var_pct), Var = pca_var_pct, Cum = pca_cumvar)

ggplot(pca_scree[1:10, ], aes(x = PC)) +
  geom_col(aes(y = Var), fill = "#2E86AB", alpha = 0.8) +
  geom_line(aes(y = Cum), color = "#E84855", linewidth = 1.2, group = 1) +
  geom_point(aes(y = Cum), color = "#E84855", size = 3) +
  geom_hline(yintercept = 80, linetype = "dashed", color = "gray40") +
  scale_x_continuous(breaks = 1:10) +
  labs(title = "PCA Scree Plot",
       subtitle = "Bars = individual variance explained; Line = cumulative variance",
       x = "Principal Component", y = "Variance Explained (%)") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
PCA Scree Plot

PCA Scree Plot

cat(sprintf("PC1: %.1f%% | PC2: %.1f%% | Combined: %.1f%%\n",
            pca_var_pct[1], pca_var_pct[2], pca_cumvar[2]))
## PC1: 39.2% | PC2: 17.7% | Combined: 56.9%
pcs <- as.data.frame(pca_result$x[, 1:2])
pcs$PROVINSI <- df_latest$PROVINSI

5. Optimal Number of Clusters

5.1 Elbow Method (WSS)

The Elbow Method evaluates the Within-Cluster Sum of Squares (WSS) across a range of k values. The optimal k is identified at the “elbow” point where the rate of decrease in WSS sharply decelerates, indicating diminishing returns from adding more clusters.

set.seed(42)
wss_values <- sapply(1:10, function(k) {
  kmeans(features_scaled, centers = k, nstart = 25, iter.max = 100)$tot.withinss
})

ggplot(data.frame(k = 1:10, WSS = wss_values), aes(x = k, y = WSS)) +
  geom_line(color = "#2E86AB", linewidth = 1.2) +
  geom_point(color = "#2E86AB", size = 3.5) +
  scale_x_continuous(breaks = 1:10) +
  labs(title = "Elbow Method for Optimal k",
       subtitle = "Within-Cluster Sum of Squares vs. Number of Clusters",
       x = "Number of Clusters (k)", y = "Total WSS") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
Elbow Method

Elbow Method

elbow_k <- which.min(diff(diff(wss_values))) + 1
cat(sprintf("Elbow heuristic suggests k = %d\n", elbow_k))
## Elbow heuristic suggests k = 7

5.2 Silhouette Method

The Silhouette Width measures how similar an observation is to its own cluster compared to other clusters. Values range from -1 (poor fit) to +1 (excellent fit). The optimal k maximizes the average silhouette width.

set.seed(42)
sil_values <- sapply(2:10, function(k) {
  km <- kmeans(features_scaled, centers = k, nstart = 25, iter.max = 100)
  mean(silhouette(km$cluster, dist(features_scaled))[, 3])
})

sil_df     <- data.frame(k = 2:10, Silhouette = sil_values)
best_sil_k <- sil_df$k[which.max(sil_df$Silhouette)]

ggplot(sil_df, aes(x = k, y = Silhouette)) +
  geom_line(color = "#E84855", linewidth = 1.2) +
  geom_point(color = "#E84855", size = 3.5) +
  geom_vline(xintercept = best_sil_k, linetype = "dashed", color = "gray40") +
  scale_x_continuous(breaks = 2:10) +
  labs(title = "Silhouette Method for Optimal k",
       x = "Number of Clusters (k)", y = "Average Silhouette Width") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
Average Silhouette Width

Average Silhouette Width

cat(sprintf("Optimal k by Silhouette: k = %d (avg sil = %.4f)\n", best_sil_k, max(sil_values)))
## Optimal k by Silhouette: k = 2 (avg sil = 0.4024)

5.3 Gap Statistic

The Gap Statistic compares within-cluster dispersion against its expected value under a null reference distribution. The optimal k is where the gap statistic is largest, indicating that the observed clustering structure substantially exceeds chance-level structure.

set.seed(42)
gap_stat   <- clusGap(features_scaled, FUN = kmeans, nstart = 25, K.max = 10, B = 50)
best_gap_k <- which.max(gap_stat$Tab[, "gap"])

fviz_gap_stat(gap_stat) +
  labs(title = "Gap Statistic for Optimal k") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
Gap Statistic

Gap Statistic

cat(sprintf("Optimal k by Gap Statistic: k = %d\n", best_gap_k))
## Optimal k by Gap Statistic: k = 10

5.4 Summary: Optimal k Selection

k_votes <- c(elbow_k, best_sil_k, best_gap_k)
final_k  <- as.integer(names(sort(table(k_votes), decreasing = TRUE))[1])

kable(data.frame(
  Method    = c("Elbow Method", "Silhouette Method", "Gap Statistic", "Final (Majority Vote)"),
  Optimal_k = c(elbow_k, best_sil_k, best_gap_k, final_k)
), caption = "Optimal k Determination Summary",
   col.names = c("Method", "Suggested k")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
Optimal k Determination Summary
Method Suggested k
Elbow Method 7
Silhouette Method 2
Gap Statistic 10
Final (Majority Vote) 2

6. K-Means Clustering

6.1 Methodology

K-Means is one of the most widely used partitional clustering algorithms. The algorithm iteratively assigns each observation to the nearest centroid using Euclidean distance and updates centroids as the mean of assigned observations, minimizing total within-cluster sum of squared deviations. To mitigate sensitivity to random initialization, 25 random restarts (nstart = 25) are employed and the solution with the lowest total WSS is retained.

set.seed(42)
km_result <- kmeans(features_scaled, centers = final_k, nstart = 25, iter.max = 200)

df_latest$KMeans_Cluster <- factor(km_result$cluster)

kable(data.frame(
  Metric = c("Number of Clusters", "Total WSS", "Between-SS", "Total SS", "BSS/TSS (%)"),
  Value  = c(final_k,
             round(km_result$tot.withinss, 4),
             round(km_result$betweenss, 4),
             round(km_result$totss, 4),
             round(km_result$betweenss / km_result$totss * 100, 2))
), caption = "K-Means Summary Statistics", col.names = c("Metric", "Value")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
K-Means Summary Statistics
Metric Value
Number of Clusters 2.0000
Total WSS 435.6562
Between-SS 156.3438
Total SS 592.0000
BSS/TSS (%) 26.4100
kable(data.frame(Cluster = 1:final_k, Size = km_result$size),
      caption = "K-Means Cluster Sizes") %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
K-Means Cluster Sizes
Cluster Size
1 4
2 34

6.2 Cluster Visualization (PCA Space)

pcs$KMeans <- factor(km_result$cluster)

ggplot(pcs, aes(x = PC1, y = PC2, color = KMeans, label = PROVINSI)) +
  geom_point(size = 3.5, alpha = 0.85) +
  geom_text_repel(size = 2.5, max.overlaps = 15, segment.alpha = 0.5) +
  stat_ellipse(aes(fill = KMeans), geom = "polygon", alpha = 0.1, level = 0.90) +
  scale_color_brewer(palette = "Set1") +
  scale_fill_brewer(palette = "Set1") +
  labs(title = paste0("K-Means Clustering (k = ", final_k, ") PCA Projection"),
       subtitle = "90% confidence ellipses; labels = province names",
       x = sprintf("PC1 (%.1f%%)", pca_var_pct[1]),
       y = sprintf("PC2 (%.1f%%)", pca_var_pct[2]),
       color = "Cluster", fill = "Cluster") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
K-Means Clusters PCA Projection

K-Means Clusters PCA Projection

6.3 Cluster Centroids

km_centers_orig <- as.data.frame(km_result$centers)
names(km_centers_orig) <- names(features)

kable(round(km_centers_orig, 3), caption = "K-Means Cluster Centroids (Standardized Scale)") %>%
  kable_styling(bootstrap_options = c("hover"), full_width = TRUE) %>%
  scroll_box(width = "100%")
K-Means Cluster Centroids (Standardized Scale)
APS_13_15 APS_16_18 APS_19_24 TPT_Feb TPT_Agt TPAK_Feb TPAK_Agt GK_Maret GK_Sept JPM_Maret JPM_Sept PPM_Maret PPM_Sept IPM RLS_15Plus HLS
-2.629 -1.605 -0.669 -0.870 -1.249 1.793 1.804 0.728 0.467 -0.437 -0.492 1.958 2.043 -2.076 -1.351 -0.958
0.309 0.189 0.079 0.102 0.147 -0.211 -0.212 -0.086 -0.055 0.051 0.058 -0.230 -0.240 0.244 0.159 0.113
km_centers_long <- km_centers_orig %>%
  mutate(Cluster = factor(1:nrow(.))) %>%
  pivot_longer(-Cluster, names_to = "Variable", values_to = "Value")

ggplot(km_centers_long, aes(x = reorder(Variable, Value), y = Value, fill = Cluster)) +
  geom_col(position = "dodge", alpha = 0.85) +
  coord_flip() +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "K-Means Cluster Centroids by Variable",
       subtitle = "Standardized values",
       x = NULL, y = "Standardized Centroid Value", fill = "Cluster") +
  theme_minimal(base_size = 10) +
  theme(plot.title = element_text(face = "bold"))
K-Means Cluster Centroid Profile

K-Means Cluster Centroid Profile

6.4 Silhouette Analysis

km_sil     <- silhouette(km_result$cluster, dist(features_scaled))
avg_sil_km <- mean(km_sil[, 3])

fviz_silhouette(km_sil, palette = "Set1") +
  labs(title = paste0("Silhouette Plot K-Means (k = ", final_k, ")")) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
##   cluster size ave.sil.width
## 1       1    4          0.36
## 2       2   34          0.41
Silhouette Plot K-Means

Silhouette Plot K-Means

sil_label_km <- ifelse(avg_sil_km >= 0.70, "Strong Structure",
                ifelse(avg_sil_km >= 0.50, "Reasonable Structure",
                ifelse(avg_sil_km >= 0.25, "Weak Structure", "No Substantial Structure")))
cat(sprintf("Average Silhouette Width: %.4f %s\n", avg_sil_km, sil_label_km))
## Average Silhouette Width: 0.4024 Weak Structure

6.5 Province Assignment

kable(df_latest %>% select(PROVINSI, TAHUN, KMeans_Cluster) %>%
        arrange(KMeans_Cluster, PROVINSI),
      caption = "Province Cluster Assignments K-Means",
      col.names = c("Province", "Year", "Cluster")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE) %>%
  scroll_box(height = "400px")
Province Cluster Assignments K-Means
Province Year Cluster
PAPUA 2024 1
PAPUA PEGUNUNGAN 2024 1
PAPUA SELATAN 2024 1
PAPUA TENGAH 2024 1
ACEH 2024 2
BALI 2024 2
BANTEN 2024 2
BENGKULU 2024 2
DI YOGYAKARTA 2024 2
DKI JAKARTA 2024 2
GORONTALO 2024 2
JAMBI 2024 2
JAWA BARAT 2024 2
JAWA TENGAH 2024 2
JAWA TIMUR 2024 2
KALIMANTAN BARAT 2024 2
KALIMANTAN SELATAN 2024 2
KALIMANTAN TENGAH 2024 2
KALIMANTAN TIMUR 2024 2
KALIMANTAN UTARA 2024 2
KEP. BANGKA BELITUNG 2024 2
KEP. RIAU 2024 2
LAMPUNG 2024 2
MALUKU 2024 2
MALUKU UTARA 2024 2
NUSA TENGGARA BARAT 2024 2
NUSA TENGGARA TIMUR 2024 2
PAPUA BARAT 2024 2
PAPUA BARAT DAYA 2024 2
RIAU 2024 2
SULAWESI BARAT 2024 2
SULAWESI SELATAN 2024 2
SULAWESI TENGAH 2024 2
SULAWESI TENGGARA 2024 2
SULAWESI UTARA 2024 2
SUMATERA BARAT 2024 2
SUMATERA SELATAN 2024 2
SUMATERA UTARA 2024 2

7. K-Medoids (PAM) Clustering

7.1 Methodology

K-Medoids, implemented via the Partitioning Around Medoids (PAM) algorithm, selects actual data points (medoids) as cluster representatives rather than computed centroids. This approach minimizes the sum of dissimilarities and is significantly more resistant to outliers and noise than K-Means a particularly desirable property when working with heterogeneous regional socioeconomic data.

set.seed(42)
pam_result <- pam(features_scaled, k = final_k, metric = "euclidean")

df_latest$PAM_Cluster <- factor(pam_result$clustering)
pcs$PAM               <- factor(pam_result$clustering)

kable(data.frame(
  Cluster = 1:final_k,
  Medoid  = rownames(features_scaled)[pam_result$id.med],
  Size    = as.integer(table(pam_result$clustering))
), caption = "PAM Medoids and Cluster Sizes",
   col.names = c("Cluster", "Medoid Province", "Size")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
PAM Medoids and Cluster Sizes
Cluster Medoid Province Size
1 JAMBI 33
2 PAPUA SELATAN 5

7.2 PAM Visualization

ggplot(pcs, aes(x = PC1, y = PC2, color = PAM, label = PROVINSI)) +
  geom_point(size = 3.5, alpha = 0.85) +
  geom_text_repel(size = 2.5, max.overlaps = 15, segment.alpha = 0.5) +
  stat_ellipse(aes(fill = PAM), geom = "polygon", alpha = 0.1, level = 0.90) +
  scale_color_brewer(palette = "Set2") +
  scale_fill_brewer(palette = "Set2") +
  labs(title = paste0("K-Medoids (PAM) Clustering (k = ", final_k, ") PCA Projection"),
       x = sprintf("PC1 (%.1f%%)", pca_var_pct[1]),
       y = sprintf("PC2 (%.1f%%)", pca_var_pct[2]),
       color = "Cluster", fill = "Cluster") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
PAM Clusters PCA Projection

PAM Clusters PCA Projection

7.3 PAM Silhouette Analysis

pam_sil     <- silhouette(pam_result$clustering, dist(features_scaled))
avg_sil_pam <- mean(pam_sil[, 3])

fviz_silhouette(pam_sil, palette = "Set2") +
  labs(title = paste0("Silhouette Plot PAM (k = ", final_k, ")")) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
##   cluster size ave.sil.width
## 1       1   33          0.37
## 2       2    5          0.26
Silhouette Plot PAM

Silhouette Plot PAM

sil_label_pam <- ifelse(avg_sil_pam >= 0.70, "Strong Structure",
                  ifelse(avg_sil_pam >= 0.50, "Reasonable Structure",
                  ifelse(avg_sil_pam >= 0.25, "Weak Structure", "No Substantial Structure")))
cat(sprintf("Average Silhouette Width (PAM): %.4f %s\n", avg_sil_pam, sil_label_pam))
## Average Silhouette Width (PAM): 0.3588 Weak Structure

7.4 K-Means vs PAM Agreement

ari_kmpam <- adjustedRandIndex(km_result$cluster, pam_result$clustering)
ari_label <- ifelse(ari_kmpam >= 0.80, "Near-Perfect Agreement",
              ifelse(ari_kmpam >= 0.60, "Strong Agreement",
              ifelse(ari_kmpam >= 0.40, "Moderate Agreement", "Weak Agreement")))

kable(data.frame(Metric = c("Adjusted Rand Index (ARI)", "Interpretation"),
                 Value  = c(round(ari_kmpam, 4), ari_label)),
      caption = "K-Means vs PAM Agreement",
      col.names = c("Metric", "Value")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
K-Means vs PAM Agreement
Metric Value
Adjusted Rand Index (ARI) 0.844
Interpretation Near-Perfect Agreement

8. DBSCAN Clustering

8.1 Methodology

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) identifies clusters as regions of high point density separated by regions of lower density. Unlike partitional methods, DBSCAN does not require k to be specified a priori and is capable of discovering clusters of arbitrary shape. Two key hyperparameters must be tuned: ε (epsilon), the neighborhood radius; and MinPts, the minimum number of points required to form a dense region. Points unreachable from any core point are labeled as noise.

8.2 Epsilon Selection (k-NN Distance Plot)

min_pts    <- max(4, 2 * ncol(features_scaled))
knn_dist   <- kNNdist(as.matrix(features_scaled), k = min_pts)
knn_sorted <- if (is.matrix(knn_dist)) sort(knn_dist[, min_pts]) else sort(knn_dist)

ggplot(data.frame(Index = seq_along(knn_sorted), Distance = knn_sorted),
       aes(x = Index, y = Distance)) +
  geom_line(color = "#2E86AB", linewidth = 1) +
  geom_hline(yintercept = quantile(knn_sorted, 0.9), color = "red",
             linetype = "dashed", linewidth = 0.8) +
  labs(title = paste0(min_pts, "-NN Distance Plot for DBSCAN ε Selection"),
       subtitle = "Red dashed line = 90th percentile (suggested ε)",
       x = "Points (sorted)", y = paste0(min_pts, "-NN Distance")) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
k-NN Distance Plot for DBSCAN ε Selection

k-NN Distance Plot for DBSCAN ε Selection

eps_val <- round(quantile(knn_sorted, 0.9), 2)
cat(sprintf("MinPts = %d | Suggested ε = %.4f\n", min_pts, eps_val))
## MinPts = 32 | Suggested ε = 8.2000

8.3 DBSCAN Execution

set.seed(42)
db_result <- dbscan(as.matrix(features_scaled), eps = eps_val, minPts = min_pts)

df_latest$DBSCAN_Cluster <- factor(db_result$cluster)
pcs$DBSCAN               <- factor(db_result$cluster)

n_clusters_db <- length(unique(db_result$cluster[db_result$cluster != 0]))
n_noise_db    <- sum(db_result$cluster == 0)

kable(as.data.frame(table(Cluster = db_result$cluster)),
      caption = "DBSCAN Cluster Sizes (Cluster 0 = Noise)") %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
DBSCAN Cluster Sizes (Cluster 0 = Noise)
Cluster Freq
1 38
cat(sprintf("Clusters found: %d | Noise points: %d (%.1f%%)\n",
            n_clusters_db, n_noise_db, n_noise_db / nrow(features_scaled) * 100))
## Clusters found: 1 | Noise points: 0 (0.0%)
if (n_noise_db > 0) {
  cat("Noise provinces:", paste(df_latest$PROVINSI[db_result$cluster == 0], collapse = ", "), "\n")
}

8.4 DBSCAN Visualization

pcs_db <- pcs
pcs_db$DBSCAN_label <- ifelse(as.character(pcs_db$DBSCAN) == "0", "Noise",
                               paste0("Cluster ", pcs_db$DBSCAN))

n_pal      <- max(3, n_clusters_db)
db_pal     <- setNames(brewer.pal(n_pal, "Dark2")[1:n_clusters_db],
                       paste0("Cluster ", 1:n_clusters_db))
color_vals <- c("Noise" = "gray60", db_pal)
shape_vals <- c("Noise" = 4, setNames(rep(16, n_clusters_db),
                                      paste0("Cluster ", 1:n_clusters_db)))

ggplot(pcs_db, aes(x = PC1, y = PC2, color = DBSCAN_label,
                    shape = DBSCAN_label, label = PROVINSI)) +
  geom_point(size = 3.5, alpha = 0.85) +
  geom_text_repel(size = 2.5, max.overlaps = 15, segment.alpha = 0.5) +
  scale_color_manual(values = color_vals) +
  scale_shape_manual(values = shape_vals) +
  labs(title = "DBSCAN Clustering PCA Projection",
       subtitle = paste0("ε = ", eps_val, ", MinPts = ", min_pts, "; x = noise"),
       x = sprintf("PC1 (%.1f%%)", pca_var_pct[1]),
       y = sprintf("PC2 (%.1f%%)", pca_var_pct[2]),
       color = "Group", shape = "Group") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
DBSCAN Clusters PCA Projection

DBSCAN Clusters PCA Projection

8.5 DBSCAN Parameter Sensitivity

eps_grid    <- round(seq(eps_val * 0.6, eps_val * 1.6, length.out = 7), 3)
minpts_grid <- c(min_pts - 2, min_pts, min_pts + 2)

sens_results <- expand.grid(eps = eps_grid, minPts = minpts_grid) %>%
  rowwise() %>%
  mutate(
    res        = list(dbscan(as.matrix(features_scaled), eps = eps, minPts = minPts)),
    n_clusters = length(unique(res$cluster[res$cluster != 0])),
    n_noise    = sum(res$cluster == 0)
  ) %>%
  select(-res) %>%
  ungroup()

kable(sens_results, caption = "DBSCAN Sensitivity: Clusters & Noise Across Parameter Grid",
      col.names = c("ε", "MinPts", "Clusters", "Noise Points")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
DBSCAN Sensitivity: Clusters & Noise Across Parameter Grid
ε MinPts Clusters Noise Points
4.920 30 1 6
6.287 30 1 1
7.653 30 1 1
9.020 30 1 0
10.387 30 1 0
11.753 30 1 0
13.120 30 1 0
4.920 32 1 6
6.287 32 1 1
7.653 32 1 1
9.020 32 1 0
10.387 32 1 0
11.753 32 1 0
13.120 32 1 0
4.920 34 0 38
6.287 34 1 2
7.653 34 1 1
9.020 34 1 0
10.387 34 1 0
11.753 34 1 0
13.120 34 1 0

9. Mean Shift Clustering

9.1 Methodology

Mean Shift is a non-parametric, kernel-density-based clustering algorithm that does not require the number of clusters to be specified in advance. The algorithm iteratively shifts each data point toward the local mode of the estimated kernel density function. Points converging to the same mode are assigned to the same cluster. The bandwidth parameter h controls the smoothness of the kernel density estimate: smaller bandwidths produce more clusters, while larger bandwidths merge modes into fewer clusters. Given computational intensity in high dimensions, PCA-reduced data is used, consistent with standard practice in the literature.

n_pcs_ms <- min(5, ncol(pca_result$x))
pcs_ms   <- pca_result$x[, 1:n_pcs_ms]

bw_val <- mean(apply(pcs_ms, 2, sd)) * nrow(pcs_ms)^(-1 / (n_pcs_ms + 4)) * 1.5

ms_result <- meanShift(pcs_ms, bandwidth = rep(bw_val, n_pcs_ms))

ms_labels <- ms_result$assignment
df_latest$MeanShift_Cluster <- factor(ms_labels)
pcs$MeanShift               <- factor(ms_labels)
n_ms_clusters               <- length(unique(ms_labels))

kable(as.data.frame(table(Cluster = ms_labels)),
      caption = "Mean Shift Cluster Sizes") %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
Mean Shift Cluster Sizes
Cluster Freq
1 2
2 13
3 2
4 1
5 1
6 1
7 1
8 1
9 2
10 3
11 1
12 4
13 1
14 1
15 1
16 1
17 1
18 1
cat(sprintf("Applied on %d PCs (%.1f%% variance) | Bandwidth = %.4f | Clusters = %d\n",
            n_pcs_ms, pca_cumvar[n_pcs_ms], bw_val, n_ms_clusters))
## Applied on 5 PCs (88.6% variance) | Bandwidth = 1.6076 | Clusters = 18

9.2 Mean Shift Visualization

ggplot(pcs, aes(x = PC1, y = PC2, color = MeanShift, label = PROVINSI)) +
  geom_point(size = 3.5, alpha = 0.85) +
  geom_text_repel(size = 2.5, max.overlaps = 15, segment.alpha = 0.5) +
  stat_ellipse(aes(fill = MeanShift), geom = "polygon", alpha = 0.1, level = 0.90) +
  scale_color_brewer(palette = "Set3") +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Mean Shift Clustering PCA Projection",
       subtitle = paste0("Bandwidth h = ", round(bw_val, 4),
                         "; applied on ", n_pcs_ms, " PCs"),
       x = sprintf("PC1 (%.1f%%)", pca_var_pct[1]),
       y = sprintf("PC2 (%.1f%%)", pca_var_pct[2]),
       color = "Cluster", fill = "Cluster") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
Mean Shift Clusters PCA Projection

Mean Shift Clusters PCA Projection


10. Fuzzy C-Means (FCM) Clustering

10.1 Methodology

Fuzzy C-Means (FCM) is a soft partitioning algorithm that assigns each observation a membership degree to each cluster rather than a crisp binary assignment. Membership degrees range from 0 to 1 and sum to 1 across all clusters for each observation. This probabilistic assignment is particularly valuable for socioeconomic data where provinces may share characteristics across multiple development clusters, reflecting gradual developmental transitions rather than abrupt boundaries. The fuzziness parameter m = 2 is the standard choice in the literature.

set.seed(42)
fcm_result <- ppclust::fcm(as.matrix(features_scaled), centers = final_k, m = 2, nstart = 10)

fcm_hard <- apply(fcm_result$u, 1, which.max)
df_latest$FCM_Cluster <- factor(fcm_hard)
pcs$FCM               <- factor(fcm_hard)

kable(as.data.frame(table(Cluster = fcm_hard)),
      caption = "FCM Cluster Sizes (Hard Assignment)") %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
FCM Cluster Sizes (Hard Assignment)
Cluster Freq
1 19
2 19
cat(sprintf("Clusters: %d | m = 2 | Objective: %.4f\n", final_k, fcm_result$func.val))
## Clusters: 2 | m = 2 | Objective: 293.1874
##  Clusters: 2 | m = 2 | Objective: 293.1874
##  Clusters: 2 | m = 2 | Objective: 293.1874
##  Clusters: 2 | m = 2 | Objective: 293.1874
##  Clusters: 2 | m = 2 | Objective: 293.1874
##  Clusters: 2 | m = 2 | Objective: 293.1874
##  Clusters: 2 | m = 2 | Objective: 293.1874
##  Clusters: 2 | m = 2 | Objective: 293.1874
##  Clusters: 2 | m = 2 | Objective: 293.1874
##  Clusters: 2 | m = 2 | Objective: 293.1874

10.2 Membership Matrix

membership_df <- as.data.frame(round(fcm_result$u, 4))
names(membership_df) <- paste0("Cluster_", 1:final_k)
membership_df$PROVINSI    <- df_latest$PROVINSI
membership_df$HardCluster <- fcm_hard
membership_df <- membership_df %>%
  select(PROVINSI, HardCluster, everything()) %>%
  arrange(HardCluster, PROVINSI)

kable(membership_df, caption = "Fuzzy Membership Degrees per Province",
      col.names = c("Province", "Hard Cluster", paste0("mu(C", 1:final_k, ")"))) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  scroll_box(height = "400px")
Fuzzy Membership Degrees per Province
Province Hard Cluster mu(C1) mu(C2)
ACEH ACEH 1 0.7383 0.2617
BANTEN BANTEN 1 0.6846 0.3154
DI YOGYAKARTA DI YOGYAKARTA 1 0.5318 0.4682
DKI JAKARTA DKI JAKARTA 1 0.6856 0.3144
JAMBI JAMBI 1 0.6423 0.3577
JAWA BARAT JAWA BARAT 1 0.5413 0.4587
KALIMANTAN TIMUR KALIMANTAN TIMUR 1 0.7479 0.2521
KALIMANTAN UTARA KALIMANTAN UTARA 1 0.6538 0.3462
KEP. BANGKA BELITUNG KEP. BANGKA BELITUNG 1 0.5488 0.4512
KEP. RIAU KEP. RIAU 1 0.6971 0.3029
MALUKU MALUKU 1 0.6989 0.3011
MALUKU UTARA MALUKU UTARA 1 0.7108 0.2892
PAPUA BARAT PAPUA BARAT 1 0.5774 0.4226
PAPUA BARAT DAYA PAPUA BARAT DAYA 1 0.6359 0.3641
RIAU RIAU 1 0.7717 0.2283
SULAWESI SELATAN SULAWESI SELATAN 1 0.5361 0.4639
SULAWESI UTARA SULAWESI UTARA 1 0.6367 0.3633
SUMATERA BARAT SUMATERA BARAT 1 0.7260 0.2740
SUMATERA UTARA SUMATERA UTARA 1 0.7466 0.2534
BALI BALI 2 0.4567 0.5433
BENGKULU BENGKULU 2 0.4167 0.5833
GORONTALO GORONTALO 2 0.2455 0.7545
JAWA TENGAH JAWA TENGAH 2 0.4489 0.5511
JAWA TIMUR JAWA TIMUR 2 0.4631 0.5369
KALIMANTAN BARAT KALIMANTAN BARAT 2 0.4152 0.5848
KALIMANTAN SELATAN KALIMANTAN SELATAN 2 0.4937 0.5063
KALIMANTAN TENGAH KALIMANTAN TENGAH 2 0.4805 0.5195
LAMPUNG LAMPUNG 2 0.2345 0.7655
NUSA TENGGARA BARAT NUSA TENGGARA BARAT 2 0.2666 0.7334
NUSA TENGGARA TIMUR NUSA TENGGARA TIMUR 2 0.2590 0.7410
PAPUA PAPUA 2 0.3706 0.6294
PAPUA PEGUNUNGAN PAPUA PEGUNUNGAN 2 0.4227 0.5773
PAPUA SELATAN PAPUA SELATAN 2 0.3370 0.6630
PAPUA TENGAH PAPUA TENGAH 2 0.3845 0.6155
SULAWESI BARAT SULAWESI BARAT 2 0.2785 0.7215
SULAWESI TENGAH SULAWESI TENGAH 2 0.1874 0.8126
SULAWESI TENGGARA SULAWESI TENGGARA 2 0.3685 0.6315
SUMATERA SELATAN SUMATERA SELATAN 2 0.2887 0.7113

10.3 Ambiguous Membership Detection

max_membership <- apply(fcm_result$u, 1, max)
ambiguous_idx  <- which(max_membership < 0.60)

if (length(ambiguous_idx) == 0) {
  kable(data.frame(Result = "No provinces with ambiguous membership (max mu < 0.60) detected."),
        col.names = NULL) %>%
    kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
} else {
  kable(data.frame(
    Province    = df_latest$PROVINSI[ambiguous_idx],
    Max_Mu      = round(max_membership[ambiguous_idx], 4),
    HardCluster = fcm_hard[ambiguous_idx]
  ), caption = "Provinces with Ambiguous Membership (max mu < 0.60)",
     col.names = c("Province", "Max Membership", "Hard Cluster")) %>%
    kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
}
Provinces with Ambiguous Membership (max mu < 0.60)
Province Max Membership Hard Cluster
BALI BALI 0.5433 2
BENGKULU BENGKULU 0.5833 2
DI YOGYAKARTA DI YOGYAKARTA 0.5318 1
JAWA BARAT JAWA BARAT 0.5413 1
JAWA TENGAH JAWA TENGAH 0.5511 2
JAWA TIMUR JAWA TIMUR 0.5369 2
KALIMANTAN BARAT KALIMANTAN BARAT 0.5848 2
KALIMANTAN SELATAN KALIMANTAN SELATAN 0.5063 2
KALIMANTAN TENGAH KALIMANTAN TENGAH 0.5195 2
KEP. BANGKA BELITUNG KEP. BANGKA BELITUNG 0.5488 1
PAPUA BARAT PAPUA BARAT 0.5774 1
PAPUA PEGUNUNGAN PAPUA PEGUNUNGAN 0.5773 2
SULAWESI SELATAN SULAWESI SELATAN 0.5361 1

10.4 FCM Visualization

pcs$MaxMembership <- max_membership

ggplot(pcs, aes(x = PC1, y = PC2, color = FCM, size = MaxMembership, label = PROVINSI)) +
  geom_point(alpha = 0.85) +
  geom_text_repel(size = 2.3, max.overlaps = 15, segment.alpha = 0.5) +
  scale_color_brewer(palette = "Paired") +
  scale_size_continuous(range = c(2, 6), name = "Max Membership") +
  labs(title = paste0("Fuzzy C-Means (k = ", final_k, ") PCA Projection"),
       subtitle = "Point size proportional to maximum membership degree",
       x = sprintf("PC1 (%.1f%%)", pca_var_pct[1]),
       y = sprintf("PC2 (%.1f%%)", pca_var_pct[2]),
       color = "Cluster") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
FCM Clusters PCA Projection

FCM Clusters PCA Projection

10.5 FCM Cluster Centroids

fcm_centers <- as.data.frame(fcm_result$v)
names(fcm_centers) <- names(features)

kable(round(fcm_centers, 3), caption = "FCM Cluster Centroids (Standardized)") %>%
  kable_styling(bootstrap_options = c("hover"), full_width = TRUE) %>%
  scroll_box(width = "100%")
FCM Cluster Centroids (Standardized)
APS_13_15 APS_16_18 APS_19_24 TPT_Feb TPT_Agt TPAK_Feb TPAK_Agt GK_Maret GK_Sept JPM_Maret JPM_Sept PPM_Maret PPM_Sept IPM RLS_15Plus HLS
Cluster 1 0.272 0.274 0.139 0.394 0.418 -0.359 -0.377 0.259 0.308 -0.058 -0.059 -0.222 -0.237 0.273 0.309 0.177
Cluster 2 -0.265 -0.245 -0.140 -0.418 -0.450 0.343 0.368 -0.289 -0.333 0.029 0.031 0.228 0.243 -0.280 -0.299 -0.158

11. Cluster Validation

11.1 Internal Validation Indices

Internal validation assesses the quality of a clustering solution using only the data itself, without reference to external ground-truth labels. Key indices include:

Index Criterion Preferred Value
Average Silhouette Width Cohesion & separation Higher (max = 1)
Dunn Index Min inter-cluster / Max intra-cluster distance Higher
Calinski-Harabasz Index Between-cluster to within-cluster SS ratio Higher
Davies-Bouldin Index Average similarity between clusters Lower
dist_matrix <- dist(features_scaled)

dunn_index <- function(labels, dist_mat) {
  labels    <- as.integer(factor(labels))
  k         <- length(unique(labels))
  dm        <- as.matrix(dist_mat)
  intra     <- sapply(1:k, function(cl) {
    idx <- which(labels == cl)
    if (length(idx) < 2) return(0)
    max(dm[idx, idx])
  })
  inter_min <- Inf
  for (i in 1:(k - 1)) {
    for (j in (i + 1):k) {
      d <- min(dm[which(labels == i), which(labels == j)])
      if (d < inter_min) inter_min <- d
    }
  }
  inter_min / max(intra)
}

ch_index <- function(labels, data) {
  labels     <- as.integer(factor(labels))
  k          <- length(unique(labels))
  n          <- nrow(data)
  grand_mean <- colMeans(data)
  ssb <- sum(sapply(unique(labels), function(cl) {
    pts <- data[labels == cl, , drop = FALSE]
    nrow(pts) * sum((colMeans(pts) - grand_mean)^2)
  }))
  ssw <- sum(sapply(unique(labels), function(cl) {
    pts <- data[labels == cl, , drop = FALSE]
    if (nrow(pts) < 2) return(0)
    sum(apply(pts, 1, function(r) sum((r - colMeans(pts))^2)))
  }))
  (ssb / (k - 1)) / (ssw / (n - k))
}

db_index <- function(labels, data) {
  labels <- as.integer(factor(labels))
  k      <- length(unique(labels))
  cents  <- sapply(1:k, function(cl) colMeans(data[labels == cl, , drop = FALSE]))
  s      <- sapply(1:k, function(cl) {
    pts <- data[labels == cl, , drop = FALSE]
    mean(sqrt(rowSums(sweep(pts, 2, cents[, cl])^2)))
  })
  mean(sapply(1:k, function(i) {
    max(sapply(setdiff(1:k, i), function(j) {
      (s[i] + s[j]) / sqrt(sum((cents[, i] - cents[, j])^2))
    }))
  }))
}

method_labels <- list(
  "K-Means"    = km_result$cluster,
  "PAM"        = as.integer(pam_result$clustering),
  "DBSCAN"     = db_result$cluster,
  "Mean Shift" = as.integer(factor(ms_labels)),
  "FCM"        = fcm_hard
)

data_matrix <- as.matrix(features_scaled)

valid_df <- lapply(names(method_labels), function(mname) {
  labels <- method_labels[[mname]]
  if (mname == "DBSCAN") {
    valid_idx <- labels != 0
    if (sum(valid_idx) < 5 || length(unique(labels[valid_idx])) < 2) {
      return(data.frame(Method = mname, N_Clusters = NA,
                         Avg_Silhouette = NA, Dunn = NA, CH = NA, DB = NA))
    }
    labels_v <- labels[valid_idx]
    data_v   <- data_matrix[valid_idx, ]
    dist_v   <- dist(data_v)
  } else {
    labels_v <- labels
    data_v   <- data_matrix
    dist_v   <- dist_matrix
  }
  if (length(unique(labels_v)) < 2) {
    return(data.frame(Method = mname, N_Clusters = 1,
                       Avg_Silhouette = NA, Dunn = NA, CH = NA, DB = NA))
  }
  data.frame(
    Method         = mname,
    N_Clusters     = length(unique(labels_v)),
    Avg_Silhouette = round(mean(silhouette(as.integer(factor(labels_v)), dist_v)[, 3]), 4),
    Dunn           = round(tryCatch(dunn_index(labels_v, dist_v), error = function(e) NA), 4),
    CH             = round(tryCatch(ch_index(labels_v, data_v), error = function(e) NA), 2),
    DB             = round(tryCatch(db_index(labels_v, data_v), error = function(e) NA), 4)
  )
}) %>% bind_rows()

kable(valid_df, caption = "Internal Validation Indices Across All Clustering Methods",
      col.names = c("Method", "Clusters", "Avg Silhouette (up)", "Dunn (up)", "CH (up)", "DB (down)")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE)
Internal Validation Indices Across All Clustering Methods
Method Clusters Avg Silhouette (up) Dunn (up) CH (up) DB (down)
K-Means 2 0.4024 0.4475 12.92 0.9556
PAM 2 0.3588 0.2605 12.78 1.0859
DBSCAN NA NA NA NA NA
Mean Shift 18 0.0029 0.2681 5.89 0.8378
FCM 2 0.1824 0.1189 10.30 1.7095
best_sil   <- valid_df$Method[which.max(valid_df$Avg_Silhouette)]
best_ch    <- valid_df$Method[which.max(valid_df$CH)]
best_db    <- valid_df$Method[which.min(valid_df$DB)]
best_dunn  <- valid_df$Method[which.max(valid_df$Dunn)]
vote_counts  <- table(c(best_sil, best_ch, best_db, best_dunn))
overall_best <- names(which.max(vote_counts))

kable(data.frame(
  Index  = c("Best Avg Silhouette", "Best CH Index", "Best DB Index",
              "Best Dunn Index", "Overall Best (Majority Vote)"),
  Method = c(best_sil, best_ch, best_db, best_dunn, overall_best)
), caption = "Validation Summary", col.names = c("Index", "Best Method")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
Validation Summary
Index Best Method
Best Avg Silhouette K-Means
Best CH Index K-Means
Best DB Index Mean Shift
Best Dunn Index K-Means
Overall Best (Majority Vote) K-Means

11.2 Cluster Stability (Bootstrap)

Bootstrap-based stability assessment evaluates how consistently the clustering solution replicates across repeated resamples of the data. A stability score above 0.85 indicates high stability; values below 0.60 suggest an unstable solution.

set.seed(42)
n_boot    <- 100
n_obs     <- nrow(features_scaled)

boot_stab <- sapply(1:n_boot, function(b) {
  idx         <- sample(1:n_obs, n_obs, replace = TRUE)
  boot_data   <- features_scaled[idx, ]
  km_boot     <- kmeans(boot_data, centers = final_k, nstart = 10, iter.max = 100)
  orig_labels <- km_result$cluster[idx]
  boot_labels <- km_boot$cluster
  mean(sapply(1:final_k, function(cl) {
    orig_idx_cl <- which(orig_labels == cl)
    if (length(orig_idx_cl) < 2) return(1)
    if (length(orig_idx_cl) > 50) orig_idx_cl <- sample(orig_idx_cl, 50)
    pairs <- combn(orig_idx_cl, 2)
    mean(boot_labels[pairs[1, ]] == boot_labels[pairs[2, ]])
  }))
})

stab_label <- ifelse(mean(boot_stab) >= 0.85, "High Stability",
               ifelse(mean(boot_stab) >= 0.75, "Acceptable Stability",
               ifelse(mean(boot_stab) >= 0.60, "Moderate Stability", "Low Stability")))

kable(data.frame(
  Metric = c("Mean Stability Score", "Std Deviation", "Interpretation"),
  Value  = c(round(mean(boot_stab), 4), round(sd(boot_stab), 4), stab_label)
), caption = paste0("Bootstrap Stability K-Means (n = ", n_boot, " resamples)"),
   col.names = c("Metric", "Value")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
Bootstrap Stability K-Means (n = 100 resamples)
Metric Value
Mean Stability Score 0.9069
Std Deviation 0.1105
Interpretation High Stability

12. Cross-Method Comparison

12.1 Agreement Matrix (Adjusted Rand Index)

The Adjusted Rand Index (ARI) quantifies the agreement between two clustering solutions, corrected for chance. ARI = 1 indicates perfect agreement, ARI = 0 indicates chance-level agreement.

method_names <- names(method_labels)
ari_mat <- matrix(NA, length(method_names), length(method_names),
                  dimnames = list(method_names, method_names))

for (i in seq_along(method_names)) {
  for (j in seq_along(method_names)) {
    ari_mat[i, j] <- ifelse(i == j, 1.0,
      round(adjustedRandIndex(method_labels[[i]], method_labels[[j]]), 3))
  }
}

ari_melt <- melt(ari_mat)
names(ari_melt) <- c("Method1", "Method2", "ARI")

ggplot(ari_melt, aes(x = Method1, y = Method2, fill = ARI, label = round(ARI, 2))) +
  geom_tile(color = "white") +
  geom_text(size = 4, fontface = "bold") +
  scale_fill_gradient2(low = "#E84855", mid = "white", high = "#2E86AB",
                       midpoint = 0.5, limits = c(-0.1, 1), name = "ARI") +
  labs(title = "Pairwise Adjusted Rand Index",
       subtitle = "ARI = 1: identical | ARI = 0: random",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 11) +
  theme(plot.title  = element_text(face = "bold"),
        axis.text.x = element_text(angle = 30, hjust = 1))
Pairwise ARI Heatmap

Pairwise ARI Heatmap

12.2 Cluster Assignment Comparison Table

kable(df_latest %>%
        select(PROVINSI, KMeans_Cluster, PAM_Cluster, DBSCAN_Cluster,
               MeanShift_Cluster, FCM_Cluster) %>%
        arrange(KMeans_Cluster, PROVINSI),
      caption = "Province Cluster Assignments Across All Methods",
      col.names = c("Province", "K-Means", "PAM", "DBSCAN", "Mean Shift", "FCM")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  scroll_box(height = "500px")
Province Cluster Assignments Across All Methods
Province K-Means PAM DBSCAN Mean Shift FCM
PAPUA 1 2 1 14 2
PAPUA PEGUNUNGAN 1 2 1 16 2
PAPUA SELATAN 1 2 1 17 2
PAPUA TENGAH 1 2 1 18 2
ACEH 2 1 1 1 1
BALI 2 1 1 2 2
BANTEN 2 1 1 3 1
BENGKULU 2 1 1 2 2
DI YOGYAKARTA 2 1 1 4 1
DKI JAKARTA 2 1 1 5 1
GORONTALO 2 1 1 2 2
JAMBI 2 1 1 2 1
JAWA BARAT 2 1 1 6 1
JAWA TENGAH 2 1 1 7 2
JAWA TIMUR 2 1 1 8 2
KALIMANTAN BARAT 2 1 1 2 2
KALIMANTAN SELATAN 2 1 1 2 2
KALIMANTAN TENGAH 2 1 1 2 2
KALIMANTAN TIMUR 2 1 1 9 1
KALIMANTAN UTARA 2 1 1 10 1
KEP. BANGKA BELITUNG 2 1 1 10 1
KEP. RIAU 2 1 1 11 1
LAMPUNG 2 1 1 2 2
MALUKU 2 1 1 9 1
MALUKU UTARA 2 1 1 12 1
NUSA TENGGARA BARAT 2 1 1 2 2
NUSA TENGGARA TIMUR 2 2 1 13 2
PAPUA BARAT 2 1 1 15 1
PAPUA BARAT DAYA 2 1 1 1 1
RIAU 2 1 1 10 1
SULAWESI BARAT 2 1 1 2 2
SULAWESI SELATAN 2 1 1 12 1
SULAWESI TENGAH 2 1 1 2 2
SULAWESI TENGGARA 2 1 1 2 2
SULAWESI UTARA 2 1 1 12 1
SUMATERA BARAT 2 1 1 3 1
SUMATERA SELATAN 2 1 1 2 2
SUMATERA UTARA 2 1 1 12 1

12.3 Visual Comparison: All Methods

pcs_long <- pcs %>%
  select(PC1, PC2, PROVINSI, KMeans, PAM, DBSCAN, MeanShift, FCM) %>%
  pivot_longer(cols = c(KMeans, PAM, DBSCAN, MeanShift, FCM),
               names_to = "Method", values_to = "Cluster")

ggplot(pcs_long, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(size = 2.5, alpha = 0.8) +
  facet_wrap(~Method, ncol = 3, scales = "fixed") +
  scale_color_brewer(palette = "Set1") +
  labs(title = "Clustering Results Comparison PCA Projection",
       x = sprintf("PC1 (%.1f%%)", pca_var_pct[1]),
       y = sprintf("PC2 (%.1f%%)", pca_var_pct[2])) +
  theme_minimal(base_size = 10) +
  theme(plot.title      = element_text(face = "bold", size = 13),
        strip.text      = element_text(face = "bold"),
        legend.position = "bottom")
PCA Projection: All Clustering Methods

PCA Projection: All Clustering Methods


13. Cluster Interpretation

13.1 Socioeconomic Profiling (K-Means)

Using K-Means as the primary interpretive framework, cluster profiles are generated by computing the mean of each original (unstandardized) variable per cluster, enabling substantive socioeconomic interpretation of group differences across Indonesian provinces.

profile_df <- df_latest %>%
  select(KMeans_Cluster, all_of(names(features))) %>%
  group_by(KMeans_Cluster) %>%
  summarise(across(everything(), mean, na.rm = TRUE), .groups = "drop")

kable(profile_df %>% mutate(across(-KMeans_Cluster, ~round(.x, 2))),
      caption = "Mean Socioeconomic Indicators by K-Means Cluster (Original Scale)") %>%
  kable_styling(bootstrap_options = c("hover"),
                full_width = TRUE, font_size = 11) %>%
  scroll_box(width = "100%")
Mean Socioeconomic Indicators by K-Means Cluster (Original Scale)
KMeans_Cluster APS_13_15 APS_16_18 APS_19_24 TPT_Feb TPT_Agt TPAK_Feb TPAK_Agt GK_Maret GK_Sept JPM_Maret JPM_Sept PPM_Maret PPM_Sept IPM RLS_15Plus HLS
1 80.91 64.15 23.89 3.49 2.67 76.79 77.20 741381.2 708811.0 229.76 161.07 24.36 23.68 63.01 7.74 12.52
2 96.04 75.98 28.48 4.87 4.69 68.79 69.05 632247.1 645927.2 714.71 681.53 9.59 9.13 73.90 9.47 13.41

13.2 Key Indicator Profiles

key_vars   <- c("IPM", "PPM_Maret", "TPT_Feb", "APS_13_15", "GK_Maret", "RLS_15Plus")
var_labels <- c(IPM = "Human Dev. Index (IPM)",
                PPM_Maret  = "Poverty Rate - March (%)",
                TPT_Feb    = "Unemployment Rate - Feb (%)",
                APS_13_15  = "School Participation 13-15",
                GK_Maret   = "Poverty Line - March (Rp)",
                RLS_15Plus = "Avg Schooling Years (15+)")

profile_long <- profile_df %>%
  select(KMeans_Cluster, all_of(key_vars)) %>%
  pivot_longer(-KMeans_Cluster, names_to = "Variable", values_to = "Mean_Value") %>%
  mutate(Variable = var_labels[Variable])

ggplot(profile_long, aes(x = KMeans_Cluster, y = Mean_Value, fill = KMeans_Cluster)) +
  geom_col(alpha = 0.85, color = "white") +
  facet_wrap(~Variable, scales = "free_y", ncol = 3) +
  scale_fill_brewer(palette = "Set1", guide = "none") +
  labs(title = "Cluster Profiles: Key Socioeconomic Indicators",
       subtitle = "Mean values per K-Means cluster; original scale",
       x = "Cluster", y = "Mean Value") +
  theme_minimal(base_size = 10) +
  theme(plot.title = element_text(face = "bold"),
        strip.text = element_text(face = "bold"))
Cluster Profiles: Key Socioeconomic Indicators

Cluster Profiles: Key Socioeconomic Indicators

13.3 Cluster Narrative Interpretation

ipm_means <- profile_df$IPM
ppm_means <- profile_df$PPM_Maret
tpt_means <- profile_df$TPT_Feb
gk_means  <- profile_df$GK_Maret

narrative_df <- lapply(1:nrow(profile_df), function(i) {
  cl        <- profile_df$KMeans_Cluster[i]
  provs     <- paste(sort(df_latest$PROVINSI[df_latest$KMeans_Cluster == cl]), collapse = ", ")
  ipm_rank  <- rank(-ipm_means)[i]
  ppm_rank  <- rank(ppm_means)[i]
  dev_level <- ifelse(ipm_rank == 1, "Most Developed",
                ifelse(ipm_rank == nrow(profile_df), "Least Developed", "Developing"))
  pov_level <- ifelse(ppm_rank == 1, "Lowest Poverty",
                ifelse(ppm_rank == nrow(profile_df), "Highest Poverty", "Moderate Poverty"))
  data.frame(
    Cluster       = as.integer(cl),
    Dev_Level     = dev_level,
    Poverty_Level = pov_level,
    N             = sum(df_latest$KMeans_Cluster == cl),
    Mean_IPM      = round(ipm_means[i], 2),
    Mean_PPM      = round(ppm_means[i], 2),
    Mean_TPT      = round(tpt_means[i], 2),
    Mean_GK       = format(round(gk_means[i]), big.mark = "."),
    Provinces     = provs
  )
}) %>% bind_rows()

kable(narrative_df, caption = "Cluster Socioeconomic Narrative (K-Means)",
      col.names = c("Cluster", "Dev. Level", "Poverty Level", "N", "Mean IPM",
                    "Mean PPM (%)", "Mean TPT (%)", "Mean GK (Rp)", "Provinces")) %>%
  kable_styling(bootstrap_options = c("hover"),
                full_width = TRUE, font_size = 11) %>%
  scroll_box(width = "100%")
Cluster Socioeconomic Narrative (K-Means)
Cluster Dev. Level Poverty Level N Mean IPM Mean PPM (%) Mean TPT (%) Mean GK (Rp) Provinces
1 Least Developed Highest Poverty 4 63.01 24.36 3.49 741.381 PAPUA, PAPUA PEGUNUNGAN, PAPUA SELATAN, PAPUA TENGAH
2 Most Developed Lowest Poverty 34 73.90 9.59 4.87 632.247 ACEH, BALI, BANTEN, BENGKULU, DI YOGYAKARTA, DKI JAKARTA, GORONTALO, JAMBI, JAWA BARAT, JAWA TENGAH, JAWA TIMUR, KALIMANTAN BARAT, KALIMANTAN SELATAN, KALIMANTAN TENGAH, KALIMANTAN TIMUR, KALIMANTAN UTARA, KEP. BANGKA BELITUNG, KEP. RIAU, LAMPUNG, MALUKU, MALUKU UTARA, NUSA TENGGARA BARAT, NUSA TENGGARA TIMUR, PAPUA BARAT, PAPUA BARAT DAYA, RIAU, SULAWESI BARAT, SULAWESI SELATAN, SULAWESI TENGAH, SULAWESI TENGGARA, SULAWESI UTARA, SUMATERA BARAT, SUMATERA SELATAN, SUMATERA UTARA

13.4 Final Province Assignment Table

kable(df_latest %>%
        select(PROVINSI, TAHUN, KMeans_Cluster, IPM, PPM_Maret, TPT_Feb, GK_Maret) %>%
        arrange(KMeans_Cluster, desc(IPM)) %>%
        mutate(IPM       = round(IPM, 2),
               PPM_Maret = round(PPM_Maret, 2),
               TPT_Feb   = round(TPT_Feb, 2),
               GK_Maret  = format(GK_Maret, big.mark = ".")),
      caption = "Final Cluster Assignments with Key Indicators",
      col.names = c("Province", "Year", "Cluster", "IPM",
                    "Poverty Rate (%)", "Unemployment (%)", "Poverty Line (Rp)")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = TRUE, font_size = 11) %>%
  scroll_box(height = "500px")
Final Cluster Assignments with Key Indicators
Province Year Cluster IPM Poverty Rate (%) Unemployment (%) Poverty Line (Rp)
PAPUA 2024 1 63.01 17.26 3.49 674.371
PAPUA PEGUNUNGAN 2024 1 63.01 32.97 3.49 1.007.060
PAPUA SELATAN 2024 1 63.01 17.44 3.49 519.979
PAPUA TENGAH 2024 1 63.01 29.76 3.49 764.115
DKI JAKARTA 2024 2 83.55 4.30 7.57 825.288
DI YOGYAKARTA 2024 2 81.09 10.83 3.58 602.437
KEP. RIAU 2024 2 79.08 5.37 7.61 787.211
KALIMANTAN TIMUR 2024 2 78.20 5.78 6.37 833.955
BALI 2024 2 78.01 4.00 3.73 568.510
BANTEN 2024 2 75.77 5.84 7.97 654.213
SUMATERA BARAT 2024 2 75.64 5.97 5.90 708.416
SUMATERA UTARA 2024 2 75.13 7.99 5.24 642.423
SULAWESI UTARA 2024 2 75.04 7.25 6.19 490.719
RIAU 2024 2 74.95 6.67 4.25 697.296
ACEH 2024 2 74.70 14.23 5.75 661.227
KALIMANTAN SELATAN 2024 2 74.66 4.11 3.95 632.739
JAWA TIMUR 2024 2 74.65 9.79 4.33 536.122
SULAWESI SELATAN 2024 2 74.60 8.06 5.26 459.226
BENGKULU 2024 2 74.30 13.56 3.21 671.095
JAWA BARAT 2024 2 74.24 7.46 7.89 524.052
KEP. BANGKA BELITUNG 2024 2 74.09 4.55 3.89 908.397
JAMBI 2024 2 73.73 7.10 4.50 650.115
KALIMANTAN TENGAH 2024 2 73.73 5.17 3.84 623.954
JAWA TENGAH 2024 2 73.39 10.47 5.24 507.001
SUMATERA SELATAN 2024 2 73.18 10.97 4.53 554.197
SULAWESI TENGGARA 2024 2 72.94 11.21 3.66 462.715
KALIMANTAN UTARA 2024 2 72.88 6.32 4.10 854.294
MALUKU 2024 2 72.75 16.05 6.08 713.503
LAMPUNG 2024 2 72.48 10.69 4.18 586.551
NUSA TENGGARA BARAT 2024 2 72.37 12.91 3.73 534.703
SULAWESI TENGAH 2024 2 71.66 11.77 3.49 600.872
GORONTALO 2024 2 71.25 14.57 3.07 473.006
MALUKU UTARA 2024 2 70.98 6.32 4.60 604.460
KALIMANTAN BARAT 2024 2 70.47 6.32 4.52 595.509
SULAWESI BARAT 2024 2 69.80 11.21 3.04 454.879
NUSA TENGGARA TIMUR 2024 2 68.40 19.48 3.10 527.275
PAPUA BARAT 2024 2 67.47 21.66 5.53 793.804
PAPUA BARAT DAYA 2024 2 67.47 18.13 5.53 756.237

14. Conclusions and Recommendations

14.1 Summary of Findings

kable(data.frame(
  Method         = c("K-Means", "PAM", "DBSCAN", "Mean Shift", "FCM"),
  N_Clusters     = c(final_k, final_k, n_clusters_db, n_ms_clusters, final_k),
  Characteristic = c(
    "Fast; sensitive to outliers; spherical cluster assumption",
    "Robust to outliers; uses actual data points as representatives",
    "Detects arbitrary shapes; identifies noise/outlier provinces",
    "Non-parametric; auto-determines k via kernel density modes",
    "Soft assignments; captures membership ambiguity across clusters"
  )
), caption = "Algorithm Comparison Summary",
   col.names = c("Method", "Clusters Found", "Key Characteristic")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = TRUE)
Algorithm Comparison Summary
Method Clusters Found Key Characteristic
K-Means 2 Fast; sensitive to outliers; spherical cluster assumption
PAM 2 Robust to outliers; uses actual data points as representatives
DBSCAN 1 Detects arbitrary shapes; identifies noise/outlier provinces
Mean Shift 18 Non-parametric; auto-determines k via kernel density modes
FCM 2 Soft assignments; captures membership ambiguity across clusters
cat(sprintf("Overall best method (internal validation): %s\n", overall_best))
## Overall best method (internal validation): K-Means
cat(sprintf("K-Means bootstrap stability: %s (score = %.4f)\n", stab_label, mean(boot_stab)))
## K-Means bootstrap stability: High Stability (score = 0.9069)

14.2 Policy Recommendations

Based on the clustering results, the following data-driven policy recommendations are proposed for each provincial development group.

kable(lapply(1:nrow(profile_df), function(i) {
  cl  <- profile_df$KMeans_Cluster[i]
  ipm <- ipm_means[i]
  ppm <- ppm_means[i]
  rec <- if (ipm >= quantile(ipm_means, 0.66)) {
    "Sustain development momentum; promote as model province; reduce intra-cluster inequality."
  } else if (ipm >= quantile(ipm_means, 0.33)) {
    "Accelerate education quality (RLS & HLS); expand labor market programs; strengthen social protection."
  } else {
    "Priority intervention: basic education access, poverty alleviation, affirmative budget allocation."
  }
  data.frame(Cluster = as.integer(cl), Mean_IPM = round(ipm, 2),
             Mean_PPM = round(ppm, 2), Recommendation = rec)
}) %>% bind_rows(),
caption = "Policy Recommendations by Cluster Development Profile",
col.names = c("Cluster", "Mean IPM", "Mean Poverty Rate (%)", "Recommendation")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = TRUE)
Policy Recommendations by Cluster Development Profile
Cluster Mean IPM Mean Poverty Rate (%) Recommendation
1 63.01 24.36 Priority intervention: basic education access, poverty alleviation, affirmative budget allocation.
2 73.90 9.59 Sustain development momentum; promote as model province; reduce intra-cluster inequality.

14.3 Methodological Limitations

# Limitation Implication
1 Cross-sectional analysis (latest year only) Temporal dynamics in development trajectories are not captured
2 K-Means assumes spherical, equally-sized clusters Assumption likely violated in heterogeneous regional data
3 DBSCAN is sensitive to ε and MinPts Small parameter changes can substantially alter results
4 Mean Shift applied on PCA-reduced space Information in lower-ranked components may be lost
5 Highly correlated variables included Certain dimensions may be implicitly over-weighted
6 IPM used alongside its sub-components Multicollinearity between IPM, RLS, and HLS

Document generated using R Markdown. All analyses conducted in R (version >= 4.1.0). Reproducibility ensured via set.seed(42) in all stochastic procedures.