1. Introduction

This report presents a comprehensive clustering analysis of Indonesian provincial socioeconomic indicators. Clustering, as an unsupervised machine learning technique, aims to partition observations into groups such that observations within the same group are more similar to each other than to those in other groups. The dataset encompasses 151 observations across 34 Indonesian provinces over multiple years, covering indicators related to education participation, labor market performance, poverty lines, poverty rates, and human development indices.

Five clustering methodologies are employed and systematically compared in this study:

K-Means a centroid-based partitional algorithm
K-Medoids (PAM) a medoid-based robust alternative to K-Means
DBSCAN a density-based spatial clustering algorithm capable of detecting arbitrarily-shaped clusters and noise
Mean Shift a non-parametric, kernel-density-based algorithm that does not require pre-specifying the number of clusters
Fuzzy C-Means (FCM) a soft-partitioning algorithm that allows each observation to belong to multiple clusters with varying degrees of membership

The analysis follows a structured pipeline: data loading → exploratory data analysis → preprocessing → optimal cluster determination → clustering execution → validation → interpretation.

2. Data Loading and Initial Inspection

Prior to any analytical procedure, it is essential to examine the raw structure of the dataset to identify potential issues such as missing values, data type mismatches, and distributional anomalies.

if (file.exists("data_5_clustering.xlsx")) {
  df_raw <- read_excel("data_5_clustering.xlsx")
} else if (file.exists("data_5_clustering.csv")) {
  df_raw <- read.csv("data_5_clustering.csv", stringsAsFactors = FALSE)
} else {
  stop("Dataset not found. Place data_5_clustering.xlsx or .csv in the working directory.")
}

colnames(df_raw) <- c(
  "PROVINSI", "TAHUN",
  "APS_13_15", "APS_16_18", "APS_19_24",
  "TPT_Feb", "TPT_Agt",
  "TPAK_Feb", "TPAK_Agt",
  "GK_Maret", "GK_Sept",
  "JPM_Maret", "JPM_Sept",
  "PPM_Maret", "PPM_Sept",
  "IPM", "RLS_15Plus", "HLS"
)

cat(sprintf("Rows: %d | Columns: %d\n", nrow(df_raw), ncol(df_raw)))

## Rows: 151 | Columns: 18

str(df_raw)

## 'data.frame':    151 obs. of  18 variables:
##  $ PROVINSI  : chr  "SUMATERA UTARA" "SUMATERA BARAT" "RIAU" "JAMBI" ...
##  $ TAHUN     : int  2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 ...
##  $ APS_13_15 : num  97 96.6 95.7 96.4 94.8 ...
##  $ APS_16_18 : num  78.7 84.1 77.8 72.5 71.5 ...
##  $ APS_19_24 : num  27.1 36.4 28.8 24.1 18.8 ...
##  $ TPT_Feb   : num  6.01 6.67 4.96 4.76 5.17 ...
##  $ TPT_Agt   : num  6.33 6.52 4.42 5.09 4.98 3.65 4.69 5.03 9.91 8.5 ...
##  $ TPAK_Feb  : num  69.4 68.4 65.8 67.3 70 ...
##  $ TPAK_Agt  : num  69.1 67.7 65 67.2 68.8 ...
##  $ GK_Maret  : int  525756 568703 565937 506355 457455 548934 471439 752203 642425 697638 ...
##  $ GK_Sept   : int  648336 714991 702620 658100 564462 672816 599018 917673 807602 846085 ...
##  $ JPM_Maret : num  1344 371 501 294 1114 ...
##  $ JPM_Sept  : num  1111 315 473 273 949 ...
##  $ PPM_Maret : num  9.01 6.63 7.12 8.09 12.84 ...
##  $ PPM_Sept  : num  7.19 5.42 6.36 7.26 10.51 ...
##  $ IPM       : num  73.8 74.6 73.9 72.6 71.8 ...
##  $ RLS_15Plus: num  9.88 9.46 9.52 9.03 8.78 ...
##  $ HLS       : num  13.3 14.1 13.3 13 12.5 ...

missing_summary <- data.frame(
  Column   = names(df_raw),
  Missing  = sapply(df_raw, function(x) sum(is.na(x))),
  Pct_Miss = sapply(df_raw, function(x) round(mean(is.na(x)) * 100, 2))
)
rownames(missing_summary) <- NULL

kable(missing_summary, caption = "Missing Values per Variable",
      col.names = c("Variable", "Missing Count", "Missing (%)")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE)

Missing Values per Variable
Variable	Missing Count	Missing (%)
PROVINSI	0	0
TAHUN	0	0
APS_13_15	0	0
APS_16_18	0	0
APS_19_24	0	0
TPT_Feb	0	0
TPT_Agt	0	0
TPAK_Feb	0	0
TPAK_Agt	0	0
GK_Maret	0	0
GK_Sept	0	0
JPM_Maret	0	0
JPM_Sept	0	0
PPM_Maret	0	0
PPM_Sept	0	0
IPM	0	0
RLS_15Plus	0	0
HLS	0	0

kable(head(df_raw), caption = "Preview: First 6 Rows of Dataset") %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = TRUE, font_size = 11) %>%
  scroll_box(width = "100%")

Preview: First 6 Rows of Dataset
PROVINSI	TAHUN	APS_13_15	APS_16_18	APS_19_24	TPT_Feb	TPT_Agt	TPAK_Feb	TPAK_Agt	GK_Maret	GK_Sept	JPM_Maret	JPM_Sept	PPM_Maret	PPM_Sept	IPM	RLS_15Plus	HLS
SUMATERA UTARA	2021	96.99	78.66	27.05	6.01	6.33	69.39	69.10	525756	648336	1343.86	1110.92	9.01	7.19	73.84	9.88	13.27
SUMATERA BARAT	2021	96.63	84.07	36.41	6.67	6.52	68.41	67.72	568703	714991	370.67	315.43	6.63	5.42	74.56	9.46	14.09
RIAU	2021	95.66	77.81	28.79	4.96	4.42	65.81	65.03	565937	702620	500.81	473.04	7.12	6.36	73.89	9.52	13.28
JAMBI	2021	96.39	72.50	24.14	4.76	5.09	67.30	67.17	506355	658100	293.86	272.70	8.09	7.26	72.62	9.03	13.04
SUMATERA SELATAN	2021	94.85	71.53	18.81	5.17	4.98	69.95	68.77	457455	564462	1113.76	948.84	12.84	10.51	71.83	8.78	12.54
BENGKULU	2021	97.49	79.75	30.46	3.72	3.65	71.74	69.75	548934	672816	306.00	261.15	15.22	12.52	73.16	9.26	13.67

3. Exploratory Data Analysis (EDA)

3.1 Descriptive Statistics

Descriptive statistics provide a foundational understanding of the central tendency, dispersion, and shape of each variable’s distribution. Examining skewness and kurtosis prior to multivariate analysis is critical for assessing distributional assumptions. Variables with absolute skewness exceeding 2.0 or kurtosis exceeding 7.0 may warrant transformation.

numeric_vars <- df_raw %>% select(where(is.numeric)) %>% select(-TAHUN)

desc_stats <- data.frame(
  Variable = names(numeric_vars),
  N        = sapply(numeric_vars, function(x) sum(!is.na(x))),
  Mean     = sapply(numeric_vars, mean, na.rm = TRUE),
  SD       = sapply(numeric_vars, sd, na.rm = TRUE),
  Min      = sapply(numeric_vars, min, na.rm = TRUE),
  Q1       = sapply(numeric_vars, quantile, 0.25, na.rm = TRUE),
  Median   = sapply(numeric_vars, median, na.rm = TRUE),
  Q3       = sapply(numeric_vars, quantile, 0.75, na.rm = TRUE),
  Max      = sapply(numeric_vars, max, na.rm = TRUE),
  Skewness = sapply(numeric_vars, skewness, na.rm = TRUE),
  Kurtosis = sapply(numeric_vars, kurtosis, na.rm = TRUE)
)
rownames(desc_stats) <- NULL

kable(desc_stats %>% mutate(across(where(is.numeric), ~round(.x, 3))),
      caption = "Descriptive Statistics of Numeric Variables") %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = TRUE, font_size = 11) %>%
  scroll_box(width = "100%")

Descriptive Statistics of Numeric Variables
Variable	N	Mean	SD	Min	Q1	Median	Q3	Max	Skewness	Kurtosis
APS_13_15	151	94.392	5.109	80.02	93.270	95.93	97.615	99.43	-1.871	5.451
APS_16_18	151	74.570	6.390	63.98	69.915	74.07	79.280	91.17	0.304	2.519
APS_19_24	151	27.616	6.132	17.59	23.740	26.42	30.740	51.60	1.541	6.916
TPT_Feb	151	4.953	1.524	3.04	3.730	4.53	5.785	10.12	1.006	3.420
TPT_Agt	151	4.765	1.615	2.27	3.390	4.56	5.840	9.91	0.706	3.286
TPAK_Feb	151	69.487	4.032	61.97	66.500	69.36	71.720	80.23	0.576	3.034
TPAK_Agt	151	69.577	4.125	62.15	66.365	69.31	71.055	78.29	0.541	2.585
GK_Maret	151	583860.291	123126.095	364251.00	490047.000	570368.00	663415.500	1007060.00	0.583	3.135
GK_Sept	151	652458.444	119598.356	460283.00	547751.000	644107.00	708811.000	917673.00	0.345	2.370
JPM_Maret	151	747.771	1026.577	47.83	191.390	350.25	915.150	4572.73	2.584	8.730
JPM_Sept	151	626.133	940.524	41.11	161.070	272.70	711.770	3893.82	2.648	8.912
PPM_Maret	151	11.705	6.769	4.00	6.455	10.47	14.595	32.97	1.104	3.374
PPM_Sept	151	10.652	6.322	3.80	5.700	9.56	12.640	29.66	1.325	4.280
IPM	151	72.264	4.738	61.40	70.470	72.96	74.650	83.55	-0.442	3.412
RLS_15Plus	151	9.120	1.022	5.10	8.515	9.22	9.850	11.49	-0.686	4.235
HLS	151	13.133	0.884	9.97	12.790	13.17	13.645	15.70	-0.450	5.280

skew_df <- data.frame(
  Variable  = names(numeric_vars),
  Skewness  = round(sapply(numeric_vars, skewness, na.rm = TRUE), 3)
) %>%
  mutate(
    Direction      = ifelse(Skewness > 0, "Positive (Right)", "Negative (Left)"),
    Interpretation = ifelse(abs(Skewness) < 0.5, "Approximately Symmetric",
                     ifelse(abs(Skewness) < 1.0, "Moderately Skewed",
                     ifelse(abs(Skewness) < 2.0, "Highly Skewed", "Extremely Skewed")))
  )

kable(skew_df, caption = "Skewness Interpretation per Variable") %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE)

Skewness Interpretation per Variable
	Variable	Skewness	Direction	Interpretation
APS_13_15	APS_13_15	-1.871	Negative (Left)	Highly Skewed
APS_16_18	APS_16_18	0.304	Positive (Right)	Approximately Symmetric
APS_19_24	APS_19_24	1.541	Positive (Right)	Highly Skewed
TPT_Feb	TPT_Feb	1.006	Positive (Right)	Highly Skewed
TPT_Agt	TPT_Agt	0.706	Positive (Right)	Moderately Skewed
TPAK_Feb	TPAK_Feb	0.576	Positive (Right)	Moderately Skewed
TPAK_Agt	TPAK_Agt	0.541	Positive (Right)	Moderately Skewed
GK_Maret	GK_Maret	0.583	Positive (Right)	Moderately Skewed
GK_Sept	GK_Sept	0.345	Positive (Right)	Approximately Symmetric
JPM_Maret	JPM_Maret	2.584	Positive (Right)	Extremely Skewed
JPM_Sept	JPM_Sept	2.648	Positive (Right)	Extremely Skewed
PPM_Maret	PPM_Maret	1.104	Positive (Right)	Highly Skewed
PPM_Sept	PPM_Sept	1.325	Positive (Right)	Highly Skewed
IPM	IPM	-0.442	Negative (Left)	Approximately Symmetric
RLS_15Plus	RLS_15Plus	-0.686	Negative (Left)	Moderately Skewed
HLS	HLS	-0.450	Negative (Left)	Approximately Symmetric

3.2 Distribution Visualization

df_long <- numeric_vars %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(df_long, aes(x = Value)) +
  geom_histogram(aes(y = after_stat(density)), bins = 20,
                 fill = "#2E86AB", color = "white", alpha = 0.7) +
  geom_density(color = "#E84855", linewidth = 0.8) +
  facet_wrap(~Variable, scales = "free", ncol = 4) +
  labs(title = "Distribution of Socioeconomic Variables",
       subtitle = "Histogram overlaid with kernel density estimate",
       x = NULL, y = "Density") +
  theme_minimal(base_size = 10) +
  theme(strip.text = element_text(face = "bold", size = 8),
        plot.title = element_text(face = "bold", size = 13))

Histogram and Density Plots of All Numeric Variables

3.3 Boxplot Analysis

Boxplots enable the simultaneous visualization of central tendency, dispersion, and outlier presence across all variables. Outliers, defined as values beyond 1.5 × IQR from the quartile boundaries, may distort centroid-based algorithms such as K-Means and warrant careful consideration.

df_scaled_long <- as.data.frame(scale(numeric_vars)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(df_scaled_long, aes(x = reorder(Variable, Value, median), y = Value, fill = Variable)) +
  geom_boxplot(outlier.shape = 21, outlier.fill = "red", outlier.size = 1.5, alpha = 0.7) +
  coord_flip() +
  scale_fill_viridis_d(option = "turbo", guide = "none") +
  labs(title = "Boxplot of Standardized Variables",
       subtitle = "Z-score normalized; red dots indicate outliers",
       x = NULL, y = "Standardized Value (Z-score)") +
  theme_minimal(base_size = 10) +
  theme(plot.title = element_text(face = "bold"))

Boxplots of Standardized Variables

3.4 Correlation Analysis

Correlation analysis identifies linear relationships among variables. In cluster analysis, multicollinearity is less problematic than in regression, but understanding inter-variable relationships aids in interpreting cluster profiles and selecting features.

cor_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")

ggcorrplot(cor_matrix,
           method   = "square",
           type     = "lower",
           lab      = TRUE,
           lab_size = 2.5,
           colors   = c("#E84855", "white", "#2E86AB"),
           title    = "Pearson Correlation Matrix",
           ggtheme  = theme_minimal(base_size = 9)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7),
        axis.text.y = element_text(size = 7),
        plot.title  = element_text(face = "bold"))

Correlation Matrix Heatmap

cor_df <- as.data.frame(as.table(cor_matrix))
names(cor_df) <- c("Var1", "Var2", "r")
high_cor <- cor_df %>%
  filter(as.character(Var1) < as.character(Var2), abs(r) > 0.80) %>%
  arrange(desc(abs(r))) %>%
  mutate(r = round(r, 3))

kable(high_cor, caption = "Variable Pairs with |r| > 0.80 (High Correlation)",
      col.names = c("Variable 1", "Variable 2", "r")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

Variable Pairs with |r| > 0.80 (High Correlation)
Variable 1	Variable 2	r
JPM_Maret	JPM_Sept	0.979
PPM_Maret	PPM_Sept	0.967
TPT_Agt	TPT_Feb	0.953
TPAK_Agt	TPAK_Feb	0.916
GK_Maret	GK_Sept	0.889
IPM	PPM_Maret	-0.814

3.5 Temporal Overview

yearly_summary <- df_raw %>%
  group_by(TAHUN) %>%
  summarise(Mean_IPM = mean(IPM, na.rm = TRUE),
            Mean_PPM = mean(PPM_Maret, na.rm = TRUE),
            .groups  = "drop")

p1 <- ggplot(yearly_summary, aes(x = factor(TAHUN), y = Mean_IPM, group = 1)) +
  geom_line(color = "#2E86AB", linewidth = 1.2) +
  geom_point(color = "#2E86AB", size = 3) +
  labs(title = "Average Human Development Index (IPM) by Year",
       x = "Year", y = "Mean IPM") +
  theme_minimal()

p2 <- ggplot(yearly_summary, aes(x = factor(TAHUN), y = Mean_PPM, group = 1)) +
  geom_line(color = "#E84855", linewidth = 1.2) +
  geom_point(color = "#E84855", size = 3) +
  labs(title = "Average Poverty Rate - March by Year",
       x = "Year", y = "Mean PPM (%)") +
  theme_minimal()

grid.arrange(p1, p2, ncol = 2)

Mean IPM and Poverty Rate by Year

4. Data Preprocessing

4.1 Feature Selection

For this clustering analysis, the most recent year of data is selected for each province to obtain a cross-sectional snapshot. All numeric socioeconomic indicators are retained as clustering features, excluding the identifier variables PROVINSI and TAHUN.

df_latest <- df_raw %>%
  group_by(PROVINSI) %>%
  slice_max(order_by = TAHUN, n = 1) %>%
  ungroup()

features <- df_latest %>% select(-PROVINSI, -TAHUN)

cat(sprintf("Provinces: %d | Features: %d\n", nrow(df_latest), ncol(features)))

## Provinces: 38 | Features: 16

4.2 Standardization

Standardization (Z-score normalization) is a critical preprocessing step in distance-based clustering algorithms. Without standardization, variables measured on larger scales (e.g., Poverty Line in Rupiah) will disproportionately influence distance calculations relative to variables on smaller scales.

features_scaled <- as.data.frame(scale(features))
rownames(features_scaled) <- df_latest$PROVINSI

cat(sprintf("Mean of scaled data : %.8f (expected = 0)\n", mean(as.matrix(features_scaled))))

## Mean of scaled data : 0.00000000 (expected = 0)

cat(sprintf("SD of scaled data   : %.8f (expected = 1)\n", sd(as.matrix(features_scaled))))

## SD of scaled data   : 0.98756686 (expected = 1)

4.3 PCA for Visualization

Principal Component Analysis (PCA) is applied here solely for visualization purposes projecting high-dimensional cluster solutions into two-dimensional space without loss of interpretive clarity.

pca_result  <- prcomp(features_scaled, center = FALSE, scale. = FALSE)
pca_var     <- pca_result$sdev^2
pca_var_pct <- pca_var / sum(pca_var) * 100
pca_cumvar  <- cumsum(pca_var_pct)

pca_scree <- data.frame(PC = 1:length(pca_var_pct), Var = pca_var_pct, Cum = pca_cumvar)

ggplot(pca_scree[1:10, ], aes(x = PC)) +
  geom_col(aes(y = Var), fill = "#2E86AB", alpha = 0.8) +
  geom_line(aes(y = Cum), color = "#E84855", linewidth = 1.2, group = 1) +
  geom_point(aes(y = Cum), color = "#E84855", size = 3) +
  geom_hline(yintercept = 80, linetype = "dashed", color = "gray40") +
  scale_x_continuous(breaks = 1:10) +
  labs(title = "PCA Scree Plot",
       subtitle = "Bars = individual variance explained; Line = cumulative variance",
       x = "Principal Component", y = "Variance Explained (%)") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

PCA Scree Plot

cat(sprintf("PC1: %.1f%% | PC2: %.1f%% | Combined: %.1f%%\n",
            pca_var_pct[1], pca_var_pct[2], pca_cumvar[2]))

## PC1: 39.2% | PC2: 17.7% | Combined: 56.9%

pcs <- as.data.frame(pca_result$x[, 1:2])
pcs$PROVINSI <- df_latest$PROVINSI

5. Optimal Number of Clusters

5.1 Elbow Method (WSS)

The Elbow Method evaluates the Within-Cluster Sum of Squares (WSS) across a range of k values. The optimal k is identified at the “elbow” point where the rate of decrease in WSS sharply decelerates, indicating diminishing returns from adding more clusters.

set.seed(42)
wss_values <- sapply(1:10, function(k) {
  kmeans(features_scaled, centers = k, nstart = 25, iter.max = 100)$tot.withinss
})

ggplot(data.frame(k = 1:10, WSS = wss_values), aes(x = k, y = WSS)) +
  geom_line(color = "#2E86AB", linewidth = 1.2) +
  geom_point(color = "#2E86AB", size = 3.5) +
  scale_x_continuous(breaks = 1:10) +
  labs(title = "Elbow Method for Optimal k",
       subtitle = "Within-Cluster Sum of Squares vs. Number of Clusters",
       x = "Number of Clusters (k)", y = "Total WSS") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

Elbow Method

elbow_k <- which.min(diff(diff(wss_values))) + 1
cat(sprintf("Elbow heuristic suggests k = %d\n", elbow_k))

## Elbow heuristic suggests k = 7

5.2 Silhouette Method

The Silhouette Width measures how similar an observation is to its own cluster compared to other clusters. Values range from -1 (poor fit) to +1 (excellent fit). The optimal k maximizes the average silhouette width.

set.seed(42)
sil_values <- sapply(2:10, function(k) {
  km <- kmeans(features_scaled, centers = k, nstart = 25, iter.max = 100)
  mean(silhouette(km$cluster, dist(features_scaled))[, 3])
})

sil_df     <- data.frame(k = 2:10, Silhouette = sil_values)
best_sil_k <- sil_df$k[which.max(sil_df$Silhouette)]

ggplot(sil_df, aes(x = k, y = Silhouette)) +
  geom_line(color = "#E84855", linewidth = 1.2) +
  geom_point(color = "#E84855", size = 3.5) +
  geom_vline(xintercept = best_sil_k, linetype = "dashed", color = "gray40") +
  scale_x_continuous(breaks = 2:10) +
  labs(title = "Silhouette Method for Optimal k",
       x = "Number of Clusters (k)", y = "Average Silhouette Width") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

Average Silhouette Width

cat(sprintf("Optimal k by Silhouette: k = %d (avg sil = %.4f)\n", best_sil_k, max(sil_values)))

## Optimal k by Silhouette: k = 2 (avg sil = 0.4024)

5.3 Gap Statistic

The Gap Statistic compares within-cluster dispersion against its expected value under a null reference distribution. The optimal k is where the gap statistic is largest, indicating that the observed clustering structure substantially exceeds chance-level structure.

set.seed(42)
gap_stat   <- clusGap(features_scaled, FUN = kmeans, nstart = 25, K.max = 10, B = 50)
best_gap_k <- which.max(gap_stat$Tab[, "gap"])

fviz_gap_stat(gap_stat) +
  labs(title = "Gap Statistic for Optimal k") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

Gap Statistic

cat(sprintf("Optimal k by Gap Statistic: k = %d\n", best_gap_k))

## Optimal k by Gap Statistic: k = 10

5.4 Summary: Optimal k Selection

k_votes <- c(elbow_k, best_sil_k, best_gap_k)
final_k  <- as.integer(names(sort(table(k_votes), decreasing = TRUE))[1])

kable(data.frame(
  Method    = c("Elbow Method", "Silhouette Method", "Gap Statistic", "Final (Majority Vote)"),
  Optimal_k = c(elbow_k, best_sil_k, best_gap_k, final_k)
), caption = "Optimal k Determination Summary",
   col.names = c("Method", "Suggested k")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

Optimal k Determination Summary
Method	Suggested k
Elbow Method	7
Silhouette Method	2
Gap Statistic	10
Final (Majority Vote)	2

6. K-Means Clustering

6.1 Methodology

K-Means is one of the most widely used partitional clustering algorithms. The algorithm iteratively assigns each observation to the nearest centroid using Euclidean distance and updates centroids as the mean of assigned observations, minimizing total within-cluster sum of squared deviations. To mitigate sensitivity to random initialization, 25 random restarts (nstart = 25) are employed and the solution with the lowest total WSS is retained.

set.seed(42)
km_result <- kmeans(features_scaled, centers = final_k, nstart = 25, iter.max = 200)

df_latest$KMeans_Cluster <- factor(km_result$cluster)

kable(data.frame(
  Metric = c("Number of Clusters", "Total WSS", "Between-SS", "Total SS", "BSS/TSS (%)"),
  Value  = c(final_k,
             round(km_result$tot.withinss, 4),
             round(km_result$betweenss, 4),
             round(km_result$totss, 4),
             round(km_result$betweenss / km_result$totss * 100, 2))
), caption = "K-Means Summary Statistics", col.names = c("Metric", "Value")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

K-Means Summary Statistics
Metric	Value
Number of Clusters	2.0000
Total WSS	435.6562
Between-SS	156.3438
Total SS	592.0000
BSS/TSS (%)	26.4100

kable(data.frame(Cluster = 1:final_k, Size = km_result$size),
      caption = "K-Means Cluster Sizes") %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

K-Means Cluster Sizes
Cluster	Size
1	4
2	34

6.2 Cluster Visualization (PCA Space)

pcs$KMeans <- factor(km_result$cluster)

ggplot(pcs, aes(x = PC1, y = PC2, color = KMeans, label = PROVINSI)) +
  geom_point(size = 3.5, alpha = 0.85) +
  geom_text_repel(size = 2.5, max.overlaps = 15, segment.alpha = 0.5) +
  stat_ellipse(aes(fill = KMeans), geom = "polygon", alpha = 0.1, level = 0.90) +
  scale_color_brewer(palette = "Set1") +
  scale_fill_brewer(palette = "Set1") +
  labs(title = paste0("K-Means Clustering (k = ", final_k, ") PCA Projection"),
       subtitle = "90% confidence ellipses; labels = province names",
       x = sprintf("PC1 (%.1f%%)", pca_var_pct[1]),
       y = sprintf("PC2 (%.1f%%)", pca_var_pct[2]),
       color = "Cluster", fill = "Cluster") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

K-Means Clusters PCA Projection

6.3 Cluster Centroids

km_centers_orig <- as.data.frame(km_result$centers)
names(km_centers_orig) <- names(features)

kable(round(km_centers_orig, 3), caption = "K-Means Cluster Centroids (Standardized Scale)") %>%
  kable_styling(bootstrap_options = c("hover"), full_width = TRUE) %>%
  scroll_box(width = "100%")

K-Means Cluster Centroids (Standardized Scale)
APS_13_15	APS_16_18	APS_19_24	TPT_Feb	TPT_Agt	TPAK_Feb	TPAK_Agt	GK_Maret	GK_Sept	JPM_Maret	JPM_Sept	PPM_Maret	PPM_Sept	IPM	RLS_15Plus	HLS
-2.629	-1.605	-0.669	-0.870	-1.249	1.793	1.804	0.728	0.467	-0.437	-0.492	1.958	2.043	-2.076	-1.351	-0.958
0.309	0.189	0.079	0.102	0.147	-0.211	-0.212	-0.086	-0.055	0.051	0.058	-0.230	-0.240	0.244	0.159	0.113

km_centers_long <- km_centers_orig %>%
  mutate(Cluster = factor(1:nrow(.))) %>%
  pivot_longer(-Cluster, names_to = "Variable", values_to = "Value")

ggplot(km_centers_long, aes(x = reorder(Variable, Value), y = Value, fill = Cluster)) +
  geom_col(position = "dodge", alpha = 0.85) +
  coord_flip() +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "K-Means Cluster Centroids by Variable",
       subtitle = "Standardized values",
       x = NULL, y = "Standardized Centroid Value", fill = "Cluster") +
  theme_minimal(base_size = 10) +
  theme(plot.title = element_text(face = "bold"))

K-Means Cluster Centroid Profile

6.4 Silhouette Analysis

km_sil     <- silhouette(km_result$cluster, dist(features_scaled))
avg_sil_km <- mean(km_sil[, 3])

fviz_silhouette(km_sil, palette = "Set1") +
  labs(title = paste0("Silhouette Plot K-Means (k = ", final_k, ")")) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

##   cluster size ave.sil.width
## 1       1    4          0.36
## 2       2   34          0.41

Silhouette Plot K-Means

sil_label_km <- ifelse(avg_sil_km >= 0.70, "Strong Structure",
                ifelse(avg_sil_km >= 0.50, "Reasonable Structure",
                ifelse(avg_sil_km >= 0.25, "Weak Structure", "No Substantial Structure")))
cat(sprintf("Average Silhouette Width: %.4f %s\n", avg_sil_km, sil_label_km))

## Average Silhouette Width: 0.4024 Weak Structure

6.5 Province Assignment

kable(df_latest %>% select(PROVINSI, TAHUN, KMeans_Cluster) %>%
        arrange(KMeans_Cluster, PROVINSI),
      caption = "Province Cluster Assignments K-Means",
      col.names = c("Province", "Year", "Cluster")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE) %>%
  scroll_box(height = "400px")

Province Cluster Assignments K-Means
Province	Year	Cluster
PAPUA	2024	1
PAPUA PEGUNUNGAN	2024	1
PAPUA SELATAN	2024	1
PAPUA TENGAH	2024	1
ACEH	2024	2
BALI	2024	2
BANTEN	2024	2
BENGKULU	2024	2
DI YOGYAKARTA	2024	2
DKI JAKARTA	2024	2
GORONTALO	2024	2
JAMBI	2024	2
JAWA BARAT	2024	2
JAWA TENGAH	2024	2
JAWA TIMUR	2024	2
KALIMANTAN BARAT	2024	2
KALIMANTAN SELATAN	2024	2
KALIMANTAN TENGAH	2024	2
KALIMANTAN TIMUR	2024	2
KALIMANTAN UTARA	2024	2
KEP. BANGKA BELITUNG	2024	2
KEP. RIAU	2024	2
LAMPUNG	2024	2
MALUKU	2024	2
MALUKU UTARA	2024	2
NUSA TENGGARA BARAT	2024	2
NUSA TENGGARA TIMUR	2024	2
PAPUA BARAT	2024	2
PAPUA BARAT DAYA	2024	2
RIAU	2024	2
SULAWESI BARAT	2024	2
SULAWESI SELATAN	2024	2
SULAWESI TENGAH	2024	2
SULAWESI TENGGARA	2024	2
SULAWESI UTARA	2024	2
SUMATERA BARAT	2024	2
SUMATERA SELATAN	2024	2
SUMATERA UTARA	2024	2

7. K-Medoids (PAM) Clustering

7.1 Methodology

K-Medoids, implemented via the Partitioning Around Medoids (PAM) algorithm, selects actual data points (medoids) as cluster representatives rather than computed centroids. This approach minimizes the sum of dissimilarities and is significantly more resistant to outliers and noise than K-Means a particularly desirable property when working with heterogeneous regional socioeconomic data.

set.seed(42)
pam_result <- pam(features_scaled, k = final_k, metric = "euclidean")

df_latest$PAM_Cluster <- factor(pam_result$clustering)
pcs$PAM               <- factor(pam_result$clustering)

kable(data.frame(
  Cluster = 1:final_k,
  Medoid  = rownames(features_scaled)[pam_result$id.med],
  Size    = as.integer(table(pam_result$clustering))
), caption = "PAM Medoids and Cluster Sizes",
   col.names = c("Cluster", "Medoid Province", "Size")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

PAM Medoids and Cluster Sizes
Cluster	Medoid Province	Size
1	JAMBI	33
2	PAPUA SELATAN	5

7.2 PAM Visualization

ggplot(pcs, aes(x = PC1, y = PC2, color = PAM, label = PROVINSI)) +
  geom_point(size = 3.5, alpha = 0.85) +
  geom_text_repel(size = 2.5, max.overlaps = 15, segment.alpha = 0.5) +
  stat_ellipse(aes(fill = PAM), geom = "polygon", alpha = 0.1, level = 0.90) +
  scale_color_brewer(palette = "Set2") +
  scale_fill_brewer(palette = "Set2") +
  labs(title = paste0("K-Medoids (PAM) Clustering (k = ", final_k, ") PCA Projection"),
       x = sprintf("PC1 (%.1f%%)", pca_var_pct[1]),
       y = sprintf("PC2 (%.1f%%)", pca_var_pct[2]),
       color = "Cluster", fill = "Cluster") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

PAM Clusters PCA Projection

7.3 PAM Silhouette Analysis

pam_sil     <- silhouette(pam_result$clustering, dist(features_scaled))
avg_sil_pam <- mean(pam_sil[, 3])

fviz_silhouette(pam_sil, palette = "Set2") +
  labs(title = paste0("Silhouette Plot PAM (k = ", final_k, ")")) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

##   cluster size ave.sil.width
## 1       1   33          0.37
## 2       2    5          0.26

Silhouette Plot PAM

sil_label_pam <- ifelse(avg_sil_pam >= 0.70, "Strong Structure",
                  ifelse(avg_sil_pam >= 0.50, "Reasonable Structure",
                  ifelse(avg_sil_pam >= 0.25, "Weak Structure", "No Substantial Structure")))
cat(sprintf("Average Silhouette Width (PAM): %.4f %s\n", avg_sil_pam, sil_label_pam))

## Average Silhouette Width (PAM): 0.3588 Weak Structure

7.4 K-Means vs PAM Agreement

ari_kmpam <- adjustedRandIndex(km_result$cluster, pam_result$clustering)
ari_label <- ifelse(ari_kmpam >= 0.80, "Near-Perfect Agreement",
              ifelse(ari_kmpam >= 0.60, "Strong Agreement",
              ifelse(ari_kmpam >= 0.40, "Moderate Agreement", "Weak Agreement")))

kable(data.frame(Metric = c("Adjusted Rand Index (ARI)", "Interpretation"),
                 Value  = c(round(ari_kmpam, 4), ari_label)),
      caption = "K-Means vs PAM Agreement",
      col.names = c("Metric", "Value")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

K-Means vs PAM Agreement
Metric	Value
Adjusted Rand Index (ARI)	0.844
Interpretation	Near-Perfect Agreement

8. DBSCAN Clustering

8.1 Methodology

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) identifies clusters as regions of high point density separated by regions of lower density. Unlike partitional methods, DBSCAN does not require k to be specified a priori and is capable of discovering clusters of arbitrary shape. Two key hyperparameters must be tuned: ε (epsilon), the neighborhood radius; and MinPts, the minimum number of points required to form a dense region. Points unreachable from any core point are labeled as noise.

8.2 Epsilon Selection (k-NN Distance Plot)

min_pts    <- max(4, 2 * ncol(features_scaled))
knn_dist   <- kNNdist(as.matrix(features_scaled), k = min_pts)
knn_sorted <- if (is.matrix(knn_dist)) sort(knn_dist[, min_pts]) else sort(knn_dist)

ggplot(data.frame(Index = seq_along(knn_sorted), Distance = knn_sorted),
       aes(x = Index, y = Distance)) +
  geom_line(color = "#2E86AB", linewidth = 1) +
  geom_hline(yintercept = quantile(knn_sorted, 0.9), color = "red",
             linetype = "dashed", linewidth = 0.8) +
  labs(title = paste0(min_pts, "-NN Distance Plot for DBSCAN ε Selection"),
       subtitle = "Red dashed line = 90th percentile (suggested ε)",
       x = "Points (sorted)", y = paste0(min_pts, "-NN Distance")) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

k-NN Distance Plot for DBSCAN ε Selection

eps_val <- round(quantile(knn_sorted, 0.9), 2)
cat(sprintf("MinPts = %d | Suggested ε = %.4f\n", min_pts, eps_val))

## MinPts = 32 | Suggested ε = 8.2000

8.3 DBSCAN Execution

set.seed(42)
db_result <- dbscan(as.matrix(features_scaled), eps = eps_val, minPts = min_pts)

df_latest$DBSCAN_Cluster <- factor(db_result$cluster)
pcs$DBSCAN               <- factor(db_result$cluster)

n_clusters_db <- length(unique(db_result$cluster[db_result$cluster != 0]))
n_noise_db    <- sum(db_result$cluster == 0)

kable(as.data.frame(table(Cluster = db_result$cluster)),
      caption = "DBSCAN Cluster Sizes (Cluster 0 = Noise)") %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

DBSCAN Cluster Sizes (Cluster 0 = Noise)
Cluster	Freq
1	38

cat(sprintf("Clusters found: %d | Noise points: %d (%.1f%%)\n",
            n_clusters_db, n_noise_db, n_noise_db / nrow(features_scaled) * 100))

## Clusters found: 1 | Noise points: 0 (0.0%)

if (n_noise_db > 0) {
  cat("Noise provinces:", paste(df_latest$PROVINSI[db_result$cluster == 0], collapse = ", "), "\n")
}

8.4 DBSCAN Visualization

pcs_db <- pcs
pcs_db$DBSCAN_label <- ifelse(as.character(pcs_db$DBSCAN) == "0", "Noise",
                               paste0("Cluster ", pcs_db$DBSCAN))

n_pal      <- max(3, n_clusters_db)
db_pal     <- setNames(brewer.pal(n_pal, "Dark2")[1:n_clusters_db],
                       paste0("Cluster ", 1:n_clusters_db))
color_vals <- c("Noise" = "gray60", db_pal)
shape_vals <- c("Noise" = 4, setNames(rep(16, n_clusters_db),
                                      paste0("Cluster ", 1:n_clusters_db)))

ggplot(pcs_db, aes(x = PC1, y = PC2, color = DBSCAN_label,
                    shape = DBSCAN_label, label = PROVINSI)) +
  geom_point(size = 3.5, alpha = 0.85) +
  geom_text_repel(size = 2.5, max.overlaps = 15, segment.alpha = 0.5) +
  scale_color_manual(values = color_vals) +
  scale_shape_manual(values = shape_vals) +
  labs(title = "DBSCAN Clustering PCA Projection",
       subtitle = paste0("ε = ", eps_val, ", MinPts = ", min_pts, "; x = noise"),
       x = sprintf("PC1 (%.1f%%)", pca_var_pct[1]),
       y = sprintf("PC2 (%.1f%%)", pca_var_pct[2]),
       color = "Group", shape = "Group") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

DBSCAN Clusters PCA Projection

8.5 DBSCAN Parameter Sensitivity

eps_grid    <- round(seq(eps_val * 0.6, eps_val * 1.6, length.out = 7), 3)
minpts_grid <- c(min_pts - 2, min_pts, min_pts + 2)

sens_results <- expand.grid(eps = eps_grid, minPts = minpts_grid) %>%
  rowwise() %>%
  mutate(
    res        = list(dbscan(as.matrix(features_scaled), eps = eps, minPts = minPts)),
    n_clusters = length(unique(res$cluster[res$cluster != 0])),
    n_noise    = sum(res$cluster == 0)
  ) %>%
  select(-res) %>%
  ungroup()

kable(sens_results, caption = "DBSCAN Sensitivity: Clusters & Noise Across Parameter Grid",
      col.names = c("ε", "MinPts", "Clusters", "Noise Points")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

DBSCAN Sensitivity: Clusters & Noise Across Parameter Grid
ε	MinPts	Clusters	Noise Points
4.920	30	1	6
6.287	30	1	1
7.653	30	1	1
9.020	30	1	0
10.387	30	1	0
11.753	30	1	0
13.120	30	1	0
4.920	32	1	6
6.287	32	1	1
7.653	32	1	1
9.020	32	1	0
10.387	32	1	0
11.753	32	1	0
13.120	32	1	0
4.920	34	0	38
6.287	34	1	2
7.653	34	1	1
9.020	34	1	0
10.387	34	1	0
11.753	34	1	0
13.120	34	1	0

9. Mean Shift Clustering

9.1 Methodology

Mean Shift is a non-parametric, kernel-density-based clustering algorithm that does not require the number of clusters to be specified in advance. The algorithm iteratively shifts each data point toward the local mode of the estimated kernel density function. Points converging to the same mode are assigned to the same cluster. The bandwidth parameter h controls the smoothness of the kernel density estimate: smaller bandwidths produce more clusters, while larger bandwidths merge modes into fewer clusters. Given computational intensity in high dimensions, PCA-reduced data is used, consistent with standard practice in the literature.

n_pcs_ms <- min(5, ncol(pca_result$x))
pcs_ms   <- pca_result$x[, 1:n_pcs_ms]

bw_val <- mean(apply(pcs_ms, 2, sd)) * nrow(pcs_ms)^(-1 / (n_pcs_ms + 4)) * 1.5

ms_result <- meanShift(pcs_ms, bandwidth = rep(bw_val, n_pcs_ms))

ms_labels <- ms_result$assignment
df_latest$MeanShift_Cluster <- factor(ms_labels)
pcs$MeanShift               <- factor(ms_labels)
n_ms_clusters               <- length(unique(ms_labels))

kable(as.data.frame(table(Cluster = ms_labels)),
      caption = "Mean Shift Cluster Sizes") %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

Mean Shift Cluster Sizes
Cluster	Freq
1	2
2	13
3	2
4	1
5	1
6	1
7	1
8	1
9	2
10	3
11	1
12	4
13	1
14	1
15	1
16	1
17	1
18	1

cat(sprintf("Applied on %d PCs (%.1f%% variance) | Bandwidth = %.4f | Clusters = %d\n",
            n_pcs_ms, pca_cumvar[n_pcs_ms], bw_val, n_ms_clusters))

## Applied on 5 PCs (88.6% variance) | Bandwidth = 1.6076 | Clusters = 18

9.2 Mean Shift Visualization

ggplot(pcs, aes(x = PC1, y = PC2, color = MeanShift, label = PROVINSI)) +
  geom_point(size = 3.5, alpha = 0.85) +
  geom_text_repel(size = 2.5, max.overlaps = 15, segment.alpha = 0.5) +
  stat_ellipse(aes(fill = MeanShift), geom = "polygon", alpha = 0.1, level = 0.90) +
  scale_color_brewer(palette = "Set3") +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Mean Shift Clustering PCA Projection",
       subtitle = paste0("Bandwidth h = ", round(bw_val, 4),
                         "; applied on ", n_pcs_ms, " PCs"),
       x = sprintf("PC1 (%.1f%%)", pca_var_pct[1]),
       y = sprintf("PC2 (%.1f%%)", pca_var_pct[2]),
       color = "Cluster", fill = "Cluster") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

Mean Shift Clusters PCA Projection

10. Fuzzy C-Means (FCM) Clustering

10.1 Methodology

Fuzzy C-Means (FCM) is a soft partitioning algorithm that assigns each observation a membership degree to each cluster rather than a crisp binary assignment. Membership degrees range from 0 to 1 and sum to 1 across all clusters for each observation. This probabilistic assignment is particularly valuable for socioeconomic data where provinces may share characteristics across multiple development clusters, reflecting gradual developmental transitions rather than abrupt boundaries. The fuzziness parameter m = 2 is the standard choice in the literature.

set.seed(42)
fcm_result <- ppclust::fcm(as.matrix(features_scaled), centers = final_k, m = 2, nstart = 10)

fcm_hard <- apply(fcm_result$u, 1, which.max)
df_latest$FCM_Cluster <- factor(fcm_hard)
pcs$FCM               <- factor(fcm_hard)

kable(as.data.frame(table(Cluster = fcm_hard)),
      caption = "FCM Cluster Sizes (Hard Assignment)") %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

FCM Cluster Sizes (Hard Assignment)
Cluster	Freq
1	19
2	19

cat(sprintf("Clusters: %d | m = 2 | Objective: %.4f\n", final_k, fcm_result$func.val))

## Clusters: 2 | m = 2 | Objective: 293.1874
##  Clusters: 2 | m = 2 | Objective: 293.1874
##  Clusters: 2 | m = 2 | Objective: 293.1874
##  Clusters: 2 | m = 2 | Objective: 293.1874
##  Clusters: 2 | m = 2 | Objective: 293.1874
##  Clusters: 2 | m = 2 | Objective: 293.1874
##  Clusters: 2 | m = 2 | Objective: 293.1874
##  Clusters: 2 | m = 2 | Objective: 293.1874
##  Clusters: 2 | m = 2 | Objective: 293.1874
##  Clusters: 2 | m = 2 | Objective: 293.1874

10.2 Membership Matrix

membership_df <- as.data.frame(round(fcm_result$u, 4))
names(membership_df) <- paste0("Cluster_", 1:final_k)
membership_df$PROVINSI    <- df_latest$PROVINSI
membership_df$HardCluster <- fcm_hard
membership_df <- membership_df %>%
  select(PROVINSI, HardCluster, everything()) %>%
  arrange(HardCluster, PROVINSI)

kable(membership_df, caption = "Fuzzy Membership Degrees per Province",
      col.names = c("Province", "Hard Cluster", paste0("mu(C", 1:final_k, ")"))) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  scroll_box(height = "400px")

Fuzzy Membership Degrees per Province
	Province	Hard Cluster	mu(C1)	mu(C2)
ACEH	ACEH	1	0.7383	0.2617
BANTEN	BANTEN	1	0.6846	0.3154
DI YOGYAKARTA	DI YOGYAKARTA	1	0.5318	0.4682
DKI JAKARTA	DKI JAKARTA	1	0.6856	0.3144
JAMBI	JAMBI	1	0.6423	0.3577
JAWA BARAT	JAWA BARAT	1	0.5413	0.4587
KALIMANTAN TIMUR	KALIMANTAN TIMUR	1	0.7479	0.2521
KALIMANTAN UTARA	KALIMANTAN UTARA	1	0.6538	0.3462
KEP. BANGKA BELITUNG	KEP. BANGKA BELITUNG	1	0.5488	0.4512
KEP. RIAU	KEP. RIAU	1	0.6971	0.3029
MALUKU	MALUKU	1	0.6989	0.3011
MALUKU UTARA	MALUKU UTARA	1	0.7108	0.2892
PAPUA BARAT	PAPUA BARAT	1	0.5774	0.4226
PAPUA BARAT DAYA	PAPUA BARAT DAYA	1	0.6359	0.3641
RIAU	RIAU	1	0.7717	0.2283
SULAWESI SELATAN	SULAWESI SELATAN	1	0.5361	0.4639
SULAWESI UTARA	SULAWESI UTARA	1	0.6367	0.3633
SUMATERA BARAT	SUMATERA BARAT	1	0.7260	0.2740
SUMATERA UTARA	SUMATERA UTARA	1	0.7466	0.2534
BALI	BALI	2	0.4567	0.5433
BENGKULU	BENGKULU	2	0.4167	0.5833
GORONTALO	GORONTALO	2	0.2455	0.7545
JAWA TENGAH	JAWA TENGAH	2	0.4489	0.5511
JAWA TIMUR	JAWA TIMUR	2	0.4631	0.5369
KALIMANTAN BARAT	KALIMANTAN BARAT	2	0.4152	0.5848
KALIMANTAN SELATAN	KALIMANTAN SELATAN	2	0.4937	0.5063
KALIMANTAN TENGAH	KALIMANTAN TENGAH	2	0.4805	0.5195
LAMPUNG	LAMPUNG	2	0.2345	0.7655
NUSA TENGGARA BARAT	NUSA TENGGARA BARAT	2	0.2666	0.7334
NUSA TENGGARA TIMUR	NUSA TENGGARA TIMUR	2	0.2590	0.7410
PAPUA	PAPUA	2	0.3706	0.6294
PAPUA PEGUNUNGAN	PAPUA PEGUNUNGAN	2	0.4227	0.5773
PAPUA SELATAN	PAPUA SELATAN	2	0.3370	0.6630
PAPUA TENGAH	PAPUA TENGAH	2	0.3845	0.6155
SULAWESI BARAT	SULAWESI BARAT	2	0.2785	0.7215
SULAWESI TENGAH	SULAWESI TENGAH	2	0.1874	0.8126
SULAWESI TENGGARA	SULAWESI TENGGARA	2	0.3685	0.6315
SUMATERA SELATAN	SUMATERA SELATAN	2	0.2887	0.7113

10.3 Ambiguous Membership Detection

max_membership <- apply(fcm_result$u, 1, max)
ambiguous_idx  <- which(max_membership < 0.60)

if (length(ambiguous_idx) == 0) {
  kable(data.frame(Result = "No provinces with ambiguous membership (max mu < 0.60) detected."),
        col.names = NULL) %>%
    kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
} else {
  kable(data.frame(
    Province    = df_latest$PROVINSI[ambiguous_idx],
    Max_Mu      = round(max_membership[ambiguous_idx], 4),
    HardCluster = fcm_hard[ambiguous_idx]
  ), caption = "Provinces with Ambiguous Membership (max mu < 0.60)",
     col.names = c("Province", "Max Membership", "Hard Cluster")) %>%
    kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
}

Provinces with Ambiguous Membership (max mu < 0.60)
	Province	Max Membership	Hard Cluster
BALI	BALI	0.5433	2
BENGKULU	BENGKULU	0.5833	2
DI YOGYAKARTA	DI YOGYAKARTA	0.5318	1
JAWA BARAT	JAWA BARAT	0.5413	1
JAWA TENGAH	JAWA TENGAH	0.5511	2
JAWA TIMUR	JAWA TIMUR	0.5369	2
KALIMANTAN BARAT	KALIMANTAN BARAT	0.5848	2
KALIMANTAN SELATAN	KALIMANTAN SELATAN	0.5063	2
KALIMANTAN TENGAH	KALIMANTAN TENGAH	0.5195	2
KEP. BANGKA BELITUNG	KEP. BANGKA BELITUNG	0.5488	1
PAPUA BARAT	PAPUA BARAT	0.5774	1
PAPUA PEGUNUNGAN	PAPUA PEGUNUNGAN	0.5773	2
SULAWESI SELATAN	SULAWESI SELATAN	0.5361	1

10.4 FCM Visualization

pcs$MaxMembership <- max_membership

ggplot(pcs, aes(x = PC1, y = PC2, color = FCM, size = MaxMembership, label = PROVINSI)) +
  geom_point(alpha = 0.85) +
  geom_text_repel(size = 2.3, max.overlaps = 15, segment.alpha = 0.5) +
  scale_color_brewer(palette = "Paired") +
  scale_size_continuous(range = c(2, 6), name = "Max Membership") +
  labs(title = paste0("Fuzzy C-Means (k = ", final_k, ") PCA Projection"),
       subtitle = "Point size proportional to maximum membership degree",
       x = sprintf("PC1 (%.1f%%)", pca_var_pct[1]),
       y = sprintf("PC2 (%.1f%%)", pca_var_pct[2]),
       color = "Cluster") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

FCM Clusters PCA Projection

10.5 FCM Cluster Centroids

fcm_centers <- as.data.frame(fcm_result$v)
names(fcm_centers) <- names(features)

kable(round(fcm_centers, 3), caption = "FCM Cluster Centroids (Standardized)") %>%
  kable_styling(bootstrap_options = c("hover"), full_width = TRUE) %>%
  scroll_box(width = "100%")

FCM Cluster Centroids (Standardized)
	APS_13_15	APS_16_18	APS_19_24	TPT_Feb	TPT_Agt	TPAK_Feb	TPAK_Agt	GK_Maret	GK_Sept	JPM_Maret	JPM_Sept	PPM_Maret	PPM_Sept	IPM	RLS_15Plus	HLS
Cluster 1	0.272	0.274	0.139	0.394	0.418	-0.359	-0.377	0.259	0.308	-0.058	-0.059	-0.222	-0.237	0.273	0.309	0.177
Cluster 2	-0.265	-0.245	-0.140	-0.418	-0.450	0.343	0.368	-0.289	-0.333	0.029	0.031	0.228	0.243	-0.280	-0.299	-0.158

11. Cluster Validation

11.1 Internal Validation Indices

Internal validation assesses the quality of a clustering solution using only the data itself, without reference to external ground-truth labels. Key indices include:

Index	Criterion	Preferred Value
Average Silhouette Width	Cohesion & separation	Higher (max = 1)
Dunn Index	Min inter-cluster / Max intra-cluster distance	Higher
Calinski-Harabasz Index	Between-cluster to within-cluster SS ratio	Higher
Davies-Bouldin Index	Average similarity between clusters	Lower

dist_matrix <- dist(features_scaled)

dunn_index <- function(labels, dist_mat) {
  labels    <- as.integer(factor(labels))
  k         <- length(unique(labels))
  dm        <- as.matrix(dist_mat)
  intra     <- sapply(1:k, function(cl) {
    idx <- which(labels == cl)
    if (length(idx) < 2) return(0)
    max(dm[idx, idx])
  })
  inter_min <- Inf
  for (i in 1:(k - 1)) {
    for (j in (i + 1):k) {
      d <- min(dm[which(labels == i), which(labels == j)])
      if (d < inter_min) inter_min <- d
    }
  }
  inter_min / max(intra)
}

ch_index <- function(labels, data) {
  labels     <- as.integer(factor(labels))
  k          <- length(unique(labels))
  n          <- nrow(data)
  grand_mean <- colMeans(data)
  ssb <- sum(sapply(unique(labels), function(cl) {
    pts <- data[labels == cl, , drop = FALSE]
    nrow(pts) * sum((colMeans(pts) - grand_mean)^2)
  }))
  ssw <- sum(sapply(unique(labels), function(cl) {
    pts <- data[labels == cl, , drop = FALSE]
    if (nrow(pts) < 2) return(0)
    sum(apply(pts, 1, function(r) sum((r - colMeans(pts))^2)))
  }))
  (ssb / (k - 1)) / (ssw / (n - k))
}

db_index <- function(labels, data) {
  labels <- as.integer(factor(labels))
  k      <- length(unique(labels))
  cents  <- sapply(1:k, function(cl) colMeans(data[labels == cl, , drop = FALSE]))
  s      <- sapply(1:k, function(cl) {
    pts <- data[labels == cl, , drop = FALSE]
    mean(sqrt(rowSums(sweep(pts, 2, cents[, cl])^2)))
  })
  mean(sapply(1:k, function(i) {
    max(sapply(setdiff(1:k, i), function(j) {
      (s[i] + s[j]) / sqrt(sum((cents[, i] - cents[, j])^2))
    }))
  }))
}

method_labels <- list(
  "K-Means"    = km_result$cluster,
  "PAM"        = as.integer(pam_result$clustering),
  "DBSCAN"     = db_result$cluster,
  "Mean Shift" = as.integer(factor(ms_labels)),
  "FCM"        = fcm_hard
)

data_matrix <- as.matrix(features_scaled)

valid_df <- lapply(names(method_labels), function(mname) {
  labels <- method_labels[[mname]]
  if (mname == "DBSCAN") {
    valid_idx <- labels != 0
    if (sum(valid_idx) < 5 || length(unique(labels[valid_idx])) < 2) {
      return(data.frame(Method = mname, N_Clusters = NA,
                         Avg_Silhouette = NA, Dunn = NA, CH = NA, DB = NA))
    }
    labels_v <- labels[valid_idx]
    data_v   <- data_matrix[valid_idx, ]
    dist_v   <- dist(data_v)
  } else {
    labels_v <- labels
    data_v   <- data_matrix
    dist_v   <- dist_matrix
  }
  if (length(unique(labels_v)) < 2) {
    return(data.frame(Method = mname, N_Clusters = 1,
                       Avg_Silhouette = NA, Dunn = NA, CH = NA, DB = NA))
  }
  data.frame(
    Method         = mname,
    N_Clusters     = length(unique(labels_v)),
    Avg_Silhouette = round(mean(silhouette(as.integer(factor(labels_v)), dist_v)[, 3]), 4),
    Dunn           = round(tryCatch(dunn_index(labels_v, dist_v), error = function(e) NA), 4),
    CH             = round(tryCatch(ch_index(labels_v, data_v), error = function(e) NA), 2),
    DB             = round(tryCatch(db_index(labels_v, data_v), error = function(e) NA), 4)
  )
}) %>% bind_rows()

kable(valid_df, caption = "Internal Validation Indices Across All Clustering Methods",
      col.names = c("Method", "Clusters", "Avg Silhouette (up)", "Dunn (up)", "CH (up)", "DB (down)")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE)

Internal Validation Indices Across All Clustering Methods
Method	Clusters	Avg Silhouette (up)	Dunn (up)	CH (up)	DB (down)
K-Means	2	0.4024	0.4475	12.92	0.9556
PAM	2	0.3588	0.2605	12.78	1.0859
DBSCAN	NA	NA	NA	NA	NA
Mean Shift	18	0.0029	0.2681	5.89	0.8378
FCM	2	0.1824	0.1189	10.30	1.7095

best_sil   <- valid_df$Method[which.max(valid_df$Avg_Silhouette)]
best_ch    <- valid_df$Method[which.max(valid_df$CH)]
best_db    <- valid_df$Method[which.min(valid_df$DB)]
best_dunn  <- valid_df$Method[which.max(valid_df$Dunn)]
vote_counts  <- table(c(best_sil, best_ch, best_db, best_dunn))
overall_best <- names(which.max(vote_counts))

kable(data.frame(
  Index  = c("Best Avg Silhouette", "Best CH Index", "Best DB Index",
              "Best Dunn Index", "Overall Best (Majority Vote)"),
  Method = c(best_sil, best_ch, best_db, best_dunn, overall_best)
), caption = "Validation Summary", col.names = c("Index", "Best Method")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

Validation Summary
Index	Best Method
Best Avg Silhouette	K-Means
Best CH Index	K-Means
Best DB Index	Mean Shift
Best Dunn Index	K-Means
Overall Best (Majority Vote)	K-Means

11.2 Cluster Stability (Bootstrap)

Bootstrap-based stability assessment evaluates how consistently the clustering solution replicates across repeated resamples of the data. A stability score above 0.85 indicates high stability; values below 0.60 suggest an unstable solution.

set.seed(42)
n_boot    <- 100
n_obs     <- nrow(features_scaled)

boot_stab <- sapply(1:n_boot, function(b) {
  idx         <- sample(1:n_obs, n_obs, replace = TRUE)
  boot_data   <- features_scaled[idx, ]
  km_boot     <- kmeans(boot_data, centers = final_k, nstart = 10, iter.max = 100)
  orig_labels <- km_result$cluster[idx]
  boot_labels <- km_boot$cluster
  mean(sapply(1:final_k, function(cl) {
    orig_idx_cl <- which(orig_labels == cl)
    if (length(orig_idx_cl) < 2) return(1)
    if (length(orig_idx_cl) > 50) orig_idx_cl <- sample(orig_idx_cl, 50)
    pairs <- combn(orig_idx_cl, 2)
    mean(boot_labels[pairs[1, ]] == boot_labels[pairs[2, ]])
  }))
})

stab_label <- ifelse(mean(boot_stab) >= 0.85, "High Stability",
               ifelse(mean(boot_stab) >= 0.75, "Acceptable Stability",
               ifelse(mean(boot_stab) >= 0.60, "Moderate Stability", "Low Stability")))

kable(data.frame(
  Metric = c("Mean Stability Score", "Std Deviation", "Interpretation"),
  Value  = c(round(mean(boot_stab), 4), round(sd(boot_stab), 4), stab_label)
), caption = paste0("Bootstrap Stability K-Means (n = ", n_boot, " resamples)"),
   col.names = c("Metric", "Value")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

Bootstrap Stability K-Means (n = 100 resamples)
Metric	Value
Mean Stability Score	0.9069
Std Deviation	0.1105
Interpretation	High Stability

12. Cross-Method Comparison

12.1 Agreement Matrix (Adjusted Rand Index)

The Adjusted Rand Index (ARI) quantifies the agreement between two clustering solutions, corrected for chance. ARI = 1 indicates perfect agreement, ARI = 0 indicates chance-level agreement.

method_names <- names(method_labels)
ari_mat <- matrix(NA, length(method_names), length(method_names),
                  dimnames = list(method_names, method_names))

for (i in seq_along(method_names)) {
  for (j in seq_along(method_names)) {
    ari_mat[i, j] <- ifelse(i == j, 1.0,
      round(adjustedRandIndex(method_labels[[i]], method_labels[[j]]), 3))
  }
}

ari_melt <- melt(ari_mat)
names(ari_melt) <- c("Method1", "Method2", "ARI")

ggplot(ari_melt, aes(x = Method1, y = Method2, fill = ARI, label = round(ARI, 2))) +
  geom_tile(color = "white") +
  geom_text(size = 4, fontface = "bold") +
  scale_fill_gradient2(low = "#E84855", mid = "white", high = "#2E86AB",
                       midpoint = 0.5, limits = c(-0.1, 1), name = "ARI") +
  labs(title = "Pairwise Adjusted Rand Index",
       subtitle = "ARI = 1: identical | ARI = 0: random",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 11) +
  theme(plot.title  = element_text(face = "bold"),
        axis.text.x = element_text(angle = 30, hjust = 1))

Pairwise ARI Heatmap

12.2 Cluster Assignment Comparison Table

kable(df_latest %>%
        select(PROVINSI, KMeans_Cluster, PAM_Cluster, DBSCAN_Cluster,
               MeanShift_Cluster, FCM_Cluster) %>%
        arrange(KMeans_Cluster, PROVINSI),
      caption = "Province Cluster Assignments Across All Methods",
      col.names = c("Province", "K-Means", "PAM", "DBSCAN", "Mean Shift", "FCM")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  scroll_box(height = "500px")

Province Cluster Assignments Across All Methods
Province	K-Means	PAM	DBSCAN	Mean Shift	FCM
PAPUA	1	2	1	14	2
PAPUA PEGUNUNGAN	1	2	1	16	2
PAPUA SELATAN	1	2	1	17	2
PAPUA TENGAH	1	2	1	18	2
ACEH	2	1	1	1	1
BALI	2	1	1	2	2
BANTEN	2	1	1	3	1
BENGKULU	2	1	1	2	2
DI YOGYAKARTA	2	1	1	4	1
DKI JAKARTA	2	1	1	5	1
GORONTALO	2	1	1	2	2
JAMBI	2	1	1	2	1
JAWA BARAT	2	1	1	6	1
JAWA TENGAH	2	1	1	7	2
JAWA TIMUR	2	1	1	8	2
KALIMANTAN BARAT	2	1	1	2	2
KALIMANTAN SELATAN	2	1	1	2	2
KALIMANTAN TENGAH	2	1	1	2	2
KALIMANTAN TIMUR	2	1	1	9	1
KALIMANTAN UTARA	2	1	1	10	1
KEP. BANGKA BELITUNG	2	1	1	10	1
KEP. RIAU	2	1	1	11	1
LAMPUNG	2	1	1	2	2
MALUKU	2	1	1	9	1
MALUKU UTARA	2	1	1	12	1
NUSA TENGGARA BARAT	2	1	1	2	2
NUSA TENGGARA TIMUR	2	2	1	13	2
PAPUA BARAT	2	1	1	15	1
PAPUA BARAT DAYA	2	1	1	1	1
RIAU	2	1	1	10	1
SULAWESI BARAT	2	1	1	2	2
SULAWESI SELATAN	2	1	1	12	1
SULAWESI TENGAH	2	1	1	2	2
SULAWESI TENGGARA	2	1	1	2	2
SULAWESI UTARA	2	1	1	12	1
SUMATERA BARAT	2	1	1	3	1
SUMATERA SELATAN	2	1	1	2	2
SUMATERA UTARA	2	1	1	12	1

12.3 Visual Comparison: All Methods

pcs_long <- pcs %>%
  select(PC1, PC2, PROVINSI, KMeans, PAM, DBSCAN, MeanShift, FCM) %>%
  pivot_longer(cols = c(KMeans, PAM, DBSCAN, MeanShift, FCM),
               names_to = "Method", values_to = "Cluster")

ggplot(pcs_long, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(size = 2.5, alpha = 0.8) +
  facet_wrap(~Method, ncol = 3, scales = "fixed") +
  scale_color_brewer(palette = "Set1") +
  labs(title = "Clustering Results Comparison PCA Projection",
       x = sprintf("PC1 (%.1f%%)", pca_var_pct[1]),
       y = sprintf("PC2 (%.1f%%)", pca_var_pct[2])) +
  theme_minimal(base_size = 10) +
  theme(plot.title      = element_text(face = "bold", size = 13),
        strip.text      = element_text(face = "bold"),
        legend.position = "bottom")

PCA Projection: All Clustering Methods

13. Cluster Interpretation

13.1 Socioeconomic Profiling (K-Means)

Using K-Means as the primary interpretive framework, cluster profiles are generated by computing the mean of each original (unstandardized) variable per cluster, enabling substantive socioeconomic interpretation of group differences across Indonesian provinces.

profile_df <- df_latest %>%
  select(KMeans_Cluster, all_of(names(features))) %>%
  group_by(KMeans_Cluster) %>%
  summarise(across(everything(), mean, na.rm = TRUE), .groups = "drop")

kable(profile_df %>% mutate(across(-KMeans_Cluster, ~round(.x, 2))),
      caption = "Mean Socioeconomic Indicators by K-Means Cluster (Original Scale)") %>%
  kable_styling(bootstrap_options = c("hover"),
                full_width = TRUE, font_size = 11) %>%
  scroll_box(width = "100%")

Mean Socioeconomic Indicators by K-Means Cluster (Original Scale)
KMeans_Cluster	APS_13_15	APS_16_18	APS_19_24	TPT_Feb	TPT_Agt	TPAK_Feb	TPAK_Agt	GK_Maret	GK_Sept	JPM_Maret	JPM_Sept	PPM_Maret	PPM_Sept	IPM	RLS_15Plus	HLS
1	80.91	64.15	23.89	3.49	2.67	76.79	77.20	741381.2	708811.0	229.76	161.07	24.36	23.68	63.01	7.74	12.52
2	96.04	75.98	28.48	4.87	4.69	68.79	69.05	632247.1	645927.2	714.71	681.53	9.59	9.13	73.90	9.47	13.41

13.2 Key Indicator Profiles

key_vars   <- c("IPM", "PPM_Maret", "TPT_Feb", "APS_13_15", "GK_Maret", "RLS_15Plus")
var_labels <- c(IPM = "Human Dev. Index (IPM)",
                PPM_Maret  = "Poverty Rate - March (%)",
                TPT_Feb    = "Unemployment Rate - Feb (%)",
                APS_13_15  = "School Participation 13-15",
                GK_Maret   = "Poverty Line - March (Rp)",
                RLS_15Plus = "Avg Schooling Years (15+)")

profile_long <- profile_df %>%
  select(KMeans_Cluster, all_of(key_vars)) %>%
  pivot_longer(-KMeans_Cluster, names_to = "Variable", values_to = "Mean_Value") %>%
  mutate(Variable = var_labels[Variable])

ggplot(profile_long, aes(x = KMeans_Cluster, y = Mean_Value, fill = KMeans_Cluster)) +
  geom_col(alpha = 0.85, color = "white") +
  facet_wrap(~Variable, scales = "free_y", ncol = 3) +
  scale_fill_brewer(palette = "Set1", guide = "none") +
  labs(title = "Cluster Profiles: Key Socioeconomic Indicators",
       subtitle = "Mean values per K-Means cluster; original scale",
       x = "Cluster", y = "Mean Value") +
  theme_minimal(base_size = 10) +
  theme(plot.title = element_text(face = "bold"),
        strip.text = element_text(face = "bold"))

Cluster Profiles: Key Socioeconomic Indicators

13.3 Cluster Narrative Interpretation

ipm_means <- profile_df$IPM
ppm_means <- profile_df$PPM_Maret
tpt_means <- profile_df$TPT_Feb
gk_means  <- profile_df$GK_Maret

narrative_df <- lapply(1:nrow(profile_df), function(i) {
  cl        <- profile_df$KMeans_Cluster[i]
  provs     <- paste(sort(df_latest$PROVINSI[df_latest$KMeans_Cluster == cl]), collapse = ", ")
  ipm_rank  <- rank(-ipm_means)[i]
  ppm_rank  <- rank(ppm_means)[i]
  dev_level <- ifelse(ipm_rank == 1, "Most Developed",
                ifelse(ipm_rank == nrow(profile_df), "Least Developed", "Developing"))
  pov_level <- ifelse(ppm_rank == 1, "Lowest Poverty",
                ifelse(ppm_rank == nrow(profile_df), "Highest Poverty", "Moderate Poverty"))
  data.frame(
    Cluster       = as.integer(cl),
    Dev_Level     = dev_level,
    Poverty_Level = pov_level,
    N             = sum(df_latest$KMeans_Cluster == cl),
    Mean_IPM      = round(ipm_means[i], 2),
    Mean_PPM      = round(ppm_means[i], 2),
    Mean_TPT      = round(tpt_means[i], 2),
    Mean_GK       = format(round(gk_means[i]), big.mark = "."),
    Provinces     = provs
  )
}) %>% bind_rows()

kable(narrative_df, caption = "Cluster Socioeconomic Narrative (K-Means)",
      col.names = c("Cluster", "Dev. Level", "Poverty Level", "N", "Mean IPM",
                    "Mean PPM (%)", "Mean TPT (%)", "Mean GK (Rp)", "Provinces")) %>%
  kable_styling(bootstrap_options = c("hover"),
                full_width = TRUE, font_size = 11) %>%
  scroll_box(width = "100%")

Cluster Socioeconomic Narrative (K-Means)
Cluster	Dev. Level	Poverty Level	N	Mean IPM	Mean PPM (%)	Mean TPT (%)	Mean GK (Rp)	Provinces
1	Least Developed	Highest Poverty	4	63.01	24.36	3.49	741.381	PAPUA, PAPUA PEGUNUNGAN, PAPUA SELATAN, PAPUA TENGAH
2	Most Developed	Lowest Poverty	34	73.90	9.59	4.87	632.247	ACEH, BALI, BANTEN, BENGKULU, DI YOGYAKARTA, DKI JAKARTA, GORONTALO, JAMBI, JAWA BARAT, JAWA TENGAH, JAWA TIMUR, KALIMANTAN BARAT, KALIMANTAN SELATAN, KALIMANTAN TENGAH, KALIMANTAN TIMUR, KALIMANTAN UTARA, KEP. BANGKA BELITUNG, KEP. RIAU, LAMPUNG, MALUKU, MALUKU UTARA, NUSA TENGGARA BARAT, NUSA TENGGARA TIMUR, PAPUA BARAT, PAPUA BARAT DAYA, RIAU, SULAWESI BARAT, SULAWESI SELATAN, SULAWESI TENGAH, SULAWESI TENGGARA, SULAWESI UTARA, SUMATERA BARAT, SUMATERA SELATAN, SUMATERA UTARA

13.4 Final Province Assignment Table

kable(df_latest %>%
        select(PROVINSI, TAHUN, KMeans_Cluster, IPM, PPM_Maret, TPT_Feb, GK_Maret) %>%
        arrange(KMeans_Cluster, desc(IPM)) %>%
        mutate(IPM       = round(IPM, 2),
               PPM_Maret = round(PPM_Maret, 2),
               TPT_Feb   = round(TPT_Feb, 2),
               GK_Maret  = format(GK_Maret, big.mark = ".")),
      caption = "Final Cluster Assignments with Key Indicators",
      col.names = c("Province", "Year", "Cluster", "IPM",
                    "Poverty Rate (%)", "Unemployment (%)", "Poverty Line (Rp)")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = TRUE, font_size = 11) %>%
  scroll_box(height = "500px")

Final Cluster Assignments with Key Indicators
Province	Year	Cluster	IPM	Poverty Rate (%)	Unemployment (%)	Poverty Line (Rp)
PAPUA	2024	1	63.01	17.26	3.49	674.371
PAPUA PEGUNUNGAN	2024	1	63.01	32.97	3.49	1.007.060
PAPUA SELATAN	2024	1	63.01	17.44	3.49	519.979
PAPUA TENGAH	2024	1	63.01	29.76	3.49	764.115
DKI JAKARTA	2024	2	83.55	4.30	7.57	825.288
DI YOGYAKARTA	2024	2	81.09	10.83	3.58	602.437
KEP. RIAU	2024	2	79.08	5.37	7.61	787.211
KALIMANTAN TIMUR	2024	2	78.20	5.78	6.37	833.955
BALI	2024	2	78.01	4.00	3.73	568.510
BANTEN	2024	2	75.77	5.84	7.97	654.213
SUMATERA BARAT	2024	2	75.64	5.97	5.90	708.416
SUMATERA UTARA	2024	2	75.13	7.99	5.24	642.423
SULAWESI UTARA	2024	2	75.04	7.25	6.19	490.719
RIAU	2024	2	74.95	6.67	4.25	697.296
ACEH	2024	2	74.70	14.23	5.75	661.227
KALIMANTAN SELATAN	2024	2	74.66	4.11	3.95	632.739
JAWA TIMUR	2024	2	74.65	9.79	4.33	536.122
SULAWESI SELATAN	2024	2	74.60	8.06	5.26	459.226
BENGKULU	2024	2	74.30	13.56	3.21	671.095
JAWA BARAT	2024	2	74.24	7.46	7.89	524.052
KEP. BANGKA BELITUNG	2024	2	74.09	4.55	3.89	908.397
JAMBI	2024	2	73.73	7.10	4.50	650.115
KALIMANTAN TENGAH	2024	2	73.73	5.17	3.84	623.954
JAWA TENGAH	2024	2	73.39	10.47	5.24	507.001
SUMATERA SELATAN	2024	2	73.18	10.97	4.53	554.197
SULAWESI TENGGARA	2024	2	72.94	11.21	3.66	462.715
KALIMANTAN UTARA	2024	2	72.88	6.32	4.10	854.294
MALUKU	2024	2	72.75	16.05	6.08	713.503
LAMPUNG	2024	2	72.48	10.69	4.18	586.551
NUSA TENGGARA BARAT	2024	2	72.37	12.91	3.73	534.703
SULAWESI TENGAH	2024	2	71.66	11.77	3.49	600.872
GORONTALO	2024	2	71.25	14.57	3.07	473.006
MALUKU UTARA	2024	2	70.98	6.32	4.60	604.460
KALIMANTAN BARAT	2024	2	70.47	6.32	4.52	595.509
SULAWESI BARAT	2024	2	69.80	11.21	3.04	454.879
NUSA TENGGARA TIMUR	2024	2	68.40	19.48	3.10	527.275
PAPUA BARAT	2024	2	67.47	21.66	5.53	793.804
PAPUA BARAT DAYA	2024	2	67.47	18.13	5.53	756.237

14. Conclusions and Recommendations

14.1 Summary of Findings

kable(data.frame(
  Method         = c("K-Means", "PAM", "DBSCAN", "Mean Shift", "FCM"),
  N_Clusters     = c(final_k, final_k, n_clusters_db, n_ms_clusters, final_k),
  Characteristic = c(
    "Fast; sensitive to outliers; spherical cluster assumption",
    "Robust to outliers; uses actual data points as representatives",
    "Detects arbitrary shapes; identifies noise/outlier provinces",
    "Non-parametric; auto-determines k via kernel density modes",
    "Soft assignments; captures membership ambiguity across clusters"
  )
), caption = "Algorithm Comparison Summary",
   col.names = c("Method", "Clusters Found", "Key Characteristic")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = TRUE)

Algorithm Comparison Summary
Method	Clusters Found	Key Characteristic
K-Means	2	Fast; sensitive to outliers; spherical cluster assumption
PAM	2	Robust to outliers; uses actual data points as representatives
DBSCAN	1	Detects arbitrary shapes; identifies noise/outlier provinces
Mean Shift	18	Non-parametric; auto-determines k via kernel density modes
FCM	2	Soft assignments; captures membership ambiguity across clusters

cat(sprintf("Overall best method (internal validation): %s\n", overall_best))

## Overall best method (internal validation): K-Means

cat(sprintf("K-Means bootstrap stability: %s (score = %.4f)\n", stab_label, mean(boot_stab)))

## K-Means bootstrap stability: High Stability (score = 0.9069)

14.2 Policy Recommendations

Based on the clustering results, the following data-driven policy recommendations are proposed for each provincial development group.

kable(lapply(1:nrow(profile_df), function(i) {
  cl  <- profile_df$KMeans_Cluster[i]
  ipm <- ipm_means[i]
  ppm <- ppm_means[i]
  rec <- if (ipm >= quantile(ipm_means, 0.66)) {
    "Sustain development momentum; promote as model province; reduce intra-cluster inequality."
  } else if (ipm >= quantile(ipm_means, 0.33)) {
    "Accelerate education quality (RLS & HLS); expand labor market programs; strengthen social protection."
  } else {
    "Priority intervention: basic education access, poverty alleviation, affirmative budget allocation."
  }
  data.frame(Cluster = as.integer(cl), Mean_IPM = round(ipm, 2),
             Mean_PPM = round(ppm, 2), Recommendation = rec)
}) %>% bind_rows(),
caption = "Policy Recommendations by Cluster Development Profile",
col.names = c("Cluster", "Mean IPM", "Mean Poverty Rate (%)", "Recommendation")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = TRUE)

Policy Recommendations by Cluster Development Profile
Cluster	Mean IPM	Mean Poverty Rate (%)	Recommendation
1	63.01	24.36	Priority intervention: basic education access, poverty alleviation, affirmative budget allocation.
2	73.90	9.59	Sustain development momentum; promote as model province; reduce intra-cluster inequality.

14.3 Methodological Limitations

#	Limitation	Implication
1	Cross-sectional analysis (latest year only)	Temporal dynamics in development trajectories are not captured
2	K-Means assumes spherical, equally-sized clusters	Assumption likely violated in heterogeneous regional data
3	DBSCAN is sensitive to ε and MinPts	Small parameter changes can substantially alter results
4	Mean Shift applied on PCA-reduced space	Information in lower-ranked components may be lost
5	Highly correlated variables included	Certain dimensions may be implicitly over-weighted
6	IPM used alongside its sub-components	Multicollinearity between IPM, RLS, and HLS

Document generated using R Markdown. All analyses conducted in R (version >= 4.1.0). Reproducibility ensured via set.seed(42) in all stochastic procedures.

Clustering Analysis of Indonesian Provincial Socioeconomic Indicators

Nazril Ravi Pratama

2026-04-03