This report presents a comprehensive clustering analysis of Indonesian provincial socioeconomic indicators. Clustering, as an unsupervised machine learning technique, aims to partition observations into groups such that observations within the same group are more similar to each other than to those in other groups. The dataset encompasses 151 observations across 34 Indonesian provinces over multiple years, covering indicators related to education participation, labor market performance, poverty lines, poverty rates, and human development indices.
Five clustering methodologies are employed and systematically compared in this study:
The analysis follows a structured pipeline: data loading → exploratory data analysis → preprocessing → optimal cluster determination → clustering execution → validation → interpretation.
Prior to any analytical procedure, it is essential to examine the raw structure of the dataset to identify potential issues such as missing values, data type mismatches, and distributional anomalies.
if (file.exists("data_5_clustering.xlsx")) {
df_raw <- read_excel("data_5_clustering.xlsx")
} else if (file.exists("data_5_clustering.csv")) {
df_raw <- read.csv("data_5_clustering.csv", stringsAsFactors = FALSE)
} else {
stop("Dataset not found. Place data_5_clustering.xlsx or .csv in the working directory.")
}
colnames(df_raw) <- c(
"PROVINSI", "TAHUN",
"APS_13_15", "APS_16_18", "APS_19_24",
"TPT_Feb", "TPT_Agt",
"TPAK_Feb", "TPAK_Agt",
"GK_Maret", "GK_Sept",
"JPM_Maret", "JPM_Sept",
"PPM_Maret", "PPM_Sept",
"IPM", "RLS_15Plus", "HLS"
)
cat(sprintf("Rows: %d | Columns: %d\n", nrow(df_raw), ncol(df_raw)))## Rows: 151 | Columns: 18
## 'data.frame': 151 obs. of 18 variables:
## $ PROVINSI : chr "SUMATERA UTARA" "SUMATERA BARAT" "RIAU" "JAMBI" ...
## $ TAHUN : int 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 ...
## $ APS_13_15 : num 97 96.6 95.7 96.4 94.8 ...
## $ APS_16_18 : num 78.7 84.1 77.8 72.5 71.5 ...
## $ APS_19_24 : num 27.1 36.4 28.8 24.1 18.8 ...
## $ TPT_Feb : num 6.01 6.67 4.96 4.76 5.17 ...
## $ TPT_Agt : num 6.33 6.52 4.42 5.09 4.98 3.65 4.69 5.03 9.91 8.5 ...
## $ TPAK_Feb : num 69.4 68.4 65.8 67.3 70 ...
## $ TPAK_Agt : num 69.1 67.7 65 67.2 68.8 ...
## $ GK_Maret : int 525756 568703 565937 506355 457455 548934 471439 752203 642425 697638 ...
## $ GK_Sept : int 648336 714991 702620 658100 564462 672816 599018 917673 807602 846085 ...
## $ JPM_Maret : num 1344 371 501 294 1114 ...
## $ JPM_Sept : num 1111 315 473 273 949 ...
## $ PPM_Maret : num 9.01 6.63 7.12 8.09 12.84 ...
## $ PPM_Sept : num 7.19 5.42 6.36 7.26 10.51 ...
## $ IPM : num 73.8 74.6 73.9 72.6 71.8 ...
## $ RLS_15Plus: num 9.88 9.46 9.52 9.03 8.78 ...
## $ HLS : num 13.3 14.1 13.3 13 12.5 ...
missing_summary <- data.frame(
Column = names(df_raw),
Missing = sapply(df_raw, function(x) sum(is.na(x))),
Pct_Miss = sapply(df_raw, function(x) round(mean(is.na(x)) * 100, 2))
)
rownames(missing_summary) <- NULL
kable(missing_summary, caption = "Missing Values per Variable",
col.names = c("Variable", "Missing Count", "Missing (%)")) %>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE)| Variable | Missing Count | Missing (%) |
|---|---|---|
| PROVINSI | 0 | 0 |
| TAHUN | 0 | 0 |
| APS_13_15 | 0 | 0 |
| APS_16_18 | 0 | 0 |
| APS_19_24 | 0 | 0 |
| TPT_Feb | 0 | 0 |
| TPT_Agt | 0 | 0 |
| TPAK_Feb | 0 | 0 |
| TPAK_Agt | 0 | 0 |
| GK_Maret | 0 | 0 |
| GK_Sept | 0 | 0 |
| JPM_Maret | 0 | 0 |
| JPM_Sept | 0 | 0 |
| PPM_Maret | 0 | 0 |
| PPM_Sept | 0 | 0 |
| IPM | 0 | 0 |
| RLS_15Plus | 0 | 0 |
| HLS | 0 | 0 |
kable(head(df_raw), caption = "Preview: First 6 Rows of Dataset") %>%
kable_styling(bootstrap_options = c("hover", "condensed"),
full_width = TRUE, font_size = 11) %>%
scroll_box(width = "100%")| PROVINSI | TAHUN | APS_13_15 | APS_16_18 | APS_19_24 | TPT_Feb | TPT_Agt | TPAK_Feb | TPAK_Agt | GK_Maret | GK_Sept | JPM_Maret | JPM_Sept | PPM_Maret | PPM_Sept | IPM | RLS_15Plus | HLS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SUMATERA UTARA | 2021 | 96.99 | 78.66 | 27.05 | 6.01 | 6.33 | 69.39 | 69.10 | 525756 | 648336 | 1343.86 | 1110.92 | 9.01 | 7.19 | 73.84 | 9.88 | 13.27 |
| SUMATERA BARAT | 2021 | 96.63 | 84.07 | 36.41 | 6.67 | 6.52 | 68.41 | 67.72 | 568703 | 714991 | 370.67 | 315.43 | 6.63 | 5.42 | 74.56 | 9.46 | 14.09 |
| RIAU | 2021 | 95.66 | 77.81 | 28.79 | 4.96 | 4.42 | 65.81 | 65.03 | 565937 | 702620 | 500.81 | 473.04 | 7.12 | 6.36 | 73.89 | 9.52 | 13.28 |
| JAMBI | 2021 | 96.39 | 72.50 | 24.14 | 4.76 | 5.09 | 67.30 | 67.17 | 506355 | 658100 | 293.86 | 272.70 | 8.09 | 7.26 | 72.62 | 9.03 | 13.04 |
| SUMATERA SELATAN | 2021 | 94.85 | 71.53 | 18.81 | 5.17 | 4.98 | 69.95 | 68.77 | 457455 | 564462 | 1113.76 | 948.84 | 12.84 | 10.51 | 71.83 | 8.78 | 12.54 |
| BENGKULU | 2021 | 97.49 | 79.75 | 30.46 | 3.72 | 3.65 | 71.74 | 69.75 | 548934 | 672816 | 306.00 | 261.15 | 15.22 | 12.52 | 73.16 | 9.26 | 13.67 |
Descriptive statistics provide a foundational understanding of the central tendency, dispersion, and shape of each variable’s distribution. Examining skewness and kurtosis prior to multivariate analysis is critical for assessing distributional assumptions. Variables with absolute skewness exceeding 2.0 or kurtosis exceeding 7.0 may warrant transformation.
numeric_vars <- df_raw %>% select(where(is.numeric)) %>% select(-TAHUN)
desc_stats <- data.frame(
Variable = names(numeric_vars),
N = sapply(numeric_vars, function(x) sum(!is.na(x))),
Mean = sapply(numeric_vars, mean, na.rm = TRUE),
SD = sapply(numeric_vars, sd, na.rm = TRUE),
Min = sapply(numeric_vars, min, na.rm = TRUE),
Q1 = sapply(numeric_vars, quantile, 0.25, na.rm = TRUE),
Median = sapply(numeric_vars, median, na.rm = TRUE),
Q3 = sapply(numeric_vars, quantile, 0.75, na.rm = TRUE),
Max = sapply(numeric_vars, max, na.rm = TRUE),
Skewness = sapply(numeric_vars, skewness, na.rm = TRUE),
Kurtosis = sapply(numeric_vars, kurtosis, na.rm = TRUE)
)
rownames(desc_stats) <- NULL
kable(desc_stats %>% mutate(across(where(is.numeric), ~round(.x, 3))),
caption = "Descriptive Statistics of Numeric Variables") %>%
kable_styling(bootstrap_options = c("hover", "condensed"),
full_width = TRUE, font_size = 11) %>%
scroll_box(width = "100%")| Variable | N | Mean | SD | Min | Q1 | Median | Q3 | Max | Skewness | Kurtosis |
|---|---|---|---|---|---|---|---|---|---|---|
| APS_13_15 | 151 | 94.392 | 5.109 | 80.02 | 93.270 | 95.93 | 97.615 | 99.43 | -1.871 | 5.451 |
| APS_16_18 | 151 | 74.570 | 6.390 | 63.98 | 69.915 | 74.07 | 79.280 | 91.17 | 0.304 | 2.519 |
| APS_19_24 | 151 | 27.616 | 6.132 | 17.59 | 23.740 | 26.42 | 30.740 | 51.60 | 1.541 | 6.916 |
| TPT_Feb | 151 | 4.953 | 1.524 | 3.04 | 3.730 | 4.53 | 5.785 | 10.12 | 1.006 | 3.420 |
| TPT_Agt | 151 | 4.765 | 1.615 | 2.27 | 3.390 | 4.56 | 5.840 | 9.91 | 0.706 | 3.286 |
| TPAK_Feb | 151 | 69.487 | 4.032 | 61.97 | 66.500 | 69.36 | 71.720 | 80.23 | 0.576 | 3.034 |
| TPAK_Agt | 151 | 69.577 | 4.125 | 62.15 | 66.365 | 69.31 | 71.055 | 78.29 | 0.541 | 2.585 |
| GK_Maret | 151 | 583860.291 | 123126.095 | 364251.00 | 490047.000 | 570368.00 | 663415.500 | 1007060.00 | 0.583 | 3.135 |
| GK_Sept | 151 | 652458.444 | 119598.356 | 460283.00 | 547751.000 | 644107.00 | 708811.000 | 917673.00 | 0.345 | 2.370 |
| JPM_Maret | 151 | 747.771 | 1026.577 | 47.83 | 191.390 | 350.25 | 915.150 | 4572.73 | 2.584 | 8.730 |
| JPM_Sept | 151 | 626.133 | 940.524 | 41.11 | 161.070 | 272.70 | 711.770 | 3893.82 | 2.648 | 8.912 |
| PPM_Maret | 151 | 11.705 | 6.769 | 4.00 | 6.455 | 10.47 | 14.595 | 32.97 | 1.104 | 3.374 |
| PPM_Sept | 151 | 10.652 | 6.322 | 3.80 | 5.700 | 9.56 | 12.640 | 29.66 | 1.325 | 4.280 |
| IPM | 151 | 72.264 | 4.738 | 61.40 | 70.470 | 72.96 | 74.650 | 83.55 | -0.442 | 3.412 |
| RLS_15Plus | 151 | 9.120 | 1.022 | 5.10 | 8.515 | 9.22 | 9.850 | 11.49 | -0.686 | 4.235 |
| HLS | 151 | 13.133 | 0.884 | 9.97 | 12.790 | 13.17 | 13.645 | 15.70 | -0.450 | 5.280 |
skew_df <- data.frame(
Variable = names(numeric_vars),
Skewness = round(sapply(numeric_vars, skewness, na.rm = TRUE), 3)
) %>%
mutate(
Direction = ifelse(Skewness > 0, "Positive (Right)", "Negative (Left)"),
Interpretation = ifelse(abs(Skewness) < 0.5, "Approximately Symmetric",
ifelse(abs(Skewness) < 1.0, "Moderately Skewed",
ifelse(abs(Skewness) < 2.0, "Highly Skewed", "Extremely Skewed")))
)
kable(skew_df, caption = "Skewness Interpretation per Variable") %>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE)| Variable | Skewness | Direction | Interpretation | |
|---|---|---|---|---|
| APS_13_15 | APS_13_15 | -1.871 | Negative (Left) | Highly Skewed |
| APS_16_18 | APS_16_18 | 0.304 | Positive (Right) | Approximately Symmetric |
| APS_19_24 | APS_19_24 | 1.541 | Positive (Right) | Highly Skewed |
| TPT_Feb | TPT_Feb | 1.006 | Positive (Right) | Highly Skewed |
| TPT_Agt | TPT_Agt | 0.706 | Positive (Right) | Moderately Skewed |
| TPAK_Feb | TPAK_Feb | 0.576 | Positive (Right) | Moderately Skewed |
| TPAK_Agt | TPAK_Agt | 0.541 | Positive (Right) | Moderately Skewed |
| GK_Maret | GK_Maret | 0.583 | Positive (Right) | Moderately Skewed |
| GK_Sept | GK_Sept | 0.345 | Positive (Right) | Approximately Symmetric |
| JPM_Maret | JPM_Maret | 2.584 | Positive (Right) | Extremely Skewed |
| JPM_Sept | JPM_Sept | 2.648 | Positive (Right) | Extremely Skewed |
| PPM_Maret | PPM_Maret | 1.104 | Positive (Right) | Highly Skewed |
| PPM_Sept | PPM_Sept | 1.325 | Positive (Right) | Highly Skewed |
| IPM | IPM | -0.442 | Negative (Left) | Approximately Symmetric |
| RLS_15Plus | RLS_15Plus | -0.686 | Negative (Left) | Moderately Skewed |
| HLS | HLS | -0.450 | Negative (Left) | Approximately Symmetric |
df_long <- numeric_vars %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(df_long, aes(x = Value)) +
geom_histogram(aes(y = after_stat(density)), bins = 20,
fill = "#2E86AB", color = "white", alpha = 0.7) +
geom_density(color = "#E84855", linewidth = 0.8) +
facet_wrap(~Variable, scales = "free", ncol = 4) +
labs(title = "Distribution of Socioeconomic Variables",
subtitle = "Histogram overlaid with kernel density estimate",
x = NULL, y = "Density") +
theme_minimal(base_size = 10) +
theme(strip.text = element_text(face = "bold", size = 8),
plot.title = element_text(face = "bold", size = 13))Histogram and Density Plots of All Numeric Variables
Boxplots enable the simultaneous visualization of central tendency, dispersion, and outlier presence across all variables. Outliers, defined as values beyond 1.5 × IQR from the quartile boundaries, may distort centroid-based algorithms such as K-Means and warrant careful consideration.
df_scaled_long <- as.data.frame(scale(numeric_vars)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(df_scaled_long, aes(x = reorder(Variable, Value, median), y = Value, fill = Variable)) +
geom_boxplot(outlier.shape = 21, outlier.fill = "red", outlier.size = 1.5, alpha = 0.7) +
coord_flip() +
scale_fill_viridis_d(option = "turbo", guide = "none") +
labs(title = "Boxplot of Standardized Variables",
subtitle = "Z-score normalized; red dots indicate outliers",
x = NULL, y = "Standardized Value (Z-score)") +
theme_minimal(base_size = 10) +
theme(plot.title = element_text(face = "bold"))Boxplots of Standardized Variables
Correlation analysis identifies linear relationships among variables. In cluster analysis, multicollinearity is less problematic than in regression, but understanding inter-variable relationships aids in interpreting cluster profiles and selecting features.
cor_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")
ggcorrplot(cor_matrix,
method = "square",
type = "lower",
lab = TRUE,
lab_size = 2.5,
colors = c("#E84855", "white", "#2E86AB"),
title = "Pearson Correlation Matrix",
ggtheme = theme_minimal(base_size = 9)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7),
axis.text.y = element_text(size = 7),
plot.title = element_text(face = "bold"))Correlation Matrix Heatmap
cor_df <- as.data.frame(as.table(cor_matrix))
names(cor_df) <- c("Var1", "Var2", "r")
high_cor <- cor_df %>%
filter(as.character(Var1) < as.character(Var2), abs(r) > 0.80) %>%
arrange(desc(abs(r))) %>%
mutate(r = round(r, 3))
kable(high_cor, caption = "Variable Pairs with |r| > 0.80 (High Correlation)",
col.names = c("Variable 1", "Variable 2", "r")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE)| Variable 1 | Variable 2 | r |
|---|---|---|
| JPM_Maret | JPM_Sept | 0.979 |
| PPM_Maret | PPM_Sept | 0.967 |
| TPT_Agt | TPT_Feb | 0.953 |
| TPAK_Agt | TPAK_Feb | 0.916 |
| GK_Maret | GK_Sept | 0.889 |
| IPM | PPM_Maret | -0.814 |
yearly_summary <- df_raw %>%
group_by(TAHUN) %>%
summarise(Mean_IPM = mean(IPM, na.rm = TRUE),
Mean_PPM = mean(PPM_Maret, na.rm = TRUE),
.groups = "drop")
p1 <- ggplot(yearly_summary, aes(x = factor(TAHUN), y = Mean_IPM, group = 1)) +
geom_line(color = "#2E86AB", linewidth = 1.2) +
geom_point(color = "#2E86AB", size = 3) +
labs(title = "Average Human Development Index (IPM) by Year",
x = "Year", y = "Mean IPM") +
theme_minimal()
p2 <- ggplot(yearly_summary, aes(x = factor(TAHUN), y = Mean_PPM, group = 1)) +
geom_line(color = "#E84855", linewidth = 1.2) +
geom_point(color = "#E84855", size = 3) +
labs(title = "Average Poverty Rate - March by Year",
x = "Year", y = "Mean PPM (%)") +
theme_minimal()
grid.arrange(p1, p2, ncol = 2)Mean IPM and Poverty Rate by Year
For this clustering analysis, the most recent year of data is selected for each province to obtain a cross-sectional snapshot. All numeric socioeconomic indicators are retained as clustering features, excluding the identifier variables PROVINSI and TAHUN.
df_latest <- df_raw %>%
group_by(PROVINSI) %>%
slice_max(order_by = TAHUN, n = 1) %>%
ungroup()
features <- df_latest %>% select(-PROVINSI, -TAHUN)
cat(sprintf("Provinces: %d | Features: %d\n", nrow(df_latest), ncol(features)))## Provinces: 38 | Features: 16
Standardization (Z-score normalization) is a critical preprocessing step in distance-based clustering algorithms. Without standardization, variables measured on larger scales (e.g., Poverty Line in Rupiah) will disproportionately influence distance calculations relative to variables on smaller scales.
features_scaled <- as.data.frame(scale(features))
rownames(features_scaled) <- df_latest$PROVINSI
cat(sprintf("Mean of scaled data : %.8f (expected = 0)\n", mean(as.matrix(features_scaled))))## Mean of scaled data : 0.00000000 (expected = 0)
## SD of scaled data : 0.98756686 (expected = 1)
Principal Component Analysis (PCA) is applied here solely for visualization purposes projecting high-dimensional cluster solutions into two-dimensional space without loss of interpretive clarity.
pca_result <- prcomp(features_scaled, center = FALSE, scale. = FALSE)
pca_var <- pca_result$sdev^2
pca_var_pct <- pca_var / sum(pca_var) * 100
pca_cumvar <- cumsum(pca_var_pct)
pca_scree <- data.frame(PC = 1:length(pca_var_pct), Var = pca_var_pct, Cum = pca_cumvar)
ggplot(pca_scree[1:10, ], aes(x = PC)) +
geom_col(aes(y = Var), fill = "#2E86AB", alpha = 0.8) +
geom_line(aes(y = Cum), color = "#E84855", linewidth = 1.2, group = 1) +
geom_point(aes(y = Cum), color = "#E84855", size = 3) +
geom_hline(yintercept = 80, linetype = "dashed", color = "gray40") +
scale_x_continuous(breaks = 1:10) +
labs(title = "PCA Scree Plot",
subtitle = "Bars = individual variance explained; Line = cumulative variance",
x = "Principal Component", y = "Variance Explained (%)") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))PCA Scree Plot
cat(sprintf("PC1: %.1f%% | PC2: %.1f%% | Combined: %.1f%%\n",
pca_var_pct[1], pca_var_pct[2], pca_cumvar[2]))## PC1: 39.2% | PC2: 17.7% | Combined: 56.9%
The Elbow Method evaluates the Within-Cluster Sum of Squares (WSS) across a range of k values. The optimal k is identified at the “elbow” point where the rate of decrease in WSS sharply decelerates, indicating diminishing returns from adding more clusters.
set.seed(42)
wss_values <- sapply(1:10, function(k) {
kmeans(features_scaled, centers = k, nstart = 25, iter.max = 100)$tot.withinss
})
ggplot(data.frame(k = 1:10, WSS = wss_values), aes(x = k, y = WSS)) +
geom_line(color = "#2E86AB", linewidth = 1.2) +
geom_point(color = "#2E86AB", size = 3.5) +
scale_x_continuous(breaks = 1:10) +
labs(title = "Elbow Method for Optimal k",
subtitle = "Within-Cluster Sum of Squares vs. Number of Clusters",
x = "Number of Clusters (k)", y = "Total WSS") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))Elbow Method
elbow_k <- which.min(diff(diff(wss_values))) + 1
cat(sprintf("Elbow heuristic suggests k = %d\n", elbow_k))## Elbow heuristic suggests k = 7
The Silhouette Width measures how similar an observation is to its own cluster compared to other clusters. Values range from -1 (poor fit) to +1 (excellent fit). The optimal k maximizes the average silhouette width.
set.seed(42)
sil_values <- sapply(2:10, function(k) {
km <- kmeans(features_scaled, centers = k, nstart = 25, iter.max = 100)
mean(silhouette(km$cluster, dist(features_scaled))[, 3])
})
sil_df <- data.frame(k = 2:10, Silhouette = sil_values)
best_sil_k <- sil_df$k[which.max(sil_df$Silhouette)]
ggplot(sil_df, aes(x = k, y = Silhouette)) +
geom_line(color = "#E84855", linewidth = 1.2) +
geom_point(color = "#E84855", size = 3.5) +
geom_vline(xintercept = best_sil_k, linetype = "dashed", color = "gray40") +
scale_x_continuous(breaks = 2:10) +
labs(title = "Silhouette Method for Optimal k",
x = "Number of Clusters (k)", y = "Average Silhouette Width") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))Average Silhouette Width
## Optimal k by Silhouette: k = 2 (avg sil = 0.4024)
The Gap Statistic compares within-cluster dispersion against its expected value under a null reference distribution. The optimal k is where the gap statistic is largest, indicating that the observed clustering structure substantially exceeds chance-level structure.
set.seed(42)
gap_stat <- clusGap(features_scaled, FUN = kmeans, nstart = 25, K.max = 10, B = 50)
best_gap_k <- which.max(gap_stat$Tab[, "gap"])
fviz_gap_stat(gap_stat) +
labs(title = "Gap Statistic for Optimal k") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))Gap Statistic
## Optimal k by Gap Statistic: k = 10
k_votes <- c(elbow_k, best_sil_k, best_gap_k)
final_k <- as.integer(names(sort(table(k_votes), decreasing = TRUE))[1])
kable(data.frame(
Method = c("Elbow Method", "Silhouette Method", "Gap Statistic", "Final (Majority Vote)"),
Optimal_k = c(elbow_k, best_sil_k, best_gap_k, final_k)
), caption = "Optimal k Determination Summary",
col.names = c("Method", "Suggested k")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE)| Method | Suggested k |
|---|---|
| Elbow Method | 7 |
| Silhouette Method | 2 |
| Gap Statistic | 10 |
| Final (Majority Vote) | 2 |
K-Means is one of the most widely used partitional clustering
algorithms. The algorithm iteratively assigns each observation to the
nearest centroid using Euclidean distance and updates centroids as the
mean of assigned observations, minimizing total within-cluster sum of
squared deviations. To mitigate sensitivity to random initialization, 25
random restarts (nstart = 25) are employed and the solution
with the lowest total WSS is retained.
set.seed(42)
km_result <- kmeans(features_scaled, centers = final_k, nstart = 25, iter.max = 200)
df_latest$KMeans_Cluster <- factor(km_result$cluster)
kable(data.frame(
Metric = c("Number of Clusters", "Total WSS", "Between-SS", "Total SS", "BSS/TSS (%)"),
Value = c(final_k,
round(km_result$tot.withinss, 4),
round(km_result$betweenss, 4),
round(km_result$totss, 4),
round(km_result$betweenss / km_result$totss * 100, 2))
), caption = "K-Means Summary Statistics", col.names = c("Metric", "Value")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE)| Metric | Value |
|---|---|
| Number of Clusters | 2.0000 |
| Total WSS | 435.6562 |
| Between-SS | 156.3438 |
| Total SS | 592.0000 |
| BSS/TSS (%) | 26.4100 |
kable(data.frame(Cluster = 1:final_k, Size = km_result$size),
caption = "K-Means Cluster Sizes") %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE)| Cluster | Size |
|---|---|
| 1 | 4 |
| 2 | 34 |
pcs$KMeans <- factor(km_result$cluster)
ggplot(pcs, aes(x = PC1, y = PC2, color = KMeans, label = PROVINSI)) +
geom_point(size = 3.5, alpha = 0.85) +
geom_text_repel(size = 2.5, max.overlaps = 15, segment.alpha = 0.5) +
stat_ellipse(aes(fill = KMeans), geom = "polygon", alpha = 0.1, level = 0.90) +
scale_color_brewer(palette = "Set1") +
scale_fill_brewer(palette = "Set1") +
labs(title = paste0("K-Means Clustering (k = ", final_k, ") PCA Projection"),
subtitle = "90% confidence ellipses; labels = province names",
x = sprintf("PC1 (%.1f%%)", pca_var_pct[1]),
y = sprintf("PC2 (%.1f%%)", pca_var_pct[2]),
color = "Cluster", fill = "Cluster") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))K-Means Clusters PCA Projection
km_centers_orig <- as.data.frame(km_result$centers)
names(km_centers_orig) <- names(features)
kable(round(km_centers_orig, 3), caption = "K-Means Cluster Centroids (Standardized Scale)") %>%
kable_styling(bootstrap_options = c("hover"), full_width = TRUE) %>%
scroll_box(width = "100%")| APS_13_15 | APS_16_18 | APS_19_24 | TPT_Feb | TPT_Agt | TPAK_Feb | TPAK_Agt | GK_Maret | GK_Sept | JPM_Maret | JPM_Sept | PPM_Maret | PPM_Sept | IPM | RLS_15Plus | HLS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -2.629 | -1.605 | -0.669 | -0.870 | -1.249 | 1.793 | 1.804 | 0.728 | 0.467 | -0.437 | -0.492 | 1.958 | 2.043 | -2.076 | -1.351 | -0.958 |
| 0.309 | 0.189 | 0.079 | 0.102 | 0.147 | -0.211 | -0.212 | -0.086 | -0.055 | 0.051 | 0.058 | -0.230 | -0.240 | 0.244 | 0.159 | 0.113 |
km_centers_long <- km_centers_orig %>%
mutate(Cluster = factor(1:nrow(.))) %>%
pivot_longer(-Cluster, names_to = "Variable", values_to = "Value")
ggplot(km_centers_long, aes(x = reorder(Variable, Value), y = Value, fill = Cluster)) +
geom_col(position = "dodge", alpha = 0.85) +
coord_flip() +
scale_fill_brewer(palette = "Set1") +
labs(title = "K-Means Cluster Centroids by Variable",
subtitle = "Standardized values",
x = NULL, y = "Standardized Centroid Value", fill = "Cluster") +
theme_minimal(base_size = 10) +
theme(plot.title = element_text(face = "bold"))K-Means Cluster Centroid Profile
km_sil <- silhouette(km_result$cluster, dist(features_scaled))
avg_sil_km <- mean(km_sil[, 3])
fviz_silhouette(km_sil, palette = "Set1") +
labs(title = paste0("Silhouette Plot K-Means (k = ", final_k, ")")) +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))## cluster size ave.sil.width
## 1 1 4 0.36
## 2 2 34 0.41
Silhouette Plot K-Means
sil_label_km <- ifelse(avg_sil_km >= 0.70, "Strong Structure",
ifelse(avg_sil_km >= 0.50, "Reasonable Structure",
ifelse(avg_sil_km >= 0.25, "Weak Structure", "No Substantial Structure")))
cat(sprintf("Average Silhouette Width: %.4f %s\n", avg_sil_km, sil_label_km))## Average Silhouette Width: 0.4024 Weak Structure
kable(df_latest %>% select(PROVINSI, TAHUN, KMeans_Cluster) %>%
arrange(KMeans_Cluster, PROVINSI),
caption = "Province Cluster Assignments K-Means",
col.names = c("Province", "Year", "Cluster")) %>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE) %>%
scroll_box(height = "400px")| Province | Year | Cluster |
|---|---|---|
| PAPUA | 2024 | 1 |
| PAPUA PEGUNUNGAN | 2024 | 1 |
| PAPUA SELATAN | 2024 | 1 |
| PAPUA TENGAH | 2024 | 1 |
| ACEH | 2024 | 2 |
| BALI | 2024 | 2 |
| BANTEN | 2024 | 2 |
| BENGKULU | 2024 | 2 |
| DI YOGYAKARTA | 2024 | 2 |
| DKI JAKARTA | 2024 | 2 |
| GORONTALO | 2024 | 2 |
| JAMBI | 2024 | 2 |
| JAWA BARAT | 2024 | 2 |
| JAWA TENGAH | 2024 | 2 |
| JAWA TIMUR | 2024 | 2 |
| KALIMANTAN BARAT | 2024 | 2 |
| KALIMANTAN SELATAN | 2024 | 2 |
| KALIMANTAN TENGAH | 2024 | 2 |
| KALIMANTAN TIMUR | 2024 | 2 |
| KALIMANTAN UTARA | 2024 | 2 |
| KEP. BANGKA BELITUNG | 2024 | 2 |
| KEP. RIAU | 2024 | 2 |
| LAMPUNG | 2024 | 2 |
| MALUKU | 2024 | 2 |
| MALUKU UTARA | 2024 | 2 |
| NUSA TENGGARA BARAT | 2024 | 2 |
| NUSA TENGGARA TIMUR | 2024 | 2 |
| PAPUA BARAT | 2024 | 2 |
| PAPUA BARAT DAYA | 2024 | 2 |
| RIAU | 2024 | 2 |
| SULAWESI BARAT | 2024 | 2 |
| SULAWESI SELATAN | 2024 | 2 |
| SULAWESI TENGAH | 2024 | 2 |
| SULAWESI TENGGARA | 2024 | 2 |
| SULAWESI UTARA | 2024 | 2 |
| SUMATERA BARAT | 2024 | 2 |
| SUMATERA SELATAN | 2024 | 2 |
| SUMATERA UTARA | 2024 | 2 |
K-Medoids, implemented via the Partitioning Around Medoids (PAM) algorithm, selects actual data points (medoids) as cluster representatives rather than computed centroids. This approach minimizes the sum of dissimilarities and is significantly more resistant to outliers and noise than K-Means a particularly desirable property when working with heterogeneous regional socioeconomic data.
set.seed(42)
pam_result <- pam(features_scaled, k = final_k, metric = "euclidean")
df_latest$PAM_Cluster <- factor(pam_result$clustering)
pcs$PAM <- factor(pam_result$clustering)
kable(data.frame(
Cluster = 1:final_k,
Medoid = rownames(features_scaled)[pam_result$id.med],
Size = as.integer(table(pam_result$clustering))
), caption = "PAM Medoids and Cluster Sizes",
col.names = c("Cluster", "Medoid Province", "Size")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE)| Cluster | Medoid Province | Size |
|---|---|---|
| 1 | JAMBI | 33 |
| 2 | PAPUA SELATAN | 5 |
ggplot(pcs, aes(x = PC1, y = PC2, color = PAM, label = PROVINSI)) +
geom_point(size = 3.5, alpha = 0.85) +
geom_text_repel(size = 2.5, max.overlaps = 15, segment.alpha = 0.5) +
stat_ellipse(aes(fill = PAM), geom = "polygon", alpha = 0.1, level = 0.90) +
scale_color_brewer(palette = "Set2") +
scale_fill_brewer(palette = "Set2") +
labs(title = paste0("K-Medoids (PAM) Clustering (k = ", final_k, ") PCA Projection"),
x = sprintf("PC1 (%.1f%%)", pca_var_pct[1]),
y = sprintf("PC2 (%.1f%%)", pca_var_pct[2]),
color = "Cluster", fill = "Cluster") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))PAM Clusters PCA Projection
pam_sil <- silhouette(pam_result$clustering, dist(features_scaled))
avg_sil_pam <- mean(pam_sil[, 3])
fviz_silhouette(pam_sil, palette = "Set2") +
labs(title = paste0("Silhouette Plot PAM (k = ", final_k, ")")) +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))## cluster size ave.sil.width
## 1 1 33 0.37
## 2 2 5 0.26
Silhouette Plot PAM
sil_label_pam <- ifelse(avg_sil_pam >= 0.70, "Strong Structure",
ifelse(avg_sil_pam >= 0.50, "Reasonable Structure",
ifelse(avg_sil_pam >= 0.25, "Weak Structure", "No Substantial Structure")))
cat(sprintf("Average Silhouette Width (PAM): %.4f %s\n", avg_sil_pam, sil_label_pam))## Average Silhouette Width (PAM): 0.3588 Weak Structure
ari_kmpam <- adjustedRandIndex(km_result$cluster, pam_result$clustering)
ari_label <- ifelse(ari_kmpam >= 0.80, "Near-Perfect Agreement",
ifelse(ari_kmpam >= 0.60, "Strong Agreement",
ifelse(ari_kmpam >= 0.40, "Moderate Agreement", "Weak Agreement")))
kable(data.frame(Metric = c("Adjusted Rand Index (ARI)", "Interpretation"),
Value = c(round(ari_kmpam, 4), ari_label)),
caption = "K-Means vs PAM Agreement",
col.names = c("Metric", "Value")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE)| Metric | Value |
|---|---|
| Adjusted Rand Index (ARI) | 0.844 |
| Interpretation | Near-Perfect Agreement |
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) identifies clusters as regions of high point density separated by regions of lower density. Unlike partitional methods, DBSCAN does not require k to be specified a priori and is capable of discovering clusters of arbitrary shape. Two key hyperparameters must be tuned: ε (epsilon), the neighborhood radius; and MinPts, the minimum number of points required to form a dense region. Points unreachable from any core point are labeled as noise.
min_pts <- max(4, 2 * ncol(features_scaled))
knn_dist <- kNNdist(as.matrix(features_scaled), k = min_pts)
knn_sorted <- if (is.matrix(knn_dist)) sort(knn_dist[, min_pts]) else sort(knn_dist)
ggplot(data.frame(Index = seq_along(knn_sorted), Distance = knn_sorted),
aes(x = Index, y = Distance)) +
geom_line(color = "#2E86AB", linewidth = 1) +
geom_hline(yintercept = quantile(knn_sorted, 0.9), color = "red",
linetype = "dashed", linewidth = 0.8) +
labs(title = paste0(min_pts, "-NN Distance Plot for DBSCAN ε Selection"),
subtitle = "Red dashed line = 90th percentile (suggested ε)",
x = "Points (sorted)", y = paste0(min_pts, "-NN Distance")) +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))k-NN Distance Plot for DBSCAN ε Selection
eps_val <- round(quantile(knn_sorted, 0.9), 2)
cat(sprintf("MinPts = %d | Suggested ε = %.4f\n", min_pts, eps_val))## MinPts = 32 | Suggested ε = 8.2000
set.seed(42)
db_result <- dbscan(as.matrix(features_scaled), eps = eps_val, minPts = min_pts)
df_latest$DBSCAN_Cluster <- factor(db_result$cluster)
pcs$DBSCAN <- factor(db_result$cluster)
n_clusters_db <- length(unique(db_result$cluster[db_result$cluster != 0]))
n_noise_db <- sum(db_result$cluster == 0)
kable(as.data.frame(table(Cluster = db_result$cluster)),
caption = "DBSCAN Cluster Sizes (Cluster 0 = Noise)") %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE)| Cluster | Freq |
|---|---|
| 1 | 38 |
cat(sprintf("Clusters found: %d | Noise points: %d (%.1f%%)\n",
n_clusters_db, n_noise_db, n_noise_db / nrow(features_scaled) * 100))## Clusters found: 1 | Noise points: 0 (0.0%)
pcs_db <- pcs
pcs_db$DBSCAN_label <- ifelse(as.character(pcs_db$DBSCAN) == "0", "Noise",
paste0("Cluster ", pcs_db$DBSCAN))
n_pal <- max(3, n_clusters_db)
db_pal <- setNames(brewer.pal(n_pal, "Dark2")[1:n_clusters_db],
paste0("Cluster ", 1:n_clusters_db))
color_vals <- c("Noise" = "gray60", db_pal)
shape_vals <- c("Noise" = 4, setNames(rep(16, n_clusters_db),
paste0("Cluster ", 1:n_clusters_db)))
ggplot(pcs_db, aes(x = PC1, y = PC2, color = DBSCAN_label,
shape = DBSCAN_label, label = PROVINSI)) +
geom_point(size = 3.5, alpha = 0.85) +
geom_text_repel(size = 2.5, max.overlaps = 15, segment.alpha = 0.5) +
scale_color_manual(values = color_vals) +
scale_shape_manual(values = shape_vals) +
labs(title = "DBSCAN Clustering PCA Projection",
subtitle = paste0("ε = ", eps_val, ", MinPts = ", min_pts, "; x = noise"),
x = sprintf("PC1 (%.1f%%)", pca_var_pct[1]),
y = sprintf("PC2 (%.1f%%)", pca_var_pct[2]),
color = "Group", shape = "Group") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))DBSCAN Clusters PCA Projection
eps_grid <- round(seq(eps_val * 0.6, eps_val * 1.6, length.out = 7), 3)
minpts_grid <- c(min_pts - 2, min_pts, min_pts + 2)
sens_results <- expand.grid(eps = eps_grid, minPts = minpts_grid) %>%
rowwise() %>%
mutate(
res = list(dbscan(as.matrix(features_scaled), eps = eps, minPts = minPts)),
n_clusters = length(unique(res$cluster[res$cluster != 0])),
n_noise = sum(res$cluster == 0)
) %>%
select(-res) %>%
ungroup()
kable(sens_results, caption = "DBSCAN Sensitivity: Clusters & Noise Across Parameter Grid",
col.names = c("ε", "MinPts", "Clusters", "Noise Points")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE)| ε | MinPts | Clusters | Noise Points |
|---|---|---|---|
| 4.920 | 30 | 1 | 6 |
| 6.287 | 30 | 1 | 1 |
| 7.653 | 30 | 1 | 1 |
| 9.020 | 30 | 1 | 0 |
| 10.387 | 30 | 1 | 0 |
| 11.753 | 30 | 1 | 0 |
| 13.120 | 30 | 1 | 0 |
| 4.920 | 32 | 1 | 6 |
| 6.287 | 32 | 1 | 1 |
| 7.653 | 32 | 1 | 1 |
| 9.020 | 32 | 1 | 0 |
| 10.387 | 32 | 1 | 0 |
| 11.753 | 32 | 1 | 0 |
| 13.120 | 32 | 1 | 0 |
| 4.920 | 34 | 0 | 38 |
| 6.287 | 34 | 1 | 2 |
| 7.653 | 34 | 1 | 1 |
| 9.020 | 34 | 1 | 0 |
| 10.387 | 34 | 1 | 0 |
| 11.753 | 34 | 1 | 0 |
| 13.120 | 34 | 1 | 0 |
Mean Shift is a non-parametric, kernel-density-based clustering algorithm that does not require the number of clusters to be specified in advance. The algorithm iteratively shifts each data point toward the local mode of the estimated kernel density function. Points converging to the same mode are assigned to the same cluster. The bandwidth parameter h controls the smoothness of the kernel density estimate: smaller bandwidths produce more clusters, while larger bandwidths merge modes into fewer clusters. Given computational intensity in high dimensions, PCA-reduced data is used, consistent with standard practice in the literature.
n_pcs_ms <- min(5, ncol(pca_result$x))
pcs_ms <- pca_result$x[, 1:n_pcs_ms]
bw_val <- mean(apply(pcs_ms, 2, sd)) * nrow(pcs_ms)^(-1 / (n_pcs_ms + 4)) * 1.5
ms_result <- meanShift(pcs_ms, bandwidth = rep(bw_val, n_pcs_ms))
ms_labels <- ms_result$assignment
df_latest$MeanShift_Cluster <- factor(ms_labels)
pcs$MeanShift <- factor(ms_labels)
n_ms_clusters <- length(unique(ms_labels))
kable(as.data.frame(table(Cluster = ms_labels)),
caption = "Mean Shift Cluster Sizes") %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE)| Cluster | Freq |
|---|---|
| 1 | 2 |
| 2 | 13 |
| 3 | 2 |
| 4 | 1 |
| 5 | 1 |
| 6 | 1 |
| 7 | 1 |
| 8 | 1 |
| 9 | 2 |
| 10 | 3 |
| 11 | 1 |
| 12 | 4 |
| 13 | 1 |
| 14 | 1 |
| 15 | 1 |
| 16 | 1 |
| 17 | 1 |
| 18 | 1 |
cat(sprintf("Applied on %d PCs (%.1f%% variance) | Bandwidth = %.4f | Clusters = %d\n",
n_pcs_ms, pca_cumvar[n_pcs_ms], bw_val, n_ms_clusters))## Applied on 5 PCs (88.6% variance) | Bandwidth = 1.6076 | Clusters = 18
ggplot(pcs, aes(x = PC1, y = PC2, color = MeanShift, label = PROVINSI)) +
geom_point(size = 3.5, alpha = 0.85) +
geom_text_repel(size = 2.5, max.overlaps = 15, segment.alpha = 0.5) +
stat_ellipse(aes(fill = MeanShift), geom = "polygon", alpha = 0.1, level = 0.90) +
scale_color_brewer(palette = "Set3") +
scale_fill_brewer(palette = "Set3") +
labs(title = "Mean Shift Clustering PCA Projection",
subtitle = paste0("Bandwidth h = ", round(bw_val, 4),
"; applied on ", n_pcs_ms, " PCs"),
x = sprintf("PC1 (%.1f%%)", pca_var_pct[1]),
y = sprintf("PC2 (%.1f%%)", pca_var_pct[2]),
color = "Cluster", fill = "Cluster") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))Mean Shift Clusters PCA Projection
Fuzzy C-Means (FCM) is a soft partitioning algorithm that assigns each observation a membership degree to each cluster rather than a crisp binary assignment. Membership degrees range from 0 to 1 and sum to 1 across all clusters for each observation. This probabilistic assignment is particularly valuable for socioeconomic data where provinces may share characteristics across multiple development clusters, reflecting gradual developmental transitions rather than abrupt boundaries. The fuzziness parameter m = 2 is the standard choice in the literature.
set.seed(42)
fcm_result <- ppclust::fcm(as.matrix(features_scaled), centers = final_k, m = 2, nstart = 10)
fcm_hard <- apply(fcm_result$u, 1, which.max)
df_latest$FCM_Cluster <- factor(fcm_hard)
pcs$FCM <- factor(fcm_hard)
kable(as.data.frame(table(Cluster = fcm_hard)),
caption = "FCM Cluster Sizes (Hard Assignment)") %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE)| Cluster | Freq |
|---|---|
| 1 | 19 |
| 2 | 19 |
## Clusters: 2 | m = 2 | Objective: 293.1874
## Clusters: 2 | m = 2 | Objective: 293.1874
## Clusters: 2 | m = 2 | Objective: 293.1874
## Clusters: 2 | m = 2 | Objective: 293.1874
## Clusters: 2 | m = 2 | Objective: 293.1874
## Clusters: 2 | m = 2 | Objective: 293.1874
## Clusters: 2 | m = 2 | Objective: 293.1874
## Clusters: 2 | m = 2 | Objective: 293.1874
## Clusters: 2 | m = 2 | Objective: 293.1874
## Clusters: 2 | m = 2 | Objective: 293.1874
membership_df <- as.data.frame(round(fcm_result$u, 4))
names(membership_df) <- paste0("Cluster_", 1:final_k)
membership_df$PROVINSI <- df_latest$PROVINSI
membership_df$HardCluster <- fcm_hard
membership_df <- membership_df %>%
select(PROVINSI, HardCluster, everything()) %>%
arrange(HardCluster, PROVINSI)
kable(membership_df, caption = "Fuzzy Membership Degrees per Province",
col.names = c("Province", "Hard Cluster", paste0("mu(C", 1:final_k, ")"))) %>%
kable_styling(bootstrap_options = c("hover", "condensed"),
full_width = FALSE, font_size = 11) %>%
scroll_box(height = "400px")| Province | Hard Cluster | mu(C1) | mu(C2) | |
|---|---|---|---|---|
| ACEH | ACEH | 1 | 0.7383 | 0.2617 |
| BANTEN | BANTEN | 1 | 0.6846 | 0.3154 |
| DI YOGYAKARTA | DI YOGYAKARTA | 1 | 0.5318 | 0.4682 |
| DKI JAKARTA | DKI JAKARTA | 1 | 0.6856 | 0.3144 |
| JAMBI | JAMBI | 1 | 0.6423 | 0.3577 |
| JAWA BARAT | JAWA BARAT | 1 | 0.5413 | 0.4587 |
| KALIMANTAN TIMUR | KALIMANTAN TIMUR | 1 | 0.7479 | 0.2521 |
| KALIMANTAN UTARA | KALIMANTAN UTARA | 1 | 0.6538 | 0.3462 |
| KEP. BANGKA BELITUNG | KEP. BANGKA BELITUNG | 1 | 0.5488 | 0.4512 |
| KEP. RIAU | KEP. RIAU | 1 | 0.6971 | 0.3029 |
| MALUKU | MALUKU | 1 | 0.6989 | 0.3011 |
| MALUKU UTARA | MALUKU UTARA | 1 | 0.7108 | 0.2892 |
| PAPUA BARAT | PAPUA BARAT | 1 | 0.5774 | 0.4226 |
| PAPUA BARAT DAYA | PAPUA BARAT DAYA | 1 | 0.6359 | 0.3641 |
| RIAU | RIAU | 1 | 0.7717 | 0.2283 |
| SULAWESI SELATAN | SULAWESI SELATAN | 1 | 0.5361 | 0.4639 |
| SULAWESI UTARA | SULAWESI UTARA | 1 | 0.6367 | 0.3633 |
| SUMATERA BARAT | SUMATERA BARAT | 1 | 0.7260 | 0.2740 |
| SUMATERA UTARA | SUMATERA UTARA | 1 | 0.7466 | 0.2534 |
| BALI | BALI | 2 | 0.4567 | 0.5433 |
| BENGKULU | BENGKULU | 2 | 0.4167 | 0.5833 |
| GORONTALO | GORONTALO | 2 | 0.2455 | 0.7545 |
| JAWA TENGAH | JAWA TENGAH | 2 | 0.4489 | 0.5511 |
| JAWA TIMUR | JAWA TIMUR | 2 | 0.4631 | 0.5369 |
| KALIMANTAN BARAT | KALIMANTAN BARAT | 2 | 0.4152 | 0.5848 |
| KALIMANTAN SELATAN | KALIMANTAN SELATAN | 2 | 0.4937 | 0.5063 |
| KALIMANTAN TENGAH | KALIMANTAN TENGAH | 2 | 0.4805 | 0.5195 |
| LAMPUNG | LAMPUNG | 2 | 0.2345 | 0.7655 |
| NUSA TENGGARA BARAT | NUSA TENGGARA BARAT | 2 | 0.2666 | 0.7334 |
| NUSA TENGGARA TIMUR | NUSA TENGGARA TIMUR | 2 | 0.2590 | 0.7410 |
| PAPUA | PAPUA | 2 | 0.3706 | 0.6294 |
| PAPUA PEGUNUNGAN | PAPUA PEGUNUNGAN | 2 | 0.4227 | 0.5773 |
| PAPUA SELATAN | PAPUA SELATAN | 2 | 0.3370 | 0.6630 |
| PAPUA TENGAH | PAPUA TENGAH | 2 | 0.3845 | 0.6155 |
| SULAWESI BARAT | SULAWESI BARAT | 2 | 0.2785 | 0.7215 |
| SULAWESI TENGAH | SULAWESI TENGAH | 2 | 0.1874 | 0.8126 |
| SULAWESI TENGGARA | SULAWESI TENGGARA | 2 | 0.3685 | 0.6315 |
| SUMATERA SELATAN | SUMATERA SELATAN | 2 | 0.2887 | 0.7113 |
max_membership <- apply(fcm_result$u, 1, max)
ambiguous_idx <- which(max_membership < 0.60)
if (length(ambiguous_idx) == 0) {
kable(data.frame(Result = "No provinces with ambiguous membership (max mu < 0.60) detected."),
col.names = NULL) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
} else {
kable(data.frame(
Province = df_latest$PROVINSI[ambiguous_idx],
Max_Mu = round(max_membership[ambiguous_idx], 4),
HardCluster = fcm_hard[ambiguous_idx]
), caption = "Provinces with Ambiguous Membership (max mu < 0.60)",
col.names = c("Province", "Max Membership", "Hard Cluster")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
}| Province | Max Membership | Hard Cluster | |
|---|---|---|---|
| BALI | BALI | 0.5433 | 2 |
| BENGKULU | BENGKULU | 0.5833 | 2 |
| DI YOGYAKARTA | DI YOGYAKARTA | 0.5318 | 1 |
| JAWA BARAT | JAWA BARAT | 0.5413 | 1 |
| JAWA TENGAH | JAWA TENGAH | 0.5511 | 2 |
| JAWA TIMUR | JAWA TIMUR | 0.5369 | 2 |
| KALIMANTAN BARAT | KALIMANTAN BARAT | 0.5848 | 2 |
| KALIMANTAN SELATAN | KALIMANTAN SELATAN | 0.5063 | 2 |
| KALIMANTAN TENGAH | KALIMANTAN TENGAH | 0.5195 | 2 |
| KEP. BANGKA BELITUNG | KEP. BANGKA BELITUNG | 0.5488 | 1 |
| PAPUA BARAT | PAPUA BARAT | 0.5774 | 1 |
| PAPUA PEGUNUNGAN | PAPUA PEGUNUNGAN | 0.5773 | 2 |
| SULAWESI SELATAN | SULAWESI SELATAN | 0.5361 | 1 |
pcs$MaxMembership <- max_membership
ggplot(pcs, aes(x = PC1, y = PC2, color = FCM, size = MaxMembership, label = PROVINSI)) +
geom_point(alpha = 0.85) +
geom_text_repel(size = 2.3, max.overlaps = 15, segment.alpha = 0.5) +
scale_color_brewer(palette = "Paired") +
scale_size_continuous(range = c(2, 6), name = "Max Membership") +
labs(title = paste0("Fuzzy C-Means (k = ", final_k, ") PCA Projection"),
subtitle = "Point size proportional to maximum membership degree",
x = sprintf("PC1 (%.1f%%)", pca_var_pct[1]),
y = sprintf("PC2 (%.1f%%)", pca_var_pct[2]),
color = "Cluster") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))FCM Clusters PCA Projection
fcm_centers <- as.data.frame(fcm_result$v)
names(fcm_centers) <- names(features)
kable(round(fcm_centers, 3), caption = "FCM Cluster Centroids (Standardized)") %>%
kable_styling(bootstrap_options = c("hover"), full_width = TRUE) %>%
scroll_box(width = "100%")| APS_13_15 | APS_16_18 | APS_19_24 | TPT_Feb | TPT_Agt | TPAK_Feb | TPAK_Agt | GK_Maret | GK_Sept | JPM_Maret | JPM_Sept | PPM_Maret | PPM_Sept | IPM | RLS_15Plus | HLS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cluster 1 | 0.272 | 0.274 | 0.139 | 0.394 | 0.418 | -0.359 | -0.377 | 0.259 | 0.308 | -0.058 | -0.059 | -0.222 | -0.237 | 0.273 | 0.309 | 0.177 |
| Cluster 2 | -0.265 | -0.245 | -0.140 | -0.418 | -0.450 | 0.343 | 0.368 | -0.289 | -0.333 | 0.029 | 0.031 | 0.228 | 0.243 | -0.280 | -0.299 | -0.158 |
Internal validation assesses the quality of a clustering solution using only the data itself, without reference to external ground-truth labels. Key indices include:
| Index | Criterion | Preferred Value |
|---|---|---|
| Average Silhouette Width | Cohesion & separation | Higher (max = 1) |
| Dunn Index | Min inter-cluster / Max intra-cluster distance | Higher |
| Calinski-Harabasz Index | Between-cluster to within-cluster SS ratio | Higher |
| Davies-Bouldin Index | Average similarity between clusters | Lower |
dist_matrix <- dist(features_scaled)
dunn_index <- function(labels, dist_mat) {
labels <- as.integer(factor(labels))
k <- length(unique(labels))
dm <- as.matrix(dist_mat)
intra <- sapply(1:k, function(cl) {
idx <- which(labels == cl)
if (length(idx) < 2) return(0)
max(dm[idx, idx])
})
inter_min <- Inf
for (i in 1:(k - 1)) {
for (j in (i + 1):k) {
d <- min(dm[which(labels == i), which(labels == j)])
if (d < inter_min) inter_min <- d
}
}
inter_min / max(intra)
}
ch_index <- function(labels, data) {
labels <- as.integer(factor(labels))
k <- length(unique(labels))
n <- nrow(data)
grand_mean <- colMeans(data)
ssb <- sum(sapply(unique(labels), function(cl) {
pts <- data[labels == cl, , drop = FALSE]
nrow(pts) * sum((colMeans(pts) - grand_mean)^2)
}))
ssw <- sum(sapply(unique(labels), function(cl) {
pts <- data[labels == cl, , drop = FALSE]
if (nrow(pts) < 2) return(0)
sum(apply(pts, 1, function(r) sum((r - colMeans(pts))^2)))
}))
(ssb / (k - 1)) / (ssw / (n - k))
}
db_index <- function(labels, data) {
labels <- as.integer(factor(labels))
k <- length(unique(labels))
cents <- sapply(1:k, function(cl) colMeans(data[labels == cl, , drop = FALSE]))
s <- sapply(1:k, function(cl) {
pts <- data[labels == cl, , drop = FALSE]
mean(sqrt(rowSums(sweep(pts, 2, cents[, cl])^2)))
})
mean(sapply(1:k, function(i) {
max(sapply(setdiff(1:k, i), function(j) {
(s[i] + s[j]) / sqrt(sum((cents[, i] - cents[, j])^2))
}))
}))
}
method_labels <- list(
"K-Means" = km_result$cluster,
"PAM" = as.integer(pam_result$clustering),
"DBSCAN" = db_result$cluster,
"Mean Shift" = as.integer(factor(ms_labels)),
"FCM" = fcm_hard
)
data_matrix <- as.matrix(features_scaled)
valid_df <- lapply(names(method_labels), function(mname) {
labels <- method_labels[[mname]]
if (mname == "DBSCAN") {
valid_idx <- labels != 0
if (sum(valid_idx) < 5 || length(unique(labels[valid_idx])) < 2) {
return(data.frame(Method = mname, N_Clusters = NA,
Avg_Silhouette = NA, Dunn = NA, CH = NA, DB = NA))
}
labels_v <- labels[valid_idx]
data_v <- data_matrix[valid_idx, ]
dist_v <- dist(data_v)
} else {
labels_v <- labels
data_v <- data_matrix
dist_v <- dist_matrix
}
if (length(unique(labels_v)) < 2) {
return(data.frame(Method = mname, N_Clusters = 1,
Avg_Silhouette = NA, Dunn = NA, CH = NA, DB = NA))
}
data.frame(
Method = mname,
N_Clusters = length(unique(labels_v)),
Avg_Silhouette = round(mean(silhouette(as.integer(factor(labels_v)), dist_v)[, 3]), 4),
Dunn = round(tryCatch(dunn_index(labels_v, dist_v), error = function(e) NA), 4),
CH = round(tryCatch(ch_index(labels_v, data_v), error = function(e) NA), 2),
DB = round(tryCatch(db_index(labels_v, data_v), error = function(e) NA), 4)
)
}) %>% bind_rows()
kable(valid_df, caption = "Internal Validation Indices Across All Clustering Methods",
col.names = c("Method", "Clusters", "Avg Silhouette (up)", "Dunn (up)", "CH (up)", "DB (down)")) %>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE)| Method | Clusters | Avg Silhouette (up) | Dunn (up) | CH (up) | DB (down) |
|---|---|---|---|---|---|
| K-Means | 2 | 0.4024 | 0.4475 | 12.92 | 0.9556 |
| PAM | 2 | 0.3588 | 0.2605 | 12.78 | 1.0859 |
| DBSCAN | NA | NA | NA | NA | NA |
| Mean Shift | 18 | 0.0029 | 0.2681 | 5.89 | 0.8378 |
| FCM | 2 | 0.1824 | 0.1189 | 10.30 | 1.7095 |
best_sil <- valid_df$Method[which.max(valid_df$Avg_Silhouette)]
best_ch <- valid_df$Method[which.max(valid_df$CH)]
best_db <- valid_df$Method[which.min(valid_df$DB)]
best_dunn <- valid_df$Method[which.max(valid_df$Dunn)]
vote_counts <- table(c(best_sil, best_ch, best_db, best_dunn))
overall_best <- names(which.max(vote_counts))
kable(data.frame(
Index = c("Best Avg Silhouette", "Best CH Index", "Best DB Index",
"Best Dunn Index", "Overall Best (Majority Vote)"),
Method = c(best_sil, best_ch, best_db, best_dunn, overall_best)
), caption = "Validation Summary", col.names = c("Index", "Best Method")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE)| Index | Best Method |
|---|---|
| Best Avg Silhouette | K-Means |
| Best CH Index | K-Means |
| Best DB Index | Mean Shift |
| Best Dunn Index | K-Means |
| Overall Best (Majority Vote) | K-Means |
Bootstrap-based stability assessment evaluates how consistently the clustering solution replicates across repeated resamples of the data. A stability score above 0.85 indicates high stability; values below 0.60 suggest an unstable solution.
set.seed(42)
n_boot <- 100
n_obs <- nrow(features_scaled)
boot_stab <- sapply(1:n_boot, function(b) {
idx <- sample(1:n_obs, n_obs, replace = TRUE)
boot_data <- features_scaled[idx, ]
km_boot <- kmeans(boot_data, centers = final_k, nstart = 10, iter.max = 100)
orig_labels <- km_result$cluster[idx]
boot_labels <- km_boot$cluster
mean(sapply(1:final_k, function(cl) {
orig_idx_cl <- which(orig_labels == cl)
if (length(orig_idx_cl) < 2) return(1)
if (length(orig_idx_cl) > 50) orig_idx_cl <- sample(orig_idx_cl, 50)
pairs <- combn(orig_idx_cl, 2)
mean(boot_labels[pairs[1, ]] == boot_labels[pairs[2, ]])
}))
})
stab_label <- ifelse(mean(boot_stab) >= 0.85, "High Stability",
ifelse(mean(boot_stab) >= 0.75, "Acceptable Stability",
ifelse(mean(boot_stab) >= 0.60, "Moderate Stability", "Low Stability")))
kable(data.frame(
Metric = c("Mean Stability Score", "Std Deviation", "Interpretation"),
Value = c(round(mean(boot_stab), 4), round(sd(boot_stab), 4), stab_label)
), caption = paste0("Bootstrap Stability K-Means (n = ", n_boot, " resamples)"),
col.names = c("Metric", "Value")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE)| Metric | Value |
|---|---|
| Mean Stability Score | 0.9069 |
| Std Deviation | 0.1105 |
| Interpretation | High Stability |
The Adjusted Rand Index (ARI) quantifies the agreement between two clustering solutions, corrected for chance. ARI = 1 indicates perfect agreement, ARI = 0 indicates chance-level agreement.
method_names <- names(method_labels)
ari_mat <- matrix(NA, length(method_names), length(method_names),
dimnames = list(method_names, method_names))
for (i in seq_along(method_names)) {
for (j in seq_along(method_names)) {
ari_mat[i, j] <- ifelse(i == j, 1.0,
round(adjustedRandIndex(method_labels[[i]], method_labels[[j]]), 3))
}
}
ari_melt <- melt(ari_mat)
names(ari_melt) <- c("Method1", "Method2", "ARI")
ggplot(ari_melt, aes(x = Method1, y = Method2, fill = ARI, label = round(ARI, 2))) +
geom_tile(color = "white") +
geom_text(size = 4, fontface = "bold") +
scale_fill_gradient2(low = "#E84855", mid = "white", high = "#2E86AB",
midpoint = 0.5, limits = c(-0.1, 1), name = "ARI") +
labs(title = "Pairwise Adjusted Rand Index",
subtitle = "ARI = 1: identical | ARI = 0: random",
x = NULL, y = NULL) +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"),
axis.text.x = element_text(angle = 30, hjust = 1))Pairwise ARI Heatmap
kable(df_latest %>%
select(PROVINSI, KMeans_Cluster, PAM_Cluster, DBSCAN_Cluster,
MeanShift_Cluster, FCM_Cluster) %>%
arrange(KMeans_Cluster, PROVINSI),
caption = "Province Cluster Assignments Across All Methods",
col.names = c("Province", "K-Means", "PAM", "DBSCAN", "Mean Shift", "FCM")) %>%
kable_styling(bootstrap_options = c("hover", "condensed"),
full_width = FALSE, font_size = 11) %>%
scroll_box(height = "500px")| Province | K-Means | PAM | DBSCAN | Mean Shift | FCM |
|---|---|---|---|---|---|
| PAPUA | 1 | 2 | 1 | 14 | 2 |
| PAPUA PEGUNUNGAN | 1 | 2 | 1 | 16 | 2 |
| PAPUA SELATAN | 1 | 2 | 1 | 17 | 2 |
| PAPUA TENGAH | 1 | 2 | 1 | 18 | 2 |
| ACEH | 2 | 1 | 1 | 1 | 1 |
| BALI | 2 | 1 | 1 | 2 | 2 |
| BANTEN | 2 | 1 | 1 | 3 | 1 |
| BENGKULU | 2 | 1 | 1 | 2 | 2 |
| DI YOGYAKARTA | 2 | 1 | 1 | 4 | 1 |
| DKI JAKARTA | 2 | 1 | 1 | 5 | 1 |
| GORONTALO | 2 | 1 | 1 | 2 | 2 |
| JAMBI | 2 | 1 | 1 | 2 | 1 |
| JAWA BARAT | 2 | 1 | 1 | 6 | 1 |
| JAWA TENGAH | 2 | 1 | 1 | 7 | 2 |
| JAWA TIMUR | 2 | 1 | 1 | 8 | 2 |
| KALIMANTAN BARAT | 2 | 1 | 1 | 2 | 2 |
| KALIMANTAN SELATAN | 2 | 1 | 1 | 2 | 2 |
| KALIMANTAN TENGAH | 2 | 1 | 1 | 2 | 2 |
| KALIMANTAN TIMUR | 2 | 1 | 1 | 9 | 1 |
| KALIMANTAN UTARA | 2 | 1 | 1 | 10 | 1 |
| KEP. BANGKA BELITUNG | 2 | 1 | 1 | 10 | 1 |
| KEP. RIAU | 2 | 1 | 1 | 11 | 1 |
| LAMPUNG | 2 | 1 | 1 | 2 | 2 |
| MALUKU | 2 | 1 | 1 | 9 | 1 |
| MALUKU UTARA | 2 | 1 | 1 | 12 | 1 |
| NUSA TENGGARA BARAT | 2 | 1 | 1 | 2 | 2 |
| NUSA TENGGARA TIMUR | 2 | 2 | 1 | 13 | 2 |
| PAPUA BARAT | 2 | 1 | 1 | 15 | 1 |
| PAPUA BARAT DAYA | 2 | 1 | 1 | 1 | 1 |
| RIAU | 2 | 1 | 1 | 10 | 1 |
| SULAWESI BARAT | 2 | 1 | 1 | 2 | 2 |
| SULAWESI SELATAN | 2 | 1 | 1 | 12 | 1 |
| SULAWESI TENGAH | 2 | 1 | 1 | 2 | 2 |
| SULAWESI TENGGARA | 2 | 1 | 1 | 2 | 2 |
| SULAWESI UTARA | 2 | 1 | 1 | 12 | 1 |
| SUMATERA BARAT | 2 | 1 | 1 | 3 | 1 |
| SUMATERA SELATAN | 2 | 1 | 1 | 2 | 2 |
| SUMATERA UTARA | 2 | 1 | 1 | 12 | 1 |
pcs_long <- pcs %>%
select(PC1, PC2, PROVINSI, KMeans, PAM, DBSCAN, MeanShift, FCM) %>%
pivot_longer(cols = c(KMeans, PAM, DBSCAN, MeanShift, FCM),
names_to = "Method", values_to = "Cluster")
ggplot(pcs_long, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 2.5, alpha = 0.8) +
facet_wrap(~Method, ncol = 3, scales = "fixed") +
scale_color_brewer(palette = "Set1") +
labs(title = "Clustering Results Comparison PCA Projection",
x = sprintf("PC1 (%.1f%%)", pca_var_pct[1]),
y = sprintf("PC2 (%.1f%%)", pca_var_pct[2])) +
theme_minimal(base_size = 10) +
theme(plot.title = element_text(face = "bold", size = 13),
strip.text = element_text(face = "bold"),
legend.position = "bottom")PCA Projection: All Clustering Methods
Using K-Means as the primary interpretive framework, cluster profiles are generated by computing the mean of each original (unstandardized) variable per cluster, enabling substantive socioeconomic interpretation of group differences across Indonesian provinces.
profile_df <- df_latest %>%
select(KMeans_Cluster, all_of(names(features))) %>%
group_by(KMeans_Cluster) %>%
summarise(across(everything(), mean, na.rm = TRUE), .groups = "drop")
kable(profile_df %>% mutate(across(-KMeans_Cluster, ~round(.x, 2))),
caption = "Mean Socioeconomic Indicators by K-Means Cluster (Original Scale)") %>%
kable_styling(bootstrap_options = c("hover"),
full_width = TRUE, font_size = 11) %>%
scroll_box(width = "100%")| KMeans_Cluster | APS_13_15 | APS_16_18 | APS_19_24 | TPT_Feb | TPT_Agt | TPAK_Feb | TPAK_Agt | GK_Maret | GK_Sept | JPM_Maret | JPM_Sept | PPM_Maret | PPM_Sept | IPM | RLS_15Plus | HLS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 80.91 | 64.15 | 23.89 | 3.49 | 2.67 | 76.79 | 77.20 | 741381.2 | 708811.0 | 229.76 | 161.07 | 24.36 | 23.68 | 63.01 | 7.74 | 12.52 |
| 2 | 96.04 | 75.98 | 28.48 | 4.87 | 4.69 | 68.79 | 69.05 | 632247.1 | 645927.2 | 714.71 | 681.53 | 9.59 | 9.13 | 73.90 | 9.47 | 13.41 |
key_vars <- c("IPM", "PPM_Maret", "TPT_Feb", "APS_13_15", "GK_Maret", "RLS_15Plus")
var_labels <- c(IPM = "Human Dev. Index (IPM)",
PPM_Maret = "Poverty Rate - March (%)",
TPT_Feb = "Unemployment Rate - Feb (%)",
APS_13_15 = "School Participation 13-15",
GK_Maret = "Poverty Line - March (Rp)",
RLS_15Plus = "Avg Schooling Years (15+)")
profile_long <- profile_df %>%
select(KMeans_Cluster, all_of(key_vars)) %>%
pivot_longer(-KMeans_Cluster, names_to = "Variable", values_to = "Mean_Value") %>%
mutate(Variable = var_labels[Variable])
ggplot(profile_long, aes(x = KMeans_Cluster, y = Mean_Value, fill = KMeans_Cluster)) +
geom_col(alpha = 0.85, color = "white") +
facet_wrap(~Variable, scales = "free_y", ncol = 3) +
scale_fill_brewer(palette = "Set1", guide = "none") +
labs(title = "Cluster Profiles: Key Socioeconomic Indicators",
subtitle = "Mean values per K-Means cluster; original scale",
x = "Cluster", y = "Mean Value") +
theme_minimal(base_size = 10) +
theme(plot.title = element_text(face = "bold"),
strip.text = element_text(face = "bold"))Cluster Profiles: Key Socioeconomic Indicators
ipm_means <- profile_df$IPM
ppm_means <- profile_df$PPM_Maret
tpt_means <- profile_df$TPT_Feb
gk_means <- profile_df$GK_Maret
narrative_df <- lapply(1:nrow(profile_df), function(i) {
cl <- profile_df$KMeans_Cluster[i]
provs <- paste(sort(df_latest$PROVINSI[df_latest$KMeans_Cluster == cl]), collapse = ", ")
ipm_rank <- rank(-ipm_means)[i]
ppm_rank <- rank(ppm_means)[i]
dev_level <- ifelse(ipm_rank == 1, "Most Developed",
ifelse(ipm_rank == nrow(profile_df), "Least Developed", "Developing"))
pov_level <- ifelse(ppm_rank == 1, "Lowest Poverty",
ifelse(ppm_rank == nrow(profile_df), "Highest Poverty", "Moderate Poverty"))
data.frame(
Cluster = as.integer(cl),
Dev_Level = dev_level,
Poverty_Level = pov_level,
N = sum(df_latest$KMeans_Cluster == cl),
Mean_IPM = round(ipm_means[i], 2),
Mean_PPM = round(ppm_means[i], 2),
Mean_TPT = round(tpt_means[i], 2),
Mean_GK = format(round(gk_means[i]), big.mark = "."),
Provinces = provs
)
}) %>% bind_rows()
kable(narrative_df, caption = "Cluster Socioeconomic Narrative (K-Means)",
col.names = c("Cluster", "Dev. Level", "Poverty Level", "N", "Mean IPM",
"Mean PPM (%)", "Mean TPT (%)", "Mean GK (Rp)", "Provinces")) %>%
kable_styling(bootstrap_options = c("hover"),
full_width = TRUE, font_size = 11) %>%
scroll_box(width = "100%")| Cluster | Dev. Level | Poverty Level | N | Mean IPM | Mean PPM (%) | Mean TPT (%) | Mean GK (Rp) | Provinces |
|---|---|---|---|---|---|---|---|---|
| 1 | Least Developed | Highest Poverty | 4 | 63.01 | 24.36 | 3.49 | 741.381 | PAPUA, PAPUA PEGUNUNGAN, PAPUA SELATAN, PAPUA TENGAH |
| 2 | Most Developed | Lowest Poverty | 34 | 73.90 | 9.59 | 4.87 | 632.247 | ACEH, BALI, BANTEN, BENGKULU, DI YOGYAKARTA, DKI JAKARTA, GORONTALO, JAMBI, JAWA BARAT, JAWA TENGAH, JAWA TIMUR, KALIMANTAN BARAT, KALIMANTAN SELATAN, KALIMANTAN TENGAH, KALIMANTAN TIMUR, KALIMANTAN UTARA, KEP. BANGKA BELITUNG, KEP. RIAU, LAMPUNG, MALUKU, MALUKU UTARA, NUSA TENGGARA BARAT, NUSA TENGGARA TIMUR, PAPUA BARAT, PAPUA BARAT DAYA, RIAU, SULAWESI BARAT, SULAWESI SELATAN, SULAWESI TENGAH, SULAWESI TENGGARA, SULAWESI UTARA, SUMATERA BARAT, SUMATERA SELATAN, SUMATERA UTARA |
kable(df_latest %>%
select(PROVINSI, TAHUN, KMeans_Cluster, IPM, PPM_Maret, TPT_Feb, GK_Maret) %>%
arrange(KMeans_Cluster, desc(IPM)) %>%
mutate(IPM = round(IPM, 2),
PPM_Maret = round(PPM_Maret, 2),
TPT_Feb = round(TPT_Feb, 2),
GK_Maret = format(GK_Maret, big.mark = ".")),
caption = "Final Cluster Assignments with Key Indicators",
col.names = c("Province", "Year", "Cluster", "IPM",
"Poverty Rate (%)", "Unemployment (%)", "Poverty Line (Rp)")) %>%
kable_styling(bootstrap_options = c("hover", "condensed"),
full_width = TRUE, font_size = 11) %>%
scroll_box(height = "500px")| Province | Year | Cluster | IPM | Poverty Rate (%) | Unemployment (%) | Poverty Line (Rp) |
|---|---|---|---|---|---|---|
| PAPUA | 2024 | 1 | 63.01 | 17.26 | 3.49 | 674.371 |
| PAPUA PEGUNUNGAN | 2024 | 1 | 63.01 | 32.97 | 3.49 | 1.007.060 |
| PAPUA SELATAN | 2024 | 1 | 63.01 | 17.44 | 3.49 | 519.979 |
| PAPUA TENGAH | 2024 | 1 | 63.01 | 29.76 | 3.49 | 764.115 |
| DKI JAKARTA | 2024 | 2 | 83.55 | 4.30 | 7.57 | 825.288 |
| DI YOGYAKARTA | 2024 | 2 | 81.09 | 10.83 | 3.58 | 602.437 |
| KEP. RIAU | 2024 | 2 | 79.08 | 5.37 | 7.61 | 787.211 |
| KALIMANTAN TIMUR | 2024 | 2 | 78.20 | 5.78 | 6.37 | 833.955 |
| BALI | 2024 | 2 | 78.01 | 4.00 | 3.73 | 568.510 |
| BANTEN | 2024 | 2 | 75.77 | 5.84 | 7.97 | 654.213 |
| SUMATERA BARAT | 2024 | 2 | 75.64 | 5.97 | 5.90 | 708.416 |
| SUMATERA UTARA | 2024 | 2 | 75.13 | 7.99 | 5.24 | 642.423 |
| SULAWESI UTARA | 2024 | 2 | 75.04 | 7.25 | 6.19 | 490.719 |
| RIAU | 2024 | 2 | 74.95 | 6.67 | 4.25 | 697.296 |
| ACEH | 2024 | 2 | 74.70 | 14.23 | 5.75 | 661.227 |
| KALIMANTAN SELATAN | 2024 | 2 | 74.66 | 4.11 | 3.95 | 632.739 |
| JAWA TIMUR | 2024 | 2 | 74.65 | 9.79 | 4.33 | 536.122 |
| SULAWESI SELATAN | 2024 | 2 | 74.60 | 8.06 | 5.26 | 459.226 |
| BENGKULU | 2024 | 2 | 74.30 | 13.56 | 3.21 | 671.095 |
| JAWA BARAT | 2024 | 2 | 74.24 | 7.46 | 7.89 | 524.052 |
| KEP. BANGKA BELITUNG | 2024 | 2 | 74.09 | 4.55 | 3.89 | 908.397 |
| JAMBI | 2024 | 2 | 73.73 | 7.10 | 4.50 | 650.115 |
| KALIMANTAN TENGAH | 2024 | 2 | 73.73 | 5.17 | 3.84 | 623.954 |
| JAWA TENGAH | 2024 | 2 | 73.39 | 10.47 | 5.24 | 507.001 |
| SUMATERA SELATAN | 2024 | 2 | 73.18 | 10.97 | 4.53 | 554.197 |
| SULAWESI TENGGARA | 2024 | 2 | 72.94 | 11.21 | 3.66 | 462.715 |
| KALIMANTAN UTARA | 2024 | 2 | 72.88 | 6.32 | 4.10 | 854.294 |
| MALUKU | 2024 | 2 | 72.75 | 16.05 | 6.08 | 713.503 |
| LAMPUNG | 2024 | 2 | 72.48 | 10.69 | 4.18 | 586.551 |
| NUSA TENGGARA BARAT | 2024 | 2 | 72.37 | 12.91 | 3.73 | 534.703 |
| SULAWESI TENGAH | 2024 | 2 | 71.66 | 11.77 | 3.49 | 600.872 |
| GORONTALO | 2024 | 2 | 71.25 | 14.57 | 3.07 | 473.006 |
| MALUKU UTARA | 2024 | 2 | 70.98 | 6.32 | 4.60 | 604.460 |
| KALIMANTAN BARAT | 2024 | 2 | 70.47 | 6.32 | 4.52 | 595.509 |
| SULAWESI BARAT | 2024 | 2 | 69.80 | 11.21 | 3.04 | 454.879 |
| NUSA TENGGARA TIMUR | 2024 | 2 | 68.40 | 19.48 | 3.10 | 527.275 |
| PAPUA BARAT | 2024 | 2 | 67.47 | 21.66 | 5.53 | 793.804 |
| PAPUA BARAT DAYA | 2024 | 2 | 67.47 | 18.13 | 5.53 | 756.237 |
kable(data.frame(
Method = c("K-Means", "PAM", "DBSCAN", "Mean Shift", "FCM"),
N_Clusters = c(final_k, final_k, n_clusters_db, n_ms_clusters, final_k),
Characteristic = c(
"Fast; sensitive to outliers; spherical cluster assumption",
"Robust to outliers; uses actual data points as representatives",
"Detects arbitrary shapes; identifies noise/outlier provinces",
"Non-parametric; auto-determines k via kernel density modes",
"Soft assignments; captures membership ambiguity across clusters"
)
), caption = "Algorithm Comparison Summary",
col.names = c("Method", "Clusters Found", "Key Characteristic")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = TRUE)| Method | Clusters Found | Key Characteristic |
|---|---|---|
| K-Means | 2 | Fast; sensitive to outliers; spherical cluster assumption |
| PAM | 2 | Robust to outliers; uses actual data points as representatives |
| DBSCAN | 1 | Detects arbitrary shapes; identifies noise/outlier provinces |
| Mean Shift | 18 | Non-parametric; auto-determines k via kernel density modes |
| FCM | 2 | Soft assignments; captures membership ambiguity across clusters |
## Overall best method (internal validation): K-Means
## K-Means bootstrap stability: High Stability (score = 0.9069)
Based on the clustering results, the following data-driven policy recommendations are proposed for each provincial development group.
kable(lapply(1:nrow(profile_df), function(i) {
cl <- profile_df$KMeans_Cluster[i]
ipm <- ipm_means[i]
ppm <- ppm_means[i]
rec <- if (ipm >= quantile(ipm_means, 0.66)) {
"Sustain development momentum; promote as model province; reduce intra-cluster inequality."
} else if (ipm >= quantile(ipm_means, 0.33)) {
"Accelerate education quality (RLS & HLS); expand labor market programs; strengthen social protection."
} else {
"Priority intervention: basic education access, poverty alleviation, affirmative budget allocation."
}
data.frame(Cluster = as.integer(cl), Mean_IPM = round(ipm, 2),
Mean_PPM = round(ppm, 2), Recommendation = rec)
}) %>% bind_rows(),
caption = "Policy Recommendations by Cluster Development Profile",
col.names = c("Cluster", "Mean IPM", "Mean Poverty Rate (%)", "Recommendation")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = TRUE)| Cluster | Mean IPM | Mean Poverty Rate (%) | Recommendation |
|---|---|---|---|
| 1 | 63.01 | 24.36 | Priority intervention: basic education access, poverty alleviation, affirmative budget allocation. |
| 2 | 73.90 | 9.59 | Sustain development momentum; promote as model province; reduce intra-cluster inequality. |
| # | Limitation | Implication |
|---|---|---|
| 1 | Cross-sectional analysis (latest year only) | Temporal dynamics in development trajectories are not captured |
| 2 | K-Means assumes spherical, equally-sized clusters | Assumption likely violated in heterogeneous regional data |
| 3 | DBSCAN is sensitive to ε and MinPts | Small parameter changes can substantially alter results |
| 4 | Mean Shift applied on PCA-reduced space | Information in lower-ranked components may be lost |
| 5 | Highly correlated variables included | Certain dimensions may be implicitly over-weighted |
| 6 | IPM used alongside its sub-components | Multicollinearity between IPM, RLS, and HLS |
Document generated using R Markdown. All analyses conducted in R
(version >= 4.1.0). Reproducibility ensured via
set.seed(42) in all stochastic procedures.