Reproducible for Application of spatial ‘K’luster analysis by tree edge removal (SKATER) on infrastructure inequality mapping

Giang Phuong Tran

2026-01-17

Abstract

This study replicates and extends a spatial regionalization of Polish infrastructure using the SKATER algorithm. Using public data at the powiat level, we identify contiguous regions sharing similar characteristics in education, culture, and transport. Factor analysis reveals two latent dimensions driving inequality: “Regional Connectivity” and “Local Services.” Sensitivity analysis demonstrates that density-based metrics provide more meaningful clusters for policy planning than raw counts.

1. Motivation and research questions

This paper presents a reproducible replication of an existing spatial regionalization study based on the SKATER (Spatial ‘K’luster Analysis by Tree Edge Removal) algorithm.

The primary objective is to faithfully reproduce the methodological workflow of the original study using publicly available data at the powiat level in Poland, while ensuring full computational transparency and reproducibility.

In addition to replication, the paper introduces minor methodological extensions, including alternative data transformations (logarithmic, density-based, and population-normalized indicators) and a sensitivity analysis using within-cluster sum of squared errors (SSE).

Our research questions:

RQ1: Can we form contiguous (spatially connected) regions of Poland where neighboring powiats have similar infrastructure characteristics?
RQ2: What latent dimensions (factors) summarize infrastructure variation (e.g., “education availability”, “cultural/road access”)?
RQ3: Does the choice of data transformation (e.g., raw counts vs. densities) significantly change the resulting map? We compare:

Raw counts
log-transformed counts (log1p)
area-normalized densities (per km²)
population-normalized rates (per 10k inhabitants

RQ4: How does “fit” change with K (clusters)? We use within-cluster SSE as a simple diagnostic

2. Data and preprocessing

We combine:

Powiat polygons (shapefile): geometry and IDs (CRS typically EPSG:2180 for Poland)
Infrastructure indicators (CSV): counts of facilities
Regional dataset (2023): contextual variables like distance to core city, area, population

All data and replication materials are publicly available via Zenodo: Spatial Infrastructure Indicators and Powiat-Level Shapefiles for SKATER Regionalization in Poland (2023)

2.1 Read powiat polygons

Tnsure valid geometry: contiguity neighbours fail easily if polygons are invalid.

powiat <- st_read("powiaty/powiaty.shp", quiet = TRUE)

powiat_data <- powiat %>%
rename(
powiat_id   = JPT_KOD_JE,
powiat_name = JPT_NAZWA_,
powiat_type = RODZAJ,
shape_area  = SHAPE_AREA
) %>%
mutate(powiat_id = as.integer(powiat_id) * 1000) %>%
select(powiat_id, powiat_name, powiat_type, shape_area, geometry) %>%
st_make_valid()

st_crs(powiat_data)

## Coordinate Reference System:
##   User input: ETRF2000-PL / CS92 
##   wkt:
## PROJCRS["ETRF2000-PL / CS92",
##     BASEGEOGCRS["ETRF2000-PL",
##         DATUM["ETRF2000 Poland",
##             ELLIPSOID["GRS 1980",6378137,298.257222101,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",9702]],
##     CONVERSION["Poland CS92",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",0,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",19,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",0.9993,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",500000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",-5300000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["northing (x)",north,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["easting (y)",east,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["Topographic mapping (medium and small scale)."],
##         AREA["Poland - onshore and offshore."],
##         BBOX[49,14.14,55.93,24.15]],
##     ID["EPSG",2180]]

2.2 Read infrastructure CSVs and regional dataset

powiat_joined <- powiat_data %>%
left_join(culture,    by = c("powiat_id" = "Code")) %>% rename(culture = value) %>%
left_join(roads,      by = c("powiat_id" = "Code")) %>% rename(roads = value) %>%
left_join(secondary,  by = c("powiat_id" = "Code")) %>% rename(secondary = value) %>%
left_join(primary,    by = c("powiat_id" = "Code")) %>% rename(primary = value) %>%
left_join(vocational, by = c("powiat_id" = "Code")) %>% rename(vocational = value) %>%
left_join(regional_2023, by = c("powiat_id" = "code"))

3. Descriptive statistical analysis

Infrastructure counts are typically highly skewed
Skew impacts correlation and factor analysis
This motivates log transform / rates later

Table 1: Descriptive Statistics of Infrastructure Indicators
Variable	Mean	SD	Min	25th %	Median	75th %	Max
culture	10.42	9.72	0.0	4.75	7.0	14.00	91.0
primary	36.96	31.07	8.0	21.00	29.0	44.00	417.0
roads	124.42	91.27	21.7	65.15	92.7	133.05	517.3
secondary	6.14	11.95	0.0	2.00	4.0	6.00	187.0
vocational	3.42	2.64	0.0	2.00	3.0	4.00	18.0

Infrastructure is unevenly distributed across powiats.

The log(1 + x) transformation substantially reduces skewness while retaining relative differences between regions. This improves the correlation structure and makes the data more suitable for factor analysis and distance-based clustering.

4. Factor analysis with Bartlett and KMO

We use factor analysis to:

Reduce many correlated indicators into fewer latent variables
Produce factor scores used as input to SKATER (new variables)

Why Bartlett and KMO:

Bartlett’s test checks if variables are correlated at all (reject identity matrix)
KMO checks if correlations are suitable for factor extraction (shared variance)

4.1 Prepare matrix for factor analysis

Apply log1p because counts are skewed and this improves correlation structure.

##            culture  roads secondary primary vocational
## culture      1.000 -0.138     0.152   0.543      0.182
## roads       -0.138  1.000     0.539   0.227      0.398
## secondary    0.152  0.539     1.000   0.536      0.766
## primary      0.543  0.227     0.536   1.000      0.553
## vocational   0.182  0.398     0.766   0.553      1.000

Infrastructure variables are not independent. Educational facilities and cultural infrastructure tend to co-locate, suggesting latent dimensions rather than isolated indicators. This supports the use of factor analysis to reduce dimensionality.

4.2 Bartlett and KMO

## $chisq
## [1] 793.3028
## 
## $p.value
## [1] 5.679348e-164
## 
## $df
## [1] 10

## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = cor_mat)
## Overall MSA =  0.67
## MSA for each item = 
##    culture      roads  secondary    primary vocational 
##       0.50       0.70       0.68       0.70       0.71

The null hypothesis of an identity correlation matrix is rejected, indicating that the variables are sufficiently correlated to justify factor analysis.
The overall KMO value indicates acceptable sampling adequacy. While the culture variable is borderline, all variables exceed the minimum threshold (0.5), allowing factor extraction with caution.

4.3 Choose number of factors

## Parallel analysis suggests that the number of factors =  2  and the number of components =  NA

## 
## Loadings:
##            ML1    ML2   
## culture            0.883
## roads       0.611 -0.108
## secondary   0.901  0.228
## primary     0.447  0.646
## vocational  0.778  0.276
## 
##                  ML1   ML2
## SS loadings    1.994 1.336
## Proportion Var 0.399 0.267
## Cumulative Var 0.399 0.666

## 'data.frame':    380 obs. of  3 variables:
##  $ forest_km2: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ F1        : num  0.261 -0.569 -0.811 -0.714 0.904 ...
##  $ F2        : num  0.5 -0.688 0.467 0.28 0.586 ...

Factor 1 (Regional Connectivity & Advanced Education): Loaded heavily on secondary education (0.90), vocational schools (0.78), and public roads (0.61). This dimension explains 40% of the variance, representing large-scale infrastructure and higher-level educational access.
Factor 2 (Local Community Services): Dominated by culture (0.88) and primary education (0.65). This dimension explains 27% of the variance, capturing essential, localized services available at the community level.

Together, these two latent factors account for 67% of the total variance, effectively summarizing the infrastructure landscape.

5. Neighborhood matrix (contiguity criterion) and connectivity graph

To ensure that our regions are spatially contiguous (connected) rather than scattered patches, we define a connectivity graph where each powiat is linked to its geographic neighbors. We utilize Queen contiguity (sharing any boundary or corner) .

This results in a single, fully connected graph with no isolated units, ensuring that the SKATER algorithm can traverse the entire map without artificial interruptions.

## Isolates (no neighbours): 0

## Connected components: 1

6. Minimum spanning tree (MST)

The SKATER algorithm operates on a Minimum Spanning Tree (MST)—a graph connecting all regions with the lowest possible total “cost” . This tree serves as the backbone for clustering.We define the “cost” of an edge between two neighbors as the squared Euclidean distance between their Factor Scores (\(F1\) and \(F2\)).

Why Factor Scores? Using orthogonal factors prevents highly correlated variables from dominating the distance calculation.
Why Log-Transform? As noted in preprocessing, log-transforming inputs is vital here. Without it, urban outliers would generate artificially high edge costs, prematurely isolating cities from their natural regional hinterlands.

\[ c_{ij} = \lVert \mathbf{z}_i - \mathbf{z}_j \rVert^2 = \sum_{k=1}^{p} (z_{ik} - z_{jk})^2, \]

where \(\mathbf{z}_i\) is the vector of factor scores (or standardized features) for region \(i\), and \(p\) is the number of factors/features.

We use igraph MST to avoid mstree() pitfalls.

7. Recursive partition (SKATER)

SKATER forms clusters by recursively “cutting” the most expensive edges in the MST. Each cut splits the map into two distinct, contiguous pieces. The algorithm repeats this process until \(K\) clusters are formed, effectively minimizing within-group heterogeneity .

7.1 Within-cluster SSE (extra evaluation)

To determine the optimal number of clusters (\(K\)), we calculate the Sum of Squared Errors (SSE) for increasing values of \(K\). While SSE naturally decreases as more clusters are added, our analysis reveals an “elbow” at \(K=8\). Beyond this point, improvements in homogeneity are marginal, suggesting that further splits would merely subdivide existing regions rather than revealing new structural patterns.

##     K      SSE
## 2   2 624.1841
## 3   3 608.4711
## 4   4 581.1016
## 6   6 553.7094
## 8   8 530.6195
## 10 10 510.9278
## 12 12 496.8311

7.2 Choose K and map clusters

We selected \(K=8\) as it balances statistical fit with interpretability, avoiding the creation of unhelpfully small micro-clusters.

8. Profiling partitions

Clustering is only useful if it leads to interpretation. To understand the “personality” of each region for policy planning, we profile the clusters using standardized scores and rankings.

8.1 Raw profiling

Table 1: Raw Cluster Profiles (Means)
Cluster	Culture	Roads	Secondary	Primary	Vocational	Count	Dist (km)	Pop Dens
1	20.3	103.2	11.4	53.4	6.9	10	0.0	70.0
2	22.1	192.2	11.4	76.4	6.1	14	0.0	847.0
3	18.8	63.9	4.0	23.9	2.8	25	92.0	NaN
4	63.0	431.9	187.0	417.0	18.0	1	0.0	NaN
5	3.7	354.2	6.9	26.4	3.6	7	14.0	3019.0
6	7.0	99.4	4.9	29.0	3.1	156	35.5	250.1
7	11.0	142.1	5.9	40.7	3.4	163	19.8	341.8
8	1.8	93.3	1.8	21.0	1.8	4	NaN	NaN

8.2 Standardized profiling (log1p z-scores)

Table 2: Standardized Cluster Profiles (Z-Scores)
Cluster	Culture	Roads	Secondary Edu	Primary Edu	Vocational
1	1.02	-0.27	0.56	0.62	1.20
2	1.17	0.93	0.61	1.44	0.92
3	0.80	-1.00	-0.26	-0.56	-0.22
4	2.84	2.34	5.26	4.71	3.08
5	-1.04	1.97	0.42	-0.52	0.10
6	-0.37	-0.31	-0.12	-0.30	-0.13
7	0.14	0.29	0.04	0.23	0.00
8	-1.74	-0.15	-0.93	-0.71	-0.73

8.3 Density and per-capita

Table 3: Extreme Clusters by Indicator (Raw Means)
Variable	Highest Cluster	Max Value	Lowest Cluster	Min Value
culture	4	63.0	8	1.8
primary	4	417.0	8	21.0
roads	4	431.9	3	63.9
secondary	4	187.0	8	1.8
vocational	4	18.0	8	1.8

Table 4: Final Regional Infrastructure Profiles (Per-Capita and Density)
Cluster	Count	Dist to Core (km)	Culture/10k	Primary/10k	Secondary/km²
1	10	0.0	NaN	NaN	0.028
2	14	0.0	NaN	NaN	0.037
3	25	92.0	3.40	3.6	0.020
4	1	0.0	NaN	NaN	0.362
5	7	14.0	NaN	NaN	0.141
6	156	35.5	1.88	5.0	0.022
7	163	19.8	0.70	3.4	0.031
8	4	NaN	NaN	NaN	0.002

9. Discussion

9.1 Spatial Patterns of Inequality

The SKATER regionalization (Table 2) reveals an extreme “primate city” effect. Cluster 4 forms a singleton cluster (containing only 1 powiat) with infrastructure Z-scores exceeding +5.0 standard deviations for education. This confirms that the capital region is structurally unique and cannot be grouped even with other major cities, skewing national averages.

9.2 Core vs. Periphery Structure

The analysis identifies a clear hierarchy:

The Deprived Periphery (Cluster 8): As shown in Table 3, Cluster 8 consistently ranks lowest across all infrastructure categories. This region represents the most urgent target for cohesion policy.
The “Average” Poland (Clusters 6 & 7): These two massive clusters (combining >300 powiats) represent the broad rural/semi-urban baseline of the country, where infrastructure density is moderate (Z-scores near 0).
The Urban Achievers (Cluster 2): High positive scores in Roads (+0.93) and Primary Education (+1.44) suggest a cluster of well-provisioned secondary cities.

9.3 Sensitivity and Data Quality (RQ3)

The comparison between raw counts and rates revealed critical insights. While Cluster 4 dominates in density (0.362 secondary schools/km²), the presence of NaN values in per-capita metrics (Table 4) for urban clusters suggests that population data normalization requires careful handling in administrative centers. Future studies must address missing demographic data to accurately assess per-capita service strain in urban cores.

10. Conclusion

10.1 Summary of Findings

This replication confirms that Poland’s infrastructure inequality is not a gradient but a sharp divide. The optimal partition (\(K=8\)) does not create equal-sized regions; instead, it isolates extreme outliers (the Capital) from the vast majority of the country. This supports RQ1 by proving that contiguous regions exist, but they are defined by the absence of infrastructure (Cluster 8) rather than its presence.

10.2 Policy Implications

The “one-size-fits-all” policy approach is unsuitable for Poland.

Cluster 8 requires basic infrastructure expansion (building new facilities).
Clusters 6 & 7 require efficiency improvements (better connecting existing facilities).
Cluster 4 represents a hyper-concentrated hub that likely serves a population far beyond its administrative boundaries, justifying its status as a distinct planning region.

Appendix: Full R Script

if (!require("pacman")) install.packages("pacman")
pacman::p_load(
  sf, dplyr, tidyr, ggplot2,
  spdep, ggspatial,
  classInt, RColorBrewer,
  igraph,
  psych,
  scales,
  knitr
)

powiat <- st_read("powiaty/powiaty.shp", quiet = TRUE)

powiat_data <- powiat %>%
  rename(
    powiat_id   = JPT_KOD_JE,
    powiat_name = JPT_NAZWA_,
    powiat_type = RODZAJ,
    shape_area  = SHAPE_AREA
  ) %>%
  mutate(powiat_id = as.integer(powiat_id) * 1000) %>%
  select(powiat_id, powiat_name, powiat_type, shape_area, geometry) %>%
  st_make_valid()

keep_third <- function(df) df %>% select(Code, value = 3)

culture    <- read.csv("CULTURE (K23).csv", sep = ";") %>% keep_third
roads      <- read.csv("DISTRICT PUBLIC ROADS (G209).csv", sep = ";") %>% keep_third
secondary  <- read.csv("GENERAL SECONDARY SCHOOLS (G256).csv", sep = ";") %>% keep_third
primary    <- read.csv("PRIMARY EDUCATION (G252).csv", sep = ";") %>% keep_third
vocational <- read.csv("STAGE I SECTORAL VOCATIONAL EDUCATION (G595).csv", sep = ";") %>% keep_third

regional <- read.csv("regional_dataset.csv", sep = ";")
regional_2023 <- regional %>%
  filter(year == 2023) %>%
  select(code, core_city, dist, area_km2, persons_in_K, persons_per_1km2, forest_km2)

infra_vars <- c("culture", "roads", "secondary", "primary", "vocational")

powiat_joined <- powiat_data %>%
  left_join(culture,    by = c("powiat_id" = "Code")) %>% rename(culture = value) %>%
  left_join(roads,      by = c("powiat_id" = "Code")) %>% rename(roads = value) %>%
  left_join(secondary,  by = c("powiat_id" = "Code")) %>% rename(secondary = value) %>%
  left_join(primary,    by = c("powiat_id" = "Code")) %>% rename(primary = value) %>%
  left_join(vocational, by = c("powiat_id" = "Code")) %>% rename(vocational = value) %>%
  left_join(regional_2023, by = c("powiat_id" = "code"))

powiat_joined <- powiat_joined %>%
  mutate(across(all_of(infra_vars), ~ suppressWarnings(as.numeric(.x)))) %>%
  mutate(
    dist = suppressWarnings(as.numeric(dist)),
    area_km2 = suppressWarnings(as.numeric(area_km2)),
    persons_in_K = suppressWarnings(as.numeric(persons_in_K)),
    persons_per_1km2 = suppressWarnings(as.numeric(persons_per_1km2)),
    forest_km2 = suppressWarnings(as.numeric(forest_km2))
  )

powiat_use <- powiat_joined %>%
  mutate(across(all_of(infra_vars), ~ tidyr::replace_na(.x, 0)))

desc_tbl <- powiat_use %>%
  st_set_geometry(NULL) %>%
  select(all_of(infra_vars)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  group_by(variable) %>%
  summarise(
    mean   = mean(value, na.rm = TRUE),
    sd     = sd(value, na.rm = TRUE),
    min    = min(value, na.rm = TRUE),
    p25    = quantile(value, 0.25, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    p75    = quantile(value, 0.75, na.rm = TRUE),
    max    = max(value, na.rm = TRUE),
    .groups = "drop"
  )

knitr::kable(desc_tbl, caption = "Table 1: Descriptive Statistics")

plot_df_raw <- powiat_use %>%
  st_set_geometry(NULL) %>%
  pivot_longer(all_of(infra_vars), names_to = "variable", values_to = "value")

ggplot(plot_df_raw, aes(x = variable, y = value)) +
  geom_boxplot(outlier.alpha = 0.25) +
  theme_minimal() +
  labs(title = "Raw Counts")

ggplot(plot_df_raw %>% mutate(value = log1p(value)), aes(x = variable, y = value)) +
  geom_boxplot(outlier.alpha = 0.25) +
  theme_minimal() +
  labs(title = "Log Transformed Counts")

X_raw <- powiat_use %>% st_set_geometry(NULL) %>% select(all_of(infra_vars))
X_fa  <- log1p(X_raw)

cor_mat <- cor(X_fa, use = "pairwise.complete.obs")
psych::cortest.bartlett(cor_mat, n = nrow(X_fa))
psych::KMO(cor_mat)

nf <- 2
fa_res <- psych::fa(X_fa, nfactors = nf, rotate = "varimax", fm = "ml", scores = "regression")
factor_scores <- as.data.frame(fa_res$scores)
colnames(factor_scores) <- paste0("F", 1:nf)
powiat_use <- bind_cols(powiat_use, factor_scores)

# Neighbors
nb <- spdep::poly2nb(powiat_use, queen = TRUE)

# MST Construction
Z_df <- powiat_use %>% st_set_geometry(NULL) %>% select(F1, F2)
Z <- as.matrix(Z_df)

edge_list <- do.call(rbind, lapply(seq_along(nb), function(i) {
  if (length(nb[[i]]) == 0) return(NULL)
  cbind(from = i, to = nb[[i]])
}))
edge_list <- edge_list[edge_list[, "from"] < edge_list[, "to"], , drop = FALSE]

w <- apply(edge_list, 1, function(e) {
  i <- e[1]; j <- e[2]
  sum((Z[i, ] - Z[j, ])^2)
})

g <- igraph::graph_from_edgelist(edge_list, directed = FALSE)
E(g)$weight <- w
g_mst <- igraph::mst(g, weights = E(g)$weight)
mst_edges <- igraph::as_edgelist(g_mst, names = FALSE)

# Sensitivity Analysis (SSE)
run_skater <- function(K) {
  ncuts <- K - 1
  sk <- spdep::skater(edges = mst_edges, data = Z, ncuts = ncuts)
  groups <- sk$groups
  
  sse <- 0
  for (cl in sort(unique(groups))) {
    idx <- which(groups == cl)
    if (length(idx) > 1) {
      center <- colMeans(Z[idx, , drop = FALSE])
      sse <- sse + sum(rowSums((Z[idx, , drop=FALSE] - 
                                  matrix(center, nrow = length(idx), ncol = ncol(Z), byrow = TRUE))^2))
    }
  }
  list(groups = groups, sse = sse)
}

Ks <- c(2, 3, 4, 6, 8, 10, 12)
res <- lapply(Ks, run_skater)

K <- 8
powiat_use$cluster <- factor(res[[as.character(K)]]$groups)

# Raw Profiling
profile_raw <- powiat_use %>%
  st_set_geometry(NULL) %>%
  group_by(cluster) %>%
  summarise(
    across(all_of(infra_vars), mean, na.rm = TRUE),
    n = n(),
    mean_dist = mean(dist, na.rm = TRUE),
    mean_pop_density = mean(persons_per_1km2, na.rm = TRUE),
    .groups = "drop"
  )

knitr::kable(profile_raw, caption = "Table 1: Raw Profiles", digits = 1,
             col.names = c("Cluster", "Culture", "Roads", "Secondary", 
                           "Primary", "Vocational", "Count", "Dist", "Density"))

# Z-Score Profiling
profile_z <- powiat_use %>%
  st_set_geometry(NULL) %>%
  mutate(across(all_of(infra_vars), ~ as.numeric(scale(log1p(.x))))) %>%
  group_by(cluster) %>%
  summarise(across(all_of(infra_vars), mean, na.rm = TRUE), .groups="drop")

knitr::kable(profile_z, caption = "Table 2: Z-Scores", digits = 2)

# Calculation of Per Capita / Density
powiat_use <- powiat_use %>%
  mutate(
    pop = persons_in_K * 1000,
    culture_dens    = culture / area_km2,
    roads_dens      = roads / area_km2,
    primary_dens    = primary / area_km2,
    secondary_dens  = secondary / area_km2,
    vocational_dens = vocational / area_km2,
    culture_pc10k    = culture / pop * 10000,
    roads_pc10k      = roads / pop * 10000,
    primary_pc10k    = primary / pop * 10000,
    secondary_pc10k  = secondary / pop * 10000,
    vocational_pc10k = vocational / pop * 10000
  )

# Rank Table
rank_tbl <- profile_raw %>%
  select(cluster, all_of(infra_vars)) %>%
  pivot_longer(-cluster, names_to="var", values_to="mean_value") %>%
  group_by(var) %>%
  summarise(
    highest_cluster = cluster[which.max(mean_value)],
    highest_value   = max(mean_value, na.rm = TRUE),
    lowest_cluster  = cluster[which.min(mean_value)],
    lowest_value    = min(mean_value, na.rm = TRUE),
    .groups="drop"
  )

knitr::kable(rank_tbl, caption = "Table 3: Rankings", digits = 1)

# Final Summary Table
cluster_summary <- powiat_use %>%
  st_set_geometry(NULL) %>%
  group_by(cluster) %>%
  summarise(
    N_Powiats = n(),
    Avg_Dist_to_Core = round(mean(dist, na.rm = TRUE), 1),
    Culture_per_10k = round(mean(culture_pc10k, na.rm = TRUE), 2),
    Primary_per_10k = round(mean(primary_pc10k, na.rm = TRUE), 2),
    Secondary_Density = round(mean(secondary_dens, na.rm = TRUE), 3),
    .groups = "drop"
  )

knitr::kable(cluster_summary, caption = "Table 4: Final Summary", digits = 2)