Abstract
This study replicates and extends a spatial regionalization of Polish infrastructure using the SKATER algorithm. Using public data at the powiat level, we identify contiguous regions sharing similar characteristics in education, culture, and transport. Factor analysis reveals two latent dimensions driving inequality: “Regional Connectivity” and “Local Services.” Sensitivity analysis demonstrates that density-based metrics provide more meaningful clusters for policy planning than raw counts.
This paper presents a reproducible replication of an existing spatial regionalization study based on the SKATER (Spatial ‘K’luster Analysis by Tree Edge Removal) algorithm.
The primary objective is to faithfully reproduce the methodological workflow of the original study using publicly available data at the powiat level in Poland, while ensuring full computational transparency and reproducibility.
In addition to replication, the paper introduces minor methodological extensions, including alternative data transformations (logarithmic, density-based, and population-normalized indicators) and a sensitivity analysis using within-cluster sum of squared errors (SSE).
Our research questions:
RQ1: Can we form contiguous (spatially connected) regions of Poland where neighboring powiats have similar infrastructure characteristics?
RQ2: What latent dimensions (factors) summarize infrastructure variation (e.g., “education availability”, “cultural/road access”)?
RQ3: Does the choice of data transformation (e.g., raw counts vs. densities) significantly change the resulting map? We compare:
Raw counts
log-transformed counts (log1p)
area-normalized densities (per km²)
population-normalized rates (per 10k inhabitants
We combine:
Powiat polygons (shapefile): geometry and IDs (CRS typically EPSG:2180 for Poland)
Infrastructure indicators (CSV): counts of facilities
Regional dataset (2023): contextual variables like distance to core city, area, population
All data and replication materials are publicly available via Zenodo: Spatial Infrastructure Indicators and Powiat-Level Shapefiles for SKATER Regionalization in Poland (2023)
Tnsure valid geometry: contiguity neighbours fail easily if polygons are invalid.
powiat <- st_read("powiaty/powiaty.shp", quiet = TRUE)
powiat_data <- powiat %>%
rename(
powiat_id = JPT_KOD_JE,
powiat_name = JPT_NAZWA_,
powiat_type = RODZAJ,
shape_area = SHAPE_AREA
) %>%
mutate(powiat_id = as.integer(powiat_id) * 1000) %>%
select(powiat_id, powiat_name, powiat_type, shape_area, geometry) %>%
st_make_valid()
st_crs(powiat_data)
## Coordinate Reference System:
## User input: ETRF2000-PL / CS92
## wkt:
## PROJCRS["ETRF2000-PL / CS92",
## BASEGEOGCRS["ETRF2000-PL",
## DATUM["ETRF2000 Poland",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",9702]],
## CONVERSION["Poland CS92",
## METHOD["Transverse Mercator",
## ID["EPSG",9807]],
## PARAMETER["Latitude of natural origin",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8801]],
## PARAMETER["Longitude of natural origin",19,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8802]],
## PARAMETER["Scale factor at natural origin",0.9993,
## SCALEUNIT["unity",1],
## ID["EPSG",8805]],
## PARAMETER["False easting",500000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8806]],
## PARAMETER["False northing",-5300000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8807]]],
## CS[Cartesian,2],
## AXIS["northing (x)",north,
## ORDER[1],
## LENGTHUNIT["metre",1]],
## AXIS["easting (y)",east,
## ORDER[2],
## LENGTHUNIT["metre",1]],
## USAGE[
## SCOPE["Topographic mapping (medium and small scale)."],
## AREA["Poland - onshore and offshore."],
## BBOX[49,14.14,55.93,24.15]],
## ID["EPSG",2180]]
powiat_joined <- powiat_data %>%
left_join(culture, by = c("powiat_id" = "Code")) %>% rename(culture = value) %>%
left_join(roads, by = c("powiat_id" = "Code")) %>% rename(roads = value) %>%
left_join(secondary, by = c("powiat_id" = "Code")) %>% rename(secondary = value) %>%
left_join(primary, by = c("powiat_id" = "Code")) %>% rename(primary = value) %>%
left_join(vocational, by = c("powiat_id" = "Code")) %>% rename(vocational = value) %>%
left_join(regional_2023, by = c("powiat_id" = "code"))
| Variable | Mean | SD | Min | 25th % | Median | 75th % | Max |
|---|---|---|---|---|---|---|---|
| culture | 10.42 | 9.72 | 0.0 | 4.75 | 7.0 | 14.00 | 91.0 |
| primary | 36.96 | 31.07 | 8.0 | 21.00 | 29.0 | 44.00 | 417.0 |
| roads | 124.42 | 91.27 | 21.7 | 65.15 | 92.7 | 133.05 | 517.3 |
| secondary | 6.14 | 11.95 | 0.0 | 2.00 | 4.0 | 6.00 | 187.0 |
| vocational | 3.42 | 2.64 | 0.0 | 2.00 | 3.0 | 4.00 | 18.0 |
Infrastructure is unevenly distributed across powiats.
The log(1 + x) transformation substantially reduces skewness while retaining relative differences between regions. This improves the correlation structure and makes the data more suitable for factor analysis and distance-based clustering.
We use factor analysis to:
Why Bartlett and KMO:
Bartlett’s test checks if variables are correlated at all (reject identity matrix)
KMO checks if correlations are suitable for factor extraction (shared variance)
Apply log1p because counts are skewed and this improves correlation structure.
## culture roads secondary primary vocational
## culture 1.000 -0.138 0.152 0.543 0.182
## roads -0.138 1.000 0.539 0.227 0.398
## secondary 0.152 0.539 1.000 0.536 0.766
## primary 0.543 0.227 0.536 1.000 0.553
## vocational 0.182 0.398 0.766 0.553 1.000
Infrastructure variables are not independent. Educational facilities and cultural infrastructure tend to co-locate, suggesting latent dimensions rather than isolated indicators. This supports the use of factor analysis to reduce dimensionality.
## $chisq
## [1] 793.3028
##
## $p.value
## [1] 5.679348e-164
##
## $df
## [1] 10
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = cor_mat)
## Overall MSA = 0.67
## MSA for each item =
## culture roads secondary primary vocational
## 0.50 0.70 0.68 0.70 0.71
The null hypothesis of an identity correlation matrix is rejected, indicating that the variables are sufficiently correlated to justify factor analysis.
The overall KMO value indicates acceptable sampling adequacy. While the culture variable is borderline, all variables exceed the minimum threshold (0.5), allowing factor extraction with caution.
## Parallel analysis suggests that the number of factors = 2 and the number of components = NA
##
## Loadings:
## ML1 ML2
## culture 0.883
## roads 0.611 -0.108
## secondary 0.901 0.228
## primary 0.447 0.646
## vocational 0.778 0.276
##
## ML1 ML2
## SS loadings 1.994 1.336
## Proportion Var 0.399 0.267
## Cumulative Var 0.399 0.666
## 'data.frame': 380 obs. of 3 variables:
## $ forest_km2: num NA NA NA NA NA NA NA NA NA NA ...
## $ F1 : num 0.261 -0.569 -0.811 -0.714 0.904 ...
## $ F2 : num 0.5 -0.688 0.467 0.28 0.586 ...
Factor 1 (Regional Connectivity & Advanced Education): Loaded heavily on secondary education (0.90), vocational schools (0.78), and public roads (0.61). This dimension explains 40% of the variance, representing large-scale infrastructure and higher-level educational access.
Factor 2 (Local Community Services): Dominated by culture (0.88) and primary education (0.65). This dimension explains 27% of the variance, capturing essential, localized services available at the community level.
Together, these two latent factors account for 67% of the total variance, effectively summarizing the infrastructure landscape.
To ensure that our regions are spatially contiguous (connected) rather than scattered patches, we define a connectivity graph where each powiat is linked to its geographic neighbors. We utilize Queen contiguity (sharing any boundary or corner) .
This results in a single, fully connected graph with no isolated units, ensuring that the SKATER algorithm can traverse the entire map without artificial interruptions.
## Isolates (no neighbours): 0
## Connected components: 1
The SKATER algorithm operates on a Minimum Spanning Tree (MST)—a graph connecting all regions with the lowest possible total “cost” . This tree serves as the backbone for clustering.We define the “cost” of an edge between two neighbors as the squared Euclidean distance between their Factor Scores (\(F1\) and \(F2\)).
Why Factor Scores? Using orthogonal factors prevents highly correlated variables from dominating the distance calculation.
Why Log-Transform? As noted in preprocessing, log-transforming inputs is vital here. Without it, urban outliers would generate artificially high edge costs, prematurely isolating cities from their natural regional hinterlands.
\[ c_{ij} = \lVert \mathbf{z}_i - \mathbf{z}_j \rVert^2 = \sum_{k=1}^{p} (z_{ik} - z_{jk})^2, \]
where \(\mathbf{z}_i\) is the vector of factor scores (or standardized features) for region \(i\), and \(p\) is the number of factors/features.
We use igraph MST to avoid mstree() pitfalls.
SKATER forms clusters by recursively “cutting” the most expensive edges in the MST. Each cut splits the map into two distinct, contiguous pieces. The algorithm repeats this process until \(K\) clusters are formed, effectively minimizing within-group heterogeneity .
To determine the optimal number of clusters (\(K\)), we calculate the Sum of Squared Errors (SSE) for increasing values of \(K\). While SSE naturally decreases as more clusters are added, our analysis reveals an “elbow” at \(K=8\). Beyond this point, improvements in homogeneity are marginal, suggesting that further splits would merely subdivide existing regions rather than revealing new structural patterns.
## K SSE
## 2 2 624.1841
## 3 3 608.4711
## 4 4 581.1016
## 6 6 553.7094
## 8 8 530.6195
## 10 10 510.9278
## 12 12 496.8311
We selected \(K=8\) as it balances statistical fit with interpretability, avoiding the creation of unhelpfully small micro-clusters.
Clustering is only useful if it leads to interpretation. To understand the “personality” of each region for policy planning, we profile the clusters using standardized scores and rankings.
| Cluster | Culture | Roads | Secondary | Primary | Vocational | Count | Dist (km) | Pop Dens |
|---|---|---|---|---|---|---|---|---|
| 1 | 20.3 | 103.2 | 11.4 | 53.4 | 6.9 | 10 | 0.0 | 70.0 |
| 2 | 22.1 | 192.2 | 11.4 | 76.4 | 6.1 | 14 | 0.0 | 847.0 |
| 3 | 18.8 | 63.9 | 4.0 | 23.9 | 2.8 | 25 | 92.0 | NaN |
| 4 | 63.0 | 431.9 | 187.0 | 417.0 | 18.0 | 1 | 0.0 | NaN |
| 5 | 3.7 | 354.2 | 6.9 | 26.4 | 3.6 | 7 | 14.0 | 3019.0 |
| 6 | 7.0 | 99.4 | 4.9 | 29.0 | 3.1 | 156 | 35.5 | 250.1 |
| 7 | 11.0 | 142.1 | 5.9 | 40.7 | 3.4 | 163 | 19.8 | 341.8 |
| 8 | 1.8 | 93.3 | 1.8 | 21.0 | 1.8 | 4 | NaN | NaN |
| Cluster | Culture | Roads | Secondary Edu | Primary Edu | Vocational |
|---|---|---|---|---|---|
| 1 | 1.02 | -0.27 | 0.56 | 0.62 | 1.20 |
| 2 | 1.17 | 0.93 | 0.61 | 1.44 | 0.92 |
| 3 | 0.80 | -1.00 | -0.26 | -0.56 | -0.22 |
| 4 | 2.84 | 2.34 | 5.26 | 4.71 | 3.08 |
| 5 | -1.04 | 1.97 | 0.42 | -0.52 | 0.10 |
| 6 | -0.37 | -0.31 | -0.12 | -0.30 | -0.13 |
| 7 | 0.14 | 0.29 | 0.04 | 0.23 | 0.00 |
| 8 | -1.74 | -0.15 | -0.93 | -0.71 | -0.73 |
| Variable | Highest Cluster | Max Value | Lowest Cluster | Min Value |
|---|---|---|---|---|
| culture | 4 | 63.0 | 8 | 1.8 |
| primary | 4 | 417.0 | 8 | 21.0 |
| roads | 4 | 431.9 | 3 | 63.9 |
| secondary | 4 | 187.0 | 8 | 1.8 |
| vocational | 4 | 18.0 | 8 | 1.8 |
| Cluster | Count | Dist to Core (km) | Culture/10k | Primary/10k | Secondary/km² |
|---|---|---|---|---|---|
| 1 | 10 | 0.0 | NaN | NaN | 0.028 |
| 2 | 14 | 0.0 | NaN | NaN | 0.037 |
| 3 | 25 | 92.0 | 3.40 | 3.6 | 0.020 |
| 4 | 1 | 0.0 | NaN | NaN | 0.362 |
| 5 | 7 | 14.0 | NaN | NaN | 0.141 |
| 6 | 156 | 35.5 | 1.88 | 5.0 | 0.022 |
| 7 | 163 | 19.8 | 0.70 | 3.4 | 0.031 |
| 8 | 4 | NaN | NaN | NaN | 0.002 |
The SKATER regionalization (Table 2) reveals an extreme “primate city” effect. Cluster 4 forms a singleton cluster (containing only 1 powiat) with infrastructure Z-scores exceeding +5.0 standard deviations for education. This confirms that the capital region is structurally unique and cannot be grouped even with other major cities, skewing national averages.
The analysis identifies a clear hierarchy:
The Deprived Periphery (Cluster 8): As shown in Table 3, Cluster 8 consistently ranks lowest across all infrastructure categories. This region represents the most urgent target for cohesion policy.
The “Average” Poland (Clusters 6 & 7): These two massive clusters (combining >300 powiats) represent the broad rural/semi-urban baseline of the country, where infrastructure density is moderate (Z-scores near 0).
The Urban Achievers (Cluster 2): High positive scores in Roads (+0.93) and Primary Education (+1.44) suggest a cluster of well-provisioned secondary cities.
The comparison between raw counts and rates revealed critical insights. While Cluster 4 dominates in density (0.362 secondary schools/km²), the presence of NaN values in per-capita metrics (Table 4) for urban clusters suggests that population data normalization requires careful handling in administrative centers. Future studies must address missing demographic data to accurately assess per-capita service strain in urban cores.
This replication confirms that Poland’s infrastructure inequality is not a gradient but a sharp divide. The optimal partition (\(K=8\)) does not create equal-sized regions; instead, it isolates extreme outliers (the Capital) from the vast majority of the country. This supports RQ1 by proving that contiguous regions exist, but they are defined by the absence of infrastructure (Cluster 8) rather than its presence.
The “one-size-fits-all” policy approach is unsuitable for Poland.
Cluster 8 requires basic infrastructure expansion (building new facilities).
Clusters 6 & 7 require efficiency improvements (better connecting existing facilities).
Cluster 4 represents a hyper-concentrated hub that likely serves a population far beyond its administrative boundaries, justifying its status as a distinct planning region.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(
sf, dplyr, tidyr, ggplot2,
spdep, ggspatial,
classInt, RColorBrewer,
igraph,
psych,
scales,
knitr
)
powiat <- st_read("powiaty/powiaty.shp", quiet = TRUE)
powiat_data <- powiat %>%
rename(
powiat_id = JPT_KOD_JE,
powiat_name = JPT_NAZWA_,
powiat_type = RODZAJ,
shape_area = SHAPE_AREA
) %>%
mutate(powiat_id = as.integer(powiat_id) * 1000) %>%
select(powiat_id, powiat_name, powiat_type, shape_area, geometry) %>%
st_make_valid()
keep_third <- function(df) df %>% select(Code, value = 3)
culture <- read.csv("CULTURE (K23).csv", sep = ";") %>% keep_third
roads <- read.csv("DISTRICT PUBLIC ROADS (G209).csv", sep = ";") %>% keep_third
secondary <- read.csv("GENERAL SECONDARY SCHOOLS (G256).csv", sep = ";") %>% keep_third
primary <- read.csv("PRIMARY EDUCATION (G252).csv", sep = ";") %>% keep_third
vocational <- read.csv("STAGE I SECTORAL VOCATIONAL EDUCATION (G595).csv", sep = ";") %>% keep_third
regional <- read.csv("regional_dataset.csv", sep = ";")
regional_2023 <- regional %>%
filter(year == 2023) %>%
select(code, core_city, dist, area_km2, persons_in_K, persons_per_1km2, forest_km2)
infra_vars <- c("culture", "roads", "secondary", "primary", "vocational")
powiat_joined <- powiat_data %>%
left_join(culture, by = c("powiat_id" = "Code")) %>% rename(culture = value) %>%
left_join(roads, by = c("powiat_id" = "Code")) %>% rename(roads = value) %>%
left_join(secondary, by = c("powiat_id" = "Code")) %>% rename(secondary = value) %>%
left_join(primary, by = c("powiat_id" = "Code")) %>% rename(primary = value) %>%
left_join(vocational, by = c("powiat_id" = "Code")) %>% rename(vocational = value) %>%
left_join(regional_2023, by = c("powiat_id" = "code"))
powiat_joined <- powiat_joined %>%
mutate(across(all_of(infra_vars), ~ suppressWarnings(as.numeric(.x)))) %>%
mutate(
dist = suppressWarnings(as.numeric(dist)),
area_km2 = suppressWarnings(as.numeric(area_km2)),
persons_in_K = suppressWarnings(as.numeric(persons_in_K)),
persons_per_1km2 = suppressWarnings(as.numeric(persons_per_1km2)),
forest_km2 = suppressWarnings(as.numeric(forest_km2))
)
powiat_use <- powiat_joined %>%
mutate(across(all_of(infra_vars), ~ tidyr::replace_na(.x, 0)))
desc_tbl <- powiat_use %>%
st_set_geometry(NULL) %>%
select(all_of(infra_vars)) %>%
pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
group_by(variable) %>%
summarise(
mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
min = min(value, na.rm = TRUE),
p25 = quantile(value, 0.25, na.rm = TRUE),
median = median(value, na.rm = TRUE),
p75 = quantile(value, 0.75, na.rm = TRUE),
max = max(value, na.rm = TRUE),
.groups = "drop"
)
knitr::kable(desc_tbl, caption = "Table 1: Descriptive Statistics")
plot_df_raw <- powiat_use %>%
st_set_geometry(NULL) %>%
pivot_longer(all_of(infra_vars), names_to = "variable", values_to = "value")
ggplot(plot_df_raw, aes(x = variable, y = value)) +
geom_boxplot(outlier.alpha = 0.25) +
theme_minimal() +
labs(title = "Raw Counts")
ggplot(plot_df_raw %>% mutate(value = log1p(value)), aes(x = variable, y = value)) +
geom_boxplot(outlier.alpha = 0.25) +
theme_minimal() +
labs(title = "Log Transformed Counts")
X_raw <- powiat_use %>% st_set_geometry(NULL) %>% select(all_of(infra_vars))
X_fa <- log1p(X_raw)
cor_mat <- cor(X_fa, use = "pairwise.complete.obs")
psych::cortest.bartlett(cor_mat, n = nrow(X_fa))
psych::KMO(cor_mat)
nf <- 2
fa_res <- psych::fa(X_fa, nfactors = nf, rotate = "varimax", fm = "ml", scores = "regression")
factor_scores <- as.data.frame(fa_res$scores)
colnames(factor_scores) <- paste0("F", 1:nf)
powiat_use <- bind_cols(powiat_use, factor_scores)
# Neighbors
nb <- spdep::poly2nb(powiat_use, queen = TRUE)
# MST Construction
Z_df <- powiat_use %>% st_set_geometry(NULL) %>% select(F1, F2)
Z <- as.matrix(Z_df)
edge_list <- do.call(rbind, lapply(seq_along(nb), function(i) {
if (length(nb[[i]]) == 0) return(NULL)
cbind(from = i, to = nb[[i]])
}))
edge_list <- edge_list[edge_list[, "from"] < edge_list[, "to"], , drop = FALSE]
w <- apply(edge_list, 1, function(e) {
i <- e[1]; j <- e[2]
sum((Z[i, ] - Z[j, ])^2)
})
g <- igraph::graph_from_edgelist(edge_list, directed = FALSE)
E(g)$weight <- w
g_mst <- igraph::mst(g, weights = E(g)$weight)
mst_edges <- igraph::as_edgelist(g_mst, names = FALSE)
# Sensitivity Analysis (SSE)
run_skater <- function(K) {
ncuts <- K - 1
sk <- spdep::skater(edges = mst_edges, data = Z, ncuts = ncuts)
groups <- sk$groups
sse <- 0
for (cl in sort(unique(groups))) {
idx <- which(groups == cl)
if (length(idx) > 1) {
center <- colMeans(Z[idx, , drop = FALSE])
sse <- sse + sum(rowSums((Z[idx, , drop=FALSE] -
matrix(center, nrow = length(idx), ncol = ncol(Z), byrow = TRUE))^2))
}
}
list(groups = groups, sse = sse)
}
Ks <- c(2, 3, 4, 6, 8, 10, 12)
res <- lapply(Ks, run_skater)
K <- 8
powiat_use$cluster <- factor(res[[as.character(K)]]$groups)
# Raw Profiling
profile_raw <- powiat_use %>%
st_set_geometry(NULL) %>%
group_by(cluster) %>%
summarise(
across(all_of(infra_vars), mean, na.rm = TRUE),
n = n(),
mean_dist = mean(dist, na.rm = TRUE),
mean_pop_density = mean(persons_per_1km2, na.rm = TRUE),
.groups = "drop"
)
knitr::kable(profile_raw, caption = "Table 1: Raw Profiles", digits = 1,
col.names = c("Cluster", "Culture", "Roads", "Secondary",
"Primary", "Vocational", "Count", "Dist", "Density"))
# Z-Score Profiling
profile_z <- powiat_use %>%
st_set_geometry(NULL) %>%
mutate(across(all_of(infra_vars), ~ as.numeric(scale(log1p(.x))))) %>%
group_by(cluster) %>%
summarise(across(all_of(infra_vars), mean, na.rm = TRUE), .groups="drop")
knitr::kable(profile_z, caption = "Table 2: Z-Scores", digits = 2)
# Calculation of Per Capita / Density
powiat_use <- powiat_use %>%
mutate(
pop = persons_in_K * 1000,
culture_dens = culture / area_km2,
roads_dens = roads / area_km2,
primary_dens = primary / area_km2,
secondary_dens = secondary / area_km2,
vocational_dens = vocational / area_km2,
culture_pc10k = culture / pop * 10000,
roads_pc10k = roads / pop * 10000,
primary_pc10k = primary / pop * 10000,
secondary_pc10k = secondary / pop * 10000,
vocational_pc10k = vocational / pop * 10000
)
# Rank Table
rank_tbl <- profile_raw %>%
select(cluster, all_of(infra_vars)) %>%
pivot_longer(-cluster, names_to="var", values_to="mean_value") %>%
group_by(var) %>%
summarise(
highest_cluster = cluster[which.max(mean_value)],
highest_value = max(mean_value, na.rm = TRUE),
lowest_cluster = cluster[which.min(mean_value)],
lowest_value = min(mean_value, na.rm = TRUE),
.groups="drop"
)
knitr::kable(rank_tbl, caption = "Table 3: Rankings", digits = 1)
# Final Summary Table
cluster_summary <- powiat_use %>%
st_set_geometry(NULL) %>%
group_by(cluster) %>%
summarise(
N_Powiats = n(),
Avg_Dist_to_Core = round(mean(dist, na.rm = TRUE), 1),
Culture_per_10k = round(mean(culture_pc10k, na.rm = TRUE), 2),
Primary_per_10k = round(mean(primary_pc10k, na.rm = TRUE), 2),
Secondary_Density = round(mean(secondary_dens, na.rm = TRUE), 3),
.groups = "drop"
)
knitr::kable(cluster_summary, caption = "Table 4: Final Summary", digits = 2)