This paper presents a reproducible replication of an existing spatial regionalization study based on the SKATER (Spatial ‘K’luster Analysis by Tree Edge Removal) algorithm.
The primary objective is to faithfully reproduce the methodological workflow of the original study using publicly available data at the powiat level in Poland, while ensuring full computational transparency and reproducibility.
In addition to replication, the paper introduces minor methodological extensions, including alternative data transformations (logarithmic, density-based, and population-normalized indicators) and a sensitivity analysis using within-cluster sum of squared errors (SSE).
Our research questions:
RQ1: Can we form contiguous (spatially connected) regions of Poland where neighboring powiats have similar infrastructure characteristics?
RQ2: What latent dimensions (factors) summarize infrastructure variation (e.g., “education availability”, “cultural/road access”)?
RQ3: How sensitive are regionalizations to data representation choices? We compare:
Raw counts
log-transformed counts (log1p)
area-normalized densities (per km²)
population-normalized rates (per 10k inhabitants
We combine:
Powiat polygons (shapefile): geometry and IDs (CRS typically EPSG:2180 for Poland)
Infrastructure indicators (CSV): counts of facilities
Regional dataset (2023): contextual variables like distance to core city, area, population
All data and replication materials are publicly available via Zenodo: Spatial Infrastructure Indicators and Powiat-Level Shapefiles for SKATER Regionalization in Poland (2023)
Why this is needed:
SKATER requires spatial units (polygons) and a neighborhood graph.
Factor analysis and clustering require a numeric feature matrix.
Context variables (distance/area/pop) enable robustness via densities and rates
Also ensure valid geometry: contiguity neighbours fail easily if polygons are invalid.
powiat <- st_read("powiaty/powiaty.shp", quiet = TRUE)
powiat_data <- powiat %>%
rename(
powiat_id = JPT_KOD_JE,
powiat_name = JPT_NAZWA_,
powiat_type = RODZAJ,
shape_area = SHAPE_AREA
) %>%
mutate(powiat_id = as.integer(powiat_id) * 1000) %>%
select(powiat_id, powiat_name, powiat_type, shape_area, geometry) %>%
st_make_valid()
st_crs(powiat_data)
## Coordinate Reference System:
## User input: ETRF2000-PL / CS92
## wkt:
## PROJCRS["ETRF2000-PL / CS92",
## BASEGEOGCRS["ETRF2000-PL",
## DATUM["ETRF2000 Poland",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",9702]],
## CONVERSION["Poland CS92",
## METHOD["Transverse Mercator",
## ID["EPSG",9807]],
## PARAMETER["Latitude of natural origin",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8801]],
## PARAMETER["Longitude of natural origin",19,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8802]],
## PARAMETER["Scale factor at natural origin",0.9993,
## SCALEUNIT["unity",1],
## ID["EPSG",8805]],
## PARAMETER["False easting",500000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8806]],
## PARAMETER["False northing",-5300000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8807]]],
## CS[Cartesian,2],
## AXIS["northing (x)",north,
## ORDER[1],
## LENGTHUNIT["metre",1]],
## AXIS["easting (y)",east,
## ORDER[2],
## LENGTHUNIT["metre",1]],
## USAGE[
## SCOPE["Topographic mapping (medium and small scale)."],
## AREA["Poland - onshore and offshore."],
## BBOX[49,14.14,55.93,24.15]],
## ID["EPSG",2180]]
powiat_joined <- powiat_data %>%
left_join(culture, by = c("powiat_id" = "Code")) %>% rename(culture = value) %>%
left_join(roads, by = c("powiat_id" = "Code")) %>% rename(roads = value) %>%
left_join(secondary, by = c("powiat_id" = "Code")) %>% rename(secondary = value) %>%
left_join(primary, by = c("powiat_id" = "Code")) %>% rename(primary = value) %>%
left_join(vocational, by = c("powiat_id" = "Code")) %>% rename(vocational = value) %>%
left_join(regional_2023, by = c("powiat_id" = "code"))
## # A tibble: 5 × 8
## variable mean sd min p25 median p75 max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 culture 10.4 9.72 0 4.75 7 14 91
## 2 primary 37.0 31.1 8 21 29 44 417
## 3 roads 124. 91.3 21.7 65.1 92.7 133. 517.
## 4 secondary 6.14 12.0 0 2 4 6 187
## 5 vocational 3.42 2.64 0 2 3 4 18
Infrastructure is unevenly distributed across powiats, with a small number of highly endowed units and many low-infrastructure areas. This confirms that raw counts reflect strong size and urbanization effects and motivates transformation prior to multivariate analysis.
The log(1 + x) transformation substantially reduces skewness while retaining relative differences between regions. This improves the correlation structure and makes the data more suitable for factor analysis and distance-based clustering.
Use factor analysis to:
Why Bartlett and KMO:
Bartlett’s test checks if variables are correlated at all (reject identity matrix)
KMO checks if correlations are suitable for factor extraction (shared variance)
Apply log1p because counts are skewed and this improves correlation structure.
## culture roads secondary primary vocational
## culture 1.000 -0.138 0.152 0.543 0.182
## roads -0.138 1.000 0.539 0.227 0.398
## secondary 0.152 0.539 1.000 0.536 0.766
## primary 0.543 0.227 0.536 1.000 0.553
## vocational 0.182 0.398 0.766 0.553 1.000
Infrastructure variables are not independent. Educational facilities and cultural infrastructure tend to co-locate, suggesting latent dimensions rather than isolated indicators. This supports the use of factor analysis to reduce dimensionality.
## $chisq
## [1] 793.3028
##
## $p.value
## [1] 5.679348e-164
##
## $df
## [1] 10
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = cor_mat)
## Overall MSA = 0.67
## MSA for each item =
## culture roads secondary primary vocational
## 0.50 0.70 0.68 0.70 0.71
The null hypothesis of an identity correlation matrix is decisively rejected, indicating that the variables are sufficiently correlated to justify factor analysis.
The overall KMO value indicates acceptable sampling adequacy. While the culture variable is borderline, all variables exceed the minimum threshold (0.5), allowing factor extraction with caution.
## Parallel analysis suggests that the number of factors = 2 and the number of components = NA
##
## Loadings:
## ML1 ML2
## culture 0.883
## roads 0.611 -0.108
## secondary 0.901 0.228
## primary 0.447 0.646
## vocational 0.778 0.276
##
## ML1 ML2
## SS loadings 1.994 1.336
## Proportion Var 0.399 0.267
## Cumulative Var 0.399 0.666
## 'data.frame': 380 obs. of 3 variables:
## $ forest_km2: num NA NA NA NA NA NA NA NA NA NA ...
## $ F1 : num 0.261 -0.569 -0.811 -0.714 0.904 ...
## $ F2 : num 0.5 -0.688 0.467 0.28 0.586 ...
Factor analysis identified two independent latent dimensions of infrastructure provision. The first factor, explaining 39.9% of total variance, is strongly associated with secondary and vocational education and moderately with road infrastructure, reflecting the concentration of post-primary educational and transport facilities. The second factor, explaining an additional 26.7% of variance, is dominated by cultural infrastructure and primary education, capturing local social infrastructure oriented toward everyday community needs. Together, these factors explain 66.6% of total variance and reveal a clear separation between regional service-oriented infrastructure and locally distributed basic services.
A connectivity graph: each region is connected to its geographic neighbors.
Why contiguity neighbors:
Ensures clusters are contiguous (no disconnected patches)
Reflects spatial dependence: neighboring areas tend to be similar
We use Queen contiguity (sharing boundary or corner).
## Isolates (no neighbours): 0
## Connected components: 1
The queen contiguity neighborhood matrix forms a single connected graph with no isolated powiats. This ensures that spatial regionalization is feasible without artificial links or disconnected components.
The MST is a tree that connects all nodes with minimum total cost.
Why MST:
SKATER partitions the MST by cutting edges
The MST is the backbone for contiguous clustering
The edge cost between neighboring regions \(i\) and \(j\) is defined as squared Euclidean distance in factor-score space:
\[ c_{ij} = \lVert \mathbf{z}_i - \mathbf{z}_j \rVert^2 = \sum_{k=1}^{p} (z_{ik} - z_{jk})^2, \]
where \(\mathbf{z}_i\) is the vector of factor scores (or standardized features) for region \(i\), and \(p\) is the number of factors/features.
We use igraph MST to avoid mstree() pitfalls.
The MST identifies the most critical spatial links between neighboring powiats in terms of infrastructure similarity. High-cost edges represent sharp transitions and are natural candidates for partitioning during SKATER regionalization.
For each K SKATER cuts MST edges to create K contiguous groups
Why recursive partition:
SSE (Sum of Squared Errors):
Within each cluster, compute squared distance of points to their cluster center
Smaller SSE → more homogeneous clusters
SSE always decreases as K increases, so we look for an “elbow”
## K SSE
## 2 2 624.1841
## 3 3 608.4711
## 4 4 581.1016
## 6 6 553.7094
## 8 8 530.6195
## 10 10 510.9278
## 12 12 496.8311
While within-cluster SSE decreases with increasing K, the rate of improvement slows notably after K ≈ 8. This suggests that additional clusters primarily subdivide existing regions rather than revealing fundamentally new spatial structures.
Choose a K based on:
Interpretable regions
No extreme tiny clusters
Reasonable elbow in SSE
Profiling explains cluster meaning by computing average indicator values per cluster.
Why this matters:
Clustering itself is not the goal; interpretation is
Policy or planning insights come from understanding “what characterizes each region”
We profile:
Raw counts
Standardized log counts
Contextual variables (distance, density)
Densities and per-capita rates
## # A tibble: 8 × 9
## cluster culture roads secondary primary vocational n mean_dist
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
## 1 1 20.3 103. 11.4 53.4 6.9 10 0
## 2 2 22.1 192. 11.4 76.4 6.07 14 0
## 3 3 18.8 63.9 3.96 23.9 2.84 25 92
## 4 4 63 432. 187 417 18 1 0
## 5 5 3.71 354. 6.86 26.4 3.57 7 14
## 6 6 6.99 99.4 4.89 29 3.05 156 35.5
## 7 7 11.0 142. 5.86 40.7 3.37 163 19.8
## 8 8 1.75 93.3 1.75 21 1.75 4 NaN
## # ℹ 1 more variable: mean_pop_density <dbl>
##
## 1 2 3 4 5 6 7 8
## 10 14 25 1 7 156 163 4
## # A tibble: 8 × 6
## cluster culture roads secondary primary vocational
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1.02 -0.268 0.560 0.616 1.20
## 2 2 1.17 0.926 0.614 1.44 0.917
## 3 3 0.795 -0.999 -0.259 -0.557 -0.220
## 4 4 2.84 2.34 5.26 4.71 3.08
## 5 5 -1.04 1.97 0.420 -0.521 0.0976
## 6 6 -0.375 -0.306 -0.123 -0.304 -0.134
## 7 7 0.144 0.288 0.0431 0.226 0.00453
## 8 8 -1.74 -0.147 -0.927 -0.714 -0.729
## # A tibble: 5 × 5
## var highest_cluster highest_value lowest_cluster lowest_value
## <chr> <fct> <dbl> <fct> <dbl>
## 1 culture 4 63 8 1.75
## 2 primary 4 417 8 21
## 3 roads 4 432. 3 63.9
## 4 secondary 4 187 8 1.75
## 5 vocational 4 18 8 1.75