1. Motivation and research questions

This paper presents a reproducible replication of an existing spatial regionalization study based on the SKATER (Spatial ‘K’luster Analysis by Tree Edge Removal) algorithm.

The primary objective is to faithfully reproduce the methodological workflow of the original study using publicly available data at the powiat level in Poland, while ensuring full computational transparency and reproducibility.

In addition to replication, the paper introduces minor methodological extensions, including alternative data transformations (logarithmic, density-based, and population-normalized indicators) and a sensitivity analysis using within-cluster sum of squared errors (SSE).

Our research questions:

  • RQ1: Can we form contiguous (spatially connected) regions of Poland where neighboring powiats have similar infrastructure characteristics?

  • RQ2: What latent dimensions (factors) summarize infrastructure variation (e.g., “education availability”, “cultural/road access”)?

  • RQ3: How sensitive are regionalizations to data representation choices? We compare:

  1. Raw counts

  2. log-transformed counts (log1p)

  3. area-normalized densities (per km²)

  4. population-normalized rates (per 10k inhabitants

  • RQ4: How does “fit” change with K (clusters)? We use within-cluster SSE as a simple diagnostic

2. Data and preprocessing

We combine:

  • Powiat polygons (shapefile): geometry and IDs (CRS typically EPSG:2180 for Poland)

  • Infrastructure indicators (CSV): counts of facilities

  • Regional dataset (2023): contextual variables like distance to core city, area, population

All data and replication materials are publicly available via Zenodo: Spatial Infrastructure Indicators and Powiat-Level Shapefiles for SKATER Regionalization in Poland (2023)

Why this is needed:

  • SKATER requires spatial units (polygons) and a neighborhood graph.

  • Factor analysis and clustering require a numeric feature matrix.

  • Context variables (distance/area/pop) enable robustness via densities and rates

2.1 Read powiat polygons

Also ensure valid geometry: contiguity neighbours fail easily if polygons are invalid.

powiat <- st_read("powiaty/powiaty.shp", quiet = TRUE)

powiat_data <- powiat %>%
rename(
powiat_id   = JPT_KOD_JE,
powiat_name = JPT_NAZWA_,
powiat_type = RODZAJ,
shape_area  = SHAPE_AREA
) %>%
mutate(powiat_id = as.integer(powiat_id) * 1000) %>%
select(powiat_id, powiat_name, powiat_type, shape_area, geometry) %>%
st_make_valid()

st_crs(powiat_data)
## Coordinate Reference System:
##   User input: ETRF2000-PL / CS92 
##   wkt:
## PROJCRS["ETRF2000-PL / CS92",
##     BASEGEOGCRS["ETRF2000-PL",
##         DATUM["ETRF2000 Poland",
##             ELLIPSOID["GRS 1980",6378137,298.257222101,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",9702]],
##     CONVERSION["Poland CS92",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",0,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",19,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",0.9993,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",500000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",-5300000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["northing (x)",north,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["easting (y)",east,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["Topographic mapping (medium and small scale)."],
##         AREA["Poland - onshore and offshore."],
##         BBOX[49,14.14,55.93,24.15]],
##     ID["EPSG",2180]]

2.2 Read infrastructure CSVs and regional dataset

powiat_joined <- powiat_data %>%
left_join(culture,    by = c("powiat_id" = "Code")) %>% rename(culture = value) %>%
left_join(roads,      by = c("powiat_id" = "Code")) %>% rename(roads = value) %>%
left_join(secondary,  by = c("powiat_id" = "Code")) %>% rename(secondary = value) %>%
left_join(primary,    by = c("powiat_id" = "Code")) %>% rename(primary = value) %>%
left_join(vocational, by = c("powiat_id" = "Code")) %>% rename(vocational = value) %>%
left_join(regional_2023, by = c("powiat_id" = "code"))

3. Descriptive statistical analysis

  • Infrastructure counts are typically highly skewed
  • Skew impacts correlation and factor analysis
  • This motivates log transform / rates later
## # A tibble: 5 × 8
##   variable     mean    sd   min   p25 median   p75   max
##   <chr>       <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 culture     10.4   9.72   0    4.75    7     14    91 
## 2 primary     37.0  31.1    8   21      29     44   417 
## 3 roads      124.   91.3   21.7 65.1    92.7  133.  517.
## 4 secondary    6.14 12.0    0    2       4      6   187 
## 5 vocational   3.42  2.64   0    2       3      4    18

Infrastructure is unevenly distributed across powiats, with a small number of highly endowed units and many low-infrastructure areas. This confirms that raw counts reflect strong size and urbanization effects and motivates transformation prior to multivariate analysis.

The log(1 + x) transformation substantially reduces skewness while retaining relative differences between regions. This improves the correlation structure and makes the data more suitable for factor analysis and distance-based clustering.

4. Factor analysis with Bartlett and KMO

Use factor analysis to:

  • Reduce many correlated indicators into fewer latent variables
  • Produce factor scores used as input to SKATER (new variables)

Why Bartlett and KMO:

  • Bartlett’s test checks if variables are correlated at all (reject identity matrix)

  • KMO checks if correlations are suitable for factor extraction (shared variance)

4.1 Prepare matrix for factor analysis

Apply log1p because counts are skewed and this improves correlation structure.

##            culture  roads secondary primary vocational
## culture      1.000 -0.138     0.152   0.543      0.182
## roads       -0.138  1.000     0.539   0.227      0.398
## secondary    0.152  0.539     1.000   0.536      0.766
## primary      0.543  0.227     0.536   1.000      0.553
## vocational   0.182  0.398     0.766   0.553      1.000

Infrastructure variables are not independent. Educational facilities and cultural infrastructure tend to co-locate, suggesting latent dimensions rather than isolated indicators. This supports the use of factor analysis to reduce dimensionality.

4.2 Bartlett and KMO

## $chisq
## [1] 793.3028
## 
## $p.value
## [1] 5.679348e-164
## 
## $df
## [1] 10
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = cor_mat)
## Overall MSA =  0.67
## MSA for each item = 
##    culture      roads  secondary    primary vocational 
##       0.50       0.70       0.68       0.70       0.71
  • The null hypothesis of an identity correlation matrix is decisively rejected, indicating that the variables are sufficiently correlated to justify factor analysis.

  • The overall KMO value indicates acceptable sampling adequacy. While the culture variable is borderline, all variables exceed the minimum threshold (0.5), allowing factor extraction with caution.

4.3 Choose number of factors

## Parallel analysis suggests that the number of factors =  2  and the number of components =  NA
## 
## Loadings:
##            ML1    ML2   
## culture            0.883
## roads       0.611 -0.108
## secondary   0.901  0.228
## primary     0.447  0.646
## vocational  0.778  0.276
## 
##                  ML1   ML2
## SS loadings    1.994 1.336
## Proportion Var 0.399 0.267
## Cumulative Var 0.399 0.666
## 'data.frame':    380 obs. of  3 variables:
##  $ forest_km2: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ F1        : num  0.261 -0.569 -0.811 -0.714 0.904 ...
##  $ F2        : num  0.5 -0.688 0.467 0.28 0.586 ...

Factor analysis identified two independent latent dimensions of infrastructure provision. The first factor, explaining 39.9% of total variance, is strongly associated with secondary and vocational education and moderately with road infrastructure, reflecting the concentration of post-primary educational and transport facilities. The second factor, explaining an additional 26.7% of variance, is dominated by cultural infrastructure and primary education, capturing local social infrastructure oriented toward everyday community needs. Together, these factors explain 66.6% of total variance and reveal a clear separation between regional service-oriented infrastructure and locally distributed basic services.

5. Neighborhood matrix (contiguity criterion) and connectivity graph

A connectivity graph: each region is connected to its geographic neighbors.

Why contiguity neighbors:

  • Ensures clusters are contiguous (no disconnected patches)

  • Reflects spatial dependence: neighboring areas tend to be similar

We use Queen contiguity (sharing boundary or corner).

## Isolates (no neighbours): 0
## Connected components: 1

The queen contiguity neighborhood matrix forms a single connected graph with no isolated powiats. This ensures that spatial regionalization is feasible without artificial links or disconnected components.

6. Minimum spanning tree (MST)

The MST is a tree that connects all nodes with minimum total cost.

Why MST:

  • SKATER partitions the MST by cutting edges

  • The MST is the backbone for contiguous clustering

The edge cost between neighboring regions \(i\) and \(j\) is defined as squared Euclidean distance in factor-score space:

\[ c_{ij} = \lVert \mathbf{z}_i - \mathbf{z}_j \rVert^2 = \sum_{k=1}^{p} (z_{ik} - z_{jk})^2, \]

where \(\mathbf{z}_i\) is the vector of factor scores (or standardized features) for region \(i\), and \(p\) is the number of factors/features.

We use igraph MST to avoid mstree() pitfalls.

The MST identifies the most critical spatial links between neighboring powiats in terms of infrastructure similarity. High-cost edges represent sharp transitions and are natural candidates for partitioning during SKATER regionalization.

7. Recursive partition (SKATER)

For each K SKATER cuts MST edges to create K contiguous groups

Why recursive partition:

  • MST ensures connectivity
  • Cutting edges yields contiguous regions
  • Objective is to minimize within-group heterogeneity

7.1 Within-cluster SSE (extra evaluation)

SSE (Sum of Squared Errors):

  • Within each cluster, compute squared distance of points to their cluster center

  • Smaller SSE → more homogeneous clusters

  • SSE always decreases as K increases, so we look for an “elbow”

##     K      SSE
## 2   2 624.1841
## 3   3 608.4711
## 4   4 581.1016
## 6   6 553.7094
## 8   8 530.6195
## 10 10 510.9278
## 12 12 496.8311

While within-cluster SSE decreases with increasing K, the rate of improvement slows notably after K ≈ 8. This suggests that additional clusters primarily subdivide existing regions rather than revealing fundamentally new spatial structures.

7.2 Choose K and map clusters

Choose a K based on:

  • Interpretable regions

  • No extreme tiny clusters

  • Reasonable elbow in SSE

8. Profiling partitions

Profiling explains cluster meaning by computing average indicator values per cluster.

Why this matters:

  • Clustering itself is not the goal; interpretation is

  • Policy or planning insights come from understanding “what characterizes each region”

We profile:

  • Raw counts

  • Standardized log counts

  • Contextual variables (distance, density)

  • Densities and per-capita rates

8.1 Raw profiling

## # A tibble: 8 × 9
##   cluster culture roads secondary primary vocational     n mean_dist
##   <fct>     <dbl> <dbl>     <dbl>   <dbl>      <dbl> <int>     <dbl>
## 1 1         20.3  103.      11.4     53.4       6.9     10       0  
## 2 2         22.1  192.      11.4     76.4       6.07    14       0  
## 3 3         18.8   63.9      3.96    23.9       2.84    25      92  
## 4 4         63    432.     187      417        18        1       0  
## 5 5          3.71 354.       6.86    26.4       3.57     7      14  
## 6 6          6.99  99.4      4.89    29         3.05   156      35.5
## 7 7         11.0  142.       5.86    40.7       3.37   163      19.8
## 8 8          1.75  93.3      1.75    21         1.75     4     NaN  
## # ℹ 1 more variable: mean_pop_density <dbl>
## 
##   1   2   3   4   5   6   7   8 
##  10  14  25   1   7 156 163   4

8.2 Standardized profiling (log1p z-scores)

## # A tibble: 8 × 6
##   cluster culture  roads secondary primary vocational
##   <fct>     <dbl>  <dbl>     <dbl>   <dbl>      <dbl>
## 1 1         1.02  -0.268    0.560    0.616    1.20   
## 2 2         1.17   0.926    0.614    1.44     0.917  
## 3 3         0.795 -0.999   -0.259   -0.557   -0.220  
## 4 4         2.84   2.34     5.26     4.71     3.08   
## 5 5        -1.04   1.97     0.420   -0.521    0.0976 
## 6 6        -0.375 -0.306   -0.123   -0.304   -0.134  
## 7 7         0.144  0.288    0.0431   0.226    0.00453
## 8 8        -1.74  -0.147   -0.927   -0.714   -0.729

8.3 Density and per-capita

## # A tibble: 5 × 5
##   var        highest_cluster highest_value lowest_cluster lowest_value
##   <chr>      <fct>                   <dbl> <fct>                 <dbl>
## 1 culture    4                         63  8                      1.75
## 2 primary    4                        417  8                     21   
## 3 roads      4                        432. 3                     63.9 
## 4 secondary  4                        187  8                      1.75
## 5 vocational 4                         18  8                      1.75