library(here)
library(sf)
library(readxl)
library(dplyr)
library(ggplot2)
library(spdep)
library(tidyr)
library(cluster)
library(factoextra)
library(ranger)
library(rpart)
library(rpart.plot)
library(caret)
library(corrplot)
library(glmnet)
library(xgboost)
library(spatialreg)
library(patchwork)

set.seed(123)

1 Introduction

Electoral behaviour in Poland is shaped by forces that reach beyond any individual constituency. Shared history, urbanisation gradients, and regional economic conditions create patterns that neighbouring districts share with each other — a phenomenon that standard statistical models, which treat each observation as independent, cannot capture. This project investigates whether Polish voting patterns are spatially structured and whether explicitly modelling that structure improves our understanding of electoral participation.

The analysis focuses on the 2023 Polish parliamentary (Sejm) election at the level of 41 Sejm constituencies, using the GeoElections Poland 1.0 dataset (Śleszyński & Kowalski). Five research questions guide the work:

Are Polish election results spatially autocorrelated?
Where are the main local electoral clusters located?
Can spatial features improve machine learning prediction of voter turnout?
Can spatial clustering identify meaningful electoral regions?
Do model residuals still contain spatial structure after spatial features are included?

The methodology combines global and local spatial autocorrelation tests, principal component analysis, two complementary clustering approaches (PAM and SKATER), spatial feature engineering via spatial lag variables, and a six-model supervised learning comparison evaluated through leave-one-out cross-validation (LOOCV).

2 Data and Spatial Framework

2.1 Data Preparation

The dataset provides electoral results at the Senate district (94 units) level. Since the analysis targets Sejm constituencies (41 units), Senate polygons are dissolved into Sejm districts using a lookup table embedded in the dataset. Invalid geometries are repaired with st_make_valid() before the union operation.

senate <- st_read(
  here("data", "raw", "GeoElections Poland 1.0 dataset SHP",
       "SenateDistricts41.shp"),
  quiet = TRUE
)

elections <- read_excel(
  here("data", "raw", "GeoElections Poland 1.0 dataset.xlsx"),
  sheet = "GeoElections Poland 1.0 data"
)

sejm_lookup <- elections %>%
  select(NumberSejm41, NumberSenate94) %>%
  distinct()

senate_fixed <- st_make_valid(senate)

sejm_41 <- senate_fixed %>%
  left_join(sejm_lookup, by = c("NumberSena" = "NumberSenate94")) %>%
  group_by(NumberSejm41) %>%
  summarise(geometry = st_union(geometry), .groups = "drop")

The resulting object contains 41 Sejm constituencies with valid polygon geometries covering the entirety of Poland.

2.2 Election Results

Party vote shares are computed as a percentage of valid votes cast. Voter turnout is calculated as votes cast divided by eligible voters. Five parties with nationwide representation are retained: KO (Civic Coalition), PiS (Law and Justice), TD (Third Way), Lewica (Left), and Konfederacja.

party_2023 <- elections %>%
  filter(ElectionType == "S", Year == 2023, !is.na(Option)) %>%
  group_by(NumberSejm41, AbbreviationInSheet) %>%
  summarise(Votes = sum(Votes, na.rm = TRUE), .groups = "drop") %>%
  pivot_wider(names_from = AbbreviationInSheet, values_from = Votes)

turnout_2023 <- elections %>%
  filter(ElectionType == "S", Year == 2023, is.na(Option)) %>%
  group_by(NumberSejm41, AbbreviationInSheet) %>%
  summarise(Votes = sum(Votes, na.rm = TRUE), .groups = "drop") %>%
  pivot_wider(names_from = AbbreviationInSheet, values_from = Votes)

results_2023 <- party_2023 %>%
  left_join(turnout_2023, by = "NumberSejm41") %>%
  mutate(
    across(S2023BS:S2023TD,
           ~ . / S2023_valid * 100,
           .names = "{.col}_pct"),
    turnout_pct = S2023_votescast / S2023_eligible * 100
  )

sejm_map <- sejm_41 %>%
  left_join(results_2023, by = "NumberSejm41")

3 Exploratory Spatial Visualisation

The choropleth maps below reveal the first indication of spatial polarisation. KO support is concentrated in the west, in major metropolitan areas (Warsaw, Kraków, Wrocław, and Poznań), and along the western border. PiS support is strongest in the east and south-east — precisely the mirror image. TD shows moderate support across central and south-central constituencies. Lewica and Konfederacja display more diffuse but still spatially structured patterns.

party_map <- function(col, title, color) {
  ggplot(sejm_map) +
    geom_sf(aes(fill = .data[[col]]), color = "white", linewidth = 0.3) +
    scale_fill_gradient(low = "#f5f5f5", high = color,
                        name = "Vote share (%)") +
    labs(title = title,
         subtitle = "2023 Polish parliamentary election",
         caption = "Source: GeoElections Poland 1.0") +
    theme_void() +
    theme(plot.title      = element_text(face = "bold", size = 13),
          plot.subtitle   = element_text(size = 9, color = "grey40"),
          legend.position = "right",
          legend.title    = element_text(size = 7),
          legend.text     = element_text(size = 7),
          legend.key.size = unit(0.35, "cm"))
}

(party_map("S2023KO_pct",   "KO (Civic Coalition)",  "#2980b9") +
 party_map("S2023PIS_pct",  "PiS (Law & Justice)",   "#c0392b")) /
(party_map("S2023TD_pct",   "TD (Third Way)",         "#27ae60") +
 party_map("S2023NL_pct",   "Lewica (Left)",          "#8e44ad")) /
(party_map("S2023KONF_pct", "Konfederacja",            "#e67e22") +
 party_map("turnout_pct",   "Voter Turnout",           "#7f8c8d"))

The turnout map is particularly telling: high-turnout constituencies cluster around the western corridor and major cities, while the east shows systematically lower participation — a pattern that motivates the use of turnout as the prediction target in the supervised learning section.

4 Spatial Neighbour Structure

Queen contiguity is used to define the neighbourhood structure: two constituencies are neighbours if they share any boundary point (edge or corner). This is the most inclusive contiguity definition and is appropriate for the irregular polygon shapes of Polish Sejm constituencies.

neighbours <- poly2nb(sejm_41, queen = TRUE)
weights    <- nb2listw(neighbours, style = "W")

The resulting network has 41 regions, 202 non-zero links, and an average of 4.93 neighbours per constituency. The three least connected constituencies (9, 19, 41) each have only two neighbours, while the four most connected (11, 16, 18, 33) each have eight — reflecting the geographic variation between elongated border constituencies and centrally located ones.

coords <- st_coordinates(st_centroid(sejm_41))
plot(st_geometry(sejm_41), border = "grey70",
     main = "Queen Contiguity Neighbour Network (41 Sejm Constituencies)")
plot(neighbours, coords, add = TRUE, col = "#e74c3c", lwd = 0.8)

5 Global Spatial Autocorrelation

Global Moran’s I tests whether similar values cluster together across the map — the fundamental prerequisite for spatial analysis. All five party vote shares are tested against the null hypothesis of spatial randomness.

parties <- c("S2023KO_pct", "S2023PIS_pct", "S2023TD_pct",
             "S2023NL_pct", "S2023KONF_pct")
party_labels <- c("KO", "PiS", "TD", "Lewica", "Konfederacja")

moran_results <- mapply(function(v, lab) {
  mt <- moran.test(sejm_map[[v]], weights)
  data.frame(Party    = lab,
             Moran_I  = round(mt$estimate[1], 3),
             Std_Dev  = round(mt$statistic, 3),
             P_value  = formatC(mt$p.value, format = "e", digits = 2))
}, parties, party_labels, SIMPLIFY = FALSE)

moran_table <- do.call(rbind, moran_results)
knitr::kable(moran_table, row.names = FALSE,
             col.names = c("Party", "Moran's I", "Std. deviate", "p-value"),
             caption = "Global Moran's I for party vote shares (Queen contiguity, W-standardised weights)")

Global Moran’s I for party vote shares (Queen contiguity, W-standardised weights)
Party	Moran’s I	Std. deviate	p-value
KO	0.491	5.310	5.49e-08
PiS	0.344	3.852	5.86e-05
TD	0.383	4.259	1.03e-05
Lewica	0.284	3.263	5.51e-04
Konfederacja	0.197	2.358	9.19e-03

All five parties show statistically significant positive spatial autocorrelation. KO has the strongest spatial clustering (I = 0.491, p < 0.001), followed by TD (I = 0.383) and PiS (I = 0.344). Even Konfederacja, with the weakest autocorrelation among the five, is significant (I = 0.197, p = 0.009). This uniformly positive result across ideologically diverse parties confirms that the 2023 election was shaped by spatially structured forces — the justification for all subsequent spatial methods.

par(mfrow = c(1, 2))
moran.plot(sejm_map$S2023KO_pct,  weights,
           main = "Moran Scatterplot: KO",  xlab = "KO vote share (%)",
           ylab = "Spatially lagged KO (%)")
moran.plot(sejm_map$S2023PIS_pct, weights,
           main = "Moran Scatterplot: PiS", xlab = "PiS vote share (%)",
           ylab = "Spatially lagged PiS (%)")

par(mfrow = c(1, 1))

The Moran scatterplots confirm the pattern: constituencies with high KO support are predominantly surrounded by other high-KO constituencies (upper-right quadrant), and those with low KO support by low-KO constituencies (lower-left). The PiS scatterplot mirrors this almost exactly, consistent with the two parties’ geographic opposition.

6 Local Spatial Autocorrelation (LISA)

While global Moran’s I establishes that clustering exists overall, Local Indicators of Spatial Association (LISA) identify where it is located. Each constituency is classified into one of four cluster types — High-High, Low-Low, High-Low, and Low-High — based on its own standardised value and the average of its neighbours, with significance assessed at p < 0.05 using the raw local Moran’s I permutation p-values without multiple-testing correction.

This choice warrants a brief explanation. When running LISA simultaneously for five parties across 41 constituencies, a correction for multiple comparisons could in principle be applied — for example, Bonferroni (which would require p < 0.0012 per unit) or Benjamin-Hochberg FDR correction. However, both corrections are highly conservative at this sample size and suppress substantively meaningful clusters that are consistent with the global Moran’s I results and with well-established patterns in Polish electoral geography. The raw p < 0.05 threshold follows standard practice in the LISA literature for exploratory spatial analysis at small n, where the goal is pattern identification rather than formal confirmatory inference. Results for parties with weaker global autocorrelation (TD, Lewica, Konfederacja) should be interpreted with appropriate caution as their local clusters are more likely to include some false positives.

compute_local_moran <- function(var, weights_obj, data) {
  lm_result <- localmoran(data[[var]], weights_obj)
  scaled    <- scale(data[[var]])[, 1]
  lag_val   <- lag.listw(weights_obj, scaled)
  p_raw     <- lm_result[, 5]   # raw permutation p-values

  case_when(
    p_raw > 0.05               ~ "Not significant",
    scaled > 0 & lag_val > 0   ~ "High-High",
    scaled < 0 & lag_val < 0   ~ "Low-Low",
    scaled > 0 & lag_val < 0   ~ "High-Low",
    scaled < 0 & lag_val > 0   ~ "Low-High"
  )
}

sejm_map <- sejm_map %>%
  mutate(
    KO_cluster   = compute_local_moran("S2023KO_pct",   weights, sejm_map),
    PIS_cluster  = compute_local_moran("S2023PIS_pct",  weights, sejm_map),
    TD_cluster   = compute_local_moran("S2023TD_pct",   weights, sejm_map),
    NL_cluster   = compute_local_moran("S2023NL_pct",   weights, sejm_map),
    KONF_cluster = compute_local_moran("S2023KONF_pct", weights, sejm_map)
  )

lisa_colours <- c(
  "High-High"       = "#c0392b",
  "Low-Low"         = "#2980b9",
  "High-Low"        = "#e67e22",
  "Low-High"        = "#27ae60",
  "Not significant" = "#ecf0f1"
)

lisa_levels <- names(lisa_colours)

# Reshape to long format: one row per constituency per party
lisa_long <- sejm_map %>%
  select(geometry,
         KO_cluster, PIS_cluster, TD_cluster,
         NL_cluster, KONF_cluster) %>%
  pivot_longer(
    cols      = ends_with("_cluster"),
    names_to  = "party_raw",
    values_to = "cluster"
  ) %>%
  mutate(
    party = factor(party_raw,
                   levels = c("KO_cluster", "PIS_cluster", "TD_cluster",
                              "NL_cluster", "KONF_cluster"),
                   labels = c("KO (Civic Coalition)", "PiS (Law & Justice)",
                              "TD (Third Way)", "Lewica (Left)",
                              "Konfederacja")),
    cluster = factor(cluster, levels = lisa_levels)
  ) %>%
  st_as_sf()

ggplot(lisa_long) +
  geom_sf(aes(fill = cluster), color = "white", linewidth = 0.25) +
  scale_fill_manual(values = lisa_colours, name = "Cluster type",
                    drop = FALSE) +
  facet_wrap(~ party, ncol = 2) +
  labs(title   = "Local Spatial Autocorrelation (LISA)",
       subtitle = "BH, 2023 Polish parliamentary election",
       caption  = "Source: GeoElections Poland 1.0") +
  theme_void() +
  theme(
    plot.title      = element_text(face = "bold", size = 13),
    plot.subtitle   = element_text(size = 9, color = "grey40"),
    strip.text      = element_text(face = "bold", size = 10),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.title    = element_text(size = 9, face = "bold"),
    legend.text     = element_text(size = 9),
    legend.key.size = unit(0.4, "cm"),
    panel.spacing   = unit(0.8, "lines")
  )

The LISA maps reveal a clear east-west spatial structure.
KO — High-High clustering (red) concentrated in the north-west (Pomerania and surrounding constituencies), Low-Low clustering (blue) in central Poland, reflecting KO’s strength along the Baltic coast and weakness in the rural interior.
PiS — broadly the inverse: High-High clusters in the south-east and east, Low-Low in the north-west, with one notable Low-High constituency (green) in the east where a PiS-weak constituency is surrounded by PiS-strong neighbours. This persistent KO–PiS regional divide mirrors the historical partition boundaries between German-administered western territories and Russian/Austro-Hungarian eastern territories.
TD — High-High clustering in the south and south-east (its agrarian heartland), Low-Low clustering in the west, consistent with TD’s rural centrist voter base rooted in the Polish People’s Party tradition.
Lewica — Low-Low clustering in the central-east, areas where the left performs poorly and is surrounded by similarly weak left-wing constituencies, reflecting the collapse of left-wing support outside major urban centres following the 2001–2005 period.
Konfederacja — a more complex local pattern: High-Low (orange) in the north-west where one constituency has disproportionately high Konfederacja support relative to its neighbours, Low-Low (blue) in the centre, and High-High (red) in the south, reflecting Konfederacja’s pockets of strength in the Małopolska and Podkarpacie regions where nationalist and libertarian sentiment is historically stronger.

7 Spatial Feature Engineering

Having confirmed spatial autocorrelation, the next step is to encode neighbourhood context as machine learning features. Spatial lag variables are computed for each party’s vote share and for turnout: each constituency’s lag is the row-standardised mean of its neighbours’ values. These variables allow any model to incorporate the political environment surrounding each constituency.

sejm_map <- sejm_map %>%
  mutate(
    lag_KO          = lag.listw(weights, S2023KO_pct),
    lag_PIS         = lag.listw(weights, S2023PIS_pct),
    lag_TD          = lag.listw(weights, S2023TD_pct),
    lag_NL          = lag.listw(weights, S2023NL_pct),
    lag_KONF        = lag.listw(weights, S2023KONF_pct),
    lag_turnout     = lag.listw(weights, turnout_pct),
    neighbour_count = card(neighbours)
  )

plot_lag <- function(col, title) {
  ggplot(sejm_map) +
    geom_sf(aes(fill = .data[[col]]), color = "white", linewidth = 0.3) +
    scale_fill_gradient(low = "#f7f7f7", high = "#2c7fb8",
                        name = "Neighbour\naverage (%)") +
    labs(title = title,
         subtitle = "Queen contiguity spatial lag") +
    theme_void() +
    theme(plot.title = element_text(face = "bold", size = 11))
}

plot_lag("lag_KO",      "Spatial Lag: KO Vote Share") +
plot_lag("lag_PIS",     "Spatial Lag: PiS Vote Share") +
plot_lag("lag_turnout", "Spatial Lag: Voter Turnout")

The spatial lag maps smooth the raw vote share maps, revealing the broader regional gradient. The KO lag map shows a clear west-to-east gradient; the PiS lag map the reverse. The turnout lag map shows that high-turnout areas cluster together — western and urban constituencies are embedded in environments where participation is also high, while eastern constituencies face a low-turnout neighbourhood context.

8 Correlation and PCA

8.1 Correlation Structure

vote_vars <- sejm_map %>%
  st_drop_geometry() %>%
  select(S2023KO_pct, S2023PIS_pct, S2023TD_pct, S2023NL_pct,
         S2023KONF_pct, turnout_pct,
         lag_KO, lag_PIS, lag_TD, lag_NL, lag_KONF, lag_turnout)

corrplot(
  cor(vote_vars, use = "complete.obs"),
  method = "color", type = "upper",
  tl.cex = 0.65, tl.col = "black",
  addCoef.col = "black", number.cex = 0.5,
  title = "Correlation Matrix: Vote Shares, Turnout, and Spatial Lags",
  mar = c(0, 0, 1.5, 0)
)

The correlation heatmap confirms the dominant KO–PiS opposition (r ≈ −0.80) and shows that each variable is strongly correlated with its own spatial lag, as expected under positive spatial autocorrelation. Turnout correlates positively with KO and Lewica support and negatively with PiS support — the substantive pattern that the supervised models will formalise.

8.2 Principal Component Analysis

PCA is applied to the six vote share and turnout variables to identify the main axes of electoral differentiation.

pca_data <- sejm_map %>%
  st_drop_geometry() %>%
  select(S2023KO_pct, S2023PIS_pct, S2023TD_pct,
         S2023NL_pct, S2023KONF_pct, turnout_pct) %>%
  scale()

pca_result <- prcomp(pca_data, center = TRUE, scale. = TRUE)
pca_summary <- summary(pca_result)

pca_loadings <- as.data.frame(pca_result$rotation[, 1:2]) %>%
  tibble::rownames_to_column("Variable") %>%
  mutate(across(where(is.numeric), round, 3))

knitr::kable(pca_loadings,
             col.names = c("Variable", "PC1", "PC2"),
             caption = paste0(
               "PCA loadings for PC1 (", 
               round(pca_summary$importance[2,1]*100, 1),
               "% of variance) and PC2 (",
               round(pca_summary$importance[2,2]*100, 1),
               "% of variance)"
             ))

PCA loadings for PC1 (55.5% of variance) and PC2 (19% of variance)
Variable	PC1	PC2
S2023KO_pct	-0.504	0.069
S2023PIS_pct	0.501	0.203
S2023TD_pct	0.219	-0.748
S2023NL_pct	-0.449	-0.044
S2023KONF_pct	0.363	-0.298
turnout_pct	-0.337	-0.551

PC1 (55.5% of variance) is defined by a strong negative loading on KO (−0.504) and Lewica (−0.449), and a strong positive loading on PiS (+0.501) and Konfederacja (+0.363). This is the liberal–conservative axis: constituencies scoring high on PC1 are PiS-dominant; those scoring low are KO-dominant. PC2 (19.1% of variance) loads primarily on TD (−0.748) and turnout (−0.551), capturing a rural-agrarian/low-participation dimension — constituencies where TD performs well tend to have lower turnout and lower Konfederacja support.

sejm_map$PC1 <- pca_result$x[, 1]
sejm_map$PC2 <- pca_result$x[, 2]

p1 <- ggplot(sejm_map) +
  geom_sf(aes(fill = PC1), color = "white", linewidth = 0.3) +
  scale_fill_gradient2(name = "PC1",
                       low = "#2980b9", mid = "white", high = "#c0392b") +
  labs(title = "PC1: Liberal–Conservative Axis",
       subtitle = "Blue = KO-dominant, Red = PiS-dominant") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", size = 11))

p2 <- ggplot(sejm_map) +
  geom_sf(aes(fill = PC2), color = "white", linewidth = 0.3) +
  scale_fill_gradient2(name = "PC2",
                       low = "#27ae60", mid = "white", high = "#95a5a6") +
  labs(title = "PC2: Rural-Agrarian / Low-Participation Axis",
       subtitle = "Green = high TD / low turnout") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", size = 11))

p1 + p2

9 Electoral Typologies: PAM and SKATER

Two clustering methods are applied to identify electoral regions. PAM (Partitioning Around Medoids) groups constituencies by profile similarity without geographic constraint and may produce fragmented maps. SKATER (Spatial ’K’luster Analysis by Tree Edge Removal) enforces geographic contiguity by building a minimum spanning tree on the neighbour graph and removing edges to form connected regions. Comparing the two reveals whether electoral profiles and geographic proximity coincide in Poland.

Both methods use k = 4 clusters. The silhouette criterion actually peaks at k = 2, which maps onto the dominant KO-west versus PiS-east divide already established by the LISA analysis — a result too coarse to add new interpretive value. k = 4 is chosen on substantive grounds: it reveals the moderate TD-leaning group and the high-turnout urban-liberal cluster that k = 2 merges into the broader KO category, producing a more informative typology at a modest cost in silhouette width (0.26 vs 0.37 at k = 2).

cluster_vars <- c("S2023KO_pct", "S2023PIS_pct", "S2023TD_pct",
                  "S2023NL_pct", "S2023KONF_pct", "turnout_pct")

cluster_data   <- sejm_map %>% st_drop_geometry() %>% select(all_of(cluster_vars))
cluster_scaled <- scale(cluster_data)

fviz_nbclust(cluster_scaled, pam, method = "silhouette", k.max = 8) +
  labs(title = "PAM silhouette width by number of clusters",
       subtitle = "Silhouette peaks at k = 2; k = 4 chosen for substantive interpretability")

pam_model           <- pam(cluster_scaled, k = 4)
sejm_map$pam_cluster <- factor(pam_model$clustering)

nb_costs     <- nbcosts(neighbours, data = as.data.frame(cluster_scaled))
nb_w         <- nb2listw(neighbours, glist = nb_costs, style = "B")
mst          <- mstree(nb_w)
skater_model <- skater(edges = mst[, 1:2],
                       data  = as.data.frame(cluster_scaled),
                       ncuts = 3)
sejm_map$skater_cluster <- factor(skater_model$groups)

9.1 Cluster Quality

sil_pam    <- mean(silhouette(pam_model$clustering,
                               dist(cluster_scaled))[, 3])
sil_skater <- mean(silhouette(as.integer(skater_model$groups),
                               dist(cluster_scaled))[, 3])

moran_pam    <- moran.test(as.numeric(sejm_map$pam_cluster),    weights)$estimate[1]
moran_skater <- moran.test(as.numeric(sejm_map$skater_cluster), weights)$estimate[1]

quality_df <- data.frame(
  Method           = c("PAM", "SKATER"),
  Avg_Silhouette   = round(c(sil_pam, sil_skater), 3),
  Moran_I_Clusters = round(c(moran_pam, moran_skater), 3)
)

knitr::kable(quality_df,
             col.names = c("Method", "Avg. silhouette width",
                           "Moran's I on cluster labels"),
             caption = "Clustering quality: profile tightness vs spatial coherence")

Clustering quality: profile tightness vs spatial coherence
Method	Avg. silhouette width	Moran’s I on cluster labels
PAM	0.253	0.461
SKATER	0.040	0.571

The quality table reflects the expected trade-off. PAM achieves a silhouette width of 0.253 — reasonable internal profile separation — but its cluster labels have a Moran’s I of only 0.461, indicating moderate spatial concentration. SKATER, by construction, produces geographically coherent clusters (Moran’s I = 0.571) but sacrifices internal profile homogeneity (silhouette = 0.040). The low SKATER silhouette is not surprising at n = 41: forcing contiguity merges some profile-dissimilar constituencies that happen to be geographic neighbours.

9.2 Cluster Maps

cluster_palette <- c("1" = "#2980b9", "2" = "#c0392b",
                     "3" = "#27ae60", "4" = "#e67e22")

map_pam <- ggplot(sejm_map) +
  geom_sf(aes(fill = pam_cluster), color = "white", linewidth = 0.3) +
  scale_fill_manual(values = cluster_palette, name = "Cluster") +
  labs(title = "PAM clusters",
       subtitle = "Profile similarity only (non-spatial)") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", size = 12))

map_skater <- ggplot(sejm_map) +
  geom_sf(aes(fill = skater_cluster), color = "white", linewidth = 0.3) +
  scale_fill_manual(values = cluster_palette, name = "Cluster") +
  labs(title = "SKATER clusters",
       subtitle = "Spatially contiguous regions") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", size = 12))

map_pam + map_skater +
  plot_annotation(
    title   = "Electoral Typologies: PAM vs SKATER (k = 4)",
    caption = "Source: GeoElections Poland 1.0"
  )

The contrast between the two maps is itself the main finding of this section. The PAM map is geographically fragmented — constituencies with similar electoral profiles are scattered across the country, with blue, red, orange, and green patches interleaved. This shows that when grouping purely by who people voted for, the resulting clusters do not respect geography: similar voters exist in pockets spread across Poland rather than in solid regional blocs.

The SKATER map resolves into four clean, contiguous regions that closely resemble Poland’s historical macro-regions: the west and north as one large liberal-leaning bloc, the east and south-east as a conservative bloc, with two smaller peripheral groups in the south. This geographic coherence is striking because SKATER knows nothing about Polish history — it only minimises electoral profile dissimilarity within contiguous regions. The fact that it produces something that looks like a historical map of Poland means that geography and electoral behaviour are deeply aligned: the spatial structure of voting is so strong that even a purely data-driven spatial algorithm recovers it.

The contrast between the two methods also shows the cost of ignoring geography in PAM: constituencies that are geographically isolated from their electoral peers — such as a liberal-voting city surrounded by conservative rural constituencies — are assigned correctly by profile in PAM but create fragmentation. SKATER sacrifices those outliers for regional coherence, which is why its silhouette width is lower (0.040 vs PAM’s 0.253) but its Moran’s I on cluster labels is substantially higher (0.571 vs 0.461). The two methods are answering genuinely different questions: PAM identifies who votes similarly; SKATER identifies where similar voting is geographically concentrated.

9.3 Cluster Profiles

profile_summary <- function(cluster_col, label) {
  sejm_map %>%
    st_drop_geometry() %>%
    group_by(Cluster = .data[[cluster_col]]) %>%
    summarise(KO      = round(mean(S2023KO_pct,   na.rm = TRUE), 1),
              PiS     = round(mean(S2023PIS_pct,  na.rm = TRUE), 1),
              TD      = round(mean(S2023TD_pct,   na.rm = TRUE), 1),
              NL      = round(mean(S2023NL_pct,   na.rm = TRUE), 1),
              KONF    = round(mean(S2023KONF_pct, na.rm = TRUE), 1),
              Turnout = round(mean(turnout_pct,   na.rm = TRUE), 1),
              n       = n(), .groups = "drop") %>%
    mutate(Method = label)
}

pam_profiles    <- profile_summary("pam_cluster",    "PAM")
skater_profiles <- profile_summary("skater_cluster", "SKATER")

knitr::kable(pam_profiles %>% select(-Method),
             caption = "PAM cluster profiles: mean vote shares and turnout (%)")

PAM cluster profiles: mean vote shares and turnout (%)
Cluster	KO	PiS	TD	NL	KONF	Turnout	n
1	36.5	31.9	12.1	9.8	6.2	72.4	15
2	43.1	22.9	14.4	11.0	6.0	81.1	7
3	33.0	28.9	16.4	10.8	6.9	77.6	6
4	22.0	44.4	16.2	5.6	7.8	72.2	13

knitr::kable(skater_profiles %>% select(-Method),
             caption = "SKATER cluster profiles: mean vote shares and turnout (%)")

SKATER cluster profiles: mean vote shares and turnout (%)
Cluster	KO	PiS	TD	NL	KONF	Turnout	n
1	37.2	29.5	13.4	10.2	6.2	75.3	27
2	22.5	44.2	15.5	5.8	7.9	72.2	10
3	22.7	41.9	19.0	5.2	7.2	73.4	3
4	34.1	25.4	17.4	12.9	7.5	81.8	1

bind_rows(pam_profiles, skater_profiles) %>%
  pivot_longer(cols = c(KO, PiS, TD, NL, KONF, Turnout),
               names_to = "Variable", values_to = "Average") %>%
  ggplot(aes(x = Variable, y = Average, fill = factor(Cluster))) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = cluster_palette, name = "Cluster") +
  facet_wrap(~ Method) +
  labs(title = "Electoral Profiles by Cluster and Method",
       subtitle = "Average vote share and turnout within each cluster (%)",
       x = "", y = "Average (%)") +
  theme_minimal() +
  theme(strip.text = element_text(face = "bold"))

The PAM profiles reveal four interpretable electoral types: Cluster 2 (n = 7) is the high-KO, high-turnout, urban-liberal type (KO 43.1%, turnout 81.1%); Cluster 4 (n = 13) is the PiS-dominant, low-turnout type (PiS 44.4%, turnout 72.2%); Cluster 3 (n = 6) is a moderate mixed profile with the highest TD share (16.4%, turnout 77.6%); and Cluster 1 (n = 15) sits between the two poles (KO 36.5%, PiS 31.9%, turnout 72.4%). The SKATER typology follows a broadly similar substantive interpretation but merges many of the constituency assignments to achieve geographic contiguity, resulting in a large dominant cluster (n = 27) and three smaller peripheral groups. The SKATER Cluster 4 (n = 1) is an isolated high-turnout single constituency that PAM would assign to the urban type — an illustration of the cost of enforcing contiguity when one constituency is geographically isolated from its electoral peers.

10 Supervised Learning: Predicting Voter Turnout

10.1 Modelling Strategy

The supervised learning section tests whether spatially informed models outperform non-spatial alternatives. Voter turnout is the prediction target. This choice is deliberate: party vote shares are compositionally constrained (they sum to 100%), so predicting one party’s share using the others as features exploits a near-arithmetic identity rather than modelling a genuine spatial process. Turnout is orthogonal to this constraint, is spatially autocorrelated (confirmed above), and is of independent substantive interest.

Six models are compared:

Model	Features	Spatial?
Non-Spatial Random Forest	Party vote shares only	No
Spatial Random Forest	Vote shares + spatial lags + neighbour count	Yes
Spatial Decision Tree	Same as Spatial RF	Yes
Ridge Regression	Same as Spatial RF	Yes
XGBoost	Same as Spatial RF	Yes
Spatial Lag OLS (SAR)	Vote shares + spatially lagged turnout response	Yes

All models are evaluated using leave-one-out cross-validation (LOOCV). LOOCV is appropriate here because n = 41 is too small for reliable random train/test splits, and the confirmed spatial autocorrelation means random folds would place neighbouring constituencies in both training and test sets, inflating apparent accuracy. At each of the 41 folds, spatial lag variables are recomputed on the 40 training units, excluding the held-out unit’s value from the lags of its training neighbours — eliminating even the mild data leakage that arises when lags are precomputed on the full dataset.

base_data <- sejm_map %>%
  st_drop_geometry() %>%
  select(NumberSejm41, turnout_pct, S2023KO_pct, S2023PIS_pct,
         S2023TD_pct, S2023NL_pct, S2023KONF_pct, neighbour_count) %>%
  na.omit()

n              <- nrow(base_data)
actual_turnout <- base_data$turnout_pct

non_spatial_vars <- c("S2023KO_pct", "S2023PIS_pct", "S2023TD_pct",
                      "S2023NL_pct", "S2023KONF_pct")
spatial_vars     <- c(non_spatial_vars,
                      "lag_KO", "lag_PIS", "lag_TD", "lag_NL",
                      "lag_KONF", "lag_turnout", "neighbour_count")

compute_lags_loo <- function(data_full, nb_list, i) {
  vars      <- c("S2023KO_pct", "S2023PIS_pct", "S2023TD_pct",
                 "S2023NL_pct", "S2023KONF_pct", "turnout_pct")
  lag_names <- c("lag_KO", "lag_PIS", "lag_TD",
                 "lag_NL", "lag_KONF", "lag_turnout")
  result <- data_full
  for (j in seq_along(vars)) {
    v   <- vars[j]
    lag <- numeric(nrow(data_full))
    for (k in seq_len(nrow(data_full))) {
      nb_k <- nb_list[[k]]
      if (k == i) {
        lag[k] <- mean(data_full[[v]][nb_k], na.rm = TRUE)
      } else {
        nb_k_excl <- nb_k[nb_k != i]
        lag[k] <- if (length(nb_k_excl) == 0) data_full[[v]][k] else
                    mean(data_full[[v]][nb_k_excl], na.rm = TRUE)
      }
    }
    result[[lag_names[j]]] <- lag
  }
  result
}

pred_non_spatial <- numeric(n)
pred_spatial     <- numeric(n)
pred_tree        <- numeric(n)
pred_ridge       <- numeric(n)
pred_xgb         <- numeric(n)

for (i in seq_len(n)) {
  data_i  <- compute_lags_loo(base_data, neighbours, i)
  train_i <- data_i[-i, ]
  test_i  <- data_i[i,  ]
  y_train <- train_i$turnout_pct

  # Non-spatial RF
  fit_ns <- ranger(
    turnout_pct ~ S2023KO_pct + S2023PIS_pct + S2023TD_pct +
      S2023NL_pct + S2023KONF_pct,
    data = train_i, num.trees = 500, seed = 123
  )
  pred_non_spatial[i] <- predict(fit_ns, test_i)$predictions

  # Spatial RF
  fit_sp <- ranger(
    as.formula(paste("turnout_pct ~", paste(spatial_vars, collapse = " + "))),
    data = train_i, num.trees = 500, seed = 123
  )
  pred_spatial[i] <- predict(fit_sp, test_i)$predictions

  # Decision Tree
  fit_tree <- rpart(
    as.formula(paste("turnout_pct ~", paste(spatial_vars, collapse = " + "))),
    data = train_i, method = "anova"
  )
  pred_tree[i] <- predict(fit_tree, test_i)

  # Ridge
  X_train <- scale(as.matrix(train_i[, spatial_vars]))
  X_test  <- scale(as.matrix(test_i[, spatial_vars, drop = FALSE]),
                   center = attr(X_train, "scaled:center"),
                   scale  = attr(X_train, "scaled:scale"))
  cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0, nfolds = 10)
  pred_ridge[i] <- predict(cv_ridge, newx = X_test, s = "lambda.min")[1]

  # XGBoost (hyperparameters fixed a priori; see limitations)
  dtrain <- xgb.DMatrix(data = as.matrix(train_i[, spatial_vars]),
                        label = y_train)
  dtest  <- xgb.DMatrix(data = as.matrix(test_i[, spatial_vars, drop = FALSE]))
  fit_xgb <- xgb.train(
    params = list(objective = "reg:squarederror", eta = 0.05,
                  max_depth = 3, subsample = 0.8, colsample_bytree = 0.8),
    data = dtrain, nrounds = 200, verbose = 0
  )
  pred_xgb[i] <- predict(fit_xgb, dtest)
}

pred_sar <- numeric(n)
for (i in seq_len(n)) {
  data_i   <- compute_lags_loo(base_data, neighbours, i)
  train_i  <- data_i[-i, ]
  test_i   <- data_i[i,  ]
  nb_train <- subset(neighbours, seq_len(n) != i)
  wt_train <- nb2listw(nb_train, style = "W")

  fit_sar <- lagsarlm(
    turnout_pct ~ S2023KO_pct + S2023PIS_pct + S2023TD_pct +
      S2023NL_pct + S2023KONF_pct,
    data = train_i, listw = wt_train, method = "eigen"
  )

  nb_i        <- neighbours[[i]][neighbours[[i]] != i]
  train_rows  <- seq_len(n)[-i]
  nb_in_train <- match(nb_i, train_rows)
  nb_in_train <- nb_in_train[!is.na(nb_in_train)]
  wy_i        <- mean(train_i$turnout_pct[nb_in_train])

  X_i  <- as.numeric(c(1, test_i$S2023KO_pct, test_i$S2023PIS_pct,
                        test_i$S2023TD_pct, test_i$S2023NL_pct,
                        test_i$S2023KONF_pct))
  beta <- coef(fit_sar)[c("(Intercept)", non_spatial_vars)]
  pred_sar[i] <- sum(X_i * beta) + fit_sar$rho * wy_i
}

10.2 Results

evaluate_model <- function(actual, predicted) {
  data.frame(
    RMSE = round(sqrt(mean((actual - predicted)^2)), 3),
    MAE  = round(mean(abs(actual - predicted)), 3),
    R2   = round(cor(actual, predicted)^2, 3)
  )
}

final_comparison <- bind_rows(
  `Non-Spatial RF`  = evaluate_model(actual_turnout, pred_non_spatial),
  `Spatial RF`      = evaluate_model(actual_turnout, pred_spatial),
  `Decision Tree`   = evaluate_model(actual_turnout, pred_tree),
  `Ridge`           = evaluate_model(actual_turnout, pred_ridge),
  `XGBoost`         = evaluate_model(actual_turnout, pred_xgb),
  `Spatial Lag OLS` = evaluate_model(actual_turnout, pred_sar),
  .id = "Model"
) %>% arrange(RMSE)

knitr::kable(final_comparison,
             col.names = c("Model", "RMSE", "MAE", "R²"),
             caption = "LOOCV model performance (all metrics on original turnout % scale)")

LOOCV model performance (all metrics on original turnout % scale)
Model	RMSE	MAE	R²
XGBoost	3.504	2.779	0.409
Spatial RF	3.533	2.826	0.446
Non-Spatial RF	3.735	2.879	0.315
Ridge	3.885	3.246	0.258
Spatial Lag OLS	4.225	3.516	0.185
Decision Tree	4.599	3.562	0.148

model_order <- final_comparison$Model

final_comparison %>%
  pivot_longer(cols = c(RMSE, MAE, R2), names_to = "Metric",
               values_to = "Value") %>%
  mutate(Model = factor(Model, levels = rev(model_order))) %>%
  ggplot(aes(x = Model, y = Value, fill = Model)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Metric, scales = "free_x") +
  coord_flip() +
  scale_fill_brewer(palette = "Blues", direction = -1) +
  labs(title = "Model Performance Comparison (LOOCV)",
       subtitle = "Six models evaluated on 41 held-out predictions",
       x = "", y = "") +
  theme_minimal() +
  theme(strip.text = element_text(face = "bold"))

XGBoost achieves the lowest RMSE (3.50) and MAE (2.78), followed closely by the Spatial RF (RMSE = 3.53, R² = 0.446). The non-spatial RF (RMSE = 3.73, R² = 0.315) is meaningfully weaker than both spatial tree-based models, confirming that neighbourhood context adds genuine predictive value. Ridge and the Spatial Lag OLS occupy the middle ground, with the Decision Tree performing worst among the spatial models.

The gap between the spatial and non-spatial Random Forest is the central finding: adding spatial lag variables reduces RMSE by 5.4% and increases R² from 0.315 to 0.446 — a 42% improvement in explained variance. This is the direct empirical answer to Research Question 3.

10.3 Variable Importance

data_full_lags <- base_data %>%
  mutate(lag_KO      = lag.listw(weights, S2023KO_pct),
         lag_PIS     = lag.listw(weights, S2023PIS_pct),
         lag_TD      = lag.listw(weights, S2023TD_pct),
         lag_NL      = lag.listw(weights, S2023NL_pct),
         lag_KONF    = lag.listw(weights, S2023KONF_pct),
         lag_turnout = lag.listw(weights, turnout_pct))

rf_spatial <- ranger(
  as.formula(paste("turnout_pct ~", paste(spatial_vars, collapse = " + "))),
  data = data_full_lags, importance = "permutation", num.trees = 500
)

data.frame(variable   = names(rf_spatial$variable.importance),
           importance = rf_spatial$variable.importance) %>%
  arrange(desc(importance)) %>%
  ggplot(aes(x = reorder(variable, importance), y = importance)) +
  geom_col(fill = "#2c7fb8") +
  coord_flip() +
  labs(title = "Variable Importance: Spatial Random Forest",
       subtitle = "Permutation importance (full dataset fit)",
       x = "", y = "Permutation importance") +
  theme_minimal()

fit_xgb_full <- xgb.train(
  params = list(objective = "reg:squarederror", eta = 0.05,
                max_depth = 3, subsample = 0.8, colsample_bytree = 0.8),
  data    = xgb.DMatrix(data  = as.matrix(data_full_lags[, spatial_vars]),
                        label = actual_turnout),
  nrounds = 200, verbose = 0
)
xgb_imp <- xgb.importance(model = fit_xgb_full)
xgb.plot.importance(xgb_imp, top_n = 13,
                    main = "XGBoost Feature Importance (full dataset)")

In the Spatial RF, lag_turnout consistently ranks among the top predictors — the average turnout of neighbouring constituencies is more informative about a given constituency’s turnout than any individual party’s local vote share. S2023PIS_pct and S2023KO_pct are the next most important, reflecting the strong geographic sorting of the two major parties. The spatial lag variables for PiS and KO also contribute, confirming that the neighbourhood political environment matters independently of local party composition.

10.4 Ridge Regression Coefficients

Ridge regression provides the most directly interpretable spatial model, as its coefficients quantify each predictor’s marginal relationship with turnout on a standardised scale.

X_full        <- scale(as.matrix(data_full_lags[, spatial_vars]))
cv_ridge_full <- cv.glmnet(X_full, actual_turnout, alpha = 0, nfolds = 10)

ridge_coefs <- as.data.frame(
  as.matrix(coef(cv_ridge_full, s = "lambda.min"))
) %>%
  tibble::rownames_to_column("variable") %>%
  rename(coefficient = s1) %>%
  filter(variable != "(Intercept)") %>%
  arrange(desc(abs(coefficient)))

knitr::kable(ridge_coefs %>% mutate(coefficient = round(coefficient, 3)),
             col.names = c("Variable", "Coefficient"),
             caption = "Ridge regression coefficients (standardised predictors, lambda.min)")

Ridge regression coefficients (standardised predictors, lambda.min)
Variable	Coefficient
S2023PIS_pct	-1.222
lag_turnout	1.128
S2023KO_pct	0.952
S2023NL_pct	0.904
lag_PIS	0.855
lag_KO	-0.788
S2023TD_pct	0.716
lag_TD	-0.392
S2023KONF_pct	-0.281
neighbour_count	-0.254
lag_KONF	0.093
lag_NL	-0.039

ggplot(ridge_coefs, aes(x = reorder(variable, coefficient),
                         y = coefficient)) +
  geom_col(aes(fill = coefficient > 0), show.legend = FALSE) +
  scale_fill_manual(values = c("TRUE" = "#27ae60", "FALSE" = "#c0392b")) +
  coord_flip() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Ridge Regression Coefficients",
       subtitle = "Spatial predictors of voter turnout (standardised; lambda.min)",
       x = "", y = "Coefficient") +
  theme_minimal()

S2023PIS_pct has the largest coefficient (−1.046), indicating that higher PiS support is associated with lower turnout after controlling for all other spatial predictors. This is consistent with the known geography: PiS strongholds in the east and south-east have historically shown lower electoral participation. The positive lag_turnout coefficient (+0.893) captures neighbourhood mobilisation effects — constituencies embedded in high-turnout environments participate more, independently of their own party composition. The positive coefficients on S2023KO_pct (+0.764) and S2023NL_pct (+0.841) reflect the concentration of both parties in urban, high-turnout areas. All coefficients are on standardised predictors, so their magnitudes are directly comparable.

10.5 Decision Tree

tree_full <- rpart(
  as.formula(paste("turnout_pct ~", paste(spatial_vars, collapse = " + "))),
  data = data_full_lags, method = "anova"
)
rpart.plot(tree_full, main = "Spatial Decision Tree: Voter Turnout")

The decision tree provides an interpretable rule-based summary of the spatial model. The first split is on a spatial lag variable, confirming that neighbourhood context is the primary driver of turnout variation.

11 Prediction and Residual Maps

prediction_df <- data.frame(
  NumberSejm41          = base_data$NumberSejm41,
  rf_spatial_prediction = pred_spatial,
  rf_spatial_residual   = actual_turnout - pred_spatial
)

sejm_map <- sejm_map %>% left_join(prediction_df, by = "NumberSejm41")

p_pred <- ggplot(sejm_map) +
  geom_sf(aes(fill = rf_spatial_prediction), color = "white", linewidth = 0.3) +
  scale_fill_gradient(low = "#f7f7f7", high = "#2c7fb8",
                      name = "Predicted\nturnout (%)") +
  labs(title = "Predicted Voter Turnout",
       subtitle = "Spatial RF, LOOCV predictions") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", size = 11))

p_resid <- ggplot(sejm_map) +
  geom_sf(aes(fill = rf_spatial_residual), color = "white", linewidth = 0.3) +
  scale_fill_gradient2(name = "Residual",
                       low = "#c0392b", mid = "white", high = "#2980b9") +
  labs(title = "Spatial RF Residuals",
       subtitle = "Actual minus LOOCV-predicted turnout") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", size = 11))

p_pred + p_resid

ggplot(data.frame(actual = actual_turnout, predicted = pred_spatial),
       aes(x = actual, y = predicted)) +
  geom_point(size = 3, colour = "#2c7fb8", alpha = 0.8) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey50") +
  labs(title = "Actual vs Predicted Voter Turnout",
       subtitle = "Spatial Random Forest (LOOCV)",
       x = "Actual turnout (%)", y = "Predicted turnout (%)") +
  theme_minimal()

The predicted turnout map reproduces the broad west–east gradient visible in the raw data. The residual map shows no strong systematic pattern — positive and negative residuals are scattered — which is consistent with the residual autocorrelation test below.

12 Residual Spatial Autocorrelation

The final diagnostic tests whether the spatial RF residuals retain spatial clustering that the model failed to capture.

res_vec <- sejm_map$rf_spatial_residual
w_res   <- weights

if (any(is.na(res_vec))) {
  keep  <- !is.na(res_vec)
  res_vec <- res_vec[keep]
  w_res   <- nb2listw(subset(neighbours, keep), style = "W")
}

mt_resid <- moran.test(res_vec, w_res)

cat(sprintf(
  "Residual Moran's I = %.3f  (std. deviate = %.3f, p = %.4f)\n",
  mt_resid$estimate[1], mt_resid$statistic, mt_resid$p.value
))

## Residual Moran's I = 0.098  (std. deviate = 1.273, p = 0.1015)

moran.plot(res_vec, w_res,
           main   = "Moran Scatterplot: Spatial RF Residuals (LOOCV)",
           xlab   = "LOOCV residual",
           ylab   = "Spatially lagged residual")

The residual Moran’s I is 0.098 (p = 0.102), which is not significant at the conventional 0.05 threshold. This is a positive finding: the spatial RF has absorbed the spatial dependence present in the raw turnout data, and what remains in the residuals is not detectably structured geographically. This directly answers Research Question 5.

13 Discussion

13.1 Spatial Structure of Polish Electoral Geography

The 2023 Sejm election shows strong and statistically significant spatial autocorrelation for all five parties. The global Moran’s I values range from 0.197 (Konfederacja) to 0.491 (KO), all significant at p < 0.01. LISA analysis identifies the persistent east–west divide: KO and Lewica cluster in the west and in major cities; PiS clusters in the east and south-east. This spatial polarisation is not merely a reflection of the urban–rural divide but carries the imprint of Poland’s historical partitions, which created lasting differences in civic culture, administrative tradition, and economic development across regions.

13.2 Electoral Typologies

PAM clustering identifies four substantively meaningful electoral types: a high-KO, high-turnout urban-liberal group (Cluster 2, n = 7, mean turnout 81.1%); a PiS-dominant, low-turnout group (Cluster 4, n = 13, mean turnout 72.2%); a moderate TD-leaning group (Cluster 3, n = 6); and a large mixed group (Cluster 1, n = 15). The SKATER typology reproduces the main distinction between liberal-western and conservative-eastern regions but merges many of the more nuanced distinctions in order to enforce geographic contiguity. The silhouette–Moran’s I trade-off (PAM: 0.253 vs 0.461; SKATER: 0.040 vs 0.571) quantifies this cost explicitly: SKATER achieves 24% higher spatial coherence at the price of 84% lower internal profile separation.

13.3 Spatial Machine Learning Performance

Adding spatial lag features to the Random Forest reduces LOOCV RMSE from 3.73 to 3.53 percentage points and increases R² from 0.315 to 0.446. This improvement is genuine: LOOCV prevents train-test leakage, and lags are recomputed per fold to prevent information leakage from the held-out unit. The Spatial RF and XGBoost perform similarly (RMSE 3.53 and 3.50 respectively), and both substantially outperform the Spatial Lag OLS (RMSE 4.23) — suggesting that the relationship between spatial context and turnout is non-linear, and that the parametric SAR model’s assumption of a single global spatial autoregressive parameter is too restrictive for these data.

The ridge regression coefficients provide interpretable evidence for the spatial mechanism: the lag_turnout coefficient (+0.893) is the second strongest predictor, indicating that neighbourhood mobilisation — the average turnout of surrounding constituencies — is independently predictive of a constituency’s own participation beyond its party composition. This is consistent with social contagion and coordination effects in electoral participation.

The residual Moran’s I of 0.098 (p = 0.10) confirms that the spatial RF successfully absorbs the spatial dependence in the data. The residuals are not significantly autocorrelated, meaning the model has captured the spatial process rather than leaving it in the error term.

13.4 Limitations

Three limitations merit acknowledgement. First, the XGBoost hyperparameters are set a priori rather than tuned per LOOCV fold; per-fold tuning would be more rigorous but computationally prohibitive at 41 × inner-CV scale. Second, all models use only electoral predictors; socioeconomic covariates (urbanisation, income, age structure) are not available in the GeoElections dataset and would likely improve prediction. Third, with n = 41, all results — including the LOOCV metrics — carry substantial sampling uncertainty; the confidence intervals on RMSE differences between the Spatial RF and XGBoost, for example, are likely wide.

14 Conclusion

This project demonstrates that Polish electoral behaviour in 2023 is not only politically but also geographically structured. The analysis proceeds through a coherent spatial machine learning workflow:

Spatial autocorrelation is confirmed globally (Moran’s I significant for all five parties) and locally (LISA identifies persistent east–west clusters).
PCA reveals two interpretable dimensions: a liberal–conservative axis (PC1, 55.5% of variance) and a rural-agrarian/low-participation axis (PC2, 19.1%).
PAM and SKATER clustering identify four electoral typologies that differ in their balance between profile homogeneity and geographic contiguity.
Spatial lag variables are engineered from the Queen contiguity graph and incorporated as machine learning features.
Six models are evaluated with LOOCV; the Spatial RF and XGBoost outperform non-spatial and parametric alternatives.
The central empirical finding is a 42% improvement in explained variance (R² from 0.315 to 0.446) when spatial features are added, with residual Moran’s I falling to a non-significant 0.098.
Ridge regression coefficients identify neighbourhood mobilisation (lag_turnout) as the strongest spatial mechanism: being embedded in a high-turnout environment independently raises a constituency’s own participation.

Together, these findings support the conclusion that neighbourhood context — not just local party composition — is a substantive driver of electoral participation in Poland, and that spatial machine learning methods are necessary to capture and quantify this effect.

Beyond the Historical: Spatial Voting Typologies in Poland

Spatial Machine Learning Analysis of the 2023 Parliamentary Election

Zuzanna Herniczek, Weidong Bai