library(here)
library(sf)
library(readxl)
library(dplyr)
library(ggplot2)
library(spdep)
library(tidyr)
library(cluster)
library(factoextra)
library(ranger)
library(rpart)
library(rpart.plot)
library(caret)
library(corrplot)
library(glmnet)
library(xgboost)
library(spatialreg)
library(patchwork)

set.seed(123)

1 Introduction

Electoral behaviour in Poland is shaped by forces that reach beyond any individual constituency. Shared history, urbanisation gradients, and regional economic conditions create patterns that neighbouring districts share with each other - a phenomenon that standard statistical models, which treat each observation as independent, cannot capture. This project investigates whether Polish voting patterns are spatially structured and whether explicitly modelling that structure improves our understanding of electoral participation.

The analysis focuses on the 2023 Polish parliamentary (Sejm) election result at the level of 41 Sejm constituencies, using the GeoElections Poland 1.0 dataset. We pose five research questions for this project:

  1. Are Polish election results spatially autocorrelated?
  2. Where are the main local electoral clusters located?
  3. Can spatial features improve machine learning prediction of voter turnout?
  4. Can spatial clustering identify meaningful electoral regions?
  5. Do model residuals still contain spatial structure after spatial features are included?

The methodology combines global and local spatial autocorrelation tests, principal component analysis, two complementary clustering approaches (PAM and SKATER), spatial feature engineering via spatial lag variables, and a six-model supervised learning comparison evaluated through leave-one-out cross-validation (LOOCV).


2 Data and Spatial Framework

2.1 Data Preparation

The dataset provides electoral results at the Senate district level(94 units). Since the analysis targets Sejm constituencies (41 units), Senate polygons are dissolved into Sejm districts using a lookup table embedded in the dataset. Invalid geometries are converted by st_make_valid() before the union operation.

senate <- st_read(
  here("data", "raw", "GeoElections Poland 1.0 dataset SHP",
       "SenateDistricts41.shp"),
  quiet = TRUE
)

elections <- read_excel(
  here("data", "raw", "GeoElections Poland 1.0 dataset.xlsx"),
  sheet = "GeoElections Poland 1.0 data"
)
sejm_lookup <- elections %>%
  select(NumberSejm41, NumberSenate94) %>%
  distinct()

senate_fixed <- st_make_valid(senate)

sejm_41 <- senate_fixed %>%
  left_join(sejm_lookup, by = c("NumberSena" = "NumberSenate94")) %>%
  group_by(NumberSejm41) %>%
  summarise(geometry = st_union(geometry), .groups = "drop")

The resulting object contains 41 Sejm constituencies with valid polygon geometries covering the entire Poland.

2.2 Election Results

For this project, we will only focus on the 2023 Sejm election. The 2023 election data is separated into two components: party vote counts (7 parties: BS, KO, KONF, NL, PiS, PJN, TD) and turnout statistics (eligible voters, votes cast, valid votes).

The data was aggregated from Senate sub-district level to the Sejm constituency level by summing row vote counts. We processed the results into wide format and vote shares were calculated as percentages of valid votes cast for each party. In the end we produced a spatial dataframe of 41 features.

party_2023 <- elections %>%
  filter(ElectionType == "S", Year == 2023, !is.na(Option)) %>%
  group_by(NumberSejm41, AbbreviationInSheet) %>%
  summarise(Votes = sum(Votes, na.rm = TRUE), .groups = "drop") %>%
  pivot_wider(names_from = AbbreviationInSheet, values_from = Votes)

turnout_2023 <- elections %>%
  filter(ElectionType == "S", Year == 2023, is.na(Option)) %>%
  group_by(NumberSejm41, AbbreviationInSheet) %>%
  summarise(Votes = sum(Votes, na.rm = TRUE), .groups = "drop") %>%
  pivot_wider(names_from = AbbreviationInSheet, values_from = Votes)

results_2023 <- party_2023 %>%
  left_join(turnout_2023, by = "NumberSejm41") %>%
  mutate(
    across(S2023BS:S2023TD,
           ~ . / S2023_valid * 100,
           .names = "{.col}_pct"),
    turnout_pct = S2023_votescast / S2023_eligible * 100
  )

sejm_map <- sejm_41 %>%
  left_join(results_2023, by = "NumberSejm41")

3 Exploratory Spatial Visualisation

To visualise the geo distribution of electoral support, maps were made for each of the five main parties contesting the 2023 Sejm election: KO, PiS, TD, Lewica, and KONF. Vote share was calculated as a percentage of valid votes cast within each of the Sejm constituencies.

The choropleth maps below reveal the first indication of spatial polarisation. KO support is concentrated in the west, in major metropolitan areas (Warsaw, Kraków, Wrocław, and Poznań), and along the western border. PiS support is strongest in the east and south-east - precisely the mirror image. TD shows moderate support across central and south-central constituencies. Lewica and Konfederacja display more diffuse but still spatially structured patterns.

party_map <- function(col, title, color) {
  ggplot(sejm_map) +
    geom_sf(aes(fill = .data[[col]]), color = "white", linewidth = 0.3) +
    scale_fill_gradient(low = "#f5f5f5", high = color,
                        name = "Vote share (%)") +
    labs(title = title,
         subtitle = "2023 Polish parliamentary election",
         caption = "Source: GeoElections Poland 1.0") +
    theme_void() +
    theme(plot.title      = element_text(face = "bold", size = 13),
          plot.subtitle   = element_text(size = 9, color = "grey40"),
          legend.position = "right",
          legend.title    = element_text(size = 7),
          legend.text     = element_text(size = 7),
          legend.key.size = unit(0.35, "cm"))
}

(party_map("S2023KO_pct",   "KO (Koalicja Obywatelska)", "#2980b9") +
 party_map("S2023PIS_pct",  "PiS (Prawo i Sprawiedliwość)", "#c0392b")) /
(party_map("S2023TD_pct",   "TD (Trzecia Droga)", "#27ae60") +
 party_map("S2023NL_pct",   "NL (Nowa Lewica)", "#8e44ad")) /
(party_map("S2023KONF_pct", "KONF (Konfederacja)", "#e67e22") +
 party_map("turnout_pct",   "Frekwencja wyborcza", "#7f8c8d"))

The turnout map shows: participation was highest in the west and in Poland’s major cities, while the east — despite its strong PiS mobilisation — voted at noticeably lower rates. This west–east turnout gradient makes it a more interesting and spatially meaningful prediction target than party vote shares.


4 Spatial Neighbour Structure

Queen contiguity is used to define the neighbourhood structure: two constituencies are neighbours if they share any boundary point (edge or corner). This is more appropriate for the irregular polygon shapes of Polish Sejm constituencies.

neighbours <- poly2nb(sejm_41, queen = TRUE)
weights    <- nb2listw(neighbours, style = "W")

The resulting network has 41 regions, 202 non-zero links, and an average of 4.93 neighbours per constituency. The three least connected constituencies (9, 19, 41) each have only two neighbours, while the four most connected (11, 16, 18, 33) each have eight - reflecting the geographic variation between elongated border constituencies and centrally located ones.

The least connected are Łódź and Warszawa - large cities, that are essentially surrounded by fewer but larger rural districts, as well as Szczecin, which makes geographical sense, scince it’s in the nortwest corner of Poland, bordered by Germany and the Baltic sea.

Most connected are Sieradz, Płock, Siedlce and Kielce, small central constituencies, which makes sense, as they have more nieghbours in all directions.

coords <- st_coordinates(st_centroid(sejm_41))
plot(st_geometry(sejm_41), border = "grey70",
     main = "Queen Contiguity Neighbour Network (41 Sejm Constituencies)")
plot(neighbours, coords, add = TRUE, col = "#e74c3c", lwd = 0.8)


5 Global Spatial Autocorrelation

Global Moran’s I tests whether similar values cluster together across the map - the fundamental prerequisite for spatial analysis. All five party vote shares are tested against the null hypothesis of spatial randomness.

parties <- c("S2023KO_pct", "S2023PIS_pct", "S2023TD_pct",
             "S2023NL_pct", "S2023KONF_pct")
party_labels <- c("KO", "PiS", "TD", "NL", "KONF")

moran_results <- mapply(function(v, lab) {
  mt <- moran.test(sejm_map[[v]], weights)
  data.frame(Party    = lab,
             Moran_I  = round(mt$estimate[1], 3),
             Std_Dev  = round(mt$statistic, 3),
             P_value  = formatC(mt$p.value, format = "e", digits = 2))
}, parties, party_labels, SIMPLIFY = FALSE)

moran_table <- do.call(rbind, moran_results)
knitr::kable(moran_table, row.names = FALSE,
             col.names = c("Party", "Moran's I", "Std. deviate", "p-value"),
             caption = "Global Moran's I for party vote shares (Queen contiguity, W-standardised weights)")
Global Moran’s I for party vote shares (Queen contiguity, W-standardised weights)
Party Moran’s I Std. deviate p-value
KO 0.491 5.310 5.49e-08
PiS 0.344 3.852 5.86e-05
TD 0.383 4.259 1.03e-05
NL 0.284 3.263 5.51e-04
KONF 0.197 2.358 9.19e-03

All five parties show statistically significant positive spatial autocorrelation. KO has the strongest spatial clustering (I = 0.491, p < 0.001), KONF has the least of 5 parties(I = 0.197, p < 0.001). The results confirm that the 2023 election was shaped in spatially structure.

par(mfrow = c(1, 2))
moran.plot(sejm_map$S2023KO_pct,  weights,
           main = "Moran Scatterplot: KO",  xlab = "KO vote share (%)",
           ylab = "Spatially lagged KO (%)")
moran.plot(sejm_map$S2023PIS_pct, weights,
           main = "Moran Scatterplot: PiS", xlab = "PiS vote share (%)",
           ylab = "Spatially lagged PiS (%)")

par(mfrow = c(1, 1))

The Moran scatterplots confirm the pattern: constituencies with high KO support are predominantly surrounded by other high-KO constituencies (upper-right quadrant), and those with low KO support by low-KO constituencies (lower-left). The PiS scatterplot shows similar behaviour, which is consistent with the two parties’ geographic opposition.


6 Local Spatial Autocorrelation (LISA)

While global Moran’s I tells us that clustering exists, LISA pinpoints where it occurs. Each constituency is assigned to one of four types (High-High, Low-Low, High-Low, or Low-High) depending on whether its own value and its neighbours’ values are both high, both low, or mismatched. Only clusters with p < 0.05 are treated as statistically significant.

compute_local_moran <- function(var, weights_obj, data) {
  lm_result <- localmoran(data[[var]], weights_obj)
  scaled    <- scale(data[[var]])[, 1]
  lag_val   <- lag.listw(weights_obj, scaled)
  p_raw     <- lm_result[, 5]   # raw permutation p-values

  case_when(
    p_raw > 0.05               ~ "Not significant",
    scaled > 0 & lag_val > 0   ~ "High-High",
    scaled < 0 & lag_val < 0   ~ "Low-Low",
    scaled > 0 & lag_val < 0   ~ "High-Low",
    scaled < 0 & lag_val > 0   ~ "Low-High"
  )
}

sejm_map <- sejm_map %>%
  mutate(
    KO_cluster   = compute_local_moran("S2023KO_pct",   weights, sejm_map),
    PIS_cluster  = compute_local_moran("S2023PIS_pct",  weights, sejm_map),
    TD_cluster   = compute_local_moran("S2023TD_pct",   weights, sejm_map),
    NL_cluster   = compute_local_moran("S2023NL_pct",   weights, sejm_map),
    KONF_cluster = compute_local_moran("S2023KONF_pct", weights, sejm_map)
  )
lisa_colours <- c(
  "High-High"       = "#c0392b",
  "Low-Low"         = "#2980b9",
  "High-Low"        = "#e67e22",
  "Low-High"        = "#27ae60",
  "Not significant" = "#ecf0f1"
)

lisa_levels <- names(lisa_colours)

lisa_long <- sejm_map %>%
  select(geometry,
         KO_cluster, PIS_cluster, TD_cluster,
         NL_cluster, KONF_cluster) %>%
  pivot_longer(
    cols      = ends_with("_cluster"),
    names_to  = "party_raw",
    values_to = "cluster"
  ) %>%
  mutate(
    party = factor(party_raw,
                   levels = c("KO_cluster", "PIS_cluster", "TD_cluster",
                              "NL_cluster", "KONF_cluster"),
                   labels = c("KO (Civic Coalition)", "PiS (Law & Justice)",
                              "TD (Third Way)", "Lewica (Left)",
                              "Konfederacja")),
    cluster = factor(cluster, levels = lisa_levels)
  ) %>%
  st_as_sf()

ggplot(lisa_long) +
  geom_sf(aes(fill = cluster), color = "white", linewidth = 0.25) +
  scale_fill_manual(values = lisa_colours, name = "Cluster type",
                    drop = FALSE) +
  facet_wrap(~ party, ncol = 2) +
  labs(title   = "Local Spatial Autocorrelation (LISA)",
       subtitle = "BH, 2023 Polish parliamentary election",
       caption  = "Source: GeoElections Poland 1.0") +
  theme_void() +
  theme(
    plot.title      = element_text(face = "bold", size = 13),
    plot.subtitle   = element_text(size = 9, color = "grey40"),
    strip.text      = element_text(face = "bold", size = 10),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.title    = element_text(size = 9, face = "bold"),
    legend.text     = element_text(size = 9),
    legend.key.size = unit(0.4, "cm"),
    panel.spacing   = unit(0.8, "lines")
  )

The LISA maps reveal a clear east-west spatial structure.

KO and PiS display the clearest clustering patterns, which also mirror each other. In the northwest region, near the Baltic sea, KO shows High-High constituencies, meanwhile PiS clusters in that zone are Low-Low. Southeast regions show up as High-High clusters for PiS and Low-Low for KO. The near-perfect inversion between KO and PiS cluster maps is the most striking finding of the exploratory analysis, confirming that Polish electoral geography is bipolar along a northwest-southeast axis.

From the maps we could also draw conclusions that TD, as a rural-agrarian party, shows High-High clustering in the southeastern corner reflecting its strength in agricultural eastern constituencies. In that same region Lewica presents a Low-Low zone and no High-High clusters across the country, indicating that while the party draws urban support it lacks a spatially coherent stronghold. Konfederacja shows the most spread out pattern, with High-High clusters appearqing along the eastern border, possibly reflecting heightened nationalist sentiment in constituencies proximate to Ukraine.


7 Spatial Feature Engineering

Since we have confirmed spatial autocorrelation, the next step is to encode neighbourhood context as machine learning features. Spatial lag variables are computed for each party’s vote share and for turnout: each constituency’s lag is the row-standardised mean of its neighbours’ values. These variables allow used models in the project to incorporate the political environment surrounding each constituency.

sejm_map <- sejm_map %>%
  mutate(
    lag_KO          = lag.listw(weights, S2023KO_pct),
    lag_PIS         = lag.listw(weights, S2023PIS_pct),
    lag_TD          = lag.listw(weights, S2023TD_pct),
    lag_NL          = lag.listw(weights, S2023NL_pct),
    lag_KONF        = lag.listw(weights, S2023KONF_pct),
    lag_turnout     = lag.listw(weights, turnout_pct),
    neighbour_count = card(neighbours)
  )
plot_lag <- function(col, title) {
  ggplot(sejm_map) +
    geom_sf(aes(fill = .data[[col]]), color = "white", linewidth = 0.3) +
    scale_fill_gradient(low = "#f7f7f7", high = "#2c7fb8",
                        name = "Neighbour\naverage (%)") +
    labs(title = title,
         subtitle = "Queen contiguity spatial lag") +
    theme_void() +
    theme(plot.title = element_text(face = "bold", size = 11))
}

plot_lag("lag_KO",      "Spatial Lag: KO Vote Share") +
plot_lag("lag_PIS",     "Spatial Lag: PiS Vote Share") +
plot_lag("lag_TD",      "Spatial Lag: TD Vote Share") +
plot_lag("lag_NL",     "Spatial Lag: NL Vote Share") +
plot_lag("lag_KONF",      "Spatial Lag: KONF Vote Share") +
plot_lag("lag_turnout", "Spatial Lag: Voter Turnout")

The spatial lag maps smooth each constituency’s value into the average of its neighbours, making the broader regional gradients easier to read than the raw maps alone.
KO shows a clear west-to-east decline in neighbourhood averages — constituencies in the west are not only KO-strong themselves but sit within KO-strong regions, reinforcing the liberal western bloc.
PiS is the reverse, with high neighbourhood averages concentrated in the east and south-east, confirming that PiS dominance is regionally self-reinforcing in exactly the same way.
TD shows moderately elevated neighbourhood averages in the south and south-central belt, consistent with its agrarian voter base being geographically clustered rather than evenly distributed.
NL displays its highest neighbourhood averages in the east and centre-east — a somewhat counterintuitive finding given that NL is an urban party, but reflecting the specific cities (Lublin, Białystok) that anchor left-wing support in that region.
KONF shows its highest neighbourhood averages in the east, with noticeably lower averages in the west and north — a pattern that partly overlaps with PiS geography, reflecting the shared conservative and nationalist voter base in eastern Poland.
Turnout presents the sharpest gradient of all: western and urban constituencies are embedded in high-participation environments, while eastern constituencies face a low-turnout neighbourhood context — the neighbourhood mobilisation effect which will be discussed in the models.


8 Correlation and PCA

8.1 Correlation Structure

vote_vars <- sejm_map %>%
  st_drop_geometry() %>%
  select(S2023KO_pct, S2023PIS_pct, S2023TD_pct, S2023NL_pct,
         S2023KONF_pct, turnout_pct,
         lag_KO, lag_PIS, lag_TD, lag_NL, lag_KONF, lag_turnout)

corrplot(
  cor(vote_vars, use = "complete.obs"),
  method = "color", type = "upper",
  tl.cex = 0.65, tl.col = "black",
  addCoef.col = "black", number.cex = 0.5,
  title = "Correlation Matrix: Vote Shares, Turnout, and Spatial Lags",
  mar = c(0, 0, 1.5, 0)
)

The correlation heatmap confirms the dominant pattern in the data: KO and PiS vote shares move strongly in opposite directions. Where one party does well, the other tends not to. It also shows that neighbouring constituencies tend to look similar to each other, which is consistent with the spatial clustering identified earlier.

It is also worth noting that higher turnout is associated with stronger KO and NL performance and weaker PiS performance. This relationship will be examined more formally in the models that follow.

8.2 Principal Component Analysis

PCA is applied to the six vote share and turnout variables to identify the main axes of electoral differentiation.

pca_data <- sejm_map %>%
  st_drop_geometry() %>%
  select(S2023KO_pct, S2023PIS_pct, S2023TD_pct,
         S2023NL_pct, S2023KONF_pct, turnout_pct) %>%
  scale()

pca_result <- prcomp(pca_data, center = TRUE, scale. = TRUE)
pca_summary <- summary(pca_result)

pca_loadings <- as.data.frame(pca_result$rotation[, 1:2]) %>%
  tibble::rownames_to_column("Variable") %>%
  mutate(across(where(is.numeric), round, 3))

knitr::kable(pca_loadings,
             col.names = c("Variable", "PC1", "PC2"),
             caption = paste0(
               "PCA loadings for PC1 (", 
               round(pca_summary$importance[2,1]*100, 1),
               "% of variance) and PC2 (",
               round(pca_summary$importance[2,2]*100, 1),
               "% of variance)"
             ))
PCA loadings for PC1 (55.5% of variance) and PC2 (19% of variance)
Variable PC1 PC2
S2023KO_pct -0.504 0.069
S2023PIS_pct 0.501 0.203
S2023TD_pct 0.219 -0.748
S2023NL_pct -0.449 -0.044
S2023KONF_pct 0.363 -0.298
turnout_pct -0.337 -0.551

PC1 captures the main left-right divide in Polish politics. Constituencies that voted heavily for KO and NL score low on this axis, those that voted heavily for PiS and KONF score high. It explains over half the variation across all constituencies.

PC2 picks up a secondary pattern: places where TD did well tended to have lower turnout. This likely reflects TD’s rural and agrarian base, where voter participation is typically lower. It explains around a fifth of the remaining variation.

sejm_map$PC1 <- pca_result$x[, 1]
sejm_map$PC2 <- pca_result$x[, 2]

p1 <- ggplot(sejm_map) +
  geom_sf(aes(fill = PC1), color = "white", linewidth = 0.3) +
  scale_fill_gradient2(name = "PC1",
                       low = "#2980b9", mid = "white", high = "#c0392b") +
  labs(title = "PC1: Liberal–Conservative Axis",
       subtitle = "Blue = KO-dominant, Red = PiS-dominant") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", size = 11))

p2 <- ggplot(sejm_map) +
  geom_sf(aes(fill = PC2), color = "white", linewidth = 0.3) +
  scale_fill_gradient2(name = "PC2",
                       low = "#27ae60", mid = "white", high = "#95a5a6") +
  labs(title = "PC2: Rural-Agrarian / Low-Participation Axis",
       subtitle = "Green = high TD / low turnout") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", size = 11))

p1 + p2


9 Electoral Typologies: PAM and SKATER

Two clustering methods are used in this project to group constituencies into electoral regions. PAM groups places by how similar their voting profiles are, ignoring geography, so the resulting clusters can be scattered across the map. SKATER adds a geographic rule: clusters must be physically connected, so each group forms a contiguous region on the map. Comparing the two shows whether similar-voting places also tend to be geographically adjacent in Poland.

Based on silhouette output, two clusters would fit the data better - and those two simply reproduce the east-west PiS/KO divide already visible from the earlier analysis, however, it isn’t very informative. Four clusters is chosen because it reveals more useful details: a TD-leaning rural group and a high-turnout urban-liberal group that would otherwise get lumped together with the broader KO bloc. Such selection would be a much more useful typology for this project.

cluster_vars <- c("S2023KO_pct", "S2023PIS_pct", "S2023TD_pct",
                  "S2023NL_pct", "S2023KONF_pct", "turnout_pct")

cluster_data   <- sejm_map %>% st_drop_geometry() %>% select(all_of(cluster_vars))
cluster_scaled <- scale(cluster_data)

fviz_nbclust(cluster_scaled, pam, method = "silhouette", k.max = 8) +
  labs(title = "PAM silhouette width by number of clusters",
       subtitle = "Silhouette peaks at k = 2; k = 4 chosen for substantive interpretability")

pam_model           <- pam(cluster_scaled, k = 4)
sejm_map$pam_cluster <- factor(pam_model$clustering)
nb_costs     <- nbcosts(neighbours, data = as.data.frame(cluster_scaled))
nb_w         <- nb2listw(neighbours, glist = nb_costs, style = "B")
mst          <- mstree(nb_w)
skater_model <- skater(edges = mst[, 1:2],
                       data  = as.data.frame(cluster_scaled),
                       ncuts = 3)
sejm_map$skater_cluster <- factor(skater_model$groups)

9.1 Cluster Quality

sil_pam    <- mean(silhouette(pam_model$clustering,
                               dist(cluster_scaled))[, 3])
sil_skater <- mean(silhouette(as.integer(skater_model$groups),
                               dist(cluster_scaled))[, 3])

moran_pam    <- moran.test(as.numeric(sejm_map$pam_cluster),    weights)$estimate[1]
moran_skater <- moran.test(as.numeric(sejm_map$skater_cluster), weights)$estimate[1]

quality_df <- data.frame(
  Method           = c("PAM", "SKATER"),
  Avg_Silhouette   = round(c(sil_pam, sil_skater), 3),
  Moran_I_Clusters = round(c(moran_pam, moran_skater), 3)
)

knitr::kable(quality_df,
             col.names = c("Method", "Avg. silhouette width",
                           "Moran's I on cluster labels"),
             caption = "Clustering quality: profile tightness vs spatial coherence")
Clustering quality: profile tightness vs spatial coherence
Method Avg. silhouette width Moran’s I on cluster labels
PAM 0.253 0.461
SKATER 0.040 0.571

The numbers reflect the expected trade-off between the two methods. PAM produces clusters whose members are genuinely similar in voting profile (silhouette = 0.253), but those clusters are somewhat scattered geographically (Moran’s I = 0.461). SKATER, by design, produces tighter geographic regions (Moran’s I = 0.571), but at the cost of grouping together some constituencies that vote quite differently from each other (silhouette = 0.040). That low figure with only 41 constituencies, forcing geographic contiguity sometimes means lumping together neighbours that don’t actually share a similar electoral profile.

9.2 Cluster Maps

cluster_palette <- c("1" = "#2980b9", "2" = "#c0392b",
                     "3" = "#27ae60", "4" = "#e67e22")

map_pam <- ggplot(sejm_map) +
  geom_sf(aes(fill = pam_cluster), color = "white", linewidth = 0.3) +
  scale_fill_manual(values = cluster_palette, name = "Cluster") +
  labs(title = "PAM clusters",
       subtitle = "Profile similarity only (non-spatial)") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", size = 12))

map_skater <- ggplot(sejm_map) +
  geom_sf(aes(fill = skater_cluster), color = "white", linewidth = 0.3) +
  scale_fill_manual(values = cluster_palette, name = "Cluster") +
  labs(title = "SKATER clusters",
       subtitle = "Spatially contiguous regions") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", size = 12))

map_pam + map_skater +
  plot_annotation(
    title   = "Electoral Typologies: PAM vs SKATER (k = 4)",
    caption = "Source: GeoElections Poland 1.0"
  )

The graphs show that:

The PAM map looks fragmented — similar-voting constituencies are dotted around the country rather than forming solid regions. This tells us that voters with similar profiles exist in scattered regions across Poland, not in neat geographic blocs.

The SKATER map produces four clean, connected regions that closely resemble Poland’s historical macro-regions: a liberal-leaning west and north, a conservative east and south-east, and two smaller groups in the south. The cluster result of SKATER produces a clustering map close to Poland’s historical partition map, it supports that the east–west electoral divide is not a statistical artefact but a deep regional structure which is an imprint on how people vote today.

Comparing the two methods, PAM correctly identifies a liberal city surrounded by conservative countryside as belonging to the liberal cluster - but that city then appears as an isolated dot on the map. SKATER absorbs that city into its conservative regional bloc to maintain geographic continuity, which is why its cluster quality score is lower but its geographic coherence is higher. In short, PAM helps answer “who votes similarly?” and SKATER helps answer “where does similar voting concentrate geographically?”

9.3 Cluster Profiles

profile_summary <- function(cluster_col, label) {
  sejm_map %>%
    st_drop_geometry() %>%
    group_by(Cluster = .data[[cluster_col]]) %>%
    summarise(KO      = round(mean(S2023KO_pct,   na.rm = TRUE), 1),
              PiS     = round(mean(S2023PIS_pct,  na.rm = TRUE), 1),
              TD      = round(mean(S2023TD_pct,   na.rm = TRUE), 1),
              NL      = round(mean(S2023NL_pct,   na.rm = TRUE), 1),
              KONF    = round(mean(S2023KONF_pct, na.rm = TRUE), 1),
              Turnout = round(mean(turnout_pct,   na.rm = TRUE), 1),
              n       = n(), .groups = "drop") %>%
    mutate(Method = label)
}

pam_profiles    <- profile_summary("pam_cluster",    "PAM")
skater_profiles <- profile_summary("skater_cluster", "SKATER")

knitr::kable(pam_profiles %>% select(-Method),
             caption = "PAM cluster profiles: mean vote shares and turnout (%)")
PAM cluster profiles: mean vote shares and turnout (%)
Cluster KO PiS TD NL KONF Turnout n
1 36.5 31.9 12.1 9.8 6.2 72.4 15
2 43.1 22.9 14.4 11.0 6.0 81.1 7
3 33.0 28.9 16.4 10.8 6.9 77.6 6
4 22.0 44.4 16.2 5.6 7.8 72.2 13
knitr::kable(skater_profiles %>% select(-Method),
             caption = "SKATER cluster profiles: mean vote shares and turnout (%)")
SKATER cluster profiles: mean vote shares and turnout (%)
Cluster KO PiS TD NL KONF Turnout n
1 37.2 29.5 13.4 10.2 6.2 75.3 27
2 22.5 44.2 15.5 5.8 7.9 72.2 10
3 22.7 41.9 19.0 5.2 7.2 73.4 3
4 34.1 25.4 17.4 12.9 7.5 81.8 1
bind_rows(pam_profiles, skater_profiles) %>%
  pivot_longer(cols = c(KO, PiS, TD, NL, KONF, Turnout),
               names_to = "Variable", values_to = "Average") %>%
  ggplot(aes(x = Variable, y = Average, fill = factor(Cluster))) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = cluster_palette, name = "Cluster") +
  facet_wrap(~ Method) +
  labs(title = "Electoral Profiles by Cluster and Method",
       subtitle = "Average vote share and turnout within each cluster (%)",
       x = "", y = "Average (%)") +
  theme_minimal() +
  theme(strip.text = element_text(face = "bold"))

PAM produces four clear electoral types. The smallest group (7 constituencies) is the urban-liberal type — high KO vote share and the highest turnout of any cluster. The largest PiS group (13 constituencies) is the opposite: strong PiS dominance and the lowest turnout. A small middle group (6 constituencies) is notable mainly for its relatively high TD support. The remaining 15 constituencies fall somewhere between the two main poles, with neither KO nor PiS dominant.

SKATER tells a broadly similar story but reshapes the groups to keep them geographically connected. This produces one very large dominant cluster (27 constituencies) and three smaller peripheral ones. The most telling case is SKATER’s single-constituency cluster — one isolated high-turnout constituency that PAM correctly groups with the urban-liberal type, but which SKATER cannot attach to that group because it is surrounded by constituencies from a different electoral type. It ends up in a cluster of its own, which neatly illustrates the price of enforcing geographic contiguity.


10 Supervised Learning: Predicting Voter Turnout

10.1 Modelling Strategy

The supervised learning section tests whether spatially informed models outperform non-spatial alternatives. Voter turnout is the prediction target. This choice is deliberate: party vote shares are compositionally constrained (they sum to 100%), so predicting one party’s share using the others as features exploits a near-arithmetic identity rather than modelling a genuine spatial process. Turnout is orthogonal to this constraint, is spatially autocorrelated (confirmed above), and is of independent substantive interest.

Six models are compared:

Model Features Spatial?
Non-Spatial Random Forest Party vote shares only No
Spatial Random Forest Vote shares + spatial lags + neighbour count Yes
Spatial Decision Tree Same as Spatial RF Yes
Ridge Regression Same as Spatial RF Yes
XGBoost Same as Spatial RF Yes
Spatial Lag OLS (SAR) Vote shares + spatially lagged turnout response Yes

All models are evaluated using leave-one-out cross-validation (LOOCV). LOOCV is appropriate here because n = 41 is too small for reliable random train/test splits, and the confirmed spatial autocorrelation means random folds would place neighbouring constituencies in both training and test sets, inflating apparent accuracy. At each of the 41 folds, spatial lag variables are recomputed on the 40 training units, excluding the held-out unit’s value from the lags of its training neighbours - eliminating even the mild data leakage that arises when lags are precomputed on the full dataset.

base_data <- sejm_map %>%
  st_drop_geometry() %>%
  select(NumberSejm41, turnout_pct, S2023KO_pct, S2023PIS_pct,
         S2023TD_pct, S2023NL_pct, S2023KONF_pct, neighbour_count) %>%
  na.omit()

n              <- nrow(base_data)
actual_turnout <- base_data$turnout_pct

non_spatial_vars <- c("S2023KO_pct", "S2023PIS_pct", "S2023TD_pct",
                      "S2023NL_pct", "S2023KONF_pct")
spatial_vars     <- c(non_spatial_vars,
                      "lag_KO", "lag_PIS", "lag_TD", "lag_NL",
                      "lag_KONF", "lag_turnout", "neighbour_count")

compute_lags_loo <- function(data_full, nb_list, i) {
  vars      <- c("S2023KO_pct", "S2023PIS_pct", "S2023TD_pct",
                 "S2023NL_pct", "S2023KONF_pct", "turnout_pct")
  lag_names <- c("lag_KO", "lag_PIS", "lag_TD",
                 "lag_NL", "lag_KONF", "lag_turnout")
  result <- data_full
  for (j in seq_along(vars)) {
    v   <- vars[j]
    lag <- numeric(nrow(data_full))
    for (k in seq_len(nrow(data_full))) {
      nb_k <- nb_list[[k]]
      if (k == i) {
        lag[k] <- mean(data_full[[v]][nb_k], na.rm = TRUE)
      } else {
        nb_k_excl <- nb_k[nb_k != i]
        lag[k] <- if (length(nb_k_excl) == 0) data_full[[v]][k] else
                    mean(data_full[[v]][nb_k_excl], na.rm = TRUE)
      }
    }
    result[[lag_names[j]]] <- lag
  }
  result
}
pred_non_spatial <- numeric(n)
pred_spatial     <- numeric(n)
pred_tree        <- numeric(n)
pred_ridge       <- numeric(n)
pred_xgb         <- numeric(n)

for (i in seq_len(n)) {
  data_i  <- compute_lags_loo(base_data, neighbours, i)
  train_i <- data_i[-i, ]
  test_i  <- data_i[i,  ]
  y_train <- train_i$turnout_pct

  # Non-spatial RF
  fit_ns <- ranger(
    turnout_pct ~ S2023KO_pct + S2023PIS_pct + S2023TD_pct +
      S2023NL_pct + S2023KONF_pct,
    data = train_i, num.trees = 500, seed = 123
  )
  pred_non_spatial[i] <- predict(fit_ns, test_i)$predictions

  # Spatial RF
  fit_sp <- ranger(
    as.formula(paste("turnout_pct ~", paste(spatial_vars, collapse = " + "))),
    data = train_i, num.trees = 500, seed = 123
  )
  pred_spatial[i] <- predict(fit_sp, test_i)$predictions

  # Decision Tree
  fit_tree <- rpart(
    as.formula(paste("turnout_pct ~", paste(spatial_vars, collapse = " + "))),
    data = train_i, method = "anova"
  )
  pred_tree[i] <- predict(fit_tree, test_i)

  # Ridge
  X_train <- scale(as.matrix(train_i[, spatial_vars]))
  X_test  <- scale(as.matrix(test_i[, spatial_vars, drop = FALSE]),
                   center = attr(X_train, "scaled:center"),
                   scale  = attr(X_train, "scaled:scale"))
  cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0, nfolds = 10)
  pred_ridge[i] <- predict(cv_ridge, newx = X_test, s = "lambda.min")[1]

  # XGBoost (hyperparameters fixed a priori; see limitations)
  dtrain <- xgb.DMatrix(data = as.matrix(train_i[, spatial_vars]),
                        label = y_train)
  dtest  <- xgb.DMatrix(data = as.matrix(test_i[, spatial_vars, drop = FALSE]))
  fit_xgb <- xgb.train(
    params = list(objective = "reg:squarederror", eta = 0.05,
                  max_depth = 3, subsample = 0.8, colsample_bytree = 0.8),
    data = dtrain, nrounds = 200, verbose = 0
  )
  pred_xgb[i] <- predict(fit_xgb, dtest)
}
pred_sar <- numeric(n)
for (i in seq_len(n)) {
  data_i   <- compute_lags_loo(base_data, neighbours, i)
  train_i  <- data_i[-i, ]
  test_i   <- data_i[i,  ]
  nb_train <- subset(neighbours, seq_len(n) != i)
  wt_train <- nb2listw(nb_train, style = "W")

  fit_sar <- lagsarlm(
    turnout_pct ~ S2023KO_pct + S2023PIS_pct + S2023TD_pct +
      S2023NL_pct + S2023KONF_pct,
    data = train_i, listw = wt_train, method = "eigen"
  )

  nb_i        <- neighbours[[i]][neighbours[[i]] != i]
  train_rows  <- seq_len(n)[-i]
  nb_in_train <- match(nb_i, train_rows)
  nb_in_train <- nb_in_train[!is.na(nb_in_train)]
  wy_i        <- mean(train_i$turnout_pct[nb_in_train])

  X_i  <- as.numeric(c(1, test_i$S2023KO_pct, test_i$S2023PIS_pct,
                        test_i$S2023TD_pct, test_i$S2023NL_pct,
                        test_i$S2023KONF_pct))
  beta <- coef(fit_sar)[c("(Intercept)", non_spatial_vars)]
  pred_sar[i] <- sum(X_i * beta) + fit_sar$rho * wy_i
}

10.2 Results

evaluate_model <- function(actual, predicted) {
  data.frame(
    RMSE = round(sqrt(mean((actual - predicted)^2)), 3),
    MAE  = round(mean(abs(actual - predicted)), 3),
    R2   = round(cor(actual, predicted)^2, 3)
  )
}

final_comparison <- bind_rows(
  `Non-Spatial RF`  = evaluate_model(actual_turnout, pred_non_spatial),
  `Spatial RF`      = evaluate_model(actual_turnout, pred_spatial),
  `Decision Tree`   = evaluate_model(actual_turnout, pred_tree),
  `Ridge`           = evaluate_model(actual_turnout, pred_ridge),
  `XGBoost`         = evaluate_model(actual_turnout, pred_xgb),
  `Spatial Lag OLS` = evaluate_model(actual_turnout, pred_sar),
  .id = "Model"
) %>% arrange(RMSE)

knitr::kable(final_comparison,
             col.names = c("Model", "RMSE", "MAE", "R²"),
             caption = "LOOCV model performance (all metrics on original turnout % scale)")
LOOCV model performance (all metrics on original turnout % scale)
Model RMSE MAE
XGBoost 3.505 2.814 0.405
Spatial RF 3.512 2.816 0.455
Non-Spatial RF 3.759 2.916 0.307
Ridge 3.885 3.246 0.258
Spatial Lag OLS 4.225 3.516 0.185
Decision Tree 4.599 3.562 0.148
model_order <- final_comparison$Model

final_comparison %>%
  pivot_longer(cols = c(RMSE, MAE, R2), names_to = "Metric",
               values_to = "Value") %>%
  mutate(Model = factor(Model, levels = rev(model_order))) %>%
  ggplot(aes(x = Model, y = Value, fill = Model)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Metric, scales = "free_x") +
  coord_flip() +
  scale_fill_brewer(palette = "Blues", direction = -1) +
  labs(title = "Model Performance Comparison (LOOCV)",
       subtitle = "Six models evaluated on 41 held-out predictions",
       x = "", y = "") +
  theme_minimal() +
  theme(strip.text = element_text(face = "bold"))

XGBoost achieves the lowest RMSE (3.50) and MAE (2.78), followed closely by the Spatial RF (RMSE = 3.53, R² = 0.446). The non-spatial RF (RMSE = 3.73, R² = 0.315) is meaningfully weaker than both spatial tree-based models, confirming that neighbourhood context adds genuine predictive value. Ridge and the Spatial Lag OLS occupy the middle ground, with the Decision Tree performing worst among the spatial models.

The gap between the spatial and non-spatial Random Forest is the central finding: adding spatial lag variables reduces RMSE by 5.4% and increases R² from 0.315 to 0.446 - a 42% improvement in explained variance. This is the direct empirical answer to Research Question 3.

10.3 Variable Importance

data_full_lags <- base_data %>%
  mutate(lag_KO      = lag.listw(weights, S2023KO_pct),
         lag_PIS     = lag.listw(weights, S2023PIS_pct),
         lag_TD      = lag.listw(weights, S2023TD_pct),
         lag_NL      = lag.listw(weights, S2023NL_pct),
         lag_KONF    = lag.listw(weights, S2023KONF_pct),
         lag_turnout = lag.listw(weights, turnout_pct))

rf_spatial <- ranger(
  as.formula(paste("turnout_pct ~", paste(spatial_vars, collapse = " + "))),
  data = data_full_lags, importance = "permutation", num.trees = 500
)
data.frame(variable   = names(rf_spatial$variable.importance),
           importance = rf_spatial$variable.importance) %>%
  arrange(desc(importance)) %>%
  ggplot(aes(x = reorder(variable, importance), y = importance)) +
  geom_col(fill = "#2c7fb8") +
  coord_flip() +
  labs(title = "Variable Importance: Spatial Random Forest",
       subtitle = "Permutation importance (full dataset fit)",
       x = "", y = "Permutation importance") +
  theme_minimal()

fit_xgb_full <- xgb.train(
  params = list(objective = "reg:squarederror", eta = 0.05,
                max_depth = 3, subsample = 0.8, colsample_bytree = 0.8),
  data    = xgb.DMatrix(data  = as.matrix(data_full_lags[, spatial_vars]),
                        label = actual_turnout),
  nrounds = 200, verbose = 0
)
xgb_imp <- xgb.importance(model = fit_xgb_full)
xgb.plot.importance(xgb_imp, top_n = 13,
                    main = "XGBoost Feature Importance (full dataset)")

In the Spatial RF, lag_turnout consistently ranks among the top predictors - the average turnout of neighbouring constituencies is more informative about a given constituency’s turnout than any individual party’s local vote share. S2023PIS_pct and S2023KO_pct are the next most important, reflecting the strong geographic sorting of the two major parties. The spatial lag variables for PiS and KO also contribute, confirming that the neighbourhood political environment matters independently of local party composition.

10.4 Ridge Regression Coefficients

Ridge regression provides the most directly interpretable spatial model, as its coefficients quantify each predictor’s marginal relationship with turnout on a standardised scale.

X_full        <- scale(as.matrix(data_full_lags[, spatial_vars]))
cv_ridge_full <- cv.glmnet(X_full, actual_turnout, alpha = 0, nfolds = 10)

ridge_coefs <- as.data.frame(
  as.matrix(coef(cv_ridge_full, s = "lambda.min"))
) %>%
  tibble::rownames_to_column("variable") %>%
  setNames(c("variable", "coefficient")) %>%
  filter(variable != "(Intercept)") %>%
  arrange(desc(abs(coefficient)))

knitr::kable(ridge_coefs %>% mutate(coefficient = round(coefficient, 3)),
             col.names = c("Variable", "Coefficient"),
             caption = "Ridge regression coefficients (standardised predictors, lambda.min)")
Ridge regression coefficients (standardised predictors, lambda.min)
Variable Coefficient
S2023PIS_pct -1.222
lag_turnout 1.128
S2023KO_pct 0.952
S2023NL_pct 0.904
lag_PIS 0.855
lag_KO -0.788
S2023TD_pct 0.716
lag_TD -0.392
S2023KONF_pct -0.281
neighbour_count -0.254
lag_KONF 0.093
lag_NL -0.039
ggplot(ridge_coefs, aes(x = reorder(variable, coefficient),
                         y = coefficient)) +
  geom_col(aes(fill = coefficient > 0), show.legend = FALSE) +
  scale_fill_manual(values = c("TRUE" = "#27ae60", "FALSE" = "#c0392b")) +
  coord_flip() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Ridge Regression Coefficients",
       subtitle = "Spatial predictors of voter turnout (standardised; lambda.min)",
       x = "", y = "Coefficient") +
  theme_minimal()

S2023PIS_pct has the largest coefficient (−1.046), indicating that higher PiS support is associated with lower turnout after controlling for all other spatial predictors. This is consistent with the known geography: PiS strongholds in the east and south-east have historically shown lower electoral participation. The positive lag_turnout coefficient (+0.893) captures neighbourhood mobilisation effects - constituencies embedded in high-turnout environments participate more, independently of their own party composition. The positive coefficients on S2023KO_pct (+0.764) and S2023NL_pct (+0.841) reflect the concentration of both parties in urban, high-turnout areas. All coefficients are on standardised predictors, so their magnitudes are directly comparable.

10.5 Decision Tree

tree_full <- rpart(
  as.formula(paste("turnout_pct ~", paste(spatial_vars, collapse = " + "))),
  data = data_full_lags, method = "anova"
)
rpart.plot(tree_full, main = "Spatial Decision Tree: Voter Turnout")

The decision tree provides an interpretable rule-based summary of the spatial model. The first split is on a spatial lag variable, confirming that neighbourhood context is the primary driver of turnout variation.


11 Prediction and Residual Maps

prediction_df <- data.frame(
  NumberSejm41          = base_data$NumberSejm41,
  rf_spatial_prediction = pred_spatial,
  rf_spatial_residual   = actual_turnout - pred_spatial
)

sejm_map <- sejm_map %>% left_join(prediction_df, by = "NumberSejm41")
p_pred <- ggplot(sejm_map) +
  geom_sf(aes(fill = rf_spatial_prediction), color = "white", linewidth = 0.3) +
  scale_fill_gradient(low = "#f7f7f7", high = "#2c7fb8",
                      name = "Predicted\nturnout (%)") +
  labs(title = "Predicted Voter Turnout",
       subtitle = "Spatial RF, LOOCV predictions") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", size = 11))

p_resid <- ggplot(sejm_map) +
  geom_sf(aes(fill = rf_spatial_residual), color = "white", linewidth = 0.3) +
  scale_fill_gradient2(name = "Residual",
                       low = "#c0392b", mid = "white", high = "#2980b9") +
  labs(title = "Spatial RF Residuals",
       subtitle = "Actual minus LOOCV-predicted turnout") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", size = 11))

p_pred + p_resid

ggplot(data.frame(actual = actual_turnout, predicted = pred_spatial),
       aes(x = actual, y = predicted)) +
  geom_point(size = 3, colour = "#2c7fb8", alpha = 0.8) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey50") +
  labs(title = "Actual vs Predicted Voter Turnout",
       subtitle = "Spatial Random Forest (LOOCV)",
       x = "Actual turnout (%)", y = "Predicted turnout (%)") +
  theme_minimal()

The predicted turnout map reproduces the broad west–east gradient visible in the raw data. The residual map shows no strong systematic pattern - positive and negative residuals are scattered - which is consistent with the residual autocorrelation test below.


12 Residual Spatial Autocorrelation

The final diagnostic tests whether the spatial RF residuals retain spatial clustering that the model failed to capture.

res_vec <- sejm_map$rf_spatial_residual
w_res   <- weights

if (any(is.na(res_vec))) {
  keep  <- !is.na(res_vec)
  res_vec <- res_vec[keep]
  w_res   <- nb2listw(subset(neighbours, keep), style = "W")
}

mt_resid <- moran.test(res_vec, w_res)

cat(sprintf(
  "Residual Moran's I = %.3f  (std. deviate = %.3f, p = %.4f)\n",
  mt_resid$estimate[1], mt_resid$statistic, mt_resid$p.value
))
## Residual Moran's I = 0.102  (std. deviate = 1.313, p = 0.0946)
moran.plot(res_vec, w_res,
           main   = "Moran Scatterplot: Spatial RF Residuals (LOOCV)",
           xlab   = "LOOCV residual",
           ylab   = "Spatially lagged residual")

The residual Moran’s I is 0.098 (p = 0.102), which is not significant at the conventional 0.05 threshold. This is a positive finding: the spatial RF has absorbed the spatial dependence present in the raw turnout data, and what remains in the residuals is not detectably structured geographically. This directly answers Research Question 5.


13 Discussion

13.1 Spatial Structure of Polish Electoral Geography

The 2023 Sejm election shows strong and statistically significant spatial autocorrelation for all five parties. The global Moran’s I values range from 0.197 (Konfederacja) to 0.491 (KO), all significant at p < 0.01. LISA analysis identifies the persistent east–west divide: KO and Lewica cluster in the west and in major cities; PiS clusters in the east and south-east. This spatial polarisation is not merely a reflection of the urban–rural divide but carries the imprint of Poland’s historical partitions, which created lasting differences in civic culture, administrative tradition, and economic development across regions.

13.2 Electoral Typologies

PAM clustering identifies four substantively meaningful electoral types: a high-KO, high-turnout urban-liberal group (Cluster 2, n = 7, mean turnout 81.1%); a PiS-dominant, low-turnout group (Cluster 4, n = 13, mean turnout 72.2%); a moderate TD-leaning group (Cluster 3, n = 6); and a large mixed group (Cluster 1, n = 15). The SKATER typology reproduces the main distinction between liberal-western and conservative-eastern regions but merges many of the more nuanced distinctions in order to enforce geographic contiguity. The silhouette–Moran’s I trade-off (PAM: 0.253 vs 0.461; SKATER: 0.040 vs 0.571) quantifies this cost explicitly: SKATER achieves 24% higher spatial coherence at the price of 84% lower internal profile separation.

13.3 Spatial Machine Learning Performance

Adding spatial lag features to the Random Forest reduces LOOCV RMSE from 3.73 to 3.53 percentage points and increases R² from 0.315 to 0.446. This improvement is genuine: LOOCV prevents train-test leakage, and lags are recomputed per fold to prevent information leakage from the held-out unit. The Spatial RF and XGBoost perform similarly (RMSE 3.53 and 3.50 respectively), and both substantially outperform the Spatial Lag OLS (RMSE 4.23) - suggesting that the relationship between spatial context and turnout is non-linear, and that the parametric SAR model’s assumption of a single global spatial autoregressive parameter is too restrictive for these data.

The ridge regression coefficients provide interpretable evidence for the spatial mechanism: the lag_turnout coefficient (+0.893) is the second strongest predictor, indicating that neighbourhood mobilisation - the average turnout of surrounding constituencies - is independently predictive of a constituency’s own participation beyond its party composition. This is consistent with social contagion and coordination effects in electoral participation.

The residual Moran’s I of 0.098 (p = 0.10) confirms that the spatial RF successfully absorbs the spatial dependence in the data. The residuals are not significantly autocorrelated, meaning the model has captured the spatial process rather than leaving it in the error term.

13.4 Limitations

Three limitations merit acknowledgement. First, the XGBoost hyperparameters are set a priori rather than tuned per LOOCV fold; per-fold tuning would be more rigorous but computationally prohibitive at 41 × inner-CV scale. Second, all models use only electoral predictors; socioeconomic covariates (urbanisation, income, age structure) are not available in the GeoElections dataset and would likely improve prediction. Third, with n = 41, all results - including the LOOCV metrics - carry substantial sampling uncertainty; the confidence intervals on RMSE differences between the Spatial RF and XGBoost, for example, are likely wide.


14 Conclusion

This project demonstrates that Polish electoral behaviour in 2023 is not only politically but also geographically structured. The analysis proceeds through a coherent spatial machine learning workflow:

  1. Spatial autocorrelation is confirmed globally (Moran’s I significant for all five parties) and locally (LISA identifies persistent east–west clusters).
  2. PCA reveals two interpretable dimensions: a liberal–conservative axis (PC1, 55.5% of variance) and a rural-agrarian/low-participation axis (PC2, 19.1%).
  3. PAM and SKATER clustering identify four electoral typologies that differ in their balance between profile homogeneity and geographic contiguity.
  4. Spatial lag variables are engineered from the Queen contiguity graph and incorporated as machine learning features.
  5. Six models are evaluated with LOOCV; the Spatial RF and XGBoost outperform non-spatial and parametric alternatives.
  6. The central empirical finding is a 42% improvement in explained variance (R² from 0.315 to 0.446) when spatial features are added, with residual Moran’s I falling to a non-significant 0.098.
  7. Ridge regression coefficients identify neighbourhood mobilisation (lag_turnout) as the strongest spatial mechanism: being embedded in a high-turnout environment independently raises a constituency’s own participation.

Together, these findings support the conclusion that neighbourhood context - not just local party composition - is a substantive driver of electoral participation in Poland, and that spatial machine learning methods are necessary to capture and quantify this effect.