library(here)
library(sf)
library(readxl)
library(dplyr)
library(ggplot2)
library(spdep)
library(tidyr)
library(cluster)
library(factoextra)
library(ranger)
library(rpart)
library(rpart.plot)
library(caret)
library(corrplot)
library(glmnet)
library(xgboost)
library(spatialreg)
library(patchwork)
set.seed(123)Electoral behaviour in Poland is shaped by forces that reach beyond any individual constituency. Shared history, urbanisation gradients, and regional economic conditions create patterns that neighbouring districts share with each other — a phenomenon that standard statistical models, which treat each observation as independent, cannot capture. This project investigates whether Polish voting patterns are spatially structured and whether explicitly modelling that structure improves our understanding of electoral participation.
The analysis focuses on the 2023 Polish parliamentary (Sejm) election at the level of 41 Sejm constituencies, using the GeoElections Poland 1.0 dataset (Śleszyński & Kowalski). Five research questions guide the work:
The methodology combines global and local spatial autocorrelation tests, principal component analysis, two complementary clustering approaches (PAM and SKATER), spatial feature engineering via spatial lag variables, and a six-model supervised learning comparison evaluated through leave-one-out cross-validation (LOOCV).
The dataset provides electoral results at the Senate district (94
units) level. Since the analysis targets Sejm constituencies (41 units),
Senate polygons are dissolved into Sejm districts using a lookup table
embedded in the dataset. Invalid geometries are repaired with
st_make_valid() before the union operation.
senate <- st_read(
here("data", "raw", "GeoElections Poland 1.0 dataset SHP",
"SenateDistricts41.shp"),
quiet = TRUE
)
elections <- read_excel(
here("data", "raw", "GeoElections Poland 1.0 dataset.xlsx"),
sheet = "GeoElections Poland 1.0 data"
)sejm_lookup <- elections %>%
select(NumberSejm41, NumberSenate94) %>%
distinct()
senate_fixed <- st_make_valid(senate)
sejm_41 <- senate_fixed %>%
left_join(sejm_lookup, by = c("NumberSena" = "NumberSenate94")) %>%
group_by(NumberSejm41) %>%
summarise(geometry = st_union(geometry), .groups = "drop")The resulting object contains 41 Sejm constituencies with valid polygon geometries covering the entirety of Poland.
Party vote shares are computed as a percentage of valid votes cast. Voter turnout is calculated as votes cast divided by eligible voters. Five parties with nationwide representation are retained: KO (Civic Coalition), PiS (Law and Justice), TD (Third Way), Lewica (Left), and Konfederacja.
party_2023 <- elections %>%
filter(ElectionType == "S", Year == 2023, !is.na(Option)) %>%
group_by(NumberSejm41, AbbreviationInSheet) %>%
summarise(Votes = sum(Votes, na.rm = TRUE), .groups = "drop") %>%
pivot_wider(names_from = AbbreviationInSheet, values_from = Votes)
turnout_2023 <- elections %>%
filter(ElectionType == "S", Year == 2023, is.na(Option)) %>%
group_by(NumberSejm41, AbbreviationInSheet) %>%
summarise(Votes = sum(Votes, na.rm = TRUE), .groups = "drop") %>%
pivot_wider(names_from = AbbreviationInSheet, values_from = Votes)
results_2023 <- party_2023 %>%
left_join(turnout_2023, by = "NumberSejm41") %>%
mutate(
across(S2023BS:S2023TD,
~ . / S2023_valid * 100,
.names = "{.col}_pct"),
turnout_pct = S2023_votescast / S2023_eligible * 100
)
sejm_map <- sejm_41 %>%
left_join(results_2023, by = "NumberSejm41")The choropleth maps below reveal the first indication of spatial polarisation. KO support is concentrated in the west, in major metropolitan areas (Warsaw, Kraków, Wrocław, and Poznań), and along the western border. PiS support is strongest in the east and south-east — precisely the mirror image. TD shows moderate support across central and south-central constituencies. Lewica and Konfederacja display more diffuse but still spatially structured patterns.
party_map <- function(col, title, color) {
ggplot(sejm_map) +
geom_sf(aes(fill = .data[[col]]), color = "white", linewidth = 0.3) +
scale_fill_gradient(low = "#f5f5f5", high = color,
name = "Vote share (%)") +
labs(title = title,
subtitle = "2023 Polish parliamentary election",
caption = "Source: GeoElections Poland 1.0") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 13),
plot.subtitle = element_text(size = 9, color = "grey40"),
legend.position = "right",
legend.title = element_text(size = 7),
legend.text = element_text(size = 7),
legend.key.size = unit(0.35, "cm"))
}
(party_map("S2023KO_pct", "KO (Civic Coalition)", "#2980b9") +
party_map("S2023PIS_pct", "PiS (Law & Justice)", "#c0392b")) /
(party_map("S2023TD_pct", "TD (Third Way)", "#27ae60") +
party_map("S2023NL_pct", "Lewica (Left)", "#8e44ad")) /
(party_map("S2023KONF_pct", "Konfederacja", "#e67e22") +
party_map("turnout_pct", "Voter Turnout", "#7f8c8d"))The turnout map is particularly telling: high-turnout constituencies cluster around the western corridor and major cities, while the east shows systematically lower participation — a pattern that motivates the use of turnout as the prediction target in the supervised learning section.
Queen contiguity is used to define the neighbourhood structure: two constituencies are neighbours if they share any boundary point (edge or corner). This is the most inclusive contiguity definition and is appropriate for the irregular polygon shapes of Polish Sejm constituencies.
The resulting network has 41 regions, 202 non-zero links, and an average of 4.93 neighbours per constituency. The three least connected constituencies (9, 19, 41) each have only two neighbours, while the four most connected (11, 16, 18, 33) each have eight — reflecting the geographic variation between elongated border constituencies and centrally located ones.
coords <- st_coordinates(st_centroid(sejm_41))
plot(st_geometry(sejm_41), border = "grey70",
main = "Queen Contiguity Neighbour Network (41 Sejm Constituencies)")
plot(neighbours, coords, add = TRUE, col = "#e74c3c", lwd = 0.8)Global Moran’s I tests whether similar values cluster together across the map — the fundamental prerequisite for spatial analysis. All five party vote shares are tested against the null hypothesis of spatial randomness.
parties <- c("S2023KO_pct", "S2023PIS_pct", "S2023TD_pct",
"S2023NL_pct", "S2023KONF_pct")
party_labels <- c("KO", "PiS", "TD", "Lewica", "Konfederacja")
moran_results <- mapply(function(v, lab) {
mt <- moran.test(sejm_map[[v]], weights)
data.frame(Party = lab,
Moran_I = round(mt$estimate[1], 3),
Std_Dev = round(mt$statistic, 3),
P_value = formatC(mt$p.value, format = "e", digits = 2))
}, parties, party_labels, SIMPLIFY = FALSE)
moran_table <- do.call(rbind, moran_results)
knitr::kable(moran_table, row.names = FALSE,
col.names = c("Party", "Moran's I", "Std. deviate", "p-value"),
caption = "Global Moran's I for party vote shares (Queen contiguity, W-standardised weights)")| Party | Moran’s I | Std. deviate | p-value |
|---|---|---|---|
| KO | 0.491 | 5.310 | 5.49e-08 |
| PiS | 0.344 | 3.852 | 5.86e-05 |
| TD | 0.383 | 4.259 | 1.03e-05 |
| Lewica | 0.284 | 3.263 | 5.51e-04 |
| Konfederacja | 0.197 | 2.358 | 9.19e-03 |
All five parties show statistically significant positive spatial autocorrelation. KO has the strongest spatial clustering (I = 0.491, p < 0.001), followed by TD (I = 0.383) and PiS (I = 0.344). Even Konfederacja, with the weakest autocorrelation among the five, is significant (I = 0.197, p = 0.009). This uniformly positive result across ideologically diverse parties confirms that the 2023 election was shaped by spatially structured forces — the justification for all subsequent spatial methods.
par(mfrow = c(1, 2))
moran.plot(sejm_map$S2023KO_pct, weights,
main = "Moran Scatterplot: KO", xlab = "KO vote share (%)",
ylab = "Spatially lagged KO (%)")
moran.plot(sejm_map$S2023PIS_pct, weights,
main = "Moran Scatterplot: PiS", xlab = "PiS vote share (%)",
ylab = "Spatially lagged PiS (%)")The Moran scatterplots confirm the pattern: constituencies with high KO support are predominantly surrounded by other high-KO constituencies (upper-right quadrant), and those with low KO support by low-KO constituencies (lower-left). The PiS scatterplot mirrors this almost exactly, consistent with the two parties’ geographic opposition.
While global Moran’s I establishes that clustering exists overall, Local Indicators of Spatial Association (LISA) identify where it is located. Each constituency is classified into one of four cluster types — High-High, Low-Low, High-Low, and Low-High — based on its own standardised value and the average of its neighbours, with significance assessed at p < 0.05 using the raw local Moran’s I permutation p-values without multiple-testing correction.
This choice warrants a brief explanation. When running LISA simultaneously for five parties across 41 constituencies, a correction for multiple comparisons could in principle be applied — for example, Bonferroni (which would require p < 0.0012 per unit) or Benjamin-Hochberg FDR correction. However, both corrections are highly conservative at this sample size and suppress substantively meaningful clusters that are consistent with the global Moran’s I results and with well-established patterns in Polish electoral geography. The raw p < 0.05 threshold follows standard practice in the LISA literature for exploratory spatial analysis at small n, where the goal is pattern identification rather than formal confirmatory inference. Results for parties with weaker global autocorrelation (TD, Lewica, Konfederacja) should be interpreted with appropriate caution as their local clusters are more likely to include some false positives.
compute_local_moran <- function(var, weights_obj, data) {
lm_result <- localmoran(data[[var]], weights_obj)
scaled <- scale(data[[var]])[, 1]
lag_val <- lag.listw(weights_obj, scaled)
p_raw <- lm_result[, 5] # raw permutation p-values
case_when(
p_raw > 0.05 ~ "Not significant",
scaled > 0 & lag_val > 0 ~ "High-High",
scaled < 0 & lag_val < 0 ~ "Low-Low",
scaled > 0 & lag_val < 0 ~ "High-Low",
scaled < 0 & lag_val > 0 ~ "Low-High"
)
}
sejm_map <- sejm_map %>%
mutate(
KO_cluster = compute_local_moran("S2023KO_pct", weights, sejm_map),
PIS_cluster = compute_local_moran("S2023PIS_pct", weights, sejm_map),
TD_cluster = compute_local_moran("S2023TD_pct", weights, sejm_map),
NL_cluster = compute_local_moran("S2023NL_pct", weights, sejm_map),
KONF_cluster = compute_local_moran("S2023KONF_pct", weights, sejm_map)
)lisa_colours <- c(
"High-High" = "#c0392b",
"Low-Low" = "#2980b9",
"High-Low" = "#e67e22",
"Low-High" = "#27ae60",
"Not significant" = "#ecf0f1"
)
lisa_levels <- names(lisa_colours)
# Reshape to long format: one row per constituency per party
lisa_long <- sejm_map %>%
select(geometry,
KO_cluster, PIS_cluster, TD_cluster,
NL_cluster, KONF_cluster) %>%
pivot_longer(
cols = ends_with("_cluster"),
names_to = "party_raw",
values_to = "cluster"
) %>%
mutate(
party = factor(party_raw,
levels = c("KO_cluster", "PIS_cluster", "TD_cluster",
"NL_cluster", "KONF_cluster"),
labels = c("KO (Civic Coalition)", "PiS (Law & Justice)",
"TD (Third Way)", "Lewica (Left)",
"Konfederacja")),
cluster = factor(cluster, levels = lisa_levels)
) %>%
st_as_sf()
ggplot(lisa_long) +
geom_sf(aes(fill = cluster), color = "white", linewidth = 0.25) +
scale_fill_manual(values = lisa_colours, name = "Cluster type",
drop = FALSE) +
facet_wrap(~ party, ncol = 2) +
labs(title = "Local Spatial Autocorrelation (LISA)",
subtitle = "BH, 2023 Polish parliamentary election",
caption = "Source: GeoElections Poland 1.0") +
theme_void() +
theme(
plot.title = element_text(face = "bold", size = 13),
plot.subtitle = element_text(size = 9, color = "grey40"),
strip.text = element_text(face = "bold", size = 10),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 9, face = "bold"),
legend.text = element_text(size = 9),
legend.key.size = unit(0.4, "cm"),
panel.spacing = unit(0.8, "lines")
)The LISA maps reveal a clear east-west spatial structure.
KO — High-High clustering (red) concentrated in the
north-west (Pomerania and surrounding constituencies), Low-Low
clustering (blue) in central Poland, reflecting KO’s strength along the
Baltic coast and weakness in the rural interior.
PiS — broadly the inverse: High-High clusters in the
south-east and east, Low-Low in the north-west, with one notable
Low-High constituency (green) in the east where a PiS-weak constituency
is surrounded by PiS-strong neighbours. This persistent KO–PiS regional
divide mirrors the historical partition boundaries between
German-administered western territories and Russian/Austro-Hungarian
eastern territories.
TD — High-High clustering in
the south and south-east (its agrarian heartland), Low-Low clustering in
the west, consistent with TD’s rural centrist voter base rooted in the
Polish People’s Party tradition.
Lewica — Low-Low
clustering in the central-east, areas where the left performs poorly and
is surrounded by similarly weak left-wing constituencies, reflecting the
collapse of left-wing support outside major urban centres following the
2001–2005 period.
Konfederacja — a more complex
local pattern: High-Low (orange) in the north-west where one
constituency has disproportionately high Konfederacja support relative
to its neighbours, Low-Low (blue) in the centre, and High-High (red) in
the south, reflecting Konfederacja’s pockets of strength in the
Małopolska and Podkarpacie regions where nationalist and libertarian
sentiment is historically stronger.
Having confirmed spatial autocorrelation, the next step is to encode neighbourhood context as machine learning features. Spatial lag variables are computed for each party’s vote share and for turnout: each constituency’s lag is the row-standardised mean of its neighbours’ values. These variables allow any model to incorporate the political environment surrounding each constituency.
sejm_map <- sejm_map %>%
mutate(
lag_KO = lag.listw(weights, S2023KO_pct),
lag_PIS = lag.listw(weights, S2023PIS_pct),
lag_TD = lag.listw(weights, S2023TD_pct),
lag_NL = lag.listw(weights, S2023NL_pct),
lag_KONF = lag.listw(weights, S2023KONF_pct),
lag_turnout = lag.listw(weights, turnout_pct),
neighbour_count = card(neighbours)
)plot_lag <- function(col, title) {
ggplot(sejm_map) +
geom_sf(aes(fill = .data[[col]]), color = "white", linewidth = 0.3) +
scale_fill_gradient(low = "#f7f7f7", high = "#2c7fb8",
name = "Neighbour\naverage (%)") +
labs(title = title,
subtitle = "Queen contiguity spatial lag") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 11))
}
plot_lag("lag_KO", "Spatial Lag: KO Vote Share") +
plot_lag("lag_PIS", "Spatial Lag: PiS Vote Share") +
plot_lag("lag_turnout", "Spatial Lag: Voter Turnout")The spatial lag maps smooth the raw vote share maps, revealing the broader regional gradient. The KO lag map shows a clear west-to-east gradient; the PiS lag map the reverse. The turnout lag map shows that high-turnout areas cluster together — western and urban constituencies are embedded in environments where participation is also high, while eastern constituencies face a low-turnout neighbourhood context.
vote_vars <- sejm_map %>%
st_drop_geometry() %>%
select(S2023KO_pct, S2023PIS_pct, S2023TD_pct, S2023NL_pct,
S2023KONF_pct, turnout_pct,
lag_KO, lag_PIS, lag_TD, lag_NL, lag_KONF, lag_turnout)
corrplot(
cor(vote_vars, use = "complete.obs"),
method = "color", type = "upper",
tl.cex = 0.65, tl.col = "black",
addCoef.col = "black", number.cex = 0.5,
title = "Correlation Matrix: Vote Shares, Turnout, and Spatial Lags",
mar = c(0, 0, 1.5, 0)
)The correlation heatmap confirms the dominant KO–PiS opposition (r ≈ −0.80) and shows that each variable is strongly correlated with its own spatial lag, as expected under positive spatial autocorrelation. Turnout correlates positively with KO and Lewica support and negatively with PiS support — the substantive pattern that the supervised models will formalise.
PCA is applied to the six vote share and turnout variables to identify the main axes of electoral differentiation.
pca_data <- sejm_map %>%
st_drop_geometry() %>%
select(S2023KO_pct, S2023PIS_pct, S2023TD_pct,
S2023NL_pct, S2023KONF_pct, turnout_pct) %>%
scale()
pca_result <- prcomp(pca_data, center = TRUE, scale. = TRUE)
pca_summary <- summary(pca_result)
pca_loadings <- as.data.frame(pca_result$rotation[, 1:2]) %>%
tibble::rownames_to_column("Variable") %>%
mutate(across(where(is.numeric), round, 3))
knitr::kable(pca_loadings,
col.names = c("Variable", "PC1", "PC2"),
caption = paste0(
"PCA loadings for PC1 (",
round(pca_summary$importance[2,1]*100, 1),
"% of variance) and PC2 (",
round(pca_summary$importance[2,2]*100, 1),
"% of variance)"
))| Variable | PC1 | PC2 |
|---|---|---|
| S2023KO_pct | -0.504 | 0.069 |
| S2023PIS_pct | 0.501 | 0.203 |
| S2023TD_pct | 0.219 | -0.748 |
| S2023NL_pct | -0.449 | -0.044 |
| S2023KONF_pct | 0.363 | -0.298 |
| turnout_pct | -0.337 | -0.551 |
PC1 (55.5% of variance) is defined by a strong negative loading on KO (−0.504) and Lewica (−0.449), and a strong positive loading on PiS (+0.501) and Konfederacja (+0.363). This is the liberal–conservative axis: constituencies scoring high on PC1 are PiS-dominant; those scoring low are KO-dominant. PC2 (19.1% of variance) loads primarily on TD (−0.748) and turnout (−0.551), capturing a rural-agrarian/low-participation dimension — constituencies where TD performs well tend to have lower turnout and lower Konfederacja support.
sejm_map$PC1 <- pca_result$x[, 1]
sejm_map$PC2 <- pca_result$x[, 2]
p1 <- ggplot(sejm_map) +
geom_sf(aes(fill = PC1), color = "white", linewidth = 0.3) +
scale_fill_gradient2(name = "PC1",
low = "#2980b9", mid = "white", high = "#c0392b") +
labs(title = "PC1: Liberal–Conservative Axis",
subtitle = "Blue = KO-dominant, Red = PiS-dominant") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 11))
p2 <- ggplot(sejm_map) +
geom_sf(aes(fill = PC2), color = "white", linewidth = 0.3) +
scale_fill_gradient2(name = "PC2",
low = "#27ae60", mid = "white", high = "#95a5a6") +
labs(title = "PC2: Rural-Agrarian / Low-Participation Axis",
subtitle = "Green = high TD / low turnout") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 11))
p1 + p2Two clustering methods are applied to identify electoral regions. PAM (Partitioning Around Medoids) groups constituencies by profile similarity without geographic constraint and may produce fragmented maps. SKATER (Spatial ’K’luster Analysis by Tree Edge Removal) enforces geographic contiguity by building a minimum spanning tree on the neighbour graph and removing edges to form connected regions. Comparing the two reveals whether electoral profiles and geographic proximity coincide in Poland.
Both methods use k = 4 clusters. The silhouette criterion actually peaks at k = 2, which maps onto the dominant KO-west versus PiS-east divide already established by the LISA analysis — a result too coarse to add new interpretive value. k = 4 is chosen on substantive grounds: it reveals the moderate TD-leaning group and the high-turnout urban-liberal cluster that k = 2 merges into the broader KO category, producing a more informative typology at a modest cost in silhouette width (0.26 vs 0.37 at k = 2).
cluster_vars <- c("S2023KO_pct", "S2023PIS_pct", "S2023TD_pct",
"S2023NL_pct", "S2023KONF_pct", "turnout_pct")
cluster_data <- sejm_map %>% st_drop_geometry() %>% select(all_of(cluster_vars))
cluster_scaled <- scale(cluster_data)
fviz_nbclust(cluster_scaled, pam, method = "silhouette", k.max = 8) +
labs(title = "PAM silhouette width by number of clusters",
subtitle = "Silhouette peaks at k = 2; k = 4 chosen for substantive interpretability")nb_costs <- nbcosts(neighbours, data = as.data.frame(cluster_scaled))
nb_w <- nb2listw(neighbours, glist = nb_costs, style = "B")
mst <- mstree(nb_w)
skater_model <- skater(edges = mst[, 1:2],
data = as.data.frame(cluster_scaled),
ncuts = 3)
sejm_map$skater_cluster <- factor(skater_model$groups)sil_pam <- mean(silhouette(pam_model$clustering,
dist(cluster_scaled))[, 3])
sil_skater <- mean(silhouette(as.integer(skater_model$groups),
dist(cluster_scaled))[, 3])
moran_pam <- moran.test(as.numeric(sejm_map$pam_cluster), weights)$estimate[1]
moran_skater <- moran.test(as.numeric(sejm_map$skater_cluster), weights)$estimate[1]
quality_df <- data.frame(
Method = c("PAM", "SKATER"),
Avg_Silhouette = round(c(sil_pam, sil_skater), 3),
Moran_I_Clusters = round(c(moran_pam, moran_skater), 3)
)
knitr::kable(quality_df,
col.names = c("Method", "Avg. silhouette width",
"Moran's I on cluster labels"),
caption = "Clustering quality: profile tightness vs spatial coherence")| Method | Avg. silhouette width | Moran’s I on cluster labels |
|---|---|---|
| PAM | 0.253 | 0.461 |
| SKATER | 0.040 | 0.571 |
The quality table reflects the expected trade-off. PAM achieves a silhouette width of 0.253 — reasonable internal profile separation — but its cluster labels have a Moran’s I of only 0.461, indicating moderate spatial concentration. SKATER, by construction, produces geographically coherent clusters (Moran’s I = 0.571) but sacrifices internal profile homogeneity (silhouette = 0.040). The low SKATER silhouette is not surprising at n = 41: forcing contiguity merges some profile-dissimilar constituencies that happen to be geographic neighbours.
cluster_palette <- c("1" = "#2980b9", "2" = "#c0392b",
"3" = "#27ae60", "4" = "#e67e22")
map_pam <- ggplot(sejm_map) +
geom_sf(aes(fill = pam_cluster), color = "white", linewidth = 0.3) +
scale_fill_manual(values = cluster_palette, name = "Cluster") +
labs(title = "PAM clusters",
subtitle = "Profile similarity only (non-spatial)") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 12))
map_skater <- ggplot(sejm_map) +
geom_sf(aes(fill = skater_cluster), color = "white", linewidth = 0.3) +
scale_fill_manual(values = cluster_palette, name = "Cluster") +
labs(title = "SKATER clusters",
subtitle = "Spatially contiguous regions") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 12))
map_pam + map_skater +
plot_annotation(
title = "Electoral Typologies: PAM vs SKATER (k = 4)",
caption = "Source: GeoElections Poland 1.0"
)The contrast between the two maps is itself the main finding of this section. The PAM map is geographically fragmented — constituencies with similar electoral profiles are scattered across the country, with blue, red, orange, and green patches interleaved. This shows that when grouping purely by who people voted for, the resulting clusters do not respect geography: similar voters exist in pockets spread across Poland rather than in solid regional blocs.
The SKATER map resolves into four clean, contiguous regions that closely resemble Poland’s historical macro-regions: the west and north as one large liberal-leaning bloc, the east and south-east as a conservative bloc, with two smaller peripheral groups in the south. This geographic coherence is striking because SKATER knows nothing about Polish history — it only minimises electoral profile dissimilarity within contiguous regions. The fact that it produces something that looks like a historical map of Poland means that geography and electoral behaviour are deeply aligned: the spatial structure of voting is so strong that even a purely data-driven spatial algorithm recovers it.
The contrast between the two methods also shows the cost of ignoring geography in PAM: constituencies that are geographically isolated from their electoral peers — such as a liberal-voting city surrounded by conservative rural constituencies — are assigned correctly by profile in PAM but create fragmentation. SKATER sacrifices those outliers for regional coherence, which is why its silhouette width is lower (0.040 vs PAM’s 0.253) but its Moran’s I on cluster labels is substantially higher (0.571 vs 0.461). The two methods are answering genuinely different questions: PAM identifies who votes similarly; SKATER identifies where similar voting is geographically concentrated.
profile_summary <- function(cluster_col, label) {
sejm_map %>%
st_drop_geometry() %>%
group_by(Cluster = .data[[cluster_col]]) %>%
summarise(KO = round(mean(S2023KO_pct, na.rm = TRUE), 1),
PiS = round(mean(S2023PIS_pct, na.rm = TRUE), 1),
TD = round(mean(S2023TD_pct, na.rm = TRUE), 1),
NL = round(mean(S2023NL_pct, na.rm = TRUE), 1),
KONF = round(mean(S2023KONF_pct, na.rm = TRUE), 1),
Turnout = round(mean(turnout_pct, na.rm = TRUE), 1),
n = n(), .groups = "drop") %>%
mutate(Method = label)
}
pam_profiles <- profile_summary("pam_cluster", "PAM")
skater_profiles <- profile_summary("skater_cluster", "SKATER")
knitr::kable(pam_profiles %>% select(-Method),
caption = "PAM cluster profiles: mean vote shares and turnout (%)")| Cluster | KO | PiS | TD | NL | KONF | Turnout | n |
|---|---|---|---|---|---|---|---|
| 1 | 36.5 | 31.9 | 12.1 | 9.8 | 6.2 | 72.4 | 15 |
| 2 | 43.1 | 22.9 | 14.4 | 11.0 | 6.0 | 81.1 | 7 |
| 3 | 33.0 | 28.9 | 16.4 | 10.8 | 6.9 | 77.6 | 6 |
| 4 | 22.0 | 44.4 | 16.2 | 5.6 | 7.8 | 72.2 | 13 |
knitr::kable(skater_profiles %>% select(-Method),
caption = "SKATER cluster profiles: mean vote shares and turnout (%)")| Cluster | KO | PiS | TD | NL | KONF | Turnout | n |
|---|---|---|---|---|---|---|---|
| 1 | 37.2 | 29.5 | 13.4 | 10.2 | 6.2 | 75.3 | 27 |
| 2 | 22.5 | 44.2 | 15.5 | 5.8 | 7.9 | 72.2 | 10 |
| 3 | 22.7 | 41.9 | 19.0 | 5.2 | 7.2 | 73.4 | 3 |
| 4 | 34.1 | 25.4 | 17.4 | 12.9 | 7.5 | 81.8 | 1 |
bind_rows(pam_profiles, skater_profiles) %>%
pivot_longer(cols = c(KO, PiS, TD, NL, KONF, Turnout),
names_to = "Variable", values_to = "Average") %>%
ggplot(aes(x = Variable, y = Average, fill = factor(Cluster))) +
geom_col(position = "dodge") +
scale_fill_manual(values = cluster_palette, name = "Cluster") +
facet_wrap(~ Method) +
labs(title = "Electoral Profiles by Cluster and Method",
subtitle = "Average vote share and turnout within each cluster (%)",
x = "", y = "Average (%)") +
theme_minimal() +
theme(strip.text = element_text(face = "bold"))The PAM profiles reveal four interpretable electoral types: Cluster 2 (n = 7) is the high-KO, high-turnout, urban-liberal type (KO 43.1%, turnout 81.1%); Cluster 4 (n = 13) is the PiS-dominant, low-turnout type (PiS 44.4%, turnout 72.2%); Cluster 3 (n = 6) is a moderate mixed profile with the highest TD share (16.4%, turnout 77.6%); and Cluster 1 (n = 15) sits between the two poles (KO 36.5%, PiS 31.9%, turnout 72.4%). The SKATER typology follows a broadly similar substantive interpretation but merges many of the constituency assignments to achieve geographic contiguity, resulting in a large dominant cluster (n = 27) and three smaller peripheral groups. The SKATER Cluster 4 (n = 1) is an isolated high-turnout single constituency that PAM would assign to the urban type — an illustration of the cost of enforcing contiguity when one constituency is geographically isolated from its electoral peers.
The supervised learning section tests whether spatially informed models outperform non-spatial alternatives. Voter turnout is the prediction target. This choice is deliberate: party vote shares are compositionally constrained (they sum to 100%), so predicting one party’s share using the others as features exploits a near-arithmetic identity rather than modelling a genuine spatial process. Turnout is orthogonal to this constraint, is spatially autocorrelated (confirmed above), and is of independent substantive interest.
Six models are compared:
| Model | Features | Spatial? |
|---|---|---|
| Non-Spatial Random Forest | Party vote shares only | No |
| Spatial Random Forest | Vote shares + spatial lags + neighbour count | Yes |
| Spatial Decision Tree | Same as Spatial RF | Yes |
| Ridge Regression | Same as Spatial RF | Yes |
| XGBoost | Same as Spatial RF | Yes |
| Spatial Lag OLS (SAR) | Vote shares + spatially lagged turnout response | Yes |
All models are evaluated using leave-one-out cross-validation (LOOCV). LOOCV is appropriate here because n = 41 is too small for reliable random train/test splits, and the confirmed spatial autocorrelation means random folds would place neighbouring constituencies in both training and test sets, inflating apparent accuracy. At each of the 41 folds, spatial lag variables are recomputed on the 40 training units, excluding the held-out unit’s value from the lags of its training neighbours — eliminating even the mild data leakage that arises when lags are precomputed on the full dataset.
base_data <- sejm_map %>%
st_drop_geometry() %>%
select(NumberSejm41, turnout_pct, S2023KO_pct, S2023PIS_pct,
S2023TD_pct, S2023NL_pct, S2023KONF_pct, neighbour_count) %>%
na.omit()
n <- nrow(base_data)
actual_turnout <- base_data$turnout_pct
non_spatial_vars <- c("S2023KO_pct", "S2023PIS_pct", "S2023TD_pct",
"S2023NL_pct", "S2023KONF_pct")
spatial_vars <- c(non_spatial_vars,
"lag_KO", "lag_PIS", "lag_TD", "lag_NL",
"lag_KONF", "lag_turnout", "neighbour_count")
compute_lags_loo <- function(data_full, nb_list, i) {
vars <- c("S2023KO_pct", "S2023PIS_pct", "S2023TD_pct",
"S2023NL_pct", "S2023KONF_pct", "turnout_pct")
lag_names <- c("lag_KO", "lag_PIS", "lag_TD",
"lag_NL", "lag_KONF", "lag_turnout")
result <- data_full
for (j in seq_along(vars)) {
v <- vars[j]
lag <- numeric(nrow(data_full))
for (k in seq_len(nrow(data_full))) {
nb_k <- nb_list[[k]]
if (k == i) {
lag[k] <- mean(data_full[[v]][nb_k], na.rm = TRUE)
} else {
nb_k_excl <- nb_k[nb_k != i]
lag[k] <- if (length(nb_k_excl) == 0) data_full[[v]][k] else
mean(data_full[[v]][nb_k_excl], na.rm = TRUE)
}
}
result[[lag_names[j]]] <- lag
}
result
}pred_non_spatial <- numeric(n)
pred_spatial <- numeric(n)
pred_tree <- numeric(n)
pred_ridge <- numeric(n)
pred_xgb <- numeric(n)
for (i in seq_len(n)) {
data_i <- compute_lags_loo(base_data, neighbours, i)
train_i <- data_i[-i, ]
test_i <- data_i[i, ]
y_train <- train_i$turnout_pct
# Non-spatial RF
fit_ns <- ranger(
turnout_pct ~ S2023KO_pct + S2023PIS_pct + S2023TD_pct +
S2023NL_pct + S2023KONF_pct,
data = train_i, num.trees = 500, seed = 123
)
pred_non_spatial[i] <- predict(fit_ns, test_i)$predictions
# Spatial RF
fit_sp <- ranger(
as.formula(paste("turnout_pct ~", paste(spatial_vars, collapse = " + "))),
data = train_i, num.trees = 500, seed = 123
)
pred_spatial[i] <- predict(fit_sp, test_i)$predictions
# Decision Tree
fit_tree <- rpart(
as.formula(paste("turnout_pct ~", paste(spatial_vars, collapse = " + "))),
data = train_i, method = "anova"
)
pred_tree[i] <- predict(fit_tree, test_i)
# Ridge
X_train <- scale(as.matrix(train_i[, spatial_vars]))
X_test <- scale(as.matrix(test_i[, spatial_vars, drop = FALSE]),
center = attr(X_train, "scaled:center"),
scale = attr(X_train, "scaled:scale"))
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0, nfolds = 10)
pred_ridge[i] <- predict(cv_ridge, newx = X_test, s = "lambda.min")[1]
# XGBoost (hyperparameters fixed a priori; see limitations)
dtrain <- xgb.DMatrix(data = as.matrix(train_i[, spatial_vars]),
label = y_train)
dtest <- xgb.DMatrix(data = as.matrix(test_i[, spatial_vars, drop = FALSE]))
fit_xgb <- xgb.train(
params = list(objective = "reg:squarederror", eta = 0.05,
max_depth = 3, subsample = 0.8, colsample_bytree = 0.8),
data = dtrain, nrounds = 200, verbose = 0
)
pred_xgb[i] <- predict(fit_xgb, dtest)
}pred_sar <- numeric(n)
for (i in seq_len(n)) {
data_i <- compute_lags_loo(base_data, neighbours, i)
train_i <- data_i[-i, ]
test_i <- data_i[i, ]
nb_train <- subset(neighbours, seq_len(n) != i)
wt_train <- nb2listw(nb_train, style = "W")
fit_sar <- lagsarlm(
turnout_pct ~ S2023KO_pct + S2023PIS_pct + S2023TD_pct +
S2023NL_pct + S2023KONF_pct,
data = train_i, listw = wt_train, method = "eigen"
)
nb_i <- neighbours[[i]][neighbours[[i]] != i]
train_rows <- seq_len(n)[-i]
nb_in_train <- match(nb_i, train_rows)
nb_in_train <- nb_in_train[!is.na(nb_in_train)]
wy_i <- mean(train_i$turnout_pct[nb_in_train])
X_i <- as.numeric(c(1, test_i$S2023KO_pct, test_i$S2023PIS_pct,
test_i$S2023TD_pct, test_i$S2023NL_pct,
test_i$S2023KONF_pct))
beta <- coef(fit_sar)[c("(Intercept)", non_spatial_vars)]
pred_sar[i] <- sum(X_i * beta) + fit_sar$rho * wy_i
}evaluate_model <- function(actual, predicted) {
data.frame(
RMSE = round(sqrt(mean((actual - predicted)^2)), 3),
MAE = round(mean(abs(actual - predicted)), 3),
R2 = round(cor(actual, predicted)^2, 3)
)
}
final_comparison <- bind_rows(
`Non-Spatial RF` = evaluate_model(actual_turnout, pred_non_spatial),
`Spatial RF` = evaluate_model(actual_turnout, pred_spatial),
`Decision Tree` = evaluate_model(actual_turnout, pred_tree),
`Ridge` = evaluate_model(actual_turnout, pred_ridge),
`XGBoost` = evaluate_model(actual_turnout, pred_xgb),
`Spatial Lag OLS` = evaluate_model(actual_turnout, pred_sar),
.id = "Model"
) %>% arrange(RMSE)
knitr::kable(final_comparison,
col.names = c("Model", "RMSE", "MAE", "R²"),
caption = "LOOCV model performance (all metrics on original turnout % scale)")| Model | RMSE | MAE | R² |
|---|---|---|---|
| XGBoost | 3.504 | 2.779 | 0.409 |
| Spatial RF | 3.533 | 2.826 | 0.446 |
| Non-Spatial RF | 3.735 | 2.879 | 0.315 |
| Ridge | 3.885 | 3.246 | 0.258 |
| Spatial Lag OLS | 4.225 | 3.516 | 0.185 |
| Decision Tree | 4.599 | 3.562 | 0.148 |
model_order <- final_comparison$Model
final_comparison %>%
pivot_longer(cols = c(RMSE, MAE, R2), names_to = "Metric",
values_to = "Value") %>%
mutate(Model = factor(Model, levels = rev(model_order))) %>%
ggplot(aes(x = Model, y = Value, fill = Model)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ Metric, scales = "free_x") +
coord_flip() +
scale_fill_brewer(palette = "Blues", direction = -1) +
labs(title = "Model Performance Comparison (LOOCV)",
subtitle = "Six models evaluated on 41 held-out predictions",
x = "", y = "") +
theme_minimal() +
theme(strip.text = element_text(face = "bold"))XGBoost achieves the lowest RMSE (3.50) and MAE (2.78), followed closely by the Spatial RF (RMSE = 3.53, R² = 0.446). The non-spatial RF (RMSE = 3.73, R² = 0.315) is meaningfully weaker than both spatial tree-based models, confirming that neighbourhood context adds genuine predictive value. Ridge and the Spatial Lag OLS occupy the middle ground, with the Decision Tree performing worst among the spatial models.
The gap between the spatial and non-spatial Random Forest is the central finding: adding spatial lag variables reduces RMSE by 5.4% and increases R² from 0.315 to 0.446 — a 42% improvement in explained variance. This is the direct empirical answer to Research Question 3.
data_full_lags <- base_data %>%
mutate(lag_KO = lag.listw(weights, S2023KO_pct),
lag_PIS = lag.listw(weights, S2023PIS_pct),
lag_TD = lag.listw(weights, S2023TD_pct),
lag_NL = lag.listw(weights, S2023NL_pct),
lag_KONF = lag.listw(weights, S2023KONF_pct),
lag_turnout = lag.listw(weights, turnout_pct))
rf_spatial <- ranger(
as.formula(paste("turnout_pct ~", paste(spatial_vars, collapse = " + "))),
data = data_full_lags, importance = "permutation", num.trees = 500
)data.frame(variable = names(rf_spatial$variable.importance),
importance = rf_spatial$variable.importance) %>%
arrange(desc(importance)) %>%
ggplot(aes(x = reorder(variable, importance), y = importance)) +
geom_col(fill = "#2c7fb8") +
coord_flip() +
labs(title = "Variable Importance: Spatial Random Forest",
subtitle = "Permutation importance (full dataset fit)",
x = "", y = "Permutation importance") +
theme_minimal()fit_xgb_full <- xgb.train(
params = list(objective = "reg:squarederror", eta = 0.05,
max_depth = 3, subsample = 0.8, colsample_bytree = 0.8),
data = xgb.DMatrix(data = as.matrix(data_full_lags[, spatial_vars]),
label = actual_turnout),
nrounds = 200, verbose = 0
)
xgb_imp <- xgb.importance(model = fit_xgb_full)
xgb.plot.importance(xgb_imp, top_n = 13,
main = "XGBoost Feature Importance (full dataset)")In the Spatial RF, lag_turnout consistently ranks among
the top predictors — the average turnout of neighbouring constituencies
is more informative about a given constituency’s turnout than any
individual party’s local vote share. S2023PIS_pct and
S2023KO_pct are the next most important, reflecting the
strong geographic sorting of the two major parties. The spatial lag
variables for PiS and KO also contribute, confirming that the
neighbourhood political environment matters independently of local party
composition.
Ridge regression provides the most directly interpretable spatial model, as its coefficients quantify each predictor’s marginal relationship with turnout on a standardised scale.
X_full <- scale(as.matrix(data_full_lags[, spatial_vars]))
cv_ridge_full <- cv.glmnet(X_full, actual_turnout, alpha = 0, nfolds = 10)
ridge_coefs <- as.data.frame(
as.matrix(coef(cv_ridge_full, s = "lambda.min"))
) %>%
tibble::rownames_to_column("variable") %>%
rename(coefficient = s1) %>%
filter(variable != "(Intercept)") %>%
arrange(desc(abs(coefficient)))
knitr::kable(ridge_coefs %>% mutate(coefficient = round(coefficient, 3)),
col.names = c("Variable", "Coefficient"),
caption = "Ridge regression coefficients (standardised predictors, lambda.min)")| Variable | Coefficient |
|---|---|
| S2023PIS_pct | -1.222 |
| lag_turnout | 1.128 |
| S2023KO_pct | 0.952 |
| S2023NL_pct | 0.904 |
| lag_PIS | 0.855 |
| lag_KO | -0.788 |
| S2023TD_pct | 0.716 |
| lag_TD | -0.392 |
| S2023KONF_pct | -0.281 |
| neighbour_count | -0.254 |
| lag_KONF | 0.093 |
| lag_NL | -0.039 |
ggplot(ridge_coefs, aes(x = reorder(variable, coefficient),
y = coefficient)) +
geom_col(aes(fill = coefficient > 0), show.legend = FALSE) +
scale_fill_manual(values = c("TRUE" = "#27ae60", "FALSE" = "#c0392b")) +
coord_flip() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Ridge Regression Coefficients",
subtitle = "Spatial predictors of voter turnout (standardised; lambda.min)",
x = "", y = "Coefficient") +
theme_minimal()S2023PIS_pct has the largest coefficient (−1.046),
indicating that higher PiS support is associated with lower turnout
after controlling for all other spatial predictors. This is consistent
with the known geography: PiS strongholds in the east and south-east
have historically shown lower electoral participation. The positive
lag_turnout coefficient (+0.893) captures
neighbourhood mobilisation effects — constituencies
embedded in high-turnout environments participate more, independently of
their own party composition. The positive coefficients on
S2023KO_pct (+0.764) and S2023NL_pct (+0.841)
reflect the concentration of both parties in urban, high-turnout areas.
All coefficients are on standardised predictors, so their magnitudes are
directly comparable.
tree_full <- rpart(
as.formula(paste("turnout_pct ~", paste(spatial_vars, collapse = " + "))),
data = data_full_lags, method = "anova"
)
rpart.plot(tree_full, main = "Spatial Decision Tree: Voter Turnout")The decision tree provides an interpretable rule-based summary of the spatial model. The first split is on a spatial lag variable, confirming that neighbourhood context is the primary driver of turnout variation.
prediction_df <- data.frame(
NumberSejm41 = base_data$NumberSejm41,
rf_spatial_prediction = pred_spatial,
rf_spatial_residual = actual_turnout - pred_spatial
)
sejm_map <- sejm_map %>% left_join(prediction_df, by = "NumberSejm41")p_pred <- ggplot(sejm_map) +
geom_sf(aes(fill = rf_spatial_prediction), color = "white", linewidth = 0.3) +
scale_fill_gradient(low = "#f7f7f7", high = "#2c7fb8",
name = "Predicted\nturnout (%)") +
labs(title = "Predicted Voter Turnout",
subtitle = "Spatial RF, LOOCV predictions") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 11))
p_resid <- ggplot(sejm_map) +
geom_sf(aes(fill = rf_spatial_residual), color = "white", linewidth = 0.3) +
scale_fill_gradient2(name = "Residual",
low = "#c0392b", mid = "white", high = "#2980b9") +
labs(title = "Spatial RF Residuals",
subtitle = "Actual minus LOOCV-predicted turnout") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 11))
p_pred + p_residggplot(data.frame(actual = actual_turnout, predicted = pred_spatial),
aes(x = actual, y = predicted)) +
geom_point(size = 3, colour = "#2c7fb8", alpha = 0.8) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey50") +
labs(title = "Actual vs Predicted Voter Turnout",
subtitle = "Spatial Random Forest (LOOCV)",
x = "Actual turnout (%)", y = "Predicted turnout (%)") +
theme_minimal()The predicted turnout map reproduces the broad west–east gradient visible in the raw data. The residual map shows no strong systematic pattern — positive and negative residuals are scattered — which is consistent with the residual autocorrelation test below.
The final diagnostic tests whether the spatial RF residuals retain spatial clustering that the model failed to capture.
res_vec <- sejm_map$rf_spatial_residual
w_res <- weights
if (any(is.na(res_vec))) {
keep <- !is.na(res_vec)
res_vec <- res_vec[keep]
w_res <- nb2listw(subset(neighbours, keep), style = "W")
}
mt_resid <- moran.test(res_vec, w_res)
cat(sprintf(
"Residual Moran's I = %.3f (std. deviate = %.3f, p = %.4f)\n",
mt_resid$estimate[1], mt_resid$statistic, mt_resid$p.value
))## Residual Moran's I = 0.098 (std. deviate = 1.273, p = 0.1015)
moran.plot(res_vec, w_res,
main = "Moran Scatterplot: Spatial RF Residuals (LOOCV)",
xlab = "LOOCV residual",
ylab = "Spatially lagged residual")The residual Moran’s I is 0.098 (p = 0.102), which is not significant at the conventional 0.05 threshold. This is a positive finding: the spatial RF has absorbed the spatial dependence present in the raw turnout data, and what remains in the residuals is not detectably structured geographically. This directly answers Research Question 5.
The 2023 Sejm election shows strong and statistically significant spatial autocorrelation for all five parties. The global Moran’s I values range from 0.197 (Konfederacja) to 0.491 (KO), all significant at p < 0.01. LISA analysis identifies the persistent east–west divide: KO and Lewica cluster in the west and in major cities; PiS clusters in the east and south-east. This spatial polarisation is not merely a reflection of the urban–rural divide but carries the imprint of Poland’s historical partitions, which created lasting differences in civic culture, administrative tradition, and economic development across regions.
PAM clustering identifies four substantively meaningful electoral types: a high-KO, high-turnout urban-liberal group (Cluster 2, n = 7, mean turnout 81.1%); a PiS-dominant, low-turnout group (Cluster 4, n = 13, mean turnout 72.2%); a moderate TD-leaning group (Cluster 3, n = 6); and a large mixed group (Cluster 1, n = 15). The SKATER typology reproduces the main distinction between liberal-western and conservative-eastern regions but merges many of the more nuanced distinctions in order to enforce geographic contiguity. The silhouette–Moran’s I trade-off (PAM: 0.253 vs 0.461; SKATER: 0.040 vs 0.571) quantifies this cost explicitly: SKATER achieves 24% higher spatial coherence at the price of 84% lower internal profile separation.
Adding spatial lag features to the Random Forest reduces LOOCV RMSE from 3.73 to 3.53 percentage points and increases R² from 0.315 to 0.446. This improvement is genuine: LOOCV prevents train-test leakage, and lags are recomputed per fold to prevent information leakage from the held-out unit. The Spatial RF and XGBoost perform similarly (RMSE 3.53 and 3.50 respectively), and both substantially outperform the Spatial Lag OLS (RMSE 4.23) — suggesting that the relationship between spatial context and turnout is non-linear, and that the parametric SAR model’s assumption of a single global spatial autoregressive parameter is too restrictive for these data.
The ridge regression coefficients provide interpretable evidence for
the spatial mechanism: the lag_turnout coefficient (+0.893)
is the second strongest predictor, indicating that neighbourhood
mobilisation — the average turnout of surrounding constituencies — is
independently predictive of a constituency’s own participation beyond
its party composition. This is consistent with social contagion and
coordination effects in electoral participation.
The residual Moran’s I of 0.098 (p = 0.10) confirms that the spatial RF successfully absorbs the spatial dependence in the data. The residuals are not significantly autocorrelated, meaning the model has captured the spatial process rather than leaving it in the error term.
Three limitations merit acknowledgement. First, the XGBoost hyperparameters are set a priori rather than tuned per LOOCV fold; per-fold tuning would be more rigorous but computationally prohibitive at 41 × inner-CV scale. Second, all models use only electoral predictors; socioeconomic covariates (urbanisation, income, age structure) are not available in the GeoElections dataset and would likely improve prediction. Third, with n = 41, all results — including the LOOCV metrics — carry substantial sampling uncertainty; the confidence intervals on RMSE differences between the Spatial RF and XGBoost, for example, are likely wide.
This project demonstrates that Polish electoral behaviour in 2023 is not only politically but also geographically structured. The analysis proceeds through a coherent spatial machine learning workflow:
lag_turnout) as the strongest spatial mechanism: being
embedded in a high-turnout environment independently raises a
constituency’s own participation.Together, these findings support the conclusion that neighbourhood context — not just local party composition — is a substantive driver of electoral participation in Poland, and that spatial machine learning methods are necessary to capture and quantify this effect.