library(here)
library(sf)
library(readxl)
library(dplyr)
library(ggplot2)
library(spdep)
library(tidyr)
library(cluster)
library(factoextra)
library(ranger)
library(rpart)
library(rpart.plot)
library(caret)
library(corrplot)
library(glmnet)
library(xgboost)
library(spatialreg)
library(patchwork)
set.seed(123)Electoral behaviour in Poland is shaped by forces that reach beyond any individual constituency. Shared history, urbanisation gradients, and regional economic conditions create patterns that neighbouring districts share with each other - a phenomenon that standard statistical models, which treat each observation as independent, cannot capture. This project investigates whether Polish voting patterns are spatially structured and whether explicitly modelling that structure improves our understanding of electoral participation.
The analysis focuses on the 2023 Polish parliamentary (Sejm) election result at the level of 41 Sejm constituencies, using the GeoElections Poland 1.0 dataset. We pose five research questions for this project:
The methodology combines global and local spatial autocorrelation tests, principal component analysis, two complementary clustering approaches (PAM and SKATER), spatial feature engineering via spatial lag variables, and a six-model supervised learning comparison evaluated through leave-one-out cross-validation (LOOCV).
The dataset provides electoral results at the Senate district
level(94 units). Since the analysis targets Sejm constituencies (41
units), Senate polygons are dissolved into Sejm districts using a lookup
table embedded in the dataset. Invalid geometries are converted by
st_make_valid() before the union operation.
senate <- st_read(
here("data", "raw", "GeoElections Poland 1.0 dataset SHP",
"SenateDistricts41.shp"),
quiet = TRUE
)
elections <- read_excel(
here("data", "raw", "GeoElections Poland 1.0 dataset.xlsx"),
sheet = "GeoElections Poland 1.0 data"
)sejm_lookup <- elections %>%
select(NumberSejm41, NumberSenate94) %>%
distinct()
senate_fixed <- st_make_valid(senate)
sejm_41 <- senate_fixed %>%
left_join(sejm_lookup, by = c("NumberSena" = "NumberSenate94")) %>%
group_by(NumberSejm41) %>%
summarise(geometry = st_union(geometry), .groups = "drop")The resulting object contains 41 Sejm constituencies with valid polygon geometries covering the entire Poland.
For this project, we will only focus on the 2023 Sejm election. The 2023 election data is separated into two components: party vote counts (7 parties: BS, KO, KONF, NL, PiS, PJN, TD) and turnout statistics (eligible voters, votes cast, valid votes).
The data was aggregated from Senate sub-district level to the Sejm constituency level by summing row vote counts. We processed the results into wide format and vote shares were calculated as percentages of valid votes cast for each party. In the end we produced a spatial dataframe of 41 features.
party_2023 <- elections %>%
filter(ElectionType == "S", Year == 2023, !is.na(Option)) %>%
group_by(NumberSejm41, AbbreviationInSheet) %>%
summarise(Votes = sum(Votes, na.rm = TRUE), .groups = "drop") %>%
pivot_wider(names_from = AbbreviationInSheet, values_from = Votes)
turnout_2023 <- elections %>%
filter(ElectionType == "S", Year == 2023, is.na(Option)) %>%
group_by(NumberSejm41, AbbreviationInSheet) %>%
summarise(Votes = sum(Votes, na.rm = TRUE), .groups = "drop") %>%
pivot_wider(names_from = AbbreviationInSheet, values_from = Votes)
results_2023 <- party_2023 %>%
left_join(turnout_2023, by = "NumberSejm41") %>%
mutate(
across(S2023BS:S2023TD,
~ . / S2023_valid * 100,
.names = "{.col}_pct"),
turnout_pct = S2023_votescast / S2023_eligible * 100
)
sejm_map <- sejm_41 %>%
left_join(results_2023, by = "NumberSejm41")To visualise the geo distribution of electoral support, maps were made for each of the five main parties contesting the 2023 Sejm election: KO, PiS, TD, Lewica, and KONF. Vote share was calculated as a percentage of valid votes cast within each of the Sejm constituencies.
The choropleth maps below reveal the first indication of spatial polarisation. KO support is concentrated in the west, in major metropolitan areas (Warsaw, Kraków, Wrocław, and Poznań), and along the western border. PiS support is strongest in the east and south-east - precisely the mirror image. TD shows moderate support across central and south-central constituencies. Lewica and Konfederacja display more diffuse but still spatially structured patterns.
party_map <- function(col, title, color) {
ggplot(sejm_map) +
geom_sf(aes(fill = .data[[col]]), color = "white", linewidth = 0.3) +
scale_fill_gradient(low = "#f5f5f5", high = color,
name = "Vote share (%)") +
labs(title = title,
subtitle = "2023 Polish parliamentary election",
caption = "Source: GeoElections Poland 1.0") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 13),
plot.subtitle = element_text(size = 9, color = "grey40"),
legend.position = "right",
legend.title = element_text(size = 7),
legend.text = element_text(size = 7),
legend.key.size = unit(0.35, "cm"))
}
(party_map("S2023KO_pct", "KO (Koalicja Obywatelska)", "#2980b9") +
party_map("S2023PIS_pct", "PiS (Prawo i Sprawiedliwość)", "#c0392b")) /
(party_map("S2023TD_pct", "TD (Trzecia Droga)", "#27ae60") +
party_map("S2023NL_pct", "NL (Nowa Lewica)", "#8e44ad")) /
(party_map("S2023KONF_pct", "KONF (Konfederacja)", "#e67e22") +
party_map("turnout_pct", "Frekwencja wyborcza", "#7f8c8d"))The turnout map shows: participation was highest in the west and in Poland’s major cities, while the east — despite its strong PiS mobilisation — voted at noticeably lower rates. This west–east turnout gradient makes it a more interesting and spatially meaningful prediction target than party vote shares.
Queen contiguity is used to define the neighbourhood structure: two constituencies are neighbours if they share any boundary point (edge or corner). This is more appropriate for the irregular polygon shapes of Polish Sejm constituencies.
The resulting network has 41 regions, 202 non-zero links, and an average of 4.93 neighbours per constituency. The three least connected constituencies (9, 19, 41) each have only two neighbours, while the four most connected (11, 16, 18, 33) each have eight - reflecting the geographic variation between elongated border constituencies and centrally located ones.
The least connected are Łódź and Warszawa - large cities, that are essentially surrounded by fewer but larger rural districts, as well as Szczecin, which makes geographical sense, scince it’s in the nortwest corner of Poland, bordered by Germany and the Baltic sea.
Most connected are Sieradz, Płock, Siedlce and Kielce, small central constituencies, which makes sense, as they have more nieghbours in all directions.
coords <- st_coordinates(st_centroid(sejm_41))
plot(st_geometry(sejm_41), border = "grey70",
main = "Queen Contiguity Neighbour Network (41 Sejm Constituencies)")
plot(neighbours, coords, add = TRUE, col = "#e74c3c", lwd = 0.8)Global Moran’s I tests whether similar values cluster together across the map - the fundamental prerequisite for spatial analysis. All five party vote shares are tested against the null hypothesis of spatial randomness.
parties <- c("S2023KO_pct", "S2023PIS_pct", "S2023TD_pct",
"S2023NL_pct", "S2023KONF_pct")
party_labels <- c("KO", "PiS", "TD", "NL", "KONF")
moran_results <- mapply(function(v, lab) {
mt <- moran.test(sejm_map[[v]], weights)
data.frame(Party = lab,
Moran_I = round(mt$estimate[1], 3),
Std_Dev = round(mt$statistic, 3),
P_value = formatC(mt$p.value, format = "e", digits = 2))
}, parties, party_labels, SIMPLIFY = FALSE)
moran_table <- do.call(rbind, moran_results)
knitr::kable(moran_table, row.names = FALSE,
col.names = c("Party", "Moran's I", "Std. deviate", "p-value"),
caption = "Global Moran's I for party vote shares (Queen contiguity, W-standardised weights)")| Party | Moran’s I | Std. deviate | p-value |
|---|---|---|---|
| KO | 0.491 | 5.310 | 5.49e-08 |
| PiS | 0.344 | 3.852 | 5.86e-05 |
| TD | 0.383 | 4.259 | 1.03e-05 |
| NL | 0.284 | 3.263 | 5.51e-04 |
| KONF | 0.197 | 2.358 | 9.19e-03 |
All five parties show statistically significant positive spatial autocorrelation. KO has the strongest spatial clustering (I = 0.491, p < 0.001), KONF has the least of 5 parties(I = 0.197, p < 0.001). The results confirm that the 2023 election was shaped in spatially structure.
par(mfrow = c(1, 2))
moran.plot(sejm_map$S2023KO_pct, weights,
main = "Moran Scatterplot: KO", xlab = "KO vote share (%)",
ylab = "Spatially lagged KO (%)")
moran.plot(sejm_map$S2023PIS_pct, weights,
main = "Moran Scatterplot: PiS", xlab = "PiS vote share (%)",
ylab = "Spatially lagged PiS (%)")The Moran scatterplots confirm the pattern: constituencies with high KO support are predominantly surrounded by other high-KO constituencies (upper-right quadrant), and those with low KO support by low-KO constituencies (lower-left). The PiS scatterplot shows similar behaviour, which is consistent with the two parties’ geographic opposition.
While global Moran’s I tells us that clustering exists, LISA pinpoints where it occurs. Each constituency is assigned to one of four types (High-High, Low-Low, High-Low, or Low-High) depending on whether its own value and its neighbours’ values are both high, both low, or mismatched. Only clusters with p < 0.05 are treated as statistically significant.
compute_local_moran <- function(var, weights_obj, data) {
lm_result <- localmoran(data[[var]], weights_obj)
scaled <- scale(data[[var]])[, 1]
lag_val <- lag.listw(weights_obj, scaled)
p_raw <- lm_result[, 5] # raw permutation p-values
case_when(
p_raw > 0.05 ~ "Not significant",
scaled > 0 & lag_val > 0 ~ "High-High",
scaled < 0 & lag_val < 0 ~ "Low-Low",
scaled > 0 & lag_val < 0 ~ "High-Low",
scaled < 0 & lag_val > 0 ~ "Low-High"
)
}
sejm_map <- sejm_map %>%
mutate(
KO_cluster = compute_local_moran("S2023KO_pct", weights, sejm_map),
PIS_cluster = compute_local_moran("S2023PIS_pct", weights, sejm_map),
TD_cluster = compute_local_moran("S2023TD_pct", weights, sejm_map),
NL_cluster = compute_local_moran("S2023NL_pct", weights, sejm_map),
KONF_cluster = compute_local_moran("S2023KONF_pct", weights, sejm_map)
)lisa_colours <- c(
"High-High" = "#c0392b",
"Low-Low" = "#2980b9",
"High-Low" = "#e67e22",
"Low-High" = "#27ae60",
"Not significant" = "#ecf0f1"
)
lisa_levels <- names(lisa_colours)
lisa_long <- sejm_map %>%
select(geometry,
KO_cluster, PIS_cluster, TD_cluster,
NL_cluster, KONF_cluster) %>%
pivot_longer(
cols = ends_with("_cluster"),
names_to = "party_raw",
values_to = "cluster"
) %>%
mutate(
party = factor(party_raw,
levels = c("KO_cluster", "PIS_cluster", "TD_cluster",
"NL_cluster", "KONF_cluster"),
labels = c("KO (Civic Coalition)", "PiS (Law & Justice)",
"TD (Third Way)", "Lewica (Left)",
"Konfederacja")),
cluster = factor(cluster, levels = lisa_levels)
) %>%
st_as_sf()
ggplot(lisa_long) +
geom_sf(aes(fill = cluster), color = "white", linewidth = 0.25) +
scale_fill_manual(values = lisa_colours, name = "Cluster type",
drop = FALSE) +
facet_wrap(~ party, ncol = 2) +
labs(title = "Local Spatial Autocorrelation (LISA)",
subtitle = "BH, 2023 Polish parliamentary election",
caption = "Source: GeoElections Poland 1.0") +
theme_void() +
theme(
plot.title = element_text(face = "bold", size = 13),
plot.subtitle = element_text(size = 9, color = "grey40"),
strip.text = element_text(face = "bold", size = 10),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 9, face = "bold"),
legend.text = element_text(size = 9),
legend.key.size = unit(0.4, "cm"),
panel.spacing = unit(0.8, "lines")
)The LISA maps reveal a clear east-west spatial structure.
KO and PiS display the clearest clustering patterns, which also mirror each other. In the northwest region, near the Baltic sea, KO shows High-High constituencies, meanwhile PiS clusters in that zone are Low-Low. Southeast regions show up as High-High clusters for PiS and Low-Low for KO. The near-perfect inversion between KO and PiS cluster maps is the most striking finding of the exploratory analysis, confirming that Polish electoral geography is bipolar along a northwest-southeast axis.
From the maps we could also draw conclusions that TD, as a rural-agrarian party, shows High-High clustering in the southeastern corner reflecting its strength in agricultural eastern constituencies. In that same region Lewica presents a Low-Low zone and no High-High clusters across the country, indicating that while the party draws urban support it lacks a spatially coherent stronghold. Konfederacja shows the most spread out pattern, with High-High clusters appearqing along the eastern border, possibly reflecting heightened nationalist sentiment in constituencies proximate to Ukraine.
Since we have confirmed spatial autocorrelation, the next step is to encode neighbourhood context as machine learning features. Spatial lag variables are computed for each party’s vote share and for turnout: each constituency’s lag is the row-standardised mean of its neighbours’ values. These variables allow used models in the project to incorporate the political environment surrounding each constituency.
sejm_map <- sejm_map %>%
mutate(
lag_KO = lag.listw(weights, S2023KO_pct),
lag_PIS = lag.listw(weights, S2023PIS_pct),
lag_TD = lag.listw(weights, S2023TD_pct),
lag_NL = lag.listw(weights, S2023NL_pct),
lag_KONF = lag.listw(weights, S2023KONF_pct),
lag_turnout = lag.listw(weights, turnout_pct),
neighbour_count = card(neighbours)
)plot_lag <- function(col, title) {
ggplot(sejm_map) +
geom_sf(aes(fill = .data[[col]]), color = "white", linewidth = 0.3) +
scale_fill_gradient(low = "#f7f7f7", high = "#2c7fb8",
name = "Neighbour\naverage (%)") +
labs(title = title,
subtitle = "Queen contiguity spatial lag") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 11))
}
plot_lag("lag_KO", "Spatial Lag: KO Vote Share") +
plot_lag("lag_PIS", "Spatial Lag: PiS Vote Share") +
plot_lag("lag_TD", "Spatial Lag: TD Vote Share") +
plot_lag("lag_NL", "Spatial Lag: NL Vote Share") +
plot_lag("lag_KONF", "Spatial Lag: KONF Vote Share") +
plot_lag("lag_turnout", "Spatial Lag: Voter Turnout")The spatial lag maps smooth each constituency’s value into the
average of its neighbours, making the broader regional gradients easier
to read than the raw maps alone.
KO shows a clear west-to-east
decline in neighbourhood averages — constituencies in the west are not
only KO-strong themselves but sit within KO-strong regions, reinforcing
the liberal western bloc.
PiS is the reverse, with high
neighbourhood averages concentrated in the east and south-east,
confirming that PiS dominance is regionally self-reinforcing in exactly
the same way.
TD shows moderately elevated neighbourhood averages in
the south and south-central belt, consistent with its agrarian voter
base being geographically clustered rather than evenly distributed.
NL displays its highest neighbourhood averages in the east and
centre-east — a somewhat counterintuitive finding given that NL is an
urban party, but reflecting the specific cities (Lublin, Białystok) that
anchor left-wing support in that region.
KONF shows its highest
neighbourhood averages in the east, with noticeably lower averages in
the west and north — a pattern that partly overlaps with PiS geography,
reflecting the shared conservative and nationalist voter base in eastern
Poland.
Turnout presents the sharpest gradient of all: western and
urban constituencies are embedded in high-participation environments,
while eastern constituencies face a low-turnout neighbourhood context —
the neighbourhood mobilisation effect which will be discussed in the
models.
vote_vars <- sejm_map %>%
st_drop_geometry() %>%
select(S2023KO_pct, S2023PIS_pct, S2023TD_pct, S2023NL_pct,
S2023KONF_pct, turnout_pct,
lag_KO, lag_PIS, lag_TD, lag_NL, lag_KONF, lag_turnout)
corrplot(
cor(vote_vars, use = "complete.obs"),
method = "color", type = "upper",
tl.cex = 0.65, tl.col = "black",
addCoef.col = "black", number.cex = 0.5,
title = "Correlation Matrix: Vote Shares, Turnout, and Spatial Lags",
mar = c(0, 0, 1.5, 0)
)The correlation heatmap confirms the dominant pattern in the data: KO and PiS vote shares move strongly in opposite directions. Where one party does well, the other tends not to. It also shows that neighbouring constituencies tend to look similar to each other, which is consistent with the spatial clustering identified earlier.
It is also worth noting that higher turnout is associated with stronger KO and NL performance and weaker PiS performance. This relationship will be examined more formally in the models that follow.
PCA is applied to the six vote share and turnout variables to identify the main axes of electoral differentiation.
pca_data <- sejm_map %>%
st_drop_geometry() %>%
select(S2023KO_pct, S2023PIS_pct, S2023TD_pct,
S2023NL_pct, S2023KONF_pct, turnout_pct) %>%
scale()
pca_result <- prcomp(pca_data, center = TRUE, scale. = TRUE)
pca_summary <- summary(pca_result)
pca_loadings <- as.data.frame(pca_result$rotation[, 1:2]) %>%
tibble::rownames_to_column("Variable") %>%
mutate(across(where(is.numeric), round, 3))
knitr::kable(pca_loadings,
col.names = c("Variable", "PC1", "PC2"),
caption = paste0(
"PCA loadings for PC1 (",
round(pca_summary$importance[2,1]*100, 1),
"% of variance) and PC2 (",
round(pca_summary$importance[2,2]*100, 1),
"% of variance)"
))| Variable | PC1 | PC2 |
|---|---|---|
| S2023KO_pct | -0.504 | 0.069 |
| S2023PIS_pct | 0.501 | 0.203 |
| S2023TD_pct | 0.219 | -0.748 |
| S2023NL_pct | -0.449 | -0.044 |
| S2023KONF_pct | 0.363 | -0.298 |
| turnout_pct | -0.337 | -0.551 |
PC1 captures the main left-right divide in Polish politics. Constituencies that voted heavily for KO and NL score low on this axis, those that voted heavily for PiS and KONF score high. It explains over half the variation across all constituencies.
PC2 picks up a secondary pattern: places where TD did well tended to have lower turnout. This likely reflects TD’s rural and agrarian base, where voter participation is typically lower. It explains around a fifth of the remaining variation.
sejm_map$PC1 <- pca_result$x[, 1]
sejm_map$PC2 <- pca_result$x[, 2]
p1 <- ggplot(sejm_map) +
geom_sf(aes(fill = PC1), color = "white", linewidth = 0.3) +
scale_fill_gradient2(name = "PC1",
low = "#2980b9", mid = "white", high = "#c0392b") +
labs(title = "PC1: Liberal–Conservative Axis",
subtitle = "Blue = KO-dominant, Red = PiS-dominant") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 11))
p2 <- ggplot(sejm_map) +
geom_sf(aes(fill = PC2), color = "white", linewidth = 0.3) +
scale_fill_gradient2(name = "PC2",
low = "#27ae60", mid = "white", high = "#95a5a6") +
labs(title = "PC2: Rural-Agrarian / Low-Participation Axis",
subtitle = "Green = high TD / low turnout") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 11))
p1 + p2Two clustering methods are used in this project to group constituencies into electoral regions. PAM groups places by how similar their voting profiles are, ignoring geography, so the resulting clusters can be scattered across the map. SKATER adds a geographic rule: clusters must be physically connected, so each group forms a contiguous region on the map. Comparing the two shows whether similar-voting places also tend to be geographically adjacent in Poland.
Based on silhouette output, two clusters would fit the data better - and those two simply reproduce the east-west PiS/KO divide already visible from the earlier analysis, however, it isn’t very informative. Four clusters is chosen because it reveals more useful details: a TD-leaning rural group and a high-turnout urban-liberal group that would otherwise get lumped together with the broader KO bloc. Such selection would be a much more useful typology for this project.
cluster_vars <- c("S2023KO_pct", "S2023PIS_pct", "S2023TD_pct",
"S2023NL_pct", "S2023KONF_pct", "turnout_pct")
cluster_data <- sejm_map %>% st_drop_geometry() %>% select(all_of(cluster_vars))
cluster_scaled <- scale(cluster_data)
fviz_nbclust(cluster_scaled, pam, method = "silhouette", k.max = 8) +
labs(title = "PAM silhouette width by number of clusters",
subtitle = "Silhouette peaks at k = 2; k = 4 chosen for substantive interpretability")nb_costs <- nbcosts(neighbours, data = as.data.frame(cluster_scaled))
nb_w <- nb2listw(neighbours, glist = nb_costs, style = "B")
mst <- mstree(nb_w)
skater_model <- skater(edges = mst[, 1:2],
data = as.data.frame(cluster_scaled),
ncuts = 3)
sejm_map$skater_cluster <- factor(skater_model$groups)sil_pam <- mean(silhouette(pam_model$clustering,
dist(cluster_scaled))[, 3])
sil_skater <- mean(silhouette(as.integer(skater_model$groups),
dist(cluster_scaled))[, 3])
moran_pam <- moran.test(as.numeric(sejm_map$pam_cluster), weights)$estimate[1]
moran_skater <- moran.test(as.numeric(sejm_map$skater_cluster), weights)$estimate[1]
quality_df <- data.frame(
Method = c("PAM", "SKATER"),
Avg_Silhouette = round(c(sil_pam, sil_skater), 3),
Moran_I_Clusters = round(c(moran_pam, moran_skater), 3)
)
knitr::kable(quality_df,
col.names = c("Method", "Avg. silhouette width",
"Moran's I on cluster labels"),
caption = "Clustering quality: profile tightness vs spatial coherence")| Method | Avg. silhouette width | Moran’s I on cluster labels |
|---|---|---|
| PAM | 0.253 | 0.461 |
| SKATER | 0.040 | 0.571 |
The numbers reflect the expected trade-off between the two methods. PAM produces clusters whose members are genuinely similar in voting profile (silhouette = 0.253), but those clusters are somewhat scattered geographically (Moran’s I = 0.461). SKATER, by design, produces tighter geographic regions (Moran’s I = 0.571), but at the cost of grouping together some constituencies that vote quite differently from each other (silhouette = 0.040). That low figure with only 41 constituencies, forcing geographic contiguity sometimes means lumping together neighbours that don’t actually share a similar electoral profile.
cluster_palette <- c("1" = "#2980b9", "2" = "#c0392b",
"3" = "#27ae60", "4" = "#e67e22")
map_pam <- ggplot(sejm_map) +
geom_sf(aes(fill = pam_cluster), color = "white", linewidth = 0.3) +
scale_fill_manual(values = cluster_palette, name = "Cluster") +
labs(title = "PAM clusters",
subtitle = "Profile similarity only (non-spatial)") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 12))
map_skater <- ggplot(sejm_map) +
geom_sf(aes(fill = skater_cluster), color = "white", linewidth = 0.3) +
scale_fill_manual(values = cluster_palette, name = "Cluster") +
labs(title = "SKATER clusters",
subtitle = "Spatially contiguous regions") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 12))
map_pam + map_skater +
plot_annotation(
title = "Electoral Typologies: PAM vs SKATER (k = 4)",
caption = "Source: GeoElections Poland 1.0"
)The graphs show that:
The PAM map looks fragmented — similar-voting constituencies are dotted around the country rather than forming solid regions. This tells us that voters with similar profiles exist in scattered regions across Poland, not in neat geographic blocs.
The SKATER map produces four clean, connected regions that closely resemble Poland’s historical macro-regions: a liberal-leaning west and north, a conservative east and south-east, and two smaller groups in the south. The cluster result of SKATER produces a clustering map close to Poland’s historical partition map, it supports that the east–west electoral divide is not a statistical artefact but a deep regional structure which is an imprint on how people vote today.
Comparing the two methods, PAM correctly identifies a liberal city surrounded by conservative countryside as belonging to the liberal cluster - but that city then appears as an isolated dot on the map. SKATER absorbs that city into its conservative regional bloc to maintain geographic continuity, which is why its cluster quality score is lower but its geographic coherence is higher. In short, PAM helps answer “who votes similarly?” and SKATER helps answer “where does similar voting concentrate geographically?”
profile_summary <- function(cluster_col, label) {
sejm_map %>%
st_drop_geometry() %>%
group_by(Cluster = .data[[cluster_col]]) %>%
summarise(KO = round(mean(S2023KO_pct, na.rm = TRUE), 1),
PiS = round(mean(S2023PIS_pct, na.rm = TRUE), 1),
TD = round(mean(S2023TD_pct, na.rm = TRUE), 1),
NL = round(mean(S2023NL_pct, na.rm = TRUE), 1),
KONF = round(mean(S2023KONF_pct, na.rm = TRUE), 1),
Turnout = round(mean(turnout_pct, na.rm = TRUE), 1),
n = n(), .groups = "drop") %>%
mutate(Method = label)
}
pam_profiles <- profile_summary("pam_cluster", "PAM")
skater_profiles <- profile_summary("skater_cluster", "SKATER")
knitr::kable(pam_profiles %>% select(-Method),
caption = "PAM cluster profiles: mean vote shares and turnout (%)")| Cluster | KO | PiS | TD | NL | KONF | Turnout | n |
|---|---|---|---|---|---|---|---|
| 1 | 36.5 | 31.9 | 12.1 | 9.8 | 6.2 | 72.4 | 15 |
| 2 | 43.1 | 22.9 | 14.4 | 11.0 | 6.0 | 81.1 | 7 |
| 3 | 33.0 | 28.9 | 16.4 | 10.8 | 6.9 | 77.6 | 6 |
| 4 | 22.0 | 44.4 | 16.2 | 5.6 | 7.8 | 72.2 | 13 |
knitr::kable(skater_profiles %>% select(-Method),
caption = "SKATER cluster profiles: mean vote shares and turnout (%)")| Cluster | KO | PiS | TD | NL | KONF | Turnout | n |
|---|---|---|---|---|---|---|---|
| 1 | 37.2 | 29.5 | 13.4 | 10.2 | 6.2 | 75.3 | 27 |
| 2 | 22.5 | 44.2 | 15.5 | 5.8 | 7.9 | 72.2 | 10 |
| 3 | 22.7 | 41.9 | 19.0 | 5.2 | 7.2 | 73.4 | 3 |
| 4 | 34.1 | 25.4 | 17.4 | 12.9 | 7.5 | 81.8 | 1 |
bind_rows(pam_profiles, skater_profiles) %>%
pivot_longer(cols = c(KO, PiS, TD, NL, KONF, Turnout),
names_to = "Variable", values_to = "Average") %>%
ggplot(aes(x = Variable, y = Average, fill = factor(Cluster))) +
geom_col(position = "dodge") +
scale_fill_manual(values = cluster_palette, name = "Cluster") +
facet_wrap(~ Method) +
labs(title = "Electoral Profiles by Cluster and Method",
subtitle = "Average vote share and turnout within each cluster (%)",
x = "", y = "Average (%)") +
theme_minimal() +
theme(strip.text = element_text(face = "bold"))PAM produces four clear electoral types. The smallest group (7 constituencies) is the urban-liberal type — high KO vote share and the highest turnout of any cluster. The largest PiS group (13 constituencies) is the opposite: strong PiS dominance and the lowest turnout. A small middle group (6 constituencies) is notable mainly for its relatively high TD support. The remaining 15 constituencies fall somewhere between the two main poles, with neither KO nor PiS dominant.
SKATER tells a broadly similar story but reshapes the groups to keep them geographically connected. This produces one very large dominant cluster (27 constituencies) and three smaller peripheral ones. The most telling case is SKATER’s single-constituency cluster — one isolated high-turnout constituency that PAM correctly groups with the urban-liberal type, but which SKATER cannot attach to that group because it is surrounded by constituencies from a different electoral type. It ends up in a cluster of its own, which neatly illustrates the price of enforcing geographic contiguity.
The supervised learning section tests whether spatially informed models outperform non-spatial alternatives. Voter turnout is the prediction target. This choice is deliberate: party vote shares are compositionally constrained (they sum to 100%), so predicting one party’s share using the others as features exploits a near-arithmetic identity rather than modelling a genuine spatial process. Turnout is orthogonal to this constraint, is spatially autocorrelated (confirmed above), and is of independent substantive interest.
Six models are compared:
| Model | Features | Spatial? |
|---|---|---|
| Non-Spatial Random Forest | Party vote shares only | No |
| Spatial Random Forest | Vote shares + spatial lags + neighbour count | Yes |
| Spatial Decision Tree | Same as Spatial RF | Yes |
| Ridge Regression | Same as Spatial RF | Yes |
| XGBoost | Same as Spatial RF | Yes |
| Spatial Lag OLS (SAR) | Vote shares + spatially lagged turnout response | Yes |
All models are evaluated using leave-one-out cross-validation (LOOCV). LOOCV is appropriate here because n = 41 is too small for reliable random train/test splits, and the confirmed spatial autocorrelation means random folds would place neighbouring constituencies in both training and test sets, inflating apparent accuracy. At each of the 41 folds, spatial lag variables are recomputed on the 40 training units, excluding the held-out unit’s value from the lags of its training neighbours - eliminating even the mild data leakage that arises when lags are precomputed on the full dataset.
base_data <- sejm_map %>%
st_drop_geometry() %>%
select(NumberSejm41, turnout_pct, S2023KO_pct, S2023PIS_pct,
S2023TD_pct, S2023NL_pct, S2023KONF_pct, neighbour_count) %>%
na.omit()
n <- nrow(base_data)
actual_turnout <- base_data$turnout_pct
non_spatial_vars <- c("S2023KO_pct", "S2023PIS_pct", "S2023TD_pct",
"S2023NL_pct", "S2023KONF_pct")
spatial_vars <- c(non_spatial_vars,
"lag_KO", "lag_PIS", "lag_TD", "lag_NL",
"lag_KONF", "lag_turnout", "neighbour_count")
compute_lags_loo <- function(data_full, nb_list, i) {
vars <- c("S2023KO_pct", "S2023PIS_pct", "S2023TD_pct",
"S2023NL_pct", "S2023KONF_pct", "turnout_pct")
lag_names <- c("lag_KO", "lag_PIS", "lag_TD",
"lag_NL", "lag_KONF", "lag_turnout")
result <- data_full
for (j in seq_along(vars)) {
v <- vars[j]
lag <- numeric(nrow(data_full))
for (k in seq_len(nrow(data_full))) {
nb_k <- nb_list[[k]]
if (k == i) {
lag[k] <- mean(data_full[[v]][nb_k], na.rm = TRUE)
} else {
nb_k_excl <- nb_k[nb_k != i]
lag[k] <- if (length(nb_k_excl) == 0) data_full[[v]][k] else
mean(data_full[[v]][nb_k_excl], na.rm = TRUE)
}
}
result[[lag_names[j]]] <- lag
}
result
}pred_non_spatial <- numeric(n)
pred_spatial <- numeric(n)
pred_tree <- numeric(n)
pred_ridge <- numeric(n)
pred_xgb <- numeric(n)
for (i in seq_len(n)) {
data_i <- compute_lags_loo(base_data, neighbours, i)
train_i <- data_i[-i, ]
test_i <- data_i[i, ]
y_train <- train_i$turnout_pct
# Non-spatial RF
fit_ns <- ranger(
turnout_pct ~ S2023KO_pct + S2023PIS_pct + S2023TD_pct +
S2023NL_pct + S2023KONF_pct,
data = train_i, num.trees = 500, seed = 123
)
pred_non_spatial[i] <- predict(fit_ns, test_i)$predictions
# Spatial RF
fit_sp <- ranger(
as.formula(paste("turnout_pct ~", paste(spatial_vars, collapse = " + "))),
data = train_i, num.trees = 500, seed = 123
)
pred_spatial[i] <- predict(fit_sp, test_i)$predictions
# Decision Tree
fit_tree <- rpart(
as.formula(paste("turnout_pct ~", paste(spatial_vars, collapse = " + "))),
data = train_i, method = "anova"
)
pred_tree[i] <- predict(fit_tree, test_i)
# Ridge
X_train <- scale(as.matrix(train_i[, spatial_vars]))
X_test <- scale(as.matrix(test_i[, spatial_vars, drop = FALSE]),
center = attr(X_train, "scaled:center"),
scale = attr(X_train, "scaled:scale"))
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0, nfolds = 10)
pred_ridge[i] <- predict(cv_ridge, newx = X_test, s = "lambda.min")[1]
# XGBoost (hyperparameters fixed a priori; see limitations)
dtrain <- xgb.DMatrix(data = as.matrix(train_i[, spatial_vars]),
label = y_train)
dtest <- xgb.DMatrix(data = as.matrix(test_i[, spatial_vars, drop = FALSE]))
fit_xgb <- xgb.train(
params = list(objective = "reg:squarederror", eta = 0.05,
max_depth = 3, subsample = 0.8, colsample_bytree = 0.8),
data = dtrain, nrounds = 200, verbose = 0
)
pred_xgb[i] <- predict(fit_xgb, dtest)
}pred_sar <- numeric(n)
for (i in seq_len(n)) {
data_i <- compute_lags_loo(base_data, neighbours, i)
train_i <- data_i[-i, ]
test_i <- data_i[i, ]
nb_train <- subset(neighbours, seq_len(n) != i)
wt_train <- nb2listw(nb_train, style = "W")
fit_sar <- lagsarlm(
turnout_pct ~ S2023KO_pct + S2023PIS_pct + S2023TD_pct +
S2023NL_pct + S2023KONF_pct,
data = train_i, listw = wt_train, method = "eigen"
)
nb_i <- neighbours[[i]][neighbours[[i]] != i]
train_rows <- seq_len(n)[-i]
nb_in_train <- match(nb_i, train_rows)
nb_in_train <- nb_in_train[!is.na(nb_in_train)]
wy_i <- mean(train_i$turnout_pct[nb_in_train])
X_i <- as.numeric(c(1, test_i$S2023KO_pct, test_i$S2023PIS_pct,
test_i$S2023TD_pct, test_i$S2023NL_pct,
test_i$S2023KONF_pct))
beta <- coef(fit_sar)[c("(Intercept)", non_spatial_vars)]
pred_sar[i] <- sum(X_i * beta) + fit_sar$rho * wy_i
}evaluate_model <- function(actual, predicted) {
data.frame(
RMSE = round(sqrt(mean((actual - predicted)^2)), 3),
MAE = round(mean(abs(actual - predicted)), 3),
R2 = round(cor(actual, predicted)^2, 3)
)
}
final_comparison <- bind_rows(
`Non-Spatial RF` = evaluate_model(actual_turnout, pred_non_spatial),
`Spatial RF` = evaluate_model(actual_turnout, pred_spatial),
`Decision Tree` = evaluate_model(actual_turnout, pred_tree),
`Ridge` = evaluate_model(actual_turnout, pred_ridge),
`XGBoost` = evaluate_model(actual_turnout, pred_xgb),
`Spatial Lag OLS` = evaluate_model(actual_turnout, pred_sar),
.id = "Model"
) %>% arrange(RMSE)
knitr::kable(final_comparison,
col.names = c("Model", "RMSE", "MAE", "R²"),
caption = "LOOCV model performance (all metrics on original turnout % scale)")| Model | RMSE | MAE | R² |
|---|---|---|---|
| XGBoost | 3.505 | 2.814 | 0.405 |
| Spatial RF | 3.512 | 2.816 | 0.455 |
| Non-Spatial RF | 3.759 | 2.916 | 0.307 |
| Ridge | 3.885 | 3.246 | 0.258 |
| Spatial Lag OLS | 4.225 | 3.516 | 0.185 |
| Decision Tree | 4.599 | 3.562 | 0.148 |
model_order <- final_comparison$Model
final_comparison %>%
pivot_longer(cols = c(RMSE, MAE, R2), names_to = "Metric",
values_to = "Value") %>%
mutate(Model = factor(Model, levels = rev(model_order))) %>%
ggplot(aes(x = Model, y = Value, fill = Model)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ Metric, scales = "free_x") +
coord_flip() +
scale_fill_brewer(palette = "Blues", direction = -1) +
labs(title = "Model Performance Comparison (LOOCV)",
subtitle = "Six models evaluated on 41 held-out predictions",
x = "", y = "") +
theme_minimal() +
theme(strip.text = element_text(face = "bold"))XGBoost achieves the lowest RMSE (3.50) and MAE (2.78), followed closely by the Spatial RF (RMSE = 3.53, R² = 0.446). The non-spatial RF (RMSE = 3.73, R² = 0.315) is meaningfully weaker than both spatial tree-based models, confirming that neighbourhood context adds genuine predictive value. Ridge and the Spatial Lag OLS occupy the middle ground, with the Decision Tree performing worst among the spatial models.
The gap between the spatial and non-spatial Random Forest is the central finding: adding spatial lag variables reduces RMSE by 5.4% and increases R² from 0.315 to 0.446 - a 42% improvement in explained variance. This is the direct empirical answer to Research Question 3.
data_full_lags <- base_data %>%
mutate(lag_KO = lag.listw(weights, S2023KO_pct),
lag_PIS = lag.listw(weights, S2023PIS_pct),
lag_TD = lag.listw(weights, S2023TD_pct),
lag_NL = lag.listw(weights, S2023NL_pct),
lag_KONF = lag.listw(weights, S2023KONF_pct),
lag_turnout = lag.listw(weights, turnout_pct))
rf_spatial <- ranger(
as.formula(paste("turnout_pct ~", paste(spatial_vars, collapse = " + "))),
data = data_full_lags, importance = "permutation", num.trees = 500
)data.frame(variable = names(rf_spatial$variable.importance),
importance = rf_spatial$variable.importance) %>%
arrange(desc(importance)) %>%
ggplot(aes(x = reorder(variable, importance), y = importance)) +
geom_col(fill = "#2c7fb8") +
coord_flip() +
labs(title = "Variable Importance: Spatial Random Forest",
subtitle = "Permutation importance (full dataset fit)",
x = "", y = "Permutation importance") +
theme_minimal()fit_xgb_full <- xgb.train(
params = list(objective = "reg:squarederror", eta = 0.05,
max_depth = 3, subsample = 0.8, colsample_bytree = 0.8),
data = xgb.DMatrix(data = as.matrix(data_full_lags[, spatial_vars]),
label = actual_turnout),
nrounds = 200, verbose = 0
)
xgb_imp <- xgb.importance(model = fit_xgb_full)
xgb.plot.importance(xgb_imp, top_n = 13,
main = "XGBoost Feature Importance (full dataset)")In the Spatial RF, lag_turnout consistently ranks among
the top predictors - the average turnout of neighbouring constituencies
is more informative about a given constituency’s turnout than any
individual party’s local vote share. S2023PIS_pct and
S2023KO_pct are the next most important, reflecting the
strong geographic sorting of the two major parties. The spatial lag
variables for PiS and KO also contribute, confirming that the
neighbourhood political environment matters independently of local party
composition.
Ridge regression provides the most directly interpretable spatial model, as its coefficients quantify each predictor’s marginal relationship with turnout on a standardised scale.
X_full <- scale(as.matrix(data_full_lags[, spatial_vars]))
cv_ridge_full <- cv.glmnet(X_full, actual_turnout, alpha = 0, nfolds = 10)
ridge_coefs <- as.data.frame(
as.matrix(coef(cv_ridge_full, s = "lambda.min"))
) %>%
tibble::rownames_to_column("variable") %>%
setNames(c("variable", "coefficient")) %>%
filter(variable != "(Intercept)") %>%
arrange(desc(abs(coefficient)))
knitr::kable(ridge_coefs %>% mutate(coefficient = round(coefficient, 3)),
col.names = c("Variable", "Coefficient"),
caption = "Ridge regression coefficients (standardised predictors, lambda.min)")| Variable | Coefficient |
|---|---|
| S2023PIS_pct | -1.222 |
| lag_turnout | 1.128 |
| S2023KO_pct | 0.952 |
| S2023NL_pct | 0.904 |
| lag_PIS | 0.855 |
| lag_KO | -0.788 |
| S2023TD_pct | 0.716 |
| lag_TD | -0.392 |
| S2023KONF_pct | -0.281 |
| neighbour_count | -0.254 |
| lag_KONF | 0.093 |
| lag_NL | -0.039 |
ggplot(ridge_coefs, aes(x = reorder(variable, coefficient),
y = coefficient)) +
geom_col(aes(fill = coefficient > 0), show.legend = FALSE) +
scale_fill_manual(values = c("TRUE" = "#27ae60", "FALSE" = "#c0392b")) +
coord_flip() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Ridge Regression Coefficients",
subtitle = "Spatial predictors of voter turnout (standardised; lambda.min)",
x = "", y = "Coefficient") +
theme_minimal()S2023PIS_pct has the largest coefficient (−1.046),
indicating that higher PiS support is associated with lower turnout
after controlling for all other spatial predictors. This is consistent
with the known geography: PiS strongholds in the east and south-east
have historically shown lower electoral participation. The positive
lag_turnout coefficient (+0.893) captures
neighbourhood mobilisation effects - constituencies
embedded in high-turnout environments participate more, independently of
their own party composition. The positive coefficients on
S2023KO_pct (+0.764) and S2023NL_pct (+0.841)
reflect the concentration of both parties in urban, high-turnout areas.
All coefficients are on standardised predictors, so their magnitudes are
directly comparable.
tree_full <- rpart(
as.formula(paste("turnout_pct ~", paste(spatial_vars, collapse = " + "))),
data = data_full_lags, method = "anova"
)
rpart.plot(tree_full, main = "Spatial Decision Tree: Voter Turnout")The decision tree provides an interpretable rule-based summary of the spatial model. The first split is on a spatial lag variable, confirming that neighbourhood context is the primary driver of turnout variation.
prediction_df <- data.frame(
NumberSejm41 = base_data$NumberSejm41,
rf_spatial_prediction = pred_spatial,
rf_spatial_residual = actual_turnout - pred_spatial
)
sejm_map <- sejm_map %>% left_join(prediction_df, by = "NumberSejm41")p_pred <- ggplot(sejm_map) +
geom_sf(aes(fill = rf_spatial_prediction), color = "white", linewidth = 0.3) +
scale_fill_gradient(low = "#f7f7f7", high = "#2c7fb8",
name = "Predicted\nturnout (%)") +
labs(title = "Predicted Voter Turnout",
subtitle = "Spatial RF, LOOCV predictions") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 11))
p_resid <- ggplot(sejm_map) +
geom_sf(aes(fill = rf_spatial_residual), color = "white", linewidth = 0.3) +
scale_fill_gradient2(name = "Residual",
low = "#c0392b", mid = "white", high = "#2980b9") +
labs(title = "Spatial RF Residuals",
subtitle = "Actual minus LOOCV-predicted turnout") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 11))
p_pred + p_residggplot(data.frame(actual = actual_turnout, predicted = pred_spatial),
aes(x = actual, y = predicted)) +
geom_point(size = 3, colour = "#2c7fb8", alpha = 0.8) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey50") +
labs(title = "Actual vs Predicted Voter Turnout",
subtitle = "Spatial Random Forest (LOOCV)",
x = "Actual turnout (%)", y = "Predicted turnout (%)") +
theme_minimal()The predicted turnout map reproduces the broad west–east gradient visible in the raw data. The residual map shows no strong systematic pattern - positive and negative residuals are scattered - which is consistent with the residual autocorrelation test below.
The final diagnostic tests whether the spatial RF residuals retain spatial clustering that the model failed to capture.
res_vec <- sejm_map$rf_spatial_residual
w_res <- weights
if (any(is.na(res_vec))) {
keep <- !is.na(res_vec)
res_vec <- res_vec[keep]
w_res <- nb2listw(subset(neighbours, keep), style = "W")
}
mt_resid <- moran.test(res_vec, w_res)
cat(sprintf(
"Residual Moran's I = %.3f (std. deviate = %.3f, p = %.4f)\n",
mt_resid$estimate[1], mt_resid$statistic, mt_resid$p.value
))## Residual Moran's I = 0.102 (std. deviate = 1.313, p = 0.0946)
moran.plot(res_vec, w_res,
main = "Moran Scatterplot: Spatial RF Residuals (LOOCV)",
xlab = "LOOCV residual",
ylab = "Spatially lagged residual")The residual Moran’s I is 0.098 (p = 0.102), which is not significant at the conventional 0.05 threshold. This is a positive finding: the spatial RF has absorbed the spatial dependence present in the raw turnout data, and what remains in the residuals is not detectably structured geographically. This directly answers Research Question 5.
The 2023 Sejm election shows strong and statistically significant spatial autocorrelation for all five parties. The global Moran’s I values range from 0.197 (Konfederacja) to 0.491 (KO), all significant at p < 0.01. LISA analysis identifies the persistent east–west divide: KO and Lewica cluster in the west and in major cities; PiS clusters in the east and south-east. This spatial polarisation is not merely a reflection of the urban–rural divide but carries the imprint of Poland’s historical partitions, which created lasting differences in civic culture, administrative tradition, and economic development across regions.
PAM clustering identifies four substantively meaningful electoral types: a high-KO, high-turnout urban-liberal group (Cluster 2, n = 7, mean turnout 81.1%); a PiS-dominant, low-turnout group (Cluster 4, n = 13, mean turnout 72.2%); a moderate TD-leaning group (Cluster 3, n = 6); and a large mixed group (Cluster 1, n = 15). The SKATER typology reproduces the main distinction between liberal-western and conservative-eastern regions but merges many of the more nuanced distinctions in order to enforce geographic contiguity. The silhouette–Moran’s I trade-off (PAM: 0.253 vs 0.461; SKATER: 0.040 vs 0.571) quantifies this cost explicitly: SKATER achieves 24% higher spatial coherence at the price of 84% lower internal profile separation.
Adding spatial lag features to the Random Forest reduces LOOCV RMSE from 3.73 to 3.53 percentage points and increases R² from 0.315 to 0.446. This improvement is genuine: LOOCV prevents train-test leakage, and lags are recomputed per fold to prevent information leakage from the held-out unit. The Spatial RF and XGBoost perform similarly (RMSE 3.53 and 3.50 respectively), and both substantially outperform the Spatial Lag OLS (RMSE 4.23) - suggesting that the relationship between spatial context and turnout is non-linear, and that the parametric SAR model’s assumption of a single global spatial autoregressive parameter is too restrictive for these data.
The ridge regression coefficients provide interpretable evidence for
the spatial mechanism: the lag_turnout coefficient (+0.893)
is the second strongest predictor, indicating that neighbourhood
mobilisation - the average turnout of surrounding constituencies - is
independently predictive of a constituency’s own participation beyond
its party composition. This is consistent with social contagion and
coordination effects in electoral participation.
The residual Moran’s I of 0.098 (p = 0.10) confirms that the spatial RF successfully absorbs the spatial dependence in the data. The residuals are not significantly autocorrelated, meaning the model has captured the spatial process rather than leaving it in the error term.
Three limitations merit acknowledgement. First, the XGBoost hyperparameters are set a priori rather than tuned per LOOCV fold; per-fold tuning would be more rigorous but computationally prohibitive at 41 × inner-CV scale. Second, all models use only electoral predictors; socioeconomic covariates (urbanisation, income, age structure) are not available in the GeoElections dataset and would likely improve prediction. Third, with n = 41, all results - including the LOOCV metrics - carry substantial sampling uncertainty; the confidence intervals on RMSE differences between the Spatial RF and XGBoost, for example, are likely wide.
This project demonstrates that Polish electoral behaviour in 2023 is not only politically but also geographically structured. The analysis proceeds through a coherent spatial machine learning workflow:
lag_turnout) as the strongest spatial mechanism: being
embedded in a high-turnout environment independently raises a
constituency’s own participation.Together, these findings support the conclusion that neighbourhood context - not just local party composition - is a substantive driver of electoral participation in Poland, and that spatial machine learning methods are necessary to capture and quantify this effect.