This project utilizes Exploratory Factor Analysis (EFA) and t-SNE to deconstruct the “Quality of Life” index across 380 Polish Poviats. Moving beyond the traditional “Poland A vs. B” model, the analysis identifies three latent drivers: Social Infrastructure, Ecological Buffer, and Economic Vitality. The results highlight that regional well-being is increasingly defined by urban-hub dynamics rather than simple geographic location, providing a data-driven basis for regional development policies.
The socio-economic landscape of Poland is frequently described through the lens of historical partitions, resulting in a perceived East-West developmental split. However, modern regional development is influenced by a complex interplay of labor markets, educational accessibility, and ecological quality.
This project reduces the dimensionality of these complex variables to uncover the latent structures that truly define regional well-being. By utilizing advanced dimension reduction techniques, we identify the fundamental “pillars” of Polish quality of life.
Data is retrieved from the Statistics Poland Local Data Bank (BDL) for 2023 at the Poviat level (LAU-1). We select eight indicators to represent the multi-faceted nature of quality of life, focusing on metrics where robust data is available.
We perform feature engineering to normalize hospital capacity against local population density to ensure comparable metrics across regions.
# Fetch data from BDL
var_ids <- c(
"wages" = "64429", # Avg monthly gross wages (rel. to Poland = 100)
"entities" = "634123", # Entities per 1k population
"libraries" = "1725594", # Library users per 1k population
"hospitals" = "152354", # General hospitals - total beds
"preschool" = "1612315", # Children in preschool per 1k (aged 3-6)
"forest" = "194828", # Forest cover %
"waste" = "60585", # Share of waste recovered %
"population" = "1645341" # Population in thousands
)
df_raw <- get_data_by_variable(varId = var_ids, unitLevel = 5, year = 2023, lang = "en")
head(df_raw) %>%
kable(caption = "Preview of Raw Data") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| id | name | year | val_64429 | val_634123 | val_1725594 | val_152354 | val_1612315 | val_194828 | val_60585 | val_1645341 | measureUnitId_64429 | measureName_64429 | measureUnitId_634123 | measureName_634123 | measureUnitId_1725594 | measureName_1725594 | measureUnitId_152354 | measureName_152354 | measureUnitId_1612315 | measureName_1612315 | measureUnitId_194828 | measureName_194828 | measureUnitId_60585 | measureName_60585 | measureUnitId_1645341 | measureName_1645341 | attrId_64429 | attributeDescription_64429 | attrId_634123 | attributeDescription_634123 | attrId_1725594 | attributeDescription_1725594 | attrId_152354 | attributeDescription_152354 | attrId_1612315 | attributeDescription_1612315 | attrId_194828 | attributeDescription_194828 | attrId_60585 | attributeDescription_60585 | attrId_1645341 | attributeDescription_1645341 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 011212001000 | Powiat bocheński | 2023 | 81.4 | 107.3 | 148 | 210 | 954.8 | 28.6 | 4.5 | 106.95 | 50 | % | 1 |
|
26 | person | 1 |
|
26 | person | 50 | % | 50 | % | 23 | thousand persons | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ||||||||
| 011212006000 | Powiat krakowski | 2023 | 95.5 | 137.0 | 124 | 308 | 906.6 | 12.1 | 76.5 | 302.35 | 50 | % | 1 |
|
26 | person | 1 |
|
26 | person | 50 | % | 50 | % | 23 | thousand persons | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ||||||||
| 011212008000 | Powiat miechowski | 2023 | 80.7 | 111.0 | 117 | 199 | 932.8 | 12.0 | 0.0 | 46.67 | 50 | % | 1 |
|
26 | person | 1 |
|
26 | person | 50 | % | 50 | % | 23 | thousand persons | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | ||||||||
| 011212009000 | Powiat myślenicki | 2023 | 81.9 | 124.7 | 126 | 240 | 937.2 | 35.6 | 0.0 | 129.84 | 50 | % | 1 |
|
26 | person | 1 |
|
26 | person | 50 | % | 50 | % | 23 | thousand persons | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | ||||||||
| 011212014000 | Powiat proszowicki | 2023 | 81.3 | 108.6 | 94 | 196 | 926.7 | 1.5 | 0.0 | 42.01 | 50 | % | 1 |
|
26 | person | 1 |
|
26 | person | 50 | % | 50 | % | 23 | thousand persons | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | ||||||||
| 011212019000 | Powiat wielicki | 2023 | 92.4 | 141.1 | 171 | 0 | 968.9 | 15.8 | 0.0 | 143.63 | 50 | % | 1 |
|
26 | person | 1 |
|
26 | person | 50 | % | 50 | % | 23 | thousand persons | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
# Preprocessing and Feature Engineering
df_clean <- df_raw %>%
select(id, name,
wages = val_64429, entities = val_634123, libraries = val_1725594,
hospitals_raw = val_152354, pop_k = val_1645341, preschool = val_1612315,
forest = val_194828, waste = val_60585
) %>%
drop_na() %>%
mutate(hospitals = hospitals_raw / pop_k) %>% # Normalize beds per 1k pop
select(-hospitals_raw, -pop_k)
# Synchronization for t-SNE
df_unique_rows <- df_clean %>%
distinct(wages, entities, libraries, hospitals, preschool, forest, waste, .keep_all = TRUE)
df_analysis <- df_unique_rows %>%
column_to_rownames("id") %>%
select(-name)
df_scaled <- scale(df_analysis)Before normalization, it is crucial to understand the baseline distributions of our indicators.
# Calculate summary stats for raw variables
desc_stats <- describe(df_analysis)[, c("mean", "median", "min", "max", "sd")]
kable(desc_stats, digits = 2, caption = "Descriptive Statistics of Raw Variables") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| mean | median | min | max | sd | |
|---|---|---|---|---|---|
| wages | 86.37 | 84.45 | 0.0 | 168.60 | 11.05 |
| entities | 115.50 | 110.25 | 69.2 | 296.90 | 31.38 |
| libraries | 124.69 | 116.00 | 48.0 | 291.00 | 38.10 |
| preschool | 931.96 | 914.05 | 623.5 | 1309.50 | 100.70 |
| forest | 25.90 | 24.35 | 0.0 | 70.30 | 13.40 |
| waste | 15.90 | 0.65 | 0.0 | 100.00 | 27.55 |
| hospitals | 3.73 | 3.08 | 0.0 | 14.44 | 2.68 |
The correlation matrix reveals a strong nexus between urban metrics. Preschool enrollment shows a strong positive correlation with hospitals (0.57) and entities (0.51), suggesting that social infrastructure follows economic density.
res_cor <- cor(df_analysis)
corrplot(res_cor, method = "color", type = "upper", addCoef.col = "black", tl.col = "black", diag = FALSE)We perform two tests to verify that the data is suitable for factor analysis.
# KMO Test
kmo_res <- KMO(res_cor)
print(kmo_res)
# Bartlett's Test
bartlett_res <- cortest.bartlett(res_cor, n = nrow(df_analysis))
print(paste("Bartlett's Test p-value:", bartlett_res$p.value))## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = res_cor)
## Overall MSA = 0.74
## MSA for each item =
## wages entities libraries preschool forest waste hospitals
## 0.80 0.76 0.84 0.70 0.64 0.69 0.73
## [1] "Bartlett's Test p-value: 3.71006558501937e-96"
The highly significant p-value from Bartlett’s test confirms that the variables are correlated enough for dimension reduction.
To determine the optimal number of factors, we inspect the eigenvalues and the scree plot.
## Parallel analysis suggests that the number of factors = 3 and the number of components = NA
We extract 3 factors based on the scree plot. The table below details the variance explained by each factor.
We apply Oblique Rotation (Promax) because we hypothesize that economic and social development are interlinked.
fa_res <- fa(df_scaled, nfactors = 3, rotate = "promax", fm = "minres")
# Display Variance Explained
kable(fa_res$Vaccounted, digits = 2, caption = "Variance Explained by Factors") %>%
kable_styling()
# Display Loadings
print(fa_res$loadings, cutoff = 0.3, sort = TRUE)| MR1 | MR2 | MR3 | |
|---|---|---|---|
| SS loadings | 1.56 | 1.06 | 0.92 |
| Proportion Var | 0.22 | 0.15 | 0.13 |
| Cumulative Var | 0.22 | 0.37 | 0.51 |
| Proportion Explained | 0.44 | 0.30 | 0.26 |
| Cumulative Proportion | 0.44 | 0.74 | 1.00 |
##
## Loadings:
## MR1 MR2 MR3
## preschool 1.019
## hospitals 0.566
## forest 1.061
## wages 0.679
## entities 0.666
## libraries 0.316
## waste
##
## MR1 MR2 MR3
## SS loadings 1.490 1.170 1.049
## Proportion Var 0.213 0.167 0.150
## Cumulative Var 0.213 0.380 0.530
Based on the factor loadings, we identify the following latent dimensions:
The assumption of oblique axes is justified by the inter-factor correlations.
# Factor Correlation Matrix
kable(fa_res$Phi, digits = 2, caption = "Correlation Matrix between Latent Factors") %>%
kable_styling(full_width = FALSE)| MR1 | MR2 | MR3 | |
|---|---|---|---|
| MR1 | 1.00 | -0.35 | 0.66 |
| MR2 | -0.35 | 1.00 | -0.41 |
| MR3 | 0.66 | -0.41 | 1.00 |
We observe a moderate positive correlation between Social Infrastructure and Economic Vitality, validating that wealthier regions tend to have better social services.
The PCA biplot illustrates the “pull” of each variable. We notice that “Waste” has a shorter vector, indicating it is less well-represented by the primary dimensions compared to strong drivers like Wages or Forest cover. This suggests waste recovery might be driven by specific local regulations rather than systemic socio-economic forces.
pca_res <- prcomp(df_scaled)
fviz_pca_var(pca_res, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE)t-SNE identifies local clusters that linear models might miss. The projection shows a dense “backbone” of Polish poviats with several “satellites” representing extreme archetypes.
set.seed(42)
optimal_perplexity <- min(30, floor((nrow(df_scaled) - 1) / 3))
tsne_out <- Rtsne(df_scaled, perplexity = optimal_perplexity, check_duplicates = FALSE)
tsne_plot_data <- as.data.frame(tsne_out$Y) %>% mutate(name = df_unique_rows$name)
ggplot(tsne_plot_data, aes(V1, V2, label = name)) +
geom_point(color = "#2E9FDF", alpha = 0.6) +
geom_text_repel(size = 3, max.overlaps = 15) +
theme_minimal() +
labs(title = "t-SNE Projection: Quality of Life Archetypes")Mapping the factor scores for Urban-Economic Vitality demonstrates that prosperity is concentrated in metropolitan “islands.” We see significant vitality in western Poland and around Warsaw, but also in eastern hubs like Rzeszów and Lublin, contradicting the simple East-West narrative.
load(file.path(Sys.getenv("HOME"), "bdl.maps.2022.RData"))
# Assuming MR3 is Economic Vitality based on interpretation
scores_df <- as.data.frame(fa_res$scores) %>%
mutate(id = rownames(df_analysis)) %>%
rename(Economic_Vitality = MR3)
map_data <- inner_join(bdl.maps.2022$level5, scores_df, by = "id")
tmap_mode("view")
tm_shape(map_data) +
tm_polygons("Economic_Vitality",
palette = "-RdYlBu",
title = "Vitality Factor (MR3)",
style = "quantile",
colorNA = "grey80",
textNA = "Missing Data"
) +
tm_layout(main.title = "Spatial Distribution of Latent Vitality")The dimension reduction analysis reveals that “Quality of Life” in Poland is not a single gradient but a complex interaction of three latent pillars.