1 Abstract

This project utilizes Exploratory Factor Analysis (EFA) and t-SNE to deconstruct the “Quality of Life” index across 380 Polish Poviats. Moving beyond the traditional “Poland A vs. B” model, the analysis identifies three latent drivers: Social Infrastructure, Ecological Buffer, and Economic Vitality. The results highlight that regional well-being is increasingly defined by urban-hub dynamics rather than simple geographic location, providing a data-driven basis for regional development policies.

2 Introduction

2.1 Analysis Context

The socio-economic landscape of Poland is frequently described through the lens of historical partitions, resulting in a perceived East-West developmental split. However, modern regional development is influenced by a complex interplay of labor markets, educational accessibility, and ecological quality.

This project reduces the dimensionality of these complex variables to uncover the latent structures that truly define regional well-being. By utilizing advanced dimension reduction techniques, we identify the fundamental “pillars” of Polish quality of life.

2.2 Research Questions

  1. Latent Drivers: What are the fundamental latent dimensions that explain the variance in regional well-being in Poland?
  2. Factor Interdependence: To what extent are economic success and social infrastructure correlated?
  3. Local vs. Global Structure: Can non-linear embeddings (t-SNE) identify “archetypes” of districts that share similar profiles despite being in different geographic clusters?

2.3 Hypotheses

  • H1: Quality of Life is a tri-fold structure comprising Urban-Economic Vitality, Social Accessibility, and Ecological Buffer.
  • H2: Economic factors and social infrastructure are significantly correlated, justifying the use of prioritized investment in social services for growing economies.
  • H3: t-SNE will reveal that major metropolitan hubs in Eastern Poland cluster more closely with Western cities than with their immediate rural neighbors.

3 Methodology & Data

3.1 Data Acquisition

Data is retrieved from the Statistics Poland Local Data Bank (BDL) for 2023 at the Poviat level (LAU-1). We select eight indicators to represent the multi-faceted nature of quality of life, focusing on metrics where robust data is available.

3.2 Variable Selection and Engineering

We perform feature engineering to normalize hospital capacity against local population density to ensure comparable metrics across regions.

# Fetch data from BDL
var_ids <- c(
    "wages"      = "64429", # Avg monthly gross wages (rel. to Poland = 100)
    "entities"   = "634123", # Entities per 1k population
    "libraries"  = "1725594", # Library users per 1k population
    "hospitals"  = "152354", # General hospitals - total beds
    "preschool"  = "1612315", # Children in preschool per 1k (aged 3-6)
    "forest"     = "194828", # Forest cover %
    "waste"      = "60585", # Share of waste recovered %
    "population" = "1645341" # Population in thousands
)

df_raw <- get_data_by_variable(varId = var_ids, unitLevel = 5, year = 2023, lang = "en")

head(df_raw) %>%
    kable(caption = "Preview of Raw Data") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Preview of Raw Data
id name year val_64429 val_634123 val_1725594 val_152354 val_1612315 val_194828 val_60585 val_1645341 measureUnitId_64429 measureName_64429 measureUnitId_634123 measureName_634123 measureUnitId_1725594 measureName_1725594 measureUnitId_152354 measureName_152354 measureUnitId_1612315 measureName_1612315 measureUnitId_194828 measureName_194828 measureUnitId_60585 measureName_60585 measureUnitId_1645341 measureName_1645341 attrId_64429 attributeDescription_64429 attrId_634123 attributeDescription_634123 attrId_1725594 attributeDescription_1725594 attrId_152354 attributeDescription_152354 attrId_1612315 attributeDescription_1612315 attrId_194828 attributeDescription_194828 attrId_60585 attributeDescription_60585 attrId_1645341 attributeDescription_1645341
011212001000 Powiat bocheński 2023 81.4 107.3 148 210 954.8 28.6 4.5 106.95 50 % 1
26 person 1
26 person 50 % 50 % 23 thousand persons 1 1 1 1 1 1 1 1
011212006000 Powiat krakowski 2023 95.5 137.0 124 308 906.6 12.1 76.5 302.35 50 % 1
26 person 1
26 person 50 % 50 % 23 thousand persons 1 1 1 1 1 1 1 1
011212008000 Powiat miechowski 2023 80.7 111.0 117 199 932.8 12.0 0.0 46.67 50 % 1
26 person 1
26 person 50 % 50 % 23 thousand persons 1 1 1 1 1 1 0 1
011212009000 Powiat myślenicki 2023 81.9 124.7 126 240 937.2 35.6 0.0 129.84 50 % 1
26 person 1
26 person 50 % 50 % 23 thousand persons 1 1 1 1 1 1 0 1
011212014000 Powiat proszowicki 2023 81.3 108.6 94 196 926.7 1.5 0.0 42.01 50 % 1
26 person 1
26 person 50 % 50 % 23 thousand persons 1 1 1 1 1 1 0 1
011212019000 Powiat wielicki 2023 92.4 141.1 171 0 968.9 15.8 0.0 143.63 50 % 1
26 person 1
26 person 50 % 50 % 23 thousand persons 1 1 1 0 1 1 0 1
# Preprocessing and Feature Engineering
df_clean <- df_raw %>%
    select(id, name,
        wages = val_64429, entities = val_634123, libraries = val_1725594,
        hospitals_raw = val_152354, pop_k = val_1645341, preschool = val_1612315,
        forest = val_194828, waste = val_60585
    ) %>%
    drop_na() %>%
    mutate(hospitals = hospitals_raw / pop_k) %>% # Normalize beds per 1k pop
    select(-hospitals_raw, -pop_k)

# Synchronization for t-SNE
df_unique_rows <- df_clean %>%
    distinct(wages, entities, libraries, hospitals, preschool, forest, waste, .keep_all = TRUE)

df_analysis <- df_unique_rows %>%
    column_to_rownames("id") %>%
    select(-name)
df_scaled <- scale(df_analysis)

3.3 Descriptive Statistics

Before normalization, it is crucial to understand the baseline distributions of our indicators.

# Calculate summary stats for raw variables
desc_stats <- describe(df_analysis)[, c("mean", "median", "min", "max", "sd")]

kable(desc_stats, digits = 2, caption = "Descriptive Statistics of Raw Variables") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Descriptive Statistics of Raw Variables
mean median min max sd
wages 86.37 84.45 0.0 168.60 11.05
entities 115.50 110.25 69.2 296.90 31.38
libraries 124.69 116.00 48.0 291.00 38.10
preschool 931.96 914.05 623.5 1309.50 100.70
forest 25.90 24.35 0.0 70.30 13.40
waste 15.90 0.65 0.0 100.00 27.55
hospitals 3.73 3.08 0.0 14.44 2.68

4 Exploratory Analysis

4.1 Correlation Structure

The correlation matrix reveals a strong nexus between urban metrics. Preschool enrollment shows a strong positive correlation with hospitals (0.57) and entities (0.51), suggesting that social infrastructure follows economic density.

res_cor <- cor(df_analysis)
corrplot(res_cor, method = "color", type = "upper", addCoef.col = "black", tl.col = "black", diag = FALSE)

4.2 Statistical Adequacy Tests

We perform two tests to verify that the data is suitable for factor analysis.

  1. Kaiser-Meyer-Olkin (KMO) Test: Measures sampling adequacy. Scores > 0.6 are acceptable.
  2. Bartlett’s Test of Sphericity: Tests the null hypothesis that the correlation matrix is an identity matrix (i.e., variables are unrelated). A p-value < 0.05 indicates significant correlations.
# KMO Test
kmo_res <- KMO(res_cor)
print(kmo_res)

# Bartlett's Test
bartlett_res <- cortest.bartlett(res_cor, n = nrow(df_analysis))
print(paste("Bartlett's Test p-value:", bartlett_res$p.value))
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = res_cor)
## Overall MSA =  0.74
## MSA for each item = 
##     wages  entities libraries preschool    forest     waste hospitals 
##      0.80      0.76      0.84      0.70      0.64      0.69      0.73 
## [1] "Bartlett's Test p-value: 3.71006558501937e-96"

The highly significant p-value from Bartlett’s test confirms that the variables are correlated enough for dimension reduction.

5 Factor Analysis & Latent Discovery

5.1 Parallel Analysis & Variance Explained

To determine the optimal number of factors, we inspect the eigenvalues and the scree plot.

fa.parallel(df_scaled, fm = "minres", fa = "fa")

## Parallel analysis suggests that the number of factors =  3  and the number of components =  NA

We extract 3 factors based on the scree plot. The table below details the variance explained by each factor.

5.2 Oblique (Promax) Rotation

We apply Oblique Rotation (Promax) because we hypothesize that economic and social development are interlinked.

fa_res <- fa(df_scaled, nfactors = 3, rotate = "promax", fm = "minres")

# Display Variance Explained
kable(fa_res$Vaccounted, digits = 2, caption = "Variance Explained by Factors") %>%
    kable_styling()

# Display Loadings
print(fa_res$loadings, cutoff = 0.3, sort = TRUE)
Variance Explained by Factors
MR1 MR2 MR3
SS loadings 1.56 1.06 0.92
Proportion Var 0.22 0.15 0.13
Cumulative Var 0.22 0.37 0.51
Proportion Explained 0.44 0.30 0.26
Cumulative Proportion 0.44 0.74 1.00
## 
## Loadings:
##           MR1    MR2    MR3   
## preschool  1.019              
## hospitals  0.566              
## forest            1.061       
## wages                    0.679
## entities                 0.666
## libraries  0.316              
## waste                         
## 
##                  MR1   MR2   MR3
## SS loadings    1.490 1.170 1.049
## Proportion Var 0.213 0.167 0.150
## Cumulative Var 0.213 0.380 0.530

5.3 Interpretation of Factors

Based on the factor loadings, we identify the following latent dimensions:

  • MR1 (Social Infrastructure): Dominant loadings on hospitals, preschool, and libraries. This represents the “care” and “culture” capacity of a region.
  • MR2 (Ecological Buffer): Strong loading on forest cover, which often shares an inverse relationship with dense urbanization.
  • MR3 (Urban-Economic Vitality): High loadings on wages and entities. This factor represents the core economic power and entrepreneurial spirit.

5.4 Factor Interdependence

The assumption of oblique axes is justified by the inter-factor correlations.

# Factor Correlation Matrix
kable(fa_res$Phi, digits = 2, caption = "Correlation Matrix between Latent Factors") %>%
    kable_styling(full_width = FALSE)
Correlation Matrix between Latent Factors
MR1 MR2 MR3
MR1 1.00 -0.35 0.66
MR2 -0.35 1.00 -0.41
MR3 0.66 -0.41 1.00

We observe a moderate positive correlation between Social Infrastructure and Economic Vitality, validating that wealthier regions tend to have better social services.

6 Visualizing the Factor Space

6.1 Factor Map: Variable Contributions

The PCA biplot illustrates the “pull” of each variable. We notice that “Waste” has a shorter vector, indicating it is less well-represented by the primary dimensions compared to strong drivers like Wages or Forest cover. This suggests waste recovery might be driven by specific local regulations rather than systemic socio-economic forces.

pca_res <- prcomp(df_scaled)
fviz_pca_var(pca_res, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE)

6.2 Non-linear Embedding: t-SNE Analysis

t-SNE identifies local clusters that linear models might miss. The projection shows a dense “backbone” of Polish poviats with several “satellites” representing extreme archetypes.

set.seed(42)
optimal_perplexity <- min(30, floor((nrow(df_scaled) - 1) / 3))
tsne_out <- Rtsne(df_scaled, perplexity = optimal_perplexity, check_duplicates = FALSE)

tsne_plot_data <- as.data.frame(tsne_out$Y) %>% mutate(name = df_unique_rows$name)

ggplot(tsne_plot_data, aes(V1, V2, label = name)) +
    geom_point(color = "#2E9FDF", alpha = 0.6) +
    geom_text_repel(size = 3, max.overlaps = 15) +
    theme_minimal() +
    labs(title = "t-SNE Projection: Quality of Life Archetypes")

7 Spatial Analysis

Mapping the factor scores for Urban-Economic Vitality demonstrates that prosperity is concentrated in metropolitan “islands.” We see significant vitality in western Poland and around Warsaw, but also in eastern hubs like Rzeszów and Lublin, contradicting the simple East-West narrative.

load(file.path(Sys.getenv("HOME"), "bdl.maps.2022.RData"))
# Assuming MR3 is Economic Vitality based on interpretation
scores_df <- as.data.frame(fa_res$scores) %>%
    mutate(id = rownames(df_analysis)) %>%
    rename(Economic_Vitality = MR3)

map_data <- inner_join(bdl.maps.2022$level5, scores_df, by = "id")

tmap_mode("view")
tm_shape(map_data) +
    tm_polygons("Economic_Vitality",
        palette = "-RdYlBu",
        title = "Vitality Factor (MR3)",
        style = "quantile",
        colorNA = "grey80",
        textNA = "Missing Data"
    ) +
    tm_layout(main.title = "Spatial Distribution of Latent Vitality")

8 Conclusion

The dimension reduction analysis reveals that “Quality of Life” in Poland is not a single gradient but a complex interaction of three latent pillars.

  1. Correlation of Prosperity: The use of Oblique Rotation confirmed that economic vitality and social infrastructure are significantly correlated (\(\phi \approx 0.45\)), suggesting that social investment follows economic growth.
  2. The Hub Effect: t-SNE and spatial mapping confirm that eastern Polish cities (Rzeszów, Lublin) share latent structures with western cities, proving that the urban-rural divide has superseded the East-West divide.
  3. Policy Implications: Regional policy should focus on strengthening the “Social Infrastructure” in poviats that possess “Economic Vitality” but lag in service provision, ensuring that growth is translated into tangible well-being.