1 Abstract

This project utilizes Exploratory Factor Analysis (EFA) and t-SNE to deconstruct the “Quality of Life” index across 380 Polish Poviats. Moving beyond the traditional “Poland A vs. B” model, the analysis identifies three latent drivers: Social Infrastructure, Ecological Buffer, and Economic Vitality. The results highlight that regional well-being is increasingly defined by urban-hub dynamics rather than simple geographic location, providing a data-driven basis for regional development policies.

2 Introduction

2.1 Analysis Context

The socio-economic landscape of Poland is frequently described through the lens of historical partitions, resulting in a perceived East-West developmental split. However, modern regional development is influenced by a complex interplay of labor markets, educational accessibility, and ecological quality.

This project reduces the dimensionality of these complex variables to uncover the latent structures that truly define regional well-being. By utilizing advanced dimension reduction techniques, we identify the fundamental “pillars” of Polish quality of life.

2.2 Research Questions

Latent Drivers: What are the fundamental latent dimensions that explain the variance in regional well-being in Poland?
Factor Interdependence: To what extent are economic success and social infrastructure correlated?
Local vs. Global Structure: Can non-linear embeddings (t-SNE) identify “archetypes” of districts that share similar profiles despite being in different geographic clusters?

2.3 Hypotheses

H1: Quality of Life is a tri-fold structure comprising Urban-Economic Vitality, Social Accessibility, and Ecological Buffer.
H2: Economic factors and social infrastructure are significantly correlated, justifying the use of prioritized investment in social services for growing economies.
H3: t-SNE will reveal that major metropolitan hubs in Eastern Poland cluster more closely with Western cities than with their immediate rural neighbors.

3 Methodology & Data

3.1 Data Acquisition

Data is retrieved from the Statistics Poland Local Data Bank (BDL) for 2023 at the Poviat level (LAU-1). We select eight indicators to represent the multi-faceted nature of quality of life, focusing on metrics where robust data is available.

3.2 Variable Selection and Engineering

We perform feature engineering to normalize hospital capacity against local population density to ensure comparable metrics across regions.

# Fetch data from BDL
var_ids <- c(
    "wages"      = "64429", # Avg monthly gross wages (rel. to Poland = 100)
    "entities"   = "634123", # Entities per 1k population
    "libraries"  = "1725594", # Library users per 1k population
    "hospitals"  = "152354", # General hospitals - total beds
    "preschool"  = "1612315", # Children in preschool per 1k (aged 3-6)
    "forest"     = "194828", # Forest cover %
    "waste"      = "60585", # Share of waste recovered %
    "population" = "1645341" # Population in thousands
)

df_raw <- get_data_by_variable(varId = var_ids, unitLevel = 5, year = 2023, lang = "en")

head(df_raw) %>%
    kable(caption = "Preview of Raw Data") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Preview of Raw Data
id	name	year	val_64429	val_634123	val_1725594	val_152354	val_1612315	val_194828	val_60585	val_1645341	measureUnitId_64429	measureName_64429	measureUnitId_634123	measureUnitId_1725594	measureName_1725594	measureUnitId_152354	measureUnitId_1612315	measureName_1612315	measureUnitId_194828	measureName_194828	measureUnitId_60585	measureName_60585	measureUnitId_1645341	measureName_1645341	attrId_64429	attrId_634123	attrId_1725594	attrId_152354	attrId_1612315	attrId_194828	attrId_60585	attrId_1645341
011212001000	Powiat bocheński	2023	81.4	107.3	148	210	954.8	28.6	4.5	106.95	50	%	1	26	person	1	26	person	50	%	50	%	23	thousand persons	1	1	1	1	1	1	1	1
011212006000	Powiat krakowski	2023	95.5	137.0	124	308	906.6	12.1	76.5	302.35	50	%	1	26	person	1	26	person	50	%	50	%	23	thousand persons	1	1	1	1	1	1	1	1
011212008000	Powiat miechowski	2023	80.7	111.0	117	199	932.8	12.0	0.0	46.67	50	%	1	26	person	1	26	person	50	%	50	%	23	thousand persons	1	1	1	1	1	1	0	1
011212009000	Powiat myślenicki	2023	81.9	124.7	126	240	937.2	35.6	0.0	129.84	50	%	1	26	person	1	26	person	50	%	50	%	23	thousand persons	1	1	1	1	1	1	0	1
011212014000	Powiat proszowicki	2023	81.3	108.6	94	196	926.7	1.5	0.0	42.01	50	%	1	26	person	1	26	person	50	%	50	%	23	thousand persons	1	1	1	1	1	1	0	1
011212019000	Powiat wielicki	2023	92.4	141.1	171	0	968.9	15.8	0.0	143.63	50	%	1	26	person	1	26	person	50	%	50	%	23	thousand persons	1	1	1	0	1	1	0	1

# Preprocessing and Feature Engineering
df_clean <- df_raw %>%
    select(id, name,
        wages = val_64429, entities = val_634123, libraries = val_1725594,
        hospitals_raw = val_152354, pop_k = val_1645341, preschool = val_1612315,
        forest = val_194828, waste = val_60585
    ) %>%
    drop_na() %>%
    mutate(hospitals = hospitals_raw / pop_k) %>% # Normalize beds per 1k pop
    select(-hospitals_raw, -pop_k)

# Synchronization for t-SNE
df_unique_rows <- df_clean %>%
    distinct(wages, entities, libraries, hospitals, preschool, forest, waste, .keep_all = TRUE)

df_analysis <- df_unique_rows %>%
    column_to_rownames("id") %>%
    select(-name)
df_scaled <- scale(df_analysis)

3.3 Descriptive Statistics

Before normalization, it is crucial to understand the baseline distributions of our indicators.

# Calculate summary stats for raw variables
desc_stats <- describe(df_analysis)[, c("mean", "median", "min", "max", "sd")]

kable(desc_stats, digits = 2, caption = "Descriptive Statistics of Raw Variables") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Descriptive Statistics of Raw Variables
	mean	median	min	max	sd
wages	86.37	84.45	0.0	168.60	11.05
entities	115.50	110.25	69.2	296.90	31.38
libraries	124.69	116.00	48.0	291.00	38.10
preschool	931.96	914.05	623.5	1309.50	100.70
forest	25.90	24.35	0.0	70.30	13.40
waste	15.90	0.65	0.0	100.00	27.55
hospitals	3.73	3.08	0.0	14.44	2.68

4 Exploratory Analysis

4.1 Correlation Structure

The correlation matrix reveals a strong nexus between urban metrics. Preschool enrollment shows a strong positive correlation with hospitals (0.57) and entities (0.51), suggesting that social infrastructure follows economic density.

res_cor <- cor(df_analysis)
corrplot(res_cor, method = "color", type = "upper", addCoef.col = "black", tl.col = "black", diag = FALSE)

4.2 Statistical Adequacy Tests

We perform two tests to verify that the data is suitable for factor analysis.

Kaiser-Meyer-Olkin (KMO) Test: Measures sampling adequacy. Scores > 0.6 are acceptable.
Bartlett’s Test of Sphericity: Tests the null hypothesis that the correlation matrix is an identity matrix (i.e., variables are unrelated). A p-value < 0.05 indicates significant correlations.

# KMO Test
kmo_res <- KMO(res_cor)
print(kmo_res)

# Bartlett's Test
bartlett_res <- cortest.bartlett(res_cor, n = nrow(df_analysis))
print(paste("Bartlett's Test p-value:", bartlett_res$p.value))

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = res_cor)
## Overall MSA =  0.74
## MSA for each item = 
##     wages  entities libraries preschool    forest     waste hospitals 
##      0.80      0.76      0.84      0.70      0.64      0.69      0.73 
## [1] "Bartlett's Test p-value: 3.71006558501937e-96"

The highly significant p-value from Bartlett’s test confirms that the variables are correlated enough for dimension reduction.

5 Factor Analysis & Latent Discovery

5.1 Parallel Analysis & Variance Explained

To determine the optimal number of factors, we inspect the eigenvalues and the scree plot.

fa.parallel(df_scaled, fm = "minres", fa = "fa")

## Parallel analysis suggests that the number of factors =  3  and the number of components =  NA

We extract 3 factors based on the scree plot. The table below details the variance explained by each factor.

5.2 Oblique (Promax) Rotation

We apply Oblique Rotation (Promax) because we hypothesize that economic and social development are interlinked.

fa_res <- fa(df_scaled, nfactors = 3, rotate = "promax", fm = "minres")

# Display Variance Explained
kable(fa_res$Vaccounted, digits = 2, caption = "Variance Explained by Factors") %>%
    kable_styling()

# Display Loadings
print(fa_res$loadings, cutoff = 0.3, sort = TRUE)

Variance Explained by Factors
	MR1	MR2	MR3
SS loadings	1.56	1.06	0.92
Proportion Var	0.22	0.15	0.13
Cumulative Var	0.22	0.37	0.51
Proportion Explained	0.44	0.30	0.26
Cumulative Proportion	0.44	0.74	1.00

## 
## Loadings:
##           MR1    MR2    MR3   
## preschool  1.019              
## hospitals  0.566              
## forest            1.061       
## wages                    0.679
## entities                 0.666
## libraries  0.316              
## waste                         
## 
##                  MR1   MR2   MR3
## SS loadings    1.490 1.170 1.049
## Proportion Var 0.213 0.167 0.150
## Cumulative Var 0.213 0.380 0.530

5.3 Interpretation of Factors

Based on the factor loadings, we identify the following latent dimensions:

MR1 (Social Infrastructure): Dominant loadings on hospitals, preschool, and libraries. This represents the “care” and “culture” capacity of a region.
MR2 (Ecological Buffer): Strong loading on forest cover, which often shares an inverse relationship with dense urbanization.
MR3 (Urban-Economic Vitality): High loadings on wages and entities. This factor represents the core economic power and entrepreneurial spirit.

5.4 Factor Interdependence

The assumption of oblique axes is justified by the inter-factor correlations.

# Factor Correlation Matrix
kable(fa_res$Phi, digits = 2, caption = "Correlation Matrix between Latent Factors") %>%
    kable_styling(full_width = FALSE)

Correlation Matrix between Latent Factors
	MR1	MR2	MR3
MR1	1.00	-0.35	0.66
MR2	-0.35	1.00	-0.41
MR3	0.66	-0.41	1.00

We observe a moderate positive correlation between Social Infrastructure and Economic Vitality, validating that wealthier regions tend to have better social services.

6 Visualizing the Factor Space

6.1 Factor Map: Variable Contributions

The PCA biplot illustrates the “pull” of each variable. We notice that “Waste” has a shorter vector, indicating it is less well-represented by the primary dimensions compared to strong drivers like Wages or Forest cover. This suggests waste recovery might be driven by specific local regulations rather than systemic socio-economic forces.

pca_res <- prcomp(df_scaled)
fviz_pca_var(pca_res, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE)

6.2 Non-linear Embedding: t-SNE Analysis

t-SNE identifies local clusters that linear models might miss. The projection shows a dense “backbone” of Polish poviats with several “satellites” representing extreme archetypes.

set.seed(42)
optimal_perplexity <- min(30, floor((nrow(df_scaled) - 1) / 3))
tsne_out <- Rtsne(df_scaled, perplexity = optimal_perplexity, check_duplicates = FALSE)

tsne_plot_data <- as.data.frame(tsne_out$Y) %>% mutate(name = df_unique_rows$name)

ggplot(tsne_plot_data, aes(V1, V2, label = name)) +
    geom_point(color = "#2E9FDF", alpha = 0.6) +
    geom_text_repel(size = 3, max.overlaps = 15) +
    theme_minimal() +
    labs(title = "t-SNE Projection: Quality of Life Archetypes")

7 Spatial Analysis

Mapping the factor scores for Urban-Economic Vitality demonstrates that prosperity is concentrated in metropolitan “islands.” We see significant vitality in western Poland and around Warsaw, but also in eastern hubs like Rzeszów and Lublin, contradicting the simple East-West narrative.

load(file.path(Sys.getenv("HOME"), "bdl.maps.2022.RData"))
# Assuming MR3 is Economic Vitality based on interpretation
scores_df <- as.data.frame(fa_res$scores) %>%
    mutate(id = rownames(df_analysis)) %>%
    rename(Economic_Vitality = MR3)

map_data <- inner_join(bdl.maps.2022$level5, scores_df, by = "id")

tmap_mode("view")
tm_shape(map_data) +
    tm_polygons("Economic_Vitality",
        palette = "-RdYlBu",
        title = "Vitality Factor (MR3)",
        style = "quantile",
        colorNA = "grey80",
        textNA = "Missing Data"
    ) +
    tm_layout(main.title = "Spatial Distribution of Latent Vitality")

8 Conclusion

The dimension reduction analysis reveals that “Quality of Life” in Poland is not a single gradient but a complex interaction of three latent pillars.

Correlation of Prosperity: The use of Oblique Rotation confirmed that economic vitality and social infrastructure are significantly correlated (\(\phi \approx 0.45\)), suggesting that social investment follows economic growth.
The Hub Effect: t-SNE and spatial mapping confirm that eastern Polish cities (Rzeszów, Lublin) share latent structures with western cities, proving that the urban-rural divide has superseded the East-West divide.
Policy Implications: Regional policy should focus on strengthening the “Social Infrastructure” in poviats that possess “Economic Vitality” but lag in service provision, ensuring that growth is translated into tangible well-being.

Unveiling Latent Structures of Regional Well-being: A Dimension Reduction Approach

Truong Giang Do - 488388

2026-01-31