This project challenges the historical “Poland A vs. Poland B” binary by utilizing Hierarchical Clustering to analyze regional development. By processing socio-economic indicators—wages, unemployment, entrepreneurship, and urbanization—the analysis reveals a complex, three-tiered economic hierarchy. The findings suggest that development is driven more by “Metropolitan” dynamics and “Transitional” zones than by simple geographic longitude, offering a nuanced view for targeted policy interventions.
The historical narrative of Poland’s development is often reduced to a binary “Poland A” (Western, industrialized) and “Poland B” (Eastern, agricultural). However, decades of EU integration, infrastructure investments, and the rise of service-oriented urban hubs suggest this split may be obsolete.
This analysis utilizes Hierarchical Clustering to investigate whether regional development in Poland follows a traditional geographic divide or a more complex, tiered hierarchy driven by urbanization and labor market dynamics.
Data is sourced from the Statistics Poland - Local Data Bank (BDL) for the year 2022. We focus on four key indicators at the Poviat (Level 5) level:
We retrieve the data directly via the BDL API, process it to handle missing values, and prepare it for clustering.
# Fetch data from BDL
var_ids <- c(
"unemployment" = "60270", # Registered unemployment rate
"wages" = "64429", # Average monthly gross wages and salary in relation to the average domestic (Poland = 100)
"entities" = "634123", # Entities by size classes per 1000 population total
"housing" = "60559" # Population per 1 km2
)
df_raw <- get_data_by_variable(
varId = var_ids,
unitLevel = 5,
year = 2022,
lang = "en"
)
head(df_raw) %>%
kable(caption = "Preview of Raw Data") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| id | name | year | val_60270 | val_64429 | val_634123 | val_60559 | measureUnitId_60270 | measureName_60270 | measureUnitId_64429 | measureName_64429 | measureUnitId_634123 | measureName_634123 | measureUnitId_60559 | measureName_60559 | attrId_60270 | attributeDescription_60270 | attrId_64429 | attributeDescription_64429 | attrId_634123 | attributeDescription_634123 | attrId_60559 | attributeDescription_60559 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 011212001000 | Powiat bocheński | 2022 | 3.0 | 84.2 | 103.2 | 164.7 | 50 | % | 50 | % | 1 |
|
26 | person | 1 | 1 | 1 | 1 | ||||
| 011212006000 | Powiat krakowski | 2022 | 4.6 | 95.2 | 131.5 | 244.0 | 50 | % | 50 | % | 1 |
|
26 | person | 1 | 1 | 1 | 1 | ||||
| 011212008000 | Powiat miechowski | 2022 | 5.0 | 80.8 | 108.3 | 69.5 | 50 | % | 50 | % | 1 |
|
26 | person | 1 | 1 | 1 | 1 | ||||
| 011212009000 | Powiat myślenicki | 2022 | 3.3 | 83.6 | 120.3 | 192.5 | 50 | % | 50 | % | 1 |
|
26 | person | 1 | 1 | 1 | 1 | ||||
| 011212014000 | Powiat proszowicki | 2022 | 6.1 | 83.2 | 104.1 | 101.9 | 50 | % | 50 | % | 1 |
|
26 | person | 1 | 1 | 1 | 1 | ||||
| 011212019000 | Powiat wielicki | 2022 | 3.5 | 92.5 | 135.8 | 346.0 | 50 | % | 50 | % | 1 |
|
26 | person | 1 | 1 | 1 | 1 |
# Clean and Select Variables
df_clean <- df_raw %>%
select(
id,
name,
unemployment = val_60270,
wages = val_64429,
entities = val_634123,
housing = val_60559
) %>%
drop_na()
# Set row names for clustering
df_clustering <- df_clean %>%
select(-name) %>%
column_to_rownames("id")
# Scale the data
df_scaled <- scale(df_clustering)
head(df_scaled) %>%
kable(digits = 2, caption = "Preview of Scaled Data") %>%
kable_styling()| unemployment | wages | entities | housing | |
|---|---|---|---|---|
| 011212001000 | -1.07 | -0.21 | -0.30 | -0.30 |
| 011212006000 | -0.67 | 0.78 | 0.64 | -0.18 |
| 011212008000 | -0.57 | -0.51 | -0.13 | -0.46 |
| 011212009000 | -0.99 | -0.26 | 0.27 | -0.26 |
| 011212014000 | -0.29 | -0.30 | -0.27 | -0.40 |
| 011212019000 | -0.94 | 0.53 | 0.78 | -0.01 |
Before applying clustering algorithms, we assess the relationships between variables and the feasibility of clustering.
res_cor <- cor(df_clustering)
corrplot(res_cor,
method = "color",
type = "upper",
addCoef.col = "black",
tl.col = "black",
title = "Variable Correlation Matrix",
mar = c(0, 0, 1, 0),
diag = FALSE
)Interpretation: The correlation matrix reveals a
strong positive relationship between urbanization (housing)
and entrepreneurial activity (entities), which likely
drives the formation of the “Metropolis” cluster.
Unemployment shows a negative correlation with
wages, as expected.
To verify if the dataset contains meaningful clusters (i.e., it is not uniformly distributed), we calculate the Hopkins Statistic. A value close to 1 indicates high clustering tendency.
res <- get_clust_tendency(df_scaled, n = nrow(df_scaled) - 1, graph = FALSE)
print(paste("Hopkins Statistic:", round(res$hopkins_stat, 3)))## [1] "Hopkins Statistic: 0.904"
A Hopkins statistic significantly above 0.5 (ideally > 0.7) confirms that the data structure is suitable for clustering.
We use Ward’s Method (ward.D2) for
hierarchical clustering to minimize within-cluster variance, using
Euclidean distance as the dissimilarity measure.
We employ three validation methods to determine the optimal number of clusters (\(k\)): Elbow Method, Silhouette Method, and Gap Statistic.
p1 <- fviz_nbclust(df_scaled, hcut, method = "wss") +
labs(subtitle = "Elbow Method")
p2 <- fviz_nbclust(df_scaled, hcut, method = "silhouette") +
labs(subtitle = "Silhouette Method")
p3 <- fviz_nbclust(df_scaled, hcut, method = "gap_stat", nboot = 50) +
labs(subtitle = "Gap Statistic")
grid.arrange(p1, p2, p3, nrow = 1)Decision: We select k = 3. The Silhouette Method shows a clear peak at this level, and the Gap Statistic supports a lower number of clusters. This aligns with our H1 hypothesis of a three-tiered economic hierarchy.
The dendrogram visualizes the hierarchical relationship between poviats. The structure clearly separates into three main branches, corresponding to our proposed Metropolitan, Industrial, and Periphery tiers.
dist_mat <- dist(df_scaled, method = "euclidean")
hc_ward <- hclust(dist_mat, method = "ward.D2")
fviz_dend(hc_ward,
k = 3,
cex = 0.5,
rect = TRUE,
k_colors = c("#2E9FDF", "#E7B800", "#FC4E07"),
main = "Clustering Hierarchy of Polish Poviats",
show_labels = FALSE
)We assign each poviat to a cluster and analyze the mean characteristics of each group.
groups <- cutree(hc_ward, k = 3)
df_final <- df_clean %>%
mutate(Cluster = as.factor(groups))
profile_table <- df_final %>%
group_by(Cluster) %>%
summarise(
Count = n(),
`Avg Wages` = mean(wages),
`Avg Unemployment (%)` = mean(unemployment),
`Avg Entities (per 1k)` = mean(entities),
`Avg Density (pop/km2)` = mean(housing)
) %>%
arrange(desc(`Avg Wages`))
kable(profile_table, caption = "Socio-Economic Profiles of Clusters", digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Cluster | Count | Avg Wages | Avg Unemployment (%) | Avg Entities (per 1k) | Avg Density (pop/km2) |
|---|---|---|---|---|---|
| 2 | 59 | 95.00 | 4.47 | 144.48 | 1687.56 |
| 1 | 181 | 87.91 | 5.14 | 113.88 | 135.41 |
| 3 | 140 | 81.20 | 11.19 | 96.39 | 78.45 |
Cluster Interpretation:
We map the clusters to visualize the “Poland A vs B” vs. “Islands of Prosperity” hypotheses.
# Load map data
load(file.path(Sys.getenv("HOME"), "bdl.maps.2022.RData"))
map_level5 <- bdl.maps.2022$level5
# Join map data with clusters
map_with_clusters <- inner_join(map_level5, df_final, by = "id")
# Interactive Map
tmap_mode("view")
tm_shape(map_with_clusters) +
tm_polygons(
col = "Cluster",
palette = "Set1",
id = "name.x",
popup.vars = c("Wages" = "wages", "Unemployment" = "unemployment", "Entities" = "entities", "Density" = "housing"),
title = "Socio-Economic Clusters",
alpha = 0.7
) +
tm_view(set.view = c(19, 52, 6))Islands of Prosperity: The map confirms H2. We observe Cluster 1 (Blue) markers in Eastern Poland (e.g., around Rzeszów and Lublin), confirming that high-performing urban centers exist independently of their regional neighbors.
The Mazowieckie Effect: A stark contrast is visible around Warsaw (Mazowieckie). The capital is a high-performing island surrounded immediately by Cluster 3 (Periphery) poviats, illustrating a “hollow center” effect where wealth does not easily spill over to the immediate neighborhood.
We identify regions that defy the norms of their cluster—potential “hidden gems” or “problem areas”.
df_outliers <- df_final %>%
mutate(
wage_z = as.vector(scale(wages)),
unemp_z = as.vector(scale(unemployment))
) %>%
filter(
(wage_z > 1 & unemp_z > 0.5) | # High wage, High unemp
(wage_z > 1 & housing < median(housing)) # High wage, Low density
) %>%
select(name, Cluster, wages, unemployment, housing)
kable(df_outliers, caption = "Anomalous Poviats: High Deviation from Cluster Patterns") %>%
kable_styling()| name | Cluster | wages | unemployment | housing |
|---|---|---|---|---|
| Powiat polkowicki | 1 | 102.3 | 4.4 | 78.6 |
| Powiat wołowski | 1 | 117.9 | 11.2 | 67.8 |
| Powiat poddębicki | 1 | 110.0 | 6.8 | 45.0 |
| Powiat łęczyński | 1 | 127.4 | 5.7 | 87.8 |
| Powiat bielski | 1 | 102.0 | 4.2 | 36.9 |
| Powiat hajnowski | 1 | 101.0 | 8.1 | 24.3 |
| Powiat kozienicki | 1 | 108.5 | 10.3 | 62.8 |
Contextualizing the Outliers:
Silhouette Width: Adding internal validation metrics confirms the tightness of our clusters.
sil <- cluster::silhouette(groups, dist_mat)
fviz_silhouette(sil, print.summary = FALSE) + labs(title = "Silhouette Plot")## [1] "Average Silhouette Width: 0.278"Implications: The analysis supports H1. We see distinct tiers. The spatial distribution confirms H2, showing “islands of prosperity” around major cities in Eastern Poland (e.g., Rzeszów, Lublin) which cluster with Western cities, breaking the “Poland B” block.
This multi-dimensional analysis refutes the simplistic “Poland A vs. B” binary. Instead, Poland is economically stratified into: * Urban Powerhouses: Driving national growth. * Industrial Transition Zones: Solid wages but varying social challenges. * Rural Periphery: Lagging in all metrics, regardless of longitude.
Policy References & Recommendations:
The binary “Poland A vs. B” model is obsolete. Policy must pivot to a standard based on Functional Regions: