1 Abstract

This project challenges the historical “Poland A vs. Poland B” binary by utilizing Hierarchical Clustering to analyze regional development. By processing socio-economic indicators—wages, unemployment, entrepreneurship, and urbanization—the analysis reveals a complex, three-tiered economic hierarchy. The findings suggest that development is driven more by “Metropolitan” dynamics and “Transitional” zones than by simple geographic longitude, offering a nuanced view for targeted policy interventions.

2 Introduction

2.1 Analysis Context

The historical narrative of Poland’s development is often reduced to a binary “Poland A” (Western, industrialized) and “Poland B” (Eastern, agricultural). However, decades of EU integration, infrastructure investments, and the rise of service-oriented urban hubs suggest this split may be obsolete.

This analysis utilizes Hierarchical Clustering to investigate whether regional development in Poland follows a traditional geographic divide or a more complex, tiered hierarchy driven by urbanization and labor market dynamics.

2.2 Research Questions

Geography vs. Economy: Based on wages, unemployment, and entrepreneurial activity, do Polish Poviats still cluster along historical East-West borders?
The Middle-Income Trap: Which regions are currently caught in a transition phase, showing high wages but low housing density or entity growth?
Metropolitan Dominance: To what extent does the “Mazowieckie effect” (Warsaw) and other major cities isolate specific regions from their geographic neighbors?

2.3 Hypotheses

H1: Regional development is no longer a simple East-West split but a three-tiered hierarchy: Metropolises, Transition Zones, and Peripheries.
H2: “Islands of Prosperity” exist within traditionally poorer regions, driven by local entrepreneurship rather than geographic location.

3 Methodology & Data

Data is sourced from the Statistics Poland - Local Data Bank (BDL) for the year 2022. We focus on four key indicators at the Poviat (Level 5) level:

Unemployment: Registered unemployment rate (%).
Wages: Average monthly gross wages (relative to national average).
Entities: Business entities per 1000 population (proxy for entrepreneurship).
Housing/Density: Population per 1 km² (proxy for urbanization).

3.1 Data Acquisition and Preprocessing

We retrieve the data directly via the BDL API, process it to handle missing values, and prepare it for clustering.

# Fetch data from BDL
var_ids <- c(
  "unemployment" = "60270", # Registered unemployment rate
  "wages"        = "64429", # Average monthly gross wages and salary in relation to the average domestic (Poland = 100)
  "entities"     = "634123", # Entities by size classes per 1000 population total
  "housing"      = "60559" # Population per 1 km2
)

df_raw <- get_data_by_variable(
  varId = var_ids,
  unitLevel = 5,
  year = 2022,
  lang = "en"
)

head(df_raw) %>%
  kable(caption = "Preview of Raw Data") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Preview of Raw Data
id	name	year	val_60270	val_64429	val_634123	val_60559	measureUnitId_60270	measureName_60270	measureUnitId_64429	measureName_64429	measureUnitId_634123	measureUnitId_60559	measureName_60559	attrId_60270	attrId_64429	attrId_634123	attrId_60559
011212001000	Powiat bocheński	2022	3.0	84.2	103.2	164.7	50	%	50	%	1	26	person	1	1	1	1
011212006000	Powiat krakowski	2022	4.6	95.2	131.5	244.0	50	%	50	%	1	26	person	1	1	1	1
011212008000	Powiat miechowski	2022	5.0	80.8	108.3	69.5	50	%	50	%	1	26	person	1	1	1	1
011212009000	Powiat myślenicki	2022	3.3	83.6	120.3	192.5	50	%	50	%	1	26	person	1	1	1	1
011212014000	Powiat proszowicki	2022	6.1	83.2	104.1	101.9	50	%	50	%	1	26	person	1	1	1	1
011212019000	Powiat wielicki	2022	3.5	92.5	135.8	346.0	50	%	50	%	1	26	person	1	1	1	1

# Clean and Select Variables
df_clean <- df_raw %>%
  select(
    id,
    name,
    unemployment = val_60270,
    wages = val_64429,
    entities = val_634123,
    housing = val_60559
  ) %>%
  drop_na()

# Set row names for clustering
df_clustering <- df_clean %>%
  select(-name) %>%
  column_to_rownames("id")

# Scale the data
df_scaled <- scale(df_clustering)

head(df_scaled) %>%
  kable(digits = 2, caption = "Preview of Scaled Data") %>%
  kable_styling()

Preview of Scaled Data
	unemployment	wages	entities	housing
011212001000	-1.07	-0.21	-0.30	-0.30
011212006000	-0.67	0.78	0.64	-0.18
011212008000	-0.57	-0.51	-0.13	-0.46
011212009000	-0.99	-0.26	0.27	-0.26
011212014000	-0.29	-0.30	-0.27	-0.40
011212019000	-0.94	0.53	0.78	-0.01

4 Exploratory Data Analysis

Before applying clustering algorithms, we assess the relationships between variables and the feasibility of clustering.

4.1 Correlation Analysis

res_cor <- cor(df_clustering)
corrplot(res_cor,
  method = "color",
  type = "upper",
  addCoef.col = "black",
  tl.col = "black",
  title = "Variable Correlation Matrix",
  mar = c(0, 0, 1, 0),
  diag = FALSE
)

Interpretation: The correlation matrix reveals a strong positive relationship between urbanization (housing) and entrepreneurial activity (entities), which likely drives the formation of the “Metropolis” cluster. Unemployment shows a negative correlation with wages, as expected.

4.2 Clustering Tendency (Hopkins Statistic)

To verify if the dataset contains meaningful clusters (i.e., it is not uniformly distributed), we calculate the Hopkins Statistic. A value close to 1 indicates high clustering tendency.

res <- get_clust_tendency(df_scaled, n = nrow(df_scaled) - 1, graph = FALSE)
print(paste("Hopkins Statistic:", round(res$hopkins_stat, 3)))

## [1] "Hopkins Statistic: 0.904"

A Hopkins statistic significantly above 0.5 (ideally > 0.7) confirms that the data structure is suitable for clustering.

5 Hierarchical Clustering

We use Ward’s Method (ward.D2) for hierarchical clustering to minimize within-cluster variance, using Euclidean distance as the dissimilarity measure.

5.1 Determining Optimal Clusters

We employ three validation methods to determine the optimal number of clusters (\(k\)): Elbow Method, Silhouette Method, and Gap Statistic.

p1 <- fviz_nbclust(df_scaled, hcut, method = "wss") +
  labs(subtitle = "Elbow Method")

p2 <- fviz_nbclust(df_scaled, hcut, method = "silhouette") +
  labs(subtitle = "Silhouette Method")

p3 <- fviz_nbclust(df_scaled, hcut, method = "gap_stat", nboot = 50) +
  labs(subtitle = "Gap Statistic")

grid.arrange(p1, p2, p3, nrow = 1)

Decision: We select k = 3. The Silhouette Method shows a clear peak at this level, and the Gap Statistic supports a lower number of clusters. This aligns with our H1 hypothesis of a three-tiered economic hierarchy.

5.2 Dendrogram Analysis

The dendrogram visualizes the hierarchical relationship between poviats. The structure clearly separates into three main branches, corresponding to our proposed Metropolitan, Industrial, and Periphery tiers.

dist_mat <- dist(df_scaled, method = "euclidean")
hc_ward <- hclust(dist_mat, method = "ward.D2")

fviz_dend(hc_ward,
  k = 3,
  cex = 0.5,
  rect = TRUE,
  k_colors = c("#2E9FDF", "#E7B800", "#FC4E07"),
  main = "Clustering Hierarchy of Polish Poviats",
  show_labels = FALSE
)

6 Results & Interpretation

6.1 Cluster Profiles

We assign each poviat to a cluster and analyze the mean characteristics of each group.

groups <- cutree(hc_ward, k = 3)
df_final <- df_clean %>%
  mutate(Cluster = as.factor(groups))

profile_table <- df_final %>%
  group_by(Cluster) %>%
  summarise(
    Count = n(),
    `Avg Wages` = mean(wages),
    `Avg Unemployment (%)` = mean(unemployment),
    `Avg Entities (per 1k)` = mean(entities),
    `Avg Density (pop/km2)` = mean(housing)
  ) %>%
  arrange(desc(`Avg Wages`))

kable(profile_table, caption = "Socio-Economic Profiles of Clusters", digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Socio-Economic Profiles of Clusters
Cluster	Count	Avg Wages	Avg Unemployment (%)	Avg Entities (per 1k)	Avg Density (pop/km2)
2	59	95.00	4.47	144.48	1687.56
1	181	87.91	5.14	113.88	135.41
3	140	81.20	11.19	96.39	78.45

Cluster Interpretation:

Cluster 2 (The Powerhouses - Metropolitan): While wages are approximately 15% above the national average (index ~95 vs 100 benchmark), the defining feature is “urban friction”: extreme population density (~1688 pop/km²) and high business density (~144 entities/1k pop). This density drives innovation and growth.
Cluster 1 (The Transition Zone - Industrial): Solid wages and moderate density. These regions act as the industrial backbone but lack the hyper-concentration of services found in metropolises.
Cluster 3 (The Critical Zone - Periphery): The deep structural divide is evident here. The unemployment rate (~11.2%) is more than double that of the other clusters, indicating persistent economic stagnation despite EU investments.

6.2 Spatial Distribution: The “New Geography”

We map the clusters to visualize the “Poland A vs B” vs. “Islands of Prosperity” hypotheses.

# Load map data
load(file.path(Sys.getenv("HOME"), "bdl.maps.2022.RData"))
map_level5 <- bdl.maps.2022$level5

# Join map data with clusters
map_with_clusters <- inner_join(map_level5, df_final, by = "id")

# Interactive Map
tmap_mode("view")
tm_shape(map_with_clusters) +
  tm_polygons(
    col = "Cluster",
    palette = "Set1",
    id = "name.x",
    popup.vars = c("Wages" = "wages", "Unemployment" = "unemployment", "Entities" = "entities", "Density" = "housing"),
    title = "Socio-Economic Clusters",
    alpha = 0.7
  ) +
  tm_view(set.view = c(19, 52, 6))

Islands of Prosperity: The map confirms H2. We observe Cluster 1 (Blue) markers in Eastern Poland (e.g., around Rzeszów and Lublin), confirming that high-performing urban centers exist independently of their regional neighbors.

The Mazowieckie Effect: A stark contrast is visible around Warsaw (Mazowieckie). The capital is a high-performing island surrounded immediately by Cluster 3 (Periphery) poviats, illustrating a “hollow center” effect where wealth does not easily spill over to the immediate neighborhood.

7 Outlier Analysis

We identify regions that defy the norms of their cluster—potential “hidden gems” or “problem areas”.

df_outliers <- df_final %>%
  mutate(
    wage_z = as.vector(scale(wages)),
    unemp_z = as.vector(scale(unemployment))
  ) %>%
  filter(
    (wage_z > 1 & unemp_z > 0.5) | # High wage, High unemp
      (wage_z > 1 & housing < median(housing)) # High wage, Low density
  ) %>%
  select(name, Cluster, wages, unemployment, housing)

kable(df_outliers, caption = "Anomalous Poviats: High Deviation from Cluster Patterns") %>%
  kable_styling()

Anomalous Poviats: High Deviation from Cluster Patterns
name	Cluster	wages	unemployment	housing
Powiat polkowicki	1	102.3	4.4	78.6
Powiat wołowski	1	117.9	11.2	67.8
Powiat poddębicki	1	110.0	6.8	45.0
Powiat łęczyński	1	127.4	5.7	87.8
Powiat bielski	1	102.0	4.2	36.9
Powiat hajnowski	1	101.0	8.1	24.3
Powiat kozienicki	1	108.5	10.3	62.8

Contextualizing the Outliers:

The Mining/Energy Effect: Poviats such as polkowicki and łęczyński appear as outliers because of high wages (indices > 100) tied to copper and coal mining industries. However, they retain the low housing density typical of rural areas, separating them from metropolitan service hubs.
The Hajnowski Anomaly: Regions like hajnowski combine high unemployment with surprisingly high wages, likely driven by specific local industries (e.g., forestry/tourism near Białowieża) or public sector dominance, creating a decoupling of labor market health and wage levels.

8 Discussion & Methodology Check

Silhouette Width: Adding internal validation metrics confirms the tightness of our clusters.

sil <- cluster::silhouette(groups, dist_mat)
fviz_silhouette(sil, print.summary = FALSE) + labs(title = "Silhouette Plot")

print(paste("Average Silhouette Width:", round(mean(sil[, 3]), 3)))

## [1] "Average Silhouette Width: 0.278"

Implications: The analysis supports H1. We see distinct tiers. The spatial distribution confirms H2, showing “islands of prosperity” around major cities in Eastern Poland (e.g., Rzeszów, Lublin) which cluster with Western cities, breaking the “Poland B” block.

9 Conclusion

This multi-dimensional analysis refutes the simplistic “Poland A vs. B” binary. Instead, Poland is economically stratified into: * Urban Powerhouses: Driving national growth. * Industrial Transition Zones: Solid wages but varying social challenges. * Rural Periphery: Lagging in all metrics, regardless of longitude.

Policy References & Recommendations:

The binary “Poland A vs. B” model is obsolete. Policy must pivot to a standard based on Functional Regions:

Targeting the Transition Zone: The priority should be preventing Cluster 1 (Transitional) regions from sliding into the Periphery. Improvements in transport infrastructure (rail/road) linking these areas to nearby Metropolises can help them capture spillover growth.
Periphery Integration: For Cluster 3, simple cash transfers are insufficient. Focus on digital inclusion and human capital investments to reduce the structural unemployment gap (>11%).

Beyond the ‘Poland A’ and ‘Poland B’ Myth: A Multi-Dimensional Hierarchical Clustering Analysis

Truong Giang Do - 488388

2026-01-31