knitr::opts_knit$set(root.dir = normalizePath(".."))

knitr::opts_chunk$set(
  echo = TRUE,
  message = FALSE,
  warning = FALSE,
  fig.align = "center",
  fig.width = 9,
  fig.height = 6,
  results = "hold"
)

library(DiagrammeR)

1 Executive Summary & Introduction

1.1 Abstract

This project explores global development patterns using country-level World Development Indicators (WDI) averaged over 2017–2019. To capture the multidimensional nature of development beyond income alone, a broad set of economic, demographic, institutional, and infrastructure indicators is analyzed using unsupervised learning techniques.

After data preprocessing and dimensionality reduction via Principal Component Analysis (PCA), countries are grouped using hierarchical clustering in the reduced space. The analysis identifies a small number of stable and interpretable country typologies, reflecting systematic differences in development structures related to institutional quality, human capital, labor market characteristics, and demographic pressure.

1.2 Motivation

Economic development is inherently multidimensional. Traditional analyses often rely heavily on GDP per capita, yet this single metric fails to capture critical differences in institutional quality, human capital, labor market dynamics, infrastructure, and environmental sustainability. Many development indicators are highly correlated, suggesting underlying latent dimensions that unsupervised learning methods like PCA can uncover.

Unsupervised learning techniques are well suited to this setting, as no single ground-truth label exists for development types. By applying Principal Component Analysis and clustering to a curated set of World Development Indicators, this study aims to extract latent structure and derive data-driven country typologies without imposing predefined classifications such as income groups.

1.3 Methodological Overview

The project follows a rigorous unsupervised learning pipeline. After data cleaning (missingness analysis, hierarchical imputation) and preprocessing (log-transformation, standardization), Principal Component Analysis (PCA) was used to reduce dimensionality. Subsequently, Hierarchical Clustering (Ward.D2) was applied in the reduced orthogonal space. The final typology was validated using stability metrics (Bootstrap Jaccard Resampling - 200 iterations.) and internal indices (Silhouette, Calinski-Harabasz).

1.4 Research Questions

This analysis addresses two primary questions:

What are the main latent dimensions that explain variation across a broad set of development indicators?
Can countries be reliably grouped into stable, interpretable clusters based on their positions in the reduced PCA space, and what development profiles do these clusters represent?

1.5 The Research Environment

The project was performed as a part of Unsupervised Learning Course of Data Science and Business Analytics Master’s Curriculum at the University of Warsaw

1.5.1 The Workflow

Project Folder Structure is :

  
Project folder structure:

USL_Global_Development/
├── USL_Global_Development.Rproj
├── data/
│   ├── raw/
│   └── processed/
├── output/
├── report/
│   └── USL_Global_Development.Rmd
└── scripts/
    ├── 01_import_clean.R
    ├── 02_dimension_reduction.R
    └── 03_clustering.R

The analysis follows a three-script pipeline for this combined PCA and Clustering Project

1.5.2 Prerequisites

# Global Options
options(stringsAsFactors = FALSE)

# 1. Load Core Libraries

library(tidyverse)      # Data manipulation & plotting
library(janitor)        # Data cleaning
library(WDI)            # Data fetch
library(naniar)         # Missingness visualization
library(kableExtra)     # Table styling


# 2. Load Analysis Libraries

library(psych)          # Skewness & description
library(FactoMineR)     # PCA
library(factoextra)     # Visualization (PCA/Cluster)
library(cluster)        # Clustering algorithms
library(fpc)            # Clustering validation
library(hopkins)        # Clusterability
library(corrplot)       # Correlation matrix


# 3. Load Mapping Libraries

library(sf)
library(rnaturalearth)
library(rnaturalearthdata)
library(ggsci)          # color palettes
library(pheatmap)       # Heatmaps


# Create Directory Structure
dir.create("output", recursive = TRUE, showWarnings = FALSE)
dir.create("data/raw", recursive = TRUE, showWarnings = FALSE)
dir.create("data/processed", recursive = TRUE, showWarnings = FALSE)

2 Dataset & Preprocessing

This section explains the data-related decisions made for this project.

2.1 The Dataset : World Bank WDI

The dataset is constructed from the World Bank World Development Indicators (WDI) at the country level. To reduce volatility and obtain a stable cross-sectional representation, all indicators are aggregated to a three-year mean over 2017–2019. The unit of analysis is a country, described by a multidimensional set of economic, demographic, health, education, infrastructure, environmental, and governance indicators.

For reproducibility, intermediate datasets are stored as RDS objects. The descriptive imputed dataset is used for cluster profiling and interpretation, while the standardized dataset is used for PCA and clustering.

2.1.1 Indicator Selection

Indicators were selected and the dataset was aggregated to 2017–2019 means to obtain a stable cross-sectional matrix for PCA and clustering.

# Define Indicator Pools Main WDI indicators WDI api
indicators_wdi <- c(
  # Economy / structure
  gdp_per_cap_ppp     = "NY.GDP.PCAP.PP.KD",
  gdp_growth          = "NY.GDP.MKTP.KD.ZG",
  capital_formation   = "NE.GDI.FTOT.ZS",
  agri_value_added    = "NV.AGR.TOTL.ZS",
  service_value_added = "NV.SRV.TOTL.ZS",
  
  # Openness / capital
  exports_gdp         = "NE.EXP.GNFS.ZS",
  fdi_inflows         = "BX.KLT.DINV.WD.GD.ZS",
  
  # Health
  life_expectancy     = "SP.DYN.LE00.IN",
  sanitation_basic    = "SH.STA.BASS.ZS",
  health_exp_gdp      = "SH.XPD.CHEX.GD.ZS",
  measles_immun       = "SH.IMM.MEAS",
  
  # Education
  school_sec_gross    = "SE.SEC.ENRR",
  school_ter_gross    = "SE.TER.ENRR",
  gov_exp_edu         = "SE.XPD.TOTL.GD.ZS",
  
  # Demography
  pop_growth          = "SP.POP.GROW",
  dependency_ratio    = "SP.POP.DPND",
  urban_pop_rate      = "SP.URB.TOTL.IN.ZS",
  
  # Labor
  unemployment        = "SL.UEM.TOTL.ZS",
  labor_part_tot      = "SL.TLF.CACT.ZS",
  labor_part_fem      = "SL.TLF.CACT.FE.ZS",
  
  # Tech / infra
  access_elec         = "EG.ELC.ACCS.ZS",
  elec_consump        = "EG.USE.ELEC.KH.PC",
  internet_usage      = "IT.NET.USER.ZS",
  mobile_subs         = "IT.CEL.SETS.P2",
  
  # Environment
  gge_per_cap         = "EN.GHG.ALL.PC.CE.AR5",
  renew_energy        = "EG.FEC.RNEW.ZS",
  
  #Governance (WGI)
  corruption_ctrl = "CC.EST",
  gov_effect      = "GE.EST"
)

years <- c(2017:2019)

Indicators were selected based on three complementary criteria:

Relevance in development economics: Variables capture widely accepted dimensions of development, including economic performance (GDP per capita, growth), human capital (education, health), institutional quality (governance effectiveness, corruption control), infrastructure access, and demographic structure.
Coverage and data quality: Indicators with broad country coverage and consistent availability over 2017–2019 were prioritized to minimize missingness and unstable imputation.
Latent structure relevance: The selected indicators are known to be correlated yet conceptually distinct, making them well suited for latent dimension extraction via PCA.

2.1.2 Data Import

Data were retrieved via the WDI package API and cached locally for reproducibility & speed.

# Get data 
raw_path <- "data/raw/wdi_raw_2017_2019.rds"
if (!file.exists(raw_path)) {
  message("Raw WDI data not found. Downloading from API...")
  options(timeout = 360)    
  raw_wdi <- WDI(
    country   = "all",
    indicator = indicators_wdi,
    start     = min(years),
    end       = max(years),
    extra     = TRUE,
    cache     = NULL
  )
  
  dir.create("data/raw", recursive = TRUE, showWarnings = FALSE)
  saveRDS(raw_wdi, raw_path)
  
} else {
  message("Raw WDI data found. Loading from disk...")
  raw_wdi <- readRDS(raw_path)
}
dir.create("data/processed", recursive = TRUE, showWarnings = FALSE)

To ensure reproducibility and reduce API dependence, raw downloads and intermediate datasets are cached as RDS files. If cached files exist, the pipeline loads from disk; otherwise, it downloads and saves them.

2.2 Data Cleaning

The section displays the process of transforming the raw data so it is suitable for further analysis

2.2.1 Data Preprocesing

The Important Indicators were taken and Dataset was averaged for 3 years

# Filtering
wdi_clean <- raw_wdi %>%
  clean_names() %>%
  filter(region != "Aggregates") %>%      # drop World/regions aggregates
  filter(!is.na(iso3c)) %>%               # keep proper countries
  filter(nchar(iso3c) == 3) %>% 
  select(country, iso3c, region, income, year, all_of(names(indicators_wdi)))
# Aggregate to 3-year Average 
# cross-sectional matrix for clustering/PCA
wdi_avg <- wdi_clean %>%
  group_by(country, iso3c, region, income) %>%
  summarise(across(all_of(names(indicators_wdi)), ~ mean(.x, na.rm = TRUE)),
            .groups = "drop")
wdi_avg <- wdi_avg %>%
  mutate(across(all_of(names(indicators_wdi)), ~ ifelse(is.nan(.x), NA, .x)))

Note: If a country has fewer than three observed years for an indicator, the 2017–2019 mean is computed using available years (via na.rm = TRUE), avoiding unnecessary country loss.

cat(
  "Final dataset dimensions:\n",
  "Countries:", nrow(wdi_avg), "\n",
  "Indicators:", length(indicators_wdi), "\n"
)

## Final dataset dimensions:
##  Countries: 215 
##  Indicators: 28

2.2.2 Missing data Analysis

We apply a strict 30/40 rule to ensure data quality:

The 30% variable threshold reduces unstable imputations, while the 40% country threshold avoids constructing country profiles dominated by imputed values.

1.Indicators: Variables with >30% missingness are dropped.

2.Countries: Observations with >40% missingness are dropped.

2.2.2.1 Variable-Level Analysis

For each WDI indicator missing share were assessed and threshold for dropping in was set to 0.3

# Missing Data Analysis
# Variable 
miss_var <- wdi_avg %>%
  summarise(across(all_of(names(indicators_wdi)), ~ mean(is.na(.x)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "missing_share") %>%
  arrange(desc(missing_share))

# visual
p1 <- ggplot(miss_var, aes(x = reorder(variable, missing_share), y = missing_share)) +
  geom_col() +
  coord_flip() +
  labs(title = "Initial Missing Share by Variable", x = NULL, y = "Missing share")

print(p1)

Indicators above 30% missingness were excluded to avoid unstable imputations and preserve interpretability.

cutoff <- 0.30

vars_drop <- miss_var %>%
  filter(missing_share > cutoff) %>%
  pull(variable)

vars_keep <- miss_var %>%
  filter(missing_share <= cutoff) %>%
  pull(variable)

2.2.2.2 Row-Wise Analysis

Next, the countries with high missing data share were assessed . To not include bias and represent different countries country_cutoff <- 0.40 was set

# Row-wise (Country) Missingness Check

country_miss <- wdi_avg %>%
  select(country, iso3c, region, income, all_of(vars_keep)) %>%
  mutate(miss_prop = rowMeans(is.na(select(., all_of(vars_keep))))) %>%
  arrange(desc(miss_prop))

# 40% cutoff: 
country_cutoff <- 0.40
countries_drop <- country_miss %>% filter(miss_prop > country_cutoff) %>% pull(iso3c) 

cat("Dropping", length(countries_drop), "countries due to excessive missingness.\n")


wdi_reduced <- wdi_avg %>%
  filter(!iso3c %in% countries_drop) %>%
  select(country, iso3c, region, income, all_of(vars_keep))

## Dropping 14 countries due to excessive missingness.

Near-Constant variables were assessed and dropped to ensure the smooth execution of algorithms down the pipeline

# Drop Near-Constant Variables 
# We also check variables that became all-NA after

stats_check <- wdi_reduced %>%
  summarise(across(all_of(vars_keep), list(
    na_count = ~ sum(!is.na(.)),
    sd       = ~ sd(., na.rm = TRUE)
  ))) %>%
  pivot_longer(
    cols = everything(),
    names_to = c("variable", "stat"),
    names_pattern = "^(.*)_(na_count|sd)$"
  ) %>%
  pivot_wider(names_from = stat, values_from = value)

vars_drop_stat <- stats_check %>%
  filter(na_count < 10 | sd < 1e-10) %>%
  pull(variable)
  
vars_model <- setdiff(vars_keep, vars_drop_stat)
wdi_reduced <- wdi_reduced %>% select(country, iso3c, region, income, all_of(vars_model))
cat("Variables dropped (low variance/obs):", length(vars_drop_stat), "\n")
cat("Final variable count:", length(vars_model), "\n")

## Variables dropped (low variance/obs): 0 
## Final variable count: 26

Tables below summarizes coverage after preprocessing.

Dataset snapshot after preprocessing (before imputation).
countries	indicators	regions	income_groups
201	26	7	5

summary_tbl <- data.frame(
  Step = c("Initial countries (after aggregates removed)",
           "Countries dropped (missingness > 40%)",
           "Final countries",
           "Initial indicators",
           "Indicators dropped (missingness > 30%)",
           "Indicators dropped (near-constant / too few obs)",
           "Final indicators"),
  Value = c(nrow(wdi_avg),
            length(countries_drop),
            nrow(wdi_reduced),
            length(indicators_wdi),
            length(vars_drop),
            length(vars_drop_stat),
            length(vars_model))
)
knitr::kable(summary_tbl, caption = "Preprocessing decisions and resulting dataset size.")

Preprocessing decisions and resulting dataset size.
Step	Value
Initial countries (after aggregates removed)	215
Countries dropped (missingness > 40%)	14
Final countries	201
Initial indicators	28
Indicators dropped (missingness > 30%)	2
Indicators dropped (near-constant / too few obs)	0
Final indicators	26

2.2.3 Data Imputation

We employ a hierarchical median imputation (Region Median \(\rightarrow\) Global Median) to preserve regional structural differences.

# Hierarchical Imputation
#Region then Global
wdi_imp <- wdi_reduced %>%
  group_by(region) %>%
  mutate(across(all_of(vars_model), ~ ifelse(is.na(.x), median(.x, na.rm = TRUE), .x))) %>%
  ungroup() %>%
  mutate(across(all_of(vars_model), ~ ifelse(is.na(.x), median(.x, na.rm = TRUE), .x))) # Global fallback
# Fail if any NAs remain later.
na_count_post <- sum(is.na(wdi_imp[vars_model])) 
stopifnot(all(vars_model %in% names(wdi_imp)))
stopifnot(all(sapply(wdi_imp[vars_model], is.numeric)))
if (na_count_post > 0) {
  stop("CRITICAL ERROR: Imputation failed. NAs still exist in the dataset.")
} else {
  message("Imputation successful: 0 missing values remaining.")
}
stopifnot(!anyNA(wdi_imp[vars_model]))

Note: We did not include the Income based median imputation in our imputation rule set. Though at first glance it made sense , the income labels will force data leakage and disrupt natural cluster discovery.

Imputation audit. The table reports the share of values imputed (missing before imputation) for the most affected indicators.

Indicators with the largest share of imputed values (top 10).
variable	imputed_share
school_sec_gross	0.234
capital_formation	0.169
gov_exp_edu	0.129
exports_gdp	0.114
unemployment	0.090
labor_part_tot	0.090
labor_part_fem	0.090
health_exp_gdp	0.055
measles_immun	0.055
gdp_per_cap_ppp	0.040

2.2.4 Skewness & Log Transformation

Variables with high skewness (>2) are log-transformed (\(log_{10}(x+1)\)) to improve normality, excluding percentage-based metrics.

# Skewness & Log Transformation
# Calculate skewness table 
skew_tbl <- describe(wdi_imp[vars_model]) %>%
  as.data.frame() %>%
  select(skew) %>%
  rownames_to_column("variable") %>%   
  arrange(desc(abs(skew)))
#Identify highly skewed variables
high_skew_vars <- skew_tbl %>%
  filter(skew > 2) %>%
  pull(variable)
#Identify non-negative variables 
safe_to_log <- vars_model[sapply(wdi_imp[vars_model], function(x) min(x, na.rm = TRUE) >= 0)]
#skewed AND safe
vars_to_log <- intersect(high_skew_vars, safe_to_log)
#bounded percent/rate indicators
percent_like <- c(
  "capital_formation","agri_value_added","service_value_added",
  "exports_gdp","fdi_inflows","sanitation_basic","health_exp_gdp",
  "gov_exp_edu","urban_pop_rate","unemployment","labor_part_tot",
  "labor_part_fem","access_elec","internet_usage","renew_energy"
)
vars_to_log <- setdiff(vars_to_log, percent_like)
cat("Log-transforming", length(vars_to_log), "variables:\n")
print(vars_to_log)
# Apply log10(x+1)
wdi_transformed <- wdi_imp %>%
  mutate(across(all_of(vars_to_log), ~ log10(.x + 1)))
#no NaN/Inf introduced by transform
stopifnot(all(is.finite(as.matrix(wdi_transformed[vars_model]))))

## Log-transforming 1 variables:
## [1] "gge_per_cap"

cat("Checks after transformations:\n")
cat("Any NA in wdi_transformed vars_model? ", anyNA(wdi_transformed[vars_model]), "\n")
cat("Any infinite values? ", any(!is.finite(as.matrix(wdi_transformed[vars_model]))), "\n")

## Checks after transformations:
## Any NA in wdi_transformed vars_model?  FALSE 
## Any infinite values?  FALSE

2.2.5 Standardization

The standardization is needed to make variables comparable in analysis

# Standardization 
wdi_scaled <- wdi_transformed %>%
  mutate(across(all_of(vars_model), ~ as.numeric(scale(.x))))
stopifnot(all(sapply(wdi_scaled[vars_model], is.numeric)))
stopifnot(!anyNA(wdi_scaled[vars_model]))
# Check Boxplot of scaled variables (sample 12 vars) 
set.seed(47)
vars_sample <- sample(vars_model, min(12, length(vars_model)))
wdi_scaled %>%
  select(all_of(vars_sample)) %>%
  pivot_longer(everything()) %>%
  ggplot(aes(x = name, y = value)) +
  geom_boxplot() +
  coord_flip() +
  labs(title = "Scaled Variable Distributions (sample)", x = NULL, y = "Z-score")

The post-scaling distributions confirm comparable scale across indicators, which is required for PCA and Euclidean-distance clustering.

2.2.6 Saving the Data

# Saving Outputs
saveRDS(wdi_imp,    "data/processed/wdi_imputed_descriptive.rds") # for Cluster Naming
saveRDS(wdi_scaled, "data/processed/wdi_scaled_pca_input.rds")    # for PCA/Clustering
cat("Success! Processed data saved. Ready for PCA.\n")

## Success! Processed data saved. Ready for PCA.

3 Dimension Reduction : PCA

Many World Development Indicators are strongly correlated, reflecting overlapping aspects of economic structure, human development, infrastructure, and governance. Applying clustering directly to the raw indicator space can violate distance assumptions due to multicollinearity and unequal variance. Principal Component Analysis (PCA) is therefore used to extract orthogonal latent dimensions that summarize the main sources of variation in the data. PCA reduces noise, improves distance validity, and facilitates interpretation of country positions in a lower-dimensional space.

# Load Processed Data
wdi_scaled <- readRDS("data/processed/wdi_scaled_pca_input.rds")
# Metadata columns not enter PCA
meta_vars <- c("iso3c", "region", "income", "year")
# PCA matrix
wdi_matrix <- wdi_scaled %>%
  column_to_rownames("country") %>%
  select(-any_of(meta_vars))
# checks 
stopifnot(all(map_lgl(wdi_matrix, is.numeric)))
stopifnot(!anyNA(wdi_matrix))
cat("PCA input:", nrow(wdi_matrix), "countries x", ncol(wdi_matrix), "indicators\n")

## PCA input: 201 countries x 26 indicators

This section analyzes the latent structure of the remaining indicators after preprocessing to reduce noise and multicollinearity.

##PCA Execution

# wdi_matrix is already z-scaled 
pca_res <- PCA(wdi_matrix, scale.unit = FALSE,ncp=10, graph = FALSE)

##Component Selection

We use the Scree Plot and Kaiser Criterion also, checking the total variance explained to determine the number of dimensions.

p_scree <- fviz_eig(pca_res, addlabels = TRUE) +
  labs(title = "Scree Plot: Explained Variance by PCA Dimensions")

print(p_scree)

eig <- get_eigenvalue(pca_res) %>% as.data.frame()
eig_tbl <- eig %>%
  mutate(Component = paste0("PC", row_number())) %>%
  select(Component, eigenvalue, variance.percent, cumulative.variance.percent)

knitr::kable(
  head(eig_tbl, 10),
  digits = 3,
  caption = "Eigenvalues and explained variance (first 10 components)"
)

Eigenvalues and explained variance (first 10 components)
	Component	eigenvalue	variance.percent	cumulative.variance.percent
Dim.1	PC1	10.693	41.333	41.333
Dim.2	PC2	2.705	10.454	51.787
Dim.3	PC3	1.798	6.950	58.737
Dim.4	PC4	1.539	5.949	64.686
Dim.5	PC5	1.321	5.106	69.792
Dim.6	PC6	0.959	3.706	73.498
Dim.7	PC7	0.831	3.213	76.712
Dim.8	PC8	0.771	2.981	79.693
Dim.9	PC9	0.655	2.533	82.226
Dim.10	PC10	0.624	2.414	84.640

kaiser_k <- sum(eig$eigenvalue > 1)
var70_k  <- which(eig$cumulative.variance.percent >= 70)[1]
k_pca <- min(kaiser_k, var70_k)

cat("Kaiser (eig > 1):", kaiser_k, "\n")
cat("Components for >=70% variance:", var70_k, "\n")
cat("Selected number of components (k_pca):", k_pca, "\n")

## Kaiser (eig > 1): 5 
## Components for >=70% variance: 6 
## Selected number of components (k_pca): 5

Component retention. We retain \(k\) components using a rule: minimum of (i) Kaiser criterion (eigenvalue > 1) and (ii) components needed for ≥70% cumulative explained variance.

PCA component retention rule and selected dimensionality.
Rule	Value
Kaiser (eig > 1)	5
≥ 70% cumulative variance	6
Selected k	5

Retaining five components balances information preservation with interpretability, while keeping the clustering distance metric valid by operating in an orthogonal space

3.1 Interpretation of Dimensions

contrib_tbl <- as.data.frame(pca_res$var$contrib[, 1:k_pca, drop = FALSE]) %>%
  rownames_to_column("variable") %>%
  pivot_longer(-variable, names_to = "component", values_to = "contrib") %>%
  group_by(component) %>%
  slice_max(contrib, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(component, desc(contrib))

knitr::kable(
  contrib_tbl,
  digits = 3,
  caption = "Top contributing variables (per retained component)"
)

Top contributing variables (per retained component)
variable	component	contrib
internet_usage	Dim.1	8.336
life_expectancy	Dim.1	7.534
sanitation_basic	Dim.1	7.463
access_elec	Dim.1	6.721
school_sec_gross	Dim.1	6.627
agri_value_added	Dim.1	6.617
dependency_ratio	Dim.1	6.588
gov_effect	Dim.1	6.555
gdp_per_cap_ppp	Dim.1	5.495
corruption_ctrl	Dim.1	5.074
labor_part_fem	Dim.2	26.391
labor_part_tot	Dim.2	26.168
unemployment	Dim.2	14.990
gdp_per_cap_ppp	Dim.2	4.191
gov_effect	Dim.2	3.961
corruption_ctrl	Dim.2	3.790
renew_energy	Dim.2	3.737
gdp_growth	Dim.2	2.876
exports_gdp	Dim.2	2.670
capital_formation	Dim.2	2.263
health_exp_gdp	Dim.3	29.332
gov_exp_edu	Dim.3	20.396
mobile_subs	Dim.3	7.730
corruption_ctrl	Dim.3	5.338
gdp_growth	Dim.3	4.548
service_value_added	Dim.3	4.536
capital_formation	Dim.3	4.460
labor_part_fem	Dim.3	3.843
pop_growth	Dim.3	3.757
exports_gdp	Dim.3	3.580
capital_formation	Dim.4	26.413
gdp_growth	Dim.4	22.439
measles_immun	Dim.4	9.786
gov_exp_edu	Dim.4	8.409
urban_pop_rate	Dim.4	7.051
exports_gdp	Dim.4	4.648
gdp_per_cap_ppp	Dim.4	4.625
access_elec	Dim.4	2.668
unemployment	Dim.4	2.526
gge_per_cap	Dim.4	2.257
fdi_inflows	Dim.5	33.128
gdp_growth	Dim.5	11.901
exports_gdp	Dim.5	11.267
labor_part_tot	Dim.5	7.848
pop_growth	Dim.5	7.374
service_value_added	Dim.5	6.522
gge_per_cap	Dim.5	3.349
unemployment	Dim.5	2.972
corruption_ctrl	Dim.5	2.776
labor_part_fem	Dim.5	2.379

3.1.1 Interpretation of Retained Components

Directional interpretation. The sign of loadings indicates which indicators increase in the positive direction of each component.

PC1: strongest positive loadings (top 5).
	loading	variable
internet_usage	0.944	internet_usage
life_expectancy	0.898	life_expectancy
sanitation_basic	0.893	sanitation_basic
access_elec	0.848	access_elec
school_sec_gross	0.842	school_sec_gross

PC1: strongest negative loadings (top 5).
	loading	variable
agri_value_added	-0.841	agri_value_added
dependency_ratio	-0.839	dependency_ratio
renew_energy	-0.721	renew_energy
pop_growth	-0.521	pop_growth
gdp_growth	-0.194	gdp_growth

PCA loading signs are arbitrary up to multiplication by -1; interpretations focus on relative contrasts within each component.

3.1.2 Strongest loadings table (top 8 by absolute loading)

loadings_tbl <- as.data.frame(pca_res$var$coord[, 1:k_pca, drop = FALSE]) %>%
  rownames_to_column("variable") %>%
  pivot_longer(-variable, names_to = "component", values_to = "loading") %>%
  group_by(component) %>%
  slice_max(abs(loading), n = 8, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(component, desc(abs(loading)))

knitr::kable(
  loadings_tbl,
  digits = 3,
  caption = "Largest absolute loadings (per retained component)"
)

Largest absolute loadings (per retained component)
variable	component	loading
internet_usage	Dim.1	0.944
life_expectancy	Dim.1	0.898
sanitation_basic	Dim.1	0.893
access_elec	Dim.1	0.848
school_sec_gross	Dim.1	0.842
agri_value_added	Dim.1	-0.841
dependency_ratio	Dim.1	-0.839
gov_effect	Dim.1	0.837
labor_part_fem	Dim.2	0.845
labor_part_tot	Dim.2	0.841
unemployment	Dim.2	-0.637
gdp_per_cap_ppp	Dim.2	0.337
gov_effect	Dim.2	0.327
corruption_ctrl	Dim.2	0.320
renew_energy	Dim.2	0.318
gdp_growth	Dim.2	0.279
health_exp_gdp	Dim.3	0.726
gov_exp_edu	Dim.3	0.606
mobile_subs	Dim.3	-0.373
corruption_ctrl	Dim.3	0.310
gdp_growth	Dim.3	-0.286
service_value_added	Dim.3	0.286
capital_formation	Dim.3	-0.283
labor_part_fem	Dim.3	0.263
capital_formation	Dim.4	0.638
gdp_growth	Dim.4	0.588
measles_immun	Dim.4	0.388
gov_exp_edu	Dim.4	0.360
urban_pop_rate	Dim.4	-0.329
exports_gdp	Dim.4	-0.267
gdp_per_cap_ppp	Dim.4	-0.267
access_elec	Dim.4	0.203
fdi_inflows	Dim.5	0.662
gdp_growth	Dim.5	0.396
exports_gdp	Dim.5	0.386
labor_part_tot	Dim.5	-0.322
pop_growth	Dim.5	0.312
service_value_added	Dim.5	0.294
gge_per_cap	Dim.5	-0.210
unemployment	Dim.5	0.198

Based on the highest contributions and loadings, the five retained components can be interpreted as follows:

PC1 (“Socio-Economic Development & Institutional Quality”): Captures the primary development continuum. Positive direction: high GDP per capita, strong governance, life expectancy, infrastructure access; negative direction: high agricultural share, demographic pressure.
PC2 (“Labor Market Participation”): Reflects overall and female labor force participation rates, with lower unemployment.
PC3 (“Public Human Capital Investment”): Emphasizes public spending on education and health relative to private capital formation.
PC4 (“Investment & Growth Dynamics”): Distinguishes investment-intensive, faster-growing economies from more mature ones.
PC5 (“External Economic Integration”): Separates trade/FDI-open economies from more domestic-oriented ones.

These dimensions are orthogonal by construction and collectively explain approximately 70% of total variance in the original indicators.

Visual map

fviz_pca_var(
  pca_res,
  col.var = "contrib",
  repel = TRUE,
  select.var = list(contrib = 20)
) + labs(title = "PCA Variable Map (Top 20 contributions, Dim 1 & 2)")

fviz_pca_ind(
  pca_res,
  axes = c(1,2),
  geom.ind = "point",
  col.ind = wdi_scaled$income,
  addEllipses = TRUE,
  legend.title = "Income"
) + labs(title = "Countries in PCA space (Dim 1 & 2), colored by income group")

fviz_pca_ind(
  pca_res,
  axes = c(1,2),
  geom.ind = "point",
  col.ind = wdi_scaled$region,
  addEllipses = TRUE,
  legend.title = "Region"
) + labs(title = "Countries in PCA space (Dim 1 & 2), colored by region")

Because PCA components are orthogonal (uncorrelated), the Euclidean distance used in the subsequent clustering step is valid and not biased by the redundant information (multicollinearity) present in the raw WDI indicators.

Output for clustering. Metadata (region, income) is kept only for interpretation and visualization.

4 Clustering in PCA Space

Clustering is performed on the retained PCA scores (PC1–PCk) to avoid multicollinearity and ensure Euclidean distance validity. Ward’s hierarchical clustering (Ward.D2) is applied, and the number of clusters is selected using a stability-first strategy combining internal indices and bootstrap Jaccard stability.

#Load data
pca_scores <- readRDS("data/processed/pca_scores_for_clustering.rds")
wdi_imp    <- readRDS("data/processed/wdi_imputed_descriptive.rds")
pc_labels <- readRDS("data/processed/pc_labels.rds") 
wdi_scaled <- readRDS("data/processed/wdi_scaled_pca_input.rds")

# clustering space
pc_cols_all <- names(pca_scores)[grepl("^PC\\d+$", names(pca_scores))]
n_pcs_available   <- length(pc_cols_all)
pcs_all <- pca_scores %>% 
  column_to_rownames("country") %>% 
  select(all_of(pc_cols_all))
stopifnot(all(map_lgl(pcs_all, is.numeric)))
cat("Clustering input:", nrow(pcs_all), "countries x", ncol(pcs_all), "PCs (Available)\n")

## Clustering input: 201 countries x 5 PCs (Available)

n_pcs <- min(5, ncol(pcs_all))
pc_cols <- paste0("PC", 1:n_pcs)
X_pca <- pcs_all %>% select(all_of(pc_cols))
stopifnot(!anyNA(X_pca))
cat("Clustering on:", paste(pc_cols, collapse = ", "), "\n")

## Clustering on: PC1, PC2, PC3, PC4, PC5

4.1 Clusterability Assessment

Before applying a partitioning algorithm, we must ensure the data contains meaningful structure rather than a random distribution.

set.seed(47)
H <- hopkins::hopkins(as.matrix(X_pca), m = floor(0.1 * nrow(X_pca)))
cat("Hopkins statistic H =", round(H, 4), "\n")

## Hopkins statistic H = 0.9988

d  <- dist(X_pca, method = "euclidean")
hc <- hclust(d, method = "ward.D2")

H scores mean usually: scores near 0.5 random structure, closer to 1 clusterable structure ; closer to 0 uniform structure

The Hopkins statistic is far above 0.5, indicating strong departure from spatial randomness and supporting the presence of cluster structure in PCA space.

4.2 Hierarchical Strategy & Optimal k

4.2.1 k selection table

We iterate \(k\) from 2 to 6, selecting the solution based on Bootstrap Stability (Jaccard Index).

k_candidates <- 2:6
min_size_rule <- 7
B_boot <- 200

set.seed(47)
validation_df <- purrr::map_dfr(k_candidates, function(k) {
  grp <- cutree(hc, k = k)

  sil <- silhouette(grp, d)
  sil_avg <- mean(sil[, 3])

  ch <- fpc::calinhara(as.matrix(X_pca), grp)

  sizes <- table(grp)
  min_size <- min(sizes)

  set.seed(47 + k)
  boot <- fpc::clusterboot(
    as.matrix(X_pca),
    B = B_boot,
    bootmethod = "boot",
    clustermethod = fpc::hclustCBI,
    method = "ward.D2",
    k = k,
    count = FALSE,
    showplots = FALSE
  )

  pam_fit <- cluster::pam(as.matrix(X_pca), k = k)
  crand <- fpc::cluster.stats(d, grp, pam_fit$clustering)$corrected.rand

  tibble(
    k = k,
    Silhouette = sil_avg,
    Calinski_Harabasz = ch,
    Min_Cluster_Size = as.integer(min_size),
    Min_Jaccard = min(boot$bootmean),
    Mean_Jaccard = mean(boot$bootmean),
    PAM_Agreement_CRand = crand
  )
})

knitr::kable(
  validation_df,
  digits = 3,
  caption = "Model selection diagnostics for candidate numbers of clusters (Ward.D2 in PCA space)."
)

Model selection diagnostics for candidate numbers of clusters (Ward.D2 in PCA space).
k	Silhouette	Calinski_Harabasz	Min_Cluster_Size	Min_Jaccard	Mean_Jaccard	PAM_Agreement_CRand
2	0.411	104.211	30	0.679	0.788	0.514
3	0.270	115.331	30	0.440	0.667	0.623
4	0.234	97.253	30	0.501	0.635	0.545
5	0.252	88.759	13	0.458	0.635	0.396
6	0.242	82.717	13	0.306	0.571	0.693

The k selection logic

candidates_ok <- validation_df %>% filter(Min_Cluster_Size >= min_size_rule)
if (nrow(candidates_ok) == 0) stop("No k passed Min_Cluster_Size rule.")

stable_ok <- candidates_ok %>% filter(Min_Jaccard >= 0.60)

if (nrow(stable_ok) > 0) {
  k_clusters <- stable_ok %>%
    arrange(desc(Mean_Jaccard), desc(Silhouette)) %>%
    slice(1) %>% pull(k)
  reason <- "Highest Mean Jaccard among solutions with Min Jaccard >= 0.60 and adequate cluster sizes."
} else {
  k_clusters <- candidates_ok %>%
    arrange(desc(Min_Jaccard), desc(Mean_Jaccard), desc(Silhouette)) %>%
    slice(1) %>% pull(k)
  reason <- "Fallback: no solution reached Min Jaccard >= 0.60; selected best stability compromise."
}

knitr::kable(
  data.frame(Selected_k = k_clusters, Reason = reason),
  caption = "Selected number of clusters (stability-first decision)."
)

Selected number of clusters (stability-first decision).
Selected_k	Reason
2	Highest Mean Jaccard among solutions with Min Jaccard >= 0.60 and adequate cluster sizes.

4.2.2 Final clustering and key plots

groups <- cutree(hc, k = k_clusters)
stopifnot(identical(names(groups), rownames(X_pca)))

4.2.3 Final validation

Cluster sizes (number of countries).
cluster	n_countries
1	171
2	30

Validation summary for the final clustering solution.
Metric	Value
Avg silhouette	0.411
Calinski-Harabasz	104.211
PAM agreement (Corrected Rand)	0.514
Min Jaccard	0.666
Mean Jaccard	0.777

Selected solution. Based on the stability-first rule (minimum cluster size ≥ 7, and preference for Min Jaccard ≥ 0.60), the final choice is k = 2. The solution achieves an average silhouette of 0.411, Calinski–Harabasz of 104.2, and bootstrap stability (Jaccard) with min = 0.666 and mean = 0.777.

A Jaccard similarity above 0.60 suggests cluster stability, meaning that the cluster’s membership shows substantial overlap across bootstrap resamples. This indicates that the grouping is reasonably robust to sampling variation, although it should not be interpreted as definitive proof of an underlying ‘true’ structure.

4.2.4 Interpretation: names and profiles

## Note: these requested profile variables were not available and were skipped:  school_ter_gross

Cluster profiles (median values on original scale, selected indicators).
cluster_short	gdp_per_cap_ppp	life_expectancy	internet_usage	access_elec	sanitation_basic	school_sec_gross	gov_effect	corruption_ctrl	agri_value_added	dependency_ratio	pop_growth	unemployment	labor_part_fem
Advanced / Institutional	19847.45	75.00	66.59	99.93	93.67	96.39	0.08	-0.05	5.01	52.82	0.94	5.68	52.58
Labor-Intensive / Agrarian	2737.48	61.12	15.71	36.37	25.13	45.46	-1.05	-1.03	23.26	86.02	2.64	3.50	64.85

4.2.5 Heatmap

The heatmap below confirms the distinct separation of profiles:

Positive values indicate above-average (relative) indicator levels for that cluster; negative values indicate below-average levels.

4.2.6 World map

Although clustering was based on socio-economic indicators and not geographic information, the resulting groups display a clear spatial pattern. Cluster 2 is dominated by Sub-Saharan African countries, with some representation from South and Southeast Asia, while Cluster 1 covers most of the remaining regions.

This geographic concentration is an outcome of structural differences in institutions, human capital, infrastructure, and demographic pressure rather than a trivial regional split. Solutions with more clusters (e.g. four clusters) produced interpretable intermediate or fast-catch-up groups but showed lower stability under bootstrap validation. Therefore, the two-cluster solution was retained as the most robust and structurally meaningful representation of the data.

5 Discussion & Conclusion

This analysis successfully extracted latent development dimensions and identified two robust country typologies from 2017–2019 WDI data: - Advanced, institutionally strong economies: Characterized by high income, strong governance, advanced human capital, infrastructure, and digital access. - Labor-intensive, agrarian and demographically pressured economies: Higher agricultural dependence, demographic pressures, and necessity-driven labor participation.

Income groups align strongly with PC1 and cluster membership, but clustering is not identical to income classification because it also reflects governance, infrastructure, and demographic structure.

The PCA-clustering pipeline successfully extracts meaningful structure beyond traditional income classifications, with the two-cluster solution proving stable and interpretable.

The use of Hopkins statistic and Bootstrap Stability indicates that the clusters are not random artifacts.

Key insights: - Development is not unidimensional — institutional quality and human capital investment emerge as distinct from pure economic growth. - Geographic patterns emerge naturally (e.g., advanced cluster concentrated in Europe/North America/East Asia).

Method choice rationale -Ward.D2 is chosen for compact clusters in Euclidean space; clustering in PCA space reduces multicollinearity and makes distances more meaningful.

Limitations: - Cross-sectional (pre-COVID snapshot); temporal dynamics not captured. - Reliance on median imputation and theoretical indicator selection.

Future work: Incorporate time-series clustering or additional indicators (e.g., inequality, climate vulnerability).

The full reproducible pipeline demonstrates the power of unsupervised learning for exploratory development analysis.

6 Appendix

6.1 Sources

Data Source

World Bank. (2026). World Development Indicators (WDI). Retrieved January 2026 from https://databank.worldbank.org/source/world-development-indicators

Other Sources

Commission on the Measurement of Economic Performance and Social Progress. (2009). Report by the Commission on the Measurement of Economic Performance and Social Progress. Paris. Retrieved February 2026 from https://ec.europa.eu/eurostat/documents/8131721/8131772/Stiglitz-Sen-Fitoussi-Commission-report.pdf

Husson, F., Lê, S., & Pagès, J. (2017). Exploratory multivariate analysis by example using R (2nd ed.). Chapman and Hall/CRC. https://doi.org/10.1201/b21874

United Nations Development Programme. (2019). Human Development Report 2019: Beyond income, beyond averages, beyond today. Retrieved February 2026 from http://report2019.archive.s3-website-us-east-1.amazonaws.com/

6.2 Reproducibility Info

cat("Report rendered on:", format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z"), "\n\n")

# Core R info
print(R.version)

cat("\nPlatform:", R.version$platform, "\n")
cat("Running on:", R.version$os, "\n\n")

# Packages + versions actually loaded in this session
sessionInfo()

## Report rendered on: 2026-02-01 17:32:36 CET 
## 
##                _                                
## platform       x86_64-w64-mingw32               
## arch           x86_64                           
## os             mingw32                          
## crt            ucrt                             
## system         x86_64, mingw32                  
## status                                          
## major          4                                
## minor          5.1                              
## year           2025                             
## month          06                               
## day            13                               
## svn rev        88306                            
## language       R                                
## version.string R version 4.5.1 (2025-06-13 ucrt)
## nickname       Great Square Root                
## 
## Platform: x86_64-w64-mingw32 
## Running on: mingw32 
## 
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] C
## 
## time zone: Europe/Warsaw
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] pheatmap_1.0.13         ggsci_4.2.0             rnaturalearthdata_1.0.0
##  [4] rnaturalearth_1.2.0     sf_1.0-24               corrplot_0.95          
##  [7] hopkins_1.1             fpc_2.2-14              cluster_2.1.8.1        
## [10] factoextra_1.0.7        FactoMineR_2.13         psych_2.5.6            
## [13] kableExtra_1.4.0        naniar_1.1.0            WDI_2.7.9              
## [16] janitor_2.2.1           lubridate_1.9.4         forcats_1.0.1          
## [19] stringr_1.6.0           dplyr_1.1.4             purrr_1.2.0            
## [22] readr_2.1.5             tidyr_1.3.1             tibble_3.3.0           
## [25] ggplot2_4.0.0           tidyverse_2.0.0         DiagrammeR_1.0.11      
## 
## loaded via a namespace (and not attached):
##  [1] DBI_1.2.3            mnormt_2.1.1         gridExtra_2.3       
##  [4] rlang_1.1.6          magrittr_2.0.4       donut_1.0.3         
##  [7] snakecase_0.11.1     otel_0.2.0           e1071_1.7-17        
## [10] compiler_4.5.1       flexmix_2.3-20       systemfonts_1.3.1   
## [13] vctrs_0.6.5          pkgconfig_2.0.3      fastmap_1.2.0       
## [16] backports_1.5.0      labeling_0.4.3       rmarkdown_2.30      
## [19] tzdb_0.5.0           visdat_0.6.0         xfun_0.54           
## [22] modeltools_0.2-24    cachem_1.1.0         jsonlite_2.0.0      
## [25] flashClust_1.01-2    pdist_1.2.1          broom_1.0.10        
## [28] parallel_4.5.1       prabclus_2.3-4       R6_2.6.1            
## [31] bslib_0.9.0          stringi_1.8.7        RColorBrewer_1.1-3  
## [34] car_3.1-3            jquerylib_0.1.4      diptest_0.77-2      
## [37] estimability_1.5.1   Rcpp_1.1.0           knitr_1.51          
## [40] nnet_7.3-20          timechange_0.3.0     tidyselect_1.2.1    
## [43] viridis_0.6.5        abind_1.4-8          rstudioapi_0.17.1   
## [46] yaml_2.3.10          lattice_0.22-7       withr_3.0.2         
## [49] S7_0.2.0             evaluate_1.0.5       units_1.0-0         
## [52] proxy_0.4-29         xml2_1.4.1           mclust_6.1.2        
## [55] kernlab_0.9-33       ggpubr_0.6.2         pillar_1.11.1       
## [58] carData_3.0-5        KernSmooth_2.23-26   DT_0.34.0           
## [61] stats4_4.5.1         generics_0.1.4       hms_1.1.4           
## [64] scales_1.4.0         leaps_3.2            class_7.3-23        
## [67] glue_1.8.0           emmeans_2.0.0        scatterplot3d_0.3-44
## [70] tools_4.5.1          dendextend_1.19.1    robustbase_0.99-6   
## [73] ggsignif_0.6.4       RANN_2.6.2           visNetwork_2.1.4    
## [76] mvtnorm_1.3-3        grid_4.5.1           nlme_3.1-168        
## [79] Formula_1.2-5        cli_3.6.5            textshaping_1.0.4   
## [82] viridisLite_0.4.2    svglite_2.2.2        gtable_0.3.6        
## [85] DEoptimR_1.1-4       rstatix_0.7.3        sass_0.4.10         
## [88] digest_0.6.37        classInt_0.4-11      ggrepel_0.9.6       
## [91] htmlwidgets_1.6.4    farver_2.1.2         htmltools_0.5.9     
## [94] lifecycle_1.0.4      multcompView_0.1-10  MASS_7.3-65

Global Development Patterns: A PCA and Clustering Analysis of World Development Indicators

Latent dimensions and country typologies from WDI averages (2017–2019)

Andrei Klimko

2026-02-01