knitr::opts_knit$set(root.dir = normalizePath(".."))

knitr::opts_chunk$set(
  echo = TRUE,
  message = FALSE,
  warning = FALSE,
  fig.align = "center",
  fig.width = 9,
  fig.height = 6,
  results = "hold"
)

library(DiagrammeR)

1 Executive Summary & Introduction

1.1 Abstract

This project explores global development patterns using country-level World Development Indicators (WDI) averaged over 2017–2019. To capture the multidimensional nature of development beyond income alone, a broad set of economic, demographic, institutional, and infrastructure indicators is analyzed using unsupervised learning techniques.

After data preprocessing and dimensionality reduction via Principal Component Analysis (PCA), countries are grouped using hierarchical clustering in the reduced space. The analysis identifies a small number of stable and interpretable country typologies, reflecting systematic differences in development structures related to institutional quality, human capital, labor market characteristics, and demographic pressure.

1.2 Motivation

Economic development is inherently multidimensional. Traditional analyses often rely heavily on GDP per capita, yet this single metric fails to capture critical differences in institutional quality, human capital, labor market dynamics, infrastructure, and environmental sustainability. Many development indicators are highly correlated, suggesting underlying latent dimensions that unsupervised learning methods like PCA can uncover.

Unsupervised learning techniques are well suited to this setting, as no single ground-truth label exists for development types. By applying Principal Component Analysis and clustering to a curated set of World Development Indicators, this study aims to extract latent structure and derive data-driven country typologies without imposing predefined classifications such as income groups.

1.3 Methodological Overview

The project follows a rigorous unsupervised learning pipeline. After data cleaning (missingness analysis, hierarchical imputation) and preprocessing (log-transformation, standardization), Principal Component Analysis (PCA) was used to reduce dimensionality. Subsequently, Hierarchical Clustering (Ward.D2) was applied in the reduced orthogonal space. The final typology was validated using stability metrics (Bootstrap Jaccard Resampling - 200 iterations.) and internal indices (Silhouette, Calinski-Harabasz).

1.4 Research Questions

This analysis addresses two primary questions:

  1. What are the main latent dimensions that explain variation across a broad set of development indicators?

  2. Can countries be reliably grouped into stable, interpretable clusters based on their positions in the reduced PCA space, and what development profiles do these clusters represent?

1.5 The Research Environment

The project was performed as a part of Unsupervised Learning Course of Data Science and Business Analytics Master’s Curriculum at the University of Warsaw

1.5.1 The Workflow

Project Folder Structure is :

  
Project folder structure:

USL_Global_Development/
├── USL_Global_Development.Rproj
├── data/
│   ├── raw/
│   └── processed/
├── output/
├── report/
│   └── USL_Global_Development.Rmd
└── scripts/
    ├── 01_import_clean.R
    ├── 02_dimension_reduction.R
    └── 03_clustering.R

The analysis follows a three-script pipeline for this combined PCA and Clustering Project

1.5.2 Prerequisites

# Global Options
options(stringsAsFactors = FALSE)

# 1. Load Core Libraries

library(tidyverse)      # Data manipulation & plotting
library(janitor)        # Data cleaning
library(WDI)            # Data fetch
library(naniar)         # Missingness visualization
library(kableExtra)     # Table styling


# 2. Load Analysis Libraries

library(psych)          # Skewness & description
library(FactoMineR)     # PCA
library(factoextra)     # Visualization (PCA/Cluster)
library(cluster)        # Clustering algorithms
library(fpc)            # Clustering validation
library(hopkins)        # Clusterability
library(corrplot)       # Correlation matrix


# 3. Load Mapping Libraries

library(sf)
library(rnaturalearth)
library(rnaturalearthdata)
library(ggsci)          # color palettes
library(pheatmap)       # Heatmaps


# Create Directory Structure
dir.create("output", recursive = TRUE, showWarnings = FALSE)
dir.create("data/raw", recursive = TRUE, showWarnings = FALSE)
dir.create("data/processed", recursive = TRUE, showWarnings = FALSE)

2 Dataset & Preprocessing

This section explains the data-related decisions made for this project.

2.1 The Dataset : World Bank WDI

The dataset is constructed from the World Bank World Development Indicators (WDI) at the country level. To reduce volatility and obtain a stable cross-sectional representation, all indicators are aggregated to a three-year mean over 2017–2019. The unit of analysis is a country, described by a multidimensional set of economic, demographic, health, education, infrastructure, environmental, and governance indicators.

For reproducibility, intermediate datasets are stored as RDS objects. The descriptive imputed dataset is used for cluster profiling and interpretation, while the standardized dataset is used for PCA and clustering.

2.1.1 Indicator Selection

Indicators were selected and the dataset was aggregated to 2017–2019 means to obtain a stable cross-sectional matrix for PCA and clustering.

# Define Indicator Pools Main WDI indicators WDI api
indicators_wdi <- c(
  # Economy / structure
  gdp_per_cap_ppp     = "NY.GDP.PCAP.PP.KD",
  gdp_growth          = "NY.GDP.MKTP.KD.ZG",
  capital_formation   = "NE.GDI.FTOT.ZS",
  agri_value_added    = "NV.AGR.TOTL.ZS",
  service_value_added = "NV.SRV.TOTL.ZS",
  
  # Openness / capital
  exports_gdp         = "NE.EXP.GNFS.ZS",
  fdi_inflows         = "BX.KLT.DINV.WD.GD.ZS",
  
  # Health
  life_expectancy     = "SP.DYN.LE00.IN",
  sanitation_basic    = "SH.STA.BASS.ZS",
  health_exp_gdp      = "SH.XPD.CHEX.GD.ZS",
  measles_immun       = "SH.IMM.MEAS",
  
  # Education
  school_sec_gross    = "SE.SEC.ENRR",
  school_ter_gross    = "SE.TER.ENRR",
  gov_exp_edu         = "SE.XPD.TOTL.GD.ZS",
  
  # Demography
  pop_growth          = "SP.POP.GROW",
  dependency_ratio    = "SP.POP.DPND",
  urban_pop_rate      = "SP.URB.TOTL.IN.ZS",
  
  # Labor
  unemployment        = "SL.UEM.TOTL.ZS",
  labor_part_tot      = "SL.TLF.CACT.ZS",
  labor_part_fem      = "SL.TLF.CACT.FE.ZS",
  
  # Tech / infra
  access_elec         = "EG.ELC.ACCS.ZS",
  elec_consump        = "EG.USE.ELEC.KH.PC",
  internet_usage      = "IT.NET.USER.ZS",
  mobile_subs         = "IT.CEL.SETS.P2",
  
  # Environment
  gge_per_cap         = "EN.GHG.ALL.PC.CE.AR5",
  renew_energy        = "EG.FEC.RNEW.ZS",
  
  #Governance (WGI)
  corruption_ctrl = "CC.EST",
  gov_effect      = "GE.EST"
)

years <- c(2017:2019)

Indicators were selected based on three complementary criteria:

  1. Relevance in development economics: Variables capture widely accepted dimensions of development, including economic performance (GDP per capita, growth), human capital (education, health), institutional quality (governance effectiveness, corruption control), infrastructure access, and demographic structure.

  2. Coverage and data quality: Indicators with broad country coverage and consistent availability over 2017–2019 were prioritized to minimize missingness and unstable imputation.

  3. Latent structure relevance: The selected indicators are known to be correlated yet conceptually distinct, making them well suited for latent dimension extraction via PCA.

2.1.2 Data Import

Data were retrieved via the WDI package API and cached locally for reproducibility & speed.

# Get data 
raw_path <- "data/raw/wdi_raw_2017_2019.rds"
if (!file.exists(raw_path)) {
  message("Raw WDI data not found. Downloading from API...")
  options(timeout = 360)    
  raw_wdi <- WDI(
    country   = "all",
    indicator = indicators_wdi,
    start     = min(years),
    end       = max(years),
    extra     = TRUE,
    cache     = NULL
  )
  
  dir.create("data/raw", recursive = TRUE, showWarnings = FALSE)
  saveRDS(raw_wdi, raw_path)
  
} else {
  message("Raw WDI data found. Loading from disk...")
  raw_wdi <- readRDS(raw_path)
}
dir.create("data/processed", recursive = TRUE, showWarnings = FALSE)

To ensure reproducibility and reduce API dependence, raw downloads and intermediate datasets are cached as RDS files. If cached files exist, the pipeline loads from disk; otherwise, it downloads and saves them.

2.2 Data Cleaning

The section displays the process of transforming the raw data so it is suitable for further analysis

2.2.1 Data Preprocesing

The Important Indicators were taken and Dataset was averaged for 3 years

# Filtering
wdi_clean <- raw_wdi %>%
  clean_names() %>%
  filter(region != "Aggregates") %>%      # drop World/regions aggregates
  filter(!is.na(iso3c)) %>%               # keep proper countries
  filter(nchar(iso3c) == 3) %>% 
  select(country, iso3c, region, income, year, all_of(names(indicators_wdi)))
# Aggregate to 3-year Average 
# cross-sectional matrix for clustering/PCA
wdi_avg <- wdi_clean %>%
  group_by(country, iso3c, region, income) %>%
  summarise(across(all_of(names(indicators_wdi)), ~ mean(.x, na.rm = TRUE)),
            .groups = "drop")
wdi_avg <- wdi_avg %>%
  mutate(across(all_of(names(indicators_wdi)), ~ ifelse(is.nan(.x), NA, .x)))

Note: If a country has fewer than three observed years for an indicator, the 2017–2019 mean is computed using available years (via na.rm = TRUE), avoiding unnecessary country loss.

cat(
  "Final dataset dimensions:\n",
  "Countries:", nrow(wdi_avg), "\n",
  "Indicators:", length(indicators_wdi), "\n"
)
## Final dataset dimensions:
##  Countries: 215 
##  Indicators: 28

2.2.2 Missing data Analysis

We apply a strict 30/40 rule to ensure data quality:

The 30% variable threshold reduces unstable imputations, while the 40% country threshold avoids constructing country profiles dominated by imputed values.

1.Indicators: Variables with >30% missingness are dropped.

2.Countries: Observations with >40% missingness are dropped.

2.2.2.1 Variable-Level Analysis

For each WDI indicator missing share were assessed and threshold for dropping in was set to 0.3

# Missing Data Analysis
# Variable 
miss_var <- wdi_avg %>%
  summarise(across(all_of(names(indicators_wdi)), ~ mean(is.na(.x)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "missing_share") %>%
  arrange(desc(missing_share))

# visual
p1 <- ggplot(miss_var, aes(x = reorder(variable, missing_share), y = missing_share)) +
  geom_col() +
  coord_flip() +
  labs(title = "Initial Missing Share by Variable", x = NULL, y = "Missing share")

print(p1)

Indicators above 30% missingness were excluded to avoid unstable imputations and preserve interpretability.

cutoff <- 0.30

vars_drop <- miss_var %>%
  filter(missing_share > cutoff) %>%
  pull(variable)

vars_keep <- miss_var %>%
  filter(missing_share <= cutoff) %>%
  pull(variable)

2.2.2.2 Row-Wise Analysis

Next, the countries with high missing data share were assessed . To not include bias and represent different countries country_cutoff <- 0.40 was set

# Row-wise (Country) Missingness Check

country_miss <- wdi_avg %>%
  select(country, iso3c, region, income, all_of(vars_keep)) %>%
  mutate(miss_prop = rowMeans(is.na(select(., all_of(vars_keep))))) %>%
  arrange(desc(miss_prop))

# 40% cutoff: 
country_cutoff <- 0.40
countries_drop <- country_miss %>% filter(miss_prop > country_cutoff) %>% pull(iso3c) 

cat("Dropping", length(countries_drop), "countries due to excessive missingness.\n")


wdi_reduced <- wdi_avg %>%
  filter(!iso3c %in% countries_drop) %>%
  select(country, iso3c, region, income, all_of(vars_keep))
## Dropping 14 countries due to excessive missingness.

Near-Constant variables were assessed and dropped to ensure the smooth execution of algorithms down the pipeline

# Drop Near-Constant Variables 
# We also check variables that became all-NA after

stats_check <- wdi_reduced %>%
  summarise(across(all_of(vars_keep), list(
    na_count = ~ sum(!is.na(.)),
    sd       = ~ sd(., na.rm = TRUE)
  ))) %>%
  pivot_longer(
    cols = everything(),
    names_to = c("variable", "stat"),
    names_pattern = "^(.*)_(na_count|sd)$"
  ) %>%
  pivot_wider(names_from = stat, values_from = value)

vars_drop_stat <- stats_check %>%
  filter(na_count < 10 | sd < 1e-10) %>%
  pull(variable)
  
vars_model <- setdiff(vars_keep, vars_drop_stat)
wdi_reduced <- wdi_reduced %>% select(country, iso3c, region, income, all_of(vars_model))
cat("Variables dropped (low variance/obs):", length(vars_drop_stat), "\n")
cat("Final variable count:", length(vars_model), "\n")
## Variables dropped (low variance/obs): 0 
## Final variable count: 26

Tables below summarizes coverage after preprocessing.

Dataset snapshot after preprocessing (before imputation).
countries indicators regions income_groups
201 26 7 5
summary_tbl <- data.frame(
  Step = c("Initial countries (after aggregates removed)",
           "Countries dropped (missingness > 40%)",
           "Final countries",
           "Initial indicators",
           "Indicators dropped (missingness > 30%)",
           "Indicators dropped (near-constant / too few obs)",
           "Final indicators"),
  Value = c(nrow(wdi_avg),
            length(countries_drop),
            nrow(wdi_reduced),
            length(indicators_wdi),
            length(vars_drop),
            length(vars_drop_stat),
            length(vars_model))
)
knitr::kable(summary_tbl, caption = "Preprocessing decisions and resulting dataset size.")
Preprocessing decisions and resulting dataset size.
Step Value
Initial countries (after aggregates removed) 215
Countries dropped (missingness > 40%) 14
Final countries 201
Initial indicators 28
Indicators dropped (missingness > 30%) 2
Indicators dropped (near-constant / too few obs) 0
Final indicators 26

2.2.3 Data Imputation

We employ a hierarchical median imputation (Region Median \(\rightarrow\) Global Median) to preserve regional structural differences.

# Hierarchical Imputation
#Region then Global
wdi_imp <- wdi_reduced %>%
  group_by(region) %>%
  mutate(across(all_of(vars_model), ~ ifelse(is.na(.x), median(.x, na.rm = TRUE), .x))) %>%
  ungroup() %>%
  mutate(across(all_of(vars_model), ~ ifelse(is.na(.x), median(.x, na.rm = TRUE), .x))) # Global fallback
# Fail if any NAs remain later.
na_count_post <- sum(is.na(wdi_imp[vars_model])) 
stopifnot(all(vars_model %in% names(wdi_imp)))
stopifnot(all(sapply(wdi_imp[vars_model], is.numeric)))
if (na_count_post > 0) {
  stop("CRITICAL ERROR: Imputation failed. NAs still exist in the dataset.")
} else {
  message("Imputation successful: 0 missing values remaining.")
}
stopifnot(!anyNA(wdi_imp[vars_model]))

Note: We did not include the Income based median imputation in our imputation rule set. Though at first glance it made sense , the income labels will force data leakage and disrupt natural cluster discovery.

Imputation audit. The table reports the share of values imputed (missing before imputation) for the most affected indicators.

Indicators with the largest share of imputed values (top 10).
variable imputed_share
school_sec_gross 0.234
capital_formation 0.169
gov_exp_edu 0.129
exports_gdp 0.114
unemployment 0.090
labor_part_tot 0.090
labor_part_fem 0.090
health_exp_gdp 0.055
measles_immun 0.055
gdp_per_cap_ppp 0.040

2.2.4 Skewness & Log Transformation

Variables with high skewness (>2) are log-transformed (\(log_{10}(x+1)\)) to improve normality, excluding percentage-based metrics.

# Skewness & Log Transformation
# Calculate skewness table 
skew_tbl <- describe(wdi_imp[vars_model]) %>%
  as.data.frame() %>%
  select(skew) %>%
  rownames_to_column("variable") %>%   
  arrange(desc(abs(skew)))
#Identify highly skewed variables
high_skew_vars <- skew_tbl %>%
  filter(skew > 2) %>%
  pull(variable)
#Identify non-negative variables 
safe_to_log <- vars_model[sapply(wdi_imp[vars_model], function(x) min(x, na.rm = TRUE) >= 0)]
#skewed AND safe
vars_to_log <- intersect(high_skew_vars, safe_to_log)
#bounded percent/rate indicators
percent_like <- c(
  "capital_formation","agri_value_added","service_value_added",
  "exports_gdp","fdi_inflows","sanitation_basic","health_exp_gdp",
  "gov_exp_edu","urban_pop_rate","unemployment","labor_part_tot",
  "labor_part_fem","access_elec","internet_usage","renew_energy"
)
vars_to_log <- setdiff(vars_to_log, percent_like)
cat("Log-transforming", length(vars_to_log), "variables:\n")
print(vars_to_log)
# Apply log10(x+1)
wdi_transformed <- wdi_imp %>%
  mutate(across(all_of(vars_to_log), ~ log10(.x + 1)))
#no NaN/Inf introduced by transform
stopifnot(all(is.finite(as.matrix(wdi_transformed[vars_model]))))
## Log-transforming 1 variables:
## [1] "gge_per_cap"
cat("Checks after transformations:\n")
cat("Any NA in wdi_transformed vars_model? ", anyNA(wdi_transformed[vars_model]), "\n")
cat("Any infinite values? ", any(!is.finite(as.matrix(wdi_transformed[vars_model]))), "\n")
## Checks after transformations:
## Any NA in wdi_transformed vars_model?  FALSE 
## Any infinite values?  FALSE

2.2.5 Standardization

The standardization is needed to make variables comparable in analysis

# Standardization 
wdi_scaled <- wdi_transformed %>%
  mutate(across(all_of(vars_model), ~ as.numeric(scale(.x))))
stopifnot(all(sapply(wdi_scaled[vars_model], is.numeric)))
stopifnot(!anyNA(wdi_scaled[vars_model]))
# Check Boxplot of scaled variables (sample 12 vars) 
set.seed(47)
vars_sample <- sample(vars_model, min(12, length(vars_model)))
wdi_scaled %>%
  select(all_of(vars_sample)) %>%
  pivot_longer(everything()) %>%
  ggplot(aes(x = name, y = value)) +
  geom_boxplot() +
  coord_flip() +
  labs(title = "Scaled Variable Distributions (sample)", x = NULL, y = "Z-score")

The post-scaling distributions confirm comparable scale across indicators, which is required for PCA and Euclidean-distance clustering.

2.2.6 Saving the Data

# Saving Outputs
saveRDS(wdi_imp,    "data/processed/wdi_imputed_descriptive.rds") # for Cluster Naming
saveRDS(wdi_scaled, "data/processed/wdi_scaled_pca_input.rds")    # for PCA/Clustering
cat("Success! Processed data saved. Ready for PCA.\n")
## Success! Processed data saved. Ready for PCA.

3 Dimension Reduction : PCA

Many World Development Indicators are strongly correlated, reflecting overlapping aspects of economic structure, human development, infrastructure, and governance. Applying clustering directly to the raw indicator space can violate distance assumptions due to multicollinearity and unequal variance. Principal Component Analysis (PCA) is therefore used to extract orthogonal latent dimensions that summarize the main sources of variation in the data. PCA reduces noise, improves distance validity, and facilitates interpretation of country positions in a lower-dimensional space.

# Load Processed Data
wdi_scaled <- readRDS("data/processed/wdi_scaled_pca_input.rds")
# Metadata columns not enter PCA
meta_vars <- c("iso3c", "region", "income", "year")
# PCA matrix
wdi_matrix <- wdi_scaled %>%
  column_to_rownames("country") %>%
  select(-any_of(meta_vars))
# checks 
stopifnot(all(map_lgl(wdi_matrix, is.numeric)))
stopifnot(!anyNA(wdi_matrix))
cat("PCA input:", nrow(wdi_matrix), "countries x", ncol(wdi_matrix), "indicators\n")
## PCA input: 201 countries x 26 indicators

This section analyzes the latent structure of the remaining indicators after preprocessing to reduce noise and multicollinearity.

##PCA Execution

# wdi_matrix is already z-scaled 
pca_res <- PCA(wdi_matrix, scale.unit = FALSE,ncp=10, graph = FALSE)

##Component Selection

We use the Scree Plot and Kaiser Criterion also, checking the total variance explained to determine the number of dimensions.

p_scree <- fviz_eig(pca_res, addlabels = TRUE) +
  labs(title = "Scree Plot: Explained Variance by PCA Dimensions")

print(p_scree)

eig <- get_eigenvalue(pca_res) %>% as.data.frame()
eig_tbl <- eig %>%
  mutate(Component = paste0("PC", row_number())) %>%
  select(Component, eigenvalue, variance.percent, cumulative.variance.percent)

knitr::kable(
  head(eig_tbl, 10),
  digits = 3,
  caption = "Eigenvalues and explained variance (first 10 components)"
)
Eigenvalues and explained variance (first 10 components)
Component eigenvalue variance.percent cumulative.variance.percent
Dim.1 PC1 10.693 41.333 41.333
Dim.2 PC2 2.705 10.454 51.787
Dim.3 PC3 1.798 6.950 58.737
Dim.4 PC4 1.539 5.949 64.686
Dim.5 PC5 1.321 5.106 69.792
Dim.6 PC6 0.959 3.706 73.498
Dim.7 PC7 0.831 3.213 76.712
Dim.8 PC8 0.771 2.981 79.693
Dim.9 PC9 0.655 2.533 82.226
Dim.10 PC10 0.624 2.414 84.640
kaiser_k <- sum(eig$eigenvalue > 1)
var70_k  <- which(eig$cumulative.variance.percent >= 70)[1]
k_pca <- min(kaiser_k, var70_k)

cat("Kaiser (eig > 1):", kaiser_k, "\n")
cat("Components for >=70% variance:", var70_k, "\n")
cat("Selected number of components (k_pca):", k_pca, "\n")
## Kaiser (eig > 1): 5 
## Components for >=70% variance: 6 
## Selected number of components (k_pca): 5

Component retention. We retain \(k\) components using a rule: minimum of (i) Kaiser criterion (eigenvalue > 1) and (ii) components needed for ≥70% cumulative explained variance.

PCA component retention rule and selected dimensionality.
Rule Value
Kaiser (eig > 1) 5
≥ 70% cumulative variance 6
Selected k 5

Retaining five components balances information preservation with interpretability, while keeping the clustering distance metric valid by operating in an orthogonal space

3.1 Interpretation of Dimensions

contrib_tbl <- as.data.frame(pca_res$var$contrib[, 1:k_pca, drop = FALSE]) %>%
  rownames_to_column("variable") %>%
  pivot_longer(-variable, names_to = "component", values_to = "contrib") %>%
  group_by(component) %>%
  slice_max(contrib, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(component, desc(contrib))

knitr::kable(
  contrib_tbl,
  digits = 3,
  caption = "Top contributing variables (per retained component)"
)
Top contributing variables (per retained component)
variable component contrib
internet_usage Dim.1 8.336
life_expectancy Dim.1 7.534
sanitation_basic Dim.1 7.463
access_elec Dim.1 6.721
school_sec_gross Dim.1 6.627
agri_value_added Dim.1 6.617
dependency_ratio Dim.1 6.588
gov_effect Dim.1 6.555
gdp_per_cap_ppp Dim.1 5.495
corruption_ctrl Dim.1 5.074
labor_part_fem Dim.2 26.391
labor_part_tot Dim.2 26.168
unemployment Dim.2 14.990
gdp_per_cap_ppp Dim.2 4.191
gov_effect Dim.2 3.961
corruption_ctrl Dim.2 3.790
renew_energy Dim.2 3.737
gdp_growth Dim.2 2.876
exports_gdp Dim.2 2.670
capital_formation Dim.2 2.263
health_exp_gdp Dim.3 29.332
gov_exp_edu Dim.3 20.396
mobile_subs Dim.3 7.730
corruption_ctrl Dim.3 5.338
gdp_growth Dim.3 4.548
service_value_added Dim.3 4.536
capital_formation Dim.3 4.460
labor_part_fem Dim.3 3.843
pop_growth Dim.3 3.757
exports_gdp Dim.3 3.580
capital_formation Dim.4 26.413
gdp_growth Dim.4 22.439
measles_immun Dim.4 9.786
gov_exp_edu Dim.4 8.409
urban_pop_rate Dim.4 7.051
exports_gdp Dim.4 4.648
gdp_per_cap_ppp Dim.4 4.625
access_elec Dim.4 2.668
unemployment Dim.4 2.526
gge_per_cap Dim.4 2.257
fdi_inflows Dim.5 33.128
gdp_growth Dim.5 11.901
exports_gdp Dim.5 11.267
labor_part_tot Dim.5 7.848
pop_growth Dim.5 7.374
service_value_added Dim.5 6.522
gge_per_cap Dim.5 3.349
unemployment Dim.5 2.972
corruption_ctrl Dim.5 2.776
labor_part_fem Dim.5 2.379

3.1.1 Interpretation of Retained Components

Directional interpretation. The sign of loadings indicates which indicators increase in the positive direction of each component.

PC1: strongest positive loadings (top 5).
loading variable
internet_usage 0.944 internet_usage
life_expectancy 0.898 life_expectancy
sanitation_basic 0.893 sanitation_basic
access_elec 0.848 access_elec
school_sec_gross 0.842 school_sec_gross
PC1: strongest negative loadings (top 5).
loading variable
agri_value_added -0.841 agri_value_added
dependency_ratio -0.839 dependency_ratio
renew_energy -0.721 renew_energy
pop_growth -0.521 pop_growth
gdp_growth -0.194 gdp_growth

PCA loading signs are arbitrary up to multiplication by -1; interpretations focus on relative contrasts within each component.

3.1.2 Strongest loadings table (top 8 by absolute loading)

loadings_tbl <- as.data.frame(pca_res$var$coord[, 1:k_pca, drop = FALSE]) %>%
  rownames_to_column("variable") %>%
  pivot_longer(-variable, names_to = "component", values_to = "loading") %>%
  group_by(component) %>%
  slice_max(abs(loading), n = 8, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(component, desc(abs(loading)))

knitr::kable(
  loadings_tbl,
  digits = 3,
  caption = "Largest absolute loadings (per retained component)"
)
Largest absolute loadings (per retained component)
variable component loading
internet_usage Dim.1 0.944
life_expectancy Dim.1 0.898
sanitation_basic Dim.1 0.893
access_elec Dim.1 0.848
school_sec_gross Dim.1 0.842
agri_value_added Dim.1 -0.841
dependency_ratio Dim.1 -0.839
gov_effect Dim.1 0.837
labor_part_fem Dim.2 0.845
labor_part_tot Dim.2 0.841
unemployment Dim.2 -0.637
gdp_per_cap_ppp Dim.2 0.337
gov_effect Dim.2 0.327
corruption_ctrl Dim.2 0.320
renew_energy Dim.2 0.318
gdp_growth Dim.2 0.279
health_exp_gdp Dim.3 0.726
gov_exp_edu Dim.3 0.606
mobile_subs Dim.3 -0.373
corruption_ctrl Dim.3 0.310
gdp_growth Dim.3 -0.286
service_value_added Dim.3 0.286
capital_formation Dim.3 -0.283
labor_part_fem Dim.3 0.263
capital_formation Dim.4 0.638
gdp_growth Dim.4 0.588
measles_immun Dim.4 0.388
gov_exp_edu Dim.4 0.360
urban_pop_rate Dim.4 -0.329
exports_gdp Dim.4 -0.267
gdp_per_cap_ppp Dim.4 -0.267
access_elec Dim.4 0.203
fdi_inflows Dim.5 0.662
gdp_growth Dim.5 0.396
exports_gdp Dim.5 0.386
labor_part_tot Dim.5 -0.322
pop_growth Dim.5 0.312
service_value_added Dim.5 0.294
gge_per_cap Dim.5 -0.210
unemployment Dim.5 0.198

Based on the highest contributions and loadings, the five retained components can be interpreted as follows:

  • PC1 (“Socio-Economic Development & Institutional Quality”): Captures the primary development continuum. Positive direction: high GDP per capita, strong governance, life expectancy, infrastructure access; negative direction: high agricultural share, demographic pressure.

  • PC2 (“Labor Market Participation”): Reflects overall and female labor force participation rates, with lower unemployment.

  • PC3 (“Public Human Capital Investment”): Emphasizes public spending on education and health relative to private capital formation.

  • PC4 (“Investment & Growth Dynamics”): Distinguishes investment-intensive, faster-growing economies from more mature ones.

  • PC5 (“External Economic Integration”): Separates trade/FDI-open economies from more domestic-oriented ones.

These dimensions are orthogonal by construction and collectively explain approximately 70% of total variance in the original indicators.

Visual map

fviz_pca_var(
  pca_res,
  col.var = "contrib",
  repel = TRUE,
  select.var = list(contrib = 20)
) + labs(title = "PCA Variable Map (Top 20 contributions, Dim 1 & 2)")

fviz_pca_ind(
  pca_res,
  axes = c(1,2),
  geom.ind = "point",
  col.ind = wdi_scaled$income,
  addEllipses = TRUE,
  legend.title = "Income"
) + labs(title = "Countries in PCA space (Dim 1 & 2), colored by income group")

fviz_pca_ind(
  pca_res,
  axes = c(1,2),
  geom.ind = "point",
  col.ind = wdi_scaled$region,
  addEllipses = TRUE,
  legend.title = "Region"
) + labs(title = "Countries in PCA space (Dim 1 & 2), colored by region")

Because PCA components are orthogonal (uncorrelated), the Euclidean distance used in the subsequent clustering step is valid and not biased by the redundant information (multicollinearity) present in the raw WDI indicators.

Output for clustering. Metadata (region, income) is kept only for interpretation and visualization.

4 Clustering in PCA Space

Clustering is performed on the retained PCA scores (PC1–PCk) to avoid multicollinearity and ensure Euclidean distance validity. Ward’s hierarchical clustering (Ward.D2) is applied, and the number of clusters is selected using a stability-first strategy combining internal indices and bootstrap Jaccard stability.

#Load data
pca_scores <- readRDS("data/processed/pca_scores_for_clustering.rds")
wdi_imp    <- readRDS("data/processed/wdi_imputed_descriptive.rds")
pc_labels <- readRDS("data/processed/pc_labels.rds") 
wdi_scaled <- readRDS("data/processed/wdi_scaled_pca_input.rds")
# clustering space
pc_cols_all <- names(pca_scores)[grepl("^PC\\d+$", names(pca_scores))]
n_pcs_available   <- length(pc_cols_all)
pcs_all <- pca_scores %>% 
  column_to_rownames("country") %>% 
  select(all_of(pc_cols_all))
stopifnot(all(map_lgl(pcs_all, is.numeric)))
cat("Clustering input:", nrow(pcs_all), "countries x", ncol(pcs_all), "PCs (Available)\n")
## Clustering input: 201 countries x 5 PCs (Available)
n_pcs <- min(5, ncol(pcs_all))
pc_cols <- paste0("PC", 1:n_pcs)
X_pca <- pcs_all %>% select(all_of(pc_cols))
stopifnot(!anyNA(X_pca))
cat("Clustering on:", paste(pc_cols, collapse = ", "), "\n")
## Clustering on: PC1, PC2, PC3, PC4, PC5

4.1 Clusterability Assessment

Before applying a partitioning algorithm, we must ensure the data contains meaningful structure rather than a random distribution.

set.seed(47)
H <- hopkins::hopkins(as.matrix(X_pca), m = floor(0.1 * nrow(X_pca)))
cat("Hopkins statistic H =", round(H, 4), "\n")
## Hopkins statistic H = 0.9988
d  <- dist(X_pca, method = "euclidean")
hc <- hclust(d, method = "ward.D2")

H scores mean usually: scores near 0.5 random structure, closer to 1 clusterable structure ; closer to 0 uniform structure

The Hopkins statistic is far above 0.5, indicating strong departure from spatial randomness and supporting the presence of cluster structure in PCA space.

4.2 Hierarchical Strategy & Optimal k

4.2.1 k selection table

We iterate \(k\) from 2 to 6, selecting the solution based on Bootstrap Stability (Jaccard Index).

k_candidates <- 2:6
min_size_rule <- 7
B_boot <- 200

set.seed(47)
validation_df <- purrr::map_dfr(k_candidates, function(k) {
  grp <- cutree(hc, k = k)

  sil <- silhouette(grp, d)
  sil_avg <- mean(sil[, 3])

  ch <- fpc::calinhara(as.matrix(X_pca), grp)

  sizes <- table(grp)
  min_size <- min(sizes)

  set.seed(47 + k)
  boot <- fpc::clusterboot(
    as.matrix(X_pca),
    B = B_boot,
    bootmethod = "boot",
    clustermethod = fpc::hclustCBI,
    method = "ward.D2",
    k = k,
    count = FALSE,
    showplots = FALSE
  )

  pam_fit <- cluster::pam(as.matrix(X_pca), k = k)
  crand <- fpc::cluster.stats(d, grp, pam_fit$clustering)$corrected.rand

  tibble(
    k = k,
    Silhouette = sil_avg,
    Calinski_Harabasz = ch,
    Min_Cluster_Size = as.integer(min_size),
    Min_Jaccard = min(boot$bootmean),
    Mean_Jaccard = mean(boot$bootmean),
    PAM_Agreement_CRand = crand
  )
})

knitr::kable(
  validation_df,
  digits = 3,
  caption = "Model selection diagnostics for candidate numbers of clusters (Ward.D2 in PCA space)."
)
Model selection diagnostics for candidate numbers of clusters (Ward.D2 in PCA space).
k Silhouette Calinski_Harabasz Min_Cluster_Size Min_Jaccard Mean_Jaccard PAM_Agreement_CRand
2 0.411 104.211 30 0.679 0.788 0.514
3 0.270 115.331 30 0.440 0.667 0.623
4 0.234 97.253 30 0.501 0.635 0.545
5 0.252 88.759 13 0.458 0.635 0.396
6 0.242 82.717 13 0.306 0.571 0.693

The k selection logic

candidates_ok <- validation_df %>% filter(Min_Cluster_Size >= min_size_rule)
if (nrow(candidates_ok) == 0) stop("No k passed Min_Cluster_Size rule.")

stable_ok <- candidates_ok %>% filter(Min_Jaccard >= 0.60)

if (nrow(stable_ok) > 0) {
  k_clusters <- stable_ok %>%
    arrange(desc(Mean_Jaccard), desc(Silhouette)) %>%
    slice(1) %>% pull(k)
  reason <- "Highest Mean Jaccard among solutions with Min Jaccard >= 0.60 and adequate cluster sizes."
} else {
  k_clusters <- candidates_ok %>%
    arrange(desc(Min_Jaccard), desc(Mean_Jaccard), desc(Silhouette)) %>%
    slice(1) %>% pull(k)
  reason <- "Fallback: no solution reached Min Jaccard >= 0.60; selected best stability compromise."
}

knitr::kable(
  data.frame(Selected_k = k_clusters, Reason = reason),
  caption = "Selected number of clusters (stability-first decision)."
)
Selected number of clusters (stability-first decision).
Selected_k Reason
2 Highest Mean Jaccard among solutions with Min Jaccard >= 0.60 and adequate cluster sizes.

4.2.2 Final clustering and key plots

groups <- cutree(hc, k = k_clusters)
stopifnot(identical(names(groups), rownames(X_pca)))

4.2.3 Final validation

Cluster sizes (number of countries).
cluster n_countries
1 171
2 30
Validation summary for the final clustering solution.
Metric Value
Avg silhouette 0.411
Calinski-Harabasz 104.211
PAM agreement (Corrected Rand) 0.514
Min Jaccard 0.666
Mean Jaccard 0.777
Selected solution. Based on the stability-first rule (minimum cluster size ≥ 7, and preference for Min Jaccard ≥ 0.60), the final choice is k = 2. The solution achieves an average silhouette of 0.411, Calinski–Harabasz of 104.2, and bootstrap stability (Jaccard) with min = 0.666 and mean = 0.777.

A Jaccard similarity above 0.60 suggests cluster stability, meaning that the cluster’s membership shows substantial overlap across bootstrap resamples. This indicates that the grouping is reasonably robust to sampling variation, although it should not be interpreted as definitive proof of an underlying ‘true’ structure.

4.2.4 Interpretation: names and profiles

## Note: these requested profile variables were not available and were skipped:  school_ter_gross
Cluster profiles (median values on original scale, selected indicators).
cluster_short gdp_per_cap_ppp life_expectancy internet_usage access_elec sanitation_basic school_sec_gross gov_effect corruption_ctrl agri_value_added dependency_ratio pop_growth unemployment labor_part_fem
Advanced / Institutional 19847.45 75.00 66.59 99.93 93.67 96.39 0.08 -0.05 5.01 52.82 0.94 5.68 52.58
Labor-Intensive / Agrarian 2737.48 61.12 15.71 36.37 25.13 45.46 -1.05 -1.03 23.26 86.02 2.64 3.50 64.85

4.2.5 Heatmap

The heatmap below confirms the distinct separation of profiles:

Positive values indicate above-average (relative) indicator levels for that cluster; negative values indicate below-average levels.

4.2.6 World map

Although clustering was based on socio-economic indicators and not geographic information, the resulting groups display a clear spatial pattern. Cluster 2 is dominated by Sub-Saharan African countries, with some representation from South and Southeast Asia, while Cluster 1 covers most of the remaining regions.

This geographic concentration is an outcome of structural differences in institutions, human capital, infrastructure, and demographic pressure rather than a trivial regional split. Solutions with more clusters (e.g. four clusters) produced interpretable intermediate or fast-catch-up groups but showed lower stability under bootstrap validation. Therefore, the two-cluster solution was retained as the most robust and structurally meaningful representation of the data.

5 Discussion & Conclusion

This analysis successfully extracted latent development dimensions and identified two robust country typologies from 2017–2019 WDI data: - Advanced, institutionally strong economies: Characterized by high income, strong governance, advanced human capital, infrastructure, and digital access. - Labor-intensive, agrarian and demographically pressured economies: Higher agricultural dependence, demographic pressures, and necessity-driven labor participation.

Income groups align strongly with PC1 and cluster membership, but clustering is not identical to income classification because it also reflects governance, infrastructure, and demographic structure.

The PCA-clustering pipeline successfully extracts meaningful structure beyond traditional income classifications, with the two-cluster solution proving stable and interpretable.

The use of Hopkins statistic and Bootstrap Stability indicates that the clusters are not random artifacts.

Key insights: - Development is not unidimensional — institutional quality and human capital investment emerge as distinct from pure economic growth. - Geographic patterns emerge naturally (e.g., advanced cluster concentrated in Europe/North America/East Asia).

Method choice rationale -Ward.D2 is chosen for compact clusters in Euclidean space; clustering in PCA space reduces multicollinearity and makes distances more meaningful.

Limitations: - Cross-sectional (pre-COVID snapshot); temporal dynamics not captured. - Reliance on median imputation and theoretical indicator selection.

Future work: Incorporate time-series clustering or additional indicators (e.g., inequality, climate vulnerability).

The full reproducible pipeline demonstrates the power of unsupervised learning for exploratory development analysis.

6 Appendix

6.1 Sources

Data Source

World Bank. (2026). World Development Indicators (WDI). Retrieved January 2026 from https://databank.worldbank.org/source/world-development-indicators

Other Sources

Commission on the Measurement of Economic Performance and Social Progress. (2009). Report by the Commission on the Measurement of Economic Performance and Social Progress. Paris. Retrieved February 2026 from https://ec.europa.eu/eurostat/documents/8131721/8131772/Stiglitz-Sen-Fitoussi-Commission-report.pdf

Husson, F., Lê, S., & Pagès, J. (2017). Exploratory multivariate analysis by example using R (2nd ed.). Chapman and Hall/CRC. https://doi.org/10.1201/b21874

United Nations Development Programme. (2019). Human Development Report 2019: Beyond income, beyond averages, beyond today. Retrieved February 2026 from http://report2019.archive.s3-website-us-east-1.amazonaws.com/

6.2 Reproducibility Info

cat("Report rendered on:", format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z"), "\n\n")

# Core R info
print(R.version)

cat("\nPlatform:", R.version$platform, "\n")
cat("Running on:", R.version$os, "\n\n")

# Packages + versions actually loaded in this session
sessionInfo()
## Report rendered on: 2026-02-01 17:32:36 CET 
## 
##                _                                
## platform       x86_64-w64-mingw32               
## arch           x86_64                           
## os             mingw32                          
## crt            ucrt                             
## system         x86_64, mingw32                  
## status                                          
## major          4                                
## minor          5.1                              
## year           2025                             
## month          06                               
## day            13                               
## svn rev        88306                            
## language       R                                
## version.string R version 4.5.1 (2025-06-13 ucrt)
## nickname       Great Square Root                
## 
## Platform: x86_64-w64-mingw32 
## Running on: mingw32 
## 
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] C
## 
## time zone: Europe/Warsaw
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] pheatmap_1.0.13         ggsci_4.2.0             rnaturalearthdata_1.0.0
##  [4] rnaturalearth_1.2.0     sf_1.0-24               corrplot_0.95          
##  [7] hopkins_1.1             fpc_2.2-14              cluster_2.1.8.1        
## [10] factoextra_1.0.7        FactoMineR_2.13         psych_2.5.6            
## [13] kableExtra_1.4.0        naniar_1.1.0            WDI_2.7.9              
## [16] janitor_2.2.1           lubridate_1.9.4         forcats_1.0.1          
## [19] stringr_1.6.0           dplyr_1.1.4             purrr_1.2.0            
## [22] readr_2.1.5             tidyr_1.3.1             tibble_3.3.0           
## [25] ggplot2_4.0.0           tidyverse_2.0.0         DiagrammeR_1.0.11      
## 
## loaded via a namespace (and not attached):
##  [1] DBI_1.2.3            mnormt_2.1.1         gridExtra_2.3       
##  [4] rlang_1.1.6          magrittr_2.0.4       donut_1.0.3         
##  [7] snakecase_0.11.1     otel_0.2.0           e1071_1.7-17        
## [10] compiler_4.5.1       flexmix_2.3-20       systemfonts_1.3.1   
## [13] vctrs_0.6.5          pkgconfig_2.0.3      fastmap_1.2.0       
## [16] backports_1.5.0      labeling_0.4.3       rmarkdown_2.30      
## [19] tzdb_0.5.0           visdat_0.6.0         xfun_0.54           
## [22] modeltools_0.2-24    cachem_1.1.0         jsonlite_2.0.0      
## [25] flashClust_1.01-2    pdist_1.2.1          broom_1.0.10        
## [28] parallel_4.5.1       prabclus_2.3-4       R6_2.6.1            
## [31] bslib_0.9.0          stringi_1.8.7        RColorBrewer_1.1-3  
## [34] car_3.1-3            jquerylib_0.1.4      diptest_0.77-2      
## [37] estimability_1.5.1   Rcpp_1.1.0           knitr_1.51          
## [40] nnet_7.3-20          timechange_0.3.0     tidyselect_1.2.1    
## [43] viridis_0.6.5        abind_1.4-8          rstudioapi_0.17.1   
## [46] yaml_2.3.10          lattice_0.22-7       withr_3.0.2         
## [49] S7_0.2.0             evaluate_1.0.5       units_1.0-0         
## [52] proxy_0.4-29         xml2_1.4.1           mclust_6.1.2        
## [55] kernlab_0.9-33       ggpubr_0.6.2         pillar_1.11.1       
## [58] carData_3.0-5        KernSmooth_2.23-26   DT_0.34.0           
## [61] stats4_4.5.1         generics_0.1.4       hms_1.1.4           
## [64] scales_1.4.0         leaps_3.2            class_7.3-23        
## [67] glue_1.8.0           emmeans_2.0.0        scatterplot3d_0.3-44
## [70] tools_4.5.1          dendextend_1.19.1    robustbase_0.99-6   
## [73] ggsignif_0.6.4       RANN_2.6.2           visNetwork_2.1.4    
## [76] mvtnorm_1.3-3        grid_4.5.1           nlme_3.1-168        
## [79] Formula_1.2-5        cli_3.6.5            textshaping_1.0.4   
## [82] viridisLite_0.4.2    svglite_2.2.2        gtable_0.3.6        
## [85] DEoptimR_1.1-4       rstatix_0.7.3        sass_0.4.10         
## [88] digest_0.6.37        classInt_0.4-11      ggrepel_0.9.6       
## [91] htmlwidgets_1.6.4    farver_2.1.2         htmltools_0.5.9     
## [94] lifecycle_1.0.4      multcompView_0.1-10  MASS_7.3-65