knitr::opts_knit$set(root.dir = normalizePath(".."))
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE,
fig.align = "center",
fig.width = 9,
fig.height = 6,
results = "hold"
)
library(DiagrammeR)
This project explores global development patterns using country-level World Development Indicators (WDI) averaged over 2017–2019. To capture the multidimensional nature of development beyond income alone, a broad set of economic, demographic, institutional, and infrastructure indicators is analyzed using unsupervised learning techniques.
After data preprocessing and dimensionality reduction via Principal Component Analysis (PCA), countries are grouped using hierarchical clustering in the reduced space. The analysis identifies a small number of stable and interpretable country typologies, reflecting systematic differences in development structures related to institutional quality, human capital, labor market characteristics, and demographic pressure.
Economic development is inherently multidimensional. Traditional analyses often rely heavily on GDP per capita, yet this single metric fails to capture critical differences in institutional quality, human capital, labor market dynamics, infrastructure, and environmental sustainability. Many development indicators are highly correlated, suggesting underlying latent dimensions that unsupervised learning methods like PCA can uncover.
Unsupervised learning techniques are well suited to this setting, as no single ground-truth label exists for development types. By applying Principal Component Analysis and clustering to a curated set of World Development Indicators, this study aims to extract latent structure and derive data-driven country typologies without imposing predefined classifications such as income groups.
The project follows a rigorous unsupervised learning pipeline. After data cleaning (missingness analysis, hierarchical imputation) and preprocessing (log-transformation, standardization), Principal Component Analysis (PCA) was used to reduce dimensionality. Subsequently, Hierarchical Clustering (Ward.D2) was applied in the reduced orthogonal space. The final typology was validated using stability metrics (Bootstrap Jaccard Resampling - 200 iterations.) and internal indices (Silhouette, Calinski-Harabasz).
This analysis addresses two primary questions:
What are the main latent dimensions that explain variation across a broad set of development indicators?
Can countries be reliably grouped into stable, interpretable clusters based on their positions in the reduced PCA space, and what development profiles do these clusters represent?
The project was performed as a part of Unsupervised Learning Course of Data Science and Business Analytics Master’s Curriculum at the University of Warsaw
Project Folder Structure is :
Project folder structure:
USL_Global_Development/
├── USL_Global_Development.Rproj
├── data/
│ ├── raw/
│ └── processed/
├── output/
├── report/
│ └── USL_Global_Development.Rmd
└── scripts/
├── 01_import_clean.R
├── 02_dimension_reduction.R
└── 03_clustering.R
The analysis follows a three-script pipeline for this combined PCA and Clustering Project
# Global Options
options(stringsAsFactors = FALSE)
# 1. Load Core Libraries
library(tidyverse) # Data manipulation & plotting
library(janitor) # Data cleaning
library(WDI) # Data fetch
library(naniar) # Missingness visualization
library(kableExtra) # Table styling
# 2. Load Analysis Libraries
library(psych) # Skewness & description
library(FactoMineR) # PCA
library(factoextra) # Visualization (PCA/Cluster)
library(cluster) # Clustering algorithms
library(fpc) # Clustering validation
library(hopkins) # Clusterability
library(corrplot) # Correlation matrix
# 3. Load Mapping Libraries
library(sf)
library(rnaturalearth)
library(rnaturalearthdata)
library(ggsci) # color palettes
library(pheatmap) # Heatmaps
# Create Directory Structure
dir.create("output", recursive = TRUE, showWarnings = FALSE)
dir.create("data/raw", recursive = TRUE, showWarnings = FALSE)
dir.create("data/processed", recursive = TRUE, showWarnings = FALSE)
This section explains the data-related decisions made for this project.
The dataset is constructed from the World Bank World Development Indicators (WDI) at the country level. To reduce volatility and obtain a stable cross-sectional representation, all indicators are aggregated to a three-year mean over 2017–2019. The unit of analysis is a country, described by a multidimensional set of economic, demographic, health, education, infrastructure, environmental, and governance indicators.
For reproducibility, intermediate datasets are stored as RDS objects. The descriptive imputed dataset is used for cluster profiling and interpretation, while the standardized dataset is used for PCA and clustering.
Indicators were selected and the dataset was aggregated to 2017–2019 means to obtain a stable cross-sectional matrix for PCA and clustering.
# Define Indicator Pools Main WDI indicators WDI api
indicators_wdi <- c(
# Economy / structure
gdp_per_cap_ppp = "NY.GDP.PCAP.PP.KD",
gdp_growth = "NY.GDP.MKTP.KD.ZG",
capital_formation = "NE.GDI.FTOT.ZS",
agri_value_added = "NV.AGR.TOTL.ZS",
service_value_added = "NV.SRV.TOTL.ZS",
# Openness / capital
exports_gdp = "NE.EXP.GNFS.ZS",
fdi_inflows = "BX.KLT.DINV.WD.GD.ZS",
# Health
life_expectancy = "SP.DYN.LE00.IN",
sanitation_basic = "SH.STA.BASS.ZS",
health_exp_gdp = "SH.XPD.CHEX.GD.ZS",
measles_immun = "SH.IMM.MEAS",
# Education
school_sec_gross = "SE.SEC.ENRR",
school_ter_gross = "SE.TER.ENRR",
gov_exp_edu = "SE.XPD.TOTL.GD.ZS",
# Demography
pop_growth = "SP.POP.GROW",
dependency_ratio = "SP.POP.DPND",
urban_pop_rate = "SP.URB.TOTL.IN.ZS",
# Labor
unemployment = "SL.UEM.TOTL.ZS",
labor_part_tot = "SL.TLF.CACT.ZS",
labor_part_fem = "SL.TLF.CACT.FE.ZS",
# Tech / infra
access_elec = "EG.ELC.ACCS.ZS",
elec_consump = "EG.USE.ELEC.KH.PC",
internet_usage = "IT.NET.USER.ZS",
mobile_subs = "IT.CEL.SETS.P2",
# Environment
gge_per_cap = "EN.GHG.ALL.PC.CE.AR5",
renew_energy = "EG.FEC.RNEW.ZS",
#Governance (WGI)
corruption_ctrl = "CC.EST",
gov_effect = "GE.EST"
)
years <- c(2017:2019)
Indicators were selected based on three complementary criteria:
Relevance in development economics: Variables capture widely accepted dimensions of development, including economic performance (GDP per capita, growth), human capital (education, health), institutional quality (governance effectiveness, corruption control), infrastructure access, and demographic structure.
Coverage and data quality: Indicators with broad country coverage and consistent availability over 2017–2019 were prioritized to minimize missingness and unstable imputation.
Latent structure relevance: The selected indicators are known to be correlated yet conceptually distinct, making them well suited for latent dimension extraction via PCA.
Data were retrieved via the WDI package API and cached locally for reproducibility & speed.
# Get data
raw_path <- "data/raw/wdi_raw_2017_2019.rds"
if (!file.exists(raw_path)) {
message("Raw WDI data not found. Downloading from API...")
options(timeout = 360)
raw_wdi <- WDI(
country = "all",
indicator = indicators_wdi,
start = min(years),
end = max(years),
extra = TRUE,
cache = NULL
)
dir.create("data/raw", recursive = TRUE, showWarnings = FALSE)
saveRDS(raw_wdi, raw_path)
} else {
message("Raw WDI data found. Loading from disk...")
raw_wdi <- readRDS(raw_path)
}
dir.create("data/processed", recursive = TRUE, showWarnings = FALSE)
To ensure reproducibility and reduce API dependence, raw downloads and intermediate datasets are cached as RDS files. If cached files exist, the pipeline loads from disk; otherwise, it downloads and saves them.
The section displays the process of transforming the raw data so it is suitable for further analysis
The Important Indicators were taken and Dataset was averaged for 3 years
# Filtering
wdi_clean <- raw_wdi %>%
clean_names() %>%
filter(region != "Aggregates") %>% # drop World/regions aggregates
filter(!is.na(iso3c)) %>% # keep proper countries
filter(nchar(iso3c) == 3) %>%
select(country, iso3c, region, income, year, all_of(names(indicators_wdi)))
# Aggregate to 3-year Average
# cross-sectional matrix for clustering/PCA
wdi_avg <- wdi_clean %>%
group_by(country, iso3c, region, income) %>%
summarise(across(all_of(names(indicators_wdi)), ~ mean(.x, na.rm = TRUE)),
.groups = "drop")
wdi_avg <- wdi_avg %>%
mutate(across(all_of(names(indicators_wdi)), ~ ifelse(is.nan(.x), NA, .x)))
Note: If a country has fewer than three observed years for an indicator, the 2017–2019 mean is computed using available years (via na.rm = TRUE), avoiding unnecessary country loss.
cat(
"Final dataset dimensions:\n",
"Countries:", nrow(wdi_avg), "\n",
"Indicators:", length(indicators_wdi), "\n"
)
## Final dataset dimensions:
## Countries: 215
## Indicators: 28
We apply a strict 30/40 rule to ensure data quality:
The 30% variable threshold reduces unstable imputations, while the 40% country threshold avoids constructing country profiles dominated by imputed values.
1.Indicators: Variables with >30% missingness are dropped.
2.Countries: Observations with >40% missingness are dropped.
For each WDI indicator missing share were assessed and threshold for dropping in was set to 0.3
# Missing Data Analysis
# Variable
miss_var <- wdi_avg %>%
summarise(across(all_of(names(indicators_wdi)), ~ mean(is.na(.x)))) %>%
pivot_longer(everything(), names_to = "variable", values_to = "missing_share") %>%
arrange(desc(missing_share))
# visual
p1 <- ggplot(miss_var, aes(x = reorder(variable, missing_share), y = missing_share)) +
geom_col() +
coord_flip() +
labs(title = "Initial Missing Share by Variable", x = NULL, y = "Missing share")
print(p1)
Indicators above 30% missingness were excluded to avoid unstable imputations and preserve interpretability.
cutoff <- 0.30
vars_drop <- miss_var %>%
filter(missing_share > cutoff) %>%
pull(variable)
vars_keep <- miss_var %>%
filter(missing_share <= cutoff) %>%
pull(variable)
Next, the countries with high missing data share were assessed . To not include bias and represent different countries country_cutoff <- 0.40 was set
# Row-wise (Country) Missingness Check
country_miss <- wdi_avg %>%
select(country, iso3c, region, income, all_of(vars_keep)) %>%
mutate(miss_prop = rowMeans(is.na(select(., all_of(vars_keep))))) %>%
arrange(desc(miss_prop))
# 40% cutoff:
country_cutoff <- 0.40
countries_drop <- country_miss %>% filter(miss_prop > country_cutoff) %>% pull(iso3c)
cat("Dropping", length(countries_drop), "countries due to excessive missingness.\n")
wdi_reduced <- wdi_avg %>%
filter(!iso3c %in% countries_drop) %>%
select(country, iso3c, region, income, all_of(vars_keep))
## Dropping 14 countries due to excessive missingness.
Near-Constant variables were assessed and dropped to ensure the smooth execution of algorithms down the pipeline
# Drop Near-Constant Variables
# We also check variables that became all-NA after
stats_check <- wdi_reduced %>%
summarise(across(all_of(vars_keep), list(
na_count = ~ sum(!is.na(.)),
sd = ~ sd(., na.rm = TRUE)
))) %>%
pivot_longer(
cols = everything(),
names_to = c("variable", "stat"),
names_pattern = "^(.*)_(na_count|sd)$"
) %>%
pivot_wider(names_from = stat, values_from = value)
vars_drop_stat <- stats_check %>%
filter(na_count < 10 | sd < 1e-10) %>%
pull(variable)
vars_model <- setdiff(vars_keep, vars_drop_stat)
wdi_reduced <- wdi_reduced %>% select(country, iso3c, region, income, all_of(vars_model))
cat("Variables dropped (low variance/obs):", length(vars_drop_stat), "\n")
cat("Final variable count:", length(vars_model), "\n")
## Variables dropped (low variance/obs): 0
## Final variable count: 26
Tables below summarizes coverage after preprocessing.
| countries | indicators | regions | income_groups |
|---|---|---|---|
| 201 | 26 | 7 | 5 |
summary_tbl <- data.frame(
Step = c("Initial countries (after aggregates removed)",
"Countries dropped (missingness > 40%)",
"Final countries",
"Initial indicators",
"Indicators dropped (missingness > 30%)",
"Indicators dropped (near-constant / too few obs)",
"Final indicators"),
Value = c(nrow(wdi_avg),
length(countries_drop),
nrow(wdi_reduced),
length(indicators_wdi),
length(vars_drop),
length(vars_drop_stat),
length(vars_model))
)
knitr::kable(summary_tbl, caption = "Preprocessing decisions and resulting dataset size.")
| Step | Value |
|---|---|
| Initial countries (after aggregates removed) | 215 |
| Countries dropped (missingness > 40%) | 14 |
| Final countries | 201 |
| Initial indicators | 28 |
| Indicators dropped (missingness > 30%) | 2 |
| Indicators dropped (near-constant / too few obs) | 0 |
| Final indicators | 26 |
We employ a hierarchical median imputation (Region Median \(\rightarrow\) Global Median) to preserve regional structural differences.
# Hierarchical Imputation
#Region then Global
wdi_imp <- wdi_reduced %>%
group_by(region) %>%
mutate(across(all_of(vars_model), ~ ifelse(is.na(.x), median(.x, na.rm = TRUE), .x))) %>%
ungroup() %>%
mutate(across(all_of(vars_model), ~ ifelse(is.na(.x), median(.x, na.rm = TRUE), .x))) # Global fallback
# Fail if any NAs remain later.
na_count_post <- sum(is.na(wdi_imp[vars_model]))
stopifnot(all(vars_model %in% names(wdi_imp)))
stopifnot(all(sapply(wdi_imp[vars_model], is.numeric)))
if (na_count_post > 0) {
stop("CRITICAL ERROR: Imputation failed. NAs still exist in the dataset.")
} else {
message("Imputation successful: 0 missing values remaining.")
}
stopifnot(!anyNA(wdi_imp[vars_model]))
Note: We did not include the Income based median imputation in our imputation rule set. Though at first glance it made sense , the income labels will force data leakage and disrupt natural cluster discovery.
Imputation audit. The table reports the share of values imputed (missing before imputation) for the most affected indicators.
| variable | imputed_share |
|---|---|
| school_sec_gross | 0.234 |
| capital_formation | 0.169 |
| gov_exp_edu | 0.129 |
| exports_gdp | 0.114 |
| unemployment | 0.090 |
| labor_part_tot | 0.090 |
| labor_part_fem | 0.090 |
| health_exp_gdp | 0.055 |
| measles_immun | 0.055 |
| gdp_per_cap_ppp | 0.040 |
Variables with high skewness (>2) are log-transformed (\(log_{10}(x+1)\)) to improve normality, excluding percentage-based metrics.
# Skewness & Log Transformation
# Calculate skewness table
skew_tbl <- describe(wdi_imp[vars_model]) %>%
as.data.frame() %>%
select(skew) %>%
rownames_to_column("variable") %>%
arrange(desc(abs(skew)))
#Identify highly skewed variables
high_skew_vars <- skew_tbl %>%
filter(skew > 2) %>%
pull(variable)
#Identify non-negative variables
safe_to_log <- vars_model[sapply(wdi_imp[vars_model], function(x) min(x, na.rm = TRUE) >= 0)]
#skewed AND safe
vars_to_log <- intersect(high_skew_vars, safe_to_log)
#bounded percent/rate indicators
percent_like <- c(
"capital_formation","agri_value_added","service_value_added",
"exports_gdp","fdi_inflows","sanitation_basic","health_exp_gdp",
"gov_exp_edu","urban_pop_rate","unemployment","labor_part_tot",
"labor_part_fem","access_elec","internet_usage","renew_energy"
)
vars_to_log <- setdiff(vars_to_log, percent_like)
cat("Log-transforming", length(vars_to_log), "variables:\n")
print(vars_to_log)
# Apply log10(x+1)
wdi_transformed <- wdi_imp %>%
mutate(across(all_of(vars_to_log), ~ log10(.x + 1)))
#no NaN/Inf introduced by transform
stopifnot(all(is.finite(as.matrix(wdi_transformed[vars_model]))))
## Log-transforming 1 variables:
## [1] "gge_per_cap"
cat("Checks after transformations:\n")
cat("Any NA in wdi_transformed vars_model? ", anyNA(wdi_transformed[vars_model]), "\n")
cat("Any infinite values? ", any(!is.finite(as.matrix(wdi_transformed[vars_model]))), "\n")
## Checks after transformations:
## Any NA in wdi_transformed vars_model? FALSE
## Any infinite values? FALSE
The standardization is needed to make variables comparable in analysis
# Standardization
wdi_scaled <- wdi_transformed %>%
mutate(across(all_of(vars_model), ~ as.numeric(scale(.x))))
stopifnot(all(sapply(wdi_scaled[vars_model], is.numeric)))
stopifnot(!anyNA(wdi_scaled[vars_model]))
# Check Boxplot of scaled variables (sample 12 vars)
set.seed(47)
vars_sample <- sample(vars_model, min(12, length(vars_model)))
wdi_scaled %>%
select(all_of(vars_sample)) %>%
pivot_longer(everything()) %>%
ggplot(aes(x = name, y = value)) +
geom_boxplot() +
coord_flip() +
labs(title = "Scaled Variable Distributions (sample)", x = NULL, y = "Z-score")
The post-scaling distributions confirm comparable scale across indicators, which is required for PCA and Euclidean-distance clustering.
# Saving Outputs
saveRDS(wdi_imp, "data/processed/wdi_imputed_descriptive.rds") # for Cluster Naming
saveRDS(wdi_scaled, "data/processed/wdi_scaled_pca_input.rds") # for PCA/Clustering
cat("Success! Processed data saved. Ready for PCA.\n")
## Success! Processed data saved. Ready for PCA.
Many World Development Indicators are strongly correlated, reflecting overlapping aspects of economic structure, human development, infrastructure, and governance. Applying clustering directly to the raw indicator space can violate distance assumptions due to multicollinearity and unequal variance. Principal Component Analysis (PCA) is therefore used to extract orthogonal latent dimensions that summarize the main sources of variation in the data. PCA reduces noise, improves distance validity, and facilitates interpretation of country positions in a lower-dimensional space.
# Load Processed Data
wdi_scaled <- readRDS("data/processed/wdi_scaled_pca_input.rds")
# Metadata columns not enter PCA
meta_vars <- c("iso3c", "region", "income", "year")
# PCA matrix
wdi_matrix <- wdi_scaled %>%
column_to_rownames("country") %>%
select(-any_of(meta_vars))
# checks
stopifnot(all(map_lgl(wdi_matrix, is.numeric)))
stopifnot(!anyNA(wdi_matrix))
cat("PCA input:", nrow(wdi_matrix), "countries x", ncol(wdi_matrix), "indicators\n")
## PCA input: 201 countries x 26 indicators
This section analyzes the latent structure of the remaining indicators after preprocessing to reduce noise and multicollinearity.
##PCA Execution
# wdi_matrix is already z-scaled
pca_res <- PCA(wdi_matrix, scale.unit = FALSE,ncp=10, graph = FALSE)
##Component Selection
We use the Scree Plot and Kaiser Criterion also, checking the total variance explained to determine the number of dimensions.
p_scree <- fviz_eig(pca_res, addlabels = TRUE) +
labs(title = "Scree Plot: Explained Variance by PCA Dimensions")
print(p_scree)
eig <- get_eigenvalue(pca_res) %>% as.data.frame()
eig_tbl <- eig %>%
mutate(Component = paste0("PC", row_number())) %>%
select(Component, eigenvalue, variance.percent, cumulative.variance.percent)
knitr::kable(
head(eig_tbl, 10),
digits = 3,
caption = "Eigenvalues and explained variance (first 10 components)"
)
| Component | eigenvalue | variance.percent | cumulative.variance.percent | |
|---|---|---|---|---|
| Dim.1 | PC1 | 10.693 | 41.333 | 41.333 |
| Dim.2 | PC2 | 2.705 | 10.454 | 51.787 |
| Dim.3 | PC3 | 1.798 | 6.950 | 58.737 |
| Dim.4 | PC4 | 1.539 | 5.949 | 64.686 |
| Dim.5 | PC5 | 1.321 | 5.106 | 69.792 |
| Dim.6 | PC6 | 0.959 | 3.706 | 73.498 |
| Dim.7 | PC7 | 0.831 | 3.213 | 76.712 |
| Dim.8 | PC8 | 0.771 | 2.981 | 79.693 |
| Dim.9 | PC9 | 0.655 | 2.533 | 82.226 |
| Dim.10 | PC10 | 0.624 | 2.414 | 84.640 |
kaiser_k <- sum(eig$eigenvalue > 1)
var70_k <- which(eig$cumulative.variance.percent >= 70)[1]
k_pca <- min(kaiser_k, var70_k)
cat("Kaiser (eig > 1):", kaiser_k, "\n")
cat("Components for >=70% variance:", var70_k, "\n")
cat("Selected number of components (k_pca):", k_pca, "\n")
## Kaiser (eig > 1): 5
## Components for >=70% variance: 6
## Selected number of components (k_pca): 5
Component retention. We retain \(k\) components using a rule: minimum of (i) Kaiser criterion (eigenvalue > 1) and (ii) components needed for ≥70% cumulative explained variance.
| Rule | Value |
|---|---|
| Kaiser (eig > 1) | 5 |
| ≥ 70% cumulative variance | 6 |
| Selected k | 5 |
Retaining five components balances information preservation with interpretability, while keeping the clustering distance metric valid by operating in an orthogonal space
contrib_tbl <- as.data.frame(pca_res$var$contrib[, 1:k_pca, drop = FALSE]) %>%
rownames_to_column("variable") %>%
pivot_longer(-variable, names_to = "component", values_to = "contrib") %>%
group_by(component) %>%
slice_max(contrib, n = 10, with_ties = FALSE) %>%
ungroup() %>%
arrange(component, desc(contrib))
knitr::kable(
contrib_tbl,
digits = 3,
caption = "Top contributing variables (per retained component)"
)
| variable | component | contrib |
|---|---|---|
| internet_usage | Dim.1 | 8.336 |
| life_expectancy | Dim.1 | 7.534 |
| sanitation_basic | Dim.1 | 7.463 |
| access_elec | Dim.1 | 6.721 |
| school_sec_gross | Dim.1 | 6.627 |
| agri_value_added | Dim.1 | 6.617 |
| dependency_ratio | Dim.1 | 6.588 |
| gov_effect | Dim.1 | 6.555 |
| gdp_per_cap_ppp | Dim.1 | 5.495 |
| corruption_ctrl | Dim.1 | 5.074 |
| labor_part_fem | Dim.2 | 26.391 |
| labor_part_tot | Dim.2 | 26.168 |
| unemployment | Dim.2 | 14.990 |
| gdp_per_cap_ppp | Dim.2 | 4.191 |
| gov_effect | Dim.2 | 3.961 |
| corruption_ctrl | Dim.2 | 3.790 |
| renew_energy | Dim.2 | 3.737 |
| gdp_growth | Dim.2 | 2.876 |
| exports_gdp | Dim.2 | 2.670 |
| capital_formation | Dim.2 | 2.263 |
| health_exp_gdp | Dim.3 | 29.332 |
| gov_exp_edu | Dim.3 | 20.396 |
| mobile_subs | Dim.3 | 7.730 |
| corruption_ctrl | Dim.3 | 5.338 |
| gdp_growth | Dim.3 | 4.548 |
| service_value_added | Dim.3 | 4.536 |
| capital_formation | Dim.3 | 4.460 |
| labor_part_fem | Dim.3 | 3.843 |
| pop_growth | Dim.3 | 3.757 |
| exports_gdp | Dim.3 | 3.580 |
| capital_formation | Dim.4 | 26.413 |
| gdp_growth | Dim.4 | 22.439 |
| measles_immun | Dim.4 | 9.786 |
| gov_exp_edu | Dim.4 | 8.409 |
| urban_pop_rate | Dim.4 | 7.051 |
| exports_gdp | Dim.4 | 4.648 |
| gdp_per_cap_ppp | Dim.4 | 4.625 |
| access_elec | Dim.4 | 2.668 |
| unemployment | Dim.4 | 2.526 |
| gge_per_cap | Dim.4 | 2.257 |
| fdi_inflows | Dim.5 | 33.128 |
| gdp_growth | Dim.5 | 11.901 |
| exports_gdp | Dim.5 | 11.267 |
| labor_part_tot | Dim.5 | 7.848 |
| pop_growth | Dim.5 | 7.374 |
| service_value_added | Dim.5 | 6.522 |
| gge_per_cap | Dim.5 | 3.349 |
| unemployment | Dim.5 | 2.972 |
| corruption_ctrl | Dim.5 | 2.776 |
| labor_part_fem | Dim.5 | 2.379 |
Directional interpretation. The sign of loadings indicates which indicators increase in the positive direction of each component.
| loading | variable | |
|---|---|---|
| internet_usage | 0.944 | internet_usage |
| life_expectancy | 0.898 | life_expectancy |
| sanitation_basic | 0.893 | sanitation_basic |
| access_elec | 0.848 | access_elec |
| school_sec_gross | 0.842 | school_sec_gross |
| loading | variable | |
|---|---|---|
| agri_value_added | -0.841 | agri_value_added |
| dependency_ratio | -0.839 | dependency_ratio |
| renew_energy | -0.721 | renew_energy |
| pop_growth | -0.521 | pop_growth |
| gdp_growth | -0.194 | gdp_growth |
PCA loading signs are arbitrary up to multiplication by -1; interpretations focus on relative contrasts within each component.
loadings_tbl <- as.data.frame(pca_res$var$coord[, 1:k_pca, drop = FALSE]) %>%
rownames_to_column("variable") %>%
pivot_longer(-variable, names_to = "component", values_to = "loading") %>%
group_by(component) %>%
slice_max(abs(loading), n = 8, with_ties = FALSE) %>%
ungroup() %>%
arrange(component, desc(abs(loading)))
knitr::kable(
loadings_tbl,
digits = 3,
caption = "Largest absolute loadings (per retained component)"
)
| variable | component | loading |
|---|---|---|
| internet_usage | Dim.1 | 0.944 |
| life_expectancy | Dim.1 | 0.898 |
| sanitation_basic | Dim.1 | 0.893 |
| access_elec | Dim.1 | 0.848 |
| school_sec_gross | Dim.1 | 0.842 |
| agri_value_added | Dim.1 | -0.841 |
| dependency_ratio | Dim.1 | -0.839 |
| gov_effect | Dim.1 | 0.837 |
| labor_part_fem | Dim.2 | 0.845 |
| labor_part_tot | Dim.2 | 0.841 |
| unemployment | Dim.2 | -0.637 |
| gdp_per_cap_ppp | Dim.2 | 0.337 |
| gov_effect | Dim.2 | 0.327 |
| corruption_ctrl | Dim.2 | 0.320 |
| renew_energy | Dim.2 | 0.318 |
| gdp_growth | Dim.2 | 0.279 |
| health_exp_gdp | Dim.3 | 0.726 |
| gov_exp_edu | Dim.3 | 0.606 |
| mobile_subs | Dim.3 | -0.373 |
| corruption_ctrl | Dim.3 | 0.310 |
| gdp_growth | Dim.3 | -0.286 |
| service_value_added | Dim.3 | 0.286 |
| capital_formation | Dim.3 | -0.283 |
| labor_part_fem | Dim.3 | 0.263 |
| capital_formation | Dim.4 | 0.638 |
| gdp_growth | Dim.4 | 0.588 |
| measles_immun | Dim.4 | 0.388 |
| gov_exp_edu | Dim.4 | 0.360 |
| urban_pop_rate | Dim.4 | -0.329 |
| exports_gdp | Dim.4 | -0.267 |
| gdp_per_cap_ppp | Dim.4 | -0.267 |
| access_elec | Dim.4 | 0.203 |
| fdi_inflows | Dim.5 | 0.662 |
| gdp_growth | Dim.5 | 0.396 |
| exports_gdp | Dim.5 | 0.386 |
| labor_part_tot | Dim.5 | -0.322 |
| pop_growth | Dim.5 | 0.312 |
| service_value_added | Dim.5 | 0.294 |
| gge_per_cap | Dim.5 | -0.210 |
| unemployment | Dim.5 | 0.198 |
Based on the highest contributions and loadings, the five retained components can be interpreted as follows:
PC1 (“Socio-Economic Development & Institutional Quality”): Captures the primary development continuum. Positive direction: high GDP per capita, strong governance, life expectancy, infrastructure access; negative direction: high agricultural share, demographic pressure.
PC2 (“Labor Market Participation”): Reflects overall and female labor force participation rates, with lower unemployment.
PC3 (“Public Human Capital Investment”): Emphasizes public spending on education and health relative to private capital formation.
PC4 (“Investment & Growth Dynamics”): Distinguishes investment-intensive, faster-growing economies from more mature ones.
PC5 (“External Economic Integration”): Separates trade/FDI-open economies from more domestic-oriented ones.
These dimensions are orthogonal by construction and collectively explain approximately 70% of total variance in the original indicators.
Visual map
fviz_pca_var(
pca_res,
col.var = "contrib",
repel = TRUE,
select.var = list(contrib = 20)
) + labs(title = "PCA Variable Map (Top 20 contributions, Dim 1 & 2)")
fviz_pca_ind(
pca_res,
axes = c(1,2),
geom.ind = "point",
col.ind = wdi_scaled$income,
addEllipses = TRUE,
legend.title = "Income"
) + labs(title = "Countries in PCA space (Dim 1 & 2), colored by income group")
fviz_pca_ind(
pca_res,
axes = c(1,2),
geom.ind = "point",
col.ind = wdi_scaled$region,
addEllipses = TRUE,
legend.title = "Region"
) + labs(title = "Countries in PCA space (Dim 1 & 2), colored by region")
Because PCA components are orthogonal (uncorrelated), the Euclidean distance used in the subsequent clustering step is valid and not biased by the redundant information (multicollinearity) present in the raw WDI indicators.
Output for clustering. Metadata (region, income) is kept only for interpretation and visualization.
Clustering is performed on the retained PCA scores (PC1–PCk) to avoid multicollinearity and ensure Euclidean distance validity. Ward’s hierarchical clustering (Ward.D2) is applied, and the number of clusters is selected using a stability-first strategy combining internal indices and bootstrap Jaccard stability.
#Load data
pca_scores <- readRDS("data/processed/pca_scores_for_clustering.rds")
wdi_imp <- readRDS("data/processed/wdi_imputed_descriptive.rds")
pc_labels <- readRDS("data/processed/pc_labels.rds")
wdi_scaled <- readRDS("data/processed/wdi_scaled_pca_input.rds")
# clustering space
pc_cols_all <- names(pca_scores)[grepl("^PC\\d+$", names(pca_scores))]
n_pcs_available <- length(pc_cols_all)
pcs_all <- pca_scores %>%
column_to_rownames("country") %>%
select(all_of(pc_cols_all))
stopifnot(all(map_lgl(pcs_all, is.numeric)))
cat("Clustering input:", nrow(pcs_all), "countries x", ncol(pcs_all), "PCs (Available)\n")
## Clustering input: 201 countries x 5 PCs (Available)
n_pcs <- min(5, ncol(pcs_all))
pc_cols <- paste0("PC", 1:n_pcs)
X_pca <- pcs_all %>% select(all_of(pc_cols))
stopifnot(!anyNA(X_pca))
cat("Clustering on:", paste(pc_cols, collapse = ", "), "\n")
## Clustering on: PC1, PC2, PC3, PC4, PC5
Before applying a partitioning algorithm, we must ensure the data contains meaningful structure rather than a random distribution.
set.seed(47)
H <- hopkins::hopkins(as.matrix(X_pca), m = floor(0.1 * nrow(X_pca)))
cat("Hopkins statistic H =", round(H, 4), "\n")
## Hopkins statistic H = 0.9988
d <- dist(X_pca, method = "euclidean")
hc <- hclust(d, method = "ward.D2")
H scores mean usually: scores near 0.5 random structure, closer to 1 clusterable structure ; closer to 0 uniform structure
The Hopkins statistic is far above 0.5, indicating strong departure from spatial randomness and supporting the presence of cluster structure in PCA space.
We iterate \(k\) from 2 to 6, selecting the solution based on Bootstrap Stability (Jaccard Index).
k_candidates <- 2:6
min_size_rule <- 7
B_boot <- 200
set.seed(47)
validation_df <- purrr::map_dfr(k_candidates, function(k) {
grp <- cutree(hc, k = k)
sil <- silhouette(grp, d)
sil_avg <- mean(sil[, 3])
ch <- fpc::calinhara(as.matrix(X_pca), grp)
sizes <- table(grp)
min_size <- min(sizes)
set.seed(47 + k)
boot <- fpc::clusterboot(
as.matrix(X_pca),
B = B_boot,
bootmethod = "boot",
clustermethod = fpc::hclustCBI,
method = "ward.D2",
k = k,
count = FALSE,
showplots = FALSE
)
pam_fit <- cluster::pam(as.matrix(X_pca), k = k)
crand <- fpc::cluster.stats(d, grp, pam_fit$clustering)$corrected.rand
tibble(
k = k,
Silhouette = sil_avg,
Calinski_Harabasz = ch,
Min_Cluster_Size = as.integer(min_size),
Min_Jaccard = min(boot$bootmean),
Mean_Jaccard = mean(boot$bootmean),
PAM_Agreement_CRand = crand
)
})
knitr::kable(
validation_df,
digits = 3,
caption = "Model selection diagnostics for candidate numbers of clusters (Ward.D2 in PCA space)."
)
| k | Silhouette | Calinski_Harabasz | Min_Cluster_Size | Min_Jaccard | Mean_Jaccard | PAM_Agreement_CRand |
|---|---|---|---|---|---|---|
| 2 | 0.411 | 104.211 | 30 | 0.679 | 0.788 | 0.514 |
| 3 | 0.270 | 115.331 | 30 | 0.440 | 0.667 | 0.623 |
| 4 | 0.234 | 97.253 | 30 | 0.501 | 0.635 | 0.545 |
| 5 | 0.252 | 88.759 | 13 | 0.458 | 0.635 | 0.396 |
| 6 | 0.242 | 82.717 | 13 | 0.306 | 0.571 | 0.693 |
The k selection logic
candidates_ok <- validation_df %>% filter(Min_Cluster_Size >= min_size_rule)
if (nrow(candidates_ok) == 0) stop("No k passed Min_Cluster_Size rule.")
stable_ok <- candidates_ok %>% filter(Min_Jaccard >= 0.60)
if (nrow(stable_ok) > 0) {
k_clusters <- stable_ok %>%
arrange(desc(Mean_Jaccard), desc(Silhouette)) %>%
slice(1) %>% pull(k)
reason <- "Highest Mean Jaccard among solutions with Min Jaccard >= 0.60 and adequate cluster sizes."
} else {
k_clusters <- candidates_ok %>%
arrange(desc(Min_Jaccard), desc(Mean_Jaccard), desc(Silhouette)) %>%
slice(1) %>% pull(k)
reason <- "Fallback: no solution reached Min Jaccard >= 0.60; selected best stability compromise."
}
knitr::kable(
data.frame(Selected_k = k_clusters, Reason = reason),
caption = "Selected number of clusters (stability-first decision)."
)
| Selected_k | Reason |
|---|---|
| 2 | Highest Mean Jaccard among solutions with Min Jaccard >= 0.60 and adequate cluster sizes. |
groups <- cutree(hc, k = k_clusters)
stopifnot(identical(names(groups), rownames(X_pca)))
| cluster | n_countries |
|---|---|
| 1 | 171 |
| 2 | 30 |
| Metric | Value |
|---|---|
| Avg silhouette | 0.411 |
| Calinski-Harabasz | 104.211 |
| PAM agreement (Corrected Rand) | 0.514 |
| Min Jaccard | 0.666 |
| Mean Jaccard | 0.777 |
Selected solution. Based on the stability-first rule (minimum cluster size ≥ 7, and preference for Min Jaccard ≥ 0.60), the final choice is k = 2. The solution achieves an average silhouette of 0.411, Calinski–Harabasz of 104.2, and bootstrap stability (Jaccard) with min = 0.666 and mean = 0.777.
A Jaccard similarity above 0.60 suggests cluster stability, meaning that the cluster’s membership shows substantial overlap across bootstrap resamples. This indicates that the grouping is reasonably robust to sampling variation, although it should not be interpreted as definitive proof of an underlying ‘true’ structure.
## Note: these requested profile variables were not available and were skipped: school_ter_gross
| cluster_short | gdp_per_cap_ppp | life_expectancy | internet_usage | access_elec | sanitation_basic | school_sec_gross | gov_effect | corruption_ctrl | agri_value_added | dependency_ratio | pop_growth | unemployment | labor_part_fem |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Advanced / Institutional | 19847.45 | 75.00 | 66.59 | 99.93 | 93.67 | 96.39 | 0.08 | -0.05 | 5.01 | 52.82 | 0.94 | 5.68 | 52.58 |
| Labor-Intensive / Agrarian | 2737.48 | 61.12 | 15.71 | 36.37 | 25.13 | 45.46 | -1.05 | -1.03 | 23.26 | 86.02 | 2.64 | 3.50 | 64.85 |
The heatmap below confirms the distinct separation of profiles:
Positive values indicate above-average (relative) indicator levels for that cluster; negative values indicate below-average levels.
Although clustering was based on socio-economic indicators and not geographic information, the resulting groups display a clear spatial pattern. Cluster 2 is dominated by Sub-Saharan African countries, with some representation from South and Southeast Asia, while Cluster 1 covers most of the remaining regions.
This geographic concentration is an outcome of structural differences in institutions, human capital, infrastructure, and demographic pressure rather than a trivial regional split. Solutions with more clusters (e.g. four clusters) produced interpretable intermediate or fast-catch-up groups but showed lower stability under bootstrap validation. Therefore, the two-cluster solution was retained as the most robust and structurally meaningful representation of the data.
This analysis successfully extracted latent development dimensions and identified two robust country typologies from 2017–2019 WDI data: - Advanced, institutionally strong economies: Characterized by high income, strong governance, advanced human capital, infrastructure, and digital access. - Labor-intensive, agrarian and demographically pressured economies: Higher agricultural dependence, demographic pressures, and necessity-driven labor participation.
Income groups align strongly with PC1 and cluster membership, but clustering is not identical to income classification because it also reflects governance, infrastructure, and demographic structure.
The PCA-clustering pipeline successfully extracts meaningful structure beyond traditional income classifications, with the two-cluster solution proving stable and interpretable.
The use of Hopkins statistic and Bootstrap Stability indicates that the clusters are not random artifacts.
Key insights: - Development is not unidimensional — institutional quality and human capital investment emerge as distinct from pure economic growth. - Geographic patterns emerge naturally (e.g., advanced cluster concentrated in Europe/North America/East Asia).
Method choice rationale -Ward.D2 is chosen for compact clusters in Euclidean space; clustering in PCA space reduces multicollinearity and makes distances more meaningful.
Limitations: - Cross-sectional (pre-COVID snapshot); temporal dynamics not captured. - Reliance on median imputation and theoretical indicator selection.
Future work: Incorporate time-series clustering or additional indicators (e.g., inequality, climate vulnerability).
The full reproducible pipeline demonstrates the power of unsupervised learning for exploratory development analysis.
Data Source
World Bank. (2026). World Development Indicators (WDI). Retrieved January 2026 from https://databank.worldbank.org/source/world-development-indicators
Other Sources
Commission on the Measurement of Economic Performance and Social Progress. (2009). Report by the Commission on the Measurement of Economic Performance and Social Progress. Paris. Retrieved February 2026 from https://ec.europa.eu/eurostat/documents/8131721/8131772/Stiglitz-Sen-Fitoussi-Commission-report.pdf
Husson, F., Lê, S., & Pagès, J. (2017). Exploratory multivariate analysis by example using R (2nd ed.). Chapman and Hall/CRC. https://doi.org/10.1201/b21874
United Nations Development Programme. (2019). Human Development Report 2019: Beyond income, beyond averages, beyond today. Retrieved February 2026 from http://report2019.archive.s3-website-us-east-1.amazonaws.com/
cat("Report rendered on:", format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z"), "\n\n")
# Core R info
print(R.version)
cat("\nPlatform:", R.version$platform, "\n")
cat("Running on:", R.version$os, "\n\n")
# Packages + versions actually loaded in this session
sessionInfo()
## Report rendered on: 2026-02-01 17:32:36 CET
##
## _
## platform x86_64-w64-mingw32
## arch x86_64
## os mingw32
## crt ucrt
## system x86_64, mingw32
## status
## major 4
## minor 5.1
## year 2025
## month 06
## day 13
## svn rev 88306
## language R
## version.string R version 4.5.1 (2025-06-13 ucrt)
## nickname Great Square Root
##
## Platform: x86_64-w64-mingw32
## Running on: mingw32
##
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] C
##
## time zone: Europe/Warsaw
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] pheatmap_1.0.13 ggsci_4.2.0 rnaturalearthdata_1.0.0
## [4] rnaturalearth_1.2.0 sf_1.0-24 corrplot_0.95
## [7] hopkins_1.1 fpc_2.2-14 cluster_2.1.8.1
## [10] factoextra_1.0.7 FactoMineR_2.13 psych_2.5.6
## [13] kableExtra_1.4.0 naniar_1.1.0 WDI_2.7.9
## [16] janitor_2.2.1 lubridate_1.9.4 forcats_1.0.1
## [19] stringr_1.6.0 dplyr_1.1.4 purrr_1.2.0
## [22] readr_2.1.5 tidyr_1.3.1 tibble_3.3.0
## [25] ggplot2_4.0.0 tidyverse_2.0.0 DiagrammeR_1.0.11
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 mnormt_2.1.1 gridExtra_2.3
## [4] rlang_1.1.6 magrittr_2.0.4 donut_1.0.3
## [7] snakecase_0.11.1 otel_0.2.0 e1071_1.7-17
## [10] compiler_4.5.1 flexmix_2.3-20 systemfonts_1.3.1
## [13] vctrs_0.6.5 pkgconfig_2.0.3 fastmap_1.2.0
## [16] backports_1.5.0 labeling_0.4.3 rmarkdown_2.30
## [19] tzdb_0.5.0 visdat_0.6.0 xfun_0.54
## [22] modeltools_0.2-24 cachem_1.1.0 jsonlite_2.0.0
## [25] flashClust_1.01-2 pdist_1.2.1 broom_1.0.10
## [28] parallel_4.5.1 prabclus_2.3-4 R6_2.6.1
## [31] bslib_0.9.0 stringi_1.8.7 RColorBrewer_1.1-3
## [34] car_3.1-3 jquerylib_0.1.4 diptest_0.77-2
## [37] estimability_1.5.1 Rcpp_1.1.0 knitr_1.51
## [40] nnet_7.3-20 timechange_0.3.0 tidyselect_1.2.1
## [43] viridis_0.6.5 abind_1.4-8 rstudioapi_0.17.1
## [46] yaml_2.3.10 lattice_0.22-7 withr_3.0.2
## [49] S7_0.2.0 evaluate_1.0.5 units_1.0-0
## [52] proxy_0.4-29 xml2_1.4.1 mclust_6.1.2
## [55] kernlab_0.9-33 ggpubr_0.6.2 pillar_1.11.1
## [58] carData_3.0-5 KernSmooth_2.23-26 DT_0.34.0
## [61] stats4_4.5.1 generics_0.1.4 hms_1.1.4
## [64] scales_1.4.0 leaps_3.2 class_7.3-23
## [67] glue_1.8.0 emmeans_2.0.0 scatterplot3d_0.3-44
## [70] tools_4.5.1 dendextend_1.19.1 robustbase_0.99-6
## [73] ggsignif_0.6.4 RANN_2.6.2 visNetwork_2.1.4
## [76] mvtnorm_1.3-3 grid_4.5.1 nlme_3.1-168
## [79] Formula_1.2-5 cli_3.6.5 textshaping_1.0.4
## [82] viridisLite_0.4.2 svglite_2.2.2 gtable_0.3.6
## [85] DEoptimR_1.1-4 rstatix_0.7.3 sass_0.4.10
## [88] digest_0.6.37 classInt_0.4-11 ggrepel_0.9.6
## [91] htmlwidgets_1.6.4 farver_2.1.2 htmltools_0.5.9
## [94] lifecycle_1.0.4 multcompView_0.1-10 MASS_7.3-65