Introduction

This project aim to investigate latent aspects of development and identify patterns in the world using economic factors. Global development is multidimensional. Looking at GDP alone is deceptive; a country might have high wealth but poor literacy or high child mortality. This project aims to move beyond single-metric rankings by using Unsupervised Machine Learning to find hidden patterns in how nations develop. This project applies two key unsupervised methods:

Principal Component Analysis (PCA): The PCA algorithm is a dimensionality reduction technique that converts high-dimensional data into a lower-dimensional space, resulting in new, uncorrelated features known as Principal Components (PCs) that capture the most variance (information) from the original data, making it easier to visualise, process, and model by removing redundancy and noise. Most development indicators are highly correlated (eg, as Literacy Rate increases, Life Expectancy etc). PCA transforms the 9 variables into Principal Components (PCs) which are uncorrelated.
K-means clustering: K-means is an unsupervised machine learning approach that splits data points into k unique, non-overlapping clusters with the goal of minimizing the distance between data points and their cluster’s centre.

The data used for this project was sourced from World Development Indicators (WDI) database. Year used: 2019 Visit this link to access data here

Loading Libraries and Data

library(factoextra)
library(cluster)
library(WDI)
library(tidyverse)
library(dplyr)
library("ggplot2")
library("reshape2")
library("corrplot")
library(ggrepel)

indicators <- c( 
  GDP = "NY.GDP.MKTP.CD", 
  lifeExp = "SP.DYN.LE00.IN", 
  fertilityRt = "SP.DYN.TFRT.IN", 
  education = "SE.SEC.ENRR", 
  literacyRt = "SE.ADT.LITR.ZS", 
  populationGth = "SP.POP.GROW", 
  unemploymentRt = "SL.UEM.TOTL.ZS", 
  mortalityRt = "SH.DYN.MORT", 
  CO2Em = "EN.GHG.CO2.IC.MT.CE.AR5" ) 
global_development <- WDI( 
  country = "all", 
  indicator = indicators, 
  start = 2019, 
  end = 2019, extra = TRUE )

glimpse(global_development)

## Rows: 266
## Columns: 21
## $ country        <chr> "Afghanistan", "Africa Eastern and Southern", "Africa W…
## $ iso2c          <chr> "AF", "ZH", "ZI", "AL", "DZ", "AS", "AD", "AO", "AG", "…
## $ iso3c          <chr> "AFG", "AFE", "AFW", "ALB", "DZA", "ASM", "AND", "AGO",…
## $ year           <int> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2…
## $ status         <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ lastupdated    <chr> "2026-01-28", "2026-01-28", "2026-01-28", "2026-01-28",…
## $ GDP            <dbl> 1.879944e+10, 1.018715e+12, 1.026996e+12, 1.558511e+10,…
## $ lifeExp        <dbl> 62.94100, 63.85726, 57.14985, 79.46700, 75.68200, 72.75…
## $ fertilityRt    <dbl> 5.238000, 4.471338, 4.829134, 1.395000, 2.997000, 2.404…
## $ education      <dbl> NA, 44.12873, 43.90627, 99.01515, NA, NA, 96.57433, NA,…
## $ literacyRt     <dbl> NA, 72.80108, 60.13639, NA, NA, NA, NA, NA, NA, 77.1497…
## $ populationGth  <dbl> 2.9843891, 2.7216805, 2.4400480, -1.5431371, 1.8404129,…
## $ unemploymentRt <dbl> 11.187000, 7.459106, 4.390764, 11.466000, 12.302000, NA…
## $ mortalityRt    <dbl> 63.30000, 60.65808, 101.44005, 9.40000, 23.30000, NA, 3…
## $ CO2Em          <dbl> 2.8111, 76.4884, 17.8348, 0.9101, 13.9341, NA, NA, 1.50…
## $ region         <chr> "South Asia", "Aggregates", "Aggregates", "Europe & Cen…
## $ capital        <chr> "Kabul", "", "", "Tirane", "Algiers", "Pago Pago", "And…
## $ longitude      <chr> "69.1761", "", "", "19.8172", "3.05097", "-170.691", "1…
## $ latitude       <chr> "34.5228", "", "", "41.3317", "36.7397", "-14.2846", "4…
## $ income         <chr> "Low income", "Aggregates", "Aggregates", "Upper middle…
## $ lending        <chr> "IDA", "Aggregates", "Aggregates", "IBRD", "IBRD", "Not…

Data Cleaning

missing_in_cols <- sapply(global_development, function(x) sum(is.na(x))/nrow(global_development))
head(missing_in_cols)

##     country       iso2c       iso3c        year      status lastupdated 
##           0           0           0           0           0           0

numeric_data <- data_clean %>%
  select(
    GDP,
    lifeExp,
    fertilityRt,
    education,
    literacyRt,
    populationGth,
    unemploymentRt,
    mortalityRt,
    CO2Em
  )

Missing values and categorical values: country, are removed to ensure PCA and clustering operate on complete data. That’s why we have to do some preprocessing, and extraction of character data.

Data Visualization

    ggplot(numeric_long, aes(x = value)) +
      geom_histogram(bins = 30, fill = "blue", color = "white") +
      facet_wrap(~ indicator, scales = "free") +
      theme_minimal() +
      labs(
        title = "Distribution of Development Indicators",
        x = "Value",
        y = "Frequency"
      )

ggplot(numeric_long, aes(x = indicator, y = value)) +
  geom_boxplot(fill = "blue") +
  theme_minimal() +
  coord_flip() +
  labs(
    title = "Boxplots of Development Indicators",
    x = "",
    y = "Value"
  )

Standalize

scaled_data <- scale(numeric_data)
correlation<-cor(scaled_data, method="pearson") 
corrplot(correlation)

The correlation matrix reveals:

Strong Negative Correlation: mortalityRt vs.lifeExp are nagetively correlated .
Strong Positive Correlation: Education and literacyRt. This suggests that for many nations, school enrollment and adult literacy are nearly synonymous.

GDP and CO2Em highlights the environmental cost of traditional economic growth(Production).

MortalityRt vs fertiltyRt This validates the data high child mortality is the strongest driver of increased life expectancy.

LifeExp and literacyRt are nearly perfectly correlated. This suggests that education is the ultimate precursor to health resilience.

UnemploymentRt shows the weakest correlations with other factors, suggesting it is a unique economic dimension

pca_res <- prcomp(scaled_data)
summary(pca_res)

## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.202 1.3185 1.0002 0.80437 0.51289 0.48216 0.36700
## Proportion of Variance 0.539 0.1931 0.1111 0.07189 0.02923 0.02583 0.01497
## Cumulative Proportion  0.539 0.7321 0.8433 0.91519 0.94442 0.97025 0.98521
##                            PC8     PC9
## Standard deviation     0.29097 0.22001
## Proportion of Variance 0.00941 0.00538
## Cumulative Proportion  0.99462 1.00000

plot(pca_res)

pca_scores <- pca_res$x[, 1:2]
pca_scores

##               PC1         PC2
##  [1,]  1.00620496  0.78913422
##  [2,] -0.42992953  0.10540342
##  [3,]  1.62470994  0.90160257
##  [4,]  1.83480948 -1.86537289
##  [5,] -4.73761513 -0.17863919
##  [6,] -0.97129510  0.72003044
##  [7,] -6.74455217 -0.32250657
##  [8,]  1.18312482  0.27024849
##  [9,] -3.78330497  0.06664097
## [10,]  1.61259608  0.92257444
## [11,]  0.08557904  0.56745188
## [12,]  0.32523066  0.77817477
## [13,]  1.63795907  0.76028662
## [14,] -0.42208472  0.63908942
## [15,]  0.56977813 -6.53241048
## [16,]  2.98884872 -1.34010721
## [17,]  1.54122810  0.78616071
## [18,]  0.65412174  0.24274002
## [19,]  0.55119714  0.84065800
## [20,]  1.26743582 -0.85652371
## [21,]  0.06387895  0.64043787
## [22,] -0.48111336  0.31192658
## [23,] -2.82997355 -0.45666135
## [24,]  0.05634616  0.31305873
## [25,] -3.11686009  0.21073157
## [26,] -4.27187580 -0.09494040
## [27,]  2.04488682  0.58332431
## [28,]  0.57492058 -0.73937764
## [29,]  1.22733304  0.04018420
## [30,]  0.30818600  0.91622467
## [31,]  1.71557542 -0.62227603
## [32,]  1.90119696  0.10112336
## [33,]  2.02469004  0.82961257
## [34,]  0.24630326  0.64253657
## [35,]  0.74246350  0.02945906

wss_plot <- fviz_nbclust(pca_scores, kmeans, method = "wss")
wss_plot

set.seed(123)
kmeans_model <- kmeans(pca_scores, centers = 3, nstart = 25)

final_data <- data_clean %>%
  mutate(cluster = factor(kmeans_model$cluster))
final_data$PC1 <- pca_res$x[, 1]
final_data$PC2 <- pca_res$x[, 2]
str(kmeans_model)

## List of 9
##  $ cluster     : int [1:35] 1 1 1 1 3 1 3 1 3 1 ...
##  $ centers     : num [1:3, 1:2] 0.89 0.57 -4.247 0.261 -6.532 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "1" "2" "3"
##   .. ..$ : chr [1:2] "PC1" "PC2"
##  $ totss       : num 224
##  $ withinss    : num [1:3] 38.3 0 10.3
##  $ tot.withinss: num 48.6
##  $ betweenss   : num 175
##  $ size        : int [1:3] 28 1 6
##  $ iter        : int 2
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

eig_plot <- fviz_eig(pca_res, addlabels = TRUE)
eig_plot

I decided to use K-means not Clara or PAM because the dataset is relatively small and the variables continuous nature. After standardization and PCA transformation, Euclidean distance becomes appropriate, making K-means a natural efficient and interpretable clustering solution. While alternative algorithms such as PAM and CLARA are designed for robustness or scalability, their advantages are not critical in this context.

The model uses centers = 3. Based on standard development patterns, these clusters usually break down as:

Cluster 1 (Developed-Economy): High GDP, high education, low fertility, and low mortality. Associated with high export and imports, good infrastructure and amenities etc.
Cluster 2 (Developing-Economy): Moderate growth, rising literacy, but still balancing high population growth or environmental impacts ( CO2Em).
Cluster 3 (Underdeveloped-Economy): High mortality, high fertility rates, and lower literacy levels. Associated with low export and imports, low infrastructure and amenities etc.

fviz_cluster(
  kmeans_model, 
  data = pca_scores,
  geom = "point",
  ellipse.type = "convex",
  palette = c("#2E9FDF", "#E7B800", "#FC4E07"),
  ggtheme = theme_minimal(),
  main = "Global Development: Cluster Distribution",
  xlab = "PC1",
  ylab = "PC2"
) +
theme(
  plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
  legend.position = "bottom"
)

ggplot(final_data, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(alpha = 0.7, size = 3) +
  scale_color_manual(
    values = c("#2E9FDF", "#E7B800", "#FC4E07"),
    labels = c("Developed Economies", "Developing Economies", "Underdeveloped Economies"))+
  labs(
    title = "Nations Grouped by Socio-Economic Performance",
    subtitle = "Analysis based on 2019 World Bank Development Indicators",
    x = "Principal Component 1",
    y = "Principal Component 2",
    color = "Development Group"
  ) +
  theme_classic()

By clustering countries based on indicators rather than continents, we find surprising peers.

Key Findings

Countries that fall into the “High-Development” cluster almost universally have a literacyRt above 90%, suggesting it’s importance for economic transition.

The Environmental Paradox: High-income clusters contribute disproportionately to CO2Em, showing that “developed” status currently lacks environmental sustainability.The Demographic Divide: High fertilityRt is the strongest predictor for membership in the “underveloped” cluster, as it often correlates with lower education access.

Final Conclusion

Using PCA and K-Means, we successfully reduced a complex 9-dimensional dataset into three distinct global profiles. This model suggests prioritizing literacy and mortality reduction over pure GDP growth.

Exploring Global Development Patterns Using Principal Component Analysis and Clustering

Rita Obiora