Introduction

This project aim to investigate latent aspects of development and identify patterns in the world using economic factors. Global development is multidimensional. Looking at GDP alone is deceptive; a country might have high wealth but poor literacy or high child mortality. This project aims to move beyond single-metric rankings by using Unsupervised Machine Learning to find hidden patterns in how nations develop. This project applies two key unsupervised methods:

The data used for this project was sourced from World Development Indicators (WDI) database. Year used: 2019 Visit this link to access data here

Loading Libraries and Data

library(factoextra)
library(cluster)
library(WDI)
library(tidyverse)
library(dplyr)
library("ggplot2")
library("reshape2")
library("corrplot")
library(ggrepel)
indicators <- c( 
  GDP = "NY.GDP.MKTP.CD", 
  lifeExp = "SP.DYN.LE00.IN", 
  fertilityRt = "SP.DYN.TFRT.IN", 
  education = "SE.SEC.ENRR", 
  literacyRt = "SE.ADT.LITR.ZS", 
  populationGth = "SP.POP.GROW", 
  unemploymentRt = "SL.UEM.TOTL.ZS", 
  mortalityRt = "SH.DYN.MORT", 
  CO2Em = "EN.GHG.CO2.IC.MT.CE.AR5" ) 
global_development <- WDI( 
  country = "all", 
  indicator = indicators, 
  start = 2019, 
  end = 2019, extra = TRUE )
glimpse(global_development)
## Rows: 266
## Columns: 21
## $ country        <chr> "Afghanistan", "Africa Eastern and Southern", "Africa W…
## $ iso2c          <chr> "AF", "ZH", "ZI", "AL", "DZ", "AS", "AD", "AO", "AG", "…
## $ iso3c          <chr> "AFG", "AFE", "AFW", "ALB", "DZA", "ASM", "AND", "AGO",…
## $ year           <int> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2…
## $ status         <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ lastupdated    <chr> "2026-01-28", "2026-01-28", "2026-01-28", "2026-01-28",…
## $ GDP            <dbl> 1.879944e+10, 1.018715e+12, 1.026996e+12, 1.558511e+10,…
## $ lifeExp        <dbl> 62.94100, 63.85726, 57.14985, 79.46700, 75.68200, 72.75…
## $ fertilityRt    <dbl> 5.238000, 4.471338, 4.829134, 1.395000, 2.997000, 2.404…
## $ education      <dbl> NA, 44.12873, 43.90627, 99.01515, NA, NA, 96.57433, NA,…
## $ literacyRt     <dbl> NA, 72.80108, 60.13639, NA, NA, NA, NA, NA, NA, 77.1497…
## $ populationGth  <dbl> 2.9843891, 2.7216805, 2.4400480, -1.5431371, 1.8404129,…
## $ unemploymentRt <dbl> 11.187000, 7.459106, 4.390764, 11.466000, 12.302000, NA…
## $ mortalityRt    <dbl> 63.30000, 60.65808, 101.44005, 9.40000, 23.30000, NA, 3…
## $ CO2Em          <dbl> 2.8111, 76.4884, 17.8348, 0.9101, 13.9341, NA, NA, 1.50…
## $ region         <chr> "South Asia", "Aggregates", "Aggregates", "Europe & Cen…
## $ capital        <chr> "Kabul", "", "", "Tirane", "Algiers", "Pago Pago", "And…
## $ longitude      <chr> "69.1761", "", "", "19.8172", "3.05097", "-170.691", "1…
## $ latitude       <chr> "34.5228", "", "", "41.3317", "36.7397", "-14.2846", "4…
## $ income         <chr> "Low income", "Aggregates", "Aggregates", "Upper middle…
## $ lending        <chr> "IDA", "Aggregates", "Aggregates", "IBRD", "IBRD", "Not…

Data Cleaning

missing_in_cols <- sapply(global_development, function(x) sum(is.na(x))/nrow(global_development))
head(missing_in_cols)
##     country       iso2c       iso3c        year      status lastupdated 
##           0           0           0           0           0           0
numeric_data <- data_clean %>%
  select(
    GDP,
    lifeExp,
    fertilityRt,
    education,
    literacyRt,
    populationGth,
    unemploymentRt,
    mortalityRt,
    CO2Em
  )

Missing values and categorical values: country, are removed to ensure PCA and clustering operate on complete data. That’s why we have to do some preprocessing, and extraction of character data.

Data Visualization

    ggplot(numeric_long, aes(x = value)) +
      geom_histogram(bins = 30, fill = "blue", color = "white") +
      facet_wrap(~ indicator, scales = "free") +
      theme_minimal() +
      labs(
        title = "Distribution of Development Indicators",
        x = "Value",
        y = "Frequency"
      )

ggplot(numeric_long, aes(x = indicator, y = value)) +
  geom_boxplot(fill = "blue") +
  theme_minimal() +
  coord_flip() +
  labs(
    title = "Boxplots of Development Indicators",
    x = "",
    y = "Value"
  )

Standalize

scaled_data <- scale(numeric_data)
correlation<-cor(scaled_data, method="pearson") 
corrplot(correlation)

The correlation matrix reveals:

GDP and CO2Em highlights the environmental cost of traditional economic growth(Production).

MortalityRt vs fertiltyRt This validates the data high child mortality is the strongest driver of increased life expectancy.

LifeExp and literacyRt are nearly perfectly correlated. This suggests that education is the ultimate precursor to health resilience.

UnemploymentRt shows the weakest correlations with other factors, suggesting it is a unique economic dimension

pca_res <- prcomp(scaled_data)
summary(pca_res)
## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.202 1.3185 1.0002 0.80437 0.51289 0.48216 0.36700
## Proportion of Variance 0.539 0.1931 0.1111 0.07189 0.02923 0.02583 0.01497
## Cumulative Proportion  0.539 0.7321 0.8433 0.91519 0.94442 0.97025 0.98521
##                            PC8     PC9
## Standard deviation     0.29097 0.22001
## Proportion of Variance 0.00941 0.00538
## Cumulative Proportion  0.99462 1.00000
plot(pca_res)

pca_scores <- pca_res$x[, 1:2]
pca_scores
##               PC1         PC2
##  [1,]  1.00620496  0.78913422
##  [2,] -0.42992953  0.10540342
##  [3,]  1.62470994  0.90160257
##  [4,]  1.83480948 -1.86537289
##  [5,] -4.73761513 -0.17863919
##  [6,] -0.97129510  0.72003044
##  [7,] -6.74455217 -0.32250657
##  [8,]  1.18312482  0.27024849
##  [9,] -3.78330497  0.06664097
## [10,]  1.61259608  0.92257444
## [11,]  0.08557904  0.56745188
## [12,]  0.32523066  0.77817477
## [13,]  1.63795907  0.76028662
## [14,] -0.42208472  0.63908942
## [15,]  0.56977813 -6.53241048
## [16,]  2.98884872 -1.34010721
## [17,]  1.54122810  0.78616071
## [18,]  0.65412174  0.24274002
## [19,]  0.55119714  0.84065800
## [20,]  1.26743582 -0.85652371
## [21,]  0.06387895  0.64043787
## [22,] -0.48111336  0.31192658
## [23,] -2.82997355 -0.45666135
## [24,]  0.05634616  0.31305873
## [25,] -3.11686009  0.21073157
## [26,] -4.27187580 -0.09494040
## [27,]  2.04488682  0.58332431
## [28,]  0.57492058 -0.73937764
## [29,]  1.22733304  0.04018420
## [30,]  0.30818600  0.91622467
## [31,]  1.71557542 -0.62227603
## [32,]  1.90119696  0.10112336
## [33,]  2.02469004  0.82961257
## [34,]  0.24630326  0.64253657
## [35,]  0.74246350  0.02945906
wss_plot <- fviz_nbclust(pca_scores, kmeans, method = "wss")
wss_plot

set.seed(123)
kmeans_model <- kmeans(pca_scores, centers = 3, nstart = 25)

final_data <- data_clean %>%
  mutate(cluster = factor(kmeans_model$cluster))
final_data$PC1 <- pca_res$x[, 1]
final_data$PC2 <- pca_res$x[, 2]
str(kmeans_model)
## List of 9
##  $ cluster     : int [1:35] 1 1 1 1 3 1 3 1 3 1 ...
##  $ centers     : num [1:3, 1:2] 0.89 0.57 -4.247 0.261 -6.532 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "1" "2" "3"
##   .. ..$ : chr [1:2] "PC1" "PC2"
##  $ totss       : num 224
##  $ withinss    : num [1:3] 38.3 0 10.3
##  $ tot.withinss: num 48.6
##  $ betweenss   : num 175
##  $ size        : int [1:3] 28 1 6
##  $ iter        : int 2
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"
eig_plot <- fviz_eig(pca_res, addlabels = TRUE)
eig_plot

I decided to use K-means not Clara or PAM because the dataset is relatively small and the variables continuous nature. After standardization and PCA transformation, Euclidean distance becomes appropriate, making K-means a natural efficient and interpretable clustering solution. While alternative algorithms such as PAM and CLARA are designed for robustness or scalability, their advantages are not critical in this context.

The model uses centers = 3. Based on standard development patterns, these clusters usually break down as:

fviz_cluster(
  kmeans_model, 
  data = pca_scores,
  geom = "point",
  ellipse.type = "convex",
  palette = c("#2E9FDF", "#E7B800", "#FC4E07"),
  ggtheme = theme_minimal(),
  main = "Global Development: Cluster Distribution",
  xlab = "PC1",
  ylab = "PC2"
) +
theme(
  plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
  legend.position = "bottom"
)

ggplot(final_data, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(alpha = 0.7, size = 3) +
  scale_color_manual(
    values = c("#2E9FDF", "#E7B800", "#FC4E07"),
    labels = c("Developed Economies", "Developing Economies", "Underdeveloped Economies"))+
  labs(
    title = "Nations Grouped by Socio-Economic Performance",
    subtitle = "Analysis based on 2019 World Bank Development Indicators",
    x = "Principal Component 1",
    y = "Principal Component 2",
    color = "Development Group"
  ) +
  theme_classic()

By clustering countries based on indicators rather than continents, we find surprising peers.

Key Findings

Countries that fall into the “High-Development” cluster almost universally have a literacyRt above 90%, suggesting it’s importance for economic transition.

The Environmental Paradox: High-income clusters contribute disproportionately to CO2Em, showing that “developed” status currently lacks environmental sustainability.The Demographic Divide: High fertilityRt is the strongest predictor for membership in the “underveloped” cluster, as it often correlates with lower education access.

Final Conclusion

Using PCA and K-Means, we successfully reduced a complex 9-dimensional dataset into three distinct global profiles. This model suggests prioritizing literacy and mortality reduction over pure GDP growth.