This project aim to investigate latent aspects of development and identify patterns in the world using economic factors. Global development is multidimensional. Looking at GDP alone is deceptive; a country might have high wealth but poor literacy or high child mortality. This project aims to move beyond single-metric rankings by using Unsupervised Machine Learning to find hidden patterns in how nations develop. This project applies two key unsupervised methods:
The data used for this project was sourced from World Development Indicators (WDI) database. Year used: 2019 Visit this link to access data here
library(factoextra)
library(cluster)
library(WDI)
library(tidyverse)
library(dplyr)
library("ggplot2")
library("reshape2")
library("corrplot")
library(ggrepel)
indicators <- c(
GDP = "NY.GDP.MKTP.CD",
lifeExp = "SP.DYN.LE00.IN",
fertilityRt = "SP.DYN.TFRT.IN",
education = "SE.SEC.ENRR",
literacyRt = "SE.ADT.LITR.ZS",
populationGth = "SP.POP.GROW",
unemploymentRt = "SL.UEM.TOTL.ZS",
mortalityRt = "SH.DYN.MORT",
CO2Em = "EN.GHG.CO2.IC.MT.CE.AR5" )
global_development <- WDI(
country = "all",
indicator = indicators,
start = 2019,
end = 2019, extra = TRUE )
glimpse(global_development)
## Rows: 266
## Columns: 21
## $ country <chr> "Afghanistan", "Africa Eastern and Southern", "Africa W…
## $ iso2c <chr> "AF", "ZH", "ZI", "AL", "DZ", "AS", "AD", "AO", "AG", "…
## $ iso3c <chr> "AFG", "AFE", "AFW", "ALB", "DZA", "ASM", "AND", "AGO",…
## $ year <int> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2…
## $ status <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ lastupdated <chr> "2026-01-28", "2026-01-28", "2026-01-28", "2026-01-28",…
## $ GDP <dbl> 1.879944e+10, 1.018715e+12, 1.026996e+12, 1.558511e+10,…
## $ lifeExp <dbl> 62.94100, 63.85726, 57.14985, 79.46700, 75.68200, 72.75…
## $ fertilityRt <dbl> 5.238000, 4.471338, 4.829134, 1.395000, 2.997000, 2.404…
## $ education <dbl> NA, 44.12873, 43.90627, 99.01515, NA, NA, 96.57433, NA,…
## $ literacyRt <dbl> NA, 72.80108, 60.13639, NA, NA, NA, NA, NA, NA, 77.1497…
## $ populationGth <dbl> 2.9843891, 2.7216805, 2.4400480, -1.5431371, 1.8404129,…
## $ unemploymentRt <dbl> 11.187000, 7.459106, 4.390764, 11.466000, 12.302000, NA…
## $ mortalityRt <dbl> 63.30000, 60.65808, 101.44005, 9.40000, 23.30000, NA, 3…
## $ CO2Em <dbl> 2.8111, 76.4884, 17.8348, 0.9101, 13.9341, NA, NA, 1.50…
## $ region <chr> "South Asia", "Aggregates", "Aggregates", "Europe & Cen…
## $ capital <chr> "Kabul", "", "", "Tirane", "Algiers", "Pago Pago", "And…
## $ longitude <chr> "69.1761", "", "", "19.8172", "3.05097", "-170.691", "1…
## $ latitude <chr> "34.5228", "", "", "41.3317", "36.7397", "-14.2846", "4…
## $ income <chr> "Low income", "Aggregates", "Aggregates", "Upper middle…
## $ lending <chr> "IDA", "Aggregates", "Aggregates", "IBRD", "IBRD", "Not…
missing_in_cols <- sapply(global_development, function(x) sum(is.na(x))/nrow(global_development))
head(missing_in_cols)
## country iso2c iso3c year status lastupdated
## 0 0 0 0 0 0
numeric_data <- data_clean %>%
select(
GDP,
lifeExp,
fertilityRt,
education,
literacyRt,
populationGth,
unemploymentRt,
mortalityRt,
CO2Em
)
Missing values and categorical values: country, are removed to ensure PCA and clustering operate on complete data. That’s why we have to do some preprocessing, and extraction of character data.
ggplot(numeric_long, aes(x = value)) +
geom_histogram(bins = 30, fill = "blue", color = "white") +
facet_wrap(~ indicator, scales = "free") +
theme_minimal() +
labs(
title = "Distribution of Development Indicators",
x = "Value",
y = "Frequency"
)
ggplot(numeric_long, aes(x = indicator, y = value)) +
geom_boxplot(fill = "blue") +
theme_minimal() +
coord_flip() +
labs(
title = "Boxplots of Development Indicators",
x = "",
y = "Value"
)
scaled_data <- scale(numeric_data)
correlation<-cor(scaled_data, method="pearson")
corrplot(correlation)
The correlation matrix reveals:
Strong Negative Correlation: mortalityRt vs.lifeExp are nagetively correlated .
Strong Positive Correlation: Education and literacyRt. This suggests that for many nations, school enrollment and adult literacy are nearly synonymous.
GDP and CO2Em highlights the environmental cost of traditional economic growth(Production).
MortalityRt vs fertiltyRt This validates the data high child mortality is the strongest driver of increased life expectancy.
LifeExp and literacyRt are nearly perfectly correlated. This suggests that education is the ultimate precursor to health resilience.
UnemploymentRt shows the weakest correlations with other factors, suggesting it is a unique economic dimension
pca_res <- prcomp(scaled_data)
summary(pca_res)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.202 1.3185 1.0002 0.80437 0.51289 0.48216 0.36700
## Proportion of Variance 0.539 0.1931 0.1111 0.07189 0.02923 0.02583 0.01497
## Cumulative Proportion 0.539 0.7321 0.8433 0.91519 0.94442 0.97025 0.98521
## PC8 PC9
## Standard deviation 0.29097 0.22001
## Proportion of Variance 0.00941 0.00538
## Cumulative Proportion 0.99462 1.00000
plot(pca_res)
pca_scores <- pca_res$x[, 1:2]
pca_scores
## PC1 PC2
## [1,] 1.00620496 0.78913422
## [2,] -0.42992953 0.10540342
## [3,] 1.62470994 0.90160257
## [4,] 1.83480948 -1.86537289
## [5,] -4.73761513 -0.17863919
## [6,] -0.97129510 0.72003044
## [7,] -6.74455217 -0.32250657
## [8,] 1.18312482 0.27024849
## [9,] -3.78330497 0.06664097
## [10,] 1.61259608 0.92257444
## [11,] 0.08557904 0.56745188
## [12,] 0.32523066 0.77817477
## [13,] 1.63795907 0.76028662
## [14,] -0.42208472 0.63908942
## [15,] 0.56977813 -6.53241048
## [16,] 2.98884872 -1.34010721
## [17,] 1.54122810 0.78616071
## [18,] 0.65412174 0.24274002
## [19,] 0.55119714 0.84065800
## [20,] 1.26743582 -0.85652371
## [21,] 0.06387895 0.64043787
## [22,] -0.48111336 0.31192658
## [23,] -2.82997355 -0.45666135
## [24,] 0.05634616 0.31305873
## [25,] -3.11686009 0.21073157
## [26,] -4.27187580 -0.09494040
## [27,] 2.04488682 0.58332431
## [28,] 0.57492058 -0.73937764
## [29,] 1.22733304 0.04018420
## [30,] 0.30818600 0.91622467
## [31,] 1.71557542 -0.62227603
## [32,] 1.90119696 0.10112336
## [33,] 2.02469004 0.82961257
## [34,] 0.24630326 0.64253657
## [35,] 0.74246350 0.02945906
wss_plot <- fviz_nbclust(pca_scores, kmeans, method = "wss")
wss_plot
set.seed(123)
kmeans_model <- kmeans(pca_scores, centers = 3, nstart = 25)
final_data <- data_clean %>%
mutate(cluster = factor(kmeans_model$cluster))
final_data$PC1 <- pca_res$x[, 1]
final_data$PC2 <- pca_res$x[, 2]
str(kmeans_model)
## List of 9
## $ cluster : int [1:35] 1 1 1 1 3 1 3 1 3 1 ...
## $ centers : num [1:3, 1:2] 0.89 0.57 -4.247 0.261 -6.532 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "1" "2" "3"
## .. ..$ : chr [1:2] "PC1" "PC2"
## $ totss : num 224
## $ withinss : num [1:3] 38.3 0 10.3
## $ tot.withinss: num 48.6
## $ betweenss : num 175
## $ size : int [1:3] 28 1 6
## $ iter : int 2
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
eig_plot <- fviz_eig(pca_res, addlabels = TRUE)
eig_plot
I decided to use K-means not Clara or PAM because the dataset is relatively small and the variables continuous nature. After standardization and PCA transformation, Euclidean distance becomes appropriate, making K-means a natural efficient and interpretable clustering solution. While alternative algorithms such as PAM and CLARA are designed for robustness or scalability, their advantages are not critical in this context.
The model uses centers = 3. Based on standard development patterns, these clusters usually break down as:
Cluster 1 (Developed-Economy): High GDP, high education, low fertility, and low mortality. Associated with high export and imports, good infrastructure and amenities etc.
Cluster 2 (Developing-Economy): Moderate growth, rising literacy, but still balancing high population growth or environmental impacts ( CO2Em).
Cluster 3 (Underdeveloped-Economy): High mortality, high fertility rates, and lower literacy levels. Associated with low export and imports, low infrastructure and amenities etc.
fviz_cluster(
kmeans_model,
data = pca_scores,
geom = "point",
ellipse.type = "convex",
palette = c("#2E9FDF", "#E7B800", "#FC4E07"),
ggtheme = theme_minimal(),
main = "Global Development: Cluster Distribution",
xlab = "PC1",
ylab = "PC2"
) +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
legend.position = "bottom"
)
ggplot(final_data, aes(x = PC1, y = PC2, color = cluster)) +
geom_point(alpha = 0.7, size = 3) +
scale_color_manual(
values = c("#2E9FDF", "#E7B800", "#FC4E07"),
labels = c("Developed Economies", "Developing Economies", "Underdeveloped Economies"))+
labs(
title = "Nations Grouped by Socio-Economic Performance",
subtitle = "Analysis based on 2019 World Bank Development Indicators",
x = "Principal Component 1",
y = "Principal Component 2",
color = "Development Group"
) +
theme_classic()
By clustering countries based on indicators rather than continents, we find surprising peers.
Countries that fall into the “High-Development” cluster almost universally have a literacyRt above 90%, suggesting it’s importance for economic transition.
The Environmental Paradox: High-income clusters contribute disproportionately to CO2Em, showing that “developed” status currently lacks environmental sustainability.The Demographic Divide: High fertilityRt is the strongest predictor for membership in the “underveloped” cluster, as it often correlates with lower education access.
Using PCA and K-Means, we successfully reduced a complex 9-dimensional dataset into three distinct global profiles. This model suggests prioritizing literacy and mortality reduction over pure GDP growth.