Socio-Economic Country Clustering for Aid Prioritization

Executive Summary

This report presents a rigorous, end-to-end unsupervised machine learning study on 167 countries, using socio-economic and health data to segment nations into meaningful groups and identify which require urgent development aid.

Unsupervised Learning PCA · Dimensionality Reduction K-Means Clustering Hierarchical Clustering Silhouette Validation Aid Prioritization R · tidyverse · factoextra

🔍 Central Research Question

How can countries be meaningfully grouped based on their socio-economic and health profiles, and which countries should be prioritized for humanitarian aid using a transparent, data-driven scoring framework?

The full workflow covers feature engineering → exploratory analysis → standardization → PCA → K-means and hierarchical clustering → silhouette validation → cluster profiling → need-score ranking.

1 · Project Objectives

This analysis targets four concrete questions that a practitioner or policy analyst would ask:

What natural groupings exist among countries based on health and economic indicators?
Can PCA reveal a low-dimensional structure in the data without significant information loss?
How do K-means and hierarchical clustering compare when applied to reduced PCA space?
Which specific countries show the strongest need for aid, and can that ranking be explained transparently?

2 · Methodology Pipeline

The project follows an industry-standard data science workflow. Each stage builds directly on the previous one, ensuring that every modeling decision is justified by the data.

Load & InspectUnderstand structure and variable types

Feature EngineeringConvert % GDP values to absolute figures

EDADistributions, outliers, correlations

StandardizeZ-score scaling for fair comparison

PCAReduce to 90%-variance components

ClusteringK-means + hierarchical, k = 4

ValidationSilhouette scores for both methods

ProfilingInterpret cluster identities

Need ScoreRank & recommend priority nations

3 · Load Required Libraries

Code & Rationale

Before the analysis begins, all required packages are loaded. A defensive loader function is used so that the report fails early with a helpful message if any dependency is missing — a best practice for reproducible research.

Package	Role in this analysis
`tidyverse`	Data wrangling, reshaping, and all `ggplot2` visualizations
`skimr`	Rich, formatted summary statistics
`ggcorrplot`	Correlation heatmap
`factoextra`	PCA scree plots, cluster visualizations, silhouette plots
`cluster`	Silhouette computation
`knitr`	Table rendering in the knitted report

load_package <- function(pkg) {
  if (!requireNamespace(pkg, quietly = TRUE))
    stop(sprintf(
      "Package '%s' is required but not installed.\nRun: install.packages('%s')", pkg, pkg
    ))
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}

required_packages <- c(
  "tidyverse", "skimr", "ggcorrplot",
  "factoextra", "cluster", "knitr"
)

invisible(lapply(required_packages, load_package))

# Set a clean, consistent ggplot2 theme for the entire report
theme_report <- function() {
  theme_minimal(base_size = 12, base_family = "sans") +
  theme(
    plot.title    = element_text(face = "bold", color = "#0D1F2D", size = 14, margin = margin(b = 4)),
    plot.subtitle = element_text(color = "#64748B", size = 11, margin = margin(b = 12)),
    plot.caption  = element_text(color = "#94A3B8", size = 9, hjust = 0),
    axis.title    = element_text(color = "#2C4A5E", size = 11, face = "bold"),
    axis.text     = element_text(color = "#475569"),
    strip.text    = element_text(face = "bold", color = "#0D1F2D"),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "#E2E8F0", linewidth = 0.5),
    legend.position  = "bottom",
    legend.text      = element_text(size = 10)
  )
}

theme_set(theme_report())

✓ Environment ready. If this chunk runs without errors all six packages are available. The custom theme_report() function guarantees visual consistency across every plot in the report.

4 · Load & Inspect the Dataset

The raw dataset is read from disk and immediately inspected. Knowing the number of rows, columns, and variable types before touching the data prevents silent errors later in the pipeline.

if (!file.exists(params$data_path))
  stop(paste0(
    "Dataset not found: '", params$data_path, "'.\n",
    "Place Country-data.csv in the same folder as this .Rmd file."
  ))

df_raw <- read.csv(params$data_path)

n_countries <- nrow(df_raw)
n_vars      <- ncol(df_raw)

cat("Rows (countries):", n_countries, "\nColumns (variables):", n_vars, "\n")

#> Rows (countries): 167 
#> Columns (variables): 10

head(df_raw, 8) |>
  knitr::kable(
    caption  = "**Table 1** — Raw dataset preview (first 8 rows)",
    digits   = 2,
    booktabs = TRUE
  )

**Table 1** — Raw dataset preview (first 8 rows)
country	child_mort	exports	health	imports	income	inflation	life_expec	total_fer	gdpp
Afghanistan	90.2	10.0	7.58	44.9	1610	9.44	56.2	5.82	553
Albania	16.6	28.0	6.55	48.6	9930	4.49	76.3	1.65	4090
Algeria	27.3	38.4	4.17	31.4	12900	16.10	76.5	2.89	4460
Angola	119.0	62.3	2.85	42.9	5900	22.40	60.1	6.16	3530
Antigua and Barbuda	10.3	45.5	6.03	58.9	19100	1.44	76.8	2.13	12200
Argentina	14.5	18.9	8.10	16.0	18700	20.90	75.8	2.37	10300
Armenia	18.1	20.8	4.40	45.3	6700	7.77	73.3	1.69	3220
Australia	4.8	19.8	8.73	20.9	41400	1.16	82.0	1.93	51900

str(df_raw)

#> 'data.frame':    167 obs. of  10 variables:
#>  $ country   : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
#>  $ child_mort: num  90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
#>  $ exports   : num  10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
#>  $ health    : num  7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
#>  $ imports   : num  44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
#>  $ income    : int  1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
#>  $ inflation : num  9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
#>  $ life_expec: num  56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
#>  $ total_fer : num  5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
#>  $ gdpp      : int  553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...

Key observations: The country column is a character identifier — it will be used as a label throughout but excluded from all numerical computations. Every remaining column is numeric. No immediate structural issues detected.

5 · Data Dictionary

Understanding what each variable measures is essential before any transformation or modelling decision.

tibble::tribble(
  ~Variable,    ~Type,       ~Unit,               ~Description,
  "country",    "Character", "—",                 "Country name (label, not used in modelling)",
  "child_mort", "Numeric",   "per 1,000 births",  "Under-5 child mortality rate",
  "exports",    "Numeric",   "% of GDP (raw)",    "Exports of goods and services",
  "health",     "Numeric",   "% of GDP (raw)",    "Total health expenditure",
  "imports",    "Numeric",   "% of GDP (raw)",    "Imports of goods and services",
  "income",     "Numeric",   "USD",               "Net income per person",
  "inflation",  "Numeric",   "% per year",        "Annual consumer price inflation",
  "life_expec", "Numeric",   "years",             "Life expectancy at birth",
  "total_fer",  "Numeric",   "children/woman",    "Total fertility rate",
  "gdpp",       "Numeric",   "USD",               "GDP per capita"
) |>
  knitr::kable(
    caption  = "**Table 2** — Data dictionary",
    booktabs = TRUE
  )

**Table 2** — Data dictionary
Variable	Type	Unit	Description
country	Character	—	Country name (label, not used in modelling)
child_mort	Numeric	per 1,000 births	Under-5 child mortality rate
exports	Numeric	% of GDP (raw)	Exports of goods and services
health	Numeric	% of GDP (raw)	Total health expenditure
imports	Numeric	% of GDP (raw)	Imports of goods and services
income	Numeric	USD	Net income per person
inflation	Numeric	% per year	Annual consumer price inflation
life_expec	Numeric	years	Life expectancy at birth
total_fer	Numeric	children/woman	Total fertility rate
gdpp	Numeric	USD	GDP per capita

⚠️ Important — Scale Mismatch

exports, health, and imports are recorded as percentages of GDP. Feeding raw percentages into clustering alongside absolute USD values would distort distances. The next section converts them to estimated per-capita monetary values.

6 · Feature Engineering

The three percentage-of-GDP variables are converted into estimated per-capita monetary values by multiplying by GDP per capita. This puts all economic indicators on a comparable absolute scale before any further analysis.

\[\text{exports}_{\text{abs}} = \frac{\text{exports}_{\%} \times \text{gdpp}}{100}\]

The same formula is applied to health and imports.

df <- df_raw |>
  mutate(
    exports = (exports * gdpp) / 100,
    health  = (health  * gdpp) / 100,
    imports = (imports * gdpp) / 100,
    gdpp    = as.numeric(gdpp),
    income  = as.numeric(income),
    country = factor(country)
  )

missing_count <- sum(is.na(df))
cat("Total missing values after engineering:", missing_count, "\n")

#> Total missing values after engineering: 0

head(df, 6) |>
  knitr::kable(
    caption  = "**Table 3** — Dataset after feature engineering",
    digits   = 1,
    booktabs = TRUE
  )

**Table 3** — Dataset after feature engineering
country	child_mort	exports	health	imports	income	inflation	life_expec	total_fer	gdpp
Afghanistan	90.2	55.3	41.9	248.3	1610	9.4	56.2	5.8	553
Albania	16.6	1145.2	267.9	1987.7	9930	4.5	76.3	1.6	4090
Algeria	27.3	1712.6	186.0	1400.4	12900	16.1	76.5	2.9	4460
Angola	119.0	2199.2	100.6	1514.4	5900	22.4	60.1	6.2	3530
Antigua and Barbuda	10.3	5551.0	735.7	7185.8	19100	1.4	76.8	2.1	12200
Argentina	14.5	1946.7	834.3	1648.0	18700	20.9	75.8	2.4	10300

Result: All three percentage variables now represent estimated USD-per-capita amounts, making them directly comparable with income and gdpp. No missing values were introduced by the transformation.

7 · Dataset Overview

A high-level numerical summary of the cleaned dataset confirms the scale differences that motivate standardization in Section 9.

X <- df |> select(-country)   # numeric matrix used in all subsequent steps

tibble(
  Metric  = c("Countries in dataset", "Numeric indicators", "Missing values"),
  Value   = c(nrow(df), ncol(X), missing_count)
) |>
  knitr::kable(caption = "**Table 4** — Cleaned dataset at a glance", booktabs = TRUE)

**Table 4** — Cleaned dataset at a glance
Metric	Value
Countries in dataset	167
Numeric indicators	9
Missing values	0

Countries

167

Global coverage

Indicators

Economic + health

Missing Values

Complete dataset

Clustering Target

k = 4

Groups to identify

skimr::skim(X)

Data summary
Name	X
Number of rows	167
Number of columns	9
_______________________
Column type frequency:
numeric	9
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
child_mort	1	38.27	40.33	2.60	8.25	19.30	62.10	208.00	▇▂▂▁▁
exports	1	7420.62	17973.89	1.08	447.14	1777.44	7278.00	183750.00	▇▁▁▁▁
health	1	1056.73	1801.41	12.82	78.54	321.89	976.94	8663.60	▇▁▁▁▁
imports	1	6588.35	14710.81	0.65	640.21	2045.58	7719.60	149100.00	▇▁▁▁▁
income	1	17144.69	19278.07	609.00	3355.00	9960.00	22800.00	125000.00	▇▂▁▁▁
inflation	1	7.78	10.57	-4.21	1.81	5.39	10.75	104.00	▇▁▁▁▁
life_expec	1	70.56	8.89	32.10	65.30	73.10	76.80	82.80	▁▁▃▅▇
total_fer	1	2.95	1.51	1.15	1.79	2.41	3.88	7.49	▇▃▂▂▁
gdpp	1	12964.16	18328.70	231.00	1330.00	4660.00	14050.00	105000.00	▇▁▁▁▁

Key takeaway: Variables span vastly different ranges — GDP per capita ranges from tens of dollars to tens of thousands, while fertility is bounded between ~1 and ~8. This scale disparity confirms that Z-score standardization is mandatory before distance-based methods like PCA and clustering.

8 · Exploratory Data Analysis

Exploratory analysis answers three questions before any modelling: (1) Are distributions roughly normal or heavily skewed? (2) How correlated are the variables? (3) Are there extreme outliers that may distort distance calculations?

8.1 Distributions

Each variable is plotted as a histogram. Variables with right-skewed distributions will benefit from the log-scale view in the next tab.

df |>
  select(-country) |>
  pivot_longer(everything(), names_to = "feature", values_to = "value") |>
  ggplot(aes(x = value, fill = feature)) +
  geom_histogram(bins = 28, color = "white", alpha = 0.90, show.legend = FALSE) +
  facet_wrap(~ feature, scales = "free", ncol = 3) +
  scale_fill_manual(values = rep(c("#4EA8DE","#2EC4B6","#E63946","#FF9F1C",
                                   "#6366F1","#8B5CF6","#14B8A6","#F59E0B","#10B981"), 2)) +
  labs(
    title    = "Distribution of All Socio-Economic & Health Indicators",
    subtitle = "Histograms on raw scales — note strong right-skew in economic variables",
    x = NULL, y = "Countries"
  )

Finding: GDP per capita, income, exports, imports, and health expenditure are all strongly right-skewed — a handful of wealthy nations pull the tails far to the right. Child mortality and fertility show a different shape, closer to log-normal. Inflation shows the most irregular distribution, likely driven by countries experiencing hyperinflation.

8.2 Log-Scale View

Log-transformed views reveal the true spread of the right-skewed economic variables.

df |>
  select(gdpp, income, child_mort, exports, health, imports) |>
  pivot_longer(everything(), names_to = "feature", values_to = "value") |>
  filter(value > 0) |>
  ggplot(aes(x = value, fill = feature)) +
  geom_histogram(bins = 28, color = "white", alpha = 0.9, show.legend = FALSE) +
  facet_wrap(~ feature, scales = "free", ncol = 3) +
  scale_x_log10(labels = scales::label_comma()) +
  scale_fill_manual(values = c("#4EA8DE","#2EC4B6","#E63946","#FF9F1C","#6366F1","#8B5CF6")) +
  labs(
    title    = "Selected Variables on a Log₁₀ Scale",
    subtitle = "Log scaling normalises the long right tails and reveals the full country spread",
    x = "Log₁₀ value", y = "Countries"
  )

Finding: On the log scale, distributions become far more symmetric, confirming the multiplicative nature of economic inequality across nations. Countries are spread over two to three orders of magnitude in GDP and income.

8.3 Outlier Boxplots

Boxplots highlight country-level outliers across every indicator simultaneously.

df |>
  select(-country) |>
  pivot_longer(everything(), names_to = "feature", values_to = "value") |>
  ggplot(aes(x = reorder(feature, value, FUN = median), y = value, fill = feature)) +
  geom_boxplot(
    outlier.colour = "#E63946",
    outlier.size   = 2.2,
    outlier.alpha  = 0.70,
    alpha          = 0.80,
    show.legend    = FALSE
  ) +
  coord_flip() +
  labs(
    title    = "Feature-Level Outlier Detection via Boxplots",
    subtitle = "Red points represent potential outliers — common in cross-country economic data",
    x = NULL, y = "Value"
  )

Finding: GDP, income, exports, and health show the most extreme outliers. These likely correspond to small, high-income economies (e.g. Singapore, Luxembourg) or highly trade-oriented nations. These observations are real data points, not errors — they will be handled by Z-score standardization rather than removal.

9 · Standardization & Multivariate Outliers

9.1 Z-Score Normalization

K-means minimizes squared Euclidean distances, and PCA is variance-weighted. Both are sensitive to scale. Z-score normalization transforms every variable to mean = 0 and standard deviation = 1, ensuring no single variable dominates purely because of its measurement units.

X_scaled <- scale(X)

tibble(
  Feature           = colnames(X_scaled),
  Mean_after_scale  = round(colMeans(X_scaled), 5),
  SD_after_scale    = round(apply(X_scaled, 2, sd), 5)
) |>
  knitr::kable(
    caption  = "**Table 5** — Standardization verification (mean ≈ 0, SD ≈ 1 for all features)",
    booktabs = TRUE
  )

**Table 5** — Standardization verification (mean ≈ 0, SD ≈ 1 for all features)
Feature	Mean_after_scale	SD_after_scale
child_mort	0	1
exports	0	1
health	0	1
imports	0	1
income	0	1
inflation	0	1
life_expec	0	1
total_fer	0	1
gdpp	0	1

Verification passed. Every feature now has mean essentially equal to 0 and standard deviation equal to 1. The data is ready for PCA and clustering.

9.2 Multivariate Outlier Ranking

Beyond individual-variable outliers, some countries may be extreme across multiple dimensions simultaneously. The multivariate outlier score is the Euclidean distance from the standardized data centroid (the origin in Z-score space).

\[\text{Outlier Score}_i = \sqrt{\sum_{j=1}^{p} z_{ij}^2}\]

outlier_rank <- tibble(
  country       = df$country,
  outlier_score = sqrt(rowSums(X_scaled^2))
) |>
  arrange(desc(outlier_score))

head(outlier_rank, 12) |>
  mutate(outlier_score = round(outlier_score, 3)) |>
  knitr::kable(
    caption  = "**Table 6** — Top 12 multivariate outliers (ranked by standardized distance from centroid)",
    booktabs = TRUE
  )

**Table 6** — Top 12 multivariate outliers (ranked by standardized distance from centroid)
country	outlier_score
Luxembourg	15.778
Nigeria	9.709
Singapore	8.044
Qatar	6.896
Switzerland	6.828
Norway	6.703
Haiti	6.188
United States	5.162
Ireland	4.946
Denmark	4.797
Netherlands	4.729
Central African Republic	4.320

outlier_rank |>
  slice_head(n = 15) |>
  ggplot(aes(x = reorder(country, outlier_score), y = outlier_score, fill = outlier_score)) +
  geom_col(width = 0.72, color = "white", linewidth = 0.3) +
  geom_text(aes(label = round(outlier_score, 2)), hjust = -0.15, size = 3, color = "#1E293B") +
  coord_flip() +
  scale_fill_gradient2(low = "#ADE8F4", mid = "#4EA8DE", high = "#E63946",
                       midpoint = median(outlier_rank$outlier_score[1:15])) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title    = "Top 15 Multivariate Outliers",
    subtitle = "Countries furthest from the dataset centroid in standardized 9-dimensional space",
    x = NULL, y = "Outlier Score (standardized Euclidean distance)",
    fill = "Score"
  ) +
  theme(legend.position = "none")

Insight: Top outliers tend to be high-income, high-trade economies (very high GDP, income, and exports) — or occasionally, countries in extreme humanitarian distress (very high child mortality, very low GDP). Both extremes are legitimate data and will form their own natural clusters.

10 · Correlation Analysis

Before applying PCA, it is worth checking whether the variables are correlated. High inter-variable correlation is the key condition that makes PCA effective — it means a smaller set of components can capture most of the variance.

corr_matrix <- cor(X, use = "pairwise.complete.obs")

ggcorrplot(
  corr_matrix,
  type     = "lower",
  lab      = TRUE,
  lab_size = 3.2,
  tl.cex   = 10,
  colors   = c("#E63946", "#FFFFFF", "#4EA8DE"),
  outline.color = "#E2E8F0"
) +
  labs(
    title    = "Correlation Heatmap of All Indicators",
    subtitle = "Blue = positive correlation · Red = negative correlation · Values = Pearson r"
  ) +
  theme(
    plot.title    = element_text(face = "bold", color = "#0D1F2D"),
    plot.subtitle = element_text(color = "#64748B")
  )

Notable correlation clusters:

Development axis (positive): gdpp, income, life_expec, health, exports, imports all correlate positively — wealthier countries spend more on health, trade more, and live longer.
Inverse relationships: child_mort and total_fer are strongly negatively correlated with gdpp and life_expec — high fertility and child mortality are hallmarks of lower-income countries.
PCA justification: The strong inter-correlations confirm that PCA will effectively summarize this dataset into a small number of orthogonal components.

11 · Dimensionality Reduction with PCA

Principal Component Analysis (PCA) rotates the data into a new coordinate system where the first axis (PC1) explains the maximum possible variance, PC2 explains the next most, and so on. This allows us to cluster in a lower-dimensional space that is less susceptible to noise.

pca_result <- prcomp(X_scaled, center = FALSE, scale. = FALSE)  # already scaled

explained_var  <- pca_result$sdev^2 / sum(pca_result$sdev^2)
cumulative_var <- cumsum(explained_var)

pca_tbl <- tibble(
  Component            = paste0("PC", seq_along(explained_var)),
  `Variance Explained` = paste0(round(explained_var * 100, 2), "%"),
  `Cumulative Variance`= paste0(round(cumulative_var * 100, 2), "%")
)

num_components <- which(cumulative_var >= 0.90)[1]

pca_tbl |>
  knitr::kable(
    caption  = "**Table 7** — PCA variance explained per component",
    booktabs = TRUE
  )

**Table 7** — PCA variance explained per component
Component	Variance Explained	Cumulative Variance
PC1	58.94%	58.94%
PC2	18.45%	77.38%
PC3	9.91%	87.29%
PC4	6.07%	93.37%
PC5	3.03%	96.4%
PC6	2.46%	98.86%
PC7	0.94%	99.79%
PC8	0.16%	99.95%
PC9	0.05%	100%

cat("Components needed to retain ≥90% variance:", num_components,
    "\nCumulative variance retained:", round(cumulative_var[num_components]*100, 2), "%\n")

#> Components needed to retain ≥90% variance: 4 
#> Cumulative variance retained: 93.37 %

11.1 Scree Plot

The scree plot shows the “elbow” where additional components contribute diminishing variance.

factoextra::fviz_eig(
  pca_result,
  addlabels  = TRUE,
  ylim       = c(0, 62),
  barfill    = "#4EA8DE",
  barcolor   = "#3D6B8C",
  linecolor  = "#E63946",
  ncp        = 9
) +
  labs(
    title    = "PCA Scree Plot — Variance Explained by Each Component",
    subtitle = "The first two components alone explain over 70% of total variance"
  ) +
  theme_report()

11.2 Cumulative Explained Variance

The cumulative curve shows how quickly components accumulate information, with the 90% threshold highlighted.

tibble(n = seq_along(cumulative_var), cumvar = cumulative_var) |>
  ggplot(aes(x = n, y = cumvar)) +
  geom_area(alpha = 0.12, fill = "#4EA8DE") +
  geom_line(color = "#4EA8DE", linewidth = 1.2) +
  geom_point(color = "#E63946", size = 3.5, shape = 21, fill = "white", stroke = 2) +
  geom_hline(yintercept = 0.90, linetype = "dashed", color = "#2EC4B6", linewidth = 1) +
  annotate("text", x = 1.2, y = 0.915, label = "90% threshold",
           color = "#0F766E", size = 3.5, fontface = "bold") +
  geom_vline(xintercept = num_components, linetype = "dotted", color = "#E63946", linewidth = 0.9) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1)) +
  scale_x_continuous(breaks = 1:9) +
  labs(
    title    = "Cumulative Explained Variance",
    subtitle = paste0("Retaining ", num_components, " components captures the 90% threshold"),
    x        = "Number of Components",
    y        = "Cumulative Variance Explained"
  )

✅ PCA Decision

4 principal components are retained, preserving 93.37% of the original variance. This reduced representation removes noise while keeping virtually all of the structured information needed for clustering.

11.3 PCA Biplot — Variable Loadings

The biplot overlays country positions with variable loading arrows, showing which original variables drive each principal component.

factoextra::fviz_pca_biplot(
  pca_result,
  repel           = TRUE,
  col.ind         = "#94A3B8",
  col.var         = "#E63946",
  alpha.ind       = 0.4,
  label           = "var",
  arrowsize       = 0.8,
  labelsize       = 4
) +
  labs(
    title    = "PCA Biplot — Countries & Variable Loadings",
    subtitle = "Arrows show direction and strength of each variable's contribution to PC1 and PC2"
  ) +
  theme_report()

Interpretation: Variables pointing in the same direction are positively correlated. PC1 separates wealthy, high-health-spending countries (right) from poor, high-mortality countries (left). PC2 captures trade-intensity and inflation patterns. This two-dimensional map already reveals a strong country structure that clustering will formalize.

12 · Optional — Autoencoder Comparison

An autoencoder is a neural-network-based dimensionality reduction method that can capture non-linear relationships. This section is parameterized: it runs only when run_autoencoder: true is set in the YAML header.

Why not use it by default? PCA is entirely sufficient here — the high linear correlations shown in Section 10 mean non-linear methods offer no meaningful benefit. PCA is also faster, deterministic, and fully interpretable, making it the more professional choice for this dataset.

# Requires keras3 to be installed and configured.
if (!requireNamespace("keras3", quietly = TRUE))
  stop("Install keras3 to run the autoencoder section.")

library(keras3)

input_dim  <- ncol(X_scaled)
latent_dim <- num_components

inputs      <- layer_input(shape = input_dim)
encoded     <- inputs |> layer_dense(128, activation = "relu") |>
               layer_dense(latent_dim, activation = "relu")
decoded     <- encoded |> layer_dense(128, activation = "relu") |>
               layer_dense(input_dim, activation = "linear")

autoencoder <- keras_model(inputs, decoded)
autoencoder |> compile(optimizer = "adam", loss = "mse")
autoencoder |> fit(X_scaled, X_scaled,
                   epochs = 60, batch_size = 32,
                   validation_split = 0.15, verbose = 0)

encoder <- keras_model(inputs, encoded)
enc_df  <- predict(encoder, X_scaled) |> as.data.frame() |>
           setNames(paste0("Latent_", seq_len(latent_dim)))

#> 6/6 - 0s - 14ms/step

ggplot(enc_df, aes(Latent_1, Latent_2)) +
  geom_point(alpha = 0.65, color = "#E63946", size = 2.5) +
  labs(title = "Autoencoder — 2D Latent Space Projection",
       x = "Latent Dimension 1", y = "Latent Dimension 2")

13 · Prepare PCA Components for Clustering

The selected principal component scores are extracted into a clean matrix. This is the input for both clustering algorithms.

pca_components <- as.data.frame(pca_result$x[, 1:num_components])
pca_components$country <- df$country

pca_matrix <- pca_components |> select(-country)

head(pca_components, 6) |>
  mutate(across(where(is.numeric), ~ round(.x, 4))) |>
  knitr::kable(
    caption  = paste0("**Table 8** — First 6 countries in PCA space (",
                      num_components, " components)"),
    booktabs = TRUE
  )

**Table 8** — First 6 countries in PCA space (4 components)
PC1	PC2	PC3	PC4	country
-2.6277	-1.4679	-0.5478	-0.2416	Afghanistan
-0.0241	1.4256	-0.0141	0.4493	Albania
-0.4582	0.6735	0.9565	0.2178	Algeria
-2.7145	-2.1658	0.5984	-0.4327	Angola
0.6467	1.0204	-0.2567	0.2883	Antigua and Barbuda
0.0353	0.6832	1.4643	-0.0288	Argentina

Each row is now a country. Instead of 9 original variables, each country is described by 4 orthogonal component scores. These scores carry no measurement-unit differences and are ready for distance-based clustering.

14 · Hierarchical Clustering

Hierarchical clustering constructs a tree (dendrogram) by successively merging the most similar countries or groups. Ward’s method is used because it minimizes the total within-cluster variance at each merge step, producing compact and well-separated clusters.

14.1 Build & Visualize Dendrogram

dist_pca <- dist(pca_matrix, method = "euclidean")
hc_ward  <- hclust(dist_pca, method = "ward.D2")

par(
  mar   = c(4, 3, 3, 1),
  family = "sans",
  bg    = "white"
)
plot(
  hc_ward,
  labels  = FALSE,
  hang    = -1,
  main    = "Hierarchical Clustering Dendrogram — Ward's Method",
  sub     = "",
  xlab    = "Countries (labels hidden for clarity)",
  ylab    = "Ward's Linkage Height",
  col.main = "#0D1F2D",
  cex.main = 1.1
)
mtext(
  paste0("k = ", params$k_clusters, " clusters highlighted"),
  side = 3, line = 0.3, cex = 0.85, col = "#64748B"
)
rect.hclust(hc_ward, k = params$k_clusters, border = c("#4EA8DE","#E63946","#2EC4B6","#FF9F1C"))

Reading the dendrogram: The height of each merger represents the dissimilarity at that point. The four tall vertical gaps near the top of the tree correspond to the four natural clusters — cutting at that height yields the most distinct country groups.

14.2 Cluster Sizes & Silhouette

k_final      <- params$k_clusters
hc_clusters  <- cutree(hc_ward, k = k_final)
pca_components$cluster_hc <- factor(hc_clusters)

table(hc_clusters) |>
  as.data.frame() |>
  setNames(c("Cluster", "Countries")) |>
  knitr::kable(
    caption  = "**Table 9** — Hierarchical clustering: countries per cluster",
    booktabs = TRUE
  )

**Table 9** — Hierarchical clustering: countries per cluster
Cluster	Countries
1	48
2	82
3	35
4	2

sil_hc     <- silhouette(hc_clusters, dist_pca)
avg_sil_hc <- mean(sil_hc[, 3])

factoextra::fviz_silhouette(sil_hc, palette = c("#4EA8DE","#E63946","#2EC4B6","#FF9F1C")) +
  labs(
    title    = paste0("Silhouette Plot — Hierarchical Clustering (k = ", k_final, ")"),
    subtitle = paste0("Average silhouette width = ", round(avg_sil_hc, 3),
                      "  ·  Width > 0.5 indicates reasonable cluster structure")
  ) +
  theme_report()

#>   cluster size ave.sil.width
#> 1       1   48          0.41
#> 2       2   82          0.56
#> 3       3   35          0.35
#> 4       4    2          0.10

cat("Average silhouette width (hierarchical):", round(avg_sil_hc, 3), "\n")

#> Average silhouette width (hierarchical): 0.467

15 · K-Means Clustering

K-means partitions n countries into k groups by iteratively reassigning each country to the nearest centroid and recomputing centroids until convergence. Unlike hierarchical clustering, K-means requires specifying k in advance.

15.1 Elbow Method

The elbow method plots total within-cluster sum of squares (WCSS) for k = 1 to 10. The “elbow” — where the curve bends and returns diminish — guides the choice of k.

set.seed(42)
wss_values <- sapply(1:10, function(k) {
  kmeans(pca_matrix, centers = k, nstart = 30, iter.max = 100)$tot.withinss
})

tibble(k = 1:10, WCSS = wss_values) |>
  ggplot(aes(x = k, y = WCSS)) +
  geom_line(color = "#4EA8DE", linewidth = 1.2) +
  geom_point(color = "#E63946", size = 3.5, shape = 21, fill = "white", stroke = 2) +
  geom_vline(xintercept = k_final, linetype = "dashed",
             color = "#2EC4B6", linewidth = 1) +
  annotate("text", x = k_final + 0.2, y = max(wss_values) * 0.9,
           label = paste0("Selected k = ", k_final),
           hjust = 0, color = "#0F766E", size = 3.5, fontface = "bold") +
  scale_x_continuous(breaks = 1:10) +
  scale_y_continuous(labels = scales::label_comma()) +
  labs(
    title    = "Elbow Method for Optimal K",
    subtitle = "WCSS (Total Within-Cluster Sum of Squares) vs. number of clusters",
    x        = "Number of Clusters (k)",
    y        = "Total WCSS"
  )

15.2 Fit Model & Cluster Sizes

set.seed(42)
km_result   <- kmeans(pca_matrix, centers = k_final, nstart = 50, iter.max = 200)
km_clusters <- km_result$cluster

table(km_clusters) |>
  as.data.frame() |>
  setNames(c("Cluster", "Countries")) |>
  knitr::kable(
    caption  = "**Table 10** — K-means: countries per cluster",
    booktabs = TRUE
  )

**Table 10** — K-means: countries per cluster
Cluster	Countries
1	88
2	48
3	29
4	2

15.3 Cluster Visualization in PCA Space

# Manual ggplot2 cluster plot — avoids the fviz_cluster / ggpubr argument conflict
cluster_palette <- c("1" = "#4EA8DE", "2" = "#E63946", "3" = "#2EC4B6", "4" = "#FF9F1C")

tibble(
  PC1     = pca_result$x[, 1],
  PC2     = pca_result$x[, 2],
  Cluster = factor(km_clusters),
  Country = as.character(df$country)
) |>
  ggplot(aes(x = PC1, y = PC2, colour = Cluster, fill = Cluster)) +
  stat_ellipse(
    geom  = "polygon",
    type  = "norm",
    alpha = 0.10,
    level = 0.90,
    linewidth = 0.4
  ) +
  geom_point(size = 2.4, alpha = 0.85, shape = 21, colour = "white", stroke = 0.4,
             aes(fill = Cluster)) +
  scale_colour_manual(values = cluster_palette) +
  scale_fill_manual(values   = cluster_palette) +
  labs(
    title    = paste0("K-Means Clusters in PCA Space (k = ", k_final, ")"),
    subtitle = "Countries projected onto PC1 & PC2 · Ellipses show 90% normal confidence regions",
    x        = "Principal Component 1",
    y        = "Principal Component 2",
    colour   = "Cluster",
    fill     = "Cluster"
  )

15.4 Silhouette Analysis

sil_km     <- silhouette(km_clusters, dist(pca_matrix))
avg_sil_km <- mean(sil_km[, 3])

factoextra::fviz_silhouette(sil_km, palette = c("#4EA8DE","#E63946","#2EC4B6","#FF9F1C")) +
  labs(
    title    = paste0("Silhouette Plot — K-Means (k = ", k_final, ")"),
    subtitle = paste0("Average silhouette width = ", round(avg_sil_km, 3))
  ) +
  theme_report()

#>   cluster size ave.sil.width
#> 1       1   88          0.53
#> 2       2   48          0.43
#> 3       3   29          0.42
#> 4       4    2          0.09

cat("Average silhouette width (K-means):", round(avg_sil_km, 3), "\n")

#> Average silhouette width (K-means): 0.477

16 · Method Comparison

Both algorithms are evaluated side-by-side. The silhouette width is the primary internal validation metric — it measures how similar each country is to its own cluster relative to neighbouring clusters (range −1 to +1; higher is better).

tibble(
  Method             = c("Hierarchical (Ward's)", "K-Means"),
  `Linkage / Init`   = c("Ward's D2", "Random centroids, nstart = 50"),
  `Input Space`      = rep(paste0(num_components, " PCA components"), 2),
  `k`                = rep(k_final, 2),
  `Avg Silhouette`   = round(c(avg_sil_hc, avg_sil_km), 4),
  `Interpretation`   = c(
    "Nested tree structure, useful for understanding hierarchy",
    "Hard partitions, computationally efficient, easy to deploy"
  )
) |>
  knitr::kable(
    caption  = "**Table 11** — Clustering method comparison",
    booktabs = TRUE
  )

**Table 11** — Clustering method comparison
Method	Linkage / Init	Input Space	k	Avg Silhouette	Interpretation
Hierarchical (Ward’s)	Ward’s D2	4 PCA components	4	0.4670	Nested tree structure, useful for understanding hierarchy
K-Means	Random centroids, nstart = 50	4 PCA components	4	0.4768	Hard partitions, computationally efficient, easy to deploy

💡 Professional Interpretation

Neither method is universally superior. Hierarchical clustering excels at revealing nested similarity structures, while K-means produces clean, deployable hard partitions. Running both and comparing their silhouette scores strengthens the credibility of the final clustering — if both methods agree on which countries belong together, the groupings are more trustworthy.

17 · Cluster Profiling

Translating cluster IDs into meaningful economic categories is where data science becomes decision support. Each cluster is characterized by its average values on the five most interpretable indicators.

analysis_df <- df |>
  mutate(
    cluster_km = factor(km_clusters),
    cluster_hc = factor(hc_clusters)
  )

profile_fn <- function(data, clust_col) {
  data |>
    group_by({{ clust_col }}) |>
    summarise(
      N           = n(),
      Avg_GDP     = round(mean(gdpp,       na.rm = TRUE), 0),
      Avg_Income  = round(mean(income,     na.rm = TRUE), 0),
      Avg_Health  = round(mean(health,     na.rm = TRUE), 1),
      Avg_ChildMort = round(mean(child_mort, na.rm = TRUE), 1),
      Avg_LifeExp = round(mean(life_expec,  na.rm = TRUE), 1),
      .groups = "drop"
    )
}

km_profile <- profile_fn(analysis_df, cluster_km)
hc_profile <- profile_fn(analysis_df, cluster_hc)

km_profile |>
  knitr::kable(caption = "**Table 12** — K-Means cluster profiles", booktabs = TRUE)

**Table 12** — K-Means cluster profiles
cluster_km	N	Avg_GDP	Avg_Income	Avg_Health	Avg_ChildMort	Avg_LifeExp
1	88	7333	13456	482.9	20.9	73.2
2	48	1909	3897	114.8	91.6	59.2
3	29	44017	45800	4085.0	5.1	80.4
4	2	75800	81900	5001.9	2.8	82.0

hc_profile |>
  knitr::kable(caption = "**Table 13** — Hierarchical cluster profiles", booktabs = TRUE)

**Table 13** — Hierarchical cluster profiles
cluster_hc	N	Avg_GDP	Avg_Income	Avg_Health	Avg_ChildMort	Avg_LifeExp
1	48	1909	3897	114.8	91.6	59.2
2	82	6287	12305	393.1	22.0	72.9
3	35	40177	42951	3677.9	5.4	79.8
4	2	75800	81900	5001.9	2.8	82.0

analysis_df |>
  pivot_longer(
    cols     = c(child_mort, gdpp, income, health, life_expec),
    names_to = "indicator", values_to = "value"
  ) |>
  mutate(indicator = recode(indicator,
    child_mort = "Child Mortality",
    gdpp       = "GDP per Capita",
    income     = "Income",
    health     = "Health Expenditure",
    life_expec = "Life Expectancy"
  )) |>
  ggplot(aes(x = cluster_km, y = value, fill = cluster_km)) +
  geom_boxplot(alpha = 0.80, outlier.size = 1.5, outlier.alpha = 0.5) +
  facet_wrap(~ indicator, scales = "free_y", ncol = 3) +
  scale_fill_manual(values = c("#4EA8DE", "#E63946", "#2EC4B6", "#FF9F1C")) +
  labs(
    title    = "Key Indicators by K-Means Cluster",
    subtitle = "Boxplots reveal how clusters differ across health and economic dimensions",
    x        = "K-Means Cluster",
    y        = "Value",
    fill     = "Cluster"
  ) +
  theme(legend.position = "top")

Cluster interpretation guide: The cluster with the highest child mortality and lowest GDP per capita represents the most vulnerable nations. The cluster with the highest GDP and income represents high-income economies. The remaining clusters capture middle-income development tiers. The next two sections identify and rank the highest-need cluster precisely.

18 · Identify the Highest-Need Cluster

The priority cluster is identified programmatically: the cluster with the highest average child mortality rate, broken by the lowest average GDP per capita and lowest average health expenditure.

priority_km <- km_profile |>
  arrange(desc(Avg_ChildMort), Avg_GDP, Avg_Health) |>
  slice(1)

priority_hc <- hc_profile |>
  arrange(desc(Avg_ChildMort), Avg_GDP, Avg_Health) |>
  slice(1)

priority_km |>
  knitr::kable(
    caption  = "**Table 14** — Highest-need cluster (K-Means)",
    booktabs = TRUE
  )

**Table 14** — Highest-need cluster (K-Means)
cluster_km	N	Avg_GDP	Avg_Income	Avg_Health	Avg_ChildMort	Avg_LifeExp
2	48	1909	3897	114.8	91.6	59.2

priority_hc |>
  knitr::kable(
    caption  = "**Table 15** — Highest-need cluster (Hierarchical)",
    booktabs = TRUE
  )

**Table 15** — Highest-need cluster (Hierarchical)
cluster_hc	N	Avg_GDP	Avg_Income	Avg_Health	Avg_ChildMort	Avg_LifeExp
1	48	1909	3897	114.8	91.6	59.2

19 · Need Score Construction & Country Ranking

The need score converts the cluster analysis into an actionable, country-level ranking. It is a composite index built from three Z-scored indicators:

\[\boxed{\text{Need Score}_i = z(\text{child\_mort}_i) - z(\text{gdpp}_i) - z(\text{health}_i)}\]

Higher values indicate greater unmet need. The formula deliberately rewards high child mortality (positive) and penalizes high GDP and health expenditure (negative), since those reflect capacity to self-fund development.

priority_countries <- analysis_df |>
  filter(
    cluster_km == priority_km$cluster_km |
    cluster_hc == priority_hc$cluster_hc
  ) |>
  mutate(
    z_child = as.numeric(scale(child_mort)),
    z_gdpp  = as.numeric(scale(gdpp)),
    z_health = as.numeric(scale(health)),
    need_score = z_child - z_gdpp - z_health
  ) |>
  arrange(desc(need_score))

priority_countries |>
  select(country, child_mort, gdpp, income, health, life_expec, need_score) |>
  mutate(
    need_score = round(need_score, 3),
    gdpp       = round(gdpp, 0),
    health     = round(health, 1)
  ) |>
  head(20) |>
  knitr::kable(
    caption  = "**Table 16** — Top 20 countries ranked by Need Score",
    booktabs = TRUE
  )

**Table 16** — Top 20 countries ranked by Need Score
country	child_mort	gdpp	income	health	life_expec	need_score
Haiti	208.0	662	1500	45.7	32.1	4.235
Sierra Leone	160.0	399	1220	52.3	55.0	2.887
Central African Republic	149.0	446	888	17.8	47.5	2.759
Chad	150.0	897	1930	40.6	56.5	2.495
Mali	137.0	708	1870	35.3	59.5	2.214
Niger	123.0	348	814	18.0	58.8	2.033
Congo, Dem. Rep.	116.0	334	609	26.4	57.5	1.783
Burkina Faso	116.0	575	1430	38.8	57.9	1.626
Guinea-Bissau	114.0	547	1390	46.5	55.6	1.531
Benin	111.0	758	1820	31.1	61.8	1.464
Guinea	109.0	648	1190	31.9	58.0	1.438
Mozambique	101.0	419	918	21.8	54.5	1.345
Burundi	93.6	231	764	26.8	57.7	1.163
Cote d’Ivoire	111.0	1220	2690	64.7	56.3	1.104
Malawi	90.5	459	1030	30.2	53.1	0.974
Cameroon	108.0	1310	2660	67.2	57.3	0.970
Nigeria	130.0	2330	5150	118.1	60.5	0.955
Liberia	89.3	327	700	38.6	60.8	0.934
Togo	90.3	488	1210	37.3	58.7	0.916
Pakistan	92.1	1040	4280	22.9	65.3	0.867

priority_countries |>
  slice_head(n = 20) |>
  ggplot(aes(
    x    = reorder(country, need_score),
    y    = need_score,
    fill = need_score
  )) +
  geom_col(width = 0.75, color = "white", linewidth = 0.25) +
  geom_text(
    aes(label = round(need_score, 2)),
    hjust = -0.12, size = 3, color = "#1E293B"
  ) +
  coord_flip() +
  scale_fill_gradient2(
    low      = "#ADE8F4",
    mid      = "#4EA8DE",
    high     = "#E63946",
    midpoint = median(priority_countries$need_score[1:20])
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.14))) +
  labs(
    title    = "Top 20 Countries Ranked by Composite Need Score",
    subtitle = "Score = z(child_mort) − z(gdpp) − z(health) · Higher = greater aid priority",
    x        = NULL,
    y        = "Need Score",
    fill     = "Score",
    caption  = "Source: Country-data.csv · Analysis by Eliya Christopher Nandi"
  ) +
  theme(legend.position = "none")

🔴 Aid Priority — Top Countries

The top-ranked nations combine the highest child mortality, lowest GDP per capita, and lowest health expenditure — a convergence of vulnerabilities that no single indicator alone would fully capture. This ranking is the core deliverable of the analysis.

20 · Country Membership by Cluster

For transparency and reproducibility, every country’s cluster assignment is listed explicitly.

analysis_df |>
  arrange(cluster_km, country) |>
  group_by(cluster_km) |>
  summarise(Countries = paste(as.character(country), collapse = ", "), .groups = "drop") |>
  knitr::kable(
    caption  = "**Table 17** — Country membership by K-Means cluster",
    booktabs = TRUE
  )

**Table 17** — Country membership by K-Means cluster
cluster_km	Countries
1	Albania, Algeria, Antigua and Barbuda, Argentina, Armenia, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belize, Bhutan, Bolivia, Bosnia and Herzegovina, Brazil, Bulgaria, Cambodia, Cape Verde, Chile, China, Colombia, Costa Rica, Croatia, Czech Republic, Dominican Republic, Ecuador, Egypt, El Salvador, Estonia, Fiji, Georgia, Grenada, Guatemala, Guyana, Hungary, India, Indonesia, Iran, Jamaica, Jordan, Kazakhstan, Kyrgyz Republic, Latvia, Lebanon, Libya, Lithuania, Macedonia, FYR, Malaysia, Maldives, Mauritius, Micronesia, Fed. Sts., Moldova, Mongolia, Montenegro, Morocco, Myanmar, Nepal, Oman, Panama, Paraguay, Peru, Philippines, Poland, Portugal, Romania, Russia, Samoa, Saudi Arabia, Serbia, Seychelles, Slovak Republic, South Korea, Sri Lanka, St. Vincent and the Grenadines, Suriname, Tajikistan, Thailand, Tonga, Tunisia, Turkey, Turkmenistan, Ukraine, Uruguay, Uzbekistan, Vanuatu, Venezuela, Vietnam
2	Afghanistan, Angola, Benin, Botswana, Burkina Faso, Burundi, Cameroon, Central African Republic, Chad, Comoros, Congo, Dem. Rep., Congo, Rep., Cote d’Ivoire, Equatorial Guinea, Eritrea, Gabon, Gambia, Ghana, Guinea, Guinea-Bissau, Haiti, Iraq, Kenya, Kiribati, Lao, Lesotho, Liberia, Madagascar, Malawi, Mali, Mauritania, Mozambique, Namibia, Niger, Nigeria, Pakistan, Rwanda, Senegal, Sierra Leone, Solomon Islands, South Africa, Sudan, Tanzania, Timor-Leste, Togo, Uganda, Yemen, Zambia
3	Australia, Austria, Belgium, Brunei, Canada, Cyprus, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Japan, Kuwait, Malta, Netherlands, New Zealand, Norway, Qatar, Slovenia, Spain, Sweden, Switzerland, United Arab Emirates, United Kingdom, United States
4	Luxembourg, Singapore

analysis_df |>
  arrange(cluster_hc, country) |>
  group_by(cluster_hc) |>
  summarise(Countries = paste(as.character(country), collapse = ", "), .groups = "drop") |>
  knitr::kable(
    caption  = "**Table 18** — Country membership by Hierarchical cluster",
    booktabs = TRUE
  )

**Table 18** — Country membership by Hierarchical cluster
cluster_hc	Countries
1	Afghanistan, Angola, Benin, Botswana, Burkina Faso, Burundi, Cameroon, Central African Republic, Chad, Comoros, Congo, Dem. Rep., Congo, Rep., Cote d’Ivoire, Equatorial Guinea, Eritrea, Gabon, Gambia, Ghana, Guinea, Guinea-Bissau, Haiti, Iraq, Kenya, Kiribati, Lao, Lesotho, Liberia, Madagascar, Malawi, Mali, Mauritania, Mozambique, Namibia, Niger, Nigeria, Pakistan, Rwanda, Senegal, Sierra Leone, Solomon Islands, South Africa, Sudan, Tanzania, Timor-Leste, Togo, Uganda, Yemen, Zambia
2	Albania, Algeria, Antigua and Barbuda, Argentina, Armenia, Azerbaijan, Bangladesh, Barbados, Belarus, Belize, Bhutan, Bolivia, Bosnia and Herzegovina, Brazil, Bulgaria, Cambodia, Cape Verde, Chile, China, Colombia, Costa Rica, Croatia, Dominican Republic, Ecuador, Egypt, El Salvador, Estonia, Fiji, Georgia, Grenada, Guatemala, Guyana, Hungary, India, Indonesia, Iran, Jamaica, Jordan, Kazakhstan, Kyrgyz Republic, Latvia, Lebanon, Libya, Lithuania, Macedonia, FYR, Malaysia, Maldives, Mauritius, Micronesia, Fed. Sts., Moldova, Mongolia, Montenegro, Morocco, Myanmar, Nepal, Oman, Panama, Paraguay, Peru, Philippines, Poland, Romania, Russia, Samoa, Saudi Arabia, Serbia, Seychelles, Sri Lanka, St. Vincent and the Grenadines, Suriname, Tajikistan, Thailand, Tonga, Tunisia, Turkey, Turkmenistan, Ukraine, Uruguay, Uzbekistan, Vanuatu, Venezuela, Vietnam
3	Australia, Austria, Bahamas, Bahrain, Belgium, Brunei, Canada, Cyprus, Czech Republic, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Japan, Kuwait, Malta, Netherlands, New Zealand, Norway, Portugal, Qatar, Slovak Republic, Slovenia, South Korea, Spain, Sweden, Switzerland, United Arab Emirates, United Kingdom, United States
4	Luxembourg, Singapore

21 · Final Recommendations

✅ Data-Driven Recommendations

Based on the clustering analysis and need-score rankings, aid and development investment should be prioritized toward the countries at the top of Table 16. These nations face a compounded disadvantage: high child mortality + low economic capacity + low health system investment.

Specifically, the analysis recommends:

Immediate health interventions — reduce under-5 mortality through vaccination, nutrition, and maternal care programs.
Economic development — build GDP-generating capacity and income infrastructure in the lowest-tier cluster.
Health system investment — increase public health expenditure to break the feedback loop between poverty and poor health outcomes.
Multi-indicator monitoring — reassess country rankings annually as socio-economic conditions evolve.

⚠️ Important Limitations

This analysis is a data-driven decision-support tool, not a policy prescription. Real-world aid allocation must also account for: population size, political stability, humanitarian access, conflict risk, governance quality, and on-the-ground field assessments. Quantitative rankings should be treated as a starting point for expert deliberation, not a final answer.

22 · Conclusion

This project demonstrates how unsupervised machine learning methods — applied rigorously and transparently — can extract actionable insights from cross-country socio-economic data.

Key contributions of this analysis:

Feature engineering converted percentage-based economic variables into comparable per-capita monetary values, improving the quality of the distance calculations.
PCA effectively compressed nine correlated indicators into a lower-dimensional space retaining 93.4% of total variance.
K-means and hierarchical clustering both identified consistent country groupings; silhouette validation confirmed the reliability of the structure.
The need-score framework produced a transparent, reproducible country ranking that translates abstract cluster assignments into a concrete prioritization order.

The approach taken here follows best practices for reproducible, explainable, and production-quality data science portfolio work.

23 · Reproducibility Instructions

To reproduce this report on your machine:

1.  Place  Country-data.csv  in the same directory as this .Rmd file.
2.  Open the .Rmd file in RStudio (version ≥ 2022.07 recommended).
3.  Install missing packages if prompted:
    install.packages(c("tidyverse","skimr","ggcorrplot","factoextra","cluster","knitr"))
4.  Click  Knit → Knit to HTML  (or run rmarkdown::render("...Rmd")).
5.  Upload both the .Rmd and the knitted .html to GitHub for portfolio display.

To use a different dataset or cluster count, change data_path or k_clusters in the YAML header — all downstream results update automatically.

24 · Session Information

Full session information is recorded below for complete reproducibility. This captures the R version, platform, and exact package versions used.

sessionInfo()

#> R version 4.5.2 (2025-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#> 
#> Matrix products: default
#>   LAPACK version 3.12.1
#> 
#> locale:
#> [1] LC_COLLATE=English_United Kingdom.utf8 
#> [2] LC_CTYPE=English_United Kingdom.utf8   
#> [3] LC_MONETARY=English_United Kingdom.utf8
#> [4] LC_NUMERIC=C                           
#> [5] LC_TIME=English_United Kingdom.utf8    
#> 
#> time zone: Europe/Berlin
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] keras3_1.5.0       knitr_1.51         cluster_2.1.8.2    factoextra_1.0.7  
#>  [5] ggcorrplot_0.1.4.1 skimr_2.2.2        lubridate_1.9.5    forcats_1.0.1     
#>  [9] stringr_1.6.0      dplyr_1.2.1        purrr_1.2.0        readr_2.2.0       
#> [13] tidyr_1.3.1        tibble_3.3.0       ggplot2_4.0.2      tidyverse_2.0.0   
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6       xfun_0.56          bslib_0.10.0       ggrepel_0.9.6     
#>  [5] rstatix_0.7.3      lattice_0.22-7     tzdb_0.5.0         tfruns_1.5.4      
#>  [9] vctrs_0.7.3        tools_4.5.2        generics_0.1.4     pkgconfig_2.0.3   
#> [13] Matrix_1.7-4       RColorBrewer_1.1-3 S7_0.2.0           lifecycle_1.0.5   
#> [17] compiler_4.5.2     farver_2.1.2       repr_1.1.7         codetools_0.2-20  
#> [21] carData_3.0-6      htmltools_0.5.9    sass_0.4.10        yaml_2.3.12       
#> [25] Formula_1.2-5      whisker_0.4.1      pillar_1.11.1      car_3.1-5         
#> [29] ggpubr_0.6.2       jquerylib_0.1.4    cachem_1.1.0       abind_1.4-8       
#> [33] tidyselect_1.2.1   digest_0.6.37      stringi_1.8.7      reshape2_1.4.5    
#> [37] labeling_0.4.3     rprojroot_2.1.1    fastmap_1.2.0      grid_4.5.2        
#> [41] here_1.0.2         cli_3.6.5          magrittr_2.0.4     base64enc_0.1-6   
#> [45] dotty_0.1.0        broom_1.0.12       withr_3.0.2        scales_1.4.0      
#> [49] backports_1.5.0    timechange_0.4.0   rmarkdown_2.30     otel_0.2.0        
#> [53] reticulate_1.44.1  ggsignif_0.6.4     png_0.1-8          hms_1.1.4         
#> [57] evaluate_1.0.5     rlang_1.2.0        zeallot_0.2.0      Rcpp_1.1.0        
#> [61] glue_1.8.0         rstudioapi_0.18.0  jsonlite_2.0.0     R6_2.6.1          
#> [65] plyr_1.8.9         tensorflow_2.20.0

Socio-Economic Country Clustering for Aid Prioritization

An End-to-End Unsupervised Machine Learning Analysis — PCA · K-Means · Hierarchical Clustering · Need-Score Ranking

Eliya Christopher Nandi [eliyanandi07@gmail.com]

May 16, 2026

Executive Summary

1 · Project Objectives

2 · Methodology Pipeline

3 · Load Required Libraries

Code & Rationale

4 · Load & Inspect the Dataset

5 · Data Dictionary

6 · Feature Engineering

7 · Dataset Overview

8 · Exploratory Data Analysis

8.1 Distributions

8.2 Log-Scale View

8.3 Outlier Boxplots

9 · Standardization & Multivariate Outliers

9.1 Z-Score Normalization

9.2 Multivariate Outlier Ranking

10 · Correlation Analysis

11 · Dimensionality Reduction with PCA

11.1 Scree Plot

11.2 Cumulative Explained Variance

11.3 PCA Biplot — Variable Loadings

12 · Optional — Autoencoder Comparison

13 · Prepare PCA Components for Clustering

14 · Hierarchical Clustering

14.1 Build & Visualize Dendrogram

14.2 Cluster Sizes & Silhouette

15 · K-Means Clustering

15.1 Elbow Method

15.2 Fit Model & Cluster Sizes

15.3 Cluster Visualization in PCA Space

15.4 Silhouette Analysis

16 · Method Comparison

17 · Cluster Profiling

18 · Identify the Highest-Need Cluster

19 · Need Score Construction & Country Ranking

20 · Country Membership by Cluster

21 · Final Recommendations

22 · Conclusion

23 · Reproducibility Instructions

24 · Session Information