Executive Summary

This report presents a rigorous, end-to-end unsupervised machine learning study on 167 countries, using socio-economic and health data to segment nations into meaningful groups and identify which require urgent development aid.

Unsupervised Learning PCA · Dimensionality Reduction K-Means Clustering Hierarchical Clustering Silhouette Validation Aid Prioritization R · tidyverse · factoextra

🔍 Central Research Question

How can countries be meaningfully grouped based on their socio-economic and health profiles, and which countries should be prioritized for humanitarian aid using a transparent, data-driven scoring framework?

The full workflow covers feature engineering → exploratory analysis → standardization → PCA → K-means and hierarchical clustering → silhouette validation → cluster profiling → need-score ranking.


1 · Project Objectives

This analysis targets four concrete questions that a practitioner or policy analyst would ask:

  1. What natural groupings exist among countries based on health and economic indicators?
  2. Can PCA reveal a low-dimensional structure in the data without significant information loss?
  3. How do K-means and hierarchical clustering compare when applied to reduced PCA space?
  4. Which specific countries show the strongest need for aid, and can that ranking be explained transparently?

2 · Methodology Pipeline

The project follows an industry-standard data science workflow. Each stage builds directly on the previous one, ensuring that every modeling decision is justified by the data.

Load & InspectUnderstand structure and variable types
Feature EngineeringConvert % GDP values to absolute figures
EDADistributions, outliers, correlations
StandardizeZ-score scaling for fair comparison
PCAReduce to 90%-variance components
ClusteringK-means + hierarchical, k = 4
ValidationSilhouette scores for both methods
ProfilingInterpret cluster identities
Need ScoreRank & recommend priority nations

3 · Load Required Libraries

Code & Rationale

Before the analysis begins, all required packages are loaded. A defensive loader function is used so that the report fails early with a helpful message if any dependency is missing — a best practice for reproducible research.

Package Role in this analysis
tidyverse Data wrangling, reshaping, and all ggplot2 visualizations
skimr Rich, formatted summary statistics
ggcorrplot Correlation heatmap
factoextra PCA scree plots, cluster visualizations, silhouette plots
cluster Silhouette computation
knitr Table rendering in the knitted report
load_package <- function(pkg) {
  if (!requireNamespace(pkg, quietly = TRUE))
    stop(sprintf(
      "Package '%s' is required but not installed.\nRun: install.packages('%s')", pkg, pkg
    ))
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}

required_packages <- c(
  "tidyverse", "skimr", "ggcorrplot",
  "factoextra", "cluster", "knitr"
)

invisible(lapply(required_packages, load_package))

# Set a clean, consistent ggplot2 theme for the entire report
theme_report <- function() {
  theme_minimal(base_size = 12, base_family = "sans") +
  theme(
    plot.title    = element_text(face = "bold", color = "#0D1F2D", size = 14, margin = margin(b = 4)),
    plot.subtitle = element_text(color = "#64748B", size = 11, margin = margin(b = 12)),
    plot.caption  = element_text(color = "#94A3B8", size = 9, hjust = 0),
    axis.title    = element_text(color = "#2C4A5E", size = 11, face = "bold"),
    axis.text     = element_text(color = "#475569"),
    strip.text    = element_text(face = "bold", color = "#0D1F2D"),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "#E2E8F0", linewidth = 0.5),
    legend.position  = "bottom",
    legend.text      = element_text(size = 10)
  )
}

theme_set(theme_report())
✓ Environment ready. If this chunk runs without errors all six packages are available. The custom theme_report() function guarantees visual consistency across every plot in the report.

4 · Load & Inspect the Dataset

The raw dataset is read from disk and immediately inspected. Knowing the number of rows, columns, and variable types before touching the data prevents silent errors later in the pipeline.

if (!file.exists(params$data_path))
  stop(paste0(
    "Dataset not found: '", params$data_path, "'.\n",
    "Place Country-data.csv in the same folder as this .Rmd file."
  ))

df_raw <- read.csv(params$data_path)

n_countries <- nrow(df_raw)
n_vars      <- ncol(df_raw)

cat("Rows (countries):", n_countries, "\nColumns (variables):", n_vars, "\n")
#> Rows (countries): 167 
#> Columns (variables): 10
head(df_raw, 8) |>
  knitr::kable(
    caption  = "**Table 1** — Raw dataset preview (first 8 rows)",
    digits   = 2,
    booktabs = TRUE
  )
Table 1 — Raw dataset preview (first 8 rows)
country child_mort exports health imports income inflation life_expec total_fer gdpp
Afghanistan 90.2 10.0 7.58 44.9 1610 9.44 56.2 5.82 553
Albania 16.6 28.0 6.55 48.6 9930 4.49 76.3 1.65 4090
Algeria 27.3 38.4 4.17 31.4 12900 16.10 76.5 2.89 4460
Angola 119.0 62.3 2.85 42.9 5900 22.40 60.1 6.16 3530
Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.44 76.8 2.13 12200
Argentina 14.5 18.9 8.10 16.0 18700 20.90 75.8 2.37 10300
Armenia 18.1 20.8 4.40 45.3 6700 7.77 73.3 1.69 3220
Australia 4.8 19.8 8.73 20.9 41400 1.16 82.0 1.93 51900
str(df_raw)
#> 'data.frame':    167 obs. of  10 variables:
#>  $ country   : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
#>  $ child_mort: num  90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
#>  $ exports   : num  10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
#>  $ health    : num  7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
#>  $ imports   : num  44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
#>  $ income    : int  1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
#>  $ inflation : num  9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
#>  $ life_expec: num  56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
#>  $ total_fer : num  5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
#>  $ gdpp      : int  553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...
Key observations: The country column is a character identifier — it will be used as a label throughout but excluded from all numerical computations. Every remaining column is numeric. No immediate structural issues detected.

5 · Data Dictionary

Understanding what each variable measures is essential before any transformation or modelling decision.

tibble::tribble(
  ~Variable,    ~Type,       ~Unit,               ~Description,
  "country",    "Character", "—",                 "Country name (label, not used in modelling)",
  "child_mort", "Numeric",   "per 1,000 births",  "Under-5 child mortality rate",
  "exports",    "Numeric",   "% of GDP (raw)",    "Exports of goods and services",
  "health",     "Numeric",   "% of GDP (raw)",    "Total health expenditure",
  "imports",    "Numeric",   "% of GDP (raw)",    "Imports of goods and services",
  "income",     "Numeric",   "USD",               "Net income per person",
  "inflation",  "Numeric",   "% per year",        "Annual consumer price inflation",
  "life_expec", "Numeric",   "years",             "Life expectancy at birth",
  "total_fer",  "Numeric",   "children/woman",    "Total fertility rate",
  "gdpp",       "Numeric",   "USD",               "GDP per capita"
) |>
  knitr::kable(
    caption  = "**Table 2** — Data dictionary",
    booktabs = TRUE
  )
Table 2 — Data dictionary
Variable Type Unit Description
country Character Country name (label, not used in modelling)
child_mort Numeric per 1,000 births Under-5 child mortality rate
exports Numeric % of GDP (raw) Exports of goods and services
health Numeric % of GDP (raw) Total health expenditure
imports Numeric % of GDP (raw) Imports of goods and services
income Numeric USD Net income per person
inflation Numeric % per year Annual consumer price inflation
life_expec Numeric years Life expectancy at birth
total_fer Numeric children/woman Total fertility rate
gdpp Numeric USD GDP per capita
⚠️ Important — Scale Mismatch

exports, health, and imports are recorded as percentages of GDP. Feeding raw percentages into clustering alongside absolute USD values would distort distances. The next section converts them to estimated per-capita monetary values.


6 · Feature Engineering

The three percentage-of-GDP variables are converted into estimated per-capita monetary values by multiplying by GDP per capita. This puts all economic indicators on a comparable absolute scale before any further analysis.

\[\text{exports}_{\text{abs}} = \frac{\text{exports}_{\%} \times \text{gdpp}}{100}\]

The same formula is applied to health and imports.

df <- df_raw |>
  mutate(
    exports = (exports * gdpp) / 100,
    health  = (health  * gdpp) / 100,
    imports = (imports * gdpp) / 100,
    gdpp    = as.numeric(gdpp),
    income  = as.numeric(income),
    country = factor(country)
  )

missing_count <- sum(is.na(df))
cat("Total missing values after engineering:", missing_count, "\n")
#> Total missing values after engineering: 0
head(df, 6) |>
  knitr::kable(
    caption  = "**Table 3** — Dataset after feature engineering",
    digits   = 1,
    booktabs = TRUE
  )
Table 3 — Dataset after feature engineering
country child_mort exports health imports income inflation life_expec total_fer gdpp
Afghanistan 90.2 55.3 41.9 248.3 1610 9.4 56.2 5.8 553
Albania 16.6 1145.2 267.9 1987.7 9930 4.5 76.3 1.6 4090
Algeria 27.3 1712.6 186.0 1400.4 12900 16.1 76.5 2.9 4460
Angola 119.0 2199.2 100.6 1514.4 5900 22.4 60.1 6.2 3530
Antigua and Barbuda 10.3 5551.0 735.7 7185.8 19100 1.4 76.8 2.1 12200
Argentina 14.5 1946.7 834.3 1648.0 18700 20.9 75.8 2.4 10300
Result: All three percentage variables now represent estimated USD-per-capita amounts, making them directly comparable with income and gdpp. No missing values were introduced by the transformation.

7 · Dataset Overview

A high-level numerical summary of the cleaned dataset confirms the scale differences that motivate standardization in Section 9.

X <- df |> select(-country)   # numeric matrix used in all subsequent steps

tibble(
  Metric  = c("Countries in dataset", "Numeric indicators", "Missing values"),
  Value   = c(nrow(df), ncol(X), missing_count)
) |>
  knitr::kable(caption = "**Table 4** — Cleaned dataset at a glance", booktabs = TRUE)
Table 4 — Cleaned dataset at a glance
Metric Value
Countries in dataset 167
Numeric indicators 9
Missing values 0
Countries
167
Global coverage
Indicators
9
Economic + health
Missing Values
0
Complete dataset
Clustering Target
k = 4
Groups to identify
skimr::skim(X)
Data summary
Name X
Number of rows 167
Number of columns 9
_______________________
Column type frequency:
numeric 9
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
child_mort 0 1 38.27 40.33 2.60 8.25 19.30 62.10 208.00 ▇▂▂▁▁
exports 0 1 7420.62 17973.89 1.08 447.14 1777.44 7278.00 183750.00 ▇▁▁▁▁
health 0 1 1056.73 1801.41 12.82 78.54 321.89 976.94 8663.60 ▇▁▁▁▁
imports 0 1 6588.35 14710.81 0.65 640.21 2045.58 7719.60 149100.00 ▇▁▁▁▁
income 0 1 17144.69 19278.07 609.00 3355.00 9960.00 22800.00 125000.00 ▇▂▁▁▁
inflation 0 1 7.78 10.57 -4.21 1.81 5.39 10.75 104.00 ▇▁▁▁▁
life_expec 0 1 70.56 8.89 32.10 65.30 73.10 76.80 82.80 ▁▁▃▅▇
total_fer 0 1 2.95 1.51 1.15 1.79 2.41 3.88 7.49 ▇▃▂▂▁
gdpp 0 1 12964.16 18328.70 231.00 1330.00 4660.00 14050.00 105000.00 ▇▁▁▁▁
Key takeaway: Variables span vastly different ranges — GDP per capita ranges from tens of dollars to tens of thousands, while fertility is bounded between ~1 and ~8. This scale disparity confirms that Z-score standardization is mandatory before distance-based methods like PCA and clustering.

8 · Exploratory Data Analysis

Exploratory analysis answers three questions before any modelling: (1) Are distributions roughly normal or heavily skewed? (2) How correlated are the variables? (3) Are there extreme outliers that may distort distance calculations?

8.1 Distributions

Each variable is plotted as a histogram. Variables with right-skewed distributions will benefit from the log-scale view in the next tab.

df |>
  select(-country) |>
  pivot_longer(everything(), names_to = "feature", values_to = "value") |>
  ggplot(aes(x = value, fill = feature)) +
  geom_histogram(bins = 28, color = "white", alpha = 0.90, show.legend = FALSE) +
  facet_wrap(~ feature, scales = "free", ncol = 3) +
  scale_fill_manual(values = rep(c("#4EA8DE","#2EC4B6","#E63946","#FF9F1C",
                                   "#6366F1","#8B5CF6","#14B8A6","#F59E0B","#10B981"), 2)) +
  labs(
    title    = "Distribution of All Socio-Economic & Health Indicators",
    subtitle = "Histograms on raw scales — note strong right-skew in economic variables",
    x = NULL, y = "Countries"
  )

Finding: GDP per capita, income, exports, imports, and health expenditure are all strongly right-skewed — a handful of wealthy nations pull the tails far to the right. Child mortality and fertility show a different shape, closer to log-normal. Inflation shows the most irregular distribution, likely driven by countries experiencing hyperinflation.

8.2 Log-Scale View

Log-transformed views reveal the true spread of the right-skewed economic variables.

df |>
  select(gdpp, income, child_mort, exports, health, imports) |>
  pivot_longer(everything(), names_to = "feature", values_to = "value") |>
  filter(value > 0) |>
  ggplot(aes(x = value, fill = feature)) +
  geom_histogram(bins = 28, color = "white", alpha = 0.9, show.legend = FALSE) +
  facet_wrap(~ feature, scales = "free", ncol = 3) +
  scale_x_log10(labels = scales::label_comma()) +
  scale_fill_manual(values = c("#4EA8DE","#2EC4B6","#E63946","#FF9F1C","#6366F1","#8B5CF6")) +
  labs(
    title    = "Selected Variables on a Log₁₀ Scale",
    subtitle = "Log scaling normalises the long right tails and reveals the full country spread",
    x = "Log₁₀ value", y = "Countries"
  )

Finding: On the log scale, distributions become far more symmetric, confirming the multiplicative nature of economic inequality across nations. Countries are spread over two to three orders of magnitude in GDP and income.

8.3 Outlier Boxplots

Boxplots highlight country-level outliers across every indicator simultaneously.

df |>
  select(-country) |>
  pivot_longer(everything(), names_to = "feature", values_to = "value") |>
  ggplot(aes(x = reorder(feature, value, FUN = median), y = value, fill = feature)) +
  geom_boxplot(
    outlier.colour = "#E63946",
    outlier.size   = 2.2,
    outlier.alpha  = 0.70,
    alpha          = 0.80,
    show.legend    = FALSE
  ) +
  coord_flip() +
  labs(
    title    = "Feature-Level Outlier Detection via Boxplots",
    subtitle = "Red points represent potential outliers — common in cross-country economic data",
    x = NULL, y = "Value"
  )

Finding: GDP, income, exports, and health show the most extreme outliers. These likely correspond to small, high-income economies (e.g. Singapore, Luxembourg) or highly trade-oriented nations. These observations are real data points, not errors — they will be handled by Z-score standardization rather than removal.

9 · Standardization & Multivariate Outliers

9.1 Z-Score Normalization

K-means minimizes squared Euclidean distances, and PCA is variance-weighted. Both are sensitive to scale. Z-score normalization transforms every variable to mean = 0 and standard deviation = 1, ensuring no single variable dominates purely because of its measurement units.

X_scaled <- scale(X)

tibble(
  Feature           = colnames(X_scaled),
  Mean_after_scale  = round(colMeans(X_scaled), 5),
  SD_after_scale    = round(apply(X_scaled, 2, sd), 5)
) |>
  knitr::kable(
    caption  = "**Table 5** — Standardization verification (mean ≈ 0, SD ≈ 1 for all features)",
    booktabs = TRUE
  )
Table 5 — Standardization verification (mean ≈ 0, SD ≈ 1 for all features)
Feature Mean_after_scale SD_after_scale
child_mort 0 1
exports 0 1
health 0 1
imports 0 1
income 0 1
inflation 0 1
life_expec 0 1
total_fer 0 1
gdpp 0 1
Verification passed. Every feature now has mean essentially equal to 0 and standard deviation equal to 1. The data is ready for PCA and clustering.

9.2 Multivariate Outlier Ranking

Beyond individual-variable outliers, some countries may be extreme across multiple dimensions simultaneously. The multivariate outlier score is the Euclidean distance from the standardized data centroid (the origin in Z-score space).

\[\text{Outlier Score}_i = \sqrt{\sum_{j=1}^{p} z_{ij}^2}\]

outlier_rank <- tibble(
  country       = df$country,
  outlier_score = sqrt(rowSums(X_scaled^2))
) |>
  arrange(desc(outlier_score))

head(outlier_rank, 12) |>
  mutate(outlier_score = round(outlier_score, 3)) |>
  knitr::kable(
    caption  = "**Table 6** — Top 12 multivariate outliers (ranked by standardized distance from centroid)",
    booktabs = TRUE
  )
Table 6 — Top 12 multivariate outliers (ranked by standardized distance from centroid)
country outlier_score
Luxembourg 15.778
Nigeria 9.709
Singapore 8.044
Qatar 6.896
Switzerland 6.828
Norway 6.703
Haiti 6.188
United States 5.162
Ireland 4.946
Denmark 4.797
Netherlands 4.729
Central African Republic 4.320
outlier_rank |>
  slice_head(n = 15) |>
  ggplot(aes(x = reorder(country, outlier_score), y = outlier_score, fill = outlier_score)) +
  geom_col(width = 0.72, color = "white", linewidth = 0.3) +
  geom_text(aes(label = round(outlier_score, 2)), hjust = -0.15, size = 3, color = "#1E293B") +
  coord_flip() +
  scale_fill_gradient2(low = "#ADE8F4", mid = "#4EA8DE", high = "#E63946",
                       midpoint = median(outlier_rank$outlier_score[1:15])) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title    = "Top 15 Multivariate Outliers",
    subtitle = "Countries furthest from the dataset centroid in standardized 9-dimensional space",
    x = NULL, y = "Outlier Score (standardized Euclidean distance)",
    fill = "Score"
  ) +
  theme(legend.position = "none")

Insight: Top outliers tend to be high-income, high-trade economies (very high GDP, income, and exports) — or occasionally, countries in extreme humanitarian distress (very high child mortality, very low GDP). Both extremes are legitimate data and will form their own natural clusters.

10 · Correlation Analysis

Before applying PCA, it is worth checking whether the variables are correlated. High inter-variable correlation is the key condition that makes PCA effective — it means a smaller set of components can capture most of the variance.

corr_matrix <- cor(X, use = "pairwise.complete.obs")

ggcorrplot(
  corr_matrix,
  type     = "lower",
  lab      = TRUE,
  lab_size = 3.2,
  tl.cex   = 10,
  colors   = c("#E63946", "#FFFFFF", "#4EA8DE"),
  outline.color = "#E2E8F0"
) +
  labs(
    title    = "Correlation Heatmap of All Indicators",
    subtitle = "Blue = positive correlation · Red = negative correlation · Values = Pearson r"
  ) +
  theme(
    plot.title    = element_text(face = "bold", color = "#0D1F2D"),
    plot.subtitle = element_text(color = "#64748B")
  )

Notable correlation clusters:

  • Development axis (positive): gdpp, income, life_expec, health, exports, imports all correlate positively — wealthier countries spend more on health, trade more, and live longer.
  • Inverse relationships: child_mort and total_fer are strongly negatively correlated with gdpp and life_expec — high fertility and child mortality are hallmarks of lower-income countries.
  • PCA justification: The strong inter-correlations confirm that PCA will effectively summarize this dataset into a small number of orthogonal components.

11 · Dimensionality Reduction with PCA

Principal Component Analysis (PCA) rotates the data into a new coordinate system where the first axis (PC1) explains the maximum possible variance, PC2 explains the next most, and so on. This allows us to cluster in a lower-dimensional space that is less susceptible to noise.

pca_result <- prcomp(X_scaled, center = FALSE, scale. = FALSE)  # already scaled

explained_var  <- pca_result$sdev^2 / sum(pca_result$sdev^2)
cumulative_var <- cumsum(explained_var)

pca_tbl <- tibble(
  Component            = paste0("PC", seq_along(explained_var)),
  `Variance Explained` = paste0(round(explained_var * 100, 2), "%"),
  `Cumulative Variance`= paste0(round(cumulative_var * 100, 2), "%")
)

num_components <- which(cumulative_var >= 0.90)[1]

pca_tbl |>
  knitr::kable(
    caption  = "**Table 7** — PCA variance explained per component",
    booktabs = TRUE
  )
Table 7 — PCA variance explained per component
Component Variance Explained Cumulative Variance
PC1 58.94% 58.94%
PC2 18.45% 77.38%
PC3 9.91% 87.29%
PC4 6.07% 93.37%
PC5 3.03% 96.4%
PC6 2.46% 98.86%
PC7 0.94% 99.79%
PC8 0.16% 99.95%
PC9 0.05% 100%
cat("Components needed to retain ≥90% variance:", num_components,
    "\nCumulative variance retained:", round(cumulative_var[num_components]*100, 2), "%\n")
#> Components needed to retain ≥90% variance: 4 
#> Cumulative variance retained: 93.37 %

11.1 Scree Plot

The scree plot shows the “elbow” where additional components contribute diminishing variance.

factoextra::fviz_eig(
  pca_result,
  addlabels  = TRUE,
  ylim       = c(0, 62),
  barfill    = "#4EA8DE",
  barcolor   = "#3D6B8C",
  linecolor  = "#E63946",
  ncp        = 9
) +
  labs(
    title    = "PCA Scree Plot — Variance Explained by Each Component",
    subtitle = "The first two components alone explain over 70% of total variance"
  ) +
  theme_report()

11.2 Cumulative Explained Variance

The cumulative curve shows how quickly components accumulate information, with the 90% threshold highlighted.

tibble(n = seq_along(cumulative_var), cumvar = cumulative_var) |>
  ggplot(aes(x = n, y = cumvar)) +
  geom_area(alpha = 0.12, fill = "#4EA8DE") +
  geom_line(color = "#4EA8DE", linewidth = 1.2) +
  geom_point(color = "#E63946", size = 3.5, shape = 21, fill = "white", stroke = 2) +
  geom_hline(yintercept = 0.90, linetype = "dashed", color = "#2EC4B6", linewidth = 1) +
  annotate("text", x = 1.2, y = 0.915, label = "90% threshold",
           color = "#0F766E", size = 3.5, fontface = "bold") +
  geom_vline(xintercept = num_components, linetype = "dotted", color = "#E63946", linewidth = 0.9) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1)) +
  scale_x_continuous(breaks = 1:9) +
  labs(
    title    = "Cumulative Explained Variance",
    subtitle = paste0("Retaining ", num_components, " components captures the 90% threshold"),
    x        = "Number of Components",
    y        = "Cumulative Variance Explained"
  )

✅ PCA Decision

4 principal components are retained, preserving 93.37% of the original variance. This reduced representation removes noise while keeping virtually all of the structured information needed for clustering.

11.3 PCA Biplot — Variable Loadings

The biplot overlays country positions with variable loading arrows, showing which original variables drive each principal component.

factoextra::fviz_pca_biplot(
  pca_result,
  repel           = TRUE,
  col.ind         = "#94A3B8",
  col.var         = "#E63946",
  alpha.ind       = 0.4,
  label           = "var",
  arrowsize       = 0.8,
  labelsize       = 4
) +
  labs(
    title    = "PCA Biplot — Countries & Variable Loadings",
    subtitle = "Arrows show direction and strength of each variable's contribution to PC1 and PC2"
  ) +
  theme_report()

Interpretation: Variables pointing in the same direction are positively correlated. PC1 separates wealthy, high-health-spending countries (right) from poor, high-mortality countries (left). PC2 captures trade-intensity and inflation patterns. This two-dimensional map already reveals a strong country structure that clustering will formalize.

12 · Optional — Autoencoder Comparison

An autoencoder is a neural-network-based dimensionality reduction method that can capture non-linear relationships. This section is parameterized: it runs only when run_autoencoder: true is set in the YAML header.

Why not use it by default? PCA is entirely sufficient here — the high linear correlations shown in Section 10 mean non-linear methods offer no meaningful benefit. PCA is also faster, deterministic, and fully interpretable, making it the more professional choice for this dataset.

# Requires keras3 to be installed and configured.
if (!requireNamespace("keras3", quietly = TRUE))
  stop("Install keras3 to run the autoencoder section.")

library(keras3)

input_dim  <- ncol(X_scaled)
latent_dim <- num_components

inputs      <- layer_input(shape = input_dim)
encoded     <- inputs |> layer_dense(128, activation = "relu") |>
               layer_dense(latent_dim, activation = "relu")
decoded     <- encoded |> layer_dense(128, activation = "relu") |>
               layer_dense(input_dim, activation = "linear")

autoencoder <- keras_model(inputs, decoded)
autoencoder |> compile(optimizer = "adam", loss = "mse")
autoencoder |> fit(X_scaled, X_scaled,
                   epochs = 60, batch_size = 32,
                   validation_split = 0.15, verbose = 0)

encoder <- keras_model(inputs, encoded)
enc_df  <- predict(encoder, X_scaled) |> as.data.frame() |>
           setNames(paste0("Latent_", seq_len(latent_dim)))
#> 6/6 - 0s - 14ms/step
ggplot(enc_df, aes(Latent_1, Latent_2)) +
  geom_point(alpha = 0.65, color = "#E63946", size = 2.5) +
  labs(title = "Autoencoder — 2D Latent Space Projection",
       x = "Latent Dimension 1", y = "Latent Dimension 2")


13 · Prepare PCA Components for Clustering

The selected principal component scores are extracted into a clean matrix. This is the input for both clustering algorithms.

pca_components <- as.data.frame(pca_result$x[, 1:num_components])
pca_components$country <- df$country

pca_matrix <- pca_components |> select(-country)

head(pca_components, 6) |>
  mutate(across(where(is.numeric), ~ round(.x, 4))) |>
  knitr::kable(
    caption  = paste0("**Table 8** — First 6 countries in PCA space (",
                      num_components, " components)"),
    booktabs = TRUE
  )
Table 8 — First 6 countries in PCA space (4 components)
PC1 PC2 PC3 PC4 country
-2.6277 -1.4679 -0.5478 -0.2416 Afghanistan
-0.0241 1.4256 -0.0141 0.4493 Albania
-0.4582 0.6735 0.9565 0.2178 Algeria
-2.7145 -2.1658 0.5984 -0.4327 Angola
0.6467 1.0204 -0.2567 0.2883 Antigua and Barbuda
0.0353 0.6832 1.4643 -0.0288 Argentina
Each row is now a country. Instead of 9 original variables, each country is described by 4 orthogonal component scores. These scores carry no measurement-unit differences and are ready for distance-based clustering.

14 · Hierarchical Clustering

Hierarchical clustering constructs a tree (dendrogram) by successively merging the most similar countries or groups. Ward’s method is used because it minimizes the total within-cluster variance at each merge step, producing compact and well-separated clusters.

14.1 Build & Visualize Dendrogram

dist_pca <- dist(pca_matrix, method = "euclidean")
hc_ward  <- hclust(dist_pca, method = "ward.D2")

par(
  mar   = c(4, 3, 3, 1),
  family = "sans",
  bg    = "white"
)
plot(
  hc_ward,
  labels  = FALSE,
  hang    = -1,
  main    = "Hierarchical Clustering Dendrogram — Ward's Method",
  sub     = "",
  xlab    = "Countries (labels hidden for clarity)",
  ylab    = "Ward's Linkage Height",
  col.main = "#0D1F2D",
  cex.main = 1.1
)
mtext(
  paste0("k = ", params$k_clusters, " clusters highlighted"),
  side = 3, line = 0.3, cex = 0.85, col = "#64748B"
)
rect.hclust(hc_ward, k = params$k_clusters, border = c("#4EA8DE","#E63946","#2EC4B6","#FF9F1C"))

Reading the dendrogram: The height of each merger represents the dissimilarity at that point. The four tall vertical gaps near the top of the tree correspond to the four natural clusters — cutting at that height yields the most distinct country groups.

14.2 Cluster Sizes & Silhouette

k_final      <- params$k_clusters
hc_clusters  <- cutree(hc_ward, k = k_final)
pca_components$cluster_hc <- factor(hc_clusters)

table(hc_clusters) |>
  as.data.frame() |>
  setNames(c("Cluster", "Countries")) |>
  knitr::kable(
    caption  = "**Table 9** — Hierarchical clustering: countries per cluster",
    booktabs = TRUE
  )
Table 9 — Hierarchical clustering: countries per cluster
Cluster Countries
1 48
2 82
3 35
4 2
sil_hc     <- silhouette(hc_clusters, dist_pca)
avg_sil_hc <- mean(sil_hc[, 3])

factoextra::fviz_silhouette(sil_hc, palette = c("#4EA8DE","#E63946","#2EC4B6","#FF9F1C")) +
  labs(
    title    = paste0("Silhouette Plot — Hierarchical Clustering (k = ", k_final, ")"),
    subtitle = paste0("Average silhouette width = ", round(avg_sil_hc, 3),
                      "  ·  Width > 0.5 indicates reasonable cluster structure")
  ) +
  theme_report()
#>   cluster size ave.sil.width
#> 1       1   48          0.41
#> 2       2   82          0.56
#> 3       3   35          0.35
#> 4       4    2          0.10

cat("Average silhouette width (hierarchical):", round(avg_sil_hc, 3), "\n")
#> Average silhouette width (hierarchical): 0.467

15 · K-Means Clustering

K-means partitions n countries into k groups by iteratively reassigning each country to the nearest centroid and recomputing centroids until convergence. Unlike hierarchical clustering, K-means requires specifying k in advance.

15.1 Elbow Method

The elbow method plots total within-cluster sum of squares (WCSS) for k = 1 to 10. The “elbow” — where the curve bends and returns diminish — guides the choice of k.

set.seed(42)
wss_values <- sapply(1:10, function(k) {
  kmeans(pca_matrix, centers = k, nstart = 30, iter.max = 100)$tot.withinss
})

tibble(k = 1:10, WCSS = wss_values) |>
  ggplot(aes(x = k, y = WCSS)) +
  geom_line(color = "#4EA8DE", linewidth = 1.2) +
  geom_point(color = "#E63946", size = 3.5, shape = 21, fill = "white", stroke = 2) +
  geom_vline(xintercept = k_final, linetype = "dashed",
             color = "#2EC4B6", linewidth = 1) +
  annotate("text", x = k_final + 0.2, y = max(wss_values) * 0.9,
           label = paste0("Selected k = ", k_final),
           hjust = 0, color = "#0F766E", size = 3.5, fontface = "bold") +
  scale_x_continuous(breaks = 1:10) +
  scale_y_continuous(labels = scales::label_comma()) +
  labs(
    title    = "Elbow Method for Optimal K",
    subtitle = "WCSS (Total Within-Cluster Sum of Squares) vs. number of clusters",
    x        = "Number of Clusters (k)",
    y        = "Total WCSS"
  )

15.2 Fit Model & Cluster Sizes

set.seed(42)
km_result   <- kmeans(pca_matrix, centers = k_final, nstart = 50, iter.max = 200)
km_clusters <- km_result$cluster

table(km_clusters) |>
  as.data.frame() |>
  setNames(c("Cluster", "Countries")) |>
  knitr::kable(
    caption  = "**Table 10** — K-means: countries per cluster",
    booktabs = TRUE
  )
Table 10 — K-means: countries per cluster
Cluster Countries
1 88
2 48
3 29
4 2

15.3 Cluster Visualization in PCA Space

# Manual ggplot2 cluster plot — avoids the fviz_cluster / ggpubr argument conflict
cluster_palette <- c("1" = "#4EA8DE", "2" = "#E63946", "3" = "#2EC4B6", "4" = "#FF9F1C")

tibble(
  PC1     = pca_result$x[, 1],
  PC2     = pca_result$x[, 2],
  Cluster = factor(km_clusters),
  Country = as.character(df$country)
) |>
  ggplot(aes(x = PC1, y = PC2, colour = Cluster, fill = Cluster)) +
  stat_ellipse(
    geom  = "polygon",
    type  = "norm",
    alpha = 0.10,
    level = 0.90,
    linewidth = 0.4
  ) +
  geom_point(size = 2.4, alpha = 0.85, shape = 21, colour = "white", stroke = 0.4,
             aes(fill = Cluster)) +
  scale_colour_manual(values = cluster_palette) +
  scale_fill_manual(values   = cluster_palette) +
  labs(
    title    = paste0("K-Means Clusters in PCA Space (k = ", k_final, ")"),
    subtitle = "Countries projected onto PC1 & PC2 · Ellipses show 90% normal confidence regions",
    x        = "Principal Component 1",
    y        = "Principal Component 2",
    colour   = "Cluster",
    fill     = "Cluster"
  )

15.4 Silhouette Analysis

sil_km     <- silhouette(km_clusters, dist(pca_matrix))
avg_sil_km <- mean(sil_km[, 3])

factoextra::fviz_silhouette(sil_km, palette = c("#4EA8DE","#E63946","#2EC4B6","#FF9F1C")) +
  labs(
    title    = paste0("Silhouette Plot — K-Means (k = ", k_final, ")"),
    subtitle = paste0("Average silhouette width = ", round(avg_sil_km, 3))
  ) +
  theme_report()
#>   cluster size ave.sil.width
#> 1       1   88          0.53
#> 2       2   48          0.43
#> 3       3   29          0.42
#> 4       4    2          0.09

cat("Average silhouette width (K-means):", round(avg_sil_km, 3), "\n")
#> Average silhouette width (K-means): 0.477

16 · Method Comparison

Both algorithms are evaluated side-by-side. The silhouette width is the primary internal validation metric — it measures how similar each country is to its own cluster relative to neighbouring clusters (range −1 to +1; higher is better).

tibble(
  Method             = c("Hierarchical (Ward's)", "K-Means"),
  `Linkage / Init`   = c("Ward's D2", "Random centroids, nstart = 50"),
  `Input Space`      = rep(paste0(num_components, " PCA components"), 2),
  `k`                = rep(k_final, 2),
  `Avg Silhouette`   = round(c(avg_sil_hc, avg_sil_km), 4),
  `Interpretation`   = c(
    "Nested tree structure, useful for understanding hierarchy",
    "Hard partitions, computationally efficient, easy to deploy"
  )
) |>
  knitr::kable(
    caption  = "**Table 11** — Clustering method comparison",
    booktabs = TRUE
  )
Table 11 — Clustering method comparison
Method Linkage / Init Input Space k Avg Silhouette Interpretation
Hierarchical (Ward’s) Ward’s D2 4 PCA components 4 0.4670 Nested tree structure, useful for understanding hierarchy
K-Means Random centroids, nstart = 50 4 PCA components 4 0.4768 Hard partitions, computationally efficient, easy to deploy
💡 Professional Interpretation

Neither method is universally superior. Hierarchical clustering excels at revealing nested similarity structures, while K-means produces clean, deployable hard partitions. Running both and comparing their silhouette scores strengthens the credibility of the final clustering — if both methods agree on which countries belong together, the groupings are more trustworthy.


17 · Cluster Profiling

Translating cluster IDs into meaningful economic categories is where data science becomes decision support. Each cluster is characterized by its average values on the five most interpretable indicators.

analysis_df <- df |>
  mutate(
    cluster_km = factor(km_clusters),
    cluster_hc = factor(hc_clusters)
  )

profile_fn <- function(data, clust_col) {
  data |>
    group_by({{ clust_col }}) |>
    summarise(
      N           = n(),
      Avg_GDP     = round(mean(gdpp,       na.rm = TRUE), 0),
      Avg_Income  = round(mean(income,     na.rm = TRUE), 0),
      Avg_Health  = round(mean(health,     na.rm = TRUE), 1),
      Avg_ChildMort = round(mean(child_mort, na.rm = TRUE), 1),
      Avg_LifeExp = round(mean(life_expec,  na.rm = TRUE), 1),
      .groups = "drop"
    )
}

km_profile <- profile_fn(analysis_df, cluster_km)
hc_profile <- profile_fn(analysis_df, cluster_hc)

km_profile |>
  knitr::kable(caption = "**Table 12** — K-Means cluster profiles", booktabs = TRUE)
Table 12 — K-Means cluster profiles
cluster_km N Avg_GDP Avg_Income Avg_Health Avg_ChildMort Avg_LifeExp
1 88 7333 13456 482.9 20.9 73.2
2 48 1909 3897 114.8 91.6 59.2
3 29 44017 45800 4085.0 5.1 80.4
4 2 75800 81900 5001.9 2.8 82.0
hc_profile |>
  knitr::kable(caption = "**Table 13** — Hierarchical cluster profiles", booktabs = TRUE)
Table 13 — Hierarchical cluster profiles
cluster_hc N Avg_GDP Avg_Income Avg_Health Avg_ChildMort Avg_LifeExp
1 48 1909 3897 114.8 91.6 59.2
2 82 6287 12305 393.1 22.0 72.9
3 35 40177 42951 3677.9 5.4 79.8
4 2 75800 81900 5001.9 2.8 82.0
analysis_df |>
  pivot_longer(
    cols     = c(child_mort, gdpp, income, health, life_expec),
    names_to = "indicator", values_to = "value"
  ) |>
  mutate(indicator = recode(indicator,
    child_mort = "Child Mortality",
    gdpp       = "GDP per Capita",
    income     = "Income",
    health     = "Health Expenditure",
    life_expec = "Life Expectancy"
  )) |>
  ggplot(aes(x = cluster_km, y = value, fill = cluster_km)) +
  geom_boxplot(alpha = 0.80, outlier.size = 1.5, outlier.alpha = 0.5) +
  facet_wrap(~ indicator, scales = "free_y", ncol = 3) +
  scale_fill_manual(values = c("#4EA8DE", "#E63946", "#2EC4B6", "#FF9F1C")) +
  labs(
    title    = "Key Indicators by K-Means Cluster",
    subtitle = "Boxplots reveal how clusters differ across health and economic dimensions",
    x        = "K-Means Cluster",
    y        = "Value",
    fill     = "Cluster"
  ) +
  theme(legend.position = "top")

Cluster interpretation guide: The cluster with the highest child mortality and lowest GDP per capita represents the most vulnerable nations. The cluster with the highest GDP and income represents high-income economies. The remaining clusters capture middle-income development tiers. The next two sections identify and rank the highest-need cluster precisely.

18 · Identify the Highest-Need Cluster

The priority cluster is identified programmatically: the cluster with the highest average child mortality rate, broken by the lowest average GDP per capita and lowest average health expenditure.

priority_km <- km_profile |>
  arrange(desc(Avg_ChildMort), Avg_GDP, Avg_Health) |>
  slice(1)

priority_hc <- hc_profile |>
  arrange(desc(Avg_ChildMort), Avg_GDP, Avg_Health) |>
  slice(1)

priority_km |>
  knitr::kable(
    caption  = "**Table 14** — Highest-need cluster (K-Means)",
    booktabs = TRUE
  )
Table 14 — Highest-need cluster (K-Means)
cluster_km N Avg_GDP Avg_Income Avg_Health Avg_ChildMort Avg_LifeExp
2 48 1909 3897 114.8 91.6 59.2
priority_hc |>
  knitr::kable(
    caption  = "**Table 15** — Highest-need cluster (Hierarchical)",
    booktabs = TRUE
  )
Table 15 — Highest-need cluster (Hierarchical)
cluster_hc N Avg_GDP Avg_Income Avg_Health Avg_ChildMort Avg_LifeExp
1 48 1909 3897 114.8 91.6 59.2

19 · Need Score Construction & Country Ranking

The need score converts the cluster analysis into an actionable, country-level ranking. It is a composite index built from three Z-scored indicators:

\[\boxed{\text{Need Score}_i = z(\text{child\_mort}_i) - z(\text{gdpp}_i) - z(\text{health}_i)}\]

Higher values indicate greater unmet need. The formula deliberately rewards high child mortality (positive) and penalizes high GDP and health expenditure (negative), since those reflect capacity to self-fund development.

priority_countries <- analysis_df |>
  filter(
    cluster_km == priority_km$cluster_km |
    cluster_hc == priority_hc$cluster_hc
  ) |>
  mutate(
    z_child = as.numeric(scale(child_mort)),
    z_gdpp  = as.numeric(scale(gdpp)),
    z_health = as.numeric(scale(health)),
    need_score = z_child - z_gdpp - z_health
  ) |>
  arrange(desc(need_score))

priority_countries |>
  select(country, child_mort, gdpp, income, health, life_expec, need_score) |>
  mutate(
    need_score = round(need_score, 3),
    gdpp       = round(gdpp, 0),
    health     = round(health, 1)
  ) |>
  head(20) |>
  knitr::kable(
    caption  = "**Table 16** — Top 20 countries ranked by Need Score",
    booktabs = TRUE
  )
Table 16 — Top 20 countries ranked by Need Score
country child_mort gdpp income health life_expec need_score
Haiti 208.0 662 1500 45.7 32.1 4.235
Sierra Leone 160.0 399 1220 52.3 55.0 2.887
Central African Republic 149.0 446 888 17.8 47.5 2.759
Chad 150.0 897 1930 40.6 56.5 2.495
Mali 137.0 708 1870 35.3 59.5 2.214
Niger 123.0 348 814 18.0 58.8 2.033
Congo, Dem. Rep. 116.0 334 609 26.4 57.5 1.783
Burkina Faso 116.0 575 1430 38.8 57.9 1.626
Guinea-Bissau 114.0 547 1390 46.5 55.6 1.531
Benin 111.0 758 1820 31.1 61.8 1.464
Guinea 109.0 648 1190 31.9 58.0 1.438
Mozambique 101.0 419 918 21.8 54.5 1.345
Burundi 93.6 231 764 26.8 57.7 1.163
Cote d’Ivoire 111.0 1220 2690 64.7 56.3 1.104
Malawi 90.5 459 1030 30.2 53.1 0.974
Cameroon 108.0 1310 2660 67.2 57.3 0.970
Nigeria 130.0 2330 5150 118.1 60.5 0.955
Liberia 89.3 327 700 38.6 60.8 0.934
Togo 90.3 488 1210 37.3 58.7 0.916
Pakistan 92.1 1040 4280 22.9 65.3 0.867
priority_countries |>
  slice_head(n = 20) |>
  ggplot(aes(
    x    = reorder(country, need_score),
    y    = need_score,
    fill = need_score
  )) +
  geom_col(width = 0.75, color = "white", linewidth = 0.25) +
  geom_text(
    aes(label = round(need_score, 2)),
    hjust = -0.12, size = 3, color = "#1E293B"
  ) +
  coord_flip() +
  scale_fill_gradient2(
    low      = "#ADE8F4",
    mid      = "#4EA8DE",
    high     = "#E63946",
    midpoint = median(priority_countries$need_score[1:20])
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.14))) +
  labs(
    title    = "Top 20 Countries Ranked by Composite Need Score",
    subtitle = "Score = z(child_mort) − z(gdpp) − z(health) · Higher = greater aid priority",
    x        = NULL,
    y        = "Need Score",
    fill     = "Score",
    caption  = "Source: Country-data.csv · Analysis by Eliya Christopher Nandi"
  ) +
  theme(legend.position = "none")

🔴 Aid Priority — Top Countries

The top-ranked nations combine the highest child mortality, lowest GDP per capita, and lowest health expenditure — a convergence of vulnerabilities that no single indicator alone would fully capture. This ranking is the core deliverable of the analysis.


20 · Country Membership by Cluster

For transparency and reproducibility, every country’s cluster assignment is listed explicitly.

analysis_df |>
  arrange(cluster_km, country) |>
  group_by(cluster_km) |>
  summarise(Countries = paste(as.character(country), collapse = ", "), .groups = "drop") |>
  knitr::kable(
    caption  = "**Table 17** — Country membership by K-Means cluster",
    booktabs = TRUE
  )
Table 17 — Country membership by K-Means cluster
cluster_km Countries
1 Albania, Algeria, Antigua and Barbuda, Argentina, Armenia, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belize, Bhutan, Bolivia, Bosnia and Herzegovina, Brazil, Bulgaria, Cambodia, Cape Verde, Chile, China, Colombia, Costa Rica, Croatia, Czech Republic, Dominican Republic, Ecuador, Egypt, El Salvador, Estonia, Fiji, Georgia, Grenada, Guatemala, Guyana, Hungary, India, Indonesia, Iran, Jamaica, Jordan, Kazakhstan, Kyrgyz Republic, Latvia, Lebanon, Libya, Lithuania, Macedonia, FYR, Malaysia, Maldives, Mauritius, Micronesia, Fed. Sts., Moldova, Mongolia, Montenegro, Morocco, Myanmar, Nepal, Oman, Panama, Paraguay, Peru, Philippines, Poland, Portugal, Romania, Russia, Samoa, Saudi Arabia, Serbia, Seychelles, Slovak Republic, South Korea, Sri Lanka, St. Vincent and the Grenadines, Suriname, Tajikistan, Thailand, Tonga, Tunisia, Turkey, Turkmenistan, Ukraine, Uruguay, Uzbekistan, Vanuatu, Venezuela, Vietnam
2 Afghanistan, Angola, Benin, Botswana, Burkina Faso, Burundi, Cameroon, Central African Republic, Chad, Comoros, Congo, Dem. Rep., Congo, Rep., Cote d’Ivoire, Equatorial Guinea, Eritrea, Gabon, Gambia, Ghana, Guinea, Guinea-Bissau, Haiti, Iraq, Kenya, Kiribati, Lao, Lesotho, Liberia, Madagascar, Malawi, Mali, Mauritania, Mozambique, Namibia, Niger, Nigeria, Pakistan, Rwanda, Senegal, Sierra Leone, Solomon Islands, South Africa, Sudan, Tanzania, Timor-Leste, Togo, Uganda, Yemen, Zambia
3 Australia, Austria, Belgium, Brunei, Canada, Cyprus, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Japan, Kuwait, Malta, Netherlands, New Zealand, Norway, Qatar, Slovenia, Spain, Sweden, Switzerland, United Arab Emirates, United Kingdom, United States
4 Luxembourg, Singapore
analysis_df |>
  arrange(cluster_hc, country) |>
  group_by(cluster_hc) |>
  summarise(Countries = paste(as.character(country), collapse = ", "), .groups = "drop") |>
  knitr::kable(
    caption  = "**Table 18** — Country membership by Hierarchical cluster",
    booktabs = TRUE
  )
Table 18 — Country membership by Hierarchical cluster
cluster_hc Countries
1 Afghanistan, Angola, Benin, Botswana, Burkina Faso, Burundi, Cameroon, Central African Republic, Chad, Comoros, Congo, Dem. Rep., Congo, Rep., Cote d’Ivoire, Equatorial Guinea, Eritrea, Gabon, Gambia, Ghana, Guinea, Guinea-Bissau, Haiti, Iraq, Kenya, Kiribati, Lao, Lesotho, Liberia, Madagascar, Malawi, Mali, Mauritania, Mozambique, Namibia, Niger, Nigeria, Pakistan, Rwanda, Senegal, Sierra Leone, Solomon Islands, South Africa, Sudan, Tanzania, Timor-Leste, Togo, Uganda, Yemen, Zambia
2 Albania, Algeria, Antigua and Barbuda, Argentina, Armenia, Azerbaijan, Bangladesh, Barbados, Belarus, Belize, Bhutan, Bolivia, Bosnia and Herzegovina, Brazil, Bulgaria, Cambodia, Cape Verde, Chile, China, Colombia, Costa Rica, Croatia, Dominican Republic, Ecuador, Egypt, El Salvador, Estonia, Fiji, Georgia, Grenada, Guatemala, Guyana, Hungary, India, Indonesia, Iran, Jamaica, Jordan, Kazakhstan, Kyrgyz Republic, Latvia, Lebanon, Libya, Lithuania, Macedonia, FYR, Malaysia, Maldives, Mauritius, Micronesia, Fed. Sts., Moldova, Mongolia, Montenegro, Morocco, Myanmar, Nepal, Oman, Panama, Paraguay, Peru, Philippines, Poland, Romania, Russia, Samoa, Saudi Arabia, Serbia, Seychelles, Sri Lanka, St. Vincent and the Grenadines, Suriname, Tajikistan, Thailand, Tonga, Tunisia, Turkey, Turkmenistan, Ukraine, Uruguay, Uzbekistan, Vanuatu, Venezuela, Vietnam
3 Australia, Austria, Bahamas, Bahrain, Belgium, Brunei, Canada, Cyprus, Czech Republic, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Japan, Kuwait, Malta, Netherlands, New Zealand, Norway, Portugal, Qatar, Slovak Republic, Slovenia, South Korea, Spain, Sweden, Switzerland, United Arab Emirates, United Kingdom, United States
4 Luxembourg, Singapore

21 · Final Recommendations

✅ Data-Driven Recommendations

Based on the clustering analysis and need-score rankings, aid and development investment should be prioritized toward the countries at the top of Table 16. These nations face a compounded disadvantage: high child mortality + low economic capacity + low health system investment.

Specifically, the analysis recommends:

  1. Immediate health interventions — reduce under-5 mortality through vaccination, nutrition, and maternal care programs.
  2. Economic development — build GDP-generating capacity and income infrastructure in the lowest-tier cluster.
  3. Health system investment — increase public health expenditure to break the feedback loop between poverty and poor health outcomes.
  4. Multi-indicator monitoring — reassess country rankings annually as socio-economic conditions evolve.
⚠️ Important Limitations

This analysis is a data-driven decision-support tool, not a policy prescription. Real-world aid allocation must also account for: population size, political stability, humanitarian access, conflict risk, governance quality, and on-the-ground field assessments. Quantitative rankings should be treated as a starting point for expert deliberation, not a final answer.


22 · Conclusion

This project demonstrates how unsupervised machine learning methods — applied rigorously and transparently — can extract actionable insights from cross-country socio-economic data.

Key contributions of this analysis:

  • Feature engineering converted percentage-based economic variables into comparable per-capita monetary values, improving the quality of the distance calculations.
  • PCA effectively compressed nine correlated indicators into a lower-dimensional space retaining 93.4% of total variance.
  • K-means and hierarchical clustering both identified consistent country groupings; silhouette validation confirmed the reliability of the structure.
  • The need-score framework produced a transparent, reproducible country ranking that translates abstract cluster assignments into a concrete prioritization order.

The approach taken here follows best practices for reproducible, explainable, and production-quality data science portfolio work.


23 · Reproducibility Instructions

To reproduce this report on your machine:

1.  Place  Country-data.csv  in the same directory as this .Rmd file.
2.  Open the .Rmd file in RStudio (version ≥ 2022.07 recommended).
3.  Install missing packages if prompted:
    install.packages(c("tidyverse","skimr","ggcorrplot","factoextra","cluster","knitr"))
4.  Click  Knit → Knit to HTML  (or run rmarkdown::render("...Rmd")).
5.  Upload both the .Rmd and the knitted .html to GitHub for portfolio display.

To use a different dataset or cluster count, change data_path or k_clusters in the YAML header — all downstream results update automatically.


24 · Session Information

Full session information is recorded below for complete reproducibility. This captures the R version, platform, and exact package versions used.

sessionInfo()
#> R version 4.5.2 (2025-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#> 
#> Matrix products: default
#>   LAPACK version 3.12.1
#> 
#> locale:
#> [1] LC_COLLATE=English_United Kingdom.utf8 
#> [2] LC_CTYPE=English_United Kingdom.utf8   
#> [3] LC_MONETARY=English_United Kingdom.utf8
#> [4] LC_NUMERIC=C                           
#> [5] LC_TIME=English_United Kingdom.utf8    
#> 
#> time zone: Europe/Berlin
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] keras3_1.5.0       knitr_1.51         cluster_2.1.8.2    factoextra_1.0.7  
#>  [5] ggcorrplot_0.1.4.1 skimr_2.2.2        lubridate_1.9.5    forcats_1.0.1     
#>  [9] stringr_1.6.0      dplyr_1.2.1        purrr_1.2.0        readr_2.2.0       
#> [13] tidyr_1.3.1        tibble_3.3.0       ggplot2_4.0.2      tidyverse_2.0.0   
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6       xfun_0.56          bslib_0.10.0       ggrepel_0.9.6     
#>  [5] rstatix_0.7.3      lattice_0.22-7     tzdb_0.5.0         tfruns_1.5.4      
#>  [9] vctrs_0.7.3        tools_4.5.2        generics_0.1.4     pkgconfig_2.0.3   
#> [13] Matrix_1.7-4       RColorBrewer_1.1-3 S7_0.2.0           lifecycle_1.0.5   
#> [17] compiler_4.5.2     farver_2.1.2       repr_1.1.7         codetools_0.2-20  
#> [21] carData_3.0-6      htmltools_0.5.9    sass_0.4.10        yaml_2.3.12       
#> [25] Formula_1.2-5      whisker_0.4.1      pillar_1.11.1      car_3.1-5         
#> [29] ggpubr_0.6.2       jquerylib_0.1.4    cachem_1.1.0       abind_1.4-8       
#> [33] tidyselect_1.2.1   digest_0.6.37      stringi_1.8.7      reshape2_1.4.5    
#> [37] labeling_0.4.3     rprojroot_2.1.1    fastmap_1.2.0      grid_4.5.2        
#> [41] here_1.0.2         cli_3.6.5          magrittr_2.0.4     base64enc_0.1-6   
#> [45] dotty_0.1.0        broom_1.0.12       withr_3.0.2        scales_1.4.0      
#> [49] backports_1.5.0    timechange_0.4.0   rmarkdown_2.30     otel_0.2.0        
#> [53] reticulate_1.44.1  ggsignif_0.6.4     png_0.1-8          hms_1.1.4         
#> [57] evaluate_1.0.5     rlang_1.2.0        zeallot_0.2.0      Rcpp_1.1.0        
#> [61] glue_1.8.0         rstudioapi_0.18.0  jsonlite_2.0.0     R6_2.6.1          
#> [65] plyr_1.8.9         tensorflow_2.20.0