This report presents a rigorous, end-to-end unsupervised machine learning study on 167 countries, using socio-economic and health data to segment nations into meaningful groups and identify which require urgent development aid.
Unsupervised Learning PCA · Dimensionality Reduction K-Means Clustering Hierarchical Clustering Silhouette Validation Aid Prioritization R · tidyverse · factoextra
How can countries be meaningfully grouped based on their socio-economic and health profiles, and which countries should be prioritized for humanitarian aid using a transparent, data-driven scoring framework?
The full workflow covers feature engineering → exploratory analysis → standardization → PCA → K-means and hierarchical clustering → silhouette validation → cluster profiling → need-score ranking.
This analysis targets four concrete questions that a practitioner or policy analyst would ask:
The project follows an industry-standard data science workflow. Each stage builds directly on the previous one, ensuring that every modeling decision is justified by the data.
Before the analysis begins, all required packages are loaded. A defensive loader function is used so that the report fails early with a helpful message if any dependency is missing — a best practice for reproducible research.
| Package | Role in this analysis |
|---|---|
tidyverse |
Data wrangling, reshaping, and all ggplot2
visualizations |
skimr |
Rich, formatted summary statistics |
ggcorrplot |
Correlation heatmap |
factoextra |
PCA scree plots, cluster visualizations, silhouette plots |
cluster |
Silhouette computation |
knitr |
Table rendering in the knitted report |
load_package <- function(pkg) {
if (!requireNamespace(pkg, quietly = TRUE))
stop(sprintf(
"Package '%s' is required but not installed.\nRun: install.packages('%s')", pkg, pkg
))
suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}
required_packages <- c(
"tidyverse", "skimr", "ggcorrplot",
"factoextra", "cluster", "knitr"
)
invisible(lapply(required_packages, load_package))
# Set a clean, consistent ggplot2 theme for the entire report
theme_report <- function() {
theme_minimal(base_size = 12, base_family = "sans") +
theme(
plot.title = element_text(face = "bold", color = "#0D1F2D", size = 14, margin = margin(b = 4)),
plot.subtitle = element_text(color = "#64748B", size = 11, margin = margin(b = 12)),
plot.caption = element_text(color = "#94A3B8", size = 9, hjust = 0),
axis.title = element_text(color = "#2C4A5E", size = 11, face = "bold"),
axis.text = element_text(color = "#475569"),
strip.text = element_text(face = "bold", color = "#0D1F2D"),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(color = "#E2E8F0", linewidth = 0.5),
legend.position = "bottom",
legend.text = element_text(size = 10)
)
}
theme_set(theme_report())theme_report()
function guarantees visual consistency across every plot in the report.
The raw dataset is read from disk and immediately inspected. Knowing the number of rows, columns, and variable types before touching the data prevents silent errors later in the pipeline.
if (!file.exists(params$data_path))
stop(paste0(
"Dataset not found: '", params$data_path, "'.\n",
"Place Country-data.csv in the same folder as this .Rmd file."
))
df_raw <- read.csv(params$data_path)
n_countries <- nrow(df_raw)
n_vars <- ncol(df_raw)
cat("Rows (countries):", n_countries, "\nColumns (variables):", n_vars, "\n")#> Rows (countries): 167
#> Columns (variables): 10
head(df_raw, 8) |>
knitr::kable(
caption = "**Table 1** — Raw dataset preview (first 8 rows)",
digits = 2,
booktabs = TRUE
)| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp |
|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 90.2 | 10.0 | 7.58 | 44.9 | 1610 | 9.44 | 56.2 | 5.82 | 553 |
| Albania | 16.6 | 28.0 | 6.55 | 48.6 | 9930 | 4.49 | 76.3 | 1.65 | 4090 |
| Algeria | 27.3 | 38.4 | 4.17 | 31.4 | 12900 | 16.10 | 76.5 | 2.89 | 4460 |
| Angola | 119.0 | 62.3 | 2.85 | 42.9 | 5900 | 22.40 | 60.1 | 6.16 | 3530 |
| Antigua and Barbuda | 10.3 | 45.5 | 6.03 | 58.9 | 19100 | 1.44 | 76.8 | 2.13 | 12200 |
| Argentina | 14.5 | 18.9 | 8.10 | 16.0 | 18700 | 20.90 | 75.8 | 2.37 | 10300 |
| Armenia | 18.1 | 20.8 | 4.40 | 45.3 | 6700 | 7.77 | 73.3 | 1.69 | 3220 |
| Australia | 4.8 | 19.8 | 8.73 | 20.9 | 41400 | 1.16 | 82.0 | 1.93 | 51900 |
#> 'data.frame': 167 obs. of 10 variables:
#> $ country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
#> $ child_mort: num 90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
#> $ exports : num 10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
#> $ health : num 7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
#> $ imports : num 44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
#> $ income : int 1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
#> $ inflation : num 9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
#> $ life_expec: num 56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
#> $ total_fer : num 5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
#> $ gdpp : int 553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...
country column is a
character identifier — it will be used as a label throughout but
excluded from all numerical computations. Every remaining
column is numeric. No immediate structural issues detected.
Understanding what each variable measures is essential before any transformation or modelling decision.
tibble::tribble(
~Variable, ~Type, ~Unit, ~Description,
"country", "Character", "—", "Country name (label, not used in modelling)",
"child_mort", "Numeric", "per 1,000 births", "Under-5 child mortality rate",
"exports", "Numeric", "% of GDP (raw)", "Exports of goods and services",
"health", "Numeric", "% of GDP (raw)", "Total health expenditure",
"imports", "Numeric", "% of GDP (raw)", "Imports of goods and services",
"income", "Numeric", "USD", "Net income per person",
"inflation", "Numeric", "% per year", "Annual consumer price inflation",
"life_expec", "Numeric", "years", "Life expectancy at birth",
"total_fer", "Numeric", "children/woman", "Total fertility rate",
"gdpp", "Numeric", "USD", "GDP per capita"
) |>
knitr::kable(
caption = "**Table 2** — Data dictionary",
booktabs = TRUE
)| Variable | Type | Unit | Description |
|---|---|---|---|
| country | Character | — | Country name (label, not used in modelling) |
| child_mort | Numeric | per 1,000 births | Under-5 child mortality rate |
| exports | Numeric | % of GDP (raw) | Exports of goods and services |
| health | Numeric | % of GDP (raw) | Total health expenditure |
| imports | Numeric | % of GDP (raw) | Imports of goods and services |
| income | Numeric | USD | Net income per person |
| inflation | Numeric | % per year | Annual consumer price inflation |
| life_expec | Numeric | years | Life expectancy at birth |
| total_fer | Numeric | children/woman | Total fertility rate |
| gdpp | Numeric | USD | GDP per capita |
exports, health, and imports
are recorded as percentages of GDP. Feeding raw
percentages into clustering alongside absolute USD values would distort
distances. The next section converts them to estimated per-capita
monetary values.
The three percentage-of-GDP variables are converted into estimated per-capita monetary values by multiplying by GDP per capita. This puts all economic indicators on a comparable absolute scale before any further analysis.
\[\text{exports}_{\text{abs}} = \frac{\text{exports}_{\%} \times \text{gdpp}}{100}\]
The same formula is applied to health and
imports.
df <- df_raw |>
mutate(
exports = (exports * gdpp) / 100,
health = (health * gdpp) / 100,
imports = (imports * gdpp) / 100,
gdpp = as.numeric(gdpp),
income = as.numeric(income),
country = factor(country)
)
missing_count <- sum(is.na(df))
cat("Total missing values after engineering:", missing_count, "\n")#> Total missing values after engineering: 0
head(df, 6) |>
knitr::kable(
caption = "**Table 3** — Dataset after feature engineering",
digits = 1,
booktabs = TRUE
)| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp |
|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 90.2 | 55.3 | 41.9 | 248.3 | 1610 | 9.4 | 56.2 | 5.8 | 553 |
| Albania | 16.6 | 1145.2 | 267.9 | 1987.7 | 9930 | 4.5 | 76.3 | 1.6 | 4090 |
| Algeria | 27.3 | 1712.6 | 186.0 | 1400.4 | 12900 | 16.1 | 76.5 | 2.9 | 4460 |
| Angola | 119.0 | 2199.2 | 100.6 | 1514.4 | 5900 | 22.4 | 60.1 | 6.2 | 3530 |
| Antigua and Barbuda | 10.3 | 5551.0 | 735.7 | 7185.8 | 19100 | 1.4 | 76.8 | 2.1 | 12200 |
| Argentina | 14.5 | 1946.7 | 834.3 | 1648.0 | 18700 | 20.9 | 75.8 | 2.4 | 10300 |
income and gdpp. No missing
values were introduced by the transformation.
A high-level numerical summary of the cleaned dataset confirms the scale differences that motivate standardization in Section 9.
X <- df |> select(-country) # numeric matrix used in all subsequent steps
tibble(
Metric = c("Countries in dataset", "Numeric indicators", "Missing values"),
Value = c(nrow(df), ncol(X), missing_count)
) |>
knitr::kable(caption = "**Table 4** — Cleaned dataset at a glance", booktabs = TRUE)| Metric | Value |
|---|---|
| Countries in dataset | 167 |
| Numeric indicators | 9 |
| Missing values | 0 |
| Name | X |
| Number of rows | 167 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| child_mort | 0 | 1 | 38.27 | 40.33 | 2.60 | 8.25 | 19.30 | 62.10 | 208.00 | ▇▂▂▁▁ |
| exports | 0 | 1 | 7420.62 | 17973.89 | 1.08 | 447.14 | 1777.44 | 7278.00 | 183750.00 | ▇▁▁▁▁ |
| health | 0 | 1 | 1056.73 | 1801.41 | 12.82 | 78.54 | 321.89 | 976.94 | 8663.60 | ▇▁▁▁▁ |
| imports | 0 | 1 | 6588.35 | 14710.81 | 0.65 | 640.21 | 2045.58 | 7719.60 | 149100.00 | ▇▁▁▁▁ |
| income | 0 | 1 | 17144.69 | 19278.07 | 609.00 | 3355.00 | 9960.00 | 22800.00 | 125000.00 | ▇▂▁▁▁ |
| inflation | 0 | 1 | 7.78 | 10.57 | -4.21 | 1.81 | 5.39 | 10.75 | 104.00 | ▇▁▁▁▁ |
| life_expec | 0 | 1 | 70.56 | 8.89 | 32.10 | 65.30 | 73.10 | 76.80 | 82.80 | ▁▁▃▅▇ |
| total_fer | 0 | 1 | 2.95 | 1.51 | 1.15 | 1.79 | 2.41 | 3.88 | 7.49 | ▇▃▂▂▁ |
| gdpp | 0 | 1 | 12964.16 | 18328.70 | 231.00 | 1330.00 | 4660.00 | 14050.00 | 105000.00 | ▇▁▁▁▁ |
Exploratory analysis answers three questions before any modelling: (1) Are distributions roughly normal or heavily skewed? (2) How correlated are the variables? (3) Are there extreme outliers that may distort distance calculations?
Each variable is plotted as a histogram. Variables with right-skewed distributions will benefit from the log-scale view in the next tab.
df |>
select(-country) |>
pivot_longer(everything(), names_to = "feature", values_to = "value") |>
ggplot(aes(x = value, fill = feature)) +
geom_histogram(bins = 28, color = "white", alpha = 0.90, show.legend = FALSE) +
facet_wrap(~ feature, scales = "free", ncol = 3) +
scale_fill_manual(values = rep(c("#4EA8DE","#2EC4B6","#E63946","#FF9F1C",
"#6366F1","#8B5CF6","#14B8A6","#F59E0B","#10B981"), 2)) +
labs(
title = "Distribution of All Socio-Economic & Health Indicators",
subtitle = "Histograms on raw scales — note strong right-skew in economic variables",
x = NULL, y = "Countries"
)Log-transformed views reveal the true spread of the right-skewed economic variables.
df |>
select(gdpp, income, child_mort, exports, health, imports) |>
pivot_longer(everything(), names_to = "feature", values_to = "value") |>
filter(value > 0) |>
ggplot(aes(x = value, fill = feature)) +
geom_histogram(bins = 28, color = "white", alpha = 0.9, show.legend = FALSE) +
facet_wrap(~ feature, scales = "free", ncol = 3) +
scale_x_log10(labels = scales::label_comma()) +
scale_fill_manual(values = c("#4EA8DE","#2EC4B6","#E63946","#FF9F1C","#6366F1","#8B5CF6")) +
labs(
title = "Selected Variables on a Log₁₀ Scale",
subtitle = "Log scaling normalises the long right tails and reveals the full country spread",
x = "Log₁₀ value", y = "Countries"
)Boxplots highlight country-level outliers across every indicator simultaneously.
df |>
select(-country) |>
pivot_longer(everything(), names_to = "feature", values_to = "value") |>
ggplot(aes(x = reorder(feature, value, FUN = median), y = value, fill = feature)) +
geom_boxplot(
outlier.colour = "#E63946",
outlier.size = 2.2,
outlier.alpha = 0.70,
alpha = 0.80,
show.legend = FALSE
) +
coord_flip() +
labs(
title = "Feature-Level Outlier Detection via Boxplots",
subtitle = "Red points represent potential outliers — common in cross-country economic data",
x = NULL, y = "Value"
)K-means minimizes squared Euclidean distances, and PCA is variance-weighted. Both are sensitive to scale. Z-score normalization transforms every variable to mean = 0 and standard deviation = 1, ensuring no single variable dominates purely because of its measurement units.
X_scaled <- scale(X)
tibble(
Feature = colnames(X_scaled),
Mean_after_scale = round(colMeans(X_scaled), 5),
SD_after_scale = round(apply(X_scaled, 2, sd), 5)
) |>
knitr::kable(
caption = "**Table 5** — Standardization verification (mean ≈ 0, SD ≈ 1 for all features)",
booktabs = TRUE
)| Feature | Mean_after_scale | SD_after_scale |
|---|---|---|
| child_mort | 0 | 1 |
| exports | 0 | 1 |
| health | 0 | 1 |
| imports | 0 | 1 |
| income | 0 | 1 |
| inflation | 0 | 1 |
| life_expec | 0 | 1 |
| total_fer | 0 | 1 |
| gdpp | 0 | 1 |
Beyond individual-variable outliers, some countries may be extreme across multiple dimensions simultaneously. The multivariate outlier score is the Euclidean distance from the standardized data centroid (the origin in Z-score space).
\[\text{Outlier Score}_i = \sqrt{\sum_{j=1}^{p} z_{ij}^2}\]
outlier_rank <- tibble(
country = df$country,
outlier_score = sqrt(rowSums(X_scaled^2))
) |>
arrange(desc(outlier_score))
head(outlier_rank, 12) |>
mutate(outlier_score = round(outlier_score, 3)) |>
knitr::kable(
caption = "**Table 6** — Top 12 multivariate outliers (ranked by standardized distance from centroid)",
booktabs = TRUE
)| country | outlier_score |
|---|---|
| Luxembourg | 15.778 |
| Nigeria | 9.709 |
| Singapore | 8.044 |
| Qatar | 6.896 |
| Switzerland | 6.828 |
| Norway | 6.703 |
| Haiti | 6.188 |
| United States | 5.162 |
| Ireland | 4.946 |
| Denmark | 4.797 |
| Netherlands | 4.729 |
| Central African Republic | 4.320 |
outlier_rank |>
slice_head(n = 15) |>
ggplot(aes(x = reorder(country, outlier_score), y = outlier_score, fill = outlier_score)) +
geom_col(width = 0.72, color = "white", linewidth = 0.3) +
geom_text(aes(label = round(outlier_score, 2)), hjust = -0.15, size = 3, color = "#1E293B") +
coord_flip() +
scale_fill_gradient2(low = "#ADE8F4", mid = "#4EA8DE", high = "#E63946",
midpoint = median(outlier_rank$outlier_score[1:15])) +
scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Top 15 Multivariate Outliers",
subtitle = "Countries furthest from the dataset centroid in standardized 9-dimensional space",
x = NULL, y = "Outlier Score (standardized Euclidean distance)",
fill = "Score"
) +
theme(legend.position = "none")Before applying PCA, it is worth checking whether the variables are correlated. High inter-variable correlation is the key condition that makes PCA effective — it means a smaller set of components can capture most of the variance.
corr_matrix <- cor(X, use = "pairwise.complete.obs")
ggcorrplot(
corr_matrix,
type = "lower",
lab = TRUE,
lab_size = 3.2,
tl.cex = 10,
colors = c("#E63946", "#FFFFFF", "#4EA8DE"),
outline.color = "#E2E8F0"
) +
labs(
title = "Correlation Heatmap of All Indicators",
subtitle = "Blue = positive correlation · Red = negative correlation · Values = Pearson r"
) +
theme(
plot.title = element_text(face = "bold", color = "#0D1F2D"),
plot.subtitle = element_text(color = "#64748B")
)Notable correlation clusters:
gdpp,
income, life_expec, health,
exports, imports all correlate positively —
wealthier countries spend more on health, trade more, and live
longer.child_mort and
total_fer are strongly negatively correlated with
gdpp and life_expec — high fertility and child
mortality are hallmarks of lower-income countries.Principal Component Analysis (PCA) rotates the data into a new coordinate system where the first axis (PC1) explains the maximum possible variance, PC2 explains the next most, and so on. This allows us to cluster in a lower-dimensional space that is less susceptible to noise.
pca_result <- prcomp(X_scaled, center = FALSE, scale. = FALSE) # already scaled
explained_var <- pca_result$sdev^2 / sum(pca_result$sdev^2)
cumulative_var <- cumsum(explained_var)
pca_tbl <- tibble(
Component = paste0("PC", seq_along(explained_var)),
`Variance Explained` = paste0(round(explained_var * 100, 2), "%"),
`Cumulative Variance`= paste0(round(cumulative_var * 100, 2), "%")
)
num_components <- which(cumulative_var >= 0.90)[1]
pca_tbl |>
knitr::kable(
caption = "**Table 7** — PCA variance explained per component",
booktabs = TRUE
)| Component | Variance Explained | Cumulative Variance |
|---|---|---|
| PC1 | 58.94% | 58.94% |
| PC2 | 18.45% | 77.38% |
| PC3 | 9.91% | 87.29% |
| PC4 | 6.07% | 93.37% |
| PC5 | 3.03% | 96.4% |
| PC6 | 2.46% | 98.86% |
| PC7 | 0.94% | 99.79% |
| PC8 | 0.16% | 99.95% |
| PC9 | 0.05% | 100% |
cat("Components needed to retain ≥90% variance:", num_components,
"\nCumulative variance retained:", round(cumulative_var[num_components]*100, 2), "%\n")#> Components needed to retain ≥90% variance: 4
#> Cumulative variance retained: 93.37 %
The scree plot shows the “elbow” where additional components contribute diminishing variance.
factoextra::fviz_eig(
pca_result,
addlabels = TRUE,
ylim = c(0, 62),
barfill = "#4EA8DE",
barcolor = "#3D6B8C",
linecolor = "#E63946",
ncp = 9
) +
labs(
title = "PCA Scree Plot — Variance Explained by Each Component",
subtitle = "The first two components alone explain over 70% of total variance"
) +
theme_report()The cumulative curve shows how quickly components accumulate information, with the 90% threshold highlighted.
tibble(n = seq_along(cumulative_var), cumvar = cumulative_var) |>
ggplot(aes(x = n, y = cumvar)) +
geom_area(alpha = 0.12, fill = "#4EA8DE") +
geom_line(color = "#4EA8DE", linewidth = 1.2) +
geom_point(color = "#E63946", size = 3.5, shape = 21, fill = "white", stroke = 2) +
geom_hline(yintercept = 0.90, linetype = "dashed", color = "#2EC4B6", linewidth = 1) +
annotate("text", x = 1.2, y = 0.915, label = "90% threshold",
color = "#0F766E", size = 3.5, fontface = "bold") +
geom_vline(xintercept = num_components, linetype = "dotted", color = "#E63946", linewidth = 0.9) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1)) +
scale_x_continuous(breaks = 1:9) +
labs(
title = "Cumulative Explained Variance",
subtitle = paste0("Retaining ", num_components, " components captures the 90% threshold"),
x = "Number of Components",
y = "Cumulative Variance Explained"
)4 principal components are retained, preserving 93.37% of the original variance. This reduced representation removes noise while keeping virtually all of the structured information needed for clustering.
The biplot overlays country positions with variable loading arrows, showing which original variables drive each principal component.
factoextra::fviz_pca_biplot(
pca_result,
repel = TRUE,
col.ind = "#94A3B8",
col.var = "#E63946",
alpha.ind = 0.4,
label = "var",
arrowsize = 0.8,
labelsize = 4
) +
labs(
title = "PCA Biplot — Countries & Variable Loadings",
subtitle = "Arrows show direction and strength of each variable's contribution to PC1 and PC2"
) +
theme_report()An autoencoder is a neural-network-based dimensionality reduction
method that can capture non-linear relationships. This section is
parameterized: it runs only when run_autoencoder: true is
set in the YAML header.
Why not use it by default? PCA is entirely sufficient here — the high linear correlations shown in Section 10 mean non-linear methods offer no meaningful benefit. PCA is also faster, deterministic, and fully interpretable, making it the more professional choice for this dataset.
# Requires keras3 to be installed and configured.
if (!requireNamespace("keras3", quietly = TRUE))
stop("Install keras3 to run the autoencoder section.")
library(keras3)
input_dim <- ncol(X_scaled)
latent_dim <- num_components
inputs <- layer_input(shape = input_dim)
encoded <- inputs |> layer_dense(128, activation = "relu") |>
layer_dense(latent_dim, activation = "relu")
decoded <- encoded |> layer_dense(128, activation = "relu") |>
layer_dense(input_dim, activation = "linear")
autoencoder <- keras_model(inputs, decoded)
autoencoder |> compile(optimizer = "adam", loss = "mse")
autoencoder |> fit(X_scaled, X_scaled,
epochs = 60, batch_size = 32,
validation_split = 0.15, verbose = 0)
encoder <- keras_model(inputs, encoded)
enc_df <- predict(encoder, X_scaled) |> as.data.frame() |>
setNames(paste0("Latent_", seq_len(latent_dim)))#> 6/6 - 0s - 14ms/step
ggplot(enc_df, aes(Latent_1, Latent_2)) +
geom_point(alpha = 0.65, color = "#E63946", size = 2.5) +
labs(title = "Autoencoder — 2D Latent Space Projection",
x = "Latent Dimension 1", y = "Latent Dimension 2")The selected principal component scores are extracted into a clean matrix. This is the input for both clustering algorithms.
pca_components <- as.data.frame(pca_result$x[, 1:num_components])
pca_components$country <- df$country
pca_matrix <- pca_components |> select(-country)
head(pca_components, 6) |>
mutate(across(where(is.numeric), ~ round(.x, 4))) |>
knitr::kable(
caption = paste0("**Table 8** — First 6 countries in PCA space (",
num_components, " components)"),
booktabs = TRUE
)| PC1 | PC2 | PC3 | PC4 | country |
|---|---|---|---|---|
| -2.6277 | -1.4679 | -0.5478 | -0.2416 | Afghanistan |
| -0.0241 | 1.4256 | -0.0141 | 0.4493 | Albania |
| -0.4582 | 0.6735 | 0.9565 | 0.2178 | Algeria |
| -2.7145 | -2.1658 | 0.5984 | -0.4327 | Angola |
| 0.6467 | 1.0204 | -0.2567 | 0.2883 | Antigua and Barbuda |
| 0.0353 | 0.6832 | 1.4643 | -0.0288 | Argentina |
Hierarchical clustering constructs a tree (dendrogram) by successively merging the most similar countries or groups. Ward’s method is used because it minimizes the total within-cluster variance at each merge step, producing compact and well-separated clusters.
dist_pca <- dist(pca_matrix, method = "euclidean")
hc_ward <- hclust(dist_pca, method = "ward.D2")
par(
mar = c(4, 3, 3, 1),
family = "sans",
bg = "white"
)
plot(
hc_ward,
labels = FALSE,
hang = -1,
main = "Hierarchical Clustering Dendrogram — Ward's Method",
sub = "",
xlab = "Countries (labels hidden for clarity)",
ylab = "Ward's Linkage Height",
col.main = "#0D1F2D",
cex.main = 1.1
)
mtext(
paste0("k = ", params$k_clusters, " clusters highlighted"),
side = 3, line = 0.3, cex = 0.85, col = "#64748B"
)
rect.hclust(hc_ward, k = params$k_clusters, border = c("#4EA8DE","#E63946","#2EC4B6","#FF9F1C"))k_final <- params$k_clusters
hc_clusters <- cutree(hc_ward, k = k_final)
pca_components$cluster_hc <- factor(hc_clusters)
table(hc_clusters) |>
as.data.frame() |>
setNames(c("Cluster", "Countries")) |>
knitr::kable(
caption = "**Table 9** — Hierarchical clustering: countries per cluster",
booktabs = TRUE
)| Cluster | Countries |
|---|---|
| 1 | 48 |
| 2 | 82 |
| 3 | 35 |
| 4 | 2 |
sil_hc <- silhouette(hc_clusters, dist_pca)
avg_sil_hc <- mean(sil_hc[, 3])
factoextra::fviz_silhouette(sil_hc, palette = c("#4EA8DE","#E63946","#2EC4B6","#FF9F1C")) +
labs(
title = paste0("Silhouette Plot — Hierarchical Clustering (k = ", k_final, ")"),
subtitle = paste0("Average silhouette width = ", round(avg_sil_hc, 3),
" · Width > 0.5 indicates reasonable cluster structure")
) +
theme_report()#> cluster size ave.sil.width
#> 1 1 48 0.41
#> 2 2 82 0.56
#> 3 3 35 0.35
#> 4 4 2 0.10
#> Average silhouette width (hierarchical): 0.467
K-means partitions n countries into k
groups by iteratively reassigning each country to the nearest centroid
and recomputing centroids until convergence. Unlike hierarchical
clustering, K-means requires specifying k in advance.
The elbow method plots total within-cluster sum of squares (WCSS) for k = 1 to 10. The “elbow” — where the curve bends and returns diminish — guides the choice of k.
set.seed(42)
wss_values <- sapply(1:10, function(k) {
kmeans(pca_matrix, centers = k, nstart = 30, iter.max = 100)$tot.withinss
})
tibble(k = 1:10, WCSS = wss_values) |>
ggplot(aes(x = k, y = WCSS)) +
geom_line(color = "#4EA8DE", linewidth = 1.2) +
geom_point(color = "#E63946", size = 3.5, shape = 21, fill = "white", stroke = 2) +
geom_vline(xintercept = k_final, linetype = "dashed",
color = "#2EC4B6", linewidth = 1) +
annotate("text", x = k_final + 0.2, y = max(wss_values) * 0.9,
label = paste0("Selected k = ", k_final),
hjust = 0, color = "#0F766E", size = 3.5, fontface = "bold") +
scale_x_continuous(breaks = 1:10) +
scale_y_continuous(labels = scales::label_comma()) +
labs(
title = "Elbow Method for Optimal K",
subtitle = "WCSS (Total Within-Cluster Sum of Squares) vs. number of clusters",
x = "Number of Clusters (k)",
y = "Total WCSS"
)set.seed(42)
km_result <- kmeans(pca_matrix, centers = k_final, nstart = 50, iter.max = 200)
km_clusters <- km_result$cluster
table(km_clusters) |>
as.data.frame() |>
setNames(c("Cluster", "Countries")) |>
knitr::kable(
caption = "**Table 10** — K-means: countries per cluster",
booktabs = TRUE
)| Cluster | Countries |
|---|---|
| 1 | 88 |
| 2 | 48 |
| 3 | 29 |
| 4 | 2 |
# Manual ggplot2 cluster plot — avoids the fviz_cluster / ggpubr argument conflict
cluster_palette <- c("1" = "#4EA8DE", "2" = "#E63946", "3" = "#2EC4B6", "4" = "#FF9F1C")
tibble(
PC1 = pca_result$x[, 1],
PC2 = pca_result$x[, 2],
Cluster = factor(km_clusters),
Country = as.character(df$country)
) |>
ggplot(aes(x = PC1, y = PC2, colour = Cluster, fill = Cluster)) +
stat_ellipse(
geom = "polygon",
type = "norm",
alpha = 0.10,
level = 0.90,
linewidth = 0.4
) +
geom_point(size = 2.4, alpha = 0.85, shape = 21, colour = "white", stroke = 0.4,
aes(fill = Cluster)) +
scale_colour_manual(values = cluster_palette) +
scale_fill_manual(values = cluster_palette) +
labs(
title = paste0("K-Means Clusters in PCA Space (k = ", k_final, ")"),
subtitle = "Countries projected onto PC1 & PC2 · Ellipses show 90% normal confidence regions",
x = "Principal Component 1",
y = "Principal Component 2",
colour = "Cluster",
fill = "Cluster"
)sil_km <- silhouette(km_clusters, dist(pca_matrix))
avg_sil_km <- mean(sil_km[, 3])
factoextra::fviz_silhouette(sil_km, palette = c("#4EA8DE","#E63946","#2EC4B6","#FF9F1C")) +
labs(
title = paste0("Silhouette Plot — K-Means (k = ", k_final, ")"),
subtitle = paste0("Average silhouette width = ", round(avg_sil_km, 3))
) +
theme_report()#> cluster size ave.sil.width
#> 1 1 88 0.53
#> 2 2 48 0.43
#> 3 3 29 0.42
#> 4 4 2 0.09
#> Average silhouette width (K-means): 0.477
Both algorithms are evaluated side-by-side. The silhouette width is the primary internal validation metric — it measures how similar each country is to its own cluster relative to neighbouring clusters (range −1 to +1; higher is better).
tibble(
Method = c("Hierarchical (Ward's)", "K-Means"),
`Linkage / Init` = c("Ward's D2", "Random centroids, nstart = 50"),
`Input Space` = rep(paste0(num_components, " PCA components"), 2),
`k` = rep(k_final, 2),
`Avg Silhouette` = round(c(avg_sil_hc, avg_sil_km), 4),
`Interpretation` = c(
"Nested tree structure, useful for understanding hierarchy",
"Hard partitions, computationally efficient, easy to deploy"
)
) |>
knitr::kable(
caption = "**Table 11** — Clustering method comparison",
booktabs = TRUE
)| Method | Linkage / Init | Input Space | k | Avg Silhouette | Interpretation |
|---|---|---|---|---|---|
| Hierarchical (Ward’s) | Ward’s D2 | 4 PCA components | 4 | 0.4670 | Nested tree structure, useful for understanding hierarchy |
| K-Means | Random centroids, nstart = 50 | 4 PCA components | 4 | 0.4768 | Hard partitions, computationally efficient, easy to deploy |
Neither method is universally superior. Hierarchical clustering excels at revealing nested similarity structures, while K-means produces clean, deployable hard partitions. Running both and comparing their silhouette scores strengthens the credibility of the final clustering — if both methods agree on which countries belong together, the groupings are more trustworthy.
Translating cluster IDs into meaningful economic categories is where data science becomes decision support. Each cluster is characterized by its average values on the five most interpretable indicators.
analysis_df <- df |>
mutate(
cluster_km = factor(km_clusters),
cluster_hc = factor(hc_clusters)
)
profile_fn <- function(data, clust_col) {
data |>
group_by({{ clust_col }}) |>
summarise(
N = n(),
Avg_GDP = round(mean(gdpp, na.rm = TRUE), 0),
Avg_Income = round(mean(income, na.rm = TRUE), 0),
Avg_Health = round(mean(health, na.rm = TRUE), 1),
Avg_ChildMort = round(mean(child_mort, na.rm = TRUE), 1),
Avg_LifeExp = round(mean(life_expec, na.rm = TRUE), 1),
.groups = "drop"
)
}
km_profile <- profile_fn(analysis_df, cluster_km)
hc_profile <- profile_fn(analysis_df, cluster_hc)
km_profile |>
knitr::kable(caption = "**Table 12** — K-Means cluster profiles", booktabs = TRUE)| cluster_km | N | Avg_GDP | Avg_Income | Avg_Health | Avg_ChildMort | Avg_LifeExp |
|---|---|---|---|---|---|---|
| 1 | 88 | 7333 | 13456 | 482.9 | 20.9 | 73.2 |
| 2 | 48 | 1909 | 3897 | 114.8 | 91.6 | 59.2 |
| 3 | 29 | 44017 | 45800 | 4085.0 | 5.1 | 80.4 |
| 4 | 2 | 75800 | 81900 | 5001.9 | 2.8 | 82.0 |
hc_profile |>
knitr::kable(caption = "**Table 13** — Hierarchical cluster profiles", booktabs = TRUE)| cluster_hc | N | Avg_GDP | Avg_Income | Avg_Health | Avg_ChildMort | Avg_LifeExp |
|---|---|---|---|---|---|---|
| 1 | 48 | 1909 | 3897 | 114.8 | 91.6 | 59.2 |
| 2 | 82 | 6287 | 12305 | 393.1 | 22.0 | 72.9 |
| 3 | 35 | 40177 | 42951 | 3677.9 | 5.4 | 79.8 |
| 4 | 2 | 75800 | 81900 | 5001.9 | 2.8 | 82.0 |
analysis_df |>
pivot_longer(
cols = c(child_mort, gdpp, income, health, life_expec),
names_to = "indicator", values_to = "value"
) |>
mutate(indicator = recode(indicator,
child_mort = "Child Mortality",
gdpp = "GDP per Capita",
income = "Income",
health = "Health Expenditure",
life_expec = "Life Expectancy"
)) |>
ggplot(aes(x = cluster_km, y = value, fill = cluster_km)) +
geom_boxplot(alpha = 0.80, outlier.size = 1.5, outlier.alpha = 0.5) +
facet_wrap(~ indicator, scales = "free_y", ncol = 3) +
scale_fill_manual(values = c("#4EA8DE", "#E63946", "#2EC4B6", "#FF9F1C")) +
labs(
title = "Key Indicators by K-Means Cluster",
subtitle = "Boxplots reveal how clusters differ across health and economic dimensions",
x = "K-Means Cluster",
y = "Value",
fill = "Cluster"
) +
theme(legend.position = "top")The priority cluster is identified programmatically: the cluster with the highest average child mortality rate, broken by the lowest average GDP per capita and lowest average health expenditure.
priority_km <- km_profile |>
arrange(desc(Avg_ChildMort), Avg_GDP, Avg_Health) |>
slice(1)
priority_hc <- hc_profile |>
arrange(desc(Avg_ChildMort), Avg_GDP, Avg_Health) |>
slice(1)
priority_km |>
knitr::kable(
caption = "**Table 14** — Highest-need cluster (K-Means)",
booktabs = TRUE
)| cluster_km | N | Avg_GDP | Avg_Income | Avg_Health | Avg_ChildMort | Avg_LifeExp |
|---|---|---|---|---|---|---|
| 2 | 48 | 1909 | 3897 | 114.8 | 91.6 | 59.2 |
priority_hc |>
knitr::kable(
caption = "**Table 15** — Highest-need cluster (Hierarchical)",
booktabs = TRUE
)| cluster_hc | N | Avg_GDP | Avg_Income | Avg_Health | Avg_ChildMort | Avg_LifeExp |
|---|---|---|---|---|---|---|
| 1 | 48 | 1909 | 3897 | 114.8 | 91.6 | 59.2 |
The need score converts the cluster analysis into an actionable, country-level ranking. It is a composite index built from three Z-scored indicators:
\[\boxed{\text{Need Score}_i = z(\text{child\_mort}_i) - z(\text{gdpp}_i) - z(\text{health}_i)}\]
Higher values indicate greater unmet need. The formula deliberately rewards high child mortality (positive) and penalizes high GDP and health expenditure (negative), since those reflect capacity to self-fund development.
priority_countries <- analysis_df |>
filter(
cluster_km == priority_km$cluster_km |
cluster_hc == priority_hc$cluster_hc
) |>
mutate(
z_child = as.numeric(scale(child_mort)),
z_gdpp = as.numeric(scale(gdpp)),
z_health = as.numeric(scale(health)),
need_score = z_child - z_gdpp - z_health
) |>
arrange(desc(need_score))
priority_countries |>
select(country, child_mort, gdpp, income, health, life_expec, need_score) |>
mutate(
need_score = round(need_score, 3),
gdpp = round(gdpp, 0),
health = round(health, 1)
) |>
head(20) |>
knitr::kable(
caption = "**Table 16** — Top 20 countries ranked by Need Score",
booktabs = TRUE
)| country | child_mort | gdpp | income | health | life_expec | need_score |
|---|---|---|---|---|---|---|
| Haiti | 208.0 | 662 | 1500 | 45.7 | 32.1 | 4.235 |
| Sierra Leone | 160.0 | 399 | 1220 | 52.3 | 55.0 | 2.887 |
| Central African Republic | 149.0 | 446 | 888 | 17.8 | 47.5 | 2.759 |
| Chad | 150.0 | 897 | 1930 | 40.6 | 56.5 | 2.495 |
| Mali | 137.0 | 708 | 1870 | 35.3 | 59.5 | 2.214 |
| Niger | 123.0 | 348 | 814 | 18.0 | 58.8 | 2.033 |
| Congo, Dem. Rep. | 116.0 | 334 | 609 | 26.4 | 57.5 | 1.783 |
| Burkina Faso | 116.0 | 575 | 1430 | 38.8 | 57.9 | 1.626 |
| Guinea-Bissau | 114.0 | 547 | 1390 | 46.5 | 55.6 | 1.531 |
| Benin | 111.0 | 758 | 1820 | 31.1 | 61.8 | 1.464 |
| Guinea | 109.0 | 648 | 1190 | 31.9 | 58.0 | 1.438 |
| Mozambique | 101.0 | 419 | 918 | 21.8 | 54.5 | 1.345 |
| Burundi | 93.6 | 231 | 764 | 26.8 | 57.7 | 1.163 |
| Cote d’Ivoire | 111.0 | 1220 | 2690 | 64.7 | 56.3 | 1.104 |
| Malawi | 90.5 | 459 | 1030 | 30.2 | 53.1 | 0.974 |
| Cameroon | 108.0 | 1310 | 2660 | 67.2 | 57.3 | 0.970 |
| Nigeria | 130.0 | 2330 | 5150 | 118.1 | 60.5 | 0.955 |
| Liberia | 89.3 | 327 | 700 | 38.6 | 60.8 | 0.934 |
| Togo | 90.3 | 488 | 1210 | 37.3 | 58.7 | 0.916 |
| Pakistan | 92.1 | 1040 | 4280 | 22.9 | 65.3 | 0.867 |
priority_countries |>
slice_head(n = 20) |>
ggplot(aes(
x = reorder(country, need_score),
y = need_score,
fill = need_score
)) +
geom_col(width = 0.75, color = "white", linewidth = 0.25) +
geom_text(
aes(label = round(need_score, 2)),
hjust = -0.12, size = 3, color = "#1E293B"
) +
coord_flip() +
scale_fill_gradient2(
low = "#ADE8F4",
mid = "#4EA8DE",
high = "#E63946",
midpoint = median(priority_countries$need_score[1:20])
) +
scale_y_continuous(expand = expansion(mult = c(0, 0.14))) +
labs(
title = "Top 20 Countries Ranked by Composite Need Score",
subtitle = "Score = z(child_mort) − z(gdpp) − z(health) · Higher = greater aid priority",
x = NULL,
y = "Need Score",
fill = "Score",
caption = "Source: Country-data.csv · Analysis by Eliya Christopher Nandi"
) +
theme(legend.position = "none")The top-ranked nations combine the highest child mortality, lowest GDP per capita, and lowest health expenditure — a convergence of vulnerabilities that no single indicator alone would fully capture. This ranking is the core deliverable of the analysis.
For transparency and reproducibility, every country’s cluster assignment is listed explicitly.
analysis_df |>
arrange(cluster_km, country) |>
group_by(cluster_km) |>
summarise(Countries = paste(as.character(country), collapse = ", "), .groups = "drop") |>
knitr::kable(
caption = "**Table 17** — Country membership by K-Means cluster",
booktabs = TRUE
)| cluster_km | Countries |
|---|---|
| 1 | Albania, Algeria, Antigua and Barbuda, Argentina, Armenia, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belize, Bhutan, Bolivia, Bosnia and Herzegovina, Brazil, Bulgaria, Cambodia, Cape Verde, Chile, China, Colombia, Costa Rica, Croatia, Czech Republic, Dominican Republic, Ecuador, Egypt, El Salvador, Estonia, Fiji, Georgia, Grenada, Guatemala, Guyana, Hungary, India, Indonesia, Iran, Jamaica, Jordan, Kazakhstan, Kyrgyz Republic, Latvia, Lebanon, Libya, Lithuania, Macedonia, FYR, Malaysia, Maldives, Mauritius, Micronesia, Fed. Sts., Moldova, Mongolia, Montenegro, Morocco, Myanmar, Nepal, Oman, Panama, Paraguay, Peru, Philippines, Poland, Portugal, Romania, Russia, Samoa, Saudi Arabia, Serbia, Seychelles, Slovak Republic, South Korea, Sri Lanka, St. Vincent and the Grenadines, Suriname, Tajikistan, Thailand, Tonga, Tunisia, Turkey, Turkmenistan, Ukraine, Uruguay, Uzbekistan, Vanuatu, Venezuela, Vietnam |
| 2 | Afghanistan, Angola, Benin, Botswana, Burkina Faso, Burundi, Cameroon, Central African Republic, Chad, Comoros, Congo, Dem. Rep., Congo, Rep., Cote d’Ivoire, Equatorial Guinea, Eritrea, Gabon, Gambia, Ghana, Guinea, Guinea-Bissau, Haiti, Iraq, Kenya, Kiribati, Lao, Lesotho, Liberia, Madagascar, Malawi, Mali, Mauritania, Mozambique, Namibia, Niger, Nigeria, Pakistan, Rwanda, Senegal, Sierra Leone, Solomon Islands, South Africa, Sudan, Tanzania, Timor-Leste, Togo, Uganda, Yemen, Zambia |
| 3 | Australia, Austria, Belgium, Brunei, Canada, Cyprus, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Japan, Kuwait, Malta, Netherlands, New Zealand, Norway, Qatar, Slovenia, Spain, Sweden, Switzerland, United Arab Emirates, United Kingdom, United States |
| 4 | Luxembourg, Singapore |
analysis_df |>
arrange(cluster_hc, country) |>
group_by(cluster_hc) |>
summarise(Countries = paste(as.character(country), collapse = ", "), .groups = "drop") |>
knitr::kable(
caption = "**Table 18** — Country membership by Hierarchical cluster",
booktabs = TRUE
)| cluster_hc | Countries |
|---|---|
| 1 | Afghanistan, Angola, Benin, Botswana, Burkina Faso, Burundi, Cameroon, Central African Republic, Chad, Comoros, Congo, Dem. Rep., Congo, Rep., Cote d’Ivoire, Equatorial Guinea, Eritrea, Gabon, Gambia, Ghana, Guinea, Guinea-Bissau, Haiti, Iraq, Kenya, Kiribati, Lao, Lesotho, Liberia, Madagascar, Malawi, Mali, Mauritania, Mozambique, Namibia, Niger, Nigeria, Pakistan, Rwanda, Senegal, Sierra Leone, Solomon Islands, South Africa, Sudan, Tanzania, Timor-Leste, Togo, Uganda, Yemen, Zambia |
| 2 | Albania, Algeria, Antigua and Barbuda, Argentina, Armenia, Azerbaijan, Bangladesh, Barbados, Belarus, Belize, Bhutan, Bolivia, Bosnia and Herzegovina, Brazil, Bulgaria, Cambodia, Cape Verde, Chile, China, Colombia, Costa Rica, Croatia, Dominican Republic, Ecuador, Egypt, El Salvador, Estonia, Fiji, Georgia, Grenada, Guatemala, Guyana, Hungary, India, Indonesia, Iran, Jamaica, Jordan, Kazakhstan, Kyrgyz Republic, Latvia, Lebanon, Libya, Lithuania, Macedonia, FYR, Malaysia, Maldives, Mauritius, Micronesia, Fed. Sts., Moldova, Mongolia, Montenegro, Morocco, Myanmar, Nepal, Oman, Panama, Paraguay, Peru, Philippines, Poland, Romania, Russia, Samoa, Saudi Arabia, Serbia, Seychelles, Sri Lanka, St. Vincent and the Grenadines, Suriname, Tajikistan, Thailand, Tonga, Tunisia, Turkey, Turkmenistan, Ukraine, Uruguay, Uzbekistan, Vanuatu, Venezuela, Vietnam |
| 3 | Australia, Austria, Bahamas, Bahrain, Belgium, Brunei, Canada, Cyprus, Czech Republic, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Japan, Kuwait, Malta, Netherlands, New Zealand, Norway, Portugal, Qatar, Slovak Republic, Slovenia, South Korea, Spain, Sweden, Switzerland, United Arab Emirates, United Kingdom, United States |
| 4 | Luxembourg, Singapore |
Based on the clustering analysis and need-score rankings, aid and development investment should be prioritized toward the countries at the top of Table 16. These nations face a compounded disadvantage: high child mortality + low economic capacity + low health system investment.
Specifically, the analysis recommends:
This analysis is a data-driven decision-support tool, not a policy prescription. Real-world aid allocation must also account for: population size, political stability, humanitarian access, conflict risk, governance quality, and on-the-ground field assessments. Quantitative rankings should be treated as a starting point for expert deliberation, not a final answer.
This project demonstrates how unsupervised machine learning methods — applied rigorously and transparently — can extract actionable insights from cross-country socio-economic data.
Key contributions of this analysis:
The approach taken here follows best practices for reproducible, explainable, and production-quality data science portfolio work.
To reproduce this report on your machine:
1. Place Country-data.csv in the same directory as this .Rmd file.
2. Open the .Rmd file in RStudio (version ≥ 2022.07 recommended).
3. Install missing packages if prompted:
install.packages(c("tidyverse","skimr","ggcorrplot","factoextra","cluster","knitr"))
4. Click Knit → Knit to HTML (or run rmarkdown::render("...Rmd")).
5. Upload both the .Rmd and the knitted .html to GitHub for portfolio display.
To use a different dataset or cluster count, change
data_path or k_clusters in the YAML header —
all downstream results update automatically.
Full session information is recorded below for complete reproducibility. This captures the R version, platform, and exact package versions used.
#> R version 4.5.2 (2025-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#>
#> Matrix products: default
#> LAPACK version 3.12.1
#>
#> locale:
#> [1] LC_COLLATE=English_United Kingdom.utf8
#> [2] LC_CTYPE=English_United Kingdom.utf8
#> [3] LC_MONETARY=English_United Kingdom.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United Kingdom.utf8
#>
#> time zone: Europe/Berlin
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] keras3_1.5.0 knitr_1.51 cluster_2.1.8.2 factoextra_1.0.7
#> [5] ggcorrplot_0.1.4.1 skimr_2.2.2 lubridate_1.9.5 forcats_1.0.1
#> [9] stringr_1.6.0 dplyr_1.2.1 purrr_1.2.0 readr_2.2.0
#> [13] tidyr_1.3.1 tibble_3.3.0 ggplot2_4.0.2 tidyverse_2.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 xfun_0.56 bslib_0.10.0 ggrepel_0.9.6
#> [5] rstatix_0.7.3 lattice_0.22-7 tzdb_0.5.0 tfruns_1.5.4
#> [9] vctrs_0.7.3 tools_4.5.2 generics_0.1.4 pkgconfig_2.0.3
#> [13] Matrix_1.7-4 RColorBrewer_1.1-3 S7_0.2.0 lifecycle_1.0.5
#> [17] compiler_4.5.2 farver_2.1.2 repr_1.1.7 codetools_0.2-20
#> [21] carData_3.0-6 htmltools_0.5.9 sass_0.4.10 yaml_2.3.12
#> [25] Formula_1.2-5 whisker_0.4.1 pillar_1.11.1 car_3.1-5
#> [29] ggpubr_0.6.2 jquerylib_0.1.4 cachem_1.1.0 abind_1.4-8
#> [33] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.7 reshape2_1.4.5
#> [37] labeling_0.4.3 rprojroot_2.1.1 fastmap_1.2.0 grid_4.5.2
#> [41] here_1.0.2 cli_3.6.5 magrittr_2.0.4 base64enc_0.1-6
#> [45] dotty_0.1.0 broom_1.0.12 withr_3.0.2 scales_1.4.0
#> [49] backports_1.5.0 timechange_0.4.0 rmarkdown_2.30 otel_0.2.0
#> [53] reticulate_1.44.1 ggsignif_0.6.4 png_0.1-8 hms_1.1.4
#> [57] evaluate_1.0.5 rlang_1.2.0 zeallot_0.2.0 Rcpp_1.1.0
#> [61] glue_1.8.0 rstudioapi_0.18.0 jsonlite_2.0.0 R6_2.6.1
#> [65] plyr_1.8.9 tensorflow_2.20.0