1. Introduction
International trade reflects deep structural characteristics of economies. Countries differ not only in how much they trade, but also in what they trade, how diversified their export baskets are and how trade interacts with development level, economic structure and openness. These differences are rarely captured by a single indicator. Instead, they emerge as multivariate patterns across trade and macroeconomic dimensions.
The objective of this project is to uncover latent structures in global trade and development profiles using unsupervised learning. Rather than imposing predefined categories (such as income groups or regions), I let the data reveal recurring patterns among country-year observations between 2019 and 2024.
Specifically this project pursues three goals:
Identify groups of countries with similar trade structures and macroeconomic characteristics using clustering.
Reveal the main latent economic dimensions that drive cross-country differences using dimension reduction.
Combine both approaches to interpret global trade–development regimes in an economically meaningful way.
1.1. Why Clustering and Dimension Reduction Together?
These two techniques are naturally complementary:
Clustering asks: Which countries are similar to each other?
Dimension reduction asks: Along which fundamental economic dimensions are they similar or different?
Clustering provides discrete groups, while PCA and MDS provide continuous economic axes.
Using both allows us to:
Validate cluster structures visually,
Interpret clusters economically,
Understand the underlying latent structure of global trade.
Given the complexity and continuity of global macroeconomic data, combining these methods particularly suits well for country level trade analysis.
1.2 Datasets and Data Sources
This project is based on a custom-constructed dataset that combines international trade flows with macroeconomic indicators from authoritative sources.
I use HS2-level international trade data for the years 2019–2024, covering reporter countries trading with the world as partner.
Each raw file contains:
Reporter country (ISO code)
Year
Trade flow (export or import)
HS2 product code
Trade value (USD)
These transaction-level data provide detailed information on export composition and trade scale.
Files used: data_raw/ comtrade_hs2_2019.csv comtrade_hs2_2020.csv comtrade_hs2_2021.csv comtrade_hs2_2022.csv comtrade_hs2_2023.csv comtrade_hs2_2024.csv
To place trade structures in economic context, I merge trade data with WDI indicators capturing:
GDP per capita (development level)
Population (scale)
Trade as % of GDP (openness)
Sectoral value added shares (agriculture, industry, services)
The combination of Comtrade and WDI makes the dataset suitable for both clustering (comparability across observations) and dimension reduction (rich multivariate structure).
2. Loading and Preparing the Datasets
Setup and Libraries
library(tidyverse)
library(readr)
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
library(cluster)
library(scales)
library(janitor)
library(stringr)
Loading and Combining UN Comtrade Files I first loaded multiple Comtrade CSV files and stacked them into a single dataset. At this stage, each row represents one country, one year, one trade flow, and one HS2 product category.
Because column names differ across Comtrade exports, I implemented a helper function to automatically detect the correct columns. This ensures robustness and reproducibility.
comtrade_files <- list.files(
path = "data_raw",
pattern = "comtrade_hs2_.*\\.csv$",
full.names = TRUE
)
trade_raw <- purrr::map_dfr(comtrade_files, ~ readr::read_csv(.x, show_col_types = FALSE))
trade_raw <- janitor::clean_names(trade_raw)
names(trade_raw)
## [1] "type_code" "freq_code"
## [3] "ref_period_id" "ref_year"
## [5] "ref_month" "period"
## [7] "reporter_code" "reporter_iso"
## [9] "reporter_desc" "flow_code"
## [11] "flow_desc" "partner_code"
## [13] "partner_iso" "partner_desc"
## [15] "partner2code" "partner2iso"
## [17] "partner2desc" "classification_code"
## [19] "classification_search_code" "is_original_classification"
## [21] "cmd_code" "cmd_desc"
## [23] "aggr_level" "is_leaf"
## [25] "customs_code" "customs_desc"
## [27] "mos_code" "mot_code"
## [29] "mot_desc" "qty_unit_code"
## [31] "qty_unit_abbr" "qty"
## [33] "is_qty_estimated" "alt_qty_unit_code"
## [35] "alt_qty_unit_abbr" "alt_qty"
## [37] "is_alt_qty_estimated" "net_wgt"
## [39] "is_net_wgt_estimated" "gross_wgt"
## [41] "is_gross_wgt_estimated" "cifvalue"
## [43] "fobvalue" "primary_value"
## [45] "legacy_estimation_flag" "is_reported"
## [47] "is_aggregate"
glimpse(trade_raw)
## Rows: 179,156
## Columns: 47
## $ type_code <chr> "C", "C", "C", "C", "C", "C", "C", "C", "C"…
## $ freq_code <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A"…
## $ ref_period_id <dbl> 20190101, 20190101, 20190101, 20190101, 201…
## $ ref_year <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2…
## $ ref_month <dbl> 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52,…
## $ period <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2…
## $ reporter_code <dbl> 4, 4, 8, 8, 20, 20, 24, 24, 28, 28, 31, 31,…
## $ reporter_iso <chr> "AFG", "AFG", "ALB", "ALB", "AND", "AND", "…
## $ reporter_desc <chr> "Afghanistan", "Afghanistan", "Albania", "A…
## $ flow_code <chr> "M", "X", "M", "X", "M", "X", "M", "X", "M"…
## $ flow_desc <chr> "Import", "Export", "Import", "Export", "Im…
## $ partner_code <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ partner_iso <chr> "W00", "W00", "W00", "W00", "W00", "W00", "…
## $ partner_desc <chr> "World", "World", "World", "World", "World"…
## $ partner2code <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ partner2iso <chr> "W00", "W00", "W00", "W00", "W00", "W00", "…
## $ partner2desc <chr> "World", "World", "World", "World", "World"…
## $ classification_code <chr> "H4", "H4", "H5", "H5", "H5", "H5", "H5", "…
## $ classification_search_code <chr> "HS", "HS", "HS", "HS", "HS", "HS", "HS", "…
## $ is_original_classification <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
## $ cmd_code <chr> "01", "01", "01", "01", "01", "01", "01", "…
## $ cmd_desc <chr> "Animals; live", "Animals; live", "Animals;…
## $ aggr_level <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ is_leaf <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ customs_code <chr> "C00", "C00", "C00", "C00", "C00", "C00", "…
## $ customs_desc <chr> "TOTAL CPC", "TOTAL CPC", "TOTAL CPC", "TOT…
## $ mos_code <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ mot_code <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ mot_desc <chr> "TOTAL MOT", "TOTAL MOT", "TOTAL MOT", "TOT…
## $ qty_unit_code <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,…
## $ qty_unit_abbr <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "…
## $ qty <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ is_qty_estimated <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ alt_qty_unit_code <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,…
## $ alt_qty_unit_abbr <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "…
## $ alt_qty <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ is_alt_qty_estimated <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ net_wgt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, N…
## $ is_net_wgt_estimated <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRU…
## $ gross_wgt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ is_gross_wgt_estimated <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ cifvalue <dbl> 16791952.1, 0.0, 40884453.6, 0.0, 116848.9,…
## $ fobvalue <dbl> 0.000000e+00, 3.946623e+04, 0.000000e+00, 5…
## $ primary_value <dbl> 1.679195e+07, 3.946623e+04, 4.088445e+07, 5…
## $ legacy_estimation_flag <dbl> 4, 4, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 0, 0…
## $ is_reported <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ is_aggregate <chr> "true,", "true,", "true,", "true,", "true,"…
Loading World Bank WDI File
wdi_path <- "data_raw/worldbank_wdi_2019_2024.csv"
wdi_raw <- readr::read_csv(wdi_path, show_col_types = FALSE)
wdi_raw <- janitor::clean_names(wdi_raw)
names(wdi_raw)
## [1] "iso3c" "country" "year" "gdp_pc" "pop" "trade_gdp"
## [7] "ind_va" "srv_va" "agr_va"
glimpse(wdi_raw)
## Rows: 1,566
## Columns: 9
## $ iso3c <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFE", "AFE", "AFE…
## $ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ year <dbl> 2019, 2020, 2021, 2022, 2023, 2024, 2019, 2020, 2021, 2022, …
## $ gdp_pc <dbl> 496.6025, 510.7871, 356.4962, 357.2612, 413.7579, NA, 1507.0…
## $ pop <dbl> 37856121, 39068979, 40000412, 40578842, 41454761, 42647492, …
## $ trade_gdp <dbl> NA, 46.70989, 51.41172, 72.88547, 67.58467, NA, 53.52813, 49…
## $ ind_va <dbl> 14.05811, 12.95260, 14.27366, 16.05037, 13.44982, NA, 26.033…
## $ srv_va <dbl> 55.47281, 52.57211, 47.16042, 45.04991, 46.37437, NA, 52.629…
## $ agr_va <dbl> 25.77397, 29.97558, 33.59762, 33.70143, 34.74325, NA, 12.953…
Cleaning and Preparing Data
Before aggregation, I cleaned the transaction-level data by:
Selecting only relevant variables (country, year, flow, HS2, value),
Restricting the sample to 2019–2024,
Removing missing and non-positive trade values,
Standardizing trade flow labels so exports are coded as “X” and imports as “M”.
This step was obligatory because all later indicators will be depending on consistent definitions of exports, imports, and product categories.
pick_col <- function(df, candidates) {
hit <- candidates[candidates %in% names(df)]
if (length(hit) == 0) stop("None of these columns were found: ", paste(candidates, collapse = ", "))
hit[1]
}
col_country <- pick_col(trade_raw, c(
"reporter_iso", "reporter_iso3", "reporter", "reporter_code",
"reporter_iso_code", "reporter_iso3_code"
))
col_year <- pick_col(trade_raw, c("year", "ref_year", "period"))
col_flow <- pick_col(trade_raw, c("flow", "flow_code", "flow_desc"))
col_hs2 <- pick_col(trade_raw, c("hs2", "cmd_code", "commodity_code", "cmdcode", "commodity"))
col_value <- pick_col(trade_raw, c("primary_value", "trade_value", "trade_value_usd", "fobvalue", "cifvalue", "value"))
trade_clean <- trade_raw %>%
transmute(
country = as.character(.data[[col_country]]),
year = as.integer(as.character(.data[[col_year]])),
flow = as.character(.data[[col_flow]]),
hs2 = as.character(.data[[col_hs2]]),
value = suppressWarnings(as.numeric(.data[[col_value]]))
) %>%
filter(year >= 2019 & year <= 2024, !is.na(value), value > 0) %>%
mutate(
flow = case_when(
flow %in% c("X", "Export", "Exports", "export", "exports") ~ "X",
flow %in% c("M", "Import", "Imports", "import", "imports") ~ "M",
TRUE ~ flow
)
)
table(trade_clean$flow)
##
## M X
## 93358 85798
glimpse(trade_clean)
## Rows: 179,156
## Columns: 5
## $ country <chr> "AFG", "AFG", "ALB", "ALB", "AND", "AND", "AGO", "AGO", "ATG",…
## $ year <int> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 20…
## $ flow <chr> "M", "X", "M", "X", "M", "X", "M", "X", "M", "X", "M", "X", "M…
## $ hs2 <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "0…
## $ value <dbl> 1.679195e+07, 3.946623e+04, 4.088445e+07, 5.379707e+05, 1.1684…
Building Trade Structure Indicators
Clustering requires observations to be comparable at the same level of aggregation. I therefore transformed transaction-level data into country–year trade indicators:
Total exports (x_total) and imports (m_total),
Trade balance (exports minus imports),
Export–import ratio as a scale-free orientation measure,
Number of HS2 products exported (n_products) as a diversification proxy,
Herfindahl–Hirschman Index (hhi) measuring export concentration.
These indicators jointly capture both the size and the structure of trade, which is central for distinguishing trade profiles.
trade_totals <- trade_clean %>%
group_by(country, year, flow) %>%
summarise(total_value = sum(value), .groups = "drop") %>%
pivot_wider(names_from = flow, values_from = total_value, values_fill = 0) %>%
rename(x_total = X, m_total = M) %>%
mutate(
trade_balance = x_total - m_total,
xm_ratio = x_total / (m_total + 1)
)
glimpse(trade_totals)
## Rows: 971
## Columns: 6
## $ country <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "AFG", "AGO", "AGO", …
## $ year <int> 2019, 2020, 2021, 2022, 2023, 2019, 2019, 2020, 2021, 20…
## $ m_total <dbl> 1313806452, 939368008, 1169773934, 1473238003, 153651956…
## $ x_total <dbl> 78642939, 65929725, 88347008, 104796787, 111004914, 8704…
## $ trade_balance <dbl> -1235163513, -873438282, -1081426927, -1368441216, -1425…
## $ xm_ratio <dbl> 0.05985885, 0.07018519, 0.07552486, 0.07113364, 0.072244…
n_products <- trade_clean %>%
filter(flow == "X") %>%
group_by(country, year) %>%
summarise(n_products = n_distinct(hs2), .groups = "drop")
glimpse(n_products)
## Rows: 971
## Columns: 3
## $ country <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "AFG", "AGO", "AGO", "AG…
## $ year <int> 2019, 2020, 2021, 2022, 2023, 2019, 2019, 2020, 2021, 2022,…
## $ n_products <int> 84, 77, 79, 81, 84, 69, 95, 92, 92, 94, 95, 95, 94, 96, 95,…
export_shares <- trade_clean %>%
filter(flow == "X") %>%
group_by(country, year, hs2) %>%
summarise(export_value = sum(value), .groups = "drop") %>%
group_by(country, year) %>%
mutate(
total_exports = sum(export_value),
share = export_value / total_exports
)
hhi <- export_shares %>%
summarise(hhi = sum(share^2), .groups = "drop")
glimpse(hhi)
## Rows: 971
## Columns: 3
## $ country <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "AFG", "AGO", "AGO", "AGO",…
## $ year <int> 2019, 2020, 2021, 2022, 2023, 2019, 2019, 2020, 2021, 2022, 20…
## $ hhi <dbl> 0.13638179, 0.15118685, 0.27458806, 0.27619753, 0.30585118, 0.…
trade_features <- trade_totals %>%
left_join(n_products, by = c("country", "year")) %>%
left_join(hhi, by = c("country", "year"))
write.csv(trade_features, "data_processed/trade_features_2019_2024.csv", row.names = FALSE)
glimpse(trade_features)
## Rows: 971
## Columns: 8
## $ country <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "AFG", "AGO", "AGO", …
## $ year <int> 2019, 2020, 2021, 2022, 2023, 2019, 2019, 2020, 2021, 20…
## $ m_total <dbl> 1313806452, 939368008, 1169773934, 1473238003, 153651956…
## $ x_total <dbl> 78642939, 65929725, 88347008, 104796787, 111004914, 8704…
## $ trade_balance <dbl> -1235163513, -873438282, -1081426927, -1368441216, -1425…
## $ xm_ratio <dbl> 0.05985885, 0.07018519, 0.07552486, 0.07113364, 0.072244…
## $ n_products <int> 84, 77, 79, 81, 84, 69, 95, 92, 92, 94, 95, 95, 94, 96, …
## $ hhi <dbl> 0.13638179, 0.15118685, 0.27458806, 0.27619753, 0.305851…
Pattern-Based Column Detection
Indicator names in WDI files can vary (e.g. gdp_per_capita, gdp_pc, NY.GDP.PCAP.CD). To handle this variability, I defined a helper function that selects columns using regular-expression pattern matching rather than relying on exact column names.
The function searches all columns except the year variable and returns the first column that matches the specified pattern. This approach allows the pipeline to remain stable even if the naming conventions in the WDI export change.
pick_by_pattern <- function(df, patterns, exclude = c("year")) {
cols <- setdiff(names(df), exclude)
hits <- cols[stringr::str_detect(cols, paste(patterns, collapse = "|"))]
if (length(hits) == 0) return(NA_character_)
hits[1]
}
Detecting WDI File Structure and Extracting Indicators
In this section, I determine whether the WDI dataset is in tidy or wide format and process it accordingly. First, I verify that a year column exists. If not, the file cannot be safely processed and execution is stopped. I then identify the country identifier column using flexible matching, as this can also vary across exports (ISO3 code, country name, economy, etc.). To distinguish between tidy and wide formats, I examine the number of numeric columns excluding identifiers. If several numeric indicator columns are present, I treat the file as tidy. Otherwise, I assume a wide structure and reshape it into long form before extracting the required indicators. For tidy files, I automatically detect the columns corresponding to GDP per capita, population, trade openness, and sectoral value-added shares using pattern matching. For wide files, I rely on official WDI indicator codes and pivot the data into a consistent country–year format.
year_cols <- names(wdi_raw)[stringr::str_detect(names(wdi_raw), "^x?20\\d\\d$")]
has_year_col <- "year" %in% names(wdi_raw)
col_country_any <- suppressWarnings({
pick_col(wdi_raw, c(
"iso3", "iso3_code", "iso3c", "iso3c_code",
"country_code", "countrycode", "country_iso3", "countryiso3code",
"iso_code", "code",
"country", "country_name", "economy", "location", "entity", "name"
))
})
if (!has_year_col) stop("WDI does not have a 'year' column, please inspect the file format")
id_candidates <- c("year", col_country_any)
numeric_cols <- names(wdi_raw)[sapply(wdi_raw, is.numeric)]
numeric_cols <- setdiff(numeric_cols, id_candidates)
is_tidy_indicator_table <- length(numeric_cols) >= 4
if (is_tidy_indicator_table) {
col_gdp_pc <- pick_by_pattern(wdi_raw, patterns = c("gdp.*pcap|pcap|gdp_pc|gdp_per_cap"))
col_pop <- pick_by_pattern(wdi_raw, patterns = c("^pop$|pop_|population|sp_pop"))
col_tradegdp <- pick_by_pattern(wdi_raw, patterns = c("trade.*gdp|trade_gdp|ne_trd"))
col_agr <- pick_by_pattern(wdi_raw, patterns = c("agr|agri|nv_agr"))
col_ind <- pick_by_pattern(wdi_raw, patterns = c("ind(?!ex)|industry|nv_ind"))
col_srv <- pick_by_pattern(wdi_raw, patterns = c("srv|service|nv_srv"))
if (any(is.na(c(col_gdp_pc, col_pop, col_tradegdp, col_agr, col_ind, col_srv)))) {
print(numeric_cols)
stop("Could not auto-detect at least one WDI indicator column")
}
wdi_clean <- wdi_raw %>%
transmute(
country = as.character(.data[[col_country_any]]),
year = as.integer(as.character(year)),
gdp_pc = suppressWarnings(as.numeric(.data[[col_gdp_pc]])),
population = suppressWarnings(as.numeric(.data[[col_pop]])),
trade_gdp = suppressWarnings(as.numeric(.data[[col_tradegdp]])),
agr_va = suppressWarnings(as.numeric(.data[[col_agr]])),
ind_va = suppressWarnings(as.numeric(.data[[col_ind]])),
srv_va = suppressWarnings(as.numeric(.data[[col_srv]]))
) %>%
filter(year >= 2019 & year <= 2024)
} else {
col_country_code <- pick_col(wdi_raw, c("country_code","countrycode","country_iso3","countryiso3code"))
col_indicator <- pick_col(wdi_raw, c("indicator_code","series_code","indicator","series","indicator_id"))
if (length(year_cols) == 0) stop("Could not detect WDI year columns")
wdi_long <- wdi_raw %>%
select(all_of(c(col_country_code, col_indicator)), all_of(year_cols)) %>%
pivot_longer(cols = all_of(year_cols), names_to = "year", values_to = "value") %>%
mutate(
year = as.integer(stringr::str_remove(year, "^x")),
value = suppressWarnings(as.numeric(value))
) %>%
filter(year >= 2019 & year <= 2024)
needed_codes <- c(
"NY.GDP.PCAP.CD",
"SP.POP.TOTL",
"NE.TRD.GNFS.ZS",
"NV.AGR.TOTL.ZS",
"NV.IND.TOTL.ZS",
"NV.SRV.TOTL.ZS"
)
wdi_clean <- wdi_long %>%
filter(.data[[col_indicator]] %in% needed_codes) %>%
mutate(ind = recode(.data[[col_indicator]],
"NY.GDP.PCAP.CD" = "gdp_pc",
"SP.POP.TOTL" = "population",
"NE.TRD.GNFS.ZS" = "trade_gdp",
"NV.AGR.TOTL.ZS" = "agr_va",
"NV.IND.TOTL.ZS" = "ind_va",
"NV.SRV.TOTL.ZS" = "srv_va"
)) %>%
select(country = all_of(col_country_code), year, ind, value) %>%
pivot_wider(names_from = ind, values_from = value)
}
write.csv(wdi_clean, "data_processed/wdi_clean_2019_2024.csv", row.names = FALSE)
glimpse(wdi_clean)
## Rows: 1,566
## Columns: 8
## $ country <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFE", "AFE", "AF…
## $ year <int> 2019, 2020, 2021, 2022, 2023, 2024, 2019, 2020, 2021, 2022,…
## $ gdp_pc <dbl> 496.6025, 510.7871, 356.4962, 357.2612, 413.7579, NA, 1507.…
## $ population <dbl> 37856121, 39068979, 40000412, 40578842, 41454761, 42647492,…
## $ trade_gdp <dbl> NA, 46.70989, 51.41172, 72.88547, 67.58467, NA, 53.52813, 4…
## $ agr_va <dbl> 25.77397, 29.97558, 33.59762, 33.70143, 34.74325, NA, 12.95…
## $ ind_va <dbl> 14.05811, 12.95260, 14.27366, 16.05037, 13.44982, NA, 26.03…
## $ srv_va <dbl> 55.47281, 52.57211, 47.16042, 45.04991, 46.37437, NA, 52.62…
Merging Trade Features with WDI Data and Handling Missing Values
I merged the trade-structure indicators with the cleaned WDI dataset at the country–year level. I then summarized missing values across all variables. Because distance-based methods such as k-means, PCA, and MDS are sensitive to missing data, I removed observations with missing values in the core macroeconomic indicators. This ensures that similarity measures are not distorted by incomplete information.
main_dataset <- trade_features %>%
left_join(wdi_clean, by = c("country", "year"))
missing_summary <- main_dataset %>%
summarise(across(everything(), ~ sum(is.na(.))))
missing_summary
## # A tibble: 1 × 14
## country year m_total x_total trade_balance xm_ratio n_products hhi gdp_pc
## <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 0 0 17
## # ℹ 5 more variables: population <int>, trade_gdp <int>, agr_va <int>,
## # ind_va <int>, srv_va <int>
main_dataset <- main_dataset %>%
drop_na(gdp_pc, population, trade_gdp, agr_va, ind_va, srv_va)
write.csv(main_dataset, "data_processed/main_country_year_dataset_2019_2024.csv", row.names = FALSE)
glimpse(main_dataset)
## Rows: 823
## Columns: 14
## $ country <chr> "ABW", "ABW", "ABW", "ABW", "AGO", "AGO", "AGO", "AGO", …
## $ year <int> 2019, 2020, 2021, 2022, 2019, 2020, 2021, 2022, 2023, 20…
## $ m_total <dbl> 1313806452, 939368008, 1169773934, 1473238003, 139616404…
## $ x_total <dbl> 78642939, 65929725, 88347008, 104796787, 35432454859, 22…
## $ trade_balance <dbl> -1235163513, -873438282, -1081426927, -1368441216, 21470…
## $ xm_ratio <dbl> 0.05985885, 0.07018519, 0.07552486, 0.07113364, 2.537843…
## $ n_products <int> 84, 77, 79, 81, 95, 92, 92, 94, 95, 95, 94, 96, 95, 97, …
## $ hhi <dbl> 0.13638179, 0.15118685, 0.27458806, 0.27619753, 0.889236…
## $ gdp_pc <dbl> 30645.891, 22759.807, 26749.330, 30975.999, 2493.679, 17…
## $ population <dbl> 109203, 108587, 107700, 107310, 32375632, 33451132, 3453…
## $ trade_gdp <dbl> 137.96193, 122.72374, 139.91780, 160.63455, 60.71482, 54…
## $ agr_va <dbl> 0.01842432, 0.02170060, 0.02999855, 0.03421443, 12.39115…
## $ ind_va <dbl> 11.628810, 12.611441, 11.007534, 9.735917, 42.225105, 36…
## $ srv_va <dbl> 77.91933, 77.20070, 78.41041, 79.00360, 43.86272, 45.359…
Preparing the Feature Matrix
The final feature matrix includes 12 numeric variables. I applied two preprocessing steps: Log transformations for skewed size variables (exports, imports, GDP per capita, population) and including a signed log for trade balance. Scaling so all variables contribute equally to distance calculations.
Without these steps, large-unit variables would dominate both clustering and PCA results.
X <- main_dataset %>%
select(
x_total, m_total, trade_balance, xm_ratio, n_products, hhi,
gdp_pc, population, trade_gdp,
agr_va, ind_va, srv_va
)
X_transformed <- X %>%
mutate(
x_total = log1p(x_total),
m_total = log1p(m_total),
trade_balance = sign(trade_balance) * log1p(abs(trade_balance)),
gdp_pc = log1p(gdp_pc),
population = log1p(population)
)
X_scaled <- scale(X_transformed) %>% as.matrix()
dim(X_scaled)
## [1] 823 12
3. Clustering Analysis
3.1 Clusterability Check: Hopkins Statistic
Before clustering, I tested whether the dataset exhibits meaningful clustering tendency using the Hopkins statistic.
Values close to 0.5 indicate randomness, Values closer to 1 indicate non-random, clusterable structure.
I used a safe implementation to ensure numerical stability. The resulting value indicates that the data contain meaningful structure suitable for clustering.
hopkins_safe <- function(X, m = 100, seed = 123) {
set.seed(seed)
X <- as.data.frame(X)
names(X) <- make.names(names(X), unique = TRUE)
n <- nrow(X)
d <- ncol(X)
m <- min(m, n - 1)
mins <- apply(X, 2, min)
maxs <- apply(X, 2, max)
rand_points <- as.data.frame(matrix(runif(m * d), nrow = m, ncol = d))
for (j in 1:d) {
rand_points[[j]] <- mins[j] + rand_points[[j]] * (maxs[j] - mins[j])
}
names(rand_points) <- names(X)
u_dist <- sapply(1:m, function(i) {
min(as.matrix(dist(rbind(rand_points[i, , drop = FALSE], X)))[1, -1])
})
sample_idx <- sample(1:n, m)
w_dist <- sapply(sample_idx, function(i) {
Xi <- X[i, , drop = FALSE]
X_others <- X[-i, , drop = FALSE]
min(as.matrix(dist(rbind(Xi, X_others)))[1, -1])
})
sum(u_dist) / (sum(u_dist) + sum(w_dist))
}
H <- hopkins_safe(X_scaled, m = 100)
H
## [1] 0.9355139
3.2 Choosing the Number of Clusters
I evaluated the number of clusters using the elbow method (WSS) and the silhouette method.
In macroeconomic data, silhouette values are typically moderate due to overlapping country profiles. I therefore treated these diagnostics as guidance rather than strict rules.
set.seed(123)
wss <- sapply(1:10, function(k) kmeans(X_scaled, centers = k, nstart = 25)$tot.withinss)
plot(1:10, wss, type = "b", pch = 19,
xlab = "k", ylab = "Total within-cluster SS",
main = "Elbow Method (WSS)")
sil_width <- sapply(2:10, function(k) {
km_tmp <- kmeans(X_scaled, centers = k, nstart = 25)
ss <- cluster::silhouette(km_tmp$cluster, dist(X_scaled))
mean(ss[, 3])
})
plot(2:10, sil_width, type = "b", pch = 19,
xlab = "k", ylab = "Avg silhouette width",
main = "Silhouette Method")
sil_width
## [1] 0.2270123 0.2327230 0.2618397 0.2501236 0.2648038 0.2660268 0.2585856
## [8] 0.2361480 0.2452863
3.3 K-means Clustering and Cluster Profiles
Based on diagnostics and interpretability, I selected k = 4 and applied k-means clustering. Each country–year observation was assigned to a cluster.
To interpret the clusters, I computed cluster-level means for all features. These profiles represent trade–development archetypes rather than rigid categories.
set.seed(123)
k <- 4
km <- kmeans(X_scaled, centers = k, nstart = 50)
main_dataset$cluster <- factor(km$cluster)
table(main_dataset$cluster)
##
## 1 2 3 4
## 127 85 310 301
cluster_profiles <- main_dataset %>%
group_by(cluster) %>%
summarise(across(
c(x_total, m_total, trade_balance, xm_ratio, n_products, hhi,
gdp_pc, population, trade_gdp, agr_va, ind_va, srv_va),
mean, na.rm = TRUE
))
cluster_profiles
## # A tibble: 4 × 13
## cluster x_total m_total trade_balance xm_ratio n_products hhi gdp_pc
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1287004686. 3.83e 9 -2546939329. 0.243 71.4 0.280 20000.
## 2 2 55132698276. 3.10e10 24139254111. 1.77 90.1 0.590 18811.
## 3 3 329884816962. 3.23e11 7109436641. 0.988 96.6 0.113 34447.
## 4 4 20232344287. 3.19e10 -11667808747. 0.596 91.4 0.170 3596.
## # ℹ 5 more variables: population <dbl>, trade_gdp <dbl>, agr_va <dbl>,
## # ind_va <dbl>, srv_va <dbl>
3.4 Hierarchical Clustering as Validation
To validate the k-means results, I applied hierarchical clustering (Ward.D2) to the same distance matrix. Comparing hierarchical assignments with k-means clusters shows substantial consistency, increasing confidence that the identified structure is not algorithm-specific.
dist_matrix <- dist(X_scaled)
hc <- hclust(dist_matrix, method = "ward.D2")
plot(hc, labels = FALSE, hang = -1, main = "Hierarchical Clustering Dendrogram")
hc_clusters <- cutree(hc, k = k)
table(KMeans = km$cluster, Hierarchical = hc_clusters)
## Hierarchical
## KMeans 1 2 3 4
## 1 87 0 40 0
## 2 5 80 0 0
## 3 0 149 161 0
## 4 5 11 190 95
The results indicate meaningful but overlapping regimes. Clusters should be interpreted as broad patterns rather than sharply separated groups, which is realistic for global economic data.
4. Dimension Reduction
4.1 Motivation Clustering was performed in a 12-dimensional space, which is not directly interpretable. Dimension reduction serves two purposes: Identifying dominant axes of variation, Visualizing whether clusters align with these axes.
I applied PCA and classical MDS, which preserve different aspects of structure.
4.2 PCA
PCA reveals orthogonal components explaining the largest share of variance. The scree and cumulative variance plots show that the first two components explain a substantial proportion of total variation, making PC1–PC2 visualization meaningful.
Loadings indicate that:
PC1 is mainly driven by trade scale and development variables (exports, imports, GDP per capita, population), capturing overall economic size and intensity.
PC2 is driven by openness, sectoral composition, and diversification indicators, capturing structural differences across economies.
Plotting PC1 vs PC2 and coloring by cluster shows strong alignment between clustering and dominant variance structure.
pca <- prcomp(X_scaled, center = FALSE, scale. = FALSE)
pca_var <- (pca$sdev^2)
pca_var_ratio <- pca_var / sum(pca_var)
pca_var_df <- data.frame(
PC = paste0("PC", seq_along(pca_var_ratio)),
var_explained = pca_var_ratio,
cum_var = cumsum(pca_var_ratio)
)
plot(pca_var_ratio, type = "b", pch = 19,
xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
main = "PCA Scree Plot")
plot(pca_var_df$cum_var, type = "b", pch = 19,
xlab = "Principal Component",
ylab = "Cumulative Variance Explained",
main = "PCA Cumulative Explained Variance",
ylim = c(0, 1))
scores <- as.data.frame(pca$x)
scores$cluster <- main_dataset$cluster
scores$country <- main_dataset$country
scores$year <- main_dataset$year
cols <- as.numeric(scores$cluster)
plot(scores$PC1, scores$PC2,
col = cols,
pch = 19,
xlab = paste0("PC1 (", round(100 * pca_var_ratio[1], 1), "%)"),
ylab = paste0("PC2 (", round(100 * pca_var_ratio[2], 1), "%)"),
main = "PCA: PC1 vs PC2 (Colored by Cluster)")
legend("topright",
legend = levels(scores$cluster),
col = seq_along(levels(scores$cluster)),
pch = 19)
loadings <- as.data.frame(pca$rotation)
loadings$feature <- rownames(loadings)
top_pc1 <- loadings %>%
mutate(abs_PC1 = abs(PC1)) %>%
arrange(desc(abs_PC1)) %>%
select(feature, PC1) %>%
head(10)
top_pc2 <- loadings %>%
mutate(abs_PC2 = abs(PC2)) %>%
arrange(desc(abs_PC2)) %>%
select(feature, PC2) %>%
head(10)
top_pc1
## feature PC1
## x_total x_total 0.4529420
## m_total m_total 0.4304688
## n_products n_products 0.3497553
## xm_ratio xm_ratio 0.3237472
## gdp_pc gdp_pc 0.3139626
## agr_va agr_va -0.3085282
## trade_balance trade_balance 0.2731948
## population population 0.2226554
## ind_va ind_va 0.1828848
## trade_gdp trade_gdp 0.1119138
top_pc2
## feature PC2
## srv_va srv_va -0.50525509
## ind_va ind_va 0.40826625
## gdp_pc gdp_pc -0.32588364
## agr_va agr_va 0.31117655
## population population 0.29596267
## trade_gdp trade_gdp -0.29050620
## xm_ratio xm_ratio 0.27472843
## hhi hhi 0.27273906
## trade_balance trade_balance 0.22080955
## x_total x_total 0.06247722
4.3 MDS
While PCA preserves variance, MDS preserves distances. Using MDS on the same scaled features allows me to examine whether similarity relationships underlying clustering remain visible.
The MDS plot shows partial separation with overlaps, reinforcing the idea that global trade patterns form continuous gradients rather than isolated groups.
dist_matrix <- dist(X_scaled)
mds <- cmdscale(dist_matrix, k = 2, eig = TRUE)
mds_points <- as.data.frame(mds$points)
colnames(mds_points) <- c("Dim1", "Dim2")
mds_points$cluster <- main_dataset$cluster
cols <- as.numeric(mds_points$cluster)
plot(mds_points$Dim1, mds_points$Dim2,
col = cols,
pch = 19,
xlab = "MDS Dimension 1",
ylab = "MDS Dimension 2",
main = "Classical MDS (2D) Colored by Cluster")
legend("topright",
legend = levels(mds_points$cluster),
col = seq_along(levels(mds_points$cluster)),
pch = 19)
mds_var_ratio <- mds$eig / sum(mds$eig[mds$eig > 0])
sum(mds_var_ratio[1:2], na.rm = TRUE)
## [1] 0.6209684
PCA and MDS together indicate strong underlying structure combined with gradual transitions, which is expected in country-level macroeconomic data.
5. Final Conclusion
I combined clustering and dimension reduction because they address complementary aspects of the same research question. Clustering summarizes country–year observations into a small number of interpretable trade–development regimes. Dimension reduction explains why these regimes differ and validates whether clusters align with dominant variance and distance structure. The analysis shows that global trade profiles between 2019 and 2024 can be characterized by a limited number of broad regimes reflecting trade scale, openness, diversification, and economic structure. PCA reveals that a few latent dimensions explain most variation, while MDS confirms that similarity is continuous rather than sharply segmented.
Overall, the analysis shows that global trade and development patterns between 2019 and 2024 are structured around a small number of broad, recurring regimes. These regimes are shaped primarily by trade scale, openness, diversification, and sectoral composition, and they manifest as continuous gradients rather than sharply separated groups across countries.