1. Introduction

International trade reflects deep structural characteristics of economies. Countries differ not only in how much they trade, but also in what they trade, how diversified their export baskets are and how trade interacts with development level, economic structure and openness. These differences are rarely captured by a single indicator. Instead, they emerge as multivariate patterns across trade and macroeconomic dimensions.

The objective of this project is to uncover latent structures in global trade and development profiles using unsupervised learning. Rather than imposing predefined categories (such as income groups or regions), I let the data reveal recurring patterns among country-year observations between 2019 and 2024.

Specifically this project pursues three goals:

Identify groups of countries with similar trade structures and macroeconomic characteristics using clustering.

Reveal the main latent economic dimensions that drive cross-country differences using dimension reduction.

Combine both approaches to interpret global trade–development regimes in an economically meaningful way.

1.1. Why Clustering and Dimension Reduction Together?

These two techniques are naturally complementary:

Clustering asks: Which countries are similar to each other?

Dimension reduction asks: Along which fundamental economic dimensions are they similar or different?

Clustering provides discrete groups, while PCA and MDS provide continuous economic axes.

Using both allows us to:

Validate cluster structures visually,

Interpret clusters economically,

Understand the underlying latent structure of global trade.

Given the complexity and continuity of global macroeconomic data, combining these methods particularly suits well for country level trade analysis.

1.2 Datasets and Data Sources

This project is based on a custom-constructed dataset that combines international trade flows with macroeconomic indicators from authoritative sources.

  1. UN Comtrade Database Source: https://comtradeplus.un.org

I use HS2-level international trade data for the years 2019–2024, covering reporter countries trading with the world as partner.

Each raw file contains:

Reporter country (ISO code)

Year

Trade flow (export or import)

HS2 product code

Trade value (USD)

These transaction-level data provide detailed information on export composition and trade scale.

Files used: data_raw/ comtrade_hs2_2019.csv comtrade_hs2_2020.csv comtrade_hs2_2021.csv comtrade_hs2_2022.csv comtrade_hs2_2023.csv comtrade_hs2_2024.csv

  1. World Bank – World Development Indicators (WDI) Source: https://data.worldbank.org

To place trade structures in economic context, I merge trade data with WDI indicators capturing:

GDP per capita (development level)

Population (scale)

Trade as % of GDP (openness)

Sectoral value added shares (agriculture, industry, services)

The combination of Comtrade and WDI makes the dataset suitable for both clustering (comparability across observations) and dimension reduction (rich multivariate structure).

2. Loading and Preparing the Datasets

Setup and Libraries

library(tidyverse)
library(readr)
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
library(cluster)
library(scales)
library(janitor)
library(stringr)

Loading and Combining UN Comtrade Files I first loaded multiple Comtrade CSV files and stacked them into a single dataset. At this stage, each row represents one country, one year, one trade flow, and one HS2 product category.

Because column names differ across Comtrade exports, I implemented a helper function to automatically detect the correct columns. This ensures robustness and reproducibility.

comtrade_files <- list.files(
  path = "data_raw",
  pattern = "comtrade_hs2_.*\\.csv$",
  full.names = TRUE
)

trade_raw <- purrr::map_dfr(comtrade_files, ~ readr::read_csv(.x, show_col_types = FALSE))
trade_raw <- janitor::clean_names(trade_raw)

names(trade_raw)
##  [1] "type_code"                  "freq_code"                 
##  [3] "ref_period_id"              "ref_year"                  
##  [5] "ref_month"                  "period"                    
##  [7] "reporter_code"              "reporter_iso"              
##  [9] "reporter_desc"              "flow_code"                 
## [11] "flow_desc"                  "partner_code"              
## [13] "partner_iso"                "partner_desc"              
## [15] "partner2code"               "partner2iso"               
## [17] "partner2desc"               "classification_code"       
## [19] "classification_search_code" "is_original_classification"
## [21] "cmd_code"                   "cmd_desc"                  
## [23] "aggr_level"                 "is_leaf"                   
## [25] "customs_code"               "customs_desc"              
## [27] "mos_code"                   "mot_code"                  
## [29] "mot_desc"                   "qty_unit_code"             
## [31] "qty_unit_abbr"              "qty"                       
## [33] "is_qty_estimated"           "alt_qty_unit_code"         
## [35] "alt_qty_unit_abbr"          "alt_qty"                   
## [37] "is_alt_qty_estimated"       "net_wgt"                   
## [39] "is_net_wgt_estimated"       "gross_wgt"                 
## [41] "is_gross_wgt_estimated"     "cifvalue"                  
## [43] "fobvalue"                   "primary_value"             
## [45] "legacy_estimation_flag"     "is_reported"               
## [47] "is_aggregate"
glimpse(trade_raw)
## Rows: 179,156
## Columns: 47
## $ type_code                  <chr> "C", "C", "C", "C", "C", "C", "C", "C", "C"…
## $ freq_code                  <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A"…
## $ ref_period_id              <dbl> 20190101, 20190101, 20190101, 20190101, 201…
## $ ref_year                   <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2…
## $ ref_month                  <dbl> 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52,…
## $ period                     <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2…
## $ reporter_code              <dbl> 4, 4, 8, 8, 20, 20, 24, 24, 28, 28, 31, 31,…
## $ reporter_iso               <chr> "AFG", "AFG", "ALB", "ALB", "AND", "AND", "…
## $ reporter_desc              <chr> "Afghanistan", "Afghanistan", "Albania", "A…
## $ flow_code                  <chr> "M", "X", "M", "X", "M", "X", "M", "X", "M"…
## $ flow_desc                  <chr> "Import", "Export", "Import", "Export", "Im…
## $ partner_code               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ partner_iso                <chr> "W00", "W00", "W00", "W00", "W00", "W00", "…
## $ partner_desc               <chr> "World", "World", "World", "World", "World"…
## $ partner2code               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ partner2iso                <chr> "W00", "W00", "W00", "W00", "W00", "W00", "…
## $ partner2desc               <chr> "World", "World", "World", "World", "World"…
## $ classification_code        <chr> "H4", "H4", "H5", "H5", "H5", "H5", "H5", "…
## $ classification_search_code <chr> "HS", "HS", "HS", "HS", "HS", "HS", "HS", "…
## $ is_original_classification <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
## $ cmd_code                   <chr> "01", "01", "01", "01", "01", "01", "01", "…
## $ cmd_desc                   <chr> "Animals; live", "Animals; live", "Animals;…
## $ aggr_level                 <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ is_leaf                    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ customs_code               <chr> "C00", "C00", "C00", "C00", "C00", "C00", "…
## $ customs_desc               <chr> "TOTAL CPC", "TOTAL CPC", "TOTAL CPC", "TOT…
## $ mos_code                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ mot_code                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ mot_desc                   <chr> "TOTAL MOT", "TOTAL MOT", "TOTAL MOT", "TOT…
## $ qty_unit_code              <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,…
## $ qty_unit_abbr              <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "…
## $ qty                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ is_qty_estimated           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ alt_qty_unit_code          <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,…
## $ alt_qty_unit_abbr          <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "…
## $ alt_qty                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ is_alt_qty_estimated       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ net_wgt                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, N…
## $ is_net_wgt_estimated       <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRU…
## $ gross_wgt                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ is_gross_wgt_estimated     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ cifvalue                   <dbl> 16791952.1, 0.0, 40884453.6, 0.0, 116848.9,…
## $ fobvalue                   <dbl> 0.000000e+00, 3.946623e+04, 0.000000e+00, 5…
## $ primary_value              <dbl> 1.679195e+07, 3.946623e+04, 4.088445e+07, 5…
## $ legacy_estimation_flag     <dbl> 4, 4, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 0, 0…
## $ is_reported                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ is_aggregate               <chr> "true,", "true,", "true,", "true,", "true,"…

Loading World Bank WDI File

wdi_path <- "data_raw/worldbank_wdi_2019_2024.csv"
wdi_raw <- readr::read_csv(wdi_path, show_col_types = FALSE)
wdi_raw <- janitor::clean_names(wdi_raw)

names(wdi_raw)
## [1] "iso3c"     "country"   "year"      "gdp_pc"    "pop"       "trade_gdp"
## [7] "ind_va"    "srv_va"    "agr_va"
glimpse(wdi_raw)
## Rows: 1,566
## Columns: 9
## $ iso3c     <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFE", "AFE", "AFE…
## $ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ year      <dbl> 2019, 2020, 2021, 2022, 2023, 2024, 2019, 2020, 2021, 2022, …
## $ gdp_pc    <dbl> 496.6025, 510.7871, 356.4962, 357.2612, 413.7579, NA, 1507.0…
## $ pop       <dbl> 37856121, 39068979, 40000412, 40578842, 41454761, 42647492, …
## $ trade_gdp <dbl> NA, 46.70989, 51.41172, 72.88547, 67.58467, NA, 53.52813, 49…
## $ ind_va    <dbl> 14.05811, 12.95260, 14.27366, 16.05037, 13.44982, NA, 26.033…
## $ srv_va    <dbl> 55.47281, 52.57211, 47.16042, 45.04991, 46.37437, NA, 52.629…
## $ agr_va    <dbl> 25.77397, 29.97558, 33.59762, 33.70143, 34.74325, NA, 12.953…

Cleaning and Preparing Data

Before aggregation, I cleaned the transaction-level data by:

Selecting only relevant variables (country, year, flow, HS2, value),

Restricting the sample to 2019–2024,

Removing missing and non-positive trade values,

Standardizing trade flow labels so exports are coded as “X” and imports as “M”.

This step was obligatory because all later indicators will be depending on consistent definitions of exports, imports, and product categories.

pick_col <- function(df, candidates) {
  hit <- candidates[candidates %in% names(df)]
  if (length(hit) == 0) stop("None of these columns were found: ", paste(candidates, collapse = ", "))
  hit[1]
}

col_country <- pick_col(trade_raw, c(
  "reporter_iso", "reporter_iso3", "reporter", "reporter_code",
  "reporter_iso_code", "reporter_iso3_code"
))
col_year  <- pick_col(trade_raw, c("year", "ref_year", "period"))
col_flow  <- pick_col(trade_raw, c("flow", "flow_code", "flow_desc"))
col_hs2   <- pick_col(trade_raw, c("hs2", "cmd_code", "commodity_code", "cmdcode", "commodity"))
col_value <- pick_col(trade_raw, c("primary_value", "trade_value", "trade_value_usd", "fobvalue", "cifvalue", "value"))

trade_clean <- trade_raw %>%
  transmute(
    country = as.character(.data[[col_country]]),
    year    = as.integer(as.character(.data[[col_year]])),
    flow    = as.character(.data[[col_flow]]),
    hs2     = as.character(.data[[col_hs2]]),
    value   = suppressWarnings(as.numeric(.data[[col_value]]))
  ) %>%
  filter(year >= 2019 & year <= 2024, !is.na(value), value > 0) %>%
  mutate(
    flow = case_when(
      flow %in% c("X", "Export", "Exports", "export", "exports") ~ "X",
      flow %in% c("M", "Import", "Imports", "import", "imports") ~ "M",
      TRUE ~ flow
    )
  )

table(trade_clean$flow)
## 
##     M     X 
## 93358 85798
glimpse(trade_clean)
## Rows: 179,156
## Columns: 5
## $ country <chr> "AFG", "AFG", "ALB", "ALB", "AND", "AND", "AGO", "AGO", "ATG",…
## $ year    <int> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 20…
## $ flow    <chr> "M", "X", "M", "X", "M", "X", "M", "X", "M", "X", "M", "X", "M…
## $ hs2     <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "0…
## $ value   <dbl> 1.679195e+07, 3.946623e+04, 4.088445e+07, 5.379707e+05, 1.1684…

Building Trade Structure Indicators

Clustering requires observations to be comparable at the same level of aggregation. I therefore transformed transaction-level data into country–year trade indicators:

Total exports (x_total) and imports (m_total),

Trade balance (exports minus imports),

Export–import ratio as a scale-free orientation measure,

Number of HS2 products exported (n_products) as a diversification proxy,

Herfindahl–Hirschman Index (hhi) measuring export concentration.

These indicators jointly capture both the size and the structure of trade, which is central for distinguishing trade profiles.

trade_totals <- trade_clean %>%
  group_by(country, year, flow) %>%
  summarise(total_value = sum(value), .groups = "drop") %>%
  pivot_wider(names_from = flow, values_from = total_value, values_fill = 0) %>%
  rename(x_total = X, m_total = M) %>%
  mutate(
    trade_balance = x_total - m_total,
    xm_ratio = x_total / (m_total + 1)
  )

glimpse(trade_totals)
## Rows: 971
## Columns: 6
## $ country       <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "AFG", "AGO", "AGO", …
## $ year          <int> 2019, 2020, 2021, 2022, 2023, 2019, 2019, 2020, 2021, 20…
## $ m_total       <dbl> 1313806452, 939368008, 1169773934, 1473238003, 153651956…
## $ x_total       <dbl> 78642939, 65929725, 88347008, 104796787, 111004914, 8704…
## $ trade_balance <dbl> -1235163513, -873438282, -1081426927, -1368441216, -1425…
## $ xm_ratio      <dbl> 0.05985885, 0.07018519, 0.07552486, 0.07113364, 0.072244…
n_products <- trade_clean %>%
  filter(flow == "X") %>%
  group_by(country, year) %>%
  summarise(n_products = n_distinct(hs2), .groups = "drop")

glimpse(n_products)
## Rows: 971
## Columns: 3
## $ country    <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "AFG", "AGO", "AGO", "AG…
## $ year       <int> 2019, 2020, 2021, 2022, 2023, 2019, 2019, 2020, 2021, 2022,…
## $ n_products <int> 84, 77, 79, 81, 84, 69, 95, 92, 92, 94, 95, 95, 94, 96, 95,…
export_shares <- trade_clean %>%
  filter(flow == "X") %>%
  group_by(country, year, hs2) %>%
  summarise(export_value = sum(value), .groups = "drop") %>%
  group_by(country, year) %>%
  mutate(
    total_exports = sum(export_value),
    share = export_value / total_exports
  )

hhi <- export_shares %>%
  summarise(hhi = sum(share^2), .groups = "drop")

glimpse(hhi)
## Rows: 971
## Columns: 3
## $ country <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "AFG", "AGO", "AGO", "AGO",…
## $ year    <int> 2019, 2020, 2021, 2022, 2023, 2019, 2019, 2020, 2021, 2022, 20…
## $ hhi     <dbl> 0.13638179, 0.15118685, 0.27458806, 0.27619753, 0.30585118, 0.…
trade_features <- trade_totals %>%
  left_join(n_products, by = c("country", "year")) %>%
  left_join(hhi, by = c("country", "year"))

write.csv(trade_features, "data_processed/trade_features_2019_2024.csv", row.names = FALSE)

glimpse(trade_features)
## Rows: 971
## Columns: 8
## $ country       <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "AFG", "AGO", "AGO", …
## $ year          <int> 2019, 2020, 2021, 2022, 2023, 2019, 2019, 2020, 2021, 20…
## $ m_total       <dbl> 1313806452, 939368008, 1169773934, 1473238003, 153651956…
## $ x_total       <dbl> 78642939, 65929725, 88347008, 104796787, 111004914, 8704…
## $ trade_balance <dbl> -1235163513, -873438282, -1081426927, -1368441216, -1425…
## $ xm_ratio      <dbl> 0.05985885, 0.07018519, 0.07552486, 0.07113364, 0.072244…
## $ n_products    <int> 84, 77, 79, 81, 84, 69, 95, 92, 92, 94, 95, 95, 94, 96, …
## $ hhi           <dbl> 0.13638179, 0.15118685, 0.27458806, 0.27619753, 0.305851…

Pattern-Based Column Detection

Indicator names in WDI files can vary (e.g. gdp_per_capita, gdp_pc, NY.GDP.PCAP.CD). To handle this variability, I defined a helper function that selects columns using regular-expression pattern matching rather than relying on exact column names.

The function searches all columns except the year variable and returns the first column that matches the specified pattern. This approach allows the pipeline to remain stable even if the naming conventions in the WDI export change.

pick_by_pattern <- function(df, patterns, exclude = c("year")) {
  cols <- setdiff(names(df), exclude)
  hits <- cols[stringr::str_detect(cols, paste(patterns, collapse = "|"))]
  if (length(hits) == 0) return(NA_character_)
  hits[1]
}

Detecting WDI File Structure and Extracting Indicators

In this section, I determine whether the WDI dataset is in tidy or wide format and process it accordingly. First, I verify that a year column exists. If not, the file cannot be safely processed and execution is stopped. I then identify the country identifier column using flexible matching, as this can also vary across exports (ISO3 code, country name, economy, etc.). To distinguish between tidy and wide formats, I examine the number of numeric columns excluding identifiers. If several numeric indicator columns are present, I treat the file as tidy. Otherwise, I assume a wide structure and reshape it into long form before extracting the required indicators. For tidy files, I automatically detect the columns corresponding to GDP per capita, population, trade openness, and sectoral value-added shares using pattern matching. For wide files, I rely on official WDI indicator codes and pivot the data into a consistent country–year format.

year_cols <- names(wdi_raw)[stringr::str_detect(names(wdi_raw), "^x?20\\d\\d$")]
has_year_col <- "year" %in% names(wdi_raw)

col_country_any <- suppressWarnings({
  pick_col(wdi_raw, c(
    "iso3", "iso3_code", "iso3c", "iso3c_code",
    "country_code", "countrycode", "country_iso3", "countryiso3code",
    "iso_code", "code",
    "country", "country_name", "economy", "location", "entity", "name"
  ))
})

if (!has_year_col) stop("WDI does not have a 'year' column, please inspect the file format")

id_candidates <- c("year", col_country_any)
numeric_cols <- names(wdi_raw)[sapply(wdi_raw, is.numeric)]
numeric_cols <- setdiff(numeric_cols, id_candidates)
is_tidy_indicator_table <- length(numeric_cols) >= 4

if (is_tidy_indicator_table) {
  col_gdp_pc   <- pick_by_pattern(wdi_raw, patterns = c("gdp.*pcap|pcap|gdp_pc|gdp_per_cap"))
  col_pop      <- pick_by_pattern(wdi_raw, patterns = c("^pop$|pop_|population|sp_pop"))
  col_tradegdp <- pick_by_pattern(wdi_raw, patterns = c("trade.*gdp|trade_gdp|ne_trd"))
  col_agr      <- pick_by_pattern(wdi_raw, patterns = c("agr|agri|nv_agr"))
  col_ind      <- pick_by_pattern(wdi_raw, patterns = c("ind(?!ex)|industry|nv_ind"))
  col_srv      <- pick_by_pattern(wdi_raw, patterns = c("srv|service|nv_srv"))

  if (any(is.na(c(col_gdp_pc, col_pop, col_tradegdp, col_agr, col_ind, col_srv)))) {
    print(numeric_cols)
    stop("Could not auto-detect at least one WDI indicator column")
  }

  wdi_clean <- wdi_raw %>%
    transmute(
      country    = as.character(.data[[col_country_any]]),
      year       = as.integer(as.character(year)),
      gdp_pc     = suppressWarnings(as.numeric(.data[[col_gdp_pc]])),
      population = suppressWarnings(as.numeric(.data[[col_pop]])),
      trade_gdp  = suppressWarnings(as.numeric(.data[[col_tradegdp]])),
      agr_va     = suppressWarnings(as.numeric(.data[[col_agr]])),
      ind_va     = suppressWarnings(as.numeric(.data[[col_ind]])),
      srv_va     = suppressWarnings(as.numeric(.data[[col_srv]]))
    ) %>%
    filter(year >= 2019 & year <= 2024)

} else {
  col_country_code <- pick_col(wdi_raw, c("country_code","countrycode","country_iso3","countryiso3code"))
  col_indicator    <- pick_col(wdi_raw, c("indicator_code","series_code","indicator","series","indicator_id"))

  if (length(year_cols) == 0) stop("Could not detect WDI year columns")

  wdi_long <- wdi_raw %>%
    select(all_of(c(col_country_code, col_indicator)), all_of(year_cols)) %>%
    pivot_longer(cols = all_of(year_cols), names_to = "year", values_to = "value") %>%
    mutate(
      year = as.integer(stringr::str_remove(year, "^x")),
      value = suppressWarnings(as.numeric(value))
    ) %>%
    filter(year >= 2019 & year <= 2024)

  needed_codes <- c(
    "NY.GDP.PCAP.CD",
    "SP.POP.TOTL",
    "NE.TRD.GNFS.ZS",
    "NV.AGR.TOTL.ZS",
    "NV.IND.TOTL.ZS",
    "NV.SRV.TOTL.ZS"
  )

  wdi_clean <- wdi_long %>%
    filter(.data[[col_indicator]] %in% needed_codes) %>%
    mutate(ind = recode(.data[[col_indicator]],
      "NY.GDP.PCAP.CD" = "gdp_pc",
      "SP.POP.TOTL"    = "population",
      "NE.TRD.GNFS.ZS" = "trade_gdp",
      "NV.AGR.TOTL.ZS" = "agr_va",
      "NV.IND.TOTL.ZS" = "ind_va",
      "NV.SRV.TOTL.ZS" = "srv_va"
    )) %>%
    select(country = all_of(col_country_code), year, ind, value) %>%
    pivot_wider(names_from = ind, values_from = value)
}

write.csv(wdi_clean, "data_processed/wdi_clean_2019_2024.csv", row.names = FALSE)
glimpse(wdi_clean)
## Rows: 1,566
## Columns: 8
## $ country    <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFE", "AFE", "AF…
## $ year       <int> 2019, 2020, 2021, 2022, 2023, 2024, 2019, 2020, 2021, 2022,…
## $ gdp_pc     <dbl> 496.6025, 510.7871, 356.4962, 357.2612, 413.7579, NA, 1507.…
## $ population <dbl> 37856121, 39068979, 40000412, 40578842, 41454761, 42647492,…
## $ trade_gdp  <dbl> NA, 46.70989, 51.41172, 72.88547, 67.58467, NA, 53.52813, 4…
## $ agr_va     <dbl> 25.77397, 29.97558, 33.59762, 33.70143, 34.74325, NA, 12.95…
## $ ind_va     <dbl> 14.05811, 12.95260, 14.27366, 16.05037, 13.44982, NA, 26.03…
## $ srv_va     <dbl> 55.47281, 52.57211, 47.16042, 45.04991, 46.37437, NA, 52.62…

Merging Trade Features with WDI Data and Handling Missing Values

I merged the trade-structure indicators with the cleaned WDI dataset at the country–year level. I then summarized missing values across all variables. Because distance-based methods such as k-means, PCA, and MDS are sensitive to missing data, I removed observations with missing values in the core macroeconomic indicators. This ensures that similarity measures are not distorted by incomplete information.

main_dataset <- trade_features %>%
  left_join(wdi_clean, by = c("country", "year"))

missing_summary <- main_dataset %>%
  summarise(across(everything(), ~ sum(is.na(.))))
missing_summary
## # A tibble: 1 × 14
##   country  year m_total x_total trade_balance xm_ratio n_products   hhi gdp_pc
##     <int> <int>   <int>   <int>         <int>    <int>      <int> <int>  <int>
## 1       0     0       0       0             0        0          0     0     17
## # ℹ 5 more variables: population <int>, trade_gdp <int>, agr_va <int>,
## #   ind_va <int>, srv_va <int>
main_dataset <- main_dataset %>%
  drop_na(gdp_pc, population, trade_gdp, agr_va, ind_va, srv_va)

write.csv(main_dataset, "data_processed/main_country_year_dataset_2019_2024.csv", row.names = FALSE)
glimpse(main_dataset)
## Rows: 823
## Columns: 14
## $ country       <chr> "ABW", "ABW", "ABW", "ABW", "AGO", "AGO", "AGO", "AGO", …
## $ year          <int> 2019, 2020, 2021, 2022, 2019, 2020, 2021, 2022, 2023, 20…
## $ m_total       <dbl> 1313806452, 939368008, 1169773934, 1473238003, 139616404…
## $ x_total       <dbl> 78642939, 65929725, 88347008, 104796787, 35432454859, 22…
## $ trade_balance <dbl> -1235163513, -873438282, -1081426927, -1368441216, 21470…
## $ xm_ratio      <dbl> 0.05985885, 0.07018519, 0.07552486, 0.07113364, 2.537843…
## $ n_products    <int> 84, 77, 79, 81, 95, 92, 92, 94, 95, 95, 94, 96, 95, 97, …
## $ hhi           <dbl> 0.13638179, 0.15118685, 0.27458806, 0.27619753, 0.889236…
## $ gdp_pc        <dbl> 30645.891, 22759.807, 26749.330, 30975.999, 2493.679, 17…
## $ population    <dbl> 109203, 108587, 107700, 107310, 32375632, 33451132, 3453…
## $ trade_gdp     <dbl> 137.96193, 122.72374, 139.91780, 160.63455, 60.71482, 54…
## $ agr_va        <dbl> 0.01842432, 0.02170060, 0.02999855, 0.03421443, 12.39115…
## $ ind_va        <dbl> 11.628810, 12.611441, 11.007534, 9.735917, 42.225105, 36…
## $ srv_va        <dbl> 77.91933, 77.20070, 78.41041, 79.00360, 43.86272, 45.359…

Preparing the Feature Matrix

The final feature matrix includes 12 numeric variables. I applied two preprocessing steps: Log transformations for skewed size variables (exports, imports, GDP per capita, population) and including a signed log for trade balance. Scaling so all variables contribute equally to distance calculations.

Without these steps, large-unit variables would dominate both clustering and PCA results.

X <- main_dataset %>%
  select(
    x_total, m_total, trade_balance, xm_ratio, n_products, hhi,
    gdp_pc, population, trade_gdp,
    agr_va, ind_va, srv_va
  )

X_transformed <- X %>%
  mutate(
    x_total = log1p(x_total),
    m_total = log1p(m_total),
    trade_balance = sign(trade_balance) * log1p(abs(trade_balance)),
    gdp_pc = log1p(gdp_pc),
    population = log1p(population)
  )

X_scaled <- scale(X_transformed) %>% as.matrix()
dim(X_scaled)
## [1] 823  12

3. Clustering Analysis

3.1 Clusterability Check: Hopkins Statistic

Before clustering, I tested whether the dataset exhibits meaningful clustering tendency using the Hopkins statistic.

Values close to 0.5 indicate randomness, Values closer to 1 indicate non-random, clusterable structure.

I used a safe implementation to ensure numerical stability. The resulting value indicates that the data contain meaningful structure suitable for clustering.

hopkins_safe <- function(X, m = 100, seed = 123) {
  set.seed(seed)
  X <- as.data.frame(X)
  names(X) <- make.names(names(X), unique = TRUE)

  n <- nrow(X)
  d <- ncol(X)
  m <- min(m, n - 1)

  mins <- apply(X, 2, min)
  maxs <- apply(X, 2, max)

  rand_points <- as.data.frame(matrix(runif(m * d), nrow = m, ncol = d))
  for (j in 1:d) {
    rand_points[[j]] <- mins[j] + rand_points[[j]] * (maxs[j] - mins[j])
  }
  names(rand_points) <- names(X)

  u_dist <- sapply(1:m, function(i) {
    min(as.matrix(dist(rbind(rand_points[i, , drop = FALSE], X)))[1, -1])
  })

  sample_idx <- sample(1:n, m)
  w_dist <- sapply(sample_idx, function(i) {
    Xi <- X[i, , drop = FALSE]
    X_others <- X[-i, , drop = FALSE]
    min(as.matrix(dist(rbind(Xi, X_others)))[1, -1])
  })

  sum(u_dist) / (sum(u_dist) + sum(w_dist))
}

H <- hopkins_safe(X_scaled, m = 100)
H
## [1] 0.9355139

3.2 Choosing the Number of Clusters

I evaluated the number of clusters using the elbow method (WSS) and the silhouette method.

In macroeconomic data, silhouette values are typically moderate due to overlapping country profiles. I therefore treated these diagnostics as guidance rather than strict rules.

set.seed(123)

wss <- sapply(1:10, function(k) kmeans(X_scaled, centers = k, nstart = 25)$tot.withinss)
plot(1:10, wss, type = "b", pch = 19,
     xlab = "k", ylab = "Total within-cluster SS",
     main = "Elbow Method (WSS)")

sil_width <- sapply(2:10, function(k) {
  km_tmp <- kmeans(X_scaled, centers = k, nstart = 25)
  ss <- cluster::silhouette(km_tmp$cluster, dist(X_scaled))
  mean(ss[, 3])
})
plot(2:10, sil_width, type = "b", pch = 19,
     xlab = "k", ylab = "Avg silhouette width",
     main = "Silhouette Method")

sil_width
## [1] 0.2270123 0.2327230 0.2618397 0.2501236 0.2648038 0.2660268 0.2585856
## [8] 0.2361480 0.2452863

3.3 K-means Clustering and Cluster Profiles

Based on diagnostics and interpretability, I selected k = 4 and applied k-means clustering. Each country–year observation was assigned to a cluster.

To interpret the clusters, I computed cluster-level means for all features. These profiles represent trade–development archetypes rather than rigid categories.

set.seed(123)
k <- 4
km <- kmeans(X_scaled, centers = k, nstart = 50)

main_dataset$cluster <- factor(km$cluster)
table(main_dataset$cluster)
## 
##   1   2   3   4 
## 127  85 310 301
cluster_profiles <- main_dataset %>%
  group_by(cluster) %>%
  summarise(across(
    c(x_total, m_total, trade_balance, xm_ratio, n_products, hhi,
      gdp_pc, population, trade_gdp, agr_va, ind_va, srv_va),
    mean, na.rm = TRUE
  ))

cluster_profiles
## # A tibble: 4 × 13
##   cluster       x_total   m_total trade_balance xm_ratio n_products   hhi gdp_pc
##   <fct>           <dbl>     <dbl>         <dbl>    <dbl>      <dbl> <dbl>  <dbl>
## 1 1         1287004686.   3.83e 9  -2546939329.    0.243       71.4 0.280 20000.
## 2 2        55132698276.   3.10e10  24139254111.    1.77        90.1 0.590 18811.
## 3 3       329884816962.   3.23e11   7109436641.    0.988       96.6 0.113 34447.
## 4 4        20232344287.   3.19e10 -11667808747.    0.596       91.4 0.170  3596.
## # ℹ 5 more variables: population <dbl>, trade_gdp <dbl>, agr_va <dbl>,
## #   ind_va <dbl>, srv_va <dbl>

3.4 Hierarchical Clustering as Validation

To validate the k-means results, I applied hierarchical clustering (Ward.D2) to the same distance matrix. Comparing hierarchical assignments with k-means clusters shows substantial consistency, increasing confidence that the identified structure is not algorithm-specific.

dist_matrix <- dist(X_scaled)
hc <- hclust(dist_matrix, method = "ward.D2")
plot(hc, labels = FALSE, hang = -1, main = "Hierarchical Clustering Dendrogram")

hc_clusters <- cutree(hc, k = k)
table(KMeans = km$cluster, Hierarchical = hc_clusters)
##       Hierarchical
## KMeans   1   2   3   4
##      1  87   0  40   0
##      2   5  80   0   0
##      3   0 149 161   0
##      4   5  11 190  95

The results indicate meaningful but overlapping regimes. Clusters should be interpreted as broad patterns rather than sharply separated groups, which is realistic for global economic data.

4. Dimension Reduction

4.1 Motivation Clustering was performed in a 12-dimensional space, which is not directly interpretable. Dimension reduction serves two purposes: Identifying dominant axes of variation, Visualizing whether clusters align with these axes.

I applied PCA and classical MDS, which preserve different aspects of structure.

4.2 PCA

PCA reveals orthogonal components explaining the largest share of variance. The scree and cumulative variance plots show that the first two components explain a substantial proportion of total variation, making PC1–PC2 visualization meaningful.

Loadings indicate that:

PC1 is mainly driven by trade scale and development variables (exports, imports, GDP per capita, population), capturing overall economic size and intensity.

PC2 is driven by openness, sectoral composition, and diversification indicators, capturing structural differences across economies.

Plotting PC1 vs PC2 and coloring by cluster shows strong alignment between clustering and dominant variance structure.

pca <- prcomp(X_scaled, center = FALSE, scale. = FALSE)

pca_var <- (pca$sdev^2)
pca_var_ratio <- pca_var / sum(pca_var)
pca_var_df <- data.frame(
  PC = paste0("PC", seq_along(pca_var_ratio)),
  var_explained = pca_var_ratio,
  cum_var = cumsum(pca_var_ratio)
)

plot(pca_var_ratio, type = "b", pch = 19,
     xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     main = "PCA Scree Plot")

plot(pca_var_df$cum_var, type = "b", pch = 19,
     xlab = "Principal Component",
     ylab = "Cumulative Variance Explained",
     main = "PCA Cumulative Explained Variance",
     ylim = c(0, 1))

scores <- as.data.frame(pca$x)
scores$cluster <- main_dataset$cluster
scores$country <- main_dataset$country
scores$year <- main_dataset$year

cols <- as.numeric(scores$cluster)

plot(scores$PC1, scores$PC2,
     col  = cols,
     pch  = 19,
     xlab = paste0("PC1 (", round(100 * pca_var_ratio[1], 1), "%)"),
     ylab = paste0("PC2 (", round(100 * pca_var_ratio[2], 1), "%)"),
     main = "PCA: PC1 vs PC2 (Colored by Cluster)")

legend("topright",
       legend = levels(scores$cluster),
       col    = seq_along(levels(scores$cluster)),
       pch    = 19)

loadings <- as.data.frame(pca$rotation)
loadings$feature <- rownames(loadings)

top_pc1 <- loadings %>%
  mutate(abs_PC1 = abs(PC1)) %>%
  arrange(desc(abs_PC1)) %>%
  select(feature, PC1) %>%
  head(10)

top_pc2 <- loadings %>%
  mutate(abs_PC2 = abs(PC2)) %>%
  arrange(desc(abs_PC2)) %>%
  select(feature, PC2) %>%
  head(10)

top_pc1
##                     feature        PC1
## x_total             x_total  0.4529420
## m_total             m_total  0.4304688
## n_products       n_products  0.3497553
## xm_ratio           xm_ratio  0.3237472
## gdp_pc               gdp_pc  0.3139626
## agr_va               agr_va -0.3085282
## trade_balance trade_balance  0.2731948
## population       population  0.2226554
## ind_va               ind_va  0.1828848
## trade_gdp         trade_gdp  0.1119138
top_pc2
##                     feature         PC2
## srv_va               srv_va -0.50525509
## ind_va               ind_va  0.40826625
## gdp_pc               gdp_pc -0.32588364
## agr_va               agr_va  0.31117655
## population       population  0.29596267
## trade_gdp         trade_gdp -0.29050620
## xm_ratio           xm_ratio  0.27472843
## hhi                     hhi  0.27273906
## trade_balance trade_balance  0.22080955
## x_total             x_total  0.06247722

4.3 MDS

While PCA preserves variance, MDS preserves distances. Using MDS on the same scaled features allows me to examine whether similarity relationships underlying clustering remain visible.

The MDS plot shows partial separation with overlaps, reinforcing the idea that global trade patterns form continuous gradients rather than isolated groups.

dist_matrix <- dist(X_scaled)
mds <- cmdscale(dist_matrix, k = 2, eig = TRUE)

mds_points <- as.data.frame(mds$points)
colnames(mds_points) <- c("Dim1", "Dim2")
mds_points$cluster <- main_dataset$cluster

cols <- as.numeric(mds_points$cluster)

plot(mds_points$Dim1, mds_points$Dim2,
     col  = cols,
     pch  = 19,
     xlab = "MDS Dimension 1",
     ylab = "MDS Dimension 2",
     main = "Classical MDS (2D) Colored by Cluster")

legend("topright",
       legend = levels(mds_points$cluster),
       col    = seq_along(levels(mds_points$cluster)),
       pch    = 19)

mds_var_ratio <- mds$eig / sum(mds$eig[mds$eig > 0])
sum(mds_var_ratio[1:2], na.rm = TRUE)
## [1] 0.6209684

PCA and MDS together indicate strong underlying structure combined with gradual transitions, which is expected in country-level macroeconomic data.

5. Final Conclusion

I combined clustering and dimension reduction because they address complementary aspects of the same research question. Clustering summarizes country–year observations into a small number of interpretable trade–development regimes. Dimension reduction explains why these regimes differ and validates whether clusters align with dominant variance and distance structure. The analysis shows that global trade profiles between 2019 and 2024 can be characterized by a limited number of broad regimes reflecting trade scale, openness, diversification, and economic structure. PCA reveals that a few latent dimensions explain most variation, while MDS confirms that similarity is continuous rather than sharply segmented.

Overall, the analysis shows that global trade and development patterns between 2019 and 2024 are structured around a small number of broad, recurring regimes. These regimes are shaped primarily by trade scale, openness, diversification, and sectoral composition, and they manifest as continuous gradients rather than sharply separated groups across countries.