Project Objective
The objective of this project is to identify and classify macroeconomic risk across countries using global macroeconomic indicators. By leveraging World Bank data and applying modern machine learning classification techniques, the study aims to:
1.Detect countries facing elevated macroeconomic risk
2.Identify which economic variables are most strongly associated with risk
3.Evaluate and compare the performance of multiple models, including:
Logistic Regression (baseline)
Random Forest
XGBoost
4.Assess potential data leakage and model overfitting
5.Test model robustness after removing potentially problematic variables
The broader goal is to demonstrate how data-driven risk detection can complement traditional macroeconomic analysis and support early-warning systems for economic stress.
Summary
In this project, we built a predictive model to identify high-risk economies using global macroeconomic and financial data from the World Bank. By combining statistical analysis and machine learning, we identified the key drivers of economic risk and produced highly accurate country-level risk predictions.
Key Findings:
Economic Foundations of Risk:
ANOVA confirmed that GDP growth differs significantly across economy sizes, particularly between small and large economies.
Regression analysis highlighted GDP per capita growth, inflation, lending rates, and current account balances as the strongest predictors of economic risk, providing a clear macroeconomic rationale.
Machine Learning Insights:
Logistic Regression achieved 85.7% accuracy, offering interpretable coefficients that revealed how growth and financial indicators influence risk.
Random Forest captured non-linear relationships and interactions, achieving near-perfect accuracy (98%) and highlighting GDP growth, current account balance, and lending rate as the top predictors.
XGBoost provided a balance of predictive power (95% accuracy) and interpretability, confirming the importance of macroeconomic indicators while handling complex patterns efficiently.
Robustness and Feature Validation:
After removing potentially “leaky” variables (e.g., non-performing loans, private sector credit), both Random Forest and XGBoost maintained strong performance (91–96% accuracy), demonstrating robustness and real-world applicability.
Feature importance consistently pointed to GDP growth, current account balance, and lending rate as the main drivers of risk, aligning statistical findings with machine learning insights.
Data Source
The macroeconomic indicators were obtained from the World Bank Open Data platform. Annual country-level data were collected for inflation, GDP growth, interest rates, current account balances, unemployment, exchange rates, and financial risk indicators.
Data Preparation Raw datasets were cleaned, harmonized, and merged to create a unified global macroeconomic dataset. This included:
Standardizing country and year identifiers
Handling missing values
Removing countries with insufficient observations
Creating economically meaningful risk indicators
For clarity and readability, the analysis below begins with the final cleaned dataset used for modeling. Full data preparation scripts are provided in the appendix / repository.
#load libraries
library(tidyverse)
library(rstatix)
library(lubridate)
library(fpp3)
library(fixest)
library(rstatix)
library(scales)
library(corrplot)
library(plotly)
library(patchwork)
library(broom)
library(car)
library(ggpubr)
library(multcomp)
library(cluster)
library(factoextra)
library(caret)
library(randomForest)
library(xgboost)
library(glmnet)
library(pROC)
library(MLmetrics)
library(DALEX)
library(vip)
library(recipes)
library(forecast)
library(tseries)
library(prophet)
library(xts)
library(lubridate)
library(gridExtra)
library(ggfortify)
library(ROSE)
library(smotefamily)
library(themis)
Global_Rates <- read_csv("Global_Rates.csv")
#clean the dataset..
glimpse(Global_Rates)
## Rows: 1,022
## Columns: 16
## $ Country <chr> "Angola", "Angola", "Angola", "Angola", "Angol…
## $ Country_Code <chr> "AGO", "AGO", "AGO", "AGO", "AGO", "AGO", "AGO…
## $ Years <dbl> 2015, 2016, 2017, 2018, 2019, 2019, 2020, 2021…
## $ GDP_current_us <dbl> 90496420507, 52761617226, 73690154991, 7945068…
## $ GDP_per_capita_growth <dbl> -2.630702, -6.002649, -3.620758, -4.665970, -4…
## $ Inflation <dbl> 9.355972, 30.694415, 29.844480, 19.628938, 17.…
## $ Lending_rate <dbl> 16.881862, 15.780504, 15.806100, 20.677004, 19…
## $ Deposit_rate <dbl> 3.3146202, 5.5443181, 6.3421932, 6.8766571, 6.…
## $ Current_account_balance <dbl> -3747517596, -10272841903, -3085195463, -63286…
## $ Exchange_rate <dbl> 98.30242, 120.06070, 163.65643, 165.91595, 252…
## $ Non_performing_loans <dbl> 10.61123, 11.28509, 25.83606, 23.22174, 23.145…
## $ Private_sector_credit <dbl> 25.24011, 21.09841, 17.00069, 14.93527, 15.340…
## $ Unemployment <dbl> 16.490, 16.575, 16.610, 16.594, 16.497, 16.417…
## $ Interest_rate_spread <dbl> 12.855307, 13.567242, 10.236186, 9.463907, 13.…
## $ Real_interest_rate <dbl> 21.14418153, -4.92200310, -5.55269798, -5.8440…
## $ Risk_premium <dbl> 12.1029654, 9.6780430, 0.6885965, -0.5418163, …
#remove the null values
Global_Rates <- Global_Rates %>%
na.omit()
#remove all duplicates
Global_Rates <- Global_Rates %>%
distinct()
# Check the temporal and cross-sectional dimensions
summary_stats <- Global_Rates %>%
summarise(
n_countries = n_distinct(Country),
n_years = n_distinct(Years),
avg_years_per_country = n() / n_distinct(Country)
)
print(summary_stats)
## # A tibble: 1 × 3
## n_countries n_years avg_years_per_country
## <int> <int> <dbl>
## 1 59 19 17.3
#summary statistics
summary(Global_Rates)
## Country Country_Code Years GDP_current_us
## Length:1022 Length:1022 Min. :2005 Min. :7.866e+08
## Class :character Class :character 1st Qu.:2012 1st Qu.:1.158e+10
## Mode :character Mode :character Median :2015 Median :5.748e+10
## Mean :2015 Mean :3.298e+11
## 3rd Qu.:2019 3rd Qu.:3.014e+11
## Max. :2023 Max. :6.272e+12
## GDP_per_capita_growth Inflation Lending_rate Deposit_rate
## Min. :-34.8312 Min. :-16.860 Min. : 0.994 Min. : 0.010
## 1st Qu.: 0.5587 1st Qu.: 2.272 1st Qu.: 6.580 1st Qu.: 1.459
## Median : 2.4983 Median : 4.170 Median : 9.412 Median : 3.705
## Mean : 2.1045 Mean : 4.954 Mean :12.278 Mean : 4.832
## 3rd Qu.: 4.3150 3rd Qu.: 6.582 3rd Qu.:14.276 3rd Qu.: 7.673
## Max. : 33.7686 Max. : 30.694 Max. :60.000 Max. :19.800
## Current_account_balance Exchange_rate Non_performing_loans
## Min. :-1.105e+11 Min. : 0.718 Min. : 0.4203
## 1st Qu.:-3.008e+09 1st Qu.: 4.080 1st Qu.: 2.6044
## Median :-6.860e+08 Median : 18.097 Median : 4.4908
## Mean : 4.505e+08 Mean : 584.567 Mean : 6.3027
## 3rd Qu.: 8.983e+08 3rd Qu.: 103.250 3rd Qu.: 8.6859
## Max. : 2.209e+11 Max. :21697.568 Max. :47.5958
## Private_sector_credit Unemployment Interest_rate_spread Real_interest_rate
## Min. : 7.052 Min. : 0.061 Min. : 0.340 Min. :-20.497
## 1st Qu.: 23.319 1st Qu.: 3.378 1st Qu.: 3.348 1st Qu.: 2.356
## Median : 37.720 Median : 4.967 Median : 5.455 Median : 5.064
## Mean : 54.936 Mean : 7.560 Mean : 7.531 Mean : 7.021
## 3rd Qu.: 67.962 3rd Qu.:10.220 3rd Qu.: 7.701 3rd Qu.: 8.958
## Max. :264.442 Max. :34.007 Max. :49.046 Max. : 54.678
## Risk_premium
## Min. :-10.838
## 1st Qu.: 2.927
## Median : 4.920
## Mean : 6.908
## 3rd Qu.: 7.666
## Max. : 52.310
Feature Engineering
grouping countries by their average GDP over time and uses it to classify each country into a size category (Small, Medium, Large, or Very Large economy)
#Country Group Analysis
country_summary <- Global_Rates %>%
group_by(Country) %>%
summarize(
avg_gdp = mean(GDP_current_us, na.rm = TRUE),
max_GDP = max(GDP_current_us, na.rm = TRUE),
.groups = 'drop'
)
#country group gdp
country_gdp_group <- country_summary %>%
mutate(
Economy_Size = case_when(
avg_gdp < 50e9 ~ "Small_Economy" ,#less than 10 billion dollars
avg_gdp < 500e9 ~ "Medium_Economy",# From $50 billion to $500 billion
avg_gdp < 2e12 ~ "Large_Economy", # from $500 billion to $2 trillion
TRUE ~ "Very_Large" #More than $2 trillion
)
) %>%
dplyr::select(Country, Economy_Size)
#join the datset to globala rates
Global_Rates_Economy_Size <- Global_Rates %>%
left_join(country_gdp_group, by = "Country")
#clean the data types
Global_Rates_Economy_Size <- Global_Rates_Economy_Size %>%
mutate(Country = as.factor(Country),
Economy_Size = factor(Economy_Size,
levels = c( "Small_Economy", "Medium_Economy",
"Large_Economy", "Very_Large"),
ordered = TRUE),
Years = as.integer(Years))
# make current account as the percentage of GDP
Global_Rates_Economy_Size <- Global_Rates_Economy_Size %>%
mutate(Current_Account_GDP_Perc = (Current_account_balance / GDP_current_us) * 100)
#make the risk flags
Global_Rates_Economy_Size <- Global_Rates_Economy_Size %>%
mutate(High_Inflation = ifelse(Inflation > 15, 1, 0),
GDP_Decline = ifelse(GDP_per_capita_growth < 0, 1 , 0),
High_NPL = ifelse(Non_performing_loans > 10, 1, 0),
Negative_CA = ifelse(Current_account_balance < 0, 1 , 0))
# make the risk -scores
Global_Rates_Economy_Size <- Global_Rates_Economy_Size %>%
mutate(Risk_Score = High_Inflation + GDP_Decline + High_NPL + Negative_CA,
Risk_Level = case_when(
Risk_Score >= 3 ~ "High",
Risk_Score == 2 ~ "Medium",
Risk_Score == 1 ~ "Low",
Risk_Score == 0 ~ "Very Low"
))
#summary statistics
summary_by_size <- Global_Rates_Economy_Size %>%
group_by(Country) %>%
summarise(
n_countries = n_distinct(Country),
n_observations = n(),
avg_gdp_growth = mean(GDP_per_capita_growth, na.rm = TRUE),
avg_inflation = mean(Inflation, na.rm = TRUE),
avg_lending_rate = mean(Lending_rate, na.rm = TRUE)
)
print(summary_by_size, n = Inf)
## # A tibble: 59 × 6
## Country n_countries n_observations avg_gdp_growth avg_inflation
## <fct> <int> <int> <dbl> <dbl>
## 1 Albania 1 26 3.80 2.67
## 2 Algeria 1 22 0.530 5.12
## 3 Angola 1 9 -4.17 21.9
## 4 Antigua and Barbuda 1 1 6.17 1.21
## 5 Armenia 1 24 4.69 3.61
## 6 Australia 1 10 1.14 2.58
## 7 Bangladesh 1 19 5.48 7.22
## 8 Barbados 1 6 0.0434 3.21
## 9 Belize 1 9 0.548 1.43
## 10 Bolivia 1 17 3.36 4.70
## 11 Brazil 1 37 1.21 5.75
## 12 Bulgaria 1 16 2.72 2.80
## 13 Canada 1 8 1.20 2.18
## 14 Czechia 1 22 1.32 2.18
## 15 Eswatini 1 12 2.46 6.05
## 16 Fiji 1 16 1.15 2.79
## 17 Gambia, The 1 15 1.14 5.70
## 18 Georgia 1 25 5.16 4.33
## 19 Grenada 1 2 2.28 0.701
## 20 Hong Kong SAR, China 1 32 1.19 2.70
## 21 Hungary 1 32 2.14 4.70
## 22 Israel 1 18 2.07 0.723
## 23 Japan 1 16 1.58 0.400
## 24 Kenya 1 17 2.58 7.00
## 25 Kosovo 1 6 4.84 0.746
## 26 Kyrgyz Republic 1 28 2.11 6.94
## 27 Lebanon 1 7 -0.994 2.57
## 28 Lesotho 1 16 -0.0817 3.57
## 29 Madagascar 1 24 0.136 8.63
## 30 Malaysia 1 18 3.09 2.66
## 31 Maldives 1 15 3.23 1.90
## 32 Mauritius 1 30 2.80 3.77
## 33 Mexico 1 38 0.507 4.43
## 34 Moldova 1 28 4.32 7.92
## 35 Montenegro 1 26 2.76 3.04
## 36 Mozambique 1 12 1.32 6.97
## 37 Namibia 1 10 -2.20 4.72
## 38 Nigeria 1 24 1.12 12.3
## 39 Pakistan 1 15 2.75 6.34
## 40 Papua New Guinea 1 15 1.27 4.81
## 41 Philippines 1 18 4.67 2.93
## 42 Romania 1 30 3.35 4.19
## 43 Rwanda 1 19 4.07 5.82
## 44 Seychelles 1 4 8.01 1.74
## 45 Singapore 1 14 2.56 3.33
## 46 Solomon Islands 1 7 -1.05 1.20
## 47 South Africa 1 28 -0.132 5.47
## 48 Sri Lanka 1 18 4.56 5.05
## 49 St. Lucia 1 8 1.43 0.137
## 50 St. Vincent and the … 1 4 3.52 0.649
## 51 Tajikistan 1 8 4.35 6.69
## 52 Tanzania 1 15 2.60 7.67
## 53 Thailand 1 26 1.87 1.53
## 54 Trinidad and Tobago 1 30 -0.439 5.21
## 55 Uganda 1 13 1.92 7.60
## 56 Uruguay 1 18 0.853 8.05
## 57 Uzbekistan 1 7 3.30 12.4
## 58 Viet Nam 1 17 4.81 8.14
## 59 Zambia 1 15 -0.127 9.59
## # ℹ 1 more variable: avg_lending_rate <dbl>
Macro relationships visible in the table
When Inflation increase leads Lending rates to increase This pattern is very clear: High-inflation countries almost always have high lending rates This supports the Fisher effect + risk premium channel
High lending rates mostly have weaker growth
Countries with lending rates above ~15% tend to Grow slowly and Experience volatile or negative growth
Low inflation + low rates ≠ high growth
Advanced economies: These economies are at the technological frontier Growth is constrained by productivity, not capital costs Japan, Canada, Australia : Low inflation, low rates, but modest growth
ECONOMIC RESILIENCE ANALYSIS - EDA & VISUALIZATIONS
QUESTION 1: How do key indicators distribute across economy sizes?
#GDP Growth Distribution
ggplot(Global_Rates_Economy_Size, aes(x = Economy_Size, y = GDP_per_capita_growth, fill = Economy_Size)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.3, size = 1) +
labs(title = "GDP Growth Distribution by Economy Size",
subtitle = "Larger economies show more stable growth",
x = "Economy Size",
y = "GDP per Capita Growth (%)") +
scale_fill_brewer(palette = "Set2") +
coord_flip()
Small_Economy to Very_Large:
The spread (range/interquartile range) of growth rates narrows. Extreme highs and lows (outliers) are fewer for larger economies. Growth rates cluster closer to the median for larger economies which means it is more stability.
Bigger economies don’t necessarily grow faster—but their growth is more predictable and less bumpy. Small economies often experience sharper booms and deeper busts.
Inflation Distribution
ggplot(Global_Rates_Economy_Size, aes(x = Economy_Size, y = Inflation, fill = Economy_Size)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.3, size = 1) +
labs(title = "Inflation Distribution by Economy Size",
subtitle = "Smaller economies experience higher inflation volatility",
x = "Economy Size",
y = "Inflation Rate (%)") +
scale_fill_brewer(palette = "Set2") +
coord_flip() +
scale_y_log10() # Log scale for better visualization
From Small to Very Large
Inflation ranges from very low to extremely high Many outliers in small and medieum economies, indicating frequent inflation shocks
Moving from small to very large economies, inflation distributions exhibit a declining median and a narrowing dispersion, indicating that larger economies experience more stable and predictable inflation dynamics than small and medium economies
What’s the relationship between inflation and growth?
ggplot(Global_Rates_Economy_Size, aes(x = Inflation, y = GDP_per_capita_growth))+
geom_point(aes(colour = Economy_Size, size = GDP_current_us), alpha = 0.6)+
geom_smooth(method = "lm", se = FALSE, color = "darkred", linetype = "dashed")+
scale_x_log10()+
scale_size_continuous(range = c(1, 10),
labels = scales::dollar_format(scale = 1e-9, suffix = "B")) +
labs(title = "Inflation vs. GDP Growth",
subtitle = "High inflation often correlates with lower growth",
x = "Inflation Rate (log scale, %)",
y = "GDP per Capita Growth (%)",
size = "GDP (USD)",
color = "Economy Size") +
facet_wrap(~Economy_Size, scales = "free") +
theme(legend.position = "right")
Across all economy sizes, higher inflation is generally associated with lower GDP per capita growth, with the negative relationship becoming clearer and more stable as economy size increases.
How have key indicators evolved over time
#calculate yearly averages
yearly_avg <- Global_Rates_Economy_Size %>%
group_by(Years, Economy_Size) %>%
summarise(
avg_gdp = mean(GDP_per_capita_growth, na.rm = TRUE),
avg_inflation = mean(Inflation, na.rm = TRUE),
avg_lending_rate = mean(Lending_rate, na.rm = TRUE),
n_countries = n(),
.groups = "drop"
)
print(yearly_avg, n = Inf)
## # A tibble: 65 × 6
## Years Economy_Size avg_gdp avg_inflation avg_lending_rate n_countries
## <int> <ord> <dbl> <dbl> <dbl> <int>
## 1 2005 Small_Economy 1.73 18.4 25.9 2
## 2 2005 Medium_Economy 2.97 2.98 5.95 2
## 3 2005 Large_Economy 1.69 4.36 23.2 6
## 4 2006 Small_Economy -0.502 6.41 29.0 2
## 5 2006 Medium_Economy 3.26 3.61 6.49 1
## 6 2006 Large_Economy 2.62 3.27 21.4 6
## 7 2007 Small_Economy 1.37 7.83 34.1 2
## 8 2007 Medium_Economy 3.85 3.71 11.7 2
## 9 2007 Large_Economy 2.24 3.25 19.1 6
## 10 2008 Small_Economy 3.19 9.45 24.0 4
## 11 2008 Medium_Economy 3.08 8.65 10.2 17
## 12 2008 Large_Economy 1.16 4.39 20.2 6
## 13 2009 Small_Economy -0.992 2.91 16.2 9
## 14 2009 Medium_Economy -2.37 4.10 9.78 20
## 15 2009 Large_Economy -4.34 5.09 25.9 4
## 16 2010 Small_Economy 3.03 5.88 17.7 20
## 17 2010 Medium_Economy 3.62 4.48 9.13 20
## 18 2010 Large_Economy 2.97 3.84 13.0 5
## 19 2010 Very_Large 4.08 -0.728 1.60 2
## 20 2011 Small_Economy 3.27 7.85 15.1 35
## 21 2011 Medium_Economy 3.32 6.67 9.34 30
## 22 2011 Large_Economy 2.04 4.45 18.8 6
## 23 2011 Very_Large 0.209 -0.272 1.50 2
## 24 2012 Small_Economy 2.37 5.88 16.8 35
## 25 2012 Medium_Economy 2.53 5.42 9.13 28
## 26 2012 Large_Economy 1.79 3.76 16.1 6
## 27 2012 Very_Large 1.54 -0.0441 1.41 2
## 28 2013 Small_Economy 3.38 4.52 14.7 35
## 29 2013 Medium_Economy 2.80 4.59 9.04 30
## 30 2013 Large_Economy 0.859 4.15 12.6 6
## 31 2013 Very_Large 2.15 0.335 1.30 2
## 32 2014 Small_Economy 3.13 3.67 13.6 38
## 33 2014 Medium_Economy 3.03 3.11 7.75 34
## 34 2014 Large_Economy 0.681 4.28 13.8 6
## 35 2014 Very_Large 0.429 2.76 1.22 2
## 36 2015 Small_Economy 2.28 3.82 14.5 40
## 37 2015 Medium_Economy 2.42 2.94 8.31 33
## 38 2015 Large_Economy -1.35 5.88 23.7 4
## 39 2015 Very_Large 1.67 0.795 1.14 2
## 40 2016 Small_Economy 2.44 3.48 13.3 44
## 41 2016 Medium_Economy 2.52 4.33 8.31 37
## 42 2016 Large_Economy -1.61 5.78 28.4 4
## 43 2016 Very_Large 0.805 -0.127 1.04 2
## 44 2017 Small_Economy 2.00 4.17 12.7 42
## 45 2017 Medium_Economy 3.07 5.41 8.08 30
## 46 2017 Large_Economy 0.756 4.74 27.1 4
## 47 2017 Very_Large 1.76 0.484 0.994 2
## 48 2018 Small_Economy 2.55 2.97 12.4 45
## 49 2018 Medium_Economy 2.65 4.57 7.93 27
## 50 2018 Large_Economy 1.05 4.28 23.6 4
## 51 2019 Small_Economy 2.31 2.67 12.1 38
## 52 2019 Medium_Economy 1.63 5.62 9.66 30
## 53 2019 Large_Economy -0.386 3.68 23.0 4
## 54 2020 Small_Economy -9.15 4.05 12.6 30
## 55 2020 Medium_Economy -4.45 5.64 8.95 23
## 56 2020 Large_Economy -6.47 3.30 17.7 4
## 57 2021 Small_Economy 6.60 4.29 11.4 27
## 58 2021 Medium_Economy 4.64 7.37 8.21 24
## 59 2021 Large_Economy 4.83 7.00 17.5 4
## 60 2022 Small_Economy 5.11 10.9 14.1 28
## 61 2022 Medium_Economy 3.05 8.68 8.71 20
## 62 2022 Large_Economy 2.79 8.59 23.8 4
## 63 2023 Small_Economy 4.39 7.20 13.0 17
## 64 2023 Medium_Economy 1.85 8.71 10.4 12
## 65 2023 Large_Economy 2.62 5.06 27.6 4
The results show substantial heterogeneity in the inflation–growth relationship across economy sizes. Small economies experience high inflation volatility and elevated lending rates, which are associated with unstable and often lower growth. Medium-sized economies display the most favorable macroeconomic trade-off, combining moderate inflation, manageable borrowing costs, and stable growth. Very large economies operate under a distinct low-inflation, low-interest-rate regime, where growth appears largely decoupled from inflation dynamics
#time series plot
ggplot(yearly_avg, aes(x = Years, y = avg_gdp, color = Economy_Size)) +
geom_line(size = 1.5) +
geom_point(size = 2) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
labs(title = "Average GDP Growth Over Time by Economy Size",
subtitle = "Large economies were more resilient during global crises",
x = "Year",
y = "Average GDP Growth (%)",
color = "Economy Size") +
scale_color_brewer(palette = "Set1")
Pre-crisis periods (2005–2007, 2011–2018)
All economy sizes show positive growth Large and very large economies grow steadily Small economies show more fluctuations
Global Financial Crisis (2008–2009)
Growth falls across all groups Small and medium economies experience sharper contractions Very large economies recover faster post-crisis
COVID-19 shock (2020)
The largest contraction in the sample Small economies suffer the deepest decline Large and very large economies experience smaller drops
Post-crisis recovery (2021–2023)
Strong rebound across all groups Small economies show high rebound growth (base effect) Large economies return to stable, moderate growth Very large economies recover smoothly, not excessively
Small economies, while sometimes growing faster in good times, are more vulnerable to global turbulence. Mid-sized sit in the middle not as volatile as small ones, but not as stable as the large economies. Average GDP growth declines sharply during global crises, but larger economies exhibit greater resilience and recover faster than smaller economies.
Inflation
ggplot(yearly_avg, aes(x = Years, y = avg_inflation, color = Economy_Size)) +
geom_line(size = 1.5) +
geom_point(size = 2) +
labs(title = "Average Inflation Over Time by Economy Size",
subtitle = "Inflation spikes visible in small/medium economies",
x = "Year",
y = "Average Inflation (%)",
color = "Economy Size") +
scale_color_brewer(palette = "Set1") +
scale_y_log10()
Small economies
Highest inflation levels throughout the period Pronounced inflation spikes (e.g. mid-2000s, post-2020) Large year-to-year fluctuations
Medium economies
Inflation lower than small economies but still volatile Clear sensitivity to global cycles (2008–09, 2020–22)
Large economies
Inflation remains within a relatively narrow band Gradual movements rather than sharp spikes Temporary increases around global shocks
Very large economies
Lowest inflation levels overall Extremely smooth inflation path Minimal response even during crises The sudden spike around 2014–2015
Economic size plays a central role in inflation stability, with larger economies exhibiting stronger control over price dynamics.
what’s the correlation structure between indicators
# Select numeric columns for correlation
numeric_cols <- Global_Rates_Economy_Size %>%
dplyr::select(where(is.numeric)) %>%
dplyr::select(-c(Years, contains("High_"), GDP_Decline, Risk_Score)) # Remove binary flags
cor_matrix <- cor(numeric_cols, use = "complete.obs")
#corrrelation heatwave
corrplot(cor_matrix,
method = "color",
type = "upper",
tl.col = "darkred",
tl.srt = 45,
addCoef.col = "orange",
number.cex = 0.7,
title = "Correlation Matrix of Economic Indicators",
mar = c(0, 0, 2, 0))
HIGH CORRELATION (|r| > 0.7)
Interest Rate & Risk Cluster
(Lending rate, interest rate spread, risk premium, real interest rate) These variables move almost one-for-one and form the financial risk core of the economy. High lending rates signal high risk, wide spreads, and elevated real returns demanded by investors. Seeing one high almost guarantees the others are high.
Current Account Balance Cluster
Current_Account_GDP_Perc ↔︎ Negative_CA: ~ -0.69
A negative current account (deficit) strongly correlates with having a low current account as % of GDP countries with deficits have negative balances relative to GDP.
*Interest Rate Components Cluster**
(Interest rate spread, risk premium, real interest rate) These measures reflect different expressions of the same underlying risk pricing. Markets embed perceived economic risk directly into real returns and spreads.
MODERATE CORRELATION GROUPS (0.3 < |r| < 0.7)
Inflation & Monetary Policy Transmission
(Inflation, lending rate, deposit rate) Higher inflation leads to policy tightening, raising lending rates, which then transmit to deposit rates. This shows a working but imperfect monetary transmission mechanism
Credit & Economic Size
(Private sector credit, GDP size, current account balance) Larger economies tend to have deeper credit markets, and surplus countries often support more domestic lending
External Sector & Capital Attraction
(Current account, negative CA, deposit rates) Countries with external deficits tend to offer higher deposit rates to attract capital and finance imbalances.
Inflation–Deposit Rate Link
(Inflation ↔︎ deposit rate) Banks partially compensate savers for inflation, but real returns are not fully protected.
LOW / NO CORRELATION GROUPS (|r| < 0.3)
Economic Growth Isolated
(GDP per capita growth) Growth is largely independent of monetary and financial variables here. it likely depends on structural and productivity factors not captured in this dataset.
Exchange Rate Independence
(Exchange rate vs most variables) Exchange rates are driven by external capital flows, expectations, and policy interventions, not domestic fundamentals alone.
Unemployment as a Structural Variable
(Unemployment vs macro-financial variables) Labor market outcomes reflect institutional and structural dynamics, not short-run financial conditions.
Overall, the correlation structure reveals that financial risk variables are tightly interconnected, monetary policy transmission is present but imperfect, and real economic outcomes such as growth and unemployment remain largely decoupled from short-run macro-financial indicators in cross-country data
Which countries have the highest risk scores
# Calculate average risk by country
country_risk <- Global_Rates_Economy_Size %>%
group_by(Country, Economy_Size) %>%
summarise(
avg_risk_score = mean(Risk_Score, na.rm = TRUE),
high_risk_years = sum(Risk_Score >= 3, na.rm = TRUE),
total_years = n(),
.groups = "drop"
) %>%
mutate(risk_pct = high_risk_years / total_years * 100) %>%
arrange(desc(avg_risk_score))
print(country_risk, n = 15)
## # A tibble: 59 × 6
## Country Economy_Size avg_risk_score high_risk_years total_years risk_pct
## <fct> <ord> <dbl> <int> <int> <dbl>
## 1 Angola Medium_Econ… 3.33 9 9 100
## 2 Belize Small_Econo… 2.44 5 9 55.6
## 3 Solomon Isl… Small_Econo… 2.14 2 7 28.6
## 4 Maldives Small_Econo… 2 2 15 13.3
## 5 St. Lucia Small_Econo… 1.88 1 8 12.5
## 6 Moldova Small_Econo… 1.86 4 28 14.3
## 7 Mozambique Small_Econo… 1.83 3 12 25
## 8 Gambia, The Small_Econo… 1.73 2 15 13.3
## 9 Zambia Small_Econo… 1.73 2 15 13.3
## 10 Lebanon Small_Econo… 1.71 0 7 0
## 11 Albania Small_Econo… 1.69 0 26 0
## 12 Barbados Small_Econo… 1.67 0 6 0
## 13 Pakistan Medium_Econ… 1.67 0 15 0
## 14 Algeria Medium_Econ… 1.64 5 22 22.7
## 15 Montenegro Small_Econo… 1.62 2 26 7.69
## # ℹ 44 more rows
# Top 15 risky countries
country_risk %>%
slice_head(n = 15) %>%
ggplot(aes(x = reorder(Country, avg_risk_score), y = avg_risk_score, fill = Economy_Size)) +
geom_col() +
coord_flip() +
labs(title = "Top 15 Countries by Average Risk Score",
subtitle = "Higher scores indicate more frequent risk indicators",
x = "Country",
y = "Average Risk Score (0-4)",
fill = "Economy Size") +
scale_fill_brewer(palette = "Set2") +
theme(legend.position = "right")
Small economies dominate the risk rankings
Most countries in the top 15 are Small_Economy
high risk score means an economy frequently operates under:
High inflation High interest rates Large risk premia / spreads Weak financial conditions (e.g. NPLs, negative growth flags)
This reflects stress in the macro-financial system and An economy can grow and still be high risk, but it cannot be stable while remaining high risk for long.
Angola is a high-risk country with risk score (3.33) and High-risk conditions in 100% of observed years This means Angola is almost never in a low-stress macro environment. Risk is structural and persistent, not occasional
Macro-financial risk is highly uneven across countries, concentrated in small and externally exposed economies, and becomes most dangerous when it is persistent rather than episodic
Almost all are developing/emerging economies confirming that less developed = higher economic risk.
How do lending and deposit rates relate?
ggplot(Global_Rates_Economy_Size, aes(x = Deposit_rate, y = Lending_rate)) +
geom_point(aes(color = Economy_Size, size = GDP_current_us), alpha = 0.6) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
labs(title = "Lending vs Deposit Rates",
subtitle = "Points above the red line indicate positive interest spreads",
x = "Deposit Rate (%)",
y = "Lending Rate (%)",
color = "Economy Size",
size = "GDP (USD)") +
scale_size_continuous(range = c(1, 10),
labels = scales::dollar_format(scale = 1e-9, suffix = "B")) +
facet_wrap(~Economy_Size, scales = "free")
The plot visualizes the banking price system across countries: how banks set deposit rates (cost of funds) relative to lending rates (price of credit), and how this varies by economic size and development
Points above the line = Positive interest spread (lending > deposit) Points on the line = Zero spread (lending = deposit) Points below = Negative spread (impossible in normal banking)
small economy:
High Lending, Low Deposit Banks charge high lending rates because lending is risky, and deposit rates stay low because households have few safe alternatives. example : (30% lending vs 5% Deposit) is economically realistic
Medium Economy:
High on Lending, Right on Deposit
Medium economies have high lending and high deposit rates because inflation and risk affect both borrowers and savers, resulting in a high-rate but more balanced financial environment than small economies
Large Economy:
in Large economy there’s is low lending/ low deposit also high lending/low deposit
Some large economies behave like small ones (high lending, low deposits) They’re large in GDP but still financially developing/volatile Banks charge high rates for loans but don’t pay much for deposits (wide spreads)
Other large economies behave like very large ones (low lending, low deposits) They’re financially mature and stable Both borrowing and saving are cheap
Very Large Economies
Low Lending Low Deposit
All have low rates for everything Their financial systems are deep, stable, and integrated globally
Smaller and Medium economies cluster at high lending rates and wide spreads due to inflation, credit risk, and weak financial institution Being big doesn’t automatically mean your financial system is mature. Some large economies still have ‘small economy’ interest rate problems. But when you’re very large, you’ve almost certainly achieved financial stability and low rates.
ECONOMIC RESILIENCE ANALYSIS - STATISTICAL ANALYSIS
STATISTICAL TEST 1: Do different economy sizes have different growth rates?
Null hypothesis (H₀): Mean GDP per capita growth is the same across all economy sizes.
Alternative hypothesis (H₁): At least one economy-size group has a different mean growth rate.
# ANOVA test
growth_anova <- aov(GDP_per_capita_growth ~ Economy_Size, data = Global_Rates_Economy_Size)
anova_summary <- summary(growth_anova)
cat("ANOVA: GDP Growth by Economy Size\n")
## ANOVA: GDP Growth by Economy Size
print(anova_summary)
## Df Sum Sq Mean Sq F value Pr(>F)
## Economy_Size 3 152 50.80 3.002 0.0297 *
## Residuals 1018 17227 16.92
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Post-hoc test (Tukey HSD)
if (anova_summary[[1]][1, "Pr(>F)"] < 0.05) {
tukey_test <- TukeyHSD(growth_anova)
print(tukey_test)
# Visualize
p_tukey <- plot(tukey_test)
}
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = GDP_per_capita_growth ~ Economy_Size, data = Global_Rates_Economy_Size)
##
## $Economy_Size
## diff lwr upr p adj
## Medium_Economy-Small_Economy -0.04001195 -0.7429325 0.66290863 0.9988794
## Large_Economy-Small_Economy -1.34068434 -2.5374327 -0.14393595 0.0209513
## Very_Large-Small_Economy -0.67357514 -3.3625975 2.01544724 0.9174445
## Large_Economy-Medium_Economy -1.30067238 -2.5138141 -0.08753066 0.0300461
## Very_Large-Medium_Economy -0.63356319 -3.3299214 2.06279501 0.9306098
## Very_Large-Large_Economy 0.66710920 -2.1979313 3.53214966 0.9323133
Since p < 0.05, you reject the null hypothesis. There is a statistically significant difference in average GDP per capita growth across economy sizes.
ANOVA results indicate statistically significant differences in GDP per capita growth across economy sizes (F(3,1018)=3.00, p=0.03). Post-hoc Tukey tests reveal that large economies experience significantly lower average growth than both small and medium economies
Smaller and medium economies grow faster on average, while large economies grow more slowly but more steadily.
STATISTICAL TEST 2: Correlation between inflation and growth
# Overall correlation
cor_test_overall <- cor.test(Global_Rates_Economy_Size$Inflation, Global_Rates_Economy_Size$GDP_per_capita_growth,
use = "complete.obs")
cat("\nCorrelation Test: Inflation vs GDP Growth (Overall)\n")
##
## Correlation Test: Inflation vs GDP Growth (Overall)
print(cor_test_overall)
##
## Pearson's product-moment correlation
##
## data: Global_Rates_Economy_Size$Inflation and Global_Rates_Economy_Size$GDP_per_capita_growth
## t = -0.23225, df = 1020, p-value = 0.8164
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06856304 0.05407434
## sample estimates:
## cor
## -0.007271695
# By economy size
cor_by_size <- Global_Rates_Economy_Size %>%
group_by(Economy_Size) %>%
summarise(
correlation = cor(Inflation, GDP_per_capita_growth, use = "complete.obs"),
p_value = cor.test(Inflation, GDP_per_capita_growth)$p.value,
n = n()
)
print(cor_by_size)
## # A tibble: 4 × 4
## Economy_Size correlation p_value n
## <ord> <dbl> <dbl> <int>
## 1 Small_Economy 0.0674 0.135 493
## 2 Medium_Economy -0.134 0.00583 420
## 3 Large_Economy -0.00166 0.987 93
## 4 Very_Large -0.459 0.0738 16
p-value = 0.8164 < 0.05 There is no statistically significant linear correlation between inflation and GDP per capita growth when all economy sizes are pooled together
Correlation by economy size
Small economies: r = +0.067, p = 0.135 (not significant) No meaningful linear relationship, Inflation is not the main growth determinant
Medium economies: r = −0.134, p = 0.0058 Statistically significant negative correlation Higher inflation is associated with lower growth
Large economies: r ≈ 0, p = 0.987(not significant) Growth is decoupled from inflation fluctuations
Very large economies: r = −0.459, p = 0.0738 (not significant) Not statistically significant due to low sample size but Strong negative relationship
While no significant correlation between inflation and GDP per capita growth is observed in the pooled sample, subgroup analysis reveals a statistically significant negative relationship in medium-sized economies and a strong negative association in very large economies, highlighting the importance of economic size and structural heterogeneity.
STATISTICAL TEST 3: Are risk scores different across sizes?
H₀: Risk score distributions are the same across economy sizes
H₁: At least one economy size has a different risk score distribution
#krusskal wallis test
risk_kruskal <- kruskal.test(Risk_Score ~ Economy_Size, data = Global_Rates_Economy_Size)
cat("\nKruskal-Wallis Test: Risk Scores by Economy Size\n")
##
## Kruskal-Wallis Test: Risk Scores by Economy Size
print(risk_kruskal)
##
## Kruskal-Wallis rank sum test
##
## data: Risk_Score by Economy_Size
## Kruskal-Wallis chi-squared = 92.664, df = 3, p-value < 2.2e-16
# Pairwise comparisons
if (risk_kruskal$p.value < 0.05) {
pairwise.wilcox.test(Global_Rates_Economy_Size$Risk_Score, Global_Rates_Economy_Size$Economy_Size,
p.adjust.method = "BH")
}
##
## Pairwise comparisons using Wilcoxon rank sum test with continuity correction
##
## data: Global_Rates_Economy_Size$Risk_Score and Global_Rates_Economy_Size$Economy_Size
##
## Small_Economy Medium_Economy Large_Economy
## Medium_Economy 5.4e-13 - -
## Large_Economy 0.00091 0.07331 -
## Very_Large 4.3e-11 4.1e-06 4.2e-09
##
## P value adjustment method: BH
p-value < 2.2e-16 (Strong Statistical significant )
reject H₀: Risk scores differ significantly across economy sizes.
Pairwise Wilcoxon results
Small vs Medium, p = 5.4e-13( Statistical significant) Risk profiles are very different
Small vs Large ,p = 4.1e-06 (Statistically Significant) Risk continues to fall with size
Large vs Very Large, p = 4.2e-09 (Statistically significant) Even among large economies, very large ones are safer
Medium vs Large, p = 0.073 (Not significant) Borderline, suggests gradual rather than abrupt risk reduction
A Kruskal–Wallis test reveals statistically significant differences in risk scores across economy sizes (χ²(3)=92.66, p<0.001). Pairwise Wilcoxon tests indicate that risk declines significantly with economic size, with small economies exhibiting the highest risk and very large economies the lowest.
REGRESSION ANALYSIS: What predicts GDP growth?
#Remove extreme outliers for regression
regression <- Global_Rates_Economy_Size %>%
filter(
abs(GDP_per_capita_growth) < 30,# Remove extreme values
Inflation < 50 ## Remove hyperinflation cases
)
# Multiple regression
model <- lm(GDP_per_capita_growth ~ Inflation + Lending_rate +
Unemployment + Real_interest_rate + factor(Economy_Size),
data = regression)
print(summary(model))
##
## Call:
## lm(formula = GDP_per_capita_growth ~ Inflation + Lending_rate +
## Unemployment + Real_interest_rate + factor(Economy_Size),
## data = regression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.7127 -1.2640 0.3625 2.1465 14.3773
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.34777 0.33347 7.040 3.53e-12 ***
## Inflation -0.09066 0.03600 -2.518 0.01195 *
## Lending_rate 0.07909 0.02833 2.791 0.00535 **
## Unemployment -0.05329 0.01909 -2.791 0.00535 **
## Real_interest_rate -0.11542 0.02763 -4.177 3.21e-05 ***
## factor(Economy_Size).L -1.05062 0.66304 -1.585 0.11338
## factor(Economy_Size).Q 0.27895 0.53580 0.521 0.60275
## factor(Economy_Size).C 0.45600 0.38921 1.172 0.24164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.78 on 1012 degrees of freedom
## Multiple R-squared: 0.03668, Adjusted R-squared: 0.03001
## F-statistic: 5.504 on 7 and 1012 DF, p-value: 3.154e-06
Inflation (−0.091)
Negative and statistically significant (p = 0.012) A 1 percentage-point increase in inflation is associated with a 0.09 percentage-point reduction in GDP per capita growth, holding other factors constant.
Lending rate (+0.079)
Positive and significant (p = 0.005) Higher lending rates shouldn’t help growth (they make borrowing expensive), yet they correlate positively here In growing economies, rates rise because growth is strong
Unemployment (−0.053)
Negative and significant Higher unemployment directly reduces output growth via underutilised labour. Fully consistent with macro theory (Okun’s Law)
Real interest rate (−0.115)
Strongly negative and highly significant Higher real borrowing costs significantly depress growth by discouraging investment and consumption. This is one of your strongest predictors
Economy size
None are statistically significant after controlling for macro variables
Regression results indicate that inflation, unemployment, and real interest rates exert statistically significant negative effects on GDP per capita growth, while lending rates are positively associated with growth, reflecting cyclical credit conditions. Once macroeconomic fundamentals are controlled for, economy size no longer has a direct effect on growth, suggesting that size operates indirectly through macroeconomic stability channels.
# VIF for multicollinearity
vif_values <- car::vif(model)
cat("\nVariance Inflation Factors (VIF):\n")
##
## Variance Inflation Factors (VIF):
print(vif_values)
## GVIF Df GVIF^(1/(2*Df))
## Inflation 1.749101 1 1.322536
## Lending_rate 5.606672 1 2.367841
## Unemployment 1.049062 1 1.024237
## Real_interest_rate 4.830297 1 2.197794
## factor(Economy_Size) 1.284241 3 1.042576
Lending rates and real interest rates are related, reflecting the interest rate–risk cluster we saw earlier. However, the correlation is moderate, not extreme, so our regression results for growth remain interpretable
TIME SERIES ANALYSIS: Are there global trends?
# Calculate global averages per year
global_avg <- Global_Rates_Economy_Size %>%
group_by(Years) %>%
summarise(
global_growth = mean(GDP_per_capita_growth, na.rm = TRUE),
global_inflation = mean(Inflation, na.rm = TRUE),
n_countries = n()
)
# Simple time series model
ts_model <- lm(global_growth ~ Years + I(Years^2), data = global_avg)
cat("\nGlobal Growth Trend Model:\n")
##
## Global Growth Trend Model:
print(summary(ts_model))
##
## Call:
## lm(formula = global_growth ~ Years + I(Years^2), data = global_avg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1845 0.0835 0.4729 0.9740 3.4321
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.620e+04 9.722e+04 0.167 0.87
## Years -1.610e+01 9.654e+01 -0.167 0.87
## I(Years^2) 4.000e-03 2.397e-02 0.167 0.87
##
## Residual standard error: 2.792 on 16 degrees of freedom
## Multiple R-squared: 0.002427, Adjusted R-squared: -0.1223
## F-statistic: 0.01946 on 2 and 16 DF, p-value: 0.9808
Neither linear (Years) nor quadratic (Years²) terms are significant , p >> 0.05
Global economic growth doesn’t systematically increase or decrease over time. There’s no upward trend (we’re not getting better at growing), no downward trend (we’re not stagnating), and no U-shaped pattern. Growth is essentially flat with random fluctuations year to year.
CLUSTER ANALYSIS: Do countries group by economic behavior?
# Prepare data for clustering
cluster_data <- Global_Rates_Economy_Size %>%
group_by(Country) %>%
summarise(
avg_growth = mean(GDP_per_capita_growth, na.rm = TRUE),
avg_inflation = mean(Inflation, na.rm = TRUE),
avg_lending = mean(Lending_rate, na.rm = TRUE),
avg_unemployment = mean(Unemployment, na.rm = TRUE),
volatility = sd(GDP_per_capita_growth, na.rm = TRUE)
) %>%
na.omit() %>%
column_to_rownames("Country")
# Scale data
cluster_scaled <- scale(cluster_data)
# Determine optimal number of clusters
fviz_nbclust(cluster_scaled, kmeans, method = "wss") +
labs(title = "Optimal Number of Clusters - Elbow Method")
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
## Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
## Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Perform k-means clustering
set.seed(123)
kmeans_result <- kmeans(cluster_scaled, centers = 4, nstart = 25)
print(kmeans_result)
## K-means clustering with 4 clusters of sizes 2, 12, 31, 13
##
## Cluster means:
## avg_growth avg_inflation avg_lending avg_unemployment volatility
## 1 -0.09427025 -0.8709498 -0.2170550 -0.2102689 3.792600351
## 2 0.21574991 -0.3904647 -0.1706681 1.6415339 0.009498563
## 3 0.09917527 -0.3067118 -0.4896737 -0.5072472 -0.158173111
## 4 -0.42114551 1.2258108 1.3586162 -0.2733234 -0.215062849
##
## Clustering vector:
## Albania Algeria
## 2 3
## Angola Armenia
## 4 2
## Australia Bangladesh
## 3 3
## Barbados Belize
## 3 1
## Bolivia Brazil
## 3 4
## Bulgaria Canada
## 3 3
## Czechia Eswatini
## 3 2
## Fiji Gambia, The
## 3 4
## Georgia Grenada
## 2 3
## Hong Kong SAR, China Hungary
## 3 3
## Israel Japan
## 3 3
## Kenya Kosovo
## 3 2
## Kyrgyz Republic Lebanon
## 4 3
## Lesotho Madagascar
## 2 4
## Malaysia Maldives
## 3 1
## Mauritius Mexico
## 3 3
## Moldova Montenegro
## 3 2
## Mozambique Namibia
## 4 2
## Nigeria Pakistan
## 4 3
## Papua New Guinea Philippines
## 3 3
## Romania Rwanda
## 3 2
## Seychelles Singapore
## 3 3
## Solomon Islands South Africa
## 3 2
## Sri Lanka St. Lucia
## 3 2
## St. Vincent and the Grenadines Tajikistan
## 2 4
## Tanzania Thailand
## 4 3
## Trinidad and Tobago Uganda
## 3 4
## Uruguay Uzbekistan
## 4 4
## Viet Nam Zambia
## 3 4
##
## Within cluster sum of squares by cluster:
## [1] 3.495281 31.357413 61.319528 56.454033
## (between_SS / total_SS = 46.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Visualize clusters
fviz_cluster(kmeans_result, data = cluster_scaled,
palette = "Set2",
ggtheme = theme_minimal(),
main = "Country Clusters by Economic Performance")
Macro patterns define groups: Countries with similar GDP growth, inflation, lending rates, unemployment, and volatility tend to cluster together.
Cluster 1 (n=2): Belize, Maldives
Small, stable economies with very low lending/inflation but slightly higher unemployment. Likely smaller, low-risk countries.
Cluster 2 (n=12): Albania, Eswatini, Montenegro, South Africa
moderate economies, negative/slightly negative growth, high volatility Slightly negative growth, low inflation and lending, moderate unemployment, high volatility unusual pattern, possibly small economies with shocks
Cluster 3 (n=31): Australia, Canada, Japan, Singapore, Philippines
large or medium economies, stable growth Stable, moderate growth countries, low inflation, low lending rates, low unemployment. Likely medium-to-large, mature economies.
Cluster 4 (n=13): Angola, Brazil, Nigeria, Tanzania, Uzbekistan small-to-medium high-risk economies, High-risk/high-rate economies: negative growth, very high inflation and lending rates. Possibly volatile or small/medium developing countries.
MODELING FOR RISK CLASSIFICATION
DATA PREPARATION FOR MODELING
We’ll predict Risk_Level (High/Medium/Low/Very Low) But let’s simplify to binary for some models: High-Risk vs Not-High-Risk
global_model <- Global_Rates_Economy_Size %>%
mutate(
# Binary target
High_Risk = ifelse(Risk_Level %in% c("High", "Medium"), 1, 0),
High_Risk = as.factor(High_Risk),
# Multi-class target
Risk_Level = factor(Risk_Level,
levels = c("Very Low", "Low", "Medium", "High"),
ordered = FALSE)
) %>%
# Remove countries with very few observations
group_by(Country) %>%
filter(n() >= 5) %>%
ungroup()
# Select features for modeling
feature_cols <- c(
"GDP_per_capita_growth", "Inflation", "Lending_rate", "Deposit_rate",
"Current_Account_GDP_Perc", "Exchange_rate", "Non_performing_loans",
"Private_sector_credit", "Unemployment", "Interest_rate_spread",
"Real_interest_rate", "Risk_premium", "Economy_Size"
)
# Prepare final dataset
model_data <- global_model %>%
dplyr::select(all_of(feature_cols), High_Risk, Risk_Level, Years, Country) %>%
na.omit()
cat("Final modeling dataset dimensions:", dim(model_data), "\n")
## Final modeling dataset dimensions: 1011 17
cat("Class distribution (High Risk vs Not):\n")
## Class distribution (High Risk vs Not):
print(table(model_data$High_Risk))
##
## 0 1
## 697 314
Handling Imbalanced Class
#USE SMOTE TO HANDLING IMBALANCED CLASSES
model_data_numeric <- model_data %>%
dplyr::select(where(is.numeric), High_Risk)
model_data_numeric$High_Risk <- factor(model_data$High_Risk,
levels = c(0, 1))
set.seed(123)
model_data_smote <- themis::smote(
model_data_numeric,
var = "High_Risk",
over_ratio = 1
)
table(model_data_smote$High_Risk)
##
## 0 1
## 697 697
SPLIT DATA: Time-based split (train on older years, test on recent years
# Split data into training and testing (70/30)
set.seed(123)
train_index <- createDataPartition(model_data_smote$High_Risk,
p = 0.7,
list = FALSE)
train_data <- model_data_smote[train_index, ]
test_data <- model_data_smote[-train_index, ]
cat("\nTraining set size:", nrow(train_data), "\n")
##
## Training set size: 976
cat("Testing set size:", nrow(test_data), "\n")
## Testing set size: 418
cat("Training class distribution:\n")
## Training class distribution:
print(table(train_data$High_Risk))
##
## 0 1
## 488 488
cat("Testing class distribution:\n")
## Testing class distribution:
print(table(test_data$High_Risk))
##
## 0 1
## 209 209
# Prepare matrices for XGBoost
train_x <- as.matrix(train_data %>% dplyr::select(-High_Risk))
train_y <- as.numeric(as.character(train_data$High_Risk))
test_x <- as.matrix(test_data %>% dplyr::select(-High_Risk))
test_y <- as.numeric(as.character(test_data$High_Risk))
HELPER FUNCTION FOR EVALUATION
evaluate_model <- function(predictions, probs, actual, model_name) {
# Ensure factors have same levels
predictions <- factor(predictions, levels = c("0", "1"))
actual <- factor(actual, levels = c("0", "1"))
# Confusion matrix
cm <- confusionMatrix(predictions, actual, positive = "1")
# ROC curve
roc_obj <- roc(as.numeric(actual == "1"), probs)
# Calculate metrics
metrics <- data.frame(
Model = model_name,
Accuracy = round(cm$overall["Accuracy"], 4),
Sensitivity = round(cm$byClass["Sensitivity"], 4), # Recall for class 1
Specificity = round(cm$byClass["Specificity"], 4),
Precision = round(cm$byClass["Precision"], 4),
F1_Score = round(cm$byClass["F1"], 4),
AUC = round(auc(roc_obj), 4),
Kappa = round(cm$overall["Kappa"], 4)
)
return(list(
metrics = metrics,
cm = cm,
roc = roc_obj,
predictions = predictions,
probabilities = probs
))
}
MODEL 1: LOGISTIC REGRESSION WITH CLASS WEIGHTS
# Train logistic regression
logit_model <- glm(High_Risk ~ .,
data = train_data,
family = binomial())
# Summary
cat("\nModel Coefficients:\n")
##
## Model Coefficients:
coef_summary <- summary(logit_model)$coefficients
print(coef_summary[order(-abs(coef_summary[, "Estimate"])), ][1:10, ])
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 21.69754886 56.26643513 0.3856215 6.997770e-01
## GDP_per_capita_growth -0.57567490 0.04794094 -12.0080028 3.225408e-33
## Non_performing_loans 0.44729906 0.03620194 12.3556637 4.539937e-35
## Current_Account_GDP_Perc -0.14299663 0.01853687 -7.7141729 1.217693e-14
## Inflation 0.13143094 0.04074922 3.2253607 1.258140e-03
## Deposit_rate 0.06048924 0.07192393 0.8410169 4.003385e-01
## Interest_rate_spread 0.05716604 0.07284433 0.7847699 4.325885e-01
## Risk_premium -0.03882690 0.04323347 -0.8980751 3.691455e-01
## Lending_rate -0.02852118 0.06200685 -0.4599682 6.455390e-01
## Real_interest_rate -0.02087328 0.03139520 -0.6648558 5.061428e-01
# Predictions
logit_prob <- predict(logit_model, newdata = test_data, type = "response")
logit_pred <- ifelse(logit_prob > 0.5, "1", "0")
# Evaluate
logit_results <- evaluate_model(logit_pred, logit_prob,
test_data$High_Risk, "Logistic Regression")
cat("\nPerformance Metrics:\n")
##
## Performance Metrics:
print(logit_results$metrics)
## Model Accuracy Sensitivity Specificity Precision
## Accuracy Logistic Regression 0.8565 0.8373 0.8756 0.8706
## F1_Score AUC Kappa
## Accuracy 0.8537 0.9356 0.7129
GDP_per_capita_growth: -0.576 (p < 0.001)
Higher economic growth reduces the probability of high risk.
Economically, countries with stronger GDP growth are less likely to experience macroeconomic stress.
Non_performing_loans: 0.447 (p < 0.001)
Higher levels of non-performing loans increase the risk of macroeconomic stress.
Indicates banking sector vulnerabilities strongly signal economic risk.
*Current_Account_GDP_Perc: -0.143 (p < 0.001)**
A higher (less negative) current account balance reduces risk.
Persistent external deficits increase vulnerability to shocks.
Inflation: 0.131 (p = 0.0013)
Higher inflation slightly increases risk.
Reflects the destabilizing effect of rising prices on macroeconomic stability.
Variables with Low or Non-significant Impact
Deposit_rate, Interest_rate_spread, Risk_premium, Lending_rate, Real_interest_rate
These coefficients are not statistically significant (p > 0.05).
After balancing the dataset, these variables do not provide strong independent signals for high-risk classification.
They may still interact with other variables but are less predictive on their own.
Accuracy: 85.7%
Overall, the model correctly classifies ~86% of case
AUC: 0.936 Excellent discriminative ability between high-risk and low-risk countries
Sensitivity: 83.7% It correctly identifies ~84% of high-risk cases (true positives)
Specificity: 87.6% It correctly identifies ~88% of low-risk cases (true negatives).
Precision: 87.1% When the model predicts high risk, it is correct ~87% of the tim
The logistic regression model predicts whether a country is at high macroeconomic risk based on key economic indicators. The model achieves 85.7% accuracy and an AUC of 0.936, demonstrating strong ability to distinguish high-risk from low-risk countries.
Key drivers of risk include:
GDP per capita growth (higher growth reduces risk),
Non-performing loans (higher levels increase risk),
Current account balance as a percentage of GDP (larger deficits increase risk), and
Inflation (higher inflation slightly increases risk).
Other variables such as deposit rates, lending rates, and interest rate spreads were not statistically significant, indicating that core macroeconomic fundamentals are the strongest predictors of systemic risk.
Overall, the model provides both high predictive performance and economic interpretability, making it suitable for risk monitoring and policy decision-making.
MODEL 2: RANDOM FOREST
# Train Random Forest
set.seed(123)
rf_model <- randomForest(
High_Risk ~ .,
data = train_data,
ntree = 500,
mtry = floor(sqrt(ncol(train_data) - 1)), # Square root of features
importance = TRUE,
do.trace = FALSE
)
# Check OOB error
cat("OOB Error Rate:", round(rf_model$err.rate[nrow(rf_model$err.rate), "OOB"], 4), "\n")
## OOB Error Rate: 0.0123
# Predictions
rf_prob <- predict(rf_model, newdata = test_data, type = "prob")[, "1"]
rf_pred <- predict(rf_model, newdata = test_data, type = "class")
# Evaluate
rf_results <- evaluate_model(rf_pred, rf_prob,
test_data$High_Risk, "Random Forest")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
cat("\nPerformance Metrics:\n")
##
## Performance Metrics:
print(rf_results$metrics)
## Model Accuracy Sensitivity Specificity Precision F1_Score
## Accuracy Random Forest 0.9928 0.9856 1 1 0.9928
## AUC Kappa
## Accuracy 0.9998 0.9856
# Variable Importance
rf_importance <- importance(rf_model)
rf_importance_df <- data.frame(
Variable = rownames(rf_importance),
Importance = rf_importance[, "MeanDecreaseGini"]
) %>% arrange(desc(Importance))
cat("\nTop 10 Important Variables:\n")
##
## Top 10 Important Variables:
print(rf_importance_df[1:10, ])
## Variable Importance
## Non_performing_loans Non_performing_loans 134.15726
## GDP_per_capita_growth GDP_per_capita_growth 130.36832
## Current_Account_GDP_Perc Current_Account_GDP_Perc 60.01939
## Inflation Inflation 25.22484
## Lending_rate Lending_rate 23.47179
## Private_sector_credit Private_sector_credit 19.26347
## Deposit_rate Deposit_rate 16.96200
## Exchange_rate Exchange_rate 15.29407
## Unemployment Unemployment 14.12186
## Interest_rate_spread Interest_rate_spread 14.02310
The model perfectly classifies all countries into High-Risk vs Not-High-Risk. This is unusually perfect for real-world macro-financial data, suggesting that the predictors are very strongly associated with the outcome
Perfect accuracy can be a red flag in real-world forecasting: the model might have memorized patterns instead of learning generalizable relationships.
Using LASSO for feature pruning
I am going to run the Lasso Regression to see which variables are being shrinked then remove them and run the Random Forest again to see if the data will still be 100% accuracy which is a red flag
cv_lasso <- cv.glmnet(train_x, train_y, family = "binomial", alpha = 1)
coef(cv_lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -2.20779642
## GDP_per_capita_growth -0.36794816
## Inflation 0.06454987
## Lending_rate .
## Deposit_rate .
## Current_Account_GDP_Perc -0.08727535
## Exchange_rate .
## Non_performing_loans 0.29385562
## Private_sector_credit .
## Unemployment .
## Interest_rate_spread .
## Real_interest_rate .
## Risk_premium .
## Years .
excluding Deposit_rate, Private_sector_credit, Unemployment, Lending rate, Real interest, Exchange Rate, Years Interest spread
# Columns to remove
remove_cols <- c(
"Deposit_rate",
"Private_sector_credit",
"Unemployment",
"Non_performing_loans",
"Years"
)
# Safely remove columns that exist in the data
train_fixed <- train_data %>%
dplyr::select(-dplyr::any_of(remove_cols))
test_fixed <- test_data %>%
dplyr::select(-dplyr::any_of(remove_cols))
# Add High_Risk back
train_fixed$High_Risk <- train_data$High_Risk
test_fixed$High_Risk <- test_data$High_Risk
levels(train_data$High_Risk)
## [1] "0" "1"
# If they're invalid (like "High Risk", "0"/"1", etc.), fix them:
train_fixed$High_Risk <- factor(train_data$High_Risk,
levels = unique(train_data$High_Risk),
labels = make.names(unique(train_data$High_Risk)))
test_fixed$High_Risk <- factor(test_data$High_Risk,
levels = unique(test_data$High_Risk),
labels = make.names(unique(test_data$High_Risk)))
ctrl <- trainControl(
method = "cv", # k-fold cross-validation
number = 5, # 5 folds
classProbs = TRUE, # needed for ROC metric
summaryFunction = twoClassSummary, # computes ROC, Sensitivity, Specificity
savePredictions = TRUE
)
# train
rf_fixed <- train(
High_Risk ~ .,
data = train_fixed,
method = "rf",
trControl = ctrl,
metric = "ROC",
tuneLength = 3
)
pred_fixed <- predict(rf_fixed, newdata = test_fixed)
cm_fixed <- confusionMatrix(pred_fixed, test_fixed$High_Risk)
print(cm_fixed)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X1 X0
## X1 205 4
## X0 4 205
##
## Accuracy : 0.9809
## 95% CI : (0.9626, 0.9917)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9617
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9809
## Specificity : 0.9809
## Pos Pred Value : 0.9809
## Neg Pred Value : 0.9809
## Prevalence : 0.5000
## Detection Rate : 0.4904
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9809
##
## 'Positive' Class : X1
##
cat("Accuracy WITHOUT other columns:", round(cm_fixed$overall["Accuracy"], 3), "\n")
## Accuracy WITHOUT other columns: 0.981
# Variable Importance
rf_importance <- varImp(rf_fixed, scale = FALSE)$importance
rf_importance_df <- rf_importance %>%
rownames_to_column("Variable") %>%
arrange(desc(Overall))
print(rf_importance_df)
## Variable Overall
## 1 GDP_per_capita_growth 137.88716
## 2 Current_Account_GDP_Perc 89.13168
## 3 Lending_rate 58.89313
## 4 Interest_rate_spread 45.89394
## 5 Inflation 44.12807
## 6 Exchange_rate 43.17681
## 7 Risk_premium 34.65580
## 8 Real_interest_rate 33.70932
Random Forest Model Without Other Variables
Performance Metrics:
Accuracy: 99.09%
Balanced Accuracy: 97.9%
Sensitivity (True Positive Rate): 98%
Specificity (True Negative Rate): 98%
Kappa: 0.96 (indicating excellent agreement beyond chance)
GDP_per_capita_growth 137
The most important predictor countries with low or negative GDP growth are more likely to be high-risk.
Current_Account_GDP_Perc 89
Strong predictor — countries with large deficits or imbalances face higher risk.
Lending_rate 58
Higher lending rates signal potential financial stress, contributing to risk classification.
Interest_rate_spread 45
Indicates banking profitability; wider spreads may correlate with high-risk lending environments.
Inflation 44
High inflation increases economic uncertainty, affecting risk levels.
Exchange_rate 43
Currency volatility or depreciation is associated with higher risk.
Risk_premium 34
Market-required compensation for risk contributes moderately.
Real_interest_rate 33 Higher real rates indicate tighter financial conditions, influencing risk.
The model achieves very high accuracy without using problematic predictors like non-performing loans or private credit, meaning it is now a more realistic and trustworthy predictor.
GDP growth and current account balances remain the most critical macroeconomic indicators for systemic or country-level financial risk.
Other factors lending rates, inflation, interest spreads, and exchange rate movements also play meaningful roles, reflecting the multidimensional nature of risk.
MODEL 3: XGBOOST
# Remove variables from train_x and test_x
vars_to_remove <- c("Deposit_rate", "Private_sector_credit",
"Unemployment", "Non_performing_loans", "Years")
# Remove from train_x
train_x <- train_x[, !colnames(train_x) %in% vars_to_remove]
# Remove from test_x
test_x <- test_x[, !colnames(test_x) %in% vars_to_remove]
# Now use train_x and test_x as you already do in your code
# Your existing code below stays exactly the same:
# Set up cross-validation
xgb_cv <- xgb.cv(
data = train_x, # This now has the variables removed
label = train_y,
nrounds = 200,
nfold = 5,
objective = "binary:logistic",
eval_metric = "logloss",
max_depth = 6,
eta = 0.05,
subsample = 0.8,
colsample_bytree = 0.8,
early_stopping_rounds = 10,
verbose = 0
)
best_nrounds <- xgb_cv$best_iteration
cat("Optimal number of rounds:", best_nrounds, "\n")
## Optimal number of rounds: 200
# Train final XGBoost model
xgb_model <- xgboost(
data = train_x, # This now has the variables removed
label = train_y,
nrounds = best_nrounds,
objective = "binary:logistic",
eval_metric = "logloss",
max_depth = 6,
eta = 0.05,
subsample = 0.8,
colsample_bytree = 0.8,
verbose = 0
)
# Predictions
xgb_prob <- predict(xgb_model, newdata = test_x) # This now has the variables removed
xgb_pred <- ifelse(xgb_prob > 0.5, "1", "0")
# Evaluate
xgb_results <- evaluate_model(xgb_pred, xgb_prob,
test_data$High_Risk, "XGBoost")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
cat("\nPerformance Metrics:\n")
##
## Performance Metrics:
print(xgb_results$metrics)
## Model Accuracy Sensitivity Specificity Precision F1_Score AUC
## Accuracy XGBoost 0.9569 0.9617 0.9522 0.9526 0.9571 0.9943
## Kappa
## Accuracy 0.9139
# Feature Importance - colnames(train_x) will automatically exclude removed variables
xgb_importance <- xgb.importance(
feature_names = colnames(train_x), # Will only include remaining variables
model = xgb_model
)
cat("\nTop 8 Important Variables:\n")
##
## Top 8 Important Variables:
print(xgb_importance[1:10, ])
## Feature Gain Cover Frequency
## <char> <num> <num> <num>
## 1: GDP_per_capita_growth 0.31034841 0.20650509 0.14517218
## 2: Current_Account_GDP_Perc 0.22675633 0.19683701 0.16779203
## 3: Exchange_rate 0.09890131 0.11886928 0.16644159
## 4: Lending_rate 0.09758106 0.11294281 0.12018906
## 5: Inflation 0.08716333 0.12394475 0.12896691
## 6: Interest_rate_spread 0.08242607 0.10927052 0.09520594
## 7: Risk_premium 0.05087823 0.07424621 0.09723160
## 8: Real_interest_rate 0.04594525 0.05738434 0.07900068
## 9: <NA> NA NA NA
## 10: <NA> NA NA NA
Performance Metrics:
Accuracy: 95.6% The model correctly classifies high-risk and low-risk countries in almost 95% of cases.
Sensitivity (True Positive Rate): The model correctly identifies 96% of the actual high-risk countries.
Specificity (True Negative Rate): 95.2% The model correctly identifies 95.2% of the low-risk countries.
Precision: 95.1% Of all countries predicted as high-risk, 95.1% were truly high-risk.
F1 Score: 95% Indicates excellent balance between precision and recall.
AUC: 0.994 Near-perfect discrimination between high-risk and low-risk classes
GDP_per_capita_growth 0.3102 The most influential predictor; countries with low or negative GDP growth are more likely to be high-risk.
Current_Account_GDP_Perc 0.2236 A major driver; large current account deficits or imbalances increase risk
Exchange_rate 0.098 Currency volatility contributes significantly to country risk
Lending_rate 0.097 Higher lending rates are associated with higher risk levels.
Inflation 0.087 High inflation increases financial instability
Interest_rate_spread 0.082 Wider spreads in the banking sector indicate potential economic stress
Risk_premium 0.050 Market perception of country risk moderately affects classification.
Real_interest_rate 0.045 Tighter real rates have a smaller but non-negligible effect on risk.
GDP growth and current account balances remain the strongest predictors of country risk, consistent with economic intuition.
Financial variables like exchange rates, lending rates, and inflation also play significant roles, showing XGBoost captures both macroeconomic and financial dimensions of risk.
The model is highly accurate and robust, with near-perfect discrimination (AUC = 0.994), indicating it can reliably classify countries into high-risk and low-risk categories.
Unlike Random Forest, XGBoost provides detailed importance measures for each variable, helping explain which features contribute most to risk assessment.
Conclusion:
This project demonstrates that macroeconomic and financial indicators—particularly GDP per capita growth, current account balance, and lending rates—are strong predictors of economic risk. Through a combination of statistical analyses (ANOVA and regression) and machine learning models (Logistic Regression, Random Forest, XGBoost), we achieved high predictive accuracy while maintaining interpretability. Random Forest captured complex interactions with 98% accuracy, XGBoost provided robust predictions at 95%, and Logistic Regression offered clear insights into how individual factors affect risk. Importantly, even after removing potentially leaky variables, the models remained reliable, confirming their real-world applicability. This framework provides a data-driven, actionable approach for early identification of high-risk economies, making it valuable for policymakers, investors, and financial institutions.