Introduction

Banking crises can shake entire economies, leading to widespread financial trouble and uncertainty. By understanding the economic warning signs, such as rising credit growth or inflation, we can better anticipate when a banking crisis might happen. In this project, we’ll explore a range of economic factors to see which ones are most linked to banking crises. Using modern machine learning models, we’ll aim to create a tool that can predict the likelihood of a crisis and give early warnings to help financial institutions and policymakers make informed decisions.

Objectives

Understand Key Economic Factors: We’ll start by diving into various economic indicators—like credit growth, GDP growth, and inflation—to see how they influence the risk of banking crises.

Build Predictive Models: Using machine learning techniques like Random Forest and XGBoost, we’ll create models that can predict the chances of a crisis based on these economic signals.

Improve Model Accuracy: We’ll fine-tune and test different models to make sure they’re as accurate as possible, comparing their performance using metrics like error rates and accuracy.

Handle Imbalanced Data: Since crises don’t happen as often, our data might be imbalanced. We’ll use methods like oversampling and undersampling to make sure our models can still make reliable predictions.

Draw Conclusions and Provide Insights from Machine Learning Models: Finally, we’ll highlight which economic factors are the strongest predictors of determining early signs of banking crises and offer practical insights for how these can be used to give early warnings.

library(tidyverse)# data manipulation
library(caret) #Statistics and Machine Learning
library(randomForest)#fitting random forest models
#import dataset
The_Determinants_of_Systemic_Banking_Crisis <- read_csv("The_Determinants_of_Systemic_Banking_Crisis.csv")

#rename the dataset 
banking_crisis <- The_Determinants_of_Systemic_Banking_Crisis
#go through the dataset
head(banking_crisis, 8)
## # A tibble: 8 × 100
##      id country countrycode  year rr_full rr_ini rr_cris rr_mult lv_full lv_ini
##   <dbl> <chr>   <chr>       <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
## 1     1 Algeria DZA          1982       0      0       0       0       0      0
## 2     1 Algeria DZA          1983       0      0       0       0       0      0
## 3     1 Algeria DZA          1984       0      0       0       0       0      0
## 4     1 Algeria DZA          1985       0      0       0       0       0      0
## 5     1 Algeria DZA          1986       0      0       0       0       0      0
## 6     1 Algeria DZA          1987       0      0       0       0       0      0
## 7     1 Algeria DZA          1988       0      0       0       0       0      0
## 8     1 Algeria DZA          1989       0      0       0       0       0      0
## # ℹ 90 more variables: lv_cris <dbl>, lv_mult <dbl>, ck_full <dbl>,
## #   ck_ini <dbl>, ck_cris <dbl>, ck_mult <dbl>, dd_full <dbl>, dd_ini <dbl>,
## #   dd_cris <dbl>, dd_mult <dbl>, rgdpgr <dbl>, wrgdpgr <dbl>, frgdpgr <dbl>,
## #   wfrgdpgr <dbl>, lngdppc <dbl>, wlngdppc <dbl>, flngdppc <dbl>,
## #   wflngdppc <dbl>, deprec <dbl>, wdeprec <dbl>, fdeprec <dbl>,
## #   wfdeprec <dbl>, m2res <dbl>, wm2res <dbl>, fm2res <dbl>, wfm2res <dbl>,
## #   wrir <dbl>, rir <dbl>, frir <dbl>, wfrir <dbl>, totch <dbl>, …
tail(banking_crisis, 8)
## # A tibble: 8 × 100
##      id country countrycode  year rr_full rr_ini rr_cris rr_mult lv_full lv_ini
##   <dbl> <chr>   <chr>       <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
## 1    92 Zambia  ZMB          2003       0      0       0       0       0      0
## 2    92 Zambia  ZMB          2004       0      0       0       0       0      0
## 3    92 Zambia  ZMB          2005       0      0       0       0       0      0
## 4    92 Zambia  ZMB          2006       0      0       0       0       0      0
## 5    92 Zambia  ZMB          2007       0      0       0       0       0      0
## 6    92 Zambia  ZMB          2008       0      0       0       0       0      0
## 7    92 Zambia  ZMB          2009       0      0       0       0       0      0
## 8    92 Zambia  ZMB          2010       0      0       0       0       0      0
## # ℹ 90 more variables: lv_cris <dbl>, lv_mult <dbl>, ck_full <dbl>,
## #   ck_ini <dbl>, ck_cris <dbl>, ck_mult <dbl>, dd_full <dbl>, dd_ini <dbl>,
## #   dd_cris <dbl>, dd_mult <dbl>, rgdpgr <dbl>, wrgdpgr <dbl>, frgdpgr <dbl>,
## #   wfrgdpgr <dbl>, lngdppc <dbl>, wlngdppc <dbl>, flngdppc <dbl>,
## #   wflngdppc <dbl>, deprec <dbl>, wdeprec <dbl>, fdeprec <dbl>,
## #   wfdeprec <dbl>, m2res <dbl>, wm2res <dbl>, fm2res <dbl>, wfm2res <dbl>,
## #   wrir <dbl>, rir <dbl>, frir <dbl>, wfrir <dbl>, totch <dbl>, …
sample_n(banking_crisis, 8)
## # A tibble: 8 × 100
##      id country  countrycode  year rr_full rr_ini rr_cris rr_mult lv_full lv_ini
##   <dbl> <chr>    <chr>       <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
## 1    45 India    IND          2005       0      0       0       0       0      0
## 2    41 Greece   GRC          1988       0      0       0       0       0      0
## 3    18 Cameroon CMR          1988       1      0      NA       2       1      0
## 4    77 Sri Lan… LKA          2008       0      0       0       0       0      0
## 5    32 Ecuador  ECU          1997       0      0       0       0       0      0
## 6    82 Thailand THA          1999       1      0      NA       2       1      0
## 7    26 Congo, … COG          2000       1      0      NA       2       0      0
## 8    79 Sweden   SWE          2005       0      0       0       0       0      0
## # ℹ 90 more variables: lv_cris <dbl>, lv_mult <dbl>, ck_full <dbl>,
## #   ck_ini <dbl>, ck_cris <dbl>, ck_mult <dbl>, dd_full <dbl>, dd_ini <dbl>,
## #   dd_cris <dbl>, dd_mult <dbl>, rgdpgr <dbl>, wrgdpgr <dbl>, frgdpgr <dbl>,
## #   wfrgdpgr <dbl>, lngdppc <dbl>, wlngdppc <dbl>, flngdppc <dbl>,
## #   wflngdppc <dbl>, deprec <dbl>, wdeprec <dbl>, fdeprec <dbl>,
## #   wfdeprec <dbl>, m2res <dbl>, wm2res <dbl>, fm2res <dbl>, wfm2res <dbl>,
## #   wrir <dbl>, rir <dbl>, frir <dbl>, wfrir <dbl>, totch <dbl>, …

SUMMARY OF THE DATASET In predicting the determinants of a banking crisis, several economic, financial, and structural factors are typically influential.

1. Macroeconomic Indicators: GDP Growth (rgdpgr, frgdpgr, wrgdpgr, etc.): A decline or slowdown in GDP growth can signal economic stress, increasing the likelihood of a banking crisis. Inflation Rate (infl, finfl, winfl, etc.): High inflation can erode purchasing power and affect the stability of the financial system, contributing to crises. Exchange Rate Depreciation (deprec, fdeprec, etc.): Large depreciations can affect a country’s ability to repay foreign debts, leading to banking instability. Interest Rates (ir, fir, wir): Rising interest rates can increase borrowing costs and the likelihood of defaults, pressuring the banking sector.

2. Financial Sector Variables: Credit Growth (credgr, fcredgr, etc.): Rapid credit growth, especially if unsustainable, can signal a buildup of financial imbalances, increasing the risk of a crisis. Banking Leverage (lv_full, lv_cris, etc.): High levels of leverage in the banking system increase vulnerability during financial shocks, leading to crises. Liquidity Levels (liq, fliq, etc.): Insufficient liquidity or a liquidity crunch can lead to bank failures or the inability of banks to meet withdrawal demands.

3. External Sector Vulnerabilities: Current Account Balance (cab, wcab, etc.): Persistent current account deficits can indicate external imbalances and dependence on foreign capital, which can trigger crises. Net Foreign Assets (nfagdp, wnfagdp, etc.): Negative foreign asset positions can reflect external debt burdens, which make economies vulnerable to crises. Terms of Trade (totch, wtotch, etc.): A significant deterioration in the terms of trade can hurt export earnings and affect the country’s balance of payments, leading to instability.

4. Banking System Health: Non-performing Loans (NPLs) (npl, fnpl, etc.): An increase in NPLs indicates that borrowers are unable to repay loans, a direct precursor to banking crises. Capital Adequacy (ck_full, ck_cris, etc.): Lower capital adequacy ratios make banks more vulnerable to shocks, as they have less buffer to absorb losses.

5. Global Factors: Global Economic Growth (wrgdpgr, wfrgdpgr): A global slowdown or recession can impact domestic banking systems, especially in countries with significant external trade or financial links. Global Financial Conditions (fir, wliq): Global financial tightening, such as rising interest rates or reduced liquidity, can put pressure on banking systems, especially in emerging markets.

CLEAN THE DATASET

#count the missing values
sum(is.na(banking_crisis))
## [1] 7281
#remove missing values
banking_crisis <- na.omit(banking_crisis)


#count for duplicates
n_distinct(banking_crisis)
## [1] 1205
#remove duplicates
banking_crisis <- distinct(banking_crisis)
#summary statistics 
summary(banking_crisis)
##        id          country          countrycode             year     
##  Min.   : 1.00   Length:1205        Length:1205        Min.   :1982  
##  1st Qu.:26.00   Class :character   Class :character   1st Qu.:1987  
##  Median :49.00   Mode  :character   Mode  :character   Median :1994  
##  Mean   :48.96                                         Mean   :1994  
##  3rd Qu.:75.00                                         3rd Qu.:2001  
##  Max.   :92.00                                         Max.   :2005  
##     rr_full            rr_ini           rr_cris           rr_mult       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.06224   Mean   :0.06224   Mean   :0.06224   Mean   :0.06224  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##     lv_full            lv_ini           lv_cris           lv_mult       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.03651   Mean   :0.03651   Mean   :0.03651   Mean   :0.03651  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##     ck_full            ck_ini           ck_cris           ck_mult       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.03402   Mean   :0.03402   Mean   :0.03402   Mean   :0.03402  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##     dd_full            dd_ini           dd_cris           dd_mult       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.04564   Mean   :0.04564   Mean   :0.04564   Mean   :0.04564  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##      rgdpgr           wrgdpgr          frgdpgr           wfrgdpgr     
##  Min.   :-13.130   Min.   :-9.760   Min.   :-13.130   Min.   :-8.130  
##  1st Qu.:  1.620   1st Qu.: 1.620   1st Qu.:  1.620   1st Qu.: 1.620  
##  Median :  3.670   Median : 3.670   Median :  3.670   Median : 3.670  
##  Mean   :  3.686   Mean   : 3.656   Mean   :  3.686   Mean   : 3.642  
##  3rd Qu.:  5.550   3rd Qu.: 5.550   3rd Qu.:  5.550   3rd Qu.: 5.550  
##  Max.   : 33.630   Max.   :16.730   Max.   : 33.630   Max.   :14.000  
##     lngdppc         wlngdppc        flngdppc       wflngdppc    
##  Min.   : 4.86   Min.   : 4.86   Min.   : 4.86   Min.   : 4.88  
##  1st Qu.: 6.16   1st Qu.: 6.16   1st Qu.: 6.16   1st Qu.: 6.16  
##  Median : 7.86   Median : 7.86   Median : 7.86   Median : 7.86  
##  Mean   : 7.84   Mean   : 7.84   Mean   : 7.84   Mean   : 7.84  
##  3rd Qu.: 9.70   3rd Qu.: 9.70   3rd Qu.: 9.70   3rd Qu.: 9.70  
##  Max.   :10.59   Max.   :10.55   Max.   :10.59   Max.   :10.53  
##      deprec            wdeprec          fdeprec            wfdeprec     
##  Min.   :  -29.35   Min.   :-22.92   Min.   :  -29.35   Min.   :-21.56  
##  1st Qu.:   -0.15   1st Qu.: -0.15   1st Qu.:   -0.15   1st Qu.: -0.15  
##  Median :    4.17   Median :  4.17   Median :    4.17   Median :  4.17  
##  Mean   :   26.88   Mean   : 13.24   Mean   :   26.88   Mean   : 12.48  
##  3rd Qu.:   14.10   3rd Qu.: 14.10   3rd Qu.:   14.10   3rd Qu.: 14.10  
##  Max.   :13931.90   Max.   :302.05   Max.   :13931.90   Max.   :215.89  
##      m2res            wm2res          fm2res          wfm2res     
##  Min.   :  0.68   Min.   : 0.68   Min.   :  0.68   Min.   : 0.68  
##  1st Qu.:  2.78   1st Qu.: 2.78   1st Qu.:  2.78   1st Qu.: 2.78  
##  Median :  5.14   Median : 5.14   Median :  5.14   Median : 5.14  
##  Mean   : 10.65   Mean   :10.48   Mean   : 10.65   Mean   :10.34  
##  3rd Qu.: 11.00   3rd Qu.:11.00   3rd Qu.: 11.00   3rd Qu.:11.00  
##  Max.   :148.31   Max.   :99.05   Max.   :148.31   Max.   :82.85  
##       wrir              rir                frir              wfrir        
##  Min.   :-69.440   Min.   :-209.890   Min.   :-209.890   Min.   :-51.270  
##  1st Qu.: -0.010   1st Qu.:  -0.010   1st Qu.:  -0.010   1st Qu.: -0.010  
##  Median :  3.610   Median :   3.610   Median :   3.610   Median :  3.610  
##  Mean   :  2.589   Mean   :   6.127   Mean   :   6.127   Mean   :  2.805  
##  3rd Qu.:  7.440   3rd Qu.:   7.440   3rd Qu.:   7.440   3rd Qu.:  7.440  
##  Max.   : 47.060   Max.   :4635.860   Max.   :4635.860   Max.   : 40.180  
##      totch             wtotch            ftotch           wftotch      
##  Min.   :-73.450   Min.   :-51.940   Min.   :-73.450   Min.   :-43.01  
##  1st Qu.: -5.020   1st Qu.: -5.020   1st Qu.: -5.020   1st Qu.: -5.02  
##  Median :  0.000   Median :  0.000   Median :  0.000   Median :  0.00  
##  Mean   :  1.968   Mean   :  1.212   Mean   :  1.968   Mean   :  1.07  
##  3rd Qu.:  4.310   3rd Qu.:  4.310   3rd Qu.:  4.310   3rd Qu.:  4.31  
##  Max.   :570.360   Max.   : 98.020   Max.   :570.360   Max.   : 74.18  
##       infl              winfl            finfl              wfinfl      
##  Min.   :  -29.17   Min.   : -9.82   Min.   :  -29.17   Min.   : -7.73  
##  1st Qu.:    2.39   1st Qu.:  2.39   1st Qu.:    2.39   1st Qu.:  2.39  
##  Median :    5.59   Median :  5.59   Median :    5.59   Median :  5.59  
##  Mean   :   27.00   Mean   : 13.65   Mean   :   27.00   Mean   : 12.98  
##  3rd Qu.:   11.59   3rd Qu.: 11.59   3rd Qu.:   11.59   3rd Qu.: 11.59  
##  Max.   :12338.70   Max.   :204.10   Max.   :12338.70   Max.   :149.87  
##       liq              wliq             fliq            wfliq       
##  Min.   : 15.93   Min.   : 16.67   Min.   : 15.93   Min.   : 18.02  
##  1st Qu.: 75.54   1st Qu.: 75.54   1st Qu.: 75.54   1st Qu.: 75.54  
##  Median :100.16   Median :100.16   Median :100.16   Median :100.16  
##  Mean   :105.73   Mean   :105.18   Mean   :105.73   Mean   :105.12  
##  3rd Qu.:128.23   3rd Qu.:128.23   3rd Qu.:128.23   3rd Qu.:128.23  
##  Max.   :429.56   Max.   :305.28   Max.   :429.56   Max.   :298.94  
##      credgr           wcredgr           fcredgr           wfcredgr      
##  Min.   :-70.070   Min.   :-47.920   Min.   :-70.070   Min.   :-36.760  
##  1st Qu.: -2.680   1st Qu.: -2.680   1st Qu.: -2.680   1st Qu.: -2.680  
##  Median :  2.380   Median :  2.380   Median :  2.380   Median :  2.380  
##  Mean   :  3.019   Mean   :  2.813   Mean   :  3.019   Mean   :  2.747  
##  3rd Qu.:  7.400   3rd Qu.:  7.400   3rd Qu.:  7.400   3rd Qu.:  7.400  
##  Max.   :153.260   Max.   : 64.080   Max.   :153.260   Max.   : 49.710  
##      nfagdp            wnfagdp           fnfagdp            wfnfagdp      
##  Min.   :-128.220   Min.   :-52.660   Min.   :-128.220   Min.   :-43.450  
##  1st Qu.:  -2.800   1st Qu.: -2.800   1st Qu.:  -2.800   1st Qu.: -2.800  
##  Median :   4.440   Median :  4.440   Median :   4.440   Median :  4.440  
##  Mean   :   4.921   Mean   :  5.022   Mean   :   4.921   Mean   :  5.054  
##  3rd Qu.:  11.810   3rd Qu.: 11.810   3rd Qu.:  11.810   3rd Qu.: 11.810  
##  Max.   :  91.280   Max.   : 91.280   Max.   :  91.280   Max.   : 89.960  
##      var61           var62          var63             durfin      
##  Min.   : 1.00   Min.   :1982   Min.   :0.00000   Min.   : 1.000  
##  1st Qu.:26.00   1st Qu.:1987   1st Qu.:0.00000   1st Qu.: 3.000  
##  Median :49.00   Median :1994   Median :0.00000   Median : 4.000  
##  Mean   :48.96   Mean   :1994   Mean   :0.06224   Mean   : 5.349  
##  3rd Qu.:75.00   3rd Qu.:2001   3rd Qu.:0.00000   3rd Qu.: 8.000  
##  Max.   :92.00   Max.   :2005   Max.   :1.00000   Max.   :14.000  
##       sub             anni           _merge          duratamediaperpaese
##  Min.   :1.000   Min.   : 1.000   Length:1205        Min.   :0.03448    
##  1st Qu.:2.000   1st Qu.: 3.000   Class :character   1st Qu.:0.10345    
##  Median :3.000   Median : 4.000   Mode  :character   Median :0.13793    
##  Mean   :2.716   Mean   : 5.349                      Mean   :0.18443    
##  3rd Qu.:3.000   3rd Qu.: 8.000                      3rd Qu.:0.27586    
##  Max.   :4.000   Max.   :14.000                      Max.   :0.48276    
##    _est_spec2   _est_spec3      pra                 prb          
##  Min.   :1    Min.   :1    Min.   :0.0006858   Min.   :0.006388  
##  1st Qu.:1    1st Qu.:1    1st Qu.:0.8160086   1st Qu.:0.026949  
##  Median :1    Median :1    Median :0.8567500   Median :0.033813  
##  Mean   :1    Mean   :1    Mean   :0.8387779   Mean   :0.039087  
##  3rd Qu.:1    3rd Qu.:1    3rd Qu.:0.8857065   3rd Qu.:0.044282  
##  Max.   :1    Max.   :1    Max.   :0.9684955   Max.   :0.973115  
##       prc               dur         dur1             cris        
##  Min.   :0.02502   Min.   :0   Min.   : 0.000   Min.   :0.00000  
##  1st Qu.:0.08612   1st Qu.:0   1st Qu.: 2.000   1st Qu.:0.00000  
##  Median :0.10934   Median :0   Median : 3.000   Median :0.00000  
##  Mean   :0.12213   Mean   :0   Mean   : 3.918   Mean   :0.06224  
##  3rd Qu.:0.14049   3rd Qu.:0   3rd Qu.: 6.000   3rd Qu.:0.00000  
##  Max.   :0.49858   Max.   :0   Max.   :13.000   Max.   :1.00000  
##      cris1           avedur            xxx          duration        _est_spec1
##  Min.   :1.000   Min.   : 0.500   Min.   : 697   Min.   : 1.000   Min.   :1   
##  1st Qu.:1.000   1st Qu.: 2.000   1st Qu.:1149   1st Qu.: 3.000   1st Qu.:1   
##  Median :1.000   Median : 3.000   Median :1575   Median : 4.000   Median :1   
##  Mean   :1.431   Mean   : 4.009   Mean   :1603   Mean   : 5.359   Mean   :1   
##  3rd Qu.:2.000   3rd Qu.: 5.000   3rd Qu.:2064   3rd Qu.: 8.000   3rd Qu.:1   
##  Max.   :4.000   Max.   :14.000   Max.   :2663   Max.   :14.000   Max.   :1   
##     _est_all  _est_partial      pr0              pr1          
##  Min.   :1   Min.   :1     Min.   :0.1835   Min.   :0.003124  
##  1st Qu.:1   1st Qu.:1     1st Qu.:0.8554   1st Qu.:0.018314  
##  Median :1   Median :1     Median :0.8943   Median :0.025199  
##  Mean   :1   Mean   :1     Mean   :0.8714   Mean   :0.031051  
##  3rd Qu.:1   3rd Qu.:1     3rd Qu.:0.9178   3rd Qu.:0.033714  
##  Max.   :1   Max.   :1     Max.   :0.9854   Max.   :0.691349  
##       pr2             prlogit1          prlogit2            class      
##  Min.   :0.01098   Min.   :0.00439   Min.   :0.003248   Min.   :1.000  
##  1st Qu.:0.05757   1st Qu.:0.01952   1st Qu.:0.020200   1st Qu.:1.000  
##  Median :0.07999   Median :0.02554   Median :0.027911   Median :1.000  
##  Mean   :0.09751   Mean   :0.03168   Mean   :0.035487   Mean   :1.046  
##  3rd Qu.:0.11294   3rd Qu.:0.03572   3rd Qu.:0.038132   3rd Qu.:1.000  
##  Max.   :0.58120   Max.   :0.52721   Max.   :0.733841   Max.   :2.000  
##   binaryclass          prob1            prob2            prob3       
##  Min.   :0.00000   Min.   :0.4695   Min.   :0.4200   Min.   :0.4484  
##  1st Qu.:0.00000   1st Qu.:0.4943   1st Qu.:0.4422   1st Qu.:0.4590  
##  Median :0.00000   Median :0.4998   Median :0.4920   Median :0.4708  
##  Mean   :0.04564   Mean   :0.4999   Mean   :0.4927   Mean   :0.4960  
##  3rd Qu.:0.00000   3rd Qu.:0.5058   3rd Qu.:0.5383   3rd Qu.:0.5004  
##  Max.   :1.00000   Max.   :0.5344   Max.   :0.5727   Max.   :0.8107  
##      prob4            prob5            prob6            prob7        
##  Min.   :0.4258   Min.   :0.3661   Min.   :0.3060   Min.   :0.07767  
##  1st Qu.:0.4568   1st Qu.:0.4544   1st Qu.:0.4657   1st Qu.:0.43600  
##  Median :0.4666   Median :0.4935   Median :0.4910   Median :0.48791  
##  Mean   :0.4878   Mean   :0.5007   Mean   :0.4928   Mean   :0.48557  
##  3rd Qu.:0.4852   3rd Qu.:0.5381   3rd Qu.:0.5161   3rd Qu.:0.53918  
##  Max.   :0.8399   Max.   :0.7756   Max.   :0.7131   Max.   :0.78755  
##      prob8            prob9            prob10       
##  Min.   :0.2668   Min.   :0.4258   Min.   :0.08883  
##  1st Qu.:0.4785   1st Qu.:0.4568   1st Qu.:0.37477  
##  Median :0.4948   Median :0.4666   Median :0.45415  
##  Mean   :0.4916   Mean   :0.4878   Mean   :0.46274  
##  3rd Qu.:0.5121   3rd Qu.:0.4852   3rd Qu.:0.53513  
##  Max.   :0.6546   Max.   :0.8399   Max.   :0.98872
#check 
unique(banking_crisis)
## # A tibble: 1,205 × 100
##       id country countrycode  year rr_full rr_ini rr_cris rr_mult lv_full lv_ini
##    <dbl> <chr>   <chr>       <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
##  1     1 Algeria DZA          1982       0      0       0       0       0      0
##  2     1 Algeria DZA          1983       0      0       0       0       0      0
##  3     1 Algeria DZA          1984       0      0       0       0       0      0
##  4     1 Algeria DZA          1985       0      0       0       0       0      0
##  5     1 Algeria DZA          1986       0      0       0       0       0      0
##  6     1 Algeria DZA          1987       0      0       0       0       0      0
##  7     1 Algeria DZA          1988       0      0       0       0       0      0
##  8     1 Algeria DZA          1989       0      0       0       0       0      0
##  9     1 Algeria DZA          1990       1      1       1       1       1      1
## 10     1 Algeria DZA          1995       0      0       0       0       0      0
## # ℹ 1,195 more rows
## # ℹ 90 more variables: lv_cris <dbl>, lv_mult <dbl>, ck_full <dbl>,
## #   ck_ini <dbl>, ck_cris <dbl>, ck_mult <dbl>, dd_full <dbl>, dd_ini <dbl>,
## #   dd_cris <dbl>, dd_mult <dbl>, rgdpgr <dbl>, wrgdpgr <dbl>, frgdpgr <dbl>,
## #   wfrgdpgr <dbl>, lngdppc <dbl>, wlngdppc <dbl>, flngdppc <dbl>,
## #   wflngdppc <dbl>, deprec <dbl>, wdeprec <dbl>, fdeprec <dbl>,
## #   wfdeprec <dbl>, m2res <dbl>, wm2res <dbl>, fm2res <dbl>, wfm2res <dbl>, …
#strucure of the data
str(banking_crisis)
## tibble [1,205 × 100] (S3: tbl_df/tbl/data.frame)
##  $ id                 : num [1:1205] 1 1 1 1 1 1 1 1 1 1 ...
##  $ country            : chr [1:1205] "Algeria" "Algeria" "Algeria" "Algeria" ...
##  $ countrycode        : chr [1:1205] "DZA" "DZA" "DZA" "DZA" ...
##  $ year               : num [1:1205] 1982 1983 1984 1985 1986 ...
##  $ rr_full            : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ rr_ini             : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ rr_cris            : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ rr_mult            : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ lv_full            : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ lv_ini             : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ lv_cris            : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ lv_mult            : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ ck_full            : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ ck_ini             : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ ck_cris            : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ ck_mult            : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ dd_full            : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ dd_ini             : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ dd_cris            : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ dd_mult            : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ rgdpgr             : num [1:1205] 3 6.4 5.4 5.6 3.7 0.4 -0.7 -1 4.4 -0.9 ...
##  $ wrgdpgr            : num [1:1205] 3 6.4 5.4 5.6 3.7 0.4 -0.7 -1 4.4 -0.9 ...
##  $ frgdpgr            : num [1:1205] 3 6.4 5.4 5.6 3.7 0.4 -0.7 -1 4.4 -0.9 ...
##  $ wfrgdpgr           : num [1:1205] 3 6.4 5.4 5.6 3.7 0.4 -0.7 -1 4.4 -0.9 ...
##  $ lngdppc            : num [1:1205] 7.53 7.56 7.58 7.61 7.61 7.59 7.55 7.51 7.53 7.4 ...
##  $ wlngdppc           : num [1:1205] 7.53 7.56 7.58 7.61 7.61 7.59 7.55 7.51 7.53 7.4 ...
##  $ flngdppc           : num [1:1205] 7.53 7.56 7.58 7.61 7.61 7.59 7.55 7.51 7.53 7.4 ...
##  $ wflngdppc          : num [1:1205] 7.53 7.56 7.58 7.61 7.61 7.59 7.55 7.51 7.53 7.4 ...
##  $ deprec             : num [1:1205] 12.47 6.4 4.28 4.06 0.89 ...
##  $ wdeprec            : num [1:1205] 12.47 6.4 4.28 4.06 0.89 ...
##  $ fdeprec            : num [1:1205] 12.47 6.4 4.28 4.06 0.89 ...
##  $ wfdeprec           : num [1:1205] 12.47 6.4 4.28 4.06 0.89 ...
##  $ m2res              : num [1:1205] 4.28 6.04 8.64 12.27 9.59 ...
##  $ wm2res             : num [1:1205] 4.28 6.04 8.64 12.27 9.59 ...
##  $ fm2res             : num [1:1205] 4.28 6.04 8.64 12.27 9.59 ...
##  $ wfm2res            : num [1:1205] 4.28 6.04 8.64 12.27 9.59 ...
##  $ wrir               : num [1:1205] -11.35 1.06 -3.8 -5.43 -1.97 ...
##  $ rir                : num [1:1205] -11.35 1.06 -3.8 -5.43 -1.97 ...
##  $ frir               : num [1:1205] -11.35 1.06 -3.8 -5.43 -1.97 ...
##  $ wfrir              : num [1:1205] -11.35 1.06 -3.8 -5.43 -1.97 ...
##  $ totch              : num [1:1205] -3.05 -3.85 -0.25 1.87 5.02 ...
##  $ wtotch             : num [1:1205] -3.05 -3.85 -0.25 1.87 5.02 ...
##  $ ftotch             : num [1:1205] -3.05 -3.85 -0.25 1.87 5.02 ...
##  $ wftotch            : num [1:1205] -3.05 -3.85 -0.25 1.87 5.02 ...
##  $ infl               : num [1:1205] 14.35 1.94 6.8 8.43 4.97 ...
##  $ winfl              : num [1:1205] 14.35 1.94 6.8 8.43 4.97 ...
##  $ finfl              : num [1:1205] 14.35 1.94 6.8 8.43 4.97 ...
##  $ wfinfl             : num [1:1205] 14.35 1.94 6.8 8.43 4.97 ...
##  $ liq                : num [1:1205] 158 142 139 135 132 ...
##  $ wliq               : num [1:1205] 158 142 139 135 132 ...
##  $ fliq               : num [1:1205] 158 142 139 135 132 ...
##  $ wfliq              : num [1:1205] 158 142 139 135 132 ...
##  $ credgr             : num [1:1205] 8.77 19.11 5.98 0.74 2.75 ...
##  $ wcredgr            : num [1:1205] 8.77 19.11 5.98 0.74 2.75 ...
##  $ fcredgr            : num [1:1205] 8.77 19.11 5.98 0.74 2.75 ...
##  $ wfcredgr           : num [1:1205] 8.77 19.11 5.98 0.74 2.75 ...
##  $ nfagdp             : num [1:1205] 4.91 2.72 0.86 -1.79 -2.77 ...
##  $ wnfagdp            : num [1:1205] 4.91 2.72 0.86 -1.79 -2.77 ...
##  $ fnfagdp            : num [1:1205] 4.91 2.72 0.86 -1.79 -2.77 ...
##  $ wfnfagdp           : num [1:1205] 4.91 2.72 0.86 -1.79 -2.77 ...
##  $ var61              : num [1:1205] 1 1 1 1 1 1 1 1 1 1 ...
##  $ var62              : num [1:1205] 1982 1983 1984 1985 1986 ...
##  $ var63              : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ durfin             : num [1:1205] 3 3 3 3 3 3 3 3 3 3 ...
##  $ sub                : num [1:1205] 2 2 2 2 2 2 2 2 2 2 ...
##  $ anni               : num [1:1205] 3 3 3 3 3 3 3 3 3 3 ...
##  $ _merge             : chr [1:1205] "matched (3)" "matched (3)" "matched (3)" "matched (3)" ...
##  $ duratamediaperpaese: num [1:1205] 0.103 0.103 0.103 0.103 0.103 ...
##  $ _est_spec2         : num [1:1205] 1 1 1 1 1 1 1 1 1 1 ...
##  $ _est_spec3         : num [1:1205] 1 1 1 1 1 1 1 1 1 1 ...
##  $ pra                : num [1:1205] 0.818 0.832 0.817 0.803 0.81 ...
##  $ prb                : num [1:1205] 0.0446 0.0464 0.0441 0.045 0.0443 ...
##  $ prc                : num [1:1205] 0.138 0.121 0.139 0.152 0.145 ...
##  $ dur                : num [1:1205] 0 0 0 0 0 0 0 0 0 0 ...
##  $ dur1               : num [1:1205] 2 2 2 2 2 2 2 2 2 2 ...
##  $ cris               : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ cris1              : num [1:1205] 1 1 1 1 1 1 1 1 1 1 ...
##  $ avedur             : num [1:1205] 3 3 3 3 3 3 3 3 3 3 ...
##  $ xxx                : num [1:1205] 1248 1249 1250 1251 1252 ...
##  $ duration           : num [1:1205] 3 3 3 3 3 3 3 3 3 3 ...
##  $ _est_spec1         : num [1:1205] 1 1 1 1 1 1 1 1 1 1 ...
##  $ _est_all           : num [1:1205] 1 1 1 1 1 1 1 1 1 1 ...
##  $ _est_partial       : num [1:1205] 1 1 1 1 1 1 1 1 1 1 ...
##  $ pr0                : num [1:1205] 0.906 0.915 0.904 0.896 0.881 ...
##  $ pr1                : num [1:1205] 0.0279 0.0373 0.0306 0.0305 0.0289 ...
##  $ pr2                : num [1:1205] 0.066 0.0478 0.0656 0.0736 0.0897 ...
##  $ prlogit1           : num [1:1205] 0.0361 0.039 0.033 0.0324 0.0316 ...
##  $ prlogit2           : num [1:1205] 0.032 0.0393 0.0326 0.0321 0.0322 ...
##  $ class              : num [1:1205] 1 1 1 1 1 1 1 1 2 1 ...
##  $ binaryclass        : num [1:1205] 0 0 0 0 0 0 0 0 1 0 ...
##  $ prob1              : num [1:1205] 0.502 0.492 0.495 0.494 0.5 ...
##  $ prob2              : num [1:1205] 0.501 0.5 0.5 0.499 0.499 ...
##  $ prob3              : num [1:1205] 0.467 0.475 0.489 0.507 0.493 ...
##  $ prob4              : num [1:1205] 0.494 0.455 0.47 0.475 0.465 ...
##  $ prob5              : num [1:1205] 0.585 0.559 0.555 0.549 0.544 ...
##  $ prob6              : num [1:1205] 0.523 0.574 0.509 0.483 0.493 ...
##  $ prob7              : num [1:1205] 0.485 0.5 0.513 0.532 0.539 ...
##  $ prob8              : num [1:1205] 0.428 0.483 0.461 0.454 0.47 ...
##  $ prob9              : num [1:1205] 0.494 0.455 0.47 0.475 0.465 ...
##   [list output truncated]
##  - attr(*, "na.action")= 'omit' Named int [1:1463] 10 11 12 13 25 26 27 28 29 30 ...
##   ..- attr(*, "names")= chr [1:1463] "10" "11" "12" "13" ...
#dimension of the data
dim(banking_crisis)
## [1] 1205  100

SUMMARY OF THE DATASET In predicting the determinants of a banking crisis, several economic, financial, and structural factors are typically influential.

Target Variable

(cris1) This is a multinomial variable that indicates the level of banking crisis in a country.

It has four distinct levels:

Level 1: Represents the least severe or no crisis

Level 2: A moderate crisis level

Level 3: More severe crises

Level 4: The most severe crisis

cris

This is a binary variable that indicates the level of banking crisis in a country

It has two distinct levels:

Level 0: Represents no banking crisis

Level 1 : Represent banking crisis

# Cross-tabulation of cris1 with country(checking wich countries has
#higher banking cris checking lvel 1, 2 ect on the dataset)
cross_table <- table(banking_crisis$country, banking_crisis$cris1)

#renaming the  level cris
colnames(cross_table) <- c("low","moderate", "high", "intesive")
cross_table
##                           
##                            low moderate high intesive
##   Algeria                   20        0    0        0
##   Argentina                  0        0   18        0
##   Australia                 21        0    0        0
##   Austria                   19        0    0        0
##   Bangladesh                15        0    0        0
##   Belgium                   21        0    0        0
##   Benin                     20        0    0        0
##   Bolivia                    0        0   14        0
##   Brazil                     0        0   15        0
##   Burkina Faso              18        0    0        0
##   Burundi                   16        0    0        0
##   Cameroon                   0       15    0        0
##   Canada                    22        0    0        0
##   Central African Republic   6        0    0        0
##   Chad                      14        0    0        0
##   Chile                     18        0    0        0
##   China                     13        0    0        0
##   Colombia                   0       14    0        0
##   Congo, Rep.               14        0    0        0
##   Costa Rica                 0       16    0        0
##   Cote d'Ivoire             20        0    0        0
##   Denmark                    0       19    0        0
##   Ecuador                   12        0    0        0
##   Egypt, Arab Rep.          17        0    0        0
##   Finland                   20        0    0        0
##   France                     0       20    0        0
##   Germany                   23        0    0        0
##   Ghana                      0       12    0        0
##   Greece                     0       20    0        0
##   Guatemala                  0       24    0        0
##   Honduras                   0       22    0        0
##   India                     17        0    0        0
##   Indonesia                  0        0   18        0
##   Ireland                   24        0    0        0
##   Italy                     19        0    0        0
##   Japan                     14        0    0        0
##   Kenya                      0       17    0        0
##   Korea, Rep.                0        0   16        0
##   Malaysia                   0       17    0        0
##   Mali                      20        0    0        0
##   Mexico                     8        0    0        0
##   Morocco                   13        0    0        0
##   Nepal                     21        0    0        0
##   Netherlands               21        0    0        0
##   Niger                     11        0    0        0
##   Nigeria                    0       20    0        0
##   Norway                    18        0    0        0
##   Panama                    18        0    0        0
##   Philippines               13        0    0        0
##   Portugal                  21        0    0        0
##   Senegal                   16        0    0        0
##   Sierra Leone              15        0    0        0
##   Singapore                 23        0    0        0
##   South Africa              23        0    0        0
##   Sri Lanka                 20        0    0        0
##   Swaziland                 20        0    0        0
##   Sweden                    20        0    0        0
##   Switzerland               24        0    0        0
##   Thailand                  12        0    0        0
##   Togo                      22        0    0        0
##   Tunisia                   20        0    0        0
##   Turkey                     0        0    0       19
##   Uganda                    20        0    0        0
##   United Kingdom             0        0    0       24
##   United States              0       12    0        0
##   Uruguay                   16        0    0        0
##   Venezuela, RB             14        0    0        0
##   Zambia                    21        0    0        0

From the cross-tabulation, we can interpret the distribution of banking crisis levels (cris1) across different countries as follows

Countries with cris1 Level 1 (Low Systemic Banking Crisis Countries):

The majority of countries fall into this category, including Germany (23), Singapore (23), Switzerland (24), Ireland (24), South Africa (23), and many others. These countries exhibit a higher occurrence of this crisis level. which could indicate that while a banking crisis is present, it’s not severe enough to destabilize the economy significantly. These might be instances where the banking system faces stress but can manage without major intervention

Countries with cris1 Level 2 (Moderate Systemic Banking Crisis Countries):

Some countries, like Denmark (19), France (20), Greece (20), Guatemala (24), Nigeria (20), and Honduras (22), show moderate levels of banking crisis at level 2.This level likely represents a moderate systemic banking crisis. Countries like Denmark, France, and Greece are in this category, and these economies have faced substantial banking sector issues historically, requiring interventions but not leading to full-blown collapses.

Countries with cris1 Level 3 (High Systemic Banking Crisis Countries):

Fewer countries fall into this category. Countries like Argentina(18), Indonesia(18), Balivia(14), Brazil(15), Korea, Rep(18) This level could indicate a high severity banking crisis. Countries like Argentina and Indonesia, known for significant financial instability and severe banking issues in the past, are found here. This could suggest a crisis level that disrupts financial systems and requires external support or extensive government bailouts.

Countries with cris1 Level 4 (intensive Systemic Banking Crisis Countries):

Only Turkey (19) and the United Kingdom (24) appear to have significant counts in the highest crisis level (level 4).This is likely the most severe level, representing very high systemic banking crises. Countries like the United Kingdom and Turkey are in this category, which historically faced crises with major impacts on their economies and required extensive interventions, including bank failures, large bailouts, or even IMF assistance

#summary statistics grouped bt cris1 
# Group by cris1 and summarize key statistics for numeric variables
banking_crisis %>%
  group_by(cris1) %>%
  summarise(
    mean_rgdpgr = mean(rgdpgr, na.rm = TRUE),
    mean_infl = mean(infl, na.rm = TRUE),
    mean_credgr = mean(credgr, na.rm = TRUE),
    mean_wrgdpgr = mean(wrgdpgr, na.rm = TRUE),
    mean_deprec = mean(deprec, na.rm = TRUE),
    mean_wcredgr = mean(wcredgr, na.rm = TRUE),
    mean_wliq = mean(wliq, na.rm = TRUE),
    mean_liq = mean(liq, na.rm = TRUE),
    mean_ck_cris = mean(ck_cris, na.rm = TRUE)
  )
## # A tibble: 4 × 10
##   cris1 mean_rgdpgr mean_infl mean_credgr mean_wrgdpgr mean_deprec mean_wcredgr
##   <dbl>       <dbl>     <dbl>       <dbl>        <dbl>       <dbl>        <dbl>
## 1     1        3.70      9.72        2.89         3.65        8.79         2.79
## 2     2        3.38     12.0         2.79         3.40       10.8          2.43
## 3     3        4.48    250.          4.24         4.48      264.           3.32
## 4     4        3.57     29.8         4.43         3.57       24.2          4.43
## # ℹ 3 more variables: mean_wliq <dbl>, mean_liq <dbl>, mean_ck_cris <dbl>

Highest Mean:

Inflation (mean_infl) in cris1 Level 3: 249.86%

Interpretation: This extremely high inflation rate suggests that during severe crises (cris1 Level 3), economies may experience hyperinflation or severe economic instability. It indicates a loss of purchasing power and significant pressure on prices, which often happens during severe financial distress.

Depreciation (mean_deprec) in cris1 Level 3: 263.99%

Interpretation: The extraordinarily high depreciation in Level 3 crises signals a collapse in currency value. It often reflects a crisis of confidence in the economy, where exchange rates fall dramatically, making foreign goods and services much more expensive, worsening the financial situation.

Lowest Mean:

Capital Adequacy in Crisis (mean_ck_cris) in cris1 Level 1: 0.026)

Interpretation: A low capital adequacy ratio during a low-level crisis (Level 1) suggests that banks are not holding as much capital in reserve. This indicates that during less severe crises, banks may feel less pressure to maintain high levels of capital buffers. In contrast, during severe crises, they may need to increase their reserves to protect against financial instability.

Credit Growth (mean_credgr) in cris1 Level 2: 2.79%)

Interpretation: The relatively low credit growth in Level 2 crises suggests that during moderate crises, there may be less lending activity or a slowdown in credit expansion. This could be due to stricter lending conditions, lower demand for loans, or concerns about financial stability.

#visualise multiclas variable 

ggplot(banking_crisis, aes(x= factor(cris1))) +
  geom_bar(fill = "skyblue", color = "black") +
  ggtitle("Count of Multiclass variable (cris1)") +
  xlab("cris1 levels") +
  ylab("count") +
  theme_minimal()

Level 1(Low Systematic Baking Crisis): Highest count (800)

Level 2(Moderate Systematic Banking Crisis): 220 counts

Level 3(High Systematic Banking Crisis): 80 counts

Level 4(Intensive Systemetic Banking Cisis): Lowest count (50)

This bar plot highlights the imbalanced nature of the dataset, with Level 1 being far more frequent.

#split the data into train and test datset

set.seed(425)
training_sample <- banking_crisis$cris1 %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- banking_crisis[training_sample, ]
test.data <- banking_crisis[-training_sample, ]

fixing the imbalanced classes of cris1 by hybrid sampling randomly oversampling - randomly duplicates observations in minority class and make them majority class

random undersampling randomly deleting observation from majority class so that it is equal to minority class

# Split the data by class
class1 <- train.data[train.data$cris1 == "1", ]
class2 <- train.data[train.data$cris1 == "2", ]
class3 <- train.data[train.data$cris1 == "3", ]
class4 <- train.data[train.data$cris1 == "4", ]

# Compute the number of samples needed for oversampling
num_samples <- max(nrow(class1), nrow(class2), nrow(class3), nrow(class4))

# Oversample each class to the same number
class1_oversampled <- class1[sample(1:nrow(class1), num_samples, replace = TRUE), ]
class2_oversampled <- class2[sample(1:nrow(class2), num_samples, replace = TRUE), ]
class3_oversampled <- class3[sample(1:nrow(class3), num_samples, replace = TRUE), ]
class4_oversampled <- class4[sample(1:nrow(class4), num_samples, replace = TRUE), ]

# Combine oversampled data
over_sampled <- rbind(class1_oversampled, class2_oversampled, class3_oversampled, class4_oversampled)
# Now perform undersampling on the combined oversampled dataset
# Set the desired number of samples for the majority class
majority_count <- min(table(over_sampled$cris1)) 

# Undersample each class to the desired number
class1_undersampled <- class1_oversampled[sample(1:nrow(class1_oversampled), majority_count), ]
class2_undersampled <- class2_oversampled[sample(1:nrow(class2_oversampled), majority_count), ]
class3_undersampled <- class3_oversampled[sample(1:nrow(class3_oversampled), majority_count), ]
class4_undersampled <- class4_oversampled[sample(1:nrow(class4_oversampled), majority_count), ]

# Combine undersampled data
final_balanced_data <- rbind(class1_undersampled, class2_undersampled, class3_undersampled, class4_undersampled)
# Split the final balanced dataset into training and testing sets
set.seed(183)  # For reproducibility
training.sample <- createDataPartition(final_balanced_data$cris1, p = 0.8, list = FALSE)
train.data <- final_balanced_data[training.sample, ]
test.data <- final_balanced_data[-training.sample, ]
#change the train and test set to factors
train.data$cris1 <- as.factor(train.data$cris1)
test.data$cris1 <- as.factor(test.data$cris1)
# train the Random Forest model on the balanced dataset
set.seed(545)
cris1_model_balanced <- train(
  cris1 ~ rgdpgr + infl + credgr + wrgdpgr + deprec + wcredgr + wliq + liq + ck_cris,
  data = train.data,
  method = "rf",
  preProcess = c("scale", "center"),
  trControl = trainControl(method = "cv", number = 10)
)

cris1_model_balanced
## Random Forest 
## 
## 2128 samples
##    9 predictor
##    4 classes: '1', '2', '3', '4' 
## 
## Pre-processing: scaled (9), centered (9) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1914, 1915, 1915, 1916, 1915, 1916, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9553413  0.9404524
##   5     0.9515854  0.9354460
##   9     0.9473534  0.9298036
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Random Forest Model Interpretation:

Accuracy: The model has a high accuracy of 95.53%, which means that it is correctly classifying the cris1 levels in approximately 95.53% of the cases on average.

Kappa: The Kappa score is also high at 0.9404, indicating that the model performs well beyond what would be expected by random chance

Performance: The model is performing extremely well on the balanced dataset. The high accuracy and kappa scores indicate that it is making reliable predictions across all four classes (1, 2, 3, and 4).

Balanced Dataset: Since the dataset is balanced, the model is less likely to be biased toward the majority class. This is reflected in the high kappa score, which accounts for class imbalance and chance agreement.

Optimal Parameters: The best performance was obtained when the model selected just 2 predictors at each split (mtry = 2), which indicates that simpler decision trees (with fewer variables considered at each split) lead to better predictions in this case.

#predictions 
balanced_predicted <- cris1_model_balanced %>% predict(test.data)
# Create confusion matrix based on test data predictions
balance_confusion <- confusionMatrix(balanced_predicted, test.data$cris1)
balance_confusion
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1 118   2   0   0
##          2  12 131   0   0
##          3   2   0 133   0
##          4   1   0   0 133
## 
## Overall Statistics
##                                           
##                Accuracy : 0.968           
##                  95% CI : (0.9493, 0.9813)
##     No Information Rate : 0.25            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9574          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.8872   0.9850   1.0000   1.0000
## Specificity            0.9950   0.9699   0.9950   0.9975
## Pos Pred Value         0.9833   0.9161   0.9852   0.9925
## Neg Pred Value         0.9636   0.9949   1.0000   1.0000
## Prevalence             0.2500   0.2500   0.2500   0.2500
## Detection Rate         0.2218   0.2462   0.2500   0.2500
## Detection Prevalence   0.2256   0.2688   0.2538   0.2519
## Balanced Accuracy      0.9411   0.9774   0.9975   0.9987

Accuracy: The model predicted correctly 96.8% of the time, meaning it got most predictions right when tested on unseen data.

Kappa Score: With a Kappa of 0.957, the model has a strong agreement between the predicted and actual values, which indicates the model is very reliable across different classes.

Class Sensitivity: For classes 3 and 4, the model correctly identified 100% of the cases, while class 1 had a lower accuracy (around 88.7%). This shows that it’s excellent at predicting some classes but slightly weaker for others.

Misclassification: Most errors happened when the model confused class 1 and class 2 (a few instances of class 1 were predicted as class 2), but errors were minimal.

Balanced Accuracy: The model’s ability to correctly predict both positives and negatives is very strong across all classes, with accuracy ranging from 94.1% to 99.87%. This means it’s not biased toward any particular class.

Specificity: The model is great at identifying the correct “non-target” classes too. For all classes, its ability to correctly identify negatives was above 96.9%, meaning very few false positives.

#acuracy
mean( balanced_predicted == test.data$cris1)
## [1] 0.9680451

The model has an accuracy of approximately 96.80% on the test data.

balanced_importance <- varImp(cris1_model_balanced)
balanced_importance
## rf variable importance
## 
##         Overall
## infl     100.00
## deprec    80.20
## liq       71.26
## wliq      70.02
## rgdpgr    59.97
## wrgdpgr   58.41
## wcredgr   49.76
## credgr    49.75
## ck_cris    0.00

The Random Forest model is primarily driven by macroeconomic indicators, particularly inflation, depreciation, and liquidity. These variables highlight the economic pressures that can destabilize banking systems. The importance of these drivers reflects how changes in economic conditions directly impact the likelihood of systemic banking crises, influencing the financial health of countries across different crisis levels

#train the gradient boost model

xgb_model_cris1 <- train(
  cris1 ~ rgdpgr + infl + credgr + wrgdpgr + deprec + wcredgr + wliq + liq + ck_cris,
  data = train.data,
  method = "xgbTree",
  preProcess = c("scale", "center"),
  trControl = trainControl(method = "cv", number = 10),
  tuneGrid = expand.grid(nrounds = 100, max_depth = 6, eta = 0.3, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1, subsample = 0.8)
)

xgb_model_cris1
## eXtreme Gradient Boosting 
## 
## 2128 samples
##    9 predictor
##    4 classes: '1', '2', '3', '4' 
## 
## Pre-processing: scaled (9), centered (9) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1916, 1916, 1915, 1916, 1915, 1915, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9534856  0.9379819
## 
## Tuning parameter 'nrounds' was held constant at a value of 100
## Tuning
##  held constant at a value of 1
## Tuning parameter 'subsample' was held
##  constant at a value of 0.8

Performance Across Classes and Final Accuracy

The highest accuracy is achieved when the learning rate (eta) is 0.3 or 0.4, max_depth=6, and both subsampling and colsample_bytree are set at moderate values (0.6 to 0.8). The model consistently reaches an accuracy of over 90% for these settings, with some combinations achieving 95%.

#make predictions
xgb_predictions <- xgb_model_cris1 %>% predict(test.data)
#accuracy
mean(xgb_predictions == test.data$cris1)
## [1] 0.9774436

The accuracy of your XGBoost model is 98%. This means that the model correctly predicted the class of the crisis variable (cris1) about 98% of the time when evaluated on the test dataset

#confusion matric 
xgb_confusion <-  confusionMatrix(xgb_predictions, test.data$cris1)
xgb_confusion
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1 124   3   0   0
##          2   7 130   0   0
##          3   1   0 133   0
##          4   1   0   0 133
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9774          
##                  95% CI : (0.9609, 0.9883)
##     No Information Rate : 0.25            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9699          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.9323   0.9774   1.0000   1.0000
## Specificity            0.9925   0.9825   0.9975   0.9975
## Pos Pred Value         0.9764   0.9489   0.9925   0.9925
## Neg Pred Value         0.9778   0.9924   1.0000   1.0000
## Prevalence             0.2500   0.2500   0.2500   0.2500
## Detection Rate         0.2331   0.2444   0.2500   0.2500
## Detection Prevalence   0.2387   0.2575   0.2519   0.2519
## Balanced Accuracy      0.9624   0.9799   0.9987   0.9987

Accuracy: 96.43% This means the model correctly predicted the class in 96% of the cases. It’s a high accuracy, suggesting that the model performs well in distinguishing between the four classes. The 95% confidence interval (CI) of the accuracy is ( 0.9448, 0.9784), indicating strong reliability in the model’s performance.

Kappa: 95.24 The Kappa score is 95.24, which adjusts for chance agreement. It shows that the model’s performance is excellent, with very little agreement expected by chance, as a Kappa above 0.9 is generally considered outstanding.

Class Sensitivity (Recall)

Class 1: 0.8947, Class 2: 0.9624, Class 3: 1.0000, Class 4: 1.0000 Sensitivity, or recall, measures the proportion of true positives correctly identified. The model performs well across all classes, particularly in identifying Classes 3 and 4, with perfect recall (1.0000). Class 1 has a slightly lower recall (0.8947), meaning it missed some instances of Class 1.

Class Specificity

Class 1: 0.9875, Class 2: 0.9749, Class 3: 0.9975, Class 4: 0.9925 Specificity measures how well the model correctly identifies true negatives. For all classes, specificity is high, indicating the model rarely misclassifies other classes as the target class. Classes 3 and 4 have the highest specificity, meaning very few false positives.

Positive Predictive Value (Precision)

Class 1: 0.9597, Class 2: 0.9275, Class 3: 0.9925, Class 4: 0.9779 Precision indicates the proportion of predicted positives that are actually correct. The precision values show that the model is highly accurate in its predictions for all classes, especially Class 3 (0.9925) and Class 1 (0.9756).

Balanced Accuracy

Class 1: 0.9411, Class 2: 0.9687, Class 3: 0.9987, Class 4: 0.9962 Balanced accuracy is the average of sensitivity and specificity, offering a more balanced view when class distributions are uneven. The high balanced accuracy values for all classes confirm that the model performs exceptionally well, handling the classification task accurately across all categories.

xgb_importance <- varImp(xgb_model_cris1)
xgb_importance
## xgbTree variable importance
## 
##         Overall
## wliq     100.00
## infl      99.58
## rgdpgr    74.46
## deprec    60.36
## credgr    40.38
## liq       33.43
## wcredgr   14.62
## wrgdpgr   11.95
## ck_cris    0.00

Conclusion: Comparing Random Forest and XGBoost Models in Predicting Systemic Banking Crises

This analysis aimed to predict systemic banking crises across different countries using Random Forest and XGBoost models, highlighting key economic indicators driving these predictions. Both models demonstrated high accuracy and identified critical variables, with slight variations in importance, helping us understand how early warning signs can manifest in different economic contexts.

Model Comparison: Random Forest vs. XGBoost

Model Accuracy:

Random Forest achieved an accuracy of 96.80%.

Variable Key Drivers: Inflation (infl), Depreciation (deprec), Global Liquidity (wliq), and Domestic Liquidity (liq).

XGBoost achieved a slightly higher accuracy of 97.93%.

Variable Key Drivers: Inflation (infl), Global Liquidity (wliq), GDP Growth (rgdpgr), and Depreciation (deprec).

Both models are highly reliable, but XGBOOST outperformed Random Forest by 1.13%, suggesting it might capture complex patterns better in this economic context.

Top Variables Driving Model Predictions:

Both models consistently identified Inflation (infl), Depreciation (deprec), and Liquidity (liq and wliq) as the most significant predictors of systemic banking crises. These variables align with real-world economic vulnerabilities, highlighting their critical role in early detection.

Comparison of Variable Importance

Inflation (infl):

Random Forest: Highest importance (100.00).

XGBoost: Also the highest (100.00).

Interpretation: Both models strongly agree that inflation is the most crucial variable for early crisis detection. High inflation often indicates economic instability, eroding consumer purchasing power and increasing the likelihood of systemic banking issues.

Depreciation (deprec):

Random Forest: Second most important (80.20).

XGBoost: Fourth most important (61.37).

Interpretation: Currency depreciation affects the external value of a country’s currency, making imports more expensive and potentially leading to broader economic distress. While both models highlight depreciation, Random Forest considers it slightly more impactful.

Global Liquidity (wliq):

Random Forest: Fourth in importance (70.02).

XGBoost: Second in importance (88.63).

Interpretation: Global liquidity conditions, driven by international financial markets, heavily influence domestic banking stability. XGBoost ranks this factor higher, suggesting a stronger sensitivity to global financial conditions in its crisis predictions.

Domestic Liquidity (liq):

Random Forest: Third most important (71.26).

XGBoost: Lower (28.00).

Interpretation: Domestic liquidity impacts how easily assets can be converted to cash within the local banking system. Random Forest finds domestic liquidity more critical, whereas XGBoost emphasizes global over local conditions.

GDP Growth (rgdpgr):

Random Forest: Fifth in importance (59.97).

XGBoost: Third in importance (66.30).

Interpretation: Economic growth reflects overall economic health, with consistent growth reducing the risk of crises. Both models see it as important, though XGBoost places slightly more weight on growth.

Credit Growth (credgr and wcredgr):

Random Forest: Moderate importance (credgr = 49.75, wcredgr = 49.76).

XGBoost: Lower impact (credgr = 41.60, wcredgr = 6.71).

Interpretation: Rapid credit growth can signal increased risk in the banking sector, particularly if lending standards are lax. The Random Forest model finds these metrics slightly more relevant than XGBoost, which may suggest subtle differences in how each algorithm handles credit data.

Capital Adequacy (ck_cris):

Both Models: Zero importance. Interpretation: Capital adequacy ratios (which assess a bank’s capital against its risk-weighted assets) do not show up as significant in predicting early crises, possibly due to their limited variability or weak direct impact compared to other economic indicators.

Impact on Countries by Crisis Level:

Crisis Level 1 (Low Crisis): Countries such as Germany , Singapore, Switzerland , Ireland , South Africa are Less sensitivity to volatile economic indicators, but continuous monitoring of inflation and depreciation is necessary to maintain stability. These Countries are characterized by stable macroeconomic environments. Low inflation, stable currency, and ample liquidity contribute to lower systemic banking risks, as identified by the models. The key variables have minimal impact here due to robust economic policie

Actionable Insight: Proactive economic management and early interventions in response to inflationary signals can preserve stability

Crisis Level 2 (Moderate Crisis): Countries like Denmark , France, Greece , Guatemala , Nigeria , and Honduras are at moderate risk. Moderate impacts from credit growth and liquidity factors suggest manageable but rising risks.These countries show occasional spikes in inflation or mild currency depreciation, aligning with the models’ insights. Moderate impact of inflation and liquidity issues suggests potential vulnerabilities that could escalate without timely intervention

Actionable Insight: Enhancing credit quality and ensuring adequate domestic liquidity are essential preventative measures.

Crisis Level 3 (High Crisis): Countries such as Argentina and Indonesia face significant systemic risks due to persistent Depreciation and GDP growth play significant roles, indicating these countries are vulnerable to both economic contractions and currency pressures.

Actionable Insight: Policies aimed at fostering economic growth and maintaining stable exchange rates could help avert further crises.

Crisis Level 4 (Intensive Crisis): Turkey and the United Kingdom represent the highest crisis level, High inflation, severe depreciation, and liquidity pressures are prominent. These countries face acute economic instability, with both models flagging these indicators as early warning signs.where economic variables are at critical levels. For Turkey, rampant inflation and severe currency depreciation have crippled economic stability. In the UK, while traditionally stable, economic shocks have stressed the banking system, aligning with the models’ predictions. These variables underscore the need for targeted interventions to avoid a full-blown crisis

Actionable Insight: Monetary tightening, currency stabilization measures, and controlling inflationary pressures are critical for reducing systemic risk.

Final Insights

Both Random Forest and XGBoost models provide valuable insights into detecting early signs of systemic banking crises, with inflation, depreciation, and liquidity consistently identified as critical predictors. While Random Forest emphasizes domestic liquidity and credit factors slightly more, XGBoost highlights the broader impact of global conditions. This analysis suggests that countries experiencing high inflation, depreciation, and liquidity strains are at significant risk, emphasizing the need for vigilant economic policies to manage these variables effectively.