RISK SECTOR

Author

Izet Kelmendi

Published

March 26, 2024

Introduction

Understanding financial risk and credit behavior has become a cornerstone in managing the performance and stability of financial institutions. This project addresses the issue from three perspectives: macroeconomic time series modeling, individual-level credit default prediction, and data engineering through SQL.

The first part of the project focuses on Kosovo’s macroeconomic environment and its relationship with the banking sector’s Non-Performing Loans (NPL). Using time series data from 2008 to 2022, we analyze how indicators such as GDP growth, unemployment, and inflation affect credit risk. This analysis uses statistical modeling to identify key economic drivers of loan defaults.

The second part shifts to the individual borrower level. Using a dataset of 150,000 clients and 10 independent variables, the objective is to build a predictive model that identifies which clients are likely to default within the next two years (DlqIn2Years). We explore data preprocessing, feature engineering, logistic regression, and advanced classification techniques to optimize predictive performance.

The third part involves SQL-based data engineering. Here, we simulate a local database to analyze client data with custom queries. We extract useful business insights such as the number of young clients, average income based on default status, and currency conversion to EUR through a newly created exchange rates table.

Section 1: Time Series (NPL & macroeconomics)

The financial soundness of a country’s banking sector is closely tied to the performance of its broader economy. One of the key indicators reflecting the stability of the banking system is the Non-Performing Loan (NPL) ratio — the proportion of loans that borrowers fail to repay on time. A rise in NPLs not only weakens banks’ balance sheets but also signals underlying macroeconomic challenges.

This study explores the relationship between NPLs and key macroeconomic variables in Kosovo, covering the period from 2008 to 2022. Using data gathered from trusted sources such as the World Bank, Eurostat, and the Central Bank of Kosovo, the analysis focuses on three core macroeconomic indicators:

GDP Growth Rate

Unemployment Rate

Inflation Rate (HICP)

1.1 Time Series Preparation

npl_bank <- read_excel("D:/Izeti/NLB_DataScientist_NLBGroup/data/npl_bank.xlsx")
kosovo_macro <- read_excel("D:/Izeti/NLB_DataScientist_NLBGroup/data/kosovo_macro_real_yearly_2008_2022.xlsx")
glimpse(kosovo_macro)

Rows: 15
Columns: 4
$ Year                    <dbl> 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015…
$ `GDP_Growth (%)`        <dbl> 5.4, 3.5, 3.2, 4.4, 2.5, 3.4, 1.2, 4.1, 4.1, 4…
$ `Unemployment Rate (%)` <dbl> 47.5, 45.4, 45.3, 45.4, 44.9, 30.0, 35.3, 32.9…
$ `Inflation (HICP) (%)`  <dbl> 9.4, -2.4, 3.5, 7.3, 2.5, 1.8, 0.4, -0.5, 0.3,…

glimpse(npl_bank)

Rows: 60
Columns: 2
$ DATE <dttm> 2008-03-31, 2008-06-30, 2008-09-30, 2008-12-31, 2009-03-31, 2009…
$ NPL  <dbl> 0.052048906, 0.042427987, 0.051323089, 0.051384476, 0.058027842, …

npl_bank$Year <- format(as.Date(npl_bank$DATE), "%Y")
npl_yearly <- npl_bank %>%
  group_by(Year) %>%
  summarise(NPL = mean(NPL, na.rm = TRUE)) %>%
  mutate(Year = as.numeric(Year))

head(npl_yearly)

# A tibble: 6 × 2
   Year     NPL
  <dbl>   <dbl>
1  2008 0.0493 
2  2009 0.0539 
3  2010 0.0556 
4  2011 0.0563 
5  2012 0.00402
6  2013 0.0206

combined_data <- merge(kosovo_macro, npl_yearly, by = "Year")
head(combined_data)

  Year GDP_Growth (%) Unemployment Rate (%) Inflation (HICP) (%)         NPL
1 2008            5.4                  47.5                  9.4 0.049296115
2 2009            3.5                  45.4                 -2.4 0.053929375
3 2010            3.2                  45.3                  3.5 0.055647157
4 2011            4.4                  45.4                  7.3 0.056298524
5 2012            2.5                  44.9                  2.5 0.004017859
6 2013            3.4                  30.0                  1.8 0.020558980

At the beginning of this analysis, two datasets for Kosovo were used: one containing quarterly data on Non-Performing Loans (NPL), and another with annual data on key macroeconomic indicators (GDP growth, unemployment, and inflation). From the NPL dataset, the year column was extracted, and yearly averages were calculated in order to align the time structure with the macroeconomic data. Afterwards, the two datasets were merged by year, resulting in an integrated dataset that, for each year from 2008 to 2022, includes values for NPL as well as the three macroeconomic variables.

1.2 Basic Statistics

summary(combined_data)

      Year      GDP_Growth (%)   Unemployment Rate (%) Inflation (HICP) (%)
 Min.   :2008   Min.   :-5.300   Min.   :13.00         Min.   :-2.40       
 1st Qu.:2012   1st Qu.: 3.300   1st Qu.:26.70         1st Qu.: 0.35       
 Median :2015   Median : 3.800   Median :30.50         Median : 1.80       
 Mean   :2015   Mean   : 3.607   Mean   :33.29         Mean   : 3.26       
 3rd Qu.:2018   3rd Qu.: 4.600   3rd Qu.:45.10         3rd Qu.: 5.40       
 Max.   :2022   Max.   :10.700   Max.   :47.50         Max.   :11.60       
      NPL          
 Min.   :0.004018  
 1st Qu.:0.013190  
 Median :0.017674  
 Mean   :0.024979  
 3rd Qu.:0.037715  
 Max.   :0.056299

sapply(combined_data[ , -1], sd)

       GDP_Growth (%) Unemployment Rate (%)  Inflation (HICP) (%) 
           3.20835220           10.44758801            4.19043128 
                  NPL 
           0.01891042

The descriptive statistics for the period 2008–2022 in Kosovo show the following:

Economic growth (GDP Growth) has an average of 3.61%, ranging from -5.3% to 10.7%, indicating moderate variability in the country’s economic performance.

The unemployment rate has a high average of 33.29% and a standard deviation of 10.45%, reflecting significant instability in the labor market.

Inflation (HICP) recorded an average of 3.26%, with values ranging from -2.4% to 11.6%, highlighting the presence of both inflationary and deflationary periods.

The NPL (Non-Performing Loans) ratio averaged 2.5%, with values between 0.4% and 5.6%, signaling an increased credit risk in certain years..

1.3 Correlation matrix

cor_matrix <- cor(combined_data[ , -1])  # heqim Year
print(cor_matrix)

                      GDP_Growth (%) Unemployment Rate (%) Inflation (HICP) (%)
GDP_Growth (%)            1.00000000           -0.05455098         0.5418833778
Unemployment Rate (%)    -0.05455098            1.00000000        -0.1828693094
Inflation (HICP) (%)      0.54188338           -0.18286931         1.0000000000
NPL                       0.02774249            0.75718741         0.0009290186
                               NPL
GDP_Growth (%)        0.0277424910
Unemployment Rate (%) 0.7571874111
Inflation (HICP) (%)  0.0009290186
NPL                   1.0000000000

corrplot(cor_matrix, method = "circle", type = "upper", tl.cex = 0.8)

The correlation results show:

NPL and Unemployment Rate have a strong positive correlation (r = 0.76), indicating that higher unemployment is strongly associated with higher levels of non-performing loans.

NPL and GDP Growth show a very weak correlation (r = 0.03), suggesting no meaningful relationship between economic growth and NPL levels.

NPL and Inflation exhibit virtually no correlation (r ≈ 0.001), implying that inflation does not significantly affect the level of NPLs.

Unemployment appears to be the most influential factor affecting NPLs in this analysis.

1.4 Visualization

This is an interactive dashboard created in Power BI, which visually and structurally summarizes the key macroeconomic indicators of Kosovo over a multi-year period. The overall layout includes four essential indicators: economic growth (GDP Growth), inflation (HICP), the unemployment rate, and the level of non-performing loans (NPL).

1.5 Linear regression model

class(combined_data)

[1] "data.frame"

model <- lm(NPL ~ `GDP_Growth (%)` + `Unemployment Rate (%)` + `Inflation (HICP) (%)`, data = combined_data)
summary(model)


Call:
lm(formula = NPL ~ `GDP_Growth (%)` + `Unemployment Rate (%)` + 
    `Inflation (HICP) (%)`, data = combined_data)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.036989 -0.002464  0.000109  0.004633  0.015599 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)   
(Intercept)             -2.425e-02  1.332e-02  -1.820  0.09604 . 
`GDP_Growth (%)`        -6.413e-05  1.350e-03  -0.047  0.96297   
`Unemployment Rate (%)`  1.419e-03  3.544e-04   4.004  0.00207 **
`Inflation (HICP) (%)`   6.778e-04  1.050e-03   0.646  0.53174   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0136 on 11 degrees of freedom
Multiple R-squared:  0.5935,    Adjusted R-squared:  0.4827 
F-statistic: 5.354 on 3 and 11 DF,  p-value: 0.01616

Based on the output, the model equation is:

NPL=−0.02425−0.00006413⋅GDP+0.001419⋅Unemployment+0.0006778⋅Inflation

where:

GDP is GDP_Growth (%)

Unemployment is Unemployment Rate (%)

Inflation is Inflation (HICP) (%)

Also:

R² = 0.5935: ~59% of the variance in NPL is explained by these three factors.

Adjusted R² = 0.4827: after adjusting for model complexity.

F-statistic p = 0.016: model is globally statistically significant.

1.6 Forecast of NPLs for the coming year

library(tibble)

new_data <- tibble(
  `GDP_Growth (%)` = 4.0,
  `Unemployment Rate (%)` = 18.0,
  `Inflation (HICP) (%)` = 6.0
)

predict(model, newdata = new_data)

          1 
0.005107102

Based on the values given for the hypothetical year:

GDP Growth = 4.0%

Unemployment Rate = 18.0%

Inflation = 6.0%

The model predicts that the percentage of non-performing loans (NPL) will be around 0.51%.

Section 2: Individual Loan Default Prediction

As a prominent online retail lender with a substantial amount of customer history, your organization is facing an alarming increase in default rates among your clients over the past year. The objective of this analysis is to identify the variables that have the most significant impact on the probability that a client will experience a serious default within two years (`SeriousDlqin2yrs’) and to build a reliable risk prediction model.

2.1 Data description

Data set of 150,000 customers, each with information through 11 variables – 1 dependent (`SeriousDlqin2yrs`) and 10 independent. The dependent variable indicates whether the customer has had a serious default within 2 years. The variables are related to loans, income, delinquencies on pages, number of incomes and debt-to-income ratios.

csdata <- read_excel("D:/Izeti/NLB_DataScientist_NLBGroup/data/csdata.xlsx")
head(csdata)

# A tibble: 6 × 13
  ClientId DlqIn2Years UtilizationOfUnsecLines AgeInYears NoOfTime35-65DaysPas…¹
     <dbl>       <dbl>                   <dbl>      <dbl>                  <dbl>
1        1           1                   0.766         45                      2
2        2           0                   0.957         40                      0
3        3           0                   0.658         38                      1
4        4           0                   0.234         30                      0
5        5           0                   0.907         49                      1
6        6           0                   0.213         74                      0
# ℹ abbreviated name: ¹`NoOfTime35-65DaysPastDueNotWorse`
# ℹ 8 more variables: DebtRtio <dbl>, MonthlyIncm <chr>, Currency <chr>,
#   NoOfOpenCreditLinesAndLoans <dbl>, NoOfTimes90DaysLate <dbl>,
#   NoRealEstateLoansOrLines <dbl>, `NoOfTime60-89DaysPastDueNotWorse` <dbl>,
#   NoOfDependents <chr>

summary(csdata)

    ClientId       DlqIn2Years      UtilizationOfUnsecLines   AgeInYears   
 Min.   :     1   Min.   :0.00000   Min.   :    0.00        Min.   :  0.0  
 1st Qu.: 37501   1st Qu.:0.00000   1st Qu.:    0.03        1st Qu.: 41.0  
 Median : 75001   Median :0.00000   Median :    0.15        Median : 52.0  
 Mean   : 75001   Mean   :0.06684   Mean   :    6.05        Mean   : 52.3  
 3rd Qu.:112500   3rd Qu.:0.00000   3rd Qu.:    0.56        3rd Qu.: 63.0  
 Max.   :150000   Max.   :1.00000   Max.   :50708.00        Max.   :109.0  
 NoOfTime35-65DaysPastDueNotWorse    DebtRtio        MonthlyIncm       
 Min.   : 0.000                   Min.   :     0.0   Length:150000     
 1st Qu.: 0.000                   1st Qu.:     0.2   Class :character  
 Median : 0.000                   Median :     0.4   Mode  :character  
 Mean   : 0.421                   Mean   :   353.0                     
 3rd Qu.: 0.000                   3rd Qu.:     0.9                     
 Max.   :98.000                   Max.   :329664.0                     
   Currency         NoOfOpenCreditLinesAndLoans NoOfTimes90DaysLate
 Length:150000      Min.   : 0.000              Min.   : 0.000     
 Class :character   1st Qu.: 5.000              1st Qu.: 0.000     
 Mode  :character   Median : 8.000              Median : 0.000     
                    Mean   : 8.453              Mean   : 0.266     
                    3rd Qu.:11.000              3rd Qu.: 0.000     
                    Max.   :58.000              Max.   :98.000     
 NoRealEstateLoansOrLines NoOfTime60-89DaysPastDueNotWorse NoOfDependents    
 Min.   : 0.000           Min.   : 0.0000                  Length:150000     
 1st Qu.: 0.000           1st Qu.: 0.0000                  Class :character  
 Median : 1.000           Median : 0.0000                  Mode  :character  
 Mean   : 1.018           Mean   : 0.2404                                    
 3rd Qu.: 2.000           3rd Qu.: 0.0000                                    
 Max.   :54.000           Max.   :98.0000

2.2 Data Preparation

csdata <- csdata %>%   mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))  
colSums(is.na(csdata))

                        ClientId                      DlqIn2Years 
                               0                                0 
         UtilizationOfUnsecLines                       AgeInYears 
                               0                                0 
NoOfTime35-65DaysPastDueNotWorse                         DebtRtio 
                               0                                0 
                     MonthlyIncm                         Currency 
                               0                                0 
     NoOfOpenCreditLinesAndLoans              NoOfTimes90DaysLate 
                               0                                0 
        NoRealEstateLoansOrLines NoOfTime60-89DaysPastDueNotWorse 
                               0                                0 
                  NoOfDependents 
                               0

csdata <- csdata %>%   mutate(MonthlyIncm = as.numeric(MonthlyIncm))

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `MonthlyIncm = as.numeric(MonthlyIncm)`.
Caused by warning:
! NAs introduced by coercion

csdata <- csdata %>%   mutate(MonthlyIncm = as.numeric(MonthlyIncm)) %>%   mutate(MonthlyIncm = ifelse(is.na(MonthlyIncm), median(MonthlyIncm, na.rm = TRUE), MonthlyIncm))  
cap_outliers <- function(x) {   qnt <- quantile(x, probs = c(0.01, 0.99), na.rm = TRUE)   
x[x < qnt[1]] <- qnt[1]   
x[x > qnt[2]] <- qnt[2]  
return(x) }
csdata <- csdata %>% mutate(across(c(MonthlyIncm, AgeInYears, DebtRtio), cap_outliers))  
csdata$DlqIn2Years <- as.factor(csdata$DlqIn2Years) 
names(csdata) <- make.names(names(csdata))

This code prepares and cleans the csdata dataset for predictive analysis of loan defaults. First, all missing values (NA) in numeric columns are replaced with the corresponding median, ensuring that the analyses are not affected by missing values. Next, the MonthlyIncm column is converted from text to numeric values, and any NA values created during this process are also replaced with the median. To handle extreme values (outliers), a function is created that limits values outside the 1% and 99% percentiles for key columns such as monthly income, age, and debt ratio, preventing them from disproportionately affecting the model. Finally, the target variable DlqIn2Years is converted to a factor, a necessary step for classification models such as Random Forest or Logistic Regression.

2.4 Splitting into train/test

set.seed(123)  
split <- createDataPartition(csdata$DlqIn2Years, p = 0.7, list = FALSE)
train <- csdata[split, ] 
test <- csdata[-split, ]

This code performs a stratified split of the dataset csdata into training and testing subsets, which is a crucial step in building and evaluating a predictive model. The line set.seed(123) ensures that the random sampling process is reproducible, meaning the same split will be obtained each time the code is run. The function createDataPartition() selects 70% of the data (as specified by p = 0.7) while maintaining the original distribution of the target variable DlqIn2Years, ensuring balanced representation of both classes in the training set. The resulting indices are then used to split the dataset into train (70%) and test (30%) sets. This approach enables the model to be trained on one portion of the data and tested on an unseen portion, providing a more accurate evaluation of the model’s predictive performance.

2.5 Model: Random Forest

rf_model <- randomForest(DlqIn2Years ~ ., data = train, ntree = 100, importance = TRUE)   
varImpPlot(rf_model)

This plot illustrates the importance of each variable in the Random Forest model used to predict DlqIn2Years, which indicates whether a client will default within the next two years. The chart is split into two parts: the left side represents MeanDecreaseAccuracy, which measures how much the model’s accuracy decreases when a variable is removed, while the right side shows MeanDecreaseGini, reflecting the variable’s importance in splitting decisions within the decision trees. According to the Gini index, the most important predictors are: UtilizationOfUnsecLines (unsecured credit line utilization), DebtRtio (debt-to-income ratio), MonthlyIncm (monthly income), AgeInYears (age), and NoOfTimes90DaysLate (number of times the client was more than 90 days late).

2.5 Data Exploring

Exploratory Data Analysis (EDA) was conducted to better understand the characteristics and distribution of variables prior to model development. The analysis showed that the distribution of classes between defaulted and non-defaulted customers was well balanced in the training and testing data, ensuring a fair estimation process.

distribution in training set
0	1
0.93315302	0.3384698

distribution in test set
0	1
0.99317629	0.6682371

The distribution indicates that while both datasets preserve a reasonable balance between classes, the testing set shows a slightly higher proportion of delinquent clients.

The histogram illustrates the distribution of the Debt Ratio variable across default status categories (DlqIn2Years). The majority of clients have a very low debt ratio, clustered near zero. However, a small portion of the population exhibits extremely high debt ratios, with values exceeding 5000.

The boxplot displays the distribution of monthly income across default statuses (DlqIn2Years). Clients who did not default (DlqIn2Years = 0) exhibit a higher median monthly income compared to those who defaulted (DlqIn2Years = 1).

The plot shows the importance of variables in the Random Forest model for predicting loan default (DlqIn2Years).

The most influential variables are:

UtilizationOfUnsecLines (utilization of unsecured credit lines),

DebtRatio (debt-to-income ratio),

MonthlyIncm (monthly income), and

AgeInYears (age of the borrower).

2.5 Model Performance Evaluation

To evaluate the predictive performance of the model, a Random Forest classifier was trained on the training dataset and tested on the unseen test set. The model’s predictive power was assessed using accuracy, AUC (Area Under the Curve) from the ROC curve

An AUC of 0.841 means that the model has an 84.1% chance of correctly ranking a customer with a default higher than one without a default, when selected at random.

Section 3: SQL-Summarized Examples and Answers

3.1 How many clients have age lower than 34?

SQL Statement:

SELECT COUNT(*) AS Under_34_years  
FROM csdata_export  
WHERE  AgeInYears < 34;

Under_34_years

17085

Interpretation: The SQL statement shows that there are 17,085 clients who are younger than 34 years old.

3.2 What is the average income for clients based on the target variable?

SQL Statement:

SELECT     
DlqIn2Years AS target,     AVG(MonthlyIncm) AS average_monthly_income 
FROM  csdata_export 
GROUP BY DlqIn2Years;

target	average_monthly_income
0	5630.83
1	6747.84

Interpretation: The SQL statement shows that customers who have not had any late payments (0) have a higher average monthly income (6747.84 EUR), while those who have had late payments (1) have a lower income (5630.83 EUR). This suggests that income level may affect the risk of late payments.

3.3 How are many different currencies are there?

SQL Statement:

SELECT COUNT(DISTINCT currency) AS unique_currencies
FROM csdata_export;

unique_currencies
10

Interpretation: The SQL statement shows that there are 10 unique currencies used by the clients.

3.4 Try to convert all the currency to EUR? ▪ create a new table with foreign exchange rate?

SQL Statement:

CREATE TABLE exchange (
    currency VARCHAR(10) PRIMARY KEY,
    rate_to_euro DECIMAL(10,4)
);

INSERT INTO exchange (currency, rate_to_euro) VALUES
    ('EUR', 1.00),
    ('USD', 0.91),
    ('GBP', 1.17),
    ('CHF', 1.02),
    ('JPY', 0.0062),
    ('AUD', 0.60),
    ('CAD', 0.67),
    ('SEK', 0.087),
    ('NOK', 0.089),
    ('DKK', 0.13);
SELECT *
FROM exchange

SELECT
    c.ClientId,
    c.DlqIn2Years,
    c.MonthlyIncm,
    c.Currency,
    e.rate_to_euro,
    (c.MonthlyIncm * e.rate_to_euro) AS MonthlyIncome_EUR
FROM
    csdata_export c
        JOIN
    exchange e
    ON
        c.Currency = e.currency;


CREATE TABLE clients_in_euro AS
SELECT
    c.ClientId AS id,
    c.AgeInYears AS Age,
    c.currency ,
    c.DlqIn2Years AS Target,
    (c.MonthlyIncm * e.rate_to_euro) AS income_euro
FROM csdata_export c
         JOIN exchange e ON c.currency = e.currency;

Interpretation: The SQL statement shows how converts all currencies to EURO, first a new table named exchange was created containing each currency and its conversion rate to EURO. Then, by joining the csdata_export table with the exchange table on the currency field, the monthly income in EURO was calculated by multiplying the original income with the corresponding exchange rate. The final result shows the clients’ table income uniformly converted to EURO.

3.5 Present basic statistics of monthly income in EUR. What is average income in EUR?

SELECT
    COUNT(*) AS total_clients,
    MIN(income_euro) AS min_income_euro,
    MAX(income_euro) AS max_income_euro,
    AVG(income_euro) AS avg_income_euro
FROM clients_in_euro;

total_clients	min_income_euro	max_income_euro	avg_income_euro
148495	0	300875	5145.53

Interpretation:The SQL statement calculates basic statistics for the clients’ monthly income in EURO. It returns the total number of clients, the minimum and maximum income values, and the average monthly income after converting all currencies to EURO.