Understanding financial risk and credit behavior has become a cornerstone in managing the performance and stability of financial institutions. This project addresses the issue from three perspectives: macroeconomic time series modeling, individual-level credit default prediction, and data engineering through SQL.
The first part of the project focuses on Kosovo’s macroeconomic environment and its relationship with the banking sector’s Non-Performing Loans (NPL). Using time series data from 2008 to 2022, we analyze how indicators such as GDP growth, unemployment, and inflation affect credit risk. This analysis uses statistical modeling to identify key economic drivers of loan defaults.
The second part shifts to the individual borrower level. Using a dataset of 150,000 clients and 10 independent variables, the objective is to build a predictive model that identifies which clients are likely to default within the next two years (DlqIn2Years). We explore data preprocessing, feature engineering, logistic regression, and advanced classification techniques to optimize predictive performance.
The third part involves SQL-based data engineering. Here, we simulate a local database to analyze client data with custom queries. We extract useful business insights such as the number of young clients, average income based on default status, and currency conversion to EUR through a newly created exchange rates table.
Section 1: Time Series (NPL & macroeconomics)
The financial soundness of a country’s banking sector is closely tied to the performance of its broader economy. One of the key indicators reflecting the stability of the banking system is the Non-Performing Loan (NPL) ratio — the proportion of loans that borrowers fail to repay on time. A rise in NPLs not only weakens banks’ balance sheets but also signals underlying macroeconomic challenges.
This study explores the relationship between NPLs and key macroeconomic variables in Kosovo, covering the period from 2008 to 2022. Using data gathered from trusted sources such as the World Bank, Eurostat, and the Central Bank of Kosovo, the analysis focuses on three core macroeconomic indicators:
At the beginning of this analysis, two datasets for Kosovo were used: one containing quarterly data on Non-Performing Loans (NPL), and another with annual data on key macroeconomic indicators (GDP growth, unemployment, and inflation). From the NPL dataset, the year column was extracted, and yearly averages were calculated in order to align the time structure with the macroeconomic data. Afterwards, the two datasets were merged by year, resulting in an integrated dataset that, for each year from 2008 to 2022, includes values for NPL as well as the three macroeconomic variables.
1.2 Basic Statistics
summary(combined_data)
Year GDP_Growth (%) Unemployment Rate (%) Inflation (HICP) (%)
Min. :2008 Min. :-5.300 Min. :13.00 Min. :-2.40
1st Qu.:2012 1st Qu.: 3.300 1st Qu.:26.70 1st Qu.: 0.35
Median :2015 Median : 3.800 Median :30.50 Median : 1.80
Mean :2015 Mean : 3.607 Mean :33.29 Mean : 3.26
3rd Qu.:2018 3rd Qu.: 4.600 3rd Qu.:45.10 3rd Qu.: 5.40
Max. :2022 Max. :10.700 Max. :47.50 Max. :11.60
NPL
Min. :0.004018
1st Qu.:0.013190
Median :0.017674
Mean :0.024979
3rd Qu.:0.037715
Max. :0.056299
The descriptive statistics for the period 2008–2022 in Kosovo show the following:
Economic growth (GDP Growth) has an average of 3.61%, ranging from -5.3% to 10.7%, indicating moderate variability in the country’s economic performance.
The unemployment rate has a high average of 33.29% and a standard deviation of 10.45%, reflecting significant instability in the labor market.
Inflation (HICP) recorded an average of 3.26%, with values ranging from -2.4% to 11.6%, highlighting the presence of both inflationary and deflationary periods.
The NPL (Non-Performing Loans) ratio averaged 2.5%, with values between 0.4% and 5.6%, signaling an increased credit risk in certain years..
corrplot(cor_matrix, method ="circle", type ="upper", tl.cex =0.8)
The correlation results show:
NPL and Unemployment Rate have a strong positive correlation (r = 0.76), indicating that higher unemployment is strongly associated with higher levels of non-performing loans.
NPL and GDP Growth show a very weak correlation (r = 0.03), suggesting no meaningful relationship between economic growth and NPL levels.
NPL and Inflation exhibit virtually no correlation (r ≈ 0.001), implying that inflation does not significantly affect the level of NPLs.
Unemployment appears to be the most influential factor affecting NPLs in this analysis.
1.4 Visualization
This is an interactive dashboard created in Power BI, which visually and structurally summarizes the key macroeconomic indicators of Kosovo over a multi-year period. The overall layout includes four essential indicators: economic growth (GDP Growth), inflation (HICP), the unemployment rate, and the level of non-performing loans (NPL).
1.5 Linear regression model
class(combined_data)
[1] "data.frame"
model <-lm(NPL ~`GDP_Growth (%)`+`Unemployment Rate (%)`+`Inflation (HICP) (%)`, data = combined_data)summary(model)
Call:
lm(formula = NPL ~ `GDP_Growth (%)` + `Unemployment Rate (%)` +
`Inflation (HICP) (%)`, data = combined_data)
Residuals:
Min 1Q Median 3Q Max
-0.036989 -0.002464 0.000109 0.004633 0.015599
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.425e-02 1.332e-02 -1.820 0.09604 .
`GDP_Growth (%)` -6.413e-05 1.350e-03 -0.047 0.96297
`Unemployment Rate (%)` 1.419e-03 3.544e-04 4.004 0.00207 **
`Inflation (HICP) (%)` 6.778e-04 1.050e-03 0.646 0.53174
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0136 on 11 degrees of freedom
Multiple R-squared: 0.5935, Adjusted R-squared: 0.4827
F-statistic: 5.354 on 3 and 11 DF, p-value: 0.01616
Based on the values given for the hypothetical year:
GDP Growth = 4.0%
Unemployment Rate = 18.0%
Inflation = 6.0%
The model predicts that the percentage of non-performing loans (NPL) will be around 0.51%.
Section 2: Individual Loan Default Prediction
As a prominent online retail lender with a substantial amount of customer history, your organization is facing an alarming increase in default rates among your clients over the past year. The objective of this analysis is to identify the variables that have the most significant impact on the probability that a client will experience a serious default within two years (`SeriousDlqin2yrs’) and to build a reliable risk prediction model.
2.1 Data description
Data set of 150,000 customers, each with information through 11 variables – 1 dependent (`SeriousDlqin2yrs`) and 10 independent. The dependent variable indicates whether the customer has had a serious default within 2 years. The variables are related to loans, income, delinquencies on pages, number of incomes and debt-to-income ratios.
This code prepares and cleans the csdata dataset for predictive analysis of loan defaults. First, all missing values (NA) in numeric columns are replaced with the corresponding median, ensuring that the analyses are not affected by missing values. Next, the MonthlyIncm column is converted from text to numeric values, and any NA values created during this process are also replaced with the median. To handle extreme values (outliers), a function is created that limits values outside the 1% and 99% percentiles for key columns such as monthly income, age, and debt ratio, preventing them from disproportionately affecting the model. Finally, the target variable DlqIn2Years is converted to a factor, a necessary step for classification models such as Random Forest or Logistic Regression.
2.4 Splitting into train/test
set.seed(123) split <-createDataPartition(csdata$DlqIn2Years, p =0.7, list =FALSE)train <- csdata[split, ] test <- csdata[-split, ]
This code performs a stratified split of the dataset csdata into training and testing subsets, which is a crucial step in building and evaluating a predictive model. The line set.seed(123) ensures that the random sampling process is reproducible, meaning the same split will be obtained each time the code is run. The function createDataPartition() selects 70% of the data (as specified by p = 0.7) while maintaining the original distribution of the target variable DlqIn2Years, ensuring balanced representation of both classes in the training set. The resulting indices are then used to split the dataset into train (70%) and test (30%) sets. This approach enables the model to be trained on one portion of the data and tested on an unseen portion, providing a more accurate evaluation of the model’s predictive performance.
This plot illustrates the importance of each variable in the Random Forest model used to predict DlqIn2Years, which indicates whether a client will default within the next two years. The chart is split into two parts: the left side represents MeanDecreaseAccuracy, which measures how much the model’s accuracy decreases when a variable is removed, while the right side shows MeanDecreaseGini, reflecting the variable’s importance in splitting decisions within the decision trees. According to the Gini index, the most important predictors are: UtilizationOfUnsecLines (unsecured credit line utilization), DebtRtio (debt-to-income ratio), MonthlyIncm (monthly income), AgeInYears (age), and NoOfTimes90DaysLate (number of times the client was more than 90 days late).
2.5 Data Exploring
Exploratory Data Analysis (EDA) was conducted to better understand the characteristics and distribution of variables prior to model development. The analysis showed that the distribution of classes between defaulted and non-defaulted customers was well balanced in the training and testing data, ensuring a fair estimation process.
distribution in training set
0
1
0.93315302
0.3384698
distribution in test set
0
1
0.99317629
0.6682371
The distribution indicates that while both datasets preserve a reasonable balance between classes, the testing set shows a slightly higher proportion of delinquent clients.
Debt Ration
The histogram illustrates the distribution of the Debt Ratio variable across default status categories (DlqIn2Years). The majority of clients have a very low debt ratio, clustered near zero. However, a small portion of the population exhibits extremely high debt ratios, with values exceeding 5000.
Monthly Income
The boxplot displays the distribution of monthly income across default statuses (DlqIn2Years). Clients who did not default (DlqIn2Years = 0) exhibit a higher median monthly income compared to those who defaulted (DlqIn2Years = 1).
Random Forest Model
The plot shows the importance of variables in the Random Forest model for predicting loan default (DlqIn2Years).
The most influential variables are:
UtilizationOfUnsecLines (utilization of unsecured credit lines),
DebtRatio (debt-to-income ratio),
MonthlyIncm (monthly income), and
AgeInYears (age of the borrower).
2.5 Model Performance Evaluation
To evaluate the predictive performance of the model, a Random Forest classifier was trained on the training dataset and tested on the unseen test set. The model’s predictive power was assessed using accuracy, AUC (Area Under the Curve) from the ROC curve
AUC
An AUC of 0.841 means that the model has an 84.1% chance of correctly ranking a customer with a default higher than one without a default, when selected at random.
Section 3: SQL-Summarized Examples and Answers
3.1 How many clients have age lower than 34?
SQL Statement:
SELECTCOUNT(*) AS Under_34_years FROM csdata_export WHERE AgeInYears <34;
Under_34_years
17085
Interpretation: The SQL statement shows that there are 17,085 clients who are younger than 34 years old.
3.2 What is the average income for clients based on the target variable?
SQL Statement:
SELECTDlqIn2Years AS target, AVG(MonthlyIncm) AS average_monthly_income FROM csdata_export GROUPBY DlqIn2Years;
target
average_monthly_income
0
5630.83
1
6747.84
Interpretation: The SQL statement shows that customers who have not had any late payments (0) have a higher average monthly income (6747.84 EUR), while those who have had late payments (1) have a lower income (5630.83 EUR). This suggests that income level may affect the risk of late payments.
3.3 How are many different currencies are there?
SQL Statement:
SELECTCOUNT(DISTINCT currency) AS unique_currenciesFROM csdata_export;
unique_currencies
10
Interpretation: The SQL statement shows that there are 10 unique currencies used by the clients.
3.4 Try to convert all the currency to EUR? ▪ create a new table with foreign exchange rate?
Interpretation: The SQL statement shows how converts all currencies to EURO, first a new table named exchange was created containing each currency and its conversion rate to EURO. Then, by joining the csdata_export table with the exchange table on the currency field, the monthly income in EURO was calculated by multiplying the original income with the corresponding exchange rate. The final result shows the clients’ table income uniformly converted to EURO.
3.5 Present basic statistics of monthly income in EUR. What is average income in EUR?
SELECTCOUNT(*) AS total_clients,MIN(income_euro) AS min_income_euro,MAX(income_euro) AS max_income_euro,AVG(income_euro) AS avg_income_euroFROM clients_in_euro;
total_clients
min_income_euro
max_income_euro
avg_income_euro
148495
0
300875
5145.53
Interpretation:The SQL statement calculates basic statistics for the clients’ monthly income in EURO. It returns the total number of clients, the minimum and maximum income values, and the average monthly income after converting all currencies to EURO.