Short description:

In this project we conducted credit risk analysis of database of clients provided on Kaggle. Goal was to predict the probability that somebody will experience financial distress in the next two years. Data is available under this link:

If you would like to see the source code for this analysis it is provided in GitHub repository - link in the top right corner.



Data preparation

Data description

Data which we used comes from Kaggle competition and it consists of two initial datasets: training one with 150,000 observations, each with 1 default indicator and 10 numerical features and a test set with a little over 100,000 observations – but without default indicator. The test set will be used for Kaggle competition submission and all the modelling will therefore be done on the (accordingly split) training set.

We also had to change some variable names and remove index variable (for further clarity and smbinning function restrictions).


Data cleaning and split

First, we split the training dataset (150,000 observations) into train/test sets which we will use for training and final evaluation. We split the data with respect to default - we need to retain percentage of defaults the same in both sets. It is essential, because the dataset is very imbalanced - there are only ~ 7% of defaults in the dataset. Therefore, after the splitting, we have 0.9/0.1 train/test split - both with around 7% of defaults.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06684 0.00000 1.00000

NA imputation:

There are missing values only in two columns: 26716 NAs for MonthlyIncome and 3543 for NumberOfDependents (which stands for how many people, e.g. family members, depend financially on the analysed customer). We decide to impute mean of MonthlyIncome in place of NAs (we use the same mean for test set, so there won’t be any data leakage) and create new binary variable indicating that the MonthlyIncome is missing. We do the same with NumberOfDependents but we use mode instead of mean.

Therefore we end up with 10 numerical variables and 2 factors.


Histograms

Then we have a look at how each variable is distributed using histograms:

We can see that some of the variables have outliers which after further inspection might bias our estimations and we decided to impute them with different value.

We assess observations that are 2 x IQR under 5th and above 95th percentile to be outliers and we impute them with 5th and 95th percentile respectively. Using this method we changed the value of 3 variables (RevolvingUtilizationOfUnsecuredLines, DebtRatio and MonthlyIncome) and jointly ~8000 values.


Creating WoE variables

After analyzing the histograms and supporting that with our own experience we decided to split 4 variables using smbinning package:

  • NumberOfTime30_59DaysPastDueNotWorse
  • NumberOfOpenCreditLinesAndLoans
  • NumberRealEstateLoansOrLines
  • DebtRatio

These variables seem to be good candidates - they have low variance but values higher than some threshold might be very indicative.

Below plots and tables for splits are shown. We set the smbinning package to split the variable into maximum of 10 intervals.

Cutpoint CntRec CntGood CntBad GoodRate BadRate Odds LnOdds WoE IV
1 <= 0 113446 108898 4548 0.9599 0.0401 23.9442 3.1757 0.5394 0.1944
2 <= Inf 21554 17079 4475 0.7924 0.2076 3.8165 1.3393 -1.2970 0.4674
4 Total 135000 125977 9023 0.9332 0.0668 13.9618 2.6363 0.0000 0.6618

As we can see, first variable is split into two intervals - equal to zero and above zero. That split is quite intuitive from our experience - people who have been in >30 DPD are more likely to default. Information Value of this variable after splitting is 0.6618 and it visibly splits population into Bads/Goods (hence the higher BadRate in one interval).

Cutpoint CntRec CntGood CntBad GoodRate BadRate Odds LnOdds WoE IV
1 <= 3 19865 17730 2135 0.8925 0.1075 8.3044 2.1168 -0.5195 0.0498
2 <= 13 95328 89823 5505 0.9423 0.0577 16.3166 2.7922 0.1559 0.0160
3 <= Inf 19807 18424 1383 0.9302 0.0698 13.3218 2.5894 -0.0469 0.0003
5 Total 135000 125977 9023 0.9332 0.0668 13.9618 2.6363 0.0000 0.0661

Second variable, which stands for Number of open credit lines and loans was split into 3 intervals - less than 3, between 3 and 13 and above 13. That split is not that logical - people with almost none and people with more than 13 ongoing liabailities are most likely to default (according to our data) but people with between 3 and 13 ongoing loans/credit lines open are rather good clients. IV is 0.66 - again it is considered valuable.

Cutpoint CntRec CntGood CntBad GoodRate BadRate Odds LnOdds WoE IV
1 <= 0 50574 46358 4216 0.9166 0.0834 10.9957 2.3975 -0.2388 0.0237
2 <= 1 47053 44603 2450 0.9479 0.0521 18.2053 2.9017 0.2654 0.0219
3 <= Inf 37373 35016 2357 0.9369 0.0631 14.8562 2.6984 0.0621 0.0010
5 Total 135000 125977 9023 0.9332 0.0668 13.9618 2.6363 0.0000 0.0466

Here we can see that clients with 1 mortgage are good clients and no mortgage/more than 1 mortgage indicate that the client is more inclined to default. IV of 0.0466 indicates non-negligeble predictive ability.

Cutpoint CntRec CntGood CntBad GoodRate BadRate Odds LnOdds WoE IV
1 <= 0.4234 75711 71244 4467 0.9410 0.0590 15.9490 2.7694 0.1331 0.0094
2 <= 0.6447 18587 17075 1512 0.9187 0.0813 11.2930 2.4242 -0.2121 0.0068
3 <= 3.9724 13689 12123 1566 0.8856 0.1144 7.7414 2.0466 -0.5897 0.0456
4 <= 1267 13501 12687 814 0.9397 0.0603 15.5860 2.7464 0.1101 0.0012
5 <= Inf 13512 12848 664 0.9509 0.0491 19.3494 2.9627 0.3263 0.0093
7 Total 135000 125977 9023 0.9332 0.0668 13.9618 2.6363 0.0000 0.0723

Finally the Debt Ratio - surprisingly, very high DR indicates that the client is less likely to default according to the data. However it may indicate some data flaws as it is very counter-intuitive. The worst clients are the ones with DR between 0.65 and 3.97. IV of 0.0723 indicates highest predictive ability of all WoE variables.


Variable correlation

As we can see, there are no significant correlations between any of the variables - therefore we keep all the variables in the further analysis.


Train/Validation split

Now we need to further split the training set into actual training dataset and validation one – we will be optimizing the model using training set for validation set performance.

We are once again splitting the dataset into 0.9/0.1 train/validation sets.


SMOTE

Since our dataset is very imbalanced (7% of observations are defaults), we try oversampling using SMOTE (Synthetic Minority Oversampling Technique). Using SMOTE function from DMwR package we are able to double the amount of default observations in our training set (we are not using synthetic observations in the validation set).

# train_woe$ID <- seq.int(nrow(train_woe))
# temp_train <- DMwR::SMOTE(def ~ ., train_woe, perc.over = 50, k = 3)
# temp_train <- subset(temp_train, def == "yes")
# temp_train <- rbind(train_woe, temp_train)
# train_woe_smote <- distinct(temp_train)
# train_woe$ID <- NULL
# train_woe_smote$ID <- NULL
# rm(temp_train)
# 
# summary(train_woe$def)
# summary(train_woe_smote$def)

Despite our best tries, adding observations obtained by synthetic oversampling did not improve model’s predictive ability in further analysis, so we decided not to use it in this report.



Choosing the best logit model

Data loading

rm(list=ls(all=TRUE))

load("data/train_woe.Rdata")
load("data/validation_woe.Rdata")
load("models/logit_model.Rdata")

source("model_functions.R")

First, we estimate basic models (with Intercept only and one with all variables) - these will be our benchmarks when assessing the final logit model.

baza<-glm(def ~ 1,data=train_woe, family=binomial("logit"))

max<-glm(def ~ .,data=train_woe, family=binomial("logit"))

Then, we use stepwise (both) method to choose only statistically significant variables. We are providing only the code, because the model is previously computed and we loaded it in the previous cell.

# model_stepwise_both<-step(baza, scope = list(upper=max, lower=baza ), direction = "both", trace=T,steps=30,k=4)
# summary(model_stepwise_both)

Quality assessment

## [1] "Goodness Of Fit test p-value: 1"
## [1] "Goodness Of Fit test p-value: 0"
## [1] "Hosmer-Lemeshow test p-value: 0"

As we can see model is well fit to the data and the variables are statistically significant (based on GoF, Hosmer-Lemeshow and LR tests p-values).


ROC Tests

Now we compare ROC curves of base and max model with our model chosen using stepwise method.

## [1] "Roc test p-value for stepwise and base: 0"
## [1] "Roc test p-value for stepwise and max: 0.00382507331587263"

P-value of both tests is lower than the significance level (of 0.05) so we reject the null hypotheses in favor of the alternative and conclude that the ROC curves are not equally good.


Gini comparison

## [1] "Gini coefficients on the validation set"
## [1] "Gini coefficient for stepwise model: 0.64236008320048"
## [1] "Gini coefficient for max model: 0.642370995431295"
## [1] "Gini coefficient for base model: 0"

As we can see model chosen using stepwise methot outperformed the former - therefore we will be using it in further analysis.



Model comparison

Quick description

In this section we will compare four models build using different methods:

  • Logistic Regression (from the previous section)
  • Random Forest
  • Adaptive Boosting
  • Extreme Gradient Boosting

Logit

Our logit model is the one we obtained using stepwise method, its scores are presented below:


Random Forest

Tuning led to the most efficient model with 100 estimators, 5 sampled features at each split and minimal node size being 1.


AdaBoost

It turned out that the most optimal set of hyperparameters in AdaBoost model is 80 trees (lesser than in the Random Forest model), learning rate 0.07 and the model sampled 0.7 of the initial set in each tree.


XGBoost Model

Optimal set of hyperparameters for the XGBoost is:

  • eta = 0.3
  • gamma = 0
  • max.depth = 4
  • max_delta_step = 0
  • subsample = 0.7
  • colsample_bytree = 0.85
  • lambda = 1
  • alpha = 0
  • scale_pos_weight = 4
  • nrounds = 20


Comparison tables

All values are computed for validation set:

model_name auc gini k_s_stat psi
Logit 0.8212 0.6424 0.4984 0.0013
Random_Forest 0.8142 0.6285 0.4931 Inf
Adaboost 0.8510 0.7019 0.5465 Inf
XGBoost 0.8538 0.7076 0.5477 0.0016

As you can see XGBoost outperformes all other methods. Population stability index value is lower than 0.1 so we can conclude that the model is stable (we suspect that in case of Random Forest and AdaBoost models there occured division by zero because a lot of observations had the same value, therefore the outcome is flawed). Considering the Kolmogorov-Smirnov statistics, we can conclude that the distributions of scores in default groups are indeed different.

Below we present Gini coefficients for the test set (never previously seen in the training process):

model_name gini
Logit 0.6389
Random_Forest 0.6325
Adaboost 0.6862
XGBoost 0.6956

Once again, XGBoost outperforms all other methods, therefore we decide to build the scorecard on its outcomes.



Scorecard

Creating the cutoffs

Once we decided that XGBoost is our go-to model, we use the whole dataset (train+validation+test sets) to divide scores into risk-intervals and therefore create a scorecard.

Score buckets No of clients Goods Bads Cumulative Goods Cumulative Bads Good rate Bad rate
0 - 473.05 19503 13406 6097 13406 6097 68.74% 31.26%
473.05 - 518.35 30355 27971 2384 41377 8481 92.15% 7.85%
518.35 - 565.69 29687 28804 883 70181 9364 97.03% 2.97%
565.69 - 589.53 29007 28600 407 98781 9771 98.60% 1.40%
589.53 - 609.47 21228 21050 178 119831 9949 99.16% 0.84%
609.47 and higher 20220 20143 77 139974 10026 99.62% 0.38%

Based on the score buckets we would divide our clients into 6 potential groups with ideally different pricing and for example different interest rates. We could reject the most risky clients from chosen buckets (presumably first two) - all depending on business approach and desired exposure to risk.