In this project we conducted credit risk analysis of database of clients provided on Kaggle. Goal was to predict the probability that somebody will experience financial distress in the next two years. Data is available under this link:
If you would like to see the source code for this analysis it is provided in GitHub repository - link in the top right corner.
Data which we used comes from Kaggle competition and it consists of two initial datasets: training one with 150,000 observations, each with 1 default indicator and 10 numerical features and a test set with a little over 100,000 observations – but without default indicator. The test set will be used for Kaggle competition submission and all the modelling will therefore be done on the (accordingly split) training set.
We also had to change some variable names and remove index variable (for further clarity and smbinning function restrictions).
First, we split the training dataset (150,000 observations) into train/test sets which we will use for training and final evaluation. We split the data with respect to default - we need to retain percentage of defaults the same in both sets. It is essential, because the dataset is very imbalanced - there are only ~ 7% of defaults in the dataset. Therefore, after the splitting, we have 0.9/0.1 train/test split - both with around 7% of defaults.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06684 0.00000 1.00000
There are missing values only in two columns: 26716 NAs for MonthlyIncome and 3543 for NumberOfDependents (which stands for how many people, e.g. family members, depend financially on the analysed customer). We decide to impute mean of MonthlyIncome in place of NAs (we use the same mean for test set, so there won’t be any data leakage) and create new binary variable indicating that the MonthlyIncome is missing. We do the same with NumberOfDependents but we use mode instead of mean.
Therefore we end up with 10 numerical variables and 2 factors.
Then we have a look at how each variable is distributed using histograms:
We can see that some of the variables have outliers which after further inspection might bias our estimations and we decided to impute them with different value.
We assess observations that are 2 x IQR under 5th and above 95th percentile to be outliers and we impute them with 5th and 95th percentile respectively. Using this method we changed the value of 3 variables (RevolvingUtilizationOfUnsecuredLines, DebtRatio and MonthlyIncome) and jointly ~8000 values.
After analyzing the histograms and supporting that with our own experience we decided to split 4 variables using smbinning package:
These variables seem to be good candidates - they have low variance but values higher than some threshold might be very indicative.
Below plots and tables for splits are shown. We set the smbinning package to split the variable into maximum of 10 intervals.
Cutpoint | CntRec | CntGood | CntBad | GoodRate | BadRate | Odds | LnOdds | WoE | IV | |
---|---|---|---|---|---|---|---|---|---|---|
1 | <= 0 | 113446 | 108898 | 4548 | 0.9599 | 0.0401 | 23.9442 | 3.1757 | 0.5394 | 0.1944 |
2 | <= Inf | 21554 | 17079 | 4475 | 0.7924 | 0.2076 | 3.8165 | 1.3393 | -1.2970 | 0.4674 |
4 | Total | 135000 | 125977 | 9023 | 0.9332 | 0.0668 | 13.9618 | 2.6363 | 0.0000 | 0.6618 |
As we can see, first variable is split into two intervals - equal to zero and above zero. That split is quite intuitive from our experience - people who have been in >30 DPD are more likely to default. Information Value of this variable after splitting is 0.6618 and it visibly splits population into Bads/Goods (hence the higher BadRate in one interval).
Cutpoint | CntRec | CntGood | CntBad | GoodRate | BadRate | Odds | LnOdds | WoE | IV | |
---|---|---|---|---|---|---|---|---|---|---|
1 | <= 3 | 19865 | 17730 | 2135 | 0.8925 | 0.1075 | 8.3044 | 2.1168 | -0.5195 | 0.0498 |
2 | <= 13 | 95328 | 89823 | 5505 | 0.9423 | 0.0577 | 16.3166 | 2.7922 | 0.1559 | 0.0160 |
3 | <= Inf | 19807 | 18424 | 1383 | 0.9302 | 0.0698 | 13.3218 | 2.5894 | -0.0469 | 0.0003 |
5 | Total | 135000 | 125977 | 9023 | 0.9332 | 0.0668 | 13.9618 | 2.6363 | 0.0000 | 0.0661 |
Second variable, which stands for Number of open credit lines and loans was split into 3 intervals - less than 3, between 3 and 13 and above 13. That split is not that logical - people with almost none and people with more than 13 ongoing liabailities are most likely to default (according to our data) but people with between 3 and 13 ongoing loans/credit lines open are rather good clients. IV is 0.66 - again it is considered valuable.
Cutpoint | CntRec | CntGood | CntBad | GoodRate | BadRate | Odds | LnOdds | WoE | IV | |
---|---|---|---|---|---|---|---|---|---|---|
1 | <= 0 | 50574 | 46358 | 4216 | 0.9166 | 0.0834 | 10.9957 | 2.3975 | -0.2388 | 0.0237 |
2 | <= 1 | 47053 | 44603 | 2450 | 0.9479 | 0.0521 | 18.2053 | 2.9017 | 0.2654 | 0.0219 |
3 | <= Inf | 37373 | 35016 | 2357 | 0.9369 | 0.0631 | 14.8562 | 2.6984 | 0.0621 | 0.0010 |
5 | Total | 135000 | 125977 | 9023 | 0.9332 | 0.0668 | 13.9618 | 2.6363 | 0.0000 | 0.0466 |
Here we can see that clients with 1 mortgage are good clients and no mortgage/more than 1 mortgage indicate that the client is more inclined to default. IV of 0.0466 indicates non-negligeble predictive ability.
Cutpoint | CntRec | CntGood | CntBad | GoodRate | BadRate | Odds | LnOdds | WoE | IV | |
---|---|---|---|---|---|---|---|---|---|---|
1 | <= 0.4234 | 75711 | 71244 | 4467 | 0.9410 | 0.0590 | 15.9490 | 2.7694 | 0.1331 | 0.0094 |
2 | <= 0.6447 | 18587 | 17075 | 1512 | 0.9187 | 0.0813 | 11.2930 | 2.4242 | -0.2121 | 0.0068 |
3 | <= 3.9724 | 13689 | 12123 | 1566 | 0.8856 | 0.1144 | 7.7414 | 2.0466 | -0.5897 | 0.0456 |
4 | <= 1267 | 13501 | 12687 | 814 | 0.9397 | 0.0603 | 15.5860 | 2.7464 | 0.1101 | 0.0012 |
5 | <= Inf | 13512 | 12848 | 664 | 0.9509 | 0.0491 | 19.3494 | 2.9627 | 0.3263 | 0.0093 |
7 | Total | 135000 | 125977 | 9023 | 0.9332 | 0.0668 | 13.9618 | 2.6363 | 0.0000 | 0.0723 |
Finally the Debt Ratio - surprisingly, very high DR indicates that the client is less likely to default according to the data. However it may indicate some data flaws as it is very counter-intuitive. The worst clients are the ones with DR between 0.65 and 3.97. IV of 0.0723 indicates highest predictive ability of all WoE variables.
As we can see, there are no significant correlations between any of the variables - therefore we keep all the variables in the further analysis.
Now we need to further split the training set into actual training dataset and validation one – we will be optimizing the model using training set for validation set performance.
We are once again splitting the dataset into 0.9/0.1 train/validation sets.
Since our dataset is very imbalanced (7% of observations are defaults), we try oversampling using SMOTE (Synthetic Minority Oversampling Technique). Using SMOTE function from DMwR package we are able to double the amount of default observations in our training set (we are not using synthetic observations in the validation set).
# train_woe$ID <- seq.int(nrow(train_woe))
# temp_train <- DMwR::SMOTE(def ~ ., train_woe, perc.over = 50, k = 3)
# temp_train <- subset(temp_train, def == "yes")
# temp_train <- rbind(train_woe, temp_train)
# train_woe_smote <- distinct(temp_train)
# train_woe$ID <- NULL
# train_woe_smote$ID <- NULL
# rm(temp_train)
#
# summary(train_woe$def)
# summary(train_woe_smote$def)
Despite our best tries, adding observations obtained by synthetic oversampling did not improve model’s predictive ability in further analysis, so we decided not to use it in this report.
rm(list=ls(all=TRUE))
load("data/train_woe.Rdata")
load("data/validation_woe.Rdata")
load("models/logit_model.Rdata")
source("model_functions.R")
First, we estimate basic models (with Intercept only and one with all variables) - these will be our benchmarks when assessing the final logit model.
baza<-glm(def ~ 1,data=train_woe, family=binomial("logit"))
max<-glm(def ~ .,data=train_woe, family=binomial("logit"))
Then, we use stepwise (both) method to choose only statistically significant variables. We are providing only the code, because the model is previously computed and we loaded it in the previous cell.
# model_stepwise_both<-step(baza, scope = list(upper=max, lower=baza ), direction = "both", trace=T,steps=30,k=4)
# summary(model_stepwise_both)
## [1] "Goodness Of Fit test p-value: 1"
## [1] "Goodness Of Fit test p-value: 0"
## [1] "Hosmer-Lemeshow test p-value: 0"
As we can see model is well fit to the data and the variables are statistically significant (based on GoF, Hosmer-Lemeshow and LR tests p-values).
Now we compare ROC curves of base and max model with our model chosen using stepwise method.
## [1] "Roc test p-value for stepwise and base: 0"
## [1] "Roc test p-value for stepwise and max: 0.00382507331587263"
P-value of both tests is lower than the significance level (of 0.05) so we reject the null hypotheses in favor of the alternative and conclude that the ROC curves are not equally good.
## [1] "Gini coefficients on the validation set"
## [1] "Gini coefficient for stepwise model: 0.64236008320048"
## [1] "Gini coefficient for max model: 0.642370995431295"
## [1] "Gini coefficient for base model: 0"
As we can see model chosen using stepwise methot outperformed the former - therefore we will be using it in further analysis.
In this section we will compare four models build using different methods:
Our logit model is the one we obtained using stepwise method, its scores are presented below:
Tuning led to the most efficient model with 100 estimators, 5 sampled features at each split and minimal node size being 1.
It turned out that the most optimal set of hyperparameters in AdaBoost model is 80 trees (lesser than in the Random Forest model), learning rate 0.07 and the model sampled 0.7 of the initial set in each tree.
Optimal set of hyperparameters for the XGBoost is:
All values are computed for validation set:
model_name | auc | gini | k_s_stat | psi |
---|---|---|---|---|
Logit | 0.8212 | 0.6424 | 0.4984 | 0.0013 |
Random_Forest | 0.8142 | 0.6285 | 0.4931 | Inf |
Adaboost | 0.8510 | 0.7019 | 0.5465 | Inf |
XGBoost | 0.8538 | 0.7076 | 0.5477 | 0.0016 |
As you can see XGBoost outperformes all other methods. Population stability index value is lower than 0.1 so we can conclude that the model is stable (we suspect that in case of Random Forest and AdaBoost models there occured division by zero because a lot of observations had the same value, therefore the outcome is flawed). Considering the Kolmogorov-Smirnov statistics, we can conclude that the distributions of scores in default groups are indeed different.
Below we present Gini coefficients for the test set (never previously seen in the training process):
model_name | gini |
---|---|
Logit | 0.6389 |
Random_Forest | 0.6325 |
Adaboost | 0.6862 |
XGBoost | 0.6956 |
Once again, XGBoost outperforms all other methods, therefore we decide to build the scorecard on its outcomes.
Once we decided that XGBoost is our go-to model, we use the whole dataset (train+validation+test sets) to divide scores into risk-intervals and therefore create a scorecard.
Score buckets | No of clients | Goods | Bads | Cumulative Goods | Cumulative Bads | Good rate | Bad rate |
---|---|---|---|---|---|---|---|
0 - 473.05 | 19503 | 13406 | 6097 | 13406 | 6097 | 68.74% | 31.26% |
473.05 - 518.35 | 30355 | 27971 | 2384 | 41377 | 8481 | 92.15% | 7.85% |
518.35 - 565.69 | 29687 | 28804 | 883 | 70181 | 9364 | 97.03% | 2.97% |
565.69 - 589.53 | 29007 | 28600 | 407 | 98781 | 9771 | 98.60% | 1.40% |
589.53 - 609.47 | 21228 | 21050 | 178 | 119831 | 9949 | 99.16% | 0.84% |
609.47 and higher | 20220 | 20143 | 77 | 139974 | 10026 | 99.62% | 0.38% |
Based on the score buckets we would divide our clients into 6 potential groups with ideally different pricing and for example different interest rates. We could reject the most risky clients from chosen buckets (presumably first two) - all depending on business approach and desired exposure to risk.