Short description:

In this project we conducted credit risk analysis of database of clients provided on Kaggle. Goal was to predict the probability that somebody will experience financial distress in the next two years. Data is available under this link:

Kaggle Competition

If you would like to see the source code for this analysis it is provided in GitHub repository - link in the top right corner.

Data preparation

Data description

Data which we used comes from Kaggle competition and it consists of two initial datasets: training one with 150,000 observations, each with 1 default indicator and 10 numerical features and a test set with a little over 100,000 observations – but without default indicator. The test set will be used for Kaggle competition submission and all the modelling will therefore be done on the (accordingly split) training set.

We also had to change some variable names and remove index variable (for further clarity and smbinning function restrictions).

Data cleaning and split

First, we split the training dataset (150,000 observations) into train/test sets which we will use for training and final evaluation. We split the data with respect to default - we need to retain percentage of defaults the same in both sets. It is essential, because the dataset is very imbalanced - there are only ~ 7% of defaults in the dataset. Therefore, after the splitting, we have 0.9/0.1 train/test split - both with around 7% of defaults.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06684 0.00000 1.00000

NA imputation:

There are missing values only in two columns: 26716 NAs for MonthlyIncome and 3543 for NumberOfDependents (which stands for how many people, e.g. family members, depend financially on the analysed customer). We decide to impute mean of MonthlyIncome in place of NAs (we use the same mean for test set, so there won’t be any data leakage) and create new binary variable indicating that the MonthlyIncome is missing. We do the same with NumberOfDependents but we use mode instead of mean.

Therefore we end up with 10 numerical variables and 2 factors.

Histograms

Then we have a look at how each variable is distributed using histograms:

We can see that some of the variables have outliers which after further inspection might bias our estimations and we decided to impute them with different value.

We assess observations that are 2 x IQR under 5th and above 95th percentile to be outliers and we impute them with 5th and 95th percentile respectively. Using this method we changed the value of 3 variables (RevolvingUtilizationOfUnsecuredLines, DebtRatio and MonthlyIncome) and jointly ~8000 values.

Creating WoE variables

After analyzing the histograms and supporting that with our own experience we decided to split 4 variables using smbinning package:

NumberOfTime30_59DaysPastDueNotWorse
NumberOfOpenCreditLinesAndLoans
NumberRealEstateLoansOrLines
DebtRatio

These variables seem to be good candidates - they have low variance but values higher than some threshold might be very indicative.

Below plots and tables for splits are shown. We set the smbinning package to split the variable into maximum of 10 intervals.

	Cutpoint	CntRec	CntGood	CntBad	GoodRate	BadRate	Odds	LnOdds	WoE	IV
1	<= 0	113446	108898	4548	0.9599	0.0401	23.9442	3.1757	0.5394	0.1944
2	<= Inf	21554	17079	4475	0.7924	0.2076	3.8165	1.3393	-1.2970	0.4674
4	Total	135000	125977	9023	0.9332	0.0668	13.9618	2.6363	0.0000	0.6618

As we can see, first variable is split into two intervals - equal to zero and above zero. That split is quite intuitive from our experience - people who have been in >30 DPD are more likely to default. Information Value of this variable after splitting is 0.6618 and it visibly splits population into Bads/Goods (hence the higher BadRate in one interval).

	Cutpoint	CntRec	CntGood	CntBad	GoodRate	BadRate	Odds	LnOdds	WoE	IV
1	<= 3	19865	17730	2135	0.8925	0.1075	8.3044	2.1168	-0.5195	0.0498
2	<= 13	95328	89823	5505	0.9423	0.0577	16.3166	2.7922	0.1559	0.0160
3	<= Inf	19807	18424	1383	0.9302	0.0698	13.3218	2.5894	-0.0469	0.0003
5	Total	135000	125977	9023	0.9332	0.0668	13.9618	2.6363	0.0000	0.0661

Second variable, which stands for Number of open credit lines and loans was split into 3 intervals - less than 3, between 3 and 13 and above 13. That split is not that logical - people with almost none and people with more than 13 ongoing liabailities are most likely to default (according to our data) but people with between 3 and 13 ongoing loans/credit lines open are rather good clients. IV is 0.66 - again it is considered valuable.

	Cutpoint	CntRec	CntGood	CntBad	GoodRate	BadRate	Odds	LnOdds	WoE	IV
1	<= 0	50574	46358	4216	0.9166	0.0834	10.9957	2.3975	-0.2388	0.0237
2	<= 1	47053	44603	2450	0.9479	0.0521	18.2053	2.9017	0.2654	0.0219
3	<= Inf	37373	35016	2357	0.9369	0.0631	14.8562	2.6984	0.0621	0.0010
5	Total	135000	125977	9023	0.9332	0.0668	13.9618	2.6363	0.0000	0.0466

Here we can see that clients with 1 mortgage are good clients and no mortgage/more than 1 mortgage indicate that the client is more inclined to default. IV of 0.0466 indicates non-negligeble predictive ability.

	Cutpoint	CntRec	CntGood	CntBad	GoodRate	BadRate	Odds	LnOdds	WoE	IV
1	<= 0.4234	75711	71244	4467	0.9410	0.0590	15.9490	2.7694	0.1331	0.0094
2	<= 0.6447	18587	17075	1512	0.9187	0.0813	11.2930	2.4242	-0.2121	0.0068
3	<= 3.9724	13689	12123	1566	0.8856	0.1144	7.7414	2.0466	-0.5897	0.0456
4	<= 1267	13501	12687	814	0.9397	0.0603	15.5860	2.7464	0.1101	0.0012
5	<= Inf	13512	12848	664	0.9509	0.0491	19.3494	2.9627	0.3263	0.0093
7	Total	135000	125977	9023	0.9332	0.0668	13.9618	2.6363	0.0000	0.0723

Finally the Debt Ratio - surprisingly, very high DR indicates that the client is less likely to default according to the data. However it may indicate some data flaws as it is very counter-intuitive. The worst clients are the ones with DR between 0.65 and 3.97. IV of 0.0723 indicates highest predictive ability of all WoE variables.

Variable correlation

As we can see, there are no significant correlations between any of the variables - therefore we keep all the variables in the further analysis.

Train/Validation split

Now we need to further split the training set into actual training dataset and validation one – we will be optimizing the model using training set for validation set performance.

We are once again splitting the dataset into 0.9/0.1 train/validation sets.

SMOTE

Since our dataset is very imbalanced (7% of observations are defaults), we try oversampling using SMOTE (Synthetic Minority Oversampling Technique). Using SMOTE function from DMwR package we are able to double the amount of default observations in our training set (we are not using synthetic observations in the validation set).

# train_woe$ID <- seq.int(nrow(train_woe))
# temp_train <- DMwR::SMOTE(def ~ ., train_woe, perc.over = 50, k = 3)
# temp_train <- subset(temp_train, def == "yes")
# temp_train <- rbind(train_woe, temp_train)
# train_woe_smote <- distinct(temp_train)
# train_woe$ID <- NULL
# train_woe_smote$ID <- NULL
# rm(temp_train)
# 
# summary(train_woe$def)
# summary(train_woe_smote$def)

Despite our best tries, adding observations obtained by synthetic oversampling did not improve model’s predictive ability in further analysis, so we decided not to use it in this report.

Choosing the best logit model

Data loading

rm(list=ls(all=TRUE))

load("data/train_woe.Rdata")
load("data/validation_woe.Rdata")
load("models/logit_model.Rdata")

source("model_functions.R")

First, we estimate basic models (with Intercept only and one with all variables) - these will be our benchmarks when assessing the final logit model.

baza<-glm(def ~ 1,data=train_woe, family=binomial("logit"))

max<-glm(def ~ .,data=train_woe, family=binomial("logit"))

Then, we use stepwise (both) method to choose only statistically significant variables. We are providing only the code, because the model is previously computed and we loaded it in the previous cell.

# model_stepwise_both<-step(baza, scope = list(upper=max, lower=baza ), direction = "both", trace=T,steps=30,k=4)
# summary(model_stepwise_both)

Quality assessment

## [1] "Goodness Of Fit test p-value: 1"

## [1] "Goodness Of Fit test p-value: 0"

## [1] "Hosmer-Lemeshow test p-value: 0"

As we can see model is well fit to the data and the variables are statistically significant (based on GoF, Hosmer-Lemeshow and LR tests p-values).

ROC Tests

Now we compare ROC curves of base and max model with our model chosen using stepwise method.

## [1] "Roc test p-value for stepwise and base: 0"

## [1] "Roc test p-value for stepwise and max: 0.00382507331587263"

P-value of both tests is lower than the significance level (of 0.05) so we reject the null hypotheses in favor of the alternative and conclude that the ROC curves are not equally good.

Gini comparison

## [1] "Gini coefficients on the validation set"

## [1] "Gini coefficient for stepwise model: 0.64236008320048"

## [1] "Gini coefficient for max model: 0.642370995431295"

## [1] "Gini coefficient for base model: 0"

As we can see model chosen using stepwise methot outperformed the former - therefore we will be using it in further analysis.

Model comparison

Quick description

In this section we will compare four models build using different methods:

Logistic Regression (from the previous section)
Random Forest
Adaptive Boosting
Extreme Gradient Boosting

Logit

Our logit model is the one we obtained using stepwise method, its scores are presented below:

Random Forest

Tuning led to the most efficient model with 100 estimators, 5 sampled features at each split and minimal node size being 1.

AdaBoost

It turned out that the most optimal set of hyperparameters in AdaBoost model is 80 trees (lesser than in the Random Forest model), learning rate 0.07 and the model sampled 0.7 of the initial set in each tree.

XGBoost Model

Optimal set of hyperparameters for the XGBoost is:

eta = 0.3
gamma = 0
max.depth = 4
max_delta_step = 0
subsample = 0.7
colsample_bytree = 0.85
lambda = 1
alpha = 0
scale_pos_weight = 4
nrounds = 20

Comparison tables

All values are computed for validation set:

model_name	auc	gini	k_s_stat	psi
Logit	0.8212	0.6424	0.4984	0.0013
Random_Forest	0.8142	0.6285	0.4931	Inf
Adaboost	0.8510	0.7019	0.5465	Inf
XGBoost	0.8538	0.7076	0.5477	0.0016

As you can see XGBoost outperformes all other methods. Population stability index value is lower than 0.1 so we can conclude that the model is stable (we suspect that in case of Random Forest and AdaBoost models there occured division by zero because a lot of observations had the same value, therefore the outcome is flawed). Considering the Kolmogorov-Smirnov statistics, we can conclude that the distributions of scores in default groups are indeed different.

Below we present Gini coefficients for the test set (never previously seen in the training process):

model_name	gini
Logit	0.6389
Random_Forest	0.6325
Adaboost	0.6862
XGBoost	0.6956

Once again, XGBoost outperforms all other methods, therefore we decide to build the scorecard on its outcomes.

Scorecard

Creating the cutoffs

Once we decided that XGBoost is our go-to model, we use the whole dataset (train+validation+test sets) to divide scores into risk-intervals and therefore create a scorecard.

Score buckets	No of clients	Goods	Bads	Cumulative Goods	Cumulative Bads	Good rate	Bad rate
0 - 473.05	19503	13406	6097	13406	6097	68.74%	31.26%
473.05 - 518.35	30355	27971	2384	41377	8481	92.15%	7.85%
518.35 - 565.69	29687	28804	883	70181	9364	97.03%	2.97%
565.69 - 589.53	29007	28600	407	98781	9771	98.60%	1.40%
589.53 - 609.47	21228	21050	178	119831	9949	99.16%	0.84%
609.47 and higher	20220	20143	77	139974	10026	99.62%	0.38%

Based on the score buckets we would divide our clients into 6 potential groups with ideally different pricing and for example different interest rates. We could reject the most risky clients from chosen buckets (presumably first two) - all depending on business approach and desired exposure to risk.

Credit Risk: Methods of scorecards development in R

Credit risk analysis using various methods and creation of scorecard based on the best performing model

Bartlomiej Kowalczuk & Michal Thor

27 06 2020