Abstract

The main of this research is to compare traditional logistic regression with random forest for the purpose of scorecard. This research will be based on empirical results only. Firstly, we will estimate a logistic regression model, in accordance with the best practices. Then we will train a Random Forest Classifier. Finally, we will compare the predictions of both approaches, using various performance measures.

Dataset description

Dataset that was used in this research is a well-known credit-g dataset, available at OpenML https://www.openml.org/d/31. Original dataset contains 1 target value, class, which describes if the customer is good or bad, and 21 attributes:

checking_status - Status of existing checking account, in Deutsche Mark.
duration - Duration in months
credit_history - Credit history (credits taken, paid back duly, delays, critical accounts)
purpose - Purpose of the credit (car, television,…)
credit_amount - Credit amount
savings_status - Status of savings account/bonds, in Deutsche Mark.
employment - Present employment, in number of years.
installment_commitment - Installment rate in percentage of disposable income
personal_status - Personal status (married, single,…) and sex
other_parties - Other debtors / guarantors
residence_since - Present residence since X years
property_magnitude - Property (e.g. real estate)
age - Age in years
other_payment_plans - Other installment plans (banks, stores)
housing - Housing (rent, own,…)
existing_credits - Number of existing credits at this bank
job - Job
num_dependents - Number of people being liable to provide maintenance for
own_telephone - Telephone (yes,no)
foreign_worker - Foreign worker (yes,no)

The datasets contains data for 1000 customers, of which 700 are good and 300 are bad ones. This means that the data is not much imbalanced and there is enough number of bad clients to estimate/train a reliable model.

To build a scorecard model based on logistic regression, coarse classing of the data is needed. Firstly, continuous variables were binned into factor variables. The initial division into bins was made with respect to percentiles. In case of some variables which were originally factors, merging similar categories was done.

In the end, we ended up with 20 factor variables. Then, we checked their correlation, using Kendall correlation measure. This is important in order not to introduce colinearity into our model. The result is plotted below:

It seems like there will not be any problems with colinearity in the model. We might only consider removing either credit_history or existing_credits, as their correlation is around 50%.

Then, we calculated some variables quality statistics, like Gini and Information Value.

No.	VAR	IV	Gini
1	checking_status	0.6660	0.4155
2	duration	0.2465	0.2748
3	credit_history	0.2918	0.2531
4	purpose	0.1487	0.2052
5	savings_status	0.1961	0.1983
6	credit_amount	0.1136	0.1796
7	age	0.1007	0.1777
8	property_magnitude	0.1127	0.1707
9	employment	0.0866	0.1616
10	housing	0.0833	0.1344
11	personal_status	0.0446	0.1048
12	other_payment_plans	0.0576	0.0964
13	installment_commitment	0.0262	0.0868
14	other_parties	0.0320	0.0513
15	existing_credits	0.0115	0.0503
16	job	0.0088	0.0429
17	own_telephone	0.0064	0.0390
18	foreign_worker	0.0439	0.0338
19	residence_since	0.0035	0.0324
20	num_dependents	0.0000	0.0024

We decided to include in our final dataset only these variables, for which Gini value is bigger than 10%. In the end we ended up with 11 explanatory variables.

To divide the dataset into train and test samples, we used stratified sampling. This way the ratio between number of good and bad clients is the same for train and test samples.

Then, using train sample, we performed supervised discretization. We used Optimal Binning to categorise numeric variables into bins for scoring modeling. Only for age bins were not assigned by the algorithm, so they were created manually, based on WoE of previously created percentiles. In case of factor variables, they were automatically merged into groups with the use of woeBinning package.

Charts for all analysed variables:

WoE for final bins,
Stability analysis - percentage of customers in each bin across train and test samples,
Distribution (percentage) of all customers in each bin,
Bad rate for each bin,

can be find in the pdf file on Github.

Logistic regression

Model specification

Firstly, two models were estimated: a model with constant only (1) and a model with all the variables (2). These models can be benchmarks and used to assess if our chosen model form is correct. To choose the proper functional form, a stepwise selection algorithm was used. Stepwise regression is based on that in each step, a variable is considered for addition or substraction from the set of all variables, based on defined information criterion. We used stats package, which uses AIC criterion while number of degrees of freedom used for the penalty is set k = 2. We decided to use more conservative k = 4.


	Dependent variable:

	class
	(1)	(2)	(3)

age_woe		-0.345

checking_status_woe		0.008^***	0.008^***

credit_amount_woe		-0.801^***	-0.818^***

credit_history_woe		0.006^***	0.007^***

duration_woe		-0.790^***	-0.810^***

employment_woe		0.007^***	0.009^***

housing_woe		0.005

personal_status_woe		0.007

property_magnitude_woe		0.005

purpose_woe		0.009^***	0.009^***

savings_status_woe		0.007^***	0.007^***

Constant	-0.847^***	-0.840^***	-0.841^***


Observations	700	700	700
Log Likelihood	-427.605	-325.711	-329.370
Akaike Inf. Crit.	857.210	675.421	674.740

Note:	p<0.1;p<0.05;**p<0.01

Because we used WoE transformation, we would expect all coefficients to be negative as the higher the WoE, the less probability of default. This could be a problem from the business side, however in this research we focus mainly on forecasts, so we will leave it as it is.

We can also see that Akaike Information Criterion is smaller for the chosen model than for the base and max models. This suggests that our model is a better fit to the data.

Quality assessment

To assess if the model is specified correctly, we performed tests assessing model quality.

Goodness of fit

Firstly, we conducted a Pearson Chi2 and Deviance test. Its null hypothesis says that the model is well fitted to the data. We obtained p-value equal to 0.8135, so there is no evidence to reject the null hypothesis.

Then we performed LR test on the joint significance of variables. In this case null hypothesis states that variables are statistically irrelevant. P-value of this test is 0, so we reject the null hypothesis.

Afterwards, we did a Hosmer-Lemeshow test. This test’s null hypothesis tells that model is well fitted to the data. As the model is very sensitive to the number of buckets used, we repeated this test for 7, 8, 9 and 10 buckets. The p-values we obtained are respectively 0.4994, 0.321, 0.4129 and 0.3869. In each case we cannot reject the null hypothesis, which suggests that our formula is correct.

Discriminatory/predictive power

Next, we tried to assess the discriminatory power of the model. Therefore, we estimated confidence intervals of the Gini index. The rule of the thumb says that for a behavioral model, Gini index should be at least 60%. We can see that in our case, while the training sample achieves this result, the testing sample does not. It might suggest that we have a slight problem with overfitting.

	Lower limit	Mean	Upper limit
Train sample	0.5740	0.6390	0.7039
Test sample	0.3808	0.4944	0.6081

We also estimated the K-S statistics. For the test sample it equals 0.5293, while for the train sample 0.3825. Again we can see that statistics is better for the training sample than for the test sample. The rule of the thumb says that for behavioral model, we should obtain at least 0.5 - we can see we are not that far from this value.

Stability

Later we evaluated stability tests, namely PSI and Kolmogorov-Smirnov test. To do that, firstly we calculated score of each customer in the test and train sample. Then, we estimated Population Stability Index (PSI) to compare the distribution of the scoring variable in both sets. The thumb rule states that if PSI is less than 0.1 then no change in scoring model is needed. Our PSI equals to 0.0661, so the model is stable.

The null hypothesis of the Kolmogorov-Smirnov test states that there is no difference between the two distributions. P-value of the test equals to ’r round(ks, 4)` so we have no reason to reject the null hypothesis.

Comparison to base and max models

Finally, we compared the difference in areas under the ROC curves with ROC test. The null hypothesis in the test is that the difference is equal to 0, against the alternative hypothesis that it is not equal to 0. When we compare our model to the base model, including only intercept, p-value of ROC test is 0, so the area under curves are different. P-value of ROC test for our model and the max model, including all the variables, is 0.3209, so we have no reason to reject the null hypothesis.

Summary

Taking into consideration all the tests, we decided to use the model obtained by stepwise regression as the final one.

Random Forests

Random Forests is a supervised learning method for classification. It is based on generating a large number of decision trees, each constructed using a different subset of the training set. Random forests correct for decision tree’s habit of overfitting to the training set.

In case of random forests, we decided to use raw, unprocessed data. Firstly, we set abitrarily the number of trees as ntree=500 and number of predictors as m=4 (number which is the closest to the square root of 20). Then we decided to decrease number of trees to 400, as the OOB error seems to be stable after this point. Afterwards, we used cross-validation process to choose optimal number of m. Finally we decided to use random forest with m=12.

## 
## Call:
##  randomForest(formula = model.formula, data = train_raw, ntree = 400,      mtry = 12, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 400
## No. of variables tried at each split: 12
## 
##         OOB estimate of  error rate: 24.43%
## Confusion matrix:
##      good bad class.error
## good  432  58   0.1183673
## bad   113  97   0.5380952

Random Forest also generates features’ importance plots - mean decrease Accuracy and mean decrease Gini. We can see that the insights from these plots are similar to those based on previous results from Logistic Regression - the most important variables include checking_status, duration and credit_amount.

Results

To compare results, firstly we calculated the AuC for both GLM and RF, assuming cut-off point for GLM at 50%. The difference is not big. AuC for GLM is 0.7472, while for the RF it is 0.788.

Next, we plotted the confusion matrix for both models.

Both the Accuracy and the Sensivity are better for Random Forest, thus it seems that Random Forest outperforms Logistic Regression.

Summary

In this research I compared predictive power of Logistic Regression and Random Forest in the context of Credit Scorecard. Firstly I estimated GLM model, in accordance with best practices and subsequently I trained a Random Forest, using cross-validation to optimise hyperparameters. Random Forest seems to outperform Logistic Regression, however the difference is not immense. In case of Credit Scorecard building, GLM models still are preferable as they provide clear answer about how each trait of a client contributes to their Credit Score.

Comparison of models for credit risk purposes - logistic regression vs random forest

Katarzyna Kryńska

5 07 2020

Abstract

Dataset description

Logistic regression

Model specification

Quality assessment

Goodness of fit

Discriminatory/predictive power

Stability

Comparison to base and max models

Summary

Random Forests

Results

Summary