The main of this research is to compare traditional logistic regression with random forest for the purpose of scorecard. This research will be based on empirical results only. Firstly, we will estimate a logistic regression model, in accordance with the best practices. Then we will train a Random Forest Classifier. Finally, we will compare the predictions of both approaches, using various performance measures.
Dataset that was used in this research is a well-known credit-g dataset, available at OpenML https://www.openml.org/d/31. Original dataset contains 1 target value, class, which describes if the customer is good or bad, and 21 attributes:
The datasets contains data for 1000 customers, of which 700 are good and 300 are bad ones. This means that the data is not much imbalanced and there is enough number of bad clients to estimate/train a reliable model.
To build a scorecard model based on logistic regression, coarse classing of the data is needed. Firstly, continuous variables were binned into factor variables. The initial division into bins was made with respect to percentiles. In case of some variables which were originally factors, merging similar categories was done.
In the end, we ended up with 20 factor variables. Then, we checked their correlation, using Kendall correlation measure. This is important in order not to introduce colinearity into our model. The result is plotted below:
It seems like there will not be any problems with colinearity in the model. We might only consider removing either credit_history or existing_credits, as their correlation is around 50%.
Then, we calculated some variables quality statistics, like Gini and Information Value.
| No. | VAR | IV | Gini |
|---|---|---|---|
| 1 | checking_status | 0.6660 | 0.4155 |
| 2 | duration | 0.2465 | 0.2748 |
| 3 | credit_history | 0.2918 | 0.2531 |
| 4 | purpose | 0.1487 | 0.2052 |
| 5 | savings_status | 0.1961 | 0.1983 |
| 6 | credit_amount | 0.1136 | 0.1796 |
| 7 | age | 0.1007 | 0.1777 |
| 8 | property_magnitude | 0.1127 | 0.1707 |
| 9 | employment | 0.0866 | 0.1616 |
| 10 | housing | 0.0833 | 0.1344 |
| 11 | personal_status | 0.0446 | 0.1048 |
| 12 | other_payment_plans | 0.0576 | 0.0964 |
| 13 | installment_commitment | 0.0262 | 0.0868 |
| 14 | other_parties | 0.0320 | 0.0513 |
| 15 | existing_credits | 0.0115 | 0.0503 |
| 16 | job | 0.0088 | 0.0429 |
| 17 | own_telephone | 0.0064 | 0.0390 |
| 18 | foreign_worker | 0.0439 | 0.0338 |
| 19 | residence_since | 0.0035 | 0.0324 |
| 20 | num_dependents | 0.0000 | 0.0024 |
We decided to include in our final dataset only these variables, for which Gini value is bigger than 10%. In the end we ended up with 11 explanatory variables.
To divide the dataset into train and test samples, we used stratified sampling. This way the ratio between number of good and bad clients is the same for train and test samples.
Then, using train sample, we performed supervised discretization. We used Optimal Binning to categorise numeric variables into bins for scoring modeling. Only for age bins were not assigned by the algorithm, so they were created manually, based on WoE of previously created percentiles. In case of factor variables, they were automatically merged into groups with the use of woeBinning package.
Charts for all analysed variables:
WoE for final bins,
Stability analysis - percentage of customers in each bin across train and test samples,
Distribution (percentage) of all customers in each bin,
Bad rate for each bin,
can be find in the pdf file on Github.
Firstly, two models were estimated: a model with constant only (1) and a model with all the variables (2). These models can be benchmarks and used to assess if our chosen model form is correct. To choose the proper functional form, a stepwise selection algorithm was used. Stepwise regression is based on that in each step, a variable is considered for addition or substraction from the set of all variables, based on defined information criterion. We used stats package, which uses AIC criterion while number of degrees of freedom used for the penalty is set k = 2. We decided to use more conservative k = 4.
| Dependent variable: | |||
| class | |||
| (1) | (2) | (3) | |
| age_woe | -0.345 | ||
| checking_status_woe | 0.008*** | 0.008*** | |
| credit_amount_woe | -0.801*** | -0.818*** | |
| credit_history_woe | 0.006*** | 0.007*** | |
| duration_woe | -0.790*** | -0.810*** | |
| employment_woe | 0.007*** | 0.009*** | |
| housing_woe | 0.005 | ||
| personal_status_woe | 0.007 | ||
| property_magnitude_woe | 0.005 | ||
| purpose_woe | 0.009*** | 0.009*** | |
| savings_status_woe | 0.007*** | 0.007*** | |
| Constant | -0.847*** | -0.840*** | -0.841*** |
| Observations | 700 | 700 | 700 |
| Log Likelihood | -427.605 | -325.711 | -329.370 |
| Akaike Inf. Crit. | 857.210 | 675.421 | 674.740 |
| Note: | *p<0.1;**p<0.05;***p<0.01 | ||
Because we used WoE transformation, we would expect all coefficients to be negative as the higher the WoE, the less probability of default. This could be a problem from the business side, however in this research we focus mainly on forecasts, so we will leave it as it is.
We can also see that Akaike Information Criterion is smaller for the chosen model than for the base and max models. This suggests that our model is a better fit to the data.
To assess if the model is specified correctly, we performed tests assessing model quality.
Firstly, we conducted a Pearson Chi2 and Deviance test. Its null hypothesis says that the model is well fitted to the data. We obtained p-value equal to 0.8135, so there is no evidence to reject the null hypothesis.
Then we performed LR test on the joint significance of variables. In this case null hypothesis states that variables are statistically irrelevant. P-value of this test is 0, so we reject the null hypothesis.
Afterwards, we did a Hosmer-Lemeshow test. This testâs null hypothesis tells that model is well fitted to the data. As the model is very sensitive to the number of buckets used, we repeated this test for 7, 8, 9 and 10 buckets. The p-values we obtained are respectively 0.4994, 0.321, 0.4129 and 0.3869. In each case we cannot reject the null hypothesis, which suggests that our formula is correct.
Next, we tried to assess the discriminatory power of the model. Therefore, we estimated confidence intervals of the Gini index. The rule of the thumb says that for a behavioral model, Gini index should be at least 60%. We can see that in our case, while the training sample achieves this result, the testing sample does not. It might suggest that we have a slight problem with overfitting.
| Lower limit | Mean | Upper limit | |
|---|---|---|---|
| Train sample | 0.5740 | 0.6390 | 0.7039 |
| Test sample | 0.3808 | 0.4944 | 0.6081 |
We also estimated the K-S statistics. For the test sample it equals 0.5293, while for the train sample 0.3825. Again we can see that statistics is better for the training sample than for the test sample. The rule of the thumb says that for behavioral model, we should obtain at least 0.5 - we can see we are not that far from this value.
Later we evaluated stability tests, namely PSI and Kolmogorov-Smirnov test. To do that, firstly we calculated score of each customer in the test and train sample. Then, we estimated Population Stability Index (PSI) to compare the distribution of the scoring variable in both sets. The thumb rule states that if PSI is less than 0.1 then no change in scoring model is needed. Our PSI equals to 0.0661, so the model is stable.
The null hypothesis of the Kolmogorov-Smirnov test states that there is no difference between the two distributions. P-value of the test equals to âr round(ks, 4)` so we have no reason to reject the null hypothesis.
Finally, we compared the difference in areas under the ROC curves with ROC test. The null hypothesis in the test is that the difference is equal to 0, against the alternative hypothesis that it is not equal to 0. When we compare our model to the base model, including only intercept, p-value of ROC test is 0, so the area under curves are different. P-value of ROC test for our model and the max model, including all the variables, is 0.3209, so we have no reason to reject the null hypothesis.
Taking into consideration all the tests, we decided to use the model obtained by stepwise regression as the final one.
Random Forests is a supervised learning method for classification. It is based on generating a large number of decision trees, each constructed using a different subset of the training set. Random forests correct for decision treeâs habit of overfitting to the training set.
In case of random forests, we decided to use raw, unprocessed data. Firstly, we set abitrarily the number of trees as ntree=500 and number of predictors as m=4 (number which is the closest to the square root of 20). Then we decided to decrease number of trees to 400, as the OOB error seems to be stable after this point. Afterwards, we used cross-validation process to choose optimal number of m. Finally we decided to use random forest with m=12.
##
## Call:
## randomForest(formula = model.formula, data = train_raw, ntree = 400, mtry = 12, importance = TRUE)
## Type of random forest: classification
## Number of trees: 400
## No. of variables tried at each split: 12
##
## OOB estimate of error rate: 24.43%
## Confusion matrix:
## good bad class.error
## good 432 58 0.1183673
## bad 113 97 0.5380952
Random Forest also generates featuresâ importance plots - mean decrease Accuracy and mean decrease Gini. We can see that the insights from these plots are similar to those based on previous results from Logistic Regression - the most important variables include checking_status, duration and credit_amount.
To compare results, firstly we calculated the AuC for both GLM and RF, assuming cut-off point for GLM at 50%. The difference is not big. AuC for GLM is 0.7472, while for the RF it is 0.788.
Next, we plotted the confusion matrix for both models.
Both the Accuracy and the Sensivity are better for Random Forest, thus it seems that Random Forest outperforms Logistic Regression.
In this research I compared predictive power of Logistic Regression and Random Forest in the context of Credit Scorecard. Firstly I estimated GLM model, in accordance with best practices and subsequently I trained a Random Forest, using cross-validation to optimise hyperparameters. Random Forest seems to outperform Logistic Regression, however the difference is not immense. In case of Credit Scorecard building, GLM models still are preferable as they provide clear answer about how each trait of a client contributes to their Credit Score.