This work has the purpose of explaining propensity models using logistic regression and decision trees. It presents examples for these models in order to find a profile of clients who are more likely to acquire personal credit. This report also addresses the ROC curve and the Kolmogorov-Smirnov test. For the examples, the following data set will be used.
## 'data.frame': 1000 obs. of 21 variables:
## $ Creditability : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ Account.Balance : Factor w/ 4 levels "1","2","3","4": 1 1 2 1 1 1 1 1 4 2 ...
## $ Duration.of.Credit..month. : int 18 9 12 12 12 10 8 6 18 24 ...
## $ Payment.Status.of.Previous.Credit: Factor w/ 5 levels "0","1","2","3",..: 5 5 3 5 5 5 5 5 5 3 ...
## $ Purpose : Factor w/ 10 levels "0","1","2","3",..: 3 1 9 1 1 1 1 1 4 4 ...
## $ Credit.Amount : int 1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
## $ Value.Savings.Stocks : Factor w/ 5 levels "1","2","3","4",..: 1 1 2 1 1 1 1 1 1 3 ...
## $ Length.of.current.employment : Factor w/ 5 levels "1","2","3","4",..: 2 3 4 3 3 2 4 2 1 1 ...
## $ Instalment.per.cent : Factor w/ 4 levels "1","2","3","4": 4 2 2 3 4 1 1 2 4 1 ...
## $ Sex...Marital.Status : Factor w/ 4 levels "1","2","3","4": 2 3 2 3 3 3 3 3 2 2 ...
## $ Guarantors : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ Duration.in.Current.address : Factor w/ 4 levels "1","2","3","4": 4 2 4 2 4 3 4 4 4 4 ...
## $ Most.valuable.available.asset : Factor w/ 4 levels "1","2","3","4": 2 1 1 1 2 1 1 1 3 4 ...
## $ Age..years. : int 21 36 23 39 38 48 39 40 65 23 ...
## $ Concurrent.Credits : Factor w/ 3 levels "1","2","3": 3 3 3 3 1 3 3 3 3 3 ...
## $ Type.of.apartment : Factor w/ 3 levels "1","2","3": 1 1 1 1 2 1 2 2 2 1 ...
## $ No.of.Credits.at.this.Bank : Factor w/ 4 levels "1","2","3","4": 1 2 1 2 2 2 2 1 2 1 ...
## $ Occupation : Factor w/ 4 levels "1","2","3","4": 3 3 2 2 2 2 2 2 1 1 ...
## $ No.of.dependents : Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 1 ...
## $ Telephone : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ Foreign.Worker : int 1 1 1 2 2 2 2 2 1 1 ...
The logistic regression models are largely used when the dependent variable is categorical. Usually, the variables envolved in credit risk modelling are categorical or can be treated as categorical, such as the purpose of the credit, number of previously credits at the same bank, occupation, type of apartment. It is common to split the data set in two datasets. The first one is used to adjust the model and the second one is used to test it. We can break a dataset in 50/50 or anyway we like. In this report it is gonna be randomly split in 50/50. After we split the dataset is important to analyze the correlation between the possibles variables and the dependent variable. In this case the credit ability is the dependent variable and the correlation is shown below.
## Correlation With Creditability
## Creditability 1.000000000
## Account.Balance 0.350847483
## Duration.of.Credit..month. -0.214926665
## Payment.Status.of.Previous.Credit 0.228784733
## Purpose -0.017978870
## Credit.Amount -0.154740146
## Value.Savings.Stocks 0.178942736
## Length.of.current.employment 0.116002036
## Instalment.per.cent -0.072403937
## Sex...Marital.Status 0.088184301
## Guarantors 0.025136768
## Duration.in.Current.address -0.002967159
## Most.valuable.available.asset -0.142611973
## Age..years. 0.091271949
## Concurrent.Credits 0.109844099
## Type.of.apartment 0.018118912
## No.of.Credits.at.this.Bank 0.045732489
## Occupation -0.032735001
## No.of.dependents 0.003014853
## Telephone 0.036466190
## Foreign.Worker 0.082079499
Firstly, we build a model with all the variables and test its significance. If there is one or more non significant variables we rebuild the logistic regression excluding the non significant variable with smaller correlation coefficient. Then we keep doing that until all the variables are significant. The first model summary with all variables is shown below.
##
## Call:
## glm(formula = Creditability ~ Account.Balance + Duration.of.Credit..month. +
## Payment.Status.of.Previous.Credit + Purpose + Credit.Amount +
## Value.Savings.Stocks + Length.of.current.employment + Instalment.per.cent +
## Sex...Marital.Status + Guarantors + Duration.in.Current.address +
## Most.valuable.available.asset + Age..years. + Concurrent.Credits +
## Type.of.apartment + No.of.Credits.at.this.Bank + Occupation +
## No.of.dependents + Telephone + Foreign.Worker, family = binomial,
## data = credit[calibration, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6404 -0.4863 0.2600 0.6330 2.2657
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.599e+01 7.875e+02 -0.020 0.98380
## Account.Balance2 4.759e-01 3.516e-01 1.354 0.17588
## Account.Balance3 -2.032e-02 5.536e-01 -0.037 0.97072
## Account.Balance4 1.930e+00 3.728e-01 5.177 2.25e-07
## Duration.of.Credit..month. -4.930e-02 1.640e-02 -3.005 0.00265
## Payment.Status.of.Previous.Credit1 -1.101e+00 8.942e-01 -1.231 0.21838
## Payment.Status.of.Previous.Credit2 5.013e-01 6.735e-01 0.744 0.45670
## Payment.Status.of.Previous.Credit3 8.869e-01 7.551e-01 1.174 0.24020
## Payment.Status.of.Previous.Credit4 1.983e+00 6.713e-01 2.954 0.00313
## Purpose1 1.880e+00 6.404e-01 2.936 0.00333
## Purpose2 1.031e+00 4.143e-01 2.488 0.01286
## Purpose3 1.170e+00 3.998e-01 2.926 0.00344
## Purpose4 2.564e-01 1.691e+00 0.152 0.87951
## Purpose5 -6.294e-01 7.619e-01 -0.826 0.40880
## Purpose6 5.489e-01 6.862e-01 0.800 0.42376
## Purpose8 1.893e+01 2.255e+03 0.008 0.99330
## Purpose9 6.227e-01 5.798e-01 1.074 0.28276
## Purpose10 -5.281e-01 1.277e+00 -0.413 0.67926
## Credit.Amount -9.925e-05 7.662e-05 -1.295 0.19519
## Value.Savings.Stocks2 5.021e-01 4.765e-01 1.054 0.29207
## Value.Savings.Stocks3 1.018e+00 7.390e-01 1.378 0.16825
## Value.Savings.Stocks4 1.969e+00 7.554e-01 2.607 0.00914
## Value.Savings.Stocks5 7.168e-01 4.217e-01 1.700 0.08918
## Length.of.current.employment2 2.712e-01 7.165e-01 0.378 0.70510
## Length.of.current.employment3 7.273e-01 7.082e-01 1.027 0.30443
## Length.of.current.employment4 1.650e+00 7.772e-01 2.122 0.03380
## Length.of.current.employment5 5.116e-01 6.952e-01 0.736 0.46180
## Instalment.per.cent2 -4.342e-01 4.924e-01 -0.882 0.37781
## Instalment.per.cent3 -6.071e-01 5.509e-01 -1.102 0.27048
## Instalment.per.cent4 -1.216e+00 4.864e-01 -2.500 0.01241
## Sex...Marital.Status2 -7.990e-02 5.799e-01 -0.138 0.89040
## Sex...Marital.Status3 6.018e-01 5.718e-01 1.053 0.29256
## Sex...Marital.Status4 -2.890e-01 6.624e-01 -0.436 0.66261
## Guarantors2 -6.691e-02 6.389e-01 -0.105 0.91659
## Guarantors3 1.522e+00 7.063e-01 2.154 0.03123
## Duration.in.Current.address2 -2.770e-01 4.622e-01 -0.599 0.54895
## Duration.in.Current.address3 -2.209e-01 5.133e-01 -0.430 0.66700
## Duration.in.Current.address4 2.635e-01 4.715e-01 0.559 0.57622
## Most.valuable.available.asset2 -5.692e-01 4.182e-01 -1.361 0.17348
## Most.valuable.available.asset3 -1.381e-02 3.906e-01 -0.035 0.97179
## Most.valuable.available.asset4 -1.338e+00 6.673e-01 -2.005 0.04501
## Age..years. 8.271e-03 1.642e-02 0.504 0.61440
## Concurrent.Credits2 -4.881e-01 7.813e-01 -0.625 0.53213
## Concurrent.Credits3 -3.716e-01 4.122e-01 -0.902 0.36730
## Type.of.apartment2 3.250e-01 3.916e-01 0.830 0.40661
## Type.of.apartment3 1.301e+00 8.191e-01 1.588 0.11230
## No.of.Credits.at.this.Bank2 -3.732e-01 4.411e-01 -0.846 0.39752
## No.of.Credits.at.this.Bank3 -1.334e+00 8.583e-01 -1.554 0.12023
## No.of.Credits.at.this.Bank4 1.377e+01 2.545e+03 0.005 0.99568
## Occupation2 -1.063e+00 1.037e+00 -1.025 0.30517
## Occupation3 -1.027e+00 9.928e-01 -1.034 0.30096
## Occupation4 -1.193e+00 1.004e+00 -1.189 0.23446
## No.of.dependents2 -3.804e-01 3.867e-01 -0.984 0.32526
## Telephone2 3.361e-01 3.315e-01 1.014 0.31066
## Foreign.Worker 1.662e+01 7.875e+02 0.021 0.98316
##
## (Intercept)
## Account.Balance2
## Account.Balance3
## Account.Balance4 ***
## Duration.of.Credit..month. **
## Payment.Status.of.Previous.Credit1
## Payment.Status.of.Previous.Credit2
## Payment.Status.of.Previous.Credit3
## Payment.Status.of.Previous.Credit4 **
## Purpose1 **
## Purpose2 *
## Purpose3 **
## Purpose4
## Purpose5
## Purpose6
## Purpose8
## Purpose9
## Purpose10
## Credit.Amount
## Value.Savings.Stocks2
## Value.Savings.Stocks3
## Value.Savings.Stocks4 **
## Value.Savings.Stocks5 .
## Length.of.current.employment2
## Length.of.current.employment3
## Length.of.current.employment4 *
## Length.of.current.employment5
## Instalment.per.cent2
## Instalment.per.cent3
## Instalment.per.cent4 *
## Sex...Marital.Status2
## Sex...Marital.Status3
## Sex...Marital.Status4
## Guarantors2
## Guarantors3 *
## Duration.in.Current.address2
## Duration.in.Current.address3
## Duration.in.Current.address4
## Most.valuable.available.asset2
## Most.valuable.available.asset3
## Most.valuable.available.asset4 *
## Age..years.
## Concurrent.Credits2
## Concurrent.Credits3
## Type.of.apartment2
## Type.of.apartment3
## No.of.Credits.at.this.Bank2
## No.of.Credits.at.this.Bank3
## No.of.Credits.at.this.Bank4
## Occupation2
## Occupation3
## Occupation4
## No.of.dependents2
## Telephone2
## Foreign.Worker
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 591.05 on 499 degrees of freedom
## Residual deviance: 381.16 on 445 degrees of freedom
## AIC: 491.16
##
## Number of Fisher Scoring iterations: 16
After removing the non significant variables step by step, we have the final variables in the model as the summary below shows.
##
## Call:
## glm(formula = Creditability ~ Account.Balance + Duration.of.Credit..month. +
## Payment.Status.of.Previous.Credit + Purpose + Value.Savings.Stocks +
## Length.of.current.employment + Instalment.per.cent + Guarantors,
## family = binomial, data = credit[calibration, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6227 -0.5779 0.3386 0.6434 2.1957
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.25944 0.82018 -0.316 0.751758
## Account.Balance2 0.40386 0.31306 1.290 0.197038
## Account.Balance3 0.17993 0.52343 0.344 0.731034
## Account.Balance4 1.89636 0.34588 5.483 4.19e-08
## Duration.of.Credit..month. -0.05861 0.01223 -4.793 1.64e-06
## Payment.Status.of.Previous.Credit1 -0.52573 0.75413 -0.697 0.485718
## Payment.Status.of.Previous.Credit2 0.64362 0.52763 1.220 0.222532
## Payment.Status.of.Previous.Credit3 0.88385 0.66331 1.332 0.182700
## Payment.Status.of.Previous.Credit4 1.93177 0.58084 3.326 0.000882
## Purpose1 1.54136 0.54864 2.809 0.004963
## Purpose2 0.67800 0.36150 1.876 0.060723
## Purpose3 0.95104 0.35991 2.642 0.008232
## Purpose4 1.05533 1.78548 0.591 0.554479
## Purpose5 -0.59032 0.74328 -0.794 0.427074
## Purpose6 0.23165 0.62247 0.372 0.709782
## Purpose8 16.35981 832.97186 0.020 0.984330
## Purpose9 0.50645 0.53221 0.952 0.341307
## Purpose10 -0.56755 1.11059 -0.511 0.609327
## Value.Savings.Stocks2 0.36766 0.42556 0.864 0.387627
## Value.Savings.Stocks3 0.88415 0.68506 1.291 0.196838
## Value.Savings.Stocks4 1.75694 0.69284 2.536 0.011218
## Value.Savings.Stocks5 0.61891 0.37826 1.636 0.101803
## Length.of.current.employment2 0.10897 0.56102 0.194 0.845984
## Length.of.current.employment3 0.54417 0.52563 1.035 0.300543
## Length.of.current.employment4 1.40212 0.60031 2.336 0.019509
## Length.of.current.employment5 0.52669 0.55067 0.956 0.338845
## Instalment.per.cent2 -0.27470 0.45666 -0.602 0.547470
## Instalment.per.cent3 -0.47301 0.49401 -0.957 0.338316
## Instalment.per.cent4 -0.91850 0.42030 -2.185 0.028863
## Guarantors2 -0.01831 0.60180 -0.030 0.975732
## Guarantors3 1.45306 0.62561 2.323 0.020200
##
## (Intercept)
## Account.Balance2
## Account.Balance3
## Account.Balance4 ***
## Duration.of.Credit..month. ***
## Payment.Status.of.Previous.Credit1
## Payment.Status.of.Previous.Credit2
## Payment.Status.of.Previous.Credit3
## Payment.Status.of.Previous.Credit4 ***
## Purpose1 **
## Purpose2 .
## Purpose3 **
## Purpose4
## Purpose5
## Purpose6
## Purpose8
## Purpose9
## Purpose10
## Value.Savings.Stocks2
## Value.Savings.Stocks3
## Value.Savings.Stocks4 *
## Value.Savings.Stocks5
## Length.of.current.employment2
## Length.of.current.employment3
## Length.of.current.employment4 *
## Length.of.current.employment5
## Instalment.per.cent2
## Instalment.per.cent3
## Instalment.per.cent4 *
## Guarantors2
## Guarantors3 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 591.05 on 499 degrees of freedom
## Residual deviance: 414.90 on 469 degrees of freedom
## AIC: 476.9
##
## Number of Fisher Scoring iterations: 14
Since all the variables are significant, this is a well adjusted model.
In this topic we will analyze the ROC Curve for our model. The ROC curve (Receiver Operating Characteristic Curve) is used to determinate the cuttoff value. The y axis of the ROC Curve is the sensitivity, also called true positive rate, and the x axis is 1-specificity, or false positivity rate. The idea is to find the cuttoff point that give us the best tradeoff between sensitivity and specificity. For the data set and the adjusted model of this report we have the following ROC curve and the area under the curve (AUC).
## [1] 0.7489785
We want to find the point closest to true positive rate ‘1’ and the false positive rate ‘0’. That’s the same to try to find the point farthest from the line. The calculated value is around cuttoff point 0.7. To validate the idea that it’s a good cutoff point we compare the sensitivity and the specifity with the cuttoff point equals 0.5.
## Sensitivity Specificity
## Cuttoff 0.5 0.8377581 0.4844720
## Cuttoff 0.7 0.6784661 0.6832298
The table shows that with a cuttoff point equals 0.5 there’s more sensitivity and less specificity than in a cuttoff point 0.7. Althought the sensitivity has decreased, the specificity has increased even more. The good thing is that with a 0.7 cuttoff point we have more equal values for both, sensitivity and specificity.
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences. A decision tree has three main components : Root Node : The top most node is called Root Node. It implies the best predictor (independent variable). Decision / Internal Node : The nodes in which predictors (independent variables) are tested and each branch represents an outcome of the test. Leaf / Terminal Node : It holds a class label (category) - Yes or No (Final Classification Outcome). For our dataset a example of decision tree is shown below.
The ROC curve and the area under the curve (AUC) for this decision tree is the following one.
## [1] 0.6547144
Instead of building one decision tree, we can use the random forest method to create a metaphorical “forest” of decision trees. In this method, the end result is the mean of the predictions. The idea behind the random forest is that decision trees are prone to overfitting, so finding the “average” tree in the forest can help avoid this problem. The ROC curve and the AUC for the random forest is shown below.
## [1] 0.7788506
The random forest approach limits the analysis to the AUC statistic since we use the average decision tree. However, it’s possible to use all the data from the random forest created to analyze as a dataset. To compare random forests with logistic regression models we adjusted 200 logistic models and 200 average decision trees from random forests. The plot below shows the relation between the AUCs from the random forest and the AUCs from the corresponding logistic model.
We can see that the previously calculated AUCs for our dataset are mostly in the middle of the distribution. This confirms for us the models are all pretty comparable. To make easier to analyze the areas, we use the a density contour plot and the high density contour plot.
At best, it looks like our models give us an 82% chance of lending to good credit risks. However, the biggest ammount of data is placed around 77,5%. So, for every 1 million in loans, at best we might expect to be repaid 820,000. On average, we would expect to recover around 775,000 in principal. In other words, according to our analysis, there is between a 75% and 80% chance we will recapture our $1 million loan, depending on the modeling method we use. Basically, for a group of loans, we would like them to cluster in the darkest area.