When a company receives a loan application, it has to make a decision whether to go ahead with the loan approval or not. This is based on the applicant’s profile
A credit risk is the risk of default on a debt that may arise from a borrower failing to make required payments.
There are two types of risks associated with the bank’s decision:
If the applicant is a good credit risk, i.e. is likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank
If the applicant is a bad credit risk, i.e. is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank
It may be assumed that the second risk is a greater risk, as the loss includes the prinicple amount as well.
So it is very crucial for a company to evaluate the risks associated with lending money to a customer.
To compare the performance of various classification models on predicting the risk of the loans for 1000 individuals.
Compare the asymmetric cost for the test set for different classification models.
The following packages are used:
library(ROCR)
library(glmnet)
library(kableExtra)
library(DT)
library(ggplot2)
library(ggpubr)
library(rpart)
library(rpart.plot)
The data set has information about 1000 individuals, on the basis of which they have been classified as risky or not. There are 8 quantitative variables and 13 qualitative variables.
Whether a particular loan is good or bad is indicated by ‘response’. We change it from ‘1’ and ‘2’ to ‘0’ and ‘1’, which is required for logistic regression.
The information of each variable is given below :
Variable | Description |
---|---|
chk_acct | Status of existing checking account |
duration | Duration in month |
credit_his | Credit history |
purpose | Purpose (car,furniture,education) |
amount | Credit amount |
saving_acct | Savings account/bonds |
present_emp | Present employment since |
installment_rate | Installment rate in percentage of disposable income |
sex | Personal status and sex |
other_debtor | Other debtors / guarantors |
present_resid | Present residence since |
property | Property(real estate,life insurance) |
age | Age in years |
other_install | Other installment plans(bank,stores,none) |
housing | housing(rent,own,free) |
n_credits | Number of existing credits at this bank |
job | Job |
n_people | Number of people being liable to provide maintenance for |
telephone | Telephone |
foreign | foreign worker |
response | yes/no |
Let us have a look at the top 100 records of the dataset
Let us have a look at the structre of data:
## 'data.frame': 1000 obs. of 21 variables:
## $ chk_acct : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
## $ duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_his : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
## $ purpose : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ saving_acct : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
## $ present_emp : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ installment_rate: int 4 2 2 2 3 2 3 2 2 4 ...
## $ sex : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
## $ other_debtor : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
## $ present_resid : int 4 2 3 4 4 4 4 2 4 2 ...
## $ property : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ other_install : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ housing : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
## $ n_credits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ job : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
## $ n_people : int 1 1 2 2 2 2 1 1 1 1 ...
## $ telephone : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
## $ foreign : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
## $ response : num 0 1 0 0 1 0 0 0 0 1 ...
Observations:
Analysis of Continious Variables
Observations:
The variables duration, amount, age, num credits amd num people have outliers.
Amount has the maximum number of outliers.
There are no outliers in installment rate and present resid.
Now, let us have a look at the distribution of the duration, amount and age, split by binary variable response.( response=1 means that the person is a defaulter and 0 means he is not)
Note : The defaulters are shown in red and non defaulters are shown in grey.
Observation
The median duration is higher for the defaulters as compared to non-defaulters. This means that the defaulters tend to take more time to repay the loan.
There is not much significant difference between the amount of loan taken by the defaulters and non defaulters.
The median age is also higher for defaulters that of non-defaulters. This means that the younger crowd tend to default.
Above plot suggests that there is a significant difference between the count of defaulters and non-defaulters in case of no checking account. The number of good credit people is much higher than the defaulters. A probable reason could be that the people who dont have checking account may have a savings account, which would have enough savings to repay the loan on time.
Here, A30 : no credits taken/ all credits paid back duly A31 : all credits at this bank paid back duly A32 : existing credits paid back duly till now A33 : delay in paying off in the past A34 : critical account/ other credits existing (not at this bank)
It can be seen that there is a significant difference in the count for A32 - which means that amongst the people who have all credits duly paid till now, there are high number of non- defaulters.
It is surprising to see that the number of non-defaulters is high for the critical accounts as well.
The above plot suggests that when the reason to take loan is radio/television(A43), majority of the people pay the loan back keeping the ocunt of non defaulters very high.
Similarly, we can see the trends based on the sex, whether the person is a foreign worker or not, present employement,etc .
Before we build the model, we will randomly select 70% of the credit data as the training data to train the logistic model. The remaining 30% data is used for validation of this fit.
As mentioned previously, the second risk is a greater risk, as the loss includes the prinicple amount as well. Hence, using training sample, we find the optimal cut-off by grid search method with asymmetric cost.
Approach:
Fit a null model on the training data where the response is regressed on the intercept
Fit a full model on the training data where the response is regressed on all the predictor variables
Use step AIC forward selection algorithm to fit our final model
The final model has the following predictor variables:
## Estimate Pr(>|z|)
## (Intercept) 0.9922340787 2.108212e-01
## chk_acctA12 -0.5755083547 2.342864e-02
## chk_acctA13 -1.4700974482 1.377762e-03
## chk_acctA14 -2.0399014900 3.513269e-12
## credit_hisA31 0.2590174737 6.842753e-01
## credit_hisA32 -0.3866250466 4.507211e-01
## credit_hisA33 -0.8932764594 1.286990e-01
## credit_hisA34 -1.1642357154 3.015917e-02
## saving_acctA62 -0.1036014437 7.600248e-01
## saving_acctA63 -0.8265475424 1.296496e-01
## saving_acctA64 -1.4483771337 2.090331e-02
## saving_acctA65 -1.1249840962 4.064682e-04
## duration 0.0201762204 6.671238e-02
## purposeA41 -1.7731123563 3.968719e-05
## purposeA410 -1.6017696869 6.050567e-02
## purposeA42 -1.0663167985 6.957862e-04
## purposeA43 -0.8659557933 2.448656e-03
## purposeA44 -1.6936017565 1.872997e-01
## purposeA45 0.5248827103 4.282168e-01
## purposeA46 -0.4394483086 3.491969e-01
## purposeA48 -1.8941306914 1.402732e-01
## purposeA49 -1.0202434352 1.362715e-02
## other_debtorA102 0.7387310033 1.396637e-01
## other_debtorA103 -1.0733474597 2.483626e-02
## other_installA142 -0.2217782567 6.454375e-01
## other_installA143 -0.9488384309 1.176864e-03
## foreignA202 -1.7838129270 3.469892e-02
## present_empA72 0.4649719454 3.081107e-01
## present_empA73 -0.0967302672 8.222627e-01
## present_empA74 -0.6077952725 2.010794e-01
## present_empA75 -0.1754880569 6.949012e-01
## amount 0.0001135857 2.295298e-02
## installment_rate 0.1876313904 7.225762e-02
Observations :
The above model predits the probability of a person defaulting. In order to classify the person as defaulter , we need a cut-off probability.
Based on the cost function we will determine the cut-off probability and select the cut-off that corresponds to minimum cost.
Based on the above plot, the cutoff probability of 0.14 gives the minimum cost. So we will consider the cutoff probability as 0.14.
ROC Curve :
In order to show give an overall measure of goodness of classification, using the Receiver Operating Characteristic (ROC) curve is one way.
The misclasification table is given below :
## prediction
## observation 0 1
## 0 268 235
## 1 15 182
Observation:
Out of 700 observations,
450 observations have been classified correctly
There are 235 False positives
There are 15 False negatives
Model Evaluation (Based on training data):
Misclassification Rate | False Positive Rate | False Negative Rate |
---|---|---|
0.357 | 0.467 | 0.076 |
Using the above model, we predict the response of the test data to see who have defaulted.
ROC Curve :
Let us have a look at the out of sample ROC curve:
The misclasification table is given below :
## prediction
## observation 0 1
## 0 110 87
## 1 19 84
Observation:
Out of 300 observations,
194 observations have been classified correctly
There are 87 False positives
There are 19 False negatives
Model Evaluation (Based on test data):
Misclassification Rate | False Positive Rate | False Negative Rate | Cost |
---|---|---|---|
0.353 | 0.442 | 0.184 | 0.607 |
We will fit a model using step BIC , in a similar way as AIC step model
Approach:
Fit a null model on the training data where the response is regressed on the intercept
Fit a full model on the training data where the response is regressed on all the predictor variables
Use step BIC forward selection algorithm to fit our final model
The final model has the following predictor variables:
## Estimate Pr(>|z|)
## (Intercept) -0.69729990 1.006801e-03
## chk_acctA12 -0.56042234 7.541335e-03
## chk_acctA13 -1.31721558 1.687172e-03
## chk_acctA14 -2.17026570 2.982995e-17
## duration 0.02847478 9.953440e-05
Observations :
Based on Step BIC Logistic Regression, the important factors that determine whether a person will default or not are:
Again, we will determine the optimal cut-off probability based on the cost function.
Based on the above plot, the cutoff probability of 0.14 gives the minimum cost. So we will consider the cutoff probability as 0.16.
ROC Curve :
Below ROC Curve gives an overall measure of goodness of classification.
The misclasification table is given below :
## prediction
## observation 0 1
## 0 242 261
## 1 25 172
Observation:
Out of 700 observations,
414 observations have been classified correctly
There are 261 False positives
There are 25 False negatives
Misclassification Rate | False Positive Rate | False Negative Rate | Cost |
---|---|---|---|
0.409 | 0.519 | 0.127 | 0.551 |
ROC Curve :
Let us have a look at the out of sample ROC curve:
## prediction
## observation 0 1
## 0 116 81
## 1 22 81
Observation:
Out of 300 observations,
197 observations have been classified correctly
There are 81 False positives
There are 22 False negatives
Misclassification Rate | False Positive Rate | False Negative Rate | Cost |
---|---|---|---|
0.343 | 0.411 | 0.214 | 0.637 |
We fit the LASSO model to our data. From the plot below, we see that as the value of lambda keeps on increasing, the coefficients for the variables tend to 0.
Using cross validation, we will find the optimal value of lambda for LASSO fit.
Based on cross validation, selecting lambda corresponding to the Error within 1 standard deviation of the minimum error, the LASSO fit gives the following variables:
## 21 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 2.36240301
## chk_acct -0.50553227
## duration 0.15280083
## credit_his -0.22675199
## purpose .
## amount 0.05975605
## saving_acct -0.15294536
## present_emp -0.07337619
## installment_rate 0.02685850
## sex .
## other_debtor -0.01679098
## present_resid .
## property 0.03403184
## age .
## other_install -0.17492876
## housing .
## n_credits .
## job .
## n_people .
## telephone .
## foreign -0.38044129
We will fit a model based on LASSO
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0064450065 3.545768e-01 -2.838440 4.533457e-03
## chk_acctA12 -0.6073777434 2.129442e-01 -2.852286 4.340604e-03
## chk_acctA13 -1.2705447162 4.224797e-01 -3.007351 2.635356e-03
## chk_acctA14 -2.1987011282 2.582536e-01 -8.513728 1.684286e-17
## amount 0.0001164014 3.183539e-05 3.656352 2.558298e-04
## installment_rate 0.1983866238 8.880422e-02 2.233977 2.548455e-02
## foreignA202 -1.7039659957 7.675981e-01 -2.219867 2.642776e-02
We can see that based on the LASSO model, the important factors that determine whether a person will default or not are:
##### In-sample prediction (less important)
ROC Curve :
Let us have a look at the out of sample ROC curve:
Below is the misclassification table:
## prediction
## observation 0 1
## 0 245 258
## 1 23 174
Observation:
Out of 700 observations,
489 observations have been classified correctly
There are 185 False positives
There are 26 False negatives
Model Evaluation (Based on train data):
Misclassification Rate | False Positive Rate | False Negative Rate | Cost |
---|---|---|---|
0.401 | 0.513 | 0.117 | 0.533 |
ROC Curve :
Let us have a look at the out of sample ROC curve:
Below is the misclassification table:
## prediction
## observation 0 1
## 0 116 81
## 1 22 81
Observation:
Out of 300 observations,
197 observations have been classified correctly
There are 81 False positives
There are 22 False negatives
Model Evaluation (Based on test data):
Misclassification Rate | False Positive Rate | False Negative Rate | Cost |
---|---|---|---|
0.32 | 0.396 | 0.175 | 0.56 |
Based on the data at hand, the decision tree looks like :
Plotting the complexity parameters for all possible number of splits
Printing the complexity parameters for all possible number of splits
##
## Classification tree:
## rpart(formula = response ~ ., data = german_credit_train, method = "class",
## parms = list(loss = matrix(c(0, 5, 1, 0), nrow = 2)))
##
## Variables actually used in tree construction:
## [1] age chk_acct duration n_credits other_install
## [6] present_emp property purpose saving_acct sex
##
## Root node error: 503/700 = 0.71857
##
## n= 700
##
## CP nsplit rel error xerror xstd
## 1 0.212724 0 1.00000 5.0000 0.11827
## 2 0.087475 1 0.78728 2.3917 0.12280
## 3 0.026839 2 0.69980 2.6938 0.12655
## 4 0.017893 4 0.64612 2.7376 0.12649
## 5 0.013917 5 0.62823 2.6700 0.12552
## 6 0.012922 10 0.55467 2.5586 0.12385
## 7 0.011928 12 0.52883 2.6044 0.12445
## 8 0.010934 15 0.49304 2.5686 0.12395
## 9 0.010000 17 0.47117 2.4115 0.12209
Pruning the tree using optimal complexity parameter and then plotting the optimal tree
ROC Curve :
In order to show give an overall measure of goodness of classification, using the Receiver Operating Characteristic (ROC) curve is one way.
The AUC of the above ROC curve is 0.80
## [1] 0.8020759
The misclassification table is as follows:
## Predicted
## Truth 0 1
## 0 292 211
## 1 11 186
Observation:
Out of 700 observations,
478 observations have been classified correctly
There are 211 False positives
There are 11 False negatives
Model Evaluation (Based on training data):
Misclassification Rate | False Positive Rate | False Negative Rate |
---|---|---|
0.409 | 0.519 | 0.127 |
The AUC of the above ROC curve is 0.72
The missclasification table is as follows:
## Predicted
## Truth 0 1
## 0 122 75
## 1 29 74
Observation:
Out of 300 observations,
196 observations have been classified correctly
There are 75 False positives
There are 29 False negatives
Model Evaluation (Based on test data):
Misclassification Rate | False Positive Rate | False Negative Rate | Cost |
---|---|---|---|
0.347 | 0.381 | 0.282 | 0.733 |
The below table gives a comparision of the misclassification rate, False Positive Rate, False Negative Rate and Cost for the test set for the 4 different classification models.
Model | Misclassification Rate | False Positive Rate | False Negative Rate | Cost |
---|---|---|---|---|
Step AIC | 0.353 | 0.442 | 0.184 | 0.607 |
Step BIC | 0.343 | 0.411 | 0.214 | 0.637 |
model based on LASSO variable selection | 0.320 | 0.396 | 0.175 | 0.560 |
Decision Tree | 0.347 | 0.381 | 0.282 | 0.733 |
Major Findings: In this case, based on the assymetric cost,predictive power of Model based on LASSO Variable Selection > Step AIC > Step BIC > Classification Tree