Introduction

Problem Statement

When a company receives a loan application, it has to make a decision whether to go ahead with the loan approval or not. This is based on the applicant’s profile

A credit risk is the risk of default on a debt that may arise from a borrower failing to make required payments.

There are two types of risks associated with the bank’s decision:

  • If the applicant is a good credit risk, i.e. is likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank

  • If the applicant is a bad credit risk, i.e. is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank

It may be assumed that the second risk is a greater risk, as the loss includes the prinicple amount as well.

So it is very crucial for a company to evaluate the risks associated with lending money to a customer.

Goal

To compare the performance of various classification models on predicting the risk of the loans for 1000 individuals.

Approach

Compare the asymmetric cost for the test set for different classification models.

Required Packages

The following packages are used:

  • ROCR : For evaluating and visualizing classifier performance
  • glmnet : For fitting a generalized linear model
  • kableExtra : For adding features to a kable() output
  • DT : Interface to the JavaScript library DataTables
  • ggplot2 : For data visualizations
  • ggpubr : Extention of ggplot2; provides some easy-to-use functions for creating and customizing ‘ggplot2’
  • rpart : To generate classification tree
  • rpart.plot : To plot the classification tree
library(ROCR)
library(glmnet)
library(kableExtra)
library(DT)
library(ggplot2)
library(ggpubr)
library(rpart)
library(rpart.plot)

Data

Data

The data set has information about 1000 individuals, on the basis of which they have been classified as risky or not. There are 8 quantitative variables and 13 qualitative variables.

Whether a particular loan is good or bad is indicated by ‘response’. We change it from ‘1’ and ‘2’ to ‘0’ and ‘1’, which is required for logistic regression.

Data Dictionary

The information of each variable is given below :

Variable Description
chk_acct Status of existing checking account
duration Duration in month
credit_his Credit history
purpose Purpose (car,furniture,education)
amount Credit amount
saving_acct Savings account/bonds
present_emp Present employment since
installment_rate Installment rate in percentage of disposable income
sex Personal status and sex
other_debtor Other debtors / guarantors
present_resid Present residence since
property Property(real estate,life insurance)
age Age in years
other_install Other installment plans(bank,stores,none)
housing housing(rent,own,free)
n_credits Number of existing credits at this bank
job Job
n_people Number of people being liable to provide maintenance for
telephone Telephone
foreign foreign worker
response yes/no
Final data

Let us have a look at the top 100 records of the dataset

Exploratory Data Analysis

Let us have a look at the structre of data:

## 'data.frame':    1000 obs. of  21 variables:
##  $ chk_acct        : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
##  $ duration        : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_his      : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ purpose         : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
##  $ amount          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ saving_acct     : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ present_emp     : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ installment_rate: int  4 2 2 2 3 2 3 2 2 4 ...
##  $ sex             : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
##  $ other_debtor    : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
##  $ present_resid   : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ property        : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
##  $ age             : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_install   : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ housing         : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
##  $ n_credits       : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job             : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
##  $ n_people        : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ telephone       : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
##  $ foreign         : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
##  $ response        : num  0 1 0 0 1 0 0 0 0 1 ...

Observations:

  • There are 1000 observations and 21 variables
  • Out of the 21 variables, 13 are categorical and 8 are integers.

Analysis of Continious Variables

Observations:

  • The variables duration, amount, age, num credits amd num people have outliers.

  • Amount has the maximum number of outliers.

  • There are no outliers in installment rate and present resid.

Now, let us have a look at the distribution of the duration, amount and age, split by binary variable response.( response=1 means that the person is a defaulter and 0 means he is not)

Note : The defaulters are shown in red and non defaulters are shown in grey.

Observation

  • The median duration is higher for the defaulters as compared to non-defaulters. This means that the defaulters tend to take more time to repay the loan.

  • There is not much significant difference between the amount of loan taken by the defaulters and non defaulters.

  • The median age is also higher for defaulters that of non-defaulters. This means that the younger crowd tend to default.

Above plot suggests that there is a significant difference between the count of defaulters and non-defaulters in case of no checking account. The number of good credit people is much higher than the defaulters. A probable reason could be that the people who dont have checking account may have a savings account, which would have enough savings to repay the loan on time.

Here, A30 : no credits taken/ all credits paid back duly A31 : all credits at this bank paid back duly A32 : existing credits paid back duly till now A33 : delay in paying off in the past A34 : critical account/ other credits existing (not at this bank)

It can be seen that there is a significant difference in the count for A32 - which means that amongst the people who have all credits duly paid till now, there are high number of non- defaulters.

It is surprising to see that the number of non-defaulters is high for the critical accounts as well.

The above plot suggests that when the reason to take loan is radio/television(A43), majority of the people pay the loan back keeping the ocunt of non defaulters very high.

Similarly, we can see the trends based on the sex, whether the person is a foreign worker or not, present employement,etc .

Modeling

Before we build the model, we will randomly select 70% of the credit data as the training data to train the logistic model. The remaining 30% data is used for validation of this fit.

As mentioned previously, the second risk is a greater risk, as the loss includes the prinicple amount as well. Hence, using training sample, we find the optimal cut-off by grid search method with asymmetric cost.

Logistic Regression

Step AIC

Approach:

  • Fit a null model on the training data where the response is regressed on the intercept

  • Fit a full model on the training data where the response is regressed on all the predictor variables

  • Use step AIC forward selection algorithm to fit our final model

The final model has the following predictor variables:

##                        Estimate     Pr(>|z|)
## (Intercept)        0.9922340787 2.108212e-01
## chk_acctA12       -0.5755083547 2.342864e-02
## chk_acctA13       -1.4700974482 1.377762e-03
## chk_acctA14       -2.0399014900 3.513269e-12
## credit_hisA31      0.2590174737 6.842753e-01
## credit_hisA32     -0.3866250466 4.507211e-01
## credit_hisA33     -0.8932764594 1.286990e-01
## credit_hisA34     -1.1642357154 3.015917e-02
## saving_acctA62    -0.1036014437 7.600248e-01
## saving_acctA63    -0.8265475424 1.296496e-01
## saving_acctA64    -1.4483771337 2.090331e-02
## saving_acctA65    -1.1249840962 4.064682e-04
## duration           0.0201762204 6.671238e-02
## purposeA41        -1.7731123563 3.968719e-05
## purposeA410       -1.6017696869 6.050567e-02
## purposeA42        -1.0663167985 6.957862e-04
## purposeA43        -0.8659557933 2.448656e-03
## purposeA44        -1.6936017565 1.872997e-01
## purposeA45         0.5248827103 4.282168e-01
## purposeA46        -0.4394483086 3.491969e-01
## purposeA48        -1.8941306914 1.402732e-01
## purposeA49        -1.0202434352 1.362715e-02
## other_debtorA102   0.7387310033 1.396637e-01
## other_debtorA103  -1.0733474597 2.483626e-02
## other_installA142 -0.2217782567 6.454375e-01
## other_installA143 -0.9488384309 1.176864e-03
## foreignA202       -1.7838129270 3.469892e-02
## present_empA72     0.4649719454 3.081107e-01
## present_empA73    -0.0967302672 8.222627e-01
## present_empA74    -0.6077952725 2.010794e-01
## present_empA75    -0.1754880569 6.949012e-01
## amount             0.0001135857 2.295298e-02
## installment_rate   0.1876313904 7.225762e-02

Observations :

  • The AIC based model has 10 variables.
  • The AIC of the model is 673.0247
  • There are many variables that are not significant (pvalue > 0.05) , but they are still considered in the model.

The above model predits the probability of a person defaulting. In order to classify the person as defaulter , we need a cut-off probability.

Based on the cost function we will determine the cut-off probability and select the cut-off that corresponds to minimum cost.

Based on the above plot, the cutoff probability of 0.14 gives the minimum cost. So we will consider the cutoff probability as 0.14.

In-sample prediction (less important)

ROC Curve :

In order to show give an overall measure of goodness of classification, using the Receiver Operating Characteristic (ROC) curve is one way.

  • The Area under the curve is 0.84

The misclasification table is given below :

##            prediction
## observation   0   1
##           0 268 235
##           1  15 182

Observation:

Out of 700 observations,

  • 450 observations have been classified correctly

  • There are 235 False positives

  • There are 15 False negatives

Model Evaluation (Based on training data):

Misclassification Rate False Positive Rate False Negative Rate
0.357 0.467 0.076
Out-of-sample prediction (more important)

Using the above model, we predict the response of the test data to see who have defaulted.

ROC Curve :

Let us have a look at the out of sample ROC curve:

  • The Area under the curve is 0.77

The misclasification table is given below :

##            prediction
## observation   0   1
##           0 110  87
##           1  19  84

Observation:

Out of 300 observations,

  • 194 observations have been classified correctly

  • There are 87 False positives

  • There are 19 False negatives

Model Evaluation (Based on test data):

Misclassification Rate False Positive Rate False Negative Rate Cost
0.353 0.442 0.184 0.607

Step BIC

We will fit a model using step BIC , in a similar way as AIC step model

Approach:

  • Fit a null model on the training data where the response is regressed on the intercept

  • Fit a full model on the training data where the response is regressed on all the predictor variables

  • Use step BIC forward selection algorithm to fit our final model

The final model has the following predictor variables:

##                Estimate     Pr(>|z|)
## (Intercept) -0.69729990 1.006801e-03
## chk_acctA12 -0.56042234 7.541335e-03
## chk_acctA13 -1.31721558 1.687172e-03
## chk_acctA14 -2.17026570 2.982995e-17
## duration     0.02847478 9.953440e-05

Observations :

  • The BIC based model has only 2 variables.
  • The AIC of the model is 731.2582 , higher than that of the step AIC model
  • The BIC of the model is The AIC of the model is 754.0136
  • All the variables that are significant (pvalue < 0.05).

Based on Step BIC Logistic Regression, the important factors that determine whether a person will default or not are:

  • Checking Account
  • Duration

Again, we will determine the optimal cut-off probability based on the cost function.

Based on the above plot, the cutoff probability of 0.14 gives the minimum cost. So we will consider the cutoff probability as 0.16.

In-sample prediction (less important)

ROC Curve :

Below ROC Curve gives an overall measure of goodness of classification.

  • The Area under the curve is 0.75 , lesser than that of step AIC model.

The misclasification table is given below :

##            prediction
## observation   0   1
##           0 242 261
##           1  25 172

Observation:

Out of 700 observations,

  • 414 observations have been classified correctly

  • There are 261 False positives

  • There are 25 False negatives

Misclassification Rate False Positive Rate False Negative Rate Cost
0.409 0.519 0.127 0.551
Out-of-sample prediction (more important)

ROC Curve :

Let us have a look at the out of sample ROC curve:

  • The Area under the curve is 0.7569 , less than that of step AIC model.
##            prediction
## observation   0   1
##           0 116  81
##           1  22  81

Observation:

Out of 300 observations,

  • 197 observations have been classified correctly

  • There are 81 False positives

  • There are 22 False negatives

Misclassification Rate False Positive Rate False Negative Rate Cost
0.343 0.411 0.214 0.637

Variable selection - LASSO

We fit the LASSO model to our data. From the plot below, we see that as the value of lambda keeps on increasing, the coefficients for the variables tend to 0.

Using cross validation, we will find the optimal value of lambda for LASSO fit.

Based on cross validation, selecting lambda corresponding to the Error within 1 standard deviation of the minimum error, the LASSO fit gives the following variables:

## 21 x 1 sparse Matrix of class "dgCMatrix"
##                            1
## (Intercept)       2.36240301
## chk_acct         -0.50553227
## duration          0.15280083
## credit_his       -0.22675199
## purpose           .         
## amount            0.05975605
## saving_acct      -0.15294536
## present_emp      -0.07337619
## installment_rate  0.02685850
## sex               .         
## other_debtor     -0.01679098
## present_resid     .         
## property          0.03403184
## age               .         
## other_install    -0.17492876
## housing           .         
## n_credits         .         
## job               .         
## n_people          .         
## telephone         .         
## foreign          -0.38044129

We will fit a model based on LASSO

##                       Estimate   Std. Error   z value     Pr(>|z|)
## (Intercept)      -1.0064450065 3.545768e-01 -2.838440 4.533457e-03
## chk_acctA12      -0.6073777434 2.129442e-01 -2.852286 4.340604e-03
## chk_acctA13      -1.2705447162 4.224797e-01 -3.007351 2.635356e-03
## chk_acctA14      -2.1987011282 2.582536e-01 -8.513728 1.684286e-17
## amount            0.0001164014 3.183539e-05  3.656352 2.558298e-04
## installment_rate  0.1983866238 8.880422e-02  2.233977 2.548455e-02
## foreignA202      -1.7039659957 7.675981e-01 -2.219867 2.642776e-02

We can see that based on the LASSO model, the important factors that determine whether a person will default or not are:

  • Checking Account Status
  • Duration
  • Credit History
  • Amount
  • Savings Account
  • Present employment since
  • Installment Rate
  • Property
  • Other installment
  • Foreign worker

##### In-sample prediction (less important)

ROC Curve :

Let us have a look at the out of sample ROC curve:

  • The Area under the curve is 0.75.

Below is the misclassification table:

##            prediction
## observation   0   1
##           0 245 258
##           1  23 174

Observation:

Out of 700 observations,

  • 489 observations have been classified correctly

  • There are 185 False positives

  • There are 26 False negatives

Model Evaluation (Based on train data):

Misclassification Rate False Positive Rate False Negative Rate Cost
0.401 0.513 0.117 0.533
Out-of-sample prediction (more important)

ROC Curve :

Let us have a look at the out of sample ROC curve:

  • The Area under the curve is 0.767.

Below is the misclassification table:

##            prediction
## observation   0   1
##           0 116  81
##           1  22  81

Observation:

Out of 300 observations,

  • 197 observations have been classified correctly

  • There are 81 False positives

  • There are 22 False negatives

Model Evaluation (Based on test data):

Misclassification Rate False Positive Rate False Negative Rate Cost
0.32 0.396 0.175 0.56

Decision Tree

Based on the data at hand, the decision tree looks like :

Plotting the complexity parameters for all possible number of splits

Printing the complexity parameters for all possible number of splits

## 
## Classification tree:
## rpart(formula = response ~ ., data = german_credit_train, method = "class", 
##     parms = list(loss = matrix(c(0, 5, 1, 0), nrow = 2)))
## 
## Variables actually used in tree construction:
##  [1] age           chk_acct      duration      n_credits     other_install
##  [6] present_emp   property      purpose       saving_acct   sex          
## 
## Root node error: 503/700 = 0.71857
## 
## n= 700 
## 
##         CP nsplit rel error xerror    xstd
## 1 0.212724      0   1.00000 5.0000 0.11827
## 2 0.087475      1   0.78728 2.3917 0.12280
## 3 0.026839      2   0.69980 2.6938 0.12655
## 4 0.017893      4   0.64612 2.7376 0.12649
## 5 0.013917      5   0.62823 2.6700 0.12552
## 6 0.012922     10   0.55467 2.5586 0.12385
## 7 0.011928     12   0.52883 2.6044 0.12445
## 8 0.010934     15   0.49304 2.5686 0.12395
## 9 0.010000     17   0.47117 2.4115 0.12209

Pruning the tree using optimal complexity parameter and then plotting the optimal tree

In-sample prediction (less important)

ROC Curve :

In order to show give an overall measure of goodness of classification, using the Receiver Operating Characteristic (ROC) curve is one way.

The AUC of the above ROC curve is 0.80

## [1] 0.8020759

The misclassification table is as follows:

##      Predicted
## Truth   0   1
##     0 292 211
##     1  11 186

Observation:

Out of 700 observations,

  • 478 observations have been classified correctly

  • There are 211 False positives

  • There are 11 False negatives

Model Evaluation (Based on training data):

Misclassification Rate False Positive Rate False Negative Rate
0.409 0.519 0.127
Out-of-sample prediction (more important)

The AUC of the above ROC curve is 0.72

The missclasification table is as follows:

##      Predicted
## Truth   0   1
##     0 122  75
##     1  29  74

Observation:

Out of 300 observations,

  • 196 observations have been classified correctly

  • There are 75 False positives

  • There are 29 False negatives

Model Evaluation (Based on test data):

Misclassification Rate False Positive Rate False Negative Rate Cost
0.347 0.381 0.282 0.733

Conclusion

The below table gives a comparision of the misclassification rate, False Positive Rate, False Negative Rate and Cost for the test set for the 4 different classification models.

Model Misclassification Rate False Positive Rate False Negative Rate Cost
Step AIC 0.353 0.442 0.184 0.607
Step BIC 0.343 0.411 0.214 0.637
model based on LASSO variable selection 0.320 0.396 0.175 0.560
Decision Tree 0.347 0.381 0.282 0.733

Major Findings: In this case, based on the assymetric cost,predictive power of Model based on LASSO Variable Selection > Step AIC > Step BIC > Classification Tree