German Credit Risk Analysis

Introduction

Problem Statement

When a company receives a loan application, it has to make a decision whether to go ahead with the loan approval or not. This is based on the applicant’s profile

A credit risk is the risk of default on a debt that may arise from a borrower failing to make required payments.

There are two types of risks associated with the bank’s decision:

If the applicant is a good credit risk, i.e. is likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank
If the applicant is a bad credit risk, i.e. is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank

It may be assumed that the second risk is a greater risk, as the loss includes the prinicple amount as well.

So it is very crucial for a company to evaluate the risks associated with lending money to a customer.

Goal

To compare the performance of various classification models on predicting the risk of the loans for 1000 individuals.

Approach

Compare the asymmetric cost for the test set for different classification models.

Required Packages

The following packages are used:

ROCR : For evaluating and visualizing classifier performance
glmnet : For fitting a generalized linear model
kableExtra : For adding features to a kable() output
DT : Interface to the JavaScript library DataTables
ggplot2 : For data visualizations
ggpubr : Extention of ggplot2; provides some easy-to-use functions for creating and customizing ‘ggplot2’
rpart : To generate classification tree
rpart.plot : To plot the classification tree

library(ROCR)
library(glmnet)
library(kableExtra)
library(DT)
library(ggplot2)
library(ggpubr)
library(rpart)
library(rpart.plot)

Data

The data set has information about 1000 individuals, on the basis of which they have been classified as risky or not. There are 8 quantitative variables and 13 qualitative variables.

Whether a particular loan is good or bad is indicated by ‘response’. We change it from ‘1’ and ‘2’ to ‘0’ and ‘1’, which is required for logistic regression.

Data Dictionary

The information of each variable is given below :

Variable	Description
chk_acct	Status of existing checking account
duration	Duration in month
credit_his	Credit history
purpose	Purpose (car,furniture,education)
amount	Credit amount
saving_acct	Savings account/bonds
present_emp	Present employment since
installment_rate	Installment rate in percentage of disposable income
sex	Personal status and sex
other_debtor	Other debtors / guarantors
present_resid	Present residence since
property	Property(real estate,life insurance)
age	Age in years
other_install	Other installment plans(bank,stores,none)
housing	housing(rent,own,free)
n_credits	Number of existing credits at this bank
job	Job
n_people	Number of people being liable to provide maintenance for
telephone	Telephone
foreign	foreign worker
response	yes/no

Final data

Let us have a look at the top 100 records of the dataset

Exploratory Data Analysis

Let us have a look at the structre of data:

## 'data.frame':    1000 obs. of  21 variables:
##  $ chk_acct        : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
##  $ duration        : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_his      : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ purpose         : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
##  $ amount          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ saving_acct     : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ present_emp     : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ installment_rate: int  4 2 2 2 3 2 3 2 2 4 ...
##  $ sex             : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
##  $ other_debtor    : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
##  $ present_resid   : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ property        : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
##  $ age             : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_install   : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ housing         : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
##  $ n_credits       : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job             : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
##  $ n_people        : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ telephone       : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
##  $ foreign         : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
##  $ response        : num  0 1 0 0 1 0 0 0 0 1 ...

Observations:

There are 1000 observations and 21 variables
Out of the 21 variables, 13 are categorical and 8 are integers.

Analysis of Continious Variables

Observations:

The variables duration, amount, age, num credits amd num people have outliers.
Amount has the maximum number of outliers.
There are no outliers in installment rate and present resid.

Now, let us have a look at the distribution of the duration, amount and age, split by binary variable response.( response=1 means that the person is a defaulter and 0 means he is not)

Note : The defaulters are shown in red and non defaulters are shown in grey.

Observation

The median duration is higher for the defaulters as compared to non-defaulters. This means that the defaulters tend to take more time to repay the loan.
There is not much significant difference between the amount of loan taken by the defaulters and non defaulters.
The median age is also higher for defaulters that of non-defaulters. This means that the younger crowd tend to default.

Above plot suggests that there is a significant difference between the count of defaulters and non-defaulters in case of no checking account. The number of good credit people is much higher than the defaulters. A probable reason could be that the people who dont have checking account may have a savings account, which would have enough savings to repay the loan on time.

Here, A30 : no credits taken/ all credits paid back duly A31 : all credits at this bank paid back duly A32 : existing credits paid back duly till now A33 : delay in paying off in the past A34 : critical account/ other credits existing (not at this bank)

It can be seen that there is a significant difference in the count for A32 - which means that amongst the people who have all credits duly paid till now, there are high number of non- defaulters.

It is surprising to see that the number of non-defaulters is high for the critical accounts as well.

The above plot suggests that when the reason to take loan is radio/television(A43), majority of the people pay the loan back keeping the ocunt of non defaulters very high.

Similarly, we can see the trends based on the sex, whether the person is a foreign worker or not, present employement,etc .

Modeling

Before we build the model, we will randomly select 70% of the credit data as the training data to train the logistic model. The remaining 30% data is used for validation of this fit.

As mentioned previously, the second risk is a greater risk, as the loss includes the prinicple amount as well. Hence, using training sample, we find the optimal cut-off by grid search method with asymmetric cost.

Logistic Regression

Step AIC

Approach:

Fit a null model on the training data where the response is regressed on the intercept
Fit a full model on the training data where the response is regressed on all the predictor variables
Use step AIC forward selection algorithm to fit our final model

The final model has the following predictor variables:

##                        Estimate     Pr(>|z|)
## (Intercept)        0.9922340787 2.108212e-01
## chk_acctA12       -0.5755083547 2.342864e-02
## chk_acctA13       -1.4700974482 1.377762e-03
## chk_acctA14       -2.0399014900 3.513269e-12
## credit_hisA31      0.2590174737 6.842753e-01
## credit_hisA32     -0.3866250466 4.507211e-01
## credit_hisA33     -0.8932764594 1.286990e-01
## credit_hisA34     -1.1642357154 3.015917e-02
## saving_acctA62    -0.1036014437 7.600248e-01
## saving_acctA63    -0.8265475424 1.296496e-01
## saving_acctA64    -1.4483771337 2.090331e-02
## saving_acctA65    -1.1249840962 4.064682e-04
## duration           0.0201762204 6.671238e-02
## purposeA41        -1.7731123563 3.968719e-05
## purposeA410       -1.6017696869 6.050567e-02
## purposeA42        -1.0663167985 6.957862e-04
## purposeA43        -0.8659557933 2.448656e-03
## purposeA44        -1.6936017565 1.872997e-01
## purposeA45         0.5248827103 4.282168e-01
## purposeA46        -0.4394483086 3.491969e-01
## purposeA48        -1.8941306914 1.402732e-01
## purposeA49        -1.0202434352 1.362715e-02
## other_debtorA102   0.7387310033 1.396637e-01
## other_debtorA103  -1.0733474597 2.483626e-02
## other_installA142 -0.2217782567 6.454375e-01
## other_installA143 -0.9488384309 1.176864e-03
## foreignA202       -1.7838129270 3.469892e-02
## present_empA72     0.4649719454 3.081107e-01
## present_empA73    -0.0967302672 8.222627e-01
## present_empA74    -0.6077952725 2.010794e-01
## present_empA75    -0.1754880569 6.949012e-01
## amount             0.0001135857 2.295298e-02
## installment_rate   0.1876313904 7.225762e-02

Observations :

The AIC based model has 10 variables.
The AIC of the model is 673.0247
There are many variables that are not significant (pvalue > 0.05) , but they are still considered in the model.

The above model predits the probability of a person defaulting. In order to classify the person as defaulter , we need a cut-off probability.

Based on the cost function we will determine the cut-off probability and select the cut-off that corresponds to minimum cost.

Based on the above plot, the cutoff probability of 0.14 gives the minimum cost. So we will consider the cutoff probability as 0.14.

In-sample prediction (less important)

ROC Curve :

In order to show give an overall measure of goodness of classification, using the Receiver Operating Characteristic (ROC) curve is one way.

The Area under the curve is 0.84

The misclasification table is given below :

##            prediction
## observation   0   1
##           0 268 235
##           1  15 182

Observation:

Out of 700 observations,

450 observations have been classified correctly
There are 235 False positives
There are 15 False negatives

Model Evaluation (Based on training data):

Misclassification Rate	False Positive Rate	False Negative Rate
0.357	0.467	0.076

Out-of-sample prediction (more important)

Using the above model, we predict the response of the test data to see who have defaulted.

ROC Curve :

Let us have a look at the out of sample ROC curve:

The Area under the curve is 0.77

The misclasification table is given below :

##            prediction
## observation   0   1
##           0 110  87
##           1  19  84

Observation:

Out of 300 observations,

194 observations have been classified correctly
There are 87 False positives
There are 19 False negatives

Model Evaluation (Based on test data):

Misclassification Rate	False Positive Rate	False Negative Rate	Cost
0.353	0.442	0.184	0.607

Step BIC

We will fit a model using step BIC , in a similar way as AIC step model

Approach:

Fit a null model on the training data where the response is regressed on the intercept
Fit a full model on the training data where the response is regressed on all the predictor variables
Use step BIC forward selection algorithm to fit our final model

The final model has the following predictor variables:

##                Estimate     Pr(>|z|)
## (Intercept) -0.69729990 1.006801e-03
## chk_acctA12 -0.56042234 7.541335e-03
## chk_acctA13 -1.31721558 1.687172e-03
## chk_acctA14 -2.17026570 2.982995e-17
## duration     0.02847478 9.953440e-05

Observations :

The BIC based model has only 2 variables.
The AIC of the model is 731.2582 , higher than that of the step AIC model
The BIC of the model is The AIC of the model is 754.0136
All the variables that are significant (pvalue < 0.05).

Based on Step BIC Logistic Regression, the important factors that determine whether a person will default or not are:

Checking Account
Duration

Again, we will determine the optimal cut-off probability based on the cost function.

Based on the above plot, the cutoff probability of 0.14 gives the minimum cost. So we will consider the cutoff probability as 0.16.

In-sample prediction (less important)

ROC Curve :

Below ROC Curve gives an overall measure of goodness of classification.

The Area under the curve is 0.75 , lesser than that of step AIC model.

The misclasification table is given below :

##            prediction
## observation   0   1
##           0 242 261
##           1  25 172

Observation:

Out of 700 observations,

414 observations have been classified correctly
There are 261 False positives
There are 25 False negatives

Misclassification Rate	False Positive Rate	False Negative Rate	Cost
0.409	0.519	0.127	0.551

Out-of-sample prediction (more important)

ROC Curve :

Let us have a look at the out of sample ROC curve:

The Area under the curve is 0.7569 , less than that of step AIC model.

##            prediction
## observation   0   1
##           0 116  81
##           1  22  81

Observation:

Out of 300 observations,

197 observations have been classified correctly
There are 81 False positives
There are 22 False negatives

Misclassification Rate	False Positive Rate	False Negative Rate	Cost
0.343	0.411	0.214	0.637

Variable selection - LASSO

We fit the LASSO model to our data. From the plot below, we see that as the value of lambda keeps on increasing, the coefficients for the variables tend to 0.

Using cross validation, we will find the optimal value of lambda for LASSO fit.

Based on cross validation, selecting lambda corresponding to the Error within 1 standard deviation of the minimum error, the LASSO fit gives the following variables:

## 21 x 1 sparse Matrix of class "dgCMatrix"
##                            1
## (Intercept)       2.36240301
## chk_acct         -0.50553227
## duration          0.15280083
## credit_his       -0.22675199
## purpose           .         
## amount            0.05975605
## saving_acct      -0.15294536
## present_emp      -0.07337619
## installment_rate  0.02685850
## sex               .         
## other_debtor     -0.01679098
## present_resid     .         
## property          0.03403184
## age               .         
## other_install    -0.17492876
## housing           .         
## n_credits         .         
## job               .         
## n_people          .         
## telephone         .         
## foreign          -0.38044129

We will fit a model based on LASSO

##                       Estimate   Std. Error   z value     Pr(>|z|)
## (Intercept)      -1.0064450065 3.545768e-01 -2.838440 4.533457e-03
## chk_acctA12      -0.6073777434 2.129442e-01 -2.852286 4.340604e-03
## chk_acctA13      -1.2705447162 4.224797e-01 -3.007351 2.635356e-03
## chk_acctA14      -2.1987011282 2.582536e-01 -8.513728 1.684286e-17
## amount            0.0001164014 3.183539e-05  3.656352 2.558298e-04
## installment_rate  0.1983866238 8.880422e-02  2.233977 2.548455e-02
## foreignA202      -1.7039659957 7.675981e-01 -2.219867 2.642776e-02

We can see that based on the LASSO model, the important factors that determine whether a person will default or not are:

Checking Account Status
Duration
Credit History
Amount
Savings Account
Present employment since
Installment Rate
Property
Other installment
Foreign worker

##### In-sample prediction (less important)

ROC Curve :

Let us have a look at the out of sample ROC curve:

The Area under the curve is 0.75.

Below is the misclassification table:

##            prediction
## observation   0   1
##           0 245 258
##           1  23 174

Observation:

Out of 700 observations,

489 observations have been classified correctly
There are 185 False positives
There are 26 False negatives

Model Evaluation (Based on train data):

Misclassification Rate	False Positive Rate	False Negative Rate	Cost
0.401	0.513	0.117	0.533

Out-of-sample prediction (more important)

ROC Curve :

Let us have a look at the out of sample ROC curve:

The Area under the curve is 0.767.

Below is the misclassification table:

##            prediction
## observation   0   1
##           0 116  81
##           1  22  81

Observation:

Out of 300 observations,

197 observations have been classified correctly
There are 81 False positives
There are 22 False negatives

Model Evaluation (Based on test data):

Misclassification Rate	False Positive Rate	False Negative Rate	Cost
0.32	0.396	0.175	0.56

Decision Tree

Based on the data at hand, the decision tree looks like :

Plotting the complexity parameters for all possible number of splits

Printing the complexity parameters for all possible number of splits

## 
## Classification tree:
## rpart(formula = response ~ ., data = german_credit_train, method = "class", 
##     parms = list(loss = matrix(c(0, 5, 1, 0), nrow = 2)))
## 
## Variables actually used in tree construction:
##  [1] age           chk_acct      duration      n_credits     other_install
##  [6] present_emp   property      purpose       saving_acct   sex          
## 
## Root node error: 503/700 = 0.71857
## 
## n= 700 
## 
##         CP nsplit rel error xerror    xstd
## 1 0.212724      0   1.00000 5.0000 0.11827
## 2 0.087475      1   0.78728 2.3917 0.12280
## 3 0.026839      2   0.69980 2.6938 0.12655
## 4 0.017893      4   0.64612 2.7376 0.12649
## 5 0.013917      5   0.62823 2.6700 0.12552
## 6 0.012922     10   0.55467 2.5586 0.12385
## 7 0.011928     12   0.52883 2.6044 0.12445
## 8 0.010934     15   0.49304 2.5686 0.12395
## 9 0.010000     17   0.47117 2.4115 0.12209

Pruning the tree using optimal complexity parameter and then plotting the optimal tree

In-sample prediction (less important)

ROC Curve :

In order to show give an overall measure of goodness of classification, using the Receiver Operating Characteristic (ROC) curve is one way.

The AUC of the above ROC curve is 0.80

## [1] 0.8020759

The misclassification table is as follows:

##      Predicted
## Truth   0   1
##     0 292 211
##     1  11 186

Observation:

Out of 700 observations,

478 observations have been classified correctly
There are 211 False positives
There are 11 False negatives

Model Evaluation (Based on training data):

Misclassification Rate	False Positive Rate	False Negative Rate
0.409	0.519	0.127

Out-of-sample prediction (more important)

The AUC of the above ROC curve is 0.72

The missclasification table is as follows:

##      Predicted
## Truth   0   1
##     0 122  75
##     1  29  74

Observation:

Out of 300 observations,

196 observations have been classified correctly
There are 75 False positives
There are 29 False negatives

Model Evaluation (Based on test data):

Misclassification Rate	False Positive Rate	False Negative Rate	Cost
0.347	0.381	0.282	0.733

Conclusion

The below table gives a comparision of the misclassification rate, False Positive Rate, False Negative Rate and Cost for the test set for the 4 different classification models.

Model	Misclassification Rate	False Positive Rate	False Negative Rate	Cost
Step AIC	0.353	0.442	0.184	0.607
Step BIC	0.343	0.411	0.214	0.637
model based on LASSO variable selection	0.320	0.396	0.175	0.560
Decision Tree	0.347	0.381	0.282	0.733

Major Findings: In this case, based on the assymetric cost,predictive power of Model based on LASSO Variable Selection > Step AIC > Step BIC > Classification Tree

German Credit Risk Analysis

Surabhi Kamath

2/14/2020

Introduction

Problem Statement

Goal

Approach

Required Packages

Data

Data

Data Dictionary

Final data

Exploratory Data Analysis

Modeling

Logistic Regression

Step AIC

In-sample prediction (less important)

Out-of-sample prediction (more important)

Step BIC

In-sample prediction (less important)

Out-of-sample prediction (more important)

Variable selection - LASSO

Out-of-sample prediction (more important)

Decision Tree

In-sample prediction (less important)

Out-of-sample prediction (more important)

Conclusion