Pay Your Bills On time - Credit Scoring Using R

Goals The goal of this research paper is to show basic credit scoring computations in R programming language.

Approach to Model Building It is suggested that credit scoring practitioners adopt a systems approach to model development and maintenance. From this point of view one can use the SOAR methodology, developed by Don Brown at UVA (Brown, 2005). The SOAR process comprises of understanding the goal of the system being developed and specifying it in clear terms along with a clear understanding and specification of the data, observing the data, analyzing the data, and the making recommendations (2005). For references on the traditional credit scoring development process like Lewis, Siddiqi, or Anderson please see Ross Gayler’s Credit Scoring references page (http://r.gayler.googlepages.com/creditscoringresources ).

Practical Suggestions Building high performing models requires skill, ability to conceptualize and understand data relationships, some theory. It is helpful to be versed in the appropriate literature, brainstorm relationships that should exist in the data, and test them out. This is an ad hoc process I have used and found to be effective. For formal methods like Geschka’s brainwriting and Zwicky’s morphological box see Gibson’s guide to Systems analysis (Gibson etal, 2004). For the advantages of R and introductory tutorials see http://cran.r-project.org/other-docs.html.

About The Data In this project we will be using a historical bank data. It has 2004 bad loans and 26423 good loans and is a better data set than other open credit data as it is performance based rather than modeling the decision to grant a loan or not. The bad loans did not pay as intended. It is common in credit scoring to classify bad accounts as those which have ever had a 60 day delinquency or worse (in mortgage loans often 90 day plus is often used). The data can be downloaded Here

library(readr)
train <- read.csv("~/Credit_train.csv")
test <- read.csv("~/Credit_test.csv")

str(train)

## 'data.frame':    28427 obs. of  6 variables:
##  $ BUSAGE     : int  183 271 51 208 148 82 12 41 65 111 ...
##  $ BUSTYPE    : Factor w/ 6 levels "A","B","C","D",..: 2 5 1 1 1 4 1 1 2 1 ...
##  $ MAXLINEUTIL: num  0 0 0 0 0 0 0 0 0 NA ...
##  $ DAYSDELQ   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TOTACBAL   : num  0.24 1.37 1.52 1.64 1.78 1.88 2.21 3.11 3.25 3.44 ...
##  $ DEFAULT    : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...

These are the names of the columns/variables

BUSAGE - Business age (age of the business measured in months)

BUSTYPE - Business type (Type of business classified from A-F)

MAXLINEUTIL - Maximum Line of Credit utilized used by the customer

DAYSDELQ - Days delinquency is the number of days a customer has been delinquent or performs illegal or immoral acts

TOTACBAL - Total Account Balance (Total account balance for the customer)

DEFAULT - Default (Yes OR No if the client defualted on a loan)

head(train)

##   BUSAGE BUSTYPE MAXLINEUTIL DAYSDELQ TOTACBAL DEFAULT
## 1    183       B           0        0     0.24       N
## 2    271       E           0        0     1.37       N
## 3     51       A           0        0     1.52       N
## 4    208       A           0        0     1.64       N
## 5    148       A           0        0     1.78       N
## 6     82       D           0        0     1.88       N

This is what the data looks like in a database.

Dependent Variable - ‘DEFAULT’ A binary variable indicating that the loan was not paid back in full, i.e, (the borrower either defaulted or the loan was “charged off,” meaning the borrower was deemed unlikely to ever pay it back).

Dependent Variable - ‘TOTACBAL’, ‘DAYSDELQ’, ‘MAXLINEUTIL’, ‘BUSTYPE’, ‘BUSAGE’

**Next step is to Build the Logistic Model and check the model summary

modLog = glm(DEFAULT ~., data=train, family="binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(modLog)

## 
## Call:
## glm(formula = DEFAULT ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -7.1264  -0.4709  -0.3230  -0.2474   3.2988  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.115e+00  7.038e-02 -44.264  < 2e-16 ***
## BUSAGE      -3.393e-03  3.599e-04  -9.427  < 2e-16 ***
## BUSTYPEB    -8.075e-03  6.368e-02  -0.127    0.899    
## BUSTYPEC    -1.715e-01  1.279e-01  -1.341    0.180    
## BUSTYPED    -1.315e-01  1.054e-01  -1.247    0.212    
## BUSTYPEE    -1.772e-01  1.042e+00  -0.170    0.865    
## BUSTYPEF     3.100e-01  4.647e-01   0.667    0.505    
## MAXLINEUTIL  2.045e+00  8.560e-02  23.894  < 2e-16 ***
## DAYSDELQ     7.395e-02  6.563e-03  11.268  < 2e-16 ***
## TOTACBAL    -5.560e-06  1.363e-06  -4.079 4.53e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11152.1  on 19008  degrees of freedom
## Residual deviance:  9914.9  on 18999  degrees of freedom
##   (9418 observations deleted due to missingness)
## AIC: 9934.9
## 
## Number of Fisher Scoring iterations: 6

Significance Level of the Variables Those variables which have at least one star in the coefficients table are sigificant. Positive coefficient means the higher the value of that variable, the higher risk of default, and vice versa.

The significant variales are BUSAGE, TOTACBAL, DAYSDELQ, and MAXLINEUTIL.

Since some of the variables are not significant, we will rebuild the logistic regression with only the significant variables

Revised Logistic Regression Model

modLog2 = glm(DEFAULT ~ MAXLINEUTIL + DAYSDELQ + TOTACBAL + BUSAGE, data=train, family="binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(modLog2)

## 
## Call:
## glm(formula = DEFAULT ~ MAXLINEUTIL + DAYSDELQ + TOTACBAL + BUSAGE, 
##     family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -7.1202  -0.4706  -0.3231  -0.2478   3.2583  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.134e+00  6.730e-02 -46.558  < 2e-16 ***
## MAXLINEUTIL  2.043e+00  8.531e-02  23.954  < 2e-16 ***
## DAYSDELQ     7.411e-02  6.574e-03  11.273  < 2e-16 ***
## TOTACBAL    -5.501e-06  1.353e-06  -4.065  4.8e-05 ***
## BUSAGE      -3.424e-03  3.580e-04  -9.562  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11152.1  on 19008  degrees of freedom
## Residual deviance:  9918.6  on 19004  degrees of freedom
##   (9418 observations deleted due to missingness)
## AIC: 9928.6
## 
## Number of Fisher Scoring iterations: 6

All the variables here are significant.

Using the Revised Model for Making Predictions on the Test Data Set

We will store the prediction values in a vector named ‘predicted.risk’ and add it in the test data set.

test$predicted.risk = predict(modLog2, newdata=test, type="response")

Measuring the accuracy of the model

table(test$DEFAULT, as.numeric(test$predicted.risk >= 0.5))

##    
##        0    1
##   N 4289    8
##   Y  391   18

Computing Accuracy of the Model

(4289 + 18)/ nrow(test)

## [1] 0.6090215

Overall Accuracy = (4289 + 18)/ nrow(test) = 0.61

18 /409

## [1] 0.04400978

Sensitivity = 18 /409 = 0.04

4289 / 4297

## [1] 0.9981382

Specificity = 4289 / 4297 = 0.99

Inference We see that the model is doing far better in specificity as compared to sensitivity. Next, let us compare our model’s accuracy with the baseline accuracy.

Comparing Baseline Accuracy

table(test$DEFAULT)

## 
##    N    Y 
## 6583  489

6583/(6583+ 489)

## [1] 0.9308541

The baseline accuracy is 0.93. The Baseline accuracy is higher than the Revised Model here, however the bigger the training data, the better our model performs .

Test set Area Under the Curve (AUC)

library(ROCR)

pred = prediction(test$predicted.risk, test$DEFAULT)
as.numeric(performance(pred, "auc")@y.values)

## [1] 0.7373365

Area Under the Curve (AUC) comes out to be 0.74.

Although the model’s accuracy seems, it does not do extremely good. The reason being that the model is low on Sensitivity. For a bank or financial institution, the misclassification cost of not predicting loan defaults correctly is much higher than misclassiffication cost of not predicting non-defaults correctly, hence it is important that the model should have a higher sensitivity.

This is controlled by the threshold value.

If we increase the threshold value from 0.5 to 0.7, the sensitivity will decrease further, even though overall acurracy may increase. However, that is not aceptable from the bank’s point of view which is more interested in classifying defaults.

On the other hand, if we reduce the threshold level from 0.5 to 0.1, our sensitivity increases drastically.

So, what is the ideal trade-off between sensitivity and specificity? Luckily, R has a package to help us understand the threshold cutoff.

This can be done using the following syntax

#load library
library(ROCR)
#score test data set
test$score<-predict(modLog2,type='response',test)
pred<-prediction(test$score, test$DEFAULT)
perf <- performance(pred,"tpr","fpr")
plot(perf)

# Add colors
plot(perf, colorize=TRUE)

# Add threshold labels 
plot(perf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))

We see that probably the cutoff value of 0.2 is giving us a better tradeoff between sensitivity and specificity. Let’s use this threshold to see how our model performs in the test data set.

threshold = 0.2
predicted_values<-ifelse(predict(modLog2,type="response")>threshold,1,0)
actual_values<- modLog2$y
conf_matrix<-table(predicted_values,actual_values)
conf_matrix

##                 actual_values
## predicted_values     0     1
##                0 16957  1342
##                1   416   294

We can easily change the theshold number here and tune it. It is coded 0 for non-default, and 1 for default

library(caret)
sensitivity(conf_matrix)

## [1] 0.9760548

specificity(conf_matrix)

## [1] 0.1797066

Calculating top 3 variables affecting Credit Score Function in R

In credit scoring per regulation lenders are required to provide the top 3 reasons impacting the credit decision when a loan fails to pass the credit score (Velez, 2008).

#get results of terms in regression
g<-predict(modLog2,type='terms',test)
#function to pick top 3 reasons
#works by sorting coefficient terms in equation
# and selecting top 3 in sort for each loan scored
ftopk<- function(x,top=3){
res=names(x)[order(x, decreasing = TRUE)][1:top]
paste(res,collapse=";",sep="")
}
# Application of the function using the top 3 rows
topk=apply(g,1,ftopk,top=3)
#add reason list to scored tets sample
test<-cbind(test, topk)

Conclusion Banks and Financial Institutions can use this model to create a Credit Scoring System for every Loan Applications and minimise the Bad Loan Error Rate from their portfolio.

For more use cases on Machine learning, visit us at Cartwheel Technologies

‘Pay Your Bills On time’ - Credit Scoring Using R

Cartwheel Technologies

December 28, 2017