Modeling credit risk for both personal and company loans is of major importance for banks and financial institutions. The probability that a debtor will default is a key component in getting to a measure for credit risk.
In this blog, we will be building one of the widely used machine learning techniques for solving classification problem, i.e, logistic regression.
The original data used in this exercise comes from publicly available data from LendingClub.com (https://www.lendingclub.com/info/download-data.action), a website that connects borrowers and investors over the Internet.
a binary variable indicating that the loan was not paid back in full, i.e, (the borrower either defaulted or the loan was “charged off,” meaning the borrower was deemed unlikely to ever pay it back).
1)credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
2)purpose: The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”).
3)int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
4)installment: The monthly installments ($) owed by the borrower if the loan is funded.
5)annualincome: the self-reported annual income of the borrower.
6)log.annual.inc: The natural log of the self-reported annual income of the borrower.
7)dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
8)fico: The FICO credit score of the borrower.
9)days.with.cr.line: The number of days the borrower has had a credit line.
10)revol.bal: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle).
11)revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available).
12)inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months.
13)delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
14)pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).
str(loans) #There are 5000 observations and 15 variables
## 'data.frame': 5000 obs. of 15 variables:
## $ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
## $ purpose : Factor w/ 7 levels "all_other","credit_card",..: 3 1 2 1 7 5 5 3 7 7 ...
## $ int.rate : num 0.15 0.111 0.134 0.106 0.15 ...
## $ installment : num 194 131.2 678.1 32.5 225.4 ...
## $ log.annual.inc : num 10.7 11 11.9 10.4 12.3 ...
## $ dti : num 4 11.08 10.15 14.47 6.45 ...
## $ fico : int 667 722 682 687 677 772 682 712 737 672 ...
## $ days.with.cr.line: num 3180 5116 4210 1110 6240 ...
## $ revol.bal : int 3839 24220 41674 4485 56411 269 3408 6513 10296 2867 ...
## $ revol.util : num 76.8 68.6 74.1 36.9 75.3 3.8 35.1 34.3 42.5 55.1 ...
## $ inq.last.6mths : int 0 0 0 1 0 3 1 3 0 3 ...
## $ delinq.2yrs : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pub.rec : int 1 0 0 0 0 0 0 1 0 0 ...
## $ not.fully.paid : int 1 1 1 1 1 1 1 1 1 1 ...
## $ annualincome : int 45000 60000 145000 33990 213000 75000 32000 40000 96000 15000 ...
summary(loans) # no missing values
## credit.policy purpose int.rate
## Min. :0.0000 all_other :1227 Min. :0.0600
## 1st Qu.:1.0000 credit_card : 664 1st Qu.:0.1008
## Median :1.0000 debt_consolidation:1957 Median :0.1218
## Mean :0.8962 educational : 198 Mean :0.1208
## 3rd Qu.:1.0000 home_improvement : 354 3rd Qu.:0.1379
## Max. :1.0000 major_purchase : 186 Max. :0.2164
## small_business : 414
## installment log.annual.inc dti fico
## Min. : 15.69 Min. : 7.601 Min. : 0.000 Min. :617.0
## 1st Qu.:163.55 1st Qu.:10.545 1st Qu.: 7.067 1st Qu.:682.0
## Median :260.64 Median :10.915 Median :12.300 Median :707.0
## Mean :308.33 Mean :10.912 Mean :12.309 Mean :710.9
## 3rd Qu.:407.51 3rd Qu.:11.277 3rd Qu.:17.652 3rd Qu.:737.0
## Max. :926.83 Max. :14.528 Max. :29.960 Max. :827.0
##
## days.with.cr.line revol.bal revol.util inq.last.6mths
## Min. : 180 Min. : 0 Min. : 0.0 Min. : 0.000
## 1st Qu.: 2790 1st Qu.: 3328 1st Qu.: 22.3 1st Qu.: 0.000
## Median : 4080 Median : 8605 Median : 45.7 Median : 1.000
## Mean : 4511 Mean : 15872 Mean : 46.4 Mean : 1.407
## 3rd Qu.: 5640 3rd Qu.: 18155 3rd Qu.: 70.5 3rd Qu.: 2.000
## Max. :16259 Max. :1207359 Max. :106.5 Max. :33.000
##
## delinq.2yrs pub.rec not.fully.paid annualincome
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 2000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 38000
## Median :0.0000 Median :0.0000 Median :0.0000 Median : 55000
## Mean :0.1614 Mean :0.0668 Mean :0.3066 Mean : 66260
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.: 79000
## Max. :6.0000 Max. :3.0000 Max. :1.0000 Max. :2039784
##
loans$credit.policy = as.factor(loans$credit.policy)
str(loans)
## 'data.frame': 5000 obs. of 15 variables:
## $ credit.policy : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ purpose : Factor w/ 7 levels "all_other","credit_card",..: 3 1 2 1 7 5 5 3 7 7 ...
## $ int.rate : num 0.15 0.111 0.134 0.106 0.15 ...
## $ installment : num 194 131.2 678.1 32.5 225.4 ...
## $ log.annual.inc : num 10.7 11 11.9 10.4 12.3 ...
## $ dti : num 4 11.08 10.15 14.47 6.45 ...
## $ fico : int 667 722 682 687 677 772 682 712 737 672 ...
## $ days.with.cr.line: num 3180 5116 4210 1110 6240 ...
## $ revol.bal : int 3839 24220 41674 4485 56411 269 3408 6513 10296 2867 ...
## $ revol.util : num 76.8 68.6 74.1 36.9 75.3 3.8 35.1 34.3 42.5 55.1 ...
## $ inq.last.6mths : int 0 0 0 1 0 3 1 3 0 3 ...
## $ delinq.2yrs : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pub.rec : int 1 0 0 0 0 0 0 1 0 0 ...
## $ not.fully.paid : int 1 1 1 1 1 1 1 1 1 1 ...
## $ annualincome : int 45000 60000 145000 33990 213000 75000 32000 40000 96000 15000 ...
library(caTools)
## Warning: package 'caTools' was built under R version 3.2.4
set.seed(100)
spl = sample.split(loans$not.fully.paid, 0.7)
train = subset(loans, spl == TRUE)
test = subset(loans, spl == FALSE)
We have left out annualincome as it is already included in the variable ‘log.annual.inc’
modLog = glm(not.fully.paid ~. -annualincome, data=train, family="binomial")
summary(modLog)
##
## Call:
## glm(formula = not.fully.paid ~ . - annualincome, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.69976 -0.72158 -0.54723 0.00013 2.35322
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.064e+01 3.226e+02 0.064 0.94899
## credit.policy1 -1.902e+01 3.226e+02 -0.059 0.95300
## purposecredit_card -4.983e-01 1.686e-01 -2.955 0.00313 **
## purposedebt_consolidation -2.668e-01 1.209e-01 -2.206 0.02737 *
## purposeeducational -1.936e-02 2.348e-01 -0.082 0.93428
## purposehome_improvement -6.254e-02 2.031e-01 -0.308 0.75807
## purposemajor_purchase -1.666e-02 2.461e-01 -0.068 0.94602
## purposesmall_business 1.161e-01 1.797e-01 0.646 0.51817
## int.rate 1.777e+01 3.092e+00 5.746 9.13e-09 ***
## installment 1.291e-03 2.829e-04 4.564 5.03e-06 ***
## log.annual.inc -5.419e-01 1.001e-01 -5.413 6.20e-08 ***
## dti 3.105e-03 7.445e-03 0.417 0.67662
## fico 5.627e-05 2.194e-03 0.026 0.97954
## days.with.cr.line 3.618e-05 2.055e-05 1.760 0.07834 .
## revol.bal 6.464e-06 3.370e-06 1.918 0.05508 .
## revol.util 2.730e-03 2.080e-03 1.313 0.18934
## inq.last.6mths 1.328e-01 3.559e-02 3.731 0.00019 ***
## delinq.2yrs -8.311e-02 9.695e-02 -0.857 0.39130
## pub.rec 3.673e-01 1.656e-01 2.218 0.02657 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4314.3 on 3499 degrees of freedom
## Residual deviance: 3127.3 on 3481 degrees of freedom
## AIC: 3165.3
##
## Number of Fisher Scoring iterations: 17
Those variables which have atleast one star in the coefficients table are sigificant. Positive coefficient means higher the value of that variable, an increased risk of default, and vice versa.
The significant variales are Credit Card Purpose, Interest Rate, ‘inq.last.6mths’, and ‘pub.rec’.
Since some of the variables are not significant, we will rebuild the logistic regression with only the significant variables.
modLog2 = glm(not.fully.paid ~ purpose + int.rate + installment + log.annual.inc + inq.last.6mths + pub.rec, data=train, family="binomial")
summary(modLog2)
##
## Call:
## glm(formula = not.fully.paid ~ purpose + int.rate + installment +
## log.annual.inc + inq.last.6mths + pub.rec, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1109 -0.8035 -0.5614 0.8903 2.4357
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3552949 0.8528953 -0.417 0.676989
## purposecredit_card -0.3342296 0.1459913 -2.289 0.022057 *
## purposedebt_consolidation -0.2615709 0.1068706 -2.448 0.014383 *
## purposeeducational -0.0968354 0.2150149 -0.450 0.652447
## purposehome_improvement -0.1267309 0.1825735 -0.694 0.487596
## purposemajor_purchase -0.2953788 0.2368597 -1.247 0.212375
## purposesmall_business -0.0037807 0.1647683 -0.023 0.981694
## int.rate 24.9879111 1.8469198 13.530 < 2e-16 ***
## installment 0.0009194 0.0002437 3.773 0.000161 ***
## log.annual.inc -0.3909672 0.0783604 -4.989 6.06e-07 ***
## inq.last.6mths 0.3682961 0.0258440 14.251 < 2e-16 ***
## pub.rec 0.3337165 0.1467921 2.273 0.023002 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4314.3 on 3499 degrees of freedom
## Residual deviance: 3657.3 on 3488 degrees of freedom
## AIC: 3681.3
##
## Number of Fisher Scoring iterations: 4
All the variables are significant.
We will store the prediction values in a vector named ‘predicted.risk’ and add it in the test data set.
test$predicted.risk = predict(modLog2, newdata=test, type="response")
table(test$not.fully.paid, as.numeric(test$predicted.risk >= 0.5))
##
## 0 1
## 0 970 70
## 1 306 154
Overall Accuracy = (970 + 154)/ nrow(test) = 74.933 Sensitivity = 154/460 = 33.5 Specificity = 970 / 1040 = 93.3
We see that the model is doing far better in sensitivity as compared to specificity. We will come back to these concepts later, but before that, let us compare our model with the baseline accuracy.
table(test$not.fully.paid)
##
## 0 1
## 1040 460
1040/(1040+460)
## [1] 0.6933333
The baseline accuracy is 0.693. Hence, we see that the Revised Model beats the baseline model comfortably.
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
pred = prediction(test$predicted.risk, test$not.fully.paid)
as.numeric(performance(pred, "auc")@y.values)
## [1] 0.7318395
Area Under the Curve (AUC) comes out to be 0.73.
Although the model beats the Baseline Model, it does not do extremely good. The reason being that the model is low on Sensitivity. For a bank or financial institution, the misclassification cost of not predicting loan defaults correctly is much higher than misclassiffication cost of not predicting non-defaults correctly, hence it is important that the model should have a higher sensitivity.
This is controlled by the threshold value.
If we increase the threshold value from 0.5 to 0.7, the sensitivity will decrease from 33% to 17%, even though overall acuuracy will increase to 74% from 73% earlier. However, that is not aceptable from the bank’s point of view which is more interested in classifying defaults.
On the other hand, if we reduce the threshold level from 0.5 to 0.3, our sensitivity incraeses drastically from 33% to 60%. But the overall model acuracy comes down to 68%, which is less than the baseline model accuracy.
So, what is the ideal trade-off between sensitivity and specificity? Luckily, r has a package to help us understand the threshold cutoff.
This can be done using the following syntax
library(ROCR)
# Make predictions on training set
predictTrain = predict(modLog2, type="response")
# Prediction function
ROCRpred = prediction(predictTrain, train$not.fully.paid)
# Performance function
ROCRperf = performance(ROCRpred, "tpr", "fpr")
# Plot ROC curve
plot(ROCRperf)
# Add colors
plot(ROCRperf, colorize=TRUE)
# Add threshold labels
plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
We see that probably the cutoff value of 0.35 is giving us a better tradeoff between sensitivity and specificity. Let’s use this threshold to see how our model performs in the test data set.
t1 = table(test$not.fully.paid, as.numeric(test$predicted.risk >= 0.35))
t1
##
## 0 1
## 0 828 212
## 1 219 241
#Overall Accuracy = (828 + 241) / nrow(test) = 0.71
#Sensitivity = 241 / 460 = 0.52
#Specificity = 828/ 1040 = 0.80
At a cutoff of 0.35, the sensitivity sees a drastic improvement from 33% in the original model to 52%. And the Overall Accuracy of the model is also not compromised much as it is still considerably better at 71% than the baseline accuracy of 69%.
Banks and Financial Institutions can use this model to create a Loan Acceptance Strategy for every Loan Applications and minimise the Bad Loan Error Rate from their portfolio.
In future blogs, I will be covering Decision Trees and Random Forests for carrying out the same exercise. Please note that this is just an illustartion of one technique, there are other methods as well, which may perform better than this model.
However, logistic regression remains one of the most widely used classification technique in Credit Risk Modeling.