Predicting loan repayment
In the lending industry, investors provide loans to borrowers in exchange for the promise of repayment with interest. If the borrower repays the loan, then the lender profits from the interest. However, if the borrower is unable to repay the loan, then the lender loses money. Therefore, lenders face the problem of predicting the risk of a borrower being unable to repay a loan.
To address this problem, we will use publicly available data from LendingClub.com, a website that connects borrowers and investors over the Internet. This dataset represents 9,578 3-year loans that were funded through the LendingClub.com platform between May 2007 and February 2010. The binary dependent variable not_fully_paid indicates that the loan was not paid back in full (the borrower either defaulted or the loan was “charged off,” meaning the borrower was deemed unlikely to ever pay it back).
To predict this dependent variable, we will use the following independent variables available to the investor when deciding whether to fund a loan:
credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
purpose: The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”).
int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
installment: The monthly installments ($) owed by the borrower if the loan is funded.
log.annual.inc: The natural log of the self-reported annual income of the borrower.
dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
fico: The FICO credit score of the borrower. days.with.cr.line: The number of days the borrower has had a credit line.
revol.bal: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle).
revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available).
inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months.
delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).
setwd("C:/Users/dell/Downloads")
The working directory was changed to C:/Users/dell/Downloads inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.
loans = read.csv(file = "loans.csv")
str(loans)
'data.frame': 9578 obs. of 14 variables:
$ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
$ purpose : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
$ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
$ installment : num 829 228 367 162 103 ...
$ log.annual.inc : num 11.4 11.1 10.4 11.4 11.3 ...
$ dti : num 19.5 14.3 11.6 8.1 15 ...
$ fico : int 737 707 682 712 667 727 667 722 682 707 ...
$ days.with.cr.line: num 5640 2760 4710 2700 4066 ...
$ revol.bal : int 28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
$ revol.util : num 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
$ inq.last.6mths : int 0 0 1 1 0 0 0 0 1 1 ...
$ delinq.2yrs : int 0 0 0 0 1 0 0 0 0 0 ...
$ pub.rec : int 0 0 0 0 0 0 1 0 0 0 ...
$ not.fully.paid : int 0 0 0 0 0 0 1 1 0 0 ...
Thus we have around 9578 rows in this dataset where each row signifies a loan given to an individual. Now, lets see the proportion of loans not paid
table(loans$not.fully.paid)
0 1
8045 1533
1533/9578
[1] 0.1600543
Thus, approximately 16% of loans in the dataset remain unpaid. Now, let’s see if there are any missing values in the dataset and further impute them using the technique Multivariate Imputation by Chained Equations.
summary(loans)
credit.policy purpose int.rate
Min. :0.000 all_other :2331 Min. :0.0600
1st Qu.:1.000 credit_card :1262 1st Qu.:0.1039
Median :1.000 debt_consolidation:3957 Median :0.1221
Mean :0.805 educational : 343 Mean :0.1226
3rd Qu.:1.000 home_improvement : 629 3rd Qu.:0.1407
Max. :1.000 major_purchase : 437 Max. :0.2164
small_business : 619
installment log.annual.inc dti fico
Min. : 15.67 Min. : 7.548 Min. : 0.000 Min. :612.0
1st Qu.:163.77 1st Qu.:10.558 1st Qu.: 7.213 1st Qu.:682.0
Median :268.95 Median :10.928 Median :12.665 Median :707.0
Mean :319.09 Mean :10.932 Mean :12.607 Mean :710.8
3rd Qu.:432.76 3rd Qu.:11.290 3rd Qu.:17.950 3rd Qu.:737.0
Max. :940.14 Max. :14.528 Max. :29.960 Max. :827.0
NA's :4
days.with.cr.line revol.bal revol.util inq.last.6mths
Min. : 179 Min. : 0 Min. : 0.00 Min. : 0.000
1st Qu.: 2820 1st Qu.: 3187 1st Qu.: 22.70 1st Qu.: 0.000
Median : 4140 Median : 8596 Median : 46.40 Median : 1.000
Mean : 4562 Mean : 16914 Mean : 46.87 Mean : 1.572
3rd Qu.: 5730 3rd Qu.: 18250 3rd Qu.: 71.00 3rd Qu.: 2.000
Max. :17640 Max. :1207359 Max. :119.00 Max. :33.000
NA's :29 NA's :62 NA's :29
delinq.2yrs pub.rec not.fully.paid
Min. : 0.0000 Min. :0.0000 Min. :0.0000
1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median : 0.0000 Median :0.0000 Median :0.0000
Mean : 0.1638 Mean :0.0621 Mean :0.1601
3rd Qu.: 0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :13.0000 Max. :5.0000 Max. :1.0000
NA's :29 NA's :29
Variables with missing values are:
log.annual.inc
days.with.cr.line
revol.util
inq.last.6mths
delinq.2yrs
pub.rec
Very less obseravtions are missing, hence removing these observations shall not lead to overfitting. But we shall still impute the missing values so that we are able to predict risk for all borrowers, instead of just the ones with all data reported
library(mice)
package <U+393C><U+3E31>mice<U+393C><U+3E32> was built under R version 3.3.2
set.seed(144)
vars = setdiff(names(loans), "not.fully.paid")
imputed = complete(mice(loans[vars]))
iter imp variable
1 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
1 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
1 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
1 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
1 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
2 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
2 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
2 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
2 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
2 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
3 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
3 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
3 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
3 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
3 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
4 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
4 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
4 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
4 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
4 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
5 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
5 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
5 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
5 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
5 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
loans[vars] = imputed
We are imputing the missing values only using those columns that are independent variables, thus we ignore not.fully.paid since it is our dependent variable for logistic regression.
Let us split our data into training and test such that 70% of the data is used to train the model and rest to test our model.
library(caTools)
sample = sample.split(loans, SplitRatio = 0.7)
loans_train = subset(loans,sample == TRUE)
loans_test = subset(loans,sample == FALSE)
model1 = glm(not.fully.paid ~ . , loans_train, family = "binomial")
summary(model1)
Call:
glm(formula = not.fully.paid ~ ., family = "binomial", data = loans_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8746 -0.6154 -0.4960 -0.3710 2.6032
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.988e+00 1.611e+00 5.578 2.44e-08 ***
credit.policy -4.516e-01 1.039e-01 -4.346 1.39e-05 ***
purpose2 -4.125e-01 1.333e-01 -3.095 0.001970 **
purpose3 -3.406e-01 9.723e-02 -3.503 0.000459 ***
purpose4 2.315e-01 1.900e-01 1.218 0.223065
purpose5 8.768e-02 1.552e-01 0.565 0.572213
purpose6 -2.274e-01 1.951e-01 -1.166 0.243765
purpose7 5.888e-01 1.468e-01 4.010 6.08e-05 ***
int.rate -3.166e-02 2.163e+00 -0.015 0.988326
installment 1.360e-03 2.180e-04 6.240 4.36e-10 ***
log.annual.inc -4.206e-01 7.403e-02 -5.681 1.34e-08 ***
dti -6.468e-04 5.727e-03 -0.113 0.910076
fico -9.099e-03 1.768e-03 -5.145 2.67e-07 ***
days.with.cr.line 1.861e-05 1.635e-05 1.138 0.255076
revol.bal 1.752e-06 1.025e-06 1.709 0.087487 .
revol.util 3.037e-03 1.586e-03 1.914 0.055569 .
inq.last.6mths 8.078e-02 1.706e-02 4.736 2.18e-06 ***
delinq.2yrs -6.087e-02 7.035e-02 -0.865 0.386903
pub.rec 2.194e-01 1.201e-01 1.827 0.067683 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5440.5 on 6157 degrees of freedom
Residual deviance: 5067.1 on 6139 degrees of freedom
AIC: 5105.1
Number of Fisher Scoring iterations: 5
Thus we created a logistic regression model with all the columns except not.fully.paid as independent variables. Most of our variables are significant as seen by the * next to them in coefficients table in the summary output.
Now let us check the accuracy of prediction of our model against the test data set.
loans_test$predicted_risk = predict(model1, newdata = loans_test)
contrasts dropped from factor purpose
table(loans_test$not.fully.paid, loans_test$predicted_risk > 0.5)
FALSE TRUE
0 2871 5
1 542 2
accuracy = 2873/(2873+547)
accuracy
[1] 0.8400585
We have considered a threshold value of 0.5 i.e assumed loans not.fully.paid to be predicted as 1 when our probability as seen in predict_risk column is greater than 0.5.
Accuracy of this model = 84%