In the lending industry, investors provide loans to borrowers in exchange for the promise of repayment with interest. If the borrower repays the loan, then the lender profits from the interest. However, if the borrower is unable to repay the loan, then the lender loses money. Therefore, lenders face the problem of predicting the risk of a borrower being unable to repay a loan.
To address this problem, we will use publicly available data from LendingClub.com, a website that connects borrowers and investors over the Internet. This dataset represents 9,578 3-year loans that were funded through the LendingClub.com platform between May 2007 and February 2010. The binary dependent variable not.fully.paid indicates that the loan was not paid back in full (the borrower either defaulted or the loan was “charged off,” meaning the borrower was deemed unlikely to ever pay it back).
To predict this dependent variable, we will use the following independent variables available to the investor when deciding whether to fund a loan:
Load the dataset loans.csv into a data frame called loans, and explore it using the str() and summary() functions.
# Load the data
loans = read.csv("loans.csv")# Tabulate not fully paid
z = table(loans$not.fully.paid)
kable(z)| Var1 | Freq |
|---|---|
| 0 | 8045 |
| 1 | 1533 |
# Compute proportion
z[2]/sum(z)
## 1
## 0.1600543Proportion = 0.1600543
# Output summary
z = summary(loans)
kable(z)| credit.policy | purpose | int.rate | installment | log.annual.inc | dti | fico | days.with.cr.line | revol.bal | revol.util | inq.last.6mths | delinq.2yrs | pub.rec | not.fully.paid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :0.000 | all_other :2331 | Min. :0.0600 | Min. : 15.67 | Min. : 7.548 | Min. : 0.000 | Min. :612.0 | Min. : 179 | Min. : 0 | Min. : 0.00 | Min. : 0.000 | Min. : 0.0000 | Min. :0.0000 | Min. :0.0000 | |
| 1st Qu.:1.000 | credit_card :1262 | 1st Qu.:0.1039 | 1st Qu.:163.77 | 1st Qu.:10.558 | 1st Qu.: 7.213 | 1st Qu.:682.0 | 1st Qu.: 2820 | 1st Qu.: 3187 | 1st Qu.: 22.70 | 1st Qu.: 0.000 | 1st Qu.: 0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | |
| Median :1.000 | debt_consolidation:3957 | Median :0.1221 | Median :268.95 | Median :10.928 | Median :12.665 | Median :707.0 | Median : 4140 | Median : 8596 | Median : 46.40 | Median : 1.000 | Median : 0.0000 | Median :0.0000 | Median :0.0000 | |
| Mean :0.805 | educational : 343 | Mean :0.1226 | Mean :319.09 | Mean :10.932 | Mean :12.607 | Mean :710.8 | Mean : 4562 | Mean : 16914 | Mean : 46.87 | Mean : 1.572 | Mean : 0.1638 | Mean :0.0621 | Mean :0.1601 | |
| 3rd Qu.:1.000 | home_improvement : 629 | 3rd Qu.:0.1407 | 3rd Qu.:432.76 | 3rd Qu.:11.290 | 3rd Qu.:17.950 | 3rd Qu.:737.0 | 3rd Qu.: 5730 | 3rd Qu.: 18250 | 3rd Qu.: 71.00 | 3rd Qu.: 2.000 | 3rd Qu.: 0.0000 | 3rd Qu.:0.0000 | 3rd Qu.:0.0000 | |
| Max. :1.000 | major_purchase : 437 | Max. :0.2164 | Max. :940.14 | Max. :14.528 | Max. :29.960 | Max. :827.0 | Max. :17640 | Max. :1207359 | Max. :119.00 | Max. :33.000 | Max. :13.0000 | Max. :5.0000 | Max. :1.0000 | |
| NA | small_business : 619 | NA | NA | NA’s :4 | NA | NA | NA’s :29 | NA | NA’s :62 | NA’s :29 | NA’s :29 | NA’s :29 | NA |
log.annual.inc, days.with.cr.line, revol.util, inq.last.6mths, delinq.2yrs and pub.rec are missing values.
We want to be able to predict risk for all borrowers, instead of just the ones with all data reported.
# Split the data
library(mice)
set.seed(144)
vars.for.imputation = setdiff(names(loans), "not.fully.paid")
imputed = complete(mice(loans[vars.for.imputation]))
##
## iter imp variable
## 1 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 1 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 1 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 1 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 1 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 2 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 2 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 2 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 2 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 2 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 3 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 3 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 3 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 3 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 3 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 4 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 4 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 4 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 4 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 4 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 5 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 5 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 5 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 5 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 5 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
loans[vars.for.imputation] = imputedWe predicted missing variable values using the available independent variables for each observation.
# Split the data
library(caTools)
set.seed(144)
spl = sample.split(loans$not.fully.paid, 0.7)
train = subset(loans, spl == TRUE)
test = subset(loans, spl == FALSE)
# Logistic Regression
mod = glm(not.fully.paid~., data=train, family="binomial")# Output summary
summary(mod)
##
## Call:
## glm(formula = not.fully.paid ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1867 -0.6206 -0.4949 -0.3610 2.6395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.158007646 1.555140670 5.889 0.00000000389 ***
## credit.policy -0.349246579 0.100825981 -3.464 0.000532 ***
## purposecredit_card -0.614397768 0.134403507 -4.571 0.00000484725 ***
## purposedebt_consolidation -0.321655756 0.091828446 -3.503 0.000460 ***
## purposeeducational 0.135780792 0.175274026 0.775 0.438530
## purposehome_improvement 0.174353490 0.147918419 1.179 0.238512
## purposemajor_purchase -0.481418544 0.200793056 -2.398 0.016504 *
## purposesmall_business 0.413433572 0.141832368 2.915 0.003558 **
## int.rate 0.622113793 2.084902246 0.298 0.765406
## installment 0.001272966 0.000209181 6.085 0.00000000116 ***
## log.annual.inc -0.431293681 0.071452618 -6.036 0.00000000158 ***
## dti 0.004627352 0.005499671 0.841 0.400131
## fico -0.009293809 0.001708306 -5.440 0.00000005317 ***
## days.with.cr.line 0.000002187 0.000015878 0.138 0.890435
## revol.bal 0.000003035 0.000001166 2.602 0.009279 **
## revol.util 0.001916012 0.001532931 1.250 0.211336
## inq.last.6mths 0.080739865 0.015868489 5.088 0.00000036174 ***
## delinq.2yrs -0.083383810 0.065542094 -1.272 0.203296
## pub.rec 0.331043003 0.113758106 2.910 0.003614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5896.6 on 6704 degrees of freedom
## Residual deviance: 5487.4 on 6686 degrees of freedom
## AIC: 5525.4
##
## Number of Fisher Scoring iterations: 5credit.policy, purpose2 (credit card), purpose3 (debt consolidation), purpose 6 (major purchase), purpose 7 (small business), installment, log.annual.inc, fico, revol.bal, inq.last.6mths, pub.rec
-0.009317 * -10
## [1] 0.09317Because Application A is identical to Application B other than having a FICO score 10 lower, its predicted log odds differ by -0.009317 * -10 = 0.09317 from the predicted log odds of Application B.
exp(0.09317)
## [1] 1.097648The predicted odds of loan A not being paid back in full are exp(0.09317) = 1.0976 times larger than the predicted odds for loan B. Intuitively, it makes sense that loan A should have higher odds of non-payment than loan B, since the borrower has a worse credit score.
# Make predictions
test$predicted.risk = predict(mod, newdata=test, type="response")
# Tabulate not fully with threshold
z = table(test$not.fully.paid, test$predicted.risk > 0.5)
kable(z)| FALSE | TRUE | |
|---|---|---|
| 0 | 2400 | 13 |
| 1 | 457 | 3 |
sum(diag(z))/sum(z)
## [1] 0.8364079Accuracy = 0.8364
z = table(test$not.fully.paid)
kable(z)| Var1 | Freq |
|---|---|
| 0 | 2413 |
| 1 | 460 |
z[1]/sum(z)
## 0
## 0.8398886Accuracy = 0.8399
# Calculate AUC
library(ROCR)
pred = prediction(test$predicted.risk, test$not.fully.paid)
as.numeric(performance(pred, "auc")@y.values)
## [1] 0.6718878AUC = 0.672
Using the training set, build a bivariate logistic regression model (aka a logistic regression model with a single independent variable) that predicts the dependent variable not.fully.paid using only the variable int.rate.
# Logistic Regression
bivariate = glm(not.fully.paid~int.rate, data=train, family="binomial")# Output summary
summary(bivariate)
##
## Call:
## glm(formula = not.fully.paid ~ int.rate, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0547 -0.6271 -0.5442 -0.4361 2.2914
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.6726 0.1688 -21.76 <2e-16 ***
## int.rate 15.9214 1.2702 12.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5896.6 on 6704 degrees of freedom
## Residual deviance: 5734.8 on 6703 degrees of freedom
## AIC: 5738.8
##
## Number of Fisher Scoring iterations: 4Decreased significance between a bivariate and multivariate model is typically due to correlation. From cor(train\(int.rate, train\)fico), we can see that the interest rate is moderately well correlated with a borrower’s credit score.
Training/testing set split rarely has a large effect on the significance of variables (this can be verified in this case by trying out a few other training/testing splits), and the models were trained on the same observations.
# Make predictions
pred.bivariate = predict(bivariate, newdata=test, type="response")
# Max Probability
summary(pred.bivariate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06196 0.11549 0.15077 0.15963 0.18928 0.42662Highest Predicted Probability = 0.4266
The maximum predicted probability of the loan not being paid back is 0.4266, which means no loans would be flagged at a logistic regression cutoff of 0.5.
# Calculate AUC
prediction.bivariate = prediction(pred.bivariate, test$not.fully.paid)
as.numeric(performance(prediction.bivariate, "auc")@y.values)
## [1] 0.6239081AUC = 0.624
While thus far we have predicted if a loan will be paid back or not, an investor needs to identify loans that are expected to be profitable. If the loan is paid back in full, then the investor makes interest on the loan. However, if the loan is not paid back, the investor loses the money invested. Therefore, the investor should seek loans that best balance this risk and reward.
To compute interest revenue, consider a $c investment in a loan that has an annual interest rate r over a period of t years. Using continuous compounding of interest, this investment pays back c x exp(rt) dollars by the end of the t years, where exp(rt) is e raised to the r*t power.
c = 10
r = 0.06
t = 3
c*exp(r*t)
## [1] 11.97217$11.97
c exp(rt) - c
In the previous subproblem, we concluded that an investor who invested c dollars in a loan with interest rate r for t years makes c * (exp(rt) - 1) dollars of profit if the loan is paid back in full and -c dollars of profit if the loan is not paid back in full (pessimistically).
In order to evaluate the quality of an investment strategy, we need to compute this profit for each loan in the test set. For this variable, we will assume a $1 investment (aka c=1). To create the variable, we first assign to the profit for a fully paid loan, exp(rt)-1, to every observation, and we then replace this value with -1 in the cases where the loan was not paid in full. All the loans in our dataset are 3-year loans, meaning t=3 in our calculations. Enter the following commands in your R console to create this new variable:
# Create a new variable
test$profit = exp(test$int.rate*3) - 1
test$profit[test$not.fully.paid == 1] = -1# Maximum profit
summary(test$profit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0000 0.2858 0.4111 0.2094 0.4980 0.8895Maximum profit = 8.895
A simple investment strategy of equally investing in all the loans would yield profit $20.94 for a $100 investment. But this simple investment strategy does not leverage the prediction model we built earlier in this problem. As stated earlier, investors seek loans that balance reward with risk, in that they simultaneously have high interest rates and a low risk of not being paid back.
To meet this objective, we will analyze an investment strategy in which the investor only purchases loans with a high interest rate (a rate of at least 15%), but amongst these loans selects the ones with the lowest predicted risk of not being paid back in full. We will model an investor who invests $1 in each of the most promising 100 loans.
First, use the subset() function to build a data frame called highInterest consisting of the test set loans with an interest rate of at least 15%.
# Subset the data
highInterest = subset(test, int.rate >= 0.15)
# Find the average
mean(highInterest$profit)
## [1] 0.2251015# Tabulate high interest loans not fully paid
z = table(highInterest$not.fully.paid)
kable(z)| Var1 | Freq |
|---|---|
| 0 | 327 |
| 1 | 110 |
# Compute proportion
z[2]/sum(z)
## 1
## 0.2517162Proportion = 0.2517
Find the highest predicted risk that we will include by typing the following command into your R console:
# Implement cutoff
cutoff = sort(highInterest$predicted.risk, decreasing=FALSE)[100]Use the subset() function to build a data frame called selectedLoans consisting of the high-interest loans with predicted risk not exceeding the cutoff we just computed. Check to make sure you have selected 100 loans for investment.
# Subset the data
selectedLoans = subset(highInterest, predicted.risk <= cutoff) # Calculate the profit
sum(selectedLoans$profit)
## [1] 31.27825# Tabulate how many selected loans were not paid back in full
z = table(selectedLoans$not.fully.paid)
kable(z)| Var1 | Freq |
|---|---|
| 0 | 81 |
| 1 | 19 |
19 loans.