Predicting Loan Repayment

(assignment for MITx Analytics Edge)

In the lending industry, investors provide loans to borrowers in exchange for the promise of repayment with interest. If the borrower repays the loan, then the lender profits from the interest. However, if the borrower is unable to repay the loan, then the lender loses money. Therefore, lenders face the problem of predicting the risk of a borrower being unable to repay a loan.

To address this problem, we will use publicly available data from LendingClub.com, a website that connects borrowers and investors over the Internet. This dataset represents 9,578 3-year loans that were funded through the LendingClub.com platform between May 2007 and February 2010. The binary dependent variable not.fully.paid indicates that the loan was not paid back in full (the borrower either defaulted or the loan was “charged off,” meaning the borrower was deemed unlikely to ever pay it back).

To predict this dependent variable, we will use the following independent variables available to the investor when deciding whether to fund a loan:

credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
purpose: The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”).
int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
installment: The monthly installments ($) owed by the borrower if the loan is funded.
log.annual.inc: The natural log of the self-reported annual income of the borrower.
dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
fico: The FICO credit score of the borrower.
days.with.cr.line: The number of days the borrower has had a credit line.
revol.bal: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle).
revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available).
inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months.
delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).

loans <- read.csv('loans.csv')
str(loans)

## 'data.frame':    9578 obs. of  14 variables:
##  $ credit.policy    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ purpose          : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
##  $ int.rate         : num  0.119 0.107 0.136 0.101 0.143 ...
##  $ installment      : num  829 228 367 162 103 ...
##  $ log.annual.inc   : num  11.4 11.1 10.4 11.4 11.3 ...
##  $ dti              : num  19.5 14.3 11.6 8.1 15 ...
##  $ fico             : int  737 707 682 712 667 727 667 722 682 707 ...
##  $ days.with.cr.line: num  5640 2760 4710 2700 4066 ...
##  $ revol.bal        : int  28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
##  $ revol.util       : num  52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
##  $ inq.last.6mths   : int  0 0 1 1 0 0 0 0 1 1 ...
##  $ delinq.2yrs      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ pub.rec          : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ not.fully.paid   : int  0 0 0 0 0 0 1 1 0 0 ...

summary(loans)

##  credit.policy                 purpose        int.rate     
##  Min.   :0.000   all_other         :2331   Min.   :0.0600  
##  1st Qu.:1.000   credit_card       :1262   1st Qu.:0.1039  
##  Median :1.000   debt_consolidation:3957   Median :0.1221  
##  Mean   :0.805   educational       : 343   Mean   :0.1226  
##  3rd Qu.:1.000   home_improvement  : 629   3rd Qu.:0.1407  
##  Max.   :1.000   major_purchase    : 437   Max.   :0.2164  
##                  small_business    : 619                   
##   installment     log.annual.inc        dti              fico      
##  Min.   : 15.67   Min.   : 7.548   Min.   : 0.000   Min.   :612.0  
##  1st Qu.:163.77   1st Qu.:10.558   1st Qu.: 7.213   1st Qu.:682.0  
##  Median :268.95   Median :10.928   Median :12.665   Median :707.0  
##  Mean   :319.09   Mean   :10.932   Mean   :12.607   Mean   :710.8  
##  3rd Qu.:432.76   3rd Qu.:11.290   3rd Qu.:17.950   3rd Qu.:737.0  
##  Max.   :940.14   Max.   :14.528   Max.   :29.960   Max.   :827.0  
##                   NA's   :4                                        
##  days.with.cr.line   revol.bal         revol.util     inq.last.6mths  
##  Min.   :  179     Min.   :      0   Min.   :  0.00   Min.   : 0.000  
##  1st Qu.: 2820     1st Qu.:   3187   1st Qu.: 22.70   1st Qu.: 0.000  
##  Median : 4140     Median :   8596   Median : 46.40   Median : 1.000  
##  Mean   : 4562     Mean   :  16914   Mean   : 46.87   Mean   : 1.572  
##  3rd Qu.: 5730     3rd Qu.:  18250   3rd Qu.: 71.00   3rd Qu.: 2.000  
##  Max.   :17640     Max.   :1207359   Max.   :119.00   Max.   :33.000  
##  NA's   :29                          NA's   :62       NA's   :29      
##   delinq.2yrs         pub.rec       not.fully.paid  
##  Min.   : 0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 0.0000   Median :0.0000   Median :0.0000  
##  Mean   : 0.1638   Mean   :0.0621   Mean   :0.1601  
##  3rd Qu.: 0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :13.0000   Max.   :5.0000   Max.   :1.0000  
##  NA's   :29        NA's   :29

Loans not paid in full

paste('Loans not paid in full :', signif(sum(loans$not.fully.paid)/length(loans$not.fully.paid)))

## [1] "Loans not paid in full : 0.160054"

Missing values

no_missing_values <- !is.na(loans$log.annual.inc) & !is.na(loans$days.with.cr.line) & !is.na(loans$revol.util) & !is.na(loans$inq.last.6mths) & !is.na(loans$delinq.2yrs) & !is.na(loans$pub.rec)

loans_no_missing_values <- subset(loans, no_missing_values)
loans_missing_values <- subset(loans, no_missing_values == FALSE)

paste('Number of loans with no missing values :', nrow(loans_no_missing_values))

## [1] "Number of loans with no missing values : 9516"

paste('Number of loans with a missing values :', nrow(loans_missing_values))

## [1] "Number of loans with a missing values : 62"

paste('Loans not paid in full if no missing value :', signif(sum(loans_no_missing_values$not.fully.paid)/length(loans_no_missing_values$not.fully.paid)))

## [1] "Loans not paid in full if no missing value : 0.159836"

paste('Loans not paid in full if a missing value :', signif(sum(loans_missing_values$not.fully.paid)/length(loans_missing_values$not.fully.paid)))

## [1] "Loans not paid in full if a missing value : 0.193548"

Imputing missing values

library(mice)

set.seed(144)

vars.for.imputation = setdiff(names(loans), "not.fully.paid")

imputed = complete(mice(loans[vars.for.imputation]))

## 
##  iter imp variable
##   1   1  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   1   2  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   1   3  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   1   4  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   1   5  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   2   1  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   2   2  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   2   3  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   2   4  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   2   5  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   3   1  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   3   2  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   3   3  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   3   4  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   3   5  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   4   1  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   4   2  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   4   3  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   4   4  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   4   5  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   5   1  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   5   2  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   5   3  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   5   4  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   5   5  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec

# predict missing variable values using the available independent variables for each observation

loans[vars.for.imputation] = imputed

loans_imputed <- read.csv('loans_imputed.csv')
# (imputed using another computer with potentially different settings)

Create training and test sets

set.seed(144) # fix the random number generator

library(caTools) # provides splitting tools

split = sample.split(loans_imputed$not.fully.paid, SplitRatio = 0.7) # ensure balance of violators in sets
train = subset(loans, split == TRUE)
test = subset(loans, split == FALSE)

train_imputed <- subset(loans_imputed, split == TRUE)
test_imputed <- subset(loans_imputed, split == FALSE)

Create model

model_x <- glm(not.fully.paid ~ ., data = train, family = 'binomial')
summary(model_x)

## 
## Call:
## glm(formula = not.fully.paid ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2097  -0.6214  -0.4950  -0.3601   2.6414  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        9.260e+00  1.554e+00   5.958 2.55e-09 ***
## credit.policy     -3.327e-01  1.011e-01  -3.292 0.000995 ***
## purpose2          -6.100e-01  1.344e-01  -4.538 5.67e-06 ***
## purpose3          -3.181e-01  9.185e-02  -3.463 0.000534 ***
## purpose4           1.386e-01  1.753e-01   0.791 0.429074    
## purpose5           1.774e-01  1.479e-01   1.199 0.230496    
## purpose6          -4.783e-01  2.009e-01  -2.381 0.017260 *  
## purpose7           4.159e-01  1.419e-01   2.932 0.003373 ** 
## int.rate           6.202e-01  2.085e+00   0.297 0.766111    
## installment        1.279e-03  2.093e-04   6.110 9.96e-10 ***
## log.annual.inc    -4.357e-01  7.151e-02  -6.093 1.11e-09 ***
## dti                4.733e-03  5.501e-03   0.861 0.389508    
## fico              -9.406e-03  1.707e-03  -5.510 3.60e-08 ***
## days.with.cr.line  3.174e-06  1.587e-05   0.200 0.841463    
## revol.bal          3.103e-06  1.169e-06   2.653 0.007966 ** 
## revol.util         1.796e-03  1.532e-03   1.172 0.241022    
## inq.last.6mths     8.386e-02  1.577e-02   5.317 1.06e-07 ***
## delinq.2yrs       -7.794e-02  6.532e-02  -1.193 0.232814    
## pub.rec            3.191e-01  1.134e-01   2.814 0.004899 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5896.6  on 6704  degrees of freedom
## Residual deviance: 5485.7  on 6686  degrees of freedom
## AIC: 5523.7
## 
## Number of Fisher Scoring iterations: 5

model <- glm(not.fully.paid ~ ., data = train_imputed, family = 'binomial')
summary(model)

## 
## Call:
## glm(formula = not.fully.paid ~ ., family = "binomial", data = train_imputed)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2049  -0.6205  -0.4951  -0.3606   2.6397  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                9.187e+00  1.554e+00   5.910 3.42e-09 ***
## credit.policy             -3.368e-01  1.011e-01  -3.332 0.000861 ***
## purposecredit_card        -6.141e-01  1.344e-01  -4.568 4.93e-06 ***
## purposedebt_consolidation -3.212e-01  9.183e-02  -3.498 0.000469 ***
## purposeeducational         1.347e-01  1.753e-01   0.768 0.442201    
## purposehome_improvement    1.727e-01  1.480e-01   1.167 0.243135    
## purposemajor_purchase     -4.830e-01  2.009e-01  -2.404 0.016203 *  
## purposesmall_business      4.120e-01  1.419e-01   2.905 0.003678 ** 
## int.rate                   6.110e-01  2.085e+00   0.293 0.769446    
## installment                1.275e-03  2.092e-04   6.093 1.11e-09 ***
## log.annual.inc            -4.337e-01  7.148e-02  -6.067 1.30e-09 ***
## dti                        4.638e-03  5.502e-03   0.843 0.399288    
## fico                      -9.317e-03  1.710e-03  -5.448 5.08e-08 ***
## days.with.cr.line          2.371e-06  1.588e-05   0.149 0.881343    
## revol.bal                  3.085e-06  1.168e-06   2.641 0.008273 ** 
## revol.util                 1.839e-03  1.535e-03   1.199 0.230722    
## inq.last.6mths             8.437e-02  1.600e-02   5.275 1.33e-07 ***
## delinq.2yrs               -8.320e-02  6.561e-02  -1.268 0.204762    
## pub.rec                    3.300e-01  1.139e-01   2.898 0.003756 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5896.6  on 6704  degrees of freedom
## Residual deviance: 5485.2  on 6686  degrees of freedom
## AIC: 5523.2
## 
## Number of Fisher Scoring iterations: 5

Prediction models - Logit examples

Consider two loan applications, which are identical other than the fact that the borrower in Application A has FICO credit score 700 while the borrower in Application B has FICO credit score 710.

Let Logit(A) be the log odds of loan A not being paid back in full, according to our logistic regression model, and define Logit(B) similarly for loan B. What is the value of Logit(A) - Logit(B)?

logit <- model$coefficients['fico'][[1]]*(700-710)

paste("Value of Logit(A)-Logit(B) =", signif(logit))

## [1] "Value of Logit(A)-Logit(B) = 0.0931679"

paste("Odds ratio of A not being paid back, compared to B =", signif(exp(logit)))

## [1] "Odds ratio of A not being paid back, compared to B = 1.09765"

Prediction of test set

Where threshold = 0.5

predicted.risk <- predict(model, newdata = test_imputed, type = 'response')
test_imputed$predicted.risk <- predicted.risk

accuracy.table <- table(test_imputed$not.fully.paid, predicted.risk >= 0.5)

accuracy.table

##    
##     FALSE TRUE
##   0  2400   13
##   1   457    3

paste('Accuracy of logistic regression model :',
      signif((accuracy.table['0','FALSE'] + accuracy.table['1','TRUE']) / sum(accuracy.table[,])))

## [1] "Accuracy of logistic regression model : 0.836408"

paste('Accuracy of baseline model :', 
      signif(max(sum(test$not.fully.paid == TRUE), sum(test$not.fully.paid == FALSE))/nrow(test)))

## [1] "Accuracy of baseline model : 0.839889"

Receiver Operator Characteristic

library(ROCR)

pred_ROCR <- prediction(predicted.risk, test$not.fully.paid)
auc_ROCR <- performance(pred_ROCR, measure = 'auc')
plot(performance(pred_ROCR, measure = 'tpr', x.measure = 'fpr'), colorize = TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

paste('Area under Curve :', signif(auc_ROCR@y.values[[1]]))

## [1] "Area under Curve : 0.672099"

‘Smart Baseline’

We built a logistic regression model that has an AUC significantly higher than the AUC of 0.5 that would be obtained by randomly ordering observations.

However, LendingClub.com assigns the interest rate to a loan based on their estimate of that loan’s risk. This variable, int.rate, is an independent variable in our dataset. In this part, we will investigate using the loan’s interest rate as a “smart baseline” to order the loans according to risk.

A bivariate logistic regression model (aka a logistic regression model with a single independent variable) that predicts the dependent variable not.fully.paid using only the variable int.rate.

baseline.model <- glm(not.fully.paid ~ int.rate, data = train_imputed, family = 'binomial')
summary(baseline.model)

## 
## Call:
## glm(formula = not.fully.paid ~ int.rate, family = "binomial", 
##     data = train_imputed)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0547  -0.6271  -0.5442  -0.4361   2.2914  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.6726     0.1688  -21.76   <2e-16 ***
## int.rate     15.9214     1.2702   12.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5896.6  on 6704  degrees of freedom
## Residual deviance: 5734.8  on 6703  degrees of freedom
## AIC: 5738.8
## 
## Number of Fisher Scoring iterations: 4

The variable int.rate is highly significant in the bivariate model, but it is not significant at the 0.05 level in the model trained with all the independent variables. The most likely explanation for this difference is that int.rate is correlated with other risk-related variables, and therefore does not incrementally improve the model when those other variables are included.

Smart baseline predictions

baseline.model.test.predictions <- predict(baseline.model, newdata = test_imputed, type = 'response')

paste('Highest predicted probability of loan not being paid in full on testing set', signif(max(baseline.model.test.predictions)))

## [1] "Highest predicted probability of loan not being paid in full on testing set 0.426624"

paste('With logistic regression cutoff of 0.5, the number of loans not being paid in full on the testing set :', sum(baseline.model.test.predictions >= 0.5))

## [1] "With logistic regression cutoff of 0.5, the number of loans not being paid in full on the testing set : 0"

Receiver operator characteristics of ‘smart baseline’ model

pred_ROCR <- prediction(baseline.model.test.predictions, test_imputed$not.fully.paid)
auc_ROCR <- performance(pred_ROCR, measure = 'auc')
plot(performance(pred_ROCR, measure = 'tpr', x.measure = 'fpr'), colorize = TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

paste('Area under Curve :', signif(auc_ROCR@y.values[[1]]))

## [1] "Area under Curve : 0.623908"

Estimating profit

Assume a $1 investment (aka c=1). To create the variable, we first assign to the profit for a fully paid loan, exp(rt)-1, to every observation, and we then replace this value with -1 in the cases where the loan was not paid in full. All the loans in our dataset are 3-year loans, meaning t=3 in our calculations.

test_imputed$profit <- exp(test_imputed$int.rate*3)-1
test_imputed$profit[test_imputed$not.fully.paid == 1] = -1

paste('Maximum profit of a $10 investment in any loan in the testing set:',
      signif(max(test_imputed$profit)*10))

## [1] "Maximum profit of a $10 investment in any loan in the testing set: 8.89477"

Investment strategy based on risk

Investors seek loans that balance reward with risk, in that they simultaneously have high interest rates and a low risk of not being paid back.

We will analyze an investment strategy in which the investor only purchases loans with a high interest rate (a rate of at least 15%), but amongst these loans selects the ones with the lowest predicted risk of not being paid back in full. We will model an investor who invests $1 in each of the most promising 100 loans.

highinterest <- subset(test_imputed, test_imputed$int.rate >= 0.15)

paste('Average profit of a $1 investment in one of these high-interest loans :',
      signif(mean(highinterest$profit)))

## [1] "Average profit of a $1 investment in one of these high-interest loans : 0.225102"

paste('Proportion of high-interest loans not paid back in full:',
      signif(mean(highinterest$not.fully.paid)))

## [1] "Proportion of high-interest loans not paid back in full: 0.251716"

We will determine the 100th smallest predicted probability of not paying in full by sorting the predicted risks in increasing order and selecting the 100th element of this sorted list.

We then build a data frame called selectedloans consisting of the 100 lowest risk high-interest loans.

cutoff <- sort(highinterest$predicted.risk, decreasing=FALSE)[100]
selectedloans <- subset(highinterest, highinterest$predicted.risk <= cutoff)

paste("Profit of the investor who invested $1 each of these 100 'low-risk' high-interest loans :",
      signif(sum(selectedloans$profit)))

## [1] "Profit of the investor who invested $1 each of these 100 'low-risk' high-interest loans : 31.2782"

paste("Number of these 100 'low-risk' high-interest loans not paid back in full:",
      sum(selectedloans$not.fully.paid))

## [1] "Number of these 100 'low-risk' high-interest loans not paid back in full: 19"

We have now seen how analytics can be used to select a subset of the high-interest loans that were paid back at only a slightly lower rate than average, resulting in a significant increase in the profit from our investor’s $100 investment. Although the logistic regression models developed in this problem did not have large AUC values, we see that they still provided the edge needed to improve the profitability of an investment portfolio.

We conclude with a note of warning. Throughout this analysis we assume that the loans we invest in will perform in the same way as the loans we used to train our model, even though our training set covers a relatively short period of time. If there is an economic shock like a large financial downturn, default rates might be significantly higher than those observed in the training set and we might end up losing money instead of profiting. Investors must pay careful attention to such risk when making investment decisions.