library(lubridate)
library(ggplot2)
library(dplyr)
library(stringr)
library(caret)
library(rpart)
library(rattle)
library(ROSE)
library(ROCR)
library(MASS)
library(ipred)
library(plyr)
library(rpart.plot)

Introduction

Lending Club, a San Francisco-based fintech company, works to facilitate peer-to-peer loans through their online lending platform. Started in 2007, their website allows individuals to publicly post loan applications, which other users can then browse and choose to fund. Company estimates place aggregate loan totals at over $15.98 billion through December 2015, making Lending Club the largest online loan platform in the world.

Although currently mired in scandal (CEO Renaud Laplanche stepped down just this week amidst dropping loan volumes and well-vetted accusations of shady accounting practices), Lending Club still provides an excellent machine learning case study. Luckily, their website makes its historical records publicly available, leading to the interesting question: can we determine an accurate way of using loan characteristics to predict which loans will be paid back in full and which will default?

Data Explanation

Here is a link to the Lending Club Data used throughout this project - https://www.lendingclub.com/info/download-data.action

I downloaded the Loan Data from 2007 to 2011, a file containing over 40,000 records. Each record is a distinct loan made on the Lending Club platform. For each loan, over 100 characteristics are recorded in the table. These characteristics can largely be divided into two groups - features of the loan, and features of the borrower.

The loan features are the basic stats one might expect, such as the loan amount, the interest rate, and the term of the loan.

The borrower features, which comprise by far the majority of the dataset, are the characteristics of the borrowers that Lending Club has deemed important to collect. Such features include employment length, credit history, and public default histories.

Preparation

The raw data from Lending Club is quite extensive, spanning 111 features on 40,000 records.

load("/Users/tedorourke/Desktop/Lending Club Model/loans.RData")
dim(loans)
## [1] 42540   111

To begin preparing the data, I read through the documentation published in the “Data Dictionary”on the Lending Club website. By selecting out irrelevant data (loan id, url, etc), poorly documented data (average current balance), and less important features (payment plan, home state), we were left with a manageable dataset of just 18 features.

Variable Description
loan_status The final status of the loan
loan_amnt The total principal of the loan
term The term length of the loan. Either 36-months (standard) or 60-months
int_rate Interest Rate
installment Dollar amount of montly payment isntallments
grade The quality of the loan (as determined by Lending Club) - A, B, C, etc.
sub-grade A further partitioning of loan quality - A1, A2, etc.
emp_length Borrower’s length of employment.
home_ownership Borrwer’s home status (ie own, mortgage, rent)
annual_inc Borrower’s reported annual income
verification_status Status of Lending Club’s verification of the borrower
issue_d Loan issue date
dti Monthly Debt Payments divided by reported Monthly Income
earliest.cr.line Date of borrower’s earliest credit line
open_acc Number of open credit lines on the borrower’s credit file.
revol_bal Total credit revolving balance.
revol_util Revolving Line Utilization Rate
total_acc Total number of historical credit lines on the borrower’s file

Now, having selected a more manageable number of features, I began investigating the various types of loan outcomes included in the data set.

levels(loans$loan_status)
##  [1] ""                                                   
##  [2] "Charged Off"                                        
##  [3] "Current"                                            
##  [4] "Default"                                            
##  [5] "Does not meet the credit policy. Status:Charged Off"
##  [6] "Does not meet the credit policy. Status:Fully Paid" 
##  [7] "Fully Paid"                                         
##  [8] "In Grace Period"                                    
##  [9] "Late (16-30 days)"                                  
## [10] "Late (31-120 days)"

To consolidate these labels, I took several steps. First, I removed all the records that did not meet credit thresholds - these loans were not endorsed by Lending Club, and so are less important for the sake of this model. I also decided to remove all the late loans, as these fall into a certain grey area with ambiguous, undetermined final statuses. I next combined the “Default” loans with the “Charged Off” loans (Lending Club defines “Charged Off” as loans that the lender has no reasonable hope of recovering money from), leaving me with two final labels for my classification attempts.

The resulting data was much more cleanly labled:

table(loans$loan_status)
## 
## Charged Off  Fully Paid 
##        5629       32950

Further cleaning of the data involved conerting interest rates and revolving utility rates from factors to numerics, removing empty factor levels, and removing annual income outliers.

Feature Engineering

Several features of the data set were inherently related, lending themselves naturally to feature engineering. For example, I used lubridate alongside the loan issue date and the date of the borrower’s first credit line to calculate the length of each borrower’s credit history.

loans$issue_d <- as.character(loans$issue_d)
loans$issue_d <- paste(loans$issue_d, "-01", sep = "")
loans$issue_d <- parse_date_time(loans$issue_d, "myd")

loans$earliest_cr_line <- as.character(loans$earliest_cr_line)
loans$earliest_cr_line <- paste(loans$earliest_cr_line, "-01", sep = "")
loans$earliest_cr_line <- parse_date_time(loans$earliest_cr_line, "myd")

loans$time_since_first_credit <- loans$issue_d - loans$earliest_cr_line
loans$time_since_first_credit <- as.numeric(loans$time_since_first_credit)

loans <- loans %>% filter(time_since_first_credit > 0)
head(loans$time_since_first_credit)
## [1] 9830 4627 3682 5782 2586 2344

A second engineered feature, current.account.ratio, was calculated by dividing open.acc (the number of open credit lines the borrower had at the time of the loan) by total.acc (the total number of credit lines the borrower has had).

The final step in cleaning the data was then removing all rows with NAs, a process that only removed about 100 records.

Exploratory Analysis

To start examining the data, I began by investigating the distributions of each numeric feature via histograms, segmented out by loan outcome.

The interest rate analysis was of particular interest. Examining the histogram, fully paid loans are clearly clumped at lower interest rates, while charged off loans have a more even distribution, tending towards higher interest rates more frequently than the fully paid ones do. This result makes intuitive sense, as higher interest rates are assigned to riskier investments.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Moving into more categorical analysis, I began looking at the distributions of factor variables across loan statuses, starting with loan grades. As we would expect, a much larger portion of lower grade loans failed to be repaid.

table(loans$loan_status, loans$grade)
##              
##                   A     B     C     D     E     F     G
##   Charged Off   597  1413  1337  1109   708   315    99
##   Fully Paid   9383 10205  6448  3935  1933   652   198
ggplot(loans, aes(x = int_rate))+ geom_histogram(aes(fill = grade)) + facet_wrap(~loan_status, ncol = 1)

Continuing to explore, it was interesting to see that a much larger portion of 60-month loans (as opposed to the standard 36-month loans) were charged off.

table(loans$loan_status, loans$term)
##              
##                36 months  60 months
##   Charged Off       3189       2389
##   Fully Paid       25690       7064

I next moved into more multivariate explorations, utilizing scatter plots in an attempt to identify trends between seemingly related numeric variables. To make the scatter plots more visually relevant (and not just a block of 40,000 points), I chose to plot just 10% of the points.

This one plot suggests that the 60-month loans tend to have larger interest rates and be for larger loan amounts (the top right corner is dominated by blue points).

index = createDataPartition(y = loans$loan_status, p = 0.90)[[1]]
loans.sample <- loans[-index,]
ggplot(loans.sample, aes(x = loan_amnt, y = int_rate)) + geom_point(aes(color = term))

Having identified several of these weak surface trends, it became time to dive further into the data, begin modeling, and look for more robust insights.

Decision Trees

After segmenting the data into training and testing sets, I began my modeling by building a simple decision tree.

index = createDataPartition(y = loans$loan_status, p = 0.8)[[1]]
loans.test <- loans[-index,]
loans.train <- loans[index,]
loans.rpart.0 <- rpart(loan_status ~ ., data = loans.train)

Any attempt to plot this model, however, immediately ran into the error “Fit is Not a Tree, Just a Root”. Turns out the default rpart settings (minsplit = 20, minbucket = 6) are too general for my large dataset, and the algorithm fails to create a single split.

Altering the settings and we get our first true model:

loans.rpart.1 <- rpart(loan_status ~ . , data = loans.train, 
                      control=rpart.control(minsplit=10, minbucket = 3, cp=0.0006))
fancyRpartPlot(loans.rpart.1)

Let’s take a look at the model’s performance.

predictions.1 <- (predict(loans.rpart.1, loans.test, type = "class"))
confusionMatrix(predictions.1, loans.test$loan_status)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Charged Off Fully Paid
##   Charged Off          60         75
##   Fully Paid         1055       6475
##                                           
##                Accuracy : 0.8526          
##                  95% CI : (0.8444, 0.8604)
##     No Information Rate : 0.8545          
##     P-Value [Acc > NIR] : 0.6932          
##                                           
##                   Kappa : 0.0667          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.053812        
##             Specificity : 0.988550        
##          Pos Pred Value : 0.444444        
##          Neg Pred Value : 0.859894        
##              Prevalence : 0.145466        
##          Detection Rate : 0.007828        
##    Detection Prevalence : 0.017613        
##       Balanced Accuracy : 0.521181        
##                                           
##        'Positive' Class : Charged Off     
## 

Look at that accuracy! 85%!! We’re done, right??

Not so fast. Look at that terrible specificty value. Of the 1115 Charged Off loans in the test set, this model only correclty classifed 60 of them. If this model is to help lenders avoid bad loans, the true positive rate must be much more robust.

Turns out the main problem is that this data is incredibly imbalanced (85.44% of the 38332 records are classified as “Fully Paid”). For these kinds of datasets, accuracy is a poor measure of model quality. Minority classes have such a small impact on overall accuracy that the metric turns out to be almost meaningless - just look at how good this model’s accuracy was, but at how bad it was at predicting Charged Off loans.

AUC turns out to be a much better metric for evaluating models based on imbalanced data. AUC measures the Area Under the Curve when True Positive Rate is mapped against False positive rates. Any model starts at a baseline of AUC = 0.5, and a perfect model will map the entire space, producing an AUC value of 1.0.

Let’s take a look at the AUC value for our first model:

roc.curve(loans.test$loan_status, predict(loans.rpart.1, loans.test, type = "prob")[,1], plot = TRUE)

## Area under the curve (AUC): 0.622

Clearly, some room for improvement.

Turns out there are four methods for fixing up imbalanced data: oversampling, undersampling, and synthetic data generation (either via ROSE or SMOTE approaches). Each approach involves evening out the data, either by upsampling the minority class (oversampling), taking a subset of the majority class (undersampling), or artifically generating new records from the minority class.

My next step was to try each of these techniques, generating decision trees on each new data set and evaluating those trees on the basis of their sensitivity, specificty, and AUC.

Here’s a run of the oversampling decision tree generation:

loans.oversampled <- ovun.sample(loan_status ~ ., data = loans.train, method = "over", N = 52410, seed = 13)$data
table(loans.oversampled$loan_status)
## 
##  Fully Paid Charged Off 
##       26204       26206
tune <- data.frame(0.001)
colnames(tune) <- "cp"
tr_control <- trainControl(method = "cv",number = 10, verboseIter = TRUE)
loans.rpart.oversampled <- train(loan_status ~., data = loans.oversampled, method = "rpart", trControl = tr_control, tuneGrid = tune, control=rpart.control(minsplit=10, minbucket = 3))
fancyRpartPlot(loans.rpart.oversampled$finalModel)

Let’s look at the model’s performance, via the confusion matrix and the AUC value:

confusionMatrix(predict(loans.rpart.oversampled, loans.test), loans.test$loan_status)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Charged Off Fully Paid
##   Charged Off         691       2284
##   Fully Paid          424       4266
##                                           
##                Accuracy : 0.6467          
##                  95% CI : (0.6359, 0.6574)
##     No Information Rate : 0.8545          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1602          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.61973         
##             Specificity : 0.65130         
##          Pos Pred Value : 0.23227         
##          Neg Pred Value : 0.90959         
##              Prevalence : 0.14547         
##          Detection Rate : 0.09015         
##    Detection Prevalence : 0.38813         
##       Balanced Accuracy : 0.63551         
##                                           
##        'Positive' Class : Charged Off     
## 
loans.rpart.over.predictions <- prediction(predict(loans.rpart.oversampled, newdata = loans.test, type = "prob")[,"Fully Paid"], loans.test$loan_status)
performance(loans.rpart.over.predictions, measure = "auc")@y.values
## [[1]]
## [1] 0.6807121

Clearly, this model is a huge improvement. Not only has the AUC value increased, but our confusion matrix demonstrates a far more effective model, especially when it comes to predicting loans that are charged off.

I ran this same code on an undersampled dataset, with slightly worse results (AUC = 0.652).

I next turned to synthetic data generation, using the ROSE and SMOTE techniques. Here’s the run and results from the ROSE method, the better of the two:

loans.synthetic.rose <- ROSE(loan_status ~ ., data = loans.train, seed = 13)$data
table(loans.synthetic.rose$loan_status)
## 
##  Fully Paid Charged Off 
##       15364       15303
tune <- data.frame(0.001)
colnames(tune) <- "cp"
tr_control <- trainControl(method = "cv",number = 10, verboseIter = TRUE)
loans.rpart.rose.train <- train(loan_status ~., data = loans.synthetic.rose, method = "rpart", trControl = tr_control, tuneGrid = tune, control=rpart.control(minsplit=10, minbucket = 3))
confusionMatrix(predict(loans.rpart.rose.train, loans.test), loans.test$loan_status)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Charged Off Fully Paid
##   Charged Off         662       2123
##   Fully Paid          453       4427
##                                           
##                Accuracy : 0.6639          
##                  95% CI : (0.6532, 0.6745)
##     No Information Rate : 0.8545          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1663          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.59372         
##             Specificity : 0.67588         
##          Pos Pred Value : 0.23770         
##          Neg Pred Value : 0.90717         
##              Prevalence : 0.14547         
##          Detection Rate : 0.08637         
##    Detection Prevalence : 0.36334         
##       Balanced Accuracy : 0.63480         
##                                           
##        'Positive' Class : Charged Off     
## 
loans.rpart.rose.train.predictions <- prediction(predict(loans.rpart.rose.train, newdata = loans.test, type = "prob")[,"Fully Paid"], loans.test$loan_status)
performance(loans.rpart.rose.train.predictions, measure = "auc")@y.values
## [[1]]
## [1] 0.6793782

Logistic Regression

Having identified oversampling and ROSE synthetic data generation as effective methods of remedying the data imbalance problem, I decided to go about exploring logistic models. Logistic models are very effective at binary classification problems, and they can be efficiently optimized via the stepAIC function in the MASS library and typical cross validation in the caret train function.

As such, I began by creating a glm model for my oversampled data and using stepAIC to create a more parsimonious function.

loans.glm.over <- glm(loan_status ~ ., data = loans.oversampled, family = "binomial")
loans.glm.over.1 <- stepAIC(loans.glm.over, direction = "both")
## Start:  AIC=66593.95
## loan_status ~ loan_amnt + term + int_rate + installment + grade + 
##     sub_grade + emp_length + home_ownership + annual_inc + verification_status + 
##     dti + open_acc + revol_bal + revol_util + total_acc + time_since_first_credit + 
##     current_account_ratio
## 
## 
## Step:  AIC=66593.95
## loan_status ~ loan_amnt + term + int_rate + installment + sub_grade + 
##     emp_length + home_ownership + annual_inc + verification_status + 
##     dti + open_acc + revol_bal + revol_util + total_acc + time_since_first_credit + 
##     current_account_ratio
## 
##                           Df Deviance   AIC
## - dti                      1    66471 66593
## <none>                          66470 66594
## - home_ownership           2    66476 66596
## - verification_status      2    66476 66596
## - revol_bal                1    66476 66598
## - time_since_first_credit  1    66476 66598
## - loan_amnt                1    66490 66612
## - installment              1    66497 66619
## - revol_util               1    66497 66619
## - total_acc                1    66501 66623
## - open_acc                 1    66502 66624
## - current_account_ratio    1    66516 66638
## - int_rate                 1    66578 66700
## - emp_length              11    66645 66747
## - term                     1    66640 66762
## - sub_grade               34    66725 66781
## - annual_inc               1    66822 66944
## 
## Step:  AIC=66592.68
## loan_status ~ loan_amnt + term + int_rate + installment + sub_grade + 
##     emp_length + home_ownership + annual_inc + verification_status + 
##     open_acc + revol_bal + revol_util + total_acc + time_since_first_credit + 
##     current_account_ratio
## 
##                           Df Deviance   AIC
## <none>                          66471 66593
## + dti                      1    66470 66594
## - home_ownership           2    66477 66595
## - verification_status      2    66477 66595
## - time_since_first_credit  1    66477 66597
## - revol_bal                1    66477 66597
## - loan_amnt                1    66490 66610
## - installment              1    66497 66617
## - total_acc                1    66502 66622
## - revol_util               1    66502 66622
## - open_acc                 1    66502 66622
## - current_account_ratio    1    66516 66636
## - int_rate                 1    66578 66698
## - emp_length              11    66646 66746
## - term                     1    66642 66762
## - sub_grade               34    66727 66781
## - annual_inc               1    66861 66981

I next converted the function to caret type with the train function and calculated our evaluation metrics:

loans.glm.over.2 <- train(loan_status ~ loan_amnt + term + int_rate + installment + sub_grade + 
                            emp_length + home_ownership + annual_inc + verification_status + 
                            dti + open_acc + revol_bal + revol_util + total_acc + time_since_first_credit + 
                            current_account_ratio, data = loans.oversampled, method = "glm")

confusionMatrix(predict(loans.glm.over.2, loans.test), loans.test$loan_status)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Charged Off Fully Paid
##   Charged Off         728       2385
##   Fully Paid          387       4165
##                                           
##                Accuracy : 0.6384          
##                  95% CI : (0.6275, 0.6491)
##     No Information Rate : 0.8545          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1656          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.65291         
##             Specificity : 0.63588         
##          Pos Pred Value : 0.23386         
##          Neg Pred Value : 0.91498         
##              Prevalence : 0.14547         
##          Detection Rate : 0.09498         
##    Detection Prevalence : 0.40613         
##       Balanced Accuracy : 0.64440         
##                                           
##        'Positive' Class : Charged Off     
## 
loans.glm.over.2.prediction <- prediction(predict(loans.glm.over.2, newdata = loans.test, type = "prob")[,"Fully Paid"], loans.test$loan_status)
performance(loans.glm.over.2.prediction, measure = "auc")@y.values
## [[1]]
## [1] 0.6976908

This logistic model shows our best results yet. To look for further improvement, I repeated the process with our other promising dataset, the ROSE data.

loans.glm.rose <- glm(loan_status ~ ., data = loans.synthetic.rose, family = "binomial")
loans.glm.rose.1 <- stepAIC(loans.glm.rose, direction = "both")

loans.glm.rose.2 <- train(loan_status ~ term + int_rate + installment + sub_grade + emp_length + 
                            annual_inc + dti + open_acc + revol_util + total_acc + time_since_first_credit + 
                            current_account_ratio, data = loans.synthetic.rose, method = "glm")
confusionMatrix(predict(loans.glm.rose.2, loans.test), loans.test$loan_status)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Charged Off Fully Paid
##   Charged Off         729       2416
##   Fully Paid          386       4134
##                                           
##                Accuracy : 0.6344          
##                  95% CI : (0.6235, 0.6452)
##     No Information Rate : 0.8545          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1623          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.65381         
##             Specificity : 0.63115         
##          Pos Pred Value : 0.23180         
##          Neg Pred Value : 0.91460         
##              Prevalence : 0.14547         
##          Detection Rate : 0.09511         
##    Detection Prevalence : 0.41031         
##       Balanced Accuracy : 0.64248         
##                                           
##        'Positive' Class : Charged Off     
## 
loans.glm.rose.2.prediction <- prediction(predict(loans.glm.rose.2, newdata = loans.test, type = "prob")[,"Fully Paid"], loans.test$loan_status)
performance(loans.glm.rose.2.prediction, measure = "auc")@y.values
## [[1]]
## [1] 0.6953732

Ensemble Model - Bagging

For both decision trees and logistic modeling, the oversampled dataset has consistently produced the highest AUC values (if only marginally). As such, it makes sense to generate an ensemble model on the oversampled data to create a more robust result.

Using the ipred library, it is possible to efficiently generate a bagged (boostrap aggregation) decision tree model.

loans.oversampled.glm.bagging <- bagging(loan_status ~ ., data = loans.oversampled, coob = TRUE, nbagg = 25, control=rpart.control(minsplit=10, minbucket = 3, cp = 0.001))

test.bagprob <- predict(loans.oversampled.glm.bagging, type = "prob", newdata = loans.test)
bagpred <- prediction(test.bagprob[,1], loans.test$loan_status)
bagperf <- performance(bagpred, "auc")
testbag = predict(loans.oversampled.glm.bagging, newdata = loans.test)
table(testbag, loans.test$loan_status)
##              
## testbag       Charged Off Fully Paid
##   Charged Off         708       2337
##   Fully Paid          407       4213
bagperf@y.values
## [[1]]
## [1] 0.6959452

This bagged decision tree model clocks in with an AUC approximately even with the ROSE and Oversampled Logistic models. All are sharp improvements over our initial model, and demonstrate the power of upsampling to adjust for imbalanced data.

Let’s take a look at the final importance of each of our factors in this bagged model:

varImp(loans.oversampled.glm.bagging)
##                            Overall
## annual_inc              1037.05863
## current_account_ratio    284.55674
## dti                      324.09717
## emp_length               500.57362
## grade                   2155.88022
## home_ownership            50.46849
## installment              480.24595
## int_rate                2541.45801
## loan_amnt                349.03318
## open_acc                 131.01534
## revol_bal                380.09777
## revol_util               897.45100
## sub_grade               2636.02575
## term                    1988.82275
## time_since_first_credit  346.84110
## total_acc                303.27759
## verification_status       31.81818

Taking a look at this output, it appears that the most important factors in predicting loan outcomes are the fundamental characteristics of the loans themselves: grade, subgrade, interst rate, and term.

Among the borrower characteristics, annual income and revolving utility turned out to be the most important features. Seemingly relevant statistics, such as dti, surprisingly had relatively minimal impact.

Our two engineered variables, the current account ratio and the time since first credit, unfortunately did not end up significantly helping our model make its predictions.

Conclusion

This type of modelling is incredibly important. Improving our ability to gain insight into potential lending outcomes at the time of a loan’s origin makes us all smarter investors. At its best, a model of this kind will be able to complement the basic credit score judgements many lenders are currently stuck with when deciding where to place their money.

There will always be value in a model that can help discern future investment outcomes. Lending Club members who might use this model to make financial descisions are by no means guaranteed results, but such a model will hopefully help them sift through the 100+ features they might be considering so they can arrive at the truly important ones.