library(lubridate)
library(ggplot2)
library(dplyr)
library(stringr)
library(caret)
library(rpart)
library(rattle)
library(ROSE)
library(ROCR)
library(MASS)
library(ipred)
library(plyr)
library(rpart.plot)
Lending Club, a San Francisco-based fintech company, works to facilitate peer-to-peer loans through their online lending platform. Started in 2007, their website allows individuals to publicly post loan applications, which other users can then browse and choose to fund. Company estimates place aggregate loan totals at over $15.98 billion through December 2015, making Lending Club the largest online loan platform in the world.
Although currently mired in scandal (CEO Renaud Laplanche stepped down just this week amidst dropping loan volumes and well-vetted accusations of shady accounting practices), Lending Club still provides an excellent machine learning case study. Luckily, their website makes its historical records publicly available, leading to the interesting question: can we determine an accurate way of using loan characteristics to predict which loans will be paid back in full and which will default?
Here is a link to the Lending Club Data used throughout this project - https://www.lendingclub.com/info/download-data.action
I downloaded the Loan Data from 2007 to 2011, a file containing over 40,000 records. Each record is a distinct loan made on the Lending Club platform. For each loan, over 100 characteristics are recorded in the table. These characteristics can largely be divided into two groups - features of the loan, and features of the borrower.
The loan features are the basic stats one might expect, such as the loan amount, the interest rate, and the term of the loan.
The borrower features, which comprise by far the majority of the dataset, are the characteristics of the borrowers that Lending Club has deemed important to collect. Such features include employment length, credit history, and public default histories.
The raw data from Lending Club is quite extensive, spanning 111 features on 40,000 records.
load("/Users/tedorourke/Desktop/Lending Club Model/loans.RData")
dim(loans)
## [1] 42540 111
To begin preparing the data, I read through the documentation published in the “Data Dictionary”on the Lending Club website. By selecting out irrelevant data (loan id, url, etc), poorly documented data (average current balance), and less important features (payment plan, home state), we were left with a manageable dataset of just 18 features.
Variable | Description |
---|---|
loan_status | The final status of the loan |
loan_amnt | The total principal of the loan |
term | The term length of the loan. Either 36-months (standard) or 60-months |
int_rate | Interest Rate |
installment | Dollar amount of montly payment isntallments |
grade | The quality of the loan (as determined by Lending Club) - A, B, C, etc. |
sub-grade | A further partitioning of loan quality - A1, A2, etc. |
emp_length | Borrower’s length of employment. |
home_ownership | Borrwer’s home status (ie own, mortgage, rent) |
annual_inc | Borrower’s reported annual income |
verification_status | Status of Lending Club’s verification of the borrower |
issue_d | Loan issue date |
dti | Monthly Debt Payments divided by reported Monthly Income |
earliest.cr.line | Date of borrower’s earliest credit line |
open_acc | Number of open credit lines on the borrower’s credit file. |
revol_bal | Total credit revolving balance. |
revol_util | Revolving Line Utilization Rate |
total_acc | Total number of historical credit lines on the borrower’s file |
Now, having selected a more manageable number of features, I began investigating the various types of loan outcomes included in the data set.
levels(loans$loan_status)
## [1] ""
## [2] "Charged Off"
## [3] "Current"
## [4] "Default"
## [5] "Does not meet the credit policy. Status:Charged Off"
## [6] "Does not meet the credit policy. Status:Fully Paid"
## [7] "Fully Paid"
## [8] "In Grace Period"
## [9] "Late (16-30 days)"
## [10] "Late (31-120 days)"
To consolidate these labels, I took several steps. First, I removed all the records that did not meet credit thresholds - these loans were not endorsed by Lending Club, and so are less important for the sake of this model. I also decided to remove all the late loans, as these fall into a certain grey area with ambiguous, undetermined final statuses. I next combined the “Default” loans with the “Charged Off” loans (Lending Club defines “Charged Off” as loans that the lender has no reasonable hope of recovering money from), leaving me with two final labels for my classification attempts.
The resulting data was much more cleanly labled:
table(loans$loan_status)
##
## Charged Off Fully Paid
## 5629 32950
Further cleaning of the data involved conerting interest rates and revolving utility rates from factors to numerics, removing empty factor levels, and removing annual income outliers.
Several features of the data set were inherently related, lending themselves naturally to feature engineering. For example, I used lubridate alongside the loan issue date and the date of the borrower’s first credit line to calculate the length of each borrower’s credit history.
loans$issue_d <- as.character(loans$issue_d)
loans$issue_d <- paste(loans$issue_d, "-01", sep = "")
loans$issue_d <- parse_date_time(loans$issue_d, "myd")
loans$earliest_cr_line <- as.character(loans$earliest_cr_line)
loans$earliest_cr_line <- paste(loans$earliest_cr_line, "-01", sep = "")
loans$earliest_cr_line <- parse_date_time(loans$earliest_cr_line, "myd")
loans$time_since_first_credit <- loans$issue_d - loans$earliest_cr_line
loans$time_since_first_credit <- as.numeric(loans$time_since_first_credit)
loans <- loans %>% filter(time_since_first_credit > 0)
head(loans$time_since_first_credit)
## [1] 9830 4627 3682 5782 2586 2344
A second engineered feature, current.account.ratio, was calculated by dividing open.acc (the number of open credit lines the borrower had at the time of the loan) by total.acc (the total number of credit lines the borrower has had).
The final step in cleaning the data was then removing all rows with NAs, a process that only removed about 100 records.
To start examining the data, I began by investigating the distributions of each numeric feature via histograms, segmented out by loan outcome.
The interest rate analysis was of particular interest. Examining the histogram, fully paid loans are clearly clumped at lower interest rates, while charged off loans have a more even distribution, tending towards higher interest rates more frequently than the fully paid ones do. This result makes intuitive sense, as higher interest rates are assigned to riskier investments.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Moving into more categorical analysis, I began looking at the distributions of factor variables across loan statuses, starting with loan grades. As we would expect, a much larger portion of lower grade loans failed to be repaid.
table(loans$loan_status, loans$grade)
##
## A B C D E F G
## Charged Off 597 1413 1337 1109 708 315 99
## Fully Paid 9383 10205 6448 3935 1933 652 198
ggplot(loans, aes(x = int_rate))+ geom_histogram(aes(fill = grade)) + facet_wrap(~loan_status, ncol = 1)
Continuing to explore, it was interesting to see that a much larger portion of 60-month loans (as opposed to the standard 36-month loans) were charged off.
table(loans$loan_status, loans$term)
##
## 36 months 60 months
## Charged Off 3189 2389
## Fully Paid 25690 7064
I next moved into more multivariate explorations, utilizing scatter plots in an attempt to identify trends between seemingly related numeric variables. To make the scatter plots more visually relevant (and not just a block of 40,000 points), I chose to plot just 10% of the points.
This one plot suggests that the 60-month loans tend to have larger interest rates and be for larger loan amounts (the top right corner is dominated by blue points).
index = createDataPartition(y = loans$loan_status, p = 0.90)[[1]]
loans.sample <- loans[-index,]
ggplot(loans.sample, aes(x = loan_amnt, y = int_rate)) + geom_point(aes(color = term))
Having identified several of these weak surface trends, it became time to dive further into the data, begin modeling, and look for more robust insights.
After segmenting the data into training and testing sets, I began my modeling by building a simple decision tree.
index = createDataPartition(y = loans$loan_status, p = 0.8)[[1]]
loans.test <- loans[-index,]
loans.train <- loans[index,]
loans.rpart.0 <- rpart(loan_status ~ ., data = loans.train)
Any attempt to plot this model, however, immediately ran into the error “Fit is Not a Tree, Just a Root”. Turns out the default rpart settings (minsplit = 20, minbucket = 6) are too general for my large dataset, and the algorithm fails to create a single split.
Altering the settings and we get our first true model:
loans.rpart.1 <- rpart(loan_status ~ . , data = loans.train,
control=rpart.control(minsplit=10, minbucket = 3, cp=0.0006))
fancyRpartPlot(loans.rpart.1)
Let’s take a look at the model’s performance.
predictions.1 <- (predict(loans.rpart.1, loans.test, type = "class"))
confusionMatrix(predictions.1, loans.test$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Fully Paid
## Charged Off 60 75
## Fully Paid 1055 6475
##
## Accuracy : 0.8526
## 95% CI : (0.8444, 0.8604)
## No Information Rate : 0.8545
## P-Value [Acc > NIR] : 0.6932
##
## Kappa : 0.0667
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.053812
## Specificity : 0.988550
## Pos Pred Value : 0.444444
## Neg Pred Value : 0.859894
## Prevalence : 0.145466
## Detection Rate : 0.007828
## Detection Prevalence : 0.017613
## Balanced Accuracy : 0.521181
##
## 'Positive' Class : Charged Off
##
Look at that accuracy! 85%!! We’re done, right??
Not so fast. Look at that terrible specificty value. Of the 1115 Charged Off loans in the test set, this model only correclty classifed 60 of them. If this model is to help lenders avoid bad loans, the true positive rate must be much more robust.
Turns out the main problem is that this data is incredibly imbalanced (85.44% of the 38332 records are classified as “Fully Paid”). For these kinds of datasets, accuracy is a poor measure of model quality. Minority classes have such a small impact on overall accuracy that the metric turns out to be almost meaningless - just look at how good this model’s accuracy was, but at how bad it was at predicting Charged Off loans.
AUC turns out to be a much better metric for evaluating models based on imbalanced data. AUC measures the Area Under the Curve when True Positive Rate is mapped against False positive rates. Any model starts at a baseline of AUC = 0.5, and a perfect model will map the entire space, producing an AUC value of 1.0.
Let’s take a look at the AUC value for our first model:
roc.curve(loans.test$loan_status, predict(loans.rpart.1, loans.test, type = "prob")[,1], plot = TRUE)
## Area under the curve (AUC): 0.622
Clearly, some room for improvement.
Turns out there are four methods for fixing up imbalanced data: oversampling, undersampling, and synthetic data generation (either via ROSE or SMOTE approaches). Each approach involves evening out the data, either by upsampling the minority class (oversampling), taking a subset of the majority class (undersampling), or artifically generating new records from the minority class.
My next step was to try each of these techniques, generating decision trees on each new data set and evaluating those trees on the basis of their sensitivity, specificty, and AUC.
Here’s a run of the oversampling decision tree generation:
loans.oversampled <- ovun.sample(loan_status ~ ., data = loans.train, method = "over", N = 52410, seed = 13)$data
table(loans.oversampled$loan_status)
##
## Fully Paid Charged Off
## 26204 26206
tune <- data.frame(0.001)
colnames(tune) <- "cp"
tr_control <- trainControl(method = "cv",number = 10, verboseIter = TRUE)
loans.rpart.oversampled <- train(loan_status ~., data = loans.oversampled, method = "rpart", trControl = tr_control, tuneGrid = tune, control=rpart.control(minsplit=10, minbucket = 3))
fancyRpartPlot(loans.rpart.oversampled$finalModel)
Let’s look at the model’s performance, via the confusion matrix and the AUC value:
confusionMatrix(predict(loans.rpart.oversampled, loans.test), loans.test$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Fully Paid
## Charged Off 691 2284
## Fully Paid 424 4266
##
## Accuracy : 0.6467
## 95% CI : (0.6359, 0.6574)
## No Information Rate : 0.8545
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1602
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.61973
## Specificity : 0.65130
## Pos Pred Value : 0.23227
## Neg Pred Value : 0.90959
## Prevalence : 0.14547
## Detection Rate : 0.09015
## Detection Prevalence : 0.38813
## Balanced Accuracy : 0.63551
##
## 'Positive' Class : Charged Off
##
loans.rpart.over.predictions <- prediction(predict(loans.rpart.oversampled, newdata = loans.test, type = "prob")[,"Fully Paid"], loans.test$loan_status)
performance(loans.rpart.over.predictions, measure = "auc")@y.values
## [[1]]
## [1] 0.6807121
Clearly, this model is a huge improvement. Not only has the AUC value increased, but our confusion matrix demonstrates a far more effective model, especially when it comes to predicting loans that are charged off.
I ran this same code on an undersampled dataset, with slightly worse results (AUC = 0.652).
I next turned to synthetic data generation, using the ROSE and SMOTE techniques. Here’s the run and results from the ROSE method, the better of the two:
loans.synthetic.rose <- ROSE(loan_status ~ ., data = loans.train, seed = 13)$data
table(loans.synthetic.rose$loan_status)
##
## Fully Paid Charged Off
## 15364 15303
tune <- data.frame(0.001)
colnames(tune) <- "cp"
tr_control <- trainControl(method = "cv",number = 10, verboseIter = TRUE)
loans.rpart.rose.train <- train(loan_status ~., data = loans.synthetic.rose, method = "rpart", trControl = tr_control, tuneGrid = tune, control=rpart.control(minsplit=10, minbucket = 3))
confusionMatrix(predict(loans.rpart.rose.train, loans.test), loans.test$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Fully Paid
## Charged Off 662 2123
## Fully Paid 453 4427
##
## Accuracy : 0.6639
## 95% CI : (0.6532, 0.6745)
## No Information Rate : 0.8545
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1663
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.59372
## Specificity : 0.67588
## Pos Pred Value : 0.23770
## Neg Pred Value : 0.90717
## Prevalence : 0.14547
## Detection Rate : 0.08637
## Detection Prevalence : 0.36334
## Balanced Accuracy : 0.63480
##
## 'Positive' Class : Charged Off
##
loans.rpart.rose.train.predictions <- prediction(predict(loans.rpart.rose.train, newdata = loans.test, type = "prob")[,"Fully Paid"], loans.test$loan_status)
performance(loans.rpart.rose.train.predictions, measure = "auc")@y.values
## [[1]]
## [1] 0.6793782
Having identified oversampling and ROSE synthetic data generation as effective methods of remedying the data imbalance problem, I decided to go about exploring logistic models. Logistic models are very effective at binary classification problems, and they can be efficiently optimized via the stepAIC function in the MASS library and typical cross validation in the caret train function.
As such, I began by creating a glm model for my oversampled data and using stepAIC to create a more parsimonious function.
loans.glm.over <- glm(loan_status ~ ., data = loans.oversampled, family = "binomial")
loans.glm.over.1 <- stepAIC(loans.glm.over, direction = "both")
## Start: AIC=66593.95
## loan_status ~ loan_amnt + term + int_rate + installment + grade +
## sub_grade + emp_length + home_ownership + annual_inc + verification_status +
## dti + open_acc + revol_bal + revol_util + total_acc + time_since_first_credit +
## current_account_ratio
##
##
## Step: AIC=66593.95
## loan_status ~ loan_amnt + term + int_rate + installment + sub_grade +
## emp_length + home_ownership + annual_inc + verification_status +
## dti + open_acc + revol_bal + revol_util + total_acc + time_since_first_credit +
## current_account_ratio
##
## Df Deviance AIC
## - dti 1 66471 66593
## <none> 66470 66594
## - home_ownership 2 66476 66596
## - verification_status 2 66476 66596
## - revol_bal 1 66476 66598
## - time_since_first_credit 1 66476 66598
## - loan_amnt 1 66490 66612
## - installment 1 66497 66619
## - revol_util 1 66497 66619
## - total_acc 1 66501 66623
## - open_acc 1 66502 66624
## - current_account_ratio 1 66516 66638
## - int_rate 1 66578 66700
## - emp_length 11 66645 66747
## - term 1 66640 66762
## - sub_grade 34 66725 66781
## - annual_inc 1 66822 66944
##
## Step: AIC=66592.68
## loan_status ~ loan_amnt + term + int_rate + installment + sub_grade +
## emp_length + home_ownership + annual_inc + verification_status +
## open_acc + revol_bal + revol_util + total_acc + time_since_first_credit +
## current_account_ratio
##
## Df Deviance AIC
## <none> 66471 66593
## + dti 1 66470 66594
## - home_ownership 2 66477 66595
## - verification_status 2 66477 66595
## - time_since_first_credit 1 66477 66597
## - revol_bal 1 66477 66597
## - loan_amnt 1 66490 66610
## - installment 1 66497 66617
## - total_acc 1 66502 66622
## - revol_util 1 66502 66622
## - open_acc 1 66502 66622
## - current_account_ratio 1 66516 66636
## - int_rate 1 66578 66698
## - emp_length 11 66646 66746
## - term 1 66642 66762
## - sub_grade 34 66727 66781
## - annual_inc 1 66861 66981
I next converted the function to caret type with the train function and calculated our evaluation metrics:
loans.glm.over.2 <- train(loan_status ~ loan_amnt + term + int_rate + installment + sub_grade +
emp_length + home_ownership + annual_inc + verification_status +
dti + open_acc + revol_bal + revol_util + total_acc + time_since_first_credit +
current_account_ratio, data = loans.oversampled, method = "glm")
confusionMatrix(predict(loans.glm.over.2, loans.test), loans.test$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Fully Paid
## Charged Off 728 2385
## Fully Paid 387 4165
##
## Accuracy : 0.6384
## 95% CI : (0.6275, 0.6491)
## No Information Rate : 0.8545
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1656
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.65291
## Specificity : 0.63588
## Pos Pred Value : 0.23386
## Neg Pred Value : 0.91498
## Prevalence : 0.14547
## Detection Rate : 0.09498
## Detection Prevalence : 0.40613
## Balanced Accuracy : 0.64440
##
## 'Positive' Class : Charged Off
##
loans.glm.over.2.prediction <- prediction(predict(loans.glm.over.2, newdata = loans.test, type = "prob")[,"Fully Paid"], loans.test$loan_status)
performance(loans.glm.over.2.prediction, measure = "auc")@y.values
## [[1]]
## [1] 0.6976908
This logistic model shows our best results yet. To look for further improvement, I repeated the process with our other promising dataset, the ROSE data.
loans.glm.rose <- glm(loan_status ~ ., data = loans.synthetic.rose, family = "binomial")
loans.glm.rose.1 <- stepAIC(loans.glm.rose, direction = "both")
loans.glm.rose.2 <- train(loan_status ~ term + int_rate + installment + sub_grade + emp_length +
annual_inc + dti + open_acc + revol_util + total_acc + time_since_first_credit +
current_account_ratio, data = loans.synthetic.rose, method = "glm")
confusionMatrix(predict(loans.glm.rose.2, loans.test), loans.test$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Fully Paid
## Charged Off 729 2416
## Fully Paid 386 4134
##
## Accuracy : 0.6344
## 95% CI : (0.6235, 0.6452)
## No Information Rate : 0.8545
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1623
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.65381
## Specificity : 0.63115
## Pos Pred Value : 0.23180
## Neg Pred Value : 0.91460
## Prevalence : 0.14547
## Detection Rate : 0.09511
## Detection Prevalence : 0.41031
## Balanced Accuracy : 0.64248
##
## 'Positive' Class : Charged Off
##
loans.glm.rose.2.prediction <- prediction(predict(loans.glm.rose.2, newdata = loans.test, type = "prob")[,"Fully Paid"], loans.test$loan_status)
performance(loans.glm.rose.2.prediction, measure = "auc")@y.values
## [[1]]
## [1] 0.6953732
For both decision trees and logistic modeling, the oversampled dataset has consistently produced the highest AUC values (if only marginally). As such, it makes sense to generate an ensemble model on the oversampled data to create a more robust result.
Using the ipred library, it is possible to efficiently generate a bagged (boostrap aggregation) decision tree model.
loans.oversampled.glm.bagging <- bagging(loan_status ~ ., data = loans.oversampled, coob = TRUE, nbagg = 25, control=rpart.control(minsplit=10, minbucket = 3, cp = 0.001))
test.bagprob <- predict(loans.oversampled.glm.bagging, type = "prob", newdata = loans.test)
bagpred <- prediction(test.bagprob[,1], loans.test$loan_status)
bagperf <- performance(bagpred, "auc")
testbag = predict(loans.oversampled.glm.bagging, newdata = loans.test)
table(testbag, loans.test$loan_status)
##
## testbag Charged Off Fully Paid
## Charged Off 708 2337
## Fully Paid 407 4213
bagperf@y.values
## [[1]]
## [1] 0.6959452
This bagged decision tree model clocks in with an AUC approximately even with the ROSE and Oversampled Logistic models. All are sharp improvements over our initial model, and demonstrate the power of upsampling to adjust for imbalanced data.
Let’s take a look at the final importance of each of our factors in this bagged model:
varImp(loans.oversampled.glm.bagging)
## Overall
## annual_inc 1037.05863
## current_account_ratio 284.55674
## dti 324.09717
## emp_length 500.57362
## grade 2155.88022
## home_ownership 50.46849
## installment 480.24595
## int_rate 2541.45801
## loan_amnt 349.03318
## open_acc 131.01534
## revol_bal 380.09777
## revol_util 897.45100
## sub_grade 2636.02575
## term 1988.82275
## time_since_first_credit 346.84110
## total_acc 303.27759
## verification_status 31.81818
Taking a look at this output, it appears that the most important factors in predicting loan outcomes are the fundamental characteristics of the loans themselves: grade, subgrade, interst rate, and term.
Among the borrower characteristics, annual income and revolving utility turned out to be the most important features. Seemingly relevant statistics, such as dti, surprisingly had relatively minimal impact.
Our two engineered variables, the current account ratio and the time since first credit, unfortunately did not end up significantly helping our model make its predictions.
This type of modelling is incredibly important. Improving our ability to gain insight into potential lending outcomes at the time of a loan’s origin makes us all smarter investors. At its best, a model of this kind will be able to complement the basic credit score judgements many lenders are currently stuck with when deciding where to place their money.
There will always be value in a model that can help discern future investment outcomes. Lending Club members who might use this model to make financial descisions are by no means guaranteed results, but such a model will hopefully help them sift through the 100+ features they might be considering so they can arrive at the truly important ones.