1 Introduction

The purpose of this project is to explore data from Lending Club (LC), a peer-to-peer lending site. LC has made data from funded loans publicly available and accessible in CSV formatted files. This data is incredibly rich, and personal financial data is rare. That makes this exploration of the data interesting by itself. Additionally, a model predicting defaults could help to instruct the direction of a lending portfolio on the site.

2 Setting Up

LC provides data on funded loans from 2007 thru (at this time) Q1 2017. LC’s data can be downloaded at https://www.lendingclub.com/info/download-data.action. The data are split over several files. I already merged all of the data and combined it into a single R data file. Each data file from LC has the same columns, so merging is uncomplicated.

I start by loading the merged data and necessary R packages. I also remove a small number of missing rows that were missing loan amounts and other critical data points.

load(file = "lending club.RData")
df <- df[!is.na(df$funded_amnt),] # removes missing lines

library(zoo)
library(ggplot2)
library(scales)
library(extrafont)
library(fiftystater)
library(mapproj)
library(randomForest)

3 Exploring the Data

3.1 Dimensions

There are 128 variables and over 1.4 million rows in the data.

dim(df)
## [1] 1418626     128

3.2 Funded Amount

Loans amounts range from $500 to $40,000 with the median loan fund amount of $14,747.

summary(df$funded_amnt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     500    8000   12800   14747   20000   40000

3.3 Term Length

Loans are either 36 months or 60 months. The vast majority are 36 months.

table(df$term)
## 
##  36 months  60 months 
##    1017074     401552

3.4 Interest Rate

Interest rates range between 5.32% and 30.99%. The median interest rate is 12.79%.

summary(as.numeric(gsub("%","",df$int_rate)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.32    9.75   12.79   13.19   15.88   30.99

3.5 FICO Score

A FICO score is an indication of a loanee’s credit quality. LC’s loanees’ scores range from 614 to 850, with a median score of 694. According to Experian, scores in the 670-739 range are “good.” Scores below 670 are considered “subprime.” Given a 1st quartile score of 674, this means that a significant number of loans on LC are subprime.

summary(df$fico_range_high)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   614.0   674.0   694.0   699.2   714.0   850.0
df$subprime <- ifelse(df$fico_range_high < 670, "Yes", "No")

ggplot(df, aes(x = factor(subprime))) +  
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "dodgerblue3") + 
  scale_y_continuous(labels = percent) +
  labs(x = "Subprime", y = "") +
  ggtitle("Subprime") +
  theme_classic(base_family = "Gill Sans MT")

3.6 Employment Status

Loanees’ employment length history is given up to 10+ years. I assume that those with an “n/a” are unemployed. I condense this variable into buckets for the prediction model later. Most loanees have been employed at least five years.

table(df$emp_length)
## 
##  < 1 year    1 year 10+ years   2 years   3 years   4 years   5 years 
##    109477     92672    474931    127526    112942     84077     88226 
##   6 years   7 years   8 years   9 years       n/a 
##     65856     62024     65930     55124     79841
df$employ <- NA
df$employ[df$emp_length == "n/a"] <- "Unemployed"
df$employ[df$emp_length == "< 1 year"] <- "Less than 2 years"
df$employ[df$emp_length == "1 year"] <- "Less than 2 years"
df$employ[df$emp_length == "2 years"] <- "2-4 years"
df$employ[df$emp_length == "3 years"] <- "2-4 years"
df$employ[df$emp_length == "4 years"] <- "2-4 years"
df$employ[df$emp_length == "5 years"] <- "5-9 years"
df$employ[df$emp_length == "6 years"] <- "5-9 years"
df$employ[df$emp_length == "7 years"] <- "5-9 years"
df$employ[df$emp_length == "8 years"] <- "5-9 years"
df$employ[df$emp_length == "9 years"] <- "5-9 years"
df$employ[df$emp_length == "10+ years"] <- "10+ years"
df$employ <- factor(df$employ,
                      levels = c("Unemployed",
                                 "Less than 2 years",
                                 "2-4 years",
                                 "5-9 years",
                                 "10+ years"))

ggplot(df, aes(x = factor(employ))) +  
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "dodgerblue3") + 
  scale_y_continuous(labels = percent) +
  labs(x = "Employment Status", y = "") +
  ggtitle("Employment Status") +
  theme_classic(base_family = "Gill Sans MT")

3.7 Annual Income

The median annual income is $65,000.

summary(df$annual_inc)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##        0    46000    65000    76923    91000 61000000        4

3.8 Debt-to-Income Ratio

The median debt-to-income ratio is 17.84%. There are some observations miscoded or coded as missing, so I create a new dti variable (dti2) to remove these outliers.

summary(df$dti)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -1.00   12.05   17.84   19.09   24.24 9999.00
df$dti2 <- df$dti
df$dti2[df$dti < 0 | df$dti == 9999] <- NA

summary(df$dti2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   12.05   17.84   18.40   24.24 8581.25     100

3.9 Loan Grade

Loan grades are assigned by LC based on applicants’ background, including credit history. These loan grades determine the loan’s interest rate.

ggplot(df, aes(x = factor(sub_grade))) +
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "dodgerblue3") +
  scale_y_continuous(labels = percent) +
  labs(x = "Loan Grade", y = "") +
  ggtitle("Loan Grade") +
  theme_classic(base_family = "Gill Sans MT")

3.10 Loan Status

Just over half of LC’s loans in the data are current, which means they are in the process of being paid back. When looking at just matured loans, around 20% of loans have defaulted.

table(df$loan_status)
## 
##                                         Charged Off 
##                                              129458 
##                                             Current 
##                                              741937 
##                                             Default 
##                                                  24 
## Does not meet the credit policy. Status:Charged Off 
##                                                 761 
##  Does not meet the credit policy. Status:Fully Paid 
##                                                1988 
##                                          Fully Paid 
##                                              507837 
##                                     In Grace Period 
##                                               12762 
##                                   Late (16-30 days) 
##                                                4121 
##                                  Late (31-120 days) 
##                                               19738
df$status <- gsub("Does not meet the credit policy. Status:", "", df$loan_status)

df$status[df$status == "Charged Off"] <- "Default"
df$status[df$status == "In Grace Period"] <- "Late"
df$status[df$status == "Late (16-30 days)"] <- "Late"
df$status[df$status == "Late (31-120 days)"] <- "Late"

ggplot(df, aes(x = factor(status))) +
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "dodgerblue3") +
  scale_y_continuous(labels = percent) +
  labs(x = "Loan Status", y = "") +
  ggtitle("Loan Status") +
  theme_classic(base_family = "Gill Sans MT")

ggplot(aes(x = factor(status)), data = subset(df, status != "Current" & status != "Late")) +
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "dodgerblue3") +
  scale_y_continuous(limits = c(0,1),
                     labels = percent) +
  labs(x = "Loan Status", y = "") +
  ggtitle("Loan Status (Only Matured Loans)") +
  theme_classic(base_family = "Gill Sans MT")

3.11 Loans by State

States with the most loan volume are also the most populous states. California has the most loan volume (over 200,000). Interestingly, Iowa has originated only 14 funded loans through LC since 2007. By comparison, less populous Wyoming has originated over 3000 loans. Iowa’s total seems like mistake in the data, but there’s no apparent way to double-check.

# Create data set with number of loans per state

lc.states <- data.frame(table(df$addr_state))
names(lc.states) <- c("state","count")
lc.states$id <- tolower(state.name[match(lc.states$state,state.abb)])
lc.states$id[lc.states$state == "DC"] <- "district of columbia"

lc.states[c("state","count")]
##    state  count
## 1     AK   3451
## 2     AL  17656
## 3     AR  10682
## 4     AZ  33069
## 5     CA 200780
## 6     CO  29723
## 7     CT  21979
## 8     DC   3608
## 9     DE   4035
## 10    FL  99320
## 11    GA  46478
## 12    HI   7112
## 13    IA     14
## 14    ID   1516
## 15    IL  57001
## 16    IN  23061
## 17    KS  12283
## 18    KY  13648
## 19    LA  16658
## 20    MA  32637
## 21    MD  33521
## 22    ME   2240
## 23    MI  37075
## 24    MN  25417
## 25    MO  22736
## 26    MS   7306
## 27    MT   4013
## 28    NC  39661
## 29    ND   1746
## 30    NE   3750
## 31    NH   6927
## 32    NJ  52656
## 33    NM   7690
## 34    NV  19965
## 35    NY 117632
## 36    OH  47763
## 37    OK  13023
## 38    OR  16810
## 39    PA  49190
## 40    RI   6244
## 41    SC  17304
## 42    SD   2879
## 43    TN  21728
## 44    TX 116506
## 45    UT   9707
## 46    VA  40620
## 47    VT   2978
## 48    WA  29893
## 49    WI  18631
## 50    WV   5156
## 51    WY   3148
# Put the counts into buckets

lc.states$bucket <- NA
lc.states$bucket[lc.states$count < 5000] <- "Under 5k"
lc.states$bucket[lc.states$count >= 5000 & lc.states$count < 20000] <- "5k-19k"
lc.states$bucket[lc.states$count >= 20000 & lc.states$count < 50000] <- "20k-49k"
lc.states$bucket[lc.states$count >= 50000 & lc.states$count < 100000] <- "50k-99k"
lc.states$bucket[lc.states$count >= 100000 & lc.states$count < 200000] <- "100k-199k"
lc.states$bucket[lc.states$count >= 200000] <- "200k+"
lc.states$bucket <- factor(lc.states$bucket,
                           levels = c("Under 5k",
                                      "5k-19k",
                                      "20k-49k",
                                      "50k-99k",
                                      "100k-199k",
                                      "200k+"))

# Create data frame with longitude and latitude from "fiftystater" package

us <- fifty_states

# Merge map and count data

us <- merge(us, lc.states, by = "id")

# Create a map

ggplot(us) + 
  geom_map(aes(map_id = id, fill = bucket), col = "grey20", size = .2, map = us) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map() +
  scale_fill_brewer("", palette = "PuRd") +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  ggtitle("Loans by State") +
  theme(legend.text = element_text(family = "Gill Sans MT"),
        legend.position = "bottom",
        panel.background = element_blank(),
        plot.title = element_text(family = "Gill Sans MT"))

3.12 Loan Volume over Time

Using the zoo package, I create a quarterly date field of issues loans from the “issue_d” variable.

df$issued.qtr <- as.yearqtr(as.character(df$issue_d), "%b-%Y") # as quarter

Loan funding grew steadily through Q1 2016, maxing out at just over $2 billion. But then funding dropped by around half a billion dollars in Q2 2016. Funding has stayed around that level through Q1 2017.

ggplot(df,aes(x = issued.qtr, y = funded_amnt/1000000)) +
  stat_summary(fun.y = sum, geom = "line", col = "dodgerblue3") +
  scale_y_continuous("Loans Funded ($mil)", labels = dollar) +
  scale_x_yearqtr("Quarter",
                  limits = c(min(df$issued.qtr),
                             max(df$issued.qtr)),
                  breaks = c(2007.25,
                             seq(2008,2017,1)),
                  format = "0%q-%y") +
  ggtitle("Loans Funded by Quarter") +
  theme_classic(base_family = "Gill Sans MT")

4 Predicting Defaults

Using a random forest model, I will predict defaults using loanee characteristics.

4.1 Recoding

I recode a few variables to insertion into the model.

df$term36 <- ifelse(df$term == " 36 months",1,0)
df$own <- ifelse(df$home_ownership == "MORTGAGE" |
                     df$home_ownership == "OWN",
                   1,0)
df$rate <- as.numeric(gsub("%","",df$int_rate))

4.2 Create Matured Loans Data Set

I subset the data to use only matured loans. Current & late loans have not been paid off or defaulted yet, so it makes no sense to include them in the model.

df.m <- subset(df, status == "Fully Paid" | status == "Default")
df.m$default <- ifelse(df.m$status == "Default", 1, 0)

4.3 Create Training and Testing Data

There are over 600k rows in the matured data set. We do not need all of those rows for the model. Additionally, we want to set aside some of the data so we can test the accuracy of the model. I randomly select 200k rows from the matured data to use as my training data set. This is the data used in the model. The test data will be used to assess the model’s accuracy.

set.seed(313)
df.m$rndm <- sample(1:nrow(df.m))
train <- subset(df.m, rndm <= 200000)
test <- subset(df.m, rndm > 200000)

4.4 Random Forest Model

I use a random forest model to predict defaults. The funded loan amount, FICO score, annual income, loan term, home ownership, interest rate, employment status, and debt-to-income ratio are used as predictors.

set.seed(313)
fit <- randomForest(as.factor(default) ~ 
                      funded_amnt +
                      fico_range_high +
                      annual_inc +
                      term36 +
                      own +
                      rate +
                      employ +
                      dti2,
                    data = train,
                    na.action = na.omit,
                    importance = TRUE,
                    ntree = 250)
varImpPlot(fit)

4.5 Model Success Summary

The model’s success is determined by the improvement in prediction it gives us over the modal category percentage. In this case, the modal category is “fully paid,” as 79.65% of loans have been paid off. If that’s all we knew, then we would predict that each loan would be paid off and we would be correct around 79.65% of the time. For the model to be a success, the percentage of cases predicted correctly will need to be higher than 79.65%.

In the code below, I apply the model results to the test data set and identify cases that are predicted correctly and incorrectly.

set.seed(313)
test$y_pred <- predict(fit, test)
test <- test[!is.na(test$y_pred),]
table(test$y_pred, test$default)
##    
##          0      1
##   0 345865  83702
##   1   4693   5800
test$correct <- NA
test$correct[test$y_pred == 0 & test$default == 0] <- "Correct"
test$correct[test$y_pred == 1 & test$default == 1] <- "Correct"
test$correct[test$y_pred == 0 & test$default == 1] <- "Wrong"
test$correct[test$y_pred == 1 & test$default == 0] <- "Wrong"
tbl <- prop.table(table(test$correct))
correct <- as.numeric(tbl)[1]
tbl2 <- prop.table(table(test$default))
real <- as.numeric(tbl2)[1]
diff <- correct - real
correct <- round(correct * 100, 2)
improve <- round((diff/real)*100, 2)

The model correctly predicts 79.91% of cases, which is an improvement of 0.32%. Given that the modal category is 79.65%, the model improves prediction only slightly.

This is a surprisingly poor result. I thought that factors like debt-to-income, annual income, and interest rate would be prime determinants of default. This tells me that Lending Club actually does a pretty good job of screening applicants. If these factors did strongly predict default, then that would mean that LC is knowingly putting investor capital at risk.

Of course, that’s not to say that default is completely random. In the future, I will dig a bit deeper to see if I can improve the model. I have two ideas on how to do this:

  1. Additional Variables - That data include some messier fields like job title and loan purpose. I will experiment with including some of these variables.

  2. Different Model - A random forest might not be the best model for this. I will try using a gradient boosted trees model.

print(paste("Correctly Predicted: ",correct,"%",sep=""))
## [1] "Correctly Predicted: 79.91%"
print(paste("Improvement: ",improve,"%",sep=""))
## [1] "Improvement: 0.32%"