The purpose of this project is to explore data from Lending Club (LC), a peer-to-peer lending site. LC has made data from funded loans publicly available and accessible in CSV formatted files. This data is incredibly rich, and personal financial data is rare. That makes this exploration of the data interesting by itself. Additionally, a model predicting defaults could help to instruct the direction of a lending portfolio on the site.
LC provides data on funded loans from 2007 thru (at this time) Q1 2017. LC’s data can be downloaded at https://www.lendingclub.com/info/download-data.action. The data are split over several files. I already merged all of the data and combined it into a single R data file. Each data file from LC has the same columns, so merging is uncomplicated.
I start by loading the merged data and necessary R packages. I also remove a small number of missing rows that were missing loan amounts and other critical data points.
load(file = "lending club.RData")
df <- df[!is.na(df$funded_amnt),] # removes missing lines
library(zoo)
library(ggplot2)
library(scales)
library(extrafont)
library(fiftystater)
library(mapproj)
library(randomForest)There are 128 variables and over 1.4 million rows in the data.
dim(df)## [1] 1418626 128
Loans amounts range from $500 to $40,000 with the median loan fund amount of $14,747.
summary(df$funded_amnt)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 500 8000 12800 14747 20000 40000
Loans are either 36 months or 60 months. The vast majority are 36 months.
table(df$term)##
## 36 months 60 months
## 1017074 401552
Interest rates range between 5.32% and 30.99%. The median interest rate is 12.79%.
summary(as.numeric(gsub("%","",df$int_rate)))## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.32 9.75 12.79 13.19 15.88 30.99
A FICO score is an indication of a loanee’s credit quality. LC’s loanees’ scores range from 614 to 850, with a median score of 694. According to Experian, scores in the 670-739 range are “good.” Scores below 670 are considered “subprime.” Given a 1st quartile score of 674, this means that a significant number of loans on LC are subprime.
summary(df$fico_range_high)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 614.0 674.0 694.0 699.2 714.0 850.0
df$subprime <- ifelse(df$fico_range_high < 670, "Yes", "No")
ggplot(df, aes(x = factor(subprime))) +
geom_bar(aes(y = (..count..)/sum(..count..)), fill = "dodgerblue3") +
scale_y_continuous(labels = percent) +
labs(x = "Subprime", y = "") +
ggtitle("Subprime") +
theme_classic(base_family = "Gill Sans MT")Loanees’ employment length history is given up to 10+ years. I assume that those with an “n/a” are unemployed. I condense this variable into buckets for the prediction model later. Most loanees have been employed at least five years.
table(df$emp_length)##
## < 1 year 1 year 10+ years 2 years 3 years 4 years 5 years
## 109477 92672 474931 127526 112942 84077 88226
## 6 years 7 years 8 years 9 years n/a
## 65856 62024 65930 55124 79841
df$employ <- NA
df$employ[df$emp_length == "n/a"] <- "Unemployed"
df$employ[df$emp_length == "< 1 year"] <- "Less than 2 years"
df$employ[df$emp_length == "1 year"] <- "Less than 2 years"
df$employ[df$emp_length == "2 years"] <- "2-4 years"
df$employ[df$emp_length == "3 years"] <- "2-4 years"
df$employ[df$emp_length == "4 years"] <- "2-4 years"
df$employ[df$emp_length == "5 years"] <- "5-9 years"
df$employ[df$emp_length == "6 years"] <- "5-9 years"
df$employ[df$emp_length == "7 years"] <- "5-9 years"
df$employ[df$emp_length == "8 years"] <- "5-9 years"
df$employ[df$emp_length == "9 years"] <- "5-9 years"
df$employ[df$emp_length == "10+ years"] <- "10+ years"
df$employ <- factor(df$employ,
levels = c("Unemployed",
"Less than 2 years",
"2-4 years",
"5-9 years",
"10+ years"))
ggplot(df, aes(x = factor(employ))) +
geom_bar(aes(y = (..count..)/sum(..count..)), fill = "dodgerblue3") +
scale_y_continuous(labels = percent) +
labs(x = "Employment Status", y = "") +
ggtitle("Employment Status") +
theme_classic(base_family = "Gill Sans MT")The median annual income is $65,000.
summary(df$annual_inc)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 46000 65000 76923 91000 61000000 4
The median debt-to-income ratio is 17.84%. There are some observations miscoded or coded as missing, so I create a new dti variable (dti2) to remove these outliers.
summary(df$dti)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.00 12.05 17.84 19.09 24.24 9999.00
df$dti2 <- df$dti
df$dti2[df$dti < 0 | df$dti == 9999] <- NA
summary(df$dti2)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 12.05 17.84 18.40 24.24 8581.25 100
Loan grades are assigned by LC based on applicants’ background, including credit history. These loan grades determine the loan’s interest rate.
ggplot(df, aes(x = factor(sub_grade))) +
geom_bar(aes(y = (..count..)/sum(..count..)), fill = "dodgerblue3") +
scale_y_continuous(labels = percent) +
labs(x = "Loan Grade", y = "") +
ggtitle("Loan Grade") +
theme_classic(base_family = "Gill Sans MT")Just over half of LC’s loans in the data are current, which means they are in the process of being paid back. When looking at just matured loans, around 20% of loans have defaulted.
table(df$loan_status)##
## Charged Off
## 129458
## Current
## 741937
## Default
## 24
## Does not meet the credit policy. Status:Charged Off
## 761
## Does not meet the credit policy. Status:Fully Paid
## 1988
## Fully Paid
## 507837
## In Grace Period
## 12762
## Late (16-30 days)
## 4121
## Late (31-120 days)
## 19738
df$status <- gsub("Does not meet the credit policy. Status:", "", df$loan_status)
df$status[df$status == "Charged Off"] <- "Default"
df$status[df$status == "In Grace Period"] <- "Late"
df$status[df$status == "Late (16-30 days)"] <- "Late"
df$status[df$status == "Late (31-120 days)"] <- "Late"
ggplot(df, aes(x = factor(status))) +
geom_bar(aes(y = (..count..)/sum(..count..)), fill = "dodgerblue3") +
scale_y_continuous(labels = percent) +
labs(x = "Loan Status", y = "") +
ggtitle("Loan Status") +
theme_classic(base_family = "Gill Sans MT")ggplot(aes(x = factor(status)), data = subset(df, status != "Current" & status != "Late")) +
geom_bar(aes(y = (..count..)/sum(..count..)), fill = "dodgerblue3") +
scale_y_continuous(limits = c(0,1),
labels = percent) +
labs(x = "Loan Status", y = "") +
ggtitle("Loan Status (Only Matured Loans)") +
theme_classic(base_family = "Gill Sans MT")States with the most loan volume are also the most populous states. California has the most loan volume (over 200,000). Interestingly, Iowa has originated only 14 funded loans through LC since 2007. By comparison, less populous Wyoming has originated over 3000 loans. Iowa’s total seems like mistake in the data, but there’s no apparent way to double-check.
# Create data set with number of loans per state
lc.states <- data.frame(table(df$addr_state))
names(lc.states) <- c("state","count")
lc.states$id <- tolower(state.name[match(lc.states$state,state.abb)])
lc.states$id[lc.states$state == "DC"] <- "district of columbia"
lc.states[c("state","count")]## state count
## 1 AK 3451
## 2 AL 17656
## 3 AR 10682
## 4 AZ 33069
## 5 CA 200780
## 6 CO 29723
## 7 CT 21979
## 8 DC 3608
## 9 DE 4035
## 10 FL 99320
## 11 GA 46478
## 12 HI 7112
## 13 IA 14
## 14 ID 1516
## 15 IL 57001
## 16 IN 23061
## 17 KS 12283
## 18 KY 13648
## 19 LA 16658
## 20 MA 32637
## 21 MD 33521
## 22 ME 2240
## 23 MI 37075
## 24 MN 25417
## 25 MO 22736
## 26 MS 7306
## 27 MT 4013
## 28 NC 39661
## 29 ND 1746
## 30 NE 3750
## 31 NH 6927
## 32 NJ 52656
## 33 NM 7690
## 34 NV 19965
## 35 NY 117632
## 36 OH 47763
## 37 OK 13023
## 38 OR 16810
## 39 PA 49190
## 40 RI 6244
## 41 SC 17304
## 42 SD 2879
## 43 TN 21728
## 44 TX 116506
## 45 UT 9707
## 46 VA 40620
## 47 VT 2978
## 48 WA 29893
## 49 WI 18631
## 50 WV 5156
## 51 WY 3148
# Put the counts into buckets
lc.states$bucket <- NA
lc.states$bucket[lc.states$count < 5000] <- "Under 5k"
lc.states$bucket[lc.states$count >= 5000 & lc.states$count < 20000] <- "5k-19k"
lc.states$bucket[lc.states$count >= 20000 & lc.states$count < 50000] <- "20k-49k"
lc.states$bucket[lc.states$count >= 50000 & lc.states$count < 100000] <- "50k-99k"
lc.states$bucket[lc.states$count >= 100000 & lc.states$count < 200000] <- "100k-199k"
lc.states$bucket[lc.states$count >= 200000] <- "200k+"
lc.states$bucket <- factor(lc.states$bucket,
levels = c("Under 5k",
"5k-19k",
"20k-49k",
"50k-99k",
"100k-199k",
"200k+"))
# Create data frame with longitude and latitude from "fiftystater" package
us <- fifty_states
# Merge map and count data
us <- merge(us, lc.states, by = "id")
# Create a map
ggplot(us) +
geom_map(aes(map_id = id, fill = bucket), col = "grey20", size = .2, map = us) +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map() +
scale_fill_brewer("", palette = "PuRd") +
scale_x_continuous(breaks = NULL) +
scale_y_continuous(breaks = NULL) +
labs(x = "", y = "") +
ggtitle("Loans by State") +
theme(legend.text = element_text(family = "Gill Sans MT"),
legend.position = "bottom",
panel.background = element_blank(),
plot.title = element_text(family = "Gill Sans MT"))Using the zoo package, I create a quarterly date field of issues loans from the “issue_d” variable.
df$issued.qtr <- as.yearqtr(as.character(df$issue_d), "%b-%Y") # as quarterLoan funding grew steadily through Q1 2016, maxing out at just over $2 billion. But then funding dropped by around half a billion dollars in Q2 2016. Funding has stayed around that level through Q1 2017.
ggplot(df,aes(x = issued.qtr, y = funded_amnt/1000000)) +
stat_summary(fun.y = sum, geom = "line", col = "dodgerblue3") +
scale_y_continuous("Loans Funded ($mil)", labels = dollar) +
scale_x_yearqtr("Quarter",
limits = c(min(df$issued.qtr),
max(df$issued.qtr)),
breaks = c(2007.25,
seq(2008,2017,1)),
format = "0%q-%y") +
ggtitle("Loans Funded by Quarter") +
theme_classic(base_family = "Gill Sans MT")Using a random forest model, I will predict defaults using loanee characteristics.
I recode a few variables to insertion into the model.
df$term36 <- ifelse(df$term == " 36 months",1,0)
df$own <- ifelse(df$home_ownership == "MORTGAGE" |
df$home_ownership == "OWN",
1,0)
df$rate <- as.numeric(gsub("%","",df$int_rate))I subset the data to use only matured loans. Current & late loans have not been paid off or defaulted yet, so it makes no sense to include them in the model.
df.m <- subset(df, status == "Fully Paid" | status == "Default")
df.m$default <- ifelse(df.m$status == "Default", 1, 0)There are over 600k rows in the matured data set. We do not need all of those rows for the model. Additionally, we want to set aside some of the data so we can test the accuracy of the model. I randomly select 200k rows from the matured data to use as my training data set. This is the data used in the model. The test data will be used to assess the model’s accuracy.
set.seed(313)
df.m$rndm <- sample(1:nrow(df.m))
train <- subset(df.m, rndm <= 200000)
test <- subset(df.m, rndm > 200000)I use a random forest model to predict defaults. The funded loan amount, FICO score, annual income, loan term, home ownership, interest rate, employment status, and debt-to-income ratio are used as predictors.
set.seed(313)
fit <- randomForest(as.factor(default) ~
funded_amnt +
fico_range_high +
annual_inc +
term36 +
own +
rate +
employ +
dti2,
data = train,
na.action = na.omit,
importance = TRUE,
ntree = 250)
varImpPlot(fit)The model’s success is determined by the improvement in prediction it gives us over the modal category percentage. In this case, the modal category is “fully paid,” as 79.65% of loans have been paid off. If that’s all we knew, then we would predict that each loan would be paid off and we would be correct around 79.65% of the time. For the model to be a success, the percentage of cases predicted correctly will need to be higher than 79.65%.
In the code below, I apply the model results to the test data set and identify cases that are predicted correctly and incorrectly.
set.seed(313)
test$y_pred <- predict(fit, test)
test <- test[!is.na(test$y_pred),]
table(test$y_pred, test$default)##
## 0 1
## 0 345865 83702
## 1 4693 5800
test$correct <- NA
test$correct[test$y_pred == 0 & test$default == 0] <- "Correct"
test$correct[test$y_pred == 1 & test$default == 1] <- "Correct"
test$correct[test$y_pred == 0 & test$default == 1] <- "Wrong"
test$correct[test$y_pred == 1 & test$default == 0] <- "Wrong"
tbl <- prop.table(table(test$correct))
correct <- as.numeric(tbl)[1]
tbl2 <- prop.table(table(test$default))
real <- as.numeric(tbl2)[1]
diff <- correct - real
correct <- round(correct * 100, 2)
improve <- round((diff/real)*100, 2)The model correctly predicts 79.91% of cases, which is an improvement of 0.32%. Given that the modal category is 79.65%, the model improves prediction only slightly.
This is a surprisingly poor result. I thought that factors like debt-to-income, annual income, and interest rate would be prime determinants of default. This tells me that Lending Club actually does a pretty good job of screening applicants. If these factors did strongly predict default, then that would mean that LC is knowingly putting investor capital at risk.
Of course, that’s not to say that default is completely random. In the future, I will dig a bit deeper to see if I can improve the model. I have two ideas on how to do this:
Additional Variables - That data include some messier fields like job title and loan purpose. I will experiment with including some of these variables.
Different Model - A random forest might not be the best model for this. I will try using a gradient boosted trees model.
print(paste("Correctly Predicted: ",correct,"%",sep=""))## [1] "Correctly Predicted: 79.91%"
print(paste("Improvement: ",improve,"%",sep=""))## [1] "Improvement: 0.32%"