Lending Club (LC) is a P2P lending company that allows investors to pick who they want to lend to. Since 2007 LC has posted their loan and rejected applicaiton information online. It shows data around funding, rates, merchant information, etc. LC posts loan data on it’s site at: https://www.lendingclub.com/info/download-data.action
With the data provide by LC we can see how certain loans are performing by looking at their loan status. Once funded, loans can be in one of six statuses: Fully Paid (the loan has been repaid in full), Current (the loan is still being paid), Default (the loan has not been paid in 121 days or more), Charged Off (the loan has been in Default for 30 days or more and no more future payments are expected), Late 16-30 days and Late 31 -120 days (the loan has not been current for 16-30 days or 31-120 days). LC also includes grades of their loans for funding from A to G, with A being best and G being worst.
This study will look at LC loans from 2014 to see what factors go into a loan grade and how loans perform based on grade. It will also look to see if there is any correlation between any comments made by the borrower (the desc field) and how their loan performs.
This study will use the following libraries:
ggplot2: For data visualization
dplyr: For tidying data
tm: For text mining
wordCloud: For text data visualization
library(ggplot2)
library(dplyr)
library(tm)
library(knitr)
library(RColorBrewer)
library(wordcloud)
library(kableExtra)
library(DT)The Lending Club loans from 2014 can be downloaded from here: https://www.lendingclub.com/info/download-data.action The file downloads as a CSV and can be read in R using read.csv. However looking at the orginal CSV we see that there are a lot of columns with no data. We can make the file smaller by getting rid of those blank columns using is.na.
loans2014 <- read.csv("loans2014.csv",header=TRUE, sep=",", stringsAsFactors=FALSE)
loans2014 <- as.data.frame(loans2014)
loans2014 <- loans2014[,colSums(is.na(loans2014))<nrow(loans2014)]
kable(head(loans2014,2))| id | member_id | loan_amnt | funded_amnt | funded_amnt_inv | term | int_rate | installment | grade | sub_grade | emp_title | emp_length | home_ownership | annual_inc | verification_status | issue_d | loan_status | pymnt_plan | url | desc | purpose | title | zip_code | addr_state | dti | delinq_2yrs | earliest_cr_line | inq_last_6mths | mths_since_last_delinq | mths_since_last_record | open_acc | pub_rec | revol_bal | revol_util | total_acc | initial_list_status | out_prncp | out_prncp_inv | total_pymnt | total_pymnt_inv | total_rec_prncp | total_rec_int | total_rec_late_fee | recoveries | collection_recovery_fee | last_pymnt_d | last_pymnt_amnt | next_pymnt_d | last_credit_pull_d | collections_12_mths_ex_med | mths_since_last_major_derog | policy_code | application_type | acc_now_delinq | tot_coll_amt | tot_cur_bal | total_rev_hi_lim |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 38098114 | 40860827 | 15000 | 15000 | 15000 | 60 months | 12.39 | 336.64 | C | C1 | MANAGEMENT | 10+ years | RENT | 78000 | Source Verified | Dec-14 | Current | n | https://www.lendingclub.com/browse/loanDetail.action?loan_id=38098114 | debt_consolidation | Debt consolidation | 235xx | VA | 12.03 | 0 | Aug-94 | 0 | NA | NA | 6 | 0 | 138008 | 29.0 | 17 | w | 12690.60 | 12690.60 | 4028.00 | 4028.00 | 2309.40 | 1718.60 | 0 | 0 | 0 | Jan-16 | 336.64 | Feb-16 | Jan-16 | 0 | NA | 1 | INDIVIDUAL | 0 | 0 | 149140 | 184500 | |
| 36805548 | 39558264 | 10400 | 10400 | 10400 | 36 months | 6.99 | 321.08 | A | A3 | Truck Driver Delivery Personel | 8 years | MORTGAGE | 58000 | Not Verified | Dec-14 | Current | n | https://www.lendingclub.com/browse/loanDetail.action?loan_id=36805548 | credit_card | Credit card refinancing | 937xx | CA | 14.92 | 0 | Sep-89 | 2 | 42 | NA | 17 | 0 | 6133 | 31.6 | 36 | w | 7171.88 | 7171.88 | 3842.86 | 3842.86 | 3228.12 | 614.74 | 0 | 0 | 0 | Jan-16 | 321.08 | Jan-16 | Jan-16 | 0 | 59 | 1 | INDIVIDUAL | 0 | 0 | 162110 | 19400 |
Looking at the count we can see that there were 235,628 loans funded in 2014. When we bring the status into the picture we notice that the Current status has the most loans with 158,558 (63% of all 2014 loans). This makes sense as the terms of the LC loans are either 36 or 60 months. Bringing in term we see that 106,941 (45% of all 2014 loans) have a status of Current and have a 36 month term.
datatable(count(loans2014, loan_status))datatable(count(loans2014, loan_status, term))Knowing that most loans are in the Current status we now want to see the breakdown of loan grades for each status. Using group by and summarise we are able to get the counts by grade and state. We can see that there are most loans with a C grade and a Current state - 44,736 loans.
Using a boxplot we can also look at how the statuses vary for each grade. While we already know that the Current state will be the majority in each grade it is interesting to see that delinquent statuses like Charged Off and Late increase between grades C and E.
loans_grp <- loans2014 %>%
group_by(grade,loan_status) %>%
summarise(loans = n())
datatable(loans_grp)ggplot() + geom_bar(aes(y = loans, x = grade, fill = loan_status), data = loans_grp,stat="identity") LC has also provided us with detailed financial information about each loan. These details include the loan funded amount, the interest rate, the monthly payment, and borrowers’ annual income. We will use these factors to gather some insight on loan performance and grade.
Some loans have the description field filled out where borrowers can add comments for investors to see why they want money. We’ll make a seperate table for those loans to do some analysis on them later.
#Replace blank desc with NAs
loans2014$desc[loans2014$desc==""] <- NA
#Get rows with no NA in desc
loans2014desc <- subset <- loans2014[complete.cases(loans2014$desc), ] We will split out the loans by the statuses they are in. These will be all loans that have a Fully Paid status.
fully_paid <- data.frame(subset(loans2014, loan_status=='Fully Paid'))
fully_paid_desc <- data.frame(subset(loans2014desc, loan_status=='Fully Paid'))These will be the tables for all the loans that have Current status.
current <- data.frame(subset(loans2014, loan_status=="Current"))
current_desc <- data.frame(subset(loans2014desc, loan_status=="Current"))THese will be the tables for all loans that are either Charged Off, Default, or Late.
bad_status <- data.frame(subset(loans2014, loan_status !="Current" & loan_status != "Fully Paid"))
bad_status_desc <- data.frame(subset(loans2014desc, loan_status !="Current" & loan_status != "Fully Paid"))With each status broken out into their own tables we can see how the grades relate to the financial information and statuses. Overall we know that lower the grade of a loan the higher the interest rate and monthly payments will be. Other insights may vary by grade.
For the Fully Paid loans we see that most loans have a B grade. However it is interesting to see that even though the borrowers have the second highest average annual income of $80K, G loans are the least to be fully paid and have the highest funded amount. This shows us that G loans are least likely to be paid off in full but when they are they do give back more in return to their investors.
full_grade <- fully_paid %>%
group_by(grade)
full_grade <- summarise(
full_grade,
loans = n(),
mean_funded = round(mean(funded_amnt),2),
mean_interest = round(mean(int_rate),2),
mean_payment = round(mean(installment),2),
mean_inc = round(mean(annual_inc),2)
)
datatable(full_grade)ggplot() +
geom_bar(aes(y = loans, x = grade, fill = grade), data = full_grade,stat="identity") +
labs(title="Fully Paid Grades",x="Grade",y="Loans")As stated before most loans have a C grade and a Current status. Focusing on grade E loans we see that they place 5th out of 7th in number of Current loans but they have the 2nd highest funded amount. With an average interest of 20% and these loans still being paid, their investors should be making a good return on their investment.
current_grade <- current %>%
group_by(grade)
current_grade <- summarise(
current_grade,
loans = n(),
mean_funded = round(mean(funded_amnt),2),
mean_interest = round(mean(int_rate),2),
mean_payment = round(mean(installment),2),
mean_inc = round(mean(annual_inc),2)
)
datatable(current_grade)ggplot() +
geom_bar(aes(y = loans, x = grade, fill = grade), data = current_grade,stat="identity") +
labs(title="Current Grades",x="Grade",y="Loans") Of the 3 groups the delinquent group has the least amount of loans. It’s interesting to see that the most delinquent loans are also C graded. However comparing the C loans in the delinquent group to the ones in the Current group we see that delinquent ones have lower annual income and higher interest rates.
bad_grade <- bad_status %>%
group_by(grade)
bad_grade <- summarise(
bad_grade,
loans = n(),
mean_funded = round(mean(funded_amnt),2),
mean_interest = round(mean(int_rate),2),
mean_payment = round(mean(installment),2),
mean_inc = round(mean(annual_inc),2)
)
datatable(bad_grade)ggplot() +
geom_bar(aes(y = loans, x = grade, fill = grade), data = bad_grade,stat="identity") +
labs(title="Deliquent Grades",x="Grade",y="Loans") Using the desc tables we made earlier we’ll make word clouds for each status to see if there any trends.
stopwords <- c("Borrower","and", "the","have","for","all","like","will","this", "that", "would","want",
"with", "added", "on","/","<","br",">")
#Fully Paid
full_descriptions.corpus <- Corpus(DataframeSource(data.frame(fully_paid_desc[,20])))
full_descriptions.corpus <- tm_map(full_descriptions.corpus,removeWords, stopwords)
full_descriptions.corpus <- tm_map(full_descriptions.corpus,removeNumbers)
full_descriptions.corpus <- tm_map(full_descriptions.corpus, removePunctuation)
full_descriptions.corpus <- tm_map(full_descriptions.corpus, content_transformer(tolower))
wordcloud(full_descriptions.corpus,
max.words = 100,
random.order=FALSE,
rot.per=0.30,
use.r.layout=FALSE,
colors=brewer.pal(8, "Paired"))#Current
current_desc_corpus <- Corpus(DataframeSource(data.frame(current_desc[,20])))
current_desc_corpus <- tm_map(current_desc_corpus,removeWords, stopwords)
current_desc_corpus <- tm_map(current_desc_corpus,removeNumbers)
current_desc_corpus <- tm_map(current_desc_corpus, removePunctuation)
current_desc_corpus <- tm_map(current_desc_corpus, content_transformer(tolower))
wordcloud(current_desc_corpus,
max.words = 100,
random.order=FALSE,
rot.per=0.30,
use.r.layout=FALSE,
colors=brewer.pal(8, "Paired"))bad_desc_corpus <- Corpus(DataframeSource(data.frame(bad_status_desc[,20])))
bad_desc_corpus <- tm_map(bad_desc_corpus,removeWords, stopwords)
bad_desc_corpus <- tm_map(bad_desc_corpus,removeNumbers)
bad_desc_corpus <- tm_map(bad_desc_corpus, removePunctuation)
bad_desc_corpus <- tm_map(bad_desc_corpus, content_transformer(tolower))
wordcloud(bad_desc_corpus,
max.words = 100,
random.order=FALSE,
rot.per=0.30,
use.r.layout=FALSE,
colors=brewer.pal(8, "Paired"))Looking at all 3 clouds we see some common words like credit, dept, pay, and cards. Seeing as how borrowers are asking for funding it makes sense that these show up in the requests. However there a few words like need and back that show up in the delinquent loan group that do not show up in the other 2 groups. But this is a small percentage of a small group so it doesn’t mean that every borrower that says something like “I need this money” or “I promise to pay you back” will go delinquent.
Investors have a lot to think about when picking who they lend to on Lending Club. As we saw in the study picking a C grade loan could be a safe bet as most are still currently paying but also C grade loans have the highest number of delinquent loans. G grade loans do promise a higher return on your investment but so few get to be fully paid. Investors may also want to look at the unique reasons why someone may want a loan, but they must also take a borrower’s financial information into consideration.
While Lending Club does provide a wealth of information for analysis I do have a question of how often this data is updated. I chose 2014 because it gave enough information in the desc field to see how borrowers interacted with it. In later years that field is completely blank. If I was to do this project again I would like to have some more information so that I could predict the kind of customer that may default and charge off a loan.