Introduciton

Lending Club (LC) is a P2P lending company that allows investors to pick who they want to lend to. Since 2007 LC has posted their loan and rejected applicaiton information online. It shows data around funding, rates, merchant information, etc. LC posts loan data on it’s site at: https://www.lendingclub.com/info/download-data.action

Study

With the data provide by LC we can see how certain loans are performing by looking at their loan status. Once funded, loans can be in one of six statuses: Fully Paid (the loan has been repaid in full), Current (the loan is still being paid), Default (the loan has not been paid in 121 days or more), Charged Off (the loan has been in Default for 30 days or more and no more future payments are expected), Late 16-30 days and Late 31 -120 days (the loan has not been current for 16-30 days or 31-120 days). LC also includes grades of their loans for funding from A to G, with A being best and G being worst.

This study will look at LC loans from 2014 to see what factors go into a loan grade and how loans perform based on grade. It will also look to see if there is any correlation between any comments made by the borrower (the desc field) and how their loan performs.

Load Libraries

This study will use the following libraries:

ggplot2: For data visualization
dplyr: For tidying data
tm: For text mining
wordCloud: For text data visualization

library(ggplot2)
library(dplyr)
library(tm)
library(knitr)
library(RColorBrewer)
library(wordcloud)
library(kableExtra)
library(DT)

Load LC Loans from 2014

The Lending Club loans from 2014 can be downloaded from here: https://www.lendingclub.com/info/download-data.action The file downloads as a CSV and can be read in R using read.csv. However looking at the orginal CSV we see that there are a lot of columns with no data. We can make the file smaller by getting rid of those blank columns using is.na.

loans2014 <- read.csv("loans2014.csv",header=TRUE, sep=",", stringsAsFactors=FALSE)

loans2014 <- as.data.frame(loans2014) 

loans2014 <- loans2014[,colSums(is.na(loans2014))<nrow(loans2014)]

kable(head(loans2014,2))
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan url desc purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d last_credit_pull_d collections_12_mths_ex_med mths_since_last_major_derog policy_code application_type acc_now_delinq tot_coll_amt tot_cur_bal total_rev_hi_lim
38098114 40860827 15000 15000 15000 60 months 12.39 336.64 C C1 MANAGEMENT 10+ years RENT 78000 Source Verified Dec-14 Current n https://www.lendingclub.com/browse/loanDetail.action?loan_id=38098114 debt_consolidation Debt consolidation 235xx VA 12.03 0 Aug-94 0 NA NA 6 0 138008 29.0 17 w 12690.60 12690.60 4028.00 4028.00 2309.40 1718.60 0 0 0 Jan-16 336.64 Feb-16 Jan-16 0 NA 1 INDIVIDUAL 0 0 149140 184500
36805548 39558264 10400 10400 10400 36 months 6.99 321.08 A A3 Truck Driver Delivery Personel 8 years MORTGAGE 58000 Not Verified Dec-14 Current n https://www.lendingclub.com/browse/loanDetail.action?loan_id=36805548 credit_card Credit card refinancing 937xx CA 14.92 0 Sep-89 2 42 NA 17 0 6133 31.6 36 w 7171.88 7171.88 3842.86 3842.86 3228.12 614.74 0 0 0 Jan-16 321.08 Jan-16 Jan-16 0 59 1 INDIVIDUAL 0 0 162110 19400

Loans by Status and Grade

Status Insights

Looking at the count we can see that there were 235,628 loans funded in 2014. When we bring the status into the picture we notice that the Current status has the most loans with 158,558 (63% of all 2014 loans). This makes sense as the terms of the LC loans are either 36 or 60 months. Bringing in term we see that 106,941 (45% of all 2014 loans) have a status of Current and have a 36 month term.

datatable(count(loans2014, loan_status))
datatable(count(loans2014, loan_status, term))

Grade Insights

Knowing that most loans are in the Current status we now want to see the breakdown of loan grades for each status. Using group by and summarise we are able to get the counts by grade and state. We can see that there are most loans with a C grade and a Current state - 44,736 loans.

Using a boxplot we can also look at how the statuses vary for each grade. While we already know that the Current state will be the majority in each grade it is interesting to see that delinquent statuses like Charged Off and Late increase between grades C and E.

loans_grp <- loans2014 %>% 
  group_by(grade,loan_status) %>% 
  summarise(loans = n())

datatable(loans_grp)
ggplot() + geom_bar(aes(y = loans, x = grade, fill = loan_status), data = loans_grp,stat="identity") 

Loan Financials

LC has also provided us with detailed financial information about each loan. These details include the loan funded amount, the interest rate, the monthly payment, and borrowers’ annual income. We will use these factors to gather some insight on loan performance and grade.

Make Description Table

Some loans have the description field filled out where borrowers can add comments for investors to see why they want money. We’ll make a seperate table for those loans to do some analysis on them later.

#Replace blank desc with NAs
loans2014$desc[loans2014$desc==""] <- NA

#Get rows with no NA in desc
loans2014desc <- subset <-  loans2014[complete.cases(loans2014$desc), ] 

Make Fully Paid Tables

We will split out the loans by the statuses they are in. These will be all loans that have a Fully Paid status.

fully_paid <- data.frame(subset(loans2014, loan_status=='Fully Paid'))
fully_paid_desc <- data.frame(subset(loans2014desc, loan_status=='Fully Paid'))

Make Current Tables

These will be the tables for all the loans that have Current status.

current <- data.frame(subset(loans2014, loan_status=="Current"))
current_desc <- data.frame(subset(loans2014desc, loan_status=="Current"))

Make Delinquent Tables

THese will be the tables for all loans that are either Charged Off, Default, or Late.

bad_status <- data.frame(subset(loans2014, loan_status !="Current" & loan_status != "Fully Paid"))
bad_status_desc <- data.frame(subset(loans2014desc, loan_status !="Current" & loan_status != "Fully Paid"))

Financial Insights

With each status broken out into their own tables we can see how the grades relate to the financial information and statuses. Overall we know that lower the grade of a loan the higher the interest rate and monthly payments will be. Other insights may vary by grade.

Fully Paid Loans

For the Fully Paid loans we see that most loans have a B grade. However it is interesting to see that even though the borrowers have the second highest average annual income of $80K, G loans are the least to be fully paid and have the highest funded amount. This shows us that G loans are least likely to be paid off in full but when they are they do give back more in return to their investors.

full_grade <- fully_paid %>%
  group_by(grade)

full_grade <- summarise(
  full_grade,
  loans = n(),
  mean_funded = round(mean(funded_amnt),2), 
  mean_interest = round(mean(int_rate),2),
  mean_payment = round(mean(installment),2),
  mean_inc = round(mean(annual_inc),2)
)


datatable(full_grade)
ggplot() +
  geom_bar(aes(y = loans, x = grade, fill = grade), data = full_grade,stat="identity") +
  labs(title="Fully Paid Grades",x="Grade",y="Loans")

Current Loans

As stated before most loans have a C grade and a Current status. Focusing on grade E loans we see that they place 5th out of 7th in number of Current loans but they have the 2nd highest funded amount. With an average interest of 20% and these loans still being paid, their investors should be making a good return on their investment.

current_grade <- current %>%
  group_by(grade)

current_grade <- summarise(
  current_grade,
  loans = n(),
  mean_funded = round(mean(funded_amnt),2), 
  mean_interest = round(mean(int_rate),2),
  mean_payment = round(mean(installment),2),
  mean_inc = round(mean(annual_inc),2)
)

datatable(current_grade)
ggplot() +
  geom_bar(aes(y = loans, x = grade, fill = grade), data = current_grade,stat="identity") +
  labs(title="Current Grades",x="Grade",y="Loans") 

Delinquent Loans

Of the 3 groups the delinquent group has the least amount of loans. It’s interesting to see that the most delinquent loans are also C graded. However comparing the C loans in the delinquent group to the ones in the Current group we see that delinquent ones have lower annual income and higher interest rates.

bad_grade <- bad_status %>%
  group_by(grade)

bad_grade <- summarise(
  bad_grade,
  loans = n(),
  mean_funded = round(mean(funded_amnt),2), 
  mean_interest = round(mean(int_rate),2),
  mean_payment = round(mean(installment),2),
  mean_inc = round(mean(annual_inc),2)
)

datatable(bad_grade)
ggplot() +
  geom_bar(aes(y = loans, x = grade, fill = grade), data = bad_grade,stat="identity") +
  labs(title="Deliquent Grades",x="Grade",y="Loans") 

Text Analsis

Using the desc tables we made earlier we’ll make word clouds for each status to see if there any trends.

Fully Paid Loans World Cloud

stopwords <- c("Borrower","and", "the","have","for","all","like","will","this", "that", "would","want",
               "with", "added", "on","/","<","br",">")

#Fully Paid
full_descriptions.corpus <- Corpus(DataframeSource(data.frame(fully_paid_desc[,20])))
full_descriptions.corpus <- tm_map(full_descriptions.corpus,removeWords, stopwords)
full_descriptions.corpus <- tm_map(full_descriptions.corpus,removeNumbers)
full_descriptions.corpus <- tm_map(full_descriptions.corpus, removePunctuation)
full_descriptions.corpus <- tm_map(full_descriptions.corpus, content_transformer(tolower))

wordcloud(full_descriptions.corpus,
          max.words = 100,
          random.order=FALSE, 
          rot.per=0.30, 
          use.r.layout=FALSE, 
          colors=brewer.pal(8, "Paired"))

Current Loans World Cloud

#Current
current_desc_corpus <- Corpus(DataframeSource(data.frame(current_desc[,20])))
current_desc_corpus <- tm_map(current_desc_corpus,removeWords, stopwords)
current_desc_corpus <- tm_map(current_desc_corpus,removeNumbers)
current_desc_corpus <- tm_map(current_desc_corpus, removePunctuation)
current_desc_corpus <- tm_map(current_desc_corpus, content_transformer(tolower))

wordcloud(current_desc_corpus,
          max.words = 100,
          random.order=FALSE, 
          rot.per=0.30, 
          use.r.layout=FALSE, 
          colors=brewer.pal(8, "Paired"))

Delinquent Loans Word Cloud

bad_desc_corpus <- Corpus(DataframeSource(data.frame(bad_status_desc[,20])))
bad_desc_corpus <- tm_map(bad_desc_corpus,removeWords, stopwords)
bad_desc_corpus <- tm_map(bad_desc_corpus,removeNumbers)
bad_desc_corpus <- tm_map(bad_desc_corpus, removePunctuation)
bad_desc_corpus <- tm_map(bad_desc_corpus, content_transformer(tolower))

wordcloud(bad_desc_corpus,
          max.words = 100,
          random.order=FALSE, 
          rot.per=0.30, 
          use.r.layout=FALSE, 
          colors=brewer.pal(8, "Paired"))

Looking at all 3 clouds we see some common words like credit, dept, pay, and cards. Seeing as how borrowers are asking for funding it makes sense that these show up in the requests. However there a few words like need and back that show up in the delinquent loan group that do not show up in the other 2 groups. But this is a small percentage of a small group so it doesn’t mean that every borrower that says something like “I need this money” or “I promise to pay you back” will go delinquent.

Conclusion

Investors have a lot to think about when picking who they lend to on Lending Club. As we saw in the study picking a C grade loan could be a safe bet as most are still currently paying but also C grade loans have the highest number of delinquent loans. G grade loans do promise a higher return on your investment but so few get to be fully paid. Investors may also want to look at the unique reasons why someone may want a loan, but they must also take a borrower’s financial information into consideration.

While Lending Club does provide a wealth of information for analysis I do have a question of how often this data is updated. I chose 2014 because it gave enough information in the desc field to see how borrowers interacted with it. In later years that field is completely blank. If I was to do this project again I would like to have some more information so that I could predict the kind of customer that may default and charge off a loan.