Debabrata Kabiraj
May 13, 2019
Some months back, I was trying to apply for a lending club loan. Interest rate was different me and my colleague. That made me to think more about how lending rates are decided and what factors affect one’s ability to get best loan rate.
Lending Club (LC), a San Francisco-based fintech company, works to facilitate peer-to-peer loans through their online lending platform. Started in 2007, their website allows individuals to publicly post loan applications, which other users can then browse and choose to fund. Company estimates place aggregate loan totals at over $15.98 billion through December 2015, making Lending Club the largest online loan platform in the world.
Simply put, LC is a US peer-to-peer lending company. Where investors provide funding and borrowers return back the payments. Lending club selects and approves the borrowers using many parameters. It is a sort of EBay for loans.
Although mired in scandal (CEO Renaud Laplanche stepped down amidst dropping loan volumes and well-vetted accusations of shady accounting practices), Lending Club still provides an excellent machine learning case study. Luckily, their website makes its historical records publicly available, leading to the interesting question:
This project is to predict the interest rate with various predictor variables. By performing this analysis we will know below information.
What parameters will impact my interest rate? ie., Is loan interest % predictive of FICO credit score alone?
Is loan funded amount are equal for different purpose of loan request? So the person can get loan in that particular loan type.
It is always mentioned that living state plays a important role in interest rate. This hypothesis will be validated.
There is a myth that home ownership will impact FICO scores. It will be validated via this dataset.
Did lending club receive equal number of loans in each month
When we register as a lending club user, you will get access to the borrowers data from lending loan website Lending Club.
This dataset has borrowers details (personal info will be removed) It has the funded amount, interest rate, fico credit score and about 150 variables. Also the row count is around 115K for Q1 2019.
For current analysis, I have taken a simple random sample of 10000 rows. These data are transformed and modified to perform analysis on data.
Describe the method of data collection.
When we register as a lending club borrower/invester, you will get access to the borrowers data from lending loan website Lending Club.
LC also provides loan rejection dataset. But for current analysis, we have used only the borrowers dataset since rejection dataset does not have enough relevant data for analysis.
Download the Loan Statistics from Lending Club WebPage via LendingClub’s API.
Data can also be accessed from LendingClub package provied in R via LendingClub: A Lending Club API Wrapper
source_url <- "https://api.lendingclub.com/api/investor/v1/loans/listing"
raw_data <- httr::GET(source_url, add_headers(Authorization="Vkqakl1lAygRyXRwlKCOyHWG4DE")))
raw_data_content <- httr::content(raw_data, as = "text", encoding = "UTF-8")
write.csv(raw_data_content,"LoanStats_securev1_2019Q1.csv")# load data of 2019 Q1
set.seed(7340)
loans_summary_full <- read.csv ("LoanStats_securev1_2019Q1.csv",
header=TRUE, skip=1, sep=",", stringsAsFactors=FALSE, skipNul = TRUE) %>%
select(c(pub_rec_bankruptcies, loan_amnt, funded_amnt, funded_amnt_inv, term, int_rate, emp_length, home_ownership, annual_inc, dti, addr_state,
fico_range_low, fico_range_high, installment, issue_d, loan_status, purpose, loan_amnt, grade)) %>%
mutate(fico_score=(fico_range_low+fico_range_high)/2) %>%
mutate(term1=as.numeric(str_trim(str_replace(term,"months","")))) %>%
select(-fico_range_low,-fico_range_high) %>%
mutate(emp_length=str_replace(emp_length,c("n/a"),"NA")) %>%
mutate(int_rate=(as.numeric(str_replace(as.character(int_rate),"%","")))) %>%
mutate(loan_amnt=(as.integer(str_replace(as.character(loan_amnt),"%","")))) %>%
mutate(issue_d=(dmy(paste("01-", issue_d, sep ="")))) %>%
mutate(total_pymnt=(funded_amnt+(funded_amnt*int_rate*(term1/12))/100)) %>%
filter(!is.na(int_rate))
DT::datatable(loans_summary_full, options = list(pagelength=5))