Sreejaya, Vuthy and Suman
12/11/2016
How safe is to invest in these loans ?
LendingClub is an online lending platform for loans. Borrowers apply for a loan online, and if accepted, the loan gets listed in the market place. As an investor you can browse the loans and chose to invest in individual loans at your discretion. In this project, we attempted to analyse the loandata and predict the risk of loans.
Loan Data, 2012-13 (https://resources.lendingclub.com/LoanStats3b.csv.zip)
Observations: 188,183
Variables: 111
First, select a high level list of features based on domain understanding: (for example - loan amount, interest rate, debt to income ratio, grade, emp_length etc.)
Review matured loans. (issue date + term months < today)
How to label a loan as bad or good ?
Further we plan on eliminating some numeric type features that do not have significant differences in the bad/good populations by looking at their distributions.
Noticed high VIF variables here:
With 500 trees, the random forest shows the this Gini index - a measure how each variable contribute to the homogenity of the nodes and leaves in the resulting random forest.
Here it shows the dti, annual income are high importance variables, also its interesting to see the issue date is also shown as an important variable here.
We tried to include the important variables from random forest, and build logistic model again.
Except emp_length all other predictors are significant here.
AUC for both the logistic models is around 0.55
AUC for both RF with selected variables is 0.91 , but when considered all vars, it is 0.99
Random Forest, Model 3, where all variables included, performed well here. It has got the AUC [ P(predicted TRUE|actual TRUE) Vs P(FALSE|FALSE) ], and the accuracy [ P(TRUE|TRUE).P(actual TRUE) + P(FALSE|FALSE).P(actual FALSE) ], are both higher compared to the other models.