There are a total of 100000 rows and 144 variables in our data and about 13.79% of loans are charged off and 86.22% of loans are fully paid.
Lending Club categorizes borrowers into seven different loan grades: A through G. Within each loan grade there are five sub-grades from A1 to G5. A borrower is graded depending on many factors like credit report, borrowing amount, debt-income ratio…etc. The chance of loan getting approved will be decreasing from A1 thru G5.
##
## A B C D E F G
## Charged Off 1187 3723 4738 2858 1010 239 30
## Fully Paid 21401 30184 21907 9635 2569 469 50
We observe that not many people have grade A, Since credit score is one of the factors effecting the grade and many people didnt have a good credit score back in 2013-2015 it could be one of the reasons for higher people having grades B and C (as per FICO data https://www.fico.com/blogs/average-us-fico-score-ticks-706)
There is a relationship between interest rate and grades which is expected - lower the grade higher the interest rate.
Now we will calculate the annual returns of investors based on the actual term. For a fully paid loan actual term is obtained by subtracting the loan issue date and the last payment date. But in case of a charged off loan last payment date is not available for a few observations in the data. Since all the loans have either a 36- or 60-month term, with fixed interest rates and equal payments. We set 3 years as the actual term for charged off loans.
There are few positive annual returns for charged off loans. This could be because they charged off after paying few installments. For example, a borrower has installment of $188.66 at int_rate of 12.99 with 36 months term. He paid $6114.48 and then charged off on the interest the needs to be paid. So, there are few returns in case of this borrower.
We can see that the annual returns rate % is almost same for grade A,B and slightly decreases as the grade lowers to G. Also the mean return % is having a positive relationship with interest rate. If I were the investor I would probably look for grade B,C loans to invest since they have low default rate and high returns.
The above table shows the number of loans, defaults, average Interest, average loan amount, and average payment for all the purposes.
##
## A B C D E F G
## car 253 306 238 92 27 8 4
## credit_card 8349 9809 5008 1518 266 37 2
## debt_consolidation 11573 19745 16497 7534 1954 292 27
## home_improvement 1457 1777 1496 673 215 33 3
## house 37 74 83 74 48 27 11
## major_purchase 441 553 479 252 70 26 2
## medical 84 270 382 251 97 34 1
## moving 10 96 207 234 108 32 4
## other 324 1036 1702 1321 551 139 18
## renewable_energy 3 5 22 18 8 2 0
## small_business 15 100 249 300 159 62 8
## vacation 42 127 257 180 59 13 0
## wedding 0 9 25 46 17 3 0
We can see that the average loan amount is high for a credit card, debt consolidations, house, small business loans and low for vacation, moving and medical loans and also the number of loans for debt consolidation irrespective of loan grade and as stated earlier overall number of loans are high for grades B and C.
## [1] 0
## chr [1:100000] "9 years" "6 years" "3 years" "n/a" "1 year" "10+ years" ...
The employment length column has missing information but running is.na() function doesn’t show any missing values because the missing information is entered as “n/a”. Lets also convert the employement length feature into factor variable
##
## Charged Off Fully Paid
## n/a 1296 4852
## < 1 year 1204 6900
## 1 year 960 5689
## 2 years 1206 7781
## 3 years 1088 6958
## 4 years 775 5117
## 5 years 841 5205
## 6 years 632 4080
## 7 years 712 4412
## 8 years 698 4292
## 9 years 522 3386
## 10+ years 3851 27543
The average returns, average loan amount are high in case of cusotmer with more than 10 years of employement length. Also default rate seems to be decreasing with increase in employement length
As expected, the average annual income of fully paid loans is higher than that of defaulted loans. Also the average income level keeps increasing as we move from higher grades to lower which can be one of the reasons for high default rates for lower grades.
Also, the average annual income is high for cusotmer applying loan for home improvements, small businesses
Let’s look at some more columns and see if we can derive some new variables out of them. We have a few columns which gives us details about a peron’s backgroud info like the date the borrower’s earliest reported credit line was opened, the number of open credit lines and the total number of credit lines in the borrower’s credit file, the number of satisfactory bankcard accounts, and the total number of bankcard accounts.
Borrower’s History is calculated by subtracting the earliest credit line date and the loan issue date.We can see that for grade A the borrower’s history is high and it keeps decreasing as we move to lower grades.
Open account ratio is calculated by taking the ratio of the number of open credit lines to the total number of credit lines in the borrower’s credit file. We can see that as the grade moves from A to G the open account ratio keeps increasing.
Satisfactory Bankcard accounts ratio is calculated by dividing the number of satisfactory bankcard accounts by the total number of bankcard accounts. We can see that the Satisfactory Bankcard accounts ratio is increasing as the grade moves from A to G
Let’s look at the distributions of missing value columns. Most of the columns are right skewed distributions. We impute all those columns with median of its respective column.
## [1] "mths_since_last_delinq" "revol_util" "avg_cur_bal"
## [4] "bc_open_to_buy" "bc_util" "mo_sin_old_il_acct"
## [7] "mths_since_recent_bc" "mths_since_recent_inq" "num_rev_accts"
## [10] "num_tl_120dpd_2m" "pct_tl_nvr_dlq" "percent_bc_gt_75"
There are many columns with all data points as NAs so first let’s drop such columns. Later remove columns which were having more than 60% missing values.
Let’s also remove variables that can be a potential for data leakage. Though these variables were present at the time of creating and training the data model, when a new borrower enrolls into the lending club these values will not be present. Hence, creating a model based on these variables will result in inaccurate predictions.
The variables that have been excluded from the model are funded_amnt,funded_amnt_inv, term, installment, grade, emp_title, pymnt_plan, title, zip_code, addr_state, out_prncp, out_prncp_inv, total_pymnt_inv, total_rec_prncp, total_rec_int,total_rec_late_fee,recoveries, collection_recovery_fee, last_credit_pull_d, policy_code, disbursement_method, debt_settlement_flag, hardship_flag,chargeoff_within_12_mths, collections_12_mths_ex_med, application_type,last_pymnt_d,last_pymnt_amnt,total_pymnt,issue_d
Let’s measure the AUC score of each variable on loan status using auc() function. Below is the final table with sorted scores in descending order. Below is the rankings table. (actualReturn, total_pymnt, actualTerm columns will be omitted later when building the model)
We have a total of 100K observations in our data set. Let’s use 70% of the data as a training set and remaining as a test set.
Multiple classification models were built - Decision Tree using both rpart and C50 libraries, Random Forests, Gradient Boosting and are compared. Sensitivity score is used as a evaluation metric to compare models, as shown below
Let’s look at the ROC curves and lift charts for above classification models
Now lets try building regression models for predicting actual returns as well and later combine this model with that of classification model to see if we can improve accuracy
Annualized Returns is calculated by summing all loan payments received net of principal repayment, credit losses, and servicing costs. LendingClub charges an investor service fee of 1% of the number of borrower payments received by the payment due date or during applicable grace periods.
A random forest, a linear model and an xgboost model are developed to predict the best returns.Below table shows the RMSE scores for the regression models developed.
Let’s combine best models that we built previously - a regression model that predicts actualreturns, classification model to predict default rate
If we had used just the classification model, we would have had more emphasis on capital preservation but have a low returns. However, by incorporating the probability scores from the classification model into our regression model for predicting return, we are able to invest in the loans most likely to be paid with the highest predicted return.
Approach 1 -> Consider regression model based loans with high predicted returns, and sort these based on high classification model scores and then select the top (4%) loans
Approach 2 -> Calculating the expected returns by (Predicted returns from a loan) * (prob. of being FullyPaid) and select top fraction of loans based on this product of scores and sort these based on high those calculated scores and then select the top (4%) loans
The above table shows the predicted returns,by xgb model, sorted by the expected returns (calculated in approach 2). We observe that there are almost no grade A loans which give us the best returns and with approach 2 we have loans with higher returns - low defaults compared to the first approach.
Compared to single models (and approach 1 results) the combined model through approach 2 performs better having good returns and low defaulted loans.
Let’s focus on low grade loans now, because higher grade loans are less likely to default but also carry lower interest rates; many lower grad loans are fully paid, and these can yield higher returns. One approach may be to focus on lower grade loans (C and below), and try to identify those which are likely to be paid off.
Now we develop models from the data on lower grade loans, and check if this can provide an effective investment approach.
Considering all the approaches, we conclude that combined models are better than single models and also we observed that we can effectively invest in the riskier loan grades (C and below) and can get better returns at higher interest rates compared to that of investing in grade A,B loans with low interest rates, low returns and low risk.
We observed that when all the loan grades are considered most of the loans in the top 1% are from high risk grades D,E,F,G. So we had high defaults. However, when we consider only the lower grades (C and below) the top 1% loans are from grade C which are having low risk rate compared to grades D,E,F,G. Also we observe that we get better returns when considering lower grade loans due to the high interest rates.