Data Exploration:

There are a total of 100000 rows and 144 variables in our data and about 13.79% of loans are charged off and 86.22% of loans are fully paid.

Lending Club categorizes borrowers into seven different loan grades: A through G. Within each loan grade there are five sub-grades from A1 to G5. A borrower is graded depending on many factors like credit report, borrowing amount, debt-income ratio…etc. The chance of loan getting approved will be decreasing from A1 thru G5.

##              
##                   A     B     C     D     E     F     G
##   Charged Off  1187  3723  4738  2858  1010   239    30
##   Fully Paid  21401 30184 21907  9635  2569   469    50

We observe that not many people have grade A, Since credit score is one of the factors effecting the grade and many people didnt have a good credit score back in 2013-2015 it could be one of the reasons for higher people having grades B and C (as per FICO data https://www.fico.com/blogs/average-us-fico-score-ticks-706)

There is a relationship between interest rate and grades which is expected - lower the grade higher the interest rate.

Now we will calculate the annual returns of investors based on the actual term. For a fully paid loan actual term is obtained by subtracting the loan issue date and the last payment date. But in case of a charged off loan last payment date is not available for a few observations in the data. Since all the loans have either a 36- or 60-month term, with fixed interest rates and equal payments. We set 3 years as the actual term for charged off loans.

There are few positive annual returns for charged off loans. This could be because they charged off after paying few installments. For example, a borrower has installment of $188.66 at int_rate of 12.99 with 36 months term. He paid $6114.48 and then charged off on the interest the needs to be paid. So, there are few returns in case of this borrower.

We can see that the annual returns rate % is almost same for grade A,B and slightly decreases as the grade lowers to G. Also the mean return % is having a positive relationship with interest rate. If I were the investor I would probably look for grade B,C loans to invest since they have low default rate and high returns.

Let’s explore few columns

Purpose Column:

The above table shows the number of loans, defaults, average Interest, average loan amount, and average payment for all the purposes.

##                     
##                          A     B     C     D     E     F     G
##   car                  253   306   238    92    27     8     4
##   credit_card         8349  9809  5008  1518   266    37     2
##   debt_consolidation 11573 19745 16497  7534  1954   292    27
##   home_improvement    1457  1777  1496   673   215    33     3
##   house                 37    74    83    74    48    27    11
##   major_purchase       441   553   479   252    70    26     2
##   medical               84   270   382   251    97    34     1
##   moving                10    96   207   234   108    32     4
##   other                324  1036  1702  1321   551   139    18
##   renewable_energy       3     5    22    18     8     2     0
##   small_business        15   100   249   300   159    62     8
##   vacation              42   127   257   180    59    13     0
##   wedding                0     9    25    46    17     3     0

We can see that the average loan amount is high for a credit card, debt consolidations, house, small business loans and low for vacation, moving and medical loans and also the number of loans for debt consolidation irrespective of loan grade and as stated earlier overall number of loans are high for grades B and C.

## [1] 0
##  chr [1:100000] "9 years" "6 years" "3 years" "n/a" "1 year" "10+ years" ...

Employement Length Column:

The employment length column has missing information but running is.na() function doesn’t show any missing values because the missing information is entered as “n/a”. Lets also convert the employement length feature into factor variable

##            
##             Charged Off Fully Paid
##   n/a              1296       4852
##   < 1 year         1204       6900
##   1 year            960       5689
##   2 years          1206       7781
##   3 years          1088       6958
##   4 years           775       5117
##   5 years           841       5205
##   6 years           632       4080
##   7 years           712       4412
##   8 years           698       4292
##   9 years           522       3386
##   10+ years        3851      27543

The average returns, average loan amount are high in case of cusotmer with more than 10 years of employement length. Also default rate seems to be decreasing with increase in employement length

Annual Income Column:

As expected, the average annual income of fully paid loans is higher than that of defaulted loans. Also the average income level keeps increasing as we move from higher grades to lower which can be one of the reasons for high default rates for lower grades.

Also, the average annual income is high for cusotmer applying loan for home improvements, small businesses

Let’s look at some more columns and see if we can derive some new variables out of them. We have a few columns which gives us details about a peron’s backgroud info like the date the borrower’s earliest reported credit line was opened, the number of open credit lines and the total number of credit lines in the borrower’s credit file, the number of satisfactory bankcard accounts, and the total number of bankcard accounts.

Borrower’s History is calculated by subtracting the earliest credit line date and the loan issue date.We can see that for grade A the borrower’s history is high and it keeps decreasing as we move to lower grades.

Open account ratio is calculated by taking the ratio of the number of open credit lines to the total number of credit lines in the borrower’s credit file. We can see that as the grade moves from A to G the open account ratio keeps increasing.

Satisfactory Bankcard accounts ratio is calculated by dividing the number of satisfactory bankcard accounts by the total number of bankcard accounts. We can see that the Satisfactory Bankcard accounts ratio is increasing as the grade moves from A to G

Let’s look at the distributions of missing value columns. Most of the columns are right skewed distributions. We impute all those columns with median of its respective column.

##  [1] "mths_since_last_delinq" "revol_util"             "avg_cur_bal"           
##  [4] "bc_open_to_buy"         "bc_util"                "mo_sin_old_il_acct"    
##  [7] "mths_since_recent_bc"   "mths_since_recent_inq"  "num_rev_accts"         
## [10] "num_tl_120dpd_2m"       "pct_tl_nvr_dlq"         "percent_bc_gt_75"

There are many columns with all data points as NAs so first let’s drop such columns. Later remove columns which were having more than 60% missing values.

Let’s also remove variables that can be a potential for data leakage. Though these variables were present at the time of creating and training the data model, when a new borrower enrolls into the lending club these values will not be present. Hence, creating a model based on these variables will result in inaccurate predictions.

The variables that have been excluded from the model are funded_amnt,funded_amnt_inv, term, installment, grade, emp_title, pymnt_plan, title, zip_code, addr_state, out_prncp, out_prncp_inv, total_pymnt_inv, total_rec_prncp, total_rec_int,total_rec_late_fee,recoveries, collection_recovery_fee, last_credit_pull_d, policy_code, disbursement_method, debt_settlement_flag, hardship_flag,chargeoff_within_12_mths, collections_12_mths_ex_med, application_type,last_pymnt_d,last_pymnt_amnt,total_pymnt,issue_d

Let’s measure the AUC score of each variable on loan status using auc() function. Below is the final table with sorted scores in descending order. Below is the rankings table. (actualReturn, total_pymnt, actualTerm columns will be omitted later when building the model)

Let’s build few models

We have a total of 100K observations in our data set. Let’s use 70% of the data as a training set and remaining as a test set.

Multiple classification models were built - Decision Tree using both rpart and C50 libraries, Random Forests, Gradient Boosting and are compared. Sensitivity score is used as a evaluation metric to compare models, as shown below

Let’s look at the ROC curves and lift charts for above classification models

Now lets try building regression models for predicting actual returns as well and later combine this model with that of classification model to see if we can improve accuracy

Annualized Returns is calculated by summing all loan payments received net of principal repayment, credit losses, and servicing costs. LendingClub charges an investor service fee of 1% of the number of borrower payments received by the payment due date or during applicable grace periods.

A random forest, a linear model and an xgboost model are developed to predict the best returns.Below table shows the RMSE scores for the regression models developed.

Let’s combine best models that we built previously - a regression model that predicts actualreturns, classification model to predict default rate

If we had used just the classification model, we would have had more emphasis on capital preservation but have a low returns. However, by incorporating the probability scores from the classification model into our regression model for predicting return, we are able to invest in the loans most likely to be paid with the highest predicted return.

Approach 1 -> Consider regression model based loans with high predicted returns, and sort these based on high classification model scores and then select the top (4%) loans

Approach 2 -> Calculating the expected returns by (Predicted returns from a loan) * (prob. of being FullyPaid) and select top fraction of loans based on this product of scores and sort these based on high those calculated scores and then select the top (4%) loans

The above table shows the predicted returns,by xgb model, sorted by the expected returns (calculated in approach 2). We observe that there are almost no grade A loans which give us the best returns and with approach 2 we have loans with higher returns - low defaults compared to the first approach.

Compared to single models (and approach 1 results) the combined model through approach 2 performs better having good returns and low defaulted loans.

Let’s focus on low grade loans now, because higher grade loans are less likely to default but also carry lower interest rates; many lower grad loans are fully paid, and these can yield higher returns. One approach may be to focus on lower grade loans (C and below), and try to identify those which are likely to be paid off.

Now we develop models from the data on lower grade loans, and check if this can provide an effective investment approach.

Conslusion:

Considering all the approaches, we conclude that combined models are better than single models and also we observed that we can effectively invest in the riskier loan grades (C and below) and can get better returns at higher interest rates compared to that of investing in grade A,B loans with low interest rates, low returns and low risk.

We observed that when all the loan grades are considered most of the loans in the top 1% are from high risk grades D,E,F,G. So we had high defaults. However, when we consider only the lower grades (C and below) the top 1% loans are from grade C which are having low risk rate compared to grades D,E,F,G. Also we observe that we get better returns when considering lower grade loans due to the high interest rates.