Abstract

Between 2007 and 2011, LendingClub issued over 42,000 personal loans to borrowers. To better understand loan performance in terms of total paid by a borrower, we conducted a multiple regression analysis of the loan data. Our results illustrate the difficulty in modeling loan performance with multiple regression. In conclusion, we learned that LendingClub’s loan data was too inconsistent to be used in a business context. For future analysis, we may consider modeling with methods more appropriate for the dataset and adding additional sources of information.

Introduction

In researching this loan data set, our intent is to better understand, and eventually model, the outcome of a loan. More specifically, we wish to develop an understanding of the factors that affect the overall performance or outcome of a LendingClub loan. We expect a number of interesting business questions can be explored through our analysis, both broad and narrow in scope. We will explore loan outcomes as a component of estimating the total value of a loan portfolio. More narrowly, we also wish to understand those factors that may have the most influence on the overall performance of a given loan.

Overview of the Dataset

The raw dataset is comprised of information describing all loans issued by LendingClub between 2007 and 2011. The dataset can be found at https://www.lendingclub.com/foliofn/rateDetail.action . From the raw data, a test set was created by first removing all columns that were entirely composed of missing values or contained the same value for each record. Additionally, columns containing free text - title, description, etc., were removed along with redundant features that were a subset or component of another feature (sub-grade, funded amount,etc) Finally, a target variable of ‘total paid’ was created by combining total principal and total interest, excluding any collection amounts or fees received. Features relating to total amounts collected or recovered were also excluded from the test data set.

Description of Variables
  • annual_inc (numeric): The self-reported annual income provided by the borrower during registration
  • emp_length (factor): Employment length in years
  • grade (factor): Lending Club’s assigned loan grade
  • home_ownership (factor): The home ownership status provided by the borrower during registration
  • inq_last_6mths (numeric): The number of inquiries in past 6 months excluding auto and mortgage inquiries
  • installment (numeric): The monthly payment owed by the borrower if the loan originates
  • int_rate (numeric): Interest Rate on the loan
  • issue_d (date): The month which the loan was funded
  • last_pymnt_amnt (numeric): Last total payment amount received
  • last_pymnt_d (date): Last month payment was received
  • loan_amnt (numeric): The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value
  • loan_status (factor): Current status of the loan
  • mths_since_last_delinq (numeric): The number of months since the borrower’s last delinquency
  • mths_since_last_record (numeric): The number of months since the last public record
  • open_acc (numeric): The number of open credit lines in the borrower’s credit file
  • pub_rec (numeric): The number of derogatory public records
  • purpose (factor): A category provided by the borrower for the loan request
  • recoveries (numeric): post charge off gross recovery
  • revol_bal (numeric): Total credit revolving balance
  • revol_util (numeric): Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit
  • term (factor): The number of payments on the loan
  • tot_paid (numeric): The sum of total principal and total interest, excluding any collection amounts or fees received
  • verification_status (factor): Indicates if the borrowers’ income was verified by LC, not verified, or if the income source was verified

Exploratory analysis

Variable Analysis

Most of our distributions are right or positively skewed. Respective to a majority of variables in our dataset, it makes sense that these observations have a greater mean than median. For example, from annnual income standpoint most borrowers typically make less than 100,000 dollars with some borrowers making quite a bit more ranging from 100,000 to 200,000 dollars.

Categorical Variable Analysis

Looking at our boxplots, it is clear to see that many extreme outliers are present in each one of our graphs. This reflects the point above about how many of our distributions are positively skewed. As we further investgate, light will be shed on the types of problems this many extreme outliers can cause from a constant variance and normailty perspective.

Correlation Analysis

In our correlation anlaysis, there is potential multicollinearity found between two independent variables, installment and loan amount. If multicollinearity is indeed detected as we conduct our models, then we will need to remove one of these variables.

Model Identification and Evaluation

Our final OLS model was developed over 8 iterations, which we began by regressing on all available independent variables remaining after our intitial data cleanse. Through five model iterations, we exlcuded the following variables that did not significantly contribute to our model (p > 0.7), and improved our model fit once removed (adjusted R^2): verification_status, zip_code, earliest_cr_line, total_acc, dti, last_credit_pull_d,grade, and delinq_2yrs. Furthermore, as proven true from our correlation analysis, the independent variable loan_amnt was dropped due to its multicolinearity with the installment independent variable. Lastly, we removed the state independent variable as its influence on the fit of the model was minimal, and the regression model could be made more manageable by removing a 49-level factor variable. Our 6th model iteration produced a set of residuals that failed both Non-constant variance and Shapiro-Wilk normailty tests:

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 577.4706, Df = 1, p = < 2.22e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  testmodel6$residuals
## W = 0.91184, p-value < 2.2e-16

A Box-Cox test of the output, total_paid, revealed a transformation approximately between log(Y) and sqrt(Y) may help normalize residual variance:

A cube-root transformation of our output produced residuals that passed the NCV test, but still failed the test of normality:

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.1740437, Df = 1, p = 0.67654

A simiar process was followed for the independent variables where Box-Cox suggested a transformation would be appropriate. Though normalizing a number of independent variables improved the fit of our model to an R^2 of 94.2, we were unable to produce a model that could pass both residiual tests.

Summary of the final Model:

Observations 42331 (203 missing obs. deleted)
Dependent variable tot_paid
Type OLS linear regression
F(32,42298) 21435.18
0.94
Adj. R² 0.94
Est. S.E. t val. p
(Intercept) -14.67 0.32 -45.36 0.00
term 0.07 0.00 72.62 0.00
int_rate -0.67 0.18 -3.82 0.00
installment 5.82 0.01 495.75 0.00
home_ownershipNONE -2.50 0.70 -3.59 0.00
home_ownershipOTHER 0.05 0.12 0.39 0.69
home_ownershipOWN -0.08 0.03 -3.02 0.00
home_ownershipRENT -0.05 0.02 -3.06 0.00
annual_inc 0.00 0.00 4.49 0.00
issue_d -0.00 0.00 -112.70 0.00
loan_statusDoes not meet the credit policy. Status:Charged Off -0.30 0.06 -5.04 0.00
loan_statusDoes not meet the credit policy. Status:Fully Paid 4.21 0.04 94.40 0.00
loan_statusFully Paid 4.17 0.03 159.34 0.00
purposecredit_card 0.12 0.04 3.08 0.00
purposedebt_consolidation 0.11 0.04 3.05 0.00
purposeeducational 0.11 0.08 1.38 0.17
purposehome_improvement 0.05 0.04 1.04 0.30
purposehouse 0.22 0.08 2.89 0.00
purposemajor_purchase 0.04 0.05 0.90 0.37
purposemedical -0.04 0.06 -0.70 0.49
purposemoving 0.05 0.07 0.83 0.41
purposeother 0.04 0.04 1.02 0.31
purposerenewable_energy 0.03 0.14 0.18 0.85
purposesmall_business -0.14 0.05 -2.88 0.00
purposevacation 0.11 0.08 1.37 0.17
purposewedding 0.04 0.06 0.64 0.52
inq_last_6mths -0.02 0.01 -4.01 0.00
open_acc -0.02 0.03 -0.76 0.45
revol_bal 0.00 0.00 0.65 0.52
revol_util -0.19 0.05 -4.28 0.00
last_pymnt_d 0.00 0.00 157.68 0.00
last_pymnt_amnt 0.00 0.00 89.97 0.00
recoveries -0.00 0.00 -24.51 0.00
Standard errors: OLS

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 2872.179, Df = 1, p = < 2.22e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(testmodel10$residuals, 5000)
## W = 0.84597, p-value < 2.2e-16

Model Application

Applying the model to the four holdout cases produces the following estimates:

Total Paid Predicted Lower PI Upper PI % Error
1000 14401.722 14305.042 10003.1857 19691.294 0.671306
13 1826.386 3447.899 1896.8298 5673.781 88.782566
156 1350.850 1916.693 910.7396 3477.566 41.887908
228 4182.605 3256.604 1768.7551 5406.743 22.139328

For these observations, the model does not appear to make approporiate predictions for total amounts paid. Total paid actual for the second observation, ‘13’, falls outside of the 95% prediction interval. Three of the observations had errors in excess of 20%.

Applying the model to the training data and analyzing prediction errors produces an RMSE of :

## [1] 1887.424

Given the potential business consequences of utilizing such erroneous estimates, it would not be approporiate to apply this model to new or similar loan data.

Assumptions and Limitations

Because the data was gathered and prepared by LendingClub, we were limited in our ability understand certain data points and, more importantly, handle missing information in the data set. There are a number of borrower credit-related variables that were entirely blank that we feel could have helped build a better model of the total amount paid on a loan. We also feel that it could be beneficial to bring in outside economic data, as we have hypothesized that economic factors may contribute to why a loan is taken out, its attributes, and how it eventually performs. In the future, it would also be worth considering how LendingClub’s growing popularity as a lending platform affects not only their lending frequency, but also the nature and performance of the loans themselves.

The first, and most broad, assumption we made was that the eventual outcome of a loan in terms of total paid could be described by a simple, multiple linear regression model, which does not appear to be the case. The data may be more suitable for tree-type or classification models, or for a binary outcome such as whether or not a loan was fully paid off.