Data Exploration, Cleaning, and Transformation

The first step in any modeling project is acquiring and cleaning your data, otherwise you will suffer from garbage-in garbage-out. The most important steps were converting percentages that were stored as strings to numeric values, converting FICO scores from integers to named ranges (Good, Very Good, etc.), dealing with NA values in the data, and removing variables with too many factors (since our data contains only ~500 instances of delinquent loans, factors with 50+ levels are not going to be useful). I also removed some variables that were highly correlated with each other (> 0.9).

##        Charged Off            Current            Default 
##                359               4264                  6 
##         Fully Paid    In Grace Period  Late (16-30 days) 
##                759                 70                 35 
## Late (31-120 days) 
##                148

Looking at the classes we have, we see that there are very few ‘Bad’ classes so we will combine our classes into ‘Current’ and ‘Delinquent’ for our analysis. In Grace Period loans are removed from our data since they are neither good nor bad. Full details of all data processing can be found in the comments of the included code.

We can also look at some plots of our data to get a sense of how some variables relate to the loan status.

From these we can see that lower FICOs and higher loan amounts might have higher delinquency rates, as we might expect.

Model Building

The two options I am considering are a logistic regression model and a random forest model. Both are good for classification, which is what we want to do. I will first try a logistic regression model, as the results are generally easier to interpret. Since we are interested in what factors might effect delinquency, and not just predictive accuracy, I believe ease of interpretation is important.

We begin by splitting our data into a test and training set using an 80/20 split, then we will fit our model to the training dataset. We have 35 possible predictor variables, so we will use Lasso regression to try limit the number that the model uses. The data must be standardized for the Lasso, and we will also use 10-fold cross validation to find the parameter lambda that has the smallest prediction error.

Error With the Full Dataset

Error With the Full Dataset

The graph shows the misclassifcation error for various values of lambda chosen by cross validation. An error rate of around .1 might sound good, but further investigation reveals a problem. The blue line is the error rate if we just choose all loans to be ‘Current’, and this is indeed what the model is doing as we can see if we predict on our test set.

##             testy
## preds        Current Delinquent
##   Current       1003        108
##   Delinquent       1          1

This problem arises because approximately 90% of our data are ‘Current’ loans, and logistic regression models suffer when the proportion is not close to 50/50. The best approach would probably be to try and get more data for delinquent loans, but if that is not possible there is another approach we can try. We can use a random sample of our ‘Current’ loans that matches the number of ‘Delinquent’ loans in our dataset. Then we would have a 50/50 split, but this approach has its risks. You could possibly be removing valuable information from the ‘Current’ portion of the data even if the sampling is truly random. We will try this approach and see how the results compare.

Error With the Downsampled Data

Error With the Downsampled Data

The error rate in this case is lower than our base error rate of .5, although ~.33 is a bit high. Predicting on the test set shows that our prediction error is similar for both Current and Delinquent loans at around .33. Although this prediction error is on the high side, this model should be a more accurate representation of what actually affects loan delinquency.

##             testy
##              Current Delinquent
##   Current        652         34
##   Delinquent     352         75

The LASSO method has selected 19 variables that are the most effective predictors for loan delinquency. The coefficients of the variables for this model are included in the appendix. Coeffecients with a positive sign will increase the chance of delinquency, while negative coefficients will decrease it. Only the non-zero coefficients (as chosen by the cross validated Lasso) are shown. These selected attributes give us areas to focus on when trying to determine whether or not a loan will become delinquent, and also allow us to generate an estimaed probability of delinquency.

Conclusions and Future Considerations

There are clearly some issues with trying to use the entire given data set. The best approach would be to try and gather more data on delinquent loans so that the proportions are closer to 50/50. Downsampling is far from a perfect strategy, so to ensure accurate results more data is the best course. I found more loan data from previous years on the Lending Club website that could be used. They also have data on declined loans - it would be interesting to try and compare their attributes to loans that ended up becoming delinquent.

As I mentioned in the model building section, I also considered using a random forest model. I actually quickly tried a random forest model and it produced very similar results on both the full and downsampled dataset. The two models generated the same predictions on 85% of the test samples so it is possible that an ensemble method could produce better results.

Other future work that could be done is trying to use the residuals from the logistic regression to try and determine if any variable transformations could be appropriate. We could also try using ridge or elastic net regression by adjusting the alpha parameter in glmnet, but I feel that gathering more data on delinquent loans would be the most productive thing to do.

Appendix

Coefficients from the Logistic Regression model fitted to the downsampled data.

##                          names          coef
## 1                  (Intercept) -2.446757e+00
## 3                            X -1.285744e-04
## 4                    loan_amnt  3.239964e-05
## 5               term 60 months -4.275072e-01
## 6                     int_rate  1.515133e-01
## 7                   emp_length -1.478907e-02
## 10                  annual_inc -6.067829e-07
## 12 verification_statusVerified  1.175811e-01
## 13                         dti  8.650684e-03
## 14                 delinq_2yrs  1.191275e-02
## 16          fico_range_lowGood  4.734737e-02
## 17     fico_range_lowVery Good -3.683005e-01
## 19              inq_last_6mths  1.514916e-01
## 23                  revol_util -4.844642e-04
## 25             last_pymnt_amnt -2.673647e-04
## 28        acc_open_past_24mths -6.493869e-02
## 31    chargeoff_within_12_mths -4.328192e-01
## 34        mths_since_recent_bc -3.813994e-03
## 37                   num_il_tl  1.262962e-02
## 38            num_tl_120dpd_2m -4.723544e-01
## 43        pub_rec_bankruptcies  4.300000e+01