Summary Report

The data is about personal loans and the challenge is to find a model that can help distinguish between customers that pay off their loans and those that need to be charged off. The challenge, thus to me, is to answer the question, who is more likely to be charged off in the future and can we help them before hand as to avoid charge off and make them pay off their loan. However, due to the fact that only 2% of the data is charge offs, no machine learning method will be able to predict both cases accurately. Thus, I view the challenge as “how likely is charge off” and I will show that a customer may need charge off with positive predictive value of 83%. To help with a better predictive model and to have better representation of our outcome, I would recommend that in the future, loan rejection customer information be also included with the data as to find a possible relationship between rejection and charge off. Another indicator to include would delinquency in the last 6 month, delinquency in last 2 year simply does not have enough predictive power.

Getting the data

Challenge 1: Programmatically download the data

# first, the code checks your directory structure for a data folder and if not there, creates one
if(dir.exists("data") == F) {
  dir.create("data") 
}

# the code checks the content of the data folder and if empty, it proceeds to download and unzip teh
# data in that folder
if(length(dir("data")) == 0 ) {
  dwnld <- tempfile()
  download.file("https://s3.us-east-2.amazonaws.com/example.data/lacare_ml_challenge.zip",dwnld)
  unzip(dwnld, exdir = "data")
  unlink(dwnld)
}

Loading data and initial data review

The data set is very skewed and requires a lot of attention

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   loan_status = col_character(),
##   purpose = col_character(),
##   home_ownership = col_character(),
##   grade = col_character(),
##   emp_length = col_character(),
##   term = col_character(),
##   addr_state = col_character(),
##   verification_status = col_character(),
##   application_type = col_character()
## )
## See spec(...) for full column specifications.
## Observations: 1,062,952
## Variables: 22
## $ dti                        <dbl> 30.46, 50.53, 18.92, 4.64, 12.37, 22.…
## $ annual_inc                 <dbl> 100000, 45000, 100000, 38500, 450000,…
## $ delinq_2yrs                <dbl> NA, NA, NA, 0, NA, 0, NA, NA, 0, NA, …
## $ inq_last_6mths             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ open_acc                   <dbl> 11, 18, 9, 12, 21, 10, 6, 3, 11, 21, …
## $ pub_rec                    <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ revol_bal                  <dbl> 15603, 34971, 25416, 4472, 36812, 333…
## $ revol_util                 <dbl> 37.0, 64.5, 29.9, 15.3, 65.7, 45.9, 5…
## $ total_acc                  <dbl> 19, 37, 19, 25, 37, 16, 13, 6, 18, 28…
## $ out_prncp                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ out_prncp_inv              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ total_rec_late_fee         <dbl> 0, 0, NA, NA, 0, 0, NA, 0, NA, 0, 0, …
## $ collections_12_mths_ex_med <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_status                <chr> "Fully Paid", "Fully Paid", "Fully Pa…
## $ purpose                    <chr> "debt_consolidation", "credit_card", …
## $ home_ownership             <chr> "MORTGAGE", "MORTGAGE", "MORTGAGE", "…
## $ grade                      <chr> "D", "C", "A", "B", "E", "D", "B", "C…
## $ emp_length                 <chr> "5 years", "< 1 year", "10+ years", "…
## $ term                       <chr> "36 months", "60 months", "36 months"…
## $ addr_state                 <chr> "CA", "OH", "WA", "TX", "MA", "WA", "…
## $ verification_status        <chr> "Source Verified", "Verified", "Not V…
## $ application_type           <chr> "Joint App", "Joint App", "Joint App"…

from the output, we observe that loan status has observations that are in a non-standard format which has to be addressed.

the distribution of charged off to fully paid is disproportanaly skewed to fully paid where only 2% of the data is represented by charged off, we have to think about strategy to deal with this problem. Also, the category names are in a non-standard format which has to be addressed.

Exploratory analysis

I’ll be using the dataexplorer package to look at the overall data structure

we have four variables with missing data two variables have 33% missing, total_rec_late_fee & delinq_2yrs dti has 0.02% missing observations, revol_util has 0.06% missing observatiobs and inq_last_6mths has 1 missing observation

open_acc and total_acc are skewed and could use a transformation to deal with the issue

Before dealing with pre-processing of data, lets find out which variables we can drop due to zero variance or near zero variance (zv/nzv) - ZV and NZV are variables that have no variability in their distribution which can imply that they don’t have contributing factors in model scenarios.

##                             freqRatio percentUnique zeroVar   nzv
## dti                          1.002591  5.997449e-01   FALSE FALSE
## annual_inc                   1.134768  5.121868e+00   FALSE FALSE
## delinq_2yrs                  6.391808  2.634174e-03   FALSE FALSE
## inq_last_6mths               2.175604  8.466986e-04   FALSE FALSE
## open_acc                     1.031607  7.808443e-03   FALSE FALSE
## pub_rec                      6.068576  3.104562e-03   FALSE FALSE
## revol_bal                   59.704545  7.431662e+00   FALSE  TRUE
## revol_util                   2.811106  1.247469e-01   FALSE FALSE
## total_acc                    1.003499  1.307679e-02   FALSE FALSE
## out_prncp                    0.000000  9.407763e-05    TRUE  TRUE
## out_prncp_inv                0.000000  9.407763e-05    TRUE  TRUE
## total_rec_late_fee         145.948761  4.336038e-01   FALSE  TRUE
## collections_12_mths_ex_med  73.194047  1.411164e-03   FALSE  TRUE
## loan_status                 49.616762  1.881553e-04   FALSE  TRUE
## purpose                      2.530201  1.317087e-03   FALSE FALSE
## home_ownership               1.334914  5.644658e-04   FALSE FALSE
## grade                        1.135453  6.585434e-04   FALSE FALSE
## emp_length                   3.678981  1.128932e-03   FALSE FALSE
## term                         3.820710  1.881553e-04   FALSE FALSE
## addr_state                   1.793739  4.797959e-03   FALSE FALSE
## verification_status          1.196600  2.822329e-04   FALSE FALSE
## application_type            58.181115  1.881553e-04   FALSE  TRUE

How does dropiing delinq_2yrs affect charge off in loan status

## # A tibble: 2 x 2
##   loan_status  `n()`
##   <chr>        <int>
## 1 Charged Off  14036
## 2 Fully Paid  698142

we see that dropping missing observations of delinq_2yrs reduced charged off by a third, so we have to perform imputation

Lets look at the distribution of delinq_2yrs variable and see if we can find a pattern

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## # A tibble: 6 x 3
## # Groups:   loan_status [2]
##   loan_status delinq_2yrs  `n()`
##   <chr>             <dbl>  <int>
## 1 Charged Off           0  11040
## 2 Charged Off           1   1954
## 3 Charged Off           2    576
## 4 Fully Paid            0 566217
## 5 Fully Paid            1  88358
## 6 Fully Paid            2  25412

we see that most values most values are 0 and that of the of the charge off accounts 81% is 0. Because no single imputation method will be able to give us accurate measure between 1.5% and .5% that the charge off accounts represent of all the data. It is best in this case to just assign 0 (the majority case) to all missing data

lets take a closer look at them

## # A tibble: 10 x 3
## # Groups:   loan_status [2]
##    loan_status delinq_2yrs  `n()`
##    <chr>             <dbl>  <int>
##  1 Charged Off           0  11040
##  2 Charged Off           1   1954
##  3 Charged Off           2    576
##  4 Charged Off           3    246
##  5 Charged Off           4    108
##  6 Fully Paid            0 566217
##  7 Fully Paid            1  88358
##  8 Fully Paid            2  25412
##  9 Fully Paid            3   9137
## 10 Fully Paid            4   4024

we can see that over 50% of the values are 0 followed by 1 with about 10%. the remaining values are distributed among the other values of deling_2yrs

looking at categorical variables and their structure

## 1 columns ignored with more than 50 categories.
## addr_state: 51 categories

from the plot, we can see that purpose and home_ownership have categories that are small so creating an alternate category to encompass all the smaller ones will help with better prediction overall

looking at “purpose” categories for decision making

## # A tibble: 28 x 3
## # Groups:   purpose [14]
##    purpose            loan_status  `n()`
##    <chr>              <chr>        <int>
##  1 car                Charged Off    183
##  2 car                Fully Paid   12053
##  3 credit_card        Charged Off   3841
##  4 credit_card        Fully Paid  237058
##  5 debt_consolidation Charged Off  12971
##  6 debt_consolidation Fully Paid  596552
##  7 educational        Charged Off      4
##  8 educational        Fully Paid     270
##  9 home_improvement   Charged Off   1254
## 10 home_improvement   Fully Paid   69410
## # … with 18 more rows

we can see that most of the smaller categories only have very few observations of charged off, so the best strategy is to create a new variable with only four categories the rolls the smaller ones into the other category

looking at home_ownership for decision making

## # A tibble: 11 x 3
## # Groups:   home_ownership [6]
##    home_ownership loan_status  `n()`
##    <chr>          <chr>        <int>
##  1 ANY            Charged Off     12
##  2 ANY            Fully Paid     215
##  3 MORTGAGE       Charged Off   8974
##  4 MORTGAGE       Fully Paid  533834
##  5 NONE           Fully Paid      41
##  6 OTHER          Charged Off      4
##  7 OTHER          Fully Paid     117
##  8 OWN            Charged Off   2298
##  9 OWN            Fully Paid  110833
## 10 RENT           Charged Off   9712
## 11 RENT           Fully Paid  396912

we can see from the result that dropping the observations for the ANY, None, and OTHER categoris, will have minimal effect on our data

Data prepration

using the information from previous step of eda, I will be preping the data for modeling

One final review of the dataset - the following fuction creates a full EDA of the final dataset

Splitting data into test and training set. 3/4 train to 1/4 test and using custome seed for reproducibility

  • I’ll be using the tidymoel package for developing my model, it allows for a simple recipe to be created which allows for a formula functions that can be updated and passed to any model without extra steps
set.seed(3465) 
data_split <- initial_split(loan_data_full_temp, strata = "loan_status")
loan_train <- training(data_split)
loan_test  <- testing(data_split)

loan_rec <- loan_train %>%
  recipe(loan_status ~ .) %>%
  step_YeoJohnson(open_acc, total_acc) %>% #extension of Box-Cox tranformation, to deal with skewness of the data
  step_dummy(all_nominal(), -all_outcomes()) %>%
  prep()

# using the above recipe to finalize the test and training set
loan_test_rec <- loan_rec %>%
  bake(loan_test)

loan_train_rec <- juice(loan_rec)

The model for prediciton

The problem with this data is that no matter what method we choose, the imbalance of the daat will not allow for great accuracy of a model. This prediction problem has to approach from a different prespective, how much risk are we willing to take of custumers needing charge-off Thus the model will create a prediction however, we will use the probabilities resulting from our prediction to build a categorical variable that will assign customers based on their probability to either, no risk, mild risk, minor risk, major risk I have chosen to run generalized boosting regression Modeling (GBM). It is a method that is able to minimize both, bias and variance and excutes more efficiently than similar methods, I am also goint to use h2o to perform my modeling with due to its distibuted functionality

Using the caret package for modeling, we declare the control variable with how we like to use a model eg using cross validation and for dealing with our imbalanced dependent variable, I will use down sampling of the data in order to achieve better results

looking for optimal hyper parameter (our method here will chose this for us, however; if we want to improve on the model in the future, this would be one of the ways to explore model improvment)

The top ten variables as determined by our model

Model performance analysis

##    TrainROC TrainSens TrainSpec  method
## 1 0.7037325 0.3586677 0.8608503 gbm_h2o

Because our data is extremely skewed, accuracy will can not be of importance, instead We can use the ROC curve to determine optimal cutoff point to find our target customers

Performance of model as viewed on ROC curve

## Setting direction: controls < cases

Predicting test set with probabilities

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

ROC curve of prediction

looking at the roc curve of the test results reveals that we predicting a lot more paid_off customer than we have charged_off customers, which is as expected

Finding out PPV (positive predictive value where charged off is the positive class)

##              prediction
## loan_status   charged_off fully_paid
##   charged_off        4492        901
##   fully_paid       146853     113394

I use a .4 cutoff instead of .5 which produces a higher PPV at a cost of lower NPV but if the goal is to identify the bad apple, the value of the cutoff would need to be determined by a content expert

I will use some randome cuttoffs of the prediction to assign our risk rank to customer

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

End-user layer to be generated if desired

extra content to help with explaining results, the lime package helps identify the variables that act of an observation, given the model and dataset

## Warning: delinq_2yrs does not contain enough variance to use quantile
## binning. Using standard binning instead.
## Warning: pub_rec does not contain enough variance to use quantile binning.
## Using standard binning instead.
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%