The data is about personal loans and the challenge is to find a model that can help distinguish between customers that pay off their loans and those that need to be charged off. The challenge, thus to me, is to answer the question, who is more likely to be charged off in the future and can we help them before hand as to avoid charge off and make them pay off their loan. However, due to the fact that only 2% of the data is charge offs, no machine learning method will be able to predict both cases accurately. Thus, I view the challenge as “how likely is charge off” and I will show that a customer may need charge off with positive predictive value of 83%. To help with a better predictive model and to have better representation of our outcome, I would recommend that in the future, loan rejection customer information be also included with the data as to find a possible relationship between rejection and charge off. Another indicator to include would delinquency in the last 6 month, delinquency in last 2 year simply does not have enough predictive power.
Challenge 1: Programmatically download the data
# first, the code checks your directory structure for a data folder and if not there, creates one
if(dir.exists("data") == F) {
dir.create("data")
}
# the code checks the content of the data folder and if empty, it proceeds to download and unzip teh
# data in that folder
if(length(dir("data")) == 0 ) {
dwnld <- tempfile()
download.file("https://s3.us-east-2.amazonaws.com/example.data/lacare_ml_challenge.zip",dwnld)
unzip(dwnld, exdir = "data")
unlink(dwnld)
}
The data set is very skewed and requires a lot of attention
## Parsed with column specification:
## cols(
## .default = col_double(),
## loan_status = col_character(),
## purpose = col_character(),
## home_ownership = col_character(),
## grade = col_character(),
## emp_length = col_character(),
## term = col_character(),
## addr_state = col_character(),
## verification_status = col_character(),
## application_type = col_character()
## )
## See spec(...) for full column specifications.
## Observations: 1,062,952
## Variables: 22
## $ dti <dbl> 30.46, 50.53, 18.92, 4.64, 12.37, 22.…
## $ annual_inc <dbl> 100000, 45000, 100000, 38500, 450000,…
## $ delinq_2yrs <dbl> NA, NA, NA, 0, NA, 0, NA, NA, 0, NA, …
## $ inq_last_6mths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ open_acc <dbl> 11, 18, 9, 12, 21, 10, 6, 3, 11, 21, …
## $ pub_rec <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ revol_bal <dbl> 15603, 34971, 25416, 4472, 36812, 333…
## $ revol_util <dbl> 37.0, 64.5, 29.9, 15.3, 65.7, 45.9, 5…
## $ total_acc <dbl> 19, 37, 19, 25, 37, 16, 13, 6, 18, 28…
## $ out_prncp <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ out_prncp_inv <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ total_rec_late_fee <dbl> 0, 0, NA, NA, 0, 0, NA, 0, NA, 0, 0, …
## $ collections_12_mths_ex_med <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_status <chr> "Fully Paid", "Fully Paid", "Fully Pa…
## $ purpose <chr> "debt_consolidation", "credit_card", …
## $ home_ownership <chr> "MORTGAGE", "MORTGAGE", "MORTGAGE", "…
## $ grade <chr> "D", "C", "A", "B", "E", "D", "B", "C…
## $ emp_length <chr> "5 years", "< 1 year", "10+ years", "…
## $ term <chr> "36 months", "60 months", "36 months"…
## $ addr_state <chr> "CA", "OH", "WA", "TX", "MA", "WA", "…
## $ verification_status <chr> "Source Verified", "Verified", "Not V…
## $ application_type <chr> "Joint App", "Joint App", "Joint App"…
from the output, we observe that loan status has observations that are in a non-standard format which has to be addressed.
the distribution of charged off to fully paid is disproportanaly skewed to fully paid where only 2% of the data is represented by charged off, we have to think about strategy to deal with this problem. Also, the category names are in a non-standard format which has to be addressed.
I’ll be using the dataexplorer package to look at the overall data structure
we have four variables with missing data two variables have 33% missing, total_rec_late_fee & delinq_2yrs dti has 0.02% missing observations, revol_util has 0.06% missing observatiobs and inq_last_6mths has 1 missing observation
open_acc and total_acc are skewed and could use a transformation to deal with the issue
Before dealing with pre-processing of data, lets find out which variables we can drop due to zero variance or near zero variance (zv/nzv) - ZV and NZV are variables that have no variability in their distribution which can imply that they don’t have contributing factors in model scenarios.
## freqRatio percentUnique zeroVar nzv
## dti 1.002591 5.997449e-01 FALSE FALSE
## annual_inc 1.134768 5.121868e+00 FALSE FALSE
## delinq_2yrs 6.391808 2.634174e-03 FALSE FALSE
## inq_last_6mths 2.175604 8.466986e-04 FALSE FALSE
## open_acc 1.031607 7.808443e-03 FALSE FALSE
## pub_rec 6.068576 3.104562e-03 FALSE FALSE
## revol_bal 59.704545 7.431662e+00 FALSE TRUE
## revol_util 2.811106 1.247469e-01 FALSE FALSE
## total_acc 1.003499 1.307679e-02 FALSE FALSE
## out_prncp 0.000000 9.407763e-05 TRUE TRUE
## out_prncp_inv 0.000000 9.407763e-05 TRUE TRUE
## total_rec_late_fee 145.948761 4.336038e-01 FALSE TRUE
## collections_12_mths_ex_med 73.194047 1.411164e-03 FALSE TRUE
## loan_status 49.616762 1.881553e-04 FALSE TRUE
## purpose 2.530201 1.317087e-03 FALSE FALSE
## home_ownership 1.334914 5.644658e-04 FALSE FALSE
## grade 1.135453 6.585434e-04 FALSE FALSE
## emp_length 3.678981 1.128932e-03 FALSE FALSE
## term 3.820710 1.881553e-04 FALSE FALSE
## addr_state 1.793739 4.797959e-03 FALSE FALSE
## verification_status 1.196600 2.822329e-04 FALSE FALSE
## application_type 58.181115 1.881553e-04 FALSE TRUE
How does dropiing delinq_2yrs affect charge off in loan status
## # A tibble: 2 x 2
## loan_status `n()`
## <chr> <int>
## 1 Charged Off 14036
## 2 Fully Paid 698142
we see that dropping missing observations of delinq_2yrs reduced charged off by a third, so we have to perform imputation
Lets look at the distribution of delinq_2yrs variable and see if we can find a pattern
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## # A tibble: 6 x 3
## # Groups: loan_status [2]
## loan_status delinq_2yrs `n()`
## <chr> <dbl> <int>
## 1 Charged Off 0 11040
## 2 Charged Off 1 1954
## 3 Charged Off 2 576
## 4 Fully Paid 0 566217
## 5 Fully Paid 1 88358
## 6 Fully Paid 2 25412
we see that most values most values are 0 and that of the of the charge off accounts 81% is 0. Because no single imputation method will be able to give us accurate measure between 1.5% and .5% that the charge off accounts represent of all the data. It is best in this case to just assign 0 (the majority case) to all missing data
lets take a closer look at them
## # A tibble: 10 x 3
## # Groups: loan_status [2]
## loan_status delinq_2yrs `n()`
## <chr> <dbl> <int>
## 1 Charged Off 0 11040
## 2 Charged Off 1 1954
## 3 Charged Off 2 576
## 4 Charged Off 3 246
## 5 Charged Off 4 108
## 6 Fully Paid 0 566217
## 7 Fully Paid 1 88358
## 8 Fully Paid 2 25412
## 9 Fully Paid 3 9137
## 10 Fully Paid 4 4024
we can see that over 50% of the values are 0 followed by 1 with about 10%. the remaining values are distributed among the other values of deling_2yrs
looking at categorical variables and their structure
## 1 columns ignored with more than 50 categories.
## addr_state: 51 categories
from the plot, we can see that purpose and home_ownership have categories that are small so creating an alternate category to encompass all the smaller ones will help with better prediction overall
looking at “purpose” categories for decision making
## # A tibble: 28 x 3
## # Groups: purpose [14]
## purpose loan_status `n()`
## <chr> <chr> <int>
## 1 car Charged Off 183
## 2 car Fully Paid 12053
## 3 credit_card Charged Off 3841
## 4 credit_card Fully Paid 237058
## 5 debt_consolidation Charged Off 12971
## 6 debt_consolidation Fully Paid 596552
## 7 educational Charged Off 4
## 8 educational Fully Paid 270
## 9 home_improvement Charged Off 1254
## 10 home_improvement Fully Paid 69410
## # … with 18 more rows
we can see that most of the smaller categories only have very few observations of charged off, so the best strategy is to create a new variable with only four categories the rolls the smaller ones into the other category
looking at home_ownership for decision making
## # A tibble: 11 x 3
## # Groups: home_ownership [6]
## home_ownership loan_status `n()`
## <chr> <chr> <int>
## 1 ANY Charged Off 12
## 2 ANY Fully Paid 215
## 3 MORTGAGE Charged Off 8974
## 4 MORTGAGE Fully Paid 533834
## 5 NONE Fully Paid 41
## 6 OTHER Charged Off 4
## 7 OTHER Fully Paid 117
## 8 OWN Charged Off 2298
## 9 OWN Fully Paid 110833
## 10 RENT Charged Off 9712
## 11 RENT Fully Paid 396912
we can see from the result that dropping the observations for the ANY, None, and OTHER categoris, will have minimal effect on our data
using the information from previous step of eda, I will be preping the data for modeling
One final review of the dataset - the following fuction creates a full EDA of the final dataset
Splitting data into test and training set. 3/4 train to 1/4 test and using custome seed for reproducibility
set.seed(3465)
data_split <- initial_split(loan_data_full_temp, strata = "loan_status")
loan_train <- training(data_split)
loan_test <- testing(data_split)
loan_rec <- loan_train %>%
recipe(loan_status ~ .) %>%
step_YeoJohnson(open_acc, total_acc) %>% #extension of Box-Cox tranformation, to deal with skewness of the data
step_dummy(all_nominal(), -all_outcomes()) %>%
prep()
# using the above recipe to finalize the test and training set
loan_test_rec <- loan_rec %>%
bake(loan_test)
loan_train_rec <- juice(loan_rec)
The problem with this data is that no matter what method we choose, the imbalance of the daat will not allow for great accuracy of a model. This prediction problem has to approach from a different prespective, how much risk are we willing to take of custumers needing charge-off Thus the model will create a prediction however, we will use the probabilities resulting from our prediction to build a categorical variable that will assign customers based on their probability to either, no risk, mild risk, minor risk, major risk I have chosen to run generalized boosting regression Modeling (GBM). It is a method that is able to minimize both, bias and variance and excutes more efficiently than similar methods, I am also goint to use h2o to perform my modeling with due to its distibuted functionality
Using the caret package for modeling, we declare the control variable with how we like to use a model eg using cross validation and for dealing with our imbalanced dependent variable, I will use down sampling of the data in order to achieve better results
looking for optimal hyper parameter (our method here will chose this for us, however; if we want to improve on the model in the future, this would be one of the ways to explore model improvment)
The top ten variables as determined by our model
## TrainROC TrainSens TrainSpec method
## 1 0.7037325 0.3586677 0.8608503 gbm_h2o
Because our data is extremely skewed, accuracy will can not be of importance, instead We can use the ROC curve to determine optimal cutoff point to find our target customers
Performance of model as viewed on ROC curve
## Setting direction: controls < cases
Predicting test set with probabilities
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
ROC curve of prediction
looking at the roc curve of the test results reveals that we predicting a lot more paid_off customer than we have charged_off customers, which is as expected
Finding out PPV (positive predictive value where charged off is the positive class)
## prediction
## loan_status charged_off fully_paid
## charged_off 4492 901
## fully_paid 146853 113394
I use a .4 cutoff instead of .5 which produces a higher PPV at a cost of lower NPV but if the goal is to identify the bad apple, the value of the cutoff would need to be determined by a content expert
I will use some randome cuttoffs of the prediction to assign our risk rank to customer
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
extra content to help with explaining results, the lime package helps identify the variables that act of an observation, given the model and dataset
## Warning: delinq_2yrs does not contain enough variance to use quantile
## binning. Using standard binning instead.
## Warning: pub_rec does not contain enough variance to use quantile binning.
## Using standard binning instead.
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%