Loan Status Classification

Abstract

There are two parts in this project. In first part, I implemented five different Machine Learning Algorithms to classify the Loan Status from 2012 to 2014 of the approved LendingClub loans data. For each algorithm, I tuned the parameters to achieve the best performance. And comparing the results, I determined the XGBoost model to be the best model for building a classifier in this data set. In the second part, I used the XGBoost model to predict the Loan Status in 2015 and make the evaluation also.

Part 1. Finding the Best Model

Introduction

Before implementing the Machine Learning Algorithm, I cleaned the data set. The response variable loan_status had seven levels, and I was only interested in predicting whether the observations were fully paid or not. Therefore, I kept the ‘fully paid’ level and defined all the other levels as ‘not fully paid’. I dropped the ID variables, categorical variables with large numbers of levels, and variables with high rates of missing values. I also removed the rows containing any missing values. Then I get a tidy data set having 416590 observations and 54 variables. I randomly selected a sample data set with 10000 observations, and I split the sample data set into a training and a testing data set using a 75-25 split. Then I developed the five models: KNN, Random Forest, C5.0, Logistic Regression using regularization and XGBoost. To improve the performance for each model, I tuned the parameters to achieve their best accuracy. At last, I compared and ranked the classification accuracy of each model.

Collect Data

library(pacman)
p_load(data.table, tidymodels, tidyverse, lubridate, visdat,
       janitor, vip, knitr)

df <- fread('data/accepted_2007_to_2018Q4.csv')

df_2012_to_2014 <- df %>%
  mutate(year = year(mdy(issue_d))) %>%
  filter(year %in% c(2012, 2013, 2014))
fwrite(df_2012_to_2014, 'data/accepted_2012_to_2014.csv')

df_2012_to_2014 <- fread('data/accepted_2012_to_2014.csv')

Use clean_names function to make all the variable names have lowercase letters and underscores.

df_2012_to_2014 <- df_2012_to_2014 %>% clean_names()

Table of the variable names.

col <- colnames(df_2012_to_2014)
variable_names <- data_frame(col[1:38], col[39:76], col[77:114], col[115:152])
kable(variable_names)

col[1:38]	col[39:76]	col[77:114]	col[115:152]
id	out_prncp	total_cu_tl	total_il_high_credit_limit
member_id	out_prncp_inv	inq_last_12m	revol_bal_joint
loan_amnt	total_pymnt	acc_open_past_24mths	sec_app_fico_range_low
funded_amnt	total_pymnt_inv	avg_cur_bal	sec_app_fico_range_high
funded_amnt_inv	total_rec_prncp	bc_open_to_buy	sec_app_earliest_cr_line
term	total_rec_int	bc_util	sec_app_inq_last_6mths
int_rate	total_rec_late_fee	chargeoff_within_12_mths	sec_app_mort_acc
installment	recoveries	delinq_amnt	sec_app_open_acc
grade	collection_recovery_fee	mo_sin_old_il_acct	sec_app_revol_util
sub_grade	last_pymnt_d	mo_sin_old_rev_tl_op	sec_app_open_act_il
emp_title	last_pymnt_amnt	mo_sin_rcnt_rev_tl_op	sec_app_num_rev_accts
emp_length	next_pymnt_d	mo_sin_rcnt_tl	sec_app_chargeoff_within_12_mths
home_ownership	last_credit_pull_d	mort_acc	sec_app_collections_12_mths_ex_med
annual_inc	last_fico_range_high	mths_since_recent_bc	sec_app_mths_since_last_major_derog
verification_status	last_fico_range_low	mths_since_recent_bc_dlq	hardship_flag
issue_d	collections_12_mths_ex_med	mths_since_recent_inq	hardship_type
loan_status	mths_since_last_major_derog	mths_since_recent_revol_delinq	hardship_reason
pymnt_plan	policy_code	num_accts_ever_120_pd	hardship_status
url	application_type	num_actv_bc_tl	deferral_term
desc	annual_inc_joint	num_actv_rev_tl	hardship_amount
purpose	dti_joint	num_bc_sats	hardship_start_date
title	verification_status_joint	num_bc_tl	hardship_end_date
zip_code	acc_now_delinq	num_il_tl	payment_plan_start_date
addr_state	tot_coll_amt	num_op_rev_tl	hardship_length
dti	tot_cur_bal	num_rev_accts	hardship_dpd
delinq_2yrs	open_acc_6m	num_rev_tl_bal_gt_0	hardship_loan_status
earliest_cr_line	open_act_il	num_sats	orig_projected_additional_accrued_interest
fico_range_low	open_il_12m	num_tl_120dpd_2m	hardship_payoff_balance_amount
fico_range_high	open_il_24m	num_tl_30dpd	hardship_last_payment_amount
inq_last_6mths	mths_since_rcnt_il	num_tl_90g_dpd_24m	disbursement_method
mths_since_last_delinq	total_bal_il	num_tl_op_past_12m	debt_settlement_flag
mths_since_last_record	il_util	pct_tl_nvr_dlq	debt_settlement_flag_date
open_acc	open_rv_12m	percent_bc_gt_75	settlement_status
pub_rec	open_rv_24m	pub_rec_bankruptcies	settlement_date
revol_bal	max_bal_bc	tax_liens	settlement_amount
revol_util	all_util	tot_hi_cred_lim	settlement_percentage
total_acc	total_rev_hi_lim	total_bal_ex_mort	settlement_term
initial_list_status	inq_fi	total_bc_limit	year

Table of the number of rows and columns in the data frame.

n_row <- nrow(df_2012_to_2014)
n_col <- ncol(df_2012_to_2014)
value <- c(n_row, n_col)
name <- c('n_rows', 'n_columns')
data_str <- data.frame(name, value)
kable(data_str)

name	value
n_rows	423810
n_columns	152

Exploring and Preparing the Data

Examine and drop variables with high rates of missing values. Using skim() function, I find out some variables contain large amounts of ‘blank’ values, which are not be recognized by the is.na() function.

skimr::skim(df_2012_to_2014)

lots_empty <- df_2012_to_2014 %>% 
  select(c(emp_length, desc, next_pymnt_d, hardship_type:hardship_loan_status, 
           debt_settlement_flag_date:settlement_date )) %>% 
  names()

lots_miss <- sapply(df_2012_to_2014, function(x){sum(is.na(x))}) > 20000
lots_miss_col <- names(lots_miss)[lots_miss]

df_2012_to_2014 <- df_2012_to_2014 %>% select(-all_of(lots_miss_col), -all_of(lots_empty))
ncol(df_2012_to_2014)

## [1] 69

Examine and drop id variables and categorical variables with large numbers of levels.

id_feature <- sapply(df_2012_to_2014, function(x){length(unique(x))}) == 1 
id_feature_col <- names(id_feature)[id_feature]
lots_level_cat <- c('emp_title', 'title', 'url') 
unique <- c('hardship_flag', 'pymnt_plan')
meaning_less <- c('zip_code', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d')

df_2012_to_2014 <- df_2012_to_2014 %>% 
  select(-c(all_of(id_feature_col), all_of(lots_level_cat), 
            all_of(unique), all_of(meaning_less),
            id, funded_amnt, funded_amnt_inv)) %>%
  mutate_if(is.character, factor)

ncol(df_2012_to_2014)

## [1] 54

Summarize four numeric features. loan_amnt, annual_inc, installment, int_rate.

df_2012_to_2014 %>% 
  select(c(loan_amnt, annual_inc, installment, int_rate)) %>%
  summary()

##    loan_amnt       annual_inc       installment         int_rate    
##  Min.   : 1000   Min.   :   3000   Min.   :   4.93   Min.   : 6.00  
##  1st Qu.: 8000   1st Qu.:  45000   1st Qu.: 267.60   1st Qu.:10.99  
##  Median :12800   Median :  63500   Median : 390.74   Median :13.98  
##  Mean   :14641   Mean   :  73690   Mean   : 443.02   Mean   :14.00  
##  3rd Qu.:20000   3rd Qu.:  90000   3rd Qu.: 578.68   3rd Qu.:16.99  
##  Max.   :35000   Max.   :7500000   Max.   :1409.99   Max.   :26.06

Summarize four categorical features. grade, emp_length, home_ownership, purpose.

table(df_2012_to_2014$grade)

## 
##      A      B      C      D      E      F      G 
##  64688 124558 116574  70884  32365  11931   2810

table(df_2012_to_2014$emp_length)

## < table of extent 0 >

table(df_2012_to_2014$home_ownership)

## 
##      ANY MORTGAGE     NONE    OTHER      OWN     RENT 
##        1   216949       42       46    38461   168311

table(df_2012_to_2014$purpose)

## 
##                car        credit_card debt_consolidation   home_improvement 
##               3783              98692             254457              23342 
##              house     major_purchase            medical             moving 
##               1843               7517               3850               2366 
##              other   renewable_energy     small_business           vacation 
##              19267                245               5022               2087 
##            wedding 
##               1339

The response variable loan_status has seven levels, and I am only interested in predicting whether the observations are fully paid or not. Therefore, I keep the ‘fully paid’ level and defined all the other levels as ‘not fully paid’.

table(df_2012_to_2014$loan_status)

## 
##        Charged Off            Current            Default         Fully Paid 
##              70829              11925                  1             340444 
##    In Grace Period  Late (16-30 days) Late (31-120 days) 
##                201                 73                337

df_2012_to_2014 <- df_2012_to_2014 %>%
  mutate(loan_status = ifelse(loan_status == 'Fully Paid', 
                              'fully_paid', 'not_full_paid')) 
table(df_2012_to_2014$loan_status)

## 
##    fully_paid not_full_paid 
##        340444         83366

Remove the rows that containing missing values. And randomly take a sample data set with 10000 observations to implement the machine learning algorithms. There are 54 variables in this sample data set.

set.seed(999)
df_2012_to_2014 <- df_2012_to_2014 %>% na.omit()
n <- nrow(df_2012_to_2014)
index <- sample(n, 10000)
dfs <- df_2012_to_2014[index] 
nrow(dfs)

## [1] 10000

ncol(dfs)

## [1] 54

Summarize the y-variable. The table shows there is 79.94% of people in the sample data set are ‘fully_paid’ for their loan_status.

dfs %>% count(loan_status) %>%
  mutate(freq = n / sum(n))

Make the first split with 75% of the data being in the training data set.

set.seed(999)
dfs_split <- initial_split(dfs, prop = 0.75)

train <- dfs_split %>% training() 
test <- dfs_split %>% testing()

Preparing the data. Create the recipe for ML model. Use step_nzv() to remove any columns that have very low variability.

set.seed(999)
train_recipe <- train %>% 
  recipe(loan_status ~ .) %>%
  step_nzv(all_predictors()) %>%
  prep()

train_prep <- juice(train_recipe)
test_prep <- train_recipe %>% bake(test)

Training a Model on the Data

Model 0: null

The accuracy of the null model is 0.8036.

mod_null <- null_model() %>%
  set_mode('classification') %>%
  fit(loan_status ~ ., data = train_prep)

mod_null_r <- mod_null %>% predict(test_prep) %>%
  bind_cols(test_prep) 
  
mod_null_r %>% metrics(truth = loan_status, 
                       estimate = .pred_class)

mod_null_r %>% conf_mat(truth = loan_status, 
                       estimate = .pred_class)

##                Truth
## Prediction      fully_paid not_full_paid
##   fully_paid          2009           491
##   not_full_paid          0             0

mod_null %>% predict(test_prep, type = 'prob') %>%
  bind_cols(test_prep) %>% 
  roc_curve(loan_status, .pred_fully_paid) %>%
  autoplot()

Model 1: KNN

Prepare the data, select all the numeric variables and make the normalization.

t <- train %>% select(-loan_status)
numeric <- unlist(lapply(t, is.numeric))
categorical_var <- names(numeric[!numeric])

recipe_knn <- train %>%
  recipe(loan_status ~ .) %>%
  step_rm(all_of(categorical_var)) %>%
  step_normalize(all_predictors()) %>%
  step_nzv(all_predictors()) %>%
  prep()

Setup the model using workflow.

mod_knn <- nearest_neighbor() %>% 
  set_engine("kknn") %>%
  set_mode("classification") 

wf_knn <- workflow() %>%
  add_recipe(recipe_knn) %>%
  add_model(mod_knn)

Fit and evaluate the model using vfold_cv cross validation.

set.seed(999)
doParallel::registerDoParallel()

folds <- vfold_cv(train , v = 10)

fit_knn_rs <- wf_knn %>% 
  fit_resamples(folds, 
                control = control_resamples(save_pred = TRUE))

collect_metrics(fit_knn_rs)

Evaluating Model Performance

Estimate performance of the testing data. The accuracy is 0.8820, which is increased from the null model (0.8036).

test_knn <- wf_knn %>%
  last_fit(dfs_split)
acc_knn <- collect_metrics(test_knn)
acc_knn

collect_predictions(test_knn) %>%
  conf_mat(loan_status, .pred_class)

##                Truth
## Prediction      fully_paid not_full_paid
##   fully_paid          1896           182
##   not_full_paid        113           309

collect_predictions(test_knn) %>%
  roc_curve(loan_status, .pred_fully_paid) %>%
  autoplot()

Improving Model Performance

Tuning and finding the best parameter K.

tune_knn <- 
  nearest_neighbor(neighbors = tune()) %>%
  set_engine("kknn") %>% 
  set_mode("classification")

knn_wflow <-
  workflow() %>%
  add_recipe(recipe_knn) %>%
  add_model(tune_knn)

set.seed(999)
doParallel::registerDoParallel()

knn_fit_rs <- 
  knn_wflow %>% 
  tune_grid(
    resamples = folds,
    grid = 20
    )

The plots show the best K for the model should be 11.

autoplot(knn_fit_rs)

The table shows the same result from the plot above, and 11 should be the best k value for the knn algorithm in this data set.

knn_fit_rs %>% show_best("accuracy")

best_knn <- knn_fit_rs %>%
  select_best("accuracy")
best_knn

The accuracy for the test data set is 0.906. The model performance does increase after tuning and finding the best parameter k = 11, comparing to that without tuning (0.8820).

final_knn <- 
  knn_wflow %>% 
  finalize_workflow(best_knn) %>%
  last_fit(dfs_split) 

acc_knn <- final_knn %>% 
    collect_metrics()
acc_knn

collect_predictions(final_knn) %>%
  conf_mat(loan_status, .pred_class)

##                Truth
## Prediction      fully_paid not_full_paid
##   fully_paid          1978           204
##   not_full_paid         31           287

collect_predictions(final_knn) %>%
  roc_curve(loan_status, .pred_fully_paid) %>%
  autoplot()

Model 2: Random Forest

Setup the model using workflow.

mod_rf <- 
  rand_forest(trees = 100) %>%
  set_mode('classification') %>%
  set_engine('ranger')

workflow_rf <-
  workflow() %>%
  add_recipe(train_recipe) %>%
  add_model(mod_rf)

Fit and evaluate the model using vfold_cv cross validation.

set.seed(999)
folds <- vfold_cv(train , v = 10)

doParallel::registerDoParallel()
fit_rf_rs <- workflow_rf %>%
  fit_resamples(folds, 
                control = control_resamples(save_pred = TRUE))

collect_metrics(fit_rf_rs)

Evaluating Model Performance

Estimate performance of the testing data. The accuracy is 0.9812 and the roc_auc is 0.9982, which increased a lot from the null model (0.8036).

test_rf <- workflow_rf %>%
  last_fit(dfs_split)
acc_rf <- collect_metrics(test_rf)
acc_rf

collect_predictions(test_rf) %>%
  conf_mat(loan_status, .pred_class)

##                Truth
## Prediction      fully_paid not_full_paid
##   fully_paid          1997            35
##   not_full_paid         12           456

collect_predictions(test_rf) %>%
  roc_curve(loan_status, .pred_fully_paid) %>%
  autoplot()

Improving Model Performance

Tuning and finding the best parameter mtry, min_n.

set.seed(999)
folds <- vfold_cv(train , v = 10)

rf_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 100) %>%
  set_mode('classification') %>%
  set_engine('ranger')

rf_workflow <-
  workflow() %>%
  add_recipe(train_recipe) %>%
  add_model(rf_spec)

set.seed(999)
doParallel::registerDoParallel()
rf_fit_rs <- 
  tune_grid(rf_workflow,
            resamples = folds,
            grid = 10)

The results show the best value of mtry should be 13, meaning 13 predictors will be randomly sampled at each split. And the best min_n is 14, meaning 14 data points in a node that is required for the node to be split further.

autoplot(rf_fit_rs)

rf_fit_rs %>% show_best("accuracy")

best_rf <- rf_fit_rs %>%
  select_best("accuracy")
best_rf

The accuracy for the test data set is 0.9844. The model performance does increase after tuning and finding the best parameter mtry = 13 and min_n = 14, comparing to that of without tuning (0.9812).

final_rf <- 
  rf_workflow %>% 
  finalize_workflow(best_rf) %>%
  last_fit(dfs_split) 

acc_rf <- final_rf %>% 
    collect_metrics()
acc_rf

collect_predictions(final_rf) %>%
  conf_mat(loan_status, .pred_class)

##                Truth
## Prediction      fully_paid not_full_paid
##   fully_paid          1992            21
##   not_full_paid         17           470

mod_rf_aoc <- collect_predictions(final_rf)
mod_rf_aoc %>%
  roc_curve(loan_status, .pred_fully_paid) %>%
  autoplot()

Explore the feature importance.

mod_rf_importance <- rand_forest(trees = 100) %>%
  set_engine('ranger', importance = 'permutation') %>%
  set_mode('classification') %>%
  fit(loan_status ~ ., data = train_prep) %>%
  vip(aesthetics = list(alpha = 0.8, fill = 'midnightblue'))

mod_rf_importance

Model 3: Logistic Regression using regularization

Setup and fit the model.

mod_glm <- logistic_reg(penalty = 0.001, mixture = 0.5) %>%
  set_engine('glmnet') %>%
  set_mode('classification') %>%
  fit(loan_status ~ ., data = train_prep)

Evaluating Model Performance

Estimate performance of the testing data. The accuracy is 0.9776 and the roc_auc is 0.9269, which increased a lot from the null model (0.8036).

mod_glm_r <- mod_glm %>%
  predict(test_prep) %>%
  bind_cols(test_prep)

acc_glm <- mod_glm_r %>% 
  metrics(truth = loan_status, estimate = .pred_class)
acc_glm

mod_glm_r %>% conf_mat(truth = loan_status, estimate = .pred_class)

##                Truth
## Prediction      fully_paid not_full_paid
##   fully_paid          2000            47
##   not_full_paid          9           444

mod_glm_aoc <- mod_glm %>% predict(test_prep, type = 'prob') %>%
  bind_cols(test_prep) 
mod_glm_aoc %>%
  roc_curve(loan_status, .pred_fully_paid) %>%
  autoplot()

Explore the feature importance.

mod_glm_importance <- logistic_reg(penalty = 0.001, mixture = 0.5) %>%
  set_engine('glmnet') %>%
  set_mode('classification') %>%
  fit(loan_status ~ ., data = train_prep) %>%
  vip(aesthetics = list(alpha = 0.8, fill = 'midnightblue'))

mod_glm_importance

Model 4: C5.0

Setup the model using workflow.

set.seed(999)
mod_c50 <- boost_tree(trees = 40) %>%
  set_engine('C5.0') %>%
  set_mode('classification')

workflow_c50 <-
  workflow() %>%
  add_recipe(train_recipe) %>%
  add_model(mod_c50)

Fit and evaluate the model using vfold_cv cross validation.

set.seed(999)
folds <- vfold_cv(train , v = 10)

doParallel::registerDoParallel()
fit_c50_rs <- workflow_c50 %>%
  fit_resamples(folds, 
                control = control_resamples(save_pred = TRUE))

collect_metrics(fit_c50_rs)

Evaluating Model Performance

Estimate performance of the testing data. The accuracy is 0.9892 and the roc_auc is 0.9982, which increased a lot from the null model.

test_c50 <- workflow_c50 %>%
  last_fit(dfs_split)
acc_c50 <- collect_metrics(test_c50)
acc_c50

collect_predictions(test_c50) %>%
  conf_mat(loan_status, .pred_class)

##                Truth
## Prediction      fully_paid not_full_paid
##   fully_paid          2001            19
##   not_full_paid          8           472

collect_predictions(test_c50) %>%
  roc_curve(loan_status, .pred_fully_paid) %>%
  autoplot()

Improving Model Performance

Tuning and finding the best parameter min_n.

set.seed(999)
folds <- vfold_cv(train , v = 10)

c50_tune <- 
  boost_tree(min_n = tune(), trees = 40) %>%
  set_mode('classification') %>%
  set_engine('C5.0')

c50_workflow <-
  workflow() %>%
  add_recipe(train_recipe) %>%
  add_model(c50_tune)

set.seed(999)
doParallel::registerDoParallel()
c50_fit_rs <- 
  tune_grid(c50_workflow,
            resamples = folds,
            grid = 10)

The results show the best min_n is 31, meaning 31 data points in a node that is required for the node to be split further.

autoplot(c50_fit_rs)

c50_fit_rs %>% show_best("accuracy")

best_c50 <- c50_fit_rs %>%
  select_best("accuracy")
best_c50

The accuracy for the test data set is 0.9900 and the roc_auc is 0.9991. The model performance does increase after tuning and finding the best parameter min_n = 31, comparing to those of without tuning (0.9892, 0.9982).

final_c50 <- 
  c50_workflow %>% 
  finalize_workflow(best_c50) %>%
  last_fit(dfs_split) 

acc_c50 <- final_c50 %>% 
    collect_metrics()
acc_c50

collect_predictions(final_c50) %>%
  conf_mat(loan_status, .pred_class)

##                Truth
## Prediction      fully_paid not_full_paid
##   fully_paid          2002            18
##   not_full_paid          7           473

mod_c50_aoc <- collect_predictions(final_c50)
mod_c50_aoc %>%
  roc_curve(loan_status, .pred_fully_paid) %>%
  autoplot()

Explore the feature importance.

mod_c50_importance <- boost_tree(trees = 40) %>%
  set_engine('C5.0', importance = 'permutation') %>%
  set_mode('classification') %>%
  fit(loan_status ~ ., data = train_prep) %>%
  vip(aesthetics = list(alpha = 0.8, fill = 'midnightblue'))

mod_c50_importance

Model 5: XGBoost

Setup the model using workflow.

set.seed(999)

mod_xgb <- 
  boost_tree(trees = 30) %>%
  set_mode('classification') %>%
  set_engine('xgboost')

workflow_xgb <-
  workflow() %>%
  add_recipe(recipe_knn) %>%
  add_model(mod_xgb)

Fit and evaluate the model using vfold_cv cross validation.

set.seed(999)
folds <- vfold_cv(train , v = 10)

doParallel::registerDoParallel()
fit_xgb_rs <- workflow_xgb %>%
  fit_resamples(resamples = folds,
                control = control_resamples(save_pred = TRUE))

collect_metrics(fit_xgb_rs)

Evaluating Model Performance

Estimate performance of the testing data. The accuracy is 0.9900 and the roc_auc is 0.9992, which increased a lot from the null model (0.8036).

test_xgb <- workflow_xgb %>%
  last_fit(dfs_split)
acc_xgb <- collect_metrics(test_xgb)
acc_xgb

collect_predictions(test_xgb) %>%
  conf_mat(loan_status, .pred_class)

##                Truth
## Prediction      fully_paid not_full_paid
##   fully_paid          1998            14
##   not_full_paid         11           477

collect_predictions(test_xgb) %>%
  roc_curve(loan_status, .pred_fully_paid) %>%
  autoplot()

Improving Model Performance

Tuning and finding the best parameter tree_depth, min_n, loss_reduction, sample_size, mtry.

set.seed(999)
xgb_mod <- 
  boost_tree(trees = 30, tree_depth = tune(), min_n = tune(), loss_reduction = tune(),
             sample_size = tune(), mtry = tune()) %>%
  set_mode('classification') %>%
  set_engine('xgboost')

xgb_workflow <-
  workflow() %>%
  add_recipe(recipe_knn) %>%
  add_model(xgb_mod)

xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), train),
  size = 10
)

set.seed(999)
folds <- vfold_cv(train , v = 10)

doParallel::registerDoParallel()
xgb_fit_rs <- xgb_workflow %>%
  tune_grid(xgb_workflow,
          resamples = folds,
          grid = xgb_grid)

autoplot(xgb_fit_rs)

xgb_fit_rs %>% show_best("accuracy")

The best combination of parameters are shown below.

best_xgb <- xgb_fit_rs %>%
  select_best("accuracy")
best_xgb

The accuracy for the test data set is 0.986. The model performance does not increase after tuning and finding the best parameters. The accuracy remains very close to the one without tuning.

final_xgb <- 
  workflow_xgb %>% 
  finalize_workflow(best_xgb) %>%
  last_fit(dfs_split) 

acc_xgb <- final_xgb %>% 
    collect_metrics()
acc_xgb

collect_predictions(final_xgb) %>%
  conf_mat(loan_status, .pred_class)

##                Truth
## Prediction      fully_paid not_full_paid
##   fully_paid          1998            14
##   not_full_paid         11           477

mod_xgb_aoc <- collect_predictions(final_xgb)
mod_xgb_aoc %>%
  roc_curve(loan_status, .pred_fully_paid) %>%
  autoplot()

Explore the feature importance.

mod_xgb_importance <- boost_tree(trees = 30) %>%
  set_mode('classification') %>%
  set_engine('xgboost', importance = 'permutation')  %>%
  fit(loan_status ~ ., data = train_prep) %>%
  vip(aesthetics = list(alpha = 0.8, fill = 'midnightblue'))

## [23:11:59] WARNING: amalgamation/../src/learner.cc:541: 
## Parameters: { importance } might not be used.
## 
##   This may not be accurate due to some parameters are only used in language bindings but
##   passed down to XGBoost core.  Or some parameters are not used but slip through this
##   verification. Please open an issue if you find above cases.
## 
## 
## [23:11:59] WARNING: amalgamation/../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

mod_xgb_importance

Compare performance among the five models.

collect_predictions(final_knn) %>%
  mutate(model = 'kknn') %>%
  bind_rows(mod_c50_aoc %>% mutate(model = 'c50')) %>%
  bind_rows(mod_rf_aoc %>% mutate(model = 'random forest')) %>%
  bind_rows(mod_glm_aoc %>% mutate(model = 'glmnet')) %>%
  bind_rows(mod_xgb_aoc %>% mutate(model = 'XGBoost')) %>%
  group_by(model) %>%
  roc_curve(loan_status, .pred_fully_paid) %>%
  autoplot()

acc_knn %>% rename(knn = .estimate, metrics = .metric) %>%
  bind_cols(acc_c50 %>% rename(c50 = .estimate)) %>%
  bind_cols(acc_rf %>% rename(random_forest = .estimate)) %>%
  bind_cols(acc_glm %>% rename(glmnet = .estimate)) %>%
  bind_cols(acc_xgb %>% rename(XGBoost = .estimate)) %>%
  select(metrics, XGBoost, c50, random_forest, glmnet, knn)

For most models, the accuracy does increase after tuning the parameters. It increased from 0.882 to 0.906 for KNN, from 0.981 to 0.985 for Random Forest and from 0.989 to 0.990 for C5.0. And comparing the performance for each model, all C50, XGBoost, and Random Forest models are doing well. But the XGBoost should be the best model with an accuracy of 0.990 and receiver operator curve is 0.999. Therefore, I will choose the XBGoost model to predict any new data.

Part 2.

I filtered the data from 2012 to 2015 and made a similar data cleaning from the project. Then I got a tidy data set with 820120 observations and 58 variables. To predict the observations in 2015, I used the same training data set that I used in the project’s five models. And I randomly selected 10% of the data in 2015 to make the predictions. Therefore, the data size of training data is 7500 and testing data is 41659. I would use these data to implement the XGBoost model and see how well the performance.

Collect Data

df_2012_to_2015 <- df %>%
  mutate(year = year(mdy(issue_d))) %>%
  filter(year %in% c(2012, 2013, 2014, 2015))
fwrite(df_2012_to_2015, 'data/accepted_2012_to_2015.csv')

df_2012_to_2015 <- fread('data/accepted_2012_to_2015.csv')

Preparing the Data

Similar process as I did in part 1.

df_2012_to_2015 <- df_2012_to_2015 %>% clean_names()

lots_empty <- df_2012_to_2015 %>% 
  select(c(emp_length, desc, next_pymnt_d, verification_status_joint,
           hardship_type:hardship_loan_status, 
           debt_settlement_flag_date:settlement_date )) %>% 
  names()

lots_miss <- sapply(df_2012_to_2015, function(x){sum(is.na(x))}) > 20000
lots_miss_col <- names(lots_miss)[lots_miss]

df_2012_to_2015 <- df_2012_to_2015 %>% select(-all_of(lots_miss_col), -all_of(lots_empty))

id_feature <- sapply(df_2012_to_2015, function(x){length(unique(x))}) == 1 
id_feature_col <- names(id_feature)[id_feature]
lots_level_cat <- c('emp_title', 'title', 'url') 
unique <- c('hardship_flag', 'pymnt_plan')
meaning_less <- c('zip_code', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d')

df_2012_to_2015 <- df_2012_to_2015%>% 
  select(-c(all_of(id_feature_col), all_of(lots_level_cat), 
            all_of(unique), all_of(meaning_less),
            id, funded_amnt, funded_amnt_inv)) %>%
  mutate_if(is.character, factor)

The response variable has 7 levels, and I am only interested to predict whether the observations fully paid or not. Therefore, I keep the ‘fully paid’ level and defined all the other levels to be ‘not fully paid’.

df_2012_to_2015 <- df_2012_to_2015 %>%
  mutate(loan_status = ifelse(loan_status == 'Fully Paid', 
                              'fully_paid', 'not_full_paid')) %>%
  na.omit()

nrow(df_2012_to_2015)

## [1] 820120

table(df_2012_to_2015$loan_status)

## 
##    fully_paid not_full_paid 
##        620163        199957

Take the same training data set as the priors five models.

Randomly take 10% of the data in 2015 to do prediction and evaluation.

df_2012_to_2014 <- df_2012_to_2015 %>% 
  filter(year %in% c(2012, 2013, 2014))
n <- nrow(df_2012_to_2014)

set.seed(999)
index <- sample(n, 10000)
dfs <- df_2012_to_2014[index] 

set.seed(999)
dfs_split <- initial_split(dfs, prop = 0.75)
train <- dfs_split %>% training()

df_2015 <- df_2012_to_2015 %>% 
  filter(year == 2015)
n <- nrow(df_2015)

set.seed(999)
index <- sample(n, 0.1*n)
test <- df_2015[index]

Preparing the data. Create the recipe for ML model. Use step_nzv() to remove any columns that have very low variability.

set.seed(999)
train_recipe <- train %>% 
  recipe(loan_status ~ .) %>%
  step_nzv(all_predictors()) %>%
  prep()

train_prep <- juice(train_recipe)
test_prep <- train_recipe %>% bake(test)

Training Model 3: C5.0

Setup the model using the same parameters from the XGBoost model that I already done.

set.seed(999)
mod_xgb_f <- boost_tree(trees = 40, mtry = 48, min_n = 6, tree_depth = 11) %>%
  set_engine('xgboost') %>%
  set_mode('classification') %>%
  fit(loan_status ~ ., data = train_prep)

## [23:12:13] WARNING: amalgamation/../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

Evaluating Model Performance

Estimate performance of the testing data. The accuracy is 0.983, which is a very good estimate.

mod_xgb_m <- mod_xgb_f %>%
  predict(test_prep) %>%
  bind_cols(test_prep) 

acc_xgb_f <- mod_xgb_m %>% 
  metrics(truth = loan_status, estimate = .pred_class)
acc_xgb_f

mod_xgb_m %>% conf_mat(truth = loan_status, 
                        estimate = .pred_class)

##                Truth
## Prediction      fully_paid not_full_paid
##   fully_paid         29618           681
##   not_full_paid         37         11323

mod_xgb_aoc <- mod_xgb_f %>% predict(test_prep, type = 'prob') %>%
  bind_cols(test_prep) 

mod_xgb_aoc %>% 
  roc_curve(loan_status, .pred_fully_paid) %>%
  autoplot()

Conclusions

Use the data from 2012 to 2014 to train the XGBoost model and predict the observations in 2015. The results show it has a very good performance with accuracy of 0.983.

Loan Status Classification

Xiaodan Chen

Mar 9, 2021

Abstract

Part 1. Finding the Best Model

Introduction

Collect Data

Use clean_names function to make all the variable names have lowercase letters and underscores.

Table of the variable names.

Table of the number of rows and columns in the data frame.

Exploring and Preparing the Data

Examine and drop variables with high rates of missing values. Using skim() function, I find out some variables contain large amounts of ‘blank’ values, which are not be recognized by the is.na() function.

Examine and drop id variables and categorical variables with large numbers of levels.

Summarize four numeric features. loan_amnt, annual_inc, installment, int_rate.

Summarize four categorical features. grade, emp_length, home_ownership, purpose.

Remove the rows that containing missing values. And randomly take a sample data set with 10000 observations to implement the machine learning algorithms. There are 54 variables in this sample data set.

Summarize the y-variable. The table shows there is 79.94% of people in the sample data set are ‘fully_paid’ for their loan_status.

Make the first split with 75% of the data being in the training data set.

Preparing the data. Create the recipe for ML model. Use step_nzv() to remove any columns that have very low variability.

Training a Model on the Data

Model 0: null

The accuracy of the null model is 0.8036.

Model 1: KNN

Prepare the data, select all the numeric variables and make the normalization.

Setup the model using workflow.

Fit and evaluate the model using vfold_cv cross validation.

Evaluating Model Performance

Estimate performance of the testing data. The accuracy is 0.8820, which is increased from the null model (0.8036).

Improving Model Performance

Tuning and finding the best parameter K.

The plots show the best K for the model should be 11.

The table shows the same result from the plot above, and 11 should be the best k value for the knn algorithm in this data set.

The accuracy for the test data set is 0.906. The model performance does increase after tuning and finding the best parameter k = 11, comparing to that without tuning (0.8820).

Model 2: Random Forest

Setup the model using workflow.

Fit and evaluate the model using vfold_cv cross validation.

Evaluating Model Performance

Estimate performance of the testing data. The accuracy is 0.9812 and the roc_auc is 0.9982, which increased a lot from the null model (0.8036).

Improving Model Performance

Tuning and finding the best parameter mtry, min_n.

The results show the best value of mtry should be 13, meaning 13 predictors will be randomly sampled at each split. And the best min_n is 14, meaning 14 data points in a node that is required for the node to be split further.

The accuracy for the test data set is 0.9844. The model performance does increase after tuning and finding the best parameter mtry = 13 and min_n = 14, comparing to that of without tuning (0.9812).

Explore the feature importance.

Model 3: Logistic Regression using regularization

Setup and fit the model.

Evaluating Model Performance

Estimate performance of the testing data. The accuracy is 0.9776 and the roc_auc is 0.9269, which increased a lot from the null model (0.8036).

Explore the feature importance.

Model 4: C5.0

Setup the model using workflow.

Fit and evaluate the model using vfold_cv cross validation.

Evaluating Model Performance

Estimate performance of the testing data. The accuracy is 0.9892 and the roc_auc is 0.9982, which increased a lot from the null model.

Improving Model Performance

Tuning and finding the best parameter min_n.

The results show the best min_n is 31, meaning 31 data points in a node that is required for the node to be split further.

The accuracy for the test data set is 0.9900 and the roc_auc is 0.9991. The model performance does increase after tuning and finding the best parameter min_n = 31, comparing to those of without tuning (0.9892, 0.9982).

Explore the feature importance.

Model 5: XGBoost

Setup the model using workflow.

Fit and evaluate the model using vfold_cv cross validation.

Evaluating Model Performance

Improving Model Performance

Tuning and finding the best parameter tree_depth, min_n, loss_reduction, sample_size, mtry.

Explore the feature importance.

Compare performance among the five models.

Part 2.

Collect Data

Preparing the Data

Similar process as I did in part 1.

The response variable has 7 levels, and I am only interested to predict whether the observations fully paid or not. Therefore, I keep the ‘fully paid’ level and defined all the other levels to be ‘not fully paid’.

Take the same training data set as the priors five models.

Randomly take 10% of the data in 2015 to do prediction and evaluation.

Preparing the data. Create the recipe for ML model. Use step_nzv() to remove any columns that have very low variability.

Training Model 3: C5.0

Setup the model using the same parameters from the XGBoost model that I already done.

Evaluating Model Performance

Estimate performance of the testing data. The accuracy is 0.983, which is a very good estimate.

Conclusions