There are two parts in this project. In first part, I implemented five different Machine Learning Algorithms to classify the Loan Status from 2012 to 2014 of the approved LendingClub loans data. For each algorithm, I tuned the parameters to achieve the best performance. And comparing the results, I determined the XGBoost model to be the best model for building a classifier in this data set. In the second part, I used the XGBoost model to predict the Loan Status in 2015 and make the evaluation also.
Before implementing the Machine Learning Algorithm, I cleaned the data set. The response variable loan_status had seven levels, and I was only interested in predicting whether the observations were fully paid or not. Therefore, I kept the ‘fully paid’ level and defined all the other levels as ‘not fully paid’. I dropped the ID variables, categorical variables with large numbers of levels, and variables with high rates of missing values. I also removed the rows containing any missing values. Then I get a tidy data set having 416590 observations and 54 variables. I randomly selected a sample data set with 10000 observations, and I split the sample data set into a training and a testing data set using a 75-25 split. Then I developed the five models: KNN, Random Forest, C5.0, Logistic Regression using regularization and XGBoost. To improve the performance for each model, I tuned the parameters to achieve their best accuracy. At last, I compared and ranked the classification accuracy of each model.
library(pacman)
p_load(data.table, tidymodels, tidyverse, lubridate, visdat,
janitor, vip, knitr)
df <- fread('data/accepted_2007_to_2018Q4.csv')
df_2012_to_2014 <- df %>%
mutate(year = year(mdy(issue_d))) %>%
filter(year %in% c(2012, 2013, 2014))
fwrite(df_2012_to_2014, 'data/accepted_2012_to_2014.csv')
df_2012_to_2014 <- fread('data/accepted_2012_to_2014.csv')
df_2012_to_2014 <- df_2012_to_2014 %>% clean_names()
col <- colnames(df_2012_to_2014)
variable_names <- data_frame(col[1:38], col[39:76], col[77:114], col[115:152])
kable(variable_names)
| col[1:38] | col[39:76] | col[77:114] | col[115:152] |
|---|---|---|---|
| id | out_prncp | total_cu_tl | total_il_high_credit_limit |
| member_id | out_prncp_inv | inq_last_12m | revol_bal_joint |
| loan_amnt | total_pymnt | acc_open_past_24mths | sec_app_fico_range_low |
| funded_amnt | total_pymnt_inv | avg_cur_bal | sec_app_fico_range_high |
| funded_amnt_inv | total_rec_prncp | bc_open_to_buy | sec_app_earliest_cr_line |
| term | total_rec_int | bc_util | sec_app_inq_last_6mths |
| int_rate | total_rec_late_fee | chargeoff_within_12_mths | sec_app_mort_acc |
| installment | recoveries | delinq_amnt | sec_app_open_acc |
| grade | collection_recovery_fee | mo_sin_old_il_acct | sec_app_revol_util |
| sub_grade | last_pymnt_d | mo_sin_old_rev_tl_op | sec_app_open_act_il |
| emp_title | last_pymnt_amnt | mo_sin_rcnt_rev_tl_op | sec_app_num_rev_accts |
| emp_length | next_pymnt_d | mo_sin_rcnt_tl | sec_app_chargeoff_within_12_mths |
| home_ownership | last_credit_pull_d | mort_acc | sec_app_collections_12_mths_ex_med |
| annual_inc | last_fico_range_high | mths_since_recent_bc | sec_app_mths_since_last_major_derog |
| verification_status | last_fico_range_low | mths_since_recent_bc_dlq | hardship_flag |
| issue_d | collections_12_mths_ex_med | mths_since_recent_inq | hardship_type |
| loan_status | mths_since_last_major_derog | mths_since_recent_revol_delinq | hardship_reason |
| pymnt_plan | policy_code | num_accts_ever_120_pd | hardship_status |
| url | application_type | num_actv_bc_tl | deferral_term |
| desc | annual_inc_joint | num_actv_rev_tl | hardship_amount |
| purpose | dti_joint | num_bc_sats | hardship_start_date |
| title | verification_status_joint | num_bc_tl | hardship_end_date |
| zip_code | acc_now_delinq | num_il_tl | payment_plan_start_date |
| addr_state | tot_coll_amt | num_op_rev_tl | hardship_length |
| dti | tot_cur_bal | num_rev_accts | hardship_dpd |
| delinq_2yrs | open_acc_6m | num_rev_tl_bal_gt_0 | hardship_loan_status |
| earliest_cr_line | open_act_il | num_sats | orig_projected_additional_accrued_interest |
| fico_range_low | open_il_12m | num_tl_120dpd_2m | hardship_payoff_balance_amount |
| fico_range_high | open_il_24m | num_tl_30dpd | hardship_last_payment_amount |
| inq_last_6mths | mths_since_rcnt_il | num_tl_90g_dpd_24m | disbursement_method |
| mths_since_last_delinq | total_bal_il | num_tl_op_past_12m | debt_settlement_flag |
| mths_since_last_record | il_util | pct_tl_nvr_dlq | debt_settlement_flag_date |
| open_acc | open_rv_12m | percent_bc_gt_75 | settlement_status |
| pub_rec | open_rv_24m | pub_rec_bankruptcies | settlement_date |
| revol_bal | max_bal_bc | tax_liens | settlement_amount |
| revol_util | all_util | tot_hi_cred_lim | settlement_percentage |
| total_acc | total_rev_hi_lim | total_bal_ex_mort | settlement_term |
| initial_list_status | inq_fi | total_bc_limit | year |
n_row <- nrow(df_2012_to_2014)
n_col <- ncol(df_2012_to_2014)
value <- c(n_row, n_col)
name <- c('n_rows', 'n_columns')
data_str <- data.frame(name, value)
kable(data_str)
| name | value |
|---|---|
| n_rows | 423810 |
| n_columns | 152 |
skimr::skim(df_2012_to_2014)
lots_empty <- df_2012_to_2014 %>%
select(c(emp_length, desc, next_pymnt_d, hardship_type:hardship_loan_status,
debt_settlement_flag_date:settlement_date )) %>%
names()
lots_miss <- sapply(df_2012_to_2014, function(x){sum(is.na(x))}) > 20000
lots_miss_col <- names(lots_miss)[lots_miss]
df_2012_to_2014 <- df_2012_to_2014 %>% select(-all_of(lots_miss_col), -all_of(lots_empty))
ncol(df_2012_to_2014)
## [1] 69
id_feature <- sapply(df_2012_to_2014, function(x){length(unique(x))}) == 1
id_feature_col <- names(id_feature)[id_feature]
lots_level_cat <- c('emp_title', 'title', 'url')
unique <- c('hardship_flag', 'pymnt_plan')
meaning_less <- c('zip_code', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d')
df_2012_to_2014 <- df_2012_to_2014 %>%
select(-c(all_of(id_feature_col), all_of(lots_level_cat),
all_of(unique), all_of(meaning_less),
id, funded_amnt, funded_amnt_inv)) %>%
mutate_if(is.character, factor)
ncol(df_2012_to_2014)
## [1] 54
df_2012_to_2014 %>%
select(c(loan_amnt, annual_inc, installment, int_rate)) %>%
summary()
## loan_amnt annual_inc installment int_rate
## Min. : 1000 Min. : 3000 Min. : 4.93 Min. : 6.00
## 1st Qu.: 8000 1st Qu.: 45000 1st Qu.: 267.60 1st Qu.:10.99
## Median :12800 Median : 63500 Median : 390.74 Median :13.98
## Mean :14641 Mean : 73690 Mean : 443.02 Mean :14.00
## 3rd Qu.:20000 3rd Qu.: 90000 3rd Qu.: 578.68 3rd Qu.:16.99
## Max. :35000 Max. :7500000 Max. :1409.99 Max. :26.06
table(df_2012_to_2014$grade)
##
## A B C D E F G
## 64688 124558 116574 70884 32365 11931 2810
table(df_2012_to_2014$emp_length)
## < table of extent 0 >
table(df_2012_to_2014$home_ownership)
##
## ANY MORTGAGE NONE OTHER OWN RENT
## 1 216949 42 46 38461 168311
table(df_2012_to_2014$purpose)
##
## car credit_card debt_consolidation home_improvement
## 3783 98692 254457 23342
## house major_purchase medical moving
## 1843 7517 3850 2366
## other renewable_energy small_business vacation
## 19267 245 5022 2087
## wedding
## 1339
The response variable loan_status has seven levels, and I am only interested in predicting whether the observations are fully paid or not. Therefore, I keep the ‘fully paid’ level and defined all the other levels as ‘not fully paid’.
table(df_2012_to_2014$loan_status)
##
## Charged Off Current Default Fully Paid
## 70829 11925 1 340444
## In Grace Period Late (16-30 days) Late (31-120 days)
## 201 73 337
df_2012_to_2014 <- df_2012_to_2014 %>%
mutate(loan_status = ifelse(loan_status == 'Fully Paid',
'fully_paid', 'not_full_paid'))
table(df_2012_to_2014$loan_status)
##
## fully_paid not_full_paid
## 340444 83366
set.seed(999)
df_2012_to_2014 <- df_2012_to_2014 %>% na.omit()
n <- nrow(df_2012_to_2014)
index <- sample(n, 10000)
dfs <- df_2012_to_2014[index]
nrow(dfs)
## [1] 10000
ncol(dfs)
## [1] 54
dfs %>% count(loan_status) %>%
mutate(freq = n / sum(n))
set.seed(999)
dfs_split <- initial_split(dfs, prop = 0.75)
train <- dfs_split %>% training()
test <- dfs_split %>% testing()
set.seed(999)
train_recipe <- train %>%
recipe(loan_status ~ .) %>%
step_nzv(all_predictors()) %>%
prep()
train_prep <- juice(train_recipe)
test_prep <- train_recipe %>% bake(test)
mod_null <- null_model() %>%
set_mode('classification') %>%
fit(loan_status ~ ., data = train_prep)
mod_null_r <- mod_null %>% predict(test_prep) %>%
bind_cols(test_prep)
mod_null_r %>% metrics(truth = loan_status,
estimate = .pred_class)
mod_null_r %>% conf_mat(truth = loan_status,
estimate = .pred_class)
## Truth
## Prediction fully_paid not_full_paid
## fully_paid 2009 491
## not_full_paid 0 0
mod_null %>% predict(test_prep, type = 'prob') %>%
bind_cols(test_prep) %>%
roc_curve(loan_status, .pred_fully_paid) %>%
autoplot()
t <- train %>% select(-loan_status)
numeric <- unlist(lapply(t, is.numeric))
categorical_var <- names(numeric[!numeric])
recipe_knn <- train %>%
recipe(loan_status ~ .) %>%
step_rm(all_of(categorical_var)) %>%
step_normalize(all_predictors()) %>%
step_nzv(all_predictors()) %>%
prep()
mod_knn <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("classification")
wf_knn <- workflow() %>%
add_recipe(recipe_knn) %>%
add_model(mod_knn)
set.seed(999)
doParallel::registerDoParallel()
folds <- vfold_cv(train , v = 10)
fit_knn_rs <- wf_knn %>%
fit_resamples(folds,
control = control_resamples(save_pred = TRUE))
collect_metrics(fit_knn_rs)
test_knn <- wf_knn %>%
last_fit(dfs_split)
acc_knn <- collect_metrics(test_knn)
acc_knn
collect_predictions(test_knn) %>%
conf_mat(loan_status, .pred_class)
## Truth
## Prediction fully_paid not_full_paid
## fully_paid 1896 182
## not_full_paid 113 309
collect_predictions(test_knn) %>%
roc_curve(loan_status, .pred_fully_paid) %>%
autoplot()
tune_knn <-
nearest_neighbor(neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("classification")
knn_wflow <-
workflow() %>%
add_recipe(recipe_knn) %>%
add_model(tune_knn)
set.seed(999)
doParallel::registerDoParallel()
knn_fit_rs <-
knn_wflow %>%
tune_grid(
resamples = folds,
grid = 20
)
autoplot(knn_fit_rs)
knn_fit_rs %>% show_best("accuracy")
best_knn <- knn_fit_rs %>%
select_best("accuracy")
best_knn
final_knn <-
knn_wflow %>%
finalize_workflow(best_knn) %>%
last_fit(dfs_split)
acc_knn <- final_knn %>%
collect_metrics()
acc_knn
collect_predictions(final_knn) %>%
conf_mat(loan_status, .pred_class)
## Truth
## Prediction fully_paid not_full_paid
## fully_paid 1978 204
## not_full_paid 31 287
collect_predictions(final_knn) %>%
roc_curve(loan_status, .pred_fully_paid) %>%
autoplot()
mod_rf <-
rand_forest(trees = 100) %>%
set_mode('classification') %>%
set_engine('ranger')
workflow_rf <-
workflow() %>%
add_recipe(train_recipe) %>%
add_model(mod_rf)
set.seed(999)
folds <- vfold_cv(train , v = 10)
doParallel::registerDoParallel()
fit_rf_rs <- workflow_rf %>%
fit_resamples(folds,
control = control_resamples(save_pred = TRUE))
collect_metrics(fit_rf_rs)
test_rf <- workflow_rf %>%
last_fit(dfs_split)
acc_rf <- collect_metrics(test_rf)
acc_rf
collect_predictions(test_rf) %>%
conf_mat(loan_status, .pred_class)
## Truth
## Prediction fully_paid not_full_paid
## fully_paid 1997 35
## not_full_paid 12 456
collect_predictions(test_rf) %>%
roc_curve(loan_status, .pred_fully_paid) %>%
autoplot()
set.seed(999)
folds <- vfold_cv(train , v = 10)
rf_spec <-
rand_forest(mtry = tune(), min_n = tune(), trees = 100) %>%
set_mode('classification') %>%
set_engine('ranger')
rf_workflow <-
workflow() %>%
add_recipe(train_recipe) %>%
add_model(rf_spec)
set.seed(999)
doParallel::registerDoParallel()
rf_fit_rs <-
tune_grid(rf_workflow,
resamples = folds,
grid = 10)
autoplot(rf_fit_rs)
rf_fit_rs %>% show_best("accuracy")
best_rf <- rf_fit_rs %>%
select_best("accuracy")
best_rf
final_rf <-
rf_workflow %>%
finalize_workflow(best_rf) %>%
last_fit(dfs_split)
acc_rf <- final_rf %>%
collect_metrics()
acc_rf
collect_predictions(final_rf) %>%
conf_mat(loan_status, .pred_class)
## Truth
## Prediction fully_paid not_full_paid
## fully_paid 1992 21
## not_full_paid 17 470
mod_rf_aoc <- collect_predictions(final_rf)
mod_rf_aoc %>%
roc_curve(loan_status, .pred_fully_paid) %>%
autoplot()
mod_rf_importance <- rand_forest(trees = 100) %>%
set_engine('ranger', importance = 'permutation') %>%
set_mode('classification') %>%
fit(loan_status ~ ., data = train_prep) %>%
vip(aesthetics = list(alpha = 0.8, fill = 'midnightblue'))
mod_rf_importance
mod_glm <- logistic_reg(penalty = 0.001, mixture = 0.5) %>%
set_engine('glmnet') %>%
set_mode('classification') %>%
fit(loan_status ~ ., data = train_prep)
mod_glm_r <- mod_glm %>%
predict(test_prep) %>%
bind_cols(test_prep)
acc_glm <- mod_glm_r %>%
metrics(truth = loan_status, estimate = .pred_class)
acc_glm
mod_glm_r %>% conf_mat(truth = loan_status, estimate = .pred_class)
## Truth
## Prediction fully_paid not_full_paid
## fully_paid 2000 47
## not_full_paid 9 444
mod_glm_aoc <- mod_glm %>% predict(test_prep, type = 'prob') %>%
bind_cols(test_prep)
mod_glm_aoc %>%
roc_curve(loan_status, .pred_fully_paid) %>%
autoplot()
mod_glm_importance <- logistic_reg(penalty = 0.001, mixture = 0.5) %>%
set_engine('glmnet') %>%
set_mode('classification') %>%
fit(loan_status ~ ., data = train_prep) %>%
vip(aesthetics = list(alpha = 0.8, fill = 'midnightblue'))
mod_glm_importance
set.seed(999)
mod_c50 <- boost_tree(trees = 40) %>%
set_engine('C5.0') %>%
set_mode('classification')
workflow_c50 <-
workflow() %>%
add_recipe(train_recipe) %>%
add_model(mod_c50)
set.seed(999)
folds <- vfold_cv(train , v = 10)
doParallel::registerDoParallel()
fit_c50_rs <- workflow_c50 %>%
fit_resamples(folds,
control = control_resamples(save_pred = TRUE))
collect_metrics(fit_c50_rs)
test_c50 <- workflow_c50 %>%
last_fit(dfs_split)
acc_c50 <- collect_metrics(test_c50)
acc_c50
collect_predictions(test_c50) %>%
conf_mat(loan_status, .pred_class)
## Truth
## Prediction fully_paid not_full_paid
## fully_paid 2001 19
## not_full_paid 8 472
collect_predictions(test_c50) %>%
roc_curve(loan_status, .pred_fully_paid) %>%
autoplot()
set.seed(999)
folds <- vfold_cv(train , v = 10)
c50_tune <-
boost_tree(min_n = tune(), trees = 40) %>%
set_mode('classification') %>%
set_engine('C5.0')
c50_workflow <-
workflow() %>%
add_recipe(train_recipe) %>%
add_model(c50_tune)
set.seed(999)
doParallel::registerDoParallel()
c50_fit_rs <-
tune_grid(c50_workflow,
resamples = folds,
grid = 10)
autoplot(c50_fit_rs)
c50_fit_rs %>% show_best("accuracy")
best_c50 <- c50_fit_rs %>%
select_best("accuracy")
best_c50
final_c50 <-
c50_workflow %>%
finalize_workflow(best_c50) %>%
last_fit(dfs_split)
acc_c50 <- final_c50 %>%
collect_metrics()
acc_c50
collect_predictions(final_c50) %>%
conf_mat(loan_status, .pred_class)
## Truth
## Prediction fully_paid not_full_paid
## fully_paid 2002 18
## not_full_paid 7 473
mod_c50_aoc <- collect_predictions(final_c50)
mod_c50_aoc %>%
roc_curve(loan_status, .pred_fully_paid) %>%
autoplot()
mod_c50_importance <- boost_tree(trees = 40) %>%
set_engine('C5.0', importance = 'permutation') %>%
set_mode('classification') %>%
fit(loan_status ~ ., data = train_prep) %>%
vip(aesthetics = list(alpha = 0.8, fill = 'midnightblue'))
mod_c50_importance
set.seed(999)
mod_xgb <-
boost_tree(trees = 30) %>%
set_mode('classification') %>%
set_engine('xgboost')
workflow_xgb <-
workflow() %>%
add_recipe(recipe_knn) %>%
add_model(mod_xgb)
set.seed(999)
folds <- vfold_cv(train , v = 10)
doParallel::registerDoParallel()
fit_xgb_rs <- workflow_xgb %>%
fit_resamples(resamples = folds,
control = control_resamples(save_pred = TRUE))
collect_metrics(fit_xgb_rs)
Estimate performance of the testing data. The accuracy is 0.9900 and the roc_auc is 0.9992, which increased a lot from the null model (0.8036).
test_xgb <- workflow_xgb %>%
last_fit(dfs_split)
acc_xgb <- collect_metrics(test_xgb)
acc_xgb
collect_predictions(test_xgb) %>%
conf_mat(loan_status, .pred_class)
## Truth
## Prediction fully_paid not_full_paid
## fully_paid 1998 14
## not_full_paid 11 477
collect_predictions(test_xgb) %>%
roc_curve(loan_status, .pred_fully_paid) %>%
autoplot()
set.seed(999)
xgb_mod <-
boost_tree(trees = 30, tree_depth = tune(), min_n = tune(), loss_reduction = tune(),
sample_size = tune(), mtry = tune()) %>%
set_mode('classification') %>%
set_engine('xgboost')
xgb_workflow <-
workflow() %>%
add_recipe(recipe_knn) %>%
add_model(xgb_mod)
xgb_grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), train),
size = 10
)
set.seed(999)
folds <- vfold_cv(train , v = 10)
doParallel::registerDoParallel()
xgb_fit_rs <- xgb_workflow %>%
tune_grid(xgb_workflow,
resamples = folds,
grid = xgb_grid)
autoplot(xgb_fit_rs)
xgb_fit_rs %>% show_best("accuracy")
The best combination of parameters are shown below.
best_xgb <- xgb_fit_rs %>%
select_best("accuracy")
best_xgb
The accuracy for the test data set is 0.986. The model performance does not increase after tuning and finding the best parameters. The accuracy remains very close to the one without tuning.
final_xgb <-
workflow_xgb %>%
finalize_workflow(best_xgb) %>%
last_fit(dfs_split)
acc_xgb <- final_xgb %>%
collect_metrics()
acc_xgb
collect_predictions(final_xgb) %>%
conf_mat(loan_status, .pred_class)
## Truth
## Prediction fully_paid not_full_paid
## fully_paid 1998 14
## not_full_paid 11 477
mod_xgb_aoc <- collect_predictions(final_xgb)
mod_xgb_aoc %>%
roc_curve(loan_status, .pred_fully_paid) %>%
autoplot()
mod_xgb_importance <- boost_tree(trees = 30) %>%
set_mode('classification') %>%
set_engine('xgboost', importance = 'permutation') %>%
fit(loan_status ~ ., data = train_prep) %>%
vip(aesthetics = list(alpha = 0.8, fill = 'midnightblue'))
## [23:11:59] WARNING: amalgamation/../src/learner.cc:541:
## Parameters: { importance } might not be used.
##
## This may not be accurate due to some parameters are only used in language bindings but
## passed down to XGBoost core. Or some parameters are not used but slip through this
## verification. Please open an issue if you find above cases.
##
##
## [23:11:59] WARNING: amalgamation/../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
mod_xgb_importance
collect_predictions(final_knn) %>%
mutate(model = 'kknn') %>%
bind_rows(mod_c50_aoc %>% mutate(model = 'c50')) %>%
bind_rows(mod_rf_aoc %>% mutate(model = 'random forest')) %>%
bind_rows(mod_glm_aoc %>% mutate(model = 'glmnet')) %>%
bind_rows(mod_xgb_aoc %>% mutate(model = 'XGBoost')) %>%
group_by(model) %>%
roc_curve(loan_status, .pred_fully_paid) %>%
autoplot()
acc_knn %>% rename(knn = .estimate, metrics = .metric) %>%
bind_cols(acc_c50 %>% rename(c50 = .estimate)) %>%
bind_cols(acc_rf %>% rename(random_forest = .estimate)) %>%
bind_cols(acc_glm %>% rename(glmnet = .estimate)) %>%
bind_cols(acc_xgb %>% rename(XGBoost = .estimate)) %>%
select(metrics, XGBoost, c50, random_forest, glmnet, knn)
For most models, the accuracy does increase after tuning the parameters. It increased from 0.882 to 0.906 for KNN, from 0.981 to 0.985 for Random Forest and from 0.989 to 0.990 for C5.0. And comparing the performance for each model, all C50, XGBoost, and Random Forest models are doing well. But the XGBoost should be the best model with an accuracy of 0.990 and receiver operator curve is 0.999. Therefore, I will choose the XBGoost model to predict any new data.
I filtered the data from 2012 to 2015 and made a similar data cleaning from the project. Then I got a tidy data set with 820120 observations and 58 variables. To predict the observations in 2015, I used the same training data set that I used in the project’s five models. And I randomly selected 10% of the data in 2015 to make the predictions. Therefore, the data size of training data is 7500 and testing data is 41659. I would use these data to implement the XGBoost model and see how well the performance.
df_2012_to_2015 <- df %>%
mutate(year = year(mdy(issue_d))) %>%
filter(year %in% c(2012, 2013, 2014, 2015))
fwrite(df_2012_to_2015, 'data/accepted_2012_to_2015.csv')
df_2012_to_2015 <- fread('data/accepted_2012_to_2015.csv')
df_2012_to_2015 <- df_2012_to_2015 %>% clean_names()
lots_empty <- df_2012_to_2015 %>%
select(c(emp_length, desc, next_pymnt_d, verification_status_joint,
hardship_type:hardship_loan_status,
debt_settlement_flag_date:settlement_date )) %>%
names()
lots_miss <- sapply(df_2012_to_2015, function(x){sum(is.na(x))}) > 20000
lots_miss_col <- names(lots_miss)[lots_miss]
df_2012_to_2015 <- df_2012_to_2015 %>% select(-all_of(lots_miss_col), -all_of(lots_empty))
id_feature <- sapply(df_2012_to_2015, function(x){length(unique(x))}) == 1
id_feature_col <- names(id_feature)[id_feature]
lots_level_cat <- c('emp_title', 'title', 'url')
unique <- c('hardship_flag', 'pymnt_plan')
meaning_less <- c('zip_code', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d')
df_2012_to_2015 <- df_2012_to_2015%>%
select(-c(all_of(id_feature_col), all_of(lots_level_cat),
all_of(unique), all_of(meaning_less),
id, funded_amnt, funded_amnt_inv)) %>%
mutate_if(is.character, factor)
df_2012_to_2015 <- df_2012_to_2015 %>%
mutate(loan_status = ifelse(loan_status == 'Fully Paid',
'fully_paid', 'not_full_paid')) %>%
na.omit()
nrow(df_2012_to_2015)
## [1] 820120
table(df_2012_to_2015$loan_status)
##
## fully_paid not_full_paid
## 620163 199957
df_2012_to_2014 <- df_2012_to_2015 %>%
filter(year %in% c(2012, 2013, 2014))
n <- nrow(df_2012_to_2014)
set.seed(999)
index <- sample(n, 10000)
dfs <- df_2012_to_2014[index]
set.seed(999)
dfs_split <- initial_split(dfs, prop = 0.75)
train <- dfs_split %>% training()
df_2015 <- df_2012_to_2015 %>%
filter(year == 2015)
n <- nrow(df_2015)
set.seed(999)
index <- sample(n, 0.1*n)
test <- df_2015[index]
set.seed(999)
train_recipe <- train %>%
recipe(loan_status ~ .) %>%
step_nzv(all_predictors()) %>%
prep()
train_prep <- juice(train_recipe)
test_prep <- train_recipe %>% bake(test)
set.seed(999)
mod_xgb_f <- boost_tree(trees = 40, mtry = 48, min_n = 6, tree_depth = 11) %>%
set_engine('xgboost') %>%
set_mode('classification') %>%
fit(loan_status ~ ., data = train_prep)
## [23:12:13] WARNING: amalgamation/../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
mod_xgb_m <- mod_xgb_f %>%
predict(test_prep) %>%
bind_cols(test_prep)
acc_xgb_f <- mod_xgb_m %>%
metrics(truth = loan_status, estimate = .pred_class)
acc_xgb_f
mod_xgb_m %>% conf_mat(truth = loan_status,
estimate = .pred_class)
## Truth
## Prediction fully_paid not_full_paid
## fully_paid 29618 681
## not_full_paid 37 11323
mod_xgb_aoc <- mod_xgb_f %>% predict(test_prep, type = 'prob') %>%
bind_cols(test_prep)
mod_xgb_aoc %>%
roc_curve(loan_status, .pred_fully_paid) %>%
autoplot()
Use the data from 2012 to 2014 to train the XGBoost model and predict the observations in 2015. The results show it has a very good performance with accuracy of 0.983.