1. Introduction

In this section, I will use ramdom forest model to build a classification model that classify good or bad customer. For more details about the data set, visit: https://www.kaggle.com/rikdifos/credit-card-approval-prediction

Following are context that quoted from the link above:

Context

Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk.


Generally speaking, credit score cards are based on historical data. Once encountering large economic fluctuations. Past models may lose their original predictive power. Logistic model is a common method for credit scoring. Because Logistic is suitable for binary classification tasks and can calculate the coefficients of each feature. In order to facilitate understanding and operation, the score card will multiply the logistic regression coefficient by a certain value (such as 100) and round it.

At present, with the development of machine learning algorithms. More predictive methods such as Boosting, Random Forest, and Support Vector Machines have been introduced into credit card scoring. However, these methods often do not have good transparency. It may be difficult to provide customers and regulators with a reason for rejection or acceptance.

First, we need to load some necessary libraries and data. You guys can download data set from the linked above.

library(tidyverse)
library(tidymodels) 
library(readr)
tidymodels_prefer()

credit_record <- janitor::clean_names(read_csv("credit_record.csv"))
application_record <- janitor::clean_names(read_csv("application_record.csv"))

Credit_record dataframe contains the information of credit payment status. We will retain only current month status of customnes’ credit payment status. The status data is coded, therefore, we need to lables those coded data.

credit_record <- credit_record %>% filter(months_balance==0) %>% select(-months_balance) %>% 
  mutate(status = factor(status,
                         levels = c("0","1","2","3", "4", "5", "X", "C"),
                         labels = c("1m past due", "2m past due", "3m past due",  
                                    "4m past due", "5m past due", 
                                    "bad debts", "paid off that month", 
                                    "no loan for the month")))

Let’s have a glance on what we have in the dataframe. We have 18 variables, of which, there are 8 factors variables and 10 numeric variables. Most of them are non-missing vaiables except occupation_type. Besides, days_births and days_employed are seemingly stored in a wrong class. We need to fix missing value and wrong class before putting data into our models. This will be done in preprocessing steps.

skimr::skim(application_record)
Data summary
Name application_record
Number of rows 438557
Number of columns 18
_______________________
Column type frequency:
character 8
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
code_gender 0 1.00 1 1 0 2 0
flag_own_car 0 1.00 1 1 0 2 0
flag_own_realty 0 1.00 1 1 0 2 0
name_income_type 0 1.00 7 20 0 5 0
name_education_type 0 1.00 15 29 0 5 0
name_family_status 0 1.00 5 20 0 5 0
name_housing_type 0 1.00 12 19 0 6 0
occupation_type 134203 0.69 7 21 0 18 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 6022176.27 571637.02 5008804 5609375 6047745.0 6456971 7999952 ▆▇▇▁▁
cnt_children 0 1 0.43 0.72 0 0 0.0 1 19 ▇▁▁▁▁
amt_income_total 0 1 187524.29 110086.85 26100 121500 160780.5 225000 6750000 ▇▁▁▁▁
days_birth 0 1 -15997.90 4185.03 -25201 -19483 -15630.0 -12514 -7489 ▃▆▇▇▅
days_employed 0 1 60563.68 138767.80 -17531 -3103 -1467.0 -371 365243 ▇▁▁▁▂
flag_mobil 0 1 1.00 0.00 1 1 1.0 1 1 ▁▁▇▁▁
flag_work_phone 0 1 0.21 0.40 0 0 0.0 0 1 ▇▁▁▁▂
flag_phone 0 1 0.29 0.45 0 0 0.0 1 1 ▇▁▁▁▃
flag_email 0 1 0.11 0.31 0 0 0.0 0 1 ▇▁▁▁▁
cnt_fam_members 0 1 2.19 0.90 1 2 2.0 3 20 ▇▁▁▁▁

In this step, we will convert days_birth and days_employed into date class using lubridate package. Then, we need to merge 2 dataframe into a unique one. We see that the credit_record contains only 33856 but the application_record contains 438557. This means that, some of them are matched.

application_record <- application_record %>% mutate(days_birth = lubridate::as_date(days_birth),
                              days_employed = lubridate::as_date(days_employed))

credit_data <- inner_join(credit_record, application_record)
## Joining, by = "id"
head(credit_data)
## # A tibble: 6 x 19
##        id status           code_gender flag_own_car flag_own_realty cnt_children
##     <dbl> <fct>            <chr>       <chr>        <chr>                  <dbl>
## 1 5008804 no loan for the… M           Y            Y                          0
## 2 5008805 no loan for the… M           Y            Y                          0
## 3 5008806 no loan for the… M           Y            Y                          0
## 4 5008808 1m past due      F           N            Y                          0
## 5 5008810 no loan for the… F           N            Y                          0
## 6 5008811 no loan for the… F           N            Y                          0
## # … with 13 more variables: amt_income_total <dbl>, name_income_type <chr>,
## #   name_education_type <chr>, name_family_status <chr>,
## #   name_housing_type <chr>, days_birth <date>, days_employed <date>,
## #   flag_mobil <dbl>, flag_work_phone <dbl>, flag_phone <dbl>,
## #   flag_email <dbl>, occupation_type <chr>, cnt_fam_members <dbl>
rm(application_record, credit_record)

The table below shows that more than 12,9 thousand customers do not have a loan for current month, 4.4 thousand coustomers have paid off for the current month and other over 7 thousand customers has past due.

credit_data %>% count(status, sort = T) %>%
  ggplot(aes(n, fct_reorder(status, n), fill = status)) +
  geom_col() + 
  theme_light()

Now, I will define that customer who has paid off or do not have loan for the month as good customer, the others tho have past the due date are defined as bad customer.

final_dat <- credit_data %>% mutate(classification = case_when(status == "no loan for the month" ~ "good",
                                                  status == "paid off that month" ~ "good",
                                                  TRUE ~ "bad")) %>% 
  mutate(occupation_type = case_when(is.na(occupation_type)=="TRUE"~"unknown",
                                     TRUE ~ occupation_type)) %>% 
  mutate(occupation_type = fct_lump(occupation_type, n = 7),
         days_birth = 2020 - lubridate::year(days_birth),
         days_employed = 2020 - lubridate::year(days_employed)) %>% 
  mutate_if(is.character, as.factor)

skimr::skim(final_dat)
Data summary
Name final_dat
Number of rows 24672
Number of columns 20
_______________________
Column type frequency:
factor 10
numeric 10
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
status 0 1 FALSE 8 no : 12974, 1m : 6886, pai: 4487, 2m : 236
code_gender 0 1 FALSE 2 F: 16406, M: 8266
flag_own_car 0 1 FALSE 2 N: 15301, Y: 9371
flag_own_realty 0 1 FALSE 2 Y: 16293, N: 8379
name_income_type 0 1 FALSE 5 Wor: 12711, Com: 5849, Pen: 4150, Sta: 1953
name_education_type 0 1 FALSE 5 Sec: 16782, Hig: 6651, Inc: 966, Low: 250
name_family_status 0 1 FALSE 5 Mar: 16942, Sin: 3319, Civ: 1965, Sep: 1434
name_housing_type 0 1 FALSE 6 Hou: 22020, Wit: 1206, Mun: 785, Ren: 374
occupation_type 0 1 FALSE 8 unk: 7629, Lab: 4293, Oth: 3579, Cor: 2381
classification 0 1 FALSE 2 goo: 17461, bad: 7211

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 5078862.63 41535.33 5008804 5045511 5069454 5115437 5150487 ▇▇▇▇▆
cnt_children 0 1 0.42 0.74 0 0 0 1 19 ▇▁▁▁▁
amt_income_total 0 1 187074.99 102094.83 27000 121500 157500 225000 1575000 ▇▁▁▁▁
days_birth 0 1 94.36 11.45 71 85 93 104 119 ▃▇▇▆▃
days_employed 0 1 -111.04 376.40 -950 52 55 59 94 ▂▁▁▁▇
flag_mobil 0 1 1.00 0.00 1 1 1 1 1 ▁▁▇▁▁
flag_work_phone 0 1 0.23 0.42 0 0 0 0 1 ▇▁▁▁▂
flag_phone 0 1 0.29 0.46 0 0 0 1 1 ▇▁▁▁▃
flag_email 0 1 0.09 0.29 0 0 0 0 1 ▇▁▁▁▁
cnt_fam_members 0 1 2.18 0.91 1 2 2 3 20 ▇▁▁▁▁

Split training and testing data. The training data is used to train our model. Meanwhile, testing data is used to evaluate our model performance.

set.seed(123)
split <- initial_split(final_dat, prop = 0.8)
train <- training(split)
testing <- testing(split)

This step is to create resampling data for evaluating trained model.

set.seed(234)
bs_resampling <- bootstraps(train)
bs_resampling
## # Bootstrap sampling 
## # A tibble: 25 x 2
##    splits               id         
##    <list>               <chr>      
##  1 <split [19737/7227]> Bootstrap01
##  2 <split [19737/7220]> Bootstrap02
##  3 <split [19737/7317]> Bootstrap03
##  4 <split [19737/7255]> Bootstrap04
##  5 <split [19737/7267]> Bootstrap05
##  6 <split [19737/7240]> Bootstrap06
##  7 <split [19737/7238]> Bootstrap07
##  8 <split [19737/7273]> Bootstrap08
##  9 <split [19737/7210]> Bootstrap09
## 10 <split [19737/7279]> Bootstrap10
## # … with 15 more rows

We are building a recipe to preprocess our data. Here, I am going to remove flag_mobil since it is zero variance variable. Id and status collum will not used in this model, so I will set them as an id.

rec <- recipe(classification~., data = train) %>% 
  add_role(id, status, new_role = "id") %>% 
  step_rm(flag_mobil)

Let’s build an engine using random forest model

engine <- rand_forest() %>% 
  set_mode("classification") %>% 
  set_engine("randomForest")

I will the preprocess recipe and engine into our workflows

wf <- workflow() %>%
  add_recipe(rec) %>% 
  add_model(engine)

Let’s fit the model using bootstrap resample

fit_resample <- wf %>% fit_resamples(resamples = bs_resampling,
                    metric = metrics(roc_auc, accuracy),
                    control = control_resamples(save_pred = T))
## Warning: The `...` are not used in this function but one or more objects were
## passed: 'metric'

The metric show that this model perform super good with accuracy equal to 1, roc_auc is 1 also.

fit_resample %>% collect_metrics()
## # A tibble: 2 x 6
##   .metric  .estimator  mean     n   std_err .config             
##   <chr>    <chr>      <dbl> <int>     <dbl> <chr>               
## 1 accuracy binary      1.00    25 0.0000401 Preprocessor1_Model1
## 2 roc_auc  binary      1.00    25 0.0000258 Preprocessor1_Model1
fit_resample %>% conf_mat_resampled()
## # A tibble: 4 x 3
##   Prediction Truth    Freq
##   <fct>      <fct>   <dbl>
## 1 bad        bad   2123.  
## 2 bad        good     0   
## 3 good       bad      1.24
## 4 good       good  5129.

Ok, now, let’s see our how does our model perform over testing data set. We can see that our performs very well on testing data set. Amazing!

last_fit <- last_fit(wf, split)
last_fit %>% collect_metrics() %>% knitr::kable()
.metric .estimator .estimate .config
accuracy binary 1 Preprocessor1_Model1
roc_auc binary 1 Preprocessor1_Model1
last_fit %>% conf_mat_resampled()%>% knitr::kable()
Prediction Truth Freq
bad bad 1408
bad good 0
good bad 0
good good 3527
last_fit %>% collect_predictions()
## # A tibble: 4,935 x 7
##    id         .pred_bad .pred_good  .row .pred_class classification .config     
##    <chr>          <dbl>      <dbl> <int> <fct>       <fct>          <chr>       
##  1 train/tes…     0.026      0.974     1 good        good           Preprocesso…
##  2 train/tes…     0.036      0.964     6 good        good           Preprocesso…
##  3 train/tes…     0.972      0.028     8 bad         bad            Preprocesso…
##  4 train/tes…     0.05       0.95     12 good        good           Preprocesso…
##  5 train/tes…     0.998      0.002    24 bad         bad            Preprocesso…
##  6 train/tes…     0.054      0.946    29 good        good           Preprocesso…
##  7 train/tes…     0.996      0.004    34 bad         bad            Preprocesso…
##  8 train/tes…     0.968      0.032    35 bad         bad            Preprocesso…
##  9 train/tes…     0          1        48 good        good           Preprocesso…
## 10 train/tes…     0          1        59 good        good           Preprocesso…
## # … with 4,925 more rows

Conclusion

Our model performs very well on detecting good or bad credit user. If we have new data of credit user, we can use this model specification to predict whether the customer might be good or bad