1. Introduction

In this section, I will use ramdom forest model to build a classification model that classify good or bad customer. For more details about the data set, visit: https://www.kaggle.com/rikdifos/credit-card-approval-prediction

Following are context that quoted from the link above:

Context

Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk.

Generally speaking, credit score cards are based on historical data. Once encountering large economic fluctuations. Past models may lose their original predictive power. Logistic model is a common method for credit scoring. Because Logistic is suitable for binary classification tasks and can calculate the coefficients of each feature. In order to facilitate understanding and operation, the score card will multiply the logistic regression coefficient by a certain value (such as 100) and round it.

At present, with the development of machine learning algorithms. More predictive methods such as Boosting, Random Forest, and Support Vector Machines have been introduced into credit card scoring. However, these methods often do not have good transparency. It may be difficult to provide customers and regulators with a reason for rejection or acceptance.

First, we need to load some necessary libraries and data. You guys can download data set from the linked above.

library(tidyverse)
library(tidymodels) 
library(readr)
tidymodels_prefer()

credit_record <- janitor::clean_names(read_csv("credit_record.csv"))
application_record <- janitor::clean_names(read_csv("application_record.csv"))

Credit_record dataframe contains the information of credit payment status. We will retain only current month status of customnes’ credit payment status. The status data is coded, therefore, we need to lables those coded data.

credit_record <- credit_record %>% filter(months_balance==0) %>% select(-months_balance) %>% 
  mutate(status = factor(status,
                         levels = c("0","1","2","3", "4", "5", "X", "C"),
                         labels = c("1m past due", "2m past due", "3m past due",  
                                    "4m past due", "5m past due", 
                                    "bad debts", "paid off that month", 
                                    "no loan for the month")))

Let’s have a glance on what we have in the dataframe. We have 18 variables, of which, there are 8 factors variables and 10 numeric variables. Most of them are non-missing vaiables except occupation_type. Besides, days_births and days_employed are seemingly stored in a wrong class. We need to fix missing value and wrong class before putting data into our models. This will be done in preprocessing steps.

skimr::skim(application_record)

Data summary
Name	application_record
Number of rows	438557
Number of columns	18
_______________________
Column type frequency:
character	8
numeric	10
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
code_gender	0	1.00	1	1	2
flag_own_car	0	1.00	1	1	2
flag_own_realty	0	1.00	1	1	2
name_income_type	0	1.00	7	20	5
name_education_type	0	1.00	15	29	5
name_family_status	0	1.00	5	20	5
name_housing_type	0	1.00	12	19	6
occupation_type	134203	0.69	7	21	18

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	6022176.27	571637.02	5008804	5609375	6047745.0	6456971	7999952	▆▇▇▁▁
cnt_children	1	0.43	0.72	0	0	0.0	1	19	▇▁▁▁▁
amt_income_total	1	187524.29	110086.85	26100	121500	160780.5	225000	6750000	▇▁▁▁▁
days_birth	1	-15997.90	4185.03	-25201	-19483	-15630.0	-12514	-7489	▃▆▇▇▅
days_employed	1	60563.68	138767.80	-17531	-3103	-1467.0	-371	365243	▇▁▁▁▂
flag_mobil	1	1.00	0.00	1	1	1.0	1	1	▁▁▇▁▁
flag_work_phone	1	0.21	0.40	0	0	0.0	0	1	▇▁▁▁▂
flag_phone	1	0.29	0.45	0	0	0.0	1	1	▇▁▁▁▃
flag_email	1	0.11	0.31	0	0	0.0	0	1	▇▁▁▁▁
cnt_fam_members	1	2.19	0.90	1	2	2.0	3	20	▇▁▁▁▁

In this step, we will convert days_birth and days_employed into date class using lubridate package. Then, we need to merge 2 dataframe into a unique one. We see that the credit_record contains only 33856 but the application_record contains 438557. This means that, some of them are matched.

application_record <- application_record %>% mutate(days_birth = lubridate::as_date(days_birth),
                              days_employed = lubridate::as_date(days_employed))

credit_data <- inner_join(credit_record, application_record)

## Joining, by = "id"

head(credit_data)

## # A tibble: 6 x 19
##        id status           code_gender flag_own_car flag_own_realty cnt_children
##     <dbl> <fct>            <chr>       <chr>        <chr>                  <dbl>
## 1 5008804 no loan for the… M           Y            Y                          0
## 2 5008805 no loan for the… M           Y            Y                          0
## 3 5008806 no loan for the… M           Y            Y                          0
## 4 5008808 1m past due      F           N            Y                          0
## 5 5008810 no loan for the… F           N            Y                          0
## 6 5008811 no loan for the… F           N            Y                          0
## # … with 13 more variables: amt_income_total <dbl>, name_income_type <chr>,
## #   name_education_type <chr>, name_family_status <chr>,
## #   name_housing_type <chr>, days_birth <date>, days_employed <date>,
## #   flag_mobil <dbl>, flag_work_phone <dbl>, flag_phone <dbl>,
## #   flag_email <dbl>, occupation_type <chr>, cnt_fam_members <dbl>

rm(application_record, credit_record)

The table below shows that more than 12,9 thousand customers do not have a loan for current month, 4.4 thousand coustomers have paid off for the current month and other over 7 thousand customers has past due.

credit_data %>% count(status, sort = T) %>%
  ggplot(aes(n, fct_reorder(status, n), fill = status)) +
  geom_col() + 
  theme_light()

Now, I will define that customer who has paid off or do not have loan for the month as good customer, the others tho have past the due date are defined as bad customer.

final_dat <- credit_data %>% mutate(classification = case_when(status == "no loan for the month" ~ "good",
                                                  status == "paid off that month" ~ "good",
                                                  TRUE ~ "bad")) %>% 
  mutate(occupation_type = case_when(is.na(occupation_type)=="TRUE"~"unknown",
                                     TRUE ~ occupation_type)) %>% 
  mutate(occupation_type = fct_lump(occupation_type, n = 7),
         days_birth = 2020 - lubridate::year(days_birth),
         days_employed = 2020 - lubridate::year(days_employed)) %>% 
  mutate_if(is.character, as.factor)

skimr::skim(final_dat)

Data summary
Name	final_dat
Number of rows	24672
Number of columns	20
_______________________
Column type frequency:
factor	10
numeric	10
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
status	1	FALSE	8	no : 12974, 1m : 6886, pai: 4487, 2m : 236
code_gender	1	FALSE	2	F: 16406, M: 8266
flag_own_car	1	FALSE	2	N: 15301, Y: 9371
flag_own_realty	1	FALSE	2	Y: 16293, N: 8379
name_income_type	1	FALSE	5	Wor: 12711, Com: 5849, Pen: 4150, Sta: 1953
name_education_type	1	FALSE	5	Sec: 16782, Hig: 6651, Inc: 966, Low: 250
name_family_status	1	FALSE	5	Mar: 16942, Sin: 3319, Civ: 1965, Sep: 1434
name_housing_type	1	FALSE	6	Hou: 22020, Wit: 1206, Mun: 785, Ren: 374
occupation_type	1	FALSE	8	unk: 7629, Lab: 4293, Oth: 3579, Cor: 2381
classification	1	FALSE	2	goo: 17461, bad: 7211

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	5078862.63	41535.33	5008804	5045511	5069454	5115437	5150487	▇▇▇▇▆
cnt_children	1	0.42	0.74	0	0	0	1	19	▇▁▁▁▁
amt_income_total	1	187074.99	102094.83	27000	121500	157500	225000	1575000	▇▁▁▁▁
days_birth	1	94.36	11.45	71	85	93	104	119	▃▇▇▆▃
days_employed	1	-111.04	376.40	-950	52	55	59	94	▂▁▁▁▇
flag_mobil	1	1.00	0.00	1	1	1	1	1	▁▁▇▁▁
flag_work_phone	1	0.23	0.42	0	0	0	0	1	▇▁▁▁▂
flag_phone	1	0.29	0.46	0	0	0	1	1	▇▁▁▁▃
flag_email	1	0.09	0.29	0	0	0	0	1	▇▁▁▁▁
cnt_fam_members	1	2.18	0.91	1	2	2	3	20	▇▁▁▁▁

Split training and testing data. The training data is used to train our model. Meanwhile, testing data is used to evaluate our model performance.

set.seed(123)
split <- initial_split(final_dat, prop = 0.8)
train <- training(split)
testing <- testing(split)

This step is to create resampling data for evaluating trained model.

set.seed(234)
bs_resampling <- bootstraps(train)
bs_resampling

## # Bootstrap sampling 
## # A tibble: 25 x 2
##    splits               id         
##    <list>               <chr>      
##  1 <split [19737/7227]> Bootstrap01
##  2 <split [19737/7220]> Bootstrap02
##  3 <split [19737/7317]> Bootstrap03
##  4 <split [19737/7255]> Bootstrap04
##  5 <split [19737/7267]> Bootstrap05
##  6 <split [19737/7240]> Bootstrap06
##  7 <split [19737/7238]> Bootstrap07
##  8 <split [19737/7273]> Bootstrap08
##  9 <split [19737/7210]> Bootstrap09
## 10 <split [19737/7279]> Bootstrap10
## # … with 15 more rows

We are building a recipe to preprocess our data. Here, I am going to remove flag_mobil since it is zero variance variable. Id and status collum will not used in this model, so I will set them as an id.

rec <- recipe(classification~., data = train) %>% 
  add_role(id, status, new_role = "id") %>% 
  step_rm(flag_mobil)

Let’s build an engine using random forest model

engine <- rand_forest() %>% 
  set_mode("classification") %>% 
  set_engine("randomForest")

I will the preprocess recipe and engine into our workflows

wf <- workflow() %>%
  add_recipe(rec) %>% 
  add_model(engine)

Let’s fit the model using bootstrap resample

fit_resample <- wf %>% fit_resamples(resamples = bs_resampling,
                    metric = metrics(roc_auc, accuracy),
                    control = control_resamples(save_pred = T))

## Warning: The `...` are not used in this function but one or more objects were
## passed: 'metric'

The metric show that this model perform super good with accuracy equal to 1, roc_auc is 1 also.

fit_resample %>% collect_metrics()

## # A tibble: 2 x 6
##   .metric  .estimator  mean     n   std_err .config             
##   <chr>    <chr>      <dbl> <int>     <dbl> <chr>               
## 1 accuracy binary      1.00    25 0.0000401 Preprocessor1_Model1
## 2 roc_auc  binary      1.00    25 0.0000258 Preprocessor1_Model1

fit_resample %>% conf_mat_resampled()

## # A tibble: 4 x 3
##   Prediction Truth    Freq
##   <fct>      <fct>   <dbl>
## 1 bad        bad   2123.  
## 2 bad        good     0   
## 3 good       bad      1.24
## 4 good       good  5129.

Ok, now, let’s see our how does our model perform over testing data set. We can see that our performs very well on testing data set. Amazing!

last_fit <- last_fit(wf, split)
last_fit %>% collect_metrics() %>% knitr::kable()

.metric	.estimator	.estimate	.config
accuracy	binary	1	Preprocessor1_Model1
roc_auc	binary	1	Preprocessor1_Model1

last_fit %>% conf_mat_resampled()%>% knitr::kable()

Prediction	Truth	Freq
bad	bad	1408
bad	good	0
good	bad	0
good	good	3527

last_fit %>% collect_predictions()

## # A tibble: 4,935 x 7
##    id         .pred_bad .pred_good  .row .pred_class classification .config     
##    <chr>          <dbl>      <dbl> <int> <fct>       <fct>          <chr>       
##  1 train/tes…     0.026      0.974     1 good        good           Preprocesso…
##  2 train/tes…     0.036      0.964     6 good        good           Preprocesso…
##  3 train/tes…     0.972      0.028     8 bad         bad            Preprocesso…
##  4 train/tes…     0.05       0.95     12 good        good           Preprocesso…
##  5 train/tes…     0.998      0.002    24 bad         bad            Preprocesso…
##  6 train/tes…     0.054      0.946    29 good        good           Preprocesso…
##  7 train/tes…     0.996      0.004    34 bad         bad            Preprocesso…
##  8 train/tes…     0.968      0.032    35 bad         bad            Preprocesso…
##  9 train/tes…     0          1        48 good        good           Preprocesso…
## 10 train/tes…     0          1        59 good        good           Preprocesso…
## # … with 4,925 more rows

PREDICT GOOD OR BAD CREDIT CUSOMER

HUNG.NGUYEN

5/24/2021

1. Introduction

Context

Conclusion