In this section, I will use ramdom forest model to build a classification model that classify good or bad customer. For more details about the data set, visit: https://www.kaggle.com/rikdifos/credit-card-approval-prediction
Following are context that quoted from the link above:
Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk.
Generally speaking, credit score cards are based on historical data. Once encountering large economic fluctuations. Past models may lose their original predictive power. Logistic model is a common method for credit scoring. Because Logistic is suitable for binary classification tasks and can calculate the coefficients of each feature. In order to facilitate understanding and operation, the score card will multiply the logistic regression coefficient by a certain value (such as 100) and round it.
At present, with the development of machine learning algorithms. More predictive methods such as Boosting, Random Forest, and Support Vector Machines have been introduced into credit card scoring. However, these methods often do not have good transparency. It may be difficult to provide customers and regulators with a reason for rejection or acceptance.
First, we need to load some necessary libraries and data. You guys can download data set from the linked above.
library(tidyverse)
library(tidymodels)
library(readr)
tidymodels_prefer()
credit_record <- janitor::clean_names(read_csv("credit_record.csv"))
application_record <- janitor::clean_names(read_csv("application_record.csv"))
Credit_record dataframe contains the information of credit payment status. We will retain only current month status of customnes’ credit payment status. The status data is coded, therefore, we need to lables those coded data.
credit_record <- credit_record %>% filter(months_balance==0) %>% select(-months_balance) %>%
mutate(status = factor(status,
levels = c("0","1","2","3", "4", "5", "X", "C"),
labels = c("1m past due", "2m past due", "3m past due",
"4m past due", "5m past due",
"bad debts", "paid off that month",
"no loan for the month")))
Let’s have a glance on what we have in the dataframe. We have 18 variables, of which, there are 8 factors variables and 10 numeric variables. Most of them are non-missing vaiables except occupation_type. Besides, days_births and days_employed are seemingly stored in a wrong class. We need to fix missing value and wrong class before putting data into our models. This will be done in preprocessing steps.
skimr::skim(application_record)
| Name | application_record |
| Number of rows | 438557 |
| Number of columns | 18 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| code_gender | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| flag_own_car | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| flag_own_realty | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| name_income_type | 0 | 1.00 | 7 | 20 | 0 | 5 | 0 |
| name_education_type | 0 | 1.00 | 15 | 29 | 0 | 5 | 0 |
| name_family_status | 0 | 1.00 | 5 | 20 | 0 | 5 | 0 |
| name_housing_type | 0 | 1.00 | 12 | 19 | 0 | 6 | 0 |
| occupation_type | 134203 | 0.69 | 7 | 21 | 0 | 18 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 6022176.27 | 571637.02 | 5008804 | 5609375 | 6047745.0 | 6456971 | 7999952 | ▆▇▇▁▁ |
| cnt_children | 0 | 1 | 0.43 | 0.72 | 0 | 0 | 0.0 | 1 | 19 | ▇▁▁▁▁ |
| amt_income_total | 0 | 1 | 187524.29 | 110086.85 | 26100 | 121500 | 160780.5 | 225000 | 6750000 | ▇▁▁▁▁ |
| days_birth | 0 | 1 | -15997.90 | 4185.03 | -25201 | -19483 | -15630.0 | -12514 | -7489 | ▃▆▇▇▅ |
| days_employed | 0 | 1 | 60563.68 | 138767.80 | -17531 | -3103 | -1467.0 | -371 | 365243 | ▇▁▁▁▂ |
| flag_mobil | 0 | 1 | 1.00 | 0.00 | 1 | 1 | 1.0 | 1 | 1 | ▁▁▇▁▁ |
| flag_work_phone | 0 | 1 | 0.21 | 0.40 | 0 | 0 | 0.0 | 0 | 1 | ▇▁▁▁▂ |
| flag_phone | 0 | 1 | 0.29 | 0.45 | 0 | 0 | 0.0 | 1 | 1 | ▇▁▁▁▃ |
| flag_email | 0 | 1 | 0.11 | 0.31 | 0 | 0 | 0.0 | 0 | 1 | ▇▁▁▁▁ |
| cnt_fam_members | 0 | 1 | 2.19 | 0.90 | 1 | 2 | 2.0 | 3 | 20 | ▇▁▁▁▁ |
In this step, we will convert days_birth and days_employed into date class using lubridate package. Then, we need to merge 2 dataframe into a unique one. We see that the credit_record contains only 33856 but the application_record contains 438557. This means that, some of them are matched.
application_record <- application_record %>% mutate(days_birth = lubridate::as_date(days_birth),
days_employed = lubridate::as_date(days_employed))
credit_data <- inner_join(credit_record, application_record)
## Joining, by = "id"
head(credit_data)
## # A tibble: 6 x 19
## id status code_gender flag_own_car flag_own_realty cnt_children
## <dbl> <fct> <chr> <chr> <chr> <dbl>
## 1 5008804 no loan for the… M Y Y 0
## 2 5008805 no loan for the… M Y Y 0
## 3 5008806 no loan for the… M Y Y 0
## 4 5008808 1m past due F N Y 0
## 5 5008810 no loan for the… F N Y 0
## 6 5008811 no loan for the… F N Y 0
## # … with 13 more variables: amt_income_total <dbl>, name_income_type <chr>,
## # name_education_type <chr>, name_family_status <chr>,
## # name_housing_type <chr>, days_birth <date>, days_employed <date>,
## # flag_mobil <dbl>, flag_work_phone <dbl>, flag_phone <dbl>,
## # flag_email <dbl>, occupation_type <chr>, cnt_fam_members <dbl>
rm(application_record, credit_record)
The table below shows that more than 12,9 thousand customers do not have a loan for current month, 4.4 thousand coustomers have paid off for the current month and other over 7 thousand customers has past due.
credit_data %>% count(status, sort = T) %>%
ggplot(aes(n, fct_reorder(status, n), fill = status)) +
geom_col() +
theme_light()
Now, I will define that customer who has paid off or do not have loan for the month as good customer, the others tho have past the due date are defined as bad customer.
final_dat <- credit_data %>% mutate(classification = case_when(status == "no loan for the month" ~ "good",
status == "paid off that month" ~ "good",
TRUE ~ "bad")) %>%
mutate(occupation_type = case_when(is.na(occupation_type)=="TRUE"~"unknown",
TRUE ~ occupation_type)) %>%
mutate(occupation_type = fct_lump(occupation_type, n = 7),
days_birth = 2020 - lubridate::year(days_birth),
days_employed = 2020 - lubridate::year(days_employed)) %>%
mutate_if(is.character, as.factor)
skimr::skim(final_dat)
| Name | final_dat |
| Number of rows | 24672 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| factor | 10 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| status | 0 | 1 | FALSE | 8 | no : 12974, 1m : 6886, pai: 4487, 2m : 236 |
| code_gender | 0 | 1 | FALSE | 2 | F: 16406, M: 8266 |
| flag_own_car | 0 | 1 | FALSE | 2 | N: 15301, Y: 9371 |
| flag_own_realty | 0 | 1 | FALSE | 2 | Y: 16293, N: 8379 |
| name_income_type | 0 | 1 | FALSE | 5 | Wor: 12711, Com: 5849, Pen: 4150, Sta: 1953 |
| name_education_type | 0 | 1 | FALSE | 5 | Sec: 16782, Hig: 6651, Inc: 966, Low: 250 |
| name_family_status | 0 | 1 | FALSE | 5 | Mar: 16942, Sin: 3319, Civ: 1965, Sep: 1434 |
| name_housing_type | 0 | 1 | FALSE | 6 | Hou: 22020, Wit: 1206, Mun: 785, Ren: 374 |
| occupation_type | 0 | 1 | FALSE | 8 | unk: 7629, Lab: 4293, Oth: 3579, Cor: 2381 |
| classification | 0 | 1 | FALSE | 2 | goo: 17461, bad: 7211 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 5078862.63 | 41535.33 | 5008804 | 5045511 | 5069454 | 5115437 | 5150487 | ▇▇▇▇▆ |
| cnt_children | 0 | 1 | 0.42 | 0.74 | 0 | 0 | 0 | 1 | 19 | ▇▁▁▁▁ |
| amt_income_total | 0 | 1 | 187074.99 | 102094.83 | 27000 | 121500 | 157500 | 225000 | 1575000 | ▇▁▁▁▁ |
| days_birth | 0 | 1 | 94.36 | 11.45 | 71 | 85 | 93 | 104 | 119 | ▃▇▇▆▃ |
| days_employed | 0 | 1 | -111.04 | 376.40 | -950 | 52 | 55 | 59 | 94 | ▂▁▁▁▇ |
| flag_mobil | 0 | 1 | 1.00 | 0.00 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
| flag_work_phone | 0 | 1 | 0.23 | 0.42 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▂ |
| flag_phone | 0 | 1 | 0.29 | 0.46 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▃ |
| flag_email | 0 | 1 | 0.09 | 0.29 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| cnt_fam_members | 0 | 1 | 2.18 | 0.91 | 1 | 2 | 2 | 3 | 20 | ▇▁▁▁▁ |
Split training and testing data. The training data is used to train our model. Meanwhile, testing data is used to evaluate our model performance.
set.seed(123)
split <- initial_split(final_dat, prop = 0.8)
train <- training(split)
testing <- testing(split)
This step is to create resampling data for evaluating trained model.
set.seed(234)
bs_resampling <- bootstraps(train)
bs_resampling
## # Bootstrap sampling
## # A tibble: 25 x 2
## splits id
## <list> <chr>
## 1 <split [19737/7227]> Bootstrap01
## 2 <split [19737/7220]> Bootstrap02
## 3 <split [19737/7317]> Bootstrap03
## 4 <split [19737/7255]> Bootstrap04
## 5 <split [19737/7267]> Bootstrap05
## 6 <split [19737/7240]> Bootstrap06
## 7 <split [19737/7238]> Bootstrap07
## 8 <split [19737/7273]> Bootstrap08
## 9 <split [19737/7210]> Bootstrap09
## 10 <split [19737/7279]> Bootstrap10
## # … with 15 more rows
We are building a recipe to preprocess our data. Here, I am going to remove flag_mobil since it is zero variance variable. Id and status collum will not used in this model, so I will set them as an id.
rec <- recipe(classification~., data = train) %>%
add_role(id, status, new_role = "id") %>%
step_rm(flag_mobil)
Let’s build an engine using random forest model
engine <- rand_forest() %>%
set_mode("classification") %>%
set_engine("randomForest")
I will the preprocess recipe and engine into our workflows
wf <- workflow() %>%
add_recipe(rec) %>%
add_model(engine)
Let’s fit the model using bootstrap resample
fit_resample <- wf %>% fit_resamples(resamples = bs_resampling,
metric = metrics(roc_auc, accuracy),
control = control_resamples(save_pred = T))
## Warning: The `...` are not used in this function but one or more objects were
## passed: 'metric'
The metric show that this model perform super good with accuracy equal to 1, roc_auc is 1 also.
fit_resample %>% collect_metrics()
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 1.00 25 0.0000401 Preprocessor1_Model1
## 2 roc_auc binary 1.00 25 0.0000258 Preprocessor1_Model1
fit_resample %>% conf_mat_resampled()
## # A tibble: 4 x 3
## Prediction Truth Freq
## <fct> <fct> <dbl>
## 1 bad bad 2123.
## 2 bad good 0
## 3 good bad 1.24
## 4 good good 5129.
Ok, now, let’s see our how does our model perform over testing data set. We can see that our performs very well on testing data set. Amazing!
last_fit <- last_fit(wf, split)
last_fit %>% collect_metrics() %>% knitr::kable()
| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| accuracy | binary | 1 | Preprocessor1_Model1 |
| roc_auc | binary | 1 | Preprocessor1_Model1 |
last_fit %>% conf_mat_resampled()%>% knitr::kable()
| Prediction | Truth | Freq |
|---|---|---|
| bad | bad | 1408 |
| bad | good | 0 |
| good | bad | 0 |
| good | good | 3527 |
last_fit %>% collect_predictions()
## # A tibble: 4,935 x 7
## id .pred_bad .pred_good .row .pred_class classification .config
## <chr> <dbl> <dbl> <int> <fct> <fct> <chr>
## 1 train/tes… 0.026 0.974 1 good good Preprocesso…
## 2 train/tes… 0.036 0.964 6 good good Preprocesso…
## 3 train/tes… 0.972 0.028 8 bad bad Preprocesso…
## 4 train/tes… 0.05 0.95 12 good good Preprocesso…
## 5 train/tes… 0.998 0.002 24 bad bad Preprocesso…
## 6 train/tes… 0.054 0.946 29 good good Preprocesso…
## 7 train/tes… 0.996 0.004 34 bad bad Preprocesso…
## 8 train/tes… 0.968 0.032 35 bad bad Preprocesso…
## 9 train/tes… 0 1 48 good good Preprocesso…
## 10 train/tes… 0 1 59 good good Preprocesso…
## # … with 4,925 more rows
Our model performs very well on detecting good or bad credit user. If we have new data of credit user, we can use this model specification to predict whether the customer might be good or bad