Today, we are going to expolore a credit card service. following is the description of the task that we need to solve. This comes from a competition in Kaggle website.
*A manager at the bank is disturbed with more and more customers leaving their credit card services. They would really appreciate if one could predict for them who is gonna get churned so they can proactively go to the customer to provide them better services and turn customers’ decisions in the opposite direction
I got this dataset from a website with the URL as https://leaps.analyttica.com/home. I have been using this for a while to get datasets and accordingly work on them to produce fruitful results. The site explains how to solve a particular business problem.
Now, this dataset consists of 10,000 customers mentioning their age, salary, marital_status, credit card limit, credit card category, etc. There are nearly 18 features.
We have only 16.07% of customers who have churned. Thus, it’s a bit difficult to train our model to predict churning customers.*
First, we need to load necessary libraries to facilitate out datamalipulation process and to build our forecasting model.
You can find the data from this site: https://www.kaggle.com/sakshigoyal7/credit-card-customers
bank_churner <- read_csv("BankChurners.csv")
res <- bank_churner %>% select(Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2)
bank_churner_dat <- bank_churner %>% select(CLIENTNUM:
Avg_Utilization_Ratio) %>% clean_names()
skim(bank_churner_dat)
| Name | bank_churner_dat |
| Number of rows | 10127 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 15 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| attrition_flag | 0 | 1 | 17 | 17 | 0 | 2 | 0 |
| gender | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| education_level | 0 | 1 | 7 | 13 | 0 | 7 | 0 |
| marital_status | 0 | 1 | 6 | 8 | 0 | 4 | 0 |
| income_category | 0 | 1 | 7 | 14 | 0 | 6 | 0 |
| card_category | 0 | 1 | 4 | 8 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| clientnum | 0 | 1 | 739177606.33 | 36903783.45 | 708082083.0 | 713036770.50 | 717926358.00 | 773143533.00 | 828343083.00 | ▇▁▂▂▁ |
| customer_age | 0 | 1 | 46.33 | 8.02 | 26.0 | 41.00 | 46.00 | 52.00 | 73.00 | ▂▆▇▃▁ |
| dependent_count | 0 | 1 | 2.35 | 1.30 | 0.0 | 1.00 | 2.00 | 3.00 | 5.00 | ▇▇▇▅▁ |
| months_on_book | 0 | 1 | 35.93 | 7.99 | 13.0 | 31.00 | 36.00 | 40.00 | 56.00 | ▁▃▇▃▂ |
| total_relationship_count | 0 | 1 | 3.81 | 1.55 | 1.0 | 3.00 | 4.00 | 5.00 | 6.00 | ▇▇▆▆▆ |
| months_inactive_12_mon | 0 | 1 | 2.34 | 1.01 | 0.0 | 2.00 | 2.00 | 3.00 | 6.00 | ▅▇▇▁▁ |
| contacts_count_12_mon | 0 | 1 | 2.46 | 1.11 | 0.0 | 2.00 | 2.00 | 3.00 | 6.00 | ▅▇▇▃▁ |
| credit_limit | 0 | 1 | 8631.95 | 9088.78 | 1438.3 | 2555.00 | 4549.00 | 11067.50 | 34516.00 | ▇▂▁▁▁ |
| total_revolving_bal | 0 | 1 | 1162.81 | 814.99 | 0.0 | 359.00 | 1276.00 | 1784.00 | 2517.00 | ▇▅▇▇▅ |
| avg_open_to_buy | 0 | 1 | 7469.14 | 9090.69 | 3.0 | 1324.50 | 3474.00 | 9859.00 | 34516.00 | ▇▂▁▁▁ |
| total_amt_chng_q4_q1 | 0 | 1 | 0.76 | 0.22 | 0.0 | 0.63 | 0.74 | 0.86 | 3.40 | ▅▇▁▁▁ |
| total_trans_amt | 0 | 1 | 4404.09 | 3397.13 | 510.0 | 2155.50 | 3899.00 | 4741.00 | 18484.00 | ▇▅▁▁▁ |
| total_trans_ct | 0 | 1 | 64.86 | 23.47 | 10.0 | 45.00 | 67.00 | 81.00 | 139.00 | ▂▅▇▂▁ |
| total_ct_chng_q4_q1 | 0 | 1 | 0.71 | 0.24 | 0.0 | 0.58 | 0.70 | 0.82 | 3.71 | ▇▆▁▁▁ |
| avg_utilization_ratio | 0 | 1 | 0.27 | 0.28 | 0.0 | 0.02 | 0.18 | 0.50 | 1.00 | ▇▂▂▂▁ |
We have 6 categorical variables and 15 numberic variables in our dataset. However, we do not use CLIENTNUM variable in our models since it is just ID of the client. There is not any missing in our dataset, both categorical and numeric variable. For more information about the detail of the dataset, please visit the link listed above.
We need to split our data into training data and testing data. The traning data is used to train our model. Models’ parameters that we gained from training data is used to evaluate model performance on testing data. The less error model made on testing data, the better model are.
Now, we split our data with the proportion of training data of 80%, the remaining are testing data.
# build model
set.seed(111)
ini_split <- initial_split(bank_churner_dat,
prop = 0.8,
strata = attrition_flag)
bank_churners_test <- testing(ini_split)
bank_churners_traning <- training(ini_split)
Here, we create resampling method which is used for model evaluation.
set.seed(222)
bootstrap_resampling <- bootstraps(bank_churners_traning, strata = attrition_flag)
bootstrap_resampling
## # Bootstrap sampling using stratification
## # A tibble: 25 x 2
## splits id
## <list> <chr>
## 1 <split [8101/2991]> Bootstrap01
## 2 <split [8101/2949]> Bootstrap02
## 3 <split [8101/2969]> Bootstrap03
## 4 <split [8101/3005]> Bootstrap04
## 5 <split [8101/2932]> Bootstrap05
## 6 <split [8101/3015]> Bootstrap06
## 7 <split [8101/3004]> Bootstrap07
## 8 <split [8101/2976]> Bootstrap08
## 9 <split [8101/2962]> Bootstrap09
## 10 <split [8101/2929]> Bootstrap10
## # ... with 15 more rows
Before puting data into model, we need to preprocess the data.This dataset consists of 10,000 customers. But we have only 16.07% of customers who have churned. Therefore, we need to downsample our sample which contain the equal proportion of churned and not churned customes. And also, we convert string variables to factor variables.
library(themis)
bank_churner_recipe <- recipe(attrition_flag~., bank_churners_traning) %>%
step_downsample(attrition_flag) %>%
update_role(clientnum, new_role = "id") %>%
step_string2factor(attrition_flag, gender,
education_level, marital_status,
income_category, card_category)
bank_churner_recipe
## Data Recipe
##
## Inputs:
##
## role #variables
## id 1
## outcome 1
## predictor 19
##
## Operations:
##
## Down-sampling based on attrition_flag
## Factor variables from attrition_flag, gender, ...
Here, we build a random forest model and tuning its parameter including mtry, trees and min_n. We choose the ranger engine and classification mode.
library(ranger)
rf_model <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger") %>%
set_mode("classification")
rf_model
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = tune()
## min_n = tune()
##
## Computational engine: ranger
set.seed(4545)
rf_grid <- grid_regular(
mtry (range = c(1, 10)),
trees (range = c(100, 1000)),
min_n (range = c(1, 10)),
levels = 3)
rf_grid
## # A tibble: 27 x 3
## mtry trees min_n
## <int> <int> <int>
## 1 1 100 1
## 2 5 100 1
## 3 10 100 1
## 4 1 550 1
## 5 5 550 1
## 6 10 550 1
## 7 1 1000 1
## 8 5 1000 1
## 9 10 1000 1
## 10 1 100 5
## # ... with 17 more rows
This step just create a workflows, we add the recipe for data preprocessing, add the model. And finally, tune the model.
rf_wf <- workflow() %>%
add_recipe(bank_churner_recipe) %>%
add_model(rf_model)
doParallel::registerDoParallel(cores = 6)
fit <- rf_wf %>% tune_grid(resamples = bootstrap_resampling,
grid = rf_grid,
metrics = metric_set(accuracy, roc_auc, sensitivity, specificity),
control = control_resamples(save_pred = T))
We just have tunned our model with different set of parameters in rf_grid. Following a metrics results by different set of parameters. There are metrics including accuracy, roc_auc, sensitivity, specificity.
fit %>% collect_metrics()
## # A tibble: 108 x 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 100 1 accuracy binary 0.916 25 0.000955 Preprocessor1_Mod~
## 2 1 100 1 roc_auc binary 0.962 25 0.000782 Preprocessor1_Mod~
## 3 1 100 1 sens binary 0.868 25 0.00369 Preprocessor1_Mod~
## 4 1 100 1 spec binary 0.926 25 0.00119 Preprocessor1_Mod~
## 5 5 100 1 accuracy binary 0.943 25 0.000858 Preprocessor1_Mod~
## 6 5 100 1 roc_auc binary 0.984 25 0.000313 Preprocessor1_Mod~
## 7 5 100 1 sens binary 0.938 25 0.00245 Preprocessor1_Mod~
## 8 5 100 1 spec binary 0.944 25 0.00114 Preprocessor1_Mod~
## 9 10 100 1 accuracy binary 0.942 25 0.000877 Preprocessor1_Mod~
## 10 10 100 1 roc_auc binary 0.985 25 0.000267 Preprocessor1_Mod~
## # ... with 98 more rows
We can use any of these metric to evaluate the performance of the model. However, in this case, we aim to detect churned customers, therefore, we choose best sensitivity metric which better for detecting churned customers.
best_parameter <- fit %>% collect_metrics() %>% filter(`.metric`=="sens") %>% filter(mean== max(mean)) %>% select(mtry, trees, min_n, .config )
We finalize the model by putting best parameters into our random forest model. Then, we rebuild our model workflow by adding finalize_model
#finalize model based on best accuracy
finalize_model <- finalize_model(rf_model, best_parameter)
finalize_model
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = 5
## trees = 1000
## min_n = 10
##
## Computational engine: ranger
# rebuild the model using the best_tuned_specs
tuned_wf <- workflow() %>%
add_recipe(bank_churner_recipe) %>%
add_model(finalize_model)
tuned_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: rand_forest()
##
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
##
## * step_downsample()
## * step_string2factor()
##
## -- Model -----------------------------------------------------------------------
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = 5
## trees = 1000
## min_n = 10
##
## Computational engine: ranger
After finalizing our model and workflow, now, we can fit out final model
last_fitted_model <- tuned_wf %>% last_fit(ini_split,
metrics = metric_set(accuracy, roc_auc, sensitivity, specificity))
Our metric looks good.
last_fitted_model %>% collect_metrics() %>% knitr::kable(digits = 3)
| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| accuracy | binary | 0.932 | Preprocessor1_Model1 |
| sens | binary | 0.966 | Preprocessor1_Model1 |
| spec | binary | 0.925 | Preprocessor1_Model1 |
| roc_auc | binary | 0.989 | Preprocessor1_Model1 |
The table shows that we detect 93% correctly in total. For Attrited Customer, we can detect 316 out of 326 cases, on the other word we have detected 96,7% number of case of Attrited Customer. This is a very good results.
last_fitted_model %>% conf_mat_resampled() %>%
mutate(proportion = 100*Freq/sum(Freq)) %>% knitr::kable(digits = 2)
| Prediction | Truth | Freq | proportion |
|---|---|---|---|
| Attrited Customer | Attrited Customer | 315 | 15.55 |
| Attrited Customer | Existing Customer | 127 | 6.27 |
| Existing Customer | Attrited Customer | 11 | 0.54 |
| Existing Customer | Existing Customer | 1573 | 77.64 |
Our random forest model can detect Attrited Customer very well. For Attrited Customer, we can detect 316 out of 326 cases, on the other word we have detected 96,7% number of case of Attrited Customer. This is a very good results.