1. Description

Today, we are going to expolore a credit card service. following is the description of the task that we need to solve. This comes from a competition in Kaggle website.

*A manager at the bank is disturbed with more and more customers leaving their credit card services. They would really appreciate if one could predict for them who is gonna get churned so they can proactively go to the customer to provide them better services and turn customers’ decisions in the opposite direction

I got this dataset from a website with the URL as https://leaps.analyttica.com/home. I have been using this for a while to get datasets and accordingly work on them to produce fruitful results. The site explains how to solve a particular business problem.

Now, this dataset consists of 10,000 customers mentioning their age, salary, marital_status, credit card limit, credit card category, etc. There are nearly 18 features.

We have only 16.07% of customers who have churned. Thus, it’s a bit difficult to train our model to predict churning customers.*

2. Load necessary libraries

First, we need to load necessary libraries to facilitate out datamalipulation process and to build our forecasting model.

3. Loading and cleaning data

You can find the data from this site: https://www.kaggle.com/sakshigoyal7/credit-card-customers

bank_churner <- read_csv("BankChurners.csv")

res <- bank_churner %>% select(Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2)

bank_churner_dat <- bank_churner %>% select(CLIENTNUM:
                        Avg_Utilization_Ratio) %>% clean_names()

skim(bank_churner_dat)
Data summary
Name bank_churner_dat
Number of rows 10127
Number of columns 21
_______________________
Column type frequency:
character 6
numeric 15
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
attrition_flag 0 1 17 17 0 2 0
gender 0 1 1 1 0 2 0
education_level 0 1 7 13 0 7 0
marital_status 0 1 6 8 0 4 0
income_category 0 1 7 14 0 6 0
card_category 0 1 4 8 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
clientnum 0 1 739177606.33 36903783.45 708082083.0 713036770.50 717926358.00 773143533.00 828343083.00 ▇▁▂▂▁
customer_age 0 1 46.33 8.02 26.0 41.00 46.00 52.00 73.00 ▂▆▇▃▁
dependent_count 0 1 2.35 1.30 0.0 1.00 2.00 3.00 5.00 ▇▇▇▅▁
months_on_book 0 1 35.93 7.99 13.0 31.00 36.00 40.00 56.00 ▁▃▇▃▂
total_relationship_count 0 1 3.81 1.55 1.0 3.00 4.00 5.00 6.00 ▇▇▆▆▆
months_inactive_12_mon 0 1 2.34 1.01 0.0 2.00 2.00 3.00 6.00 ▅▇▇▁▁
contacts_count_12_mon 0 1 2.46 1.11 0.0 2.00 2.00 3.00 6.00 ▅▇▇▃▁
credit_limit 0 1 8631.95 9088.78 1438.3 2555.00 4549.00 11067.50 34516.00 ▇▂▁▁▁
total_revolving_bal 0 1 1162.81 814.99 0.0 359.00 1276.00 1784.00 2517.00 ▇▅▇▇▅
avg_open_to_buy 0 1 7469.14 9090.69 3.0 1324.50 3474.00 9859.00 34516.00 ▇▂▁▁▁
total_amt_chng_q4_q1 0 1 0.76 0.22 0.0 0.63 0.74 0.86 3.40 ▅▇▁▁▁
total_trans_amt 0 1 4404.09 3397.13 510.0 2155.50 3899.00 4741.00 18484.00 ▇▅▁▁▁
total_trans_ct 0 1 64.86 23.47 10.0 45.00 67.00 81.00 139.00 ▂▅▇▂▁
total_ct_chng_q4_q1 0 1 0.71 0.24 0.0 0.58 0.70 0.82 3.71 ▇▆▁▁▁
avg_utilization_ratio 0 1 0.27 0.28 0.0 0.02 0.18 0.50 1.00 ▇▂▂▂▁

We have 6 categorical variables and 15 numberic variables in our dataset. However, we do not use CLIENTNUM variable in our models since it is just ID of the client. There is not any missing in our dataset, both categorical and numeric variable. For more information about the detail of the dataset, please visit the link listed above.

4. Build the model

4.1. Creating split

We need to split our data into training data and testing data. The traning data is used to train our model. Models’ parameters that we gained from training data is used to evaluate model performance on testing data. The less error model made on testing data, the better model are.

Now, we split our data with the proportion of training data of 80%, the remaining are testing data.

# build model 
set.seed(111)
ini_split <- initial_split(bank_churner_dat, 
                           prop = 0.8, 
                           strata = attrition_flag)

bank_churners_test <- testing(ini_split)    
bank_churners_traning <- training(ini_split)

4.2. Creating sampling

Here, we create resampling method which is used for model evaluation.

set.seed(222)
bootstrap_resampling <- bootstraps(bank_churners_traning, strata = attrition_flag)
bootstrap_resampling
## # Bootstrap sampling using stratification 
## # A tibble: 25 x 2
##    splits              id         
##    <list>              <chr>      
##  1 <split [8101/2991]> Bootstrap01
##  2 <split [8101/2949]> Bootstrap02
##  3 <split [8101/2969]> Bootstrap03
##  4 <split [8101/3005]> Bootstrap04
##  5 <split [8101/2932]> Bootstrap05
##  6 <split [8101/3015]> Bootstrap06
##  7 <split [8101/3004]> Bootstrap07
##  8 <split [8101/2976]> Bootstrap08
##  9 <split [8101/2962]> Bootstrap09
## 10 <split [8101/2929]> Bootstrap10
## # ... with 15 more rows

4.3. Create pre_processing recipe

Before puting data into model, we need to preprocess the data.This dataset consists of 10,000 customers. But we have only 16.07% of customers who have churned. Therefore, we need to downsample our sample which contain the equal proportion of churned and not churned customes. And also, we convert string variables to factor variables.

library(themis)
bank_churner_recipe <- recipe(attrition_flag~., bank_churners_traning) %>%
  step_downsample(attrition_flag) %>%
  update_role(clientnum, new_role = "id") %>%
  step_string2factor(attrition_flag, gender, 
                     education_level, marital_status, 
                     income_category, card_category)
bank_churner_recipe
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##         id          1
##    outcome          1
##  predictor         19
## 
## Operations:
## 
## Down-sampling based on attrition_flag
## Factor variables from attrition_flag, gender, ...

4.4. Create model engines

Here, we build a random forest model and tuning its parameter including mtry, trees and min_n. We choose the ranger engine and classification mode.

library(ranger)
rf_model <- rand_forest(mtry = tune(), 
                        trees = tune(),
                        min_n = tune()) %>%
  set_engine("ranger") %>%
  set_mode("classification")
rf_model
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = tune()
##   min_n = tune()
## 
## Computational engine: ranger
set.seed(4545)
rf_grid <- grid_regular(
  mtry (range = c(1, 10)),
  trees (range = c(100, 1000)),
  min_n (range = c(1, 10)),
  levels = 3)
rf_grid
## # A tibble: 27 x 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     1   100     1
##  2     5   100     1
##  3    10   100     1
##  4     1   550     1
##  5     5   550     1
##  6    10   550     1
##  7     1  1000     1
##  8     5  1000     1
##  9    10  1000     1
## 10     1   100     5
## # ... with 17 more rows

4.5. Create workflows

This step just create a workflows, we add the recipe for data preprocessing, add the model. And finally, tune the model.

rf_wf <- workflow() %>%
  add_recipe(bank_churner_recipe) %>%
  add_model(rf_model)

doParallel::registerDoParallel(cores = 6)
fit <- rf_wf %>% tune_grid(resamples = bootstrap_resampling,
                           grid = rf_grid,
                           metrics = metric_set(accuracy, roc_auc, sensitivity, specificity),
                           control = control_resamples(save_pred = T))

4.6. Examine tunned model

We just have tunned our model with different set of parameters in rf_grid. Following a metrics results by different set of parameters. There are metrics including accuracy, roc_auc, sensitivity, specificity.

fit %>% collect_metrics()
## # A tibble: 108 x 9
##     mtry trees min_n .metric  .estimator  mean     n  std_err .config           
##    <int> <int> <int> <chr>    <chr>      <dbl> <int>    <dbl> <chr>             
##  1     1   100     1 accuracy binary     0.916    25 0.000955 Preprocessor1_Mod~
##  2     1   100     1 roc_auc  binary     0.962    25 0.000782 Preprocessor1_Mod~
##  3     1   100     1 sens     binary     0.868    25 0.00369  Preprocessor1_Mod~
##  4     1   100     1 spec     binary     0.926    25 0.00119  Preprocessor1_Mod~
##  5     5   100     1 accuracy binary     0.943    25 0.000858 Preprocessor1_Mod~
##  6     5   100     1 roc_auc  binary     0.984    25 0.000313 Preprocessor1_Mod~
##  7     5   100     1 sens     binary     0.938    25 0.00245  Preprocessor1_Mod~
##  8     5   100     1 spec     binary     0.944    25 0.00114  Preprocessor1_Mod~
##  9    10   100     1 accuracy binary     0.942    25 0.000877 Preprocessor1_Mod~
## 10    10   100     1 roc_auc  binary     0.985    25 0.000267 Preprocessor1_Mod~
## # ... with 98 more rows

We can use any of these metric to evaluate the performance of the model. However, in this case, we aim to detect churned customers, therefore, we choose best sensitivity metric which better for detecting churned customers.

best_parameter <- fit %>% collect_metrics() %>% filter(`.metric`=="sens") %>% filter(mean== max(mean)) %>% select(mtry, trees, min_n, .config )

4.7. Finalize the model based on best accuracy and rebuild the model woth best parameter integrated

We finalize the model by putting best parameters into our random forest model. Then, we rebuild our model workflow by adding finalize_model

#finalize model based on best accuracy
finalize_model <- finalize_model(rf_model, best_parameter)

finalize_model
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 5
##   trees = 1000
##   min_n = 10
## 
## Computational engine: ranger
# rebuild the model using the best_tuned_specs
tuned_wf <- workflow() %>%
  add_recipe(bank_churner_recipe) %>%
  add_model(finalize_model)

tuned_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: rand_forest()
## 
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
## 
## * step_downsample()
## * step_string2factor()
## 
## -- Model -----------------------------------------------------------------------
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 5
##   trees = 1000
##   min_n = 10
## 
## Computational engine: ranger

4.8. Last fit the tuned model

After finalizing our model and workflow, now, we can fit out final model

last_fitted_model <- tuned_wf %>% last_fit(ini_split,
                                            metrics = metric_set(accuracy, roc_auc, sensitivity, specificity))

5. Assess our model results

Our metric looks good.

last_fitted_model %>% collect_metrics() %>% knitr::kable(digits = 3)
.metric .estimator .estimate .config
accuracy binary 0.932 Preprocessor1_Model1
sens binary 0.966 Preprocessor1_Model1
spec binary 0.925 Preprocessor1_Model1
roc_auc binary 0.989 Preprocessor1_Model1

The table shows that we detect 93% correctly in total. For Attrited Customer, we can detect 316 out of 326 cases, on the other word we have detected 96,7% number of case of Attrited Customer. This is a very good results.

last_fitted_model %>% conf_mat_resampled() %>% 
  mutate(proportion = 100*Freq/sum(Freq)) %>% knitr::kable(digits = 2)
Prediction Truth Freq proportion
Attrited Customer Attrited Customer 315 15.55
Attrited Customer Existing Customer 127 6.27
Existing Customer Attrited Customer 11 0.54
Existing Customer Existing Customer 1573 77.64

Conclusion

Our random forest model can detect Attrited Customer very well. For Attrited Customer, we can detect 316 out of 326 cases, on the other word we have detected 96,7% number of case of Attrited Customer. This is a very good results.