1. Description

Today, we are going to expolore a credit card service. following is the description of the task that we need to solve. This comes from a competition in Kaggle website.

*A manager at the bank is disturbed with more and more customers leaving their credit card services. They would really appreciate if one could predict for them who is gonna get churned so they can proactively go to the customer to provide them better services and turn customers’ decisions in the opposite direction

I got this dataset from a website with the URL as https://leaps.analyttica.com/home. I have been using this for a while to get datasets and accordingly work on them to produce fruitful results. The site explains how to solve a particular business problem.

Now, this dataset consists of 10,000 customers mentioning their age, salary, marital_status, credit card limit, credit card category, etc. There are nearly 18 features.

We have only 16.07% of customers who have churned. Thus, it’s a bit difficult to train our model to predict churning customers.*

2. Load necessary libraries

First, we need to load necessary libraries to facilitate out datamalipulation process and to build our forecasting model.

3. Loading and cleaning data

You can find the data from this site: https://www.kaggle.com/sakshigoyal7/credit-card-customers

bank_churner <- read_csv("BankChurners.csv")

res <- bank_churner %>% select(Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2)

bank_churner_dat <- bank_churner %>% select(CLIENTNUM:
                        Avg_Utilization_Ratio) %>% clean_names()

skim(bank_churner_dat)

Data summary
Name	bank_churner_dat
Number of rows	10127
Number of columns	21
_______________________
Column type frequency:
character	6
numeric	15
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
attrition_flag	1	17	17	2
gender	1	1	1	2
education_level	1	7	13	7
marital_status	1	6	8	4
income_category	1	7	14	6
card_category	1	4	8	4

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
clientnum	1	739177606.33	36903783.45	708082083.0	713036770.50	717926358.00	773143533.00	828343083.00	▇▁▂▂▁
customer_age	1	46.33	8.02	26.0	41.00	46.00	52.00	73.00	▂▆▇▃▁
dependent_count	1	2.35	1.30	0.0	1.00	2.00	3.00	5.00	▇▇▇▅▁
months_on_book	1	35.93	7.99	13.0	31.00	36.00	40.00	56.00	▁▃▇▃▂
total_relationship_count	1	3.81	1.55	1.0	3.00	4.00	5.00	6.00	▇▇▆▆▆
months_inactive_12_mon	1	2.34	1.01	0.0	2.00	2.00	3.00	6.00	▅▇▇▁▁
contacts_count_12_mon	1	2.46	1.11	0.0	2.00	2.00	3.00	6.00	▅▇▇▃▁
credit_limit	1	8631.95	9088.78	1438.3	2555.00	4549.00	11067.50	34516.00	▇▂▁▁▁
total_revolving_bal	1	1162.81	814.99	0.0	359.00	1276.00	1784.00	2517.00	▇▅▇▇▅
avg_open_to_buy	1	7469.14	9090.69	3.0	1324.50	3474.00	9859.00	34516.00	▇▂▁▁▁
total_amt_chng_q4_q1	1	0.76	0.22	0.0	0.63	0.74	0.86	3.40	▅▇▁▁▁
total_trans_amt	1	4404.09	3397.13	510.0	2155.50	3899.00	4741.00	18484.00	▇▅▁▁▁
total_trans_ct	1	64.86	23.47	10.0	45.00	67.00	81.00	139.00	▂▅▇▂▁
total_ct_chng_q4_q1	1	0.71	0.24	0.0	0.58	0.70	0.82	3.71	▇▆▁▁▁
avg_utilization_ratio	1	0.27	0.28	0.0	0.02	0.18	0.50	1.00	▇▂▂▂▁

We have 6 categorical variables and 15 numberic variables in our dataset. However, we do not use CLIENTNUM variable in our models since it is just ID of the client. There is not any missing in our dataset, both categorical and numeric variable. For more information about the detail of the dataset, please visit the link listed above.

4. Build the model

4.1. Creating split

We need to split our data into training data and testing data. The traning data is used to train our model. Models’ parameters that we gained from training data is used to evaluate model performance on testing data. The less error model made on testing data, the better model are.

Now, we split our data with the proportion of training data of 80%, the remaining are testing data.

# build model 
set.seed(111)
ini_split <- initial_split(bank_churner_dat, 
                           prop = 0.8, 
                           strata = attrition_flag)

bank_churners_test <- testing(ini_split)    
bank_churners_traning <- training(ini_split)

4.2. Creating sampling

Here, we create resampling method which is used for model evaluation.

set.seed(222)
bootstrap_resampling <- bootstraps(bank_churners_traning, strata = attrition_flag)
bootstrap_resampling

## # Bootstrap sampling using stratification 
## # A tibble: 25 x 2
##    splits              id         
##    <list>              <chr>      
##  1 <split [8101/2991]> Bootstrap01
##  2 <split [8101/2949]> Bootstrap02
##  3 <split [8101/2969]> Bootstrap03
##  4 <split [8101/3005]> Bootstrap04
##  5 <split [8101/2932]> Bootstrap05
##  6 <split [8101/3015]> Bootstrap06
##  7 <split [8101/3004]> Bootstrap07
##  8 <split [8101/2976]> Bootstrap08
##  9 <split [8101/2962]> Bootstrap09
## 10 <split [8101/2929]> Bootstrap10
## # ... with 15 more rows

4.3. Create pre_processing recipe

Before puting data into model, we need to preprocess the data.This dataset consists of 10,000 customers. But we have only 16.07% of customers who have churned. Therefore, we need to downsample our sample which contain the equal proportion of churned and not churned customes. And also, we convert string variables to factor variables.

library(themis)
bank_churner_recipe <- recipe(attrition_flag~., bank_churners_traning) %>%
  step_downsample(attrition_flag) %>%
  update_role(clientnum, new_role = "id") %>%
  step_string2factor(attrition_flag, gender, 
                     education_level, marital_status, 
                     income_category, card_category)
bank_churner_recipe

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##         id          1
##    outcome          1
##  predictor         19
## 
## Operations:
## 
## Down-sampling based on attrition_flag
## Factor variables from attrition_flag, gender, ...

4.4. Create model engines

Here, we build a random forest model and tuning its parameter including mtry, trees and min_n. We choose the ranger engine and classification mode.

library(ranger)
rf_model <- rand_forest(mtry = tune(), 
                        trees = tune(),
                        min_n = tune()) %>%
  set_engine("ranger") %>%
  set_mode("classification")
rf_model

## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = tune()
##   min_n = tune()
## 
## Computational engine: ranger

set.seed(4545)
rf_grid <- grid_regular(
  mtry (range = c(1, 10)),
  trees (range = c(100, 1000)),
  min_n (range = c(1, 10)),
  levels = 3)
rf_grid

## # A tibble: 27 x 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     1   100     1
##  2     5   100     1
##  3    10   100     1
##  4     1   550     1
##  5     5   550     1
##  6    10   550     1
##  7     1  1000     1
##  8     5  1000     1
##  9    10  1000     1
## 10     1   100     5
## # ... with 17 more rows

4.5. Create workflows

This step just create a workflows, we add the recipe for data preprocessing, add the model. And finally, tune the model.

rf_wf <- workflow() %>%
  add_recipe(bank_churner_recipe) %>%
  add_model(rf_model)

doParallel::registerDoParallel(cores = 6)
fit <- rf_wf %>% tune_grid(resamples = bootstrap_resampling,
                           grid = rf_grid,
                           metrics = metric_set(accuracy, roc_auc, sensitivity, specificity),
                           control = control_resamples(save_pred = T))

4.6. Examine tunned model

We just have tunned our model with different set of parameters in rf_grid. Following a metrics results by different set of parameters. There are metrics including accuracy, roc_auc, sensitivity, specificity.

fit %>% collect_metrics()

## # A tibble: 108 x 9
##     mtry trees min_n .metric  .estimator  mean     n  std_err .config           
##    <int> <int> <int> <chr>    <chr>      <dbl> <int>    <dbl> <chr>             
##  1     1   100     1 accuracy binary     0.916    25 0.000955 Preprocessor1_Mod~
##  2     1   100     1 roc_auc  binary     0.962    25 0.000782 Preprocessor1_Mod~
##  3     1   100     1 sens     binary     0.868    25 0.00369  Preprocessor1_Mod~
##  4     1   100     1 spec     binary     0.926    25 0.00119  Preprocessor1_Mod~
##  5     5   100     1 accuracy binary     0.943    25 0.000858 Preprocessor1_Mod~
##  6     5   100     1 roc_auc  binary     0.984    25 0.000313 Preprocessor1_Mod~
##  7     5   100     1 sens     binary     0.938    25 0.00245  Preprocessor1_Mod~
##  8     5   100     1 spec     binary     0.944    25 0.00114  Preprocessor1_Mod~
##  9    10   100     1 accuracy binary     0.942    25 0.000877 Preprocessor1_Mod~
## 10    10   100     1 roc_auc  binary     0.985    25 0.000267 Preprocessor1_Mod~
## # ... with 98 more rows

We can use any of these metric to evaluate the performance of the model. However, in this case, we aim to detect churned customers, therefore, we choose best sensitivity metric which better for detecting churned customers.

best_parameter <- fit %>% collect_metrics() %>% filter(`.metric`=="sens") %>% filter(mean== max(mean)) %>% select(mtry, trees, min_n, .config )

4.7. Finalize the model based on best accuracy and rebuild the model woth best parameter integrated

We finalize the model by putting best parameters into our random forest model. Then, we rebuild our model workflow by adding finalize_model

#finalize model based on best accuracy
finalize_model <- finalize_model(rf_model, best_parameter)

finalize_model

## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 5
##   trees = 1000
##   min_n = 10
## 
## Computational engine: ranger

# rebuild the model using the best_tuned_specs
tuned_wf <- workflow() %>%
  add_recipe(bank_churner_recipe) %>%
  add_model(finalize_model)

tuned_wf

## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: rand_forest()
## 
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
## 
## * step_downsample()
## * step_string2factor()
## 
## -- Model -----------------------------------------------------------------------
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 5
##   trees = 1000
##   min_n = 10
## 
## Computational engine: ranger

4.8. Last fit the tuned model

After finalizing our model and workflow, now, we can fit out final model

last_fitted_model <- tuned_wf %>% last_fit(ini_split,
                                            metrics = metric_set(accuracy, roc_auc, sensitivity, specificity))

5. Assess our model results

Our metric looks good.

last_fitted_model %>% collect_metrics() %>% knitr::kable(digits = 3)

.metric	.estimator	.estimate	.config
accuracy	binary	0.932	Preprocessor1_Model1
sens	binary	0.966	Preprocessor1_Model1
spec	binary	0.925	Preprocessor1_Model1
roc_auc	binary	0.989	Preprocessor1_Model1

The table shows that we detect 93% correctly in total. For Attrited Customer, we can detect 316 out of 326 cases, on the other word we have detected 96,7% number of case of Attrited Customer. This is a very good results.

last_fitted_model %>% conf_mat_resampled() %>% 
  mutate(proportion = 100*Freq/sum(Freq)) %>% knitr::kable(digits = 2)

Prediction	Truth	Freq	proportion
Attrited Customer	Attrited Customer	315	15.55
Attrited Customer	Existing Customer	127	6.27
Existing Customer	Attrited Customer	11	0.54
Existing Customer	Existing Customer	1573	77.64

Conclusion

Our random forest model can detect Attrited Customer very well. For Attrited Customer, we can detect 316 out of 326 cases, on the other word we have detected 96,7% number of case of Attrited Customer. This is a very good results.

PREDICT CHURNING CUSTOMER

HUNG.NGUYEN

4/18/2020