Objective

I am going to make bank churn prediction. It is the dataset of a U.S. bank customer for getting the information, this particular customer will leave bank or not. This dataset is from U.S. bank, given like Credit Score, Age, Gender, and many more.

Preparation

First, call library we need

library(dplyr)
library(tidyverse)
library(gmodels)
library(gtools)
library(class)

Library to handle imbalance data

library(caret)

Data import

I use dataset from Kaggle. It’s called “Bank Customer Churn Prediction”. Or you can download from here.

churn <- read.csv('Churn_Modelling.csv')
glimpse(churn)
#> Rows: 10,000
#> Columns: 14
#> $ RowNumber       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
#> $ CustomerId      <int> 15634602, 15647311, 15619304, 15701354, 15737888, 1557…
#> $ Surname         <chr> "Hargrave", "Hill", "Onio", "Boni", "Mitchell", "Chu",…
#> $ CreditScore     <int> 619, 608, 502, 699, 850, 645, 822, 376, 501, 684, 528,…
#> $ Geography       <chr> "France", "Spain", "France", "France", "Spain", "Spain…
#> $ Gender          <chr> "Female", "Female", "Female", "Female", "Female", "Mal…
#> $ Age             <int> 42, 41, 42, 39, 43, 44, 50, 29, 44, 27, 31, 24, 34, 25…
#> $ Tenure          <int> 2, 1, 8, 1, 2, 8, 7, 4, 4, 2, 6, 3, 10, 5, 7, 3, 1, 9,…
#> $ Balance         <dbl> 0.00, 83807.86, 159660.80, 0.00, 125510.82, 113755.78,…
#> $ NumOfProducts   <int> 1, 1, 3, 2, 1, 2, 2, 4, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, …
#> $ HasCrCard       <int> 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …
#> $ IsActiveMember  <int> 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, …
#> $ EstimatedSalary <dbl> 101348.88, 112542.58, 113931.57, 93826.63, 79084.10, 1…
#> $ Exited          <int> 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …

Important things from data :

  1. RowNumber : Row Number Of dataset
  2. CustomerId : CustomerID is given
  3. Surname : Surname of the customer
  4. CreditScore: Credit Score of customer
  5. Geography : location of customer
  6. Gender : Gender whether male or female
  7. Age : Age of the customer
  8. Tenure : From how many years customer is in bank
  9. Balance : Average balance of customer
  10. NumOfProducts : Number of bank product facilities customer is using
  11. HasCrCard : Customer whether has credit card or not
  12. IsActiveMember: Customer whehter is active member or not
  13. EstimatedSalary : Customer’s estimated salary
  14. Exited : Customer whether exit or not

Check the data first

head(churn)

Data Cleansing

As we can see the data still not clean, the data type is quite not right. Some of the columns are not needed like RowNumber. CustomerId, and Surname.

churn_clean <-  churn %>% 
  select(-c(RowNumber, CustomerId, Surname )) %>%
  mutate_if(is.character, as.factor) %>% 
  mutate(
    HasCrCard = as.factor(HasCrCard),
    IsActiveMember = as.factor(IsActiveMember),
    Exited = as.factor(Exited),
    Exited = factor(Exited, levels = c(0, 1),
                    labels = c("Churn",
                               "Not Churn")))

glimpse(churn_clean)
#> Rows: 10,000
#> Columns: 11
#> $ CreditScore     <int> 619, 608, 502, 699, 850, 645, 822, 376, 501, 684, 528,…
#> $ Geography       <fct> France, Spain, France, France, Spain, Spain, France, G…
#> $ Gender          <fct> Female, Female, Female, Female, Female, Male, Male, Fe…
#> $ Age             <int> 42, 41, 42, 39, 43, 44, 50, 29, 44, 27, 31, 24, 34, 25…
#> $ Tenure          <int> 2, 1, 8, 1, 2, 8, 7, 4, 4, 2, 6, 3, 10, 5, 7, 3, 1, 9,…
#> $ Balance         <dbl> 0.00, 83807.86, 159660.80, 0.00, 125510.82, 113755.78,…
#> $ NumOfProducts   <int> 1, 1, 3, 2, 1, 2, 2, 4, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, …
#> $ HasCrCard       <fct> 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …
#> $ IsActiveMember  <fct> 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, …
#> $ EstimatedSalary <dbl> 101348.88, 112542.58, 113931.57, 93826.63, 79084.10, 1…
#> $ Exited          <fct> Not Churn, Churn, Not Churn, Churn, Churn, Not Churn, …

Check the missing value

churn_clean %>% is.na() %>% colSums()
#>     CreditScore       Geography          Gender             Age          Tenure 
#>               0               0               0               0               0 
#>         Balance   NumOfProducts       HasCrCard  IsActiveMember EstimatedSalary 
#>               0               0               0               0               0 
#>          Exited 
#>               0

The dataset has no missing values. It is okay to continue.

Pre-Processing Data

Check data proportion from target variable. The target variable or independent varible is Exited.

churn_clean$Exited %>% table() %>% prop.table
#> .
#>     Churn Not Churn 
#>    0.7963    0.2037

It’s imbalanced. We are going handle it later.

Cross Validation

We split the dataset to Train Dataset dan Test Dataset. Train dataset is use to train the model. Test dataset is use to validate our model and see how well our model work to unseen data.

# Split the dataset to train data and test data

index <- sample(nrow(churn_clean),
                nrow(churn_clean)*0.8)
churn_train <- churn_clean[index,]
churn_test <- churn_clean[-index,]
# check proportion table for data train
prop.table(table(churn_train$Exited))
#> 
#>     Churn Not Churn 
#>   0.79425   0.20575

Our data is imbalance. So we do downsampling to data train.

RNGkind(sample.kind = "Rounding")
set.seed(70)
library(rsample)

churn_train_down <- downSample(
  x = churn_train %>% select(-Exited), 
  y= churn_train$Exited,  
  yname = "Exited"
  ) 

Check the propotional of data train after downsampling

churn_train_down$Exited %>% table() %>% prop.table()
#> .
#>     Churn Not Churn 
#>       0.5       0.5


Naive Bayes

Positive class : Churn
Negative Class : Not Churn

library(e1071)
model_NB_1 <- naiveBayes(
  formula = Exited ~ . , 
  data = churn_train_down
)

model_NB_1
#> 
#> Naive Bayes Classifier for Discrete Predictors
#> 
#> Call:
#> naiveBayes.default(x = X, y = Y, laplace = laplace)
#> 
#> A-priori probabilities:
#> Y
#>     Churn Not Churn 
#>       0.5       0.5 
#> 
#> Conditional probabilities:
#>            CreditScore
#> Y               [,1]      [,2]
#>   Churn     650.0869  94.46189
#>   Not Churn 644.0304 101.10760
#> 
#>            Geography
#> Y              France   Germany     Spain
#>   Churn     0.5340219 0.2059538 0.2600243
#>   Not Churn 0.4015796 0.3863913 0.2120292
#> 
#>            Gender
#> Y              Female      Male
#>   Churn     0.4216282 0.5783718
#>   Not Churn 0.5571081 0.4428919
#> 
#>            Age
#> Y               [,1]     [,2]
#>   Churn     37.19988 10.17289
#>   Not Churn 44.93135  9.84584
#> 
#>            Tenure
#> Y               [,1]     [,2]
#>   Churn     5.072904 2.891477
#>   Not Churn 4.924666 2.903583
#> 
#>            Balance
#> Y               [,1]     [,2]
#>   Churn     73181.37 63348.09
#>   Not Churn 90771.97 58613.47
#> 
#>            NumOfProducts
#> Y               [,1]      [,2]
#>   Churn     1.549210 0.5121704
#>   Not Churn 1.473269 0.7975356
#> 
#>            HasCrCard
#> Y                   0         1
#>   Churn     0.3068044 0.6931956
#>   Not Churn 0.3031592 0.6968408
#> 
#>            IsActiveMember
#> Y                   0         1
#>   Churn     0.4264885 0.5735115
#>   Not Churn 0.6342649 0.3657351
#> 
#>            EstimatedSalary
#> Y               [,1]     [,2]
#>   Churn     100005.0 57453.85
#>   Not Churn 101394.2 57848.29
pred.churn.nb<- predict(
  object = model_NB_1,
  newdata = churn_test,
  type = 'class'
)
confusionMatrix(
  data = pred.churn.nb,
  reference = churn_test$Exited,
  positive = 'Churn'
)
#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Churn Not Churn
#>   Churn      1206       100
#>   Not Churn   403       291
#>                                              
#>                Accuracy : 0.7485             
#>                  95% CI : (0.7289, 0.7674)   
#>     No Information Rate : 0.8045             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.3818             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.7495             
#>             Specificity : 0.7442             
#>          Pos Pred Value : 0.9234             
#>          Neg Pred Value : 0.4193             
#>              Prevalence : 0.8045             
#>          Detection Rate : 0.6030             
#>    Detection Prevalence : 0.6530             
#>       Balanced Accuracy : 0.7469             
#>                                              
#>        'Positive' Class : Churn              
#> 

We Cross Check Confusion Matrix and Statistics manually, so we can interpret easier (This step can be skip if you already familiar with the terms)

# True Positive
TP = 1213
# False Negative
FN = 386
# False Positive
FP = 136
# True Negative
TN = 265

# Sensitivity
Recall_NB = TP / (TP+FN)
Recall_NB
#> [1] 0.7585991
# Specificity
Specificity_NB = TN / (TN+FP)
Specificity_NB
#> [1] 0.6608479
# Positive Predictive Value
Precision_NB = TP / (TP+FP)
Precision_NB
#> [1] 0.8991846
# Negative Predictive Value
Neg_Pred_Value_NB = TN / (TN+FN)
Neg_Pred_Value_NB
#> [1] 0.4070661

Summary

based on the Confusion Matrix and Statistics, we get Sensitivity : 0.7586. It is good. This means our model can predict positive class well.



Decision Tree

# Load the package

library(partykit)
# Build tree model
# We already prune the tree so it is not getting complicated

dtree_model1 <- ctree(formula = Exited ~.,
                     data = churn_train,
                     control = ctree_control(mincriterion=0.97,
                                             minsplit=70,
                                             minbucket=30))
# plot(dtree_model1, type = "simple")
knitr::include_graphics("Decision Tree.JPG")

# Call the model
dtree_model1
#> 
#> Model formula:
#> Exited ~ CreditScore + Geography + Gender + Age + Tenure + Balance + 
#>     NumOfProducts + HasCrCard + IsActiveMember + EstimatedSalary
#> 
#> Fitted party:
#> [1] root
#> |   [2] Age <= 42
#> |   |   [3] Age <= 38
#> |   |   |   [4] Geography in France, Spain
#> |   |   |   |   [5] Balance <= 181110.13
#> |   |   |   |   |   [6] IsActiveMember in 0: Churn (n = 1656, err = 9.2%)
#> |   |   |   |   |   [7] IsActiveMember in 1: Churn (n = 1718, err = 5.8%)
#> |   |   |   |   [8] Balance > 181110.13: Churn (n = 49, err = 36.7%)
#> |   |   |   [9] Geography in Germany
#> |   |   |   |   [10] IsActiveMember in 0: Churn (n = 485, err = 19.2%)
#> |   |   |   |   [11] IsActiveMember in 1
#> |   |   |   |   |   [12] Gender in Female: Churn (n = 225, err = 15.1%)
#> |   |   |   |   |   [13] Gender in Male: Churn (n = 288, err = 6.9%)
#> |   |   [14] Age > 38
#> |   |   |   [15] Geography in France, Spain
#> |   |   |   |   [16] Gender in Female: Churn (n = 428, err = 21.3%)
#> |   |   |   |   [17] Gender in Male
#> |   |   |   |   |   [18] Balance <= 150507.21: Churn (n = 444, err = 9.0%)
#> |   |   |   |   |   [19] Balance > 150507.21: Churn (n = 36, err = 36.1%)
#> |   |   |   [20] Geography in Germany: Churn (n = 323, err = 30.0%)
#> |   [21] Age > 42
#> |   |   [22] IsActiveMember in 0
#> |   |   |   [23] Age <= 50
#> |   |   |   |   [24] Geography in France, Spain
#> |   |   |   |   |   [25] Age <= 46: Churn (n = 317, err = 34.1%)
#> |   |   |   |   |   [26] Age > 46: Not Churn (n = 172, err = 48.3%)
#> |   |   |   |   [27] Geography in Germany
#> |   |   |   |   |   [28] NumOfProducts <= 1: Not Churn (n = 148, err = 22.3%)
#> |   |   |   |   |   [29] NumOfProducts > 1: Churn (n = 73, err = 37.0%)
#> |   |   |   [30] Age > 50
#> |   |   |   |   [31] Gender in Female: Not Churn (n = 189, err = 9.0%)
#> |   |   |   |   [32] Gender in Male: Not Churn (n = 164, err = 20.1%)
#> |   |   [33] IsActiveMember in 1
#> |   |   |   [34] Geography in France, Spain
#> |   |   |   |   [35] Age <= 60: Churn (n = 716, err = 27.0%)
#> |   |   |   |   [36] Age > 60: Churn (n = 240, err = 7.9%)
#> |   |   |   [37] Geography in Germany
#> |   |   |   |   [38] Age <= 67: Churn (n = 297, err = 44.1%)
#> |   |   |   |   [39] Age > 67: Churn (n = 32, err = 6.2%)
#> 
#> Number of inner nodes:    19
#> Number of terminal nodes: 20
# Prediction with data test
pred_churn_t <- predict(dtree_model1, churn_test, type="response")
# confusion matrix with data test
confusionMatrix(pred_churn_t, churn_test$Exited, positive = "Churn")
#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Churn Not Churn
#>   Churn      1565       293
#>   Not Churn    44        98
#>                                                
#>                Accuracy : 0.8315               
#>                  95% CI : (0.8144, 0.8477)     
#>     No Information Rate : 0.8045               
#>     P-Value [Acc > NIR] : 0.001084             
#>                                                
#>                   Kappa : 0.2942               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.9727               
#>             Specificity : 0.2506               
#>          Pos Pred Value : 0.8423               
#>          Neg Pred Value : 0.6901               
#>              Prevalence : 0.8045               
#>          Detection Rate : 0.7825               
#>    Detection Prevalence : 0.9290               
#>       Balanced Accuracy : 0.6116               
#>                                                
#>        'Positive' Class : Churn                
#> 

Summary

based on the Confusion Matrix and Statistics, the model is good. It has Sensitivity : 0.9675. It is close to 1. Then it has Specificity : 0.3192. But it is no problem. Because our main concern is positive class (Churn)

Conclusion

For this particular Churn Prediction dataset, we can use Decision Tree. Because Decision Tree Model perform better than Naive Bayes Model.