I am going to make bank churn prediction. It is the dataset of a U.S. bank customer for getting the information, this particular customer will leave bank or not. This dataset is from U.S. bank, given like Credit Score, Age, Gender, and many more.
First, call library we need
library(dplyr)
library(tidyverse)
library(gmodels)
library(gtools)
library(class)Library to handle imbalance data
library(caret)I use dataset from Kaggle. It’s called “Bank Customer Churn Prediction”. Or you can download from here.
churn <- read.csv('Churn_Modelling.csv')
glimpse(churn)#> Rows: 10,000
#> Columns: 14
#> $ RowNumber <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
#> $ CustomerId <int> 15634602, 15647311, 15619304, 15701354, 15737888, 1557…
#> $ Surname <chr> "Hargrave", "Hill", "Onio", "Boni", "Mitchell", "Chu",…
#> $ CreditScore <int> 619, 608, 502, 699, 850, 645, 822, 376, 501, 684, 528,…
#> $ Geography <chr> "France", "Spain", "France", "France", "Spain", "Spain…
#> $ Gender <chr> "Female", "Female", "Female", "Female", "Female", "Mal…
#> $ Age <int> 42, 41, 42, 39, 43, 44, 50, 29, 44, 27, 31, 24, 34, 25…
#> $ Tenure <int> 2, 1, 8, 1, 2, 8, 7, 4, 4, 2, 6, 3, 10, 5, 7, 3, 1, 9,…
#> $ Balance <dbl> 0.00, 83807.86, 159660.80, 0.00, 125510.82, 113755.78,…
#> $ NumOfProducts <int> 1, 1, 3, 2, 1, 2, 2, 4, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, …
#> $ HasCrCard <int> 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …
#> $ IsActiveMember <int> 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, …
#> $ EstimatedSalary <dbl> 101348.88, 112542.58, 113931.57, 93826.63, 79084.10, 1…
#> $ Exited <int> 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
Important things from data :
RowNumber : Row Number Of datasetCustomerId : CustomerID is givenSurname : Surname of the customerCreditScore: Credit Score of customerGeography : location of customerGender : Gender whether male or femaleAge : Age of the customerTenure : From how many years customer is in bankBalance : Average balance of customerNumOfProducts : Number of bank product facilities
customer is usingHasCrCard : Customer whether has credit card or
notIsActiveMember: Customer whehter is active member or
notEstimatedSalary : Customer’s estimated salaryExited : Customer whether exit or notCheck the data first
head(churn)As we can see the data still not clean, the data type is quite not right. Some of the columns are not needed like RowNumber. CustomerId, and Surname.
churn_clean <- churn %>%
select(-c(RowNumber, CustomerId, Surname )) %>%
mutate_if(is.character, as.factor) %>%
mutate(
HasCrCard = as.factor(HasCrCard),
IsActiveMember = as.factor(IsActiveMember),
Exited = as.factor(Exited),
Exited = factor(Exited, levels = c(0, 1),
labels = c("Churn",
"Not Churn")))
glimpse(churn_clean)#> Rows: 10,000
#> Columns: 11
#> $ CreditScore <int> 619, 608, 502, 699, 850, 645, 822, 376, 501, 684, 528,…
#> $ Geography <fct> France, Spain, France, France, Spain, Spain, France, G…
#> $ Gender <fct> Female, Female, Female, Female, Female, Male, Male, Fe…
#> $ Age <int> 42, 41, 42, 39, 43, 44, 50, 29, 44, 27, 31, 24, 34, 25…
#> $ Tenure <int> 2, 1, 8, 1, 2, 8, 7, 4, 4, 2, 6, 3, 10, 5, 7, 3, 1, 9,…
#> $ Balance <dbl> 0.00, 83807.86, 159660.80, 0.00, 125510.82, 113755.78,…
#> $ NumOfProducts <int> 1, 1, 3, 2, 1, 2, 2, 4, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, …
#> $ HasCrCard <fct> 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …
#> $ IsActiveMember <fct> 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, …
#> $ EstimatedSalary <dbl> 101348.88, 112542.58, 113931.57, 93826.63, 79084.10, 1…
#> $ Exited <fct> Not Churn, Churn, Not Churn, Churn, Churn, Not Churn, …
Check the missing value
churn_clean %>% is.na() %>% colSums()#> CreditScore Geography Gender Age Tenure
#> 0 0 0 0 0
#> Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary
#> 0 0 0 0 0
#> Exited
#> 0
The dataset has no missing values. It is okay to continue.
Check data proportion from target variable. The target variable or
independent varible is Exited.
churn_clean$Exited %>% table() %>% prop.table#> .
#> Churn Not Churn
#> 0.7963 0.2037
It’s imbalanced. We are going handle it later.
We split the dataset to Train Dataset dan Test Dataset. Train dataset is use to train the model. Test dataset is use to validate our model and see how well our model work to unseen data.
# Split the dataset to train data and test data
index <- sample(nrow(churn_clean),
nrow(churn_clean)*0.8)
churn_train <- churn_clean[index,]
churn_test <- churn_clean[-index,]# check proportion table for data train
prop.table(table(churn_train$Exited))#>
#> Churn Not Churn
#> 0.79425 0.20575
Our data is imbalance. So we do downsampling to data train.
RNGkind(sample.kind = "Rounding")
set.seed(70)
library(rsample)
churn_train_down <- downSample(
x = churn_train %>% select(-Exited),
y= churn_train$Exited,
yname = "Exited"
) Check the propotional of data train after downsampling
churn_train_down$Exited %>% table() %>% prop.table()#> .
#> Churn Not Churn
#> 0.5 0.5
Positive class : Churn Negative Class : Not Churn
library(e1071)model_NB_1 <- naiveBayes(
formula = Exited ~ . ,
data = churn_train_down
)
model_NB_1#>
#> Naive Bayes Classifier for Discrete Predictors
#>
#> Call:
#> naiveBayes.default(x = X, y = Y, laplace = laplace)
#>
#> A-priori probabilities:
#> Y
#> Churn Not Churn
#> 0.5 0.5
#>
#> Conditional probabilities:
#> CreditScore
#> Y [,1] [,2]
#> Churn 650.0869 94.46189
#> Not Churn 644.0304 101.10760
#>
#> Geography
#> Y France Germany Spain
#> Churn 0.5340219 0.2059538 0.2600243
#> Not Churn 0.4015796 0.3863913 0.2120292
#>
#> Gender
#> Y Female Male
#> Churn 0.4216282 0.5783718
#> Not Churn 0.5571081 0.4428919
#>
#> Age
#> Y [,1] [,2]
#> Churn 37.19988 10.17289
#> Not Churn 44.93135 9.84584
#>
#> Tenure
#> Y [,1] [,2]
#> Churn 5.072904 2.891477
#> Not Churn 4.924666 2.903583
#>
#> Balance
#> Y [,1] [,2]
#> Churn 73181.37 63348.09
#> Not Churn 90771.97 58613.47
#>
#> NumOfProducts
#> Y [,1] [,2]
#> Churn 1.549210 0.5121704
#> Not Churn 1.473269 0.7975356
#>
#> HasCrCard
#> Y 0 1
#> Churn 0.3068044 0.6931956
#> Not Churn 0.3031592 0.6968408
#>
#> IsActiveMember
#> Y 0 1
#> Churn 0.4264885 0.5735115
#> Not Churn 0.6342649 0.3657351
#>
#> EstimatedSalary
#> Y [,1] [,2]
#> Churn 100005.0 57453.85
#> Not Churn 101394.2 57848.29
pred.churn.nb<- predict(
object = model_NB_1,
newdata = churn_test,
type = 'class'
)confusionMatrix(
data = pred.churn.nb,
reference = churn_test$Exited,
positive = 'Churn'
)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Churn Not Churn
#> Churn 1206 100
#> Not Churn 403 291
#>
#> Accuracy : 0.7485
#> 95% CI : (0.7289, 0.7674)
#> No Information Rate : 0.8045
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.3818
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.7495
#> Specificity : 0.7442
#> Pos Pred Value : 0.9234
#> Neg Pred Value : 0.4193
#> Prevalence : 0.8045
#> Detection Rate : 0.6030
#> Detection Prevalence : 0.6530
#> Balanced Accuracy : 0.7469
#>
#> 'Positive' Class : Churn
#>
We Cross Check Confusion Matrix and Statistics manually, so we can interpret easier (This step can be skip if you already familiar with the terms)
# True Positive
TP = 1213
# False Negative
FN = 386
# False Positive
FP = 136
# True Negative
TN = 265
# Sensitivity
Recall_NB = TP / (TP+FN)
Recall_NB#> [1] 0.7585991
# Specificity
Specificity_NB = TN / (TN+FP)
Specificity_NB#> [1] 0.6608479
# Positive Predictive Value
Precision_NB = TP / (TP+FP)
Precision_NB#> [1] 0.8991846
# Negative Predictive Value
Neg_Pred_Value_NB = TN / (TN+FN)
Neg_Pred_Value_NB#> [1] 0.4070661
based on the Confusion Matrix and Statistics, we get Sensitivity : 0.7586. It is good. This means our model can predict positive class well.
# Load the package
library(partykit)# Build tree model
# We already prune the tree so it is not getting complicated
dtree_model1 <- ctree(formula = Exited ~.,
data = churn_train,
control = ctree_control(mincriterion=0.97,
minsplit=70,
minbucket=30))
# plot(dtree_model1, type = "simple")knitr::include_graphics("Decision Tree.JPG")# Call the model
dtree_model1#>
#> Model formula:
#> Exited ~ CreditScore + Geography + Gender + Age + Tenure + Balance +
#> NumOfProducts + HasCrCard + IsActiveMember + EstimatedSalary
#>
#> Fitted party:
#> [1] root
#> | [2] Age <= 42
#> | | [3] Age <= 38
#> | | | [4] Geography in France, Spain
#> | | | | [5] Balance <= 181110.13
#> | | | | | [6] IsActiveMember in 0: Churn (n = 1656, err = 9.2%)
#> | | | | | [7] IsActiveMember in 1: Churn (n = 1718, err = 5.8%)
#> | | | | [8] Balance > 181110.13: Churn (n = 49, err = 36.7%)
#> | | | [9] Geography in Germany
#> | | | | [10] IsActiveMember in 0: Churn (n = 485, err = 19.2%)
#> | | | | [11] IsActiveMember in 1
#> | | | | | [12] Gender in Female: Churn (n = 225, err = 15.1%)
#> | | | | | [13] Gender in Male: Churn (n = 288, err = 6.9%)
#> | | [14] Age > 38
#> | | | [15] Geography in France, Spain
#> | | | | [16] Gender in Female: Churn (n = 428, err = 21.3%)
#> | | | | [17] Gender in Male
#> | | | | | [18] Balance <= 150507.21: Churn (n = 444, err = 9.0%)
#> | | | | | [19] Balance > 150507.21: Churn (n = 36, err = 36.1%)
#> | | | [20] Geography in Germany: Churn (n = 323, err = 30.0%)
#> | [21] Age > 42
#> | | [22] IsActiveMember in 0
#> | | | [23] Age <= 50
#> | | | | [24] Geography in France, Spain
#> | | | | | [25] Age <= 46: Churn (n = 317, err = 34.1%)
#> | | | | | [26] Age > 46: Not Churn (n = 172, err = 48.3%)
#> | | | | [27] Geography in Germany
#> | | | | | [28] NumOfProducts <= 1: Not Churn (n = 148, err = 22.3%)
#> | | | | | [29] NumOfProducts > 1: Churn (n = 73, err = 37.0%)
#> | | | [30] Age > 50
#> | | | | [31] Gender in Female: Not Churn (n = 189, err = 9.0%)
#> | | | | [32] Gender in Male: Not Churn (n = 164, err = 20.1%)
#> | | [33] IsActiveMember in 1
#> | | | [34] Geography in France, Spain
#> | | | | [35] Age <= 60: Churn (n = 716, err = 27.0%)
#> | | | | [36] Age > 60: Churn (n = 240, err = 7.9%)
#> | | | [37] Geography in Germany
#> | | | | [38] Age <= 67: Churn (n = 297, err = 44.1%)
#> | | | | [39] Age > 67: Churn (n = 32, err = 6.2%)
#>
#> Number of inner nodes: 19
#> Number of terminal nodes: 20
# Prediction with data test
pred_churn_t <- predict(dtree_model1, churn_test, type="response")# confusion matrix with data test
confusionMatrix(pred_churn_t, churn_test$Exited, positive = "Churn")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Churn Not Churn
#> Churn 1565 293
#> Not Churn 44 98
#>
#> Accuracy : 0.8315
#> 95% CI : (0.8144, 0.8477)
#> No Information Rate : 0.8045
#> P-Value [Acc > NIR] : 0.001084
#>
#> Kappa : 0.2942
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.9727
#> Specificity : 0.2506
#> Pos Pred Value : 0.8423
#> Neg Pred Value : 0.6901
#> Prevalence : 0.8045
#> Detection Rate : 0.7825
#> Detection Prevalence : 0.9290
#> Balanced Accuracy : 0.6116
#>
#> 'Positive' Class : Churn
#>
based on the Confusion Matrix and Statistics, the model is good. It has Sensitivity : 0.9675. It is close to 1. Then it has Specificity : 0.3192. But it is no problem. Because our main concern is positive class (Churn)
For this particular Churn Prediction dataset, we can use Decision Tree. Because Decision Tree Model perform better than Naive Bayes Model.