library(tidyverse)

Objective

Predicting the bank’s customer behavior using Naive Bayes and Decision Tree, which customer that will make a purchase activity when receiving calls from the bank

This data came from UCI Machine Learning, and is a real data from Portugese banking institution. The variable y (dependent/output) is the y variable (information on each variable below):

Data

Data Read

# read data using separator
bank <- read.csv("bank-full.csv", sep = ";")
str(bank)

'data.frame':   45211 obs. of  17 variables:
 $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
 $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
 $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
 $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
 $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
 $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
 $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
 $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
 $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
 $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
 $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
 $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
 $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Input variables:

1 - age (numeric)
2 - job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’) 3 - marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
4 - education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
5 - default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
6 - housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
7 - loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)

**elated with the last contact of the current campaign*:**
8 - contact: contact communication type (categorical: ‘cellular’,‘telephone’)
9 - month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
10 - day_of_week: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)

social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

EDA

Cross Validation

Using initial_split() to make proportions of y (dependent variable) the same even in train and test data:

library(rsample)

set.seed(100)
split <- initial_split(bank, prop = 0.8, strata = "y")

bank.train <- training(split)
bank.test <- testing(split)

# Check proportion on the original dataset
table(bank$y) %>% 
  prop.table()


       no       yes 
0.8830152 0.1169848

# Check proportion on the training dataset
table(bank.train$y) %>% 
  prop.table()


       no       yes 
0.8824684 0.1175316

# Check proportion on the test dataset
table(bank.train$y) %>% 
  prop.table()


       no       yes 
0.8824684 0.1175316

Naive Bayes

Model Fitting

library(e1071)
# model
bank.model <- naiveBayes(y ~ ., bank.train)

# predict
bank.pred <- predict(bank.model, bank.test)

Model Evaluation

library(caret)

confusionMatrix(
  bank.test$y,
  bank.pred
)

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  7387  617
       yes  470  568
                                          
               Accuracy : 0.8798          
                 95% CI : (0.8729, 0.8864)
    No Information Rate : 0.8689          
    P-Value [Acc > NIR] : 0.001066        
                                          
                  Kappa : 0.4428          
                                          
 Mcnemar's Test P-Value : 9.497e-06       
                                          
            Sensitivity : 0.9402          
            Specificity : 0.4793          
         Pos Pred Value : 0.9229          
         Neg Pred Value : 0.5472          
             Prevalence : 0.8689          
         Detection Rate : 0.8170          
   Detection Prevalence : 0.8852          
      Balanced Accuracy : 0.7098          
                                          
       'Positive' Class : no

Accuracy: the ability to correctly predict both classes from the total observation.
Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
Specificity: the ability to correctly predict the negative class from the total actual-negative class.

Based on this result, everything is acceptable except specificity (the ability to correctly predict the negative class from the total actual-negative class), which is very low at 55%. This might mean that this Naive Bayes’ model can cause the bank to miss a huge chunk of potential customer that might say “yes” to deposit to the bank.

By Heatmap (Optional)

Or by creating a heatmap:

library(yardstick)
# model fitting
naive.prob <- predict(bank.model, bank.test, type = "raw")
naive.pred <- predict(bank.model, bank.test, type = "class")

# result
naive.table <- select(bank.test, y) %>% 
  bind_cols(y_pred = naive.pred) %>% 
  bind_cols(y_eprob = round(naive.prob[,1],4)) %>% 
  bind_cols(y_pprob = round(naive.prob[,2],4))

# performance evaluation - confusion matrix
#naive.table %>% 
 # conf_mat(y, y_pred) %>% 
  #autoplot(type = "heatmap")

naive.table %>%
  summarise(
    accuracy = accuracy_vec(y, y_pred),
    sensitivity = sens_vec(y, y_pred),
    specificity = spec_vec(y, y_pred),
    precision = precision_vec(y, y_pred)
  )

   accuracy sensitivity specificity precision
1 0.8797832   0.9229135   0.5472062 0.9401807

ROC Curve

Initial steps to create ROC curve:

# Probability of Prediction
bank.pred.raw <- predict(bank.model, bank.test, type = "raw") # copy & paste the chunk above
head(bank.pred.raw)

            no          yes
[1,] 0.9963142 0.0036857912
[2,] 0.9991702 0.0008298437
[3,] 0.9969154 0.0030846295
[4,] 0.9980056 0.0019943986
[5,] 0.9957616 0.0042383707
[6,] 0.9909533 0.0090467468

Then assign class to that prediction

yes.pred <- data.frame("prediction" = bank.pred.raw[,2],
                       "classification" = as.numeric(bank.test == "yes"))
tail(yes.pred)

       prediction classification
153709 0.96107329              1
153710 0.79282306              0
153711 0.11953660              0
153712 0.88410636              1
153713 0.04048232              1
153714 0.03568349              1

library(ROCR)

yes.roc <- prediction(yes.pred$prediction,
                      yes.pred$classification)
# tpr (true positive rate) and fpr (false positive rate)
performance <- performance(yes.roc, "tpr", "fpr")

# roc plot
plot(performance, colorize = T)

ROC (receiver operator characteristic) curve is graphical plot used to show the diagnostic ability of binary classifiers. It is constructed by plotting True Positive Rate (TPR) with False Positive Rate (FPR) - TPR: The true positive rate is the proportion of observations that were correctly predicted to be positive out of all positive observations (TP/(TP + FN)).
- FPR: the proportion of observations that are incorrectly predicted to be positive out of all negative observations (FP/(TN + FP))

The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1 – FPR). Generally, if the curve is closer to the top-left corner, then the classifiers meant that is has better performance (because “True positive is high while false negative is low”), which is not the case with the graphic above. ROC itself does not depend on the class distribution.

AUC

AUC (Area Under the Curve), is derived from calculating….the area under the ROC curve, that signifies ‘the summary of the performance of each classifier into a single measure’.

# AUC
auc <- performance(yes.roc, "auc")
unlist(slot(auc, "y.values"))

[1] 0.4670649

AUC sources

Decision Tree

Model & Plot

library(partykit)

# Create a decision tree model using train data
dtmodel <- ctree(formula = y ~., data = bank.train)
plot(dtmodel, type = "simple")

Welp, too much of a tree.

Model Evaluation

# predict on test data
dt_pred <- predict(dtmodel, bank.test)
table(dt_pred)

dt_pred
  no  yes 
8064  978

cbind(bank.test, dt_pred)

   age          job marital education default balance housing loan contact day
2   44   technician  single secondary      no      29     yes   no unknown   5
3   33 entrepreneur married secondary      no       2     yes  yes unknown   5
15  57     services married secondary      no     162     yes   no unknown   5
18  57  blue-collar married   primary      no      52     yes   no unknown   5
   month duration campaign pdays previous poutcome  y dt_pred
2    may      151        1    -1        0  unknown no      no
3    may       76        1    -1        0  unknown no      no
15   may      174        1    -1        0  unknown no      no
18   may       38        1    -1        0  unknown no      no
 [ reached 'max' / getOption("max.print") -- omitted 9038 rows ]

dt_pred

  2   3  15  18  24  27  29  31  34  41  42  44  47  49  58  66  67  68  69  70 
 no  no  no  no  no  no  no  no  no  no  no  no  no  no  no  no  no  no  no  no 
 78  79 103 104 106 109 134 144 149 156 157 161 162 164 165 171 173 189 190 191 
 no  no  no  no  no  no  no  no  no  no  no  no  no  no  no  no  no  no  no  no 
193 194 207 209 212 213 215 222 223 229 234 249 256 259 263 265 269 271 272 274 
 no  no  no  no  no  no  no  no  no  no  no  no  no  no  no  no  no  no yes  no 
275 277 282 287 294 298 303 306 307 316 328 330 337 344 362 
 no  no  no  no  no  no  no  no  no  no  no  no  no  no  no 
 [ reached getOption("max.print") -- omitted 8967 entries ]
Levels: no yes

# create probability prediction
dt_prob_pred <- predict(dtmodel, bank.test, type = "prob")

# evaluating decision tree model
library(caret)
dtcf <- confusionMatrix(bank.test$y,
                dt_pred, positive = "yes")
dtcf

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  7609  395
       yes  455  583
                                          
               Accuracy : 0.906           
                 95% CI : (0.8998, 0.9119)
    No Information Rate : 0.8918          
    P-Value [Acc > NIR] : 5.411e-06       
                                          
                  Kappa : 0.5255          
                                          
 Mcnemar's Test P-Value : 0.043           
                                          
            Sensitivity : 0.59611         
            Specificity : 0.94358         
         Pos Pred Value : 0.56166         
         Neg Pred Value : 0.95065         
             Prevalence : 0.10816         
         Detection Rate : 0.06448         
   Detection Prevalence : 0.11480         
      Balanced Accuracy : 0.76985         
                                          
       'Positive' Class : yes

On previous Naive Bayes model, there is only 1 parameter that has low %, which is “Specificity” (by 54%). But in this decision tree model, there are 2 parameters that have lower figure, which is Sensitivity and Precision.

ROC Curve

library(ROCR)
# di copy paste dari atas
dt_prob_pred <- predict(dtmodel, bank.test, type = "prob")
pred <- prediction(dt_prob_pred[, 2],
                   bank.test$y)
perf <- performance(pred, "tpr", "fpr")

# ROC Curve
plot(perf, colorize = T)

The ROC curve of this decision tree model is better than the ones from Naive Bayes model.

AUC

# auc
auc <- performance(pred, "auc")
auc@y.values[[1]]

[1] 0.9185209

With this result, it means that the classifiers used for decision tree model are considered ‘excellent’ (or at least better than the AUC from Naive Bayes’ model).

Conclusion

Comparing the models’ evaluation from Naive Bayes and decision tree (DT) method, we can conclude that for this case, predicting the bank customer’s deposit probability is better by using the decision tree model. While the evaluation does not have ideal result on every aspect, but overall the dt model has higher accuracy, sensitivity, precision, and even has better result in ROC curve and AUC value.

Reiteration

I will try to add the pruning step to the DT model. Pruning can help to simplify the branch/nodes of the decision tree, to avoid overfitting. The result of this pruning method may later be evaluated (by Confusion Matrix) so we can compare the result with the previous Naive Bayes and DT model.

Predicting Deposit Probability using Naive Bayes and Decision Tree

Predicting Deposit Probability using Naive Bayes and Decision Tree

Objective

Data

Data Read

EDA

Cross Validation

Naive Bayes

Model Fitting

Model Evaluation

By Heatmap (Optional)

ROC Curve

AUC

Decision Tree

Model & Plot

Model Evaluation

ROC Curve

AUC

Conclusion

Reiteration