Data

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

Load Libraries

library(dplyr)
library(caret)
library(gridExtra)
library(GGally)
library(rsample)
library(e1071)
library(ROCR)

Read Data

bank <- read.csv("data/bank-additional.csv", sep=";")

head(bank)
##   age         job marital         education default housing    loan   contact
## 1  30 blue-collar married          basic.9y      no     yes      no  cellular
## 2  39    services  single       high.school      no      no      no telephone
## 3  25    services married       high.school      no     yes      no telephone
## 4  38    services married          basic.9y      no unknown unknown telephone
## 5  47      admin. married university.degree      no     yes      no  cellular
## 6  32    services  single university.degree      no      no      no  cellular
##   month day_of_week duration campaign pdays previous    poutcome emp.var.rate
## 1   may         fri      487        2   999        0 nonexistent         -1.8
## 2   may         fri      346        4   999        0 nonexistent          1.1
## 3   jun         wed      227        1   999        0 nonexistent          1.4
## 4   jun         fri       17        3   999        0 nonexistent          1.4
## 5   nov         mon       58        1   999        0 nonexistent         -0.1
## 6   sep         thu      128        3   999        2     failure         -1.1
##   cons.price.idx cons.conf.idx euribor3m nr.employed  y
## 1         92.893         -46.2     1.313      5099.1 no
## 2         93.994         -36.4     4.855      5191.0 no
## 3         94.465         -41.8     4.962      5228.1 no
## 4         94.465         -41.8     4.959      5228.1 no
## 5         93.200         -42.0     4.191      5195.8 no
## 6         94.199         -37.5     0.884      4963.6 no

Description of every column:

  1. age
  2. job : type of job (categorical)
  3. marital : marital status (categorical)
  4. education (categorical)
  5. default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
  6. housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
  7. loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)
  8. contact: contact communication type (categorical)
  9. month: last contact month of year (categorical)
  10. day_of_week: last contact day of the week (categorical)
  11. duration: last contact duration, in seconds (numeric)
  12. campaign: number of contacts performed during this campaign and for this client (numeric)
  13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  14. previous: number of contacts performed before this campaign and for this client (numeric)
  15. poutcome: outcome of the previous marketing campaign (categorical)
  16. emp.var.rate: employment variation rate - quarterly indicator (numeric)
  17. cons.price.idx: consumer price index - monthly indicator (numeric)
  18. cons.conf.idx: consumer confidence index - monthly indicator (numeric)
  19. euribor3m: euribor 3 month rate - daily indicator (numeric)
  20. nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target): 21. y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

Data Wrangling

glimpse(bank)
## Rows: 4,119
## Columns: 21
## $ age            <int> 30, 39, 25, 38, 47, 32, 32, 41, 31, 35, 25, 36, 36, 47,~
## $ job            <chr> "blue-collar", "services", "services", "services", "adm~
## $ marital        <chr> "married", "single", "married", "married", "married", "~
## $ education      <chr> "basic.9y", "high.school", "high.school", "basic.9y", "~
## $ default        <chr> "no", "no", "no", "no", "no", "no", "no", "unknown", "n~
## $ housing        <chr> "yes", "no", "yes", "unknown", "yes", "no", "yes", "yes~
## $ loan           <chr> "no", "no", "no", "unknown", "no", "no", "no", "no", "n~
## $ contact        <chr> "cellular", "telephone", "telephone", "telephone", "cel~
## $ month          <chr> "may", "may", "jun", "jun", "nov", "sep", "sep", "nov",~
## $ day_of_week    <chr> "fri", "fri", "wed", "fri", "mon", "thu", "mon", "mon",~
## $ duration       <int> 487, 346, 227, 17, 58, 128, 290, 44, 68, 170, 301, 148,~
## $ campaign       <int> 2, 4, 1, 3, 1, 3, 4, 2, 1, 1, 1, 1, 2, 2, 2, 2, 6, 4, 2~
## $ pdays          <int> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, ~
## $ previous       <int> 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ poutcome       <chr> "nonexistent", "nonexistent", "nonexistent", "nonexiste~
## $ emp.var.rate   <dbl> -1.8, 1.1, 1.4, 1.4, -0.1, -1.1, -1.1, -0.1, -0.1, 1.1,~
## $ cons.price.idx <dbl> 92.893, 93.994, 94.465, 94.465, 93.200, 94.199, 94.199,~
## $ cons.conf.idx  <dbl> -46.2, -36.4, -41.8, -41.8, -42.0, -37.5, -37.5, -42.0,~
## $ euribor3m      <dbl> 1.313, 4.855, 4.962, 4.959, 4.191, 0.884, 0.879, 4.191,~
## $ nr.employed    <dbl> 5099.1, 5191.0, 5228.1, 5228.1, 5195.8, 4963.6, 4963.6,~
## $ y              <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "~

Change data type from character to factor as well as removing column duration because the duration is not known before a call is performed, however it is known after the call is performed.

bank <- bank %>% 
  mutate_if(is.character, as.factor) %>% 
  select(-duration)

Check if there are any missing value

colSums(is.na(bank))
##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       campaign          pdays       previous       poutcome   emp.var.rate 
##              0              0              0              0              0 
## cons.price.idx  cons.conf.idx      euribor3m    nr.employed              y 
##              0              0              0              0              0

Exploratory Data Analysis

Since there is no missing value in every variables, we can check the correlation between each variables.

ggcorr(bank, hjust = 1, layout.exp = 2, label = T, label_size = 3)

According from the correlation above, variable emp.car.rate, eruibor3m, and nr.employed have a strong correlations. Those variables might not suitable for NaiveBayes Model.

Cross Validation

Splitting data bank to training and testing data with 75% of data will be used as data training.

RNGkind(sample.kind = "Rounding")
set.seed(106)

index <- sample(nrow(bank), nrow(bank)*0.8)
data_train <- bank[index, ]
data_test <- bank[-index, ]

Check proportion target

prop.table(table(data_train$y))
## 
##        no       yes 
## 0.8898331 0.1101669

Because the proportion target is not balanced, down sampling is needed to balanced the data proportion

bank_ds <- downSample(x = data_train %>% select(-y),
                     y = data_train$y,
                     yname = "y")

prop.table(table(bank_ds$y))
## 
##  no yes 
## 0.5 0.5

Modelling

Naive Bayes

model_naive <- naiveBayes(x = data_train %>% select(-y),
                          y = data_train$y)

Predicting

levels(data_train$y)
## [1] "no"  "yes"

Predicting model

bank_pred <- predict(model_naive, newdata = data_test, type = "class")

Model Evaluation

eval <- confusionMatrix(bank_pred, data_test$y, positive = "yes")
eval
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  661  44
##        yes  75  44
##                                           
##                Accuracy : 0.8556          
##                  95% CI : (0.8297, 0.8789)
##     No Information Rate : 0.8932          
##     P-Value [Acc > NIR] : 0.999669        
##                                           
##                   Kappa : 0.3447          
##                                           
##  Mcnemar's Test P-Value : 0.005958        
##                                           
##             Sensitivity : 0.5000          
##             Specificity : 0.8981          
##          Pos Pred Value : 0.3697          
##          Neg Pred Value : 0.9376          
##              Prevalence : 0.1068          
##          Detection Rate : 0.0534          
##    Detection Prevalence : 0.1444          
##       Balanced Accuracy : 0.6990          
##                                           
##        'Positive' Class : yes             
## 
# Accuracy
eval$overall[1]
##  Accuracy 
## 0.8555825
# Recall
eval$byClass[1]
## Sensitivity 
##         0.5
# Specificity
eval$byClass[2]
## Specificity 
##   0.8980978
# Precision
eval$byClass[3]
## Pos Pred Value 
##      0.3697479

Based on model evaluation, we can conclude that: 1. The accuracy of the model is 85.5% 2. The actual data of class yes is 50% who are subscribed to bank term deposit. 3. There are 89.8% who are not subscribed to bank term deposit. 4. The model predict 36.9% of class yes from prediction.

Using ROC to know how much the model can differentiate between class.

bank_prob <- predict(model_naive, newdata = data_test, type = "raw")#for the probability
round(head(bank_prob),4)
##          no    yes
## [1,] 0.9589 0.0411
## [2,] 0.9999 0.0001
## [3,] 0.9996 0.0004
## [4,] 0.9999 0.0001
## [5,] 0.0213 0.9787
## [6,] 1.0000 0.0000
roc <- prediction(predictions = bank_prob[,2], # = kelas positif
                       labels = as.numeric(data_test$y == "yes")) # label kelas positif
        
#Objec performance dari prediction
perf <- performance(prediction.obj = roc,
                    measure = "tpr",
                    x.measure = "fpr")
#plot
plot(perf, main = "ROC")
abline(0,1, lty =2)

auc <- performance(prediction.obj = roc,
                   measure = "auc")
auc@y.values[[1]]
## [1] 0.7624676

The model can differentiate 76.2% class based on the AUC score.

Decision Tree

Model fitting

In the fitting model we will use: 1. mincriterion = 0.99, where the p-value must be below 0.05 for a node to create branches. 2. minsplit = 2000 or the minimum number of observations after splitting. 3. minbucket = 1500 as the minimum number of observations in the terminal node.

library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
model_dtree <- ctree(formula = y ~ ., data = data_train,
                     control = ctree_control(mincriterion=0.99, 
                                            minsplit=2000,
                                            minbucket=1500))

plot(model_dtree, type = "simple")

Predicting

dtree_pred <- predict(model_dtree, data_test)
dtree_prob <- predict(model_dtree, data_test, type = "prob")

Evaluating

cm_dtree <- confusionMatrix(dtree_pred, data_test$y, positive = "yes")
cm_dtree
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  736  88
##        yes   0   0
##                                           
##                Accuracy : 0.8932          
##                  95% CI : (0.8701, 0.9135)
##     No Information Rate : 0.8932          
##     P-Value [Acc > NIR] : 0.5284          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8932          
##              Prevalence : 0.1068          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : yes             
## 
#Objek Prediction
dt_roc <- prediction(predictions = dtree_prob[,2], # = kelas positif
                       labels = as.numeric(data_test$y == "yes")) # label kelas positif
        
#Objec performance dari prediction
dt_perf <- performance(prediction.obj = dt_roc,
                    measure = "tpr",
                    x.measure = "fpr")
#plot
plot(dt_perf, main = "ROC")
abline(0,1, lty =2)

dt_auc <- performance(prediction.obj = dt_roc,
                   measure = "auc")
dt_auc@y.values
## [[1]]
## [1] 0.6603878

The AUC score from the model decision tree is 66%, which is not a good score to differentiate between class.

Conclusion

Based on both model NaiveBayes and Decision Tree, NaiveBayes model has a bigger accuracy in predicting the model rather than Decision Tree.

Based on the results of the three models, the data model from NaiveBayes has better results, in addition to high accuracy and also the value of AUC and sensitivity or recall, because in this case the target class is the customer buying the product yes of course if recalled. This high model proves that the model has a small number in False Positive or a prediction of no but actually ‘yes’.