Bank Telemarketing - Naive Bayes, Decision Tree, and Random Forest

1. Objective

The objective of this analysis is to predict potential clients who are willing to subscribe banking term deposit product offered by a Portugese Bank Instituion during 2008 - 2010 through telemarketing channel using an algorithm model. The model built here will use Naieve Bayes, Decision Tree, and Random Forest methods.

Source of data: Bank Marketing Data Set that can be accessed on https://archive.ics.uci.edu/ml/datasets/bank+marketing

2. Data Preparation and Library & Setup

2.1. Library & Setup

Here are the libraries used in this analysis

library(dplyr)
library(ggplot2)
library(gridExtra)
library(caret)
library(e1071)
library(ROCR)
library(partykit) 
library(randomForest)
library(rattle)
library(GGally)
library(car)
library(ROSE)
library(rpart)

2.2. Data Inputted

Here is the used data.

bank <- read.csv("bank-full.csv", sep=";")

glimpse(bank)

## Rows: 45,211
## Columns: 17
## $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, ~
## $ job       <chr> "management", "technician", "entrepreneur", "blue-collar", "~
## $ marital   <chr> "married", "single", "married", "married", "single", "marrie~
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", ~
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",~
## $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71~
## $ housing   <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"~
## $ loan      <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"~
## $ contact   <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn~
## $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ~
## $ month     <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may~
## $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,~
## $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ~
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn~
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", ~

Column Description:

age: the age of the clients
job: job type
marital: marital status
education: client education
default: whether the clients has the credit in default or not
balance: average client balance yearly
housing: does the client have housing loan?
loan: does the client have personal loan?
contact: contact communication type
day: last contact day of the month
month: last contact month of year
duration: last contact duration, in second
campaign: number of contacts performed during this campaign and for this client
pdays: number of days that passed by after the client was last contacted from a previous campaign
previous: number of contacts performed before this campaign and for this client
poutcome: outcome of the previous marketing campaign

Target Variable:

y: has the client subscribed a term deposit?

3. Data Preparation & EDA

3.1. Data Wrangling

There will be adjusted in the dataset:

Type data change: “marital”, “education”, “contact”, “month”, “poutcome”, “default”, “housing to”factor”

bank <- bank %>% mutate(job = as.factor(job),
                     marital = as.factor(marital),
                     education = as.factor(education),
                     default = as.factor(default),
                     loan = as.factor(loan),
                     housing = as.factor(housing),
                     contact = as.factor(contact),
                     month = as.factor(month),
                     poutcome = as.factor(poutcome),
                     y = as.factor(y))

head(bank)

3.2. Missing Value Check

colSums(is.na(bank))

##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

From checking dataset above, there is no missing value.

3.3. Data Correlation Check

To build Naive Bayes Model, it is important to check correlation among the data, since it assumes that each predictor is independent and hos no correlation to each other.

ggcorr(bank, label_size = 3, hjust = 1, layout.exp = 2, label = T)

Based on the plot above, overall predictors within the data set have no correlation, except for previous and age.

3.4. Cross Validation

Before setting a model, it needs to split the dataset to “train” and “test”. The propotion of data will be categorized into “train” is 80%, while “test” is 20%. Here is the splitting process.

RNGkind(sample.kind = "Rounding")
set.seed(192)

index <- sample(nrow(bank), size = nrow(bank)*0.8)

bank.train <- bank[index, ]
bank.test <- bank[-index, ]

3.5. Data Imbalance Check

Before builing a model, it is needed to check the propotion of the target variable.

prop.table(table(bank.train$y))

## 
##        no       yes 
## 0.8828246 0.1171754

According to the propotion check, there is imbalance issue within the data with the comparion 88:11. Therefore, to solve it, “Downsampling” Method will be used.

set.seed(192)

train.down <- ovun.sample(y~., 
                        data = bank.train, 
                        method = "under")$data

prop.table(table(train.down$y))

## 
##        no       yes 
## 0.5012944 0.4987056

After Upsampling has been completed, here is the updated propotion result. The result is balance with 0.501: 0.498. Therefore, the process can be continued.

4. Naive Bayes

4.1. Modeling

Here is the naive bayes model process. In addition, the process also uses laplace smoothing to avoid skewness due to scarcity to ensure that there is no predictor with value is 0. The laplace value inputted is 1.

model.nb <- naiveBayes(y~., train.down, laplace = 1)

model.nb

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##        no       yes 
## 0.5012944 0.4987056 
## 
## Conditional probabilities:
##      age
## Y         [,1]     [,2]
##   no  40.90423 10.28181
##   yes 41.68240 13.54187
## 
##      job
## Y          admin. blue-collar entrepreneur   housemaid  management     retired
##   no  0.108380150 0.224719101  0.032069288 0.031132959 0.200374532 0.046114232
##   yes 0.120470588 0.136705882  0.021647059 0.019764706 0.240941176 0.098117647
##      job
## Y     self-employed    services     student  technician  unemployed     unknown
##   no    0.039559925 0.095739700 0.018960674 0.166198502 0.027153558 0.009597378
##   yes   0.036235294 0.071529412 0.051529412 0.156705882 0.039058824 0.007294118
## 
##      marital
## Y      divorced   married    single
##   no  0.1168191 0.6040347 0.2791461
##   yes 0.1211978 0.5147371 0.3640651
## 
##      education
## Y        primary  secondary   tertiary    unknown
##   no  0.15173546 0.52884615 0.27954972 0.03986867
##   yes 0.10843942 0.46605375 0.37670910 0.04879774
## 
##      default
## Y             no        yes
##   no  0.98122947 0.01877053
##   yes 0.98962264 0.01037736
## 
##      balance
## Y         [,1]     [,2]
##   no  1307.341 3283.647
##   yes 1768.836 3333.038
## 
##      housing
## Y            no       yes
##   no  0.4246832 0.5753168
##   yes 0.6379717 0.3620283
## 
##      loan
## Y             no        yes
##   no  0.83317691 0.16682309
##   yes 0.90872642 0.09127358
## 
##      contact
## Y       cellular  telephone    unknown
##   no  0.62631949 0.06873094 0.30494957
##   yes 0.82739920 0.07120962 0.10139118
## 
##      day
## Y         [,1]     [,2]
##   no  15.93685 8.252132
##   yes 15.14370 8.473055
## 
##      month
## Y             apr         aug         dec         feb         jan         jul
##   no  0.066713483 0.143492509 0.002574906 0.056179775 0.031132959 0.154026217
##   yes 0.104941176 0.131764706 0.021176471 0.084235294 0.026823529 0.120470588
##      month
## Y             jun         mar         may         nov         oct         sep
##   no  0.109784644 0.006320225 0.325842697 0.085205993 0.011001873 0.007724719
##   yes 0.103764706 0.047058824 0.172705882 0.073882353 0.060705882 0.052470588
## 
##      duration
## Y         [,1]     [,2]
##   no  222.4960 209.0754
##   yes 540.0144 392.5904
## 
##      campaign
## Y         [,1]     [,2]
##   no  2.791549 2.995923
##   yes 2.120104 1.799999
## 
##      pdays
## Y         [,1]      [,2]
##   no  35.60141  95.41756
##   yes 68.09863 119.41539
## 
##      previous
## Y          [,1]     [,2]
##   no  0.4467136 1.496878
##   yes 1.1772062 2.643383
## 
##      poutcome
## Y        failure      other    success    unknown
##   no  0.10647280 0.03799250 0.01383677 0.84169794
##   yes 0.11621876 0.05964168 0.18057520 0.64356436

4.2. Model Fitting and Prediction

Here is to predict the data test using Naive Bayes process.

pred.nb <- predict(object = model.nb, 
                   newdata = bank.test, 
                   type = "class")

4.3. Model Evaluation

To evaluate the model performance will be used confusion matrix, as it is proceeded below.

confusionMatrix(data = pred.nb, 
                reference = bank.test$y, 
                positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6561  292
##        yes 1431  759
##                                           
##                Accuracy : 0.8095          
##                  95% CI : (0.8012, 0.8175)
##     No Information Rate : 0.8838          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3693          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.72217         
##             Specificity : 0.82095         
##          Pos Pred Value : 0.34658         
##          Neg Pred Value : 0.95739         
##              Prevalence : 0.11622         
##          Detection Rate : 0.08393         
##    Detection Prevalence : 0.24218         
##       Balanced Accuracy : 0.77156         
##                                           
##        'Positive' Class : yes             
##

According to confusionMatrix above, the result shows:

Accuracy is 80,84%
Sensitivity is 72,21%

Since the objective of this analysis would like to predict where positive customer are willing to deposit (yes), therefore, it uses “Recall-Sensitivity” as the parameter.

4.4. ROC-AUC

Besides to measure whether the model can identify positive and negative classes, it uses also Receiver Operationg Curve (ROC) and Area Under Curve (AUC) as it is proceeded below.

pred.nb.prb <- predict(model.nb, newdata = bank.test, type = "raw")

roc.bank.nb <- data.frame(pred.prop = pred.nb.prb [,2],
                       label = as.numeric(bank.test$y == "yes"))

pred.roc.nb <- prediction(predictions = roc.bank.nb$pred.prop, 
                       labels = roc.bank.nb$label)

plot(performance(pred.roc.nb, "tpr", "fpr"))

auc.bank.nb <- performance(pred.roc.nb, "auc")
auc.bank.nb@y.values

## [[1]]
## [1] 0.8504087

Based on ROC-AUC result, it shows that with value of 0.85 of 1.00 indicates that the model can identify positive and negative classes pretty well.

5. Decision Tree

Here is the decision tree process

5.1. Modeling

model.dt <- ctree(y~., data = train.down)

5.2. Model Fitting and Prediction

Here is to predict the data test and data train

#data test
pred.dt.test <- predict(object = model.dt, 
                   newdata = bank.test)

#data train
pred.dt.train <- predict(object = model.dt, 
                   newdata = bank.train)

5.3. Model Evaluation

To evaluate the model performance will be used confusion matrix, as it is proceeded below.

#data test
confusionMatrix(data = pred.dt.test, 
                reference = bank.test$y, 
                positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6577  151
##        yes 1415  900
##                                           
##                Accuracy : 0.8268          
##                  95% CI : (0.8189, 0.8346)
##     No Information Rate : 0.8838          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4462          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.85633         
##             Specificity : 0.82295         
##          Pos Pred Value : 0.38877         
##          Neg Pred Value : 0.97756         
##              Prevalence : 0.11622         
##          Detection Rate : 0.09952         
##    Detection Prevalence : 0.25600         
##       Balanced Accuracy : 0.83964         
##                                           
##        'Positive' Class : yes             
##

According to confusionMatrix from data train above, the result shows:

Accuracy is 82,68%
Sensitivity is 85,63%

#data train
confusionMatrix(data = pred.dt.train, 
                reference = bank.train$y, 
                positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  26243   508
##        yes  5687  3730
##                                           
##                Accuracy : 0.8287          
##                  95% CI : (0.8248, 0.8326)
##     No Information Rate : 0.8828          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4589          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8801          
##             Specificity : 0.8219          
##          Pos Pred Value : 0.3961          
##          Neg Pred Value : 0.9810          
##              Prevalence : 0.1172          
##          Detection Rate : 0.1031          
##    Detection Prevalence : 0.2604          
##       Balanced Accuracy : 0.8510          
##                                           
##        'Positive' Class : yes             
##

According to confusionMatrix from data test above, the result shows:

Accuracy is 82,68%
Sensitivity is 88,01%

Since the objective of this analysis would like to predict where positive customer are willing to deposit (yes), therefore, it uses “Recall-Sensitivity” as the parameter.

According to confusion matrix from data train and data test with each test’s accuracy is 82% and sensitivity are 85% and 88% shows that there is no over-fitted phenomenon with this model.

5.4. ROC-AUC

Besides to measure whether the model can identify positive and negative classes, it uses also Receiver Operationg Curve (ROC) and Area Under Curve (AUC) as it is proceeded below.

pred.dt.prb <- predict(model.dt, newdata = bank.test, type = "prob")

roc.bank.dt <- data.frame(pred.prop = pred.dt.prb [,2],
                       label = as.numeric(bank.test$y == "yes"))

pred.roc.dt <- prediction(predictions = roc.bank.dt$pred.prop, 
                       labels = roc.bank.dt$label)

plot(performance(pred.roc.dt, "tpr", "fpr"))

auc.bank.dt <- performance(pred.roc.dt, "auc")
auc.bank.dt@y.values

## [[1]]
## [1] 0.9057526

Based on ROC-AUC result, it shows that with value of 0.90 of 1.00 indicates that the model can identify positive and negative classes well.

6. Random Forest

Here is the random forest process.

6.1. Modeling

In random forest method we will delete less information data in the predictor column that its variance close to 0 to reduce processing time.

nz.var.train <- nearZeroVar(train.down)
nz.var.test <- nearZeroVar(bank.test)

no.var.train <- train.down[, -nz.var.train]
no.var.test <- bank.test[, -nz.var.test]

glimpse(no.var.train)

## Rows: 8,498
## Columns: 15
## $ age       <int> 56, 48, 55, 29, 45, 60, 60, 56, 37, 34, 35, 51, 63, 33, 33, ~
## $ job       <fct> technician, management, retired, blue-collar, technician, en~
## $ marital   <fct> divorced, divorced, married, married, married, married, divo~
## $ education <fct> unknown, tertiary, secondary, primary, secondary, primary, s~
## $ balance   <int> 56, 2939, 1126, 50, 268, 2408, 162, 9, 1624, 416, 451, 303, ~
## $ housing   <fct> yes, no, yes, no, no, no, no, no, yes, no, yes, yes, no, yes~
## $ loan      <fct> no, no, no, no, no, no, no, yes, no, no, yes, no, no, yes, n~
## $ contact   <fct> unknown, cellular, telephone, cellular, cellular, telephone,~
## $ day       <int> 5, 22, 28, 8, 7, 20, 27, 20, 20, 14, 12, 3, 16, 9, 2, 30, 11~
## $ month     <fct> may, aug, jul, jul, jul, apr, may, aug, nov, aug, may, jun, ~
## $ duration  <int> 439, 91, 388, 185, 96, 102, 74, 285, 559, 1217, 247, 87, 57,~
## $ campaign  <int> 1, 4, 2, 2, 14, 4, 1, 2, 2, 4, 1, 7, 4, 3, 1, 3, 3, 1, 22, 4~
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, ~
## $ poutcome  <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow~
## $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, ~

6.2. Model Fitting and Prediction

Here is to model building uaing traincontrol selection to select predictors automatically and randomly in random forest with k-fold 4 times.

control <- trainControl(method = "repeatedcv", number = 5, repeats = 2)

model.rf <- train(y~., data = no.var.train, method = "rf", trControl = control)

saveRDS(model.rf, "bank_RF.RDS")

bank.model.rf <- readRDS("bank_RF.RDS")

glimpse(bank.model.rf)

## List of 25
##  $ method      : chr "rf"
##  $ modelInfo   :List of 15
##   ..$ label     : chr "Random Forest"
##   ..$ library   : chr "randomForest"
##   ..$ loop      : NULL
##   ..$ type      : chr [1:2] "Classification" "Regression"
##   ..$ parameters:'data.frame':   1 obs. of  3 variables:
##   .. ..$ parameter: chr "mtry"
##   .. ..$ class    : chr "numeric"
##   .. ..$ label    : chr "#Randomly Selected Predictors"
##   ..$ grid      :function (x, y, len = NULL, search = "grid")  
##   ..$ fit       :function (x, y, wts, param, lev, last, classProbs, ...)  
##   ..$ predict   :function (modelFit, newdata, submodels = NULL)  
##   ..$ prob      :function (modelFit, newdata, submodels = NULL)  
##   ..$ predictors:function (x, ...)  
##   ..$ varImp    :function (object, ...)  
##   ..$ levels    :function (x)  
##   ..$ tags      : chr [1:4] "Random Forest" "Ensemble Model" "Bagging" "Implicit Feature Selection"
##   ..$ sort      :function (x)  
##   ..$ oob       :function (x)  
##  $ modelType   : chr "Classification"
##  $ results     :'data.frame':    3 obs. of  5 variables:
##   ..$ mtry      : num [1:3] 2 21 40
##   ..$ Accuracy  : num [1:3] 0.803 0.852 0.851
##   ..$ Kappa     : num [1:3] 0.605 0.705 0.701
##   ..$ AccuracySD: num [1:3] 0.01232 0.00818 0.00804
##   ..$ KappaSD   : num [1:3] 0.0247 0.0163 0.0161
##  $ pred        : NULL
##  $ bestTune    :'data.frame':    1 obs. of  1 variable:
##   ..$ mtry: num 21
##  $ call        : language train.formula(form = y ~ ., data = no.var.train, method = "rf", trControl = control)
##  $ dots        : list()
##  $ metric      : chr "Accuracy"
##  $ control     :List of 27
##   ..$ method           : chr "repeatedcv"
##   ..$ number           : num 5
##   ..$ repeats          : num 2
##   ..$ search           : chr "grid"
##   ..$ p                : num 0.75
##   ..$ initialWindow    : NULL
##   ..$ horizon          : num 1
##   ..$ fixedWindow      : logi TRUE
##   ..$ skip             : num 0
##   ..$ verboseIter      : logi FALSE
##   ..$ returnData       : logi TRUE
##   ..$ returnResamp     : chr "final"
##   ..$ savePredictions  : chr "none"
##   ..$ classProbs       : logi FALSE
##   ..$ summaryFunction  :function (data, lev = NULL, model = NULL)  
##   ..$ selectionFunction: chr "best"
##   ..$ preProcOptions   :List of 6
##   .. ..$ thresh   : num 0.95
##   .. ..$ ICAcomp  : num 3
##   .. ..$ k        : num 5
##   .. ..$ freqCut  : num 19
##   .. ..$ uniqueCut: num 10
##   .. ..$ cutoff   : num 0.9
##   ..$ sampling         : NULL
##   ..$ index            :List of 10
##   .. ..$ Fold1.Rep1: int [1:6799] 1 2 3 5 6 7 8 9 10 11 ...
##   .. ..$ Fold2.Rep1: int [1:6798] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ Fold3.Rep1: int [1:6799] 2 4 5 6 8 10 11 12 13 15 ...
##   .. ..$ Fold4.Rep1: int [1:6798] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ Fold5.Rep1: int [1:6798] 1 3 4 7 9 12 13 14 15 16 ...
##   .. ..$ Fold1.Rep2: int [1:6798] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ Fold2.Rep2: int [1:6798] 1 5 6 7 8 9 10 11 13 14 ...
##   .. ..$ Fold3.Rep2: int [1:6799] 2 3 4 6 7 8 10 11 12 13 ...
##   .. ..$ Fold4.Rep2: int [1:6798] 1 2 3 4 5 7 8 9 12 13 ...
##   .. ..$ Fold5.Rep2: int [1:6799] 1 2 3 4 5 6 9 10 11 12 ...
##   ..$ indexOut         :List of 10
##   .. ..$ Resample01: int [1:1699] 4 15 19 21 26 27 33 34 45 49 ...
##   .. ..$ Resample02: int [1:1700] 20 22 25 32 35 41 46 47 63 66 ...
##   .. ..$ Resample03: int [1:1699] 1 3 7 9 14 17 29 31 36 38 ...
##   .. ..$ Resample04: int [1:1700] 12 13 16 18 23 24 28 39 43 53 ...
##   .. ..$ Resample05: int [1:1700] 2 5 6 8 10 11 30 37 40 42 ...
##   .. ..$ Resample06: int [1:1700] 13 14 16 22 33 36 37 39 41 42 ...
##   .. ..$ Resample07: int [1:1700] 2 3 4 12 18 19 25 40 53 58 ...
##   .. ..$ Resample08: int [1:1699] 1 5 9 20 27 43 46 49 51 55 ...
##   .. ..$ Resample09: int [1:1700] 6 10 11 17 21 26 30 32 34 44 ...
##   .. ..$ Resample10: int [1:1699] 7 8 15 23 24 28 29 31 35 38 ...
##   ..$ indexFinal       : NULL
##   ..$ timingSamps      : num 0
##   ..$ predictionBounds : logi [1:2] FALSE FALSE
##   ..$ seeds            :List of 11
##   .. ..$ : int [1:3] 945873 423799 898085
##   .. ..$ : int [1:3] 250379 364676 54093
##   .. ..$ : int [1:3] 170850 365471 474005
##   .. ..$ : int [1:3] 665620 765413 363035
##   .. ..$ : int [1:3] 375028 602099 625898
##   .. ..$ : int [1:3] 475448 545467 355686
##   .. ..$ : int [1:3] 267900 211683 496212
##   .. ..$ : int [1:3] 559824 414995 506401
##   .. ..$ : int [1:3] 392221 911793 800844
##   .. ..$ : int [1:3] 78949 693462 636605
##   .. ..$ : int 315343
##   ..$ adaptive         :List of 4
##   .. ..$ min     : num 5
##   .. ..$ alpha   : num 0.05
##   .. ..$ method  : chr "gls"
##   .. ..$ complete: logi TRUE
##   ..$ trim             : logi FALSE
##   ..$ allowParallel    : logi TRUE
##  $ finalModel  :List of 23
##   ..$ call           : language randomForest(x = x, y = y, mtry = param$mtry)
##   ..$ type           : chr "classification"
##   ..$ predicted      : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 2 2 ...
##   .. ..- attr(*, "names")= chr [1:8498] "X1" "X2" "X3" "X4" ...
##   ..$ err.rate       : num [1:500, 1:3] 0.223 0.232 0.224 0.221 0.218 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ confusion      : num [1:2, 1:3] 3489 453 771 3785 0.181 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ votes          : 'matrix' num [1:8498, 1:2] 0.726 0.961 0.452 0.915 0.949 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ oob.times      : num [1:8498] 186 180 186 188 175 169 180 182 184 186 ...
##   ..$ classes        : chr [1:2] "no" "yes"
##   ..$ importance     : num [1:40, 1] 309.46 35.95 9.14 9.7 29.87 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ importanceSD   : NULL
##   ..$ localImportance: NULL
##   ..$ proximity      : NULL
##   ..$ ntree          : num 500
##   ..$ mtry           : num 21
##   ..$ forest         :List of 14
##   .. ..$ ndbigtree : int [1:500] 1915 1945 1863 1857 1885 1785 1903 1967 1909 1897 ...
##   .. ..$ nodestatus: int [1:2017, 1:500] 1 1 1 1 1 1 1 1 1 -1 ...
##   .. ..$ bestvar   : int [1:2017, 1:500] 35 30 35 35 1 39 22 1 40 0 ...
##   .. ..$ treemap   : int [1:2017, 1:2, 1:500] 2 4 6 8 10 12 14 16 18 0 ...
##   .. ..$ nodepred  : int [1:2017, 1:500] 0 0 0 0 0 0 0 0 0 1 ...
##   .. ..$ xbestsplit: num [1:2017, 1:500] 207.5 0.5 448.5 124.5 26 ...
##   .. ..$ pid       : num [1:2] 1 1
##   .. ..$ cutoff    : num [1:2] 0.5 0.5
##   .. ..$ ncat      : Named num [1:40] 1 1 1 1 1 1 1 1 1 1 ...
##   .. .. ..- attr(*, "names")= chr [1:40] "age" "jobblue-collar" "jobentrepreneur" "jobhousemaid" ...
##   .. ..$ maxcat    : num 1
##   .. ..$ nrnodes   : int 2017
##   .. ..$ ntree     : num 500
##   .. ..$ nclass    : int 2
##   .. ..$ xlevels   :List of 40
##   ..$ y              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:8498] "1" "2" "3" "4" ...
##   ..$ test           : NULL
##   ..$ inbag          : NULL
##   ..$ xNames         : chr [1:40] "age" "jobblue-collar" "jobentrepreneur" "jobhousemaid" ...
##   ..$ problemType    : chr "Classification"
##   ..$ tuneValue      :'data.frame':  1 obs. of  1 variable:
##   .. ..$ mtry: num 21
##   ..$ obsLevels      : chr [1:2] "no" "yes"
##   .. ..- attr(*, "ordered")= logi FALSE
##   ..$ param          : list()
##   ..- attr(*, "class")= chr "randomForest"
##  $ preProcess  : NULL
##  $ trainingData:'data.frame':    8498 obs. of  15 variables:
##   ..$ .outcome : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ age      : int [1:8498] 56 48 55 29 45 60 60 56 37 34 ...
##   ..$ job      : Factor w/ 12 levels "admin.","blue-collar",..: 10 5 6 2 10 3 1 8 7 5 ...
##   ..$ marital  : Factor w/ 3 levels "divorced","married",..: 1 1 2 2 2 2 1 2 2 3 ...
##   ..$ education: Factor w/ 4 levels "primary","secondary",..: 4 3 2 1 2 1 2 2 3 3 ...
##   ..$ balance  : int [1:8498] 56 2939 1126 50 268 2408 162 9 1624 416 ...
##   ..$ housing  : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 1 2 1 ...
##   ..$ loan     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##   ..$ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 1 2 1 1 2 3 1 1 1 ...
##   ..$ day      : int [1:8498] 5 22 28 8 7 20 27 20 20 14 ...
##   ..$ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 2 6 6 6 1 9 2 10 2 ...
##   ..$ duration : int [1:8498] 439 91 388 185 96 102 74 285 559 1217 ...
##   ..$ campaign : int [1:8498] 1 4 2 2 14 4 1 2 2 4 ...
##   ..$ previous : int [1:8498] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ ptype       :'data.frame':    0 obs. of  14 variables:
##   ..$ age      : int(0) 
##   ..$ job      : Factor w/ 12 levels "admin.","blue-collar",..: 
##   ..$ marital  : Factor w/ 3 levels "divorced","married",..: 
##   ..$ education: Factor w/ 4 levels "primary","secondary",..: 
##   ..$ balance  : int(0) 
##   ..$ housing  : Factor w/ 2 levels "no","yes": 
##   ..$ loan     : Factor w/ 2 levels "no","yes": 
##   ..$ contact  : Factor w/ 3 levels "cellular","telephone",..: 
##   ..$ day      : int(0) 
##   ..$ month    : Factor w/ 12 levels "apr","aug","dec",..: 
##   ..$ duration : int(0) 
##   ..$ campaign : int(0) 
##   ..$ previous : int(0) 
##   ..$ poutcome : Factor w/ 4 levels "failure","other",..: 
##  $ resample    :'data.frame':    10 obs. of  3 variables:
##   ..$ Accuracy: num [1:10] 0.861 0.83 0.852 0.857 0.854 ...
##   ..$ Kappa   : num [1:10] 0.721 0.661 0.705 0.714 0.708 ...
##   ..$ Resample: chr [1:10] "Fold2.Rep1" "Fold1.Rep1" "Fold4.Rep1" "Fold3.Rep1" ...
##  $ resampledCM :'data.frame':    30 obs. of  6 variables:
##   ..$ cell1   : num [1:30] 691 669 672 713 688 686 729 695 698 719 ...
##   ..$ cell2   : num [1:30] 161 183 180 139 164 166 123 157 154 133 ...
##   ..$ cell3   : num [1:30] 209 105 109 194 73 77 186 86 91 210 ...
##   ..$ cell4   : num [1:30] 638 742 738 654 775 771 661 761 756 638 ...
##   ..$ mtry    : num [1:30] 2 21 40 2 21 40 2 21 40 2 ...
##   ..$ Resample: chr [1:30] "Fold1.Rep1" "Fold1.Rep1" "Fold1.Rep1" "Fold2.Rep1" ...
##  $ perfNames   : chr [1:2] "Accuracy" "Kappa"
##  $ maximize    : logi TRUE
##  $ yLimits     : NULL
##  $ times       :List of 3
##   ..$ everything: 'proc_time' Named num [1:5] 451.9 12.5 475.2 NA NA
##   .. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
##   ..$ final     : 'proc_time' Named num [1:5] 20.93 0.35 21.93 NA NA
##   .. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
##   ..$ prediction: logi [1:3] NA NA NA
##  $ levels      : chr [1:2] "no" "yes"
##   ..- attr(*, "ordered")= logi FALSE
##  $ terms       :Classes 'terms', 'formula'  language y ~ age + job + marital + education + balance + housing + loan + contact +      day + month + duration + campaign| __truncated__
##   .. ..- attr(*, "variables")= language list(y, age, job, marital, education, balance, housing, loan, contact,      day, month, duration, campaign, previous, poutcome)
##   .. ..- attr(*, "factors")= int [1:15, 1:14] 0 1 0 0 0 0 0 0 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. ..- attr(*, "term.labels")= chr [1:14] "age" "job" "marital" "education" ...
##   .. ..- attr(*, "order")= int [1:14] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(y, age, job, marital, education, balance, housing, loan, contact,      day, month, duration, campaign, previous, poutcome)
##   .. ..- attr(*, "dataClasses")= Named chr [1:15] "factor" "numeric" "factor" "factor" ...
##   .. .. ..- attr(*, "names")= chr [1:15] "y" "age" "job" "marital" ...
##  $ coefnames   : chr [1:40] "age" "jobblue-collar" "jobentrepreneur" "jobhousemaid" ...
##  $ contrasts   :List of 8
##   ..$ job      : chr "contr.treatment"
##   ..$ marital  : chr "contr.treatment"
##   ..$ education: chr "contr.treatment"
##   ..$ housing  : chr "contr.treatment"
##   ..$ loan     : chr "contr.treatment"
##   ..$ contact  : chr "contr.treatment"
##   ..$ month    : chr "contr.treatment"
##   ..$ poutcome : chr "contr.treatment"
##  $ xlevels     :List of 8
##   ..$ job      : chr [1:12] "admin." "blue-collar" "entrepreneur" "housemaid" ...
##   ..$ marital  : chr [1:3] "divorced" "married" "single"
##   ..$ education: chr [1:4] "primary" "secondary" "tertiary" "unknown"
##   ..$ housing  : chr [1:2] "no" "yes"
##   ..$ loan     : chr [1:2] "no" "yes"
##   ..$ contact  : chr [1:3] "cellular" "telephone" "unknown"
##   ..$ month    : chr [1:12] "apr" "aug" "dec" "feb" ...
##   ..$ poutcome : chr [1:4] "failure" "other" "success" "unknown"
##  - attr(*, "class")= chr [1:2] "train" "train.formula"

#data test
pred.rf.test <- predict(object = bank.model.rf, 
                   newdata = bank.test)

6.3. Model Evaluation

To evaluate the model performance will be used out-of-bag errors and confusion matrix, as it is proceeded below.

Here is to check out-of-bag errors

bank.model.rf$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 21
## 
##         OOB estimate of  error rate: 14.4%
## Confusion matrix:
##       no  yes class.error
## no  3489  771   0.1809859
## yes  453 3785   0.1068900

Out-of-bag errors of this model shows accuracy

Here is to check its confusionmatrix

#data test
confusionMatrix(data = pred.rf.test, 
                reference = bank.test$y, 
                positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6580  112
##        yes 1412  939
##                                           
##                Accuracy : 0.8315          
##                  95% CI : (0.8236, 0.8391)
##     No Information Rate : 0.8838          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4663          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8934          
##             Specificity : 0.8233          
##          Pos Pred Value : 0.3994          
##          Neg Pred Value : 0.9833          
##              Prevalence : 0.1162          
##          Detection Rate : 0.1038          
##    Detection Prevalence : 0.2600          
##       Balanced Accuracy : 0.8584          
##                                           
##        'Positive' Class : yes             
##

According to confusionMatrix from data test above, the result shows:

Accuracy is 83,15%
Sensitivity is 88,87%

Since the objective of this analysis would like to predict where positive customer are willing to deposit (yes), therefore, it uses “Recall-Sensitivity” as the parameter.

According to confusion matrix from data train and data test with each test’s accuracy is 82% and sensitivity are 85% and 88% shows that there is no over-fitted phenomenon with this model.

6.4. ROC-AUC

Besides to measure whether the model can identify positive and negative classes, it uses also Receiver Operationg Curve (ROC) and Area Under Curve (AUC) as it is proceeded below.

pred.rf.prb <- predict(bank.model.rf, newdata = bank.test, type = "prob")

roc.bank.rf <- data.frame(pred.prop = pred.rf.prb [,2],
                       label = as.numeric(bank.test$y == "yes"))

pred.roc.rf <- prediction(predictions = roc.bank.rf$pred.prop, 
                       labels = roc.bank.rf$label)

plot(performance(pred.roc.rf, "tpr", "fpr"))

auc.bank.rf <- performance(pred.roc.rf, "auc")
auc.bank.rf@y.values

## [[1]]
## [1] 0.9254217

Based on ROC-AUC result, it shows that with value of 0.92 of 1.00 indicates that the model can identify positive and negative classes well.

7. Conclusion

Based on three model above, in overall performance, Random Forest Model has the best performance compared to Decision Tree Model and Naive Bayes Model with its accuracy value is 83,15%, sensitivity is 88,87%, and auc is 92,48%.

Bank Telemarketing - Naive Bayes, Decision Tree, and Random Forest

Dwi Puji Laksono Gumilang

2022-09-23

1. Objective

2. Data Preparation and Library & Setup

2.1. Library & Setup

2.2. Data Inputted

3. Data Preparation & EDA

3.1. Data Wrangling

3.2. Missing Value Check

3.3. Data Correlation Check

3.4. Cross Validation

3.5. Data Imbalance Check

4. Naive Bayes

4.1. Modeling

4.2. Model Fitting and Prediction

4.3. Model Evaluation

4.4. ROC-AUC

5. Decision Tree

5.1. Modeling

5.2. Model Fitting and Prediction

5.3. Model Evaluation

5.4. ROC-AUC

6. Random Forest

6.1. Modeling

6.2. Model Fitting and Prediction

6.3. Model Evaluation

6.4. ROC-AUC

7. Conclusion