Background

Hello Everyone! At this page, I want to share about bank marketing in Portugal. I want to predict whether a customer would buy the product or not after receiving a call from the officer. I am going to use classification methods, which are naive bayes and decision tree algorithm. At the end, I will compare which methods produce the best results.

Let’s get started!

Pre-Start

Packages loading

library(tidyverse)
library(caret)
library(car)
library(ggplot2)
library(MLmetrics)
library(glue)
library(plotly)
library(GGally)
library(e1071)
library(partykit)
library(rpart)
library(rpart.plot)
library(rattle)
library(ROCR)

Data Preparation

Data importing

bank <- read.csv2("bank.csv")

Overview the data

glimpse(bank)

## Observations: 4,521
## Variables: 17
## $ age       <int> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 3...
## $ job       <fct> unemployed, services, management, management, blue-collar...
## $ marital   <fct> married, married, single, married, married, single, marri...
## $ education <fct> primary, secondary, tertiary, tertiary, secondary, tertia...
## $ default   <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...
## $ balance   <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374,...
## $ housing   <fct> no, yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes,...
## $ loan      <fct> no, yes, no, yes, no, no, no, no, no, yes, no, no, no, no...
## $ contact   <fct> cellular, cellular, cellular, unknown, unknown, cellular,...
## $ day       <int> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, ...
## $ month     <fct> oct, may, apr, jun, may, feb, may, may, may, apr, may, ap...
## $ duration  <int> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113,...
## $ campaign  <int> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, ...
## $ pdays     <int> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, ...
## $ previous  <int> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, ...
## $ poutcome  <fct> unknown, failure, failure, unknown, unknown, failure, oth...
## $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, yes, ...

Here is the dataframe attribute information : Input variables:

age (numeric)
job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)
marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)
contact: contact communication type (categorical: ‘cellular’,‘telephone’)
month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
day_of_week: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)
duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client (numeric)
poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)
y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

Since all of data class is in correct form, now I will check fot the missing value

colSums(is.na(bank))

##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

Great!There are no missing values in the data.

head(bank)

##   age         job marital education default balance housing loan  contact day
## 1  30  unemployed married   primary      no    1787      no   no cellular  19
## 2  33    services married secondary      no    4789     yes  yes cellular  11
## 3  35  management  single  tertiary      no    1350     yes   no cellular  16
## 4  30  management married  tertiary      no    1476     yes  yes  unknown   3
## 5  59 blue-collar married secondary      no       0     yes   no  unknown   5
## 6  35  management  single  tertiary      no     747      no   no cellular  23
##   month duration campaign pdays previous poutcome  y
## 1   oct       79        1    -1        0  unknown no
## 2   may      220        1   339        4  failure no
## 3   apr      185        1   330        1  failure no
## 4   jun      199        4    -1        0  unknown no
## 5   may      226        1    -1        0  unknown no
## 6   feb      141        2   176        3  failure no

Modelling

For this machine learning process, I will use Naive Bayes theorem and Decision Tree to predict y variable.

Naive Bayes

Since I am going to use Naive Bayes method, I have to make sure that all the predictors are factor type. I will try to convert the numeric variables to factor (some variables have to be grouped) and eliminate unused variables.

In order to make group in certain variables, I am trying to see ath the summary of the variables. I choose the 1st Quantile, Mean, and 3rd Quantile as the group category.

summary(bank$balance)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -3313      69     444    1423    1480   71188

summary(bank$duration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       4     104     185     264     329    3025

summary(bank$campaign)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.794   3.000  50.000

bank_bayes<-bank %>% 
  mutate (age = as.factor(ifelse(age < 30, "<30", 
                       ifelse(age>=30 & age <= 50, "30-50", ">50")))) %>% 
  mutate (balance = as.factor(ifelse(balance <69, "Low", 
                                     ifelse(balance >= 69 & balance <= 1480, "Medium", "High")))) %>% 
  mutate (duration = as.factor(ifelse(duration <104, "Short", 
                                     ifelse(duration >= 104 & duration <= 264, "Medium", "Long")))) %>% 
    mutate (campaign = as.factor(ifelse(campaign <=1, "Rarely", 
                                     ifelse(campaign > 1 & campaign <= 2, "Medium", "Often")))) %>% 
  select(-c(day, month, pdays))

Great! Now, I have model for Naive Bayes.

Exploratory Data Analysis

I want to show you the proportion of the y target variable using this

p1<- bank_bayes %>% 
  group_by(y) %>% 
  summarise(freq = n()) %>% 
  ggplot(mapping = aes(x = y, y= freq)) +
  geom_col(position = "stack", aes(fill = y, text = glue("Yes : {y}
                                                          Freq : {freq}")), width= NULL)+
  theme_minimal()

ggplotly(p1, tooltip = "text")

From the chart above, it seems that the proportion is unbalanced. To make a better prediction, it is necessary to make the data balannce. I am going to use upsample for the process.

bayes_up <- upSample(x = bank_bayes %>% select(-y), 
           y= bank_bayes$y,yname = "y")

Check for the proportion once again

p2<- bayes_up %>% 
  group_by(y) %>% 
  summarise(freq = n()) %>% 
  ggplot(mapping = aes(x = y, y= freq)) +
  geom_col(position = "stack", aes(fill = y, text = glue("Yes : {y}
                                                          Freq : {freq}")), width= NULL)+
  theme_minimal()

ggplotly(p2, tooltip = "text")

Okay, the prportion has been balanced.

Cross Validation

Now, I have to split the data into train and test for cross validation

set.seed(123)
index <- sample (nrow(bayes_up), nrow(bayes_up)*0.8)
train_bayes <- bayes_up[index, ]
test_bayes <- bayes_up [-index, ]

Model Fitting

Build the model using naivebayes() function

model_bayes <- naiveBayes(x = train_bayes %>%  select(-y),
                          y = train_bayes$y)

Done, model_bayes has been built

Model Evaluation

I will evaluate the model using predict() function. This function works for predicting the result by using data test_bayes

predict_bayes <- predict(object = model_bayes, newdata = test_bayes, type = "class")

After that, I can use confusionmatrix() function to see the accuracy

confusionMatrix(data = predict_bayes, reference = test_bayes$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  643 223
##        yes 158 576
##                                           
##                Accuracy : 0.7619          
##                  95% CI : (0.7402, 0.7826)
##     No Information Rate : 0.5006          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5237          
##                                           
##  Mcnemar's Test P-Value : 0.001042        
##                                           
##             Sensitivity : 0.7209          
##             Specificity : 0.8027          
##          Pos Pred Value : 0.7847          
##          Neg Pred Value : 0.7425          
##              Prevalence : 0.4994          
##          Detection Rate : 0.3600          
##    Detection Prevalence : 0.4587          
##       Balanced Accuracy : 0.7618          
##                                           
##        'Positive' Class : yes             
##

Model Improvement

From the confusion matrix made, the accuracy of the data is quite good ~ 76%. However, the recall(sensitivity) of this case is important. Therefore, I have to reduce the value of false negative in order to increase the recall. To do that, I just have to decrease the threshold.

pred_imp <- predict(object = model_bayes, newdata = test_bayes, type = "raw")
predict_bayes_imp <- pred_imp %>% 
  as.data.frame() %>% 
  mutate(tuning_pred = as.factor(ifelse(yes >= 0.4 , "yes", "no")))

Build another confusion matrix

confusionMatrix(data = predict_bayes_imp$tuning_pred, reference = test_bayes$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  598 164
##        yes 203 635
##                                          
##                Accuracy : 0.7706         
##                  95% CI : (0.7492, 0.791)
##     No Information Rate : 0.5006         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.5413         
##                                          
##  Mcnemar's Test P-Value : 0.0473         
##                                          
##             Sensitivity : 0.7947         
##             Specificity : 0.7466         
##          Pos Pred Value : 0.7578         
##          Neg Pred Value : 0.7848         
##              Prevalence : 0.4994         
##          Detection Rate : 0.3969         
##    Detection Prevalence : 0.5238         
##       Balanced Accuracy : 0.7707         
##                                          
##        'Positive' Class : yes            
##

After resize the threshold into 0.4, the number of false negative falls slightly from 218 to 147. This means that, the number of customers who are going to buy the product but predicted not to do so is decreased.
As we look at the accuracy, there is also a change, from 0.76 to 0.77. The model is getting better.

Decision Tree

Using decision tree is quite easy. I only have to use rpart() function to create the model and then plot it using fancyRpartPlot() function.
Since I already have made the model bank, now I can continue to build the model.
First upsample the data.

bank_tree_up <- upSample(x = bank %>% select(-y),
                         y= bank$y, yname = "y")

index_tree <- sample(nrow(bank_tree_up), nrow(bank_tree_up)*0.8)
train_tree <- bank_tree_up[index_tree, ]
test_tree <- bank_tree_up [-index_tree, ]

set.seed(123)
dtree <-rpart(formula = y~., data = train_tree, method = "class")
fancyRpartPlot(dtree, sub = NULL)

A decision tree has several nodes that explain the probability of the predictors we use.

Each node shows:

The predicted class (Yes/No).
The probability of Yes or No class .
The percentage of observations in the node.
The root and internal nodes also show the rules (variables with threshold/value) that will partition each observation.

Model Evaluation

Same as before, I wil make prediction and confusion matrix to evaluate the model.

pred_tree <- predict(object = dtree, newdata =test_tree, type = "class")

confusionMatrix(data = pred_tree, test_tree$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  617 155
##        yes 169 659
##                                          
##                Accuracy : 0.7975         
##                  95% CI : (0.777, 0.8169)
##     No Information Rate : 0.5088         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.5948         
##                                          
##  Mcnemar's Test P-Value : 0.4702         
##                                          
##             Sensitivity : 0.8096         
##             Specificity : 0.7850         
##          Pos Pred Value : 0.7959         
##          Neg Pred Value : 0.7992         
##              Prevalence : 0.5088         
##          Detection Rate : 0.4119         
##    Detection Prevalence : 0.5175         
##       Balanced Accuracy : 0.7973         
##                                          
##        'Positive' Class : yes            
##

From the matrix summary above, we can see that the decision tree model works well on the data. False negative number is low, hence recall(sensitivity) is something bigger than the previous model. This tree also gives much higher accuracy~80% which makes this model quite effective in creating prediction.

ROC Curve

The curve may give us a view whether we should tune the model or no by evaluate the true positive rate vs false postive rate.

pred_roc <- predict(object = dtree, newdata = test_tree, type = "prob")

pred_prob <- pred_roc[,2]
bank_roc <- prediction(pred_prob, test_tree$y)
bank_performance <- performance(bank_roc, "tpr", "fpr")

plot(bank_performance)
abline(0,1, lty =2)

After analyzing the ROC curve, seems that I don’t have to perform tree pruning on the decision tree since it has already got a normal-sized tree.

Conclusion

data.frame(Model = c("Naive Bayes","Naive Bayes Tuned" ,"Decision Tree"), 
           Accuracy = c(0.76, 0.77, 0.79),
           Sensitivity = c(0.73, 0.82, 0.89))

##               Model Accuracy Sensitivity
## 1       Naive Bayes     0.76        0.73
## 2 Naive Bayes Tuned     0.77        0.82
## 3     Decision Tree     0.79        0.89

Based on the table above, it can be concluded that decision tree is the best model among the others. It has highest accuracy and sensitivity which have an important role in making decision and really interpretable. However, naive bayes can also be called a good model due its speed in processing data and easy to use. In order to raise the sensitivity and accuracy, all we need to do is just take control of threshold.

Bank Telemarketing Customer Prediction Using Naive Bayes and Decision Tree

Alfado Dhusi Sembiring

Background

Pre-Start

Data Preparation

Modelling

Naive Bayes

Exploratory Data Analysis

Cross Validation

Model Fitting

Model Evaluation

Model Improvement

Decision Tree

Model Evaluation

ROC Curve

Conclusion