The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
library(dplyr)
library(caret)
library(gridExtra)
library(GGally)
library(rsample)
library(e1071)
library(ROCR)
bank <- read.csv("data/bank-additional.csv", sep=";")
head(bank)
## age job marital education default housing loan contact
## 1 30 blue-collar married basic.9y no yes no cellular
## 2 39 services single high.school no no no telephone
## 3 25 services married high.school no yes no telephone
## 4 38 services married basic.9y no unknown unknown telephone
## 5 47 admin. married university.degree no yes no cellular
## 6 32 services single university.degree no no no cellular
## month day_of_week duration campaign pdays previous poutcome emp.var.rate
## 1 may fri 487 2 999 0 nonexistent -1.8
## 2 may fri 346 4 999 0 nonexistent 1.1
## 3 jun wed 227 1 999 0 nonexistent 1.4
## 4 jun fri 17 3 999 0 nonexistent 1.4
## 5 nov mon 58 1 999 0 nonexistent -0.1
## 6 sep thu 128 3 999 2 failure -1.1
## cons.price.idx cons.conf.idx euribor3m nr.employed y
## 1 92.893 -46.2 1.313 5099.1 no
## 2 93.994 -36.4 4.855 5191.0 no
## 3 94.465 -41.8 4.962 5228.1 no
## 4 94.465 -41.8 4.959 5228.1 no
## 5 93.200 -42.0 4.191 5195.8 no
## 6 94.199 -37.5 0.884 4963.6 no
Description of every column:
Output variable (desired target): 21. y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)
glimpse(bank)
## Rows: 4,119
## Columns: 21
## $ age <int> 30, 39, 25, 38, 47, 32, 32, 41, 31, 35, 25, 36, 36, 47,~
## $ job <chr> "blue-collar", "services", "services", "services", "adm~
## $ marital <chr> "married", "single", "married", "married", "married", "~
## $ education <chr> "basic.9y", "high.school", "high.school", "basic.9y", "~
## $ default <chr> "no", "no", "no", "no", "no", "no", "no", "unknown", "n~
## $ housing <chr> "yes", "no", "yes", "unknown", "yes", "no", "yes", "yes~
## $ loan <chr> "no", "no", "no", "unknown", "no", "no", "no", "no", "n~
## $ contact <chr> "cellular", "telephone", "telephone", "telephone", "cel~
## $ month <chr> "may", "may", "jun", "jun", "nov", "sep", "sep", "nov",~
## $ day_of_week <chr> "fri", "fri", "wed", "fri", "mon", "thu", "mon", "mon",~
## $ duration <int> 487, 346, 227, 17, 58, 128, 290, 44, 68, 170, 301, 148,~
## $ campaign <int> 2, 4, 1, 3, 1, 3, 4, 2, 1, 1, 1, 1, 2, 2, 2, 2, 6, 4, 2~
## $ pdays <int> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, ~
## $ previous <int> 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ poutcome <chr> "nonexistent", "nonexistent", "nonexistent", "nonexiste~
## $ emp.var.rate <dbl> -1.8, 1.1, 1.4, 1.4, -0.1, -1.1, -1.1, -0.1, -0.1, 1.1,~
## $ cons.price.idx <dbl> 92.893, 93.994, 94.465, 94.465, 93.200, 94.199, 94.199,~
## $ cons.conf.idx <dbl> -46.2, -36.4, -41.8, -41.8, -42.0, -37.5, -37.5, -42.0,~
## $ euribor3m <dbl> 1.313, 4.855, 4.962, 4.959, 4.191, 0.884, 0.879, 4.191,~
## $ nr.employed <dbl> 5099.1, 5191.0, 5228.1, 5228.1, 5195.8, 4963.6, 4963.6,~
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "~
Change data type from character to factor as well as removing column duration because the duration is not known before a call is performed, however it is known after the call is performed.
bank <- bank %>%
mutate_if(is.character, as.factor) %>%
select(-duration)
Check if there are any missing value
colSums(is.na(bank))
## age job marital education default
## 0 0 0 0 0
## housing loan contact month day_of_week
## 0 0 0 0 0
## campaign pdays previous poutcome emp.var.rate
## 0 0 0 0 0
## cons.price.idx cons.conf.idx euribor3m nr.employed y
## 0 0 0 0 0
Since there is no missing value in every variables, we can check the correlation between each variables.
ggcorr(bank, hjust = 1, layout.exp = 2, label = T, label_size = 3)
According from the correlation above, variable
emp.car.rate, eruibor3m, and
nr.employed have a strong correlations. Those variables
might not suitable for NaiveBayes Model.
Splitting data bank to training and testing data with
75% of data will be used as data training.
RNGkind(sample.kind = "Rounding")
set.seed(106)
index <- sample(nrow(bank), nrow(bank)*0.8)
data_train <- bank[index, ]
data_test <- bank[-index, ]
Check proportion target
prop.table(table(data_train$y))
##
## no yes
## 0.8898331 0.1101669
Because the proportion target is not balanced, down sampling is needed to balanced the data proportion
bank_ds <- downSample(x = data_train %>% select(-y),
y = data_train$y,
yname = "y")
prop.table(table(bank_ds$y))
##
## no yes
## 0.5 0.5
model_naive <- naiveBayes(x = data_train %>% select(-y),
y = data_train$y)
levels(data_train$y)
## [1] "no" "yes"
Predicting model
bank_pred <- predict(model_naive, newdata = data_test, type = "class")
eval <- confusionMatrix(bank_pred, data_test$y, positive = "yes")
eval
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 661 44
## yes 75 44
##
## Accuracy : 0.8556
## 95% CI : (0.8297, 0.8789)
## No Information Rate : 0.8932
## P-Value [Acc > NIR] : 0.999669
##
## Kappa : 0.3447
##
## Mcnemar's Test P-Value : 0.005958
##
## Sensitivity : 0.5000
## Specificity : 0.8981
## Pos Pred Value : 0.3697
## Neg Pred Value : 0.9376
## Prevalence : 0.1068
## Detection Rate : 0.0534
## Detection Prevalence : 0.1444
## Balanced Accuracy : 0.6990
##
## 'Positive' Class : yes
##
# Accuracy
eval$overall[1]
## Accuracy
## 0.8555825
# Recall
eval$byClass[1]
## Sensitivity
## 0.5
# Specificity
eval$byClass[2]
## Specificity
## 0.8980978
# Precision
eval$byClass[3]
## Pos Pred Value
## 0.3697479
Based on model evaluation, we can conclude that: 1. The accuracy of
the model is 85.5% 2. The actual data of class yes is 50%
who are subscribed to bank term deposit. 3. There are 89.8% who are not
subscribed to bank term deposit. 4. The model predict 36.9% of class
yes from prediction.
Using ROC to know how much the model can differentiate between class.
bank_prob <- predict(model_naive, newdata = data_test, type = "raw")#for the probability
round(head(bank_prob),4)
## no yes
## [1,] 0.9589 0.0411
## [2,] 0.9999 0.0001
## [3,] 0.9996 0.0004
## [4,] 0.9999 0.0001
## [5,] 0.0213 0.9787
## [6,] 1.0000 0.0000
roc <- prediction(predictions = bank_prob[,2], # = kelas positif
labels = as.numeric(data_test$y == "yes")) # label kelas positif
#Objec performance dari prediction
perf <- performance(prediction.obj = roc,
measure = "tpr",
x.measure = "fpr")
#plot
plot(perf, main = "ROC")
abline(0,1, lty =2)
auc <- performance(prediction.obj = roc,
measure = "auc")
auc@y.values[[1]]
## [1] 0.7624676
The model can differentiate 76.2% class based on the AUC score.
In the fitting model we will use: 1. mincriterion = 0.99, where the p-value must be below 0.05 for a node to create branches. 2. minsplit = 2000 or the minimum number of observations after splitting. 3. minbucket = 1500 as the minimum number of observations in the terminal node.
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
model_dtree <- ctree(formula = y ~ ., data = data_train,
control = ctree_control(mincriterion=0.99,
minsplit=2000,
minbucket=1500))
plot(model_dtree, type = "simple")
dtree_pred <- predict(model_dtree, data_test)
dtree_prob <- predict(model_dtree, data_test, type = "prob")
cm_dtree <- confusionMatrix(dtree_pred, data_test$y, positive = "yes")
cm_dtree
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 736 88
## yes 0 0
##
## Accuracy : 0.8932
## 95% CI : (0.8701, 0.9135)
## No Information Rate : 0.8932
## P-Value [Acc > NIR] : 0.5284
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.8932
## Prevalence : 0.1068
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : yes
##
#Objek Prediction
dt_roc <- prediction(predictions = dtree_prob[,2], # = kelas positif
labels = as.numeric(data_test$y == "yes")) # label kelas positif
#Objec performance dari prediction
dt_perf <- performance(prediction.obj = dt_roc,
measure = "tpr",
x.measure = "fpr")
#plot
plot(dt_perf, main = "ROC")
abline(0,1, lty =2)
dt_auc <- performance(prediction.obj = dt_roc,
measure = "auc")
dt_auc@y.values
## [[1]]
## [1] 0.6603878
The AUC score from the model decision tree is 66%, which is not a good score to differentiate between class.
Based on both model NaiveBayes and Decision Tree, NaiveBayes model has a bigger accuracy in predicting the model rather than Decision Tree.
Based on the results of the three models, the data model from NaiveBayes has better results, in addition to high accuracy and also the value of AUC and sensitivity or recall, because in this case the target class is the customer buying the product yes of course if recalled. This high model proves that the model has a small number in False Positive or a prediction of no but actually ‘yes’.