Predicting Deposit Probability using Naive Bayes and Decision Tree
Objective
Predicting the bank’s customer behavior using Naive Bayes and Decision Tree, which customer that will make a purchase activity when receiving calls from the bank
This data came from UCI Machine Learning, and is a real data from Portugese banking institution. The variable y (dependent/output) is the y variable (information on each variable below):
Data
Data Read
'data.frame': 45211 obs. of 17 variables:
$ age : int 58 44 33 47 33 35 28 42 58 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
$ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
$ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
$ day : int 5 5 5 5 5 5 5 5 5 5 ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
$ duration : int 261 151 76 92 198 139 217 380 50 55 ...
$ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
$ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ previous : int 0 0 0 0 0 0 0 0 0 0 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Input variables:
1 - age (numeric)
2 - job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’) 3 - marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
4 - education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
5 - default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
6 - housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
7 - loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)
**elated with the last contact of the current campaign*:**
8 - contact: contact communication type (categorical: ‘cellular’,‘telephone’)
9 - month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
10 - day_of_week: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)
social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)
Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)
EDA
Cross Validation
Using initial_split() to make proportions of y (dependent variable) the same even in train and test data:
library(rsample)
set.seed(100)
split <- initial_split(bank, prop = 0.8, strata = "y")
bank.train <- training(split)
bank.test <- testing(split)
no yes
0.8830152 0.1169848
no yes
0.8824684 0.1175316
no yes
0.8824684 0.1175316
Naive Bayes
Model Fitting
Model Evaluation
Confusion Matrix and Statistics
Reference
Prediction no yes
no 7387 617
yes 470 568
Accuracy : 0.8798
95% CI : (0.8729, 0.8864)
No Information Rate : 0.8689
P-Value [Acc > NIR] : 0.001066
Kappa : 0.4428
Mcnemar's Test P-Value : 9.497e-06
Sensitivity : 0.9402
Specificity : 0.4793
Pos Pred Value : 0.9229
Neg Pred Value : 0.5472
Prevalence : 0.8689
Detection Rate : 0.8170
Detection Prevalence : 0.8852
Balanced Accuracy : 0.7098
'Positive' Class : no
- Accuracy: the ability to correctly predict both classes from the total observation.
- Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
- Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
- Specificity: the ability to correctly predict the negative class from the total actual-negative class.
Based on this result, everything is acceptable except specificity (the ability to correctly predict the negative class from the total actual-negative class), which is very low at 55%. This might mean that this Naive Bayes’ model can cause the bank to miss a huge chunk of potential customer that might say “yes” to deposit to the bank.
By Heatmap (Optional)
Or by creating a heatmap:
library(yardstick)
# model fitting
naive.prob <- predict(bank.model, bank.test, type = "raw")
naive.pred <- predict(bank.model, bank.test, type = "class")
# result
naive.table <- select(bank.test, y) %>%
bind_cols(y_pred = naive.pred) %>%
bind_cols(y_eprob = round(naive.prob[,1],4)) %>%
bind_cols(y_pprob = round(naive.prob[,2],4))# performance evaluation - confusion matrix
#naive.table %>%
# conf_mat(y, y_pred) %>%
#autoplot(type = "heatmap")naive.table %>%
summarise(
accuracy = accuracy_vec(y, y_pred),
sensitivity = sens_vec(y, y_pred),
specificity = spec_vec(y, y_pred),
precision = precision_vec(y, y_pred)
) accuracy sensitivity specificity precision
1 0.8797832 0.9229135 0.5472062 0.9401807
ROC Curve
Initial steps to create ROC curve:
# Probability of Prediction
bank.pred.raw <- predict(bank.model, bank.test, type = "raw") # copy & paste the chunk above
head(bank.pred.raw) no yes
[1,] 0.9963142 0.0036857912
[2,] 0.9991702 0.0008298437
[3,] 0.9969154 0.0030846295
[4,] 0.9980056 0.0019943986
[5,] 0.9957616 0.0042383707
[6,] 0.9909533 0.0090467468
Then assign class to that prediction
yes.pred <- data.frame("prediction" = bank.pred.raw[,2],
"classification" = as.numeric(bank.test == "yes"))
tail(yes.pred) prediction classification
153709 0.96107329 1
153710 0.79282306 0
153711 0.11953660 0
153712 0.88410636 1
153713 0.04048232 1
153714 0.03568349 1
library(ROCR)
yes.roc <- prediction(yes.pred$prediction,
yes.pred$classification)
# tpr (true positive rate) and fpr (false positive rate)
performance <- performance(yes.roc, "tpr", "fpr")
# roc plot
plot(performance, colorize = T)ROC (receiver operator characteristic) curve is graphical plot used to show the diagnostic ability of binary classifiers. It is constructed by plotting True Positive Rate (TPR) with False Positive Rate (FPR) - TPR: The true positive rate is the proportion of observations that were correctly predicted to be positive out of all positive observations (TP/(TP + FN)).
- FPR: the proportion of observations that are incorrectly predicted to be positive out of all negative observations (FP/(TN + FP))
The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1 – FPR). Generally, if the curve is closer to the top-left corner, then the classifiers meant that is has better performance (because “True positive is high while false negative is low”), which is not the case with the graphic above. ROC itself does not depend on the class distribution.
AUC
AUC (Area Under the Curve), is derived from calculating….the area under the ROC curve, that signifies ‘the summary of the performance of each classifier into a single measure’.
[1] 0.4670649
Decision Tree
Model & Plot
# Create a decision tree model using train data
dtmodel <- ctree(formula = y ~., data = bank.train)
plot(dtmodel, type = "simple") Welp, too much of a tree.
Model Evaluation
dt_pred
no yes
8064 978
age job marital education default balance housing loan contact day
2 44 technician single secondary no 29 yes no unknown 5
3 33 entrepreneur married secondary no 2 yes yes unknown 5
15 57 services married secondary no 162 yes no unknown 5
18 57 blue-collar married primary no 52 yes no unknown 5
month duration campaign pdays previous poutcome y dt_pred
2 may 151 1 -1 0 unknown no no
3 may 76 1 -1 0 unknown no no
15 may 174 1 -1 0 unknown no no
18 may 38 1 -1 0 unknown no no
[ reached 'max' / getOption("max.print") -- omitted 9038 rows ]
2 3 15 18 24 27 29 31 34 41 42 44 47 49 58 66 67 68 69 70
no no no no no no no no no no no no no no no no no no no no
78 79 103 104 106 109 134 144 149 156 157 161 162 164 165 171 173 189 190 191
no no no no no no no no no no no no no no no no no no no no
193 194 207 209 212 213 215 222 223 229 234 249 256 259 263 265 269 271 272 274
no no no no no no no no no no no no no no no no no no yes no
275 277 282 287 294 298 303 306 307 316 328 330 337 344 362
no no no no no no no no no no no no no no no
[ reached getOption("max.print") -- omitted 8967 entries ]
Levels: no yes
# create probability prediction
dt_prob_pred <- predict(dtmodel, bank.test, type = "prob")
# evaluating decision tree model
library(caret)
dtcf <- confusionMatrix(bank.test$y,
dt_pred, positive = "yes")
dtcfConfusion Matrix and Statistics
Reference
Prediction no yes
no 7609 395
yes 455 583
Accuracy : 0.906
95% CI : (0.8998, 0.9119)
No Information Rate : 0.8918
P-Value [Acc > NIR] : 5.411e-06
Kappa : 0.5255
Mcnemar's Test P-Value : 0.043
Sensitivity : 0.59611
Specificity : 0.94358
Pos Pred Value : 0.56166
Neg Pred Value : 0.95065
Prevalence : 0.10816
Detection Rate : 0.06448
Detection Prevalence : 0.11480
Balanced Accuracy : 0.76985
'Positive' Class : yes
On previous Naive Bayes model, there is only 1 parameter that has low %, which is “Specificity” (by 54%). But in this decision tree model, there are 2 parameters that have lower figure, which is Sensitivity and Precision.
ROC Curve
library(ROCR)
# di copy paste dari atas
dt_prob_pred <- predict(dtmodel, bank.test, type = "prob")
pred <- prediction(dt_prob_pred[, 2],
bank.test$y)
perf <- performance(pred, "tpr", "fpr")
# ROC Curve
plot(perf, colorize = T) The ROC curve of this decision tree model is better than the ones from Naive Bayes model.
AUC
[1] 0.9185209
With this result, it means that the classifiers used for decision tree model are considered ‘excellent’ (or at least better than the AUC from Naive Bayes’ model).
Conclusion
Comparing the models’ evaluation from Naive Bayes and decision tree (DT) method, we can conclude that for this case, predicting the bank customer’s deposit probability is better by using the decision tree model. While the evaluation does not have ideal result on every aspect, but overall the dt model has higher accuracy, sensitivity, precision, and even has better result in ROC curve and AUC value.
Reiteration
I will try to add the pruning step to the DT model. Pruning can help to simplify the branch/nodes of the decision tree, to avoid overfitting. The result of this pruning method may later be evaluated (by Confusion Matrix) so we can compare the result with the previous Naive Bayes and DT model.