library(dplyr) # for data wrangling
library(ggplot2) # to visualize data
library(gridExtra) # to display multiple graph
library(inspectdf) # for EDA
library(tidymodels) # to build tidy models
library(caret) # to pre-process data
library(tidyverse)
library(GGally)# Visualize Data
library(e1071) #Naivebayes
library(plotly)
library(glue)
library(rpart)
library(rpart.plot)
library(RColorBrewer)
library(rattle)## age job marital education default balance housing loan contact day
## 1 59 admin. married secondary no 2343 yes no unknown 5
## 2 56 admin. married secondary no 45 no no unknown 5
## 3 41 technician married secondary no 1270 yes no unknown 5
## 4 55 services married secondary no 2476 yes no unknown 5
## 5 54 admin. married tertiary no 184 no no unknown 5
## 6 42 management single tertiary no 0 yes yes unknown 5
## month duration campaign pdays previous poutcome deposit
## 1 may 1042 1 -1 0 unknown yes
## 2 may 1467 1 -1 0 unknown yes
## 3 may 1389 1 -1 0 unknown yes
## 4 may 579 1 -1 0 unknown yes
## 5 may 673 2 -1 0 unknown yes
## 6 may 562 2 -1 0 unknown yes
## 'data.frame': 11162 obs. of 17 variables:
## $ age : int 59 56 41 55 54 42 56 60 37 28 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 1 1 10 8 1 5 5 6 10 8 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 2 2 2 3 2 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 2 2 2 2 3 3 3 2 2 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ balance : int 2343 45 1270 2476 184 0 830 545 1 5090 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 6 6 6 6 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 1042 1467 1389 579 673 562 1201 1030 608 1297 ...
## $ campaign : int 1 1 1 1 2 2 1 1 1 3 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ deposit : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
Yeay, there is no missing value in our data
## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## deposit
## 0
Naive Bayes model is made of with all factor type prerdictor.
it_final <- it_csv %>%
mutate (age = as.factor(ifelse(age < 30, "<30",
ifelse(age>=30 & age <= 50, "30-50", ">50")))) %>%
mutate (balance = as.factor(ifelse(balance <69, "Low",
ifelse(balance >= 69 & balance <= 1480, "Medium", "High")))) %>%
mutate (duration = as.factor(ifelse(duration <104, "Short",
ifelse(duration >= 104 & duration <= 264, "Medium", "Long")))) %>%
mutate (campaign = as.factor(ifelse(campaign <=1, "Rarely",
ifelse(campaign > 1 & campaign <= 2, "Medium", "Often")))) %>%
select(-c(day, month, pdays))
head(it_final)## age job marital education default balance housing loan contact
## 1 >50 admin. married secondary no High yes no unknown
## 2 >50 admin. married secondary no Low no no unknown
## 3 30-50 technician married secondary no Medium yes no unknown
## 4 >50 services married secondary no High yes no unknown
## 5 >50 admin. married tertiary no Medium no no unknown
## 6 30-50 management single tertiary no Low yes yes unknown
## duration campaign previous poutcome deposit
## 1 Long Rarely 0 unknown yes
## 2 Long Rarely 0 unknown yes
## 3 Long Rarely 0 unknown yes
## 4 Long Rarely 0 unknown yes
## 5 Long Medium 0 unknown yes
## 6 Long Medium 0 unknown yes
Since the differences yes and no is very least, we dont need a further action like downsampling or upsampling
This step to create a naive bayes model
I will evaluate the model using predict() function. This function works for predicting the result by using data test_bayes
After that, I can use confusionmatrix() function to see the accuracy and sensitivity. As we can see in the Confusion Matrix, the Accuracy level was not so firm, even if it was not a bad number. The sensivity number also not so satisfying. Since we are going to targeting customer which predicted going to opted Term Deposit (Positive/Yes), we would like to have a higher numbers.
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 938 256
## yes 256 783
##
## Accuracy : 0.7707
## 95% CI : (0.7527, 0.788)
## No Information Rate : 0.5347
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5392
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7536
## Specificity : 0.7856
## Pos Pred Value : 0.7536
## Neg Pred Value : 0.7856
## Prevalence : 0.4653
## Detection Rate : 0.3506
## Detection Prevalence : 0.4653
## Balanced Accuracy : 0.7696
##
## 'Positive' Class : yes
##
Several nodes that explain the probability of the predictors we use.
Each node shows:
The predicted class (Yes/No). The probability of Yes or No class . The percentage of observations in the node. The root and internal nodes also show the rules (variables with threshold/value) that will partition each observation.
dtree <- rpart(formula = deposit ~., data = train_bayes, method = "class")
fancyRpartPlot(dtree, sub = NULL)The Accuracy and Sensitivity of this model scores higher than the Naive Bayes.
pred_tree <- predict(object = dtree, newdata =test_bayes, type = "class")
## Confusion Matrix
confusionMatrix(data = pred_tree, test_bayes$deposit, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 911 118
## yes 283 921
##
## Accuracy : 0.8204
## 95% CI : (0.8039, 0.8361)
## No Information Rate : 0.5347
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6428
##
## Mcnemar's Test P-Value : 2.617e-16
##
## Sensitivity : 0.8864
## Specificity : 0.7630
## Pos Pred Value : 0.7650
## Neg Pred Value : 0.8853
## Prevalence : 0.4653
## Detection Rate : 0.4124
## Detection Prevalence : 0.5392
## Balanced Accuracy : 0.8247
##
## 'Positive' Class : yes
##
Based on the table below, it can be concluded that decision tree is the best model among the others. It has highest accuracy and sensitivity which have an important role in making decision and really interpretable. both of those aspects Decision Tree win it all.
data.frame(Model = c("Naive Bayes" ,"Decision Tree"),
Accuracy = c(0.77, 0.82),
Sensitivity = c(0.75, 0.88))## Model Accuracy Sensitivity
## 1 Naive Bayes 0.77 0.75
## 2 Decision Tree 0.82 0.88