Pre-Start

LBB_6 Requirement Tab

Library

library(dplyr) # for data wrangling
library(ggplot2) # to visualize data
library(gridExtra) # to display multiple graph
library(inspectdf) # for EDA
library(tidymodels) # to build tidy models
library(caret) # to pre-process data
library(tidyverse)
library(GGally)# Visualize Data
library(e1071) #Naivebayes
library(plotly)
library(glue)
library(rpart)
library(rpart.plot)
library(RColorBrewer)
library(rattle)

Data Preparation

Importing Dataset

it_csv <- read.csv("data/bank.csv")
head(it_csv)

##   age        job marital education default balance housing loan contact day
## 1  59     admin. married secondary      no    2343     yes   no unknown   5
## 2  56     admin. married secondary      no      45      no   no unknown   5
## 3  41 technician married secondary      no    1270     yes   no unknown   5
## 4  55   services married secondary      no    2476     yes   no unknown   5
## 5  54     admin. married  tertiary      no     184      no   no unknown   5
## 6  42 management  single  tertiary      no       0     yes  yes unknown   5
##   month duration campaign pdays previous poutcome deposit
## 1   may     1042        1    -1        0  unknown     yes
## 2   may     1467        1    -1        0  unknown     yes
## 3   may     1389        1    -1        0  unknown     yes
## 4   may      579        1    -1        0  unknown     yes
## 5   may      673        2    -1        0  unknown     yes
## 6   may      562        2    -1        0  unknown     yes

Overview Data

str(it_csv)

## 'data.frame':    11162 obs. of  17 variables:
##  $ age      : int  59 56 41 55 54 42 56 60 37 28 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 1 1 10 8 1 5 5 6 10 8 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 2 2 2 3 2 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 2 2 2 2 3 3 3 2 2 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ balance  : int  2343 45 1270 2476 184 0 830 545 1 5090 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 6 6 6 6 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  1042 1467 1389 579 673 562 1201 1030 608 1297 ...
##  $ campaign : int  1 1 1 1 2 2 1 1 1 3 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ deposit  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...

Check Missing Value

Yeay, there is no missing value in our data

colSums(is.na(it_csv))

##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##   deposit 
##         0

Feature Selection

Naive Bayes model is made of with all factor type prerdictor.

it_final <- it_csv %>% 
  mutate (age = as.factor(ifelse(age < 30, "<30", 
                       ifelse(age>=30 & age <= 50, "30-50", ">50")))) %>% 
  mutate (balance = as.factor(ifelse(balance <69, "Low", 
                                     ifelse(balance >= 69 & balance <= 1480, "Medium", "High")))) %>% 
  mutate (duration = as.factor(ifelse(duration <104, "Short", 
                                     ifelse(duration >= 104 & duration <= 264, "Medium", "Long")))) %>% 
    mutate (campaign = as.factor(ifelse(campaign <=1, "Rarely", 
                                     ifelse(campaign > 1 & campaign <= 2, "Medium", "Often")))) %>% 
  select(-c(day, month, pdays))

head(it_final)

##     age        job marital education default balance housing loan contact
## 1   >50     admin. married secondary      no    High     yes   no unknown
## 2   >50     admin. married secondary      no     Low      no   no unknown
## 3 30-50 technician married secondary      no  Medium     yes   no unknown
## 4   >50   services married secondary      no    High     yes   no unknown
## 5   >50     admin. married  tertiary      no  Medium      no   no unknown
## 6 30-50 management  single  tertiary      no     Low     yes  yes unknown
##   duration campaign previous poutcome deposit
## 1     Long   Rarely        0  unknown     yes
## 2     Long   Rarely        0  unknown     yes
## 3     Long   Rarely        0  unknown     yes
## 4     Long   Rarely        0  unknown     yes
## 5     Long   Medium        0  unknown     yes
## 6     Long   Medium        0  unknown     yes

Explanatory Data Analysis

Since the differences yes and no is very least, we dont need a further action like downsampling or upsampling

EDA <- it_final %>% 
  group_by(deposit) %>% 
  summarise(freq = n()) %>% 
  ggplot(mapping = aes(x = deposit, y= freq)) +
  geom_col(position = "stack", aes(fill = deposit, text = glue("Yes : {deposit}
                                                          Freq : {freq}")), width= NULL)+
  theme_minimal()

ggplotly(EDA, tooltip = "text")

Naivebayes

Cross Validation

set.seed(123)
index <- sample (nrow(it_csv), nrow(it_csv)*0.8)
train_bayes <- it_csv[index, ]
test_bayes <- it_csv[-index, ]

Model Fitting

This step to create a naive bayes model

model_bayes <- naiveBayes(x = train_bayes %>%  select(-deposit),
                          y = train_bayes$deposit)

I will evaluate the model using predict() function. This function works for predicting the result by using data test_bayes

predict_bayes <- predict(object = model_bayes, newdata = test_bayes, type = "class")

Model Evaluation

After that, I can use confusionmatrix() function to see the accuracy and sensitivity. As we can see in the Confusion Matrix, the Accuracy level was not so firm, even if it was not a bad number. The sensivity number also not so satisfying. Since we are going to targeting customer which predicted going to opted Term Deposit (Positive/Yes), we would like to have a higher numbers.

confusionMatrix(data = predict_bayes, reference = test_bayes$deposit, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  938 256
##        yes 256 783
##                                          
##                Accuracy : 0.7707         
##                  95% CI : (0.7527, 0.788)
##     No Information Rate : 0.5347         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.5392         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.7536         
##             Specificity : 0.7856         
##          Pos Pred Value : 0.7536         
##          Neg Pred Value : 0.7856         
##              Prevalence : 0.4653         
##          Detection Rate : 0.3506         
##    Detection Prevalence : 0.4653         
##       Balanced Accuracy : 0.7696         
##                                          
##        'Positive' Class : yes            
##

Decesion Tree

Model Fitting

Several nodes that explain the probability of the predictors we use.

Each node shows:

The predicted class (Yes/No). The probability of Yes or No class . The percentage of observations in the node. The root and internal nodes also show the rules (variables with threshold/value) that will partition each observation.

dtree <- rpart(formula = deposit ~., data = train_bayes, method = "class")

fancyRpartPlot(dtree, sub = NULL)

Model Evaluation

The Accuracy and Sensitivity of this model scores higher than the Naive Bayes.

pred_tree <- predict(object = dtree, newdata =test_bayes, type = "class")

## Confusion Matrix
confusionMatrix(data = pred_tree, test_bayes$deposit, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  911 118
##        yes 283 921
##                                           
##                Accuracy : 0.8204          
##                  95% CI : (0.8039, 0.8361)
##     No Information Rate : 0.5347          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6428          
##                                           
##  Mcnemar's Test P-Value : 2.617e-16       
##                                           
##             Sensitivity : 0.8864          
##             Specificity : 0.7630          
##          Pos Pred Value : 0.7650          
##          Neg Pred Value : 0.8853          
##              Prevalence : 0.4653          
##          Detection Rate : 0.4124          
##    Detection Prevalence : 0.5392          
##       Balanced Accuracy : 0.8247          
##                                           
##        'Positive' Class : yes             
##

Conclusion

Based on the table below, it can be concluded that decision tree is the best model among the others. It has highest accuracy and sensitivity which have an important role in making decision and really interpretable. both of those aspects Decision Tree win it all.

data.frame(Model = c("Naive Bayes" ,"Decision Tree"), 
           Accuracy = c(0.77, 0.82),
           Sensitivity = c(0.75, 0.88))

##           Model Accuracy Sensitivity
## 1   Naive Bayes     0.77        0.75
## 2 Decision Tree     0.82        0.88

lbb_6

Lucky Putranto

4/12/2020

Pre-Start

LBB_6 Requirement Tab

Library

Data Preparation

Importing Dataset

Overview Data

Check Missing Value

Feature Selection

Explanatory Data Analysis

Naivebayes

Cross Validation

Model Fitting

Model Evaluation

Decesion Tree

Model Fitting

Model Evaluation

Conclusion