Introduction

This time, we will learn and make predictions about potential customers of a bank in Portugal—whether they subscribe to a term deposit (“Yes”) or not (“No”). To make this prediction, we will create three models for comparison: Naive Bayes, Decision Tree, and Random Forest.

Objectives

  • Build Naïve Bayes, Decision Tree, and Random Forest models.
  • Compare the performance of these three models.

Data Preprocessing

Library

The libraries we will use for this analysis are:

library(tidyverse)
library(dplyr)
library(partykit)
library(randomForest)
library(caret)
library(e1071)

Read Data

bank <- read.csv("bank-full.csv", sep = ";")
head(bank)

Check and Adjust Data Types

str(bank)
#> 'data.frame':    45211 obs. of  17 variables:
#>  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
#>  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
#>  $ marital  : chr  "married" "single" "married" "married" ...
#>  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
#>  $ default  : chr  "no" "no" "no" "no" ...
#>  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
#>  $ housing  : chr  "yes" "yes" "yes" "yes" ...
#>  $ loan     : chr  "no" "no" "yes" "no" ...
#>  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
#>  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
#>  $ month    : chr  "may" "may" "may" "may" ...
#>  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
#>  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#>  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
#>  $ y        : chr  "no" "no" "no" "no" ...

Deskripsi Data

Data terdiri dari 45.211 baris dan 17 kolom - age : Age of the client - job : Type of job - marital : Marital status - education : Level of education - default : Whether the client has credit in default (“yes”, “no”) - balance : Average yearly balance in euros. - housing : Whether the client has a housing loan (“yes”, “no”) - loan : Whether the client has a personal loan (“yes”, “no”) - contact : Contact communication type - day : Last contact day of the month - month : Last contact month - duration : Last contact duration in seconds - campaign : Number of contacts performed during this campaign - pday : Number of days since the client was last contacted (-1 means never contacted) - previous : Number of contacts performed before this campaign - poutcome : Outcome of the previous marketing campaign - y : Whether the client subscribed to a term deposit (“yes”, “no”)

bank <- bank %>% 
  mutate_if(is.character, as.factor)

str(bank)
#> 'data.frame':    45211 obs. of  17 variables:
#>  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
#>  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
#>  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
#>  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
#>  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
#>  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
#>  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
#>  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
#>  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
#>  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
#>  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
#>  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#>  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
#>  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Check for Missing Values (NA)

bank %>% 
  is.na() %>% 
  colSums()
#>       age       job   marital education   default   balance   housing      loan 
#>         0         0         0         0         0         0         0         0 
#>   contact       day     month  duration  campaign     pdays  previous  poutcome 
#>         0         0         0         0         0         0         0         0 
#>         y 
#>         0

From the results above, we obtained information that there are no missing values. Therefore, we can proceed to the next step of model building.

Cross Validation

Before building the model, we first split the data into 80% training data and 20% testing data.

RNGkind(sample.kind = "Rounding")
set.seed(123)
row_data <- nrow(bank)

index <- sample(row_data, row_data*0.8)

data_train <- bank[ index, ]
data_test <- bank[ -index, ] 

Target Class Proportion

prop.table(table(bank$y))
#> 
#>        no       yes 
#> 0.8830152 0.1169848

It turns out that the proportion of the target variable is imbalanced. Therefore, we will take the following steps:

set.seed(123)
data_train_downsample <- downSample(x = data_train %>% select(-y),
                               y = data_train$y, 
                               list = F,
                               yname = "y"
                               )

Downsampling is the process of reducing the majority class until its count matches the minority class, achieving a balanced class proportion. In our training data, the “No” class will be sampled and reduced until it equals the number of “Yes” instances.

prop.table(table(data_train_downsample$y))
#> 
#>  no yes 
#> 0.5 0.5

The target variable class proportion is now balanced. Therefore, we proceed to the next step: model building.

Model Building

Naive Bayes

model_nb <- naiveBayes(x = data_train_downsample %>% select(-y),
                       y = data_train_downsample$y, 
                       laplace = 1)

Prediction

prediction_nb <- predict(model_nb, newdata = data_test)

Evaluation

conf_nb <- confusionMatrix(data = prediction_nb, reference = data_test$y, positive = "yes")
conf_nb
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  6395  284
#>        yes 1602  762
#>                                              
#>                Accuracy : 0.7914             
#>                  95% CI : (0.7829, 0.7998)   
#>     No Information Rate : 0.8843             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.3413             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.72849            
#>             Specificity : 0.79967            
#>          Pos Pred Value : 0.32234            
#>          Neg Pred Value : 0.95748            
#>              Prevalence : 0.11567            
#>          Detection Rate : 0.08426            
#>    Detection Prevalence : 0.26142            
#>       Balanced Accuracy : 0.76408            
#>                                              
#>        'Positive' Class : yes                
#> 

Decision Tree

model_dtree <- ctree(formula = y ~ ., data = data_train_downsample)

plot(model_dtree, type = "simple")

Prediction

prediction_dtree <- predict(model_dtree, newdata = data_test)

Evaluation

conf_dt <- confusionMatrix(prediction_dtree, reference = data_test$y, "yes")
conf_dt
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  6472  141
#>        yes 1525  905
#>                                              
#>                Accuracy : 0.8158             
#>                  95% CI : (0.8076, 0.8237)   
#>     No Information Rate : 0.8843             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.4282             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.8652             
#>             Specificity : 0.8093             
#>          Pos Pred Value : 0.3724             
#>          Neg Pred Value : 0.9787             
#>              Prevalence : 0.1157             
#>          Detection Rate : 0.1001             
#>    Detection Prevalence : 0.2687             
#>       Balanced Accuracy : 0.8373             
#>                                              
#>        'Positive' Class : yes                
#> 

Random Forest

ctrl <- trainControl(method = "repeatedcv",
                     number = 5, # k-fold
                     repeats = 3) # repetisi
model_rf <- train(y ~ ., data = data_train_downsample, method = "rf", trControl = ctrl)
saveRDS(model_rf, "model_rf.rds")

Prediction

# load model
model_rf <- readRDS("model_rf.rds")
prediction_rf <- predict(model_rf, newdata = data_test)

Evaluation

conf_rf <- confusionMatrix(data = prediction_rf, reference = data_test$y, "yes")
conf_rf
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  6676  118
#>        yes 1321  928
#>                                              
#>                Accuracy : 0.8409             
#>                  95% CI : (0.8332, 0.8484)   
#>     No Information Rate : 0.8843             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.4814             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.8872             
#>             Specificity : 0.8348             
#>          Pos Pred Value : 0.4126             
#>          Neg Pred Value : 0.9826             
#>              Prevalence : 0.1157             
#>          Detection Rate : 0.1026             
#>    Detection Prevalence : 0.2487             
#>       Balanced Accuracy : 0.8610             
#>                                              
#>        'Positive' Class : yes                
#> 

Performa Model

Naive Bayes

eval_nb <- data_frame(Accuracy = conf_nb$overall[1],
           Recall = conf_nb$byClass[1],
           Specificity = conf_nb$byClass[2],
           Precision = conf_nb$byClass[3])
eval_nb

Decision Tree

eval_dt <- data_frame(Accuracy = conf_dt$overall[1],
           Recall = conf_dt$byClass[1],
           Specificity = conf_dt$byClass[2],
           Precision = conf_dt$byClass[3])
eval_dt

Random Forest

eval_rf <- data_frame(Accuracy = conf_rf$overall[1],
           Recall = conf_rf$byClass[1],
           Specificity = conf_rf$byClass[2],
           Precision = conf_rf$byClass[3])
eval_rf

Conclusion

  • Based on the results of the three models (Naive Bayes, Decision Tree, and Random Forest), the Random Forest model performs best in making predictions. This is evident from its higher Accuracy, Recall, Specificity, and Precision compared to Naive Bayes and Decision Tree.

  • However, the downside of the Random Forest model is that it takes longer to train and is difficult to interpret compared to simpler models like Decision Tree and Naive Bayes.