1 Background

1.1 Brief

Bank is a financial institution licensed to receive deposits and make loans. It may also provides financial services such as currency exchange and term deposit subscriptions. As a financial institution, it is important for bank to have some insight on what factors affect the customers’ choices to use their financial services. Furthermore, banking institutions need something that allows them to classify customers, i.e. knowing which customer will be interested on their product. So, using the bank marketing dataset, I will create models using Naive Bayes, Decision Tree, and Random Forest classifier. The result will be compared to see which classifier is the most suitable for predicting whether the customer will subscribe term deposit or not.

1.2 Libraries and Setup

These following packages are required in this notebook. Use install.packages() to install any packages that are not already downloaded and load them using library() function. A brief explanation about the required packages:

tidyverse: data manipulation
e1071: naive bayes classifier
caret: upsampling/downsampling and confusion matrix
ROCR: evaluate ROC curve and AUC
partykit: decision tree classifier
rsample: cross validation
randomForest: random forest classifier

library(tidyverse)
library(e1071)
library(caret)
library(ROCR)
library(partykit)
library(rsample)
library(randomForest)

2 Data Wrangling

2.1 Data Inspection

bank <- read.csv("data_input/bank-full.csv", 
                 sep = ";",
                 stringsAsFactors = T)
head(bank)

glimpse(bank)

## Rows: 45,211
## Columns: 17
## $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, ~
## $ job       <fct> management, technician, entrepreneur, blue-collar, unknown, ~
## $ marital   <fct> married, single, married, married, single, married, single, ~
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, ~
## $ default   <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,~
## $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71~
## $ housing   <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y~
## $ loan      <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no~
## $ contact   <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow~
## $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ~
## $ month     <fct> may, may, may, may, may, may, may, may, may, may, may, may, ~
## $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,~
## $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ~
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ poutcome  <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow~
## $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, ~

Here are some informations about the features:

age: client’s age
job : type of job
marital : marital status
education: client’s last education
default: does the client have credit in default?
balance: average yearly balance, in euros
housing: has housing loan?
loan: does he client have personal loan?
contact: contact communication type
day: last contact day of the month
month: last contact month of year
duration: last contact duration, in seconds
campaign: number of contacts performed during this campaign and for this client
pdays: number of days that passed by after the client was last contacted from a previous campaign (-1 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client
poutcome: outcome of the previous marketing campaign
y: has the client subscribed a term deposit?

We can see that there are 45.211 instances and 17 features in this dataset. Most of the features are categorical. We can see using glimpse(), all the data types are correct. However, there are some “unknowns” in the data. I will consider this as missing value. Let’s first check whether the data has missing and duplicated values.

2.2 Missing and Duplicated Values

Replacing “unknowns” with NA, then checking whether the data has any missing values.

bank <- bank %>% 
  mutate(across(.cols = everything(),
                .fns = ~replace(., . == "unknown", NA)))

colSums(is.na(bank))

##       age       job   marital education   default   balance   housing      loan 
##         0       288         0      1857         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##     13020         0         0         0         0         0         0     36959 
##         y 
##         0

contact and poutcome have a lot of missing values. In fact, most of the values in poutcome is unknown. I will drop these variables. I am also concerned about the day and month features. There is no information about whether the data was collected in the same year or not. If the data was collected not in the same year, those features will be meaningless, because the exact dates are not provided. So, I will drop day and month as well. Additionally, job and education also has some missing values. But the number is less than 5% of the data. So I will just drop the observations containing missing values in those two features.

bank <- bank %>% 
  select(-c(contact, poutcome, day, month)) %>% 
  drop_na()

We have cleaned the missing values, let’s now check whether the data has any duplicated value. Systematically, in a dataset like this, I do not think that duplicated values should exist. Considering that there are features measuring the balance and number of days passed after the last campaign in the data, I personally think that every observation should have unique values.

sum(duplicated(bank))

## [1] 1

The data only has 1 duplicated value. I will just remove it.

bank <- bank[!duplicated(bank), ]

2.3 Near Zero Variance

We will also see whether there is any feature that has almost no variance. Those features should be removed since it is not informative, thus will not contribute much during model construction. This is the final cleaned data.

bank <- bank[, -nearZeroVar(bank)]
bank

3 Class Balance and Cross Validation

We want our target variable to have balanced class proportion so that the model could classify well on all classes instead of only the majority class. Let’s first check on the class proportion.

prop.table(table(bank$y))

## 
##        no       yes 
## 0.8837516 0.1162484

table(bank$y)

## 
##    no   yes 
## 38171  5021

We have a moderate class imbalance. This is a problem since proceeding with our data might lead to a model that can predict well on only one class. Let’s first split the data into 80% train set and 20% test set. The train set for constructing the model and the test set for measuring the out-of-sample accuracy.

set.seed(100)
bank_split <- initial_split(bank, prop = 0.8, strata = "y")
train <- training(bank_split)
test <- testing(bank_split)

Rechecking the class proportion in the train set.

prop.table(table(train$y))

## 
##        no       yes 
## 0.8837694 0.1162306

table(train$y)

## 
##    no   yes 
## 30536  4016

Before proceeding to the modeling part, we need to handle the class imbalance. There are a few techniques that we can use: upsampling, downsampling, SMOTE, etc. Considering there are quite a lot of observations in the data, I will proceed to use downsampling to handle class imbalance. The method will remove some observations in the majority class, and will result in using 8,032 data for training the model.

train_down <- downSample(x = train[, -13], 
                         y = train$y,
                         yname = "y")

table(train_down$y)

## 
##   no  yes 
## 4016 4016

We are done with pre-processing the data. Now we can use it to construct and evaluate classifiers.

4 Naive Bayes Classifier

Naive Bayes is a simple technique for constructing classifiers by applying Bayes’ theorem between data features. It has one strong assumption, which is that all features of the dataset are equally important and independent. In real world data, this assumption is most likely to be violated. However, despite the violation, the Naive Bayes Classifier might give outstanding results. I will construct a Naive Bayes Classifier with laplace smoothing, just in case there is any predictor that never occurs in a class.

model_nb <- naiveBayes(y ~ ., data = train_down, laplace = 1)

conf_nb <- confusionMatrix(data = predict(model_nb, newdata = test),
                           reference = test$y,
                           positive = "yes")
conf_nb

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6354  328
##        yes 1281  677
##                                              
##                Accuracy : 0.8138             
##                  95% CI : (0.8054, 0.8219)   
##     No Information Rate : 0.8837             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.3583             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.67363            
##             Specificity : 0.83222            
##          Pos Pred Value : 0.34576            
##          Neg Pred Value : 0.95091            
##              Prevalence : 0.11632            
##          Detection Rate : 0.07836            
##    Detection Prevalence : 0.22662            
##       Balanced Accuracy : 0.75293            
##                                              
##        'Positive' Class : yes                
##

We can see that Naive Bayes Classifier has 81.38% accuracy. For this data, the false negative and false positive case would be:

False Negative: clients who will subscribe to a term deposit is predicted to be not willing to subscribe.
False Positive: clients who will not subscribe to a term deposit is predicted to be willing to subscribe.

Deciding which case would be more severe to the bank is beyond my expertise. Feel free to decide which case is more severe, thus deciding which metric in the confusion matrix to be the most important. For now, I will use the ROC and AUC to see how good the model is at distinguishing binary classes.

nb_rocr <- prediction(predictions = predict(model_nb, newdata = test, type = "raw")[, 2],
                      labels = test$y,
                      label.ordering = c("no", "yes"))

plot(performance(nb_rocr,
                 measure = "tpr",
                 x.measure = "fpr"))

# AUC
performance(nb_rocr, "auc")@y.values

## [[1]]
## [1] 0.8194124

The problem about Naive Bayes classifier is that we can only tune the cutoff for the model. Doing so might lead to an increase in a metric listed in confusion matrix, with a tradeoff of a decrease in the other metric. However, the AUC value (which measures how good the model is at distinguishing classes) will not change. Let’s see the accuracy, recall, precision, and for different cutoffs.

pred_prob_nb <- predict(model_nb, newdata = test, type = "raw")

metrics <- function(cutoff){
  prediction <- as.factor(ifelse(pred_prob_nb[, 2] > cutoff, "yes", "no"))
  conf <- confusionMatrix(prediction, 
                          reference = test$y,
                          positive = "yes")
  res <- c(conf$overall[1], conf$byClass[1], conf$byClass[2], conf$byClass[3])
  return(res)
}

cutoffs <- seq(0.01, 0.99, length = 99)
result <- matrix(nrow = 99, ncol = 4)

for(i in 1:99){
  result[i, ] <- metrics(cutoffs[i])
}

result <- as.data.frame(result) %>% 
  rename(Accuracy = V1,
         Recall = V2,
         Specifity = V3,
         Precision = V4) %>% 
  mutate(Cutoff = cutoffs)

result %>% 
  gather(key = "metrics", value = "value", -Cutoff) %>% 
  ggplot(mapping = aes(x = Cutoff,
                       y = value,
                       col = metrics)) +
  geom_line(lwd = 1) +
  labs(title = "Metrics for Different Cutoffs",
       y = "Value") +
  theme_minimal() +
  theme(legend.position = "top",
        legend.title = element_blank(),
        plot.title = element_text(hjust = 0.5))

Again, feel free to decide which cutoff is the most optimum. I personally think that focusing on precision is not worth it, since the peak precision is below 50%. So, I would just focus on the other 3 metrics or just use another classifier.

5 Decision Tree Classifier

Decision tree is probably one of the most popular machine learning method for classifying task. Unlike Naive Bayes, decision tree generates output in the form of rules, which means that it can easily be understood by humans and applied to many scenarios. Let’s first build a decision tree classifier with default options.

model_dt <- ctree(y ~ ., data = train_down)
model_dt

## 
## Model formula:
## y ~ age + job + marital + education + balance + housing + loan + 
##     duration + campaign + previous
## 
## Fitted party:
## [1] root
## |   [2] duration <= 251
## |   |   [3] duration <= 125
## |   |   |   [4] duration <= 77
## |   |   |   |   [5] education in primary, secondary: no (n = 537, err = 1.9%)
## |   |   |   |   [6] education in tertiary: no (n = 243, err = 6.2%)
## |   |   |   [7] duration > 77
## |   |   |   |   [8] housing in no
## |   |   |   |   |   [9] campaign <= 1: no (n = 183, err = 39.3%)
## |   |   |   |   |   [10] campaign > 1
## |   |   |   |   |   |   [11] job in admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, technician, unemployed
## |   |   |   |   |   |   |   [12] previous <= 0: no (n = 256, err = 11.7%)
## |   |   |   |   |   |   |   [13] previous > 0: no (n = 33, err = 45.5%)
## |   |   |   |   |   |   [14] job in student: yes (n = 10, err = 20.0%)
## |   |   |   |   [15] housing in yes
## |   |   |   |   |   [16] education in primary, secondary
## |   |   |   |   |   |   [17] previous <= 1: no (n = 254, err = 2.8%)
## |   |   |   |   |   |   [18] previous > 1: no (n = 54, err = 18.5%)
## |   |   |   |   |   [19] education in tertiary: no (n = 117, err = 17.9%)
## |   |   [20] duration > 125
## |   |   |   [21] housing in no
## |   |   |   |   [22] previous <= 0
## |   |   |   |   |   [23] loan in no
## |   |   |   |   |   |   [24] campaign <= 1
## |   |   |   |   |   |   |   [25] job in admin., management, retired, services, student: yes (n = 202, err = 33.7%)
## |   |   |   |   |   |   |   [26] job in blue-collar, entrepreneur, housemaid, self-employed, technician, unemployed: no (n = 121, err = 38.8%)
## |   |   |   |   |   |   [27] campaign > 1
## |   |   |   |   |   |   |   [28] duration <= 205
## |   |   |   |   |   |   |   |   [29] campaign <= 5
## |   |   |   |   |   |   |   |   |   [30] balance <= 154: no (n = 69, err = 18.8%)
## |   |   |   |   |   |   |   |   |   [31] balance > 154: no (n = 163, err = 41.1%)
## |   |   |   |   |   |   |   |   [32] campaign > 5: no (n = 46, err = 6.5%)
## |   |   |   |   |   |   |   [33] duration > 205: no (n = 129, err = 48.8%)
## |   |   |   |   |   [34] loan in yes
## |   |   |   |   |   |   [35] marital in divorced, married: no (n = 90, err = 6.7%)
## |   |   |   |   |   |   [36] marital in single: no (n = 30, err = 30.0%)
## |   |   |   |   [37] previous > 0
## |   |   |   |   |   [38] duration <= 166: yes (n = 123, err = 31.7%)
## |   |   |   |   |   [39] duration > 166: yes (n = 257, err = 9.3%)
## |   |   |   [40] housing in yes
## |   |   |   |   [41] previous <= 0
## |   |   |   |   |   [42] job in admin., housemaid, management, retired
## |   |   |   |   |   |   [43] balance <= 2679: no (n = 231, err = 16.5%)
## |   |   |   |   |   |   [44] balance > 2679
## |   |   |   |   |   |   |   [45] duration <= 167: no (n = 20, err = 20.0%)
## |   |   |   |   |   |   |   [46] duration > 167: yes (n = 15, err = 20.0%)
## |   |   |   |   |   [47] job in blue-collar, entrepreneur, self-employed, services, student, technician, unemployed: no (n = 472, err = 6.8%)
## |   |   |   |   [48] previous > 0
## |   |   |   |   |   [49] education in primary, secondary: no (n = 196, err = 34.7%)
## |   |   |   |   |   [50] education in tertiary
## |   |   |   |   |   |   [51] duration <= 169: no (n = 33, err = 39.4%)
## |   |   |   |   |   |   [52] duration > 169: yes (n = 53, err = 24.5%)
## |   [53] duration > 251
## |   |   [54] duration <= 490
## |   |   |   [55] housing in no
## |   |   |   |   [56] loan in no
## |   |   |   |   |   [57] previous <= 0
## |   |   |   |   |   |   [58] duration <= 441
## |   |   |   |   |   |   |   [59] marital in divorced, married
## |   |   |   |   |   |   |   |   [60] age <= 59
## |   |   |   |   |   |   |   |   |   [61] education in primary: no (n = 29, err = 24.1%)
## |   |   |   |   |   |   |   |   |   [62] education in secondary, tertiary: yes (n = 279, err = 46.2%)
## |   |   |   |   |   |   |   |   [63] age > 59: yes (n = 82, err = 9.8%)
## |   |   |   |   |   |   |   [64] marital in single
## |   |   |   |   |   |   |   |   [65] age <= 30: yes (n = 98, err = 13.3%)
## |   |   |   |   |   |   |   |   [66] age > 30: yes (n = 88, err = 40.9%)
## |   |   |   |   |   |   [67] duration > 441: yes (n = 96, err = 15.6%)
## |   |   |   |   |   [68] previous > 0: yes (n = 415, err = 10.6%)
## |   |   |   |   [69] loan in yes
## |   |   |   |   |   [70] previous <= 0: no (n = 77, err = 22.1%)
## |   |   |   |   |   [71] previous > 0: yes (n = 9, err = 11.1%)
## |   |   |   [72] housing in yes
## |   |   |   |   [73] previous <= 0
## |   |   |   |   |   [74] duration <= 376
## |   |   |   |   |   |   [75] education in primary, secondary: no (n = 246, err = 15.4%)
## |   |   |   |   |   |   [76] education in tertiary: no (n = 99, err = 33.3%)
## |   |   |   |   |   [77] duration > 376: no (n = 220, err = 41.4%)
## |   |   |   |   [78] previous > 0
## |   |   |   |   |   [79] loan in no: yes (n = 224, err = 30.8%)
## |   |   |   |   |   [80] loan in yes: no (n = 34, err = 44.1%)
## |   |   [81] duration > 490
## |   |   |   [82] duration <= 706
## |   |   |   |   [83] housing in no
## |   |   |   |   |   [84] loan in no: yes (n = 411, err = 14.4%)
## |   |   |   |   |   [85] loan in yes: yes (n = 43, err = 34.9%)
## |   |   |   |   [86] housing in yes
## |   |   |   |   |   [87] marital in divorced, married: yes (n = 302, err = 33.1%)
## |   |   |   |   |   [88] marital in single: yes (n = 145, err = 17.2%)
## |   |   |   [89] duration > 706: yes (n = 1198, err = 9.6%)
## 
## Number of inner nodes:    44
## Number of terminal nodes: 45

plot(model_dt, type = "simple")

It seems like the decision tree model is too complex. It is understandable though, since our data has many features. Let’s see the accuracy on both training and test set.

# Confusion matrix for training set
confusionMatrix(data = predict(model_dt, newdata = train_down, type = "response"),
                reference = train_down$y,
                positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  3238  744
##        yes  778 3272
##                                              
##                Accuracy : 0.8105             
##                  95% CI : (0.8018, 0.819)    
##     No Information Rate : 0.5                
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.621              
##                                              
##  Mcnemar's Test P-Value : 0.3976             
##                                              
##             Sensitivity : 0.8147             
##             Specificity : 0.8063             
##          Pos Pred Value : 0.8079             
##          Neg Pred Value : 0.8132             
##              Prevalence : 0.5000             
##          Detection Rate : 0.4074             
##    Detection Prevalence : 0.5042             
##       Balanced Accuracy : 0.8105             
##                                              
##        'Positive' Class : yes                
##

# Confusion matrix for test set
conf_dt <- confusionMatrix(data = predict(model_dt, newdata = test, type = "response"),
                           reference = test$y,
                           positive = "yes")
conf_dt

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6033  204
##        yes 1602  801
##                                              
##                Accuracy : 0.791              
##                  95% CI : (0.7822, 0.7995)   
##     No Information Rate : 0.8837             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.3661             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.79701            
##             Specificity : 0.79018            
##          Pos Pred Value : 0.33333            
##          Neg Pred Value : 0.96729            
##              Prevalence : 0.11632            
##          Detection Rate : 0.09271            
##    Detection Prevalence : 0.27813            
##       Balanced Accuracy : 0.79360            
##                                              
##        'Positive' Class : yes                
##

Looking at the difference between training set and testing set, I do not think that the model is overfit. Comparing the confusion matrix of decision tree model and Naive Bayes, we can see that the decision tree model, with slightly less accuracy, may still be better than Naive Bayes due to its high recall and specificity. If we were to tune the Naive Bayes cutoff to have similar accuracy and specifity, its sensitivity will still be significantly lower than the decision tree model. Let’s see ROC and AUC of the decision tree model.

dt_rocr <- prediction(predictions = predict(model_dt, newdata = test, type = "prob")[, 2],
                      labels = test$y,
                      label.ordering = c("no", "yes"))

plot(performance(dt_rocr,
                 measure = "tpr",
                 x.measure = "fpr"))

# AUC
performance(dt_rocr, "auc")@y.values

## [[1]]
## [1] 0.8629822

As we can see, the decision tree model has AUC of ~0.863. It means that decision tree is better than Naive Bayes in distinguishing between 2 classes. The decision tree can be tuned by setting mincriterion, minsplit, and minbucket in its control parameter. However, considering it will take too many iterations to find the best combination for those 3, I will just skip it for the time being. Besides, the model does not indicate that it is overfitting, and has pretty decent AUC. For now, let’s proceed to the last classifier, random forest.

6 Random Forest Classifier

Random forests is an ensemble learning method known for its versatility and performance. A random forest consists of many decision trees. Those decision trees were constructed using different observations based on the sampling, thus it have different characteristics. The prediction gained in random forest is average of all the predictions from these different decision trees. First, let’s set the random forest control so it uses 5 fold cross validation with 3 iterations.

# The model was made using this code

#set.seed(100)
#ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
#model_rf <- train(y ~., data = train_down, method = "rf", trControl = ctrl)

model_rf <- readRDS("model_rf.rds")
model_rf

## Random Forest 
## 
## 8032 samples
##   10 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 6426, 6426, 6425, 6426, 6425, 6425, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7991356  0.5982723
##   12    0.8056515  0.6113023
##   23    0.8028712  0.6057415
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 12.

Checking the final random forest model.

model_rf$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 12
## 
##         OOB estimate of  error rate: 19.4%
## Confusion matrix:
##       no  yes class.error
## no  3162  854   0.2126494
## yes  704 3312   0.1752988

plot(model_rf)

The random forest has Out of Bag error rate of 19.4%. Looking the confusion matrix for both training and test set.

confusionMatrix(data = predict(model_rf, newdata = train_down),
                reference = train_down$y,
                positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  4016    0
##        yes    0 4016
##                                                
##                Accuracy : 1                    
##                  95% CI : (0.9995, 1)          
##     No Information Rate : 0.5                  
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 1                    
##                                                
##  Mcnemar's Test P-Value : NA                   
##                                                
##             Sensitivity : 1.0                  
##             Specificity : 1.0                  
##          Pos Pred Value : 1.0                  
##          Neg Pred Value : 1.0                  
##              Prevalence : 0.5                  
##          Detection Rate : 0.5                  
##    Detection Prevalence : 0.5                  
##       Balanced Accuracy : 1.0                  
##                                                
##        'Positive' Class : yes                  
##

conf_rf <- confusionMatrix(data = predict(model_rf, newdata = test),
                           reference = test$y,
                           positive = "yes")
conf_rf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6000  175
##        yes 1635  830
##                                              
##                Accuracy : 0.7905             
##                  95% CI : (0.7818, 0.799)    
##     No Information Rate : 0.8837             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.3751             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.82587            
##             Specificity : 0.78585            
##          Pos Pred Value : 0.33671            
##          Neg Pred Value : 0.97166            
##              Prevalence : 0.11632            
##          Detection Rate : 0.09606            
##    Detection Prevalence : 0.28530            
##       Balanced Accuracy : 0.80586            
##                                              
##        'Positive' Class : yes                
##

Turns out that the random forest classifier is overfitting. The accuracy and metrics for the training set is perfect, but for the test set, the accuracy is only about 70%, with decent recall and specificity, but poor precision. Let’s see the ROC and AUC.

rf_rocr <- prediction(predictions = predict(model_rf, newdata = test, type = "prob")$yes,
                      labels = test$y,
                      label.ordering = c("no", "yes"))

plot(performance(rf_rocr, 
                 measure = "tpr",
                 x.measure = "fpr"))

# AUC
performance(rf_rocr, "auc")@y.values

## [[1]]
## [1] 0.8737758

The random forest model’s ability to distinguish classes is slightly higher than of decision tree’s, with AUC value of 0.8738.

varImp(model_rf)

## rf variable importance
## 
##   only 20 most important variables shown (out of 23)
## 
##                     Overall
## duration           100.0000
## balance             32.9411
## age                 27.6447
## previous            16.6702
## housingyes          11.2165
## campaign            10.2486
## loanyes              3.9804
## educationtertiary    3.2654
## maritalmarried       3.0600
## jobblue-collar       2.8713
## maritalsingle        2.7803
## jobmanagement        2.6442
## educationsecondary   2.6436
## jobtechnician        2.6404
## jobservices          1.6528
## jobunemployed        1.1723
## jobself-employed     1.0584
## jobentrepreneur      1.0052
## jobstudent           0.9884
## jobretired           0.9608

We can also see using variable importance, that the most significant variable is duration (last contact duration, in seconds), followed by balance and age. It may be reasonable since longer duration during contacting clients might imply that the client is interested with the offer.

7 Conclusion

This is a dataframe consisting of Naive Bayes, Decision Tree, and Random Forest models’ performance.

Looking at the dataframe, we can conclude that naive bayes classifier, despite having the highest accuracy, is the worst at distinguishing classes (although the difference is not really significant). Random forest performance is similar to decision tree. Random forest seems to be able to distinguish whether the bank clients will subscribe to a term deposit or not slightly better than decision tree. However, considering how interpretable and adaptable decision tree is, it is better to use decision tree rather than random forest classifier. Looking at the variable importance of random forest model, it can be seen that the most significant feature for detecting customer’s choice in subscribing term deposit is the contact duration. Longer contact duration might imply that the client is interested in the bank institutions’ offers.

Predicting Bank Term Deposit Subscription

Naive Bayes, Decision Tree, & Random Forest Classifier

Stane Aurelius Ronotana

3 September 2021