Banking Deposit Investment Classification Using Naive Bayes, Decision Tree and Random Forest

Introduction

There has been a revenue decline in the Portuguese Bank and they would like to know what actions to take. After investigation, they found that the root cause was that their customers are not investing enough for long term deposits. So the bank would like to identify existing customers that have a higher chance to subscribe for a long term deposit and focus marketing efforts on such customers.

Previously, we have answered the bank’s problem using Logistic Regression and K-Nearest Neighbor. If you want to check on the report, click here.

Now, in this report, we are going to answer the bank’s problem using Naive Bayes, Decision Tree and Random Forest. The process includes Data Preparation, Exploratory Data Analysis, Data Pre-Processing, Model Building & Prediction, Model Evaluation and Conclusion.

Data Preparation

# Library Input
library(dplyr)
library(ggplot2)
library(GGally)
library(car)
library(caret)
library(partykit)
library(rsample)
library(e1071)
library(randomForest)
library(ROCR)
library(rpart)
library(rpart.plot)
# Data Input
bank <- read.csv("data/bank_investments.csv", stringsAsFactors = T)

head(bank)
##   age          job  marital         education default housing loan   contact
## 1  49  blue-collar  married          basic.9y unknown      no   no  cellular
## 2  37 entrepreneur  married university.degree      no      no   no telephone
## 3  78      retired  married          basic.4y      no      no   no  cellular
## 4  36       admin.  married university.degree      no     yes   no telephone
## 5  59      retired divorced university.degree      no      no   no  cellular
## 6  29       admin.   single university.degree      no      no   no  cellular
##   month day_of_week duration campaign pdays previous    poutcome   y
## 1   nov         wed      227        4   999        0 nonexistent  no
## 2   nov         wed      202        2   999        1     failure  no
## 3   jul         mon     1148        1   999        0 nonexistent yes
## 4   may         mon      120        2   999        0 nonexistent  no
## 5   jun         tue      368        2   999        0 nonexistent  no
## 6   aug         wed      256        2   999        0 nonexistent  no

The data used in this report contains 32950 data and 21 inputs including the target feature, ordered by date (from May 2008 to November 2010). The data set consists of several variables with the following details:

  • age : the age of the client
  • job : type of job
  • marital : marital status
  • education: client last education
  • default : whether or not the client has credit in default
  • balance : the balance in client account
  • housing : does the client has housing loan?
  • loan : does the client has personal loan?
  • contact : contact communication type
  • month : last contact month of year
  • day_of_week: last contact day of the week
  • duration : last contact duration, in seconds
  • campaign : number of contacts performed during this campaign and for this client (includes last contact)
  • pdays : number of days that passed by after the client was last contacted from a previous campaign (999 means client was not previously contacted)
  • previous: number of contacts performed before this campaign and for this client
  • poutcome : outcome of the previous marketing campaign

Target Variable:

  • y : has the client subscribed a term deposit?
# Checking Missing Values
colSums(is.na(bank))
##         age         job     marital   education     default     housing 
##           0           0           0           0           0           0 
##        loan     contact       month day_of_week    duration    campaign 
##           0           0           0           0           0           0 
##       pdays    previous    poutcome           y 
##           0           0           0           0

All data types have been converted to the desired data types and there’s no more missing value.

Exploratory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, and find out any pattern that can indicate any kind of correlation between the variables.

# Data Summary
summary(bank)
##       age                 job           marital                    education   
##  Min.   :17.00   admin.     :8314   divorced: 3675   university.degree  :9736  
##  1st Qu.:32.00   blue-collar:7441   married :19953   high.school        :7596  
##  Median :38.00   technician :5400   single  : 9257   basic.9y           :4826  
##  Mean   :40.01   services   :3196   unknown :   65   professional.course:4192  
##  3rd Qu.:47.00   management :2345                    basic.4y           :3322  
##  Max.   :98.00   retired    :1366                    basic.6y           :1865  
##                  (Other)    :4888                    (Other)            :1413  
##     default         housing           loan            contact     
##  no     :26007   no     :14900   no     :27131   cellular :20908  
##  unknown: 6940   unknown:  796   unknown:  796   telephone:12042  
##  yes    :    3   yes    :17254   yes    : 5023                    
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##      month       day_of_week    duration         campaign          pdays      
##  may    :11011   fri:6322    Min.   :   0.0   Min.   : 1.000   Min.   :  0.0  
##  jul    : 5763   mon:6812    1st Qu.: 103.0   1st Qu.: 1.000   1st Qu.:999.0  
##  aug    : 4948   thu:6857    Median : 180.0   Median : 2.000   Median :999.0  
##  jun    : 4247   tue:6444    Mean   : 258.1   Mean   : 2.561   Mean   :962.1  
##  nov    : 3266   wed:6515    3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0  
##  apr    : 2085               Max.   :4918.0   Max.   :56.000   Max.   :999.0  
##  (Other): 1630                                                                
##     previous             poutcome       y        
##  Min.   :0.0000   failure    : 3429   no :29238  
##  1st Qu.:0.0000   nonexistent:28416   yes: 3712  
##  Median :0.0000   success    : 1105              
##  Mean   :0.1747                                  
##  3rd Qu.:0.0000                                  
##  Max.   :7.0000                                  
## 

It is important to note that the dataset need to has the same scale. Hence, a further scaling might be needed.

# Correlation of Numeric Variables
ggcorr(bank, hjust = 1, layout.exp = 2, label = T, label_size = 4,
       low = "#d53e4f", mid = "white", high = "#fddbc7")

Based on the correlation plot, there are no correlation between the numeric predictor except for age and previous which has a negative correlation. Since Naive Bayes assumes each predictor is independent (not related to each other), it is important to note that each predictor has no correlation to each other.

# Target Variable Proportion
investment_y <- bank %>% 
  select(y) %>% 
  group_by(y) %>% 
  summarise(count = length(y))

ggplot(data = investment_y, mapping = aes(x = reorder(y, -count), y = count)) +
  geom_col(mapping = aes(fill = y), 
           position = "dodge") +
  labs(title = "Customer Investment Proportion",
       fill = "Investment",
       x = "Number of Customers",
       y = NULL) +
  scale_fill_manual(values = c("#d53e4f", "#fddbc7")) +
  theme_minimal()

As can be seen above, the target variable clearly is disproportioned. Further resampling will be needed.

Data Pre-Processing

In this process, we will split the data into train data set and test data set. The train data set will be used to build models, while the test data set will be used to predict the target variable. We will took 80% of the data as the train data set and the rest will be used as the test data set.

# Splitting the data set
RNGkind(sample.kind = "Rounding")
set.seed(123) 

index_bank <- sample(x = nrow(bank) , size = nrow(bank)*0.8) 
bank_train <- bank[index_bank, ]
bank_test <- bank[-index_bank, ]

Next, we will check the data proportion, whether or not the data is imbalance.

# Checking data proportion
prop.table(table(bank$y))
## 
##        no       yes 
## 0.8873445 0.1126555
# Checking data proportion of train data set
prop.table(table(bank_train$y))
## 
##        no       yes 
## 0.8876328 0.1123672
# Checking data proportion of test data set
prop.table(table(bank_test$y))
## 
##        no       yes 
## 0.8861912 0.1138088

Since the class distribution in the target variable is around 89 : 11 indicating an imbalance dataset, we need to resample it.

# Downsampling
RNGkind(sample.kind = "Rounding")
set.seed(123)

bank_train_down <- downSample(x = bank_train %>% select(-y),
                              y = bank_train$y,
                              yname = "y")
prop.table(table(bank_train_down$y))
## 
##  no yes 
## 0.5 0.5

A balanced proportion of classes is important for the data train because we will be training the model using the data train.

Model Building & Prediction

Naive Bayes

# Model Building (all variables)
model_nb <- naiveBayes(formula = y ~ . , data = bank_train_down, laplace = 1)

Naive Bayes has the Skewness Due To Scarcity characteristic. Which indicates that when there is a predictor whose frequency value is 0 for one of the classes, then the model will automatically predicts that the probability is 0 for that condition, regardless of the value of the predictor other. This will affect the model and makes it becomes biased or less accurate in making predictions. To avoid this, we can do Laplace Smoothing, which the function is to add the frequency of each predictor by a certain number (usually 1), so there would be no longer a predictor has a value of 0.

# Model Output
model_nb
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##  no yes 
## 0.5 0.5 
## 
## Conditional probabilities:
##      age
## Y         [,1]      [,2]
##   no  39.93991  9.807662
##   yes 40.95307 13.877065
## 
##      job
## Y          admin. blue-collar entrepreneur   housemaid  management     retired
##   no  0.248823134 0.228312038  0.041694687 0.024882313 0.073974445 0.033624748
##   yes 0.291526564 0.133826496  0.028917283 0.022528581 0.070948218 0.096503026
##      job
## Y     self-employed    services     student  technician  unemployed     unknown
##   no    0.029253531 0.106590451 0.017148621 0.164761264 0.023537323 0.007397445
##   yes   0.034969738 0.069939475 0.057162071 0.156018830 0.029589778 0.008069939
## 
##      marital
## Y        divorced     married      single     unknown
##   no  0.118341200 0.611260958 0.269049225 0.001348618
##   yes 0.097774781 0.544504383 0.354012138 0.003708699
## 
##      education
## Y         basic.4y     basic.6y     basic.9y  high.school   illiterate
##   no  0.0966329966 0.0579124579 0.1481481481 0.2414141414 0.0003367003
##   yes 0.0966329966 0.0397306397 0.0983164983 0.2205387205 0.0013468013
##      education
## Y     professional.course university.degree      unknown
##   no         0.1276094276      0.2851851852 0.0427609428
##   yes        0.1262626263      0.3643097643 0.0528619529
## 
##      default
## Y               no      unknown          yes
##   no  0.7801011804 0.2192242833 0.0006745363
##   yes 0.9045531197 0.0951096121 0.0003372681
## 
##      housing
## Y             no    unknown        yes
##   no  0.44485666 0.02293423 0.53220911
##   yes 0.43912310 0.02327150 0.53760540
## 
##      loan
## Y             no    unknown        yes
##   no  0.82563238 0.02293423 0.15143339
##   yes 0.82462057 0.02327150 0.15210793
## 
##      contact
## Y      cellular telephone
##   no  0.5843455 0.4156545
##   yes 0.8275978 0.1724022
## 
##      month
## Y             apr         aug         dec         jul         jun         mar
##   no  0.051480485 0.157469717 0.001682369 0.163526245 0.146029610 0.010094213
##   yes 0.108344549 0.141655451 0.020188425 0.148048452 0.121130552 0.059555855
##      month
## Y             may         nov         oct         sep
##   no  0.347913863 0.095558546 0.014131898 0.012113055
##   yes 0.187079408 0.088829071 0.070995962 0.054172275
## 
##      day_of_week
## Y           fri       mon       thu       tue       wed
##   no  0.1985170 0.2200876 0.1981800 0.1921132 0.1911021
##   yes 0.1840243 0.1877317 0.2214358 0.2025615 0.2042467
## 
##      duration
## Y         [,1]     [,2]
##   no  219.3518 197.3430
##   yes 551.2154 402.9288
## 
##      campaign
## Y         [,1]     [,2]
##   no  2.555706 2.793423
##   yes 2.039838 1.657146
## 
##      pdays
## Y         [,1]     [,2]
##   no  980.2346 135.2032
##   yes 781.7677 410.5785
## 
##      previous
## Y          [,1]      [,2]
##   no  0.1390952 0.4257392
##   yes 0.5104659 0.8790741
## 
##      poutcome
## Y        failure nonexistent    success
##   no  0.10118044  0.88229342 0.01652614
##   yes 0.13086003  0.66779089 0.20134907
# Predict Data Test
pred_nb <- predict(object = model_nb, newdata = bank_test, type = "class")
table(predict = pred_nb, 
      actual = bank_test$y)
##        actual
## predict   no  yes
##     no  5336  317
##     yes  504  433

Decision Tree

# Model Building (all variables)
model_dt <- rpart(formula = y ~ ., data = bank_train_down, method = "class")
model_dt
## n= 5924 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 5924 2962 no (0.50000000 0.50000000)  
##    2) duration< 206.5 2343  495 no (0.78873239 0.21126761)  
##      4) pdays>=512.5 2149  335 no (0.84411354 0.15588646)  
##        8) month=aug,jul,jun,may,nov 1814  156 no (0.91400221 0.08599779) *
##        9) month=apr,dec,mar,oct,sep 335  156 yes (0.46567164 0.53432836)  
##         18) duration< 95.5 90   13 no (0.85555556 0.14444444) *
##         19) duration>=95.5 245   79 yes (0.32244898 0.67755102) *
##      5) pdays< 512.5 194   34 yes (0.17525773 0.82474227) *
##    3) duration>=206.5 3581 1114 yes (0.31108629 0.68891371)  
##      6) duration< 457.5 1847  845 yes (0.45749865 0.54250135)  
##       12) pdays>=513 1498  669 no (0.55340454 0.44659546)  
##         24) month=aug,jul,jun,may,nov 1151  383 no (0.66724587 0.33275413) *
##         25) month=apr,dec,mar,oct,sep 347   61 yes (0.17579251 0.82420749) *
##       13) pdays< 513 349   16 yes (0.04584527 0.95415473) *
##      7) duration>=457.5 1734  269 yes (0.15513264 0.84486736) *
# Decision Tree Plot
rpart.plot(model_dt,  type = 2, nn = TRUE, extra = "auto",
           box.palette = c("#d53e4f", "#fddbc7"), shadow.col = "grey")

Based on the decision tree plot above, we can see the number of divisions/ leaves (width) and how many layers/ levels (depth) in it.

  • On top of the diagram, [1] is called Root Node. It’s the first branch in determining the target value, commonly referred to as the main predictor. In our case, it shows the proportion of customers that accepted the investment offer. 50% of customers accepted the offer.
  • [2], [3], [4], [6], [9], [12] are called Internal Nodes. Internal Nodes (branches) are indicated by an arrow pointing at them, and an arrow pointing from them. For example, node [2] asks whether the duration of the call is over 207 or less. If yes, then you go down to the root’s left child node (depth 2). 40% are calls with the duration less than 207 with a accepted investment offer probability of 21%; and so on.
  • [8], [18], [19], [5], [24], [25], [13], and [7] are Leaf Nodes. Leaf Nodes are indicated by arrows pointing at them, but no arrows are pointing from them.
# Predict Data Test
pred_dt <- predict(object = model_dt, newdata = bank_test, type = "class")
table(predict = pred_dt, 
      actual = bank_test$y)
##        actual
## predict   no  yes
##     no  4871  155
##     yes  969  595

Originally, the prediction need to be run only on the test data set. But we will predict based on the train data set as well to check whether or not our model is over-fitted.

# Predict Data Train
pred_dt_train <- predict(object = model_dt, newdata = bank_train_down, type = "class")
table(predict = pred_dt_train, 
      actual = bank_train_down$y)
##        actual
## predict   no  yes
##     no  2503  552
##     yes  459 2410

Random Forest

Before building the model of Random Forest, we will delete the columns that have a variance close to zero. The disadvantages of Random Forest is that the computational load is very long. This can be reduced by selecting predictors so that there are not too many. If we find a large number of columns, we can delete the columns that have a variance close to zero (less informative).

# Feature Selection using nearzerovar
no_var_train <- nearZeroVar(bank_train_down)
no_var_test <- nearZeroVar(bank_test)
bank_train_novar <- bank_train_down[, -no_var_train]
bank_test_novar <- bank_test[, -no_var_test]

head(bank_train_novar)
##   age         job marital           education default housing loan   contact
## 1  47     retired  single            basic.6y unknown      no   no  cellular
## 2  60      admin.  single         high.school      no     yes   no telephone
## 3  39 blue-collar married         high.school unknown      no   no telephone
## 4  61  technician married professional.course      no      no   no  cellular
## 5  31 blue-collar married            basic.6y unknown     yes   no telephone
## 6  41  technician married   university.degree      no     yes   no telephone
##   month day_of_week duration campaign previous    poutcome  y
## 1   jul         mon      157        5        0 nonexistent no
## 2   may         thu      609        1        0 nonexistent no
## 3   may         wed      278        1        0 nonexistent no
## 4   oct         thu       76        1        0 nonexistent no
## 5   may         wed      298        1        0 nonexistent no
## 6   may         fri       67        3        0 nonexistent no

In this model, we are going to use a technique for model evaluation called K-fold Cross-Validation. This technique performs cross-validation by splitting our data into k equal-sized sample group (bins) and use one of the bins to become the test data while the rest of the data become the train data. This process is repeated for k-times (the folds).

For example, we will create a random forest model with k-fold cross validation (k=5) and the creation of the k-fold set is done 3 times:

# Cross Validation
set.seed(123)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

# Model Building
model_rf <- train(y ~ ., data = bank_train_novar, method = "rf", trControl = ctrl)

# Saving Model
saveRDS(model_rf, file = "model_rf.rds")
# Model Summary
rf_model <- readRDS("model_rf.rds")
rf_model
## Random Forest 
## 
## 5924 samples
##   14 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 4740, 4740, 4738, 4739, 4739, 4740, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7910180  0.5820361
##   24    0.8529709  0.7059414
##   47    0.8490881  0.6981762
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 24.

Based on the model summary, the most optimal mtry value used for the model was mtry = 24 with the Accuracy around 85%.

# Predict Data Test
pred_rf <- predict(rf_model, newdata = bank_test)
table(predict = pred_rf, 
      actual = bank_test$y)
##        actual
## predict   no  yes
##     no  4810   86
##     yes 1030  664

Model Evaluation

Naive Bayes

# Confusion Matrix
confusionMatrix(data = pred_nb, reference = bank_test$y, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  5336  317
##        yes  504  433
##                                           
##                Accuracy : 0.8754          
##                  95% CI : (0.8672, 0.8833)
##     No Information Rate : 0.8862          
##     P-Value [Acc > NIR] : 0.9969          
##                                           
##                   Kappa : 0.4429          
##                                           
##  Mcnemar's Test P-Value : 8.502e-11       
##                                           
##             Sensitivity : 0.57733         
##             Specificity : 0.91370         
##          Pos Pred Value : 0.46211         
##          Neg Pred Value : 0.94392         
##              Prevalence : 0.11381         
##          Detection Rate : 0.06571         
##    Detection Prevalence : 0.14219         
##       Balanced Accuracy : 0.74552         
##                                           
##        'Positive' Class : yes             
## 

Based on the Confusion Matrix result, we can conclude that this model is quite good, with the Accuracy around 87.5%. However in our case, we are offering the customer the investment, and in our model we would like a prediction where positive = customer decide to invest (yes), so we tend to rely on Recall / Sensitivity (we would want to approach / offer as many customer as we can) which unfortunately quite low, around 58%.

And since our data is imbalanced, and we want to see how well our model distinguishes between Positive and Negative classes, we will use ROC(Receiver-Operating Curve) and AUC(Area Under Curve) metrics.

# ROC
pred_nbProb <- predict(model_nb, newdata = bank_test, type = "raw")

df_roc_bank <- data.frame(prob = pred_nbProb[,2], 
                           label = as.numeric(bank_test$y == "yes"))

prediction_roc_bank <- prediction(predictions = df_roc_bank$prob, 
                                  labels = df_roc_bank$label) 

plot(performance(prediction.obj = prediction_roc_bank, 
                 measure = "tpr", x.measure = "fpr")) 

ROC plots the proportion of true positive rate (TPR or Sensitivity) to the proportion of a false negative rate (FNR or 1-Specificity). Since the closer the curve reaches the upper-left of the plot (True positive is high while false negative is low), the better our model is, so we can conclude that our model has a good measure of separability.

# AUC
auc_bank <- performance(prediction.obj = prediction_roc_bank, measure = "auc")
auc_bank@y.values
## [[1]]
## [1] 0.8683594

AUC shows the area under the ROC curve. The closer it is to 1, the better the model’s performance. Our model’s AUC value is 0.87, which means it’s a good model and has a good measure of separability.

Decision Tree

In this section, we will try to compare the confusion matrix of both data train prediction and data test prediction. The objective is to check whether the model is overfitted or not (since the drawback of the Decision Tree model is its tendency to be overfitted). If so, further tree pruning might be needed.

# Confusion Matrix of Data Train Prediction
confusionMatrix(data = pred_dt_train, reference = bank_train_down$y, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  2503  552
##        yes  459 2410
##                                           
##                Accuracy : 0.8293          
##                  95% CI : (0.8195, 0.8388)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6587          
##                                           
##  Mcnemar's Test P-Value : 0.003811        
##                                           
##             Sensitivity : 0.8136          
##             Specificity : 0.8450          
##          Pos Pred Value : 0.8400          
##          Neg Pred Value : 0.8193          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4068          
##    Detection Prevalence : 0.4843          
##       Balanced Accuracy : 0.8293          
##                                           
##        'Positive' Class : yes             
## 
# Confusion Matrix of Data Test Prediction
confusionMatrix(data = pred_dt, reference = bank_test$y, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  4871  155
##        yes  969  595
##                                           
##                Accuracy : 0.8294          
##                  95% CI : (0.8201, 0.8384)
##     No Information Rate : 0.8862          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4259          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.79333         
##             Specificity : 0.83408         
##          Pos Pred Value : 0.38043         
##          Neg Pred Value : 0.96916         
##              Prevalence : 0.11381         
##          Detection Rate : 0.09029         
##    Detection Prevalence : 0.23733         
##       Balanced Accuracy : 0.81370         
##                                           
##        'Positive' Class : yes             
## 

Based on the Confusion Matrix result of both data set, we can conclude that this model is not overfitted. It can be seen from the Accuracy of data train prediction and data test prediction are both around 83%.

As explained before, in this case, we are offering the customer the investment, and in our model we would like a prediction where positive = customer decide to invest (yes), so we tend to rely on Recall / Sensitivity (we would want to approach / offer as many customer as we can).

Form the confusion matrix, it can be seen that the Sensitivity of data train prediction is around 81% and the Sensitivity of data test prediction is around 79%. Which means the model is quite good and tree pruning might not be neccesary.

But because our data is imbalanced, and we want to see how well our model distinguishes between Positive and Negative classes, we will use ROC(Receiver-Operating Curve) and AUC(Area Under Curve) metrics.

# ROC
pred_dtProb <- predict(object = model_dt, newdata = bank_test, type = "prob")

df_roc_bank_dt <- data.frame(prob = pred_dtProb[,2], 
                           label = as.numeric(bank_test$y == "yes"))

pred_roc_bank_dt <- prediction(predictions = df_roc_bank_dt$prob, 
                                  labels = df_roc_bank_dt$label) 

plot(performance(prediction.obj = pred_roc_bank_dt, 
                 measure = "tpr", x.measure = "fpr"))

ROC plots the proportion of true positive rate (TPR or Sensitivity) to the proportion of a false negative rate (FNR or 1-Specificity). Since the closer the curve reaches the upper-left of the plot (True positive is high while false negative is low), the better our model is, so we can conclude that our model has a good measure of separability.

# AUC
auc_bank_dt <- performance(prediction.obj = pred_roc_bank_dt, measure = "auc")
auc_bank_dt@y.values
## [[1]]
## [1] 0.8640068

AUC shows the area under the ROC curve. The closer it is to 1, the better the model’s performance. Our model’s AUC value is 0.86, which means it’s a good model and has a good measure of separability.

Random Forest

When creating a Random Forest model, we are not required to split the training-testing dataset up front. This is because from the results of boostrap sampling, there are data that are not used in making the random forest model. These data are out-of-bag data and are considered as test data by the model. The model will make predictions with the data and calculate the resulting error. These errors are referred to as out-of-bag errors.

#Out of Bag Error
rf_model$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 24
## 
##         OOB estimate of  error rate: 14.65%
## Confusion matrix:
##       no  yes class.error
## no  2447  515   0.1738690
## yes  353 2609   0.1191762

In this model, the value of Out of Bag Error is 14.65%. In other words, the accuracy of the model on the test data (out of bag data) is 85.35%.

Even though the random forest is labeled as a non-interpretable model, at least we can see what predictors are most used (the most important) in the random forest model:

# Variable Importance
plot(varImp(rf_model), col = "#d53e4f")

By the result, we can see that duration is the most important variable followed by age, poutcomesuccess and contacttelephone. Next, we will check the confusion matrix of the prediction that we have done previously.

# Confusion Matrix of Data Test Prediction
confusionMatrix(data = pred_rf, reference = bank_test$y, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  4810   86
##        yes 1030  664
##                                           
##                Accuracy : 0.8307          
##                  95% CI : (0.8214, 0.8396)
##     No Information Rate : 0.8862          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4578          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8853          
##             Specificity : 0.8236          
##          Pos Pred Value : 0.3920          
##          Neg Pred Value : 0.9824          
##              Prevalence : 0.1138          
##          Detection Rate : 0.1008          
##    Detection Prevalence : 0.2571          
##       Balanced Accuracy : 0.8545          
##                                           
##        'Positive' Class : yes             
## 

Based on the Confusion Matrix result above, it can be seen from the Accuracy of data test prediction is around 83%.

As explained before, in this case, we are offering the customer the investment, and in our model we would like a prediction where positive = customer decide to invest (yes), so we tend to rely on Recall / Sensitivity (we would want to approach / offer as many customer as we can).

Form the confusion matrix, it can be seen that the Sensitivity of ata test prediction is around 88.5%. Which means the model is quite good.

But because our data is imbalanced, and we want to see how well our model distinguishes between Positive and Negative classes, we will use ROC(Receiver-Operating Curve) and AUC(Area Under Curve) metrics.

# ROC
pred_rfProb <- predict(object = rf_model, newdata = bank_test, type = "prob")

df_roc_bank_rf <- data.frame(prob = pred_rfProb[,2], 
                           label = as.numeric(bank_test$y == "yes"))

pred_roc_bank_rf <- prediction(predictions = df_roc_bank_rf$prob, 
                                  labels = df_roc_bank_rf$label) 

plot(performance(prediction.obj = pred_roc_bank_rf, 
                 measure = "tpr", x.measure = "fpr"))

ROC plots the proportion of true positive rate (TPR or Sensitivity) to the proportion of a false negative rate (FNR or 1-Specificity). Since the closer the curve reaches the upper-left of the plot (True positive is high while false negative is low), the better our model is, so we can conclude that our model has a good measure of separability.

# AUC
auc_bank_rf <- performance(prediction.obj = pred_roc_bank_rf, measure = "auc")
auc_bank_rf@y.values
## [[1]]
## [1] 0.9182189

AUC shows the area under the ROC curve. The closer it is to 1, the better the model’s performance. Our model’s AUC value is around 0.92, which means it’s a great model and has a great measure of separability.

Conclusion

After predicting with three models, Naive Bayes model, Decision Tree model and Random Forest model, there’s no significant difference in term of Accuracy. The Naive Bayes model has 87.5% of Accuracy, while the Decision Tree model and Random Forest model both have around 83%. So based on the Accuracy, the Naive Bayes model is the best, which means it can distinguishes between Positive and Negative classes the best compared to the other two.

However, Accuracy will be the most appropriate metric to choose when the data we have is balanced. And we know that the data set we used is NOT, so if we want to see how well our model distinguishes between Positive and Negative classes, ROC(Receiver-Operating Curve) and AUC(Area Under Curve) metrics are needed. Based on the AUC(Area Under Curve) value, the Naive Bayes model has 0.87 of AUC, the Decision Tree model has 0.86 of AUC and Random Forest model has 0.92 of AUC. So naturally, the Random Forest model can distinguishes between Positive and Negative classes the best.

Another factor to be considered is that in this case, we would want to approach / offer as many customer as we can. So, we tend to rely on Sensitivity. So based on that point of view, the Random Forest model is better, it has around 88.5% of Sensitivity compared to the other two, which are Decision Tree model that has 79% of Sensitivity and Naive Bayes model that only has 58% of Sensitivity.

Based on those statements, we can conclude that the wisest decision to predict the data set in these case is by using Random Forest model, since it has the highest level of AUC and Sensitivity compare to the other two. Also, based on this model, it can be seen that the most significant variables in predicting customer’s investment decision is duration, age, poutcomesuccess and contacttelephone.