1. Background

Introduction

Marketing to potential clients has always been a crucial challenge in achieving success for banking institutions. It’s not a surprise that banks usually deploy mediums such as social media, customer service, digital media, and strategic partnerships to reach out to customers. But how can banks market to a specific location, demographic, and society with increased accuracy? With the inception of machine learning, reaching out to specific groups of people has been revolutionized by using data and analytics to provide detailed strategies to inform banks which customers are more likely to subscribe to a financial product. In this project on bank marketing with machine learning, I will explain how a particular Portuguese bank can use predictive analytics from data science to help prioritize customers who would subscribe to a bank deposit.

The data set is based on the direct marketing campaigns of a Portuguese banking institution. These marketing campaigns were based on phone calls. More than one contact with a client was required in order to know if the product (a bank term deposit) was subscribed by a client or not. The classification goal is to predict if a client will subscribe to the bank’s term deposit (yes or no).

The dataset contains 21 columns, including the output (y). I am going to discard the output column and use the remaining columns to find the most relatable independent variables (x) that will be able to predict if a customer will subscribe to a bank deposit or not.

2. Data Wrangling

2.1 Data Inspection

bank <- read.csv("bank-full.csv", 
                 sep = ";",
                 stringsAsFactors = T)
head(bank)
# Check data types
glimpse(bank)
## Rows: 45,211
## Columns: 17
## $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job       <fct> management, technician, entrepreneur, blue-collar, unknown, …
## $ marital   <fct> married, single, married, married, single, married, single, …
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, …
## $ default   <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
## $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing   <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y…
## $ loan      <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no…
## $ contact   <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month     <fct> may, may, may, may, may, may, may, may, may, may, may, may, …
## $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome  <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …

Here are some informations about the features:

  • age: client’s age
  • job : type of job
  • marital : marital status
  • education: client’s last education
  • default: does the client have credit in default?
  • balance: average yearly balance, in euros
  • housing: has housing loan?
  • loan: does he client have personal loan?
  • contact: contact communication type
  • day: last contact day of the month
  • month: last contact month of year
  • duration: last contact duration, in seconds
  • campaign: number of contacts performed during this campaign and for this client
  • pdays: number of days that passed by after the client was last contacted from a previous campaign (-1 means client was not previously contacted)
  • previous: number of contacts performed before this campaign and for this client
  • poutcome: outcome of the previous marketing campaign
  • y: has the client subscribed a term deposit?

We can see that there are 45.211 instances and 17 features in this dataset. Most of the features are categorical. We can see using glimpse(), all the data types are correct. However, there are some “unknowns” in the data. I will consider this as missing value. Let’s first check whether the data has missing and duplicated values.

2.2 Missing and Duplicated Values

Replacing “unknowns” with NA, then checking whether the data has any missing values.

bank <- bank %>% 
  mutate(across(.cols = everything(),
                .fns = ~replace(., . == "unknown", NA)))

colSums(is.na(bank))
##       age       job   marital education   default   balance   housing      loan 
##         0       288         0      1857         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##     13020         0         0         0         0         0         0     36959 
##         y 
##         0

The “contact” and “poutcome” variables have a significant number of missing values. In fact, the majority of values in “poutcome” are unknown. Therefore, I will remove these variables from the analysis. I am also concerned about the “day” and “month” features because there is no information indicating whether the data was collected in the same year or not. If the data was collected in different years, these features would be meaningless since the exact dates are not provided. Hence, I will also exclude the “day” and “month” variables. Additionally, the “job” and “education” variables have some missing values, but they account for less than 5% of the total data. Therefore, I will only drop the observations that contain missing values in these two features.

bank <- bank %>% 
  select(-c(contact, poutcome, day, month)) %>% 
  drop_na()

We have cleaned the missing values, let’s now check whether the data has any duplicated value. Systematically, in a dataset like this, I do not think that duplicated values should exist. Considering that there are features measuring the balance and number of days passed after the last campaign in the data, I personally think that every observation should have unique values.

sum(duplicated(bank))
## [1] 1

The data only has 1 duplicated value. I will just remove it.

bank <- bank[!duplicated(bank), ]

2.3 Near Zero Variance

We will also check if there are any features that have almost no variance. Those features should be removed as they are not informative and will not contribute significantly during model construction. This represents the final cleaned data.

bank <- bank[, -nearZeroVar(bank)]
bank

3. Class Balance and Cross Validation

We want our target variable to have balanced class proportion so that the model could classify well on all classes instead of only the majority class. Let’s first check on the class proportion.

prop.table(table(bank$y))
## 
##        no       yes 
## 0.8837516 0.1162484
table(bank$y)
## 
##    no   yes 
## 38171  5021

We have a moderate class imbalance, which poses a problem as proceeding with our data may result in a model that performs well only on one class. First, let’s split the data into an 80% train set and a 20% test set. The train set will be used for constructing the model, while the test set will be used to measure the out-of-sample accuracy.

set.seed(100)
bank_split <- initial_split(bank, prop = 0.8, strata = "y")
train <- training(bank_split)
test <- testing(bank_split)

Rechecking the class proportion in the train set.

prop.table(table(train$y))
## 
##        no       yes 
## 0.8837694 0.1162306
table(train$y)
## 
##    no   yes 
## 30536  4016

Before proceeding to the modeling part, we need to handle the class imbalance. There are a few techniques that we can use: upsampling, downsampling, SMOTE, etc. Considering there are quite a lot of observations in the data, I will proceed to use downsampling to handle class imbalance. The method will remove some observations in the majority class, and will result in using 8,032 data for training the model.

train_down <- downSample(x = train[, -13], 
                         y = train$y,
                         yname = "y")

table(train_down$y)
## 
##   no  yes 
## 4016 4016

We are done with pre-processing the data. Now we can use it to construct and evaluate classifiers.

4. Naive Bayes Classifier

Naive Bayes is a simple technique for constructing classifiers by applying Bayes’ theorem to data features. It relies on a strong assumption that all features in the dataset are equally important and independent. In real-world data, this assumption is likely to be violated. However, despite the violation, the Naive Bayes Classifier can still produce excellent results. I will construct a Naive Bayes Classifier with Laplace smoothing, as a precautionary measure in case there are predictors that never occur in a class.

model_nb <- naiveBayes(y ~ ., data = train_down, laplace = 1)

conf_nb <- confusionMatrix(data = predict(model_nb, newdata = test),
                           reference = test$y,
                           positive = "yes")
conf_nb
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6354  328
##        yes 1281  677
##                                           
##                Accuracy : 0.8138          
##                  95% CI : (0.8054, 0.8219)
##     No Information Rate : 0.8837          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3583          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.67363         
##             Specificity : 0.83222         
##          Pos Pred Value : 0.34576         
##          Neg Pred Value : 0.95091         
##              Prevalence : 0.11632         
##          Detection Rate : 0.07836         
##    Detection Prevalence : 0.22662         
##       Balanced Accuracy : 0.75293         
##                                           
##        'Positive' Class : yes             
## 

We can see that Naive Bayes Classifier has 81.38% accuracy. For this data, the false negative and false positive case would be:

  • False Negative: clients who will subscribe to a term deposit is predicted to be not willing to subscribe.
  • False Positive: clients who will not subscribe to a term deposit is predicted to be willing to subscribe.

Deciding which case would be more severe to the bank is beyond my expertise. Feel free to decide which case is more severe, thus deciding which metric in the confusion matrix to be the most important. For now, I will use the ROC and AUC to see how good the model is at distinguishing binary classes.

nb_rocr <- prediction(predictions = predict(model_nb, newdata = test, type = "raw")[, 2],
                      labels = test$y,
                      label.ordering = c("no", "yes"))

plot(performance(nb_rocr,
                 measure = "tpr",
                 x.measure = "fpr"))

# AUC
performance(nb_rocr, "auc")@y.values
## [[1]]
## [1] 0.8194124

The problem about Naive Bayes classifier is that we can only tune the cutoff for the model. Doing so might lead to an increase in a metric listed in confusion matrix, with a tradeoff of a decrease in the other metric. However, the AUC value (which measures how good the model is at distinguishing classes) will not change. Let’s see the accuracy, recall, precision, and for different cutoffs.

pred_prob_nb <- predict(model_nb, newdata = test, type = "raw")
metrics <- function(cutoff){
  prediction <- as.factor(ifelse(pred_prob_nb[, 2] > cutoff, "yes", "no"))
  conf <- confusionMatrix(prediction, 
                          reference = test$y,
                          positive = "yes")
  res <- c(conf$overall[1], conf$byClass[1], conf$byClass[2], conf$byClass[3])
  return(res)
}
cutoffs <- seq(0.01, 0.99, length = 99)
result <- matrix(nrow = 99, ncol = 4)

for(i in 1:99){
  result[i, ] <- metrics(cutoffs[i])
}

result <- as.data.frame(result) %>% 
  rename(Accuracy = V1,
         Recall = V2,
         Specifity = V3,
         Precision = V4) %>% 
  mutate(Cutoff = cutoffs)
result %>% 
  gather(key = "metrics", value = "value", -Cutoff) %>% 
  ggplot(mapping = aes(x = Cutoff,
                       y = value,
                       col = metrics)) +
  geom_line(lwd = 1) +
  labs(title = "Metrics for Different Cutoffs",
       y = "Value") +
  theme_minimal() +
  theme(legend.position = "top",
        legend.title = element_blank(),
        plot.title = element_text(hjust = 0.5))

Once again, feel free to decide on the most optimum cutoff. Personally, I don’t believe focusing on precision is worthwhile as the peak precision is below 50%. Therefore, I would suggest focusing on the other three metrics or considering the use of another classifier.

5. Decision Tree Classifier

The decision tree is likely one of the most popular machine learning methods for classification tasks. Unlike Naive Bayes, a decision tree generates output in the form of rules, making it easily understandable by humans and applicable to various scenarios. Let’s start by building a decision tree classifier with the default options.

model_dt <- ctree(y ~ ., data = train_down)
model_dt
## 
## Model formula:
## y ~ age + job + marital + education + balance + housing + loan + 
##     duration + campaign + previous
## 
## Fitted party:
## [1] root
## |   [2] duration <= 251
## |   |   [3] duration <= 125
## |   |   |   [4] duration <= 77
## |   |   |   |   [5] education in primary, secondary: no (n = 537, err = 1.9%)
## |   |   |   |   [6] education in tertiary: no (n = 243, err = 6.2%)
## |   |   |   [7] duration > 77
## |   |   |   |   [8] housing in no
## |   |   |   |   |   [9] campaign <= 1: no (n = 183, err = 39.3%)
## |   |   |   |   |   [10] campaign > 1
## |   |   |   |   |   |   [11] job in admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, technician, unemployed
## |   |   |   |   |   |   |   [12] previous <= 0: no (n = 256, err = 11.7%)
## |   |   |   |   |   |   |   [13] previous > 0: no (n = 33, err = 45.5%)
## |   |   |   |   |   |   [14] job in student: yes (n = 10, err = 20.0%)
## |   |   |   |   [15] housing in yes
## |   |   |   |   |   [16] education in primary, secondary
## |   |   |   |   |   |   [17] previous <= 1: no (n = 254, err = 2.8%)
## |   |   |   |   |   |   [18] previous > 1: no (n = 54, err = 18.5%)
## |   |   |   |   |   [19] education in tertiary: no (n = 117, err = 17.9%)
## |   |   [20] duration > 125
## |   |   |   [21] housing in no
## |   |   |   |   [22] previous <= 0
## |   |   |   |   |   [23] loan in no
## |   |   |   |   |   |   [24] campaign <= 1
## |   |   |   |   |   |   |   [25] job in admin., management, retired, services, student: yes (n = 202, err = 33.7%)
## |   |   |   |   |   |   |   [26] job in blue-collar, entrepreneur, housemaid, self-employed, technician, unemployed: no (n = 121, err = 38.8%)
## |   |   |   |   |   |   [27] campaign > 1
## |   |   |   |   |   |   |   [28] duration <= 205
## |   |   |   |   |   |   |   |   [29] campaign <= 5
## |   |   |   |   |   |   |   |   |   [30] balance <= 154: no (n = 69, err = 18.8%)
## |   |   |   |   |   |   |   |   |   [31] balance > 154: no (n = 163, err = 41.1%)
## |   |   |   |   |   |   |   |   [32] campaign > 5: no (n = 46, err = 6.5%)
## |   |   |   |   |   |   |   [33] duration > 205: no (n = 129, err = 48.8%)
## |   |   |   |   |   [34] loan in yes
## |   |   |   |   |   |   [35] marital in divorced, married: no (n = 90, err = 6.7%)
## |   |   |   |   |   |   [36] marital in single: no (n = 30, err = 30.0%)
## |   |   |   |   [37] previous > 0
## |   |   |   |   |   [38] duration <= 166: yes (n = 123, err = 31.7%)
## |   |   |   |   |   [39] duration > 166: yes (n = 257, err = 9.3%)
## |   |   |   [40] housing in yes
## |   |   |   |   [41] previous <= 0
## |   |   |   |   |   [42] job in admin., housemaid, management, retired
## |   |   |   |   |   |   [43] balance <= 2679: no (n = 231, err = 16.5%)
## |   |   |   |   |   |   [44] balance > 2679
## |   |   |   |   |   |   |   [45] duration <= 167: no (n = 20, err = 20.0%)
## |   |   |   |   |   |   |   [46] duration > 167: yes (n = 15, err = 20.0%)
## |   |   |   |   |   [47] job in blue-collar, entrepreneur, self-employed, services, student, technician, unemployed: no (n = 472, err = 6.8%)
## |   |   |   |   [48] previous > 0
## |   |   |   |   |   [49] education in primary, secondary: no (n = 196, err = 34.7%)
## |   |   |   |   |   [50] education in tertiary
## |   |   |   |   |   |   [51] duration <= 169: no (n = 33, err = 39.4%)
## |   |   |   |   |   |   [52] duration > 169: yes (n = 53, err = 24.5%)
## |   [53] duration > 251
## |   |   [54] duration <= 490
## |   |   |   [55] housing in no
## |   |   |   |   [56] loan in no
## |   |   |   |   |   [57] previous <= 0
## |   |   |   |   |   |   [58] duration <= 441
## |   |   |   |   |   |   |   [59] marital in divorced, married
## |   |   |   |   |   |   |   |   [60] age <= 59
## |   |   |   |   |   |   |   |   |   [61] education in primary: no (n = 29, err = 24.1%)
## |   |   |   |   |   |   |   |   |   [62] education in secondary, tertiary: yes (n = 279, err = 46.2%)
## |   |   |   |   |   |   |   |   [63] age > 59: yes (n = 82, err = 9.8%)
## |   |   |   |   |   |   |   [64] marital in single
## |   |   |   |   |   |   |   |   [65] age <= 30: yes (n = 98, err = 13.3%)
## |   |   |   |   |   |   |   |   [66] age > 30: yes (n = 88, err = 40.9%)
## |   |   |   |   |   |   [67] duration > 441: yes (n = 96, err = 15.6%)
## |   |   |   |   |   [68] previous > 0: yes (n = 415, err = 10.6%)
## |   |   |   |   [69] loan in yes
## |   |   |   |   |   [70] previous <= 0: no (n = 77, err = 22.1%)
## |   |   |   |   |   [71] previous > 0: yes (n = 9, err = 11.1%)
## |   |   |   [72] housing in yes
## |   |   |   |   [73] previous <= 0
## |   |   |   |   |   [74] duration <= 376
## |   |   |   |   |   |   [75] education in primary, secondary: no (n = 246, err = 15.4%)
## |   |   |   |   |   |   [76] education in tertiary: no (n = 99, err = 33.3%)
## |   |   |   |   |   [77] duration > 376: no (n = 220, err = 41.4%)
## |   |   |   |   [78] previous > 0
## |   |   |   |   |   [79] loan in no: yes (n = 224, err = 30.8%)
## |   |   |   |   |   [80] loan in yes: no (n = 34, err = 44.1%)
## |   |   [81] duration > 490
## |   |   |   [82] duration <= 706
## |   |   |   |   [83] housing in no
## |   |   |   |   |   [84] loan in no: yes (n = 411, err = 14.4%)
## |   |   |   |   |   [85] loan in yes: yes (n = 43, err = 34.9%)
## |   |   |   |   [86] housing in yes
## |   |   |   |   |   [87] marital in divorced, married: yes (n = 302, err = 33.1%)
## |   |   |   |   |   [88] marital in single: yes (n = 145, err = 17.2%)
## |   |   |   [89] duration > 706: yes (n = 1198, err = 9.6%)
## 
## Number of inner nodes:    44
## Number of terminal nodes: 45
plot(model_dt, type = "simple")

It seems like the decision tree model is too complex. It is understandable though, since our data has many features. Let’s see the accuracy on both training and test set.

# Confusion matrix for training set
confusionMatrix(data = predict(model_dt, newdata = train_down, type = "response"),
                reference = train_down$y,
                positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  3238  744
##        yes  778 3272
##                                          
##                Accuracy : 0.8105         
##                  95% CI : (0.8018, 0.819)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.621          
##                                          
##  Mcnemar's Test P-Value : 0.3976         
##                                          
##             Sensitivity : 0.8147         
##             Specificity : 0.8063         
##          Pos Pred Value : 0.8079         
##          Neg Pred Value : 0.8132         
##              Prevalence : 0.5000         
##          Detection Rate : 0.4074         
##    Detection Prevalence : 0.5042         
##       Balanced Accuracy : 0.8105         
##                                          
##        'Positive' Class : yes            
## 
# Confusion matrix for test set
conf_dt <- confusionMatrix(data = predict(model_dt, newdata = test, type = "response"),
                           reference = test$y,
                           positive = "yes")
conf_dt
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6033  204
##        yes 1602  801
##                                           
##                Accuracy : 0.791           
##                  95% CI : (0.7822, 0.7995)
##     No Information Rate : 0.8837          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3661          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.79701         
##             Specificity : 0.79018         
##          Pos Pred Value : 0.33333         
##          Neg Pred Value : 0.96729         
##              Prevalence : 0.11632         
##          Detection Rate : 0.09271         
##    Detection Prevalence : 0.27813         
##       Balanced Accuracy : 0.79360         
##                                           
##        'Positive' Class : yes             
## 

Upon examining the differences between the training set and testing set, I don’t believe the model is overfit. When comparing the confusion matrix of the decision tree model and Naive Bayes, it is evident that the decision tree model, despite having slightly lower accuracy, may still outperform Naive Bayes due to its higher recall and specificity. Even if we were to adjust the cutoff for Naive Bayes to achieve similar accuracy and specificity, its sensitivity would still be considerably lower than that of the decision tree model. Let’s proceed to analyze the ROC curve and AUC of the decision tree model.

dt_rocr <- prediction(predictions = predict(model_dt, newdata = test, type = "prob")[, 2],
                      labels = test$y,
                      label.ordering = c("no", "yes"))

plot(performance(dt_rocr,
                 measure = "tpr",
                 x.measure = "fpr"))

# AUC
performance(dt_rocr, "auc")@y.values
## [[1]]
## [1] 0.8629822

As we can observe, the decision tree model has an AUC of approximately 0.863. This indicates that the decision tree outperforms Naive Bayes in distinguishing between the two classes. The decision tree can be fine-tuned by adjusting parameters such as mincriterion, minsplit, and minbucket within its control parameter. However, considering that finding the best combination for these parameters would require numerous iterations, I will temporarily skip this step. Furthermore, the model does not exhibit signs of overfitting and demonstrates a reasonably good AUC. For now, let’s move on to the last classifier, random forest.

6. Random Forest Classifier

Random forests is an ensemble learning method known for its versatility and performance. A random forest consists of many decision trees. Those decision trees were constructed using different observations based on the sampling, thus it have different characteristics. The prediction gained in random forest is average of all the predictions from these different decision trees. First, let’s set the random forest control so it uses 5 fold cross validation with 3 iterations.

# The model was made using this code

 set.seed(100)
 ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
 model_rf <- train(y ~., data = train_down, method = "rf", trControl = ctrl)
# read model from RDS file
# model_rf <- readRDS("model/fb_forest.RDS")
# model_rf

Checking the final random forest model.

model_rf$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 12
## 
##         OOB estimate of  error rate: 19.4%
## Confusion matrix:
##       no  yes class.error
## no  3162  854   0.2126494
## yes  704 3312   0.1752988
plot(model_rf)

The random forest has Out of Bag error rate of 19.4%. Looking the confusion matrix for both training and test set.

confusionMatrix(data = predict(model_rf, newdata = train_down),
                reference = train_down$y,
                positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  4016    0
##        yes    0 4016
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9995, 1)
##     No Information Rate : 0.5        
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0        
##             Specificity : 1.0        
##          Pos Pred Value : 1.0        
##          Neg Pred Value : 1.0        
##              Prevalence : 0.5        
##          Detection Rate : 0.5        
##    Detection Prevalence : 0.5        
##       Balanced Accuracy : 1.0        
##                                      
##        'Positive' Class : yes        
## 
conf_rf <- confusionMatrix(data = predict(model_rf, newdata = test),
                           reference = test$y,
                           positive = "yes")
conf_rf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  5999  175
##        yes 1636  830
##                                           
##                Accuracy : 0.7904          
##                  95% CI : (0.7817, 0.7989)
##     No Information Rate : 0.8837          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3749          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.82587         
##             Specificity : 0.78572         
##          Pos Pred Value : 0.33658         
##          Neg Pred Value : 0.97166         
##              Prevalence : 0.11632         
##          Detection Rate : 0.09606         
##    Detection Prevalence : 0.28542         
##       Balanced Accuracy : 0.80580         
##                                           
##        'Positive' Class : yes             
## 
#code here
pred_rf <- predict(object = model_rf, newdata = test ,type="raw")

Turns out that the random forest classifier is overfitting. The accuracy and metrics for the training set is perfect, but for the test set, the accuracy is only about 70%, with decent recall and specificity, but poor precision. Let’s see the ROC and AUC.

rf_rocr <- prediction(predictions = predict(model_rf, newdata = test, type = "prob")$yes,
                      labels = test$y,
                      label.ordering = c("no", "yes"))

plot(performance(rf_rocr, 
                 measure = "tpr",
                 x.measure = "fpr"))

# AUC
performance(rf_rocr, "auc")@y.values
## [[1]]
## [1] 0.8737758

The random forest model’s ability to distinguish classes is slightly higher than of decision tree’s, with AUC value of 0.8737

varImp(model_rf)
## rf variable importance
## 
##   only 20 most important variables shown (out of 23)
## 
##                     Overall
## duration           100.0000
## balance             32.9411
## age                 27.6447
## previous            16.6702
## housingyes          11.2165
## campaign            10.2486
## loanyes              3.9804
## educationtertiary    3.2654
## maritalmarried       3.0600
## jobblue-collar       2.8713
## maritalsingle        2.7803
## jobmanagement        2.6442
## educationsecondary   2.6436
## jobtechnician        2.6404
## jobservices          1.6528
## jobunemployed        1.1723
## jobself-employed     1.0584
## jobentrepreneur      1.0052
## jobstudent           0.9884
## jobretired           0.9608

We can also see using variable importance, that the most significant variable is duration (last contact duration, in seconds), followed by balance and age. It may be reasonable since longer duration during contacting clients might imply that the client is interested with the offer.

7. Conclusion

We can conclude that naive bayes classifier, despite having the highest accuracy, is the worst at distinguishing classes (although the difference is not really significant). Random forest performance is similar to decision tree. Random forest seems to be able to distinguish whether the bank clients will subscribe to a term deposit or not slightly better than decision tree. However, considering how interpretable and adaptable decision tree is, it is better to use decision tree rather than random forest classifier. Looking at the variable importance of random forest model, it can be seen that the most significant feature for detecting customer’s choice in subscribing term deposit is the contact duration. Longer contact duration might imply that the client is interested in the bank institutions’ offers.