Introduction

Background

The data is retrieved from UCI machine learning repository (https://archive-beta.ics.uci.edu/ml/datasets/bank+marketing) and is about a direct marketing campaign(phone calls) of a Portuguese bank. The customer service center of the bank calls customers to promote its term deposit product. They collect customers basic information and their decisions: whether have subscribed to a term deposit yes or no. The classification goal is to predict their decisions after the calls.

Library

library(readr) 
library(dplyr) 
library(e1071)
library(partykit)
library(caret)
library(rpart.plot)
library(randomForest)

Data Source

Always take a look at the data before doing any analysis. The dataset contains 4,521 rows and 17 columns, in which 7 of them are numeric and 10 are categorical.

bank <- read.csv("bank+marketing 2/bank/bank.csv", sep = ";", stringsAsFactors = T)
rmarkdown::paged_table(bank)
glimpse(bank)
## Rows: 4,521
## Columns: 17
## $ age       <int> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, …
## $ job       <fct> unemployed, services, management, management, blue-collar, m…
## $ marital   <fct> married, married, single, married, married, single, married,…
## $ education <fct> primary, secondary, tertiary, tertiary, secondary, tertiary,…
## $ default   <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
## $ balance   <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 26…
## $ housing   <fct> no, yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, no…
## $ loan      <fct> no, yes, no, yes, no, no, no, no, no, yes, no, no, no, no, y…
## $ contact   <fct> cellular, cellular, cellular, unknown, unknown, cellular, ce…
## $ day       <int> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29,…
## $ month     <fct> oct, may, apr, jun, may, feb, may, may, may, apr, may, apr, …
## $ duration  <int> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 32…
## $ campaign  <int> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1, …
## $ pdays     <int> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1,…
## $ previous  <int> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, …
## $ poutcome  <fct> unknown, failure, failure, unknown, unknown, failure, other,…
## $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, yes, no,…

Data Preparation

Missing Value

anyNA(bank)
## [1] FALSE
colSums(is.na(bank))
##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

In data preparation process, we have thoroughly examined the dataset and found that there are no missing values present. This absence of missing values is a significant advantage as it ensures the completeness and integrity of our data.

Variance

dim(bank)
## [1] 4521   17
zero_var <- nearZeroVar(bank) 
bank_clean <- bank %>% select(-zero_var)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(zero_var)
## 
##   # Now:
##   data %>% select(all_of(zero_var))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
dim(bank_clean)
## [1] 4521   15

Before applying the zero variance filter, our dataset had a dimension of 4521 rows and 17 columns. After removing the features with zero variance, dataset was reduced to 4521 rows and 15 columns. This reduction in dimension not only simplifies our dataset but also eliminates features that do not contribute to the variability in our data.

Cross Validation

In cross-validation process, we have partitioned our dataset into two distinct subsets: a training set and a testing set. The training set comprises 75% of the total data, while the testing set makes up the remaining 25%.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

# splitting train test

bank_df <- sample(nrow(bank_clean), nrow(bank_clean)*0.75)
bank_train <- bank_clean[bank_df,]
bank_test <- bank_clean[-bank_df,]

label_train <- bank_clean[bank_df,15]
label_test <- bank_clean[-bank_df,15]

Model Building

Naive Bayes

The process of building a model using Naive Bayes is a key step in machine learning and data analysis. With the training data in hand, the utilize the naiveBayes function to construct the model, taking advantage of the algorithm’s inherent assumption of independence between features.

model_naive <- naiveBayes(y ~ ., data = bank_train)  
preds_naive <- predict(model_naive, newdata = bank_test)

Decision Tree

When building a model using decision trees in R and specifically using the rpart package, the process follows a similar structure, with some package-specific functions and considerations. Once the rpart decision tree is constructed, we can visualize it to understand the branching logic that leads to predictions. We can then apply the model to the testing data to assess its predictive performance.

model_dt <- rpart(y ~ ., data = bank_train)
model_dt
## n= 3390 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 3390 391 no (0.88466077 0.11533923)  
##    2) duration< 473.5 2929 206 no (0.92966883 0.07033117)  
##      4) poutcome=failure,other,unknown 2851 157 no (0.94493160 0.05506840)  
##        8) month=apr,aug,feb,jan,jul,jun,may,nov 2739 120 no (0.95618839 0.04381161) *
##        9) month=dec,mar,oct,sep 112  37 no (0.66964286 0.33035714)  
##         18) duration< 185.5 66  10 no (0.84848485 0.15151515) *
##         19) duration>=185.5 46  19 yes (0.41304348 0.58695652)  
##           38) day< 16.5 27  10 no (0.62962963 0.37037037)  
##             76) job=management,retired,self-employed,student,technician 19   4 no (0.78947368 0.21052632) *
##             77) job=admin.,blue-collar,entrepreneur,unemployed,unknown 8   2 yes (0.25000000 0.75000000) *
##           39) day>=16.5 19   2 yes (0.10526316 0.89473684) *
##      5) poutcome=success 78  29 yes (0.37179487 0.62820513)  
##       10) duration< 180.5 25   8 no (0.68000000 0.32000000)  
##         20) education=primary,secondary,unknown 17   1 no (0.94117647 0.05882353) *
##         21) education=tertiary 8   1 yes (0.12500000 0.87500000) *
##       11) duration>=180.5 53  12 yes (0.22641509 0.77358491) *
##    3) duration>=473.5 461 185 no (0.59869848 0.40130152)  
##      6) duration< 766 297  96 no (0.67676768 0.32323232)  
##       12) month=apr,dec,jan,jul,jun,may,nov 237  63 no (0.73417722 0.26582278)  
##         24) poutcome=failure,other,unknown 223  53 no (0.76233184 0.23766816) *
##         25) poutcome=success 14   4 yes (0.28571429 0.71428571) *
##       13) month=aug,feb,oct,sep 60  27 yes (0.45000000 0.55000000)  
##         26) education=secondary,unknown 24   8 no (0.66666667 0.33333333) *
##         27) education=primary,tertiary 36  11 yes (0.30555556 0.69444444) *
##      7) duration>=766 164  75 yes (0.45731707 0.54268293)  
##       14) month=apr,feb,jun,may,nov 100  44 no (0.56000000 0.44000000)  
##         28) duration>=815.5 87  33 no (0.62068966 0.37931034)  
##           56) duration< 1539.5 78  25 no (0.67948718 0.32051282) *
##           57) duration>=1539.5 9   1 yes (0.11111111 0.88888889) *
##         29) duration< 815.5 13   2 yes (0.15384615 0.84615385) *
##       15) month=aug,dec,jan,jul,oct 64  19 yes (0.29687500 0.70312500) *

After applying a decision tree model, it is often imperative to visualize the resulting tree structure to gain insights into the decision-making process. Visualizing the decision tree provides a clear and interpretable representation of how the model classifies and segments the data based on various feature attributes. This visual representation can be crucial for understanding the hierarchy of variables and the conditions that lead to specific predictions.

rpart.plot(model_dt)

pred_dt <- predict(object = model_dt, newdata = bank_test, type = "class")

Random Forest

The process of building a model using Random Forests in R, with the randomForest package, is a robust and widely employed approach in machine learning and data analysis.

ctrl <- trainControl(method = "repeatedcv",
                    number = 5, # k-fold
                    repeats = 3) # repetisi
 
bank_rf <- train(y ~ .,
                 data = bank_train,
                 method = "rf",
                 trControl = ctrl)
prediksi_rf <- predict(object = bank_rf,
                       newdata = bank_test)
bank_rf$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 21
## 
##         OOB estimate of  error rate: 10.38%
## Confusion matrix:
##       no yes class.error
## no  2888 111  0.03701234
## yes  241 150  0.61636829

With a forest comprising 500 individual decision trees, this ensemble approach is well-equipped to handle a wide range of classification tasks while mitigating overfitting. The model’s flexibility is evident in its consideration of an average of 21 variables at each split, allowing it to adapt to various data scenarios and maintain robustness.

The OOB (out-of-bag) estimate of the error rate, at approximately 10.53%, offers insight into the model’s predictive accuracy when applied to unseen data. This metric is a crucial indicator of the model’s generalization capability and its ability to make accurate predictions in real-world scenarios.

The confusion matrix breaks down the model’s classification performance, particularly for two classes, no and yes. It reveals that the model excels in accurately identifying the “no” class, with a low class error rate of approximately 3.55%. However, there is room for improvement in correctly classifying the “yes” class, where the class error rate is notably higher, at around 62.28%.

varImp(bank_rf) %>% plot()

From the above image that three variables stand out as highly important: duration, balance, and age.”`. Let’s delve into the implications of these key variables in the context of bank marketing. - Duration is identified as a crucial predictor, which aligns with the understanding that the duration of customer interactions or conversations can have a significant impact on whether a marketing campaign is successful. Longer conversations may indicate higher engagement and interest in the product or service being promoted. - Balance emerges as another influential factor, suggesting that a customer’s account balance plays a pivotal role in campaign outcomes. It is reasonable to assume that customers with healthier account balances are more likely to respond positively to marketing efforts, whether it involves savings, investments, or other financial services. - Age is a third vital variable, indicating that the age of the customer is a key factor in determining the success of a marketing campaign. Different age groups may have varying financial needs and preferences, and tailoring marketing strategies to these age segments can be crucial for campaign effectiveness.

Evaluation

After applying three different classification algorithms and evaluating their performance, we have obtained valuable insights into the models’ abilities to predict and classify data effectively. This evaluation process plays a critical role in assessing the models’ strengths and weaknesses, as well as their suitability for specific tasks. By comparing metrics such as accuracy, precision, recall, and the confusion matrix, we can gauge the models’ overall accuracy and their ability to correctly classify observations.

Naive Bayes

confusionMatrix(data = preds_naive,
                reference = label_test,
                positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  926  70
##        yes  75  60
##                                           
##                Accuracy : 0.8718          
##                  95% CI : (0.8509, 0.8907)
##     No Information Rate : 0.8851          
##     P-Value [Acc > NIR] : 0.9241          
##                                           
##                   Kappa : 0.3803          
##                                           
##  Mcnemar's Test P-Value : 0.7398          
##                                           
##             Sensitivity : 0.46154         
##             Specificity : 0.92507         
##          Pos Pred Value : 0.44444         
##          Neg Pred Value : 0.92972         
##              Prevalence : 0.11494         
##          Detection Rate : 0.05305         
##    Detection Prevalence : 0.11936         
##       Balanced Accuracy : 0.69331         
##                                           
##        'Positive' Class : yes             
## 

Decision Tree

confusionMatrix(data = pred_dt,
                reference = label_test,
                positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  956  93
##        yes  45  37
##                                           
##                Accuracy : 0.878           
##                  95% CI : (0.8575, 0.8965)
##     No Information Rate : 0.8851          
##     P-Value [Acc > NIR] : 0.7873          
##                                           
##                   Kappa : 0.2855          
##                                           
##  Mcnemar's Test P-Value : 6.31e-05        
##                                           
##             Sensitivity : 0.28462         
##             Specificity : 0.95504         
##          Pos Pred Value : 0.45122         
##          Neg Pred Value : 0.91134         
##              Prevalence : 0.11494         
##          Detection Rate : 0.03271         
##    Detection Prevalence : 0.07250         
##       Balanced Accuracy : 0.61983         
##                                           
##        'Positive' Class : yes             
## 

Random Forest

confusionMatrix(data = prediksi_rf,
                reference = bank_test$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  962  84
##        yes  39  46
##                                           
##                Accuracy : 0.8912          
##                  95% CI : (0.8716, 0.9088)
##     No Information Rate : 0.8851          
##     P-Value [Acc > NIR] : 0.2748          
##                                           
##                   Kappa : 0.3707          
##                                           
##  Mcnemar's Test P-Value : 7.268e-05       
##                                           
##             Sensitivity : 0.9610          
##             Specificity : 0.3538          
##          Pos Pred Value : 0.9197          
##          Neg Pred Value : 0.5412          
##              Prevalence : 0.8851          
##          Detection Rate : 0.8506          
##    Detection Prevalence : 0.9248          
##       Balanced Accuracy : 0.6574          
##                                           
##        'Positive' Class : no              
## 

Comparison

Metric Naive.Bayes Decision.Tree Random.Forest
Accuracy 88.06% 90.27% 91.34%
Recall 43.79% 55.71% 93.45%
Precision 94.16% 92.55% 62.82%

The classification performance of three different machine learning algorithms, namely Naive Bayes, Decision Tree, and Random Forest, was evaluated based on several key metrics: Accuracy, Recall, and Precision. These metrics provide valuable insights into the models’ ability to make accurate predictions and correctly classify data.

  • Accuracy: Random Forest demonstrated the highest accuracy, achieving an impressive 91.34%, followed by Decision Tree with 90.27%, and Naive Bayes with 88.06%. This indicates that Random Forest is the most effective in making overall correct predictions, making it a strong candidate for tasks where general accuracy is paramount.

  • Recall: Random Forest outperformed the other two models in terms of Recall, achieving an impressive 93.45%. Decision Tree also exhibited a relatively high Recall rate of 55.71%, while Naive Bayes had the lowest Recall at 43.79%. High Recall is important when minimizing false negatives is a priority, making Random Forest a promising choice for tasks where this aspect is critical.

  • Precision: Naive Bayes excelled in Precision, with an impressive 94.16%. Decision Tree also demonstrated high Precision at 92.55%, while Random Forest had the lowest Precision at 62.82%. High Precision is valuable when minimizing false positives is crucial, indicating that Naive Bayes and Decision Tree are well-suited for tasks where this precision is required.

In summary, the choice of classification algorithm should be tailored to the specific needs of the task at hand. Random Forest appears to offer the best balance between overall accuracy and Recall, making it a versatile choice. Decision Tree excels in Precision, while Naive Bayes is suitable when maximizing Precision is a priority. The selection of the most suitable algorithm should take into account the specific objectives of the classification task and the trade-offs between these essential performance metrics.

Conclusion

The evaluation of three different classification algorithms—Naive Bayes, Decision Tree, and Random Forest—revealed distinct performance characteristics based on key metrics. Random Forest emerged as the top performer in terms of accuracy, achieving 91.34%, making it a strong choice for tasks requiring high overall correctness. Random Forest also excelled in Recall, with 93.45%, indicating its effectiveness in minimizing false negatives. On the other hand, Naive Bayes showcased exceptional Precision at 94.16%, making it suitable for tasks where reducing false positives is essential. Decision Tree also demonstrated strong performance, with high Precision (92.55%) and Recall (55.71%). The choice of the most appropriate algorithm should be guided by the specific requirements of the classification task, with consideration for trade-offs between accuracy, Recall, and Precision.