Case Information

The data is associated with direct marketing campaigns conducted by a Portuguese banking institution. These campaigns primarily relied on phone calls, often requiring multiple contacts with the same client to determine whether they would subscribe to the bank’s term deposit product (categorized as ‘yes’) or not (categorized as ‘no’).

Importing Libraries

library(dplyr)
library(GGally)
library(ggplot2)
library(ggcorrplot)
library(corrplot)
library(reshape2)
library(gmodels)
library(class)
library(tidyr)
library(treemapify)
library(viridis)
library(caret)
library(readxl)

Data Exploration

Reading Dataset

bank <- read_excel("data_input/bank.xlsx")
head(bank)
glimpse(bank)
#> Rows: 4,521
#> Columns: 17
#> $ age       <dbl> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, …
#> $ job       <chr> "unemployed", "services", "management", "management", "blue-…
#> $ marital   <chr> "married", "married", "single", "married", "married", "singl…
#> $ education <chr> "primary", "secondary", "tertiary", "tertiary", "secondary",…
#> $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
#> $ balance   <dbl> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 26…
#> $ housing   <chr> "no", "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes",…
#> $ loan      <chr> "no", "yes", "no", "yes", "no", "no", "no", "no", "no", "yes…
#> $ contact   <chr> "cellular", "cellular", "cellular", "unknown", "unknown", "c…
#> $ day       <dbl> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29,…
#> $ month     <chr> "oct", "may", "apr", "jun", "may", "feb", "may", "may", "may…
#> $ duration  <dbl> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 32…
#> $ campaign  <dbl> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1, …
#> $ pdays     <dbl> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1,…
#> $ previous  <dbl> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, …
#> $ poutcome  <chr> "unknown", "failure", "failure", "unknown", "unknown", "fail…
#> $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …

Variables Client data:

  • age: (numeric)

  • job: type of job

      (categorical: admin, blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, tudent, technician, unemployed, unknown)
  • marital: marital status

           (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
  • education: education Status

            (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
  • default: has credit in default?

          (categorical: 'no','yes','unknown')
  • housing: has housing loan?

          (categorical: 'no','yes','unknown')
  • loan: has personal loan?

        (categorical: 'no','yes','unknown')
  • contact: contact communication type

        (categorical: 'cellular','telephone') 
  • month: last contact month of year

        (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
  • day_of_week: last contact day of the week

             (categorical: 'mon','tue','wed','thu','fri')
  • duration: last contact duration, in seconds (numeric).

          Important note:  this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
  • campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

  • pdays: number of days that passed by after the client was last contacted from a previous campaign

        (numeric; 999 means client was not previously contacted)
  • previous: number of contacts performed before this campaign and for this client (numeric)

  • poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)

  • emp.var.rate: employment variation rate - quarterly indicator (numeric)

  • cons.price.idx: consumer price index - monthly indicator (numeric)

  • cons.conf.idx: consumer confidence index - monthly indicator (numeric)

  • euribor3m: euribor 3 month rate - daily indicator (numeric)

  • nr.employed: number of employees - quarterly indicator (numeric)

  • y: has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

Checking Missing Values (NA)

We inspect the NA values in each column, so that we can understand the data we have and determine what actions need to be taken.

colSums(is.na(bank))
#>       age       job   marital education   default   balance   housing      loan 
#>         0         0         0         0         0         0         0         0 
#>   contact       day     month  duration  campaign     pdays  previous  poutcome 
#>         0         0         0         0         0         0         0         0 
#>         y 
#>         0

After the check, no NA values were found in this dataset. This indicates that the data is complete, without any missing values, and can be further processed for analysis and modeling with more confidence.

Data Wrangling

bank <- bank %>%
  mutate(job = as.factor(job),
         marital = as.factor(marital),
         education = as.factor(education),
         default = as.factor(default),
         housing = as.factor(housing),
         loan = as.factor(loan),
         contact = as.factor(contact),
         month = as.factor(month),
         poutcome = as.factor(poutcome),
         y = as.factor(y))

head(bank)

Cross Validation

Cross-validation is a common evaluation technique used in data analysis and statistical modeling. This technique is used to measure the performance of a model by dividing the data into several subsets or folds. The model is then trained on several folds and tested on other folds. This is done to avoid overfitting and ensure that the model performs well and can be generalized to unseen data. With cross-validation, we can assess how well the model can make predictions on new data and choose the most optimal model for further analysis.

# Mengatur generator angka acak dengan seed
set.seed(100)

# Mengambil indeks baris secara acak untuk data train
train_indices <- sample(nrow(bank), 0.8 * nrow(bank))

# Membuat data train
data_train <- bank[train_indices, ]

# Membuat data test dengan menggunakan setdiff() untuk mengambil baris yang tidak termasuk di data train
data_test <- bank[setdiff(1:nrow(bank), train_indices), ]

The data is divided into training data (80% of the total data) and test data (20% of the total data) using the “random sampling” technique with a seed of 100. This process is conducted to evaluate the model’s performance by training the model on the training data and testing it on the test data that has not been seen before.

Downsampling

This technique is effective in addressing the problem of imbalance in the dataset, especially when there is a significant imbalance between the majority and minority classes. By reducing the number of examples in the majority class, downsampling helps create a more balanced dataset between the two classes, allowing the model to learn better from the minority class and improve its performance in recognizing and predicting the minority class.

library(caret)
set.seed(100)

# Melakukan downsampling
data_train_down <- downSample(x = data_train[, -which(names(data_train) == "y")],
                              y = data_train$y,
                              yname = "y")
prop.table(table(data_train_down$y))
#> 
#>  no yes 
#> 0.5 0.5

The result of downsampling shows that the proportion of the target classes in the dataset has been balanced, with the “no” class (majority class) having approximately 50% proportion, and the “yes” class (minority class) also having approximately 50% proportion. This indicates that downsampling successfully alleviated the imbalance in the dataset, resulting in both classes having a more equal representation, approximately 50% of the total data for each class. # Modeling

Naive Bayes

“The Naive Bayes method is a classification algorithm that can be used in data analysis, including the case of the bank marketing campaign data in Portugal mentioned above. In the Naive Bayes method, we can utilize features such as age, job, marital status, education, and other variables as predictors to predict whether a client will subscribe to the bank’s deposit product (denoted as ‘yes’) or not (‘no’). This method assumes that each feature is independent, although this is a simple assumption, it is typically effective in various types of datasets, including the bank marketing campaign case.

# Mengimpor pustaka yang diperlukan
library(e1071)

# Membuat model Naive Bayes
model_nb <- naiveBayes(y ~ ., data = data_train_down)

# Menampilkan ringkasan model
model_nb
#> 
#> Naive Bayes Classifier for Discrete Predictors
#> 
#> Call:
#> naiveBayes.default(x = X, y = Y, laplace = laplace)
#> 
#> A-priori probabilities:
#> Y
#>  no yes 
#> 0.5 0.5 
#> 
#> Conditional probabilities:
#>      age
#> Y         [,1]      [,2]
#>   no  40.66667  9.942704
#>   yes 42.73381 13.307699
#> 
#>      job
#> Y          admin. blue-collar entrepreneur   housemaid  management     retired
#>   no  0.091127098 0.251798561  0.045563549 0.033573141 0.223021583 0.028776978
#>   yes 0.103117506 0.146282974  0.033573141 0.026378897 0.237410072 0.115107914
#>      job
#> Y     self-employed    services     student  technician  unemployed     unknown
#>   no    0.033573141 0.071942446 0.014388489 0.165467626 0.035971223 0.004796163
#>   yes   0.033573141 0.074340528 0.038369305 0.153477218 0.021582734 0.016786571
#> 
#>      marital
#> Y      divorced   married    single
#>   no  0.0911271 0.6258993 0.2829736
#>   yes 0.1486811 0.5443645 0.3069544
#> 
#>      education
#> Y        primary  secondary   tertiary    unknown
#>   no  0.13908873 0.52517986 0.29256595 0.04316547
#>   yes 0.13189448 0.46522782 0.35971223 0.04316547
#> 
#>      default
#> Y             no        yes
#>   no  0.98321343 0.01678657
#>   yes 0.98081535 0.01918465
#> 
#>      balance
#> Y         [,1]     [,2]
#>   no  1449.803 2976.179
#>   yes 1642.890 2572.182
#> 
#>      housing
#> Y            no       yes
#>   no  0.3980815 0.6019185
#>   yes 0.5875300 0.4124700
#> 
#>      loan
#> Y             no        yes
#>   no  0.84412470 0.15587530
#>   yes 0.92086331 0.07913669
#> 
#>      contact
#> Y       cellular  telephone    unknown
#>   no  0.65227818 0.03357314 0.31414868
#>   yes 0.80095923 0.08153477 0.11750600
#> 
#>      day
#> Y         [,1]     [,2]
#>   no  15.50839 8.155081
#>   yes 15.34532 8.177460
#> 
#>      month
#> Y             apr         aug         dec         feb         jan         jul
#>   no  0.062350120 0.129496403 0.009592326 0.050359712 0.021582734 0.158273381
#>   yes 0.115107914 0.158273381 0.016786571 0.067146283 0.023980815 0.117505995
#>      month
#> Y             jun         mar         may         nov         oct         sep
#>   no  0.103117506 0.011990408 0.328537170 0.105515588 0.009592326 0.009592326
#>   yes 0.105515588 0.038369305 0.179856115 0.076738609 0.064748201 0.035971223
#> 
#>      duration
#> Y         [,1]     [,2]
#>   no  238.5947 232.3099
#>   yes 548.2518 387.6554
#> 
#>      campaign
#> Y         [,1]     [,2]
#>   no  2.904077 3.205375
#>   yes 2.292566 2.108993
#> 
#>      pdays
#> Y         [,1]      [,2]
#>   no  34.34772  90.32884
#>   yes 65.31894 116.72375
#> 
#>      previous
#> Y          [,1]     [,2]
#>   no  0.4460432 1.456943
#>   yes 1.0791367 2.079194
#> 
#>      poutcome
#> Y         failure       other     success     unknown
#>   no  0.119904077 0.028776978 0.004796163 0.846522782
#>   yes 0.110311751 0.071942446 0.163069544 0.654676259

After creating the Naive Bayes model using the training data and obtaining the conditional probabilities of each predictor with respect to the target classes (‘yes’ or ‘no’), the model is ready to be used for predicting whether a client will subscribe to the bank’s term deposit product (‘yes’) or not (‘no’). By leveraging the simple assumption that each feature is independent, the Naive Bayes model calculates the posterior probabilities for each class based on the features provided in the test data. Subsequently, based on these posterior probabilities, the model will make predictions about the most likely class for each data point in the test data. Thus, the Naive Bayes model will aid in classifying bank clients based on their attributes and estimating whether they will subscribe to the term deposit product or not.

Prediction

After the Naive Bayes model has been successfully created and trained using the training data, we can use the model to make predictions on the test data. By inputting the attributes of each client in the test data into the model, we can obtain predictions about whether the client will subscribe to the bank’s term deposit product (‘yes’) or not (‘no’). These prediction results will help the bank in identifying potential clients who are likely to subscribe to the term deposit product, as well as provide valuable insights to improve the effectiveness of marketing campaigns and make better decisions in business strategies.

pred_naive <- predict(model_nb, newdata = data_test, type = "class")

Confusion Matrix

The confusion matrix is used to evaluate the performance of a classification model like Naive Bayes on the test data. It contains four values: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN), representing the number of data points that were correctly and incorrectly predicted by the model.

By using the confusion matrix, we can calculate the accuracy of the model, which is the proportion of correctly predicted data points to the total test data. Additionally, we can compute the precision and recall for each target class (‘yes’ and ‘no’). Precision measures the proportion of positive predictions that are actually positive, while recall measures the proportion of positive data points that are correctly identified by the model.

The confusion matrix provides a comprehensive overview of the model’s performance in classifying the test data and helps identify areas where the model can be improved to enhance accuracy and prediction effectiveness.

# your code here
library(caret)

# Menentukan status pelanggan gagal bayar sebagai kelas positif
positive_class <- "yes"

# Membuat confusion matrix
cm_naive <- confusionMatrix(data = pred_naive, reference = data_test$y, positive = positive_class)

# Menampilkan confusion matrix
cm_naive
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  627  26
#>        yes 174  78
#>                                              
#>                Accuracy : 0.779              
#>                  95% CI : (0.7505, 0.8056)   
#>     No Information Rate : 0.8851             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.329              
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.75000            
#>             Specificity : 0.78277            
#>          Pos Pred Value : 0.30952            
#>          Neg Pred Value : 0.96018            
#>              Prevalence : 0.11492            
#>          Detection Rate : 0.08619            
#>    Detection Prevalence : 0.27845            
#>       Balanced Accuracy : 0.76639            
#>                                              
#>        'Positive' Class : yes                
#> 

The confusion matrix and statistics provide a comprehensive evaluation of the performance of the classification model on the test data. Here are the key metrics:

Confusion Matrix:

  1. True Positives (TP): 70
  2. True Negatives (TN): 634
  3. False Positives (FP): 25
  4. False Negatives (FN): 176
  5. Accuracy: 0.7779
  • Accuracy measures the proportion of correctly classified data points among all data points. 95% Confidence Interval (CI): (0.7494, 0.8046)

  • The confidence interval indicates the range within which the true accuracy of the model is likely to lie. No Information Rate (NIR): 0.895

  • The No Information Rate is the accuracy achieved by simply predicting the majority class (‘no’ in this case). It serves as a baseline for model comparison. Kappa: 0.3053

  • Kappa is a statistic that measures the agreement between the model’s predictions and the actual outcomes, accounting for the agreement that could occur by chance. McNemar’s Test P-Value: <0.0000000000000002

  • McNemar’s test is used to compare the performance of two classifiers. The very low p-value indicates a significant difference between the model’s predictions and another hypothetical model that predicts the same majority class for all data points. Sensitivity (True Positive Rate): 0.73684

  • Sensitivity measures the proportion of actual positive data points that are correctly identified by the model. Specificity (True Negative Rate): 0.78272

  • Specificity measures the proportion of actual negative data points that are correctly identified by the model. Positive Predictive Value (Precision): 0.28455

  • Positive Predictive Value measures the proportion of data points predicted as positive that are actually positive. Negative Predictive Value: 0.96206

  • Negative Predictive Value measures the proportion of data points predicted as negative that are actually negative. Prevalence: 0.10497

  • Prevalence is the proportion of positive data points in the test data. Detection Rate (True Positive Rate in positive class): 0.07735

  • Detection Rate measures the proportion of actual positive data points that are correctly identified by the model. Detection Prevalence: 0.27182

  • Detection Prevalence is the proportion of predicted positive data points by the model. Balanced Accuracy: 0.75978

  • Balanced Accuracy is the average of sensitivity and specificity, providing a balanced view of the model’s performance on both classes. ‘Positive’ Class: ‘yes’

The positive class in this model is labeled as ‘yes,’ representing the clients who subscribed to the bank’s term deposit product. Overall, the model shows moderate performance with an accuracy of 0.7779 and a Kappa value of 0.3053. The sensitivity (recall) for the positive class is 0.73684, indicating that the model is capable of identifying a considerable proportion of clients who subscribed to the term deposit product. However, the positive predictive value is relatively low at 0.28455, suggesting that the model’s predictions for positive cases should be interpreted with caution. The balanced accuracy is 0.75978, indicating a reasonable balance between sensitivity and specificity. Further analysis and model improvement may be necessary to enhance the overall performance.

D-Tree

Decision Tree (D-Tree) is one of the classification methods in data analysis. This method works by constructing a decision tree based on the features of the data and performing splits at each node based on rules that optimize the data classification. At each node, decisions are made based on the most influential features in separating the data into the target classes (‘yes’ or ‘no’).

library(partykit)
set.seed(100)

# Membuat model Decision Tree dengan mincriterion = 0.90
model_dt <- ctree(y ~ ., data = data_train_down, mincriterion = 0.90)

# Menampilkan ringkasan model
model_dt
#> 
#> Model formula:
#> y ~ age + job + marital + education + default + balance + housing + 
#>     loan + contact + day + month + duration + campaign + pdays + 
#>     previous + poutcome
#> 
#> Fitted party:
#> [1] root
#> |   [2] duration <= 212
#> |   |   [3] poutcome in failure, other, unknown
#> |   |   |   [4] month in apr, feb, mar, oct, sep: no (n = 71, err = 40.8%)
#> |   |   |   [5] month in aug, dec, jan, jul, jun, may, nov
#> |   |   |   |   [6] job in admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, technician, unemployed
#> |   |   |   |   |   [7] poutcome in failure, other
#> |   |   |   |   |   |   [8] marital in divorced, married: no (n = 19, err = 21.1%)
#> |   |   |   |   |   |   [9] marital in single: no (n = 11, err = 9.1%)
#> |   |   |   |   |   [10] poutcome in unknown: no (n = 195, err = 3.6%)
#> |   |   |   |   [11] job in student, unknown: yes (n = 8, err = 37.5%)
#> |   |   [12] poutcome in success: yes (n = 12, err = 0.0%)
#> |   [13] duration > 212
#> |   |   [14] contact in cellular, telephone
#> |   |   |   [15] poutcome in failure, unknown
#> |   |   |   |   [16] duration <= 585
#> |   |   |   |   |   [17] month in apr, aug, dec, feb, jan, jul, may, nov: yes (n = 164, err = 46.3%)
#> |   |   |   |   |   [18] month in jun, mar, oct, sep
#> |   |   |   |   |   |   [19] previous <= 1: yes (n = 33, err = 3.0%)
#> |   |   |   |   |   |   [20] previous > 1: yes (n = 7, err = 28.6%)
#> |   |   |   |   [21] duration > 585: yes (n = 120, err = 11.7%)
#> |   |   |   [22] poutcome in other, success: yes (n = 87, err = 5.7%)
#> |   |   [23] contact in unknown
#> |   |   |   [24] duration <= 422: no (n = 46, err = 4.3%)
#> |   |   |   [25] duration > 422: yes (n = 61, err = 27.9%)
#> 
#> Number of inner nodes:    12
#> Number of terminal nodes: 13
# Membuat plot model Decision Tree
plot(model_dt, type = "simple")

The D-Tree model consists of 15 nodes in the decision tree, with 16 terminal nodes resulting from the data splitting based on the most influential features.

The first split is based on the duration of phone calls (duration). If the call duration is less than or equal to 217, the model will consider the month (month) of the call, and the prediction of ‘yes’ or ‘no’ will depend on specific months (apr, dec, mar, oct, sep) or other months (aug, feb, jan, jul, jun, may, nov), as well as the previous outcome (poutcome) if the month is one of the other months.

If the call duration is greater than 217, the model will consider higher call durations (duration > 472) or call durations between these two values (duration <= 472). For call durations greater than 217, the further splitting will be based on the type of contact (contact) and the previous outcome (poutcome).

If the call duration is greater than 472, the next splitting will be based on the client’s education level (education) and the number of previous calls (previous).

By using this D-Tree model, we can make predictions on test data by inputting relevant feature values and following the rules formed in the decision tree. ### Prediction

pred_dt <- predict(model_dt, newdata = data_test, type = "response")

Confusion Matrix

library(caret)

# Set the positive class as "yes" (default)
positive_class <- "yes"

# Create confusion matrix
naive_conf <- confusionMatrix(data = pred_dt, 
                           reference = data_test$y, 
                           positive = positive_class)
naive_conf
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  549  14
#>        yes 252  90
#>                                              
#>                Accuracy : 0.7061             
#>                  95% CI : (0.6752, 0.7356)   
#>     No Information Rate : 0.8851             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.276              
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.86538            
#>             Specificity : 0.68539            
#>          Pos Pred Value : 0.26316            
#>          Neg Pred Value : 0.97513            
#>              Prevalence : 0.11492            
#>          Detection Rate : 0.09945            
#>    Detection Prevalence : 0.37790            
#>       Balanced Accuracy : 0.77539            
#>                                              
#>        'Positive' Class : yes                
#> 
# prediksi kelas di data train
pred_train <- predict(object = model_dt,
                           newdata = data_train_down,
                           type = "response")


# confusion matrix data train
dtree_conf <- confusionMatrix(data = pred_train, # hasil prediksi dari model
                reference = data_train_down$y, # data aktual dari kolom data train
                positive = positive_class) # kelas amatan

dtree_conf
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  299  43
#>        yes 118 374
#>                                                
#>                Accuracy : 0.807                
#>                  95% CI : (0.7785, 0.8332)     
#>     No Information Rate : 0.5                  
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.6139               
#>                                                
#>  Mcnemar's Test P-Value : 0.000000005476       
#>                                                
#>             Sensitivity : 0.8969               
#>             Specificity : 0.7170               
#>          Pos Pred Value : 0.7602               
#>          Neg Pred Value : 0.8743               
#>              Prevalence : 0.5000               
#>          Detection Rate : 0.4484               
#>    Detection Prevalence : 0.5899               
#>       Balanced Accuracy : 0.8070               
#>                                                
#>        'Positive' Class : yes                  
#> 
  • Recall di data train: 0.8638
  • Recall di data test: 0.86316

Random Forest

Dimention Reduction

This reduction data is used as the training data for the Random Forest model.

library(caret)
# feature selection menggunakan nearzerovar
zero_var <- nearZeroVar(bank)
bank_clean <- bank[,-zero_var] # jangan di run 2 kali

dim(bank_clean)
#> [1] 4521   15

After reduction, the dataset consists of 15 columns with a total of 4521 rows.

Data Test

The test data is used to evaluate the model’s performance on new data, and here we have prepared new data for this purpose.

bank_data_test <- read_excel("data_input/bank-full.xlsx")
glimpse(bank_data_test)
#> Rows: 45,211
#> Columns: 17
#> $ age       <dbl> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
#> $ job       <chr> "management", "technician", "entrepreneur", "blue-collar", "…
#> $ marital   <chr> "married", "single", "married", "married", "single", "marrie…
#> $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
#> $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
#> $ balance   <dbl> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
#> $ housing   <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
#> $ loan      <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
#> $ contact   <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
#> $ day       <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
#> $ month     <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
#> $ duration  <dbl> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
#> $ campaign  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ pdays     <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
#> $ previous  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
#> $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
bank_data_test <- bank_data_test %>%
  mutate(job = as.factor(job),
         marital = as.factor(marital),
         education = as.factor(education),
         default = as.factor(default),
         housing = as.factor(housing),
         loan = as.factor(loan),
         contact = as.factor(contact),
         month = as.factor(month),
         poutcome = as.factor(poutcome),
         y = as.factor(y))

dim(bank_data_test)
#> [1] 45211    17

The data consists of 45,211 rows and 17 columns.

#set.seed(417)
#control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

# Pembuatan model random forest
#model_rf <- train(y ~ ., data = bank_clean, method = "rf", trControl = control)

# Simpan model
# saveRDS(model_rf, "bank_forest_all.RDS")

Reading Model

# read model
model_bank_forest <- readRDS("model/bank_forest_all.RDS")
model_bank_forest
#> Random Forest 
#> 
#> 4521 samples
#>   14 predictor
#>    2 classes: 'no', 'yes' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times) 
#> Summary of sample sizes: 3617, 3617, 3616, 3617, 3617, 3617, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa     
#>    2    0.8867512  0.04749236
#>   21    0.8990651  0.41313525
#>   40    0.8995817  0.42836685
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 40.
library(randomForest)
model_bank_forest$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 40
#> 
#>         OOB estimate of  error rate: 10.53%
#> Confusion matrix:
#>       no yes class.error
#> no  3840 160   0.0400000
#> yes  316 205   0.6065259

Prediction

bank_pred <- predict(object = model_bank_forest, 
                   newdata = bank_data_test)
head(bank_pred)
#> [1] no no no no no no
#> Levels: no yes
rf_conf <- confusionMatrix(data = model_bank_forest, 
                reference = bank_data_test$y, 
                positive = "1")
rf_conf
#> Cross-Validated (5 fold, repeated 3 times) Confusion Matrix 
#> 
#> (entries are percentual average cell counts across resamples)
#>  
#>           Reference
#> Prediction   no  yes
#>        no  85.3  6.8
#>        yes  3.2  4.7
#>                             
#>  Accuracy (average) : 0.8996

The evaluation method used is “Cross-Validated (5 fold, repeated 3 times)”. This indicates that the model was evaluated using cross-validation with a 5-fold scheme, repeated 3 times. In 5-fold cross-validation, the data is divided into 5 equal-sized subsets (folds), and the model is trained and tested using different combinations of these folds to obtain a more stable performance estimation.

Accuracy (average): “0.8996” indicates that the average accuracy of the model across all resampling is approximately 0.8996 or around 89.96%. Accuracy is an evaluation metric that measures how well the model can make correct predictions, which means how many true positives and true negatives it correctly predicts compared to the total evaluated data.

library(ROCR)

prob_test <- predict(model_bank_forest, newdata = bank_data_test, type = "prob")
pred_roc <- prediction(prob_test[,"yes"], bank_data_test$y)
perf <-performance(prediction.obj = pred_roc, measure = "tpr", x.measure = "fpr")
plot(perf)

# Calculate AUC
auc <- performance(pred_roc, measure = "auc")

# Print the AUC value
str(auc)
#> Formal class 'performance' [package "ROCR"] with 6 slots
#>   ..@ x.name      : chr "None"
#>   ..@ y.name      : chr "Area under the ROC curve"
#>   ..@ alpha.name  : chr "none"
#>   ..@ x.values    : list()
#>   ..@ y.values    :List of 1
#>   .. ..$ : num 0.916
#>   ..@ alpha.values: list()

The obtained AUC value from the model evaluation is 0.916. This value indicates that the model has good performance in distinguishing between positive and negative classes, as it is close to 1.

Conclusion

# Extract values from the confusion matrix for Naive Bayes
TP_naive <- naive_conf$table[2, 2]
TN_naive <- naive_conf$table[1, 1]
FP_naive <- naive_conf$table[1, 2]
FN_naive <- naive_conf$table[2, 1]

# Calculate accuracy for Naive Bayes
Accuracy_naive <- (TP_naive + TN_naive) / (TP_naive + TN_naive + FP_naive + FN_naive)

# Extract values from the confusion matrix for Decision Tree
TP_dtree <- dtree_conf$table[2, 2]
TN_dtree <- dtree_conf$table[1, 1]
FP_dtree <- dtree_conf$table[1, 2]
FN_dtree <- dtree_conf$table[2, 1]

# Calculate accuracy for Decision Tree
Accuracy_dtree <- (TP_dtree + TN_dtree) / (TP_dtree + TN_dtree + FP_dtree + FN_dtree)

# Extract values from the confusion matrix for Random Forest
TP_rf <- rf_conf$table[2, 2]
TN_rf <- rf_conf$table[1, 1]
FP_rf <- rf_conf$table[1, 2]
FN_rf <- rf_conf$table[2, 1]

# Calculate accuracy for Random Forest
Accuracy_rf <- (TP_rf + TN_rf) / (TP_rf + TN_rf + FP_rf + FN_rf)

comparison <- cbind.data.frame(Accuracy_naive, Accuracy_dtree, Accuracy_rf)

comparison

The accuracy values for each model are as follows:

  • Naive Bayes: 0.7845304
  • Decision Tree: 0.8251174
  • Random Forest: 0.8995797

The accuracy metric represents the proportion of correctly predicted instances out of the total instances. Higher accuracy values indicate better model performance in making correct predictions on the given dataset.

From the provided accuracy values, we can see that the Random Forest model has the highest accuracy (0.8995797), followed by the Decision Tree model (0.8251174), and then the Naive Bayes model (0.7845304). This means that the Random Forest model is performing the best among the three in terms of overall prediction accuracy on the dataset being evaluated.

In this case, the Random Forest model outperforms the others.