IntroDuction

In determining whether a customer wants to subscribe to a time deposit or not is the problem we will solve. In this case we will use 3 types of machine learning with predictors that are either numeric or categorical. The target of our prediction is column y which contains yes or no, which means that our target is whether to subscribe to a time deposit (yes) or not (no). We will compare the two types of methods and will conclude which method is the best to use for this prediction.

Library

library(dplyr)
library(caret)
library(e1071)
library(ROCR)
library(partykit)
library(rsample)
library(randomForest)

Read Data & Data Understanding

Import Data

Import the data that we have prepared, namely bank.csv. Use the read.csv command according to the file extension

bank <- read.csv("bank.csv")

Data Inspection

Let’s take a quick look at the data content with the Head() command

head(bank)

We check the data type with the glimpse() command.

glimpse(bank)
#> Rows: 4,521
#> Columns: 17
#> $ age       <int> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, …
#> $ job       <chr> "unemployed", "services", "management", "management", "blue-…
#> $ marital   <chr> "married", "married", "single", "married", "married", "singl…
#> $ education <chr> "primary", "secondary", "tertiary", "tertiary", "secondary",…
#> $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
#> $ balance   <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 26…
#> $ housing   <chr> "no", "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes",…
#> $ loan      <chr> "no", "yes", "no", "yes", "no", "no", "no", "no", "no", "yes…
#> $ contact   <chr> "cellular", "cellular", "cellular", "unknown", "unknown", "c…
#> $ day       <int> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29,…
#> $ month     <chr> "oct", "may", "apr", "jun", "may", "feb", "may", "may", "may…
#> $ duration  <int> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 32…
#> $ campaign  <int> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1, …
#> $ pdays     <int> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1,…
#> $ previous  <int> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, …
#> $ poutcome  <chr> "unknown", "failure", "failure", "unknown", "unknown", "fail…
#> $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …

From the glimps function above, we can see that the data has 4521 rows and 17 columns. Here is the explanation of the variables:

  • age :(numeric)
  • job :type of job
  • marital :marital status
  • education :categorical: “unknown”,“secondary”,“primary”,“tertiary”
  • default :has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
  • balance :average yearly balance, in euros (numeric)
  • housing :has housing loan? (binary: “yes”,“no”)
  • loan :has personal loan? (binary: “yes”,“no”)
  • contact :contact communication type (categorical: ‘cellular’,‘telephone’)
  • day :last contact day of the month (numeric)
  • month :last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
  • duration :last contact duration, in seconds (numeric)
  • campaign :number of contacts performed during this campaign and for this client
  • pdays :number of days that passed by after the client was last contacted from a previous campaign
  • previous :number of contacts performed before this campaign and for this client (numeric)
  • poutcome :outcome of the previous marketing campaign
  • y :has the client subscribed a term deposit? (binary: “yes”,“no”)

Data Manipulation

We will change the data as it should be

bank_clean <- bank %>% 
  mutate_at(vars(job, marital, education, default, housing, loan, contact, month, poutcome, y), as.factor)
glimpse(bank_clean)
#> Rows: 4,521
#> Columns: 17
#> $ age       <int> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, …
#> $ job       <fct> unemployed, services, management, management, blue-collar, m…
#> $ marital   <fct> married, married, single, married, married, single, married,…
#> $ education <fct> primary, secondary, tertiary, tertiary, secondary, tertiary,…
#> $ default   <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
#> $ balance   <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 26…
#> $ housing   <fct> no, yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, no…
#> $ loan      <fct> no, yes, no, yes, no, no, no, no, no, yes, no, no, no, no, y…
#> $ contact   <fct> cellular, cellular, cellular, unknown, unknown, cellular, ce…
#> $ day       <int> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29,…
#> $ month     <fct> oct, may, apr, jun, may, feb, may, may, may, apr, may, apr, …
#> $ duration  <int> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 32…
#> $ campaign  <int> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1, …
#> $ pdays     <int> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1,…
#> $ previous  <int> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, …
#> $ poutcome  <fct> unknown, failure, failure, unknown, unknown, failure, other,…
#> $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, yes, no,…

Check Missing values

colSums(is.na(bank_clean))
#>       age       job   marital education   default   balance   housing      loan 
#>         0         0         0         0         0         0         0         0 
#>   contact       day     month  duration  campaign     pdays  previous  poutcome 
#>         0         0         0         0         0         0         0         0 
#>         y 
#>         0

Exploratory Data Analysis

Check the distribution/pattern of the data

summary(bank_clean)
#>       age                 job          marital         education    default   
#>  Min.   :19.00   management :969   divorced: 528   primary  : 678   no :4445  
#>  1st Qu.:33.00   blue-collar:946   married :2797   secondary:2306   yes:  76  
#>  Median :39.00   technician :768   single  :1196   tertiary :1350             
#>  Mean   :41.17   admin.     :478                   unknown  : 187             
#>  3rd Qu.:49.00   services   :417                                              
#>  Max.   :87.00   retired    :230                                              
#>                  (Other)    :713                                              
#>     balance      housing     loan           contact          day       
#>  Min.   :-3313   no :1962   no :3830   cellular :2896   Min.   : 1.00  
#>  1st Qu.:   69   yes:2559   yes: 691   telephone: 301   1st Qu.: 9.00  
#>  Median :  444                         unknown  :1324   Median :16.00  
#>  Mean   : 1423                                          Mean   :15.92  
#>  3rd Qu.: 1480                                          3rd Qu.:21.00  
#>  Max.   :71188                                          Max.   :31.00  
#>                                                                        
#>      month         duration       campaign          pdays       
#>  may    :1398   Min.   :   4   Min.   : 1.000   Min.   : -1.00  
#>  jul    : 706   1st Qu.: 104   1st Qu.: 1.000   1st Qu.: -1.00  
#>  aug    : 633   Median : 185   Median : 2.000   Median : -1.00  
#>  jun    : 531   Mean   : 264   Mean   : 2.794   Mean   : 39.77  
#>  nov    : 389   3rd Qu.: 329   3rd Qu.: 3.000   3rd Qu.: -1.00  
#>  apr    : 293   Max.   :3025   Max.   :50.000   Max.   :871.00  
#>  (Other): 571                                                   
#>     previous          poutcome      y       
#>  Min.   : 0.0000   failure: 490   no :4000  
#>  1st Qu.: 0.0000   other  : 197   yes: 521  
#>  Median : 0.0000   success: 129             
#>  Mean   : 0.5426   unknown:3705             
#>  3rd Qu.: 0.0000                            
#>  Max.   :25.0000                            
#> 

Insight:

  • age min. 19 and max 87
  • job management is most high with value 969 and at least retired with value 230
  • martial married is most high vith value 2797 and at least divorced with value 528
  • education secondary is most high vith value 2306 and at least unknown with value 187
  • default no is most high vith value 4445 and at least unknown with value 76
  • balance min.-3313 and max 71188
  • housing yes is most high vith value 2559 and at least unknown with value 1962
  • loan no is most high vith value 3830 and at least unknown with value 691
  • contact celluler is most high vith value 2896 and at least telephone with value 301
  • day min.1 and max 31
  • month may is most high at 1398
  • duration min.4 and max 3025
  • campaign min.1 and max 50
  • pdays min.-1 and max 871
  • previous min.0 and max 25
  • poutcome min.129 and max 3705
  • contact yes is most high vith value 4000 and at least yes with value 521

Cross validation

We will split the train data with the test data

RNGkind(sample.kind = "Rounding")
set.seed(100)

# your code here
index_bank <- sample(nrow(bank_clean), nrow(bank_clean)*0.80)

bank_train <- bank_clean[index_bank,] # untuk pelatihan
bank_test <- bank_clean[-index_bank,] # untuk predict

check bank_train’s proportions with its target

prop.table(table(bank_train$y))
#> 
#>        no       yes 
#> 0.8821903 0.1178097

Target proportion is not balanced

Handling Imbalanced Data

# upsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)
library(caret)

bank_train_up <- upSample(x = bank_train %>% select(-y),
                       y = bank_train$y,
                       yname = "y")

Check the target proportion

prop.table(table(bank_train_up$y))
#> 
#>  no yes 
#> 0.5 0.5

Target is in balance

Naive Bayes

Naive Bayes is a classification method that uses Bayes’ theorem. Bayes’ theorem states that the probability of an event can change if there is new information..

Modeling

# train
model_nb_bank <- naiveBayes(y~., bank_train_up, laplace = 1)

Predict class from test data with function predict():

# predict class
bank_test$pred_label <- predict(object = model_nb_bank,
                                 newdata=bank_test,
                                 type="class") 

Model Evaluation

Model evaluation with confusion matrix:

con_bank_naive <- confusionMatrix(data = bank_test$pred_label, reference=bank_test$y, positive = "yes")
con_bank_naive
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  625  25
#>        yes 185  70
#>                                              
#>                Accuracy : 0.768              
#>                  95% CI : (0.7391, 0.7951)   
#>     No Information Rate : 0.895              
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.2917             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.73684            
#>             Specificity : 0.77160            
#>          Pos Pred Value : 0.27451            
#>          Neg Pred Value : 0.96154            
#>              Prevalence : 0.10497            
#>          Detection Rate : 0.07735            
#>    Detection Prevalence : 0.28177            
#>       Balanced Accuracy : 0.75422            
#>                                              
#>        'Positive' Class : yes                
#> 

ROC dan AUC

ROC is a curve that describes the relationship between True Positive Rate and False Positive Rate at each threshold. A good model should ideally have a high True Positive Rate and a low False Positive Rate. AUC shows the area under the ROC curve. The closer to 1, the better the model’s performance in separating positive and negative classes.

we construct the ROC curve of the model model_nb_vote. First we make a prediction in the form of probability.

# ambil hasil prediksi data test dalam bentuk probability
bank_test$pred <- predict(model_nb_bank, bank_test, type="raw")

Prepare the data frame for ROC (actually optional, but it makes things easier). We assume the positive class is yes.

# menyiapkan actual dalam bentuk 1 & 0
bank_test$actual <- ifelse(bank_test$y == "yes", yes = 1, no = 0)

Set up prediction() object, calculate TPR & FPR with performance() function, then create ROC curve with plot().

# objek prediction
bank_roc_pred <- prediction(predictions = bank_test$pred[,1], # prediksi yes dalam peluang
                       labels = bank_test$actual) # label asli dalam bentuk 1 & 0

# ROC curve
plot(performance(prediction.obj = bank_roc_pred, "tpr", "fpr"))
abline(0,1, lty=2)

# nilai AUC
auc_pred <- performance(prediction.obj = bank_roc_pred, "auc")

auc_pred@y.values # tanda @ untuk mengakases nilai dari object auc_pred
#> [[1]]
#> [1] 0.1896296

AUC = 0.1896296, then it can be concluded that our model is not good at separating the yes and no classes.

Decison Tree

Modeling

Decision trees are a type of supervised machine learning algorithm that are used for both classification and regression tasks. They work by creating a tree-like structure where each node represents a decision, each branch represents a possible outcome, and each leaf node represents a final prediction. To create a Decision Tree model, the ctree() function from the partykit library can be used.

bank_tree <- ctree(formula = y ~ ., data = bank_train_up,
                   control = ctree_control(mincriterion = 0.95, 
                                           minsplit = 100,
                                           minbucket = 80))
plot(bank_tree, type='simple')

Model Evaluation

Decision trees have the characteristic of overfitting, so we set the accuracy between train and test evaluations to be maximum. 15%

# prediksi kelas di data train
pred_train <- predict(bank_tree, bank_train_up, type="response")

# confusion matrix data train
tree_con_train <- confusionMatrix(pred_train, bank_train_up$y, positive = "yes")
tree_con_train
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  2467  282
#>        yes  723 2908
#>                                                
#>                Accuracy : 0.8425               
#>                  95% CI : (0.8333, 0.8513)     
#>     No Information Rate : 0.5                  
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.685                
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.9116               
#>             Specificity : 0.7734               
#>          Pos Pred Value : 0.8009               
#>          Neg Pred Value : 0.8974               
#>              Prevalence : 0.5000               
#>          Detection Rate : 0.4558               
#>    Detection Prevalence : 0.5691               
#>       Balanced Accuracy : 0.8425               
#>                                                
#>        'Positive' Class : yes                  
#> 
# prediksi kelas di data test
pred_test <- predict(bank_tree, bank_test, type="response")

# confusion matrix data test
tree_con_test <- confusionMatrix(pred_test, bank_test$y, positive = "yes")
tree_con_test
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  615  12
#>        yes 195  83
#>                                              
#>                Accuracy : 0.7713             
#>                  95% CI : (0.7425, 0.7983)   
#>     No Information Rate : 0.895              
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.3421             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.87368            
#>             Specificity : 0.75926            
#>          Pos Pred Value : 0.29856            
#>          Neg Pred Value : 0.98086            
#>              Prevalence : 0.10497            
#>          Detection Rate : 0.09171            
#>    Detection Prevalence : 0.30718            
#>       Balanced Accuracy : 0.81647            
#>                                              
#>        'Positive' Class : yes                
#> 

From Sensitivity we can see the difference in value is less than 15%

Random Forest

Modeling

Random Forest is one of the most popular and powerful machine learning algorithms. It falls under the category of ensemble learning, which means it combines several simpler models (in this case, decision trees) to produce a more accurate and stable model. Create a Random Forest model using bank_train_up with 3-fold cross validation, then the process is repeated 2 times.

set.seed(417)

ctrl <- trainControl(method = "repeatedcv",
                     number = 5, # k-fold
                     repeats = 3) # repetisi

bank_forest <- train(y ~ .,
                   data = bank_train_up,
                   method = "rf", # random forest
                   trControl = ctrl)

We will save it in RDS form

saveRDS(bank_forest, file = "bank_forest.RDS")

We’ll call our model

bank_forest_f <- readRDS("bank_forest.RDS")
bank_forest_f
#> Random Forest 
#> 
#> 6380 samples
#>   16 predictor
#>    2 classes: 'no', 'yes' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times) 
#> Summary of sample sizes: 5104, 5104, 5104, 5104, 5104, 5104, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>    2    0.8874608  0.7749216
#>   22    0.9680773  0.9361546
#>   42    0.9641066  0.9282132
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 22.

Out Of Bag

In the Bootstrap sampling stage, there is data that is not used in modeling, this is referred to as Out-of-Bag (OOB) data. The Random Forest model will use OOB data as data to evaluate by calculating the error (similar to test data). This error is called OOB error. In the case of classification, OOB error is the percentage of OOB data that is misclassified.

bank_forest_f$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 22
#> 
#>         OOB estimate of  error rate: 2.62%
#> Confusion matrix:
#>       no  yes class.error
#> no  3023  167   0.0523511
#> yes    0 3190   0.0000000

The OOB Error value for the bank_forest_f model is 2.62%. In other words, the model accuracy on OOB data is 97.38%.

Interpretation

Although random forest is labeled as an uninterpretable model, at least we can see what predictors are most used (important) in making random forest:

varImp(bank_forest_f) %>% plot()

From the plot above, we can conclude that the duration predictor has the greatest influence

bank_pred_rf <- predict(bank_forest_f, bank_test)
plot(bank_pred_rf)

(conf_matrix_bank_rfor <- table(bank_pred_rf, bank_test$y))
#>             
#> bank_pred_rf  no yes
#>          no  766  51
#>          yes  44  44
con_bank_rf <- confusionMatrix(conf_matrix_bank_rfor, positive = "yes")
con_bank_rf
#> Confusion Matrix and Statistics
#> 
#>             
#> bank_pred_rf  no yes
#>          no  766  51
#>          yes  44  44
#>                                           
#>                Accuracy : 0.895           
#>                  95% CI : (0.8732, 0.9142)
#>     No Information Rate : 0.895           
#>     P-Value [Acc > NIR] : 0.5273          
#>                                           
#>                   Kappa : 0.4226          
#>                                           
#>  Mcnemar's Test P-Value : 0.5382          
#>                                           
#>             Sensitivity : 0.46316         
#>             Specificity : 0.94568         
#>          Pos Pred Value : 0.50000         
#>          Neg Pred Value : 0.93758         
#>              Prevalence : 0.10497         
#>          Detection Rate : 0.04862         
#>    Detection Prevalence : 0.09724         
#>       Balanced Accuracy : 0.70442         
#>                                           
#>        'Positive' Class : yes             
#> 

Model Evaluation Naive Bayes, Decision Tree and Random Forest

eval_bank_naiv <- data_frame(Accuracy = con_bank_naive$overall[1],
                                 Recall = con_bank_naive$byClass[1],
                                 Specificity = con_bank_naive$byClass[2],
                                 Precision = con_bank_naive$byClass[3])

eval_bank_tree <- data_frame(Accuracy = tree_con_test$overall[1],
                                 Recall = tree_con_test$byClass[1],
                                 Specificity = tree_con_test$byClass[2],
                                 Precision = tree_con_test$byClass[3])

eval_bank_rf <- data_frame(Accuracy = con_bank_rf$overall[1],
                                 Recall = con_bank_rf$byClass[1],
                                 Specificity = con_bank_rf$byClass[2],
                                 Precision = con_bank_rf$byClass[3])
eval_bank_naiv
eval_bank_tree
eval_bank_rf

Of the 3 methods above, each has its own advantages when viewed from the matrix

Conclusion

Our positive class is yes, which means the customer has subscribed to a time deposit, while the negative class is no, which means the customer has not subscribed to a time deposit. FP: predicting a customer subscribed to a time deposit (yes), while the customer did not subscribe to a time deposit, the bank’s risk is that the bank incurs a loss. FN: predicting the customer does not subscribe to a time deposit (no), even though the customer subscribes to a time deposit, the bank risks losing profits. From the bank’s side, the concerning risk is FN so the matrix we use is Recall. From the three machine learning methods above, if we are concerned according to the matrix, we will use the machine learning type decision tree which has a Recall of 87.4%.