Bank Maketing Model Experimentation

Author

Darwhin Gomez

Published

May 4, 2025

Code

#Loading the data, viewing the srtucture
bank_data<-readRDS("data/bank_full.rds")
str(bank_data)

'data.frame':   41188 obs. of  21 variables:
 $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
 $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
 $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
 $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
 $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
 $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
 $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
 $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
 $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ duration      : num  261 149 226 151 307 198 139 217 380 50 ...
 $ campaign      : num  1 1 1 1 1 1 1 1 1 1 ...
 $ pdays         : num  999 999 999 999 999 999 999 999 999 999 ...
 $ previous      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
 $ cons.price.idx: num  94 94 94 94 94 ...
 $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
 $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
 $ nr.employed   : num  5191 5191 5191 5191 5191 ...
 $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Experimentation and Modeling

This report builds upon the previous analysis of the bank-additional-full dataset explored in Project One. The objective of this project is to experiment with classification models, specifically Decision Trees, Random Forest, and AdaBoost.

To achieve this, I will first establish baseline runs for each model and then conduct experiments to understand how changes in model parameters, feature selection, and cross-validation impact performance. The models will be evaluated using precision, recall, F1-score, and AUC-ROC to compare their effectiveness. Based on these experiments, I will recommend the most suitable model for classification.

Applying Recommendations from Project One:

Transforming pdays into a binary factor indicating whether the client was previously contacted.
Removing irrelevant features: emp.var.rate, previous, and duration.

Code

# pdays making  binary factor 0 not previuously contactated - 1 has been contacted
bank_data$pdays <- factor(ifelse(bank_data$pdays == 999, 0, 1))
# Remove unwanted features (previous contact count, employment variation rate, and duration)
bank_data <- bank_data |> 
  select(-previous, -emp.var.rate, -duration)

# Display structure of the modified dataset
str(bank_data)

'data.frame':   41188 obs. of  18 variables:
 $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
 $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
 $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
 $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
 $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
 $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
 $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
 $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
 $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ campaign      : num  1 1 1 1 1 1 1 1 1 1 ...
 $ pdays         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ cons.price.idx: num  94 94 94 94 94 ...
 $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
 $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
 $ nr.employed   : num  5191 5191 5191 5191 5191 ...
 $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Code

# Get names of encoded predictors from SMOTE training data
train_features <- setdiff(names(trainData_smote), "y")

# One-hot encode the test data
x_test_encoded <- model.matrix(~ . -1, data = testData[, setdiff(names(testData), "y")])
x_test_encoded <- as.data.frame(x_test_encoded)

# Add any missing columns that exist in training but not test
missing_cols <- setdiff(train_features, names(x_test_encoded))
for (col in missing_cols) {
  x_test_encoded[[col]] <- 0  # Add as zeros
}

# Make sure columns are in the same order as training
x_test_encoded <- x_test_encoded[, train_features]

# Add back test labels if needed
x_test_encoded$y <- testData$y

Baseline Models

Baseline models are plotted on original data set, for baseline comparison

Code

tree_base<- rpart(y ~ . , data = trainData , method = "class", )

# Make predictions
tree_base_pred <- predict(tree_base, testData, type = "class")
tree_base_prob <- predict(tree_base, testData, type = "prob")[,2] 
tree_base_conf <- confusionMatrix(tree_base_pred, testData$y)
tree_base_auc <- auc(testData$y, tree_base_prob)

# Store results
results <- rbind(results, data.frame(Model = "Decision Tree", Experiment = 0, 
                      Accuracy =tree_base_conf$overall["Accuracy"],                                   Precision = tree_base_conf$byClass["Precision"],
                      Recall = tree_base_conf$byClass["Recall"],
                     F1 = tree_base_conf$byClass["F1"],
                                     AUC = tree_base_auc))
#plot the tree
rpart.plot(tree_base, main = "Decision Tree", cex = .65)

Code

plotcp(tree_base)

Code

summary(tree_base)

Call:
rpart(formula = y ~ ., data = trainData, method = "class")
  n= 30891 

          CP nsplit rel error    xerror       xstd
1 0.05043103      0 1.0000000 1.0000000 0.01596823
2 0.01000000      2 0.8991379 0.9034483 0.01527053

Variable importance
   nr.employed      euribor3m  cons.conf.idx cons.price.idx          pdays 
            28             24             15             12             10 
         month       poutcome 
             7              4 

Node number 1: 30891 observations,    complexity param=0.05043103
  predicted class=no   expected loss=0.1126542  P(node) =1
    class counts: 27411  3480
   probabilities: 0.887 0.113 
  left son=2 (27177 obs) right son=3 (3714 obs)
  Primary splits:
      nr.employed   < 5087.65 to the right, improve=911.9415, (0 missing)
      euribor3m     < 1.2395  to the right, improve=834.9867, (0 missing)
      pdays         splits as  LR,          improve=629.0750, (0 missing)
      poutcome      splits as  LLR,         improve=589.4370, (0 missing)
      cons.conf.idx < -35.45  to the left,  improve=412.6504, (0 missing)
  Surrogate splits:
      euribor3m      < 1.2395  to the right, agree=0.984, adj=0.870, (0 split)
      cons.conf.idx  < -35.45  to the left,  agree=0.944, adj=0.537, (0 split)
      cons.price.idx < 92.7345 to the right, agree=0.934, adj=0.451, (0 split)
      month          splits as  LLRLLLLLRR,  agree=0.912, adj=0.264, (0 split)
      pdays          splits as  LR,          agree=0.903, adj=0.193, (0 split)

Node number 2: 27177 observations
  predicted class=no   expected loss=0.0677411  P(node) =0.8797708
    class counts: 25336  1841
   probabilities: 0.932 0.068 

Node number 3: 3714 observations,    complexity param=0.05043103
  predicted class=no   expected loss=0.4413032  P(node) =0.1202292
    class counts:  2075  1639
   probabilities: 0.559 0.441 
  left son=6 (2787 obs) right son=7 (927 obs)
  Primary splits:
      pdays          splits as  LR,          improve=151.97720, (0 missing)
      poutcome       splits as  LLR,         improve=149.17390, (0 missing)
      contact        splits as  RL,          improve= 31.80034, (0 missing)
      nr.employed    < 5049.85 to the right, improve= 31.70430, (0 missing)
      cons.price.idx < 93.166  to the left,  improve= 29.14529, (0 missing)
  Surrogate splits:
      poutcome splits as  LLR, agree=0.974, adj=0.894, (0 split)

Node number 6: 2787 observations
  predicted class=no   expected loss=0.3588088  P(node) =0.09022045
    class counts:  1787  1000
   probabilities: 0.641 0.359 

Node number 7: 927 observations
  predicted class=yes  expected loss=0.3106796  P(node) =0.03000874
    class counts:   288   639
   probabilities: 0.311 0.689

Code

tree_base_conf

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  9053  926
       yes   84  234
                                         
               Accuracy : 0.9019         
                 95% CI : (0.896, 0.9076)
    No Information Rate : 0.8873         
    P-Value [Acc > NIR] : 1.026e-06      
                                         
                  Kappa : 0.2818         
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.9908         
            Specificity : 0.2017         
         Pos Pred Value : 0.9072         
         Neg Pred Value : 0.7358         
             Prevalence : 0.8873         
         Detection Rate : 0.8792         
   Detection Prevalence : 0.9691         
      Balanced Accuracy : 0.5963         
                                         
       'Positive' Class : no

Code

rf_base_model <- randomForest(y ~ ., data = trainData, ntree = 50)
rf_base_pred <- predict(rf_base_model, testData)
rf_base_prob <- predict(rf_base_model, testData, type = "prob")[,2]  # Probability for ROC

# Evaluate Random Forest
rf_base_conf <- confusionMatrix(rf_base_pred, testData$y)
rf_base_auc<- auc(testData$y, rf_base_prob)


# Store results
results <- rbind(results, data.frame(Model = "Random Forest", Experiment = 0, 
                           Accuracy = rf_base_conf$overall["Accuracy"],
                               Precision = rf_base_conf$byClass["Precision"],
                                     Recall = rf_base_conf$byClass["Recall"],
                                     F1 = rf_base_conf$byClass["F1"],
                                     AUC = rf_base_auc))
varImpPlot(rf_base_model, sort = TRUE,n.var = 10, main = "Ten most important variables in base RF Model")

Code

summary(rf_base_model)

                Length Class  Mode     
call                4  -none- call     
type                1  -none- character
predicted       30891  factor numeric  
err.rate          150  -none- numeric  
confusion           6  -none- numeric  
votes           61782  matrix numeric  
oob.times       30891  -none- numeric  
classes             2  -none- character
importance         17  -none- numeric  
importanceSD        0  -none- NULL     
localImportance     0  -none- NULL     
proximity           0  -none- NULL     
ntree               1  -none- numeric  
mtry                1  -none- numeric  
forest             14  -none- list     
y               30891  factor numeric  
test                0  -none- NULL     
inbag               0  -none- NULL     
terms               3  terms  call

Code

rf_base_conf

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  8897  824
       yes  240  336
                                          
               Accuracy : 0.8967          
                 95% CI : (0.8906, 0.9025)
    No Information Rate : 0.8873          
    P-Value [Acc > NIR] : 0.001308        
                                          
                  Kappa : 0.3376          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9737          
            Specificity : 0.2897          
         Pos Pred Value : 0.9152          
         Neg Pred Value : 0.5833          
             Prevalence : 0.8873          
         Detection Rate : 0.8640          
   Detection Prevalence : 0.9441          
      Balanced Accuracy : 0.6317          
                                          
       'Positive' Class : no

Code

# Baseline Model 3: Adaboost (Default) 50 week learners
ab_base_model <- boosting(y ~ ., data = trainData, boos = TRUE, mfinal = 50)

# Make predictions
ab_base_pred <- predict(ab_base_model, testData)
ab_base_prob <- ab_base_pred$prob[,2]  # Probability for ROC

# Ensure predicted class levels match actual class levels
ab_base_pred$class <- factor(ab_base_pred$class, levels = levels(testData$y))

# Evaluate Adaboost model
ab_base_conf <- confusionMatrix(ab_base_pred$class, testData$y)
ab_base_auc <- auc(testData$y, ab_base_prob)
# printing the cofusion matrix
print(ab_base_conf)

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  8948  837
       yes  189  323
                                          
               Accuracy : 0.9004          
                 95% CI : (0.8944, 0.9061)
    No Information Rate : 0.8873          
    P-Value [Acc > NIR] : 1.167e-05       
                                          
                  Kappa : 0.3409          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9793          
            Specificity : 0.2784          
         Pos Pred Value : 0.9145          
         Neg Pred Value : 0.6309          
             Prevalence : 0.8873          
         Detection Rate : 0.8690          
   Detection Prevalence : 0.9503          
      Balanced Accuracy : 0.6289          
                                          
       'Positive' Class : no

Code

# Store results
results <- rbind(results, data.frame(Model = "Adaboost", Experiment = 0, 
                              Accuracy =  ab_base_conf$overall["Accuracy"],
                              Precision = ab_base_conf$byClass["Precision"],
                              Recall =    ab_base_conf$byClass["Recall"],
                              F1 = ab_base_conf$byClass["F1"],
                              AUC = ab_base_auc))

Code

rownames(results) <- NULL
print(results)

          Model Experiment  Accuracy Precision    Recall        F1       AUC
1 Decision Tree          0 0.9019132 0.9072051 0.9908066 0.9471647 0.7137260
2 Random Forest          0 0.8966689 0.9152351 0.9737332 0.9435783 0.7715372
3      Adaboost          0 0.9003593 0.9144609 0.9793149 0.9457774 0.7986919

Evaluation Metrics:

Accuracy The percentage of total predictions that were correct—both “yes” (subscribed) and “no” (not subscribed).

Precision Out of all the times the model predicted someone would subscribe to a term deposit (“yes”), how many actually did.

Recall Out of all the people who actually subscribed, how many the model correctly predicted as “yes”.

F1 Score A balanced score that combines both precision and recall—useful when you want to weigh both equally.

AUC Stands for “Area Under the Curve.” It measures how well the model can distinguish between those who subscribed and those who didn’t. The closer it is to 1, the better the model is at separating the two groups.

ROC Stands for “Receiver Operating Characteristic” curve. It’s a graph that shows how the model’s true positive rate and false positive rate change at different classification thresholds.

Bussiness Case

Cases: 41188 only ~11% subsrucribed

Since the dataset is highly imbalanced—about 88% of the cases are “no” and only 11% are “yes”—accuracy alone isn’t a reliable measure. A model could predict “no” for everyone and still look 89% accurate. From a business standpoint, it’s more important to avoid falsely labeling someone as a subscriber when they’re not, since that would waste valuable marketing resources on people who aren’t likely to subscribe. Because of that, I decided to prioritize precision over recall. Precision helps ensure that when the model does predict “yes,” it’s more likely to be correct—so the marketing effort is better targeted and more efficient.

Experiments

All experiments are compared to the values in the original or (base models) in the Results data frame shown above.

Decision Tree (CART)

H1: Increasing the tree depth (maxdepth = 10) and allowing fine splits (minsplit = 2, minbucket = 100) will improve precision by enabling the model to capture more specific patterns associated with positive cases.

H2: Limiting the tree depth (maxdepth = 5) while keeping minsplit = 2 and minbucket = 100 will increase recall by generalizing to capture more “yes” cases, but it may reduce precision due to an increase in false positives.

H3: Applying SMOTE to balance the training set will improve the model’s ability to learn from the minority class. This is expected to lead to higher accuracy and precision, particularly by reducing false positives and making the model more selective when predicting “yes.”

Random Forest -

H4: Increasing the number of trees (ntree = 1000) will improve precision by stabilizing predictions and reducing variance.

H5: Reducing the number of predictors considered at each split (mtry = √p) will reduce overfitting and improve precision by limiting the impact of noisy or irrelevant features.

H6: Training the model on a SMOTE-balanced dataset is expected to improve precision by exposing the model to a more balanced distribution, allowing it to better distinguish true positives.

AdaBoost

H7: Increasing the number of boosting iterations (mfinal) = 150 will improve recall by allowing the model to focus more effectively on hard-to-classify “yes” cases.
H8: Limiting the depth of base learners will improve precision by preventing overfitting, especially on minority class examples. STUMPS
H9: Training AdaBoost on a SMOTE-balanced, one-hot encoded dataset is expected to improve both precision and accuracy by helping the model better learn patterns in the minority class and reduce false positives.

Models

Code

# Experiment 1: Decision Tree with maxdepth = 10, minsplit = 2,minbucket=50 
# Hypothesis: This will increase precision over the baseline

control_params <- rpart.control(minsplit = 2,   
                              
                                minbucket = 50,  
                              
                                cp = 0,     
                               
                                maxdepth =10)   
# Set max depth (adjust as needed)
tree_e1<- rpart(y ~ . , data = trainData , method = "class", control = control_params)

# Make predictions
tree_e1_pred <- predict(tree_e1, testData, type = "class")
tree_e1_prob <- predict(tree_e1, testData, type = "prob")[,2] 
tree_e1_conf <- confusionMatrix(tree_e1_pred, testData$y)
tree_e1_auc <- auc(testData$y, tree_e1_prob)

# Store results
results <- rbind(results, data.frame(Model = "Decision Tree MaxDepth 10", Experiment = 1, 
                      Accuracy =tree_e1_conf$overall["Accuracy"],                                   Precision = tree_e1_conf$byClass["Precision"],
                      Recall = tree_e1_conf$byClass["Recall"],
                     F1 = tree_e1_conf$byClass["F1"],
                                     AUC = tree_e1_auc))
#plot the tree
#rpart.plot(tree_e1, main = "Decision Tree experiment 1", cex = .55)
plotcp(tree_e1)

Code

##summary(tree_e1)
print(tree_e1_conf)

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  8947  839
       yes  190  321
                                          
               Accuracy : 0.9001          
                 95% CI : (0.8941, 0.9058)
    No Information Rate : 0.8873          
    P-Value [Acc > NIR] : 1.786e-05       
                                          
                  Kappa : 0.3386          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9792          
            Specificity : 0.2767          
         Pos Pred Value : 0.9143          
         Neg Pred Value : 0.6282          
             Prevalence : 0.8873          
         Detection Rate : 0.8689          
   Detection Prevalence : 0.9504          
      Balanced Accuracy : 0.6280          
                                          
       'Positive' Class : no

Code

## # Experiment 1: Decision Tree with maxdepth = 5, minsplit = 2,minbucket= 50
# Hypothesis: This will increase precision over the baseline
control_params2 <- rpart.control(minsplit = 2,   
                              
                                minbucket = 100,  
                              
                                cp = 0,     
                               
                                maxdepth =5)   
# Set max depth (adjust as needed)
tree_e2<- rpart(y ~ . , data = trainData , method = "class", control = control_params2)

# Make predictions
tree_e2_pred <- predict(tree_e2, testData, type = "class")
tree_e2_prob <- predict(tree_e2, testData, type = "prob")[,2] 
tree_e2_conf <- confusionMatrix(tree_e2_pred, testData$y)
tree_e2_auc <- auc(testData$y,  tree_e2_prob)

# Store results
results <- rbind(results, data.frame(Model = "Decision Tree MaxDepth 5", Experiment = 2, 
                      Accuracy =tree_e2_conf$overall["Accuracy"],                                   Precision = tree_e2_conf$byClass["Precision"],
                      Recall = tree_e2_conf$byClass["Recall"],
                     F1 = tree_e2_conf$byClass["F1"],
                                     AUC = tree_e2_auc))
#plot the tree
rpart.plot(tree_e2, main = "Decision Tree Expeiment 2", cex = .55)

Code

plotcp(tree_e2)

Code

summary(tree_e2)

Call:
rpart(formula = y ~ ., data = trainData, method = "class", control = control_params2)
  n= 30891 

           CP nsplit rel error    xerror       xstd
1 0.050431034      0 1.0000000 1.0000000 0.01596823
2 0.001915709      2 0.8991379 0.8991379 0.01523817
3 0.000000000      5 0.8933908 0.8954023 0.01521005

Variable importance
   nr.employed      euribor3m  cons.conf.idx cons.price.idx          month 
            25             23             17             12             10 
         pdays       poutcome 
             9              4 

Node number 1: 30891 observations,    complexity param=0.05043103
  predicted class=no   expected loss=0.1126542  P(node) =1
    class counts: 27411  3480
   probabilities: 0.887 0.113 
  left son=2 (27177 obs) right son=3 (3714 obs)
  Primary splits:
      nr.employed   < 5087.65 to the right, improve=911.9415, (0 missing)
      euribor3m     < 1.2395  to the right, improve=834.9867, (0 missing)
      pdays         splits as  LR,          improve=629.0750, (0 missing)
      poutcome      splits as  LLR,         improve=589.4370, (0 missing)
      cons.conf.idx < -35.45  to the left,  improve=412.6504, (0 missing)
  Surrogate splits:
      euribor3m      < 1.2395  to the right, agree=0.984, adj=0.870, (0 split)
      cons.conf.idx  < -35.45  to the left,  agree=0.944, adj=0.537, (0 split)
      cons.price.idx < 92.7345 to the right, agree=0.934, adj=0.451, (0 split)
      month          splits as  LLRLLLLLRR,  agree=0.912, adj=0.264, (0 split)
      pdays          splits as  LR,          agree=0.903, adj=0.193, (0 split)

Node number 2: 27177 observations,    complexity param=0.001915709
  predicted class=no   expected loss=0.0677411  P(node) =0.8797708
    class counts: 25336  1841
   probabilities: 0.932 0.068 
  left son=4 (25066 obs) right son=5 (2111 obs)
  Primary splits:
      month          splits as  RLLLLRLLR-,  improve=103.22210, (0 missing)
      cons.conf.idx  < -46.65  to the right, improve= 88.09483, (0 missing)
      euribor3m      < 3.1675  to the right, improve= 64.99114, (0 missing)
      cons.price.idx < 93.1375 to the right, improve= 64.78781, (0 missing)
      nr.employed    < 5183.65 to the right, improve= 64.78781, (0 missing)
  Surrogate splits:
      cons.conf.idx  < -46.65  to the right, agree=0.998, adj=0.975, (0 split)
      cons.price.idx < 92.868  to the right, agree=0.930, adj=0.096, (0 split)
      age            < 62.5    to the left,  agree=0.925, adj=0.036, (0 split)
      euribor3m      < 4.985   to the left,  agree=0.923, adj=0.006, (0 split)

Node number 3: 3714 observations,    complexity param=0.05043103
  predicted class=no   expected loss=0.4413032  P(node) =0.1202292
    class counts:  2075  1639
   probabilities: 0.559 0.441 
  left son=6 (2787 obs) right son=7 (927 obs)
  Primary splits:
      pdays          splits as  LR,          improve=151.97720, (0 missing)
      poutcome       splits as  LLR,         improve=149.17390, (0 missing)
      contact        splits as  RL,          improve= 31.80034, (0 missing)
      nr.employed    < 5049.85 to the right, improve= 31.70430, (0 missing)
      cons.price.idx < 93.166  to the left,  improve= 29.14529, (0 missing)
  Surrogate splits:
      poutcome splits as  LLR, agree=0.974, adj=0.894, (0 split)

Node number 4: 25066 observations
  predicted class=no   expected loss=0.05509455  P(node) =0.8114338
    class counts: 23685  1381
   probabilities: 0.945 0.055 

Node number 5: 2111 observations,    complexity param=0.001915709
  predicted class=no   expected loss=0.2179062  P(node) =0.06833706
    class counts:  1651   460
   probabilities: 0.782 0.218 
  left son=10 (1848 obs) right son=11 (263 obs)
  Primary splits:
      month          splits as  L----R--R-,   improve=39.80321, (0 missing)
      euribor3m      < 1.504   to the left,   improve=39.80321, (0 missing)
      day_of_week    splits as  LLRRR,        improve=38.57510, (0 missing)
      job            splits as  RLLRLRRLRRRR, improve=28.82942, (0 missing)
      cons.price.idx < 92.959  to the right,  improve=24.28616, (0 missing)
  Surrogate splits:
      euribor3m      < 1.504   to the left,  agree=1.000, adj=1.000, (0 split)
      cons.price.idx < 92.959  to the right, agree=0.975, adj=0.802, (0 split)
      cons.conf.idx  < -48.55  to the right, agree=0.975, adj=0.802, (0 split)
      nr.employed    < 5147.45 to the left,  agree=0.900, adj=0.198, (0 split)
      age            < 86.5    to the left,  agree=0.881, adj=0.042, (0 split)

Node number 6: 2787 observations
  predicted class=no   expected loss=0.3588088  P(node) =0.09022045
    class counts:  1787  1000
   probabilities: 0.641 0.359 

Node number 7: 927 observations
  predicted class=yes  expected loss=0.3106796  P(node) =0.03000874
    class counts:   288   639
   probabilities: 0.311 0.689 

Node number 10: 1848 observations
  predicted class=no   expected loss=0.1812771  P(node) =0.05982325
    class counts:  1513   335
   probabilities: 0.819 0.181 

Node number 11: 263 observations,    complexity param=0.001915709
  predicted class=no   expected loss=0.4752852  P(node) =0.008513807
    class counts:   138   125
   probabilities: 0.525 0.475 
  left son=22 (113 obs) right son=23 (150 obs)
  Primary splits:
      day_of_week splits as  RLLRR,        improve=5.8306250, (0 missing)
      job         splits as  RLRLLLLRRLLR, improve=2.1341050, (0 missing)
      education   splits as  LLLR-LRL,     improve=1.3445820, (0 missing)
      campaign    < 1.5     to the right,  improve=0.4601346, (0 missing)
      euribor3m   < 1.7145  to the left,   improve=0.4415975, (0 missing)
  Surrogate splits:
      euribor3m < 1.7415  to the right,  agree=0.631, adj=0.142, (0 split)
      age       < 26.5    to the left,   agree=0.608, adj=0.088, (0 split)
      job       splits as  RRRRLRRLLLRR, agree=0.605, adj=0.080, (0 split)
      campaign  < 2.5     to the right,  agree=0.605, adj=0.080, (0 split)
      education splits as  RLRR-RRR,     agree=0.586, adj=0.035, (0 split)

Node number 22: 113 observations
  predicted class=no   expected loss=0.3539823  P(node) =0.003658023
    class counts:    73    40
   probabilities: 0.646 0.354 

Node number 23: 150 observations
  predicted class=yes  expected loss=0.4333333  P(node) =0.004855783
    class counts:    65    85
   probabilities: 0.433 0.567

Code

## Experiment 3: Decision Tree with maxdepth = 5, minsplit = 2, minbucket = 100 (SMOTE)
# Hypothesis (H3): Using SMOTE to balance the training set will improve precision and accuracy by giving the model better representation of the minority class.

control_params3 <- rpart.control(minsplit = 2,   
                                 minbucket = 100,  
                                 cp = 0,     
                                 maxdepth = 5)

# Train decision tree on SMOTE-balanced training data
tree_e3 <- rpart(y ~ ., data = trainData_smote, method = "class", control = control_params3)

# Make predictions on original test set
tree_e3_pred <- predict(tree_e3, newdata = x_test_encoded, type = "class")
tree_e3_prob <- predict(tree_e3, newdata = x_test_encoded, type = "prob")[,2]

tree_e3_conf <- confusionMatrix(tree_e3_pred, testData$y)
tree_e3_auc <- auc(testData$y, tree_e3_prob)

# Store results
results <- rbind(results, data.frame(Model = "Decision Tree SMOTE MaxDepth 5", Experiment = 3, 
                      Accuracy = tree_e3_conf$overall["Accuracy"],
                      Precision = tree_e3_conf$byClass["Precision"],
                      Recall = tree_e3_conf$byClass["Recall"],
                      F1 = tree_e3_conf$byClass["F1"],
                      AUC = tree_e3_auc))

# Plot the tree
rpart.plot(tree_e3, main = "Decision Tree - Experiment 3 (SMOTE)", cex = .55)

Code

plotcp(tree_e3)

Code

summary(tree_e3)

Call:
rpart(formula = y ~ ., data = trainData_smote, method = "class", 
    control = control_params3)
  n= 51771 

            CP nsplit rel error    xerror        xstd
1 0.3857963875      0 1.0000000 1.0000000 0.004662089
2 0.0743021346      1 0.6142036 0.6142036 0.004234004
3 0.0363300493      2 0.5399015 0.5399425 0.004066181
4 0.0064039409      4 0.4672414 0.4673235 0.003868545
5 0.0055008210      5 0.4608374 0.4565271 0.003836027
6 0.0042898194      8 0.4409688 0.4430624 0.003794250
7 0.0008347017     10 0.4323892 0.4328407 0.003761603
8 0.0006157635     13 0.4298851 0.4330460 0.003762267
9 0.0000000000     14 0.4292693 0.4322250 0.003759611

Variable importance
        nr.employed           euribor3m       cons.conf.idx      cons.price.idx 
                 20                  20                  16                  11 
             pdays1      defaultunknown     poutcomesuccess            monthmar 
                  9                   8                   8                   1 
     jobblue.collar poutcomenonexistent    contacttelephone            monthmay 
                  1                   1                   1                   1 

Node number 1: 51771 observations,    complexity param=0.3857964
  predicted class=no   expected loss=0.4705337  P(node) =1
    class counts: 27411 24360
   probabilities: 0.529 0.471 
  left son=2 (38223 obs) right son=3 (13548 obs)
  Primary splits:
      nr.employed         < 5087.65      to the right, improve=5196.991, (0 missing)
      euribor3m           < 3.1675       to the right, improve=4781.745, (0 missing)
      pdays1              < 4.740583e-05 to the left,  improve=3565.832, (0 missing)
      poutcomesuccess     < 4.740583e-05 to the left,  improve=3320.322, (0 missing)
      poutcomenonexistent < 0.9998086    to the right, improve=3015.181, (0 missing)
  Surrogate splits:
      euribor3m       < 1.243815     to the right, agree=0.970, adj=0.887, (0 split)
      cons.conf.idx   < -36.09156    to the left,  agree=0.875, adj=0.522, (0 split)
      cons.price.idx  < 92.7345      to the right, agree=0.848, adj=0.420, (0 split)
      pdays1          < 4.740583e-05 to the left,  agree=0.843, adj=0.402, (0 split)
      poutcomesuccess < 4.740583e-05 to the left,  agree=0.836, adj=0.372, (0 split)

Node number 2: 38223 observations,    complexity param=0.07430213
  predicted class=no   expected loss=0.337153  P(node) =0.7383091
    class counts: 25336 12887
   probabilities: 0.663 0.337 
  left son=4 (33153 obs) right son=5 (5070 obs)
  Primary splits:
      cons.conf.idx  < -46.20037    to the right, improve=1362.1800, (0 missing)
      nr.employed    < 5189.733     to the right, improve= 982.5063, (0 missing)
      euribor3m      < 3.1675       to the right, improve= 981.5473, (0 missing)
      cons.price.idx < 93.1843      to the right, improve= 979.4715, (0 missing)
      defaultunknown < 0.9997251    to the right, improve= 796.2174, (0 missing)
  Surrogate splits:
      monthmar            < 0.005552286  to the left,  agree=0.888, adj=0.157, (0 split)
      cons.price.idx      < 92.89293     to the right, agree=0.886, adj=0.138, (0 split)
      age                 < 60.01999     to the left,  agree=0.876, adj=0.063, (0 split)
      educationilliterate < 0.1124237    to the left,  agree=0.867, adj=0.001, (0 split)

Node number 3: 13548 observations,    complexity param=0.006403941
  predicted class=yes  expected loss=0.1531591  P(node) =0.2616909
    class counts:  2075 11473
   probabilities: 0.153 0.847 
  left son=6 (556 obs) right son=7 (12992 obs)
  Primary splits:
      contacttelephone    < 0.9994507    to the right, improve=275.1638, (0 missing)
      pdays1              < 4.740583e-05 to the left,  improve=262.8417, (0 missing)
      poutcomesuccess     < 4.740583e-05 to the left,  improve=247.5177, (0 missing)
      day_of_weekmon      < 0.9996842    to the right, improve=186.0594, (0 missing)
      poutcomenonexistent < 0.9998086    to the right, improve=158.9864, (0 missing)

Node number 4: 33153 observations,    complexity param=0.03633005
  predicted class=no   expected loss=0.2849516  P(node) =0.6403778
    class counts: 23706  9447
   probabilities: 0.715 0.285 
  left son=8 (6706 obs) right son=9 (26447 obs)
  Primary splits:
      defaultunknown   < 0.9997251    to the right, improve=443.2790, (0 missing)
      contacttelephone < 0.9989294    to the right, improve=335.0333, (0 missing)
      housingyes       < 0.9996808    to the right, improve=309.6382, (0 missing)
      day_of_weekthu   < 0.9994674    to the right, improve=281.4864, (0 missing)
      jobadmin.        < 0.0003392689 to the left,  improve=261.4274, (0 missing)
  Surrogate splits:
      jobunknown < 0.9904955    to the right, agree=0.798, adj=0.003, (0 split)
      campaign   < 42.5         to the right, agree=0.798, adj=0.000, (0 split)

Node number 5: 5070 observations,    complexity param=0.005500821
  predicted class=yes  expected loss=0.321499  P(node) =0.09793127
    class counts:  1630  3440
   probabilities: 0.321 0.679 
  left son=10 (3453 obs) right son=11 (1617 obs)
  Primary splits:
      cons.price.idx < 93.07496     to the right, improve=294.7461, (0 missing)
      euribor3m      < 1.404961     to the right, improve=290.9597, (0 missing)
      day_of_weekmon < 0.9991537    to the right, improve=229.4181, (0 missing)
      defaultunknown < 0.9972832    to the right, improve=204.2945, (0 missing)
      monthmay       < 0.0002164142 to the left,  improve=201.9250, (0 missing)
  Surrogate splits:
      monthmay      < 0.0002164142 to the left,  agree=0.843, adj=0.506, (0 split)
      cons.conf.idx < -47.09981    to the left,  agree=0.843, adj=0.506, (0 split)
      monthmar      < 0.005552286  to the left,  agree=0.838, adj=0.494, (0 split)
      euribor3m     < 1.49805      to the left,  agree=0.828, adj=0.460, (0 split)
      age           < 79.00307     to the left,  agree=0.691, adj=0.030, (0 split)

Node number 6: 556 observations,    complexity param=0.0006157635
  predicted class=no   expected loss=0.3597122  P(node) =0.0107396
    class counts:   356   200
   probabilities: 0.640 0.360 
  left son=12 (367 obs) right son=13 (189 obs)
  Primary splits:
      poutcomenonexistent < 0.9881113    to the right, improve=18.548220, (0 missing)
      euribor3m           < 0.7155       to the right, improve=15.650110, (0 missing)
      cons.conf.idx       < -36.15       to the left,  improve= 8.270934, (0 missing)
      nr.employed         < 5000.15      to the left,  improve= 7.568426, (0 missing)
      cons.price.idx      < 93.9515      to the right, improve= 7.568426, (0 missing)
  Surrogate splits:
      pdays1          < 0.04194645   to the left,  agree=0.824, adj=0.481, (0 split)
      poutcomesuccess < 0.04194645   to the left,  agree=0.809, adj=0.439, (0 split)
      euribor3m       < 0.6547767    to the right, agree=0.682, adj=0.063, (0 split)
      monthmar        < 0.9905503    to the left,  agree=0.667, adj=0.021, (0 split)
      age             < 20.5         to the right, agree=0.664, adj=0.011, (0 split)

Node number 7: 12992 observations,    complexity param=0.0008347017
  predicted class=yes  expected loss=0.1323122  P(node) =0.2509513
    class counts:  1719 11273
   probabilities: 0.132 0.868 
  left son=14 (6861 obs) right son=15 (6131 obs)
  Primary splits:
      pdays1              < 4.740583e-05 to the left,  improve=184.96480, (0 missing)
      poutcomesuccess     < 4.740583e-05 to the left,  improve=175.65810, (0 missing)
      day_of_weekmon      < 0.9996842    to the right, improve=143.03820, (0 missing)
      poutcomenonexistent < 0.9998086    to the right, improve= 96.98513, (0 missing)
      loanyes             < 0.9996842    to the right, improve= 69.48569, (0 missing)
  Surrogate splits:
      poutcomesuccess     < 4.740583e-05 to the left,  agree=0.966, adj=0.928, (0 split)
      poutcomenonexistent < 0.9990546    to the right, agree=0.814, adj=0.606, (0 split)
      cons.price.idx      < 93.166       to the left,  agree=0.651, adj=0.261, (0 split)
      nr.employed         < 5013.1       to the right, agree=0.651, adj=0.261, (0 split)
      euribor3m           < 0.7178845    to the right, agree=0.574, adj=0.098, (0 split)

Node number 8: 6706 observations,    complexity param=0.004289819
  predicted class=no   expected loss=0.1225768  P(node) =0.129532
    class counts:  5884   822
   probabilities: 0.877 0.123 
  left son=16 (4038 obs) right son=17 (2668 obs)
  Primary splits:
      jobblue.collar < 0.0009973011 to the left,  improve=26.16184, (0 missing)
      day_of_weekwed < 0.9941173    to the right, improve=17.90126, (0 missing)
      cons.conf.idx  < -39.1        to the right, improve=16.68515, (0 missing)
      euribor3m      < 4.866449     to the left,  improve=15.93546, (0 missing)
      housingyes     < 0.9984455    to the right, improve=13.78131, (0 missing)
  Surrogate splits:
      educationbasic.9y < 0.001023148  to the left,  agree=0.677, adj=0.189, (0 split)
      educationbasic.6y < 0.003896569  to the left,  agree=0.649, adj=0.117, (0 split)
      euribor3m         < 1.349        to the right, agree=0.615, adj=0.032, (0 split)
      cons.price.idx    < 93.0465      to the right, agree=0.615, adj=0.031, (0 split)
      cons.conf.idx     < -44.45       to the right, agree=0.615, adj=0.031, (0 split)

Node number 9: 26447 observations,    complexity param=0.03633005
  predicted class=no   expected loss=0.3261239  P(node) =0.5108458
    class counts: 17822  8625
   probabilities: 0.674 0.326 
  left son=18 (24677 obs) right son=19 (1770 obs)
  Primary splits:
      defaultunknown   < 0.0007670226 to the left,  improve=1722.8500, (0 missing)
      housingyes       < 0.9996808    to the right, improve= 305.5411, (0 missing)
      day_of_weekthu   < 0.9994674    to the right, improve= 293.4515, (0 missing)
      contacttelephone < 0.9989294    to the right, improve= 266.4293, (0 missing)
      loanyes          < 0.9989104    to the right, improve= 240.3539, (0 missing)

Node number 10: 3453 observations,    complexity param=0.005500821
  predicted class=yes  expected loss=0.4381697  P(node) =0.06669757
    class counts:  1513  1940
   probabilities: 0.438 0.562 
  left son=20 (2436 obs) right son=21 (1017 obs)
  Primary splits:
      euribor3m                  < 1.404961     to the right, improve=225.8161, (0 missing)
      day_of_weekthu             < 0.002713774  to the left,  improve=192.8775, (0 missing)
      day_of_weekmon             < 0.9980373    to the right, improve=178.9132, (0 missing)
      educationuniversity.degree < 0.003769935  to the left,  improve=147.3356, (0 missing)
      defaultunknown             < 0.9943578    to the right, improve=137.9734, (0 missing)
  Surrogate splits:
      day_of_weekthu  < 0.1971073    to the left,  agree=0.799, adj=0.319, (0 split)
      pdays1          < 0.0005789162 to the left,  agree=0.738, adj=0.112, (0 split)
      poutcomesuccess < 0.0005789162 to the left,  agree=0.734, adj=0.098, (0 split)
      age             < 78.07475     to the left,  agree=0.707, adj=0.004, (0 split)
      jobunknown      < 0.02106808   to the left,  agree=0.706, adj=0.003, (0 split)

Node number 11: 1617 observations
  predicted class=yes  expected loss=0.07235622  P(node) =0.0312337
    class counts:   117  1500
   probabilities: 0.072 0.928 

Node number 12: 367 observations
  predicted class=no   expected loss=0.26703  P(node) =0.007088911
    class counts:   269    98
   probabilities: 0.733 0.267 

Node number 13: 189 observations
  predicted class=yes  expected loss=0.4603175  P(node) =0.003650692
    class counts:    87   102
   probabilities: 0.460 0.540 

Node number 14: 6861 observations,    complexity param=0.0008347017
  predicted class=yes  expected loss=0.2120682  P(node) =0.1325259
    class counts:  1455  5406
   probabilities: 0.212 0.788 
  left son=28 (1051 obs) right son=29 (5810 obs)
  Primary splits:
      poutcomenonexistent < 0.0001533917 to the left,  improve=160.33910, (0 missing)
      day_of_weekmon      < 0.9996842    to the right, improve=116.82290, (0 missing)
      loanyes             < 0.9996842    to the right, improve= 77.15317, (0 missing)
      housingyes          < 0.9996842    to the right, improve= 76.46780, (0 missing)
      contacttelephone    < 0.0002215477 to the left,  improve= 72.01725, (0 missing)
  Surrogate splits:
      age       < 17.17132     to the left,  agree=0.847, adj=0.002, (0 split)
      euribor3m < 0.6350494    to the left,  agree=0.847, adj=0.002, (0 split)

Node number 15: 6131 observations
  predicted class=yes  expected loss=0.04305986  P(node) =0.1184254
    class counts:   264  5867
   probabilities: 0.043 0.957 

Node number 16: 4038 observations
  predicted class=no   expected loss=0.08667657  P(node) =0.07799733
    class counts:  3688   350
   probabilities: 0.913 0.087 

Node number 17: 2668 observations,    complexity param=0.004289819
  predicted class=no   expected loss=0.1769115  P(node) =0.05153464
    class counts:  2196   472
   probabilities: 0.823 0.177 
  left son=34 (2459 obs) right son=35 (209 obs)
  Primary splits:
      jobblue.collar       < 0.9957468    to the right, improve=307.25330, (0 missing)
      educationhigh.school < 0.01398942   to the left,  improve= 49.65009, (0 missing)
      day_of_weekwed       < 0.9941173    to the right, improve= 18.55342, (0 missing)
      cons.conf.idx        < -39.1        to the right, improve= 14.78333, (0 missing)
      educationbasic.9y    < 0.9990027    to the right, improve= 14.53333, (0 missing)
  Surrogate splits:
      jobadmin.     < 0.01807752   to the left,  agree=0.940, adj=0.230, (0 split)
      jobtechnician < 0.07014193   to the left,  agree=0.937, adj=0.191, (0 split)
      jobservices   < 0.01398942   to the left,  agree=0.933, adj=0.148, (0 split)
      jobmanagement < 0.02856058   to the left,  agree=0.930, adj=0.100, (0 split)
      jobretired    < 0.004253234  to the left,  agree=0.929, adj=0.096, (0 split)

Node number 18: 24677 observations
  predicted class=no   expected loss=0.277789  P(node) =0.4766568
    class counts: 17822  6855
   probabilities: 0.722 0.278 

Node number 19: 1770 observations
  predicted class=yes  expected loss=0  P(node) =0.03418902
    class counts:     0  1770
   probabilities: 0.000 1.000 

Node number 20: 2436 observations,    complexity param=0.005500821
  predicted class=no   expected loss=0.4449918  P(node) =0.04705337
    class counts:  1352  1084
   probabilities: 0.555 0.445 
  left son=40 (1096 obs) right son=41 (1340 obs)
  Primary splits:
      euribor3m                  < 1.405039     to the left,  improve=109.53540, (0 missing)
      day_of_weekwed             < 0.001711558  to the left,  improve=106.53840, (0 missing)
      educationuniversity.degree < 0.005160756  to the left,  improve=105.03320, (0 missing)
      day_of_weekmon             < 0.9980373    to the right, improve= 95.87677, (0 missing)
      age                        < 30.98329     to the right, improve= 92.25378, (0 missing)
  Surrogate splits:
      day_of_weekmon < 0.7135894    to the right, agree=0.733, adj=0.407, (0 split)
      day_of_weekthu < 0.002713774  to the left,  agree=0.700, adj=0.333, (0 split)
      day_of_weekwed < 0.001711558  to the left,  agree=0.600, adj=0.111, (0 split)
      campaign       < 2.99712      to the right, agree=0.593, adj=0.096, (0 split)
      defaultunknown < 0.9878138    to the right, agree=0.575, adj=0.055, (0 split)

Node number 21: 1017 observations
  predicted class=yes  expected loss=0.1583088  P(node) =0.0196442
    class counts:   161   856
   probabilities: 0.158 0.842 

Node number 28: 1051 observations,    complexity param=0.0008347017
  predicted class=yes  expected loss=0.4662226  P(node) =0.02030094
    class counts:   490   561
   probabilities: 0.466 0.534 
  left son=56 (101 obs) right son=57 (950 obs)
  Primary splits:
      loanyes                    < 0.9985629    to the right, improve=25.19313, (0 missing)
      day_of_weekmon             < 0.9958733    to the right, improve=15.72747, (0 missing)
      maritalmarried             < 0.008247433  to the left,  improve=15.44133, (0 missing)
      educationuniversity.degree < 0.9939607    to the right, improve=13.35216, (0 missing)
      euribor3m                  < 0.8829841    to the right, improve=13.20157, (0 missing)

Node number 29: 5810 observations
  predicted class=yes  expected loss=0.1660929  P(node) =0.112225
    class counts:   965  4845
   probabilities: 0.166 0.834 

Node number 34: 2459 observations
  predicted class=no   expected loss=0.106954  P(node) =0.04749763
    class counts:  2196   263
   probabilities: 0.893 0.107 

Node number 35: 209 observations
  predicted class=yes  expected loss=0  P(node) =0.004037009
    class counts:     0   209
   probabilities: 0.000 1.000 

Node number 40: 1096 observations
  predicted class=no   expected loss=0.2791971  P(node) =0.02117015
    class counts:   790   306
   probabilities: 0.721 0.279 

Node number 41: 1340 observations
  predicted class=yes  expected loss=0.419403  P(node) =0.02588322
    class counts:   562   778
   probabilities: 0.419 0.581 

Node number 56: 101 observations
  predicted class=no   expected loss=0.1980198  P(node) =0.001950899
    class counts:    81    20
   probabilities: 0.802 0.198 

Node number 57: 950 observations
  predicted class=yes  expected loss=0.4305263  P(node) =0.01835004
    class counts:   409   541
   probabilities: 0.431 0.569

Code

tree_e3_conf

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  8293  531
       yes  844  629
                                         
               Accuracy : 0.8665         
                 95% CI : (0.8597, 0.873)
    No Information Rate : 0.8873         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.4025         
                                         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.9076         
            Specificity : 0.5422         
         Pos Pred Value : 0.9398         
         Neg Pred Value : 0.4270         
             Prevalence : 0.8873         
         Detection Rate : 0.8054         
   Detection Prevalence : 0.8569         
      Balanced Accuracy : 0.7249         
                                         
       'Positive' Class : no

Code

# Random Forest Experiment 1 increasing ntrees to 1000
rf_1000 <- randomForest(y ~ ., data = trainData, ntree = 1000)
rf_1000_predict <- predict(rf_1000, testData)
rf_1000_prob <- predict(rf_1000, testData, type = "prob")[,2]  

# Evaluate Random Forest
rf_1000_conf <- confusionMatrix(rf_1000_predict, testData$y)
rf_1000_auc<- auc(testData$y, rf_1000_prob)


# Store results
results <- rbind(results, data.frame(Model = "Random Forest N 1000", Experiment = 1, 
                           Accuracy = rf_1000_conf$overall["Accuracy"],
                               Precision = rf_1000_conf$byClass["Precision"],
                                     Recall = rf_1000_conf$byClass["Recall"],
                                     F1 = rf_1000_conf$byClass["F1"],
                                     AUC = rf_1000_auc))
varImpPlot(rf_1000, sort = TRUE,n.var = 10, main = "Ten most important variables in  RF Model 1000")

Code

summary(rf_1000)

                Length Class  Mode     
call                4  -none- call     
type                1  -none- character
predicted       30891  factor numeric  
err.rate         3000  -none- numeric  
confusion           6  -none- numeric  
votes           61782  matrix numeric  
oob.times       30891  -none- numeric  
classes             2  -none- character
importance         17  -none- numeric  
importanceSD        0  -none- NULL     
localImportance     0  -none- NULL     
proximity           0  -none- NULL     
ntree               1  -none- numeric  
mtry                1  -none- numeric  
forest             14  -none- list     
y               30891  factor numeric  
test                0  -none- NULL     
inbag               0  -none- NULL     
terms               3  terms  call

Code

# Calculate sqrt of total features

p <- ncol(trainData) - 1  # Excluding target variable
mtry_val <- floor(sqrt(p))  # Taking floor to ensure an integer

# Random Forest Experiment 2 - Reducing mtry to sqrt(p)

rf_sqrt_mtry <- randomForest(y ~ ., data = trainData, ntree = 1000,mtry = mtry_val)
rf_sqrt_mtry_predict <- predict(rf_sqrt_mtry, testData)
rf_sqrt_mtry_prob <- predict(rf_sqrt_mtry, testData, type = "prob")[,2]  

# Evaluate Random Forest
rf_sqrt_mtry_conf <- confusionMatrix(rf_sqrt_mtry_predict, testData$y)
rf_sqrt_mtry_auc <- auc(testData$y, rf_sqrt_mtry_prob)

# Store results
results <- rbind(results, data.frame(Model = "Random Forest mtry=sqrt(p)) N 1000", Experiment = 2, 
                           Accuracy = rf_sqrt_mtry_conf$overall["Accuracy"],
                            Precision = rf_sqrt_mtry_conf$byClass["Precision"],
                              Recall = rf_sqrt_mtry_conf$byClass["Recall"],
                                     F1 = rf_sqrt_mtry_conf$byClass["F1"],
                                     AUC = rf_sqrt_mtry_auc))

# Feature Importance Plot
varImpPlot(rf_sqrt_mtry, sort = TRUE, n.var = 10, main = "Top 10 Features (RF with mtry=sqrt(p))")

Code

# Summary of the Random Forest Model
summary(rf_sqrt_mtry)

                Length Class  Mode     
call                5  -none- call     
type                1  -none- character
predicted       30891  factor numeric  
err.rate         3000  -none- numeric  
confusion           6  -none- numeric  
votes           61782  matrix numeric  
oob.times       30891  -none- numeric  
classes             2  -none- character
importance         17  -none- numeric  
importanceSD        0  -none- NULL     
localImportance     0  -none- NULL     
proximity           0  -none- NULL     
ntree               1  -none- numeric  
mtry                1  -none- numeric  
forest             14  -none- list     
y               30891  factor numeric  
test                0  -none- NULL     
inbag               0  -none- NULL     
terms               3  terms  call

Code

rf_sqrt_mtry


Call:
 randomForest(formula = y ~ ., data = trainData, ntree = 1000,      mtry = mtry_val) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 4

        OOB estimate of  error rate: 10.37%
Confusion matrix:
       no  yes class.error
no  26682  729  0.02659516
yes  2473 1007  0.71063218

Code

# H6: Random Forest trained on SMOTE-balanced dataset



# Train Random Forest on SMOTE data
rf_smote_h6 <- randomForest(y ~ ., data = trainData_smote, ntree = 1000, mtry = mtry_val)

# Predict on original test set
rf_smote_h6_pred <- predict(rf_smote_h6, x_test_encoded)
rf_smote_h6_prob <- predict(rf_smote_h6, x_test_encoded, type = "prob")[, 2]


# Evaluate performance
rf_smote_h6_conf <- confusionMatrix(rf_smote_h6_pred, testData$y)
rf_smote_h6_auc <- auc(testData$y, rf_smote_h6_prob)

# Store results
results <- rbind(results, data.frame(Model = "Random Forest SMOTE, mtry=sqrt(p) N 1000", Experiment = 3, 
                                Accuracy = rf_smote_h6_conf$overall["Accuracy"],
                              Precision = rf_smote_h6_conf$byClass["Precision"],
                                     Recall = rf_smote_h6_conf$byClass["Recall"],
                                     F1 = rf_smote_h6_conf$byClass["F1"],
                                     AUC = rf_smote_h6_auc))


rf_smote_h6


Call:
 randomForest(formula = y ~ ., data = trainData_smote, ntree = 1000,      mtry = mtry_val) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 4

        OOB estimate of  error rate: 6.71%
Confusion matrix:
       no   yes class.error
no  26360  1051  0.03834227
yes  2423 21937  0.09946634

Code

ab_control <- rpart.control(maxdepth = 5   )

ab_exp_h5 <- boosting(y ~ ., 
                      data = trainData, 
                      boos = TRUE, 
                      mfinal = 150,
                      )

# Make predictions
ab_exp_h5_pred <- predict(ab_exp_h5, testData)
ab_exp_h5_prob <- ab_exp_h5_pred$prob[,2]

# Ensure factor levels match
ab_exp_h5_pred$class <- factor(ab_exp_h5_pred$class, levels = levels(testData$y))

# Evaluate performance
ab_exp_h5_conf <- confusionMatrix(ab_exp_h5_pred$class, testData$y)
ab_exp_h5_auc <- auc(testData$y, ab_exp_h5_prob)

# Store results
results <- rbind(results, data.frame(Model = "AdaBoost  Mfinal 150 ", Experiment = 1,
                                  Accuracy = ab_exp_h5_conf$overall["Accuracy"],
                                Precision = ab_exp_h5_conf$byClass["Precision"],
                                     Recall = ab_exp_h5_conf$byClass["Recall"],
                                     F1 = ab_exp_h5_conf$byClass["F1"],
                                     AUC = ab_exp_h5_auc))

# Reset row names

ab_exp_h5_conf

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  9008  859
       yes  129  301
                                          
               Accuracy : 0.904           
                 95% CI : (0.8982, 0.9097)
    No Information Rate : 0.8873          
    P-Value [Acc > NIR] : 2.322e-08       
                                          
                  Kappa : 0.3383          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9859          
            Specificity : 0.2595          
         Pos Pred Value : 0.9129          
         Neg Pred Value : 0.7000          
             Prevalence : 0.8873          
         Detection Rate : 0.8748          
   Detection Prevalence : 0.9582          
      Balanced Accuracy : 0.6227          
                                          
       'Positive' Class : no

Code

# Experiment: AdaBoost with shallow base learners (maxdepth = 5)
ab_control <- rpart.control(maxdepth = 5)

ab_exp_h6 <- boosting(y ~ ., 
                      data = trainData, 
                      boos = TRUE, 
                      mfinal = 50,
                      control = ab_control)

# Make predictions
ab_exp_h6_pred <- predict(ab_exp_h6, testData)
ab_exp_h6_prob <- ab_exp_h6_pred$prob[,2]

# Ensure factor levels match
ab_exp_h6_pred$class <- factor(ab_exp_h6_pred$class, levels = levels(testData$y))

# Evaluate performance
ab_exp_h6_conf <- confusionMatrix(ab_exp_h6_pred$class, testData$y)
ab_exp_h6_auc <- auc(testData$y, ab_exp_h6_prob)

# Store results
results <- rbind(results, data.frame(Model = "AdaBoost (maxdepth = 5), Mfinal 50", Experiment = 2,
                                  Accuracy = ab_exp_h6_conf$overall["Accuracy"],
                                 Precision = ab_exp_h6_conf$byClass["Precision"],
                                     Recall = ab_exp_h6_conf$byClass["Recall"],
                                     F1 = ab_exp_h6_conf$byClass["F1"],
                                     AUC = ab_exp_h6_auc))


ab_exp_h6_conf

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  9011  867
       yes  126  293
                                          
               Accuracy : 0.9036          
                 95% CI : (0.8977, 0.9092)
    No Information Rate : 0.8873          
    P-Value [Acc > NIR] : 5.75e-08        
                                          
                  Kappa : 0.3311          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9862          
            Specificity : 0.2526          
         Pos Pred Value : 0.9122          
         Neg Pred Value : 0.6993          
             Prevalence : 0.8873          
         Detection Rate : 0.8751          
   Detection Prevalence : 0.9593          
      Balanced Accuracy : 0.6194          
                                          
       'Positive' Class : no

Code

# AdaBoost Experiment 3: SMOTE-balanced, one-hot encoded data
# H9: This setup is expected improve precision 


ab_control_h9 <- rpart.control(maxdepth = 5)

# Train AdaBoost on SMOTE + encoded training data
ab_exp_h9 <- boosting(y ~ ., 
                      data = trainData_smote, 
                      boos = TRUE, 
                      mfinal = 50,
                      control = ab_control_h9)

# Predict on encoded test set
ab_exp_h9_pred <- predict(ab_exp_h9, x_test_encoded)
ab_exp_h9_prob <- ab_exp_h9_pred$prob[, 2]

# Ensure levels match
ab_exp_h9_pred$class <- factor(ab_exp_h9_pred$class, levels = levels(testData$y))

# Evaluate performance
ab_exp_h9_conf <- confusionMatrix(ab_exp_h9_pred$class, testData$y)
ab_exp_h9_auc <- auc(testData$y, ab_exp_h9_prob)

# Store results
results <- rbind(results, data.frame(Model = "AdaBoost SMOTE Mfinal 50 MAxDepth 5", Experiment = 3,
                                  Accuracy = ab_exp_h9_conf$overall["Accuracy"],
                                Precision = ab_exp_h9_conf$byClass["Precision"],
                                     Recall = ab_exp_h9_conf$byClass["Recall"],
                                     F1 = ab_exp_h9_conf$byClass["F1"],
                                     AUC = ab_exp_h9_auc))



summary(ab_exp_h9)

           Length Class   Mode     
formula         3 formula call     
trees          50 -none-  list     
weights        50 -none-  numeric  
votes      103542 -none-  numeric  
prob       103542 -none-  numeric  
class       51771 -none-  character
importance     51 -none-  numeric  
terms           3 terms   call     
call            6 -none-  call

Code

ab_exp_h9_conf

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  8794  749
       yes  343  411
                                          
               Accuracy : 0.8939          
                 95% CI : (0.8878, 0.8998)
    No Information Rate : 0.8873          
    P-Value [Acc > NIR] : 0.01708         
                                          
                  Kappa : 0.3739          
                                          
 Mcnemar's Test P-Value : < 2e-16         
                                          
            Sensitivity : 0.9625          
            Specificity : 0.3543          
         Pos Pred Value : 0.9215          
         Neg Pred Value : 0.5451          
             Prevalence : 0.8873          
         Detection Rate : 0.8540          
   Detection Prevalence : 0.9268          
      Balanced Accuracy : 0.6584          
                                          
       'Positive' Class : no

Results

Code

results_cp<- results
rownames(results) <- NULL
prec_result<-results|>
  arrange(desc(Precision)) 
prec_result

                                      Model Experiment  Accuracy Precision
1            Decision Tree SMOTE MaxDepth 5          3 0.8664660 0.9398232
2  Random Forest SMOTE, mtry=sqrt(p) N 1000          3 0.8972516 0.9232059
3       AdaBoost SMOTE Mfinal 50 MAxDepth 5          3 0.8939497 0.9215132
4        Random Forest mtry=sqrt(p)) N 1000          2 0.8986112 0.9155797
5                      Random Forest N 1000          1 0.8981257 0.9155364
6                             Random Forest          0 0.8966689 0.9152351
7                                  Adaboost          0 0.9003593 0.9144609
8                     AdaBoost  Mfinal 150           1 0.9040497 0.9129421
9        AdaBoost (maxdepth = 5), Mfinal 50          2 0.9035641 0.9122292
10                 Decision Tree MaxDepth 5          2 0.9018161 0.9094115
11                            Decision Tree          0 0.9019132 0.9072051
      Recall        F1       AUC
1  0.9076283 0.9234452 0.7347030
2  0.9644303 0.9433679 0.7886479
3  0.9624603 0.9415418 0.7867911
4  0.9757032 0.9446858 0.7823131
5  0.9751560 0.9444062 0.7809573
6  0.9737332 0.9435783 0.7715372
7  0.9793149 0.9457774 0.7986919
8  0.9858816 0.9480109 0.7957804
9  0.9862099 0.9477781 0.7942832
10 0.9877421 0.9469598 0.7598044
11 0.9908066 0.9471647 0.7137260

Code

results$AUC<- as.numeric(results$AUC)

# Reshape just Precision and AUC to long format
results_long <- results %>%
  select(Model, Precision, AUC) %>%
  pivot_longer(cols = c("Precision", "AUC"), 
               names_to = "Metric", values_to = "Value")

#plot
ggplot(results_long, aes(x = Value, y = reorder(Model, Value), color = Metric)) +
  geom_point(size = 4) +
  labs(title = "Precision vs AUC by Model",
       x = "Score", y = "Model", color = "Metric") +
  theme_minimal()

Code

# Feature importance plot for random forest smote
varImpPlot(rf_smote_h6, sort = TRUE, n.var = 10, main = "Top 10 Features (RF SMOTE)")

Code

# Feature importance plot for adaboost 150
importance_ab6 <- as.data.frame(ab_exp_h6$importance)
importance_ab6$Variable <- rownames(importance_ab6)
colnames(importance_ab6)[1] <- "Importance"

# Sort and plot
importance_ab6 %>%
  arrange(desc(Importance)) %>%
  slice(1:10) %>%
  ggplot(aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_col(fill = "grey") +
  coord_flip() +
  labs(title = "Top 10 Features (AdaBoost H6)",
       x = "Variable",
       y = "Importance") +
  theme_minimal()

Summary of results

The objective of all experiments was to increase precision compared to each model’s baseline. The highest precision was achieved by the Decision Tree with max depth 5, trained on a SMOTE-balanced dataset, as proposed in Hypothesis 3 (H3). This model reached a precision score of 0.9398, but with trade-offs in recall (0.9076), F1 score (0.9234), and AUC (0.7347) — raising concerns about its balance and overall generalization.

In support of Hypothesis 9 (H9), the AdaBoost model trained on a SMOTE-balanced, one-hot encoded dataset showed strong improvement in precision and general performance. However, the Random Forest model trained on SMOTE data with tuned parameters proved to be the most well-rounded. This model delivered high precision (0.9236), high F1 (0.9432), and a strong AUC (0.7882) — aligning well with the business objective of improving precision while maintaining strong overall performance. The most importance features in this model were social economic indicators, specifically euribor3m and nm.employed.

During model experimentation, it should be noted that the boosting methods used in AdaBoost introduced considerable compute time, especially as maxdepth was decreased and mfinal increased. These computational costs should be factored in when selecting a model for business deployment, particularly in environments where scalability and efficiency are critical.

Conclusion

Final Recommendation:

Ultimately, I would recommend integrating a class imbalance solution, such as SMOTE, and adopting an ensemble-based modeling approach to address this business case effectively. The Random Forest model, when combined with SMOTE, demonstrated strong performance and is particularly promising. It is worth further exploration and tuning to achieve even higher predictive power while maintaining the precision necessary to minimize false positives in a business context.

Finally, it is evident that macroeconomic factors such as euribor3m and nr.employed play a significant role in predicting whether a client is likely to subscribe to a long-term deposit. These insights could inform more targeted and data-driven marketing strategies.

Extension- Support Vector Machines

Code

# Define training controls
ctrl <- trainControl(
  method = "cv",
  number = 3, 
  summaryFunction = twoClassSummary,
  classProbs = TRUE,
  savePredictions = TRUE
)

Code

svm_linear <- train(
  y ~ .,
  data = trainData,
  method = "svmLinear",
  trControl = ctrl,
  metric = "ROC",
  preProcess = c("center", "scale"),
  tuneLength = 2
)

Code

svm_rbf <- train(
  y ~ .,
  data = trainData,
  method = "svmRadial",
  trControl = ctrl,
  metric = "ROC",
  preProcess = c("center", "scale"),
  tuneLength = 2
)

line search fails -1.955118 0.3550244 9.845612e-06 1.002723e-05 -5.648176e-08 -5.442391e-08 -1.101819e-12

Code

# Predictions
svm_preds_linear <- predict(svm_linear, testData)

svm_preds_rbf <- predict(svm_rbf, testData)


# Confusion Matrices
cm_linear <- confusionMatrix(svm_preds_linear, testData$y, positive = "yes")

cm_rbf <- confusionMatrix(svm_preds_rbf, testData$y, positive = "yes")


# Probabilities for AUC
svm_probs_linear <- predict(svm_linear, testData, type = "prob")

svm_probs_rbf <- predict(svm_rbf, testData, type = "prob")


# AUCs
roc_linear <- roc(testData$y, svm_probs_linear$yes)

roc_rbf <- roc(testData$y, svm_probs_rbf$yes)

Code

results <- rbind(
  results,
  data.frame(Model = "SVM Linear", Experiment = 4,
             Accuracy = cm_linear$overall["Accuracy"],
             Precision = cm_linear$byClass["Precision"],
             Recall = cm_linear$byClass["Recall"],
             F1 = cm_linear$byClass["F1"],
             AUC = auc(roc_linear)),

  data.frame(Model = "SVM RBF", Experiment = 5,
             Accuracy = cm_rbf$overall["Accuracy"],
             Precision = cm_rbf$byClass["Precision"],
             Recall = cm_rbf$byClass["Recall"],
             F1 = cm_rbf$byClass["F1"],
             AUC = auc(roc_rbf))
)

Code

cm_linear

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  9011  908
       yes  126  252
                                          
               Accuracy : 0.8996          
                 95% CI : (0.8936, 0.9053)
    No Information Rate : 0.8873          
    P-Value [Acc > NIR] : 3.559e-05       
                                          
                  Kappa : 0.2883          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.21724         
            Specificity : 0.98621         
         Pos Pred Value : 0.66667         
         Neg Pred Value : 0.90846         
             Prevalence : 0.11265         
         Detection Rate : 0.02447         
   Detection Prevalence : 0.03671         
      Balanced Accuracy : 0.60173         
                                          
       'Positive' Class : yes

Code

cm_rbf

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  9065  939
       yes   72  221
                                          
               Accuracy : 0.9018          
                 95% CI : (0.8959, 0.9075)
    No Information Rate : 0.8873          
    P-Value [Acc > NIR] : 1.203e-06       
                                          
                  Kappa : 0.2711          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.19052         
            Specificity : 0.99212         
         Pos Pred Value : 0.75427         
         Neg Pred Value : 0.90614         
             Prevalence : 0.11265         
         Detection Rate : 0.02146         
   Detection Prevalence : 0.02845         
      Balanced Accuracy : 0.59132         
                                          
       'Positive' Class : yes

Code

svm_linear$results

  C       ROC     Sens      Spec      ROCSD      SensSD     SpecSD
1 1 0.6061883 0.985079 0.1985632 0.02103078 0.001477847 0.01053466

Code

svm_rbf$results

       sigma    C       ROC      Sens      Spec       ROCSD       SensSD
1 0.01393698 0.25 0.7133550 0.9887272 0.1719828 0.005240896 0.0003095575
2 0.01393698 0.50 0.7129381 0.9898216 0.1721264 0.003870521 0.0005015405
       SpecSD
1 0.009143622
2 0.008847589

Code

# Best tuned values
svm_linear$bestTune

  C
1 1

Code

svm_rbf$bestTune

       sigma    C
1 0.01393698 0.25

Code

kable(results)

	Model	Experiment	Accuracy	Precision	Recall	F1	AUC
1	Decision Tree	0	0.9019132	0.9072051	0.9908066	0.9471647	0.7137260
2	Random Forest	0	0.8966689	0.9152351	0.9737332	0.9435783	0.7715372
3	Adaboost	0	0.9003593	0.9144609	0.9793149	0.9457774	0.7986919
4	Decision Tree MaxDepth 5	2	0.9018161	0.9094115	0.9877421	0.9469598	0.7598044
5	Decision Tree SMOTE MaxDepth 5	3	0.8664660	0.9398232	0.9076283	0.9234452	0.7347030
6	Random Forest N 1000	1	0.8981257	0.9155364	0.9751560	0.9444062	0.7809573
7	Random Forest mtry=sqrt(p)) N 1000	2	0.8986112	0.9155797	0.9757032	0.9446858	0.7823131
8	Random Forest SMOTE, mtry=sqrt(p) N 1000	3	0.8972516	0.9232059	0.9644303	0.9433679	0.7886479
9	AdaBoost Mfinal 150	1	0.9040497	0.9129421	0.9858816	0.9480109	0.7957804
10	AdaBoost (maxdepth = 5), Mfinal 50	2	0.9035641	0.9122292	0.9862099	0.9477781	0.7942832
11	AdaBoost SMOTE Mfinal 50 MAxDepth 5	3	0.8939497	0.9215132	0.9624603	0.9415418	0.7867911
Accuracy	SVM Linear	4	0.8995824	0.6666667	0.2172414	0.3276983	0.6189648
Accuracy1	SVM RBF	5	0.9018161	0.7542662	0.1905172	0.3041982	0.7158063

Code

svm_weighted <- svm(y ~ ., data = trainData,
                    kernel = "radial",
                    class.weights = c("no" = 1, "yes" = 7),
                    probability = TRUE,
                    scale = TRUE)

Code

svm_weighted_c10 <- svm(y ~ ., data = trainData,
                    kernel = "radial",
                    class.weights = c("no" = 1, "yes" = 7),
                    probability = TRUE,
                    cost = 10,
                    scale = TRUE)

Code

pred_weighted <- predict(svm_weighted, newdata = testData)
cm_weighted <- confusionMatrix(pred_weighted, testData$y, positive = "yes")

# ROC AUC
prob_weighted <- attr(predict(svm_weighted, testData, decision.values = TRUE), "decision.values")
roc_weighted <- roc(response = testData$y, predictor = as.numeric(prob_weighted), levels = c("no", "yes"))

# Add to results
results <- rbind(results,
  data.frame(Model = "SVM Radial Weighted", Experiment = 6,
             Accuracy = cm_weighted$overall["Accuracy"],
             Precision = cm_weighted$byClass["Precision"],
             Recall = cm_weighted$byClass["Recall"],
             F1 = cm_weighted$byClass["F1"],
             AUC = auc(roc_weighted))
)
rownames(results) <- NULL
# View updated results
print(results)

                                      Model Experiment  Accuracy Precision
1                             Decision Tree          0 0.9019132 0.9072051
2                             Random Forest          0 0.8966689 0.9152351
3                                  Adaboost          0 0.9003593 0.9144609
4                  Decision Tree MaxDepth 5          2 0.9018161 0.9094115
5            Decision Tree SMOTE MaxDepth 5          3 0.8664660 0.9398232
6                      Random Forest N 1000          1 0.8981257 0.9155364
7        Random Forest mtry=sqrt(p)) N 1000          2 0.8986112 0.9155797
8  Random Forest SMOTE, mtry=sqrt(p) N 1000          3 0.8972516 0.9232059
9                     AdaBoost  Mfinal 150           1 0.9040497 0.9129421
10       AdaBoost (maxdepth = 5), Mfinal 50          2 0.9035641 0.9122292
11      AdaBoost SMOTE Mfinal 50 MAxDepth 5          3 0.8939497 0.9215132
12                               SVM Linear          4 0.8995824 0.6666667
13                                  SVM RBF          5 0.9018161 0.7542662
14                      SVM Radial Weighted          6 0.8363601 0.3672231
      Recall        F1       AUC
1  0.9908066 0.9471647 0.7137260
2  0.9737332 0.9435783 0.7715372
3  0.9793149 0.9457774 0.7986919
4  0.9877421 0.9469598 0.7598044
5  0.9076283 0.9234452 0.7347030
6  0.9751560 0.9444062 0.7809573
7  0.9757032 0.9446858 0.7823131
8  0.9644303 0.9433679 0.7886479
9  0.9858816 0.9480109 0.7957804
10 0.9862099 0.9477781 0.7942832
11 0.9624603 0.9415418 0.7867911
12 0.2172414 0.3276983 0.6189648
13 0.1905172 0.3041982 0.7158063
14 0.6258621 0.4628626 0.7854107

Code

pred_weighted10 <- predict(svm_weighted_c10, newdata = testData)
cm_weighted10 <- confusionMatrix(pred_weighted10, testData$y, positive = "yes")

# ROC AUC
prob_weighted10 <- attr(predict(svm_weighted_c10, testData, decision.values = TRUE), "decision.values")
roc_weighted10 <- roc(response = testData$y, predictor = as.numeric(prob_weighted10), levels = c("no", "yes"))

# Add to results
results <- rbind(results,
  data.frame(Model = "SVM Radial Weighted (C=10)", Experiment = 7,
             Accuracy = cm_weighted10$overall["Accuracy"],
             Precision = cm_weighted10$byClass["Precision"],
             Recall = cm_weighted10$byClass["Recall"],
             F1 = cm_weighted10$byClass["F1"],
             AUC = auc(roc_weighted10))
)
rownames(results) <- NULL

# View updated results
print(results)

                                      Model Experiment  Accuracy Precision
1                             Decision Tree          0 0.9019132 0.9072051
2                             Random Forest          0 0.8966689 0.9152351
3                                  Adaboost          0 0.9003593 0.9144609
4                  Decision Tree MaxDepth 5          2 0.9018161 0.9094115
5            Decision Tree SMOTE MaxDepth 5          3 0.8664660 0.9398232
6                      Random Forest N 1000          1 0.8981257 0.9155364
7        Random Forest mtry=sqrt(p)) N 1000          2 0.8986112 0.9155797
8  Random Forest SMOTE, mtry=sqrt(p) N 1000          3 0.8972516 0.9232059
9                     AdaBoost  Mfinal 150           1 0.9040497 0.9129421
10       AdaBoost (maxdepth = 5), Mfinal 50          2 0.9035641 0.9122292
11      AdaBoost SMOTE Mfinal 50 MAxDepth 5          3 0.8939497 0.9215132
12                               SVM Linear          4 0.8995824 0.6666667
13                                  SVM RBF          5 0.9018161 0.7542662
14                      SVM Radial Weighted          6 0.8363601 0.3672231
15               SVM Radial Weighted (C=10)          7 0.8567544 0.4092219
      Recall        F1       AUC
1  0.9908066 0.9471647 0.7137260
2  0.9737332 0.9435783 0.7715372
3  0.9793149 0.9457774 0.7986919
4  0.9877421 0.9469598 0.7598044
5  0.9076283 0.9234452 0.7347030
6  0.9751560 0.9444062 0.7809573
7  0.9757032 0.9446858 0.7823131
8  0.9644303 0.9433679 0.7886479
9  0.9858816 0.9480109 0.7957804
10 0.9862099 0.9477781 0.7942832
11 0.9624603 0.9415418 0.7867911
12 0.2172414 0.3276983 0.6189648
13 0.1905172 0.3041982 0.7158063
14 0.6258621 0.4628626 0.7854107
15 0.6120690 0.4905009 0.7846742