Introduction
Load
packages
The data
Dataset
Overview
Decision Tree
Experiment 1
Decision
Tree Experiment 2
Random
Forest Experiment 1
Random Forest
Experiment 2
AdaBoost
Experiment 1
AdaBoost
Experiment 2
ROC Curve
Comparison
Model
Performance Summary
Model
Comparison & Discussion
Conclusion
& Recommendations
In this project, I utilized machine learning methods to analyze a real-world dataset from a Portuguese bank’s marketing campaign. The main goal was to develop predictive models that determine whether a client will subscribe to a term deposit based on their demographic, socio-economic, and macroeconomic characteristics.
To accomplish this, I implemented and compared three supervised classification algorithms — Decision Tree, Random Forest, and AdaBoost — under both default and tuned hyperparameter settings. The performance of each model was evaluated using five key metrics: Accuracy, Precision, Recall, F1 Score, and AUC.
Because the dataset is imbalanced, particular focus was placed on Recall, which indicates how effectively the model identifies true subscribers. The project concludes with a recommendation of the most effective model that aligns with the bank’s business objective of increasing customer subscription rates.
Load the packages.
library(tidyverse)
library(openintro)
library(infer)
library(dplyr)
library(knitr)
library(corrplot)
library(ggthemes)
library(randomForest)
library(ggcorrplot)
library(rpart)
library(rpart.plot)
library(caret)
library(ROSE)
library(ada)
library(pROC)
# Load the Bank dataset from CSV file
bank <- read.csv("D:\\Cuny_sps\\Data_622\\Assignment-1\\bank.csv", sep = ";")
# Show the first 10 rows of the dataset in a table
kable(head(bank, 10), caption = "Display first 10 rows of the Bank Dataset")| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 30 | unemployed | married | primary | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | -1 | 0 | unknown | no |
| 33 | services | married | secondary | no | 4789 | yes | yes | cellular | 11 | may | 220 | 1 | 339 | 4 | failure | no |
| 35 | management | single | tertiary | no | 1350 | yes | no | cellular | 16 | apr | 185 | 1 | 330 | 1 | failure | no |
| 30 | management | married | tertiary | no | 1476 | yes | yes | unknown | 3 | jun | 199 | 4 | -1 | 0 | unknown | no |
| 59 | blue-collar | married | secondary | no | 0 | yes | no | unknown | 5 | may | 226 | 1 | -1 | 0 | unknown | no |
| 35 | management | single | tertiary | no | 747 | no | no | cellular | 23 | feb | 141 | 2 | 176 | 3 | failure | no |
| 36 | self-employed | married | tertiary | no | 307 | yes | no | cellular | 14 | may | 341 | 1 | 330 | 2 | other | no |
| 39 | technician | married | secondary | no | 147 | yes | no | cellular | 6 | may | 151 | 2 | -1 | 0 | unknown | no |
| 41 | entrepreneur | married | tertiary | no | 221 | yes | no | unknown | 14 | may | 57 | 2 | -1 | 0 | unknown | no |
| 43 | services | married | primary | no | -88 | yes | yes | cellular | 17 | apr | 313 | 1 | 147 | 2 | failure | no |
'data.frame': 4521 obs. of 17 variables:
$ age : int 30 33 35 30 59 35 36 39 41 43 ...
$ job : chr "unemployed" "services" "management" "management" ...
$ marital : chr "married" "married" "single" "married" ...
$ education: chr "primary" "secondary" "tertiary" "tertiary" ...
$ default : chr "no" "no" "no" "no" ...
$ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
$ housing : chr "no" "yes" "yes" "yes" ...
$ loan : chr "no" "yes" "no" "yes" ...
$ contact : chr "cellular" "cellular" "cellular" "unknown" ...
$ day : int 19 11 16 3 5 23 14 6 14 17 ...
$ month : chr "oct" "may" "apr" "jun" ...
$ duration : int 79 220 185 199 226 141 341 151 57 313 ...
$ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
$ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
$ previous : int 0 4 1 0 0 3 2 0 0 2 ...
$ poutcome : chr "unknown" "failure" "failure" "unknown" ...
$ y : chr "no" "no" "no" "no" ...
bank <- bank %>%
mutate(y = factor(y, levels = c("no", "yes")),
job = as.factor(job),
marital = as.factor(marital),
education = as.factor(education),
default = as.factor(default),
housing = as.factor(housing),
loan = as.factor(loan),
contact = as.factor(contact),
month = as.factor(month),
poutcome = as.factor(poutcome),
job = as.factor(job))
#Summary statistics
summary(bank) age job marital education default
Min. :19.00 management :969 divorced: 528 primary : 678 no :4445
1st Qu.:33.00 blue-collar:946 married :2797 secondary:2306 yes: 76
Median :39.00 technician :768 single :1196 tertiary :1350
Mean :41.17 admin. :478 unknown : 187
3rd Qu.:49.00 services :417
Max. :87.00 retired :230
(Other) :713
balance housing loan contact day
Min. :-3313 no :1962 no :3830 cellular :2896 Min. : 1.00
1st Qu.: 69 yes:2559 yes: 691 telephone: 301 1st Qu.: 9.00
Median : 444 unknown :1324 Median :16.00
Mean : 1423 Mean :15.92
3rd Qu.: 1480 3rd Qu.:21.00
Max. :71188 Max. :31.00
month duration campaign pdays
may :1398 Min. : 4 Min. : 1.000 Min. : -1.00
jul : 706 1st Qu.: 104 1st Qu.: 1.000 1st Qu.: -1.00
aug : 633 Median : 185 Median : 2.000 Median : -1.00
jun : 531 Mean : 264 Mean : 2.794 Mean : 39.77
nov : 389 3rd Qu.: 329 3rd Qu.: 3.000 3rd Qu.: -1.00
apr : 293 Max. :3025 Max. :50.000 Max. :871.00
(Other): 571
previous poutcome y
Min. : 0.0000 failure: 490 no :4000
1st Qu.: 0.0000 other : 197 yes: 521
Median : 0.0000 success: 129
Mean : 0.5426 unknown:3705
3rd Qu.: 0.0000
Max. :25.0000
A new data frame called n_bank that contains only the numeric variables (e.g., age, balance, duration, etc.), excluding categorical or character columns.
age balance day duration
Min. :19.00 Min. :-3313 Min. : 1.00 Min. : 4
1st Qu.:33.00 1st Qu.: 69 1st Qu.: 9.00 1st Qu.: 104
Median :39.00 Median : 444 Median :16.00 Median : 185
Mean :41.17 Mean : 1423 Mean :15.92 Mean : 264
3rd Qu.:49.00 3rd Qu.: 1480 3rd Qu.:21.00 3rd Qu.: 329
Max. :87.00 Max. :71188 Max. :31.00 Max. :3025
campaign pdays previous
Min. : 1.000 Min. : -1.00 Min. : 0.0000
1st Qu.: 1.000 1st Qu.: -1.00 1st Qu.: 0.0000
Median : 2.000 Median : -1.00 Median : 0.0000
Mean : 2.794 Mean : 39.77 Mean : 0.5426
3rd Qu.: 3.000 3rd Qu.: -1.00 3rd Qu.: 0.0000
Max. :50.000 Max. :871.00 Max. :25.0000
Inspect the distribution of categorical variables, which is useful for understanding class imbalance, dominant categories, or data encoding needs before modeling.
job marital education default housing
management :969 divorced: 528 primary : 678 no :4445 no :1962
blue-collar:946 married :2797 secondary:2306 yes: 76 yes:2559
technician :768 single :1196 tertiary :1350
admin. :478 unknown : 187
services :417
retired :230
(Other) :713
loan contact month poutcome y
no :3830 cellular :2896 may :1398 failure: 490 no :4000
yes: 691 telephone: 301 jul : 706 other : 197 yes: 521
unknown :1324 aug : 633 success: 129
jun : 531 unknown:3705
nov : 389
apr : 293
(Other): 571
age job marital education default balance housing loan
0 0 0 0 0 0 0 0
contact day month duration campaign pdays previous poutcome
0 0 0 0 0 0 0 0
y
0
Perform exploratory data analysis (EDA) on the bank marketing dataset to understand variable distributions, relationships, and correlations — essential for feature selection and model building later.
n_bank %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
facet_wrap(~variable, scales = "free") +
theme_minimal() +
labs(title = "Distribution of Numeric Variables")c_bank %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_bar(fill = "blue", alpha = 0.7) +
facet_wrap(~variable, scales = "free") +
theme_minimal() +
labs(title = "Distribution of Categorical Variables") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))n_bank <- n_bank %>%
mutate(y = bank$y)
n_bank <- n_bank %>%
mutate(y = as.factor(y))
n_bank_long <- n_bank %>%
pivot_longer(cols = -y, names_to = "variable", values_to = "value")
ggplot(n_bank_long, aes(x = y, y = value, fill = y)) +
geom_boxplot() +
facet_wrap(~variable, scales = "free") +
labs(title = "Distribution of Numeric Variables by Yes/No in Y",
x = "Y (Outcome)", y = "Value") +
theme_minimal()c_bank <- c_bank %>%
mutate(y = bank$y) %>%
mutate(y = as.factor(y))
c_bank_long <- c_bank %>%
pivot_longer(cols = -y, names_to = "variable", values_to = "value") %>%
count(variable, value, y)
ggplot(c_bank_long, aes(x = value, y = n, fill = y)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
facet_wrap(~variable, scales = "free") +
labs(title = "Relationship Between Categorical Variables and Y (Yes/No)",
x = "Category", y = "Count") +
scale_fill_manual(values = c("yes" = "blue", "no" = "red")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))n_bank <- n_bank %>%
mutate(y = as.numeric(as.factor(y)) - 1)
cor_matrix <- cor(n_bank, use = "complete.obs")
ggcorrplot(cor_matrix,
method = "circle",
type = "lower",
lab = TRUE,
lab_size = 3,
colors = c("blue", "white", "red"),
title = "Correlation Heatmap",
ggtheme = theme_minimal()) Experiment
Decision trees models
Partition
80% of the data will be used for training and 20% will be used for testing
set.seed(1234)
sample <- sample(nrow(bank), round(nrow(bank)*.8),
replace = FALSE)
bank_train <- bank[sample,]
bank_test <- bank[-sample,]
round(prop.table(table(select(bank, y))),2)y
no yes
0.88 0.12
y
no yes
0.88 0.12
y
no yes
0.89 0.11
Objective / Hypothesis: Test whether a baseline decision tree model using all features can accurately predict if a client will subscribe to a term deposit.
Change: No resampling or class balancing — model trained directly on original data (80/20 split).
Metric(s): Accuracy, Recall, Precision, F1, AUC.
Result / Finding: Accuracy = 0.886, but Sensitivity / Recall = 0.27 (very low). The model correctly classifies “no” cases but struggles to detect “yes” (subscribers). Indicates strong imbalance bias.
Hyperparameter: We used default rpart settings for the baseline Decision Tree. No depth or split restrictions were applied to observe the tree’s natural growth on unbalanced data.
bank_pred <- predict(bank_mod, bank_test, type = "class")
bank_pred_table <- table(bank_test$y, bank_pred)
bank_pred_table bank_pred
no yes
no 774 29
yes 74 27
[1] 0.8860619
Confusion Matrix and Statistics
Reference
Prediction no yes
no 774 74
yes 29 27
Accuracy : 0.8861
95% CI : (0.8635, 0.906)
No Information Rate : 0.8883
P-Value [Acc > NIR] : 0.609
Kappa : 0.2871
Mcnemar's Test P-Value : 1.455e-05
Sensitivity : 0.26733
Specificity : 0.96389
Pos Pred Value : 0.48214
Neg Pred Value : 0.91274
Prevalence : 0.11173
Detection Rate : 0.02987
Detection Prevalence : 0.06195
Balanced Accuracy : 0.61561
'Positive' Class : yes
Recall is very low, only 27% of successes were detected. Model predicts yes correct less than 50% of the time.
Objective / Hypothesis: Balancing the target variable will improve sensitivity (detecting “yes”) and overall model fairness.
Change: Applied ROSE oversampling to balance the “yes” and “no” classes before training.
Metric(s): Accuracy, Recall, Precision, F1, AUC.
Result / Finding: Sensitivity / Recall improved from 0.27 → 0.81 . The balanced model detects subscribers much better with only a minor trade-off in accuracy.
Hyperparameter : Decision Tree parameters remain default to isolate the effect of ROSE-balanced data. No maxdepth or minsplit changes were applied, focusing only on class balancing impact.
no yes
2313 2208
set.seed(124)
trainIndex <- createDataPartition(data_balanced$y, p = 0.8, list = FALSE)
bank_train2 <- data_balanced[trainIndex, ]
bank_test2 <- data_balanced[-trainIndex, ]Confusion Matrix and Statistics
Reference
Prediction no yes
no 359 82
yes 103 359
Accuracy : 0.7951
95% CI : (0.7673, 0.821)
No Information Rate : 0.5116
P-Value [Acc > NIR] : <2e-16
Kappa : 0.5905
Mcnemar's Test P-Value : 0.1414
Sensitivity : 0.8141
Specificity : 0.7771
Pos Pred Value : 0.7771
Neg Pred Value : 0.8141
Prevalence : 0.4884
Detection Rate : 0.3976
Detection Prevalence : 0.5116
Balanced Accuracy : 0.7956
'Positive' Class : yes
after balancing the data, the models sensitivity / Recall improves 4 times the previous model. Sensitivity - .81 ## Random forest
Objective / Hypothesis: Random Forest with selected significant predictors will reduce overfitting and improve performance consistency compared to Decision Tree.
Change: Used balanced data but limited predictors to key variables (duration, poutcome, job, month).
Metric(s): Accuracy, Recall, Precision, F1, AUC.
Result / Finding: Sensitivity = 0.802. The model outperforms Decision Tree in both accuracy and balance, confirming ensemble learning reduces bias and variance.
Hyperparameter :
We used ‘ntree = 500’ to ensure stability of predictions across trees,
and ‘mtry = sqrt(p)’ (where ‘p’ is number of predictors) as recommended
for classification tasks.
Only selected key predictors are used to test whether a smaller feature
set can reduce overfitting and maintain performance.
bank_train2$y <- factor(bank_train2$y, levels = c("yes", "no"))
bank_test2$y <- factor(bank_test2$y, levels = c("yes", "no"))
set.seed(126)
rf_model <- randomForest(y ~ duration + poutcome + job + month , data = bank_train2, ntree = 500, mtry = sqrt(ncol(bank_train2) - 1), importance = TRUE)
print(rf_model)
Call:
randomForest(formula = y ~ duration + poutcome + job + month, data = bank_train2, ntree = 500, mtry = sqrt(ncol(bank_train2) - 1), importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
OOB estimate of error rate: 19.65%
Confusion matrix:
yes no class.error
yes 1420 347 0.1963780
no 364 1487 0.1966505
yes no MeanDecreaseAccuracy MeanDecreaseGini
duration 174.23918 150.86938 221.1754 1037.9661
poutcome 95.83095 97.54879 133.3147 181.0547
job 71.06409 20.08557 69.0880 223.5180
month 105.64045 81.29794 131.0640 364.9532
predictions <- predict(rf_model, bank_test2, type = "class")
confusionMatrix(table(predictions, bank_test2$y))Confusion Matrix and Statistics
predictions yes no
yes 361 99
no 80 363
Accuracy : 0.8018
95% CI : (0.7742, 0.8273)
No Information Rate : 0.5116
P-Value [Acc > NIR] : <2e-16
Kappa : 0.6037
Mcnemar's Test P-Value : 0.1785
Sensitivity : 0.8186
Specificity : 0.7857
Pos Pred Value : 0.7848
Neg Pred Value : 0.8194
Prevalence : 0.4884
Detection Rate : 0.3998
Detection Prevalence : 0.5094
Balanced Accuracy : 0.8022
'Positive' Class : yes
For our next model we will go back to using all of the variables to understand which are important to predict a success in subcriptions to a term deposit.
Objective / Hypothesis: Including all features in Random Forest may further improve model accuracy and reliability.
Change: Same balanced data, but now used all variables.
Metric(s): Accuracy, Recall, Precision, F1, AUC.
Result / Finding: Sensitivity = 0.88, showing that including all variables improves performance. Duration and month remain the most influential predictors.
Hyperparameter : All predictors are included to assess full model
performance.
‘ntree = 500’ ensures stability, and ‘mtry = sqrt(p)’ follows standard
practice for classification.
We compare results to Experiment 1 to understand the effect of feature
selection on model accuracy and sensitivity.
set.seed(127)
rf_model2 <- randomForest(y ~ . , data = bank_train2, ntree = 500, mtry = sqrt(ncol(bank_train2) - 1), importance = TRUE)
print(rf_model2)
Call:
randomForest(formula = y ~ ., data = bank_train2, ntree = 500, mtry = sqrt(ncol(bank_train2) - 1), importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
OOB estimate of error rate: 12.6%
Confusion matrix:
yes no class.error
yes 1565 202 0.1143181
no 254 1597 0.1372231
yes no MeanDecreaseAccuracy MeanDecreaseGini
age 19.820523 21.572753 29.700444 103.723779
job 49.632793 17.263917 49.445207 126.675076
marital 24.374131 10.114840 25.286746 28.699770
education 18.927340 9.791511 21.902100 31.341975
default 6.901647 2.935399 6.525886 2.891009
balance 20.221959 8.092355 19.457560 102.214957
housing 21.785807 11.726264 22.117810 20.989088
loan 19.174621 8.614734 19.445217 15.450295
contact 25.387859 32.874001 36.256648 44.063995
day 18.308964 19.305632 26.024022 103.066523
month 66.603151 61.593286 83.421355 221.760095
duration 132.482029 133.421898 158.444418 514.816303
campaign 25.589388 5.854974 23.598160 101.715509
pdays 5.182782 34.678797 36.324847 129.501015
previous 18.917044 42.631203 44.014716 164.431657
poutcome 24.525720 40.771142 50.914476 94.491496
predictions <- predict(rf_model2, bank_test2, type = "class")
confusionMatrix(table(predictions, bank_test2$y))Confusion Matrix and Statistics
predictions yes no
yes 388 71
no 53 391
Accuracy : 0.8627
95% CI : (0.8385, 0.8845)
No Information Rate : 0.5116
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7255
Mcnemar's Test P-Value : 0.1268
Sensitivity : 0.8798
Specificity : 0.8463
Pos Pred Value : 0.8453
Neg Pred Value : 0.8806
Prevalence : 0.4884
Detection Rate : 0.4297
Detection Prevalence : 0.5083
Balanced Accuracy : 0.8631
'Positive' Class : yes
Objective / Hypothesis: Using AdaBoost with 5-fold cross-validation and 500 trees will reduce overfitting and handle non-linear relationships better than Random Forest.
Change: New algorithm (AdaBoost) with 5-fold CV, limited predictors (duration, month, previous, poutcome).
Metric(s): Accuracy, Recall, Precision, F1, AUC.
Result / Finding: Performance dropped significantly (Accuracy = 0.277). Model likely underfit due to parameter settings or feature limitations.
5 k-fold cross validation 500 trees Hypothesis: remove bias and overfitting
Hyperparameter :
We used 500 trees and 5-fold cross-validation to reduce overfitting and
get a stable estimate of performance.
Only selected key predictors (duration, month, previous, poutcome) are
used to test whether a smaller feature set can capture the main patterns
without adding noise.
The learning rate (nu) was set to default, balancing
training speed with stability.
set.seed(128)
ada_model <- train(y ~ duration + month + previous + poutcome , data = bank_train2, method = "ada", trControl = trainControl(method = "cv", number = 5))Boosted Classification Trees
3618 samples
4 predictor
2 classes: 'yes', 'no'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 2894, 2894, 2895, 2895, 2894
Resampling results across tuning parameters:
maxdepth iter Accuracy Kappa
1 50 0.2636857 -0.4647131
1 100 0.2426797 -0.5084060
1 150 0.2363227 -0.5214948
2 50 0.2260945 -0.5425801
2 100 0.2222259 -0.5507751
2 150 0.2175256 -0.5606495
3 50 0.2147609 -0.5659306
3 100 0.2059165 -0.5846610
3 150 0.2017694 -0.5935630
Tuning parameter 'nu' was held constant at a value of 0.1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were iter = 50, maxdepth = 1 and nu = 0.1.
Month showed no importance
pred <- predict(ada_model, newdata = bank_test2, type = "raw")
confusionMatrix(table(pred, bank_test2$y))Confusion Matrix and Statistics
pred yes no
yes 182 394
no 259 68
Accuracy : 0.2769
95% CI : (0.2479, 0.3073)
No Information Rate : 0.5116
P-Value [Acc > NIR] : 1
Kappa : -0.4371
Mcnemar's Test P-Value : 1.573e-07
Sensitivity : 0.4127
Specificity : 0.1472
Pos Pred Value : 0.3160
Neg Pred Value : 0.2080
Prevalence : 0.4884
Detection Rate : 0.2016
Detection Prevalence : 0.6379
Balanced Accuracy : 0.2799
'Positive' Class : yes
Accuracy : 0.277 Sensitivity : 0.413
results worse than random prediction ### Model 6 increase K-folds to reduce variance and get a better performance estimate
Objective / Hypothesis: Increasing the number of folds (20-fold CV) and including all features (except job) will improve AdaBoost generalization.
Change: 20-fold cross-validation, expanded feature set.
Metric(s): Accuracy, Recall, Precision, F1, AUC.
Result / Finding: Results similar to Experiment 1 — still poor (Accuracy ≈ 0.27). Suggests AdaBoost is not well-suited for this dataset’s imbalance or structure.
Hyperparameter :
20-fold cross-validation was used to get a more robust performance
estimate and reduce variance in model evaluation.
All features except ‘job’ were included to explore whether more
information improves model generalization.
The number of trees remains high (500 default) to stabilize predictions,
while the learning rate ( ‘nu’ ) is kept at default.
Poor Performance: The low accuracy and recall are largely due to the dataset’s class imbalance and categorical-heavy predictors. AdaBoost struggles to learn minority class patterns without additional balancing or cost-sensitive adjustments.
set.seed(129)
ada_model2 <- train(y ~ . - job,
data = bank_train2,
method = "ada",
trControl = trainControl(method = "cv", number = 20))
pred <- predict(ada_model2, newdata = bank_test2, type = "raw")
confusionMatrix(table(pred, bank_test2$y))Confusion Matrix and Statistics
pred yes no
yes 171 387
no 270 75
Accuracy : 0.2724
95% CI : (0.2436, 0.3027)
No Information Rate : 0.5116
P-Value [Acc > NIR] : 1
Kappa : -0.4472
Mcnemar's Test P-Value : 6.023e-06
Sensitivity : 0.3878
Specificity : 0.1623
Pos Pred Value : 0.3065
Neg Pred Value : 0.2174
Prevalence : 0.4884
Detection Rate : 0.1894
Detection Prevalence : 0.6179
Balanced Accuracy : 0.2750
'Positive' Class : yes
# Load ROC packages
library(pROC)
# Compute probabilities (for ROC)
dt1_probs <- predict(bank_mod, bank_test, type = "prob")[, "yes"]
dt2_probs <- predict(bank_mod2, bank_test2, type = "prob")[, "yes"]
rf1_probs <- predict(rf_model, bank_test2, type = "prob")[, "yes"]
rf2_probs <- predict(rf_model2, bank_test2, type = "prob")[, "yes"]
ada1_probs <- predict(ada_model, bank_test2, type = "prob")[, "yes"]
ada2_probs <- predict(ada_model2, bank_test2, type = "prob")[, "yes"]
# Generate ROC objects
roc_dt1 <- roc(bank_test$y, dt1_probs)
roc_dt2 <- roc(bank_test2$y, dt2_probs)
roc_rf1 <- roc(bank_test2$y, rf1_probs)
roc_rf2 <- roc(bank_test2$y, rf2_probs)
roc_ada1 <- roc(bank_test2$y, ada1_probs)
roc_ada2 <- roc(bank_test2$y, ada2_probs)
# Plot all ROC curves together
plot(roc_dt1, col = "gray40", lwd = 2, main = "ROC Curve Comparison Across Models")
lines(roc_dt2, col = "orange", lwd = 2)
lines(roc_rf1, col = "blue", lwd = 2)
lines(roc_rf2, col = "darkgreen", lwd = 2)
lines(roc_ada1, col = "red", lwd = 2)
lines(roc_ada2, col = "purple", lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "black")
legend("bottomright",
legend = c("Decision Tree (Raw)", "Decision Tree (Balanced)",
"Random Forest (Selected)", "Random Forest (All)",
"AdaBoost (5-fold)", "AdaBoost (20-fold)"),
col = c("gray40", "orange", "blue", "darkgreen", "red", "purple"),
lwd = 2, cex = 0.8)# Decision Tree predictions
pred_dt1 <- predict(bank_mod, bank_test, type = "class") # unbalanced
pred_dt2 <- predict(bank_mod2, bank_test2, type = "class") # balanced
# Random Forest predictions (all use bank_test2, balanced)
pred_rf1 <- predict(rf_model, bank_test2, type = "class")
pred_rf2 <- predict(rf_model2, bank_test2, type = "class")
# AdaBoost predictions
pred_ada1 <- predict(ada_model, bank_test2, type = "raw")
pred_ada2 <- predict(ada_model2, bank_test2, type = "raw")
# Actuals
actual_dt1 <- bank_test$y # matches pred_dt1
actual_dt2 <- bank_test2$y # matches pred_dt2, rf, ada
# --- Predicted probabilities for AUC ---
# (only needed for ROC/AUC calculation)
pred_prob_dt1 <- predict(bank_mod, bank_test, type = "prob")[, "yes"]
pred_prob_dt2 <- predict(bank_mod2, bank_test2, type = "prob")[, "yes"]
pred_prob_rf1 <- predict(rf_model, bank_test2, type = "prob")[, "yes"]
pred_prob_rf2 <- predict(rf_model2, bank_test2, type = "prob")[, "yes"]
pred_prob_ada1 <- predict(ada_model, bank_test2, type = "prob")[, "yes"]
pred_prob_ada2 <- predict(ada_model2, bank_test2, type = "prob")[, "yes"]
# Actuals
actual_dt1 <- bank_test$y
actual_dt2 <- bank_test2$y
# Function to compute metrics
compute_metrics <- function(actual, predicted, probs) {
cm <- confusionMatrix(predicted, actual, positive = "yes")
prec <- cm$byClass["Precision"]
recall <- cm$byClass["Sensitivity"]
f1 <- 2 * (prec * recall) / (prec + recall)
auc_val <- auc(roc(actual, as.numeric(probs)))
return(c(Accuracy = cm$overall["Accuracy"],
Sensitivity = recall,
Specificity = cm$byClass["Specificity"],
Precision = prec,
F1 = f1,
AUC = auc_val))
}
# Create an empty data frame to store results
model_results <- data.frame(
Model = c("Decision Tree (Unbalanced)", "Decision Tree (Balanced)",
"Random Forest (Selected)", "Random Forest (All)",
"AdaBoost (5-fold)", "AdaBoost (20-fold)"),
Accuracy = NA,
Sensitivity = NA,
Specificity = NA,
Precision = NA,
F1 = NA,
AUC = NA
)
# Example assuming predicted probabilities too (e.g., pred_prob_dt1)
model_results[1, 2:7] <- compute_metrics(actual_dt1, pred_dt1, pred_prob_dt1)
model_results[2, 2:7] <- compute_metrics(actual_dt2, pred_dt2, pred_prob_dt2)
model_results[3, 2:7] <- compute_metrics(actual_dt2, pred_rf1, pred_prob_rf1)
model_results[4, 2:7] <- compute_metrics(actual_dt2, pred_rf2, pred_prob_rf2)
model_results[5, 2:7] <- compute_metrics(actual_dt2, pred_ada1, pred_prob_ada1)
model_results[6, 2:7] <- compute_metrics(actual_dt2, pred_ada2, pred_prob_ada2)
#Rename Sensitivity to Recall
colnames(model_results)[colnames(model_results) == "Sensitivity"] <- "Sensitivity / Recall"
# Nicely formatted output table
kable(model_results, caption = "Model Performance Summary (with Precision & F1)", digits = 3)| Model | Accuracy | Sensitivity / Recall | Specificity | Precision | F1 | AUC |
|---|---|---|---|---|---|---|
| Decision Tree (Unbalanced) | 0.886 | 0.267 | 0.964 | 0.482 | 0.344 | 0.735 |
| Decision Tree (Balanced) | 0.795 | 0.814 | 0.777 | 0.777 | 0.795 | 0.827 |
| Random Forest (Selected) | 0.802 | 0.819 | 0.786 | 0.785 | 0.801 | 0.876 |
| Random Forest (All) | 0.863 | 0.880 | 0.846 | 0.845 | 0.862 | 0.937 |
| AdaBoost (5-fold) | 0.277 | 0.413 | 0.147 | 0.316 | 0.358 | 0.814 |
| AdaBoost (20-fold) | 0.272 | 0.388 | 0.162 | 0.306 | 0.342 | 0.822 |
Recall = Sensitivity
ctrl <- trainControl(method = "cv", number = 5, summaryFunction = twoClassSummary, classProbs = TRUE, savePredictions = "final")
grid <- expand.grid(maxdepth = c(1,2,3), iter = c(50,100,150), nu = c(0.05, 0.1))
set.seed(128)
ada_model <- train(y ~ duration + month + previous + poutcome,
data = bank_train2,
method = "ada",
metric = "ROC",
trControl = ctrl,
tuneGrid = grid)
# Display model summary
ada_modelBoosted Classification Trees
3618 samples
4 predictor
2 classes: 'yes', 'no'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 2894, 2894, 2895, 2895, 2894
Resampling results across tuning parameters:
nu maxdepth iter ROC Sens Spec
0.05 1 50 0.8012766 0.4006658 0.1550477
0.05 1 100 0.8190276 0.3859589 0.1534305
0.05 1 150 0.8257163 0.3333349 0.1561303
0.05 2 50 0.8500026 0.3061683 0.1609995
0.05 2 100 0.8583474 0.2954178 0.1604517
0.05 2 150 0.8620120 0.2937213 0.1582910
0.05 3 50 0.8638215 0.2920232 0.1604488
0.05 3 100 0.8701753 0.2829772 0.1512741
0.05 3 150 0.8722650 0.2643043 0.1566752
0.10 1 50 0.8176723 0.3820025 0.1566825
0.10 1 100 0.8314747 0.3333397 0.1566781
0.10 1 150 0.8412265 0.3106896 0.1588373
0.10 2 50 0.8568814 0.2999376 0.1593706
0.10 2 100 0.8640976 0.2852243 0.1626109
0.10 2 150 0.8679547 0.2756038 0.1631558
0.10 3 50 0.8693934 0.2705302 0.1577490
0.10 3 100 0.8744853 0.2529793 0.1620660
0.10 3 150 0.8773464 0.2303388 0.1701741
ROC was used to select the optimal model using the largest value.
The final values used for the model were iter = 150, maxdepth = 3 and nu = 0.1.
iter maxdepth nu
18 150 3 0.1
nu maxdepth iter ROC Sens Spec ROCSD SensSD
1 0.05 1 50 0.8012766 0.4006658 0.1550477 0.018571428 0.04397188
10 0.10 1 50 0.8176723 0.3820025 0.1566825 0.015152920 0.03979091
4 0.05 2 50 0.8500026 0.3061683 0.1609995 0.006182022 0.03168737
13 0.10 2 50 0.8568814 0.2999376 0.1593706 0.011622817 0.04082031
7 0.05 3 50 0.8638215 0.2920232 0.1604488 0.009230597 0.03337594
16 0.10 3 50 0.8693934 0.2705302 0.1577490 0.009834789 0.04143704
2 0.05 1 100 0.8190276 0.3859589 0.1534305 0.013341481 0.04234205
11 0.10 1 100 0.8314747 0.3333397 0.1566781 0.011008361 0.02682130
5 0.05 2 100 0.8583474 0.2954178 0.1604517 0.006955493 0.03338367
14 0.10 2 100 0.8640976 0.2852243 0.1626109 0.010828081 0.04177662
8 0.05 3 100 0.8701753 0.2829772 0.1512741 0.010273044 0.03637675
17 0.10 3 100 0.8744853 0.2529793 0.1620660 0.011067804 0.04094407
3 0.05 1 150 0.8257163 0.3333349 0.1561303 0.012208141 0.03213443
12 0.10 1 150 0.8412265 0.3106896 0.1588373 0.012462594 0.02949385
6 0.05 2 150 0.8620120 0.2937213 0.1582910 0.007861763 0.03697073
15 0.10 2 150 0.8679547 0.2756038 0.1631558 0.011160869 0.04409062
9 0.05 3 150 0.8722650 0.2643043 0.1566752 0.010573137 0.04318602
18 0.10 3 150 0.8773464 0.2303388 0.1701741 0.011144841 0.04120877
SpecSD
1 0.017006635
10 0.022120084
4 0.008994867
13 0.017586473
7 0.010112626
16 0.016123986
2 0.015452597
11 0.016615653
5 0.010175189
14 0.012375909
8 0.013437181
17 0.018719536
3 0.012851242
12 0.015268686
6 0.010181703
15 0.011885003
9 0.013420082
18 0.021236345
The comparison across models highlights clear performance distinctions driven by algorithm design and data imbalance handling.
1. Decision Tree (Balanced Data)
The decision tree, when trained on ROSE-balanced data, achieved a Sensitivity of 0.814 — a significant improvement over the unbalanced baseline.
However, AUC ≈ 0.82 indicate moderate predictive power.
Single trees have high variance and are prone to overfitting; while balancing improved recall, the model still struggled with generalization due to its depth and data fragmentation.
2. Random Forest (All Predictors)
The Random Forest model consistently delivered the best overall performance, with Accuracy = 0.86, Sensitivity = 0.88, and AUC ≈ 0.937.
This improvement stems from bagging (bootstrap aggregation) — combining multiple trees reduces variance and stabilizes predictions.
The ensemble effect mitigates the decision tree’s overfitting and captures nonlinear interactions between predictors.
Random Forest also handles categorical variables and noisy data effectively, making it well-suited for the bank marketing dataset.
3. AdaBoost (All Predictors)
Despite theoretical advantages in correcting bias, AdaBoost performed poorly with Accuracy = 0.27, Sensitivity = 0.40, and AUC ≈ 0.81.
Boosting algorithms can amplify noise and misclassified minority samples, which likely caused instability here.
This outcome suggests that AdaBoost is less effective for categorical-heavy or imbalanced datasets unless extensive tuning or cost-sensitive weighting is applied.
Note: The extremely low performance is primarily due to the dataset’s class imbalance and the categorical-heavy nature of the predictors. AdaBoost struggles to learn minority classes effectively without additional balancing or cost-sensitive adjustments.
# reference level = "no", positive = "yes"
bank_train2$y <- factor(bank_train2$y, levels = c("no","yes"))
bank_test2$y <- factor(bank_test2$y, levels = c("no","yes"))# ----------------------------
# Decision Tree Experiment 1
# Objective: Baseline DT to measure accuracy & sensitivity on imbalanced data
# Variation : No resampling (original distribution)
# Metrics : Accuracy, Sensitivity / Recall , Specificity, Precision, F1, AUC
# ----------------------------Among all models, Random Forest (all features) achieved the best overall performance, balancing sensitivity (0.88) and accuracy (0.86). Decision Tree performance improved substantially after balancing, but still lagged behind Random Forest in consistency. AdaBoost underperformed due to its sensitivity to noise and the dataset’s categorical complexity.
Random Forest performed best with 0.86 accuracy and 0.88 recall, due to ensemble stability and feature interactions
The Random Forest model achieved the strongest overall performance across all evaluation metrics, with an accuracy of 0.86, sensitivity of 0.88, and AUC ≈ 0.937. This superiority is attributed to its ensemble architecture, which aggregates multiple decorrelated trees to reduce variance and capture nonlinear relationships between predictors.
In contrast, the Decision Tree model, though interpretable and improved substantially after class balancing, exhibited lower stability and generalization power due to its single-tree structure. The AdaBoost model underperformed significantly, likely because of its sensitivity to noise and the dataset’s categorical imbalance.
From a business perspective, maximizing recall (sensitivity) is most important, as missing potential subscribers carries higher cost than false positives. Therefore, Random Forest with all predictors is recommended for deployment in the bank’s term-deposit marketing strategy.