1. Introduction

This project is a continuation of a previous exploratory data analysis (EDA) on the Bank Marketing Dataset, which explored customer responses to term deposit offers. The EDA phase (viewable here) uncovered two important characteristics that influence model selection and evaluation:

  • A significant class imbalance, with relatively few clients subscribing to a term deposit
  • Strong correlations among numerical features, particularly among economic indicators

Building on those insights, this phase shifts focus from exploration to systematic experimentation. The goal is to test different modeling strategies and preprocessing decisions to determine what improves predictive performance, especially for the underrepresented class.

This assignment involves conducting at least six (6) total experiments, spanning three different classification algorithms:

  • Decision Trees
  • Random Forest
  • AdaBoost

Each algorithm is tested under at least two different experimental conditions. Before running each experiment, an objective is clearly defined. After execution, results are evaluated and compared to draw conclusions and guide recommendations.

Through this iterative and methodical process, the project aims to surface insights on what modeling strategies are most effective for predicting term deposit subscriptions in the Bank Marketing Dataset, which reflects the complexities of real-world customer behavior.

2. Experiments

Before running experiments, we applied essential preprocessing to prepare the dataset for modeling. Although multicollinearity is not a major concern for tree-based models, removing highly correlated features helps reduce redundancy and ensures a cleaner, more interpretable dataset across experiments. Based on correlation analysis, variables such as housing and month were removed due to their strong pairwise relationships. Additionally, variables with near-zero variance, such as pdays, were excluded, as they contribute little to model performance.

Other features were dropped based on earlier analysis. For example, duration was excluded because it is directly tied to the target variable and only becomes available after the outcome of the call is known. Including it would introduce data leakage, as it provides information from the future that would not be available during prediction.

This cleaned version of the dataset served as the foundation for the experiments that followed. Each experiment tested a different model configuration to evaluate performance, with a focus on handling class imbalance and assessing the relevance of available features.

bank_marketing <- read.csv("bank-additional-full.csv",sep = ";",stringsAsFactors = T)
corr_matrix <- bank_marketing |> 
  keep(is.numeric) |> cor()
highCorrelation <- findCorrelation(corr_matrix,cutoff = 0.75)

hcol <- names(bank_marketing)[highCorrelation]
noVar <- nearZeroVar(bank_marketing)
columns_to <- names(bank_marketing)[noVar]
bank_marketing_cl <- bank_marketing |> select(-all_of(c("pdays","duration","housing","month","default")))

2.1 Decision Trees

2.1.1 Experiment One: Effect of Tree Depth on Decision Tree Classifier

Objective
This experiment evaluates how limiting the depth of a decision tree affects its ability to predict term deposit subscriptions. The focus is on improving both precision and recall, since these metrics are most relevant when identifying customers likely to subscribe. Precision tells us how many predicted subscribers are actually correct, while recall shows how many actual subscribers the model successfully finds.

Approach
Two models were trained with different maximum depths:
- A shallow tree with maxdepth = 3, designed to keep the model simple
- A deeper tree with maxdepth = 10, allowing for more detailed splits

To control tree complexity and prevent overfitting, we also tuned the complexity parameter (cp), which determines the minimum improvement required for a split to be considered. We used a grid ranging from 0.001 to 0.1, increasing by 0.005. Lower cp values allow the tree to grow deeper, while higher values prune the tree earlier.

Both models used 10-fold cross-validation and were evaluated on a separate test set. The features pdays, month, and housing were excluded, as they either added noise or leaked information about the outcome.

set.seed(123)

dt_train_index <- createDataPartition(bank_marketing_cl$y, p = 0.7, list = FALSE)
dt_train <- bank_marketing_cl[dt_train_index, ]
dt_test  <- bank_marketing_cl[-dt_train_index, ]
set.seed(123)

cp_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.005))


dt_ctrl <- trainControl(
  method = "cv",
  number = 10,
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  savePredictions = "final"
)


dt_depth3_model <- train(
  y ~ .,
  data = dt_train,
  method = "rpart",
  metric = "ROC",
  trControl = dt_ctrl,
  tuneGrid = expand.grid(cp = cp_grid),
  control = rpart.control(maxdepth = 3)
)
dt_depth10_model <- train(
  y ~ .,
  data = dt_train,
  method = "rpart",
  metric = "ROC",
  trControl = dt_ctrl,
  tuneGrid = expand.grid(cp = cp_grid),
  control = rpart.control(maxdepth = 10)
)
get_metrics <- function(preds, probs, truth, model_name, dataset_name) {
  cm <- confusionMatrix(preds, truth, positive = "yes")
  auc_val <- auc(roc(truth, probs, levels = c("no", "yes"), direction = "<"))
  
  data.frame(
    Model = model_name,
    Dataset = dataset_name,
    Accuracy = unname(cm$overall["Accuracy"]),
    Precision = unname(cm$byClass["Precision"]),
    Recall = unname(cm$byClass["Recall"]),
    F1 = unname(cm$byClass["F1"]),
    AUC = unname(auc_val)
  )
}

# Get predicted probabilities
dt_depth3_train_probs <- predict(dt_depth3_model, dt_train, type = "prob")[, "yes"]
dt_depth10_train_probs <- predict(dt_depth10_model, dt_train, type = "prob")[, "yes"]
dt_depth3_test_probs  <- predict(dt_depth3_model, dt_test, type = "prob")[, "yes"]
dt_depth10_test_probs <- predict(dt_depth10_model, dt_test, type = "prob")[, "yes"]

cutoff <- 0.5
# Class predictions using cutoff cutoff
dt_depth3_train_preds <- factor(ifelse(dt_depth3_train_probs > cutoff, "yes", "no"), levels = c("no", "yes"))
dt_depth10_train_preds <- factor(ifelse(dt_depth10_train_probs > cutoff, "yes", "no"), levels = c("no", "yes"))
dt_depth3_test_preds <- factor(ifelse(dt_depth3_test_probs > cutoff, "yes", "no"), levels = c("no", "yes"))
dt_depth10_test_preds <- factor(ifelse(dt_depth10_test_probs > cutoff, "yes", "no"), levels = c("no", "yes"))

# Combine metrics
dt_metrics <- rbind(
  get_metrics(dt_depth3_train_preds, dt_depth3_train_probs, dt_train$y, "DT Depth 3", "Train"),
  get_metrics(dt_depth10_train_preds, dt_depth10_train_probs, dt_train$y, "DT Depth 10", "Train"),
  get_metrics(dt_depth3_test_preds, dt_depth3_test_probs, dt_test$y, "DT Depth 3", "Test"),
  get_metrics(dt_depth10_test_preds, dt_depth10_test_probs, dt_test$y, "DT Depth 10", "Test")
)

dt_metrics |>  kable(caption = "Decision Trees Experiment 1 Metrics") |> 
  kable_styling() |> 
  kable_classic()
Decision Trees Experiment 1 Metrics
Model Dataset Accuracy Precision Recall F1 AUC
DT Depth 3 Train 0.8995214 0.7191011 0.1773399 0.2845147 0.7036327
DT Depth 10 Train 0.9064928 0.7243902 0.2743227 0.3979455 0.7524979
DT Depth 3 Test 0.8986727 0.7215190 0.1637931 0.2669789 0.7076750
DT Depth 10 Test 0.8989155 0.6421471 0.2320402 0.3408971 0.7528885

Results Summary
- The depth 3 model had slightly higher precision but very low recall, meaning it missed most actual subscribers.
- The depth 10 model had slightly lower precision but much better recall, along with higher F1 and AUC scores.

Conclusion
Because both precision and recall are important, the depth 10 model performed better overall. It captured more true positives and achieved a better balance between finding actual subscribers and avoiding false positives. With a properly tuned cp value, this model shows stronger potential for identifying customers who are likely to subscribe.

2.1.2 Experiment Two: Manual Class Weighting in Decision Tree

Objective
This experiment evaluates whether assigning class weights based on inverse class frequency improves the Decision Tree’s ability to identify subscribers. The goal is to increase recall and F1-score, which are the most important metrics for this classification task.

Approach
Instead of sampling the data, this model modifies the learning process by applying manual class weights. Weights were calculated based on the inverse frequency of each class in the training data, giving the minority class (y = "yes") more influence during training.

The model was trained using 10-fold cross-validation and tuned over the same cp grid as in Experiment One, ranging from 0.001 to 0.1 in steps of 0.005. A classification threshold of 0.5 was used to convert predicted probabilities into labels. No other changes were made to the data or features.

class_freq <- table(dt_train$y)
class_weights <- 1 / class_freq
dt_train$weights <- ifelse(dt_train$y == "yes", class_weights["yes"], class_weights["no"])


set.seed(123)
dt_weighted_model <- train(
  y ~ .,
  data = dt_train |> select(-weights),
  method = "rpart",
  metric = "ROC",
  weights = dt_train$weights,
  trControl = dt_ctrl,
  tuneGrid = expand.grid(cp = cp_grid),
    # control = rpart.control(maxdepth = 10)
)
custom_cutoff <- .5


dt_weighted_train_probs <- predict(dt_weighted_model, dt_train, type = "prob")

dt_weighted_train_class <- ifelse(dt_weighted_train_probs$yes >= custom_cutoff, "yes", "no")
dt_weighted_train_class <- factor(dt_weighted_train_class, levels = c("no", "yes"))
dt_train_y <- factor(dt_train$y, levels = c("no", "yes"))

# Confusion Matrix and AUC
cm_dt_weighted_train <- confusionMatrix(dt_weighted_train_class, dt_train_y, positive = "yes")
auc_dt_weighted_train <- auc(roc(dt_train_y, dt_weighted_train_probs$yes))

# ---------------------------------------
# TEST SET - Apply same cutoff to predicted probs
# ---------------------------------------

dt_weighted_test_probs <- predict(dt_weighted_model, dt_test, type = "prob")

dt_weighted_test_class <- ifelse(dt_weighted_test_probs$yes >= custom_cutoff, "yes", "no")
dt_weighted_test_class <- factor(dt_weighted_test_class, levels = c("no", "yes"))
dt_test_y <- factor(dt_test$y, levels = c("no", "yes"))

# Confusion Matrix and AUC
cm_dt_weighted_test <- confusionMatrix(dt_weighted_test_class, dt_test_y, positive = "yes")
auc_dt_weighted_test <- auc(roc(dt_test_y, dt_weighted_test_probs$yes))

# ---------------------------------------
# Combined Metrics Table
# ---------------------------------------
dt_weighted_metrics <- data.frame(
  Model = "DT Weighted",
  Dataset = c("Train", "Test"),
  Accuracy = c(cm_dt_weighted_train$overall["Accuracy"], cm_dt_weighted_test$overall["Accuracy"]),
  Precision = c(cm_dt_weighted_train$byClass["Precision"], cm_dt_weighted_test$byClass["Precision"]),
  Recall = c(cm_dt_weighted_train$byClass["Recall"], cm_dt_weighted_test$byClass["Recall"]),
  F1 = c(cm_dt_weighted_train$byClass["F1"], cm_dt_weighted_test$byClass["F1"]),
  AUC = c(auc_dt_weighted_train, auc_dt_weighted_test)
  # Cutoff = custom_cutoff
)

dt_weighted_metrics |>  kable(caption = "Decision Trees Experiment 2 Metrics") |> 
  kable_styling() |> 
  kable_classic()
Decision Trees Experiment 2 Metrics
Model Dataset Accuracy Precision Recall F1 AUC
DT Weighted Train 0.8271712 0.3561598 0.6613300 0.4629809 0.7874679
DT Weighted Test 0.8276950 0.3545203 0.6451149 0.4575796 0.7861752

Results Summary
Compared to the baseline tree, the class-weighted model achieved higher recall and F1-score, both on the training and test sets. Precision decreased slightly, but this is expected when the model is adjusted to prioritize identifying more positive cases.

Conclusion
Applying class weights improved the model’s ability to detect true subscribers. While precision remained low, the significant gain in recall and F1-score indicates better performance for the minority class. This makes class weighting a useful strategy for handling imbalance in Decision Tree models when the goal is to identify more potential bank term deposit subscribers.

2.2 Random Forests

2.2.1 Random Forest – Experiment One: Upsampling the Minority Class

Objective
This experiment examines how applying upsampling to the training data affects the performance of a Random Forest classifier. The goal is to improve the detection of customers likely to subscribe to a term deposit, with a focus on recall and F1-score for the minority class (y = "yes").

Approach
The model was trained using a standard Random Forest implementation with 1,000 trees. To address class imbalance, upsampling was applied during cross-validation, ensuring equal representation of classes in each fold. The mtry parameter, which controls the number of variables randomly selected at each split, was tuned over the range 2 to 4.

Performance was evaluated using 5-fold cross-validation on the training set and validated on a separate test set.

set.seed(1265653)

train_index <- createDataPartition(bank_marketing_cl$y, p = 0.7, list = FALSE)
dt_train <- bank_marketing_cl[train_index, ]
dt_test  <- bank_marketing_cl[-train_index, ]
ctrl_rf <- trainControl(
  method = "cv",
  number = 5,
  sampling = "up",
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  savePredictions = "final"
)
set.seed(1265653)

rf_model <- train(
  y ~ ., 
  data = dt_train,
  method = "rf",
  trControl = ctrl_rf,
  metric = "ROC",
  tuneGrid = expand.grid(mtry = seq(2,4,by=1)),
  ntree = 1000
)
best_mtry <- rf_model$bestTune$mtry

# Filter CV predictions for best mtry
cv_preds <- rf_model$pred |>
  filter(mtry == best_mtry)

# Confusion matrix and AUC for training
cm_train <- confusionMatrix(cv_preds$pred, cv_preds$obs, positive = "yes")
auc_train <- roc(cv_preds$obs, cv_preds$yes)$auc

# --- TEST METRICS (From holdout set) ---
rf_test_probs <- predict(rf_model, dt_test, type = "prob")[, "yes"]
rf_test_preds <- predict(rf_model, dt_test)
cm_test <- confusionMatrix(rf_test_preds, dt_test$y, positive = "yes")
auc_test <- roc(dt_test$y, rf_test_probs)$auc

# --- Final Metrics Table ---
rf_metrics <- data.frame(
  Model = "Random Forest",
  Dataset = c("Train (CV)", "Test"),
  Accuracy = c(cm_train$overall["Accuracy"], cm_test$overall["Accuracy"]),
  Precision = c(cm_train$byClass["Precision"], cm_test$byClass["Precision"]),
  Recall = c(cm_train$byClass["Recall"], cm_test$byClass["Recall"]),
  F1 = c(cm_train$byClass["F1"], cm_test$byClass["F1"]),
  AUC = c(auc_train, auc_test)
)

rf_metrics |>  kable(caption = "Random Forest Experiment 1 Metrics") |> 
  kable_styling() |> 
  kable_classic()
Random Forest Experiment 1 Metrics
Model Dataset Accuracy Precision Recall F1 AUC
Random Forest Train (CV) 0.8614387 0.4176042 0.5828202 0.4865698 0.7915460
Random Forest Test 0.8556976 0.4017094 0.5739943 0.4726412 0.7836298

Results Summary

The model performed consistently across both training and test sets. Upsampling improved recall and helped the model identify more subscribers. While precision remained modest, the increase in F1-score shows that the model achieved a better balance between capturing true positives and avoiding too many false positives.

Conclusion
Applying upsampling during training helped the Random Forest model better identify customers likely to subscribe to a term deposit. Although precision was relatively low, the improvement in recall and overall F1-score makes this approach valuable for scenarios where detecting potential subscribers is a priority.

2.2.2 Random Forest – Experiment Two: Feature Engineering and Preprocessing

Objective
This experiment investigates how specific feature engineering and preprocessing steps affect the performance of a Random Forest model. The focus is on transforming the age and pdays variables into more interpretable categories and removing less informative or redundant features.

Variation
Two main feature transformations were applied: - Age was bucketed into three categories:
- Adults (≤ 39)
- Middle-aged Adults (40–59)
- Senior Adults (60+) - Pdays was converted from numeric into categorical intervals: - Never Contacted, One Week, Two Week, Three Week, Three Week +

Additionally, the variables duration, housing, month, and default were removed. The model was trained on an 80% training split using 5-fold cross-validation. mtry was tuned over values from 2 to 5, and 1,000 trees were used.

Why This Variation Matters
Transforming continuous variables into interpretable categories can enhance the model’s ability to capture meaningful patterns, especially when domain knowledge supports the grouping. Removing less relevant features may also reduce noise and improve generalization.

Evaluation Metrics
Same metrics as previous experiments were used: - Primary metric: AUC
- Secondary metrics: F1-score, Precision, Recall, Accuracy

bank_marketing_cl2 <- bank_marketing |> select(-all_of(c(
  # "pdays",
                                                         "duration","housing","month"
                                                         ,"default"
                                                         )))


bank_marketing_cl2 = bank_marketing_cl2 |>  
  mutate(age= case_when(
    age <= 39 ~ "Adult",
    age >= 40 & age <= 59 ~ "Middle-aged Adults",
    age >= 60 ~ "Senior Adult"
  ))

bank_marketing_cl2 = bank_marketing_cl2 |>  
  mutate(pdays= case_when(
    pdays == 999 ~ "Never Contacted",
    pdays >= 0 & pdays <= 7 ~ "One Week",
    pdays >= 8 & pdays <= 14 ~ "Two Week",
    pdays >= 15 & pdays <= 21 ~ "Tree Week",
    pdays >= 22  ~ "Tree Week +",
  ))
set.seed(1265653)

train_index_rf2 <- createDataPartition(bank_marketing_cl2$y, p = 0.8, list = FALSE)
dt_train2 <- bank_marketing_cl2[train_index_rf2, ]
dt_test2  <- bank_marketing_cl2[-train_index_rf2, ]

ex = c("day_of_week", "marital","education","job","contact","loan")
dt_train2 = dt_train |> select(-all_of(ex))
dt_test2 = dt_test |> select(-all_of(ex))
rf_model2 <- train(
  y ~ ., 
  data = dt_train2 ,
  method = "rf",
  trControl = ctrl_rf,
  metric = "ROC",
  tuneGrid = expand.grid(mtry = seq(2,5,by=1)),
    # tuneGrid = expand.grid(mtry = 2),

  ntree = 1000
)
best_mtry2 <- rf_model2$bestTune$mtry

# Filter CV predictions for best mtry
cv_preds2 <- rf_model2$pred |>
  filter(mtry == best_mtry2)

# Confusion matrix and AUC for training
cm_train2 <- confusionMatrix(cv_preds2$pred, cv_preds2$obs, positive = "yes")
auc_train2 <- roc(cv_preds2$obs, cv_preds2$yes)$auc


rf_test_probs2 <- predict(rf_model2, dt_test2, type = "prob")[, "yes"]
rf_test_preds2 <- predict(rf_model2, dt_test2)
cm_test2 <- confusionMatrix(rf_test_preds2, dt_test2$y, positive = "yes")
auc_test2 <- roc(dt_test2$y, rf_test_probs2)$auc

# --- Final Metrics Table ---
rf_metrics2 <- data.frame(
  Model = "Random Forest",
  Dataset = c("Train (CV)", "Test"),
  Accuracy = c(cm_train2$overall["Accuracy"], cm_test2$overall["Accuracy"]),
  Precision = c(cm_train2$byClass["Precision"], cm_test2$byClass["Precision"]),
  Recall = c(cm_train2$byClass["Recall"], cm_test2$byClass["Recall"]),
  F1 = c(cm_train2$byClass["F1"], cm_test2$byClass["F1"]),
  AUC = c(auc_train2, auc_test2)
)

rf_metrics2 |>  kable(caption = "Random Forest Experiment 2 Metrics") |> 
  kable_styling() |> 
  kable_classic()
Random Forest Experiment 2 Metrics
Model Dataset Accuracy Precision Recall F1 AUC
Random Forest Train (CV) 0.8427442 0.3765835 0.6040640 0.4639395 0.7846408
Random Forest Test 0.8390256 0.3682119 0.5991379 0.4561116 0.7804622

Results and Conclusion
The engineered features improved recall and F1-score compared to previous experiments, showing that the model became better at detecting positive cases. AUC remained relatively strong, indicating that class discrimination was not compromised. While precision remained moderate, the overall balance of metrics suggests that the transformations helped the model better generalize.

Recommendation
Incorporating domain-informed feature engineering (e.g., age brackets, pdays intervals) can improve minority class detection in Random Forest models. This approach is recommended for preprocessing when working with structured data in real-world applications.

2.3 ADABOOST

2.3.1 Experiment One: Shift in Strategy with Feature Engineering and Boosting

Objective
After earlier experiments with Decision Trees and Random Forests showed limited improvement in detecting the minority class (y = "yes"), this experiment introduces a new strategy. The goal is to evaluate whether combining AdaBoost, structured feature engineering, and preprocessing of numerical features can lead to more meaningful improvements in classification performance.

Variation
This experiment builds a redesigned dataset using domain-driven feature engineering and scaled transformations of numeric variables. Specifically:

  • Numerical features (campaign, euribor3m, nr.employed, and age) were centered, scaled, and transformed using the Yeo-Johnson method to address skewness and improve consistency across predictors.
  • Additional categorical features were engineered to capture contact recency, contact timing, education grouping, and financial risk indicators.
  • Upsampling was used within cross-validation to correct the imbalance between subscription (yes) and non-subscription (no) classes.

The AdaBoost model was tuned across the following hyperparameters: - mfinal: Tested values 50, 100, and 150 to evaluate the effect of increasing the number of boosting iterations. More iterations allow the model to learn from residual errors, but too many can lead to overfitting. - maxdepth: Set to 3 and 5 to limit the depth of individual weak learners (decision stumps). Shallow trees are used to reduce variance and enforce simplicity in each boosting round. - coeflearn: Fixed to "Breiman", which adjusts boosting weights based on classification accuracy. This option provides a balance between aggressive error correction and model stability.

Why This Variation Matters
Unlike Random Forests, which rely on averaging many uncorrelated trees, AdaBoost builds trees sequentially and is more sensitive to noisy data or poor feature scaling. This makes it well-suited for experiments involving thoughtful preprocessing and engineered features. In this setup, boosting is expected to better exploit informative patterns that earlier models may have missed.

Evaluation Metrics
Models were assessed using: - Primary metric: AUC
- Secondary metrics: F1-score, Recall, Precision, Accuracy

All metrics were calculated using 5-fold cross-validation on the training set and evaluated on a held-out test set.

bank_marketing_cl3 <- bank_marketing |>
  
  # 1. Contacted Recently (from pdays)
  mutate(pdays= case_when(
    pdays == 999 ~ "Never Contacted",
    pdays >= 0 & pdays <= 7 ~ "One Week",
    pdays >= 8 & pdays <= 14 ~ "Two Week",
    pdays >= 15 & pdays <= 21 ~ "Tree Week",
    pdays >= 22  ~ "Tree Week +",
  )) |>
  
  # 2. Has Any Debt (loan + housing + default)
  mutate(has_any_debt = ifelse(loan == "yes" | housing == "yes" | default == "yes", 1, 0)) |>
  
  # 3. Contact Quarter (from month)
  mutate(contact_quarter = case_when(
    month %in% c("mar", "apr", "may") ~ "Q2",
    month %in% c("jun", "jul", "aug") ~ "Q3",
    month %in% c("sep", "oct", "nov", "dec") ~ "Q4",
    TRUE ~ "Q1"
  )) |>
  
  # 4. Job Risk Level
  mutate(job_risk = case_when(
    job %in% c("blue-collar", "services", "unemployed") ~ "high_risk",
    job %in% c("admin.", "technician", "management", "student") ~ "low_risk",
    TRUE ~ "unknown"
  )) |>
  
  # 5. Preferred Contact Day (thu/fri)
  mutate(preferred_contact_day = ifelse(day_of_week %in% c("thu", "fri"), 1, 0)) |>
  
  # 6. Education Level Simplified
  mutate(edu_level = case_when(
    grepl("basic", education) ~ "basic",
    education == "university.degree" ~ "higher_ed",
    education == "professional.course" ~ "vocational",
    TRUE ~ "other"
  )) |>
  
  # 7. Weekend Contact Flag
  mutate(weekend_contact = ifelse(day_of_week %in% c("sat", "sun"), 1, 0)) |>
  
  select(-c(pdays, loan, housing, default, month, job, day_of_week, education, duration))  # 

  transformData <- function(data ,fields,methods){
  
  transformation_model <- data |> select(all_of(fields)) |> preProcess(method = methods)
  trasnformed <- predict(transformation_model,data)
  
  return(list(trasnformed,transformation_model))
}
set.seed(555)

train_index_ada3 <- createDataPartition(bank_marketing_cl3$y, p = 0.70, list = FALSE)
ada_train3 <- bank_marketing_cl3[train_index_ada3, ]
ada_test3  <- bank_marketing_cl3[-train_index_ada3, ]

transform_1 <- transformData(ada_train3,c("campaign","euribor3m","nr.employed","age"),c("center", "scale", "YeoJohnson"))

ada_train3 <- transform_1[[1]]
transform_1_lambda <- transform_1[[2]]
ada_test3 <- predict(transform_1_lambda, ada_test3)

tune <- expand.grid(
  mfinal = c(50,100,150),
  maxdepth = c(3, 5),
  coeflearn = "Breiman"
)


ctrl_ada <- trainControl(
  method = "cv",
  number = 5,
  sampling = "up",
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  savePredictions = "final"
)


ada_model3 <- train(
  y ~ .,
  data = ada_train3,
  method = "AdaBoost.M1",
  trControl = ctrl_ada,
  metric = "ROC",
  tuneGrid = tune
)
# 1. Get best tuning parameters
best_params3 <- ada_model3$bestTune

# 2. Filter cross-validated predictions for best model
cv_preds3 <- ada_model3$pred |>
  filter(
    mfinal == best_params3$mfinal,
    maxdepth == best_params3$maxdepth,
    coeflearn == best_params3$coeflearn
  )

# 3. Confusion matrix and AUC on cross-validation predictions (Train set)
cm_train3 <- confusionMatrix(cv_preds3$pred, cv_preds3$obs, positive = "yes")
auc_train3 <- roc(cv_preds3$obs, cv_preds3$yes)$auc

# 4. Predictions on test set
ada_test_probs3 <- predict(ada_model3, ada_test3, type = "prob")[, "yes"]
ada_test_preds3 <- predict(ada_model3, ada_test3)

# 5. Confusion matrix and AUC on test set
cm_test3 <- confusionMatrix(ada_test_preds3, ada_test3$y, positive = "yes")
auc_test3 <- roc(ada_test3$y, ada_test_probs3)$auc

# 6. Final comparison table
ada_metrics3 <- data.frame(
  Model = "AdaBoost",
  Dataset = c("Train (CV)", "Test"),
  Accuracy = c(cm_train3$overall["Accuracy"], cm_test3$overall["Accuracy"]),
  Precision = c(cm_train3$byClass["Precision"], cm_test3$byClass["Precision"]),
  Recall = c(cm_train3$byClass["Recall"], cm_test3$byClass["Recall"]),
  F1 = c(cm_train3$byClass["F1"], cm_test3$byClass["F1"]),
  AUC = c(auc_train3, auc_test3)
)

ada_metrics3 |>  kable(caption = "AdaBoost Experiment 1 Metrics") |> 
  kable_styling() |> 
  kable_classic()
AdaBoost Experiment 1 Metrics
Model Dataset Accuracy Precision Recall F1 AUC
AdaBoost Train (CV) 0.8306396 0.3587839 0.6394704 0.4596658 0.8000380
AdaBoost Test 0.8340887 0.3651639 0.6400862 0.4650313 0.7996845

Results and Conclusion

These are the best results observed so far in terms of recall, F1-score, and AUC, showing a clear improvement in the model’s ability to detect the minority class. The consistency between training and test performance also suggests that the model generalizes well.

However, one notable limitation is the low precision, which remained below 0.37. This means that a relatively high number of predicted positive cases were false positives. In practical terms, while the model is more successful at finding actual subscribers, it also classifies many non-subscribers as subscribers.

This trade-off reflects a recall-oriented model, which may be acceptable in contexts where missing potential subscribers is more costly than reaching out to uninterested ones. However, in applications where precision matters (e.g., avoiding wasted marketing costs), further tuning or threshold adjustment may be necessary.

2.3.2 Experiment Two: Simplified Features with SMOTE and Downsampling

Objective
This experiment evaluates whether combining SMOTE and downsampling with a simplified, cleaner feature set can improve AdaBoost’s classification performance on the Bank Marketing dataset. After previous experiments involving numeric transformations showed limited impact, this setup intentionally excludes such preprocessing. The focus shifts to engineered categorical and binary features, paired with hybrid sampling to handle class imbalance.

Variation
Several original variables were removed based on prior analysis, and a compact set of engineered features was introduced:

  • Risk Score: Sum of binary indicators for housing, loan, and default
  • Multi-loan Flag: Indicates whether a client has both a personal and housing loan
  • Was Previously Contacted: Derived from the pdays field
  • Contact Intensity: Total number of campaign and prior contacts

No transformations (e.g., centering, scaling, Yeo-Johnson) were applied to numerical features, based on evidence from earlier experiments that such steps had little to no impact on performance.

Categorical variables were encoded as integers using a recipe. Class imbalance was addressed through a combination of SMOTE to generate synthetic examples for the minority class and downsampling to reduce the dominant class, allowing the model to train on a more balanced distribution.

The AdaBoost model was tuned across the following parameters: - mfinal: 50, 100, 150
- maxdepth: 3, 5, 10
- coeflearn: “Breiman”

Why This Variation Matters
This setup tests whether removing unnecessary complexity and applying focused resampling techniques can help AdaBoost better capture the minority class signal. Unlike earlier approaches that emphasized heavy transformations or broad feature expansions, this strategy prioritizes clarity and balance, both in the dataset and the learning process.

Evaluation Metrics
Model performance was assessed using: - Primary metric: AUC
- Secondary metrics: F1-score, Recall, Precision, Accuracy

All metrics were calculated using 5-fold cross-validation on the training data and tested on a held-out set.

Results and Conclusion

set.seed(6666)
ex = c(
    "day_of_week",
       "marital",
       "education"
       ,"job"
       ,"contact"
       ,"loan"
       ,"previous"
       # ,"poutcome"
    ,"duration"
    ,"housing",
    "pdays",
    "emp.var.rate",
    "nr.employed"
    ,"cons.price.idx",
    "cons.conf.idx"
       )

bank_marketing_cl4 = bank_marketing
# Create engineered features on the full dataset
bank_marketing_cl4$risk_score <- with(bank_marketing_cl4, 
  as.integer(housing == "yes") + 
  as.integer(loan == "yes") + 
  as.integer(default == "yes")
)

bank_marketing_cl4$multi_loan_flag <- with(bank_marketing_cl4,
  as.integer(housing == "yes" & loan == "yes")
)

bank_marketing_cl4$was_previously_contacted <- with(bank_marketing_cl4,
  as.integer(pdays != 999)
)

bank_marketing_cl4$contact_intensity <-  with(bank_marketing_cl4,
  campaign + previous
)

bank_marketing_cl4$multi_loan_flag <- as.factor(bank_marketing_cl4$multi_loan_flag)
bank_marketing_cl4$was_previously_contacted <- as.factor(bank_marketing_cl4$was_previously_contacted)
bank_marketing_cl4$contact_intensity <- as.factor(bank_marketing_cl4$contact_intensity)
# bank_marketing_cl4<- bank_marketing_c |> select(-all_of(ex))

bank_marketing_cl4<- bank_marketing_cl4 |> select(-all_of(ex))

train_index_ada_exp2 <- createDataPartition(bank_marketing_cl4$y, p = 0.70, list = FALSE)
ada_exp2_train <- bank_marketing_cl4[train_index_ada_exp2, ]
ada_exp2_test  <- bank_marketing_cl4[-train_index_ada_exp2, ]
set.seed(986)

registerDoSEQ()  # disables parallel

rec_exp2 <- recipe(y ~ ., data = ada_exp2_train) %>%
  step_integer(all_nominal_predictors()) %>%
  step_smote(y) |>
  step_downsample(y)

# TrainControl
ctrl_exp2 <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  savePredictions = "final"
)

# Tuning grid
tune_exp2 <- expand.grid(
  mfinal = c(50,100,150),
  maxdepth = c(3, 5,10),
  coeflearn = "Breiman"
)

# Train the AdaBoost model
ada_model_exp2 <- train(
  rec_exp2,
  data = ada_exp2_train,
  method = "AdaBoost.M1",
  metric = "ROC",
  trControl = ctrl_exp2,
  tuneGrid = tune_exp2
)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:purrr':
## 
##     compact
# 1. Get best tuning parameters for Experiment 2
best_params_exp2 <- ada_model_exp2$bestTune

# 2. Filter cross-validated predictions for best model
cv_preds_exp2 <- ada_model_exp2$pred |>
  filter(
    mfinal == best_params_exp2$mfinal,
    maxdepth == best_params_exp2$maxdepth,
    coeflearn == best_params_exp2$coeflearn
  )


# 3. Confusion matrix and AUC on cross-validation predictions (Train set)
cm_train_exp2 <- confusionMatrix(cv_preds_exp2$pred, cv_preds_exp2$obs, positive = "yes")

auc_train_exp2 <- roc(cv_preds_exp2$obs, cv_preds_exp2$yes)$auc

# 4. Predictions on test set
ada_test_probs_exp2 <- predict(ada_model_exp2, ada_exp2_test, type = "prob")[, "yes"]
ada_test_preds_exp2 <- predict(ada_model_exp2, ada_exp2_test)

# 5. Confusion matrix and AUC on test set
cm_test_exp2 <- confusionMatrix(ada_test_preds_exp2, ada_exp2_test$y, positive = "yes")
auc_test_exp2 <- roc(ada_exp2_test$y, ada_test_probs_exp2)$auc

# 6. Final comparison table
ada_metrics_exp2 <- data.frame(
  Model = "AdaBoost Exp 2",
  Dataset = c("Train (CV)", "Test"),
  Accuracy = c(cm_train_exp2$overall["Accuracy"], cm_test_exp2$overall["Accuracy"]),
  Precision = c(cm_train_exp2$byClass["Precision"], cm_test_exp2$byClass["Precision"]),
  Recall = c(cm_train_exp2$byClass["Recall"], cm_test_exp2$byClass["Recall"]),
  F1 = c(cm_train_exp2$byClass["F1"], cm_test_exp2$byClass["F1"]),
  AUC = c(auc_train_exp2, auc_test_exp2)
)

ada_metrics_exp2 |>  kable(caption = "AdaBoost Experiment 2 Metrics") |> 
  kable_styling() |> 
  kable_classic()
AdaBoost Experiment 2 Metrics
Model Dataset Accuracy Precision Recall F1 AUC
AdaBoost Exp 2 Train (CV) 0.8673349 0.4280010 0.5280172 0.4727774 0.7841837
AdaBoost Exp 2 Test 0.8650858 0.4214734 0.5301724 0.4696150 0.7932252

Compared to Experiment One, this model achieved: - Higher precision (from 0.365 to 0.421 on the test set), meaning it made fewer false positive predictions. - Lower recall (from 0.640 to 0.530), meaning it missed more actual subscribers. - A slightly improved F1-score, indicating better overall balance between precision and recall. - A small decrease in AUC, from 0.800 to 0.793, which still reflects good model discrimination.

These results reflect a clear shift in the model’s behavior: the hybrid resampling and simpler features led to more selective, precise predictions, but at the cost of missing some true positives. The modest improvement in F1-score suggests this approach offers a more balanced model than the one used in Experiment One.

3. Summary of Experimental Results

The set of experiments explored different strategies for improving classification performance on the Bank Marketing Dataset, with the primary challenge being a highly imbalanced target variable. Each model incorporated a distinct method for addressing this issue, including deeper trees, class weighting, sampling techniques, and feature engineering. While these adjustments produced variations in performance, the overall patterns reveal a limited capacity to achieve both high recall and high precision simultaneously.

The Decision Tree models provided useful baselines but highlighted the trade-off between model complexity and generalization. Deeper trees increased recall but at the cost of precision, while class weighting modestly improved recall without significantly affecting precision. These models struggled to generalize, often overfitting to the training data.

Random Forests improved overall stability and delivered better balance across metrics, particularly when combined with upsampling. Still, neither Random Forest configuration overcame the precision-recall trade-off. Feature engineering offered minor improvements in recall but did not produce meaningful gains in AUC.

AdaBoost models produced the most polarized results. The version with numeric transformations and engineered features achieved the highest recall and AUC, indicating a stronger bias toward identifying positive cases. In contrast, the simplified AdaBoost model with SMOTE and downsampling achieved the highest precision, favoring conservative predictions. These results suggest that boosting emphasizes recall at the cost of false positives unless constrained by simplified inputs and balanced training data.

In conclusion, while each modeling choice influenced the bias-variance trade-off differently, none of the approaches consistently delivered strong enough performance to justify deployment. The performance limitations suggest that future work should prioritize testing more advanced classifiers and exploring ways to enhance data quality and feature richness, particularly for the underrepresented class. Without these improvements, further optimization of current models is unlikely to yield substantial gains.