This project is a continuation of a previous exploratory data analysis (EDA) on the Bank Marketing Dataset, which explored customer responses to term deposit offers. The EDA phase (viewable here) uncovered two important characteristics that influence model selection and evaluation:
Building on those insights, this phase shifts focus from exploration to systematic experimentation. The goal is to test different modeling strategies and preprocessing decisions to determine what improves predictive performance, especially for the underrepresented class.
This assignment involves conducting at least six (6) total experiments, spanning three different classification algorithms:
Each algorithm is tested under at least two different experimental conditions. Before running each experiment, an objective is clearly defined. After execution, results are evaluated and compared to draw conclusions and guide recommendations.
Through this iterative and methodical process, the project aims to surface insights on what modeling strategies are most effective for predicting term deposit subscriptions in the Bank Marketing Dataset, which reflects the complexities of real-world customer behavior.
Before running experiments, we applied essential preprocessing to prepare the dataset for modeling. Although multicollinearity is not a major concern for tree-based models, removing highly correlated features helps reduce redundancy and ensures a cleaner, more interpretable dataset across experiments. Based on correlation analysis, variables such as housing and month were removed due to their strong pairwise relationships. Additionally, variables with near-zero variance, such as pdays, were excluded, as they contribute little to model performance.
Other features were dropped based on earlier analysis. For example, duration was excluded because it is directly tied to the target variable and only becomes available after the outcome of the call is known. Including it would introduce data leakage, as it provides information from the future that would not be available during prediction.
This cleaned version of the dataset served as the foundation for the experiments that followed. Each experiment tested a different model configuration to evaluate performance, with a focus on handling class imbalance and assessing the relevance of available features.
bank_marketing <- read.csv("bank-additional-full.csv",sep = ";",stringsAsFactors = T)
corr_matrix <- bank_marketing |>
keep(is.numeric) |> cor()
highCorrelation <- findCorrelation(corr_matrix,cutoff = 0.75)
hcol <- names(bank_marketing)[highCorrelation]
noVar <- nearZeroVar(bank_marketing)
columns_to <- names(bank_marketing)[noVar]
bank_marketing_cl <- bank_marketing |> select(-all_of(c("pdays","duration","housing","month","default")))
Objective
This experiment evaluates how limiting the depth of a decision tree
affects its ability to predict term deposit subscriptions. The focus is
on improving both precision and recall, since these metrics are most
relevant when identifying customers likely to subscribe. Precision tells
us how many predicted subscribers are actually correct, while recall
shows how many actual subscribers the model successfully finds.
Approach
Two models were trained with different maximum depths:
- A shallow tree with maxdepth = 3, designed to keep the
model simple
- A deeper tree with maxdepth = 10, allowing for more
detailed splits
To control tree complexity and prevent overfitting, we also tuned the
complexity parameter (cp), which
determines the minimum improvement required for a split to be
considered. We used a grid ranging from 0.001 to 0.1,
increasing by 0.005. Lower cp values allow the tree to grow
deeper, while higher values prune the tree earlier.
Both models used 10-fold cross-validation and were evaluated on a
separate test set. The features pdays, month,
and housing were excluded, as they either added noise or
leaked information about the outcome.
set.seed(123)
dt_train_index <- createDataPartition(bank_marketing_cl$y, p = 0.7, list = FALSE)
dt_train <- bank_marketing_cl[dt_train_index, ]
dt_test <- bank_marketing_cl[-dt_train_index, ]
set.seed(123)
cp_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.005))
dt_ctrl <- trainControl(
method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = "final"
)
dt_depth3_model <- train(
y ~ .,
data = dt_train,
method = "rpart",
metric = "ROC",
trControl = dt_ctrl,
tuneGrid = expand.grid(cp = cp_grid),
control = rpart.control(maxdepth = 3)
)
dt_depth10_model <- train(
y ~ .,
data = dt_train,
method = "rpart",
metric = "ROC",
trControl = dt_ctrl,
tuneGrid = expand.grid(cp = cp_grid),
control = rpart.control(maxdepth = 10)
)
get_metrics <- function(preds, probs, truth, model_name, dataset_name) {
cm <- confusionMatrix(preds, truth, positive = "yes")
auc_val <- auc(roc(truth, probs, levels = c("no", "yes"), direction = "<"))
data.frame(
Model = model_name,
Dataset = dataset_name,
Accuracy = unname(cm$overall["Accuracy"]),
Precision = unname(cm$byClass["Precision"]),
Recall = unname(cm$byClass["Recall"]),
F1 = unname(cm$byClass["F1"]),
AUC = unname(auc_val)
)
}
# Get predicted probabilities
dt_depth3_train_probs <- predict(dt_depth3_model, dt_train, type = "prob")[, "yes"]
dt_depth10_train_probs <- predict(dt_depth10_model, dt_train, type = "prob")[, "yes"]
dt_depth3_test_probs <- predict(dt_depth3_model, dt_test, type = "prob")[, "yes"]
dt_depth10_test_probs <- predict(dt_depth10_model, dt_test, type = "prob")[, "yes"]
cutoff <- 0.5
# Class predictions using cutoff cutoff
dt_depth3_train_preds <- factor(ifelse(dt_depth3_train_probs > cutoff, "yes", "no"), levels = c("no", "yes"))
dt_depth10_train_preds <- factor(ifelse(dt_depth10_train_probs > cutoff, "yes", "no"), levels = c("no", "yes"))
dt_depth3_test_preds <- factor(ifelse(dt_depth3_test_probs > cutoff, "yes", "no"), levels = c("no", "yes"))
dt_depth10_test_preds <- factor(ifelse(dt_depth10_test_probs > cutoff, "yes", "no"), levels = c("no", "yes"))
# Combine metrics
dt_metrics <- rbind(
get_metrics(dt_depth3_train_preds, dt_depth3_train_probs, dt_train$y, "DT Depth 3", "Train"),
get_metrics(dt_depth10_train_preds, dt_depth10_train_probs, dt_train$y, "DT Depth 10", "Train"),
get_metrics(dt_depth3_test_preds, dt_depth3_test_probs, dt_test$y, "DT Depth 3", "Test"),
get_metrics(dt_depth10_test_preds, dt_depth10_test_probs, dt_test$y, "DT Depth 10", "Test")
)
dt_metrics |> kable(caption = "Decision Trees Experiment 1 Metrics") |>
kable_styling() |>
kable_classic()
| Model | Dataset | Accuracy | Precision | Recall | F1 | AUC |
|---|---|---|---|---|---|---|
| DT Depth 3 | Train | 0.8995214 | 0.7191011 | 0.1773399 | 0.2845147 | 0.7036327 |
| DT Depth 10 | Train | 0.9064928 | 0.7243902 | 0.2743227 | 0.3979455 | 0.7524979 |
| DT Depth 3 | Test | 0.8986727 | 0.7215190 | 0.1637931 | 0.2669789 | 0.7076750 |
| DT Depth 10 | Test | 0.8989155 | 0.6421471 | 0.2320402 | 0.3408971 | 0.7528885 |
Results Summary
- The depth 3 model had slightly higher precision but very low recall,
meaning it missed most actual subscribers.
- The depth 10 model had slightly lower precision but much better
recall, along with higher F1 and AUC scores.
Conclusion
Because both precision and recall are important, the depth 10 model
performed better overall. It captured more true positives and achieved a
better balance between finding actual subscribers and avoiding false
positives. With a properly tuned cp value, this model shows
stronger potential for identifying customers who are likely to
subscribe.
Objective
This experiment evaluates whether assigning class weights based on
inverse class frequency improves the Decision Tree’s ability to identify
subscribers. The goal is to increase recall and F1-score, which are the
most important metrics for this classification task.
Approach
Instead of sampling the data, this model modifies the learning process
by applying manual class weights. Weights were
calculated based on the inverse frequency of each class in the training
data, giving the minority class (y = "yes") more influence
during training.
The model was trained using 10-fold cross-validation
and tuned over the same cp grid as in Experiment One,
ranging from 0.001 to 0.1 in steps of 0.005. A classification threshold
of 0.5 was used to convert predicted probabilities into labels. No other
changes were made to the data or features.
class_freq <- table(dt_train$y)
class_weights <- 1 / class_freq
dt_train$weights <- ifelse(dt_train$y == "yes", class_weights["yes"], class_weights["no"])
set.seed(123)
dt_weighted_model <- train(
y ~ .,
data = dt_train |> select(-weights),
method = "rpart",
metric = "ROC",
weights = dt_train$weights,
trControl = dt_ctrl,
tuneGrid = expand.grid(cp = cp_grid),
# control = rpart.control(maxdepth = 10)
)
custom_cutoff <- .5
dt_weighted_train_probs <- predict(dt_weighted_model, dt_train, type = "prob")
dt_weighted_train_class <- ifelse(dt_weighted_train_probs$yes >= custom_cutoff, "yes", "no")
dt_weighted_train_class <- factor(dt_weighted_train_class, levels = c("no", "yes"))
dt_train_y <- factor(dt_train$y, levels = c("no", "yes"))
# Confusion Matrix and AUC
cm_dt_weighted_train <- confusionMatrix(dt_weighted_train_class, dt_train_y, positive = "yes")
auc_dt_weighted_train <- auc(roc(dt_train_y, dt_weighted_train_probs$yes))
# ---------------------------------------
# TEST SET - Apply same cutoff to predicted probs
# ---------------------------------------
dt_weighted_test_probs <- predict(dt_weighted_model, dt_test, type = "prob")
dt_weighted_test_class <- ifelse(dt_weighted_test_probs$yes >= custom_cutoff, "yes", "no")
dt_weighted_test_class <- factor(dt_weighted_test_class, levels = c("no", "yes"))
dt_test_y <- factor(dt_test$y, levels = c("no", "yes"))
# Confusion Matrix and AUC
cm_dt_weighted_test <- confusionMatrix(dt_weighted_test_class, dt_test_y, positive = "yes")
auc_dt_weighted_test <- auc(roc(dt_test_y, dt_weighted_test_probs$yes))
# ---------------------------------------
# Combined Metrics Table
# ---------------------------------------
dt_weighted_metrics <- data.frame(
Model = "DT Weighted",
Dataset = c("Train", "Test"),
Accuracy = c(cm_dt_weighted_train$overall["Accuracy"], cm_dt_weighted_test$overall["Accuracy"]),
Precision = c(cm_dt_weighted_train$byClass["Precision"], cm_dt_weighted_test$byClass["Precision"]),
Recall = c(cm_dt_weighted_train$byClass["Recall"], cm_dt_weighted_test$byClass["Recall"]),
F1 = c(cm_dt_weighted_train$byClass["F1"], cm_dt_weighted_test$byClass["F1"]),
AUC = c(auc_dt_weighted_train, auc_dt_weighted_test)
# Cutoff = custom_cutoff
)
dt_weighted_metrics |> kable(caption = "Decision Trees Experiment 2 Metrics") |>
kable_styling() |>
kable_classic()
| Model | Dataset | Accuracy | Precision | Recall | F1 | AUC |
|---|---|---|---|---|---|---|
| DT Weighted | Train | 0.8271712 | 0.3561598 | 0.6613300 | 0.4629809 | 0.7874679 |
| DT Weighted | Test | 0.8276950 | 0.3545203 | 0.6451149 | 0.4575796 | 0.7861752 |
Results Summary
Compared to the baseline tree, the class-weighted model achieved higher
recall and F1-score, both on the training and test sets. Precision
decreased slightly, but this is expected when the model is adjusted to
prioritize identifying more positive cases.
Conclusion
Applying class weights improved the model’s ability to detect true
subscribers. While precision remained low, the significant gain in
recall and F1-score indicates better performance for the minority class.
This makes class weighting a useful strategy for handling imbalance in
Decision Tree models when the goal is to identify more potential bank
term deposit subscribers.
Objective
This experiment examines how applying upsampling to the training data
affects the performance of a Random Forest classifier. The goal is to
improve the detection of customers likely to subscribe to a term
deposit, with a focus on recall and F1-score for the minority class
(y = "yes").
Approach
The model was trained using a standard Random Forest implementation with
1,000 trees. To address class imbalance,
upsampling was applied during
cross-validation, ensuring equal representation of
classes in each fold. The mtry parameter, which controls
the number of variables randomly selected at each split, was tuned over
the range 2 to 4.
Performance was evaluated using 5-fold cross-validation on the training set and validated on a separate test set.
set.seed(1265653)
train_index <- createDataPartition(bank_marketing_cl$y, p = 0.7, list = FALSE)
dt_train <- bank_marketing_cl[train_index, ]
dt_test <- bank_marketing_cl[-train_index, ]
ctrl_rf <- trainControl(
method = "cv",
number = 5,
sampling = "up",
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = "final"
)
set.seed(1265653)
rf_model <- train(
y ~ .,
data = dt_train,
method = "rf",
trControl = ctrl_rf,
metric = "ROC",
tuneGrid = expand.grid(mtry = seq(2,4,by=1)),
ntree = 1000
)
best_mtry <- rf_model$bestTune$mtry
# Filter CV predictions for best mtry
cv_preds <- rf_model$pred |>
filter(mtry == best_mtry)
# Confusion matrix and AUC for training
cm_train <- confusionMatrix(cv_preds$pred, cv_preds$obs, positive = "yes")
auc_train <- roc(cv_preds$obs, cv_preds$yes)$auc
# --- TEST METRICS (From holdout set) ---
rf_test_probs <- predict(rf_model, dt_test, type = "prob")[, "yes"]
rf_test_preds <- predict(rf_model, dt_test)
cm_test <- confusionMatrix(rf_test_preds, dt_test$y, positive = "yes")
auc_test <- roc(dt_test$y, rf_test_probs)$auc
# --- Final Metrics Table ---
rf_metrics <- data.frame(
Model = "Random Forest",
Dataset = c("Train (CV)", "Test"),
Accuracy = c(cm_train$overall["Accuracy"], cm_test$overall["Accuracy"]),
Precision = c(cm_train$byClass["Precision"], cm_test$byClass["Precision"]),
Recall = c(cm_train$byClass["Recall"], cm_test$byClass["Recall"]),
F1 = c(cm_train$byClass["F1"], cm_test$byClass["F1"]),
AUC = c(auc_train, auc_test)
)
rf_metrics |> kable(caption = "Random Forest Experiment 1 Metrics") |>
kable_styling() |>
kable_classic()
| Model | Dataset | Accuracy | Precision | Recall | F1 | AUC |
|---|---|---|---|---|---|---|
| Random Forest | Train (CV) | 0.8614387 | 0.4176042 | 0.5828202 | 0.4865698 | 0.7915460 |
| Random Forest | Test | 0.8556976 | 0.4017094 | 0.5739943 | 0.4726412 | 0.7836298 |
Results Summary
The model performed consistently across both training and test sets. Upsampling improved recall and helped the model identify more subscribers. While precision remained modest, the increase in F1-score shows that the model achieved a better balance between capturing true positives and avoiding too many false positives.
Conclusion
Applying upsampling during training helped the Random Forest model
better identify customers likely to subscribe to a term deposit.
Although precision was relatively low, the improvement in recall and
overall F1-score makes this approach valuable for scenarios where
detecting potential subscribers is a priority.
Objective
This experiment investigates how specific feature engineering and
preprocessing steps affect the performance of a Random Forest model. The
focus is on transforming the age and pdays
variables into more interpretable categories and removing less
informative or redundant features.
Variation
Two main feature transformations were applied: - Age
was bucketed into three categories:
- Adults (≤ 39)
- Middle-aged Adults (40–59)
- Senior Adults (60+) - Pdays was converted
from numeric into categorical intervals: - Never Contacted,
One Week, Two Week, Three Week, Three
Week +
Additionally, the variables duration,
housing, month, and default were
removed. The model was trained on an 80% training split using 5-fold
cross-validation. mtry was tuned over values from 2
to 5, and 1,000 trees were used.
Why This Variation Matters
Transforming continuous variables into interpretable categories can
enhance the model’s ability to capture meaningful patterns, especially
when domain knowledge supports the grouping. Removing less relevant
features may also reduce noise and improve generalization.
Evaluation Metrics
Same metrics as previous experiments were used: - Primary
metric: AUC
- Secondary metrics: F1-score, Precision, Recall,
Accuracy
bank_marketing_cl2 <- bank_marketing |> select(-all_of(c(
# "pdays",
"duration","housing","month"
,"default"
)))
bank_marketing_cl2 = bank_marketing_cl2 |>
mutate(age= case_when(
age <= 39 ~ "Adult",
age >= 40 & age <= 59 ~ "Middle-aged Adults",
age >= 60 ~ "Senior Adult"
))
bank_marketing_cl2 = bank_marketing_cl2 |>
mutate(pdays= case_when(
pdays == 999 ~ "Never Contacted",
pdays >= 0 & pdays <= 7 ~ "One Week",
pdays >= 8 & pdays <= 14 ~ "Two Week",
pdays >= 15 & pdays <= 21 ~ "Tree Week",
pdays >= 22 ~ "Tree Week +",
))
set.seed(1265653)
train_index_rf2 <- createDataPartition(bank_marketing_cl2$y, p = 0.8, list = FALSE)
dt_train2 <- bank_marketing_cl2[train_index_rf2, ]
dt_test2 <- bank_marketing_cl2[-train_index_rf2, ]
ex = c("day_of_week", "marital","education","job","contact","loan")
dt_train2 = dt_train |> select(-all_of(ex))
dt_test2 = dt_test |> select(-all_of(ex))
rf_model2 <- train(
y ~ .,
data = dt_train2 ,
method = "rf",
trControl = ctrl_rf,
metric = "ROC",
tuneGrid = expand.grid(mtry = seq(2,5,by=1)),
# tuneGrid = expand.grid(mtry = 2),
ntree = 1000
)
best_mtry2 <- rf_model2$bestTune$mtry
# Filter CV predictions for best mtry
cv_preds2 <- rf_model2$pred |>
filter(mtry == best_mtry2)
# Confusion matrix and AUC for training
cm_train2 <- confusionMatrix(cv_preds2$pred, cv_preds2$obs, positive = "yes")
auc_train2 <- roc(cv_preds2$obs, cv_preds2$yes)$auc
rf_test_probs2 <- predict(rf_model2, dt_test2, type = "prob")[, "yes"]
rf_test_preds2 <- predict(rf_model2, dt_test2)
cm_test2 <- confusionMatrix(rf_test_preds2, dt_test2$y, positive = "yes")
auc_test2 <- roc(dt_test2$y, rf_test_probs2)$auc
# --- Final Metrics Table ---
rf_metrics2 <- data.frame(
Model = "Random Forest",
Dataset = c("Train (CV)", "Test"),
Accuracy = c(cm_train2$overall["Accuracy"], cm_test2$overall["Accuracy"]),
Precision = c(cm_train2$byClass["Precision"], cm_test2$byClass["Precision"]),
Recall = c(cm_train2$byClass["Recall"], cm_test2$byClass["Recall"]),
F1 = c(cm_train2$byClass["F1"], cm_test2$byClass["F1"]),
AUC = c(auc_train2, auc_test2)
)
rf_metrics2 |> kable(caption = "Random Forest Experiment 2 Metrics") |>
kable_styling() |>
kable_classic()
| Model | Dataset | Accuracy | Precision | Recall | F1 | AUC |
|---|---|---|---|---|---|---|
| Random Forest | Train (CV) | 0.8427442 | 0.3765835 | 0.6040640 | 0.4639395 | 0.7846408 |
| Random Forest | Test | 0.8390256 | 0.3682119 | 0.5991379 | 0.4561116 | 0.7804622 |
Results and Conclusion
The engineered features improved recall and F1-score compared to
previous experiments, showing that the model became better at detecting
positive cases. AUC remained relatively strong, indicating that class
discrimination was not compromised. While precision remained moderate,
the overall balance of metrics suggests that the transformations helped
the model better generalize.
Recommendation
Incorporating domain-informed feature engineering (e.g., age brackets,
pdays intervals) can improve minority class detection in Random Forest
models. This approach is recommended for preprocessing when working with
structured data in real-world applications.
Objective
After earlier experiments with Decision Trees and Random Forests showed
limited improvement in detecting the minority class
(y = "yes"), this experiment introduces a new strategy. The
goal is to evaluate whether combining AdaBoost,
structured feature engineering, and
preprocessing of numerical features can lead to more
meaningful improvements in classification performance.
Variation
This experiment builds a redesigned dataset using domain-driven feature
engineering and scaled transformations of numeric variables.
Specifically:
campaign, euribor3m,
nr.employed, and age) were centered, scaled,
and transformed using the Yeo-Johnson method to address
skewness and improve consistency across predictors.yes) and
non-subscription (no) classes.The AdaBoost model was tuned across the following hyperparameters: -
mfinal: Tested values 50, 100, and 150 to
evaluate the effect of increasing the number of boosting iterations.
More iterations allow the model to learn from residual errors, but too
many can lead to overfitting. - maxdepth:
Set to 3 and 5 to limit the depth of individual weak learners (decision
stumps). Shallow trees are used to reduce variance and enforce
simplicity in each boosting round. -
coeflearn: Fixed to
"Breiman", which adjusts boosting weights based on
classification accuracy. This option provides a balance between
aggressive error correction and model stability.
Why This Variation Matters
Unlike Random Forests, which rely on averaging many uncorrelated trees,
AdaBoost builds trees sequentially and is more sensitive to noisy data
or poor feature scaling. This makes it well-suited for experiments
involving thoughtful preprocessing and engineered features. In this
setup, boosting is expected to better exploit informative patterns that
earlier models may have missed.
Evaluation Metrics
Models were assessed using: - Primary metric: AUC
- Secondary metrics: F1-score, Recall, Precision,
Accuracy
All metrics were calculated using 5-fold cross-validation on the training set and evaluated on a held-out test set.
bank_marketing_cl3 <- bank_marketing |>
# 1. Contacted Recently (from pdays)
mutate(pdays= case_when(
pdays == 999 ~ "Never Contacted",
pdays >= 0 & pdays <= 7 ~ "One Week",
pdays >= 8 & pdays <= 14 ~ "Two Week",
pdays >= 15 & pdays <= 21 ~ "Tree Week",
pdays >= 22 ~ "Tree Week +",
)) |>
# 2. Has Any Debt (loan + housing + default)
mutate(has_any_debt = ifelse(loan == "yes" | housing == "yes" | default == "yes", 1, 0)) |>
# 3. Contact Quarter (from month)
mutate(contact_quarter = case_when(
month %in% c("mar", "apr", "may") ~ "Q2",
month %in% c("jun", "jul", "aug") ~ "Q3",
month %in% c("sep", "oct", "nov", "dec") ~ "Q4",
TRUE ~ "Q1"
)) |>
# 4. Job Risk Level
mutate(job_risk = case_when(
job %in% c("blue-collar", "services", "unemployed") ~ "high_risk",
job %in% c("admin.", "technician", "management", "student") ~ "low_risk",
TRUE ~ "unknown"
)) |>
# 5. Preferred Contact Day (thu/fri)
mutate(preferred_contact_day = ifelse(day_of_week %in% c("thu", "fri"), 1, 0)) |>
# 6. Education Level Simplified
mutate(edu_level = case_when(
grepl("basic", education) ~ "basic",
education == "university.degree" ~ "higher_ed",
education == "professional.course" ~ "vocational",
TRUE ~ "other"
)) |>
# 7. Weekend Contact Flag
mutate(weekend_contact = ifelse(day_of_week %in% c("sat", "sun"), 1, 0)) |>
select(-c(pdays, loan, housing, default, month, job, day_of_week, education, duration)) #
transformData <- function(data ,fields,methods){
transformation_model <- data |> select(all_of(fields)) |> preProcess(method = methods)
trasnformed <- predict(transformation_model,data)
return(list(trasnformed,transformation_model))
}
set.seed(555)
train_index_ada3 <- createDataPartition(bank_marketing_cl3$y, p = 0.70, list = FALSE)
ada_train3 <- bank_marketing_cl3[train_index_ada3, ]
ada_test3 <- bank_marketing_cl3[-train_index_ada3, ]
transform_1 <- transformData(ada_train3,c("campaign","euribor3m","nr.employed","age"),c("center", "scale", "YeoJohnson"))
ada_train3 <- transform_1[[1]]
transform_1_lambda <- transform_1[[2]]
ada_test3 <- predict(transform_1_lambda, ada_test3)
tune <- expand.grid(
mfinal = c(50,100,150),
maxdepth = c(3, 5),
coeflearn = "Breiman"
)
ctrl_ada <- trainControl(
method = "cv",
number = 5,
sampling = "up",
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = "final"
)
ada_model3 <- train(
y ~ .,
data = ada_train3,
method = "AdaBoost.M1",
trControl = ctrl_ada,
metric = "ROC",
tuneGrid = tune
)
# 1. Get best tuning parameters
best_params3 <- ada_model3$bestTune
# 2. Filter cross-validated predictions for best model
cv_preds3 <- ada_model3$pred |>
filter(
mfinal == best_params3$mfinal,
maxdepth == best_params3$maxdepth,
coeflearn == best_params3$coeflearn
)
# 3. Confusion matrix and AUC on cross-validation predictions (Train set)
cm_train3 <- confusionMatrix(cv_preds3$pred, cv_preds3$obs, positive = "yes")
auc_train3 <- roc(cv_preds3$obs, cv_preds3$yes)$auc
# 4. Predictions on test set
ada_test_probs3 <- predict(ada_model3, ada_test3, type = "prob")[, "yes"]
ada_test_preds3 <- predict(ada_model3, ada_test3)
# 5. Confusion matrix and AUC on test set
cm_test3 <- confusionMatrix(ada_test_preds3, ada_test3$y, positive = "yes")
auc_test3 <- roc(ada_test3$y, ada_test_probs3)$auc
# 6. Final comparison table
ada_metrics3 <- data.frame(
Model = "AdaBoost",
Dataset = c("Train (CV)", "Test"),
Accuracy = c(cm_train3$overall["Accuracy"], cm_test3$overall["Accuracy"]),
Precision = c(cm_train3$byClass["Precision"], cm_test3$byClass["Precision"]),
Recall = c(cm_train3$byClass["Recall"], cm_test3$byClass["Recall"]),
F1 = c(cm_train3$byClass["F1"], cm_test3$byClass["F1"]),
AUC = c(auc_train3, auc_test3)
)
ada_metrics3 |> kable(caption = "AdaBoost Experiment 1 Metrics") |>
kable_styling() |>
kable_classic()
| Model | Dataset | Accuracy | Precision | Recall | F1 | AUC |
|---|---|---|---|---|---|---|
| AdaBoost | Train (CV) | 0.8306396 | 0.3587839 | 0.6394704 | 0.4596658 | 0.8000380 |
| AdaBoost | Test | 0.8340887 | 0.3651639 | 0.6400862 | 0.4650313 | 0.7996845 |
Results and Conclusion
These are the best results observed so far in terms of recall, F1-score, and AUC, showing a clear improvement in the model’s ability to detect the minority class. The consistency between training and test performance also suggests that the model generalizes well.
However, one notable limitation is the low precision, which remained below 0.37. This means that a relatively high number of predicted positive cases were false positives. In practical terms, while the model is more successful at finding actual subscribers, it also classifies many non-subscribers as subscribers.
This trade-off reflects a recall-oriented model, which may be acceptable in contexts where missing potential subscribers is more costly than reaching out to uninterested ones. However, in applications where precision matters (e.g., avoiding wasted marketing costs), further tuning or threshold adjustment may be necessary.
Objective
This experiment evaluates whether combining SMOTE and
downsampling with a simplified, cleaner feature set can improve
AdaBoost’s classification performance on the Bank Marketing dataset.
After previous experiments involving numeric transformations showed
limited impact, this setup intentionally excludes such preprocessing.
The focus shifts to engineered categorical and binary features, paired
with hybrid sampling to handle class imbalance.
Variation
Several original variables were removed based on prior analysis, and a
compact set of engineered features was introduced:
housing, loan, and defaultpdays fieldNo transformations (e.g., centering, scaling, Yeo-Johnson) were applied to numerical features, based on evidence from earlier experiments that such steps had little to no impact on performance.
Categorical variables were encoded as integers using a recipe. Class imbalance was addressed through a combination of SMOTE to generate synthetic examples for the minority class and downsampling to reduce the dominant class, allowing the model to train on a more balanced distribution.
The AdaBoost model was tuned across the following parameters: -
mfinal: 50, 100, 150
- maxdepth: 3, 5, 10
- coeflearn: “Breiman”
Why This Variation Matters
This setup tests whether removing unnecessary complexity and applying
focused resampling techniques can help AdaBoost better capture the
minority class signal. Unlike earlier approaches that emphasized heavy
transformations or broad feature expansions, this strategy prioritizes
clarity and balance, both in the dataset and the learning process.
Evaluation Metrics
Model performance was assessed using: - Primary metric:
AUC
- Secondary metrics: F1-score, Recall, Precision,
Accuracy
All metrics were calculated using 5-fold cross-validation on the training data and tested on a held-out set.
Results and Conclusion
set.seed(6666)
ex = c(
"day_of_week",
"marital",
"education"
,"job"
,"contact"
,"loan"
,"previous"
# ,"poutcome"
,"duration"
,"housing",
"pdays",
"emp.var.rate",
"nr.employed"
,"cons.price.idx",
"cons.conf.idx"
)
bank_marketing_cl4 = bank_marketing
# Create engineered features on the full dataset
bank_marketing_cl4$risk_score <- with(bank_marketing_cl4,
as.integer(housing == "yes") +
as.integer(loan == "yes") +
as.integer(default == "yes")
)
bank_marketing_cl4$multi_loan_flag <- with(bank_marketing_cl4,
as.integer(housing == "yes" & loan == "yes")
)
bank_marketing_cl4$was_previously_contacted <- with(bank_marketing_cl4,
as.integer(pdays != 999)
)
bank_marketing_cl4$contact_intensity <- with(bank_marketing_cl4,
campaign + previous
)
bank_marketing_cl4$multi_loan_flag <- as.factor(bank_marketing_cl4$multi_loan_flag)
bank_marketing_cl4$was_previously_contacted <- as.factor(bank_marketing_cl4$was_previously_contacted)
bank_marketing_cl4$contact_intensity <- as.factor(bank_marketing_cl4$contact_intensity)
# bank_marketing_cl4<- bank_marketing_c |> select(-all_of(ex))
bank_marketing_cl4<- bank_marketing_cl4 |> select(-all_of(ex))
train_index_ada_exp2 <- createDataPartition(bank_marketing_cl4$y, p = 0.70, list = FALSE)
ada_exp2_train <- bank_marketing_cl4[train_index_ada_exp2, ]
ada_exp2_test <- bank_marketing_cl4[-train_index_ada_exp2, ]
set.seed(986)
registerDoSEQ() # disables parallel
rec_exp2 <- recipe(y ~ ., data = ada_exp2_train) %>%
step_integer(all_nominal_predictors()) %>%
step_smote(y) |>
step_downsample(y)
# TrainControl
ctrl_exp2 <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = "final"
)
# Tuning grid
tune_exp2 <- expand.grid(
mfinal = c(50,100,150),
maxdepth = c(3, 5,10),
coeflearn = "Breiman"
)
# Train the AdaBoost model
ada_model_exp2 <- train(
rec_exp2,
data = ada_exp2_train,
method = "AdaBoost.M1",
metric = "ROC",
trControl = ctrl_exp2,
tuneGrid = tune_exp2
)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
# 1. Get best tuning parameters for Experiment 2
best_params_exp2 <- ada_model_exp2$bestTune
# 2. Filter cross-validated predictions for best model
cv_preds_exp2 <- ada_model_exp2$pred |>
filter(
mfinal == best_params_exp2$mfinal,
maxdepth == best_params_exp2$maxdepth,
coeflearn == best_params_exp2$coeflearn
)
# 3. Confusion matrix and AUC on cross-validation predictions (Train set)
cm_train_exp2 <- confusionMatrix(cv_preds_exp2$pred, cv_preds_exp2$obs, positive = "yes")
auc_train_exp2 <- roc(cv_preds_exp2$obs, cv_preds_exp2$yes)$auc
# 4. Predictions on test set
ada_test_probs_exp2 <- predict(ada_model_exp2, ada_exp2_test, type = "prob")[, "yes"]
ada_test_preds_exp2 <- predict(ada_model_exp2, ada_exp2_test)
# 5. Confusion matrix and AUC on test set
cm_test_exp2 <- confusionMatrix(ada_test_preds_exp2, ada_exp2_test$y, positive = "yes")
auc_test_exp2 <- roc(ada_exp2_test$y, ada_test_probs_exp2)$auc
# 6. Final comparison table
ada_metrics_exp2 <- data.frame(
Model = "AdaBoost Exp 2",
Dataset = c("Train (CV)", "Test"),
Accuracy = c(cm_train_exp2$overall["Accuracy"], cm_test_exp2$overall["Accuracy"]),
Precision = c(cm_train_exp2$byClass["Precision"], cm_test_exp2$byClass["Precision"]),
Recall = c(cm_train_exp2$byClass["Recall"], cm_test_exp2$byClass["Recall"]),
F1 = c(cm_train_exp2$byClass["F1"], cm_test_exp2$byClass["F1"]),
AUC = c(auc_train_exp2, auc_test_exp2)
)
ada_metrics_exp2 |> kable(caption = "AdaBoost Experiment 2 Metrics") |>
kable_styling() |>
kable_classic()
| Model | Dataset | Accuracy | Precision | Recall | F1 | AUC |
|---|---|---|---|---|---|---|
| AdaBoost Exp 2 | Train (CV) | 0.8673349 | 0.4280010 | 0.5280172 | 0.4727774 | 0.7841837 |
| AdaBoost Exp 2 | Test | 0.8650858 | 0.4214734 | 0.5301724 | 0.4696150 | 0.7932252 |
Compared to Experiment One, this model achieved: - Higher precision (from 0.365 to 0.421 on the test set), meaning it made fewer false positive predictions. - Lower recall (from 0.640 to 0.530), meaning it missed more actual subscribers. - A slightly improved F1-score, indicating better overall balance between precision and recall. - A small decrease in AUC, from 0.800 to 0.793, which still reflects good model discrimination.
These results reflect a clear shift in the model’s behavior: the hybrid resampling and simpler features led to more selective, precise predictions, but at the cost of missing some true positives. The modest improvement in F1-score suggests this approach offers a more balanced model than the one used in Experiment One.
The set of experiments explored different strategies for improving classification performance on the Bank Marketing Dataset, with the primary challenge being a highly imbalanced target variable. Each model incorporated a distinct method for addressing this issue, including deeper trees, class weighting, sampling techniques, and feature engineering. While these adjustments produced variations in performance, the overall patterns reveal a limited capacity to achieve both high recall and high precision simultaneously.
The Decision Tree models provided useful baselines but highlighted the trade-off between model complexity and generalization. Deeper trees increased recall but at the cost of precision, while class weighting modestly improved recall without significantly affecting precision. These models struggled to generalize, often overfitting to the training data.
Random Forests improved overall stability and delivered better balance across metrics, particularly when combined with upsampling. Still, neither Random Forest configuration overcame the precision-recall trade-off. Feature engineering offered minor improvements in recall but did not produce meaningful gains in AUC.
AdaBoost models produced the most polarized results. The version with numeric transformations and engineered features achieved the highest recall and AUC, indicating a stronger bias toward identifying positive cases. In contrast, the simplified AdaBoost model with SMOTE and downsampling achieved the highest precision, favoring conservative predictions. These results suggest that boosting emphasizes recall at the cost of false positives unless constrained by simplified inputs and balanced training data.
In conclusion, while each modeling choice influenced the bias-variance trade-off differently, none of the approaches consistently delivered strong enough performance to justify deployment. The performance limitations suggest that future work should prioritize testing more advanced classifiers and exploring ways to enhance data quality and feature richness, particularly for the underrepresented class. Without these improvements, further optimization of current models is unlikely to yield substantial gains.