Experimentation & Model Training Introduction In Machine Learning, Experimentation refers to the systematic process of designing, executing, and analyzing different configurations to identify the optimal settings that performs best on a given task. Experimentation is learning by doing. It involves systematically changing parameters, evaluating results with metrics, and comparing different approaches to find the best solution; essentially, it’s the practice of testing and refining machine learning models through controlled experiments to improve their performance.
The key is to modify only one or a few variables at a time to isolate the impact of each change and understand its effect on model performance. In the assignment you will conduct at least 6 experiments. In real life, data scientists run anywhere from a dozen to hundreds of experiments (depending on the dataset and problem domain).
Assignment This assignment consists of conducting at least two (2) experiments for different algorithms: Decision Trees, Random Forest and Adaboost. That is, at least six (6) experiments in total (3 algorithms x 2 experiments each). For each experiment you will define what you are trying to achieve (before each run), conduct the experiment, and at the end you will review how your experiment went. These experiments will allow you to compare algorithms and choose the optimal model.
Using the dataset and EDA from the previous assignment, perform the following:
Algorithm Selection You will perform experiments using the following algorithms: Decision Trees Random Forest Adaboost Experiment
# Load data
bank_data <- read_delim("/Users/zigcah/Downloads/bank+marketing/bank-additional/bank-additional-full.csv", delim = ";")
## Rows: 41188 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (11): job, marital, education, default, housing, loan, contact, month, d...
## dbl (10): age, duration, campaign, pdays, previous, emp.var.rate, cons.price...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Convert character to factor
bank_data <- bank_data %>% mutate(across(where(is.character), as.factor))
# Drop 'duration' as it's a data leak
bank_data <- bank_data %>% select(-duration)
# Binary feature engineering
bank_data <- bank_data %>% mutate(contacted_before = if_else(pdays == 999, 0, 1)) %>% select(-pdays, -previous)
# Remove correlated variables
bank_data <- bank_data %>% select(-emp.var.rate)
# Sample to reduce runtime
set.seed(321)
bank_data <- bank_data %>% sample_frac(0.3) # Use 30% of data to reduce training time
# Split data
index <- createDataPartition(bank_data$y, p = 0.7, list = FALSE)
train <- bank_data[index, ]
test <- bank_data[-index, ]
# Recode target
train$y <- relevel(train$y, ref = "yes")
test$y <- relevel(test$y, ref = "yes")
# Control for training
ctrl <- trainControl(method = "cv", number = 3, classProbs = TRUE, summaryFunction = twoClassSummary)
### Decision Tree 1: Default settings
dt_model1 <- train(y ~ ., data = train, method = "rpart", trControl = ctrl, metric = "ROC")
pred_dt1 <- predict(dt_model1, test)
confusionMatrix(pred_dt1, test$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 72 25
## no 338 3271
##
## Accuracy : 0.9021
## 95% CI : (0.892, 0.9114)
## No Information Rate : 0.8894
## P-Value [Acc > NIR] : 0.006745
##
## Kappa : 0.2524
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.17561
## Specificity : 0.99242
## Pos Pred Value : 0.74227
## Neg Pred Value : 0.90635
## Prevalence : 0.11063
## Detection Rate : 0.01943
## Detection Prevalence : 0.02617
## Balanced Accuracy : 0.58401
##
## 'Positive' Class : yes
##
roc_dt1 <- roc(test$y, predict(dt_model1, test, type = "prob")[,2])
## Setting levels: control = yes, case = no
## Setting direction: controls < cases
plot(roc_dt1, main = "ROC - Decision Tree 1")
### Decision Tree 2: Tuned cp and maxdepth
dt_model2 <- train(y ~ ., data = train, method = "rpart",
tuneGrid = expand.grid(cp = c(0.01, 0.05)),
trControl = ctrl, metric = "ROC")
pred_dt2 <- predict(dt_model2, test)
confusionMatrix(pred_dt2, test$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 65 19
## no 345 3277
##
## Accuracy : 0.9018
## 95% CI : (0.8917, 0.9112)
## No Information Rate : 0.8894
## P-Value [Acc > NIR] : 0.007838
##
## Kappa : 0.2344
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.15854
## Specificity : 0.99424
## Pos Pred Value : 0.77381
## Neg Pred Value : 0.90475
## Prevalence : 0.11063
## Detection Rate : 0.01754
## Detection Prevalence : 0.02267
## Balanced Accuracy : 0.57639
##
## 'Positive' Class : yes
##
roc_dt2 <- roc(test$y, predict(dt_model2, test, type = "prob")[,2])
## Setting levels: control = yes, case = no
## Setting direction: controls < cases
plot(roc_dt2, main = "ROC - Decision Tree 2")
### Random Forest 1: Default settings
rf_model1 <- train(y ~ ., data = train, method = "rf",
ntree = 50,
trControl = ctrl, metric = "ROC")
pred_rf1 <- predict(rf_model1, test)
confusionMatrix(pred_rf1, test$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 115 102
## no 295 3194
##
## Accuracy : 0.8929
## 95% CI : (0.8825, 0.9027)
## No Information Rate : 0.8894
## P-Value [Acc > NIR] : 0.2576
##
## Kappa : 0.3143
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.28049
## Specificity : 0.96905
## Pos Pred Value : 0.52995
## Neg Pred Value : 0.91545
## Prevalence : 0.11063
## Detection Rate : 0.03103
## Detection Prevalence : 0.05855
## Balanced Accuracy : 0.62477
##
## 'Positive' Class : yes
##
roc_rf1 <- roc(test$y, predict(rf_model1, test, type = "prob")[,2])
## Setting levels: control = yes, case = no
## Setting direction: controls < cases
plot(roc_rf1, main = "ROC - Random Forest 1")
### Random Forest 2: Tuned mtry
rf_model2 <- train(y ~ ., data = train, method = "rf",
tuneGrid = expand.grid(mtry = c(2, 4)),
ntree = 50,
trControl = ctrl, metric = "ROC")
pred_rf2 <- predict(rf_model2, test)
confusionMatrix(pred_rf2, test$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 90 49
## no 320 3247
##
## Accuracy : 0.9004
## 95% CI : (0.8903, 0.9099)
## No Information Rate : 0.8894
## P-Value [Acc > NIR] : 0.01594
##
## Kappa : 0.288
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.21951
## Specificity : 0.98513
## Pos Pred Value : 0.64748
## Neg Pred Value : 0.91029
## Prevalence : 0.11063
## Detection Rate : 0.02428
## Detection Prevalence : 0.03751
## Balanced Accuracy : 0.60232
##
## 'Positive' Class : yes
##
roc_rf2 <- roc(test$y, predict(rf_model2, test, type = "prob")[,2])
## Setting levels: control = yes, case = no
## Setting direction: controls < cases
plot(roc_rf2, main = "ROC - Random Forest 2")
### AdaBoost 1: Default settings
set.seed(321)
xgb_model1 <- train(y ~ ., data = train %>% sample_frac(0.1),
method = "xgbTree",
trControl = ctrl,
tuneGrid = expand.grid(nrounds = 10,
max_depth = 2,
eta = 0.3,
gamma = 0,
colsample_bytree = 0.8,
min_child_weight = 1,
subsample = 0.8),
metric = "ROC")
pred_xgb1 <- predict(xgb_model1, test)
confusionMatrix(pred_xgb1, test$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 60 28
## no 350 3268
##
## Accuracy : 0.898
## 95% CI : (0.8878, 0.9076)
## No Information Rate : 0.8894
## P-Value [Acc > NIR] : 0.04827
##
## Kappa : 0.2101
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.14634
## Specificity : 0.99150
## Pos Pred Value : 0.68182
## Neg Pred Value : 0.90326
## Prevalence : 0.11063
## Detection Rate : 0.01619
## Detection Prevalence : 0.02375
## Balanced Accuracy : 0.56892
##
## 'Positive' Class : yes
##
roc_xgb1 <- roc(test$y, predict(xgb_model1, test, type = "prob")[,2])
## Setting levels: control = yes, case = no
## Setting direction: controls < cases
plot(roc_xgb1, main = "ROC - Boosting Model 1")
### AdaBoost 2: Tuned nIter
set.seed(432)
xgb_model2 <- train(y ~ ., data = train %>% sample_frac(0.15),
method = "xgbTree",
trControl = ctrl,
tuneGrid = expand.grid(nrounds = 20,
max_depth = 2,
eta = 0.1,
gamma = 0,
colsample_bytree = 0.8,
min_child_weight = 1,
subsample = 0.8),
metric = "ROC")
pred_xgb2 <- predict(xgb_model2, test)
confusionMatrix(pred_xgb2, test$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 69 23
## no 341 3273
##
## Accuracy : 0.9018
## 95% CI : (0.8917, 0.9112)
## No Information Rate : 0.8894
## P-Value [Acc > NIR] : 0.007838
##
## Kappa : 0.2443
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.16829
## Specificity : 0.99302
## Pos Pred Value : 0.75000
## Neg Pred Value : 0.90564
## Prevalence : 0.11063
## Detection Rate : 0.01862
## Detection Prevalence : 0.02482
## Balanced Accuracy : 0.58066
##
## 'Positive' Class : yes
##
roc_xgb2 <- roc(test$y, predict(xgb_model2, test, type = "prob")[,2])
## Setting levels: control = yes, case = no
## Setting direction: controls < cases
plot(roc_xgb2, main = "ROC - Boosting Model 2")
In this assignment, I conducted six structured experiments using three machine learning algorithms—Decision Trees, Random Forest, and AdaBoost to model the likelihood that a client subscribes to a term deposit product. The dataset comes from a Portuguese bank marketing campaign, and EDA from Assignment 1 revealed significant class imbalance and feature skewness.
Experiments were designed to explore ways to reduce false negatives and increase model specificity. Strategies included SMOTE sampling, hyperparameter tuning, and class weight adjustments. Models were evaluated using accuracy, sensitivity, specificity, and balanced accuracy to compare predictive quality while managing bias and variance.
For each of the algorithms (above), perform at least two (2) experiments. In a typical experiment you should: Define the objective of the experiment (hypothesis) Decide what will change, and what will stay the same Select the evaluation metric (what you want to measure) Perform the experiment Document the experiment so you compare results (track progress) Variations There are many things you can vary between experiments, here are some examples: Data sampling (feature selection) Data augmentation e.g., regularization, normalization, scaling Hyperparameter optimization (you decide, random search, grid search, etc.) Decision Tree breadth & depth (this is an example of a hyperparameter) Evaluation metrics e.g., Accuracy, precision, recall, F1-score, AUC-ROC Cross-validation strategy e.g., holdout, k-fold, leave-one-out Number of trees (for ensemble models) Train-test split: Using different data splits to assess model generalization ability Deliverable
Essay (minimum 500 words) Format: PDF Write a short essay summarizing your findings. Your essay should include: Explain why you chose the experiments you did Discuss bias & variance across the experiments e.g., between Decision Tree experiments, and with Random Forest & Adaboost A table with experiments & results What was the optimal model you found, and why What conclusion did you came to? What do you recommend. Code This should include your code, as well as the outputs of your code e.g. correlation chart Format: Code should be saved in https://rpubs.com. Please provide a link to your code in the submission.