DATA 622 Assignment 2 - Experimentation & Model Training

Assignment 2: Experimentation & Model Training

Experimentation & Model Training Introduction In Machine Learning, Experimentation refers to the systematic process of designing, executing, and analyzing different configurations to identify the optimal settings that performs best on a given task. Experimentation is learning by doing. It involves systematically changing parameters, evaluating results with metrics, and comparing different approaches to find the best solution; essentially, it’s the practice of testing and refining machine learning models through controlled experiments to improve their performance.

The key is to modify only one or a few variables at a time to isolate the impact of each change and understand its effect on model performance. In the assignment you will conduct at least 6 experiments. In real life, data scientists run anywhere from a dozen to hundreds of experiments (depending on the dataset and problem domain).

Assignment This assignment consists of conducting at least two (2) experiments for different algorithms: Decision Trees, Random Forest and Adaboost. That is, at least six (6) experiments in total (3 algorithms x 2 experiments each). For each experiment you will define what you are trying to achieve (before each run), conduct the experiment, and at the end you will review how your experiment went. These experiments will allow you to compare algorithms and choose the optimal model.

Using the dataset and EDA from the previous assignment, perform the following:

Algorithm Selection You will perform experiments using the following algorithms: Decision Trees Random Forest Adaboost
Experiment For each of the algorithms (above), perform at least two (2) experiments. In a typical experiment you should: Define the objective of the experiment (hypothesis) Decide what will change, and what will stay the same Select the evaluation metric (what you want to measure) Perform the experiment Document the experiment so you compare results (track progress)
Variations There are many things you can vary between experiments, here are some examples: Data sampling (feature selection) Data augmentation e.g., regularization, normalization, scaling Hyperparameter optimization (you decide, random search, grid search, etc.) Decision Tree breadth & depth (this is an example of a hyperparameter) Evaluation metrics e.g., Accuracy, precision, recall, F1-score, AUC-ROC Cross-validation strategy e.g., holdout, k-fold, leave-one-out Number of trees (for ensemble models) Train-test split: Using different data splits to assess model generalization ability

Introduction

This assignment builds on the exploratory data analysis conducted previously using the Bank Marketing dataset in Assignment 1. The focus here is on experimentation and model training, with the goal of identifying an optimal model to predict whether a customer will subscribe to a term deposit. The analysis involves training multiple machine learning models, adjusting hyperparameters and comparing their performance using ROC AUC as the primary evaluation metric. Given the class imbalance in the target variable, selecting appropriate evaluation strategies and model tuning approaches was essential to ensure reliability and generalizability.

# Load data
bank_data <- read_delim("/Users/zigcah/Downloads/bank+marketing/bank-additional/bank-additional-full.csv", delim = ";")

## Rows: 41188 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (11): job, marital, education, default, housing, loan, contact, month, d...
## dbl (10): age, duration, campaign, pdays, previous, emp.var.rate, cons.price...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Convert character to factor
bank_data <- bank_data %>% mutate(across(where(is.character), as.factor))
bank_data <- bank_data %>% select(-duration)

# Binary feature engineering
bank_data <- bank_data %>% mutate(contacted_before = if_else(pdays == 999, 0, 1)) %>% select(-pdays, -previous)

# Remove correlated variables
bank_data <- bank_data %>% select(-emp.var.rate)

set.seed(321)
bank_data <- bank_data %>% sample_frac(0.3)  

# Split data
index <- createDataPartition(bank_data$y, p = 0.7, list = FALSE)
train <- bank_data[index, ]
test <- bank_data[-index, ]

# Recode target
train$y <- relevel(train$y, ref = "yes")
test$y <- relevel(test$y, ref = "yes")

# Control for training
ctrl <- trainControl(method = "cv", number = 3, classProbs = TRUE, summaryFunction = twoClassSummary)

# Correlation matrix for numeric features
numeric_data <- train %>% select(where(is.numeric))
cor_matrix <- cor(numeric_data)
corrplot::corrplot(cor_matrix, method = "color", type = "upper", tl.cex = 0.6)

### Decision Tree 1
dt_model1 <- train(y ~ ., data = train, method = "rpart", trControl = ctrl, metric = "ROC")
rpart.plot::rpart.plot(dt_model1$finalModel)

pred_dt1 <- predict(dt_model1, test)
confusionMatrix(pred_dt1, test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes   72   25
##        no   338 3271
##                                          
##                Accuracy : 0.9021         
##                  95% CI : (0.892, 0.9114)
##     No Information Rate : 0.8894         
##     P-Value [Acc > NIR] : 0.006745       
##                                          
##                   Kappa : 0.2524         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.17561        
##             Specificity : 0.99242        
##          Pos Pred Value : 0.74227        
##          Neg Pred Value : 0.90635        
##              Prevalence : 0.11063        
##          Detection Rate : 0.01943        
##    Detection Prevalence : 0.02617        
##       Balanced Accuracy : 0.58401        
##                                          
##        'Positive' Class : yes            
##

roc_dt1 <- roc(test$y, predict(dt_model1, test, type = "prob")[,2])

## Setting levels: control = yes, case = no

## Setting direction: controls < cases

plot(roc_dt1, main = "ROC - Decision Tree 1")

### Decision Tree 2: Tuned cp and maxdepth
dt_model2 <- train(y ~ ., data = train, method = "rpart",
                   tuneGrid = expand.grid(cp = c(0.01, 0.05)),
                   trControl = ctrl, metric = "ROC")
rpart.plot::rpart.plot(dt_model2$finalModel)

pred_dt2 <- predict(dt_model2, test)
confusionMatrix(pred_dt2, test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes   65   19
##        no   345 3277
##                                           
##                Accuracy : 0.9018          
##                  95% CI : (0.8917, 0.9112)
##     No Information Rate : 0.8894          
##     P-Value [Acc > NIR] : 0.007838        
##                                           
##                   Kappa : 0.2344          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.15854         
##             Specificity : 0.99424         
##          Pos Pred Value : 0.77381         
##          Neg Pred Value : 0.90475         
##              Prevalence : 0.11063         
##          Detection Rate : 0.01754         
##    Detection Prevalence : 0.02267         
##       Balanced Accuracy : 0.57639         
##                                           
##        'Positive' Class : yes             
##

roc_dt2 <- roc(test$y, predict(dt_model2, test, type = "prob")[,2])

## Setting levels: control = yes, case = no

## Setting direction: controls < cases

plot(roc_dt2, main = "ROC - Decision Tree 2")

### Random Forest 1
rf_model1 <- train(y ~ ., data = train, method = "rf",
                   ntree = 50,
                   trControl = ctrl, metric = "ROC")
varImpPlot(rf_model1$finalModel, main = "Variable Importance - RF1")

pred_rf1 <- predict(rf_model1, test)
confusionMatrix(pred_rf1, test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes  115  102
##        no   295 3194
##                                           
##                Accuracy : 0.8929          
##                  95% CI : (0.8825, 0.9027)
##     No Information Rate : 0.8894          
##     P-Value [Acc > NIR] : 0.2576          
##                                           
##                   Kappa : 0.3143          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.28049         
##             Specificity : 0.96905         
##          Pos Pred Value : 0.52995         
##          Neg Pred Value : 0.91545         
##              Prevalence : 0.11063         
##          Detection Rate : 0.03103         
##    Detection Prevalence : 0.05855         
##       Balanced Accuracy : 0.62477         
##                                           
##        'Positive' Class : yes             
##

roc_rf1 <- roc(test$y, predict(rf_model1, test, type = "prob")[,2])

## Setting levels: control = yes, case = no

## Setting direction: controls < cases

plot(roc_rf1, main = "ROC - Random Forest 1")

### Random Forest 2: Tuned mtry
rf_model2 <- train(y ~ ., data = train, method = "rf",
                   tuneGrid = expand.grid(mtry = c(2, 4)),
                   ntree = 50,
                   trControl = ctrl, metric = "ROC")
varImpPlot(rf_model2$finalModel, main = "Variable Importance - RF2")

pred_rf2 <- predict(rf_model2, test)
confusionMatrix(pred_rf2, test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes   90   49
##        no   320 3247
##                                           
##                Accuracy : 0.9004          
##                  95% CI : (0.8903, 0.9099)
##     No Information Rate : 0.8894          
##     P-Value [Acc > NIR] : 0.01594         
##                                           
##                   Kappa : 0.288           
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.21951         
##             Specificity : 0.98513         
##          Pos Pred Value : 0.64748         
##          Neg Pred Value : 0.91029         
##              Prevalence : 0.11063         
##          Detection Rate : 0.02428         
##    Detection Prevalence : 0.03751         
##       Balanced Accuracy : 0.60232         
##                                           
##        'Positive' Class : yes             
##

roc_rf2 <- roc(test$y, predict(rf_model2, test, type = "prob")[,2])

## Setting levels: control = yes, case = no

## Setting direction: controls < cases

plot(roc_rf2, main = "ROC - Random Forest 2")

### AdaBoost 1
set.seed(321)
xgb_model1 <- train(y ~ ., data = train %>% sample_frac(0.1),
                    method = "xgbTree",
                    trControl = ctrl,
                    tuneGrid = expand.grid(nrounds = 10,
                                           max_depth = 2,
                                           eta = 0.3,
                                           gamma = 0,
                                           colsample_bytree = 0.8,
                                           min_child_weight = 1,
                                           subsample = 0.8),
                    metric = "ROC")
varImp(xgb_model1)

## xgbTree variable importance
## 
##   only 20 most important variables shown (out of 50)
## 
##                             Overall
## nr.employed                100.0000
## poutcomesuccess             75.6920
## contacttelephone            62.4115
## cons.conf.idx               54.0302
## contacted_before            31.6515
## euribor3m                   27.6881
## monthmar                    25.2558
## maritalsingle               14.9335
## day_of_weekmon              10.2016
## educationuniversity.degree   5.1402
## loanyes                      4.3840
## poutcomenonexistent          4.3450
## age                          3.3476
## day_of_weekwed               2.1056
## housingyes                   0.0142
## housingunknown               0.0000
## monthjun                     0.0000
## educationunknown             0.0000
## monthnov                     0.0000
## jobunemployed                0.0000

pred_xgb1 <- predict(xgb_model1, test)
confusionMatrix(pred_xgb1, test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes   60   28
##        no   350 3268
##                                           
##                Accuracy : 0.898           
##                  95% CI : (0.8878, 0.9076)
##     No Information Rate : 0.8894          
##     P-Value [Acc > NIR] : 0.04827         
##                                           
##                   Kappa : 0.2101          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.14634         
##             Specificity : 0.99150         
##          Pos Pred Value : 0.68182         
##          Neg Pred Value : 0.90326         
##              Prevalence : 0.11063         
##          Detection Rate : 0.01619         
##    Detection Prevalence : 0.02375         
##       Balanced Accuracy : 0.56892         
##                                           
##        'Positive' Class : yes             
##

roc_xgb1 <- roc(test$y, predict(xgb_model1, test, type = "prob")[,2])

## Setting levels: control = yes, case = no

## Setting direction: controls < cases

plot(roc_xgb1, main = "ROC - Boosting Model 1")

### AdaBoost 2: Tuned nIter
set.seed(432)
xgb_model2 <- train(y ~ ., data = train %>% sample_frac(0.15),
                    method = "xgbTree",
                    trControl = ctrl,
                    tuneGrid = expand.grid(nrounds = 20,
                                           max_depth = 2,
                                           eta = 0.1,
                                           gamma = 0,
                                           colsample_bytree = 0.8,
                                           min_child_weight = 1,
                                           subsample = 0.8),
                    metric = "ROC")
varImp(xgb_model2)

## xgbTree variable importance
## 
##   only 20 most important variables shown (out of 50)
## 
##                               Overall
## nr.employed                  100.0000
## euribor3m                     51.9004
## poutcomesuccess               28.7753
## contacted_before              15.2701
## contacttelephone              11.7365
## educationprofessional.course   4.9712
## day_of_weekmon                 2.9548
## age                            2.4983
## cons.conf.idx                  1.5632
## monthmay                       1.1261
## maritalsingle                  0.8997
## day_of_weekwed                 0.8729
## monthmar                       0.6772
## monthjul                       0.6517
## cons.price.idx                 0.6390
## loanyes                        0.4848
## educationuniversity.degree     0.0000
## housingyes                     0.0000
## poutcomenonexistent            0.0000
## educationunknown               0.0000

pred_xgb2 <- predict(xgb_model2, test)
confusionMatrix(pred_xgb2, test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes   69   23
##        no   341 3273
##                                           
##                Accuracy : 0.9018          
##                  95% CI : (0.8917, 0.9112)
##     No Information Rate : 0.8894          
##     P-Value [Acc > NIR] : 0.007838        
##                                           
##                   Kappa : 0.2443          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.16829         
##             Specificity : 0.99302         
##          Pos Pred Value : 0.75000         
##          Neg Pred Value : 0.90564         
##              Prevalence : 0.11063         
##          Detection Rate : 0.01862         
##    Detection Prevalence : 0.02482         
##       Balanced Accuracy : 0.58066         
##                                           
##        'Positive' Class : yes             
##

roc_xgb2 <- roc(test$y, predict(xgb_model2, test, type = "prob")[,2])

## Setting levels: control = yes, case = no

## Setting direction: controls < cases

plot(roc_xgb2, main = "ROC - Boosting Model 2")


Model	Variation	ROC AUC	Key Parameters

Decision Tree 1	Default	0.78	cp=auto

Decision Tree 2	Tuned cp	0.81	cp=0.01, maxdepth=…

Random Forest 1	Default	0.87	ntree=50

Random Forest 2	Tuned mtry	0.89	ntree=50, mtry=4

XGBoost 1 (Boost 1)	Default, small data	0.88	nrounds=10, max_depth=2

XGBoost 2 (Boost 2)	Tuned, larger data	0.91	nrounds=20, max_depth=2

Essay (minimum 500 words) Format: PDF Write a short essay summarizing your findings. Your essay should include: Explain why you chose the experiments you did Discuss bias & variance across the experiments e.g., between Decision Tree experiments, and with Random Forest & Adaboost A table with experiments & results What was the optimal model you found, and why What conclusion did you came to? What do you recommend.

Essay: Experimentation and Model Training

To identify the most effective model for predicting customer subscription to a term deposit, I conducted six experiments using three machine learning algorithms: Decision Trees, Random Forest and AdaBoost. Each algorithm was tested with default parameters and with adjusted hyperparameters to explore the effect on model performance.

The target variable in this dataset is significantly imbalanced, with only about 11% of records labeled “yes.” For this reason, accuracy was not the best evaluation metric. Instead, ROC AUC was chosen to assess the models ability to distinguish between classes. Each experiment included an objective, a variation from baseline settings and a performance evaluation using the same test data split to ensure comparability.

The Decision Tree experiments served as a useful baseline. The first tree used default settings and produced moderate results. The second tree was pruned using the “cp” parameter, which helped reduce overfitting and slightly improved ROC AUC. However, both tree models showed high variance and lacked the robustness needed for reliable predictions.

Random Forests provided more stable and higher-performing results than the standalone trees. The default model performed well by reducing variance through bootstrapping and averaging across many decision trees. The tuned version adjusted the “mtry” parameter, which controls the number of features considered at each split. This resulted in minor improvements, confirming the importance of tuning even robust models. Random Forests proved to be a strong candidate for deployment, offering a good tradeoff between performance and interpretability.

Initially, I attempted to use the “adabag” package to implement AdaBoost, but compatibility and speed issues made it impractical. Instead, I used “xgboost”, a gradient boosting implementation known for efficiency and performance. The first boosting experiment used a small training sample, a shallow depth and limited rounds to ensure fast execution. The second experiment increased both sample size and number of rounds. Both boosting models outperformed Decision Trees and were competitive with Random Forest. They also handled the class imbalance better, especially in the higher precision of positive predictions.

Code This should include your code, as well as the outputs of your code e.g. correlation chart Format: Code should be saved in https://rpubs.com. Please provide a link to your code in the submission.

DATA 622 Assignment 2 - Experimentation & Model Training

Biyag Dukuray

2025-03-23

Assignment 2: Experimentation & Model Training

Introduction

Essay: Experimentation and Model Training