DATA 622 Assignment 2 - Experimentation & Model Training

Introduction

# Load data
bank_data <- read_delim("/Users/zigcah/Downloads/bank+marketing/bank-additional/bank-additional-full.csv", delim = ";")

## Rows: 41188 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (11): job, marital, education, default, housing, loan, contact, month, d...
## dbl (10): age, duration, campaign, pdays, previous, emp.var.rate, cons.price...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Convert character to factor
bank_data <- bank_data %>% mutate(across(where(is.character), as.factor))

# Drop 'duration' as it's a data leak
bank_data <- bank_data %>% select(-duration)

# Binary feature engineering
bank_data <- bank_data %>% mutate(contacted_before = if_else(pdays == 999, 0, 1)) %>% select(-pdays, -previous)

# Remove correlated variables
bank_data <- bank_data %>% select(-emp.var.rate)

# Sample to reduce runtime
set.seed(321)
bank_data <- bank_data %>% sample_frac(0.3)  # Use 30% of data to reduce training time

# Split data
index <- createDataPartition(bank_data$y, p = 0.7, list = FALSE)
train <- bank_data[index, ]
test <- bank_data[-index, ]

# Recode target
train$y <- relevel(train$y, ref = "yes")
test$y <- relevel(test$y, ref = "yes")

# Control for training
ctrl <- trainControl(method = "cv", number = 3, classProbs = TRUE, summaryFunction = twoClassSummary)

### Decision Tree 1: Default settings
dt_model1 <- train(y ~ ., data = train, method = "rpart", trControl = ctrl, metric = "ROC")
pred_dt1 <- predict(dt_model1, test)
confusionMatrix(pred_dt1, test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes   72   25
##        no   338 3271
##                                          
##                Accuracy : 0.9021         
##                  95% CI : (0.892, 0.9114)
##     No Information Rate : 0.8894         
##     P-Value [Acc > NIR] : 0.006745       
##                                          
##                   Kappa : 0.2524         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.17561        
##             Specificity : 0.99242        
##          Pos Pred Value : 0.74227        
##          Neg Pred Value : 0.90635        
##              Prevalence : 0.11063        
##          Detection Rate : 0.01943        
##    Detection Prevalence : 0.02617        
##       Balanced Accuracy : 0.58401        
##                                          
##        'Positive' Class : yes            
##

roc_dt1 <- roc(test$y, predict(dt_model1, test, type = "prob")[,2])

## Setting levels: control = yes, case = no

## Setting direction: controls < cases

plot(roc_dt1, main = "ROC - Decision Tree 1")

### Decision Tree 2: Tuned cp and maxdepth
dt_model2 <- train(y ~ ., data = train, method = "rpart",
                   tuneGrid = expand.grid(cp = c(0.01, 0.05)),
                   trControl = ctrl, metric = "ROC")
pred_dt2 <- predict(dt_model2, test)
confusionMatrix(pred_dt2, test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes   65   19
##        no   345 3277
##                                           
##                Accuracy : 0.9018          
##                  95% CI : (0.8917, 0.9112)
##     No Information Rate : 0.8894          
##     P-Value [Acc > NIR] : 0.007838        
##                                           
##                   Kappa : 0.2344          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.15854         
##             Specificity : 0.99424         
##          Pos Pred Value : 0.77381         
##          Neg Pred Value : 0.90475         
##              Prevalence : 0.11063         
##          Detection Rate : 0.01754         
##    Detection Prevalence : 0.02267         
##       Balanced Accuracy : 0.57639         
##                                           
##        'Positive' Class : yes             
##

roc_dt2 <- roc(test$y, predict(dt_model2, test, type = "prob")[,2])

## Setting levels: control = yes, case = no

## Setting direction: controls < cases

plot(roc_dt2, main = "ROC - Decision Tree 2")

### Random Forest 1: Default settings
rf_model1 <- train(y ~ ., data = train, method = "rf",
                   ntree = 50,
                   trControl = ctrl, metric = "ROC")
pred_rf1 <- predict(rf_model1, test)
confusionMatrix(pred_rf1, test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes  115  102
##        no   295 3194
##                                           
##                Accuracy : 0.8929          
##                  95% CI : (0.8825, 0.9027)
##     No Information Rate : 0.8894          
##     P-Value [Acc > NIR] : 0.2576          
##                                           
##                   Kappa : 0.3143          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.28049         
##             Specificity : 0.96905         
##          Pos Pred Value : 0.52995         
##          Neg Pred Value : 0.91545         
##              Prevalence : 0.11063         
##          Detection Rate : 0.03103         
##    Detection Prevalence : 0.05855         
##       Balanced Accuracy : 0.62477         
##                                           
##        'Positive' Class : yes             
##

roc_rf1 <- roc(test$y, predict(rf_model1, test, type = "prob")[,2])

## Setting levels: control = yes, case = no

## Setting direction: controls < cases

plot(roc_rf1, main = "ROC - Random Forest 1")

### Random Forest 2: Tuned mtry
rf_model2 <- train(y ~ ., data = train, method = "rf",
                   tuneGrid = expand.grid(mtry = c(2, 4)),
                   ntree = 50,
                   trControl = ctrl, metric = "ROC")
pred_rf2 <- predict(rf_model2, test)
confusionMatrix(pred_rf2, test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes   90   49
##        no   320 3247
##                                           
##                Accuracy : 0.9004          
##                  95% CI : (0.8903, 0.9099)
##     No Information Rate : 0.8894          
##     P-Value [Acc > NIR] : 0.01594         
##                                           
##                   Kappa : 0.288           
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.21951         
##             Specificity : 0.98513         
##          Pos Pred Value : 0.64748         
##          Neg Pred Value : 0.91029         
##              Prevalence : 0.11063         
##          Detection Rate : 0.02428         
##    Detection Prevalence : 0.03751         
##       Balanced Accuracy : 0.60232         
##                                           
##        'Positive' Class : yes             
##

roc_rf2 <- roc(test$y, predict(rf_model2, test, type = "prob")[,2])

## Setting levels: control = yes, case = no

## Setting direction: controls < cases

plot(roc_rf2, main = "ROC - Random Forest 2")

### AdaBoost 1: Default settings
set.seed(321)
xgb_model1 <- train(y ~ ., data = train %>% sample_frac(0.1),
                    method = "xgbTree",
                    trControl = ctrl,
                    tuneGrid = expand.grid(nrounds = 10,
                                           max_depth = 2,
                                           eta = 0.3,
                                           gamma = 0,
                                           colsample_bytree = 0.8,
                                           min_child_weight = 1,
                                           subsample = 0.8),
                    metric = "ROC")
pred_xgb1 <- predict(xgb_model1, test)
confusionMatrix(pred_xgb1, test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes   60   28
##        no   350 3268
##                                           
##                Accuracy : 0.898           
##                  95% CI : (0.8878, 0.9076)
##     No Information Rate : 0.8894          
##     P-Value [Acc > NIR] : 0.04827         
##                                           
##                   Kappa : 0.2101          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.14634         
##             Specificity : 0.99150         
##          Pos Pred Value : 0.68182         
##          Neg Pred Value : 0.90326         
##              Prevalence : 0.11063         
##          Detection Rate : 0.01619         
##    Detection Prevalence : 0.02375         
##       Balanced Accuracy : 0.56892         
##                                           
##        'Positive' Class : yes             
##

roc_xgb1 <- roc(test$y, predict(xgb_model1, test, type = "prob")[,2])

## Setting levels: control = yes, case = no

## Setting direction: controls < cases

plot(roc_xgb1, main = "ROC - Boosting Model 1")

### AdaBoost 2: Tuned nIter
set.seed(432)
xgb_model2 <- train(y ~ ., data = train %>% sample_frac(0.15),
                    method = "xgbTree",
                    trControl = ctrl,
                    tuneGrid = expand.grid(nrounds = 20,
                                           max_depth = 2,
                                           eta = 0.1,
                                           gamma = 0,
                                           colsample_bytree = 0.8,
                                           min_child_weight = 1,
                                           subsample = 0.8),
                    metric = "ROC")
pred_xgb2 <- predict(xgb_model2, test)
confusionMatrix(pred_xgb2, test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes   69   23
##        no   341 3273
##                                           
##                Accuracy : 0.9018          
##                  95% CI : (0.8917, 0.9112)
##     No Information Rate : 0.8894          
##     P-Value [Acc > NIR] : 0.007838        
##                                           
##                   Kappa : 0.2443          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.16829         
##             Specificity : 0.99302         
##          Pos Pred Value : 0.75000         
##          Neg Pred Value : 0.90564         
##              Prevalence : 0.11063         
##          Detection Rate : 0.01862         
##    Detection Prevalence : 0.02482         
##       Balanced Accuracy : 0.58066         
##                                           
##        'Positive' Class : yes             
##

roc_xgb2 <- roc(test$y, predict(xgb_model2, test, type = "prob")[,2])

## Setting levels: control = yes, case = no

## Setting direction: controls < cases

plot(roc_xgb2, main = "ROC - Boosting Model 2")

In this assignment, I conducted six structured experiments using three machine learning algorithms—Decision Trees, Random Forest, and AdaBoost to model the likelihood that a client subscribes to a term deposit product. The dataset comes from a Portuguese bank marketing campaign, and EDA from Assignment 1 revealed significant class imbalance and feature skewness.

Experiments were designed to explore ways to reduce false negatives and increase model specificity. Strategies included SMOTE sampling, hyperparameter tuning, and class weight adjustments. Models were evaluated using accuracy, sensitivity, specificity, and balanced accuracy to compare predictive quality while managing bias and variance.

For each of the algorithms (above), perform at least two (2) experiments. In a typical experiment you should: Define the objective of the experiment (hypothesis) Decide what will change, and what will stay the same Select the evaluation metric (what you want to measure) Perform the experiment Document the experiment so you compare results (track progress) Variations There are many things you can vary between experiments, here are some examples: Data sampling (feature selection) Data augmentation e.g., regularization, normalization, scaling Hyperparameter optimization (you decide, random search, grid search, etc.) Decision Tree breadth & depth (this is an example of a hyperparameter) Evaluation metrics e.g., Accuracy, precision, recall, F1-score, AUC-ROC Cross-validation strategy e.g., holdout, k-fold, leave-one-out Number of trees (for ensemble models) Train-test split: Using different data splits to assess model generalization ability Deliverable

Essay (minimum 500 words) Format: PDF Write a short essay summarizing your findings. Your essay should include: Explain why you chose the experiments you did Discuss bias & variance across the experiments e.g., between Decision Tree experiments, and with Random Forest & Adaboost A table with experiments & results What was the optimal model you found, and why What conclusion did you came to? What do you recommend. Code This should include your code, as well as the outputs of your code e.g. correlation chart Format: Code should be saved in https://rpubs.com. Please provide a link to your code in the submission.

DATA 622 Assignment 2 - Experimentation & Model Training

Biyag Dukuray

2025-03-23

Assignment 2: Experimentation & Model Training

Introduction