Data 622 Assignment 2: Experimentation & Model Training

Introduction

This document conducts six experiments using the Bank Marketing dataset to predict whether a client subscribes to a term deposit. We use three algorithms—Decision Trees, Random Forest, and XGBoost—with two experiments each. The dataset is preprocessed based on prior EDA insights in Assignment 1 forund on https://rpubs.com/sokkarbishoy/1278944, and each experiment has a specific objective to deepen our understanding of predictive factors and model performance.

Explanation of Approach

In this project, we apply three machine learning algorithms to explore classification tasks. Each algorithm is selected for its strengths:

Decision Trees: Provide interpretability by visualizing decision rules, overall simple and good for identifying key features.
Random Forest: Enhance predictive accuracy and reduce overfitting through ensemble learning.
XGBoost: A powerful gradient boosting algorithm known for improving prediction accuracy in complex datasets.

By performing two experiments per algorithm, we systematically test model variations using different hyperparameters, cross-validation strategies, and feature selections. Evaluation metrics such as accuracy, AUC-ROC, and confusion matrices are used to assess performance.

Each algorithm undergoes two experiments with defined objectives and variations. Evaluation metrics include:

Accuracy: Measures overall correctness of predictions.
AUC-ROC: Assesses the model’s ability to distinguish between classes, especially useful for imbalanced datasets like this one.

Libraries Used

# Load required 
library(reticulate)    # For Python integration

## Warning: package 'reticulate' was built under R version 4.3.3

library(dplyr)         # Data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)       # Visualization

## Warning: package 'ggplot2' was built under R version 4.3.3

library(rpart)         # Decision Trees
library(rpart.plot)    # Tree plotting

## Warning: package 'rpart.plot' was built under R version 4.3.3

library(randomForest)  # Random Forest

## Warning: package 'randomForest' was built under R version 4.3.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(xgboost)       # XGBoost

## Warning: package 'xgboost' was built under R version 4.3.3

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

library(caret)         # Model training utilities

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: lattice

library(pROC)          # ROC-AUC calculation

## Warning: package 'pROC' was built under R version 4.3.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(adabag)        # ADAboost could not be run on local PC

## Warning: package 'adabag' was built under R version 4.3.3

## Loading required package: foreach

## Loading required package: doParallel

## Warning: package 'doParallel' was built under R version 4.3.3

## Loading required package: iterators

## Loading required package: parallel

# Set seed for reproducibility
set.seed(456789)

The UCI dataset from Assignment 1

In this section, we use the ucimlrepo package to fetch the Bank Marketing dataset from the UCI repository. This dataset contains information about direct marketing campaigns conducted by a Portuguese bank. The target variable indicates whether a client subscribed to a term deposit.

# Install Python package if not already installed
py_install("ucimlrepo")

## Using virtual environment "C:/Users/PC/Documents/.virtualenvs/r-reticulate" ...

## + "C:/Users/PC/Documents/.virtualenvs/r-reticulate/Scripts/python.exe" -m pip install --upgrade --no-user ucimlrepo

# Fetch the dataset from the UCI repository
py_run_string("
from ucimlrepo import fetch_ucirepo
bank_marketing = fetch_ucirepo(id=222)
features = bank_marketing.data.features
target = bank_marketing.data.targets
metadata = bank_marketing.metadata
variables = bank_marketing.variables
")

# Assign more meaningful names
bank_features <- py$features   # Independent variables
bank_target <- py$target       # Dependent variable (Target)
metadata <- py$metadata        # Metadata about the dataset
variables <- py$variables      # Variable descriptions


# Combine features and target into a single dataset
bank_data <- dplyr::bind_cols(bank_features, target = bank_target)
bank_data$target <- as.factor(bank_data$y)

Data Preprocessing

Preprocessing is essential to ensure the dataset is clean and suitable for machine learning. Here’s how we handle it:

Missing Values: Imputation is done using domain knowledge to fill missing values with appropriate defaults.
Feature Encoding: Categorical variables are converted into factors for compatibility with machine learning algorithms.
Data Splitting: An 80-20 train-test split ensures fair model evaluation.

# Impute missing values
bank_data$job[is.na(bank_data$job)] <- "blue-collar"
bank_data$education[is.na(bank_data$education)] <- "secondary"
bank_data$contact[is.na(bank_data$contact)] <- "unknown"
bank_data$poutcome[is.na(bank_data$poutcome)] <- "unknown"

# Convert categorical variables to factors for DT and RF
bank_data <- bank_data %>% 
  mutate(across(where(is.character), as.factor)) %>% 
  select(-y)

# Train-test split
trainIndex <- createDataPartition(bank_data$target, p = 0.8, list = FALSE)
train_data <- bank_data[trainIndex, ]
test_data <- bank_data[-trainIndex, ]

# For XGBoost: One-hot encode categorical variables and convert to matrix
train_data_xgb <- model.matrix(~ . - 1, data = train_data %>% select(-target))
test_data_xgb <- model.matrix(~ . - 1, data = test_data %>% select(-target))

# Convert target to factor with valid R variable names ("No" and "Yes")
train_labels <- factor(ifelse(as.numeric(train_data$target) == 1, "No", "Yes"), levels = c("No", "Yes"))
test_labels <- factor(ifelse(as.numeric(test_data$target) == 1, "No", "Yes"), levels = c("No", "Yes"))

# For Experiment 5 (basic XGBoost), we need numeric labels (0/1) for xgb.DMatrix
train_labels_numeric <- as.numeric(train_data$target) - 1  # 0 for "No", 1 for "Yes"
test_labels_numeric <- as.numeric(test_data$target) - 1

Algorithm Selection

We select the following algorithms based on their strengths:

Decision Trees: Simple, interpretable, good for identifying key features.
Random Forest: Ensemble method, robust to overfitting, strong predictive power.
AdaBoost: Boosting method, focuses on hard-to-classify cases, complements tree-based models.

Each algorithm will undergo two experiments with defined objectives and variations.

Experiments

Decision Trees

Experiment 1: Baseline Decision Tree

Objective: Establish baseline performance and identify key features (e.g., duration).

Variation: Default parameters, all features, no pruning.
Metric: Accuracy, AUC-ROC.
Execution:

dt_model1 <- rpart(target ~ ., data = train_data, method = "class")
rpart.plot(dt_model1, main = "Decision Tree - Baseline")

dt_pred1 <- predict(dt_model1, test_data, type = "class")
dt_cm1 <- confusionMatrix(dt_pred1, test_data$target)
dt_prob1 <- predict(dt_model1, test_data, type = "prob")[, 2]
dt_roc1 <- roc(test_data$target, dt_prob1)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

# Document results
cat("Accuracy:", dt_cm1$overall["Accuracy"], "\n")

## Accuracy: 0.9013383

cat("AUC-ROC:", auc(dt_roc1), "\n")

## AUC-ROC: 0.7593647

Results: Accuracy = 0.8995686, AUC-ROC = 0.7646265. Duration was a key predictor, as expected.
Conclusion: The baseline model performs reasonably well, with an accuracy of 0.8995686, but its AUC-ROC of 0.7646265 suggests moderate discriminative ability, likely due to the imbalanced dataset.

Results: The decision tree uses duration as the most significant splitting feature, indicating its strong predictive influence on whether a customer will subscribe to a term deposit.The first split at duration < 522 suggests that customers who had a call duration shorter than 522 seconds are predominantly classified as “no” (did not subscribe).Further splits occur based on poutcome, which indicates the outcome of previous campaigns, and duration thresholds, emphasizing its importance in the prediction.If a customer had a “No Previous Contact” and a short duration (<132 seconds), they are highly likely to be classified as “no” (86% confidence). Conversely, if the call duration was very long (e.g., >828 seconds) and the previous outcome was unsuccessful, the model is more likely to classify it as “yes.”

Preformance metrics: The model achieves a strong accuracy score, suggesting it correctly classifies a significant portion of the test data. However, accuracy alone may not be sufficient due to potential class imbalance.The AUC-ROC indicates reasonable discriminative power. At 0.76, it suggests the model is better than random guessing, though improvements could be made.

Experiment 2: Pruned Tree with 5-Fold Cross-Validation

Experiment 2: Pruned Tree with Cross-Validation

Objective: Test if pruning and 5-fold CV improve generalization.
Variation: Pruning (cp = 0.01), 5-fold CV, non-trivial change from baseline.
Metric: Accuracy, AUC-ROC.
Execution:

dt_control <- trainControl(method = "cv", number = 5)
dt_model2 <- train(target ~ ., data = train_data, method = "rpart", 
                   trControl = dt_control, tuneGrid = data.frame(cp = 0.01))
dt_pred2 <- predict(dt_model2, test_data)
dt_cm2 <- confusionMatrix(dt_pred2, test_data$target)
dt_prob2 <- predict(dt_model2, test_data, type = "prob")[, 2]
dt_roc2 <- roc(test_data$target, dt_prob2)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

# Document results
cat("Accuracy:", dt_cm2$overall["Accuracy"], "\n")

## Accuracy: 0.9013383

cat("AUC-ROC:", auc(dt_roc2), "\n")

## AUC-ROC: 0.7593647

Results: Accuracy = 0.8995686, AUC-ROC = 0.7646265, identical to the baseline.
Conclusion: Pruning and cross-validation did not improve performance, suggesting the baseline model was not significantly overfitting, or the pruning parameter (cp = 0.01) was too lenient to reduce variance.
Recommendation: Experiment with more aggressive pruning (e.g., higher cp values) or constrain tree depth to better control model complexity.

The identical performance indicates that the baseline Decision Tree was already performing well on this dataset, and the pruning and cross-validation did not yield a measurable improvement in generalization. This suggests that the model’s complexity in the baseline was appropriate for the data, and overfitting may not be a significant issue here. However, this also highlights a limitation in the experiment design: the variation (pruning with cp = 0.01) might have been too subtle to impact performance. A more aggressive pruning strategy or a different hyperparameter (e.g., max depth) might have produced a more noticeable difference.

Random Forest

Experiment 3: Standard Random Forest with All Features

Objective: Establish a strong baseline with ensemble learning and confirm key predictors.
Variation: Default parameters (ntree = 100), all features.
Metric: Accuracy, AUC-ROC.
Execution:

rf_model1 <- randomForest(target ~ ., data = train_data, ntree = 100, importance = TRUE)
varImpPlot(rf_model1, main = "Random Forest - Feature Importance")

rf_pred1 <- predict(rf_model1, test_data)
rf_cm1 <- confusionMatrix(rf_pred1, test_data$target)
rf_prob1 <- predict(rf_model1, test_data, type = "prob")[, 2]
rf_roc1 <- roc(test_data$target, rf_prob1)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

# Document results
cat("Accuracy:", rf_cm1$overall["Accuracy"], "\n")

## Accuracy: 0.9070899

cat("AUC-ROC:", auc(rf_roc1), "\n")

## AUC-ROC: 0.9316094

Results: Accuracy = 0.9069793, AUC-ROC = 0.9310273. Feature importance highlights duration as the top predictor, followed by month, day_of_week, poutcome, and balance.
Conclusion: Random Forest outperforms Decision Trees in both accuracy and AUC-ROC, confirming the advantage of ensemble methods. The high AUC-ROC indicates excellent discriminative ability, particularly useful for the imbalanced dataset.
Recommendation: For data science, tune hyperparameters (e.g., ntree, mtry) to potentially improve performance further. For business, focus marketing on clients with longer call durations and target campaigns during impactful months or days.

Experiment 4: Random Forest with Feature Selection and Tuning

Objective: Assess if top 5 features improve efficiency/performance
Variation: Use top features (duration, month, previous, campaign, poutcome), non-trivial reduction.
Metric: Accuracy, AUC-ROC.
Execution:

top_features <- c("duration", "month", "previous", "campaign", "poutcome", "target")
train_subset <- train_data %>% select(all_of(top_features))
test_subset <- test_data %>% select(all_of(top_features))
rf_model2 <- randomForest(target ~ ., data = train_subset, ntree = 100, 
                          mtry = 2, importance = TRUE)  # mtry = sqrt(5) ~ 2
rf_pred2 <- predict(rf_model2, test_subset)
rf_cm2 <- confusionMatrix(rf_pred2, test_subset$target)
rf_prob2 <- predict(rf_model2, test_subset, type = "prob")[, 2]
rf_roc2 <- roc(test_subset$target, rf_prob2)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

# Document results
cat("Accuracy:", rf_cm2$overall["Accuracy"], "\n")

## Accuracy: 0.9022232

cat("AUC-ROC:", auc(rf_roc2), "\n")

## AUC-ROC: 0.8494517

Results: Accuracy = 0.9004535, AUC-ROC = 0.8421994.
Conclusion: Feature selection slightly reduces accuracy and significantly lowers AUC-ROC compared to the Standard Random Forest (Experiment 3: Accuracy = 0.9069793, AUC-ROC = 0.9310273). While the top 5 features are important, other features contribute to the model’s ability to distinguish between classes.
Recommendation: Use the full feature set for maximum performance, but consider feature selection if computational efficiency is a priority.

XGBoost (Replacing AdaBoost)

Explanation of Switch

AdaBoost was initially selected to explore boosting’s ability to focus on hard-to-classify cases. However, running AdaBoost on my machine was computationally heavy, likely due to the sequential nature of the algorithm and the large dataset (45,211 rows). To address this, I replaced AdaBoost with XGBoost

Experiment 5: Basic XGBoost with Default Parameters

Objective: Establish a baseline for XGBoost and assess its performance on the dataset.
Variation: Use default parameters with a moderate number of boosting rounds (nrounds = 100).
Metric: Accuracy, AUC-ROC.
Execution:

# For Experiment 5 (basic XGBoost), we need numeric labels (0/1) for xgb.DMatrix
train_labels_numeric <- as.numeric(train_data$target) - 1  # 0 for "No", 1 for "Yes"
test_labels_numeric <- as.numeric(test_data$target) - 1
# Prepare DMatrix for XGBoost using numeric labels
dtrain <- xgb.DMatrix(data = train_data_xgb, label = train_labels_numeric)
dtest <- xgb.DMatrix(data = test_data_xgb, label = test_labels_numeric)

# Train basic XGBoost model
xgb_model1 <- xgboost(data = dtrain, nrounds = 100, objective = "binary:logistic", verbose = 0)
xgb_pred1 <- predict(xgb_model1, dtest)
xgb_pred1_class <- ifelse(xgb_pred1 > 0.5, 1, 0)
xgb_cm1 <- confusionMatrix(as.factor(xgb_pred1_class), as.factor(test_labels_numeric))
xgb_roc1 <- roc(test_labels_numeric, xgb_pred1)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

cat("Accuracy:", xgb_cm1$overall["Accuracy"], "\n")

## Accuracy: 0.9058732

cat("AUC-ROC:", auc(xgb_roc1), "\n")

## AUC-ROC: 0.9357653

Results: Accuracy = 0.9080854, AUC-ROC = 0.9357429.
Conclusion: The basic XGBoost model performs strongly, slightly outperforming the Standard Random Forest in both accuracy (0.9080854 vs. 0.9069793) and AUC-ROC (0.9357429 vs. 0.9310273), and significantly surpassing Decision Trees (AUC-ROC: 0.7646265). It’s a strong baseline for gradient boosting.
Recommendation: Explore hyperparameter tuning to further improve performance, as done in Experiment 6.

Experiment 6: XGBoost with Hyperparameter Tuning

Objective: Test if tuning key parameters (learning rate, max depth) improves performance.
Variation: Use a grid search to tune eta (learning rate) and max_depth, with 5-fold cross-validation, while setting reasonable defaults for other parameters.
Metric: Accuracy, AUC-ROC.
Execution:

# Define parameter grid with all required parameters
param_grid <- expand.grid(
  nrounds = 100,              # Number of boosting rounds (fixed)
  eta = c(0.01, 0.1, 0.3),   # Learning rate (tuned)
  max_depth = c(3, 6, 9),     # Maximum tree depth (tuned)
  gamma = 0,                  # Minimum loss reduction (fixed, default)
  colsample_bytree = 0.8,     # Fraction of features to sample per tree (fixed)
  min_child_weight = 1,       # Minimum sum of instance weight in a child (fixed, default)
  subsample = 0.8             # Fraction of data to sample per round (fixed)
)

# Cross-validation setup
xgb_control <- trainControl(
  method = "cv",
  number = 5,
  verboseIter = FALSE,
  summaryFunction = twoClassSummary,  # For AUC-ROC
  classProbs = TRUE                   # Enable probability predictions
)

# Train XGBoost with tuning
xgb_model2 <- train(
  x = train_data_xgb,
  y = train_labels,  # Use factor labels with valid R variable names
  method = "xgbTree",
  trControl = xgb_control,
  tuneGrid = param_grid,
  metric = "ROC",  # Optimize for AUC-ROC
  verbosity = 0
)

# Predict and evaluate
xgb_pred2 <- predict(xgb_model2, test_data_xgb, type = "raw")
xgb_cm2 <- confusionMatrix(as.factor(xgb_pred2), as.factor(test_labels))
xgb_prob2 <- predict(xgb_model2, test_data_xgb, type = "prob")[, "Yes"]
xgb_roc2 <- roc(as.numeric(test_labels) - 1, xgb_prob2)  # Convert back to 0/1 for ROC

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

cat("Accuracy:", xgb_cm2$overall["Accuracy"], "\n")

## Accuracy: 0.9099657

cat("AUC-ROC:", auc(xgb_roc2), "\n")

## AUC-ROC: 0.937874

Results: Accuracy = 0.9108506, AUC-ROC = 0.9367571.
Conclusion: Tuning eta and max_depth improves both accuracy and AUC-ROC compared to the basic XGBoost model (Experiment 5: Accuracy = 0.9080854, AUC-ROC = 0.9357429). This model achieves the highest performance across all experiments, making it the best candidate for deployment.
Recommendation: Consider further tuning other XGBoost parameters (e.g., subsample, colsample_bytree) to potentially improve performance. For business, deploy this model, focusing on key predictors like duration to target clients.

Essay

Experiment Selection Rationale

I conducted six experiments to systematically explore model complexity, feature importance, and ensemble techniques using the Bank Marketing dataset. The Decision Tree experiments (Experiments 1 and 2) tested interpretability versus generalization by comparing a baseline model to one with pruning and 5-fold cross-validation. Random Forest experiments (Experiments 3 and 4) assessed the impact of ensemble learning and feature selection, using all features versus a subset of the top 5 features (duration, month, previous, campaign, poutcome). Initially, AdaBoost was selected to explore boosting, but due to its computational intensity on my machine (likely due to the large dataset and sequential nature of AdaBoost), I replaced it with XGBoost. XGBoost experiments (Experiments 5 and 6) evaluated gradient boosting with a focus on efficiency and performance, comparing a basic model to one with tuned hyperparameters (eta and max_depth). These variations were chosen to understand how model complexity, feature selection, and boosting strategies affect performance on an imbalanced dataset.

Bias & Variance Comparison

Decision Tree Experiments: The baseline Decision Tree (Experiment 1) achieved an accuracy of 0.8995686 and AUC-ROC of 0.7646265, indicating reasonable performance but potentially high variance due to the lack of pruning. The pruned tree with 5-fold cross-validation (Experiment 2) yielded identical metrics (Accuracy: 0.8995686, AUC-ROC: 0.7646265), suggesting that the baseline model was not significantly overfitting, or the pruning (cp = 0.01) was too lenient to reduce variance. Both models exhibit similar bias, as their predictive power (AUC-ROC) is the same, but the expected reduction in variance from pruning did not occur.
Random Forest vs. DT vs. XGBoost: The Standard Random Forest (Experiment 3) achieved a higher accuracy (0.9069793) and AUC-ROC (0.9310273) compared to the Decision Trees, confirming that Random Forest reduces variance by averaging predictions across multiple trees, leading to better generalization. The Random Forest with feature selection (Experiment 4) saw a drop in performance (Accuracy: 0.9004535, AUC-ROC = 0.8421994), indicating that while variance remained low, bias increased due to the loss of predictive information from excluded features. XGBoost’s basic model (Experiment 5) performed strongly (Accuracy: 0.9080854, AUC-ROC: 0.9357429), reducing bias compared to Decision Trees by focusing on errors through gradient boosting, though its sequential nature may introduce higher variance than Random Forest. The tuned XGBoost model (Experiment 6) achieved the best performance (Accuracy: 0.9108506, AUC-ROC: 0.9367571), effectively balancing bias and variance through hyperparameter optimization.
Across Experiments: Decision Trees exhibited moderate variance and higher bias (lower AUC-ROC), while Random Forest significantly reduced variance, improving AUC-ROC. XGBoost further reduced bias, especially after tuning, achieving the best balance of bias and variance among all models. The feature selection in Experiment 4 increased bias by limiting the model’s ability to capture complex patterns, while XGBoost’s tuning in Experiment 6 minimized bias while keeping variance in check.

Results Table

Experiment	Accuracy	AUC-ROC	Notes
DT Baseline	0.8995686	0.7646265	Decent performance, potential high variance
DT Pruned CV	0.8995686	0.7646265	No improvement, pruning ineffective
RF Standard	0.9069793	0.9310273	Strong ensemble baseline, excellent AUC-ROC
RF Feature Selection	0.9004535	0.8421994	Slight performance drop, more efficient
XGBoost Basic	0.9080854	0.9357429	Strong baseline, competitive with RF
XGBoost Tuned	0.9108506	0.9367571	Best performance, optimal model

Conclusion

The tuned XGBoost model (Experiment 6) is the optimal model, achieving the highest accuracy (0.9108506) and AUC-ROC (0.9367571) across all experiments. This performance surpasses the Standard Random Forest (Experiment 3: Accuracy = 0.9069793, AUC-ROC = 0.9310273) and significantly outperforms the Decision Trees (AUC-ROC: 0.7646265), justifying its selection as the best model. The high AUC-ROC is particularly important for the imbalanced Bank Marketing dataset, where distinguishing between classes (yes and no subscriptions) is critical. The improvement from the basic XGBoost model (Experiment 5: AUC-ROC = 0.9357429) to the tuned model demonstrates the value of hyperparameter optimization in fine-tuning the model’s performance, though the AUC-ROC gain is modest.z

To further tune XGBoost parameters (e.g., subsample, colsample_bytree) to potentially improve performance even more. Additionally, explore feature importance in XGBoost to identify key predictors, similar to what was done with Random Forest, to inform feature engineering efforts. For example, understanding which features XGBoost prioritizes could guide the creation of interaction terms or new derived features.

Business Recommendation and Insights : Deploy the tuned XGBoost model for predicting term deposit subscriptions. Focus marketing efforts on clients with longer call durations (duration was the top predictor in Random Forest), schedule campaigns during impactful months or days (based on month and day_of_week), and target clients with a successful previous outcome (poutcome) to maximize subscription rates. This strategy leverages the model’s insights to optimize resource allocation and improve campaign effectiveness.