This assignment aims to perform machine learning experiments using UCI bank marketing datasets to identify the optimal algorithm and hyperparameters to predict whether a client will subscribe to a term deposit. Exploratory data analysis was performed previously (see Assignment 1).

0. Data input and cleaning

For this assignment, I focus on the bank_additional_full dataset.

url <- 'https://raw.githubusercontent.com/alexandersimon1/Data622/refs/heads/main/Assignment1/bank-additional-full.csv'
bank_additional_full_df <- read_delim(url, delim = ';', show_col_types = FALSE)

# Rename columns to make them more descriptive
bank_additional_full_df <- bank_additional_full_df %>%
  select(age,
         job_type = job, 
         marital_status = marital, 
         highest_education = education,
         credit_is_defaulted = default,
         has_housing_loan = housing,
         has_personal_loan = loan,
         communication_type = contact,
         last_contact_dow = day_of_week,
         last_contact_month = month,
         last_contact_duration_sec = duration,
         campaign_contacts = campaign,
         days_since_last_contact = pdays,
         previous_contacts = previous,
         previous_contact_outcome = poutcome,
         employee_variation_rate = emp.var.rate,
         consumer_price_index = cons.price.idx,
         consumer_confidence_index = cons.conf.idx,
         euribor_rate_3m = euribor3m,
         n_employees = nr.employed,
         has_term_deposit = y)

# Coerce categorical variables to factors
categorical_variables <- c('job_type', 'marital_status', 'highest_education',
                           'communication_type', 'previous_contact_outcome',
                           'last_contact_dow', 'last_contact_month',
                           'credit_is_defaulted', 'has_housing_loan',
                           'has_personal_loan', 'has_term_deposit')

bank_additional_full_df <- bank_additional_full_df %>%
mutate_at(categorical_variables, factor)

1. Pre-processing

1.1. Remove duplicates

In Assignment 1, I showed that the bank_additional_full dataset had duplicate rows. Here, I remove them.

bank_additional_full_df <- bank_additional_full_df %>%
  distinct()

1.2. Data manipulation

1.2.1. Highly correlated variables

In Assignment 1, I showed that the 3-month Euribor rate is nearly perfectly linearly correlated with employee variation rate (r=0.97) and number of employees (r=0.95), and the latter two variables are also strongly correlated with each other (r=0.91). Because of this redundancy, I dropped these two variables from the dataset.

Incidentally, removal of number of employees also circumvents a separate issue with the variable having some non-integer values.

bank_additional_full_df2 <- bank_additional_full_df %>%
  select(-c(employee_variation_rate, n_employees))

1.2.2. Target leakage

To avoid target leakage, I also excluded last contact duration from the dataset.

bank_additional_full_df2 <- bank_additional_full_df2 %>%
  select(-c(last_contact_duration_sec))

1.2.3. Days since last contact

A placeholder value is used for an unknown number of days since last contact. I wasn’t sure how if any data handling was needed for this; however, it didn’t make sense to delete the data since other valuable client data would be omitted from the model, so I left the values as is.

2. Performance evaluation

2.1. Datasets

2.1.1. Class imbalance

In Assignment 1, I showed that the target feature (has_term_deposit) is imbalanced (approximately 88% of clients do not have a term deposit). To reduce this class imbalance in the training dataset, I used Synthetic Minority Oversampling Technique (SMOTE). However, because SMOTE does not handle categorical variables, I first one-hot encoded these variables.

# One-hot encode all variables except target variable (has_term_deposit)
bank_additional_full_dgCMatrix <- sparse.model.matrix(~ . -has_term_deposit,
                                           data = bank_additional_full_df2)

# Remove first column of sparse matrix, which is artifact of conversion
bank_additional_full_dgCMatrix <- bank_additional_full_dgCMatrix[ , -1]

# Convert sparse matrix to dataframe
bank_additional_full_df3 <- as.data.frame(as.matrix(bank_additional_full_dgCMatrix))

# Rename variables to syntactically valid names to avoid issues with subsequent functions
colnames(bank_additional_full_df3) <- make.names(names(bank_additional_full_df3))

# Replace target variable and encode as numeric values
bank_additional_full_df3 <-
  cbind(bank_additional_full_df3,
        has_term_deposit = bank_additional_full_df2$has_term_deposit) %>%
  mutate(
    has_term_deposit = ifelse(has_term_deposit == 'yes', 1, 0)
  )

Then I created training and test datasets using a 70:30 split. I use the caret::createDataParition() function, which uses stratified random sampling to minimize the possibility that the minority class is omitted from a dataset.¹

train_idx <- createDataPartition(bank_additional_full_df3$has_term_deposit, p = 0.7, list = FALSE)
train_imbalanced <- bank_additional_full_df3[train_idx, ]
test <- bank_additional_full_df3[-train_idx, ]

Finally, I applied SMOTE to the imbalanced training dataset.

# I used the default value of K=5 for the number of nearest neighbors during sampling
train_smote <- SMOTE(train_imbalanced[, -train_imbalanced$has_term_deposit],
                     train_imbalanced$has_term_deposit)
train_smote <- train_smote$data  # extract the balanced dataset

# Revert target variable to categorical values and
# remove 'class' variable (artifact of SMOTE function)
train_smote <- train_smote %>%
  mutate(
    has_term_deposit = ifelse(has_term_deposit == 1, 'yes', 'no')
  ) %>%
  select(-c(class))
train_smote$has_term_deposit <- as.factor(train_smote$has_term_deposit)

The classes of has_term_deposit in the SMOTE training dataset are now approximately balanced (52.2% no, 47.8% yes).

fct_proportions(train_smote$has_term_deposit)

##   variable  Freq
## 1       no 52.16
## 2      yes 47.84

2.1.2. Cross-validation folds

To enable statistical comparison of model performance with cross-validation, I created 10 folds from the balanced training dataset.

folds <- createFolds(train_smote$has_term_deposit, k = 10)

These folds, which currently only contain row numbers of the training dataframe that comprise each fold, will be used during model development (section 3. Experiments).

2.2. Performance metrics

For the classification problem (ie, whether a client will subscribe to a term deposit), a positive prediction means that the bank should commit marketing resources to the client, and a negative prediction means that no action is needed. False positives (ie, marketing to clients who will not subscribe) wastes time and money and reduces the return on investment (ROI) of a marketing campaign. On the other hand, false negatives (ie, failure to market to clients who would have subscribed) are missed opportunities to profit from term deposits.

No data are available in the assignment to quantitatively assess whether false positives or false negatives are more costly; however, for clients with large/long term deposits, false negatives would be expected to be more costly. This suggests that recall is a key metric for this classification problem. More specifically, the best model is one that maximizes recall to minimize false negatives.

For completeness, I also evaluate accuracy, precision, and model training time. The threshold for statistical significance was \(\alpha = 0.05\).

2.3. Model selection

Where possible, I will calculate the mean, standard deviation, and 95% confidence interval of each performance metric using 10-fold cross-validation. Comparison of these statistics will help determine whether there is statistically significant difference in performance between models and which model is the best for predicting term deposit subscriptions.

3. Experiments

3.1. Decision trees

3.1.0. Experiment DT0

Objective: Establish baseline performance for a decision tree model of bank term deposits

Variation: None (ie, use all features, no pruning, and no cross-validation)

Evaluation metrics: Accuracy, recall, precision

Perform the experiment:

Algorithm: rpart()

Parameters:

Decision tree control: complexity parameter cp = 0 to prevent pruning (ie, grow the full tree)

# Train the model
# Reference: https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf

dt0_runtime <- system.time(
  dt0_model <- rpart(has_term_deposit ~ ., data = train_smote, method = 'class',
                     control = rpart.control(cp = 0))
)
dt0_runtime <- dt0_runtime[['elapsed']]
sprintf('Runtime: %.1f seconds', dt0_runtime)

## [1] "Runtime: 4.5 seconds"

This is a large tree with 98 levels and 913 splits. Due to this complexity, I did not plot the tree.

dt0_levels <- nrow(dt0_model$cptable)
dt0_splits <- dt0_model[['cptable']][dt0_levels, 'nsplit']
sprintf('Decision tree 1 has %d levels and %d splits', dt0_levels, dt0_splits)

## [1] "Decision tree 1 has 98 levels and 913 splits"

The variable importance plot shows that the three most important variables in the model are economic indicators, namely, euribor_rate_3m, consumer_price_index, and consumer_confidence_index. These are followed by contact details (eg, previous_contacts, communication_type = telephone). The least important variables are month of last contact and variables with unknowns.

# Extract variable importance from decision tree model
# Rpart uses Gini index to split nodes
# Reference: https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf
#
dt0_vars_df <- as.data.frame(dt0_model$variable.importance)
colnames(dt0_vars_df) <- c('Gini_index')

# Calculate scaled scores (easier to interpret)
total_scores <- sum(dt0_vars_df$Gini_index)
dt0_vars_df <- dt0_vars_df %>%
  mutate(
    scaled_score = round(100 * (Gini_index / total_scores), 2)
  )

# Visualize the relative importance
ggplot(dt0_vars_df, aes(x = reorder(row.names(dt0_vars_df), scaled_score), y = scaled_score)) +
  geom_bar(stat = 'identity', fill = 'steelblue') + 
  coord_flip() +
  labs(x = 'Variable', y = 'Importance (scaled Gini index)',
       title = 'Variable Importance in Full Decision Tree') +
  theme_classic() +
  theme(
    axis.title = element_text(face = "bold"),
    plot.title = element_text(face = "bold")
  )

Use the model to make class predictions

dt0_predictions <- predict(dt0_model, newdata = test, type = 'class')

Confusion matrix

(dt0_confusion_matrix <- table(Actual = test$has_term_deposit, Predicted = dt0_predictions))

##       Predicted
## Actual    no   yes
##      0 10163   889
##      1   790   510

The accuracy, recall, and precision of the model are shown below.

dt0_performance <- unlist(performance_metrics(dt0_confusion_matrix))
dt0_accuracy <- dt0_performance[1]
dt0_recall <- dt0_performance[2]
dt0_precision <- dt0_performance[3]
sprintf('Accuracy: %.3f, Recall: %.3f, Precision: %.3f', dt0_accuracy, dt0_recall, dt0_precision)

## [1] "Accuracy: 0.864, Recall: 0.365, Precision: 0.392"

Conclusion: The baseline decision tree model has moderately high accuracy but low recall and precision. However, these metrics may not be reliable estimates of its performance since cross-validation was not performed. In addition, the size and complexity of the decision tree make it unwieldy to explain and visualize.

Document results:

experiment_tracker_df <- data.frame(
  model = 'Decision tree',
  variation = 'No pruning, no CV',
  runtime_sec = round(dt0_runtime, 1),
  accuracy_95CI = round(dt0_accuracy, 3),  
  recall_95CI = round(dt0_recall, 3), 
  precision_95CI = round(dt0_precision, 3),
  notes = 'Baseline. Fast. Large tree.'
)

3.1.1. Experiment DT1

Objective: Compare baseline performance for a decision tree model using cross-validation

Variation: 10-fold cross-validation

Evaluation metrics: Accuracy, recall, precision

Perform the experiment:

Algorithm: rpart()

Parameters:

Decision tree control: complexity parameter cp = 0 to prevent pruning (ie, grow the full tree)

dt1_runtime <- system.time(
  experiment1_results <- lapply(folds, function(x) {
    # Training and test datasets
    exp_train <- train_smote[-x, ]
    exp_test <- train_smote[x, ]
    # Use training data to create decision tree classification model
    exp_model <- rpart(has_term_deposit ~ ., data = exp_train, method = 'class',
                       control = rpart.control(cp = 0))
    # Make predictions on test data
    exp_predictions <- predict(exp_model, newdata = exp_test, type = 'class')
    # Actual (true) data for comparison
    exp_actual <- exp_test$has_term_deposit
    # Create confusion matrix of actual vs predicted classes
    exp_confusion_matrix <- table(exp_actual, exp_predictions)
    # Calculate performance metrics from confusion matrix
    exp_performance <- unlist(performance_metrics(exp_confusion_matrix))
    exp_accuracy <-  exp_performance[1]
    exp_recall <- exp_performance[2]
    exp_precision <- exp_performance[3]
    return(list(exp_accuracy, exp_recall, exp_precision))
  }))

dt1_runtime <- dt1_runtime[['elapsed']]
sprintf('Runtime: %.1f seconds', dt1_runtime)

## [1] "Runtime: 40.5 seconds"

# Convert results (list of lists) to dataframe
experiment1_results_df <- as.data.frame(t(data.frame(lapply(experiment1_results, unlist))))
colnames(experiment1_results_df) <- c('Accuracy', 'Recall', 'Precision')
experiment1_results_df

##         Accuracy    Recall Precision
## Fold01 0.8876151 0.9041591 0.8557980
## Fold02 0.8892755 0.9012511 0.8630723
## Fold03 0.8901166 0.8993348 0.8674080
## Fold04 0.8958461 0.9069871 0.8716852
## Fold05 0.8829069 0.8917000 0.8596491
## Fold06 0.8859775 0.9063927 0.8493795
## Fold07 0.8911175 0.9024064 0.8661249
## Fold08 0.8956201 0.9058196 0.8724861
## Fold09 0.8894575 0.8988016 0.8664955
## Fold10 0.8892755 0.9041404 0.8596491

The 95% confidence interval for accuracy, recall, and precision is shown below. Recall, in particular, is high (~90%).

# Accuracy
dt1_accuracy_CI <- unlist(calc_ci95(mean(experiment1_results_df$Accuracy), 
                                      sd(experiment1_results_df$Accuracy), 
                                      nrow(experiment1_results_df)))
sprintf('Accuracy 95%% CI: (%.3f, %.3f)', dt1_accuracy_CI[1], dt1_accuracy_CI[2])

## [1] "Accuracy 95% CI: (0.887, 0.892)"

# Recall
dt1_recall_CI <- unlist(calc_ci95(mean(experiment1_results_df$Recall), 
                                      sd(experiment1_results_df$Recall), 
                                      nrow(experiment1_results_df)))
sprintf('Recall 95%% CI: (%.3f, %.3f)', dt1_recall_CI[1], dt1_recall_CI[2])

## [1] "Recall 95% CI: (0.899, 0.905)"

# Precision
dt1_precision_CI <- unlist(calc_ci95(mean(experiment1_results_df$Precision), 
                                      sd(experiment1_results_df$Precision), 
                                      nrow(experiment1_results_df)))
sprintf('Precision 95%% CI: (%.3f, %.3f)', dt1_precision_CI[1], dt1_precision_CI[2])

## [1] "Precision 95% CI: (0.859, 0.868)"

Conclusion: Using cross-validation shows that the baseline (unpruned) decision tree performs well (accuracy ~89%) and has balanced recall and precision (~90% and ~86%, respectively).

Document results:

experiment1_documentation <- c('Decision tree', '10-fold CV', 
                               runtime_sec = round(dt1_runtime, 1),
                               accuracy_95CI = paste0('(', round(dt1_accuracy_CI[1], 3), ', ', 
                                                           round(dt1_accuracy_CI[2], 3), ')'),  
                               recall_95CI = paste0('(', round(dt1_recall_CI[1], 3), ', ', 
                                                         round(dt1_recall_CI[2], 3), ')'),  
                               precision_95CI = paste0('(', round(dt1_precision_CI[1], 3), ', ', 
                                                            round(dt1_precision_CI[2], 3), ')'), 
                               'Good performance, high recall.')
experiment_tracker_df <- rbind(experiment_tracker_df, experiment1_documentation)

3.1.2. Experiment DT2

Objective: Determine whether a simpler decision tree model increases recall by \(\ge 5\%\).

Variation: Reduced tree complexity (ie, tree breadth and depth), implemented via cp parameter

This is expected to increase bias in the model but reduce variance²

Evaluation metrics: Accuracy, recall, precision

Perform the experiment:

Algorithm: rpart()

Parameters:

Decision tree control: optimized complexity parameter (explained below)

I determined the optimal tree complexity by plotting the relative error versus the complexity parameter (CP) of the full decision tree. The plot shows that the first split provides the greatest information gain (ie, reduction in relative error). In addition, very little information is gained as \(CP < 0.0066\) (ie, there are very small decreases for relative_error < 0.5).

dt0_cp_df <- as.data.frame(dt0_model$cptable) %>%
  rename(relative_error = `rel error`)

ggplot(filter(dt0_cp_df, CP > 0), aes(x = CP, y = relative_error)) +
  geom_point() +
  geom_hline(yintercept = 0.5, linetype = 'dashed', color = 'steelblue') +
  annotate('segment', x = 0.4, y = 0.99, xend = 0.04, yend = 0.6,
           arrow = arrow(ends = 'last'), color = 'steelblue') + 
  annotate('text', x = 0.25, y = 0.75, label = 'First split', color = 'steelblue') +  
  xlim(0, 0.5) + ylim(0, 1) +
  guides(x = guide_axis(cap = "both"), y = guide_axis(cap = "both")) +
  labs(x = 'Complexity parameter (CP)', y = 'Relative error') +
  theme_classic() +
  theme(
    axis.title = element_text(face = 'bold')
  )

# Minimum value of complexity parameter for which relative_error > 0.5
(cp_optimal <- dt0_cp_df[which.min(dt0_cp_df$relative_error > 0.5) - 1, 'CP'])

## [1] 0.006610191

I used this threshold for cp to generate the pruned decision tree.

dt2_model <- rpart(has_term_deposit ~ ., data = train_smote, method = 'class',
                   control = rpart.control(cp = cp_optimal))

The resulting tree is much smaller than the full tree (3.1.1. Experiment DT1) and has 6 levels and 5 splits.

dt2_levels <- nrow(dt2_model$cptable)
dt2_splits <- dt2_model[['cptable']][dt2_levels, 'nsplit']
sprintf('Decision tree 1 has %d levels and %d splits', dt2_levels, dt2_splits)

## [1] "Decision tree 1 has 6 levels and 5 splits"

The decision tree looks like this:

fancyRpartPlot(dt2_model, main = 'Decision Tree 2', sub = '')

10-fold cross-validation

dt2_runtime <- system.time(
  experiment2_results <- lapply(folds, function(x) {
    # Training and test datasets
    exp_train <- train_smote[-x, ]
    exp_test <- train_smote[x, ]
    # Use training data to create decision tree classification model
    exp_model <- rpart(has_term_deposit ~ ., data = exp_train, method = 'class',
                       control = rpart.control(cp = cp_optimal))  # pruned tree
    # Make predictions on test data
    exp_predictions <- predict(exp_model, newdata = exp_test, type = 'class')
    # Actual (true) data for comparison
    exp_actual <- exp_test$has_term_deposit
    # Create confusion matrix of actual vs predicted classes
    exp_confusion_matrix <- table(exp_actual, exp_predictions)
    # Calculate performance metrics from confusion matrix
    exp_performance <- unlist(performance_metrics(exp_confusion_matrix))  
    exp_accuracy <-  exp_performance[1]
    exp_recall <- exp_performance[2]
    exp_precision <- exp_performance[3]
    return(list(exp_accuracy, exp_recall, exp_precision))
}))

dt2_runtime <- dt2_runtime[['elapsed']]
sprintf('Runtime: %.1f seconds', dt2_runtime)

## [1] "Runtime: 22.2 seconds"

# Convert results list of lists to dataframe
experiment2_results_df <- as.data.frame(t(data.frame(lapply(experiment2_results, unlist))))
colnames(experiment2_results_df) <- c('Accuracy', 'Recall', 'Precision')
experiment2_results_df

##         Accuracy    Recall Precision
## Fold01 0.7729785 0.8174767 0.6765083
## Fold02 0.7736390 0.8171046 0.6786478
## Fold03 0.7853489 0.8279898 0.6958939
## Fold04 0.7495396 0.8101336 0.6223268
## Fold05 0.7553736 0.8103261 0.6379974
## Fold06 0.7453429 0.8096317 0.6114677
## Fold07 0.7691363 0.8131470 0.6719418
## Fold08 0.7505117 0.8054645 0.6307231
## Fold09 0.7795292 0.8247423 0.6846384
## Fold10 0.7478510 0.8267297 0.5982028

The 95% confidence interval for accuracy, recall, and precision is shown below. Recall, in particular, decreased to ~82%.

# Accuracy
dt2_accuracy_CI <- unlist(calc_ci95(mean(experiment2_results_df$Accuracy), 
                                      sd(experiment2_results_df$Accuracy), 
                                      nrow(experiment2_results_df)))
sprintf('Accuracy 95%% CI: (%.3f, %.3f)', dt2_accuracy_CI[1], dt2_accuracy_CI[2])

## [1] "Accuracy 95% CI: (0.754, 0.772)"

# Recall
dt2_recall_CI <- unlist(calc_ci95(mean(experiment2_results_df$Recall), 
                                      sd(experiment2_results_df$Recall), 
                                      nrow(experiment2_results_df)))
sprintf('Recall 95%% CI: (%.3f, %.3f)', dt2_recall_CI[1], dt2_recall_CI[2])

## [1] "Recall 95% CI: (0.811, 0.821)"

# Precision
dt2_precision_CI <- unlist(calc_ci95(mean(experiment2_results_df$Precision), 
                                      sd(experiment2_results_df$Precision), 
                                      nrow(experiment2_results_df)))
sprintf('Precision 95%% CI: (%.3f, %.3f)', dt2_precision_CI[1], dt2_precision_CI[2])

## [1] "Precision 95% CI: (0.629, 0.672)"

Conclusion: The objective was not met. All performance metrics of the pruned decision tree model were less than those of the full decision tree. Furthermore, since the confidence intervals do not overlap with those from 3.1.1. Experiment DT1, we can conclude that the pruned decision tree model performs significantly worse than the full decision tree model. However, the pruned decision tree model is easier to interpret and visualize.

Document results:

experiment2_documentation <- c('Decision tree', 'cp=0.0066',
                               runtime_sec = round(dt2_runtime, 1),                               
                               accuracy_95CI = paste0('(', round(dt2_accuracy_CI[1], 3), ', ', 
                                                           round(dt2_accuracy_CI[2], 3), ')'),  
                               recall_95CI = paste0('(', round(dt2_recall_CI[1], 3), ', ', 
                                                         round(dt2_recall_CI[2], 3), ')'),  
                               precision_95CI = paste0('(', round(dt2_precision_CI[1], 3), ', ', 
                                                            round(dt2_precision_CI[2], 3), ')'), 
                               'Pruning decreased runtime but also performance')
experiment_tracker_df <- rbind(experiment_tracker_df, experiment2_documentation)

3.2. Random forest

3.2.0. Experiment RF0

Objective: Establish baseline performance for a random forest model of bank term deposits

Variation: None (ie, use all features, default parameters)

Evaluation metric(s): Accuracy, recall, precision

Perform the experiment:

Algorithm: randomForest()

Parameters:

ntree (number of trees) = 500 (default value). This is close to the rule of thumb to start with 10 times the number of features (ie, 49 features in training dataset \(\times\) 10 = 490 trees).³
mtry (number of variables that are randomly sampled at each split) = 7 (default value). This is the square root of the number of features (ie, \(\sqrt{49} = 7\)).

rf0_runtime <- system.time(
  rf0_model <- randomForest(has_term_deposit ~ ., data = train_smote))

rf0_runtime <- rf0_runtime[['elapsed']]
sprintf('Runtime: %.1f seconds', rf0_runtime)

## [1] "Runtime: 70.2 seconds"

The variable importance plot shows three groups of variables ranked by importance. By far the most important feature in the model is euribor_rate_3m. Moderately important variables includes has_housing_loanyes, consumer_confidence_index, communication_typetelephone, and campaign_contacts. Variables that are less important include highest level of education and job type.

The important variables in the random forest model with all features are similar to those in the full decision tree model. Of note, euribor_rate_3m is the most important feature in both models. However, some variables have greater importance in the random forest model than the decision tree model. For example, has_housing_loanyes is the 2nd most important in the random forest model but is the 7th most important in the decision tree model. This most likely reflects the ensemble nature of the random forest algorithm (ie, it evaluates many different trees with different features).

varImpPlot(rf0_model, main = 'Variable Importance in RF Model with All Features')

Use the model to make class predictions

rf0_predictions <- predict(rf0_model, newdata = test, type = 'class')

Confusion matrix

(rf0_confusion_matrix <- table(Actual = test$has_term_deposit, Predicted = rf0_predictions))

##       Predicted
## Actual    no   yes
##      0 10547   505
##      1   812   488

The accuracy, recall, and precision of the model are shown below.

rf0_performance <- unlist(performance_metrics(rf0_confusion_matrix))
rf0_accuracy <- rf0_performance[1]
rf0_recall <- rf0_performance[2]
rf0_precision <- rf0_performance[3]
sprintf('Accuracy: %.3f, Recall: %.3f, Precision: %.3f', rf0_accuracy, rf0_recall, rf0_precision)

## [1] "Accuracy: 0.893, Recall: 0.491, Precision: 0.375"

Conclusion: The baseline random forest model has moderately high accuracy (better than the baseline decision tree model, which was 0.864) but low recall and precision. However, these metrics may not be reliable estimates of its performance since cross-validation was not performed.

Document results:

experiment3_documentation <- c('Random forest', 'ntree = 500, mtry = 7, no CV',
                               runtime_sec = round(rf0_runtime, 1),
                               accuracy_95CI = round(rf0_accuracy, 3),  
                               recall_95CI = round(rf0_recall, 3), 
                               precision_95CI = round(rf0_precision, 3), 
                               'Baseline. Slower than DT.')
experiment_tracker_df <- rbind(experiment_tracker_df, experiment3_documentation)

3.2.1. Experiment RF1

Objective: Compare baseline performance for a random forest model using cross-validation

Variation: 10-fold cross-validation

Evaluation metrics: Accuracy, recall, precision

Perform the experiment:

Algorithm: randomForest()

Parameters:

ntree = 500 (default value)
mtry = 7 (default value)

10-fold cross-validation

rf1_runtime <- system.time(
  experiment3_results <- lapply(folds, function(x) {
    # Training and test datasets
    exp_train <- train_smote[-x, ]
    exp_test <- train_smote[x, ]
    # Use training data to create random forest classification model
    exp_model <- randomForest(has_term_deposit ~ ., data = exp_train)
    # Make predictions on test data
    exp_predictions <- predict(exp_model, newdata = exp_test, type = 'class')
    # Actual (true) data for comparison
    exp_actual <- exp_test$has_term_deposit
    # Create confusion matrix of actual vs predicted classes
    exp_confusion_matrix <- table(exp_actual, exp_predictions)
    # Calculate performance metrics from confusion matrix
    exp_performance <- unlist(performance_metrics(exp_confusion_matrix))
    exp_accuracy <-  exp_performance[1]
    exp_recall <- exp_performance[2]
    exp_precision <- exp_performance[3]
    return(list(exp_accuracy, exp_recall, exp_precision))
  }))

rf1_runtime <- rf1_runtime[['elapsed']]
sprintf('Runtime: %.1f seconds', rf1_runtime)

## [1] "Runtime: 629.7 seconds"

# Convert results list of lists to dataframe
experiment3_results_df <- as.data.frame(t(data.frame(lapply(experiment3_results, unlist))))
colnames(experiment3_results_df) <- c('Accuracy', 'Recall', 'Precision')
experiment3_results_df

##         Accuracy    Recall Precision
## Fold01 0.9332651 0.9535408 0.9045785
## Fold02 0.9361441 0.9506008 0.9139923
## Fold03 0.9392265 0.9493615 0.9221557
## Fold04 0.9402496 0.9498681 0.9238666
## Fold05 0.9371546 0.9507105 0.9161318
## Fold06 0.9342886 0.9464128 0.9144202
## Fold07 0.9357348 0.9426947 0.9217280
## Fold08 0.9347114 0.9480462 0.9135644
## Fold09 0.9365404 0.9470666 0.9186992
## Fold10 0.9328694 0.9506505 0.9067180

The 95% confidence interval for accuracy, recall, and precision is shown below. Recall, in particular, is quite high (~95%).

# Accuracy
rf1_accuracy_CI <- unlist(calc_ci95(mean(experiment3_results_df$Accuracy), 
                                      sd(experiment3_results_df$Accuracy), 
                                      nrow(experiment3_results_df)))
sprintf('Accuracy 95%% CI: (%.3f, %.3f)', rf1_accuracy_CI[1], rf1_accuracy_CI[2])

## [1] "Accuracy 95% CI: (0.935, 0.938)"

# Recall
rf1_recall_CI <- unlist(calc_ci95(mean(experiment3_results_df$Recall), 
                                      sd(experiment3_results_df$Recall), 
                                      nrow(experiment3_results_df)))
sprintf('Recall 95%% CI: (%.3f, %.3f)', rf1_recall_CI[1], rf1_recall_CI[2])

## [1] "Recall 95% CI: (0.947, 0.951)"

# Precision
rf1_precision_CI <- unlist(calc_ci95(mean(experiment3_results_df$Precision), 
                                      sd(experiment3_results_df$Precision), 
                                      nrow(experiment3_results_df)))
sprintf('Precision 95%% CI: (%.3f, %.3f)', rf1_precision_CI[1], rf1_precision_CI[2])

## [1] "Precision 95% CI: (0.912, 0.920)"

Conclusion: Using cross-validation shows that the baseline random forest model performs well (accuracy ~94%) and has balanced recall and precision (~95% and ~91%, respectively). Furthermore, since the confidence intervals for these metrics in this experiment is greater than and do not overlap those from 3.1.1. Experiment DT1, we can conclude that the baseline random forest model performs significantly better than the baseline decision tree model.

Document results:

experiment3_documentation <- c('Random forest', '10-fold CV',
                               runtime_sec = round(rf1_runtime, 1),
                               accuracy_95CI = paste0('(', round(rf1_accuracy_CI[1], 3), ', ', 
                                                           round(rf1_accuracy_CI[2], 3), ')'),  
                               recall_95CI = paste0('(', round(rf1_recall_CI[1], 3), ', ', 
                                                         round(rf1_recall_CI[2], 3), ')'),  
                               precision_95CI = paste0('(', round(rf1_precision_CI[1], 3), ', ', 
                                                            round(rf1_precision_CI[2], 3), ')'), 
                               'Performance > decision tree. Long runtime.')
experiment_tracker_df <- rbind(experiment_tracker_df, experiment3_documentation)

3.2.2. Experiment RF2

Objective: Determine whether considering more features at each split in the random forest model increases recall by \(\ge 5\%\).

Variations: Increase the value of mtry from 7 (default) to 14

I hypothesized that a higher value of mtry may improve model performance because it would increase the probability that the set of candidate variables at each split contains the most important variable from 3.2.0. Experiment RF0 (euribor3m).⁴ This variation would be expected to reduce bias in the model but increase variance.⁵
Note: I also considered setting importance to TRUE (default = FALSE) to assess the importance of predictors; however, this greatly increased computation time and was not feasible to implement.

Evaluation metric(s): Accuracy, recall, precision

Perform the experiment:

Algorithm: randomForest()

Parameters:

ntree = 500 (default value)
mtry = 14

10-fold cross-validation

rf2_runtime <- system.time(
  experiment4_results <- lapply(folds, function(x) {
    # Training and test datasets
    exp_train <- train_smote[-x, ]
    exp_test <- train_smote[x, ]
    # Use training data to create random forest classification model
    exp_model <- randomForest(has_term_deposit ~ ., data = exp_train, mtry = 14)
    # Make predictions on test data
    exp_predictions <- predict(exp_model, newdata = exp_test, type = 'class')
    # Actual (true) data for comparison
    exp_actual <- exp_test$has_term_deposit
    # Create confusion matrix of actual vs predicted classes
    exp_confusion_matrix <- table(exp_actual, exp_predictions)
    # Calculate performance metrics from confusion matrix
    exp_performance <- unlist(performance_metrics(exp_confusion_matrix))
    exp_accuracy <-  exp_performance[1]
    exp_recall <- exp_performance[2]
    exp_precision <- exp_performance[3]
    return(list(exp_accuracy, exp_recall, exp_precision))
  }))

rf2_runtime <- rf2_runtime[['elapsed']]
sprintf('Runtime: %.1f seconds', rf2_runtime)

## [1] "Runtime: 676.2 seconds"

experiment4_results_df <- as.data.frame(t(data.frame(lapply(experiment4_results, unlist))))
colnames(experiment4_results_df) <- c('Accuracy', 'Recall', 'Precision')
experiment4_results_df

##         Accuracy    Recall Precision
## Fold01 0.9344933 0.9488206 0.9122807
## Fold02 0.9394187 0.9485714 0.9234061
## Fold03 0.9402496 0.9486842 0.9251497
## Fold04 0.9418866 0.9504386 0.9268606
## Fold05 0.9371546 0.9495128 0.9174155
## Fold06 0.9381781 0.9520213 0.9169876
## Fold07 0.9373721 0.9440559 0.9238666
## Fold08 0.9320508 0.9406593 0.9157039
## Fold09 0.9385875 0.9488762 0.9212666
## Fold10 0.9349161 0.9476718 0.9144202

The 95% confidence interval for accuracy, recall, and precision is shown below. Recall, in particular, is high (~95%).

# Accuracy
rf2_accuracy_CI <- unlist(calc_ci95(mean(experiment4_results_df$Accuracy), 
                                      sd(experiment4_results_df$Accuracy), 
                                      nrow(experiment4_results_df)))
sprintf('Accuracy 95%% CI: (%.3f, %.3f)', rf2_accuracy_CI[1], rf2_accuracy_CI[2])

## [1] "Accuracy 95% CI: (0.936, 0.939)"

# Recall
rf2_recall_CI <- unlist(calc_ci95(mean(experiment4_results_df$Recall), 
                                      sd(experiment4_results_df$Recall), 
                                      nrow(experiment4_results_df)))
sprintf('Recall 95%% CI: (%.3f, %.3f)', rf2_recall_CI[1], rf2_recall_CI[2])

## [1] "Recall 95% CI: (0.946, 0.950)"

# Precision
rf2_precision_CI <- unlist(calc_ci95(mean(experiment4_results_df$Precision), 
                                      sd(experiment4_results_df$Precision), 
                                      nrow(experiment4_results_df)))
sprintf('Precision 95%% CI: (%.3f, %.3f)', rf2_recall_CI[1], rf2_recall_CI[2])

## [1] "Precision 95% CI: (0.946, 0.950)"

Conclusion: The objective was not met. Increasing mtry from 7 to 14 did not result in any change in recall. The overlapping confidence intervals for the three performance metrics in experiments 3 and 4 show that the performance of the two random forest models is not significantly different.

Document results:

experiment4_documentation <- c('Random forest', 'mtry = 14',
                               runtime_sec = round(rf2_runtime, 1),
                               accuracy_95CI = paste0('(', round(rf2_accuracy_CI[1], 3), ', ', 
                                                           round(rf2_accuracy_CI[2], 3), ')'),  
                               recall_95CI = paste0('(', round(rf2_recall_CI[1], 3), ', ', 
                                                         round(rf2_recall_CI[2], 3), ')'),  
                               precision_95CI = paste0('(', round(rf2_precision_CI[1], 3), ', ', 
                                                            round(rf2_precision_CI[2], 3), ')'),  
                               'Similar performance. Long runtime.')
experiment_tracker_df <- rbind(experiment_tracker_df, experiment4_documentation)

3.3. Adaboost

3.3.1. Experiment AB1

Objective: Establish baseline performance for an AdaBoost model of bank term deposits

Variation: None

Evaluation metric(s): Accuracy, recall, precision

Perform the experiment:

Algorithm: adabag::boosting.cv()

Parameters:

mfinal (number of weak learners in the final ensemble model): Due to excessively long computation time with the default value of 100, I reduced it to 50.
v (v-fold cross-validation): 10 (default value)
Decision tree control: maxdepth = 1 corresponds to decision tree stumps since AdaBoost leverages weak learners

ab1_runtime <- system.time(
  ab1_obj <- boosting.cv(has_term_deposit ~ ., data = train_smote, mfinal = 50,
                         control = rpart.control(maxdepth = 1)))

## i:  1 Sun Oct 19 23:02:24 2025 
## i:  2 Sun Oct 19 23:02:54 2025 
## i:  3 Sun Oct 19 23:03:28 2025 
## i:  4 Sun Oct 19 23:04:13 2025 
## i:  5 Sun Oct 19 23:04:57 2025 
## i:  6 Sun Oct 19 23:05:37 2025 
## i:  7 Sun Oct 19 23:06:12 2025 
## i:  8 Sun Oct 19 23:06:59 2025 
## i:  9 Sun Oct 19 23:07:38 2025 
## i:  10 Sun Oct 19 23:08:12 2025

ab1_runtime <- ab1_runtime[['elapsed']]
sprintf('Runtime: %.1f seconds', ab1_runtime)

## [1] "Runtime: 744.2 seconds"

Confusion matrix

(ab1_confusion_matrix <- ab1_obj$confusion)

##                Observed Class
## Predicted Class    no   yes
##             no  21856  6150
##             yes  3629 17223

Average error (sum of weights of misclassified data)

(ab1_error <- ab1_obj$error)

## [1] 0.2001515

The 95% confidence interval for accuracy, recall, and precision is shown below. Recall, in particular, is moderate (~82%).

# I transposed the confusion matrix from the boosting.cv() output object to make it compatible with 
# my performance_metrics() function, which assumes that actual (observed) classes are rows and 
# predicted classes are columns
ab1_performance <- unlist(performance_metrics(t(ab1_confusion_matrix)))
ab1_accuracy <- ab1_performance[1]
ab1_recall <- ab1_performance[2]
ab1_precision <- ab1_performance[3]
sprintf('Accuracy: %.3f, Recall: %.3f, Precision: %.3f', ab1_accuracy, ab1_recall, ab1_precision)

## [1] "Accuracy: 0.800, Recall: 0.826, Precision: 0.737"

Conclusion: The baseline AdaBoost model has moderate performance. Statistical comparison with the decision tree and random forest algorithms is not possible due to differences in how the adabag::boosting.cv() function performs cross-validation (ie, the folds are most likely different). However, the performance metrics suggest that the baseline AdaBoost model is inferior to the decision tree and random forest models.

Note: I was not able to adapt the adabag::boosting() function without cross-validation to the lapply() function that I used to perform cross-validation with decision tree and random forest models. The computation time for even a single boosting() iteration was very long (longer than boosting.cv), which made multiple iterations impossible. This is most likely due to the sequential nature of the AdaBoost algorithm.

Document results:

experiment5_documentation <- c('Adaboost', 'mfinal = 50, maxdepth = 1, 10-fold CV',
                               runtime_sec = round(ab1_runtime, 1),
                               accuracy_95CI = round(ab1_accuracy, 3),  
                               recall_95CI = round(ab1_recall, 3), 
                               precision_95CI = round(ab1_precision, 3), 
                               'Baseline. Long runtime.')
experiment_tracker_df <- rbind(experiment_tracker_df, experiment5_documentation)

3.3.2. Experiment AB2

Objective: Determine whether increasing tree depth in the AdaBoost ensemble model increases recall by \(\ge 5\%\).

Variation: Increase maxdepth from 1 (tree stump) to 2

I hypothesized that increasing tree depth creates stronger learners (ie, better fit); however, I limited the increase to keep the learners relatively weak (which is the aim of AdaBoost). This variation is expected to reduce bias but increase variance.

Evaluation metric(s): Accuracy, recall, precision

Perform the experiment:

Algorithm: adabag::boosting.cv()

Parameters:

mfinal= 50
v = 10 (default value)
Decision tree control: maxdepth = 2

ab2_runtime <- system.time(
  ab2_obj <- boosting.cv(has_term_deposit ~ ., data = train_smote, mfinal = 50,
                         control = rpart.control(maxdepth = 2)))

## i:  1 Sun Oct 19 23:08:50 2025 
## i:  2 Sun Oct 19 23:09:28 2025 
## i:  3 Sun Oct 19 23:10:07 2025 
## i:  4 Sun Oct 19 23:10:44 2025 
## i:  5 Sun Oct 19 23:11:23 2025 
## i:  6 Sun Oct 19 23:12:00 2025 
## i:  7 Sun Oct 19 23:12:39 2025 
## i:  8 Sun Oct 19 23:13:18 2025 
## i:  9 Sun Oct 19 23:13:56 2025 
## i:  10 Sun Oct 19 23:14:34 2025

ab2_runtime <- ab2_runtime[['elapsed']]
sprintf('Runtime: %.1f seconds', ab2_runtime)

## [1] "Runtime: 382.2 seconds"

Confusion matrix

(ab2_confusion_matrix <- ab2_obj$confusion)

##                Observed Class
## Predicted Class    no   yes
##             no  22935  5628
##             yes  2550 17745

Average error

(ab2_error <- ab2_obj$error)

## [1] 0.167383

The 95% confidence interval for accuracy, recall, and precision is shown below. Recall, in particular, increased 5.7% compared with the previous experiment.

ab2_performance <- unlist(performance_metrics(t(ab2_confusion_matrix)))
ab2_accuracy <- ab2_performance[1]
ab2_recall <- ab2_performance[2]
ab2_precision <- ab2_performance[3]
sprintf('Accuracy: %.3f, Recall: %.3f, Precision: %.3f', ab2_accuracy, ab2_recall, ab2_precision)

## [1] "Accuracy: 0.833, Recall: 0.874, Precision: 0.759"

sprintf('Recall increased %.1f%%', 100 * (ab2_recall - ab1_recall) / ab1_recall)

## [1] "Recall increased 5.9%"

Conclusion: The objective was met. Increasing the tree depth from 1 to 2 in the Adaboost model increased recall by >5%. As noted in the previous experiment (3.3.1. Experiment AB1), statistical comparison with other models is not possible. However, the performance metrics suggest that even with improvement, the performance of the AdaBoost model still lags behind that of decision trees and random forest models.

Document results:

experiment6_documentation <- c('Adaboost', 'maxdepth = 2',
                               runtime_sec = round(ab2_runtime, 1),
                               accuracy_95CI = round(ab2_accuracy, 3),  
                               recall_95CI = round(ab2_recall, 3), 
                               precision_95CI = round(ab2_precision, 3), 
                               'Increased tree depth improved performance')
experiment_tracker_df <- rbind(experiment_tracker_df, experiment6_documentation)

4. Summary of findings and recommendations

The outcomes of the machine learning experiments are summarized below:

experiment_tracker_df %>%
  gt() %>%
  # Define column widths
  cols_width(
    ends_with('_95CI') ~ pct(15)
  ) %>%
  # Highlight cells of interest
  tab_style(
    style = cell_fill(color = 'wheat'),
    locations = cells_body(
      columns = recall_95CI,
      rows = (model == 'Decision tree') & (variation == 'cp=0.0066') 
    )
  ) %>% 
  tab_style(
    style = cell_fill(color = 'slategray1'),
    locations = cells_body(
      columns = runtime_sec,
      rows = (model == 'Decision tree') & (variation == 'cp=0.0066') 
    )
  ) %>%
  tab_style(
    style = list(cell_text(weight = 'bold')),
    locations = cells_body(
      columns = model,
      rows = (model == 'Decision tree') & (variation == 'cp=0.0066') 
    )
  ) %>%
  tab_style(
    style = cell_fill(color = 'wheat'),
    locations = cells_body(
      columns = recall_95CI,
      rows = (model == 'Random forest') & (variation == '10-fold CV') 
    )
  ) %>%
  tab_style(
    style = cell_fill(color = 'slategray1'),
    locations = cells_body(
      columns = runtime_sec,
      rows = (model == 'Random forest') & (variation == '10-fold CV')
    )
  ) %>%
  tab_style(
    style = list(cell_text(weight = 'bold')),
    locations = cells_body(
      columns = model,
      rows = (model == 'Random forest') & (variation == '10-fold CV')
    )
  ) %>%
  # Boldface column labels
  tab_style(
    style = "font-weight: bold",
    locations = cells_column_labels()
  )

model	variation	runtime_sec	accuracy_95CI	recall_95CI	precision_95CI	notes
Decision tree	No pruning, no CV	4.5	0.864	0.365	0.392	Baseline. Fast. Large tree.
Decision tree	10-fold CV	40.5	(0.887, 0.892)	(0.899, 0.905)	(0.859, 0.868)	Good performance, high recall.
Decision tree	cp=0.0066	22.2	(0.754, 0.772)	(0.811, 0.821)	(0.629, 0.672)	Pruning decreased runtime but also performance
Random forest	ntree = 500, mtry = 7, no CV	70.2	0.893	0.491	0.375	Baseline. Slower than DT.
Random forest	10-fold CV	629.7	(0.935, 0.938)	(0.947, 0.951)	(0.912, 0.92)	Performance > decision tree. Long runtime.
Random forest	mtry = 14	676.2	(0.936, 0.939)	(0.946, 0.95)	(0.917, 0.923)	Similar performance. Long runtime.
Adaboost	mfinal = 50, maxdepth = 1, 10-fold CV	744.2	0.8	0.826	0.737	Baseline. Long runtime.
Adaboost	maxdepth = 2	382.2	0.833	0.874	0.759	Increased tree depth improved performance

Key findings:

Overall, the random forest model with default parameters had the best classification performance with ~95% recall; however, it required the most computation time.
- Increasing the number of features at each split (mtry) did not significantly affect classification performance.
The decision tree model without pruning had the second best classification performance with ~90% recall.
- Although the performance of the decision tree model was a little less than that for the random forest model, the latter required much more computation time.
- Pruning significantly decreased recall to ~82% but improved interpretability and facilitated visualization.
In both the random forest and decision tree models, the most important variable was the 3-month Euro interbank offered rate (eurobor_rate_3m).
The Adaboost model with maxdepth = 1 (ie, tree stumps) had the worst classification performance with ~82% recall.
- Increasing tree depth to 2 improved classification performance (recall ~87%). This is comparable or better than that for the decision tree models; however, no statistical comparison could be made.
- The computation time needed for Adaboost is an order of magnitude greater than that for decision trees; it is similar to the time needed for random forest but is inferior in terms of classification performance.
Cross-validation increased computation time but provides more reliable performance metrics than single point estimates

This analysis has two main limitations:

k-fold cross-validation could not be implemented consistently across algorithms. Due to differences in generated folds for decision tree and random forest models vs Adaboost models, statistical comparison of the performance among all three algorithms was not possible.
The number of model variations was very small. In practice, hyperparameter tuning would need to be performed more systematically (eg, grid search) to determine the optimal value of hyperparameters.

In conclusion, the random forest model with default parameters achieved the best recall (~95%). From a purely classification performance perspective (ie, for data science), this is the best model for the bank term deposit classification problem. However, if interpretability and/or computation time are important (ie, business considerations), a decision tree model with ~82% recall (perhaps higher with additional tuning) may be an acceptable alternative.

Lantz B. (2023) Machine Learning with R, 4th ed., page 420.↩︎
https://medium.com/@tkadeethum/the-bias-variance-trade-off-explained-insights-for-ml-interviews-d944bdc05f87 ↩︎
Boehmke B and Greenwell B. (2020). Hands-On Machine Learning in R. Section 11.4.1. https://bradleyboehmke.github.io/HOML/random-forest.html ↩︎
Probst P, Wright MN, and Boulesteix A-L. Hyperparameters and tuning strategies for random forest. Data Mining Knowl Discov. 2019;9:e1301. DOI: 10.1002/widm.1301 ↩︎
https://codemia.io/knowledge-hub/path/setting_values_for_ntree_and_mtry_for_random_forest_regression_model ↩︎

DATA622 Assignment 2

Alexander Simon

2025-10-19

0. Data input and cleaning

1. Pre-processing

1.1. Remove duplicates

1.2. Data manipulation

1.2.1. Highly correlated variables

1.2.2. Target leakage

1.2.3. Days since last contact

2. Performance evaluation

2.1. Datasets

2.1.1. Class imbalance

2.1.2. Cross-validation folds

2.2. Performance metrics

2.3. Model selection

3. Experiments

3.1. Decision trees

3.1.0. Experiment DT0

3.1.1. Experiment DT1

3.1.2. Experiment DT2

3.2. Random forest

3.2.0. Experiment RF0

3.2.1. Experiment RF1

3.2.2. Experiment RF2

3.3. Adaboost

3.3.1. Experiment AB1

3.3.2. Experiment AB2

4. Summary of findings and recommendations