Set Up and Introduction

There are numerous functions defined in the setup block which are used extensively in the rest of the document. They are included here for reference.

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(caret)
library(tidyverse)
library(doParallel)
library(rpart)
library(MLmetrics)
library(Rborist)
library(xgboost)
library(kableExtra)

set.seed(8675309)

# Stop cluster if it exists, RStudio will keep spinning new ones up otherwise
if (foreach::getDoParWorkers() > 1 && exists("cl") && inherits(cl, "cluster")) {
  tryCatch({
    stopCluster(cl)
    message("Cluster stopped successfully.")
  }, error = function(e) {
    message("Error stopping cluster: ", e$message)
  })
} else {
  message("No active cluster to stop or cluster variable not defined.")
}

cl <- makeCluster(detectCores() - 2)
registerDoParallel(cl)

# Function to get Accuracy and F1 metrics for a model, specific to our 
fetch_metrics <- function(model, testing) {
  # Accuracy
  class <- predict(model, newdata = testing)
  post_sample <- postResample(pred = class, obs = testing$y)
  
  # F1 Statistics
  prob <- predict(model, newdata = testing, type = "prob")
  
  # Get stats for F score
  class_levels <- levels(testing$y)
  eval_data <- data.frame(obs = testing$y, 
                          pred = class)
  
  # Add the probability columns (required for prSummary)
  for (level in class_levels) {
    eval_data[[level]] <- prob[,level]
  }
  
  # Step 3: Calculate metrics with prSummary
  metrics <- prSummary(eval_data, lev = class_levels)
  
  full_metrics <- c(metrics, post_sample)
}

generate_metric_table <- function(metrics) {
  metrics |>
    round(3) |>
    mutate(
      Accuracy = cell_spec(
        Accuracy, 
        background = ifelse(Accuracy == max(Accuracy, na.rm = TRUE), "#FFFF99", "white")
      ),
      `F` = cell_spec(
        `F`, 
        background = ifelse(`F` == max(`F`, na.rm = TRUE), "#FFFF99", "white")
      )
    ) |>
    kbl(escape = FALSE) |>
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
}

generate_test_table <- function(metrics, title) {
  metrics |>
    mutate(across(where(is.numeric), \(x) round(x, 3))) |>
    t() |>
    as.data.frame() |>
    rownames_to_column(var = "Model") |>
    select(Model, Accuracy, `F`, Precision, Recall) |>
    kbl(caption = title, escape = FALSE) |>
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
}

generate_sampling_table <- function(metrics) {
  metrics |>
    round(3) |>
    mutate(
      `Up Sampling F1` = cell_spec(
        `Up Sampling F1`, 
        background = ifelse(`Up Sampling F1` == max(`Up Sampling F1`, na.rm = TRUE), "#FFFF99", "white")
      ),
      `Base F1` = cell_spec(
        `Base F1`, 
        background = ifelse(`Base F1` == max(`Base F1`, na.rm = TRUE), "#FFFF99", "white")
      ),
    ) |>
    kbl(escape = FALSE) |>
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
}

Upon reading in the data we are immediately dropping the duration column as it cannot be known before a call is made. Additionally we are releveling the y column so the “yes” option will be used as the positive case when producing metrics. This makes the resulting metrics more intuitive.

# Based on information in the "bank-additional-names.txt" file
column_types <- list(age="n", job="f", marital="f", education="f", default="f", housing="f", loan="f", contact="f", month="f", day_of_week="f", duration="n", campaign="n", pdays="n", previous="n", poutcome="f", emp.var.rate="d", cons.price.idx="d", cons.conf.idx="d", euribor3m="d", nr.employed="d", y="f")

bank_add_full <- read_delim("bank-additional/bank-additional-full.csv", delim = ";", col_types = column_types) |>
  select(!duration)

# Releveling to make yes the positive class by default
bank_add_full$y <- factor(bank_add_full$y, levels = c("yes", "no"))
                          
# Using small for tuning
bank_add <- read_delim("bank-additional/bank-additional.csv", delim = ";", col_types = column_types) |>
  select(!duration)

bank_add$y <- factor(bank_add$y, levels = c("yes", "no"))

We will be using the same 80/20 train/test split for all trials.

set.seed(8675309)
# CHANGE ME

#inTraining <- createDataPartition(bank_add$y, p = .8, list = FALSE)

inTraining <- createDataPartition(bank_add_full$y, p = .8, list = FALSE)
training <- bank_add_full[ inTraining,]
testing  <- bank_add_full[-inTraining,]

Due to the imbalanced nature of the data we will run two experiments on each of our three model types. The first experiment will compare accuracy vs. F1 score as the selection metric when tuning models. We will select the best performing models based on each metric and evaluate them on the test set. The evaluation criteria will be accuracy, F1 score, precision, and recall. As we are selecting optimal models based on accuracy and F1 the additional metrics need to be considered to gain a better understanding of model performance. The second experiment will be to determine whether up sampling the minority class alters model selection and improves model performance. Up sampling will only be performed on the training data, the test set will remain unaltered in all cases. The evaluation criteria will be F1 score, since it will vary more than accuracy between models as we will see. It should be noted when considering accuracy that a naive model which labels all cases “no” would be expected to have an accuracy of approximately 0.89, and this should be kept in mind as we look at model results.

set.seed(8675309)

bank_add |>
  select(y) |>
  table() |>
  prop.table()

## y
##       yes        no 
## 0.1094926 0.8905074

Model Experiments

CART

Evaluation Metric

This experiment is intended to test the effect different metrics have on model selection and subsequent performance on the test set. We will be using accuracy and F1 score as the selection metrics of interest while including metrics such as precision and recall in the results to provide context. In both cases multiple models will be built using complexity value parameters 0.00001, 0.0001, 0.001, and 0.005. An optimal model will be selected based on accuracy, then another based on F1 score and the performance of both models will be assessed on the the test set. As the choice of metric selection can be subjective if there is no clearly optimal choice, the additional metrics will also be considered when deciding which model is optimal

set.seed(8675309)
cart_grid <- expand.grid(cp = c(0.00001, 0.0001, 0.001, 0.005))

acc_ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 10,
  allowParallel = TRUE
)

cart_fit_acc <- train(y ~ ., data = training,
                     method = "rpart",
                     trControl = acc_ctrl,
                     tuneGrid = cart_grid,
                     metric = "Accuracy"
                     )

f_ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 10,
  allowParallel = TRUE,
  classProbs = TRUE,
  summaryFunction = prSummary,
  verboseIter = TRUE,
  savePredictions = "final"
)

cart_fit_fmeas <- train(y ~ ., data = training,
                     method = "rpart",
                     trControl = f_ctrl,
                     tuneGrid = cart_grid,
                     metric = "F"
                     )

## Aggregating results
## Selecting tuning parameters
## Fitting cp = 1e-05 on full training set

We can see from the table below that the selection criteria does affect which model is chosen as the optimal model. In the case of Accuracy, a complexity parameter of 0.001 performs best while a value of 0.00001 results in the optimal F1 score. The difference between accuracy and F1 score between the models selected is not particularly large on the training set.

# Compares how the models performed on the training set
cart_fit_acc$results |>
  left_join(cart_fit_fmeas$results, by = join_by(cp)) |>
  mutate(`Complexity Parameter` = format(cp, scientific = FALSE)) |>
  select(`Complexity Parameter`, Accuracy, `F`, Precision, Recall) |>
  mutate(across(where(is.numeric), \(x) round(x, 3))) |>
  mutate(
    Accuracy = cell_spec(
      Accuracy, 
      background = ifelse(Accuracy == max(Accuracy, na.rm = TRUE), "#FFFF99", "white")
    ),
    `F` = cell_spec(
      `F`, 
      background = ifelse(`F` == max(`F`, na.rm = TRUE), "#FFFF99", "white")
    )
  ) |>
  kbl(escape = FALSE) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Complexity Parameter	Accuracy	F	Precision	Recall
0.00001	0.89	0.382	0.518	0.303
0.00010	0.891	0.381	0.528	0.299
0.00100	0.9	0.365	0.640	0.256
0.00500	0.899	0.286	0.707	0.180

When evaluated on the test set, actually shrinks with neither of the primary metrics having a very meaningful gap between models. In this case it is worth considering the precision and recall values. The evaluation criteria for which model is truly “better” depends heavily on the cost per call vs. expected return on a secured subscription. The accuracy model will have fewer false positives, meaning fewer calls made will be “wasted” whereas the F1 model will capture more of the population that will subscribe but have a higher rate of wasted calls. Without metrics on cost per call vs. return on subscription I would recommend choosing the F1 model as I would value a higher recall over precision for this case.

# Gather accuracy and f1 score for accuracy selected model on the test set
set.seed(8675309)

cart_acc_metrics <-fetch_metrics(cart_fit_acc, testing)
cart_f_metrics <-fetch_metrics(cart_fit_fmeas, testing)

cart_metrics <- data.frame(cart_acc_metrics, cart_f_metrics) |>
  rename(`Optimal Accuracy` = cart_acc_metrics, `Optimal F1 Score` = cart_f_metrics)

generate_test_table(cart_metrics, "CART Metrics Performance on Test Set")

CART Metrics Performance on Test Set
Model	Accuracy	F	Precision	Recall
Optimal Accuracy	0.902	0.393	0.651	0.281
Optimal F1 Score	0.893	0.406	0.544	0.323

Up Sampling

The imbalanced nature of the classes of the dependent variable can affect the performance of models, so we will experiment with up sampling the minority class in the training set to see if it will improve model performance. The minority class in this case is the “yes” class, and the up sampling method will be the default method of the caret package. We will use both accuracy and F1 score to select the optimal models to gain additional insight on the impact of selection criteria, but that is not the focus of experiment. The base models will be evaluated against the up sampled models on the test set with F1 score as the determining metric. The same range of complexity parameters as the previous experiment will be used.

set.seed(8675309)

# Sampling can be set in the controls, need one for each model though

acc_nu_ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 10,
  allowParallel = TRUE
)

acc_up_ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 10,
  allowParallel = TRUE,
  sampling = "up"
)

f_nu_ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 10,
  allowParallel = TRUE,
  classProbs = TRUE,
  summaryFunction = prSummary,
  verboseIter = TRUE,
  savePredictions = "final"
)

f_up_ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 10,
  allowParallel = TRUE,
  classProbs = TRUE,
  summaryFunction = prSummary,
  verboseIter = TRUE,
  savePredictions = "final",
  sampling = "up"
)

set.seed(8675309)

acc_nu <- train(y ~ ., data = training,
                     method = "rpart",
                     trControl = acc_nu_ctrl,
                     tuneGrid = cart_grid,
                     metric = "Accuracy"
                     )

acc_up <- train(y ~ ., data = training,
                     method = "rpart",
                     trControl = acc_up_ctrl,
                     tuneGrid = cart_grid,
                     metric = "Accuracy"
                     )

f_nu <- train(y ~ ., data = training,
                     method = "rpart",
                     trControl = f_nu_ctrl,
                     tuneGrid = cart_grid,
                     metric = "F"
                     )

## Aggregating results
## Selecting tuning parameters
## Fitting cp = 1e-04 on full training set

f_up <- train(y ~ ., data = training,
                     method = "rpart",
                     trControl = f_up_ctrl,
                     tuneGrid = cart_grid,
                     metric = "F"
                     )

## Aggregating results
## Selecting tuning parameters
## Fitting cp = 0.001 on full training set

an_renamed <- acc_nu$results |>
  select(cp, Accuracy) |>
  rename(`No Up Sampling Accuracy` = Accuracy)

au_renamed <- acc_up$results |>
  select(cp, Accuracy) |>
  rename(`Up Sampling Accuracy` = Accuracy)

fn_renamed <- f_nu$results |>
  select(cp, `F`) |>
  rename(`No Up Sampling F1` = `F`)

fu_renamed <- f_up$results |>
  select(cp, `F`) |>
  rename(`Up Sampling F1` = `F`)

cart_consolidated <- an_renamed |>
  left_join(au_renamed, by = join_by(cp)) |>
  left_join(fn_renamed, by = join_by(cp)) |>
  left_join(fu_renamed, by = join_by(cp))

We can see that while up sampling did have an impact on both accuracy and F1 score, it did not result in a difference in the choice of model when accuracy was used as the selection metric. Additionally, the only cases in which up sampling improved model performance was for the F1 score of the higher value complexity parameters. In all other cases, up sampling resulted in worse performance on the training set. The model with complexity parameter 0.0001 will be referred to as the “base sample model” while the model with 0.001 will be referred to as the “up sampling model” as we are using F1 score as our reference.

# Messy but unique to other cases can't refactor code

cart_consolidated |>
  mutate(`Complexity Parameter` = format(cp, scientific = FALSE)) |>
  select(`Complexity Parameter`, `No Up Sampling Accuracy`, `Up Sampling Accuracy`, `No Up Sampling F1`, `Up Sampling F1`) |>
  mutate(across(where(is.numeric), round, 3)) |>
  mutate(
    `No Up Sampling Accuracy` = cell_spec(
      `No Up Sampling Accuracy`, 
      background = ifelse(`No Up Sampling Accuracy` == max(`No Up Sampling Accuracy`, na.rm = TRUE), "#FFFF99", "white")
    ),
    `Up Sampling Accuracy` = cell_spec(
      `Up Sampling Accuracy`, 
      background = ifelse(`Up Sampling Accuracy` == max(`Up Sampling Accuracy`, na.rm = TRUE), "#FFFF99", "white")
    ),
    `No Up Sampling F1` = cell_spec(
      `No Up Sampling F1`, 
      background = ifelse(`No Up Sampling F1` == max(`No Up Sampling F1`, na.rm = TRUE), "#FFFF99", "white")
    ),
    `Up Sampling F1` = cell_spec(
      `Up Sampling F1`, 
      background = ifelse(`Up Sampling F1` == max(`Up Sampling F1`, na.rm = TRUE), "#FFFF99", "white")
    )
  ) |>
  kbl(escape = FALSE) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Complexity Parameter	No Up Sampling Accuracy	Up Sampling Accuracy	No Up Sampling F1	Up Sampling F1
0.00001	0.89	0.78	0.38	0.356
0.00010	0.891	0.781	0.381	0.378
0.00100	0.9	0.838	0.367	0.465
0.00500	0.899	0.834	0.286	0.45

When evaluated on the test set we see some interesting results, with the F1 score showing very little difference whereas the accuracy of the models has a much more significant gap. The up sampled model technically outperforms the other model based on our stated evaluation criteria of the F1 score, however given the difference in accuracy between the two models the base model should likely be chosen as the optimal model. We will select the Base Sample model as our optimal CART model for these reasons.

set.seed(8675309)

cart_up_metrics <-fetch_metrics(f_up, testing)
cart_nu_metrics <-fetch_metrics(f_nu, testing)

cart_sampling <- data.frame(cart_up_metrics, cart_nu_metrics) |>
  rename(`Base Sample` = cart_nu_metrics, `Up Sampling` = cart_up_metrics)

generate_test_table(cart_sampling, "CART Sampling Performance on Test Set")

CART Sampling Performance on Test Set
Model	Accuracy	F	Precision	Recall
Up Sampling	0.833	0.470	0.366	0.656
Base Sample	0.895	0.413	0.561	0.327

Random Forest

We will be using the Rborist implementation of the Random Forest algorithm as it allows for easy tuning parameter differentiation.

Evaluation Metric

This experiment as well as that done for the XGBoost model are largely the same as that done for the CART model. The same criteria and methodology are used, and will be omitted here for the sake of brevity. We will be training models with subsets of the predictors ranging in size from 2, 4, 6, and 8 predictors as well as minimal node sizes ranging from 1, 5, and 10. An optimal model will be selected based on accuracy, then another based on F1 score and the performance of both models will be assessed on the the test set. The models will be evaluated based on accuracy, F1 score, precision, and recall to determine which is better overall.

set.seed(8675309)
p <- ncol(training) - 1

rf_grid <- expand.grid(
  predFixed = c(2, 4, 6, 8),
  minNode = c(1, 5, 10)
)

rf_f_ctrl <- trainControl(
  method = "cv",
  number = 5,
  summaryFunction = prSummary,
  classProbs = TRUE,
  verboseIter = TRUE,
  savePredictions = "final",
  allowParallel = TRUE
)

rf_acc_ctrl <- trainControl(
  method = "cv",
  number = 5,
  verboseIter = TRUE,
  savePredictions = "final",
  allowParallel = TRUE
)

rf_acc_model <- train(y ~ ., data = training,
                  method = "Rborist",
                  trControl = rf_acc_ctrl,
                  tuneGrid = rf_grid,
                  metric = "Accuracy"
                 )

## Aggregating results
## Selecting tuning parameters
## Fitting predFixed = 4, minNode = 10 on full training set

rf_f_model <- train(y ~ ., data = training,
                  method = "Rborist",
                  trControl = rf_f_ctrl,
                  tuneGrid = rf_grid,
                  metric = "F"
                 )

## Aggregating results
## Selecting tuning parameters
## Fitting predFixed = 8, minNode = 1 on full training set

We can see the choice of metric results in very different models being selected. The model with predFixed = 4 and minNode = 10 is selected as the optimal model based on accuracy with predFixed = 8 and minNode = 1 having the optimal F1 score. The accuracy gap on the test set is not large between the models, however the F1 gap is large. Both metrics are slightly lower than those for the CART model, but not by a significant margin.

rf_metrics <- rf_acc_model$results |>
  left_join(rf_f_model$results, by = join_by(predFixed, minNode)) |>
  select(predFixed, minNode, Accuracy, `F`, Precision, Recall)

rf_metrics |>
  round(3) |>
  mutate(
    Accuracy = cell_spec(
      Accuracy, 
      background = ifelse(Accuracy == max(Accuracy, na.rm = TRUE), "#FFFF99", "white")
    ),
    F = cell_spec(
      F, 
      background = ifelse(F == max(F, na.rm = TRUE), "#FFFF99", "white")
    )
  ) |>
  kbl(escape = FALSE) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

predFixed	minNode	Accuracy	F	Precision	Recall
2	1	0.893	0.135	0.811	0.074
2	5	0.894	0.111	0.820	0.060
2	10	0.893	0.141	0.791	0.078
4	1	0.899	0.291	0.706	0.183
4	5	0.9	0.289	0.706	0.182
4	10	0.9	0.286	0.717	0.179
6	1	0.898	0.342	0.615	0.237
6	5	0.899	0.336	0.637	0.228
6	10	0.899	0.332	0.643	0.224
8	1	0.896	0.373	0.592	0.273
8	5	0.898	0.364	0.606	0.260
8	10	0.898	0.363	0.619	0.256

When evaluated on the test set we see a meaningful difference in F1 score between the models with virtually no difference in accuracy. The model chosen based on F1 score is clearly the superior model.

set.seed(8675309)

rf_full_acc <- fetch_metrics(rf_acc_model, testing)
rf_full_f <- fetch_metrics(rf_f_model, testing)

rf_metric_comparison <- data.frame(rf_full_acc, rf_full_f) |>
  rename(`Optimal F1 Score` = rf_full_f, `Optimal Accuracy` = rf_full_acc)

generate_test_table(rf_metric_comparison, "Random Forest Metrics Performance on Test Set")

Random Forest Metrics Performance on Test Set
Model	Accuracy	F	Precision	Recall
Optimal Accuracy	0.899	0.303	0.683	0.195
Optimal F1 Score	0.897	0.361	0.594	0.260

Up Sampling

As we saw similar accuracy results with a larger difference in F1 score we will use F1 score as our preferred selection metric for the sampling experiment. We will be training one additional model this time using up sampling. The performance of this model will be compared to that of the optimal F1 score from the prior experiment. This previous model was trained on the base training set, acting as the base case for our upsampling comparison. The evaluation criteria will be F1 score, with accuracy, precision, and recall being used if the models have similar F1 scores.

set.seed(8675309)

# The rf_f_model is the base sampling so only one additional model needs to be trained for this test
rf_up_ctrl <- trainControl(
  method = "cv",
  number = 5,
  summaryFunction = prSummary,
  classProbs = TRUE,
  verboseIter = TRUE,
  savePredictions = "final",
  allowParallel = TRUE,
  sampling = "up"
)

rf_up_model <- train(y ~ ., data = training,
                  method = "Rborist",
                  trControl = rf_up_ctrl,
                  tuneGrid = rf_grid,
                  metric = "F"
                 )

## Aggregating results
## Selecting tuning parameters
## Fitting predFixed = 4, minNode = 10 on full training set

rf_up_selection <- rf_up_model$results |>
  select(predFixed, minNode, `F`) |>
  rename(`Up Sampling F1` = `F`)

rf_f_selection <- rf_f_model$results |>
  select(predFixed, minNode, `F`) |>
  rename(`Base F1` = `F`)

rf_consolidated <- rf_up_selection |>
  left_join(rf_f_selection, by = join_by(predFixed, minNode))

For Random Forest up sampling has a major influence on model selection, resulting in models with much smaller trees and fewer predictors. The optimal model when up sampling has predFixed = 4 and minNode = 10 and has a much higher F1 score than the base model on the test set.

rf_consolidated |>
  round(3) |>
  mutate(
    `Up Sampling F1` = cell_spec(
      `Up Sampling F1`, 
      background = ifelse(`Up Sampling F1` == max(`Up Sampling F1`, na.rm = TRUE), "#FFFF99", "white")
    ),
    `Base F1` = cell_spec(
      `Base F1`, 
      background = ifelse(`Base F1` == max(`Base F1`, na.rm = TRUE), "#FFFF99", "white")
    ),
  ) |>
  kbl(escape = FALSE) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

predFixed	minNode	Up Sampling F1	Base F1
2	1	0.486	0.135
2	5	0.48	0.111
2	10	0.477	0.141
4	1	0.486	0.291
4	5	0.488	0.289
4	10	0.493	0.286
6	1	0.482	0.342
6	5	0.486	0.336
6	10	0.488	0.332
8	1	0.479	0.373
8	5	0.482	0.364
8	10	0.486	0.363

When evaluating the models on the test set we can see that while up sampling does significantly improve the F1 score when compared to the base sample albeit at a more noticeable cost to accuracy. The uplift in F1 score is significant enough that the up sampled model is clearly the better performer.

set.seed(8675309)

rf_full_up <- fetch_metrics(rf_up_model, testing)

rf_up_comparison <- data.frame(rf_full_up, rf_full_f) |>
  rename(`Base Sample` = rf_full_f, `Up Sampling` = rf_full_up)

generate_test_table(rf_up_comparison, "Random Forest Sampling on Test Set")

Random Forest Sampling on Test Set
Model	Accuracy	F	Precision	Recall
Up Sampling	0.868	0.503	0.437	0.592
Base Sample	0.897	0.361	0.594	0.260

XGBoost

For our XGBoost models we will only be tuning two hyperparameters, the maximum tree depth and shrinkage factor due to constraints on compute. The values tested will be 2, 3, and 4 for tree depth and 0.01, 0.05, 0.1, 0.2, and 0.3 for the shrinkage factor. The number of boosting iterations will also be lower than usual at only 100 rounds.

Evaluation Metric

The methodology of this experiment is the same as the prior models.

set.seed(8675309)

xgb_grid <- expand.grid(
  nrounds = 100,
  max_depth = c(3, 4, 5),
  eta = c(0.05, 0.1, 0.2, 0.3),
  gamma = 0.5,
  colsample_bytree = 0.75,
  min_child_weight = 5,
  subsample = 1
)

f_ctrl <- trainControl(
  method = "cv",
  number = 5,
  summaryFunction = prSummary,
  classProbs = TRUE,
  verboseIter = TRUE,
  savePredictions = "final"
)

acc_ctrl <- trainControl(
  method = "cv",
  number = 5,
  verboseIter = TRUE,
  savePredictions = "final"
)

xg_F_model <- train(
  y ~ ., 
  data = training,
  method = "xgbTree",
  trControl = f_ctrl,
  tuneGrid = xgb_grid,
  metric = "F"
)

## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 100, max_depth = 5, eta = 0.3, gamma = 0.5, colsample_bytree = 0.75, min_child_weight = 5, subsample = 1 on full training set

xg_acc_model <- train(
  y ~ ., 
  data = training,
  method = "xgbTree",
  trControl = acc_ctrl,
  tuneGrid = xgb_grid,
  metric = "Accuracy"
)

## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 100, max_depth = 4, eta = 0.1, gamma = 0.5, colsample_bytree = 0.75, min_child_weight = 5, subsample = 1 on full training set

We can see that accuracy is not a particularly valuable metric for XGBoost with all models within .001 of each other. The F1 score shows much more variation between iterations, and is clearly the more meaningful metric the model max_depth = 4, eta = 0.1 is chosen as the optimal model based on accuracy.

xg_metrics <- xg_F_model$results |>
  left_join(xg_acc_model$results, by = join_by(nrounds, max_depth, eta, gamma, colsample_bytree, min_child_weight, subsample)) |>
  select(max_depth, eta, Accuracy, `F`, Precision, Recall)

generate_metric_table(xg_metrics)

max_depth	eta	Accuracy	F	Precision	Recall
3	0.05	0.9	0.298	0.698	0.189
3	0.10	0.901	0.334	0.681	0.221
3	0.20	0.901	0.356	0.666	0.244
3	0.30	0.9	0.36	0.638	0.251
4	0.05	0.9	0.332	0.684	0.220
4	0.10	0.901	0.348	0.664	0.237
4	0.20	0.901	0.372	0.641	0.262
4	0.30	0.9	0.372	0.630	0.265
5	0.05	0.901	0.349	0.672	0.237
5	0.10	0.9	0.367	0.655	0.255
5	0.20	0.899	0.375	0.625	0.268
5	0.30	0.899	0.38	0.609	0.276

When evaluated against the test set the models perform very similarly. The gap in accuracy is still negligible while the difference in F1 score is not large enough to be significant. The optimal model choice should depend on the importance of precision vs recall, and we will remain consistent with previous cases in choosing the F1 model as our optimal choice.

set.seed(8675309)

xg_f_test <- fetch_metrics(xg_F_model, testing)
xg_acc_test <- fetch_metrics(xg_acc_model, testing)

xg_metric_comparison <- data.frame(xg_f_test, xg_acc_test) |>
  rename(`F1 Model` = xg_f_test, `Accuracy Model` = xg_acc_test)

generate_test_table(xg_metric_comparison, "XGBoost Metrics Performance on Test Set")

XGBoost Metrics Performance on Test Set
Model	Accuracy	F	Precision	Recall
F1 Model	0.899	0.384	0.615	0.279
Accuracy Model	0.903	0.373	0.679	0.258

Up Sampling

The methodology for this experiment is identical to that of the Random Forest model so will not be repeated. We are once again sticking with F1 score as our model selection metric.

set.seed(8675309)

up_ctrl <- trainControl(
  method = "cv",
  number = 5,
  summaryFunction = prSummary,
  classProbs = TRUE,
  verboseIter = TRUE,
  savePredictions = "final",
  sampling = "up"
)

xg_up_model <- train(
  y ~ ., 
  data = training,
  method = "xgbTree",
  trControl = up_ctrl,
  tuneGrid = xgb_grid,
  metric = "F"
)

## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 100, max_depth = 5, eta = 0.1, gamma = 0.5, colsample_bytree = 0.75, min_child_weight = 5, subsample = 1 on full training set

Up sampling appears to provide a large boost to model performance in this case. The optimal model when up sampling has max_depth = 5, eta = 0.1.

xg_up_selection <- xg_up_model$results |>
  select(max_depth, eta, `F`) |>
  rename(`Up Sampling F1` = `F`)

xg_f_selection <- xg_F_model$results |>
  select(max_depth, eta, `F`) |>
  rename(`Base F1` = `F`)

xg_consolidated <- xg_up_selection |>
  left_join(xg_f_selection, by = join_by(max_depth, eta))

generate_sampling_table(xg_consolidated)

max_depth	eta	Up Sampling F1	Base F1
3	0.05	0.457	0.298
3	0.10	0.464	0.334
3	0.20	0.464	0.356
3	0.30	0.464	0.36
4	0.05	0.463	0.332
4	0.10	0.468	0.348
4	0.20	0.465	0.372
4	0.30	0.459	0.372
5	0.05	0.466	0.349
5	0.10	0.472	0.367
5	0.20	0.466	0.375
5	0.30	0.453	0.38

When evaluated against the test set we get a very interesting result. The uplift in F1 score is significant from the the base sample, however there is a marked decline in accuracy. While noteworthy, I would argue this trade off is worth it for the better recall of the up sampling model.

set.seed(8675309)

xg_up_test <- fetch_metrics(xg_up_model, testing)

xg_sampling_comparison <- data.frame(xg_f_test, xg_up_test) |>
  rename(`Base` = xg_f_test, `Up Sampling` = xg_up_test)

generate_test_table(xg_sampling_comparison, "XGBoost Up Sampling Performance on Test Set")

XGBoost Up Sampling Performance on Test Set
Model	Accuracy	F	Precision	Recall
Base	0.899	0.384	0.615	0.279
Up Sampling	0.846	0.483	0.389	0.638

Concluding Comparison

In each case we concluded the optimal model was selected based on F1 score and made use of up sampling the minority class in the data. These experiments highlighted the value of including multiple metrics in your decision making process and the importance of understanding your data. We end with a comparison of the best performing models from each experiment, with the Random Forest model having the best F1 score as well as accuracy due to a better balance between precision and recall than the other models.

set.seed(8675309)

final_comparison <- data.frame(cart_up_metrics, rf_full_up, xg_up_test) |>
  rename(CART = cart_up_metrics, `Random Forest` = rf_full_up, XGBoost = xg_up_test)

generate_test_table(final_comparison, "Optimal Model Comparison")

Optimal Model Comparison
Model	Accuracy	F	Precision	Recall
CART	0.833	0.470	0.366	0.656
Random Forest	0.868	0.503	0.437	0.592
XGBoost	0.846	0.483	0.389	0.638

Data 622 Assignment 2

Matthew Tillmawitz

2025-03-21

Set Up and Introduction

Model Experiments

CART

Evaluation Metric

Up Sampling

Random Forest

Evaluation Metric

Up Sampling

XGBoost

Evaluation Metric

Up Sampling

Concluding Comparison