Data Management

Data Acquisition

The bank data from Assignment 1 is used for this assignment.

# Import required libraries and data
library(tidyverse)
## Warning: package 'purrr' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.3.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.7     ✔ rsample      1.2.1
## ✔ dials        1.4.0     ✔ tune         1.3.0
## ✔ infer        1.0.7     ✔ workflows    1.2.0
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.3.0     ✔ yardstick    1.3.2
## ✔ recipes      1.1.1
## Warning: package 'broom' was built under R version 4.3.3
## Warning: package 'dials' was built under R version 4.3.3
## Warning: package 'modeldata' was built under R version 4.3.3
## Warning: package 'parsnip' was built under R version 4.3.3
## Warning: package 'recipes' was built under R version 4.3.3
## Warning: package 'tune' was built under R version 4.3.3
## Warning: package 'workflows' was built under R version 4.3.3
## Warning: package 'yardstick' was built under R version 4.3.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
library(party)
## Warning: package 'party' was built under R version 4.3.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.3.3
## Loading required package: modeltools
## Loading required package: stats4
## 
## Attaching package: 'modeltools'
## 
## The following object is masked from 'package:workflows':
## 
##     fit
## 
## The following object is masked from 'package:tune':
## 
##     parameters
## 
## The following object is masked from 'package:parsnip':
## 
##     fit
## 
## The following object is masked from 'package:infer':
## 
##     fit
## 
## The following object is masked from 'package:dials':
## 
##     parameters
## 
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.3.3
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.3.3
## 
## Attaching package: 'strucchange'
## 
## The following object is masked from 'package:stringr':
## 
##     boundary
## 
## 
## Attaching package: 'party'
## 
## The following object is masked from 'package:dplyr':
## 
##     where
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.3.3
## 
## Attaching package: 'xgboost'
## 
## The following object is masked from 'package:dplyr':
## 
##     slice
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity
## 
## The following object is masked from 'package:purrr':
## 
##     lift
url <- 'https://raw.githubusercontent.com/Marley-Myrianthopoulos/grad_school_data/refs/heads/main/bank-full.csv'
data_import <- read.csv(url, sep = ';')
bank_data <- data_import
print(head(bank_data))
##   age          job marital education default balance housing loan contact day
## 1  58   management married  tertiary      no    2143     yes   no unknown   5
## 2  44   technician  single secondary      no      29     yes   no unknown   5
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown   5
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown   5
## 5  33      unknown  single   unknown      no       1      no   no unknown   5
## 6  35   management married  tertiary      no     231     yes   no unknown   5
##   month duration campaign pdays previous poutcome  y
## 1   may      261        1    -1        0  unknown no
## 2   may      151        1    -1        0  unknown no
## 3   may       76        1    -1        0  unknown no
## 4   may       92        1    -1        0  unknown no
## 5   may      198        1    -1        0  unknown no
## 6   may      139        1    -1        0  unknown no

Data Preparation

To prepare the data, the following transformations were performed:

  • The y column (the target variable) was converted from a character column to a factor column.
  • The categorical predictors (job, marital, contact, and poutcome) were converted from character columns to factor columns.
  • The variables (default, housing, and loan) that only had values of “no” and “yes” were converted to 0s and 1s, respectively.
  • The education variable was converted from a character column to an ordinal variable with the levels being “primary”, “secondary”, and “tertiary”. Observations with “unknown” recorded for the education variable were removed.
  • The month variable was converted from a character column to a numeric column.
# Convert y column to factor
bank_data$y <- factor(bank_data$y)

# Convert categorical columns to factor
bank_data$job <- factor(bank_data$job)
bank_data$marital <- factor(bank_data$marital)
bank_data$contact <- factor(bank_data$contact)
bank_data$poutcome <- factor(bank_data$poutcome)

# Convert binary columns to 0s for "no" and 1s for "yes"
bank_data <- bank_data |>
  mutate(default = case_when(default == 'no' ~ 0, default == 'yes' ~ 1)) |>
  mutate(housing = case_when(housing == 'no' ~ 0, housing == 'yes' ~ 1)) |>
  mutate(loan = case_when(loan == 'no' ~ 0, loan == 'yes' ~ 1))

# Convert education column to ordinal
bank_data <- bank_data |> filter(education != 'unknown')
bank_data$education <- factor(bank_data$education, levels = c('primary', 'secondary', 'tertiary'), ordered = TRUE)

# Convert months to numeric values
bank_data <- bank_data |> 
  mutate(month = case_when(month == 'jan' ~ 1,
                           month == 'feb' ~ 2,
                           month == 'mar' ~ 3,
                           month == 'apr' ~ 4,
                           month == 'may' ~ 5,
                           month == 'jun' ~ 6,
                           month == 'jul' ~ 7,
                           month == 'aug' ~ 8,
                           month == 'sep' ~ 9,
                           month == 'oct' ~ 10,
                           month == 'nov' ~ 11,
                           month == 'dec' ~ 12))

# Set random seed for reproducibility
set.seed(31415)

# Create test/train split
bank_split <- initial_split(bank_data, prop = 0.8)
bank_train <- training(bank_split)
bank_test <- testing(bank_split)

No Unknowns

My first experiment with each model was to remove the unknown values. This approach was selected because if removing unknown values improves model performance, then it might benefit the bank to invest resources into more sophisticated data collection in the future to ensure that data is complete. However, if unknown values do not hurt model performance, then this expense may not be necessary.

Removing unknown values required filtering the training and testing sets to remove the observations with “unknown” recorded for the job and contact variables. In project 1, I determined that “unknown” values were recorded for the job variable in 0.64% of observations (288 cases) and for the contact variable in 28.8% of observations (13,020 cases). Additionally, the poutcome variable had “unknown” values recorded for 81.75% of observations, so this variable was removed completely. After filtering the training and testing set, I verified that the test set was still approximately 20% of the observations.

# Filter training set to remove unknown values
bank_train_exp1 <- bank_train |>
  filter(job != 'unknown') |>
  filter(contact != 'unknown') |>
  select(-poutcome)

# Filter testing set to remove unknown values
bank_test_exp1 <- bank_test |>
  filter(job != 'unknown') |>
  filter(contact != 'unknown') |>
  select(-poutcome)

# Confirm that the testing set is still approximately 20% of the total data
print(nrow(bank_test_exp1) / (nrow(bank_train_exp1) + nrow(bank_test_exp1)))
## [1] 0.2008283

Evaluation Metrics

To evaluate the results of my experiments, I created a helper function to calculate the precision, recall, and F1-score of each model.

# Define function to calculate performance metrics for models
metrics_function <- function(preds, actual) {
  
  # Create confusion matrix
  results <- as.data.frame(table(preds, actual)) |>
    arrange(preds, actual) |>
    pivot_wider(id_cols = 'preds', names_from = 'actual', values_from = 'Freq')
  
  # Extract number of true positives, false positives, true negatives, and false negatives from the confusion matrix
  tp <- as.numeric(results[2,3])
  fp <- as.numeric(results[2,2])
  tn <- as.numeric(results[1,2])
  fn <- as.numeric(results[1,3])
  
  # Calculate performance metrics and return them in a dictionary
  precision <- tp / (tp + fp)
  recall <- tp / (tp + fn)
  f1 <- (2 * precision * recall) / (precision + recall)
  metrics_dict <- c('precision' = precision, 'recall' = recall, 'f1' = f1)
  return(metrics_dict)
  
}

Depending on the objectives and priorities of the bank, the best metric for evaluating models might change. If the bank is constrained on resources and does not want to risk “wasting” them on contacting customers who may not be interested, then they need to reduce false positives and should focus on maximizing Precision. On the other hand, if the bank wishes to maximize the number of subscriptions and does not want to risk missing interested customers, then they need to reduce false negatives and should focus on maximizing recall. If they wish to strike a balance between these risks, then the F1-score could be a useful metric since the classes are imbalanced (from Assignment 1, “no” responses account for approximately 88% of observations).

Decision Tree Models

Initial Model

The first Decision Tree model was trained on the default data set with a max depth of 5. Subsequent models were compared to this one to determine the outcome of the experiments.

# Train model
tree_init_model <- ctree(y ~ ., data = bank_train, control = ctree_control(maxdepth = 5))

# Use model to predict test set values
tree_init_predictions <- predict(tree_init_model, bank_test)

# Report results
table(tree_init_predictions, bank_test$y)
##                      
## tree_init_predictions   no  yes
##                   no  7407  663
##                   yes  219  382
dt_init_metrics <- metrics_function(tree_init_predictions, bank_test$y)
print(dt_init_metrics)
## precision    recall        f1 
## 0.6356073 0.3655502 0.4641555

No Unknowns

For the first Decision Tree experiment, a new model was trained on the data set with no unknown values.

# Train model
tree_exp1_model <- ctree(y ~ ., data = bank_train_exp1, control = ctree_control(maxdepth = 5))

# Use model to predict test set values
tree_exp1_predictions <- predict(tree_exp1_model, bank_test_exp1)

# Report results
table(tree_exp1_predictions, bank_test_exp1$y)
##                      
## tree_exp1_predictions   no  yes
##                   no  4935  561
##                   yes  326  385
tree_exp1_metrics <- metrics_function(tree_exp1_predictions, bank_test_exp1$y)
print(tree_exp1_metrics)
## precision    recall        f1 
## 0.5414909 0.4069767 0.4646952

This experiment resulted in reduced precision, but increased recall (in other words, the new model is a little more aggressive with predicting “yes” responses, resulting in fewer false negatives but more false positives). These changes largely balanced each other, with the F1-score staying almost identical to the original model. Whether this experiment improved the model would depend on the bank’s goals, and whether they were more interested in reducing false positives or false negatives.

Hyper-Parameter Tuning

For my second experiment with Decision Tree models, I tried different tree depths to determine whether there is a risk of over-fitting once the model reaches a certain level of complexity.

# Create empty lists to store metrics for each iteration
tree_exp2_p_list <- c()
tree_exp2_r_list <- c()
tree_exp2_f_list <- c()

# Train model, predict test set values, and report results for tree depths from 2 to 100
for (i in 2:100) {
  tree_exp2_model <- ctree(y ~ ., data = bank_train_exp1, control = ctree_control(maxdepth = i))
  
  tree_exp2_predictions <- predict(tree_exp2_model, newdata = bank_test |> select(-y))
  tree_exp2_metrics <- metrics_function(tree_exp2_predictions, bank_test$y)
  tree_exp2_p_list <- append(tree_exp2_p_list, tree_exp2_metrics['precision'])
  tree_exp2_r_list <- append(tree_exp2_r_list, tree_exp2_metrics['recall'])
  tree_exp2_f_list <- append(tree_exp2_f_list, tree_exp2_metrics['f1'])
}

# Visualize results
tree_exp2_results <- data.frame(tree_depth = 2:100,
                                precision = tree_exp2_p_list,
                                recall = tree_exp2_r_list,
                                f1score = tree_exp2_f_list)

tree_exp2_results_long <- tree_exp2_results |>
  pivot_longer(!tree_depth, names_to = 'metric', values_to = 'value')

ggplot(tree_exp2_results_long, aes(x = tree_depth, y = value, color = metric)) + geom_line()

print(tree_exp2_results |> filter(tree_depth <= 10))
##   tree_depth precision    recall   f1score
## 1          2 0.4968652 0.3033493 0.3767083
## 2          3 0.4968652 0.3033493 0.3767083
## 3          4 0.5044118 0.3282297 0.3976812
## 4          5 0.5117318 0.4382775 0.4721649
## 5          6 0.5169683 0.4373206 0.4738206
## 6          7 0.5237515 0.4114833 0.4608789
## 7          8 0.5111111 0.4622010 0.4854271
## 8          9 0.5111111 0.4622010 0.4854271
## 9         10 0.5111111 0.4622010 0.4854271

This experiment determined that after a decision tree with a depth of 8, there is no change to any of the performance metrics. Recall and F1-Score are maximized with a depth of 8, while Precision is maximized with a depth of 7. Once again, the optimal approach depends on the target performance metric, which will be based on the organizational goals.

Random Forest Model

Initial Model

The first Random Forest model was trained on the default data set. Subsequent models were compared to this one to determine the outcome of the experiments.

# Train model
rf_init_model <- randomForest(x = bank_train |> select(-y),
                              y = bank_train$y,
                              ntree = 500,
                              mtry = 8)

# Use model to predict test set values
rf_init_predictions <- predict(rf_init_model, newdata = bank_test |> select(-y))

# Report results
table(rf_init_predictions, bank_test$y)
##                    
## rf_init_predictions   no  yes
##                 no  7347  532
##                 yes  279  513
rf_init_metrics <- metrics_function(rf_init_predictions, bank_test$y)
print(rf_init_metrics)
## precision    recall        f1 
## 0.6477273 0.4909091 0.5585193

No Unknowns

For the first Random Forest model experiment, a new model was trained on the data set with no unknown values.

# Train model
rf_exp1_model <- randomForest(x = bank_train_exp1 |> select(-y),
                              y = bank_train_exp1$y,
                              ntree = 500,
                              mtry = 8)

# Use model to predict test set values
rf_exp1_predictions <- predict(rf_exp1_model, newdata = bank_test_exp1 |> select(-y))

# Report results
table(rf_exp1_predictions, bank_test_exp1$y)
##                    
## rf_exp1_predictions   no  yes
##                 no  4994  481
##                 yes  267  465
rf_exp1_metrics <- metrics_function(rf_exp1_predictions, bank_test_exp1$y)
print(rf_exp1_metrics)
## precision    recall        f1 
## 0.6352459 0.4915433 0.5542312

Like with the Decision Tree models, this experiment resulted in reduced precision and increased recall. However, unlike with the Decision Tree models, the effects were extremely small. This suggests that the Random Forest models do a better job of handling unknown values than the Decision Tree models do (since the removal of missing values did not meaningfully impact the model).

Hyper-Parameter Tuning

For my second experiment with Random Forest models, I tried 4 different combinations of the hyper-parameters ntree (number of trees) and mtry (number of variables used for each tree). The initial model used half of the variables (mtry = 8) and 500 trees. For my experiments I selected a lower and higher number for each of these variables. For the number of trees I tried ntree = 250 for the lower number and ntree = 750 for the higher number. For the number of variables used for each tree I tried mtry = 4 for the lower number and mtry = 12 for the higher number.

# Define low and high values for both hyper-parameters
ntree_list <- c(500, 250, 250, 750, 750)
mtry_list <- c(8, 4, 12, 4, 12)

# Create lists to store metrics for each combination of hyper-parameters
rf_exp2_p_list <- c(rf_init_metrics['precision'])
rf_exp2_r_list <- c(rf_init_metrics['recall'])
rf_exp2_f_list <- c(rf_init_metrics['f1'])

# Train model, predict test set values, and report results for each combination of hyper-parameters
for (i in 2:5) {
  rf_exp2_model <- randomForest(x = bank_train |> select(-y),
                                y = bank_train$y,
                                ntree = ntree_list[i],
                                mtry = mtry_list[i])
  
  rf_exp2_predictions <- predict(rf_exp2_model, newdata = bank_test |> select(-y))
  rf_exp2_metrics <- metrics_function(rf_exp2_predictions, bank_test$y)
  rf_exp2_p_list <- append(rf_exp2_p_list, rf_exp2_metrics['precision'])
  rf_exp2_r_list <- append(rf_exp2_r_list, rf_exp2_metrics['recall'])
  rf_exp2_f_list <- append(rf_exp2_f_list, rf_exp2_metrics['f1'])
}

# Summarize results
rf_exp2_results <- data.frame(ntree = ntree_list,
                              mtry = mtry_list,
                              precision = rf_exp2_p_list,
                              recall = rf_exp2_r_list,
                              f1score = rf_exp2_f_list)

print(rf_exp2_results)
##   ntree mtry precision    recall   f1score
## 1   500    8 0.6477273 0.4909091 0.5585193
## 2   250    4 0.6752137 0.4535885 0.5426445
## 3   250   12 0.6342365 0.4928230 0.5546581
## 4   750    4 0.6652174 0.4392344 0.5291066
## 5   750   12 0.6360294 0.4966507 0.5577646

This experiment resulted in Precision being maximized when both hyper-parameters were low (ntree = 250, mtry = 4), while Recall was maximized when both hyper-parameters were high (ntree = 750, mtry = 12). F1-score, appropriately, was maximized with the original middle values of ntree = 500 and mtry = 8.

XG Boost

Initial Model

The first XGBoost model was trained on the default data set. Subsequent models were compared to this one to determine the outcome of the experiments.

# Train model
xgb_init_model <- train(y ~ ., data = bank_train, method = "xgbTree", verbosity = 0)

# Use model to predict test set values
xgb_init_predictions <- predict(xgb_init_model, newdata = bank_test |> select(-y))

# Report results
table(xgb_init_predictions, bank_test$y)
##                     
## xgb_init_predictions   no  yes
##                  no  7398  562
##                  yes  228  483
xgb_init_metrics <- metrics_function(xgb_init_predictions, bank_test$y)
print(xgb_init_metrics)
## precision    recall        f1 
## 0.6793249 0.4622010 0.5501139

No Unknowns

For the first XGBoost model experiment, a new model was trained on the data set with no unknown values.

# Train model
xgb_exp1_model <- train(y ~ ., data = bank_train_exp1, method = "xgbTree", verbosity = 0)

# Use model to predict test set values
xgb_exp1_predictions <- predict(xgb_exp1_model, newdata = bank_test_exp1 |> select(-y))

# Report results
table(xgb_exp1_predictions, bank_test_exp1$y)
##                     
## xgb_exp1_predictions   no  yes
##                  no  5019  504
##                  yes  242  442
xgb_exp1_metrics <- metrics_function(xgb_exp1_predictions, bank_test_exp1$y)
print(xgb_exp1_metrics)
## precision    recall        f1 
## 0.6461988 0.4672304 0.5423313

With both the Decision Tree models and the Random Forest models, this experiment resulted in reduced precision and increased recall. This trend continued with the XGBoost model. However, like with the Random Forest models, the impact of removing data with unknown values is very small. As with the Random Forest models, this suggests that the XGBoost models do a good job of handling missing data.

Cross - Validation

For my second experiment with XGBoost models, I tried training a model using 5-fold cross-validation. I predicted that this would improve the model by limiting the impact of outliers.

# Train model
xgb_exp2_model <- train(y ~ .,
                        data = bank_train,
                        method = "xgbTree",
                        verbosity = 0,
                        trControl = trainControl(method = 'cv', number = 5))


# Use model to predict test set values
xgb_exp2_predictions <- predict(xgb_exp2_model, newdata = bank_test |> select(-y))

# Report results
table(xgb_exp2_predictions, bank_test$y)
##                     
## xgb_exp2_predictions   no  yes
##                  no  7386  551
##                  yes  240  494
xgb_exp2_metrics <- metrics_function(xgb_exp2_predictions, bank_test$y)
print(xgb_exp2_metrics)
## precision    recall        f1 
## 0.6730245 0.4727273 0.5553682

Like experiment one, this experiment decreased precision relative to the initial model, while increasing recall. Unlike experiment one, the balance in this case was favorable, with F1-Score also increasing. Since F1-Score is useful balance of Precision and Recall, and since recall seems more useful as a metric in this scenario (since the opportunity cost of a false negative outweighs the actual cost of a false positive), I would say this is the best of the three XGBoost models.