The bank data from Assignment 1 is used for this assignment.
# Import required libraries and data
library(tidyverse)
## Warning: package 'purrr' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.3.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom 1.0.7 ✔ rsample 1.2.1
## ✔ dials 1.4.0 ✔ tune 1.3.0
## ✔ infer 1.0.7 ✔ workflows 1.2.0
## ✔ modeldata 1.4.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.3.0 ✔ yardstick 1.3.2
## ✔ recipes 1.1.1
## Warning: package 'broom' was built under R version 4.3.3
## Warning: package 'dials' was built under R version 4.3.3
## Warning: package 'modeldata' was built under R version 4.3.3
## Warning: package 'parsnip' was built under R version 4.3.3
## Warning: package 'recipes' was built under R version 4.3.3
## Warning: package 'tune' was built under R version 4.3.3
## Warning: package 'workflows' was built under R version 4.3.3
## Warning: package 'yardstick' was built under R version 4.3.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
library(party)
## Warning: package 'party' was built under R version 4.3.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.3.3
## Loading required package: modeltools
## Loading required package: stats4
##
## Attaching package: 'modeltools'
##
## The following object is masked from 'package:workflows':
##
## fit
##
## The following object is masked from 'package:tune':
##
## parameters
##
## The following object is masked from 'package:parsnip':
##
## fit
##
## The following object is masked from 'package:infer':
##
## fit
##
## The following object is masked from 'package:dials':
##
## parameters
##
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.3.3
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.3.3
##
## Attaching package: 'strucchange'
##
## The following object is masked from 'package:stringr':
##
## boundary
##
##
## Attaching package: 'party'
##
## The following object is masked from 'package:dplyr':
##
## where
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.3.3
##
## Attaching package: 'xgboost'
##
## The following object is masked from 'package:dplyr':
##
## slice
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following objects are masked from 'package:yardstick':
##
## precision, recall, sensitivity, specificity
##
## The following object is masked from 'package:purrr':
##
## lift
url <- 'https://raw.githubusercontent.com/Marley-Myrianthopoulos/grad_school_data/refs/heads/main/bank-full.csv'
data_import <- read.csv(url, sep = ';')
bank_data <- data_import
print(head(bank_data))
## age job marital education default balance housing loan contact day
## 1 58 management married tertiary no 2143 yes no unknown 5
## 2 44 technician single secondary no 29 yes no unknown 5
## 3 33 entrepreneur married secondary no 2 yes yes unknown 5
## 4 47 blue-collar married unknown no 1506 yes no unknown 5
## 5 33 unknown single unknown no 1 no no unknown 5
## 6 35 management married tertiary no 231 yes no unknown 5
## month duration campaign pdays previous poutcome y
## 1 may 261 1 -1 0 unknown no
## 2 may 151 1 -1 0 unknown no
## 3 may 76 1 -1 0 unknown no
## 4 may 92 1 -1 0 unknown no
## 5 may 198 1 -1 0 unknown no
## 6 may 139 1 -1 0 unknown no
To prepare the data, the following transformations were performed:
y
column (the target variable) was converted from a
character column to a factor column.job
, marital
,
contact
, and poutcome
) were converted from
character columns to factor columns.default
, housing
, and
loan
) that only had values of “no” and “yes” were converted
to 0s and 1s, respectively.education
variable was converted from a character
column to an ordinal variable with the levels being “primary”,
“secondary”, and “tertiary”. Observations with “unknown” recorded for
the education
variable were removed.month
variable was converted from a character
column to a numeric column.# Convert y column to factor
bank_data$y <- factor(bank_data$y)
# Convert categorical columns to factor
bank_data$job <- factor(bank_data$job)
bank_data$marital <- factor(bank_data$marital)
bank_data$contact <- factor(bank_data$contact)
bank_data$poutcome <- factor(bank_data$poutcome)
# Convert binary columns to 0s for "no" and 1s for "yes"
bank_data <- bank_data |>
mutate(default = case_when(default == 'no' ~ 0, default == 'yes' ~ 1)) |>
mutate(housing = case_when(housing == 'no' ~ 0, housing == 'yes' ~ 1)) |>
mutate(loan = case_when(loan == 'no' ~ 0, loan == 'yes' ~ 1))
# Convert education column to ordinal
bank_data <- bank_data |> filter(education != 'unknown')
bank_data$education <- factor(bank_data$education, levels = c('primary', 'secondary', 'tertiary'), ordered = TRUE)
# Convert months to numeric values
bank_data <- bank_data |>
mutate(month = case_when(month == 'jan' ~ 1,
month == 'feb' ~ 2,
month == 'mar' ~ 3,
month == 'apr' ~ 4,
month == 'may' ~ 5,
month == 'jun' ~ 6,
month == 'jul' ~ 7,
month == 'aug' ~ 8,
month == 'sep' ~ 9,
month == 'oct' ~ 10,
month == 'nov' ~ 11,
month == 'dec' ~ 12))
# Set random seed for reproducibility
set.seed(31415)
# Create test/train split
bank_split <- initial_split(bank_data, prop = 0.8)
bank_train <- training(bank_split)
bank_test <- testing(bank_split)
My first experiment with each model was to remove the unknown values. This approach was selected because if removing unknown values improves model performance, then it might benefit the bank to invest resources into more sophisticated data collection in the future to ensure that data is complete. However, if unknown values do not hurt model performance, then this expense may not be necessary.
Removing unknown values required filtering the training and testing
sets to remove the observations with “unknown” recorded for the
job
and contact
variables. In project 1, I
determined that “unknown” values were recorded for the job
variable in 0.64% of observations (288 cases) and for the
contact
variable in 28.8% of observations (13,020 cases).
Additionally, the poutcome
variable had “unknown” values
recorded for 81.75% of observations, so this variable was removed
completely. After filtering the training and testing set, I verified
that the test set was still approximately 20% of the observations.
# Filter training set to remove unknown values
bank_train_exp1 <- bank_train |>
filter(job != 'unknown') |>
filter(contact != 'unknown') |>
select(-poutcome)
# Filter testing set to remove unknown values
bank_test_exp1 <- bank_test |>
filter(job != 'unknown') |>
filter(contact != 'unknown') |>
select(-poutcome)
# Confirm that the testing set is still approximately 20% of the total data
print(nrow(bank_test_exp1) / (nrow(bank_train_exp1) + nrow(bank_test_exp1)))
## [1] 0.2008283
To evaluate the results of my experiments, I created a helper function to calculate the precision, recall, and F1-score of each model.
# Define function to calculate performance metrics for models
metrics_function <- function(preds, actual) {
# Create confusion matrix
results <- as.data.frame(table(preds, actual)) |>
arrange(preds, actual) |>
pivot_wider(id_cols = 'preds', names_from = 'actual', values_from = 'Freq')
# Extract number of true positives, false positives, true negatives, and false negatives from the confusion matrix
tp <- as.numeric(results[2,3])
fp <- as.numeric(results[2,2])
tn <- as.numeric(results[1,2])
fn <- as.numeric(results[1,3])
# Calculate performance metrics and return them in a dictionary
precision <- tp / (tp + fp)
recall <- tp / (tp + fn)
f1 <- (2 * precision * recall) / (precision + recall)
metrics_dict <- c('precision' = precision, 'recall' = recall, 'f1' = f1)
return(metrics_dict)
}
Depending on the objectives and priorities of the bank, the best metric for evaluating models might change. If the bank is constrained on resources and does not want to risk “wasting” them on contacting customers who may not be interested, then they need to reduce false positives and should focus on maximizing Precision. On the other hand, if the bank wishes to maximize the number of subscriptions and does not want to risk missing interested customers, then they need to reduce false negatives and should focus on maximizing recall. If they wish to strike a balance between these risks, then the F1-score could be a useful metric since the classes are imbalanced (from Assignment 1, “no” responses account for approximately 88% of observations).
The first Decision Tree model was trained on the default data set with a max depth of 5. Subsequent models were compared to this one to determine the outcome of the experiments.
# Train model
tree_init_model <- ctree(y ~ ., data = bank_train, control = ctree_control(maxdepth = 5))
# Use model to predict test set values
tree_init_predictions <- predict(tree_init_model, bank_test)
# Report results
table(tree_init_predictions, bank_test$y)
##
## tree_init_predictions no yes
## no 7407 663
## yes 219 382
dt_init_metrics <- metrics_function(tree_init_predictions, bank_test$y)
print(dt_init_metrics)
## precision recall f1
## 0.6356073 0.3655502 0.4641555
For the first Decision Tree experiment, a new model was trained on the data set with no unknown values.
# Train model
tree_exp1_model <- ctree(y ~ ., data = bank_train_exp1, control = ctree_control(maxdepth = 5))
# Use model to predict test set values
tree_exp1_predictions <- predict(tree_exp1_model, bank_test_exp1)
# Report results
table(tree_exp1_predictions, bank_test_exp1$y)
##
## tree_exp1_predictions no yes
## no 4935 561
## yes 326 385
tree_exp1_metrics <- metrics_function(tree_exp1_predictions, bank_test_exp1$y)
print(tree_exp1_metrics)
## precision recall f1
## 0.5414909 0.4069767 0.4646952
This experiment resulted in reduced precision, but increased recall (in other words, the new model is a little more aggressive with predicting “yes” responses, resulting in fewer false negatives but more false positives). These changes largely balanced each other, with the F1-score staying almost identical to the original model. Whether this experiment improved the model would depend on the bank’s goals, and whether they were more interested in reducing false positives or false negatives.
For my second experiment with Decision Tree models, I tried different tree depths to determine whether there is a risk of over-fitting once the model reaches a certain level of complexity.
# Create empty lists to store metrics for each iteration
tree_exp2_p_list <- c()
tree_exp2_r_list <- c()
tree_exp2_f_list <- c()
# Train model, predict test set values, and report results for tree depths from 2 to 100
for (i in 2:100) {
tree_exp2_model <- ctree(y ~ ., data = bank_train_exp1, control = ctree_control(maxdepth = i))
tree_exp2_predictions <- predict(tree_exp2_model, newdata = bank_test |> select(-y))
tree_exp2_metrics <- metrics_function(tree_exp2_predictions, bank_test$y)
tree_exp2_p_list <- append(tree_exp2_p_list, tree_exp2_metrics['precision'])
tree_exp2_r_list <- append(tree_exp2_r_list, tree_exp2_metrics['recall'])
tree_exp2_f_list <- append(tree_exp2_f_list, tree_exp2_metrics['f1'])
}
# Visualize results
tree_exp2_results <- data.frame(tree_depth = 2:100,
precision = tree_exp2_p_list,
recall = tree_exp2_r_list,
f1score = tree_exp2_f_list)
tree_exp2_results_long <- tree_exp2_results |>
pivot_longer(!tree_depth, names_to = 'metric', values_to = 'value')
ggplot(tree_exp2_results_long, aes(x = tree_depth, y = value, color = metric)) + geom_line()
print(tree_exp2_results |> filter(tree_depth <= 10))
## tree_depth precision recall f1score
## 1 2 0.4968652 0.3033493 0.3767083
## 2 3 0.4968652 0.3033493 0.3767083
## 3 4 0.5044118 0.3282297 0.3976812
## 4 5 0.5117318 0.4382775 0.4721649
## 5 6 0.5169683 0.4373206 0.4738206
## 6 7 0.5237515 0.4114833 0.4608789
## 7 8 0.5111111 0.4622010 0.4854271
## 8 9 0.5111111 0.4622010 0.4854271
## 9 10 0.5111111 0.4622010 0.4854271
This experiment determined that after a decision tree with a depth of 8, there is no change to any of the performance metrics. Recall and F1-Score are maximized with a depth of 8, while Precision is maximized with a depth of 7. Once again, the optimal approach depends on the target performance metric, which will be based on the organizational goals.
The first Random Forest model was trained on the default data set. Subsequent models were compared to this one to determine the outcome of the experiments.
# Train model
rf_init_model <- randomForest(x = bank_train |> select(-y),
y = bank_train$y,
ntree = 500,
mtry = 8)
# Use model to predict test set values
rf_init_predictions <- predict(rf_init_model, newdata = bank_test |> select(-y))
# Report results
table(rf_init_predictions, bank_test$y)
##
## rf_init_predictions no yes
## no 7347 532
## yes 279 513
rf_init_metrics <- metrics_function(rf_init_predictions, bank_test$y)
print(rf_init_metrics)
## precision recall f1
## 0.6477273 0.4909091 0.5585193
For the first Random Forest model experiment, a new model was trained on the data set with no unknown values.
# Train model
rf_exp1_model <- randomForest(x = bank_train_exp1 |> select(-y),
y = bank_train_exp1$y,
ntree = 500,
mtry = 8)
# Use model to predict test set values
rf_exp1_predictions <- predict(rf_exp1_model, newdata = bank_test_exp1 |> select(-y))
# Report results
table(rf_exp1_predictions, bank_test_exp1$y)
##
## rf_exp1_predictions no yes
## no 4994 481
## yes 267 465
rf_exp1_metrics <- metrics_function(rf_exp1_predictions, bank_test_exp1$y)
print(rf_exp1_metrics)
## precision recall f1
## 0.6352459 0.4915433 0.5542312
Like with the Decision Tree models, this experiment resulted in reduced precision and increased recall. However, unlike with the Decision Tree models, the effects were extremely small. This suggests that the Random Forest models do a better job of handling unknown values than the Decision Tree models do (since the removal of missing values did not meaningfully impact the model).
For my second experiment with Random Forest models, I tried 4
different combinations of the hyper-parameters ntree
(number of trees) and mtry
(number of variables used for
each tree). The initial model used half of the variables
(mtry
= 8) and 500 trees. For my experiments I selected a
lower and higher number for each of these variables. For the number of
trees I tried ntree
= 250 for the lower number and
ntree
= 750 for the higher number. For the number of
variables used for each tree I tried mtry
= 4 for the lower
number and mtry
= 12 for the higher number.
# Define low and high values for both hyper-parameters
ntree_list <- c(500, 250, 250, 750, 750)
mtry_list <- c(8, 4, 12, 4, 12)
# Create lists to store metrics for each combination of hyper-parameters
rf_exp2_p_list <- c(rf_init_metrics['precision'])
rf_exp2_r_list <- c(rf_init_metrics['recall'])
rf_exp2_f_list <- c(rf_init_metrics['f1'])
# Train model, predict test set values, and report results for each combination of hyper-parameters
for (i in 2:5) {
rf_exp2_model <- randomForest(x = bank_train |> select(-y),
y = bank_train$y,
ntree = ntree_list[i],
mtry = mtry_list[i])
rf_exp2_predictions <- predict(rf_exp2_model, newdata = bank_test |> select(-y))
rf_exp2_metrics <- metrics_function(rf_exp2_predictions, bank_test$y)
rf_exp2_p_list <- append(rf_exp2_p_list, rf_exp2_metrics['precision'])
rf_exp2_r_list <- append(rf_exp2_r_list, rf_exp2_metrics['recall'])
rf_exp2_f_list <- append(rf_exp2_f_list, rf_exp2_metrics['f1'])
}
# Summarize results
rf_exp2_results <- data.frame(ntree = ntree_list,
mtry = mtry_list,
precision = rf_exp2_p_list,
recall = rf_exp2_r_list,
f1score = rf_exp2_f_list)
print(rf_exp2_results)
## ntree mtry precision recall f1score
## 1 500 8 0.6477273 0.4909091 0.5585193
## 2 250 4 0.6752137 0.4535885 0.5426445
## 3 250 12 0.6342365 0.4928230 0.5546581
## 4 750 4 0.6652174 0.4392344 0.5291066
## 5 750 12 0.6360294 0.4966507 0.5577646
This experiment resulted in Precision being maximized when both
hyper-parameters were low (ntree
= 250, mtry
=
4), while Recall was maximized when both hyper-parameters were high
(ntree
= 750, mtry
= 12). F1-score,
appropriately, was maximized with the original middle values of
ntree
= 500 and mtry
= 8.
The first XGBoost model was trained on the default data set. Subsequent models were compared to this one to determine the outcome of the experiments.
# Train model
xgb_init_model <- train(y ~ ., data = bank_train, method = "xgbTree", verbosity = 0)
# Use model to predict test set values
xgb_init_predictions <- predict(xgb_init_model, newdata = bank_test |> select(-y))
# Report results
table(xgb_init_predictions, bank_test$y)
##
## xgb_init_predictions no yes
## no 7398 562
## yes 228 483
xgb_init_metrics <- metrics_function(xgb_init_predictions, bank_test$y)
print(xgb_init_metrics)
## precision recall f1
## 0.6793249 0.4622010 0.5501139
For the first XGBoost model experiment, a new model was trained on the data set with no unknown values.
# Train model
xgb_exp1_model <- train(y ~ ., data = bank_train_exp1, method = "xgbTree", verbosity = 0)
# Use model to predict test set values
xgb_exp1_predictions <- predict(xgb_exp1_model, newdata = bank_test_exp1 |> select(-y))
# Report results
table(xgb_exp1_predictions, bank_test_exp1$y)
##
## xgb_exp1_predictions no yes
## no 5019 504
## yes 242 442
xgb_exp1_metrics <- metrics_function(xgb_exp1_predictions, bank_test_exp1$y)
print(xgb_exp1_metrics)
## precision recall f1
## 0.6461988 0.4672304 0.5423313
With both the Decision Tree models and the Random Forest models, this experiment resulted in reduced precision and increased recall. This trend continued with the XGBoost model. However, like with the Random Forest models, the impact of removing data with unknown values is very small. As with the Random Forest models, this suggests that the XGBoost models do a good job of handling missing data.
For my second experiment with XGBoost models, I tried training a model using 5-fold cross-validation. I predicted that this would improve the model by limiting the impact of outliers.
# Train model
xgb_exp2_model <- train(y ~ .,
data = bank_train,
method = "xgbTree",
verbosity = 0,
trControl = trainControl(method = 'cv', number = 5))
# Use model to predict test set values
xgb_exp2_predictions <- predict(xgb_exp2_model, newdata = bank_test |> select(-y))
# Report results
table(xgb_exp2_predictions, bank_test$y)
##
## xgb_exp2_predictions no yes
## no 7386 551
## yes 240 494
xgb_exp2_metrics <- metrics_function(xgb_exp2_predictions, bank_test$y)
print(xgb_exp2_metrics)
## precision recall f1
## 0.6730245 0.4727273 0.5553682
Like experiment one, this experiment decreased precision relative to the initial model, while increasing recall. Unlike experiment one, the balance in this case was favorable, with F1-Score also increasing. Since F1-Score is useful balance of Precision and Recall, and since recall seems more useful as a metric in this scenario (since the opportunity cost of a false negative outweighs the actual cost of a false positive), I would say this is the best of the three XGBoost models.