The bank data from Assignments 1 and 2 is used for this assignment.
# Import required libraries and data
library(tidyverse)
## Warning: package 'purrr' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.3.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom 1.0.7 ✔ rsample 1.2.1
## ✔ dials 1.4.0 ✔ tune 1.3.0
## ✔ infer 1.0.7 ✔ workflows 1.2.0
## ✔ modeldata 1.4.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.3.0 ✔ yardstick 1.3.2
## ✔ recipes 1.1.1
## Warning: package 'broom' was built under R version 4.3.3
## Warning: package 'dials' was built under R version 4.3.3
## Warning: package 'modeldata' was built under R version 4.3.3
## Warning: package 'parsnip' was built under R version 4.3.3
## Warning: package 'recipes' was built under R version 4.3.3
## Warning: package 'tune' was built under R version 4.3.3
## Warning: package 'workflows' was built under R version 4.3.3
## Warning: package 'yardstick' was built under R version 4.3.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following objects are masked from 'package:yardstick':
##
## precision, recall, sensitivity, specificity
##
## The following object is masked from 'package:purrr':
##
## lift
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
##
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
##
## The following object is masked from 'package:base':
##
## Recall
url <- 'https://raw.githubusercontent.com/Marley-Myrianthopoulos/grad_school_data/refs/heads/main/bank-full.csv'
data_import <- read.csv(url, sep = ';')
bank_data <- data_import
print(head(bank_data))
## age job marital education default balance housing loan contact day
## 1 58 management married tertiary no 2143 yes no unknown 5
## 2 44 technician single secondary no 29 yes no unknown 5
## 3 33 entrepreneur married secondary no 2 yes yes unknown 5
## 4 47 blue-collar married unknown no 1506 yes no unknown 5
## 5 33 unknown single unknown no 1 no no unknown 5
## 6 35 management married tertiary no 231 yes no unknown 5
## month duration campaign pdays previous poutcome y
## 1 may 261 1 -1 0 unknown no
## 2 may 151 1 -1 0 unknown no
## 3 may 76 1 -1 0 unknown no
## 4 may 92 1 -1 0 unknown no
## 5 may 198 1 -1 0 unknown no
## 6 may 139 1 -1 0 unknown no
The same data transformations are used here as were used in Assignment 2. The following transformations were performed:
y
column (the target variable) was converted from a
character column to a factor column.job
, marital
,
contact
, and poutcome
) were converted from
character columns to factor columns.default
, housing
, and
loan
) that only had values of “no” and “yes” were converted
to 0s and 1s, respectively.education
variable was converted from a character
column to an ordinal variable with the levels being “primary”,
“secondary”, and “tertiary”. Observations with “unknown” recorded for
the education
variable were removed.month
variable was converted from a character
column to a numeric column.# Convert y column to factor
bank_data$y <- factor(bank_data$y)
# Convert categorical columns to factor
bank_data$job <- factor(bank_data$job)
bank_data$marital <- factor(bank_data$marital)
bank_data$contact <- factor(bank_data$contact)
bank_data$poutcome <- factor(bank_data$poutcome)
# Convert binary columns to 0s for "no" and 1s for "yes"
bank_data <- bank_data |>
mutate(default = case_when(default == 'no' ~ 0, default == 'yes' ~ 1)) |>
mutate(housing = case_when(housing == 'no' ~ 0, housing == 'yes' ~ 1)) |>
mutate(loan = case_when(loan == 'no' ~ 0, loan == 'yes' ~ 1))
# Convert education column to ordinal
bank_data <- bank_data |> filter(education != 'unknown')
bank_data$education <- factor(bank_data$education, levels = c('primary', 'secondary', 'tertiary'), ordered = TRUE)
# Convert months to numeric values
bank_data <- bank_data |>
mutate(month = case_when(month == 'jan' ~ 1,
month == 'feb' ~ 2,
month == 'mar' ~ 3,
month == 'apr' ~ 4,
month == 'may' ~ 5,
month == 'jun' ~ 6,
month == 'jul' ~ 7,
month == 'aug' ~ 8,
month == 'sep' ~ 9,
month == 'oct' ~ 10,
month == 'nov' ~ 11,
month == 'dec' ~ 12))
# Set random seed for reproducibility
set.seed(31415)
# Create test/train split
bank_split <- initial_split(bank_data, prop = 0.8)
bank_train <- training(bank_split)
bank_test <- testing(bank_split)
The helper function to generate evaluation metrics for the models is re-used from Assignment 2.
# Define function to calculate performance metrics for models
metrics_function <- function(preds, actual) {
# Create confusion matrix
results <- as.data.frame(table(preds, actual)) |>
arrange(preds, actual) |>
pivot_wider(id_cols = 'preds', names_from = 'actual', values_from = 'Freq')
# Extract number of true positives, false positives, true negatives, and false negatives from the confusion matrix
tp <- as.numeric(results[2,3])
fp <- as.numeric(results[2,2])
tn <- as.numeric(results[1,2])
fn <- as.numeric(results[1,3])
# Calculate performance metrics and return them in a dictionary
precision <- tp / (tp + fp)
recall <- tp / (tp + fn)
f1 <- (2 * precision * recall) / (precision + recall)
metrics_dict <- c('precision' = precision, 'recall' = recall, 'f1' = f1)
return(metrics_dict)
}
I will be using a tune grid for these models and I want to prioritize the F1-Score as the target metric, so I need a function to find F1-Score specifically so the train function can use it to select the best tune.
f1_optimizer <- function(results, lev = NULL, model = NULL) {
return(c(F1 = F1_Score(y_pred = results$pred,
y_true = results$obs,
positive = lev[2])))
}
I trained two different SVM models to compare to the results from
Assignment 2, one with a Linear kernel and one with an RBF Kernel. For
each model, I tested C
values of 0.1, 1, and 10. For the
radial SVM model, another tuning option was the Sigma hyper-parameter.
For this one I tried 0.001, 0.01, and 0.1. Each model was trained using
5-fold cross validation.
# Establish hyper-parameter tuning options
svm1_grid <- expand.grid(C = c(0.1, 1, 10))
# Train model
svm_1 <- train(y ~ .,
data = bank_train,
method = 'svmLinear',
tuneGrid = svm1_grid,
trControl = trainControl(summaryFunction = f1_optimizer,
method = 'cv',
number = 5))
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. F1 will be used instead.
# Use model to predict test set values
svm1_predictions <- predict(svm_1, newdata = bank_test |> select(-y))
# Report results
table(svm1_predictions, bank_test$y)
##
## svm1_predictions no yes
## no 7530 842
## yes 96 203
svm1_metrics <- metrics_function(svm1_predictions, bank_test$y)
print(svm1_metrics)
## precision recall f1
## 0.6789298 0.1942584 0.3020833
This model was very conservative when predicting positive outcomes, resulting in a very small number of false positives (and therefore a high precision score) but a very large number of false negatives (and therefore a very low recall score). These balanced out to a relatively low F1-Score by the standards of the models trained for Assignment 2. The best tune came from C = 0.1.
print(svm_1$bestTune)
## C
## 1 0.1
# Establish hyper-parameter tuning options
svm2_grid <- expand.grid(C = c(0.1, 1, 10), sigma = c(0.001, 0.01, 0.1))
# Train model
svm_2 <- train(y ~ .,
data = bank_train,
method = 'svmRadial',
tuneGrid = svm2_grid,
trControl = trainControl(summaryFunction = f1_optimizer,
method = 'cv',
number = 5))
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. F1 will be used instead.
# Use model to predict test set values
svm2_predictions <- predict(svm_2, newdata = bank_test |> select(-y))
# Report results
table(svm2_predictions, bank_test$y)
##
## svm2_predictions no yes
## no 7296 657
## yes 330 388
svm2_metrics <- metrics_function(svm2_predictions, bank_test$y)
print(svm2_metrics)
## precision recall f1
## 0.5403900 0.3712919 0.4401588
This model was less conservative than the one with a linear kernel, resulting in significantly improved Recall at the cost of Precision. The balance is favorable, however, with a significant improvement in F1-Score. The best tune for this model had the hyper-parameters Sigma = 0.1 and C = 10.
print(svm_2$bestTune)
## sigma C
## 9 0.1 10