The bank data from Assignments 1 and 2 is used for this assignment.
# Import required libraries and data
library(tidyverse)
## Warning: package 'purrr' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.3.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom 1.0.7 ✔ rsample 1.2.1
## ✔ dials 1.4.0 ✔ tune 1.3.0
## ✔ infer 1.0.7 ✔ workflows 1.2.0
## ✔ modeldata 1.4.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.3.0 ✔ yardstick 1.3.2
## ✔ recipes 1.1.1
## Warning: package 'broom' was built under R version 4.3.3
## Warning: package 'dials' was built under R version 4.3.3
## Warning: package 'modeldata' was built under R version 4.3.3
## Warning: package 'parsnip' was built under R version 4.3.3
## Warning: package 'recipes' was built under R version 4.3.3
## Warning: package 'tune' was built under R version 4.3.3
## Warning: package 'workflows' was built under R version 4.3.3
## Warning: package 'yardstick' was built under R version 4.3.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following objects are masked from 'package:yardstick':
##
## precision, recall, sensitivity, specificity
##
## The following object is masked from 'package:purrr':
##
## lift
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
##
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
##
## The following object is masked from 'package:base':
##
## Recall
url <- 'https://raw.githubusercontent.com/Marley-Myrianthopoulos/grad_school_data/refs/heads/main/bank-full.csv'
data_import <- read.csv(url, sep = ';')
bank_data <- data_import
print(head(bank_data))
## age job marital education default balance housing loan contact day
## 1 58 management married tertiary no 2143 yes no unknown 5
## 2 44 technician single secondary no 29 yes no unknown 5
## 3 33 entrepreneur married secondary no 2 yes yes unknown 5
## 4 47 blue-collar married unknown no 1506 yes no unknown 5
## 5 33 unknown single unknown no 1 no no unknown 5
## 6 35 management married tertiary no 231 yes no unknown 5
## month duration campaign pdays previous poutcome y
## 1 may 261 1 -1 0 unknown no
## 2 may 151 1 -1 0 unknown no
## 3 may 76 1 -1 0 unknown no
## 4 may 92 1 -1 0 unknown no
## 5 may 198 1 -1 0 unknown no
## 6 may 139 1 -1 0 unknown no
The same data transformations are used here as were used in Assignment 2. The following transformations were performed:
y column (the target variable) was converted from a
character column to a factor column.job, marital,
contact, and poutcome) were converted from
character columns to factor columns.default, housing, and
loan) that only had values of “no” and “yes” were converted
to 0s and 1s, respectively.education variable was converted from a character
column to an ordinal variable with the levels being “primary”,
“secondary”, and “tertiary”. Observations with “unknown” recorded for
the education variable were removed.month variable was converted from a character
column to a numeric column.# Convert y column to factor
bank_data$y <- factor(bank_data$y)
# Convert categorical columns to factor
bank_data$job <- factor(bank_data$job)
bank_data$marital <- factor(bank_data$marital)
bank_data$contact <- factor(bank_data$contact)
bank_data$poutcome <- factor(bank_data$poutcome)
# Convert binary columns to 0s for "no" and 1s for "yes"
bank_data <- bank_data |>
mutate(default = case_when(default == 'no' ~ 0, default == 'yes' ~ 1)) |>
mutate(housing = case_when(housing == 'no' ~ 0, housing == 'yes' ~ 1)) |>
mutate(loan = case_when(loan == 'no' ~ 0, loan == 'yes' ~ 1))
# Convert education column to ordinal
bank_data <- bank_data |> filter(education != 'unknown')
bank_data$education <- factor(bank_data$education, levels = c('primary', 'secondary', 'tertiary'), ordered = TRUE)
# Convert months to numeric values
bank_data <- bank_data |>
mutate(month = case_when(month == 'jan' ~ 1,
month == 'feb' ~ 2,
month == 'mar' ~ 3,
month == 'apr' ~ 4,
month == 'may' ~ 5,
month == 'jun' ~ 6,
month == 'jul' ~ 7,
month == 'aug' ~ 8,
month == 'sep' ~ 9,
month == 'oct' ~ 10,
month == 'nov' ~ 11,
month == 'dec' ~ 12))
# Set random seed for reproducibility
set.seed(31415)
# Create test/train split
bank_split <- initial_split(bank_data, prop = 0.8)
bank_train <- training(bank_split)
bank_test <- testing(bank_split)
The helper function to generate evaluation metrics for the models is re-used from Assignment 2.
# Define function to calculate performance metrics for models
metrics_function <- function(preds, actual) {
# Create confusion matrix
results <- as.data.frame(table(preds, actual)) |>
arrange(preds, actual) |>
pivot_wider(id_cols = 'preds', names_from = 'actual', values_from = 'Freq')
# Extract number of true positives, false positives, true negatives, and false negatives from the confusion matrix
tp <- as.numeric(results[2,3])
fp <- as.numeric(results[2,2])
tn <- as.numeric(results[1,2])
fn <- as.numeric(results[1,3])
# Calculate performance metrics and return them in a dictionary
precision <- tp / (tp + fp)
recall <- tp / (tp + fn)
f1 <- (2 * precision * recall) / (precision + recall)
metrics_dict <- c('precision' = precision, 'recall' = recall, 'f1' = f1)
return(metrics_dict)
}
I will be using a tune grid for these models and I want to prioritize the F1-Score as the target metric, so I need a function to find F1-Score specifically so the train function can use it to select the best tune.
f1_optimizer <- function(results, lev = NULL, model = NULL) {
return(c(F1 = F1_Score(y_pred = results$pred,
y_true = results$obs,
positive = lev[2])))
}
I trained two different SVM models to compare to the results from
Assignment 2, one with a Linear kernel and one with an RBF Kernel. For
each model, I tested C values of 0.1, 1, and 10. For the
radial SVM model, another tuning option was the Sigma hyper-parameter.
For this one I tried 0.001, 0.01, and 0.1. Each model was trained using
5-fold cross validation.
# Establish hyper-parameter tuning options
svm1_grid <- expand.grid(C = c(0.1, 1, 10))
# Train model
svm_1 <- train(y ~ .,
data = bank_train,
method = 'svmLinear',
tuneGrid = svm1_grid,
trControl = trainControl(summaryFunction = f1_optimizer,
method = 'cv',
number = 5))
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. F1 will be used instead.
# Use model to predict test set values
svm1_predictions <- predict(svm_1, newdata = bank_test |> select(-y))
# Report results
table(svm1_predictions, bank_test$y)
##
## svm1_predictions no yes
## no 7530 842
## yes 96 203
svm1_metrics <- metrics_function(svm1_predictions, bank_test$y)
print(svm1_metrics)
## precision recall f1
## 0.6789298 0.1942584 0.3020833
This model was very conservative when predicting positive outcomes, resulting in a very small number of false positives (and therefore a high precision score) but a very large number of false negatives (and therefore a very low recall score). These balanced out to a relatively low F1-Score by the standards of the models trained for Assignment 2. The best tune came from C = 0.1.
print(svm_1$bestTune)
## C
## 1 0.1
# Establish hyper-parameter tuning options
svm2_grid <- expand.grid(C = c(0.1, 1, 10), sigma = c(0.001, 0.01, 0.1))
# Train model
svm_2 <- train(y ~ .,
data = bank_train,
method = 'svmRadial',
tuneGrid = svm2_grid,
trControl = trainControl(summaryFunction = f1_optimizer,
method = 'cv',
number = 5))
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. F1 will be used instead.
# Use model to predict test set values
svm2_predictions <- predict(svm_2, newdata = bank_test |> select(-y))
# Report results
table(svm2_predictions, bank_test$y)
##
## svm2_predictions no yes
## no 7296 657
## yes 330 388
svm2_metrics <- metrics_function(svm2_predictions, bank_test$y)
print(svm2_metrics)
## precision recall f1
## 0.5403900 0.3712919 0.4401588
This model was less conservative than the one with a linear kernel, resulting in significantly improved Recall at the cost of Precision. The balance is favorable, however, with a significant improvement in F1-Score. The best tune for this model had the hyper-parameters Sigma = 0.1 and C = 10.
print(svm_2$bestTune)
## sigma C
## 9 0.1 10