Data Management

Data Acquisition

The bank data from Assignments 1 and 2 is used for this assignment.

# Import required libraries and data
library(tidyverse)
## Warning: package 'purrr' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.3.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.7     ✔ rsample      1.2.1
## ✔ dials        1.4.0     ✔ tune         1.3.0
## ✔ infer        1.0.7     ✔ workflows    1.2.0
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.3.0     ✔ yardstick    1.3.2
## ✔ recipes      1.1.1
## Warning: package 'broom' was built under R version 4.3.3
## Warning: package 'dials' was built under R version 4.3.3
## Warning: package 'modeldata' was built under R version 4.3.3
## Warning: package 'parsnip' was built under R version 4.3.3
## Warning: package 'recipes' was built under R version 4.3.3
## Warning: package 'tune' was built under R version 4.3.3
## Warning: package 'workflows' was built under R version 4.3.3
## Warning: package 'yardstick' was built under R version 4.3.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## 
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## 
## The following object is masked from 'package:base':
## 
##     Recall
url <- 'https://raw.githubusercontent.com/Marley-Myrianthopoulos/grad_school_data/refs/heads/main/bank-full.csv'
data_import <- read.csv(url, sep = ';')
bank_data <- data_import
print(head(bank_data))
##   age          job marital education default balance housing loan contact day
## 1  58   management married  tertiary      no    2143     yes   no unknown   5
## 2  44   technician  single secondary      no      29     yes   no unknown   5
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown   5
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown   5
## 5  33      unknown  single   unknown      no       1      no   no unknown   5
## 6  35   management married  tertiary      no     231     yes   no unknown   5
##   month duration campaign pdays previous poutcome  y
## 1   may      261        1    -1        0  unknown no
## 2   may      151        1    -1        0  unknown no
## 3   may       76        1    -1        0  unknown no
## 4   may       92        1    -1        0  unknown no
## 5   may      198        1    -1        0  unknown no
## 6   may      139        1    -1        0  unknown no

Data Preparation

The same data transformations are used here as were used in Assignment 2. The following transformations were performed:

  • The y column (the target variable) was converted from a character column to a factor column.
  • The categorical predictors (job, marital, contact, and poutcome) were converted from character columns to factor columns.
  • The variables (default, housing, and loan) that only had values of “no” and “yes” were converted to 0s and 1s, respectively.
  • The education variable was converted from a character column to an ordinal variable with the levels being “primary”, “secondary”, and “tertiary”. Observations with “unknown” recorded for the education variable were removed.
  • The month variable was converted from a character column to a numeric column.
# Convert y column to factor
bank_data$y <- factor(bank_data$y)

# Convert categorical columns to factor
bank_data$job <- factor(bank_data$job)
bank_data$marital <- factor(bank_data$marital)
bank_data$contact <- factor(bank_data$contact)
bank_data$poutcome <- factor(bank_data$poutcome)

# Convert binary columns to 0s for "no" and 1s for "yes"
bank_data <- bank_data |>
  mutate(default = case_when(default == 'no' ~ 0, default == 'yes' ~ 1)) |>
  mutate(housing = case_when(housing == 'no' ~ 0, housing == 'yes' ~ 1)) |>
  mutate(loan = case_when(loan == 'no' ~ 0, loan == 'yes' ~ 1))

# Convert education column to ordinal
bank_data <- bank_data |> filter(education != 'unknown')
bank_data$education <- factor(bank_data$education, levels = c('primary', 'secondary', 'tertiary'), ordered = TRUE)

# Convert months to numeric values
bank_data <- bank_data |> 
  mutate(month = case_when(month == 'jan' ~ 1,
                           month == 'feb' ~ 2,
                           month == 'mar' ~ 3,
                           month == 'apr' ~ 4,
                           month == 'may' ~ 5,
                           month == 'jun' ~ 6,
                           month == 'jul' ~ 7,
                           month == 'aug' ~ 8,
                           month == 'sep' ~ 9,
                           month == 'oct' ~ 10,
                           month == 'nov' ~ 11,
                           month == 'dec' ~ 12))

# Set random seed for reproducibility
set.seed(31415)

# Create test/train split
bank_split <- initial_split(bank_data, prop = 0.8)
bank_train <- training(bank_split)
bank_test <- testing(bank_split)

Evaluation Metrics

The helper function to generate evaluation metrics for the models is re-used from Assignment 2.

# Define function to calculate performance metrics for models
metrics_function <- function(preds, actual) {
  
  # Create confusion matrix
  results <- as.data.frame(table(preds, actual)) |>
    arrange(preds, actual) |>
    pivot_wider(id_cols = 'preds', names_from = 'actual', values_from = 'Freq')
  
  # Extract number of true positives, false positives, true negatives, and false negatives from the confusion matrix
  tp <- as.numeric(results[2,3])
  fp <- as.numeric(results[2,2])
  tn <- as.numeric(results[1,2])
  fn <- as.numeric(results[1,3])
  
  # Calculate performance metrics and return them in a dictionary
  precision <- tp / (tp + fp)
  recall <- tp / (tp + fn)
  f1 <- (2 * precision * recall) / (precision + recall)
  metrics_dict <- c('precision' = precision, 'recall' = recall, 'f1' = f1)
  return(metrics_dict)
  
}

F1-Score Function

I will be using a tune grid for these models and I want to prioritize the F1-Score as the target metric, so I need a function to find F1-Score specifically so the train function can use it to select the best tune.

f1_optimizer <- function(results, lev = NULL, model = NULL) {
  return(c(F1 = F1_Score(y_pred = results$pred,
                         y_true = results$obs,
                         positive = lev[2])))
}

Model Construction

I trained two different SVM models to compare to the results from Assignment 2, one with a Linear kernel and one with an RBF Kernel. For each model, I tested C values of 0.1, 1, and 10. For the radial SVM model, another tuning option was the Sigma hyper-parameter. For this one I tried 0.001, 0.01, and 0.1. Each model was trained using 5-fold cross validation.

Linear SVM Kernel

# Establish hyper-parameter tuning options
svm1_grid <- expand.grid(C = c(0.1, 1, 10))

# Train model
svm_1 <- train(y ~ .,
               data = bank_train,
               method = 'svmLinear',
               tuneGrid = svm1_grid,
               trControl = trainControl(summaryFunction = f1_optimizer,
                                        method = 'cv',
                                        number = 5))
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. F1 will be used instead.
# Use model to predict test set values
svm1_predictions <- predict(svm_1, newdata = bank_test |> select(-y))

# Report results
table(svm1_predictions, bank_test$y)
##                 
## svm1_predictions   no  yes
##              no  7530  842
##              yes   96  203
svm1_metrics <- metrics_function(svm1_predictions, bank_test$y)
print(svm1_metrics)
## precision    recall        f1 
## 0.6789298 0.1942584 0.3020833

This model was very conservative when predicting positive outcomes, resulting in a very small number of false positives (and therefore a high precision score) but a very large number of false negatives (and therefore a very low recall score). These balanced out to a relatively low F1-Score by the standards of the models trained for Assignment 2. The best tune came from C = 0.1.

print(svm_1$bestTune)
##     C
## 1 0.1

Radial SVM Kernel

# Establish hyper-parameter tuning options
svm2_grid <- expand.grid(C = c(0.1, 1, 10), sigma = c(0.001, 0.01, 0.1))

# Train model
svm_2 <- train(y ~ .,
               data = bank_train,
               method = 'svmRadial',
               tuneGrid = svm2_grid,
               trControl = trainControl(summaryFunction = f1_optimizer,
                                        method = 'cv',
                                        number = 5))
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. F1 will be used instead.
# Use model to predict test set values
svm2_predictions <- predict(svm_2, newdata = bank_test |> select(-y))

# Report results
table(svm2_predictions, bank_test$y)
##                 
## svm2_predictions   no  yes
##              no  7296  657
##              yes  330  388
svm2_metrics <- metrics_function(svm2_predictions, bank_test$y)
print(svm2_metrics)
## precision    recall        f1 
## 0.5403900 0.3712919 0.4401588

This model was less conservative than the one with a linear kernel, resulting in significantly improved Recall at the cost of Precision. The balance is favorable, however, with a significant improvement in F1-Score. The best tune for this model had the hyper-parameters Sigma = 0.1 and C = 10.

print(svm_2$bestTune)
##   sigma  C
## 9   0.1 10