Data Management

Data Acquisition

The bank data from Assignments 1 and 2 is used for this assignment.

# Import required libraries and data
library(tidyverse)

## Warning: package 'purrr' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.3.3

## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.7     ✔ rsample      1.2.1
## ✔ dials        1.4.0     ✔ tune         1.3.0
## ✔ infer        1.0.7     ✔ workflows    1.2.0
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.3.0     ✔ yardstick    1.3.2
## ✔ recipes      1.1.1

## Warning: package 'broom' was built under R version 4.3.3

## Warning: package 'dials' was built under R version 4.3.3

## Warning: package 'modeldata' was built under R version 4.3.3

## Warning: package 'parsnip' was built under R version 4.3.3

## Warning: package 'recipes' was built under R version 4.3.3

## Warning: package 'tune' was built under R version 4.3.3

## Warning: package 'workflows' was built under R version 4.3.3

## Warning: package 'yardstick' was built under R version 4.3.3

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(MLmetrics)

## 
## Attaching package: 'MLmetrics'
## 
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## 
## The following object is masked from 'package:base':
## 
##     Recall

url <- 'https://raw.githubusercontent.com/Marley-Myrianthopoulos/grad_school_data/refs/heads/main/bank-full.csv'
data_import <- read.csv(url, sep = ';')
bank_data <- data_import
print(head(bank_data))

##   age          job marital education default balance housing loan contact day
## 1  58   management married  tertiary      no    2143     yes   no unknown   5
## 2  44   technician  single secondary      no      29     yes   no unknown   5
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown   5
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown   5
## 5  33      unknown  single   unknown      no       1      no   no unknown   5
## 6  35   management married  tertiary      no     231     yes   no unknown   5
##   month duration campaign pdays previous poutcome  y
## 1   may      261        1    -1        0  unknown no
## 2   may      151        1    -1        0  unknown no
## 3   may       76        1    -1        0  unknown no
## 4   may       92        1    -1        0  unknown no
## 5   may      198        1    -1        0  unknown no
## 6   may      139        1    -1        0  unknown no

Data Preparation

The same data transformations are used here as were used in Assignment 2. The following transformations were performed:

The y column (the target variable) was converted from a character column to a factor column.
The categorical predictors (job, marital, contact, and poutcome) were converted from character columns to factor columns.
The variables (default, housing, and loan) that only had values of “no” and “yes” were converted to 0s and 1s, respectively.
The education variable was converted from a character column to an ordinal variable with the levels being “primary”, “secondary”, and “tertiary”. Observations with “unknown” recorded for the education variable were removed.
The month variable was converted from a character column to a numeric column.

# Convert y column to factor
bank_data$y <- factor(bank_data$y)

# Convert categorical columns to factor
bank_data$job <- factor(bank_data$job)
bank_data$marital <- factor(bank_data$marital)
bank_data$contact <- factor(bank_data$contact)
bank_data$poutcome <- factor(bank_data$poutcome)

# Convert binary columns to 0s for "no" and 1s for "yes"
bank_data <- bank_data |>
  mutate(default = case_when(default == 'no' ~ 0, default == 'yes' ~ 1)) |>
  mutate(housing = case_when(housing == 'no' ~ 0, housing == 'yes' ~ 1)) |>
  mutate(loan = case_when(loan == 'no' ~ 0, loan == 'yes' ~ 1))

# Convert education column to ordinal
bank_data <- bank_data |> filter(education != 'unknown')
bank_data$education <- factor(bank_data$education, levels = c('primary', 'secondary', 'tertiary'), ordered = TRUE)

# Convert months to numeric values
bank_data <- bank_data |> 
  mutate(month = case_when(month == 'jan' ~ 1,
                           month == 'feb' ~ 2,
                           month == 'mar' ~ 3,
                           month == 'apr' ~ 4,
                           month == 'may' ~ 5,
                           month == 'jun' ~ 6,
                           month == 'jul' ~ 7,
                           month == 'aug' ~ 8,
                           month == 'sep' ~ 9,
                           month == 'oct' ~ 10,
                           month == 'nov' ~ 11,
                           month == 'dec' ~ 12))

# Set random seed for reproducibility
set.seed(31415)

# Create test/train split
bank_split <- initial_split(bank_data, prop = 0.8)
bank_train <- training(bank_split)
bank_test <- testing(bank_split)

Evaluation Metrics

The helper function to generate evaluation metrics for the models is re-used from Assignment 2.

# Define function to calculate performance metrics for models
metrics_function <- function(preds, actual) {
  
  # Create confusion matrix
  results <- as.data.frame(table(preds, actual)) |>
    arrange(preds, actual) |>
    pivot_wider(id_cols = 'preds', names_from = 'actual', values_from = 'Freq')
  
  # Extract number of true positives, false positives, true negatives, and false negatives from the confusion matrix
  tp <- as.numeric(results[2,3])
  fp <- as.numeric(results[2,2])
  tn <- as.numeric(results[1,2])
  fn <- as.numeric(results[1,3])
  
  # Calculate performance metrics and return them in a dictionary
  precision <- tp / (tp + fp)
  recall <- tp / (tp + fn)
  f1 <- (2 * precision * recall) / (precision + recall)
  metrics_dict <- c('precision' = precision, 'recall' = recall, 'f1' = f1)
  return(metrics_dict)
  
}

F1-Score Function

I will be using a tune grid for these models and I want to prioritize the F1-Score as the target metric, so I need a function to find F1-Score specifically so the train function can use it to select the best tune.

f1_optimizer <- function(results, lev = NULL, model = NULL) {
  return(c(F1 = F1_Score(y_pred = results$pred,
                         y_true = results$obs,
                         positive = lev[2])))
}

Model Construction

I trained two different SVM models to compare to the results from Assignment 2, one with a Linear kernel and one with an RBF Kernel. For each model, I tested C values of 0.1, 1, and 10. For the radial SVM model, another tuning option was the Sigma hyper-parameter. For this one I tried 0.001, 0.01, and 0.1. Each model was trained using 5-fold cross validation.

Linear SVM Kernel

# Establish hyper-parameter tuning options
svm1_grid <- expand.grid(C = c(0.1, 1, 10))

# Train model
svm_1 <- train(y ~ .,
               data = bank_train,
               method = 'svmLinear',
               tuneGrid = svm1_grid,
               trControl = trainControl(summaryFunction = f1_optimizer,
                                        method = 'cv',
                                        number = 5))

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. F1 will be used instead.

# Use model to predict test set values
svm1_predictions <- predict(svm_1, newdata = bank_test |> select(-y))

# Report results
table(svm1_predictions, bank_test$y)

##                 
## svm1_predictions   no  yes
##              no  7530  842
##              yes   96  203

svm1_metrics <- metrics_function(svm1_predictions, bank_test$y)
print(svm1_metrics)

## precision    recall        f1 
## 0.6789298 0.1942584 0.3020833

This model was very conservative when predicting positive outcomes, resulting in a very small number of false positives (and therefore a high precision score) but a very large number of false negatives (and therefore a very low recall score). These balanced out to a relatively low F1-Score by the standards of the models trained for Assignment 2. The best tune came from C = 0.1.

print(svm_1$bestTune)

##     C
## 1 0.1

Radial SVM Kernel

# Establish hyper-parameter tuning options
svm2_grid <- expand.grid(C = c(0.1, 1, 10), sigma = c(0.001, 0.01, 0.1))

# Train model
svm_2 <- train(y ~ .,
               data = bank_train,
               method = 'svmRadial',
               tuneGrid = svm2_grid,
               trControl = trainControl(summaryFunction = f1_optimizer,
                                        method = 'cv',
                                        number = 5))

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. F1 will be used instead.

# Use model to predict test set values
svm2_predictions <- predict(svm_2, newdata = bank_test |> select(-y))

# Report results
table(svm2_predictions, bank_test$y)

##                 
## svm2_predictions   no  yes
##              no  7296  657
##              yes  330  388

svm2_metrics <- metrics_function(svm2_predictions, bank_test$y)
print(svm2_metrics)

## precision    recall        f1 
## 0.5403900 0.3712919 0.4401588

This model was less conservative than the one with a linear kernel, resulting in significantly improved Recall at the cost of Precision. The balance is favorable, however, with a significant improvement in F1-Score. The best tune for this model had the hyper-parameters Sigma = 0.1 and C = 10.

print(svm_2$bestTune)

##   sigma  C
## 9   0.1 10

DATA 622 Project 3

Marley Myrianthopoulos

2025-05-10