DATA 622 Assignment 3

Background

A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit.

Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing

NOTE: The experiments section is in part 6

Dataset

For this assignment, I chose the full dataset, bank-additional-full.csv which has all observations and features. I would prefer to start out with all the data, then narrow it down after performing my EDA and algorithm selection. I wouldn’t want to disregard some features or observations before looking at them and knowing if they’re important or not.

Input variables: # bank client data: 1 - age (numeric) 2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”) 3 - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed) 4 - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”) 5 - default: has credit in default? (binary: “yes”,“no”) 6 - housing: has housing loan? (binary: “yes”,“no”) 7 - loan: has personal loan? (binary: “yes”,“no”) # related with the last contact of the current campaign: 8 - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”) 9 - day: last contact day of the month (numeric) 10 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”) 11 - duration: last contact duration, in seconds (numeric) # other attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”) 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target): 21 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)

1. Importing Libraries

#Import Libraries
library(readr) # to uses read_csv function
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(e1071)  # For skewness function

## 
## Attaching package: 'e1071'
## 
## The following object is masked from 'package:ggplot2':
## 
##     element

library(corrplot)

## corrplot 0.95 loaded

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(ROSE)

## Loaded ROSE 0.0-4

library(smotefamily)

library(dplyr)
library(caret)
library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(lmtest)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(GGally)
library(gmodels)
library(correlationfunnel)

## ══ Using correlationfunnel? ════════════════════════════════════════════════════
## You might also be interested in applied data science training for business.
## </> Learn more at - www.business-science.io </>

library(DataExplorer)
library(reshape2)

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

library(readr)
library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom        1.0.10     ✔ rsample      1.3.1 
## ✔ dials        1.4.2      ✔ tailor       0.1.0 
## ✔ infer        1.0.9      ✔ tune         2.0.1 
## ✔ modeldata    1.5.1      ✔ workflows    1.3.0 
## ✔ parsnip      1.3.3      ✔ workflowsets 1.1.1 
## ✔ recipes      1.3.1      ✔ yardstick    1.3.2 
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ rsample::calibration()   masks caret::calibration()
## ✖ scales::discard()        masks purrr::discard()
## ✖ e1071::element()         masks ggplot2::element()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ recipes::fixed()         masks stringr::fixed()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ caret::lift()            masks purrr::lift()
## ✖ rsample::permutations()  masks e1071::permutations()
## ✖ yardstick::precision()   masks caret::precision()
## ✖ yardstick::recall()      masks caret::recall()
## ✖ car::recode()            masks dplyr::recode()
## ✖ MASS::select()           masks dplyr::select()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ car::some()              masks purrr::some()
## ✖ yardstick::spec()        masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step()          masks stats::step()
## ✖ tune::tune()             masks parsnip::tune(), e1071::tune()

library(themis)
library(randomForest)

## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(vcd)

## Loading required package: grid

library(rpart)

## 
## Attaching package: 'rpart'
## 
## The following object is masked from 'package:dials':
## 
##     prune

library(rpart.plot)
options(rgl.useNULL=TRUE)
library(adabag)

## Loading required package: foreach
## 
## Attaching package: 'foreach'
## 
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## 
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel

library(pROC)

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following object is masked from 'package:gmodels':
## 
##     ci
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(vip)

## 
## Attaching package: 'vip'
## 
## The following object is masked from 'package:utils':
## 
##     vi

#library(fastAdaboost)
library(kernlab)

## 
## Attaching package: 'kernlab'
## 
## The following object is masked from 'package:dials':
## 
##     buffer
## 
## The following object is masked from 'package:scales':
## 
##     alpha
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## The following object is masked from 'package:ggplot2':
## 
##     alpha

library(kableExtra)

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(doParallel)
library(FactoMineR) # PCA
library(factoextra) # PCA visualization

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

2. Data Ingestion

# Full dataset
data_raw <- read.csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-608/refs/heads/main/bank-additional-full.csv", sep = ";")

# Rename columns
names(data_raw)[names(data_raw) == "emp.var.rate"] <- "emp_var_rate"
names(data_raw)[names(data_raw) == "cons.price.idx"] <- "cons_price_idx"
names(data_raw)[names(data_raw) == "cons.conf.idx"] <- "cons_conf_idx"
names(data_raw)[names(data_raw) == "nr.employed"] <- "nr_employed"

# Remove duplicates
data_raw <- unique(data_raw)
str(data_raw)

## 'data.frame':    41176 obs. of  21 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : chr  "housemaid" "services" "services" "admin." ...
##  $ marital       : chr  "married" "married" "married" "married" ...
##  $ education     : chr  "basic.4y" "high.school" "high.school" "basic.6y" ...
##  $ default       : chr  "no" "unknown" "no" "no" ...
##  $ housing       : chr  "no" "no" "yes" "no" ...
##  $ loan          : chr  "no" "no" "no" "no" ...
##  $ contact       : chr  "telephone" "telephone" "telephone" "telephone" ...
##  $ month         : chr  "may" "may" "may" "may" ...
##  $ day_of_week   : chr  "mon" "mon" "mon" "mon" ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
##  $ emp_var_rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons_price_idx: num  94 94 94 94 94 ...
##  $ cons_conf_idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr_employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : chr  "no" "no" "no" "no" ...

4 EDA

EDA was done in assignment 1 so I won’t include it here.

5 Data Cleansing for SVM

5.1 Remove Variables

Since over 96% of pdays is inapplicable, let’s treat this as NA data which should just be removed to simplify the model. Having too much data can be harmful as it hurts efficiency.

Additionally, as the dataset suggests, duration should be discarded if the intention is to have a realistic predictive model, so let’s remove that as well.

# Remove variable pdays (96% inapplicable) and duration
data_raw_clean2 <- data_raw |>
  dplyr::select(-c(pdays, duration))

Address Multicollinearity:

Since there can not be any highly correlated variables for Logistic Regression, let’s remove the ones we found in EDA which have correlations > 0.4. Let’s remove euribor3m, emp_var_rate, and nr_employed. This will also help reduce the amount of data in the model.

# Remove `euribor3m`, `emp_var_rate`, and `nr_employed` (absolute value correlations > 0.4)
data_raw_clean3 <- data_raw_clean2 |>
  dplyr::select(-c(euribor3m, emp_var_rate, nr_employed))

5.2 Feature Engineering

Since SVM takes a very long time, let’s try to use the combined dataset I created in Assignment 1.

# Combine Variables (Feature Engineering)

# Rename columns
data_raw_combined <- data_raw_clean3

names(data_raw_combined)[names(data_raw_combined) == "emp.var.rate"] <- "emp_var_rate"
names(data_raw_combined)[names(data_raw_combined) == "cons.price.idx"] <- "cons_price_idx"
names(data_raw_combined)[names(data_raw_combined) == "cons.conf.idx"] <- "cons_conf_idx"
names(data_raw_combined)[names(data_raw_combined) == "nr.employed"] <- "nr_employed"

# Remove duplicates
data_raw_combined <- unique(data_raw_combined)

# Convert target into factor
data_raw_combined$y <- as.factor(data_raw_combined$y)


# Combine education levels -> education_combined
data_raw_combined <- data_raw_combined %>%
  mutate(education_combined = case_when(
    data_raw_combined$education %in% c('basic.4y','basic.6y','basic.9y') ~ "basic" ,
    data_raw_combined$education == "high.school" ~ "high.school"  ,
    data_raw_combined$education == "illiterate"  ~ "illiterate"  ,
    data_raw_combined$education == "professional.course" ~ "professional.course",
    data_raw_combined$education == "university.degree" ~ "university.degree",
    data_raw_combined$education == "unknown" ~ "unknown"
  )
)

# Combine job -> employed_status
data_raw_combined <- data_raw_combined %>%
  mutate(employed_status = case_when(
    data_raw_combined$job %in% c('admin.','blue-collar','entrepreneur', 'housemaid', 'management', 'services', 'technician', 'self-employed') ~ "employed",
    data_raw_combined$job == "student" ~ "student",
    data_raw_combined$job == "retired"  ~ "retired",
    data_raw_combined$job == "unemployed" ~ "unemployed",
    data_raw_combined$job == "unknown" ~ "unknown"
  )
)

# Combine month -> season
data_raw_combined <- data_raw_combined %>%
  mutate(season = case_when(
    data_raw_combined$month %in% c('sep','oct', 'nov') ~ "fall",
    data_raw_combined$month %in% c('dec','jan', 'feb') ~ "winter",
    data_raw_combined$month %in% c('mar','apr', 'may') ~ "spring",
    data_raw_combined$month %in% c('jun','jul', 'aug') ~ "summer"
  )
)

# Combine marital -> is_married
data_raw_combined <- data_raw_combined %>%
  mutate(is_married = ifelse(marital == "married", 1, 0)
  )


data_raw_combined <- data_raw_combined %>%
  dplyr::select(-c(education, job, month, marital))

head(data_raw_combined)

##   age default housing loan   contact day_of_week campaign previous    poutcome
## 1  56      no      no   no telephone         mon        1        0 nonexistent
## 2  57 unknown      no   no telephone         mon        1        0 nonexistent
## 3  37      no     yes   no telephone         mon        1        0 nonexistent
## 4  40      no      no   no telephone         mon        1        0 nonexistent
## 5  56      no      no  yes telephone         mon        1        0 nonexistent
## 6  45 unknown      no   no telephone         mon        1        0 nonexistent
##   cons_price_idx cons_conf_idx  y education_combined employed_status season
## 1         93.994         -36.4 no              basic        employed spring
## 2         93.994         -36.4 no        high.school        employed spring
## 3         93.994         -36.4 no        high.school        employed spring
## 4         93.994         -36.4 no              basic        employed spring
## 5         93.994         -36.4 no        high.school        employed spring
## 6         93.994         -36.4 no              basic        employed spring
##   is_married
## 1          1
## 2          1
## 3          1
## 4          1
## 5          1
## 6          1

Convert to numerical features:

SVM requires each feature to be numeric since it calculates distance, dot products, etc. So let’s do this by creating dummy variables for all the categorical features:

# Combined dataset
data_raw_combined3_svm <- data_raw_combined

categorical_features <- c("employed_status", "education_combined", "contact", "season", "day_of_week", "poutcome", "default", "housing", "loan")
dummy_formula <- as.formula(paste("~ . -y"))
# categorical dataset
cat_data_raw_combined3_svm <- data_raw_combined3_svm[, categorical_features]
# use dummyVars to create dummy variables
dummies <- dummyVars(~ ., data = cat_data_raw_combined3_svm, fullRank = FALSE)
combined_encoded <- data.frame(predict(dummies, newdata = cat_data_raw_combined3_svm))
# numeric columns
numeric_cols <- setdiff(names(data_raw_combined3_svm), c(categorical_features, "y"))
# combine back the columns
combined_final_data_for_svm <- cbind(data_raw_combined3_svm[, numeric_cols], combined_encoded, data_raw_combined3_svm$y)
# Rename target column
colnames(combined_final_data_for_svm)[ncol(combined_final_data_for_svm)] <- "y"

head(combined_final_data_for_svm)

##   age campaign previous cons_price_idx cons_conf_idx is_married
## 1  56        1        0         93.994         -36.4          1
## 2  57        1        0         93.994         -36.4          1
## 3  37        1        0         93.994         -36.4          1
## 4  40        1        0         93.994         -36.4          1
## 5  56        1        0         93.994         -36.4          1
## 6  45        1        0         93.994         -36.4          1
##   employed_statusemployed employed_statusretired employed_statusstudent
## 1                       1                      0                      0
## 2                       1                      0                      0
## 3                       1                      0                      0
## 4                       1                      0                      0
## 5                       1                      0                      0
## 6                       1                      0                      0
##   employed_statusunemployed employed_statusunknown education_combinedbasic
## 1                         0                      0                       1
## 2                         0                      0                       0
## 3                         0                      0                       0
## 4                         0                      0                       1
## 5                         0                      0                       0
## 6                         0                      0                       1
##   education_combinedhigh.school education_combinedilliterate
## 1                             0                            0
## 2                             1                            0
## 3                             1                            0
## 4                             0                            0
## 5                             1                            0
## 6                             0                            0
##   education_combinedprofessional.course education_combineduniversity.degree
## 1                                     0                                   0
## 2                                     0                                   0
## 3                                     0                                   0
## 4                                     0                                   0
## 5                                     0                                   0
## 6                                     0                                   0
##   education_combinedunknown contactcellular contacttelephone seasonfall
## 1                         0               0                1          0
## 2                         0               0                1          0
## 3                         0               0                1          0
## 4                         0               0                1          0
## 5                         0               0                1          0
## 6                         0               0                1          0
##   seasonspring seasonsummer seasonwinter day_of_weekfri day_of_weekmon
## 1            1            0            0              0              1
## 2            1            0            0              0              1
## 3            1            0            0              0              1
## 4            1            0            0              0              1
## 5            1            0            0              0              1
## 6            1            0            0              0              1
##   day_of_weekthu day_of_weektue day_of_weekwed poutcomefailure
## 1              0              0              0               0
## 2              0              0              0               0
## 3              0              0              0               0
## 4              0              0              0               0
## 5              0              0              0               0
## 6              0              0              0               0
##   poutcomenonexistent poutcomesuccess defaultno defaultunknown defaultyes
## 1                   1               0         1              0          0
## 2                   1               0         0              1          0
## 3                   1               0         1              0          0
## 4                   1               0         1              0          0
## 5                   1               0         1              0          0
## 6                   1               0         0              1          0
##   housingno housingunknown housingyes loanno loanunknown loanyes  y
## 1         1              0          0      1           0       0 no
## 2         1              0          0      1           0       0 no
## 3         0              0          1      1           0       0 no
## 4         1              0          0      1           0       0 no
## 5         1              0          0      0           0       1 no
## 6         1              0          0      1           0       0 no

5.3 SVM Data Preparation

Let’s do the same SVM data preparation for the regular dataset:

# Regular dataset
data_raw_svm <- data_raw_clean3

# Convert target to factor
data_raw_svm$y <- as.factor(data_raw_svm$y)

categorical_features <- c("job", "education", "contact", "month", "day_of_week", "poutcome", "default", "housing", "loan", "marital")
dummy_formula <- as.formula(paste("~ . -y")) # Adjust as needed
# categorical dataset
cat_data_raw_svm <- data_raw_svm[, categorical_features]
# use dummyVars to create dummy variables
dummies <- dummyVars(~ ., data = cat_data_raw_svm, fullRank = FALSE)
my_data_encoded <- data.frame(predict(dummies, newdata = cat_data_raw_svm))
# numeric columns
numeric_cols <- setdiff(names(data_raw_svm), c(categorical_features, "y"))
# combine back the columns
final_data_for_svm <- cbind(data_raw_svm[, numeric_cols], my_data_encoded, data_raw_svm$y)
# Rename target column
colnames(final_data_for_svm)[ncol(final_data_for_svm)] <- "y"

head(final_data_for_svm)

##   age campaign previous cons_price_idx cons_conf_idx jobadmin. jobblue.collar
## 1  56        1        0         93.994         -36.4         0              0
## 2  57        1        0         93.994         -36.4         0              0
## 3  37        1        0         93.994         -36.4         0              0
## 4  40        1        0         93.994         -36.4         1              0
## 5  56        1        0         93.994         -36.4         0              0
## 6  45        1        0         93.994         -36.4         0              0
##   jobentrepreneur jobhousemaid jobmanagement jobretired jobself.employed
## 1               0            1             0          0                0
## 2               0            0             0          0                0
## 3               0            0             0          0                0
## 4               0            0             0          0                0
## 5               0            0             0          0                0
## 6               0            0             0          0                0
##   jobservices jobstudent jobtechnician jobunemployed jobunknown
## 1           0          0             0             0          0
## 2           1          0             0             0          0
## 3           1          0             0             0          0
## 4           0          0             0             0          0
## 5           1          0             0             0          0
## 6           1          0             0             0          0
##   educationbasic.4y educationbasic.6y educationbasic.9y educationhigh.school
## 1                 1                 0                 0                    0
## 2                 0                 0                 0                    1
## 3                 0                 0                 0                    1
## 4                 0                 1                 0                    0
## 5                 0                 0                 0                    1
## 6                 0                 0                 1                    0
##   educationilliterate educationprofessional.course educationuniversity.degree
## 1                   0                            0                          0
## 2                   0                            0                          0
## 3                   0                            0                          0
## 4                   0                            0                          0
## 5                   0                            0                          0
## 6                   0                            0                          0
##   educationunknown contactcellular contacttelephone monthapr monthaug monthdec
## 1                0               0                1        0        0        0
## 2                0               0                1        0        0        0
## 3                0               0                1        0        0        0
## 4                0               0                1        0        0        0
## 5                0               0                1        0        0        0
## 6                0               0                1        0        0        0
##   monthjul monthjun monthmar monthmay monthnov monthoct monthsep day_of_weekfri
## 1        0        0        0        1        0        0        0              0
## 2        0        0        0        1        0        0        0              0
## 3        0        0        0        1        0        0        0              0
## 4        0        0        0        1        0        0        0              0
## 5        0        0        0        1        0        0        0              0
## 6        0        0        0        1        0        0        0              0
##   day_of_weekmon day_of_weekthu day_of_weektue day_of_weekwed poutcomefailure
## 1              1              0              0              0               0
## 2              1              0              0              0               0
## 3              1              0              0              0               0
## 4              1              0              0              0               0
## 5              1              0              0              0               0
## 6              1              0              0              0               0
##   poutcomenonexistent poutcomesuccess defaultno defaultunknown defaultyes
## 1                   1               0         1              0          0
## 2                   1               0         0              1          0
## 3                   1               0         1              0          0
## 4                   1               0         1              0          0
## 5                   1               0         1              0          0
## 6                   1               0         0              1          0
##   housingno housingunknown housingyes loanno loanunknown loanyes
## 1         1              0          0      1           0       0
## 2         1              0          0      1           0       0
## 3         0              0          1      1           0       0
## 4         1              0          0      1           0       0
## 5         1              0          0      0           0       1
## 6         1              0          0      1           0       0
##   maritaldivorced maritalmarried maritalsingle maritalunknown  y
## 1               0              1             0              0 no
## 2               0              1             0              0 no
## 3               0              1             0              0 no
## 4               0              1             0              0 no
## 5               0              1             0              0 no
## 6               0              1             0              0 no

5.4 Split Data for Validation

To ensure objective model evaluation and prevent overfitting, let’s split each dataset into training (80%) and testing (20%) sets. After trying to run an SVM for the full dataset, I discovered using only a portion of the data is necessary for the model to finish running.

For the combined dataset, let’s sample 60% of the full dataset, and for the regular dataset, let’s take 40% of the full dataset.

# Split data into train and test
set.seed(123)

# Randomly sample 60% of the rows for the combined dataset

# Combined dataset
sampled_combined_data <- combined_final_data_for_svm %>%
  slice_sample(prop = 0.60)

sampled_combined_split <- initial_split(sampled_combined_data, prop = 0.8, strata = y)
sampled_combined_train <- sampled_combined_split |>
         training()
sampled_combined_test <- sampled_combined_split |>
        testing()


# Randomly sample 40% of the rows for the regular dataset
# Regular dataset
sampled_regular_data <- final_data_for_svm %>%
  slice_sample(prop = 0.40)

sampled_regular_split <- initial_split(sampled_regular_data, prop = 0.8, strata = y)
sampled_regular_train <- sampled_regular_split |>
         training()
sampled_regular_test <- sampled_regular_split |>
        testing()

5.6 Address Imbalanced Dataset

As the EDA showed, this dataset is heavily imbalanced which can cause bias. To resolve this, let’s use SMOTE (Synthetic Minority Over-sampling Technique):

set.seed(123)
# Only apply SMOTE to training set
sampled_combined_smoted_train <- smotenc(sampled_combined_train, var = "y", over_ratio = 1)
sampled_regular_smoted_train <- smotenc(sampled_regular_train, var = "y", over_ratio = 1)

# Distribution of target variable - combined
sampled_combined_smoted_train |>
  dplyr::select(y) |>
  ggplot() + 
  aes(x = y) + 
  geom_histogram(bins= 40,fill = "blue", color = "black", stat="count") + 
  labs(title = "Distribution of y", y = "Count") +  
  theme_minimal()

## Warning in geom_histogram(bins = 40, fill = "blue", color = "black", stat =
## "count"): Ignoring unknown parameters: `binwidth` and `bins`

# Distribution of target variable - regular
sampled_regular_smoted_train |>
  dplyr::select(y) |>
  ggplot() + 
  aes(x = y) + 
  geom_histogram(bins= 40,fill = "blue", color = "black", stat="count") + 
  labs(title = "Distribution of y", y = "Count") +  
  theme_minimal()

## Warning in geom_histogram(bins = 40, fill = "blue", color = "black", stat =
## "count"): Ignoring unknown parameters: `binwidth` and `bins`

We can now see that the distribution is even for our target variable, y.

6 Experiments - SVM

For all the models, accuracy score will be the main metric of choice.

kvsm() parameters (official R documentation):

x: a symbolic description of the model to be fit. When not using a formula x can be a matrix or vector containing the training data or a kernel matrix of class kernelMatrix of the training data or a list of character vectors (for use with the string kernel). Note, that the intercept is always excluded, whether given in the formula or not.

data: an optional data frame containing the training data, when using a formula. By default the data is taken from the environment which `ksvm’ is called from.

y: a response vector with one label for each row/component of x. Can be either a factor (for classification tasks) or a numeric vector (for regression).

scaled: A logical vector indicating the variables to be scaled. If scaled is of length 1, the value is recycled as many times as needed and all non-binary variables are scaled. Per default, data are scaled internally (both x and y variables) to zero mean and unit variance. The center and scale values are returned and used for later predictions.

type: ksvm can be used for classification , for regression, or for novelty detection. Depending on whether y is a factor or not, the default setting for type is C-svc or eps-svr, respectively, but can be overwritten by setting an explicit value. Valid options are:

C-svc C classification nu-svc nu classification C-bsvc bound-constraint svm classification spoc-svc Crammer, Singer native multi-class kbb-svc Weston, Watkins native multi-class one-svc novelty detection eps-svr epsilon regression nu-svr nu regression eps-bsvr bound-constraint svm regression

kernel: the kernel function used in training and predicting. This parameter can be set to any function, of class kernel, which computes the inner product in feature space between two vector arguments (see kernels). kernlab provides the most popular kernel functions which can be used by setting the kernel parameter to the following strings:

rbfdot Radial Basis kernel “Gaussian” polydot Polynomial kernel vanilladot Linear kernel tanhdot Hyperbolic tangent kernel laplacedot Laplacian kernel besseldot Bessel kernel anovadot ANOVA RBF kernel splinedot Spline kernel stringdot String kernel

Setting the kernel parameter to “matrix” treats x as a kernel matrix calling the kernelMatrix interface.

The kernel parameter can also be set to a user defined function of class kernel by passing the function name as an argument.

kpar: the list of hyper-parameters (kernel parameters). This is a list which contains the parameters to be used with the kernel function. For valid parameters for existing kernels are :

sigma inverse kernel width for the Radial Basis kernel function “rbfdot” and the Laplacian kernel “laplacedot”. degree, scale, offset for the Polynomial kernel “polydot” scale, offset for the Hyperbolic tangent kernel function “tanhdot” sigma, order, degree for the Bessel kernel “besseldot”. sigma, degree for the ANOVA kernel “anovadot”.

length, lambda, normalized for the “stringdot” kernel where length is the length of the strings considered, lambda the decay factor and normalized a logical parameter determining if the kernel evaluations should be normalized.

Hyper-parameters for user defined kernels can be passed through the kpar parameter as well. In the case of a Radial Basis kernel function (Gaussian) kpar can also be set to the string “automatic” which uses the heuristics in sigest to calculate a good sigma value for the Gaussian RBF or Laplace kernel, from the data. (default = “automatic”).

C: cost of constraints violation (default: 1) this is the `C’-constant of the regularization term in the Lagrange formulation.

nu: parameter needed for nu-svc, one-svc, and nu-svr. The nu parameter sets the upper bound on the training error and the lower bound on the fraction of data points to become Support Vectors (default: 0.2).

epsilon: epsilon in the insensitive-loss function used for eps-svr, nu-svr and eps-bsvm (default: 0.1)

prob.model: if set to TRUE builds a model for calculating class probabilities or in case of regression, calculates the scaling parameter of the Laplacian distribution fitted on the residuals. Fitting is done on output data created by performing a 3-fold cross-validation on the training data. For details see references. (default: FALSE)

class.weights: a named vector of weights for the different classes, used for asymmetric class sizes. Not all factor levels have to be supplied (default weight: 1). All components have to be named.

cache: cache memory in MB (default 40)

tol: tolerance of termination criterion (default: 0.001)

shrinking: option whether to use the shrinking-heuristics (default: TRUE)

cross: if a integer value k>0 is specified, a k-fold cross validation on the training data is performed to assess the quality of the model: the accuracy rate for classification and the Mean Squared Error for regression

fit: indicates whether the fitted values should be computed and included in the model or not (default: TRUE)

Model 1 - Linear (“vanilla”) Kernel

Let’ start with a simple linear SVM classifier that uses the regular dataset. SVM should be able to find patterns within a dataset containing many features, but the drawback will be it will take long. This will serve as a simple base model.

# NOTE: do not need to standardize the dataset bc R will already do that for kvsm

# Parallel processing (to try to speed up runtime)
num_cores <- parallel::detectCores() - 1  # leave one core free
num_cores

## [1] 7

cl <- makePSOCKcluster(num_cores) # number of cores to use
registerDoParallel(cl)

# Build simple linear model
set.seed(125)
start <- Sys.time()
svm_model1 <- ksvm(y ~ ., data = sampled_regular_smoted_train,
                   kernel = "vanilladot",
                   prob.model=TRUE) # to use for AUC/ROC

##  Setting default kernel parameters

print(svm_model1)

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 16499 
## 
## Objective Function Value : -14705.61 
## Training error : 0.285696 
## Probability model included.

end <- Sys.time()


# Stop cluster
stopCluster(cl)
registerDoSEQ()
closeAllConnections()

Model Evaluation:

# Predict for the linear SVM model
model1_predictions <- predict(svm_model1, sampled_regular_test)
# Confusion matrix
cf_matrix_model1 <- confusionMatrix(model1_predictions, sampled_regular_test$y)

# Evaluation metrics
model1_probs <- predict(svm_model1, newdata = sampled_regular_test, type = "probabilities")[,2]
model1_roc <- roc(sampled_regular_test$y, model1_probs, quiet = TRUE)
model1_auc <- auc(model1_roc)
model1_accuracy <- cf_matrix_model1$overall['Accuracy']
model1_kappa <- cf_matrix_model1$overall['Kappa']
model1_sensitivity <- cf_matrix_model1$byClass['Sensitivity']
model1_specificity <- cf_matrix_model1$byClass['Specificity']
model1_f1 <- cf_matrix_model1$byClass["F1"]
model1_precision <- cf_matrix_model1$byClass["Precision"]
model1_duration <- end - start

output1 <- paste("\n=== Model Selection and Evaluation ===\n\n",
                 "=== Model 1 Evaluation ===\n",
                 "Confusion Matrix:\n",
                 paste(capture.output(print(cf_matrix_model1)), collapse = "\n"), "\n",
                 "Accuracy:", round(model1_accuracy, 4), "| Precision:", round(model1_precision, 4), 
                 "| Sensitivity:", round(model1_sensitivity, 4), "| Specificity:", round(model1_specificity, 4), "\n",
                 "F1 Score:", round(model1_f1, 4), "| Kappa:", round(model1_kappa, 4),  
                 "| Model Duration:", round(model1_duration, 4), "\n\n", sep = " ")
cat(output1)

## 
## === Model Selection and Evaluation ===
## 
##  === Model 1 Evaluation ===
##  Confusion Matrix:
##  Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  2546  163
##        yes  377  209
##                                          
##                Accuracy : 0.8361         
##                  95% CI : (0.823, 0.8486)
##     No Information Rate : 0.8871         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.346          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.8710         
##             Specificity : 0.5618         
##          Pos Pred Value : 0.9398         
##          Neg Pred Value : 0.3567         
##              Prevalence : 0.8871         
##          Detection Rate : 0.7727         
##    Detection Prevalence : 0.8222         
##       Balanced Accuracy : 0.7164         
##                                          
##        'Positive' Class : no             
##                                           
##  Accuracy: 0.8361 | Precision: 0.9398 | Sensitivity: 0.871 | Specificity: 0.5618 
##  F1 Score: 0.9041 | Kappa: 0.346 | Model Duration: 7.946

Results:

The linear kernel generated an accuracy of 83.61%, which is fairly good without any hyperparameter tuning. Besides specificity, it generated fairly good metrics across the board.

Model 1B - Linear (“vanilla”) Kernel w/ Combined Dataset

The purpose of this experiment is to try to use the combined dataset (use more observations, less predictors). The reason to try this is due to SVM taking so long to run, and the high dimensionality of the data.

# NOTE: do not need to standardize the dataset bc R will already do that for kvsm

# Parallel processing (to try to speed up runtime)
num_cores <- parallel::detectCores() - 1  # leave one core free
num_cores

## [1] 7

cl <- makePSOCKcluster(num_cores) # number of cores to use
registerDoParallel(cl)

# Build simple linear model w/ combined dataset
set.seed(125)
start <- Sys.time()
svm_model1b <- ksvm(y ~ ., data = sampled_combined_smoted_train,
                   kernel = "vanilladot",
                   prob.model=TRUE) # to use for AUC/ROC

##  Setting default kernel parameters

print(svm_model1b)

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 23791 
## 
## Objective Function Value : -23774.34 
## Training error : 0.333455 
## Probability model included.

end <- Sys.time()


# Stop cluster
stopCluster(cl)
registerDoSEQ()
closeAllConnections()

Model Evaluation:

# Predict
model1b_predictions <- predict(svm_model1b, sampled_combined_test)
# Confusion matrix
cf_matrix_model1b <- confusionMatrix(model1b_predictions, sampled_combined_test$y)

# Evaluation metrics
# this causes $ operator is invalid for atomic vectors for some reason
# model1b_probs <- predict(svm_model1b, newdata = sampled_combined_test, type = "probabilities")[,2]
# model1b_roc <- roc(sampled_regular_test$y, model1b_probs, quiet = TRUE)
# model1b_auc <- auc(model1b_roc)
model1b_auc <- NA
model1b_accuracy <- cf_matrix_model1b$overall['Accuracy']
model1b_kappa <- cf_matrix_model1b$overall['Kappa']
model1b_sensitivity <- cf_matrix_model1b$byClass['Sensitivity']
model1b_specificity <- cf_matrix_model1b$byClass['Specificity']
model1b_f1 <- cf_matrix_model1b$byClass["F1"]
model1b_precision <- cf_matrix_model1b$byClass["Precision"]
model1b_duration <- end - start

output1b <- paste("\n=== Model Selection and Evaluation ===\n\n",
                 "=== Model 1B Evaluation ===\n",
                 "Confusion Matrix:\n",
                 paste(capture.output(print(cf_matrix_model1b)), collapse = "\n"), "\n",
                 "Accuracy:", round(model1b_accuracy, 4), "| Precision:", round(model1b_precision, 4), 
                 "| Sensitivity:", round(model1b_sensitivity, 4), "| Specificity:", round(model1b_specificity, 4), "\n",
                 "F1 Score:", round(model1b_f1, 4), "| Kappa:", round(model1b_kappa, 4),  
                 "| Model Duration:", round(model1b_duration, 4), "\n\n", sep = " ")
cat(output1b)

## 
## === Model Selection and Evaluation ===
## 
##  === Model 1B Evaluation ===
##  Confusion Matrix:
##  Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  2298  147
##        yes 1804  411
##                                          
##                Accuracy : 0.5813         
##                  95% CI : (0.567, 0.5955)
##     No Information Rate : 0.8803         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.13           
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.5602         
##             Specificity : 0.7366         
##          Pos Pred Value : 0.9399         
##          Neg Pred Value : 0.1856         
##              Prevalence : 0.8803         
##          Detection Rate : 0.4931         
##    Detection Prevalence : 0.5247         
##       Balanced Accuracy : 0.6484         
##                                          
##        'Positive' Class : no             
##                                           
##  Accuracy: 0.5813 | Precision: 0.9399 | Sensitivity: 0.5602 | Specificity: 0.7366 
##  F1 Score: 0.702 | Kappa: 0.13 | Model Duration: 4.8193

Results:

The combined dataset model was faster than model 1, but it generated a poor accuracy score with a score of 58.13%. Precision is still high though, meaning it can still predict positives accurately. This is good to know since “yes” is the minority class. We can see from the confusion matrix that it overpredicts “yes” values.

Note: the AUC code gave me an error for this dataset, but since accuracy was so poor, I didn’t bother taking the time to figure it out.

Model 2 - Gaussian RBF Kernel

The reason for this model is that it has shown to perform well with all types of data, and can capture non-linear data. Let’s use the default parameters for now with the regular dataset.

# Build Gaussian RBF model
set.seed(125)
start <- Sys.time()
svm_model2 <- ksvm(y ~ ., data = sampled_regular_smoted_train,
                            kernel = "rbfdot",
                            prob.model=TRUE)
print(svm_model2)

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0109304259985459 
## 
## Number of Support Vectors : 11903 
## 
## Objective Function Value : -9214.919 
## Training error : 0.113868 
## Probability model included.

end <- Sys.time()

Model Evaluation:

# Predict - Gaussian RBF
model2_predictions <- predict(svm_model2, sampled_regular_test)
# Confusion matrix
cf_matrix_model2 <- confusionMatrix(model2_predictions, sampled_regular_test$y)

# Evaluation metrics
model2_probs <- predict(svm_model2, newdata = sampled_regular_test, type = "probabilities")[, 2]
model2_roc <- roc(sampled_regular_test$y, model2_probs, quiet = TRUE)
model2_auc <- auc(model2_roc)
model2_accuracy <- cf_matrix_model2$overall['Accuracy']
model2_kappa <- cf_matrix_model2$overall['Kappa']
model2_sensitivity <- cf_matrix_model2$byClass['Sensitivity']
model2_specificity <- cf_matrix_model2$byClass['Specificity']
model2_f1 <- cf_matrix_model2$byClass["F1"]
model2_precision <- cf_matrix_model2$byClass["Precision"]
model2_duration <- end - start

output2 <- paste("\n=== Model Selection and Evaluation ===\n\n",
                 "=== Model 2 Evaluation ===\n",
                 "Confusion Matrix:\n",
                 paste(capture.output(print(cf_matrix_model2)), collapse = "\n"), "\n",
                 "Accuracy:", round(model2_accuracy, 4), "| Precision:", round(model2_precision, 4),
                 "| Sensitivity:", round(model2_sensitivity, 4), "| Specificity:", round(model2_specificity, 4), "\n",
                 "F1 Score:", round(model2_f1, 4), "| Kappa:", round(model2_kappa, 4), "| AUC:", round(model2_auc, 4),
                 "| Model Duration:", round(model2_duration, 4), "\n\n", sep = " ")
cat(output2)

## 
## === Model Selection and Evaluation ===
## 
##  === Model 2 Evaluation ===
##  Confusion Matrix:
##  Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  2697  203
##        yes  226  169
##                                           
##                Accuracy : 0.8698          
##                  95% CI : (0.8578, 0.8811)
##     No Information Rate : 0.8871          
##     P-Value [Acc > NIR] : 0.9990          
##                                           
##                   Kappa : 0.3671          
##                                           
##  Mcnemar's Test P-Value : 0.2882          
##                                           
##             Sensitivity : 0.9227          
##             Specificity : 0.4543          
##          Pos Pred Value : 0.9300          
##          Neg Pred Value : 0.4278          
##              Prevalence : 0.8871          
##          Detection Rate : 0.8185          
##    Detection Prevalence : 0.8801          
##       Balanced Accuracy : 0.6885          
##                                           
##        'Positive' Class : no              
##                                            
##  Accuracy: 0.8698 | Precision: 0.93 | Sensitivity: 0.9227 | Specificity: 0.4543 
##  F1 Score: 0.9263 | Kappa: 0.3671 | AUC: 0.7533 | Model Duration: 5.9304

Results:

RBF Kernel increased the accuracy to 86.98% which makes sense since this kernel can handle complex patterns in the dataset. Excluding specificity, performance metrics look very good for this model.

Model 3 - Gaussian RBF Kernel with C Hyperparameter Tuning

This model will apply different values of C to find the optimal value to use with our data. A high cost could lead to overfitting, yet a low cost could lead to underfitting due to the model not detecting patterns correctly.

Note: I did try to do hyperparameter tuning with cross-validation, but the code would never complete. That is why I have broken out the tuning into separate models which isn’t ideal. I have left the code commented out for now.

# Parallel processing
num_cores <- parallel::detectCores() - 1  # leave one core free
num_cores

## [1] 7

cl <- makePSOCKcluster(num_cores) # number of cores to use
registerDoParallel(cl)

# Build RBF models with varying cost parameter
# This code is inspired from the textbook
# Takes a while to run
cost_values <- c(0.25, 1, 5, 10)
accuracy_values <- sapply(cost_values, function(x) {
    set.seed(12345)
    m <- ksvm(y ~ ., data = sampled_regular_smoted_train,
              kernel = "rbfdot", C = x)
    pred <- predict(m, sampled_regular_test)
    agree <- ifelse(pred == sampled_regular_test$y, 1, 0)
    accuracy <- sum(agree) / nrow(sampled_regular_test)
    return (accuracy)
})

# Plot accuracy vs aost
plot(cost_values, accuracy_values, type = "b")

# Stop cluster
stopCluster(cl)
registerDoSEQ()
closeAllConnections()

print(accuracy_values)

## [1] 0.8543247 0.8698027 0.8755690 0.8719272

The best model turned out to be C=5 producing a pretty good accuracy score of 87.56%. Let’s now create this model using C=5:

# Parallel processing
num_cores <- parallel::detectCores() - 1  # leave one core free
num_cores

## [1] 7

cl <- makePSOCKcluster(num_cores) # number of cores to use
registerDoParallel(cl)

# Build best RBF model with C=5
set.seed(125)
start <- Sys.time()
svm_model3 <- ksvm(y ~ ., data = sampled_regular_smoted_train,
                            kernel = "rbfdot",
                            C = 5,
                            prob.model=TRUE)
print(svm_model3)

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 5 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0109304259985459 
## 
## Number of Support Vectors : 8896 
## 
## Objective Function Value : -29990 
## Training error : 0.066729 
## Probability model included.

end <- Sys.time()

# Stop cluster
stopCluster(cl)
registerDoSEQ()
closeAllConnections()

Model Evaluation:

# Predict RBF model
model3_predictions <- predict(svm_model3, sampled_regular_test)
# Confusion matrix
cf_matrix_model3 <- confusionMatrix(model3_predictions, sampled_regular_test$y)

# Evaluation metrics
model3_probs <- predict(svm_model3, newdata = sampled_regular_test, type = "probabilities")[, 2]
model3_roc <- roc(sampled_regular_test$y, model3_probs, quiet = TRUE)
model3_auc <- auc(model3_roc)
model3_accuracy <- cf_matrix_model3$overall['Accuracy']
model3_kappa <- cf_matrix_model3$overall['Kappa']
model3_sensitivity <- cf_matrix_model3$byClass['Sensitivity']
model3_specificity <- cf_matrix_model3$byClass['Specificity']
model3_f1 <- cf_matrix_model3$byClass["F1"]
model3_precision <- cf_matrix_model3$byClass["Precision"]
model3_duration <- end - start

output3 <- paste("\n=== Model Selection and Evaluation ===\n\n",
                 "=== Model 3 Evaluation ===\n",
                 "Confusion Matrix:\n",
                 paste(capture.output(print(cf_matrix_model3)), collapse = "\n"), "\n",
                 "Accuracy:", round(model3_accuracy, 4), "| Precision:", round(model3_precision, 4),
                 "| Sensitivity:", round(model3_sensitivity, 4), "| Specificity:", round(model3_specificity, 4), "\n",
                 "F1 Score:", round(model3_f1, 4), "| Kappa:", round(model3_kappa, 4), "| AUC:", round(model3_auc, 4),
                 "| Model Duration:", round(model3_duration, 4), "\n\n", sep = " ")
cat(output3)

## 
## === Model Selection and Evaluation ===
## 
##  === Model 3 Evaluation ===
##  Confusion Matrix:
##  Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  2739  225
##        yes  184  147
##                                           
##                Accuracy : 0.8759          
##                  95% CI : (0.8641, 0.8869)
##     No Information Rate : 0.8871          
##     P-Value [Acc > NIR] : 0.97944         
##                                           
##                   Kappa : 0.349           
##                                           
##  Mcnemar's Test P-Value : 0.04794         
##                                           
##             Sensitivity : 0.9371          
##             Specificity : 0.3952          
##          Pos Pred Value : 0.9241          
##          Neg Pred Value : 0.4441          
##              Prevalence : 0.8871          
##          Detection Rate : 0.8313          
##    Detection Prevalence : 0.8995          
##       Balanced Accuracy : 0.6661          
##                                           
##        'Positive' Class : no              
##                                            
##  Accuracy: 0.8759 | Precision: 0.9241 | Sensitivity: 0.9371 | Specificity: 0.3952 
##  F1 Score: 0.9305 | Kappa: 0.349 | AUC: 0.7344 | Model Duration: 1.3186

Results:

This expectingly produced an 87.59% accuracy score and fairly good metrics across the board (excluding specificty).

# Unfortunately this code was never able to complete (took too long)

# # Parallel processing
# num_cores <- parallel::detectCores() - 1  # leave one core free
# num_cores
# cl <- makePSOCKcluster(num_cores) # number of cores to use
# registerDoParallel(cl)
# 
# set.seed(125)
# 
# # Cross validation
# ctrl <- trainControl(method = "repeatedcv", repeats = 3)
# 
# # Params to test
# rbf_grid <- expand.grid(
#   # degree = c(2, 3),
#   sigma = c(0.01, 0.1),
#   C = c(1, 5)
# )
# 
# # Build model
# svm_poly_cv <- train(y ~ ., data = sampled_regular_smoted_train, method = "svmRadial",
#                      tuneGrid = rbf_grid, trControl = ctrl, metric = "Accuracy")
# 
# # View the best parameters
# print(svm_poly_cv)
# 
# plot(svm_poly_cv)
# 
# # Stop cluster
# stopCluster(cl)
# registerDoSEQ()
# closeAllConnections()

Model 4 - Gaussian RBF Kernel with Sigma Hyperparameter Tuning

Since I wasn’t able to tune C and sigma together, this model tunes sigma. The sigma parameter is the distance of influence of a single training point. A smaller value means a smaller influence range whereas a larger value means a larger range so the boundary is less sensitive to the data.

# Parallel processing
num_cores <- parallel::detectCores() - 1  # leave one core free
num_cores

## [1] 7

cl <- makePSOCKcluster(num_cores) # number of cores to use
registerDoParallel(cl)

# Build RBF models to find the best sigma
# This code is inspired from the textbook
# NOTE: Takes a while to run
# Use best C value found from previous experiment
sigma_values <- c(0.001, 0.01, 0.1, 100)
accuracy4_values <- sapply(sigma_values, function(x) {
    set.seed(12345)
    m4 <- ksvm(y ~ ., data = sampled_regular_smoted_train,
              kernel = "rbfdot", sigma = x, C=5)
    pred4 <- predict(m4, sampled_regular_test)
    agree4 <- ifelse(pred4 == sampled_regular_test$y, 1, 0)
    accuracy4 <- sum(agree4) / nrow(sampled_regular_test)
    return (accuracy4)
})

# Plot accuracy vs sigma
plot(sigma_values, accuracy4_values, type = "b")

# Stop cluster
stopCluster(cl)
registerDoSEQ()
closeAllConnections()

print(accuracy4_values)

## [1] 0.875569 0.875569 0.875569 0.875569

Results:

Sigma did not change the accuracy score at all. This most likely means the best sigma has already been reached. This makes model 4 the same as model 3, so I won’t run the code again.

Model 5 - Polynomial Kernel w/ Degree Hyperparameter Tuning

This is to test out another kernel type - the polynomial kernel. This model will check to see if doing polynomial transformations will improve the model compared to linear. For this experiment, let’s implement hyperparameter tuning of the degree parameter.

# Parallel processing
num_cores <- parallel::detectCores() - 1  # leave one core free
num_cores

## [1] 7

cl <- makePSOCKcluster(num_cores) # number of cores to use
registerDoParallel(cl)

# Build polynomial models with varying degree
set.seed(125)
start <- Sys.time()
# svm_model4 <- ksvm(y ~ ., data = smoted_train,
#                             kernel = "polydot",
#                             prob.model=TRUE)

# Try 2nd and 3rd degree polynomial
degree_values = c(2, 3)
accuracy5_values <- sapply(degree_values, function(x) {
    set.seed(12345)
    m5 <- ksvm(y ~ ., data = sampled_regular_smoted_train,
              kernel = "polydot", degree = x)
    pred5 <- predict(m5, sampled_regular_test)
    agree5 <- ifelse(pred5 == sampled_regular_test$y, 1, 0)
    accuracy5 <- sum(agree5) / nrow(sampled_regular_test)
    return (accuracy5)
})

##  Setting default kernel parameters  
##  Setting default kernel parameters

# Accuracy vs Cost
plot(degree_values, accuracy5_values, type = "b")

# Stop cluster
stopCluster(cl)
registerDoSEQ()
closeAllConnections()

print(accuracy5_values)

## [1] 0.8361153 0.8361153

Results:

These accuracy values are the same as the linear model so it means the data is linearly separable. Let’s build the model anyway:

# Build model
set.seed(125)
start <- Sys.time()
svm_model5 <- ksvm(y ~ ., data = sampled_regular_smoted_train,
                            kernel = "polydot",
                            degree = 2,
                            prob.model=TRUE)

##  Setting default kernel parameters

print(svm_model5)

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Polynomial kernel function. 
##  Hyperparameters : degree =  1  scale =  1  offset =  1 
## 
## Number of Support Vectors : 16523 
## 
## Objective Function Value : -14705.61 
## Training error : 0.285696 
## Probability model included.

end <- Sys.time()

Model Evaluation:

# Predict Polynomial SVM
model5_predictions <- predict(svm_model5, sampled_regular_test)
# Confusion Matrix
cf_matrix_model5 <- confusionMatrix(model5_predictions, sampled_regular_test$y)

# Evaluation metrics
model5_probs <- predict(svm_model5, newdata = sampled_regular_test, type = "probabilities")[, 2]
model5_roc <- roc(sampled_regular_test$y, model5_probs, quiet = TRUE)
model5_auc <- auc(model5_roc)
model5_accuracy <- cf_matrix_model5$overall['Accuracy']
model5_kappa <- cf_matrix_model5$overall['Kappa']
model5_sensitivity <- cf_matrix_model5$byClass['Sensitivity']
model5_specificity <- cf_matrix_model5$byClass['Specificity']
model5_f1 <- cf_matrix_model5$byClass["F1"]
model5_precision <- cf_matrix_model5$byClass["Precision"]
model5_duration <- end - start

output5 <- paste("\n=== Model Selection and Evaluation ===\n\n",
                 "=== Model 5 Evaluation ===\n",
                 "Confusion Matrix:\n",
                 paste(capture.output(print(cf_matrix_model5)), collapse = "\n"), "\n",
                 "Accuracy:", round(model5_accuracy, 4), "| Precision:", round(model5_precision, 4),
                 "| Sensitivity:", round(model5_sensitivity, 4), "| Specificity:", round(model5_specificity, 4), "\n",
                 "F1 Score:", round(model5_f1, 4), "| Kappa:", round(model5_kappa, 4), "| AUC:", round(model5_auc, 4),
                 "| Model Duration:", round(model5_duration, 4), "\n\n", sep = " ")
cat(output5)

## 
## === Model Selection and Evaluation ===
## 
##  === Model 5 Evaluation ===
##  Confusion Matrix:
##  Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  2546  163
##        yes  377  209
##                                          
##                Accuracy : 0.8361         
##                  95% CI : (0.823, 0.8486)
##     No Information Rate : 0.8871         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.346          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.8710         
##             Specificity : 0.5618         
##          Pos Pred Value : 0.9398         
##          Neg Pred Value : 0.3567         
##              Prevalence : 0.8871         
##          Detection Rate : 0.7727         
##    Detection Prevalence : 0.8222         
##       Balanced Accuracy : 0.7164         
##                                          
##        'Positive' Class : no             
##                                           
##  Accuracy: 0.8361 | Precision: 0.9398 | Sensitivity: 0.871 | Specificity: 0.5618 
##  F1 Score: 0.9041 | Kappa: 0.346 | AUC: 0.7632 | Model Duration: 8.157

Results:

Regarding accuracy, this model’s results mirror the linear model’s.

# Unfortunately this took way too long to run

# # Parallel processing
# num_cores <- parallel::detectCores() - 1  # leave one core free
# num_cores
# cl <- makePSOCKcluster(num_cores) # number of cores to use
# registerDoParallel(cl)
# 
# set.seed(125)
# 
# # Cross validation
# ctrl <- trainControl(method = "repeatedcv", repeats = 3)
# 
# # Params to test
# poly_grid <- expand.grid(
#   degree = c(2, 3),
#   scale = c(0.01, 0.1),
#   C = c(1)
# )
# 
# # Build model
# svm_poly_cv <- train(y ~ ., data = smoted_train, method = "svmPoly",
#                      tuneGrid = poly_grid, trControl = ctrl, metric = "Accuracy")
# 
# # View the best parameters
# print(svm_poly_cv)
# 
# plot(svm_poly_cv)
# 
# # Stop cluster
# stopCluster(cl)
# registerDoSEQ()
# closeAllConnections()

Conclusion

SVM Comparison

Final results table:

# Create final SVM results table

# NOTE: model 3 is the same as model 4

all_experiments <- data.frame(
  Model = c("Model 1", "Model 1B", "Model 2", "Model 3", "Model 4", "Model 5"),
  Accuracy = c(model1_accuracy, model1b_accuracy, model2_accuracy, model3_accuracy, model3_accuracy, model5_accuracy),
  Error_Rate = c(1-model1_accuracy, 1-model1b_accuracy, 1-model2_accuracy, 1-model3_accuracy, 1-model3_accuracy, 1-model5_accuracy),
  Precision = c(model1_precision, model1b_precision, model2_precision, model3_precision, model3_precision, model5_precision),
  Sensitivity = c(model1_sensitivity, model1b_sensitivity, model2_sensitivity, model3_sensitivity, model3_sensitivity, model5_sensitivity),
  Specificity = c(model1_specificity, model1b_specificity, model2_specificity, model3_specificity, model3_specificity, model5_specificity),
  F1_Score = c(model1_f1, model1b_f1, model2_f1, model3_f1, model3_f1, model5_f1),
  Kappa = c(model1_kappa, model1b_kappa, model2_kappa, model3_kappa, model3_kappa, model5_kappa),
  AUC = c(model1_auc, model1b_auc, model2_auc, model3_auc, model3_auc, model5_auc),
  Duration = c(model1_duration, model1b_duration, model2_duration, model3_duration, model3_duration, model5_duration)
)


all_experiments %>%
  kbl() %>%
  kable_styling(full_width = TRUE)

Model	Accuracy	Error_Rate	Precision	Sensitivity	Specificity	F1_Score	Kappa	AUC	Duration
Model 1	0.8361153	0.1638847	0.9398302	0.8710229	0.5618280	0.9041193	0.3459961	0.7530753	7.945955 mins
Model 1B	0.5813305	0.4186695	0.9398773	0.5602145	0.7365591	0.7020009	0.1300047	NA	4.819284 mins
Model 2	0.8698027	0.1301973	0.9300000	0.9226822	0.4543011	0.9263266	0.3670796	0.7532906	5.930396 mins
Model 3	0.8758725	0.1241275	0.9240891	0.9370510	0.3951613	0.9305249	0.3489970	0.7343703	1.318632 mins
Model 4	0.8758725	0.1241275	0.9240891	0.9370510	0.3951613	0.9305249	0.3489970	0.7343703	1.318632 mins
Model 5	0.8361153	0.1638847	0.9398302	0.8710229	0.5618280	0.9041193	0.3459961	0.7631944	8.157024 mins

# Plot ROC curves

# Note: 1B not included
plot(model1_roc, col = "blue", main = "ROC Curves Comparison")
plot(model2_roc, col = "red", add = TRUE)
plot(model3_roc, col = "green", add = TRUE)
plot(model3_roc, col = "orange", add = TRUE)
plot(model5_roc, col = "yellow", add = TRUE)
legend("bottomright", 
       legend = paste0(c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"), " (AUC=", 
                      round(c(model1_auc, model2_auc, model3_auc, model3_auc, model5_auc), 3), ")"),
       col = c("blue", "red", "green", "orange", "yellow"),
       lwd = 2)

Overall, all of the models produced fairly good accuracy scores. The model with the highest accuracy is Model 3 (and 4) which uses the RBF kernel with C=5. Model 3 produced good performance metrics across the board, except for specificity. It had the lowest specificity score of the SVM models, which means this model can produce a lot of false positives. Additionally, the runtime is not too bad compared to the other SVM models. Using the combined dataset did help the duration for model 1, but it hurt accuracy. The combined features cover up informative patterns in the data. Although, specificity was the highest of the models meaning it did not have as many false positives as the other models. Model 1 was able to distinguish between the two classes the best with an AUC score of 0.7530. Given accuracy has been the metric of choice, the SVM model to pick would have to be Model 3.

# Bar plot of the target variable for predictions for selected model
model3_predictions_df <- as.data.frame(table(model3_predictions))
colnames(model3_predictions_df) <- c("Target", "Count")
ggplot(model3_predictions_df, aes(x = Target, y = Count)) +
  geom_bar(stat = "identity", fill="blue") +
  labs(
    title = "Distribution of Target Variable",
    x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
    y = "Count"
  ) +
  theme_minimal()

Decision Tree Comparison

Decision trees overall performed better than the SVM models. This is not what I expected since SVMs can be very versatile and can handle complex patterns in the data. The adaboost model 2 had the highest AUC (0.7839) meaning it was able to distinguish between the two classes the best. The random forest model 2 also had a higher AUC and accuracy value (this was the model selected for the previous assignment). Another bonus of the decision trees is that they took a lot less time to train, which can be important in the real world.

Compared to the tree models, it seems like SVM models are overfitting too much to the data, and after SMOTE, learn the intricate patterns of the minority class, positive case, too well such that they overpredict it. There could be noisy data that is also affecting SVM’s performance. Random forests are known to perform better with larger datasets (500+ observations), so maybe this is also why the random forests performed slightly better here.

The downside of SVMs is that there is a lot of hyperparameter tuning to perform. Possibly using cross validation and trying different pairs of parameters would have produced a better performance for the SVM’s. Additionally, there’s more data preparation involved, and it requires a dataset with many features (if it contains categorical variables). This causes the runtime to be super long, which can be an issue.

The final model for selection would have to be the model from the previous assignment, random forest model 2.

DATA 622 Assignment 3

Gillian McGovern

2025-11-12

Background

Dataset

1. Importing Libraries

2. Data Ingestion

4 EDA

5 Data Cleansing for SVM

5.1 Remove Variables

5.2 Feature Engineering

5.3 SVM Data Preparation

5.4 Split Data for Validation

5.6 Address Imbalanced Dataset

6 Experiments - SVM

Model 1 - Linear (“vanilla”) Kernel

Model 1B - Linear (“vanilla”) Kernel w/ Combined Dataset

Model 2 - Gaussian RBF Kernel

Model 3 - Gaussian RBF Kernel with C Hyperparameter Tuning

Model 4 - Gaussian RBF Kernel with Sigma Hyperparameter Tuning

Model 5 - Polynomial Kernel w/ Degree Hyperparameter Tuning

Conclusion

SVM Comparison

Decision Tree Comparison