A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit.
Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing
NOTE: The experiments section is in part 6
For this assignment, I chose the full dataset,
bank-additional-full.csv which has all observations and
features. I would prefer to start out with all the data, then narrow it
down after performing my EDA and algorithm selection. I wouldn’t want to
disregard some features or observations before looking at them and
knowing if they’re important or not.
Input variables: # bank client data: 1 - age (numeric) 2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”) 3 - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed) 4 - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”) 5 - default: has credit in default? (binary: “yes”,“no”) 6 - housing: has housing loan? (binary: “yes”,“no”) 7 - loan: has personal loan? (binary: “yes”,“no”) # related with the last contact of the current campaign: 8 - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”) 9 - day: last contact day of the month (numeric) 10 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”) 11 - duration: last contact duration, in seconds (numeric) # other attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”) 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)
Output variable (desired target): 21 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)
#Import Libraries
library(readr) # to uses read_csv function
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(e1071) # For skewness function
##
## Attaching package: 'e1071'
##
## The following object is masked from 'package:ggplot2':
##
## element
library(corrplot)
## corrplot 0.95 loaded
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(ROSE)
## Loaded ROSE 0.0-4
library(smotefamily)
library(dplyr)
library(caret)
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(GGally)
library(gmodels)
library(correlationfunnel)
## ══ Using correlationfunnel? ════════════════════════════════════════════════════
## You might also be interested in applied data science training for business.
## </> Learn more at - www.business-science.io </>
library(DataExplorer)
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(readr)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom 1.0.10 ✔ rsample 1.3.1
## ✔ dials 1.4.2 ✔ tailor 0.1.0
## ✔ infer 1.0.9 ✔ tune 2.0.1
## ✔ modeldata 1.5.1 ✔ workflows 1.3.0
## ✔ parsnip 1.3.3 ✔ workflowsets 1.1.1
## ✔ recipes 1.3.1 ✔ yardstick 1.3.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ rsample::calibration() masks caret::calibration()
## ✖ scales::discard() masks purrr::discard()
## ✖ e1071::element() masks ggplot2::element()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ caret::lift() masks purrr::lift()
## ✖ rsample::permutations() masks e1071::permutations()
## ✖ yardstick::precision() masks caret::precision()
## ✖ yardstick::recall() masks caret::recall()
## ✖ car::recode() masks dplyr::recode()
## ✖ MASS::select() masks dplyr::select()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ car::some() masks purrr::some()
## ✖ yardstick::spec() masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step() masks stats::step()
## ✖ tune::tune() masks parsnip::tune(), e1071::tune()
library(themis)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(vcd)
## Loading required package: grid
library(rpart)
##
## Attaching package: 'rpart'
##
## The following object is masked from 'package:dials':
##
## prune
library(rpart.plot)
options(rgl.useNULL=TRUE)
library(adabag)
## Loading required package: foreach
##
## Attaching package: 'foreach'
##
## The following objects are masked from 'package:purrr':
##
## accumulate, when
##
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following object is masked from 'package:gmodels':
##
## ci
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(vip)
##
## Attaching package: 'vip'
##
## The following object is masked from 'package:utils':
##
## vi
#library(fastAdaboost)
library(kernlab)
##
## Attaching package: 'kernlab'
##
## The following object is masked from 'package:dials':
##
## buffer
##
## The following object is masked from 'package:scales':
##
## alpha
##
## The following object is masked from 'package:purrr':
##
## cross
##
## The following object is masked from 'package:ggplot2':
##
## alpha
library(kableExtra)
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(doParallel)
library(FactoMineR) # PCA
library(factoextra) # PCA visualization
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
# Full dataset
data_raw <- read.csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-608/refs/heads/main/bank-additional-full.csv", sep = ";")
# Rename columns
names(data_raw)[names(data_raw) == "emp.var.rate"] <- "emp_var_rate"
names(data_raw)[names(data_raw) == "cons.price.idx"] <- "cons_price_idx"
names(data_raw)[names(data_raw) == "cons.conf.idx"] <- "cons_conf_idx"
names(data_raw)[names(data_raw) == "nr.employed"] <- "nr_employed"
# Remove duplicates
data_raw <- unique(data_raw)
str(data_raw)
## 'data.frame': 41176 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : chr "housemaid" "services" "services" "admin." ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
## $ default : chr "no" "unknown" "no" "no" ...
## $ housing : chr "no" "no" "yes" "no" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "telephone" "telephone" "telephone" "telephone" ...
## $ month : chr "may" "may" "may" "may" ...
## $ day_of_week : chr "mon" "mon" "mon" "mon" ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp_var_rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons_price_idx: num 94 94 94 94 94 ...
## $ cons_conf_idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr_employed : num 5191 5191 5191 5191 5191 ...
## $ y : chr "no" "no" "no" "no" ...
EDA was done in assignment 1 so I won’t include it here.
Since over 96% of pdays is inapplicable, let’s treat
this as NA data which should just be removed to simplify the model.
Having too much data can be harmful as it hurts efficiency.
Additionally, as the dataset suggests, duration should
be discarded if the intention is to have a realistic predictive model,
so let’s remove that as well.
# Remove variable pdays (96% inapplicable) and duration
data_raw_clean2 <- data_raw |>
dplyr::select(-c(pdays, duration))
Address Multicollinearity:
Since there can not be any highly correlated variables for Logistic
Regression, let’s remove the ones we found in EDA which have
correlations > 0.4. Let’s remove euribor3m,
emp_var_rate, and nr_employed. This will also
help reduce the amount of data in the model.
# Remove `euribor3m`, `emp_var_rate`, and `nr_employed` (absolute value correlations > 0.4)
data_raw_clean3 <- data_raw_clean2 |>
dplyr::select(-c(euribor3m, emp_var_rate, nr_employed))
Since SVM takes a very long time, let’s try to use the combined dataset I created in Assignment 1.
# Combine Variables (Feature Engineering)
# Rename columns
data_raw_combined <- data_raw_clean3
names(data_raw_combined)[names(data_raw_combined) == "emp.var.rate"] <- "emp_var_rate"
names(data_raw_combined)[names(data_raw_combined) == "cons.price.idx"] <- "cons_price_idx"
names(data_raw_combined)[names(data_raw_combined) == "cons.conf.idx"] <- "cons_conf_idx"
names(data_raw_combined)[names(data_raw_combined) == "nr.employed"] <- "nr_employed"
# Remove duplicates
data_raw_combined <- unique(data_raw_combined)
# Convert target into factor
data_raw_combined$y <- as.factor(data_raw_combined$y)
# Combine education levels -> education_combined
data_raw_combined <- data_raw_combined %>%
mutate(education_combined = case_when(
data_raw_combined$education %in% c('basic.4y','basic.6y','basic.9y') ~ "basic" ,
data_raw_combined$education == "high.school" ~ "high.school" ,
data_raw_combined$education == "illiterate" ~ "illiterate" ,
data_raw_combined$education == "professional.course" ~ "professional.course",
data_raw_combined$education == "university.degree" ~ "university.degree",
data_raw_combined$education == "unknown" ~ "unknown"
)
)
# Combine job -> employed_status
data_raw_combined <- data_raw_combined %>%
mutate(employed_status = case_when(
data_raw_combined$job %in% c('admin.','blue-collar','entrepreneur', 'housemaid', 'management', 'services', 'technician', 'self-employed') ~ "employed",
data_raw_combined$job == "student" ~ "student",
data_raw_combined$job == "retired" ~ "retired",
data_raw_combined$job == "unemployed" ~ "unemployed",
data_raw_combined$job == "unknown" ~ "unknown"
)
)
# Combine month -> season
data_raw_combined <- data_raw_combined %>%
mutate(season = case_when(
data_raw_combined$month %in% c('sep','oct', 'nov') ~ "fall",
data_raw_combined$month %in% c('dec','jan', 'feb') ~ "winter",
data_raw_combined$month %in% c('mar','apr', 'may') ~ "spring",
data_raw_combined$month %in% c('jun','jul', 'aug') ~ "summer"
)
)
# Combine marital -> is_married
data_raw_combined <- data_raw_combined %>%
mutate(is_married = ifelse(marital == "married", 1, 0)
)
data_raw_combined <- data_raw_combined %>%
dplyr::select(-c(education, job, month, marital))
head(data_raw_combined)
## age default housing loan contact day_of_week campaign previous poutcome
## 1 56 no no no telephone mon 1 0 nonexistent
## 2 57 unknown no no telephone mon 1 0 nonexistent
## 3 37 no yes no telephone mon 1 0 nonexistent
## 4 40 no no no telephone mon 1 0 nonexistent
## 5 56 no no yes telephone mon 1 0 nonexistent
## 6 45 unknown no no telephone mon 1 0 nonexistent
## cons_price_idx cons_conf_idx y education_combined employed_status season
## 1 93.994 -36.4 no basic employed spring
## 2 93.994 -36.4 no high.school employed spring
## 3 93.994 -36.4 no high.school employed spring
## 4 93.994 -36.4 no basic employed spring
## 5 93.994 -36.4 no high.school employed spring
## 6 93.994 -36.4 no basic employed spring
## is_married
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
Convert to numerical features:
SVM requires each feature to be numeric since it calculates distance, dot products, etc. So let’s do this by creating dummy variables for all the categorical features:
# Combined dataset
data_raw_combined3_svm <- data_raw_combined
categorical_features <- c("employed_status", "education_combined", "contact", "season", "day_of_week", "poutcome", "default", "housing", "loan")
dummy_formula <- as.formula(paste("~ . -y"))
# categorical dataset
cat_data_raw_combined3_svm <- data_raw_combined3_svm[, categorical_features]
# use dummyVars to create dummy variables
dummies <- dummyVars(~ ., data = cat_data_raw_combined3_svm, fullRank = FALSE)
combined_encoded <- data.frame(predict(dummies, newdata = cat_data_raw_combined3_svm))
# numeric columns
numeric_cols <- setdiff(names(data_raw_combined3_svm), c(categorical_features, "y"))
# combine back the columns
combined_final_data_for_svm <- cbind(data_raw_combined3_svm[, numeric_cols], combined_encoded, data_raw_combined3_svm$y)
# Rename target column
colnames(combined_final_data_for_svm)[ncol(combined_final_data_for_svm)] <- "y"
head(combined_final_data_for_svm)
## age campaign previous cons_price_idx cons_conf_idx is_married
## 1 56 1 0 93.994 -36.4 1
## 2 57 1 0 93.994 -36.4 1
## 3 37 1 0 93.994 -36.4 1
## 4 40 1 0 93.994 -36.4 1
## 5 56 1 0 93.994 -36.4 1
## 6 45 1 0 93.994 -36.4 1
## employed_statusemployed employed_statusretired employed_statusstudent
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 0 0
## 6 1 0 0
## employed_statusunemployed employed_statusunknown education_combinedbasic
## 1 0 0 1
## 2 0 0 0
## 3 0 0 0
## 4 0 0 1
## 5 0 0 0
## 6 0 0 1
## education_combinedhigh.school education_combinedilliterate
## 1 0 0
## 2 1 0
## 3 1 0
## 4 0 0
## 5 1 0
## 6 0 0
## education_combinedprofessional.course education_combineduniversity.degree
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## education_combinedunknown contactcellular contacttelephone seasonfall
## 1 0 0 1 0
## 2 0 0 1 0
## 3 0 0 1 0
## 4 0 0 1 0
## 5 0 0 1 0
## 6 0 0 1 0
## seasonspring seasonsummer seasonwinter day_of_weekfri day_of_weekmon
## 1 1 0 0 0 1
## 2 1 0 0 0 1
## 3 1 0 0 0 1
## 4 1 0 0 0 1
## 5 1 0 0 0 1
## 6 1 0 0 0 1
## day_of_weekthu day_of_weektue day_of_weekwed poutcomefailure
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## poutcomenonexistent poutcomesuccess defaultno defaultunknown defaultyes
## 1 1 0 1 0 0
## 2 1 0 0 1 0
## 3 1 0 1 0 0
## 4 1 0 1 0 0
## 5 1 0 1 0 0
## 6 1 0 0 1 0
## housingno housingunknown housingyes loanno loanunknown loanyes y
## 1 1 0 0 1 0 0 no
## 2 1 0 0 1 0 0 no
## 3 0 0 1 1 0 0 no
## 4 1 0 0 1 0 0 no
## 5 1 0 0 0 0 1 no
## 6 1 0 0 1 0 0 no
Let’s do the same SVM data preparation for the regular dataset:
# Regular dataset
data_raw_svm <- data_raw_clean3
# Convert target to factor
data_raw_svm$y <- as.factor(data_raw_svm$y)
categorical_features <- c("job", "education", "contact", "month", "day_of_week", "poutcome", "default", "housing", "loan", "marital")
dummy_formula <- as.formula(paste("~ . -y")) # Adjust as needed
# categorical dataset
cat_data_raw_svm <- data_raw_svm[, categorical_features]
# use dummyVars to create dummy variables
dummies <- dummyVars(~ ., data = cat_data_raw_svm, fullRank = FALSE)
my_data_encoded <- data.frame(predict(dummies, newdata = cat_data_raw_svm))
# numeric columns
numeric_cols <- setdiff(names(data_raw_svm), c(categorical_features, "y"))
# combine back the columns
final_data_for_svm <- cbind(data_raw_svm[, numeric_cols], my_data_encoded, data_raw_svm$y)
# Rename target column
colnames(final_data_for_svm)[ncol(final_data_for_svm)] <- "y"
head(final_data_for_svm)
## age campaign previous cons_price_idx cons_conf_idx jobadmin. jobblue.collar
## 1 56 1 0 93.994 -36.4 0 0
## 2 57 1 0 93.994 -36.4 0 0
## 3 37 1 0 93.994 -36.4 0 0
## 4 40 1 0 93.994 -36.4 1 0
## 5 56 1 0 93.994 -36.4 0 0
## 6 45 1 0 93.994 -36.4 0 0
## jobentrepreneur jobhousemaid jobmanagement jobretired jobself.employed
## 1 0 1 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## jobservices jobstudent jobtechnician jobunemployed jobunknown
## 1 0 0 0 0 0
## 2 1 0 0 0 0
## 3 1 0 0 0 0
## 4 0 0 0 0 0
## 5 1 0 0 0 0
## 6 1 0 0 0 0
## educationbasic.4y educationbasic.6y educationbasic.9y educationhigh.school
## 1 1 0 0 0
## 2 0 0 0 1
## 3 0 0 0 1
## 4 0 1 0 0
## 5 0 0 0 1
## 6 0 0 1 0
## educationilliterate educationprofessional.course educationuniversity.degree
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## educationunknown contactcellular contacttelephone monthapr monthaug monthdec
## 1 0 0 1 0 0 0
## 2 0 0 1 0 0 0
## 3 0 0 1 0 0 0
## 4 0 0 1 0 0 0
## 5 0 0 1 0 0 0
## 6 0 0 1 0 0 0
## monthjul monthjun monthmar monthmay monthnov monthoct monthsep day_of_weekfri
## 1 0 0 0 1 0 0 0 0
## 2 0 0 0 1 0 0 0 0
## 3 0 0 0 1 0 0 0 0
## 4 0 0 0 1 0 0 0 0
## 5 0 0 0 1 0 0 0 0
## 6 0 0 0 1 0 0 0 0
## day_of_weekmon day_of_weekthu day_of_weektue day_of_weekwed poutcomefailure
## 1 1 0 0 0 0
## 2 1 0 0 0 0
## 3 1 0 0 0 0
## 4 1 0 0 0 0
## 5 1 0 0 0 0
## 6 1 0 0 0 0
## poutcomenonexistent poutcomesuccess defaultno defaultunknown defaultyes
## 1 1 0 1 0 0
## 2 1 0 0 1 0
## 3 1 0 1 0 0
## 4 1 0 1 0 0
## 5 1 0 1 0 0
## 6 1 0 0 1 0
## housingno housingunknown housingyes loanno loanunknown loanyes
## 1 1 0 0 1 0 0
## 2 1 0 0 1 0 0
## 3 0 0 1 1 0 0
## 4 1 0 0 1 0 0
## 5 1 0 0 0 0 1
## 6 1 0 0 1 0 0
## maritaldivorced maritalmarried maritalsingle maritalunknown y
## 1 0 1 0 0 no
## 2 0 1 0 0 no
## 3 0 1 0 0 no
## 4 0 1 0 0 no
## 5 0 1 0 0 no
## 6 0 1 0 0 no
To ensure objective model evaluation and prevent overfitting, let’s split each dataset into training (80%) and testing (20%) sets. After trying to run an SVM for the full dataset, I discovered using only a portion of the data is necessary for the model to finish running.
For the combined dataset, let’s sample 60% of the full dataset, and for the regular dataset, let’s take 40% of the full dataset.
# Split data into train and test
set.seed(123)
# Randomly sample 60% of the rows for the combined dataset
# Combined dataset
sampled_combined_data <- combined_final_data_for_svm %>%
slice_sample(prop = 0.60)
sampled_combined_split <- initial_split(sampled_combined_data, prop = 0.8, strata = y)
sampled_combined_train <- sampled_combined_split |>
training()
sampled_combined_test <- sampled_combined_split |>
testing()
# Randomly sample 40% of the rows for the regular dataset
# Regular dataset
sampled_regular_data <- final_data_for_svm %>%
slice_sample(prop = 0.40)
sampled_regular_split <- initial_split(sampled_regular_data, prop = 0.8, strata = y)
sampled_regular_train <- sampled_regular_split |>
training()
sampled_regular_test <- sampled_regular_split |>
testing()
As the EDA showed, this dataset is heavily imbalanced which can cause bias. To resolve this, let’s use SMOTE (Synthetic Minority Over-sampling Technique):
set.seed(123)
# Only apply SMOTE to training set
sampled_combined_smoted_train <- smotenc(sampled_combined_train, var = "y", over_ratio = 1)
sampled_regular_smoted_train <- smotenc(sampled_regular_train, var = "y", over_ratio = 1)
# Distribution of target variable - combined
sampled_combined_smoted_train |>
dplyr::select(y) |>
ggplot() +
aes(x = y) +
geom_histogram(bins= 40,fill = "blue", color = "black", stat="count") +
labs(title = "Distribution of y", y = "Count") +
theme_minimal()
## Warning in geom_histogram(bins = 40, fill = "blue", color = "black", stat =
## "count"): Ignoring unknown parameters: `binwidth` and `bins`
# Distribution of target variable - regular
sampled_regular_smoted_train |>
dplyr::select(y) |>
ggplot() +
aes(x = y) +
geom_histogram(bins= 40,fill = "blue", color = "black", stat="count") +
labs(title = "Distribution of y", y = "Count") +
theme_minimal()
## Warning in geom_histogram(bins = 40, fill = "blue", color = "black", stat =
## "count"): Ignoring unknown parameters: `binwidth` and `bins`
We can now see that the distribution is even for our target variable,
y.
For all the models, accuracy score will be the main metric of choice.
kvsm() parameters (official R documentation):
x: a symbolic description of the model to be fit. When
not using a formula x can be a matrix or vector containing the training
data or a kernel matrix of class kernelMatrix of the training data or a
list of character vectors (for use with the string kernel). Note, that
the intercept is always excluded, whether given in the formula or
not.
data: an optional data frame containing the training
data, when using a formula. By default the data is taken from the
environment which `ksvm’ is called from.
y: a response vector with one label for each
row/component of x. Can be either a factor (for classification tasks) or
a numeric vector (for regression).
scaled: A logical vector indicating the variables to be
scaled. If scaled is of length 1, the value is recycled as many times as
needed and all non-binary variables are scaled. Per default, data are
scaled internally (both x and y variables) to zero mean and unit
variance. The center and scale values are returned and used for later
predictions.
type: ksvm can be used for classification , for
regression, or for novelty detection. Depending on whether y is a factor
or not, the default setting for type is C-svc or eps-svr, respectively,
but can be overwritten by setting an explicit value. Valid options
are:
C-svc C classification nu-svc nu classification C-bsvc bound-constraint svm classification spoc-svc Crammer, Singer native multi-class kbb-svc Weston, Watkins native multi-class one-svc novelty detection eps-svr epsilon regression nu-svr nu regression eps-bsvr bound-constraint svm regression
kernel: the kernel function used in training and
predicting. This parameter can be set to any function, of class kernel,
which computes the inner product in feature space between two vector
arguments (see kernels). kernlab provides the most popular kernel
functions which can be used by setting the kernel parameter to the
following strings:
rbfdot Radial Basis kernel “Gaussian” polydot Polynomial kernel vanilladot Linear kernel tanhdot Hyperbolic tangent kernel laplacedot Laplacian kernel besseldot Bessel kernel anovadot ANOVA RBF kernel splinedot Spline kernel stringdot String kernel
Setting the kernel parameter to “matrix” treats x as a kernel matrix calling the kernelMatrix interface.
The kernel parameter can also be set to a user defined function of class kernel by passing the function name as an argument.
kpar: the list of hyper-parameters (kernel parameters).
This is a list which contains the parameters to be used with the kernel
function. For valid parameters for existing kernels are :
sigma inverse kernel width for the Radial Basis kernel function “rbfdot” and the Laplacian kernel “laplacedot”. degree, scale, offset for the Polynomial kernel “polydot” scale, offset for the Hyperbolic tangent kernel function “tanhdot” sigma, order, degree for the Bessel kernel “besseldot”. sigma, degree for the ANOVA kernel “anovadot”.
length, lambda, normalized for the “stringdot” kernel where length is the length of the strings considered, lambda the decay factor and normalized a logical parameter determining if the kernel evaluations should be normalized.
Hyper-parameters for user defined kernels can be passed through the kpar parameter as well. In the case of a Radial Basis kernel function (Gaussian) kpar can also be set to the string “automatic” which uses the heuristics in sigest to calculate a good sigma value for the Gaussian RBF or Laplace kernel, from the data. (default = “automatic”).
C: cost of constraints violation (default: 1) this is
the `C’-constant of the regularization term in the Lagrange
formulation.
nu: parameter needed for nu-svc, one-svc, and nu-svr.
The nu parameter sets the upper bound on the training error and the
lower bound on the fraction of data points to become Support Vectors
(default: 0.2).
epsilon: epsilon in the insensitive-loss function used
for eps-svr, nu-svr and eps-bsvm (default: 0.1)
prob.model: if set to TRUE builds a model for
calculating class probabilities or in case of regression, calculates the
scaling parameter of the Laplacian distribution fitted on the residuals.
Fitting is done on output data created by performing a 3-fold
cross-validation on the training data. For details see references.
(default: FALSE)
class.weights: a named vector of weights for the
different classes, used for asymmetric class sizes. Not all factor
levels have to be supplied (default weight: 1). All components have to
be named.
cache: cache memory in MB (default 40)
tol: tolerance of termination criterion (default:
0.001)
shrinking: option whether to use the
shrinking-heuristics (default: TRUE)
cross: if a integer value k>0 is specified, a k-fold
cross validation on the training data is performed to assess the quality
of the model: the accuracy rate for classification and the Mean Squared
Error for regression
fit: indicates whether the fitted values should be
computed and included in the model or not (default: TRUE)
Let’ start with a simple linear SVM classifier that uses the regular dataset. SVM should be able to find patterns within a dataset containing many features, but the drawback will be it will take long. This will serve as a simple base model.
# NOTE: do not need to standardize the dataset bc R will already do that for kvsm
# Parallel processing (to try to speed up runtime)
num_cores <- parallel::detectCores() - 1 # leave one core free
num_cores
## [1] 7
cl <- makePSOCKcluster(num_cores) # number of cores to use
registerDoParallel(cl)
# Build simple linear model
set.seed(125)
start <- Sys.time()
svm_model1 <- ksvm(y ~ ., data = sampled_regular_smoted_train,
kernel = "vanilladot",
prob.model=TRUE) # to use for AUC/ROC
## Setting default kernel parameters
print(svm_model1)
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 16499
##
## Objective Function Value : -14705.61
## Training error : 0.285696
## Probability model included.
end <- Sys.time()
# Stop cluster
stopCluster(cl)
registerDoSEQ()
closeAllConnections()
Model Evaluation:
# Predict for the linear SVM model
model1_predictions <- predict(svm_model1, sampled_regular_test)
# Confusion matrix
cf_matrix_model1 <- confusionMatrix(model1_predictions, sampled_regular_test$y)
# Evaluation metrics
model1_probs <- predict(svm_model1, newdata = sampled_regular_test, type = "probabilities")[,2]
model1_roc <- roc(sampled_regular_test$y, model1_probs, quiet = TRUE)
model1_auc <- auc(model1_roc)
model1_accuracy <- cf_matrix_model1$overall['Accuracy']
model1_kappa <- cf_matrix_model1$overall['Kappa']
model1_sensitivity <- cf_matrix_model1$byClass['Sensitivity']
model1_specificity <- cf_matrix_model1$byClass['Specificity']
model1_f1 <- cf_matrix_model1$byClass["F1"]
model1_precision <- cf_matrix_model1$byClass["Precision"]
model1_duration <- end - start
output1 <- paste("\n=== Model Selection and Evaluation ===\n\n",
"=== Model 1 Evaluation ===\n",
"Confusion Matrix:\n",
paste(capture.output(print(cf_matrix_model1)), collapse = "\n"), "\n",
"Accuracy:", round(model1_accuracy, 4), "| Precision:", round(model1_precision, 4),
"| Sensitivity:", round(model1_sensitivity, 4), "| Specificity:", round(model1_specificity, 4), "\n",
"F1 Score:", round(model1_f1, 4), "| Kappa:", round(model1_kappa, 4),
"| Model Duration:", round(model1_duration, 4), "\n\n", sep = " ")
cat(output1)
##
## === Model Selection and Evaluation ===
##
## === Model 1 Evaluation ===
## Confusion Matrix:
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 2546 163
## yes 377 209
##
## Accuracy : 0.8361
## 95% CI : (0.823, 0.8486)
## No Information Rate : 0.8871
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.346
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8710
## Specificity : 0.5618
## Pos Pred Value : 0.9398
## Neg Pred Value : 0.3567
## Prevalence : 0.8871
## Detection Rate : 0.7727
## Detection Prevalence : 0.8222
## Balanced Accuracy : 0.7164
##
## 'Positive' Class : no
##
## Accuracy: 0.8361 | Precision: 0.9398 | Sensitivity: 0.871 | Specificity: 0.5618
## F1 Score: 0.9041 | Kappa: 0.346 | Model Duration: 7.946
Results:
The linear kernel generated an accuracy of 83.61%, which is fairly good without any hyperparameter tuning. Besides specificity, it generated fairly good metrics across the board.
The purpose of this experiment is to try to use the combined dataset (use more observations, less predictors). The reason to try this is due to SVM taking so long to run, and the high dimensionality of the data.
# NOTE: do not need to standardize the dataset bc R will already do that for kvsm
# Parallel processing (to try to speed up runtime)
num_cores <- parallel::detectCores() - 1 # leave one core free
num_cores
## [1] 7
cl <- makePSOCKcluster(num_cores) # number of cores to use
registerDoParallel(cl)
# Build simple linear model w/ combined dataset
set.seed(125)
start <- Sys.time()
svm_model1b <- ksvm(y ~ ., data = sampled_combined_smoted_train,
kernel = "vanilladot",
prob.model=TRUE) # to use for AUC/ROC
## Setting default kernel parameters
print(svm_model1b)
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 23791
##
## Objective Function Value : -23774.34
## Training error : 0.333455
## Probability model included.
end <- Sys.time()
# Stop cluster
stopCluster(cl)
registerDoSEQ()
closeAllConnections()
Model Evaluation:
# Predict
model1b_predictions <- predict(svm_model1b, sampled_combined_test)
# Confusion matrix
cf_matrix_model1b <- confusionMatrix(model1b_predictions, sampled_combined_test$y)
# Evaluation metrics
# this causes $ operator is invalid for atomic vectors for some reason
# model1b_probs <- predict(svm_model1b, newdata = sampled_combined_test, type = "probabilities")[,2]
# model1b_roc <- roc(sampled_regular_test$y, model1b_probs, quiet = TRUE)
# model1b_auc <- auc(model1b_roc)
model1b_auc <- NA
model1b_accuracy <- cf_matrix_model1b$overall['Accuracy']
model1b_kappa <- cf_matrix_model1b$overall['Kappa']
model1b_sensitivity <- cf_matrix_model1b$byClass['Sensitivity']
model1b_specificity <- cf_matrix_model1b$byClass['Specificity']
model1b_f1 <- cf_matrix_model1b$byClass["F1"]
model1b_precision <- cf_matrix_model1b$byClass["Precision"]
model1b_duration <- end - start
output1b <- paste("\n=== Model Selection and Evaluation ===\n\n",
"=== Model 1B Evaluation ===\n",
"Confusion Matrix:\n",
paste(capture.output(print(cf_matrix_model1b)), collapse = "\n"), "\n",
"Accuracy:", round(model1b_accuracy, 4), "| Precision:", round(model1b_precision, 4),
"| Sensitivity:", round(model1b_sensitivity, 4), "| Specificity:", round(model1b_specificity, 4), "\n",
"F1 Score:", round(model1b_f1, 4), "| Kappa:", round(model1b_kappa, 4),
"| Model Duration:", round(model1b_duration, 4), "\n\n", sep = " ")
cat(output1b)
##
## === Model Selection and Evaluation ===
##
## === Model 1B Evaluation ===
## Confusion Matrix:
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 2298 147
## yes 1804 411
##
## Accuracy : 0.5813
## 95% CI : (0.567, 0.5955)
## No Information Rate : 0.8803
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.13
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5602
## Specificity : 0.7366
## Pos Pred Value : 0.9399
## Neg Pred Value : 0.1856
## Prevalence : 0.8803
## Detection Rate : 0.4931
## Detection Prevalence : 0.5247
## Balanced Accuracy : 0.6484
##
## 'Positive' Class : no
##
## Accuracy: 0.5813 | Precision: 0.9399 | Sensitivity: 0.5602 | Specificity: 0.7366
## F1 Score: 0.702 | Kappa: 0.13 | Model Duration: 4.8193
Results:
The combined dataset model was faster than model 1, but it generated a poor accuracy score with a score of 58.13%. Precision is still high though, meaning it can still predict positives accurately. This is good to know since “yes” is the minority class. We can see from the confusion matrix that it overpredicts “yes” values.
Note: the AUC code gave me an error for this dataset, but since accuracy was so poor, I didn’t bother taking the time to figure it out.
The reason for this model is that it has shown to perform well with all types of data, and can capture non-linear data. Let’s use the default parameters for now with the regular dataset.
# Build Gaussian RBF model
set.seed(125)
start <- Sys.time()
svm_model2 <- ksvm(y ~ ., data = sampled_regular_smoted_train,
kernel = "rbfdot",
prob.model=TRUE)
print(svm_model2)
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0109304259985459
##
## Number of Support Vectors : 11903
##
## Objective Function Value : -9214.919
## Training error : 0.113868
## Probability model included.
end <- Sys.time()
Model Evaluation:
# Predict - Gaussian RBF
model2_predictions <- predict(svm_model2, sampled_regular_test)
# Confusion matrix
cf_matrix_model2 <- confusionMatrix(model2_predictions, sampled_regular_test$y)
# Evaluation metrics
model2_probs <- predict(svm_model2, newdata = sampled_regular_test, type = "probabilities")[, 2]
model2_roc <- roc(sampled_regular_test$y, model2_probs, quiet = TRUE)
model2_auc <- auc(model2_roc)
model2_accuracy <- cf_matrix_model2$overall['Accuracy']
model2_kappa <- cf_matrix_model2$overall['Kappa']
model2_sensitivity <- cf_matrix_model2$byClass['Sensitivity']
model2_specificity <- cf_matrix_model2$byClass['Specificity']
model2_f1 <- cf_matrix_model2$byClass["F1"]
model2_precision <- cf_matrix_model2$byClass["Precision"]
model2_duration <- end - start
output2 <- paste("\n=== Model Selection and Evaluation ===\n\n",
"=== Model 2 Evaluation ===\n",
"Confusion Matrix:\n",
paste(capture.output(print(cf_matrix_model2)), collapse = "\n"), "\n",
"Accuracy:", round(model2_accuracy, 4), "| Precision:", round(model2_precision, 4),
"| Sensitivity:", round(model2_sensitivity, 4), "| Specificity:", round(model2_specificity, 4), "\n",
"F1 Score:", round(model2_f1, 4), "| Kappa:", round(model2_kappa, 4), "| AUC:", round(model2_auc, 4),
"| Model Duration:", round(model2_duration, 4), "\n\n", sep = " ")
cat(output2)
##
## === Model Selection and Evaluation ===
##
## === Model 2 Evaluation ===
## Confusion Matrix:
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 2697 203
## yes 226 169
##
## Accuracy : 0.8698
## 95% CI : (0.8578, 0.8811)
## No Information Rate : 0.8871
## P-Value [Acc > NIR] : 0.9990
##
## Kappa : 0.3671
##
## Mcnemar's Test P-Value : 0.2882
##
## Sensitivity : 0.9227
## Specificity : 0.4543
## Pos Pred Value : 0.9300
## Neg Pred Value : 0.4278
## Prevalence : 0.8871
## Detection Rate : 0.8185
## Detection Prevalence : 0.8801
## Balanced Accuracy : 0.6885
##
## 'Positive' Class : no
##
## Accuracy: 0.8698 | Precision: 0.93 | Sensitivity: 0.9227 | Specificity: 0.4543
## F1 Score: 0.9263 | Kappa: 0.3671 | AUC: 0.7533 | Model Duration: 5.9304
Results:
RBF Kernel increased the accuracy to 86.98% which makes sense since this kernel can handle complex patterns in the dataset. Excluding specificity, performance metrics look very good for this model.
This model will apply different values of C to find the optimal value to use with our data. A high cost could lead to overfitting, yet a low cost could lead to underfitting due to the model not detecting patterns correctly.
Note: I did try to do hyperparameter tuning with cross-validation, but the code would never complete. That is why I have broken out the tuning into separate models which isn’t ideal. I have left the code commented out for now.
# Parallel processing
num_cores <- parallel::detectCores() - 1 # leave one core free
num_cores
## [1] 7
cl <- makePSOCKcluster(num_cores) # number of cores to use
registerDoParallel(cl)
# Build RBF models with varying cost parameter
# This code is inspired from the textbook
# Takes a while to run
cost_values <- c(0.25, 1, 5, 10)
accuracy_values <- sapply(cost_values, function(x) {
set.seed(12345)
m <- ksvm(y ~ ., data = sampled_regular_smoted_train,
kernel = "rbfdot", C = x)
pred <- predict(m, sampled_regular_test)
agree <- ifelse(pred == sampled_regular_test$y, 1, 0)
accuracy <- sum(agree) / nrow(sampled_regular_test)
return (accuracy)
})
# Plot accuracy vs aost
plot(cost_values, accuracy_values, type = "b")
# Stop cluster
stopCluster(cl)
registerDoSEQ()
closeAllConnections()
print(accuracy_values)
## [1] 0.8543247 0.8698027 0.8755690 0.8719272
The best model turned out to be C=5 producing a pretty good accuracy score of 87.56%. Let’s now create this model using C=5:
# Parallel processing
num_cores <- parallel::detectCores() - 1 # leave one core free
num_cores
## [1] 7
cl <- makePSOCKcluster(num_cores) # number of cores to use
registerDoParallel(cl)
# Build best RBF model with C=5
set.seed(125)
start <- Sys.time()
svm_model3 <- ksvm(y ~ ., data = sampled_regular_smoted_train,
kernel = "rbfdot",
C = 5,
prob.model=TRUE)
print(svm_model3)
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 5
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0109304259985459
##
## Number of Support Vectors : 8896
##
## Objective Function Value : -29990
## Training error : 0.066729
## Probability model included.
end <- Sys.time()
# Stop cluster
stopCluster(cl)
registerDoSEQ()
closeAllConnections()
Model Evaluation:
# Predict RBF model
model3_predictions <- predict(svm_model3, sampled_regular_test)
# Confusion matrix
cf_matrix_model3 <- confusionMatrix(model3_predictions, sampled_regular_test$y)
# Evaluation metrics
model3_probs <- predict(svm_model3, newdata = sampled_regular_test, type = "probabilities")[, 2]
model3_roc <- roc(sampled_regular_test$y, model3_probs, quiet = TRUE)
model3_auc <- auc(model3_roc)
model3_accuracy <- cf_matrix_model3$overall['Accuracy']
model3_kappa <- cf_matrix_model3$overall['Kappa']
model3_sensitivity <- cf_matrix_model3$byClass['Sensitivity']
model3_specificity <- cf_matrix_model3$byClass['Specificity']
model3_f1 <- cf_matrix_model3$byClass["F1"]
model3_precision <- cf_matrix_model3$byClass["Precision"]
model3_duration <- end - start
output3 <- paste("\n=== Model Selection and Evaluation ===\n\n",
"=== Model 3 Evaluation ===\n",
"Confusion Matrix:\n",
paste(capture.output(print(cf_matrix_model3)), collapse = "\n"), "\n",
"Accuracy:", round(model3_accuracy, 4), "| Precision:", round(model3_precision, 4),
"| Sensitivity:", round(model3_sensitivity, 4), "| Specificity:", round(model3_specificity, 4), "\n",
"F1 Score:", round(model3_f1, 4), "| Kappa:", round(model3_kappa, 4), "| AUC:", round(model3_auc, 4),
"| Model Duration:", round(model3_duration, 4), "\n\n", sep = " ")
cat(output3)
##
## === Model Selection and Evaluation ===
##
## === Model 3 Evaluation ===
## Confusion Matrix:
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 2739 225
## yes 184 147
##
## Accuracy : 0.8759
## 95% CI : (0.8641, 0.8869)
## No Information Rate : 0.8871
## P-Value [Acc > NIR] : 0.97944
##
## Kappa : 0.349
##
## Mcnemar's Test P-Value : 0.04794
##
## Sensitivity : 0.9371
## Specificity : 0.3952
## Pos Pred Value : 0.9241
## Neg Pred Value : 0.4441
## Prevalence : 0.8871
## Detection Rate : 0.8313
## Detection Prevalence : 0.8995
## Balanced Accuracy : 0.6661
##
## 'Positive' Class : no
##
## Accuracy: 0.8759 | Precision: 0.9241 | Sensitivity: 0.9371 | Specificity: 0.3952
## F1 Score: 0.9305 | Kappa: 0.349 | AUC: 0.7344 | Model Duration: 1.3186
Results:
This expectingly produced an 87.59% accuracy score and fairly good metrics across the board (excluding specificty).
# Unfortunately this code was never able to complete (took too long)
# # Parallel processing
# num_cores <- parallel::detectCores() - 1 # leave one core free
# num_cores
# cl <- makePSOCKcluster(num_cores) # number of cores to use
# registerDoParallel(cl)
#
# set.seed(125)
#
# # Cross validation
# ctrl <- trainControl(method = "repeatedcv", repeats = 3)
#
# # Params to test
# rbf_grid <- expand.grid(
# # degree = c(2, 3),
# sigma = c(0.01, 0.1),
# C = c(1, 5)
# )
#
# # Build model
# svm_poly_cv <- train(y ~ ., data = sampled_regular_smoted_train, method = "svmRadial",
# tuneGrid = rbf_grid, trControl = ctrl, metric = "Accuracy")
#
# # View the best parameters
# print(svm_poly_cv)
#
# plot(svm_poly_cv)
#
# # Stop cluster
# stopCluster(cl)
# registerDoSEQ()
# closeAllConnections()
Since I wasn’t able to tune C and sigma together, this model tunes sigma. The sigma parameter is the distance of influence of a single training point. A smaller value means a smaller influence range whereas a larger value means a larger range so the boundary is less sensitive to the data.
# Parallel processing
num_cores <- parallel::detectCores() - 1 # leave one core free
num_cores
## [1] 7
cl <- makePSOCKcluster(num_cores) # number of cores to use
registerDoParallel(cl)
# Build RBF models to find the best sigma
# This code is inspired from the textbook
# NOTE: Takes a while to run
# Use best C value found from previous experiment
sigma_values <- c(0.001, 0.01, 0.1, 100)
accuracy4_values <- sapply(sigma_values, function(x) {
set.seed(12345)
m4 <- ksvm(y ~ ., data = sampled_regular_smoted_train,
kernel = "rbfdot", sigma = x, C=5)
pred4 <- predict(m4, sampled_regular_test)
agree4 <- ifelse(pred4 == sampled_regular_test$y, 1, 0)
accuracy4 <- sum(agree4) / nrow(sampled_regular_test)
return (accuracy4)
})
# Plot accuracy vs sigma
plot(sigma_values, accuracy4_values, type = "b")
# Stop cluster
stopCluster(cl)
registerDoSEQ()
closeAllConnections()
print(accuracy4_values)
## [1] 0.875569 0.875569 0.875569 0.875569
Results:
Sigma did not change the accuracy score at all. This most likely means the best sigma has already been reached. This makes model 4 the same as model 3, so I won’t run the code again.
This is to test out another kernel type - the polynomial kernel. This model will check to see if doing polynomial transformations will improve the model compared to linear. For this experiment, let’s implement hyperparameter tuning of the degree parameter.
# Parallel processing
num_cores <- parallel::detectCores() - 1 # leave one core free
num_cores
## [1] 7
cl <- makePSOCKcluster(num_cores) # number of cores to use
registerDoParallel(cl)
# Build polynomial models with varying degree
set.seed(125)
start <- Sys.time()
# svm_model4 <- ksvm(y ~ ., data = smoted_train,
# kernel = "polydot",
# prob.model=TRUE)
# Try 2nd and 3rd degree polynomial
degree_values = c(2, 3)
accuracy5_values <- sapply(degree_values, function(x) {
set.seed(12345)
m5 <- ksvm(y ~ ., data = sampled_regular_smoted_train,
kernel = "polydot", degree = x)
pred5 <- predict(m5, sampled_regular_test)
agree5 <- ifelse(pred5 == sampled_regular_test$y, 1, 0)
accuracy5 <- sum(agree5) / nrow(sampled_regular_test)
return (accuracy5)
})
## Setting default kernel parameters
## Setting default kernel parameters
# Accuracy vs Cost
plot(degree_values, accuracy5_values, type = "b")
# Stop cluster
stopCluster(cl)
registerDoSEQ()
closeAllConnections()
print(accuracy5_values)
## [1] 0.8361153 0.8361153
Results:
These accuracy values are the same as the linear model so it means the data is linearly separable. Let’s build the model anyway:
# Build model
set.seed(125)
start <- Sys.time()
svm_model5 <- ksvm(y ~ ., data = sampled_regular_smoted_train,
kernel = "polydot",
degree = 2,
prob.model=TRUE)
## Setting default kernel parameters
print(svm_model5)
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Polynomial kernel function.
## Hyperparameters : degree = 1 scale = 1 offset = 1
##
## Number of Support Vectors : 16523
##
## Objective Function Value : -14705.61
## Training error : 0.285696
## Probability model included.
end <- Sys.time()
Model Evaluation:
# Predict Polynomial SVM
model5_predictions <- predict(svm_model5, sampled_regular_test)
# Confusion Matrix
cf_matrix_model5 <- confusionMatrix(model5_predictions, sampled_regular_test$y)
# Evaluation metrics
model5_probs <- predict(svm_model5, newdata = sampled_regular_test, type = "probabilities")[, 2]
model5_roc <- roc(sampled_regular_test$y, model5_probs, quiet = TRUE)
model5_auc <- auc(model5_roc)
model5_accuracy <- cf_matrix_model5$overall['Accuracy']
model5_kappa <- cf_matrix_model5$overall['Kappa']
model5_sensitivity <- cf_matrix_model5$byClass['Sensitivity']
model5_specificity <- cf_matrix_model5$byClass['Specificity']
model5_f1 <- cf_matrix_model5$byClass["F1"]
model5_precision <- cf_matrix_model5$byClass["Precision"]
model5_duration <- end - start
output5 <- paste("\n=== Model Selection and Evaluation ===\n\n",
"=== Model 5 Evaluation ===\n",
"Confusion Matrix:\n",
paste(capture.output(print(cf_matrix_model5)), collapse = "\n"), "\n",
"Accuracy:", round(model5_accuracy, 4), "| Precision:", round(model5_precision, 4),
"| Sensitivity:", round(model5_sensitivity, 4), "| Specificity:", round(model5_specificity, 4), "\n",
"F1 Score:", round(model5_f1, 4), "| Kappa:", round(model5_kappa, 4), "| AUC:", round(model5_auc, 4),
"| Model Duration:", round(model5_duration, 4), "\n\n", sep = " ")
cat(output5)
##
## === Model Selection and Evaluation ===
##
## === Model 5 Evaluation ===
## Confusion Matrix:
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 2546 163
## yes 377 209
##
## Accuracy : 0.8361
## 95% CI : (0.823, 0.8486)
## No Information Rate : 0.8871
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.346
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8710
## Specificity : 0.5618
## Pos Pred Value : 0.9398
## Neg Pred Value : 0.3567
## Prevalence : 0.8871
## Detection Rate : 0.7727
## Detection Prevalence : 0.8222
## Balanced Accuracy : 0.7164
##
## 'Positive' Class : no
##
## Accuracy: 0.8361 | Precision: 0.9398 | Sensitivity: 0.871 | Specificity: 0.5618
## F1 Score: 0.9041 | Kappa: 0.346 | AUC: 0.7632 | Model Duration: 8.157
Results:
Regarding accuracy, this model’s results mirror the linear model’s.
# Unfortunately this took way too long to run
# # Parallel processing
# num_cores <- parallel::detectCores() - 1 # leave one core free
# num_cores
# cl <- makePSOCKcluster(num_cores) # number of cores to use
# registerDoParallel(cl)
#
# set.seed(125)
#
# # Cross validation
# ctrl <- trainControl(method = "repeatedcv", repeats = 3)
#
# # Params to test
# poly_grid <- expand.grid(
# degree = c(2, 3),
# scale = c(0.01, 0.1),
# C = c(1)
# )
#
# # Build model
# svm_poly_cv <- train(y ~ ., data = smoted_train, method = "svmPoly",
# tuneGrid = poly_grid, trControl = ctrl, metric = "Accuracy")
#
# # View the best parameters
# print(svm_poly_cv)
#
# plot(svm_poly_cv)
#
# # Stop cluster
# stopCluster(cl)
# registerDoSEQ()
# closeAllConnections()
Final results table:
# Create final SVM results table
# NOTE: model 3 is the same as model 4
all_experiments <- data.frame(
Model = c("Model 1", "Model 1B", "Model 2", "Model 3", "Model 4", "Model 5"),
Accuracy = c(model1_accuracy, model1b_accuracy, model2_accuracy, model3_accuracy, model3_accuracy, model5_accuracy),
Error_Rate = c(1-model1_accuracy, 1-model1b_accuracy, 1-model2_accuracy, 1-model3_accuracy, 1-model3_accuracy, 1-model5_accuracy),
Precision = c(model1_precision, model1b_precision, model2_precision, model3_precision, model3_precision, model5_precision),
Sensitivity = c(model1_sensitivity, model1b_sensitivity, model2_sensitivity, model3_sensitivity, model3_sensitivity, model5_sensitivity),
Specificity = c(model1_specificity, model1b_specificity, model2_specificity, model3_specificity, model3_specificity, model5_specificity),
F1_Score = c(model1_f1, model1b_f1, model2_f1, model3_f1, model3_f1, model5_f1),
Kappa = c(model1_kappa, model1b_kappa, model2_kappa, model3_kappa, model3_kappa, model5_kappa),
AUC = c(model1_auc, model1b_auc, model2_auc, model3_auc, model3_auc, model5_auc),
Duration = c(model1_duration, model1b_duration, model2_duration, model3_duration, model3_duration, model5_duration)
)
all_experiments %>%
kbl() %>%
kable_styling(full_width = TRUE)
| Model | Accuracy | Error_Rate | Precision | Sensitivity | Specificity | F1_Score | Kappa | AUC | Duration |
|---|---|---|---|---|---|---|---|---|---|
| Model 1 | 0.8361153 | 0.1638847 | 0.9398302 | 0.8710229 | 0.5618280 | 0.9041193 | 0.3459961 | 0.7530753 | 7.945955 mins |
| Model 1B | 0.5813305 | 0.4186695 | 0.9398773 | 0.5602145 | 0.7365591 | 0.7020009 | 0.1300047 | NA | 4.819284 mins |
| Model 2 | 0.8698027 | 0.1301973 | 0.9300000 | 0.9226822 | 0.4543011 | 0.9263266 | 0.3670796 | 0.7532906 | 5.930396 mins |
| Model 3 | 0.8758725 | 0.1241275 | 0.9240891 | 0.9370510 | 0.3951613 | 0.9305249 | 0.3489970 | 0.7343703 | 1.318632 mins |
| Model 4 | 0.8758725 | 0.1241275 | 0.9240891 | 0.9370510 | 0.3951613 | 0.9305249 | 0.3489970 | 0.7343703 | 1.318632 mins |
| Model 5 | 0.8361153 | 0.1638847 | 0.9398302 | 0.8710229 | 0.5618280 | 0.9041193 | 0.3459961 | 0.7631944 | 8.157024 mins |
# Plot ROC curves
# Note: 1B not included
plot(model1_roc, col = "blue", main = "ROC Curves Comparison")
plot(model2_roc, col = "red", add = TRUE)
plot(model3_roc, col = "green", add = TRUE)
plot(model3_roc, col = "orange", add = TRUE)
plot(model5_roc, col = "yellow", add = TRUE)
legend("bottomright",
legend = paste0(c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"), " (AUC=",
round(c(model1_auc, model2_auc, model3_auc, model3_auc, model5_auc), 3), ")"),
col = c("blue", "red", "green", "orange", "yellow"),
lwd = 2)
Overall, all of the models produced fairly good accuracy scores. The model with the highest accuracy is Model 3 (and 4) which uses the RBF kernel with C=5. Model 3 produced good performance metrics across the board, except for specificity. It had the lowest specificity score of the SVM models, which means this model can produce a lot of false positives. Additionally, the runtime is not too bad compared to the other SVM models. Using the combined dataset did help the duration for model 1, but it hurt accuracy. The combined features cover up informative patterns in the data. Although, specificity was the highest of the models meaning it did not have as many false positives as the other models. Model 1 was able to distinguish between the two classes the best with an AUC score of 0.7530. Given accuracy has been the metric of choice, the SVM model to pick would have to be Model 3.
# Bar plot of the target variable for predictions for selected model
model3_predictions_df <- as.data.frame(table(model3_predictions))
colnames(model3_predictions_df) <- c("Target", "Count")
ggplot(model3_predictions_df, aes(x = Target, y = Count)) +
geom_bar(stat = "identity", fill="blue") +
labs(
title = "Distribution of Target Variable",
x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
y = "Count"
) +
theme_minimal()
Decision trees overall performed better than the SVM models. This is not what I expected since SVMs can be very versatile and can handle complex patterns in the data. The adaboost model 2 had the highest AUC (0.7839) meaning it was able to distinguish between the two classes the best. The random forest model 2 also had a higher AUC and accuracy value (this was the model selected for the previous assignment). Another bonus of the decision trees is that they took a lot less time to train, which can be important in the real world.
Compared to the tree models, it seems like SVM models are overfitting too much to the data, and after SMOTE, learn the intricate patterns of the minority class, positive case, too well such that they overpredict it. There could be noisy data that is also affecting SVM’s performance. Random forests are known to perform better with larger datasets (500+ observations), so maybe this is also why the random forests performed slightly better here.
The downside of SVMs is that there is a lot of hyperparameter tuning to perform. Possibly using cross validation and trying different pairs of parameters would have produced a better performance for the SVM’s. Additionally, there’s more data preparation involved, and it requires a dataset with many features (if it contains categorical variables). This causes the runtime to be super long, which can be an issue.
The final model for selection would have to be the model from the previous assignment, random forest model 2.