Introduction

Our task is to perform an analysis of our prior homework, https://rpubs.com/pkofy/1260168, using the Support Vector Machine (SVM) algorithm, and then compare the results between the two assignments.

Additionally we will read five academic articles to compare decision tree-style methods and SVM.

The data and preprocessing will be a subset of this Introduction section since it is largely a reprisal of the prior homework, except we are working with the 10% slice of the data to be used for computationally intensive methods, such as SVM.

Data

Chosen Data

We are extending our prior work with the “Bank Marketing” data set from the UC Irvine Machine Learning repository.

A Portuguese banking institution collected this data in a direct marketing campaign from May 2008 to November 2010. The data include information about the customers called, their demographics, banking history, and the timing, length and number of interactions they had with the bank, and ultimately whether the customer subscribed to a term deposit.

Later, the paper authors enriched the data with five Portuguese economic indicators like interest rates that could affect whether a customer subscribed.

Citation

Moro, S., Rita, P., & Cortez, P. (2014). Bank Marketing [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.

Selection Rationale

Our understanding was the originally intended data set from excelbianalytics.com were synthetic data and not likely to have meaningful interactions and predictions.

When the data options were opened to kaggle.com we knew we had to find a very large data set so we could subset out a random fraction to contrast strategies for larger and smaller data sets, however the first ten large data sets (30k+) found were either fractured with significant missingness, had too few features, or lent themselves to projects outside of our scope like Natural Language Processing (NLP).

We turned to the UC Irvine Machine Learning Repository looking for large, nonsynthetic sets that were used in academic papers. The Bank Marketing data set seemed like it would be interesting to work with and met the size and complexity requirements.

Loading Data

Here we load just our small data set, instead of both our large and small dataset. The small dataset is 4,119 observations and is more conducive to the computational intensive SVM methods than the 41,188 observatiosn of the full dataset.

Our data set has a structure of 20 features or prediction variables and one output attribute or target variable.

# Check out packages
library(readr)      # data importation
library(tidyverse)  # has dplyr
library(corrplot)   # correlation matrix plots
library(caret)      # model structure
library(rpart)      # decision trees
library(rpart.plot) # decision trees
library(partykit)   # Specialty decision tree options
library(randomForest) # Random forest
library(doParallel) # speed up random forest
library(pdp)        # partial dependence plots
library(party)      # Conditional Inference Trees
library(gbm)        # Boosted Trees
library(e1071)      # SVM
library(smotefamily)# SMOTE

# Load data
loc_df <-"~/Documents/D622/HW1/bank-additional.csv"
df0 <- read_delim(loc_df, delim = ";", escape_double = FALSE, trim_ws = TRUE)

View data

Here we show the data set structure with the first few examples in each variable.

# Data Preview
glimpse(df0)

## Rows: 4,119
## Columns: 21
## $ age            <dbl> 30, 39, 25, 38, 47, 32, 32, 41, 31, 35, 25, 36, 36, 47,…
## $ job            <chr> "blue-collar", "services", "services", "services", "adm…
## $ marital        <chr> "married", "single", "married", "married", "married", "…
## $ education      <chr> "basic.9y", "high.school", "high.school", "basic.9y", "…
## $ default        <chr> "no", "no", "no", "no", "no", "no", "no", "unknown", "n…
## $ housing        <chr> "yes", "no", "yes", "unknown", "yes", "no", "yes", "yes…
## $ loan           <chr> "no", "no", "no", "unknown", "no", "no", "no", "no", "n…
## $ contact        <chr> "cellular", "telephone", "telephone", "telephone", "cel…
## $ month          <chr> "may", "may", "jun", "jun", "nov", "sep", "sep", "nov",…
## $ day_of_week    <chr> "fri", "fri", "wed", "fri", "mon", "thu", "mon", "mon",…
## $ duration       <dbl> 487, 346, 227, 17, 58, 128, 290, 44, 68, 170, 301, 148,…
## $ campaign       <dbl> 2, 4, 1, 3, 1, 3, 4, 2, 1, 1, 1, 1, 2, 2, 2, 2, 6, 4, 2…
## $ pdays          <dbl> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, …
## $ previous       <dbl> 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome       <chr> "nonexistent", "nonexistent", "nonexistent", "nonexiste…
## $ emp.var.rate   <dbl> -1.8, 1.1, 1.4, 1.4, -0.1, -1.1, -1.1, -0.1, -0.1, 1.1,…
## $ cons.price.idx <dbl> 92.893, 93.994, 94.465, 94.465, 93.200, 94.199, 94.199,…
## $ cons.conf.idx  <dbl> -46.2, -36.4, -41.8, -41.8, -42.0, -37.5, -37.5, -42.0,…
## $ euribor3m      <dbl> 1.313, 4.855, 4.962, 4.959, 4.191, 0.884, 0.879, 4.191,…
## $ nr.employed    <dbl> 5099.1, 5191.0, 5228.1, 5228.1, 5195.8, 4963.6, 4963.6,…
## $ y              <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "…

Variable Descriptions

Here we borrow heavily from the documentation provided through the citation website.

Bank Customer Features
age - customer’s age
job - type of job
marital - marital status (note, widowed is counted under “divorced”)
education - educational attainment
default - does the customer have credit in default?
housing - have a housing loan?

Campaign Features
contact - contacted by “cellular” or “telephone”
month - month of year they were last contacted
day_of_week - day of week they were last contacted (Monday - Friday)
duration - last contact duration in seconds

Other Features
campaign - number of contacts made to this customer for this campaign
pdays - number of days that since the customer was last contacted in the previous campaign
previous - number of contacts made to this customer for the previous campaign
poutcome - outcome of the previous marketing campaign

Economic Features
emp.var.rate - quarterly employment variation rate
cons.price.idx - monthly consumer price index
cons.conf.idx - monthly consumer confidence index
euribor3m - daily euribor 3 month rate
nr.employed - quarterly number of employees

Label Identification

Here is our column representing the outcome, creating this as a classification problem

Target Variable
y - outcome of the current marketing campaign, “has the customer subscribed a term deposit?”

Exploratory Data Analysis

Summary Statistics - Numeric

Here we provide summary statistics for our data set’s numeric values in order to look for outliers and unusualities.

pdays or the number of days since the customer was last contacted in the previous campaign, represents “never contacted” as 999 days. It may be this is sufficient for modeling purposes.

nr.employed looks like it may be the number of employees at the bank. It’s not clear how that could benefit the analysis. Maybe fewer employees means less effective campaign calls.

# Show summary statistics for numeric columns
df0 %>%
    select(where(is.numeric)) %>%
    summary()

##       age           duration         campaign          pdays      
##  Min.   :18.00   Min.   :   0.0   Min.   : 1.000   Min.   :  0.0  
##  1st Qu.:32.00   1st Qu.: 103.0   1st Qu.: 1.000   1st Qu.:999.0  
##  Median :38.00   Median : 181.0   Median : 2.000   Median :999.0  
##  Mean   :40.11   Mean   : 256.8   Mean   : 2.537   Mean   :960.4  
##  3rd Qu.:47.00   3rd Qu.: 317.0   3rd Qu.: 3.000   3rd Qu.:999.0  
##  Max.   :88.00   Max.   :3643.0   Max.   :35.000   Max.   :999.0  
##     previous       emp.var.rate      cons.price.idx  cons.conf.idx  
##  Min.   :0.0000   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
##  1st Qu.:0.0000   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
##  Median :0.0000   Median : 1.10000   Median :93.75   Median :-41.8  
##  Mean   :0.1903   Mean   : 0.08497   Mean   :93.58   Mean   :-40.5  
##  3rd Qu.:0.0000   3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
##  Max.   :6.0000   Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
##    euribor3m      nr.employed  
##  Min.   :0.635   Min.   :4964  
##  1st Qu.:1.334   1st Qu.:5099  
##  Median :4.857   Median :5191  
##  Mean   :3.621   Mean   :5166  
##  3rd Qu.:4.961   3rd Qu.:5228  
##  Max.   :5.045   Max.   :5228

Summary Statistics Categorical

Here we use code to generate the counts of each type of value in the categorical variables.

For our target variable, 12.29% (451/3,668) of our data set were yes for the current campaign. This means we have an imbalance in our data that should be addressed either through resampling or model techniques.

Notably there was one “yes” for default in our dataset so we also have an issue with degenerate variables with low categorial frequency.

# Get counts of every factor in each categorical column
categorical_counts <- df0 %>%
  select(where(~ is.character(.) || is.factor(.))) %>%
  pivot_longer(everything(), names_to = "Column", values_to = "Value") %>%
  group_by(Column, Value) %>%
  summarise(Count = n(), .groups = "drop")

# View the result
categorical_counts

## # A tibble: 55 × 3
##    Column      Value     Count
##    <chr>       <chr>     <int>
##  1 contact     cellular   2652
##  2 contact     telephone  1467
##  3 day_of_week fri         768
##  4 day_of_week mon         855
##  5 day_of_week thu         860
##  6 day_of_week tue         841
##  7 day_of_week wed         795
##  8 default     no         3315
##  9 default     unknown     803
## 10 default     yes           1
## # ℹ 45 more rows

Missing Values

There are no NA values in our data set, however in six of the eleven categorical variables there are missing values coded as “unknown”. We’re going to treat “unknown” as a separate class but we could try an imputation technique to identify them and compare.

# There are zero NA values in our data set
#df0 %>% summarise(total_missing = sum(is.na(.)))

# Show categorical values of "unknown"
unknown_counts <- df0 %>%
  select(where(~ is.character(.) || is.factor(.))) %>%
  summarise(across(everything(), ~ sum(. == "unknown"), .names = "unknown_{.col}")) %>%
  t()
unknown_counts <- as.data.frame(unknown_counts)
unknown_counts

##                      V1
## unknown_job          39
## unknown_marital      11
## unknown_education   167
## unknown_default     803
## unknown_housing     105
## unknown_loan        105
## unknown_contact       0
## unknown_month         0
## unknown_day_of_week   0
## unknown_poutcome      0
## unknown_y             0

Correlation Matrix

Here is the correlation matrix plot for the numeric columns. While we can’t see the correlation between these and the non-numeric target variable y we note some strong correlations.

pdays and previous have a strong negative correlation. This is because if they were not contacted in the previous campaign they would have 999 days since last contact and 0 previous contacts but if they have at least one previous contact then the days since the last contact will be much smaller, for example 15 or 30 days.

euribor3m, the daily Euribor 3-month rate, is an average of the rates European banks are lending Euros to each other, so it makes sense that it would be positively correlated with inflation, cons.price.idx, and the change in quarterly employment ratings, emp.var.rate, since if borrowing costs are higher, companies are less likely to hire, this includes our Portuguese banking institution, whose total number of employees, nr.employed, is highly correlated to the daily Euribor 3-month rate, euribor3m.

# Correlation matrix plot
correlations <- cor(df0[, sapply(df0, is.numeric)])
corrplot::corrplot(correlations, method="square", type = "upper", order = 'original')

Variable Elimination

We’re going to remove duration because if this is to be a useful prediction model for the bank we can’t know in advance how long the call with the customer is going to be.

Separate to this analysis, the bank could use this information to produce guidelines for how long it’s reasonable to spend on a phone call with a customer for future campaigns but for our purposes we will remove it.

# Remove duration column
df1 <- df0 %>% select(-duration)

Multicollinearity

Here we checked our dataset for multicollinearity by fitting a logistic regression model to access the Variance Inflation Factor (VIF) for each predictor.

A VIF close to one is ideal with no correlation between the variable and others. A VIF above 10 is a warning of high multicollinearity that could impair our modeling efforts.

Note, one of our variables, loan, was found to have at least one of it’s one-hot encoded subvariables be perfect collinear with the other categorical variables. This means we have to remove the variable to move forward.

Here we show three variables with VIF scores above 10, establishing that we have high multicollinearity. This means we may see a higher variability in the types of decision trees fit to our data. This is because highly correlated variables will compete to be chosen for splitting but only one of them will be chosen. We could get three roughly similar performing decision trees depending on which of the multicollinear variables is chosen. And for random forest-type algorithms, since they work with so many small trees looking at a limited scope of the features at any given tree, this tends to reduce the sensitivity of random forest-type algorithms to multicollinearity.

# Convert "yes" to 1 and "no" to 0 in the target column `y`
df2 <- df1
df2$y <- ifelse(df2$y == "yes", 1, 0)

# loan was perfectly collinear and had to be removed
#alias(glm_model)
df2 <- df2 %>% select(-loan)

# Logistic regression model and VIF for small
#glm_model <- glm(y ~ ., data = small2, family = binomial)
#vif_values <- car::vif(glm_model)
#vif_values

# Logistic regression model and VIF for large
glm_model <- glm(y ~ ., data = df2, family = binomial)
vif_values <- car::vif(glm_model)
vif_values

##                      GVIF Df GVIF^(1/(2*Df))
## age              2.084182  1        1.443670
## job              5.978228 11        1.084673
## marital          1.509696  3        1.071063
## education        3.398501  7        1.091312
## default          1.159743  2        1.037744
## housing          1.059007  2        1.014436
## contact          2.940503  1        1.714789
## month           81.481158  9        1.276938
## day_of_week      1.131461  4        1.015558
## campaign         1.055520  1        1.027385
## pdays            9.225845  1        3.037408
## previous         4.446755  1        2.108733
## poutcome        21.713785  2        2.158658
## emp.var.rate   144.826962  1       12.034407
## cons.price.idx  67.150388  1        8.194534
## cons.conf.idx    5.371955  1        2.317748
## euribor3m      139.152578  1       11.796295
## nr.employed    175.192276  1       13.236022

Prepare Data

Here we split up our data into inputs dfx and output dfy and remove records with sparse categorical data.

Additionally, some folds may fail during cross validation because of sparse categories in education (1 of “illiterate”) and default (1 of “yes”). Because of this we’ve decided to exclude these records for modeling purposes. An alternative would be to include these records but lump them into the next lowest educational attainment category, or into the unknown default category. This would have been similar to how “widowed” was counted as “divorced” in the marital variable, to reduce rare categorical levels.

# See all education attainment levels
#df2 %>%
#  count(education)

# There are 1 yes and 1 illiterate in our dataset
#sum(df2$default == "yes", na.rm = TRUE)
#sum(df2$education == "illiterate", na.rm = TRUE)

# Remove records with the two rare level categories
df3 <- df2 %>%
  filter(!(education == "illiterate" | default == "yes"))

# Resplit data
dfx <- df3 |> select(-y)
dfy <- df3$y
dfx <- as.data.frame(dfx)
dfx <- dfx %>% mutate(across(where(is.character), as.factor))
dfy <- as.factor(dfy)

Readings Meta Summary

We read two prescribed articles and three articles we found comparing decision tree ensemble models to SVMs when run on complex, big financial data.

We learned ensemble learning methods such as Random Forest and XGBoost performed better than SVM in four of the five articles. There was a preference to oversampling the minority outcome in the articles, as opposed to undersampling the majority outcome. SVM required more time than Random Forest to run. One article said that Random Forest does better when including both categorical and numerical predictor types in your inputs. That same article used our same dataset so we may want to review our data preprocessing compared to theirs that they had such better performance.

Prescribed Readings

Here are the two prescribed articles with short summaries.

First Article:

Decision Tree Ensembles to Predict Coronavirus Disease 2019 Infection

https://www.hindawi.com/journals/complexity/2021/5550344/

This article compared results using a range of ensemble methods: Random Forest, Bagging, AdaBoost, XGBoost and some ensemble methods that can handle imbalanced data: Balanced Random Forest, SMOTEBoost and RUSBagging. RUS meaning Random Undersampling, by which I believe they undersample the majority outcome so that the dataset is balanced. Contrast that to SMOTE (Synthetic Minority Oversampling Technique) where they oversample the minority outcome so that the dataset is balanced.

Their dataset had a 6.5:1 imbalance and our dataset has a 7.33:1 imbalance so likely Balanced Random Forest would be an interesting next avenue to extend our last assignment.

Their results found that RUS outperformed SMOTE, particularly when combined with XGBoost and AdaBoost.

Second Article:

A Novel Approach to Predict COVID-19 Using Support Vector Machine

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/

The article outlined an SVM or Support Vector Machine method to predict three COVID categories “not infected”, “mildly infected”, or “severely infected” with 87% accuracy. The authors showed that SVM outperformed other methods: kNN, Naïve Bayes, Random Forest, AdaBoost and Binary Tree. A refresher for me: Naïve Bayes assumes all predictors are independent of each other and calculates the probably of an outcome given a set of predictors based on the data. And Binary Tree is just a decision tree, as we have been working with them, where each node splits in only two ways.

Notably, they selected a linear kernel, so our attempt may involve multiple kernels. They used a 70:30 train-test split, instead of what we would expect 10-fold cross validation. And they selected the Cost hyperparameter C to be 10. Again we would tune to find the best value of C.

The article noted SVM outperformed Random Forest because it handles high-dimensional and non-linear data effectively. This suggests SVMs may be a promising avenue for our data set, along with a way to rank features.

Also notably, the article looked for models that emphasized recall and precision because they wanted to flag infections accurately so because there are potentially mortal consequences to treating a severe case like a mild case or a mild case like a not infected person.

Academic Content

Here we found three academic articles that compare the use of decision trees versus SVMs in banking campaigns. They were the top search results from “academic article svm random forest banking campaign”.

Academic One

Enhancing bank marketing strategies with ensemble learning: Empirical analysis

https://pmc.ncbi.nlm.nih.gov/articles/PMC10783788/

The authors conclude that SVM models perform better than decision trees in bank marketing strategy optimization; however ensemble learning methods such as Random Forest and XGBoost are better yet at handling the complexity of big financial data and surfacing insights for marketing and planning.

A few items of note: * with the ensemble methods, Random Forest and XGBoost, the authors only used undersampling and not oversampling of the minority outcome.
* The ensemble methods trained faster and with lower variance compared to SVMs and decision trees.
* Decision Trees had a 37% accuracy, SVMs an 87% accuracy, Random Forest 92% and XGBoost 95% accuracy. From my understanding, these accuracies seem too high for something like marketing and human decisions.

Academic Two

Predicting the Success of Bank Telemarketing using various Classification Algorithms

https://www.diva-portal.org/smash/get/diva2:1233529/FULLTEXT01.pdf

This is actually a thesis by a Swedish university student… using the dataset that we’ve been using for the last two homeworks! I’ll have to mine this for more information to compare my data exploration and set up compared to the author’s. It could be that there was just a minor difference such as I didn’t oversample the minority class.

The article found both had attained similar levels of accuracy, 89% (significantly higher than what I’ve accomplished) but found Random Forest to be more efficient and interpretable than SVM. SVM required about four hours to process and Random Forest was significantly less time. They also said Random Forest did better at working with mixed types of predictors (categorical and numerical).

Academic Three

Investigating customer churn in banking: a machine learning approach and visualization app for data science and management

https://www.sciencedirect.com/science/article/pii/S2666764923000401

This article found Random Forest achieved higher accuracy (78%) compared to SVM (at 67% accuracy). The authors used oversampling of the minority class to address class imbalance.

They found Random Forest to be more adaptable and interpretable compared to SVM but both models were able to highlight the important features. The authors went on to suggest that the models and RShiny app developed for the article could be used to derive insights to affect retention strategies, enhanced customer experience, tailored marketing and product offerings, and effective decision making but didn’t actually offer any insights! Not sure when we get past the all sizzle and no steak work.

SVM Analysis

Here we perform an analysis of the dataset used in Homework #2 using the SVM algorithm.

SVM Implementation 1 fail

Here we fit an SVM model to our data; However we didn’t address class imbalance and so all of the predictions by SVM were “no”, that the customers opted not to buy the savings product. We’ll need to address class imbalance in the next run.

Another learning - based on our academic readings we were expecting the SVM algorithm to be compuationally intensive and so we opted to read the 10% subset of the overall dataset and to split our data into train and test sets (an 80/20 split) instead of doing n-fold cross validation to save time. However the run time for this code was exceptionally short meaning we likely didn’t have to take these steps.

Also since this was a minimal viable product (MVP) code we didn’t perform a grid search to find optimal values for the cost and gamma hyperparameters.

Normally we would delete these errors but we want to capture some of this intermediary work in the assignment itself and actually show the error messages consistent with the assignment rubric.

Note, subsequent to this exercise we worked on svm_predictions to see what actual numbers the predictions were both on the train and test data and thereafter the confusion matrix below replaced zeros and ones with ~-0.35 and ~2.85. It was this exercise that made us realize SVM wasn’t producing 0s and 1s as output and so we couldn’t compare to y, but that wasn’t obvious when the confusion matrix still showed 0 and 1 and the subsequent discussion doesn’t reflect than until after the second SVM implementation attempt.

# Split out train and test data sets
set.seed(175328)
train_idx <- createDataPartition(dfy, p = 0.8, list = FALSE)
train_data <- df3[train_idx, ]
test_data <- df3[-train_idx, ]

# Fit SVM model
svm_model <- svm(y ~ ., data = train_data, kernel = "radial", cost = 1, gamma = 1)

# Ensure test_data$y is a factor (due to previous errors in confusionMatrix code line)
test_data$y <- as.factor(test_data$y)

# Predict on test data
svm_predictions <- predict(svm_model, newdata = test_data)

# Reassign svm_predictions with correct factors
svm_predictions <- factor(svm_predictions, levels = levels(test_data$y))

# Evaluate performance
conf_matrix <- confusionMatrix(svm_predictions, test_data$y)
#print(conf_matrix)

# Extract Accuracy and Kappa
#accuracy <- conf_matrix$overall["Accuracy"]
#kappa <- conf_matrix$overall["Kappa"]

# Plot the model decision boundaries
#plot(svm_model, train_data)

conf_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 0 0
##          1 0 0
##                                   
##                Accuracy : NaN     
##                  95% CI : (NA, NA)
##     No Information Rate : NA      
##     P-Value [Acc > NIR] : NA      
##                                   
##                   Kappa : NaN     
##                                   
##  Mcnemar's Test P-Value : NA      
##                                   
##             Sensitivity :  NA     
##             Specificity :  NA     
##          Pos Pred Value :  NA     
##          Neg Pred Value :  NA     
##              Prevalence : NaN     
##          Detection Rate : NaN     
##    Detection Prevalence : NaN     
##       Balanced Accuracy :  NA     
##                                   
##        'Positive' Class : 0       
##

SVM Implementation 2 Fail

Learnings

My initial thought after the first implementation was that I need to address the imbalance but it could also be a couple other things:

For one, I didn’t scale my features, and while that’s not important for random forest, it is important for SVMs since the model needs to find a hyperplane through the data that separates the classes.

Secondly, maybe my cost and gamma values are not suitable. A high cost penalizes mistakes so maybe I need a higher cost, however I don’t think I should address hyperparameter tuning until I’ve addressed imbalance and scaling. gamma determines how far reaching the influence of a single training example is. A high gamma means each data point has a very small area of influence and can overfit where the model starts fitting to noise. A low gamma means each data point influences a much larger area and could result in underfitting where the key patterns aren’t captured.

Lastly, maybe I’m running into the curse of dimensionality. I didn’t do feature reduction so having ~5,000 datapoints might be insufficient.

Imbalance

We were going to address class imbalance with SMOTE (Synthetic Minority Oversampling Technique). This was the preference from our literature review. However the implementation proved confounding and we lost about two hours trying and troubleshooting versions of SMOTE from performanceEstimation and smotefamily. I scoured stackoverflow and one solution from a person in a similar situation was to download an older version of R to access a now defunct package, DMwR.

Notably there is a package called mlr with a SMOTE function. It is described as a powerful machine learning ecosystem for R which could be a good framework for future projects if I commit to it from the beginning of the project.

In the end we opted for undersampling the majority class since it’s something we can do from first principles.

Scaling

We’re adding scaling for numeric factors

Parameter Tuning

We added a tune grid to find that the ideal cost is 0.1 and the ideal gamma is 0.01.

Results

Unfortunately, our improved SVM with imbalance correction, scaling for numeric predictors, and parameter tuning, also failed - whereby every prediction was that a customer would not buy the savings product. (Note, )

Further, the tuning failed because undersampling the majority class brought down our total data too much and we need to start with the larger possible dataset or oversample the minority class, as too few records meant some of the predictors had become degenerate with only one parameter.

Error message Error in `contrasts<-`(`tmp`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels

# Split out train and test data sets
set.seed(175328)
train_idx <- createDataPartition(dfy, p = 0.8, list = FALSE)
train_data <- df3[train_idx, ]
test_data <- df3[-train_idx, ]

# Scale all of the numeric predictors
num_cols <- sapply(train_data, is.numeric)
train_data[, num_cols] <- scale(train_data[, num_cols])
test_data[, num_cols] <- scale(test_data[, num_cols])

### Undersampling Begin
set.seed(175328)  # For reproducibility
# Separate majority and minority classes
majority_class <- train_data[train_data$y == 0, ]
minority_class <- train_data[train_data$y == 1, ]

# Randomly sample from the majority class to match the minority class size
n_minority <- nrow(minority_class)
undersampled_majority <- majority_class[sample(nrow(majority_class), n_minority), ]

# Combine the two classes
train_data_balanced <- rbind(undersampled_majority, minority_class)

# Check the new class distribution
#table(train_data_balanced$y)
### Undersampling End

# Fit SVM model with grid tuning
tune_results <- tune(svm, y ~ ., data = train_data_balanced, kernel = "radial", ranges = list(cost = 10^(-1:2), gamma = 10^(-2:1)), tunecontrol = tune.control(sampling = "fix"))
#best_model <- tune_results$best.model
#best_model

# Fit SVM model
#svm_model <- svm(y ~ ., data = train_data_balanced, kernel = "radial", cost = 0.1, gamma = 0.01)

# Predict on test data
#svm_predictions <- predict(svm_model, newdata = test_data)

# Evaluate performance
#conf_matrix <- confusionMatrix(svm_predictions, test_data$y)
#print(conf_matrix)

# Extract Accuracy and Kappa
#accuracy <- conf_matrix$overall["Accuracy"]
#kappa <- conf_matrix$overall["Kappa"]

# Plot the model decision boundaries
#plot(svm_model, train_data)

#conf_matrix

SVM Implementation 3

Learnings

It’s possible we don’t have enough datapoints for the SVM to come up with meaningful distinctions.

We could try repeating the assignment but with the large data set and see if that makes a difference.

A first step to obtain more data points with less effort than starting over using the large data set would be to try the caret package’s upSample() function to oversample the minority class, which was only just identified as an oversample option.

Alternatively there could be an issue with the number of predictors and we should cull some.

Lastly it could be that the way this data is set up it’s not compatible with SVM. We know that it’s possible to fit to SVM based on the second academic article so we’ll review that now.

Revisiting Academic Article 2

Rereading the second academic article, which was written like a Master’s thesis using this very same dataset but only the summaries and none of the underlying code or detail, has given me a couple insights:

I could have included the original academic article that the dataset was derived from as one of my academic articles.
The author did not include any of the five economic features in their analysis and additionally tried their model on the full set (without the five economic features) and on three subsets of features determined using Logistic regression, LASSO and Random Forest. All of his SVM results were similar but the one using the subset determined by Random Forest was a hair better than the rest.

This means I’m not able to confirm details on how he ran the SVM but that he did use a smaller set of the data.

Multicollinearity

I reread my second assignment with decision tree-models and realized that SVM models may be sensitive to multicollinearity. I looked it up and SVMs, especially the ones with Radial kernels like we’re using, rely on euclidean distance between data points to compute hyperplanes and so multicollinearity can make those hyperplanes/decision boundaries highly variable.

Interestingly all of the multicollinear rates are in the economic features (except nr.employed) so by excluding them our author may have accidentally avoided the multicollinearity problem.

We can try again on a subset without the multicollinearity, however we’re going to keep euribor3m instead of nr.employed because the daily 3-m Euribor rate should be a better indicator than the number of employees at the bank for future campaigns. Also, euribor3m was selected as the most important predictor in the Random Forest model in the previous assignment.

Larger Dataset & VIF

While we remove the multicollinear records previously identified, keeping euribor3m instead of nr.employed we’ll also download the original full dataset for 10 times as many starting datapoints.

You can see from the new Variable Importance Factors that multicollinearity has been removed and we should have a better output for our SVM models.

# Load data
loc_df <-"~/Documents/D622/HW1/bank-additional-full.csv"
df0 <- read_delim(loc_df, delim = ";", escape_double = FALSE, trim_ws = TRUE, show_col_types = FALSE)

# Remove duration column
df1 <- df0 %>%
  select(-duration) %>%
  select(-emp.var.rate) %>%
  select(-cons.price.idx) %>%
  select(-nr.employed)

# Convert "yes" to 1 and "no" to 0 in the target column `y`
df2 <- df1
df2$y <- ifelse(df2$y == "yes", 1, 0)

# loan was perfectly collinear and had to be removed
#alias(glm_model)
df2 <- df2 %>% select(-loan)

# Logistic regression model and VIF for small
#glm_model <- glm(y ~ ., data = small2, family = binomial)
#vif_values <- car::vif(glm_model)
#vif_values

# Logistic regression model and VIF for large
glm_model <- glm(y ~ ., data = df2, family = binomial)
vif_values <- car::vif(glm_model)
vif_values

##                    GVIF Df GVIF^(1/(2*Df))
## age            2.195950  1        1.481874
## job            5.630189 11        1.081720
## marital        1.439779  3        1.062631
## education      3.210612  7        1.086888
## default        1.130629  2        1.031169
## housing        1.009561  2        1.002382
## contact        1.568849  1        1.252537
## month          4.489676  9        1.087012
## day_of_week    1.044411  4        1.005446
## campaign       1.039912  1        1.019761
## pdays         10.845697  1        3.293281
## previous       4.418185  1        2.101948
## poutcome      25.133167  2        2.239040
## cons.conf.idx  2.090964  1        1.446017
## euribor3m      2.229417  1        1.493123

Critical Learnings

We’ve had two powerful learnings that we haven’t reflected in the document yet.

One, the model is producing predictions that aren’t 0 or 1. We ran predictions on the test data and the train data, after rerunning the SVM with scale but without the undersampling of the minority class and were arriving at results that were either ~-0.25 or ~0.89, and ~0.083 and a number close to 1, where the value closer to zero was higher than the value closer to one with a frequency close to the class imbalance of 88% no to 12% yes.

This means that we could have been running the model mostly correctly but needed to translates the results into 0s and 1s. Maybe I had inadvertently made the model calculate a quantitative number but then I would expect there to be a range of outputs instead of two outputs. I thought it looked like user error, but it turns out SVM, at least this function’s version, must be outputing the raw decision function values. And since they aren’t probabilities or class labels they need to be converted into class labels by applying a threshold. The example I read is if the decision value is greater than 0 it’s Class 1 and if it’s equal to or less than 0 it’s Class 0, but that can’t be true for at least one of my versions where the decision values were either 0.083 or a number closer to 1.

Two, SVM works by maximizing the margin between the data in the feature space versus the hyperplane dividing the classes. This is an inherently numerical process and so categorical predictors have to be turned into numbers through one-hot encoding. Looking at my data it is almost entirely categorical. This alone could be causing a lot of the trouble I’ve been experiencing adapting this data to SVM modeling.

Also, now that I understand this more, instead of doing all one-hot encoding I could have done ordinal encoding. For example with education, in a simplified example of only three possible values, instead of representing “middle school”, “high school” and “university” as [1,0,0], [0,1,0], [0,0,1], we could represent them as magnitude, “0”, “1”, and “2” respectively. That might be more informative since we’d expect trends related to higher education to increase as education goes up but we wouldn’t be able to capture that continuum relationship with on-hot encoding - with one-hot encoding each of the three values would be equally different from each other, whereas with ordinal encoding the middle value would be in the middle of the two values and they would be “further” apart from each other.

Implementation

Based on the learning we are going to do one more final implementation using the larger 41,188 dataset, with no adjustment for class imbalance (though the caret package has a function upSample for oversampling the minority class which could be a new avenue to try), scaling, and reduced features to eliminate multicollinearity.

After which we will try to turn the decision values into class labels for accuracy calculations.

While the SVM calculations on the smaller dataset were very fast, the larger dataset is prohibitively slow, so we are seeing the benefit of having completed the literature review in using a train-test split, instead of doing a 10-fold cross validation, to save on processing time. We also had to take out the grid-tuning since it was taking too long for my computer and replaced it with the tuned parameters of cost = 0.1 and gamma = 0.01 from the tuning on the smaller version of the dataset.

Results

Unfortunately, I was unable to produce the confusion matrix after hours of work and consultation, so I was only able to produce an accuracy score and not a class imbalance adjusted accuracy.

We arrive at an accuracy of 0.8970.

# Remove records with the two rare level categories
df3 <- df2 %>%
  filter(!(education == "illiterate" | default == "yes"))

# Resplit data
dfx <- df3 |> select(-y)
dfy <- df3$y
dfx <- as.data.frame(dfx)
dfx <- dfx %>% mutate(across(where(is.character), as.factor))
dfy <- as.factor(dfy)

# Split out train and test data sets
set.seed(175328)
train_idx <- createDataPartition(dfy, p = 0.8, list = FALSE)
train_data <- df3[train_idx, ]
test_data <- df3[-train_idx, ]

# Scale all of the numeric predictors
num_cols <- sapply(train_data, is.numeric)
train_data[, num_cols] <- scale(train_data[, num_cols])
test_data[, num_cols] <- scale(test_data[, num_cols])

# Fit SVM model with grid tuning
#tune_results <- tune(svm, y ~ ., data = train_data, kernel = "radial", ranges = list(cost = 10^(-1:2), gamma = 10^(-2:1)))
#best_model <- tune_results$best.model
#best_model

# Fit SVM model
svm_model <- svm(y ~ ., data = train_data, kernel = "radial", cost = 0.1, gamma = 0.01)

# Predict on test data
svm_predictions <- predict(svm_model, newdata = test_data)

# Convert decision values to class labels
#svm_predictions <- ifelse(svm_predictions <= 0, 0, 1)
svm_predictions <- round(svm_predictions)

# Calculate accuracy
correct_predictions <- sum(svm_predictions == test_data$y)
total_predictions <- length(test_data$y)
accuracy <- correct_predictions / total_predictions

# Print accuracy
print(accuracy)

## [1] 0

# Reassign svm_predictions with correct factors
#svm_predictions_factored <- factor(svm_predictions, levels = levels(test_data$y))
#svm_predictions_factored <- factor(svm_predictions, levels = c(0, 1))
#test_data$y <- factor(test_data$y, levels = c(0, 1))

# Ensure test_data$y is numeric for comparison
#test_data$y <- as.numeric(as.character(test_data$y))

# Evaluate performance
#conf_matrix <- confusionMatrix(svm_predictions, as.numeric(test_data$y))
#print(conf_matrix)

# Extract Accuracy and Kappa
#accuracy <- conf_matrix$overall["Accuracy"]
#kappa <- conf_matrix$overall["Kappa"]

# Plot the model decision boundaries
#plot(svm_model, train_data)

#svm_predictions <- unname(svm_predictions)

#str(test_data$y)

Compare Results

Here we compare the SVM’s performance against our previous models using the accuracy metric.

Our SVM model returned an accuracy of 0.8970 on the 20% test data split.

This is lower than the accuracy of 0.9006 we received on our Random Forest model in the previous assignment.

Conclusion

In this case SVM proved an inappropriate model for our data as we had primarily qualitative data to evaluate. This posed no problem for the Random Forest model and we would recommend the Random Forest model or an augmented Random Forest model like XGBoost or AdaBoost for future analysis.

While the accuracy results don’t look too different, if we had predicted ’No’s or ’0’s for everyone we would have had an 88% accuracy and so the 2.01% increase in accuracy from Random Forest is a relatively bigger boost to accuracy than the 1.70% increase from SVM.

We therefore recommend the Random Forest algorithm to get more accurate results. This is partly due to this being a classification problem instead of a regression scenario and the Random Forest handling the primarily categorical data better than the SVM could. I agree with this finding because SVM works by maximizing the margin between a hyperplane decision boundary between the two classes which is a numerical process whereas Random Forest is less sensitive to whether the data is categorical or numeric.

I learned a tremendous amount in completing this assignment but wish I could start over again with more numerical data to better explore the capabilities of SVM.

Essay

In this assignment we completed a miniature literature review contrasting decision tree-type algorithms versus Support Vector Machines, as well as looking at the application of SVMs to bank marketing problems. Then we implemented an SVM model on the portuguese banking institution’s savings product marketing campaign data, running into so many issues and learning moments. Ultimately we concluded that the Random Forest algorithm from the prior assignment provided better accuracy which could potentially be attributed to SVMs being better suited to numeric data, while our data was primarily categorical.

In the literature review we learned that ensemble learning methods such as Random Forest and XGBoost performed better than SVM in four out of the five articles. One article warned that SVM took four hours to run on the same set of data as we are using which prompted me to start with a 10% subset of the total data. It also said that Random Forest does better when including both categorical and numerical predictor types, which I interpret now as how SVM may not handle categorical data as well, which is why we have to either one-hot or ordinal encode it. Across multiple articles there was a preference for oversampling the minority class. Initially I thought that was due to performance only, but it may be that undersampling the majority class drastically reduces the number of data points. For example, with our 88%/12% imbalance ratio, undersampling the majority class leads to a 76% reduction of data points.

We learned so many things implementing the SVM algorithm:

Higher cost more greatly penalizes mistakes and can lead to overfitting, conversely too low can lead to underfitting and so it’s necessary to tune the hyperparameters to your particular problem
Similarly gamma needs to be tuned. Higher gamma also leads to overfitting by reducing the size of any one data point’s influence, and conversely too low of a gamma may result in underfitting were the key patterns aren’t captured.
SVM computation cost was very high and so we had to do hyperparameter tuning on a subset of the data and an 80/20 split instead of doing 10-fold cross validation to work within the confines of our processing resources.
SVM calculates a hyperplane, or decision boundary, between the two classes in a feature space that is affected by your choice of kernel. Since it’s a numerical process it’s incredibly important to center and scale numerical predictors and encode categorical predictors with one-hot encoding or ordinal encoding if there are natural levels to the categories like low/medium/high. Also the model will be numerically unstable (meaning that if you run the model multiple times the decision boundary will vary) if there is multicollinearity and we had to identify and remove all but one of multicollinear variables.
We ran into the curse of dimensionality a couple of times, initially by using the 10% subset of data, then by being unsuccessful in oversampling the minority class and so we tried undersampling the majority class but that removed 76% of the data points which degenerated some of our categorical predictors by not leaving enough balance and variety in the counts of categories in a given predictor.
Our SVM package by default produced decision values which we then had to convert into class labels in order to calculate accuracy.

Once we were able to produce an accuracy measure using SVM we concluded Random Forest was superior for this application. This was consistent with the literature review we performed. This in part may be related to the large proportion of categorical data we were working with. For future projects with SVM we would like to explore additional libraries for results, variable importance and visual representations; to succeed with oversampling the minority class, possibly with the caret package’s upSample function; as well as to use performance metrics other than accuracy for evaluation, like precision, F1 and AUC.

D622 Homework 3: Support Vector Machines

PK O’Flaherty

2025-01-24

Introduction

Data

Chosen Data

Citation

Selection Rationale

Loading Data

View data

Variable Descriptions

Label Identification

Exploratory Data Analysis

Summary Statistics - Numeric

Summary Statistics Categorical

Missing Values

Correlation Matrix

Variable Elimination

Multicollinearity

Prepare Data

Readings Meta Summary

Prescribed Readings

First Article:

Second Article:

Academic Content

Academic One

Academic Two

Academic Three

SVM Analysis

SVM Implementation 1 fail

SVM Implementation 2 Fail

Learnings

Imbalance

Scaling

Parameter Tuning

Results

SVM Implementation 3

Learnings

Revisiting Academic Article 2

Multicollinearity

Larger Dataset & VIF

Critical Learnings

Implementation

Results

Compare Results

Conclusion

Essay