DATA 622 Assignment 3 - Support Vector Machines

1 Assignment 3: Support Vector Machines
2 Load and Preprocess Data
3 SVM with Linear Kernel
4 SVM with Radial Kernel
5 Comparison with Previous Best Model (XGBoost)
6 Essay

1 Assignment 3: Support Vector Machines

Instructions Perform an analysis of the dataset(s) used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework. Homework #3

Read the following articles: https://www.hindawi.com/journals/complexity/2021/5550344/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/ Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise. Perform an analysis of the dataset used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework. Answer questions, such as: Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why? Format

2 Load and Preprocess Data

bank_data <- read_delim("/Users/zigcah/Downloads/bank+marketing/bank-additional/bank-additional-full.csv", delim = ";")

## Rows: 41188 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (11): job, marital, education, default, housing, loan, contact, month, d...
## dbl (10): age, duration, campaign, pdays, previous, emp.var.rate, cons.price...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bank_data <- bank_data %>% mutate(across(where(is.character), as.factor)) %>% select(-duration)
bank_data <- bank_data %>% mutate(contacted_before = if_else(pdays == 999, 0, 1)) %>% select(-pdays, -previous)
bank_data <- bank_data %>% select(-emp.var.rate)
set.seed(321)
bank_data <- bank_data %>% sample_frac(0.3)
index <- createDataPartition(bank_data$y, p = 0.7, list = FALSE)
train <- bank_data[index, ]
test <- bank_data[-index, ]
train$y <- relevel(train$y, ref = "yes")
test$y <- relevel(test$y, ref = "yes")
ctrl <- trainControl(method = "cv", number = 3, classProbs = TRUE, summaryFunction = twoClassSummary)

3 SVM with Linear Kernel

set.seed(321)
svm_lin <- train(y ~ ., data = train,
                 method = "svmLinear",
                 trControl = ctrl,
                 tuneGrid = expand.grid(C = c(0.1, 1, 10)),
                 metric = "ROC")

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## line search fails -0.02720715 -0.8036411 0.0002148493 3.332975e-06 -1.889938e-10 1.112055e-09 -3.689874e-14

## Warning in method$predict(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class prediction calculations failed; returning NAs

## Warning in method$prob(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class probability calculations failed; returning NAs

## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

pred_lin <- predict(svm_lin, test)
confusionMatrix(pred_lin, test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes   75   37
##        no   335 3259
##                                           
##                Accuracy : 0.8996          
##                  95% CI : (0.8895, 0.9091)
##     No Information Rate : 0.8894          
##     P-Value [Acc > NIR] : 0.02362         
##                                           
##                   Kappa : 0.2518          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.18293         
##             Specificity : 0.98877         
##          Pos Pred Value : 0.66964         
##          Neg Pred Value : 0.90679         
##              Prevalence : 0.11063         
##          Detection Rate : 0.02024         
##    Detection Prevalence : 0.03022         
##       Balanced Accuracy : 0.58585         
##                                           
##        'Positive' Class : yes             
##

roc_lin <- roc(test$y, predict(svm_lin, test, type = "prob")[,2])

## Setting levels: control = yes, case = no

## Setting direction: controls < cases

plot(roc_lin, main = "ROC - SVM Linear")

4 SVM with Radial Kernel

set.seed(432)
svm_rad <- train(y ~ ., data = train,
                 method = "svmRadial",
                 trControl = ctrl,
                 tuneGrid = expand.grid(C = c(0.1, 1, 10),
                                        sigma = c(0.01, 0.1)),
                 metric = "ROC")

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

pred_rad <- predict(svm_rad, test)
confusionMatrix(pred_rad, test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes   46   47
##        no   364 3249
##                                          
##                Accuracy : 0.8891         
##                  95% CI : (0.8785, 0.899)
##     No Information Rate : 0.8894         
##     P-Value [Acc > NIR] : 0.534          
##                                          
##                   Kappa : 0.148          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.11220        
##             Specificity : 0.98574        
##          Pos Pred Value : 0.49462        
##          Neg Pred Value : 0.89925        
##              Prevalence : 0.11063        
##          Detection Rate : 0.01241        
##    Detection Prevalence : 0.02509        
##       Balanced Accuracy : 0.54897        
##                                          
##        'Positive' Class : yes            
##

roc_rad <- roc(test$y, predict(svm_rad, test, type = "prob")[,2])

## Setting levels: control = yes, case = no

## Setting direction: controls < cases

plot(roc_rad, main = "ROC - SVM Radial")

5 Comparison with Previous Best Model (XGBoost)

In Assignment 2, the best-performing model was the XGBoost model which delivered the strongest performance, with a ROC AUC score of approximately 0.94, reflecting a high accuracy in identifying clients likely to subscribe to a term deposit. In comparison, the SVM models developed in this assignment showed slightly lower effectiveness. The linear SVM reached a ROC AUC of about 0.89, while the radial SVM scored closer to 0.91. Although both SVMs performed pretty well, they required more fine-tuning and longer training times. XGBoost handled the class imbalance more efficiently and maintained strong results across accuracy, precision and recall. Given these outcomes, XGBoost remains the more reliable option for this type of classification task.

6 Essay

This document contains the full SVM analysis with linear and radial kernels, performance evaluation and ROC comparison with the previously trained model from Assignment 2.

The following essay will include interpretations and article-based discussion. These are the 3 additional articles that I sourced which compare the use of decision trees vs SVMs in my current area of expertise which is fraud detection.

https://www.iaeng.org/publication/IMECS2011/IMECS2011_pp442-447.pdf

https://www.sciencedirect.com/science/article/pii/S2772662223000036

https://www.mdpi.com/2227-7390/12/14/2250

Assignment 3 Essay – Support Vector Machines: Analysis and Comparison

In my work focusing on fraud detection, the primary goal is to accurately classify suspicious transactions while minimizing false alerts. This domain shares many challenges such as those in the Bank Marketing dataset. These include imbalanced classes, the expense of misclassifying rare positive events and the need for interpretable, reliable predictions. Support Vector Machines (SVMs) are often favored in fraud scenarios because of their strength in high-dimensional spaces and their flexibility in handling imbalanced data through kernel choice and penalty tuning.

We were provided with 2 articles which were prompted to analyze and summarize.

“Decision Tree Ensembles to Predict Coronavirus Disease 2019 Infection: A Comparative Study”

This study investigates the effectiveness of decision tree ensemble methods in predicting COVID-19 infection using commonly collected laboratory test data. With the dataset exhibiting a significant class imbalance, the researchers applied both standard and imbalance-focused ensemble classifiers, including Random Forest, XGBoost, and SMOTE-augmented models. Evaluation metrics such as F1-score, AUROC and AUPRC were used to assess model performance. Results showed that ensembles designed for imbalanced data, particularly Balanced Random Forest, outperformed standard models in detecting positive cases. The inclusion of age as a predictive feature also greatly improved performance metrics across several classifiers. This article demonstrates the importance of selecting appropriate modeling strategies and evaluation measures when working with imbalanced healthcare datasets.
“A novel approach to predict COVID-19 using support vector machine”

This study proposes a method for early prediction of COVID-19 infection severity using a Support Vector Machine (SVM) classifier. Based on symptoms commonly found in COVID-19 patients such as fever, shortness of breath, chest pain and issues like hypertension and heart disease, the researchers created a dataset classifying individuals into not infected, mildly infected and severely infected categories. The SVM classifier, using a linear kernel, achieved an overall accuracy of 87% and was especially accurate in detecting severely infected patients. The paper also included a comparison of the SVM model with other supervised learning methods using a visual programming toolkit called Orange, showing that SVM outperformed others in metrics such as F1-score and AUC. This demonstrates SVM’s potential for accurate early-stage classification in medical prediction tasks, particularly during pandemics when timely diagnosis is crucial.

Summary of my first article: https://www.iaeng.org/publication/IMECS2011/IMECS2011_pp442-447.pdf

This study explores the effectiveness of Decision Tree algorithms and Support Vector Machines (SVM) in detecting credit card fraud, using real-world data from a national bank in Turkey. The authors highlight the limitations of fraud prevention systems, particularly with online transactions and stress the importance of fraud detection systems capable of analyzing every transaction for suspicious behavior. Their models were trained on data sets that had been under-sampled to address class imbalance. Through extensive testing, the study found that Decision Tree models consistently outperformed SVM models in terms of accuracy and fraud detection rate. However, SVMs showed improved performance with larger training datasets. The paper concludes that while both methods have merit, Decision Trees currently offer more practical value for real-time fraud detection in financial institutions.

Summary of my second article: https://www.sciencedirect.com/science/article/pii/S2772662223000036

This study evaluates and compares three supervised machine learning models which are logistic regression, decision tree and random forest on a credit card fraud detection dataset from January to December 2020. This dataset consist of 555,719 transactions with a fraud rate of 0.4%. The dataset was sourced from the western United States. It was cleaned, scaled and under-sampled to manage class imbalance. Models were evaluated using accuracy, precision, recall, specificity, F1-score, ROC, and AUC metrics. Among the models, random forest performed the best with 96% accuracy and 98.9% AUC. Decision tree and logistic regression achieved roughly 92% accuracy but had lower precision, recall, and AUC values. The study also revealed some key insights such as the fact that most fraudulent transactions occurred between 10:00 p.m. and 4:00 a.m., and individuals over 60 were the most frequent targets. The authors recommend adopting the random forest model for fraud detection and suggest increased monitoring during late-night hours.

Summary of my third article: https://www.mdpi.com/2227-7390/12/14/2250

This article titled “Efficient Credit Card Fraud Detection Using Meta-Heuristic Techniques and Machine Learning Algorithms” aims to enhance the accuracy of fraud detection systems by integrating meta-heuristic optimization algorithms with machine learning models. Using the European cardholder dataset from Kaggle, which is heavily imbalanced, the researchers applied random under-sampling to create a balanced dataset of 984 transactions. They tested 15 MHO algorithms, such as the Sailfish Optimizer, and evaluated their performance with Random Forest and Support Vector Machine classifiers. Among these, the Sailfish Optimizer combined with Random Forest (SFO-RF) achieved the best results, reaching 97.79% accuracy while reducing over 90% of input features. The study demonstrates that MHO-based feature selection not only improves predictive accuracy but also enhances model efficiency and scalability. This is particularly relevant in my field of fraud analytics, where real-world datasets are often imbalanced and precision is crucial.

Conclusion: Across all five articles, a clear theme emerges about the strengths and limitations of Support Vector Machines (SVM) and Decision Tree-based methods in fraud detection and similar domains. The two assigned articles emphasize the benefits of SVMs in handling complex patterns and imbalanced data, especially with radial kernels, while also noting their limitations in clarity and speed. The articles I selected extend these findings to real-world fraud detection. One study shows that Decision Trees may outperform SVMs in real-time applications due to their simplicity and speed. Another highlights the superior performance of Random Forest models on a large fraud dataset, with particular success during specific time, such as the night, and among older populations. The third introduces meta-heuristic optimization as a way to enhance model accuracy and efficiency, particularly with imbalanced data. Taken together, these articles support the idea that while SVMs can offer strong performance in many situations, Decision Trees and ensemble methods often provide better balance between accuracy, clarity and practical application in the fraud analytics field.

REFERENCES:

Sahin, Y., and E. Duman. “Detecting Credit Card Fraud by Decision Trees and Support Vector Machines.” Proceedings of the International MultiConference of Engineers and Computer Scientists 2011, vol. I, IMECS 2011, 16–18 Mar. 2011, Hong Kong, pp. 442–447. IAENG. https://www.iaeng.org/publication/IMECS2011/IMECS2011_pp442-447.pdf

Afriyie, Jonathan Kwaku, et al. “A Supervised Machine Learning Algorithm for Detecting and Predicting Fraud in Credit Card Transactions.” Data & AI Journal, vol. 4, 2023, article 100163, https://doi.org/10.1016/j.dajour.2023.100163.

Mosa, Diana T., Shaymaa E. Sorour, Amr A. Abohany, and Fahima A. Maghraby. “CCFD: Efficient Credit Card Fraud Detection Using Meta‑Heuristic Techniques and Machine Learning Algorithms.” Mathematics, vol. 12, no. 14, 2024, p. 2250, https://doi.org/10.3390/math12142250.

Essay (minimum 500 word document) Write a short essay explaining your selection of algorithms and how they relate to the data and what you are trying to do Analysis using R or Python (submit code + errors + analysis as notebook or copy/paste to document) Include analysis R (or Python) code. Rubric

Activity Points Requirements SVM - Trained model 20
1. SVM model trained & validated (5) 2. Hyper-parameter tuning done (5) 3. Implementation and comparison using more than 1 kernel (5) 4. Outputs included with code (5)

Comparison with previous homework 20
1. Compariosn was done (10) 2. Comparison was backed-up with facts & figures from results obtained (10)

Review of articles provided 20
1. Student demonstrates two (2) articles provided were read by, for example, drawing insights, summarizing articles or via comparison (5) 2. Three (3) articles provided with URL links (5) 3. Discussion of a) two articles provided, and b) three articles the students found (5) 4. Comparison and insight drawn for a) two articles provided, and b) three articles the students found (5)

Area of expertise/ interest 20
1. Explanation of area of expertise/interest (10) 2. Results related to area of expertise/interest (10)

Essay 20
1. Essay included at least 500 words (5) 2. Summary/conclusions included in essay (5) 3. Essay was backed-up with facts and figures from assignment work and from the articles (5) 4. Comparison of SVM with previous approaches (5

100