Assignment #3: SVM Application to HW#2

Assignment

Perform an analysis of the dataset used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework.

##    education  marital housing contact duration month age balance campaign pdays
## 1   tertiary  married     yes unknown      261   may  58    2143        1    -1
## 2  secondary   single     yes unknown      151   may  44      29        1    -1
## 3  secondary  married     yes unknown       76   may  33       2        1    -1
## 4    unknown  married     yes unknown       92   may  47    1506        1    -1
## 5    unknown   single      no unknown      198   may  33       1        1    -1
## 6   tertiary  married     yes unknown      139   may  35     231        1    -1
## 7   tertiary   single     yes unknown      217   may  28     447        1    -1
## 8   tertiary divorced     yes unknown      380   may  42       2        1    -1
## 9    primary  married     yes unknown       50   may  58     121        1    -1
## 10 secondary   single     yes unknown       55   may  43     593        1    -1
##    previously_contacted poutcome  y
## 1                    no  unknown no
## 2                    no  unknown no
## 3                    no  unknown no
## 4                    no  unknown no
## 5                    no  unknown no
## 6                    no  unknown no
## 7                    no  unknown no
## 8                    no  unknown no
## 9                    no  unknown no
## 10                   no  unknown no

Support Vector Machine (SVM) Application

We will use a basic SVM algorithm with a linear kernel for our first attempt with our balanced dataset.

## Setting direction: controls < cases

## [1] "SVM Experiment 1 Results:"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  9529  216
##        yes 2447 1370
##                                           
##                Accuracy : 0.8036          
##                  95% CI : (0.7969, 0.8103)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4096          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8638          
##             Specificity : 0.7957          
##          Pos Pred Value : 0.3589          
##          Neg Pred Value : 0.9778          
##              Prevalence : 0.1169          
##          Detection Rate : 0.1010          
##    Detection Prevalence : 0.2814          
##       Balanced Accuracy : 0.8297          
##                                           
##        'Positive' Class : yes             
##

## [1] "SVM AUC: 0.9026"

SVM with Parameter Tuning

In this application, we will use an SVM model with parameter tuning to improve the model performance. We will also use a radial kernel, instead of a linear one because of it’s performance with non-linear classification.

## Setting direction: controls < cases

## [1] "\nSVM Experiment 2 (Tuned) Results:"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10012   258
##        yes  1964  1328
##                                           
##                Accuracy : 0.8362          
##                  95% CI : (0.8298, 0.8424)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4591          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.83733         
##             Specificity : 0.83601         
##          Pos Pred Value : 0.40340         
##          Neg Pred Value : 0.97488         
##              Prevalence : 0.11694         
##          Detection Rate : 0.09792         
##    Detection Prevalence : 0.24274         
##       Balanced Accuracy : 0.83667         
##                                           
##        'Positive' Class : yes             
##                                           
## [1] "Tuned SVM AUC: 0.9065"

Best Model Performance

Let’s compare the results of the basic SVM algorithm with the results of the various experiments conducted in Homework Assignment #2. It should be noted that we needed to address the class imbalance issue in the data and used a 60% minority class re-balancing of the dataset.

## [1] "Complete Model Comparison:"

##           Metric DecisionTree_50_50 DecisionTree_40_60 DecisionTree_60_40
## 1       Accuracy             0.7977             0.8394             0.7930
## 2    Sensitivity             0.8323             0.7686             0.8569
## 3    Specificity             0.7931             0.8488             0.7845
## 4 Pos Pred Value             0.3476             0.4023             0.3449
## 5            AUC             0.8390             0.8270             0.8610
##   RandomForest_1 RandomForest_2 AdaBoost_Original AdaBoost_Real    SVM   SVM2
## 1         0.8294         0.8452            0.8166        0.8276 0.8140 0.8362
## 2         0.8758         0.8430            0.8689        0.8581 0.8808 0.8373
## 3         0.8232         0.8454            0.8097        0.8236 0.8052 0.8360
## 4         0.3962         0.4194            0.3768        0.3918 0.3745 0.4034
## 5         0.9170         0.9170                NA            NA 0.9131 0.9065

Model Selection and Conclusion

Our assignment compared several methods for predicting bank term deposit acceptance: Decision Trees, Random Forest, AdaBoost, and Support Vector Machine (SVM). Each model was tested multiple times with different settings to find the best approach for identifying potential customers.

Looking at the numbers, Random Forest performed best overall, with 84.52% accuracy and correctly identifying potential customers 41.94% of the time. This is particularly good considering only about 12% of customers in our data actually accepted term deposits. Random Forest worked well because it can handle different types of data (like numbers and categories) and isn’t thrown off by missing information.

The tuned SVM model also performed well, achieving 83.62% accuracy with very balanced results - it was equally good at identifying both customers who would accept (83.73%) and reject (83.60%) term deposits. However, getting these results required significant adjustments to our original approach. The initial SVM analysis was too slow and computationally expensive, forcing us to:

Reduce our training data size from full dataset to 5,000 samples
Limit the number of parameter combinations we could test
Reduce cross-validation folds from 10 to 5
These limitations might mean we missed finding even better model settings, especially for SVM. Random Forest, while also computationally intensive, was more manageable and required fewer compromises in our analysis.

Decision Trees showed more variation in performance, with accuracy ranging from 79.30% to 83.94% depending on how we balanced the data. While simpler to implement and understand, they weren’t as reliable as Random Forest or SVM. AdaBoost fell in the middle, reaching 82.76% accuracy but requiring significant processing time.

The results suggest:

Random Forest is best for:

Overall accuracy (84.52%)
Reliable predictions (41.94% positive prediction value)
Handling mixed data types
Processing speed relative to performance

SVM is strong for:

Balanced performance (83.67% balanced accuracy)
Clear separation between yes/no predictions
Potential for improvement with better tuning

However, both methods have drawbacks. Random Forest requires significant memory for large data sets, and SVM can be very slow when trying different parameter settings. These practical limitations affected our ability to fully optimize the models, particularly SVM.

Looking ahead, Random Forest appears to be the better choice for similar prediction tasks, especially when working with limited computing resources. While SVM showed promise, its computational demands make it less practical for regular use unless significant computing power is available.

For future work, having more computing power would allow:

Testing more SVM parameter combinations
Using larger training samples for tuning
Exploring more complex model settings
Running more thorough cross-validation

These findings help guide future predictive modeling work in banking, showing that Random Forest offers the best balance of performance and practical usability, while SVM might be worth the extra computational cost in situations where balanced prediction accuracy is crucial.

Assigned Article Discussion

Both studies aimed to help identify COVID-19 cases, but they went about it in different ways. The first study used decision trees to analyze laboratory test results from 600 patients, simply trying to determine if each case was positive or negative for COVID-19. The second study used SVM to look at symptoms from 200 patients, attempting to classify cases into three categories: not infected, mildly infected, or severely infected. Despite their different approaches, both methods were similarly accurate, achieving around 87% success in their predictions.

The two methods work better in different situations. Decision trees work well with large amounts of numerical data, like lab results, and make it easier to understand how the decisions were made. This makes them ideal for use in clinical laboratory settings. The SVM approach is better at finding complex patterns in symptom data, making it more useful for initial patient screening. Rather than competing approaches, these methods could work together at different stages of diagnosis.

While both studies helped identify COVID-19 cases, they served different purposes in the diagnostic process. The decision tree method works best when analyzing laboratory data, while the SVM approach is more suited for evaluating patient symptoms before lab testing. Together, they show how different computer analysis methods can help healthcare workers make decisions at various stages of patient care, with each method being best suited for specific types of medical data.

Student Selected Article Discussion

https://github.com/Aconrard/DATA622/blob/main/Prehospital%20Predicting%20Factors%20Using%20Decision%20Tree.pdf https://github.com/Aconrard/DATA622/blob/main/Novel%20Enhanced%20Prediction%20of%20Possibility%20of%20Cardaiac%20Arrest.pdf https://github.com/Aconrard/DATA622/blob/main/Decision%20Tree%20Model%20for%20Pedicting%20Outcomes%20after%20OHCA%20in%20ED.pdf

My field of employment is in prehospital emergency care. While decision trees, random forest applications, and Support Vector Machines (SVM) are not new methods, their application in cardiac arrest seems very limited with only about thirteen (13) journal articles being identified in a Google Scholar search. In addition, there was only one found that used SVM.

In a comprehensive study of out-of-hospital cardiac arrest (OHCA) outcomes (Goto, 2013), researchers employed recursive partitioning analysis to develop a practical decision-tree prediction model. Using a substantial Japanese dataset of 390,226 patients split between development (2005-2008) and validation (2009) cohorts, the study prioritized clinical utility while maintaining statistical rigor. The chosen methodology balanced simplicity with accuracy, creating a model based on four key factors: shockable initial rhythm, age, witnessed arrest, and witnessed by EMS personnel. While alternative methods like logistic regression, neural networks, score-based systems, and multivariate analysis could have offered different advantages, they were ultimately rejected due to their complexity, lack of transparency, or reduced practicality in emergency settings.

Researchers analyzed data from 86,495 patients who had witnessed cardiac arrests with shockable heart rhythms, splitting the data into two groups for development (77,845 patients) and testing (8,650 patients) (Tateishi, 2023). They created a prediction model that focused on three main factors: whether the patient’s heart restarted before reaching the hospital, whether adrenaline was used, and the patient’s age. Unlike previous studies that looked at all cardiac arrest patients, this study focused only on patients whose hearts could potentially be shocked back to normal rhythm and whose arrest was witnessed by someone, making the findings more specific for this type of emergency.

The two Japanese studies analyzed cardiac arrest survival using similar methods but different approaches. While both studies were equally accurate in their predictions (85% accurate), the targeted approach of the second study found much higher survival rates (up to 70.8% versus 23.2% in the first study). The 2023 study represented an evolution in the field by focusing on patients with better survival chances, including treatment factors like adrenaline use, achieving higher success rates, and providing more detailed patient groupings. This shows that while broad prediction models are useful, focusing on specific types of cardiac arrest patients can provide better guidance and identify those with the best chance of survival.

The final article was the only one I found that used SVM (Reddy, 2024). However, the title of the article is somewhat misleading as they are not actually investigating cardiac arrest, but rather a disease process that could potentially result in out-of-hospital cardiac arrest (cardiomyopathy). This study compared the effectiveness of Support Vector Machine (SVM) and Decision Tree (DT) algorithms in predicting cardiac arrest risk in cardiovascular disease patients. Using a dataset of patient characteristics (age, gender, BMI, smoking status, glucose levels, etc.), the researchers ran multiple iterations of both models. The results consistently showed SVM outperforming DT, with an average accuracy of 86.46% compared to DT’s 72.40%. An independent sample t-test confirmed the statistical significance of this difference (p < 0.001). Therefore, the study concludes that SVM is a better predictor of cardiomyopathy risk in this context. It should be noted that there is an error in the article within the discussion section that stated that the Decision Tree performs better than the SVM in terms of accuracy, which is contradictory to everything else in the article. This appears to be a simple typographical error.