Perform an analysis of the dataset used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework.
## education marital housing contact duration month age balance campaign pdays
## 1 tertiary married yes unknown 261 may 58 2143 1 -1
## 2 secondary single yes unknown 151 may 44 29 1 -1
## 3 secondary married yes unknown 76 may 33 2 1 -1
## 4 unknown married yes unknown 92 may 47 1506 1 -1
## 5 unknown single no unknown 198 may 33 1 1 -1
## 6 tertiary married yes unknown 139 may 35 231 1 -1
## 7 tertiary single yes unknown 217 may 28 447 1 -1
## 8 tertiary divorced yes unknown 380 may 42 2 1 -1
## 9 primary married yes unknown 50 may 58 121 1 -1
## 10 secondary single yes unknown 55 may 43 593 1 -1
## previously_contacted poutcome y
## 1 no unknown no
## 2 no unknown no
## 3 no unknown no
## 4 no unknown no
## 5 no unknown no
## 6 no unknown no
## 7 no unknown no
## 8 no unknown no
## 9 no unknown no
## 10 no unknown no
We will use a basic SVM algorithm with a linear kernel for our first attempt with our balanced dataset.
## Setting direction: controls < cases
## [1] "SVM Experiment 1 Results:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 9529 216
## yes 2447 1370
##
## Accuracy : 0.8036
## 95% CI : (0.7969, 0.8103)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4096
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8638
## Specificity : 0.7957
## Pos Pred Value : 0.3589
## Neg Pred Value : 0.9778
## Prevalence : 0.1169
## Detection Rate : 0.1010
## Detection Prevalence : 0.2814
## Balanced Accuracy : 0.8297
##
## 'Positive' Class : yes
##
## [1] "SVM AUC: 0.9026"
In this application, we will use an SVM model with parameter tuning to improve the model performance. We will also use a radial kernel, instead of a linear one because of it’s performance with non-linear classification.
## Setting direction: controls < cases
## [1] "\nSVM Experiment 2 (Tuned) Results:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 10012 258
## yes 1964 1328
##
## Accuracy : 0.8362
## 95% CI : (0.8298, 0.8424)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4591
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.83733
## Specificity : 0.83601
## Pos Pred Value : 0.40340
## Neg Pred Value : 0.97488
## Prevalence : 0.11694
## Detection Rate : 0.09792
## Detection Prevalence : 0.24274
## Balanced Accuracy : 0.83667
##
## 'Positive' Class : yes
##
## [1] "Tuned SVM AUC: 0.9065"
Let’s compare the results of the basic SVM algorithm with the results of the various experiments conducted in Homework Assignment #2. It should be noted that we needed to address the class imbalance issue in the data and used a 60% minority class re-balancing of the dataset.
## [1] "Complete Model Comparison:"
## Metric DecisionTree_50_50 DecisionTree_40_60 DecisionTree_60_40
## 1 Accuracy 0.7977 0.8394 0.7930
## 2 Sensitivity 0.8323 0.7686 0.8569
## 3 Specificity 0.7931 0.8488 0.7845
## 4 Pos Pred Value 0.3476 0.4023 0.3449
## 5 AUC 0.8390 0.8270 0.8610
## RandomForest_1 RandomForest_2 AdaBoost_Original AdaBoost_Real SVM SVM2
## 1 0.8294 0.8452 0.8166 0.8276 0.8140 0.8362
## 2 0.8758 0.8430 0.8689 0.8581 0.8808 0.8373
## 3 0.8232 0.8454 0.8097 0.8236 0.8052 0.8360
## 4 0.3962 0.4194 0.3768 0.3918 0.3745 0.4034
## 5 0.9170 0.9170 NA NA 0.9131 0.9065
Our assignment compared several methods for predicting bank term deposit acceptance: Decision Trees, Random Forest, AdaBoost, and Support Vector Machine (SVM). Each model was tested multiple times with different settings to find the best approach for identifying potential customers.
Looking at the numbers, Random Forest performed best overall, with 84.52% accuracy and correctly identifying potential customers 41.94% of the time. This is particularly good considering only about 12% of customers in our data actually accepted term deposits. Random Forest worked well because it can handle different types of data (like numbers and categories) and isn’t thrown off by missing information.
The tuned SVM model also performed well, achieving 83.62% accuracy with very balanced results - it was equally good at identifying both customers who would accept (83.73%) and reject (83.60%) term deposits. However, getting these results required significant adjustments to our original approach. The initial SVM analysis was too slow and computationally expensive, forcing us to:
Decision Trees showed more variation in performance, with accuracy ranging from 79.30% to 83.94% depending on how we balanced the data. While simpler to implement and understand, they weren’t as reliable as Random Forest or SVM. AdaBoost fell in the middle, reaching 82.76% accuracy but requiring significant processing time.
The results suggest:
Random Forest is best for:
SVM is strong for:
However, both methods have drawbacks. Random Forest requires significant memory for large data sets, and SVM can be very slow when trying different parameter settings. These practical limitations affected our ability to fully optimize the models, particularly SVM.
Looking ahead, Random Forest appears to be the better choice for similar prediction tasks, especially when working with limited computing resources. While SVM showed promise, its computational demands make it less practical for regular use unless significant computing power is available.
For future work, having more computing power would allow:
These findings help guide future predictive modeling work in banking, showing that Random Forest offers the best balance of performance and practical usability, while SVM might be worth the extra computational cost in situations where balanced prediction accuracy is crucial.
Both studies aimed to help identify COVID-19 cases, but they went about it in different ways. The first study used decision trees to analyze laboratory test results from 600 patients, simply trying to determine if each case was positive or negative for COVID-19. The second study used SVM to look at symptoms from 200 patients, attempting to classify cases into three categories: not infected, mildly infected, or severely infected. Despite their different approaches, both methods were similarly accurate, achieving around 87% success in their predictions.
The two methods work better in different situations. Decision trees work well with large amounts of numerical data, like lab results, and make it easier to understand how the decisions were made. This makes them ideal for use in clinical laboratory settings. The SVM approach is better at finding complex patterns in symptom data, making it more useful for initial patient screening. Rather than competing approaches, these methods could work together at different stages of diagnosis.
While both studies helped identify COVID-19 cases, they served different purposes in the diagnostic process. The decision tree method works best when analyzing laboratory data, while the SVM approach is more suited for evaluating patient symptoms before lab testing. Together, they show how different computer analysis methods can help healthcare workers make decisions at various stages of patient care, with each method being best suited for specific types of medical data.
https://github.com/Aconrard/DATA622/blob/main/Prehospital%20Predicting%20Factors%20Using%20Decision%20Tree.pdf https://github.com/Aconrard/DATA622/blob/main/Novel%20Enhanced%20Prediction%20of%20Possibility%20of%20Cardaiac%20Arrest.pdf https://github.com/Aconrard/DATA622/blob/main/Decision%20Tree%20Model%20for%20Pedicting%20Outcomes%20after%20OHCA%20in%20ED.pdf
My field of employment is in prehospital emergency care. While decision trees, random forest applications, and Support Vector Machines (SVM) are not new methods, their application in cardiac arrest seems very limited with only about thirteen (13) journal articles being identified in a Google Scholar search. In addition, there was only one found that used SVM.
In a comprehensive study of out-of-hospital cardiac arrest (OHCA) outcomes (Goto, 2013), researchers employed recursive partitioning analysis to develop a practical decision-tree prediction model. Using a substantial Japanese dataset of 390,226 patients split between development (2005-2008) and validation (2009) cohorts, the study prioritized clinical utility while maintaining statistical rigor. The chosen methodology balanced simplicity with accuracy, creating a model based on four key factors: shockable initial rhythm, age, witnessed arrest, and witnessed by EMS personnel. While alternative methods like logistic regression, neural networks, score-based systems, and multivariate analysis could have offered different advantages, they were ultimately rejected due to their complexity, lack of transparency, or reduced practicality in emergency settings.
Researchers analyzed data from 86,495 patients who had witnessed cardiac arrests with shockable heart rhythms, splitting the data into two groups for development (77,845 patients) and testing (8,650 patients) (Tateishi, 2023). They created a prediction model that focused on three main factors: whether the patient’s heart restarted before reaching the hospital, whether adrenaline was used, and the patient’s age. Unlike previous studies that looked at all cardiac arrest patients, this study focused only on patients whose hearts could potentially be shocked back to normal rhythm and whose arrest was witnessed by someone, making the findings more specific for this type of emergency.
The two Japanese studies analyzed cardiac arrest survival using similar methods but different approaches. While both studies were equally accurate in their predictions (85% accurate), the targeted approach of the second study found much higher survival rates (up to 70.8% versus 23.2% in the first study). The 2023 study represented an evolution in the field by focusing on patients with better survival chances, including treatment factors like adrenaline use, achieving higher success rates, and providing more detailed patient groupings. This shows that while broad prediction models are useful, focusing on specific types of cardiac arrest patients can provide better guidance and identify those with the best chance of survival.
The final article was the only one I found that used SVM (Reddy, 2024). However, the title of the article is somewhat misleading as they are not actually investigating cardiac arrest, but rather a disease process that could potentially result in out-of-hospital cardiac arrest (cardiomyopathy). This study compared the effectiveness of Support Vector Machine (SVM) and Decision Tree (DT) algorithms in predicting cardiac arrest risk in cardiovascular disease patients. Using a dataset of patient characteristics (age, gender, BMI, smoking status, glucose levels, etc.), the researchers ran multiple iterations of both models. The results consistently showed SVM outperforming DT, with an average accuracy of 86.46% compared to DT’s 72.40%. An independent sample t-test confirmed the statistical significance of this difference (p < 0.001). Therefore, the study concludes that SVM is a better predictor of cardiomyopathy risk in this context. It should be noted that there is an error in the article within the discussion section that stated that the Decision Tree performs better than the SVM in terms of accuracy, which is contradictory to everything else in the article. This appears to be a simple typographical error.