Goal

  1. Perform an analysis of the dataset used in Homework #2 using the SVM algorithm.Compare the results with the results from previous homework.

  2. Based on articles

https://www.hindawi.com/journals/complexity/2021/5550344/ 

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/ 

Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise.

  1. Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why?

1. SVM model

For the previous homework I decided to use the heart failure dataset available in kaggle that is available in the below link https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction?resource=download .

This dataset contains 918 observations and 12 variables. I am choosing HeartDisease as my target variable. This variable takes values 0 and 1 which corresponds to not having heartdisease and having the diease respectively.

7 out of the 12 variables are numeric which includes my target variable. For the purpose of this homework I am changing the target variable datatype as factor.

df <- read.csv("heart.csv", colClasses = c("numeric", "factor", "factor","numeric", "numeric", "numeric","factor", "numeric", "factor","numeric", "factor","factor"))
head(df)
##   Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## 1  40   M           ATA       140         289         0     Normal   172
## 2  49   F           NAP       160         180         0     Normal   156
## 3  37   M           ATA       130         283         0         ST    98
## 4  48   F           ASY       138         214         0     Normal   108
## 5  54   M           NAP       150         195         0     Normal   122
## 6  39   M           NAP       120         339         0     Normal   170
##   ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1              N     0.0       Up            0
## 2              N     1.0     Flat            1
## 3              N     0.0       Up            0
## 4              Y     1.5     Flat            1
## 5              N     0.0       Up            0
## 6              N     0.0       Up            0

Below is the skim view of the dataset and it is seen that there are no missing values.

skim(df)
Data summary
Name df
Number of rows 918
Number of columns 12
_______________________
Column type frequency:
factor 6
numeric 6
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Sex 0 1 FALSE 2 M: 725, F: 193
ChestPainType 0 1 FALSE 4 ASY: 496, NAP: 203, ATA: 173, TA: 46
RestingECG 0 1 FALSE 3 Nor: 552, LVH: 188, ST: 178
ExerciseAngina 0 1 FALSE 2 N: 547, Y: 371
ST_Slope 0 1 FALSE 3 Fla: 460, Up: 395, Dow: 63
HeartDisease 0 1 FALSE 2 1: 508, 0: 410

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 53.51 9.43 28.0 47.00 54.0 60.0 77.0 ▁▅▇▆▁
RestingBP 0 1 132.40 18.51 0.0 120.00 130.0 140.0 200.0 ▁▁▃▇▁
Cholesterol 0 1 198.80 109.38 0.0 173.25 223.0 267.0 603.0 ▃▇▇▁▁
FastingBS 0 1 0.23 0.42 0.0 0.00 0.0 0.0 1.0 ▇▁▁▁▂
MaxHR 0 1 136.81 25.46 60.0 120.00 138.0 156.0 202.0 ▁▃▇▆▂
Oldpeak 0 1 0.89 1.07 -2.6 0.00 0.6 1.5 6.2 ▁▇▆▁▁

To move forward I am splitting the dataset into train and test set in the ratio 75:25

set.seed(111)
df.sample <- sample(nrow(df), round(nrow(df)*0.75), replace = FALSE)
df.train <- df[df.sample, ]
df.test <- df[-df.sample, ]

After splitting checking if each set has both the entries in the target

round(prop.table(table(select(df.train, HeartDisease), exclude = NULL)), 4) * 100
## 
##     0     1 
## 45.49 54.51
round(prop.table(table(select(df.test, HeartDisease), exclude = NULL)), 4) * 100
## 
##     0     1 
## 42.17 57.83
svm <- svm(HeartDisease ~ Age + Sex + ChestPainType + RestingBP + RestingECG + MaxHR, data = df.train, kernel="polynomial", scale=FALSE)
svm
## 
## Call:
## svm(formula = HeartDisease ~ Age + Sex + ChestPainType + RestingBP + 
##     RestingECG + MaxHR, data = df.train, kernel = "polynomial", scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##      coef.0:  0 
## 
## Number of Support Vectors:  194
pred <- predict(svm, newdata=df.test)
confusionMatrix(pred, df.test$HeartDisease)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  59  21
##          1  38 112
##                                           
##                Accuracy : 0.7435          
##                  95% CI : (0.6819, 0.7986)
##     No Information Rate : 0.5783          
##     P-Value [Acc > NIR] : 1.329e-07       
##                                           
##                   Kappa : 0.4613          
##                                           
##  Mcnemar's Test P-Value : 0.03725         
##                                           
##             Sensitivity : 0.6082          
##             Specificity : 0.8421          
##          Pos Pred Value : 0.7375          
##          Neg Pred Value : 0.7467          
##              Prevalence : 0.4217          
##          Detection Rate : 0.2565          
##    Detection Prevalence : 0.3478          
##       Balanced Accuracy : 0.7252          
##                                           
##        'Positive' Class : 0               
## 

2. Compare decision trees and SVMs

The below study states that “the classification accuracy of SVM algorithm was better than DT algorithm”.

https://scialert.net/fulltext/?doi=itj.2009.64.70 

The below article talks about “how these algorithms can be used in both classification and regression problems with examples of a regression problem..”

https://towardsdatascience.com/a-complete-view-of-decision-trees-and-svm-in-machine-learning-f9f3d19a337b 

The below article explains basics of both these algorithms with pictorial representations.

https://www.numpyninja.com/post/a-simple-introduction-to-decision-tree-and-support-vector-machines-svm

3. Conclusion

For this dataset the SVM model accuracy was around 63% but when I used polynomial as the kernel parameter the accuracy increased to 74% but the random forest model accuracy of 87% still remains better. So in this case random forest seems to be a better model than SVM.