Perform an analysis of the dataset used in Homework #2 using the SVM algorithm.Compare the results with the results from previous homework.
Based on articles
https://www.hindawi.com/journals/complexity/2021/5550344/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/
Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise.
For the previous homework I decided to use the heart failure dataset available in kaggle that is available in the below link https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction?resource=download .
This dataset contains 918 observations and 12 variables. I am choosing HeartDisease as my target variable. This variable takes values 0 and 1 which corresponds to not having heartdisease and having the diease respectively.
7 out of the 12 variables are numeric which includes my target variable. For the purpose of this homework I am changing the target variable datatype as factor.
df <- read.csv("heart.csv", colClasses = c("numeric", "factor", "factor","numeric", "numeric", "numeric","factor", "numeric", "factor","numeric", "factor","factor"))
head(df)
## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## 1 40 M ATA 140 289 0 Normal 172
## 2 49 F NAP 160 180 0 Normal 156
## 3 37 M ATA 130 283 0 ST 98
## 4 48 F ASY 138 214 0 Normal 108
## 5 54 M NAP 150 195 0 Normal 122
## 6 39 M NAP 120 339 0 Normal 170
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1 N 0.0 Up 0
## 2 N 1.0 Flat 1
## 3 N 0.0 Up 0
## 4 Y 1.5 Flat 1
## 5 N 0.0 Up 0
## 6 N 0.0 Up 0
Below is the skim view of the dataset and it is seen that there are no missing values.
skim(df)
| Name | df |
| Number of rows | 918 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| factor | 6 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Sex | 0 | 1 | FALSE | 2 | M: 725, F: 193 |
| ChestPainType | 0 | 1 | FALSE | 4 | ASY: 496, NAP: 203, ATA: 173, TA: 46 |
| RestingECG | 0 | 1 | FALSE | 3 | Nor: 552, LVH: 188, ST: 178 |
| ExerciseAngina | 0 | 1 | FALSE | 2 | N: 547, Y: 371 |
| ST_Slope | 0 | 1 | FALSE | 3 | Fla: 460, Up: 395, Dow: 63 |
| HeartDisease | 0 | 1 | FALSE | 2 | 1: 508, 0: 410 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 53.51 | 9.43 | 28.0 | 47.00 | 54.0 | 60.0 | 77.0 | ▁▅▇▆▁ |
| RestingBP | 0 | 1 | 132.40 | 18.51 | 0.0 | 120.00 | 130.0 | 140.0 | 200.0 | ▁▁▃▇▁ |
| Cholesterol | 0 | 1 | 198.80 | 109.38 | 0.0 | 173.25 | 223.0 | 267.0 | 603.0 | ▃▇▇▁▁ |
| FastingBS | 0 | 1 | 0.23 | 0.42 | 0.0 | 0.00 | 0.0 | 0.0 | 1.0 | ▇▁▁▁▂ |
| MaxHR | 0 | 1 | 136.81 | 25.46 | 60.0 | 120.00 | 138.0 | 156.0 | 202.0 | ▁▃▇▆▂ |
| Oldpeak | 0 | 1 | 0.89 | 1.07 | -2.6 | 0.00 | 0.6 | 1.5 | 6.2 | ▁▇▆▁▁ |
To move forward I am splitting the dataset into train and test set in the ratio 75:25
set.seed(111)
df.sample <- sample(nrow(df), round(nrow(df)*0.75), replace = FALSE)
df.train <- df[df.sample, ]
df.test <- df[-df.sample, ]
After splitting checking if each set has both the entries in the target
round(prop.table(table(select(df.train, HeartDisease), exclude = NULL)), 4) * 100
##
## 0 1
## 45.49 54.51
round(prop.table(table(select(df.test, HeartDisease), exclude = NULL)), 4) * 100
##
## 0 1
## 42.17 57.83
svm <- svm(HeartDisease ~ Age + Sex + ChestPainType + RestingBP + RestingECG + MaxHR, data = df.train, kernel="polynomial", scale=FALSE)
svm
##
## Call:
## svm(formula = HeartDisease ~ Age + Sex + ChestPainType + RestingBP +
## RestingECG + MaxHR, data = df.train, kernel = "polynomial", scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## coef.0: 0
##
## Number of Support Vectors: 194
pred <- predict(svm, newdata=df.test)
confusionMatrix(pred, df.test$HeartDisease)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 59 21
## 1 38 112
##
## Accuracy : 0.7435
## 95% CI : (0.6819, 0.7986)
## No Information Rate : 0.5783
## P-Value [Acc > NIR] : 1.329e-07
##
## Kappa : 0.4613
##
## Mcnemar's Test P-Value : 0.03725
##
## Sensitivity : 0.6082
## Specificity : 0.8421
## Pos Pred Value : 0.7375
## Neg Pred Value : 0.7467
## Prevalence : 0.4217
## Detection Rate : 0.2565
## Detection Prevalence : 0.3478
## Balanced Accuracy : 0.7252
##
## 'Positive' Class : 0
##
The below study states that “the classification accuracy of SVM algorithm was better than DT algorithm”.
https://scialert.net/fulltext/?doi=itj.2009.64.70
The below article talks about “how these algorithms can be used in both classification and regression problems with examples of a regression problem..”
https://towardsdatascience.com/a-complete-view-of-decision-trees-and-svm-in-machine-learning-f9f3d19a337b
The below article explains basics of both these algorithms with pictorial representations.
https://www.numpyninja.com/post/a-simple-introduction-to-decision-tree-and-support-vector-machines-svm
For this dataset the SVM model accuracy was around 63% but when I used polynomial as the kernel parameter the accuracy increased to 74% but the random forest model accuracy of 87% still remains better. So in this case random forest seems to be a better model than SVM.