Perform an analysis of the dataset(s) used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework.
Homework #3 Read the following articles: https://www.hindawi.com/journals/complexity/2021/5550344/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/ Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise. Perform an analysis of the dataset used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework. Answer questions, such as: Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why? ## Data
###Data From Kaggle(HW2) According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.
Author of the Data: fedesoriano
The dataset selected [https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset] is a kaggle dataset relating to stroke prediction clinical data and will allow us to build a model to predict stroke prediction based on certain variables.
data <- readr::read_csv("https://raw.githubusercontent.com/jtul333/Data622/main/healthcare-dataset-stroke-data.csv")## Rows: 5110 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): gender, ever_married, work_type, Residence_type, smoking_status
## dbl (7): id, age, hypertension, heart_disease, avg_glucose_level, bmi, stroke
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The data contain the variables “gender, ever_married, work_type, Residence_type, bmi, smoking_status, id, age, hypertension, heart_disease, avg_glucose_level, stroke”.
head(data)## # A tibble: 6 × 12
## id gender age hypertension heart_disease ever_married work_type
## <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 9046 Male 67 0 1 Yes Private
## 2 51676 Female 61 0 0 Yes Self-employed
## 3 31112 Male 80 0 1 Yes Private
## 4 60182 Female 49 0 0 Yes Private
## 5 1665 Female 79 1 0 Yes Self-employed
## 6 56669 Male 81 0 0 Yes Private
## # ℹ 5 more variables: Residence_type <chr>, avg_glucose_level <dbl>, bmi <dbl>,
## # smoking_status <chr>, stroke <dbl>
sapply(data,class)## id gender age hypertension
## "numeric" "character" "numeric" "numeric"
## heart_disease ever_married work_type Residence_type
## "numeric" "character" "character" "character"
## avg_glucose_level bmi smoking_status stroke
## "numeric" "numeric" "character" "numeric"
The plan is to predict Stroke based on these variables. The skim function allows us a quick and detailed view of the dataset. Important Notation about the Data gender - Male, Female age - Age of patient hypertension - 0 = No, 1 = Yes heart_disease - 0 = No, 1 = Yes ever_married - Yes, No, Children work_type - Private, Govt_job, Self-employed Residence_type - Urban, Rural avg_glucose_level - double bmi - double smoking_status - formerly smoked, never smoked, smokes, Unknown stroke - 0 = No, 1 = Yes
skim(data)| Name | data |
| Number of rows | 5110 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| gender | 0 | 1 | 4 | 6 | 0 | 3 | 0 |
| ever_married | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| work_type | 0 | 1 | 7 | 13 | 0 | 5 | 0 |
| Residence_type | 0 | 1 | 5 | 5 | 0 | 2 | 0 |
| smoking_status | 0 | 1 | 6 | 15 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 36517.83 | 21161.72 | 67.00 | 17741.25 | 36932.00 | 54682.00 | 72940.00 | ▇▇▇▇▇ |
| age | 0 | 1 | 43.23 | 22.61 | 0.08 | 25.00 | 45.00 | 61.00 | 82.00 | ▅▆▇▇▆ |
| hypertension | 0 | 1 | 0.10 | 0.30 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| heart_disease | 0 | 1 | 0.05 | 0.23 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| avg_glucose_level | 0 | 1 | 106.15 | 45.28 | 55.12 | 77.24 | 91.88 | 114.09 | 271.74 | ▇▃▁▁▁ |
| bmi | 0 | 1 | 29.17 | 7.82 | 10.30 | 23.80 | 28.40 | 34.00 | 97.60 | ▇▇▁▁▁ |
| stroke | 0 | 1 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
plot_missing(data)There is no missing Data
create a separate data set in order to maintain the original data and make all the necessary transformations there. we’ll transform the previous variables that were of type character into factor.
data_prepared <- data
#Value substitutions cleanup
data_prepared$`smoking_status`[data_prepared$`smoking_status` == "formerly smoked"] <- "formerly smoked"
data_prepared$`smoking_status`[data_prepared$`smoking_status` == "never smoked"] <- "never smoked"
data_prepared$`smoking_status`[data_prepared$`smoking_status` == "smokes"] <- "smokes"
data_prepared$`smoking_status`[data_prepared$`smoking_status` == "Unknown"] <- "Unknown"
data_prepared$`ever_married`[data_prepared$`ever_married` == "Yes"] <- "Yes"
data_prepared$`ever_married`[data_prepared$`ever_married` == "No"] <- "No"
data_prepared$`ever_married`[data_prepared$`ever_married` == "Children"] <- "Children"
data_prepared$gender[data_prepared$gender == "Female"] <- "F"
data_prepared$gender[data_prepared$gender == "Male"] <- "M"
data_prepared$`Residence_type`[data_prepared$`Residence_type` == "Urban"] <- "Urban"
data_prepared$`Residence_type`[data_prepared$`Residence_type` == "Rural"] <- "Rural"
data_prepared$`work_type`[data_prepared$`work_type` == "Private"] <- "Private"
data_prepared$`work_type`[data_prepared$`work_type` == "Govt_job"] <- "Govt_job"
data_prepared$`work_type`[data_prepared$`work_type` == "Self-employed"] <- "Self-employed"
#Data type change for columns of interest
data_prepared$work_type <- as.factor(data_prepared$work_type)
data_prepared$gender <- as.factor(data_prepared$gender)
data_prepared$Residence_type <- as.factor(data_prepared$Residence_type)
data_prepared$ever_married <- as.factor(data_prepared$ever_married)
data_prepared$smoking_status <- as.factor(data_prepared$smoking_status)
data_prepared$stroke <- as.integer(data_prepared$stroke)
data_prepared$heart_disease <- as.integer(data_prepared$heart_disease)
data_prepared$bmi <- as.integer(data_prepared$bmi)
data_prepared$avg_glucose_level <- as.integer(data_prepared$avg_glucose_level)
data_prepared$hypertension <- as.integer(data_prepared$hypertension)
data_prepared$age <- as.integer(data_prepared$age)str(data_prepared)## spc_tbl_ [5,110 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:5110] 9046 51676 31112 60182 1665 ...
## $ gender : Factor w/ 3 levels "F","M","Other": 2 1 2 1 1 2 2 1 1 1 ...
## $ age : int [1:5110] 67 61 80 49 79 81 74 69 59 78 ...
## $ hypertension : int [1:5110] 0 0 0 0 1 0 1 0 0 0 ...
## $ heart_disease : int [1:5110] 1 0 1 0 0 0 1 0 0 0 ...
## $ ever_married : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 1 2 2 ...
## $ work_type : Factor w/ 5 levels "children","Govt_job",..: 4 5 4 4 5 4 4 4 4 4 ...
## $ Residence_type : Factor w/ 2 levels "Rural","Urban": 2 1 1 2 1 2 1 2 1 2 ...
## $ avg_glucose_level: int [1:5110] 228 202 105 171 174 186 70 94 76 58 ...
## $ bmi : int [1:5110] 36 36 32 34 24 29 27 22 36 24 ...
## $ smoking_status : Factor w/ 4 levels "formerly smoked",..: 1 2 2 3 2 1 2 2 4 4 ...
## $ stroke : int [1:5110] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. gender = col_character(),
## .. age = col_double(),
## .. hypertension = col_double(),
## .. heart_disease = col_double(),
## .. ever_married = col_character(),
## .. work_type = col_character(),
## .. Residence_type = col_character(),
## .. avg_glucose_level = col_double(),
## .. bmi = col_double(),
## .. smoking_status = col_character(),
## .. stroke = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(data_prepared)## id gender age hypertension
## Min. : 67 F :2994 Min. : 0.00 Min. :0.00000
## 1st Qu.:17741 M :2115 1st Qu.:25.00 1st Qu.:0.00000
## Median :36932 Other: 1 Median :45.00 Median :0.00000
## Mean :36518 Mean :43.22 Mean :0.09746
## 3rd Qu.:54682 3rd Qu.:61.00 3rd Qu.:0.00000
## Max. :72940 Max. :82.00 Max. :1.00000
## heart_disease ever_married work_type Residence_type
## Min. :0.00000 No :1757 children : 687 Rural:2514
## 1st Qu.:0.00000 Yes:3353 Govt_job : 657 Urban:2596
## Median :0.00000 Never_worked : 22
## Mean :0.05401 Private :2925
## 3rd Qu.:0.00000 Self-employed: 819
## Max. :1.00000
## avg_glucose_level bmi smoking_status stroke
## Min. : 55.0 Min. :10.00 formerly smoked: 885 Min. :0.00000
## 1st Qu.: 77.0 1st Qu.:23.00 never smoked :1892 1st Qu.:0.00000
## Median : 91.0 Median :28.00 smokes : 789 Median :0.00000
## Mean :105.7 Mean :28.75 Unknown :1544 Mean :0.04873
## 3rd Qu.:114.0 3rd Qu.:34.00 3rd Qu.:0.00000
## Max. :271.0 Max. :97.00 Max. :1.00000
skim(data_prepared)| Name | data_prepared |
| Number of rows | 5110 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| factor | 5 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| gender | 0 | 1 | FALSE | 3 | F: 2994, M: 2115, Oth: 1 |
| ever_married | 0 | 1 | FALSE | 2 | Yes: 3353, No: 1757 |
| work_type | 0 | 1 | FALSE | 5 | Pri: 2925, Sel: 819, chi: 687, Gov: 657 |
| Residence_type | 0 | 1 | FALSE | 2 | Urb: 2596, Rur: 2514 |
| smoking_status | 0 | 1 | FALSE | 4 | nev: 1892, Unk: 1544, for: 885, smo: 789 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 36517.83 | 21161.72 | 67 | 17741.25 | 36932 | 54682 | 72940 | ▇▇▇▇▇ |
| age | 0 | 1 | 43.22 | 22.63 | 0 | 25.00 | 45 | 61 | 82 | ▅▆▇▇▆ |
| hypertension | 0 | 1 | 0.10 | 0.30 | 0 | 0.00 | 0 | 0 | 1 | ▇▁▁▁▁ |
| heart_disease | 0 | 1 | 0.05 | 0.23 | 0 | 0.00 | 0 | 0 | 1 | ▇▁▁▁▁ |
| avg_glucose_level | 0 | 1 | 105.66 | 45.28 | 55 | 77.00 | 91 | 114 | 271 | ▇▃▁▁▁ |
| bmi | 0 | 1 | 28.75 | 7.83 | 10 | 23.00 | 28 | 34 | 97 | ▇▇▁▁▁ |
| stroke | 0 | 1 | 0.05 | 0.22 | 0 | 0.00 | 0 | 0 | 1 | ▇▁▁▁▁ |
Since the stroke is our target variable, let’s examine it’s proportions:
hist(data_prepared$stroke)data_prepared$stroke[is.na(data_prepared$stroke)] <- median(data_prepared$stroke, na.rm=TRUE)
# We use this line of code to get rid of decimals.
data_prepared$stroke <- trunc(data_prepared$stroke)summary(data_prepared)## id gender age hypertension
## Min. : 67 F :2994 Min. : 0.00 Min. :0.00000
## 1st Qu.:17741 M :2115 1st Qu.:25.00 1st Qu.:0.00000
## Median :36932 Other: 1 Median :45.00 Median :0.00000
## Mean :36518 Mean :43.22 Mean :0.09746
## 3rd Qu.:54682 3rd Qu.:61.00 3rd Qu.:0.00000
## Max. :72940 Max. :82.00 Max. :1.00000
## heart_disease ever_married work_type Residence_type
## Min. :0.00000 No :1757 children : 687 Rural:2514
## 1st Qu.:0.00000 Yes:3353 Govt_job : 657 Urban:2596
## Median :0.00000 Never_worked : 22
## Mean :0.05401 Private :2925
## 3rd Qu.:0.00000 Self-employed: 819
## Max. :1.00000
## avg_glucose_level bmi smoking_status stroke
## Min. : 55.0 Min. :10.00 formerly smoked: 885 Min. :0.00000
## 1st Qu.: 77.0 1st Qu.:23.00 never smoked :1892 1st Qu.:0.00000
## Median : 91.0 Median :28.00 smokes : 789 Median :0.00000
## Mean :105.7 Mean :28.75 Unknown :1544 Mean :0.04873
## 3rd Qu.:114.0 3rd Qu.:34.00 3rd Qu.:0.00000
## Max. :271.0 Max. :97.00 Max. :1.00000
predict the “stroke” variable and find out whether variables “hypertension”, “heart_disease” and “gender”, “avg_glucose_level”,“bmi” have any relationship with our dependent variable.
We first take a look at the “gender” variable to see if it is evenly distributed among our “heart_disease” variable. At a first glance, we can observe that the number of Females and Males in the data seem to be evenly distributed.
xtabs(~gender + `heart_disease`, data = data_prepared)## heart_disease
## gender 0 1
## F 2881 113
## M 1952 163
## Other 1 0
Our next step is to partition the data into training (80%) and test (20%) in order to measure how well the models perform.
# Partition data
set.seed(1)
# create a list of 60% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(data_prepared$stroke, p=0.8, list=FALSE)
# select 20% of the data for validation
# use the remaining 80% of data to training and testing the models
data_train <- data_prepared[validation_index,]
data_test <- data_prepared[-validation_index,]Create a random forest for regression and analyze the results.
rf_model <- randomForest(as.factor(stroke) ~ smoking_status + avg_glucose_level + heart_disease + age + gender + bmi,
data = data_train)
rf_pred <- predict(rf_model, data_test)
c_matrix <- confusionMatrix(rf_pred, as.factor(data_test$stroke))
c_matrix## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 975 45
## 1 1 1
##
## Accuracy : 0.955
## 95% CI : (0.9404, 0.9669)
## No Information Rate : 0.955
## P-Value [Acc > NIR] : 0.5391
##
## Kappa : 0.0381
##
## Mcnemar's Test P-Value : 2.298e-10
##
## Sensitivity : 0.99898
## Specificity : 0.02174
## Pos Pred Value : 0.95588
## Neg Pred Value : 0.50000
## Prevalence : 0.95499
## Detection Rate : 0.95401
## Detection Prevalence : 0.99804
## Balanced Accuracy : 0.51036
##
## 'Positive' Class : 0
##
The following code constructs the SVM model based.It looks at the stroke based upon smoking_status + avg_glucose_level + heart_disease + age + gender + bmi
svmfit<-svm(stroke ~ smoking_status + avg_glucose_level + heart_disease + age + gender + bmi, data=data_train, kernel="linear", cost=10,scale=TRUE)
summary(svmfit)##
## Call:
## svm(formula = stroke ~ smoking_status + avg_glucose_level + heart_disease +
## age + gender + bmi, data = data_train, kernel = "linear", cost = 10,
## scale = TRUE)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: linear
## cost: 10
## gamma: 0.1
## epsilon: 0.1
##
##
## Number of Support Vectors: 708
We used a cost parameter of 10. Let’s tune to using 10-fold cv using a range of values for the cost.
set.seed(1)
tune.out<-tune(svm,stroke ~ smoking_status + avg_glucose_level + heart_disease + age + gender + bmi, data=data_train, kernel="linear", ranges=list(cost=c(0.001,0.01,0.1, 1,5,10,100)))
summary(tune.out)##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.1
##
## - best performance: 0.0479754
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-03 0.04797665 0.009181715
## 2 1e-02 0.04797673 0.009180454
## 3 1e-01 0.04797540 0.009179188
## 4 1e+00 0.04797638 0.009180673
## 5 5e+00 0.04797708 0.009181209
## 6 1e+01 0.04797683 0.009180111
## 7 1e+02 0.04797805 0.009181584
bestmod<-tune.out$best.model
summary(bestmod)##
## Call:
## best.tune(METHOD = svm, train.x = stroke ~ smoking_status + avg_glucose_level +
## heart_disease + age + gender + bmi, data = data_train, ranges = list(cost = c(0.001,
## 0.01, 0.1, 1, 5, 10, 100)), kernel = "linear")
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: linear
## cost: 0.1
## gamma: 0.1
## epsilon: 0.1
##
##
## Number of Support Vectors: 410
Using the best model, to make predictions
ypred<-predict(bestmod,data_test)
pred_tablesvm<-table(predict=ypred, truth=data_test$stroke)
sum(diag(pred_tablesvm/nrow(data_test)*100))## [1] 0.09784736
Predicting stroke by use of the SVM is 97% accuracy.
Both the decision tree (95%) and the svm (97%) performed remarkably well on predicting stroke in the stroke prediuction dataset.
In Machine Learning, tree-based techniques and Support Vector Machines (SVM) are popular tools to build prediction models. Decision trees and SVM can be intuitively understood as classifying different groups (labels), given their theories. However, they can definitely be powerful tools to solve regression problems
2)https://bmcpharmacoltoxicol.biomedcentral.com/articles/10.1186/s40360-022-00588-0
Methods This study is a retrospective cohort study of National Poison Data System (NPDS) data, the largest data repository of poisoning cases in the United States. The SVM and DT algorithms were developed using training and test datasets. We also used precision-recall and ROC curves and Area Under the Curve value (AUC) for model evaluation.
Results Our model showed that acidosis, hypoglycemia, electrolyte abnormality, hypotension, elevated anion gap, elevated creatinine, tachycardia, and renal failure are the most important determinants in terms of outcome prediction of metformin poisoning. The average negative predictive value for the decision tree and SVM models was 92.30 and 93.30. The AUC of the ROC curve of the decision tree for major, minor, and moderate outcomes was 0.92, 0.92, and 0.89, respectively. While this figure of SVM model for major, minor, and moderate outcomes was 0.98, 0.90, and 0.82, respectively.
Conclusions In order to predict the prognosis of metformin poisoning, machine learning algorithms might help clinicians in the management and follow-up of metformin poisoning cases.
3)https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6925840/ Methods In this study, extensive research efforts were made to identify those studies that applied more than one supervised machine learning algorithm on single disease prediction. Two databases (i.e., Scopus and PubMed) were searched for different types of search items. Thus, we selected 48 articles in total for the comparison among variants supervised machine learning algorithms for disease prediction.
Results We found that the Support Vector Machine (SVM) algorithm is applied most frequently (in 29 studies) followed by the Naïve Bayes algorithm (in 23 studies). However, the Random Forest (RF) algorithm showed superior accuracy comparatively. Of the 17 studies where it was applied, RF showed the highest accuracy in 9 of them, i.e., 53%. This was followed by SVM which topped in 41% of the studies it was considered.
Conclusion This study provides a wide overview of the relative performance of different variants of supervised machine learning algorithms for disease prediction. This important information of relative performance can be used to aid researchers in the selection of an appropriate supervised machine learning algorithm for their studies.
Overall, it is the quality of the dataset and characteristics which should drive the selection of the algorithm. Decision trees are better with categorical data. Computational resources should also be taken into account with SVM consuming more resources. Decision trees and SVM could be used for both classification and regression. Whether we are dealing with a classification or regression problem depends on the target variable being categorical or continuous. In a decision tree regression, all observations in a bin have the same predicted value. The SVM counterpart is SVR, Support vector regression, which is similar to SVM with some modifications. Again, running several different models and comparing RMSE seems to be the practice to aid in determining the best model.