Assignment 3

Perform an analysis of the dataset(s) used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework.

Homework #3 Read the following articles: https://www.hindawi.com/journals/complexity/2021/5550344/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/ Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise. Perform an analysis of the dataset used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework. Answer questions, such as: Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why? ## Data

###Data From Kaggle(HW2) According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

Author of the Data: fedesoriano

Data Review:

The dataset selected [https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset] is a kaggle dataset relating to stroke prediction clinical data and will allow us to build a model to predict stroke prediction based on certain variables.

data <- readr::read_csv("https://raw.githubusercontent.com/jtul333/Data622/main/healthcare-dataset-stroke-data.csv")
## Rows: 5110 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): gender, ever_married, work_type, Residence_type, smoking_status
## dbl (7): id, age, hypertension, heart_disease, avg_glucose_level, bmi, stroke
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The data contain the variables “gender, ever_married, work_type, Residence_type, bmi, smoking_status, id, age, hypertension, heart_disease, avg_glucose_level, stroke”.

head(data)
## # A tibble: 6 × 12
##      id gender   age hypertension heart_disease ever_married work_type    
##   <dbl> <chr>  <dbl>        <dbl>         <dbl> <chr>        <chr>        
## 1  9046 Male      67            0             1 Yes          Private      
## 2 51676 Female    61            0             0 Yes          Self-employed
## 3 31112 Male      80            0             1 Yes          Private      
## 4 60182 Female    49            0             0 Yes          Private      
## 5  1665 Female    79            1             0 Yes          Self-employed
## 6 56669 Male      81            0             0 Yes          Private      
## # ℹ 5 more variables: Residence_type <chr>, avg_glucose_level <dbl>, bmi <dbl>,
## #   smoking_status <chr>, stroke <dbl>

Types of Variables

sapply(data,class)
##                id            gender               age      hypertension 
##         "numeric"       "character"         "numeric"         "numeric" 
##     heart_disease      ever_married         work_type    Residence_type 
##         "numeric"       "character"       "character"       "character" 
## avg_glucose_level               bmi    smoking_status            stroke 
##         "numeric"         "numeric"       "character"         "numeric"

Statistical Summary

The plan is to predict Stroke based on these variables. The skim function allows us a quick and detailed view of the dataset. Important Notation about the Data gender - Male, Female age - Age of patient hypertension - 0 = No, 1 = Yes heart_disease - 0 = No, 1 = Yes ever_married - Yes, No, Children work_type - Private, Govt_job, Self-employed Residence_type - Urban, Rural avg_glucose_level - double bmi - double smoking_status - formerly smoked, never smoked, smokes, Unknown stroke - 0 = No, 1 = Yes

skim(data)
Data summary
Name data
Number of rows 5110
Number of columns 12
_______________________
Column type frequency:
character 5
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 0 1 4 6 0 3 0
ever_married 0 1 2 3 0 2 0
work_type 0 1 7 13 0 5 0
Residence_type 0 1 5 5 0 2 0
smoking_status 0 1 6 15 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 36517.83 21161.72 67.00 17741.25 36932.00 54682.00 72940.00 ▇▇▇▇▇
age 0 1 43.23 22.61 0.08 25.00 45.00 61.00 82.00 ▅▆▇▇▆
hypertension 0 1 0.10 0.30 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
heart_disease 0 1 0.05 0.23 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
avg_glucose_level 0 1 106.15 45.28 55.12 77.24 91.88 114.09 271.74 ▇▃▁▁▁
bmi 0 1 29.17 7.82 10.30 23.80 28.40 34.00 97.60 ▇▇▁▁▁
stroke 0 1 0.05 0.22 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁

Check for missing Data

plot_missing(data)

There is no missing Data

Data Preparation

create a separate data set in order to maintain the original data and make all the necessary transformations there. we’ll transform the previous variables that were of type character into factor.

data_prepared <- data

#Value substitutions cleanup
data_prepared$`smoking_status`[data_prepared$`smoking_status` == "formerly smoked"] <- "formerly smoked"
data_prepared$`smoking_status`[data_prepared$`smoking_status` == "never smoked"] <- "never smoked"
data_prepared$`smoking_status`[data_prepared$`smoking_status` == "smokes"] <- "smokes"
data_prepared$`smoking_status`[data_prepared$`smoking_status` == "Unknown"] <- "Unknown"

data_prepared$`ever_married`[data_prepared$`ever_married` == "Yes"] <- "Yes"
data_prepared$`ever_married`[data_prepared$`ever_married` == "No"] <- "No"
data_prepared$`ever_married`[data_prepared$`ever_married` == "Children"] <- "Children"

data_prepared$gender[data_prepared$gender == "Female"] <- "F"
data_prepared$gender[data_prepared$gender == "Male"] <- "M"

data_prepared$`Residence_type`[data_prepared$`Residence_type` == "Urban"] <- "Urban"
data_prepared$`Residence_type`[data_prepared$`Residence_type` == "Rural"] <- "Rural"

data_prepared$`work_type`[data_prepared$`work_type` == "Private"] <- "Private"
data_prepared$`work_type`[data_prepared$`work_type` == "Govt_job"] <- "Govt_job"
data_prepared$`work_type`[data_prepared$`work_type` == "Self-employed"] <- "Self-employed"


#Data type change for columns of interest
data_prepared$work_type <- as.factor(data_prepared$work_type)
data_prepared$gender <- as.factor(data_prepared$gender)
data_prepared$Residence_type <- as.factor(data_prepared$Residence_type)
data_prepared$ever_married <- as.factor(data_prepared$ever_married)
data_prepared$smoking_status <- as.factor(data_prepared$smoking_status)
data_prepared$stroke <- as.integer(data_prepared$stroke)
data_prepared$heart_disease <- as.integer(data_prepared$heart_disease)
data_prepared$bmi <- as.integer(data_prepared$bmi)
data_prepared$avg_glucose_level <- as.integer(data_prepared$avg_glucose_level)
data_prepared$hypertension <- as.integer(data_prepared$hypertension)
data_prepared$age <- as.integer(data_prepared$age)

Structure Of the variables

str(data_prepared)
## spc_tbl_ [5,110 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id               : num [1:5110] 9046 51676 31112 60182 1665 ...
##  $ gender           : Factor w/ 3 levels "F","M","Other": 2 1 2 1 1 2 2 1 1 1 ...
##  $ age              : int [1:5110] 67 61 80 49 79 81 74 69 59 78 ...
##  $ hypertension     : int [1:5110] 0 0 0 0 1 0 1 0 0 0 ...
##  $ heart_disease    : int [1:5110] 1 0 1 0 0 0 1 0 0 0 ...
##  $ ever_married     : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 1 2 2 ...
##  $ work_type        : Factor w/ 5 levels "children","Govt_job",..: 4 5 4 4 5 4 4 4 4 4 ...
##  $ Residence_type   : Factor w/ 2 levels "Rural","Urban": 2 1 1 2 1 2 1 2 1 2 ...
##  $ avg_glucose_level: int [1:5110] 228 202 105 171 174 186 70 94 76 58 ...
##  $ bmi              : int [1:5110] 36 36 32 34 24 29 27 22 36 24 ...
##  $ smoking_status   : Factor w/ 4 levels "formerly smoked",..: 1 2 2 3 2 1 2 2 4 4 ...
##  $ stroke           : int [1:5110] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   hypertension = col_double(),
##   ..   heart_disease = col_double(),
##   ..   ever_married = col_character(),
##   ..   work_type = col_character(),
##   ..   Residence_type = col_character(),
##   ..   avg_glucose_level = col_double(),
##   ..   bmi = col_double(),
##   ..   smoking_status = col_character(),
##   ..   stroke = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(data_prepared)
##        id          gender          age         hypertension    
##  Min.   :   67   F    :2994   Min.   : 0.00   Min.   :0.00000  
##  1st Qu.:17741   M    :2115   1st Qu.:25.00   1st Qu.:0.00000  
##  Median :36932   Other:   1   Median :45.00   Median :0.00000  
##  Mean   :36518                Mean   :43.22   Mean   :0.09746  
##  3rd Qu.:54682                3rd Qu.:61.00   3rd Qu.:0.00000  
##  Max.   :72940                Max.   :82.00   Max.   :1.00000  
##  heart_disease     ever_married         work_type    Residence_type
##  Min.   :0.00000   No :1757     children     : 687   Rural:2514    
##  1st Qu.:0.00000   Yes:3353     Govt_job     : 657   Urban:2596    
##  Median :0.00000                Never_worked :  22                 
##  Mean   :0.05401                Private      :2925                 
##  3rd Qu.:0.00000                Self-employed: 819                 
##  Max.   :1.00000                                                   
##  avg_glucose_level      bmi                smoking_status     stroke       
##  Min.   : 55.0     Min.   :10.00   formerly smoked: 885   Min.   :0.00000  
##  1st Qu.: 77.0     1st Qu.:23.00   never smoked   :1892   1st Qu.:0.00000  
##  Median : 91.0     Median :28.00   smokes         : 789   Median :0.00000  
##  Mean   :105.7     Mean   :28.75   Unknown        :1544   Mean   :0.04873  
##  3rd Qu.:114.0     3rd Qu.:34.00                          3rd Qu.:0.00000  
##  Max.   :271.0     Max.   :97.00                          Max.   :1.00000
skim(data_prepared)
Data summary
Name data_prepared
Number of rows 5110
Number of columns 12
_______________________
Column type frequency:
factor 5
numeric 7
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
gender 0 1 FALSE 3 F: 2994, M: 2115, Oth: 1
ever_married 0 1 FALSE 2 Yes: 3353, No: 1757
work_type 0 1 FALSE 5 Pri: 2925, Sel: 819, chi: 687, Gov: 657
Residence_type 0 1 FALSE 2 Urb: 2596, Rur: 2514
smoking_status 0 1 FALSE 4 nev: 1892, Unk: 1544, for: 885, smo: 789

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 36517.83 21161.72 67 17741.25 36932 54682 72940 ▇▇▇▇▇
age 0 1 43.22 22.63 0 25.00 45 61 82 ▅▆▇▇▆
hypertension 0 1 0.10 0.30 0 0.00 0 0 1 ▇▁▁▁▁
heart_disease 0 1 0.05 0.23 0 0.00 0 0 1 ▇▁▁▁▁
avg_glucose_level 0 1 105.66 45.28 55 77.00 91 114 271 ▇▃▁▁▁
bmi 0 1 28.75 7.83 10 23.00 28 34 97 ▇▇▁▁▁
stroke 0 1 0.05 0.22 0 0.00 0 0 1 ▇▁▁▁▁

Since the stroke is our target variable, let’s examine it’s proportions:

hist(data_prepared$stroke)

data_prepared$stroke[is.na(data_prepared$stroke)] <- median(data_prepared$stroke, na.rm=TRUE)

# We use this line of code to get rid of decimals.
data_prepared$stroke <- trunc(data_prepared$stroke)
summary(data_prepared)
##        id          gender          age         hypertension    
##  Min.   :   67   F    :2994   Min.   : 0.00   Min.   :0.00000  
##  1st Qu.:17741   M    :2115   1st Qu.:25.00   1st Qu.:0.00000  
##  Median :36932   Other:   1   Median :45.00   Median :0.00000  
##  Mean   :36518                Mean   :43.22   Mean   :0.09746  
##  3rd Qu.:54682                3rd Qu.:61.00   3rd Qu.:0.00000  
##  Max.   :72940                Max.   :82.00   Max.   :1.00000  
##  heart_disease     ever_married         work_type    Residence_type
##  Min.   :0.00000   No :1757     children     : 687   Rural:2514    
##  1st Qu.:0.00000   Yes:3353     Govt_job     : 657   Urban:2596    
##  Median :0.00000                Never_worked :  22                 
##  Mean   :0.05401                Private      :2925                 
##  3rd Qu.:0.00000                Self-employed: 819                 
##  Max.   :1.00000                                                   
##  avg_glucose_level      bmi                smoking_status     stroke       
##  Min.   : 55.0     Min.   :10.00   formerly smoked: 885   Min.   :0.00000  
##  1st Qu.: 77.0     1st Qu.:23.00   never smoked   :1892   1st Qu.:0.00000  
##  Median : 91.0     Median :28.00   smokes         : 789   Median :0.00000  
##  Mean   :105.7     Mean   :28.75   Unknown        :1544   Mean   :0.04873  
##  3rd Qu.:114.0     3rd Qu.:34.00                          3rd Qu.:0.00000  
##  Max.   :271.0     Max.   :97.00                          Max.   :1.00000

Build Models

predict the “stroke” variable and find out whether variables “hypertension”, “heart_disease” and “gender”, “avg_glucose_level”,“bmi” have any relationship with our dependent variable.

We first take a look at the “gender” variable to see if it is evenly distributed among our “heart_disease” variable. At a first glance, we can observe that the number of Females and Males in the data seem to be evenly distributed.

xtabs(~gender + `heart_disease`, data = data_prepared)
##        heart_disease
## gender     0    1
##   F     2881  113
##   M     1952  163
##   Other    1    0

Our next step is to partition the data into training (80%) and test (20%) in order to measure how well the models perform.

# Partition data 
set.seed(1)
# create a list of 60% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(data_prepared$stroke, p=0.8, list=FALSE)
# select 20% of the data for validation
# use the remaining 80% of data to training and testing the models
data_train <- data_prepared[validation_index,]

data_test <- data_prepared[-validation_index,]

Random Forest

Create a random forest for regression and analyze the results.

rf_model <- randomForest(as.factor(stroke) ~ smoking_status + avg_glucose_level + heart_disease + age + gender + bmi,
                                    data = data_train)
rf_pred <- predict(rf_model, data_test)
c_matrix <- confusionMatrix(rf_pred, as.factor(data_test$stroke))
c_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 975  45
##          1   1   1
##                                           
##                Accuracy : 0.955           
##                  95% CI : (0.9404, 0.9669)
##     No Information Rate : 0.955           
##     P-Value [Acc > NIR] : 0.5391          
##                                           
##                   Kappa : 0.0381          
##                                           
##  Mcnemar's Test P-Value : 2.298e-10       
##                                           
##             Sensitivity : 0.99898         
##             Specificity : 0.02174         
##          Pos Pred Value : 0.95588         
##          Neg Pred Value : 0.50000         
##              Prevalence : 0.95499         
##          Detection Rate : 0.95401         
##    Detection Prevalence : 0.99804         
##       Balanced Accuracy : 0.51036         
##                                           
##        'Positive' Class : 0               
## 

SVM

The following code constructs the SVM model based.It looks at the stroke based upon smoking_status + avg_glucose_level + heart_disease + age + gender + bmi

svmfit<-svm(stroke ~ smoking_status + avg_glucose_level + heart_disease + age + gender + bmi, data=data_train, kernel="linear", cost=10,scale=TRUE)

summary(svmfit)
## 
## Call:
## svm(formula = stroke ~ smoking_status + avg_glucose_level + heart_disease + 
##     age + gender + bmi, data = data_train, kernel = "linear", cost = 10, 
##     scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  linear 
##        cost:  10 
##       gamma:  0.1 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  708

Tuning model

We used a cost parameter of 10. Let’s tune to using 10-fold cv using a range of values for the cost.

set.seed(1)
tune.out<-tune(svm,stroke ~ smoking_status + avg_glucose_level + heart_disease + age + gender + bmi, data=data_train, kernel="linear", ranges=list(cost=c(0.001,0.01,0.1, 1,5,10,100)))


summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##   0.1
## 
## - best performance: 0.0479754 
## 
## - Detailed performance results:
##    cost      error  dispersion
## 1 1e-03 0.04797665 0.009181715
## 2 1e-02 0.04797673 0.009180454
## 3 1e-01 0.04797540 0.009179188
## 4 1e+00 0.04797638 0.009180673
## 5 5e+00 0.04797708 0.009181209
## 6 1e+01 0.04797683 0.009180111
## 7 1e+02 0.04797805 0.009181584

Best Model

bestmod<-tune.out$best.model
summary(bestmod)
## 
## Call:
## best.tune(METHOD = svm, train.x = stroke ~ smoking_status + avg_glucose_level + 
##     heart_disease + age + gender + bmi, data = data_train, ranges = list(cost = c(0.001, 
##     0.01, 0.1, 1, 5, 10, 100)), kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  linear 
##        cost:  0.1 
##       gamma:  0.1 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  410

Prediction on the test set

Using the best model, to make predictions

ypred<-predict(bestmod,data_test)

pred_tablesvm<-table(predict=ypred, truth=data_test$stroke)


sum(diag(pred_tablesvm/nrow(data_test)*100))
## [1] 0.09784736

Pediction Results

Predicting stroke by use of the SVM is 97% accuracy.

Both the decision tree (95%) and the svm (97%) performed remarkably well on predicting stroke in the stroke prediuction dataset.

HW 3 Articles

1)https://towardsdatascience.com/a-complete-view-of-decision-trees-and-svm-in-machine-learning-f9f3d19a337b

In Machine Learning, tree-based techniques and Support Vector Machines (SVM) are popular tools to build prediction models. Decision trees and SVM can be intuitively understood as classifying different groups (labels), given their theories. However, they can definitely be powerful tools to solve regression problems

2)https://bmcpharmacoltoxicol.biomedcentral.com/articles/10.1186/s40360-022-00588-0

Methods This study is a retrospective cohort study of National Poison Data System (NPDS) data, the largest data repository of poisoning cases in the United States. The SVM and DT algorithms were developed using training and test datasets. We also used precision-recall and ROC curves and Area Under the Curve value (AUC) for model evaluation.

Results Our model showed that acidosis, hypoglycemia, electrolyte abnormality, hypotension, elevated anion gap, elevated creatinine, tachycardia, and renal failure are the most important determinants in terms of outcome prediction of metformin poisoning. The average negative predictive value for the decision tree and SVM models was 92.30 and 93.30. The AUC of the ROC curve of the decision tree for major, minor, and moderate outcomes was 0.92, 0.92, and 0.89, respectively. While this figure of SVM model for major, minor, and moderate outcomes was 0.98, 0.90, and 0.82, respectively.

Conclusions In order to predict the prognosis of metformin poisoning, machine learning algorithms might help clinicians in the management and follow-up of metformin poisoning cases.

3)https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6925840/ Methods In this study, extensive research efforts were made to identify those studies that applied more than one supervised machine learning algorithm on single disease prediction. Two databases (i.e., Scopus and PubMed) were searched for different types of search items. Thus, we selected 48 articles in total for the comparison among variants supervised machine learning algorithms for disease prediction.

Results We found that the Support Vector Machine (SVM) algorithm is applied most frequently (in 29 studies) followed by the Naïve Bayes algorithm (in 23 studies). However, the Random Forest (RF) algorithm showed superior accuracy comparatively. Of the 17 studies where it was applied, RF showed the highest accuracy in 9 of them, i.e., 53%. This was followed by SVM which topped in 41% of the studies it was considered.

Conclusion This study provides a wide overview of the relative performance of different variants of supervised machine learning algorithms for disease prediction. This important information of relative performance can be used to aid researchers in the selection of an appropriate supervised machine learning algorithm for their studies.

Conclusion

Overall, it is the quality of the dataset and characteristics which should drive the selection of the algorithm. Decision trees are better with categorical data. Computational resources should also be taken into account with SVM consuming more resources. Decision trees and SVM could be used for both classification and regression. Whether we are dealing with a classification or regression problem depends on the target variable being categorical or continuous. In a decision tree regression, all observations in a bin have the same predicted value. The SVM counterpart is SVR, Support vector regression, which is similar to SVM with some modifications. Again, running several different models and comparing RMSE seems to be the practice to aid in determining the best model.