Assignment 3

Perform an analysis of the dataset used in Homework #2 using the SVM algorithm.Compare the results with the results from previous homework.

Introduction

Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used.

From Kaggle

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

Citation Davide Chicco, Giuseppe Jurman: Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making 20, 16 (2020). https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5

Data review

The dataset selected [https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data] is a kaggle dataset relating to heart failure clinical data and will allow us to build a model to predict heart failure based on certain variables.

data<- read.csv(here('homework2','data','heart_failure_clinical_records_dataset.csv'))

Let’s evaluate the Data

The dataset consists of \(299\) records(observations) with \(13\) variables (factors).

dim(data)
## [1] 299  13

Types of Attributes

All of the attributes are numeric. Either integer or numeric.

# list types for each attribute
sapply(data, class)
##                      age                  anaemia creatinine_phosphokinase 
##                "numeric"                "integer"                "integer" 
##                 diabetes        ejection_fraction      high_blood_pressure 
##                "integer"                "integer"                "integer" 
##                platelets         serum_creatinine             serum_sodium 
##                "numeric"                "numeric"                "integer" 
##                      sex                  smoking                     time 
##                "integer"                "integer"                "integer" 
##              DEATH_EVENT 
##                "integer"

It is also always a good idea to actually eyeball your data.

# take a peek at the first 5 rows of the data
head(data)
##   age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 1  75       0                      582        0                20
## 2  55       0                     7861        0                38
## 3  65       0                      146        0                20
## 4  50       1                      111        0                20
## 5  65       1                      160        1                20
## 6  90       1                       47        0                40
##   high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
## 1                   1    265000              1.9          130   1       0    4
## 2                   0    263358              1.1          136   1       0    6
## 3                   0    162000              1.3          129   1       1    7
## 4                   0    210000              1.9          137   1       0    7
## 5                   0    327000              2.7          116   0       0    8
## 6                   1    204000              2.1          132   1       1    8
##   DEATH_EVENT
## 1           1
## 2           1
## 3           1
## 4           1
## 5           1
## 6           1

Statistical Summary

The plan is to predict heart failure based on these variables. The skim function allows us a quick and detailed view of the dataset.

Important Notation about the Data
Sex - Gender of patient Male = 1, Female =0
Age - Age of patient
Diabetes - 0 = No, 1 = Yes
Anaemia - 0 = No, 1 = Yes
High_blood_pressure - 0 = No, 1 = Yes
Smoking - 0 = No, 1 = Yes
DEATH_EVENT - 0 = No, 1 = Yes
Time = Follow Up Period in days

skim(data)
Data summary
Name data
Number of rows 299
Number of columns 13
_______________________
Column type frequency:
numeric 13
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 60.83 11.89 40.0 51.0 60.0 70.0 95.0 ▆▇▇▂▁
anaemia 0 1 0.43 0.50 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▆
creatinine_phosphokinase 0 1 581.84 970.29 23.0 116.5 250.0 582.0 7861.0 ▇▁▁▁▁
diabetes 0 1 0.42 0.49 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▆
ejection_fraction 0 1 38.08 11.83 14.0 30.0 38.0 45.0 80.0 ▃▇▂▂▁
high_blood_pressure 0 1 0.35 0.48 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▅
platelets 0 1 263358.03 97804.24 25100.0 212500.0 262000.0 303500.0 850000.0 ▂▇▂▁▁
serum_creatinine 0 1 1.39 1.03 0.5 0.9 1.1 1.4 9.4 ▇▁▁▁▁
serum_sodium 0 1 136.63 4.41 113.0 134.0 137.0 140.0 148.0 ▁▁▃▇▁
sex 0 1 0.65 0.48 0.0 0.0 1.0 1.0 1.0 ▅▁▁▁▇
smoking 0 1 0.32 0.47 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▃
time 0 1 130.26 77.61 4.0 73.0 115.0 203.0 285.0 ▆▇▃▆▃
DEATH_EVENT 0 1 0.32 0.47 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▃

Since the DEATH_EVENT is our target variable, let’s examine it’s proportions:

prop <- round(prop.table(table(select(data, DEATH_EVENT), exclude = NULL))*100, 1)
x <- paste(prop, "%", sep="")
mat <- matrix(x, nrow = 2, ncol = 1)
rownames(mat) <- c("0", "1")
colnames(mat) <- c("Death Pct")
print(mat, quote = FALSE)
##   Death Pct
## 0 67.9%    
## 1 32.1%
set.seed(7)

# create a list of 80% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(data$DEATH_EVENT, p=0.80, list=FALSE)
# select 20% of the data for validation
data_test <- data[-validation_index,]
# use the remaining 80% of data to training and testing the models
data_train <- data[validation_index,]

SVM Models

I tried two different approaches:

e1071::svm function:

svm_model <- svm(DEATH_EVENT ~ .,
                 data = data_train,
                 type = 'C-classification',
                 kernel = "linear")

print(svm_model)
## 
## Call:
## svm(formula = DEATH_EVENT ~ ., data = data_train, type = "C-classification", 
##     kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  100

kernlab::ksvm function

model.ksvm = ksvm(DEATH_EVENT ~ .,
                  data = data_train,
                  type="C-svc")
print(model.ksvm)
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0688453906564724 
## 
## Number of Support Vectors : 142 
## 
## Objective Function Value : -91.1652 
## Training error : 0.1125

Predictions

e1071

test_pred <- predict(svm_model, newdata = data_test)

c_matrix <- confusionMatrix(table(test_pred, data_test$DEATH_EVENT))

The resulting model gives us an accuracy of \(83%\) way better than what we got with the decision tree in Homework 2.

draw_confusion_matrix(c_matrix)

ksvm

test_pred_ksvm <- predict(model.ksvm, newdata = data_test)

c_matrix2 <- confusionMatrix(table(test_pred_ksvm, data_test$DEATH_EVENT))

draw_confusion_matrix(c_matrix2)

With Kernlab’s KSVM, the accuracy goes down to 78%. Still pretty impressive.

Conclusion

Based on articles

https://www.hindawi.com/journals/complexity/2021/5550344/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/ Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise.

  1. https://www.researchgate.net/publication/223672291_An_extended_support_vector_machine_forecasting_framework_for_customer_churn_in_e-commerce This article describes a Machine Learning algorithm to predicyt churn in e-commerce. Customer churn predictions are very important in e-commerce. To maintain market competitiveness, B2C enterprises should make full use of machine learning in customer relationship management to predict the potential loss of customers and devise new marketing strategies and customer retention measures according to the prediction results. This will help establish efficient and accurate loss prediction for e-commerce enterprises.

  2. https://eprint.iacr.org/2016/736.pdf. Efficient and Private Scoring of Decision Trees, Support Vector Machines and Logistic Regression Models based on Pre-Computation. In this study, the authors proposed a novel protocol for privacy preserving classification of decision trees, and improved the performance of previously proposed protocols for general hyperplane-based classifiers and for the two specific cases of support vector machines and logistic regression. Instead of comparing algorithms, they propose methods to improve on them while maintaining security concerns in mind.

  3. https://www.linkedin.com/pulse/machine-learning-predicting-supply-chain-risks-part-3-tuan-nguyen-/ In this LinkedIn article, the author compares the different machine learning algorithms while trying to predict supply-chain risks. On of his conclusions is that SVM has been known as an example of a highly performant learner. In this case study, it can give a prediction accuracy of over 82%. However, one of the disadvantages of SVM lies in its computational time, which takes over 3 minutes in the case of the 8-feature set.

Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios?

SVM appears to be the consensus winner in terms of more accurate results, although most articles do cite its computational requirements as a disadvantage.

In this particular assignment, SVM’s performance was on par of the best model obtained in homework2: a Random Forest using all features to predict the outcome.

Do you agree with the recommendations? Why?

I believe the most important learning from this particular project is that one Machine Learning algorithm alone is probably not going to be able to satisfy or provide all of the answers and that they should used in conjunction of other ML algorithms to correlate or validate the results. Some ML Algorithms are better suited for determinate problems and it appears that Decision trees are better for categorical data as they deal with colinearity better than SVM. In all, I’d try more than one ML algorithm towards a prediction and settle in the best combination of performance and accuracy needed to get the job done.