Assignment 2

Introduction

Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used.

From Kaggle

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

Citation Davide Chicco, Giuseppe Jurman: Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making 20, 16 (2020). https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5

Data review

The dataset selected [https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data] is a kaggle dataset relating to heart failure clinical data and will allow us to build a model to predict heart failure based on certain variables.

data<- read.csv(here('homework2','data','heart_failure_clinical_records_dataset.csv'))

Let’s evaluate the Data

The dataset consists of \(299\) records(observations) with \(13\) variables (factors).

dim(data)
## [1] 299  13

Types of Attributes

All of the attributes are numeric. Either integer or numeric.

# list types for each attribute
sapply(data, class)
##                      age                  anaemia creatinine_phosphokinase 
##                "numeric"                "integer"                "integer" 
##                 diabetes        ejection_fraction      high_blood_pressure 
##                "integer"                "integer"                "integer" 
##                platelets         serum_creatinine             serum_sodium 
##                "numeric"                "numeric"                "integer" 
##                      sex                  smoking                     time 
##                "integer"                "integer"                "integer" 
##              DEATH_EVENT 
##                "integer"

It is also always a good idea to actually eyeball your data.

# take a peek at the first 5 rows of the data
head(data)
##   age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 1  75       0                      582        0                20
## 2  55       0                     7861        0                38
## 3  65       0                      146        0                20
## 4  50       1                      111        0                20
## 5  65       1                      160        1                20
## 6  90       1                       47        0                40
##   high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
## 1                   1    265000              1.9          130   1       0    4
## 2                   0    263358              1.1          136   1       0    6
## 3                   0    162000              1.3          129   1       1    7
## 4                   0    210000              1.9          137   1       0    7
## 5                   0    327000              2.7          116   0       0    8
## 6                   1    204000              2.1          132   1       1    8
##   DEATH_EVENT
## 1           1
## 2           1
## 3           1
## 4           1
## 5           1
## 6           1

Statistical Summary

The plan is to predict heart failure based on these variables. The skim function allows us a quick and detailed view of the dataset.

Important Notation about the Data
Sex - Gender of patient Male = 1, Female =0
Age - Age of patient
Diabetes - 0 = No, 1 = Yes
Anaemia - 0 = No, 1 = Yes
High_blood_pressure - 0 = No, 1 = Yes
Smoking - 0 = No, 1 = Yes
DEATH_EVENT - 0 = No, 1 = Yes
Time = Follow Up Period in days

skim(data)
Data summary
Name data
Number of rows 299
Number of columns 13
_______________________
Column type frequency:
numeric 13
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 60.83 11.89 40.0 51.0 60.0 70.0 95.0 ▆▇▇▂▁
anaemia 0 1 0.43 0.50 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▆
creatinine_phosphokinase 0 1 581.84 970.29 23.0 116.5 250.0 582.0 7861.0 ▇▁▁▁▁
diabetes 0 1 0.42 0.49 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▆
ejection_fraction 0 1 38.08 11.83 14.0 30.0 38.0 45.0 80.0 ▃▇▂▂▁
high_blood_pressure 0 1 0.35 0.48 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▅
platelets 0 1 263358.03 97804.24 25100.0 212500.0 262000.0 303500.0 850000.0 ▂▇▂▁▁
serum_creatinine 0 1 1.39 1.03 0.5 0.9 1.1 1.4 9.4 ▇▁▁▁▁
serum_sodium 0 1 136.63 4.41 113.0 134.0 137.0 140.0 148.0 ▁▁▃▇▁
sex 0 1 0.65 0.48 0.0 0.0 1.0 1.0 1.0 ▅▁▁▁▇
smoking 0 1 0.32 0.47 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▃
time 0 1 130.26 77.61 4.0 73.0 115.0 203.0 285.0 ▆▇▃▆▃
DEATH_EVENT 0 1 0.32 0.47 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▃

Since the DEATH_EVENT is our target variable, let’s examine it’s proportions:

library(scales)

prop <- round(prop.table(table(select(data, DEATH_EVENT), exclude = NULL))*100, 1)
x <- paste(prop, "%", sep="")
mat <- matrix(x, nrow = 2, ncol = 1)
rownames(mat) <- c("0", "1")
colnames(mat) <- c("Death Pct")
print(mat, quote = FALSE)
##   Death Pct
## 0 67.9%    
## 1 32.1%
set.seed(7)

# create a list of 80% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(data$DEATH_EVENT, p=0.80, list=FALSE)
# select 20% of the data for validation
data_test <- data[-validation_index,]
# use the remaining 80% of data to training and testing the models
data_train <- data[validation_index,]

Decision Tree

model1 <- rpart(DEATH_EVENT ~ age + sex + diabetes + high_blood_pressure,
                         method = "class",
                         data = data_train
                )

rpart.plot(model1)

Switch variables to generate 2 decision trees and compare the results. In this case, I used the remaining variables in the data set

model2 <- rpart(DEATH_EVENT ~ . - age - sex - diabetes - high_blood_pressure,
                         method = "class",
                         data = data_train
                )

rpart.plot(model2)

In the first model, the biggest predictor appears to be age, followed by high_blood_pressure and sex. On the second model, the biggest predictor is a little counter-intuitive: time between followups. I say it’s counter intuitive since it appears to indicate that the shorter the time between visits, the higher the chance for a bad outcome.

I’ll go out on a limb and say that the reason for this is that patients in bad health will follow up with the doctor closely and, in that sense, time might be a red-herring.

On model2, the second biggest predictors are the serum_creatinine and the number of platelets.

Random Forest

Create a random forest for regression and analyze the results. For the Random forest, I will use the predictors identified by the Decision trees.

rf_model <- randomForest(as.factor(DEATH_EVENT) ~ age + high_blood_pressure + sex + serum_creatinine + platelets,
                                    data = data_train)
rf_pred <- predict(rf_model, data_test)
c_matrix <- confusionMatrix(rf_pred, as.factor(data_test$DEATH_EVENT))
c_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 26 13
##          1 12  8
##                                           
##                Accuracy : 0.5763          
##                  95% CI : (0.4407, 0.7039)
##     No Information Rate : 0.6441          
##     P-Value [Acc > NIR] : 0.8885          
##                                           
##                   Kappa : 0.0659          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.6842          
##             Specificity : 0.3810          
##          Pos Pred Value : 0.6667          
##          Neg Pred Value : 0.4000          
##              Prevalence : 0.6441          
##          Detection Rate : 0.4407          
##    Detection Prevalence : 0.6610          
##       Balanced Accuracy : 0.5326          
##                                           
##        'Positive' Class : 0               
## 

The resulting model gives us an accuracy of \(58%\) which doesn’t sound so good.

source(here("functions","draw_confusion_matrix.R"), local = knitr::knit_global())
draw_confusion_matrix(c_matrix)

A Better Random Forest Model

When we use all the variables, the accuracy increases to 83% indicating we missed some predictors with our Decision trees.

rf_model2 <- randomForest(as.factor(DEATH_EVENT) ~ .,
                                    data = data_train)
rf_pred2 <- predict(rf_model2, data_test)
c_matrix2 <- confusionMatrix(rf_pred2, as.factor(data_test$DEATH_EVENT))
draw_confusion_matrix(c_matrix2)

Conclusion

Based on real cases where decision trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?

I feel the decision trees worked as intended. They did fail to identify ejection_fraction as a significant predictor of heart failure but they did point me in the direction of serum_creatinine as an indicator/predictor.

I believe the most important learning from this particular project is that one Machine Learning algorithm alone is probably not going to be able to satisfy or provide all of the answers and that they should used in conjunction of other ML algorithms to correlate or validate the results.