Cardiovascular diseases (CVDs) are the number 1 cause of death and kill approximately 17 million people globally every year.

CVD is a common cause of heart failure, and this dataset contains 12 characteristics that can be used to predict mortality from heart failure.

By observing and eliminating morbidity risks, such as smoking, unhealthy diet, obesity, physical inactivity, harmful alcohol consumption to a large extent etc. most cardiovascular diseases can be prevented.

People diagnosed with CVD or those at high cardiovascular risk (due to the presence of one or more risk factors in their lives, such as hypertension, diabetes, hyperlipidemia) require early detection and treatment, and the machine learning models may be very helpful in this matter.

1. Introduction

The main objective of this project is to predict patient’s survival from heart failure and identify most important risk factors by applying classification methods on UCI dataset with medical records of heart failure patients.

Classification is a process in machine supervised learning of categorizing a given set of data into classes, in our case it will be patient’s survival (0 - survived, 1 - not survived). Several methods have been used for this purpose like:

Logistic Regression - is a predictive analysis used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
Decision Tree - is a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
Random Forest - learning method for classification which consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.
SVM (Support Vector Machine) - supervised learning method which trains on a set of labeled data. SVM studies the labeled training data and then classifies any new input data depending on what it learned in the training phase.
KNN (K – Nearest Neighbor) - is a supervised machine learning algorithm that classifies a new data point into the target class, depending on the features of its adjacent data points.

2. Exploratory data analysis

2.1. Dataset

The dataset used in this analysis contains the medical records of 299 heart failure patients collected at the Faisalabad Institute of Cardiology and the Allied Hospital in Faisalabad (Punjab, Pakistan), during April–December 2015. The patients consisted of 105 women and 194 men, and their ages range between 40 and 95 years old. All 299 patients had left ventricular systolic dysfunction and had previous heart failures that put them in classes III or IV of the New York Heart Association (NYHA) classification of the stages of heart failure. Dataset can be found on UCI.

Loading all the necessary packages and dataset.

library(dplyr)
library(corrplot)
library(caret)
library(glm2)
library(e1071)
library(randomForest)
library(purrr)
library(tidyr)
library(vtable)
library(ROSE)
library(lmtest)
library(skimr)
library(DataExplorer)
library(forcats)
library(rpart.plot)
library(ranger)
library(mlr)
library(class)
library(gmodels)
library(binaryLogic)
library(rcompanion)
library(olsrr)
library(gridExtra)
library(reshape)
library(Boruta)
data_heart<- read.csv("heart_failure_clinical_records_dataset.csv")
skim(data_heart)
Data summary
Name data_heart
Number of rows 299
Number of columns 13
_______________________
Column type frequency:
numeric 13
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 60.83 11.89 40.0 51.0 60.0 70.0 95.0 ▆▇▇▂▁
anaemia 0 1 0.43 0.50 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▆
creatinine_phosphokinase 0 1 581.84 970.29 23.0 116.5 250.0 582.0 7861.0 ▇▁▁▁▁
diabetes 0 1 0.42 0.49 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▆
ejection_fraction 0 1 38.08 11.83 14.0 30.0 38.0 45.0 80.0 ▃▇▂▂▁
high_blood_pressure 0 1 0.35 0.48 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▅
platelets 0 1 263358.03 97804.24 25100.0 212500.0 262000.0 303500.0 850000.0 ▂▇▂▁▁
serum_creatinine 0 1 1.39 1.03 0.5 0.9 1.1 1.4 9.4 ▇▁▁▁▁
serum_sodium 0 1 136.63 4.41 113.0 134.0 137.0 140.0 148.0 ▁▁▃▇▁
sex 0 1 0.65 0.48 0.0 0.0 1.0 1.0 1.0 ▅▁▁▁▇
smoking 0 1 0.32 0.47 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▃
time 0 1 130.26 77.61 4.0 73.0 115.0 203.0 285.0 ▆▇▃▆▃
DEATH_EVENT 0 1 0.32 0.47 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▃

2.2. Features

Variables in this dataset:

Age: numeric - Age of the patient in years
Anaemia: integer - Decrease of red blood cells or hemoglobin (0:if absent or 1:if present)
creatinine_phosphokinase: integer - Level of the CPK enzyme in the blood (mcg/L)
Diabetes: integer - If the patient has diabetes (0:No or 1:Yes)
ejection_fraction: integer - Percentage of blood leaving the heart at each contraction (percentage)
high_blood_pressure: integer - If the patient has hypertension (0:No or 1:Yes)
platelets: numeric - Platelets in the blood (kiloplatelets/mL)
serum_creatinine: integer - Level of serum creatinine in the blood (mg/dL)
serum_sodium: integer - Level of serum sodium in the blood (mEq/L)
sex: integer - Biological sex of the patient (0:Female or 1:Male)
smoking: integer - If the patient is a smoker (0:No or 1: Yes)
time: integer - Follow-up period in days
death_event: integer - If the patient deceased during the follow-up period. (0:No or 1:Yes )

For the purpose of this research it is important to initially inspect target variable.

Below we can find a barplot of death_event variable showing its values:

counts <- table(data_heart$DEATH_EVENT)
barplot(counts,
   xlab="Death events",
   ylab="Number of instances")

Above graph shows that the levels are imbalanced. There are twice as many patients who survived, therefore we should take this further into account, perhaps by using some kind of resampling at the later stages of modeling.

2.2.1. Numeric variables

Dataset contains 13 numeric variables which density plots are shown below.

sapply(data_heart, is.numeric) %>% 
  which() %>% 
  names()
##  [1] "age"                      "anaemia"                 
##  [3] "creatinine_phosphokinase" "diabetes"                
##  [5] "ejection_fraction"        "high_blood_pressure"     
##  [7] "platelets"                "serum_creatinine"        
##  [9] "serum_sodium"             "sex"                     
## [11] "smoking"                  "time"                    
## [13] "DEATH_EVENT"
data_heart %>%
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

Initially we will discard each patient’s follow-up time as it does not have an impact on survival rate. Variable time captures the time of the event - the time at which the patient died or were censored, therefore no end user will be able to provide the value of time, since they do not know at what time in the future the patient will die/get censored.

data_heart<-subset(data_heart,select= -c(time))

Also, we can see that some of the numeric variables are binary with two levels. Thus they will be separated from the group of numeric variables and then transformed to factors.

numeric_vars<-c("age","creatinine_phosphokinase","ejection_fraction","platelets","serum_creatinine","serum_sodium")
numeric_vars
## [1] "age"                      "creatinine_phosphokinase"
## [3] "ejection_fraction"        "platelets"               
## [5] "serum_creatinine"         "serum_sodium"

2.2.2. Categorical variables

Transforming binary variables into factors.

data_heart_numeric<-data_heart
categorical_vars = c("anaemia", "diabetes", "high_blood_pressure", "sex", "smoking", "DEATH_EVENT")

data_heart[categorical_vars] <- lapply(data_heart[categorical_vars],factor)
lapply(data_heart[categorical_vars],levels)
## $anaemia
## [1] "0" "1"
## 
## $diabetes
## [1] "0" "1"
## 
## $high_blood_pressure
## [1] "0" "1"
## 
## $sex
## [1] "0" "1"
## 
## $smoking
## [1] "0" "1"
## 
## $DEATH_EVENT
## [1] "0" "1"

2.3. Visual insights

Majority of the patients are in their 50s and 60s. As the age of a patient increases, the probability of death event increases.

boxplot(data_heart$age ~ data_heart$DEATH_EVENT,
        main="Fate by Age",
         ylab="Age",xlab="Death event")

Higher proportion of patients who died had anemia, diabetes and high blood pressure.

3. Cleaning the data

One of the most important steps before applying any machine learning algorithm is to clean the dataset we are working on. It can improve the results but also shorten the computation time. In some cases we won’t be able to apply ML algorithms without data preprocessing.

3.1. Missing values

Data columns with too many missing values won’t be of much value, therefore we have to check if there is any missing value in our dataset - as shown below there is no missing values.

plot_missing(data_heart,)

sum(is.na(data_heart))
## [1] 0

3.2. Outliers

meltData <- melt(data_heart)
## Using anaemia, diabetes, high_blood_pressure, sex, smoking, DEATH_EVENT as id variables
p <- ggplot(meltData, aes(factor(variable), value)) 
p + geom_boxplot() + facet_wrap(~variable, scale="free")

As we can see variables creatinine_phospkokinase, platelets and serum_creatinine have outliers. We will remove the instances with those values in the next stage.

creatinine_phospkokinase

outlier_cp<- boxplot(data_heart$creatinine_phosphokinase,data = data_heart)$out

platelets

outlier_p<- boxplot(data_heart$platelets,data = data_heart)$out

serum_creatinine

outlier_sc<- boxplot(data_heart$serum_creatinine,data = data_heart)$out

Removing the outliers

data_heart <- data_heart[!((data_heart$creatinine_phosphokinase %in% outlier_cp) |(data_heart$platelets %in% outlier_p) |(data_heart$serum_creatinine %in% outlier_sc)),]

3.3. Scalling the data

There are many numerical variables of varying scale, which can be a problem for algorithms based on Euclidean distance. We will eliminate this problem with the help of preProcess from Caret package.

heart_preProces <-
  preProcess(data_heart, method = c("range"))
data_heart <- predict(heart_preProces,
               data_heart)

4. Data partitioning

Before applying any transformations to chosen variables, dataset should be divided into two subsets: train and test. Training sample is used to train the model, while test sample is used for making the predictions and verify performance of the model.

Using the createDataPartition from caret package dataset will be split into 70% training and 30% test samples.

set.seed(16)
heart_training <-
  createDataPartition(data_heart$DEATH_EVENT,
                      p = 0.7, 
                      list = FALSE)
heart_train <- data_heart[heart_training,]
heart_test <- data_heart[-heart_training,]
head(heart_train)
##         age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 1 0.6363636       0               0.46740051        0        0.09090909
## 3 0.4545455       0               0.09822185        0        0.09090909
## 4 0.1818182       1               0.06858594        0        0.09090909
## 6 0.9090909       1               0.01439458        0        0.39393939
## 7 0.6363636       1               0.18289585        0        0.01515152
## 9 0.4545455       0               0.10753599        0        0.77272727
##   high_blood_pressure platelets serum_creatinine serum_sodium sex smoking
## 1                   1 0.5235294        0.8666667    0.4857143   1       0
## 3                   0 0.2205882        0.4666667    0.4571429   1       1
## 4                   0 0.3617647        0.8666667    0.6857143   1       0
## 6                   1 0.3441176        1.0000000    0.5428571   1       1
## 7                   0 0.1176471        0.4000000    0.6857143   1       0
## 9                   0 0.5187001        0.6000000    0.7142857   0       0
##   DEATH_EVENT
## 1           1
## 3           1
## 4           1
## 6           1
## 7           1
## 9           1

5. Up-sampling

Imbalanced class happens when the dependent variable has one class with a higher frequency compared to the lower class. As it was mentioned before our dataset is imbalanced that’s why we have to apply some technique to reduce the negative impact of this problem by subsampling the data. As our dataset is quite small we’ll use up-sampling that increases the size of the minority class by sampling with replacement so that the classes will have the same size.

#upsampling the training data
set.seed(1922)
up_heart_train <- upSample(x = heart_train,
                     y = heart_train$DEATH_EVENT)   

# check the balance
table(up_heart_train$DEATH_EVENT)
## 
##   0   1 
## 116 116

6. Features selection

Feature selection is the process of identifying and selecting a subset of input features that are most relevant to the target variable. To achieve this we will use several techniques listed below:

6.1. Numeric variables

6.1.1 Variables with near zero variance

It is advisable to remove variables with a single level (zero variance). As there are no variables with 1 level, we don’t need to eliminate any variable.

sapply(heart_train[, numeric_vars], 
        function(x) 
          unique(x) %>% 
          length()) %>% 
  sort()
##        ejection_fraction         serum_creatinine             serum_sodium 
##                       14                       18                       23 
##                      age                platelets creatinine_phosphokinase 
##                       40                      109                      118

There are also no variables with near-zero variance.

heart_nzv <- nearZeroVar(heart_train[, numeric_vars],
                              saveMetrics = TRUE)
saveRDS(heart_nzv, "heart_nzv.rds")

heart_nzv
##                          freqRatio percentUnique zeroVar   nzv
## age                       1.200000     24.844720   FALSE FALSE
## creatinine_phosphokinase  6.000000     73.291925   FALSE FALSE
## ejection_fraction         1.045455      8.695652   FALSE FALSE
## platelets                 3.500000     67.701863   FALSE FALSE
## serum_creatinine          1.450000     11.180124   FALSE FALSE
## serum_sodium              1.090909     14.285714   FALSE FALSE

6.1.2. Correlations

Correlation expresses the extent to which two variables are related to each other.Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. Therefore, when two features have high correlation, we can drop one of them. In this case explanatory variables that can be said to be more correlated than others are:serum_creatinine and serum_sodium. However correlation equals to -0.34, hence there is no need to omit any of them.

correlations <- cor(heart_train[,numeric_vars],
    use = "pairwise.complete.obs")

corrplot.mixed(correlations,
               upper = "number",
               lower = "circle",
               tl.col = "black",
               tl.pos = "lt")

6.1.3. Colinearity

Using function findLinearCombos we checked whether linear regression is in our dataset:

data_linear_combinations <- findLinearCombos(heart_train[, numeric_vars])
data_linear_combinations
## $linearCombos
## list()
## 
## $remove
## NULL

No linear regression in our data (no sets of variables correlated significantly).

6.2. Categorical variables

6.2.1. Random Forests

Feature ranking and selection based on random forests algorithm.

boruta_output <- Boruta(DEATH_EVENT ~ ., data=na.omit(heart_train), doTrace=0)  
boruta_signif <- getSelectedAttributes(boruta_output, withTentative = TRUE)
print(boruta_signif) 
## [1] "age"               "ejection_fraction" "serum_creatinine"
# Do a tentative rough fix
roughFixMod <- TentativeRoughFix(boruta_output)
boruta_signif <- getSelectedAttributes(roughFixMod)
print(boruta_signif)
## [1] "age"               "ejection_fraction" "serum_creatinine"
# Variable Importance Scores
imps <- attStats(roughFixMod)
imps2 = imps[imps$decision != 'Rejected', c('meanImp', 'decision')]
head(imps2[order(-imps2$meanImp), ])  # descending sort
##                     meanImp  decision
## ejection_fraction 16.188221 Confirmed
## serum_creatinine   6.789898 Confirmed
## age                5.817175 Confirmed
# Plot variable importance
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")  

6.2.2. Stepwise Regression

BIC is an estimate of a function of the posterior probability of a model being true, under a certain Bayesian setup, so that a lower BIC means that the model is considered to be more likely to be the true model.

null_model <- lm(DEATH_EVENT ~ 1 , data = heart_train_numeric)
full_model <- lm(DEATH_EVENT ~ . , data = heart_train_numeric)
BIC_model  <- step( null_model, scope = list( lower = null_model, upper = full_model), k = log(nrow(heart_train_numeric)), direction = "forward")
## Start:  AIC=-306.25
## DEATH_EVENT ~ 1
## 
##                            Df Sum of Sq    RSS     AIC
## + serum_creatinine          1    5.5765 42.047 -327.05
## + ejection_fraction         1    3.8326 43.791 -318.52
## + age                       1    2.9741 44.650 -314.44
## + serum_sodium              1    2.3047 45.319 -311.32
## <none>                                  47.624 -306.25
## + creatinine_phosphokinase  1    0.4169 47.207 -302.74
## + platelets                 1    0.2140 47.410 -301.84
## + high_blood_pressure       1    0.1524 47.471 -301.57
## + sex                       1    0.0340 47.590 -301.05
## + anaemia                   1    0.0220 47.602 -301.00
## + smoking                   1    0.0079 47.616 -300.93
## + diabetes                  1    0.0049 47.619 -300.92
## 
## Step:  AIC=-327.05
## DEATH_EVENT ~ serum_creatinine
## 
##                            Df Sum of Sq    RSS     AIC
## + ejection_fraction         1    4.2499 37.797 -344.08
## + age                       1    1.8659 40.181 -331.24
## + serum_sodium              1    1.4092 40.638 -328.86
## <none>                                  42.047 -327.05
## + creatinine_phosphokinase  1    0.3851 41.662 -323.64
## + high_blood_pressure       1    0.1478 41.900 -322.44
## + platelets                 1    0.0948 41.952 -322.18
## + smoking                   1    0.0181 42.029 -321.79
## + diabetes                  1    0.0167 42.031 -321.79
## + anaemia                   1    0.0053 42.042 -321.73
## + sex                       1    0.0022 42.045 -321.72
## 
## Step:  AIC=-344.08
## DEATH_EVENT ~ serum_creatinine + ejection_fraction
## 
##                            Df Sum of Sq    RSS     AIC
## + age                       1   2.04689 35.751 -350.43
## <none>                                  37.797 -344.08
## + serum_sodium              1   0.75040 37.047 -342.94
## + creatinine_phosphokinase  1   0.33991 37.458 -340.63
## + high_blood_pressure       1   0.17328 37.624 -339.70
## + platelets                 1   0.02203 37.775 -338.86
## + sex                       1   0.02072 37.777 -338.85
## + anaemia                   1   0.01120 37.786 -338.80
## + diabetes                  1   0.00342 37.794 -338.75
## + smoking                   1   0.00238 37.795 -338.75
## 
## Step:  AIC=-350.43
## DEATH_EVENT ~ serum_creatinine + ejection_fraction + age
## 
##                            Df Sum of Sq    RSS     AIC
## <none>                                  35.751 -350.43
## + serum_sodium              1   0.62197 35.129 -348.76
## + creatinine_phosphokinase  1   0.39209 35.358 -347.39
## + sex                       1   0.13322 35.617 -345.86
## + high_blood_pressure       1   0.11863 35.632 -345.78
## + diabetes                  1   0.04629 35.704 -345.35
## + platelets                 1   0.00292 35.748 -345.10
## + anaemia                   1   0.00241 35.748 -345.09
## + smoking                   1   0.00086 35.750 -345.08
summary(BIC_model)
## 
## Call:
## lm(formula = DEATH_EVENT ~ serum_creatinine + ejection_fraction + 
##     age, data = heart_train_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1200 -0.3126 -0.1288  0.3691  1.0061 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.085233   0.169379   0.503 0.615354    
## serum_creatinine   0.160043   0.030421   5.261 3.58e-07 ***
## ejection_fraction -0.011953   0.002366  -5.053 9.58e-07 ***
## age                0.008205   0.002389   3.434 0.000718 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4166 on 206 degrees of freedom
## Multiple R-squared:  0.2493, Adjusted R-squared:  0.2384 
## F-statistic: 22.81 on 3 and 206 DF,  p-value: 8.653e-13

6.3. Final selection

Above analyzes indicate three variables: age, serum_sodium and ejection_fraction to be significant, therefore we’ll consider only them in the further analysis.

selectedvars<-c("age", "serum_creatinine","ejection_fraction", "DEATH_EVENT")

7. Classification methods

In this part the results obtained after applying five classification algorithms to the previously processed data (with and without applied up-sampling) will be presented. Presentation of each model will be divided into three parts:
1.Train the data - training the model, application of the algorithm to the training samples heart_train and up_heart_train.
2.Predictions on test data- prediction on the test sample, application of the algorithm to the test sample.
3. Results- comparison of the obtained results, check the performance measures.

7.1. Logit model

First, using glm function we’ll apply Logit model used to predict and analyze the probability of DEATH_EVENT == 1 occurrence. It is possible to calculate the change in odds (the ratio of the probability that the event occurred to the probability that the event will not occur) when the values of the explanatory variables change by estimating the parameters (intercept and regression coefficients) using the maximum likelihood method.

7.1.1. Train the data

The tabs below present the results we obtained after applying the model. Based on AIC value (Akaike Information Criterion which is a measure of the balance between the simplicity of the model and the goodness of fit of the model) we can conclude that a model where up-sampling was not applied is better. The lower the value of AIC the better model.

Logit

model_logit <- glm(DEATH_EVENT ~ ., 
                    family =  binomial(link = "logit"),
                    data = heart_train[selectedvars])
summary(model_logit)
## 
## Call:
## glm(formula = DEATH_EVENT ~ ., family = binomial(link = "logit"), 
##     data = heart_train[selectedvars])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8145  -0.7263  -0.4668   0.7228   2.3871  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.7296     0.6986  -2.476 0.013291 *  
## age                 3.4544     1.0196   3.388 0.000704 ***
## serum_creatinine    2.0131     0.9263   2.173 0.029757 *  
## ejection_fraction  -4.2856     1.2721  -3.369 0.000754 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 190.78  on 160  degrees of freedom
## Residual deviance: 154.34  on 157  degrees of freedom
## AIC: 162.34
## 
## Number of Fisher Scoring iterations: 5

To check whether above model is equally good as a model with just a constant or in other words whether variables are jointly significant we have to perform Likelihood ratio test. In our case we would reject the null hypothesis (taking into account significance level of 0.05 ) and conclude that we should use the more complex model as it increases the accuracy of our model.

lrtest(model_logit)
## Likelihood ratio test
## 
## Model 1: DEATH_EVENT ~ age + serum_creatinine + ejection_fraction
## Model 2: DEATH_EVENT ~ 1
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   4 -77.168                         
## 2   1 -95.390 -3 36.443  6.036e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Logit (up-sampling)

model_logit_upsampling <- glm(DEATH_EVENT ~ ., 
                    family =  binomial(link = "logit"),
                    data = up_heart_train[selectedvars])
summary(model_logit_upsampling)
## 
## Call:
## glm(formula = DEATH_EVENT ~ ., family = binomial(link = "logit"), 
##     data = up_heart_train[selectedvars])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.35011  -0.95310   0.00817   0.96562   1.81053  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.3375     0.5169  -2.588 0.009664 ** 
## age                 3.5722     0.8198   4.357 1.32e-05 ***
## serum_creatinine    2.4390     0.7618   3.202 0.001367 ** 
## ejection_fraction  -3.1531     0.8230  -3.831 0.000128 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 321.62  on 231  degrees of freedom
## Residual deviance: 264.95  on 228  degrees of freedom
## AIC: 272.95
## 
## Number of Fisher Scoring iterations: 4

Same as for test sample without up-sampling we should reject the null hypothesis and conclude that we have to use the more complex model as it increases the accuracy of our model.

lrtest(model_logit_upsampling)
## Likelihood ratio test
## 
## Model 1: DEATH_EVENT ~ age + serum_creatinine + ejection_fraction
## Model 2: DEATH_EVENT ~ 1
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   4 -132.48                         
## 2   1 -160.81 -3 56.669  3.024e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7.1.2. Predictions on test data

We will use above models to predict the test data - heart_test.

Logit

predict_logit <- as.factor(predict(model_logit, newdata=heart_test, type="response") >= 0.5) %>%
  fct_recode("0" = "FALSE", "1" = "TRUE")

Logit (up-sampling)

predict_logit_up <- as.factor(predict(model_logit_upsampling, newdata=heart_test, type="response") >= 0.5) %>%
  fct_recode("0" = "FALSE", "1" = "TRUE")

7.1.3. Results

The prediction using test data showed that accuracy for Logit Model was 83.58%, while for Logit Model with up-samplig was lower and equals to 76.12%.

Logit

confusionMatrix_logit<-confusionMatrix(predict_logit, heart_test$DEATH_EVENT, positive = "1")
confusionMatrix_logit
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 48 10
##          1  1  8
##                                           
##                Accuracy : 0.8358          
##                  95% CI : (0.7252, 0.9151)
##     No Information Rate : 0.7313          
##     P-Value [Acc > NIR] : 0.03204         
##                                           
##                   Kappa : 0.5037          
##                                           
##  Mcnemar's Test P-Value : 0.01586         
##                                           
##             Sensitivity : 0.4444          
##             Specificity : 0.9796          
##          Pos Pred Value : 0.8889          
##          Neg Pred Value : 0.8276          
##              Prevalence : 0.2687          
##          Detection Rate : 0.1194          
##    Detection Prevalence : 0.1343          
##       Balanced Accuracy : 0.7120          
##                                           
##        'Positive' Class : 1               
## 
stats_logit <- round(c(confusionMatrix_logit$overall[1],
                 confusionMatrix_logit$byClass[c(1:4, 7, 11)]),5)

Logit (up-sampling)

confusionMatrix_logit_up<-confusionMatrix(predict_logit_up, heart_test$DEATH_EVENT, positive = "1")
confusionMatrix_logit_up
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 40  7
##          1  9 11
##                                           
##                Accuracy : 0.7612          
##                  95% CI : (0.6414, 0.8569)
##     No Information Rate : 0.7313          
##     P-Value [Acc > NIR] : 0.3464          
##                                           
##                   Kappa : 0.4129          
##                                           
##  Mcnemar's Test P-Value : 0.8026          
##                                           
##             Sensitivity : 0.6111          
##             Specificity : 0.8163          
##          Pos Pred Value : 0.5500          
##          Neg Pred Value : 0.8511          
##              Prevalence : 0.2687          
##          Detection Rate : 0.1642          
##    Detection Prevalence : 0.2985          
##       Balanced Accuracy : 0.7137          
##                                           
##        'Positive' Class : 1               
## 
stats_logit_up <- round(c(confusionMatrix_logit_up$overall[1],
                 confusionMatrix_logit_up$byClass[c(1:4, 7, 11)]),5)

7.2. Decision Tree

Decision Tree will be the second algorithm applied to our datasets. We will use CART (Classification and Regression Trees).

7.2.1. Train the data

Below figures show that the tree model treats ejection_fraction variable as the biggest factor in categorizing the data. All patients whose ejection_fraction is lower than 33% are considered to be dead DEATH_EVENT == 1.

Decision Tree

set.seed(0)
cart <- rpart(DEATH_EVENT ~ .,
              data = heart_train, method = "class",
              control=rpart.control(minsplit=10,
                                    minbucket=5,
                                    maxdepth=10,
                                    cp=0.03))
rpart.plot(cart)

Decision Tree (up-sampling)

set.seed(0)
cart_up <- rpart(DEATH_EVENT ~ .,
              data = up_heart_train, method = "class",
              control=rpart.control(minsplit=10,
                                    minbucket=5,
                                    maxdepth=10,
                                    cp=0.03))
rpart.plot(cart_up)

7.2.2. Predictions on test data

Decision Tree

pred_dt <- as.factor(predict(cart, newdata=heart_test)[, 2] >= 0.5) %>%
  fct_recode("0" = "FALSE", "1" = "TRUE")

Decision Tree (up-sampling)

pred_dt_up <- as.factor(predict(cart_up, newdata=heart_test)[, 2] >= 0.5) %>%
  fct_recode("0" = "FALSE", "1" = "TRUE")

7.2.3. Results

The prediction using test data showed that accuracy for Decision Tree was 77.61%, while for Decision Tree with up-samplig was lower and equals to 58.21%.

Decision Tree

confusionMatrix_dt<-confusionMatrix(pred_dt, heart_test$DEATH_EVENT, positive = "1")
confusionMatrix_dt
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 43  9
##          1  6  9
##                                           
##                Accuracy : 0.7761          
##                  95% CI : (0.6578, 0.8689)
##     No Information Rate : 0.7313          
##     P-Value [Acc > NIR] : 0.2491          
##                                           
##                   Kappa : 0.3986          
##                                           
##  Mcnemar's Test P-Value : 0.6056          
##                                           
##             Sensitivity : 0.5000          
##             Specificity : 0.8776          
##          Pos Pred Value : 0.6000          
##          Neg Pred Value : 0.8269          
##              Prevalence : 0.2687          
##          Detection Rate : 0.1343          
##    Detection Prevalence : 0.2239          
##       Balanced Accuracy : 0.6888          
##                                           
##        'Positive' Class : 1               
## 
stats_dt <- round(c(confusionMatrix_dt$overall[1],
                  confusionMatrix_dt$byClass[c(1:4, 7, 11)]),5)

Decision Tree (up-sampling)

confusionMatrix_dt_up<-confusionMatrix(pred_dt_up, heart_test$DEATH_EVENT, positive = "1")
confusionMatrix_dt_up
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 32 11
##          1 17  7
##                                           
##                Accuracy : 0.5821          
##                  95% CI : (0.4552, 0.7015)
##     No Information Rate : 0.7313          
##     P-Value [Acc > NIR] : 0.9973          
##                                           
##                   Kappa : 0.0379          
##                                           
##  Mcnemar's Test P-Value : 0.3447          
##                                           
##             Sensitivity : 0.3889          
##             Specificity : 0.6531          
##          Pos Pred Value : 0.2917          
##          Neg Pred Value : 0.7442          
##              Prevalence : 0.2687          
##          Detection Rate : 0.1045          
##    Detection Prevalence : 0.3582          
##       Balanced Accuracy : 0.5210          
##                                           
##        'Positive' Class : 1               
## 
stats_dt_up <- round(c(confusionMatrix_dt_up$overall[1],
                  confusionMatrix_dt_up$byClass[c(1:4, 7, 11)]),5)

7.3. Random Forest

As a third algorithm we’ll apply Random Forest. Model samples and extracts test data, creates decision trees in parallel for each sample and makes predictions based on the average or majority vote of these decision trees.

7.3.1. Train the data

Random Forest

set.seed(0)
model_rf <- ranger(DEATH_EVENT ~.,
             data = heart_train,
             mtry = 2, num.trees = 500, write.forest=TRUE, importance = "permutation")
data.frame(variables = names(importance(model_rf, method = "janitza")),
           feature_importance = importance(model_rf, method = "janitza")) %>%
  ggplot(aes(x = feature_importance,
             y = reorder(variables, X = feature_importance))) +
    geom_bar(stat = "identity",
             alpha=0.9) +
    labs(y = "features", title = "Feature importance") +
    theme_minimal(base_size = 10)

Random Forest (up-sampling)

set.seed(0)
model_rf_upsampling <- ranger(DEATH_EVENT ~.,
             data = up_heart_train,
             mtry = 2, num.trees = 500, write.forest=TRUE, importance = "permutation")
data.frame(variables = names(importance(model_rf_upsampling, method = "janitza")),
           feature_importance = importance(model_rf_upsampling, method = "janitza")) %>%
  ggplot(aes(x = feature_importance,
             y = reorder(variables, X = feature_importance))) +
    geom_bar(stat = "identity",
             alpha=0.9) +
    labs(y = "features", title = "Feature importance") +
    theme_minimal(base_size = 10)

7.3.2. Predictions on test data

Random Forest

pred_rf <- predict(model_rf, data=heart_test)$predictions

Random Forest (up-sampling)

pred_rf_upsampling <- predict(model_rf_upsampling, data=heart_test)$predictions

7.3.3. Results

The prediction using test data showed that accuracy for Random Forest was 82.95%, while for Decision TreeRandom Forest` with up-samplig was lower and equals to 77.61%.

Random Forest

confusionMatrix_rf<-confusionMatrix(pred_rf, heart_test$DEATH_EVENT, positive = "1")
confusionMatrix_rf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 49 12
##          1  0  6
##                                          
##                Accuracy : 0.8209         
##                  95% CI : (0.708, 0.9039)
##     No Information Rate : 0.7313         
##     P-Value [Acc > NIR] : 0.060553       
##                                          
##                   Kappa : 0.4224         
##                                          
##  Mcnemar's Test P-Value : 0.001496       
##                                          
##             Sensitivity : 0.33333        
##             Specificity : 1.00000        
##          Pos Pred Value : 1.00000        
##          Neg Pred Value : 0.80328        
##              Prevalence : 0.26866        
##          Detection Rate : 0.08955        
##    Detection Prevalence : 0.08955        
##       Balanced Accuracy : 0.66667        
##                                          
##        'Positive' Class : 1              
## 
stats_rf <- round(c(confusionMatrix_rf$overall[1],
                   confusionMatrix_rf$byClass[c(1:4, 7, 11)]),5)

Random Forest (up-sampling)

confusionMatrix_rf_up<-confusionMatrix(pred_rf_upsampling, heart_test$DEATH_EVENT, positive = "1")
confusionMatrix_rf_up
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 46 12
##          1  3  6
##                                           
##                Accuracy : 0.7761          
##                  95% CI : (0.6578, 0.8689)
##     No Information Rate : 0.7313          
##     P-Value [Acc > NIR] : 0.24912         
##                                           
##                   Kappa : 0.3232          
##                                           
##  Mcnemar's Test P-Value : 0.03887         
##                                           
##             Sensitivity : 0.33333         
##             Specificity : 0.93878         
##          Pos Pred Value : 0.66667         
##          Neg Pred Value : 0.79310         
##              Prevalence : 0.26866         
##          Detection Rate : 0.08955         
##    Detection Prevalence : 0.13433         
##       Balanced Accuracy : 0.63605         
##                                           
##        'Positive' Class : 1               
## 
stats_rf_up <- round(c(confusionMatrix_rf_up$overall[1],
                   confusionMatrix_rf_up$byClass[c(1:4, 7, 11)]),5)

7.4. SVM (Support Vector Machine)

SVM is a model that draws a boundary (or separation hyperplane) that clearly separates two or more classes for classification purposes. In fact, we will train the support vector machine using training data and visualize the learning algorithm with the plotLearnerPrediction function.

7.4.1. Train the data

SVM

task_train <- makeClassifTask(data=heart_train, target="DEATH_EVENT")
task_test <- makeClassifTask(data=heart_test, target="DEATH_EVENT")
task_df <- makeClassifTask(data=data_heart, target="DEATH_EVENT")

set.seed(0)
lrn_svm <- makeLearner("classif.svm", predict.type = "prob",
                       par.vals = list(kernel="linear",
                                       cost=0.1))
svm <- train(lrn_svm, task_train)
plotLearnerPrediction(lrn_svm, task_df, features = c("serum_creatinine", "ejection_fraction")) +
  theme_minimal(base_size = 10)

SVM (up-sampling)

task_train_up <- makeClassifTask(data=up_heart_train, target="DEATH_EVENT")
task_test_up <- makeClassifTask(data=heart_test, target="DEATH_EVENT")
task_df_up <- makeClassifTask(data=data_heart, target="DEATH_EVENT")

set.seed(0)
lrn_svm_up <- makeLearner("classif.svm", predict.type = "prob",
                       par.vals = list(kernel="linear",
                                       cost=0.1))
svm_up <- train(lrn_svm_up, task_train_up)
plotLearnerPrediction(lrn_svm_up, task_df_up, features = c("serum_creatinine", "ejection_fraction")) +
  theme_minimal(base_size = 10)

7.4.2. Predictions on test data

SVM

pred_svm <- getPredictionResponse(predict(svm, task_test))

SVM (up-sampling)

pred_svm_up <- getPredictionResponse(predict(svm_up, task_test_up))

7.4.3. Results

The prediction using test data showed accuracy of 80.06% for SVM and 71.64% for SVM with up-sampling.

SVM

confusionMatrix_svm <- confusionMatrix(pred_svm, heart_test$DEATH_EVENT, positive = "1")
confusionMatrix_svm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 48 12
##          1  1  6
##                                           
##                Accuracy : 0.806           
##                  95% CI : (0.6911, 0.8924)
##     No Information Rate : 0.7313          
##     P-Value [Acc > NIR] : 0.104860        
##                                           
##                   Kappa : 0.3879          
##                                           
##  Mcnemar's Test P-Value : 0.005546        
##                                           
##             Sensitivity : 0.33333         
##             Specificity : 0.97959         
##          Pos Pred Value : 0.85714         
##          Neg Pred Value : 0.80000         
##              Prevalence : 0.26866         
##          Detection Rate : 0.08955         
##    Detection Prevalence : 0.10448         
##       Balanced Accuracy : 0.65646         
##                                           
##        'Positive' Class : 1               
## 
stats_svm <- round(c(confusionMatrix_svm$overall[1],
                   confusionMatrix_svm$byClass[c(1:4, 7, 11)]),5)

SVM (up-sampling)

confusionMatrix_svm_up <- confusionMatrix(pred_svm_up, heart_test$DEATH_EVENT, positive = "1")
confusionMatrix_svm_up
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 39  9
##          1 10  9
##                                           
##                Accuracy : 0.7164          
##                  95% CI : (0.5931, 0.8199)
##     No Information Rate : 0.7313          
##     P-Value [Acc > NIR] : 0.6666          
##                                           
##                   Kappa : 0.2908          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.5000          
##             Specificity : 0.7959          
##          Pos Pred Value : 0.4737          
##          Neg Pred Value : 0.8125          
##              Prevalence : 0.2687          
##          Detection Rate : 0.1343          
##    Detection Prevalence : 0.2836          
##       Balanced Accuracy : 0.6480          
##                                           
##        'Positive' Class : 1               
## 
stats_svm_up <- round(c(confusionMatrix_svm_up$overall[1],
                   confusionMatrix_svm_up$byClass[c(1:4, 7, 11)]),5)

7.5. KNN (K – Nearest Neighbor)

K-Nearest Neighbors Classification algorithm classifies observations by allowing the K observations in the training set that are nearest to the new observation to “vote” on the class of the new observation. KNN uses euclidean distance in order to measure the distance between neighbors, that’s why it is important to use scaled data. In our case the data was scaled before in the range [0, 1] with the help of preProcess from Caret package.

heart_train_labels <- heart_train[1:161, 12]    #Extracting training labels
heart_test_labels <- heart_test[1:67, 12]  #Extracting test labels

Since our training set includes 161 instances, we can try k = 13, an odd number approximately equal to the square root of 161. Using an odd number will reduce the probability of ending with a tie vote.

7.5.1. KNN

Appliaction

knn13 <- knn(train = heart_train, test = heart_test,
cl = heart_train_labels, k=13)
CrossTable(x = heart_test_labels, y = knn13,
prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  67 
## 
##  
##                   | knn13 
## heart_test_labels |         0 |         1 | Row Total | 
## ------------------|-----------|-----------|-----------|
##                 0 |        49 |         0 |        49 | 
##                   |     1.000 |     0.000 |     0.731 | 
##                   |     0.980 |     0.000 |           | 
##                   |     0.731 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|
##                 1 |         1 |        17 |        18 | 
##                   |     0.056 |     0.944 |     0.269 | 
##                   |     0.020 |     1.000 |           | 
##                   |     0.015 |     0.254 |           | 
## ------------------|-----------|-----------|-----------|
##      Column Total |        50 |        17 |        67 | 
##                   |     0.746 |     0.254 |           | 
## ------------------|-----------|-----------|-----------|
## 
## 

Results

confusionMatrix_knn13 <- confusionMatrix(knn13, heart_test$DEATH_EVENT, positive = "1")
confusionMatrix_knn13
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 49  1
##          1  0 17
##                                           
##                Accuracy : 0.9851          
##                  95% CI : (0.9196, 0.9996)
##     No Information Rate : 0.7313          
##     P-Value [Acc > NIR] : 2.016e-08       
##                                           
##                   Kappa : 0.9613          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9444          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9800          
##              Prevalence : 0.2687          
##          Detection Rate : 0.2537          
##    Detection Prevalence : 0.2537          
##       Balanced Accuracy : 0.9722          
##                                           
##        'Positive' Class : 1               
## 
stats_knn13 <- round(c(confusionMatrix_knn13$overall[1],
                   confusionMatrix_knn13$byClass[c(1:4, 7, 11)]),5)

8. Evaluation of model accuracy

8.1. Confusion Matrix

metrics <- rbind(stats_logit,stats_logit_up,stats_dt,stats_dt_up,stats_rf,stats_rf_up, stats_svm, stats_svm_up,stats_knn13)
row.names(metrics) <- c("Logit","Logit (up-sampling)","Decision Tree","Decision Tree (up-sampling)", "Random Forest", "Random Forest (up-sampling)", "SVM", "SVM (up-sampling", "KNN13")
metrics
##                             Accuracy Sensitivity Specificity Pos Pred Value
## Logit                        0.83582     0.44444     0.97959        0.88889
## Logit (up-sampling)          0.76119     0.61111     0.81633        0.55000
## Decision Tree                0.77612     0.50000     0.87755        0.60000
## Decision Tree (up-sampling)  0.58209     0.38889     0.65306        0.29167
## Random Forest                0.82090     0.33333     1.00000        1.00000
## Random Forest (up-sampling)  0.77612     0.33333     0.93878        0.66667
## SVM                          0.80597     0.33333     0.97959        0.85714
## SVM (up-sampling             0.71642     0.50000     0.79592        0.47368
## KNN13                        0.98507     0.94444     1.00000        1.00000
##                             Neg Pred Value      F1 Balanced Accuracy
## Logit                              0.82759 0.59259           0.71202
## Logit (up-sampling)                0.85106 0.57895           0.71372
## Decision Tree                      0.82692 0.54545           0.68878
## Decision Tree (up-sampling)        0.74419 0.33333           0.52098
## Random Forest                      0.80328 0.50000           0.66667
## Random Forest (up-sampling)        0.79310 0.44444           0.63605
## SVM                                0.80000 0.48000           0.65646
## SVM (up-sampling                   0.81250 0.48649           0.64796
## KNN13                              0.98000 0.97143           0.97222

8.2. ROC curves

ROC curve (receiver operating characteristic curve) is a graph showing the performance of each classification model at all classification thresholds. This curve plots two parameters: true positive and false positive rates.

Attached below each plot is the information AUC (area under the ROC Curve which measures the entire two-dimensional area underneath the entire ROC curve. The larger the AUC, the better the classification model.

Logit

heart_test$DEATH_EVENT <- ifelse(heart_test$DEATH_EVENT=="0",0,1)
roc.curve(heart_test$DEATH_EVENT,predict_logit)

## Area under the curve (AUC): 0.712
heart_test$DEATH_EVENT <- ifelse(heart_test$DEATH_EVENT=="0",0,1)
roc.curve(heart_test$DEATH_EVENT,predict_logit_up)

## Area under the curve (AUC): 0.714

Decision Tree

heart_test$DEATH_EVENT <- ifelse(heart_test$DEATH_EVENT=="0",0,1)
roc.curve(heart_test$DEATH_EVENT,pred_dt)

## Area under the curve (AUC): 0.689
heart_test$DEATH_EVENT <- ifelse(heart_test$DEATH_EVENT=="0",0,1)
roc.curve(heart_test$DEATH_EVENT,pred_dt_up)

## Area under the curve (AUC): 0.521

Random Forest

heart_test$DEATH_EVENT <- ifelse(heart_test$DEATH_EVENT=="0",0,1)
roc.curve(heart_test$DEATH_EVENT,pred_rf)

## Area under the curve (AUC): 0.667
heart_test$DEATH_EVENT <- ifelse(heart_test$DEATH_EVENT=="0",0,1)
roc.curve(heart_test$DEATH_EVENT,pred_rf_upsampling)

## Area under the curve (AUC): 0.636

SVM

heart_test$DEATH_EVENT <- ifelse(heart_test$DEATH_EVENT=="0",0,1)
roc.curve(heart_test$DEATH_EVENT,pred_svm)

## Area under the curve (AUC): 0.656
heart_test$DEATH_EVENT <- ifelse(heart_test$DEATH_EVENT=="0",0,1)
roc.curve(heart_test$DEATH_EVENT,pred_svm_up)

## Area under the curve (AUC): 0.648

KNN

heart_test$DEATH_EVENT <- ifelse(heart_test$DEATH_EVENT=="0",0,1)
roc.curve(heart_test$DEATH_EVENT,knn13)

## Area under the curve (AUC): 0.972

9. Conclusions

  • Based on the statistics describing the performance of our models, both those included in the Confusion Matrix and the AUC, we can conclude that the best classification model for heart failure dataset is KNN (K nearest neighbors) and logit.
  • Based on Confusion Matrix results we obtained 98,5% accuracy using KNN on unbalanced heart failure dataset.
  • Classification models performed much worse with datasets balanced by up-sampling.
  • Decision Tree model does not work very well with balanced heart failure dataset. Using this model we obtained only 58,21% of accuracy which is equal to the probability of guessing two classes and assigning them to the observations.
  • Our results show that serum creatinine, ejection fraction and age are the most significant and sufficient factors to predict survival of heart failure patients than using all features included in the original dataset.
  • Heart failure dataset contains independent variables that do predict the dependent variable, therefore model can be constructed based on them.
  • We find out that Machine Learning algorithms can help in determining which variables (risk factors) significantly affect the dependent variable and has the potential to impact on clinical practice to predict if a heart failure patient will survive or not.

10. References

  1. Davide Chicco, Giuseppe Jurman: “Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone”. BMC Medical Informatics and Decision Making 20, 16 (2020). Web Link
  2. Class materials prepared and provided by Piotr Wójcik PhD at the course “Machine Learning 1”, University of Warsaw, 2021
  3. Photo source: https://www.mountelizabeth.com.sg/specialties/medical-specialties/heart-vascular/heart-failure