Data: This data set is called Heart Failure Prediction and available on Kaggle.com. It has 299 observation of 13 variables.
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.
Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.
People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.
Objective:
Create a model for predicting mortality caused by Heart Failure.
Data Preparation
#Loading necessary libraries
library(tidyverse) # For data manipulation
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(caret) # confusion matrix
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(e1071)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
library(bootStepAIC)
#PREPARING WORK SPAcE
# Clear the workspace:
rm(list = ls())
# Load data
df <- read.csv("HeartFailure.csv", header = TRUE)
head(df)
## age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 1 75 0 582 0 20
## 2 55 0 7861 0 38
## 3 65 0 146 0 20
## 4 50 1 111 0 20
## 5 65 1 160 1 20
## 6 90 1 47 0 40
## high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
## 1 1 265000 1.9 130 1 0 4
## 2 0 263358 1.1 136 1 0 6
## 3 0 162000 1.3 129 1 1 7
## 4 0 210000 1.9 137 1 0 7
## 5 0 327000 2.7 116 0 0 8
## 6 1 204000 2.1 132 1 1 8
## DEATH_EVENT
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
nrow(df)
## [1] 299
We can obtain summary statistics through summary function.
#Summary Statistics
summary(df)
## age anaemia creatinine_phosphokinase diabetes
## Min. :40.00 Min. :0.0000 Min. : 23.0 Min. :0.0000
## 1st Qu.:51.00 1st Qu.:0.0000 1st Qu.: 116.5 1st Qu.:0.0000
## Median :60.00 Median :0.0000 Median : 250.0 Median :0.0000
## Mean :60.83 Mean :0.4314 Mean : 581.8 Mean :0.4181
## 3rd Qu.:70.00 3rd Qu.:1.0000 3rd Qu.: 582.0 3rd Qu.:1.0000
## Max. :95.00 Max. :1.0000 Max. :7861.0 Max. :1.0000
## ejection_fraction high_blood_pressure platelets serum_creatinine
## Min. :14.00 Min. :0.0000 Min. : 25100 Min. :0.500
## 1st Qu.:30.00 1st Qu.:0.0000 1st Qu.:212500 1st Qu.:0.900
## Median :38.00 Median :0.0000 Median :262000 Median :1.100
## Mean :38.08 Mean :0.3512 Mean :263358 Mean :1.394
## 3rd Qu.:45.00 3rd Qu.:1.0000 3rd Qu.:303500 3rd Qu.:1.400
## Max. :80.00 Max. :1.0000 Max. :850000 Max. :9.400
## serum_sodium sex smoking time
## Min. :113.0 Min. :0.0000 Min. :0.0000 Min. : 4.0
## 1st Qu.:134.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 73.0
## Median :137.0 Median :1.0000 Median :0.0000 Median :115.0
## Mean :136.6 Mean :0.6488 Mean :0.3211 Mean :130.3
## 3rd Qu.:140.0 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:203.0
## Max. :148.0 Max. :1.0000 Max. :1.0000 Max. :285.0
## DEATH_EVENT
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3211
## 3rd Qu.:1.0000
## Max. :1.0000
#Data Structure
str(df)
## 'data.frame': 299 obs. of 13 variables:
## $ age : num 75 55 65 50 65 90 75 60 65 80 ...
## $ anaemia : int 0 0 0 1 1 1 1 1 0 1 ...
## $ creatinine_phosphokinase: int 582 7861 146 111 160 47 246 315 157 123 ...
## $ diabetes : int 0 0 0 0 1 0 0 1 0 0 ...
## $ ejection_fraction : int 20 38 20 20 20 40 15 60 65 35 ...
## $ high_blood_pressure : int 1 0 0 0 0 1 0 0 0 1 ...
## $ platelets : num 265000 263358 162000 210000 327000 ...
## $ serum_creatinine : num 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
## $ serum_sodium : int 130 136 129 137 116 132 137 131 138 133 ...
## $ sex : int 1 1 1 1 0 1 1 1 0 1 ...
## $ smoking : int 0 0 1 0 0 1 0 1 0 1 ...
## $ time : int 4 6 7 7 8 8 10 10 10 10 ...
## $ DEATH_EVENT : int 1 1 1 1 1 1 1 1 1 1 ...
Most of the variables on this data set should be converted to categorical variable including sex, diabetes, high blood pressure, smoking, and the response variable, Death Event.
#Convert to Categorical Variable
newdf <- df %>%
mutate_at(vars(diabetes,high_blood_pressure,sex,smoking,DEATH_EVENT), funs(factor))
str(newdf)
## 'data.frame': 299 obs. of 13 variables:
## $ age : num 75 55 65 50 65 90 75 60 65 80 ...
## $ anaemia : int 0 0 0 1 1 1 1 1 0 1 ...
## $ creatinine_phosphokinase: int 582 7861 146 111 160 47 246 315 157 123 ...
## $ diabetes : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
## $ ejection_fraction : int 20 38 20 20 20 40 15 60 65 35 ...
## $ high_blood_pressure : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
## $ platelets : num 265000 263358 162000 210000 327000 ...
## $ serum_creatinine : num 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
## $ serum_sodium : int 130 136 129 137 116 132 137 131 138 133 ...
## $ sex : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
## $ smoking : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
## $ time : int 4 6 7 7 8 8 10 10 10 10 ...
## $ DEATH_EVENT : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
#Adjust and Scaling variables
newdf <- newdf %>%
mutate(platelets =platelets /10000) %>%
mutate_at(vars(creatinine_phosphokinase),funs(scale))
#Two-way table
tab1 <- xtabs(~DEATH_EVENT + high_blood_pressure, data=newdf)
tab1
## high_blood_pressure
## DEATH_EVENT 0 1
## 0 137 66
## 1 57 39
We can test one of the variables if it is depended on the response variable. I will use Chi-Square independence test.
#Chi-Square test of independence
chi1 <- chisq.test(tab1)
chi1
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab1
## X-squared = 1.5435, df = 1, p-value = 0.2141
We use Chi-square test for two reasons: 1.Test for independency 2.Test goodness of fit. Since we p-value is not significant (p_value=0.214>.05), we do not reject the Null Hypothesis, these 2 events are independent.It means Death event is not depend on High Blood pressure. So I can say that I will be not using High Blood Pressure on my final model.
SATURATED MODEL
mod1 <- glm(DEATH_EVENT ~., data=newdf, family="binomial")
summary(mod1)
##
## Call:
## glm(formula = DEATH_EVENT ~ ., family = "binomial", data = newdf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1848 -0.5706 -0.2401 0.4466 2.6668
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 10.314231 5.660479 1.822 0.068433 .
## age 0.047419 0.015801 3.001 0.002690 **
## anaemia -0.007470 0.360489 -0.021 0.983467
## creatinine_phosphokinase 0.215626 0.172646 1.249 0.211684
## diabetes1 0.145150 0.351189 0.413 0.679380
## ejection_fraction -0.076663 0.016329 -4.695 2.67e-06 ***
## high_blood_pressure1 -0.102679 0.358707 -0.286 0.774688
## platelets -0.011996 0.018891 -0.635 0.525404
## serum_creatinine 0.666093 0.181493 3.670 0.000242 ***
## serum_sodium -0.066981 0.039735 -1.686 0.091855 .
## sex1 -0.533658 0.413918 -1.289 0.197299
## smoking1 -0.013492 0.412618 -0.033 0.973915
## time -0.021045 0.003014 -6.981 2.92e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 219.55 on 286 degrees of freedom
## AIC: 245.55
##
## Number of Fisher Scoring iterations: 6
We want to include as many predictors as needed, but we donโt want to add the ones who is not correlated with the response variable.
FEATURE SELECTION (BACKWARD - STEPWISE)
back_step<-stepAIC(mod1, direction="backward", trace=FALSE)
Backward stepwise models eliminate each predictor in each iteration based on the AIC.
back_step
##
## Call: glm(formula = DEATH_EVENT ~ age + ejection_fraction + serum_creatinine +
## serum_sodium + time, family = "binomial", data = newdf)
##
## Coefficients:
## (Intercept) age ejection_fraction serum_creatinine
## 9.49303 0.04247 -0.07343 0.68599
## serum_sodium time
## -0.06456 -0.02089
##
## Degrees of Freedom: 298 Total (i.e. Null); 293 Residual
## Null Deviance: 375.3
## Residual Deviance: 223.5 AIC: 235.5
Backward stepwise model recommend us to select these variables: age,ejection_fraction, serum_creatinine, serum_sodium, time. You may notice that these variables are also categorized as significant predictors from our model(check mod1 summary).
Note that, stepwise regression may not be the best selection due to inflated R-sqaure. Therefore, it may not lead to the best combination of the predictors.It may lead to overfitting and results may not be consistent. If we use this model with new sample we may have different result, or if we reapply stepwise model, we may end up choosing different predictors.
BOOTSTRAP METHOD
mod_boot <- boot.stepAIC(mod1, data=newdf, B=100)
Bootstrap is a resampling method from the sample, and it is a method to asses consistency predictors by using stepwise model.
mod_boot
##
## Summary of Bootstrapping the 'stepAIC()' procedure for
##
## Call:
## glm(formula = DEATH_EVENT ~ ., family = "binomial", data = newdf)
##
## Bootstrap samples: 100
## Direction: backward
## Penalty: 2 * df
##
## Covariates selected
## (%)
## ejection_fraction 100
## time 100
## age 95
## serum_creatinine 95
## serum_sodium 63
## sex 47
## creatinine_phosphokinase 46
## anaemia 25
## diabetes 25
## platelets 22
## high_blood_pressure 19
## smoking 16
##
## Coefficients Sign
## + (%) - (%)
## age 100.00 0.00
## creatinine_phosphokinase 100.00 0.00
## serum_creatinine 100.00 0.00
## diabetes1 92.00 8.00
## anaemia 40.00 60.00
## smoking1 37.50 62.50
## high_blood_pressure1 31.58 68.42
## platelets 18.18 81.82
## ejection_fraction 0.00 100.00
## serum_sodium 0.00 100.00
## sex1 0.00 100.00
## time 0.00 100.00
##
## Stat Significance
## (%)
## ejection_fraction 100.00
## time 100.00
## serum_creatinine 93.68
## age 89.47
## sex1 63.83
## serum_sodium 58.73
## diabetes1 44.00
## smoking1 37.50
## high_blood_pressure1 36.84
## platelets 36.36
## anaemia 32.00
## creatinine_phosphokinase 26.09
##
##
## The stepAIC() for the original data-set gave
##
## Call: glm(formula = DEATH_EVENT ~ age + ejection_fraction + serum_creatinine +
## serum_sodium + time, family = "binomial", data = newdf)
##
## Coefficients:
## (Intercept) age ejection_fraction serum_creatinine
## 9.49303 0.04247 -0.07343 0.68599
## serum_sodium time
## -0.06456 -0.02089
##
## Degrees of Freedom: 298 Total (i.e. Null); 293 Residual
## Null Deviance: 375.3
## Residual Deviance: 223.5 AIC: 235.5
##
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + diabetes +
## ejection_fraction + high_blood_pressure + platelets + serum_creatinine +
## serum_sodium + sex + smoking + time
##
## Final Model:
## DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + serum_sodium +
## time
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 286 219.5541 245.5541
## 2 - anaemia 1 0.0004295093 287 219.5546 243.5546
## 3 - smoking 1 0.0009146290 288 219.5555 241.5555
## 4 - high_blood_pressure 1 0.0818174724 289 219.6373 239.6373
## 5 - diabetes 1 0.1748403394 290 219.8121 237.8121
## 6 - platelets 1 0.4041321347 291 220.2163 236.2163
## 7 - creatinine_phosphokinase 1 1.8280859686 292 222.0444 236.0444
## 8 - sex 1 1.4419134397 293 223.4863 235.4863
So, p-values form the summary table, Backward stepwise model, and boot strap model recommend us to the same model.
Final Model:
DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + serum_sodium + time
Final Logistic Model
mod2 <-glm(DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + serum_sodium + time, data=newdf, family="binomial")
summary(mod2)
##
## Call:
## glm(formula = DEATH_EVENT ~ age + ejection_fraction + serum_creatinine +
## serum_sodium + time, family = "binomial", data = newdf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1590 -0.5888 -0.2281 0.5144 2.7959
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.493034 5.405768 1.756 0.07907 .
## age 0.042466 0.015030 2.825 0.00472 **
## ejection_fraction -0.073430 0.015785 -4.652 3.29e-06 ***
## serum_creatinine 0.685990 0.174044 3.941 8.10e-05 ***
## serum_sodium -0.064557 0.038377 -1.682 0.09254 .
## time -0.020895 0.002916 -7.166 7.74e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 223.49 on 293 degrees of freedom
## AIC: 235.49
##
## Number of Fisher Scoring iterations: 6
#Return exponentiated coefficients to gets ODD ratio
OR <- data.frame(exp(mod2$coefficients))
OR
## exp.mod2.coefficients.
## (Intercept) 1.326699e+04
## age 1.043381e+00
## ejection_fraction 9.292012e-01
## serum_creatinine 1.985737e+00
## serum_sodium 9.374825e-01
## time 9.793219e-01
Interpretation
OR_percent <- (OR-1)*1000
OR_percent
## exp.mod2.coefficients.
## (Intercept) 1.326599e+07
## age 4.338089e+01
## ejection_fraction -7.079878e+01
## serum_creatinine 9.857373e+02
## serum_sodium -6.251755e+01
## time -2.067808e+01
AGE: For every additional year, the odds of death increases by 4.3% while controlling other predictors.
EJECTION_FRACTION: For every percent increases of the ejection fraction, the odds of death decreases by 7.07% while controlling other predictors.
We can give similar interpretation to the other features.
MultiCollinearity Multi-collinearity happens when two more variable are highly correlated. In order to get a better interpretation, one of these predictors needs to be removed from the model.
Independent variables with VIF values of 5 or higher would suggest high correlation.
#Variance Inflation Factor(VIF)
vif(mod2)
## age ejection_fraction serum_creatinine serum_sodium
## 1.053111 1.133484 1.079122 1.028355
## time
## 1.096862
Our predictors looks good, there is no proof that there is a multi-collinearity, so we can make predcition over new data set.
Prediction using a new data
data1 <- data.frame(age=65, ejection_fraction= 43,serum_creatinine=1.2, serum_sodium=118,time=140)
#Determine the probability of death
predict(mod2, data1, type="response")
## 1
## 0.3488741
The probability of the death event of this person is 34%