Logistic Regression 3

Introduction

Data: This data set is called Heart Failure Prediction and available on Kaggle.com. It has 299 observation of 13 variables.

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

Objective:

Create a model for predicting mortality caused by Heart Failure.

Data Preparation

#Loading necessary libraries
library(tidyverse) # For data manipulation

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr) 
library(caret)  # confusion matrix

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(e1071)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(ggplot2)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

library(bootStepAIC)

#PREPARING WORK SPAcE
# Clear the workspace: 
rm(list = ls())

# Load data
df <- read.csv("HeartFailure.csv", header = TRUE)

head(df)

##   age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 1  75       0                      582        0                20
## 2  55       0                     7861        0                38
## 3  65       0                      146        0                20
## 4  50       1                      111        0                20
## 5  65       1                      160        1                20
## 6  90       1                       47        0                40
##   high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
## 1                   1    265000              1.9          130   1       0    4
## 2                   0    263358              1.1          136   1       0    6
## 3                   0    162000              1.3          129   1       1    7
## 4                   0    210000              1.9          137   1       0    7
## 5                   0    327000              2.7          116   0       0    8
## 6                   1    204000              2.1          132   1       1    8
##   DEATH_EVENT
## 1           1
## 2           1
## 3           1
## 4           1
## 5           1
## 6           1

nrow(df)

## [1] 299

We can obtain summary statistics through summary function.

#Summary Statistics
summary(df)

##       age           anaemia       creatinine_phosphokinase    diabetes     
##  Min.   :40.00   Min.   :0.0000   Min.   :  23.0           Min.   :0.0000  
##  1st Qu.:51.00   1st Qu.:0.0000   1st Qu.: 116.5           1st Qu.:0.0000  
##  Median :60.00   Median :0.0000   Median : 250.0           Median :0.0000  
##  Mean   :60.83   Mean   :0.4314   Mean   : 581.8           Mean   :0.4181  
##  3rd Qu.:70.00   3rd Qu.:1.0000   3rd Qu.: 582.0           3rd Qu.:1.0000  
##  Max.   :95.00   Max.   :1.0000   Max.   :7861.0           Max.   :1.0000  
##  ejection_fraction high_blood_pressure   platelets      serum_creatinine
##  Min.   :14.00     Min.   :0.0000      Min.   : 25100   Min.   :0.500   
##  1st Qu.:30.00     1st Qu.:0.0000      1st Qu.:212500   1st Qu.:0.900   
##  Median :38.00     Median :0.0000      Median :262000   Median :1.100   
##  Mean   :38.08     Mean   :0.3512      Mean   :263358   Mean   :1.394   
##  3rd Qu.:45.00     3rd Qu.:1.0000      3rd Qu.:303500   3rd Qu.:1.400   
##  Max.   :80.00     Max.   :1.0000      Max.   :850000   Max.   :9.400   
##   serum_sodium        sex            smoking            time      
##  Min.   :113.0   Min.   :0.0000   Min.   :0.0000   Min.   :  4.0  
##  1st Qu.:134.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 73.0  
##  Median :137.0   Median :1.0000   Median :0.0000   Median :115.0  
##  Mean   :136.6   Mean   :0.6488   Mean   :0.3211   Mean   :130.3  
##  3rd Qu.:140.0   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:203.0  
##  Max.   :148.0   Max.   :1.0000   Max.   :1.0000   Max.   :285.0  
##   DEATH_EVENT    
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.3211  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

#Data Structure
str(df)

## 'data.frame':    299 obs. of  13 variables:
##  $ age                     : num  75 55 65 50 65 90 75 60 65 80 ...
##  $ anaemia                 : int  0 0 0 1 1 1 1 1 0 1 ...
##  $ creatinine_phosphokinase: int  582 7861 146 111 160 47 246 315 157 123 ...
##  $ diabetes                : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ ejection_fraction       : int  20 38 20 20 20 40 15 60 65 35 ...
##  $ high_blood_pressure     : int  1 0 0 0 0 1 0 0 0 1 ...
##  $ platelets               : num  265000 263358 162000 210000 327000 ...
##  $ serum_creatinine        : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
##  $ serum_sodium            : int  130 136 129 137 116 132 137 131 138 133 ...
##  $ sex                     : int  1 1 1 1 0 1 1 1 0 1 ...
##  $ smoking                 : int  0 0 1 0 0 1 0 1 0 1 ...
##  $ time                    : int  4 6 7 7 8 8 10 10 10 10 ...
##  $ DEATH_EVENT             : int  1 1 1 1 1 1 1 1 1 1 ...

Most of the variables on this data set should be converted to categorical variable including sex, diabetes, high blood pressure, smoking, and the response variable, Death Event.

#Convert to Categorical Variable
newdf <- df %>% 
  mutate_at(vars(diabetes,high_blood_pressure,sex,smoking,DEATH_EVENT), funs(factor))

str(newdf)

## 'data.frame':    299 obs. of  13 variables:
##  $ age                     : num  75 55 65 50 65 90 75 60 65 80 ...
##  $ anaemia                 : int  0 0 0 1 1 1 1 1 0 1 ...
##  $ creatinine_phosphokinase: int  582 7861 146 111 160 47 246 315 157 123 ...
##  $ diabetes                : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
##  $ ejection_fraction       : int  20 38 20 20 20 40 15 60 65 35 ...
##  $ high_blood_pressure     : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
##  $ platelets               : num  265000 263358 162000 210000 327000 ...
##  $ serum_creatinine        : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
##  $ serum_sodium            : int  130 136 129 137 116 132 137 131 138 133 ...
##  $ sex                     : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
##  $ smoking                 : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
##  $ time                    : int  4 6 7 7 8 8 10 10 10 10 ...
##  $ DEATH_EVENT             : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

#Adjust and Scaling variables
newdf <- newdf %>% 
  mutate(platelets =platelets /10000) %>%
  mutate_at(vars(creatinine_phosphokinase),funs(scale))

#Two-way table
tab1 <- xtabs(~DEATH_EVENT + high_blood_pressure, data=newdf)

tab1

##            high_blood_pressure
## DEATH_EVENT   0   1
##           0 137  66
##           1  57  39

We can test one of the variables if it is depended on the response variable. I will use Chi-Square independence test.

#Chi-Square test of independence
chi1 <- chisq.test(tab1)

chi1

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab1
## X-squared = 1.5435, df = 1, p-value = 0.2141

We use Chi-square test for two reasons: 1.Test for independency 2.Test goodness of fit. Since we p-value is not significant (p_value=0.214>.05), we do not reject the Null Hypothesis, these 2 events are independent.It means Death event is not depend on High Blood pressure. So I can say that I will be not using High Blood Pressure on my final model.

SATURATED MODEL

mod1 <- glm(DEATH_EVENT ~., data=newdf, family="binomial")

summary(mod1)

## 
## Call:
## glm(formula = DEATH_EVENT ~ ., family = "binomial", data = newdf)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1848  -0.5706  -0.2401   0.4466   2.6668  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              10.314231   5.660479   1.822 0.068433 .  
## age                       0.047419   0.015801   3.001 0.002690 ** 
## anaemia                  -0.007470   0.360489  -0.021 0.983467    
## creatinine_phosphokinase  0.215626   0.172646   1.249 0.211684    
## diabetes1                 0.145150   0.351189   0.413 0.679380    
## ejection_fraction        -0.076663   0.016329  -4.695 2.67e-06 ***
## high_blood_pressure1     -0.102679   0.358707  -0.286 0.774688    
## platelets                -0.011996   0.018891  -0.635 0.525404    
## serum_creatinine          0.666093   0.181493   3.670 0.000242 ***
## serum_sodium             -0.066981   0.039735  -1.686 0.091855 .  
## sex1                     -0.533658   0.413918  -1.289 0.197299    
## smoking1                 -0.013492   0.412618  -0.033 0.973915    
## time                     -0.021045   0.003014  -6.981 2.92e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.35  on 298  degrees of freedom
## Residual deviance: 219.55  on 286  degrees of freedom
## AIC: 245.55
## 
## Number of Fisher Scoring iterations: 6

We want to include as many predictors as needed, but we don’t want to add the ones who is not correlated with the response variable.

FEATURE SELECTION (BACKWARD - STEPWISE)

back_step<-stepAIC(mod1, direction="backward", trace=FALSE)

Backward stepwise models eliminate each predictor in each iteration based on the AIC.

back_step

## 
## Call:  glm(formula = DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + 
##     serum_sodium + time, family = "binomial", data = newdf)
## 
## Coefficients:
##       (Intercept)                age  ejection_fraction   serum_creatinine  
##           9.49303            0.04247           -0.07343            0.68599  
##      serum_sodium               time  
##          -0.06456           -0.02089  
## 
## Degrees of Freedom: 298 Total (i.e. Null);  293 Residual
## Null Deviance:       375.3 
## Residual Deviance: 223.5     AIC: 235.5

Backward stepwise model recommend us to select these variables: age,ejection_fraction, serum_creatinine, serum_sodium, time. You may notice that these variables are also categorized as significant predictors from our model(check mod1 summary).

Note that, stepwise regression may not be the best selection due to inflated R-sqaure. Therefore, it may not lead to the best combination of the predictors.It may lead to overfitting and results may not be consistent. If we use this model with new sample we may have different result, or if we reapply stepwise model, we may end up choosing different predictors.

BOOTSTRAP METHOD

mod_boot <- boot.stepAIC(mod1, data=newdf, B=100)

Bootstrap is a resampling method from the sample, and it is a method to asses consistency predictors by using stepwise model.

mod_boot

## 
## Summary of Bootstrapping the 'stepAIC()' procedure for
## 
## Call:
## glm(formula = DEATH_EVENT ~ ., family = "binomial", data = newdf)
## 
## Bootstrap samples: 100 
## Direction: backward 
## Penalty: 2 * df
## 
## Covariates selected
##                          (%)
## ejection_fraction        100
## time                     100
## age                       95
## serum_creatinine          95
## serum_sodium              63
## sex                       47
## creatinine_phosphokinase  46
## anaemia                   25
## diabetes                  25
## platelets                 22
## high_blood_pressure       19
## smoking                   16
## 
## Coefficients Sign
##                           + (%)  - (%)
## age                      100.00   0.00
## creatinine_phosphokinase 100.00   0.00
## serum_creatinine         100.00   0.00
## diabetes1                 92.00   8.00
## anaemia                   40.00  60.00
## smoking1                  37.50  62.50
## high_blood_pressure1      31.58  68.42
## platelets                 18.18  81.82
## ejection_fraction          0.00 100.00
## serum_sodium               0.00 100.00
## sex1                       0.00 100.00
## time                       0.00 100.00
## 
## Stat Significance
##                             (%)
## ejection_fraction        100.00
## time                     100.00
## serum_creatinine          93.68
## age                       89.47
## sex1                      63.83
## serum_sodium              58.73
## diabetes1                 44.00
## smoking1                  37.50
## high_blood_pressure1      36.84
## platelets                 36.36
## anaemia                   32.00
## creatinine_phosphokinase  26.09
## 
## 
## The stepAIC() for the original data-set gave
## 
## Call:  glm(formula = DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + 
##     serum_sodium + time, family = "binomial", data = newdf)
## 
## Coefficients:
##       (Intercept)                age  ejection_fraction   serum_creatinine  
##           9.49303            0.04247           -0.07343            0.68599  
##      serum_sodium               time  
##          -0.06456           -0.02089  
## 
## Degrees of Freedom: 298 Total (i.e. Null);  293 Residual
## Null Deviance:       375.3 
## Residual Deviance: 223.5     AIC: 235.5
## 
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + diabetes + 
##     ejection_fraction + high_blood_pressure + platelets + serum_creatinine + 
##     serum_sodium + sex + smoking + time
## 
## Final Model:
## DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + serum_sodium + 
##     time
## 
## 
##                         Step Df     Deviance Resid. Df Resid. Dev      AIC
## 1                                                  286   219.5541 245.5541
## 2                  - anaemia  1 0.0004295093       287   219.5546 243.5546
## 3                  - smoking  1 0.0009146290       288   219.5555 241.5555
## 4      - high_blood_pressure  1 0.0818174724       289   219.6373 239.6373
## 5                 - diabetes  1 0.1748403394       290   219.8121 237.8121
## 6                - platelets  1 0.4041321347       291   220.2163 236.2163
## 7 - creatinine_phosphokinase  1 1.8280859686       292   222.0444 236.0444
## 8                      - sex  1 1.4419134397       293   223.4863 235.4863

So, p-values form the summary table, Backward stepwise model, and boot strap model recommend us to the same model.

Final Model:

DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + serum_sodium + time

Final Logistic Model

mod2 <-glm(DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + serum_sodium +  time, data=newdf, family="binomial")

summary(mod2)

## 
## Call:
## glm(formula = DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + 
##     serum_sodium + time, family = "binomial", data = newdf)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1590  -0.5888  -0.2281   0.5144   2.7959  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        9.493034   5.405768   1.756  0.07907 .  
## age                0.042466   0.015030   2.825  0.00472 ** 
## ejection_fraction -0.073430   0.015785  -4.652 3.29e-06 ***
## serum_creatinine   0.685990   0.174044   3.941 8.10e-05 ***
## serum_sodium      -0.064557   0.038377  -1.682  0.09254 .  
## time              -0.020895   0.002916  -7.166 7.74e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.35  on 298  degrees of freedom
## Residual deviance: 223.49  on 293  degrees of freedom
## AIC: 235.49
## 
## Number of Fisher Scoring iterations: 6

#Return exponentiated coefficients to gets ODD ratio
OR <- data.frame(exp(mod2$coefficients))

OR

##                   exp.mod2.coefficients.
## (Intercept)                 1.326699e+04
## age                         1.043381e+00
## ejection_fraction           9.292012e-01
## serum_creatinine            1.985737e+00
## serum_sodium                9.374825e-01
## time                        9.793219e-01

Interpretation

OR_percent <- (OR-1)*1000

OR_percent

##                   exp.mod2.coefficients.
## (Intercept)                 1.326599e+07
## age                         4.338089e+01
## ejection_fraction          -7.079878e+01
## serum_creatinine            9.857373e+02
## serum_sodium               -6.251755e+01
## time                       -2.067808e+01

AGE: For every additional year, the odds of death increases by 4.3% while controlling other predictors.

EJECTION_FRACTION: For every percent increases of the ejection fraction, the odds of death decreases by 7.07% while controlling other predictors.

We can give similar interpretation to the other features.

MultiCollinearity Multi-collinearity happens when two more variable are highly correlated. In order to get a better interpretation, one of these predictors needs to be removed from the model.

Independent variables with VIF values of 5 or higher would suggest high correlation.

#Variance Inflation Factor(VIF)
vif(mod2)

##               age ejection_fraction  serum_creatinine      serum_sodium 
##          1.053111          1.133484          1.079122          1.028355 
##              time 
##          1.096862

Our predictors looks good, there is no proof that there is a multi-collinearity, so we can make predcition over new data set.

Prediction using a new data

data1 <- data.frame(age=65, ejection_fraction= 43,serum_creatinine=1.2, serum_sodium=118,time=140)

#Determine the probability of death
predict(mod2, data1, type="response")

##         1 
## 0.3488741

The probability of the death event of this person is 34%

**************

Logistic Regression 3

Mustafa Arslan

9/13/2021