Cardiovascular disease is one of the leading causes of death among men and women of all racial and ethnic group around the world. Heart failure occurs when adequate amount of blood cannot be pumped by the heart to satisfy the body’s need. As per Centre for Disease Control and Prevention, United States, one person dies every 36 seconds in the United States from cardiovascular disease. Also, here in Australia, over one million people are living with heart disease, stroke or vascular conditions as per Department of Health, Australian government.
In the first part of analysis, the aim was to examine the relationships between the variables under consideration for the prediction and analysis of people who have a high risk of having cardiovascular disease. We analyse the records of 300 patients from dataset ‘heart_failure.csv’ who had heart failure. The dataset was sourced from GitHub vaksakalli/datasets. (2020). Retrieved 27 September 2020, from https://github.com/vaksakalli/datasets/blob/master/heart_failure.csv. These medical records contain their body features and certain laboratory test results. A logistic regression model can be prepared to predict whether the person suffering from heart failure will survive or not.
In this part of the analysis, the aim is to examine the probablity of a survival status of a patient based on his medical records. For this analysis, a binomial logistic regression analysis model was formulated. The response variable used was the binary digit variable ‘Death Event Patient of a Patient’. The ‘Death Event of a Patient’, which is also the response variable has two levels, 0-‘Dead’ and 1-‘Survived’. It helps to predict the survival status of a patient from cardio vascular attack. The variables ‘Does Patient Suffer from Anaemia’,‘Does Patient suffer from Diabetes’,‘Does Patient suffer from High Blood Pressure’,‘Gender of the Patient’ and ‘Does the Patient Smoke’ are used in the analysis. The significant predictors generated from the binomial logistic regression model are analysed to compare the survival status of the patient comparing with different levels of the same predictor.
library(dplyr)
library(tidyr)
library(car)
library(knitr)
library(readr)
library(ResourceSelection)
library(ggplot2)
library(oddsratio)
** Uploading file to R Studio
risk <- read_csv("~/R/ProjectGroup74_Data.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## age = col_double(),
## anaemia = col_double(),
## creatinine_phosphokinase = col_double(),
## diabetes = col_double(),
## ejection_fraction = col_double(),
## high_blood_pressure = col_double(),
## platelets = col_double(),
## serum_creatinine = col_double(),
## serum_sodium = col_double(),
## sex = col_double(),
## smoking = col_double(),
## time = col_double(),
## DEATH_EVENT = col_double()
## )
head(risk)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 6 x 13
## age anaemia creatinine_phos~ diabetes ejection_fracti~ high_blood_pres~
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 75 0 582 0 20 1
## 2 55 0 7861 0 38 0
## 3 65 0 146 0 20 0
## 4 50 1 111 0 20 0
## 5 65 1 160 1 20 0
## 6 90 1 47 0 40 1
## # ... with 7 more variables: platelets <dbl>, serum_creatinine <dbl>,
## # serum_sodium <dbl>, sex <dbl>, smoking <dbl>, time <dbl>, DEATH_EVENT <dbl>
** Checking structure of Dataset
str(risk)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 299 obs. of 13 variables:
## $ age : num 75 55 65 50 65 90 75 60 65 80 ...
## $ anaemia : num 0 0 0 1 1 1 1 1 0 1 ...
## $ creatinine_phosphokinase: num 582 7861 146 111 160 ...
## $ diabetes : num 0 0 0 0 1 0 0 1 0 0 ...
## $ ejection_fraction : num 20 38 20 20 20 40 15 60 65 35 ...
## $ high_blood_pressure : num 1 0 0 0 0 1 0 0 0 1 ...
## $ platelets : num 265000 263358 162000 210000 327000 ...
## $ serum_creatinine : num 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
## $ serum_sodium : num 130 136 129 137 116 132 137 131 138 133 ...
## $ sex : num 1 1 1 1 0 1 1 1 0 1 ...
## $ smoking : num 0 0 1 0 0 1 0 1 0 1 ...
## $ time : num 4 6 7 7 8 8 10 10 10 10 ...
## $ DEATH_EVENT : num 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. anaemia = col_double(),
## .. creatinine_phosphokinase = col_double(),
## .. diabetes = col_double(),
## .. ejection_fraction = col_double(),
## .. high_blood_pressure = col_double(),
## .. platelets = col_double(),
## .. serum_creatinine = col_double(),
## .. serum_sodium = col_double(),
## .. sex = col_double(),
## .. smoking = col_double(),
## .. time = col_double(),
## .. DEATH_EVENT = col_double()
## .. )
** Converting variables with binary response to Factor
risk$anaemia <- factor(risk$anaemia, levels = c(0,1), labels = c("No","Yes"))
risk$diabetes <- factor(risk$diabetes, levels = c(0,1), labels = c("No","Yes"))
risk$high_blood_pressure <- factor(risk$high_blood_pressure, levels = c(0,1), labels = c("No","Yes"))
risk$sex <- factor(risk$sex, levels = c(0,1), labels = c("Female","Male"))
risk$smoking <- factor(risk$smoking, levels = c(0,1), labels = c("No","Yes"))
risk$DEATH_EVENT <- factor(risk$DEATH_EVENT,levels = c(0,1), labels = c("No","Yes"))
min(risk$age,na.rm = FALSE)
## [1] 40
max(risk$age,na.rm = FALSE)
## [1] 95
risk <- risk %>% mutate(Age_Group =
case_when(age>=35 & age<=44 ~ '35 to 44 yrs',
age>=45 & age<=54 ~ '45 to 54 yrs',
age>=55 & age<=64 ~ '55 to 64 yrs',
age>=65 & age<=74 ~ '65 to 74 yrs',
age>=75 & age<=84 ~ '75 to 84 yrs',
age>=85 & age<=95 ~ '85 and above',))
risk$Age_Group <- factor(risk$Age_Group, levels = c('35 to 44 yrs','45 to 54 yrs','55 to 64 yrs','65 to 74 yrs','75 to 84 yrs','85 and above'),
labels = c('35 to 44 yrs','45 to 54 yrs','55 to 64 yrs','65 to 74 yrs','75 to 84 yrs','85 and above'))
head(risk)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 6 x 14
## age anaemia creatinine_phos~ diabetes ejection_fracti~ high_blood_pres~
## <dbl> <fct> <dbl> <fct> <dbl> <fct>
## 1 75 No 582 No 20 Yes
## 2 55 No 7861 No 38 No
## 3 65 No 146 No 20 No
## 4 50 Yes 111 No 20 No
## 5 65 Yes 160 Yes 20 No
## 6 90 Yes 47 No 40 Yes
## # ... with 8 more variables: platelets <dbl>, serum_creatinine <dbl>,
## # serum_sodium <dbl>, sex <fct>, smoking <fct>, time <dbl>,
## # DEATH_EVENT <fct>, Age_Group <fct>
** Summary of Data
mlr::summarizeColumns(risk) %>% knitr::kable(caption = "Summary of Dataset")
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| age | numeric | 0 | 60.83389 | 1.189481e+01 | 60.0 | 14.82600 | 40.0 | 95.0 | 0 |
| anaemia | factor | 0 | NA | 4.314381e-01 | NA | NA | 129.0 | 170.0 | 2 |
| creatinine_phosphokinase | numeric | 0 | 581.83946 | 9.702879e+02 | 250.0 | 269.83320 | 23.0 | 7861.0 | 0 |
| diabetes | factor | 0 | NA | 4.180602e-01 | NA | NA | 125.0 | 174.0 | 2 |
| ejection_fraction | numeric | 0 | 38.08361 | 1.183484e+01 | 38.0 | 11.86080 | 14.0 | 80.0 | 0 |
| high_blood_pressure | factor | 0 | NA | 3.511706e-01 | NA | NA | 105.0 | 194.0 | 2 |
| platelets | numeric | 0 | 263358.02926 | 9.780424e+04 | 262000.0 | 65234.40000 | 25100.0 | 850000.0 | 0 |
| serum_creatinine | numeric | 0 | 1.39388 | 1.034510e+00 | 1.1 | 0.29652 | 0.5 | 9.4 | 0 |
| serum_sodium | numeric | 0 | 136.62542 | 4.412477e+00 | 137.0 | 4.44780 | 113.0 | 148.0 | 0 |
| sex | factor | 0 | NA | 3.511706e-01 | NA | NA | 105.0 | 194.0 | 2 |
| smoking | factor | 0 | NA | 3.210702e-01 | NA | NA | 96.0 | 203.0 | 2 |
| time | numeric | 0 | 130.26087 | 7.761421e+01 | 115.0 | 105.26460 | 4.0 | 285.0 | 0 |
| DEATH_EVENT | factor | 0 | NA | 3.210702e-01 | NA | NA | 96.0 | 203.0 | 2 |
| Age_Group | factor | 0 | NA | 7.023411e-01 | NA | NA | 14.0 | 89.0 | 6 |
** Checking levels of Factor Vairables
levels(risk$anaemia)
## [1] "No" "Yes"
levels(risk$diabetes)
## [1] "No" "Yes"
levels(risk$high_blood_pressure)
## [1] "No" "Yes"
levels(risk$sex)
## [1] "Female" "Male"
levels(risk$smoking)
## [1] "No" "Yes"
levels(risk$DEATH_EVENT)
## [1] "No" "Yes"
levels(risk$Age_Group)
## [1] "35 to 44 yrs" "45 to 54 yrs" "55 to 64 yrs" "65 to 74 yrs" "75 to 84 yrs"
## [6] "85 and above"
model_one<-glm(DEATH_EVENT~1,data=risk,family='binomial')
model_two<-glm(DEATH_EVENT~.,data=risk,family='binomial')
final_model <- step(model_one,
scope = list(lower = model_one,
upper = model_two),
direction = "forward")
## Start: AIC=377.35
## DEATH_EVENT ~ 1
##
## Df Deviance AIC
## + time 1 279.07 283.07
## + serum_creatinine 1 347.25 351.25
## + ejection_fraction 1 351.97 355.97
## + age 1 355.99 359.99
## + Age_Group 5 350.23 362.23
## + serum_sodium 1 364.02 368.02
## <none> 375.35 377.35
## + high_blood_pressure 1 373.49 377.49
## + anaemia 1 374.04 378.04
## + creatinine_phosphokinase 1 374.23 378.23
## + platelets 1 374.61 378.61
## + smoking 1 375.30 379.30
## + sex 1 375.34 379.34
## + diabetes 1 375.35 379.35
##
## Step: AIC=283.07
## DEATH_EVENT ~ time
##
## Df Deviance AIC
## + ejection_fraction 1 256.08 262.08
## + serum_creatinine 1 259.64 265.64
## + serum_sodium 1 269.83 275.83
## + age 1 271.46 277.46
## + Age_Group 5 267.46 281.46
## <none> 279.07 283.07
## + creatinine_phosphokinase 1 277.90 283.90
## + platelets 1 277.92 283.92
## + smoking 1 278.81 284.81
## + high_blood_pressure 1 278.96 284.96
## + sex 1 279.02 285.02
## + diabetes 1 279.06 285.06
## + anaemia 1 279.07 285.07
##
## Step: AIC=262.08
## DEATH_EVENT ~ time + ejection_fraction
##
## Df Deviance AIC
## + serum_creatinine 1 235.41 243.41
## + age 1 244.51 252.51
## + Age_Group 5 240.59 256.59
## + serum_sodium 1 249.73 257.73
## <none> 256.08 262.08
## + sex 1 254.98 262.98
## + smoking 1 255.20 263.20
## + platelets 1 255.22 263.22
## + creatinine_phosphokinase 1 255.33 263.33
## + high_blood_pressure 1 255.93 263.93
## + diabetes 1 256.05 264.05
## + anaemia 1 256.08 264.08
##
## Step: AIC=243.41
## DEATH_EVENT ~ time + ejection_fraction + serum_creatinine
##
## Df Deviance AIC
## + age 1 226.30 236.30
## + Age_Group 5 221.91 239.91
## + serum_sodium 1 232.02 242.02
## <none> 235.41 243.41
## + creatinine_phosphokinase 1 234.63 244.63
## + sex 1 234.69 244.69
## + platelets 1 234.90 244.90
## + smoking 1 235.20 245.20
## + diabetes 1 235.33 245.33
## + high_blood_pressure 1 235.41 245.41
## + anaemia 1 235.41 245.41
##
## Step: AIC=236.3
## DEATH_EVENT ~ time + ejection_fraction + serum_creatinine + age
##
## Df Deviance AIC
## + serum_sodium 1 223.49 235.49
## <none> 226.30 236.30
## + sex 1 225.08 237.08
## + creatinine_phosphokinase 1 225.12 237.12
## + diabetes 1 225.87 237.87
## + platelets 1 225.93 237.93
## + smoking 1 225.95 237.95
## + high_blood_pressure 1 226.27 238.27
## + anaemia 1 226.28 238.28
## + Age_Group 5 221.89 241.89
##
## Step: AIC=235.49
## DEATH_EVENT ~ time + ejection_fraction + serum_creatinine + age +
## serum_sodium
##
## Df Deviance AIC
## <none> 223.49 235.49
## + sex 1 222.04 236.04
## + creatinine_phosphokinase 1 222.18 236.18
## + smoking 1 223.09 237.09
## + diabetes 1 223.25 237.25
## + platelets 1 223.26 237.26
## + high_blood_pressure 1 223.46 237.46
## + anaemia 1 223.48 237.48
## + Age_Group 5 219.72 241.72
summary(final_model)
##
## Call:
## glm(formula = DEATH_EVENT ~ time + ejection_fraction + serum_creatinine +
## age + serum_sodium, family = "binomial", data = risk)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1590 -0.5888 -0.2281 0.5144 2.7959
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.493034 5.405768 1.756 0.07907 .
## time -0.020895 0.002916 -7.166 7.74e-13 ***
## ejection_fraction -0.073430 0.015785 -4.652 3.29e-06 ***
## serum_creatinine 0.685990 0.174044 3.941 8.10e-05 ***
## age 0.042466 0.015030 2.825 0.00472 **
## serum_sodium -0.064557 0.038377 -1.682 0.09254 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 223.49 on 293 degrees of freedom
## AIC: 235.49
##
## Number of Fisher Scoring iterations: 6
final_model$coefficients
## (Intercept) time ejection_fraction serum_creatinine
## 9.49303414 -0.02089486 -0.07342996 0.68599031
## age serum_sodium
## 0.04246630 -0.06455724
final_model$residuals
## 1 2 3 4 5 6 7
## 1.017654 1.411569 1.040665 1.085390 1.006866 1.043732 1.035209
## 8 9 10 11 12 13 14
## 2.318037 2.837611 1.000344 1.018955 1.190020 1.414152 1.602653
## 15 16 17 18 19 20 21
## -3.453061 1.318497 1.185609 1.087730 1.140229 1.511043 -9.420945
## 22 23 24 25 26 27 28
## 1.142225 1.378260 -1.275774 1.074332 1.196605 1.154780 1.481879
## 29 30 31 32 33 34 35
## 1.019244 1.079600 1.064890 1.063948 1.413165 -3.058110 2.456635
## 36 37 38 39 40 41 42
## 1.050965 1.349247 1.997885 -7.991555 1.219881 1.052129 1.454654
## 43 44 45 46 47 48 49
## 1.725040 -2.252196 5.335578 1.505178 1.285358 2.085918 1.006756
## 50 51 52 53 54 55 56
## 1.618099 1.240844 1.255739 1.160534 2.094040 1.273070 1.043713
## 57 58 59 60 61 62 63
## -6.532160 -1.593443 1.394210 1.140688 1.973254 1.579124 -1.807407
## 64 65 66 67 68 69 70
## 4.050872 -1.021892 1.048522 1.394946 1.366166 1.403764 1.210604
## 71 72 73 74 75 76 77
## -1.199265 -1.635356 1.303424 -1.416804 1.201533 1.574747 -1.138683
## 78 79 80 81 82 83 84
## -1.241322 -2.344529 -1.236334 -2.415893 -1.384387 1.166997 -2.215607
## 85 86 87 88 89 90 91
## 1.890104 -1.108479 -1.617661 -1.101027 -1.179137 -1.891827 -1.476516
## 92 93 94 95 96 97 98
## -1.433854 -1.056279 1.389520 -1.284943 -1.074231 -3.315221 -1.175278
## 99 100 101 102 103 104 105
## -2.503876 -1.427770 -2.286287 -1.619360 -3.003874 -1.334153 -1.332124
## 106 107 108 109 110 111 112
## 1.464755 -1.281684 -1.273546 -1.746620 -1.278004 3.415807 -1.598131
## 113 114 115 116 117 118 119
## -2.243416 6.889758 -1.897051 -1.310361 -1.072187 -2.019207 -1.079088
## 120 121 122 123 124 125 126
## 1.377390 -1.130545 -1.563642 -1.284262 -1.410530 1.373290 -1.110509
## 127 128 129 130 131 132 133
## 1.213489 -1.054024 -1.553177 -1.952472 -1.036677 -10.285516 -1.171815
## 134 135 136 137 138 139 140
## -1.052752 -2.096509 -1.588713 -1.073156 -4.805649 -1.451787 -1.300201
## 141 142 143 144 145 146 147
## 2.076399 -1.161746 -1.463822 -1.194498 1.555295 -1.212428 -1.319361
## 148 149 150 151 152 153 154
## -1.066245 1.487395 -1.334761 2.169881 -1.045931 -1.112267 -1.228765
## 155 156 157 158 159 160 161
## -1.348164 -1.840338 -1.302538 -1.440177 -1.424862 -1.128356 -1.368930
## 162 163 164 165 166 167 168
## -1.120155 -1.208919 9.467110 6.546604 2.552355 -1.028216 1.494184
## 169 170 171 172 173 174 175
## -1.160086 -1.309917 -1.184190 -1.066724 -1.027406 -1.265122 -1.203049
## 176 177 178 179 180 181 182
## -1.026296 -1.198936 -1.029564 -1.021353 -1.064707 -1.103721 3.901498
## 183 184 185 186 187 188 189
## 3.131468 2.541868 5.275709 5.294704 49.820074 4.470562 -1.077648
## 190 191 192 193 194 195 196
## -1.015468 -1.591255 -1.029692 -1.055593 -1.177540 5.234647 12.341878
## 197 198 199 200 201 202 203
## -1.037375 -1.111235 -1.182068 -1.621470 -1.033826 -1.006860 -1.016276
## 204 205 206 207 208 209 210
## -1.922290 -1.090421 -1.024994 -1.014896 -1.172492 -1.039951 -1.064110
## 211 212 213 214 215 216 217
## -1.447239 -1.004351 -1.059225 8.382024 -1.084400 -1.163596 -1.022677
## 218 219 220 221 222 223 224
## 2.127165 -1.110618 -1.032861 2.503832 -1.014946 -1.024425 -1.070754
## 225 226 227 228 229 230 231
## -1.091995 -1.044135 -1.166479 -1.033449 -4.095248 -1.237435 5.570897
## 232 233 234 235 236 237 238
## -1.095695 -1.018829 -1.036973 -1.018000 -1.034550 -1.015601 -1.161458
## 239 240 241 242 243 244 245
## -1.042190 -1.019854 -1.073113 -1.103595 -1.021639 -1.037632 -1.056664
## 246 247 248 249 250 251 252
## -1.047767 14.913772 -1.311853 -1.013154 -1.039392 -1.025009 -1.019908
## 253 254 255 256 257 258 259
## -1.016899 -1.100748 -1.004331 -1.030317 -1.052806 -1.019692 -1.033241
## 260 261 262 263 264 265 266
## -1.006809 -1.014997 -1.022693 7.424996 -1.004848 -1.019481 -1.012412
## 267 268 269 270 271 272 273
## 8.931869 -1.020458 -1.007407 -1.013403 -1.039378 -1.012155 -1.035642
## 274 275 276 277 278 279 280
## -1.004807 -1.032081 -1.008221 -1.026354 -1.023560 -1.017841 -1.018504
## 281 282 283 284 285 286 287
## -1.022575 -1.072596 -1.164218 -1.019940 -1.006107 -1.014636 -1.026660
## 288 289 290 291 292 293 294
## -1.003366 -1.016936 -1.030099 -1.001418 -1.019585 -1.007971 -1.014911
## 295 296 297 298 299
## -1.008370 -1.008443 -1.000769 -1.004920 -1.004867
final_model$fitted.values
## 1 2 3 4 5 6
## 0.9826518339 0.7084312920 0.9609238764 0.9213277502 0.9931804525 0.9581000445
## 7 8 9 10 11 12
## 0.9659885517 0.4313994974 0.3524091133 0.9996556600 0.9813975072 0.8403218170
## 13 14 15 16 17 18
## 0.7071376317 0.6239655388 0.7104018673 0.7584392886 0.8434483918 0.9193458897
## 19 20 21 22 23 24
## 0.8770169306 0.6617944261 0.8938535364 0.8754844882 0.7255526185 0.2161623776
## 25 26 27 28 29 30
## 0.9308112200 0.8356973657 0.8659655578 0.6748190824 0.9811191910 0.9262687613
## 31 32 33 34 35 36
## 0.9390639307 0.9398952837 0.7076313370 0.6730006900 0.4070608714 0.9515061862
## 37 38 39 40 41 42
## 0.7411543604 0.5005292468 0.8748679068 0.8197523148 0.9504534707 0.6874486126
## 43 44 45 46 47 48
## 0.5796966257 0.5559889743 0.1874211070 0.6643732333 0.7779934925 0.4794051349
## 49 50 51 52 53 54
## 0.9932891332 0.6180091519 0.8059029943 0.7963440941 0.8616726198 0.4775458579
## 55 56 57 58 59 60
## 0.7855028311 0.9581180184 0.8469112806 0.3724281338 0.7172518820 0.8766642836
## 61 62 63 64 65 66
## 0.5067772232 0.6332623261 0.4467212731 0.2468604445 0.0214231088 0.9537236907
## 67 68 69 70 71 72
## 0.7168735346 0.7319755764 0.7123705434 0.8260336160 0.1661561334 0.3885123707
## 73 74 75 76 77 78
## 0.7672102471 0.2941860454 0.8322701660 0.6350224518 0.1217925094 0.1944069575
## 79 80 81 82 83 84
## 0.5734750855 0.1911569538 0.5860744175 0.2776587159 0.8569001169 0.5486564820
## 85 86 87 88 89 90
## 0.5290713505 0.0978630377 0.3818236385 0.0917568623 0.1519220935 0.4714103576
## 91 92 93 94 95 96
## 0.3227298082 0.3025787513 0.0532805606 0.7196730094 0.2217550825 0.0691015807
## 97 98 99 100 101 102
## 0.6983610286 0.1491375468 0.6006192505 0.2996069650 0.5626095891 0.3824721554
## 103 104 105 106 107 108
## 0.6670965989 0.2504607070 0.2493191330 0.6827080691 0.2197763297 0.2147905691
## 109 110 111 112 113 114
## 0.4274655336 0.2175298218 0.2927566009 0.3742691933 0.5542511027 0.1451429717
## 115 116 117 118 119 120
## 0.4728659430 0.2368517756 0.0673269354 0.5047559534 0.0732915452 0.7260110019
## 121 122 123 124 125 126
## 0.1154706083 0.3604676071 0.2213425870 0.2910463805 0.7281784898 0.0995123264
## 127 128 129 130 131 132
## 0.8240700981 0.0512546648 0.3561582448 0.4878288477 0.0353791907 0.9027758988
## 133 134 135 136 137 138
## 0.1466227941 0.0501088551 0.5230166795 0.3705598598 0.0681694002 0.7919115502
## 139 140 141 142 143 144
## 0.3111938817 0.2308883275 0.4816029144 0.1392267366 0.3168568352 0.1628280473
## 145 146 147 148 149 150
## 0.6429648430 0.1752087478 0.2420572807 0.0621294572 0.6723162987 0.2508020275
## 151 152 153 154 155 156
## 0.4608548535 0.0439138922 0.1009355915 0.1861744706 0.2582505598 0.4566216285
## 157 158 159 160 161 162
## 0.2322679856 0.3056409757 0.2981778465 0.1137548286 0.2695023317 0.1072663028
## 163 164 165 166 167 168
## 0.1728150402 0.1056288521 0.1527509622 0.3917950881 0.0274413007 0.6692614106
## 169 170 171 172 173 174
## 0.1379948692 0.2365930632 0.1555408767 0.0625504636 0.0266752066 0.2095625209
## 175 176 177 178 179 180
## 0.1687789520 0.0256220836 0.1659271591 0.0287154653 0.0209069876 0.0607746139
## 181 182 183 184 185 186
## 0.0939743279 0.2563117914 0.3193390678 0.3934115276 0.1895479773 0.1888679616
## 187 188 189 190 191 192
## 0.0200722304 0.2236855277 0.0720530224 0.0152322516 0.3715653028 0.0288360569
## 193 194 195 196 197 198
## 0.0526651557 0.1507718701 0.1910348345 0.0810249444 0.0360287550 0.1001006388
## 199 200 201 202 203 204
## 0.1540250239 0.3832758589 0.0327190766 0.0068134553 0.0160151526 0.4797872447
## 205 206 207 208 209 210
## 0.0829228258 0.0243842096 0.0146778294 0.1471160255 0.0384160513 0.0602475616
## 211 212 213 214 215 216
## 0.3090289287 0.0043321759 0.0559131158 0.1193029329 0.0778311491 0.1405951121
## 217 218 219 220 221 222
## 0.0221743335 0.4701093138 0.0996001546 0.0318152664 0.3993878821 0.0147255265
## 223 224 225 226 227 228
## 0.0238422585 0.0660787809 0.0842445492 0.0422692499 0.1427193845 0.0323663787
## 229 230 231 232 233 234
## 0.7558145424 0.1918769560 0.1795042923 0.0873375086 0.0184813355 0.0356545587
## 235 236 237 238 239 240
## 0.0176818008 0.0333964753 0.0153616498 0.1390129312 0.0404821162 0.0194673491
## 241 242 243 244 245 246
## 0.0681313942 0.0938707028 0.0211804444 0.0362668586 0.0536252867 0.0455892744
## 247 248 249 250 251 252
## 0.0670521196 0.2377194775 0.0129835800 0.0378986548 0.0243989930 0.0195189623
## 253 254 255 256 257 258
## 0.0166178968 0.0915265434 0.0043125179 0.0294249569 0.0501570358 0.0193116951
## 259 260 261 262 263 264
## 0.0321711470 0.0067630330 0.0147750511 0.0221899321 0.1346802078 0.0048247437
## 265 266 267 268 269 270
## 0.0191085604 0.0122598592 0.1119586474 0.0200481117 0.0073528299 0.0132257621
## 271 272 273 274 275 276
## 0.0378860103 0.0120091009 0.0344154340 0.0047841953 0.0310841423 0.0081535718
## 277 278 279 280 281 282
## 0.0256768309 0.0230172683 0.0175279065 0.0181674064 0.0220767464 0.0676822223
## 283 284 285 286 287 288
## 0.1410544395 0.0195503001 0.0060697844 0.0144248516 0.0259679124 0.0033551046
## 289 290 291 292 293 294
## 0.0166536995 0.0292192205 0.0014161604 0.0192089831 0.0079078897 0.0146920998
## 295 296 297 298 299
## 0.0083005108 0.0083721761 0.0007682252 0.0048956042 0.0048436298
final_model$linear.predictors
## 1 2 3 4 5 6
## 4.036768072 0.887777395 3.202383561 2.460525351 4.981119257 3.129667439
## 7 8 9 10 11 12
## 3.346454800 -0.276143514 -0.608466395 7.973536532 3.965681990 1.660624500
## 13 14 15 16 17 18
## 0.881522551 0.506414349 0.897336608 1.144141920 1.684112997 2.433492659
## 19 20 21 22 23 24
## 1.964479598 0.671301056 2.130722060 1.950347133 0.972174043 -1.288172011
## 25 26 27 28 29 30
## 2.599217774 1.626556486 1.865748338 0.730062932 3.950547945 2.530737859
## 31 32 33 34 35 36
## 2.735058290 2.749680156 0.883907697 0.721788294 -0.376129008 2.976609949
## 37 38 39 40 41 42
## 1.051977019 0.002116988 1.944702986 1.514670305 2.954026994 0.788218186
## 43 44 45 46 47 48
## 0.321528137 0.224899069 -1.466855014 0.682844361 1.254011465 -0.082426096
## 49 50 51 52 53 54
## 4.997293668 0.481106616 1.423605317 1.363599538 1.829252211 -0.089877021
## 55 56 57 58 59 60
## 1.298027522 3.130115261 1.710578325 -0.521814102 0.930870620 1.961214079
## 61 62 63 64 65 66
## 0.027110553 0.546237946 -0.213927051 -1.115427368 -3.821629174 3.025743846
## 67 68 69 70 71 72
## 0.929005772 1.004669039 0.906925159 1.557773385 -1.613118265 -0.453569713
## 73 74 75 76 77 78
## 1.192625182 -0.875139308 1.601802553 0.553824515 -1.975564034 -1.421625024
## 79 80 81 82 83 84
## 0.296043712 -1.442510050 0.347760569 -0.956105001 1.789778492 0.195243799
## 85 86 87 88 89 90
## 0.116416705 -2.221197425 -0.481814969 -2.292369838 -1.719604656 -0.114483445
## 91 92 93 94 95 96
## -0.741254829 -0.835048024 -2.877431239 0.942840210 -1.255467738 -2.600572555
## 97 98 99 100 101 102
## 0.839505359 -1.741381473 0.408045986 -0.849170157 0.251759755 -0.479068306
## 103 104 105 106 107 108
## 0.695082500 -1.096156692 -1.102246883 0.766245076 -1.266970289 -1.296287017
## 109 110 111 112 113 114
## -0.292199275 -1.280119858 -0.882033327 -0.513944951 0.217862055 -1.773214968
## 115 116 117 118 119 120
## -0.108642964 -1.170017751 -2.628494341 0.019024387 -2.537193754 0.974477216
## 121 122 123 124 125 126
## -2.036039719 -0.573335176 -1.257859503 -0.890307470 0.985400558 -2.202654956
## 127 128 129 130 131 132
## 1.544169966 -2.918333778 -0.592077834 -0.048694229 -3.305611265 2.228455713
## 133 134 135 136 137 138
## -1.761338399 -2.942149652 0.092131833 -0.529815761 -2.615155252 1.336486478
## 139 140 141 142 143 144
## -0.794543704 -1.203302014 -0.073621578 -1.821727326 -0.768254402 -1.637334769
## 145 146 147 148 149 150
## 0.588255790 -1.549152219 -1.141433420 -2.714391698 0.718680095 -1.094339372
## 151 152 153 154 155 156
## -0.156901681 -3.080617258 -2.186872071 -1.475061758 -1.055081229 -0.173950792
## 157 158 159 160 161 162
## -1.195548917 -0.820578022 -0.855989921 -2.052948123 -0.997148997 -2.118973771
## 163 164 165 166 167 168
## -1.565806431 -2.136189289 -1.713185777 -0.439772882 -3.567881228 0.704846429
## 169 170 171 172 173 174
## -1.832044718 -1.171449595 -1.691787762 -2.707189281 -3.596983290 -1.327564439
## 175 176 177 178 179 180
## -1.594305880 -3.638344613 -1.614771846 -3.521183620 -3.846543207 -2.737883311
## 181 182 183 184 185 186
## -2.266046004 -1.065227234 -0.756810838 -0.432994383 -1.452949974 -1.457382695
## 187 188 189 190 191 192
## -3.888141578 -1.244316515 -2.555572324 -4.168990829 -0.525507490 -3.516868713
## 193 194 195 196 197 198
## -2.889698560 -1.728559954 -1.443300066 -2.428501916 -3.286744096 -2.196106869
## 199 200 201 202 203 204
## -1.703374698 -0.475666839 -3.386530674 -4.982019123 -4.118075189 -0.080895107
## 205 206 207 208 209 210
## -2.403281263 -3.689133077 -4.206630514 -1.757401953 -3.220106496 -2.747154377
## 211 212 213 214 215 216
## -0.804663065 -5.437343760 -2.826419219 -1.999047803 -2.472186617 -1.810355944
## 217 218 219 220 221 222
## -3.786395929 -0.119705482 -2.201675221 -3.415476660 -0.408016251 -4.203337773
## 223 224 225 226 227 228
## -3.712164621 -2.648544406 -2.386025485 -3.120506816 -1.792884948 -3.397733336
## 229 230 231 232 233 234
## 1.129868023 -1.437860017 -1.519709559 -2.346586120 -3.972339701 -3.297572561
## 235 236 237 238 239 240
## -4.017379382 -3.365338040 -4.160400285 -1.823512527 -3.165570654 -3.919357288
## 241 242 243 244 245 246
## -2.615753715 -2.267263675 -3.833268989 -3.279910095 -2.870617871 -3.041421630
## 247 248 249 250 251 252
## -2.632879115 -1.165223315 -4.331001190 -3.234204174 -3.688511836 -3.916656888
## 253 254 255 256 257 258
## -4.080517521 -2.295136649 -5.441911493 -3.496045536 -2.941137871 -3.927543801
## 259 260 261 262 263 264
## -3.403985273 -4.989497818 -4.199929965 -3.785676768 -1.860196003 -5.329161254
## 265 266 267 268 269 270
## -3.938325369 -4.389089202 -2.070888726 -3.889368507 -4.905290027 -4.312274676
## 271 272 273 274 275 276
## -3.234551012 -4.410008718 -3.334228561 -5.337641768 -3.439479979 -4.801112199
## 277 278 279 280 281 282
## -3.636153981 -3.748224246 -4.026277672 -3.989791690 -3.790906340 -2.622850169
## 283 284 285 286 287 288
## -1.806559633 -3.915020706 -5.098343911 -4.224272855 -3.624582610 -5.693911592
## 289 290 291 292 293 294
## -4.078328964 -3.503273945 -6.558388862 -3.932981366 -4.831954995 -4.205644260
## 295 296 297 298 299
## -4.783103066 -4.774434020 -7.170659072 -5.314509946 -5.325235478
final_model$deviance
## [1] 223.4863
final_model$aic
## [1] 235.4863
final_model$null.deviance
## [1] 375.3488
final_model$iter
## [1] 6
final_model$df.residual
## [1] 293
final_model$df.null
## [1] 298
final_model
##
## Call: glm(formula = DEATH_EVENT ~ time + ejection_fraction + serum_creatinine +
## age + serum_sodium, family = "binomial", data = risk)
##
## Coefficients:
## (Intercept) time ejection_fraction serum_creatinine
## 9.49303 -0.02089 -0.07343 0.68599
## age serum_sodium
## 0.04247 -0.06456
##
## Degrees of Freedom: 298 Total (i.e. Null); 293 Residual
## Null Deviance: 375.3
## Residual Deviance: 223.5 AIC: 235.5
glm(formula = DEATH_EVENT ~ time + ejection_fraction + serum_creatinine +
age + serum_sodium, family = "binomial", data = risk)
##
## Call: glm(formula = DEATH_EVENT ~ time + ejection_fraction + serum_creatinine +
## age + serum_sodium, family = "binomial", data = risk)
##
## Coefficients:
## (Intercept) time ejection_fraction serum_creatinine
## 9.49303 -0.02089 -0.07343 0.68599
## age serum_sodium
## 0.04247 -0.06456
##
## Degrees of Freedom: 298 Total (i.e. Null); 293 Residual
## Null Deviance: 375.3
## Residual Deviance: 223.5 AIC: 235.5
** Final Model ** Logit(Probablity of Death by Cardiovascular Disease) = 9.49303 - 0.02089 x time - 0.07343 x ejection_fraction + 0.68599 x serum_creatinine + 0.04247 x age - 0.06456 x serum_sodium
summary(final_model)
##
## Call:
## glm(formula = DEATH_EVENT ~ time + ejection_fraction + serum_creatinine +
## age + serum_sodium, family = "binomial", data = risk)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1590 -0.5888 -0.2281 0.5144 2.7959
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.493034 5.405768 1.756 0.07907 .
## time -0.020895 0.002916 -7.166 7.74e-13 ***
## ejection_fraction -0.073430 0.015785 -4.652 3.29e-06 ***
## serum_creatinine 0.685990 0.174044 3.941 8.10e-05 ***
## age 0.042466 0.015030 2.825 0.00472 **
## serum_sodium -0.064557 0.038377 -1.682 0.09254 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 223.49 on 293 degrees of freedom
## AIC: 235.49
##
## Number of Fisher Scoring iterations: 6
par(mfrow = c(2,2))
plot(final_model)
plot(density(resid(final_model, type='response')))
lines(density(resid(final_model, type='response')), col='red')
plot(density(resid(final_model, type='pearson')))
lines(density(resid(final_model, type='pearson')), col='red')
plot(density(rstandard(final_model, type='pearson')))
lines(density(rstandard(final_model, type='pearson')), col='red')
In residuals vs fitted, the predicted values lie between -4 to 1 while beyond one are some outliers. We can see some linearity between values. In Normal Q-Q plot, the value are in the range between -3 to 3 with linearity between the values. In scale-location plot it is seen that the values lie between -4 to 2 while there are some outliers beyond 2. It is seen that, predicted values and sqr.root of std.deviance residuals intersect at 0 at x-axis and 1 at y-axis. In the residuals vs leverage plot, all values are located between 0.00 to 0.05 while there are some outliers beyond 0.05.
summary(final_model)
##
## Call:
## glm(formula = DEATH_EVENT ~ time + ejection_fraction + serum_creatinine +
## age + serum_sodium, family = "binomial", data = risk)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1590 -0.5888 -0.2281 0.5144 2.7959
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.493034 5.405768 1.756 0.07907 .
## time -0.020895 0.002916 -7.166 7.74e-13 ***
## ejection_fraction -0.073430 0.015785 -4.652 3.29e-06 ***
## serum_creatinine 0.685990 0.174044 3.941 8.10e-05 ***
## age 0.042466 0.015030 2.825 0.00472 **
## serum_sodium -0.064557 0.038377 -1.682 0.09254 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 223.49 on 293 degrees of freedom
## AIC: 235.49
##
## Number of Fisher Scoring iterations: 6
survival_fracejec <- ggplot(data = risk, mapping = aes(x = DEATH_EVENT, y = ejection_fraction), color = "blue") +
geom_bar(stat = "identity")+
theme_light()+
labs(x = "Survival Status")
survival_fracejec
survival_creatinine <- ggplot(data = risk, mapping = aes(x = DEATH_EVENT, y = serum_creatinine), color = "blue") +
geom_bar(stat = "identity")+
theme_light()+
labs(x = "Survival Status")
survival_creatinine
survival_sodium <- ggplot(data = risk, mapping = aes(x = DEATH_EVENT, y = serum_sodium)) +
geom_bar(stat = "identity")+
theme_light()+
labs(x = "Survival Status")
survival_sodium
modelone_res <- model_one$deviance
modeltwo_res <- model_two$deviance
finalmodel_res <- final_model$deviance
modelone_dfres <- model_one$df.residual
modeltwo_dfres <- model_two$df.residual
finalmodel_dfres <- final_model$df.residual
modelone_resdf <- modelone_res/modelone_dfres
modeltwo_resdf <- modeltwo_res/modeltwo_dfres
finalmodel_resdf <- finalmodel_res/finalmodel_dfres
res <- c(modelone_res,modelone_dfres,modelone_resdf)
dfres <- c(modeltwo_res,modeltwo_dfres,modeltwo_resdf)
resdf <- c(finalmodel_res,finalmodel_dfres,finalmodel_resdf)
res
## [1] 375.34878 298.00000 1.25956
dfres
## [1] 215.4878046 281.0000000 0.7668605
resdf
## [1] 223.4862746 293.0000000 0.7627518
gender.table <- with(risk, table(DEATH_EVENT, sex))
gender.table
## sex
## DEATH_EVENT Female Male
## No 71 132
## Yes 34 62
chisq.test(gender.table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: gender.table
## X-squared = 0, df = 1, p-value = 1
smoke.table <- with(risk, table(DEATH_EVENT, smoking))
smoke.table
## smoking
## DEATH_EVENT No Yes
## No 137 66
## Yes 66 30
chisq.test(smoke.table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: smoke.table
## X-squared = 0.0073315, df = 1, p-value = 0.9318
anaemia.table <- with(risk, table(DEATH_EVENT, anaemia))
anaemia.table
## anaemia
## DEATH_EVENT No Yes
## No 120 83
## Yes 50 46
chisq.test(anaemia.table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: anaemia.table
## X-squared = 1.0422, df = 1, p-value = 0.3073
diabetes.table <- with(risk, table(DEATH_EVENT, diabetes))
diabetes.table
## diabetes
## DEATH_EVENT No Yes
## No 118 85
## Yes 56 40
chisq.test(diabetes.table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: diabetes.table
## X-squared = 2.1617e-30, df = 1, p-value = 1
highbp.table <- with(risk, table(DEATH_EVENT, high_blood_pressure))
highbp.table
## high_blood_pressure
## DEATH_EVENT No Yes
## No 137 66
## Yes 57 39
chisq.test(highbp.table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: highbp.table
## X-squared = 1.5435, df = 1, p-value = 0.2141
highcreatinine.table <- with(risk, table(DEATH_EVENT, serum_creatinine))
highcreatinine.table
## serum_creatinine
## DEATH_EVENT 0.5 0.6 0.7 0.75 0.8 0.9 1 1.1 1.18 1.2 1.3 1.4 1.5 1.6 1.7 1.8
## No 1 2 18 1 23 27 35 23 11 15 13 7 3 3 5 3
## Yes 0 2 1 0 1 5 15 9 0 9 7 2 2 3 4 1
## serum_creatinine
## DEATH_EVENT 1.83 1.9 2 2.1 2.2 2.3 2.4 2.5 2.7 2.9 3 3.2 3.4 3.5 3.7 3.8 4
## No 0 0 0 2 0 2 1 0 2 0 0 1 1 1 0 1 0
## Yes 8 5 1 3 1 1 1 3 1 1 2 0 0 1 1 0 1
## serum_creatinine
## DEATH_EVENT 4.4 5 5.8 6.1 6.8 9 9.4
## No 0 1 0 1 0 0 0
## Yes 1 0 1 0 1 1 1
chisq.test(highcreatinine.table)
## Warning in chisq.test(highcreatinine.table): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: highcreatinine.table
## X-squared = 92.428, df = 39, p-value = 3.145e-06
highejecfrac.table <- with(risk, table(DEATH_EVENT, ejection_fraction))
highejecfrac.table
## ejection_fraction
## DEATH_EVENT 14 15 17 20 25 30 35 38 40 45 50 55 60 62 65 70 80
## No 0 0 1 2 18 21 42 25 33 15 15 2 27 1 0 0 1
## Yes 1 2 1 16 18 13 7 15 4 5 6 1 4 1 1 1 0
chisq.test(highejecfrac.table)
## Warning in chisq.test(highejecfrac.table): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: highejecfrac.table
## X-squared = 65.332, df = 16, p-value = 6.459e-08
highsodium.table <- with(risk, table(DEATH_EVENT, serum_sodium))
highsodium.table
## serum_sodium
## DEATH_EVENT 113 116 121 124 125 126 127 128 129 130 131 132 133 134 135 136 137
## No 1 0 0 0 1 1 0 1 0 6 2 6 8 15 10 29 31
## Yes 0 1 1 1 0 0 3 1 2 3 3 8 2 17 6 11 7
## serum_sodium
## DEATH_EVENT 138 139 140 141 142 143 144 145 146 148
## No 17 16 28 11 7 3 3 6 0 1
## Yes 6 6 7 1 4 0 2 3 1 0
chisq.test(highsodium.table)
## Warning in chisq.test(highsodium.table): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: highsodium.table
## X-squared = 45.801, df = 26, p-value = 0.009601
summary(final_model)
##
## Call:
## glm(formula = DEATH_EVENT ~ time + ejection_fraction + serum_creatinine +
## age + serum_sodium, family = "binomial", data = risk)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1590 -0.5888 -0.2281 0.5144 2.7959
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.493034 5.405768 1.756 0.07907 .
## time -0.020895 0.002916 -7.166 7.74e-13 ***
## ejection_fraction -0.073430 0.015785 -4.652 3.29e-06 ***
## serum_creatinine 0.685990 0.174044 3.941 8.10e-05 ***
## age 0.042466 0.015030 2.825 0.00472 **
## serum_sodium -0.064557 0.038377 -1.682 0.09254 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 223.49 on 293 degrees of freedom
## AIC: 235.49
##
## Number of Fisher Scoring iterations: 6
CInt <- exp(confint(final_model))
## Waiting for profiling to be done...
CI <- exp(confint.default(final_model))
CInt
## 2.5 % 97.5 %
## (Intercept) 0.3340162 6.934776e+08
## time 0.9733671 9.846074e-01
## ejection_fraction 0.8994926 9.571500e-01
## serum_creatinine 1.4214664 2.874270e+00
## age 1.0138088 1.075593e+00
## serum_sodium 0.8681426 1.011048e+00
CI
## 2.5 % 97.5 %
## (Intercept) 0.3321806 5.298714e+08
## time 0.9737409 9.849349e-01
## ejection_fraction 0.9008933 9.583986e-01
## serum_creatinine 1.4118068 2.792983e+00
## age 1.0130935 1.074574e+00
## serum_sodium 0.8695534 1.010718e+00
final.predict <- data.frame(DEATH_EVENT = 0, ejection_fraction = 60,time = 10,serum_creatinine = 1.90, age = 45, serum_sodium = 130)
alpha<-0.05
model.predict <- predict(object = final_model, newdata = final.predict, type = "response", se = TRUE)
Interval<-model.predict$fit + qnorm(p = c(alpha/2, 1-alpha/2))*model.predict$se.fit
Interval
## [1] 0.1294244 0.7217940
t.test(risk$age~risk$DEATH_EVENT)
##
## Welch Two Sample t-test
##
## data: risk$age by risk$DEATH_EVENT
## t = -4.1862, df = 155.29, p-value = 4.735e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9.498546 -3.408204
## sample estimates:
## mean in group No mean in group Yes
## 58.76191 65.21528
t.test(risk$ejection_fraction~risk$DEATH_EVENT)
##
## Welch Two Sample t-test
##
## data: risk$ejection_fraction by risk$DEATH_EVENT
## t = 4.567, df = 164.76, p-value = 9.647e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.858566 9.735953
## sample estimates:
## mean in group No mean in group Yes
## 40.26601 33.46875
t.test(risk$serum_creatinine~risk$DEATH_EVENT)
##
## Welch Two Sample t-test
##
## data: risk$serum_creatinine by risk$DEATH_EVENT
## t = -4.1526, df = 113.19, p-value = 6.399e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.9615153 -0.3403977
## sample estimates:
## mean in group No mean in group Yes
## 1.184877 1.835833
summary(anova(final_model))
## Df Deviance Resid. Df Resid. Dev
## Min. :1 Min. : 2.815 Min. :293.0 Min. :223.5
## 1st Qu.:1 1st Qu.: 9.113 1st Qu.:294.2 1st Qu.:228.6
## Median :1 Median :20.670 Median :295.5 Median :245.7
## Mean :1 Mean :30.372 Mean :295.5 Mean :266.0
## 3rd Qu.:1 3rd Qu.:22.990 3rd Qu.:296.8 3rd Qu.:273.3
## Max. :1 Max. :96.275 Max. :298.0 Max. :375.3
## NA's :1 NA's :1
Anova(final_model , test = "LR")
## Analysis of Deviance Table (Type II tests)
##
## Response: DEATH_EVENT
## LR Chisq Df Pr(>Chisq)
## time 79.603 1 < 2.2e-16 ***
## ejection_fraction 26.341 1 2.862e-07 ***
## serum_creatinine 16.077 1 6.081e-05 ***
## age 8.530 1 0.003493 **
## serum_sodium 2.815 1 0.093381 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(final_model)
##
## Call:
## glm(formula = DEATH_EVENT ~ time + ejection_fraction + serum_creatinine +
## age + serum_sodium, family = "binomial", data = risk)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1590 -0.5888 -0.2281 0.5144 2.7959
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.493034 5.405768 1.756 0.07907 .
## time -0.020895 0.002916 -7.166 7.74e-13 ***
## ejection_fraction -0.073430 0.015785 -4.652 3.29e-06 ***
## serum_creatinine 0.685990 0.174044 3.941 8.10e-05 ***
## age 0.042466 0.015030 2.825 0.00472 **
## serum_sodium -0.064557 0.038377 -1.682 0.09254 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 223.49 on 293 degrees of freedom
## AIC: 235.49
##
## Number of Fisher Scoring iterations: 6
exp(cbind(Odds_and_OR=coef(final_model), confint(final_model)))
## Waiting for profiling to be done...
## Odds_and_OR 2.5 % 97.5 %
## (Intercept) 1.326699e+04 0.3340162 6.934776e+08
## time 9.793219e-01 0.9733671 9.846074e-01
## ejection_fraction 9.292012e-01 0.8994926 9.571500e-01
## serum_creatinine 1.985737e+00 1.4214664 2.874270e+00
## age 1.043381e+00 1.0138088 1.075593e+00
## serum_sodium 9.374825e-01 0.8681426 1.011048e+00
exp(coef(final_model))
## (Intercept) time ejection_fraction serum_creatinine
## 1.326699e+04 9.793219e-01 9.292012e-01 1.985737e+00
## age serum_sodium
## 1.043381e+00 9.374825e-01
exp(final_model$coefficients[2])
## time
## 0.9793219
Anova(final_model)
## Analysis of Deviance Table (Type II tests)
##
## Response: DEATH_EVENT
## LR Chisq Df Pr(>Chisq)
## time 79.603 1 < 2.2e-16 ***
## ejection_fraction 26.341 1 2.862e-07 ***
## serum_creatinine 16.077 1 6.081e-05 ***
## age 8.530 1 0.003493 **
## serum_sodium 2.815 1 0.093381 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this phase, it was seen that risk of a cardiovascular disease was based on the features selected in the model i.e. serum creatinine, serum sodium, age and ejection fraction of patient. A very important point to recognize is, low risk i.e. less than 5 or 10% does not mean there is no risk to patient at all. A common critique is prediction models always predict on the basis of populations provided for the analysis.
One of the limitation of this analysis is, the dataset is quite small (299 medical records of patients). Using a larger dataset might help in predicting the outcome more accurately. In this analysis, the most important features turned out to be serum creatinine, serum sodium, ejection fraction percentage and age of the patient. However, these features might change if we used a larger dataset. Some of these feature would not be meaningful if a larger dataset was used or some other feature might turn out useful along with these features for predicting the outcome. Hence, results would be more reliable if a larger dataset was used for this analysis. In addition to this, more medical records of patients if used would also be helpful for the prediction. For example, height and weight of the patient, Body mass Index of the patient, employment history, a stress survey, other medical history (medic pills if the patient is consuming) etc would help in predicting the outcome more accurately.
Firstly, we analysed the patients age distribution showing a histogram. It states that most of patients who suffer from a cardiovascular disease are in the age group of 50 to 60. With help of histograms, we also found the distributions of phosphokinase (CPK) levels in blood, distribution of platelet count in blood, distribution of creatinine levels in blood and distribution of sodium level in blood of patients. The results showed that, CPK level in blood was between 500 to 1250 mcg/L. The higher the CPK levels, the greater the risk of having a heart failure. The distribution of platelets count showed that, maximum number of patients had a platelet count between 125000 to 275000 kiloplatelets/mL. A platelet count lower than 150000 suggests that a patient suffers from thrombocytopenia and platelet count higher than 450000 suggests that a patient suffers from thrombocytosis. The distribution of creatinine and sodium levels in blood were between 0 to 2.5 mg/dl (ideal range for creatinine is 0.8 to 1.2) and 130 to 150 mEq/L (ideal range for sodium – 135 to 145 mEq/L) respectively. Increased creatinine levels can cause a damage to kidney and increased. Later, we analysed the relationship between different variables like the age group of patients and level of CPK, platelet count, creatinine, sodium and ejection fraction percentage. The ejection fraction percentage states the amount of blood left ventricle pumps out during each contraction. An ideal ejection fraction percentage of 55% or above is considered normal. An ejection fraction percentage lower than 55% states that the patient has a risk of suffering from a stroke or heart failure. In addition to this, a comparison on whether the person smokes, whether the person suffers from hypertension or diabetes or anaemia was done comparing this aesthetics with their platelet count, sodium and creatinine level and ejection fraction percentage. In this analysis, a comparison of patients follows period and his/her survival status and a comparison of patient’s ejection fraction percentage, his survival status and patients follow period to clinical facility is being done. There was no linear correlation between survival status - ejection fraction percentage and survival status – follow up period of patient.
In Phase 2, we did a step wise selection of binomial regression model for predicting the survival of patient from a risk of cardiovascular attack. A stepwise selection (forward selection method) was used to select the logistic binomial regression model. In this model, it was seen that the age of the patient, his or her ejection fraction percentage (ejection fraction percentage states the amount of blood left ventricle pumps out during each contraction), serum creatinine and serum sodium levels were the most important features. The least important features were anaemia (is patient suffering from anaemia) and diabetes (is patient suffering from diabetes). In the stepwise selection, we selected a binomial regression model with lowest Akaikes Information Criterion (AIC). After selecting the binomial regression model, we conducted a number of steps viz. Response analysis, Residual Analysis, a goodness of fit test, check the confidence intervals, did hypothesis testing and finally odds ratio analysis on binomial regression model to find the odds that male or female are more prone to a cardiovascular disease. In residual analysis, plots involving the standardised pearson residuals, deviance plots were shown. In the next step a response analysis was performed where scatter plots were shown between dependent variable (Age of the patient) vs independent variables (serum creatinine levels, ejection fraction percentage, serum sodium levels) which are included in the binomial regression model. A goodness of fit test was conducted to test the significance that features serum creatinine, serum sodium, ejection fraction and age of the patient help in predicting the survival status of the patient from a risk of cardiovascular attack. Similar to goodness of fit test, hypothesis tesing was also performed. Later, we found confidence intervals for ejection fraction, serum creatinine levels in patient and an odds ratio analysis to check whether male or female are more prone to have a risk of cardiovascular attack depending on their medical history.
An analysis on prediction of survival of a patient from a risk of having a cardiovascular attack was performed in the Phase 1 and Phase 2 of the project. After the analysis, it is possible to predict the risk of having a cardiovascular attack and his or her survival status from such an attack merely from patients electronic medical records. From these medical records, the most important features were serum creatinine, serum sodium, ejection fraction percentage and age of the patient. From analyzing these features and from binomial regression above, it was found that these features would help medical institutions in prediction of a risk of cardiovascular attack. In phase 1 of the analysis, features like smoking status of a patient, diabetes status of patient, anaemia status of patient and high blood pressure status of patient also contributed in the study. However, in phase 2 of the analysis, none of these features were proved to important in predicting the survival status. In stepwise selection method, all these features were removed in earlier stages as their p_values were quite higher than 0.05. Hence it is evident that from maching learing or logistic regression analysis, it is quite possible to predict such outcomes. Logistic regression models can be used to classify the medical records for predicting a risk of such a disease. This algorithm is not only limited to this analysis, in fact we can use such models to predict the survival status or risk to patients from other diseases viz. Cancers, tumors etc.
Dataset from GitHub by Dr. Vural Aksakalli. Reference of Website: vaksakalli/datasets. (2020). Retrieved 27 September 2020, from https://github.com/vaksakalli/datasets/blob/master/heart_failure.csv