Including all libraries used in R

library(WVPlots)
## Warning: package 'WVPlots' was built under R version 4.0.5
## Loading required package: wrapr
## Warning: package 'wrapr' was built under R version 4.0.5
library(leaps)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.3
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:wrapr':
## 
##     coalesce
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.3
library(pwr)
library(readr)
## Warning: package 'readr' was built under R version 4.0.3
library(reticulate)
## Warning: package 'reticulate' was built under R version 4.0.3
library(aod)
## Warning: package 'aod' was built under R version 4.0.5
library(broom)
## Warning: package 'broom' was built under R version 4.0.4
library(infer)
## Warning: package 'infer' was built under R version 4.0.5
library(pROC)
## Warning: package 'pROC' was built under R version 4.0.5
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(sigr)
## Warning: package 'sigr' was built under R version 4.0.5
library(WVPlots)
library(ranger)
## Warning: package 'ranger' was built under R version 4.0.5
library(boot)

Objective

To analyze various contributing factors related to heart failure like age, red blood cells, high blood pressure, level of the creatinine phosphokinase in the blood, if the patient has diabetes or not, percentage of blood leaving the heart, number of platelets, sex, level of serum creatinine, serum sodium and if the patient is smoking. The goal of this research is to use regression models to predict the odds of death event during follow up period, time of follow up period based on the various factors analyzed.

Load dataset

heartFailData = read.csv("heartFailData.csv")

Data source: University of California Link: https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records
File Path: https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv Size of data set: 299 observations  Years – 2020 Total Variables – 13 Total Observations – 299 Dependent Variable – Death event(binary), time(num) Independent Variable considered – Age(num), anaemia(binary), high blood pressure(binary), creatinine phosphokinase(num), Diabetes(categorical), Ejection Fraction(num), Platelets(num), sex(binary), serum creatinine (num), serum sodium (num), smoking (binary)

Data Cleaning - Outliers

#leverage
modeltime = lm(time~.,data=heartFailData)
summary(modeltime)
## 
## Call:
## lm(formula = time ~ ., data = heartFailData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -149.115  -49.439   -8.572   54.556  157.501 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               1.903e+02  1.245e+02   1.529   0.1274    
## age                      -4.074e-01  3.370e-01  -1.209   0.2277    
## anaemia                  -1.634e+01  7.841e+00  -2.084   0.0380 *  
## creatinine_phosphokinase -1.093e-03  4.016e-03  -0.272   0.7856    
## diabetes                  2.406e+00  7.850e+00   0.307   0.7594    
## ejection_fraction        -6.398e-01  3.420e-01  -1.871   0.0624 .  
## high_blood_pressure      -2.421e+01  8.000e+00  -3.027   0.0027 ** 
## platelets                -1.049e-05  3.931e-05  -0.267   0.7898    
## serum_creatinine          1.502e+00  3.882e+00   0.387   0.6992    
## serum_sodium              2.810e-01  8.977e-01   0.313   0.7545    
## sex                      -5.829e+00  9.111e+00  -0.640   0.5229    
## smoking                  -5.842e+00  9.105e+00  -0.642   0.5216    
## DEATH_EVENT              -8.658e+01  9.196e+00  -9.415   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 64.89 on 286 degrees of freedom
## Multiple R-squared:  0.3292, Adjusted R-squared:  0.301 
## F-statistic:  11.7 on 12 and 286 DF,  p-value: < 2.2e-16
k=12
leverage = hatvalues(modeltime)
cutleverage = (2*k+2) / nrow(heartFailData)
badleverage = as.numeric(leverage > cutleverage)
table(badleverage)
## badleverage
##   0   1 
## 284  15
#cooks distance 
cooks = cooks.distance(modeltime)
cutcooks = 4 / (nrow(heartFailData) - k - 1)
badcooks = as.numeric(cooks > cutcooks)
table(badcooks)
## badcooks
##   0   1 
## 288  11
#mahalanobis 

datanew <- heartFailData[,c(1,3,5,7,8,9,12)]
mahal = mahalanobis(datanew, 
                    colMeans(datanew), 
                    cov(datanew))
cutmahal = qchisq(1-.001, ncol(datanew))
badmahal = as.numeric(mahal > cutmahal) ##note the direction of the > 
table(badmahal)
## badmahal
##   0   1 
## 287  12
#total outlers 

totalout = badmahal + badleverage + badcooks
table(totalout)
## totalout
##   0   1   2   3 
## 278   9   7   5
heartFailureData = subset(heartFailData, totalout < 3)
#total outliers = 5 (which satisfies all the three conditions)

Outliers were removed from Leverage distance(0.08), Mahalanobis distance(24.32), Cooks’ distance(0.01) All the assumptionsLinearity, Additivity, Multicollinearity, Homogeneity, Homoscedasticity were met except normality, for which we used boot strapping.

Data Screening

heartFailureData$anaemia = as.factor(heartFailureData$anaemia)
heartFailureData$high_blood_pressure = as.factor(heartFailureData$high_blood_pressure)
heartFailureData$diabetes = as.factor(heartFailureData$diabetes)
heartFailureData$sex = as.factor(heartFailureData$sex)
heartFailureData$smoking = as.factor(heartFailureData$smoking)
heartFailureData$DEATH_EVENT = as.factor(heartFailureData$DEATH_EVENT)
summary(heartFailureData)
##       age        anaemia creatinine_phosphokinase diabetes ejection_fraction
##  Min.   :40.00   0:166   Min.   :  23.0           0:171    Min.   :14.00    
##  1st Qu.:51.00   1:128   1st Qu.: 115.0           1:123    1st Qu.:30.00    
##  Median :60.00           Median : 247.0                    Median :38.00    
##  Mean   :61.05           Mean   : 537.8                    Mean   :37.94    
##  3rd Qu.:70.00           3rd Qu.: 582.0                    3rd Qu.:45.00    
##  Max.   :95.00           Max.   :7702.0                    Max.   :80.00    
##  high_blood_pressure   platelets      serum_creatinine  serum_sodium   sex    
##  0:190               Min.   : 25100   Min.   :0.500    Min.   :113.0   0:103  
##  1:104               1st Qu.:212250   1st Qu.:0.900    1st Qu.:134.0   1:191  
##                      Median :262000   Median :1.100    Median :137.0          
##                      Mean   :260244   Mean   :1.373    Mean   :136.6          
##                      3rd Qu.:302750   3rd Qu.:1.400    3rd Qu.:140.0          
##                      Max.   :621000   Max.   :9.400    Max.   :148.0          
##  smoking      time       DEATH_EVENT
##  0:200   Min.   :  4.0   0:200      
##  1: 94   1st Qu.: 73.0   1: 94      
##          Median :115.0              
##          Mean   :130.2              
##          3rd Qu.:204.0              
##          Max.   :285.0

Exploratory Data analysis

## `geom_smooth()` using formula 'y ~ x'

death odds seem to increase with cpk levels.

## `geom_smooth()` using formula 'y ~ x'

death event odds are less when serum creatinine levels are low

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

There seems to be a difference in median time of the follow up period between people with high BP and people without high blood pressure. the difference of time is seen in when the patient was alive , patient passed away.

Inferencial Analysis

##       Alive Passed Away 
##         200          94
## [1] 0.07064777

## # A tibble: 1 x 1
##    pval
##   <dbl>
## 1  1.77

The graph suggests there is no difference between two groups one with high blood pressure and one with not high blood pressure in the occurrence of death

heartFailDataInf %>%
  ggplot(aes(x=time)) +
  geom_density(adjust=2)

aov(time ~ high_blood_pressure, heartFailDataInf) %>%
  tidy()
## # A tibble: 2 x 6
##   term                   df    sumsq meansq statistic   p.value
##   <chr>               <dbl>    <dbl>  <dbl>     <dbl>     <dbl>
## 1 high_blood_pressure     1   74443. 74443.      13.0  0.000371
## 2 Residuals             292 1675452.  5738.      NA   NA
74442.97 / (74442.97 + 1675451.91)
## [1] 0.0425414

high blood pressure seem to be statistically significant with p < 0.05 and explains 4% of variance

By plotting observed statistic against null distribution, we learned that observed difference of 0.07 is not as greater then sort of difference we would see if there was no association between these two variables. This suggests there is no difference between two groups , one with high blood pressure and one with not high blood pressure in the occurence of death during the follow up period.

Train and Test

smp_size <- floor(0.75 * nrow(heartFailureData))

set.seed(299)
train_ind <- sample(seq_len(nrow(heartFailureData)), size = smp_size)
train <- heartFailureData[train_ind, ]
test <- heartFailureData[-train_ind, ]

Binary Logistic regression

m=glm(DEATH_EVENT ~ time + platelets + high_blood_pressure , data = train, family="binomial")
prob = predict(m, test, type="response")
mean(as.numeric(as.character(train$DEATH_EVENT)))
## [1] 0.3136364
test$pred = ifelse(prob > 0.31, 1, 0)
mean(test$DEATH_EVENT == test$pred)
## [1] 0.6351351
ROC = roc(test$DEATH_EVENT, prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(ROC, col="blue")

auc(ROC)
## Area under the curve: 0.7633

#Stepwise Binary Logistic regression

null_model = glm(DEATH_EVENT ~ 1 , data = train, family="binomial")

full_model = glm(DEATH_EVENT ~., data = train, family="binomial")

step_model = step(null_model, scope=list(lower=null_model, upper = full_model), directoin="forward")
## Start:  AIC=275.67
## DEATH_EVENT ~ 1
## 
##                            Df Deviance    AIC
## + time                      1   186.32 190.32
## + ejection_fraction         1   254.33 258.33
## + serum_creatinine          1   256.48 260.48
## + age                       1   256.64 260.64
## + serum_sodium              1   265.21 269.21
## <none>                          273.67 275.67
## + high_blood_pressure       1   272.55 276.55
## + anaemia                   1   272.82 276.82
## + diabetes                  1   273.12 277.12
## + smoking                   1   273.38 277.38
## + creatinine_phosphokinase  1   273.40 277.40
## + sex                       1   273.67 277.67
## + platelets                 1   273.67 277.67
## 
## Step:  AIC=190.32
## DEATH_EVENT ~ time
## 
##                            Df Deviance    AIC
## + ejection_fraction         1   166.62 172.62
## + age                       1   176.40 182.40
## + serum_creatinine          1   177.50 183.50
## + serum_sodium              1   178.07 184.07
## <none>                          186.32 190.32
## + diabetes                  1   185.32 191.32
## + creatinine_phosphokinase  1   185.61 191.61
## + platelets                 1   185.90 191.90
## + high_blood_pressure       1   186.00 192.00
## + smoking                   1   186.27 192.27
## + anaemia                   1   186.30 192.30
## + sex                       1   186.32 192.32
## - time                      1   273.67 275.67
## 
## Step:  AIC=172.62
## DEATH_EVENT ~ time + ejection_fraction
## 
##                            Df Deviance    AIC
## + age                       1   153.89 161.89
## + serum_creatinine          1   159.04 167.04
## + serum_sodium              1   161.23 169.23
## <none>                          166.62 172.62
## + diabetes                  1   165.54 173.54
## + creatinine_phosphokinase  1   166.03 174.03
## + sex                       1   166.10 174.10
## + platelets                 1   166.26 174.26
## + high_blood_pressure       1   166.28 174.28
## + smoking                   1   166.37 174.37
## + anaemia                   1   166.52 174.52
## - ejection_fraction         1   186.32 190.32
## - time                      1   254.33 258.33
## 
## Step:  AIC=161.89
## DEATH_EVENT ~ time + ejection_fraction + age
## 
##                            Df Deviance    AIC
## + serum_creatinine          1   148.07 158.07
## + serum_sodium              1   150.21 160.21
## + diabetes                  1   150.89 160.89
## <none>                          153.89 161.89
## + smoking                   1   152.77 162.77
## + high_blood_pressure       1   152.97 162.97
## + creatinine_phosphokinase  1   153.14 163.14
## + sex                       1   153.16 163.16
## + platelets                 1   153.26 163.26
## + anaemia                   1   153.86 163.86
## - age                       1   166.62 172.62
## - ejection_fraction         1   176.40 182.40
## - time                      1   233.29 239.29
## 
## Step:  AIC=158.07
## DEATH_EVENT ~ time + ejection_fraction + age + serum_creatinine
## 
##                            Df Deviance    AIC
## + serum_sodium              1   145.35 157.35
## + diabetes                  1   145.52 157.52
## <none>                          148.07 158.07
## + sex                       1   147.44 159.44
## + creatinine_phosphokinase  1   147.66 159.66
## + smoking                   1   147.67 159.67
## + platelets                 1   147.76 159.76
## + high_blood_pressure       1   147.84 159.84
## + anaemia                   1   148.03 160.03
## - serum_creatinine          1   153.89 161.89
## - age                       1   159.04 167.04
## - ejection_fraction         1   169.84 177.84
## - time                      1   222.15 230.15
## 
## Step:  AIC=157.35
## DEATH_EVENT ~ time + ejection_fraction + age + serum_creatinine + 
##     serum_sodium
## 
##                            Df Deviance    AIC
## + diabetes                  1   142.86 156.86
## <none>                          145.35 157.35
## - serum_sodium              1   148.07 158.07
## + sex                       1   144.47 158.47
## + smoking                   1   144.77 158.77
## + creatinine_phosphokinase  1   144.82 158.82
## + platelets                 1   145.13 159.13
## + high_blood_pressure       1   145.25 159.25
## + anaemia                   1   145.32 159.32
## - serum_creatinine          1   150.21 160.21
## - age                       1   155.26 165.26
## - ejection_fraction         1   165.38 175.38
## - time                      1   220.88 230.88
## 
## Step:  AIC=156.86
## DEATH_EVENT ~ time + ejection_fraction + age + serum_creatinine + 
##     serum_sodium + diabetes
## 
##                            Df Deviance    AIC
## <none>                          142.86 156.86
## - diabetes                  1   145.35 157.35
## - serum_sodium              1   145.52 157.52
## + sex                       1   142.02 158.02
## + smoking                   1   142.26 158.26
## + creatinine_phosphokinase  1   142.49 158.49
## + platelets                 1   142.59 158.59
## + high_blood_pressure       1   142.80 158.80
## + anaemia                   1   142.85 158.85
## - serum_creatinine          1   147.39 159.39
## - age                       1   154.35 166.35
## - ejection_fraction         1   163.84 175.84
## - time                      1   219.09 231.09
step_prob = predict(step_model, test, type= "response")
ROC3 = roc(test$DEATH_EVENT, step_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(ROC3, col="red")

auc(ROC3)
## Area under the curve: 0.7984

Evaluation of AUC:

From the graph we can see that area under the curve is greater area – Our curve is not close to the 45-degree diagonal of the ROC space, which suggest the model has higher accuracy. Which suggest we have good rate for positive test where people are correctly identified with the heart failure at the time when they visit the hospital.

Assumption for logistic regression

Additivity

correl = cor(heartFailData[,c(1,3,5,7,8,9,12)])
symnum(correl)
##                          a c e p srm_c srm_s t
## age                      1                    
## creatinine_phosphokinase   1                  
## ejection_fraction            1                
## platelets                      1              
## serum_creatinine                 1            
## serum_sodium                           1      
## time                                         1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
#it does met the assumption of additivity as correlation values are less than 0.7 and there is no multicollinearity

#Additivity - we already meet the assumption of additivity (verified for logistic regression)

#new model after removing outliers(to check assumptions)
modeltime = lm(time~.,heartFailureData)
summary(modeltime)
## 
## Call:
## lm(formula = time ~ ., data = heartFailureData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -155.780  -46.734   -9.286   52.874  154.889 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               2.029e+02  1.229e+02   1.651 0.099903 .  
## age                      -3.158e-01  3.355e-01  -0.941 0.347353    
## anaemia1                 -1.665e+01  7.706e+00  -2.160 0.031588 *  
## creatinine_phosphokinase  2.490e-03  4.585e-03   0.543 0.587492    
## diabetes1                 2.281e+00  7.743e+00   0.295 0.768543    
## ejection_fraction        -8.627e-01  3.426e-01  -2.518 0.012362 *  
## high_blood_pressure1     -2.643e+01  7.907e+00  -3.343 0.000942 ***
## platelets                -1.806e-06  4.314e-05  -0.042 0.966630    
## serum_creatinine         -3.389e+00  4.228e+00  -0.801 0.423594    
## serum_sodium              2.225e-01  8.850e-01   0.251 0.801686    
## sex1                     -3.327e+00  8.977e+00  -0.371 0.711173    
## smoking1                 -4.864e+00  8.982e+00  -0.542 0.588554    
## DEATH_EVENT1             -8.725e+01  9.054e+00  -9.636  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.62 on 281 degrees of freedom
## Multiple R-squared:   0.35,  Adjusted R-squared:  0.3222 
## F-statistic: 12.61 on 12 and 281 DF,  p-value: < 2.2e-16

Linearity

#linearity 
standardized = rstudent(modeltime)
fitted = scale(modeltime$fitted.values)
qqnorm(standardized)
abline(0,1)

Yes , it met the assumption of linearity as the dots are close to line

Normality

hist(standardized)

library(moments)
skewness(datanew)
##                      age creatinine_phosphokinase        ejection_fraction 
##                0.4209366                4.4406886                0.5525927 
##                platelets         serum_creatinine             serum_sodium 
##                1.4549745                4.4336102               -1.0428705 
##                     time 
##                0.1271606
kurtosis(datanew)
##                      age creatinine_phosphokinase        ejection_fraction 
##                 2.798207                27.710458                 3.020720 
##                platelets         serum_creatinine             serum_sodium 
##                 9.085906                28.378346                 7.031142 
##                     time 
##                 1.788126
shapiro.test(modeltime$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  modeltime$residuals
## W = 0.98331, p-value = 0.00168

creatinine_phosphokinase and serum_creatinine has very high skew value does not met the assumption of normality as the p value is less 0.05

#Heterocadasticity

plot(fitted, standardized)
abline(0,0)
abline(v = 0)

library(car)
## Warning: package 'car' was built under R version 4.0.3
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:boot':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:wrapr':
## 
##     bc
vif(modeltime)
##                      age                  anaemia creatinine_phosphokinase 
##                 1.145810                 1.060374                 1.055992 
##                 diabetes        ejection_fraction      high_blood_pressure 
##                 1.059589                 1.164590                 1.038030 
##                platelets         serum_creatinine             serum_sodium 
##                 1.039234                 1.149905                 1.114320 
##                      sex                  smoking              DEATH_EVENT 
##                 1.332182                 1.274569                 1.294973

It meet the assumption of Homogeneity and Homoscedasticity as the variance is spread evenly across y axis from -2 to +2,also VIF is less than 5

Linear regression

We are having a null model with no predictors and full model using all of the predictors

myTimeModel1 = lm(time~ejection_fraction+high_blood_pressure+DEATH_EVENT, data = train)
wrapFTest(myTimeModel1)
## [1] "F Test summary: (R2=0.3689, F(3,216)=42.08, p<1e-05)."
test$timePred1 = predict(myTimeModel1, newdata = test)
test$residuals = test$time - test$timePred1
summary(test$time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   85.25  116.00  132.70  204.00  258.00
summary(test$timePred1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   33.37   74.13  148.21  127.35  169.12  181.88
test %>%
  ggplot(aes(x=timePred1, y=time)) +
  geom_point() +
  geom_abline(color = "blue")

test %>%
  ggplot(aes(x=timePred1, y=residuals)) +
  geom_pointrange(aes(ymin=0, ymax=residuals)) +
  geom_hline(yintercept = 0, linetype = 3) +
  ggtitle("linear model prediction for time vs ejection_fraction, high bp, death event")

GainCurvePlot(test, "timePred1", "time", "time model")

rmse = sqrt(mean(test$residuals ^ 2))
print("rmse")
## [1] "rmse"
rmse
## [1] 67.71416
print("standard deviation")
## [1] "standard deviation"
sd(test$time)
## [1] 76.04968
print("r square")
## [1] "r square"
glance(myTimeModel1)$r.squared
## [1] 0.3688733

diagnol line shows the if the time is random green is the curve where perfect model would trace out blue is the curve our model traces out

rmse of 67.71 is better than standard deviation of 76.04

Random Forest

myTimeModel2 = ranger(time~ejection_fraction+high_blood_pressure+DEATH_EVENT, train, num.trees = 500, respect.unordered.factors = "order", seed = 299)

test$timePred2 = predict(myTimeModel2, test)$predictions
test$residuals2 = test$time - test$timePred2
summary(test$time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   85.25  116.00  132.70  204.00  258.00
summary(test$timePred2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    64.5    86.2   140.0   126.5   158.6   169.4
test %>%
  ggplot(aes(x=timePred2, y=time)) +
  geom_point() +
  geom_abline(color = "blue")

test %>%
  ggplot(aes(x=timePred2, y=residuals2)) +
  geom_pointrange(aes(ymin=0, ymax=residuals2)) +
  geom_hline(yintercept = 0, linetype = 3) +
  ggtitle("linear model prediction for time vs ejection_fraction, high bp, death event")

GainCurvePlot(test, "timePred2", "time", "time model")

print("rmse")
## [1] "rmse"
rmse = sqrt(mean(test$residuals2 ^ 2))
rmse
## [1] 66.12833
print("standard deviation")
## [1] "standard deviation"
sd(test$time)
## [1] 76.04968
print("r square")
## [1] "r square"
glance(myTimeModel1)$r.squared
## [1] 0.3688733

rmse improved from 67.71 to 66.12 utilizing random forest algorithm for test data.

anova(myTimeModel1, myTimeModel2)
## Warning in anova.lmlist(object, ...): models with response '"NULL"' removed
## because response differs from model 1
## Analysis of Variance Table
## 
## Response: time
##                      Df Sum Sq Mean Sq  F value    Pr(>F)    
## ejection_fraction     1    765     765   0.1972 0.6574383    
## high_blood_pressure   1  50255   50255  12.9601 0.0003947 ***
## DEATH_EVENT           1 438512  438512 113.0877 < 2.2e-16 ***
## Residuals           216 837568    3878                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To evaluate the linear regression model We are using train and test data – with Sample size of 0.75 Train data is for running the models and test data is for the predictions

We are having a null model with no predictors and full model using all of the predictors We also perfomed anova test on the models These are the results obtained

#Prediction for Linear regression

mymodel3 = lm( time ~ ejection_fraction + high_blood_pressure + DEATH_EVENT , data = train)
summary(mymodel3)
## 
## Call:
## lm(formula = time ~ ejection_fraction + high_blood_pressure + 
##     DEATH_EVENT, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -148.156  -44.763   -4.902   48.354  153.783 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           205.5054    16.0842  12.777  < 2e-16 ***
## ejection_fraction      -0.9450     0.3724  -2.538  0.01186 *  
## high_blood_pressure1  -24.2224     8.8881  -2.725  0.00695 ** 
## DEATH_EVENT1         -100.6633     9.4659 -10.634  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62.27 on 216 degrees of freedom
## Multiple R-squared:  0.3689, Adjusted R-squared:  0.3601 
## F-statistic: 42.08 on 3 and 216 DF,  p-value: < 2.2e-16
predicted_model = predict(mymodel3, newdata = test)
test$predictedTime = predicted_model
#head(test)

 
# time vs predicted time
ggplot(test, aes(predictedTime, time)) + 
  geom_point() + 
  geom_abline() +
  ggtitle("Prediction vs. Real values")

residuals_model = test$time - predicted_model
test$residuals =  residuals_model

 
# Predicted values vs residuals
ggplot(test, aes(predictedTime, residuals)) +
  geom_pointrange(aes(ymin = 0, ymax = residuals)) +
  geom_hline(yintercept = 0) +
  labs(x="Predicted Charges", y="Residuals") 

# Histogram of Residuals
ggplot(test, aes(residuals)) + 
  geom_histogram(bins = 10) +
  labs(x="Residuals", y="Frequency") 

# The gain curve plot measures how well the model score sorts the data compared to the true outcome value.
GainCurvePlot(test, "predictedTime", "time", "Model")

act_pred <- data.frame(cbind(actuals=test$time, predicteds=test$predictedTime)) # actuals_predicteds 
cor(act_pred) # correlation_accuracy
##              actuals predicteds
## actuals    1.0000000  0.4759494
## predicteds 0.4759494  1.0000000
min_max <- mean(apply(act_pred, 1, min) / apply(act_pred, 1, max))  
print(min_max) # show the result
## [1] 0.6089327

Bootstrapping

Boot Co-ef

bootcoef = function(formula, data, indices){
  d = data[indices, ] 
  model = lm(formula, data = d)
  return(coef(model))
}
mymodel4 = lm( time ~ . , data = heartFailureData)
summary(mymodel4)
## 
## Call:
## lm(formula = time ~ ., data = heartFailureData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -155.780  -46.734   -9.286   52.874  154.889 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               2.029e+02  1.229e+02   1.651 0.099903 .  
## age                      -3.158e-01  3.355e-01  -0.941 0.347353    
## anaemia1                 -1.665e+01  7.706e+00  -2.160 0.031588 *  
## creatinine_phosphokinase  2.490e-03  4.585e-03   0.543 0.587492    
## diabetes1                 2.281e+00  7.743e+00   0.295 0.768543    
## ejection_fraction        -8.627e-01  3.426e-01  -2.518 0.012362 *  
## high_blood_pressure1     -2.643e+01  7.907e+00  -3.343 0.000942 ***
## platelets                -1.806e-06  4.314e-05  -0.042 0.966630    
## serum_creatinine         -3.389e+00  4.228e+00  -0.801 0.423594    
## serum_sodium              2.225e-01  8.850e-01   0.251 0.801686    
## sex1                     -3.327e+00  8.977e+00  -0.371 0.711173    
## smoking1                 -4.864e+00  8.982e+00  -0.542 0.588554    
## DEATH_EVENT1             -8.725e+01  9.054e+00  -9.636  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.62 on 281 degrees of freedom
## Multiple R-squared:   0.35,  Adjusted R-squared:  0.3222 
## F-statistic: 12.61 on 12 and 281 DF,  p-value: < 2.2e-16
AIC(mymodel4)
## [1] 3290.985
plot(mymodel4)

mymodel4.boot = boot(formula = time ~ .,,
                  data = heartFailureData,
                  statistic = bootcoef,
                  R = 1000)
mymodel4.boot
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = heartFailureData, statistic = bootcoef, R = 1000, 
##     formula = time ~ .)
## 
## 
## Bootstrap Statistics :
##           original        bias     std. error
## t1*   2.029021e+02 -2.223650e+00 1.195450e+02
## t2*  -3.157813e-01  1.039215e-02 3.213348e-01
## t3*  -1.664894e+01  6.318908e-01 7.577235e+00
## t4*   2.490330e-03  6.402995e-04 4.867673e-03
## t5*   2.280794e+00  1.907160e-01 8.135637e+00
## t6*  -8.626637e-01  8.318194e-03 3.316777e-01
## t7*  -2.643082e+01 -9.408112e-02 7.347773e+00
## t8*  -1.806352e-06  4.198509e-07 4.282838e-05
## t9*  -3.388512e+00  3.874594e-02 3.248523e+00
## t10*  2.224909e-01  3.082664e-03 8.534531e-01
## t11* -3.327380e+00  1.984901e-01 9.100107e+00
## t12* -4.864411e+00 -2.000522e-01 9.150950e+00
## t13* -8.724534e+01  2.251181e-01 8.513011e+00
plot(mymodel4.boot)

mymodel4.boot2 = boot(formula = time ~ .,
                  data = heartFailureData[sample(1:nrow(heartFailureData), 50, replace=FALSE),],
                  statistic = bootcoef,
                  R = 1000)
mymodel4.boot2
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = heartFailureData[sample(1:nrow(heartFailureData), 
##     50, replace = FALSE), ], statistic = bootcoef, R = 1000, 
##     formula = time ~ .)
## 
## 
## Bootstrap Statistics :
##           original        bias     std. error
## t1*  -7.583364e+01 -4.510757e+01 3.190611e+02
## t2*   3.523779e-01 -7.629962e-03 1.205633e+00
## t3*   6.017260e+00 -1.661428e+00 2.355244e+01
## t4*   2.573725e-02  3.355185e-03 2.684879e-02
## t5*   9.227490e+00  3.861918e+00 2.366643e+01
## t6*  -7.234338e-01  8.596197e-02 9.887926e-01
## t7*  -1.887019e+01 -2.697714e+00 2.198386e+01
## t8*   5.388161e-05 -2.250349e-05 1.526385e-04
## t9*  -1.060822e+01  8.533374e-01 1.019078e+01
## t10*  1.597580e+00  3.431184e-01 2.299446e+00
## t11*  2.301577e+01 -4.035560e-01 2.887668e+01
## t12*  2.201726e+01 -1.856380e+00 2.420984e+01
## t13* -1.101771e+02 -2.762760e-01 2.616902e+01
plot(mymodel4.boot2)

From Bootstrapping:- Standard error are much higher for the boot2 - in comparision with original model Co-effient values are much higher for the subset of data - in comparision with original model

length(heartFailureData$ejection_fraction)
## [1] 294
mean(heartFailureData$ejection_fraction)
## [1] 37.93878
sd(heartFailureData$ejection_fraction)
## [1] 11.70731
error <- qt(0.95,df=length(heartFailureData$ejection_fraction)-1)*sd(heartFailureData$ejection_fraction)/sqrt(length(heartFailureData$ejection_fraction))

left <- mean(heartFailureData$ejection_fraction)-error
right <- mean(heartFailureData$ejection_fraction)+error
left
## [1] 36.81213
right
## [1] 39.06542

CONFIDENCE INTERVAL for ejection_fraction is between 36 and 39

boot.ci(mymodel4.boot2, index = 5)
## Warning in boot.ci(mymodel4.boot2, index = 5): bootstrap variances needed for
## studentized intervals
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = mymodel4.boot2, index = 5)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   (-41.020,  51.751 )   (-42.507,  48.554 )  
## 
## Level     Percentile            BCa          
## 95%   (-30.099,  60.962 )   (-35.944,  54.685 )  
## Calculations and Intervals on Original Scale

CONFIDENCE INTERVAL for variable 5 that is in the range which suggest that our data is consistent

Regression Subset

# Additional evaluations
subsets = leaps::regsubsets(time~.,data=train, nvmax=12)
summary(subsets)
## Subset selection object
## Call: regsubsets.formula(time ~ ., data = train, nvmax = 12)
## 12 Variables  (and intercept)
##                          Forced in Forced out
## age                          FALSE      FALSE
## anaemia1                     FALSE      FALSE
## creatinine_phosphokinase     FALSE      FALSE
## diabetes1                    FALSE      FALSE
## ejection_fraction            FALSE      FALSE
## high_blood_pressure1         FALSE      FALSE
## platelets                    FALSE      FALSE
## serum_creatinine             FALSE      FALSE
## serum_sodium                 FALSE      FALSE
## sex1                         FALSE      FALSE
## smoking1                     FALSE      FALSE
## DEATH_EVENT1                 FALSE      FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: exhaustive
##           age anaemia1 creatinine_phosphokinase diabetes1 ejection_fraction
## 1  ( 1 )  " " " "      " "                      " "       " "              
## 2  ( 1 )  " " " "      " "                      " "       " "              
## 3  ( 1 )  " " " "      " "                      " "       "*"              
## 4  ( 1 )  " " "*"      " "                      " "       "*"              
## 5  ( 1 )  " " "*"      " "                      " "       "*"              
## 6  ( 1 )  " " "*"      " "                      "*"       "*"              
## 7  ( 1 )  " " "*"      " "                      "*"       "*"              
## 8  ( 1 )  "*" "*"      " "                      "*"       "*"              
## 9  ( 1 )  "*" "*"      " "                      "*"       "*"              
## 10  ( 1 ) "*" "*"      " "                      "*"       "*"              
## 11  ( 1 ) "*" "*"      "*"                      "*"       "*"              
## 12  ( 1 ) "*" "*"      "*"                      "*"       "*"              
##           high_blood_pressure1 platelets serum_creatinine serum_sodium sex1
## 1  ( 1 )  " "                  " "       " "              " "          " " 
## 2  ( 1 )  "*"                  " "       " "              " "          " " 
## 3  ( 1 )  "*"                  " "       " "              " "          " " 
## 4  ( 1 )  "*"                  " "       " "              " "          " " 
## 5  ( 1 )  "*"                  " "       " "              " "          " " 
## 6  ( 1 )  "*"                  " "       " "              " "          " " 
## 7  ( 1 )  "*"                  " "       "*"              " "          " " 
## 8  ( 1 )  "*"                  " "       "*"              " "          " " 
## 9  ( 1 )  "*"                  "*"       "*"              " "          " " 
## 10  ( 1 ) "*"                  "*"       "*"              " "          "*" 
## 11  ( 1 ) "*"                  "*"       "*"              " "          "*" 
## 12  ( 1 ) "*"                  "*"       "*"              "*"          "*" 
##           smoking1 DEATH_EVENT1
## 1  ( 1 )  " "      "*"         
## 2  ( 1 )  " "      "*"         
## 3  ( 1 )  " "      "*"         
## 4  ( 1 )  " "      "*"         
## 5  ( 1 )  "*"      "*"         
## 6  ( 1 )  "*"      "*"         
## 7  ( 1 )  "*"      "*"         
## 8  ( 1 )  "*"      "*"         
## 9  ( 1 )  "*"      "*"         
## 10  ( 1 ) "*"      "*"         
## 11  ( 1 ) "*"      "*"         
## 12  ( 1 ) "*"      "*"
names(summary(subsets))
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"
subsets_measures = data.frame(model=1:length(summary(subsets)$cp),
                              rsq=summary(subsets)$rsq,
                              bic=summary(subsets)$bic,
                              adjr2=summary(subsets)$adjr2)
subsets_measures
##    model       rsq       bic     adjr2
## 1      1 0.3265290 -76.18102 0.3234397
## 2      2 0.3500546 -78.60982 0.3440643
## 3      3 0.3688733 -79.68018 0.3601076
## 4      4 0.3766637 -77.01907 0.3650668
## 5      5 0.3802765 -72.90423 0.3657969
## 6      6 0.3820527 -68.14208 0.3646458
## 7      7 0.3837096 -63.33912 0.3633604
## 8      8 0.3840819 -58.07845 0.3607296
## 9      9 0.3842257 -52.73616 0.3578353
## 10    10 0.3843450 -47.38518 0.3548878
## 11    11 0.3844386 -42.02499 0.3518848
## 12    12 0.3844973 -36.65237 0.3488160
which.min(summary(subsets)$bic)
## [1] 3
coef(subsets,which.min(summary(subsets)$bic))
##          (Intercept)    ejection_fraction high_blood_pressure1 
##          205.5053896           -0.9449922          -24.2224278 
##         DEATH_EVENT1 
##         -100.6633229
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following objects are masked from 'package:wrapr':
## 
##     pack, unpack
subsets_measures %>%
  gather(key = type, value=value, 2:4)%>%
  ggplot(aes(x=model,y=value))+
  geom_line()+
  geom_point()+
  facet_grid(type~.,scales='free_y')

Evaluation We are creating the – regression subset of the predictors from all the variables Where predictors are changing for each model to get the best one. which suggested Best Model is Model3 with predictors: were selected also these are the values obtained for R2 • R2 0.368 - Also some of the studies suggest that for human behavior and for physical process we generally have R2 value less than 50%. Which is true in our case. • BIC -0.79 • SD - 76.05 • RMSE - 66.12 - rmse improved from 67.71 to 66.12 utilizing random forest algorithm for test data. We are using the same in random forest

CONCLUSION

Follow Up Time:

There seem to be a difference in median time of the follow up period between people with high BP and people without high BP. The difference of time is seen in when the patient was alive, patient passed away.

We fitted a random forest model to predict the follow up time with an RMSE of 67.71 is better than standard deviation of 76.04 with ejection fraction, high blood pressure, death event as statistically significant.

Death Event

We used forward step wise algorithm we found the best model with AIC of 156, with time, ejection_fraction, age, serum_creatininge, serum_sodium , diabetes being statistically significant.

The Area under curve was 0.7984 showing 79% accuracy when we predicted the values against test data.