Including all libraries used in R
library(WVPlots)
## Warning: package 'WVPlots' was built under R version 4.0.5
## Loading required package: wrapr
## Warning: package 'wrapr' was built under R version 4.0.5
library(leaps)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.3
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:wrapr':
##
## coalesce
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.3
library(pwr)
library(readr)
## Warning: package 'readr' was built under R version 4.0.3
library(reticulate)
## Warning: package 'reticulate' was built under R version 4.0.3
library(aod)
## Warning: package 'aod' was built under R version 4.0.5
library(broom)
## Warning: package 'broom' was built under R version 4.0.4
library(infer)
## Warning: package 'infer' was built under R version 4.0.5
library(pROC)
## Warning: package 'pROC' was built under R version 4.0.5
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(sigr)
## Warning: package 'sigr' was built under R version 4.0.5
library(WVPlots)
library(ranger)
## Warning: package 'ranger' was built under R version 4.0.5
library(boot)
To analyze various contributing factors related to heart failure like age, red blood cells, high blood pressure, level of the creatinine phosphokinase in the blood, if the patient has diabetes or not, percentage of blood leaving the heart, number of platelets, sex, level of serum creatinine, serum sodium and if the patient is smoking. The goal of this research is to use regression models to predict the odds of death event during follow up period, time of follow up period based on the various factors analyzed.
heartFailData = read.csv("heartFailData.csv")
Data source: University of California Link: https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records
File Path: https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv Size of data set: 299 observations Years – 2020 Total Variables – 13 Total Observations – 299 Dependent Variable – Death event(binary), time(num) Independent Variable considered – Age(num), anaemia(binary), high blood pressure(binary), creatinine phosphokinase(num), Diabetes(categorical), Ejection Fraction(num), Platelets(num), sex(binary), serum creatinine (num), serum sodium (num), smoking (binary)
#leverage
modeltime = lm(time~.,data=heartFailData)
summary(modeltime)
##
## Call:
## lm(formula = time ~ ., data = heartFailData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -149.115 -49.439 -8.572 54.556 157.501
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.903e+02 1.245e+02 1.529 0.1274
## age -4.074e-01 3.370e-01 -1.209 0.2277
## anaemia -1.634e+01 7.841e+00 -2.084 0.0380 *
## creatinine_phosphokinase -1.093e-03 4.016e-03 -0.272 0.7856
## diabetes 2.406e+00 7.850e+00 0.307 0.7594
## ejection_fraction -6.398e-01 3.420e-01 -1.871 0.0624 .
## high_blood_pressure -2.421e+01 8.000e+00 -3.027 0.0027 **
## platelets -1.049e-05 3.931e-05 -0.267 0.7898
## serum_creatinine 1.502e+00 3.882e+00 0.387 0.6992
## serum_sodium 2.810e-01 8.977e-01 0.313 0.7545
## sex -5.829e+00 9.111e+00 -0.640 0.5229
## smoking -5.842e+00 9.105e+00 -0.642 0.5216
## DEATH_EVENT -8.658e+01 9.196e+00 -9.415 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 64.89 on 286 degrees of freedom
## Multiple R-squared: 0.3292, Adjusted R-squared: 0.301
## F-statistic: 11.7 on 12 and 286 DF, p-value: < 2.2e-16
k=12
leverage = hatvalues(modeltime)
cutleverage = (2*k+2) / nrow(heartFailData)
badleverage = as.numeric(leverage > cutleverage)
table(badleverage)
## badleverage
## 0 1
## 284 15
#cooks distance
cooks = cooks.distance(modeltime)
cutcooks = 4 / (nrow(heartFailData) - k - 1)
badcooks = as.numeric(cooks > cutcooks)
table(badcooks)
## badcooks
## 0 1
## 288 11
#mahalanobis
datanew <- heartFailData[,c(1,3,5,7,8,9,12)]
mahal = mahalanobis(datanew,
colMeans(datanew),
cov(datanew))
cutmahal = qchisq(1-.001, ncol(datanew))
badmahal = as.numeric(mahal > cutmahal) ##note the direction of the >
table(badmahal)
## badmahal
## 0 1
## 287 12
#total outlers
totalout = badmahal + badleverage + badcooks
table(totalout)
## totalout
## 0 1 2 3
## 278 9 7 5
heartFailureData = subset(heartFailData, totalout < 3)
#total outliers = 5 (which satisfies all the three conditions)
Outliers were removed from Leverage distance(0.08), Mahalanobis distance(24.32), Cooks’ distance(0.01) All the assumptionsLinearity, Additivity, Multicollinearity, Homogeneity, Homoscedasticity were met except normality, for which we used boot strapping.
heartFailureData$anaemia = as.factor(heartFailureData$anaemia)
heartFailureData$high_blood_pressure = as.factor(heartFailureData$high_blood_pressure)
heartFailureData$diabetes = as.factor(heartFailureData$diabetes)
heartFailureData$sex = as.factor(heartFailureData$sex)
heartFailureData$smoking = as.factor(heartFailureData$smoking)
heartFailureData$DEATH_EVENT = as.factor(heartFailureData$DEATH_EVENT)
summary(heartFailureData)
## age anaemia creatinine_phosphokinase diabetes ejection_fraction
## Min. :40.00 0:166 Min. : 23.0 0:171 Min. :14.00
## 1st Qu.:51.00 1:128 1st Qu.: 115.0 1:123 1st Qu.:30.00
## Median :60.00 Median : 247.0 Median :38.00
## Mean :61.05 Mean : 537.8 Mean :37.94
## 3rd Qu.:70.00 3rd Qu.: 582.0 3rd Qu.:45.00
## Max. :95.00 Max. :7702.0 Max. :80.00
## high_blood_pressure platelets serum_creatinine serum_sodium sex
## 0:190 Min. : 25100 Min. :0.500 Min. :113.0 0:103
## 1:104 1st Qu.:212250 1st Qu.:0.900 1st Qu.:134.0 1:191
## Median :262000 Median :1.100 Median :137.0
## Mean :260244 Mean :1.373 Mean :136.6
## 3rd Qu.:302750 3rd Qu.:1.400 3rd Qu.:140.0
## Max. :621000 Max. :9.400 Max. :148.0
## smoking time DEATH_EVENT
## 0:200 Min. : 4.0 0:200
## 1: 94 1st Qu.: 73.0 1: 94
## Median :115.0
## Mean :130.2
## 3rd Qu.:204.0
## Max. :285.0
## `geom_smooth()` using formula 'y ~ x'
death odds seem to increase with cpk levels.
## `geom_smooth()` using formula 'y ~ x'
death event odds are less when serum creatinine levels are low
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
There seems to be a difference in median time of the follow up period between people with high BP and people without high blood pressure. the difference of time is seen in when the patient was alive , patient passed away.
## Alive Passed Away
## 200 94
## [1] 0.07064777
## # A tibble: 1 x 1
## pval
## <dbl>
## 1 1.77
The graph suggests there is no difference between two groups one with high blood pressure and one with not high blood pressure in the occurrence of death
heartFailDataInf %>%
ggplot(aes(x=time)) +
geom_density(adjust=2)
aov(time ~ high_blood_pressure, heartFailDataInf) %>%
tidy()
## # A tibble: 2 x 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 high_blood_pressure 1 74443. 74443. 13.0 0.000371
## 2 Residuals 292 1675452. 5738. NA NA
74442.97 / (74442.97 + 1675451.91)
## [1] 0.0425414
high blood pressure seem to be statistically significant with p < 0.05 and explains 4% of variance
By plotting observed statistic against null distribution, we learned that observed difference of 0.07 is not as greater then sort of difference we would see if there was no association between these two variables. This suggests there is no difference between two groups , one with high blood pressure and one with not high blood pressure in the occurence of death during the follow up period.
smp_size <- floor(0.75 * nrow(heartFailureData))
set.seed(299)
train_ind <- sample(seq_len(nrow(heartFailureData)), size = smp_size)
train <- heartFailureData[train_ind, ]
test <- heartFailureData[-train_ind, ]
m=glm(DEATH_EVENT ~ time + platelets + high_blood_pressure , data = train, family="binomial")
prob = predict(m, test, type="response")
mean(as.numeric(as.character(train$DEATH_EVENT)))
## [1] 0.3136364
test$pred = ifelse(prob > 0.31, 1, 0)
mean(test$DEATH_EVENT == test$pred)
## [1] 0.6351351
ROC = roc(test$DEATH_EVENT, prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(ROC, col="blue")
auc(ROC)
## Area under the curve: 0.7633
#Stepwise Binary Logistic regression
null_model = glm(DEATH_EVENT ~ 1 , data = train, family="binomial")
full_model = glm(DEATH_EVENT ~., data = train, family="binomial")
step_model = step(null_model, scope=list(lower=null_model, upper = full_model), directoin="forward")
## Start: AIC=275.67
## DEATH_EVENT ~ 1
##
## Df Deviance AIC
## + time 1 186.32 190.32
## + ejection_fraction 1 254.33 258.33
## + serum_creatinine 1 256.48 260.48
## + age 1 256.64 260.64
## + serum_sodium 1 265.21 269.21
## <none> 273.67 275.67
## + high_blood_pressure 1 272.55 276.55
## + anaemia 1 272.82 276.82
## + diabetes 1 273.12 277.12
## + smoking 1 273.38 277.38
## + creatinine_phosphokinase 1 273.40 277.40
## + sex 1 273.67 277.67
## + platelets 1 273.67 277.67
##
## Step: AIC=190.32
## DEATH_EVENT ~ time
##
## Df Deviance AIC
## + ejection_fraction 1 166.62 172.62
## + age 1 176.40 182.40
## + serum_creatinine 1 177.50 183.50
## + serum_sodium 1 178.07 184.07
## <none> 186.32 190.32
## + diabetes 1 185.32 191.32
## + creatinine_phosphokinase 1 185.61 191.61
## + platelets 1 185.90 191.90
## + high_blood_pressure 1 186.00 192.00
## + smoking 1 186.27 192.27
## + anaemia 1 186.30 192.30
## + sex 1 186.32 192.32
## - time 1 273.67 275.67
##
## Step: AIC=172.62
## DEATH_EVENT ~ time + ejection_fraction
##
## Df Deviance AIC
## + age 1 153.89 161.89
## + serum_creatinine 1 159.04 167.04
## + serum_sodium 1 161.23 169.23
## <none> 166.62 172.62
## + diabetes 1 165.54 173.54
## + creatinine_phosphokinase 1 166.03 174.03
## + sex 1 166.10 174.10
## + platelets 1 166.26 174.26
## + high_blood_pressure 1 166.28 174.28
## + smoking 1 166.37 174.37
## + anaemia 1 166.52 174.52
## - ejection_fraction 1 186.32 190.32
## - time 1 254.33 258.33
##
## Step: AIC=161.89
## DEATH_EVENT ~ time + ejection_fraction + age
##
## Df Deviance AIC
## + serum_creatinine 1 148.07 158.07
## + serum_sodium 1 150.21 160.21
## + diabetes 1 150.89 160.89
## <none> 153.89 161.89
## + smoking 1 152.77 162.77
## + high_blood_pressure 1 152.97 162.97
## + creatinine_phosphokinase 1 153.14 163.14
## + sex 1 153.16 163.16
## + platelets 1 153.26 163.26
## + anaemia 1 153.86 163.86
## - age 1 166.62 172.62
## - ejection_fraction 1 176.40 182.40
## - time 1 233.29 239.29
##
## Step: AIC=158.07
## DEATH_EVENT ~ time + ejection_fraction + age + serum_creatinine
##
## Df Deviance AIC
## + serum_sodium 1 145.35 157.35
## + diabetes 1 145.52 157.52
## <none> 148.07 158.07
## + sex 1 147.44 159.44
## + creatinine_phosphokinase 1 147.66 159.66
## + smoking 1 147.67 159.67
## + platelets 1 147.76 159.76
## + high_blood_pressure 1 147.84 159.84
## + anaemia 1 148.03 160.03
## - serum_creatinine 1 153.89 161.89
## - age 1 159.04 167.04
## - ejection_fraction 1 169.84 177.84
## - time 1 222.15 230.15
##
## Step: AIC=157.35
## DEATH_EVENT ~ time + ejection_fraction + age + serum_creatinine +
## serum_sodium
##
## Df Deviance AIC
## + diabetes 1 142.86 156.86
## <none> 145.35 157.35
## - serum_sodium 1 148.07 158.07
## + sex 1 144.47 158.47
## + smoking 1 144.77 158.77
## + creatinine_phosphokinase 1 144.82 158.82
## + platelets 1 145.13 159.13
## + high_blood_pressure 1 145.25 159.25
## + anaemia 1 145.32 159.32
## - serum_creatinine 1 150.21 160.21
## - age 1 155.26 165.26
## - ejection_fraction 1 165.38 175.38
## - time 1 220.88 230.88
##
## Step: AIC=156.86
## DEATH_EVENT ~ time + ejection_fraction + age + serum_creatinine +
## serum_sodium + diabetes
##
## Df Deviance AIC
## <none> 142.86 156.86
## - diabetes 1 145.35 157.35
## - serum_sodium 1 145.52 157.52
## + sex 1 142.02 158.02
## + smoking 1 142.26 158.26
## + creatinine_phosphokinase 1 142.49 158.49
## + platelets 1 142.59 158.59
## + high_blood_pressure 1 142.80 158.80
## + anaemia 1 142.85 158.85
## - serum_creatinine 1 147.39 159.39
## - age 1 154.35 166.35
## - ejection_fraction 1 163.84 175.84
## - time 1 219.09 231.09
step_prob = predict(step_model, test, type= "response")
ROC3 = roc(test$DEATH_EVENT, step_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(ROC3, col="red")
auc(ROC3)
## Area under the curve: 0.7984
Evaluation of AUC:
From the graph we can see that area under the curve is greater area – Our curve is not close to the 45-degree diagonal of the ROC space, which suggest the model has higher accuracy. Which suggest we have good rate for positive test where people are correctly identified with the heart failure at the time when they visit the hospital.
correl = cor(heartFailData[,c(1,3,5,7,8,9,12)])
symnum(correl)
## a c e p srm_c srm_s t
## age 1
## creatinine_phosphokinase 1
## ejection_fraction 1
## platelets 1
## serum_creatinine 1
## serum_sodium 1
## time 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
#it does met the assumption of additivity as correlation values are less than 0.7 and there is no multicollinearity
#Additivity - we already meet the assumption of additivity (verified for logistic regression)
#new model after removing outliers(to check assumptions)
modeltime = lm(time~.,heartFailureData)
summary(modeltime)
##
## Call:
## lm(formula = time ~ ., data = heartFailureData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -155.780 -46.734 -9.286 52.874 154.889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.029e+02 1.229e+02 1.651 0.099903 .
## age -3.158e-01 3.355e-01 -0.941 0.347353
## anaemia1 -1.665e+01 7.706e+00 -2.160 0.031588 *
## creatinine_phosphokinase 2.490e-03 4.585e-03 0.543 0.587492
## diabetes1 2.281e+00 7.743e+00 0.295 0.768543
## ejection_fraction -8.627e-01 3.426e-01 -2.518 0.012362 *
## high_blood_pressure1 -2.643e+01 7.907e+00 -3.343 0.000942 ***
## platelets -1.806e-06 4.314e-05 -0.042 0.966630
## serum_creatinine -3.389e+00 4.228e+00 -0.801 0.423594
## serum_sodium 2.225e-01 8.850e-01 0.251 0.801686
## sex1 -3.327e+00 8.977e+00 -0.371 0.711173
## smoking1 -4.864e+00 8.982e+00 -0.542 0.588554
## DEATH_EVENT1 -8.725e+01 9.054e+00 -9.636 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.62 on 281 degrees of freedom
## Multiple R-squared: 0.35, Adjusted R-squared: 0.3222
## F-statistic: 12.61 on 12 and 281 DF, p-value: < 2.2e-16
#linearity
standardized = rstudent(modeltime)
fitted = scale(modeltime$fitted.values)
qqnorm(standardized)
abline(0,1)
Yes , it met the assumption of linearity as the dots are close to line
hist(standardized)
library(moments)
skewness(datanew)
## age creatinine_phosphokinase ejection_fraction
## 0.4209366 4.4406886 0.5525927
## platelets serum_creatinine serum_sodium
## 1.4549745 4.4336102 -1.0428705
## time
## 0.1271606
kurtosis(datanew)
## age creatinine_phosphokinase ejection_fraction
## 2.798207 27.710458 3.020720
## platelets serum_creatinine serum_sodium
## 9.085906 28.378346 7.031142
## time
## 1.788126
shapiro.test(modeltime$residuals)
##
## Shapiro-Wilk normality test
##
## data: modeltime$residuals
## W = 0.98331, p-value = 0.00168
creatinine_phosphokinase and serum_creatinine has very high skew value does not met the assumption of normality as the p value is less 0.05
#Heterocadasticity
plot(fitted, standardized)
abline(0,0)
abline(v = 0)
library(car)
## Warning: package 'car' was built under R version 4.0.3
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:boot':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:wrapr':
##
## bc
vif(modeltime)
## age anaemia creatinine_phosphokinase
## 1.145810 1.060374 1.055992
## diabetes ejection_fraction high_blood_pressure
## 1.059589 1.164590 1.038030
## platelets serum_creatinine serum_sodium
## 1.039234 1.149905 1.114320
## sex smoking DEATH_EVENT
## 1.332182 1.274569 1.294973
It meet the assumption of Homogeneity and Homoscedasticity as the variance is spread evenly across y axis from -2 to +2,also VIF is less than 5
We are having a null model with no predictors and full model using all of the predictors
myTimeModel1 = lm(time~ejection_fraction+high_blood_pressure+DEATH_EVENT, data = train)
wrapFTest(myTimeModel1)
## [1] "F Test summary: (R2=0.3689, F(3,216)=42.08, p<1e-05)."
test$timePred1 = predict(myTimeModel1, newdata = test)
test$residuals = test$time - test$timePred1
summary(test$time)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 85.25 116.00 132.70 204.00 258.00
summary(test$timePred1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 33.37 74.13 148.21 127.35 169.12 181.88
test %>%
ggplot(aes(x=timePred1, y=time)) +
geom_point() +
geom_abline(color = "blue")
test %>%
ggplot(aes(x=timePred1, y=residuals)) +
geom_pointrange(aes(ymin=0, ymax=residuals)) +
geom_hline(yintercept = 0, linetype = 3) +
ggtitle("linear model prediction for time vs ejection_fraction, high bp, death event")
GainCurvePlot(test, "timePred1", "time", "time model")
rmse = sqrt(mean(test$residuals ^ 2))
print("rmse")
## [1] "rmse"
rmse
## [1] 67.71416
print("standard deviation")
## [1] "standard deviation"
sd(test$time)
## [1] 76.04968
print("r square")
## [1] "r square"
glance(myTimeModel1)$r.squared
## [1] 0.3688733
diagnol line shows the if the time is random green is the curve where perfect model would trace out blue is the curve our model traces out
rmse of 67.71 is better than standard deviation of 76.04
myTimeModel2 = ranger(time~ejection_fraction+high_blood_pressure+DEATH_EVENT, train, num.trees = 500, respect.unordered.factors = "order", seed = 299)
test$timePred2 = predict(myTimeModel2, test)$predictions
test$residuals2 = test$time - test$timePred2
summary(test$time)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 85.25 116.00 132.70 204.00 258.00
summary(test$timePred2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 64.5 86.2 140.0 126.5 158.6 169.4
test %>%
ggplot(aes(x=timePred2, y=time)) +
geom_point() +
geom_abline(color = "blue")
test %>%
ggplot(aes(x=timePred2, y=residuals2)) +
geom_pointrange(aes(ymin=0, ymax=residuals2)) +
geom_hline(yintercept = 0, linetype = 3) +
ggtitle("linear model prediction for time vs ejection_fraction, high bp, death event")
GainCurvePlot(test, "timePred2", "time", "time model")
print("rmse")
## [1] "rmse"
rmse = sqrt(mean(test$residuals2 ^ 2))
rmse
## [1] 66.12833
print("standard deviation")
## [1] "standard deviation"
sd(test$time)
## [1] 76.04968
print("r square")
## [1] "r square"
glance(myTimeModel1)$r.squared
## [1] 0.3688733
rmse improved from 67.71 to 66.12 utilizing random forest algorithm for test data.
anova(myTimeModel1, myTimeModel2)
## Warning in anova.lmlist(object, ...): models with response '"NULL"' removed
## because response differs from model 1
## Analysis of Variance Table
##
## Response: time
## Df Sum Sq Mean Sq F value Pr(>F)
## ejection_fraction 1 765 765 0.1972 0.6574383
## high_blood_pressure 1 50255 50255 12.9601 0.0003947 ***
## DEATH_EVENT 1 438512 438512 113.0877 < 2.2e-16 ***
## Residuals 216 837568 3878
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To evaluate the linear regression model We are using train and test data – with Sample size of 0.75 Train data is for running the models and test data is for the predictions
We are having a null model with no predictors and full model using all of the predictors We also perfomed anova test on the models These are the results obtained
#Prediction for Linear regression
mymodel3 = lm( time ~ ejection_fraction + high_blood_pressure + DEATH_EVENT , data = train)
summary(mymodel3)
##
## Call:
## lm(formula = time ~ ejection_fraction + high_blood_pressure +
## DEATH_EVENT, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -148.156 -44.763 -4.902 48.354 153.783
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 205.5054 16.0842 12.777 < 2e-16 ***
## ejection_fraction -0.9450 0.3724 -2.538 0.01186 *
## high_blood_pressure1 -24.2224 8.8881 -2.725 0.00695 **
## DEATH_EVENT1 -100.6633 9.4659 -10.634 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.27 on 216 degrees of freedom
## Multiple R-squared: 0.3689, Adjusted R-squared: 0.3601
## F-statistic: 42.08 on 3 and 216 DF, p-value: < 2.2e-16
predicted_model = predict(mymodel3, newdata = test)
test$predictedTime = predicted_model
#head(test)
# time vs predicted time
ggplot(test, aes(predictedTime, time)) +
geom_point() +
geom_abline() +
ggtitle("Prediction vs. Real values")
residuals_model = test$time - predicted_model
test$residuals = residuals_model
# Predicted values vs residuals
ggplot(test, aes(predictedTime, residuals)) +
geom_pointrange(aes(ymin = 0, ymax = residuals)) +
geom_hline(yintercept = 0) +
labs(x="Predicted Charges", y="Residuals")
# Histogram of Residuals
ggplot(test, aes(residuals)) +
geom_histogram(bins = 10) +
labs(x="Residuals", y="Frequency")
# The gain curve plot measures how well the model score sorts the data compared to the true outcome value.
GainCurvePlot(test, "predictedTime", "time", "Model")
act_pred <- data.frame(cbind(actuals=test$time, predicteds=test$predictedTime)) # actuals_predicteds
cor(act_pred) # correlation_accuracy
## actuals predicteds
## actuals 1.0000000 0.4759494
## predicteds 0.4759494 1.0000000
min_max <- mean(apply(act_pred, 1, min) / apply(act_pred, 1, max))
print(min_max) # show the result
## [1] 0.6089327
Boot Co-ef
bootcoef = function(formula, data, indices){
d = data[indices, ]
model = lm(formula, data = d)
return(coef(model))
}
mymodel4 = lm( time ~ . , data = heartFailureData)
summary(mymodel4)
##
## Call:
## lm(formula = time ~ ., data = heartFailureData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -155.780 -46.734 -9.286 52.874 154.889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.029e+02 1.229e+02 1.651 0.099903 .
## age -3.158e-01 3.355e-01 -0.941 0.347353
## anaemia1 -1.665e+01 7.706e+00 -2.160 0.031588 *
## creatinine_phosphokinase 2.490e-03 4.585e-03 0.543 0.587492
## diabetes1 2.281e+00 7.743e+00 0.295 0.768543
## ejection_fraction -8.627e-01 3.426e-01 -2.518 0.012362 *
## high_blood_pressure1 -2.643e+01 7.907e+00 -3.343 0.000942 ***
## platelets -1.806e-06 4.314e-05 -0.042 0.966630
## serum_creatinine -3.389e+00 4.228e+00 -0.801 0.423594
## serum_sodium 2.225e-01 8.850e-01 0.251 0.801686
## sex1 -3.327e+00 8.977e+00 -0.371 0.711173
## smoking1 -4.864e+00 8.982e+00 -0.542 0.588554
## DEATH_EVENT1 -8.725e+01 9.054e+00 -9.636 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.62 on 281 degrees of freedom
## Multiple R-squared: 0.35, Adjusted R-squared: 0.3222
## F-statistic: 12.61 on 12 and 281 DF, p-value: < 2.2e-16
AIC(mymodel4)
## [1] 3290.985
plot(mymodel4)
mymodel4.boot = boot(formula = time ~ .,,
data = heartFailureData,
statistic = bootcoef,
R = 1000)
mymodel4.boot
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = heartFailureData, statistic = bootcoef, R = 1000,
## formula = time ~ .)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.029021e+02 -2.223650e+00 1.195450e+02
## t2* -3.157813e-01 1.039215e-02 3.213348e-01
## t3* -1.664894e+01 6.318908e-01 7.577235e+00
## t4* 2.490330e-03 6.402995e-04 4.867673e-03
## t5* 2.280794e+00 1.907160e-01 8.135637e+00
## t6* -8.626637e-01 8.318194e-03 3.316777e-01
## t7* -2.643082e+01 -9.408112e-02 7.347773e+00
## t8* -1.806352e-06 4.198509e-07 4.282838e-05
## t9* -3.388512e+00 3.874594e-02 3.248523e+00
## t10* 2.224909e-01 3.082664e-03 8.534531e-01
## t11* -3.327380e+00 1.984901e-01 9.100107e+00
## t12* -4.864411e+00 -2.000522e-01 9.150950e+00
## t13* -8.724534e+01 2.251181e-01 8.513011e+00
plot(mymodel4.boot)
mymodel4.boot2 = boot(formula = time ~ .,
data = heartFailureData[sample(1:nrow(heartFailureData), 50, replace=FALSE),],
statistic = bootcoef,
R = 1000)
mymodel4.boot2
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = heartFailureData[sample(1:nrow(heartFailureData),
## 50, replace = FALSE), ], statistic = bootcoef, R = 1000,
## formula = time ~ .)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -7.583364e+01 -4.510757e+01 3.190611e+02
## t2* 3.523779e-01 -7.629962e-03 1.205633e+00
## t3* 6.017260e+00 -1.661428e+00 2.355244e+01
## t4* 2.573725e-02 3.355185e-03 2.684879e-02
## t5* 9.227490e+00 3.861918e+00 2.366643e+01
## t6* -7.234338e-01 8.596197e-02 9.887926e-01
## t7* -1.887019e+01 -2.697714e+00 2.198386e+01
## t8* 5.388161e-05 -2.250349e-05 1.526385e-04
## t9* -1.060822e+01 8.533374e-01 1.019078e+01
## t10* 1.597580e+00 3.431184e-01 2.299446e+00
## t11* 2.301577e+01 -4.035560e-01 2.887668e+01
## t12* 2.201726e+01 -1.856380e+00 2.420984e+01
## t13* -1.101771e+02 -2.762760e-01 2.616902e+01
plot(mymodel4.boot2)
From Bootstrapping:- Standard error are much higher for the boot2 - in comparision with original model Co-effient values are much higher for the subset of data - in comparision with original model
length(heartFailureData$ejection_fraction)
## [1] 294
mean(heartFailureData$ejection_fraction)
## [1] 37.93878
sd(heartFailureData$ejection_fraction)
## [1] 11.70731
error <- qt(0.95,df=length(heartFailureData$ejection_fraction)-1)*sd(heartFailureData$ejection_fraction)/sqrt(length(heartFailureData$ejection_fraction))
left <- mean(heartFailureData$ejection_fraction)-error
right <- mean(heartFailureData$ejection_fraction)+error
left
## [1] 36.81213
right
## [1] 39.06542
CONFIDENCE INTERVAL for ejection_fraction is between 36 and 39
boot.ci(mymodel4.boot2, index = 5)
## Warning in boot.ci(mymodel4.boot2, index = 5): bootstrap variances needed for
## studentized intervals
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = mymodel4.boot2, index = 5)
##
## Intervals :
## Level Normal Basic
## 95% (-41.020, 51.751 ) (-42.507, 48.554 )
##
## Level Percentile BCa
## 95% (-30.099, 60.962 ) (-35.944, 54.685 )
## Calculations and Intervals on Original Scale
CONFIDENCE INTERVAL for variable 5 that is in the range which suggest that our data is consistent
# Additional evaluations
subsets = leaps::regsubsets(time~.,data=train, nvmax=12)
summary(subsets)
## Subset selection object
## Call: regsubsets.formula(time ~ ., data = train, nvmax = 12)
## 12 Variables (and intercept)
## Forced in Forced out
## age FALSE FALSE
## anaemia1 FALSE FALSE
## creatinine_phosphokinase FALSE FALSE
## diabetes1 FALSE FALSE
## ejection_fraction FALSE FALSE
## high_blood_pressure1 FALSE FALSE
## platelets FALSE FALSE
## serum_creatinine FALSE FALSE
## serum_sodium FALSE FALSE
## sex1 FALSE FALSE
## smoking1 FALSE FALSE
## DEATH_EVENT1 FALSE FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: exhaustive
## age anaemia1 creatinine_phosphokinase diabetes1 ejection_fraction
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " "*"
## 4 ( 1 ) " " "*" " " " " "*"
## 5 ( 1 ) " " "*" " " " " "*"
## 6 ( 1 ) " " "*" " " "*" "*"
## 7 ( 1 ) " " "*" " " "*" "*"
## 8 ( 1 ) "*" "*" " " "*" "*"
## 9 ( 1 ) "*" "*" " " "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*"
## high_blood_pressure1 platelets serum_creatinine serum_sodium sex1
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " "
## 3 ( 1 ) "*" " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " "
## 7 ( 1 ) "*" " " "*" " " " "
## 8 ( 1 ) "*" " " "*" " " " "
## 9 ( 1 ) "*" "*" "*" " " " "
## 10 ( 1 ) "*" "*" "*" " " "*"
## 11 ( 1 ) "*" "*" "*" " " "*"
## 12 ( 1 ) "*" "*" "*" "*" "*"
## smoking1 DEATH_EVENT1
## 1 ( 1 ) " " "*"
## 2 ( 1 ) " " "*"
## 3 ( 1 ) " " "*"
## 4 ( 1 ) " " "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## 11 ( 1 ) "*" "*"
## 12 ( 1 ) "*" "*"
names(summary(subsets))
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
subsets_measures = data.frame(model=1:length(summary(subsets)$cp),
rsq=summary(subsets)$rsq,
bic=summary(subsets)$bic,
adjr2=summary(subsets)$adjr2)
subsets_measures
## model rsq bic adjr2
## 1 1 0.3265290 -76.18102 0.3234397
## 2 2 0.3500546 -78.60982 0.3440643
## 3 3 0.3688733 -79.68018 0.3601076
## 4 4 0.3766637 -77.01907 0.3650668
## 5 5 0.3802765 -72.90423 0.3657969
## 6 6 0.3820527 -68.14208 0.3646458
## 7 7 0.3837096 -63.33912 0.3633604
## 8 8 0.3840819 -58.07845 0.3607296
## 9 9 0.3842257 -52.73616 0.3578353
## 10 10 0.3843450 -47.38518 0.3548878
## 11 11 0.3844386 -42.02499 0.3518848
## 12 12 0.3844973 -36.65237 0.3488160
which.min(summary(subsets)$bic)
## [1] 3
coef(subsets,which.min(summary(subsets)$bic))
## (Intercept) ejection_fraction high_blood_pressure1
## 205.5053896 -0.9449922 -24.2224278
## DEATH_EVENT1
## -100.6633229
library(tidyr)
##
## Attaching package: 'tidyr'
## The following objects are masked from 'package:wrapr':
##
## pack, unpack
subsets_measures %>%
gather(key = type, value=value, 2:4)%>%
ggplot(aes(x=model,y=value))+
geom_line()+
geom_point()+
facet_grid(type~.,scales='free_y')
Evaluation We are creating the – regression subset of the predictors from all the variables Where predictors are changing for each model to get the best one. which suggested Best Model is Model3 with predictors: were selected also these are the values obtained for R2 • R2 0.368 - Also some of the studies suggest that for human behavior and for physical process we generally have R2 value less than 50%. Which is true in our case. • BIC -0.79 • SD - 76.05 • RMSE - 66.12 - rmse improved from 67.71 to 66.12 utilizing random forest algorithm for test data. We are using the same in random forest
Follow Up Time:
There seem to be a difference in median time of the follow up period between people with high BP and people without high BP. The difference of time is seen in when the patient was alive, patient passed away.
We fitted a random forest model to predict the follow up time with an RMSE of 67.71 is better than standard deviation of 76.04 with ejection fraction, high blood pressure, death event as statistically significant.
Death Event
We used forward step wise algorithm we found the best model with AIC of 156, with time, ejection_fraction, age, serum_creatininge, serum_sodium , diabetes being statistically significant.
The Area under curve was 0.7984 showing 79% accuracy when we predicted the values against test data.