To predict the presence of an adverse event among patients who had an unplanned transfer to a higher level of care, using the data obtained in the patient record review study.
This prediction could be used for future adverse event detections and classification since record review is a very labor intensive procedure. A comparison of different methods applied for this prediction should be made
Data used is from a large patient record review study (Marquet et al., 2015) done among hospitalized patients who needed an unplanned transfer to a higher level of care, which included transfer to intensive care or an in-hospital medical emergency team intervention.
It involved review of 830 records of patients who had an unplanned transfer to a higher level of care of which 465(56%) were found to have an Averse event and 365(44%) had No Adverse Event.
Different approach of prediction and classification models will be used and among the possible option the following will be used;
The codes and output below shows the different approaches which mainly involves the choosing of variables which best try to predict the presence or absence of adverse events.
Starting with univariate logistic regression to look for variable showing assosiation with the outcome. Using a cutpoint of pvalue <=0.25 to pick those variables showing at least some association. Univariate was done of the training data. I am yet to include admission diagnosis since we are to collapse the grouping abit further.
set.seed(37)
train<-sample(1:nrow(review),(0.6*nrow(review)))
test<- -train
train.data<-review[train,]
test.data<-review[test,]
varlist<-c("sex","agecat","ASA","pre.hosp","polypharmacy","hom.situatn",
"adm.type","med.v.surg","intervention","correct.ward","disciplines",
"daily.liv.actv","cog.impair","surg.durin.hosp","inv.trt.durin.hosp",
"DNR","transf.time.days","adm.hr","polyfarm.5","weeknd.week","adm.day")
#"si","adm.diag","transfer.type","last.ward" ,"rom"
library(aod)
# looping Over all the univariate models and Fiting a Logistic reg
models <- lapply(varlist, function(x) {
glm(substitute(AE ~ i, list(i = as.name(x))), data = train.data, family = "binomial")
})
# Functions to extract variable name
my.var<-function(x){
row.names(anova(x,test="Chisq"))[2]
}
# Functions to extract the deviance
my.Dev<-function(x){
(anova(x,test="Chisq"))$Deviance[2]
}
# Functions to extract the overal pvalue
my.pval<-function(x){
(anova(x,test="Chisq"))$`Pr(>Chi)`[2]
}
# Objects to collect some information
variable<-NULL; Deviance<-NULL;pvalue<-NULL
# looping Over all the univariate models
for (i in 1 :length(models)){
variable[i]<- lapply(models,my.var)[i]
Deviance[i]<- lapply(models,my.Dev)[i]
pvalue[i]<- lapply(models,my.pval)[i]
}
A tables showing the varaibles and their p-values from the univariate analysis.
uni.model<-cbind(variable,Deviance=round(as.numeric(Deviance),4),pvalue=round(as.numeric(pvalue),5))
knitr::kable(uni.model<-as.data.frame(uni.model))
| variable | Deviance | pvalue |
|---|---|---|
| sex | 1.9615 | 0.16135 |
| agecat | 2.0397 | 0.5642 |
| ASA | 1.1147 | 0.77353 |
| pre.hosp | 6.9705 | 0.00829 |
| polypharmacy | 1.9658 | 0.16089 |
| hom.situatn | 6.6164 | 0.08518 |
| adm.type | 30.9013 | 0 |
| med.v.surg | 68.4148 | 0 |
| intervention | 14.6901 | 0.00065 |
| correct.ward | 2.0458 | 0.15263 |
| disciplines | 9.6381 | 0.00191 |
| daily.liv.actv | 2e-04 | 0.98846 |
| cog.impair | 1.1329 | 0.28715 |
| surg.durin.hosp | 42.3218 | 0 |
| inv.trt.durin.hosp | 0.1141 | 0.73555 |
| DNR | 2.082 | 0.3531 |
| transf.time.days | 5.6452 | 0.0175 |
| adm.hr | 9.542 | 0.00847 |
| polyfarm.5 | 7e-04 | 0.97923 |
| weeknd.week | 2.1387 | 0.14363 |
| adm.day | 6.0356 | 0.41921 |
Using a cutpoint of pvalue<=0.25 the following list of variables show at least some association with the outcome.
## [1] "sex" "pre.hosp" "polypharmacy"
## [4] "hom.situatn" "adm.type" "med.v.surg"
## [7] "intervention" "correct.ward" "disciplines"
## [10] "surg.durin.hosp" "transf.time.days" "adm.hr"
## [13] "weeknd.week"
Using the variables selected by the univariate analysis to fit multiple logistic regression.
summary(M1<- glm(AE ~ agecat + ASA + pre.hosp + polypharmacy + hom.situatn + adm.type +
med.v.surg + intervention + disciplines + surg.durin.hosp
+ transf.time.days + adm.hr,
data = train.data, family = "binomial")) #+ transfer.type
##
## Call:
## glm(formula = AE ~ agecat + ASA + pre.hosp + polypharmacy + hom.situatn +
## adm.type + med.v.surg + intervention + disciplines + surg.durin.hosp +
## transf.time.days + adm.hr, family = "binomial", data = train.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3670 -0.9833 0.4426 0.9876 1.9074
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.598233 1.601318 0.374 0.70871
## agecat41-65 0.883716 0.518636 1.704 0.08840
## agecat66-79 0.959869 0.540740 1.775 0.07588
## agecat80+ 1.105565 0.550730 2.007 0.04470
## ASAASA 2 -0.087274 0.497962 -0.175 0.86087
## ASAASA 3 0.171203 0.535680 0.320 0.74927
## ASAASA 4 0.233170 0.558528 0.417 0.67633
## pre.hospyes 0.490066 0.218346 2.244 0.02480
## polypharmacy -0.035434 0.030646 -1.156 0.24758
## hom.situatnhome living alone -0.459272 0.785281 -0.585 0.55865
## hom.situatnhome cohabiting 0.079041 0.757287 0.104 0.91687
## hom.situatnstay in institution -0.376478 0.868601 -0.433 0.66470
## adm.typeelective -0.338629 0.683662 -0.495 0.62038
## adm.typeemergency 0.225411 0.462990 0.487 0.62636
## adm.typetransfer from other hospital 0.583163 0.737252 0.791 0.42895
## med.v.surgmedical - planned -0.326071 0.700638 -0.465 0.64165
## med.v.surgmedical -unplanned -1.200215 0.388698 -3.088 0.00202
## med.v.surgsurgical - planned 1.241078 0.754676 1.645 0.10007
## interventionmedical -1.530645 1.163589 -1.315 0.18836
## interventionsurgical -1.010269 1.189964 -0.849 0.39589
## disciplines 0.277318 0.232868 1.191 0.23370
## surg.durin.hospyes 0.288278 0.305042 0.945 0.34464
## transf.time.days 0.007366 0.009131 0.807 0.41989
## adm.hrday(8-15:59) 0.161816 0.245101 0.660 0.50913
## adm.hrovernight(00-07:59) 0.403018 0.324143 1.243 0.21374
##
## (Intercept)
## agecat41-65 .
## agecat66-79 .
## agecat80+ *
## ASAASA 2
## ASAASA 3
## ASAASA 4
## pre.hospyes *
## polypharmacy
## hom.situatnhome living alone
## hom.situatnhome cohabiting
## hom.situatnstay in institution
## adm.typeelective
## adm.typeemergency
## adm.typetransfer from other hospital
## med.v.surgmedical - planned
## med.v.surgmedical -unplanned **
## med.v.surgsurgical - planned
## interventionmedical
## interventionsurgical
## disciplines
## surg.durin.hospyes
## transf.time.days
## adm.hrday(8-15:59)
## adm.hrovernight(00-07:59)
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 679.34 on 497 degrees of freedom
## Residual deviance: 570.72 on 473 degrees of freedom
## AIC: 620.72
##
## Number of Fisher Scoring iterations: 4
# Test statistic for model fit
# Ho : all coefficients = 0
# H1 : AT least one of the coefficient !=0
# P-value shows current fits better than null
with(M1, pchisq(null.deviance - deviance, df.null - df.residual,lower.tail = FALSE))
## [1] 9.806367e-13
Using Stepwise model selction. Both forward and backward selction
search <- step(M1,~.)
## summary of the search
search$anova
The final main effects model chosen is as follows. ASA and agecat were not significant but were maintained in the model after an earlier discussion where we decided they are important predictors.
summary(M2<-glm(AE ~ ASA + agecat + pre.hosp + med.v.surg + intervention +
disciplines, data = train.data, family = "binomial"))
##
## Call:
## glm(formula = AE ~ ASA + agecat + pre.hosp + med.v.surg + intervention +
## disciplines, family = "binomial", data = train.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2965 -1.0709 0.4753 1.0076 1.5661
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.8590 1.2772 0.673 0.501191
## ASAASA 2 -0.2551 0.4889 -0.522 0.601779
## ASAASA 3 -0.1372 0.5060 -0.271 0.786255
## ASAASA 4 -0.1761 0.5013 -0.351 0.725434
## agecat41-65 0.8116 0.5099 1.592 0.111475
## agecat66-79 0.8918 0.5292 1.685 0.091958 .
## agecat80+ 0.9224 0.5399 1.708 0.087552 .
## pre.hospyes 0.5099 0.2094 2.435 0.014891 *
## med.v.surgmedical - planned -0.6344 0.4507 -1.408 0.159225
## med.v.surgmedical -unplanned -1.2171 0.3599 -3.382 0.000719 ***
## med.v.surgsurgical - planned 0.9297 0.4581 2.029 0.042408 *
## interventionmedical -1.5525 1.1325 -1.371 0.170419
## interventionsurgical -0.9353 1.1560 -0.809 0.418487
## disciplines 0.5157 0.1715 3.007 0.002638 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 679.34 on 497 degrees of freedom
## Residual deviance: 582.40 on 484 degrees of freedom
## AIC: 610.4
##
## Number of Fisher Scoring iterations: 4
# Test statistic for model fit
with(M2, pchisq(null.deviance - deviance, df.null - df.residual,lower.tail = FALSE))
## [1] 6.476071e-15
## Checking for all possible interactions
# Some combinations lead to O cell counts thats the reason for the 'warning' print outs
search <- step(M2, ~.^2)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# summary of the search
search$anova
The final model chosen after dropping the non-significant interactions is as follows.
# The model chosen after dropping non-significant interactions
summary(M3<-glm(AE ~ ASA + agecat + pre.hosp + med.v.surg + intervention + disciplines +
med.v.surg:disciplines,data = train.data, family = "binomial"))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = AE ~ ASA + agecat + pre.hosp + med.v.surg + intervention +
## disciplines + med.v.surg:disciplines, family = "binomial",
## data = train.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2457 -0.9913 0.4371 1.0418 1.6109
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 1.1044 1.4595 0.757
## ASAASA 2 -0.2111 0.4965 -0.425
## ASAASA 3 -0.1326 0.5135 -0.258
## ASAASA 4 -0.2024 0.5123 -0.395
## agecat41-65 0.7778 0.5124 1.518
## agecat66-79 0.9195 0.5316 1.730
## agecat80+ 0.9307 0.5429 1.714
## pre.hospyes 0.5274 0.2125 2.482
## med.v.surgmedical - planned -16.0775 770.3255 -0.021
## med.v.surgmedical -unplanned -1.1235 0.8375 -1.341
## med.v.surgsurgical - planned 1.0718 1.1875 0.903
## interventionmedical -1.7372 1.1430 -1.520
## interventionsurgical -1.1092 1.1660 -0.951
## disciplines 0.4583 0.5282 0.868
## med.v.surgmedical - planned:disciplines 15.2004 770.3251 0.020
## med.v.surgmedical -unplanned:disciplines -0.0693 0.5620 -0.123
## med.v.surgsurgical - planned:disciplines -0.1216 0.8526 -0.143
## Pr(>|z|)
## (Intercept) 0.4493
## ASAASA 2 0.6707
## ASAASA 3 0.7962
## ASAASA 4 0.6928
## agecat41-65 0.1290
## agecat66-79 0.0837 .
## agecat80+ 0.0864 .
## pre.hospyes 0.0131 *
## med.v.surgmedical - planned 0.9833
## med.v.surgmedical -unplanned 0.1798
## med.v.surgsurgical - planned 0.3667
## interventionmedical 0.1286
## interventionsurgical 0.3415
## disciplines 0.3856
## med.v.surgmedical - planned:disciplines 0.9843
## med.v.surgmedical -unplanned:disciplines 0.9019
## med.v.surgsurgical - planned:disciplines 0.8866
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 679.34 on 497 degrees of freedom
## Residual deviance: 575.80 on 481 degrees of freedom
## AIC: 609.8
##
## Number of Fisher Scoring iterations: 16
# Test statistic for model fit: current fit is better than Null
with(M3, pchisq(null.deviance - deviance, df.null - df.residual,lower.tail = FALSE))
## [1] 7.487665e-15
Optimal cutpoint, specificity and sensitivity from the main effects models is as shown.
# ROC curve
# install.packages("pROC")
library(pROC)
## Warning: package 'pROC' was built under R version 3.4.4
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# OPtimal cutpoint, Sensitivty & Specificity
# Varaibles to collect estimates from model with main effects
threshold<-NULL;specificity<-NULL;sensitivity<-NULL
# estimates from model with interaction
threshold.M3<-NULL;specificity.M3<-NULL;sensitivity.M3<-NULL
# Drawing 100 different samples
set.seed(37)
for (i in 1:100){
train<-sample(1:nrow(review),(0.6*nrow(review)))
test<- -train
train.data<-review[train,]
test.data<-review[test,]
# AE grouping for testing datatset
tst.AE<-AE[test]
# Main effects model
M2.2<-glm(AE ~ ASA + agecat + pre.hosp + med.v.surg + intervention +
disciplines, data = train.data, family = "binomial")
#Model with interaction
M3.2<-glm(AE ~ ASA + agecat + pre.hosp + med.v.surg + intervention +
disciplines +med.v.surg:disciplines
,data = train.data, family = "binomial")
# predicted probs
# Main effects model
prd.M2.2.test<-predict(M2.2,test.data,type="response")
# Model with interactions
prd.M3.2.test<-predict(M3.2,test.data,type="response")
# roc curve attributes
roc<-data.frame(t=tst.AE,pd=prd.M2.2.test,pd2=prd.M3.2.test)
roc(roc$t,roc$pd)
roc(t~pd,roc)
roc1 <- roc(roc$t,roc$pd, percent=TRUE)
roc2 <- roc(roc$t,roc$pd2, percent=TRUE)
# From Main effects model
r<-coords(roc1, "best", ret=c("threshold", "specificity", "sensitivity","npv"))
threshold[i]<-r[1]
specificity[i]<-r[2]
sensitivity[i]<-r[3]
# Interaction Model
r1<-coords(roc2, "best", ret=c("threshold", "specificity", "sensitivity","npv"))
threshold.M3[i]<-r1[1]
specificity.M3[i]<-r1[2]
sensitivity.M3[i]<-r1[3]
}
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Loading required package: lattice
## Loading required package: plyr
The mean specificity is 77.075[75.656, 78.494]. This implies that on average 77.075% of patients without an adverse event will be correctly identified as not having an adverse effect. In our case this can be viewed as the cases that do not need to be reviewed hence saving reasources that would have otherwise been spent on them. The mean[95% ci] sensitivity was 57.299[55.792, 58.807] which implies that 57.299% of the patients with an adverse effect will be correctly classified as having an adverse. This classification was when a mean cutpoint of 0.565[0.552, 0.577] is consider.
# Optimal Threshold. Main effects model
round(multi.fun(threshold),3)
## mean median range1 range2 CI.upper CI.mean CI.lower
## 0.565 0.563 0.426 0.717 0.577 0.565 0.552
# Specificity. Main effects model
round(multi.fun(specificity),3)
## mean median range1 range2 CI.upper CI.mean CI.lower
## 77.075 77.894 57.692 91.216 78.494 77.075 75.656
# Sensitivity. Main effects model
round(multi.fun(sensitivity),3)
## mean median range1 range2 CI.upper CI.mean CI.lower
## 57.299 56.919 41.304 74.468 58.807 57.299 55.792
From the model with interaction, the mean [95% ci] specificity is 79.363[78.007, 80.719]. This implies that on average 79.363% of patients without an adverse event will be correctly identified as not having an adverse effect. In our case this can be viewed as the cases that do not need to be reviewed hence saving reasources that would have otherwise been spent on them. The mean[95% ci] sensitivity was 55.162[53.618, 56.705] which implies that 55.162% of the patients with an adverse effect will be correctly classified as having an adverse. This classification was when a mean cutpoint of 0.576[0.562, 0.59] is consider.
# Interaction model
# Optimal Threshold. Interaction model
round(multi.fun(threshold.M3),3)
## mean median range1 range2 CI.upper CI.mean CI.lower
## 0.576 0.571 0.446 0.829 0.590 0.576 0.562
# Specificity. Interaction model
round(multi.fun(specificity.M3),3)
## mean median range1 range2 CI.upper CI.mean CI.lower
## 79.363 80.203 61.842 93.333 80.719 79.363 78.007
# Sensitivity. Interaction model
round(multi.fun(sensitivity.M3),3)
## mean median range1 range2 CI.upper CI.mean CI.lower
## 55.162 54.222 30.769 72.581 56.705 55.162 53.618
To get the variability in the prediction performance of the models on the test data, the models were refit 100 times on different sampled train data and prediction on the test data assesed. Just to have alook of the performance i fitted both the main effects and the model with interactions separately.
set.seed(37)
# Variables to collect the outputs
# Classification rate. Main effects model
cls.tst<-NULL;cls.trn<-NULL
# Classification rate. Interaction model
M3.cls.tst<-NULL;M3.cls.trn<-NULL
for (i in 1:100){
train<-sample(1:nrow(review),(0.6*nrow(review)))
test<- -train
train.data<-review[train,]
test.data<-review[test,]
# AE(Adverse Events) grouping for the datatset
tst.AE<-AE[test]; trn.AE<-AE[train]
# fit using training data set
# Main effects model.
M2.1<-glm(AE ~ ASA + agecat + pre.hosp + med.v.surg + intervention +
disciplines,data = train.data, family = "binomial")
# Model with an Interaction.
M3.1<- glm(AE ~ ASA + agecat + pre.hosp + med.v.surg + intervention + disciplines
+ med.v.surg:disciplines,data = train.data, family = "binomial")
# prediction using testing data set
# Main efects(M2.1) model and interaction Model(M3.1)
prd.M2.test<-predict(M2.1,test.data,type="response")
prd.M3.test<-predict(M3.1,test.data,type="response")
# prediction using Training data set
# Main efects model and interaction Model
prd.M2.train<-predict(M2.1,train.data,type="response")
prd.M3.train<-predict(M3.1,train.data,type="response")
# Changing the predicted probabilities into to classes at a cutpoint 0.5
# Main effects model M3.1
## test data
test.cl<-rep("No AE", dim(test.data)[1])
test.cl[prd.M2.test>0.5]<-"AE Detected"
## train data
train.cl<-rep("No AE", dim(train.data)[1])
train.cl[prd.M2.train>0.5]<-"AE Detected"
# interaction model M3.1
# # test data
M3.test.cl<-rep("No AE", dim(test.data)[1])
M3.test.cl[prd.M3.test>0.5]<-"AE Detected"
# # train data
M3.train.cl<-rep("No AE", dim(train.data)[1])
M3.train.cl[prd.M3.train>0.5]<-"AE Detected"
# Classification rate. Main effects model
cls.tst[i]<-mean(test.cl == tst.AE)
cls.trn[i]<-mean(train.cl == trn.AE)
# Classification rate. Interaction model
M3.cls.tst[i]<-mean(M3.test.cl == tst.AE)
M3.cls.trn[i]<-mean(M3.train.cl == trn.AE)
}
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# A Histogram showing the distribution of the Classification probabilities
par(mfrow=c(2,2))
hist(cls.tst,prob=T,main="Test Data.Main effects",xlab="Correct Classification")
hist(cls.trn,prob=T,main="Train Data.Main effects",xlab="Correct Classification")
hist(M3.cls.tst,prob=T,main="Test Data. Int' Model",xlab="Correct Classification")
hist(M3.cls.trn,prob=T,main="Train Data.Int' Model",xlab="Correct Classification")
# .....fx....
multi.fun=function(x){
c(mean=mean(x),median=median(x),sd=sd(x),range=range(x),CI=CI(x, ci = 0.95)
)}
# Main Effects model only
# Predicted Correct classification on the Test Data
round(multi.fun(cls.tst),3)
## mean median sd range1 range2 CI.upper CI.mean CI.lower
## 0.647 0.646 0.021 0.599 0.693 0.651 0.647 0.642
# Predicted Correct classification on the Train Data
round(multi.fun(cls.trn),3)
## mean median sd range1 range2 CI.upper CI.mean CI.lower
## 0.668 0.671 0.015 0.637 0.701 0.671 0.668 0.666
# Model with Interaction
# Predicted. Correct classification on the Test Data
round(multi.fun(M3.cls.tst),3)
## mean median sd range1 range2 CI.upper CI.mean CI.lower
## 0.647 0.648 0.023 0.587 0.699 0.651 0.647 0.642
# Predicted Correct classification on the Train Data
round(multi.fun(M3.cls.trn),3)
## mean median sd range1 range2 CI.upper CI.mean CI.lower
## 0.671 0.671 0.016 0.639 0.709 0.675 0.671 0.668
From the summary statistics above the main effects model shows a mean correct prediction on the test data of 0.647[0.642, 0.651] which is similar to that of the prediction by the model with interraction.
We can fit a model containing all p predictors using a technique that constrains or that shrinks the coefficient estimates towards zero. The two best-known techniques for shrinking the regression coefficients towards zero are ridge regression and the lasso. In the case of lasso, the penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter \(\lambda\) is sufficiently large. Hence, much like best subset selection, the lasso performs variable selection.(Tibshirani et.al., 2014).
This was done using the glmnet package which fits a generalized linear model via penalized maximum likelihood. The fitted Lasso Regularized/penalized Logistic Regression is as shown where the main aim was to try and look for an alternative way to get predictors that best minimize the misclassification error.
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.4.4
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.4.4
## Loading required package: foreach
## Loaded glmnet 2.0-16
##
## Attaching package: 'glmnet'
## The following object is masked from 'package:pROC':
##
## auc
# list of some of the variables
varlist<-c("AE","sex","agecat","ASA","pre.hosp","polypharmacy","hom.situatn",
"adm.type","med.v.surg","intervention","correct.ward","disciplines",
"daily.liv.actv","cog.impair","surg.durin.hosp","inv.trt.durin.hosp","DNR",
"transf.time.days","polyfarm.5","weeknd.week","adm.day","adm.hr")
#"si","adm.diag",
# Preparing data for the glmnet function
r<-review[,varlist]
x=model.matrix(AE~.,r)[,-1] # creates a design matrix of the predictors
y=r$AE # response
# Spliting
set.seed(37)
train=sample(1:nrow(x), (nrow(x)*0.6))
test=(- train )
y.test=y[test]
using package cv.glmnet, 10 fold cross-validation was used to look for the optimal tunning parameter \(\lambda\) (controls the overall strength of the penalty) for the LASSO. This is calculated by getting the \(\lambda\) value that leads to the smallest Misclassification error.
crossvalidation error is minimal at the verticle broken line
set.seed (37)
cv.out<-cv.glmnet(x[train ,],y[train],alpha=1,nfold=10,
type.measure="class", family="binomial")
# "class"- gives misclassification error.
A plot Misclassifcation error against \(\lambda\). The vertical dotted lines shows the region where the misclassification error is smallest(1 line is at he minimum and the other is at 1 standard error of the minimum.also does almost as well as the minimum) which is taken as the optimal lambda value. The top of the graph shows as a function of \(\lambda\) how many non zero variables are in the model. It shows that we require from 3-7 variables to achieve the minimum misclassification error.
(bestlam=cv.out$lambda.min) # optimal Lambda value to use
## [1] 0.01718078
Using this Optimal value. We fit the Penalized Logistic regression on training data set and get the prediction and classification on the test data.
# Fit on the Training Data
lasso.mod<-glmnet(x[train ,],y[train],alpha=1, family="binomial")
# 1) Predicted probabilities on the Test data
lasso.pred<-predict(lasso.mod,s=bestlam ,newx=x[test ,],type="response")
# 3) class label corresponding to the maximum probability.
class<-predict(lasso.mod,s=bestlam ,newx=x[test ,],type="class")
These are the variables that were considered important using LASSO variable selection. For all the rest thier coefficients were shrunk to 0. They include;-
## (Intercept) sexF
## 0.1743037568 -0.0980815686
## pre.hospyes hom.situatnhome cohabiting
## 0.3240711197 0.2466525128
## med.v.surgmedical - planned med.v.surgmedical -unplanned
## -0.0150397498 -0.7612816045
## med.v.surgsurgical - planned interventionmedical
## 0.7527433766 -0.3426807469
## disciplines daily.liv.actvno limitations
## 0.2707035673 -0.1145123744
## surg.durin.hospyes transf.time.days
## 0.3540763681 0.0009114912
## polyfarm.5yes adm.dayvrijdag
## 0.0174283736 0.1136587658
## adm.daywoensdag
## -0.0807448230
100 different samples were picked at each time prediction and classification was dome. The aim was to get the distribution and the variability of the prediction and classification.
# 100 samples
r<-review[,varlist] # 18
x=model.matrix(AE~.,r)[,-1]
y=r$AE
# Variables to collect some outputs
clss.test<-double(100);clss.train<-double(100)
import.var.l<-NULL;allvar.l<-NULL
threshold.las<-NULL;specificity.las<-NULL;sensitivity.las<-NULL
set.seed(37)
for (i in 1:100){
train=sample(1:nrow(x), (nrow(x)*0.6))
test=(- train )
y.test=y[test]
lasso.mod=glmnet(x[train ,],y[train],alpha=1, family="binomial")
# choosing the Tuning prameter(lambda) using 10 fold cross validation
cv.out<-cv.glmnet(x[train ,],y[train],nfolds=10, type.measure="class"
,family="binomial")
# predicted class on train data
class.train.l<-predict(cv.out,x[train,],s=cv.out$lambda.min,type="class")
clss.train[i]<-mean(class.train.l==y[train])
# predicted class on test data
class.test.l<-predict(cv.out,x[test,],s=cv.out$lambda.min,type="class")
clss.test[i]<-mean(class.test.l==y[test])
# Coefficient values of the variables
cof.l<-predict(cv.out,x[test,],s=cv.out$lambda.min,type="coefficients")[1:dim(x)[2],]
assign(paste("n",i, sep = "."),cof.l)
allvar.l<-cbind(allvar.l,cof.l)
colnames(allvar.l)[i]<-(paste("iter",i, sep = "."))
# Predicted probability on test data
lasso.pred.1=predict(cv.out ,x[test,],s=cv.out$lambda.min,type="response") #
roc.l<-data.frame(t=y.test, pd =(lasso.pred.1))
roc.las<- roc(roc.l[,1],roc.l[,2], percent=TRUE)
# getting the sensitivity, specificity
las<-coords(roc.las, "best", ret=c("threshold", "specificity", "sensitivity","npv"))
threshold.las[i]<-las[1]
specificity.las[i]<-las[2]
sensitivity.las[i]<-las[3]
}
# A distribution of the Classification probabilities
#____
par(mfrow=c(1,2))
hist(clss.test,prob=T,main="Test Data",xlab="Correct Classification")
hist(clss.train,prob=T,main="Train Data",xlab="Correct Classification")
# .....fx....
multi.fun=function(x){
c(mean=mean(x),median=median(x),range=range(x), CI=CI(x, ci = 0.95)
)}
# Predicted correct classification on the Testing Data
round(multi.fun(clss.test),3)
## mean median range1 range2 CI.upper CI.mean CI.lower
## 0.633 0.633 0.587 0.684 0.637 0.633 0.630
# Predicted correct classification on the Training Data
round(multi.fun(clss.train),3)
## mean median range1 range2 CI.upper CI.mean CI.lower
## 0.671 0.666 0.627 0.723 0.675 0.671 0.666
From the summary statistics above the mean correct prediction on the test data is 0.633[0.63, 0.637] which is a bit smaller compared to the one from ordinary logistic regression.
From the Lasso model, the mean [95% ci] specificity is 76.736[74.885, 78.587]. This implies that on average 76.736% of patients without an adverse event will be correctly identified as not having an adverse effect. In our case this can be viewed as the cases that do not need to be reviewed hence saving reasources that would have otherwise been spent on them. The mean[95% ci] sensitivity was 54.524[52.577, 56.471] which implies that 54.524% of the patients with an adverse effect will be correctly classified as having an adverse. This classification was when a mean cutpoint of 0.579[0.566, 0.592] is consider.
# Threshold
round(multi.fun(threshold.las),3)
## mean median range1 range2 CI.upper CI.mean CI.lower
## 0.579 0.582 0.426 0.735 0.592 0.579 0.566
# Specificity
round(multi.fun(specificity.las),3)
## mean median range1 range2 CI.upper CI.mean CI.lower
## 76.736 77.998 54.305 92.593 78.587 76.736 74.885
# Sensitivity
round(multi.fun(sensitivity.las),3)
## mean median range1 range2 CI.upper CI.mean CI.lower
## 54.524 53.119 37.436 75.253 56.471 54.524 52.577
This will ensure dummy variables from a variable are considered as a group unlike the what is done by the glmnet package. The variability on the predictions on a 100 different random samples is also assessed.
# install.packages("grpreg")
library(grpreg)
## Warning: package 'grpreg' was built under R version 3.4.4
varlist<-c("AE","sex","agecat","ASA","pre.hosp","polypharmacy","hom.situatn",
"adm.type","med.v.surg","intervention","correct.ward","disciplines",
"daily.liv.actv","cog.impair","surg.durin.hosp","inv.trt.durin.hosp",
"DNR","transf.time.days","polyfarm.5","weeknd.week","adm.day","adm.hr")
#,"adm.diag", "si","transfer.type",
r1<-review[,varlist]
x1=model.matrix(AE~.,r1)[,-1]
y1=r1$AE
# Defining the groups. i.e defining Dummies from same variable to represent a group
group<-NULL
r2<-review[,varlist[-1]]
for(i in 1:ncol(r2)){
if ( nlevels(r2[,i])>1){
#print( index[,i]<-( rep(i,(nlevels(r2[,i]))-1)))
tmp<- rep(i,(nlevels(r2[,i]))-1)
assign(paste("n",i, sep = ""),tmp)
} else {
tmp<-i # print(index[,i]<-i)
assign(paste("n",i, sep = ""),tmp )
}
group=c(group,tmp)
i=i+1
}
#_____________________
# Sampling and fitting the model 100 times to get the variablity
# of the predicted classification
# defining the collecting variables
correct.class.train<-double(100);correct.class.test<-double(100)
import.var<-NULL;allvar<-NULL
threshold.glas<-NULL;specificity.glas<-NULL;sensitivity.glas<-NULL
set.seed(37)
# looping 100 times
for (i in 1:100){
train=sample(1:nrow(x1), (nrow(x1)*0.6))
test=(- train )
y.1<-as.numeric(y1)-1
fit <- grpreg(x1[train ,],y.1[train],group,penalty="grLasso", family="binomial")
# choosing the Tuning prameter(lambda) using 10 fold cross validation
fit.cv<-cv.grpreg(x1[train ,],y.1[train],group=group, nfolds=10, trace=FALSE,family="binomial")
# predicted class on train data
class.train<-predict(fit.cv,x1[train,],lambda=fit.cv$lambda.min,type="class")
correct.class.train[i]<-mean(class.train==y.1[train])
# predicted class on test data
class.test<-predict(fit.cv,x1[test,],lambda=fit.cv$lambda.min,type="class")
correct.class.test[i]<-mean(class.test==y.1[test])
# Coefficients of picked variables on the test data
cof.gp<-predict(fit.cv,x1[test,],lambda=fit.cv$lambda.min,type="coefficients")[1:dim(x1)[2],]
#var<-cof.gp[cof.gp!=0]
#assign(paste("n",i, sep = "."),var)
assign(paste("n",i, sep = "_"),cof.gp)
#import.var=cbind(import.var,var)
allvar<-cbind(allvar,cof.gp)
#colnames(import.var)[i]<-(paste("iter",i, sep = "_"))
colnames(allvar)[i]<-(paste("iter",i, sep = "_"))
# Predicted probability on test data
gp.pred=predict(fit.cv ,x1[test,],lambda=fit.cv$lambda.min,type="response") #
roc.gl<-data.frame(t=y.1[test], pd =(gp.pred))
roc.glas<- roc(roc.gl[,1],roc.gl[,2], percent=TRUE)
#
glas<-coords(roc.glas, "best", ret=c("threshold", "specificity", "sensitivity","npv"))
threshold.glas[i]<-glas[1]
specificity.glas[i]<-glas[2]
sensitivity.glas[i]<-glas[3]
}
ft.cv<-cv.grpreg(x1[train ,],y.1[train],group=group, nfolds=10,trace=FALSE,
family="binomial")
# Lambda value tha minimizes the prediction error
# Plot showing a possible value of lambda plus the number of Important variables
plot(ft.cv,type="pred")
# A Distribution of the Classification probabilities
#_____
par(mfrow=c(1,2))
hist(correct.class.test,prob=T,main="Test Data",xlab="Correct Classification")
hist(correct.class.train,prob=T,main="Train Data",xlab="Correct Classification")
# .....fx....
multi.fun=function(x){
c(mean=mean(x),median=median(x),range=range(x), CI=CI(x, ci = 0.95)
)}
# The predicted correct classification on the Testing Data
round(multi.fun(correct.class.test),3)
## mean median range1 range2 CI.upper CI.mean CI.lower
## 0.641 0.639 0.596 0.696 0.645 0.641 0.636
# Predicted correct classification on the Training Data
round(multi.fun(correct.class.train),3)
## mean median range1 range2 CI.upper CI.mean CI.lower
## 0.670 0.672 0.631 0.713 0.674 0.670 0.667
From the summary statistics above the mean correct prediction on the test data is 64.1%[63.6, 64.5].
From the group Lasso model, the mean [95% ci] specificity is 75.469[73.732, 77.207]. This implies that on average 75.469% of patients without an adverse event will be correctly identified as not having an adverse effect. In our case this can be viewed as the cases that do not need to be reviewed hence saving reasources that would have otherwise been spent on them. The mean[95% ci] sensitivity was 57.078[55.27, 58.886] which implies that 57.078% of the patients with an adverse effect will be correctly classified as having an adverse. This classification was when a mean cutpoint of 0.559[0.547, 0.571] is consider.
# Threshold
round(multi.fun(threshold.glas),3)
## mean median range1 range2 CI.upper CI.mean CI.lower
## 0.559 0.560 0.423 0.736 0.571 0.559 0.547
# Specificity
round(multi.fun(specificity.glas),3)
## mean median range1 range2 CI.upper CI.mean CI.lower
## 75.469 73.957 52.318 93.284 77.207 75.469 73.732
# Sensitivity
round(multi.fun(sensitivity.glas),3)
## mean median range1 range2 CI.upper CI.mean CI.lower
## 57.078 59.091 32.828 76.243 58.886 57.078 55.270
Decision trees are broadly classified into classification trees (for categorical outcomes) and regression trees (for continuous outcomes). This methodology has the ability to efficiently segment populations into meaningful subgroups. It identifies mutually exclusive and exhaustive subgroups of a population whose members share common characteristics that influence the dependent variable of interest. They produces a visual output that is a multilevel structure which resembles branches of a tree.
The basic strategy for growing decision trees is to split the input or covariate space within which the outcome is relatively homogeneous. This is done in a hierarchical manner using a sequence of binary splitting rules for the inputs. The algorithm efficiently looks at all possible binary splits over all the inputs and chooses the split that makes the corresponding values of the output variable as distinct as possible.
# install.packages("tree")
library(tree)
## Warning: package 'tree' was built under R version 3.4.4
# Spliting the Data into Learning and Testing
set.seed(37)
train<-sample(1:nrow(review),(0.6*nrow(review)))
test<- -train
train.data<-review[train,]
test.data<-review[test,]
AE.test<-AE[test]
# Fit the tree using train data. Includin All the variables
t1<-tree(AE ~ sex + agecat + pre.hosp + polypharmacy + med.v.surg + intervention +
disciplines + ASA + surg.durin.hosp + transf.time.days+ hom.situatn
+ daily.liv.actv+ cog.impair+ adm.type + correct.ward +
inv.trt.durin.hosp + DNR + polyfarm.5 + weeknd.week + adm.day
+ adm.hr,data = train.data) # si + + adm.diag
summary(t1)
##
## Classification tree:
## tree(formula = AE ~ sex + agecat + pre.hosp + polypharmacy +
## med.v.surg + intervention + disciplines + ASA + surg.durin.hosp +
## transf.time.days + hom.situatn + daily.liv.actv + cog.impair +
## adm.type + correct.ward + inv.trt.durin.hosp + DNR + polyfarm.5 +
## weeknd.week + adm.day + adm.hr, data = train.data)
## Variables actually used in tree construction:
## [1] "med.v.surg" "transf.time.days" "polypharmacy"
## [4] "pre.hosp" "hom.situatn"
## Number of terminal nodes: 6
## Residual mean deviance: 1.165 = 573 / 492
## Misclassification error rate: 0.3133 = 156 / 498
plot(t1)
text(t1) #pretty=0)
A bushy tree might be harder to interprate and may also have too much variance so pruning the tree is needed. A 10 fold crossv-validation is used to check on where to start the pruning on the basis of misclassification.
set.seed(37)
cv.tr<-cv.tree(t1, FUN=prune.misclass)
# plot of misclass vs Size
# smallest miss class is at between 5 -6
plot(cv.tr)
# pruning the tree to size of 5 -6
prune.tr<-prune.misclass(t1,best=5)
summary(prune.tr)
##
## Classification tree:
## snip.tree(tree = t1, nodes = 3L)
## Variables actually used in tree construction:
## [1] "med.v.surg" "transf.time.days" "polypharmacy"
## [4] "pre.hosp"
## Number of terminal nodes: 5
## Residual mean deviance: 1.177 = 580.3 / 493
## Misclassification error rate: 0.3133 = 156 / 498
plot(prune.tr)
text(prune.tr)
Evaluating the prediction of the prunned tree on the test data.
# Evaluating the prunned tree on the test data.
t.pred.prun<-predict(prune.tr,test.data, type="class")
# Misclassification table
with(test.data,table(Predicted=t.pred.prun,Observed=AE))
## Observed
## Predicted No AE AE Detected
## No AE 66 42
## AE Detected 87 137
From the classifaction tree 61.1% of the patients from the test data were correctly classified.
This involves building alot of bushy (without prunning) trees, then averaging them to reduce variance. The aim is to help in improving the prediction accuracy of the trees from the increased precision.
# install.packages("randomForest")
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.4.4
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
set.seed(37)
# random forest- builds lots of bushy trees, then averages them to reduce variance
rf<-randomForest(AE ~ sex + agecat + pre.hosp + polypharmacy + med.v.surg + intervention +
disciplines + ASA + surg.durin.hosp + transf.time.days+ hom.situatn
+ daily.liv.actv+ cog.impair+ adm.type + correct.ward +
inv.trt.durin.hosp + DNR + polyfarm.5 + weeknd.week + adm.day
+ adm.hr, ntree=1000, data = train.data,importance=T) # + si + adm.diag
rf
##
## Call:
## randomForest(formula = AE ~ sex + agecat + pre.hosp + polypharmacy + med.v.surg + intervention + disciplines + ASA + surg.durin.hosp + transf.time.days + hom.situatn + daily.liv.actv + cog.impair + adm.type + correct.ward + inv.trt.durin.hosp + DNR + polyfarm.5 + weeknd.week + adm.day + adm.hr, data = train.data, ntree = 1000, importance = T)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 31.33%
## Confusion matrix:
## No AE AE Detected class.error
## No AE 124 88 0.4150943
## AE Detected 68 218 0.2377622
# A Dot plot of the variable importance measured by the Random Forest
varImpPlot(rf,main="Variable Importance")
This is the number of variables chosen at each split of each tree. Random forest randomly picks a number with an aim of decorrelating the trees. Here a series of trees were fit with the mtry value ranging from 1:number of variables considered.
# to collect the error
oob.err=double(21)
test.err=double(21)
corect.cl=double(21)
for (mtry in 1 :21){
fit=randomForest(AE ~ sex + agecat + pre.hosp + polypharmacy + med.v.surg + intervention +
disciplines + ASA + surg.durin.hosp + transf.time.days+ hom.situatn
+ daily.liv.actv+ cog.impair+ adm.type + correct.ward +
inv.trt.durin.hosp + DNR + polyfarm.5 + weeknd.week + adm.day
+ adm.hr
,data = train.data,mtry=mtry,ntree=1000) #si + adm.diag +
oob.err[mtry]=mean((fit$err.rate)[,1]) #mean OOB
pred=predict(fit,test.data)
test.err[mtry]=with(test.data,mean(pred!=AE))
corect.cl[mtry]=with(test.data,mean(pred==AE))
cat(mtry," ")
}
# Misclassification on the Test Data
#round(multi.fun(test.err),3)
# Correct Classification
round(multi.fun(corect.cl),3)
## mean median range1 range2 CI.upper CI.mean CI.lower
## 0.622 0.620 0.605 0.645 0.627 0.622 0.618
From the random forest predictions, 62.2%[61.8, 62.7]% of the patients from the test data were correctly classified.
I eventually propose to use the variables highlighted as important by the various methods employed herein and build a logistic model as my final model for the prediction and classification.