Approach

Different approach of prediction and classification models will be used and among the possible option the following will be used;

Logistic regression
- Ordinary logistic regression
- shrinkage/regularization technique (LASSO Logistic Regression)
Decision tree
- Classification tree
- Random Forest

The codes and output below shows the different approaches which mainly involves the choosing of variables which best try to predict the presence or absence of adverse events.

Ordinary Logistic regression.

Univariate Analysis.

Starting with univariate logistic regression to look for variable showing assosiation with the outcome. Using a cutpoint of pvalue <=0.25 to pick those variables showing at least some association. Univariate was done of the training data. I am yet to include admission diagnosis since we are to collapse the grouping abit further.

set.seed(37)
train<-sample(1:nrow(review),(0.6*nrow(review)))
test<- -train
train.data<-review[train,]
test.data<-review[test,]

varlist<-c("sex","agecat","ASA","pre.hosp","polypharmacy","hom.situatn",
           "adm.type","med.v.surg","intervention","correct.ward","disciplines",
           "daily.liv.actv","cog.impair","surg.durin.hosp","inv.trt.durin.hosp",
           "DNR","transf.time.days","adm.hr","polyfarm.5","weeknd.week","adm.day")
            #"si","adm.diag","transfer.type","last.ward" ,"rom"
library(aod)
# looping Over all the univariate models and Fiting a Logistic reg
models <- lapply(varlist, function(x) {
    glm(substitute(AE ~ i, list(i = as.name(x))), data = train.data, family = "binomial")
})

# Functions to extract variable name
my.var<-function(x){
    row.names(anova(x,test="Chisq"))[2]
}
# Functions to extract the deviance
my.Dev<-function(x){
    (anova(x,test="Chisq"))$Deviance[2]
}    
# Functions to extract the overal pvalue
my.pval<-function(x){
    (anova(x,test="Chisq"))$`Pr(>Chi)`[2]
}    
    
# Objects to collect some information
variable<-NULL; Deviance<-NULL;pvalue<-NULL  

# looping Over all the univariate models
for (i in 1 :length(models)){
    variable[i]<- lapply(models,my.var)[i]
    Deviance[i]<- lapply(models,my.Dev)[i]
    pvalue[i]<- lapply(models,my.pval)[i]
 }

A tables showing the varaibles and their p-values from the univariate analysis.

uni.model<-cbind(variable,Deviance=round(as.numeric(Deviance),4),pvalue=round(as.numeric(pvalue),5))
knitr::kable(uni.model<-as.data.frame(uni.model))

variable	Deviance	pvalue
sex	1.9615	0.16135
agecat	2.0397	0.5642
ASA	1.1147	0.77353
pre.hosp	6.9705	0.00829
polypharmacy	1.9658	0.16089
hom.situatn	6.6164	0.08518
adm.type	30.9013	0
med.v.surg	68.4148	0
intervention	14.6901	0.00065
correct.ward	2.0458	0.15263
disciplines	9.6381	0.00191
daily.liv.actv	2e-04	0.98846
cog.impair	1.1329	0.28715
surg.durin.hosp	42.3218	0
inv.trt.durin.hosp	0.1141	0.73555
DNR	2.082	0.3531
transf.time.days	5.6452	0.0175
adm.hr	9.542	0.00847
polyfarm.5	7e-04	0.97923
weeknd.week	2.1387	0.14363
adm.day	6.0356	0.41921

Using a cutpoint of pvalue<=0.25 the following list of variables show at least some association with the outcome.

##  [1] "sex"              "pre.hosp"         "polypharmacy"    
##  [4] "hom.situatn"      "adm.type"         "med.v.surg"      
##  [7] "intervention"     "correct.ward"     "disciplines"     
## [10] "surg.durin.hosp"  "transf.time.days" "adm.hr"          
## [13] "weeknd.week"

Multiple Logistic resgression.

Using the variables selected by the univariate analysis to fit multiple logistic regression.

summary(M1<- glm(AE ~ agecat + ASA + pre.hosp + polypharmacy + hom.situatn + adm.type + 
                      med.v.surg + intervention + disciplines + surg.durin.hosp 
                    + transf.time.days + adm.hr,
                    data = train.data, family = "binomial")) #+ transfer.type

## 
## Call:
## glm(formula = AE ~ agecat + ASA + pre.hosp + polypharmacy + hom.situatn + 
##     adm.type + med.v.surg + intervention + disciplines + surg.durin.hosp + 
##     transf.time.days + adm.hr, family = "binomial", data = train.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3670  -0.9833   0.4426   0.9876   1.9074  
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)                           0.598233   1.601318   0.374  0.70871
## agecat41-65                           0.883716   0.518636   1.704  0.08840
## agecat66-79                           0.959869   0.540740   1.775  0.07588
## agecat80+                             1.105565   0.550730   2.007  0.04470
## ASAASA 2                             -0.087274   0.497962  -0.175  0.86087
## ASAASA 3                              0.171203   0.535680   0.320  0.74927
## ASAASA 4                              0.233170   0.558528   0.417  0.67633
## pre.hospyes                           0.490066   0.218346   2.244  0.02480
## polypharmacy                         -0.035434   0.030646  -1.156  0.24758
## hom.situatnhome living alone         -0.459272   0.785281  -0.585  0.55865
## hom.situatnhome cohabiting            0.079041   0.757287   0.104  0.91687
## hom.situatnstay in institution       -0.376478   0.868601  -0.433  0.66470
## adm.typeelective                     -0.338629   0.683662  -0.495  0.62038
## adm.typeemergency                     0.225411   0.462990   0.487  0.62636
## adm.typetransfer from other hospital  0.583163   0.737252   0.791  0.42895
## med.v.surgmedical - planned          -0.326071   0.700638  -0.465  0.64165
## med.v.surgmedical -unplanned         -1.200215   0.388698  -3.088  0.00202
## med.v.surgsurgical - planned          1.241078   0.754676   1.645  0.10007
## interventionmedical                  -1.530645   1.163589  -1.315  0.18836
## interventionsurgical                 -1.010269   1.189964  -0.849  0.39589
## disciplines                           0.277318   0.232868   1.191  0.23370
## surg.durin.hospyes                    0.288278   0.305042   0.945  0.34464
## transf.time.days                      0.007366   0.009131   0.807  0.41989
## adm.hrday(8-15:59)                    0.161816   0.245101   0.660  0.50913
## adm.hrovernight(00-07:59)             0.403018   0.324143   1.243  0.21374
##                                        
## (Intercept)                            
## agecat41-65                          . 
## agecat66-79                          . 
## agecat80+                            * 
## ASAASA 2                               
## ASAASA 3                               
## ASAASA 4                               
## pre.hospyes                          * 
## polypharmacy                           
## hom.situatnhome living alone           
## hom.situatnhome cohabiting             
## hom.situatnstay in institution         
## adm.typeelective                       
## adm.typeemergency                      
## adm.typetransfer from other hospital   
## med.v.surgmedical - planned            
## med.v.surgmedical -unplanned         **
## med.v.surgsurgical - planned           
## interventionmedical                    
## interventionsurgical                   
## disciplines                            
## surg.durin.hospyes                     
## transf.time.days                       
## adm.hrday(8-15:59)                     
## adm.hrovernight(00-07:59)              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 679.34  on 497  degrees of freedom
## Residual deviance: 570.72  on 473  degrees of freedom
## AIC: 620.72
## 
## Number of Fisher Scoring iterations: 4

# Test statistic for model fit
    # Ho : all coefficients = 0
    # H1 :  AT least one of the coefficient !=0
# P-value shows  current fits better than null 
with(M1, pchisq(null.deviance - deviance, df.null - df.residual,lower.tail = FALSE))

## [1] 9.806367e-13

Variable selection Methods.

Using Stepwise model selction. Both forward and backward selction

search <- step(M1,~.) 
## summary of the search
search$anova

The final main effects model chosen is as follows. ASA and agecat were not significant but were maintained in the model after an earlier discussion where we decided they are important predictors.

summary(M2<-glm(AE ~ ASA + agecat + pre.hosp + med.v.surg + intervention + 
                    disciplines, data = train.data, family = "binomial"))

## 
## Call:
## glm(formula = AE ~ ASA + agecat + pre.hosp + med.v.surg + intervention + 
##     disciplines, family = "binomial", data = train.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2965  -1.0709   0.4753   1.0076   1.5661  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    0.8590     1.2772   0.673 0.501191    
## ASAASA 2                      -0.2551     0.4889  -0.522 0.601779    
## ASAASA 3                      -0.1372     0.5060  -0.271 0.786255    
## ASAASA 4                      -0.1761     0.5013  -0.351 0.725434    
## agecat41-65                    0.8116     0.5099   1.592 0.111475    
## agecat66-79                    0.8918     0.5292   1.685 0.091958 .  
## agecat80+                      0.9224     0.5399   1.708 0.087552 .  
## pre.hospyes                    0.5099     0.2094   2.435 0.014891 *  
## med.v.surgmedical - planned   -0.6344     0.4507  -1.408 0.159225    
## med.v.surgmedical -unplanned  -1.2171     0.3599  -3.382 0.000719 ***
## med.v.surgsurgical - planned   0.9297     0.4581   2.029 0.042408 *  
## interventionmedical           -1.5525     1.1325  -1.371 0.170419    
## interventionsurgical          -0.9353     1.1560  -0.809 0.418487    
## disciplines                    0.5157     0.1715   3.007 0.002638 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 679.34  on 497  degrees of freedom
## Residual deviance: 582.40  on 484  degrees of freedom
## AIC: 610.4
## 
## Number of Fisher Scoring iterations: 4

# Test statistic for model fit
with(M2, pchisq(null.deviance - deviance, df.null - df.residual,lower.tail = FALSE))

## [1] 6.476071e-15

## Checking for all possible interactions
# Some combinations lead to O cell counts thats the reason for the 'warning' print outs
search <- step(M2, ~.^2)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# summary of the search
search$anova

The final model chosen after dropping the non-significant interactions is as follows.

# The model chosen after dropping non-significant interactions
summary(M3<-glm(AE ~ ASA + agecat + pre.hosp + med.v.surg + intervention + disciplines + 
                     med.v.surg:disciplines,data = train.data, family = "binomial"))

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## 
## Call:
## glm(formula = AE ~ ASA + agecat + pre.hosp + med.v.surg + intervention + 
##     disciplines + med.v.surg:disciplines, family = "binomial", 
##     data = train.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2457  -0.9913   0.4371   1.0418   1.6109  
## 
## Coefficients:
##                                          Estimate Std. Error z value
## (Intercept)                                1.1044     1.4595   0.757
## ASAASA 2                                  -0.2111     0.4965  -0.425
## ASAASA 3                                  -0.1326     0.5135  -0.258
## ASAASA 4                                  -0.2024     0.5123  -0.395
## agecat41-65                                0.7778     0.5124   1.518
## agecat66-79                                0.9195     0.5316   1.730
## agecat80+                                  0.9307     0.5429   1.714
## pre.hospyes                                0.5274     0.2125   2.482
## med.v.surgmedical - planned              -16.0775   770.3255  -0.021
## med.v.surgmedical -unplanned              -1.1235     0.8375  -1.341
## med.v.surgsurgical - planned               1.0718     1.1875   0.903
## interventionmedical                       -1.7372     1.1430  -1.520
## interventionsurgical                      -1.1092     1.1660  -0.951
## disciplines                                0.4583     0.5282   0.868
## med.v.surgmedical - planned:disciplines   15.2004   770.3251   0.020
## med.v.surgmedical -unplanned:disciplines  -0.0693     0.5620  -0.123
## med.v.surgsurgical - planned:disciplines  -0.1216     0.8526  -0.143
##                                          Pr(>|z|)  
## (Intercept)                                0.4493  
## ASAASA 2                                   0.6707  
## ASAASA 3                                   0.7962  
## ASAASA 4                                   0.6928  
## agecat41-65                                0.1290  
## agecat66-79                                0.0837 .
## agecat80+                                  0.0864 .
## pre.hospyes                                0.0131 *
## med.v.surgmedical - planned                0.9833  
## med.v.surgmedical -unplanned               0.1798  
## med.v.surgsurgical - planned               0.3667  
## interventionmedical                        0.1286  
## interventionsurgical                       0.3415  
## disciplines                                0.3856  
## med.v.surgmedical - planned:disciplines    0.9843  
## med.v.surgmedical -unplanned:disciplines   0.9019  
## med.v.surgsurgical - planned:disciplines   0.8866  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 679.34  on 497  degrees of freedom
## Residual deviance: 575.80  on 481  degrees of freedom
## AIC: 609.8
## 
## Number of Fisher Scoring iterations: 16

# Test statistic for model fit: current fit is better than Null
with(M3, pchisq(null.deviance - deviance, df.null - df.residual,lower.tail = FALSE))

## [1] 7.487665e-15

Specificity, Sensitivity, and optimal cutpoint.

Optimal cutpoint, specificity and sensitivity from the main effects models is as shown.

# ROC curve
# install.packages("pROC")
library(pROC)

## Warning: package 'pROC' was built under R version 3.4.4

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# OPtimal cutpoint, Sensitivty & Specificity
# Varaibles to collect estimates from model with main effects
threshold<-NULL;specificity<-NULL;sensitivity<-NULL
# estimates from model with interaction
threshold.M3<-NULL;specificity.M3<-NULL;sensitivity.M3<-NULL

# Drawing 100 different samples
set.seed(37)
for (i in 1:100){
    train<-sample(1:nrow(review),(0.6*nrow(review)))
    test<- -train
    train.data<-review[train,]
    test.data<-review[test,]
    
    # AE grouping for testing datatset
    tst.AE<-AE[test]
# Main effects model    
M2.2<-glm(AE ~ ASA + agecat + pre.hosp + med.v.surg + intervention +
              disciplines, data = train.data, family = "binomial")
#Model with interaction
M3.2<-glm(AE ~ ASA + agecat + pre.hosp + med.v.surg + intervention +
              disciplines +med.v.surg:disciplines
             ,data = train.data, family = "binomial")
# predicted probs
# Main effects model
prd.M2.2.test<-predict(M2.2,test.data,type="response")
# Model with interactions
prd.M3.2.test<-predict(M3.2,test.data,type="response")

# roc curve attributes
roc<-data.frame(t=tst.AE,pd=prd.M2.2.test,pd2=prd.M3.2.test)
roc(roc$t,roc$pd)
roc(t~pd,roc)
roc1 <- roc(roc$t,roc$pd, percent=TRUE)
roc2 <- roc(roc$t,roc$pd2, percent=TRUE)
# From Main effects model
r<-coords(roc1, "best", ret=c("threshold", "specificity", "sensitivity","npv"))
threshold[i]<-r[1]
specificity[i]<-r[2]
sensitivity[i]<-r[3]
# Interaction Model
r1<-coords(roc2, "best", ret=c("threshold", "specificity", "sensitivity","npv"))
threshold.M3[i]<-r1[1]
specificity.M3[i]<-r1[2]
sensitivity.M3[i]<-r1[3]
}

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Loading required package: lattice

## Loading required package: plyr

The mean specificity is 77.075[75.656, 78.494]. This implies that on average 77.075% of patients without an adverse event will be correctly identified as not having an adverse effect. In our case this can be viewed as the cases that do not need to be reviewed hence saving reasources that would have otherwise been spent on them. The mean[95% ci] sensitivity was 57.299[55.792, 58.807] which implies that 57.299% of the patients with an adverse effect will be correctly classified as having an adverse. This classification was when a mean cutpoint of 0.565[0.552, 0.577] is consider.

# Optimal Threshold. Main effects model
round(multi.fun(threshold),3)

##     mean   median   range1   range2 CI.upper  CI.mean CI.lower 
##    0.565    0.563    0.426    0.717    0.577    0.565    0.552

# Specificity. Main effects model
round(multi.fun(specificity),3)

##     mean   median   range1   range2 CI.upper  CI.mean CI.lower 
##   77.075   77.894   57.692   91.216   78.494   77.075   75.656

# Sensitivity. Main effects model
round(multi.fun(sensitivity),3)

##     mean   median   range1   range2 CI.upper  CI.mean CI.lower 
##   57.299   56.919   41.304   74.468   58.807   57.299   55.792

From the model with interaction, the mean [95% ci] specificity is 79.363[78.007, 80.719]. This implies that on average 79.363% of patients without an adverse event will be correctly identified as not having an adverse effect. In our case this can be viewed as the cases that do not need to be reviewed hence saving reasources that would have otherwise been spent on them. The mean[95% ci] sensitivity was 55.162[53.618, 56.705] which implies that 55.162% of the patients with an adverse effect will be correctly classified as having an adverse. This classification was when a mean cutpoint of 0.576[0.562, 0.59] is consider.

# Interaction model
# Optimal Threshold. Interaction model
round(multi.fun(threshold.M3),3)

##     mean   median   range1   range2 CI.upper  CI.mean CI.lower 
##    0.576    0.571    0.446    0.829    0.590    0.576    0.562

# Specificity. Interaction model
round(multi.fun(specificity.M3),3)

##     mean   median   range1   range2 CI.upper  CI.mean CI.lower 
##   79.363   80.203   61.842   93.333   80.719   79.363   78.007

# Sensitivity. Interaction model
round(multi.fun(sensitivity.M3),3)

##     mean   median   range1   range2 CI.upper  CI.mean CI.lower 
##   55.162   54.222   30.769   72.581   56.705   55.162   53.618

Variability of the predictions.

To get the variability in the prediction performance of the models on the test data, the models were refit 100 times on different sampled train data and prediction on the test data assesed. Just to have alook of the performance i fitted both the main effects and the model with interactions separately.

set.seed(37)
# Variables to collect the outputs
#  Classification rate. Main effects model
cls.tst<-NULL;cls.trn<-NULL
#  Classification rate. Interaction model
M3.cls.tst<-NULL;M3.cls.trn<-NULL

for (i in 1:100){
    train<-sample(1:nrow(review),(0.6*nrow(review)))
    test<- -train
    train.data<-review[train,]
    test.data<-review[test,]
# AE(Adverse Events) grouping for the datatset
    tst.AE<-AE[test];    trn.AE<-AE[train]
# fit using training data set
    # Main effects model.
    M2.1<-glm(AE ~ ASA + agecat + pre.hosp + med.v.surg + intervention +
                   disciplines,data = train.data, family = "binomial")
    # Model with an Interaction. 
    M3.1<- glm(AE ~ ASA + agecat + pre.hosp + med.v.surg + intervention + disciplines
                    + med.v.surg:disciplines,data = train.data, family = "binomial")
# prediction using testing data set
# Main efects(M2.1) model and interaction Model(M3.1)
prd.M2.test<-predict(M2.1,test.data,type="response")
prd.M3.test<-predict(M3.1,test.data,type="response")

# prediction using Training data set
# Main efects model and interaction Model
prd.M2.train<-predict(M2.1,train.data,type="response")
prd.M3.train<-predict(M3.1,train.data,type="response")

# Changing the predicted probabilities into to classes at a cutpoint 0.5
    # Main effects model M3.1
## test data
test.cl<-rep("No AE", dim(test.data)[1])
test.cl[prd.M2.test>0.5]<-"AE Detected" 

## train data
train.cl<-rep("No AE", dim(train.data)[1]) 
train.cl[prd.M2.train>0.5]<-"AE Detected" 

    # interaction model M3.1
# # test data
M3.test.cl<-rep("No AE", dim(test.data)[1])
M3.test.cl[prd.M3.test>0.5]<-"AE Detected"

# # train data
M3.train.cl<-rep("No AE", dim(train.data)[1])
M3.train.cl[prd.M3.train>0.5]<-"AE Detected"

#  Classification rate. Main effects model
cls.tst[i]<-mean(test.cl == tst.AE)
cls.trn[i]<-mean(train.cl == trn.AE)

#  Classification rate. Interaction model
 M3.cls.tst[i]<-mean(M3.test.cl == tst.AE)
 M3.cls.trn[i]<-mean(M3.train.cl == trn.AE)

}

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# A Histogram showing the distribution of the Classification probabilities
par(mfrow=c(2,2))
hist(cls.tst,prob=T,main="Test Data.Main effects",xlab="Correct Classification")
hist(cls.trn,prob=T,main="Train Data.Main effects",xlab="Correct Classification")
hist(M3.cls.tst,prob=T,main="Test Data. Int' Model",xlab="Correct Classification")
hist(M3.cls.trn,prob=T,main="Train Data.Int' Model",xlab="Correct Classification")

# .....fx....
multi.fun=function(x){
    c(mean=mean(x),median=median(x),sd=sd(x),range=range(x),CI=CI(x, ci = 0.95)
    )}
    # Main Effects model only
# Predicted Correct classification on the Test Data
round(multi.fun(cls.tst),3)

##     mean   median       sd   range1   range2 CI.upper  CI.mean CI.lower 
##    0.647    0.646    0.021    0.599    0.693    0.651    0.647    0.642

# Predicted Correct classification on the Train Data
round(multi.fun(cls.trn),3)

##     mean   median       sd   range1   range2 CI.upper  CI.mean CI.lower 
##    0.668    0.671    0.015    0.637    0.701    0.671    0.668    0.666

    # Model with Interaction
# Predicted. Correct classification on the Test Data
round(multi.fun(M3.cls.tst),3)

##     mean   median       sd   range1   range2 CI.upper  CI.mean CI.lower 
##    0.647    0.648    0.023    0.587    0.699    0.651    0.647    0.642

# Predicted Correct classification on the Train Data
round(multi.fun(M3.cls.trn),3)

##     mean   median       sd   range1   range2 CI.upper  CI.mean CI.lower 
##    0.671    0.671    0.016    0.639    0.709    0.675    0.671    0.668

From the summary statistics above the main effects model shows a mean correct prediction on the test data of 0.647[0.642, 0.651] which is similar to that of the prediction by the model with interraction.

Logistic Regression using lasso

We can fit a model containing all p predictors using a technique that constrains or that shrinks the coefficient estimates towards zero. The two best-known techniques for shrinking the regression coefficients towards zero are ridge regression and the lasso. In the case of lasso, the penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter \(\lambda\) is sufficiently large. Hence, much like best subset selection, the lasso performs variable selection.(Tibshirani et.al., 2014).

This was done using the glmnet package which fits a generalized linear model via penalized maximum likelihood. The fitted Lasso Regularized/penalized Logistic Regression is as shown where the main aim was to try and look for an alternative way to get predictors that best minimize the misclassification error.

library(glmnet)

## Warning: package 'glmnet' was built under R version 3.4.4

## Loading required package: Matrix

## Warning: package 'Matrix' was built under R version 3.4.4

## Loading required package: foreach

## Loaded glmnet 2.0-16

## 
## Attaching package: 'glmnet'

## The following object is masked from 'package:pROC':
## 
##     auc

# list of some of the variables
varlist<-c("AE","sex","agecat","ASA","pre.hosp","polypharmacy","hom.situatn",
           "adm.type","med.v.surg","intervention","correct.ward","disciplines",
           "daily.liv.actv","cog.impair","surg.durin.hosp","inv.trt.durin.hosp","DNR",
           "transf.time.days","polyfarm.5","weeknd.week","adm.day","adm.hr")
#"si","adm.diag",

# Preparing data for the glmnet function
r<-review[,varlist]
x=model.matrix(AE~.,r)[,-1] # creates a design matrix of the predictors
y=r$AE # response
# Spliting 
set.seed(37)
train=sample(1:nrow(x), (nrow(x)*0.6))
test=(- train )
y.test=y[test]

Selecting the Tuning parameter.

using package cv.glmnet, 10 fold cross-validation was used to look for the optimal tunning parameter \(\lambda\) (controls the overall strength of the penalty) for the LASSO. This is calculated by getting the \(\lambda\) value that leads to the smallest Misclassification error.
crossvalidation error is minimal at the verticle broken line

set.seed (37)
cv.out<-cv.glmnet(x[train ,],y[train],alpha=1,nfold=10, 
                  type.measure="class", family="binomial")
# "class"- gives misclassification error.

A plot Misclassifcation error against \(\lambda\). The vertical dotted lines shows the region where the misclassification error is smallest(1 line is at he minimum and the other is at 1 standard error of the minimum.also does almost as well as the minimum) which is taken as the optimal lambda value. The top of the graph shows as a function of \(\lambda\) how many non zero variables are in the model. It shows that we require from 3-7 variables to achieve the minimum misclassification error.

(bestlam=cv.out$lambda.min) # optimal Lambda value to use

## [1] 0.01718078

Using this Optimal value. We fit the Penalized Logistic regression on training data set and get the prediction and classification on the test data.

# Fit on the Training Data
lasso.mod<-glmnet(x[train ,],y[train],alpha=1, family="binomial")

# 1) Predicted probabilities on the Test data
lasso.pred<-predict(lasso.mod,s=bestlam ,newx=x[test ,],type="response") 

 # 3) class label corresponding to the maximum probability.
class<-predict(lasso.mod,s=bestlam ,newx=x[test ,],type="class")

These are the variables that were considered important using LASSO variable selection. For all the rest thier coefficients were shrunk to 0. They include;-

##                  (Intercept)                         sexF 
##                 0.1743037568                -0.0980815686 
##                  pre.hospyes   hom.situatnhome cohabiting 
##                 0.3240711197                 0.2466525128 
##  med.v.surgmedical - planned med.v.surgmedical -unplanned 
##                -0.0150397498                -0.7612816045 
## med.v.surgsurgical - planned          interventionmedical 
##                 0.7527433766                -0.3426807469 
##                  disciplines daily.liv.actvno limitations 
##                 0.2707035673                -0.1145123744 
##           surg.durin.hospyes             transf.time.days 
##                 0.3540763681                 0.0009114912 
##                polyfarm.5yes               adm.dayvrijdag 
##                 0.0174283736                 0.1136587658 
##              adm.daywoensdag 
##                -0.0807448230

Variability of the predictions.

100 different samples were picked at each time prediction and classification was dome. The aim was to get the distribution and the variability of the prediction and classification.

# 100 samples
r<-review[,varlist] #  18
x=model.matrix(AE~.,r)[,-1]
y=r$AE

# Variables to collect some outputs
clss.test<-double(100);clss.train<-double(100)
import.var.l<-NULL;allvar.l<-NULL
threshold.las<-NULL;specificity.las<-NULL;sensitivity.las<-NULL

set.seed(37)

for (i in 1:100){
    train=sample(1:nrow(x), (nrow(x)*0.6))
    test=(- train )
    y.test=y[test]
    
    lasso.mod=glmnet(x[train ,],y[train],alpha=1, family="binomial")
    # choosing the Tuning prameter(lambda) using 10 fold cross validation
    cv.out<-cv.glmnet(x[train ,],y[train],nfolds=10, type.measure="class"
                      ,family="binomial")
    
    # predicted class on train data
    class.train.l<-predict(cv.out,x[train,],s=cv.out$lambda.min,type="class")
    clss.train[i]<-mean(class.train.l==y[train])
    
    # predicted class on test data
    class.test.l<-predict(cv.out,x[test,],s=cv.out$lambda.min,type="class")
    clss.test[i]<-mean(class.test.l==y[test])
    
    # Coefficient values of the variables
    cof.l<-predict(cv.out,x[test,],s=cv.out$lambda.min,type="coefficients")[1:dim(x)[2],]
    assign(paste("n",i, sep = "."),cof.l)
    allvar.l<-cbind(allvar.l,cof.l)
    colnames(allvar.l)[i]<-(paste("iter",i, sep = "."))
    
 #  Predicted probability on test data 
    lasso.pred.1=predict(cv.out ,x[test,],s=cv.out$lambda.min,type="response") #
    roc.l<-data.frame(t=y.test, pd =(lasso.pred.1))
    roc.las<- roc(roc.l[,1],roc.l[,2], percent=TRUE)
 # getting the sensitivity, specificity   
    las<-coords(roc.las, "best", ret=c("threshold", "specificity", "sensitivity","npv"))
    threshold.las[i]<-las[1]
    specificity.las[i]<-las[2]
    sensitivity.las[i]<-las[3]
 }

# A distribution of the Classification probabilities
#____
par(mfrow=c(1,2))
hist(clss.test,prob=T,main="Test Data",xlab="Correct Classification")
hist(clss.train,prob=T,main="Train Data",xlab="Correct Classification")

# .....fx....
multi.fun=function(x){
    c(mean=mean(x),median=median(x),range=range(x), CI=CI(x, ci = 0.95)
    )}

# Predicted correct classification on the Testing Data
round(multi.fun(clss.test),3)

##     mean   median   range1   range2 CI.upper  CI.mean CI.lower 
##    0.633    0.633    0.587    0.684    0.637    0.633    0.630

# Predicted correct classification on the Training Data
round(multi.fun(clss.train),3)

##     mean   median   range1   range2 CI.upper  CI.mean CI.lower 
##    0.671    0.666    0.627    0.723    0.675    0.671    0.666

From the summary statistics above the mean correct prediction on the test data is 0.633[0.63, 0.637] which is a bit smaller compared to the one from ordinary logistic regression.

Specificity, sensitivity and optimum cutpoint.

From the Lasso model, the mean [95% ci] specificity is 76.736[74.885, 78.587]. This implies that on average 76.736% of patients without an adverse event will be correctly identified as not having an adverse effect. In our case this can be viewed as the cases that do not need to be reviewed hence saving reasources that would have otherwise been spent on them. The mean[95% ci] sensitivity was 54.524[52.577, 56.471] which implies that 54.524% of the patients with an adverse effect will be correctly classified as having an adverse. This classification was when a mean cutpoint of 0.579[0.566, 0.592] is consider.

# Threshold
round(multi.fun(threshold.las),3)

##     mean   median   range1   range2 CI.upper  CI.mean CI.lower 
##    0.579    0.582    0.426    0.735    0.592    0.579    0.566

# Specificity
round(multi.fun(specificity.las),3)

##     mean   median   range1   range2 CI.upper  CI.mean CI.lower 
##   76.736   77.998   54.305   92.593   78.587   76.736   74.885

# Sensitivity
round(multi.fun(sensitivity.las),3)

##     mean   median   range1   range2 CI.upper  CI.mean CI.lower 
##   54.524   53.119   37.436   75.253   56.471   54.524   52.577

Group LASSO

This will ensure dummy variables from a variable are considered as a group unlike the what is done by the glmnet package. The variability on the predictions on a 100 different random samples is also assessed.

# install.packages("grpreg")
library(grpreg)

## Warning: package 'grpreg' was built under R version 3.4.4

varlist<-c("AE","sex","agecat","ASA","pre.hosp","polypharmacy","hom.situatn",
           "adm.type","med.v.surg","intervention","correct.ward","disciplines",
           "daily.liv.actv","cog.impair","surg.durin.hosp","inv.trt.durin.hosp",
           "DNR","transf.time.days","polyfarm.5","weeknd.week","adm.day","adm.hr") 
#,"adm.diag", "si","transfer.type",
r1<-review[,varlist] 
x1=model.matrix(AE~.,r1)[,-1]
y1=r1$AE

# Defining the groups. i.e defining Dummies from same variable to represent a group
group<-NULL
r2<-review[,varlist[-1]]
for(i in 1:ncol(r2)){
         if ( nlevels(r2[,i])>1){
   #print( index[,i]<-( rep(i,(nlevels(r2[,i]))-1)))
    tmp<-  rep(i,(nlevels(r2[,i]))-1)       
    assign(paste("n",i, sep = ""),tmp)
         } else { 
          tmp<-i      # print(index[,i]<-i) 
         assign(paste("n",i, sep = ""),tmp )
     }
    group=c(group,tmp)
    i=i+1
}
#_____________________

# Sampling and fitting the model 100 times to get the variablity 
# of the predicted classification 
  # defining the collecting variables
correct.class.train<-double(100);correct.class.test<-double(100)
import.var<-NULL;allvar<-NULL
threshold.glas<-NULL;specificity.glas<-NULL;sensitivity.glas<-NULL

set.seed(37)
# looping 100 times
for (i in 1:100){
    train=sample(1:nrow(x1), (nrow(x1)*0.6))
    test=(- train )
    y.1<-as.numeric(y1)-1
    
fit <- grpreg(x1[train ,],y.1[train],group,penalty="grLasso", family="binomial")
# choosing the Tuning prameter(lambda) using 10 fold cross validation
fit.cv<-cv.grpreg(x1[train ,],y.1[train],group=group, nfolds=10, trace=FALSE,family="binomial")
# predicted class on train data
class.train<-predict(fit.cv,x1[train,],lambda=fit.cv$lambda.min,type="class")
correct.class.train[i]<-mean(class.train==y.1[train])
# predicted class on test data
class.test<-predict(fit.cv,x1[test,],lambda=fit.cv$lambda.min,type="class")
correct.class.test[i]<-mean(class.test==y.1[test])

# Coefficients of picked variables on the test data
cof.gp<-predict(fit.cv,x1[test,],lambda=fit.cv$lambda.min,type="coefficients")[1:dim(x1)[2],]
#var<-cof.gp[cof.gp!=0]
#assign(paste("n",i, sep = "."),var)
assign(paste("n",i, sep = "_"),cof.gp)
#import.var=cbind(import.var,var)
allvar<-cbind(allvar,cof.gp)
#colnames(import.var)[i]<-(paste("iter",i, sep = "_"))
colnames(allvar)[i]<-(paste("iter",i, sep = "_"))

#   Predicted probability on test data 
gp.pred=predict(fit.cv ,x1[test,],lambda=fit.cv$lambda.min,type="response") #
roc.gl<-data.frame(t=y.1[test], pd =(gp.pred))
roc.glas<- roc(roc.gl[,1],roc.gl[,2], percent=TRUE)
#
glas<-coords(roc.glas, "best", ret=c("threshold", "specificity", "sensitivity","npv"))
threshold.glas[i]<-glas[1]
specificity.glas[i]<-glas[2]
sensitivity.glas[i]<-glas[3]
}

ft.cv<-cv.grpreg(x1[train ,],y.1[train],group=group, nfolds=10,trace=FALSE,
                  family="binomial")
# Lambda value tha minimizes the prediction error
# Plot showing a possible value of lambda plus the number of Important variables
plot(ft.cv,type="pred")

# A Distribution of the Classification probabilities
#_____
par(mfrow=c(1,2))
hist(correct.class.test,prob=T,main="Test Data",xlab="Correct Classification")
hist(correct.class.train,prob=T,main="Train Data",xlab="Correct Classification")

# .....fx....
multi.fun=function(x){
    c(mean=mean(x),median=median(x),range=range(x), CI=CI(x, ci = 0.95)
    )}
# The predicted correct classification on the Testing Data
round(multi.fun(correct.class.test),3)

##     mean   median   range1   range2 CI.upper  CI.mean CI.lower 
##    0.641    0.639    0.596    0.696    0.645    0.641    0.636

# Predicted correct classification on the Training Data
round(multi.fun(correct.class.train),3)

##     mean   median   range1   range2 CI.upper  CI.mean CI.lower 
##    0.670    0.672    0.631    0.713    0.674    0.670    0.667

From the summary statistics above the mean correct prediction on the test data is 64.1%[63.6, 64.5].

Specificity, sensitivity and optimum cutpoint.

From the group Lasso model, the mean [95% ci] specificity is 75.469[73.732, 77.207]. This implies that on average 75.469% of patients without an adverse event will be correctly identified as not having an adverse effect. In our case this can be viewed as the cases that do not need to be reviewed hence saving reasources that would have otherwise been spent on them. The mean[95% ci] sensitivity was 57.078[55.27, 58.886] which implies that 57.078% of the patients with an adverse effect will be correctly classified as having an adverse. This classification was when a mean cutpoint of 0.559[0.547, 0.571] is consider.

# Threshold
round(multi.fun(threshold.glas),3)

##     mean   median   range1   range2 CI.upper  CI.mean CI.lower 
##    0.559    0.560    0.423    0.736    0.571    0.559    0.547

# Specificity
round(multi.fun(specificity.glas),3)

##     mean   median   range1   range2 CI.upper  CI.mean CI.lower 
##   75.469   73.957   52.318   93.284   77.207   75.469   73.732

# Sensitivity
round(multi.fun(sensitivity.glas),3)

##     mean   median   range1   range2 CI.upper  CI.mean CI.lower 
##   57.078   59.091   32.828   76.243   58.886   57.078   55.270

CLASSIFICATION TREE

Decision trees are broadly classified into classification trees (for categorical outcomes) and regression trees (for continuous outcomes). This methodology has the ability to efficiently segment populations into meaningful subgroups. It identifies mutually exclusive and exhaustive subgroups of a population whose members share common characteristics that influence the dependent variable of interest. They produces a visual output that is a multilevel structure which resembles branches of a tree.

The basic strategy for growing decision trees is to split the input or covariate space within which the outcome is relatively homogeneous. This is done in a hierarchical manner using a sequence of binary splitting rules for the inputs. The algorithm efficiently looks at all possible binary splits over all the inputs and chooses the split that makes the corresponding values of the output variable as distinct as possible.

# install.packages("tree")
library(tree)

## Warning: package 'tree' was built under R version 3.4.4

# Spliting the Data into Learning and Testing
set.seed(37)
train<-sample(1:nrow(review),(0.6*nrow(review)))
test<- -train
train.data<-review[train,]
test.data<-review[test,] 
AE.test<-AE[test]


# Fit the tree using train data. Includin All the variables
t1<-tree(AE ~ sex + agecat + pre.hosp + polypharmacy + med.v.surg + intervention + 
             disciplines + ASA + surg.durin.hosp + transf.time.days+ hom.situatn
             + daily.liv.actv+ cog.impair+ adm.type + correct.ward  +
             inv.trt.durin.hosp + DNR + polyfarm.5 + weeknd.week + adm.day
             + adm.hr,data = train.data) #  si + + adm.diag

summary(t1)

## 
## Classification tree:
## tree(formula = AE ~ sex + agecat + pre.hosp + polypharmacy + 
##     med.v.surg + intervention + disciplines + ASA + surg.durin.hosp + 
##     transf.time.days + hom.situatn + daily.liv.actv + cog.impair + 
##     adm.type + correct.ward + inv.trt.durin.hosp + DNR + polyfarm.5 + 
##     weeknd.week + adm.day + adm.hr, data = train.data)
## Variables actually used in tree construction:
## [1] "med.v.surg"       "transf.time.days" "polypharmacy"    
## [4] "pre.hosp"         "hom.situatn"     
## Number of terminal nodes:  6 
## Residual mean deviance:  1.165 = 573 / 492 
## Misclassification error rate: 0.3133 = 156 / 498

plot(t1)
text(t1) #pretty=0)

A bushy tree might be harder to interprate and may also have too much variance so pruning the tree is needed. A 10 fold crossv-validation is used to check on where to start the pruning on the basis of misclassification.

set.seed(37)
cv.tr<-cv.tree(t1, FUN=prune.misclass)

# plot of misclass vs Size
# smallest miss class is at between 5 -6
plot(cv.tr)

# pruning the tree to size of 5 -6
prune.tr<-prune.misclass(t1,best=5)
summary(prune.tr)

## 
## Classification tree:
## snip.tree(tree = t1, nodes = 3L)
## Variables actually used in tree construction:
## [1] "med.v.surg"       "transf.time.days" "polypharmacy"    
## [4] "pre.hosp"        
## Number of terminal nodes:  5 
## Residual mean deviance:  1.177 = 580.3 / 493 
## Misclassification error rate: 0.3133 = 156 / 498

plot(prune.tr)
text(prune.tr)

Evaluating the prediction of the prunned tree on the test data.

# Evaluating the prunned tree on the test data.   
t.pred.prun<-predict(prune.tr,test.data, type="class")
# Misclassification table
with(test.data,table(Predicted=t.pred.prun,Observed=AE))

##              Observed
## Predicted     No AE AE Detected
##   No AE          66          42
##   AE Detected    87         137

From the classifaction tree 61.1% of the patients from the test data were correctly classified.

Random Forest.

This involves building alot of bushy (without prunning) trees, then averaging them to reduce variance. The aim is to help in improving the prediction accuracy of the trees from the increased precision.

# install.packages("randomForest")
library(randomForest)

## Warning: package 'randomForest' was built under R version 3.4.4

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

set.seed(37)
# random forest- builds lots of bushy trees, then averages them to reduce variance
rf<-randomForest(AE ~ sex + agecat + pre.hosp + polypharmacy + med.v.surg + intervention + 
                     disciplines + ASA + surg.durin.hosp + transf.time.days+ hom.situatn
                    + daily.liv.actv+ cog.impair+ adm.type + correct.ward  +
                     inv.trt.durin.hosp + DNR + polyfarm.5 + weeknd.week + adm.day
                    + adm.hr, ntree=1000, data = train.data,importance=T) # + si + adm.diag
rf

## 
## Call:
##  randomForest(formula = AE ~ sex + agecat + pre.hosp + polypharmacy +      med.v.surg + intervention + disciplines + ASA + surg.durin.hosp +      transf.time.days + hom.situatn + daily.liv.actv + cog.impair +      adm.type + correct.ward + inv.trt.durin.hosp + DNR + polyfarm.5 +      weeknd.week + adm.day + adm.hr, data = train.data, ntree = 1000,      importance = T) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 31.33%
## Confusion matrix:
##             No AE AE Detected class.error
## No AE         124          88   0.4150943
## AE Detected    68         218   0.2377622

# A Dot plot of the variable importance measured by the Random Forest
varImpPlot(rf,main="Variable Importance")

Tuning parameter (mtry)

This is the number of variables chosen at each split of each tree. Random forest randomly picks a number with an aim of decorrelating the trees. Here a series of trees were fit with the mtry value ranging from 1:number of variables considered.

# to collect the error
oob.err=double(21)
test.err=double(21)
corect.cl=double(21)

for (mtry in 1 :21){
    fit=randomForest(AE ~ sex + agecat + pre.hosp + polypharmacy + med.v.surg + intervention + 
                         disciplines + ASA + surg.durin.hosp + transf.time.days+ hom.situatn
                        + daily.liv.actv+ cog.impair+ adm.type + correct.ward  +
                         inv.trt.durin.hosp + DNR + polyfarm.5 + weeknd.week + adm.day
                        + adm.hr
                         ,data = train.data,mtry=mtry,ntree=1000) #si + adm.diag +
    oob.err[mtry]=mean((fit$err.rate)[,1]) #mean OOB
    pred=predict(fit,test.data)
    test.err[mtry]=with(test.data,mean(pred!=AE))
    corect.cl[mtry]=with(test.data,mean(pred==AE))
    cat(mtry," ")
}

# Misclassification on the Test Data
#round(multi.fun(test.err),3)
# Correct Classification
round(multi.fun(corect.cl),3)

##     mean   median   range1   range2 CI.upper  CI.mean CI.lower 
##    0.622    0.620    0.605    0.645    0.627    0.622    0.618

From the random forest predictions, 62.2%[61.8, 62.7]% of the patients from the test data were correctly classified.

Masters Thesis: progress

J.wanja

July 24, 2015

Objective

Data

Approach

Ordinary Logistic regression.

Univariate Analysis.

Multiple Logistic resgression.

Variable selection Methods.

Specificity, Sensitivity, and optimal cutpoint.

Variability of the predictions.

Logistic Regression using lasso

Selecting the Tuning parameter.

Variability of the predictions.

Specificity, sensitivity and optimum cutpoint.

Group LASSO

Specificity, sensitivity and optimum cutpoint.

CLASSIFICATION TREE

Random Forest.

Tuning parameter (mtry)

Important Predictors.

Proposal