Case study 4: Bootstrapping a logistic model

Foreword: About the Machine Learning in Medicine (MLM) project

The MLM project has been initialized in 2016 and aims to:

1.Encourage using Machine Learning techniques in medical research in Vietnam

2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.

Background

Falling out of bed is a pertinent problem in all hospitals. The elder patients are more susceptible to falling out of bed, but those accidents could also be the result of medical malpractices Our study question was whether the odd of falling from bed is associated to the global healthcare quality, ageing condition or both.

The 4th case study (Chapter 4.2 -Machine Learning in Medicine-Cookbook 1 by T. J. Cleophas and A. H. Zwinderman, Springer, 2014) introduces the Logistic regression algorithm as an alternative method for chi-square test.

Our experiment will extend the logistic model by Bootstrap resampling. Resampling could be used for (1) manual feature selection and (2) validating the model’s performance

Materials and method

The original dataset was provided in Chaper 4.2, Machine Learning in Medicine-Cookbook 1 (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in SPSS SAV format logoutcomeprediction.sav) from their website: extras.springer.com.

Data analysis was performed in R statistical programming language. The mlr package (https://mlr-org.github.io/mlr-tutorial/release/html/index.html) will be used through our Machine learning experiment.

First, following packages must be established in R-studio (some other packages might also be required during the analysis), and we will personalize the aesthetic effects for ggplot2

library(foreign)
library(tidyverse)
library(vcd)
library(mlr)
library(gridExtra)
library(ggfortify)


my_theme <- function(base_size =8, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
    theme(
      axis.text = element_text(size =8),
      axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
      axis.title = element_text(size =8),
      panel.grid.major = element_line(color = "gray"),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = "#faf2ff"),
      strip.background = element_rect(fill = "#470170", color = "#470170", size =0.5),
      strip.text = element_text(face = "bold", size = 8, color = "white"),
      legend.position = "bottom",
      legend.justification = "center",
      legend.background = element_blank(),
      panel.border = element_rect(color = "grey5", fill = NA, size = 0.5)
    )
}

theme_set(my_theme())

myfillcolors=c("#8100cc","#95c601","#00cc51","#0164c6","#242d7a","#c4b300")

mycolors=c("#5f0096","#607f01","#016b3b","#01486d","#000851","#827700")

Results

Step 0: Data loading and reshape

require(foreign)

df=read.spss("logoutcomeprediction.sav",use.value.labels = TRUE,to.data.frame = TRUE)%>%as_tibble()
df$fallingoutofbed%<>%as.factor()%>%recode_factor(.,`0` = "No", `1` = "Yes")
df$gender%<>%as.factor()%>%recode_factor(.,`0` = "Female", `1` = "Male")
df$letterofcomplaint%<>%as.factor()%>%recode_factor(.,`0` = "No", `1` = "Yes")
df$departmenttype%<>%as.factor()%>%recode_factor(.,`0` = "TypeA", `1` = "TypeB")

df

## # A tibble: 55 × 5
##    departmenttype fallingoutofbed   age gender letterofcomplaint
##            <fctr>          <fctr> <dbl> <fctr>            <fctr>
## 1           TypeA             Yes    50 Female               Yes
## 2           TypeA             Yes    76 Female               Yes
## 3           TypeA             Yes    57   Male               Yes
## 4           TypeA             Yes    65 Female               Yes
## 5           TypeA             Yes    46   Male               Yes
## 6           TypeA             Yes    36   Male               Yes
## 7           TypeA             Yes    98 Female                No
## 8           TypeA             Yes    56   Male                No
## 9           TypeA             Yes    44 Female                No
## 10          TypeA             Yes    76   Male               Yes
## # ... with 45 more rows

Once the above commmands executed, we will get a clean dataframe (n=50), that will be used throughout our analysis. The target outcome is a binary variable: fall event (0=No, 1=Yes). There are 4 features: letter of complaint (0=No, 1=Yes), Department (coded as Type A and Type B), gender and Age (yrs).

Step 1) Data exploration

Figure 1: Descriptive analysis

g=ggplot(data=df,aes(x=fallingoutofbed,fill=gender,color=gender))+geom_bar(position="fill")+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)+coord_flip()

l=df%>%ggplot(aes(x=fallingoutofbed,fill=letterofcomplaint,color=letterofcomplaint))+geom_bar(position="fill")+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)+coord_flip()

d=df%>%ggplot(aes(x=fallingoutofbed,fill=departmenttype,color=departmenttype))+geom_bar(position="fill")+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)+coord_flip()

a=df%>%ggplot(aes(fill=fallingoutofbed,color=fallingoutofbed,x=age))+geom_density(alpha=0.8)+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)

grid.arrange(g,l,d,a,ncol=2)

Alternative solution using vcd package:

df[,-3]%>%gather(departmenttype,letterofcomplaint,gender,key="Factor",value="Value")%>%mosaic(data=.,fallingoutofbed~Value,highlighting_fill=c("#8100cc","#95c601"))

df[,-3]%>%mosaic(data=.,fallingoutofbed~.,highlighting_fill=c("#8100cc","#95c601"))

Step 2: Supervised machine learning by LOGISTIC regression

Our binary classification rule might include 1,2,3 or 4 features. Following step allows us to determine the best feature composition for our model:

First, we create a Supervised Classification Task in mlr package that contains all 4 features, then we make a learner that based on Logistic regression.

library(mlr)

task= makeClassifTask(id = "4", data=df,target = "fallingoutofbed",positive="Yes")

lrn=makeLearner("classif.logreg",id="LOG",predict.type = "prob",fix.factors.prediction = TRUE)

**Figure 2: Feature selection using 4 criteria

pmi=generateFilterValuesData(task,method="permutation.importance",imp.learner=lrn)%>%.$data%>%ggplot(aes(x=reorder(name,permutation.importance),y=permutation.importance,fill=reorder(name,permutation.importance)))+geom_bar(stat="identity",color="black",show.legend=F)+scale_fill_manual(values=myfillcolors,name="permut.Importance")+scale_x_discrete("Features")+coord_flip()

uni=generateFilterValuesData(task,method="univariate.model.score",learner=lrn)%>%.$data%>%ggplot(aes(x=reorder(name,univariate.model.score),y=univariate.model.score,fill=reorder(name,univariate.model.score)))+geom_bar(stat="identity",color="black",show.legend=F)+scale_fill_manual(values=myfillcolors,name="Univar Score")+scale_x_discrete("Features")+coord_flip()

gt=generateFilterValuesData(task,method="gain.ratio")%>%.$data%>%ggplot(aes(x=reorder(name,gain.ratio),y=gain.ratio,fill=reorder(name,gain.ratio)))+geom_bar(stat="identity",color="black",show.legend=F)+scale_fill_manual(values=myfillcolors,name="Gain ratio")+scale_x_discrete("Features")+coord_flip()

ig=generateFilterValuesData(task,method="information.gain")%>%.$data%>%ggplot(aes(x=reorder(name,information.gain),y=information.gain,fill=reorder(name,information.gain)))+geom_bar(stat="identity",color="black",show.legend=F)+scale_fill_manual(values=myfillcolors,name="information gain")+scale_x_discrete("Features")+coord_flip()

grid.arrange(pmi,uni,gt,ig,ncol=2)

According to those criteria, letter of complaint and department type are the two most important features to predict the odds of falling out of bed. Neither gender nor age did contribute to the fall events.

To confirm this hypothesis, we will perform a benchmark study that evaluate the performance of 4 different model version with 1,2,3 and 4 features in descending attribution order. Each model will be tested over 200 bootstrap iterations.

2A) Manual feature selection

task1=makeClassifTask(id="1", data=df[,c(2,5)],target = "fallingoutofbed",positive="Yes")
task2=makeClassifTask(id="2", data=df[,c(2,5,1)],target = "fallingoutofbed",positive="Yes")
task3=makeClassifTask(id="3", data=df[,c(2,5,1,4)],target = "fallingoutofbed",positive="Yes")

tasks=list(task1,task2,task3,task)

mes=list(auc,bac,acc,ber,mmce,logloss)

resmp= makeResampleDesc("Bootstrap",iters=200L)

set.seed(123)
bmrk=benchmark(lrn,tasks,resmp,mes)

getBMRPerformances(bmrk, as.df = TRUE)%>%as_tibble()%>%gather(auc:logloss,key="Metric",value="Value")%>%ggplot(aes(x=task.id,y=Value,fill=Metric,color=Metric))+stat_summary(fun.y="median",geom="line",size=1,group=1) +stat_summary(fun.y="median",shape=21,size=3.5,geom="point")+facet_wrap(~Metric,scales="free",ncol=2)+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)+scale_x_discrete("Number of feature")

The result of benchmark study indicates that the most accurate logistic model would includes only 2 features, as suggested by the gain.ratio and information.gain analysis.

2B) The optimized logistic model

fit=train(lrn,task2)
fit$learner.model%>%summary()

## 
## Call:
## stats::glm(formula = f, family = "binomial", data = getTaskData(.task, 
##     .subset), weights = .weights, model = FALSE)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2222  -0.7891   0.4206   0.7809   1.6239  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)   
## (Intercept)           -1.0071     0.4483  -2.247  0.02466 * 
## letterofcomplaintYes   2.0387     0.6866   2.969  0.00299 **
## departmenttypeTypeB    1.3491     0.6805   1.982  0.04743 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 75.791  on 54  degrees of freedom
## Residual deviance: 59.914  on 52  degrees of freedom
## AIC: 65.914
## 
## Number of Fisher Scoring iterations: 4

2C) Model validation by Bootstrap resampling

A bootstrap resampling process allows us to strengthen the plausibility of our model:

r=resample(lrn,task2,measures=list(auc,bac,ber,logloss),resmp,models=TRUE) # Bootstraping

r$aggr

##     auc.test.mean     bac.test.mean     ber.test.mean logloss.test.mean 
##         0.7700732         0.7194984         0.2805016         0.7781684

Now, we will extract the coefficient values and odds-ratios of 2 factors

Table 1: Confidence intervals of Coefficients and Odds-ratio

resdf=r$measures.test%>%as_tibble()%>%mutate(OR_Complaint=rep(0,nrow(.)),OR_Departement=rep(0,nrow(.)),Departement=rep(0,nrow(.)),Comp=rep(0,nrow(.)),BIC=rep(0,nrow(.)),AIC=rep(0,nrow(.)))

for(i in 1:nrow(resdf)){
  coef=r%>%.$models%>%.[[i]]%>%.$learner.model%>%.$coefficients
  aic=r%>%.$models%>%.[[i]]%>%.$learner.model%>%.$aic
  bic=r%>%.$models%>%.[[i]]%>%.$learner.model%>%BIC(.)
  resdf$Comp[i]=coef[2]
  resdf$Departement[i]=coef[3]
  resdf$OR_Complaint[i]=exp(coef[2])
  resdf$OR_Departement[i]=exp(coef[3])
  resdf$AIC[i]=aic
  resdf$BIC[i]=bic
}

OR_Complaint=quantile(resdf$OR_Complaint,c(.025, .5, .975))
OR_Departement=quantile(resdf$OR_Departement,c(.025, .5, .975))
Coef_Complaint=quantile(resdf$Comp,c(.025, .5, .975))
Coef_Department=quantile(resdf$Departement,c(.025, .5, .975))

rbind(Coef_Complaint,OR_Complaint,Coef_Department,OR_Departement)%>%knitr::kable()

	2.5%	50%	97.5%
Coef_Complaint	0.9154525	2.062101	1.956684e+01
OR_Complaint	2.4979052	7.862485	3.146188e+08
Coef_Department	-0.1174257	1.321796	3.406617e+00
OR_Departement	0.8892073	3.750240	3.016336e+01

Figure 3: Evolution of the Odds-ratio through 200 iterations

effect=resdf[,c(1,6,7)]%>%gather(OR_Complaint,OR_Departement,key="Factors",value="Odds_ratio")

effect%>%subset(.,.$Odds_ratio<50)%>%ggplot(aes(x=Odds_ratio,fill=Factors))+geom_density(alpha=0.6)+facet_wrap(~Factors,ncol=1,scales="free")+scale_fill_manual(values=myfillcolors,name="Coefficients")+geom_vline(xintercept=2,color="blue4",linetype="dashed",size=1)

Figure 4: Marginal effects of 2 factors

pred=predict(fit,task2)%>%as_tibble()%>%mutate(.,Department=df$departmenttype,Complaint=df$letterofcomplaint)

f1=pred%>%ggplot(aes(x=Department,y=prob.Yes,fill=Department,color=Department))+stat_summary(fun.ymin=min,fun.ymax=max,fun.y="median",shape=21,size=1)+coord_flip()+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)
f2=pred%>%ggplot(aes(x=Complaint,y=prob.Yes,fill=Complaint,color=Complaint))+stat_summary(fun.ymin=min,fun.ymax=max,fun.y="median",shape=21,size=1)+coord_flip()+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)

grid.arrange(f1,f2,ncol=1)

Figure 5: Agreement between predicted and observed values - Confusion matrix analysis

pred%>%ggplot(aes(fill=truth:response))+geom_density(aes(x=prob.Yes),alpha=0.7)+scale_fill_manual(values=myfillcolors,name="Agreement Truth:Predicted")+scale_x_continuous("Predicted probability",limits = c(0:1))+facet_grid(truth~.)

pred[,c(2,5)]%>%mosaic(data=.,response~.,highlighting_fill=c("#8100cc","#95c601"))

caret::confusionMatrix(reference=pred$truth,pred$response,positive="Yes",mode="everything")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  17   6
##        Yes  8  24
##                                         
##                Accuracy : 0.7455        
##                  95% CI : (0.61, 0.8533)
##     No Information Rate : 0.5455        
##     P-Value [Acc > NIR] : 0.001816      
##                                         
##                   Kappa : 0.4832        
##  Mcnemar's Test P-Value : 0.789268      
##                                         
##             Sensitivity : 0.8000        
##             Specificity : 0.6800        
##          Pos Pred Value : 0.7500        
##          Neg Pred Value : 0.7391        
##               Precision : 0.7500        
##                  Recall : 0.8000        
##                      F1 : 0.7742        
##              Prevalence : 0.5455        
##          Detection Rate : 0.4364        
##    Detection Prevalence : 0.5818        
##       Balanced Accuracy : 0.7400        
##                                         
##        'Positive' Class : Yes           
##

Figure 6: Bootstrap confidence intervals of performance’s metrics

per=resdf[,-c(6:9)]%>%gather(auc:AIC,key="Metric",value="Value")

per%>%ggplot(aes(x=iter,y=Value,color=Metric))+geom_path(alpha=0.8,size=0.8)+facet_grid(Metric~.,scales="free")+scale_color_manual(values=myfillcolors)

per%>%ggplot(aes(x=Value,fill=Metric,color=Metric))+geom_histogram(alpha=0.7)+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)+facet_wrap(~Metric,ncol=2,scales="free")

Based on those values, the odds of falling from bed at the hospital is significantly associated to the environtment’s condition. Patients hospitalized in the type B department would have 3.57 times higher risk of falling, whilst previous letter of complaint would increase 7.86 times the odds of falling out of bed, compared to other cases.

Conclusion: Our experiment has succesfully extended the basic Logistic regression analysis with Bootstrap resampling.

The Resampling could be applied at different levels of our ML study, including feature exploration, evaluation of model’s performance, for both interpretive and predictive purposes.

MLM Case study 4

Le Dong Nhat Nam (MD, PhD)

22 jan 2017

Case study 4: Bootstrapping a logistic model