Foreword: About the Machine Learning in Medicine (MLM) project
The MLM project has been initialized in 2016 and aims to:
1.Encourage using Machine Learning techniques in medical research in Vietnam
2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.
Background
Falling out of bed is a pertinent problem in all hospitals. The elder patients are more susceptible to falling out of bed, but those accidents could also be the result of medical malpractices Our study question was whether the odd of falling from bed is associated to the global healthcare quality, ageing condition or both.
The 4th case study (Chapter 4.2 -Machine Learning in Medicine-Cookbook 1 by T. J. Cleophas and A. H. Zwinderman, Springer, 2014) introduces the Logistic regression algorithm as an alternative method for chi-square test.
Our experiment will extend the logistic model by Bootstrap resampling. Resampling could be used for (1) manual feature selection and (2) validating the model’s performance
Materials and method
The original dataset was provided in Chaper 4.2, Machine Learning in Medicine-Cookbook 1 (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in SPSS SAV format logoutcomeprediction.sav) from their website: extras.springer.com.
Data analysis was performed in R statistical programming language. The mlr package (https://mlr-org.github.io/mlr-tutorial/release/html/index.html) will be used through our Machine learning experiment.
First, following packages must be established in R-studio (some other packages might also be required during the analysis), and we will personalize the aesthetic effects for ggplot2
library(foreign)
library(tidyverse)
library(vcd)
library(mlr)
library(gridExtra)
library(ggfortify)
my_theme <- function(base_size =8, base_family = "sans"){
theme_minimal(base_size = base_size, base_family = base_family) +
theme(
axis.text = element_text(size =8),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
axis.title = element_text(size =8),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#faf2ff"),
strip.background = element_rect(fill = "#470170", color = "#470170", size =0.5),
strip.text = element_text(face = "bold", size = 8, color = "white"),
legend.position = "bottom",
legend.justification = "center",
legend.background = element_blank(),
panel.border = element_rect(color = "grey5", fill = NA, size = 0.5)
)
}
theme_set(my_theme())
myfillcolors=c("#8100cc","#95c601","#00cc51","#0164c6","#242d7a","#c4b300")
mycolors=c("#5f0096","#607f01","#016b3b","#01486d","#000851","#827700")
Results
Step 0: Data loading and reshape
require(foreign)
df=read.spss("logoutcomeprediction.sav",use.value.labels = TRUE,to.data.frame = TRUE)%>%as_tibble()
df$fallingoutofbed%<>%as.factor()%>%recode_factor(.,`0` = "No", `1` = "Yes")
df$gender%<>%as.factor()%>%recode_factor(.,`0` = "Female", `1` = "Male")
df$letterofcomplaint%<>%as.factor()%>%recode_factor(.,`0` = "No", `1` = "Yes")
df$departmenttype%<>%as.factor()%>%recode_factor(.,`0` = "TypeA", `1` = "TypeB")
df
## # A tibble: 55 × 5
## departmenttype fallingoutofbed age gender letterofcomplaint
## <fctr> <fctr> <dbl> <fctr> <fctr>
## 1 TypeA Yes 50 Female Yes
## 2 TypeA Yes 76 Female Yes
## 3 TypeA Yes 57 Male Yes
## 4 TypeA Yes 65 Female Yes
## 5 TypeA Yes 46 Male Yes
## 6 TypeA Yes 36 Male Yes
## 7 TypeA Yes 98 Female No
## 8 TypeA Yes 56 Male No
## 9 TypeA Yes 44 Female No
## 10 TypeA Yes 76 Male Yes
## # ... with 45 more rows
Once the above commmands executed, we will get a clean dataframe (n=50), that will be used throughout our analysis. The target outcome is a binary variable: fall event (0=No, 1=Yes). There are 4 features: letter of complaint (0=No, 1=Yes), Department (coded as Type A and Type B), gender and Age (yrs).
Step 1) Data exploration
Figure 1: Descriptive analysis
g=ggplot(data=df,aes(x=fallingoutofbed,fill=gender,color=gender))+geom_bar(position="fill")+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)+coord_flip()
l=df%>%ggplot(aes(x=fallingoutofbed,fill=letterofcomplaint,color=letterofcomplaint))+geom_bar(position="fill")+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)+coord_flip()
d=df%>%ggplot(aes(x=fallingoutofbed,fill=departmenttype,color=departmenttype))+geom_bar(position="fill")+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)+coord_flip()
a=df%>%ggplot(aes(fill=fallingoutofbed,color=fallingoutofbed,x=age))+geom_density(alpha=0.8)+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)
grid.arrange(g,l,d,a,ncol=2)
Alternative solution using vcd package:
df[,-3]%>%gather(departmenttype,letterofcomplaint,gender,key="Factor",value="Value")%>%mosaic(data=.,fallingoutofbed~Value,highlighting_fill=c("#8100cc","#95c601"))
df[,-3]%>%mosaic(data=.,fallingoutofbed~.,highlighting_fill=c("#8100cc","#95c601"))
Step 2: Supervised machine learning by LOGISTIC regression
Our binary classification rule might include 1,2,3 or 4 features. Following step allows us to determine the best feature composition for our model:
First, we create a Supervised Classification Task in mlr package that contains all 4 features, then we make a learner that based on Logistic regression.
library(mlr)
task= makeClassifTask(id = "4", data=df,target = "fallingoutofbed",positive="Yes")
lrn=makeLearner("classif.logreg",id="LOG",predict.type = "prob",fix.factors.prediction = TRUE)
**Figure 2: Feature selection using 4 criteria
pmi=generateFilterValuesData(task,method="permutation.importance",imp.learner=lrn)%>%.$data%>%ggplot(aes(x=reorder(name,permutation.importance),y=permutation.importance,fill=reorder(name,permutation.importance)))+geom_bar(stat="identity",color="black",show.legend=F)+scale_fill_manual(values=myfillcolors,name="permut.Importance")+scale_x_discrete("Features")+coord_flip()
uni=generateFilterValuesData(task,method="univariate.model.score",learner=lrn)%>%.$data%>%ggplot(aes(x=reorder(name,univariate.model.score),y=univariate.model.score,fill=reorder(name,univariate.model.score)))+geom_bar(stat="identity",color="black",show.legend=F)+scale_fill_manual(values=myfillcolors,name="Univar Score")+scale_x_discrete("Features")+coord_flip()
gt=generateFilterValuesData(task,method="gain.ratio")%>%.$data%>%ggplot(aes(x=reorder(name,gain.ratio),y=gain.ratio,fill=reorder(name,gain.ratio)))+geom_bar(stat="identity",color="black",show.legend=F)+scale_fill_manual(values=myfillcolors,name="Gain ratio")+scale_x_discrete("Features")+coord_flip()
ig=generateFilterValuesData(task,method="information.gain")%>%.$data%>%ggplot(aes(x=reorder(name,information.gain),y=information.gain,fill=reorder(name,information.gain)))+geom_bar(stat="identity",color="black",show.legend=F)+scale_fill_manual(values=myfillcolors,name="information gain")+scale_x_discrete("Features")+coord_flip()
grid.arrange(pmi,uni,gt,ig,ncol=2)
According to those criteria, letter of complaint and department type are the two most important features to predict the odds of falling out of bed. Neither gender nor age did contribute to the fall events.
To confirm this hypothesis, we will perform a benchmark study that evaluate the performance of 4 different model version with 1,2,3 and 4 features in descending attribution order. Each model will be tested over 200 bootstrap iterations.
2A) Manual feature selection
task1=makeClassifTask(id="1", data=df[,c(2,5)],target = "fallingoutofbed",positive="Yes")
task2=makeClassifTask(id="2", data=df[,c(2,5,1)],target = "fallingoutofbed",positive="Yes")
task3=makeClassifTask(id="3", data=df[,c(2,5,1,4)],target = "fallingoutofbed",positive="Yes")
tasks=list(task1,task2,task3,task)
mes=list(auc,bac,acc,ber,mmce,logloss)
resmp= makeResampleDesc("Bootstrap",iters=200L)
set.seed(123)
bmrk=benchmark(lrn,tasks,resmp,mes)
getBMRPerformances(bmrk, as.df = TRUE)%>%as_tibble()%>%gather(auc:logloss,key="Metric",value="Value")%>%ggplot(aes(x=task.id,y=Value,fill=Metric,color=Metric))+stat_summary(fun.y="median",geom="line",size=1,group=1) +stat_summary(fun.y="median",shape=21,size=3.5,geom="point")+facet_wrap(~Metric,scales="free",ncol=2)+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)+scale_x_discrete("Number of feature")
The result of benchmark study indicates that the most accurate logistic model would includes only 2 features, as suggested by the gain.ratio and information.gain analysis.
2B) The optimized logistic model
fit=train(lrn,task2)
fit$learner.model%>%summary()
##
## Call:
## stats::glm(formula = f, family = "binomial", data = getTaskData(.task,
## .subset), weights = .weights, model = FALSE)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2222 -0.7891 0.4206 0.7809 1.6239
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0071 0.4483 -2.247 0.02466 *
## letterofcomplaintYes 2.0387 0.6866 2.969 0.00299 **
## departmenttypeTypeB 1.3491 0.6805 1.982 0.04743 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 75.791 on 54 degrees of freedom
## Residual deviance: 59.914 on 52 degrees of freedom
## AIC: 65.914
##
## Number of Fisher Scoring iterations: 4
2C) Model validation by Bootstrap resampling
A bootstrap resampling process allows us to strengthen the plausibility of our model:
r=resample(lrn,task2,measures=list(auc,bac,ber,logloss),resmp,models=TRUE) # Bootstraping
r$aggr
## auc.test.mean bac.test.mean ber.test.mean logloss.test.mean
## 0.7700732 0.7194984 0.2805016 0.7781684
Now, we will extract the coefficient values and odds-ratios of 2 factors
Table 1: Confidence intervals of Coefficients and Odds-ratio
resdf=r$measures.test%>%as_tibble()%>%mutate(OR_Complaint=rep(0,nrow(.)),OR_Departement=rep(0,nrow(.)),Departement=rep(0,nrow(.)),Comp=rep(0,nrow(.)),BIC=rep(0,nrow(.)),AIC=rep(0,nrow(.)))
for(i in 1:nrow(resdf)){
coef=r%>%.$models%>%.[[i]]%>%.$learner.model%>%.$coefficients
aic=r%>%.$models%>%.[[i]]%>%.$learner.model%>%.$aic
bic=r%>%.$models%>%.[[i]]%>%.$learner.model%>%BIC(.)
resdf$Comp[i]=coef[2]
resdf$Departement[i]=coef[3]
resdf$OR_Complaint[i]=exp(coef[2])
resdf$OR_Departement[i]=exp(coef[3])
resdf$AIC[i]=aic
resdf$BIC[i]=bic
}
OR_Complaint=quantile(resdf$OR_Complaint,c(.025, .5, .975))
OR_Departement=quantile(resdf$OR_Departement,c(.025, .5, .975))
Coef_Complaint=quantile(resdf$Comp,c(.025, .5, .975))
Coef_Department=quantile(resdf$Departement,c(.025, .5, .975))
rbind(Coef_Complaint,OR_Complaint,Coef_Department,OR_Departement)%>%knitr::kable()
| 2.5% | 50% | 97.5% | |
|---|---|---|---|
| Coef_Complaint | 0.9154525 | 2.062101 | 1.956684e+01 |
| OR_Complaint | 2.4979052 | 7.862485 | 3.146188e+08 |
| Coef_Department | -0.1174257 | 1.321796 | 3.406617e+00 |
| OR_Departement | 0.8892073 | 3.750240 | 3.016336e+01 |
Figure 3: Evolution of the Odds-ratio through 200 iterations
effect=resdf[,c(1,6,7)]%>%gather(OR_Complaint,OR_Departement,key="Factors",value="Odds_ratio")
effect%>%subset(.,.$Odds_ratio<50)%>%ggplot(aes(x=Odds_ratio,fill=Factors))+geom_density(alpha=0.6)+facet_wrap(~Factors,ncol=1,scales="free")+scale_fill_manual(values=myfillcolors,name="Coefficients")+geom_vline(xintercept=2,color="blue4",linetype="dashed",size=1)
Figure 4: Marginal effects of 2 factors
pred=predict(fit,task2)%>%as_tibble()%>%mutate(.,Department=df$departmenttype,Complaint=df$letterofcomplaint)
f1=pred%>%ggplot(aes(x=Department,y=prob.Yes,fill=Department,color=Department))+stat_summary(fun.ymin=min,fun.ymax=max,fun.y="median",shape=21,size=1)+coord_flip()+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)
f2=pred%>%ggplot(aes(x=Complaint,y=prob.Yes,fill=Complaint,color=Complaint))+stat_summary(fun.ymin=min,fun.ymax=max,fun.y="median",shape=21,size=1)+coord_flip()+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)
grid.arrange(f1,f2,ncol=1)
Figure 5: Agreement between predicted and observed values - Confusion matrix analysis
pred%>%ggplot(aes(fill=truth:response))+geom_density(aes(x=prob.Yes),alpha=0.7)+scale_fill_manual(values=myfillcolors,name="Agreement Truth:Predicted")+scale_x_continuous("Predicted probability",limits = c(0:1))+facet_grid(truth~.)
pred[,c(2,5)]%>%mosaic(data=.,response~.,highlighting_fill=c("#8100cc","#95c601"))
caret::confusionMatrix(reference=pred$truth,pred$response,positive="Yes",mode="everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 17 6
## Yes 8 24
##
## Accuracy : 0.7455
## 95% CI : (0.61, 0.8533)
## No Information Rate : 0.5455
## P-Value [Acc > NIR] : 0.001816
##
## Kappa : 0.4832
## Mcnemar's Test P-Value : 0.789268
##
## Sensitivity : 0.8000
## Specificity : 0.6800
## Pos Pred Value : 0.7500
## Neg Pred Value : 0.7391
## Precision : 0.7500
## Recall : 0.8000
## F1 : 0.7742
## Prevalence : 0.5455
## Detection Rate : 0.4364
## Detection Prevalence : 0.5818
## Balanced Accuracy : 0.7400
##
## 'Positive' Class : Yes
##
Figure 6: Bootstrap confidence intervals of performance’s metrics
per=resdf[,-c(6:9)]%>%gather(auc:AIC,key="Metric",value="Value")
per%>%ggplot(aes(x=iter,y=Value,color=Metric))+geom_path(alpha=0.8,size=0.8)+facet_grid(Metric~.,scales="free")+scale_color_manual(values=myfillcolors)
per%>%ggplot(aes(x=Value,fill=Metric,color=Metric))+geom_histogram(alpha=0.7)+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)+facet_wrap(~Metric,ncol=2,scales="free")
Based on those values, the odds of falling from bed at the hospital is significantly associated to the environtment’s condition. Patients hospitalized in the type B department would have 3.57 times higher risk of falling, whilst previous letter of complaint would increase 7.86 times the odds of falling out of bed, compared to other cases.
Conclusion: Our experiment has succesfully extended the basic Logistic regression analysis with Bootstrap resampling.
The Resampling could be applied at different levels of our ML study, including feature exploration, evaluation of model’s performance, for both interpretive and predictive purposes.