Foreword: About the Machine Learning in Medicine (MLM) project
The MLM project has been initialized in 2016 and aims to:
1.Encourage using Machine Learning techniques in medical research in Vietnam
2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.
Background
The 3rd case study (Chapter 4 - Machine Learning in Medicine-Cookbook 1 by T. J. Cleophas and A. H. Zwinderman, Springer, 2014) aims to explore the linear regression module in IBM-SPSS (that provides both step-wise model selection and bootstrap feature). The study question is to predict the patient’s sleep time in function of their gender, age, treatment method (placebo or sleeping pill) and comorbidity.
Our experiment will adopt an alternative way to select the most relevant predictors for the model. We will also introduce the bootstrap resampling and the partial dependency analysis in mlr package.
Materials and method
The original dataset was provided in Chaper 4, Machine Learning in Medicine-Cookbook 1 (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in SPSS SAV format linoutcomeprediction.sav) from their website: extras.springer.com.
Data analysis was performed in R statistical programming language. The mlr package (https://mlr-org.github.io/mlr-tutorial/release/html/index.html) will be used through our Machine learning experiment, including Feature selection and Resampling by Bootstrap.
Our ML experiment consists of : 1) Using several criteria to evaluate the contribution of each predictor to the model’s outcome
Wrapping up a GLM model with Exshautive feature filtering function. This algorithm can automatically detect the most relevant predictor for itself.
Resampling the optimized linear regression algorithm by Bootstrap, then extracting the model’s performance metrics and coefficients data.
Using the Partial dependance function to extract the marginal effect for each factor and covariate
First, following packages must be established in R-studio (some other packages might also be required during the analysis), and we will personalize the aesthetic effects for ggplot2
library(foreign)
library(tidyverse)
library(mlr)
library(GGally)
library(gridExtra)
library(ggfortify)
my_theme <- function(base_size =8, base_family = "sans"){
theme_minimal(base_size = base_size, base_family = base_family) +
theme(
axis.text = element_text(size =8),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
axis.title = element_text(size =8),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#f7fdff"),
strip.background = element_rect(fill = "#001d60", color = "#00113a", size =0.5),
strip.text = element_text(face = "bold", size = 8, color = "white"),
legend.position = "bottom",
legend.justification = "center",
legend.background = element_blank(),
panel.border = element_rect(color = "grey5", fill = NA, size = 0.5)
)
}
theme_set(my_theme())
myfillcolors=c("#ff003f","#0094ff", "#ae00ff" , "#94ff00", "#ffc700","#fc1814")
mycolors=c("#db0229","#026bdb","#48039e","#0d7502","#c97c02","#c40c09")
Results
Step 0: Data loading and reshape
require(foreign)
df=read.spss("linoutcomeprediction.sav",use.value.labels = TRUE,to.data.frame = TRUE)%>%as_tibble()
df$treatment%<>%as.factor()%>%recode_factor(.,`0` = "Placebo", `1` = "Sleepingpill")
df$gender%<>%as.factor()%>%recode_factor(.,`0` = "Female", `1` = "Male")
df$comorbidity%<>%as.factor()%>%recode_factor(.,`0` = "No", `1` = "Yes")
df
## # A tibble: 20 × 5
## treatment hoursofsleep age gender comorbidity
## <fctr> <dbl> <dbl> <fctr> <fctr>
## 1 Placebo 6.0 65 Female Yes
## 2 Placebo 7.1 75 Female Yes
## 3 Placebo 8.1 86 Female No
## 4 Placebo 7.5 74 Female No
## 5 Placebo 6.4 64 Female Yes
## 6 Placebo 7.9 75 Male Yes
## 7 Placebo 6.8 65 Male Yes
## 8 Placebo 6.6 64 Male No
## 9 Placebo 7.3 75 Male No
## 10 Placebo 5.6 56 Female No
## 11 Sleepingpill 5.1 55 Male No
## 12 Sleepingpill 8.0 85 Female Yes
## 13 Sleepingpill 3.8 36 Male No
## 14 Sleepingpill 4.4 47 Female Yes
## 15 Sleepingpill 5.2 58 Male No
## 16 Sleepingpill 5.4 56 Female Yes
## 17 Sleepingpill 4.3 46 Male Yes
## 18 Sleepingpill 6.0 64 Male No
## 19 Sleepingpill 3.7 33 Male No
## 20 Sleepingpill 6.2 65 Female Yes
Once the above commmands executed, we will get a clean dataframe, that will be used throughout our analysis.
Step 1) Data exploration
Table 1: Descriptive analysis
Hmisc::describe(df)
## df
##
## 5 Variables 20 Observations
## ---------------------------------------------------------------------------
## treatment
## n missing distinct
## 20 0 2
##
## Value Placebo Sleepingpill
## Frequency 10 10
## Proportion 0.5 0.5
## ---------------------------------------------------------------------------
## hoursofsleep
## n missing distinct Info Mean Gmd .05 .10
## 20 0 19 0.999 6.07 1.615 3.795 4.250
## .25 .50 .75 .90 .95
## 5.175 6.100 7.150 7.910 8.005
##
## Value 3.7 3.8 4.3 4.4 5.1 5.2 5.4 5.6 6.0 6.2 6.4 6.6
## Frequency 1 1 1 1 1 1 1 1 2 1 1 1
## Proportion 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.10 0.05 0.05 0.05
##
## Value 6.8 7.1 7.3 7.5 7.9 8.0 8.1
## Frequency 1 1 1 1 1 1 1
## Proportion 0.05 0.05 0.05 0.05 0.05 0.05 0.05
## ---------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 20 0 13 0.99 62.2 16.43 35.85 45.00
## .25 .50 .75 .90 .95
## 55.75 64.00 74.25 76.00 85.05
##
## Value 33 36 46 47 55 56 58 64 65 74 75 85
## Frequency 1 1 1 1 1 2 1 3 3 1 3 1
## Proportion 0.05 0.05 0.05 0.05 0.05 0.10 0.05 0.15 0.15 0.05 0.15 0.05
##
## Value 86
## Frequency 1
## Proportion 0.05
## ---------------------------------------------------------------------------
## gender
## n missing distinct
## 20 0 2
##
## Value Female Male
## Frequency 10 10
## Proportion 0.5 0.5
## ---------------------------------------------------------------------------
## comorbidity
## n missing distinct
## 20 0 2
##
## Value No Yes
## Frequency 10 10
## Proportion 0.5 0.5
## ---------------------------------------------------------------------------
Figure 1: Distribution of the outcome
df%>%gather(treatment,gender,comorbidity,key="Factors",value="Class")%>%ggplot(aes(x=Class,y=hoursofsleep,fill=Factors,color=Factors))+geom_boxplot(alpha=0.6,width=0.8)+coord_flip()+facet_wrap(~Factors,scales="free",ncol=1)+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)
The main outcome (hours of sleep) is normaly distributed, thus no data transformation would be required. The median sleep time seems to be higher in Placebo group, however the contrast was not clear between genders or comorbidity status. We still don’t know whether Gender would be included in our model.
Figure 2: Linear relationship between patient’s Age and Sleep time
df%>%gather(treatment,gender,comorbidity,key="Factors",value="Class")%>%ggplot(aes(x=age,y=hoursofsleep,fill=Class,color=Class))+geom_jitter(shape=21,alpha=0.5,size=3)+geom_smooth(method="lm",alpha=0.4)+facet_wrap(~Factors,,scales="free",ncol=2)+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)
The Figure 2 indicates a strong linear relationship between Age and sleep time in all groups. Age seems to be a relevant predictor for our linear model.
Step 2) Machine learnin experiments
library(mlr)
task= makeRegrTask(id = "Sleep", data=df,target = "hoursofsleep")
lrn=makeLearner("regr.glm")
We just created a supervised regression task with hours of sleep as target and 4 potential features, then assigned a regression learner (GLM algorithm).
2A) Evaluating the role of each feature in dataset
Our first experiment consists of evaluating the role of each feature in the model. Five methods will be applied: a) permutation importance b) Univariate model scoring c) Gain ratio d) Information gain and e) Chi-squared test
Figure 3: Contribution of each feature to the outcome prediction
pmi=generateFilterValuesData(task,method="permutation.importance",imp.learner=lrn)%>%.$data%>%ggplot(aes(x=reorder(name,permutation.importance),y=permutation.importance,fill=reorder(name,permutation.importance)))+geom_bar(stat="identity",color="black",show.legend=F)+scale_fill_manual(values=myfillcolors,name="permut.Importance")+scale_x_discrete("Features")
uni=generateFilterValuesData(task,method="univariate.model.score",learner=lrn)%>%.$data%>%ggplot(aes(x=reorder(name,univariate.model.score),y=univariate.model.score,fill=reorder(name,univariate.model.score)))+geom_bar(stat="identity",color="black",show.legend=F)+scale_fill_manual(values=myfillcolors,name="Univar Score")+scale_x_discrete("Features")
gt=generateFilterValuesData(task,method="gain.ratio")%>%.$data%>%ggplot(aes(x=reorder(name,gain.ratio),y=gain.ratio,fill=reorder(name,gain.ratio)))+geom_bar(stat="identity",color="black",show.legend=F)+scale_fill_manual(values=myfillcolors,name="Gain ratio")+scale_x_discrete("Features")
ig=generateFilterValuesData(task,method="information.gain")%>%.$data%>%ggplot(aes(x=reorder(name,information.gain),y=information.gain,fill=reorder(name,information.gain)))+geom_bar(stat="identity",color="black",show.legend=F)+scale_fill_manual(values=myfillcolors,name="information gain")+scale_x_discrete("Features")
csq=generateFilterValuesData(task,method="chi.squared")%>%.$data%>%ggplot(aes(x=reorder(name,chi.squared),y=chi.squared,fill=reorder(name,chi.squared)))+geom_bar(stat="identity",color="black")+scale_fill_manual(values=myfillcolors,name="features")+scale_x_discrete("Features")
grid.arrange(pmi,uni,gt,ig,csq,ncol=2)
As showed in the Figure 3: 4/5 indices suggest that Age is the most important feature. Treatment is considered as the second most important feature. However, the role of Gender and comorbidity as evaluated by gain ratio, chi2 test and information gain index is not conclusive. We only know that they would contributes less to the prediction, but should we keep them in the model ?
2B) The Exhaustive feature selection
We will apply an exhaustive searching process, this is a brute force algorithm that aims to develop and test many models that based on every possible combination of our features, then vote for the most accurate model (the one with lowest mean squared error).
rdesc=makeResampleDesc("Bootstrap",iters=100L)
ctrlExh= makeFeatSelControlExhaustive(maxit=20L,max.features=4)
wrap=makeFeatSelWrapper(lrn,rdesc,control=ctrlExh)
modwrap=train(wrap,task)
modwrap$learner.model$opt.result
## FeatSel result:
## Features (2): treatment, age
## mse.test.mean=0.0992
The result indicates that the most accurate model includes 2 features: Age and treatment group. This model has lowest averaged mse of 0.094, based on 100 bootstraping iterations.
summary(modwrap$learner.model$next.model$learner.model)
##
## Call:
## stats::glm(formula = f, family = family, data = d, control = ctrl)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.51352 -0.17628 -0.02299 0.17801 0.53652
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.988729 0.365971 2.702 0.0151 *
## treatmentSleepingpill -0.411051 0.142815 -2.878 0.0104 *
## age 0.084997 0.005095 16.684 5.66e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.07120382)
##
## Null deviance: 35.8220 on 19 degrees of freedom
## Residual deviance: 1.2105 on 17 degrees of freedom
## AIC: 8.663
##
## Number of Fisher Scoring iterations: 2
fit=lm(data=df,hoursofsleep~age+treatment)
car::Anova(fit)
## Anova Table (Type II tests)
##
## Response: hoursofsleep
## Sum Sq Df F value Pr(>F)
## age 19.8195 1 278.349 5.66e-12 ***
## treatment 0.5899 1 8.284 0.01043 *
## Residuals 1.2105 17
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lmSupport::modelEffectSizes(fit)
## lm(formula = hoursofsleep ~ age + treatment, data = df)
##
## Coefficients
## SSR df pEta-sqr dR-sqr
## (Intercept) 0.5197 1 0.3004 NA
## age 19.8195 1 0.9424 0.5533
## treatment 0.5899 1 0.3276 0.0165
##
## Sum of squared errors (SSE): 1.2
## Sum of squared total (SST): 35.8
fit%>%effects::all.effects(.)%>%plot()
require(ggfortify)
modwrap$learner.model$next.model%>%.$learner.model%>%autoplot(., which = 1:6, label.size = 1,data = df,colour = 'treatment',ncol=3)+scale_color_manual(values=mycolors)
We can extract the content of this model, then perform an ANOVA and effect size analysis on this model via lm() function.
2C) Bootsraping
In next step, we will perform a Resampling process that aims to: 1) Evaluate the plausibility of our model through randomized samples and
The resampling process consists of a 100 Bootstrap iterations. Model’s performance is evaluated using following metrics:
task2=makeRegrTask(id = "Sleep", data=df[,c(1,2,3)],target = "hoursofsleep")
r=resample(lrn,task2,measures=list(expvar,rsq,rmse,mse),rdesc,models=TRUE) # Bootstraping
r$aggr
## expvar.test.mean rsq.test.mean rmse.test.rmse mse.test.mean
## 0.94333137 0.93340869 0.30666037 0.09404058
Now, we will extract the data of these metrics, as well as the coefficients of 2 predictors from each one among 100 model versions.
resdf=r$measures.test%>%as_tibble()%>%mutate(Intercept=rep(0,nrow(.)),Sleepingpill=rep(0,nrow(.)),Age=rep(0,nrow(.)),BIC=rep(0,nrow(.)),AIC=rep(0,nrow(.)))
for(i in 1:nrow(resdf)){
coef=r%>%.$models%>%.[[i]]%>%.$learner.model%>%.$coefficients
aic=r%>%.$models%>%.[[i]]%>%.$learner.model%>%.$aic
bic=r%>%.$models%>%.[[i]]%>%.$learner.model%>%BIC(.)
resdf$Intercept[i]=coef[1]
resdf$Sleepingpill[i]=coef[2]
resdf$Age[i]=coef[3]
resdf$AIC[i]=aic
resdf$BIC[i]=bic
}
psych::describe(resdf)
## vars n mean sd median trimmed mad min max range
## iter 1 100 50.50 29.01 50.50 50.50 37.06 1.00 100.00 99.00
## expvar 2 100 0.94 0.24 0.92 0.93 0.23 0.45 1.77 1.32
## rsq 3 100 0.93 0.04 0.94 0.94 0.03 0.70 0.99 0.28
## rmse 4 100 0.30 0.07 0.30 0.30 0.06 0.12 0.55 0.43
## mse 5 100 0.09 0.05 0.09 0.09 0.04 0.01 0.30 0.29
## Intercept 6 100 1.05 0.40 1.04 1.04 0.33 -0.02 2.14 2.16
## Sleepingpill 7 100 -0.44 0.13 -0.42 -0.43 0.11 -0.87 -0.20 0.67
## Age 8 100 0.08 0.01 0.08 0.08 0.00 0.07 0.10 0.03
## BIC 9 100 6.96 7.19 8.21 7.35 7.63 -14.62 20.38 35.00
## AIC 10 100 2.98 7.19 4.23 3.37 7.63 -18.60 16.40 35.00
## skew kurtosis se
## iter 0.00 -1.24 2.90
## expvar 0.55 0.51 0.02
## rsq -2.69 10.28 0.00
## rmse 0.48 1.11 0.01
## mse 1.51 4.28 0.00
## Intercept 0.21 0.39 0.04
## Sleepingpill -0.66 0.48 0.01
## Age 0.15 1.04 0.00
## BIC -0.55 -0.09 0.72
## AIC -0.55 -0.09 0.72
Intercept=quantile(resdf$Intercept,c(.025, .5, .975))
Age=quantile(resdf$Age,c(.025, .5, .975))
Sleepingpill=quantile(resdf$Sleepingpill,c(.025, .5, .975))
Table: Bootstrap confidence interval of coefficients
rbind(Intercept,Age,Sleepingpill)%>%knitr::kable()
| 2.5% | 50% | 97.5% | |
|---|---|---|---|
| Intercept | 0.2959638 | 1.0448223 | 1.8946730 |
| Age | 0.0729909 | 0.0839635 | 0.0957700 |
| Sleepingpill | -0.7404481 | -0.4195826 | -0.2086069 |
Figure 4A: Evolution of coefficients through 100 iterations
effect=resdf[,c(1,6,7,8)]%>%gather(Intercept,Age,Sleepingpill,key="Para",value="Value")
effect%>%ggplot(aes(x=iter,y=Value))+geom_path(aes(color=Para),size=0.8)+facet_wrap(~Para,ncol=1,scales="free")+scale_color_manual(values=mycolors,name="Coefficient")+scale_y_continuous("Coefficients")
Figure 4B: Bootstrap confidence interval of coefficients
effect%>%ggplot(aes(x=Value,fill=Para,color=Para))+geom_histogram(alpha=0.7)+facet_wrap(~Para,ncol=2,scales="free")+scale_fill_manual(values=myfillcolors,name="Coefficients")+scale_color_manual(values=mycolors)+geom_vline(xintercept=0,color="blue4",linetype="dashed",size=1)
Figure 5A: Evolution of model’s performance metrics through 100 iterations
resdf[,c(1:5,9,10)]%>%gather(expvar:AIC,key="Metric",value="Value")%>%ggplot(aes(x=iter,y=Value))+geom_path(aes(color=Metric),size=0.8)+facet_wrap(~Metric,ncol=2,scales="free")+scale_color_manual(values=mycolors)
Figure 5B: Bootstrap confidence interval of performance metrics
resdf[,c(1:5,9,10)]%>%gather(expvar:AIC,key="Metric",value="Value")%>%ggplot(aes(x=Value,fill=Metric,color=Metric))+geom_histogram(alpha=0.7)+facet_wrap(~Metric,ncol=2,scales="free")+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)
Based on the bootstrap’s confidence interval, we can assure that treatment and age really have significant effects on the Sleep time.
2D) Marginal effects
The partial dependency function in MLR package allows us to present a marginalized plot of age and treatment effect on predicted sleep time.
pdf<-modwrap$learner.model$next.model%>%generatePartialDependenceData(.,task,features = c("age","treatment"),interaction=TRUE,gridsize=20L)
pdf%>%.$data%>%ggplot()+geom_smooth(aes(x=age,y=hoursofsleep,color=treatment),se=T,show.legend = T,alpha=0.2)+scale_colour_manual(values=mycolors)+scale_fill_manual(values=myfillcolors)
Conclusion: The present experiment has introduced 3 Machine learning technques that could eventually improve your GLM based interpretive studies:
Manual feature selection using information gain and other score,
wrapped model with a brute force algorithm integrated inside for automated and active feature selection - such “smart” model can select the most relevant predictors for itself and return the most accurate model
Using bootstrap for validating and extracting the confidence interval of a linear model. END