Case study 3: Wrapping up a Linear Regression model

Foreword: About the Machine Learning in Medicine (MLM) project

The MLM project has been initialized in 2016 and aims to:

1.Encourage using Machine Learning techniques in medical research in Vietnam

2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.

Background

The 3rd case study (Chapter 4 - Machine Learning in Medicine-Cookbook 1 by T. J. Cleophas and A. H. Zwinderman, Springer, 2014) aims to explore the linear regression module in IBM-SPSS (that provides both step-wise model selection and bootstrap feature). The study question is to predict the patient’s sleep time in function of their gender, age, treatment method (placebo or sleeping pill) and comorbidity.

Our experiment will adopt an alternative way to select the most relevant predictors for the model. We will also introduce the bootstrap resampling and the partial dependency analysis in mlr package.

Materials and method

The original dataset was provided in Chaper 4, Machine Learning in Medicine-Cookbook 1 (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in SPSS SAV format linoutcomeprediction.sav) from their website: extras.springer.com.

Data analysis was performed in R statistical programming language. The mlr package (https://mlr-org.github.io/mlr-tutorial/release/html/index.html) will be used through our Machine learning experiment, including Feature selection and Resampling by Bootstrap.

Our ML experiment consists of : 1) Using several criteria to evaluate the contribution of each predictor to the model’s outcome

  1. Wrapping up a GLM model with Exshautive feature filtering function. This algorithm can automatically detect the most relevant predictor for itself.

  2. Resampling the optimized linear regression algorithm by Bootstrap, then extracting the model’s performance metrics and coefficients data.

  3. Using the Partial dependance function to extract the marginal effect for each factor and covariate

First, following packages must be established in R-studio (some other packages might also be required during the analysis), and we will personalize the aesthetic effects for ggplot2

library(foreign)
library(tidyverse)
library(mlr)
library(GGally)
library(gridExtra)
library(ggfortify)


my_theme <- function(base_size =8, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
    theme(
      axis.text = element_text(size =8),
      axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
      axis.title = element_text(size =8),
      panel.grid.major = element_line(color = "gray"),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = "#f7fdff"),
      strip.background = element_rect(fill = "#001d60", color = "#00113a", size =0.5),
      strip.text = element_text(face = "bold", size = 8, color = "white"),
      legend.position = "bottom",
      legend.justification = "center",
      legend.background = element_blank(),
      panel.border = element_rect(color = "grey5", fill = NA, size = 0.5)
    )
}

theme_set(my_theme())

myfillcolors=c("#ff003f","#0094ff", "#ae00ff" , "#94ff00", "#ffc700","#fc1814")
mycolors=c("#db0229","#026bdb","#48039e","#0d7502","#c97c02","#c40c09")

Results

Step 0: Data loading and reshape

require(foreign)

df=read.spss("linoutcomeprediction.sav",use.value.labels = TRUE,to.data.frame = TRUE)%>%as_tibble()
df$treatment%<>%as.factor()%>%recode_factor(.,`0` = "Placebo", `1` = "Sleepingpill")
df$gender%<>%as.factor()%>%recode_factor(.,`0` = "Female", `1` = "Male")
df$comorbidity%<>%as.factor()%>%recode_factor(.,`0` = "No", `1` = "Yes")

df
## # A tibble: 20 × 5
##       treatment hoursofsleep   age gender comorbidity
##          <fctr>        <dbl> <dbl> <fctr>      <fctr>
## 1       Placebo          6.0    65 Female         Yes
## 2       Placebo          7.1    75 Female         Yes
## 3       Placebo          8.1    86 Female          No
## 4       Placebo          7.5    74 Female          No
## 5       Placebo          6.4    64 Female         Yes
## 6       Placebo          7.9    75   Male         Yes
## 7       Placebo          6.8    65   Male         Yes
## 8       Placebo          6.6    64   Male          No
## 9       Placebo          7.3    75   Male          No
## 10      Placebo          5.6    56 Female          No
## 11 Sleepingpill          5.1    55   Male          No
## 12 Sleepingpill          8.0    85 Female         Yes
## 13 Sleepingpill          3.8    36   Male          No
## 14 Sleepingpill          4.4    47 Female         Yes
## 15 Sleepingpill          5.2    58   Male          No
## 16 Sleepingpill          5.4    56 Female         Yes
## 17 Sleepingpill          4.3    46   Male         Yes
## 18 Sleepingpill          6.0    64   Male          No
## 19 Sleepingpill          3.7    33   Male          No
## 20 Sleepingpill          6.2    65 Female         Yes

Once the above commmands executed, we will get a clean dataframe, that will be used throughout our analysis.

Step 1) Data exploration

Table 1: Descriptive analysis

Hmisc::describe(df)
## df 
## 
##  5  Variables      20  Observations
## ---------------------------------------------------------------------------
## treatment 
##        n  missing distinct 
##       20        0        2 
##                                     
## Value           Placebo Sleepingpill
## Frequency            10           10
## Proportion          0.5          0.5
## ---------------------------------------------------------------------------
## hoursofsleep 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       20        0       19    0.999     6.07    1.615    3.795    4.250 
##      .25      .50      .75      .90      .95 
##    5.175    6.100    7.150    7.910    8.005 
##                                                                       
## Value       3.7  3.8  4.3  4.4  5.1  5.2  5.4  5.6  6.0  6.2  6.4  6.6
## Frequency     1    1    1    1    1    1    1    1    2    1    1    1
## Proportion 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.10 0.05 0.05 0.05
##                                              
## Value       6.8  7.1  7.3  7.5  7.9  8.0  8.1
## Frequency     1    1    1    1    1    1    1
## Proportion 0.05 0.05 0.05 0.05 0.05 0.05 0.05
## ---------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       20        0       13     0.99     62.2    16.43    35.85    45.00 
##      .25      .50      .75      .90      .95 
##    55.75    64.00    74.25    76.00    85.05 
##                                                                       
## Value        33   36   46   47   55   56   58   64   65   74   75   85
## Frequency     1    1    1    1    1    2    1    3    3    1    3    1
## Proportion 0.05 0.05 0.05 0.05 0.05 0.10 0.05 0.15 0.15 0.05 0.15 0.05
##                
## Value        86
## Frequency     1
## Proportion 0.05
## ---------------------------------------------------------------------------
## gender 
##        n  missing distinct 
##       20        0        2 
##                         
## Value      Female   Male
## Frequency      10     10
## Proportion    0.5    0.5
## ---------------------------------------------------------------------------
## comorbidity 
##        n  missing distinct 
##       20        0        2 
##                   
## Value       No Yes
## Frequency   10  10
## Proportion 0.5 0.5
## ---------------------------------------------------------------------------

Figure 1: Distribution of the outcome

df%>%gather(treatment,gender,comorbidity,key="Factors",value="Class")%>%ggplot(aes(x=Class,y=hoursofsleep,fill=Factors,color=Factors))+geom_boxplot(alpha=0.6,width=0.8)+coord_flip()+facet_wrap(~Factors,scales="free",ncol=1)+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)

The main outcome (hours of sleep) is normaly distributed, thus no data transformation would be required. The median sleep time seems to be higher in Placebo group, however the contrast was not clear between genders or comorbidity status. We still don’t know whether Gender would be included in our model.

Figure 2: Linear relationship between patient’s Age and Sleep time

df%>%gather(treatment,gender,comorbidity,key="Factors",value="Class")%>%ggplot(aes(x=age,y=hoursofsleep,fill=Class,color=Class))+geom_jitter(shape=21,alpha=0.5,size=3)+geom_smooth(method="lm",alpha=0.4)+facet_wrap(~Factors,,scales="free",ncol=2)+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)

The Figure 2 indicates a strong linear relationship between Age and sleep time in all groups. Age seems to be a relevant predictor for our linear model.

Step 2) Machine learnin experiments

library(mlr)

task= makeRegrTask(id = "Sleep", data=df,target = "hoursofsleep")

lrn=makeLearner("regr.glm")

We just created a supervised regression task with hours of sleep as target and 4 potential features, then assigned a regression learner (GLM algorithm).

2A) Evaluating the role of each feature in dataset

Our first experiment consists of evaluating the role of each feature in the model. Five methods will be applied: a) permutation importance b) Univariate model scoring c) Gain ratio d) Information gain and e) Chi-squared test

Figure 3: Contribution of each feature to the outcome prediction

pmi=generateFilterValuesData(task,method="permutation.importance",imp.learner=lrn)%>%.$data%>%ggplot(aes(x=reorder(name,permutation.importance),y=permutation.importance,fill=reorder(name,permutation.importance)))+geom_bar(stat="identity",color="black",show.legend=F)+scale_fill_manual(values=myfillcolors,name="permut.Importance")+scale_x_discrete("Features")

uni=generateFilterValuesData(task,method="univariate.model.score",learner=lrn)%>%.$data%>%ggplot(aes(x=reorder(name,univariate.model.score),y=univariate.model.score,fill=reorder(name,univariate.model.score)))+geom_bar(stat="identity",color="black",show.legend=F)+scale_fill_manual(values=myfillcolors,name="Univar Score")+scale_x_discrete("Features")

gt=generateFilterValuesData(task,method="gain.ratio")%>%.$data%>%ggplot(aes(x=reorder(name,gain.ratio),y=gain.ratio,fill=reorder(name,gain.ratio)))+geom_bar(stat="identity",color="black",show.legend=F)+scale_fill_manual(values=myfillcolors,name="Gain ratio")+scale_x_discrete("Features")

ig=generateFilterValuesData(task,method="information.gain")%>%.$data%>%ggplot(aes(x=reorder(name,information.gain),y=information.gain,fill=reorder(name,information.gain)))+geom_bar(stat="identity",color="black",show.legend=F)+scale_fill_manual(values=myfillcolors,name="information gain")+scale_x_discrete("Features")

csq=generateFilterValuesData(task,method="chi.squared")%>%.$data%>%ggplot(aes(x=reorder(name,chi.squared),y=chi.squared,fill=reorder(name,chi.squared)))+geom_bar(stat="identity",color="black")+scale_fill_manual(values=myfillcolors,name="features")+scale_x_discrete("Features")

grid.arrange(pmi,uni,gt,ig,csq,ncol=2)

As showed in the Figure 3: 4/5 indices suggest that Age is the most important feature. Treatment is considered as the second most important feature. However, the role of Gender and comorbidity as evaluated by gain ratio, chi2 test and information gain index is not conclusive. We only know that they would contributes less to the prediction, but should we keep them in the model ?

2B) The Exhaustive feature selection

We will apply an exhaustive searching process, this is a brute force algorithm that aims to develop and test many models that based on every possible combination of our features, then vote for the most accurate model (the one with lowest mean squared error).

rdesc=makeResampleDesc("Bootstrap",iters=100L)
ctrlExh= makeFeatSelControlExhaustive(maxit=20L,max.features=4)

wrap=makeFeatSelWrapper(lrn,rdesc,control=ctrlExh)

modwrap=train(wrap,task)

modwrap$learner.model$opt.result
## FeatSel result:
## Features (2): treatment, age
## mse.test.mean=0.0992

The result indicates that the most accurate model includes 2 features: Age and treatment group. This model has lowest averaged mse of 0.094, based on 100 bootstraping iterations.

summary(modwrap$learner.model$next.model$learner.model)
## 
## Call:
## stats::glm(formula = f, family = family, data = d, control = ctrl)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.51352  -0.17628  -0.02299   0.17801   0.53652  
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.988729   0.365971   2.702   0.0151 *  
## treatmentSleepingpill -0.411051   0.142815  -2.878   0.0104 *  
## age                    0.084997   0.005095  16.684 5.66e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.07120382)
## 
##     Null deviance: 35.8220  on 19  degrees of freedom
## Residual deviance:  1.2105  on 17  degrees of freedom
## AIC: 8.663
## 
## Number of Fisher Scoring iterations: 2
fit=lm(data=df,hoursofsleep~age+treatment)

car::Anova(fit)
## Anova Table (Type II tests)
## 
## Response: hoursofsleep
##            Sum Sq Df F value   Pr(>F)    
## age       19.8195  1 278.349 5.66e-12 ***
## treatment  0.5899  1   8.284  0.01043 *  
## Residuals  1.2105 17                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lmSupport::modelEffectSizes(fit)
## lm(formula = hoursofsleep ~ age + treatment, data = df)
## 
## Coefficients
##                 SSR df pEta-sqr dR-sqr
## (Intercept)  0.5197  1   0.3004     NA
## age         19.8195  1   0.9424 0.5533
## treatment    0.5899  1   0.3276 0.0165
## 
## Sum of squared errors (SSE): 1.2
## Sum of squared total  (SST): 35.8
fit%>%effects::all.effects(.)%>%plot()

require(ggfortify)
modwrap$learner.model$next.model%>%.$learner.model%>%autoplot(., which = 1:6, label.size = 1,data = df,colour = 'treatment',ncol=3)+scale_color_manual(values=mycolors)

We can extract the content of this model, then perform an ANOVA and effect size analysis on this model via lm() function.

2C) Bootsraping

In next step, we will perform a Resampling process that aims to: 1) Evaluate the plausibility of our model through randomized samples and

  1. Generate “true confidence intervals” for all model’s coefficients, that could extend the model’s information as well as our interpretive analysis.

The resampling process consists of a 100 Bootstrap iterations. Model’s performance is evaluated using following metrics:

  1. expvar = Explained variance : Similar to rsq (R-squared). Defined as explained sum of squares /total sum of squares
  2. R2
  3. mse: Mean of squared errors
  4. rmse: Root mean square error
  5. BIC and AIC
task2=makeRegrTask(id = "Sleep", data=df[,c(1,2,3)],target = "hoursofsleep")

r=resample(lrn,task2,measures=list(expvar,rsq,rmse,mse),rdesc,models=TRUE) # Bootstraping

r$aggr
## expvar.test.mean    rsq.test.mean   rmse.test.rmse    mse.test.mean 
##       0.94333137       0.93340869       0.30666037       0.09404058

Now, we will extract the data of these metrics, as well as the coefficients of 2 predictors from each one among 100 model versions.

resdf=r$measures.test%>%as_tibble()%>%mutate(Intercept=rep(0,nrow(.)),Sleepingpill=rep(0,nrow(.)),Age=rep(0,nrow(.)),BIC=rep(0,nrow(.)),AIC=rep(0,nrow(.)))

for(i in 1:nrow(resdf)){
  coef=r%>%.$models%>%.[[i]]%>%.$learner.model%>%.$coefficients
  aic=r%>%.$models%>%.[[i]]%>%.$learner.model%>%.$aic
  bic=r%>%.$models%>%.[[i]]%>%.$learner.model%>%BIC(.)
  resdf$Intercept[i]=coef[1]
  resdf$Sleepingpill[i]=coef[2]
  resdf$Age[i]=coef[3]
  resdf$AIC[i]=aic
  resdf$BIC[i]=bic
}

psych::describe(resdf)
##              vars   n  mean    sd median trimmed   mad    min    max range
## iter            1 100 50.50 29.01  50.50   50.50 37.06   1.00 100.00 99.00
## expvar          2 100  0.94  0.24   0.92    0.93  0.23   0.45   1.77  1.32
## rsq             3 100  0.93  0.04   0.94    0.94  0.03   0.70   0.99  0.28
## rmse            4 100  0.30  0.07   0.30    0.30  0.06   0.12   0.55  0.43
## mse             5 100  0.09  0.05   0.09    0.09  0.04   0.01   0.30  0.29
## Intercept       6 100  1.05  0.40   1.04    1.04  0.33  -0.02   2.14  2.16
## Sleepingpill    7 100 -0.44  0.13  -0.42   -0.43  0.11  -0.87  -0.20  0.67
## Age             8 100  0.08  0.01   0.08    0.08  0.00   0.07   0.10  0.03
## BIC             9 100  6.96  7.19   8.21    7.35  7.63 -14.62  20.38 35.00
## AIC            10 100  2.98  7.19   4.23    3.37  7.63 -18.60  16.40 35.00
##               skew kurtosis   se
## iter          0.00    -1.24 2.90
## expvar        0.55     0.51 0.02
## rsq          -2.69    10.28 0.00
## rmse          0.48     1.11 0.01
## mse           1.51     4.28 0.00
## Intercept     0.21     0.39 0.04
## Sleepingpill -0.66     0.48 0.01
## Age           0.15     1.04 0.00
## BIC          -0.55    -0.09 0.72
## AIC          -0.55    -0.09 0.72
Intercept=quantile(resdf$Intercept,c(.025, .5, .975))
Age=quantile(resdf$Age,c(.025, .5, .975))
Sleepingpill=quantile(resdf$Sleepingpill,c(.025, .5, .975))

Table: Bootstrap confidence interval of coefficients

rbind(Intercept,Age,Sleepingpill)%>%knitr::kable()
2.5% 50% 97.5%
Intercept 0.2959638 1.0448223 1.8946730
Age 0.0729909 0.0839635 0.0957700
Sleepingpill -0.7404481 -0.4195826 -0.2086069

Figure 4A: Evolution of coefficients through 100 iterations

effect=resdf[,c(1,6,7,8)]%>%gather(Intercept,Age,Sleepingpill,key="Para",value="Value")
effect%>%ggplot(aes(x=iter,y=Value))+geom_path(aes(color=Para),size=0.8)+facet_wrap(~Para,ncol=1,scales="free")+scale_color_manual(values=mycolors,name="Coefficient")+scale_y_continuous("Coefficients")

Figure 4B: Bootstrap confidence interval of coefficients

effect%>%ggplot(aes(x=Value,fill=Para,color=Para))+geom_histogram(alpha=0.7)+facet_wrap(~Para,ncol=2,scales="free")+scale_fill_manual(values=myfillcolors,name="Coefficients")+scale_color_manual(values=mycolors)+geom_vline(xintercept=0,color="blue4",linetype="dashed",size=1)

Figure 5A: Evolution of model’s performance metrics through 100 iterations

resdf[,c(1:5,9,10)]%>%gather(expvar:AIC,key="Metric",value="Value")%>%ggplot(aes(x=iter,y=Value))+geom_path(aes(color=Metric),size=0.8)+facet_wrap(~Metric,ncol=2,scales="free")+scale_color_manual(values=mycolors)

Figure 5B: Bootstrap confidence interval of performance metrics

resdf[,c(1:5,9,10)]%>%gather(expvar:AIC,key="Metric",value="Value")%>%ggplot(aes(x=Value,fill=Metric,color=Metric))+geom_histogram(alpha=0.7)+facet_wrap(~Metric,ncol=2,scales="free")+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)

Based on the bootstrap’s confidence interval, we can assure that treatment and age really have significant effects on the Sleep time.

2D) Marginal effects

The partial dependency function in MLR package allows us to present a marginalized plot of age and treatment effect on predicted sleep time.

pdf<-modwrap$learner.model$next.model%>%generatePartialDependenceData(.,task,features = c("age","treatment"),interaction=TRUE,gridsize=20L)

pdf%>%.$data%>%ggplot()+geom_smooth(aes(x=age,y=hoursofsleep,color=treatment),se=T,show.legend = T,alpha=0.2)+scale_colour_manual(values=mycolors)+scale_fill_manual(values=myfillcolors)

Conclusion: The present experiment has introduced 3 Machine learning technques that could eventually improve your GLM based interpretive studies:

  1. Manual feature selection using information gain and other score,

  2. wrapped model with a brute force algorithm integrated inside for automated and active feature selection - such “smart” model can select the most relevant predictors for itself and return the most accurate model

  3. Using bootstrap for validating and extracting the confidence interval of a linear model. END