Case study N°31: How Machine learning can be applied to Generalized Linear Mixed Models

Foreword: About the Machine Learning in Medicine (MLM) project

The MLM project has been initialized in 2016 and aims to:

1.Encourage using Machine Learning techniques in medical research in Vietnam

2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.

Background

This parallel-group study aims to compare the long-term effectiveness of a new cholesterol reducing compound with another drug. 20 patients were included and equally randomized to Control and treatment subgroups. The main outcome is Hdl cholesterol level, measured weekly for 5 weeks.

The conventional approach for this study design would be MANOVA for repeated measure, or generalized linear mix model (GLMM). There are many available linear mix model tools in R, including nlmer, gamlss, lme4 or brms. However, in this study we will introduce an alternative approach that combines the linear mix model (lme4 package) to Machine learning techniques such as k-folds crossvalidation and bootstrap resampling.

Materials and method

The original dataset was provided in Chaper 9, Machine Learning in Medicine-Cookbook Two (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in SPSS format (chap9fixedandrandomeffects.sav) from their website: extras.springer.com.

Data analysis was performed in R statistical programming language. Our task will be not easy, since the linear mixed model has never been supported by any machine learning framework in R. So everything you read in this document is the results of manual coding.

The procedure consists of 4 steps

Data exploration
Conventional linear mixed model (lme4 package)
Resampling by K-folds cross-validation
Resampling by Bootstrap

First, following packages must be established in R-studio

library(foreign)
library(tidyverse)
require(compiler)
require(parallel)
require(boot)
require(lme4)

Results

Step 0: Data loading and reshape

df0=read.spss("chap9fixedandrandomeffects.sav",use.value.labels = TRUE,to.data.frame = TRUE)%>%.[,-5]

df<-df0[,c(1,4)]%>%map(~as.factor(.))%>%as_tibble()%>%mutate(Hdlchol=df0$outcome)

df$week=as.integer(df0$week)
df$treatment<-recode_factor(df0$treatment,`0`="Control",`1`="Newdrug")

rm(df0)

Once the above commmands executed, we will get a clean dataframe “df”, that will be used throughout 4 steps.

Step 1) Data exploration

Table 1: Descriptive statistics

df%>%split(.$treatment)%>%map(~psych::describeBy(.$Hdlchol,group=as.factor(.$week)))

## $Control
## $`1`
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 10    2 0.22   1.99    1.99 0.11 1.66 2.36   0.7 0.06    -1.07
##      se
## X1 0.07
## 
## $`2`
##    vars  n mean   sd median trimmed  mad  min  max range  skew kurtosis
## X1    1 10 2.01 0.23   2.01    2.02 0.11 1.62 2.35  0.73 -0.13    -1.02
##      se
## X1 0.07
## 
## $`3`
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 10 1.91 0.22    1.9     1.9 0.11 1.57 2.26  0.69 0.04    -1.08
##      se
## X1 0.07
## 
## $`4`
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 10 1.86 0.22   1.86    1.86 0.11 1.52 2.23  0.71 0.09    -1.06
##      se
## X1 0.07
## 
## $`5`
##    vars  n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 10 1.86 0.22   1.86    1.86 0.11 1.5 2.2   0.7 -0.06    -1.07 0.07
## 
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
## 
## $Newdrug
## $`1`
##    vars  n mean   sd median trimmed  mad  min max range skew kurtosis   se
## X1    1 10 2.01 0.37   1.94    1.96 0.28 1.57 2.8  1.23 0.79    -0.33 0.12
## 
## $`2`
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 10 2.16 0.22   2.16    2.16 0.11 1.82 2.51  0.69 0.04    -1.08
##      se
## X1 0.07
## 
## $`3`
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 10 2.21 0.24   2.21    2.21 0.12 1.83 2.57  0.74 0.03    -1.08
##      se
## X1 0.08
## 
## $`4`
##    vars  n mean   sd median trimmed  mad  min  max range  skew kurtosis
## X1    1 10 2.19 0.23    2.2     2.2 0.11 1.83 2.55  0.72 -0.01    -1.06
##      se
## X1 0.07
## 
## $`5`
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 10 2.16 0.23   2.16    2.16 0.12 1.82 2.52   0.7 0.08     -1.1
##      se
## X1 0.07
## 
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)

Figure 1A: Variation of Hdl cholesterol level (mmol/l) in each patient

df%>%unite(idx,treatment,patient_id,sep="_")%>%
  ggplot(aes(x=week,y=Hdlchol,group=1))+geom_point(shape=21,size=4,color="black",aes(fill=idx),position="identity",show.legend =F)+geom_path(aes(color=idx),size=1,show.legend =F)+facet_wrap(~idx,ncol=5,scales="free")+theme_bw()

These figures indicate that except for the 15th patient, the Hdlchol increased progressively from 1st to 3rd week in response to the treatment by new drug, then slightly decreased from 4th to 5th week. Most of patients in the Control group also experienced a temporary increase of Hdlchol, but this effect was attenuated from 2nd week until the end of the following-up.

Figure 1B: Variation of Hdl cholesterol level (mmol/l) during 5 weeks

df%>%ggplot(aes(x=week,y=Hdlchol,group=treatment))+geom_point(aes(color=treatment),size=3,alpha=0.5)+stat_summary(fun.y = median, fun.ymin = min, fun.ymax = max,aes(color=treatment),size=1)+stat_summary(aes(fill=treatment),fun.ymin=min,fun.ymax=max,fun.y="median",shape=21,color="black",size=5,geom ="point")+stat_summary(fun.y="median",geom="line",aes(color=treatment),size=1)+facet_wrap(~treatment,ncol=1,scales="free")+theme_bw()

The figure 1B represents the median, min and max values of HdlChol at 5 time points and under 2 different treatment conditions. The newdrug’s effect seems steadily higher than that in control group.

Step 2: Conventional linear mixed model (using lme4 package)

Now, we will develop a linear mixed model with random effect of patient_id and the interaction between treatment condition (treatment) and following-up duration (week).

Then we will perform an ANOVA analysis on this model

m1=lmer(Hdlchol~treatment*week+(1|patient_id),data=df)

summary(m1)

## Linear mixed model fit by REML ['lmerMod']
## Formula: Hdlchol ~ treatment * week + (1 | patient_id)
##    Data: df
## 
## REML criterion at convergence: -95
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1590 -0.2976 -0.0596  0.4269  5.8972 
## 
## Random effects:
##  Groups     Name        Variance Std.Dev.
##  patient_id (Intercept) 0.050711 0.2252  
##  Residual               0.009841 0.0992  
## Number of obs: 100, groups:  patient_id, 20
## 
## Fixed effects:
##                       Estimate Std. Error t value
## (Intercept)            2.05360    0.07845  26.179
## treatmentNewdrug      -0.01300    0.11094  -0.117
## week                  -0.04220    0.00992  -4.254
## treatmentNewdrug:week  0.07700    0.01403   5.488
## 
## Correlation of Fixed Effects:
##             (Intr) trtmnN week  
## trtmntNwdrg -0.707              
## week        -0.379  0.268       
## trtmntNwdr:  0.268 -0.379 -0.707

se=sqrt(diag(vcov(m1)))
(tab=cbind(Est = fixef(m1), LL = fixef(m1)-1.96*se,UL=fixef(m1)+1.96*se))

##                           Est          LL          UL
## (Intercept)            2.0536  1.89984809  2.20735191
## treatmentNewdrug      -0.0130 -0.23043804  0.20443804
## week                  -0.0422 -0.06164373 -0.02275627
## treatmentNewdrug:week  0.0770  0.04950241  0.10449759

car::Anova(m1)

## Analysis of Deviance Table (Type II Wald chisquare tests)
## 
## Response: Hdlchol
##                  Chisq Df Pr(>Chisq)    
## treatment       4.5107  1    0.03368 *  
## week            0.2782  1    0.59787    
## treatment:week 30.1234  1  4.054e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note: though the lme4 package is more powerful than the nlmer package, it does not provie any p-value. However, we can easily interprete the coefficients by looking on confidence intervals…

The result suggests that there was a significant interaction between the treatment type and duration, and the new drug did significantly increase the Hdlchol over time.

Step 3: Applying a K-folds crossvalidation on GLMM model

At step 3, our objective is to perform a resampling process on the linear mixed model, in order to verify its accuracy over randomized subsets… The first approach is a K-folds crossvalidation.

First, we will make a function called “fold_cv”, this allows to create K randomized blocks of data.

fold_cv=function(data,k){
  folds=cvTools::cvFolds(nrow(data),K=k)
  invisible(folds)
}

We apply the fold_cv function on our dataset… We set k=10 for a 10 folds CV

set.seed(123)
fold<-df%>%fold_cv(.,k=10)

str(fold)

## List of 5
##  $ n      : num 100
##  $ K      : num 10
##  $ R      : num 1
##  $ subsets: int [1:100, 1] 29 79 41 86 91 5 50 83 51 42 ...
##  $ which  : int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, "class")= chr "cvFolds"

As you see, the subsets in fold contains a list of patient’s id numbers or row.names, this will be used during the subsetting, while the which list in folds contains the block’s name, numbered from 1 to K.

Then we will create a FOR loop, as you already know, the for loop require an object that receives the results generated by the loop. Here is our object:

temp<-df%>%mutate(Fold=rep(0,nrow(df)),holdoutpred=rep(0,nrow(df)),MSE=rep(0,nrow(.)),RMSE=rep(0,nrow(.)),MAE=rep(0,nrow(.)),R2=rep(0,nrow(.)),AIC=rep(0,nrow(.)),BIC=rep(0,nrow(.)))

And this is our loop, it looks a little bit scary, but its job is simply to generate : predicted value, and several performance metrics for each iteration:

for(i in 1:10){
  train=temp[fold$subsets[fold$which != i], ]
  validation=temp[fold$subsets[fold$which == i], ]
  newlm=lme4::lmer(formula=Hdlchol~treatment*week+(1|patient_id),data=train)
  newpred=predict(newlm,newdata=validation)
  true=validation$Hdlchol
  error=(true-newpred)
  rmse=sqrt(mean(error^2))
  mse=mean((newpred-true)^2)
  R2=1-(sum((true-newpred)^2)/sum((true-mean(true))^2))
  mae=mean(abs(error))
  temp[fold$subsets[fold$which == i], ]$holdoutpred <- newpred
  temp[fold$subsets[fold$which == i], ]$RMSE=rmse
  temp[fold$subsets[fold$which == i], ]$MSE=mse
  temp[fold$subsets[fold$which == i], ]$MAE=mae
  temp[fold$subsets[fold$which == i], ]$R2=R2
  temp[fold$subsets[fold$which == i], ]$AIC=AIC(newlm)
  temp[fold$subsets[fold$which == i], ]$BIC=BIC(newlm)
  temp[fold$subsets[fold$which == i], ]$Fold=i
}

Note: mae: Mean of absolute errors; mse: Mean of squared errors; rmse: Root mean square error
R2: Coefficient of determination which is 1 - residual_sum_of_squares / total_sum_of_squares._ BIC: Bayesian Information criteria, AIC: Akaike information criteria holdoutpred: predicted values

If you call our temp dataframe, you could see that all the metrics are correctly attributed to each blocks. We can do a lot of thing with those data

temp

## # A tibble: 100 × 12
##    patient_id treatment Hdlchol  week  Fold holdoutpred         MSE
##        <fctr>    <fctr>   <dbl> <int> <dbl>       <dbl>       <dbl>
## 1          1    Control    1.66     1     4    1.665838 0.058904191
## 2          1    Control    1.62     2     5    1.631066 0.016436569
## 3          1    Control    1.57     3     1    1.592532 0.005255703
## 4          1    Control    1.52     4     8    1.553076 0.002642570
## 5          1    Control    1.50     5     6    1.506683 0.005121781
## 6          2    Control    1.69     1     4    1.717281 0.058904191
## 7          2    Control    1.71     2     6    1.671443 0.005121781
## 8          2    Control    1.60     3     6    1.630112 0.005121781
## 9          2    Control    1.55     4     5    1.599584 0.016436569
## 10         2    Control    1.56     5     6    1.547450 0.005121781
## # ... with 90 more rows, and 5 more variables: RMSE <dbl>, MAE <dbl>,
## #   R2 <dbl>, AIC <dbl>, BIC <dbl>

Figure 2: Distribution of model’s performance metrics, as evaluated through 5 folds cross validation

temp%>%gather(.,MSE:BIC,key ="Metric",value = "Value")%>%ggplot(aes(x=Metric,y=Value,fill=Metric))+geom_boxplot()+coord_flip()+facet_wrap(~Metric,ncol=1,scales="free")+theme_bw()

temp%>%select(.,7:12)%>%map_dbl(median,na.rm=T)

##           MSE          RMSE           MAE            R2           AIC 
##   0.005432448   0.073695389   0.058359962   0.925044423 -62.146297402 
##           BIC 
## -47.147439380

The crossvalidation results could strengthen the plausibility of our mixed model, as the median value of R2 was remarkably high (0.925) while the RMSE and MSE median values were respectively 0.074 and 0.005, indicate a very low prediction error.

Step 4: And now, a bootstrap resampling for our mixed model

The bootstrap will be handled by a function called “sampler” and unfortunately, we should write it down, manually:

sampler <- function(dat, clustervar, replace = TRUE, reps = 1) {
  cid <- unique(dat[, clustervar[1]])
  ncid <- length(cid)
  recid <- sample(cid, size = ncid * reps, replace = TRUE)
  if (replace) {
    rid <- lapply(seq_along(recid), function(i) {
      cbind(NewID = i, RowID = sample(which(dat[, clustervar] == recid[i]),
                                      size = length(which(dat[, clustervar] == recid[i])), replace = TRUE))
    })
  } else {
    rid <- lapply(seq_along(recid), function(i) {
      cbind(NewID = i, RowID = which(dat[, clustervar] == recid[i]))
    })
  }
  dat <- as.data.frame(do.call(rbind, rid))
  dat$Replicate <- factor(cut(dat$NewID, breaks = c(1, ncid * 1:reps), include.lowest = TRUE,
                              labels = FALSE))
  dat$NewID <- factor(dat$NewID)
  invisible(dat)
}

To summarise, this function will replicate a sampling process many times

set.seed(123)
bigdata<-df%>%as.data.frame()%>%sampler("patient_id",reps =1000)%>%cbind(df[.$RowID,])

Applying the sampler function on our dataframe df, we will get a “bigdata” that contains 100x1000 cases…

str(bigdata)

## 'data.frame':    100000 obs. of  7 variables:
##  $ NewID     : Factor w/ 20000 levels "1","2","3","4",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ RowID     : int  30 27 28 26 30 77 77 78 78 78 ...
##  $ Replicate : Factor w/ 1000 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ patient_id: Factor w/ 20 levels "1 ","10","11",..: 17 17 17 17 17 8 8 8 8 8 ...
##  $ treatment : Factor w/ 2 levels "Control","Newdrug": 1 1 1 1 1 2 2 2 2 2 ...
##  $ Hdlchol   : num  1.88 2.03 1.92 2.01 1.88 2.17 2.17 2.22 2.22 2.22 ...
##  $ week      : int  5 2 3 1 5 2 2 3 3 3 ...

The structure of this bigdata is identical to the original df

f<-fixef(m1)
r<-getME(m1, "theta")

cl=makeCluster(10)
clusterExport(cl, c("bigdata", "f", "r"))
clusterEvalQ(cl, require(lme4))

## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] TRUE
## 
## [[6]]
## [1] TRUE
## 
## [[7]]
## [1] TRUE
## 
## [[8]]
## [1] TRUE
## 
## [[9]]
## [1] TRUE
## 
## [[10]]
## [1] TRUE

myboot <- function(i) {object <- try(lmer(Hdlchol~treatment*week+(1|patient_id),data = bigdata, subset = Replicate == i,nAGQ = 1, start = list(fixef = f, theta = r)), silent = TRUE)
if (class(object) == "try-error")
  return(object)
c(fixef(object), getME(object, "theta"))
}

start <- proc.time()
res <- parLapplyLB(cl, X = levels(bigdata$Replicate), fun = myboot)
end <- proc.time()

stopCluster(cl)

success <- sapply(res, is.numeric)
mean(success)

## [1] 1

bigres <- do.call(cbind, res[success])

(ci <- t(apply(bigres, 1, quantile, probs = c(0.025, 0.975))))

##                               2.5%       97.5%
## (Intercept)             1.95546640  2.18383246
## treatmentNewdrug       -0.23452747  0.23766303
## week                   -0.04763396 -0.03744964
## treatmentNewdrug:week   0.01770038  0.10827006
## patient_id.(Intercept)  1.49027708  5.76272654

Here is our summarized result of bootstraped models

finaltable <- cbind(Est = c(f, r), SE = c(se, NA), BootMean = rowMeans(bigres),ci)

round(finaltable,5)

##                            Est      SE BootMean     2.5%    97.5%
## (Intercept)             2.0536 0.07844  2.05972  1.95547  2.18383
## treatmentNewdrug       -0.0130 0.11094 -0.01723 -0.23453  0.23766
## week                   -0.0422 0.00992 -0.04234 -0.04763 -0.03745
## treatmentNewdrug:week   0.0770 0.01403  0.07623  0.01770  0.10827
## patient_id.(Intercept)  2.2700      NA  3.17370  1.49028  5.76273

Those tables represent the confidence interval of each coefficient in our model.

Figure 3A: Evolution of the coefficients through the 1000 iterations

bootdf<-bigres%>%as_tibble()%>%mutate(Feature=row.names(bigres))%>%gather(V1:V1000,key="Iteration",value="Coefficient")%>%separate(Iteration, into = c("X","Iteration"),sep="V")%>%select(.,c(1,3,4))

bootdf%>%ggplot(aes(x=as.numeric(Iteration),y=Coefficient,group=1))+geom_path(aes(color=Feature))+theme_bw()+facet_wrap(~Feature,ncol=1,scales="free")

Figure 3B: Distribution of the coefficients based on x1000 times bootstraping

bootdf%>%ggplot(aes(x=Coefficient))+geom_histogram(aes(fill=Feature),color="black",alpha=0.6)+theme_bw()+facet_wrap(~Feature,ncol=2,scales="free")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Those curves would represent exactly what we saw in the summarized result table: the confidence interval of model’s coefficients.

BONUS: this graph shows the marginal effect of our mixed model:

preddf<-df%>%mutate(Predicted=predict(m1,.))

preddf%>%ggplot(aes(x=week,y=Predicted,group=treatment))+geom_point(aes(color=treatment),size=3,alpha=0.5)+stat_summary(fun.y = median, fun.ymin = min, fun.ymax = max,aes(color=treatment),size=1)+stat_summary(aes(fill=treatment),fun.ymin=min,fun.ymax=max,fun.y="median",shape=21,color="black",size=5,geom ="point")+stat_summary(fun.y="median",geom="line",aes(color=treatment),size=1)+facet_wrap(~treatment,ncol=1,scales="free")+theme_bw()

Discussion: In the original chapter, the authors only performed a process that we should call “automated model construction” via IBM-SPSS modeller, but they did not apply any Machine learning technique (neither resampling, nor testing). Our study sucessfully extended the basic model by offering both Bootstrap and Kfold cross validation to confirm the prediction accuracy of our model.

Conclusion: The module Generalized mixed linear models provides the possibility to handle both fixed and random effects, and is, therefore appropriate to adjust data with repeated measures.

END

MLM Case study 31

Le Dong Nhat Nam (MD, PhD)

15 janvier 2017

Case study N°31: How Machine learning can be applied to Generalized Linear Mixed Models