Foreword: About the Machine Learning in Medicine (MLM) project
The MLM project has been initialized in 2016 and aims to:
1.Encourage using Machine Learning techniques in medical research in Vietnam
2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.
Background
This parallel-group study aims to compare the long-term effectiveness of a new cholesterol reducing compound with another drug. 20 patients were included and equally randomized to Control and treatment subgroups. The main outcome is Hdl cholesterol level, measured weekly for 5 weeks.
The conventional approach for this study design would be MANOVA for repeated measure, or generalized linear mix model (GLMM). There are many available linear mix model tools in R, including nlmer, gamlss, lme4 or brms. However, in this study we will introduce an alternative approach that combines the linear mix model (lme4 package) to Machine learning techniques such as k-folds crossvalidation and bootstrap resampling.
Materials and method
The original dataset was provided in Chaper 9, Machine Learning in Medicine-Cookbook Two (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in SPSS format (chap9fixedandrandomeffects.sav) from their website: extras.springer.com.
Data analysis was performed in R statistical programming language. Our task will be not easy, since the linear mixed model has never been supported by any machine learning framework in R. So everything you read in this document is the results of manual coding.
The procedure consists of 4 steps
First, following packages must be established in R-studio
library(foreign)
library(tidyverse)
require(compiler)
require(parallel)
require(boot)
require(lme4)
Results
Step 0: Data loading and reshape
df0=read.spss("chap9fixedandrandomeffects.sav",use.value.labels = TRUE,to.data.frame = TRUE)%>%.[,-5]
df<-df0[,c(1,4)]%>%map(~as.factor(.))%>%as_tibble()%>%mutate(Hdlchol=df0$outcome)
df$week=as.integer(df0$week)
df$treatment<-recode_factor(df0$treatment,`0`="Control",`1`="Newdrug")
rm(df0)
Once the above commmands executed, we will get a clean dataframe “df”, that will be used throughout 4 steps.
Step 1) Data exploration
Table 1: Descriptive statistics
df%>%split(.$treatment)%>%map(~psych::describeBy(.$Hdlchol,group=as.factor(.$week)))
## $Control
## $`1`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 10 2 0.22 1.99 1.99 0.11 1.66 2.36 0.7 0.06 -1.07
## se
## X1 0.07
##
## $`2`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 10 2.01 0.23 2.01 2.02 0.11 1.62 2.35 0.73 -0.13 -1.02
## se
## X1 0.07
##
## $`3`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 10 1.91 0.22 1.9 1.9 0.11 1.57 2.26 0.69 0.04 -1.08
## se
## X1 0.07
##
## $`4`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 10 1.86 0.22 1.86 1.86 0.11 1.52 2.23 0.71 0.09 -1.06
## se
## X1 0.07
##
## $`5`
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10 1.86 0.22 1.86 1.86 0.11 1.5 2.2 0.7 -0.06 -1.07 0.07
##
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
##
## $Newdrug
## $`1`
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10 2.01 0.37 1.94 1.96 0.28 1.57 2.8 1.23 0.79 -0.33 0.12
##
## $`2`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 10 2.16 0.22 2.16 2.16 0.11 1.82 2.51 0.69 0.04 -1.08
## se
## X1 0.07
##
## $`3`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 10 2.21 0.24 2.21 2.21 0.12 1.83 2.57 0.74 0.03 -1.08
## se
## X1 0.08
##
## $`4`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 10 2.19 0.23 2.2 2.2 0.11 1.83 2.55 0.72 -0.01 -1.06
## se
## X1 0.07
##
## $`5`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 10 2.16 0.23 2.16 2.16 0.12 1.82 2.52 0.7 0.08 -1.1
## se
## X1 0.07
##
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
Figure 1A: Variation of Hdl cholesterol level (mmol/l) in each patient
df%>%unite(idx,treatment,patient_id,sep="_")%>%
ggplot(aes(x=week,y=Hdlchol,group=1))+geom_point(shape=21,size=4,color="black",aes(fill=idx),position="identity",show.legend =F)+geom_path(aes(color=idx),size=1,show.legend =F)+facet_wrap(~idx,ncol=5,scales="free")+theme_bw()
These figures indicate that except for the 15th patient, the Hdlchol increased progressively from 1st to 3rd week in response to the treatment by new drug, then slightly decreased from 4th to 5th week. Most of patients in the Control group also experienced a temporary increase of Hdlchol, but this effect was attenuated from 2nd week until the end of the following-up.
Figure 1B: Variation of Hdl cholesterol level (mmol/l) during 5 weeks
df%>%ggplot(aes(x=week,y=Hdlchol,group=treatment))+geom_point(aes(color=treatment),size=3,alpha=0.5)+stat_summary(fun.y = median, fun.ymin = min, fun.ymax = max,aes(color=treatment),size=1)+stat_summary(aes(fill=treatment),fun.ymin=min,fun.ymax=max,fun.y="median",shape=21,color="black",size=5,geom ="point")+stat_summary(fun.y="median",geom="line",aes(color=treatment),size=1)+facet_wrap(~treatment,ncol=1,scales="free")+theme_bw()
The figure 1B represents the median, min and max values of HdlChol at 5 time points and under 2 different treatment conditions. The newdrug’s effect seems steadily higher than that in control group.
Step 2: Conventional linear mixed model (using lme4 package)
Now, we will develop a linear mixed model with random effect of patient_id and the interaction between treatment condition (treatment) and following-up duration (week).
Then we will perform an ANOVA analysis on this model
m1=lmer(Hdlchol~treatment*week+(1|patient_id),data=df)
summary(m1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Hdlchol ~ treatment * week + (1 | patient_id)
## Data: df
##
## REML criterion at convergence: -95
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.1590 -0.2976 -0.0596 0.4269 5.8972
##
## Random effects:
## Groups Name Variance Std.Dev.
## patient_id (Intercept) 0.050711 0.2252
## Residual 0.009841 0.0992
## Number of obs: 100, groups: patient_id, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.05360 0.07845 26.179
## treatmentNewdrug -0.01300 0.11094 -0.117
## week -0.04220 0.00992 -4.254
## treatmentNewdrug:week 0.07700 0.01403 5.488
##
## Correlation of Fixed Effects:
## (Intr) trtmnN week
## trtmntNwdrg -0.707
## week -0.379 0.268
## trtmntNwdr: 0.268 -0.379 -0.707
se=sqrt(diag(vcov(m1)))
(tab=cbind(Est = fixef(m1), LL = fixef(m1)-1.96*se,UL=fixef(m1)+1.96*se))
## Est LL UL
## (Intercept) 2.0536 1.89984809 2.20735191
## treatmentNewdrug -0.0130 -0.23043804 0.20443804
## week -0.0422 -0.06164373 -0.02275627
## treatmentNewdrug:week 0.0770 0.04950241 0.10449759
car::Anova(m1)
## Analysis of Deviance Table (Type II Wald chisquare tests)
##
## Response: Hdlchol
## Chisq Df Pr(>Chisq)
## treatment 4.5107 1 0.03368 *
## week 0.2782 1 0.59787
## treatment:week 30.1234 1 4.054e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: though the lme4 package is more powerful than the nlmer package, it does not provie any p-value. However, we can easily interprete the coefficients by looking on confidence intervals…
The result suggests that there was a significant interaction between the treatment type and duration, and the new drug did significantly increase the Hdlchol over time.
Step 3: Applying a K-folds crossvalidation on GLMM model
At step 3, our objective is to perform a resampling process on the linear mixed model, in order to verify its accuracy over randomized subsets… The first approach is a K-folds crossvalidation.
First, we will make a function called “fold_cv”, this allows to create K randomized blocks of data.
fold_cv=function(data,k){
folds=cvTools::cvFolds(nrow(data),K=k)
invisible(folds)
}
We apply the fold_cv function on our dataset… We set k=10 for a 10 folds CV
set.seed(123)
fold<-df%>%fold_cv(.,k=10)
str(fold)
## List of 5
## $ n : num 100
## $ K : num 10
## $ R : num 1
## $ subsets: int [1:100, 1] 29 79 41 86 91 5 50 83 51 42 ...
## $ which : int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, "class")= chr "cvFolds"
As you see, the subsets in fold contains a list of patient’s id numbers or row.names, this will be used during the subsetting, while the which list in folds contains the block’s name, numbered from 1 to K.
Then we will create a FOR loop, as you already know, the for loop require an object that receives the results generated by the loop. Here is our object:
temp<-df%>%mutate(Fold=rep(0,nrow(df)),holdoutpred=rep(0,nrow(df)),MSE=rep(0,nrow(.)),RMSE=rep(0,nrow(.)),MAE=rep(0,nrow(.)),R2=rep(0,nrow(.)),AIC=rep(0,nrow(.)),BIC=rep(0,nrow(.)))
And this is our loop, it looks a little bit scary, but its job is simply to generate : predicted value, and several performance metrics for each iteration:
for(i in 1:10){
train=temp[fold$subsets[fold$which != i], ]
validation=temp[fold$subsets[fold$which == i], ]
newlm=lme4::lmer(formula=Hdlchol~treatment*week+(1|patient_id),data=train)
newpred=predict(newlm,newdata=validation)
true=validation$Hdlchol
error=(true-newpred)
rmse=sqrt(mean(error^2))
mse=mean((newpred-true)^2)
R2=1-(sum((true-newpred)^2)/sum((true-mean(true))^2))
mae=mean(abs(error))
temp[fold$subsets[fold$which == i], ]$holdoutpred <- newpred
temp[fold$subsets[fold$which == i], ]$RMSE=rmse
temp[fold$subsets[fold$which == i], ]$MSE=mse
temp[fold$subsets[fold$which == i], ]$MAE=mae
temp[fold$subsets[fold$which == i], ]$R2=R2
temp[fold$subsets[fold$which == i], ]$AIC=AIC(newlm)
temp[fold$subsets[fold$which == i], ]$BIC=BIC(newlm)
temp[fold$subsets[fold$which == i], ]$Fold=i
}
Note: mae: Mean of absolute errors; mse: Mean of squared errors; rmse: Root mean square error
R2: Coefficient of determination which is 1 - residual_sum_of_squares / total_sum_of_squares._ BIC: Bayesian Information criteria, AIC: Akaike information criteria holdoutpred: predicted values
If you call our temp dataframe, you could see that all the metrics are correctly attributed to each blocks. We can do a lot of thing with those data
temp
## # A tibble: 100 × 12
## patient_id treatment Hdlchol week Fold holdoutpred MSE
## <fctr> <fctr> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 Control 1.66 1 4 1.665838 0.058904191
## 2 1 Control 1.62 2 5 1.631066 0.016436569
## 3 1 Control 1.57 3 1 1.592532 0.005255703
## 4 1 Control 1.52 4 8 1.553076 0.002642570
## 5 1 Control 1.50 5 6 1.506683 0.005121781
## 6 2 Control 1.69 1 4 1.717281 0.058904191
## 7 2 Control 1.71 2 6 1.671443 0.005121781
## 8 2 Control 1.60 3 6 1.630112 0.005121781
## 9 2 Control 1.55 4 5 1.599584 0.016436569
## 10 2 Control 1.56 5 6 1.547450 0.005121781
## # ... with 90 more rows, and 5 more variables: RMSE <dbl>, MAE <dbl>,
## # R2 <dbl>, AIC <dbl>, BIC <dbl>
Figure 2: Distribution of model’s performance metrics, as evaluated through 5 folds cross validation
temp%>%gather(.,MSE:BIC,key ="Metric",value = "Value")%>%ggplot(aes(x=Metric,y=Value,fill=Metric))+geom_boxplot()+coord_flip()+facet_wrap(~Metric,ncol=1,scales="free")+theme_bw()
temp%>%select(.,7:12)%>%map_dbl(median,na.rm=T)
## MSE RMSE MAE R2 AIC
## 0.005432448 0.073695389 0.058359962 0.925044423 -62.146297402
## BIC
## -47.147439380
The crossvalidation results could strengthen the plausibility of our mixed model, as the median value of R2 was remarkably high (0.925) while the RMSE and MSE median values were respectively 0.074 and 0.005, indicate a very low prediction error.
Step 4: And now, a bootstrap resampling for our mixed model
The bootstrap will be handled by a function called “sampler” and unfortunately, we should write it down, manually:
sampler <- function(dat, clustervar, replace = TRUE, reps = 1) {
cid <- unique(dat[, clustervar[1]])
ncid <- length(cid)
recid <- sample(cid, size = ncid * reps, replace = TRUE)
if (replace) {
rid <- lapply(seq_along(recid), function(i) {
cbind(NewID = i, RowID = sample(which(dat[, clustervar] == recid[i]),
size = length(which(dat[, clustervar] == recid[i])), replace = TRUE))
})
} else {
rid <- lapply(seq_along(recid), function(i) {
cbind(NewID = i, RowID = which(dat[, clustervar] == recid[i]))
})
}
dat <- as.data.frame(do.call(rbind, rid))
dat$Replicate <- factor(cut(dat$NewID, breaks = c(1, ncid * 1:reps), include.lowest = TRUE,
labels = FALSE))
dat$NewID <- factor(dat$NewID)
invisible(dat)
}
To summarise, this function will replicate a sampling process many times
set.seed(123)
bigdata<-df%>%as.data.frame()%>%sampler("patient_id",reps =1000)%>%cbind(df[.$RowID,])
Applying the sampler function on our dataframe df, we will get a “bigdata” that contains 100x1000 cases…
str(bigdata)
## 'data.frame': 100000 obs. of 7 variables:
## $ NewID : Factor w/ 20000 levels "1","2","3","4",..: 1 1 1 1 1 2 2 2 2 2 ...
## $ RowID : int 30 27 28 26 30 77 77 78 78 78 ...
## $ Replicate : Factor w/ 1000 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ patient_id: Factor w/ 20 levels "1 ","10","11",..: 17 17 17 17 17 8 8 8 8 8 ...
## $ treatment : Factor w/ 2 levels "Control","Newdrug": 1 1 1 1 1 2 2 2 2 2 ...
## $ Hdlchol : num 1.88 2.03 1.92 2.01 1.88 2.17 2.17 2.22 2.22 2.22 ...
## $ week : int 5 2 3 1 5 2 2 3 3 3 ...
The structure of this bigdata is identical to the original df
f<-fixef(m1)
r<-getME(m1, "theta")
cl=makeCluster(10)
clusterExport(cl, c("bigdata", "f", "r"))
clusterEvalQ(cl, require(lme4))
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] TRUE
##
## [[6]]
## [1] TRUE
##
## [[7]]
## [1] TRUE
##
## [[8]]
## [1] TRUE
##
## [[9]]
## [1] TRUE
##
## [[10]]
## [1] TRUE
myboot <- function(i) {object <- try(lmer(Hdlchol~treatment*week+(1|patient_id),data = bigdata, subset = Replicate == i,nAGQ = 1, start = list(fixef = f, theta = r)), silent = TRUE)
if (class(object) == "try-error")
return(object)
c(fixef(object), getME(object, "theta"))
}
start <- proc.time()
res <- parLapplyLB(cl, X = levels(bigdata$Replicate), fun = myboot)
end <- proc.time()
stopCluster(cl)
success <- sapply(res, is.numeric)
mean(success)
## [1] 1
bigres <- do.call(cbind, res[success])
(ci <- t(apply(bigres, 1, quantile, probs = c(0.025, 0.975))))
## 2.5% 97.5%
## (Intercept) 1.95546640 2.18383246
## treatmentNewdrug -0.23452747 0.23766303
## week -0.04763396 -0.03744964
## treatmentNewdrug:week 0.01770038 0.10827006
## patient_id.(Intercept) 1.49027708 5.76272654
Here is our summarized result of bootstraped models
finaltable <- cbind(Est = c(f, r), SE = c(se, NA), BootMean = rowMeans(bigres),ci)
round(finaltable,5)
## Est SE BootMean 2.5% 97.5%
## (Intercept) 2.0536 0.07844 2.05972 1.95547 2.18383
## treatmentNewdrug -0.0130 0.11094 -0.01723 -0.23453 0.23766
## week -0.0422 0.00992 -0.04234 -0.04763 -0.03745
## treatmentNewdrug:week 0.0770 0.01403 0.07623 0.01770 0.10827
## patient_id.(Intercept) 2.2700 NA 3.17370 1.49028 5.76273
Those tables represent the confidence interval of each coefficient in our model.
Figure 3A: Evolution of the coefficients through the 1000 iterations
bootdf<-bigres%>%as_tibble()%>%mutate(Feature=row.names(bigres))%>%gather(V1:V1000,key="Iteration",value="Coefficient")%>%separate(Iteration, into = c("X","Iteration"),sep="V")%>%select(.,c(1,3,4))
bootdf%>%ggplot(aes(x=as.numeric(Iteration),y=Coefficient,group=1))+geom_path(aes(color=Feature))+theme_bw()+facet_wrap(~Feature,ncol=1,scales="free")
Figure 3B: Distribution of the coefficients based on x1000 times bootstraping
bootdf%>%ggplot(aes(x=Coefficient))+geom_histogram(aes(fill=Feature),color="black",alpha=0.6)+theme_bw()+facet_wrap(~Feature,ncol=2,scales="free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Those curves would represent exactly what we saw in the summarized result table: the confidence interval of model’s coefficients.
BONUS: this graph shows the marginal effect of our mixed model:
preddf<-df%>%mutate(Predicted=predict(m1,.))
preddf%>%ggplot(aes(x=week,y=Predicted,group=treatment))+geom_point(aes(color=treatment),size=3,alpha=0.5)+stat_summary(fun.y = median, fun.ymin = min, fun.ymax = max,aes(color=treatment),size=1)+stat_summary(aes(fill=treatment),fun.ymin=min,fun.ymax=max,fun.y="median",shape=21,color="black",size=5,geom ="point")+stat_summary(fun.y="median",geom="line",aes(color=treatment),size=1)+facet_wrap(~treatment,ncol=1,scales="free")+theme_bw()
Discussion: In the original chapter, the authors only performed a process that we should call “automated model construction” via IBM-SPSS modeller, but they did not apply any Machine learning technique (neither resampling, nor testing). Our study sucessfully extended the basic model by offering both Bootstrap and Kfold cross validation to confirm the prediction accuracy of our model.
Conclusion: The module Generalized mixed linear models provides the possibility to handle both fixed and random effects, and is, therefore appropriate to adjust data with repeated measures.
END