Foreword: About the Machine Learning in Medicine (MLM) project
The MLM project has been initialized in 2016 and aims to:
1.Encourage using Machine Learning techniques in medical research in Vietnam
2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.
Background
The 7th case study (Chapter 6 - Machine Learning in Medicine-Cookbook 1 by T. J. Cleophas and A. H. Zwinderman, Springer, 2014) aims to explore the weighted Poisson model. In this longitudinal following-up study, 50 patients were followed for numbers of episodes of paroxysmal atrial fibrillation (PAF), while on treatment with two parallel treatment modalities. The main study question is how Treatment modality, Psychological Social status affect the PAF occuring rate ? As this is an interpretive question, simple and conventional methods such as Generalized linear regression would be adopted.
In the present document, we would like to show the utility of Machine learning techniques (feature selection and validation by bootstraping) for extending the information of simple linear models.
Materials and method
The original dataset was provided in Chaper 6, Machine Learning in Medicine-Cookbook 1 (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in SPSS SAV format generalizedlmeventrates.sav) from their website: extras.springer.com.
Data analysis was performed in R statistical programming language. The mlr package (https://mlr-org.github.io/mlr-tutorial/release/html/index.html) will be used through our Machine learning experiment, including Feature selection and Resampling by Bootstrap.
First, we performed a Weighted Poisson Regression analysis using basic glm() function in R. In the second step, a Machine learning approach was applied to current regression task. Our analysis consists of : 1) Using Permutation importance and Univariate model scores to evaluate the attribution of each predictor to the model’s outcome and 2) Resampling the weighted poisson linear regression algorithm by Bootstrap, then extracting the model’s performance metrics and coefficients data. Those data will extend our interpretation of the predictor’s effects.
First, following packages must be established in R-studio (some other packages might also be required during the analysis), and we will personalize the aesthetic effects for ggplot2
library(plyr)
library(tidyverse)
library(foreign)
library(ggfortify)
library(mlr)
library(gridExtra)
my_theme <- function(base_size = 10, base_family = "sans"){
theme_minimal(base_size = base_size, base_family = base_family) +
theme(
axis.text = element_text(size = 10),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
axis.title = element_text(size = 11),
panel.grid.major = element_line(color = "grey"),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#fffece"),
strip.background = element_rect(fill = "#ffb20c", color = "black", size =0.5),
strip.text = element_text(face = "bold", size = 10, color = "black"),
legend.position = "bottom",
legend.justification = "center",
legend.background = element_blank(),
panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)
)
}
theme_set(my_theme())
myfillcolors=c("#aaff00","#ffbb00","#ff300c","#ff0c4c","#ee0cff")
mycolors=c("#28b201","#ff8300","#db2202","#d1063b","#9e00aa")
Results
Step 0: Data loading and reshape
require(foreign)
df=read.spss("generalizedlmeventrates.sav",use.value.labels = TRUE,to.data.frame = TRUE)%>%as_tibble()
df$treatment<-df$treat%>%as.factor()%>%recode_factor(.,`0` = "Mode1", `1` = "Mode2")
df<-df%>%dplyr::rename(.,psyScore=psych,socScore=soc,duration=days,nPAF=paf)%>%.[,-1]
df
## # A tibble: 50 × 5
## psyScore socScore duration nPAF treatment
## <dbl> <dbl> <dbl> <dbl> <fctr>
## 1 56.988831 42.45086 73 4 Mode2
## 2 37.094158 46.82059 73 4 Mode2
## 3 32.275455 43.56657 76 2 Mode1
## 4 29.056717 43.56657 74 3 Mode1
## 5 6.748048 27.24847 73 3 Mode1
## 6 61.654282 48.41482 62 13 Mode1
## 7 56.988831 40.73543 66 11 Mode1
## 8 10.390487 15.35938 72 7 Mode2
## 9 50.527950 52.11514 63 10 Mode2
## 10 49.472050 42.45086 68 9 Mode2
## # ... with 40 more rows
Once the above commmands executed, we will get a clean dataframe, that will be used throughout our analysis.
Step 1) Data exploration
Table 1: Patient’s characteristics in 2 treatment subgroups
psych::describeBy(df[,-5],group=df$treatment)
## $Mode1
## vars n mean sd median trimmed mad min max range skew
## psyScore 1 21 41.89 14.54 40.15 41.76 10.58 6.75 68.49 61.74 0.04
## socScore 2 21 44.82 10.21 43.57 44.34 8.76 27.25 65.56 38.31 0.44
## duration 3 21 68.10 9.38 72.00 69.59 4.45 46.00 76.00 30.00 -1.07
## nPAF 4 21 8.43 8.89 4.00 7.18 4.45 0.00 28.00 28.00 0.97
## kurtosis se
## psyScore -0.01 3.17
## socScore -0.77 2.23
## duration -0.30 2.05
## nPAF -0.49 1.94
##
## $Mode2
## vars n mean sd median trimmed mad min max range skew
## psyScore 1 29 42.31 20.72 44.66 43.13 18.37 1.01 75.83 74.82 -0.43
## socScore 2 29 40.22 19.86 46.29 41.11 13.42 1.01 71.83 70.82 -0.68
## duration 3 29 70.62 8.97 73.00 72.28 4.45 32.00 77.00 45.00 -2.86
## nPAF 4 29 4.45 5.05 3.00 3.80 4.45 0.00 24.00 24.00 1.96
## kurtosis se
## psyScore -0.86 3.85
## socScore -0.72 3.69
## duration 9.28 1.66
## nPAF 5.05 0.94
##
## attr(,"call")
## by.data.frame(data = x, INDICES = group, FUN = describe, type = type)
Figure 1: Distribution of outcome and 2 predictors
df%>%gather(psyScore:nPAF,key="Para",value="value")%>%ggplot(aes(x=value,fill=Para,color=Para))+geom_histogram(alpha=0.7)+facet_wrap(~Para,scales="free")+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)
As the main outcome (PAF occuring rate, nPAF) is count data, Poisson distribution family would be more appropriate for our model. Both PsyScore and SocScore are normaly distributed, thus no data transformation would be required.The graph also showed that nPAF is inversely proportional to following-up duration.
Figure 2: Linear relationship between the outcome and 4 predictors
g1=df%>%ggplot(aes(x=nPAF,fill=treatment))+geom_density(alpha=0.5,color="black")+scale_fill_manual(values=c("purple","red"))
g2=df%>%ggplot(aes(x=psyScore,y=nPAF))+geom_point(shape=21,size=3,color="black",fill="purple",alpha=0.5)+geom_smooth(fill="violet",color="#630091",alpha=0.4,method="lm")
g3=df%>%ggplot(aes(x=socScore,y=nPAF))+geom_point(shape=21,size=3,color="black",fill="red",alpha=0.5)+geom_smooth(fill="red",color="#91001f",alpha=0.4,method="lm")
g4=df%>%ggplot(aes(x=1/duration,y=nPAF))+geom_point(shape=21,size=5,color="black",fill="#ff7f00",alpha=0.5)+geom_smooth(fill="gold",color="orangered",alpha=0.4,method="lm")
library(gridExtra)
grid.arrange(g1,g2,g3,g4,ncol=2)
The Figure 2 indicates that:
By consequence, the appproriate algorithm for our model would be a Poisson GLM weighted by Duration with logarithmic function-link.
Step 2) Basic Regression analysis using GLM function
mod=glm(data=df,formula=nPAF~treatment+psyScore+socScore,family="poisson",weights =duration)
summary(mod)
##
## Call:
## glm(formula = nPAF ~ treatment + psyScore + socScore, family = "poisson",
## data = df, weights = duration)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -33.465 -18.941 -5.561 7.328 43.936
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.5357136 0.0224891 112.75 <2e-16 ***
## treatmentMode2 -0.6674532 0.0153228 -43.56 <2e-16 ***
## psyScore 0.0064416 0.0005857 11.00 <2e-16 ***
## socScore -0.0187113 0.0006494 -28.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22412 on 49 degrees of freedom
## Residual deviance: 19955 on 46 degrees of freedom
## AIC: 29477
##
## Number of Fisher Scoring iterations: 6
library(ggfortify)
autoplot(mod, which = 1:6, label.size = 1,data = df,colour = 'treatment',ncol=3)+scale_color_manual(values=c("#ffbb00","#ff0c4c"))
summary(mod)
##
## Call:
## glm(formula = nPAF ~ treatment + psyScore + socScore, family = "poisson",
## data = df, weights = duration)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -33.465 -18.941 -5.561 7.328 43.936
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.5357136 0.0224891 112.75 <2e-16 ***
## treatmentMode2 -0.6674532 0.0153228 -43.56 <2e-16 ***
## psyScore 0.0064416 0.0005857 11.00 <2e-16 ***
## socScore -0.0187113 0.0006494 -28.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22412 on 49 degrees of freedom
## Residual deviance: 19955 on 46 degrees of freedom
## AIC: 29477
##
## Number of Fisher Scoring iterations: 6
Step 3) Machine learning experiment
Our interpretive analysis would end here, as the model indicates significant effects of all 3 predictors. However, we will extend this by a Machine learning experiment as follow:
library(mlr)
task= makeRegrTask(id = "PAF", data=df[,-3],target = "nPAF",weights = df$duration)
task
## Supervised task: PAF
## Type: regr
## Target: nPAF
## Observations: 50
## Features:
## numerics factors ordered
## 2 1 0
## Missings: FALSE
## Has weights: TRUE
## Has blocking: FALSE
lsw=makeLearner("regr.glm",family="poisson")
lsw
## Learner regr.glm from package stats
## Type: regr
## Name: Generalized Linear Regression; Short name: glm
## Class: regr.glm
## Properties: numerics,factors,se,weights
## Predict-Type: response
## Hyperparameters: family=poisson
We just created a supervised regression task with weighted outcome (nPAF~1/Duration) and assigned a regression learner (Poisson GLM based).
Our first experiment consists of evaluating the role of each feature in the model. Two methods will be applied: a) permutation importance and b) Univariate model scoring
Figure 3: Contribution of Treatment modality, PsyScore and SocScore to the outcome prediction
fv = generateFilterValuesData(iris.task, method = "information.gain")
pmi=generateFilterValuesData(task,method="permutation.importance",imp.learner=lsw)%>%.$data%>%ggplot(aes(x=reorder(name,permutation.importance),y=permutation.importance,fill=reorder(name,permutation.importance)))+geom_bar(stat="identity")+scale_fill_manual(values=c("#ff7700","#ff3b00","#d30034"),name="permut.Importance")+scale_x_discrete("Features")
uni=generateFilterValuesData(task,method="univariate.model.score",learner=lsw)%>%.$data%>%ggplot(aes(x=reorder(name,univariate.model.score),y=univariate.model.score,fill=reorder(name,univariate.model.score)))+geom_bar(stat="identity")+scale_fill_manual(values=c("#ff7700","#ff3b00","#d30034"),name="Univar Score")+scale_x_discrete("Features")
grid.arrange(pmi,uni,ncol=1)
These results indicate that treatment modality contributed the most to the nPAF prediction, while both psyScore and SocScore are important for our model.
In next step, we will perform a Resampling process that aims to: 1) Evaluate the plausibility of our Poisson Weighted model through randomized samples and
The resampling process consists of a 500 Bootstrap iterations. Model’s performance is evaluated using following metrics:
expvar = Explained variance : Similar to rsq (R-squared). Defined as explained sum of squares /total sum of squares
mae: Mean of absolute errors
mse: Mean of squared errors
rmse: Root mean square error
rdesc = makeResampleDesc("Bootstrap",iters=500L) # Setting the resampling mode
r=resample(lsw,task,measures=list(expvar,mae,rmse,mse),rdesc,models=TRUE) # Bootstraping
r$aggr
## expvar.test.mean mae.test.mean rmse.test.rmse mse.test.mean
## 0.3573301 5.4609594 7.5827400 57.4979452
Now, we will extract the data of these metrics, as well as the coefficients of 3 predictors from each one among 500 model versions. To facilitate the interpretation, we transform the raw coefficients into “Incidence rate ratio change = Exp(Coef)-1”, that indicates how many percent the incidence rate increases when the predictor increases by 1 unit, compared to a reference status.
resdf=r$measures.test%>%as_tibble()%>%mutate(Intercept=rep(0,nrow(.)),psyScore=rep(0,nrow(.)),socScore=rep(0,nrow(.)),Treatment=rep(0,nrow(.)),AIC=rep(0,nrow(.)))
for(i in 1:nrow(resdf)){
coef=r%>%.$models%>%.[[i]]%>%.$learner.model%>%.$coefficients
aic=r%>%.$models%>%.[[i]]%>%.$learner.model%>%.$aic
resdf$Intercept[i]=exp(coef[1])-1
resdf$psyScore[i]=exp(coef[2])-1
resdf$socScore[i]=exp(coef[3])-1
resdf$Treatment[i]=exp(coef[4])-1
resdf$AIC[i]=aic
}
resdf
## # A tibble: 500 × 10
## iter expvar mae rmse mse Intercept psyScore
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.18251901 5.047602 6.059870 36.72202 11.969352 0.0008358621
## 2 2 0.10365133 5.456580 8.216749 67.51497 14.414219 0.0085397792
## 3 3 0.18840238 7.623195 10.857895 117.89388 7.256179 0.0062565183
## 4 4 0.04105526 3.950630 5.908107 34.90573 9.485573 0.0042537831
## 5 5 0.43105717 6.570771 9.480783 89.88524 24.529661 0.0019362872
## 6 6 0.07730674 5.807934 8.641543 74.67626 7.756539 0.0175299458
## 7 7 0.05491187 4.761583 6.817447 46.47758 8.880637 0.0074016623
## 8 8 0.78856033 5.067366 5.887127 34.65826 12.087455 0.0043400020
## 9 9 0.14960087 5.255882 7.030975 49.43460 17.241295 -0.0013190950
## 10 10 0.20898559 6.088403 9.648933 93.10191 8.531787 0.0029869625
## # ... with 490 more rows, and 3 more variables: socScore <dbl>,
## # Treatment <dbl>, AIC <dbl>
PsyScore=quantile(resdf$psyScore,c(.025, .5, .975))
SocScore=quantile(resdf$socScore,c(.025, .5, .975))
Treatment=quantile(resdf$Treatment,c(.025, .5, .975))
Table 2: Bootstrap confidence interval of incidence rate ratio changes
rbind(PsyScore,SocScore,Treatment)%>%knitr::kable()
| 2.5% | 50% | 97.5% | |
|---|---|---|---|
| PsyScore | -0.0186795 | 0.0063896 | 0.0313169 |
| SocScore | -0.0437772 | -0.0185022 | 0.0079325 |
| Treatment | -0.7471590 | -0.4913018 | -0.1094755 |
IRRchange=resdf[,c(1,7:9)]%>%gather(psyScore:Treatment,key="IRR",value="Change")
Perfmdf=resdf[,c(1:5,10)]%>%gather(expvar:AIC,key="Metric",value="Value")
Figure 4A: Evolution of IRR changes through 500 iterations
IRRchange%>%ggplot(aes(x=iter,y=Change))+geom_path(aes(color=IRR),size=0.8)+facet_wrap(~IRR,ncol=1,scales="free")+scale_color_manual(values=c("#ff7700","#d30034","#6b0096"))+geom_hline(yintercept=0,color="blue",linetype="dashed",size=1)
Figure 4B: Bootstrap confidence interval of incidence rate ratio changes
IRRchange%>%ggplot(aes(x=Change,fill=IRR,color=IRR))+geom_histogram(alpha=0.7)+facet_wrap(~IRR,ncol=2,scales="free")+scale_fill_manual(values=c("#ff7700","#d30034","#6b0096"))+scale_color_manual(values=c("#ff7700","#d30034","#6b0096"))+geom_vline(xintercept=0,color="blue4",linetype="dashed",size=1)
Figure 5A: Evolution of model’s performance metrics through 500 iterations
Perfmdf%>%ggplot(aes(x=iter,y=Value))+geom_path(aes(color=Metric),size=0.8)+facet_wrap(~Metric,ncol=2,scales="free")+scale_color_manual(values=mycolors)
Figure 5B: Bootstrap confidence interval of performance metrics
Perfmdf%>%ggplot(aes(x=Value,fill=Metric,color=Metric))+geom_histogram(alpha=0.7)+facet_wrap(~Metric,ncol=2,scales="free")+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)
The bootstrap results indicate that:
The treatment mode 2 would decrease the incidence rate of PAF by 50% (IC: -5.1% to -70%) compared to treatment mode 1.
The occuring rate of PAF would increase by 6% by each increased unit of Psychological score. However, this effect is not statistically significant as the bootstrap IC contains zero.
The Social score increases by 1 unit would decrease the PAF occuring rate by 1.9% (IC: -4.69% to +0.08%)
Despite that the model’s accuracy was poor (the predictors could explain only 33.37% of total outcome variance and model’s error was high), it might be helpful for understanding the partial effect of our predictors to PAF occuring rate.
Conclusion: Our experiment suggests that even for an interpretive study, Machine learning techniques such as resampling and feature selection can be helpful. These techniques can provide valuable information such as model’s averaged accuracy and confidence interval for feature’s effect. We strongly recommend this approach in routine data analysis.
END