Case study N7: Machine learning applied to Weighted Poisson Regression

Foreword: About the Machine Learning in Medicine (MLM) project

The MLM project has been initialized in 2016 and aims to:

1.Encourage using Machine Learning techniques in medical research in Vietnam

2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.

Background

The 7th case study (Chapter 6 - Machine Learning in Medicine-Cookbook 1 by T. J. Cleophas and A. H. Zwinderman, Springer, 2014) aims to explore the weighted Poisson model. In this longitudinal following-up study, 50 patients were followed for numbers of episodes of paroxysmal atrial fibrillation (PAF), while on treatment with two parallel treatment modalities. The main study question is how Treatment modality, Psychological Social status affect the PAF occuring rate ? As this is an interpretive question, simple and conventional methods such as Generalized linear regression would be adopted.

In the present document, we would like to show the utility of Machine learning techniques (feature selection and validation by bootstraping) for extending the information of simple linear models.

Materials and method

The original dataset was provided in Chaper 6, Machine Learning in Medicine-Cookbook 1 (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in SPSS SAV format generalizedlmeventrates.sav) from their website: extras.springer.com.

Data analysis was performed in R statistical programming language. The mlr package (https://mlr-org.github.io/mlr-tutorial/release/html/index.html) will be used through our Machine learning experiment, including Feature selection and Resampling by Bootstrap.

First, we performed a Weighted Poisson Regression analysis using basic glm() function in R. In the second step, a Machine learning approach was applied to current regression task. Our analysis consists of : 1) Using Permutation importance and Univariate model scores to evaluate the attribution of each predictor to the model’s outcome and 2) Resampling the weighted poisson linear regression algorithm by Bootstrap, then extracting the model’s performance metrics and coefficients data. Those data will extend our interpretation of the predictor’s effects.

First, following packages must be established in R-studio (some other packages might also be required during the analysis), and we will personalize the aesthetic effects for ggplot2

library(plyr)
library(tidyverse)
library(foreign)
library(ggfortify)
library(mlr)
library(gridExtra)

my_theme <- function(base_size = 10, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
    theme(
      axis.text = element_text(size = 10),
      axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
      axis.title = element_text(size = 11),
      panel.grid.major = element_line(color = "grey"),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = "#fffece"),
      strip.background = element_rect(fill = "#ffb20c", color = "black", size =0.5),
      strip.text = element_text(face = "bold", size = 10, color = "black"),
      legend.position = "bottom",
      legend.justification = "center",
      legend.background = element_blank(),
      panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)
    )
}
theme_set(my_theme())

myfillcolors=c("#aaff00","#ffbb00","#ff300c","#ff0c4c","#ee0cff")
mycolors=c("#28b201","#ff8300","#db2202","#d1063b","#9e00aa")

Results

Step 0: Data loading and reshape

require(foreign)

df=read.spss("generalizedlmeventrates.sav",use.value.labels = TRUE,to.data.frame = TRUE)%>%as_tibble()

df$treatment<-df$treat%>%as.factor()%>%recode_factor(.,`0` = "Mode1", `1` = "Mode2")

df<-df%>%dplyr::rename(.,psyScore=psych,socScore=soc,duration=days,nPAF=paf)%>%.[,-1]

df
## # A tibble: 50 × 5
##     psyScore socScore duration  nPAF treatment
##        <dbl>    <dbl>    <dbl> <dbl>    <fctr>
## 1  56.988831 42.45086       73     4     Mode2
## 2  37.094158 46.82059       73     4     Mode2
## 3  32.275455 43.56657       76     2     Mode1
## 4  29.056717 43.56657       74     3     Mode1
## 5   6.748048 27.24847       73     3     Mode1
## 6  61.654282 48.41482       62    13     Mode1
## 7  56.988831 40.73543       66    11     Mode1
## 8  10.390487 15.35938       72     7     Mode2
## 9  50.527950 52.11514       63    10     Mode2
## 10 49.472050 42.45086       68     9     Mode2
## # ... with 40 more rows

Once the above commmands executed, we will get a clean dataframe, that will be used throughout our analysis.

Step 1) Data exploration

Table 1: Patient’s characteristics in 2 treatment subgroups

psych::describeBy(df[,-5],group=df$treatment)
## $Mode1
##          vars  n  mean    sd median trimmed   mad   min   max range  skew
## psyScore    1 21 41.89 14.54  40.15   41.76 10.58  6.75 68.49 61.74  0.04
## socScore    2 21 44.82 10.21  43.57   44.34  8.76 27.25 65.56 38.31  0.44
## duration    3 21 68.10  9.38  72.00   69.59  4.45 46.00 76.00 30.00 -1.07
## nPAF        4 21  8.43  8.89   4.00    7.18  4.45  0.00 28.00 28.00  0.97
##          kurtosis   se
## psyScore    -0.01 3.17
## socScore    -0.77 2.23
## duration    -0.30 2.05
## nPAF        -0.49 1.94
## 
## $Mode2
##          vars  n  mean    sd median trimmed   mad   min   max range  skew
## psyScore    1 29 42.31 20.72  44.66   43.13 18.37  1.01 75.83 74.82 -0.43
## socScore    2 29 40.22 19.86  46.29   41.11 13.42  1.01 71.83 70.82 -0.68
## duration    3 29 70.62  8.97  73.00   72.28  4.45 32.00 77.00 45.00 -2.86
## nPAF        4 29  4.45  5.05   3.00    3.80  4.45  0.00 24.00 24.00  1.96
##          kurtosis   se
## psyScore    -0.86 3.85
## socScore    -0.72 3.69
## duration     9.28 1.66
## nPAF         5.05 0.94
## 
## attr(,"call")
## by.data.frame(data = x, INDICES = group, FUN = describe, type = type)

Figure 1: Distribution of outcome and 2 predictors

df%>%gather(psyScore:nPAF,key="Para",value="value")%>%ggplot(aes(x=value,fill=Para,color=Para))+geom_histogram(alpha=0.7)+facet_wrap(~Para,scales="free")+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)

As the main outcome (PAF occuring rate, nPAF) is count data, Poisson distribution family would be more appropriate for our model. Both PsyScore and SocScore are normaly distributed, thus no data transformation would be required.The graph also showed that nPAF is inversely proportional to following-up duration.

Figure 2: Linear relationship between the outcome and 4 predictors

g1=df%>%ggplot(aes(x=nPAF,fill=treatment))+geom_density(alpha=0.5,color="black")+scale_fill_manual(values=c("purple","red"))
g2=df%>%ggplot(aes(x=psyScore,y=nPAF))+geom_point(shape=21,size=3,color="black",fill="purple",alpha=0.5)+geom_smooth(fill="violet",color="#630091",alpha=0.4,method="lm")
g3=df%>%ggplot(aes(x=socScore,y=nPAF))+geom_point(shape=21,size=3,color="black",fill="red",alpha=0.5)+geom_smooth(fill="red",color="#91001f",alpha=0.4,method="lm")
g4=df%>%ggplot(aes(x=1/duration,y=nPAF))+geom_point(shape=21,size=5,color="black",fill="#ff7f00",alpha=0.5)+geom_smooth(fill="gold",color="orangered",alpha=0.4,method="lm")

library(gridExtra)
grid.arrange(g1,g2,g3,g4,ncol=2)

The Figure 2 indicates that:

  1. nPAF could be described by a Poisson linear regression,
  2. There is a significant contrast between two treatment modalities,
  3. raw nPAF rate is characterized by a weak linear relationship with PsyScore or socScore,
  4. raw nPAF is inversely porportional to Duration. Therefore, the model’s outcome should be weighted by Duration (longer following-up duration results higher risk for the occurence of PAF)

By consequence, the appproriate algorithm for our model would be a Poisson GLM weighted by Duration with logarithmic function-link.

Step 2) Basic Regression analysis using GLM function

mod=glm(data=df,formula=nPAF~treatment+psyScore+socScore,family="poisson",weights =duration)

summary(mod)
## 
## Call:
## glm(formula = nPAF ~ treatment + psyScore + socScore, family = "poisson", 
##     data = df, weights = duration)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -33.465  -18.941   -5.561    7.328   43.936  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     2.5357136  0.0224891  112.75   <2e-16 ***
## treatmentMode2 -0.6674532  0.0153228  -43.56   <2e-16 ***
## psyScore        0.0064416  0.0005857   11.00   <2e-16 ***
## socScore       -0.0187113  0.0006494  -28.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22412  on 49  degrees of freedom
## Residual deviance: 19955  on 46  degrees of freedom
## AIC: 29477
## 
## Number of Fisher Scoring iterations: 6
library(ggfortify)

autoplot(mod, which = 1:6, label.size = 1,data = df,colour = 'treatment',ncol=3)+scale_color_manual(values=c("#ffbb00","#ff0c4c"))

summary(mod)
## 
## Call:
## glm(formula = nPAF ~ treatment + psyScore + socScore, family = "poisson", 
##     data = df, weights = duration)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -33.465  -18.941   -5.561    7.328   43.936  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     2.5357136  0.0224891  112.75   <2e-16 ***
## treatmentMode2 -0.6674532  0.0153228  -43.56   <2e-16 ***
## psyScore        0.0064416  0.0005857   11.00   <2e-16 ***
## socScore       -0.0187113  0.0006494  -28.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22412  on 49  degrees of freedom
## Residual deviance: 19955  on 46  degrees of freedom
## AIC: 29477
## 
## Number of Fisher Scoring iterations: 6

Step 3) Machine learning experiment

Our interpretive analysis would end here, as the model indicates significant effects of all 3 predictors. However, we will extend this by a Machine learning experiment as follow:

library(mlr)

task= makeRegrTask(id = "PAF", data=df[,-3],target = "nPAF",weights = df$duration)

task
## Supervised task: PAF
## Type: regr
## Target: nPAF
## Observations: 50
## Features:
## numerics  factors  ordered 
##        2        1        0 
## Missings: FALSE
## Has weights: TRUE
## Has blocking: FALSE
lsw=makeLearner("regr.glm",family="poisson")

lsw
## Learner regr.glm from package stats
## Type: regr
## Name: Generalized Linear Regression; Short name: glm
## Class: regr.glm
## Properties: numerics,factors,se,weights
## Predict-Type: response
## Hyperparameters: family=poisson

We just created a supervised regression task with weighted outcome (nPAF~1/Duration) and assigned a regression learner (Poisson GLM based).

Our first experiment consists of evaluating the role of each feature in the model. Two methods will be applied: a) permutation importance and b) Univariate model scoring

Figure 3: Contribution of Treatment modality, PsyScore and SocScore to the outcome prediction

fv = generateFilterValuesData(iris.task, method = "information.gain")

pmi=generateFilterValuesData(task,method="permutation.importance",imp.learner=lsw)%>%.$data%>%ggplot(aes(x=reorder(name,permutation.importance),y=permutation.importance,fill=reorder(name,permutation.importance)))+geom_bar(stat="identity")+scale_fill_manual(values=c("#ff7700","#ff3b00","#d30034"),name="permut.Importance")+scale_x_discrete("Features")

uni=generateFilterValuesData(task,method="univariate.model.score",learner=lsw)%>%.$data%>%ggplot(aes(x=reorder(name,univariate.model.score),y=univariate.model.score,fill=reorder(name,univariate.model.score)))+geom_bar(stat="identity")+scale_fill_manual(values=c("#ff7700","#ff3b00","#d30034"),name="Univar Score")+scale_x_discrete("Features")

grid.arrange(pmi,uni,ncol=1)

These results indicate that treatment modality contributed the most to the nPAF prediction, while both psyScore and SocScore are important for our model.

In next step, we will perform a Resampling process that aims to: 1) Evaluate the plausibility of our Poisson Weighted model through randomized samples and

  1. Generate “true confidence intervals” for all model’s coefficients, that could extend the model’s information as well as our interpretive analysis.

The resampling process consists of a 500 Bootstrap iterations. Model’s performance is evaluated using following metrics:

  1. expvar = Explained variance : Similar to rsq (R-squared). Defined as explained sum of squares /total sum of squares

  2. mae: Mean of absolute errors

  3. mse: Mean of squared errors

  4. rmse: Root mean square error

rdesc = makeResampleDesc("Bootstrap",iters=500L) # Setting the resampling mode

r=resample(lsw,task,measures=list(expvar,mae,rmse,mse),rdesc,models=TRUE) # Bootstraping

r$aggr
## expvar.test.mean    mae.test.mean   rmse.test.rmse    mse.test.mean 
##        0.3573301        5.4609594        7.5827400       57.4979452

Now, we will extract the data of these metrics, as well as the coefficients of 3 predictors from each one among 500 model versions. To facilitate the interpretation, we transform the raw coefficients into “Incidence rate ratio change = Exp(Coef)-1”, that indicates how many percent the incidence rate increases when the predictor increases by 1 unit, compared to a reference status.

resdf=r$measures.test%>%as_tibble()%>%mutate(Intercept=rep(0,nrow(.)),psyScore=rep(0,nrow(.)),socScore=rep(0,nrow(.)),Treatment=rep(0,nrow(.)),AIC=rep(0,nrow(.)))

for(i in 1:nrow(resdf)){
  coef=r%>%.$models%>%.[[i]]%>%.$learner.model%>%.$coefficients
  aic=r%>%.$models%>%.[[i]]%>%.$learner.model%>%.$aic
  resdf$Intercept[i]=exp(coef[1])-1
  resdf$psyScore[i]=exp(coef[2])-1
  resdf$socScore[i]=exp(coef[3])-1
  resdf$Treatment[i]=exp(coef[4])-1
  resdf$AIC[i]=aic
}

resdf
## # A tibble: 500 × 10
##     iter     expvar      mae      rmse       mse Intercept      psyScore
##    <int>      <dbl>    <dbl>     <dbl>     <dbl>     <dbl>         <dbl>
## 1      1 0.18251901 5.047602  6.059870  36.72202 11.969352  0.0008358621
## 2      2 0.10365133 5.456580  8.216749  67.51497 14.414219  0.0085397792
## 3      3 0.18840238 7.623195 10.857895 117.89388  7.256179  0.0062565183
## 4      4 0.04105526 3.950630  5.908107  34.90573  9.485573  0.0042537831
## 5      5 0.43105717 6.570771  9.480783  89.88524 24.529661  0.0019362872
## 6      6 0.07730674 5.807934  8.641543  74.67626  7.756539  0.0175299458
## 7      7 0.05491187 4.761583  6.817447  46.47758  8.880637  0.0074016623
## 8      8 0.78856033 5.067366  5.887127  34.65826 12.087455  0.0043400020
## 9      9 0.14960087 5.255882  7.030975  49.43460 17.241295 -0.0013190950
## 10    10 0.20898559 6.088403  9.648933  93.10191  8.531787  0.0029869625
## # ... with 490 more rows, and 3 more variables: socScore <dbl>,
## #   Treatment <dbl>, AIC <dbl>
PsyScore=quantile(resdf$psyScore,c(.025, .5, .975))
SocScore=quantile(resdf$socScore,c(.025, .5, .975))
Treatment=quantile(resdf$Treatment,c(.025, .5, .975))

Table 2: Bootstrap confidence interval of incidence rate ratio changes

rbind(PsyScore,SocScore,Treatment)%>%knitr::kable()
2.5% 50% 97.5%
PsyScore -0.0186795 0.0063896 0.0313169
SocScore -0.0437772 -0.0185022 0.0079325
Treatment -0.7471590 -0.4913018 -0.1094755
IRRchange=resdf[,c(1,7:9)]%>%gather(psyScore:Treatment,key="IRR",value="Change")

Perfmdf=resdf[,c(1:5,10)]%>%gather(expvar:AIC,key="Metric",value="Value")

Figure 4A: Evolution of IRR changes through 500 iterations

IRRchange%>%ggplot(aes(x=iter,y=Change))+geom_path(aes(color=IRR),size=0.8)+facet_wrap(~IRR,ncol=1,scales="free")+scale_color_manual(values=c("#ff7700","#d30034","#6b0096"))+geom_hline(yintercept=0,color="blue",linetype="dashed",size=1)

Figure 4B: Bootstrap confidence interval of incidence rate ratio changes

IRRchange%>%ggplot(aes(x=Change,fill=IRR,color=IRR))+geom_histogram(alpha=0.7)+facet_wrap(~IRR,ncol=2,scales="free")+scale_fill_manual(values=c("#ff7700","#d30034","#6b0096"))+scale_color_manual(values=c("#ff7700","#d30034","#6b0096"))+geom_vline(xintercept=0,color="blue4",linetype="dashed",size=1)

Figure 5A: Evolution of model’s performance metrics through 500 iterations

Perfmdf%>%ggplot(aes(x=iter,y=Value))+geom_path(aes(color=Metric),size=0.8)+facet_wrap(~Metric,ncol=2,scales="free")+scale_color_manual(values=mycolors)

Figure 5B: Bootstrap confidence interval of performance metrics

Perfmdf%>%ggplot(aes(x=Value,fill=Metric,color=Metric))+geom_histogram(alpha=0.7)+facet_wrap(~Metric,ncol=2,scales="free")+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)

The bootstrap results indicate that:

  1. The treatment mode 2 would decrease the incidence rate of PAF by 50% (IC: -5.1% to -70%) compared to treatment mode 1.

  2. The occuring rate of PAF would increase by 6% by each increased unit of Psychological score. However, this effect is not statistically significant as the bootstrap IC contains zero.

  3. The Social score increases by 1 unit would decrease the PAF occuring rate by 1.9% (IC: -4.69% to +0.08%)

  4. Despite that the model’s accuracy was poor (the predictors could explain only 33.37% of total outcome variance and model’s error was high), it might be helpful for understanding the partial effect of our predictors to PAF occuring rate.

Conclusion: Our experiment suggests that even for an interpretive study, Machine learning techniques such as resampling and feature selection can be helpful. These techniques can provide valuable information such as model’s averaged accuracy and confidence interval for feature’s effect. We strongly recommend this approach in routine data analysis.

END