Libraries used for this analysis

library(plyr)             ####Data cleaning
library(dplyr)            ####Data cleaning
library(caret)            ####Model Training
library(doParallel)       ####Parallel Computing
library(pROC)             ####ROC Curves
library(DescTools)        ####Brier Score
library(ModelMetrics)     ####LogLoss
library(ggpubr)           ####Arranging plots
library(DALEX)            ####Comparing models
library(ingredients)      ####Comparing models
library(plotly)           ####Advanced plotting
library(knitr)            ####Tables in Markdown

Dataset Preparation

Firstly a dataset is prepared. In this example I have selected a set of variables to predict an outcome (pulmonary complications). I select only the variables I am going to include in the model for reasons of simplicity. I then assess the recorded structure of variables to ensure that they have been coded appropriately. I focus particularly on factors and ordered/ordinal factors (e.g ASA, pT) which must be handled differently.

colnames(pulmonarycomp)
##  [1] "Age"                       "PS"                       
##  [3] "ASA"                       "cT.Stage"                 
##  [5] "cN.Stage"                  "Pre.Operative.Treatment"  
##  [7] "MI"                        "Chronic.Pulmonary.Disease"
##  [9] "Connective.Tissue.Disease" "Anypulmonary"
str(pulmonarycomp)
## 'data.frame':    406 obs. of  10 variables:
##  $ Age                      : num  61.3 71.3 70.8 54.4 53.3 ...
##  $ PS                       : num  1 1 1 2 1 1 1 2 0 0 ...
##  $ ASA                      : num  2 2 2 2 2 2 2 2 3 2 ...
##  $ cT.Stage                 : Factor w/ 5 levels "0","1","2","3",..: 4 4 4 5 4 4 3 4 4 4 ...
##  $ cN.Stage                 : Factor w/ 4 levels "0","1","2","3": 2 2 2 2 2 2 1 2 2 2 ...
##  $ Pre.Operative.Treatment  : Factor w/ 3 levels "Chemo","CRT",..: 1 1 1 1 1 3 1 3 1 1 ...
##  $ MI                       : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 2 1 ...
##  $ Chronic.Pulmonary.Disease: Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 1 1 ...
##  $ Connective.Tissue.Disease: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Anypulmonary             : Factor w/ 2 levels "NO","YES": 2 1 2 2 2 1 2 1 2 2 ...
summary(pulmonarycomp$Anypulmonary)
##  NO YES 
## 257 149



Here PS, ASA, cT.Stage and cN.Stage should be recorded as ordinal factors so they are changed. The levels of factors are also assessed, and the outcome factor, ‘Anypulmonary’ should be coded as a Yes/No variable, so that is also changed.


pulmonarycomp$cT.Stage<-as.ordered(pulmonarycomp$cT.Stage)
pulmonarycomp$cN.Stage<-as.ordered(pulmonarycomp$cN.Stage)
pulmonarycomp$ASA<-as.ordered(pulmonarycomp$ASA)
pulmonarycomp$PS<-as.ordered(pulmonarycomp$PS)
levels(pulmonarycomp$Anypulmonary)<-c('No','Yes')
str(pulmonarycomp)
## 'data.frame':    406 obs. of  10 variables:
##  $ Age                      : num  61.3 71.3 70.8 54.4 53.3 ...
##  $ PS                       : Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 2 2 2 3 2 2 2 3 1 1 ...
##  $ ASA                      : Ord.factor w/ 3 levels "1"<"2"<"3": 2 2 2 2 2 2 2 2 3 2 ...
##  $ cT.Stage                 : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 4 4 4 5 4 4 3 4 4 4 ...
##  $ cN.Stage                 : Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 2 2 2 2 2 2 1 2 2 2 ...
##  $ Pre.Operative.Treatment  : Factor w/ 3 levels "Chemo","CRT",..: 1 1 1 1 1 3 1 3 1 1 ...
##  $ MI                       : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 2 1 ...
##  $ Chronic.Pulmonary.Disease: Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 1 1 ...
##  $ Connective.Tissue.Disease: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Anypulmonary             : Factor w/ 2 levels "No","Yes": 2 1 2 2 2 1 2 1 2 2 ...



Next I prepare for modelling. A formula is specified first, with the outcome variable listed first. The “.” includes all other variables in the dataframe to predict the outcome.

I then construct a training control script, which controls some options within the model training process. Below resampling is conducted using 10 fold crossvalidation with 5 repeats. ‘classProbs’ allows the model to calculate probabilities of outcomes (as opposed to being a yes/no classifier). ‘summaryFunction’ allows the C-index to be calculated.

The grid for hyperparameter tuning is constructed in ‘ELGrid’. In this example an elastic net is trained, so ‘alpha’ is the L2 or LASSO penalty and lambda is the L1 or Ridge penalty. Alpha is set between 0 and 1 at 0.1 intervals and at the same time lambda is set to between 0 and 0.1 at 0.005 intervals. A similar approach is taken for a random forest and xgboost grid.


fm <- as.formula( paste( 'Anypulmonary', ".", sep=" ~ ")) 

cctrl <- trainControl(method="repeatedcv", number=10,repeats=5,returnResamp = "all", classProbs = TRUE,
                      summaryFunction = multiClassSummary, savePredictions = TRUE,verboseIter = TRUE,allowParallel = TRUE)

ELGrid = expand.grid(alpha = seq(0,1,by=0.1),lambda = seq(0,0.1,by = 0.005))
RFGrid = expand.grid(
  .mtry=2:10,
  .splitrule=c("gini","extratrees"),
  .min.node.size=c(1,3,5)
)
XGBGrid <- expand.grid(
  nrounds = seq(from = 0, to = 5000, by = 500),
  eta = c(0.05, 0.1, 0.3),
  max_depth = c(2, 4, 6),
  gamma = c(0.1,0.5,0.7,1.0),
  colsample_bytree = c(0.4,0.6,0.8),
  min_child_weight = c(1,2,3),
  subsample = c(0.5,0.75)
)

Model Training

The models are then trained using the above settings. To increase the speed of computation, which otherwise can be quite lengthy, I make use of parallel computing. I use the number of physical cores -one to conduct my analysis, leaving one core free for other computations.

In this example, first an elastic net model is trained - as specified by method=‘glmnet’. Further training settings include optimisation metric, which I have chosen as ‘ROC’ (i.e. the C-index), which means that the grid search for hyperparameters selects the values to optimise this metric. After this a Random Forest and XG Boost model are trained. A Logistic Regression model is also trained for comparison purposes.

Missing values are handled by single imputation using a K-nearest neighbours method (‘knnImpute’). This has the benefit of being conducted within each resample (i.e. data imputation in the training and test sets is conducted separately within each resample), which should reduce bias.

After completing the analysis, the parallel computing session is closed.


cl <- makeForkCluster(detectCores(logical=FALSE)-1)
registerDoParallel(cl)

ELModel <-train(fm, data=pulmonarycomp, method="glmnet", trControl=cctrl,tuneGrid =ELGrid,
                metric="AUC", na.action=na.pass, preProcess='knnImpute')

RFModel<-train(fm, data=pulmonarycomp, method="ranger", num.trees=1000, na.action=na.pass,
               preProcess='knnImpute',trControl= cctrl, tuneGrid=RFGrid,
               metric="AUC",importance='permutation')

XGBModel <-train(fm,data=pulmonarycomp,method="xgbTree",trControl=cctrl,tuneGrid=XGBGrid,
                 na.action=na.pass, preProcess='knnImpute',metric="AUC")

LRMModel <-train(fm, data=pulmonarycomp, method="glm", trControl=cctrl, metric="ROC",
                 na.action=na.pass, preProcess='knnImpute')

stopCluster(cl)
registerDoSEQ()

Validation

Next, model performance is assessed for each model in turn. The code below generates apparent performance (performance on the training set), classical cross validation performance (metrics and plots calculated in each fold and the averaged, ‘Method 1’ in thesis text) and cross validation performance using averaged predictions (‘Method 2’ in thesis text). For each, the C-index, Brier score and logloss is calculated, the Reciever Operator curve plotted and the calibration curve plotted, and saved into the object ‘validation summary’.


modellist<-list(ELModel,RFModel,XGBModel,LRMModel)

system.time({
###Apparent
    appROC<-1
    appCAB<-1
    appCID<-1
    appLL<-1
    appBR<-1
    {
      for (m in 1:length(modellist)){
      ###Calibration
      {
        model<-modellist[[m]]
        LRMbp<-cbind(predict(model,na.omit(model$trainingData),type='prob')[,'Yes'],
                     na.omit(model$trainingData)$.outcome)
        LRMbp<-as.data.frame(LRMbp)
        colnames(LRMbp)<-c('Yes','obs')
        LRMbp$obs<-as.numeric(LRMbp$obs)-1
        appLL[m]<-round(logLoss(LRMbp[,2],LRMbp[,1]),digits=3)
        appBR[m]<-round(BrierScore(LRMbp[,2],LRMbp[,1]),digits=3)
        appCAB[m]<-list(ggplot(data = LRMbp) + xlab("Specificity") + ylab('Sensitivity')+
          geom_smooth(data = LRMbp, aes(Yes, obs),
                      color = "#F8766D",se=TRUE,level=0.95,span=2/3)+
            xlim(c(0,1))+ylim(c(0,1))+
          geom_abline(intercept = 0, slope = 1,linetype=2))
      }
      ###Discrimination
      {
        model<-modellist[[m]]
        LRMbp<-cbind(predict(model,na.omit(model$trainingData),type='prob')[,'Yes'],
                     na.omit(model$trainingData)$.outcome)
        LRMbp<-as.data.frame(LRMbp)
        appCID[m]<-paste(round(cvAUC::AUC(LRMbp[,1],LRMbp[,2]),3),'(',
                         round(ci.auc(LRMbp[,2],LRMbp[,1])[1],digits=3),
                         '-',round(ci.auc(LRMbp[,2],LRMbp[,1])[3],digits=3),')')
        colnames(LRMbp)<-c('Yes','obs')
        rocobj<-roc(LRMbp$obs,LRMbp$Yes)
        rocdata2<-as.data.frame(cbind(rocobj$sensitivities,rocobj$specificities))
        ciobj<-ci.se(rocobj,specificities=seq(0,1,l=25))
        dat.ci <- data.frame(x = as.numeric(rownames(ciobj)),
                             lower = ciobj[, 1],
                             upper = ciobj[, 3])
        appROC[m]<-list(ggplot(data = rocdata2) +
                          xlab("Predicted Probability") +
                          ylab('Observed Average')+
                              geom_line(data = rocdata2, aes(x=V2,y=V1),
                                        color = "#F8766D")+xlim(c(1,0))+ylim(c(0,1))+
                              geom_abline(intercept = 1, slope = 1,linetype=2)+
                              geom_ribbon(data=dat.ci,aes(x=x,ymin=lower,ymax=upper), 
                                          fill="grey", alpha=0.25))
        
        
      }
        names(appROC)[m]<-modellist[[m]]$method
        names(appCAB)[m]<-modellist[[m]]$method
    }
    }
  
###Cross Validation
    ###Calibration
    avepredCAB<-1
    aveplotCAB<-1
    avepredLL<-1
    avefoldLL<-1
    avefoldBR<-1
    avepredBR<-1
    for (m in 1:length(modellist)){
      model<-modellist[[m]]
      if (length(model$bestTune)>1){
      v<-names(model$bestTune)
      xa<-model$pred
      for (z in 1:length(model$bestTune)){
        xa<-subset(xa,xa[,v[z]]==model$bestTune[z][[1]])
      }
      }else
      xa<-model$pred
 
      #Method 1
      {
        cab<-1
        cabdata<-1
        brs<-1
        LLs<-1
        for (z in 1:length(levels(as.factor(xa$Resample)))){
          x1<-subset(xa,xa$Resample==levels(as.factor(xa$Resample))[z])
          brs[z]<-BrierScore(as.numeric(x1$obs)-1,x1$Yes)
          LLs[z]<-logLoss(as.numeric(x1$obs)-1,x1$Yes)
          colnames(x1)
          LRMbp<-select(x1,obs,Yes)
          LRMbp$obs<-as.numeric(LRMbp$obs)-1
          xx<-ggplot_build(ggplot(data = LRMbp) + xlab("Predicted Probability") + 
                             ylab('Observed Average')+
                             geom_smooth(data = LRMbp, aes(Yes, obs),
                                         color = "#F8766D",se=TRUE,level=0.95,span=2/3)+
                             xlim(c(0,1))+
                             ylim(c(0,1))+
                             geom_abline(intercept = 0, slope = 1,linetype=2))
          
          for (i in 1:(length(xx$data[[1]]$y)-1)){
            xx$data[[1]]$y[i+1]<-ifelse(xx$data[[1]]$y[i]>0.9&is.na(xx$data[[1]]$y[i+1]),
                                        1.0,
                                  ifelse(xx$data[[1]]$y[i]<0.1&is.na(xx$data[[1]]$y[i+1]),
                                         0,
                                         xx$data[[1]]$y[i+1]))
          }
          cabdata[z]<-list(select(xx$data[[1]],x,y))
        }
        avefoldLL[m]<-paste(round(mean(LLs),digits=3),
                            '(',round((mean(LLs)-1.96*std.error(LLs)),digits=3),
                            '-',round((mean(LLs)+1.96*std.error(LLs)),digits=3),')')
        avefoldBR[m]<-paste(round(mean(brs),digits=3),
                            '(',round((mean(brs)-1.96*std.error(brs)),digits=3),
                            '-',round((mean(brs)+1.96*std.error(brs)),digits=3),')')
        xC<-as.data.frame(cabdata[[1]]$x)
        yC<-as.data.frame(cabdata[[1]]$y)
        for (i in 1:length(cabdata)){
          xC[,i]<-cabdata[[i]]$x
          yC[,i]<-cabdata[[i]]$y
        }
        yC[is.na(yC)]<-0
        
        cal_plot_data2<-cabdata[[1]]
        cal_plot_data2$x<-rowMeans(xC)
        cal_plot_data2$y<-rowMeans(yC)
        cal_plot_data2$ymin<-rowMeans(yC)-1.96*std.error(t(yC))
        cal_plot_data2$ymax<-rowMeans(yC)+1.96*std.error(t(yC))
        calplot1<-ggplot(data = cal_plot_data2) +
          xlab("Predicted Probability") +
          ylab('Observed Average')+
          geom_smooth(data = cal_plot_data2, aes(x, y),
                      color = "#F8766D",se=FALSE,level=0.95,span=2/3)+
          xlim(c(0,1))+ylim(c(0,1))+
          geom_abline(intercept = 0, slope = 1,linetype=2)+
          geom_ribbon(data=cal_plot_data2,aes(x=x,ymin=ymin,ymax=ymax), 
                      fill="grey", alpha=0.25)
        calplot1
      }
      #Method 2
      {
          model<-modellist[[m]]
          if (length(model$bestTune)>1){
            v<-names(model$bestTune)
            xa<-model$pred
            for (z in 1:length(model$bestTune)){
              xa<-subset(xa,xa[,v[z]]==model$bestTune[z][[1]])
            }
          }else
            xa<-model$pred
        LRMbp<-cbind2((as.numeric((model$trainingData)$.outcome=='Yes')),
                      aggregate(Yes~rowIndex,xa,mean)[,'Yes'])
        LRMbp<-as.data.frame(LRMbp)
        colnames(LRMbp)<-c('obs','Yes')
        avepredLL[m]<-round(logLoss(LRMbp[,1],LRMbp[,2]),digits=3)
        avepredBR[m]<-round(BrierScore(as.numeric(LRMbp[,1]),LRMbp[,2]),digits=3)
        calplot2<-ggplot(data = LRMbp) +
          xlab("Predicted Probability") +
          ylab('Observed Average')+
          geom_smooth(data = LRMbp, aes(Yes, obs),
                      color = "#F8766D",se=TRUE,level=0.95,span=2/3)+
          xlim(c(0,1))+
          ylim(c(0,1))+
          geom_abline(intercept = 0, slope = 1,linetype=2)
      }
      avepredCAB[m]<-list(calplot2)
      aveplotCAB[m]<-list(calplot1)
      names(avepredCAB)[m]<-modellist[[m]]$method
      names(aveplotCAB)[m]<-modellist[[m]]$method
    }

    ###Discrimination
    avepredROC<-1
    aveplotROC<-1
    avefoldCID<-1
    avepredCID<-1
    {

      for (m in 1:length(modellist)){
        ###Method 1
        {
        model<-modellist[[m]]
        if (length(model$bestTune)>1){
          v<-names(model$bestTune)
          xa<-model$pred
          for (z in 1:length(model$bestTune)){
            xa<-subset(xa,xa[,v[z]]==model$bestTune[z][[1]])
          }
        }else
          xa<-model$pred
        x1<-subset(xa,xa$Resample==levels(as.factor(xa$Resample))[1])
        rocobj<-roc(x1$obs,x1$Yes)
      roc<-1
      lgt<-1
      cids<-1

      for (i in 1:length(levels(as.factor(xa$Resample)))){
        x1<-subset(xa,xa$Resample==levels(as.factor(xa$Resample))[i])
        cids[i]<-auc(as.numeric(x1$obs)-1,x1$Yes)
        lgt[i]<-nrow(x1)
      }
      avefoldCID[m]<-paste(round(mean(cids),digits=3),
                           '(',round(mean(cids)-1.96*std.error(cids),digits=3),
                           '-',round(mean(cids)+1.96*std.error(cids),digits=3),')')
      
      i<-1
      for (i in 1:length(levels(as.factor(xa$Resample)))){
        x1<-subset(xa,xa$Resample==levels(as.factor(xa$Resample))[i])
        x1<-sample_n(x1,min(lgt))
        rocobj<-roc(x1$obs,x1$Yes)
        roc[i] <- list(ggplot_build(ggroc(rocobj,alpha=0.5,colour='red',
                                          linetype=6,size=1))$data)
      }
      
      lgt2<-1
      for (i in 1:length(roc)){
        lgt2[i]<-length(roc[[i]][[1]]$x)
      }
      
      
      
      xR<-as.data.frame(roc[[1]][[1]]$x[(1:min(lgt2))])
      yR<-as.data.frame(roc[[1]][[1]]$y[(1:min(lgt2))])
      for (i in 1:length(roc)){
        seqs<-ceiling(seq(from=1,to=length(roc[[i]][[1]]$x),length.out=min(lgt2)))
        xR[,i]<-roc[[i]][[1]]$x[seqs]
        yR[,i]<-roc[[i]][[1]]$y[seqs]
      }

      
      rocdat<-as.data.frame(cbind(-rowMeans(xR),
                                  rowMeans(yR),
                                  rowMeans(yR)+1.96*std.error(t(yR)),
                                  rowMeans(yR)-1.96*std.error(t(yR))))
      
      colnames(rocdat)<-c('x','y','ymax','ymin')
      
      aveplotROC[m]<-list(ggplot(data = rocdat) + 
                            xlab("Specificity") +
                            ylab('Sensitivity')+
                            geom_line(data = rocdat, aes(y=y, x=x),
                                      color = "#F8766D")+xlim(c(1,0))+ylim(c(0,1))+
                            geom_abline(intercept = 1, slope = 1,linetype=2)+
                            geom_ribbon(data=rocdat,aes(x=x,ymin=ymin,ymax=ymax), 
                                        fill="grey", alpha=0.25))
}
      ###Method 2
        {
        model<-modellist[[m]]
          if (length(model$bestTune)>1){
            v<-names(model$bestTune)
            xa<-model$pred
            for (z in 1:length(model$bestTune)){
              xa<-subset(xa,xa[,v[z]]==model$bestTune[z][[1]])
            }
          }else
            xa<-model$pred
        LRMbp<-cbind2((as.numeric((model$trainingData)$.outcome=='Yes')),
                      aggregate(Yes~rowIndex,xa,mean)[,'Yes'])
        LRMbp<-as.data.frame(LRMbp)
        colnames(LRMbp)<-c('obs','Yes')
        avepredCID[m]<-paste(round(cvAUC::AUC(LRMbp[,2],LRMbp[,1]),3),
                             '(',round(ci.auc(LRMbp[,1],LRMbp[,2])[1],digits=3),
                             '-',round(ci.auc(LRMbp[,1],LRMbp[,2])[3],digits=3),')')
        ci.auc(LRMbp[,1],LRMbp[,2])[1:2]
        names(avepredCID)[m]<-modellist[[m]]$method
        LRMbp$Yes<-1-LRMbp$Yes
        
        LRMbp$obs<-as.factor(LRMbp$obs)
        rocobj<-roc(LRMbp$obs,LRMbp$Yes)
        rocdata2<-as.data.frame(cbind(rocobj$sensitivities,rocobj$specificities))
        ciobj<-ci.se(rocobj,specificities=seq(0,1,l=25))
        dat.ci <- data.frame(x = as.numeric(rownames(ciobj)),
                             lower = ciobj[, 1],
                             upper = ciobj[, 3])
        avepredROC[m]<-list(ggplot(data = rocdata2) +
                              xlab("Specificity") +
                              ylab('Sensitivity')+
                              geom_line(data = rocdata2, aes(x=V2,y=V1),
                                        color = "#F8766D")+xlim(c(1,0))+ylim(c(0,1))+
                              geom_abline(intercept = 1, slope = 1,linetype=2)+
                              geom_ribbon(data=dat.ci,aes(x=x,ymin=lower,ymax=upper),
                                          fill="grey", alpha=0.25))
        
        
      }
        names(avepredROC)[m]<-modellist[[m]]$method
        names(aveplotROC)[m]<-modellist[[m]]$method
      }
    }
    
metr<-as.data.frame(1:length(modellist))
metr$ApparentCID<-appCID
metr$ApparentLL<-appLL
metr$ApparentBR<-appBR
metr$FoldAveCID<-avefoldCID
metr$PredAveCID<-avepredCID
metr$FoldAveLL<-avefoldLL
metr$PredAveLL<-avepredLL
metr$FoldAveBR<-avefoldBR
metr$PredAveBR<-avepredBR
metr<-select(metr,-1)
for (m in 1:length(modellist)){
  rownames(metr)[m]<- modellist[[m]]$method
}


validationsummary<-list(metr,appROC,avepredROC,aveplotROC,appCAB,avepredCAB,aveplotCAB)
names(validationsummary)<-c('Metrics','Apparent ROC','Average-Prediction ROC',
                            'Average-Plot ROC','Apparent Calibration',
                            'Average-Prediction Calibration','Average-Plot Calibration')

rm(cids,LLs,appCAB,appROC,aveplotCAB,aveplotROC,avepredCAB,avepredROC,cabdata,
   cal_plot_data2,calplot1,calplot2,ciobj,dat.ci,LRMbp,metr,model,roc,rocdat,
   rocdata2,rocobj,x1,xa,xC,xR,xx,yC,yR,appBR,appCID,appLL,avefoldBR,avefoldCID,
   avefoldLL,avepredBR,avepredCID,avepredLL,brs,cab,i,lgt,lgt2,m,seqs,z)

})



Performance metrics (C-index, LogLoss, Brier Score) can then be called and are summarised below.


kable(validationsummary$Metrics[,1:3],caption='Apparent Performance')
Apparent Performance
ApparentCID ApparentLL ApparentBR
Logistic Regression 0.654 ( 0.599 - 0.708 ) 0.619 0.215
Elastic Net 0.644 ( 0.589 - 0.7 ) 0.628 0.219
Random Forest 0.932 ( 0.908 - 0.956 ) 0.475 0.150
XGBoost 0.88 ( 0.847 - 0.913 ) 0.455 0.146
kable(validationsummary$Metrics[,c(4,6,8)],caption='Method 1 CV Performance')
Method 1 CV Performance
FoldAveCID FoldAveLL FoldAveBR
Logistic Regression 0.558 ( 0.531 - 0.585 ) 0.678 ( 0.664 - 0.692 ) 0.24 ( 0.234 - 0.246 )
Elastic Net 0.593 ( 0.567 - 0.618 ) 0.649 ( 0.642 - 0.655 ) 0.228 ( 0.225 - 0.231 )
Random Forest 0.589 ( 0.565 - 0.614 ) 0.651 ( 0.641 - 0.662 ) 0.23 ( 0.226 - 0.235 )
XGBoost 0.594 ( 0.568 - 0.62 ) 0.69 ( 0.669 - 0.71 ) 0.244 ( 0.236 - 0.253 )
kable(validationsummary$Metrics[,c(5,7,9)],caption='Method 2 CV Performance')
Method 2 CV Performance
PredAveCID PredAveLL PredAveBR
Logistic Regression 0.556 ( 0.499 - 0.614 ) 0.674 0.239
Elastic Net 0.587 ( 0.53 - 0.644 ) 0.648 0.228
Random Forest 0.586 ( 0.53 - 0.643 ) 0.649 0.229
XGBoost 0.594 ( 0.538 - 0.651 ) 0.680 0.241



The ROC Curve is then visualised (CV method 1).


ggarrange(validationsummary[[4]][[1]],
          validationsummary[[4]][[2]],
          validationsummary[[4]][[3]],
          validationsummary[[4]][[4]],ncol=2,nrow=2,labels=c('A','B','C','D'))
Figure 1: Cross Validation Receiver Operator Characteristic Curves (A) Logistic Regression Model, (B) Elastic Net, (C) Random Forest, (D) XGBoost. The 45 degree dashed line represents random chance. The shaded area represents the 95% confidence interval

Figure 1: Cross Validation Receiver Operator Characteristic Curves (A) Logistic Regression Model, (B) Elastic Net, (C) Random Forest, (D) XGBoost. The 45 degree dashed line represents random chance. The shaded area represents the 95% confidence interval



Calibration plots are informative in intepreting predictions (CV method 2).


ggarrange(validationsummary[[7]][[1]],
          validationsummary[[7]][[2]],
          validationsummary[[7]][[3]],
          validationsummary[[7]][[4]],ncol=2,nrow=2,labels=c('A','B','C','D'))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Figure 2: Cross Validation Calibration Charts (A) Logistic Regression Model, (B) Elastic Net, (C) Random Forest, (D) XGBoost. The 45 degree dashed line represents perfect agreement between predicted and observed probabilities. The shaded area represents the 95% confidence interval

Figure 2: Cross Validation Calibration Charts (A) Logistic Regression Model, (B) Elastic Net, (C) Random Forest, (D) XGBoost. The 45 degree dashed line represents perfect agreement between predicted and observed probabilities. The shaded area represents the 95% confidence interval

Variable Importance

Permutation variable importance is calculated and scaled to a percentage of the most important variable in each model.


VariableImportance<-varImp(modellist[[1]])$importance
VariableImportance<-cbind(VariableImportance,varImp(modellist[[2]])$importance)
VariableImportance<-cbind(VariableImportance,varImp(modellist[[3]])$importance)
VariableImportance<-cbind(VariableImportance,varImp(modellist[[4]])$importance)
colnames(VariableImportance)<-c('Logistic Regression','Elastic Net','Random Forest','XGBoost')
VariableImportance<-round(VariableImportance,digits=2)
knitr::kable(VariableImportance, Caption='Permutation based variable importance')
Logistic Regression Elastic Net Random Forest XGBoost
Age 0.00 13.95 100.00 4.53
PS.L 61.48 84.55 9.97 0.06
PS.Q 34.81 100.00 9.27 0.00
PS.C 55.63 71.81 8.31 0.07
ASA.L 100.00 12.97 8.26 82.24
ASA.Q 3.91 16.28 7.34 6.48
cT.Stage.L 58.69 4.79 6.81 95.94
cT.Stage.Q 56.11 0.00 6.69 58.28
cT.Stage.C 0.00 4.42 6.55 38.88
cT.Stage^4 0.00 5.86 6.41 41.40
cN.Stage.L 0.00 2.27 5.99 0.01
cN.Stage.Q 0.00 4.85 5.34 0.02
cN.Stage.C 0.00 10.87 5.27 0.02
Pre.Operative.TreatmentCRT 43.23 40.02 4.22 64.42
Pre.Operative.TreatmentNone 0.59 1.14 3.83 70.58
MI1 60.03 7.12 3.01 75.75
Chronic.Pulmonary.Disease1 84.98 6.83 1.13 100.00
Connective.Tissue.Disease1 70.86 10.10 0.00 79.23



The partial dependence function is the calculated using the DALEX package. “Explainer” wrappers are created with reference to the training data which can the be applied to different variables.


majc<-na.omit(pulmonarycomp)
expLRM<-explain(modellist[[1]],data=majc,y=as.numeric(majc$Anypulmonary)-1,label='LRM')
## Preparation of a new explainer is initiated
##   -> model label       :  LRM 
##   -> data              :  400  rows  10  cols 
##   -> target variable   :  400  values 
##   -> model_info        :  package caret , ver. 6.0.86 , task Classification (  default  ) 
##   -> predict function  :  yhat.train  will be used (  default  )
##   -> predicted values  :  numerical, min =  0.18526 , mean =  0.3669532 , max =  0.6937088  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.6937088 , mean =  -0.004453163 , max =  0.7664297  
##   A new explainer has been created! 
expELR<-explain(modellist[[2]],data=majc,y=as.numeric(majc$Anypulmonary)-1,label='ELR')
## Preparation of a new explainer is initiated
##   -> model label       :  ELR 
##   -> data              :  400  rows  10  cols 
##   -> target variable   :  400  values 
##   -> model_info        :  package caret , ver. 6.0.86 , task Classification (  default  ) 
##   -> predict function  :  yhat.train  will be used (  default  )
##   -> predicted values  :  numerical, min =  0.06760209 , mean =  0.3636369 , max =  0.7923831  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.4822332 , mean =  -0.001136858 , max =  0.7699354  
##   A new explainer has been created! 
expRF<-explain(modellist[[3]],data=majc,y=as.numeric(majc$Anypulmonary)-1,label='RF')
## Preparation of a new explainer is initiated
##   -> model label       :  RF 
##   -> data              :  400  rows  10  cols 
##   -> target variable   :  400  values 
##   -> model_info        :  package caret , ver. 6.0.86 , task Classification (  default  ) 
##   -> predict function  :  yhat.train  will be used (  default  )
##   -> predicted values  :  numerical, min =  0.01604223 , mean =  0.3657628 , max =  0.8980607  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.6439712 , mean =  -0.003262802 , max =  0.85493  
##   A new explainer has been created! 
expXGB<-explain(modellist[[4]],data=majc,y=as.numeric(majc$Anypulmonary)-1,label='XGBoost')
## Preparation of a new explainer is initiated
##   -> model label       :  XGBoost 
##   -> data              :  400  rows  10  cols 
##   -> target variable   :  400  values 
##   -> model_info        :  package caret , ver. 6.0.86 , task Classification (  default  ) 
##   -> predict function  :  yhat.train  will be used (  default  )
##   -> predicted values  :  numerical, min =  4.721032e-07 , mean =  0.3653354 , max =  0.8183273  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.8183273 , mean =  -0.002835395 , max =  0.8433529  
##   A new explainer has been created! 
x<-(ingredients::partial_dependency(expXGB,'Age'))
x1<-(ingredients::partial_dependency(expRF,'Age'))
x2<-(ingredients::partial_dependency(expELR,'Age'))
x3<-(ingredients::partial_dependency(expLRM,'Age'))
fintab<-rbind(x,x1,x2,x3)
ft<-rbind(x,x1,x2,x3)
colnames(fintab)<-c('Variable.Name','Model','x','y','ids')
agepdp<-ggplotly(ggplot(fintab,aes(x,y))+
  geom_line(data=fintab,mapping=aes(colour=Model))+theme_bw()+
  ylim(0,0.6)+
  ylab('Predicted Probability')+
  xlab('Age (years)')+ggtitle('Age at diagnosis'))
agepdp

Marginal predicted probability for age, averaged for all other variables




Categorical variables can be handled similarly


x<-(ingredients::partial_dependency(expXGB,'PS'))
## 'variable_type' changed to 'categorical' due to lack of numerical variables.
x1<-(ingredients::partial_dependency(expRF,'PS'))
## 'variable_type' changed to 'categorical' due to lack of numerical variables.
x2<-(ingredients::partial_dependency(expELR,'PS'))
## 'variable_type' changed to 'categorical' due to lack of numerical variables.
x3<-(ingredients::partial_dependency(expLRM,'PS'))
## 'variable_type' changed to 'categorical' due to lack of numerical variables.
fintab<-rbind(x,x1,x2,x3)
ft<-rbind(ft,fintab)
colnames(fintab)<-c('Variable.Name','Model','x','y','ids')
colnames(fintab)[3]<-'Performance_Status'
pspdp<-ggplotly(ggplot(fintab,aes(x=Performance_Status,y=y,fill=Performance_Status))+
  geom_bar(position='dodge',stat='identity')+theme_bw()+
  ylab('Marginal Predicted Probability')+
  xlab('')+
  facet_wrap(~Model)+ggtitle('Performance Status'))
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
pspdp

Marginal predicted probability for WHO Performance Status, averaged for all other variables