FEP Predictors using Boosting (Version 2)

1. Setup

rm( list = ls() )

# install.packages("drat", repos="https://cran.rstudio.com")
# drat:::addRepo("dmlc")
# install.packages("xgboost", repos="http://dmlc.ml/drat/", type = "source")

require( ggplot2 )
require( gridExtra )
require( xgboost )
require( reshape2 )
require( kableExtra )
require( RColorBrewer )
require( caret )

# load an tidy data
d <- read.csv( "Data_prediction_FEP.csv")
N <- nrow( d )
# remove the ID column
d <- d[ , -1 ]

str( d )

## 'data.frame':    77 obs. of  19 variables:
##  $ age         : int  22 24 20 23 28 25 19 37 21 20 ...
##  $ gender      : int  2 2 1 2 2 1 1 1 2 1 ...
##  $ iq          : int  116 93 98 90 102 90 89 79 102 133 ...
##  $ PANSS_pos   : int  17 20 14 11 13 18 23 15 13 9 ...
##  $ PANSS_neg   : int  20 26 9 16 18 22 23 20 8 8 ...
##  $ PANSS_gen   : int  36 35 24 29 30 40 41 37 29 22 ...
##  $ PANSS_total : int  73 81 47 56 61 80 87 72 50 39 ...
##  $ Glu.Cre     : num  1.31 1.13 1.07 1.44 1.18 ...
##  $ MI.Cre      : num  0.743 0.635 0.698 0.708 0.713 0.62 0.731 0.752 0.627 0.711 ...
##  $ NAA.Cre     : num  1.22 1.06 1.14 1.49 1.33 ...
##  $ GPC.Pch_Cre : num  0.228 0.182 0.231 0.252 0.248 0.266 0.213 0.303 0.26 0.252 ...
##  $ NAA.NAAG_Cre: num  1.28 1.1 1.25 1.49 1.33 ...
##  $ BA8         : num  -6.07 -5.85 -66.47 19.48 33.93 ...
##  $ Caudaate    : num  -42.15 39.86 40.09 5.29 -23.81 ...
##  $ PAC         : num  -87.46 5.25 -77.49 -59.63 -62.94 ...
##  $ lgi_parietal: num  3.21 3.33 3.62 3.74 3.29 ...
##  $ lgi_occ     : num  4.35 3.92 4.5 4.76 4.08 ...
##  $ lgi_dlpfc   : num  2.54 2.45 2.68 2.83 2.66 ...
##  $ group       : int  1 1 0 0 0 0 1 0 0 0 ...

Gender is coded numerically, as 1 and 2. This means it needs dummy (or factor) coding, but as only two levels, we can hack it to be 0/1 and keep it as a numerical predictor.

d$gender <- d$gender - 1

To begin with, we’ll use a random half split of the data:

# independent (predictor) variables
x.vars <- names( d )[1:(length(names(d))-1)]
# dependent (outcome, response) variables
y.var  <- names( d )[length(names(d))]

folds <- createFolds(d$group, k = 2)

# divide data into matrix of X and vector of Y for convenience
# and convert to matrices for xgboost
trainSet.X <- as.matrix( d[ folds$Fold1, x.vars  ] )
trainSet.Y <- as.matrix( d[ folds$Fold1, y.var  ] )

validSet.X <- as.matrix( d[ folds$Fold2, x.vars ] )
validSet.Y <- as.matrix( d[ folds$Fold2, y.var ] )

Look at proportions of each split on group just to check

table( trainSet.Y )

## trainSet.Y
##  0  1 
## 33  5

table( validSet.Y )

## validSet.Y
##  0  1 
## 24 15

… and they are reasonably well balanced

Next, fit the xgboost, using trees and the rest of the parameters as the defaults.

fit <- xgboost(data = trainSet.X, 
               label = trainSet.Y, 
               max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic",
               verbose = 2)

Prediction using a decision rule – as xgboost is a regression method, the underlying model reports real numbers, and we need to implement a decision rule. Assume the output of the fitted boosting model is \(\hat{y}\) : \[ y_{PRED} = \left\{ \begin{array} \\ 1 & \text{ iff } \hat{y} > 0.5 \\ 0 & \text{ otherwise } \end{array} \right. \]

Predict the classifications of the validation set, using model fitted on the training set:

pred <- predict(fit, validSet.X)

# implement y_PRED
decRule <- function( x ) { return( ifelse( x > 0.5, 1, 0 )) }
y_pred  <- decRule( pred )

Compare with ground truth (the validation set actual y)

(as.numeric(validSet.Y))

##  [1] 1 1 0 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1 0
## [36] 0 0 0 1

(y_pred)

##  [1] 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0
## [36] 0 0 0 0

And compute a confusion matrix:

cm <- confusionMatrix( y_pred, validSet.Y )

And we see that :

(cm$table)

##           Reference
## Prediction  0  1
##          0 23  8
##          1  1  7

True positives (reference group 1, correctly predicted as group 1) = 7
True negatives (reference = group 0, correctly predicted as group 0) = 23
False positives (for group 0, incorrectly predicted as group 1) = 1
False negatives (for group 1, incorrectly predicted as group 0) = 8

This corresponds to sensitivity and specificity:

Sensitivity = 0.96
Specificity = 0.47

2. Transparency of the model

importance_matrix <- xgb.importance(model = fit)
print(importance_matrix)

##        Feature      Gain     Cover Frequency
## 1:   PANSS_pos 0.4597544 0.3321218      0.25
## 2:     lgi_occ 0.2339035 0.2132960      0.25
## 3:    Caudaate 0.1756688 0.1661606      0.25
## 4: GPC.Pch_Cre 0.1306733 0.2884216      0.25

xgb.plot.importance(importance_matrix = importance_matrix)

require(DiagrammeR)

## Loading required package: DiagrammeR

xgb.plot.tree(model = fit)

And we see that PANSS_pos is doing most of the heavy lifting in the model.

3. Cross-Validation and Reliability

A “one off” using the default parameters for xgboost does really well on a split data set - trained on trainSet and then tested for predictive ability on the validation set validSet. We may have gotten lucky, in that the division of train/validation sets was fortuitous. So, we can run this multiple times, re-splitting the data and average the results – essentially, using multiple-runs of the algorithms on different splits of the data to eliminate “chance” splits which give unrealistically good classification performance. We’ll also store the top-three important predictor variables for each run (so we can infer which predictors contribute most to classification performance, making our model transparent) alongside the sensitivity and specificity of the classifier on each run. Finally, we’ll store the overall misclassification error – the raw number of mistakes.

Note, there’s much debate about \(k\)-folds cross-validation in validating a classifier algorithm – with the much-quoted \(k=10\) providing a good experimental compromise. However, a half-split (\(k=2\)) repeated many times will be reasonable if we run enough replications as we don’t have that much data - \(N =\) 77.

First, a function to setup a half-split of the data and return the training and test sets.

setupSplit <- function( d ) {
  
  x.vars <- names( d )[1:(length(names(d))-1)]
  # dependent (outcome, response) variables
  y.var  <- names( d )[length(names(d))]
  
  folds <- createFolds(d$group, k = 2)
  
  # divide data into matrix of X and vector of Y for convenience
  # and convert to matrices for xgboost
  trainSet.X <- as.matrix( d[ folds$Fold1, x.vars  ] )
  trainSet.Y <- as.matrix( d[ folds$Fold1, y.var  ] )
  
  validSet.X <- as.matrix( d[ folds$Fold2, x.vars ] )
  validSet.Y <- as.matrix( d[ folds$Fold2, y.var ] )

  return( list( trainSet.X = trainSet.X, trainSet.Y = trainSet.Y, 
                validSet.X = validSet.X, validSet.Y = validSet.Y) )
}

Setup 10000 replications of the “split, train, validate” routine and their results:

# storage for results
N.repl <- 10000

resultsTab <- data.frame( sens = rep(0,N.repl),
                          spec = rep(0,N.repl),
                          error = rep(0,N.repl),   ## raw error
                          top1 = rep("-",N.repl),  ## store top 3 variables 
                          top2 = rep("-",N.repl),
                          top3 = rep("-",N.repl), stringsAsFactors = FALSE )

Define a function to execute a single run:

# function to run a single split, train and test/validate
ReplicateTrainValidate <- function( thisSplit ) {

  # a row to store and return results for a bigger table
  resultRow <- data.frame( sens = 0,
                           spec = 0,
                           error = 0,
                           top1 = "-",
                           top2 = "-",
                           top3 = "-", stringsAsFactors = FALSE )
  
  # 1. fit the training data
  fit <- xgboost(data = thisSplit$trainSet.X, 
                 label = thisSplit$trainSet.Y, 
                 max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic",
                 verbose = 0)
  
  # 2. test on the validation set
  y_pred <- decRule( predict(fit, thisSplit$validSet.X) )
  
  # 3. pull out the sens and spec values
  suppressWarnings(
   cm <- confusionMatrix( y_pred, thisSplit$validSet.Y, positive = as.character(1) )
  )
  
  # Build a "row" of results to return ...
  # 4. the "raw" misclassification rate (fraction of incorrect)
  resultRow$error <- length( which( as.numeric( thisSplit$validSet.Y ) != y_pred ) ) /
                                     length( thisSplit$validSet.Y )
  
  # 5. sensitivity and specificity
  resultRow$sens <- cm$byClass["Sensitivity"]
  resultRow$spec <- cm$byClass["Specificity"]
  
  # 6. fetch the top 3 predictor variables
  importance_matrix <- xgb.importance(model = fit)
  # ... and store away for later
  resultRow$top1 <- as.character( importance_matrix[1,1] )
  resultRow$top2 <- as.character( importance_matrix[2,1] )
  resultRow$top3 <- as.character( importance_matrix[3,1] )
    
  return( resultRow )
}

And this runs N.repl = 10000 replications:

for ( i in 1:N.repl ) {
  resultsTab[i,] <- ReplicateTrainValidate( setupSplit( d ) )
}

Having trained 10000 classifiers (to rule out chance performance on a “lucky” split in Section 2) we examine the average performance to see if there’s anything unusual; part of this assessment will be to see which predictor variables are being used to drive classification in a majority of the 10000 replications.

Start by looking at the average behaviour of the sensitivity and specificity (where we use the median and IQR as measures of central tendency and dispersion):

SensSpecPlots <- function( resultsTab ) {
  
  median.sens <- median( resultsTab$sens )
  iqr.sens    <- IQR( resultsTab$sens )
  
  p.avSens <- ggplot( resultsTab, aes( x = sens ) ) +
    geom_histogram( bins = 15, fill = "#99CCFF" ) +
    xlab("\nSensitivity") +
    ylab("Frequency\n") +
    xlim( 0, 1.2 ) +
    geom_vline( xintercept = median.sens, size = 1, color = "red") +
    annotate("segment", x = median.sens - iqr.sens, xend = median.sens + iqr.sens, 
                 y = 1000, yend = 1000, size = 1, color = "red") +
    theme_classic()
  
  median.spec <- median( resultsTab$spec )
  iqr.spec    <- IQR( resultsTab$spec )
  
  p.avSpec <- ggplot( resultsTab, aes( x = spec ) ) +
    geom_histogram( bins = 15, fill = "#FF66CC" ) +
    xlab("\nSpecificity") +
    ylab("Frequency\n") +
    xlim( 0, 1.2 ) +
    geom_vline( xintercept = median.spec, size = 1, color = "red") +
    annotate("segment", x = median.spec - iqr.spec, xend = median.spec + iqr.spec, 
                 y = 1000, yend = 1000, size = 1, color = "red") +
    theme_classic()
  
  median.err <- median( resultsTab$error )
  iqr.err  <- IQR( resultsTab$error )
  
  p.avError <- ggplot( resultsTab, aes( x = error ) ) +
    geom_histogram( bins = 10, fill = "#669933" ) +
    xlab("\nMisclassification Error") +
    ylab("Frequency\n") +
    #xlim( 0, 6 ) +
    geom_vline( xintercept = median.err, size = 1, color = "red") +
    annotate("segment", x = median.err - iqr.err, xend = median.err + iqr.err, 
                 y = 1000, yend = 1000, size = 1, color = "red") +
    theme_classic()
  
  grid.arrange( p.avSens, p.avSpec, p.avError, ncol = 3)
  
  # return the medians for later use
  return( c( "median.sens" = median.sens, "iqr.sens" = iqr.sens,
             "median.spec" = median.spec, "iqr.spec" = iqr.spec,
             "median.err"  = median.err,  "iqr.err"  = iqr.err) )
  
}

# Plot the sensitivity, specificity and misclassification errors
sensSpec <- SensSpecPlots( resultsTab )

So on average, we are obtaining out of sample performances (that is, on the validation sets) of:

median sensitivity = 0.8 with +/- IQR = 0.222
median specificity = 0.931 with +/- IQR = 0.068
median misclassification error = 0.105 with +/- IQR = 0.029

Let’s look at – on average – which predictor variables are doing the work. Recall for Section 2, that the “importance” of a predictor (relative to the others) can be extracted from a classifier. We have trained 10000 classifiers on different (pseudo-randomised) training-validation splits of the data.

So we can interrogate each of these classifiers to find out how frequently each predictor is the 1st, 2nd, 3rd (and so on) most important predictor in these 10000 replications. This gives us some idea of the stability of the features (predictors) in classifier performance when the training and validation data varies (a bit like a sensitivity analysis).

plotPredictors <- function( predsList, thisTitle ){

  # tabulate the top predicting perfor
  thisTop <- table( predsList )
  thisTop <- rev( thisTop[ order( thisTop )] )
  
  # build a dataframe to plot
  df <- data.frame( predictor = factor( names( thisTop ), levels = names( thisTop ), ordered = TRUE ),
                     freq      = as.numeric( thisTop ) ) 

  p.top <- ggplot( df, aes( x = predictor, y = log10(freq) ) ) +
          geom_bar( stat = "identity", fill = "#CC6600" ) +
          xlab( "\nPredictor" ) +
          ylab( "Log10 Frequency\n" ) +
          ylim( 0, log10(10000) ) +
          theme_classic() +
          theme( axis.text.x = element_text(angle = 90, hjust = 1 ) ) +
          ggtitle( thisTitle )
  return( p.top )

}

p.top1 <- plotPredictors( resultsTab$top1, "1st")
p.top2 <- plotPredictors( resultsTab$top2, "2nd")
p.top3 <- plotPredictors( resultsTab$top3, "3rd")
  
grid.arrange(p.top1, p.top2, p.top3, ncol = 3)

Examining these plots, in the majority of replications, PANSS_pos is selected as the most important predictor (left panel), followed by MI.Cre (middle panel) and then iq (right panel). Note how the distributions of the 2nd and 3rd most important predictors are not nearly so consistent across replications, suggesting that they are not nearly as reliable as the 1st most important predictor.

To put some “real” figures to this:

PANSS_pos was the most important predictor in 9918 of the 10000 replications – around 99.18 percent of the time, it’s PANSS_pos that contributes most to classification performance
After PANSS_pos, the second most important predictor – MI.Cre – occurred in only 1467 of the 10000 replications
The third most important predictor – iq – occurred in only 932 of the 10000 replications

Notice how these patterns – over 10000 replications – mirror what we obtained using a single run of classifier-building earlier in Section 2; suggesting that, although subject to non-determinism because of the randomisation implicit in splitting the data, our first attempt was fairly representative.

5. Classifier Performance Compared to a Null Distribution

5.1 - Toy Classifier Example

Examine the following data:

X.1	X.2	y
1	2	A
1	3	A
1	4	A
2	2	B
2	3	B
2	4	B

Now, some notation:

The entire data set is \(D\) which consists of pairs \(({X}_i,y_i)\) where \({X}_i\) is a row in the table of data for subject/participant/observation \(i\) and \(y_i\) is the class assignment
In the table above, \(i\) indexes the rows, running from 1 through 6, and e.g. participant 3 is the row \(X_3 = \{1,4\}\) with \(y_3 = A\).
The elements (“columns” of \(D\)) are the predictor variables (or features, independent variables)
In the example above, the variable X.1 and X.2 are denoted \(X_{i}^{1}\) and \(X_{i}^{2}\) – so superscripts identify the columns of \(D\) and the subscripts identify the rows.

Some subsets that we’ll need later – recall that rows of \(D\) are participants/samples/observations, and columns are features, predictor or independent variables.

Let \(D_T\) be a subset of the rows of \(D\) and this will be the training set
Similarly, a validation subset \(D_V\) is a set of rows of \(D\) different from \(D_T\)
Formally \(D_T \subset D\), \(D_V \subset D\) and \(D_T \cap D_V = \emptyset\).
- If a row of \(D\) is in the training set \(D_T\) it cannot be in the validation set \(D_V\)

Returning to the table above, we can see that:

When \(X^1 = 1\) (variable X.1 in the table), \(y_i\) is always \(A\) and when \(X^1 = 2\), \(y_i\) is always \(B\)
- So, if we only know the value of \(X^1\), we know \(y_i\) with certainty – \(y_i\) is dependent on \(X^1\).
If we know only the value of \(X^2\) (variable X.2), it adds no information whatsoever about which class each person belongs to because – for example – when \(X^2 = 2\), \(y_i\) can be either \(A\) of \(B\)

This means that \(X^1\) is the best predictor of \(y_i\), and \(X^2\) is redundant. If we plot the data with \(X_1\) on the x-axis, and \(X_2\) on the y-axis, we find:

colPal          <- brewer.pal(3,"Set1")
names( colPal ) <- levels( df$Y )

ggplot( df, aes( x = X.1, y = X.2, colour = y ) ) +
  geom_point( size = 4 ) + 
  xlim( 0.5, 2.5 ) + ylim( 1.0, 4.5 ) +
  scale_colour_manual(name = "y", values = colPal) +
  geom_vline( xintercept = 1.5, size = 2, colour = "black") +
  theme_minimal()

We can see that a straight line at \(X^1 = 1.5\) perfectly divides the space of \((X^1, X^2)\) into classes \(y_i = A\) (the ‘left’ half of the space) and \(y_i = B\) (the ‘right’ half of the space). This is because \(X^1\) (irrespective of \(X^2\)) is the best predictor.

In this toy example, we can then manually build a classifier that exploits the strong dependency of \(y_i\) on \(X^1\). The classifier can discount \(X^2\) because it adds no information to the classification. Call this classifier \(F_{real}\) such that:

\[ F_{real}(X_i) = \left\{ \begin{array} \\ A & \text{ if } X^1 > 1.5 \\ B & \text{ otherwise } \end{array} \right. \]

Next, define an error function which measures how many misclassifications a classifier (\(F\)) makes on some data \(D\) as \(\text{Err}(F,D)\). Notice that using whole data set as above (6 participants) and \(F_{real}\) then: \[ \text{Err} \left( F_{real}, D \right) = 0 \] This is because we know that when the classifier has access to all data in \(D\), it can exploit the fact that \(X^1\) and \(y_i\) are dependent and we can always correctly classify any \(X_i\) using \(F_{real}\) as defined above. Note that this is equivalent to the misclassification error plot (the rightmost panel, in green) shown earlier, but the error was shown as a proportion of the total size of the validation set.

However, recall from earlier that we train the classifier on \(D_T\) and then test or validate it on \(D_V\) measuring the performance error only on the validation set \(D_V\) which is \(\text{Err}(F,D_V)\). This is to test if the dependencies in \((X_i,y_i)\) (learned from \(D_T\)) are robust enough to correctly make predictions individuals in the unseen data \(D_V\). In general, then, the error of a trained classifier assessed on the validation set (the out-of-sample performance) won’t be zero \(\text{Err}(F,D_V) \neq 0\) and this is because it is rare that data is as simple as that given in the toy example.

5.2 - Permutations for Evaluating Classifier Performance

A problem with assessing classifier performance (even using out-of-sample error) is:

How can we know that a trained classifier, with out-of-sample error \(\text{Err}(F,D_V)\) is really using structure in the data \((X_i, y_i)\) and not just performing at chance ?

One solution is to use asymptotic statistics on the confusion matrix, but these depend on assumptions that may not hold (this can be seen easily by looking at the graphs for sensitivity/specificity from earlier – they do not look like the obey a clear parametric form).

An alternative is to show that when the dependencies in \((X_i,y_i)\) are destroyed, that the trained classifier’s performance worsens. Essentially, we devise a non-parametric null hypothesis test which proceeds as follows:

Take a training subset \(D_T\) and \(D_V\) from \(D\)
Train a classifier \(F\) on \(D_T\) – we hope the classifier will exploit any dependence structure \((X_i,y_i)\)
Evaluate the trained classifier’s error on the validation set : \(\text{Err}(F,D_V)\) – we test the classifiers ability to exploit the dependence structure in \((X_i,y_i)\)
Take the validation set \(D_V\) and “break” the dependencies we believe that \(F\) has learned
- Take the labels \(y_i\) and permute them, so the classifications for each partipant are assigned randomly
- Call this permuted validation set as \(D_V^{*}\)
Test the error of the trained classifier on the permuted version : \(\text{Err}(F,D_V^{*})\)
If there is genuine structure being used by \(F\) then : \(\text{Err}(F,D_V^{*}) \geq \text{Err}(F,D_V)\)

A real example – start with the table of toy data from above and assume we are going to use the whole of the data set for validation testing:

X.1	X.2	y
1	2	A
1	3	A
1	4	A
2	2	B
2	3	B
2	4	B

With \(F_{real}\) we know that the error is \(\text{Err}(F_{real},D) = 0\).

Now reassign the labels (permute them) pseudo-randomly:

set.seed(123)  ## for reproducibility
s <- sample( nrow(df), 6, replace = TRUE )
df.rand <- data.frame( ID  = df$ID,
                       X.1 = df$X.1, 
                       X.2 = df$X.2,
                       y   = df$y[s] )

And the permuted data \(D_{V}^{*}\) looks like:

X.1	X.2	y
1	2	A
1	3	B
1	4	A
2	2	B
2	3	B
2	4	A

And graphically :

ggplot( df.rand, aes( x = X.1, y = X.2, colour = y ) ) +
  geom_point( size = 4 ) +
  xlim( 0.5, 2.5 ) + ylim( 1.0, 4.5 ) +
  scale_colour_manual(name = "y", values = colPal) +
  geom_vline( xintercept = 1.5, size = 2, colour = "black") +
  theme_minimal()

And now evaluate \(F_{real}\) on the permuted data:

For participant 1 \(F_{real}(X_1) = A\), and in the permuted data \(y_1 = A\) – so correct, and the error on \(X_1\) is 0
For participant 2 \(F_{real}(X_2) = A\), but in the permuted data \(y_2 = B\) – so \(F_{real}\) is incorrect and we score the error as 1
For participant 3 \(F_{real}(X_3) = A\), and in the permuted data \(y_3 = A\) – so correct, and the error on \(X_3\) is 0
… and so on for the remainder of the permuted data

The trained classifier \(F_{real}\)’s performance on the permuted validation set is incorrect on participants \(i = \{2,6\}\) and each error has a cost of “1” resulting in:

\[ \begin{array}\\ \text{Err}(F_{real}, D) &= 0 \\ \text{Err}(F_{real}, D_{V}^{*}) & = 2 \end{array} \]

The error is larger on the permuted data than on the validation set and this is what one would predict – if we train the classifier on one set of statistical dependencies in some data, then test the classifier on essentially random data, we expect that the classifier will no longer perform well.

By permuting the classification labels \(y_i\) multiple times and testing the trained classifier’s performance on the permuted validation set, we can derive a test if the classifier has really discovered and is using dependency in the data, or just performing at “randomly”. This can be expressed as a \(p\)-value for the classifier \(F_{real}\)’s performance with an upper bound on the standard deviation of \(p\) being \(\frac{1}{2 \sqrt{k}}\) where \(k\) is the number of permutations of the original data set. For example, if we require a significance level \(\alpha = 0.05\), we need at least \(k=100\) permutations/randomised data sets.

5.3 - Permutation Testing on the Real Data

We repeat the analysis from Section 3: use cross-validation and storing the results but in addition, we will test the classifier’s performance on randomised, permuted data to establish a \(p\)-value for the classifier using robust dependencies (and not “random” noise).

# K = number of randomised data sets to use in estimating p-value
K <- 10000

# storage for results
nullResultsTab <- data.frame( sens = rep(0,K),
                              spec = rep(0,K),
                              error = rep(0,K),
                              top1 = rep("-",K),  ## store top 3 variables 
                              top2 = rep("-",K),
                              top3 = rep("-",K), stringsAsFactors = FALSE )

for ( i in 1:K ) {
  # randomise the class labels
  d.rand <- d
  d.rand$group <- sample( d.rand$group )
  nullResultsTab[i,] <- ReplicateTrainValidate( setupSplit( d.rand ) )
}

Now we have an 10000 estimates for the error (and feature/predictors structure of each classifier) under the null hypothesis that the classifier is learning nothing from the data (because it is randomised).

For each of the original classifier’s errors, we count the number of times the collection of random classifiers’ error is less than or equal to one of the “real” classifiers (i.e. the error on the random data is better than on the real classifiers) – and this is the empirical \(p\)-value of the classifier performing differently to random: \[ \frac{\# \left( \text{Err}(F,D_{V}^{*}) \leq \text{Err}(F,D_{V}) \right)+1}{k+1} \] We implement this, and then compute the average of all 10000 p-values to arrive at a final hypothesis test. The misclassification error is a coarse test of performance, because it doesn’t account for false positives / negatives, so in a similar way, we’ll compute the empirical p-values using the sensitivity and specificity:

pTab   <- rep(NA,K)
pSens  <- rep(NA,K)
pSpec  <- rep(NA,K)
for ( i in 1:nrow(resultsTab) ) {
  pTab[i]  <- (length( which( nullResultsTab$error <= resultsTab$error[i] ) ) + 1 ) / ( K + 1 )
  pSens[i] <- (length( which( nullResultsTab$sens  >= resultsTab$sens[i] ) ) + 1 ) / ( K + 1 )
  pSpec[i] <- (length( which( nullResultsTab$spec  >= resultsTab$spec[i] ) ) + 1 ) / ( K + 1 )
}

final.pValueErr  <- mean( pTab )
final.pValueSens <- mean( pSens )
final.pValueSpec <- mean( pSpec )

The results:

On misclassification error, we arrive at an average \(p\)-value of 0.0067

But, for sensitivity and specificity:

Average \(p\)-value for sensitivity = 0.0159
Average \(p\)-value for specificity = 0.1981

So we can conclude that the real classification algorithm is performing better than random (e.g. the null distribution of random classifiers) for misclassification (at \(p < 0.05\)). But on the more familiar sensitivity and specificity, we can only conclude that the real classifiers are more sensitive (but not more specific) than the random classifiers at \(p < 0.05\). To see why, we can plot the distributions of sensitivity and specificity for the real classifiers and the random classifiers:

nullResultsTab$lbl <- "Null"
resultsTab$lbl  <- "Actual" 

allResults <- rbind( resultsTab, nullResultsTab )

allResults$lbl <- factor( allResults$lbl )

save( allResults, nullResultsTab, resultsTab, file = "distributionClassificationPerformance.RData")

sens.plot <- ggplot( allResults, aes( x = sens, fill = lbl ) ) +
  geom_histogram( position = "identity", bins = 20, alpha = 0.5 ) +
  xlab("\nSensitivity") +
  ylab("Frequency\n") +
  theme_classic() +
  theme(legend.position="none")


spec.plot <- ggplot( allResults, aes( x = spec, fill = lbl ) ) +
  geom_histogram( position = "identity", bins = 20, alpha = 0.3 ) +
  xlab("\nSpecificity") +
  ylab("Frequency\n") +
  theme_classic() +
  theme(legend.position=c(0,1))

grid.arrange( sens.plot, spec.plot, ncol = 2 )

It is informative to look at the random classifier’s use of the predictors, to really understand if the structure shown for the real classifiers (where PANSS_pos was the most imporant predictor) is different from the random classifiers:

p.top1 <- plotPredictors( nullResultsTab$top1, "1st")
p.top2 <- plotPredictors( nullResultsTab$top2, "2nd")
p.top3 <- plotPredictors( nullResultsTab$top3, "3rd")
  
grid.arrange(p.top1, p.top2, p.top3, ncol = 3)

These distributions of the 1st, 2nd and 3rd most important predictors/features suggests that classifiers are using them almost uniformly - as we would predict as there should be no concrete dependencies to exploit.

6. Using Only Biomarkers

One remaining question is:

What if remove the clinical predictors (the PANSS scores) and force the classifier to use only the tentative biomarker data ?

To do this, we repeat our first analysis but first, remove the PANSS data:

dropCols <- which( names(d) %in% c("PANSS_pos","PANSS_neg","PANSS_gen","PANSS_total") )

# new copy of data with containing no clinical scores (just the biomarkers)
d.bio    <- d[ , -dropCols ]

And re-run the analysis:

bio.resultsTab <- data.frame( sens = rep(0,N.repl),
                          spec = rep(0,N.repl),
                          error = rep(0,N.repl),   ## raw error
                          top1 = rep("-",N.repl),  ## store top 3 variables 
                          top2 = rep("-",N.repl),
                          top3 = rep("-",N.repl), stringsAsFactors = FALSE )


for ( i in 1:K ) {
  bio.resultsTab[i,] <- ReplicateTrainValidate( setupSplit( d.bio ) )
}

Repeat visualisation of the sensitivity/specificity:

bio.sensSpec <- SensSpecPlots( bio.resultsTab )

## Warning: Removed 1 rows containing missing values (geom_segment).

And again, compare with the null distribution of classifiers trained on permuted (randomised) labels:

# storage for results
bio.nullResultsTab <- data.frame( sens = rep(0,K),
                                  spec = rep(0,K),
                                  error = rep(0,K),
                                  top1 = rep("-",K),  ## store top 3 variables 
                                  top2 = rep("-",K),
                                  top3 = rep("-",K), stringsAsFactors = FALSE )

for ( i in 1:K ) {
  
  ## randomise the class labels
  d.bio.rand <- d.bio
  d.bio.rand$group <- sample( d.bio.rand$group )
  bio.nullResultsTab[i,] <- ReplicateTrainValidate( setupSplit( d.bio.rand ) )
}

Compute the empirical \(p\)-values:

bio.pTab   <- rep(NA,K)
bio.pSens  <- rep(NA,K)
bio.pSpec  <- rep(NA,K)
for ( i in 1:nrow(bio.resultsTab) ) {
  bio.pTab[i]  <- (length( which( bio.nullResultsTab$error <= bio.resultsTab$error[i] ) ) + 1 ) / ( K + 1 )
  bio.pSens[i] <- (length( which( bio.nullResultsTab$sens  >= bio.resultsTab$sens[i] ) ) + 1 ) / ( K + 1 )
  bio.pSpec[i] <- (length( which( bio.nullResultsTab$spec  >= bio.resultsTab$spec[i] ) ) + 1 ) / ( K + 1 )
}

final.bio.pValueErr  <- mean( bio.pTab )
final.bio.pValueSens <- mean( bio.pSens )
final.bio.pValueSpec <- mean( bio.pSpec )

The results:

On misclassification error, we arrive at an average \(p\)-value of 0.5604

And, for the sensitivity and specificity:

Average \(p\)-value for sensitivity = 0.5202
Average \(p\)-value for specificity = 0.5375

With the biomarker data alone, we cannot robustly make predictions about the group membership with any confidence. A look at the plots tells us why:

bio.nullResultsTab$lbl <- "Null"
bio.resultsTab$lbl  <- "Actual" 

bio.allResults <- rbind( bio.resultsTab, bio.nullResultsTab )

bio.allResults$lbl <- factor( bio.allResults$lbl )

sens.plot <- ggplot( bio.allResults, aes( x = sens, fill = lbl ) ) +
  geom_histogram( position = "identity", bins = 20, alpha = 0.5 ) +
  xlab("\nSensitivity") +
  ylab("Frequency\n") +
  theme_classic() +
  theme(legend.position="none")


spec.plot <- ggplot( bio.allResults, aes( x = spec, fill = lbl ) ) +
  geom_histogram( position = "identity", bins = 20, alpha = 0.3 ) +
  xlab("\nSpecificity") +
  ylab("Frequency\n") +
  theme_classic() +
  theme(legend.position=c(0,1))

grid.arrange( sens.plot, spec.plot, ncol = 2 )

6. Conclusions

The tree boosting algorithm makes for a good classifier on 2-fold (split-sample) classification, and in an isolated dry-run, seems to perform well but tells us that PANSS_pos is the variable most important for classifiying
Replicated runs of the same classifier algorithm gave some confidence that we could achieve classification accurately with reasonably good sensitivity and specificity and again, the PANSS_pos variable was most important.
Comparing the classifier to the classifiers trained on the null-distribution of class label-permuted (randomised) data showed that in fact, while PANSS_pos remained the most important predictor, we could only achive statistically significant classification peformance for sensitivity, but not specificity. This suggests with the current data, we can achieve devise a statistically significant sensitive (but not specific) classifier.
Removing the clinical data, and forcing the classifier to rely only on the tentative biomarkers shows we cannot achieve better than chance performance for both sensitivity and specificity.

7. Outstanding Questions

We have only used the default xgboost settings – there may be merit in repeating the classification task using more fine-tuning of the parameters for the algorithm. However, speaking against this is the fact that only the clinical predictor (PANSS_pos) is ever capable of producing a sensitive (but not specific) classifier. It seems unlikely that a more elaborate tree-based algorithm will perform better. Also, recall that a GLM based approach failed completely.
Are the set of predictors included in this data set really the best predictors ? We hand over the task of selecting important predictors to the algorithm and then use cross-validation to ensure the algorithm doesn’t “get lucky” on a single half-sample split. It would be informative to hand over all the variables available and see if there are other variables or combinations of variables that might add predictive capacbility.
The data are imbalanced for group – group = 1 has 20 examples, versus 57 for group = 0. This suggests we might try techniques for balancing groups by artificial sample generation (so called SMOTE – Synthetic Minority Over-sampling Technique)