rm( list = ls() )
# install.packages("drat", repos="https://cran.rstudio.com")
# drat:::addRepo("dmlc")
# install.packages("xgboost", repos="http://dmlc.ml/drat/", type = "source")
require( ggplot2 )
require( gridExtra )
require( xgboost )
require( reshape2 )
require( kableExtra )
require( RColorBrewer )
require( caret )
# load an tidy data
d <- read.csv( "Data_prediction_FEP.csv")
N <- nrow( d )
# remove the ID column
d <- d[ , -1 ]
str( d )
## 'data.frame': 77 obs. of 19 variables:
## $ age : int 22 24 20 23 28 25 19 37 21 20 ...
## $ gender : int 2 2 1 2 2 1 1 1 2 1 ...
## $ iq : int 116 93 98 90 102 90 89 79 102 133 ...
## $ PANSS_pos : int 17 20 14 11 13 18 23 15 13 9 ...
## $ PANSS_neg : int 20 26 9 16 18 22 23 20 8 8 ...
## $ PANSS_gen : int 36 35 24 29 30 40 41 37 29 22 ...
## $ PANSS_total : int 73 81 47 56 61 80 87 72 50 39 ...
## $ Glu.Cre : num 1.31 1.13 1.07 1.44 1.18 ...
## $ MI.Cre : num 0.743 0.635 0.698 0.708 0.713 0.62 0.731 0.752 0.627 0.711 ...
## $ NAA.Cre : num 1.22 1.06 1.14 1.49 1.33 ...
## $ GPC.Pch_Cre : num 0.228 0.182 0.231 0.252 0.248 0.266 0.213 0.303 0.26 0.252 ...
## $ NAA.NAAG_Cre: num 1.28 1.1 1.25 1.49 1.33 ...
## $ BA8 : num -6.07 -5.85 -66.47 19.48 33.93 ...
## $ Caudaate : num -42.15 39.86 40.09 5.29 -23.81 ...
## $ PAC : num -87.46 5.25 -77.49 -59.63 -62.94 ...
## $ lgi_parietal: num 3.21 3.33 3.62 3.74 3.29 ...
## $ lgi_occ : num 4.35 3.92 4.5 4.76 4.08 ...
## $ lgi_dlpfc : num 2.54 2.45 2.68 2.83 2.66 ...
## $ group : int 1 1 0 0 0 0 1 0 0 0 ...
Gender is coded numerically, as 1 and 2. This means it needs dummy (or factor) coding, but as only two levels, we can hack it to be 0/1 and keep it as a numerical predictor.
d$gender <- d$gender - 1
To begin with, we’ll use a random half split of the data:
# independent (predictor) variables
x.vars <- names( d )[1:(length(names(d))-1)]
# dependent (outcome, response) variables
y.var <- names( d )[length(names(d))]
folds <- createFolds(d$group, k = 2)
# divide data into matrix of X and vector of Y for convenience
# and convert to matrices for xgboost
trainSet.X <- as.matrix( d[ folds$Fold1, x.vars ] )
trainSet.Y <- as.matrix( d[ folds$Fold1, y.var ] )
validSet.X <- as.matrix( d[ folds$Fold2, x.vars ] )
validSet.Y <- as.matrix( d[ folds$Fold2, y.var ] )
Look at proportions of each split on group just to check
table( trainSet.Y )
## trainSet.Y
## 0 1
## 33 5
table( validSet.Y )
## validSet.Y
## 0 1
## 24 15
… and they are reasonably well balanced
Next, fit the xgboost, using trees and the rest of the parameters as the defaults.
fit <- xgboost(data = trainSet.X,
label = trainSet.Y,
max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic",
verbose = 2)
Prediction using a decision rule – as xgboost is a regression method, the underlying model reports real numbers, and we need to implement a decision rule. Assume the output of the fitted boosting model is \(\hat{y}\) : \[
y_{PRED} = \left\{
\begin{array}
\\
1 & \text{ iff } \hat{y} > 0.5 \\
0 & \text{ otherwise }
\end{array}
\right.
\]
Predict the classifications of the validation set, using model fitted on the training set:
pred <- predict(fit, validSet.X)
# implement y_PRED
decRule <- function( x ) { return( ifelse( x > 0.5, 1, 0 )) }
y_pred <- decRule( pred )
Compare with ground truth (the validation set actual y)
(as.numeric(validSet.Y))
## [1] 1 1 0 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1 0
## [36] 0 0 0 1
(y_pred)
## [1] 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0
## [36] 0 0 0 0
And compute a confusion matrix:
cm <- confusionMatrix( y_pred, validSet.Y )
And we see that :
(cm$table)
## Reference
## Prediction 0 1
## 0 23 8
## 1 1 7
This corresponds to sensitivity and specificity:
importance_matrix <- xgb.importance(model = fit)
print(importance_matrix)
## Feature Gain Cover Frequency
## 1: PANSS_pos 0.4597544 0.3321218 0.25
## 2: lgi_occ 0.2339035 0.2132960 0.25
## 3: Caudaate 0.1756688 0.1661606 0.25
## 4: GPC.Pch_Cre 0.1306733 0.2884216 0.25
xgb.plot.importance(importance_matrix = importance_matrix)
require(DiagrammeR)
## Loading required package: DiagrammeR
xgb.plot.tree(model = fit)
And we see that PANSS_pos is doing most of the heavy lifting in the model.
A “one off” using the default parameters for xgboost does really well on a split data set - trained on trainSet and then tested for predictive ability on the validation set validSet. We may have gotten lucky, in that the division of train/validation sets was fortuitous. So, we can run this multiple times, re-splitting the data and average the results – essentially, using multiple-runs of the algorithms on different splits of the data to eliminate “chance” splits which give unrealistically good classification performance. We’ll also store the top-three important predictor variables for each run (so we can infer which predictors contribute most to classification performance, making our model transparent) alongside the sensitivity and specificity of the classifier on each run. Finally, we’ll store the overall misclassification error – the raw number of mistakes.
Note, there’s much debate about \(k\)-folds cross-validation in validating a classifier algorithm – with the much-quoted \(k=10\) providing a good experimental compromise. However, a half-split (\(k=2\)) repeated many times will be reasonable if we run enough replications as we don’t have that much data - \(N =\) 77.
First, a function to setup a half-split of the data and return the training and test sets.
setupSplit <- function( d ) {
x.vars <- names( d )[1:(length(names(d))-1)]
# dependent (outcome, response) variables
y.var <- names( d )[length(names(d))]
folds <- createFolds(d$group, k = 2)
# divide data into matrix of X and vector of Y for convenience
# and convert to matrices for xgboost
trainSet.X <- as.matrix( d[ folds$Fold1, x.vars ] )
trainSet.Y <- as.matrix( d[ folds$Fold1, y.var ] )
validSet.X <- as.matrix( d[ folds$Fold2, x.vars ] )
validSet.Y <- as.matrix( d[ folds$Fold2, y.var ] )
return( list( trainSet.X = trainSet.X, trainSet.Y = trainSet.Y,
validSet.X = validSet.X, validSet.Y = validSet.Y) )
}
Setup 10000 replications of the “split, train, validate” routine and their results:
# storage for results
N.repl <- 10000
resultsTab <- data.frame( sens = rep(0,N.repl),
spec = rep(0,N.repl),
error = rep(0,N.repl), ## raw error
top1 = rep("-",N.repl), ## store top 3 variables
top2 = rep("-",N.repl),
top3 = rep("-",N.repl), stringsAsFactors = FALSE )
Define a function to execute a single run:
# function to run a single split, train and test/validate
ReplicateTrainValidate <- function( thisSplit ) {
# a row to store and return results for a bigger table
resultRow <- data.frame( sens = 0,
spec = 0,
error = 0,
top1 = "-",
top2 = "-",
top3 = "-", stringsAsFactors = FALSE )
# 1. fit the training data
fit <- xgboost(data = thisSplit$trainSet.X,
label = thisSplit$trainSet.Y,
max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic",
verbose = 0)
# 2. test on the validation set
y_pred <- decRule( predict(fit, thisSplit$validSet.X) )
# 3. pull out the sens and spec values
suppressWarnings(
cm <- confusionMatrix( y_pred, thisSplit$validSet.Y, positive = as.character(1) )
)
# Build a "row" of results to return ...
# 4. the "raw" misclassification rate (fraction of incorrect)
resultRow$error <- length( which( as.numeric( thisSplit$validSet.Y ) != y_pred ) ) /
length( thisSplit$validSet.Y )
# 5. sensitivity and specificity
resultRow$sens <- cm$byClass["Sensitivity"]
resultRow$spec <- cm$byClass["Specificity"]
# 6. fetch the top 3 predictor variables
importance_matrix <- xgb.importance(model = fit)
# ... and store away for later
resultRow$top1 <- as.character( importance_matrix[1,1] )
resultRow$top2 <- as.character( importance_matrix[2,1] )
resultRow$top3 <- as.character( importance_matrix[3,1] )
return( resultRow )
}
And this runs N.repl = 10000 replications:
for ( i in 1:N.repl ) {
resultsTab[i,] <- ReplicateTrainValidate( setupSplit( d ) )
}
Having trained 10000 classifiers (to rule out chance performance on a “lucky” split in Section 2) we examine the average performance to see if there’s anything unusual; part of this assessment will be to see which predictor variables are being used to drive classification in a majority of the 10000 replications.
Start by looking at the average behaviour of the sensitivity and specificity (where we use the median and IQR as measures of central tendency and dispersion):
SensSpecPlots <- function( resultsTab ) {
median.sens <- median( resultsTab$sens )
iqr.sens <- IQR( resultsTab$sens )
p.avSens <- ggplot( resultsTab, aes( x = sens ) ) +
geom_histogram( bins = 15, fill = "#99CCFF" ) +
xlab("\nSensitivity") +
ylab("Frequency\n") +
xlim( 0, 1.2 ) +
geom_vline( xintercept = median.sens, size = 1, color = "red") +
annotate("segment", x = median.sens - iqr.sens, xend = median.sens + iqr.sens,
y = 1000, yend = 1000, size = 1, color = "red") +
theme_classic()
median.spec <- median( resultsTab$spec )
iqr.spec <- IQR( resultsTab$spec )
p.avSpec <- ggplot( resultsTab, aes( x = spec ) ) +
geom_histogram( bins = 15, fill = "#FF66CC" ) +
xlab("\nSpecificity") +
ylab("Frequency\n") +
xlim( 0, 1.2 ) +
geom_vline( xintercept = median.spec, size = 1, color = "red") +
annotate("segment", x = median.spec - iqr.spec, xend = median.spec + iqr.spec,
y = 1000, yend = 1000, size = 1, color = "red") +
theme_classic()
median.err <- median( resultsTab$error )
iqr.err <- IQR( resultsTab$error )
p.avError <- ggplot( resultsTab, aes( x = error ) ) +
geom_histogram( bins = 10, fill = "#669933" ) +
xlab("\nMisclassification Error") +
ylab("Frequency\n") +
#xlim( 0, 6 ) +
geom_vline( xintercept = median.err, size = 1, color = "red") +
annotate("segment", x = median.err - iqr.err, xend = median.err + iqr.err,
y = 1000, yend = 1000, size = 1, color = "red") +
theme_classic()
grid.arrange( p.avSens, p.avSpec, p.avError, ncol = 3)
# return the medians for later use
return( c( "median.sens" = median.sens, "iqr.sens" = iqr.sens,
"median.spec" = median.spec, "iqr.spec" = iqr.spec,
"median.err" = median.err, "iqr.err" = iqr.err) )
}
# Plot the sensitivity, specificity and misclassification errors
sensSpec <- SensSpecPlots( resultsTab )
So on average, we are obtaining out of sample performances (that is, on the validation sets) of:
Let’s look at – on average – which predictor variables are doing the work. Recall for Section 2, that the “importance” of a predictor (relative to the others) can be extracted from a classifier. We have trained 10000 classifiers on different (pseudo-randomised) training-validation splits of the data.
So we can interrogate each of these classifiers to find out how frequently each predictor is the 1st, 2nd, 3rd (and so on) most important predictor in these 10000 replications. This gives us some idea of the stability of the features (predictors) in classifier performance when the training and validation data varies (a bit like a sensitivity analysis).
plotPredictors <- function( predsList, thisTitle ){
# tabulate the top predicting perfor
thisTop <- table( predsList )
thisTop <- rev( thisTop[ order( thisTop )] )
# build a dataframe to plot
df <- data.frame( predictor = factor( names( thisTop ), levels = names( thisTop ), ordered = TRUE ),
freq = as.numeric( thisTop ) )
p.top <- ggplot( df, aes( x = predictor, y = log10(freq) ) ) +
geom_bar( stat = "identity", fill = "#CC6600" ) +
xlab( "\nPredictor" ) +
ylab( "Log10 Frequency\n" ) +
ylim( 0, log10(10000) ) +
theme_classic() +
theme( axis.text.x = element_text(angle = 90, hjust = 1 ) ) +
ggtitle( thisTitle )
return( p.top )
}
p.top1 <- plotPredictors( resultsTab$top1, "1st")
p.top2 <- plotPredictors( resultsTab$top2, "2nd")
p.top3 <- plotPredictors( resultsTab$top3, "3rd")
grid.arrange(p.top1, p.top2, p.top3, ncol = 3)
Examining these plots, in the majority of replications, PANSS_pos is selected as the most important predictor (left panel), followed by MI.Cre (middle panel) and then iq (right panel). Note how the distributions of the 2nd and 3rd most important predictors are not nearly so consistent across replications, suggesting that they are not nearly as reliable as the 1st most important predictor.
To put some “real” figures to this:
PANSS_pos was the most important predictor in 9918 of the 10000 replications – around 99.18 percent of the time, it’s PANSS_pos that contributes most to classification performancePANSS_pos, the second most important predictor – MI.Cre – occurred in only 1467 of the 10000 replicationsiq – occurred in only 932 of the 10000 replicationsNotice how these patterns – over 10000 replications – mirror what we obtained using a single run of classifier-building earlier in Section 2; suggesting that, although subject to non-determinism because of the randomisation implicit in splitting the data, our first attempt was fairly representative.
| X.1 | X.2 | y |
|---|---|---|
| 1 | 2 | A |
| 1 | 3 | A |
| 1 | 4 | A |
| 2 | 2 | B |
| 2 | 3 | B |
| 2 | 4 | B |
Now, some notation:
X.1 and X.2 are denoted \(X_{i}^{1}\) and \(X_{i}^{2}\) – so superscripts identify the columns of \(D\) and the subscripts identify the rows.Some subsets that we’ll need later – recall that rows of \(D\) are participants/samples/observations, and columns are features, predictor or independent variables.
Returning to the table above, we can see that:
X.1 in the table), \(y_i\) is always \(A\) and when \(X^1 = 2\), \(y_i\) is always \(B\)
X.2), it adds no information whatsoever about which class each person belongs to because – for example – when \(X^2 = 2\), \(y_i\) can be either \(A\) of \(B\)This means that \(X^1\) is the best predictor of \(y_i\), and \(X^2\) is redundant. If we plot the data with \(X_1\) on the x-axis, and \(X_2\) on the y-axis, we find:
colPal <- brewer.pal(3,"Set1")
names( colPal ) <- levels( df$Y )
ggplot( df, aes( x = X.1, y = X.2, colour = y ) ) +
geom_point( size = 4 ) +
xlim( 0.5, 2.5 ) + ylim( 1.0, 4.5 ) +
scale_colour_manual(name = "y", values = colPal) +
geom_vline( xintercept = 1.5, size = 2, colour = "black") +
theme_minimal()
We can see that a straight line at \(X^1 = 1.5\) perfectly divides the space of \((X^1, X^2)\) into classes \(y_i = A\) (the ‘left’ half of the space) and \(y_i = B\) (the ‘right’ half of the space). This is because \(X^1\) (irrespective of \(X^2\)) is the best predictor.
In this toy example, we can then manually build a classifier that exploits the strong dependency of \(y_i\) on \(X^1\). The classifier can discount \(X^2\) because it adds no information to the classification. Call this classifier \(F_{real}\) such that:
\[ F_{real}(X_i) = \left\{ \begin{array} \\ A & \text{ if } X^1 > 1.5 \\ B & \text{ otherwise } \end{array} \right. \]
Next, define an error function which measures how many misclassifications a classifier (\(F\)) makes on some data \(D\) as \(\text{Err}(F,D)\). Notice that using whole data set as above (6 participants) and \(F_{real}\) then: \[ \text{Err} \left( F_{real}, D \right) = 0 \] This is because we know that when the classifier has access to all data in \(D\), it can exploit the fact that \(X^1\) and \(y_i\) are dependent and we can always correctly classify any \(X_i\) using \(F_{real}\) as defined above. Note that this is equivalent to the misclassification error plot (the rightmost panel, in green) shown earlier, but the error was shown as a proportion of the total size of the validation set.
However, recall from earlier that we train the classifier on \(D_T\) and then test or validate it on \(D_V\) measuring the performance error only on the validation set \(D_V\) which is \(\text{Err}(F,D_V)\). This is to test if the dependencies in \((X_i,y_i)\) (learned from \(D_T\)) are robust enough to correctly make predictions individuals in the unseen data \(D_V\). In general, then, the error of a trained classifier assessed on the validation set (the out-of-sample performance) won’t be zero \(\text{Err}(F,D_V) \neq 0\) and this is because it is rare that data is as simple as that given in the toy example.
A problem with assessing classifier performance (even using out-of-sample error) is:
One solution is to use asymptotic statistics on the confusion matrix, but these depend on assumptions that may not hold (this can be seen easily by looking at the graphs for sensitivity/specificity from earlier – they do not look like the obey a clear parametric form).
An alternative is to show that when the dependencies in \((X_i,y_i)\) are destroyed, that the trained classifier’s performance worsens. Essentially, we devise a non-parametric null hypothesis test which proceeds as follows:
| X.1 | X.2 | y |
|---|---|---|
| 1 | 2 | A |
| 1 | 3 | A |
| 1 | 4 | A |
| 2 | 2 | B |
| 2 | 3 | B |
| 2 | 4 | B |
With \(F_{real}\) we know that the error is \(\text{Err}(F_{real},D) = 0\).
Now reassign the labels (permute them) pseudo-randomly:
set.seed(123) ## for reproducibility
s <- sample( nrow(df), 6, replace = TRUE )
df.rand <- data.frame( ID = df$ID,
X.1 = df$X.1,
X.2 = df$X.2,
y = df$y[s] )
And the permuted data \(D_{V}^{*}\) looks like:
| X.1 | X.2 | y |
|---|---|---|
| 1 | 2 | A |
| 1 | 3 | B |
| 1 | 4 | A |
| 2 | 2 | B |
| 2 | 3 | B |
| 2 | 4 | A |
And graphically :
ggplot( df.rand, aes( x = X.1, y = X.2, colour = y ) ) +
geom_point( size = 4 ) +
xlim( 0.5, 2.5 ) + ylim( 1.0, 4.5 ) +
scale_colour_manual(name = "y", values = colPal) +
geom_vline( xintercept = 1.5, size = 2, colour = "black") +
theme_minimal()
And now evaluate \(F_{real}\) on the permuted data:
The trained classifier \(F_{real}\)’s performance on the permuted validation set is incorrect on participants \(i = \{2,6\}\) and each error has a cost of “1” resulting in:
\[ \begin{array}\\ \text{Err}(F_{real}, D) &= 0 \\ \text{Err}(F_{real}, D_{V}^{*}) & = 2 \end{array} \]
The error is larger on the permuted data than on the validation set and this is what one would predict – if we train the classifier on one set of statistical dependencies in some data, then test the classifier on essentially random data, we expect that the classifier will no longer perform well.
By permuting the classification labels \(y_i\) multiple times and testing the trained classifier’s performance on the permuted validation set, we can derive a test if the classifier has really discovered and is using dependency in the data, or just performing at “randomly”. This can be expressed as a \(p\)-value for the classifier \(F_{real}\)’s performance with an upper bound on the standard deviation of \(p\) being \(\frac{1}{2 \sqrt{k}}\) where \(k\) is the number of permutations of the original data set. For example, if we require a significance level \(\alpha = 0.05\), we need at least \(k=100\) permutations/randomised data sets.
We repeat the analysis from Section 3: use cross-validation and storing the results but in addition, we will test the classifier’s performance on randomised, permuted data to establish a \(p\)-value for the classifier using robust dependencies (and not “random” noise).
# K = number of randomised data sets to use in estimating p-value
K <- 10000
# storage for results
nullResultsTab <- data.frame( sens = rep(0,K),
spec = rep(0,K),
error = rep(0,K),
top1 = rep("-",K), ## store top 3 variables
top2 = rep("-",K),
top3 = rep("-",K), stringsAsFactors = FALSE )
for ( i in 1:K ) {
# randomise the class labels
d.rand <- d
d.rand$group <- sample( d.rand$group )
nullResultsTab[i,] <- ReplicateTrainValidate( setupSplit( d.rand ) )
}
Now we have an 10000 estimates for the error (and feature/predictors structure of each classifier) under the null hypothesis that the classifier is learning nothing from the data (because it is randomised).
For each of the original classifier’s errors, we count the number of times the collection of random classifiers’ error is less than or equal to one of the “real” classifiers (i.e. the error on the random data is better than on the real classifiers) – and this is the empirical \(p\)-value of the classifier performing differently to random: \[ \frac{\# \left( \text{Err}(F,D_{V}^{*}) \leq \text{Err}(F,D_{V}) \right)+1}{k+1} \] We implement this, and then compute the average of all 10000 p-values to arrive at a final hypothesis test. The misclassification error is a coarse test of performance, because it doesn’t account for false positives / negatives, so in a similar way, we’ll compute the empirical p-values using the sensitivity and specificity:
pTab <- rep(NA,K)
pSens <- rep(NA,K)
pSpec <- rep(NA,K)
for ( i in 1:nrow(resultsTab) ) {
pTab[i] <- (length( which( nullResultsTab$error <= resultsTab$error[i] ) ) + 1 ) / ( K + 1 )
pSens[i] <- (length( which( nullResultsTab$sens >= resultsTab$sens[i] ) ) + 1 ) / ( K + 1 )
pSpec[i] <- (length( which( nullResultsTab$spec >= resultsTab$spec[i] ) ) + 1 ) / ( K + 1 )
}
final.pValueErr <- mean( pTab )
final.pValueSens <- mean( pSens )
final.pValueSpec <- mean( pSpec )
The results:
But, for sensitivity and specificity:
So we can conclude that the real classification algorithm is performing better than random (e.g. the null distribution of random classifiers) for misclassification (at \(p < 0.05\)). But on the more familiar sensitivity and specificity, we can only conclude that the real classifiers are more sensitive (but not more specific) than the random classifiers at \(p < 0.05\). To see why, we can plot the distributions of sensitivity and specificity for the real classifiers and the random classifiers:
nullResultsTab$lbl <- "Null"
resultsTab$lbl <- "Actual"
allResults <- rbind( resultsTab, nullResultsTab )
allResults$lbl <- factor( allResults$lbl )
save( allResults, nullResultsTab, resultsTab, file = "distributionClassificationPerformance.RData")
sens.plot <- ggplot( allResults, aes( x = sens, fill = lbl ) ) +
geom_histogram( position = "identity", bins = 20, alpha = 0.5 ) +
xlab("\nSensitivity") +
ylab("Frequency\n") +
theme_classic() +
theme(legend.position="none")
spec.plot <- ggplot( allResults, aes( x = spec, fill = lbl ) ) +
geom_histogram( position = "identity", bins = 20, alpha = 0.3 ) +
xlab("\nSpecificity") +
ylab("Frequency\n") +
theme_classic() +
theme(legend.position=c(0,1))
grid.arrange( sens.plot, spec.plot, ncol = 2 )
It is informative to look at the random classifier’s use of the predictors, to really understand if the structure shown for the real classifiers (where PANSS_pos was the most imporant predictor) is different from the random classifiers:
p.top1 <- plotPredictors( nullResultsTab$top1, "1st")
p.top2 <- plotPredictors( nullResultsTab$top2, "2nd")
p.top3 <- plotPredictors( nullResultsTab$top3, "3rd")
grid.arrange(p.top1, p.top2, p.top3, ncol = 3)
These distributions of the 1st, 2nd and 3rd most important predictors/features suggests that classifiers are using them almost uniformly - as we would predict as there should be no concrete dependencies to exploit.
One remaining question is:
To do this, we repeat our first analysis but first, remove the PANSS data:
dropCols <- which( names(d) %in% c("PANSS_pos","PANSS_neg","PANSS_gen","PANSS_total") )
# new copy of data with containing no clinical scores (just the biomarkers)
d.bio <- d[ , -dropCols ]
And re-run the analysis:
bio.resultsTab <- data.frame( sens = rep(0,N.repl),
spec = rep(0,N.repl),
error = rep(0,N.repl), ## raw error
top1 = rep("-",N.repl), ## store top 3 variables
top2 = rep("-",N.repl),
top3 = rep("-",N.repl), stringsAsFactors = FALSE )
for ( i in 1:K ) {
bio.resultsTab[i,] <- ReplicateTrainValidate( setupSplit( d.bio ) )
}
Repeat visualisation of the sensitivity/specificity:
bio.sensSpec <- SensSpecPlots( bio.resultsTab )
## Warning: Removed 1 rows containing missing values (geom_segment).
And again, compare with the null distribution of classifiers trained on permuted (randomised) labels:
# storage for results
bio.nullResultsTab <- data.frame( sens = rep(0,K),
spec = rep(0,K),
error = rep(0,K),
top1 = rep("-",K), ## store top 3 variables
top2 = rep("-",K),
top3 = rep("-",K), stringsAsFactors = FALSE )
for ( i in 1:K ) {
## randomise the class labels
d.bio.rand <- d.bio
d.bio.rand$group <- sample( d.bio.rand$group )
bio.nullResultsTab[i,] <- ReplicateTrainValidate( setupSplit( d.bio.rand ) )
}
Compute the empirical \(p\)-values:
bio.pTab <- rep(NA,K)
bio.pSens <- rep(NA,K)
bio.pSpec <- rep(NA,K)
for ( i in 1:nrow(bio.resultsTab) ) {
bio.pTab[i] <- (length( which( bio.nullResultsTab$error <= bio.resultsTab$error[i] ) ) + 1 ) / ( K + 1 )
bio.pSens[i] <- (length( which( bio.nullResultsTab$sens >= bio.resultsTab$sens[i] ) ) + 1 ) / ( K + 1 )
bio.pSpec[i] <- (length( which( bio.nullResultsTab$spec >= bio.resultsTab$spec[i] ) ) + 1 ) / ( K + 1 )
}
final.bio.pValueErr <- mean( bio.pTab )
final.bio.pValueSens <- mean( bio.pSens )
final.bio.pValueSpec <- mean( bio.pSpec )
The results:
And, for the sensitivity and specificity:
With the biomarker data alone, we cannot robustly make predictions about the group membership with any confidence. A look at the plots tells us why:
bio.nullResultsTab$lbl <- "Null"
bio.resultsTab$lbl <- "Actual"
bio.allResults <- rbind( bio.resultsTab, bio.nullResultsTab )
bio.allResults$lbl <- factor( bio.allResults$lbl )
sens.plot <- ggplot( bio.allResults, aes( x = sens, fill = lbl ) ) +
geom_histogram( position = "identity", bins = 20, alpha = 0.5 ) +
xlab("\nSensitivity") +
ylab("Frequency\n") +
theme_classic() +
theme(legend.position="none")
spec.plot <- ggplot( bio.allResults, aes( x = spec, fill = lbl ) ) +
geom_histogram( position = "identity", bins = 20, alpha = 0.3 ) +
xlab("\nSpecificity") +
ylab("Frequency\n") +
theme_classic() +
theme(legend.position=c(0,1))
grid.arrange( sens.plot, spec.plot, ncol = 2 )
The tree boosting algorithm makes for a good classifier on 2-fold (split-sample) classification, and in an isolated dry-run, seems to perform well but tells us that PANSS_pos is the variable most important for classifiying
Replicated runs of the same classifier algorithm gave some confidence that we could achieve classification accurately with reasonably good sensitivity and specificity and again, the PANSS_pos variable was most important.
Comparing the classifier to the classifiers trained on the null-distribution of class label-permuted (randomised) data showed that in fact, while PANSS_pos remained the most important predictor, we could only achive statistically significant classification peformance for sensitivity, but not specificity. This suggests with the current data, we can achieve devise a statistically significant sensitive (but not specific) classifier.
Removing the clinical data, and forcing the classifier to rely only on the tentative biomarkers shows we cannot achieve better than chance performance for both sensitivity and specificity.
We have only used the default xgboost settings – there may be merit in repeating the classification task using more fine-tuning of the parameters for the algorithm. However, speaking against this is the fact that only the clinical predictor (PANSS_pos) is ever capable of producing a sensitive (but not specific) classifier. It seems unlikely that a more elaborate tree-based algorithm will perform better. Also, recall that a GLM based approach failed completely.
Are the set of predictors included in this data set really the best predictors ? We hand over the task of selecting important predictors to the algorithm and then use cross-validation to ensure the algorithm doesn’t “get lucky” on a single half-sample split. It would be informative to hand over all the variables available and see if there are other variables or combinations of variables that might add predictive capacbility.
The data are imbalanced for group – group = 1 has 20 examples, versus 57 for group = 0. This suggests we might try techniques for balancing groups by artificial sample generation (so called SMOTE – Synthetic Minority Over-sampling Technique)