1 Demo of CrossValidation and Binary Benchmarking

1.1 Data loading and preparation

In this demo, we will use the colon data set from the “rda” package

# Load the colon data form the rda package
data(colon,package = "rda")
#FRESA.CAD requires a data frame. One of the columns must be the class
Colon <- as.data.frame(cbind(Class = colon.y, colon.x))
#The class should be 0 for controls and 1 for cases
Colon$Class <- Colon$Class - 1

1.2 Cross-Validation of classifiers

The colon cancer dataset has 62 observations and 2000 features. I will cross-validate (CV) a quadratic discriminant analysis (QDA) classifier that predicts the presence of cancer. Before estimating the QDA parameters a univariate filter based on the Wilcoxon-test will select the top 12 features. (25% of 80% of the 62 samples, with a Pearson correlation lower than 0.95)

The CV will select 80% of the samples randomly for training, and the other 20% will be a holdout for validation.

The CV will be repeated 75 times; hence, on average each sample will have 15 estimations.

# Cross validate a QDA classifier using only the top ranked features
QDAcv <- randomCV(Colon,"Class",
                  MASS::qda,trainFraction = 0.8,
                  repetitions = 75,
                  featureSelectionFunction = univariate_Wilcoxon,
                  featureSelection.control = list(limit = 0.10,thr = 0.95))


ADAcv <- randomCV(Colon,"Class",
                  adaboost,
                  trainSampleSets=QDAcv$trainSamplesSets,
                  featureSelectionFunction = univariate_Wilcoxon,
                  featureSelection.control = list(limit = 0.10,thr = 0.95),asFactor = TRUE,nIter=10
)

BESScv <- randomCV(Colon,"Class",BESS,trainSampleSets=QDAcv$trainSamplesSets)

BSWiMScv <- randomCV(Colon,"Class",BSWiMS.model,trainSampleSets=QDAcv$trainSamplesSets)

GMVEBSWiMSCV <- randomCV(Colon,"Class",GMVEBSWiMS,trainSampleSets=QDAcv$trainSamplesSets)
bs <- predictionStats_binary(GMVEBSWiMSCV$medianTest,"GMVE:BSWiMS")

GMVEBSWiMSCV$featureFrequency

BOOST_BSWiMSCV <- randomCV(Colon,"Class",BOOST_BSWiMS,trainSampleSets=QDAcv$trainSamplesSets)
bs <- predictionStats_binary(BOOST_BSWiMSCV$medianTest,"BOOST_BSWiMS")

BOOST_BSWiMSCV$featureFrequency

1.3 Report the cross-validation performance

After CV, we can visuzlize the ROC and extract the test performance:

The QDA,BeSS, ADA and BSWiMS CV will be compared to ohter common classifiers using the FRESA::BinaryBenchmark() function.

The same training and test sets will be used in all classifiers.


#comparing the cross validation to standard classifiers
par(mfrow = c(2,2),cex = 0.45);
ClassBenchmark <- BinaryBenchmark(referenceCV = list(QDA=QDAcv,BeSS=BESScv,ADABOOST=ADAcv,BSWiMS=BSWiMScv,GMVEBSWiMSCV=GMVEBSWiMSCV,BOOST_BSWiM=BOOST_BSWiMSCV))

par(mfrow = c(1,1),cex = 1.0);

1.4 Reporting the results of the Benchmark procedure

Once done, we can compare CV test results using the plot() function. The plot function also generates summary tables of the CV results.

#ploting the results
op <-par();
prBenchmark <- plot(ClassBenchmark)


pander::pander(prBenchmark$metrics,caption = "Clasifier Performance",round = 3)
Clasifier Performance
  BSWiMS LASSO ENS BOOST_BSWiM KNN ADABOOST RF SVM GMVEBSWiMSCV BeSS QDA RPART
BER 0.114 0.116 0.117 0.128 0.128 0.149 0.152 0.152 0.162 0.162 0.174 0.257
ACC 0.887 0.887 0.887 0.871 0.871 0.855 0.855 0.855 0.855 0.855 0.839 0.774
AUC 0.889 0.857 0.877 0.877 0.853 0.879 0.866 0.847 0.881 0.874 0.88 0.773
SEN 0.9 0.9 0.9 0.875 0.875 0.875 0.875 0.875 0.9 0.9 0.875 0.85
SPE 0.864 0.864 0.864 0.864 0.864 0.818 0.818 0.818 0.773 0.773 0.773 0.636
CIDX 0.875 0.844 0.88 0.865 0.839 0.854 0.844 0.844 0.854 0.74 0.865 0.708
#pander::pander(prBenchmark$metrics_filter,caption = "Average Filter Performance",round = 3)

par(op);
gplots::heatmap.2(t(as.matrix(ClassBenchmark$featureSelectionFrequency[1:50,])),trace = "none",mar = c(10,10),main = "Features",cexRow = 0.5,cexCol = 0.5)