1 Simple Cross-Validation of Common ML Methods

This tutorial will guide users on how to use FRESA.CAD to evaluate the performance of binary classifiers on the Sonar Data Set.

1.1 The required libraries


library("FRESA.CAD")
library("mlbench")
library("fastAdaboost")
library("gbm")

1.2 SONAR Data Set

I will use the SONAR dataset from the mlbech package.

data(Sonar, package = "mlbench")

We have to condition the data set.
FRESA.CAD cross-validation requires a data frame with complete cases. Also, the outcome has to be numeric.
* 0 for Controls, and
* 1 for Cases

op <- par(no.readonly = TRUE)

Sonar$Class <- 1*(Sonar$Class == "M")

1.3 Gradient boosting from the gbm package

The cross-validation with FRESA.CAD can be done on any R function that fits binary outcomes.
The requirement is that model fit has to done as:

>model <- fit(formula,data),

and the predict must be called as:

>pre <- predict(model,testdata)

If the fitting function does not conform to the requirements, you can always create a wrapper.
Here we will show how to create a wrapper to the gradient boost method of the gbm package.

The following code shows the gbm wrapper function:


GBM_fit <- function(formula = formula, data=NULL, distribution = "bernoulli", n.trees = 1000,
                  shrinkage = 0.01, interaction.depth = 4, ...)
{
  fit <- gbm(formula = formula,data = data,distribution = distribution,n.trees = n.trees,
                  shrinkage = shrinkage, interaction.depth = interaction.depth,...);
  selectedfeatures <- summary(fit,plotit = FALSE);
  sum <- 0;
  sfeat = 1;
  while (sum < 90) {sum <- sum + selectedfeatures[sfeat,2]; sfeat <- sfeat + 1;} #keep the ones that add to 90%

    result <- list(fit = fit,n.trees = n.trees,selectedfeatures = rownames(selectedfeatures[1:sfeat,]))
    class(result) <- "GBM_FIT";
    return(result)
}

We also need a proper predict function for the boosting algorithm:


predict.GBM_FIT <- function(object,...) 
{
        parameters <- list(...);
        testData <- parameters[[1]];
        n.trees = seq(from = (0.1*object$n.trees),to = object$n.trees, by = object$n.trees/25) #no of trees-a vector of 25 values 
        pLS <- predict(object$fit,testData,n.trees = n.trees);
        pLS <- 1.0/(1.0 + exp(-apply(pLS,1,mean)))
        return(pLS);
}

Let me check that fitting and prediction functions are working:


gfit <- GBM_fit(formula = Class ~ .,Sonar)
pr <- predict(gfit,Sonar)
pander::pander(table(pr > 0.5,Sonar$Class),caption="Training: Gradient Boost Confusion Matrix")
Training: Gradient Boost Confusion Matrix
  0 1
FALSE 97 0
TRUE 0 111

Now I can check the test ensembled performance of the gradient boosting method.
The following code shows five alternatives for the cross-validation.

op <- par(no.readonly = TRUE)

GradBOOSTcv <- randomCV(Sonar,
                        "Class",
                        GBM_fit,
                        trainFraction = 0.75,
                        repetitions = 200
                        )

1.4 Cross-Validation of common ML-Methods

Now I will compare the performance to other R methods that already have a handy fit and predict methods.


ADABOOSTcv <- randomCV(fittingFunction = fastAdaboost::adaboost,
                  trainSampleSets = GradBOOSTcv$trainSamplesSets,
                  asFactor = TRUE,
                  nIter=10)

QDAcv <- randomCV(fittingFunction = MASS::qda,
                  trainSampleSets = GradBOOSTcv$trainSamplesSets,
                  method = "mve",
                  featureSelectionFunction = univariate_Logit,
                  featureSelection.control = list(uniTest = "zIDI",limit = 0.1,thr = 0.975))


LDAcv <- randomCV(fittingFunction = MASS::lda,
                  trainSampleSets = GradBOOSTcv$trainSamplesSets,
                  featureSelectionFunction = univariate_Logit,
                  featureSelection.control = list(uniTest = "zIDI",limit = 0.1,thr = 0.975))



logisticCV <- randomCV(fittingFunction = glm,
                  trainSampleSets = GradBOOSTcv$trainSamplesSets,
                  family="binomial",
                  featureSelectionFunction = univariate_Logit,
                  featureSelection.control = list(uniTest = "zIDI",limit = 0.1,thr = 0.975))

HCLUSBSWIMScv <- randomCV(fittingFunction=HLCM_EM,
                   trainSampleSets=GradBOOSTcv$trainSamplesSets,
                   hysteresis = 0.1
                   )
bs <- predictionStats_binary(HCLUSBSWIMScv$medianTest,"HCLAS BSWiMS KNN")



# HCLAS_GLMScv <- randomCV(fittingFunction = HLCM_EM,
#                     trainSampleSets = GradBOOSTcv$trainSamplesSets,
#                     hysteresis = 0.1,
#                     method = filteredFit,
#                     fitmethod = glm,
#                     family = "binomial",
#                     filtermethod.control = list(pvalue = 0.1,limit = 0.10))
# bs <- predictionStats_binary(HCLAS_GLMScv$medianTest,"HCLAS Logit KNN")
# 
# HCLAS_LASSOcv <- randomCV(fittingFunction = HLCM_EM,
#                     trainSampleSets = GradBOOSTcv$trainSamplesSets,
#                     hysteresis = 0.1,
#                     method = LASSO_1SE,
#                     family = "binomial")
# bs <- predictionStats_binary(HCLAS_LASSOcv$medianTest,"HCLAS LASSO KNN")
# 
# lf <- LASSO_1SE(Class~.,Sonar)
# p <- as.numeric(predict(lf,Sonar))

1.5 FRESA.CAD::BinaryBenchmark and Comparing Methods

Once all the cross-validation have been completed, we can compare their performance to five common ML methods:

  • KNN,
  • Random Forest,
  • RPART,
  • SVM, and
  • LASSO

These methods are fitted using their default parameters inside the BinaryBenchmark function:


par(op);

par(mfrow = c(2,2),cex = 0.6);

cp <- BinaryBenchmark(referenceCV = list(Gradient_Boost = GradBOOSTcv,
                                         LDA = LDAcv,
                                         QDA = QDAcv,
                                         ADABOOST = ADABOOSTcv,
                                         Logistic = logisticCV,
                                         Latent_BSWiMS= HCLUSBSWIMScv))


par(mfrow = c(1,1),cex = 1.0);

1.6 Reporting the results of the Benchmark procedure

Once done, we can compare the CV test results using the plot() function.
The plot function creates bar plots that compare the balanced error rata, the accuracy, the sensitivity, the specificity, the area under the curve, as well as the report of the concordance index of the individual cross-validation runs.

The final two plots provide the heatmaps of testing if the methods have similar classification performance and if the methods have larger AUC to the other tested methods.


par(mfrow = c(1,1),cex = 1.0,xpd = T,pty = 'm', mar = c(3,3,3,10)) # Making space for the legend
prBenchmark <- plot(cp)

The plot function also generates summary tables of the CV results.

pander::pander(prBenchmark$metrics,caption = "Classifier Performance",round = 3)
Classifier Performance (continued below)
  ADABOOST Gradient_Boost RF Latent_BSWiMS ENS KNN
BER 0.121 0.13 0.141 0.167 0.205 0.224
ACC 0.88 0.87 0.861 0.837 0.798 0.784
AUC 0.949 0.933 0.94 0.912 0.904 0.909
SEN 0.91 0.892 0.892 0.874 0.847 0.883
SPE 0.845 0.845 0.825 0.794 0.742 0.67
CIDX 0.89 0.925 0.931 0.852 0.905 0.887
  RPART SVM Logistic LDA LASSO QDA
BER 0.224 0.229 0.23 0.235 0.245 0.255
ACC 0.779 0.774 0.769 0.764 0.755 0.755
AUC 0.852 0.87 0.838 0.839 0.837 0.793
SEN 0.829 0.811 0.766 0.757 0.748 0.892
SPE 0.722 0.732 0.773 0.773 0.763 0.598
CIDX 0.758 0.849 0.813 0.817 0.821 0.733