1 Simple Cross-Validation of Common ML Methods

This tutorial will guide users on how to use FRESA.CAD to evaluate the performance of binary classifiers.

1.1 The required libraries


library("FRESA.CAD")
library("mlbench")
library("fastAdaboost")
library("gbm")

1.2 PimaIndiansDiabetes Data Set

I will use the PimaIndiansDiabetes dataset from the mlbech package.

data("PimaIndiansDiabetes2", package = "mlbench")

We have to condition the data set.
FRESA.CAD cross-validation requires a data frame with complete cases. Also, the outcome has to be numeric.
* 0 for Controls, and
* 1 for Cases


PimaIndiansDiabetes  <- PimaIndiansDiabetes2[complete.cases(PimaIndiansDiabetes2),]
PimaIndiansDiabetes$diabetes <- 1*(PimaIndiansDiabetes$diabetes == "pos")

1.3 Gradient boosting from the gbm package

The cross-validation with FRESA.CAD can be done on any R function that fits binary outcomes.
The requirement is that model fit has to done as:

>model <- fit(formula,data),

and the predict must be called as:

>pre <- predict(model,testdata)

If the fitting function does not conform to the requirements, you can always create a wrapper.
Here we will show how to create a wrapper to the gradient boost method of the gbm package.

The following code shows the gbm wrapper function:


GBM_fit <- function(formula = formula, data=NULL, distribution = "bernoulli", n.trees = 1000,
                  shrinkage = 0.01, interaction.depth = 4, ...)
{
  fit <- gbm(formula = formula,data = data,distribution = distribution,n.trees = n.trees,
                  shrinkage = shrinkage, interaction.depth = interaction.depth,...);
  selectedfeatures <- summary(fit,plotit = FALSE);
  sum <- 0;
  sfeat = 1;
  while (sum < 90) {sum <- sum + selectedfeatures[sfeat,2]; sfeat <- sfeat + 1;} #keep the ones that add to 90%

    result <- list(fit = fit,n.trees = n.trees,selectedfeatures = rownames(selectedfeatures[1:sfeat,]))
    class(result) <- "GBM_FIT";
    return(result)
}

We also need a proper predict function for the boosting algorithm:


predict.GBM_FIT <- function(object,...) 
{
        parameters <- list(...);
        testData <- parameters[[1]];
        n.trees = seq(from = (0.1*object$n.trees),to = object$n.trees, by = object$n.trees/25) #no of trees-a vector of 25 values 
        pLS <- predict(object$fit,testData,n.trees = n.trees);
        pLS <- 1.0/(1.0 + exp(-apply(pLS,1,mean)))
        return(pLS);
}

Let me check that fitting and prediction functions are working:


gfit <- GBM_fit(formula = diabetes ~ .,PimaIndiansDiabetes)
pr <- predict(gfit,PimaIndiansDiabetes)
pander::pander(table(pr > 0.5,PimaIndiansDiabetes$diabetes),caption="Training: Gradient Boost Confusion Matrix")
Training: Gradient Boost Confusion Matrix
  0 1
FALSE 250 29
TRUE 12 101

Now I can check the test ensembled performance of the gradient boosting method.
The following code shows five alternatives for the cross-validation.

op <- par(no.readonly = TRUE)

GradBOOSTcv <- randomCV(PimaIndiansDiabetes,
                        "diabetes",
                        GBM_fit,
                        trainFraction = 0.5,
                        repetitions = 100
                        )

GradBOOST_NoAugmentedcv <- randomCV(PimaIndiansDiabetes,
                                    "diabetes",
                                    GBM_fit,
                                    trainFraction = 0.5,
                                    repetitions = 100,
                                    classSamplingType = "NoAugmented"
                                    )

GradBOOST_Proportionaldcv <- randomCV(PimaIndiansDiabetes,
                                      "diabetes",
                                      GBM_fit,
                                      trainFraction = 0.5,
                                      repetitions = 100,
                                      classSamplingType = "Proportional"
                                      )

GradBOOST_ProportionalBootstrapcv <- randomCV(PimaIndiansDiabetes,
                                              "diabetes",
                                              GBM_fit,
                                              trainFraction = "Bootstrap",
                                              repetitions = 100,
                                              classSamplingType = "Proportional"
                                              )

GradBOOST_NoAugmentedBootstrapcv <- randomCV(PimaIndiansDiabetes,
                                            "diabetes",
                                            GBM_fit,
                                            trainFraction = 0.5,
                                            repetitions = 100,
                                            classSamplingType = "NoAugmented"
                                            )

Once cross-validated, the performance results can be analyzed and plotted using the predictionStats_binary() function.


par(mfrow = c(1,2),cex = 0.5);
bs1 <- predictionStats_binary(cbind(PimaIndiansDiabetes$diabetes,pr),"Raw Training",cex = 0.8)
bs2 <- predictionStats_binary(GradBOOSTcv$medianTest,"Balanced Augmented CV",cex = 0.8)

bs3 <- predictionStats_binary(GradBOOST_NoAugmentedcv$medianTest,"Balanced No Augmented CV",cex = 0.8)
bs4 <- predictionStats_binary(GradBOOST_Proportionaldcv$medianTest,"Proportional CV",cex = 0.8)

bs5 <- predictionStats_binary(GradBOOST_ProportionalBootstrapcv$medianTest,"Bootstraping Proportional CV",cex = 0.8)
bs6 <- predictionStats_binary(GradBOOST_NoAugmentedBootstrapcv$medianTest,"Bootstraping Balanced CV",cex = 0.8)

par(mfrow = c(1,1),cex = 1.0);

The output of the predictionStats_binary() function provides key performance metrics with their corresponding 95% confidence intervals

pander::pander(bs2$accc,caption = "Accuracy")
Accuracy
est lower upper
0.7806 0.7363 0.8206
pander::pander(bs2$berror,caption = "Balanced Error")
50% 2.5% 97.5%
0.214 0.1727 0.2558
pander::pander(bs2$aucs,caption = "AUC")
est lower upper
0.8477 0.8076 0.8879
pander::pander(bs2$sensitivity,caption = "Sensitivity")
Sensitivity
est lower upper
0.8 0.7208 0.865
pander::pander(bs2$specificity,caption = "Specificity")
Specificity
est lower upper
0.771 0.7153 0.8205
pander::pander(bs2$ClassMetrics,caption = "All Metrics")
  • accci:

    50% 2.5% 97.5%
    0.7806 0.7398 0.8214
  • senci:

    50% 2.5% 97.5%
    0.786 0.7442 0.8273
  • aucci:

    50% 2.5% 97.5%
    0.786 0.7442 0.8273
  • berci:

    50% 2.5% 97.5%
    0.214 0.1727 0.2558
  • preci:

    50% 2.5% 97.5%
    0.7601 0.7186 0.8001
  • F1ci:

    50% 2.5% 97.5%
    0.7659 0.7224 0.8071

1.4 Cross-Validation of common ML-Methods

Now I will compare the performance to other R methods that already have a handy fit and predict methods.


ADABOOSTcv <- randomCV(fittingFunction = fastAdaboost::adaboost,
                  trainSampleSets = GradBOOSTcv$trainSamplesSets,
                  asFactor = TRUE,
                  nIter=10)

QDAcv <- randomCV(fittingFunction = MASS::qda,
                  trainSampleSets = GradBOOSTcv$trainSamplesSets,
                  method = "mve")


LDAcv <- randomCV(fittingFunction = MASS::lda,
                  trainSampleSets = GradBOOSTcv$trainSamplesSets)



logisticCV <- randomCV(fittingFunction = glm,
                  trainSampleSets = GradBOOSTcv$trainSamplesSets,
                  family="binomial")

1.5 FRESA.CAD::BinaryBenchmark and Comparing Methods

Once all the cross-validation have been completed, we can compare their performance to five common ML methods:

  • KNN,
  • Random Forest,
  • RPART,
  • SVM, and
  • LASSO

These methods are fitted using their default parameters inside the BinaryBenchmark function:


par(op);

par(mfrow = c(2,2),cex = 0.6);

cp <- BinaryBenchmark(referenceCV = list(Gradient_Boost = GradBOOSTcv,
                                         LDA = LDAcv,
                                         QDA = QDAcv,
                                         ADABOOST = ADABOOSTcv,
                                         Logistic = logisticCV))


par(mfrow = c(1,1),cex = 1.0);

1.6 Reporting the results of the Benchmark procedure

Once done, we can compare the CV test results using the plot() function.
The plot function creates bar plots that compare the balanced error rata, the accuracy, the sensitivity, the specificity, the area under the curve, as well as the report of the concordance index of the individual cross-validation runs.

The final two plots provide the heatmaps of testing if the methods have similar classification performance and if the methods have larger AUC to the other tested methods.


par(mfrow = c(1,1),cex = 1.0,xpd = T,pty = 'm', mar = c(3,3,3,10)) # Making space for the legend
prBenchmark <- plot(cp)

The plot function also generates summary tables of the CV results.

pander::pander(prBenchmark$metrics,caption = "Classifier Performance",round = 3)
Classifier Performance (continued below)
  RPART ENS Gradient_Boost ADABOOST LASSO RF SVM
BER 0.208 0.212 0.214 0.22 0.22 0.223 0.234
ACC 0.791 0.778 0.781 0.773 0.783 0.773 0.758
AUC 0.827 0.852 0.848 0.85 0.848 0.847 0.836
SEN 0.792 0.815 0.8 0.8 0.769 0.785 0.785
SPE 0.79 0.76 0.771 0.76 0.79 0.767 0.744
CIDX 0.777 0.852 0.837 0.81 0.845 0.844 0.831
  Logistic QDA LDA KNN
BER 0.235 0.243 0.245 0.263
ACC 0.768 0.727 0.768 0.735
AUC 0.848 0.809 0.847 0.819
SEN 0.754 0.846 0.715 0.738
SPE 0.775 0.668 0.794 0.733
CIDX 0.843 0.796 0.844 0.81