1 Simple Cross-Validation of Common ML Methods

This tutorial will guide users on how to use FRESA.CAD to evaluate the performance of binary classifiers.

1.1 The required libraries


library("FRESA.CAD")
library("mlbench")
library("fastAdaboost")
library("gbm")

1.2 PimaIndiansDiabetes Data Set

I will use the PimaIndiansDiabetes dataset from the mlbech package.

data("PimaIndiansDiabetes2", package = "mlbench")

We have to condition the data set.
FRESA.CAD cross-validation requires a data frame with complete cases. Also, the outcome has to be numeric.
* 0 for Controls, and
* 1 for Cases


PimaIndiansDiabetes  <- PimaIndiansDiabetes2[complete.cases(PimaIndiansDiabetes2),]
PimaIndiansDiabetes$diabetes <- 1*(PimaIndiansDiabetes$diabetes == "pos")

1.3 Gradient boosting from the gbm package

The cross-validation with FRESA.CAD can be done on any R function that fits binary outcomes.
The requirement is that model fit has to done as:

>model <- fit(formula,data),

and the predict must be called as:

>pre <- predict(model,testdata)

If the fitting function does not conform to the requirements, you can always create a wrapper.
Here we will show how to create a wrapper to the gradient boost method of the gbm package.

The following code shows the gbm wrapper function:


GBM_fit <- function(formula = formula, data=NULL, distribution = "bernoulli", n.trees = 1000,
                  shrinkage = 0.01, interaction.depth = 4, ...)
{
  fit <- gbm(formula = formula,data = data,distribution = distribution,n.trees = n.trees,
                  shrinkage = shrinkage, interaction.depth = interaction.depth,...);
  selectedfeatures <- summary(fit,plotit = FALSE);
  sum <- 0;
  sfeat = 1;
  while (sum < 90) {sum <- sum + selectedfeatures[sfeat,2]; sfeat <- sfeat + 1;} #keep the ones that add to 90%

    result <- list(fit = fit,n.trees = n.trees,selectedfeatures = rownames(selectedfeatures[1:sfeat,]))
    class(result) <- "GBM_FIT";
    return(result)
}

We also need a proper predict function for the boosting algorithm:


predict.GBM_FIT <- function(object,...) 
{
        parameters <- list(...);
        testData <- parameters[[1]];
        n.trees = seq(from = (0.1*object$n.trees),to = object$n.trees, by = object$n.trees/25) #no of trees-a vector of 25 values 
        pLS <- predict(object$fit,testData,n.trees = n.trees);
        pLS <- 1.0/(1.0 + exp(-apply(pLS,1,mean)))
        return(pLS);
}

Let me check that fitting and prediction functions are working:


gfit <- GBM_fit(formula = diabetes ~ .,PimaIndiansDiabetes)
pr <- predict(gfit,PimaIndiansDiabetes)
pander::pander(table(pr > 0.5,PimaIndiansDiabetes$diabetes),caption="Training: Gradient Boost Confusion Matrix")

Training: Gradient Boost Confusion Matrix
	0	1
FALSE	250	29
TRUE	12	101

Now I can check the test ensembled performance of the gradient boosting method.
The following code shows five alternatives for the cross-validation.

op <- par(no.readonly = TRUE)

GradBOOSTcv <- randomCV(PimaIndiansDiabetes,
                        "diabetes",
                        GBM_fit,
                        trainFraction = 0.5,
                        repetitions = 100
                        )

GradBOOST_NoAugmentedcv <- randomCV(PimaIndiansDiabetes,
                                    "diabetes",
                                    GBM_fit,
                                    trainFraction = 0.5,
                                    repetitions = 100,
                                    classSamplingType = "NoAugmented"
                                    )

GradBOOST_Proportionaldcv <- randomCV(PimaIndiansDiabetes,
                                      "diabetes",
                                      GBM_fit,
                                      trainFraction = 0.5,
                                      repetitions = 100,
                                      classSamplingType = "Proportional"
                                      )

GradBOOST_ProportionalBootstrapcv <- randomCV(PimaIndiansDiabetes,
                                              "diabetes",
                                              GBM_fit,
                                              trainFraction = "Bootstrap",
                                              repetitions = 100,
                                              classSamplingType = "Proportional"
                                              )

GradBOOST_NoAugmentedBootstrapcv <- randomCV(PimaIndiansDiabetes,
                                            "diabetes",
                                            GBM_fit,
                                            trainFraction = 0.5,
                                            repetitions = 100,
                                            classSamplingType = "NoAugmented"
                                            )

Once cross-validated, the performance results can be analyzed and plotted using the predictionStats_binary() function.


par(mfrow = c(1,2),cex = 0.5);
bs1 <- predictionStats_binary(cbind(PimaIndiansDiabetes$diabetes,pr),"Raw Training",cex = 0.8)
bs2 <- predictionStats_binary(GradBOOSTcv$medianTest,"Balanced Augmented CV",cex = 0.8)

bs3 <- predictionStats_binary(GradBOOST_NoAugmentedcv$medianTest,"Balanced No Augmented CV",cex = 0.8)
bs4 <- predictionStats_binary(GradBOOST_Proportionaldcv$medianTest,"Proportional CV",cex = 0.8)

bs5 <- predictionStats_binary(GradBOOST_ProportionalBootstrapcv$medianTest,"Bootstraping Proportional CV",cex = 0.8)
bs6 <- predictionStats_binary(GradBOOST_NoAugmentedBootstrapcv$medianTest,"Bootstraping Balanced CV",cex = 0.8)

par(mfrow = c(1,1),cex = 1.0);

The output of the predictionStats_binary() function provides key performance metrics with their corresponding 95% confidence intervals

pander::pander(bs2$accc,caption = "Accuracy")

Accuracy
est	lower	upper
0.7806	0.7363	0.8206

pander::pander(bs2$berror,caption = "Balanced Error")

50%	2.5%	97.5%
0.214	0.1727	0.2558

pander::pander(bs2$aucs,caption = "AUC")

est	lower	upper
0.8477	0.8076	0.8879

pander::pander(bs2$sensitivity,caption = "Sensitivity")

Sensitivity
est	lower	upper
0.8	0.7208	0.865

pander::pander(bs2$specificity,caption = "Specificity")

Specificity
est	lower	upper
0.771	0.7153	0.8205

pander::pander(bs2$ClassMetrics,caption = "All Metrics")

accci:

50% 2.5% 97.5%

0.7806 0.7398 0.8214
senci:

50% 2.5% 97.5%

0.786 0.7442 0.8273
aucci:

50% 2.5% 97.5%

0.786 0.7442 0.8273
berci:

50% 2.5% 97.5%

0.214 0.1727 0.2558
preci:

50% 2.5% 97.5%

0.7601 0.7186 0.8001
F1ci:

50% 2.5% 97.5%

0.7659 0.7224 0.8071

50%	2.5%	97.5%
0.7806	0.7398	0.8214

50%	2.5%	97.5%
0.786	0.7442	0.8273

50%	2.5%	97.5%
0.786	0.7442	0.8273

50%	2.5%	97.5%
0.7601	0.7186	0.8001

50%	2.5%	97.5%
0.7659	0.7224	0.8071

1.4 Cross-Validation of common ML-Methods

Now I will compare the performance to other R methods that already have a handy fit and predict methods.


ADABOOSTcv <- randomCV(fittingFunction = fastAdaboost::adaboost,
                  trainSampleSets = GradBOOSTcv$trainSamplesSets,
                  asFactor = TRUE,
                  nIter=10)

QDAcv <- randomCV(fittingFunction = MASS::qda,
                  trainSampleSets = GradBOOSTcv$trainSamplesSets,
                  method = "mve")


LDAcv <- randomCV(fittingFunction = MASS::lda,
                  trainSampleSets = GradBOOSTcv$trainSamplesSets)



logisticCV <- randomCV(fittingFunction = glm,
                  trainSampleSets = GradBOOSTcv$trainSamplesSets,
                  family="binomial")

1.5 FRESA.CAD::BinaryBenchmark and Comparing Methods

Once all the cross-validation have been completed, we can compare their performance to five common ML methods:

KNN,
Random Forest,
RPART,
SVM, and
LASSO

These methods are fitted using their default parameters inside the BinaryBenchmark function:


par(op);

par(mfrow = c(2,2),cex = 0.6);

cp <- BinaryBenchmark(referenceCV = list(Gradient_Boost = GradBOOSTcv,
                                         LDA = LDAcv,
                                         QDA = QDAcv,
                                         ADABOOST = ADABOOSTcv,
                                         Logistic = logisticCV))


par(mfrow = c(1,1),cex = 1.0);

1.6 Reporting the results of the Benchmark procedure

Once done, we can compare the CV test results using the plot() function.
The plot function creates bar plots that compare the balanced error rata, the accuracy, the sensitivity, the specificity, the area under the curve, as well as the report of the concordance index of the individual cross-validation runs.

The final two plots provide the heatmaps of testing if the methods have similar classification performance and if the methods have larger AUC to the other tested methods.


par(mfrow = c(1,1),cex = 1.0,xpd = T,pty = 'm', mar = c(3,3,3,10)) # Making space for the legend
prBenchmark <- plot(cp)

The plot function also generates summary tables of the CV results.

pander::pander(prBenchmark$metrics,caption = "Classifier Performance",round = 3)

Classifier Performance (continued below)
	RPART	ENS	Gradient_Boost	ADABOOST	LASSO	RF	SVM
BER	0.208	0.212	0.214	0.22	0.22	0.223	0.234
ACC	0.791	0.778	0.781	0.773	0.783	0.773	0.758
AUC	0.827	0.852	0.848	0.85	0.848	0.847	0.836
SEN	0.792	0.815	0.8	0.8	0.769	0.785	0.785
SPE	0.79	0.76	0.771	0.76	0.79	0.767	0.744
CIDX	0.777	0.852	0.837	0.81	0.845	0.844	0.831

	Logistic	QDA	LDA	KNN
BER	0.235	0.243	0.245	0.263
ACC	0.768	0.727	0.768	0.735
AUC	0.848	0.809	0.847	0.819
SEN	0.754	0.846	0.715	0.738
SPE	0.775	0.668	0.794	0.733
CIDX	0.843	0.796	0.844	0.81

Cross-Validation with FRESA.CAD

José Tamez-Peña

Nov 19, 2019