This tutorial will guide users on how to use FRESA.CAD to evaluate the performance of binary classifiers.
library("FRESA.CAD")
library("mlbench")
library("fastAdaboost")
library("gbm")
I will use the PimaIndiansDiabetes dataset from the mlbech package.
data("PimaIndiansDiabetes2", package = "mlbench")
We have to condition the data set.
FRESA.CAD cross-validation requires a data frame with complete cases. Also, the outcome has to be numeric.
* 0 for Controls, and
* 1 for Cases
PimaIndiansDiabetes <- PimaIndiansDiabetes2[complete.cases(PimaIndiansDiabetes2),]
PimaIndiansDiabetes$diabetes <- 1*(PimaIndiansDiabetes$diabetes == "pos")
The cross-validation with FRESA.CAD can be done on any R function that fits binary outcomes.
The requirement is that model fit has to done as:
>model <- fit(formula,data),
and the predict must be called as:
>pre <- predict(model,testdata)
If the fitting function does not conform to the requirements, you can always create a wrapper.
Here we will show how to create a wrapper to the gradient boost method of the gbm package.
The following code shows the gbm wrapper function:
GBM_fit <- function(formula = formula, data=NULL, distribution = "bernoulli", n.trees = 1000,
shrinkage = 0.01, interaction.depth = 4, ...)
{
fit <- gbm(formula = formula,data = data,distribution = distribution,n.trees = n.trees,
shrinkage = shrinkage, interaction.depth = interaction.depth,...);
selectedfeatures <- summary(fit,plotit = FALSE);
sum <- 0;
sfeat = 1;
while (sum < 90) {sum <- sum + selectedfeatures[sfeat,2]; sfeat <- sfeat + 1;} #keep the ones that add to 90%
result <- list(fit = fit,n.trees = n.trees,selectedfeatures = rownames(selectedfeatures[1:sfeat,]))
class(result) <- "GBM_FIT";
return(result)
}
We also need a proper predict function for the boosting algorithm:
predict.GBM_FIT <- function(object,...)
{
parameters <- list(...);
testData <- parameters[[1]];
n.trees = seq(from = (0.1*object$n.trees),to = object$n.trees, by = object$n.trees/25) #no of trees-a vector of 25 values
pLS <- predict(object$fit,testData,n.trees = n.trees);
pLS <- 1.0/(1.0 + exp(-apply(pLS,1,mean)))
return(pLS);
}
Let me check that fitting and prediction functions are working:
gfit <- GBM_fit(formula = diabetes ~ .,PimaIndiansDiabetes)
pr <- predict(gfit,PimaIndiansDiabetes)
pander::pander(table(pr > 0.5,PimaIndiansDiabetes$diabetes),caption="Training: Gradient Boost Confusion Matrix")
| 0 | 1 | |
|---|---|---|
| FALSE | 250 | 29 |
| TRUE | 12 | 101 |
Now I can check the test ensembled performance of the gradient boosting method.
The following code shows five alternatives for the cross-validation.
op <- par(no.readonly = TRUE)
GradBOOSTcv <- randomCV(PimaIndiansDiabetes,
"diabetes",
GBM_fit,
trainFraction = 0.5,
repetitions = 100
)
GradBOOST_NoAugmentedcv <- randomCV(PimaIndiansDiabetes,
"diabetes",
GBM_fit,
trainFraction = 0.5,
repetitions = 100,
classSamplingType = "NoAugmented"
)
GradBOOST_Proportionaldcv <- randomCV(PimaIndiansDiabetes,
"diabetes",
GBM_fit,
trainFraction = 0.5,
repetitions = 100,
classSamplingType = "Proportional"
)
GradBOOST_ProportionalBootstrapcv <- randomCV(PimaIndiansDiabetes,
"diabetes",
GBM_fit,
trainFraction = "Bootstrap",
repetitions = 100,
classSamplingType = "Proportional"
)
GradBOOST_NoAugmentedBootstrapcv <- randomCV(PimaIndiansDiabetes,
"diabetes",
GBM_fit,
trainFraction = 0.5,
repetitions = 100,
classSamplingType = "NoAugmented"
)
Once cross-validated, the performance results can be analyzed and plotted using the predictionStats_binary() function.
par(mfrow = c(1,2),cex = 0.5);
bs1 <- predictionStats_binary(cbind(PimaIndiansDiabetes$diabetes,pr),"Raw Training",cex = 0.8)
bs2 <- predictionStats_binary(GradBOOSTcv$medianTest,"Balanced Augmented CV",cex = 0.8)
bs3 <- predictionStats_binary(GradBOOST_NoAugmentedcv$medianTest,"Balanced No Augmented CV",cex = 0.8)
bs4 <- predictionStats_binary(GradBOOST_Proportionaldcv$medianTest,"Proportional CV",cex = 0.8)
bs5 <- predictionStats_binary(GradBOOST_ProportionalBootstrapcv$medianTest,"Bootstraping Proportional CV",cex = 0.8)
bs6 <- predictionStats_binary(GradBOOST_NoAugmentedBootstrapcv$medianTest,"Bootstraping Balanced CV",cex = 0.8)
par(mfrow = c(1,1),cex = 1.0);
The output of the predictionStats_binary() function provides key performance metrics with their corresponding 95% confidence intervals
pander::pander(bs2$accc,caption = "Accuracy")
| est | lower | upper |
|---|---|---|
| 0.7806 | 0.7363 | 0.8206 |
pander::pander(bs2$berror,caption = "Balanced Error")
| 50% | 2.5% | 97.5% |
|---|---|---|
| 0.214 | 0.1727 | 0.2558 |
pander::pander(bs2$aucs,caption = "AUC")
| est | lower | upper |
|---|---|---|
| 0.8477 | 0.8076 | 0.8879 |
pander::pander(bs2$sensitivity,caption = "Sensitivity")
| est | lower | upper |
|---|---|---|
| 0.8 | 0.7208 | 0.865 |
pander::pander(bs2$specificity,caption = "Specificity")
| est | lower | upper |
|---|---|---|
| 0.771 | 0.7153 | 0.8205 |
pander::pander(bs2$ClassMetrics,caption = "All Metrics")
accci:
| 50% | 2.5% | 97.5% |
|---|---|---|
| 0.7806 | 0.7398 | 0.8214 |
senci:
| 50% | 2.5% | 97.5% |
|---|---|---|
| 0.786 | 0.7442 | 0.8273 |
aucci:
| 50% | 2.5% | 97.5% |
|---|---|---|
| 0.786 | 0.7442 | 0.8273 |
berci:
| 50% | 2.5% | 97.5% |
|---|---|---|
| 0.214 | 0.1727 | 0.2558 |
preci:
| 50% | 2.5% | 97.5% |
|---|---|---|
| 0.7601 | 0.7186 | 0.8001 |
F1ci:
| 50% | 2.5% | 97.5% |
|---|---|---|
| 0.7659 | 0.7224 | 0.8071 |
Now I will compare the performance to other R methods that already have a handy fit and predict methods.
ADABOOSTcv <- randomCV(fittingFunction = fastAdaboost::adaboost,
trainSampleSets = GradBOOSTcv$trainSamplesSets,
asFactor = TRUE,
nIter=10)
QDAcv <- randomCV(fittingFunction = MASS::qda,
trainSampleSets = GradBOOSTcv$trainSamplesSets,
method = "mve")
LDAcv <- randomCV(fittingFunction = MASS::lda,
trainSampleSets = GradBOOSTcv$trainSamplesSets)
logisticCV <- randomCV(fittingFunction = glm,
trainSampleSets = GradBOOSTcv$trainSamplesSets,
family="binomial")
Once all the cross-validation have been completed, we can compare their performance to five common ML methods:
These methods are fitted using their default parameters inside the BinaryBenchmark function:
par(op);
par(mfrow = c(2,2),cex = 0.6);
cp <- BinaryBenchmark(referenceCV = list(Gradient_Boost = GradBOOSTcv,
LDA = LDAcv,
QDA = QDAcv,
ADABOOST = ADABOOSTcv,
Logistic = logisticCV))
par(mfrow = c(1,1),cex = 1.0);
Once done, we can compare the CV test results using the plot() function.
The plot function creates bar plots that compare the balanced error rata, the accuracy, the sensitivity, the specificity, the area under the curve, as well as the report of the concordance index of the individual cross-validation runs.
The final two plots provide the heatmaps of testing if the methods have similar classification performance and if the methods have larger AUC to the other tested methods.
par(mfrow = c(1,1),cex = 1.0,xpd = T,pty = 'm', mar = c(3,3,3,10)) # Making space for the legend
prBenchmark <- plot(cp)
The plot function also generates summary tables of the CV results.
pander::pander(prBenchmark$metrics,caption = "Classifier Performance",round = 3)
| RPART | ENS | Gradient_Boost | ADABOOST | LASSO | RF | SVM | |
|---|---|---|---|---|---|---|---|
| BER | 0.208 | 0.212 | 0.214 | 0.22 | 0.22 | 0.223 | 0.234 |
| ACC | 0.791 | 0.778 | 0.781 | 0.773 | 0.783 | 0.773 | 0.758 |
| AUC | 0.827 | 0.852 | 0.848 | 0.85 | 0.848 | 0.847 | 0.836 |
| SEN | 0.792 | 0.815 | 0.8 | 0.8 | 0.769 | 0.785 | 0.785 |
| SPE | 0.79 | 0.76 | 0.771 | 0.76 | 0.79 | 0.767 | 0.744 |
| CIDX | 0.777 | 0.852 | 0.837 | 0.81 | 0.845 | 0.844 | 0.831 |
| Logistic | QDA | LDA | KNN | |
|---|---|---|---|---|
| BER | 0.235 | 0.243 | 0.245 | 0.263 |
| ACC | 0.768 | 0.727 | 0.768 | 0.735 |
| AUC | 0.848 | 0.809 | 0.847 | 0.819 |
| SEN | 0.754 | 0.846 | 0.715 | 0.738 |
| SPE | 0.775 | 0.668 | 0.794 | 0.733 |
| CIDX | 0.843 | 0.796 | 0.844 | 0.81 |