This report was created for the final assignment for Machine Learning class of the Masters in Information Engeneering at the University of Porto Engineering School. It intends to demonstrate the application of machine learning concepts and techniques learnt in class on a real dataset.

Because I am a Medical Doctor doing Familiy Medicine Residency, I chose a dataset on Diabetic Retinopathy, which is a complication of Diabetes, a disease that I deal with every day in my medical practice.

I present the results of my work and compare it with the a paper published on the same dataset - Balint Antal, Andras Hajdu: An ensemble-based system for automatic screening of diabetic retinopathy, Knowledge-Based Systems 60 (April 2014), 20-27

Diabetic Retinopathy

Diabetic retinopathy is the leading cause of new blindness in persons aged 25-74 years in the United States. The exact mechanism by which diabetes causes retinopathy remains unclear, but several theories have been postulated to explain the typical course and history of the disease.

The disease can be diagnosed by fluorescein angiography and looking for the following features:

Adapted from Medscape

Retinal Fundoscopy

The aforementioned features show up in fundoscopy images as shown below.

Retinal Fundus

Retinal Fundus

From Arleo Eye Associates

Debrecen Diabetic Retinopathy Dataset

This dataset contains features extracted from the Messidor image set to predict whether an image contains signs of diabetic retinopathy or not. All features represent either a detected lesion, a descriptive feature of a anatomical part or an image-level descriptor.

The underlying method image analysis and feature extraction as well as the original classification technique is described in Balint Antal, Andras Hajdu: An ensemble-based system for automatic screening of diabetic retinopathy, Knowledge-Based Systems 60 (April 2014), 20-27.

The dataset was obtained from the UCI Machine Learning Repository

Data Set Characteristics
Number of Instances 1151
Area Life
Attribute Characteristics Integer, Real
Number of Attributes 20
Date Donated 2014-11-03
Associated Tasks Classification
Missing Values? N/A
Number of Web Hits 16095

Feature indexes

q - The binary result of quality assessment. 0 = bad quality 1 = sufficient quality.

ps - The binary result of pre-screening, where 1 indicates severe retinal abnormality and 0 its lack.

nma.a - nma.f - The results of microaneurism detection. Each feature value stand for the number of microaneurisms found at the confidence levels alpha = 0.5, . . . , 1, respectively.

nex.a - nex.f - contains the same information as nma.a - nma.f for exudates. However, as exudates are represented by a set of points rather than the number of pixels constructing the lesions, these features are normalized by dividing the number of lesions with the diameter of the ROI to compensate different image sizes.

dd - The euclidean distance of the center of the macula and the center of the optic disc to provide important information regarding the patients condition. This feature is also normalized with the diameter of the ROI.

dm - The diameter of the optic disc.

amfm - The binary result of the AM/FM-based classification.

class - Class label. 1 = contains signs of Diabetic Retinopathy, 0 = no signs of Diabetic Retinopathy.

Problem statement

Classify the patient retina as being diabetic or not diabetic taking into consideration the available features.

Reported performance

The original Debrecen paper reports an accuracy of 0.989 AUC. It was obtained by training an ensemble on all the features using different model selection techniques for the ensenble and 10-fold cross validation.

The models considered for the ensemble were

  • Alternating Decision Tree

  • kNN

  • AdaBoost

  • Multilayer Perceptron

  • Naive Bayes

  • Random Forest

  • SVM

  • Pattern classifier

Overall the authors report very good results (AUC > 0.9 for many cases), for both the best performing single model and the different ensemble strategies.

To my knowledge that are no other papers published on this data set.

Data exploration

First we load the data:

library(foreign)

loadData = function(path){
    
    df = read.arff(path)
    colnames(df) <- c(
        "q",      #  0 The binary result of quality assessment. 0 = bad quality 1 = sufficient quality
        "ps",     #  1 The binary result of pre-screening, 1 indicates severe retinal abnormality and 0 its lack
        "nma.a",  #  2 Number of MAs found at the confidence levels alpha = 0.5
        "nma.b",  #  3 Number of MAs found at the confidence levels alpha = 0.6
        "nma.c",  #  4 Number of MAs found at the confidence levels alpha = 0.7
        "nma.d",  #  5 Number of MAs found at the confidence levels alpha = 0.8
        "nma.e",  #  6 Number of MAs found at the confidence levels alpha = 0.9
        "nma.f",  #  7 Number of MAs found at the confidence levels alpha = 1.0
        "nex.a",  #  8 Number of Exudates found at the confidence levels alpha = 0.5
        "nex.b",  #  9 Number of Exudates found at the confidence levels alpha = 0.6
        "nex.c",  # 10 Number of Exudates found at the confidence levels alpha = 0.7
        "nex.d",  # 11 Number of Exudates found at the confidence levels alpha = 0.8
        "nex.e",  # 12 Number of Exudates found at the confidence levels alpha = 0.9
        "nex.g",  # 13 Number of Exudates found at the confidence levels alpha = 1.0
        "nex.f",  # 14 Number of Exudates found at the confidence levels alpha = 1.0
        "nex.h",  # 15 Number of Exudates found at the confidence levels alpha = 1.0
        "dd",     # 16 The euclidean distance of the center of the macula and the center of the optic disc
        "dm",     # 17 The diameter of the optic disc
        "amfm",   # 18 The binary result of the AM/FM-based classification
        "class"   # 19 Class label. 1 = contains signs of DR, 0 = no signs of DR
    )
    odf = df
    
    colnames(df)
    numericFeats = c(3:16)
    eyeFeats = c(17,18)
    df[, c(numericFeats, eyeFeats)] = scale(df[, c(numericFeats, eyeFeats)])
    
    df$class = as.factor(df$class)
    return(df)
}
df = loadData("./dataset.arff")
long = melt(df[,c(1:ncol(df)-1)])
## No id variables; using all as measure variables
ggplot(long) + 
    geom_boxplot(aes(variable, value)) + 
    coord_flip() +
    labs(title="Unimodal feature distribution", x='Feature', y='Scaled value')

We can see that nma.* and nex.* have one-sided heavy tailed distributions. No lets look at some of the covariance between the features:

ggcorr(df) + 
    labs(title="Feature covariance matrix")

We see that there are strong positive correlations between all the nma.* features and between the nex.* features, the latter set of features having stronger relationships between adjacent features. Not let’s look at some of the pairs:

ggplot(df) + geom_point(aes(nma.a, nma.b, color=class)) + facet_wrap(~amfm)

ggplot(df) + geom_point(aes(nma.a, nma.f, color=class)) + facet_wrap(~amfm)

ggplot(df) + geom_point(aes(nex.a, nex.b, color=class)) + facet_wrap(~amfm)

ggplot(df) + geom_point(aes(nex.a, nex.h, color=class)) + facet_wrap(~amfm)

ggplot(df) + geom_point(aes(nma.a, nex.h, color=class)) + facet_wrap(~amfm)

We see that in these projections there is a big fraction of the points that is separable conditional on the class. However a smaller but non-neglegible number of points seem not to be linearly separable in these projections.

Finally lets see a reduced 2d projection of the class variable on a SOM

cols = colnames(df)[1:(ncol(df)-1)]
base = df[,cols]
som_grid = somgrid(xdim=15, ydim=15, topo="hexagonal")
som_model = som(as.matrix(base), grid=som_grid, rlen=200)

plot(som_model, type="property", property = df$nma.a)
title("SOM colored for nma.a feature")

plot(som_model, type="property", property = df$nma.e)
title("SOM colored for nma.e feature")

plot(som_model, type="property", property = df$nex.a)
title("SOM colored for nex.a feature")

plot(som_model, type="property", property = df$nex.h)
title("SOM colored for nex.h feature")

plot(som_model, type="property", property = as.integer(df$class)-1)
title("SOM colored for class")

We can see that there is no clear region where the class labels lie preferentially on the SOM, nor there seems to be a preferential arrangement of features across the map

Building the classifier

We will try to build a classifier to predict the class based on the feature set. We will begin by building simple models, calibrate them, and finally try to combine them into an ensemble.

Learners

The following are the learners that we will consider for models. These are simple wrapper functions that expose a common API to base algorithms.

helpers.result = function(pred, real){
    return(data.frame(pred=pred, real=real, ok=(pred==real)))
}

fitpredict.svm = function(formula, trainingSet, validationSet, kernel="polynomial", degree=2, coef0=1){
    model = svm(formula, trainingSet, kernel=kernel, degree=degree, coef0=coef0)
    pred = as.integer(predict(model, validationSet))
    real = as.integer(validationSet[,all.vars(formula)[1]])
    return(helpers.result(pred, real))
}

fitpredict.forest = function(formula, trainingSet, validationSet, ntree=100){
    model = randomForest(formula, trainingSet, ntree=ntree)
    pred = as.integer(predict(model, validationSet))
    real = as.integer(validationSet[,all.vars(formula)[1]])
    return(helpers.result(pred, real))
}

fitpredict.knn = function(formula, trainingSet, validationSet, k=5){
    cols = all.vars(formula)[2:length(all.vars(formula))]
    pred = as.integer(knn(trainingSet[,cols], validationSet[,cols], trainingSet$class, k=k))
    real = as.integer(validationSet[,all.vars(formula)[1]])
    return(helpers.result(pred, real))
}

fitpredict.adaboost = function(formula, trainingSet, validationSet, boos=TRUE, mfinal=10, coeflearn='Breiman'){
    model = boosting(formula, trainingSet, boos=boos, mfinal=mfinal, coeflearn=coeflearn)
    pred = as.integer(predict.boosting(model, validationSet)$class)+1
    real = as.integer(validationSet[,all.vars(formula)[1]])
    return(helpers.result(pred, real))
}

fitpredict.nnet = function(formula, trainingSet, validationSet, size=10){
    model = nnet(formula, trainingSet, size=size, trace=FALSE)
    pred = as.integer(round(predict(model, validationSet)))+1
    real = as.integer(validationSet[,all.vars(formula)[1]])
    return(helpers.result(pred, real))
}

fitpredict.naiveBayes = function(formula, trainingSet, validationSet){
    model = naiveBayes(formula, trainingSet)
    pred = as.integer(predict(model, validationSet))
    real = as.integer(validationSet[,all.vars(formula)[1]])
    return(helpers.result(pred, real))
}

fitpredict.lda = function(formula, trainingSet, validationSet){
    model = lda(formula, trainingSet)
    pred = as.integer(data.frame(predict(model, validationSet))$class)
    real = as.integer(validationSet[,all.vars(formula)[1]])
    return(helpers.result(pred, real))
}

Model performance measurements

Because we are dealing with a binary classification task, we will consider the AUC as the main performance metric. This metric gives us combined information regarding the sensitivity and specificity of the models.

perf.auc = function(yy){
    
    return(performance(prediction(yy$real, yy$pred), "auc")@y.values[[1]])
}

K Fold validation

The data sets will be studied using k-fold validation, which is implemented in the following block

kfoldValidate = function(formula, data, learner, performance, ...){
    
    k=10
    ks = c(1:k)
    folds = createFolds(data$class, k, list=FALSE)
    result = rep(0, k)
    
    for (i in ks){
        
        trainingKs = ks[ks!=i]
        validationKs = ks[ks==i]
        trainingSet = data[which(folds %in% trainingKs), ]
        validationSet = data[which(folds %in% validationKs), ]
        result[i] = performance(learner(formula, trainingSet, validationSet, ...))
    }
    
    return(list(mean=mean(result), sd=sd(result), results=c(result)))
}

Lets consider the full list of features to test each model using AUC

f1 = as.formula(class ~ ps + 
                    nma.a + nma.b + nma.c + nma.d + nma.e + nma.f + 
                    nex.a + nex.b + nex.c + nex.d + nex.e + nex.g + nex.f + nex.h + 
                    dd + dm + amfm)

resultsF1v = c(
    svm        = kfoldValidate(f1, data=df, learner=fitpredict.svm,        performance=perf.auc)$mean,
    forest     = kfoldValidate(f1, data=df, learner=fitpredict.forest,     performance=perf.auc)$mean,
    knn        = kfoldValidate(f1, data=df, learner=fitpredict.knn,        performance=perf.auc)$mean,
    adaboost   = kfoldValidate(f1, data=df, learner=fitpredict.adaboost,   performance=perf.auc)$mean,
    nnet       = kfoldValidate(f1, data=df, learner=fitpredict.nnet,       performance=perf.auc)$mean,
    naiveBayes = kfoldValidate(f1, data=df, learner=fitpredict.naiveBayes, performance=perf.auc)$mean,
    lda        = kfoldValidate(f1, data=df, learner=fitpredict.lda,        performance=perf.auc)$mean
)
resultsF1 = data.frame(f1_AUC=resultsF1v)
kable(resultsF1, caption="Model performance for F1")
Model performance for F1
f1_AUC
svm 0.7334677
forest 0.6979175
knn 0.6197395
adaboost 0.6756170
nnet 0.7002981
naiveBayes 0.6769425
lda 0.7387913

We see that without any type of model parameterization the results are not very good. We achive the highest result with Latent Discriminant Analysis.

Now let’s try performing PCA on the dataset

cols = all.vars(f1)[2:length(all.vars(f1))]
fit = prcomp(df[,cols], center=T, scale=T)

df = cbind(df, fit$x)

Let’s see if we can achieve better results considering only the Principal Components

f2 = as.formula(class ~ PC1 + PC2 +  PC3 +  PC4 +  PC5 +  PC6 +  PC7 + PC8 + PC9 + PC10 + PC11 + PC12 )

resultsF2v = c(
    svm        = kfoldValidate(f2, data=df, learner=fitpredict.svm,        performance=perf.auc)$mean,
    forest     = kfoldValidate(f2, data=df, learner=fitpredict.forest,     performance=perf.auc)$mean,
    knn        = kfoldValidate(f2, data=df, learner=fitpredict.knn,        performance=perf.auc)$mean,
    adaboost   = kfoldValidate(f2, data=df, learner=fitpredict.adaboost,   performance=perf.auc)$mean,
    nnet       = kfoldValidate(f2, data=df, learner=fitpredict.nnet,       performance=perf.auc)$mean,
    naiveBayes = kfoldValidate(f2, data=df, learner=fitpredict.naiveBayes, performance=perf.auc)$mean,
    lda        = kfoldValidate(f2, data=df, learner=fitpredict.lda,        performance=perf.auc)$mean
)

resultsF2 = data.frame(f2_AUC=resultsF2v)
results = cbind(resultsF2, resultsF1)
results$improvement=results$f2_AUC - results$f1_AUC
kable(results, caption="Model performance for F1 vs F2")
Model performance for F1 vs F2
f2_AUC f1_AUC improvement
svm 0.7008355 0.7334677 -0.0326321
forest 0.6987468 0.6979175 0.0008294
knn 0.6114863 0.6197395 -0.0082533
adaboost 0.6802317 0.6756170 0.0046147
nnet 0.6458392 0.7002981 -0.0544589
naiveBayes 0.6743069 0.6769425 -0.0026356
lda 0.6744071 0.7387913 -0.0643842

We see that there is a marginal decrease for most of the models. Finally we will consider the full feature set.

f3 = as.formula(class ~ ps + 
                    nma.a + nma.b + nma.c + nma.d + nma.e + nma.f + 
                    nex.a + nex.b + nex.c + nex.d + nex.e + nex.g + nex.f + nex.h + 
                    dd + dm + amfm + 
                    PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 + PC11 + PC12)


resultsF3v = c(
    svm        = kfoldValidate(f3, data=df, learner=fitpredict.svm,        performance=perf.auc)$mean,
    forest     = kfoldValidate(f3, data=df, learner=fitpredict.forest,     performance=perf.auc)$mean,
    knn        = kfoldValidate(f3, data=df, learner=fitpredict.knn,        performance=perf.auc)$mean,
    adaboost   = kfoldValidate(f3, data=df, learner=fitpredict.adaboost,   performance=perf.auc)$mean,
    nnet       = kfoldValidate(f3, data=df, learner=fitpredict.nnet,       performance=perf.auc)$mean,
    naiveBayes = kfoldValidate(f3, data=df, learner=fitpredict.naiveBayes, performance=perf.auc)$mean,
    lda        = kfoldValidate(f3, data=df, learner=fitpredict.lda,        performance=perf.auc)$mean
)

resultsF3 = data.frame(f3_AUC=resultsF3v)
results = cbind(resultsF3, resultsF2)
results$improvement=results$f3_AUC - results$f2_AUC
kable(results, caption="Model performance for F2 vs F3")
Model performance for F2 vs F3
f3_AUC f2_AUC improvement
svm 0.7270303 0.7008355 0.0261948
forest 0.6966650 0.6987468 -0.0020819
knn 0.6173061 0.6114863 0.0058199
adaboost 0.6682730 0.6802317 -0.0119587
nnet 0.7060387 0.6458392 0.0601995
naiveBayes 0.6865975 0.6743069 0.0122906
lda 0.7338626 0.6744071 0.0594555

We see that considering that there is an increase in performance when considering the latter model.

Model and feature selection

Lets start by preparing the data to apply the more advanced modelling techniques.

ind <- sample(2, nrow(df), replace=TRUE, prob=c(0.5, 0.5))

trainingDf   = df[ind==1,]
validationDf = df[ind==2,]

Selecting model parameters

Model parameters can be defined in order to achieve better results. We will use a strategy to create a space of parameter combinations of interest and train multiple models. Then we will consider the one with the best results.

Code for model selection

selectModel = function(formula, data, learner, performance, hyperParameters){
    
    if(is.null(hyperParameters)){
    
        return(list(model=NULL, data=NULL))
    }
    
    results = cbind(hyperParameters, performance=rep(0, nrow(hyperParameters)))
    
    for (i in 1:nrow(results)){
        hyper = c(hyperParameters[i, ])
        arguments = c(list(formula=formula, data=data, learner=learner, performance=performance), hyper)
        results[i, 'performance'] = do.call(kfoldValidate, arguments)$mean
    }
    
    selectedModel = results[which.max(results$performance), colnames(hyperParameters)]
    names(selectedModel) = colnames(hyperParameters)
    
    return(list(model=as.list(selectedModel), data=results))
}

Lets try this with our models

candidateParameters = list(
    svm        = buildHyperDf(kernel=c("polynomial"), degree=c(1,2), coef0=seq(1, 3, by=1)),
    forest     = buildHyperDf(ntree=seq(10, 20, by=10)),
    knn        = buildHyperDf(k=seq(2, 4, by=1)),
    nnet       = buildHyperDf(size=seq(1, 3, by=1))
)

params = list(
    svm    = selectModel(f3, trainingDf, fitpredict.svm,    perf.auc, candidateParameters$svm),
    forest = selectModel(f3, trainingDf, fitpredict.forest, perf.auc, candidateParameters$forest),
    knn    = selectModel(f3, trainingDf, fitpredict.knn,    perf.auc, candidateParameters$knn),
    nnet   = selectModel(f3, trainingDf, fitpredict.nnet,   perf.auc, candidateParameters$nnet)
)

kable(params$svm$data[order(-params$svm$data$performance),], 
    caption="SVM performance for different variable combinations"
)
SVM performance for different variable combinations
kernel degree coef0 performance
6 polynomial 2 3 0.7414142
2 polynomial 2 1 0.7261357
4 polynomial 2 2 0.7234909
3 polynomial 1 2 0.6922994
1 polynomial 1 1 0.6922664
5 polynomial 1 3 0.6878704
svmResult = kfoldValidate(f3, validationDf, learner=fitpredict.svm, performance=perf.auc, 
    kernel=params$svm$model$kernel,
    degree=params$svm$model$degree,
    coef0=params$svm$model$coef0
)$mean


kable(params$forest$data[order(-params$forest$data$performance),], 
    caption="Random Forest performance for different variable combinations"
)
Random Forest performance for different variable combinations
ntree performance
2 20 0.6926634
1 10 0.6702572
forestResult = kfoldValidate(f3, validationDf, learner=fitpredict.forest, performance=perf.auc, 
    ntree=params$forest$model$ntree
)$mean


kable(params$knn$data[order(-params$knn$data$performance),], 
    caption="k-NN performance for different variable combinations"
)
k-NN performance for different variable combinations
k performance
2 3 0.5988584
3 4 0.5678796
1 2 0.5627380
knnResult = kfoldValidate(f3, validationDf, learner=fitpredict.knn, performance=perf.auc, 
    k=params$knn$model$k
)$mean

kable(params$nnet$data[order(-params$nnet$data$performance),], 
    caption="ANN performance for different variable combinations"
)
ANN performance for different variable combinations
size performance
3 3 0.7431458
1 1 0.7345651
2 2 0.7210666
nnetResult = kfoldValidate(f3, validationDf, learner=fitpredict.nnet, performance=perf.auc, 
    size=params$nnet$model$size
)$mean

kable(data.frame(
    model=c('SVM', 'Random Forest', 'k-NN', 'ANN'), 
    auc=c(svmResult, forestResult, knnResult, nnetResult)),
    caption="Model performance for the best model paramenters for each model class"
)
Model performance for the best model paramenters for each model class
model auc
SVM 0.7057353
Random Forest 0.6628799
k-NN 0.6284149
ANN 0.7062144

Selecting formulas

We can select formulas to check if we can arrive at better results. We will use a strategy to continusly create models leaving one feature out from the set while their AUC is better than the previous.

Code for formula selection

selectModelFormula = function(formula, data, learner, performance, hyperParameters=NULL){
    
    class = all.vars(formula)[1]
    features = all.vars(formula)[2:length(all.vars(formula))]
    
    results = data.frame(left=c('full', features), result=rep(NA, length(features)+1))
    
    if(!is.null(hyperParameters)){
        
        selectedParameters = selectModel(formula, data, learner, performance, hyperParameters)$model
        
        arguments = c(
            list(formula=formula, data=data, learner=learner, performance=performance),
            selectedParameters    
        )
        
    } else {
        
        selectedParameters = NULL
        arguments = list(formula=formula, data=data, learner=learner, performance=performance)
    }
    
    bestRoundPerformance = tryCatch(do.call(kfoldValidate, arguments)$mean, error=function(cond){ return(0) })
    featureLeftOut = NULL
    bestRoundParameters = selectedParameters
    
    results[results$left=='full', 'result'] = bestRoundPerformance
    
    for(i in 1:length(features)){
        
        try({
            
            roundFormula = as.formula(paste(class, "~", paste(features[-c(i)], collapse="+")))
            
            if(!is.null(hyperParameters)){
                
                arguments = c(
                    list(formula=roundFormula, data=data, learner=learner, performance=performance),
                    bestRoundParameters    
                )
                
            } else {
                
                bestRoundParameters = NULL
                arguments = list(formula=formula, data=data, learner=learner, performance=performance)
            }
            
            roundPerformance = do.call(kfoldValidate, arguments)$mean
            results[results$left==features[i], 'result'] = roundPerformance
        
            if(roundPerformance > bestRoundPerformance){
                
                bestRoundPerformance = roundPerformance
                featureLeftOut = i
            }
        })
    }

    if(!is.null(featureLeftOut) && length(features) > 2){
        
        selectedFormula = as.formula(paste(class, "~", paste(features[-c(featureLeftOut)], collapse="+")))    
        return(selectModelFormula(selectedFormula, data, learner, performance, hyperParameters))
        
    } else {
        
        return(list(
            formula=formula,
            params=bestRoundParameters
        ))
    }
}

Lets see this using for example the k-NN model

params = selectModelFormula(f3, trainingDf, fitpredict.knn, perf.auc, candidateParameters$knn)

kable(data.frame(features=all.vars(params$formula)), caption="Selected features using leave-one-out algorithm")
Selected features using leave-one-out algorithm
features
class
ps
nma.a
nma.b
nma.c
nma.d
nma.e
nma.f
nex.a
nex.b
nex.c
nex.d
nex.e
nex.g
nex.f
nex.h
dd
dm
amfm
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
PC9
PC10
PC11
PC12
kable(data.frame(params$params), caption="Selected model parameters")
Selected model parameters
k
3
result = kfoldValidate(params$formula, validationDf, learner=fitpredict.knn, performance=perf.auc, 
    k=params$params
)$mean

kable(data.frame(auc=result), caption="k-NN performance for the optimized formula and model")
k-NN performance for the optimized formula and model
auc
0.6230301

Selecting models for the ensemble

We can use a similar thechique to select candidate models to consider on the ensemble. We will initially create an ensemble consisting of all the models, measure the AUC of all the combinations of leaving out one model. We will pick the best on, and will keep removing models from the reduced ensemble while the AUC improves.

Code for the ensemble learner

fitpredict.votingEnsemble = function(formula, trainingSet, validationSet, learners, params){
    
    ensembleResults = data.frame(majority=rep(NA, nrow(validationSet)))
    
    for (name in names(learners)){
        
        if(!is.null(params)){
            
            arguments = c(
                list(params[name][[1]]$formula, trainingSet=trainingSet, validationSet=validationSet),
                params[name][[1]]$params
            )
            
        } else {
            
            arguments = list(params[name][[1]]$formula, trainingSet=trainingSet, validationSet=validationSet)
        }
        
        modelResult = list(predicted=do.call(learners[name][[1]], arguments)$pred)
        names(modelResult) = c(name)
        ensembleResults = cbind(ensembleResults, modelResult)
    }
    
    ensembleResults$majority = apply(ensembleResults[, names(learners)], 1, FUN=function(x){ 
        as(names(which.max(table(x))), mode(x))  
    })
    
    real = as.integer(validationSet[,all.vars(formula)[1]])
    return(helpers.result(ensembleResults$majority, real))
}

Code for the ensemble selection

selectVotingEnsemble = function(formula, data, performance, learners, learnersParams){
    
    learnerNames = names(learners)
    results = data.frame(left=c('full', learnerNames), result=rep(NA, length(learners)+1))
    
    bestRoundPerformance = kfoldValidate(formula, data, fitpredict.votingEnsemble, performance=perf.auc,
        learners=learners, 
        params=learnersParams
    )$mean    
    
    modelLeftOut = NULL
    
    results[results$left=='full', 'result'] = bestRoundPerformance
    
    for(i in 1:length(learnerNames)){
        
        try({
                
            roundPerformance = kfoldValidate(formula, data, fitpredict.votingEnsemble, perf.auc,
              learners=learners[-c(i)], 
              params=learnersParams[-c(i)]
            )$mean
            
            results[results$left==learnerNames[i], 'result'] = roundPerformance
            
            if(roundPerformance > bestRoundPerformance){
                
                bestRoundPerformance = roundPerformance
                modelLeftOut = i
            }
        })
    }
    
    if(!is.null(modelLeftOut) && length(learners) > 2){
        
        return(selectVotingEnsemble(formula, data, performance,
            learners[-c(modelLeftOut)], 
            learnersParams[-c(modelLeftOut)]
        ))
        
    } else {
        
        return(learners)
    }
}

Let’s check this using our models with default parameters and formula

learners = list(
    svm        = fitpredict.svm,
    forest     = fitpredict.forest,
    knn        = fitpredict.knn,
    nnet       = fitpredict.nnet,
    naiveBayes = fitpredict.naiveBayes,
    lda        = fitpredict.lda
)

candidateParameters = list(
    svm        = list(formula=f3, params=NULL),
    forest     = list(formula=f3, params=NULL),
    knn        = list(formula=f3, params=NULL),
    nnet       = list(formula=f3, params=NULL),
    naiveBayes = list(formula=f3, params=NULL),
    lda        = list(formula=f3, params=NULL)
)

finalEnsemble = selectVotingEnsemble(f3, trainingDf, perf.auc, learners, candidateParameters)

kable(data.frame(models=names(finalEnsemble)), caption = "Models considered by the naive ensemble")
Models considered by the naive ensemble
models
forest
nnet
naiveBayes
lda
result = kfoldValidate(f3, validationDf, fitpredict.votingEnsemble, perf.auc, finalEnsemble,
    candidateParameters[names(finalEnsemble)]
)$mean

kable(data.frame(auc=result), caption="Naive Ensemble Performance")
Naive Ensemble Performance
auc
0.7303

Complete ensemble with features and model selection step

Finally we will put all of this together and to find a reduced ensemble of models which optimized parameters and formulas to try to achieve the best result possible on the classification task. The algorythm is as follows:

- Split the data into 50% for training and model selection and 50% for training and test

- Use 10-fold cross validation on each set of steps

- Select the model with the best parameters for a given set of features
    
  - Train and validate that model leaving one feature out at a time
  
  - Choose the model with the highest AUC
  
  - If the reduced model AUC is greater that the complete model, perform the steps above using the reduced model

- Use these model parameters and specific features to create an ensemble

- Leaving out one model from the ensemble select the ensemble with the highest AUC
    
    - Repeat while the AUC increases

- Test the final model AUC
candidateParameters = list(
    svm        = buildHyperDf(kernel=c("polynomial", "radial"), degree=c(1,2), coef0=c(1,10,50,100)),
    forest     = buildHyperDf(ntree=seq(100, 500, by=100)),
    knn        = buildHyperDf(k=seq(2, 30, by=2)),
    nnet       = buildHyperDf(size=seq(5, 20, by=5))
)

svmParams        = selectModelFormula(f3, trainingDf, fitpredict.svm,        perf.auc, candidateParameters$svm)
forestParams     = selectModelFormula(f3, trainingDf, fitpredict.forest,     perf.auc, candidateParameters$forest)
nnetParams       = selectModelFormula(f3, trainingDf, fitpredict.nnet,       perf.auc, candidateParameters$nnet)
knnParams        = selectModelFormula(f3, trainingDf, fitpredict.knn,        perf.auc, candidateParameters$knn)
naiveBayesParams = selectModelFormula(f3, trainingDf, fitpredict.naiveBayes, perf.auc)
ldaParams        = selectModelFormula(f3, trainingDf, fitpredict.lda,        perf.auc)
adaboostParams   = list(
    formula=f3,
    params=NULL
)

learners = list(
    svm        = fitpredict.svm,
    forest     = fitpredict.forest,
    knn        = fitpredict.knn,
    nnet       = fitpredict.nnet,
    adaboost   = fitpredict.adaboost,
    naiveBayes = fitpredict.naiveBayes,
    lda        = fitpredict.lda
)

candidateModels = list(
    svm        = svmParams,
    forest     = forestParams,
    knn        = knnParams,
    nnet       = nnetParams,
    adaboost   = adaboostParams,
    naiveBayes = naiveBayesParams,
    lda        = ldaParams
)

finalLearners = selectVotingEnsemble(f3, trainingDf, perf.auc, learners, candidateModels)

kable(data.frame(models=names(finalEnsemble)), caption = "Models considered by the final ensemble")
Models considered by the final ensemble
models
forest
nnet
naiveBayes
lda
resultTraining = kfoldValidate(f3, trainingDf, fitpredict.votingEnsemble, perf.auc, finalLearners,
    candidateModels[names(finalLearners)]
)$mean

resultValidation = kfoldValidate(f3, validationDf, fitpredict.votingEnsemble, perf.auc, finalLearners,
    candidateModels[names(finalLearners)]
)$mean

resultComplete = kfoldValidate(f3, df, fitpredict.votingEnsemble, perf.auc, finalLearners,
    candidateModels[names(finalLearners)]
)$mean

kable(data.frame(
    set=c('Training', 'Validation', 'All'), 
    auc=c(resultTraining, resultValidation, resultComplete)
), caption="Final ensemble performance")
Final ensemble performance
set auc
Training 0.7846805
Validation 0.7565049
All 0.7648807

We see that the ensemble does not significantly increase the model accuracy.

Discussion

We have tried to construct an ensemble to predict if a patient has diabetic retinopathy using features from retina photos. While in the literature authors describe a 0.98 AUC accuraccy for this task, we were unable to get close to those results.

Our analysis took into consideration feature scaling and feature rotation using PCA. We then performed feature optimization using a leave-one-out technique to find reduced sets of features that would help models better discriminate the data. Furthermore we trained different models with several different parameters and combined those optimized models into an ensemble, that was also reduced using leave-one-out to try to find a reduced ensemble with the best results.

We divided the data set into two subsets, one used for model training and selection / validation, and another for training / testing. On each set we used 10 fold cross validation in order to use the maximum available data. We could think that it would be a limitation for model performance to split the dataset in two prior to both tasks - that would maximize its validity as the final data was never used for model training / validation, however we have checked how the model responded when using the complete dataset for both tasks, and the results that we arrived at were very similar.

In addition we used simple techniques for feature selection and scaling and possibly we could arrive at better results by introducing more complex techniques for selecting and generating features. We looked at small subspaces for model parameters. Possibility there be other parameter spaces that would yield better performing models. However for feasible computation time in a single machine, larger searches were not possible. This could be overcome by separating the training taks into batched jobs that can run in parallel in multi machine clusters - for example, we could train every leave-one-out possibilities at the same time in multiple machines.

Finally, even our single models were unable to reach a performace close to the one reported in the papers. This may indicate that the algorithms used for modelling were not robust enough, required more computation time to arrive at better results, or that the successful model parameters - which were not specified in that paper - may lie somewhere on the model parameter space that was not covered by our algorithms.

Conclusion

In conclusion, despite the shortcomings in reaching good performance results, this work provided a means to make use and test multiple machine learning algorithms and try to arrive to ensemble models that would out perform individual learners. It also allowed exploring a little feature selection, feature generation, parameter selection and ensemble selection problems, and experience the constraints in computation time when the lookuing for possible candidate models in high combinatorial spaces, even for a small dataset as the one used.