This report was created for the final assignment for Machine Learning class of the Masters in Information Engeneering at the University of Porto Engineering School. It intends to demonstrate the application of machine learning concepts and techniques learnt in class on a real dataset.
Because I am a Medical Doctor doing Familiy Medicine Residency, I chose a dataset on Diabetic Retinopathy, which is a complication of Diabetes, a disease that I deal with every day in my medical practice.
I present the results of my work and compare it with the a paper published on the same dataset - Balint Antal, Andras Hajdu: An ensemble-based system for automatic screening of diabetic retinopathy, Knowledge-Based Systems 60 (April 2014), 20-27
Diabetic retinopathy is the leading cause of new blindness in persons aged 25-74 years in the United States. The exact mechanism by which diabetes causes retinopathy remains unclear, but several theories have been postulated to explain the typical course and history of the disease.
The disease can be diagnosed by fluorescein angiography and looking for the following features:
Microaneurysms: The earliest clinical sign of diabetic retinopathy; these occur secondary to capillary wall outpouching due to pericyte loss; they appear as small, red dots in the superficial retinal layers Dot and blot hemorrhages: Appear similar to microaneurysms if they are small; they occur as microaneurysms rupture in the deeper layers of the retina, such as the inner nuclear and outer plexiform layers.
Flame-shaped hemorrhages: Splinter hemorrhages that occur in the more superficial nerve fiber layer Retinal edema and hard exudates: Caused by the breakdown of the blood-retina barrier, allowing leakage of serum proteins, lipids, and protein from the vessels.
Cotton-wool spots: Nerve fiber layer infarctions from occlusion of precapillary arterioles; they are frequently bordered by microaneurysms and vascular hyperpermeability.
Venous loops and venous beading: Frequently occur adjacent to areas of nonperfusion; they reflect increasing retinal ischemia, and their occurrence is the most significant predictor of progression to proliferative diabetic retinopathy (PDR).
Intraretinal microvascular abnormalities: Remodeled capillary beds without proliferative changes; can usually be found on the borders of the nonperfused retina
Macular exsudates: Leading cause of visual impairment in patients with diabetes.
Adapted from Medscape
The aforementioned features show up in fundoscopy images as shown below.
Retinal Fundus
From Arleo Eye Associates
This dataset contains features extracted from the Messidor image set to predict whether an image contains signs of diabetic retinopathy or not. All features represent either a detected lesion, a descriptive feature of a anatomical part or an image-level descriptor.
The underlying method image analysis and feature extraction as well as the original classification technique is described in Balint Antal, Andras Hajdu: An ensemble-based system for automatic screening of diabetic retinopathy, Knowledge-Based Systems 60 (April 2014), 20-27.
The dataset was obtained from the UCI Machine Learning Repository
| Data Set Characteristics | |
|---|---|
| Number of Instances | 1151 |
| Area | Life |
| Attribute Characteristics | Integer, Real |
| Number of Attributes | 20 |
| Date Donated | 2014-11-03 |
| Associated Tasks | Classification |
| Missing Values? | N/A |
| Number of Web Hits | 16095 |
q - The binary result of quality assessment. 0 = bad quality 1 = sufficient quality.
ps - The binary result of pre-screening, where 1 indicates severe retinal abnormality and 0 its lack.
nma.a - nma.f - The results of microaneurism detection. Each feature value stand for the number of microaneurisms found at the confidence levels alpha = 0.5, . . . , 1, respectively.
nex.a - nex.f - contains the same information as nma.a - nma.f for exudates. However, as exudates are represented by a set of points rather than the number of pixels constructing the lesions, these features are normalized by dividing the number of lesions with the diameter of the ROI to compensate different image sizes.
dd - The euclidean distance of the center of the macula and the center of the optic disc to provide important information regarding the patients condition. This feature is also normalized with the diameter of the ROI.
dm - The diameter of the optic disc.
amfm - The binary result of the AM/FM-based classification.
class - Class label. 1 = contains signs of Diabetic Retinopathy, 0 = no signs of Diabetic Retinopathy.
Classify the patient retina as being diabetic or not diabetic taking into consideration the available features.
The original Debrecen paper reports an accuracy of 0.989 AUC. It was obtained by training an ensemble on all the features using different model selection techniques for the ensenble and 10-fold cross validation.
The models considered for the ensemble were
Alternating Decision Tree
kNN
AdaBoost
Multilayer Perceptron
Naive Bayes
Random Forest
SVM
Pattern classifier
Overall the authors report very good results (AUC > 0.9 for many cases), for both the best performing single model and the different ensemble strategies.
To my knowledge that are no other papers published on this data set.
First we load the data:
library(foreign)
loadData = function(path){
df = read.arff(path)
colnames(df) <- c(
"q", # 0 The binary result of quality assessment. 0 = bad quality 1 = sufficient quality
"ps", # 1 The binary result of pre-screening, 1 indicates severe retinal abnormality and 0 its lack
"nma.a", # 2 Number of MAs found at the confidence levels alpha = 0.5
"nma.b", # 3 Number of MAs found at the confidence levels alpha = 0.6
"nma.c", # 4 Number of MAs found at the confidence levels alpha = 0.7
"nma.d", # 5 Number of MAs found at the confidence levels alpha = 0.8
"nma.e", # 6 Number of MAs found at the confidence levels alpha = 0.9
"nma.f", # 7 Number of MAs found at the confidence levels alpha = 1.0
"nex.a", # 8 Number of Exudates found at the confidence levels alpha = 0.5
"nex.b", # 9 Number of Exudates found at the confidence levels alpha = 0.6
"nex.c", # 10 Number of Exudates found at the confidence levels alpha = 0.7
"nex.d", # 11 Number of Exudates found at the confidence levels alpha = 0.8
"nex.e", # 12 Number of Exudates found at the confidence levels alpha = 0.9
"nex.g", # 13 Number of Exudates found at the confidence levels alpha = 1.0
"nex.f", # 14 Number of Exudates found at the confidence levels alpha = 1.0
"nex.h", # 15 Number of Exudates found at the confidence levels alpha = 1.0
"dd", # 16 The euclidean distance of the center of the macula and the center of the optic disc
"dm", # 17 The diameter of the optic disc
"amfm", # 18 The binary result of the AM/FM-based classification
"class" # 19 Class label. 1 = contains signs of DR, 0 = no signs of DR
)
odf = df
colnames(df)
numericFeats = c(3:16)
eyeFeats = c(17,18)
df[, c(numericFeats, eyeFeats)] = scale(df[, c(numericFeats, eyeFeats)])
df$class = as.factor(df$class)
return(df)
}
df = loadData("./dataset.arff")
long = melt(df[,c(1:ncol(df)-1)])
## No id variables; using all as measure variables
ggplot(long) +
geom_boxplot(aes(variable, value)) +
coord_flip() +
labs(title="Unimodal feature distribution", x='Feature', y='Scaled value')
We can see that nma.* and nex.* have one-sided heavy tailed distributions. No lets look at some of the covariance between the features:
ggcorr(df) +
labs(title="Feature covariance matrix")
We see that there are strong positive correlations between all the nma.* features and between the nex.* features, the latter set of features having stronger relationships between adjacent features. Not let’s look at some of the pairs:
ggplot(df) + geom_point(aes(nma.a, nma.b, color=class)) + facet_wrap(~amfm)
ggplot(df) + geom_point(aes(nma.a, nma.f, color=class)) + facet_wrap(~amfm)
ggplot(df) + geom_point(aes(nex.a, nex.b, color=class)) + facet_wrap(~amfm)
ggplot(df) + geom_point(aes(nex.a, nex.h, color=class)) + facet_wrap(~amfm)
ggplot(df) + geom_point(aes(nma.a, nex.h, color=class)) + facet_wrap(~amfm)
We see that in these projections there is a big fraction of the points that is separable conditional on the class. However a smaller but non-neglegible number of points seem not to be linearly separable in these projections.
Finally lets see a reduced 2d projection of the class variable on a SOM
cols = colnames(df)[1:(ncol(df)-1)]
base = df[,cols]
som_grid = somgrid(xdim=15, ydim=15, topo="hexagonal")
som_model = som(as.matrix(base), grid=som_grid, rlen=200)
plot(som_model, type="property", property = df$nma.a)
title("SOM colored for nma.a feature")
plot(som_model, type="property", property = df$nma.e)
title("SOM colored for nma.e feature")
plot(som_model, type="property", property = df$nex.a)
title("SOM colored for nex.a feature")
plot(som_model, type="property", property = df$nex.h)
title("SOM colored for nex.h feature")
plot(som_model, type="property", property = as.integer(df$class)-1)
title("SOM colored for class")
We can see that there is no clear region where the class labels lie preferentially on the SOM, nor there seems to be a preferential arrangement of features across the map
We will try to build a classifier to predict the class based on the feature set. We will begin by building simple models, calibrate them, and finally try to combine them into an ensemble.
The following are the learners that we will consider for models. These are simple wrapper functions that expose a common API to base algorithms.
helpers.result = function(pred, real){
return(data.frame(pred=pred, real=real, ok=(pred==real)))
}
fitpredict.svm = function(formula, trainingSet, validationSet, kernel="polynomial", degree=2, coef0=1){
model = svm(formula, trainingSet, kernel=kernel, degree=degree, coef0=coef0)
pred = as.integer(predict(model, validationSet))
real = as.integer(validationSet[,all.vars(formula)[1]])
return(helpers.result(pred, real))
}
fitpredict.forest = function(formula, trainingSet, validationSet, ntree=100){
model = randomForest(formula, trainingSet, ntree=ntree)
pred = as.integer(predict(model, validationSet))
real = as.integer(validationSet[,all.vars(formula)[1]])
return(helpers.result(pred, real))
}
fitpredict.knn = function(formula, trainingSet, validationSet, k=5){
cols = all.vars(formula)[2:length(all.vars(formula))]
pred = as.integer(knn(trainingSet[,cols], validationSet[,cols], trainingSet$class, k=k))
real = as.integer(validationSet[,all.vars(formula)[1]])
return(helpers.result(pred, real))
}
fitpredict.adaboost = function(formula, trainingSet, validationSet, boos=TRUE, mfinal=10, coeflearn='Breiman'){
model = boosting(formula, trainingSet, boos=boos, mfinal=mfinal, coeflearn=coeflearn)
pred = as.integer(predict.boosting(model, validationSet)$class)+1
real = as.integer(validationSet[,all.vars(formula)[1]])
return(helpers.result(pred, real))
}
fitpredict.nnet = function(formula, trainingSet, validationSet, size=10){
model = nnet(formula, trainingSet, size=size, trace=FALSE)
pred = as.integer(round(predict(model, validationSet)))+1
real = as.integer(validationSet[,all.vars(formula)[1]])
return(helpers.result(pred, real))
}
fitpredict.naiveBayes = function(formula, trainingSet, validationSet){
model = naiveBayes(formula, trainingSet)
pred = as.integer(predict(model, validationSet))
real = as.integer(validationSet[,all.vars(formula)[1]])
return(helpers.result(pred, real))
}
fitpredict.lda = function(formula, trainingSet, validationSet){
model = lda(formula, trainingSet)
pred = as.integer(data.frame(predict(model, validationSet))$class)
real = as.integer(validationSet[,all.vars(formula)[1]])
return(helpers.result(pred, real))
}
Because we are dealing with a binary classification task, we will consider the AUC as the main performance metric. This metric gives us combined information regarding the sensitivity and specificity of the models.
perf.auc = function(yy){
return(performance(prediction(yy$real, yy$pred), "auc")@y.values[[1]])
}
The data sets will be studied using k-fold validation, which is implemented in the following block
kfoldValidate = function(formula, data, learner, performance, ...){
k=10
ks = c(1:k)
folds = createFolds(data$class, k, list=FALSE)
result = rep(0, k)
for (i in ks){
trainingKs = ks[ks!=i]
validationKs = ks[ks==i]
trainingSet = data[which(folds %in% trainingKs), ]
validationSet = data[which(folds %in% validationKs), ]
result[i] = performance(learner(formula, trainingSet, validationSet, ...))
}
return(list(mean=mean(result), sd=sd(result), results=c(result)))
}
Lets consider the full list of features to test each model using AUC
f1 = as.formula(class ~ ps +
nma.a + nma.b + nma.c + nma.d + nma.e + nma.f +
nex.a + nex.b + nex.c + nex.d + nex.e + nex.g + nex.f + nex.h +
dd + dm + amfm)
resultsF1v = c(
svm = kfoldValidate(f1, data=df, learner=fitpredict.svm, performance=perf.auc)$mean,
forest = kfoldValidate(f1, data=df, learner=fitpredict.forest, performance=perf.auc)$mean,
knn = kfoldValidate(f1, data=df, learner=fitpredict.knn, performance=perf.auc)$mean,
adaboost = kfoldValidate(f1, data=df, learner=fitpredict.adaboost, performance=perf.auc)$mean,
nnet = kfoldValidate(f1, data=df, learner=fitpredict.nnet, performance=perf.auc)$mean,
naiveBayes = kfoldValidate(f1, data=df, learner=fitpredict.naiveBayes, performance=perf.auc)$mean,
lda = kfoldValidate(f1, data=df, learner=fitpredict.lda, performance=perf.auc)$mean
)
resultsF1 = data.frame(f1_AUC=resultsF1v)
kable(resultsF1, caption="Model performance for F1")
| f1_AUC | |
|---|---|
| svm | 0.7334677 |
| forest | 0.6979175 |
| knn | 0.6197395 |
| adaboost | 0.6756170 |
| nnet | 0.7002981 |
| naiveBayes | 0.6769425 |
| lda | 0.7387913 |
We see that without any type of model parameterization the results are not very good. We achive the highest result with Latent Discriminant Analysis.
Now let’s try performing PCA on the dataset
cols = all.vars(f1)[2:length(all.vars(f1))]
fit = prcomp(df[,cols], center=T, scale=T)
df = cbind(df, fit$x)
Let’s see if we can achieve better results considering only the Principal Components
f2 = as.formula(class ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 + PC11 + PC12 )
resultsF2v = c(
svm = kfoldValidate(f2, data=df, learner=fitpredict.svm, performance=perf.auc)$mean,
forest = kfoldValidate(f2, data=df, learner=fitpredict.forest, performance=perf.auc)$mean,
knn = kfoldValidate(f2, data=df, learner=fitpredict.knn, performance=perf.auc)$mean,
adaboost = kfoldValidate(f2, data=df, learner=fitpredict.adaboost, performance=perf.auc)$mean,
nnet = kfoldValidate(f2, data=df, learner=fitpredict.nnet, performance=perf.auc)$mean,
naiveBayes = kfoldValidate(f2, data=df, learner=fitpredict.naiveBayes, performance=perf.auc)$mean,
lda = kfoldValidate(f2, data=df, learner=fitpredict.lda, performance=perf.auc)$mean
)
resultsF2 = data.frame(f2_AUC=resultsF2v)
results = cbind(resultsF2, resultsF1)
results$improvement=results$f2_AUC - results$f1_AUC
kable(results, caption="Model performance for F1 vs F2")
| f2_AUC | f1_AUC | improvement | |
|---|---|---|---|
| svm | 0.7008355 | 0.7334677 | -0.0326321 |
| forest | 0.6987468 | 0.6979175 | 0.0008294 |
| knn | 0.6114863 | 0.6197395 | -0.0082533 |
| adaboost | 0.6802317 | 0.6756170 | 0.0046147 |
| nnet | 0.6458392 | 0.7002981 | -0.0544589 |
| naiveBayes | 0.6743069 | 0.6769425 | -0.0026356 |
| lda | 0.6744071 | 0.7387913 | -0.0643842 |
We see that there is a marginal decrease for most of the models. Finally we will consider the full feature set.
f3 = as.formula(class ~ ps +
nma.a + nma.b + nma.c + nma.d + nma.e + nma.f +
nex.a + nex.b + nex.c + nex.d + nex.e + nex.g + nex.f + nex.h +
dd + dm + amfm +
PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 + PC11 + PC12)
resultsF3v = c(
svm = kfoldValidate(f3, data=df, learner=fitpredict.svm, performance=perf.auc)$mean,
forest = kfoldValidate(f3, data=df, learner=fitpredict.forest, performance=perf.auc)$mean,
knn = kfoldValidate(f3, data=df, learner=fitpredict.knn, performance=perf.auc)$mean,
adaboost = kfoldValidate(f3, data=df, learner=fitpredict.adaboost, performance=perf.auc)$mean,
nnet = kfoldValidate(f3, data=df, learner=fitpredict.nnet, performance=perf.auc)$mean,
naiveBayes = kfoldValidate(f3, data=df, learner=fitpredict.naiveBayes, performance=perf.auc)$mean,
lda = kfoldValidate(f3, data=df, learner=fitpredict.lda, performance=perf.auc)$mean
)
resultsF3 = data.frame(f3_AUC=resultsF3v)
results = cbind(resultsF3, resultsF2)
results$improvement=results$f3_AUC - results$f2_AUC
kable(results, caption="Model performance for F2 vs F3")
| f3_AUC | f2_AUC | improvement | |
|---|---|---|---|
| svm | 0.7270303 | 0.7008355 | 0.0261948 |
| forest | 0.6966650 | 0.6987468 | -0.0020819 |
| knn | 0.6173061 | 0.6114863 | 0.0058199 |
| adaboost | 0.6682730 | 0.6802317 | -0.0119587 |
| nnet | 0.7060387 | 0.6458392 | 0.0601995 |
| naiveBayes | 0.6865975 | 0.6743069 | 0.0122906 |
| lda | 0.7338626 | 0.6744071 | 0.0594555 |
We see that considering that there is an increase in performance when considering the latter model.
Lets start by preparing the data to apply the more advanced modelling techniques.
ind <- sample(2, nrow(df), replace=TRUE, prob=c(0.5, 0.5))
trainingDf = df[ind==1,]
validationDf = df[ind==2,]
Model parameters can be defined in order to achieve better results. We will use a strategy to create a space of parameter combinations of interest and train multiple models. Then we will consider the one with the best results.
selectModel = function(formula, data, learner, performance, hyperParameters){
if(is.null(hyperParameters)){
return(list(model=NULL, data=NULL))
}
results = cbind(hyperParameters, performance=rep(0, nrow(hyperParameters)))
for (i in 1:nrow(results)){
hyper = c(hyperParameters[i, ])
arguments = c(list(formula=formula, data=data, learner=learner, performance=performance), hyper)
results[i, 'performance'] = do.call(kfoldValidate, arguments)$mean
}
selectedModel = results[which.max(results$performance), colnames(hyperParameters)]
names(selectedModel) = colnames(hyperParameters)
return(list(model=as.list(selectedModel), data=results))
}
Lets try this with our models
candidateParameters = list(
svm = buildHyperDf(kernel=c("polynomial"), degree=c(1,2), coef0=seq(1, 3, by=1)),
forest = buildHyperDf(ntree=seq(10, 20, by=10)),
knn = buildHyperDf(k=seq(2, 4, by=1)),
nnet = buildHyperDf(size=seq(1, 3, by=1))
)
params = list(
svm = selectModel(f3, trainingDf, fitpredict.svm, perf.auc, candidateParameters$svm),
forest = selectModel(f3, trainingDf, fitpredict.forest, perf.auc, candidateParameters$forest),
knn = selectModel(f3, trainingDf, fitpredict.knn, perf.auc, candidateParameters$knn),
nnet = selectModel(f3, trainingDf, fitpredict.nnet, perf.auc, candidateParameters$nnet)
)
kable(params$svm$data[order(-params$svm$data$performance),],
caption="SVM performance for different variable combinations"
)
| kernel | degree | coef0 | performance | |
|---|---|---|---|---|
| 6 | polynomial | 2 | 3 | 0.7414142 |
| 2 | polynomial | 2 | 1 | 0.7261357 |
| 4 | polynomial | 2 | 2 | 0.7234909 |
| 3 | polynomial | 1 | 2 | 0.6922994 |
| 1 | polynomial | 1 | 1 | 0.6922664 |
| 5 | polynomial | 1 | 3 | 0.6878704 |
svmResult = kfoldValidate(f3, validationDf, learner=fitpredict.svm, performance=perf.auc,
kernel=params$svm$model$kernel,
degree=params$svm$model$degree,
coef0=params$svm$model$coef0
)$mean
kable(params$forest$data[order(-params$forest$data$performance),],
caption="Random Forest performance for different variable combinations"
)
| ntree | performance | |
|---|---|---|
| 2 | 20 | 0.6926634 |
| 1 | 10 | 0.6702572 |
forestResult = kfoldValidate(f3, validationDf, learner=fitpredict.forest, performance=perf.auc,
ntree=params$forest$model$ntree
)$mean
kable(params$knn$data[order(-params$knn$data$performance),],
caption="k-NN performance for different variable combinations"
)
| k | performance | |
|---|---|---|
| 2 | 3 | 0.5988584 |
| 3 | 4 | 0.5678796 |
| 1 | 2 | 0.5627380 |
knnResult = kfoldValidate(f3, validationDf, learner=fitpredict.knn, performance=perf.auc,
k=params$knn$model$k
)$mean
kable(params$nnet$data[order(-params$nnet$data$performance),],
caption="ANN performance for different variable combinations"
)
| size | performance | |
|---|---|---|
| 3 | 3 | 0.7431458 |
| 1 | 1 | 0.7345651 |
| 2 | 2 | 0.7210666 |
nnetResult = kfoldValidate(f3, validationDf, learner=fitpredict.nnet, performance=perf.auc,
size=params$nnet$model$size
)$mean
kable(data.frame(
model=c('SVM', 'Random Forest', 'k-NN', 'ANN'),
auc=c(svmResult, forestResult, knnResult, nnetResult)),
caption="Model performance for the best model paramenters for each model class"
)
| model | auc |
|---|---|
| SVM | 0.7057353 |
| Random Forest | 0.6628799 |
| k-NN | 0.6284149 |
| ANN | 0.7062144 |
We can select formulas to check if we can arrive at better results. We will use a strategy to continusly create models leaving one feature out from the set while their AUC is better than the previous.
selectModelFormula = function(formula, data, learner, performance, hyperParameters=NULL){
class = all.vars(formula)[1]
features = all.vars(formula)[2:length(all.vars(formula))]
results = data.frame(left=c('full', features), result=rep(NA, length(features)+1))
if(!is.null(hyperParameters)){
selectedParameters = selectModel(formula, data, learner, performance, hyperParameters)$model
arguments = c(
list(formula=formula, data=data, learner=learner, performance=performance),
selectedParameters
)
} else {
selectedParameters = NULL
arguments = list(formula=formula, data=data, learner=learner, performance=performance)
}
bestRoundPerformance = tryCatch(do.call(kfoldValidate, arguments)$mean, error=function(cond){ return(0) })
featureLeftOut = NULL
bestRoundParameters = selectedParameters
results[results$left=='full', 'result'] = bestRoundPerformance
for(i in 1:length(features)){
try({
roundFormula = as.formula(paste(class, "~", paste(features[-c(i)], collapse="+")))
if(!is.null(hyperParameters)){
arguments = c(
list(formula=roundFormula, data=data, learner=learner, performance=performance),
bestRoundParameters
)
} else {
bestRoundParameters = NULL
arguments = list(formula=formula, data=data, learner=learner, performance=performance)
}
roundPerformance = do.call(kfoldValidate, arguments)$mean
results[results$left==features[i], 'result'] = roundPerformance
if(roundPerformance > bestRoundPerformance){
bestRoundPerformance = roundPerformance
featureLeftOut = i
}
})
}
if(!is.null(featureLeftOut) && length(features) > 2){
selectedFormula = as.formula(paste(class, "~", paste(features[-c(featureLeftOut)], collapse="+")))
return(selectModelFormula(selectedFormula, data, learner, performance, hyperParameters))
} else {
return(list(
formula=formula,
params=bestRoundParameters
))
}
}
Lets see this using for example the k-NN model
params = selectModelFormula(f3, trainingDf, fitpredict.knn, perf.auc, candidateParameters$knn)
kable(data.frame(features=all.vars(params$formula)), caption="Selected features using leave-one-out algorithm")
| features |
|---|
| class |
| ps |
| nma.a |
| nma.b |
| nma.c |
| nma.d |
| nma.e |
| nma.f |
| nex.a |
| nex.b |
| nex.c |
| nex.d |
| nex.e |
| nex.g |
| nex.f |
| nex.h |
| dd |
| dm |
| amfm |
| PC1 |
| PC2 |
| PC3 |
| PC4 |
| PC5 |
| PC6 |
| PC7 |
| PC8 |
| PC9 |
| PC10 |
| PC11 |
| PC12 |
kable(data.frame(params$params), caption="Selected model parameters")
| k |
|---|
| 3 |
result = kfoldValidate(params$formula, validationDf, learner=fitpredict.knn, performance=perf.auc,
k=params$params
)$mean
kable(data.frame(auc=result), caption="k-NN performance for the optimized formula and model")
| auc |
|---|
| 0.6230301 |
We can use a similar thechique to select candidate models to consider on the ensemble. We will initially create an ensemble consisting of all the models, measure the AUC of all the combinations of leaving out one model. We will pick the best on, and will keep removing models from the reduced ensemble while the AUC improves.
fitpredict.votingEnsemble = function(formula, trainingSet, validationSet, learners, params){
ensembleResults = data.frame(majority=rep(NA, nrow(validationSet)))
for (name in names(learners)){
if(!is.null(params)){
arguments = c(
list(params[name][[1]]$formula, trainingSet=trainingSet, validationSet=validationSet),
params[name][[1]]$params
)
} else {
arguments = list(params[name][[1]]$formula, trainingSet=trainingSet, validationSet=validationSet)
}
modelResult = list(predicted=do.call(learners[name][[1]], arguments)$pred)
names(modelResult) = c(name)
ensembleResults = cbind(ensembleResults, modelResult)
}
ensembleResults$majority = apply(ensembleResults[, names(learners)], 1, FUN=function(x){
as(names(which.max(table(x))), mode(x))
})
real = as.integer(validationSet[,all.vars(formula)[1]])
return(helpers.result(ensembleResults$majority, real))
}
selectVotingEnsemble = function(formula, data, performance, learners, learnersParams){
learnerNames = names(learners)
results = data.frame(left=c('full', learnerNames), result=rep(NA, length(learners)+1))
bestRoundPerformance = kfoldValidate(formula, data, fitpredict.votingEnsemble, performance=perf.auc,
learners=learners,
params=learnersParams
)$mean
modelLeftOut = NULL
results[results$left=='full', 'result'] = bestRoundPerformance
for(i in 1:length(learnerNames)){
try({
roundPerformance = kfoldValidate(formula, data, fitpredict.votingEnsemble, perf.auc,
learners=learners[-c(i)],
params=learnersParams[-c(i)]
)$mean
results[results$left==learnerNames[i], 'result'] = roundPerformance
if(roundPerformance > bestRoundPerformance){
bestRoundPerformance = roundPerformance
modelLeftOut = i
}
})
}
if(!is.null(modelLeftOut) && length(learners) > 2){
return(selectVotingEnsemble(formula, data, performance,
learners[-c(modelLeftOut)],
learnersParams[-c(modelLeftOut)]
))
} else {
return(learners)
}
}
Let’s check this using our models with default parameters and formula
learners = list(
svm = fitpredict.svm,
forest = fitpredict.forest,
knn = fitpredict.knn,
nnet = fitpredict.nnet,
naiveBayes = fitpredict.naiveBayes,
lda = fitpredict.lda
)
candidateParameters = list(
svm = list(formula=f3, params=NULL),
forest = list(formula=f3, params=NULL),
knn = list(formula=f3, params=NULL),
nnet = list(formula=f3, params=NULL),
naiveBayes = list(formula=f3, params=NULL),
lda = list(formula=f3, params=NULL)
)
finalEnsemble = selectVotingEnsemble(f3, trainingDf, perf.auc, learners, candidateParameters)
kable(data.frame(models=names(finalEnsemble)), caption = "Models considered by the naive ensemble")
| models |
|---|
| forest |
| nnet |
| naiveBayes |
| lda |
result = kfoldValidate(f3, validationDf, fitpredict.votingEnsemble, perf.auc, finalEnsemble,
candidateParameters[names(finalEnsemble)]
)$mean
kable(data.frame(auc=result), caption="Naive Ensemble Performance")
| auc |
|---|
| 0.7303 |
Finally we will put all of this together and to find a reduced ensemble of models which optimized parameters and formulas to try to achieve the best result possible on the classification task. The algorythm is as follows:
- Split the data into 50% for training and model selection and 50% for training and test
- Use 10-fold cross validation on each set of steps
- Select the model with the best parameters for a given set of features
- Train and validate that model leaving one feature out at a time
- Choose the model with the highest AUC
- If the reduced model AUC is greater that the complete model, perform the steps above using the reduced model
- Use these model parameters and specific features to create an ensemble
- Leaving out one model from the ensemble select the ensemble with the highest AUC
- Repeat while the AUC increases
- Test the final model AUC
candidateParameters = list(
svm = buildHyperDf(kernel=c("polynomial", "radial"), degree=c(1,2), coef0=c(1,10,50,100)),
forest = buildHyperDf(ntree=seq(100, 500, by=100)),
knn = buildHyperDf(k=seq(2, 30, by=2)),
nnet = buildHyperDf(size=seq(5, 20, by=5))
)
svmParams = selectModelFormula(f3, trainingDf, fitpredict.svm, perf.auc, candidateParameters$svm)
forestParams = selectModelFormula(f3, trainingDf, fitpredict.forest, perf.auc, candidateParameters$forest)
nnetParams = selectModelFormula(f3, trainingDf, fitpredict.nnet, perf.auc, candidateParameters$nnet)
knnParams = selectModelFormula(f3, trainingDf, fitpredict.knn, perf.auc, candidateParameters$knn)
naiveBayesParams = selectModelFormula(f3, trainingDf, fitpredict.naiveBayes, perf.auc)
ldaParams = selectModelFormula(f3, trainingDf, fitpredict.lda, perf.auc)
adaboostParams = list(
formula=f3,
params=NULL
)
learners = list(
svm = fitpredict.svm,
forest = fitpredict.forest,
knn = fitpredict.knn,
nnet = fitpredict.nnet,
adaboost = fitpredict.adaboost,
naiveBayes = fitpredict.naiveBayes,
lda = fitpredict.lda
)
candidateModels = list(
svm = svmParams,
forest = forestParams,
knn = knnParams,
nnet = nnetParams,
adaboost = adaboostParams,
naiveBayes = naiveBayesParams,
lda = ldaParams
)
finalLearners = selectVotingEnsemble(f3, trainingDf, perf.auc, learners, candidateModels)
kable(data.frame(models=names(finalEnsemble)), caption = "Models considered by the final ensemble")
| models |
|---|
| forest |
| nnet |
| naiveBayes |
| lda |
resultTraining = kfoldValidate(f3, trainingDf, fitpredict.votingEnsemble, perf.auc, finalLearners,
candidateModels[names(finalLearners)]
)$mean
resultValidation = kfoldValidate(f3, validationDf, fitpredict.votingEnsemble, perf.auc, finalLearners,
candidateModels[names(finalLearners)]
)$mean
resultComplete = kfoldValidate(f3, df, fitpredict.votingEnsemble, perf.auc, finalLearners,
candidateModels[names(finalLearners)]
)$mean
kable(data.frame(
set=c('Training', 'Validation', 'All'),
auc=c(resultTraining, resultValidation, resultComplete)
), caption="Final ensemble performance")
| set | auc |
|---|---|
| Training | 0.7846805 |
| Validation | 0.7565049 |
| All | 0.7648807 |
We see that the ensemble does not significantly increase the model accuracy.
We have tried to construct an ensemble to predict if a patient has diabetic retinopathy using features from retina photos. While in the literature authors describe a 0.98 AUC accuraccy for this task, we were unable to get close to those results.
Our analysis took into consideration feature scaling and feature rotation using PCA. We then performed feature optimization using a leave-one-out technique to find reduced sets of features that would help models better discriminate the data. Furthermore we trained different models with several different parameters and combined those optimized models into an ensemble, that was also reduced using leave-one-out to try to find a reduced ensemble with the best results.
We divided the data set into two subsets, one used for model training and selection / validation, and another for training / testing. On each set we used 10 fold cross validation in order to use the maximum available data. We could think that it would be a limitation for model performance to split the dataset in two prior to both tasks - that would maximize its validity as the final data was never used for model training / validation, however we have checked how the model responded when using the complete dataset for both tasks, and the results that we arrived at were very similar.
In addition we used simple techniques for feature selection and scaling and possibly we could arrive at better results by introducing more complex techniques for selecting and generating features. We looked at small subspaces for model parameters. Possibility there be other parameter spaces that would yield better performing models. However for feasible computation time in a single machine, larger searches were not possible. This could be overcome by separating the training taks into batched jobs that can run in parallel in multi machine clusters - for example, we could train every leave-one-out possibilities at the same time in multiple machines.
Finally, even our single models were unable to reach a performace close to the one reported in the papers. This may indicate that the algorithms used for modelling were not robust enough, required more computation time to arrive at better results, or that the successful model parameters - which were not specified in that paper - may lie somewhere on the model parameter space that was not covered by our algorithms.
In conclusion, despite the shortcomings in reaching good performance results, this work provided a means to make use and test multiple machine learning algorithms and try to arrive to ensemble models that would out perform individual learners. It also allowed exploring a little feature selection, feature generation, parameter selection and ensemble selection problems, and experience the constraints in computation time when the lookuing for possible candidate models in high combinatorial spaces, even for a small dataset as the one used.