This is a summary of a set of 9 experiments I ran on Cranium using a single pipe workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.
This document has the final results, by experiment. The CBDA-SuperLearner has been adapted to a multinomial outcome distribution in this case. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code [still in progress].
# # Here I load the dataset [not executed]
# ADNI_dataset = read.csv("C:/Users/simeonem/Documents/CBDA-SL/Cranium/ADNI_dataset.txt",header = TRUE)
Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. No False Discovery Rates are shown (since we don’t have information on the “true” features). I list the top features selected, set to 20 here.
## Loading required package: lattice
## Loading required package: ggplot2
## [1] EXPERIMENT 1
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 34 11 3.448276 7 2.885030
## 37 10 3.134796 4 2.764949
## 23 9 2.821317 1 2.752633
## 53 9 2.821317 8 2.731079
## 5 7 2.194357 6 2.728000
## 16 7 2.194357 2 2.724921
## 41 7 2.194357 5 2.687973
## 52 7 2.194357 3 2.654104
## 4 6 1.880878 57 2.620235
## 6 6 1.880878 58 2.564813
## 7 6 1.880878 37 2.346204
## 8 6 1.880878 49 2.059856
## 10 6 1.880878 13 1.915143
## 13 6 1.880878 53 1.819693
## 27 6 1.880878 38 1.795061
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
## [1] 1 2 3 4 5 6 7 8 57 58 37 49
## [1] "subjectSex" "MMSCORE"
## [3] "GDTOTAL" "CDGLOBAL"
## [5] "NPISCORE" "FAQTOTAL"
## [7] "subjectAge" "weightKg"
## [9] "L_caudate" "R_caudate"
## [11] "L_inferior_occipital_gyrus" "L_lingual_gyrus"
The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions (SL_Pred_Combined) is then used to generate the confusion matrix. By doing so, we combined the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first stage. Then, the second stage uses the top common features selected to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..
## Warning in levels(reference) != levels(data): longer object length is not a
## multiple of shorter object length
## Warning in confusionMatrix.default(pred_temp, truth_temp): Levels are not
## in the same order for reference and data. Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction AD LMCI MCI Normal
## AD 96 0 35 1
## LMCI 0 0 0 0
## MCI 26 0 355 14
## Normal 0 0 14 209
##
## Overall Statistics
##
## Accuracy : 0.88
## 95% CI : (0.8546, 0.9024)
## No Information Rate : 0.5387
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7996
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: AD Class: LMCI Class: MCI Class: Normal
## Sensitivity 0.7869 NA 0.8787 0.9330
## Specificity 0.9427 1 0.8844 0.9734
## Pos Pred Value 0.7273 NA 0.8987 0.9372
## Neg Pred Value 0.9579 NA 0.8620 0.9715
## Prevalence 0.1627 0 0.5387 0.2987
## Detection Rate 0.1280 0 0.4733 0.2787
## Detection Prevalence 0.1760 0 0.5267 0.2973
## Balanced Accuracy 0.8648 NA 0.8816 0.9532