This is a summary of a set of 9 experiments I ran on Cranium using a single pipe workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.
This document has the final results, by experiment. The CBDA-SuperLearner has been adapted to a multinomial outcome distribution in this case. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code [still in progress].
# # Here I load the dataset [not executed]
# ADNI_dataset = read.csv("C:/Users/simeonem/Documents/CBDA-SL/Cranium/ADNI_dataset.txt",header = TRUE)
Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. I list the top features selected, set to 15 here.
## Loading required package: lattice
## Loading required package: ggplot2
## [1] EXPERIMENT 1
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 20 10 2.985075 7 2.243323
## 60 9 2.686567 6 2.176855
## 1 8 2.388060 2 2.167359
## 29 8 2.388060 4 2.136499
## 31 8 2.388060 1 2.091395
## 37 8 2.388060 9 2.091395
## 41 8 2.388060 3 2.058160
## 44 8 2.388060 10 2.053412
## 56 8 2.388060 8 2.024926
## 2 7 2.089552 5 1.984570
## 10 7 2.089552 59 1.984570
## 19 7 2.089552 42 1.956083
## 23 7 2.089552 60 1.948961
## 54 7 2.089552 51 1.846884
## 55 7 2.089552 39 1.832641
##
##
##
##
##
##
## [1] EXPERIMENT 7
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 40 60
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 46 12 3.571429 59 2.190557
## 4 10 2.976190 9 2.174016
## 1 9 2.678571 6 2.166927
## 3 9 2.678571 8 2.138570
## 61 9 2.678571 3 2.119665
## 11 8 2.380952 7 2.117302
## 20 8 2.380952 4 2.091309
## 34 8 2.380952 5 2.074767
## 52 8 2.380952 2 2.060589
## 56 8 2.380952 1 2.001512
## 15 7 2.083333 60 1.916442
## 18 7 2.083333 10 1.807741
## 26 7 2.083333 41 1.786474
## 31 7 2.083333 42 1.758117
## 45 7 2.083333 51 1.720308
##
##
##
##
##
##
## [1] EXPERIMENT 8
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 15 30 40 60
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 52 22 3.021978 9 3.666917
## 1 19 2.609890 2 3.661548
## 9 17 2.335165 3 3.661548
## 22 17 2.335165 6 3.640073
## 45 17 2.335165 1 3.595333
## 61 16 2.197802 4 3.582805
## 4 15 2.060440 5 3.563120
## 7 15 2.060440 7 3.362683
## 34 14 1.923077 8 3.321522
## 39 14 1.923077 59 2.822220
## 51 14 1.923077 60 2.634310
## 15 13 1.785714 10 2.509038
## 16 13 1.785714 51 2.364079
## 33 13 1.785714 42 2.324707
## 42 13 1.785714 41 2.186907
##
##
##
##
##
##
## [1] EXPERIMENT 9
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 30 50 40 60
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 4 29 2.187029 1 5.884280
## 19 26 1.960784 4 5.882642
## 22 26 1.960784 9 5.823668
## 57 26 1.960784 6 5.781076
## 17 25 1.885370 3 5.761418
## 28 25 1.885370 2 5.707359
## 30 25 1.885370 5 5.024245
## 8 24 1.809955 7 4.958718
## 9 24 1.809955 8 4.688421
## 24 24 1.809955 59 2.748837
## 48 24 1.809955 10 2.701330
## 58 24 1.809955 51 2.427757
## 15 23 1.734540 42 2.354040
## 18 23 1.734540 60 2.308171
## 23 23 1.734540 41 2.106677
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
## [1] 4 60 1 3 41 9 2 10 7
## [1] "CDGLOBAL" "R_caudate" "subjectSex"
## [4] "GDTOTAL" "L_cuneus" "seriesIdentifier"
## [7] "MMSCORE" "Background" "subjectAge"
The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions (SL_Pred_Combined) is then used to generate the confusion matrix. By doing so, we combined the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first stage. Then, the second stage uses the top common features selected to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..
## Confusion Matrix and Statistics
##
## Reference
## Prediction AD LMCI MCI Normal
## AD 100 10 24 0
## LMCI 9 169 25 12
## MCI 13 20 144 7
## Normal 0 10 2 205
##
## Overall Statistics
##
## Accuracy : 0.824
## 95% CI : (0.7948, 0.8506)
## No Information Rate : 0.2987
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7624
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: AD Class: LMCI Class: MCI Class: Normal
## Sensitivity 0.8197 0.8086 0.7385 0.9152
## Specificity 0.9459 0.9150 0.9279 0.9772
## Pos Pred Value 0.7463 0.7860 0.7826 0.9447
## Neg Pred Value 0.9643 0.9252 0.9099 0.9644
## Prevalence 0.1627 0.2787 0.2600 0.2987
## Detection Rate 0.1333 0.2253 0.1920 0.2733
## Detection Prevalence 0.1787 0.2867 0.2453 0.2893
## Balanced Accuracy 0.8828 0.8618 0.8332 0.9462