This is a summary of a set of 1 experiments using a LONI pipeline workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.
This document has the final results, by experiment. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code.
Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. I list the top features selected, set to 15 here.
## [1] "C:/Users/simeonem/Documents/CBDA-SL/ExperimentsNov2016/NULL9000/NEW"
## [1] EXPERIMENT 1
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 41 16 3.319502 49 11.173184
## 19 15 3.112033 51 8.491620
## 37 14 2.904564 41 7.709497
## 76 10 2.074689 63 5.251397
## 26 9 1.867220 7 5.027933
## 34 9 1.867220 1 4.972067
## 50 9 1.867220 37 3.463687
## 73 9 1.867220 14 3.072626
## 2 8 1.659751 55 2.960894
## 7 8 1.659751 44 2.849162
## 8 8 1.659751 77 2.625698
## 13 8 1.659751 50 2.402235
## 49 8 1.659751 8 2.234637
## 84 8 1.659751 39 1.620112
## 90 8 1.659751 64 1.620112
##
##
##
##
##
##
## [1] EXPERIMENT 2
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 15 30 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 41 24 2.203857 49 24.142012
## 15 19 1.744720 51 11.597633
## 19 18 1.652893 41 9.704142
## 7 17 1.561065 7 8.047337
## 37 17 1.561065 63 5.443787
## 90 17 1.561065 1 5.325444
## 8 16 1.469238 77 4.260355
## 20 16 1.469238 50 3.905325
## 59 16 1.469238 55 3.431953
## 63 16 1.469238 14 2.958580
## 17 15 1.377410 44 2.603550
## 66 15 1.377410 8 2.366864
## 84 15 1.377410 37 2.248521
## 2 14 1.285583 39 1.420118
## 26 14 1.285583 33 1.183432
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
## [1] "Combined set of features selected across multiple experiments"
## [1] 1 2 7 8 13 14 15 19 26 34 37 41 44 49 50 51 55 63 73 76 77
## [1] "Top best features selected across multiple experiments"
## [1] 15
## [1] "Length of top best features selected across multiple experiments"
## [1] 21
The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions are then used to generate the confusion matrix. We basically combine the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first round. Then, the second stage uses the top features to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..