This is a summary of a set of 1 experiments using a LONI pipeline workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.
This document has the final results, by experiment. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code.
Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. I list the top features selected, set to 15 here.
## [1] EXPERIMENT 1
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 1 5 30 60
## [1] 1 30 60 100 130 160 200 230 260 300
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## [1] "EXPERIMENT" "1"
## CBDA Frequency Density Knockoff Density
## 300 15 2.7027027 60 5.5408971
## 36 7 1.2612613 160 4.9252419
## 151 7 1.2612613 300 4.3975374
## 49 6 1.0810811 30 3.2248607
## 104 6 1.0810811 130 2.4626209
## 18 5 0.9009009 200 2.3746702
## 32 5 0.9009009 1 2.0815010
## 62 5 0.9009009 100 2.0228672
## 68 5 0.9009009 260 1.7003811
## 76 5 0.9009009 56 1.5537965
## 122 5 0.9009009 230 1.4658458
## 181 5 0.9009009 183 1.1726766
## 182 5 0.9009009 232 1.0847259
## 209 5 0.9009009 203 0.9381413
## 263 5 0.9009009 107 0.9088244
## [1] "Nonzero Features"
## [1] 1 30 60 100 130 160 200 230 260 300
##
##
##
##
##
##
## [1] EXPERIMENT 2
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 30 60
## [1] 1 30 60 100 130 160 200 230 260 300
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## [1] "EXPERIMENT" "1"
## CBDA Frequency Density Knockoff Density
## 300 35 2.1021021 60 16.3090129
## 30 20 1.2012012 300 12.5751073
## 130 11 0.6606607 160 11.5879828
## 73 10 0.6006006 30 8.3690987
## 184 10 0.6006006 200 4.6351931
## 193 10 0.6006006 130 3.2618026
## 274 10 0.6006006 56 3.1759657
## 31 9 0.5405405 100 2.6609442
## 36 9 0.5405405 1 2.3605150
## 56 9 0.5405405 260 1.4592275
## 82 9 0.5405405 183 1.0729614
## 95 9 0.5405405 203 1.0729614
## 106 9 0.5405405 230 1.0729614
## 118 9 0.5405405 232 1.0300429
## 122 9 0.5405405 70 0.8154506
## [1] "Nonzero Features"
## [1] 1 30 60 100 130 160 200 230 260 300
##
##
##
##
##
##
## [1] EXPERIMENT 3
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 15 30 30 60
## [1] 1 30 60 100 130 160 200 230 260 300
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## [1] "EXPERIMENT" "1"
## CBDA Frequency Density Knockoff Density
## 300 41 1.1398388 60 21.1706975
## 223 21 0.5838198 300 15.5976165
## 39 20 0.5560189 160 13.0389064
## 59 20 0.5560189 30 7.0802664
## 38 19 0.5282180 200 4.2411497
## 82 19 0.5282180 56 2.8040659
## 100 19 0.5282180 130 2.4886085
## 207 19 0.5282180 1 2.3133544
## 245 19 0.5282180 100 2.2783035
## 266 19 0.5282180 260 1.6123379
## 279 19 0.5282180 230 1.2968805
## 17 18 0.5004170 203 0.8412198
## 36 18 0.5004170 232 0.6309148
## 231 18 0.5004170 183 0.5958640
## 260 18 0.5004170 205 0.5958640
## [1] "Nonzero Features"
## [1] 1 30 60 100 130 160 200 230 260 300
The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions are then used to generate the confusion matrix. We basically combine the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first round. Then, the second stage uses the top features to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..