This is a summary of a set of 1 experiments using a LONI pipeline workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.
This document has the final results, by experiment. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code.
Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. I list the top features selected, set to 15 here.
## [1] EXPERIMENT 1
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 1 9000 0 5 15 60 80
## [1] "Nonzero features - Signal"
## [1] 1 100 200 400 600 800 1000 1200 1400 1500
## [1] "TABLE with CBDA-SL and Knockoff RESULTS"
## CBDA Frequency Density Knockoff KO_Frequency KO_Density
## 1 39 0.4757838 400 164 7.617278
## 400 22 0.2683909 1 161 7.477938
## 1161 13 0.1585946 1000 135 6.270320
## 337 12 0.1463950 800 132 6.130980
## 871 12 0.1463950 322 93 4.319554
## 1135 12 0.1463950 251 63 2.926150
## 1246 12 0.1463950 1500 61 2.833256
## 226 11 0.1341954 384 58 2.693915
## 282 11 0.1341954 537 53 2.461681
## 413 11 0.1341954 25 49 2.275894
## 476 11 0.1341954 1305 47 2.183000
## 480 11 0.1341954 1200 45 2.090107
## 505 11 0.1341954 200 42 1.950766
## 570 11 0.1341954 496 42 1.950766
## 800 11 0.1341954 83 41 1.904320
##
##
##
##
##
##
## [1] EXPERIMENT 2
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 2 9000 0 15 30 60 80
## [1] "Nonzero features - Signal"
## [1] 1 100 200 400 600 800 1000 1200 1400 1500
## [1] "KNOCKOFF FILTER returned no feature"
## [1] "TABLE with CBDA-SL RESULTS"
## CBDA Frequency Density
## 1 38 0.2142656
## 400 34 0.1917113
## 60 24 0.1353256
## 200 24 0.1353256
## 100 22 0.1240485
## 775 22 0.1240485
## 458 21 0.1184099
## 600 21 0.1184099
## 800 21 0.1184099
## 1174 21 0.1184099
## 361 20 0.1127714
## 588 20 0.1127714
## 716 20 0.1127714
## 992 20 0.1127714
## 2 19 0.1071328
The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions are then used to generate the confusion matrix. We basically combine the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first round. Then, the second stage uses the top features to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..