This is a summary of a set of 1 experiments using a LONI pipeline workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.
This document has the final results, by experiment. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code.
Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. I list the top features selected, set to 15 here.
## [1] EXPERIMENT 1
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 60 80
## [1] 1 30 60 100 130 160 200 230 260 300
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 260 30 1.9620667 100 17.2537743
## 30 23 1.5042511 30 17.2058471
## 300 22 1.4388489 1 15.5763240
## 255 12 0.7848267 260 12.9882578
## 258 12 0.7848267 200 10.1126288
## 81 11 0.7194245 300 3.9539899
## 105 11 0.7194245 190 2.0369039
## 205 11 0.7194245 299 1.9650132
## 230 11 0.7194245 192 1.3898874
## 17 10 0.6540222 130 1.1742152
## 160 10 0.6540222 138 0.9825066
## 271 10 0.6540222 258 0.9825066
## 11 9 0.5886200 43 0.8866523
## 12 9 0.5886200 160 0.8866523
## 89 9 0.5886200 57 0.6949437
## [1] "Nonzero Features"
## [1] 1 30 60 100 130 160 200 230 260 300
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
## [1] 260 30 300 255 258 81 105 205 230 17 160 271 11 12 89 100 1
## [18] 200 190 299 192 130 138 43 57
The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions are then used to generate the confusion matrix. We basically combine the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first round. Then, the second stage uses the top features to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..