This is a summary of a set of 1 experiments using a LONI pipeline workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.
This document has the final results, by experiment. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code.
Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. I list the top features selected, set to 15 here.
## [1] EXPERIMENT 1
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 60 80
## [1] 1 30 60 100 130 160 200 230 260 300
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## [1] "EXPERIMENT" "1"
## CBDA Frequency Density Knockoff Density
## 130 25 1.4124294 160 17.038778
## 260 23 1.2994350 130 13.795535
## 273 15 0.8474576 30 10.975323
## 105 13 0.7344633 260 9.071680
## 95 12 0.6779661 200 9.048179
## 253 11 0.6214689 230 5.334900
## 297 11 0.6214689 300 4.770858
## 4 10 0.5649718 273 3.901293
## 18 10 0.5649718 214 3.807286
## 33 10 0.5649718 100 3.713278
## 47 10 0.5649718 222 2.655699
## 58 10 0.5649718 1 2.185664
## 87 10 0.5649718 142 1.997650
## 171 10 0.5649718 25 1.645123
## 190 10 0.5649718 129 1.480611
## [1] "Nonzero Features"
## [1] 1 30 60 100 130 160 200 230 260 300
##
##
##
##
##
##
## [1] EXPERIMENT 2
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 15 30 60 80
## [1] 1 30 60 100 130 160 200 230 260 300
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## [1] "EXPERIMENT" "2"
## CBDA Frequency Density Knockoff Density
## 130 33 0.9780676 160 24.5901639
## 260 28 0.8298755 130 17.3355945
## 217 23 0.6816835 30 11.4565668
## 264 23 0.6816835 260 9.5534200
## 297 22 0.6520451 200 9.4592048
## 299 21 0.6224066 230 5.3891087
## 52 20 0.5927682 300 3.9947239
## 122 20 0.5927682 100 3.3163746
## 188 18 0.5334914 273 3.0902581
## 206 18 0.5334914 214 2.4684379
## 225 18 0.5334914 1 1.8277746
## 43 17 0.5038530 222 1.5828151
## 118 17 0.5038530 142 1.1871114
## 128 17 0.5038530 25 0.8290936
## 300 17 0.5038530 258 0.6595063
## [1] "Nonzero Features"
## [1] 1 30 60 100 130 160 200 230 260 300
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
## [1] 130 260 273 105 217 264 95 297 299 253 52 122 160 30 200 230 300
## [18] 100
The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions are then used to generate the confusion matrix. We basically combine the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first round. Then, the second stage uses the top features to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..