This is a summary of a set of 1 experiments using a LONI pipeline workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.
This document has the final results, by experiment. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code.
Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. I list the top features selected, set to 15 here.
## [1] EXPERIMENT 1
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 60 80
## [1] 1 30 60 100 130 160 200 230 260 300
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## [1] "EXPERIMENT" "1"
## CBDA Frequency Density Knockoff Density
## 200 25 1.6233766 160 17.529976
## 30 19 1.2337662 130 14.004796
## 100 18 1.1688312 30 10.767386
## 160 14 0.9090909 200 9.208633
## 157 13 0.8441558 260 9.184652
## 298 13 0.8441558 230 5.731415
## 106 12 0.7792208 100 4.532374
## 183 12 0.7792208 300 4.508393
## 142 11 0.7142857 273 4.172662
## 222 11 0.7142857 214 3.621103
## 252 11 0.7142857 222 2.661871
## 112 10 0.6493506 1 2.398082
## 187 10 0.6493506 142 1.582734
## 228 10 0.6493506 25 1.318945
## 6 9 0.5844156 258 1.318945
## [1] "Nonzero Features"
## [1] 1 30 60 100 130 160 200 230 260 300
##
##
##
##
##
##
## [1] EXPERIMENT 2
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 15 30 60 80
## [1] 1 30 60 100 130 160 200 230 260 300
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## [1] "EXPERIMENT" "2"
## CBDA Frequency Density Knockoff Density
## 100 36 1.0098177 160 23.9869407
## 200 30 0.8415147 130 17.3612445
## 30 29 0.8134642 30 12.3679662
## 1 24 0.6732118 200 10.1401959
## 106 22 0.6171108 260 9.3527943
## 230 21 0.5890603 230 4.9548684
## 82 20 0.5610098 300 3.4376800
## 142 20 0.5610098 273 2.9959670
## 183 20 0.5610098 100 2.9383522
## 241 20 0.5610098 214 2.4774342
## 21 19 0.5329593 222 1.7092376
## 66 19 0.5329593 1 1.6708277
## 73 19 0.5329593 142 1.3443442
## 107 19 0.5329593 25 0.9986557
## 113 19 0.5329593 258 0.8834262
## [1] "Nonzero Features"
## [1] 1 30 60 100 130 160 200 230 260 300
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
## [1] 200 30 100 160 157 298 106 183 142 222 252 1 130 260 230 300 273
## [18] 214
The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions are then used to generate the confusion matrix. We basically combine the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first round. Then, the second stage uses the top features to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..