This is a summary of a set of 1 experiments using a LONI pipeline workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.
This document has the final results, by experiment. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code.
Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. I list the top features selected, set to 15 here.
## [1] EXPERIMENT 1
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 1 9000 0 5 15 60 80
## [1] "Nonzero features - Signal"
## [1] 1 30 60 100 130 160 200 230 260 300
## [1] "TABLE with CBDA-SL and Knockoff RESULTS"
## CBDA Frequency Density Knockoff KO_Frequency KO_Density
## 130 37 2.3536896 160 731 17.529976
## 260 21 1.3358779 130 584 14.004796
## 258 20 1.2722646 30 449 10.767386
## 60 14 0.8905852 200 384 9.208633
## 273 14 0.8905852 260 383 9.184652
## 300 14 0.8905852 230 239 5.731415
## 67 12 0.7633588 100 189 4.532374
## 96 11 0.6997455 300 188 4.508393
## 179 11 0.6997455 273 174 4.172662
## 267 11 0.6997455 214 151 3.621103
## 39 10 0.6361323 222 111 2.661871
## 55 10 0.6361323 1 100 2.398082
## 83 10 0.6361323 142 66 1.582734
## 87 10 0.6361323 25 55 1.318945
## 141 10 0.6361323 258 55 1.318945
##
##
##
##
##
##
## [1] EXPERIMENT 2
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 2 9000 0 15 30 60 80
## [1] "Nonzero features - Signal"
## [1] 1 30 60 100 130 160 200 230 260 300
## [1] "TABLE with CBDA-SL and Knockoff RESULTS"
## CBDA Frequency Density Knockoff KO_Frequency KO_Density
## 130 47 1.2929849 160 1249 23.9869407
## 260 30 0.8253095 130 904 17.3612445
## 273 27 0.7427785 30 644 12.3679662
## 300 26 0.7152682 200 528 10.1401959
## 160 23 0.6327373 260 487 9.3527943
## 65 20 0.5502063 230 258 4.9548684
## 258 20 0.5502063 300 179 3.4376800
## 83 19 0.5226960 273 156 2.9959670
## 136 19 0.5226960 100 153 2.9383522
## 293 19 0.5226960 214 129 2.4774342
## 187 18 0.4951857 222 89 1.7092376
## 201 18 0.4951857 1 87 1.6708277
## 3 17 0.4676754 142 70 1.3443442
## 7 17 0.4676754 25 52 0.9986557
## 13 17 0.4676754 258 46 0.8834262
The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions are then used to generate the confusion matrix. We basically combine the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first round. Then, the second stage uses the top features to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..