This is a summary of a set of 1 experiments using a LONI pipeline workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.
This document has the final results, by experiment. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code.
Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. I list the top features selected, set to 15 here.
## [1] "C:/Users/simeonem/Documents/CBDA-SL/ExperimentsNov2016/NULL9000/NEW"
## [1] EXPERIMENT 1
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 749 16 0.3254017 749 9.806157
## 285 15 0.3050641 32 5.074116
## 722 15 0.3050641 101 3.477765
## 32 14 0.2847265 324 3.078677
## 119 13 0.2643889 348 2.793615
## 63 12 0.2440513 526 2.508552
## 216 12 0.2440513 701 2.451539
## 315 12 0.2440513 250 2.052452
## 324 12 0.2440513 772 1.995439
## 546 12 0.2440513 471 1.938426
## 612 12 0.2440513 321 1.881414
## 790 12 0.2440513 368 1.824401
## 132 11 0.2237136 527 1.767389
## 268 11 0.2237136 653 1.767389
## 320 11 0.2237136 454 1.710376
##
##
##
##
##
##
## [1] EXPERIMENT 2
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 15 30 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 661 21 0.2114591 749 7.674944
## 699 20 0.2013896 324 3.837472
## 744 20 0.2013896 701 3.611738
## 271 19 0.1913201 348 3.386005
## 318 19 0.1913201 772 3.386005
## 324 19 0.1913201 32 3.160271
## 513 19 0.1913201 101 2.934537
## 296 18 0.1812506 321 2.708804
## 334 18 0.1812506 526 2.708804
## 345 18 0.1812506 454 2.483070
## 437 18 0.1812506 13 2.031603
## 749 18 0.1812506 177 2.031603
## 758 18 0.1812506 250 2.031603
## 30 17 0.1711811 388 2.031603
## 32 17 0.1711811 471 2.031603
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
## [1] "Combined set of features selected across multiple experiments"
## [1] 32 63 101 119 132 216 250 268 285 315 320 321 324 348 368 454 471
## [18] 526 527 546 612 653 701 722 749 772 790
## [1] "Top best features selected across multiple experiments"
## [1] 15
## [1] "Length of top best features selected across multiple experiments"
## [1] 27
The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions are then used to generate the confusion matrix. We basically combine the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first round. Then, the second stage uses the top features to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..