This is a summary of a set of 1 experiments using a LONI pipeline workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.
This document has the final results, by experiment. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code.
Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. I list the top features selected, set to 15 here.
## [1] "C:/Users/simeonem/Documents/CBDA-SL/ExperimentsNov2016/NULL9000/NEW"
## [1] EXPERIMENT 1
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 211 25 1.8037518 157 13.940724
## 285 20 1.4430014 47 7.683864
## 23 13 0.9379509 245 6.915477
## 65 12 0.8658009 211 5.817783
## 62 11 0.7936508 219 4.720088
## 219 11 0.7936508 54 3.293085
## 4 10 0.7215007 97 3.293085
## 6 10 0.7215007 206 3.073546
## 10 10 0.7215007 128 2.963776
## 267 10 0.7215007 182 2.744237
## 51 9 0.6493506 299 2.744237
## 60 9 0.6493506 300 2.634468
## 133 9 0.6493506 42 2.414929
## 221 9 0.6493506 285 2.414929
## 298 9 0.6493506 236 2.085620
##
##
##
##
##
##
## [1] EXPERIMENT 2
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 15 30 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 285 22 0.6890072 157 20.588235
## 211 21 0.6576887 245 7.466063
## 20 19 0.5950517 47 6.674208
## 97 18 0.5637332 211 6.221719
## 157 18 0.5637332 219 4.638009
## 219 18 0.5637332 54 4.072398
## 237 18 0.5637332 97 4.072398
## 18 17 0.5324147 182 3.506787
## 137 17 0.5324147 206 3.506787
## 201 17 0.5324147 300 3.393665
## 226 17 0.5324147 299 2.941176
## 248 17 0.5324147 128 2.149321
## 296 17 0.5324147 285 1.809955
## 67 16 0.5010961 210 1.696833
## 100 16 0.5010961 296 1.696833
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
## [1] "Combined set of features selected across multiple experiments"
## [1] 4 6 10 23 42 47 51 54 60 62 65 97 128 133 157 182 206
## [18] 211 219 245 267 285 299 300
## [1] "Top best features selected across multiple experiments"
## [1] 15
## [1] "Length of top best features selected across multiple experiments"
## [1] 24
The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions are then used to generate the confusion matrix. We basically combine the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first round. Then, the second stage uses the top features to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..