Some useful information

This is a summary of a set of 1 experiments using a LONI pipeline workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.

This document has the final results, by experiment. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code.

Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. I list the top features selected, set to 15 here.

## [1] EXPERIMENT 1
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          5         15         60         80 
##  [1]  10  20  30  40  50  60  70  80  90 100

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density   Knockoff Density   
##  260  30        1.9620667 100      17.2537743
##  30   23        1.5042511  30      17.2058471
##  300  22        1.4388489   1      15.5763240
##  255  12        0.7848267 260      12.9882578
##  258  12        0.7848267 200      10.1126288
##  81   11        0.7194245 300       3.9539899
##  105  11        0.7194245 190       2.0369039
##  205  11        0.7194245 299       1.9650132
##  230  11        0.7194245 192       1.3898874
##  17   10        0.6540222 130       1.1742152
##  160  10        0.6540222 138       0.9825066
##  271  10        0.6540222 258       0.9825066
##  11    9        0.5886200  43       0.8866523
##  12    9        0.5886200 160       0.8866523
##  89    9        0.5886200  57       0.6949437
## [1] "Nonzero Features"
##  [1]  10  20  30  40  50  60  70  80  90 100
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 2
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0         15         30         60         80 
##  [1]  10  20  30  40  50  60  70  80  90 100

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density   Knockoff Density   
##  300  42        1.1462882 100      23.0740323
##  260  39        1.0644105  30      22.4915445
##  30   32        0.8733624   1      20.7065013
##  258  29        0.7914847 260      13.2093198
##  151  23        0.6277293 200       8.0420894
##  160  22        0.6004367 300       3.4009771
##  178  22        0.6004367 299       1.0334461
##  65   20        0.5458515 190       1.0146561
##  121  20        0.5458515 130       0.6388576
##  264  20        0.5458515 192       0.6200676
##  198  19        0.5185590  43       0.5261180
##  15   18        0.4912664  86       0.3570086
##  105  18        0.4912664 258       0.3570086
##  146  18        0.4912664 142       0.3382187
##  230  18        0.4912664 238       0.3194288
## [1] "Nonzero Features"
##  [1]  10  20  30  40  50  60  70  80  90 100
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
##  [1] 260  30 300 258 255  81 105 205 230  17 160 100   1 200 190 299 192
## [18] 130 138

The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions are then used to generate the confusion matrix. We basically combine the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first round. Then, the second stage uses the top features to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..