Some useful information

This is a summary of a set of 1 experiments using a LONI pipeline workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.

This document has the final results, by experiment. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code.

Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. I list the top features selected, set to 15 here.

## [1] EXPERIMENT 1
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          5         15         60         80 
##  [1]   1  30  60 100 130 160 200 230 260 300

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## [1] "EXPERIMENT" "1"         
##  CBDA Frequency Density   Knockoff Density  
##  200  25        1.6233766 160      17.529976
##  30   19        1.2337662 130      14.004796
##  100  18        1.1688312  30      10.767386
##  160  14        0.9090909 200       9.208633
##  157  13        0.8441558 260       9.184652
##  298  13        0.8441558 230       5.731415
##  106  12        0.7792208 100       4.532374
##  183  12        0.7792208 300       4.508393
##  142  11        0.7142857 273       4.172662
##  222  11        0.7142857 214       3.621103
##  252  11        0.7142857 222       2.661871
##  112  10        0.6493506   1       2.398082
##  187  10        0.6493506 142       1.582734
##  228  10        0.6493506  25       1.318945
##  6     9        0.5844156 258       1.318945
## [1] "Nonzero Features"
##  [1]   1  30  60 100 130 160 200 230 260 300
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 2
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0         15         30         60         80 
##  [1]   1  30  60 100 130 160 200 230 260 300

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## [1] "EXPERIMENT" "2"         
##  CBDA Frequency Density   Knockoff Density   
##  100  36        1.0098177 160      23.9869407
##  200  30        0.8415147 130      17.3612445
##  30   29        0.8134642  30      12.3679662
##  1    24        0.6732118 200      10.1401959
##  106  22        0.6171108 260       9.3527943
##  230  21        0.5890603 230       4.9548684
##  82   20        0.5610098 300       3.4376800
##  142  20        0.5610098 273       2.9959670
##  183  20        0.5610098 100       2.9383522
##  241  20        0.5610098 214       2.4774342
##  21   19        0.5329593 222       1.7092376
##  66   19        0.5329593   1       1.6708277
##  73   19        0.5329593 142       1.3443442
##  107  19        0.5329593  25       0.9986557
##  113  19        0.5329593 258       0.8834262
## [1] "Nonzero Features"
##  [1]   1  30  60 100 130 160 200 230 260 300
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
##  [1] 200  30 100 160 157 298 106 183 142 222 252   1 130 260 230 300 273
## [18] 214

The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions are then used to generate the confusion matrix. We basically combine the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first round. Then, the second stage uses the top features to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..