Some useful information

This is a summary of a set of 1 experiments using a LONI pipeline workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.

This document has the final results, by experiment. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code.

Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. I list the top features selected, set to 15 here.

## [1] EXPERIMENT 1
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          1          5         30         60 
##  [1]    1  100  200  400  600  800 1000 1200 1400 1500

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## [1] "EXPERIMENT" "1"         
##  Accuracy Count Density    MSE  Count Density    Knockoff Count Density  
##  100      115   0.15737902 297  65    0.09720353  100     164   8.4318766
##  1500     103   0.14095686 565  65    0.09720353 1000     130   6.6838046
##  1000      97   0.13274578 62   63    0.09421265 1500     109   5.6041131
##  863       94   0.12864025 150  63    0.09421265 1400      84   4.3187661
##  1200      93   0.12727173 153  63    0.09421265 1200      47   2.4164524
##  1400      87   0.11906065 299  63    0.09421265  800      38   1.9537275
##  599       84   0.11495511 649  63    0.09421265 1156      31   1.5938303
##  708       81   0.11084957 309  62    0.09271721 1047      30   1.5424165
##  326       76   0.10400701 1325 62    0.09271721  694      26   1.3367609
##  400       73   0.09990147 20   61    0.09122177  138      21   1.0796915
##  1279      73   0.09990147 396  61    0.09122177  589      21   1.0796915
##  1         72   0.09853295 453  61    0.09122177  863      21   1.0796915
##  1217      72   0.09853295 1182 61    0.09122177  400      18   0.9254499
##  1439      72   0.09853295 1231 61    0.09122177 1015      18   0.9254499
##  818       71   0.09716444 74   60    0.08972633  200      15   0.7712082
## [1] "Nonzero Features"
##  [1]    1  100  200  400  600  800 1000 1200 1400 1500
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 2
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          5         15         30         60 
##  [1]    1  100  200  400  600  800 1000 1200 1400 1500

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## [1] "EXPERIMENT" "1"         
##  Accuracy Count Density    MSE  Count Density    Knockoff Count Density  
##  100      297   0.12904627 848  195   0.08748671  100     69    6.9207623
##  1400     268   0.11644580 199  187   0.08389751 1400     59    5.9177533
##  1200     267   0.11601130 1069 186   0.08344886 1500     56    5.6168506
##  1000     257   0.11166630 893  185   0.08300021 1000     51    5.1153460
##  1500     242   0.10514882 427  181   0.08120561 1200     36    3.6108325
##  599      206   0.08950684 1370 181   0.08120561  800     27    2.7081244
##  179      203   0.08820335 907  180   0.08075696  200     16    1.6048144
##  372      193   0.08385835 956  180   0.08075696 1047     16    1.6048144
##  352      191   0.08298935 995  180   0.08075696 1404     16    1.6048144
##  852      191   0.08298935 2    179   0.08030831 1156     15    1.5045135
##  116      190   0.08255486 235  178   0.07985966  400     13    1.3039117
##  358      188   0.08168586 297  178   0.07985966  694     13    1.3039117
##  906      188   0.08168586 957  178   0.07985966 1015     13    1.3039117
##  955      187   0.08125136 1056 178   0.07985966 1270     13    1.3039117
##  996      187   0.08125136 1075 178   0.07985966  138      9    0.9027081
## [1] "Nonzero Features"
##  [1]    1  100  200  400  600  800 1000 1200 1400 1500
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 3
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0         15         30         30         60 
##  [1]    1  100  200  400  600  800 1000 1200 1400 1500

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## [1] "EXPERIMENT" "1"         
##  Accuracy Count Density    MSE  Count Density   
##  100      561   0.10748094 725  394   0.07976144
##  1200     528   0.10115853 323  386   0.07814192
##  1400     519   0.09943424 389  386   0.07814192
##  1500     506   0.09694359 1182 383   0.07753460
##  1000     505   0.09675201 264  377   0.07631996
##  599      423   0.08104178 649  376   0.07611752
##  1123     405   0.07759319 579  374   0.07571264
##  915      403   0.07721002 424  373   0.07551020
##  1396     402   0.07701843 135  372   0.07530776
##  1266     401   0.07682684 803  371   0.07510532
##  795      400   0.07663525 442  369   0.07470044
##  1367     400   0.07663525 244  368   0.07449800
##  400      397   0.07606049 817  368   0.07449800
##  112      396   0.07586890 1348 368   0.07449800
##  163      395   0.07567731 246  367   0.07429556
## [1] "Nonzero Features"
##  [1]    1  100  200  400  600  800 1000 1200 1400 1500
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 4
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          1          5         60         80 
##  [1]    1  100  200  400  600  800 1000 1200 1400 1500

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## [1] "EXPERIMENT" "1"         
##  Accuracy Count Density    MSE  Count Density    Knockoff Count Density  
##  863      118   0.16908351 1288 71    0.10658100  100     242   10.886190
##  1000      97   0.13899238 1146 68    0.10207758 1000     228   10.256410
##  819       82   0.11749871 668  67    0.10057644 1500     209    9.401709
##  513       77   0.11033415 989  67    0.10057644 1400     190    8.547009
##  834       77   0.11033415 1298 67    0.10057644 1200     116    5.218174
##  1475      74   0.10603542 452  65    0.09757416  800     115    5.173189
##  1356      72   0.10316960 89   64    0.09607302 1156      57    2.564103
##  1200      71   0.10173669 962  64    0.09607302 1047      42    1.889339
##  1014      70   0.10030378 1001 64    0.09607302 1413      35    1.574449
##  1173      70   0.10030378 1405 64    0.09607302 1015      34    1.529465
##  1275      70   0.10030378 281  63    0.09457187  138      32    1.439496
##  304       69   0.09887087 555  63    0.09457187  589      31    1.394512
##  400       68   0.09743795 874  63    0.09457187 1266      31    1.394512
##  466       68   0.09743795 1062 63    0.09457187  400      29    1.304543
##  1340      68   0.09743795 106  62    0.09307073  694      29    1.304543
## [1] "Nonzero Features"
##  [1]    1  100  200  400  600  800 1000 1200 1400 1500
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 5
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          5         15         60         80 
##  [1]    1  100  200  400  600  800 1000 1200 1400 1500

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## [1] "EXPERIMENT" "1"         
##  Accuracy Count Density    MSE  Count Density    Knockoff Count Density  
##  863      274   0.11939154 503  183   0.08618132  100     272   10.755239
##  400      233   0.10152638 1090 181   0.08523945 1000     242    9.569000
##  599      230   0.10021917 1260 180   0.08476851 1500     235    9.292210
##  1000     230   0.10021917 7    179   0.08429758 1400     225    8.896797
##  100      217   0.09455461 954  177   0.08335570  800     166    6.563859
##  1200     214   0.09324741 367  176   0.08288477 1200     158    6.247529
##  1500     213   0.09281167 215  175   0.08241383 1413      68    2.688810
##  304      207   0.09019726 291  175   0.08241383 1047      58    2.293397
##  1063     202   0.08801858 1133 175   0.08241383 1156      54    2.135231
##  112      195   0.08496843 1405 175   0.08241383 1015      53    2.095690
##  326      192   0.08366122 1374 174   0.08194289  138      49    1.937525
##  523      192   0.08366122 439  173   0.08147196  589      30    1.186240
##  1413     192   0.08366122 1179 173   0.08147196  599      28    1.107157
##  519      191   0.08322549 1268 173   0.08147196  200      27    1.067616
##  800      190   0.08278975 373  172   0.08100102  400      27    1.067616
## [1] "Nonzero Features"
##  [1]    1  100  200  400  600  800 1000 1200 1400 1500
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 6
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0         15         30         60         80 
##  [1]    1  100  200  400  600  800 1000 1200 1400 1500

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## [1] "EXPERIMENT" "1"         
##  Accuracy Count Density    MSE  Count Density   
##  100      513   0.09835934 335  376   0.07775439
##  1000     506   0.09701720 1299 376   0.07775439
##  1200     504   0.09663373 1483 376   0.07775439
##  1400     476   0.09126519 756  371   0.07672042
##  1500     454   0.08704705 522  370   0.07651363
##  599      446   0.08551318 1488 369   0.07630683
##  863      439   0.08417105 169  363   0.07506607
##  400      431   0.08263718 1154 363   0.07506607
##  282      422   0.08091158 632  362   0.07485928
##  179      412   0.07899424 88   361   0.07465248
##  1019     406   0.07784384 587  361   0.07465248
##  513      405   0.07765211 642  361   0.07465248
##  1        400   0.07669344 658  361   0.07465248
##  800      399   0.07650171 670  361   0.07465248
##  1014     399   0.07650171 726  361   0.07465248
## [1] "Nonzero Features"
##  [1]    1  100  200  400  600  800 1000 1200 1400 1500

The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions are then used to generate the confusion matrix. We basically combine the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first round. Then, the second stage uses the top features to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..