Some useful information

This is a summary of a set of 9 experiments I ran on Cranium using a single pipe workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.

This document has the final results, by experiment. The CBDA-SuperLearner has been adapted to a multinomial outcome distribution in this case. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code [still in progress].

# # Here I load the dataset [not executed]
# ABIDE_dataset = read.csv("C:/Users/simeonem/Documents/CBDA-SL/ExperimentsNov2016/ABIDE/ABIDE_dataset.csv",header = TRUE)

Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. I list the top features selected, set to 15 here.

## Loading required package: lattice
## Loading required package: ggplot2
## [1] EXPERIMENT 1
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          1          5         30         40 
##   [1]  415  126   89  426  490  955  883  973   19  608  316  225  682  502
##  [15]  102   79   90  987  404  480  422  242  688   13   77  175  235  103
##  [29]  165  535  492  985  642  976 1038  487  425  387   75 1003  483  671
##  [43]  172 1089  915  411  640   97  895  386    7  850 1008  876   53  576
##  [57]  215  213  321  203  162  101  839    3  912  497  524  612  644  314
##  [71]  125   78  398 1086  707  423  220   25   18  836  244  903  418  222
##  [85]  507  396  659  983  397  701  207  427  390  159  209  178  395  468
##  [99]  532  645  549  475  402  841   20  161  315   84   17 1030 1094  911
## [113]  234  837   95    9  702  681  885  711  479  494  542  908  890  664
## [127]  166   81  887  680  160  403  195 1092  894  163  673  614  466  471
## [141]  954  144  584  406 1096   92  496 1037  910  975  488 1004  619  951
## [155]  705  663  889  200  100  412   11  491  948  696  668  361   42  256
## [169]  964  742   57  258  630   34  328  519  756  255  803  722  181  105
## [183]  827  831  816  537  344  965 1052  929  327  932  804  284  139  553
## [197]  253 1057  442  277  257 1024  944  137  365  453  601  107  808  136
## [211]  287  303  510  447  450  567  273  941  730 1065 1062  943   70   28
## [225]  761  440  544  599   63  286  809  806  325  343 1000 1048  147   51
## [239]  934  918  997  824 1056  736  830  995  355  767  378  278   37  763
## [253]  622  738 1071  988  280  603   27   65  917  970 1050  275  279  632
## [267]  116  566  349 1066  922   69  452  594  733  152  331   58  117  593
## [281] 1027  251  518  746  737  801 1077   73 1068 1067  288 1072   68  143
## [295]  373  947  747  451  309   38  294  623  769  820  758  766   64  939
## [309]  457  772 1026  372  762  781  817  263  189   71  598  187  621  926
## [323] 1023  334  306  564  338   41  962  514

##  [1] "thick_std_ctx_lh_G_Ins_lg_and_S_cent_ins" 
##  [2] "surf_area_ctx.lh.parstriangularis"        
##  [3] "gaus_curv_ctx_lh_G_front_sup"             
##  [4] "mean_curv_ctx_rh_G_front_inf.Triangul"    
##  [5] "surf_area_ctx.rh.isthmuscingulate"        
##  [6] "thick_std_ctx_rh_S_circular_insula_sup"   
##  [7] "surf_area_ctx_rh_S_orbital_lateral"       
##  [8] "volume_ctx.lh.superiorfrontal"            
##  [9] "volume_ctx_rh_G_Ins_lg_and_S_cent_ins"    
## [10] "volume_ctx_rh_S_suborbital"               
## [11] "surf_area_ctx_lh_G_and_S_cingul.Ant"      
## [12] "curv_ind_ctx_lh_S_intrapariet_and_P_trans"
## [13] "mean_curv_ctx.lh.transversetemporal"      
## [14] "surf_area_ctx.rh.fusiform"                
## [15] "mean_curv_ctx_rh_G_and_S_cingul.Mid.Post" 
## [16] "anat_fwhm"                                
## [17] "volume_CSF"                               
## [18] "volume_ctx.lh.entorhinal"                 
## [19] "volume_ctx.lh.precentral"                 
## [20] "volume_ctx.lh.rostralanteriorcingulate"   
## k_top_50_temp
##  576  840  571 1312 1505 1703 1799  178  207  279  462 1000 1165 1211 1235 
##    8    8    7    7    7    7    7    6    6    6    6    6    6    6    6 
##   63   75   80  143  145 
##    5    5    5    5    5 
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density   Knockoff Density  
##  576  8         0.2213001   66     3.0058651
##  840  8         0.2213001  296     1.6862170
##  571  7         0.1936376  297     1.6862170
##  1312 7         0.1936376   26     1.6495601
##  1505 7         0.1936376  583     1.0997067
##  1703 7         0.1936376    1     0.9530792
##  1799 7         0.1936376  337     0.9530792
##  178  6         0.1659751  727     0.9164223
##  207  6         0.1659751 1612     0.9164223
##  279  6         0.1659751   25     0.8797654
##  462  6         0.1659751  334     0.8064516
##  1000 6         0.1659751 2043     0.8064516
##  1165 6         0.1659751  605     0.7697947
##  1211 6         0.1659751  983     0.7331378
##  1235 6         0.1659751 1458     0.7331378
##   [1]  415  126   89  426  490  955  883  973   19  608  316  225  682  502
##  [15]  102   79   90  987  404  480  422  242  688   13   77  175  235  103
##  [29]  165  535  492  985  642  976 1038  487  425  387   75 1003  483  671
##  [43]  172 1089  915  411  640   97  895  386    7  850 1008  876   53  576
##  [57]  215  213  321  203  162  101  839    3  912  497  524  612  644  314
##  [71]  125   78  398 1086  707  423  220   25   18  836  244  903  418  222
##  [85]  507  396  659  983  397  701  207  427  390  159  209  178  395  468
##  [99]  532  645  549  475  402  841   20  161  315   84   17 1030 1094  911
## [113]  234  837   95    9  702  681  885  711  479  494  542  908  890  664
## [127]  166   81  887  680  160  403  195 1092  894  163  673  614  466  471
## [141]  954  144  584  406 1096   92  496 1037  910  975  488 1004  619  951
## [155]  705  663  889  200  100  412   11  491  948  696  668  361   42  256
## [169]  964  742   57  258  630   34  328  519  756  255  803  722  181  105
## [183]  827  831  816  537  344  965 1052  929  327  932  804  284  139  553
## [197]  253 1057  442  277  257 1024  944  137  365  453  601  107  808  136
## [211]  287  303  510  447  450  567  273  941  730 1065 1062  943   70   28
## [225]  761  440  544  599   63  286  809  806  325  343 1000 1048  147   51
## [239]  934  918  997  824 1056  736  830  995  355  767  378  278   37  763
## [253]  622  738 1071  988  280  603   27   65  917  970 1050  275  279  632
## [267]  116  566  349 1066  922   69  452  594  733  152  331   58  117  593
## [281] 1027  251  518  746  737  801 1077   73 1068 1067  288 1072   68  143
## [295]  373  947  747  451  309   38  294  623  769  820  758  766   64  939
## [309]  457  772 1026  372  762  781  817  263  189   71  598  187  621  926
## [323] 1023  334  306  564  338   41  962  514
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 3
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          5         10         55         65 
##   [1] 1096  487  842  583  843  320  666  500  894  506  319  479 1089  851
##  [15]  891  649  231  434  396  474  534  656  670  156  576  160  171  169
##  [29]  613  428  906  312 1029  607  890  700  653  981  323  482  841  889
##  [43]  407  677  898  483  908  717  681  580  205  665  385  202  618  501
##  [57]  980   85  608  225 1045   21   90  881  975  899  468    4   19  165
##  [71]  384  837  654  673  540   77  195  238  846 1043 1041    1  393  527
##  [85]  122  243  615  982  237  405  217  854    2   16  484  986 1039  120
##  [99]  611  696   79  410  507  957  318   86  119  233  153  695   54  424
## [113]  239  158  201  644 1092  895  690  236    8  660  432  157  178  528
## [127]  124   75  420  240   12  905  658  910  542  711  423  244  640 1006
## [141]  164  391  686  480 1034  127  716  408  655  317  418  669   83  193
## [155]  430  584  467 1005  699 1007    3 1010  414  620  502 1069 1013  808
## [169]  596  856  191  183  561  294  592  444  291  553  309  960  971 1079
## [183]  447   30  142  250  924  264  179 1028 1080  857  871  784   29  141
## [197]  365  873  795  921  529  269  140  110  593  810  797  280  940   44
## [211]  796   42  117  282  552  184  832  733  188  182  741  780  556  180
## [225]  438  595  379  807  545  996  257  740  150  963  293  375   65  935
## [239]  947  296  306  863 1071  548  970  114  108  560 1068  794 1055  634
## [253]  772  766  989 1017  441  633  587  868  566  997  132   60   56  782
## [267] 1048  285  920  460  133  585  789  519  303 1023 1053  255  727  586
## [281]  302  278  803  256  525  148  745  299 1070  367  356  344  967  292
## [295]  725  817  756  597   72 1000  769  937  801  555  874  824  252  728
## [309]  554  599  748  536  286  872  861  305  859  916 1078  860  746  332
## [323]  774 1011 1024  729  448  454  297   48

##  [1] "R_superior_temporal_gyrus"             
##  [2] "volume_Left.Hippocampus"               
##  [3] "curv_ind_ctx_lh_G_insular_short"       
##  [4] "volume_ctx_lh_S_precentral.sup.part"   
##  [5] "surf_area_ctx.rh.parsopercularis"      
##  [6] "gaus_curv_lh_BA4p"                     
##  [7] "anat_efc"                              
##  [8] "volume_ctx.lh.rostralanteriorcingulate"
##  [9] "volume_Left.Accumbens.area"            
## [10] "volume_lh_BA45"                        
## [11] "fold_ind_ctx.lh.cuneus"                
## [12] "mean_curv_ctx_lh_G_front_inf.Opercular"
## [13] "fold_ind_ctx_lh_G_oc.temp_lat.fusifor" 
## [14] "thick_std_ctx_lh_G_precentral"         
## [15] "fold_ind_ctx_lh_G_temp_sup.G_T_transv" 
## [16] "thick_avg_ctx.lh.isthmuscingulate"     
## [17] "mean_curv_ctx_lh_Lat_Fis.ant.Vertical" 
## [18] "thick_avg_ctx.lh.middletemporal"       
## [19] "surf_area_ctx.lh.parsorbitalis"        
## [20] "thick_avg_ctx_lh_S_collat_transv_ant"  
## k_top_50_temp
##   36  295  587  172 1582 1978   61  145  290  308  439  542  607  660  705 
##   12   12   12   10   10   10    9    9    9    9    9    9    9    9    9 
##  750  780  806  833  953 
##    9    9    9    9    9 
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density   Knockoff Density 
##  36   12        0.1504891   66     6.787933
##  295  12        0.1504891  296     3.859805
##  587  12        0.1504891   26     3.327418
##  172  10        0.1254076  297     2.617569
##  1582 10        0.1254076  727     2.307010
##  1978 10        0.1254076    1     2.151730
##  61    9        0.1128668 2043     1.974268
##  145   9        0.1128668  583     1.708075
##  290   9        0.1128668 2081     1.663709
##  308   9        0.1128668 1081     1.574978
##  439   9        0.1128668  337     1.419698
##  542   9        0.1128668  738     1.419698
##  607   9        0.1128668  710     1.375333
##  660   9        0.1128668 2040     1.353150
##  705   9        0.1128668 1152     1.286602
##   [1] 1096  487  842  583  843  320  666  500  894  506  319  479 1089  851
##  [15]  891  649  231  434  396  474  534  656  670  156  576  160  171  169
##  [29]  613  428  906  312 1029  607  890  700  653  981  323  482  841  889
##  [43]  407  677  898  483  908  717  681  580  205  665  385  202  618  501
##  [57]  980   85  608  225 1045   21   90  881  975  899  468    4   19  165
##  [71]  384  837  654  673  540   77  195  238  846 1043 1041    1  393  527
##  [85]  122  243  615  982  237  405  217  854    2   16  484  986 1039  120
##  [99]  611  696   79  410  507  957  318   86  119  233  153  695   54  424
## [113]  239  158  201  644 1092  895  690  236    8  660  432  157  178  528
## [127]  124   75  420  240   12  905  658  910  542  711  423  244  640 1006
## [141]  164  391  686  480 1034  127  716  408  655  317  418  669   83  193
## [155]  430  584  467 1005  699 1007    3 1010  414  620  502 1069 1013  808
## [169]  596  856  191  183  561  294  592  444  291  553  309  960  971 1079
## [183]  447   30  142  250  924  264  179 1028 1080  857  871  784   29  141
## [197]  365  873  795  921  529  269  140  110  593  810  797  280  940   44
## [211]  796   42  117  282  552  184  832  733  188  182  741  780  556  180
## [225]  438  595  379  807  545  996  257  740  150  963  293  375   65  935
## [239]  947  296  306  863 1071  548  970  114  108  560 1068  794 1055  634
## [253]  772  766  989 1017  441  633  587  868  566  997  132   60   56  782
## [267] 1048  285  920  460  133  585  789  519  303 1023 1053  255  727  586
## [281]  302  278  803  256  525  148  745  299 1070  367  356  344  967  292
## [295]  725  817  756  597   72 1000  769  937  801  555  874  824  252  728
## [309]  554  599  748  536  286  872  861  305  859  916 1078  860  746  332
## [323]  774 1011 1024  729  448  454  297   48

The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions (SL_Pred_Combined) is then used to generate the confusion matrix. By doing so, we combined the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first stage. Then, the second stage uses the top common features selected to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..