Some useful information

This is a summary of a set of 9 experiments I ran on Cranium using a single pipe workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.

This document has the final results, by experiment. The CBDA-SuperLearner has been adapted to a multinomial outcome distribution in this case. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code [still in progress].

# # Here I load the dataset [not executed]
# ABIDE_dataset = read.csv("C:/Users/simeonem/Documents/CBDA-SL/Cranium/ABIDE_dataset.txt",header = TRUE)

Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. No False Discovery Rates are shown (since we don’t have information on the “true” features). I list the top features selected, set to 20 here.

## Loading required package: lattice
## Loading required package: ggplot2
## [1] EXPERIMENT 6
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0         30         50        100        100

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density  
##  30   31        2.454473 30       21.482738
##  58   27        2.137767  1       16.134504
##  19   26        2.058591  5        6.405907
##  50   26        2.058591 20        5.996807
##  4    25        1.979414 55        5.607663
##  20   25        1.979414 54        3.741768
##  51   25        1.979414 34        3.641988
##  54   25        1.979414 29        3.322690
##  8    24        1.900238 24        2.913590
##  9    24        1.900238 57        2.255039
##  21   24        1.900238 31        1.347037
##  23   24        1.900238 16        1.297146
##  61   24        1.900238  6        1.287168
##  7    23        1.821061 42        1.247256
##  6    22        1.741884 50        1.247256
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 7
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          5         15         40         60

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density 
##  30   11        3.313253 30       4.927607
##  1    10        3.012048  1       4.392262
##  4    10        3.012048 54       2.688892
##  43    9        2.710843  5       2.640224
##  54    9        2.710843 29       2.457720
##  8     8        2.409639 24       2.372551
##  12    8        2.409639 20       2.056211
##  29    8        2.409639 23       1.946709
##  58    8        2.409639 55       1.885874
##  3     7        2.108434 50       1.849373
##  9     7        2.108434  2       1.837206
##  16    7        2.108434  3       1.788539
##  20    7        2.108434 16       1.776372
##  21    7        2.108434  4       1.752038
##  44    7        2.108434 51       1.727704
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 8
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0         15         30         40         60

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density 
##  14   19        2.653631 30       8.207028
##  30   19        2.653631  1       7.263950
##  16   16        2.234637  5       4.535758
##  23   16        2.234637 20       3.424273
##  33   16        2.234637 54       3.109914
##  43   16        2.234637 24       3.031324
##  50   16        2.234637 29       2.683283
##  8    15        2.094972 55       2.660829
##  12   15        2.094972 34       2.469967
##  36   15        2.094972  2       2.312788
##  48   15        2.094972 57       2.020882
##  49   15        2.094972  6       1.987201
##  51   15        2.094972  3       1.852476
##  9    14        1.955307 51       1.785113
##  10   14        1.955307 23       1.706523
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 9
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0         30         50         40         60

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density  
##  30   30        2.347418 30       10.486932
##  1    28        2.190923  1        9.207244
##  16   28        2.190923  5        5.433250
##  44   27        2.112676 20        3.849908
##  5    26        2.034429 55        3.383581
##  20   26        2.034429  2        2.830496
##  34   26        2.034429 54        2.602755
##  15   25        1.956182  6        2.591910
##  24   25        1.956182 34        2.581065
##  52   25        1.956182 24        2.375014
##  55   25        1.956182 29        2.201497
##  26   24        1.877934  4        2.114738
##  27   24        1.877934  3        1.886997
##  29   24        1.877934 57        1.886997
##  35   24        1.877934 51        1.746015
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
## [1] 30  4  1 54 29 20  5 34
## [1] "R_precuneus"              "PixelSpacingX"           
## [3] "subjectSex"               "R_cingulate_gyrus"       
## [5] "L_precuneus"              "R_gyrus_rectus"          
## [7] "PixelSpacingY"            "R_middle_occipital_gyrus"

The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions (SL_Pred_Combined) is then used to generate the confusion matrix. By doing so, we combined the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first stage. Then, the second stage uses the top common features selected to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 99 90
##          1 66 75
##                                           
##                Accuracy : 0.5273          
##                  95% CI : (0.4719, 0.5822)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.17469         
##                                           
##                   Kappa : 0.0545          
##  Mcnemar's Test P-Value : 0.06555         
##                                           
##             Sensitivity : 0.6000          
##             Specificity : 0.4545          
##          Pos Pred Value : 0.5238          
##          Neg Pred Value : 0.5319          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3000          
##    Detection Prevalence : 0.5727          
##       Balanced Accuracy : 0.5273          
##                                           
##        'Positive' Class : 0               
##