Some useful information

This is a summary of a set of 1 experiments using a LONI pipeline workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.

This document has the final results, by experiment. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code.

Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. I list the top features selected, set to 15 here.

## [1] EXPERIMENT 1
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          5         15         60         80

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## [1] "EXPERIMENT" "1"         
##  CBDA Frequency Density   Knockoff Density 
##  4    50        12.019231  5       2.613165
##  28   21         5.048077  2       2.534098
##  8    19         4.567308  7       2.126903
##  7    18         4.326923 55       2.115042
##  63   18         4.326923  1       2.093299
##  47   17         4.086538 34       2.055742
##  45   15         3.605769 35       2.035975
##  62   14         3.365385 38       1.954932
##  46   13         3.125000 59       1.939118
##  35   11         2.644231 40       1.923305
##  56   11         2.644231 60       1.919352
##  16   10         2.403846 20       1.915398
##  2     9         2.163462 58       1.895632
##  19    9         2.163462 22       1.846215
##  57    9         2.163462  8       1.816565
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
##  [1]  4 28  8  7 63 47 45 62 46 35 56 16  2 19 57  5 55  1 34 38 59 40 60
## [24] 20 58 22

The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions are then used to generate the confusion matrix. We basically combine the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first round. Then, the second stage uses the top features to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..

##       ggplot2          plyr    colorspace          grid    data.table 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##           VIM          MASS        Matrix          lme4           arm 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##       foreach        glmnet         class          nnet          mice 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##    missForest     calibrate          nnls  SuperLearner       plotrix 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
## TeachingDemos        plotmo         earth      parallel       splines 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##           gam            mi     BayesTree         e1071  randomForest 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##         Hmisc         dplyr        Amelia   bartMachine      knockoff 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##         caret   smotefamily           FNN 
##          TRUE          TRUE          TRUE
##  [1] 4  28 8  7  63 47 45 62 46 35 56 16 2  19 57
## Levels: 4 28 8 7 63 47 45 62 46 35 56 16 2 19 57
##   missForest iteration 1 in progress...done!
##   missForest iteration 1 in progress...done!
##   missForest iteration 1 in progress...done!
##   missForest iteration 1 in progress...done!
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  AD MCI Normal
##     AD      67  27      1
##     MCI     14 234      9
##     Normal   0   8    139
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8818          
##                  95% CI : (0.8501, 0.9088)
##     No Information Rate : 0.5391          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8041          
##  Mcnemar's Test P-Value : 0.159           
## 
## Statistics by Class:
## 
##                      Class: AD Class: MCI Class: Normal
## Sensitivity             0.8272     0.8699        0.9329
## Specificity             0.9330     0.9000        0.9771
## Pos Pred Value          0.7053     0.9105        0.9456
## Neg Pred Value          0.9653     0.8554        0.9716
## Prevalence              0.1623     0.5391        0.2986
## Detection Rate          0.1343     0.4689        0.2786
## Detection Prevalence    0.1904     0.5150        0.2946
## Balanced Accuracy       0.8801     0.8849        0.9550