Some useful information

This is a summary of a set of 9 experiments I ran on Cranium using a single pipe workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.

This document has the final results, by experiment. The CBDA-SuperLearner has been adapted to a multinomial outcome distribution in this case. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code [still in progress].

# # Here I load the dataset [not executed]
# ADNI_dataset = read.csv("C:/Users/simeonem/Documents/CBDA-SL/Cranium/ADNI_dataset.txt",header = TRUE)

Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. I list the top features selected, set to 15 here.

## Loading required package: lattice
## Loading required package: ggplot2
## [1] EXPERIMENT 1
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          5         15         60         80

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density 
##  20   10        2.985075  7       2.243323
##  60    9        2.686567  6       2.176855
##  1     8        2.388060  2       2.167359
##  29    8        2.388060  4       2.136499
##  31    8        2.388060  1       2.091395
##  37    8        2.388060  9       2.091395
##  41    8        2.388060  3       2.058160
##  44    8        2.388060 10       2.053412
##  56    8        2.388060  8       2.024926
##  2     7        2.089552  5       1.984570
##  10    7        2.089552 59       1.984570
##  19    7        2.089552 42       1.956083
##  23    7        2.089552 60       1.948961
##  54    7        2.089552 51       1.846884
##  55    7        2.089552 39       1.832641
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 7
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          5         15         40         60

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density 
##  46   12        3.571429 59       2.190557
##  4    10        2.976190  9       2.174016
##  1     9        2.678571  6       2.166927
##  3     9        2.678571  8       2.138570
##  61    9        2.678571  3       2.119665
##  11    8        2.380952  7       2.117302
##  20    8        2.380952  4       2.091309
##  34    8        2.380952  5       2.074767
##  52    8        2.380952  2       2.060589
##  56    8        2.380952  1       2.001512
##  15    7        2.083333 60       1.916442
##  18    7        2.083333 10       1.807741
##  26    7        2.083333 41       1.786474
##  31    7        2.083333 42       1.758117
##  45    7        2.083333 51       1.720308
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 8
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0         15         30         40         60

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density 
##  52   22        3.021978  9       3.666917
##  1    19        2.609890  2       3.661548
##  9    17        2.335165  3       3.661548
##  22   17        2.335165  6       3.640073
##  45   17        2.335165  1       3.595333
##  61   16        2.197802  4       3.582805
##  4    15        2.060440  5       3.563120
##  7    15        2.060440  7       3.362683
##  34   14        1.923077  8       3.321522
##  39   14        1.923077 59       2.822220
##  51   14        1.923077 60       2.634310
##  15   13        1.785714 10       2.509038
##  16   13        1.785714 51       2.364079
##  33   13        1.785714 42       2.324707
##  42   13        1.785714 41       2.186907
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 9
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0         30         50         40         60

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density 
##  4    29        2.187029  1       5.884280
##  19   26        1.960784  4       5.882642
##  22   26        1.960784  9       5.823668
##  57   26        1.960784  6       5.781076
##  17   25        1.885370  3       5.761418
##  28   25        1.885370  2       5.707359
##  30   25        1.885370  5       5.024245
##  8    24        1.809955  7       4.958718
##  9    24        1.809955  8       4.688421
##  24   24        1.809955 59       2.748837
##  48   24        1.809955 10       2.701330
##  58   24        1.809955 51       2.427757
##  15   23        1.734540 42       2.354040
##  18   23        1.734540 60       2.308171
##  23   23        1.734540 41       2.106677
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
## [1]  4 60  1  3 41  9  2 10  7
## [1] "CDGLOBAL"         "R_caudate"        "subjectSex"      
## [4] "GDTOTAL"          "L_cuneus"         "seriesIdentifier"
## [7] "MMSCORE"          "Background"       "subjectAge"

The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions (SL_Pred_Combined) is then used to generate the confusion matrix. By doing so, we combined the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first stage. Then, the second stage uses the top common features selected to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  AD LMCI MCI Normal
##     AD     100   10  24      0
##     LMCI     9  169  25     12
##     MCI     13   20 144      7
##     Normal   0   10   2    205
## 
## Overall Statistics
##                                           
##                Accuracy : 0.824           
##                  95% CI : (0.7948, 0.8506)
##     No Information Rate : 0.2987          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7624          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: AD Class: LMCI Class: MCI Class: Normal
## Sensitivity             0.8197      0.8086     0.7385        0.9152
## Specificity             0.9459      0.9150     0.9279        0.9772
## Pos Pred Value          0.7463      0.7860     0.7826        0.9447
## Neg Pred Value          0.9643      0.9252     0.9099        0.9644
## Prevalence              0.1627      0.2787     0.2600        0.2987
## Detection Rate          0.1333      0.2253     0.1920        0.2733
## Detection Prevalence    0.1787      0.2867     0.2453        0.2893
## Balanced Accuracy       0.8828      0.8618     0.8332        0.9462