Some useful information

This is a summary of a set of 9 experiments I ran on Cranium using a single pipe workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.

This document has the final results, by experiment. The CBDA-SuperLearner has been adapted to a multinomial outcome distribution in this case. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code [still in progress].

# # Here I load the dataset [not executed]
# ADNI_dataset = read.csv("C:/Users/simeonem/Documents/CBDA-SL/Cranium/ADNI_dataset.txt",header = TRUE)

Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. No False Discovery Rates are shown (since we don’t have information on the “true” features). I list the top features selected, set to 20 here.

## Loading required package: lattice
## Loading required package: ggplot2
## [1] EXPERIMENT 1
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          5         15         60         80

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density 
##  40   9         2.821317  9       2.801809
##  7    8         2.507837  3       2.795738
##  15   8         2.507837  6       2.771454
##  22   8         2.507837  1       2.765383
##  25   8         2.507837  7       2.756276
##  59   8         2.507837 59       2.731992
##  4    7         2.194357  4       2.710743
##  13   7         2.194357  2       2.683423
##  16   7         2.194357  8       2.595392
##  31   7         2.194357  5       2.589321
##  38   7         2.194357 60       2.492183
##  46   7         2.194357 39       2.073278
##  60   7         2.194357 51       1.927572
##  5    6         1.880878 40       1.827399
##  9    6         1.880878 36       1.772759
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 2
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          5         15        100        100

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density 
##  22   12        3.603604 59       2.842306
##  45   12        3.603604  7       2.836160
##  8     8        2.402402  8       2.826942
##  10    8        2.402402  4       2.780851
##  18    8        2.402402  5       2.753196
##  38    8        2.402402  3       2.734759
##  43    8        2.402402  1       2.725541
##  57    8        2.402402  6       2.670231
##  66    8        2.402402  9       2.664086
##  13    7        2.102102  2       2.636431
##  19    7        2.102102 60       2.531957
##  23    7        2.102102 39       2.405973
##  30    7        2.102102 51       2.172443
##  33    7        2.102102 40       2.120206
##  36    7        2.102102 15       2.043387
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 3
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0         15         30         60         80

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density 
##  28   18        2.479339  4       4.909827
##  56   18        2.479339  2       4.667535
##  15   17        2.341598  7       4.653821
##  18   17        2.341598  3       4.608105
##  37   17        2.341598  6       4.605820
##  4    16        2.203857  9       4.598962
##  54   16        2.203857  1       4.576105
##  16   15        2.066116  5       4.441244
##  40   15        2.066116  8       4.118951
##  58   15        2.066116 59       3.750943
##  59   14        1.928375 60       2.758920
##  64   14        1.928375 39       2.667490
##  12   13        1.790634 51       2.018332
##  39   13        1.790634 40       1.894900
##  44   13        1.790634 61       1.618323
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 7
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          5         15         40         60

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density 
##  42   10        2.994012  3       2.817322
##  21    9        2.694611  4       2.793446
##  34    9        2.694611  6       2.760617
##  50    9        2.694611  7       2.715850
##  6     8        2.395210  2       2.703913
##  19    8        2.395210  1       2.688990
##  30    8        2.395210  9       2.677053
##  53    8        2.395210  8       2.468141
##  60    8        2.395210  5       2.462172
##  13    7        2.095808 59       2.408452
##  18    7        2.095808 60       2.312950
##  22    7        2.095808 39       1.850360
##  23    7        2.095808 55       1.766795
##  24    7        2.095808 51       1.719044
##  57    7        2.095808 15       1.680246
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 8
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0         15         30         40         60

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density 
##  31   17        2.439024  6       4.990006
##  34   17        2.439024  9       4.888857
##  5    15        2.152080  4       4.864774
##  15   15        2.152080  2       4.785300
##  65   15        2.152080  1       4.633577
##  16   14        2.008608  5       4.595044
##  37   14        2.008608  3       4.563736
##  41   14        2.008608  7       4.505936
##  43   14        2.008608  8       3.925535
##  45   14        2.008608 59       3.598006
##  47   14        2.008608 60       2.947764
##  61   14        2.008608 39       2.167473
##  63   14        2.008608 40       1.762878
##  6    13        1.865136 51       1.642463
##  9    13        1.865136 36       1.594297
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 9
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0         30         50         40         60

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density 
##  29   27        2.051672  6       7.538142
##  15   26        1.975684  4       7.369761
##  24   26        1.975684  9       7.256823
##  37   26        1.975684  1       7.248609
##  16   25        1.899696  2       7.213700
##  20   25        1.899696  3       6.513481
##  27   24        1.823708  5       6.412863
##  39   24        1.823708  7       6.189039
##  46   24        1.823708  8       4.632539
##  53   24        1.823708 59       3.308076
##  14   23        1.747720 39       2.246453
##  23   23        1.747720 60       2.024682
##  40   23        1.747720 44       1.396333
##  44   23        1.747720 51       1.392226
##  55   23        1.747720 40       1.330623
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
## [1] 40  7 15 59  8 60  6  4  5
## [1] "R_inferior_occipital_gyrus" "subjectAge"                
## [3] "L_inferior_frontal_gyrus"   "L_caudate"                 
## [5] "weightKg"                   "R_caudate"                 
## [7] "FAQTOTAL"                   "CDGLOBAL"                  
## [9] "NPISCORE"

The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions (SL_Pred_Combined) is then used to generate the confusion matrix. By doing so, we combined the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first stage. Then, the second stage uses the top common features selected to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..

## Warning in levels(reference) != levels(data): longer object length is not a
## multiple of shorter object length
## Warning in confusionMatrix.default(pred_temp, truth_temp): Levels are not
## in the same order for reference and data. Refactoring data to match.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  AD LMCI MCI Normal
##     AD      96    0  14      0
##     LMCI     0    0   0      0
##     MCI     26    0 378      6
##     Normal   0    0  12    218
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9227          
##                  95% CI : (0.9012, 0.9408)
##     No Information Rate : 0.5387          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8689          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: AD Class: LMCI Class: MCI Class: Normal
## Sensitivity             0.7869          NA     0.9356        0.9732
## Specificity             0.9777           1     0.9075        0.9772
## Pos Pred Value          0.8727          NA     0.9220        0.9478
## Neg Pred Value          0.9594          NA     0.9235        0.9885
## Prevalence              0.1627           0     0.5387        0.2987
## Detection Rate          0.1280           0     0.5040        0.2907
## Detection Prevalence    0.1467           0     0.5467        0.3067
## Balanced Accuracy       0.8823          NA     0.9216        0.9752