Some useful information

This is a summary of a set of 9 experiments I ran on Cranium using a single pipe workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.

This document has the final results, by experiment. The CBDA-SuperLearner has been adapted to a multinomial outcome distribution in this case. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code [still in progress].

# # Here I load the dataset [not executed]
# ADNI_dataset = read.csv("C:/Users/simeonem/Documents/CBDA-SL/Cranium/ADNI_dataset.txt",header = TRUE)

Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. No False Discovery Rates are shown (since we don’t have information on the “true” features). I list the top features selected, set to 20 here.

## Loading required package: lattice
## Loading required package: ggplot2
## [1] EXPERIMENT 1
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          5         15         60         80

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density 
##  34   11        3.448276  7       2.885030
##  37   10        3.134796  4       2.764949
##  23    9        2.821317  1       2.752633
##  53    9        2.821317  8       2.731079
##  5     7        2.194357  6       2.728000
##  16    7        2.194357  2       2.724921
##  41    7        2.194357  5       2.687973
##  52    7        2.194357  3       2.654104
##  4     6        1.880878 57       2.620235
##  6     6        1.880878 58       2.564813
##  7     6        1.880878 37       2.346204
##  8     6        1.880878 49       2.059856
##  10    6        1.880878 13       1.915143
##  13    6        1.880878 53       1.819693
##  27    6        1.880878 38       1.795061
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
##  [1]  1  2  3  4  5  6  7  8 57 58 37 49
##  [1] "subjectSex"                 "MMSCORE"                   
##  [3] "GDTOTAL"                    "CDGLOBAL"                  
##  [5] "NPISCORE"                   "FAQTOTAL"                  
##  [7] "subjectAge"                 "weightKg"                  
##  [9] "L_caudate"                  "R_caudate"                 
## [11] "L_inferior_occipital_gyrus" "L_lingual_gyrus"

The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions (SL_Pred_Combined) is then used to generate the confusion matrix. By doing so, we combined the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first stage. Then, the second stage uses the top common features selected to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..

## Warning in levels(reference) != levels(data): longer object length is not a
## multiple of shorter object length
## Warning in confusionMatrix.default(pred_temp, truth_temp): Levels are not
## in the same order for reference and data. Refactoring data to match.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  AD LMCI MCI Normal
##     AD      96    0  35      1
##     LMCI     0    0   0      0
##     MCI     26    0 355     14
##     Normal   0    0  14    209
## 
## Overall Statistics
##                                           
##                Accuracy : 0.88            
##                  95% CI : (0.8546, 0.9024)
##     No Information Rate : 0.5387          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7996          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: AD Class: LMCI Class: MCI Class: Normal
## Sensitivity             0.7869          NA     0.8787        0.9330
## Specificity             0.9427           1     0.8844        0.9734
## Pos Pred Value          0.7273          NA     0.8987        0.9372
## Neg Pred Value          0.9579          NA     0.8620        0.9715
## Prevalence              0.1627           0     0.5387        0.2987
## Detection Rate          0.1280           0     0.4733        0.2787
## Detection Prevalence    0.1760           0     0.5267        0.2973
## Balanced Accuracy       0.8648          NA     0.8816        0.9532