Some useful information

This is a summary of a set of 9 experiments I ran on Cranium using a single pipe workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.

This document has the final results, by experiment. The CBDA-SuperLearner has been adapted to a multinomial outcome distribution in this case. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code [still in progress].

# # Here I load the dataset [not executed]
# ADNI_dataset = read.csv("C:/Users/simeonem/Documents/CBDA-SL/Cranium/ADNI_dataset.txt",header = TRUE)

Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. No False Discovery Rates are shown (since we don’t have information on the “true” features). I list the top features selected, set to 20 here.

## Loading required package: lattice
## Loading required package: ggplot2
## [1] EXPERIMENT 1
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          5         15         60         80

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density 
##  40   9         2.821317  9       2.801809
##  7    8         2.507837  3       2.795738
##  15   8         2.507837  6       2.771454
##  22   8         2.507837  1       2.765383
##  25   8         2.507837  7       2.756276
##  59   8         2.507837 59       2.731992
##  4    7         2.194357  4       2.710743
##  13   7         2.194357  2       2.683423
##  16   7         2.194357  8       2.595392
##  31   7         2.194357  5       2.589321
##  38   7         2.194357 60       2.492183
##  46   7         2.194357 39       2.073278
##  60   7         2.194357 51       1.927572
##  5    6         1.880878 40       1.827399
##  9    6         1.880878 36       1.772759
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
## [1]  7 59  4 60
## [1] "subjectAge" "L_caudate"  "CDGLOBAL"   "R_caudate"

The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions (SL_Pred_Combined) is then used to generate the confusion matrix. By doing so, we combined the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first stage. Then, the second stage uses the top common features selected to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..

## Warning in levels(reference) != levels(data): longer object length is not a
## multiple of shorter object length
## Warning in confusionMatrix.default(pred_temp, truth_temp): Levels are not
## in the same order for reference and data. Refactoring data to match.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  AD LMCI MCI Normal
##     AD     105    0  14      0
##     LMCI     0    0   0      0
##     MCI     17    0 381     18
##     Normal   0    0   9    206
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9227          
##                  95% CI : (0.9012, 0.9408)
##     No Information Rate : 0.5387          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8689          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: AD Class: LMCI Class: MCI Class: Normal
## Sensitivity             0.8607          NA     0.9431        0.9196
## Specificity             0.9777           1     0.8988        0.9829
## Pos Pred Value          0.8824          NA     0.9159        0.9581
## Neg Pred Value          0.9731          NA     0.9311        0.9664
## Prevalence              0.1627           0     0.5387        0.2987
## Detection Rate          0.1400           0     0.5080        0.2747
## Detection Prevalence    0.1587           0     0.5547        0.2867
## Balanced Accuracy       0.9192          NA     0.9210        0.9513