This is a summary of a set of 9 experiments I ran on Cranium using a single pipe workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.
This document has the final results, by experiment. The CBDA-SuperLearner has been adapted to a multinomial outcome distribution in this case. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code [still in progress].
# # Here I load the dataset [not executed]
# ADNI_dataset = read.csv("C:/Users/simeonem/Documents/CBDA-SL/Cranium/ADNI_dataset.txt",header = TRUE)
Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. No False Discovery Rates are shown (since we don’t have information on the “true” features). I list the top features selected, set to 20 here.
## Loading required package: lattice
## Loading required package: ggplot2
## [1] EXPERIMENT 1
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 40 9 2.821317 9 2.801809
## 7 8 2.507837 3 2.795738
## 15 8 2.507837 6 2.771454
## 22 8 2.507837 1 2.765383
## 25 8 2.507837 7 2.756276
## 59 8 2.507837 59 2.731992
## 4 7 2.194357 4 2.710743
## 13 7 2.194357 2 2.683423
## 16 7 2.194357 8 2.595392
## 31 7 2.194357 5 2.589321
## 38 7 2.194357 60 2.492183
## 46 7 2.194357 39 2.073278
## 60 7 2.194357 51 1.927572
## 5 6 1.880878 40 1.827399
## 9 6 1.880878 36 1.772759
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
## [1] 7 59 4 60
## [1] "subjectAge" "L_caudate" "CDGLOBAL" "R_caudate"
The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions (SL_Pred_Combined) is then used to generate the confusion matrix. By doing so, we combined the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first stage. Then, the second stage uses the top common features selected to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..
## Warning in levels(reference) != levels(data): longer object length is not a
## multiple of shorter object length
## Warning in confusionMatrix.default(pred_temp, truth_temp): Levels are not
## in the same order for reference and data. Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction AD LMCI MCI Normal
## AD 105 0 14 0
## LMCI 0 0 0 0
## MCI 17 0 381 18
## Normal 0 0 9 206
##
## Overall Statistics
##
## Accuracy : 0.9227
## 95% CI : (0.9012, 0.9408)
## No Information Rate : 0.5387
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8689
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: AD Class: LMCI Class: MCI Class: Normal
## Sensitivity 0.8607 NA 0.9431 0.9196
## Specificity 0.9777 1 0.8988 0.9829
## Pos Pred Value 0.8824 NA 0.9159 0.9581
## Neg Pred Value 0.9731 NA 0.9311 0.9664
## Prevalence 0.1627 0 0.5387 0.2987
## Detection Rate 0.1400 0 0.5080 0.2747
## Detection Prevalence 0.1587 0 0.5547 0.2867
## Balanced Accuracy 0.9192 NA 0.9210 0.9513