This is a summary of a set of 1 experiments using a LONI pipeline workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.
This document has the final results, by experiment. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code.
Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. I list the top features selected, set to 15 here.
## [1] EXPERIMENT 1
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## [1] "EXPERIMENT" "1"
## CBDA Frequency Density Knockoff Density
## 4 50 12.019231 5 2.613165
## 28 21 5.048077 2 2.534098
## 8 19 4.567308 7 2.126903
## 7 18 4.326923 55 2.115042
## 63 18 4.326923 1 2.093299
## 47 17 4.086538 34 2.055742
## 45 15 3.605769 35 2.035975
## 62 14 3.365385 38 1.954932
## 46 13 3.125000 59 1.939118
## 35 11 2.644231 40 1.923305
## 56 11 2.644231 60 1.919352
## 16 10 2.403846 20 1.915398
## 2 9 2.163462 58 1.895632
## 19 9 2.163462 22 1.846215
## 57 9 2.163462 8 1.816565
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
## [1] 4 28 8 7 63 47 45 62 46 35 56 16 2 19 57 5 55 1 34 38 59 40 60
## [24] 20 58 22
The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions are then used to generate the confusion matrix. We basically combine the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first round. Then, the second stage uses the top features to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..
## ggplot2 plyr colorspace grid data.table
## TRUE TRUE TRUE TRUE TRUE
## VIM MASS Matrix lme4 arm
## TRUE TRUE TRUE TRUE TRUE
## foreach glmnet class nnet mice
## TRUE TRUE TRUE TRUE TRUE
## missForest calibrate nnls SuperLearner plotrix
## TRUE TRUE TRUE TRUE TRUE
## TeachingDemos plotmo earth parallel splines
## TRUE TRUE TRUE TRUE TRUE
## gam mi BayesTree e1071 randomForest
## TRUE TRUE TRUE TRUE TRUE
## Hmisc dplyr Amelia bartMachine knockoff
## TRUE TRUE TRUE TRUE TRUE
## caret smotefamily FNN
## TRUE TRUE TRUE
## [1] 4 28 8 7 63 47 45 62 46 35
## Levels: 4 28 8 7 63 47 45 62 46 35 56 16 2 19 57
## missForest iteration 1 in progress...done!
## missForest iteration 1 in progress...done!
## missForest iteration 1 in progress...done!
## missForest iteration 1 in progress...done!
## Confusion Matrix and Statistics
##
## Reference
## Prediction AD MCI Normal
## AD 69 17 1
## MCI 12 243 8
## Normal 0 9 140
##
## Overall Statistics
##
## Accuracy : 0.9058
## 95% CI : (0.8767, 0.93)
## No Information Rate : 0.5391
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8426
## Mcnemar's Test P-Value : 0.589
##
## Statistics by Class:
##
## Class: AD Class: MCI Class: Normal
## Sensitivity 0.8519 0.9033 0.9396
## Specificity 0.9569 0.9130 0.9743
## Pos Pred Value 0.7931 0.9240 0.9396
## Neg Pred Value 0.9709 0.8898 0.9743
## Prevalence 0.1623 0.5391 0.2986
## Detection Rate 0.1383 0.4870 0.2806
## Detection Prevalence 0.1743 0.5271 0.2986
## Balanced Accuracy 0.9044 0.9082 0.9569