This is a summary of a set of 9 experiments I ran on Cranium using a single pipe workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.
This document has the final results, by experiment. The CBDA-SuperLearner has been adapted to a multinomial outcome distribution in this case. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code [still in progress].
# # Here I load the dataset [not executed]
# ADNI_dataset = read.csv("C:/Users/simeonem/Documents/CBDA-SL/Cranium/ADNI_dataset.txt",header = TRUE)
Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. No False Discovery Rates are shown (since we don’t have information on the “true” features). I list the top features selected, set to 20 here.
## Loading required package: lattice
## Loading required package: ggplot2
## [1] EXPERIMENT 1
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 40 9 2.821317 9 2.801809
## 7 8 2.507837 3 2.795738
## 15 8 2.507837 6 2.771454
## 22 8 2.507837 1 2.765383
## 25 8 2.507837 7 2.756276
## 59 8 2.507837 59 2.731992
## 4 7 2.194357 4 2.710743
## 13 7 2.194357 2 2.683423
## 16 7 2.194357 8 2.595392
## 31 7 2.194357 5 2.589321
## 38 7 2.194357 60 2.492183
## 46 7 2.194357 39 2.073278
## 60 7 2.194357 51 1.927572
## 5 6 1.880878 40 1.827399
## 9 6 1.880878 36 1.772759
##
##
##
##
##
##
## [1] EXPERIMENT 2
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 100 100
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 22 12 3.603604 59 2.842306
## 45 12 3.603604 7 2.836160
## 8 8 2.402402 8 2.826942
## 10 8 2.402402 4 2.780851
## 18 8 2.402402 5 2.753196
## 38 8 2.402402 3 2.734759
## 43 8 2.402402 1 2.725541
## 57 8 2.402402 6 2.670231
## 66 8 2.402402 9 2.664086
## 13 7 2.102102 2 2.636431
## 19 7 2.102102 60 2.531957
## 23 7 2.102102 39 2.405973
## 30 7 2.102102 51 2.172443
## 33 7 2.102102 40 2.120206
## 36 7 2.102102 15 2.043387
##
##
##
##
##
##
## [1] EXPERIMENT 3
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 15 30 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 28 18 2.479339 4 4.909827
## 56 18 2.479339 2 4.667535
## 15 17 2.341598 7 4.653821
## 18 17 2.341598 3 4.608105
## 37 17 2.341598 6 4.605820
## 4 16 2.203857 9 4.598962
## 54 16 2.203857 1 4.576105
## 16 15 2.066116 5 4.441244
## 40 15 2.066116 8 4.118951
## 58 15 2.066116 59 3.750943
## 59 14 1.928375 60 2.758920
## 64 14 1.928375 39 2.667490
## 12 13 1.790634 51 2.018332
## 39 13 1.790634 40 1.894900
## 44 13 1.790634 61 1.618323
##
##
##
##
##
##
## [1] EXPERIMENT 7
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 40 60
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 42 10 2.994012 3 2.817322
## 21 9 2.694611 4 2.793446
## 34 9 2.694611 6 2.760617
## 50 9 2.694611 7 2.715850
## 6 8 2.395210 2 2.703913
## 19 8 2.395210 1 2.688990
## 30 8 2.395210 9 2.677053
## 53 8 2.395210 8 2.468141
## 60 8 2.395210 5 2.462172
## 13 7 2.095808 59 2.408452
## 18 7 2.095808 60 2.312950
## 22 7 2.095808 39 1.850360
## 23 7 2.095808 55 1.766795
## 24 7 2.095808 51 1.719044
## 57 7 2.095808 15 1.680246
##
##
##
##
##
##
## [1] EXPERIMENT 8
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 15 30 40 60
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 31 17 2.439024 6 4.990006
## 34 17 2.439024 9 4.888857
## 5 15 2.152080 4 4.864774
## 15 15 2.152080 2 4.785300
## 65 15 2.152080 1 4.633577
## 16 14 2.008608 5 4.595044
## 37 14 2.008608 3 4.563736
## 41 14 2.008608 7 4.505936
## 43 14 2.008608 8 3.925535
## 45 14 2.008608 59 3.598006
## 47 14 2.008608 60 2.947764
## 61 14 2.008608 39 2.167473
## 63 14 2.008608 40 1.762878
## 6 13 1.865136 51 1.642463
## 9 13 1.865136 36 1.594297
##
##
##
##
##
##
## [1] EXPERIMENT 9
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 30 50 40 60
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 29 27 2.051672 6 7.538142
## 15 26 1.975684 4 7.369761
## 24 26 1.975684 9 7.256823
## 37 26 1.975684 1 7.248609
## 16 25 1.899696 2 7.213700
## 20 25 1.899696 3 6.513481
## 27 24 1.823708 5 6.412863
## 39 24 1.823708 7 6.189039
## 46 24 1.823708 8 4.632539
## 53 24 1.823708 59 3.308076
## 14 23 1.747720 39 2.246453
## 23 23 1.747720 60 2.024682
## 40 23 1.747720 44 1.396333
## 44 23 1.747720 51 1.392226
## 55 23 1.747720 40 1.330623
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
## [1] 40 7 15 59 8 60 6 4 5
## [1] "R_inferior_occipital_gyrus" "subjectAge"
## [3] "L_inferior_frontal_gyrus" "L_caudate"
## [5] "weightKg" "R_caudate"
## [7] "FAQTOTAL" "CDGLOBAL"
## [9] "NPISCORE"
The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions (SL_Pred_Combined) is then used to generate the confusion matrix. By doing so, we combined the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first stage. Then, the second stage uses the top common features selected to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..
## Warning in levels(reference) != levels(data): longer object length is not a
## multiple of shorter object length
## Warning in confusionMatrix.default(pred_temp, truth_temp): Levels are not
## in the same order for reference and data. Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction AD LMCI MCI Normal
## AD 83 0 33 2
## LMCI 0 0 0 0
## MCI 39 0 360 15
## Normal 0 0 11 207
##
## Overall Statistics
##
## Accuracy : 0.8667
## 95% CI : (0.8402, 0.8902)
## No Information Rate : 0.5387
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7741
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: AD Class: LMCI Class: MCI Class: Normal
## Sensitivity 0.6803 NA 0.8911 0.9241
## Specificity 0.9443 1 0.8439 0.9791
## Pos Pred Value 0.7034 NA 0.8696 0.9495
## Neg Pred Value 0.9383 NA 0.8690 0.9680
## Prevalence 0.1627 0 0.5387 0.2987
## Detection Rate 0.1107 0 0.4800 0.2760
## Detection Prevalence 0.1573 0 0.5520 0.2907
## Balanced Accuracy 0.8123 NA 0.8675 0.9516