This is a summary of a set of 9 experiments I ran on Cranium using a single pipe workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.
This document has the final results, by experiment. The CBDA-SuperLearner has been adapted to a multinomial outcome distribution in this case. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code [still in progress].
# # Here I load the dataset [not executed]
# ADNI_dataset = read.csv("C:/Users/simeonem/Documents/CBDA-SL/Cranium/ADNI_dataset.txt",header = TRUE)
Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. No False Discovery Rates are shown (since we don’t have information on the “true” features). I list the top features selected, set to 20 here.
## Loading required package: lattice
## Loading required package: ggplot2
## [1] EXPERIMENT 1
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 20 10 2.985075 7 2.243323
## 60 9 2.686567 6 2.176855
## 1 8 2.388060 2 2.167359
## 29 8 2.388060 4 2.136499
## 31 8 2.388060 1 2.091395
## 37 8 2.388060 9 2.091395
## 41 8 2.388060 3 2.058160
## 44 8 2.388060 10 2.053412
## 56 8 2.388060 8 2.024926
## 2 7 2.089552 5 1.984570
## 10 7 2.089552 59 1.984570
## 19 7 2.089552 42 1.956083
## 23 7 2.089552 60 1.948961
## 54 7 2.089552 51 1.846884
## 55 7 2.089552 39 1.832641
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
## [1] 1 2 10
## [1] "subjectSex" "MMSCORE" "Background"
The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions (SL_Pred_Combined) is then used to generate the confusion matrix. By doing so, we combined the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first stage. Then, the second stage uses the top common features selected to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..
## Confusion Matrix and Statistics
##
## Reference
## Prediction AD LMCI MCI Normal
## AD 112 9 16 0
## LMCI 3 182 24 5
## MCI 7 12 151 2
## Normal 0 6 4 217
##
## Overall Statistics
##
## Accuracy : 0.8827
## 95% CI : (0.8575, 0.9048)
## No Information Rate : 0.2987
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8416
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: AD Class: LMCI Class: MCI Class: Normal
## Sensitivity 0.9180 0.8708 0.7744 0.9688
## Specificity 0.9602 0.9409 0.9622 0.9810
## Pos Pred Value 0.8175 0.8505 0.8779 0.9559
## Neg Pred Value 0.9837 0.9496 0.9239 0.9866
## Prevalence 0.1627 0.2787 0.2600 0.2987
## Detection Rate 0.1493 0.2427 0.2013 0.2893
## Detection Prevalence 0.1827 0.2853 0.2293 0.3027
## Balanced Accuracy 0.9391 0.9058 0.8683 0.9749