This is a summary of a set of 9 experiments I ran on Cranium using a single pipe workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.
This document has the final results, by experiment. The CBDA-SuperLearner has been adapted to a multinomial outcome distribution in this case. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code [still in progress].
# # Here I load the dataset [not executed]
# ADNI_dataset = read.csv("C:/Users/simeonem/Documents/CBDA-SL/Cranium/ADNI_dataset.txt",header = TRUE)
Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. No False Discovery Rates are shown (since we don’t have information on the “true” features). I list the top features selected, set to 20 here.
## Loading required package: lattice
## Loading required package: ggplot2
## [1] EXPERIMENT 1
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 20 10 2.985075 7 2.243323
## 60 9 2.686567 6 2.176855
## 1 8 2.388060 2 2.167359
## 29 8 2.388060 4 2.136499
## 31 8 2.388060 1 2.091395
## 37 8 2.388060 9 2.091395
## 41 8 2.388060 3 2.058160
## 44 8 2.388060 10 2.053412
## 56 8 2.388060 8 2.024926
## 2 7 2.089552 5 1.984570
## 10 7 2.089552 59 1.984570
## 19 7 2.089552 42 1.956083
## 23 7 2.089552 60 1.948961
## 54 7 2.089552 51 1.846884
## 55 7 2.089552 39 1.832641
## [1] "Top Features Selected across multiple experiments,shared between CBDA-SL and Knockoff filter"
## [1] 1 2 10
## [1] "subjectSex" "MMSCORE" "Background"
The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions (SL_Pred_Combined) is then used to generate the confusion matrix. By doing so, we combined the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first stage. Then, the second stage uses the top common features selected to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..
## Confusion Matrix and Statistics
##
## Reference
## Prediction AD LMCI MCI Normal
## AD 105 54 91 7
## LMCI 6 70 37 31
## MCI 10 20 16 3
## Normal 1 65 51 183
##
## Overall Statistics
##
## Accuracy : 0.4987
## 95% CI : (0.4623, 0.5351)
## No Information Rate : 0.2987
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3354
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: AD Class: LMCI Class: MCI Class: Normal
## Sensitivity 0.8607 0.33493 0.08205 0.8170
## Specificity 0.7580 0.86322 0.94054 0.7776
## Pos Pred Value 0.4086 0.48611 0.32653 0.6100
## Neg Pred Value 0.9655 0.77063 0.74465 0.9089
## Prevalence 0.1627 0.27867 0.26000 0.2987
## Detection Rate 0.1400 0.09333 0.02133 0.2440
## Detection Prevalence 0.3427 0.19200 0.06533 0.4000
## Balanced Accuracy 0.8093 0.59907 0.51130 0.7973