This is a summary of a set of 30 experiments I ran on Cranium using a single pipe workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.
This document has the final results, by experiment. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for the code [still in progress]. The test dataset is defined as below:
# Problem parameters
n = 300 # number of observations
p = 100 # number of variables
nonzero=c(1,seq(10,p,10)) # variables with nonzero coefficients (fix location)
k = length(nonzero) # number of variables with nonzero coefficients
amplitude = 3.5 # signal amplitude (for noise level = 1)
X1 = matrix(rnorm(n*p), nrow=n, ncol=p)
beta = amplitude * (1:p %in% nonzero) # setting the nonzero variables to 3.5
ztemp <- function() X1 %*% beta + rnorm(n) # linear combination with a bias
z = ztemp()
pr = 1/(1+exp(-z)) # pass through an inv-logit function
Ytemp = rbinom(n,1,pr) # bernoulli response variable
X2 <- cbind(Ytemp,X1)
# Here I write the data in a text file [not executed]
#write.table(X2,"C:/Users/simeonem/Documents/CBDA-SL/Cranium/Binomial_dataset.txt",sep=",")
# Here I load the dataset [not executed]
#Binomial_dataset = read.csv("C:/Users/simeonem/Documents/CBDA-SL/Cranium/Binomial_dataset.txt",header = TRUE)
# Here the X and Y matrix/vector are set for the CBDA-SL algorithm to proceed [not executed]
#Ytemp <- Binomial_dataset[,1]
#Xtemp <- Binomial_dataset[,-1]
Thus, the features that should be extracted by both the knockoff filter and the CBDA-SL algorithms are 1, 10, 20, 30, 40, 50, 60, 70, 80, 90 and 100. That translates into spikes on these locations in the histograms shown below. I list the False Discovery Rates, however that is just an example (the FDRs are based on how many I list for the top features selected, set to 11 now, so we can check if all the 11 non zero features are selected). For example, if 11 features are true and I list the top 11, missing 1 out of 11 will return a FDR of 9.091%. If I miss 2 out of 11, the FDR is 18.182%. Overall, the knockoff filter is really really good (this example is ad hoc for that). However, the CBDA-SL seems to perform pretty good as well. The power of CBDA-SL is that we have potentially an infinite list of “learners” that can be gradually built into it, thus eventually returning the best predictions. Now the list is short (with some issues regarding the simultaneous use of GAM and BartMachine). I am working now on generating the same type of results with a NULL dataset (binomial outcome).
## [1] EXPERIMENT 1
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 40 18 3.169014 80 7.891930
## 70 17 2.992958 70 6.754355
## 1 16 2.816901 1 6.683256
## 10 15 2.640845 50 6.280365
## 90 14 2.464789 90 5.699727
## 80 13 2.288732 100 4.597701
## 24 11 1.936620 40 4.538452
## 100 11 1.936620 10 4.230359
## 30 10 1.760563 30 3.282379
## 75 10 1.760563 24 2.630643
## 89 10 1.760563 20 2.180353
## [1] False Discovery Rate for CBDA = 27.273 %
## [1] False Discovery Rate for KNOCKOFF FILTER = 9.091 %
##
##
##
##
##
##
## [1] EXPERIMENT 2
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 20 5 15 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 80 20 3.795066 80 7.168644
## 1 15 2.846300 1 6.059825
## 40 13 2.466793 70 5.840640
## 70 12 2.277040 50 5.711707
## 50 11 2.087287 90 5.312017
## 90 10 1.897533 100 4.396596
## 100 10 1.897533 40 3.700361
## 48 9 1.707780 10 3.610108
## 59 9 1.707780 30 2.514183
## 91 9 1.707780 20 2.385250
## 4 8 1.518027 24 2.140278
## [1] False Discovery Rate for CBDA = 36.364 %
## [1] False Discovery Rate for KNOCKOFF FILTER = 9.091 %
##
##
##
##
##
##
## [1] EXPERIMENT 3
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 15 100 100
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 70 23 3.872054 80 9.559905
## 80 18 3.030303 70 8.937663
## 40 15 2.525253 1 8.383301
## 90 15 2.525253 50 7.738432
## 1 13 2.188552 90 7.523476
## 10 13 2.188552 100 6.188483
## 20 11 1.851852 40 6.007467
## 50 11 1.851852 10 5.475733
## 100 11 1.851852 30 4.355696
## 28 10 1.683502 20 3.586379
## 92 10 1.683502 24 3.371422
## [1] False Discovery Rate for CBDA = 18.182 %
## [1] False Discovery Rate for KNOCKOFF FILTER = 9.091 %
##
##
##
##
##
##
## [1] EXPERIMENT 4
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 20 5 15 100 100
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 70 22 3.971119 80 9.554719
## 80 20 3.610108 1 8.775802
## 50 15 2.707581 70 8.373361
## 1 13 2.346570 50 7.568480
## 100 12 2.166065 90 6.919382
## 10 11 1.985560 100 5.984681
## 90 11 1.985560 40 5.322602
## 20 10 1.805054 10 4.569648
## 19 9 1.624549 30 3.518110
## 21 9 1.624549 24 2.907958
## 30 9 1.624549 20 2.544463
## [1] False Discovery Rate for CBDA = 18.182 %
## [1] False Discovery Rate for KNOCKOFF FILTER = 9.091 %
##
##
##
##
##
##
## [1] EXPERIMENT 5
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 15 30 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 1 25 2.090301 80 10.870180
## 90 24 2.006689 1 9.974545
## 50 23 1.923077 70 9.606863
## 80 23 1.923077 50 8.343547
## 10 20 1.672241 90 7.334779
## 70 20 1.672241 40 6.269445
## 59 18 1.505017 100 5.392665
## 24 17 1.421405 10 4.732724
## 40 17 1.421405 30 3.808806
## 56 17 1.421405 20 2.592628
## 66 17 1.421405 24 2.573772
## [1] False Discovery Rate for CBDA = 36.364 %
## [1] False Discovery Rate for KNOCKOFF FILTER = 9.091 %
##
##
##
##
##
##
## [1] EXPERIMENT 6
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 20 15 30 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 50 27 2.213115 80 10.103186
## 80 23 1.885246 1 9.230071
## 90 22 1.803279 70 8.878558
## 70 21 1.721311 50 8.118834
## 1 20 1.639344 90 6.984919
## 84 20 1.639344 100 4.955210
## 40 18 1.475410 40 4.943871
## 43 17 1.393443 10 3.832634
## 10 16 1.311475 30 2.789432
## 11 16 1.311475 20 2.392562
## 61 16 1.311475 24 2.347205
## [1] False Discovery Rate for CBDA = 36.364 %
## [1] False Discovery Rate for KNOCKOFF FILTER = 9.091 %
##
##
##
##
##
##
## [1] EXPERIMENT 7
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 15 30 100 100
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 70 30 2.500000 80 13.024079
## 50 25 2.083333 1 12.362489
## 80 25 2.083333 70 11.831679
## 10 24 2.000000 50 9.993076
## 20 22 1.833333 90 9.723825
## 1 20 1.666667 40 7.200554
## 40 20 1.666667 100 6.808216
## 90 20 1.666667 10 5.985076
## 30 18 1.500000 30 4.308024
## 76 18 1.500000 24 2.961766
## 14 16 1.333333 20 2.930995
## [1] False Discovery Rate for CBDA = 18.182 %
## [1] False Discovery Rate for KNOCKOFF FILTER = 9.091 %
##
##
##
##
##
##
## [1] EXPERIMENT 8
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 20 15 30 100 100
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 80 30 2.394254 80 13.319157
## 40 28 2.234637 1 12.130292
## 1 24 1.915403 70 11.763000
## 10 22 1.755786 50 9.945873
## 20 21 1.675978 90 8.863329
## 70 21 1.675978 100 6.630582
## 90 21 1.675978 40 6.553257
## 100 20 1.596169 10 4.977769
## 50 19 1.516361 30 3.363619
## 15 18 1.436552 24 2.754688
## 84 18 1.436552 20 2.619370
## [1] False Discovery Rate for CBDA = 18.182 %
## [1] False Discovery Rate for KNOCKOFF FILTER = 9.091 %
##
##
##
##
##
##
## [1] EXPERIMENT 9
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 30 50 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 90 35 1.684312 80 14.038589
## 10 34 1.636189 1 11.543580
## 40 34 1.636189 70 10.919827
## 70 34 1.636189 50 9.672322
## 1 33 1.588065 90 8.366600
## 50 32 1.539942 40 6.803061
## 30 31 1.491819 100 5.738523
## 80 29 1.395573 10 4.690619
## 92 29 1.395573 30 3.468064
## 100 29 1.395573 20 2.544910
## 37 26 1.251203 24 2.495010
## [1] False Discovery Rate for CBDA = 18.182 %
## [1] False Discovery Rate for KNOCKOFF FILTER = 9.091 %
##
##
##
##
##
##
## [1] EXPERIMENT 10
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 20 30 50 60 80
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 80 35 1.637044 80 13.468694
## 40 31 1.449953 1 11.597746
## 50 30 1.403181 70 9.992559
## 10 29 1.356408 50 8.780695
## 24 29 1.356408 90 7.473158
## 70 29 1.356408 40 6.197512
## 79 28 1.309635 100 5.644733
## 63 27 1.262862 10 3.975763
## 8 26 1.216090 30 2.774530
## 44 26 1.216090 24 2.243011
## 47 26 1.216090 20 1.998512
## [1] False Discovery Rate for CBDA = 54.545 %
## [1] False Discovery Rate for KNOCKOFF FILTER = 9.091 %
##
##
##
##
##
##
## [1] EXPERIMENT 11
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 30 50 100 100
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 80 37 1.726552 80 15.113763
## 1 35 1.633224 1 14.335848
## 70 35 1.633224 70 13.417559
## 90 35 1.633224 50 11.703808
## 10 33 1.539897 90 10.048547
## 30 32 1.493234 40 7.322922
## 40 32 1.493234 100 7.018775
## 50 30 1.399907 10 5.217290
## 100 30 1.399907 30 4.252208
## 75 28 1.306580 20 2.456571
## 79 28 1.306580 24 2.333743
## [1] False Discovery Rate for CBDA = 18.182 %
## [1] False Discovery Rate for KNOCKOFF FILTER = 9.091 %
##
##
##
##
##
##
## [1] EXPERIMENT 12
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 20 30 50 100 100
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 50 34 1.619819 80 16.449023
## 40 33 1.572177 1 13.885960
## 1 32 1.524535 70 13.009976
## 10 32 1.524535 50 11.185011
## 70 31 1.476894 90 9.343824
## 80 31 1.476894 100 6.821316
## 60 30 1.429252 40 6.813205
## 100 30 1.429252 10 4.817909
## 30 29 1.381610 30 3.236272
## 77 29 1.381610 24 2.214292
## 59 27 1.286327 20 2.035850
## [1] False Discovery Rate for CBDA = 18.182 %
## [1] False Discovery Rate for KNOCKOFF FILTER = 9.091 %