Some useful information

This is a summary of a set of 30 experiments I ran on Cranium using a single pipe workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.

This document has the final results, by experiment. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for the code [still in progress]. The test dataset is defined as below:

# Problem parameters
n = 300          # number of observations
p = 100          # number of variables
nonzero=c(1,seq(10,p,10))  # variables with nonzero coefficients (fix location)
k = length(nonzero)      # number of variables with nonzero coefficients
amplitude = 3.5  # signal amplitude (for noise level = 1)

X1 = matrix(rnorm(n*p), nrow=n, ncol=p) 
beta = amplitude * (1:p %in% nonzero)  # setting the nonzero variables to 3.5
ztemp <- function() X1 %*% beta + rnorm(n) # linear combination with a bias
z = ztemp()
pr = 1/(1+exp(-z))         # pass through an inv-logit function
Ytemp = rbinom(n,1,pr)    # bernoulli response variable
X2 <- cbind(Ytemp,X1)
# Here I write the data in a text file [not executed]
#write.table(X2,"C:/Users/simeonem/Documents/CBDA-SL/Cranium/Binomial_dataset.txt",sep=",")
# Here I load the dataset [not executed]
#Binomial_dataset = read.csv("C:/Users/simeonem/Documents/CBDA-SL/Cranium/Binomial_dataset.txt",header = TRUE)
# Here the X and Y matrix/vector are set for the CBDA-SL algorithm to proceed [not executed]
#Ytemp <- Binomial_dataset[,1]
#Xtemp <- Binomial_dataset[,-1]

Thus, the features that should be extracted by both the knockoff filter and the CBDA-SL algorithms are 1, 10, 20, 30, 40, 50, 60, 70, 80, 90 and 100. That translates into spikes on these locations in the histograms shown below. I list the False Discovery Rates, however that is just an example (the FDRs are based on how many I list for the top features selected, set to 11 now, so we can check if all the 11 non zero features are selected). For example, if 11 features are true and I list the top 11, missing 1 out of 11 will return a FDR of 9.091%. If I miss 2 out of 11, the FDR is 18.182%. Overall, the knockoff filter is really really good (this example is ad hoc for that). However, the CBDA-SL seems to perform pretty good as well. The power of CBDA-SL is that we have potentially an infinite list of “learners” that can be gradually built into it, thus eventually returning the best predictions. Now the list is short (with some issues regarding the simultaneous use of GAM and BartMachine). I am working now on generating the same type of results with a NULL dataset (binomial outcome).

## [1] EXPERIMENT 1
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          5         15         60         80

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density 
##  40   18        3.169014  80      7.891930
##  70   17        2.992958  70      6.754355
##  1    16        2.816901   1      6.683256
##  10   15        2.640845  50      6.280365
##  90   14        2.464789  90      5.699727
##  80   13        2.288732 100      4.597701
##  24   11        1.936620  40      4.538452
##  100  11        1.936620  10      4.230359
##  30   10        1.760563  30      3.282379
##  75   10        1.760563  24      2.630643
##  89   10        1.760563  20      2.180353
## [1] False Discovery Rate for CBDA =  27.273 %
## [1] False Discovery Rate for KNOCKOFF FILTER =  9.091 %
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 2
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000         20          5         15         60         80

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density 
##  80   20        3.795066  80      7.168644
##  1    15        2.846300   1      6.059825
##  40   13        2.466793  70      5.840640
##  70   12        2.277040  50      5.711707
##  50   11        2.087287  90      5.312017
##  90   10        1.897533 100      4.396596
##  100  10        1.897533  40      3.700361
##  48    9        1.707780  10      3.610108
##  59    9        1.707780  30      2.514183
##  91    9        1.707780  20      2.385250
##  4     8        1.518027  24      2.140278
## [1] False Discovery Rate for CBDA =  36.364 %
## [1] False Discovery Rate for KNOCKOFF FILTER =  9.091 %
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 3
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0          5         15        100        100

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density 
##  70   23        3.872054  80      9.559905
##  80   18        3.030303  70      8.937663
##  40   15        2.525253   1      8.383301
##  90   15        2.525253  50      7.738432
##  1    13        2.188552  90      7.523476
##  10   13        2.188552 100      6.188483
##  20   11        1.851852  40      6.007467
##  50   11        1.851852  10      5.475733
##  100  11        1.851852  30      4.355696
##  28   10        1.683502  20      3.586379
##  92   10        1.683502  24      3.371422
## [1] False Discovery Rate for CBDA =  18.182 %
## [1] False Discovery Rate for KNOCKOFF FILTER =  9.091 %
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 4
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000         20          5         15        100        100

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density 
##  70   22        3.971119  80      9.554719
##  80   20        3.610108   1      8.775802
##  50   15        2.707581  70      8.373361
##  1    13        2.346570  50      7.568480
##  100  12        2.166065  90      6.919382
##  10   11        1.985560 100      5.984681
##  90   11        1.985560  40      5.322602
##  20   10        1.805054  10      4.569648
##  19    9        1.624549  30      3.518110
##  21    9        1.624549  24      2.907958
##  30    9        1.624549  20      2.544463
## [1] False Discovery Rate for CBDA =  18.182 %
## [1] False Discovery Rate for KNOCKOFF FILTER =  9.091 %
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 5
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0         15         30         60         80

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density  
##  1    25        2.090301  80      10.870180
##  90   24        2.006689   1       9.974545
##  50   23        1.923077  70       9.606863
##  80   23        1.923077  50       8.343547
##  10   20        1.672241  90       7.334779
##  70   20        1.672241  40       6.269445
##  59   18        1.505017 100       5.392665
##  24   17        1.421405  10       4.732724
##  40   17        1.421405  30       3.808806
##  56   17        1.421405  20       2.592628
##  66   17        1.421405  24       2.573772
## [1] False Discovery Rate for CBDA =  36.364 %
## [1] False Discovery Rate for KNOCKOFF FILTER =  9.091 %
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 6
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000         20         15         30         60         80

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density  
##  50   27        2.213115  80      10.103186
##  80   23        1.885246   1       9.230071
##  90   22        1.803279  70       8.878558
##  70   21        1.721311  50       8.118834
##  1    20        1.639344  90       6.984919
##  84   20        1.639344 100       4.955210
##  40   18        1.475410  40       4.943871
##  43   17        1.393443  10       3.832634
##  10   16        1.311475  30       2.789432
##  11   16        1.311475  20       2.392562
##  61   16        1.311475  24       2.347205
## [1] False Discovery Rate for CBDA =  36.364 %
## [1] False Discovery Rate for KNOCKOFF FILTER =  9.091 %
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 7
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0         15         30        100        100

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density  
##  70   30        2.500000  80      13.024079
##  50   25        2.083333   1      12.362489
##  80   25        2.083333  70      11.831679
##  10   24        2.000000  50       9.993076
##  20   22        1.833333  90       9.723825
##  1    20        1.666667  40       7.200554
##  40   20        1.666667 100       6.808216
##  90   20        1.666667  10       5.985076
##  30   18        1.500000  30       4.308024
##  76   18        1.500000  24       2.961766
##  14   16        1.333333  20       2.930995
## [1] False Discovery Rate for CBDA =  18.182 %
## [1] False Discovery Rate for KNOCKOFF FILTER =  9.091 %
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 8
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000         20         15         30        100        100

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density  
##  80   30        2.394254  80      13.319157
##  40   28        2.234637   1      12.130292
##  1    24        1.915403  70      11.763000
##  10   22        1.755786  50       9.945873
##  20   21        1.675978  90       8.863329
##  70   21        1.675978 100       6.630582
##  90   21        1.675978  40       6.553257
##  100  20        1.596169  10       4.977769
##  50   19        1.516361  30       3.363619
##  15   18        1.436552  24       2.754688
##  84   18        1.436552  20       2.619370
## [1] False Discovery Rate for CBDA =  18.182 %
## [1] False Discovery Rate for KNOCKOFF FILTER =  9.091 %
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 9
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0         30         50         60         80

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density  
##  90   35        1.684312  80      14.038589
##  10   34        1.636189   1      11.543580
##  40   34        1.636189  70      10.919827
##  70   34        1.636189  50       9.672322
##  1    33        1.588065  90       8.366600
##  50   32        1.539942  40       6.803061
##  30   31        1.491819 100       5.738523
##  80   29        1.395573  10       4.690619
##  92   29        1.395573  30       3.468064
##  100  29        1.395573  20       2.544910
##  37   26        1.251203  24       2.495010
## [1] False Discovery Rate for CBDA =  18.182 %
## [1] False Discovery Rate for KNOCKOFF FILTER =  9.091 %
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 10
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000         20         30         50         60         80

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density  
##  80   35        1.637044  80      13.468694
##  40   31        1.449953   1      11.597746
##  50   30        1.403181  70       9.992559
##  10   29        1.356408  50       8.780695
##  24   29        1.356408  90       7.473158
##  70   29        1.356408  40       6.197512
##  79   28        1.309635 100       5.644733
##  63   27        1.262862  10       3.975763
##  8    26        1.216090  30       2.774530
##  44   26        1.216090  24       2.243011
##  47   26        1.216090  20       1.998512
## [1] False Discovery Rate for CBDA =  54.545 %
## [1] False Discovery Rate for KNOCKOFF FILTER =  9.091 %
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 11
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000          0         30         50        100        100

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density  
##  80   37        1.726552  80      15.113763
##  1    35        1.633224   1      14.335848
##  70   35        1.633224  70      13.417559
##  90   35        1.633224  50      11.703808
##  10   33        1.539897  90      10.048547
##  30   32        1.493234  40       7.322922
##  40   32        1.493234 100       7.018775
##  50   30        1.399907  10       5.217290
##  100  30        1.399907  30       4.252208
##  75   28        1.306580  20       2.456571
##  79   28        1.306580  24       2.333743
## [1] False Discovery Rate for CBDA =  18.182 %
## [1] False Discovery Rate for KNOCKOFF FILTER =  9.091 %
## 
## 
## 
## 
## 
## 
## [1] EXPERIMENT 12
##          M misValperc   Kcol_min   Kcol_max   Nrow_min   Nrow_max 
##       9000         20         30         50        100        100

## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
##  CBDA Frequency Density  Knockoff Density  
##  50   34        1.619819  80      16.449023
##  40   33        1.572177   1      13.885960
##  1    32        1.524535  70      13.009976
##  10   32        1.524535  50      11.185011
##  70   31        1.476894  90       9.343824
##  80   31        1.476894 100       6.821316
##  60   30        1.429252  40       6.813205
##  100  30        1.429252  10       4.817909
##  30   29        1.381610  30       3.236272
##  77   29        1.381610  24       2.214292
##  59   27        1.286327  20       2.035850
## [1] False Discovery Rate for CBDA =  18.182 %
## [1] False Discovery Rate for KNOCKOFF FILTER =  9.091 %