This is a summary of a set of 9 experiments I ran on Cranium using a single pipe workflow file that performs 3000 independent jobs, each one with the CBDA-SL and the knockoff filter feature mining strategies. Each experiments has a total of 9000 jobs and is uniquely identified by 6 input arguments: # of jobs [M], % of missing values [misValperc], min [Kcol_min] and max [Kcol_max] % for FSR-Feature Sampling Range, min [Nrow_min] and max [Nrow_max] % for SSR-Subject Sampling Range.
This document has the final results, by experiment. The CBDA-SuperLearner has been adapted to a multinomial outcome distribution in this case. See https://drive.google.com/file/d/0B5sz_T_1CNJQWmlsRTZEcjBEOEk/view?ths=true for some general documentation of the CBDA-SL project and github https://github.com/SOCR/CBDA for some of the code [still in progress].
# # Here I load the dataset [not executed]
# ABIDE_dataset = read.csv("C:/Users/simeonem/Documents/CBDA-SL/ExperimentsNov2016/ABIDE/ABIDE_dataset.csv",header = TRUE)
Features selected by both the knockoff filter and the CBDA-SL algorithms are shown as spikes in the histograms shown below. I list the top features selected, set to 15 here.
## Loading required package: lattice
## Loading required package: ggplot2
## [1] EXPERIMENT 1
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 1 5 30 40
## [1] 415 126 89 426 490 955 883 973 19 608 316 225 682 502
## [15] 102 79 90 987 404 480 422 242 688 13 77 175 235 103
## [29] 165 535 492 985 642 976 1038 487 425 387 75 1003 483 671
## [43] 172 1089 915 411 640 97 895 386 7 850 1008 876 53 576
## [57] 215 213 321 203 162 101 839 3 912 497 524 612 644 314
## [71] 125 78 398 1086 707 423 220 25 18 836 244 903 418 222
## [85] 507 396 659 983 397 701 207 427 390 159 209 178 395 468
## [99] 532 645 549 475 402 841 20 161 315 84 17 1030 1094 911
## [113] 234 837 95 9 702 681 885 711 479 494 542 908 890 664
## [127] 166 81 887 680 160 403 195 1092 894 163 673 614 466 471
## [141] 954 144 584 406 1096 92 496 1037 910 975 488 1004 619 951
## [155] 705 663 889 200 100 412 11 491 948 696 668 361 42 256
## [169] 964 742 57 258 630 34 328 519 756 255 803 722 181 105
## [183] 827 831 816 537 344 965 1052 929 327 932 804 284 139 553
## [197] 253 1057 442 277 257 1024 944 137 365 453 601 107 808 136
## [211] 287 303 510 447 450 567 273 941 730 1065 1062 943 70 28
## [225] 761 440 544 599 63 286 809 806 325 343 1000 1048 147 51
## [239] 934 918 997 824 1056 736 830 995 355 767 378 278 37 763
## [253] 622 738 1071 988 280 603 27 65 917 970 1050 275 279 632
## [267] 116 566 349 1066 922 69 452 594 733 152 331 58 117 593
## [281] 1027 251 518 746 737 801 1077 73 1068 1067 288 1072 68 143
## [295] 373 947 747 451 309 38 294 623 769 820 758 766 64 939
## [309] 457 772 1026 372 762 781 817 263 189 71 598 187 621 926
## [323] 1023 334 306 564 338 41 962 514
## [1] "thick_std_ctx_lh_G_Ins_lg_and_S_cent_ins"
## [2] "surf_area_ctx.lh.parstriangularis"
## [3] "gaus_curv_ctx_lh_G_front_sup"
## [4] "mean_curv_ctx_rh_G_front_inf.Triangul"
## [5] "surf_area_ctx.rh.isthmuscingulate"
## [6] "thick_std_ctx_rh_S_circular_insula_sup"
## [7] "surf_area_ctx_rh_S_orbital_lateral"
## [8] "volume_ctx.lh.superiorfrontal"
## [9] "volume_ctx_rh_G_Ins_lg_and_S_cent_ins"
## [10] "volume_ctx_rh_S_suborbital"
## [11] "surf_area_ctx_lh_G_and_S_cingul.Ant"
## [12] "curv_ind_ctx_lh_S_intrapariet_and_P_trans"
## [13] "mean_curv_ctx.lh.transversetemporal"
## [14] "surf_area_ctx.rh.fusiform"
## [15] "mean_curv_ctx_rh_G_and_S_cingul.Mid.Post"
## [16] "anat_fwhm"
## [17] "volume_CSF"
## [18] "volume_ctx.lh.entorhinal"
## [19] "volume_ctx.lh.precentral"
## [20] "volume_ctx.lh.rostralanteriorcingulate"
## k_top_50_temp
## 576 840 571 1312 1505 1703 1799 178 207 279 462 1000 1165 1211 1235
## 8 8 7 7 7 7 7 6 6 6 6 6 6 6 6
## 63 75 80 143 145
## 5 5 5 5 5
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 576 8 0.2213001 66 3.0058651
## 840 8 0.2213001 296 1.6862170
## 571 7 0.1936376 297 1.6862170
## 1312 7 0.1936376 26 1.6495601
## 1505 7 0.1936376 583 1.0997067
## 1703 7 0.1936376 1 0.9530792
## 1799 7 0.1936376 337 0.9530792
## 178 6 0.1659751 727 0.9164223
## 207 6 0.1659751 1612 0.9164223
## 279 6 0.1659751 25 0.8797654
## 462 6 0.1659751 334 0.8064516
## 1000 6 0.1659751 2043 0.8064516
## 1165 6 0.1659751 605 0.7697947
## 1211 6 0.1659751 983 0.7331378
## 1235 6 0.1659751 1458 0.7331378
## [1] 415 126 89 426 490 955 883 973 19 608 316 225 682 502
## [15] 102 79 90 987 404 480 422 242 688 13 77 175 235 103
## [29] 165 535 492 985 642 976 1038 487 425 387 75 1003 483 671
## [43] 172 1089 915 411 640 97 895 386 7 850 1008 876 53 576
## [57] 215 213 321 203 162 101 839 3 912 497 524 612 644 314
## [71] 125 78 398 1086 707 423 220 25 18 836 244 903 418 222
## [85] 507 396 659 983 397 701 207 427 390 159 209 178 395 468
## [99] 532 645 549 475 402 841 20 161 315 84 17 1030 1094 911
## [113] 234 837 95 9 702 681 885 711 479 494 542 908 890 664
## [127] 166 81 887 680 160 403 195 1092 894 163 673 614 466 471
## [141] 954 144 584 406 1096 92 496 1037 910 975 488 1004 619 951
## [155] 705 663 889 200 100 412 11 491 948 696 668 361 42 256
## [169] 964 742 57 258 630 34 328 519 756 255 803 722 181 105
## [183] 827 831 816 537 344 965 1052 929 327 932 804 284 139 553
## [197] 253 1057 442 277 257 1024 944 137 365 453 601 107 808 136
## [211] 287 303 510 447 450 567 273 941 730 1065 1062 943 70 28
## [225] 761 440 544 599 63 286 809 806 325 343 1000 1048 147 51
## [239] 934 918 997 824 1056 736 830 995 355 767 378 278 37 763
## [253] 622 738 1071 988 280 603 27 65 917 970 1050 275 279 632
## [267] 116 566 349 1066 922 69 452 594 733 152 331 58 117 593
## [281] 1027 251 518 746 737 801 1077 73 1068 1067 288 1072 68 143
## [295] 373 947 747 451 309 38 294 623 769 820 758 766 64 939
## [309] 457 772 1026 372 762 781 817 263 189 71 598 187 621 926
## [323] 1023 334 306 564 338 41 962 514
##
##
##
##
##
##
## [1] EXPERIMENT 3
## M misValperc Kcol_min Kcol_max Nrow_min Nrow_max
## 9000 0 5 10 55 65
## [1] 1096 487 842 583 843 320 666 500 894 506 319 479 1089 851
## [15] 891 649 231 434 396 474 534 656 670 156 576 160 171 169
## [29] 613 428 906 312 1029 607 890 700 653 981 323 482 841 889
## [43] 407 677 898 483 908 717 681 580 205 665 385 202 618 501
## [57] 980 85 608 225 1045 21 90 881 975 899 468 4 19 165
## [71] 384 837 654 673 540 77 195 238 846 1043 1041 1 393 527
## [85] 122 243 615 982 237 405 217 854 2 16 484 986 1039 120
## [99] 611 696 79 410 507 957 318 86 119 233 153 695 54 424
## [113] 239 158 201 644 1092 895 690 236 8 660 432 157 178 528
## [127] 124 75 420 240 12 905 658 910 542 711 423 244 640 1006
## [141] 164 391 686 480 1034 127 716 408 655 317 418 669 83 193
## [155] 430 584 467 1005 699 1007 3 1010 414 620 502 1069 1013 808
## [169] 596 856 191 183 561 294 592 444 291 553 309 960 971 1079
## [183] 447 30 142 250 924 264 179 1028 1080 857 871 784 29 141
## [197] 365 873 795 921 529 269 140 110 593 810 797 280 940 44
## [211] 796 42 117 282 552 184 832 733 188 182 741 780 556 180
## [225] 438 595 379 807 545 996 257 740 150 963 293 375 65 935
## [239] 947 296 306 863 1071 548 970 114 108 560 1068 794 1055 634
## [253] 772 766 989 1017 441 633 587 868 566 997 132 60 56 782
## [267] 1048 285 920 460 133 585 789 519 303 1023 1053 255 727 586
## [281] 302 278 803 256 525 148 745 299 1070 367 356 344 967 292
## [295] 725 817 756 597 72 1000 769 937 801 555 874 824 252 728
## [309] 554 599 748 536 286 872 861 305 859 916 1078 860 746 332
## [323] 774 1011 1024 729 448 454 297 48
## [1] "R_superior_temporal_gyrus"
## [2] "volume_Left.Hippocampus"
## [3] "curv_ind_ctx_lh_G_insular_short"
## [4] "volume_ctx_lh_S_precentral.sup.part"
## [5] "surf_area_ctx.rh.parsopercularis"
## [6] "gaus_curv_lh_BA4p"
## [7] "anat_efc"
## [8] "volume_ctx.lh.rostralanteriorcingulate"
## [9] "volume_Left.Accumbens.area"
## [10] "volume_lh_BA45"
## [11] "fold_ind_ctx.lh.cuneus"
## [12] "mean_curv_ctx_lh_G_front_inf.Opercular"
## [13] "fold_ind_ctx_lh_G_oc.temp_lat.fusifor"
## [14] "thick_std_ctx_lh_G_precentral"
## [15] "fold_ind_ctx_lh_G_temp_sup.G_T_transv"
## [16] "thick_avg_ctx.lh.isthmuscingulate"
## [17] "mean_curv_ctx_lh_Lat_Fis.ant.Vertical"
## [18] "thick_avg_ctx.lh.middletemporal"
## [19] "surf_area_ctx.lh.parsorbitalis"
## [20] "thick_avg_ctx_lh_S_collat_transv_ant"
## k_top_50_temp
## 36 295 587 172 1582 1978 61 145 290 308 439 542 607 660 705
## 12 12 12 10 10 10 9 9 9 9 9 9 9 9 9
## 750 780 806 833 953
## 9 9 9 9 9
## [1] "TABLE with CBDA-SL & KNOCKOFF FILTER RESULTS"
## CBDA Frequency Density Knockoff Density
## 36 12 0.1504891 66 6.787933
## 295 12 0.1504891 296 3.859805
## 587 12 0.1504891 26 3.327418
## 172 10 0.1254076 297 2.617569
## 1582 10 0.1254076 727 2.307010
## 1978 10 0.1254076 1 2.151730
## 61 9 0.1128668 2043 1.974268
## 145 9 0.1128668 583 1.708075
## 290 9 0.1128668 2081 1.663709
## 308 9 0.1128668 1081 1.574978
## 439 9 0.1128668 337 1.419698
## 542 9 0.1128668 738 1.419698
## 607 9 0.1128668 710 1.375333
## 660 9 0.1128668 2040 1.353150
## 705 9 0.1128668 1152 1.286602
## [1] 1096 487 842 583 843 320 666 500 894 506 319 479 1089 851
## [15] 891 649 231 434 396 474 534 656 670 156 576 160 171 169
## [29] 613 428 906 312 1029 607 890 700 653 981 323 482 841 889
## [43] 407 677 898 483 908 717 681 580 205 665 385 202 618 501
## [57] 980 85 608 225 1045 21 90 881 975 899 468 4 19 165
## [71] 384 837 654 673 540 77 195 238 846 1043 1041 1 393 527
## [85] 122 243 615 982 237 405 217 854 2 16 484 986 1039 120
## [99] 611 696 79 410 507 957 318 86 119 233 153 695 54 424
## [113] 239 158 201 644 1092 895 690 236 8 660 432 157 178 528
## [127] 124 75 420 240 12 905 658 910 542 711 423 244 640 1006
## [141] 164 391 686 480 1034 127 716 408 655 317 418 669 83 193
## [155] 430 584 467 1005 699 1007 3 1010 414 620 502 1069 1013 808
## [169] 596 856 191 183 561 294 592 444 291 553 309 960 971 1079
## [183] 447 30 142 250 924 264 179 1028 1080 857 871 784 29 141
## [197] 365 873 795 921 529 269 140 110 593 810 797 280 940 44
## [211] 796 42 117 282 552 184 832 733 188 182 741 780 556 180
## [225] 438 595 379 807 545 996 257 740 150 963 293 375 65 935
## [239] 947 296 306 863 1071 548 970 114 108 560 1068 794 1055 634
## [253] 772 766 989 1017 441 633 587 868 566 997 132 60 56 782
## [267] 1048 285 920 460 133 585 789 519 303 1023 1053 255 727 586
## [281] 302 278 803 256 525 148 745 299 1070 367 356 344 967 292
## [295] 725 817 756 597 72 1000 769 937 801 555 874 824 252 728
## [309] 554 599 748 536 286 872 861 305 859 916 1078 860 746 332
## [323] 774 1011 1024 729 448 454 297 48
The features listed above are then used to run a final analysis applying both the CBDA-SL and the knockoff filter. The ONLY features used for analysis are the ones listed above. A final summary of the accuracy of the overall procedure is determined by using the CDBA-SL object on the subset of subjects held off for prediction. The predictions (SL_Pred_Combined) is then used to generate the confusion matrix. By doing so, we combined the CBDA-SL & Knockoff Filter algorithms to first select the top features during the first stage. Then, the second stage uses the top common features selected to run a final predictive modeling step that can ultimately be tested for accuracy, sensitivity,…..