MWAS package offers multiple ways to learn a predictve model from a training set and output the trained model for further use.
< … SPACE for more detailed information … >
The raw input data might need a preprocessing step in order to show the relationships appropriately. The function preprocess.mwas includes the following functions (Table 1: Preprocessing options):
Remove nonoverlapping samples across multiple input tables
Convert OTU abudance from absolute to relative values, i.e. the sum of OTU abundance for each sample is equal to 1. (if option is.relative.conversion=TRUE or -r option is on which leads to suppress_relative_abundance_conversion=FALSE)
Remove rare features (OTUs) accroding to the mean abundance of a feature which is smaller than the given threshold. (if option -p or min_prevalence is given a value)
Three data transformation methods are provided. asin_sqrt, norm_asin_sqrt, and none (through option transform_type or -t)
Collapse the OTU table by correlation of 0.95 (if option is.collapse or -b is on)
Filter lineage table (kegg table) if option is.filter.kegg or -K is on (Used in the customized heatmap)
Table 1: Preprocessing options
Command-line version: All visualization related options are capital letters, except the input options shared with other modules (the blue ones in Table 2). Examples will follow in the next few sections, using the data in the directory test/data/.
Table 2: learn options
linear kernelRscript bin/mwas_analysis.R -w learn -M SVM -C linear -i data/taxa/GG_100nt_even10k-adults_L7.biom -m data/gg-map-adults.txt -o example/svm_output -c COUNTRY -f -v FDR -s 0.05
-w: learn mode-M: classifier type-C: kernel type for SVM-i: input file-m: mapfile-c: category name-o: output directory-f: proceed feature selection -v: feature selection method: fdr or rf-s: threshold for feature selection (determines the number of features)
If you are familiar with R, you could manipulate your data in a more flexible way. Here is the same example as shown in the command-line version.
## option initialization
opts <- list()
opts$mode <- "learn"
opts$method <- "SVM"
opts$input_fp <- "data/taxa/GG_100nt_even10k-adults_L7.biom"
opts$map_fp <- "data/gg-map-adults.txt"
opts$category <- "COUNTRY"
opts$outdir <- "example/svm_learn"
opts$nfolds <- 5
opts$method_param <- "linear"
opts$ftMethod <- "FDR"
opts$is_feat <- TRUE
opts$feat_param <- 0.05
train_params <- import.train.params(opts)
best_model <- train.mwas(train_params)
The above steps are exactly the same version as in the command-line version. Alternatively, you could also directly use inner-funcions rather than the wrapper functions.
Breiman, L. (2001). Random forests. Machine learning. 45(1), 5-32.
Leo Breiman and Adele Cutler. (2003) Random Forest - Classification Description. Retrieved on November 1, 2014 from http://www.math.usu.edu/~adele/forests/cc_home.htm
Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Friedman, J., & Tibshirani, R. (2009). The Elements of Statistical Learning (Vol. 2, No. 1). New York: springer.
Adbi, H., & Williams, L. J. (2010). Jackknife. In: Neil Salkind (Ed.), Encyclopedia of Research Design. Thousand Oaks, CA: Sage.
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J. C., & Müller, M. (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC bioinformatics, 12(1), 77.
Chang, C. C., & Lin, C. J. (2011). LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
Karatzoglou, A., Smola, A., Hornik, K., & Zeileis, A. (2004). kernlab-an S4 package for kernel methods in R. Journal of Statistical Software, 11(9)
Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22.
Ben-Hur, A., & Weston, J. (2010). A user’s guide to support vector machines. In: O. Carugo, F. Eisenhaber (eds.), Data mining techniques for the life sciences (pp. 223-239). Humana Press.
Cawley, G. C., & Talbot, N. L. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation. The Journal of Machine Learning Research, 11, 2079-2107.
Hu Huang, Emmanuel Montassier, Pajau Vangay, Gabe Al Ghalith, Dan Knights. Robust statistical models for microbiome phenotype prediction with the MWAS package. (in preparation)