MWAS package offers multiple ways to learn a predictve model from a training set and output the trained model for further use.
< … SPACE for more detailed information … >

Return to Index


Preprocessing Raw Input Data

The raw input data might need a preprocessing step in order to show the relationships appropriately. The function preprocess.mwas includes the following functions (Table 1: Preprocessing options):

Table 1: Preprocessing options
fig.cap=Table 1. Preprocessing options


Learn Options and Parameters

Command-line version: All visualization related options are capital letters, except the input options shared with other modules (the blue ones in Table 2). Examples will follow in the next few sections, using the data in the directory test/data/.

Table 2: learn options


Train a SVM model with a linear kernel

Rscript bin/mwas_analysis.R -w learn -M SVM -C linear -i data/taxa/GG_100nt_even10k-adults_L7.biom -m data/gg-map-adults.txt -o example/svm_output -c COUNTRY -f -v FDR -s 0.05

-w: learn mode
-M: classifier type
-C: kernel type for SVM
-i: input file
-m: mapfile
-c: category name
-o: output directory
-f: proceed feature selection -v: feature selection method: fdr or rf
-s: threshold for feature selection (determines the number of features)

If you are familiar with R, you could manipulate your data in a more flexible way. Here is the same example as shown in the command-line version.

## option initialization 
opts <- list()
opts$mode <- "learn"
opts$method <- "SVM"
opts$input_fp <- "data/taxa/GG_100nt_even10k-adults_L7.biom"
opts$map_fp <- "data/gg-map-adults.txt"
opts$category <- "COUNTRY"
opts$outdir <- "example/svm_learn"
opts$nfolds <- 5
opts$method_param <- "linear"
opts$ftMethod <- "FDR"
opts$is_feat <- TRUE
opts$feat_param <- 0.05

train_params <- import.train.params(opts)
best_model <- train.mwas(train_params)

The above steps are exactly the same version as in the command-line version. Alternatively, you could also directly use inner-funcions rather than the wrapper functions.


References

Breiman, L. (2001). Random forests. Machine learning. 45(1), 5-32.
Leo Breiman and Adele Cutler. (2003) Random Forest - Classification Description. Retrieved on November 1, 2014 from http://www.math.usu.edu/~adele/forests/cc_home.htm
Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Friedman, J., & Tibshirani, R. (2009). The Elements of Statistical Learning (Vol. 2, No. 1). New York: springer.
Adbi, H., & Williams, L. J. (2010). Jackknife. In: Neil Salkind (Ed.), Encyclopedia of Research Design. Thousand Oaks, CA: Sage.
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J. C., & Müller, M. (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC bioinformatics, 12(1), 77.
Chang, C. C., & Lin, C. J. (2011). LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
Karatzoglou, A., Smola, A., Hornik, K., & Zeileis, A. (2004). kernlab-an S4 package for kernel methods in R. Journal of Statistical Software, 11(9)
Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22.
Ben-Hur, A., & Weston, J. (2010). A user’s guide to support vector machines. In: O. Carugo, F. Eisenhaber (eds.), Data mining techniques for the life sciences (pp. 223-239). Humana Press.
Cawley, G. C., & Talbot, N. L. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation. The Journal of Machine Learning Research, 11, 2079-2107.
Hu Huang, Emmanuel Montassier, Pajau Vangay, Gabe Al Ghalith, Dan Knights. Robust statistical models for microbiome phenotype prediction with the MWAS package. (in preparation)