Loaded mixOmics 6.30.0
Thank you for using mixOmics!
Tutorials: http://mixomics.org
Bookdown vignette: https://mixomicsteam.github.io/Bookdown
Questions, issues: Follow the prompts at http://mixomics.org/contact-us
Cite us: citation('mixOmics')
N-integration refers to the process of analyzing multiple datasets that capture different aspects of the same samples. For instance, you might have transcriptomic, genetic, and proteomic data for the same group of cells. N-integrative methods are designed to simultaneously utilize the information from all these datasets.
DIABLO, part of the mixOmics framework, is an innovative approach for integrating multiple datasets while exploring their relationship with a categorical outcome variable. DIABLO stands for Data Integration Analysis for Biomarker discovery using Latent variable approaches for Omics studies and is also known as Multiblock (s)PLS-DA.
data(breast.TCGA)
Extract training data and name each data frame & storing as list
X <-list(mRNA = breast.TCGA$data.train$mrna, miRNA = breast.TCGA$data.train$mirna, protein = breast.TCGA$data.train$protein)Y <- breast.TCGA$data.train$subtypesummary(Y)
diablo.tcga <-block.plsda(X, Y, ncomp =5, design = design)
Design matrix has changed to include Y; each block will be
linked to Y.
set.seed(123) perf.diablo.tcga =perf(diablo.tcga, validation ='Mfold', folds =10, nrepeat =10)#perf.diablo.tcga$error.rate # Lists the different types of error rates
Plot of the error rates based on weighted vote
plot(perf.diablo.tcga)
Choosing the number of components in block.plsda using perf() with 10 x 10-fold CV function in the breast.TCGA study. Classification error rates (overall and balanced, see Module 2) are represented on the y-axis with respect to the number of components on the x-axis for each prediction distance presented in PLS-DA in Seciton 3.4 and detailed in Extra reading material 3 from Module 3. Bars show the standard deviation across the 10 repeated folds. The plot shows that the error rate reaches a minimum from 2 to 3 dimensions.
The performance plot indicates that two components should be sufficient in the final model, and that the centroids distance might lead to better prediction. A balanced error rate (BER) should be considered for further analysis.
The following outputs the optimal number of components according to the prediction distance and type of error rate (overall or balanced), as well as a prediction weighting scheme illustrated further below.
Choosing the optimal number of variables to select in each data set using the tune.block.splsda function. The function tune() is run with 10-fold cross validation and nrepeat = 3 in this case.
Design matrix has changed to include Y; each block will be
linked to Y.
You have provided a sequence of keepX of length: 9 for block mRNA and 11 for block miRNA and 5 for block proteomics.
This results in 495 models being fitted for each component and each nrepeat, this may take some time to run, be patient!
list.keepX <- tune.diablo.tcga$choice.keepX
list.keepX
$mRNA
[1] 10 25
$miRNA
[1] 20 5
$protein
[1] 5 5
The final multiblock sPLS-DA model includes the tuned parameters and is run as:
Design matrix has changed to include Y; each block will be
linked to Y.
diablo.tcga$design
mRNA miRNA protein Y
mRNA 0.0 0.1 0.1 1
miRNA 0.1 0.0 0.1 1
protein 0.1 0.1 0.0 1
Y 1.0 1.0 1.0 0
The selected variables can be extracted with the function selectVar(), for example in the mRNA block, along with their loading weights (not output here):
plotDiablo() is a diagnostic plot to check whether the correlations between components from each data set were maximised as specified in the design matrix. We specify the dimension to be assessed with the ncomp argument
plotDiablo(diablo.tcga, ncomp =2)
Diagnostic plot from multiblock sPLS-DA applied on the breast.TCGA study. Samples are represented based on the specified component (here ncomp = 2) for each data set (mRNA, miRNA and protein). Samples are coloured by breast cancer subtype (Basal, Her2 and LumA) and 95% confidence ellipse plots are represented. The bottom left numbers indicate the correlation coefficients between the first components from each data set. In this example, mRNA expression and miRNA concentration are highly correlated on the first dimension.
The sample plot with the plotIndiv() function projects each sample into the space spanned by the components from each block, resulting in a series of graphs corresponding to each data set. The optional argument blocks can output a specific data set. Ellipse plots are also available (argument ellipse = TRUE).
Sample plot from multiblock sPLS-DA performed on the breast.TCGA study. The samples are plotted according to their scores on the first 2 components for each data set. Samples are coloured by cancer subtype and are classified into three classes: Basal, Her2 and LumA. The plot shows the degree of agreement between the different data sets and the discriminative ability of each data set.
This type of graphic allows us to better understand the information extracted from each data set and its discriminative ability. Here we can see that the LumA group can be difficult to classify in the miRNA data.
Note:
Additional variants include the argument block = 'average' that averages the components from all blocks to produce a single plot. The argument block='weighted.average' is a weighted average of the components according to their correlation with the components associated with the outcome.
In the arrow plot, the start of the arrow indicates the centroid between all data sets for a given sample and the tip of the arrow the location of that same sample but in each block. Such graphics highlight the agreement between all data sets at the sample level when modelled with multiblock sPLS-DA.
Arrow plot from multiblock sPLS-DA performed on the breast.TCGA study. The samples are projected into the space spanned by the first two components for each data set then overlaid across data sets. The start of the arrow indicates the centroid between all data sets for a given sample and the tip of the arrow the location of the same sample in each block. Arrows further from their centroid indicate some disagreement between the data sets. Samples are coloured by cancer subtype (Basal, Her2 and LumA).
This plot shows that globally, the discrimination of all breast cancer subtypes can be extracted from all data sets, however, there are some dissimilarities at the samples level across data sets (the common informat
ion cannot be extracted in the same way across data sets).