mixomics2

library(mixOmics)
Loading required package: MASS
Loading required package: lattice
Loading required package: ggplot2

Loaded mixOmics 6.30.0
Thank you for using mixOmics!
Tutorials: http://mixomics.org
Bookdown vignette: https://mixomicsteam.github.io/Bookdown
Questions, issues: Follow the prompts at http://mixomics.org/contact-us
Cite us:  citation('mixOmics')

N-integration refers to the process of analyzing multiple datasets that capture different aspects of the same samples. For instance, you might have transcriptomic, genetic, and proteomic data for the same group of cells. N-integrative methods are designed to simultaneously utilize the information from all these datasets.

DIABLO, part of the mixOmics framework, is an innovative approach for integrating multiple datasets while exploring their relationship with a categorical outcome variable. DIABLO stands for Data Integration Analysis for Biomarker discovery using Latent variable approaches for Omics studies and is also known as Multiblock (s)PLS-DA.

data(breast.TCGA)

Extract training data and name each data frame & storing as list

X <- list(mRNA = breast.TCGA$data.train$mrna, 
          miRNA = breast.TCGA$data.train$mirna, 
          protein = breast.TCGA$data.train$protein)


Y <- breast.TCGA$data.train$subtype
summary(Y)
Basal  Her2  LumA 
   45    30    75 
design <- matrix(0.1, ncol = length(X), nrow = length(X), 
                dimnames = list(names(X), names(X)))
diag(design) <- 0
design 
        mRNA miRNA protein
mRNA     0.0   0.1     0.1
miRNA    0.1   0.0     0.1
protein  0.1   0.1     0.0

PLS with one component and calculate the cross-correlations between components associated to each data set

res1.pls.tcga <- pls(X$mRNA, X$protein, ncomp = 1)
cor(res1.pls.tcga$variates$X, res1.pls.tcga$variates$Y)
          comp1
comp1 0.9031761
res2.pls.tcga <- pls(X$mRNA, X$miRNA, ncomp = 1)
cor(res2.pls.tcga$variates$X, res2.pls.tcga$variates$Y)
          comp1
comp1 0.8456299
res3.pls.tcga <- pls(X$protein, X$miRNA, ncomp = 1)
cor(res3.pls.tcga$variates$X, res3.pls.tcga$variates$Y)
          comp1
comp1 0.7982008
diablo.tcga <- block.plsda(X, Y, ncomp = 5, design = design)
Design matrix has changed to include Y; each block will be
            linked to Y.
set.seed(123) 

perf.diablo.tcga = perf(diablo.tcga, validation = 'Mfold', folds = 10, nrepeat = 10)

#perf.diablo.tcga$error.rate  # Lists the different types of error rates

Plot of the error rates based on weighted vote

plot(perf.diablo.tcga)

Choosing the number of components in block.plsda using perf() with 10 x 10-fold CV function in the breast.TCGA study. Classification error rates (overall and balanced, see Module 2) are represented on the y-axis with respect to the number of components on the x-axis for each prediction distance presented in PLS-DA in Seciton 3.4 and detailed in Extra reading material 3 from Module 3. Bars show the standard deviation across the 10 repeated folds. The plot shows that the error rate reaches a minimum from 2 to 3 dimensions.

The performance plot indicates that two components should be sufficient in the final model, and that the centroids distance might lead to better prediction. A balanced error rate (BER) should be considered for further analysis.

The following outputs the optimal number of components according to the prediction distance and type of error rate (overall or balanced), as well as a prediction weighting scheme illustrated further below.

perf.diablo.tcga$choice.ncomp$WeightedVote
            max.dist centroids.dist mahalanobis.dist
Overall.ER         3              2                3
Overall.BER        3              2                3

So final ncomp value:

ncomp <- perf.diablo.tcga$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"]

Choosing the optimal number of variables to select in each data set using the tune.block.splsda function. The function tune() is run with 10-fold cross validation and nrepeat = 3 in this case.

set.seed(123) 

test.keepX <- list(mRNA = c(5:9, seq(10, 25, 5)),
                   miRNA = c(5:9, seq(10, 20, 2)),
                   proteomics = c(seq(5, 25, 5)))


tune.diablo.tcga <- tune.block.splsda(X, Y, ncomp = 2, 
                              test.keepX = test.keepX, design = design,
                              validation = 'Mfold', folds = 10, nrepeat = 3, 
                              BPPARAM = BiocParallel::SnowParam(workers = 2),
                              dist = "centroids.dist")
Design matrix has changed to include Y; each block will be
            linked to Y.

You have provided a sequence of keepX of length:  9 for block mRNA and 11 for block miRNA and  5 for block proteomics.
This results in 495 models being fitted for each component and each nrepeat, this may take some time to run, be patient!
list.keepX <- tune.diablo.tcga$choice.keepX
list.keepX
$mRNA
[1] 10 25

$miRNA
[1] 20  5

$protein
[1] 5 5

The final multiblock sPLS-DA model includes the tuned parameters and is run as:

diablo.tcga <- block.splsda(X, Y, ncomp = ncomp, 
                            keepX = list.keepX, design = design)
Design matrix has changed to include Y; each block will be
            linked to Y.
diablo.tcga$design
        mRNA miRNA protein Y
mRNA     0.0   0.1     0.1 1
miRNA    0.1   0.0     0.1 1
protein  0.1   0.1     0.0 1
Y        1.0   1.0     1.0 0

The selected variables can be extracted with the function selectVar(), for example in the mRNA block, along with their loading weights (not output here):

# mRNA variables selected on component 1
selectVar(diablo.tcga, block = 'mRNA', comp = 1)
$mRNA
$mRNA$name
 [1] "ZNF552"  "KDM4B"   "CCNA2"   "LRIG1"   "PREX1"   "FUT8"    "C4orf34"
 [8] "TTC39A"  "ASPM"    "SLC43A3"

$mRNA$value
          value.var
ZNF552  -0.60385517
KDM4B   -0.47633173
CCNA2    0.30916840
LRIG1   -0.28169298
PREX1   -0.28113199
FUT8    -0.24433772
C4orf34 -0.22022767
TTC39A  -0.20472637
ASPM     0.05578513
SLC43A3  0.03563377


$comp
[1] 1

Plotting

plotDiablo() is a diagnostic plot to check whether the correlations between components from each data set were maximised as specified in the design matrix. We specify the dimension to be assessed with the ncomp argument

plotDiablo(diablo.tcga, ncomp = 2)

Diagnostic plot from multiblock sPLS-DA applied on the breast.TCGA study. Samples are represented based on the specified component (here ncomp = 2) for each data set (mRNA, miRNA and protein). Samples are coloured by breast cancer subtype (Basal, Her2 and LumA) and 95% confidence ellipse plots are represented. The bottom left numbers indicate the correlation coefficients between the first components from each data set. In this example, mRNA expression and miRNA concentration are highly correlated on the first dimension.

The sample plot with the plotIndiv() function projects each sample into the space spanned by the components from each block, resulting in a series of graphs corresponding to each data set. The optional argument blocks can output a specific data set. Ellipse plots are also available (argument ellipse = TRUE).

plotIndiv(diablo.tcga, ind.names = FALSE, legend = TRUE, 
          title = 'TCGA, DIABLO comp 1 - 2')

Sample plot from multiblock sPLS-DA performed on the breast.TCGA study. The samples are plotted according to their scores on the first 2 components for each data set. Samples are coloured by cancer subtype and are classified into three classes: Basal, Her2 and LumA. The plot shows the degree of agreement between the different data sets and the discriminative ability of each data set.

This type of graphic allows us to better understand the information extracted from each data set and its discriminative ability. Here we can see that the LumA group can be difficult to classify in the miRNA data.

Note:

  • Additional variants include the argument block = 'average' that averages the components from all blocks to produce a single plot. The argument block='weighted.average' is a weighted average of the components according to their correlation with the components associated with the outcome.

In the arrow plot, the start of the arrow indicates the centroid between all data sets for a given sample and the tip of the arrow the location of that same sample but in each block. Such graphics highlight the agreement between all data sets at the sample level when modelled with multiblock sPLS-DA.

plotArrow(diablo.tcga, ind.names = FALSE, legend = TRUE, 
          title = 'TCGA, DIABLO comp 1 - 2')

Arrow plot from multiblock sPLS-DA performed on the breast.TCGA study. The samples are projected into the space spanned by the first two components for each data set then overlaid across data sets. The start of the arrow indicates the centroid between all data sets for a given sample and the tip of the arrow the location of the same sample in each block. Arrows further from their centroid indicate some disagreement between the data sets. Samples are coloured by cancer subtype (Basal, Her2 and LumA).

This plot shows that globally, the discrimination of all breast cancer subtypes can be extracted from all data sets, however, there are some dissimilarities at the samples level across data sets (the common informat

ion cannot be extracted in the same way across data sets).