An increasingly common use case involves a set of samples or patients who provide measurements on multiple data types, such as gene expression, genotype, miRNA abundance. It will frequently be the case that not all samples will contribute to all assays, so some sparsity in the set of samples \(\times\) assays is expected.
Here are some very simple manipulations with TCGA ovarian cancer data. The data sizes are manageable enough that the loadHub function is used to deserialize all relevant data.
suppressPackageStartupMessages(library(biocMultiAssay))
#
# crude way of enumerating RDA files planted in extdata
#
ov = dir(system.file("extdata/tcga_ov",
package="biocMultiAssay"), full=TRUE, pattern="\\.rda$")
drop = grep("pheno", ov)
if (length(drop)>0) {
pdpath=ov[drop]
ov=ov[-drop]
}
#
# informal labels for constituents
#
tags = c("ov RNA-seq", "ov agilent", "ov mirna", "ov affy", "ov CNV gistic",
"ov methy 450k")
#
# construct expt instances from ExpressionSets
#
exptlist = lapply(1:length(ov), function(x) new("expt",
serType="RData", assayPath=ov[x], tag=tags[x]))
#
# populate an eHub, witha master phenotype data frame
#
ovhub = new("eHub", hub=exptlist, masterSampleData = get(load(pdpath)))
ovhub
## eHub with 6 experiments. User-defined tags:
## ov RNA-seq
## ov agilent
## ov mirna
## ov affy
## ov CNV gistic
## ov methy 450k
## Sample level data is 580 x 29.
This is a lightweight representation of the scope of data identified to an eHub. We have as well a class that includes materializations of all the experimental data. Constructing it is currently slow.
lovhub = loadHub(ovhub)
lovhub
## loadedHub instance.
## Features Samples feats.
## ov RNA-seq 24174 545 ACAP3, ACTRT2, AGRN ...
## ov agilent 14269 556 A1CF, A2BP1, A2M ...
## ov mirna 12989 578 A1CF, A2M, A4GALT ...
## ov affy 17814 574 15E1.2, 2'-PDE, 7A5 ...
## ov CNV gistic 799 554 ebv-miR-BART1-3p, ebv-miR ...
## ov methy 450k 20502 261 ?, A1BG, A1CF ...
object.size(lovhub)
## 19759320 bytes
This is a heavy representation but manageable at this level of data reduction.
We can determine the set of common identifiers.
allid = lapply(lovhub@elist, sampleNames)
commids = allid[[1]]
for (i in 2:length(allid))
commids = intersect(commids, allid[[i]])
length(commids)
## [1] 248
We can now generate the loadedHub instance with only the common samples.
locomm = lovhub
locomm@elist = lapply(locomm@elist, function(x) x[,commids])
locomm
## loadedHub instance.
## Features Samples feats.
## ov RNA-seq 24174 248 ACAP3, ACTRT2, AGRN ...
## ov agilent 14269 248 A1CF, A2BP1, A2M ...
## ov mirna 12989 248 A1CF, A2M, A4GALT ...
## ov affy 17814 248 15E1.2, 2'-PDE, 7A5 ...
## ov CNV gistic 799 248 ebv-miR-BART1-3p, ebv-miR ...
## ov methy 450k 20502 248 ?, A1BG, A1CF ...
Where to put these abstractions for both the light and heavy representations is a point of discussion.
ovlist <- lapply(ov, function(x) get(load(x)))
names(ovlist) <- tags
lovhub2 <- createHub(masterpheno=pData(ovlist[[2]]), objlist=ovlist, drop=TRUE)
## Dropping the following samples:
## ov RNA-seq :
## TCGA.36.2530
##
##
## ov mirna :
## TCGA.04.1341 TCGA.01.0630 TCGA.01.0631 TCGA.01.0633 TCGA.01.0636 TCGA.01.0637 TCGA.13.0730 TCGA.01.0628 TCGA.01.0639 TCGA.01.0642 TCGA.04.1357 TCGA.04.1360 TCGA.59.2352 TCGA.30.1861 TCGA.04.1519 TCGA.04.1353 TCGA.42.2593 TCGA.36.2539 TCGA.36.2530 TCGA.29.2429 TCGA.36.2533 TCGA.29.1699
##
##
## ov affy :
## TCGA.13.0730 TCGA.13.0760 TCGA.04.1341 TCGA.04.1353 TCGA.04.1357 TCGA.04.1360 TCGA.04.1519 TCGA.25.2390 TCGA.29.1699 TCGA.29.2429 TCGA.30.1861 TCGA.36.2530 TCGA.36.2533 TCGA.36.2539 TCGA.42.2593 TCGA.59.2352 TCGA.61.2610 TCGA.61.2611
##
##
## ov CNV gistic :
## TCGA.04.1341 TCGA.04.1357 TCGA.04.1360 TCGA.04.1519 TCGA.13.0730 TCGA.13.0760 TCGA.29.1699 TCGA.29.2429 TCGA.30.1861 TCGA.30.1869 TCGA.36.2530 TCGA.36.2533 TCGA.59.2352
##
##
## ov methy 450k :
## TCGA.04.1519 TCGA.29.1699 TCGA.13.0730 TCGA.59.2352 TCGA.04.1357
##
##
lovhub2
## loadedHub instance.
## Features Samples feats.
## ov RNA-seq 24174 544 ACAP3, ACTRT2, AGRN ...
## ov agilent 14269 556 A1CF, A2BP1, A2M ...
## ov mirna 12989 556 A1CF, A2M, A4GALT ...
## ov affy 17814 556 15E1.2, 2'-PDE, 7A5 ...
## ov CNV gistic 799 541 ebv-miR-BART1-3p, ebv-miR ...
## ov methy 450k 20502 256 ?, A1BG, A1CF ...
object.size(lovhub2)
## 19602448 bytes