An increasingly common use case involves a set of samples or patients who provide measurements on multiple data types, such as gene expression, genotype, miRNA abundance. It will frequently be the case that not all samples will contribute to all assays, so some sparsity in the set of samples \(\times\) assays is expected.
Here are some very simple manipulations with TCGA ovarian cancer data. The data sizes are manageable enough that the loadHub function is used to deserialize all relevant data.
suppressPackageStartupMessages(library(biocMultiAssay))
#
# crude way of enumerating RDA files planted in extdata
#
ov = dir(system.file("extdata/tcga_ov",
package="biocMultiAssay"), full=TRUE)
drop = grep("pheno", ov)
if (length(drop)>0) {
pdpath=ov[drop]
ov=ov[-drop]
}
#
# informal labels for constituents
#
tags = c("ov RNA-seq", "ov agilent", "ov mirna", "ov affy", "ov CNV gistic",
"ov methy 450k")
#
# construct expt instances from ExpressionSets
#
elist = lapply(1:length(ov), function(x) new("expt",
serType="RData", assayPath=ov[x], tag=tags[x], sampleDataPath=ov[x]))
#
# populate an eHub, witha master phenotype data frame
#
ovhub = new("eHub", hub=elist, masterSampleData = get(load(pdpath)))
ovhub
## eHub with 6 experiments. User-defined tags:
## ov RNA-seq
## ov agilent
## ov mirna
## ov affy
## ov CNV gistic
## ov methy 450k
## Sample level data is 580 x 29.
This is a lightweight representation of the scope of data identified to an eHub. We have as well a class that includes materializations of all the experimental data. Constructing it is currently slow.
lovhub = loadHub(ovhub)
lovhub
## loadedHub instance.
## Features Samples feats.
## ov RNA-seq 24174 545 ACAP3, ACTRT2, AGRN ...
## ov agilent 14269 556 A1CF, A2BP1, A2M ...
## ov mirna 12989 578 A1CF, A2M, A4GALT ...
## ov affy 17814 574 15E1.2, 2'-PDE, 7A5 ...
## ov CNV gistic 799 554 ebv-miR-BART1-3p, ebv-miR ...
## ov methy 450k 20502 261 ?, A1BG, A1CF ...
object.size(lovhub)
## 19760392 bytes
This is a heavy representation but manageable at this level of data reduction.
We can determine the set of common identifiers.
allid = lapply(lovhub@elist, sampleNames)
commids = allid[[1]]
for (i in 2:length(allid))
commids = intersect(commids, allid[[i]])
length(commids)
## [1] 248
We can now generate the loadedHub instance with only the common samples.
locomm = lovhub
locomm@elist = lapply(locomm@elist, function(x) x[,commids])
locomm
## loadedHub instance.
## Features Samples feats.
## ov RNA-seq 24174 248 ACAP3, ACTRT2, AGRN ...
## ov agilent 14269 248 A1CF, A2BP1, A2M ...
## ov mirna 12989 248 A1CF, A2M, A4GALT ...
## ov affy 17814 248 15E1.2, 2'-PDE, 7A5 ...
## ov CNV gistic 799 248 ebv-miR-BART1-3p, ebv-miR ...
## ov methy 450k 20502 248 ?, A1BG, A1CF ...
Where to put these abstractions for both the light and heavy representations is a point of discussion.