1 Background

An increasingly common use case involves a set of samples or patients who provide measurements on multiple data types, such as gene expression, genotype, miRNA abundance. It will frequently be the case that not all samples will contribute to all assays, so some sparsity in the set of samples \(\times\) assays is expected.

2 Basic demonstrative resources

Here are some very simple manipulations with TCGA ovarian cancer data. The data sizes are manageable enough that the loadHub function is used to deserialize all relevant data.

suppressPackageStartupMessages(library(biocMultiAssay))
#
# crude way of enumerating RDA files planted in extdata
#
ov = dir(system.file("extdata/tcga_ov", 
   package="biocMultiAssay"), full=TRUE)
drop = grep("pheno", ov)
if (length(drop)>0) {
  pdpath=ov[drop]
  ov=ov[-drop]
  }
#
# informal labels for constituents
#
tags = c("ov RNA-seq", "ov agilent", "ov mirna", "ov affy", "ov CNV gistic",
  "ov methy 450k")
#
# construct expt instances from ExpressionSets
#
elist = lapply(1:length(ov), function(x) new("expt", 
     serType="RData", assayPath=ov[x], tag=tags[x], sampleDataPath=ov[x]))
#
# populate an eHub, witha master phenotype data frame
#
ovhub = new("eHub", hub=elist, masterSampleData = get(load(pdpath)))
ovhub
## eHub with 6 experiments.  User-defined tags:
##   ov RNA-seq 
##   ov agilent 
##   ov mirna 
##   ov affy 
##   ov CNV gistic 
##   ov methy 450k 
## Sample level data is 580 x 29.

This is a lightweight representation of the scope of data identified to an eHub. We have as well a class that includes materializations of all the experimental data. Constructing it is currently slow.

lovhub = loadHub(ovhub)
lovhub
## loadedHub instance.
##               Features Samples                        feats.
## ov RNA-seq       24174     545       ACAP3, ACTRT2, AGRN ...
## ov agilent       14269     556          A1CF, A2BP1, A2M ...
## ov mirna         12989     578         A1CF, A2M, A4GALT ...
## ov affy          17814     574       15E1.2, 2'-PDE, 7A5 ...
## ov CNV gistic      799     554 ebv-miR-BART1-3p, ebv-miR ...
## ov methy 450k    20502     261             ?, A1BG, A1CF ...
object.size(lovhub)
## 19760392 bytes

This is a heavy representation but manageable at this level of data reduction.

We can determine the set of common identifiers.

allid = lapply(lovhub@elist, sampleNames)
commids = allid[[1]]
for (i in 2:length(allid))
 commids = intersect(commids, allid[[i]])
length(commids)
## [1] 248

We can now generate the loadedHub instance with only the common samples.

locomm = lovhub
locomm@elist = lapply(locomm@elist, function(x) x[,commids])
locomm
## loadedHub instance.
##               Features Samples                        feats.
## ov RNA-seq       24174     248       ACAP3, ACTRT2, AGRN ...
## ov agilent       14269     248          A1CF, A2BP1, A2M ...
## ov mirna         12989     248         A1CF, A2M, A4GALT ...
## ov affy          17814     248       15E1.2, 2'-PDE, 7A5 ...
## ov CNV gistic      799     248 ebv-miR-BART1-3p, ebv-miR ...
## ov methy 450k    20502     248             ?, A1BG, A1CF ...

Where to put these abstractions for both the light and heavy representations is a point of discussion.