Merging ranges accross samples

Scenario

Imagine that you have data in the form of ranges from several samples. The ranges are non-overlapping in each sample, but you want to find the common sections across the different samples. Here I show one way of doing it.

Example

First lets build some example data.

suppressMessages(library("GenomicRanges"))
gListData <- GRangesList(GRanges(seqnames = rep("chr1", 3), ranges = IRanges(c(1, 
    100, 200), width = 30)), GRanges(seqnames = rep("chr1", 3), ranges = IRanges(start = c(20, 
    90, 300), width = 30)))
gListData
## GRangesList of length 2:
## [[1]] 
## GRanges with 3 ranges and 0 metadata columns:
##       seqnames     ranges strand
##          <Rle>  <IRanges>  <Rle>
##   [1]     chr1 [  1,  30]      *
##   [2]     chr1 [100, 129]      *
##   [3]     chr1 [200, 229]      *
## 
## [[2]] 
## GRanges with 3 ranges and 0 metadata columns:
##       seqnames     ranges strand
##   [1]     chr1 [ 20,  49]      *
##   [2]     chr1 [ 90, 119]      *
##   [3]     chr1 [300, 329]      *
## 
## ---
## seqlengths:
##  chr1
##    NA

Lets take a look at the ranges in the example data.

plotRanges <- function(x, xlim = x, main = deparse(substitute(x)), col = "black", 
    sep = 0.5, ...) {
    height <- 1
    if (is(xlim, "Ranges")) 
        xlim <- c(min(start(xlim)), max(end(xlim)))
    bins <- disjointBins(IRanges(start(x), end(x) + 1))
    plot.new()
    plot.window(xlim, c(0, max(bins) * (height + sep)))
    ybottom <- bins * (sep + height) - height
    rect(start(x) - 0.5, ybottom, end(x) + 0.5, ybottom + height, col = col, 
        ...)
    title(main)
    axis(1)
}
plotRanges(ranges(gListData[[1]]), main = "Sample 1")

plot of chunk unnamed-chunk-2

plotRanges(ranges(gListData[[1]]), main = "Sample 2")

plot of chunk unnamed-chunk-2

Next I extract the ranges from the GRangesList object. Note that here I am assuming that all the ranges are from the same chromosome.

allRanges <- unlist(ranges(gListData))
allRanges
## IRanges of length 6
##     start end width
## [1]     1  30    30
## [2]   100 129    30
## [3]   200 229    30
## [4]    20  49    30
## [5]    90 119    30
## [6]   300 329    30
plotRanges(allRanges, main = "All Ranges")

plot of chunk unnamed-chunk-3

With this object we can now find all the pieces that compose it.

pieces <- disjoin(allRanges)
pieces
## IRanges of length 8
##     start end width
## [1]     1  19    19
## [2]    20  30    11
## [3]    31  49    19
## [4]    90  99    10
## [5]   100 119    20
## [6]   120 129    10
## [7]   200 229    30
## [8]   300 329    30
plotRanges(pieces, main = "Pieces (disjoint ranges)")

plot of chunk unnamed-chunk-4

Now we can proceed to find the pieces that are present in more than one sample.

ov <- findOverlaps(pieces, allRanges)
ov.mat <- as.matrix(ov)
ov.mat
##       queryHits subjectHits
##  [1,]         1           1
##  [2,]         2           1
##  [3,]         2           4
##  [4,]         3           4
##  [5,]         4           5
##  [6,]         5           2
##  [7,]         5           5
##  [8,]         6           2
##  [9,]         7           3
## [10,]         8           6

hits <- table(ov.mat[, "queryHits"])
idx <- which(hits > 1)
## If you wanted those that were present in all samples you can do something
## like this
idx2 <- which(hits == length(gListData))
## In this case they are the same because we only have 2 samples.
identical(idx, idx2)
## [1] TRUE

Finally, we can find the ranges that we are interested in.

pieces[idx]
## IRanges of length 2
##     start end width
## [1]    20  30    11
## [2]   100 119    20
plotRanges(pieces[idx], main = "Collapsed ranges")

plot of chunk unnamed-chunk-6

Once you have the ranges of interest, you might be interested in plotting some information for these genomic positions using ggbio or other visualization packages.

More resources

The IRanges vignette from the IRanges site.

Table 4 from the useR2013 Bioconductor tutorial (page 7). The materials are available here.

Reproducibility

proc.time()
##    user  system elapsed 
##   3.745   0.096   3.842
Sys.time()
## [1] "2013-08-08 14:37:37 EDT"
sessionInfo()
## R version 3.0.1 (2013-05-16)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] GenomicRanges_1.12.4 IRanges_1.18.2       BiocGenerics_0.6.0  
## [4] knitr_1.3           
## 
## loaded via a namespace (and not attached):
## [1] digest_0.6.3   evaluate_0.4.4 formatR_0.9    stats4_3.0.1  
## [5] stringr_0.6.2  tools_3.0.1