Imagine that you have data in the form of ranges from several samples. The ranges are non-overlapping in each sample, but you want to find the common sections across the different samples. Here I show one way of doing it.
First lets build some example data.
suppressMessages(library("GenomicRanges"))
gListData <- GRangesList(GRanges(seqnames = rep("chr1", 3), ranges = IRanges(c(1,
100, 200), width = 30)), GRanges(seqnames = rep("chr1", 3), ranges = IRanges(start = c(20,
90, 300), width = 30)))
gListData
## GRangesList of length 2:
## [[1]]
## GRanges with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 [ 1, 30] *
## [2] chr1 [100, 129] *
## [3] chr1 [200, 229] *
##
## [[2]]
## GRanges with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## [1] chr1 [ 20, 49] *
## [2] chr1 [ 90, 119] *
## [3] chr1 [300, 329] *
##
## ---
## seqlengths:
## chr1
## NA
Lets take a look at the ranges in the example data.
plotRanges <- function(x, xlim = x, main = deparse(substitute(x)), col = "black",
sep = 0.5, ...) {
height <- 1
if (is(xlim, "Ranges"))
xlim <- c(min(start(xlim)), max(end(xlim)))
bins <- disjointBins(IRanges(start(x), end(x) + 1))
plot.new()
plot.window(xlim, c(0, max(bins) * (height + sep)))
ybottom <- bins * (sep + height) - height
rect(start(x) - 0.5, ybottom, end(x) + 0.5, ybottom + height, col = col,
...)
title(main)
axis(1)
}
plotRanges(ranges(gListData[[1]]), main = "Sample 1")
plotRanges(ranges(gListData[[1]]), main = "Sample 2")
Next I extract the ranges from the GRangesList object. Note that here I am assuming that all the ranges are from the same chromosome.
allRanges <- unlist(ranges(gListData))
allRanges
## IRanges of length 6
## start end width
## [1] 1 30 30
## [2] 100 129 30
## [3] 200 229 30
## [4] 20 49 30
## [5] 90 119 30
## [6] 300 329 30
plotRanges(allRanges, main = "All Ranges")
With this object we can now find all the pieces that compose it.
pieces <- disjoin(allRanges)
pieces
## IRanges of length 8
## start end width
## [1] 1 19 19
## [2] 20 30 11
## [3] 31 49 19
## [4] 90 99 10
## [5] 100 119 20
## [6] 120 129 10
## [7] 200 229 30
## [8] 300 329 30
plotRanges(pieces, main = "Pieces (disjoint ranges)")
Now we can proceed to find the pieces that are present in more than one sample.
ov <- findOverlaps(pieces, allRanges)
ov.mat <- as.matrix(ov)
ov.mat
## queryHits subjectHits
## [1,] 1 1
## [2,] 2 1
## [3,] 2 4
## [4,] 3 4
## [5,] 4 5
## [6,] 5 2
## [7,] 5 5
## [8,] 6 2
## [9,] 7 3
## [10,] 8 6
hits <- table(ov.mat[, "queryHits"])
idx <- which(hits > 1)
## If you wanted those that were present in all samples you can do something
## like this
idx2 <- which(hits == length(gListData))
## In this case they are the same because we only have 2 samples.
identical(idx, idx2)
## [1] TRUE
Finally, we can find the ranges that we are interested in.
pieces[idx]
## IRanges of length 2
## start end width
## [1] 20 30 11
## [2] 100 119 20
plotRanges(pieces[idx], main = "Collapsed ranges")
Once you have the ranges of interest, you might be interested in plotting some information for these genomic positions using ggbio or other visualization packages.
The IRanges vignette from the IRanges site.
Table 4 from the useR2013 Bioconductor tutorial (page 7). The materials are available here.
proc.time()
## user system elapsed
## 3.745 0.096 3.842
Sys.time()
## [1] "2013-08-08 14:37:37 EDT"
sessionInfo()
## R version 3.0.1 (2013-05-16)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] GenomicRanges_1.12.4 IRanges_1.18.2 BiocGenerics_0.6.0
## [4] knitr_1.3
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.3 evaluate_0.4.4 formatR_0.9 stats4_3.0.1
## [5] stringr_0.6.2 tools_3.0.1