Whole-exome sequencing library preparation involves use of “capture kits” to select the exonic regions for sequencing. There are many such kits, each selecting slightly different genomic regions. This 5-minute analysis explores the differences between the .bed files for two Illumina whole-exome sequencing capture kits, described and downloadable here. One is described as rapid and the other expanded. This analysis quickly verifies the width and number of overlapping ranges stated by Illumina, then visualizes one example of where the two kits differ by plotting their tracks for one gene on the UCSC genome browser.
This analysis confirms those of Illumina:
This download contains the Nextera Rapid Capture Exome targeted region intervals that overlap the Nextera Rapid Capture Expanded Exome targeted regions. The total overlap is 33.0 MB between the two products. 189,522 overlapping intervals across 188,621 Exome targeted regions. This was determined by base overlap only. For example, comparing an Exome product region at Chr 1 position 300 – 500 with an Expanded Exome region at Chr 1 position 305 – 495 results in a 191 bp overlap and two 5 bp “”unique regions“” for the Exome product.
GRanges
objects using rtracklayer
From the manufacturer’s website:
This download contains the Nextera Rapid Capture Exome targeted region intervals that overlap the Nextera Rapid Capture Expanded Exome targeted regions.
url1 <- "https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/samplepreps_nextera/nexterarapidcapture/nexterarapidcapture_exome_targetedregions.bed"
url2 <- "https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/samplepreps_nextera/nexterarapidcapture/nexterarapidcapture_expandedexome_targetedregions.bed"
We use the indispensible rtracklayer
Bioconductor package to import the bed files. Illumina specifies that these files provide GRCh37/hg19
genomic coordinates. We label them hg19
which is the annotation used by Bioconductor Homo.sapiens
annotations. Labelling them on import will make things simple later when viewing ranges on the UCSC genome browser.
suppressPackageStartupMessages({
library(rtracklayer)
library(GenomicRanges)
})
rapidgr <- import(url1, genome="hg19")
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
expandedgr <- import(url2, genome="hg19")
This is easily verified by calculating the intersection of the two kits, then the sum of the widths of the intersecting intervals:
sum(width(intersect(rapidgr, expandedgr))) / 1e6
## [1] 32.97991
This is the number of overlap pairs:
sum(countOverlaps(rapidgr, expandedgr))
## [1] 189522
The overlapping pairs can be identified, matched, and used in a paired format:
findOverlapPairs(rapidgr, expandedgr)
## Pairs object with 189522 pairs and 0 metadata columns:
## first second
## <UCSCData> <GRanges>
## [1] chr1:69089-70010 chr1:69091-70008
## [2] chr1:664485-665108 chr1:661140-665184
## [3] chr1:762080-762571 chr1:761587-762902
## [4] chr1:861320-861395 chr1:861302-861393
## [5] chr1:865533-865718 chr1:865535-865716
## ... ... ...
## [189518] chrY:22941393-22941554 chrY:22941395-22941552
## [189519] chrY:22942815-22942920 chrY:22942817-22942916
## [189520] chrY:24457011-24457119 chrY:24457007-24457119
## [189521] chrY:24457542-24457645 chrY:24457542-24457645
## [189522] chrY:24460640-24460912 chrY:24460640-24462350
first(findOverlapPairs(rapidgr, expandedgr))
## UCSC track 'CEX_Manifest_01242013'
## UCSCData object with 189522 ranges and 1 metadata column:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chr1 [ 69089, 70010] * |
## [2] chr1 [664485, 665108] * |
## [3] chr1 [762080, 762571] * |
## [4] chr1 [861320, 861395] * |
## [5] chr1 [865533, 865718] * |
## ... ... ... ... .
## [189518] chrY [22941393, 22941554] * |
## [189519] chrY [22942815, 22942920] * |
## [189520] chrY [24457011, 24457119] * |
## [189521] chrY [24457542, 24457645] * |
## [189522] chrY [24460640, 24460912] * |
## name
## <character>
## [1] CEX-chr1-69089-70010
## [2] CEX-chr1-664485-665108
## [3] CEX-chr1-762080-762571
## [4] CEX-chr1-861320-861395
## [5] CEX-chr1-865533-865718
## ... ...
## [189518] CEX-chrY-22941393-22941554
## [189519] CEX-chrY-22942815-22942920
## [189520] CEX-chrY-24457011-24457119
## [189521] CEX-chrY-24457542-24457645
## [189522] CEX-chrY-24460640-24460912
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
I was curious about whether the overlaps are 1:1. Most are, but some are 1:2 and there is even a single 1:6 match!
table(countOverlaps(rapidgr, expandedgr))
##
## 0 1 2 3 4 6
## 23537 187763 823 29 5 1
So let’s look at that 1:6 match in more detail. This narrows rapidgr
to the single rapid range in that match:
(rapidgr <- rapidgr[countOverlaps(rapidgr, expandedgr) == 6])
## UCSC track 'CEX_Manifest_01242013'
## UCSCData object with 1 range and 1 metadata column:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chr14 [101347048, 101351125] * |
## name
## <character>
## [1] CEX-chr14-101347048-101351125
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
And the expanded ranges:
(expandedgr <- expandedgr[expandedgr %over% rapidgr])
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chr14 [101346992, 101347343] * |
## [2] chr14 [101347458, 101348222] * |
## [3] chr14 [101348316, 101349315] * |
## [4] chr14 [101349413, 101350819] * |
## [5] chr14 [101350913, 101351038] * |
## [6] chr14 [101351121, 101351184] * |
## name
## <character>
## [1] RTL1-chr14-101346992-101347343
## [2] RTL1-chr14-101347458-101348222
## [3] RTL1-chr14-101348316-101349315
## [4] RTL1-chr14-101349413-101350819
## [5] RTL1-chr14-101350913-101351038
## [6] RTL1-chr14-101351121-101351184
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
Homo.sapiens
Let’s see what genes are in this region, creating a GRanges
vector of all genes (annotated with Entrez ID and HGNC gene symbols), then subset it to only ranges that overlap with the rapid kit:
suppressPackageStartupMessages(library(Homo.sapiens))
gn <- genes(Homo.sapiens, columns=c("ENTREZID", "SYMBOL"))
## 'select()' returned 1:1 mapping between keys and columns
(gn <- gn[gn %over% rapidgr])
## GRanges object with 6 ranges and 2 metadata columns:
## seqnames ranges strand | SYMBOL
## <Rle> <IRanges> <Rle> | <CharacterList>
## 388015 chr14 [101346992, 101351184] - | RTL1
## 406914 chr14 [101349316, 101349412] + | MIR127
## 406927 chr14 [101351039, 101351120] + | MIR136
## 574034 chr14 [101348223, 101348315] + | MIR433
## 574038 chr14 [101347344, 101347457] + | MIR431
## 574451 chr14 [101350820, 101350913] + | MIR432
## ENTREZID
## <FactorList>
## 388015 388015
## 406914 406914
## 406927 406927
## 574034 574034
## 574038 574038
## 574451 574451
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
So this region contains the RTL1 gene on the antisense strand and five microRNA on the opposing strand. We could explore the overlaps between these and the capture kits further with GenomicRanges
, but it becomes clear immediately with one peek in UCSC…
First, create the session using rtracklayer
, and add the two tracks from this region of our two capture kits:
session <- browserSession("UCSC")
track(session, "rapid") <- rapidgr
track(session, "expanded") <- expandedgr
Upload this session to the UCSC genome browser, using *0.75
to zoom out 25%. No installations needed, this should pop up in your web browser:
browserView(session, range=rapidgr*0.75)
which makes it immediately clear that the “rapid” kit covers the full length of the RTL1 gene, whereas the “expanded” kit removes the ranges that also contain a microRNA on the opposing strand.