Summary
Download the .bed files as GRanges objects using rtracklayer
Width and number of overlapping intervals
Look up annotations in Homo.sapiens
Plot RTL1 capture regions in the UCSC genome browser

Summary

Whole-exome sequencing library preparation involves use of “capture kits” to select the exonic regions for sequencing. There are many such kits, each selecting slightly different genomic regions. This 5-minute analysis explores the differences between the .bed files for two Illumina whole-exome sequencing capture kits, described and downloadable here. One is described as rapid and the other expanded. This analysis quickly verifies the width and number of overlapping ranges stated by Illumina, then visualizes one example of where the two kits differ by plotting their tracks for one gene on the UCSC genome browser.

This analysis confirms those of Illumina:

This download contains the Nextera Rapid Capture Exome targeted region intervals that overlap the Nextera Rapid Capture Expanded Exome targeted regions. The total overlap is 33.0 MB between the two products. 189,522 overlapping intervals across 188,621 Exome targeted regions. This was determined by base overlap only. For example, comparing an Exome product region at Chr 1 position 300 – 500 with an Expanded Exome region at Chr 1 position 305 – 495 results in a 191 bp overlap and two 5 bp “”unique regions“” for the Exome product.

Download the .bed files as `GRanges` objects using `rtracklayer`

From the manufacturer’s website:

This download contains the Nextera Rapid Capture Exome targeted region intervals that overlap the Nextera Rapid Capture Expanded Exome targeted regions.

url1 <- "https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/samplepreps_nextera/nexterarapidcapture/nexterarapidcapture_exome_targetedregions.bed"
url2 <- "https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/samplepreps_nextera/nexterarapidcapture/nexterarapidcapture_expandedexome_targetedregions.bed"

We use the indispensible rtracklayer Bioconductor package to import the bed files. Illumina specifies that these files provide GRCh37/hg19 genomic coordinates. We label them hg19 which is the annotation used by Bioconductor Homo.sapiens annotations. Labelling them on import will make things simple later when viewing ranges on the UCSC genome browser.

suppressPackageStartupMessages({
  library(rtracklayer)
  library(GenomicRanges)
})
rapidgr <- import(url1, genome="hg19")

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:base':
## 
##     strsplit

expandedgr <- import(url2, genome="hg19")

Width and number of overlapping intervals

From Illumina: The total overlap is 33.0 MB between the two products.

This is easily verified by calculating the intersection of the two kits, then the sum of the widths of the intersecting intervals:

sum(width(intersect(rapidgr, expandedgr))) / 1e6

## [1] 32.97991

From Illumina: there are 189,522 overlapping intervals

This is the number of overlap pairs:

sum(countOverlaps(rapidgr, expandedgr))

## [1] 189522

The overlapping pairs can be identified, matched, and used in a paired format:

findOverlapPairs(rapidgr, expandedgr)

## Pairs object with 189522 pairs and 0 metadata columns:
##                             first                 second
##                        <UCSCData>              <GRanges>
##        [1]       chr1:69089-70010       chr1:69091-70008
##        [2]     chr1:664485-665108     chr1:661140-665184
##        [3]     chr1:762080-762571     chr1:761587-762902
##        [4]     chr1:861320-861395     chr1:861302-861393
##        [5]     chr1:865533-865718     chr1:865535-865716
##        ...                    ...                    ...
##   [189518] chrY:22941393-22941554 chrY:22941395-22941552
##   [189519] chrY:22942815-22942920 chrY:22942817-22942916
##   [189520] chrY:24457011-24457119 chrY:24457007-24457119
##   [189521] chrY:24457542-24457645 chrY:24457542-24457645
##   [189522] chrY:24460640-24460912 chrY:24460640-24462350

first(findOverlapPairs(rapidgr, expandedgr))

## UCSC track 'CEX_Manifest_01242013'
## UCSCData object with 189522 ranges and 1 metadata column:
##            seqnames               ranges strand |
##               <Rle>            <IRanges>  <Rle> |
##        [1]     chr1     [ 69089,  70010]      * |
##        [2]     chr1     [664485, 665108]      * |
##        [3]     chr1     [762080, 762571]      * |
##        [4]     chr1     [861320, 861395]      * |
##        [5]     chr1     [865533, 865718]      * |
##        ...      ...                  ...    ... .
##   [189518]     chrY [22941393, 22941554]      * |
##   [189519]     chrY [22942815, 22942920]      * |
##   [189520]     chrY [24457011, 24457119]      * |
##   [189521]     chrY [24457542, 24457645]      * |
##   [189522]     chrY [24460640, 24460912]      * |
##                                  name
##                           <character>
##        [1]       CEX-chr1-69089-70010
##        [2]     CEX-chr1-664485-665108
##        [3]     CEX-chr1-762080-762571
##        [4]     CEX-chr1-861320-861395
##        [5]     CEX-chr1-865533-865718
##        ...                        ...
##   [189518] CEX-chrY-22941393-22941554
##   [189519] CEX-chrY-22942815-22942920
##   [189520] CEX-chrY-24457011-24457119
##   [189521] CEX-chrY-24457542-24457645
##   [189522] CEX-chrY-24460640-24460912
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

I was curious about whether the overlaps are 1:1. Most are, but some are 1:2 and there is even a single 1:6 match!

table(countOverlaps(rapidgr, expandedgr))

## 
##      0      1      2      3      4      6 
##  23537 187763    823     29      5      1

So let’s look at that 1:6 match in more detail. This narrows rapidgr to the single rapid range in that match:

(rapidgr <- rapidgr[countOverlaps(rapidgr, expandedgr) == 6])

## UCSC track 'CEX_Manifest_01242013'
## UCSCData object with 1 range and 1 metadata column:
##       seqnames                 ranges strand |
##          <Rle>              <IRanges>  <Rle> |
##   [1]    chr14 [101347048, 101351125]      * |
##                                name
##                         <character>
##   [1] CEX-chr14-101347048-101351125
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

And the expanded ranges:

(expandedgr <- expandedgr[expandedgr %over% rapidgr])

## GRanges object with 6 ranges and 1 metadata column:
##       seqnames                 ranges strand |
##          <Rle>              <IRanges>  <Rle> |
##   [1]    chr14 [101346992, 101347343]      * |
##   [2]    chr14 [101347458, 101348222]      * |
##   [3]    chr14 [101348316, 101349315]      * |
##   [4]    chr14 [101349413, 101350819]      * |
##   [5]    chr14 [101350913, 101351038]      * |
##   [6]    chr14 [101351121, 101351184]      * |
##                                 name
##                          <character>
##   [1] RTL1-chr14-101346992-101347343
##   [2] RTL1-chr14-101347458-101348222
##   [3] RTL1-chr14-101348316-101349315
##   [4] RTL1-chr14-101349413-101350819
##   [5] RTL1-chr14-101350913-101351038
##   [6] RTL1-chr14-101351121-101351184
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

Look up annotations in `Homo.sapiens`

Let’s see what genes are in this region, creating a GRanges vector of all genes (annotated with Entrez ID and HGNC gene symbols), then subset it to only ranges that overlap with the rapid kit:

suppressPackageStartupMessages(library(Homo.sapiens))
gn <- genes(Homo.sapiens, columns=c("ENTREZID", "SYMBOL"))

## 'select()' returned 1:1 mapping between keys and columns

(gn <- gn[gn %over% rapidgr])

## GRanges object with 6 ranges and 2 metadata columns:
##          seqnames                 ranges strand |          SYMBOL
##             <Rle>              <IRanges>  <Rle> | <CharacterList>
##   388015    chr14 [101346992, 101351184]      - |            RTL1
##   406914    chr14 [101349316, 101349412]      + |          MIR127
##   406927    chr14 [101351039, 101351120]      + |          MIR136
##   574034    chr14 [101348223, 101348315]      + |          MIR433
##   574038    chr14 [101347344, 101347457]      + |          MIR431
##   574451    chr14 [101350820, 101350913]      + |          MIR432
##              ENTREZID
##          <FactorList>
##   388015       388015
##   406914       406914
##   406927       406927
##   574034       574034
##   574038       574038
##   574451       574451
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

So this region contains the RTL1 gene on the antisense strand and five microRNA on the opposing strand. We could explore the overlaps between these and the capture kits further with GenomicRanges, but it becomes clear immediately with one peek in UCSC…

Plot RTL1 capture regions in the UCSC genome browser

First, create the session using rtracklayer, and add the two tracks from this region of our two capture kits:

session <- browserSession("UCSC")
track(session, "rapid") <- rapidgr
track(session, "expanded") <- expandedgr

Upload this session to the UCSC genome browser, using *0.75 to zoom out 25%. No installations needed, this should pop up in your web browser:

browserView(session, range=rapidgr*0.75)

which makes it immediately clear that the “rapid” kit covers the full length of the RTL1 gene, whereas the “expanded” kit removes the ranges that also contain a microRNA on the opposing strand.

Comparing two Illumina exome sequencing capture kits

Levi Waldron

2/10/2018

Summary

Download the .bed files as `GRanges` objects using `rtracklayer`

Width and number of overlapping intervals

Look up annotations in `Homo.sapiens`

Plot RTL1 capture regions in the UCSC genome browser

Comparing two Illumina exome sequencing capture kits

Levi Waldron

2/10/2018

Summary

Download the .bed files as GRanges objects using rtracklayer

Width and number of overlapping intervals

Look up annotations in Homo.sapiens

Plot RTL1 capture regions in the UCSC genome browser

Download the .bed files as `GRanges` objects using `rtracklayer`

Look up annotations in `Homo.sapiens`