Overview

valr provides tools to read and manipulate genome intervals and signals, similar to the standalone BEDtools suite. valr enables BEDtools-like analysis in the R/RStudio environment, and uses modern R tools for a terse, expressive syntax. Compute-intensive algorithms are implemented in Rcpp/C++, and many methods take advantage of the speed and grouping capability provided by dplyr.

We provide several introductions to valr:

This comprehensive [vignette][] that covers the core methods.
A tutorial that demonstrates how to use valr in “real-world” applications.
A shiny application that demonstrates interactive analysis of genome-scale data sets.

Installation

valr can be installed from github, and will be eventually deposited in CRAN.

> devtools::install_github('jayhesselberth/valr')

Comparison to other tools

Why another tool set for interval manipulations? We already have BEDtools, bedops, pybedtools, GenomeRanges, bedr and IRanges.

We were motivated to develop a toolset that:

Combines analysis and visualization in RStudio.
Can be used to generate reports with Rmarkdown.
Is highly extensible. New tools are quickly implemented on the R side.
Leverages the “modern R” syntax, using dplyr and the pipe operator from magrittr (%>%).
Maximizes speed by implementing compute-intensive algorithms in Rcpp.
Facilitates interactive visulaizations with shiny.

We anticipate valr will mainly be used for analysis of pre-processed data in BED, bedGraph and VCF formats. Most users will have processed their aligned reads from BAM format to bedGraph, so we do not foresee supporting BAM directly. We would entertain requests for GTF / GFF support if there is interest, as tidyr makes it easy to convert these to BED12.

At this point you might be expecting plots of speed / memory usage versus interval number. Certain algorithms in valr were implemented in Rcpp to be fast (including intersect, merge, subtract, closest), enabling fluid interactive analysis. See the benchmarks section for demonstrations.

Non-standard evaluation

Several of the methods in valr use NSE for an expressive syntax. Columns are referred to by name and can be used in multiple name/value expressions for summaries.

bed_map(x, y, mean = mean(value), var = var(value))
bed_merge(x, concat = concat(value), max = max(value))

Getting started

Here is an example using valr that creates

library(valr)

Reading data

valr has several methods to read interval data. These methods:

Take local files or URLs as input.
Return a data.frame in tibble::tbl_df format.
Assign consistent chrom, start and end column names.
Use readr for speed.
Coerce column types.

The methods include:

read_bed(): read a BED3+ file
read_bed12(): read a BED12 file
read_bedgraph(): read a bedGraph file
read_genome(): read a UCSC “chrom size” file
read_vcf(): read the Variant Call Format
read_bigwig(): read UCSC bigWig files

BED files

read_bed(system.file('extdata', '3fields.bed.gz', package = 'valr'))
#> # A tibble: 10 x 3
#>    chrom  start    end
#>    <chr>  <int>  <int>
#> 1   chr1  11873  14409
#> 2   chr1  14361  19759
#> 3   chr1  14406  29370
#> 4   chr1  34610  36081
#> 5   chr1  69090  70008
#> 6   chr1 134772 140566
#> 7   chr1 321083 321115
#> 8   chr1 321145 321207
#> 9   chr1 322036 326938
#> 10  chr1 327545 328439
read_bed(n_fields = 6, system.file('extdata', '6fields.bed.gz', package = 'valr'))
#> # A tibble: 10 x 6
#>    chrom  start    end         name score strand
#>    <chr>  <int>  <int>        <chr> <chr>  <chr>
#> 1   chr1  11873  14409      DDX11L1     3      +
#> 2   chr1  14361  19759       WASH7P    10      -
#> 3   chr1  14406  29370       WASH7P     7      -
#> 4   chr1  34610  36081      FAM138F     3      -
#> 5   chr1  69090  70008        OR4F5     1      +
#> 6   chr1 134772 140566    LOC729737     3      -
#> 7   chr1 321083 321115     DQ597235     1      +
#> 8   chr1 321145 321207     DQ599768     1      +
#> 9   chr1 322036 326938 LOC100133331     3      +
#> 10  chr1 327545 328439    LOC388312     1      +
read_bed12(system.file('extdata', '12fields.bed.gz', package = 'valr'))
#> # A tibble: 3 x 12
#>   chrom   start     end      name score strand cds_start cds_end item_rgb
#>   <chr>   <int>   <int>     <chr> <chr>  <chr>     <int>   <int>    <chr>
#> 1  chr1 4797973 4836816  testgene     1      +   4797973 4836816        .
#> 2 chr10 4848118 4880877 diffchrom     1      +   4848118 4880877        .
#> 3 chr20 5073253 5152630 negstrand     1      -   5073253 5152630        .
#> # ... with 3 more variables: exon_count <int>, exon_sizes <chr>,
#> #   exon_starts <chr>

Genome files

genome <- read_genome(system.file('extdata', 'hg19.chrom.sizes.gz', package = 'valr'))

bedGraph files

read_bedgraph(system.file('extdata', 'test.bg.gz', package = 'valr'))
#> # A tibble: 4 x 4
#>   chrom    start      end value
#>   <chr>    <int>    <int> <dbl>
#> 1 chr19 49302000 49302300 -1.00
#> 2 chr19 49302300 49302600 -0.75
#> 3 chr19 49302600 49302900 -0.50
#> 4 chr19 49302900 49303200 -0.25

VCF files

read_vcf() reads VCF files and assigns chrom, start and end columns to be used to downstream interval comparisons. Note the interval size is calculated as the length of the REF field in the original file.

read_vcf(system.file('extdata', 'test.vcf.gz', package = 'valr'))
#> # A tibble: 115 x 201
#>    CHROM   POS    ID                                          REF
#>    <int> <int> <dbl>                                        <chr>
#> 1      1 10172     0                                       CCCTAA
#> 2      1 10390     0 CCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA
#> 3      1 10397     0                                CCCCTAACCCTAA
#> 4      1 10403     0       ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC
#> 5      1 10409     0             ACCCTAACCCTAACCCTAACCCTAACCCTAAC
#> 6      1 10415     0                   ACCCTAACCCTAACCCTAACCCTAAC
#> 7      1 10421     0                         ACCCTAACCCTAACCCTAAC
#> 8      1 10428     0                                       CCCTAA
#> 9      1 10440     0                                            C
#> 10     1 10478     0                                            C
#> # ... with 105 more rows, and 197 more variables: ALT <chr>, QUAL <dbl>,
#> #   FILTER <chr>, INFO <chr>, FORMAT <chr>, 101976-101976 <chr>,
#> #   100920-100920 <chr>, 100231-100231 <chr>, 100232-100232 <chr>,
#> #   100919-100919 <chr>, 101977-101977 <chr>, 100630-100630 <chr>,
#> #   100640-100640 <chr>, 100631-100631 <chr>, 101583-101583 <chr>,
#> #   101584-101584 <chr>, 101732-101732 <chr>, 101016-101016 <chr>,
#> #   101708-101708 <chr>, 101730-101730 <chr>, 101582-101582 <chr>,
#> #   101731-101731 <chr>, 101809-101809 <chr>, 101653-101653 <chr>,
#> #   101652-101652 <chr>, 101806-101806 <chr>, 101807-101807 <chr>,
#> #   101808-101808 <chr>, 101810-101810 <chr>, 101811-101811 <chr>,
#> #   101812-101812 <chr>, 101813-101813 <chr>, 101814-101814 <chr>,
#> #   101957-101957 <chr>, 101958-101958 <chr>, 100986-100986 <chr>,
#> #   100987-100987 <chr>, 101897-101897 <chr>, 102071-102071 <chr>,
#> #   102106-102106 <chr>, 101040-101040 <chr>, 101167-101167 <chr>,
#> #   101041-101041 <chr>, 101042-101042 <chr>, 101168-101168 <chr>,
#> #   101898-101898 <chr>, 102070-102070 <chr>, 102111-102111 <chr>,
#> #   100988-100988 <chr>, 101896-101896 <chr>, 102110-102110 <chr>,
#> #   102620-102620 <chr>, 102621-102621 <chr>, 102947-102947 <chr>,
#> #   102948-102948 <chr>, 103089-103089 <chr>, 102693-102693 <chr>,
#> #   102694-102694 <chr>, 102949-102949 <chr>, 102622-102622 <chr>,
#> #   102623-102623 <chr>, 102624-102624 <chr>, 102722-102722 <chr>,
#> #   102712-102712 <chr>, 103339-103339 <chr>, 103124-103124 <chr>,
#> #   103161-103161 <chr>, 103125-103125 <chr>, 103171-103171 <chr>,
#> #   103338-103338 <chr>, 103372-103372 <chr>, 103193-103193 <chr>,
#> #   100195-100195 <chr>, 100194-100194 <chr>, 101667-101667 <chr>,
#> #   101137-101137 <chr>, 101291-101291 <chr>, 101292-101292 <chr>,
#> #   101666-101666 <chr>, 32222-32222 <chr>, 32049-32049 <chr>,
#> #   32050-32050 <chr>, 32411-32411 <chr>, 32221-32221 <chr>,
#> #   100147-100147 <chr>, 100149-100149 <chr>, 100243-100243 <chr>,
#> #   100290-100290 <chr>, 100753-100753 <chr>, 100754-100754 <chr>,
#> #   101377-101377 <chr>, 101426-101426 <chr>, 101435-101435 <chr>,
#> #   101523-101523 <chr>, 101877-101877 <chr>, 100043-100043 <chr>,
#> #   100044-100044 <chr>, 100148-100148 <chr>, 100292-100292 <chr>,
#> #   101381-101381 <chr>, ...

bigWig files

read_bigwig() uses bigwrig reads local bigWig files into a data_frame with chrom, start, end and value columns. URLs are not supported yet.

read_bigwig(system.file('extdata', 'test.bw', package = 'valr'))

Interval manipulations

valr implements several methods for manipulating sets of intervals. Some methods operate on a single set of intervals, while others compare two sets of intervals.

Many methods the same name as the corresponding BEDtool, and some commonly used BEDtools are implemented as dplyr pipes (e.g., see the group_by section).

All methods accept one or more sets of x and y intervals, which must either be created using the read methods, or have chrom, start and end columns.

Single set operations

These methods operate on a single set of intervals:

bed_sort(): order intervals
bed_cluster(): Cluster (but don’t merge) overlapping/nearby intervals.
bed_complement(): extract intervals not represented by an interval file.
bed_merge(): combine overlapping and nearby intervals into a single interval.
bed_flank(): Generate new flanking intervals
bed_slop(): Expand the size of input intervals
bed_shift(): Shift the coordinates of an input set, bounded by a genome

Sort

bed_sort orders intervals based on a specification. is_sorted asks whether a tbl is already sorted.

x <- bed_random(genome)
is_sorted(x)
#> [1] FALSE

y <- bed_sort(x)
y
#> # A tibble: 1,000,000 x 3
#>    chrom start   end
#>    <chr> <int> <int>
#> 1   chr1    91  1091
#> 2   chr1   749  1749
#> 3   chr1  2155  3155
#> 4   chr1  2165  3165
#> 5   chr1  2744  3744
#> 6   chr1  3649  4649
#> 7   chr1  6237  7237
#> 8   chr1  6316  7316
#> 9   chr1  6333  7333
#> 10  chr1 11114 12114
#> # ... with 999,990 more rows

is_sorted(y)
#> [1] TRUE

Cluster

bed_cluster identifies clustered intervals and assigns them a unique .id.

x <- bed_random(genome)
y <- bed_cluster(x)
y
#> # A tibble: 1,000,000 x 4
#>    chrom     start       end   .id
#>    <chr>     <int>     <int> <int>
#> 1  chr22  29116064  29117064  9381
#> 2   chr1 170475087 170476087 54904
#> 3   chr2  47683499  47684499 15365
#> 4   chr9  30957167  30958167 10004
#> 5   chr8 125471327 125472327 40279
#> 6   chr1  45061415  45062415 14410
#> 7  chr22  17509928  17510928  5647
#> 8  chr20  35586901  35587901 11493
#> 9  chr15  69386523  69387523 22594
#> 10  chr2 182465802 182466802 58813
#> # ... with 999,990 more rows

Complement

bed_complement identifies intervals in a genome that are not covered by an input.

x <- bed_random(genome)

bed_complement(x, genome)
#> # A tibble: 724,115 x 3
#>     chrom start   end
#>    <fctr> <int> <int>
#> 1    chr1     1  2162
#> 2    chr1  3162  6880
#> 3    chr1  7880 10679
#> 4    chr1 11679 16356
#> 5    chr1 17403 17637
#> 6    chr1 19295 19873
#> 7    chr1 22024 22833
#> 8    chr1 23833 24216
#> 9    chr1 25216 30808
#> 10   chr1 31808 33847
#> # ... with 724,105 more rows

Merge

bed_merge identifies overlapping intervals and reports new merged ones. is_merged asks whether a tbl is already merged. Values from merged intervals can be reported using name / value pairs.

# 1e6 random intervals
n <- 1e6
x <- bed_random(genome, n = n)
is_merged(x)
#> [1] FALSE
x
#> # A tibble: 1,000,000 x 3
#>    chrom     start       end
#>    <chr>     <int>     <int>
#> 1   chr5  27943564  27944564
#> 2   chr2 223883634 223884634
#> 3   chr8 107641024 107642024
#> 4   chr4  26203166  26204166
#> 5   chr3  32827287  32828287
#> 6   chr1 133146878 133147878
#> 7   chr6 121981429 121982429
#> 8   chr2   8503021   8504021
#> 9  chr15  93106998  93107998
#> 10  chr4 184553192 184554192
#> # ... with 999,990 more rows

# add some signal
x <- x %>% mutate(signal = runif(n))

y <- bed_merge(x)
is_merged(y)
#> [1] TRUE
y
#> # A tibble: 723,104 x 3
#>    chrom start   end
#>    <chr> <int> <int>
#> 1   chr1   501  1501
#> 2   chr1  1645  2645
#> 3   chr1  4062  5062
#> 4   chr1 11407 12407
#> 5   chr1 16855 17855
#> 6   chr1 19066 20900
#> 7   chr1 24519 25672
#> 8   chr1 26156 27473
#> 9   chr1 29638 30638
#> 10  chr1 32018 33018
#> # ... with 723,094 more rows

bed_merge(x, maxs = max(signal))
#> # A tibble: 723,104 x 4
#>    chrom start   end       maxs
#>    <chr> <int> <int>      <dbl>
#> 1   chr1   501  1501 0.88223255
#> 2   chr1  1645  2645 0.70457769
#> 3   chr1  4062  5062 0.76373882
#> 4   chr1 11407 12407 0.04712881
#> 5   chr1 16855 17855 0.16967085
#> 6   chr1 19066 20900 0.59457506
#> 7   chr1 24519 25672 0.58579493
#> 8   chr1 26156 27473 0.93904880
#> 9   chr1 29638 30638 0.08209918
#> 10  chr1 32018 33018 0.42858058
#> # ... with 723,094 more rows

Flank

bed_flank creates new intervals that flank - but do not contain - the input intervals.

bed_flank(x, genome, both = 100)

Slop

bed_slop pads input intervals based on a specification

bed_slop(x, genome, both = 100)

Shift

bed_shift adjusts coordinates toward start or end by a defined size. Intervals created out of bounds are removed, or trimmed.

bed_shift(x, genome, size = 100)
#> # A tibble: 999,999 x 4
#>    chrom     start       end     signal
#>    <chr>     <dbl>     <dbl>      <dbl>
#> 1   chr5  27943664  27944664 0.54661972
#> 2   chr2 223883734 223884734 0.88471469
#> 3   chr8 107641124 107642124 0.52616661
#> 4   chr4  26203266  26204266 0.32368260
#> 5   chr3  32827387  32828387 0.94392807
#> 6   chr1 133146978 133147978 0.27973158
#> 7   chr6 121981529 121982529 0.23708108
#> 8   chr2   8503121   8504121 0.57977029
#> 9  chr15  93107098  93108098 0.33198180
#> 10  chr4 184553292 184554292 0.01883723
#> # ... with 999,989 more rows

Multiple set operations

These methods compare two sets of intervals:

bed_intersect(): find overlapping intervals
bed_map(): apply a function to selected columns for overlapping intervals
bed_subtract(): Remove intervals based on overlaps between two files
bed_window(): Find overlapping intervals within a window
bed_closest(): find the closest intervals independent of overlaps

Intersection

bed_intersect is implemented using an interval tree in Rcpp. Column names in the result have .x and .y suffixes, and an .overlap column contains the size of the intersection (values of 0 indicate book-ended, or touching intervals). See the benchmarks section for timing. Though bed_intersect is pretty fast already, we intend to further improve upon this by parallization with RcppParallel.

# intersect two sets of 1e6 intervals from hg19
x <- bed_random(genome)
y <- bed_random(genome)

bed_intersect(x, y)
#> # A tibble: 646,110 x 6
#>    chrom start.x  end.x start.y  end.y .overlap
#>    <chr>   <int>  <int>   <int>  <int>    <int>
#> 1   chr1    9762  10762    9884  10884      878
#> 2   chr1   19890  20890   19655  20655      765
#> 3   chr1   19946  20946   19655  20655      709
#> 4   chr1   29590  30590   28604  29604       14
#> 5   chr1   29590  30590   28618  29618       28
#> 6   chr1   75235  76235   75546  76546      689
#> 7   chr1   92843  93843   92721  93721      878
#> 8   chr1  104230 105230  103620 104620      390
#> 9   chr1  104230 105230  104774 105774      456
#> 10  chr1  133598 134598  132750 133750      152
#> # ... with 646,100 more rows

# records in `x` with no overlaps in `y` 
bed_intersect(x, y, invert = TRUE)
#> Source: local data frame [524,270 x 3]
#> Groups: chrom [25]
#> 
#>    chrom    start      end
#>    <chr>    <int>    <int>
#> 1   chrY 59362073 59363073
#> 2   chrY 59355115 59356115
#> 3   chrY 59340287 59341287
#> 4   chrY 59339844 59340844
#> 5   chrY 59329166 59330166
#> 6   chrY 59307757 59308757
#> 7   chrY 59297407 59298407
#> 8   chrY 59291701 59292701
#> 9   chrY 59275268 59276268
#> 10  chrY 59256992 59257992
#> # ... with 524,260 more rows

One can achieve behavior similar to BEDtools by combining bed_intersect with dplyr tools.

# `x` records with overlaps (i.e., `-wa`)
bed_intersect(x, y) %>% select(chrom, start = start.x, end = end.x)
#> # A tibble: 646,110 x 3
#>    chrom  start    end
#>    <chr>  <int>  <int>
#> 1   chr1   9762  10762
#> 2   chr1  19890  20890
#> 3   chr1  19946  20946
#> 4   chr1  29590  30590
#> 5   chr1  29590  30590
#> 6   chr1  75235  76235
#> 7   chr1  92843  93843
#> 8   chr1 104230 105230
#> 9   chr1 104230 105230
#> 10  chr1 133598 134598
#> # ... with 646,100 more rows

# `y` records with overlaps (i.e., `-wb`)
bed_intersect(x, y) %>% select(chrom, start = start.y, end = end.y)
#> # A tibble: 646,110 x 3
#>    chrom  start    end
#>    <chr>  <int>  <int>
#> 1   chr1   9884  10884
#> 2   chr1  19655  20655
#> 3   chr1  19655  20655
#> 4   chr1  28604  29604
#> 5   chr1  28618  29618
#> 6   chr1  75546  76546
#> 7   chr1  92721  93721
#> 8   chr1 103620 104620
#> 9   chr1 104774 105774
#> 10  chr1 132750 133750
#> # ... with 646,100 more rows

# Unique records in `x` (i.e., `-u`)
bed_intersect(x, y) %>% select(chrom, start = start.x, end = end.x) %>% unique()
#> # A tibble: 475,656 x 3
#>    chrom  start    end
#>    <chr>  <int>  <int>
#> 1   chr1   9762  10762
#> 2   chr1  19890  20890
#> 3   chr1  19946  20946
#> 4   chr1  29590  30590
#> 5   chr1  75235  76235
#> 6   chr1  92843  93843
#> 7   chr1 104230 105230
#> 8   chr1 133598 134598
#> 9   chr1 138702 139702
#> 10  chr1 139417 140417
#> # ... with 475,646 more rows

# Count `y` intervals that intersect each `x` interval
bed_intersect(x, y) %>% group_by(chrom, start.x, end.x) %>% summarize(count = n())
#> Source: local data frame [475,656 x 4]
#> Groups: chrom, start.x [?]
#> 
#>    chrom start.x  end.x count
#>    <chr>   <int>  <int> <int>
#> 1   chr1    9762  10762     1
#> 2   chr1   19890  20890     1
#> 3   chr1   19946  20946     1
#> 4   chr1   29590  30590     2
#> 5   chr1   75235  76235     1
#> 6   chr1   92843  93843     1
#> 7   chr1  104230 105230     2
#> 8   chr1  133598 134598     1
#> 9   chr1  138702 139702     2
#> 10  chr1  139417 140417     3
#> # ... with 475,646 more rows

Map

bed_map maps signals onto intervals. Summary statistics for mapped signals can be specific using NSE with name / value pairs.

bedfile <- system.file('extdata', 'genes.hg19.chr22.bed.gz', package = 'valr')
bgfile  <- system.file('extdata', 'hela.h3k4.chip.bg.gz', package = 'valr')

x <- read_bed(bedfile, n_fields = 6)
y <- read_bedgraph(bgfile)

bed_map(x, y, means = mean(value.y), sds = sd(value.y))
#> Source: local data frame [591 x 5]
#> Groups: chrom, start.x [?]
#> 
#>    chrom  start.x    end.x    means       sds
#>    <chr>    <int>    <int>    <dbl>     <dbl>
#> 1  chr22 16150259 16193004 7.914286 7.5707309
#> 2  chr22 16162065 16172265 1.000000        NA
#> 3  chr22 16256331 16287937 1.000000 0.0000000
#> 4  chr22 17071647 17073700 1.000000 0.0000000
#> 5  chr22 17082800 17129720 1.117647 0.3270350
#> 6  chr22 17134598 17156430 1.294118 0.5878675
#> 7  chr22 17227758 17229328 1.000000        NA
#> 8  chr22 17264305 17302584 1.250000 0.5084039
#> 9  chr22 17308363 17310225 1.333333 0.5773503
#> 10 chr22 17385314 17385395 1.000000        NA
#> # ... with 581 more rows

Subtract

bed_substract() removes x intervals that intersect with y.

x <- bed_random(genome)
y <- bed_random(genome)

bed_subtract(x, y)
#> # A tibble: 723,161 x 3
#>     chrom start   end
#>    <fctr> <int> <int>
#> 1    chr1   795  1844
#> 2    chr1  4617  5617
#> 3    chr1  6350  7350
#> 4    chr1 10073 11212
#> 5    chr1 14290 15290
#> 6    chr1 17146 17605
#> 7    chr1 18758 19619
#> 8    chr1 22880 23880
#> 9    chr1 31429 34117
#> 10   chr1 38565 39565
#> # ... with 723,151 more rows

Window

bed_window() identifies y intervals that intersect an expanded window of x intervals.

x <- bed_random(genome, n = 100)
y <- bed_random(genome, n = 100)

# a few intersections ...
bed_intersect(x, y)
#> # A tibble: 0 x 6
#> # ... with 6 variables: chrom <chr>, start.x <int>, end.x <int>,
#> #   start.y <int>, end.y <int>, .overlap <int>

# ... can be expanded by casting a wider net
bed_window(x, y, genome, both = 1e6)
#> # A tibble: 5 x 6
#>   chrom  start.x    end.x  start.y    end.y .overlap
#>   <chr>    <int>    <int>    <int>    <int>    <int>
#> 1 chr18 64994997 64995997 64177173 64178173     1000
#> 2 chr22 17171867 17172867 17330319 17331319     1000
#> 3  chr6 73330999 73331999 74014664 74015664     1000
#> 4  chr6 85250503 85251503 85516787 85517787     1000
#> 5  chr7 11760548 11761548 11733095 11734095     1000

Closest

bed_closest() identifies y intervals that are closest to x.

x <- bed_random(genome, n = 100)
y <- bed_random(genome, n = 100)

bed_closest(x, y)
#> # A tibble: 157 x 7
#>    chrom   start.x     end.x   start.y     end.y .overlap .distance
#>    <chr>     <int>     <int>     <int>     <int>    <dbl>     <int>
#> 1   chr1   2082231   2083231  13672913  13673913        0  11589683
#> 2   chr1  71014696  71015696  60650968  60651968        0 -10362729
#> 3   chr1  71014696  71015696  82599956  82600956        0  11584261
#> 4   chr1  75087311  75088311  60650968  60651968        0 -14435344
#> 5   chr1  75087311  75088311  82599956  82600956        0   7511646
#> 6   chr1 120537963 120538963 102897681 102898681        0 -17639283
#> 7   chr1 120537963 120538963 134805028 134806028        0  14266066
#> 8   chr1 135634075 135635075 134805028 134806028        0   -828048
#> 9   chr1 135634075 135635075 142671350 142672350        0   7036276
#> 10  chr1 140165527 140166527 134805028 134806028        0  -5359500
#> # ... with 147 more rows

Randomzing intervals

valr provides methods for creating new random intervals or permutations of existing intervals:

bed_random generates random intervals from an input genome.
bed_shuffle shuffles coordinates given a set of input intervals.
Random sampling of input intervals is done with dplyr.

Random

bed_random generates random intervals from an input genome. The numbers of intervals from each chrom are proporitional to each chrom size.

x <- bed_random(genome, n = 1e6, length = 1e3)
x
#> # A tibble: 1,000,000 x 3
#>    chrom     start       end
#>    <chr>     <int>     <int>
#> 1  chr10  61692296  61693296
#> 2   chr8  90940249  90941249
#> 3   chr1 118873367 118874367
#> 4   chr4  49767964  49768964
#> 5   chr3 116627439 116628439
#> 6   chr2 185139966 185140966
#> 7   chr2   4886421   4887421
#> 8   chr1 230260915 230261915
#> 9  chr10  80182008  80183008
#> 10  chr5  70879026  70880026
#> # ... with 999,990 more rows

# numbers of sampled intervals are proportional to chrom size
group_by(x, chrom) %>% summarize(n = n()) %>% arrange(desc(n))
#> # A tibble: 25 x 2
#>    chrom     n
#>    <chr> <int>
#> 1   chr1 80019
#> 2   chr2 78287
#> 3   chr3 63863
#> 4   chr4 61723
#> 5   chr5 58259
#> 6   chr6 55259
#> 7   chr7 51549
#> 8   chrX 50045
#> 9   chr8 47331
#> 10  chr9 45726
#> # ... with 15 more rows

Sample

Sampling can be done using dplyr:

x <- bed_random(genome)

# sample by number
sample_n(x, 1e3, replace = FALSE)
#> # A tibble: 1,000 x 3
#>    chrom     start       end
#>    <chr>     <int>     <int>
#> 1  chr12  17069124  17070124
#> 2  chr10  31795043  31796043
#> 3   chr2  49523088  49524088
#> 4   chr5  80102671  80103671
#> 5   chr4 163573804 163574804
#> 6  chr21  36171735  36172735
#> 7  chr16  12170148  12171148
#> 8  chr13  61655971  61656971
#> 9   chr6 100866333 100867333
#> 10  chr3  97068477  97069477
#> # ... with 990 more rows

# or fraction
sample_frac(x, 0.1, replace = FALSE)
#> # A tibble: 100,000 x 3
#>    chrom     start       end
#>    <chr>     <int>     <int>
#> 1   chr7  63122310  63123310
#> 2  chr18  34178283  34179283
#> 3   chr6  26872688  26873688
#> 4   chr5 121335054 121336054
#> 5   chrX  38344551  38345551
#> 6  chr16   7495700   7496700
#> 7   chr6  11704603  11705603
#> 8  chr19   2891288   2892288
#> 9  chr14 101588608 101589608
#> 10 chr10  26708564  26709564
#> # ... with 99,990 more rows

# or sample intervals within groups
group_by(x, chrom) %>% sample_n(1)
#> Source: local data frame [25 x 3]
#> Groups: chrom [25]
#> 
#>    chrom     start       end
#>    <chr>     <int>     <int>
#> 1   chr1  31147666  31148666
#> 2  chr10  84286028  84287028
#> 3  chr11  84557236  84558236
#> 4  chr12  38171725  38172725
#> 5  chr13   3388085   3389085
#> 6  chr14 103648078 103649078
#> 7  chr15  97003992  97004992
#> 8  chr16   3709684   3710684
#> 9  chr17  75116100  75117100
#> 10 chr18  49762222  49763222
#> # ... with 15 more rows

Shuffle

bed_shuffle shuffles input intervals. Interval sizes are equal in the input and output.

y <- bed_shuffle(x)

any(x$start == y$start)

all(x$end - x$start == y$end - y$start)

Other

Calculating coverage

Instead of implementing coverage directly, coverage calculations are done with the summarize tool in dplyr.

x <- bed_random(genome)

# generate fake signals
y <- bed_random(genome) %>%
  mutate(count = floor(rnorm(n(), min=0, max=101)))

bed_intersect(x, y) %>%
  group_by(chrom, start.x, end.x) %>%
  summarize(coverage = sum(count.y))

Grouping related intervals

The group_by operation from dplyr serves a similar purpose to BEDtools groupby, and can be combined with dplyr::summarize to calculate summary statistics from groups.

x <- bed_random(genome)
y <- bed_random(genome)

bed_intersect(x, y) %>%
  group_by(chrom) %>%
  summarize(overlap.total = sum(.overlap))
#> # A tibble: 25 x 2
#>    chrom overlap.total
#>    <chr>         <int>
#> 1   chr1      25987728
#> 2  chr10      14328652
#> 3  chr11      14215152
#> 4  chr12      13935791
#> 5  chr13      11771685
#> 6  chr14      11210834
#> 7  chr15      10778166
#> 8  chr16       9058050
#> 9  chr17       8541299
#> 10 chr18       8091584
#> # ... with 15 more rows

Interval Statistics

valr provides several methods to assess statistical properties of interval sets including:

bed_fisher()
bed_absdist()
bed_reldist()
bed_jaccard()
bed_projection()

Several of these methods were described in the Genometricorr software package.

Fisher’s test

The Fisher’s test assesses whether two sets of intervals are drawn from the same background genome.

x <- bed_random(genome, n = 100)
y <- bed_random(genome, n = 100)

bed_fisher(x, y, genome)
#> Joining, by = "chrom"
#> Joining, by = "chrom"
#> Joining, by = "chrom"
#> Joining, by = "chrom"
#> Source: local data frame [23 x 8]
#> Groups: chrom [23]
#> 
#>    chrom   n_x   n_y   n_i size_mean      size   int_est p.value
#>    <chr> <int> <int> <int>     <dbl>     <int>     <dbl>   <dbl>
#> 1   chr1     7     1    NA      1000 249250621 249250.62      NA
#> 2  chr10     1     1    NA      1000 135534747 135534.75      NA
#> 3  chr11     7     5    NA      1000 135006516 135006.52      NA
#> 4  chr12     6     5    NA      1000 133851895 133851.89      NA
#> 5  chr13     2     4    NA      1000 115169878 115169.88      NA
#> 6  chr14     1     4    NA      1000 107349540 107349.54      NA
#> 7  chr15     8     3    NA      1000 102531392 102531.39      NA
#> 8  chr16     3     3    NA      1000  90354753  90354.75      NA
#> 9  chr17     3     4    NA      1000  81195210  81195.21      NA
#> 10 chr19     1     2    NA      1000  59128983  59128.98      NA
#> # ... with 13 more rows

Absolute distance

bed_absdist() - TBD

Relative distance

bed_reldist() tests whether two sets of (possibly non-overlapping) intervals are closer two each other in aggregate than a random set.

x <- bed_random(genome)
y <- bed_random(genome)

bed_reldist(x, y)

Jaccard similarity

bed_jaccard() quantifies the extent of overlap between to sets of intervals. The Jaccard statistic takes values of [0,1] and is measured as:

\[ J(x,y) = \frac{\mid x \bigcap y \mid} {\mid x \bigcup y \mid} = \frac{\mid x \bigcap y \mid} {\mid x \mid + \mid y \mid - \mid x \bigcap y \mid} \]

x <- bed_random(genome)
y <- bed_random(genome)

bed_jaccard(x, y)
#> # A tibble: 1 x 4
#>       len_i      len_u   jaccard      n
#>       <int>      <int>     <dbl>  <int>
#> 1 323086199 2000000000 0.1926671 646146

Projection

bed_projection() TBD

Benchmarks

# 1e6 random 1 kb intervals from hg19
x <- bed_random(genome)
small.x <- bed_random(genome, n = 100)
y <- bed_random(genome)

library(microbenchmark)
microbenchmark(
  bed_random(genome),
  bed_closest(small.x, y),
  bed_intersect(x, y),
  bed_merge(x),
  bed_subtract(x, y),
  bed_complement(x, genome),
  times = 1,
  unit = 's'
)
#> Unit: seconds
#>                       expr       min        lq      mean    median
#>         bed_random(genome) 0.1235815 0.1235815 0.1235815 0.1235815
#>    bed_closest(small.x, y) 0.7618047 0.7618047 0.7618047 0.7618047
#>        bed_intersect(x, y) 1.9671289 1.9671289 1.9671289 1.9671289
#>               bed_merge(x) 2.5055255 2.5055255 2.5055255 2.5055255
#>         bed_subtract(x, y) 1.9578817 1.9578817 1.9578817 1.9578817
#>  bed_complement(x, genome) 2.6705171 2.6705171 2.6705171 2.6705171
#>         uq       max neval
#>  0.1235815 0.1235815     1
#>  0.7618047 0.7618047     1
#>  1.9671289 1.9671289     1
#>  2.5055255 2.5055255     1
#>  1.9578817 1.9578817     1
#>  2.6705171 2.6705171     1

valr: Genome interval arithmetic in R

Jay Hesselberth jay.hesselberth@gmail.com

2016-07-29