valr provides tools to read and manipulate genome intervals and signals, similar to the standalone BEDtools suite. valr enables BEDtools-like analysis in the R/RStudio environment, and uses modern R tools for a terse, expressive syntax. Compute-intensive algorithms are implemented in Rcpp/C++, and many methods take advantage of the speed and grouping capability provided by dplyr.
We provide several introductions to valr:
vignette][] that covers the core methods.valr in “real-world” applications.shiny application that demonstrates interactive analysis of genome-scale data sets.valr can be installed from github, and will be eventually deposited in CRAN.
> devtools::install_github('jayhesselberth/valr')Why another tool set for interval manipulations? We already have BEDtools, bedops, pybedtools, GenomeRanges, bedr and IRanges.
We were motivated to develop a toolset that:
dplyr and the pipe operator from magrittr (%>%).Rcpp.shiny.We anticipate valr will mainly be used for analysis of pre-processed data in BED, bedGraph and VCF formats. Most users will have processed their aligned reads from BAM format to bedGraph, so we do not foresee supporting BAM directly. We would entertain requests for GTF / GFF support if there is interest, as tidyr makes it easy to convert these to BED12.
At this point you might be expecting plots of speed / memory usage versus interval number. Certain algorithms in valr were implemented in Rcpp to be fast (including intersect, merge, subtract, closest), enabling fluid interactive analysis. See the benchmarks section for demonstrations.
Several of the methods in valr use NSE for an expressive syntax. Columns are referred to by name and can be used in multiple name/value expressions for summaries.
bed_map(x, y, mean = mean(value), var = var(value))
bed_merge(x, concat = concat(value), max = max(value))Here is an example using valr that creates
library(valr)valr has several methods to read interval data. These methods:
data.frame in tibble::tbl_df format.chrom, start and end column names.readr for speed.The methods include:
read_bed(): read a BED3+ fileread_bed12(): read a BED12 fileread_bedgraph(): read a bedGraph fileread_genome(): read a UCSC “chrom size” fileread_vcf(): read the Variant Call Formatread_bigwig(): read UCSC bigWig filesread_bed(system.file('extdata', '3fields.bed.gz', package = 'valr'))
#> # A tibble: 10 x 3
#> chrom start end
#> <chr> <int> <int>
#> 1 chr1 11873 14409
#> 2 chr1 14361 19759
#> 3 chr1 14406 29370
#> 4 chr1 34610 36081
#> 5 chr1 69090 70008
#> 6 chr1 134772 140566
#> 7 chr1 321083 321115
#> 8 chr1 321145 321207
#> 9 chr1 322036 326938
#> 10 chr1 327545 328439
read_bed(n_fields = 6, system.file('extdata', '6fields.bed.gz', package = 'valr'))
#> # A tibble: 10 x 6
#> chrom start end name score strand
#> <chr> <int> <int> <chr> <chr> <chr>
#> 1 chr1 11873 14409 DDX11L1 3 +
#> 2 chr1 14361 19759 WASH7P 10 -
#> 3 chr1 14406 29370 WASH7P 7 -
#> 4 chr1 34610 36081 FAM138F 3 -
#> 5 chr1 69090 70008 OR4F5 1 +
#> 6 chr1 134772 140566 LOC729737 3 -
#> 7 chr1 321083 321115 DQ597235 1 +
#> 8 chr1 321145 321207 DQ599768 1 +
#> 9 chr1 322036 326938 LOC100133331 3 +
#> 10 chr1 327545 328439 LOC388312 1 +
read_bed12(system.file('extdata', '12fields.bed.gz', package = 'valr'))
#> # A tibble: 3 x 12
#> chrom start end name score strand cds_start cds_end item_rgb
#> <chr> <int> <int> <chr> <chr> <chr> <int> <int> <chr>
#> 1 chr1 4797973 4836816 testgene 1 + 4797973 4836816 .
#> 2 chr10 4848118 4880877 diffchrom 1 + 4848118 4880877 .
#> 3 chr20 5073253 5152630 negstrand 1 - 5073253 5152630 .
#> # ... with 3 more variables: exon_count <int>, exon_sizes <chr>,
#> # exon_starts <chr>genome <- read_genome(system.file('extdata', 'hg19.chrom.sizes.gz', package = 'valr'))read_bedgraph(system.file('extdata', 'test.bg.gz', package = 'valr'))
#> # A tibble: 4 x 4
#> chrom start end value
#> <chr> <int> <int> <dbl>
#> 1 chr19 49302000 49302300 -1.00
#> 2 chr19 49302300 49302600 -0.75
#> 3 chr19 49302600 49302900 -0.50
#> 4 chr19 49302900 49303200 -0.25read_vcf() reads VCF files and assigns chrom, start and end columns to be used to downstream interval comparisons. Note the interval size is calculated as the length of the REF field in the original file.
read_vcf(system.file('extdata', 'test.vcf.gz', package = 'valr'))
#> # A tibble: 115 x 201
#> CHROM POS ID REF
#> <int> <int> <dbl> <chr>
#> 1 1 10172 0 CCCTAA
#> 2 1 10390 0 CCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA
#> 3 1 10397 0 CCCCTAACCCTAA
#> 4 1 10403 0 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC
#> 5 1 10409 0 ACCCTAACCCTAACCCTAACCCTAACCCTAAC
#> 6 1 10415 0 ACCCTAACCCTAACCCTAACCCTAAC
#> 7 1 10421 0 ACCCTAACCCTAACCCTAAC
#> 8 1 10428 0 CCCTAA
#> 9 1 10440 0 C
#> 10 1 10478 0 C
#> # ... with 105 more rows, and 197 more variables: ALT <chr>, QUAL <dbl>,
#> # FILTER <chr>, INFO <chr>, FORMAT <chr>, 101976-101976 <chr>,
#> # 100920-100920 <chr>, 100231-100231 <chr>, 100232-100232 <chr>,
#> # 100919-100919 <chr>, 101977-101977 <chr>, 100630-100630 <chr>,
#> # 100640-100640 <chr>, 100631-100631 <chr>, 101583-101583 <chr>,
#> # 101584-101584 <chr>, 101732-101732 <chr>, 101016-101016 <chr>,
#> # 101708-101708 <chr>, 101730-101730 <chr>, 101582-101582 <chr>,
#> # 101731-101731 <chr>, 101809-101809 <chr>, 101653-101653 <chr>,
#> # 101652-101652 <chr>, 101806-101806 <chr>, 101807-101807 <chr>,
#> # 101808-101808 <chr>, 101810-101810 <chr>, 101811-101811 <chr>,
#> # 101812-101812 <chr>, 101813-101813 <chr>, 101814-101814 <chr>,
#> # 101957-101957 <chr>, 101958-101958 <chr>, 100986-100986 <chr>,
#> # 100987-100987 <chr>, 101897-101897 <chr>, 102071-102071 <chr>,
#> # 102106-102106 <chr>, 101040-101040 <chr>, 101167-101167 <chr>,
#> # 101041-101041 <chr>, 101042-101042 <chr>, 101168-101168 <chr>,
#> # 101898-101898 <chr>, 102070-102070 <chr>, 102111-102111 <chr>,
#> # 100988-100988 <chr>, 101896-101896 <chr>, 102110-102110 <chr>,
#> # 102620-102620 <chr>, 102621-102621 <chr>, 102947-102947 <chr>,
#> # 102948-102948 <chr>, 103089-103089 <chr>, 102693-102693 <chr>,
#> # 102694-102694 <chr>, 102949-102949 <chr>, 102622-102622 <chr>,
#> # 102623-102623 <chr>, 102624-102624 <chr>, 102722-102722 <chr>,
#> # 102712-102712 <chr>, 103339-103339 <chr>, 103124-103124 <chr>,
#> # 103161-103161 <chr>, 103125-103125 <chr>, 103171-103171 <chr>,
#> # 103338-103338 <chr>, 103372-103372 <chr>, 103193-103193 <chr>,
#> # 100195-100195 <chr>, 100194-100194 <chr>, 101667-101667 <chr>,
#> # 101137-101137 <chr>, 101291-101291 <chr>, 101292-101292 <chr>,
#> # 101666-101666 <chr>, 32222-32222 <chr>, 32049-32049 <chr>,
#> # 32050-32050 <chr>, 32411-32411 <chr>, 32221-32221 <chr>,
#> # 100147-100147 <chr>, 100149-100149 <chr>, 100243-100243 <chr>,
#> # 100290-100290 <chr>, 100753-100753 <chr>, 100754-100754 <chr>,
#> # 101377-101377 <chr>, 101426-101426 <chr>, 101435-101435 <chr>,
#> # 101523-101523 <chr>, 101877-101877 <chr>, 100043-100043 <chr>,
#> # 100044-100044 <chr>, 100148-100148 <chr>, 100292-100292 <chr>,
#> # 101381-101381 <chr>, ...read_bigwig() uses bigwrig reads local bigWig files into a data_frame with chrom, start, end and value columns. URLs are not supported yet.
read_bigwig(system.file('extdata', 'test.bw', package = 'valr'))valr implements several methods for manipulating sets of intervals. Some methods operate on a single set of intervals, while others compare two sets of intervals.
Many methods the same name as the corresponding BEDtool, and some commonly used BEDtools are implemented as dplyr pipes (e.g., see the group_by section).
All methods accept one or more sets of x and y intervals, which must either be created using the read methods, or have chrom, start and end columns.
These methods operate on a single set of intervals:
bed_sort(): order intervalsbed_cluster(): Cluster (but don’t merge) overlapping/nearby intervals.bed_complement(): extract intervals not represented by an interval file.bed_merge(): combine overlapping and nearby intervals into a single interval.bed_flank(): Generate new flanking intervalsbed_slop(): Expand the size of input intervalsbed_shift(): Shift the coordinates of an input set, bounded by a genomebed_sort orders intervals based on a specification. is_sorted asks whether a tbl is already sorted.
x <- bed_random(genome)
is_sorted(x)
#> [1] FALSE
y <- bed_sort(x)
y
#> # A tibble: 1,000,000 x 3
#> chrom start end
#> <chr> <int> <int>
#> 1 chr1 91 1091
#> 2 chr1 749 1749
#> 3 chr1 2155 3155
#> 4 chr1 2165 3165
#> 5 chr1 2744 3744
#> 6 chr1 3649 4649
#> 7 chr1 6237 7237
#> 8 chr1 6316 7316
#> 9 chr1 6333 7333
#> 10 chr1 11114 12114
#> # ... with 999,990 more rows
is_sorted(y)
#> [1] TRUEbed_cluster identifies clustered intervals and assigns them a unique .id.
x <- bed_random(genome)
y <- bed_cluster(x)
y
#> # A tibble: 1,000,000 x 4
#> chrom start end .id
#> <chr> <int> <int> <int>
#> 1 chr22 29116064 29117064 9381
#> 2 chr1 170475087 170476087 54904
#> 3 chr2 47683499 47684499 15365
#> 4 chr9 30957167 30958167 10004
#> 5 chr8 125471327 125472327 40279
#> 6 chr1 45061415 45062415 14410
#> 7 chr22 17509928 17510928 5647
#> 8 chr20 35586901 35587901 11493
#> 9 chr15 69386523 69387523 22594
#> 10 chr2 182465802 182466802 58813
#> # ... with 999,990 more rowsbed_complement identifies intervals in a genome that are not covered by an input.
x <- bed_random(genome)
bed_complement(x, genome)
#> # A tibble: 724,115 x 3
#> chrom start end
#> <fctr> <int> <int>
#> 1 chr1 1 2162
#> 2 chr1 3162 6880
#> 3 chr1 7880 10679
#> 4 chr1 11679 16356
#> 5 chr1 17403 17637
#> 6 chr1 19295 19873
#> 7 chr1 22024 22833
#> 8 chr1 23833 24216
#> 9 chr1 25216 30808
#> 10 chr1 31808 33847
#> # ... with 724,105 more rowsbed_merge identifies overlapping intervals and reports new merged ones. is_merged asks whether a tbl is already merged. Values from merged intervals can be reported using name / value pairs.
# 1e6 random intervals
n <- 1e6
x <- bed_random(genome, n = n)
is_merged(x)
#> [1] FALSE
x
#> # A tibble: 1,000,000 x 3
#> chrom start end
#> <chr> <int> <int>
#> 1 chr5 27943564 27944564
#> 2 chr2 223883634 223884634
#> 3 chr8 107641024 107642024
#> 4 chr4 26203166 26204166
#> 5 chr3 32827287 32828287
#> 6 chr1 133146878 133147878
#> 7 chr6 121981429 121982429
#> 8 chr2 8503021 8504021
#> 9 chr15 93106998 93107998
#> 10 chr4 184553192 184554192
#> # ... with 999,990 more rows
# add some signal
x <- x %>% mutate(signal = runif(n))
y <- bed_merge(x)
is_merged(y)
#> [1] TRUE
y
#> # A tibble: 723,104 x 3
#> chrom start end
#> <chr> <int> <int>
#> 1 chr1 501 1501
#> 2 chr1 1645 2645
#> 3 chr1 4062 5062
#> 4 chr1 11407 12407
#> 5 chr1 16855 17855
#> 6 chr1 19066 20900
#> 7 chr1 24519 25672
#> 8 chr1 26156 27473
#> 9 chr1 29638 30638
#> 10 chr1 32018 33018
#> # ... with 723,094 more rows
bed_merge(x, maxs = max(signal))
#> # A tibble: 723,104 x 4
#> chrom start end maxs
#> <chr> <int> <int> <dbl>
#> 1 chr1 501 1501 0.88223255
#> 2 chr1 1645 2645 0.70457769
#> 3 chr1 4062 5062 0.76373882
#> 4 chr1 11407 12407 0.04712881
#> 5 chr1 16855 17855 0.16967085
#> 6 chr1 19066 20900 0.59457506
#> 7 chr1 24519 25672 0.58579493
#> 8 chr1 26156 27473 0.93904880
#> 9 chr1 29638 30638 0.08209918
#> 10 chr1 32018 33018 0.42858058
#> # ... with 723,094 more rowsbed_flank creates new intervals that flank - but do not contain - the input intervals.
bed_flank(x, genome, both = 100)bed_slop pads input intervals based on a specification
bed_slop(x, genome, both = 100)bed_shift adjusts coordinates toward start or end by a defined size. Intervals created out of bounds are removed, or trimmed.
bed_shift(x, genome, size = 100)
#> # A tibble: 999,999 x 4
#> chrom start end signal
#> <chr> <dbl> <dbl> <dbl>
#> 1 chr5 27943664 27944664 0.54661972
#> 2 chr2 223883734 223884734 0.88471469
#> 3 chr8 107641124 107642124 0.52616661
#> 4 chr4 26203266 26204266 0.32368260
#> 5 chr3 32827387 32828387 0.94392807
#> 6 chr1 133146978 133147978 0.27973158
#> 7 chr6 121981529 121982529 0.23708108
#> 8 chr2 8503121 8504121 0.57977029
#> 9 chr15 93107098 93108098 0.33198180
#> 10 chr4 184553292 184554292 0.01883723
#> # ... with 999,989 more rowsThese methods compare two sets of intervals:
bed_intersect(): find overlapping intervalsbed_map(): apply a function to selected columns for overlapping intervalsbed_subtract(): Remove intervals based on overlaps between two filesbed_window(): Find overlapping intervals within a windowbed_closest(): find the closest intervals independent of overlapsbed_intersect is implemented using an interval tree in Rcpp. Column names in the result have .x and .y suffixes, and an .overlap column contains the size of the intersection (values of 0 indicate book-ended, or touching intervals). See the benchmarks section for timing. Though bed_intersect is pretty fast already, we intend to further improve upon this by parallization with RcppParallel.
# intersect two sets of 1e6 intervals from hg19
x <- bed_random(genome)
y <- bed_random(genome)
bed_intersect(x, y)
#> # A tibble: 646,110 x 6
#> chrom start.x end.x start.y end.y .overlap
#> <chr> <int> <int> <int> <int> <int>
#> 1 chr1 9762 10762 9884 10884 878
#> 2 chr1 19890 20890 19655 20655 765
#> 3 chr1 19946 20946 19655 20655 709
#> 4 chr1 29590 30590 28604 29604 14
#> 5 chr1 29590 30590 28618 29618 28
#> 6 chr1 75235 76235 75546 76546 689
#> 7 chr1 92843 93843 92721 93721 878
#> 8 chr1 104230 105230 103620 104620 390
#> 9 chr1 104230 105230 104774 105774 456
#> 10 chr1 133598 134598 132750 133750 152
#> # ... with 646,100 more rows
# records in `x` with no overlaps in `y`
bed_intersect(x, y, invert = TRUE)
#> Source: local data frame [524,270 x 3]
#> Groups: chrom [25]
#>
#> chrom start end
#> <chr> <int> <int>
#> 1 chrY 59362073 59363073
#> 2 chrY 59355115 59356115
#> 3 chrY 59340287 59341287
#> 4 chrY 59339844 59340844
#> 5 chrY 59329166 59330166
#> 6 chrY 59307757 59308757
#> 7 chrY 59297407 59298407
#> 8 chrY 59291701 59292701
#> 9 chrY 59275268 59276268
#> 10 chrY 59256992 59257992
#> # ... with 524,260 more rowsOne can achieve behavior similar to BEDtools by combining bed_intersect with dplyr tools.
# `x` records with overlaps (i.e., `-wa`)
bed_intersect(x, y) %>% select(chrom, start = start.x, end = end.x)
#> # A tibble: 646,110 x 3
#> chrom start end
#> <chr> <int> <int>
#> 1 chr1 9762 10762
#> 2 chr1 19890 20890
#> 3 chr1 19946 20946
#> 4 chr1 29590 30590
#> 5 chr1 29590 30590
#> 6 chr1 75235 76235
#> 7 chr1 92843 93843
#> 8 chr1 104230 105230
#> 9 chr1 104230 105230
#> 10 chr1 133598 134598
#> # ... with 646,100 more rows
# `y` records with overlaps (i.e., `-wb`)
bed_intersect(x, y) %>% select(chrom, start = start.y, end = end.y)
#> # A tibble: 646,110 x 3
#> chrom start end
#> <chr> <int> <int>
#> 1 chr1 9884 10884
#> 2 chr1 19655 20655
#> 3 chr1 19655 20655
#> 4 chr1 28604 29604
#> 5 chr1 28618 29618
#> 6 chr1 75546 76546
#> 7 chr1 92721 93721
#> 8 chr1 103620 104620
#> 9 chr1 104774 105774
#> 10 chr1 132750 133750
#> # ... with 646,100 more rows
# Unique records in `x` (i.e., `-u`)
bed_intersect(x, y) %>% select(chrom, start = start.x, end = end.x) %>% unique()
#> # A tibble: 475,656 x 3
#> chrom start end
#> <chr> <int> <int>
#> 1 chr1 9762 10762
#> 2 chr1 19890 20890
#> 3 chr1 19946 20946
#> 4 chr1 29590 30590
#> 5 chr1 75235 76235
#> 6 chr1 92843 93843
#> 7 chr1 104230 105230
#> 8 chr1 133598 134598
#> 9 chr1 138702 139702
#> 10 chr1 139417 140417
#> # ... with 475,646 more rows
# Count `y` intervals that intersect each `x` interval
bed_intersect(x, y) %>% group_by(chrom, start.x, end.x) %>% summarize(count = n())
#> Source: local data frame [475,656 x 4]
#> Groups: chrom, start.x [?]
#>
#> chrom start.x end.x count
#> <chr> <int> <int> <int>
#> 1 chr1 9762 10762 1
#> 2 chr1 19890 20890 1
#> 3 chr1 19946 20946 1
#> 4 chr1 29590 30590 2
#> 5 chr1 75235 76235 1
#> 6 chr1 92843 93843 1
#> 7 chr1 104230 105230 2
#> 8 chr1 133598 134598 1
#> 9 chr1 138702 139702 2
#> 10 chr1 139417 140417 3
#> # ... with 475,646 more rowsbed_map maps signals onto intervals. Summary statistics for mapped signals can be specific using NSE with name / value pairs.
bedfile <- system.file('extdata', 'genes.hg19.chr22.bed.gz', package = 'valr')
bgfile <- system.file('extdata', 'hela.h3k4.chip.bg.gz', package = 'valr')
x <- read_bed(bedfile, n_fields = 6)
y <- read_bedgraph(bgfile)
bed_map(x, y, means = mean(value.y), sds = sd(value.y))
#> Source: local data frame [591 x 5]
#> Groups: chrom, start.x [?]
#>
#> chrom start.x end.x means sds
#> <chr> <int> <int> <dbl> <dbl>
#> 1 chr22 16150259 16193004 7.914286 7.5707309
#> 2 chr22 16162065 16172265 1.000000 NA
#> 3 chr22 16256331 16287937 1.000000 0.0000000
#> 4 chr22 17071647 17073700 1.000000 0.0000000
#> 5 chr22 17082800 17129720 1.117647 0.3270350
#> 6 chr22 17134598 17156430 1.294118 0.5878675
#> 7 chr22 17227758 17229328 1.000000 NA
#> 8 chr22 17264305 17302584 1.250000 0.5084039
#> 9 chr22 17308363 17310225 1.333333 0.5773503
#> 10 chr22 17385314 17385395 1.000000 NA
#> # ... with 581 more rowsbed_substract() removes x intervals that intersect with y.
x <- bed_random(genome)
y <- bed_random(genome)
bed_subtract(x, y)
#> # A tibble: 723,161 x 3
#> chrom start end
#> <fctr> <int> <int>
#> 1 chr1 795 1844
#> 2 chr1 4617 5617
#> 3 chr1 6350 7350
#> 4 chr1 10073 11212
#> 5 chr1 14290 15290
#> 6 chr1 17146 17605
#> 7 chr1 18758 19619
#> 8 chr1 22880 23880
#> 9 chr1 31429 34117
#> 10 chr1 38565 39565
#> # ... with 723,151 more rowsbed_window() identifies y intervals that intersect an expanded window of x intervals.
x <- bed_random(genome, n = 100)
y <- bed_random(genome, n = 100)
# a few intersections ...
bed_intersect(x, y)
#> # A tibble: 0 x 6
#> # ... with 6 variables: chrom <chr>, start.x <int>, end.x <int>,
#> # start.y <int>, end.y <int>, .overlap <int>
# ... can be expanded by casting a wider net
bed_window(x, y, genome, both = 1e6)
#> # A tibble: 5 x 6
#> chrom start.x end.x start.y end.y .overlap
#> <chr> <int> <int> <int> <int> <int>
#> 1 chr18 64994997 64995997 64177173 64178173 1000
#> 2 chr22 17171867 17172867 17330319 17331319 1000
#> 3 chr6 73330999 73331999 74014664 74015664 1000
#> 4 chr6 85250503 85251503 85516787 85517787 1000
#> 5 chr7 11760548 11761548 11733095 11734095 1000bed_closest() identifies y intervals that are closest to x.
x <- bed_random(genome, n = 100)
y <- bed_random(genome, n = 100)
bed_closest(x, y)
#> # A tibble: 157 x 7
#> chrom start.x end.x start.y end.y .overlap .distance
#> <chr> <int> <int> <int> <int> <dbl> <int>
#> 1 chr1 2082231 2083231 13672913 13673913 0 11589683
#> 2 chr1 71014696 71015696 60650968 60651968 0 -10362729
#> 3 chr1 71014696 71015696 82599956 82600956 0 11584261
#> 4 chr1 75087311 75088311 60650968 60651968 0 -14435344
#> 5 chr1 75087311 75088311 82599956 82600956 0 7511646
#> 6 chr1 120537963 120538963 102897681 102898681 0 -17639283
#> 7 chr1 120537963 120538963 134805028 134806028 0 14266066
#> 8 chr1 135634075 135635075 134805028 134806028 0 -828048
#> 9 chr1 135634075 135635075 142671350 142672350 0 7036276
#> 10 chr1 140165527 140166527 134805028 134806028 0 -5359500
#> # ... with 147 more rowsvalr provides methods for creating new random intervals or permutations of existing intervals:
bed_random generates random intervals from an input genome.bed_shuffle shuffles coordinates given a set of input intervals.dplyr.bed_random generates random intervals from an input genome. The numbers of intervals from each chrom are proporitional to each chrom size.
x <- bed_random(genome, n = 1e6, length = 1e3)
x
#> # A tibble: 1,000,000 x 3
#> chrom start end
#> <chr> <int> <int>
#> 1 chr10 61692296 61693296
#> 2 chr8 90940249 90941249
#> 3 chr1 118873367 118874367
#> 4 chr4 49767964 49768964
#> 5 chr3 116627439 116628439
#> 6 chr2 185139966 185140966
#> 7 chr2 4886421 4887421
#> 8 chr1 230260915 230261915
#> 9 chr10 80182008 80183008
#> 10 chr5 70879026 70880026
#> # ... with 999,990 more rows
# numbers of sampled intervals are proportional to chrom size
group_by(x, chrom) %>% summarize(n = n()) %>% arrange(desc(n))
#> # A tibble: 25 x 2
#> chrom n
#> <chr> <int>
#> 1 chr1 80019
#> 2 chr2 78287
#> 3 chr3 63863
#> 4 chr4 61723
#> 5 chr5 58259
#> 6 chr6 55259
#> 7 chr7 51549
#> 8 chrX 50045
#> 9 chr8 47331
#> 10 chr9 45726
#> # ... with 15 more rowsSampling can be done using dplyr:
x <- bed_random(genome)
# sample by number
sample_n(x, 1e3, replace = FALSE)
#> # A tibble: 1,000 x 3
#> chrom start end
#> <chr> <int> <int>
#> 1 chr12 17069124 17070124
#> 2 chr10 31795043 31796043
#> 3 chr2 49523088 49524088
#> 4 chr5 80102671 80103671
#> 5 chr4 163573804 163574804
#> 6 chr21 36171735 36172735
#> 7 chr16 12170148 12171148
#> 8 chr13 61655971 61656971
#> 9 chr6 100866333 100867333
#> 10 chr3 97068477 97069477
#> # ... with 990 more rows
# or fraction
sample_frac(x, 0.1, replace = FALSE)
#> # A tibble: 100,000 x 3
#> chrom start end
#> <chr> <int> <int>
#> 1 chr7 63122310 63123310
#> 2 chr18 34178283 34179283
#> 3 chr6 26872688 26873688
#> 4 chr5 121335054 121336054
#> 5 chrX 38344551 38345551
#> 6 chr16 7495700 7496700
#> 7 chr6 11704603 11705603
#> 8 chr19 2891288 2892288
#> 9 chr14 101588608 101589608
#> 10 chr10 26708564 26709564
#> # ... with 99,990 more rows
# or sample intervals within groups
group_by(x, chrom) %>% sample_n(1)
#> Source: local data frame [25 x 3]
#> Groups: chrom [25]
#>
#> chrom start end
#> <chr> <int> <int>
#> 1 chr1 31147666 31148666
#> 2 chr10 84286028 84287028
#> 3 chr11 84557236 84558236
#> 4 chr12 38171725 38172725
#> 5 chr13 3388085 3389085
#> 6 chr14 103648078 103649078
#> 7 chr15 97003992 97004992
#> 8 chr16 3709684 3710684
#> 9 chr17 75116100 75117100
#> 10 chr18 49762222 49763222
#> # ... with 15 more rowsbed_shuffle shuffles input intervals. Interval sizes are equal in the input and output.
y <- bed_shuffle(x)
any(x$start == y$start)
all(x$end - x$start == y$end - y$start)Instead of implementing coverage directly, coverage calculations are done with the summarize tool in dplyr.
x <- bed_random(genome)
# generate fake signals
y <- bed_random(genome) %>%
mutate(count = floor(rnorm(n(), min=0, max=101)))
bed_intersect(x, y) %>%
group_by(chrom, start.x, end.x) %>%
summarize(coverage = sum(count.y))valr provides several methods to assess statistical properties of interval sets including:
bed_fisher()bed_absdist()bed_reldist()bed_jaccard()bed_projection()Several of these methods were described in the Genometricorr software package.
The Fisher’s test assesses whether two sets of intervals are drawn from the same background genome.
x <- bed_random(genome, n = 100)
y <- bed_random(genome, n = 100)
bed_fisher(x, y, genome)
#> Joining, by = "chrom"
#> Joining, by = "chrom"
#> Joining, by = "chrom"
#> Joining, by = "chrom"
#> Source: local data frame [23 x 8]
#> Groups: chrom [23]
#>
#> chrom n_x n_y n_i size_mean size int_est p.value
#> <chr> <int> <int> <int> <dbl> <int> <dbl> <dbl>
#> 1 chr1 7 1 NA 1000 249250621 249250.62 NA
#> 2 chr10 1 1 NA 1000 135534747 135534.75 NA
#> 3 chr11 7 5 NA 1000 135006516 135006.52 NA
#> 4 chr12 6 5 NA 1000 133851895 133851.89 NA
#> 5 chr13 2 4 NA 1000 115169878 115169.88 NA
#> 6 chr14 1 4 NA 1000 107349540 107349.54 NA
#> 7 chr15 8 3 NA 1000 102531392 102531.39 NA
#> 8 chr16 3 3 NA 1000 90354753 90354.75 NA
#> 9 chr17 3 4 NA 1000 81195210 81195.21 NA
#> 10 chr19 1 2 NA 1000 59128983 59128.98 NA
#> # ... with 13 more rowsbed_absdist() - TBD
bed_reldist() tests whether two sets of (possibly non-overlapping) intervals are closer two each other in aggregate than a random set.
x <- bed_random(genome)
y <- bed_random(genome)
bed_reldist(x, y)bed_jaccard() quantifies the extent of overlap between to sets of intervals. The Jaccard statistic takes values of [0,1] and is measured as:
\[ J(x,y) = \frac{\mid x \bigcap y \mid} {\mid x \bigcup y \mid} = \frac{\mid x \bigcap y \mid} {\mid x \mid + \mid y \mid - \mid x \bigcap y \mid} \]
x <- bed_random(genome)
y <- bed_random(genome)
bed_jaccard(x, y)
#> # A tibble: 1 x 4
#> len_i len_u jaccard n
#> <int> <int> <dbl> <int>
#> 1 323086199 2000000000 0.1926671 646146bed_projection() TBD
# 1e6 random 1 kb intervals from hg19
x <- bed_random(genome)
small.x <- bed_random(genome, n = 100)
y <- bed_random(genome)
library(microbenchmark)
microbenchmark(
bed_random(genome),
bed_closest(small.x, y),
bed_intersect(x, y),
bed_merge(x),
bed_subtract(x, y),
bed_complement(x, genome),
times = 1,
unit = 's'
)
#> Unit: seconds
#> expr min lq mean median
#> bed_random(genome) 0.1235815 0.1235815 0.1235815 0.1235815
#> bed_closest(small.x, y) 0.7618047 0.7618047 0.7618047 0.7618047
#> bed_intersect(x, y) 1.9671289 1.9671289 1.9671289 1.9671289
#> bed_merge(x) 2.5055255 2.5055255 2.5055255 2.5055255
#> bed_subtract(x, y) 1.9578817 1.9578817 1.9578817 1.9578817
#> bed_complement(x, genome) 2.6705171 2.6705171 2.6705171 2.6705171
#> uq max neval
#> 0.1235815 0.1235815 1
#> 0.7618047 0.7618047 1
#> 1.9671289 1.9671289 1
#> 2.5055255 2.5055255 1
#> 1.9578817 1.9578817 1
#> 2.6705171 2.6705171 1