Considerable time in computational genomics is spent on overlapping different features of the genome. These features are what genomic intervals represent within the chromosomal coordinate system. For example they can represent exon coordinates, a transcription factor binding site, continuous scores over the genome such as read coverage, or scores that could be associated with only certain bases such as in the case of CpG methylation. The typical cases involves overlapping intervals of interest with other features of the genome, again represented as intervals, such as transcription factor binding sites overlapped with CpG islands or promoters to quantify what percentage of binding sites overlap with the regions of interest.
The plyranges package develops a grammar for reasoning about genomics data analysis, and provides a way of developing short readable analysis pipelines. The package is built on the core Bioconductor data structure GRanges, a type of object that follows tidy data principles: each row contains a single observation to a single gene and each column is a variable describing information about those genes. The illustration begins by supposing a tibble of genes from the macacus rhesus genome:
library(tidyverse)
library(plyranges, quietly = TRUE)
genes <- tibble(seqnames = "VI",
start = c(3322, 3030, 1437, 5066, 6426, 836),
end = c(3846, 3338, 2615, 5521, 7565, 1363),
strand = c("-", "-", "-", "+", "+", "+"),
gene_id=c("JV682167", "JV580426", "JV601537",
"JV528265", "JV565707", "JV490395"))
genes
## # A tibble: 6 x 5
## seqnames start end strand gene_id
## <chr> <dbl> <dbl> <chr> <chr>
## 1 VI 3322 3846 - JV682167
## 2 VI 3030 3338 - JV580426
## 3 VI 1437 2615 - JV601537
## 4 VI 5066 5521 + JV528265
## 5 VI 6426 7565 + JV565707
## 6 VI 836 1363 + JV490395
gr <- as_granges(genes)
gr
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## [1] VI 3322-3846 - | JV682167
## [2] VI 3030-3338 - | JV580426
## [3] VI 1437-2615 - | JV601537
## [4] VI 5066-5521 + | JV528265
## [5] VI 6426-7565 + | JV565707
## [6] VI 836-1363 + | JV490395
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The core components of the object include a seqname column (representing the chromosome), a ranges column which consists of start and end coordinates for a genomic region, and a strand identifier (either positive, negative, or unstranded). Metadata are included as columns to the right of the dotted line as annotations (gene-id) or range level covariates (score).
The grammar is simply a set of verbs that define actions to be performed using the pipe operator, %>%, which can be read as ‘then’. A simple pipeline will add two columns, one corresponding to the gene_type and another with the GC content (by drawing from a uniform distribution), and then will remove genes if they have a width less than \(400bp\)
set.seed(2021-06-18)
gr2 <- gr %>%
mutate(gene_type = "mRNA",
gc_content = runif(n())) %>%
filter(width > 400)
gr2
## GRanges object with 5 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id gene_type gc_content
## <Rle> <IRanges> <Rle> | <character> <character> <numeric>
## [1] VI 3322-3846 - | JV682167 mRNA 0.221390
## [2] VI 1437-2615 - | JV601537 mRNA 0.817291
## [3] VI 5066-5521 + | JV528265 mRNA 0.758008
## [4] VI 6426-7565 + | JV565707 mRNA 0.134232
## [5] VI 836-1363 + | JV490395 mRNA 0.628786
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The mutate() verb is used to add columns, (gene_type where all values are set to mRNA standing for messenger RNA and another called gc_content with random uniform values). The n() operator returns the number of ranges in GRanges object, but can only be evaluated inside of one of the verbs.
The filter() verbs returns ranges if the expression evaluates to TRUE. Multiple expressions can be composed together with the & operator or the | operator
gr2 %>%
filter(strand == "+" & gc_content > 0.5)
## GRanges object with 2 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id gene_type gc_content
## <Rle> <IRanges> <Rle> | <character> <character> <numeric>
## [1] VI 5066-5521 + | JV528265 mRNA 0.758008
## [2] VI 836-1363 + | JV490395 mRNA 0.628786
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
gr2 %>%
filter(strand == "+" | gc_content > 0.5)
## GRanges object with 4 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id gene_type gc_content
## <Rle> <IRanges> <Rle> | <character> <character> <numeric>
## [1] VI 1437-2615 - | JV601537 mRNA 0.817291
## [2] VI 5066-5521 + | JV528265 mRNA 0.758008
## [3] VI 6426-7565 + | JV565707 mRNA 0.134232
## [4] VI 836-1363 + | JV490395 mRNA 0.628786
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Performing tasks of biological interest like averages and counting are achieved with the summarise() and group_by() verbs respectively, which will return a tibble object to manipulate it’s statistical behavior.
gr2 %>%
group_by(strand) %>%
summarise(avg_gc = mean(gc_content),
n = n())
## DataFrame with 2 rows and 3 columns
## strand avg_gc n
## <Rle> <numeric> <integer>
## 1 + 0.507009 3
## 2 - 0.519340 2
And so, any verb applied to a grouped GRanges, acts on each partition:
by_strand <- gr2 %>%
group_by(strand)
by_strand %>%
mutate(avg_gc_strand = mean(gc_content))
## GRanges object with 5 ranges and 4 metadata columns:
## Groups: strand [2]
## seqnames ranges strand | gene_id gene_type gc_content
## <Rle> <IRanges> <Rle> | <character> <character> <numeric>
## [1] VI 3322-3846 - | JV682167 mRNA 0.221390
## [2] VI 1437-2615 - | JV601537 mRNA 0.817291
## [3] VI 5066-5521 + | JV528265 mRNA 0.758008
## [4] VI 6426-7565 + | JV565707 mRNA 0.134232
## [5] VI 836-1363 + | JV490395 mRNA 0.628786
## avg_gc_strand
## <numeric>
## [1] 0.519340
## [2] 0.519340
## [3] 0.507009
## [4] 0.507009
## [5] 0.507009
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
To remove grouping there is the ungroup() verb:
by_strand %>%
ungroup()
## GRanges object with 5 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id gene_type gc_content
## <Rle> <IRanges> <Rle> | <character> <character> <numeric>
## [1] VI 3322-3846 - | JV682167 mRNA 0.221390
## [2] VI 1437-2615 - | JV601537 mRNA 0.817291
## [3] VI 5066-5521 + | JV528265 mRNA 0.758008
## [4] VI 6426-7565 + | JV565707 mRNA 0.134232
## [5] VI 836-1363 + | JV490395 mRNA 0.628786
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Finally, metadata columns can be selected using the select() verb:
gr2 %>%
select(gene_id, gene_type)
## GRanges object with 5 ranges and 2 metadata columns:
## seqnames ranges strand | gene_id gene_type
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] VI 3322-3846 - | JV682167 mRNA
## [2] VI 1437-2615 - | JV601537 mRNA
## [3] VI 5066-5521 + | JV528265 mRNA
## [4] VI 6426-7565 + | JV565707 mRNA
## [5] VI 836-1363 + | JV490395 mRNA
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Arithmetic operations transform range coordinates, as defined by their start, end and width columns. This is achieved with the mutate verb along with anchor_* adverbs. By default, setting width will fix the starting coordinate.
gr %>%
mutate(width = width + 1)
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## [1] VI 3322-3847 - | JV682167
## [2] VI 3030-3339 - | JV580426
## [3] VI 1437-2616 - | JV601537
## [4] VI 5066-5522 + | JV528265
## [5] VI 6426-7566 + | JV565707
## [6] VI 836-1364 + | JV490395
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
By setting the verb anchor_end(), the fix will carry from start to end and viceversa with the anchor_start() verb.
gr %>%
anchor_end() %>% mutate(width = width + 1)
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## [1] VI 3321-3846 - | JV682167
## [2] VI 3029-3338 - | JV580426
## [3] VI 1436-2615 - | JV601537
## [4] VI 5065-5521 + | JV528265
## [5] VI 6425-7565 + | JV565707
## [6] VI 835-1363 + | JV490395
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
gr %>%
anchor_start() %>% mutate(width = width + 1)
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## [1] VI 3322-3847 - | JV682167
## [2] VI 3030-3339 - | JV580426
## [3] VI 1437-2616 - | JV601537
## [4] VI 5066-5522 + | JV528265
## [5] VI 6426-7566 + | JV565707
## [6] VI 836-1364 + | JV490395
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The reduce verb merges overlapping and neighboring ranges and aggregates them by storing the result in a List column. The disjoin verb expands the overlapped range.
gr %>% reduce_ranges()
## GRanges object with 5 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] VI 836-1363 *
## [2] VI 1437-2615 *
## [3] VI 3030-3846 *
## [4] VI 5066-5521 *
## [5] VI 6426-7565 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
gr %>%
disjoin_ranges()
## GRanges object with 7 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] VI 836-1363 *
## [2] VI 1437-2615 *
## [3] VI 3030-3321 *
## [4] VI 3322-3338 *
## [5] VI 3339-3846 *
## [6] VI 5066-5521 *
## [7] VI 6426-7565 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
So, one first trascendental idea for manipulating intervals is to compare new collaborations against benchmark genomes through overlapping procedures. With some additional measurements obtained from a new experiment on macacus rhesus from three different replicates that represent single nucleotide or insertion deletion intensities from an array, the goal is to filter these positions and see if they overlap with one of the known genes.
The three different tibbles obtained from a collaborator are:
set.seed(66+105+111+99+49+56)
pos <- sample(1:10000, size = 100)
size <- sample(1:3, size = 100, replace = TRUE)
rep1 <- tibble(chr = "VI",
pos = pos,
size = size,
X = rnorm(100, mean = 2),
Y = rnorm(100, mean = 1))
rep2 <- tibble(chrom = "VI",
st = pos,
width = size,
X = rnorm(100, mean = 0.5, sd = 3),
Y = rnorm(100, sd = 2))
rep3 <- tibble(chromosome = "VI",
start = pos,
width = size,
X = rnorm(100, mean = 2, sd = 3),
Y = rnorm(100, mean = 4, sd = 0.5))
And the GRanges objects, binded and sorted by the starting coordinate, are:
rep1 <- as_granges(rep1, seqnames = chr, start = pos, width = size)
rep2 <- as_granges(rep2, seqnames = chrom, start = st)
rep3 <- as_granges(rep3, seqnames = chromosome)
intensities <- bind_ranges(rep1, rep2, rep3, .id = "replicate")
arrange(intensities, start)
## GRanges object with 300 ranges and 3 metadata columns:
## seqnames ranges strand | X Y replicate
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <Rle>
## [1] VI 5-6 * | 2.52287 -0.369134 1
## [2] VI 5-6 * | 3.10130 -0.255471 2
## [3] VI 5-6 * | 3.85942 3.718716 3
## [4] VI 99-100 * | 3.14127 2.218141 1
## [5] VI 99-100 * | 3.72455 1.203831 2
## ... ... ... ... . ... ... ...
## [296] VI 9975-9976 * | -0.920874 2.32376606 2
## [297] VI 9975-9976 * | 1.862762 4.69020714 3
## [298] VI 9995-9996 * | 3.219007 0.00920946 1
## [299] VI 9995-9996 * | 3.658382 2.64333960 2
## [300] VI 9995-9996 * | 2.758994 3.16707578 3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Therefore, the first interesting product will be a filter of positions if they overlap with one of the genes known.
olap <- filter_by_overlaps(intensities, gr)
olap
## GRanges object with 105 ranges and 3 metadata columns:
## seqnames ranges strand | X Y replicate
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <Rle>
## [1] VI 5239 * | 1.202741 0.652819 1
## [2] VI 6848-6849 * | 3.282009 0.789747 1
## [3] VI 6645 * | 1.385816 0.866810 1
## [4] VI 3366-3367 * | 2.919752 2.445765 1
## [5] VI 2552-2553 * | 0.692582 2.169994 1
## ... ... ... ... . ... ... ...
## [101] VI 6980-6981 * | 0.288208 3.24790 3
## [102] VI 2416 * | 0.960565 4.03358 3
## [103] VI 3836-3838 * | 1.931636 3.68147 3
## [104] VI 6674 * | 1.444077 4.57349 3
## [105] VI 1960-1962 * | 3.546545 4.42852 3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
##
## locale:
## [1] LC_CTYPE=es_MX.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_MX.UTF-8 LC_COLLATE=es_MX.UTF-8
## [5] LC_MONETARY=es_MX.UTF-8 LC_MESSAGES=es_MX.UTF-8
## [7] LC_PAPER=es_MX.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_MX.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] plyranges_1.12.0 GenomicRanges_1.44.0 GenomeInfoDb_1.28.0
## [4] IRanges_2.26.0 S4Vectors_0.30.0 BiocGenerics_0.38.0
## [7] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.6
## [10] purrr_0.3.4 readr_1.4.0 tidyr_1.1.3
## [13] tibble_3.1.2 ggplot2_3.3.4 tidyverse_1.3.1
## [16] BiocStyle_2.20.2
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 matrixStats_0.59.0
## [3] fs_1.5.0 lubridate_1.7.10
## [5] httr_1.4.2 tools_4.1.0
## [7] backports_1.2.1 bslib_0.2.5.1
## [9] utf8_1.2.1 R6_2.5.0
## [11] DBI_1.1.1 colorspace_2.0-1
## [13] withr_2.4.2 tidyselect_1.1.1
## [15] compiler_4.1.0 cli_2.5.0
## [17] rvest_1.0.0 Biobase_2.52.0
## [19] xml2_1.3.2 DelayedArray_0.18.0
## [21] rtracklayer_1.52.0 bookdown_0.22
## [23] sass_0.4.0 scales_1.1.1
## [25] digest_0.6.27 Rsamtools_2.8.0
## [27] rmarkdown_2.9 XVector_0.32.0
## [29] pkgconfig_2.0.3 htmltools_0.5.1.1
## [31] MatrixGenerics_1.4.0 dbplyr_2.1.1
## [33] rlang_0.4.11 readxl_1.3.1
## [35] rstudioapi_0.13 BiocIO_1.2.0
## [37] jquerylib_0.1.4 generics_0.1.0
## [39] jsonlite_1.7.2 BiocParallel_1.26.0
## [41] RCurl_1.98-1.3 magrittr_2.0.1
## [43] GenomeInfoDbData_1.2.6 Matrix_1.3-4
## [45] Rcpp_1.0.6 munsell_0.5.0
## [47] fansi_0.5.0 lifecycle_1.0.0
## [49] stringi_1.6.2 yaml_2.2.1
## [51] SummarizedExperiment_1.22.0 zlibbioc_1.38.0
## [53] grid_4.1.0 crayon_1.4.1
## [55] lattice_0.20-44 Biostrings_2.60.1
## [57] haven_2.4.1 hms_1.1.0
## [59] knitr_1.33 pillar_1.6.1
## [61] rjson_0.2.20 reprex_2.0.0
## [63] XML_3.99-0.6 glue_1.4.2
## [65] evaluate_0.14 BiocManager_1.30.16
## [67] modelr_0.1.8 vctrs_0.3.8
## [69] cellranger_1.1.0 gtable_0.3.0
## [71] assertthat_0.2.1 xfun_0.24
## [73] broom_0.7.7 restfulr_0.0.13
## [75] GenomicAlignments_1.28.0 ellipsis_0.3.2