Tidy Bioinformatics

JPAG1* and Acuña-González, R. J.2

1Facultad de Ciencias, UNAM
2Instituto Nacional de Perinatología. Facultad de Medicina, UNAM

*jp_6@ciencias.unam.mx

19 junio 2021

Abstract

Considerable time in computational genomics is spent on overlapping different features of the genome. These features are what genomic intervals represent within the chromosomal coordinate system. For example they can represent exon coordinates, a transcription factor binding site, continuous scores over the genome such as read coverage, or scores that could be associated with only certain bases such as in the case of CpG methylation. The typical cases involves overlapping intervals of interest with other features of the genome, again represented as intervals, such as transcription factor binding sites overlapped with CpG islands or promoters to quantify what percentage of binding sites overlap with the regions of interest.

1 Fluent genomic intervals
Session info

1 Fluent genomic intervals

The plyranges package develops a grammar for reasoning about genomics data analysis, and provides a way of developing short readable analysis pipelines. The package is built on the core Bioconductor data structure GRanges, a type of object that follows tidy data principles: each row contains a single observation to a single gene and each column is a variable describing information about those genes. The illustration begins by supposing a tibble of genes from the macacus rhesus genome:

library(tidyverse)
library(plyranges, quietly = TRUE)
genes <- tibble(seqnames = "VI", 
                    start = c(3322, 3030, 1437,  5066, 6426, 836),
                    end = c(3846, 3338, 2615, 5521, 7565, 1363),
                    strand = c("-", "-", "-", "+", "+", "+"),
                    gene_id=c("JV682167", "JV580426", "JV601537", 
                              "JV528265", "JV565707", "JV490395"))
genes

## # A tibble: 6 x 5
##   seqnames start   end strand gene_id 
##   <chr>    <dbl> <dbl> <chr>  <chr>   
## 1 VI        3322  3846 -      JV682167
## 2 VI        3030  3338 -      JV580426
## 3 VI        1437  2615 -      JV601537
## 4 VI        5066  5521 +      JV528265
## 5 VI        6426  7565 +      JV565707
## 6 VI         836  1363 +      JV490395

gr <- as_granges(genes)
gr

## GRanges object with 6 ranges and 1 metadata column:
##       seqnames    ranges strand |     gene_id
##          <Rle> <IRanges>  <Rle> | <character>
##   [1]       VI 3322-3846      - |    JV682167
##   [2]       VI 3030-3338      - |    JV580426
##   [3]       VI 1437-2615      - |    JV601537
##   [4]       VI 5066-5521      + |    JV528265
##   [5]       VI 6426-7565      + |    JV565707
##   [6]       VI  836-1363      + |    JV490395
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The core components of the object include a seqname column (representing the chromosome), a ranges column which consists of start and end coordinates for a genomic region, and a strand identifier (either positive, negative, or unstranded). Metadata are included as columns to the right of the dotted line as annotations (gene-id) or range level covariates (score).

1.1 Grammar core verbs

The grammar is simply a set of verbs that define actions to be performed using the pipe operator, %>%, which can be read as ‘then’. A simple pipeline will add two columns, one corresponding to the gene_type and another with the GC content (by drawing from a uniform distribution), and then will remove genes if they have a width less than \(400bp\)

set.seed(2021-06-18)
gr2 <- gr %>%
 mutate(gene_type = "mRNA", 
        gc_content = runif(n())) %>% 
  filter(width > 400)
gr2

## GRanges object with 5 ranges and 3 metadata columns:
##       seqnames    ranges strand |     gene_id   gene_type gc_content
##          <Rle> <IRanges>  <Rle> | <character> <character>  <numeric>
##   [1]       VI 3322-3846      - |    JV682167        mRNA   0.221390
##   [2]       VI 1437-2615      - |    JV601537        mRNA   0.817291
##   [3]       VI 5066-5521      + |    JV528265        mRNA   0.758008
##   [4]       VI 6426-7565      + |    JV565707        mRNA   0.134232
##   [5]       VI  836-1363      + |    JV490395        mRNA   0.628786
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The mutate() verb is used to add columns, (gene_type where all values are set to mRNA standing for messenger RNA and another called gc_content with random uniform values). The n() operator returns the number of ranges in GRanges object, but can only be evaluated inside of one of the verbs.

The filter() verbs returns ranges if the expression evaluates to TRUE. Multiple expressions can be composed together with the & operator or the | operator

gr2 %>%
  filter(strand == "+" & gc_content > 0.5)

## GRanges object with 2 ranges and 3 metadata columns:
##       seqnames    ranges strand |     gene_id   gene_type gc_content
##          <Rle> <IRanges>  <Rle> | <character> <character>  <numeric>
##   [1]       VI 5066-5521      + |    JV528265        mRNA   0.758008
##   [2]       VI  836-1363      + |    JV490395        mRNA   0.628786
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

gr2 %>%
  filter(strand == "+" | gc_content > 0.5)

## GRanges object with 4 ranges and 3 metadata columns:
##       seqnames    ranges strand |     gene_id   gene_type gc_content
##          <Rle> <IRanges>  <Rle> | <character> <character>  <numeric>
##   [1]       VI 1437-2615      - |    JV601537        mRNA   0.817291
##   [2]       VI 5066-5521      + |    JV528265        mRNA   0.758008
##   [3]       VI 6426-7565      + |    JV565707        mRNA   0.134232
##   [4]       VI  836-1363      + |    JV490395        mRNA   0.628786
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Performing tasks of biological interest like averages and counting are achieved with the summarise() and group_by() verbs respectively, which will return a tibble object to manipulate it’s statistical behavior.

gr2 %>%
  group_by(strand) %>%
  summarise(avg_gc = mean(gc_content),
            n = n())

## DataFrame with 2 rows and 3 columns
##   strand    avg_gc         n
##    <Rle> <numeric> <integer>
## 1      +  0.507009         3
## 2      -  0.519340         2

And so, any verb applied to a grouped GRanges, acts on each partition:

by_strand <- gr2 %>%
  group_by(strand) 

by_strand %>%
  mutate(avg_gc_strand = mean(gc_content))

## GRanges object with 5 ranges and 4 metadata columns:
## Groups: strand [2]
##       seqnames    ranges strand |     gene_id   gene_type gc_content
##          <Rle> <IRanges>  <Rle> | <character> <character>  <numeric>
##   [1]       VI 3322-3846      - |    JV682167        mRNA   0.221390
##   [2]       VI 1437-2615      - |    JV601537        mRNA   0.817291
##   [3]       VI 5066-5521      + |    JV528265        mRNA   0.758008
##   [4]       VI 6426-7565      + |    JV565707        mRNA   0.134232
##   [5]       VI  836-1363      + |    JV490395        mRNA   0.628786
##       avg_gc_strand
##           <numeric>
##   [1]      0.519340
##   [2]      0.519340
##   [3]      0.507009
##   [4]      0.507009
##   [5]      0.507009
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

To remove grouping there is the ungroup() verb:

by_strand %>%
  ungroup()

## GRanges object with 5 ranges and 3 metadata columns:
##       seqnames    ranges strand |     gene_id   gene_type gc_content
##          <Rle> <IRanges>  <Rle> | <character> <character>  <numeric>
##   [1]       VI 3322-3846      - |    JV682167        mRNA   0.221390
##   [2]       VI 1437-2615      - |    JV601537        mRNA   0.817291
##   [3]       VI 5066-5521      + |    JV528265        mRNA   0.758008
##   [4]       VI 6426-7565      + |    JV565707        mRNA   0.134232
##   [5]       VI  836-1363      + |    JV490395        mRNA   0.628786
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Finally, metadata columns can be selected using the select() verb:

gr2 %>%
  select(gene_id, gene_type)

## GRanges object with 5 ranges and 2 metadata columns:
##       seqnames    ranges strand |     gene_id   gene_type
##          <Rle> <IRanges>  <Rle> | <character> <character>
##   [1]       VI 3322-3846      - |    JV682167        mRNA
##   [2]       VI 1437-2615      - |    JV601537        mRNA
##   [3]       VI 5066-5521      + |    JV528265        mRNA
##   [4]       VI 6426-7565      + |    JV565707        mRNA
##   [5]       VI  836-1363      + |    JV490395        mRNA
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

1.2 Genome Arithmetic

Arithmetic operations transform range coordinates, as defined by their start, end and width columns. This is achieved with the mutate verb along with anchor_* adverbs. By default, setting width will fix the starting coordinate.

gr %>%
  mutate(width = width + 1)

## GRanges object with 6 ranges and 1 metadata column:
##       seqnames    ranges strand |     gene_id
##          <Rle> <IRanges>  <Rle> | <character>
##   [1]       VI 3322-3847      - |    JV682167
##   [2]       VI 3030-3339      - |    JV580426
##   [3]       VI 1437-2616      - |    JV601537
##   [4]       VI 5066-5522      + |    JV528265
##   [5]       VI 6426-7566      + |    JV565707
##   [6]       VI  836-1364      + |    JV490395
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

By setting the verb anchor_end(), the fix will carry from start to end and viceversa with the anchor_start() verb.

gr %>%
  anchor_end() %>% mutate(width = width + 1)

## GRanges object with 6 ranges and 1 metadata column:
##       seqnames    ranges strand |     gene_id
##          <Rle> <IRanges>  <Rle> | <character>
##   [1]       VI 3321-3846      - |    JV682167
##   [2]       VI 3029-3338      - |    JV580426
##   [3]       VI 1436-2615      - |    JV601537
##   [4]       VI 5065-5521      + |    JV528265
##   [5]       VI 6425-7565      + |    JV565707
##   [6]       VI  835-1363      + |    JV490395
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

gr %>%
  anchor_start() %>% mutate(width = width + 1)

## GRanges object with 6 ranges and 1 metadata column:
##       seqnames    ranges strand |     gene_id
##          <Rle> <IRanges>  <Rle> | <character>
##   [1]       VI 3322-3847      - |    JV682167
##   [2]       VI 3030-3339      - |    JV580426
##   [3]       VI 1437-2616      - |    JV601537
##   [4]       VI 5066-5522      + |    JV528265
##   [5]       VI 6426-7566      + |    JV565707
##   [6]       VI  836-1364      + |    JV490395
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

1.3 Genome Aggregation

The reduce verb merges overlapping and neighboring ranges and aggregates them by storing the result in a List column. The disjoin verb expands the overlapped range.

gr %>% reduce_ranges()

## GRanges object with 5 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]       VI  836-1363      *
##   [2]       VI 1437-2615      *
##   [3]       VI 3030-3846      *
##   [4]       VI 5066-5521      *
##   [5]       VI 6426-7565      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

gr %>% 
  disjoin_ranges()

## GRanges object with 7 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]       VI  836-1363      *
##   [2]       VI 1437-2615      *
##   [3]       VI 3030-3321      *
##   [4]       VI 3322-3338      *
##   [5]       VI 3339-3846      *
##   [6]       VI 5066-5521      *
##   [7]       VI 6426-7565      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

1.4 Overlap

So, one first trascendental idea for manipulating intervals is to compare new collaborations against benchmark genomes through overlapping procedures. With some additional measurements obtained from a new experiment on macacus rhesus from three different replicates that represent single nucleotide or insertion deletion intensities from an array, the goal is to filter these positions and see if they overlap with one of the known genes.

The three different tibbles obtained from a collaborator are:

set.seed(66+105+111+99+49+56)

pos <- sample(1:10000, size = 100)
size <- sample(1:3, size = 100, replace = TRUE)
rep1 <- tibble(chr = "VI", 
                  pos = pos,
                  size = size,
                  X = rnorm(100, mean = 2),
                  Y = rnorm(100, mean = 1))

rep2 <- tibble(chrom = "VI", 
                  st = pos,
                  width = size,
                  X = rnorm(100, mean = 0.5, sd = 3),
                  Y = rnorm(100, sd = 2))

rep3 <- tibble(chromosome = "VI", 
                  start = pos,
                  width = size,
                  X = rnorm(100, mean = 2, sd = 3),
                  Y = rnorm(100, mean = 4, sd = 0.5))

And the GRanges objects, binded and sorted by the starting coordinate, are:

rep1 <- as_granges(rep1, seqnames = chr, start = pos, width = size)
rep2 <- as_granges(rep2, seqnames = chrom, start = st)
rep3 <- as_granges(rep3, seqnames = chromosome)

intensities <- bind_ranges(rep1, rep2, rep3, .id = "replicate")

arrange(intensities, start)

## GRanges object with 300 ranges and 3 metadata columns:
##         seqnames    ranges strand |         X          Y replicate
##            <Rle> <IRanges>  <Rle> | <numeric>  <numeric>     <Rle>
##     [1]       VI       5-6      * |   2.52287  -0.369134         1
##     [2]       VI       5-6      * |   3.10130  -0.255471         2
##     [3]       VI       5-6      * |   3.85942   3.718716         3
##     [4]       VI    99-100      * |   3.14127   2.218141         1
##     [5]       VI    99-100      * |   3.72455   1.203831         2
##     ...      ...       ...    ... .       ...        ...       ...
##   [296]       VI 9975-9976      * | -0.920874 2.32376606         2
##   [297]       VI 9975-9976      * |  1.862762 4.69020714         3
##   [298]       VI 9995-9996      * |  3.219007 0.00920946         1
##   [299]       VI 9995-9996      * |  3.658382 2.64333960         2
##   [300]       VI 9995-9996      * |  2.758994 3.16707578         3
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Therefore, the first interesting product will be a filter of positions if they overlap with one of the genes known.

olap <- filter_by_overlaps(intensities, gr)
olap

## GRanges object with 105 ranges and 3 metadata columns:
##         seqnames    ranges strand |         X         Y replicate
##            <Rle> <IRanges>  <Rle> | <numeric> <numeric>     <Rle>
##     [1]       VI      5239      * |  1.202741  0.652819         1
##     [2]       VI 6848-6849      * |  3.282009  0.789747         1
##     [3]       VI      6645      * |  1.385816  0.866810         1
##     [4]       VI 3366-3367      * |  2.919752  2.445765         1
##     [5]       VI 2552-2553      * |  0.692582  2.169994         1
##     ...      ...       ...    ... .       ...       ...       ...
##   [101]       VI 6980-6981      * |  0.288208   3.24790         3
##   [102]       VI      2416      * |  0.960565   4.03358         3
##   [103]       VI 3836-3838      * |  1.931636   3.68147         3
##   [104]       VI      6674      * |  1.444077   4.57349         3
##   [105]       VI 1960-1962      * |  3.546545   4.42852         3
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Session info

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
## 
## locale:
##  [1] LC_CTYPE=es_MX.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_MX.UTF-8        LC_COLLATE=es_MX.UTF-8    
##  [5] LC_MONETARY=es_MX.UTF-8    LC_MESSAGES=es_MX.UTF-8   
##  [7] LC_PAPER=es_MX.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_MX.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] plyranges_1.12.0     GenomicRanges_1.44.0 GenomeInfoDb_1.28.0 
##  [4] IRanges_2.26.0       S4Vectors_0.30.0     BiocGenerics_0.38.0 
##  [7] forcats_0.5.1        stringr_1.4.0        dplyr_1.0.6         
## [10] purrr_0.3.4          readr_1.4.0          tidyr_1.1.3         
## [13] tibble_3.1.2         ggplot2_3.3.4        tidyverse_1.3.1     
## [16] BiocStyle_2.20.2    
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7                matrixStats_0.59.0         
##  [3] fs_1.5.0                    lubridate_1.7.10           
##  [5] httr_1.4.2                  tools_4.1.0                
##  [7] backports_1.2.1             bslib_0.2.5.1              
##  [9] utf8_1.2.1                  R6_2.5.0                   
## [11] DBI_1.1.1                   colorspace_2.0-1           
## [13] withr_2.4.2                 tidyselect_1.1.1           
## [15] compiler_4.1.0              cli_2.5.0                  
## [17] rvest_1.0.0                 Biobase_2.52.0             
## [19] xml2_1.3.2                  DelayedArray_0.18.0        
## [21] rtracklayer_1.52.0          bookdown_0.22              
## [23] sass_0.4.0                  scales_1.1.1               
## [25] digest_0.6.27               Rsamtools_2.8.0            
## [27] rmarkdown_2.9               XVector_0.32.0             
## [29] pkgconfig_2.0.3             htmltools_0.5.1.1          
## [31] MatrixGenerics_1.4.0        dbplyr_2.1.1               
## [33] rlang_0.4.11                readxl_1.3.1               
## [35] rstudioapi_0.13             BiocIO_1.2.0               
## [37] jquerylib_0.1.4             generics_0.1.0             
## [39] jsonlite_1.7.2              BiocParallel_1.26.0        
## [41] RCurl_1.98-1.3              magrittr_2.0.1             
## [43] GenomeInfoDbData_1.2.6      Matrix_1.3-4               
## [45] Rcpp_1.0.6                  munsell_0.5.0              
## [47] fansi_0.5.0                 lifecycle_1.0.0            
## [49] stringi_1.6.2               yaml_2.2.1                 
## [51] SummarizedExperiment_1.22.0 zlibbioc_1.38.0            
## [53] grid_4.1.0                  crayon_1.4.1               
## [55] lattice_0.20-44             Biostrings_2.60.1          
## [57] haven_2.4.1                 hms_1.1.0                  
## [59] knitr_1.33                  pillar_1.6.1               
## [61] rjson_0.2.20                reprex_2.0.0               
## [63] XML_3.99-0.6                glue_1.4.2                 
## [65] evaluate_0.14               BiocManager_1.30.16        
## [67] modelr_0.1.8                vctrs_0.3.8                
## [69] cellranger_1.1.0            gtable_0.3.0               
## [71] assertthat_0.2.1            xfun_0.24                  
## [73] broom_0.7.7                 restfulr_0.0.13            
## [75] GenomicAlignments_1.28.0    ellipsis_0.3.2