Genomic Alignments

Introduction:

Provides efficient storage and manipulation containers for small “genomic alignments (typically obtained by aligning short reads to a reference genome)”. This covers “read counting”, “coverage computation”,” junction identification”, and “alignment nucleotide content manipulation”(Lawrence et al. 2013).

Genomic Alignments is a package within the Bioconductor project in R. This package is used as a starting point for representation of genomic alignments in R. In this package, it is based on the infrastructure of genomes and it indulges with most of the bioconductor packages. Three categories are defined here: “GAlignments”, “GAlignmentPairs”, and “GAlignmentList”,representing, “genomic alignments”, “pairs of genomic alignments” and “groups of genomic alignments” respectively. “GenomicAlignments” is accessible at “bioconductor.org” and can be downloaded by following this command:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("GenomicAlignments")

## Bioconductor version 3.14 (BiocManager 1.30.16), R 4.1.2 (2021-11-01)

## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'GenomicAlignments'

## Installation paths not writeable, unable to update packages
##   path: C:/Program Files/R/R-4.1.2/library
##   packages:
##     Matrix

library(GenomicAlignments)

## Loading required package: BiocGenerics

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min

## Loading required package: S4Vectors

## Loading required package: stats4

## 
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## Loading required package: IRanges

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:grDevices':
## 
##     windows

## Loading required package: GenomeInfoDb

## Loading required package: GenomicRanges

## Loading required package: SummarizedExperiment

## Loading required package: MatrixGenerics

## Loading required package: matrixStats

## 
## Attaching package: 'MatrixGenerics'

## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## 
## Attaching package: 'Biobase'

## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians

## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians

## Loading required package: Biostrings

## Loading required package: XVector

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:base':
## 
##     strsplit

## Loading required package: Rsamtools

GAlignments:

A class as a container that is used to store a set of genomic alignments. It supports “ Binary Alignment Map” or “BAM” files. It also indulges with alignments in the reference sequence with gaps, for instance, from “RNA-seq experiment” that stores junction reads to make the class. So, it’s a “vector-like object” that explains an alignment, means “how a given sequence aligns to reference sequence”.

GAlignments object can be generated from a “BAM” file. In this scenario, each element in the generated object corresponds to a file record. One thing to keep in mind is that the object does not have all of the information contained in the “BAM/SAM records”. For the time being, we’ll ignore “query sequences (SEQ field)”, “query ids (QNAME field)”, “query qualities (QUAL)”, “mapping qualities (MAPQ)”, and any other data that isn’t required to enable the actions specified in this text. This also implies that multi-reads (i.e. reads with multiple hits in the reference) are not treated differently, and the numerous “SAM/BAM records” corresponding to a multi-read will appear in the GAlignments object as if they came from separate searches.

SAM tools:

“Sequence Alignment/Map (SAM) format” is a general alignment format for storing read alignments against reference sequences, and it contains both “short and long reads (up to 128 Mbp)” generated by various sequencing systems(Li et al.). SAMtools provides universal tools for processing read alignments in the SAM format, such as “indexing”,” variant caller”, and “alignment viewer”(Li et al.). ## Import BAM file into GAlignments object: A function which is readGAlignments is used to load a BAM file in selected object.

library(GenomicAlignments)
aln1_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
aln1 <- readGAlignments(aln1_file)
aln1

## GAlignments object with 3271 alignments and 0 metadata columns:
##          seqnames strand       cigar    qwidth     start       end     width
##             <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##      [1]     seq1      +         36M        36         1        36        36
##      [2]     seq1      +         35M        35         3        37        35
##      [3]     seq1      +         35M        35         5        39        35
##      [4]     seq1      +         36M        36         6        41        36
##      [5]     seq1      +         35M        35         9        43        35
##      ...      ...    ...         ...       ...       ...       ...       ...
##   [3267]     seq2      +         35M        35      1524      1558        35
##   [3268]     seq2      +         35M        35      1524      1558        35
##   [3269]     seq2      -         35M        35      1528      1562        35
##   [3270]     seq2      -         35M        35      1532      1566        35
##   [3271]     seq2      -         35M        35      1533      1567        35
##              njunc
##          <integer>
##      [1]         0
##      [2]         0
##      [3]         0
##      [4]         0
##      [5]         0
##      ...       ...
##   [3267]         0
##   [3268]         0
##   [3269]         0
##   [3270]         0
##   [3271]         0
##   -------
##   seqinfo: 2 sequences from an unspecified genome

length(aln1)

## [1] 3271

“So, it means 3271 BAM records were loaded into GAlignments”.

“Simple Accessor methods”:

The show methods display one accessor per field, that contains “same name as the field”. “They all return as same factor as the item”.

head(seqnames(aln1))

## factor-Rle of length 6 with 1 run
##   Lengths:    6
##   Values : seq1
## Levels(2): seq1 seq2

seqlevels(aln1)

## [1] "seq1" "seq2"

head(strand(aln1))

## factor-Rle of length 6 with 1 run
##   Lengths: 6
##   Values : +
## Levels(3): + - *

head(start(aln1))

## [1]  1  3  5  6  9 13

head(end(aln1))

## [1] 36 37 39 41 43 47

head(njunc(aln1))

## [1] 0 0 0 0 0 0

Discussion

Biocoductor infrastructure is used in a computationalenvironment on “annotated genomic ranges”,as well as integrating “genomic data with R” and its extensions’ statistical computing capabilities. Three packages make foundation of the basic features: “IRanges, GenomicRanges, and GenomicFeatures”. These programmes offer scalable data structures for describing annotated genome ranges, including, “read alignments”, and “coverage vectors”. More than 80% Bioconductor programmes, such as for “sequence analysis”,” differential expression analysis”, and visualisation, are directly supported by this infrastructure. In future, the class of GAlignmentpairs and GAlignmentlist will be available with precise methods.Finally, as datasets grow in size, we can execute more efficient algorithms and data structures, as well as ways to take use of parallel processing(Lawrence et al. 2013).

References:

 citation("GenomicAlignments")

## 
##   Lawrence M, Huber W, Pag\`es H, Aboyoun P, Carlson M, et al. (2013)
##   Software for Computing and Annotating Genomic Ranges. PLoS Comput
##   Biol 9(8): e1003118. doi:10.1371/journal.pcbi.1003118
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {Software for Computing and Annotating Genomic Ranges},
##     author = {Michael Lawrence and Wolfgang Huber and Herv\'e Pag\`es and Patrick Aboyoun and Marc Carlson and Robert Gentleman and Martin Morgan and Vincent Carey},
##     year = {2013},
##     journal = {{PLoS} Computational Biology},
##     volume = {9},
##     issue = {8},
##     doi = {10.1371/journal.pcbi.1003118},
##     url = {http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003118},
##   }

Durbin, R. and Durbin, R. The Sequence Alignment/Map format and SAMtools. (1367-4811 (Electronic)). er, N., Homer N Fau - Marth, G., Marth G Fau - Abecasis, G., Abecasis G Fau

Lawrence, M., Huber, W., Pagès, H., Aboyoun, P., Carlson, M., Gentleman, R., Morgan, M. T. and Carey, V. J. (2013) Software for Computing and Annotating Genomic Ranges. PLOS Computational Biology 9 (8), e1003118.

Li, H., Handsaker B Fau - Wysoker, A., Wysoker A Fau - Fennell, T., Fennell T Fau - Ruan, J., Ruan J Fau - Hom