New abilities to read and query DataSets
Obtain “SC” data from scraps.bam files
Obtain Start frame and Pkmid information with alignments
Common data manipulation and transformation functions
Bug Fixes

This document describes recent changes and enhancements to the pbbamr package.

New abilities to read and query DataSets

pbbamr traditionally worked on single BAM files, which was mismatched with SMRTLink which uses Dataset files that represented a collection of BAM files specified in XML. The library now has new abilities to make working with these datasets much easier.

A ‘Meta’ PBI index for DataSets

loadPBI now can return a “meta” index of a collection of BAMs by passing in a dataset file name (though it still accepts BAM file names). The function now also returns a “file” column in the data frame that says which BAM file an entry refers to, as the example below shows.

library(pbbamr)
index = loadPBI("AlignmentSet/m54006_160504_020705.alignmentset.xml")
index[1:3,]

##                                                           file     hole
## 1 AlignmentSet/m54006_160504_020705.tiny_mapped.1.subreads.bam 62325105
## 2 AlignmentSet/m54006_160504_020705.tiny_mapped.1.subreads.bam 62325105
## 3 AlignmentSet/m54006_160504_020705.tiny_mapped.1.subreads.bam 64422274
##   qstart  qend qual     offset flag ref tstart  tend astart  aend    rc
## 1      0 15731  0.8   49414144    2   1    790 14606   1084 15731  TRUE
## 2  15774 23769  0.8 1915492474    1   1    792  8624  15777 23761 FALSE
## 3  14714 23089  0.8 3290038272    1   1   1556  9948  14714 23075 FALSE
##   matches mismatches  mq inserts dels bcf bcr bcqual
## 1   12584        667 254    1396  565  -1  -1     -1
## 2    7226        281 254     477  325  -1  -1     -1
## 3    7667        302 254     392  423  -1  -1     -1

A new interface to obtain alignments

Previously when loading alignments, one would use the loadDataAtOffsets function which would take a file offset, a BAM file name and a fasta file name. This function is now deprecated in favor of one that can take advantage of the fact that the indexes returned by loadPBI contain the BAM file name so we don’t need to keep passing that around. The new function takes a data frame returned by loadPBI, a fasta name containing the reference, and optionally a subset of rows to obtain alignments from. Its signature is loadAlnsFromIndex(index, indexed_fasta_name, rows)

bfile = "test.aligned.bam"
ffile = "lambdaNEB.fa"
index = loadPBI(bfile)
alns = loadAlnsFromIndex(index, ffile)
head(alns[[1]])

##   read ref ipd pw     snrA    snrC     snrG     snrT
## 1    G   G 952 28 5.419003 9.99935 7.683229 11.52461
## 2    G   G   6  9 5.419003 9.99935 7.683229 11.52461
## 3    G   G  76 15 5.419003 9.99935 7.683229 11.52461
## 4    C   C  96 43 5.419003 9.99935 7.683229 11.52461
## 5    G   G  13  2 5.419003 9.99935 7.683229 11.52461
## 6    G   G   7  6 5.419003 9.99935 7.683229 11.52461

Convenience functions to examine Dataset XML files

Introduced one function to get the names of all BAMs in a dataset and another to obtain the file name of the fasta file from a reference dataset.

# View BAM files
getBAMNamesFromDatasetFile("AlignmentSet/m54006_160504_020705.alignmentset.xml")

## [1] "AlignmentSet/m54006_160504_020705.tiny_mapped.1.subreads.bam"
## [2] "AlignmentSet/m54006_160504_020705.tiny_mapped.2.subreads.bam"

# Get Reference Fasta Name
getFastaFileNameFromDatasetFile("ReferenceSet/lambdaNEB/referenceset.xml")

## [1] "./sequence/lambdaNEB.fasta"

Obtain “SC” data from scraps.bam files

loadPBI now has a loadSC flag which allows the user to also load the type of scrap for an entry if and only if the BAM file is a *.scraps.bam file.

index = loadPBI("SubreadSet/m54006_160504_020705.tiny.scraps.bam", loadSC = TRUE)
summary(index$sc)

##  Adapter  Barcode Filtered LQRegion HQRegion 
##       19        0        1        3        0

Obtain Start frame and Pkmid information with alignments

Internal BAMs have information that is not typically available in production BAMs. pbbamr now has the ability to load more of this information, including the Start Frame and the Pkmid values.

# Get names of data files distributed with pbbamr to use as examples
ifastaname = system.file("extdata", "All4Mer.V2.11.fna", package="pbbamr")
ibamname = system.file("extdata", "internalsample.bam", package="pbbamr")
# Load the index table.
iind = loadPBI(ibamname)
allAlns = loadAlnsFromIndex(iind, ifastaname)
head(allAlns[[1]])

##   read ref ipd pw     snrA     snrC     snrG     snrT pkmid    sf
## 1    G   G 226  1 6.987833 12.82881 8.718133 12.93477   0.0 48324
## 2    C   C  28 15 6.987833 12.82881 8.718133 12.93477 111.1 48353
## 3    A   A  10  5 6.987833 12.82881 8.718133 12.93477  60.2 48378
## 4    G   G  45 11 6.987833 12.82881 8.718133 12.93477  66.6 48428
## 5    A   A  20 13 6.987833 12.82881 8.718133 12.93477  60.0 48459
## 6    C   C  12 41 6.987833 12.82881 8.718133 12.93477 109.1 48484

Common data manipulation and transformation functions

Combine datasets while adding a named factor variable

Often times we have a collection of datasets we want to combine, while also adding a factor variable so we can for example plot the data together or easily calculate summary statistics for each of the original groupings. We added a new function to enable this. It’s equivalent to the collapse function in pbutils, but should be orders of magnitude faster.

# Make some fake data to combine
df1 = data.frame(a=1:2, b=3:4)
df2 = data.frame(a=5:6, b=7:8)
dfs = list(First = df1, Second = df2)
# Now use the new function taking the factor names from the list
combineConditions(dfs)

##    a b Condition
## 1: 1 3     First
## 2: 2 4     First
## 3: 5 7    Second
## 4: 6 8    Second

# Or do it the second way, specifying a name
combineConditions(dfs, c("I want a crazy name", "No, I want a crazy name"))

##    a b               Condition
## 1: 1 3     I want a crazy name
## 2: 2 4     I want a crazy name
## 3: 5 7 No, I want a crazy name
## 4: 6 8 No, I want a crazy name

Calculate chip X, Y positions from the hole number

Added very simple functions which can be used as shown below.

hole = 500*(2^16) + 55
getHoleX(hole)

## [1] 500

getHoleY(hole)

## [1] 55

Phred conversion functions

See ?fromPhred and ?toPhred for details.

Bug Fixes

The pkmid may have been returned incorrectly on some alignments, this is now fixed.

Updates to pbbamr package

Nigel Delaney

July 12, 2016