This document describes recent changes and enhancements to the pbbamr package.
pbbamr traditionally worked on single BAM files, which was mismatched with SMRTLink which uses Dataset files that represented a collection of BAM files specified in XML. The library now has new abilities to make working with these datasets much easier.
loadPBI now can return a “meta” index of a collection of BAMs by passing in a dataset file name (though it still accepts BAM file names). The function now also returns a “file” column in the data frame that says which BAM file an entry refers to, as the example below shows.
library(pbbamr)
index = loadPBI("AlignmentSet/m54006_160504_020705.alignmentset.xml")
index[1:3,]
## file hole
## 1 AlignmentSet/m54006_160504_020705.tiny_mapped.1.subreads.bam 62325105
## 2 AlignmentSet/m54006_160504_020705.tiny_mapped.1.subreads.bam 62325105
## 3 AlignmentSet/m54006_160504_020705.tiny_mapped.1.subreads.bam 64422274
## qstart qend qual offset flag ref tstart tend astart aend rc
## 1 0 15731 0.8 49414144 2 1 790 14606 1084 15731 TRUE
## 2 15774 23769 0.8 1915492474 1 1 792 8624 15777 23761 FALSE
## 3 14714 23089 0.8 3290038272 1 1 1556 9948 14714 23075 FALSE
## matches mismatches mq inserts dels bcf bcr bcqual
## 1 12584 667 254 1396 565 -1 -1 -1
## 2 7226 281 254 477 325 -1 -1 -1
## 3 7667 302 254 392 423 -1 -1 -1
Previously when loading alignments, one would use the loadDataAtOffsets function which would take a file offset, a BAM file name and a fasta file name. This function is now deprecated in favor of one that can take advantage of the fact that the indexes returned by loadPBI contain the BAM file name so we don’t need to keep passing that around. The new function takes a data frame returned by loadPBI, a fasta name containing the reference, and optionally a subset of rows to obtain alignments from. Its signature is loadAlnsFromIndex(index, indexed_fasta_name, rows)
bfile = "test.aligned.bam"
ffile = "lambdaNEB.fa"
index = loadPBI(bfile)
alns = loadAlnsFromIndex(index, ffile)
head(alns[[1]])
## read ref ipd pw snrA snrC snrG snrT
## 1 G G 952 28 5.419003 9.99935 7.683229 11.52461
## 2 G G 6 9 5.419003 9.99935 7.683229 11.52461
## 3 G G 76 15 5.419003 9.99935 7.683229 11.52461
## 4 C C 96 43 5.419003 9.99935 7.683229 11.52461
## 5 G G 13 2 5.419003 9.99935 7.683229 11.52461
## 6 G G 7 6 5.419003 9.99935 7.683229 11.52461
Introduced one function to get the names of all BAMs in a dataset and another to obtain the file name of the fasta file from a reference dataset.
# View BAM files
getBAMNamesFromDatasetFile("AlignmentSet/m54006_160504_020705.alignmentset.xml")
## [1] "AlignmentSet/m54006_160504_020705.tiny_mapped.1.subreads.bam"
## [2] "AlignmentSet/m54006_160504_020705.tiny_mapped.2.subreads.bam"
# Get Reference Fasta Name
getFastaFileNameFromDatasetFile("ReferenceSet/lambdaNEB/referenceset.xml")
## [1] "./sequence/lambdaNEB.fasta"
loadPBI now has a loadSC flag which allows the user to also load the type of scrap for an entry if and only if the BAM file is a *.scraps.bam file.
index = loadPBI("SubreadSet/m54006_160504_020705.tiny.scraps.bam", loadSC = TRUE)
summary(index$sc)
## Adapter Barcode Filtered LQRegion HQRegion
## 19 0 1 3 0
Internal BAMs have information that is not typically available in production BAMs. pbbamr now has the ability to load more of this information, including the Start Frame and the Pkmid values.
# Get names of data files distributed with pbbamr to use as examples
ifastaname = system.file("extdata", "All4Mer.V2.11.fna", package="pbbamr")
ibamname = system.file("extdata", "internalsample.bam", package="pbbamr")
# Load the index table.
iind = loadPBI(ibamname)
allAlns = loadAlnsFromIndex(iind, ifastaname)
head(allAlns[[1]])
## read ref ipd pw snrA snrC snrG snrT pkmid sf
## 1 G G 226 1 6.987833 12.82881 8.718133 12.93477 0.0 48324
## 2 C C 28 15 6.987833 12.82881 8.718133 12.93477 111.1 48353
## 3 A A 10 5 6.987833 12.82881 8.718133 12.93477 60.2 48378
## 4 G G 45 11 6.987833 12.82881 8.718133 12.93477 66.6 48428
## 5 A A 20 13 6.987833 12.82881 8.718133 12.93477 60.0 48459
## 6 C C 12 41 6.987833 12.82881 8.718133 12.93477 109.1 48484
Often times we have a collection of datasets we want to combine, while also adding a factor variable so we can for example plot the data together or easily calculate summary statistics for each of the original groupings. We added a new function to enable this. It’s equivalent to the collapse function in pbutils, but should be orders of magnitude faster.
# Make some fake data to combine
df1 = data.frame(a=1:2, b=3:4)
df2 = data.frame(a=5:6, b=7:8)
dfs = list(First = df1, Second = df2)
# Now use the new function taking the factor names from the list
combineConditions(dfs)
## a b Condition
## 1: 1 3 First
## 2: 2 4 First
## 3: 5 7 Second
## 4: 6 8 Second
# Or do it the second way, specifying a name
combineConditions(dfs, c("I want a crazy name", "No, I want a crazy name"))
## a b Condition
## 1: 1 3 I want a crazy name
## 2: 2 4 I want a crazy name
## 3: 5 7 No, I want a crazy name
## 4: 6 8 No, I want a crazy name
Added very simple functions which can be used as shown below.
hole = 500*(2^16) + 55
getHoleX(hole)
## [1] 500
getHoleY(hole)
## [1] 55
See ?fromPhred and ?toPhred for details.