BLAST (Basic Local Alignment Search Tool) is a well known web tool for searching for query sequences in databases. However, it might be useful to use this tool from a scripting interface, when multiple query sequences are being used, say. This document describes how R
can be used to do this.
There appears to be two relatively straightforward method for running a BLAST from R.
blastSequences
(some hacks needed) function in the annotate
package, available from Bioconductor
.The rest of this document will demonstrate these two suggestions.
blastSequences
functionThe blastSequences
function appears to work well in most instances. For slow web connections or for large query sequences, it appears to fail. This seem to be due to either a low number of attempts to retrieve hits from the ncbi server, or the short time the routine waits before it throws an error in R. It seem that if we increase the number of attempts to retrieve hits, generally, we can successfully run our BLAST.
The below code re-defines the blastSequences
function, with an additional argument (attempts
), to allow users to control how many times the routine should try to retrieve results from the server.
# function definition
blastSeqKK <- function (x, database = "nr", hitListSize = "10",
filter = "L", expect = "10", program = "blastn",
attempts = 10) {
baseUrl <- "http://www.ncbi.nlm.nih.gov/blast/Blast.cgi"
query <- paste("QUERY=", as.character(x), "&DATABASE=", database,
"&HITLIST_SIZE=", hitListSize, "&FILTER=", filter, "&EXPECT=",
expect, "&PROGRAM=", program, sep = "")
url0 <- sprintf("%s?%s&CMD=Put", baseUrl, query)
results <- tempfile()
Sys.sleep(5)
require(XML)
post <- htmlTreeParse(url0, useInternalNodes = TRUE)
x <- post[["string(//comment()[contains(., \"QBlastInfoBegin\")])"]]
rid <- sub(".*RID = ([[:alnum:]]+).*", "\\1", x)
rtoe <- as.integer(sub(".*RTOE = ([[:digit:]]+).*", "\\1",
x))
url1 <- sprintf("%s?RID=%s&FORMAT_TYPE=XML&CMD=Get", baseUrl,
rid)
Sys.sleep(rtoe)
.tryParseResult <- function(url, attempts){
for (i in 1:(attempts+1)) {
result <- tryCatch({
xmlTreeParse(url, useInternalNodes=TRUE,
error = xmlErrorCumulator(immediate=FALSE))
}, error=function(err) NULL)
if (!is.null(result)) return(result)
Sys.sleep(10)
}
stop(paste("no results after ", attempts,
" attempts; please try again later", sep = ""))
}
result <- .tryParseResult(url1, attempts)
qseq <- xpathApply(result, "//Hsp_qseq", xmlValue)
hseq <- xpathApply(result, "//Hsp_hseq", xmlValue)
require(Biostrings)
res <- list()
for (i in seq_len(length(qseq))) {
res[i] <- DNAMultipleAlignment(c(hseq[[i]], qseq[[i]]),
rowmask = as(IRanges(), "NormalIRanges"), colmask = as(IRanges(),
"NormalIRanges"))
}
res
}
To use this function, we should load the annotate
package and its dependencies.
source("http://bioconductor.org/biocLite.R")
biocLite("annotate")
Now that the hacked function is defined, we just need a way to read our sequences (in .fasta format) into R
, and pass them, one at a time to the blastSeqKK
function. This is outlined next:
library(seqinr)
mySeq <- read.fasta("example.FASTA")
In this case, the sequences were in the file, 'example.FASTA'. The file looks as below:
>DNA_seq1
CGGGAGGCGGCAGCGGCTGCAGCGTTGGTAGCATCAGCATCAGCATCAGCGGCAGCGGCAGCGGCCTCGG
GCGGGGCCGGCCGGACGGACAGGCGGACAGAAGGCGCCAGGGGCGCGCGTCCCGCCCGGGCCGGCCATGG
AGGGCGCCTCCTTCGGCGCGGGCCGCGCAGGGGCCGCCCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
>DNA_seq2
CGGGAGGCGGCAGCGGCTGCAGCGTTGGTAGCATCAGCATCAGCATCAGCGGCAGCGGCAGCGGCCTCGG
GCGGGGCCGGCCGGACGGACAGGCGGACAGAAGGCGCCAGGGGCGCGCGTCCCGCCCGGGCCGGCCATGG
AGGGCGCCTCCTTCGGCGCGGGCCGCGCAGGGGCCGCCCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
>DNA_seq3
CGGGAGGCGGCAGCGGCTGCAGCGTTGGTAGCATCAGCATCAGCATCAGCGGCAGCGGCAGCGGCCTCGG
GCGGGGCCGGCCGGACGGACAGGCGGACAGAAGGCGCCAGGGGCGCGCGTCCCGCCCGGGCCGGCCATGG
AGGGCGCCTCCTTCGGCGCGGGCCGCGCAGGGGCCGCCCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
>DNA_seq4
CGGGAGGCGGCAGCGGCTGCAGCGTTGGTAGCATCAGCATCAGCATCAGCGGCAGCGGCAGCGGCCTCGG
GCGGGGCCGGCCGGACGGACAGGCGGACAGAAGGCGCCAGGGGCGCGCGTCCCGCCCGGGCCGGCCATGG
AGGGCGCCTCCTTCGGCGCGGGCCGCGCAGGGGCCGCCCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
In this example, as we are only demonstrating the method, these four sequences are actually identical to each other.
Now that our sequences are in R
, we can manipulate them into simple character strings (the format expected by blastSeqKK
) and BLAST them. The code below, takes each sequence (n = 4), converts it into a character string and then runs blastSeqKK
on that sequence, and saves the results in the object, res
. This object can then be probed downstream. See the annotate
documentation for a description of the additional arguments used by blastSequences.
res <- lapply(mySeq, function(x){
# collapse seq into string
seqCollapse <- paste(toupper(as.character(x)), collapse = "")
# run blast
blastRes <- blastSeqKK(x = seqCollapse, database = "nr", hitListSize = 3,
attempts = 20)
return(blastRes)
})
Loading required package: XML
Loading required package: Biostrings
Loading required package: IRanges
Loading required package: XVector
Attaching package: 'Biostrings'
The following object is masked from 'package:seqinr':
translate
Providing the above chunk completes without an error, our results should be in the res
object. The standard output of this object is as below:
$DNA_seq1
$DNA_seq1[[1]]
Loading required package: Biostrings
Loading required package: IRanges
Loading required package: XVector
DNAMultipleAlignment with 2 rows and 145 columns
aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
$DNA_seq1[[2]]
DNAMultipleAlignment with 2 rows and 145 columns
aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
$DNA_seq1[[3]]
DNAMultipleAlignment with 2 rows and 145 columns
aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
$DNA_seq2
$DNA_seq2[[1]]
DNAMultipleAlignment with 2 rows and 145 columns
aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
$DNA_seq2[[2]]
DNAMultipleAlignment with 2 rows and 145 columns
aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
$DNA_seq2[[3]]
DNAMultipleAlignment with 2 rows and 145 columns
aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
$DNA_seq3
$DNA_seq3[[1]]
DNAMultipleAlignment with 2 rows and 145 columns
aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
$DNA_seq3[[2]]
DNAMultipleAlignment with 2 rows and 145 columns
aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
$DNA_seq3[[3]]
DNAMultipleAlignment with 2 rows and 145 columns
aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
$DNA_seq4
$DNA_seq4[[1]]
DNAMultipleAlignment with 2 rows and 145 columns
aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
$DNA_seq4[[2]]
DNAMultipleAlignment with 2 rows and 145 columns
aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
$DNA_seq4[[3]]
DNAMultipleAlignment with 2 rows and 145 columns
aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
We can see that the object res
is a list of 4 DNAMultipleAlignment
objects. Because we set the argument hitListSize
to 3 in this example, each of the four elements in res
contain 3 separate alignment hits. For further information on how to proceed with this data structure, see the Biostrings
package in Bioconductor
.
R
to script BLAST+This method uses R
as a scripting interface to run the installed library BLAST+
. To install this package execute the following command from the terminal (we could do this from R
, but the security work-arounds aren't worth the hassle).
sudo apt-get install blast+
Once blast+ is installed, we can go back to R
, and run our blast pretty easily. In this example, say we have multiple files with multiple sequences in each (perhaps our query sequences are large and we don't want to exceed the ncbi server CPU limit). This example shows how we could script the blast process from within R.
# identify all fasta files in our working directory
infileNms <- list.files(pattern = "*.FASTA")
infileNms
[1] "example1.FASTA" "example2.FASTA" "example3.FASTA"
Here we can see that there are three fasta files in the working directory (for simplicity, these are just the above fasta triplicated). Now that we know the names of our files we want to create an object which will give our output file a unique name (in this example based on the input name).
outnames <- paste(unlist(sapply(infileNms, strsplit, split = "*.FASTA")),
".out", sep = "")
Now we have input and output names for each of our query files. We simply need to construct our commands and run blast (blastn in this example).
# create commands function
cmdCreate <- function(infile, outfile){
paste("blastn -db nr -query ", infile, " -remote -out ",
outfile, sep = "")
}
# create actual commands
cmds <- mapply(FUN = cmdCreate, infile = infileNms, outfile = outnames)
The code above creates the following three system commands:
[1] "blastn -db nr -query example1.FASTA -remote -out example1.out"
[2] "blastn -db nr -query example2.FASTA -remote -out example2.out"
[3] "blastn -db nr -query example3.FASTA -remote -out example3.out"
To run these commands, we simply execute the below command:
sapply(cmds, system)
blastn -db nr -query example1.FASTA -remote -out example1.out
0
blastn -db nr -query example2.FASTA -remote -out example2.out
0
blastn -db nr -query example3.FASTA -remote -out example3.out
0
This command will result in three files being written to the working directory. The names of these files should be equal to the object outnames
. We can check this as follows:
resFiles <- list.files(pattern = "*.out")
resFiles == outnames
[1] TRUE TRUE TRUE
These files contain alignment hits for all query sequences in each of our three fasta files, and can be probed for further analyses. I have not figured out a good method for parsing these files in R yet!
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] Biostrings_2.30.1 XVector_0.2.0 IRanges_1.20.6
[4] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0
[7] BiocGenerics_0.8.0 knitr_1.5
loaded via a namespace (and not attached):
[1] DBI_0.2-7 digest_0.6.4 evaluate_0.5.1 formatR_0.10
[5] RSQLite_0.11.4 stats4_3.0.2 stringr_0.6.2 tools_3.0.2
[9] XML_3.98-1.1 xtable_1.7-1