Running BLAST from R

Kevin Keenan 2014

Introduction

BLAST (Basic Local Alignment Search Tool) is a well known web tool for searching for query sequences in databases. However, it might be useful to use this tool from a scripting interface, when multiple query sequences are being used, say. This document describes how R can be used to do this.

Description of methods

There appears to be two relatively straightforward method for running a BLAST from R.

The rest of this document will demonstrate these two suggestions.

Using the blastSequences function

The blastSequences function appears to work well in most instances. For slow web connections or for large query sequences, it appears to fail. This seem to be due to either a low number of attempts to retrieve hits from the ncbi server, or the short time the routine waits before it throws an error in R. It seem that if we increase the number of attempts to retrieve hits, generally, we can successfully run our BLAST.
The below code re-defines the blastSequences function, with an additional argument (attempts), to allow users to control how many times the routine should try to retrieve results from the server.

# function definition
blastSeqKK <- function (x, database = "nr", hitListSize = "10", 
                        filter = "L", expect = "10", program = "blastn",
                        attempts = 10) {
  baseUrl <- "http://www.ncbi.nlm.nih.gov/blast/Blast.cgi"
  query <- paste("QUERY=", as.character(x), "&DATABASE=", database, 
                 "&HITLIST_SIZE=", hitListSize, "&FILTER=", filter, "&EXPECT=", 
                 expect, "&PROGRAM=", program, sep = "")
  url0 <- sprintf("%s?%s&CMD=Put", baseUrl, query)
  results <- tempfile()
  Sys.sleep(5)
  require(XML)
  post <- htmlTreeParse(url0, useInternalNodes = TRUE)
  x <- post[["string(//comment()[contains(., \"QBlastInfoBegin\")])"]]
  rid <- sub(".*RID = ([[:alnum:]]+).*", "\\1", x)
  rtoe <- as.integer(sub(".*RTOE = ([[:digit:]]+).*", "\\1", 
                         x))
  url1 <- sprintf("%s?RID=%s&FORMAT_TYPE=XML&CMD=Get", baseUrl, 
                  rid)
  Sys.sleep(rtoe)
  .tryParseResult <- function(url, attempts){
    for (i in 1:(attempts+1)) {
      result <- tryCatch({
        xmlTreeParse(url, useInternalNodes=TRUE,
                     error = xmlErrorCumulator(immediate=FALSE))
      }, error=function(err) NULL)
      if (!is.null(result)) return(result)
      Sys.sleep(10)
    }
    stop(paste("no results after ", attempts, 
               " attempts; please try again later", sep = ""))
  }
  result <- .tryParseResult(url1, attempts)
  qseq <- xpathApply(result, "//Hsp_qseq", xmlValue)
  hseq <- xpathApply(result, "//Hsp_hseq", xmlValue)
  require(Biostrings)
  res <- list()
  for (i in seq_len(length(qseq))) {
    res[i] <- DNAMultipleAlignment(c(hseq[[i]], qseq[[i]]), 
                                   rowmask = as(IRanges(), "NormalIRanges"), colmask = as(IRanges(), 
                                                                                          "NormalIRanges"))
  }
  res
}

To use this function, we should load the annotate package and its dependencies.

source("http://bioconductor.org/biocLite.R")
biocLite("annotate")

Now that the hacked function is defined, we just need a way to read our sequences (in .fasta format) into R, and pass them, one at a time to the blastSeqKK function. This is outlined next:

library(seqinr)
mySeq <- read.fasta("example.FASTA")

In this case, the sequences were in the file, 'example.FASTA'. The file looks as below:

>DNA_seq1
CGGGAGGCGGCAGCGGCTGCAGCGTTGGTAGCATCAGCATCAGCATCAGCGGCAGCGGCAGCGGCCTCGG
GCGGGGCCGGCCGGACGGACAGGCGGACAGAAGGCGCCAGGGGCGCGCGTCCCGCCCGGGCCGGCCATGG
AGGGCGCCTCCTTCGGCGCGGGCCGCGCAGGGGCCGCCCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
>DNA_seq2
CGGGAGGCGGCAGCGGCTGCAGCGTTGGTAGCATCAGCATCAGCATCAGCGGCAGCGGCAGCGGCCTCGG
GCGGGGCCGGCCGGACGGACAGGCGGACAGAAGGCGCCAGGGGCGCGCGTCCCGCCCGGGCCGGCCATGG
AGGGCGCCTCCTTCGGCGCGGGCCGCGCAGGGGCCGCCCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
>DNA_seq3
CGGGAGGCGGCAGCGGCTGCAGCGTTGGTAGCATCAGCATCAGCATCAGCGGCAGCGGCAGCGGCCTCGG
GCGGGGCCGGCCGGACGGACAGGCGGACAGAAGGCGCCAGGGGCGCGCGTCCCGCCCGGGCCGGCCATGG
AGGGCGCCTCCTTCGGCGCGGGCCGCGCAGGGGCCGCCCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
>DNA_seq4
CGGGAGGCGGCAGCGGCTGCAGCGTTGGTAGCATCAGCATCAGCATCAGCGGCAGCGGCAGCGGCCTCGG
GCGGGGCCGGCCGGACGGACAGGCGGACAGAAGGCGCCAGGGGCGCGCGTCCCGCCCGGGCCGGCCATGG
AGGGCGCCTCCTTCGGCGCGGGCCGCGCAGGGGCCGCCCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA

In this example, as we are only demonstrating the method, these four sequences are actually identical to each other.

Now that our sequences are in R, we can manipulate them into simple character strings (the format expected by blastSeqKK) and BLAST them. The code below, takes each sequence (n = 4), converts it into a character string and then runs blastSeqKK on that sequence, and saves the results in the object, res. This object can then be probed downstream. See the annotate documentation for a description of the additional arguments used by blastSequences.

res <- lapply(mySeq, function(x){
  # collapse seq into string
  seqCollapse <- paste(toupper(as.character(x)), collapse = "")
  # run blast
  blastRes <- blastSeqKK(x = seqCollapse, database = "nr", hitListSize = 3,
                         attempts = 20)
  return(blastRes)
})
Loading required package: XML
Loading required package: Biostrings
Loading required package: IRanges
Loading required package: XVector

Attaching package: 'Biostrings'

The following object is masked from 'package:seqinr':

    translate

Providing the above chunk completes without an error, our results should be in the res object. The standard output of this object is as below:

$DNA_seq1
$DNA_seq1[[1]]
Loading required package: Biostrings
Loading required package: IRanges
Loading required package: XVector
DNAMultipleAlignment with 2 rows and 145 columns
     aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA

$DNA_seq1[[2]]
DNAMultipleAlignment with 2 rows and 145 columns
     aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA

$DNA_seq1[[3]]
DNAMultipleAlignment with 2 rows and 145 columns
     aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA


$DNA_seq2
$DNA_seq2[[1]]
DNAMultipleAlignment with 2 rows and 145 columns
     aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA

$DNA_seq2[[2]]
DNAMultipleAlignment with 2 rows and 145 columns
     aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA

$DNA_seq2[[3]]
DNAMultipleAlignment with 2 rows and 145 columns
     aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA


$DNA_seq3
$DNA_seq3[[1]]
DNAMultipleAlignment with 2 rows and 145 columns
     aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA

$DNA_seq3[[2]]
DNAMultipleAlignment with 2 rows and 145 columns
     aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA

$DNA_seq3[[3]]
DNAMultipleAlignment with 2 rows and 145 columns
     aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA


$DNA_seq4
$DNA_seq4[[1]]
DNAMultipleAlignment with 2 rows and 145 columns
     aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA

$DNA_seq4[[2]]
DNAMultipleAlignment with 2 rows and 145 columns
     aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA

$DNA_seq4[[3]]
DNAMultipleAlignment with 2 rows and 145 columns
     aln
[1] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
[2] CTCGGGCGGGGCCGGCCGGACGGACAGGCGGACA...CCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA

We can see that the object res is a list of 4 DNAMultipleAlignment objects. Because we set the argument hitListSize to 3 in this example, each of the four elements in res contain 3 separate alignment hits. For further information on how to proceed with this data structure, see the Biostrings package in Bioconductor.

Using R to script BLAST+

This method uses R as a scripting interface to run the installed library BLAST+. To install this package execute the following command from the terminal (we could do this from R, but the security work-arounds aren't worth the hassle).

sudo apt-get install blast+

Once blast+ is installed, we can go back to R, and run our blast pretty easily. In this example, say we have multiple files with multiple sequences in each (perhaps our query sequences are large and we don't want to exceed the ncbi server CPU limit). This example shows how we could script the blast process from within R.

# identify all fasta files in our working directory
infileNms <- list.files(pattern = "*.FASTA")
infileNms
[1] "example1.FASTA" "example2.FASTA" "example3.FASTA"

Here we can see that there are three fasta files in the working directory (for simplicity, these are just the above fasta triplicated). Now that we know the names of our files we want to create an object which will give our output file a unique name (in this example based on the input name).

outnames <- paste(unlist(sapply(infileNms, strsplit, split = "*.FASTA")), 
                  ".out", sep = "")

Now we have input and output names for each of our query files. We simply need to construct our commands and run blast (blastn in this example).

# create commands function
cmdCreate <- function(infile, outfile){
  paste("blastn -db nr -query ", infile, " -remote -out ",
                     outfile, sep = "")
}

# create actual commands
cmds <- mapply(FUN = cmdCreate, infile = infileNms, outfile = outnames)

The code above creates the following three system commands:

[1] "blastn -db nr -query example1.FASTA -remote -out example1.out"
[2] "blastn -db nr -query example2.FASTA -remote -out example2.out"
[3] "blastn -db nr -query example3.FASTA -remote -out example3.out"

To run these commands, we simply execute the below command:

sapply(cmds, system)
blastn -db nr -query example1.FASTA -remote -out example1.out 
                                                            0 
blastn -db nr -query example2.FASTA -remote -out example2.out 
                                                            0 
blastn -db nr -query example3.FASTA -remote -out example3.out 
                                                            0 

This command will result in three files being written to the working directory. The names of these files should be equal to the object outnames. We can check this as follows:

resFiles <- list.files(pattern = "*.out")
resFiles == outnames
[1] TRUE TRUE TRUE

These files contain alignment hits for all query sequences in each of our three fasta files, and can be probed for further analyses. I have not figured out a good method for parsing these files in R yet!

Reproducibility

R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] Biostrings_2.30.1    XVector_0.2.0        IRanges_1.20.6      
[4] annotate_1.40.0      AnnotationDbi_1.24.0 Biobase_2.22.0      
[7] BiocGenerics_0.8.0   knitr_1.5           

loaded via a namespace (and not attached):
 [1] DBI_0.2-7      digest_0.6.4   evaluate_0.5.1 formatR_0.10  
 [5] RSQLite_0.11.4 stats4_3.0.2   stringr_0.6.2  tools_3.0.2   
 [9] XML_3.98-1.1   xtable_1.7-1