1 Download NCBI Blast

The most current version is 2.14.0. Follow this link to make sure you are downloading the most recent version.

1.1 curl and tar

  • curl will download the software to your chosen location
  • tar will unzip your downloaded file so the software may be called and used

The link you grab for your specific operating system goes in the curl command. Adjust filename accordingly in your tar command to

cd /Applications/bioinfo/
curl -O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.14.0+-x64-macosx.tar.gz
tar -xf ncbi-blast-2.14.0+-x64-macosx.tar.gz

1.2 Check

Ensure your bioinfo folder contains the unzipped version of NCBI Blast. In this case, ncbi-blast-2.14.0+.

ls /Applications/bioinfo/

1.3 View Usage Information

This code runs the blastx program, which is part of the NCBI Blast program we downloaded. The -h option displays the help message for blastx, which provides information about the program’s command line options and usage.

/Applications/bioinfo/ncbi-blast-2.14.0+/bin/blastx -h

2 Make a Blast Database

UniProt is a comprehensive, high-quality protein sequence database that provides a central resource for protein sequences and functional information. We will make a blast database using UniProt.

2.1 Download Database

Download UniProtKB Swiss-Prot protein sequence database in FASTA format.

  • curl downloads zipped fasta to location specified by cd
  • mv renames file as uniprot_sprot_r2023_01.fasta.gz
  • gunzip unzips the file
  • ls lists the contents of the ../data directory to confirm that the uniprot_sprot_r2023_01.fasta file has been created
cd ../data
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.gz uniprot_sprot_r2023_01.fasta.gz
gunzip -k uniprot_sprot_r2023_01.fasta.gz
ls ../data

2.2 Create Blast Database

Make blast database with the makeblastdb program.

  • The first line runs the makeblastdb program
  • -in specifies input file is ../data/uniprot_sprot_r2023_01.fasta, which is the UniProt protein database in FASTA format that was downloaded and unzipped in a previous step
  • -dbtype prot specifies that the database type is protein
  • -out ../blastdb/uniprot_sprot_r2023_01 specifies the name of the output database file and its directory
/Applications/bioinfo/ncbi-blast-2.14.0+/bin/makeblastdb \
-in ../data/uniprot_sprot_r2023_01.fasta \
-dbtype prot \
-out ../blastdb/uniprot_sprot_r2023_01

2.3 Retrieve Query Sequence

Get the sequence we will be blasting from one of the Roberts Lab’s machines.

2.4 Download with curl

  • -k allows curl to proceed and operate even for server connections considered insecure
  • > redirects the output to a file named Ab_4denovo_CLC6_a.fa in the ../data/ directory
curl https://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa \
-k \
> ../data/Ab_4denovo_CLC6_a.fa

2.5 Check

Inspect the file just downloaded.

  • head prints the first few lines of the file for viewing
  • echo prints “How many sequences are there?” to the console
  • grep searches for lines in the file that start with “>”, which is the FASTA file format for sequence headers. By counting the number of matches, it determines the number of sequences in the file and prints that number to the console.
head ../data/Ab_4denovo_CLC6_a.fa
echo "How many sequences are there?"
grep -c ">" ../data/Ab_4denovo_CLC6_a.fa
## >solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_1
## ACACCCCACCCCAACGCACCCTCACCCCCACCCCAACAATCCATGATTGAATACTTCATC
## TATCCAAGACAAACTCCTCCTACAATCCATGATAGAATTCCTCCAAAAATAATTTCACAC
## TGAAACTCCGGTATCCGAGTTATTTTGTTCCCAGTAAAATGGCATCAACAAAAGTAGGTC
## TGGATTAACGAACCAATGTTGCTGCGTAATATCCCATTGACATATCTTGTCGATTCCTAC
## CAGGATCCGGACTGACGAGATTTCACTGTACGTTTATGCAAGTCATTTCCATATATAAAA
## TTGGATCTTATTTGCACAGTTAAATGTCTCTATGCTTATTTATAAATCAATGCCCGTAAG
## CTCCTAATATTTCTCTTTTCGTCCGACGAGCAAACAGTGAGTTTACTGTGGCCTTCAGCA
## AAAGTATTGATGTTGTAAATCTCAGTTGTGATTGAACAATTTGCCTCACTAGAAGTAGCC
## TTC
## How many sequences are there?
## 5490

3 Run Blast

Perform a blastx search of the protein sequences in the input FASTA file (Ab_4denovo_CLC6_a.fa) against the UniProtKB/Swiss-Prot protein database (uniprot_sprot_r2023_01).

3.1 Run BlastX

  • The first line runs the blastx program
  • -query specifies the input query FASTA file
  • -db specifies the reference database against which to search
  • -out specifies the name and location of the output file for the blastx results
  • -evalue option sets the maximum e-value threshold for reporting matches
  • -num_threads sets the number of CPU threads to use for the blast search
  • -max_target_seqs sets the maximum number of aligned sequences to keep in the output
  • -outfmt specifies the format of the output (in this case, tabular format)
/Applications/bioinfo/ncbi-blast-2.14.0+/bin/blastx \
-query ../data/Ab_4denovo_CLC6_a.fa \
-db ../blastdb/uniprot_sprot_r2023_01 \
-out ../output/Ab_4-uniprot_blastx.tab \
-evalue 1E-20 \
-num_threads 100 \
-max_target_seqs 1 \
-outfmt 6

3.2 Check

Inspect your output file.

  • head prints the first few lines for viewing
  • wc -l counts the number of lines in the file
head -2 ../output/Ab_4-uniprot_blastx.tab
wc -l ../output/Ab_4-uniprot_blastx.tab
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp|O42248|GBLP_DANRE    82.456  171 30  0   1   513 35  205 2.81e-103   301
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp|Q08013|SSRG_RAT  75.385  65  16  0   3   197 121 185 1.40e-28    104
##      705 ../output/Ab_4-uniprot_blastx.tab

4 Get Annotation Information

4.1 Query the Uniprot Database

To generate the API URL used with curl to successfully download the SwissProt protein database for annotating blast results following go to this Uniprot website link. Then click download > change format to TSV > click Generate URL for API > copy the link for API URL using the streaming endpoint. The link has already been placed in the following code chunk.

The database at this URL contains several parameters that specify the fields to be included in the TSV file, such as accession number, protein name, gene names, organism name, length, and various annotations (e.g., Gene Ontology terms, EC numbers, and cross-references to other databases)

  • curl downloads a tab-separated values (TSV) file from a URL specified by the given Uniprot query
  • -o specifies downloaded file is named uniprot_table_r2023_01.tab
  • -H option is used to specify the “Accept” header of the HTTP request as “text/plain; format=tsv”, indicating that the server should return a TSV file
curl -o uniprot_table_r2023_01.tab -H "Accept: text/plain; format=tsv" "https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Creviewed%2Cid%2Cprotein_name%2Cgene_names%2Corganism_name%2Clength%2Cgo_f%2Cgo%2Cgo_p%2Cgo_c%2Cgo_id%2Ccc_interaction%2Cec%2Cxref_reactome%2Cxref_unipathway%2Cxref_interpro&format=tsv&query=%28%2A%29%20AND%20%28reviewed%3Atrue%29"

5 Join Blast and Annotation Tables

5.1 Inspect Blast Table

head -2 ../output/Ab_4-uniprot_blastx.tab
wc -l ../output/Ab_4-uniprot_blastx.tab
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp|O42248|GBLP_DANRE    82.456  171 30  0   1   513 35  205 2.81e-103   301
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp|Q08013|SSRG_RAT  75.385  65  16  0   3   197 121 185 1.40e-28    104
##      705 ../output/Ab_4-uniprot_blastx.tab

5.2 Process Blast Table

The Blast table needs to be converted to a tab separated table to join with the annotation table, which is already in tab separated format.

  • tr replaces the vertical bars (|) in the file Blast table (Ab_4-uniprot_blastx.tab) with tabs (
  • The resulting output is then redirected to a new file (../output/Ab_4-uniprot_blastx_sep.tab) using the > operator
  • The output file is then piped to head -2 to print the first two lines of the file with the replaced separator
tr '|' '\t' < ../output/Ab_4-uniprot_blastx.tab \
> ../output/Ab_4-uniprot_blastx_sep.tab

head -2 ../output/Ab_4-uniprot_blastx_sep.tab
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp  O42248  GBLP_DANRE  82.456  171 30  0   1   513 35  205 2.81e-103   301
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp  Q08013  SSRG_RAT    75.385  65  16  0   3   197 121 185 1.40e-28    104

5.3 Load Packages

These packages are necessary to join and display the two tables.

library(tidyverse)
library(readr)
library(dplyr)
library(httr)
library(kableExtra)
library(magrittr)

5.4 Read in Data Tables

The following chunk reads the Blast and annotation tables in as dataframes called bltabl and spgo. Note that the SwissProt table downloaded from UniProt website is hosted on the Gannet server and read in directly via URL.

bltabl <- read.csv("../output/Ab_4-uniprot_blastx_sep.tab", sep = '\t', header = FALSE)
url <- "https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab"
response <- GET(url, config(ssl_verifypeer = FALSE))
spgo <- read.csv(text = rawToChar(response$content), sep = "\t", header = TRUE)

5.5 Combine Blast and Annotation Tables

This code creates a table that combines the results of the BlastX and information about the SwissProt entries that were hit by the query by taking the bltabl and spgo data frames and joining them together based on the V3 and Entry columns, respectively. And then displays the results.

  • The select function is then used to keep only certain columns from the resulting data frame, specifically V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., and Gene.Ontology.IDs
  • The mutate function is used to rename the V1 column by replacing a specific string of characters within the column names with “Ab”
  • The resulting data frame is then passed to the head function to return the first few rows, and then to the kbl function to format it as a table
  • The kable_styling function is used to apply Bootstrap styling options to the resulting table
kbl(
head(
  left_join(bltabl, spgo,  by = c("V3" = "Entry")) %>%
  select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, 
            pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab"))
)
) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
V1 V3 V13 Protein.names Organism Gene.Ontology..biological.process. Gene.Ontology.IDs
Ab_contig_3 O42248 0 Guanine nucleotide-binding protein subunit beta-2-like 1 (Receptor of activated protein kinase C) (RACK) Danio rerio (Zebrafish) (Brachydanio rerio) angiogenesis [GO:0001525]; convergent extension involved in gastrulation [GO:0060027]; negative regulation of Wnt signaling pathway [GO:0030178]; positive regulation of gastrulation [GO:2000543]; positive regulation of protein phosphorylation [GO:0001934]; regulation of cell division [GO:0051302]; regulation of establishment of cell polarity [GO:2000114]; regulation of protein localization [GO:0032880]; rescue of stalled ribosome [GO:0072344] GO:0001525; GO:0001934; GO:0005080; GO:0005634; GO:0005737; GO:0005829; GO:0005840; GO:0030178; GO:0032880; GO:0043022; GO:0045182; GO:0051302; GO:0060027; GO:0072344; GO:1990904; GO:2000114; GO:2000543
Ab_contig_5 Q08013 0 Translocon-associated protein subunit gamma (TRAP-gamma) (Signal sequence receptor subunit gamma) (SSR-gamma) Rattus norvegicus (Rat) SRP-dependent cotranslational protein targeting to membrane [GO:0006614] GO:0005783; GO:0005784; GO:0006614
Ab_contig_6 P12234 0 Phosphate carrier protein, mitochondrial (Phosphate transport protein) (PTP) (Solute carrier family 25 member 3) Bos taurus (Bovine) mitochondrial phosphate ion transmembrane transport [GO:1990547]; phosphate ion transmembrane transport [GO:0035435] GO:0005315; GO:0005739; GO:0005743; GO:0015293; GO:0035435; GO:0044877; GO:1990547
Ab_contig_9 Q41629 0 ADP,ATP carrier protein 1, mitochondrial (ADP/ATP translocase 1) (Adenine nucleotide translocator 1) (ANT 1) Triticum aestivum (Wheat) mitochondrial ADP transmembrane transport [GO:0140021]; mitochondrial ATP transmembrane transport [GO:1990544] GO:0005471; GO:0005743; GO:0140021; GO:1990544
Ab_contig_13 Q32NG4 0 Glutamine amidotransferase-like class 1 domain-containing protein 1 (Parkinson disease 7 domain-containing protein 1) Xenopus laevis (African clawed frog) GO:0005576
Ab_contig_23 Q9GNE2 0 60S ribosomal protein L23 (AeRpL17A) (L17A) Aedes aegypti (Yellowfever mosquito) (Culex aegypti) translation [GO:0006412] GO:0003735; GO:0005840; GO:0006412; GO:1990904

5.6 Write Table to Tab Separated File

This code writes the joined table to a tab separated file for further use and completes the same functions as the above code.

  • Takes the data from the bltabl and spgo data frames and performs a left join on them, using V3 in bltabl and Entry in spgo as the joining columns.
  • Selects columns V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., and Gene.Ontology.IDs from the joined data frame.
  • Replaces all instances of “solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed” in column V1 with “Ab” using the str_replace_all function.
  • Writes the resulting data frame to a tab-delimited file named “blast_annot_go.tab” in the “../output/” directory.
left_join(bltabl, spgo,  by = c("V3" = "Entry")) %>%
  select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, 
            pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %>%
  write_delim("../output/blast_annot_go.tab", delim = '\t')

6 Final Output

The final tab file with annotated blast results can be found here.