The most current version is 2.14.0. Follow this link to make sure you are downloading the most recent version.
curl will download the software to your chosen locationtar will unzip your downloaded file so the software may be called and usedThe link you grab for your specific operating system goes in the curl command. Adjust filename accordingly in your tar command to
cd /Applications/bioinfo/
curl -O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.14.0+-x64-macosx.tar.gz
tar -xf ncbi-blast-2.14.0+-x64-macosx.tar.gz
Ensure your bioinfo folder contains the unzipped version of NCBI Blast. In this case, ncbi-blast-2.14.0+.
ls /Applications/bioinfo/
This code runs the blastx program, which is part of the NCBI Blast program we downloaded. The -h option displays the help message for blastx, which provides information about the program’s command line options and usage.
/Applications/bioinfo/ncbi-blast-2.14.0+/bin/blastx -h
UniProt is a comprehensive, high-quality protein sequence database that provides a central resource for protein sequences and functional information. We will make a blast database using UniProt.
Download UniProtKB Swiss-Prot protein sequence database in FASTA format.
curl downloads zipped fasta to location specified by cdmv renames file as uniprot_sprot_r2023_01.fasta.gzgunzip unzips the filels lists the contents of the ../data directory to confirm that the uniprot_sprot_r2023_01.fasta file has been createdcd ../data
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.gz uniprot_sprot_r2023_01.fasta.gz
gunzip -k uniprot_sprot_r2023_01.fasta.gz
ls ../data
Make blast database with the makeblastdb program.
makeblastdb program-in specifies input file is ../data/uniprot_sprot_r2023_01.fasta, which is the UniProt protein database in FASTA format that was downloaded and unzipped in a previous step-dbtype prot specifies that the database type is protein-out ../blastdb/uniprot_sprot_r2023_01 specifies the name of the output database file and its directory/Applications/bioinfo/ncbi-blast-2.14.0+/bin/makeblastdb \
-in ../data/uniprot_sprot_r2023_01.fasta \
-dbtype prot \
-out ../blastdb/uniprot_sprot_r2023_01
Get the sequence we will be blasting from one of the Roberts Lab’s machines.
curl-k allows curl to proceed and operate even for server connections considered insecure> redirects the output to a file named Ab_4denovo_CLC6_a.fa in the ../data/ directorycurl https://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa \
-k \
> ../data/Ab_4denovo_CLC6_a.fa
Inspect the file just downloaded.
head prints the first few lines of the file for viewingecho prints “How many sequences are there?” to the consolegrep searches for lines in the file that start with “>”, which is the FASTA file format for sequence headers. By counting the number of matches, it determines the number of sequences in the file and prints that number to the console.head ../data/Ab_4denovo_CLC6_a.fa
echo "How many sequences are there?"
grep -c ">" ../data/Ab_4denovo_CLC6_a.fa
## >solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_1
## ACACCCCACCCCAACGCACCCTCACCCCCACCCCAACAATCCATGATTGAATACTTCATC
## TATCCAAGACAAACTCCTCCTACAATCCATGATAGAATTCCTCCAAAAATAATTTCACAC
## TGAAACTCCGGTATCCGAGTTATTTTGTTCCCAGTAAAATGGCATCAACAAAAGTAGGTC
## TGGATTAACGAACCAATGTTGCTGCGTAATATCCCATTGACATATCTTGTCGATTCCTAC
## CAGGATCCGGACTGACGAGATTTCACTGTACGTTTATGCAAGTCATTTCCATATATAAAA
## TTGGATCTTATTTGCACAGTTAAATGTCTCTATGCTTATTTATAAATCAATGCCCGTAAG
## CTCCTAATATTTCTCTTTTCGTCCGACGAGCAAACAGTGAGTTTACTGTGGCCTTCAGCA
## AAAGTATTGATGTTGTAAATCTCAGTTGTGATTGAACAATTTGCCTCACTAGAAGTAGCC
## TTC
## How many sequences are there?
## 5490
Perform a blastx search of the protein sequences in the input FASTA file (Ab_4denovo_CLC6_a.fa) against the UniProtKB/Swiss-Prot protein database (uniprot_sprot_r2023_01).
-query specifies the input query FASTA file-db specifies the reference database against which to search-out specifies the name and location of the output file for the blastx results-evalue option sets the maximum e-value threshold for reporting matches-num_threads sets the number of CPU threads to use for the blast search-max_target_seqs sets the maximum number of aligned sequences to keep in the output-outfmt specifies the format of the output (in this case, tabular format)/Applications/bioinfo/ncbi-blast-2.14.0+/bin/blastx \
-query ../data/Ab_4denovo_CLC6_a.fa \
-db ../blastdb/uniprot_sprot_r2023_01 \
-out ../output/Ab_4-uniprot_blastx.tab \
-evalue 1E-20 \
-num_threads 100 \
-max_target_seqs 1 \
-outfmt 6
Inspect your output file.
head prints the first few lines for viewingwc -l counts the number of lines in the filehead -2 ../output/Ab_4-uniprot_blastx.tab
wc -l ../output/Ab_4-uniprot_blastx.tab
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp|O42248|GBLP_DANRE 82.456 171 30 0 1 513 35 205 2.81e-103 301
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp|Q08013|SSRG_RAT 75.385 65 16 0 3 197 121 185 1.40e-28 104
## 705 ../output/Ab_4-uniprot_blastx.tab
To generate the API URL used with curl to successfully download the SwissProt protein database for annotating blast results following go to this Uniprot website link. Then click download > change format to TSV > click Generate URL for API > copy the link for API URL using the streaming endpoint. The link has already been placed in the following code chunk.
The database at this URL contains several parameters that specify the fields to be included in the TSV file, such as accession number, protein name, gene names, organism name, length, and various annotations (e.g., Gene Ontology terms, EC numbers, and cross-references to other databases)
curl downloads a tab-separated values (TSV) file from a URL specified by the given Uniprot query-o specifies downloaded file is named uniprot_table_r2023_01.tab-H option is used to specify the “Accept” header of the HTTP request as “text/plain; format=tsv”, indicating that the server should return a TSV filecurl -o uniprot_table_r2023_01.tab -H "Accept: text/plain; format=tsv" "https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Creviewed%2Cid%2Cprotein_name%2Cgene_names%2Corganism_name%2Clength%2Cgo_f%2Cgo%2Cgo_p%2Cgo_c%2Cgo_id%2Ccc_interaction%2Cec%2Cxref_reactome%2Cxref_unipathway%2Cxref_interpro&format=tsv&query=%28%2A%29%20AND%20%28reviewed%3Atrue%29"
head -2 ../output/Ab_4-uniprot_blastx.tab
wc -l ../output/Ab_4-uniprot_blastx.tab
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp|O42248|GBLP_DANRE 82.456 171 30 0 1 513 35 205 2.81e-103 301
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp|Q08013|SSRG_RAT 75.385 65 16 0 3 197 121 185 1.40e-28 104
## 705 ../output/Ab_4-uniprot_blastx.tab
The Blast table needs to be converted to a tab separated table to join with the annotation table, which is already in tab separated format.
tr replaces the vertical bars (|) in the file Blast table (Ab_4-uniprot_blastx.tab) with tabs (> operatorhead -2 to print the first two lines of the file with the replaced separatortr '|' '\t' < ../output/Ab_4-uniprot_blastx.tab \
> ../output/Ab_4-uniprot_blastx_sep.tab
head -2 ../output/Ab_4-uniprot_blastx_sep.tab
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp O42248 GBLP_DANRE 82.456 171 30 0 1 513 35 205 2.81e-103 301
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp Q08013 SSRG_RAT 75.385 65 16 0 3 197 121 185 1.40e-28 104
These packages are necessary to join and display the two tables.
library(tidyverse)
library(readr)
library(dplyr)
library(httr)
library(kableExtra)
library(magrittr)
The following chunk reads the Blast and annotation tables in as dataframes called bltabl and spgo. Note that the SwissProt table downloaded from UniProt website is hosted on the Gannet server and read in directly via URL.
bltabl <- read.csv("../output/Ab_4-uniprot_blastx_sep.tab", sep = '\t', header = FALSE)
url <- "https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab"
response <- GET(url, config(ssl_verifypeer = FALSE))
spgo <- read.csv(text = rawToChar(response$content), sep = "\t", header = TRUE)
This code creates a table that combines the results of the BlastX and information about the SwissProt entries that were hit by the query by taking the bltabl and spgo data frames and joining them together based on the V3 and Entry columns, respectively. And then displays the results.
select function is then used to keep only certain columns from the resulting data frame, specifically V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., and Gene.Ontology.IDsmutate function is used to rename the V1 column by replacing a specific string of characters within the column names with “Ab”head function to return the first few rows, and then to the kbl function to format it as a tablekable_styling function is used to apply Bootstrap styling options to the resulting tablekbl(
head(
left_join(bltabl, spgo, by = c("V3" = "Entry")) %>%
select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1,
pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab"))
)
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| V1 | V3 | V13 | Protein.names | Organism | Gene.Ontology..biological.process. | Gene.Ontology.IDs |
|---|---|---|---|---|---|---|
| Ab_contig_3 | O42248 | 0 | Guanine nucleotide-binding protein subunit beta-2-like 1 (Receptor of activated protein kinase C) (RACK) | Danio rerio (Zebrafish) (Brachydanio rerio) | angiogenesis [GO:0001525]; convergent extension involved in gastrulation [GO:0060027]; negative regulation of Wnt signaling pathway [GO:0030178]; positive regulation of gastrulation [GO:2000543]; positive regulation of protein phosphorylation [GO:0001934]; regulation of cell division [GO:0051302]; regulation of establishment of cell polarity [GO:2000114]; regulation of protein localization [GO:0032880]; rescue of stalled ribosome [GO:0072344] | GO:0001525; GO:0001934; GO:0005080; GO:0005634; GO:0005737; GO:0005829; GO:0005840; GO:0030178; GO:0032880; GO:0043022; GO:0045182; GO:0051302; GO:0060027; GO:0072344; GO:1990904; GO:2000114; GO:2000543 |
| Ab_contig_5 | Q08013 | 0 | Translocon-associated protein subunit gamma (TRAP-gamma) (Signal sequence receptor subunit gamma) (SSR-gamma) | Rattus norvegicus (Rat) | SRP-dependent cotranslational protein targeting to membrane [GO:0006614] | GO:0005783; GO:0005784; GO:0006614 |
| Ab_contig_6 | P12234 | 0 | Phosphate carrier protein, mitochondrial (Phosphate transport protein) (PTP) (Solute carrier family 25 member 3) | Bos taurus (Bovine) | mitochondrial phosphate ion transmembrane transport [GO:1990547]; phosphate ion transmembrane transport [GO:0035435] | GO:0005315; GO:0005739; GO:0005743; GO:0015293; GO:0035435; GO:0044877; GO:1990547 |
| Ab_contig_9 | Q41629 | 0 | ADP,ATP carrier protein 1, mitochondrial (ADP/ATP translocase 1) (Adenine nucleotide translocator 1) (ANT 1) | Triticum aestivum (Wheat) | mitochondrial ADP transmembrane transport [GO:0140021]; mitochondrial ATP transmembrane transport [GO:1990544] | GO:0005471; GO:0005743; GO:0140021; GO:1990544 |
| Ab_contig_13 | Q32NG4 | 0 | Glutamine amidotransferase-like class 1 domain-containing protein 1 (Parkinson disease 7 domain-containing protein 1) | Xenopus laevis (African clawed frog) | GO:0005576 | |
| Ab_contig_23 | Q9GNE2 | 0 | 60S ribosomal protein L23 (AeRpL17A) (L17A) | Aedes aegypti (Yellowfever mosquito) (Culex aegypti) | translation [GO:0006412] | GO:0003735; GO:0005840; GO:0006412; GO:1990904 |
This code writes the joined table to a tab separated file for further use and completes the same functions as the above code.
bltabl and spgo data frames and performs a left join on them, using V3 in bltabl and Entry in spgo as the joining columns.V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., and Gene.Ontology.IDs from the joined data frame.V1 with “Ab” using the str_replace_all function.left_join(bltabl, spgo, by = c("V3" = "Entry")) %>%
select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1,
pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %>%
write_delim("../output/blast_annot_go.tab", delim = '\t')
The final tab file with annotated blast results can be found here.