Step 4: Annotate the Reference Transcriptome using UniProt and BLAST

Use UniProt (a protein sequence and gene-ontology (GO) functional information database) and BLAST (a sequence search algorithm) to annotate the Montipora capitata reference genome

Author

Sarah Tanja

Published

May 30, 2023

celeste’s BLAST steps HERE

Install Packages

Code
if ("tidyverse" %in% rownames(installed.packages()) == 'FALSE') install.packages('tidyverse')
if ("ggplot2" %in% rownames(installed.packages()) == 'FALSE') install.packages('ggplot2')

Load Libraries

Code
library(tidyverse)
library(ggplot2)

Using BLAST to annotate reference transcriptome

Downloading complete UniProt database. UniProt is a freely accessible database of protein sequence and functional information that is continuously updated every 8 weeks.

Warning

This code takes a while to run, go take a computer break

Code
cd ../data
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.gz uniprot_sprot_r2023_01.fasta.gz
gunzip -k uniprot_sprot_r2023_01.fasta.gz
ls ../data

Using the NCBI blast software command ‘makeblastdb’ to create a blast database from the uniprot fasta file.

Code
# if blast folder doesn't already exist, make it inside the output folder
mkdir -p ../output/blast


/home/shared/ncbi-blast-2.11.0+/bin/makeblastdb \
-in ../data/uniprot_sprot_r2023_01.fasta \
-dbtype prot \
-out ../output/blast/uniprot_sprot_r2023_01

Lets look at what’s in the reference transcriptome Montipora_capitata_HIv3.genes.cds.fna file before we BLAST it.

Code
cd ../data
head Montipora_capitata_HIv3.genes.cds.fna
echo "How many sequences are there?"
grep -c ">" Montipora_capitata_HIv3.genes.cds.fna

Blasting the reference transcriptome and creating a tab file:

Warning

Running the following code chunk takes a very long time! You can do this overnight if you start the chunk and even if you close your browser window it will continue to run thanks to multiplexing!

Blast the same reference fasta file that you used to do the pseudo-alignment in kallisto Here, we the ’Montipora_capitata_HIv3.genes.cds.fna’to align our sample sequences. We run BLAST on this to annotate it and get a large .tab output file that have functional gene annotation associated with the coding sequences

This will make a .tab file named ’Bsc-uniprot_blastx.tab’and save it in the ../output/blast folder…

Code
/home/shared/ncbi-blast-2.11.0+/bin/blastx \    # The path to the blastx executable
-query ../data/Montipora_capitata_HIv3.genes.cds.fna \    # The input query file containing nucleotide sequences
-db ../output/blast/uniprot_sprot_r2023_01 \    # The path to the blast database (UniProt) to be used for comparison
-out ../output/blast/Bsc-uniprot_blastx.tab \    # The output file to store the blastx results
-evalue 1E-20 \    # The E-value threshold for reporting significant matches
-num_threads 20 \    # The number of threads (parallel processes) to be used for the blastx search
-max_target_seqs 1 \    # The maximum number of target sequences to report for each query
-outfmt 6    # The output format of the blastx results (tabular format)

Lets take a little peak at the tab file we just created:

Code
head -5 ../output/blast/Bsc-uniprot_blastx.tab
wc -l ../output/blast/Bsc-uniprot_blastx.tab

Curl file from UniProt that contains protein data in a specified format (accession number, reviewed status, ID, protein name, gene names, organism name, length, GO function, GO, process, GO component, GO ID, interacting partners, EC number, Reactome pathway, UniPathway, and InterPro cross-references).

The file will be saved as stream.tsv.gz

Code
cd ../data
curl -H "Accept-Encoding: gzip" -o stream.tsv.gz "https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Creviewed%2Cid%2Cprotein_name%2Cgene_names%2Corganism_name%2Clength%2Cgo_f%2Cgo%2Cgo_p%2Cgo_c%2Cgo_id%2Ccc_interaction%2Cec%2Cxref_reactome%2Cxref_unipathway%2Cxref_interpro&format=tsv&query=%28%2A%29%20AND%20%28reviewed%3Atrue%29"
Code
tr '|' '\t' < ../output/blast/Bsc-uniprot_blastx.tab | head -2
# replace the "|" with "\t" in the file "Ab_4-uniprot_blastx.tab"
# and save the output to the file "Ab_4-uniprot_blastx_sep.tab"
tr '|' '\t' < ../output/blast/Bsc-uniprot_blastx.tab \
> ../output/blast/Bsc-uniprot_blastx_sep.tab

Notice that the column V3contains the UniProt accession number, which is a unique identifier for each protein entry.

Code
blasttab <- read.csv("../output/blast/Bsc-uniprot_blastx_sep.tab", sep = '\t', header = FALSE)
head(blasttab)

Read in annotation table from gannet server and save as spgo

Code
library(httr)
url <- "https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab"
response <- GET(url, config(ssl_verifypeer = FALSE))
spgo <- read.csv(text = rawToChar(response$content), sep = "\t", header = TRUE)

Check out the dataframe spgo. Notice that the column Entrycontains the UniProt accession number, which is a unique identifier for each protein entry.

Code
head(spgo)

Join blasttab and spgo tables by the UniProt accession number (blasttab$V3 and spgo$Entry) and save the output as blast_annot_go.tab file in the ../output/blast folder

Code
left_join(blasttab, spgo, by = c("V3" = "Entry")) %>% select(
  V1,
  V3,
  V13,
  Protein.names,
  Organism,
  Gene.Ontology..biological.process.,
  Gene.Ontology.IDs
) %>% mutate(
  V1 = str_replace_all(V1, pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")
) %>% write_delim("../output/blast/blast_annot_go.tab", delim = '\t')

Look at the blast_annot_go.tab file which contains the original gene ID and Gene Ontology (GO) terms.

Code
blast_go <- read.delim(file = "../output/blast/blast_annot_go.tab", header = TRUE)
head(blast_go)

This will help us understand our DEGlist from our differential gene expression analysis. ::: {callout-info} Next Steps :::