1 Introduction to BLAST

1.1 What is BLAST Searching?

BLAST stands for the Basic Local Alignment Search Tool, and was developed by the National Center for Biotechnology Information (NCBI) in 1990 (O’Neil, 2017). It works rapidly using an algorithm to align various biological data sequences based on their similarity. It can compare nucleotide or protein sequences to other sequences or databases, producing an output based on the statistical significance of these alignment matches, selecting ‘High Scoring Pairs’ (HSPS).

(O’Neil, S.T., 2017. A Primer for Computational Biology. Oregon State University Press.)

BLAST searching can be applied to many different sets of data, including nucleotide and protein alignments against known databases, gene identification searches, searches for motifs and looking at similarity in multiple regions of long DNA sequences. The speed of the algorithm comes from the statistical measures of local similarity when aligning two sequences. Therefore, BLAST searches cannot guarantee optimal alignments between query sequences and databases, instead giving percentage scores for matches or ‘hits’.

1.2 DNA Sequencing and BLAST Searching

DNA sequencing pipelines involve taking samples of DNA, sequencing them, and analysing the sequence output. Labs may use various methods of DNA extraction, either using extraction kits or their own routines, where the DNA from cell nuclei are separated from other cell debris to leave a DNA sample ready for analysis. Further steps such as the Polymerase Chain Reaction (PCR) may also be carried out to obtain and amplify specific sequences from amongst the extracted DNA.

Once we have our DNA samples, we can use various sequencing methods to get data outputs in the form of nucleotide sequences. Next generation sequencing (NGS) methods have provided a fast and high-throughput solution to creating nucleotide sequence data for many DNA samples at once. Once generated, the output nucleotide sequences can be analysed using software such as BLAST+. These nucleotide sequences may therefore be input as ‘query’ samples against a known ‘target’ sequence or database in a BLAST search. The result of a BLAST search provides insight into how the query sequence aligns with the given target or database, such as where the sequence is located, how significant is the alignment match etc., prompting further analysis of the results.

1.3 FASTA File Format:

The input query sequences and target database files are often in FASTA file formats. These file formats begin with a ‘>’ to denote the start of a new file, with a single-line description including the file name. Beneath the file name and description is the DNA nucleotide sequence. Our example here includes a fasta file with the name highlighted in yellow, with the DNA sequence beneath.

Example of FASTA file format

1.4 What Outputs Does BLAST Searching Produce?

BLAST searches return outputs called ‘hits’, where the input query sequence aligns with a specified target sequence or database. The highest scoring alignment hits are returned, as well as the percentage identity or score, the length of the alignment, the start and end numbers of the target sequence, the number of mismatches and the e-value.

The e-value is the ‘expect’ value which describes the number of hits we can expect to see by chance when searching a target sequence or database of a given size. The lower the e-value, or the closer it is to zero, the more ‘significant’ the match is, given the lengths of the query and target sequences. There are many different ways to search within BLAST search. Searches can restrict the number of hits returned by changing the e-value parameter, to have only the alignments with the most matches shown. See more outputs in the ‘BLAST Output formats’ section.

Although there are many types of BLAST searches, in this document we will demonstrate blastn which searches a ‘query’ nucleotide sequence against a specified ‘subject’ DNA target sequence or database.

2 Installing BLAST+

In order to install BLAST from the Mac terminal, start by navigating to the target directory. It is advised that you create a new folder to contain the BLAST package, though we’re using the Applications directory:

cd /Applications

cd - Change directory command.
/Applications - Insert the target directory pathway here.

Download the latest mac BLAST package using the cURL command and the URL of the macros.tar file from NCBI:

curl -O ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ncbi-blast-2.11.0+-x64-macosx.tar.gz

curl - cURL command used to transfer data to and from servers.

Options:
-O - Output command specifying to save the retrieved file with the same name as the original file.
ftp - ‘File Transfer Protocol’ specifies that the data being transferred is a file.

You should see an output like this whilst the software downloads.

Output:

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  139M  100  139M    0     0   829k      0  0:02:51  0:02:51 --:--:-- 1385k

Once the download has completed, we can expand the .tar file:

tar xvzf ncbi-blast-2.11.0+-x64-macosx.tar.gz

tar - Tar command is used to expand the compressed tar archive. (can also be used to create and modify compressed archives)
Options:
x - Instructs tar to extract the file.
v - “Verbose” - Lists all the files in the archive.
z - Uncompresses (gzip) the file.
f - Creates archive with given file name

This shows us all of the files related to BLAST+ that have been downloaded.

Output:

x ncbi-blast-2.11.0+/
x ncbi-blast-2.11.0+/ChangeLog
x ncbi-blast-2.11.0+/LICENSE
x ncbi-blast-2.11.0+/README
x ncbi-blast-2.11.0+/bin/
x ncbi-blast-2.11.0+/bin/blast_formatter
x ncbi-blast-2.11.0+/bin/blastdb_aliastool
x ncbi-blast-2.11.0+/bin/blastdbcheck
x ncbi-blast-2.11.0+/bin/blastdbcmd
x ncbi-blast-2.11.0+/bin/blastn
x ncbi-blast-2.11.0+/bin/blastp
x ncbi-blast-2.11.0+/bin/blastx
x ncbi-blast-2.11.0+/bin/cleanup-blastdb-volumes.py
x ncbi-blast-2.11.0+/bin/convert2blastmask
x ncbi-blast-2.11.0+/bin/deltablast
x ncbi-blast-2.11.0+/bin/dustmasker
x ncbi-blast-2.11.0+/bin/get_species_taxids.sh
x ncbi-blast-2.11.0+/bin/legacy_blast.pl
x ncbi-blast-2.11.0+/bin/makeblastdb
x ncbi-blast-2.11.0+/bin/makembindex
x ncbi-blast-2.11.0+/bin/makeprofiledb
x ncbi-blast-2.11.0+/bin/psiblast
x ncbi-blast-2.11.0+/bin/rpsblast
x ncbi-blast-2.11.0+/bin/rpstblastn
x ncbi-blast-2.11.0+/bin/segmasker
x ncbi-blast-2.11.0+/bin/tblastn
x ncbi-blast-2.11.0+/bin/tblastx
x ncbi-blast-2.11.0+/bin/update_blastdb.pl
x ncbi-blast-2.11.0+/bin/windowmasker
x ncbi-blast-2.11.0+/doc/
x ncbi-blast-2.11.0+/doc/README.txt
x ncbi-blast-2.11.0+/ncbi_package_info

When you then look at the program saved in your chosen directory, you will see a new ncbi-blast-2.11.0+ folder.

Inside this folder are two sub-directories named ‘bin’ and ‘doc’. This is where we will be accessing our files from to complete the BLAST search.

2.1 Configuration

Configuration ensures that the BLAST+ package installed can be located by your operating system.

To do this, two environmental variables must be created and modified to point to their corresponding directories, present on your operating system.

First, load up the ‘Terminal program’ on your mac operating system. This operates under the bash language.

The following command then appends (adds) the path to the new BLAST ‘bin’ directory to the existing PATH setting.

export PATH=$PATH:/Applications/ncbi-blast-2.11.0+/bin

:/Applications/ncbi-blast-2.11.0+/bin - The path to your bin subdirectory located in the ncbi-blast-2.11.0+ parent directory.

The modified PATH can then be examined using the echo function:

echo $PATH

To then be able to manage the Drosophila melanogaster database provided by NCBI, as well as other available BLAST databases in the future, you will need to create a directory to store them.

To do this, input the following code:

mkdir $ /Applications/ncbi-blast-2.11.0+/blastdb

mkdir - Create directory command.
/Applications/ncbi-blast-2.11.0+ - Pathway to desired location of database.
blastdb - Name of BLAST database.

Output:

mkdir: $: File exists

Then create the path for the new directory as directed previously for the ‘bin’ subdirectory using the export function.

export BLASTDB=/Applications/ncbi-blast-2.11.0+/blastdb

3 BLAST Searches

3.1 Downloading Reference Genomes

Navigate to the NCBI website in an internet browser and search for your target species genome.

For our target database, we knew our query sequence came from Drosophila melanogaster (fruit fly), and so downloaded the D. melanogaster database.

Download the necessary reference genome, for DNA sequences we recommend the genomic.fna.gz variants, and copy the zip file to your newly created folder. In our case we unzipped the D. melanogaster reference genome folder ‘GCF_000001215.4’ into our newly created folder ‘blastdb’.

After downloading and unzipping this folder, our new folder now contained the RefSeq D. melanogaster (fruit fly) genomes, consisiting of 11 fna files containing the reference sequences for each chromosome as well as the mitochondrial genome.

chr2L.fna
chr2R.fna
chr3L.fna
chr3R.fna
chr4.fna
chrMT.fna (Mitochondrial genome)
chrX.fna
chrX.unlocalized.scaf.fna
chrY.fna
chrY.unlocalized.scaf.fna
unplaced.scaf.fna

Once our new folder contains the .fna files, we can turn them into BLAST databases.

3.2 Creating a BLAST Database

In the terminal, navigate to the blastdb directory using the cd function.

cd //Applications/ncbi-blast-2.11.0+/blastdb

We then need to create a BLAST database from each reference sequence from the downloaded genome using the code structure:

makeblastdb -in <File name>.fna -dbtype nucl -parse_seqids -out <new database name>

makeblastdb – Command to make a blast database from a FASTA file.
dbtype nucl – Specifies the type of database is nucleotide.
parse_seqids – Sequence id parsing command to associate sequence with a taxonomic node

eg. To create a database from Chr2L.fna, the code would be:

makeblastdb -in chr2L.fna -dbtype nucl -parse_seqids -out chr2L

Output:

Building a new DB, current time: 12/17/2020 20:15:37
New DB name:   /Applications/ncbi-blast-2.11.0+/blastdb/Fly_reference/ncbi_dataset/data/GCF_000001215.4/chr2L
New DB title:  chr2L.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1 sequences in 0.453587 seconds.

Repeat the above process for each of the remaining .fna files.

makeblastdb -in chr2R.fna -dbtype nucl -parse_seqids -out chr2R
makeblastdb -in chr3L.fna -dbtype nucl -parse_seqids -out chr3L 
makeblastdb -in chr3R.fna -dbtype nucl -parse_seqids -out chr3R 
makeblastdb -in chr4.fna -dbtype nucl -parse_seqids -out chr4 
makeblastdb -in chrX.fna -dbtype nucl -parse_seqids -out chrX 
makeblastdb -in chrX.unlocalized.scaf.fna -dbtype nucl -parse_seqids -out chrX.unlocalized.scaf 
makeblastdb -in chrY.fna -dbtype nucl -parse_seqids -out chrY
makeblastdb -in chrY.unlocalized.scaf.fna -dbtype nucl -parse_seqids -out chrY.unlocalized.scaf
makeblastdb -in unplaced.scaf.fna -dbtype nucl -parse_seqids -out unplaced.scaf
makeblastdb -in chrMT.fna -dbtype nucl -parse_seqids -out chrMT

3.3 Combining the Databases

We then need to search the Mystery sequence (query) against the target sequence files (subjects). To make this a more efficient process, we can combine the 11 databases into one:

blastdb_aliastool -dblist " chr2R chr2L chr3L chr3R chr4 chrX chrY chrMT chrX.unlocalized.scaf chrY.unlocalized.scaf unplaced.scaf " -dbtype nucl \
  -out drosophila_genome -title "Drosophila Genome"

blastdb_aliastool - Creates database alias to tie several databases together
-dblist - The list of databases to string together
-dbtype - Specifies the type of database
-out - Specifies the database name

Output:

Created nucleotide BLAST (alias) database drosophila_genome with 11 sequences

3.4 Searching the Mystery Sequence Against Databases

We can now search for our Mystery sequence against the databases. We can BLAST search the query sequence (Mystery sequence) against the database files individually, to see which chromosome the query sequence matches, or we can search it against the large combined database for efficiency.

To query the mystery sequence against each individual database file we used the structure:

blastn -db <database_name> -query <mystery_file_name> -dust no -outfmt 7 -max_target_seqs 5

blastn - Command to run a nucleotide BLAST search
-db - Specify BLAST database
-query - Specify Fasta file containing query/mystery sequence
-dust - dustmasker can mask the low complexity regions in the input nucleotide sequences. Here the ‘dust -no’ Option stops the filtering out low complexity or repeated regions in the query sequence.
-outfmt - The output format of the BLAST search
-max_target_seqs - Maximum number of aligned sequences to keep

eg. To query the mystery sequence against database Chr2L, the code would be:

blastn -db chr2L -query Mystery_sequence.fa -dust no -outfmt 7 -max_target_seqs 5

To find which chromosome we get a hit on, you can manually repeat this for each of the databases.

Having run the query against the individual databases, we know the correct chromosome is chr3L:

blastn -db chr3L -query Mystery_sequence.fa -dust no -outfmt 7 -max_target_seqs 5

Producing the output:

# BLASTN 2.11.0+
# Query: Mystery_sequence
# Database: chr3L
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 1 hits found
Mystery_sequence    NT_037436.4 100.000 139 0   0   1   139 16590549    16590411    8.13e-69    257
# BLAST processed 1 queries

Alternatively, we can query the mystery sequence against the combined database: eg. To query the mystery sequence against the combined database ‘drosophila_genome’, we use:

blastn -db drosophila_genome -query Mystery_sequence.fa -dust no -outfmt 7 -max_target_seqs 5

Output:

# BLASTN 2.11.0+
# Query: Mystery_sequence
# Database: drosophila_genome
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 1 hits found
Mystery_sequence    NT_037436.4 100.000 139 0   0   1   139 16590549    16590411    3.91e-68    257
# BLAST processed 1 queries

We can then take the information from this combined database search, and look up the ‘subject acc.ver’ to find the specific gene the query sequence encodes for. See section on ‘Identifying the gene’.

4 BLAST Output formats

We can run more comprehensive BLAST searches by changing the parameters we use to search. As a default we have used output format 7, however the output format version can be changed and added to using extra parameters. As an example we have shown a few below, but the many different parameters can be searched using the ‘blastn -help’ function within the terminal.

4.1 -blastdbcmd -info

We can look into the information a created BLAST database holds. E.g. we can search into a specific database such as chr2L.fna:

 blastdbcmd -db chr2L -info

This shows us the database name, it’s length in bases, when it was created, etc.:

Database: chr2L.fna
    1 sequences; 23,513,712 total bases

Date: Nov 24, 2020  2:39 PM Longest sequence: 23,513,712 bases

BLASTDB Version: 5

Volumes:
    /Applications/ncbi-blast-2.11.0+/blastdb/chr2L

4.2 task -blastn

We can also produce BLAST searches which return more information than simply ‘hit’ or ‘no hit’.

By adding -task blastn, we can run a BLAST search which returns lots of hits for smaller sections of the query sequence. This is useful when there are mutations or SNPs (Single Nucleotide Polymorphisms) in a query sequence or target sequence, where there may be base changes spread about the sequences.

The -task blastn function then provides us with more hit results, showing many sections where hits have been found in order of accuracy and length.

e.g. Running the added -task blastn function with the query sequence against a target of the combined database ’drosophila_genome:

blastn -db drosophila_genome -query Mystery_sequence.fa -task blastn -dust no -outfmt 7 -max_target_seqs 5

Output:

# BLASTN 2.11.0+
# Query: Mystery_sequence
# Database: drosophila_genome
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 14 hits found
Mystery_sequence    NT_037436.4 100.000 139 0   0   1   139 16590549    16590411    2.22e-66    251
Mystery_sequence    NT_037436.4 95.000  20  1   0   37  56  10662853    10662872    3.8 32.8
Mystery_sequence    NT_033777.3 88.889  27  3   0   37  63  27840709    27840735    0.31    36.5
Mystery_sequence    NT_033777.3 91.304  23  2   0   72  94  13902141    13902163    1.1 33.7
Mystery_sequence    NT_033777.3 75.000  52  12  1   36  86  14710265    14710214    3.8 32.8
Mystery_sequence    NT_033777.3 100.000 17  0   0   58  74  30136189    30136173    3.8 31.9
Mystery_sequence    NC_004354.4 82.857  35  5   1   12  46  19713258    19713225    1.1 33.7
Mystery_sequence    NC_004354.4 95.000  20  1   0   88  107 12830838    12830819    3.8 32.8
Mystery_sequence    NC_004354.4 85.185  27  4   0   39  65  713883  713857  3.8 31.9
Mystery_sequence    NC_004354.4 90.909  22  2   0   47  68  12830840    12830819    3.8 31.9
Mystery_sequence    NT_033779.5 85.714  28  4   0   62  89  19405023    19404996    1.1 33.7
Mystery_sequence    NT_033779.5 100.000 17  0   0   61  77  13457461    13457445    3.8 31.9
Mystery_sequence    NT_033778.4 95.000  20  1   0   65  84  17120084    17120065    3.8 32.8
Mystery_sequence    NT_033778.4 95.000  20  1   0   65  84  17124986    17125005    3.8 32.8
# BLAST processed 1 queries

4.3 Deflines

For a given query sequence and target sequence, we can perform a blast search and ask blastn to parse the query and subject ‘description lines’ or deflines.

We can use the following input to ask for the definition lines:

blastn -db drosophila_genome -query Mystery_sequence.fa -parse_deflines

This produces an output visibly aligning parts of the query sequence to the subject database, also telling us immediately that the query sequence aligns with the chr3L section of the subject genome:

BLASTN 2.11.0+


Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.



Database: Drosophila Genome
           1,870 sequences; 143,726,002 total letters



Query= Mystery_sequence

Length=139
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

NT_037436.4 Drosophila melanogaster chromosome 3L                     257     4e-68


NT_037436.4 Drosophila melanogaster chromosome 3L

Length=28110227

 Score = 257 bits (139),  Expect = 4e-68
 Identities = 139/139 (100%), Gaps = 0/139 (0%)
 Strand=Plus/Minus

Query  1         GAGAGCAGGCACAGAAGGCATCGCCAGCGCTCTAGGAGCCGCAATCGCAACCGAAGTCGC  60
                 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  16590549  GAGAGCAGGCACAGAAGGCATCGCCAGCGCTCTAGGAGCCGCAATCGCAACCGAAGTCGC  16590490

Query  61        AGCAGTGAACGAAAACGCCGTCAACATAGCCGAAGTCGCAGCAGTGAACGAAGACGCCGT  120
                 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  16590489  AGCAGTGAACGAAAACGCCGTCAACATAGCCGAAGTCGCAGCAGTGAACGAAGACGCCGT  16590430

Query  121       CAACGGAGCCCGCATCGGT  139
                 |||||||||||||||||||
Sbjct  16590429  CAACGGAGCCCGCATCGGT  16590411



Lambda      K        H
    1.33    0.621     1.12 

Gapped
Lambda      K        H
    1.28    0.460    0.850 

Effective search space used: 16523329030


  Database: Drosophila Genome
    Posted date:  Dec 17, 2020  11:17 PM
  Number of letters in database: 143,726,002
  Number of sequences in database:  1,870



Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0, Extension: 2.5

lambda, K & H - These are statistical values used to help derive the Bit score.

4.4 E-values

The expectation value is the number of expected hits of a similar quality (score) that we would expect to see by chance. Therefore, the lower the E value, the more significant the score and the alignment.

E-values can be used as a quality filter for BLAST search results, obtaining results equal to or better than the number given by the -evalue option. Blast results are sorted by E-value by default, meaning the best hit is the first line in the results.

blastn -db drosophila_genome -query Mystery_sequence.fa -task blastn -evalue 0.001 -dust no -outfmt 7 -max_target_seqs 5

Output:

# BLASTN 2.11.0+
# Query: Mystery_sequence
# Database: drosophila_genome
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 1 hits found
Mystery_sequence    NT_037436.4 100.000 139 0   0   1   139 16590549    16590411    2.22e-66    251
# BLAST processed 1 queries

Here we have limited the e-value to anything less than 0.001, which 2.22e-66 meets.

4.5 Dust

Dust is the dustmasker function. It can filter out low complexity or repeated regions in the query sequence. If multiple high scoring hits were identified by the first search, this option could be removed to return to the default of -dust yes to increase search accuracy.

4.6 Bit score

The bit score , S’, is derived from the raw alignment score, S, taking the statistical properties of the scoring system into account. Bit scores are normalized with respect to the scoring system,and can therefore be to compare alignment scores from different searches. This is useful when comparing the quality of BLAST searches, also potentially indicating the quality of the sequencing.

The higher the bit-score, the better the query sequence similarity to the subject.

The bit-score is the required size of a sequence database in which the current match could be found just by chance. The bit-score is a log2 scaled and normalized raw-score. Each increase by one doubles the required database size (2bit-score). It does not depend on database size, giving the same value for hits in databases of different sizes, and can therefore be used for searching in an constantly increasing database.

5 Identifying the gene

Once we have found which chromosome from the Drosophila melanogaster reference genome our Mystery sequence aligns with, we can use the online BLAST search tool to look up the sequence and see what genes it encodes. By following the steps below, you can search for and identify genes from their DNA sequence.

From our BLAST search, we received the ‘subject accession version’ of NT_037436.4 from the ‘subject’ sequence, i.e. one of our reference chromosomes.

Returned from BLAST search of Mystery Sequence

We can search to find a visual output of our input query sequence along the reference genome using the NCBI online tools.

Access the NCBI BLAST search online tool. In the search box, enter subject accession version e.g.NT_037436.4 shown in the BLAST output.
This brings up the nucleotide sequence ‘Drosophila melanogaster chromosome 3L’ from the NCBI database.
Click on the link previously stated, which brings you to a page detailing the information on chromosome 3L.
On the right of the screen, there is a drop down menu named ‘Change region shown’. Here you enter the gene region 16590549-16590411 as provided from the BLAST hit output.

The subject start and end of our alignment hit

This process informs the NCBI database of the region we are interested in, under the ‘Drosophila melanogaster chromosome 3L’ title, click on the ‘Graphics’ view.
A graphical visualization of the region of interest is displayed, by hovering over the exon colored green, the name of the gene of interest is shown. We therefore now know the gene is the tra gene.
Further information detailing the gene name, location and nucleotide length is exhibited which confirms the gene highlighted is identical to our BLAST output.

Visual representation of the query sequence aligned with the Drosophila melanogaster chromosome 3L, in region 16590549-16590411. The green colored bars detail the exons, the blue bars are identifying the presence of intron reads.

We can also verify the output of our localised BLAST search by searching the FASTA sequence of our Mystery sequence using the online BLAST search tool. This provides an output of many sequences which match our query sequence, confirming we have found the correct chromosome alignment within the Drosophila genome.

Selecting the first option, ‘Drosophila melanogaster chromosome 3L’, is the database we downloaded. We can see how the query sequence aligns with this section of the chromosome.

Query sequence alignment with Chromosome 3L

6 The tra gene

6.1 What is the tra gene?

The mystery sequence was identified to be the tra (transformer) gene, it encodes a female-specific protein transformer that is involved in female somatic sexual differentiation. The Tra protein contains an RNA recognition motif that controls the alternative splicing of the sex determination gene doublesex (dsx), which influences somatic cell differentiation.

6.2 The role of the tra gene:

The tra gene is involved in both male and females, but undergoes an alternative splicing pathway with two different alternative 3’ splice sites (Sanchez & Guerrero, 2005) (Fig. 1). A non-sex specific site is produced when the proximal 3’ splice site is used, resulting in the addition of a stop codon in the open reading frame and a truncated non-functional protein (Sanchez & Guerrero, 2005). In females, over 50% of the tra pre-mRNA is spliced differently due to Sxl (Verhulst et al., 2010). The distal 3’ splice site is used and the stop codon is not introduced, therefore producing a functional Tra protein (Boggs et al., 1987; Belote et al., 1989).

Sxl regulates female specific tra splicing through a blockage mechanism where it binds to the polypyrimidine tract of the non-sex specific splice site (Sosnowski et al., 1989). An U2 Auxiliary factor (U2AF) also uses the same splice site, U2AF is essential for the recognition of the 3’ site (Sanchez & Guerrero, 2005). However, U2AF can also bind to the female-specific 3’ splice site but has 100 times less affinity to bind to this site (Valcarcel et al., 1993). Sxl and U2AF compete for the non-sex specific splice site and binding of Sxl to this site forces U2AF to bind to the low affinity distal splice site, promoting the use of the female-specific splice site and subsequent production of the complete Tra peptide (Valcarcel et al., 1993) (Fig. 1).

Fig.1 - The tra pre-mRNA male and female specific splicing processes. The introns and exons (E1-E4) are represented by the boxes and lines. U2AF and Sxl protein binding sits are identified by an arrow, the black dot indicates the stop codon (Sanchez & Guerrero, 2005).

The functional Tra protein then interacts with the transformer2 protein (Tra2), which is the non sex-specific equivalent transformer in D. melanogaster (Amrein & Nothiger, 1988). It binds to the centre of exon 4 where the doublesex (dsx) gene is located, this contains the dsx repeat element (dsxRE) which contains a 13 nucleotide sequence, repeated six times across the locus (Tian & Maniatis, 1993). A purine rich enhancer (PRE) is located between element 5 and 6 of dsxRE which functions as the specific binding site for Tra2 on the dsxRE region (Lynch & Maniatis, 1995). The interaction of the TRA and TRA2 influences the binding with PRE resulting in the retention of exon four in the dsx pre-mRNA. The retention of dsx pre mRNA initiates female-specific splicing of dsx at the end of the cascade producing a female-specific DSX protein (Verhulst et al., 2010) and subsequent downstream female specific somatic differentiation.

An example of the downstream effects of the Tra/Tra2 dynamic is the splicing of the fruitless (fru) gene (Ryner et al., 1996). Absence of the Tra protein produces a functional Fru protein through male specific splicing, the Fru protein specifically is conserved among all male insects and is attributed to influence male specific behaviour (Velhurst et al., 2010). The presence of the Tra protein therefore produces a dysfunctional Fru protein and therefore male attributed behaviour is not exhibited in females.

6.3 Conclusions

The tra gene has a significant functional importance in regulating the dsx splicing regulators for female development, the presence of a truncated Tra protein in turn directs male development. In past experiments with RNA interference in early female embryos, the result of male-specific dsx splicing resulted in the production of intersexes with various stages of masculinization and when replicated in males, male development remained unaffected (Lagos et al., 2007). Only in the females-specific splicing of the tra pre-mRNA is a complete transcript produced and subsequent functional Tra protein as a result (Velhurst et al., 2010).

7 Downstream Applications

A BLASTmap package on programming language R could be used to visualise the BLAST alignment, it allows the grouping and viewing of the BLAST output in an interactive heatmap (Baker et al., 2018). This would provide a visual representation of the tra sequence alignment against the D. melanogaster base sequence.

A cross species BLAST analysis of the tra sequence could be completed with other insect Orders such as Diptera, Hymenoptera and Coleoptera. The tra gene is essential in determining the sex specific somatic differentiation process in D. melanogaster, therefore it would be interesting to see if this gene is conserved among other insect orders. Through BLAST sequencing of the tra gene against other insect species, analysis can be made into the extent of functionality and structural conservation that occurs with this gene. The comparative analysis of the tra sequence between species can be compared against insect species divergence, through the existing phylogeny present. The relationship of the tra gene divergence and phylogenetic species divergence can be compared to identify any homology between the two. Identification of a conserved tra regulation across all insect species would highlight the crucial component of sex specific alternative splicing in insects, that either produces a functional or unfunctional Tra protein.

8 References

To create this tutorial, we followed the guidelines laid out in the NCBI BLAST Guide: https://www.ncbi.nlm.nih.gov/books/NBK279690/

8.1

8.1.1 Section 1

O’Neil, S.T., 2017. A Primer for Computational Biology. Oregon State University Press.

8.1.2 Section 3

8.1.3 Section 5

8.1.4 Section 6

Amrein H, Gorman M, Nothiger R. (1988). The sex-determining gene tra2 of Drosophila encodes a putative RNA binding protein. Cell, 55, pp.1025-1035.

Belote, J.M., McKeown, M., Boggs, R.T., Ohkawa, R. and Sosnowski, B.A., 1989. Molecular genetics of transformer, a genetic switch controlling sexual differentiation in Drosophila. Developmental genetics, 10(3), pp.143-154.

Billeter JC, Rideout EJ, Dornan AJ, Goodwin SF. (2008). Control of male sexual behavior in Drosophila by the sex determination pathway. Curr Biol , pp.1-20.

Boggs, R.T., Gregor, P., Idriss, S., Belote, J.M. and McKeown, M. (1987). Regulation of sexual differentiation in D. melanogaster via alternative splicing of RNA from the transformer gene. Cell, 50(5), pp.739-747.

Gempe T, Hasselmann M, Schiøtt M, Hause G, Otte M, Beye M. (2009). Sex determination in honeybees: two separate mechanisms induce and maintain the female pathway. PLoS Biol, 7:e1000222

Hasselmann M, Gempe T, Schiott M, Nunes-Silva CG, Otte M, Beye M. (2008). Evidence for the evolutionary nascence of a novel sex determination pathway in honeybees. Nature, 454, pp.519-522.

Lagos D, Koukidou M, Savakis C, Komitopoulou K. (2007). The transformer gene in Bactrocera oleae: the genetic switch that determines its sex fate. Insect Mol Biol, 16, pp.221-230.

Lynch KW, Maniatis T. (1995). Synergistic interactions between two distinct elements of a regulated splicing enhancer. Genes Dev, 9, pp.284-293.

Ryner LC, Goodwin SF, Castrillon DH, Anand A, Villella A, Baker BS, Hall JC, Taylor BJ, Wasserman SA. (1996). Control of male sexual behavior and sexual orientation in Drosophila by the fruitless gene. Cell, 87, pp.1079-1089.

Sánchez, L., Gorfinkiel, N. and Guerrero, I. (2005). Sex determination and the development of the genital disc. Comprehensive Molecular Insect Science, 1, pp.1-38.

Sosnowski, B.A., Belote, J.M. and McKeown, M. (1989). Sex-specific alternative splicing of RNA from the transformer gene results from sequence-dependent splice site blockage. Cell, 58(3), pp.449-459.

Tian M, Maniatis T. (1993). A splicing enhancer complex controls alternative splicing of doublesex pre-mRNA. Cell, 74, pp.105-114.

Valcárcel, J., Singh, R., Zamore, P.D. and Green, M.R. (1993). The protein Sex-lethal antagonizes the splicing factor U2AF to regulate alternative splicing of transformer pre-mRNA. Nature, 362(6416), pp.171-175.

8.1.5 Section 7

Baker, K., Stephen, G., Strachan, S., Armstrong, M. and Hein, I. (2018). BLASTmap: A Shiny-Based Application to Visualize BLAST Results as Interactive Heat Maps and a Tool to Design Gene-Specific Baits for Bespoke Target Enrichment Sequencing. Plant Pathogenic Fungi and Oomycetes, pp. 199-206.

Local Nucleotide BLAST Search Tutorial on Mac OS

Jessica Mussett, Elliot Derby, Amy Grimwood

1 Introduction to BLAST

1.1 What is BLAST Searching?

1.2 DNA Sequencing and BLAST Searching

1.3 FASTA File Format:

1.4 What Outputs Does BLAST Searching Produce?

2 Installing BLAST+

2.1 Configuration

3 BLAST Searches

3.1 Downloading Reference Genomes

3.2 Creating a BLAST Database

3.3 Combining the Databases

3.4 Searching the Mystery Sequence Against Databases

4 BLAST Output formats

4.1 -blastdbcmd -info

4.2 task -blastn

4.3 Deflines

4.4 E-values

4.5 Dust

4.6 Bit score

5 Identifying the gene

6 The tra gene

6.1 What is the tra gene?

6.2 The role of the tra gene:

6.3 Conclusions

7 Downstream Applications

8 References

8.1

8.1.1 Section 1

8.1.2 Section 3

8.1.3 Section 5

8.1.4 Section 6

8.1.5 Section 7