Overview

In today’s exercises we will be looking to:

We will also see how we can:

Sequence information

Formats

FASTA

One of the common forms you find nucleotide data in is FASTA. This format was one of the early flat data formats in sequence databases. In particular, this format was used by one of the original sequence search engines (think primitive Google for DNA/protein sequences) called FASTA (or sometimes FAST-All). The FASTA program has long been obsolete but its simple data format remains common, today.

Let’s take a look at this format using the example below.

>U54469.1 Drosophila melanogaster eukaryotic initiation factor 4E (eIF4E) gene, complete cds, alternatively spliced
CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCAACAATCGATA
GCTGCCTTTGGCCACCAAAATCCCAAACTTAATTAAAGAATTAAATAATTCGAATAATAATTAAGCCCAG
TAACCTACGCAGCTTGAGTGCGTAACCGATATCTAGTATACATTTCGATACATCGAAATCATGGTAGTGT
TGGAGACGGAGAAGGTAAGACGATGATAGACGGCGAGCCGCATGGGTTCGATTTGCGCTGAGCCGTGGCA
GGGAACAACAAAAACAGGGTTGTTGCACAAGAGGGGAGGCGATAGTCGAGCGGAAAAGAGTGCAGTTGGC

The line with the > at the first position is the first line of a FASTA entry. Following the > is often some identifying information for the the sequence of nucleotides or amino acids, below.

The subsequent lines are a contiguous sequence of nucleotides or amino acids that comprise the FASTA entry. The entire sequence can be strung together reading each line left to right.

While you may often work with data in FASTA format where there is just one sequence per file, FASTA can allow for multiple sequences to be combined into one single, concatenated file. What delineates each sequence entry in this single file? Why, that > of course.

Advantages

  • Simple to create FASTA files
  • Can store multiple sequences in a single flat file
  • Most all bioinformatic software recognize FASTA format

Disadvantages

  • No location-specific information (i.e. reading frames or protein domains)
  • Requires a separate flat file for information

GenBank

The GenBank format is one adopted by most contemporary databases as a flat file format and typically is used as a text file storage format for nucleotide (primarily DNA) sequences. There are variants of this format such as the EMBL flat file used by the European Nucleotide Archive (ENA). Like FASTA, GenBank contains the full nucleotide sequence of a particular gene, genetic element, chromosome, or genome. The marked difference between GenBank and FASTA is the information contained within the file, itself.

GenBank files contain location information for important regions or nucleotides in the DNA sequence within the file as well as a variety of other useful bits of information about the sequence entry.

An example GenBank file:

LOCUS       DMU54469                2881 bp    DNA     linear   INV 20-JUN-2017
DEFINITION  Drosophila melanogaster eukaryotic initiation factor 4E (eIF4E)
            gene, complete cds, alternatively spliced.
ACCESSION   U54469
VERSION     U54469.1
KEYWORDS    .
SOURCE      Drosophila melanogaster (fruit fly)
  ORGANISM  Drosophila melanogaster
            Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Hexapoda; Insecta;
            Pterygota; Neoptera; Endopterygota; Diptera; Brachycera;
            Muscomorpha; Ephydroidea; Drosophilidae; Drosophila; Sophophora.
REFERENCE   1  (bases 1 to 2881)
  AUTHORS   Lavoie,C.A., Lachance,P.E., Sonenberg,N. and Lasko,P.
  TITLE     Alternatively spliced transcripts from the Drosophila eIF4E gene
            produce two different Cap-binding proteins
  JOURNAL   J. Biol. Chem. 271 (27), 16393-16398 (1996)
   PUBMED   8663200
REFERENCE   2  (bases 1 to 2881)
  AUTHORS   Lasko,P.F.
  TITLE     Direct Submission
  JOURNAL   Submitted (09-APR-1996) Paul F. Lasko, Biology, McGill University,
            1205 Avenue Docteur Penfield, Montreal, QC H3A 1B1, Canada
FEATURES             Location/Qualifiers
     source          1..2881
                     /organism="Drosophila melanogaster"
                     /mol_type="genomic DNA"
                     /db_xref="taxon:7227"
                     /chromosome="3"
                     /map="67A8-B2"
     gene            80..2881
                     /gene="eIF4E"
     mRNA            join(80..224,892..1458,1550..1920,1986..2085,2317..2404,
                     2466..2881)
                     /gene="eIF4E"
                     /product="eukaryotic initiation factor 4E-I"
     mRNA            join(80..224,1550..1920,1986..2085,2317..2404,2466..2881)
                     /gene="eIF4E"
                     /product="eukaryotic initiation factor 4E-II"
     CDS             join(201..224,1550..1920,1986..2085,2317..2404,2466..2629)
                     /gene="eIF4E"
                     /note="Method: conceptual translation with partial peptide
                     sequencing"
                     /codon_start=1
                     /product="eukaryotic initiation factor 4E-II"
                     /protein_id="AAC03524.1"
                     /translation="MVVLETEKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETG
                     EPAGNTATTTAPAGDDAVRTEHLYKHPLMNVWTLWYLENDRSKSWEDMQNEITSFDTV
                     EDFWSLYNHIKPPSEIKLGSDYSLFKKNIRPMWEDAANKQGGRWVITLNKSSKTDLDN
                     LWLDVLLCLIGEAFDHSDQICGAVINIRGKSNKISIWTADGNNEEAALEIGHKLRDAL
                     RLGRNNSLQYQLHKDTMVKQGSNVKSIYTL"
     CDS             join(1402..1458,1550..1920,1986..2085,2317..2404,
                     2466..2629)
                     /gene="eIF4E"
                     /note="Method: conceptual translation with partial peptide
                     sequencing; two alternatively spliced transcripts both
                     encode 4E-I"
                     /codon_start=1
                     /product="eukaryotic initiation factor 4E-I"
                     /protein_id="AAC03525.1"
                     /translation="MQSDFHRMKNFANPKSMFKTSAPSTEQGRPEPPTSAAAPAEAKD
                     VKPKEDPQETGEPAGNTATTTAPAGDDAVRTEHLYKHPLMNVWTLWYLENDRSKSWED
                     MQNEITSFDTVEDFWSLYNHIKPPSEIKLGSDYSLFKKNIRPMWEDAANKQGGRWVIT
                     LNKSSKTDLDNLWLDVLLCLIGEAFDHSDQICGAVINIRGKSNKISIWTADGNNEEAA
                     LEIGHKLRDALRLGRNNSLQYQLHKDTMVKQGSNVKSIYTL"
ORIGIN      
        1 cggttgcttg ggttttataa catcagtcag tgacaggcat ttccagagtt gccctgttca
       61 acaatcgata gctgcctttg gccaccaaaa tcccaaactt aattaaagaa ttaaataatt
      121 cgaataataa ttaagcccag taacctacgc agcttgagtg cgtaaccgat atctagtata
      181 catttcgata catcgaaatc atggtagtgt tggagacgga gaaggtaaga cgatgataga
      241 cggcgagccg catgggttcg atttgcgctg agccgtggca gggaacaaca aaaacagggt
      301 tgttgcacaa gaggggaggc gatagtcgag cggaaaagag tgcagttggc gtggctacat
      361 catcattgtg ttcaccgatt attttttgca caattgctta atattaattg tacttgcacg
      421 ctattgtcta cgtcatagct atcgctcatc tctgtctgtc tctatcaagc tatctctctt
      481 tcgcggtcac tcgttctctt ttttctctcc tttcgcattt gcatacgcat accacacgtt
      541 ttcagtgttc tcgctctctc tctcttgtca agacatcgcg cgcgtgtgtg tgggtgtgtc
      601 tctagcacat atacataaat aggagagcgg agagacaaat atggaaagaa tgaaaaagag
      661 tgaattactg caattaacca gtcgcgaaca gttaaatcat atttttgtcg gccattgcag
      721 taaataaacc gttggctttc cctccttcac tttccacctc ctttcttgac gttaattttt
      781 tcagttaatc gcgccgctgc tttgaactcg aacacgaatt ttagccgcaa cataaaataa
      841 aatcaagtaa ctctttaact caatataaaa caacaatcca atcttcaaca ggcaatctgt
      901 gtttttatgt cagatacgag cgcgtgtgtg tgtgtgctgt aattccatcg cccctttcga
      961 ttccgagttc gttaggaaca gcattagttc gcctatttta gtagtagcct agtccgattt
     1021 taagtgaaac aggacactcc aacaccatat acgcaataat tagttaccac ccactcaacc
     1081 atacagcaac aacaagttta acgagttttt tgtattatca ttacttagtt ttttggttaa
     1141 taatacaaca agtgaagagc gaactgcagg ggagcgagga tatcacgaaa caatccaaaa
     1201 tccacacaca ctcaaacaga aatcaaaagc ttcgctctct cgcacacaca cgcaccaacc
     1261 aactatcaac tatcacaaac accgcgacag agagagagcg gcaagtgaat cacggcgaat
     1321 cgaaaccgat ccgaacccac tccggagccg aaaaagaact gatcctacca tcaaacgcat
     1381 ccaataaaca cggccgccaa catgcagagc gactttcaca gaatgaagaa ctttgccaat
     1441 cccaagtcca tgttcaaagt aatactctca gtgcgcctgt cgctaagcca agccaagcta
     1501 atctaatctt ctgattcccc ttcccatcca ttgccatctt ctcccgcaga ccagcgcccc
     1561 cagcaccgag cagggtcgtc cggaaccacc aacttcggct gcagcgcccg ccgaggctaa
     1621 ggatgtcaag cccaaggagg acccacagga gactggtgaa ccagcaggca acactgcaac
     1681 cactactgct cctgccggcg acgatgctgt gcgcaccgag catttataca aacacccgct
     1741 catgaatgtc tggacgctgt ggtaccttga aaacgatcgg tccaagtcct gggaggacat
     1801 gcaaaacgag atcaccagct tcgataccgt cgaggacttc tggagcctat acaaccacat
     1861 caagccccca tcagagatca agctgggtag tgactactcg ctattcaaga agaacattcg
     1921 gtgggtttgc tgttttattg caatttctac caagataacc tttactaact gatatctcat
     1981 tgcagtccca tgtgggagga tgcagccaac aaacagggcg gtcgttgggt cattaccctt
     2041 aacaaaagct ccaagaccga tctggataac ctatggctcg atgtggtaag tgcacaaaga
     2101 acgagtggtt agaggatgtc tattatagtg aatgtacatt cttgaaatgc aaaaatatag
     2161 aaataggtgt atgattttgc agtataaatt ataacttata gaaaatatca gctaaaaata
     2221 cgctagtgtt agcttttgtc ttaggaacat tcaatagtga gcttatatca taaatatctt
     2281 tcgcatatga gtaactacaa ctgttttgcc ttccagctgc tctgcctgat tggcgaggcc
     2341 ttcgatcact ccgatcagat ctgcggcgct gttataaaca ttcgcggcaa gagcaacaag
     2401 atatgtaagt tttcacgcac acccaacttc agcggaattc ctttgtttaa cattaatctt
     2461 tccagccatc tggactgccg acggaaacaa cgaggaagct gcccttgaga ttggtcacaa
     2521 gctgcgcgat gccttgcgtc tgggacgcaa caactcgctg cagtatcagt tgcacaagga
     2581 cacgatggtc aagcagggct ccaacgtgaa atcgatctac actttgtagg cggctaataa
     2641 ctggccgctc cttactcggt ccgatcccac acagattagt ttgtctttca tttatttatc
     2701 gttataagca acagtagcga ttaatcgtga ctattgtcta agacccgcgt aacgaaaccg
     2761 aaacggaacc ccctttgtta tcaaaaatcg gcataatata aaatctatcc gctttttgta
     2821 gtcactgtca ataatggatt agacggaaaa gtatattaat aaaaacctac attaaaaccg
     2881 g
//

Advantages

  • All key information about a sequence region is contained in one file
  • Most all bioinformatic software recognize GenBank format
  • It is the primary format in all NCBI databases

Disadvantages

  • More complex to create a GenBank file
  • Most GenBank files have only one nucleotide entry

QUESTION 1

Consider the GenBank file, above. There is only one gene entry, yet two mRNA and two CDS entries. What can you conclude based on this information about the eIF4E gene in D. melanogaster?

delete this text and provide your answer here


FASTQ

An additional format that is increasingly important is the FASTQ format. This is another sequence flat file. The primary role of FASTQ formatted data is to store DNA sequence data from next-generation sequencing experiments. Like FASTA, these DNA sequence entries can be combined into a single file. Each entry in a file is separated by a @ instead of a >. Let’s take a look…

@seq1
AATCGACTGTGCCGATCTAACTGCGGACACTGAGCG
+
@A??:(2326)(+(2:&5.62.)(07+01+'.6;01
@seq2
CAAAATGAAAAAAGACTTTAAATTGACAAAAAGCTG
+
@7ABCCBBB:ABAB6BC@@8==AA==AB@5;AB<@C
@seq3
ACTGCATTCTAGCGTGGGACACTGCACTCCAGCTTG
+
BB??@C33>BB@8?),@2)>B>@?(?<<9>:>@6*:
@seq4
AAGATTTTTAAATGAATGATTCTCTAACTAAGGGAC
+
1@3?<@BCA,+@@A377A>9<=AB@5-1?B@,?5<A

The first line of each entry contains an @ followed by some unique descriptor for that sequence entry. Next is the DNA sequence. The next line can contain optional information about the DNA sequence. Finally, the last line is a sequencing quality score associated with each base in the DNA sequence. As we can see, this format stores a lot of information about the DNA sequence and not a lot of information about context. We’ll revisit FASTQ DNA sequence data later in this course.


QUESTION 2

Why do you think having a quality score for each nucleotide might be important for DNA sequencing reads like those shown above?


Protein structure

Determining protein structure has been a central part of structural biology for decades. Similarly, data generated about protein structure has enabled extensive modeling of protein structure and function. So, how do biologists and biochemists store this kind of information? The most common format is CIF (crystallographic information file). The largest database of CIF information is the Protein Data Bank (PDB).

CIF files contain the coordinates of every single atom in a protein crystal structure. The coordinates provide 3-dimensional information for computer programs and modeling systems to render the 3D structure of proteins. Take a look at a CIF file in your web browser.

CLICK HERE


QUESTION 3

Examine the CIF file linked above. What is the name of protein and the organism it is from? What amino acid the carbon atom 3414 part of in this protein?


Metabolite information

Metabolomic information can be acquired by a number of different analytic methods, so there is not a true standard format for this kind of information. However, many studies of metabolism involve mass spectrometry. A common format is the CDF in metabolomics and mzML in proteomics. Our treatment now will be very thin as MS data is very system-specific.

CDF

This is a binary file format that stores mass spectra information for experiments conducted using GC-MS or LC-MS systems.

mzML

This an XML-based file format that stores a variety of information collected in LC-MS/MS analysis of protein fragments.

Network information

SBML

There are a variety of formats used to describe interaction networks (metabolic pathways, protein-protein interactions). Notably, many databases have their own proprietary or internal formats for curating this kind of data. One format that is emerging as a front-runner is the Systems Biology Markup Language (or SBML). SBML is essentially a variation on the full extensible markup language (XML) used heavily in web design, text editing software, and relational databases. We will see these formats a little later this semester when we talk about networks.

Databases

Why all the strict formatting of data files? In short, this is because the amount of biological data in existence cannot be managed using simple files. Biological data is usually stored in relational databases to all end users (you and me) to search (or query) this information. There are databases for genetic/genomic, transcriptomic, proteomic, metabolomic, and other information. Choosing the right database to answer your specific questions is challenging and will take some time/effort. Here, we’ll try and cover some large publicly-accessible databases and get some practice accessing information in them.

Major Databases

NCBI

The largest public database of biological information is the National Center for Biological Information (or NCBI) run by the National Institutes of Health (NIH) housed in the United States of America. Much of the data is unique to NCBI but many data resources are duplicated from smaller databases. In essence, NCBI aims to be a one-stop shop of biological data.

Key components of NCBI include: - GenBank - PubMed - Gene Expression Omnibus (GEO) - Online Mendelian Inheritance in Man (OMIM) - Protein structures - Single Read Archive (SRA)

Let’s check it out: https://www.ncbi.nlm.nih.gov/


QUESTION 4

In the search box at the top of the page, type in eIF4E. Open a new browser tab and direct it to the same NCBI site. Input U54469. Compare the results of each search query.

Two differences

  1. delete and input your answer
  2. delete and input your answer

Two similarities

  1. delete and input your answer
  2. delete and input your answer

UniProt

The UniProt database is focused on curating sequence, structural, and functional information of proteins across a wide variety of organisms. Like NCBI, the information in Uniprot is vast and has diversified to incorporate data from many databases.

Let’s check it out: https://www.uniprot.org/


QUESTION 5

Using the search bar, input the NCBI accession number for eIF4E and search. Where would you expect to find this protein? What mutation disrupts interaction with eIF4G?


Expasy

Expasy is a database of databases and bioinformatic tools. What does that mean? It means that there are often many disparate, smaller databases or specific tools that require some higher-level integration. There are a number of commercial software packages such as Geneious that perform a similar function, however the free Expasy server is an excellent resource. We’ll use this more later this semester.

Let’s check it out: https://www.expasy.org/


QUESTION 6

An ortholog is a gene found in different species that has evolved from a common ancestral gene by speciation. Often, orthologs retain a common function throughout their evolution. Navigate Expasy to find the OrthoDB tool. Input the UniProt ID for eIF4E into the search menu of OrthoDB.

At the level of the Drosophila genus, how many orthologs are known and from how many species?


BioCyc

This is a database of metabolic pathway information for a wide variety of organisms. For more developed genetics model systems like E. coli there is additional information about genome structure and gene regulation. Many of the metabolic pathway maps at BioCyc can be used in a wide variety of metabolic modeling software packages.

Let’s check it out: https://biocyc.org/

What about R?

Our goal is to see how we can use R to automate some of the search activities we’ve engaged in by hand. Before we begin, let’s get the necessary R packages loaded for our R-based exercises, ahead.

library(rentrez)

If you don’t have any of these packages installed, you can install them using the Packages tab in the RStudio navigator window or using the BiocManager::install() command.

Interacting with NCBI Entrez

Let’s first see what the rentrez package can do. This package allows R to interface with the NCBI Entrez system, described above.

First, let’s try a general search using the entrez_global_query function. This can be access using the following syntax in the R termal or executed from your R notebook.

rentrez::entrez_global_query('U54469')
## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion
##          pubmed             pmc            mesh           books    pubmedhealth 
##               1               0               0               2              NA 
##            omim      ncbisearch         nuccore          nucgss          nucest 
##               0               0               8               0               0 
##         protein          genome       structure        taxonomy             snp 
##               0               0               0               0               0 
##           dbvar            gene             sra      biosystems         unigene 
##               0               1               0              NA               0 
##             cdd           clone          popset     geoprofiles             gds 
##               0               0               0               0               0 
##      homologene      pccompound     pcsubstance         pcassay      nlmcatalog 
##               0               0               0               0               0 
##           probe             gap proteinclusters      bioproject       biosample 
##               0               0               0               0               0 
##  biocollections 
##               0

We can see this output is a little difficult to navigate. We can instead, send this information into a variable in the R environment and explore the contents of that variable.

x <- rentrez::entrez_global_query('U54469')
## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

Now, let’s see what is in this variable. We can see there is an array heading names and associated numbers. Go back to the Entrez search we did with NCBI using the eIF4E accession number. You’ll see these correspond to those same things! We can use a computer to do the searching for us.

x
##          pubmed             pmc            mesh           books    pubmedhealth 
##               1               0               0               2              NA 
##            omim      ncbisearch         nuccore          nucgss          nucest 
##               0               0               8               0               0 
##         protein          genome       structure        taxonomy             snp 
##               0               0               0               0               0 
##           dbvar            gene             sra      biosystems         unigene 
##               0               1               0              NA               0 
##             cdd           clone          popset     geoprofiles             gds 
##               0               0               0               0               0 
##      homologene      pccompound     pcsubstance         pcassay      nlmcatalog 
##               0               0               0               0               0 
##           probe             gap proteinclusters      bioproject       biosample 
##               0               0               0               0               0 
##  biocollections 
##               0

What about some other things? Let’s see what databases we can have rentrez search.

rentrez::entrez_dbs()
##  [1] "pubmed"          "protein"         "nuccore"         "ipg"            
##  [5] "nucleotide"      "structure"       "genome"          "annotinfo"      
##  [9] "assembly"        "bioproject"      "biosample"       "blastdbinfo"    
## [13] "books"           "cdd"             "clinvar"         "gap"            
## [17] "gapplus"         "grasp"           "dbvar"           "gene"           
## [21] "gds"             "geoprofiles"     "homologene"      "medgen"         
## [25] "mesh"            "nlmcatalog"      "omim"            "orgtrack"       
## [29] "pmc"             "popset"          "proteinclusters" "pcassay"        
## [33] "protfam"         "pccompound"      "pcsubstance"     "seqannot"       
## [37] "snp"             "sra"             "taxonomy"        "biocollections" 
## [41] "gtr"

Let’s search the NCBI nuccore database (the core nucleotide database) for our eIF4E gene. Note, you can select the type of data you fetch from Entrez with the rettype option. Here, we have selected the GenBank format. Using the class() command we can see what type of data was returned to the z variable.

z = rentrez::entrez_fetch('nuccore',id='U54469',rettype='gb')
class(z)
## [1] "character"

Let’s write our GenBank file to disk and take a look. We should see an intact GenBank file identical to the one above.

write(z,file="test.gb")

Now, what if we wanted to search for more than just one gene entry. What if wanted to gather many sequences that are related in some way. For example, what if we wanted to search for several eIF4E gene sequences? We can do that!

eIF4E_search <- entrez_search(db='nuccore',term='eIF4E')
eIF4E_search$ids
##  [1] "2566745950" "2566119931" "2564783923" "2566119264" "2566113554"
##  [6] "164607180"  "2565286415" "2565286413" "2561157516" "2563387986"
## [11] "2562910616" "2562910614" "2562910612" "2562702975" "2562369619"
## [16] "2544591541" "2555928226" "2559068977" "2558856598" "2558856596"

You can see, here, that the number of sequences we can pull from Entrez using search we performed is 20 of over 10,000 hits (you can see this number in the Environment window). Let’s go ahead and fetch those 20 sequences and write them to a file. We’ll use the GenBank format.

eIF4E_seqs <- entrez_fetch(db='nuccore',id=eIF4E_search$ids[1:3],rettype='gb')
write(eIF4E_seqs,file="eIF4E_sequences.gb")

QUESTION 7

Open the the eIF4E_sequences.gb file in a text editor (Notebook in Windows, TextEdit in macOS). How are the different sequences separated in this GenBank file that contains 20 entries?


We can also search NCBI for data to understand trends in submissions or literature.

search_year <- function(year, term){
    query <- paste(term, "AND (", year, "[PDAT])")
    entrez_search(db="pubmed", term=query, retmax=0)$count
}

year <- 2008:2014
papers <- sapply(year, search_year, term="Connectome", USE.NAMES=FALSE)

plot(year, papers, type='b', main="The Rise of the Connectome")

Check out the rentrez user docs for more examples on using this package. https://cran.r-project.org/web/packages/rentrez/vignettes/rentrez_tutorial.html

Or, you can often find useful walkthroughs in package vignettes in R.

vignette("rentrez_tutorial")
## starting httpd help server ... done