In today’s exercises we will be looking to:
We will also see how we can:
One of the common forms you find nucleotide data in is FASTA. This format was one of the early flat data formats in sequence databases. In particular, this format was used by one of the original sequence search engines (think primitive Google for DNA/protein sequences) called FASTA (or sometimes FAST-All). The FASTA program has long been obsolete but its simple data format remains common, today.
Let’s take a look at this format using the example below.
>U54469.1 Drosophila melanogaster eukaryotic initiation factor 4E (eIF4E) gene, complete cds, alternatively spliced
CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCAACAATCGATA
GCTGCCTTTGGCCACCAAAATCCCAAACTTAATTAAAGAATTAAATAATTCGAATAATAATTAAGCCCAG
TAACCTACGCAGCTTGAGTGCGTAACCGATATCTAGTATACATTTCGATACATCGAAATCATGGTAGTGT
TGGAGACGGAGAAGGTAAGACGATGATAGACGGCGAGCCGCATGGGTTCGATTTGCGCTGAGCCGTGGCA
GGGAACAACAAAAACAGGGTTGTTGCACAAGAGGGGAGGCGATAGTCGAGCGGAAAAGAGTGCAGTTGGC
The line with the > at the first position is the
first line of a FASTA entry. Following the > is often
some identifying information for the the sequence of nucleotides or
amino acids, below.
The subsequent lines are a contiguous sequence of nucleotides or amino acids that comprise the FASTA entry. The entire sequence can be strung together reading each line left to right.
While you may often work with data in FASTA format where there is
just one sequence per file, FASTA can allow for multiple sequences to be
combined into one single, concatenated file. What delineates each
sequence entry in this single file? Why, that > of
course.
Advantages
Disadvantages
The GenBank format is one adopted by most contemporary databases as a flat file format and typically is used as a text file storage format for nucleotide (primarily DNA) sequences. There are variants of this format such as the EMBL flat file used by the European Nucleotide Archive (ENA). Like FASTA, GenBank contains the full nucleotide sequence of a particular gene, genetic element, chromosome, or genome. The marked difference between GenBank and FASTA is the information contained within the file, itself.
GenBank files contain location information for important regions or nucleotides in the DNA sequence within the file as well as a variety of other useful bits of information about the sequence entry.
An example GenBank file:
LOCUS DMU54469 2881 bp DNA linear INV 20-JUN-2017
DEFINITION Drosophila melanogaster eukaryotic initiation factor 4E (eIF4E)
gene, complete cds, alternatively spliced.
ACCESSION U54469
VERSION U54469.1
KEYWORDS .
SOURCE Drosophila melanogaster (fruit fly)
ORGANISM Drosophila melanogaster
Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Hexapoda; Insecta;
Pterygota; Neoptera; Endopterygota; Diptera; Brachycera;
Muscomorpha; Ephydroidea; Drosophilidae; Drosophila; Sophophora.
REFERENCE 1 (bases 1 to 2881)
AUTHORS Lavoie,C.A., Lachance,P.E., Sonenberg,N. and Lasko,P.
TITLE Alternatively spliced transcripts from the Drosophila eIF4E gene
produce two different Cap-binding proteins
JOURNAL J. Biol. Chem. 271 (27), 16393-16398 (1996)
PUBMED 8663200
REFERENCE 2 (bases 1 to 2881)
AUTHORS Lasko,P.F.
TITLE Direct Submission
JOURNAL Submitted (09-APR-1996) Paul F. Lasko, Biology, McGill University,
1205 Avenue Docteur Penfield, Montreal, QC H3A 1B1, Canada
FEATURES Location/Qualifiers
source 1..2881
/organism="Drosophila melanogaster"
/mol_type="genomic DNA"
/db_xref="taxon:7227"
/chromosome="3"
/map="67A8-B2"
gene 80..2881
/gene="eIF4E"
mRNA join(80..224,892..1458,1550..1920,1986..2085,2317..2404,
2466..2881)
/gene="eIF4E"
/product="eukaryotic initiation factor 4E-I"
mRNA join(80..224,1550..1920,1986..2085,2317..2404,2466..2881)
/gene="eIF4E"
/product="eukaryotic initiation factor 4E-II"
CDS join(201..224,1550..1920,1986..2085,2317..2404,2466..2629)
/gene="eIF4E"
/note="Method: conceptual translation with partial peptide
sequencing"
/codon_start=1
/product="eukaryotic initiation factor 4E-II"
/protein_id="AAC03524.1"
/translation="MVVLETEKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETG
EPAGNTATTTAPAGDDAVRTEHLYKHPLMNVWTLWYLENDRSKSWEDMQNEITSFDTV
EDFWSLYNHIKPPSEIKLGSDYSLFKKNIRPMWEDAANKQGGRWVITLNKSSKTDLDN
LWLDVLLCLIGEAFDHSDQICGAVINIRGKSNKISIWTADGNNEEAALEIGHKLRDAL
RLGRNNSLQYQLHKDTMVKQGSNVKSIYTL"
CDS join(1402..1458,1550..1920,1986..2085,2317..2404,
2466..2629)
/gene="eIF4E"
/note="Method: conceptual translation with partial peptide
sequencing; two alternatively spliced transcripts both
encode 4E-I"
/codon_start=1
/product="eukaryotic initiation factor 4E-I"
/protein_id="AAC03525.1"
/translation="MQSDFHRMKNFANPKSMFKTSAPSTEQGRPEPPTSAAAPAEAKD
VKPKEDPQETGEPAGNTATTTAPAGDDAVRTEHLYKHPLMNVWTLWYLENDRSKSWED
MQNEITSFDTVEDFWSLYNHIKPPSEIKLGSDYSLFKKNIRPMWEDAANKQGGRWVIT
LNKSSKTDLDNLWLDVLLCLIGEAFDHSDQICGAVINIRGKSNKISIWTADGNNEEAA
LEIGHKLRDALRLGRNNSLQYQLHKDTMVKQGSNVKSIYTL"
ORIGIN
1 cggttgcttg ggttttataa catcagtcag tgacaggcat ttccagagtt gccctgttca
61 acaatcgata gctgcctttg gccaccaaaa tcccaaactt aattaaagaa ttaaataatt
121 cgaataataa ttaagcccag taacctacgc agcttgagtg cgtaaccgat atctagtata
181 catttcgata catcgaaatc atggtagtgt tggagacgga gaaggtaaga cgatgataga
241 cggcgagccg catgggttcg atttgcgctg agccgtggca gggaacaaca aaaacagggt
301 tgttgcacaa gaggggaggc gatagtcgag cggaaaagag tgcagttggc gtggctacat
361 catcattgtg ttcaccgatt attttttgca caattgctta atattaattg tacttgcacg
421 ctattgtcta cgtcatagct atcgctcatc tctgtctgtc tctatcaagc tatctctctt
481 tcgcggtcac tcgttctctt ttttctctcc tttcgcattt gcatacgcat accacacgtt
541 ttcagtgttc tcgctctctc tctcttgtca agacatcgcg cgcgtgtgtg tgggtgtgtc
601 tctagcacat atacataaat aggagagcgg agagacaaat atggaaagaa tgaaaaagag
661 tgaattactg caattaacca gtcgcgaaca gttaaatcat atttttgtcg gccattgcag
721 taaataaacc gttggctttc cctccttcac tttccacctc ctttcttgac gttaattttt
781 tcagttaatc gcgccgctgc tttgaactcg aacacgaatt ttagccgcaa cataaaataa
841 aatcaagtaa ctctttaact caatataaaa caacaatcca atcttcaaca ggcaatctgt
901 gtttttatgt cagatacgag cgcgtgtgtg tgtgtgctgt aattccatcg cccctttcga
961 ttccgagttc gttaggaaca gcattagttc gcctatttta gtagtagcct agtccgattt
1021 taagtgaaac aggacactcc aacaccatat acgcaataat tagttaccac ccactcaacc
1081 atacagcaac aacaagttta acgagttttt tgtattatca ttacttagtt ttttggttaa
1141 taatacaaca agtgaagagc gaactgcagg ggagcgagga tatcacgaaa caatccaaaa
1201 tccacacaca ctcaaacaga aatcaaaagc ttcgctctct cgcacacaca cgcaccaacc
1261 aactatcaac tatcacaaac accgcgacag agagagagcg gcaagtgaat cacggcgaat
1321 cgaaaccgat ccgaacccac tccggagccg aaaaagaact gatcctacca tcaaacgcat
1381 ccaataaaca cggccgccaa catgcagagc gactttcaca gaatgaagaa ctttgccaat
1441 cccaagtcca tgttcaaagt aatactctca gtgcgcctgt cgctaagcca agccaagcta
1501 atctaatctt ctgattcccc ttcccatcca ttgccatctt ctcccgcaga ccagcgcccc
1561 cagcaccgag cagggtcgtc cggaaccacc aacttcggct gcagcgcccg ccgaggctaa
1621 ggatgtcaag cccaaggagg acccacagga gactggtgaa ccagcaggca acactgcaac
1681 cactactgct cctgccggcg acgatgctgt gcgcaccgag catttataca aacacccgct
1741 catgaatgtc tggacgctgt ggtaccttga aaacgatcgg tccaagtcct gggaggacat
1801 gcaaaacgag atcaccagct tcgataccgt cgaggacttc tggagcctat acaaccacat
1861 caagccccca tcagagatca agctgggtag tgactactcg ctattcaaga agaacattcg
1921 gtgggtttgc tgttttattg caatttctac caagataacc tttactaact gatatctcat
1981 tgcagtccca tgtgggagga tgcagccaac aaacagggcg gtcgttgggt cattaccctt
2041 aacaaaagct ccaagaccga tctggataac ctatggctcg atgtggtaag tgcacaaaga
2101 acgagtggtt agaggatgtc tattatagtg aatgtacatt cttgaaatgc aaaaatatag
2161 aaataggtgt atgattttgc agtataaatt ataacttata gaaaatatca gctaaaaata
2221 cgctagtgtt agcttttgtc ttaggaacat tcaatagtga gcttatatca taaatatctt
2281 tcgcatatga gtaactacaa ctgttttgcc ttccagctgc tctgcctgat tggcgaggcc
2341 ttcgatcact ccgatcagat ctgcggcgct gttataaaca ttcgcggcaa gagcaacaag
2401 atatgtaagt tttcacgcac acccaacttc agcggaattc ctttgtttaa cattaatctt
2461 tccagccatc tggactgccg acggaaacaa cgaggaagct gcccttgaga ttggtcacaa
2521 gctgcgcgat gccttgcgtc tgggacgcaa caactcgctg cagtatcagt tgcacaagga
2581 cacgatggtc aagcagggct ccaacgtgaa atcgatctac actttgtagg cggctaataa
2641 ctggccgctc cttactcggt ccgatcccac acagattagt ttgtctttca tttatttatc
2701 gttataagca acagtagcga ttaatcgtga ctattgtcta agacccgcgt aacgaaaccg
2761 aaacggaacc ccctttgtta tcaaaaatcg gcataatata aaatctatcc gctttttgta
2821 gtcactgtca ataatggatt agacggaaaa gtatattaat aaaaacctac attaaaaccg
2881 g
//
Advantages
Disadvantages
QUESTION 1
Consider the GenBank file, above. There is only one gene
entry, yet two mRNA and two CDS entries. What
can you conclude based on this information about the eIF4E gene in
D. melanogaster?
delete this text and provide your answer here
An additional format that is increasingly important is the FASTQ
format. This is another sequence flat file. The primary role of FASTQ
formatted data is to store DNA sequence data from next-generation
sequencing experiments. Like FASTA, these DNA sequence entries can be
combined into a single file. Each entry in a file is separated by a
@ instead of a >. Let’s take a look…
@seq1
AATCGACTGTGCCGATCTAACTGCGGACACTGAGCG
+
@A??:(2326)(+(2:&5.62.)(07+01+'.6;01
@seq2
CAAAATGAAAAAAGACTTTAAATTGACAAAAAGCTG
+
@7ABCCBBB:ABAB6BC@@8==AA==AB@5;AB<@C
@seq3
ACTGCATTCTAGCGTGGGACACTGCACTCCAGCTTG
+
BB??@C33>BB@8?),@2)>B>@?(?<<9>:>@6*:
@seq4
AAGATTTTTAAATGAATGATTCTCTAACTAAGGGAC
+
1@3?<@BCA,+@@A377A>9<=AB@5-1?B@,?5<A
The first line of each entry contains an @ followed by
some unique descriptor for that sequence entry. Next is the DNA
sequence. The next line can contain optional information about the DNA
sequence. Finally, the last line is a sequencing quality score
associated with each base in the DNA sequence. As we can see, this
format stores a lot of information about the DNA sequence and not a lot
of information about context. We’ll revisit FASTQ DNA sequence data
later in this course.
QUESTION 2
Why do you think having a quality score for each nucleotide might be important for DNA sequencing reads like those shown above?
Determining protein structure has been a central part of structural biology for decades. Similarly, data generated about protein structure has enabled extensive modeling of protein structure and function. So, how do biologists and biochemists store this kind of information? The most common format is CIF (crystallographic information file). The largest database of CIF information is the Protein Data Bank (PDB).
CIF files contain the coordinates of every single atom in a protein crystal structure. The coordinates provide 3-dimensional information for computer programs and modeling systems to render the 3D structure of proteins. Take a look at a CIF file in your web browser.
QUESTION 3
Examine the CIF file linked above. What is the name of protein and the organism it is from? What amino acid the carbon atom 3414 part of in this protein?
Metabolomic information can be acquired by a number of different analytic methods, so there is not a true standard format for this kind of information. However, many studies of metabolism involve mass spectrometry. A common format is the CDF in metabolomics and mzML in proteomics. Our treatment now will be very thin as MS data is very system-specific.
This is a binary file format that stores mass spectra information for experiments conducted using GC-MS or LC-MS systems.
This an XML-based file format that stores a variety of information collected in LC-MS/MS analysis of protein fragments.
There are a variety of formats used to describe interaction networks (metabolic pathways, protein-protein interactions). Notably, many databases have their own proprietary or internal formats for curating this kind of data. One format that is emerging as a front-runner is the Systems Biology Markup Language (or SBML). SBML is essentially a variation on the full extensible markup language (XML) used heavily in web design, text editing software, and relational databases. We will see these formats a little later this semester when we talk about networks.
Why all the strict formatting of data files? In short, this is because the amount of biological data in existence cannot be managed using simple files. Biological data is usually stored in relational databases to all end users (you and me) to search (or query) this information. There are databases for genetic/genomic, transcriptomic, proteomic, metabolomic, and other information. Choosing the right database to answer your specific questions is challenging and will take some time/effort. Here, we’ll try and cover some large publicly-accessible databases and get some practice accessing information in them.
The largest public database of biological information is the National Center for Biological Information (or NCBI) run by the National Institutes of Health (NIH) housed in the United States of America. Much of the data is unique to NCBI but many data resources are duplicated from smaller databases. In essence, NCBI aims to be a one-stop shop of biological data.
Key components of NCBI include: - GenBank - PubMed - Gene Expression Omnibus (GEO) - Online Mendelian Inheritance in Man (OMIM) - Protein structures - Single Read Archive (SRA)
Let’s check it out: https://www.ncbi.nlm.nih.gov/
QUESTION 4
In the search box at the top of the page, type in eIF4E.
Open a new browser tab and direct it to the same NCBI site. Input
U54469. Compare the results of each search query.
Two differences
Two similarities
The UniProt database is focused on curating sequence, structural, and functional information of proteins across a wide variety of organisms. Like NCBI, the information in Uniprot is vast and has diversified to incorporate data from many databases.
Let’s check it out: https://www.uniprot.org/
QUESTION 5
Using the search bar, input the NCBI accession number for eIF4E and search. Where would you expect to find this protein? What mutation disrupts interaction with eIF4G?
Expasy is a database of databases and bioinformatic tools. What does that mean? It means that there are often many disparate, smaller databases or specific tools that require some higher-level integration. There are a number of commercial software packages such as Geneious that perform a similar function, however the free Expasy server is an excellent resource. We’ll use this more later this semester.
Let’s check it out: https://www.expasy.org/
QUESTION 6
An ortholog is a gene found in different species that has evolved
from a common ancestral gene by speciation. Often, orthologs retain a
common function throughout their evolution. Navigate Expasy to find the
OrthoDB tool. Input the UniProt ID for eIF4E into the
search menu of OrthoDB.
At the level of the Drosophila genus, how many orthologs are known and from how many species?
This is a database of metabolic pathway information for a wide variety of organisms. For more developed genetics model systems like E. coli there is additional information about genome structure and gene regulation. Many of the metabolic pathway maps at BioCyc can be used in a wide variety of metabolic modeling software packages.
Let’s check it out: https://biocyc.org/
Our goal is to see how we can use R to automate some of the search activities we’ve engaged in by hand. Before we begin, let’s get the necessary R packages loaded for our R-based exercises, ahead.
library(rentrez)
If you don’t have any of these packages installed, you can install
them using the Packages tab in the RStudio navigator window
or using the BiocManager::install() command.
Let’s first see what the rentrez package can do. This
package allows R to interface with the NCBI Entrez system, described
above.
First, let’s try a general search using the
entrez_global_query function. This can be access using the
following syntax in the R termal or executed from your R notebook.
rentrez::entrez_global_query('U54469')
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## pubmed pmc mesh books pubmedhealth
## 1 0 0 2 NA
## omim ncbisearch nuccore nucgss nucest
## 0 0 8 0 0
## protein genome structure taxonomy snp
## 0 0 0 0 0
## dbvar gene sra biosystems unigene
## 0 1 0 NA 0
## cdd clone popset geoprofiles gds
## 0 0 0 0 0
## homologene pccompound pcsubstance pcassay nlmcatalog
## 0 0 0 0 0
## probe gap proteinclusters bioproject biosample
## 0 0 0 0 0
## biocollections
## 0
We can see this output is a little difficult to navigate. We can instead, send this information into a variable in the R environment and explore the contents of that variable.
x <- rentrez::entrez_global_query('U54469')
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
Now, let’s see what is in this variable. We can see there is an array heading names and associated numbers. Go back to the Entrez search we did with NCBI using the eIF4E accession number. You’ll see these correspond to those same things! We can use a computer to do the searching for us.
x
## pubmed pmc mesh books pubmedhealth
## 1 0 0 2 NA
## omim ncbisearch nuccore nucgss nucest
## 0 0 8 0 0
## protein genome structure taxonomy snp
## 0 0 0 0 0
## dbvar gene sra biosystems unigene
## 0 1 0 NA 0
## cdd clone popset geoprofiles gds
## 0 0 0 0 0
## homologene pccompound pcsubstance pcassay nlmcatalog
## 0 0 0 0 0
## probe gap proteinclusters bioproject biosample
## 0 0 0 0 0
## biocollections
## 0
What about some other things? Let’s see what databases we can have
rentrez search.
rentrez::entrez_dbs()
## [1] "pubmed" "protein" "nuccore" "ipg"
## [5] "nucleotide" "structure" "genome" "annotinfo"
## [9] "assembly" "bioproject" "biosample" "blastdbinfo"
## [13] "books" "cdd" "clinvar" "gap"
## [17] "gapplus" "grasp" "dbvar" "gene"
## [21] "gds" "geoprofiles" "homologene" "medgen"
## [25] "mesh" "nlmcatalog" "omim" "orgtrack"
## [29] "pmc" "popset" "proteinclusters" "pcassay"
## [33] "protfam" "pccompound" "pcsubstance" "seqannot"
## [37] "snp" "sra" "taxonomy" "biocollections"
## [41] "gtr"
Let’s search the NCBI nuccore database (the core
nucleotide database) for our eIF4E gene. Note, you can select the type
of data you fetch from Entrez with the rettype option. Here, we have
selected the GenBank format. Using the class() command we
can see what type of data was returned to the z
variable.
z = rentrez::entrez_fetch('nuccore',id='U54469',rettype='gb')
class(z)
## [1] "character"
Let’s write our GenBank file to disk and take a look. We should see an intact GenBank file identical to the one above.
write(z,file="test.gb")
Now, what if we wanted to search for more than just one gene entry. What if wanted to gather many sequences that are related in some way. For example, what if we wanted to search for several eIF4E gene sequences? We can do that!
eIF4E_search <- entrez_search(db='nuccore',term='eIF4E')
eIF4E_search$ids
## [1] "2566745950" "2566119931" "2564783923" "2566119264" "2566113554"
## [6] "164607180" "2565286415" "2565286413" "2561157516" "2563387986"
## [11] "2562910616" "2562910614" "2562910612" "2562702975" "2562369619"
## [16] "2544591541" "2555928226" "2559068977" "2558856598" "2558856596"
You can see, here, that the number of sequences we can pull from
Entrez using search we performed is 20 of over 10,000 hits (you can see
this number in the Environment window). Let’s go ahead and
fetch those 20 sequences and write them to a file. We’ll use the GenBank
format.
eIF4E_seqs <- entrez_fetch(db='nuccore',id=eIF4E_search$ids[1:3],rettype='gb')
write(eIF4E_seqs,file="eIF4E_sequences.gb")
QUESTION 7
Open the the eIF4E_sequences.gb file in a text editor
(Notebook in Windows, TextEdit in macOS). How are the different
sequences separated in this GenBank file that contains 20 entries?
We can also search NCBI for data to understand trends in submissions or literature.
search_year <- function(year, term){
query <- paste(term, "AND (", year, "[PDAT])")
entrez_search(db="pubmed", term=query, retmax=0)$count
}
year <- 2008:2014
papers <- sapply(year, search_year, term="Connectome", USE.NAMES=FALSE)
plot(year, papers, type='b', main="The Rise of the Connectome")
Check out the rentrez user docs for more examples on
using this package. https://cran.r-project.org/web/packages/rentrez/vignettes/rentrez_tutorial.html
Or, you can often find useful walkthroughs in package vignettes in R.
vignette("rentrez_tutorial")
## starting httpd help server ... done