Downloading DNA sequences as FASTA files in R

This is a modification of “DNA Sequence Statistics” from Avril Coghlan’s A little book of R for bioinformatics.. Most of the text and code was originally written by Dr. Coghlan and distributed under the Creative Commons 3.0 license.

NOTE: There is some redundancy in this current draft that needs to be eliminated.

Functions

library()
help()
cat()
is()
class()
dim()
length()
nchar()
strtrim()
is.vector()
table()
write()
getwd()
seqinr::write.fasta()

Software/websites

www.ncbi.nlm.nih.gov
Text editors (e.g. Notepad++, TextWrangler)

R vocabulary

list
library
package
CRAN
wrapper
underscore _
Camel Case

File types

FASTA

Bioinformatics vocabulary

accession, accession number
NCBI
NCBI Sequence Database
EMBL Sequence Database
FASTA file
RefSeq

Learning objectives

By the end of this tutorial you will be able to

Download sequences in FASTA format
Understand there format and structure
Do basic of FASTA data is stored in a vector using is(), class(), length() and other functions
Determine the GC content using GC() and obtain other summary data with count()
Save FASTA files to your hard drive

Organisms and Sequence accessions

Dengue virus: DEN-1, DEN-2, DEN-3, and DEN-4.

The NCBI RefeSeq accessions for the DNA sequences of the DEN-1, DEN-2, DEN-3, and DEN-4 Dengue viruses are NC_001477, NC_001474, NC_001475 and NC_002640, respectively.

According to Wikipedia

“Dengue virus (DENV) is the cause of dengue fever. It is a mosquito-borne, single positive-stranded RNA virus … Five serotypes of the virus have been found, all of which can cause the full spectrum of disease. Nevertheless, scientists’ understanding of dengue virus may be simplistic, as rather than distinct … groups, a continuum appears to exist.” https://en.wikipedia.org/wiki/Dengue_virus

Preliminaries

Note that seqinr the package name is all lower case, though the authors of the package like to spell it “SeqinR”.

library(rentrez)
library(seqinr)
#library(compbio4all)

DNA Sequence Statistics: Part 1

Using R for Bioinformatics

The chapter will guide you through the process of using R to carry out simple analyses that are common in bioinformatics and computational biology. In particular, the focus is on computational analysis of biological sequence data such as genome sequences and protein sequences. The programming approaches, however, are broadly generalizable to statistics and data science.

The tutorials assume that the reader has some basic knowledge of biology, but not necessarily of bioinformatics. The focus is to explain simple bioinformatics analysis, and to explain how to carry out these analyses using R.

R packages for bioinformatics: Bioconductor and SeqinR

Many authors have written R packages for performing a wide variety of analyses. These do not come with the standard R installation, but must be installed and loaded as “add-ons”.

Bioinformaticians have written numerous specialized packages for R. In this tutorial, you will learn to use some of the function in the SeqinR package to to carry out simple analyses of DNA sequences. (SeqinR can retrieve sequences from a DNA sequence database, but this has largely been replaced by the functions in the package rentrez)

Many well-known bioinformatics packages for R are in the Bioconductor set of R packages (www.bioconductor.org), which contains packages with many R functions for analyzing biological data sets such as microarray data. The SeqinR package is from CRAN, which contains R functions for obtaining sequences from DNA and protein sequence databases, and for analyzing DNA and protein sequences.

For instructions/review on how to install an R package on your own see How to install an R package )

We will also use functions or data from the rentrez and compbio4all packages.

Remember that you can ask for more information about a particular R command by using the help() function or ? function. For example, to ask for more information about the library(), you can type:

help("library")

You can also do this

?library

FASTA file format

The FASTA format is a simple and widely used format for storing biological (e.g. DNA or protein) sequences. It was first used by the FASTA program for sequence alignment in the 1980s and has been adopted as standard by many other programs.

FASTA files begin with a single-line description starting with a greater-than sign > character, followed on the next line by the sequences. Here is an example of a FASTA file. (If you’re looking at the source script for this lesson you’ll see the cat() command, which is just a text display function used format the text when you run the code).

## >A06852 183 residues MPRLFSYLLGVWLLLSQLPREIPGQSTNDFIKACGRELVRLWVEICGSVSWGRTALSLEEPQLETGPPAETMPSSITKDAEILKMMLEFVPNLPQELKATLSERQPSLRELQQSASKDSNLNFEEFKKIILNRQNEAEDKSLLELKNLGLDKHSRKKRLFRMTLSEKCCQVGCIRKDIARLC

The NCBI sequence database

The US National Centre for Biotechnology Information (NCBI) maintains the NCBI Sequence Database, a huge database of all the DNA and protein sequence data that has been collected. There are also similar databases in Europe, the European Molecular Biology Laboratory (EMBL) Sequence Database, and Japan, the DNA Data Bank of Japan (DDBJ). These three databases exchange data every night, so at any one point in time, they contain almost identical data.

Each sequence in the NCBI Sequence Database is stored in a separate record, and is assigned a unique identifier that can be used to refer to that record. The identifier is known as an accession, and consists of a mixture of numbers and letters.

For example, Dengue virus causes Dengue fever, which is classified as a neglected tropical disease by the World Health Organization (WHO), is classified by any one of four types of Dengue virus: DEN-1, DEN-2, DEN-3, and DEN-4. The NCBI accessions for the DNA sequences of the DEN-1, DEN-2, DEN-3, and DEN-4 Dengue viruses are

NC_001477
NC_001474
NC_001475
NC_002640

Note that because the NCBI Sequence Database, the EMBL Sequence Database, and DDBJ exchange data every night, the DEN-1 (and DEN-2, DEN-3, DEN-4) Dengue virus sequence are present in all three databases, but they have different accessions in each database, as they each use their own numbering systems for referring to their own sequence records.

Exercises

Uncomment the code chunk below to download your own sequence of interest. Change the id = to that of your sequence, and change the db = to “protein” if needed. Change the object name to a descriptive name, such as the name of the gene, e.g. shroom.fasta.

itgalseq.fasta <- rentrez::entrez_fetch(db = "nucleotide",  # set to "protein" if needed
                           id = "NM_001114380.2",                  # change accession
                           rettype = "fasta")

Set your working directory then save the FASTA file to your hard drive using. Be sure to change the name of the object and the file name to be appropriate to your gene.

 write(itgalseq.fasta,        # change object name, e.g.  shroom.fasta
       file="itgalseq.fasta") # change file name, e.g. shroom.fasta

Review questions

What does the nchar() function stand for?
Why does a FASTA file stored in a vector by entrez_fetch() have a length of 1 and no dimension?
What does strtrim() mean?
If a sequence is stored in object X and you run the code strtrim(x, 10), how many characters are shown?
What is the newline character in a FASTA file?