This is a modification of “DNA Sequence Statistics” from Avril Coghlan’s A little book of R for bioinformatics.. Most of the text and code was originally written by Dr. Coghlan and distributed under the Creative Commons 3.0 license.
NOTE: There is some redundancy in this current draft that needs to be eliminated.
By the end of this tutorial you will be able to
The NCBI RefeSeq accessions for the DNA sequences of the DEN-1, DEN-2, DEN-3, and DEN-4 Dengue viruses are NC_001477, NC_001474, NC_001475 and NC_002640, respectively.
According to Wikipedia
“Dengue virus (DENV) is the cause of dengue fever. It is a mosquito-borne, single positive-stranded RNA virus … Five serotypes of the virus have been found, all of which can cause the full spectrum of disease. Nevertheless, scientists’ understanding of dengue virus may be simplistic, as rather than distinct … groups, a continuum appears to exist.” https://en.wikipedia.org/wiki/Dengue_virus
Note that seqinr the package name is all lower case, though the authors of the package like to spell it “SeqinR”.
library(rentrez)
library(seqinr)
#library(compbio4all)
The chapter will guide you through the process of using R to carry out simple analyses that are common in bioinformatics and computational biology. In particular, the focus is on computational analysis of biological sequence data such as genome sequences and protein sequences. The programming approaches, however, are broadly generalizable to statistics and data science.
The tutorials assume that the reader has some basic knowledge of biology, but not necessarily of bioinformatics. The focus is to explain simple bioinformatics analysis, and to explain how to carry out these analyses using R.
Many authors have written R packages for performing a wide variety of analyses. These do not come with the standard R installation, but must be installed and loaded as “add-ons”.
Bioinformaticians have written numerous specialized packages for R. In this tutorial, you will learn to use some of the function in the SeqinR package to to carry out simple analyses of DNA sequences. (SeqinR can retrieve sequences from a DNA sequence database, but this has largely been replaced by the functions in the package rentrez)
Many well-known bioinformatics packages for R are in the Bioconductor set of R packages (www.bioconductor.org), which contains packages with many R functions for analyzing biological data sets such as microarray data. The SeqinR package is from CRAN, which contains R functions for obtaining sequences from DNA and protein sequence databases, and for analyzing DNA and protein sequences.
For instructions/review on how to install an R package on your own see How to install an R package )
We will also use functions or data from the rentrez and compbio4all packages.
Remember that you can ask for more information about a particular R command by using the help() function or ? function. For example, to ask for more information about the library(), you can type:
help("library")
You can also do this
?library
The FASTA format is a simple and widely used format for storing biological (e.g. DNA or protein) sequences. It was first used by the FASTA program for sequence alignment in the 1980s and has been adopted as standard by many other programs.
FASTA files begin with a single-line description starting with a greater-than sign > character, followed on the next line by the sequences. Here is an example of a FASTA file. (If you’re looking at the source script for this lesson you’ll see the cat() command, which is just a text display function used format the text when you run the code).
## >A06852 183 residues MPRLFSYLLGVWLLLSQLPREIPGQSTNDFIKACGRELVRLWVEICGSVSWGRTALSLEEPQLETGPPAETMPSSITKDAEILKMMLEFVPNLPQELKATLSERQPSLRELQQSASKDSNLNFEEFKKIILNRQNEAEDKSLLELKNLGLDKHSRKKRLFRMTLSEKCCQVGCIRKDIARLC
The US National Centre for Biotechnology Information (NCBI) maintains the NCBI Sequence Database, a huge database of all the DNA and protein sequence data that has been collected. There are also similar databases in Europe, the European Molecular Biology Laboratory (EMBL) Sequence Database, and Japan, the DNA Data Bank of Japan (DDBJ). These three databases exchange data every night, so at any one point in time, they contain almost identical data.
Each sequence in the NCBI Sequence Database is stored in a separate record, and is assigned a unique identifier that can be used to refer to that record. The identifier is known as an accession, and consists of a mixture of numbers and letters.
For example, Dengue virus causes Dengue fever, which is classified as a neglected tropical disease by the World Health Organization (WHO), is classified by any one of four types of Dengue virus: DEN-1, DEN-2, DEN-3, and DEN-4. The NCBI accessions for the DNA sequences of the DEN-1, DEN-2, DEN-3, and DEN-4 Dengue viruses are
Note that because the NCBI Sequence Database, the EMBL Sequence Database, and DDBJ exchange data every night, the DEN-1 (and DEN-2, DEN-3, DEN-4) Dengue virus sequence are present in all three databases, but they have different accessions in each database, as they each use their own numbering systems for referring to their own sequence records.
Uncomment the code chunk below to download your own sequence of interest. Change the id = to that of your sequence, and change the db = to “protein” if needed. Change the object name to a descriptive name, such as the name of the gene, e.g. shroom.fasta.
itgalseq.fasta <- rentrez::entrez_fetch(db = "nucleotide", # set to "protein" if needed
id = "NM_001114380.2", # change accession
rettype = "fasta")
Set your working directory then save the FASTA file to your hard drive using. Be sure to change the name of the object and the file name to be appropriate to your gene.
write(itgalseq.fasta, # change object name, e.g. shroom.fasta
file="itgalseq.fasta") # change file name, e.g. shroom.fasta