Introduction

This project demonstartes how to automate retrieval of mitochondrial CO1 gene sequences from NCBI for a chosen organism using R and the “ape” package.

Example Code from the Tutorial

Following the tutorial provided from RPubs to show that the fish sample works. Starting with creating a chr vector followed by downloading all sequntial sequences from Genbank. The data is then written into a new csv file.

seq1 <- paste("JQ", seq(983161, 983255), sep = "")

sequences <- read.GenBank(seq1,
                          seq.names = seq1,
                          species.names = TRUE,
                          as.character = TRUE)

write.dna(sequences, "fish.fasta", format = "fasta")

Chosen Organism

I chose Puma concolor (taxid 9696). I sreached NCBI for “txid969[Organism] and COX1[Gene]” and downloaded the Accession list.

Reading Accession List

ids <- read.table("sequence.seq",  stringsAsFactors = FALSE) [,1]
head(ids)
## [1] "NC_016470.1"
length(ids)
## [1] 1

Download all sequences

puma_seqs <- read.GenBank(ids)
puma_seqs
## 1 DNA sequence in binary format stored in a list.
## 
## Sequence length: 17153 
## 
## Label:
## NC_016470.1
## 
## Base composition:
##     a     c     g     t 
## 0.329 0.260 0.138 0.273 
## (Total: 17.15 kb)

Saving to FASTA

write.dna(puma_seqs, file = "puma_CO1.fasta", format = "fasta")

Quick analysis

Adding summaries of sequence lengths and GC content for visuals.

sapply(puma_seqs, length)
## NC_016470.1 
##       17153
gc_content <- function(x) {
  seq <- as.character(x)
  (sum(seq == "G" | seq == "C") / length(seq)) * 100
}
sapply(puma_seqs, gc_content)
## NC_016470.1 
##           0