This project demonstartes how to automate retrieval of mitochondrial CO1 gene sequences from NCBI for a chosen organism using R and the “ape” package.
Following the tutorial provided from RPubs to show that the fish sample works. Starting with creating a chr vector followed by downloading all sequntial sequences from Genbank. The data is then written into a new csv file.
seq1 <- paste("JQ", seq(983161, 983255), sep = "")
sequences <- read.GenBank(seq1,
seq.names = seq1,
species.names = TRUE,
as.character = TRUE)
write.dna(sequences, "fish.fasta", format = "fasta")
I chose Puma concolor (taxid 9696). I sreached NCBI for “txid969[Organism] and COX1[Gene]” and downloaded the Accession list.
ids <- read.table("sequence.seq", stringsAsFactors = FALSE) [,1]
head(ids)
## [1] "NC_016470.1"
length(ids)
## [1] 1
puma_seqs <- read.GenBank(ids)
puma_seqs
## 1 DNA sequence in binary format stored in a list.
##
## Sequence length: 17153
##
## Label:
## NC_016470.1
##
## Base composition:
## a c g t
## 0.329 0.260 0.138 0.273
## (Total: 17.15 kb)
write.dna(puma_seqs, file = "puma_CO1.fasta", format = "fasta")
Adding summaries of sequence lengths and GC content for visuals.
sapply(puma_seqs, length)
## NC_016470.1
## 17153
gc_content <- function(x) {
seq <- as.character(x)
(sum(seq == "G" | seq == "C") / length(seq)) * 100
}
sapply(puma_seqs, gc_content)
## NC_016470.1
## 0