QUESTION: How do I remove the “header” information from a FASTA-formatted sequence?

When FASTA sequences are downloaded from the NCBI database, they come with “header” information that is not part of the actual sequence. In order to use the sequence in data analysis, we must remove this “header” information using the function fasta_cleaner.

DATA

Load the necessary libraries

library(compbio4all)
library(rentrez)
library(seqinr)

Download the FASTA sequence for the gene (in this case, DIO1 in Humans) using entrez_fetch

dio1.human <- entrez_fetch(db = "protein", 
                          id = "NP_000783", 
                          rettype = "fasta")

Display the FASTA sequence. Notice how the first line displays a header with some information about the sequence.

dio1.human
## [1] ">NP_000783.2 type I iodothyronine deiodinase isoform a [Homo sapiens]\nMGLPQPGLWLKRLWVLLEVAVHVVVGKVLLILFPDRVKRNILAMGEKTGMTRNPHFSHDNWIPTFFSTQY\nFWFVLKVRWQRLEDTTELGGLAPNCPVVRLSGQRCNIWEFMQGNRPLVLNFGSCTUPSFMFKFDQFKRLI\nEDFSSIADFLVIYIEEAHASDGWAFKNNMDIRNHQNLQDRLQAAHLLLARSPQCPVVVDTMQNQSSQLYA\nALPERLYIIQEGRILYKGKSGPWNYNPEEVRAVLEKLHS\n\n"

Use the fasta_cleaner function from the compbio4all package to clean FASTA sequence. This removes the “header” information and results in one long string

dio1.human <- fasta_cleaner(dio1.human, parse = F) 

Display the cleaned FASTA sequence. It no longer has the “header” information, just a string of characters

dio1.human
## [1] "MGLPQPGLWLKRLWVLLEVAVHVVVGKVLLILFPDRVKRNILAMGEKTGMTRNPHFSHDNWIPTFFSTQYFWFVLKVRWQRLEDTTELGGLAPNCPVVRLSGQRCNIWEFMQGNRPLVLNFGSCTUPSFMFKFDQFKRLIEDFSSIADFLVIYIEEAHASDGWAFKNNMDIRNHQNLQDRLQAAHLLLARSPQCPVVVDTMQNQSSQLYAALPERLYIIQEGRILYKGKSGPWNYNPEEVRAVLEKLHS"

Keywords

  • compbio4all
  • rentrez
  • entrez_fetch
  • fasta_cleaner