When FASTA sequences are downloaded from the NCBI database, they come with “header” information that is not part of the actual sequence. In order to use the sequence in data analysis, we must remove this “header” information using the function fasta_cleaner.
Load the necessary libraries
library(compbio4all)
library(rentrez)
library(seqinr)
Download the FASTA sequence for the gene (in this case, DIO1 in Humans) using entrez_fetch
dio1.human <- entrez_fetch(db = "protein",
id = "NP_000783",
rettype = "fasta")
Display the FASTA sequence. Notice how the first line displays a header with some information about the sequence.
dio1.human
## [1] ">NP_000783.2 type I iodothyronine deiodinase isoform a [Homo sapiens]\nMGLPQPGLWLKRLWVLLEVAVHVVVGKVLLILFPDRVKRNILAMGEKTGMTRNPHFSHDNWIPTFFSTQY\nFWFVLKVRWQRLEDTTELGGLAPNCPVVRLSGQRCNIWEFMQGNRPLVLNFGSCTUPSFMFKFDQFKRLI\nEDFSSIADFLVIYIEEAHASDGWAFKNNMDIRNHQNLQDRLQAAHLLLARSPQCPVVVDTMQNQSSQLYA\nALPERLYIIQEGRILYKGKSGPWNYNPEEVRAVLEKLHS\n\n"
Use the fasta_cleaner function from the compbio4all package to clean FASTA sequence. This removes the “header” information and results in one long string
dio1.human <- fasta_cleaner(dio1.human, parse = F)
Display the cleaned FASTA sequence. It no longer has the “header” information, just a string of characters
dio1.human
## [1] "MGLPQPGLWLKRLWVLLEVAVHVVVGKVLLILFPDRVKRNILAMGEKTGMTRNPHFSHDNWIPTFFSTQYFWFVLKVRWQRLEDTTELGGLAPNCPVVRLSGQRCNIWEFMQGNRPLVLNFGSCTUPSFMFKFDQFKRLIEDFSSIADFLVIYIEEAHASDGWAFKNNMDIRNHQNLQDRLQAAHLLLARSPQCPVVVDTMQNQSSQLYAALPERLYIIQEGRILYKGKSGPWNYNPEEVRAVLEKLHS"