The RPubs file is here: https://rpubs.com/HailaSchultz/Week-07
The fasta file I am using is my reference file with COI barcodes for all barcoded zooplankton.
install package
install.packages("seqinr")
load package
library(seqinr)
download fasta file
cd /home/shared/8TB_HDD_02/schulh2/GitHub/haila-coursework/assignments/data
curl -O https://www.st.nmfs.noaa.gov/copepod/collaboration/metazoogene/atlas/data-src/MZGfasta-coi__MZGdbALL__o00__A.fasta
read in fasta file
# Replace 'input.fasta' with the name of your multi-sequence fasta file
input_file <- "MZGfasta-coi__MZGdbALL__o00__A.fasta"
sequences <- read.fasta(input_file)
# Set the seed for reproducibility (optional)
set.seed(42)
#randomly select 10 sequences out of the file
number_of_sequences_to_select <- 10
if (length(sequences) < number_of_sequences_to_select) {
warning("There are fewer than 10 sequences in the fasta file. All sequences will be selected.")
number_of_sequences_to_select <- length(sequences)
}
selected_indices <- sample(length(sequences), number_of_sequences_to_select)
selected_sequences <- sequences[selected_indices]
write a fasta file with the 10 selected sequences
output_file <- "/home/shared/8TB_HDD_02/schulh2/GitHub/haila-coursework/assignments/output/ten_sequences.fasta"
write.fasta(selected_sequences, names(selected_sequences), output_file, open = "w")
create index of fasta file
/home/shared/samtools-1.12/samtools faidx \
../output/ten_sequences.fasta
Find CG motifs
fuzznuc -sequence ../output/ten_sequences.fasta -pattern CG -rformat gff -outfile ../output/CGoutput.gff
Since the files were small and my github repo is private, I downloaded ten_sequences.fasta, ten_sequences.fasta.fai, and CGoutput.gff locally. In IGV, I uploaded the fasta file as the “genome” and the gff file as a file.
change directory to location of screenshots
cd /home/shared/8TB_HDD_02/schulh2/GitHub/haila-coursework/assignments/images
this zoomed-in gff file shows where C and G are next to each other
gff file shows 14 CG motifs in this full sequence