Genomes are beautifully complex; they contain an organism’s complete set of DNA, including all of its genes. From a computing perspective, genomes also can be viewed as R strings that are several million characters long. For this team project, your microbiology team members have just extracted and sequenced the 16s rRNA from the unknown bacteria to identify the species.
Now, you will work together as a team using R string manipulation and pattern matching to ‘mine’ the genome of your bacteria for useful information. The description below can serve as a foundation for your genomic exploration, but there are lots of opportunities for creativity beyond the scope of these exercises. Interesting preliminary findings may point to specific characteristics of the genome that invite further analysis.
This project is one-half of the CS125 Team Project 4. If you are the only CS student in your group with other microbiology students, you DO NOT need to complete the R portion of the project. You are only expected to do the Python portion, although you are of course welcome to complete part or all of this project as well.
Your team will turn in two files. One file will be this input document with the .rmd extension. Another will be a copy of your output file with the .html extension. Attach these files in an email to your CS125 TA’s with the subject as ‘Your Name, Team Project 4: R part’.
In addition, your BIO231/CS125 team will synthesize both lab and computational results to create a final poster on your organism. More details for this can be found on the BIO231 Collaboration Home Page
We will use the R package, seqinr
which is frequently used by biologists for sequence manipulation. Copy and paste the following lines to install this package:
install.packages('seqinr')
library('seqinr')
We will be obtaining our DNA sequences from GenBank ®, which is an annotated collection of DNA sequences available through the National Institutes of Health.
Find the accession number for your organism. It should be about an 8-digit number starting with two letters.
Include your species, its accession number, and anything else you find interesting from the NCBI genome page on your poster
*Note if you are using a local version of R downloaded on your laptop instead of RStudio, you will need to do a few extra steps to obtain the genome sequence. Go to the end of this document to find instructions
Go to the “Files” tab in the lower right corner of Rstudio and open the Genome_Data folder. Locate the .txt file for your organism containing it’s genome. To save time, they have been uploaded for you in FASTA format.
Read in the FASTA file (the standard format for biological sequence files) using the following lines of code as an example.
DNAseq = read.fasta(file="Genome_data/Your_organism.txt",as.string=TRUE)
If you are successful, a value containing the variable you stored your FASTA file as will appear in your workspace in the upper-righthand corner.
substr
, the output will look something like this[1] "atttttttct caaaaggtac gggagtccgt gatgacatat gattta"
Now write a function to calculate the GC content similar to what you did in the R Strings homework. Remember that GC content is the fraction of the sequence that consists of Gs and Cs.
If you are looking to remove the white spaces in the string, you may find gsub(" “,”“, DNAseq[[1]])
useful. It is also fine to use gregexpr()
and other R functions to simply ‘overlook’ the white spaces.
When calculating the total fraction of Gs and Cs, you will want to include in your denominator any non-A/C/T/G nucleotides (e.g. if your sequence data contains some ‘N’s’). For example:
NNAAGGCGCTGCGCTAACCNNCNAAGGGGGCANGCGACCACTGGTAGGGTCACWhat do these N’s stand for?
Write your answer here
#put your code here:
We will be plotting how the GC content changes using ggplot2. First, you wil need to install the ggplot2
package.
install.packages('ggplot2')
library('ggplot2')
Next, choose an interval length (the ‘tick width’ on your x-axis). This varies based on the size of your organism’s genome. Start with a 20000 nucleotide base pair interval, then scale your interval size up or down from there if necessary. Yout want a plot that isn’t too cluttered but still specific enough to identify regions in the genome where GC content is unusually low or high.
Use a for loop to calculate the local GC content for each base pair interval of your choice for the entire genome. Then use ggplot2 to plot the GC content versus the nucleotide position so that your final result looks something like this (only prettier, because you’ll be using ggplot). Work with your microbiology team members to analyze and understand and interpret your results.
#put your code here:
These next few exercises are totally optional if you are interested in doing more!
#put your code here:
Write your answer here:
For example, for Escherichia coli W3110 DNA the oriC RefSeq is AC000091.1, so you will need to use read.Genbank(“AC_000091”)
Write the RefSeq for your organism's oriC here:
Image Credit: Li Wen-Chao, Zhong Zhe-Jin, Zhu Pan-Pan, Deng En-Ze, Ding Hui, Chen Wei, Lin Hao. The sequence analysis of origin of replication in Saccharomyces cerevisiae genomes. Frontiers in Microbiology 5, 2014.
#put your code here:
The figure above was obtained from a Frontiers in Microbiology article, which you and your team may find helpful. Note that the article deals with Saccharomyces cerevisiae which has multiple oriCs, while your organism likely only has one.
The following analyses are even more ideas to supplement your project and final presentation. Use your judgement in terms of time spent on this project, but it may be interesting to complete some of the following:
GC Skew = (G - C)/(G + C)
Write a function that takes the reverse complement of your organism’s genome sequence.
Write a function that translates the genome from DNA to RNA to protein using the protein coding table below:
Image credit: Carnegie Mellon University Topics in Bioinformatics and Computational Biology
ape
package. Use the read.GenBank()
function from the ape
package to automatically access the GenBank database and read the genome sequence for your organism into R using accession numbers as arguments. Use the R documentation to learn how to use this function to properly read in your sequence. Make sure you test that the correct sequence has been read in using the command attr(sequence,“species”)
.# put your code here!
Error in FI[i]:LA[i] : NA/NaN argument
when using your own computer then make sure you have the latest versions of Rstudio and the ape
package installed.If you continue to have trouble when using read.GenBank()
, see this tutorial to learn how to manually download the sequence from GenBank into Rstudio. Contact your TA’s if you are still unable to obtain your sequence!
GenBankDownload=read.GenBank(accession)
write.dna(GenBankDownload,file="GenBankDownload.txt",format="fasta")
genome=read.fasta(file="GenBankDownload.txt")
*Note that the write.dna() step often takes up to 20 minutes because the file is so large!
#put your code here: