DNA Sequence Statistics using R: Team Project

Introduction

Genomes are beautifully complex; they contain an organism’s complete set of DNA, including all of its genes. From a computing perspective, genomes also can be viewed as R strings that are several million characters long. For this team project, your microbiology team members have just extracted and sequenced the 16s rRNA from the unknown bacteria to identify the species.

Now, you will work together as a team using R string manipulation and pattern matching to ‘mine’ the genome of your bacteria for useful information. The description below can serve as a foundation for your genomic exploration, but there are lots of opportunities for creativity beyond the scope of these exercises. Interesting preliminary findings may point to specific characteristics of the genome that invite further analysis.

Deliverables

This project is one-half of the CS125 Team Project 4. If you are the only CS student in your group with other microbiology students, you DO NOT need to complete the R portion of the project. You are only expected to do the Python portion, although you are of course welcome to complete part or all of this project as well.

Your team will turn in two files. One file will be this input document with the .rmd extension. Another will be a copy of your output file with the .html extension. Attach these files in an email to your CS125 TA’s with the subject as ‘Your Name, Team Project 4: R part’.

In addition, your BIO231/CS125 team will synthesize both lab and computational results to create a final poster on your organism. More details for this can be found on the BIO231 Collaboration Home Page

R Packages

We will use the R package, seqinr which is frequently used by biologists for sequence manipulation. Copy and paste the following lines to install this package:

install.packages('seqinr')
library('seqinr')

Requirements

Note before you begin: It is likely that some steps in this project, especially step 2 and step 3, maybe take time to run as you will be dealing with large DNA sequences. If you have trouble on your laptop, consider downloading RStudio and using one of the lab computers in RNS202 or 203.

We will be obtaining our DNA sequences from GenBank ®, which is an annotated collection of DNA sequences available through the National Institutes of Health.

First, find the accession number for the complete genome of your organism. Go to this document and follow the link to your organism’s genome. The link will take you to the DNA sequence available on the NCBI (National Center for Biotechnology Information) datase where you can obtain the DNA sequence for your organism.

Find the accession number for your organism. It should be about an 8-digit number starting with two letters.

 Include your species, its accession number, and anything else you find interesting from the NCBI genome page on your poster

*Note if you are using a local version of R downloaded on your laptop instead of RStudio, you will need to do a few extra steps to obtain the genome sequence. Go to the end of this document to find instructions

Go to the “Files” tab in the lower right corner of Rstudio and open the Genome_Data folder. Locate the .txt file for your organism containing it’s genome. To save time, they have been uploaded for you in FASTA format.
Read in the FASTA file (the standard format for biological sequence files) using the following lines of code as an example.

DNAseq = read.fasta(file="Genome_data/Your_organism.txt",as.string=TRUE)

If you are successful, a value containing the variable you stored your FASTA file as will appear in your workspace in the upper-righthand corner.

When you have read in the FASTA file, the string of DNA nucleotides (A, C, T, or G) are stored in the first element of the list that is returned. If you print out the first fifty chracters using substr , the output will look something like this

[1] "atttttttct caaaaggtac gggagtccgt gatgacatat gattta"

Note: When working with large genome sequences, it is advisable to create a subset containing only few hundred nucleotide base pairs to test your code on before applying to the entire genome.

Now write a function to calculate the GC content similar to what you did in the R Strings homework. Remember that GC content is the fraction of the sequence that consists of Gs and Cs.

If you are looking to remove the white spaces in the string, you may find gsub(" “,”“, DNAseq[[1]]) useful. It is also fine to use gregexpr() and other R functions to simply ‘overlook’ the white spaces.

When calculating the total fraction of Gs and Cs, you will want to include in your denominator any non-A/C/T/G nucleotides (e.g. if your sequence data contains some ‘N’s’). For example:

 NNAAGGCGCTGCGCTAACCNNCNAAGGGGGCANGCGACCACTGGTAGGGTCAC

What do these N’s stand for?

 Write your answer here

#put your code here:

As you might expect, the GC content can fluctuate significantly in different regions of the genome. We are interested in localized GC content variation because regions of the genome that have unusually low GC content may indicate high DNA mutation rates.

We will be plotting how the GC content changes using ggplot2. First, you wil need to install the ggplot2 package.

install.packages('ggplot2')
library('ggplot2')

Next, choose an interval length (the ‘tick width’ on your x-axis). This varies based on the size of your organism’s genome. Start with a 20000 nucleotide base pair interval, then scale your interval size up or down from there if necessary. Yout want a plot that isn’t too cluttered but still specific enough to identify regions in the genome where GC content is unusually low or high.

Use a for loop to calculate the local GC content for each base pair interval of your choice for the entire genome. Then use ggplot2 to plot the GC content versus the nucleotide position so that your final result looks something like this (only prettier, because you’ll be using ggplot). Work with your microbiology team members to analyze and understand and interpret your results.

Note: the following graph uses a 20000 nucleotide base pair intervals.

#put your code here:

Optional

These next few exercises are totally optional if you are interested in doing more!

If one of your group members has experience with statistics, use your graph to identify regions of the genome where the GC content is unusually low. Perform a t-test to determine whether certain regions of the genome have a “statistically significant” low GC content.

#put your code here:

Regions of the genome that have unusually low GC content may be indicate that this is the origin of replication (oriC) in your genome. Work with the microbiology members on your team to understand what the oriC is and why it’s important.

 Write your answer here:

Search the DoriC database for your organism and obtain the ‘RefSeq’ for the oriC. Do not include the ‘version’ number (decimal attached to the end). Contact your professor if your organism is not one of the >1500 bacteria listed in the database.

For example, for Escherichia coli W3110 DNA the oriC RefSeq is AC000091.1, so you will need to use read.Genbank(“AC_000091”)

 Write the RefSeq for your organism's oriC here:

Use pattern matching to determine the numbers of the start and stop nucleotides for the actual oriC in the genome of your organism. Then, create plots similar to the one pictured below showing the GC content (Figure A) for your organism as a function of distance from the oriC.

GCcontent

Image Credit: Li Wen-Chao, Zhong Zhe-Jin, Zhu Pan-Pan, Deng En-Ze, Ding Hui, Chen Wei, Lin Hao. The sequence analysis of origin of replication in Saccharomyces cerevisiae genomes. Frontiers in Microbiology 5, 2014.

#put your code here:

The figure above was obtained from a Frontiers in Microbiology article, which you and your team may find helpful. Note that the article deals with Saccharomyces cerevisiae which has multiple oriCs, while your organism likely only has one.

Additional ideas

The following analyses are even more ideas to supplement your project and final presentation. Use your judgement in terms of time spent on this project, but it may be interesting to complete some of the following:

Calculate and plot the “GC skew” for your organism’s genome. Then plot the GC skew as a function of distance (in nucleotide bp) from the oriC (similar to Figure B from the Frontiers in Microbiology article. GC skew is calculated as follows:
```
GC Skew = (G - C)/(G + C)
```
Write a function that takes the reverse complement of your organism’s genome sequence.
Write a function that translates the genome from DNA to RNA to protein using the protein coding table below:

Image credit: Carnegie Mellon University Topics in Bioinformatics and Computational Biology

Be creative! What other information can be obtained from a genome?

If you’re using a downloaded local version of R (not the Rstudio server)

Install the ape package. Use the read.GenBank() function from the ape package to automatically access the GenBank database and read the genome sequence for your organism into R using accession numbers as arguments. Use the R documentation to learn how to use this function to properly read in your sequence. Make sure you test that the correct sequence has been read in using the command attr(sequence,“species”) .

# put your code here!

Note: If you receive the following error Error in FI[i]:LA[i] : NA/NaN argument when using your own computer then make sure you have the latest versions of Rstudio and the ape package installed.

If you continue to have trouble when using read.GenBank() , see this tutorial to learn how to manually download the sequence from GenBank into Rstudio. Contact your TA’s if you are still unable to obtain your sequence!

Next, you will create a FASTA file (the standard format for biological sequence files) using the following lines of code as an example.

GenBankDownload=read.GenBank(accession)
write.dna(GenBankDownload,file="GenBankDownload.txt",format="fasta")
genome=read.fasta(file="GenBankDownload.txt")

*Note that the write.dna() step often takes up to 20 minutes because the file is so large!

#put your code here:

Return to step 4 of the original tutorial!