Preamble

Run the following code block. Biostrings is a Bioconductor package, so please install the required packages first. If you have yet to see the installation guide, follow this link: https://dl.dropboxusercontent.com/u/907375/Bioinformatics_class/installing.htm. Also, the source() function belowe loads a script with all of the functions required for this tutorial; therefore, if you clear your workspace, you should re-source the R_functoins.R script to reload said functions.

source('https://gist.githubusercontent.com/sw1/8870a124624f31585d5c15be72fcfc21/raw/1f267222249c9b89cd3d47575e592ca0dc6a492a/R_functions.R')

load_library(Biostrings)
load_library(ggplot2)
load_library(readr)
load_library(stringr)

Introduction

This is going to be a set of class exercises that are intended to be more of a bioinformatics flavor than the accompanying programming videos You might lack some intuition as to why some of these problems are important, but you should grow to appreciate them early in the course. The goal here is to present very basic bioinformatics problems and introduce a programmatic way of tackling them (possibly multiple ways), along with a Bioconductor approach. Bioconductor is a programming environment designed for analysis and interpretation of ’omic data.

One last note: many of these problems are heavily influenced by Rosalind. I highly suggest you explore the site further for a large set of incredibly rewarding programming based bioinformatics problems: http://Rosalind.info/problems/DNA/

We’ll start with some warmup problems to ensure that you are comfortable with R and basic programming.

Organization

Each section will cover a particular topic: either a programming topic or a bioinformatics topic. No advanced understanding of biology or bioinformatics is required for these problems. They are aimed to be programming problems simply disguised as bioinformatics problems, when appicable.

You will see code blocks within each section. If the top of the code block includes a header saying #<<<>>>EXERCISE<<<>>>#, then you are expected to complete the code block for it to run correctly. Areas in the code block with the placeholder ###ENTERCODEHERE### are areas you should focus. For example, if you had a problem where you asked to assign the number 5 to the variable n and then print it, you’d see

########################
#<<<>>> EXERCISE <<<>>>#
########################

n <- ###ENTERCODEHERE###
print(n)

You’d be expected to change this block to

########################
#<<<>>> EXERCISE <<<>>>#
########################

n <- 5
print(n)

## [1] 5

The #<<<>>>EXERCISE<<<>>># placeholders aren’t intended to be the only lines you will enter code. They’ll just guide you to areas you should focus. You may still need to reread the question stem a few times to ensure you tackled everything necessary to reach the correct solution. For example, had you been given

########################
#<<<>>> EXERCISE <<<>>>#
########################

n <- ###ENTERCODEHERE###

you’d still be expected to return the same answer as before, since that is what was asked in the stem.

The code blocks are intended to nearly self contained, but a few may require some results from previous, albeit nearby, code blocks.

Lastly, one important reminder, R indexes starting from 1, not 0!

Warmup

Dataframes

Running problem_dataframe() will create 3 vectors named WEIGHT, DISEASE, and SEX to your environment:

problem_dataframe()

## Added SEX, WEIGHT, and DISEASE to environment.

head(WEIGHT)

## [1] 228.0741166 256.4275163 220.7529986 305.8348281 261.5327720 221.2836066

head(DISEASE)

## [1] 1 1 1 1 1 1

head(SEX)

## [1] "M" "M" "M" "M" "M" "M"

For the DISEASE vector, rename 1s and 0s as ‘Disease’ and ‘Healthy’, respectively. This will require indexing values that are equal to 1 or 0 and renaming them. A clever way would involve using the ifelse() function, but simply using logicals like == will do.

After you perform the renaming, convert DISEASE and SEX to factors using the as.factor() function.

########################
#<<<>>> EXERCISE <<<>>>#
########################

### ENTERCODEHERE ###

DISEASE <- as.factor(DISEASE)
SEX <- as.factor(SEX)

DISEASE should now look like

##  [1] Disease Disease Disease Disease Disease Disease Disease Disease
##  [9] Disease Disease Healthy Healthy Healthy Healthy Healthy Healthy
## [17] Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy
## [25] Healthy Disease Disease Disease Disease Disease Healthy Healthy
## [33] Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy
## [41] Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy
## [49] Healthy Healthy
## Levels: Disease Healthy

Now, create a 3-column dataframe df using the data.frame() function that contains sex, weight, and disease (the order doesn’t matter). Ensure columns are named ‘Sex’, ‘Weight’, and ‘Disease’.

########################
#<<<>>> EXERCISE <<<>>>#
########################

### ENTERCODEHERE ###

The top of your dataframe should look like

##   Sex      Weight Disease
## 1   M 228.0741166 Disease
## 2   M 256.4275163 Disease
## 3   M 220.7529986 Disease
## 4   M 305.8348281 Disease
## 5   M 261.5327720 Disease
## 6   M 221.2836066 Disease

Run the following code.

problem_plot1(DF)

You should see the following figure (note that the points are jittered in the horizontal axis randomly, so your left-to-right point positions will be different):

Lists

Let’s work with the output from a linear regression model. Don’t worry if you’re unfamiliar with regression; the point of this is to get a list of output that we can manipulate. Run the function problem_regression(), which will create the variable OUT with our regression output.

problem_regression()

## Added OUT to environment.

OUT

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##   0.1219017    1.4544515

The variable OUT contains a list of length 12, with each slot containing different types of data – matrices, integers, characters, etc. Use the str() function to see the hierarchy and slot names of the list. You can also see the uppermost slot names using the function names().

From OUT, save the coefficients slot as its own variable called W. Within OUT, there is a list named model that will be of length 2, containing 2 matricies named x and y. Set x in model as the variable X and y in model as the variable Y.

########################
#<<<>>> EXERCISE <<<>>>#
########################

W <- ### ENTERCODEHERE ###
X <- ### ENTERCODEHERE ###
Y <- ### ENTERCODEHERE ###

After making these three variables, place X and Y into a dataframe named DF. Ensure the column names are ‘X’ and ‘Y’. If not, set them as such.

########################
#<<<>>> EXERCISE <<<>>>#
########################

DF <- ### ENTERCODEHERE ###

The top of your dataframe should look like

##               X             Y
## 1 -0.6264538107 -0.5415748357
## 2  0.1836433242 -0.3365614069
## 3 -0.8356286124 -0.9123232272
## 4  1.5952808021  1.2635581071
## 5  0.3295077718  1.9272853594
## 6 -0.8204683841  0.7496973223

Now, place the dataframe DF and the vector W into a new list and name it PLOT. Make sure to name both slots of the list. You should therefore have a list, named PLOT, of length 2, with slots named DF and W, containing a dataframe and a numeric vector, respectively. You can check this using the str() function.

########################
#<<<>>> EXERCISE <<<>>>#
########################

PLOT <- ### ENTERCODEHERE ###

If you run str(DF), you should see

## List of 2
##  $ DF:'data.frame':  50 obs. of  2 variables:
##   ..$ X: num [1:50] -0.626 0.184 -0.836 1.595 0.33 ...
##   ..$ Y: num [1:50] -0.542 -0.337 -0.912 1.264 1.927 ...
##  $ W : Named num [1:2] 0.122 1.454
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"

Place the list PLOT into the list OUT.

########################
#<<<>>> EXERCISE <<<>>>#
########################

### ENTERCODEHERE ###

Run the following code:

problem_plot2(OUT)

Assuming you were successful building your list, then you should have a nice regression plot:

Matrices

We’re going to start with 2 matrices of data named M1 and M2, representing patient characteristics. You can set these to your environment by running the following command:

problem_matrix()

## Added M1 and M2 to environment.

head(M1)

##                W           H  A   S            G
## [1,] 184.3386547 165.0246537 25 124  73.49164558
## [2,] 204.5910831 155.6997059 25 146 119.03959440
## [3,] 179.1092847 187.5773067 61 123 117.20008771
## [4,] 239.8820201 177.6734291 41 151 121.21580613
## [5,] 208.2376943 172.4973506 58 151  92.98832074
## [6,] 179.4882904 179.2677585 26 127  97.38468889

head(M2)

##                W           H  A   S            G
## [1,] 152.0244763 172.3911492 75 125  93.95555101
## [2,] 151.6789777 168.8660873 78 121  77.55822229
## [3,] 121.9383355 177.9639717 32 112  68.07547839
## [4,] 138.1609738 156.9644164 26 129 115.25985269
## [5,] 136.0409347 148.9737433 22 127  66.83513569
## [6,] 110.0602672 174.3325097 62 134  87.13669312

Merge them by row into a matrix named M. The dimensions should be \(2000 \times 5\), which you can check using dim().

########################
#<<<>>> EXERCISE <<<>>>#
########################

M <- ### ENTERCODEHERE ###

If you run dim(M), you should see

## [1] 2000    5

There are two NA values in the matrix, which will affect our ability to do our analysis. We need to replace them with their corresponding mean for that column. Run the following command to get the columns that have NA values; there will be 2.

col_idx <- which(apply(is.na(M),2,any))

For each column index, index the column in M, which will give you a vector. Calculate the mean of the vector using the mean(x, na.rm=TRUE) function, where x is the vector for which the mean is calculated. Save each mean as a variable. You should have two new variables, each containing 1 value, a mean for its respective column.

########################
#<<<>>> EXERCISE <<<>>>#
########################

vec1 <- ### ENTERCODEHERE ###
vec2 <- ### ENTERCODEHERE ###

mui <- mean(vec1,na.rm=TRUE)
muj <- mean(vec2,na.rm=TRUE)

The two means should be

## [1] 167.216274

## [1] 94.8007506

We know which columns have the NA values, so, similar to above, index one of the columns as a vector, but instead of calculating the mean, we’ll first find which rows have NA values. Then, we can use the function which() and save the output as a new variable. Then, we can repeat this for the other column that had an NA value.

########################
#<<<>>> EXERCISE <<<>>>#
########################

vec1 <- ### ENTERCODEHERE ###
vec2 <- ### ENTERCODEHERE ###
  
vec_na1 <- is.na(vec1)
vec_na2 <- is.na(vec2)

rowi <- ### ENTERCODEHERE ###
rowj <- ### ENTERCODEHERE ###

The rows indexes should be

## [1] 152

## [1] 589

Save these two row indexes as a vector named MISSING.

########################
#<<<>>> EXERCISE <<<>>>#
########################

MISSING <- ### ENTERCODEHERE ###

Now that we know which row and column these NA values are found, we can replace them with the their corresponding column means we just calculated.

########################
#<<<>>> EXERCISE <<<>>>#
########################

### ENTERCODEHERE ### <- mui
### ENTERCODEHERE ### <- muj

Finally, run the following code.

problem_plot3(M,MISSING)

If you are correct, then you should see the results of a PCA analysis. Assuming red crosses are seen within black triangles, then you correctly identified and replaced the NA values.

For loop

We’re going to use functions like seq_along() and seq_len() for generating the vector of indexes to iterate over. This approach should be helpful since it’ll be similar to xrange and range in Python. If you have a vector x of length 10, then seq_along(x) would result in a vector of integers ranging from 1 and 10 (recall that R indexes from 1 and not 0). Now, seq_len() expects an integer representing the maximum value in the range you want to iterate over. Therefore, to get the same vector from 1 to 10, you can simply enter seq_len(length(x)), which is essentially doing seq_len(10).

Problem 1

We’re going to make a loop that calculates the cumulative sum from i to j. For example, if we wanted to calculate the cumulative sum from 1 to 5, this would amount to \(1 + 2 + 3 + 4 + 5 = 15\). Let’s also have the loop print the current total after each iteration, specifically, each iteration j will print ‘Current total on iteration j: value’ on a new line. To do this, create a loop to calculate the sum and use cat() to print the message.

Note, to print text on a new line, use ‘\n’, like

cat('Some text for line:',1,'\n',
    'Some text for line:',2,'\n',
    sep='')

## Some text for line:1
## Some text for line:2

Now the loop, which we’ll iterate from 5 to 25:

i <- 5
j <- 25

iter <- 1
total <- 0
for (n in i:j){
  
  total <- ### ENTERCODEHERE ###
  cat('Current total for iteration',iter,':',total,'\n')
  iter <- ### ENTERCODEHERE ###
  
}

Your results should be the following:

## Current total for iteration 1 : 5 
## Current total for iteration 2 : 11 
## Current total for iteration 3 : 18 
## Current total for iteration 4 : 26 
## Current total for iteration 5 : 35 
## Current total for iteration 6 : 45 
## Current total for iteration 7 : 56 
## Current total for iteration 8 : 68 
## Current total for iteration 9 : 81 
## Current total for iteration 10 : 95 
## Current total for iteration 11 : 110 
## Current total for iteration 12 : 126 
## Current total for iteration 13 : 143 
## Current total for iteration 14 : 161 
## Current total for iteration 15 : 180 
## Current total for iteration 16 : 200 
## Current total for iteration 17 : 221 
## Current total for iteration 18 : 243 
## Current total for iteration 19 : 266 
## Current total for iteration 20 : 290 
## Current total for iteration 21 : 315

Problem 2

You’re given a \(25 \times 3\) matrix M. The first row of the matrix contains the values 1, 3, and 5. To place this matrix in your environment, run problem_forloop2():

problem_forloop2()

## Added M to environment.

head(M)

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    0    0    0
## [3,]    0    0    0
## [4,]    0    0    0
## [5,]    0    0    0
## [6,]    0    0    0

Let’s make a loop that does the following: starting at row 2, for each row, each element will contain the sum of the entire previous row minus the current row number and minus the current column number. Row two would therefore look like:

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    6    5    4
## [3,]    0    0    0
## [4,]    0    0    0
## [5,]    0    0    0
## [6,]    0    0    0

This will require one loop (to iterate over columns) nested within another loop (to iterate over rows). The sum function is simply sum().

########################
#<<<>>> EXERCISE <<<>>>#
########################

for (i in 2:nrow(M)){
  for (j in 1:ncol(M)){
    
    ### ENTERCODEHERE ###
    
  }
}

Your final row (25) should have the following:

## [1] 211822152376 211822152375 211822152374

Problem 3

Let’s again start with a \(25 \times 3\) matrix M of all zeros except element \(\text{M}(1,1)\), which will contain the number 1. Run problem_forloop3() to place M in your environment:

problem_forloop3()

## Added M to environment.

head(M)

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    0    0
## [3,]    0    0    0
## [4,]    0    0    0
## [5,]    0    0    0
## [6,]    0    0    0

Starting at row 2:

Element 1 in the current row will equal to the sum of elements 2 and 3 from the previous row
Element 2 in the current row will equal element 1 from the previous row
Element 3 in the current row will equal element 2 from the previous row

For example, row 2 would look like:

MM <- M
MM[2,] <- c(0,1,0)
head(MM)

And then row 3 would become:

MM[3,] <- c(1,0,1)
head(MM)

Unlike the previous problem, this will require a single loop.

########################
#<<<>>> EXERCISE <<<>>>#
########################

for (i in 2:nrow(M)){
    
  ### ENTERCODEHERE ###

}

Your final row (25) should have the following:

## [1] 351 265 200

If else

Problem 1

Run problem_ifelse1() to get a very long vector of characters L in your environment.

problem_ifelse1()

## Added L to environment.

head(L)

## [1] "G" "J" "O" "X" "F" "X"

The goal here is to find the number of times the sequence of letters spells ‘BIO’, the number of time it spells ‘BIT’, and the number of times it has the sequence ‘BI’ followed by any letter other than ‘O’ or ‘T’.

Let’s iterate through the vector and do the following:

Create 3 variables, set to 0, to record the totals for the occurance of 'BIO', 'BIT',' and 'BIx' 
  (x being a placeholder).

For each position in the vector 

  If the letter is 'B', check the next letter

    If the letter is 'I', check the next letter

      If the letter is 'O', add 1 to your bio counter
      Elif the letter is 'T', add 1 to your bit counter
      Else add 1 to your biX counter.

########################
#<<<>>> EXERCISE <<<>>>#
########################

bio <- 0
bit <- 0
bix <- 0
for (i in seq_along(L)){

  ### ENTERCODEHERE ###
  
}
cat('bio:',bio,'\nbit:',bit,'\nbix:',bix)

The total of the three values should equal

## [1] 141

Problem 2

Recall problem 1 in the For Loop section. There we calculated a running total as we iterated through a sequence of numbers. Here, we’ll make the following change: if the number n being added to the total is even, add n/2; if the number is odd, add n.

For example, for the sequence 1 through 5, before the answer would be \(1 + 2 + 3 + 4 + 5 = 15\). Now, our answer should look like \(1 + 2/2 + 3 + 4/2 + 5 = 12\). Print the update message as before.

Note, use the function is_even(), which will return TRUE for even values and FALSE otherwise. We’ll iterate over values 20 through 41.

########################
#<<<>>> EXERCISE <<<>>>#
########################

a <- 20
b <- 41

iter <- 1
total <- 0
for (### ENTERCODEHERE ###){
  
  ### ENTERCODEHERE ###
  
  cat('Current total for iteration',iter,':',total,'\n')
  
  iter <- ### ENTERCODEHERE ###
}

Your final line should be

## Current total for iteration 22 : 506

Problem 3

Let’s make a loop that checks if it’s raining outside for each iteration. We’ll assume an iteration is a new day. If it is raining, we’ll print that it is raining and the day number. Use the function is_raining() to see if it’s raining. For example

for (i in seq_len(30)){
   rain <- is_raining()
   if (rain) cat("It's raining on day ",i,"!\n",sep='')
}

## It's raining on day 1!
## It's raining on day 2!
## It's raining on day 5!
## It's raining on day 6!
## It's raining on day 7!
## It's raining on day 8!
## It's raining on day 10!
## It's raining on day 16!
## It's raining on day 17!
## It's raining on day 22!
## It's raining on day 23!
## It's raining on day 26!
## It's raining on day 27!
## It's raining on day 29!

Now, let’s assume we begin in town 0. On a given day, we check if it’s raining and move to the next town (e.g., town 1) only if it is not raining. Run your loop for 60 iterations and report the town you wound up in (e.g., town 12).

########################
#<<<>>> EXERCISE <<<>>>#
########################

town <- 0
for (i in seq_len(60)){

  ### ENTERCODEHERE ###
  
}
town

Now, let’s only move to the next town if it’s not raining on the current day and didn’t rain on the previous day (in the previous town). For your first iteration, assume that it didn’t rain the day before. Run your loop for 60 iterations

########################
#<<<>>> EXERCISE <<<>>>#
########################

town <- 0
rain_prev <- FALSE
for (i in seq_len(60)){
  
   ### ENTERCODEHERE ###
    
   rain_prev <- ### ENTERCODEHERE ###
     
}
town

Finally, we’ll again start at town 0.

If it didn’t rain today but rained yesterday, we’ll head north (so add 1 to your town total);
If it rained today but didn’t rain yesterday, we’ll head south (so subtract 1 from your town total).
If it didn’t rain today and didn’t rain yesterday, we’ll speed up and head north (so add 2 to your town total);
If it rained today and rained yesterday, we speed up a lot but head south (so subtract 4 from your total);

Again assume that, for your first iteration, it didn’t rain the day before. Run your loop for 1000 iterations, but record your position for each iteration.

########################
#<<<>>> EXERCISE <<<>>>#
########################

town <- 0
rain_prev <- FALSE
position <- vector(length=1000)
for (i in seq_len(1000)){
   
   ### ENTERCODEHERE ###
   
   position[i] <- ### ENTERCODEHERE ###
   
   rain_prev <- ### ENTERCODEHERE ###
   
}
position <- c(0,position)

Run the following command to plot your results:

problem_plot4(position)

An example figure (that will look different than yours) is shown below:

Bioinformatics Problems

Base counting

Given a DNA sequence, count the number of each nucleotide (A, C, G, T). Then, calculate the GC ratio (the proportion of GC content the makes up the sequence).

Data

To place the DNA sequence named s that we’ll be working with in your environment, run the following command:

generate_dna_sequence_s()

## Added s to environment.

## [1] "CATGAGCTAGTCACCTCAGTGAAATGGGAACCAGATGGCAGATCGGAAACTATCGCCCTAGGCACTACGGATGTAGCAATGATCGTTGCGAATCTATGCCGGGTGGTAGGGGTCGAGTCTCTAAATTGATTATGGTACGCTGTGGCTGTCAATACGGTCAAGCACATCAGATTTCAGGGTCCCCCCGAGCGGGAATCGCAACTGTATGCGTGTCGTTATCGCATCGGCACTACCGCGGACTGCGACTAATATATTTTCTTACCGTTAAACTCAGAGCACCTATTTAGCCGGCTTGGTGTCCCTGTATCATACGTACCCCCGTTACCTTATTAGTTGTGGCTAAAATGACTCTTGTGGGGTCGAACTTTAAAGCTTTGGCGGACGGGCGTGCCGCAAAGGTCAATAAGGTAAATGAATGGATACTCTTTGCCTATTCCTTTGTGAACAGTCAGGATTTTATCTCGGGGAACTTTACCTGTTGGCACCGGGCAGAGACAAACGGACGATGATAAATACGTACTACGCTGCGTAAAGCAGGTACATTCCACGATGGTGTATCATGCGAGAGCGCGACTTCCTGGCTCAAACAATCAGGCGCGTGCCGTAAGCCTCTACCCAACGAACCGTGCTGACTGCTTTATCTAAGTCTTCTCTGGCTCACCACGCCCCGTGTCGTATAACTCCTGACAAATCCTGGACGATCAAGCATGAGGTTCAAAGTAAGATACTACCTATCTTGCTCGAATCCGGCCAATTAATGTTAGAAAACGCCCACTGAAACAGCCGGTCCAGTCCCCAACTAGGCGCCGCACACGTTCGCGGTCACAATCCATGGAACGCGTAGTACTGAGGCCGAAACAGACGCTGGGGGGGTCGCTGTACTAGGGAAAACGACTGAAGGATACACGGATGATCTAGCAACGATGTAAAA"

Programmatically 1

Here we won’t exploit the functions in R very much; instead, we’ll use a loop and create a running total. This is not something I’d recommend (it’s fine – just not very direct), but it will help you with writing loops, which is critical to programming. This method can most easily be transliterated to other programming languages.

We’re going to make use of a function called subseq(), which extracts a subsequence from a larger sequence. This is analogous to the function substr(), but subseq() is a Bioconductor function that works well for sequence sets, which we’ll see soon, and also has additional arguments ideal for capturing sequences from different directions. The function takes two arguments, so for subseq(s,i,j), it will capture a subsequence from s that spans position i through j. For example:

sequence <- 'bioinformatics'
subseq(sequence,start=1,end=3)

## [1] "bio"

subseq(sequence,start=4,end=7)

## [1] "info"

subseq(sequence,start=4)

## [1] "informatics"

subseq(sequence,end=-6)

## [1] "bioinform"

########################
#<<<>>> EXERCISE <<<>>>#
########################

nt_A <- 0
nt_C <- 0
nt_G <- 0
nt_T <- 0
s_len <- nchar(s) 
for (i in seq_len(s_len)){
  
  base <- ### ENTERCODEHERE ###
  
  ### ENTERCODEHERE ###
  
}

nt_total <- nt_A + nt_C + nt_G + nt_T
gc <- nt_C + nt_G
gc/nt_total

You should get

## [1] 0.5016111708

Programmatically 2

Run the following command again:

generate_dna_sequence_s()

## Added s to environment.

We’ll split the string into a vector where each element is a letter in the original string.

ss <- strsplit(s,split='') 
class(ss)

## [1] "list"

It returns a list which will force us to index the first element in the list to access the vector of letters (via ss[[1]]) or, alternatively, we can simply unlist the list:

ss <- unlist(ss)

Now, to quickly tabulate the number of times a single letter occurs, we can use the table function:

nt_counts <- table(ss)

And now we’ll count the GC content by indexing the table

########################
#<<<>>> EXERCISE <<<>>>#
########################

gc <- ### ENTERCODEHERE ###
  
nt_total <- sum(nt_counts)
gc/nt_total

You should get the following fraction of GC content

##            G 
## 0.5016111708

Bioconductor

First thing I want to stress is that I don’t recommend memorizing Bioconductor functions. Just realize that, if you have a bioinformatics problem, and need to write a little bit of code, and you are particularly worried about speed, Bioconductor exists. It’s quite popular, so simply googling a few keywords and ‘Bioconductor’ should point you in the right direction. Get acclimated with using functions and reading R documentation, and then you should be OK with using Bioconductor.

generate_dna_sequence_s()

## Added s to environment.

Bioconductor likes to work with its own classes, so the first thing we have to do is convert our DNA sequence into a Bioconductor DNAString class.

ss <- DNAString(s)
class(ss)

## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"

Now, let’s count the bases.

alphabetFrequency(ss)

##   A   C   G   T   M   R   W   S   Y   K   V   H   D   B   N   -   +   . 
## 242 232 235 222   0   0   0   0   0   0   0   0   0   0   0   0   0   0

We can remove those other bases by setting baseOnly to TRUE.

alphabetFrequency(ss,baseOnly=TRUE)

##     A     C     G     T other 
##   242   232   235   222     0

And the GC content:

letterFrequency(ss, letters="GC", as.prob=TRUE)

##          G|C 
## 0.5016111708

DNA to RNA

Let’s convert that same DNA sequence above to an RNA sequence. This simply involves replacing the Ts with Us.

Programmatically 1

generate_dna_sequence_s()

## Added s to environment.

Again, we’ll take the long way with this one, using a loop.

########################
#<<<>>> EXERCISE <<<>>>#
########################

ss <- unlist(strsplit(s,split=''))

for (i in seq_along(ss)){
  
  ### ENTERCODEHERE ###

}

ss <- paste0(ss,collapse='')
ss

You should obtain:

## [1] "CAUGAGCUAGUCACCUCAGUGAAAUGGGAACCAGAUGGCAGAUCGGAAACUAUCGCCCUAGGCACUACGGAUGUAGCAAUGAUCGUUGCGAAUCUAUGCCGGGUGGUAGGGGUCGAGUCUCUAAAUUGAUUAUGGUACGCUGUGGCUGUCAAUACGGUCAAGCACAUCAGAUUUCAGGGUCCCCCCGAGCGGGAAUCGCAACUGUAUGCGUGUCGUUAUCGCAUCGGCACUACCGCGGACUGCGACUAAUAUAUUUUCUUACCGUUAAACUCAGAGCACCUAUUUAGCCGGCUUGGUGUCCCUGUAUCAUACGUACCCCCGUUACCUUAUUAGUUGUGGCUAAAAUGACUCUUGUGGGGUCGAACUUUAAAGCUUUGGCGGACGGGCGUGCCGCAAAGGUCAAUAAGGUAAAUGAAUGGAUACUCUUUGCCUAUUCCUUUGUGAACAGUCAGGAUUUUAUCUCGGGGAACUUUACCUGUUGGCACCGGGCAGAGACAAACGGACGAUGAUAAAUACGUACUACGCUGCGUAAAGCAGGUACAUUCCACGAUGGUGUAUCAUGCGAGAGCGCGACUUCCUGGCUCAAACAAUCAGGCGCGUGCCGUAAGCCUCUACCCAACGAACCGUGCUGACUGCUUUAUCUAAGUCUUCUCUGGCUCACCACGCCCCGUGUCGUAUAACUCCUGACAAAUCCUGGACGAUCAAGCAUGAGGUUCAAAGUAAGAUACUACCUAUCUUGCUCGAAUCCGGCCAAUUAAUGUUAGAAAACGCCCACUGAAACAGCCGGUCCAGUCCCCAACUAGGCGCCGCACACGUUCGCGGUCACAAUCCAUGGAACGCGUAGUACUGAGGCCGAAACAGACGCUGGGGGGGUCGCUGUACUAGGGAAAACGACUGAAGGAUACACGGAUGAUCUAGCAACGAUGUAAAA"

Programmatically 2

generate_dna_sequence_s()

## Added s to environment.

We can alternatively use regular expressions to make things very simple:

gsub(s,pattern='T',replacement='U')

## [1] "CAUGAGCUAGUCACCUCAGUGAAAUGGGAACCAGAUGGCAGAUCGGAAACUAUCGCCCUAGGCACUACGGAUGUAGCAAUGAUCGUUGCGAAUCUAUGCCGGGUGGUAGGGGUCGAGUCUCUAAAUUGAUUAUGGUACGCUGUGGCUGUCAAUACGGUCAAGCACAUCAGAUUUCAGGGUCCCCCCGAGCGGGAAUCGCAACUGUAUGCGUGUCGUUAUCGCAUCGGCACUACCGCGGACUGCGACUAAUAUAUUUUCUUACCGUUAAACUCAGAGCACCUAUUUAGCCGGCUUGGUGUCCCUGUAUCAUACGUACCCCCGUUACCUUAUUAGUUGUGGCUAAAAUGACUCUUGUGGGGUCGAACUUUAAAGCUUUGGCGGACGGGCGUGCCGCAAAGGUCAAUAAGGUAAAUGAAUGGAUACUCUUUGCCUAUUCCUUUGUGAACAGUCAGGAUUUUAUCUCGGGGAACUUUACCUGUUGGCACCGGGCAGAGACAAACGGACGAUGAUAAAUACGUACUACGCUGCGUAAAGCAGGUACAUUCCACGAUGGUGUAUCAUGCGAGAGCGCGACUUCCUGGCUCAAACAAUCAGGCGCGUGCCGUAAGCCUCUACCCAACGAACCGUGCUGACUGCUUUAUCUAAGUCUUCUCUGGCUCACCACGCCCCGUGUCGUAUAACUCCUGACAAAUCCUGGACGAUCAAGCAUGAGGUUCAAAGUAAGAUACUACCUAUCUUGCUCGAAUCCGGCCAAUUAAUGUUAGAAAACGCCCACUGAAACAGCCGGUCCAGUCCCCAACUAGGCGCCGCACACGUUCGCGGUCACAAUCCAUGGAACGCGUAGUACUGAGGCCGAAACAGACGCUGGGGGGGUCGCUGUACUAGGGAAAACGACUGAAGGAUACACGGAUGAUCUAGCAACGAUGUAAAA"

Bioconductor

ss <- DNAString(s)
ss <- RNAString(ss)
ss

##   931-letter "RNAString" instance
## seq: CAUGAGCUAGUCACCUCAGUGAAAUGGGAACCAG...AGGAUACACGGAUGAUCUAGCAACGAUGUAAAA

Reverse Complement

Let’s now assume that a DNA sequence is about to be replicated (i.e., copied). This occurs in such a way that an A is copied as a T on the replicated sequence; Cs, Gs, and Ts are copied as Gs, Cs, and As, respectively. Also, what was the head of the original DNA strand (5’) is now the the tail of the replicated strand; hence, we need to reverse the replicated sequence to get the correct reverse complement.

For example AAAACCCGGT becomes ACCGGGTTTT, and more explicitly, 5’-AAAACCCGGT-3’ becomes 5’-ACCGGGTTTT-3’.

generate_dna_sequence_s()

## Added s to environment.

Let’s get the reverse complement of the DNA sequence we used above. Note that in the example above, you may have used an if statement to check if that base is T, and if it was, you changed it to U. It may be enticing to start by doing the following:

ss <- unlist(strsplit(s,split=''))

for (i in seq_along(ss)){
  if (ss[i] == 'A') {ss[i] <- 'T'}
  if (ss[i] == 'C') {ss[i] <- 'G'}
  if (ss[i] == 'G') {ss[i] <- 'C'}
  if (ss[i] == 'T') {ss[i] <- 'A'}
}

But this would be wrong. Why?

Programmatically

Note, to reverse a vector, use rev().

########################
#<<<>>> EXERCISE <<<>>>#
########################

ss <- unlist(strsplit(s,split=''))

for (i in seq_along(ss)){

  ### ENTERCODEHERE ###
  
}

ss <- rev(ss) 
ss <- paste0(ss,collapse='')
ss

You should obtain the following:

## [1] "TTTTACATCGTTGCTAGATCATCCGTGTATCCTTCAGTCGTTTTCCCTAGTACAGCGACCCCCCCAGCGTCTGTTTCGGCCTCAGTACTACGCGTTCCATGGATTGTGACCGCGAACGTGTGCGGCGCCTAGTTGGGGACTGGACCGGCTGTTTCAGTGGGCGTTTTCTAACATTAATTGGCCGGATTCGAGCAAGATAGGTAGTATCTTACTTTGAACCTCATGCTTGATCGTCCAGGATTTGTCAGGAGTTATACGACACGGGGCGTGGTGAGCCAGAGAAGACTTAGATAAAGCAGTCAGCACGGTTCGTTGGGTAGAGGCTTACGGCACGCGCCTGATTGTTTGAGCCAGGAAGTCGCGCTCTCGCATGATACACCATCGTGGAATGTACCTGCTTTACGCAGCGTAGTACGTATTTATCATCGTCCGTTTGTCTCTGCCCGGTGCCAACAGGTAAAGTTCCCCGAGATAAAATCCTGACTGTTCACAAAGGAATAGGCAAAGAGTATCCATTCATTTACCTTATTGACCTTTGCGGCACGCCCGTCCGCCAAAGCTTTAAAGTTCGACCCCACAAGAGTCATTTTAGCCACAACTAATAAGGTAACGGGGGTACGTATGATACAGGGACACCAAGCCGGCTAAATAGGTGCTCTGAGTTTAACGGTAAGAAAATATATTAGTCGCAGTCCGCGGTAGTGCCGATGCGATAACGACACGCATACAGTTGCGATTCCCGCTCGGGGGGACCCTGAAATCTGATGTGCTTGACCGTATTGACAGCCACAGCGTACCATAATCAATTTAGAGACTCGACCCCTACCACCCGGCATAGATTCGCAACGATCATTGCTACATCCGTAGTGCCTAGGGCGATAGTTTCCGATCTGCCATCTGGTTCCCATTTCACTGAGGTGACTAGCTCATG"

Bioconductor

generate_dna_sequence_s()

## Added s to environment.

ss <- DNAString(s)
ss <- reverseComplement(ss)
ss

##   931-letter "DNAString" instance
## seq: TTTTACATCGTTGCTAGATCATCCGTGTATCCTT...TGGTTCCCATTTCACTGAGGTGACTAGCTCATG

FASTA Files

Loading

A FASTA file is a file containing multiple nucleotide or amino acid sequences, each with their own identifier, formatted as a header that starts with ‘>’. A file essentially looks like

>Sequence_1
GGCGAT
>Sequence_2
AAATCG

and so on. The structure of the content of the file is important, not necessarily the file extension. You can have a FASTA file with a .txt extension, no extension, or the common .fna extension. The trick is to know how these files are formatted to identify them. (Note that wikipedia tends to have the best information on bioinformatics file types, quality scoring, etc.)

The other file type worth noting is FASTQ, which, in addition to sequence information, also contains a quality score for each position in the sequence that measures how likely that nucleotide or protein call is correct. FASTQ files look somewhat different than FASTA:

@Sequence_1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@Sequence_2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
9C;=;=<9@4868>9:67AA<9>65<=>59-+*''))**55CCFMNO>>>>>>FFFFC65

For a given sequence, line 1 again has the header, but unlike the FASTA file, FASTQ headers begin with '@'. Line 2 is the actual sequence, followed by ‘+’ on line 3. The quality score is then found on line 4 and will be the same length as the sequence on line 1.

Let’s load a FASTA file. We’ll use a simple Bioconductor function. I could show you how to parse these files manually, but I don’t think it would be very rewarding, so we’ll skip that.

fasta <- readDNAStringSet('https://gist.githubusercontent.com/sw1/8870a124624f31585d5c15be72fcfc21/raw/f1fb586160d12c34f29532c731066fd8912a0e0c/example.fasta',format='fasta')
fasta

##   A DNAStringSet instance of length 8
##     width seq                                          names               
## [1]   893 TGGTAGAACGTGTGGGCTCGA...CTAGTTAGCGAGGTCCATAA Sequence_9715
## [2]   860 TACTGCTTGTACAAGCCTCAT...TAGACGTTGTACCGGGGGAA Sequence_5667
## [3]   815 TATGTGTTTCATTTAGGACCT...CCATAAAACGGTCAAGCAGT Sequence_2989
## [4]   912 TGGTGCCGAGGCTCGAGTGTA...ATGGGAAATTCAACAAACAC Sequence_2049
## [5]   806 CTAGCAATGGCAAATTAGATG...TGAACAGAAAATCAACCGGA Sequence_5456
## [6]   834 AAACGAGAACGTGGAGATTTG...TCGGGAGTACTAACTGATTT Sequence_1118
## [7]   818 CAGAAAGCATGAGTCTCGCCC...TTACGCCTATTTTCCCCAGT Sequence_7043
## [8]   878 CGCCGCACTATCCACGTTAAA...AAGTAGTACCTAACAGAACA Sequence_0123

For FASTQ, it’s essentially the same except we change ‘fasta’ to ‘fastq’ for the format argument:

fastq <- readDNAStringSet('https://gist.githubusercontent.com/sw1/8870a124624f31585d5c15be72fcfc21/raw/f1fb586160d12c34f29532c731066fd8912a0e0c/example.fastq',format='fastq')
fastq

##   A DNAStringSet instance of length 250
##       width seq                                        names               
##   [1]    31 TTTCCGGGGCACATAATCTTCAGCCGGGCGC            Sequence_2:UMI_AT...
##   [2]    31 TATCCTTGCAATACTCTCCGAACGGGAGAGC            Sequence_8:UMI_CT...
##   [3]    31 GCAGTTTAAGATCATTTTATTGAAGAGCAAG            Sequence_12:UMI_G...
##   [4]    31 GGCATTGCAAAATTTATTACACCCCCAGATC            Sequence_21:UMI_A...
##   [5]    31 CCCCCTTAAATAGCTGTTTATTTGGCCCCAG            Sequence_29:UMI_G...
##   ...   ... ...
## [246]    31 GCTGTAGGAACAGCAGTCTTGGTGGTTAGCA            Sequence_819:UMI_...
## [247]    31 CCATTATAATAGCCATCTTTATTTGTAAAAA            Sequence_823:UMI_...
## [248]    31 AGCTTTGCAACCATACTCCCCCCGGAACCCA            Sequence_824:UMI_...
## [249]    31 GCCCCCCCCCAAATCGGAAAAACACACCCCC            Sequence_828:UMI_...
## [250]    31 AGGGTGGGGGATCACATTTATTGTATTGAGG            Sequence_834:UMI_...

Creating Sequence Sets

Let’s approach a FASTA problem from a different direction now. We’ll create a DNA string set from a bunch of individual sequences, then write the set to a FASTA file.

Run the following command to add three new variables to your environment: s1, s2, and s3. Each represents a different DNA sequence or ‘read.’

problem_createsequencesets()

## Added s1, s2, and s3 to environment.

s1

## [1] "GCTGTAGGAACAGCAGTCTTGGTGGTTAGCA"

s2

## [1] "GGCATTGCAAAATTTATTACACCCCCAGATC"

s3

## [1] "TATCCTTGCAATACTCTCCGAACGGGAGAGC"

We’re going to create a DNAStringSet object, which can then be saved as a FASTA file. First, we have to combine the sequences into a vector and then pass this vector into DNAStringSet().

########################
#<<<>>> EXERCISE <<<>>>#
########################

S <- ### ENTERCODEHERE ###
  
SS <- DNAStringSet(S)
SS

##   A DNAStringSet instance of length 3
##     width seq
## [1]    31 GCTGTAGGAACAGCAGTCTTGGTGGTTAGCA
## [2]    31 GGCATTGCAAAATTTATTACACCCCCAGATC
## [3]    31 TATCCTTGCAATACTCTCCGAACGGGAGAGC

Recall that FASTA files have header names. Let’s create header names for these three sequences. We can manually do it like so

c('sequence_1','sequence_2','sequence_3')

## [1] "sequence_1" "sequence_2" "sequence_3"

but this will be far from ideal if we had, say, 100,000 sequences. Instead, we’re going to use a function called paste(), which basically pastes together vectors of text, element-wise:

DOG <- c('dog1','dog2','dog3')
CAT <- c('cat1','cat2','cat3')

paste(DOG,CAT)

## [1] "dog1 cat1" "dog2 cat2" "dog3 cat3"

paste(DOG,CAT,sep='-')

## [1] "dog1-cat1" "dog2-cat2" "dog3-cat3"

paste(DOG,CAT,sep='_')

## [1] "dog1_cat1" "dog2_cat2" "dog3_cat3"

paste(DOG,CAT,sep='')

## [1] "dog1cat1" "dog2cat2" "dog3cat3"

We can clean this up a little by simply passing a vector spanning the numbers 1 through 3, along with ‘dog’ and ‘cat’:

paste('dog','cat',1:3,sep='')

## [1] "dogcat1" "dogcat2" "dogcat3"

paste('dog','cat',1:3,sep='_')

## [1] "dog_cat_1" "dog_cat_2" "dog_cat_3"

paste('dog_','cat',1:3,sep='')

## [1] "dog_cat1" "dog_cat2" "dog_cat3"

This is how we’ll create our header names. We can grab the number of total sequences in our set using length(), which will let us create a vector to number our sequences. We’ll create a header name that includes each sequence number, the word sequence, along with a user name. We’ll also pass in the date using the date() function.

seq_names <- paste('sequence_',1:length(SS),' | User_12 | ',date(), sep='')
seq_names

## [1] "sequence_1 | User_12 | Wed Sep 28 11:32:16 2016"
## [2] "sequence_2 | User_12 | Wed Sep 28 11:32:16 2016"
## [3] "sequence_3 | User_12 | Wed Sep 28 11:32:16 2016"

Now, we’ll rename the sequences in the set with these names:

########################
#<<<>>> EXERCISE <<<>>>#
########################

### ENTERCODEHERE ###

You should now have:

##   A DNAStringSet instance of length 3
##     width seq                                          names               
## [1]    31 GCTGTAGGAACAGCAGTCTTGGTGGTTAGCA              sequence_1 | User...
## [2]    31 GGCATTGCAAAATTTATTACACCCCCAGATC              sequence_2 | User...
## [3]    31 TATCCTTGCAATACTCTCCGAACGGGAGAGC              sequence_3 | User...

Finally, we can save our sequence set as a FASTA file:

output_name <- 'seq_set_out.fasta'
writeXStringSet(SS,file=output_name,format="fasta")

Sample Metadata

Often, the sequences we’re working with have corresponding metadata. These metadata can range from information about the specific sequence (e.g., the type of sequencer used) to information about the organism from which the sequence was acquired (e.g., species, treatment, age). The way in which we can link our sequence reads in the FASTA file to the metadata is via the header name.

Load the following sequence set:

FASTA <- readDNAStringSet('https://gist.githubusercontent.com/sw1/8870a124624f31585d5c15be72fcfc21/raw/10bc2f50d1c739827ea2ba4edb146b36a6a4c14a/problems_metadata.fasta',format='fasta')
FASTA

##   A DNAStringSet instance of length 100
##       width seq                                        names               
##   [1]  1000 AAAGACAGTGTTACAGTCGG...CCGGTAGTTCTAATCACGC Sequence_0382
##   [2]  1000 TGGAACTAGTTAGCCGGGTC...TTTCAAGTATCTCTTAGTT Sequence_1764
##   [3]  1000 TAAACATCTTATTTTGAATG...GATACTATAGATGGTATAC Sequence_4041
##   [4]  1000 GGAAAGAGCGCCGGATAGAG...TTTGACGTAGCACGGATCG Sequence_2012
##   [5]  1000 TACCTGTCAACAATTCAGCT...TTGTCGAACTTGGAATGAC Sequence_7937
##   ...   ... ...
##  [96]  1000 AGATAACCCGCTGTGTCCGA...GGTGTTGACTGTAGAACGA Sequence_2955
##  [97]  1000 CGAGGCTATACTTGGGAACG...GATAAATCTATTTGAGACA Sequence_6164
##  [98]  1000 TTTCTCTGGCAATAGCCGGG...ATCAGCCGTTGAACGTCAA Sequence_3106
##  [99]  1000 GTAAGTCGACACTGTCTCAT...CGAATGCGACTGAGTGCGC Sequence_2175
## [100]  1000 GCCCTCAGGAGGTTAGTTTT...CCACAACACTATGGCGCAC Sequence_1629

To place the metadata, named META, into your environement, run the following:

problem_metadata(FASTA)

## Added META to environment.

head(META)

##              ID Organism          Genus  Center
## 1 Sequence_0382 Bacteria    Entercoccus  Dallas
## 2 Sequence_1764 Bacteria    Entercoccus Phoenix
## 3 Sequence_4041 Bacteria Staphylococcus  Dallas
## 4 Sequence_2012 Bacteria    Haemophilus  Dallas
## 5 Sequence_7937 Bacteria    Escherichia  Austin
## 6 Sequence_1674 Bacteria    Haemophilus Houston

We’ll henceforth refer to the rows as ‘samples.’ If we want to know the row indexes of the samples that were sequenced at the Philadelphia sequencing center, we can type

which(META$Center == 'Philadelphia')

## [1]  82  84  85  89  92  93  96  98 100

If we wanted to find the sequence with the header name ‘Rosalind_6333’, we can do

FASTA['Sequence_6333']

##   A DNAStringSet instance of length 1
##     width seq                                          names               
## [1]  1000 TTCGCAGTATCCAGGTACAGG...GACGATGACAGTGGACATGT Sequence_6333

And if we wanted to get the sequences corresponding to rows 12, 15, and 78 in the metadata file:

header_names <- META$ID[c(12,15,78)]
FASTA[header_names]

##   A DNAStringSet instance of length 3
##     width seq                                          names               
## [1]  1000 TTGCAGGGTGGGCATGGTGGT...AGCGATTAACATGTTGGCTA Sequence_1385
## [2]  1000 CACTGAGGCGAATGAATATAA...GAATAGCTACAACAGACACT Sequence_7797
## [3]  1000 ATTGGTTTTGAAGCGACAGCG...AGATCCGGTCCATAGAAATT Sequence_8314

Using this information,

Print the sequences and header names of the samples that were sequenced in Princeton
Print the sequences and header names of the Staphylococcus samples that were sequenced in Austin.

########################
#<<<>>> EXERCISE <<<>>>#
########################

### ENTERCODEHERE ###

##   A DNAStringSet instance of length 9
##     width seq                                          names               
## [1]  1000 TCTAATTTTAGTAGAGCAAAT...TGCCGCTGGCTGAGATGCAG Sequence_6874
## [2]  1000 TGTCGATCCCTGCGTCGTACG...TTCAGTTATAGTCTGTGCTA Sequence_3553
## [3]  1000 GGCGTGTACATGCAGCTTAGA...ACGCGCGGAATCGTACTACC Sequence_6947
## [4]  1000 TTCGCAGTATCCAGGTACAGG...GACGATGACAGTGGACATGT Sequence_6333
## [5]  1000 GTCCCAAAATTTCTTGGCCGA...AATACACCTATCAGAAAGCT Sequence_4497
## [6]  1000 GTTATGGCACTTTTATACCCC...AGAGATCGTCTCCAGGCGGT Sequence_8986
## [7]  1000 ACAAGGATAACATCTTGCAGT...CTCGATCCCAGATGTATTGT Sequence_9775
## [8]  1000 CGAGGCTATACTTGGGAACGC...CGATAAATCTATTTGAGACA Sequence_6164
## [9]  1000 GTAAGTCGACACTGTCTCATC...GCGAATGCGACTGAGTGCGC Sequence_2175

##   A DNAStringSet instance of length 2
##     width seq                                          names               
## [1]  1000 GTAGGCGCAACCTATTTTGTC...ACTATTAGGCATTGTCGAGG Sequence_3864
## [2]  1000 ATGTTGAATCCGTTCTCGTAC...AGCCGCCGATGCCAACTCAT Sequence_9661

GC Content

The goal here is to determine the DNA sequence with the greatest proportion of GC content. Now that we can load FASTA files, we can calculate the GC content for each sequences in a set of sequences.

First, we’ll load the sequences:

FASTA <- readDNAStringSet('https://gist.githubusercontent.com/sw1/8870a124624f31585d5c15be72fcfc21/raw/9aaa235924c3ba097716da40dbec745f6bd41fb0/problems_gc.fasta')
FASTA

##   A DNAStringSet instance of length 8
##     width seq                                          names               
## [1]   893 TGGTAGAACGTGTGGGCTCGA...CTAGTTAGCGAGGTCCATAA Sequence_9715
## [2]   860 TACTGCTTGTACAAGCCTCAT...TAGACGTTGTACCGGGGGAA Sequence_5667
## [3]   815 TATGTGTTTCATTTAGGACCT...CCATAAAACGGTCAAGCAGT Sequence_2989
## [4]   912 TGGTGCCGAGGCTCGAGTGTA...ATGGGAAATTCAACAAACAC Sequence_2049
## [5]   806 CTAGCAATGGCAAATTAGATG...TGAACAGAAAATCAACCGGA Sequence_5456
## [6]   834 AAACGAGAACGTGGAGATTTG...TCGGGAGTACTAACTGATTT Sequence_1118
## [7]   818 CAGAAAGCATGAGTCTCGCCC...TTACGCCTATTTTCCCCAGT Sequence_7043
## [8]   878 CGCCGCACTATCCACGTTAAA...AAGTAGTACCTAACAGAACA Sequence_0123

Now let’s first tackle this a more programmatic way, first making our own function to calculate the GC content, and then determining which is greatest.

Programmatically

Let’s make that function to calculate the GC content for a given sequence:

########################
#<<<>>> EXERCISE <<<>>>#
########################

gc_content <- function(s){
  
  ss <- unlist(strsplit(as.character(s),''))
  gc <- 0
  
  for (i in seq_along(ss)){
    
    ### ENTERCODEHERE ###
  
  }
  
  ### ENTERCODEHERE ###

}

Now note that we can treat this DNAStringSet class as a list, such that FASTA[[1]] will index the first sequence, FASTA[[2]] the second, and so on.

########################
#<<<>>> EXERCISE <<<>>>#
########################

GC <- vector(length=length(FASTA))
for (i in seq_along(FASTA)){
  
  GC[i] <- ### ENTERCODEHERE ###
  
}

And now which one has the max GC?

max_ind <- which.max(GC)
FASTA[max_ind]

##   A DNAStringSet instance of length 1
##     width seq                                          names               
## [1]   818 CAGAAAGCATGAGTCTCGCCC...TTACGCCTATTTTCCCCAGT Sequence_7043

Bioconductor

The nice thing about Bioconductor is that many of its functions are vectorized, so we can simply do this:

GC <- letterFrequency(FASTA, letters="GC", as.prob=TRUE)
max_ind <- which.max(GC)
FASTA[max_ind]

##   A DNAStringSet instance of length 1
##     width seq                                          names               
## [1]   818 CAGAAAGCATGAGTCTCGCCC...TTACGCCTATTTTCCCCAGT Sequence_7043

Hamming Distance and Point Mutations

Let’s say we were given two sequences and we wanted to compare them. One approach is to calculate the Hamming distance, which is basically a running total of the mismatches between bases in the same position in their respective sequences. For example, if sequence A is AATT and sequence B is AGTT, then \(\text{hd}(A,B) = 1\). If sequence C is AGGT, then \(\text{hd}(A,C) = 2\). Let’s write a Hamming distance script using loops and vectors, and then we’ll use Bioconductor.

Programmatically 1

Add the two sequences, s1 and s2, to your environment by running the following:

problem_hamming()

## Added s1 and s2 to environment.

s1

## [1] "ACTGTACCAGAATCGCTATTAGCCCACCTTAGGCGAGTGAAATAACCAAATAAACAAGTGGTGAGGGGAATTGTCCCCACCGTTGCGTTTATGGAGGGGGTGGAAGTGGCCACGAACTGCCAGGTGTCGCCAAACGGAAGACTTCGGGCTTTAGATCCGACTTAACTAACATTTTTCCACCATGAAAGGAGCAATTCAAAGCAACGTAAGGTACTTGCCTGGCCAGGTTGATAAAAGATGCGGACGTCTGATGATGTACGATGATCTTGGCGAGTCAAACCCGGGGACCCCGAGCCGTGACCTAGAGATTGCAATACAGTAAGTAGCCAGGAAAGGAGGATACGATATAAATTAGGGTCACTGTACCCGTTCCGCCTTTCTGCGGCCAAAGACCCGCACGACACATGGACGCCACAGAGGCTATTTGGACCGATGACTCAGGATCATCAAGGGCGACGACGTTAGTCAGTTATATCTGACATTGGATATGTTATAAATAAAACTGGTAACCCACAACGATCCCGGTAGTGGGGACACTGGCCAGGCTTCTAAGCAGATGCGAGGCACAGACACAAACCGGCCGTATGTCAGAGGCAGTACTGAAGTCTAACTTTATCCACGGCAGACGCGTTACATGGCAATCTTGAGCGGGGCGAAGTTAGAGACGTTAAGCTATATGAAACACACTCGGCGTAGCCAATAGCCCATCTGCCTCATAAGGATGGCTGGTTCATTTGTAAAATACTGTATCAGGCGGGGGTAACCTCCCGCGCTCAGGTAATATAATGAGACTGGTACCCATAACACGTTTTCGTCAGTAATAAAAGCGCGATCATTCAAGGGGACGATAGCAGACCTTCAATGCGGAATGGTTTTGCGCCTCTAATAACTGAGAGCACTATAATAGAAGTGAGTGTATTGTTATGCCATCCT"

s2

## [1] "GCTGGAGCGAACTGGACATCAACCCTACTAAGGGAAGAAAATTGGAATAATCATCAGGTACTGAGAACACACGACCCCACCGTTGAGGTTTCGACAGCTTGAATTCTTACAAGGGCTAGCCGTTGCTGGCTAGCCATTTGTCAGAGTGTCTAAGAGCAGACTAAACTACCCTCGTTCCTATATAAACGAAGCTACTACCAGCAAGGTCCGGAACGCAACTCTCAGGATTGATGGGATGTGCACATTTCGTTTGGAGTTGCCGGATAGACGCTCGCAATCCTCTTCGACCACACGGAATGACGCGCCGGTCGCTCCAAATATAGTAGCCCGGGGCGGAAGAGGCGAAACTATATCAGCTTCCGGAACAGATGTCGTTTCTATGGCGCCTTCAAAGAGTGCGACGCCATGCAATCGTACATGCGCCTAGAACTCCTGGATGAGGATCTTTAAGAACGACTGGGTAAGCTAGATACATTTCAGTCTGGTTATGGTCTAAGTAGAACAGGTAACCCATGTACATTCAAATGTATGAGTGCCGGTCCATGCTTATCGGTTAACATGACCAACATCGAAACTATGGTGCGAGTTTAAGTGCAATACCAAAGGCCAAATGAGCGCACTCCATAGGCTGACCCGCATAATAATGGTCTGGTCAAAGGATTAGAAGTAAGATGCTACGATAAGCTTCCCGCGGTCACGATGACCCTTTAGCCTCACCTTATTAAATGGCTAAATTTTTTATAACTGCTGCGGGCAGGGGAACCCAGCCGAGCTCCGTGGATTTACCGAGGCCGCTTACAGTCACATGTTTATGTCAACAACTTGTGCTCGAGTATGCGAGGGCCCTATTGAATCTCGGAAATGTGGCTTGGTTGTCGACCTCTCTAGTCCCGTAGTTCTCAACTCGGAATGGATGGCTGAAACATACGTCCT"

Use a loop to calculate the Hamming distance between s1 and s2. (Recall the subseq() function):

########################
#<<<>>> EXERCISE <<<>>>#
########################

n_seq <- nchar(s1)

hd <- 0
for (i in seq_len(n_seq)){
  
  ### ENTERCODEHERE ###
  
}

hd

You should get the following

## [1] 465

Programmatically 2

And a different approach might be:

mm <- unlist(strsplit(s1,'')) != unlist(strsplit(s2,''))
hd <- sum(mm)
hd

## [1] 465

Bioconductor

seqs <- c(s1,s2)
ss <- DNAStringSet(seqs)
stringDist(ss,method='hamming')

##     1
## 2 465

Finding substrings

Given a DNA sequence, it’s often necessary to find an important subsequence within a larger sequence. If we were given a sequence, let’s locate a target subsequence and return its position. If, say, we were given AATATAGCAT, then AAT would be base positions 1, 2, 3, and so on. If then we were asked to locate TAG, we can report it as being in position 5 (i.e., just reporting the start position).

Programmatically 1

First, let’s load the data. Run problem_substrings() to add S, a DNA sequence, and s, a corresponding subsequence found within S that we which to locate.

problem_substrings()

## Added s and S to environment.

## [1] "GTTAGTACCACGGTCTATTAGGTCTATGGGTCTATAGGTCTATCACAGGTCTATTGGGGTCTATCAGGTCTATGGTCTATCTGGTCTATAGGTCTATAGGTCTATCGGGTCTATCGGTCTATTACGACACGTGCGGTCTATAGAGGTCTATACCAGGTCTATTAAGGTCTATGGCGGGAACACGGTCTATATCAAGGTCTATCCAAAAGGGTCTATTATAGGGTCTATGGTCTATGGTCTATTTTAGGTCTATTGGTCTATCGGTCTATAAGGTCTATGGTCTATCGGGTCTATATGAGAGATGTGTACAACCGGTACTTGAGGTCTATCGGTCTATAGGTCTATATGCGGTCTATGTCGGTCTATCAATCGGTCTATATAGTAACCACATGGTCTATTGGTCTATGGTCTATCTTGGTCTATCGGGTCTATGGTCTATAACAGGTCTATGTGCGGTCTATGGTCTATTGGTCTATAAGCGGGTCTATGGTCTATGGTCTATGGAAGTGGTCTATCTCAGCAGGTCTATGGTCTATGGTCTATGGTCTATACGGTCTATGAAGGGTCTATGCTAGGTCTATGGTCTATCAGGTCTATGGTCTATGGTCTATTGGTCTATAGGTCTATAGAGGTCTATGGTCTATGCGTGGTCTATGGTCTATCGTAGTTGGTCTATGGTCTATCCGGTCTATATCGGTCTATCAAGGTCTATCACTGGTCTATGGTCTATGGTCTATCCGGTCTATTTGGTCTATTAACACGCGAGGCGGTCTATGGTCTATTGGTCTATGTCGATGGTCTATGGTCTATGGTCTATGTCAGGTCTATAGGTCTATGGTCTATAGAGGGTCTATGGTCTATGGTCTATTTGGTCTATTCGTGCTGGTCTATTCGAAGGTCTATGGTCTATTAGGTCTATAGAGTGGTCTATCGATGCGTAATCGGTCTATCGCTAGGGGTCTATTATCTCTCG"

## [1] "GGTCTATGG"

We want to compare our subsequence s of length k to all sequential subsequences in S of the same length as s. For example, we will use a loop and, on the first iteration, compare s to GTTAGTACC, on the second iteration compare s to TTAGTACCA, and so on. The first thing we need to do is determine the number of iterations we need, which will amount to the number of sequential subsequences of length k there are in S. This is a kmer problem, where a kmer is a subsequence of length k present in a larger sequence. For our purposes, this is a 9-mer problem because

nchar(s)

## [1] 9

Now, let’s define the length of S as N (and we already defined k as the length of s). Given that N is 973 and k is 9,

Determine how many iterations are required to iterate over all 9-mers in S. (Note that this is essentially how many 9-mers are in S.)
Come up with a formula to calculate the number of kmers for a given S.

########################
#<<<>>> EXERCISE <<<>>>#
########################

N <- nchar(S)
k <- nchar(s)

N_kmers <- ### ENTERCODEHERE ###

pos <- NULL
for (i in seq_len(N_kmers)){
  
  ss <- subseq(S,i,i+k-1)
  
  ### ENTERCODEHERE ###
  
}

pos

You should obtain the following:

##  [1]  21  67 166 222 229 272 400 426 455 482 489 496 523 530 537 575 591
## [18] 598 631 649 670 717 724 769 797 804 830 848 855 897

Bioconductor

SS <- DNAString(S)
matchPattern(s,SS)

##   Views on a 973-letter DNAString subject
## subject: GTTAGTACCACGGTCTATTAGGTCTATGGGTC...CGGTCTATCGCTAGGGGTCTATTATCTCTCG
## views:
##      start end width
##  [1]    21  29     9 [GGTCTATGG]
##  [2]    67  75     9 [GGTCTATGG]
##  [3]   166 174     9 [GGTCTATGG]
##  [4]   222 230     9 [GGTCTATGG]
##  [5]   229 237     9 [GGTCTATGG]
##  ...   ... ...   ... ...
## [26]   804 812     9 [GGTCTATGG]
## [27]   830 838     9 [GGTCTATGG]
## [28]   848 856     9 [GGTCTATGG]
## [29]   855 863     9 [GGTCTATGG]
## [30]   897 905     9 [GGTCTATGG]

Transition:Transversion Ratio

Point mutations are changes in nucleotide bases in a DNA sequence. A transition is a point mutation between A and G (so G to A or A to G) or C and T. A transversion, on the other hand, occurs between A and T or C and G. (Note this has to do with pyrimidines and purines. If you know what they are, then this will make complete sense to you; otherwise, it will later it in the course.)

Let’s calculate the ratio of transitions to transversions given two DNA sequences. To do this, we simply compare a pair of bases at each position to see if they contain either a transition or tranversion and record a running total as we iterate along.

Programmatically 1

Run problem_ttratio() to add the two DNA sequences, s1 and s2, to your environment:

problem_ttratio()

## Added s1 and s2 to environment.

s1

## [1] "GCAACGCACAACGAAAACCCTTAGGGACTGGATTATTTCGTGATCGTTGTAGTTATTGGAAGTACGGGCATCAACCCAGTT"

s2

## [1] "TTATCTGACAAAGAAAGCCGTCAACGGCTGGATAATTTCGCGATCGTGCTGGTTACTGGCGGTACGAGTGTTCCTTTGGGT"

And now let’s compute the transition:transversion ratio:

########################
#<<<>>> EXERCISE <<<>>>#
########################

n_seq <- nchar(s1)

pu <- c('A','G')
py <- c('C','T')

ts <- 0
tv <- 0
for (i in seq_len(n_seq)){
  
  b1 <- subseq(s1,i,i)
  b2 <- subseq(s2,i,i)
  
  if (b1 != b2){
    
    ### ENTERCODEHERE ###
    
  }
  
}

ts/tv

You should get the following:

## [1] 1.214285714

Shared Motif

If we were given a set of sequences, it might be informative to find the largest subsequence shared among all sequences in the set. In other words, we may want to find a kmer of length k that each sequence contains. This requires us to not only search each sequence for a specific kmer, but we have to do this for kmers of all sizes K.

Programmatically

We’ll tackle with problem using sequences in FASTA format, so we have to load them.

FASTA <- readDNAStringSet('https://gist.githubusercontent.com/sw1/8870a124624f31585d5c15be72fcfc21/raw/86c440446c6592f7472acf3768d33e9942837282/problems_motif.fasta')
FASTA

##   A DNAStringSet instance of length 100
##       width seq                                        names               
##   [1]  1000 AAAGACAGTGTTACAGTCGG...CCGGTAGTTCTAATCACGC Sequence_0382
##   [2]  1000 TGGAACTAGTTAGCCGGGTC...TTTCAAGTATCTCTTAGTT Sequence_1764
##   [3]  1000 TAAACATCTTATTTTGAATG...GATACTATAGATGGTATAC Sequence_4041
##   [4]  1000 GGAAAGAGCGCCGGATAGAG...TTTGACGTAGCACGGATCG Sequence_2012
##   [5]  1000 TACCTGTCAACAATTCAGCT...TTGTCGAACTTGGAATGAC Sequence_7937
##   ...   ... ...
##  [96]  1000 AGATAACCCGCTGTGTCCGA...GGTGTTGACTGTAGAACGA Sequence_2955
##  [97]  1000 CGAGGCTATACTTGGGAACG...GATAAATCTATTTGAGACA Sequence_6164
##  [98]  1000 TTTCTCTGGCAATAGCCGGG...ATCAGCCGTTGAACGTCAA Sequence_3106
##  [99]  1000 GTAAGTCGACACTGTCTCAT...CGAATGCGACTGAGTGCGC Sequence_2175
## [100]  1000 GCCCTCAGGAGGTTAGTTTT...CCACAACACTATGGCGCAC Sequence_1629

We’ll set one sequence, named R, as the sequence we compare all other sequences S to. Since we are looking for a shared subsequence, present in all M sequences, it doesn’t matter which reference sequence we choose.

R <- as.character(FASTA[1])  # reference sequence
S <- as.character(FASTA[-1]) # set of remaining sequences

Now, one thing I’ll note, we need to be somewhat careful about how we write this code. If we use too many loops, or fail to toss redundant kmers, this script will take a very long time and hence will be useless.

For instance, for a given kmer, we have to compare a kmer s in our reference sequence R to a sequence \(S_1\) in our sequence set S. If the kmer is present in both, we move on to the next sequence \(S_2\) in S, and so on. But if that kmer is not in \(S_1\), then we should break the loop and move to a different kmer. It’s a waste of time iterating through the remaining sequences in S because our target shared sequence must be shared among s and all sequences in S.

Another potential problem are redundent kmers within a given sequence. These sequences are pretty long. If we were working with, say, 10-mers, a particular 10-mer is probably present multiple times in a sequence. We want to ensure that when we compare s to a sequence in S, we only compare s to unique kmers in, say, \(S_1\).

We’ll start by making a function that calculates all of the unique kmers for a given k in a reference sequence R.

########################
#<<<>>> EXERCISE <<<>>>#
########################

get_kmers <- function(reference, k){
  N <- nchar(reference)
  
  N_kmers <- ### ENTERCODEHERE ###
  
  kmers <- vector(length=N_kmers)
  for (i in seq_len(N_kmers)){
    
    kmers[i] <- subseq(reference,i,i+k-1)
    
  }
  
  kmers <- unique(kmers)
  
  return(kmers)
}

Now, we’ll use this function in our loop. Again, the goal here is to find the largest subsequence present in all sequences in the set S, along with with our reference sequence R.

########################
#<<<>>> EXERCISE <<<>>>#
########################

N_Ref <- nchar(R)
N_Set <- length(S) # number of remaining sequences in set

motif <- NULL
for (k in seq(N_Ref-1,2)){
  
  R_kmers <- get_kmers(R,k)
  
  for (n in seq_along(S)){
   
    S_kmers <- get_kmers(S[n],k)
    
    R_idx <- which(R_kmers %in% S_kmers)
    
    if (length(R_idx) == 0){
      
      break
      
    }else{
      
      R_kmers <- R_kmers[R_idx]
      
    }
     
  }

  if (### ENTERCODEHERE ###) {
    
    motif <- R_kmers
    
    break
  
  }
  
}

motif

You should get the following result:

## [1] "GCAGGGTGGGCATGGTGGTAGTACGCACGATAATTCCTCGACAAACCGGTCCAGACACTTTCACCACT"

Translation

Let’s now deal with translating RNA to a sequence of amino acids. If we are given an RNA sequence, we simply have to lookup triplets of nucleotides, called codons, in a lookup table to indentify the corresponding amino acid. The first codon we look for is a start codon, AUG, which begins the translation process. We continue translating until we reach a stop codon.

Programmatically

First, we’ll load the amino acid table CODONS into our environment:

aa_table_translation()

## Added CODONS to environment.

head(CODONS)

##     codon aa
## UUU   UUU  F
## UUC   UUC  F
## UUA   UUA  L
## UUG   UUG  L
## UCU   UCU  S
## UCC   UCC  S

CODONS is a dataframe where each row name is a codon. Now that it’s loaded, let’s load the RNA sequence s into our environment:

problem_translation()

## Added s to environment.

We kept it simple such that the first codon is the start codon; nevertheless, the function below, regexpr(), finds AUG and returns the position. Had the start codon not been at the beginining of the sequence, and hence we had to find its location, this would have been a simple way to do it.

The plan here is to simply make a loop that iterates over each codon in the sequence, translates it using the amino acid table, and returns the amino acid. Each iteration will check for a stop codon using an if statement. If it finds one, it runs a command called break, which will terminate the loop.

########################
#<<<>>> EXERCISE <<<>>>#
########################

start <- regexpr('AUG',s)
ss <- subseq(s,start,nchar(s))

N_kmers <- ### ENTERCODEHERE ###

pp <- NULL
for (i in seq(1,N_kmers,3)){
  
  codon <- subseq(ss,i,i+3-1)
  aa <- ### ENTERCODEHERE ###
  
  if (aa == 'Stop') break
  
  pp <- c(### ENTERCODEHERE ### , ### ENTERCODEHERE ###)
  
}
paste0(pp,collapse='')

You should get the following:

## [1] "MGCPFRETDRLTSSTRPNVEGVTVIPDEPRVRLTTFWSRTRGSPPASNHRLYWWTRSAYLTPASVLLSSRHLEVVPTGDPNRSLGRSRSTSYLEVVQQVVRSERRLERSAVDSGTSRYTSSLPVLCRPVWCTTYPRFCFMPLLQARYVKLIVRSGKTSLYLAHPLLPMQFYRSSFVCACDCTYGLPSDPWITVIELSTSLNNSRHTSLLGSLAYLTTCEISFTADCGIMTKQLRRDDLGYRLAFALTRATSSAERGDRLSLMLRACDRMPDSGPTGILLVTAVRERKRSYGIEGFQQSYAIFICCATFNQVSSSTQSPLLSGLKKVQAAWPSRQDPHFFPIRLLRLEVTQTLTTASPLAVPLSRKYRCDSRIDAIQVCHQLVTARAQTIIFMARRLTKACLRVCVFSLRYNRRIQGLTFRHATCTSPRLSAGQNQVPYSQVLTPYPGANSTTSTAADGQWVALENPQGQSLLRCATCGYARTPANVSTYTTEAGSQRQSADNLALEAILYFADSNRHPYFSRCRDKPVSEETSTIIHRYSSARPVTLSETCEEPEIRLVQRARNLVTSPLNTRADVGVTAGPSSGSASKVRVSNHVLWARLMLSLLVTELFSYARSQSPATRSDADHRRRGFSISFTFSTTIPDGMFKGGGIWSLLRRLPTRSQRFSTPVTVSDVILYCYGGCPPSKEIGCEEVHLTWSMSGLDDWNPPQAGLGKTYTTPHKVELNTVVLFRSGIYPTLASEHAAMPCASTTLCLDCFTTSYVDTLRAAQSHERRVTIAKSYAHMQDLVYKSRVVSIHAFMIIVNDYLSGGSSVPAGGQSVVDCPQRITGPSKHLVKGQRTILPRPQLSRADAFMYLKSLPGTAVQKLMPRKCGNFQMGSHFNDQAGQRYDRKHLSAVDQELISKLRVIQAPCGIPRDVFPGAIVFPLYSTGSLQGRSGSNCTSPKVTTSRRAVLGLNSTNCYMGADRCERSMTDGCTTRCQRKWSRPIDKNSATTPSMTFKAKPTKTLPGVYGFVSRNQQERSIDRFNNFAGTSCIWPAYAYRKKVFLRKEPPLKYATPGRGIVLRKDILTSRAHPKAAVKRLQPEPWRVRTPTNGDIGGVFIATHLDVATKRLDHLRYTSFVIWICPREGDSRAGVALNASILTQPRSGYQAIGQAKFCSLGCSHYPPDPQWYCGNPPEGNCVDWWQTHDNQELRSDTWPRVLRFEALCSTNDRHISARMFTLVSFALGTRSSVVIALSLVLWRHPWAISHRRPFCRDHAFCGATELIHDYAQASGPVTGFPRPLCSGEHSSATTHFKRGVRWSCYPDARTRPVAPIDERKVFLPPPALWFSAVLTTLCRVSLSNNEVGGQRSILQYWCIGRFNANHLPSRRYCPICSNRKDMRTIWTILRSYPASLLLRLSLEWFHLSASSIYRLSDGAKQRTNGATSDGARRAPQSQERSSFASVEMPPILIGGELVIPVMWGKGLHSRPGLCLEAAVKYPQYRGKRSYHLTSTVPGRSRRCEYGHWFHVRRVSQPMIRPAHRWGERTPTVVSAQIFEVICQSLDKQMDAGDVLTVGYTRSNLSRASSKKYLRVPCFGGLAPPRALARPDSVRICQSGVEADRKRRCGPPKGRITQIPYCLDNAAPRGKRSLGLLSSLNTLHFYATMDRIRVSVRRGSENKRSSVFYQTIGTNSVQFTTLRGNLALLVEAMMYALTSSHPGDNTSDVTITRLQYSLLRFAYNDSMMLNPLQFVVGPGRTNTSSILARLRVKQLVHEYKGWPSFHIVTLPLIRDRGIGITPIRAKGSQALRRTEKLLSSGEMTGHGSPRHVGFYALIRCRNGPDCYTSHEQINNCESFSRIEPWSGLASALLWIAGRLRWALCRVLSSQLRYPSDNVVLANQFEPMLMSLRIVQSRPKAGFRLIPNGASIIRLPQLLLNLFLPSCAQQKAHNTKSESERFASSYTLKTICLQYPHVPKGRAKCGPEMPTRGYRWLPSKCMRSTASSRASGKSGSRLPNSRRLSTSWCHNTNPLVTHLPDFLDLDKLRTRTKKSGLLFLARNHVFLTSNTTPVTSHKKALRLRCVVKTELALSYLLHPQAASREHFIVAVCINPVRSTSLVVGSVPLRNDDFIKHVGVLRIPAAKCPGTWSSIFSARWRYRGLRYANSVYAGGRRGCSYIPRQSRSAWAYTLCSLKVRLFFVLCPSLIFKLWVPGSSRTVVLPPEPAVEAVPHDVDRLRLMNISSTRKSRGFTHSTVLIFSDRLEFGQCAKSYSKDPSLLDSNRFETLKLIPAIVPDVAWVSLGSFIMGGQRARHDYLEEDSNGPIDHPRLNDSHYVNVNRISSWGGLALYGETTWFPNPSIFPSPLGINYSTPGVAVNHRICIFIVSRSGSRRVSIAIHRRCSQKLGQGDPLYRSKRCRNLEAAFYGMICQFAEKTSTLNPLRSTRCSPVVRDPLALPFYGSWSIVVLRMDRYPIRGAYRALRFSVPRNIKPSLVAGQILCAQPLLIIPDLNKSHLACPPVPCTVPPSGSENTSKYHVSKYSNRSTLSGYAAYQYGLQTLHSVKTFIGVSSFRMYWTDIRSIRRWQVNPNCKLRILMRPVIPRIASRCNGEGGGGTWVPGRCSLAAVSRILLGDLMFTVLGTICDGVAQAYGGLGSHLVITGMSPQGRTTVIFTRDHDYMTREGTRSQVLGTLPRLGRASSWAVLLLSGKPNYISGASPRTESSLSSDQ"

Bioconductor

ss <- RNAString(s)
translate(ss)

##   2713-letter "AAString" instance
## seq: MGCPFRETDRLTSSTRPNVEGVTVIPDEPRVRLT...RASSWAVLLLSGKPNYISGASPRTESSLSSDQ*

Consensus Sequences

So we’re given a bunch of DNA sequences, all of the same length, and we want to find a sequence that best represents the set as a whole. We can think of this target sequence as an ‘average sequence.’ What is one way of going about this? Well, we can calculate the most common base at each position across all sequences. For example, for position 1, we count the number of As, Cs, Gs, and Ts across our set of DNA sequences. Whichever base is most common will be our base at that position in our consensus sequence. See below:

AAGTC
ACCTT
CTACC
AGCAT
TTTTA

The consensus sequence would be ATCTT. Let’s load the FASTA file.

FASTA <- readDNAStringSet('https://gist.githubusercontent.com/sw1/8870a124624f31585d5c15be72fcfc21/raw/0794c27c7753a8bd1a083d18d47fb4c6c693dd51/problems_consensus.fasta')
FASTA

##   A DNAStringSet instance of length 10
##      width seq                                         names               
##  [1]   985 ACCTGCGCCCCGCGTACGTG...AGATTTTTCCACGATGGTAT Sequence_1161
##  [2]   985 AGATCGTGATTAGAAGTCTT...TTTCATCCTAGCGATCCACA Sequence_2419
##  [3]   985 TAGCCCGGGACTAATCGCTC...CGCCCCCCGGATTGAGTAAA Sequence_7476
##  [4]   985 CTCTCCAATGTAGGAGGGGT...CAGGAACACGCTAGAAAGAA Sequence_6103
##  [5]   985 TCACGAGTGCAAGAATTAGC...TTTGTCCGTGTTCAGTGTTT Sequence_1243
##  [6]   985 GTATTATTTGAAGCGGCGAC...TCGGCCGCGAGCACCCACTG Sequence_9252
##  [7]   985 TGAATTGTGTGCGCAGAGTG...TGGTTAGGATTTTCACGAGC Sequence_3322
##  [8]   985 ACTCGTCAATTAATGTAGAA...TTTGAGTGTGTTGCCGGTGC Sequence_0784
##  [9]   985 CCATTCAAGCAGCTTGCGAC...GAGCGAATCACGGTAAATCA Sequence_9241
## [10]   985 TCAACTCCAGGATCACCGTG...CACTTCTTGATCTAGTCCTA Sequence_6958

Our first step will be to create an empty matrix with 4 rows (for each base) and a column for each position in our sequences. Note that all of the sequences in the set are of the same length.

########################
#<<<>>> EXERCISE <<<>>>#
########################

L_seqs <- width(FASTA)[1]

cons <- ### ENTERCODEHERE ###

### ENTERCODEHERE ### <- ### ENTERCODEHERE ###

If you run dim(cons), you should get 4, 985.

The plan now is to create a for loop where each iteration will focus on one of the sequences in our set. For a given iteration i, we’ll split the \(\text{sequence}_i\) into a vector of characters, allowing us to easily index each base in the sequence. Then, we’ll use another loop to loop through this vector of bases and add 1 to the element in the matrix corresponding to the nucleotide and the position in the DNA strand.

########################
#<<<>>> EXERCISE <<<>>>#
########################

N <- length(FASTA)
for (i in seq_len(N)){
  
  seq_i <- as.character(FASTA[[i]])
  s <- strsplit(seq_i,'')[[1]]
  
  for (j in seq_along(s)){
    
    ### ENTERCODEHERE ###
    
  }
  
}

The final step is to identify the most common base in each position, using a loop. We can find the index of the maximum value using the which.max() function.

########################
#<<<>>> EXERCISE <<<>>>#
########################

cons_seq <- vector(length=ncol(cons))
for (i in seq_len(ncol(cons))){
  
  ### ENTERCODEHERE ###
  
}
cons_seq <- rownames(cons)[cons_seq]
cons_seq <- paste0(cons_seq,collapse='')
cons_seq

You should obtain the following:

## [1] "TCATCCGAGCAAGAAGCGTCACGTATCCGTTCCGGGACGAAGCAACGAGATAACAAACCTTGGGGGCAAGAAAAAAGGATCAACCGGTCGCAAAGCCATGCACGTCTCTAAAAATTATCCCAACATCTCAAAAAAGCTAGGCTCAACCACCCTAGTTCGTCATACAGTAAAGGGAAAGCTAGACAAGACACTCCAGAAACCAGGGACAAAGAAAAATGTCAAGGATGCGGTGCGGTAGTTAACTTGTATAGCACGCTGCCAACCAGTCCGCGAGGCAGACACAGAAGACGTAAAAATTAACCCAACTAATATGCACAGGGTAACAGAAAGTTCTGGGATCGATCGAAAGACATACCCCTACCTATGTTTAAAACCGTAGGGAGACAACATGAGTGGAGTTCAAACCGGGGGTGTAAAATTGATACCCCATACGATAGAAAAACAAGCCCAAATGAATCGAAGGAACCAAAACGCCACCCTTTTGACACATATACCAGCGTACAACGAGATAGCAAAACCAACCATAACATCTCCTAGGGAATACACGCAGACCGCAAGTTTTGCTCAGGGTATAGATAGGGAATTGGACTTGCCCACCCTTCCCAAAACCGCATAGCGGTCTGGTTTTCGCGATCTCCCGTGCCAGCCCCTCAAATAGATCAGTGAGGCAGATCAAACTTAAAACGCGTGATAAAGTAAACGCGGGCCGCCAAAGTAACGCCTTAGACGTTGCCAACCGACCTCATAGCAGCCTACCGTGAAAAGACCTGGCCAACAATGTGGACACGTAATCATTAACACCTCCCGCCGTAGCCTACTAATTAGCGACGATAAGCTGGGGAGTAAAGTGCCAGACATGGCTGAGTGTGAGTGCCAAACGGGGAATGAAGTATGGATGGCATCAACCGGCACTGCGAACTACATAGAAATACATTCGAGCCAATACCATAAAATAGACATTCAACTAGGTCCCCATTGAACGTAA"

Dynamic Programming

Let’s try and write a script that can solve a potentially computationally burdonsome problem – a problem involving reproducing rabbits. Say we start with 1 baby rabbit (age 0) at month 1. When the rabbit reaches 1 month of age, it can reproduce, producing a new baby rabbit the following month (month 3). On month 4, the baby rabbit can now reproduce, but our original rabbit will also reproduce, and so on. The rules are therefore

Rabbits age each month
Baby (age 0) rabbits cannot reproduce

And here is a diagram showing the process:

Now focus on the number of rabbits for each month: 1, 1, 2, 3, 5, 7. It’s the Fibonacci sequence, where the current months total is a sum of the two previous month’s total. We can therefore make a function that can recursively calculate the number of rabbits, using the Finonacci sequence, only requiring the number of months the process will span across.

A quick aside: recursive algorithms are hard. They take some work to get the hang of them. I would not worry about either trying to write recursive algorithms or completely understanding how the code below works. The point of showing them is that there’s often a natural way to tackle a programming problem, but it’s not necessarily always the best way.

fib <- function(n){
  
  if (n==1 || n==2){
    return(1)
  }else{
    return(fib(n-1) + fib(n-2))
  }
  
}

fib(5)

## [1] 5

for (n in seq_len(25)) cat(fib(n),' ')

## 1  1  2  3  5  8  13  21  34  55  89  144  233  377  610  987  1597  2584  4181  6765  10946  17711  28657  46368  75025

Let’s change the problem a little bit. Let’s assume now that the rabbits can die after k months. For \(k=3\), we’d have the following process:

We can write another recurssive algorithm to tackle this:

lifespan_inner <- function(n,y,Y){

  if (n==1){
    return(1)
  }else if (y==Y){
    return(lifespan_inner(n-1,y-1,Y))
  }else if(y==1){
    lifespan_inner(n-1,Y,Y)
  }else{
    return(lifespan_inner(n-1,y-1,Y) + lifespan_inner(n-1,Y,Y))
  }
  
}

lifespan <- function(n,y){
   
  return(lifespan_inner(n,y,y))
   
}

for (n in seq_len(25)) cat(lifespan(n,3),' ')

## 1  1  2  2  3  4  5  7  9  12  16  21  28  37  49  65  86  114  151  200  265  351  465  616  816

But look how much time it takes if we ramp up the number of months to 60:

t1 <- Sys.time()
for (n in seq_len(60)) cat(lifespan(n,3),' ')
t2 <- Sys.time()
cat('Time elapsed:',round(t2-t1,1),'minutes.')

## Time elapsed: 5.3 minutes.

It’s slow!

Now we’ll use a dynamic programming approach. You’re going to need dynamic programming later in the course for genomic sequence alignment, so it’s worth exploring the type of speedup one can obtain with a quite intuitive method, particularly when aimed at computationally demanding tasks often seen in genomics.

Briefly, dynamic programming saves a ton of time and resources by sweeping through a problem as a set of smaller subproblems, while storing the results of each subproblem as you go. Think about the rabbit flow chart. If we wanted to know the number of rabbits present on month 1000, we’d have to add months 999 and 998 together, which require information from months 996 through 998, and so on. A recursive algorithm would calculate the result of 999 independently of 998, and then add them together. Dynamic programming, on the other hand, would have the results from those previously months stored, simply requiring us to look them up.

The game here involves the following:

Make an \(\text{n} \times \text{y}\) matrix M, where n is the number of months and y is the rabbit’s lifespan.
Row 1 will represent month 1, column 1 will represent baby rabbits, and column y will represent the final month of life for an adult rabbit.
Each subsequent row will be a running tally of the number of rabbits in each age group. Because each month is updated sequentially, you only need the information from a previous row (month) to update a current row (month).

########################
#<<<>>> EXERCISE <<<>>>#
########################

dynprog <- function(m,y){
   
   mat <- matrix(0,m,y)
   mat[1,1] <- 1
   
   for (i in 2:m){

     ### ENTERCODEHERE ###
     
   }
   
   return(rowSums(mat))
}

Here are some example answers to check. Note that there may be some variability given the max integer number in R, which is set in the options. If you get the right answer for smaller parameterizations, then your code is correct.

dynprog(25,5)[25]

## [1] 24485

dynprog(50,5)[50]

## [1] 1085554510

dynprog(70,8)[70]

## [1] 98633427418822

Report the total rabbits after 50 months with a lifespan of 35.

Bioinformatics Programming Problem Set

SW

September 2016

Preamble

Introduction

Organization

Warmup

Dataframes

Lists

Matrices

For loop

Problem 1

Problem 2

Problem 3

If else

Problem 1

Problem 2

Problem 3

Bioinformatics Problems

Base counting

Data

Programmatically 1

Programmatically 2

Bioconductor

DNA to RNA

Programmatically 1

Programmatically 2

Bioconductor

Reverse Complement

Programmatically

Bioconductor

FASTA Files

Loading

Creating Sequence Sets

Sample Metadata

GC Content

Programmatically

Bioconductor

Hamming Distance and Point Mutations

Programmatically 1

Programmatically 2

Bioconductor

Finding substrings

Programmatically 1

Bioconductor

Transition:Transversion Ratio

Programmatically 1

Shared Motif

Programmatically

Translation

Programmatically

Bioconductor

Consensus Sequences

Dynamic Programming