Introduction

Dynamic programming makes computationally intesive problems manageable by dividing them into a set of subproblems. On its surface, this might sound like a divide-and-conquer approach, but it is in fact different. D&C solves disjoint subproblems recursively and hence independently. Each subproblem must be calculated from scratch, and their results are combined to reach a final solution. This results in more work than necessary. DP approaches, on the other hand, solve overlapping subproblems and save their results for later use; hence, each subproblem needs to be calculated just once, and overlapping subproblems can inform each other to lessen the computational burden. An interesting property of DP is that the optimal solution to these subproblems contributes to the optimal solution for the overall problem. DP algorithms have a polynomial complexity.

DP alogorithms are bottom-up and are designed as follows (from Introduction to Algorithms, 3rd ed.):

  1. Characterize the structure of an optimal solution
  2. Recusively define the value of an optimal solution.
  3. Compute the value of an optimal solution, typically in a bottom-up fashion.
  4. Construct an optimal solution from computed information.

A simple example

Given n, we have 3 operations we can perform:

  1. if n mod 3 == 0 then n/3
  2. if n mod 2 == 0 then n/2
  3. n - 1

The objective is to find the fewest number of operations needed to reach 1 from n. A recursive version looks like this:

find_val_mem <- function(n){
  
  steps <- rep(NA,n)
  
  get_steps <- function(n){
    
    if (n == 1) return(0)
    
    if (!is.na(steps[n])) return(steps[n])
    
    v <- rep(NA,3)
    
    v[1] <- get_steps(n-1)
    if (n %% 2 == 0) v[2] <- get_steps(n/2)
    if (n %% 3 == 0) v[3] <- get_steps(n/3)
    v_min <- 1 + min(v,na.rm=TRUE)
    
    steps[n] <- v_min
    
    return(v_min)
    
  }
  
  return(get_steps(n))
  
}


find_val_mem(100)
## [1] 7

It’s slow. A DP version, on the other hand, looks like this:

find_val_dp <- function(n){
  
  steps <- vector(mode='integer',length=n)
  
  for (i in seq(2,n)){
    
    d <- rep(NA,3)
    v <- rep(NA,3)
    
    if (i %% 3 == 0){
      d[3] <- i/3
      v[3] <- 1 + steps[d[3]]
    }
    
    if (i %% 2 == 0){
      d[2] <- i/2
      v[2] <- 1 + steps[d[2]]
    }
    
    d[1] <- i - 1
    v[1] <- 1 + steps[d[1]]
  
    steps[i] <- min(v,na.rm=TRUE)
    
  }

  return(steps[n])
  
}

find_val_dp(25000)
## [1] 16

It’s significantly faster.

Rod cutting

Say we have a silver rod, and a piece of length i would net us \(p_i\) dollars, such that we have the following price table:

set.seed(12)

N <- 25
C <- 50
l <- seq_len(N)
p <- sort(sample(seq_len(C),N,replace=TRUE),decreasing=FALSE)

matrix(p,nrow=1,dimnames=list('p_i',l))
##     1 2 3 4 5 6 7 8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## p_i 1 2 2 4 5 6 9 9 11 11 14 14 19 20 20 22 23 28 33 34 36 40 41 41 48

We’d like to maximize our profit, so we need to figure out a way to cut our rod such that our pieces total to the largest value. We can do this recursively:

cut_recursive <- function(prices,length){
  
  if (length == 0) return(0)
  
  q <- -Inf
  for (i in seq_len(length)){
    
    #cat(sprintf('len=%s,val=%s\n',length-i,cut_recursive(prices,length-i)))
    q <- max(q,prices[i] + cut_recursive(prices,length-i))
    
  }
  
  return(q)

}
cut_recursive(p,5)
## [1] 5
cut_recursive(p,15)
## [1] 21
cut_recursive(p,20)
## [1] 34

And the DP version:

cut_dp <- function(prices,length){
  
  r <- c(0,rep(-Inf,length))
  
  for (i in seq_len(length)){
    
    q <- -Inf
    for (j in seq_len(i))
      q <- max(q,prices[j] + r[i-j+1])
    
    r[i+1] <- q
  }
  return(r[length+1])
}
cut_dp(p,5)
## [1] 5
cut_dp(p,10)
## [1] 12

Fibonacci rabbits

Let’s try and write a script that can solve a potentially computationally burdonsome problem – a problem involving reproducing rabbits. Say we start with 1 baby rabbit (age 0) at month 1. When the rabbit reaches 1 month of age, it can reproduce, producing a new baby rabbit the following month (month 3). On month 4, the baby rabbit can now reproduce, but our original rabbit will also reproduce, and so on. The rules are therefore

And here is a diagram showing the process:

Now focus on the number of rabbits for each month: 1, 1, 2, 3, 5, 7. It’s the fibonacci sequence, where the current months total is a sum of the previous month’s total. We can therefore make a function that can recursively calculate the number of rabbits, using the finonacci sequence, only requiring the number of months the process will span across.

fib <- function(n){
  
  if (n==1 || n==2){
    return(1)
  }else{
    return(fib(n-1) + fib(n-2))
  }
  
}

fib(5)
## [1] 5
for (n in seq_len(25)) cat(fib(n),' ')
## 1  1  2  3  5  8  13  21  34  55  89  144  233  377  610  987  1597  2584  4181  6765  10946  17711  28657  46368  75025

Let’s change the problem a little bit. Let’s assume now that the rabbits can die after k months. For \(k=3\), we’d have the following process:

We can write another recurssive algorithm to tackle this:

lifespan_inner <- function(n,y,Y){

  if (n==1){
    return(1)
  }else if (y==Y){
    return(lifespan_inner(n-1,y-1,Y))
  }else if(y==1){
    lifespan_inner(n-1,Y,Y)
  }else{
    return(lifespan_inner(n-1,y-1,Y) + lifespan_inner(n-1,Y,Y))
  }
  
}

lifespan <- function(n,y){
   
  return(lifespan_inner(n,y,y))
   
}

for (n in seq_len(25)) cat(lifespan(n,3),' ')
## 1  1  2  2  3  4  5  7  9  12  16  21  28  37  49  65  86  114  151  200  265  351  465  616  816

But look how much time it takes if we ramp up the number of months to 60:

t1 <- Sys.time()
for (n in seq_len(60)) cat(lifespan(n,3),' ')
t2 <- Sys.time()
cat('Time elapsed:',round(t2-t1,1),'minutes.')
## Time elapsed: 5.3 minutes.

It’s slow!

Now we’ll use a dynamic programming approach. Again, dynamic programming saves a ton of time and resources by sweeping through a problem as a set of smaller subproblems, while storing the results of each subproblem as you go. Think about the rabbit flow chart. If we wanted to know the number of rabbits present on month 1000, we’d have to add months 999 and 998 together, which require information from months 996 through 998, and so on. A recursive algorithm would calculate the result of 999 independently of 998, and then add them together. Dynamic programming, on the other hand, would have the results from those previously months stored, simply requiring us to look them up.

The game here involves the following:

  1. Make an \(n \times y\) matrix M, where n is the number of months and y is the rabbit’s lifespan.
  2. Row 1 will represent month 1, column 1 will represent baby rabbits, and column y will represent the final month of life for an adult rabbit.
  3. Each subsequent row will be a running tally of the number of rabbits in each age group. Because each month is updated sequentially, you only need the information of a previous row (month) to update a current row (month).

Here are some example answers to check. Note that there may be some variability given the max integer number in R, which is set in the options. If you get the right answer for smaller parameterizations, then your code is correct.

dynprog(25,5,FALSE)
##  [1]     1     1     2     3     5     7    11    17    26    40    61
## [12]    94   144   221   339   520   798  1224  1878  2881  4420  6781
## [23] 10403 15960 24485
dynprog(50,5)[50]
## [1] 1085554510
dynprog(70,6)[70]
## [1] 2.685139e+13

Longest Common Subsequence

We’re going to tackle alignment programatically by first learning how to find the longest common subsequence then tackling global alignment. They build upon one another, so the order should seem natural.

A subsequence is simply an ordered set; it need not be consecutive. For example, if we had the sequence ABCDEFG, then ABC, ACF, and DFG would all be subsequences, whereas AGD would not. We aim to find the largest subsequence that two nucleotide sequences share. This boils down to essentially an alignment problem only involving insertions and deletions.

The algorithm is as follows (this is based on R syntax, so indexing starts at 1):

DEFINE LCS:

Given sequences x and y

1 Create an empty matrix S for scores and an empty matrix B for the backtracking path
2 For i in 2 to length x
..3 For j in 2 to length y
....4 Obtain the score of the upper S[i,j-1], upper-left diagonal S[i-1,j-1], 
....  and left S[i-1,j] cells.
....5 if the current nucleotides in x and y are the same, such that x[i-1] == y[i-1], 
....  set s[i,j] to the maximum score among S[i,j-1], S[i-1,j-1] + 1, and S[i-1,j], 
....  where "+1" is the bonus for a match; 
....  if x[i-1] != y[i-1], then set s[i,j] to the maximum score between S[i,j-1] and S[i-1,j]
....6 Record the position of the maximum score (upper, upper-left, or left)
..End loop
End loop

7 Perform backtrack

We’re going to create a function that computes the LCS based on the pseudocode above. It’ll take two sequences, x and y:

x <- 'AGCAGACACGTGAT'
y <- 'ATCACCGGTAT'
unlist(strsplit(x,''))
##  [1] "A" "G" "C" "A" "G" "A" "C" "A" "C" "G" "T" "G" "A" "T"
x <- unlist(strsplit(x,''))
y <- unlist(strsplit(y,''))

m <- length(x)
n <- length(y)

The actual algorithm can be written a bunch of ways, but one way to start is to preinitialize our score and backtrack matrices. Recall that each is padded by an additional row and column that is filled in with either zeros or some type of penalty:

##     A T C A C C G G T A T
##   0 0 0 0 0 0 0 0 0 0 0 0
## A 0 0 0 0 0 0 0 0 0 0 0 0
## G 0 0 0 0 0 0 0 0 0 0 0 0
## C 0 0 0 0 0 0 0 0 0 0 0 0
## A 0 0 0 0 0 0 0 0 0 0 0 0
## G 0 0 0 0 0 0 0 0 0 0 0 0
## A 0 0 0 0 0 0 0 0 0 0 0 0
## C 0 0 0 0 0 0 0 0 0 0 0 0
## A 0 0 0 0 0 0 0 0 0 0 0 0
## C 0 0 0 0 0 0 0 0 0 0 0 0
## G 0 0 0 0 0 0 0 0 0 0 0 0
## T 0 0 0 0 0 0 0 0 0 0 0 0
## G 0 0 0 0 0 0 0 0 0 0 0 0
## A 0 0 0 0 0 0 0 0 0 0 0 0
## T 0 0 0 0 0 0 0 0 0 0 0 0
##      A  T  C  A  C  C  G  G  T  A  T 
##   "" "" "" "" "" "" "" "" "" "" "" ""
## A "" "" "" "" "" "" "" "" "" "" "" ""
## G "" "" "" "" "" "" "" "" "" "" "" ""
## C "" "" "" "" "" "" "" "" "" "" "" ""
## A "" "" "" "" "" "" "" "" "" "" "" ""
## G "" "" "" "" "" "" "" "" "" "" "" ""
## A "" "" "" "" "" "" "" "" "" "" "" ""
## C "" "" "" "" "" "" "" "" "" "" "" ""
## A "" "" "" "" "" "" "" "" "" "" "" ""
## C "" "" "" "" "" "" "" "" "" "" "" ""
## G "" "" "" "" "" "" "" "" "" "" "" ""
## T "" "" "" "" "" "" "" "" "" "" "" ""
## G "" "" "" "" "" "" "" "" "" "" "" ""
## A "" "" "" "" "" "" "" "" "" "" "" ""
## T "" "" "" "" "" "" "" "" "" "" "" ""

Now, we have to loop through these matrices just like we would if we were doing this problem on paper. We start at the upper left corner and work our way down to the bottom right. As we move down, we pay close attention to the upper, left, and upper-left diagonal cells relative to our current position.

We specifically need to calculate our scores at each iteration, which pertains to the left, upper, and upper-left diagonal scores relative to our current position. Thus, if we are at position i, j, then we care about scores s[i-1,j], s[i,j-1], and s[i-1,j-1], respectively. We also want to add our match bonus, in which we’ll add 1 to the score of the upper-left diagonal, assuming there is an actual match in our sequence at this position (which we’ll get to next).

We also have to deal with awarding the correct score for our current position. Recall the piecewise function is as follows,

....5 if the current nucleotides in x and y are the same, such that x[i-1] == y[i-1], 
....  set s[i,j] to the maximum score among S[i,j-1], S[i-1,j-1] + 1, and S[i-1,j], 
....  where "+1" is the bonus for a match; 
....  if x[i-1] != y[i-1], then set s[i,j] to the maximum score among S[i,j-1] and S[i-1,j]

Thus, we can create a vector that stores our 2 or 3 scores, depending on whether there is a match:

We also need to know which of the 3 cells contributed to our current position’s score, so we have record the index of the max score in our scores vector. We can call it backtrack update. Our backtrack_update variable records the position in the scores vector that had the max value, whereas the score_update variable contains the actual max score. We can use the position of the max value as a way of indexing a dictionary that can contain arrows that point to the cell that contributed to the current positions score.

The final function should look something like this:

find_lcs <- function(x,y){
  
  preprocess_sequences()
  
  backtrack_key = c('|','--','\\')
  
  S <- preallocate_score_matrix()
  B <- preallocate_backtrack_matrix()
  
  for (i in outer_loop){
    for (j in inner_loop){
  
      scores = get_score_update_from_upper_left_diag()
      
      update_S(scores)
      update_B(scores,backtrack_key)

    }
  }
}

That’s the function. It will calculate both the path history and the scores for all nucleotides. Now the question is “what about the backtracking”? We’ll deal with that for global alignment, but for now, we’ll use the following recursive function to handle it. We also have to update an option to allow for very deep recursions (if that makes no sense, don’t worry about it). We’ll finish the function up with a list that contains all of the information we’d like returned.

backtrack_lcs <- function(b,x,y,m,n){
  if (m==length(x) | n==length(y)) return(NULL)
  if (b[m+1,n+1] == '\\'){
    return(c(x[m],backtrack_lcs(b,x,m-1,n-1)))
  }else if(b[m+1,n+1] == '|'){
    backtrack_lcs(b,x,m-1,n)
  }else{
    backtrack_lcs(b,x,m,n-1)
  }
}

If we toss this into the pseudocode, it should look something like this

backtrack_lcs <- function(b,x,y,m,n){
  if (m==length(x) | n==length(y)) return(NULL)
  if (b[m+1,n+1] == '\\'){
    return(c(x[m],backtrack_lcs(b,x,m-1,n-1)))
  }else if(b[m+1,n+1] == '|'){
    backtrack_lcs(b,x,m-1,n)
  }else{
    backtrack_lcs(b,x,m,n-1)
  }
}

find_lcs <- function(x,y){
  
  preprocess_sequences()
  
  backtrack_key = c('|','--','\\')
  
  S <- preallocate_score_matrix()
  B <- preallocate_backtrack_matrix()
  
  for (i in outer_loop){
    for (j in inner_loop){
  
      scores = get_score_update_from_upper_left_diag()
      
      update_S(scores)
      update_B(scores,backtrack_key)

    }
  }
  
  lcs <- backtrack_lcs(b,x,y,m,n)
  
  return(list(lcs=paste0(rev(lcs),collapse=''),
              length=s[length(s)],
              score=s,
              backtrack=b))
}

We can now test it out:

find_lcs('AGCAGACACGTGAT','ATCACCGGTAT')
## $lcs
## [1] "ACACCGTAT"
## 
## $length
## [1] 9
## 
## $score
##     A T C A C C G G T A T
##   0 0 0 0 0 0 0 0 0 0 0 0
## A 0 1 1 1 1 1 1 1 1 1 1 1
## G 0 1 1 1 1 1 1 2 2 2 2 2
## C 0 1 1 2 2 2 2 2 2 2 2 2
## A 0 1 1 2 3 3 3 3 3 3 3 3
## G 0 1 1 2 3 3 3 4 4 4 4 4
## A 0 1 1 2 3 3 3 4 4 4 5 5
## C 0 1 1 2 3 4 4 4 4 4 5 5
## A 0 1 1 2 3 4 4 4 4 4 5 5
## C 0 1 1 2 3 4 5 5 5 5 5 5
## G 0 1 1 2 3 4 5 6 6 6 6 6
## T 0 1 2 2 3 4 5 6 6 7 7 7
## G 0 1 2 2 3 4 5 6 7 7 7 7
## A 0 1 2 2 3 4 5 6 7 7 8 8
## T 0 1 2 2 3 4 5 6 7 8 8 9
## 
## $backtrack
##      A    T    C    A    C    C    G    G    T    A    T   
##   "" ""   ""   ""   ""   ""   ""   ""   ""   ""   ""   ""  
## A "" "\\" "--" "--" "--" "--" "--" "--" "--" "--" "--" "--"
## G "" "|"  "|"  "|"  "|"  "|"  "|"  "\\" "--" "--" "--" "--"
## C "" "|"  "|"  "\\" "--" "--" "--" "|"  "|"  "|"  "|"  "|" 
## A "" "|"  "|"  "|"  "\\" "--" "--" "--" "--" "--" "--" "--"
## G "" "|"  "|"  "|"  "|"  "|"  "|"  "\\" "--" "--" "--" "--"
## A "" "|"  "|"  "|"  "|"  "|"  "|"  "|"  "|"  "|"  "\\" "--"
## C "" "|"  "|"  "|"  "|"  "\\" "--" "|"  "|"  "|"  "|"  "|" 
## A "" "|"  "|"  "|"  "|"  "|"  "|"  "|"  "|"  "|"  "|"  "|" 
## C "" "|"  "|"  "|"  "|"  "|"  "\\" "--" "--" "--" "|"  "|" 
## G "" "|"  "|"  "|"  "|"  "|"  "|"  "\\" "--" "--" "--" "--"
## T "" "|"  "\\" "|"  "|"  "|"  "|"  "|"  "|"  "\\" "--" "--"
## G "" "|"  "|"  "|"  "|"  "|"  "|"  "|"  "\\" "|"  "|"  "|" 
## A "" "|"  "|"  "|"  "|"  "|"  "|"  "|"  "|"  "|"  "\\" "--"
## T "" "|"  "|"  "|"  "|"  "|"  "|"  "|"  "|"  "\\" "|"  "\\"

And now with much longer sequences:

xy <- readr::read_lines('https://gist.githubusercontent.com/sw1/8870a124624f31585d5c15be72fcfc21/raw/6617b2ceae8765ff7b91f4a600fc5460b335279d/lcs_sequences')

find_lcs(xy[1],xy[2])$lcs
## [1] "CTCTAAGCCAATGGCTCAGGGTGGGTGTAGCGATCCCGCGACAGTAAGCGTCTTGGTAGTTTCATCGCAGCTTCGACCTCGGGTTATCCCGCACCACCCCTTAGACTATTAAGAGCGAATACCCAATAAGTCTTTGCTCACATCCCCAGGGTTAAAACCGCGTGAGAGCGTCTCGACACGTTCTCGTTTATAGGGCGTAATTCATTCTGAGTGTGCGCCAGACTCAATACATCGTAGTCTTCTGCAAACGACTATGAGGGATGGACGCTCGGTACTACTGATTAGGTTGACAGTGATACAAACCAGGATGTATATATCGGTAAAGCCAGTTTACCGACTGCATCGCCGAGTGAAGTTCCACTCATGAAGTAAAATCTTAGTATATGTTATGCCCCCCCTTTTTTATCGAAAGTAGACTAGGTCACTTGTTGTACCGAGCCGCGCTGACATATTAACGCACCACCGACCGTTTTAGTCCTGATTGTTGGGGCATATCAGTGAATTGCGTGAGAAGTGGTCAGGTGGAGCCAGCCGAGAATCGGCTCTAGCCTCAAAACCTACTTCCACGTTCTGCCCCACTGCGAGAGAGAACTCCGGAGTCCTTCGTAGGACCAGGGC"

Global Alignment

Now we can extend the algorithm above for a global alignment problem. Unlike the LCS problem where we only accounted for insertions and deletions, we’re now going to penalize for mismatches. The score we give based on a particular nucleotide-nucleotide or amino acid-amino acid pair will be defined by a scoring matrix such as the amino acid scoring matrix BLOSSUM62:

##    A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
## A  4  0 -2 -1 -2  0 -2 -1 -1 -1 -1 -2 -1 -1 -1  1  0  0 -3 -2
## C  0  9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2
## D -2 -3  6  2 -3 -1 -1 -3 -1 -4 -3  1 -1  0 -2  0 -1 -3 -4 -3
## E -1 -4  2  5 -3 -2  0 -3  1 -3 -2  0 -1  2  0  0 -1 -2 -3 -2
## F -2 -2 -3 -3  6 -3 -1  0 -3  0  0 -3 -4 -3 -3 -2 -2 -1  1  3
## G  0 -3 -1 -2 -3  6 -2 -4 -2 -4 -3  0 -2 -2 -2  0 -2 -3 -2 -3
## H -2 -3 -1  0 -1 -2  8 -3 -1 -3 -2  1 -2  0  0 -1 -2 -3 -2  2
## I -1 -1 -3 -3  0 -4 -3  4 -3  2  1 -3 -3 -3 -3 -2 -1  3 -3 -1
## K -1 -3 -1  1 -3 -2 -1 -3  5 -2 -1  0 -1  1  2  0 -1 -2 -3 -2
## L -1 -1 -4 -3  0 -4 -3  2 -2  4  2 -3 -3 -2 -2 -2 -1  1 -2 -1
## M -1 -1 -3 -2  0 -3 -2  1 -1  2  5 -2 -2  0 -1 -1 -1  1 -1 -1
## N -2 -3  1  0 -3  0  1 -3  0 -3 -2  6 -2  0  0  1  0 -3 -4 -2
## P -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2  7 -1 -2 -1 -1 -2 -4 -3
## Q -1 -3  0  2 -3 -2  0 -3  1 -2  0  0 -1  5  1  0 -1 -2 -2 -1
## R -1 -3 -2  0 -3 -2  0 -3  2 -2 -1  0 -2  1  5 -1 -1 -3 -3 -2
## S  1 -1  0  0 -2  0 -1 -2  0 -2 -1  1 -1  0 -1  4  1 -2 -3 -2
## T  0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1  0 -1 -1 -1  1  5  0 -2 -2
## V  0 -1 -3 -2 -1 -3 -3  3 -2  1  1 -3 -2 -2 -3 -2  0  4 -3 -1
## W -3 -2 -4 -3  1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11  2
## Y -2 -2 -3 -2  3 -3  2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1  2  7

Note that this matrix scores for mismatches and matches, hence we need not develop separate match and mismatch scoring criteria; we can simply use this matrix as a lookup table. Also, unlike before, we’re no longer going to recursively backtrack through our algorithm; instead, we’ll use a while loop to backtrack in a fashion similar to doing the problem on paper. This allows us to take full advantage of the fact we have the longest path recorded in our backtrack matrix and also permits us to shift the position of NTs or AAs in our sequences when necessary by adding ‘-’, allowing us to create our correctly oriented alignment strings.

The algorithm is as follows:

DEFINE GLOBAL_ALIGNMENT:

Given sequences x and y, penalty p, and scoring lookup table score

1 Create empty an matrix S for scores and an empty matrix B for the backtracking path
2 Preinitialize edges of S and B based on penalty
3 For i in 2 to length x
..4 For j in 2 to length y
....5 Obtain the score of the upper S[i,j-1], upper-left diagonal S[i-1,j-1], 
....  and left S[i-1,j] cells:
...... set s[i,j] to the maximum score among S[i,j-1] + p, S[i-1,j-1] + score[x[i-1],y[j-1]], 
...... and S[i-1,j] + p
....6 Record the position of the maximum score (upper, upper-left, or left).
..End loop
End loop

7 Perform backtrack

The BLOSSUM62 scoring matrix can be found here:

blosum62 <- as.matrix(read.table('https://gist.githubusercontent.com/sw1/8870a124624f31585d5c15be72fcfc21/raw/34c4a34ce49d58b595c3fb2dc77b89a5b6a9b0af/blosum62.dat'))

And our backtracking algorithm:

DEFINE GLOBAL_ALIGNMENT_BACKTRACK:

Given sequences x and y, backtracking matrix b, penalty p, and scoring lookup table score

1 Create empty alignment vectors align_x and align_y and 
. preinitialze score s to 0, m to len(x), and n to len(y)
2 While m > 0 OR n > 0
..3 if b[m+1,n+1] == UP
.... update align_x
.... update align_y
.... update score
.... m -= 1
..4 if b[m+1,n+1] == LEFT
.... update align_x
.... update align_y
.... update score
.... n -= 1
..5 else
.... update align_x
.... update align_y
.... update score
.... n -= 1
.... m -= 1
..End loop
End loop

A sample sequence pair can be found here:

xy <- readr::read_lines('https://gist.githubusercontent.com/sw1/8870a124624f31585d5c15be72fcfc21/raw/0488087071c0b27cd7a576913d5502bdf95db6e5/global_sequences')

And the result using a -5 gap penalty:

global_alignment(x,y,blosum62,-5)
## $score
## [1] 10348
## 
## $alignment
## [1] "RWDEIRLVGDRFMHANNEMLAQSQLMV--L---M---AALSGTMIGCGRYKSGWPVVTALNPPDFSSKEKLNITGMSKNT-D-ERPSCQCEGHGFRHY---IYM--QYLTQLRNIPQQQCM-QGVDINTCWADLDPVCPHLMRHSYKLWKC--K-ILWRN-A-RNYYPEITCAYDMGGG-L--SAMECDGTCHCTFFNDGWMYCIVSMANEKWQIPWMHMNMSFGLENSYPIHHDTAMNEARIKWSFWRPDDLSTTYRKAGPHYWNNWL-VFAHLQPFRNEFFQEHNMGMVSRN-S---VTSLPHLSSYYTHEINYLSMRDHNFHNWQLLTEDHVYADEIPYSRNITTAWLWHLA-VMTEIDLCTDMMYTSISDTLRPYAFWETGNIEGAAYIRCAFYPEEDDGLDMWTYDNWEAAGCIEFCWDFAD-H-AWELYRHQTHCGKHHIWVFLILDHFIF-N--WKYL-C-LPS-GT-ELRACGIQMIYQIMEFLWTSSNLRNVY--IKTHSPVQPKPC-LA--M--T-Y--VQEYKEPQHNSV-----CQS-NMERNVVHSCPFGEWMIE----T--HDDPEQECSQCAKSHEGMPAAGHKQWMTVQISSSVNNHDEHMKIMLHVEIRDDSARMNYPK--KPRGGEQAMDGHLIIQYMFHEVPHNFHHMFAQFNKWYVTQDWVNVTWHPLFQEQH-N---L-HNQQWTSNADRCYRLPKYCTQQFRNRLWRHAWVYADIIMSPGWSKTSSDD-LS-AHTESYVALYQWKEFWQPDHSCWDTNIDPPCDKQHMAYHQFRKASPYCGMFIYRMTWSSASMLVIILPSMKMPCYAAYPLCYFIPGETSRPDFMTQMTEFAWHAMSAERCTFW-QMKTY--ICIKRKMSIQWALYRI-----ANNFAHFVPWRMKPR-PNIWPFFHVFSKNSTGVRDYRMNMFEGYTPCES------PDKMQDDRFSWWQSQHVMSCPRPQNIFPEMARRR-F-ELYVAYEIYFWMT--SGPEQ-----LS-----H--MCTRKNYLMRMD--LYFINLWRFKNSIRRWMPVKFHNSLRKHNTQDVRCCQARGSAHYFRYLVMQEIPQA-------VQ-PI--N---AQCACWKNRAESSYMPFHCGCMLLC-VILKMSCERVIHQNGSLIGKRRSNTSWFCMAQDQYDTNRYLNDIPIHPWFETIDQYWLLWVAPYRPSMIPTNIHEHKNGNLHPWRDHYEDAYFVTKQETGSVT-VLPYGKMPWLCELLQYCVPAEVDAKPGICHNREDRDFDLQCSAHREWEDVHFCSKIYDSHCECQQTCSKCWMTIKVSMEFFSLDHNTVLKPWLAHLKAFKRKWDGRVATMYYVPLLWPDKNAGSQHAIKVLCRHTISHHQLHDQEAKEDIGL--QAYGPLERDNKPGVFGAQWKA--------MYIQYKDGYLWHDDIRQRLVCTQHQELLGAQHRSFLIWPQ-KKKGFFDSCIDVKDEPKQLYIILGVV-R-QYVCPALCSLFEMWFVTSDPAWPVFSEQM-----L-KIWQRENGRHMFPIHLCIERFELHKR-T-CKSWLIDERYCQHPTVECIYVMTRVIYCGQRACED--VWFRN---PNDHWYREAND-WIHGYRM----SHRANFSIRNDKYQSCCDFVHRAMTPGTFHGVDMLMYLRMGYKPEAKNHCYDWVNLTD-VCW--RCN-H--QIKVYAVTFVAIR-E-------PIVV--H-AG---E-----E--HPYLNATFGCH----CI-----GKPM-M--TIQAKNH-------MPHKKKGQYLFTWWFKII------QVHIWVW-NM-THNIMGWAETRGMGPWLCCRQILINGPILMKIHQFT-RFCCKCNCYINPRSIEHAGCPFKGNHWLNQEGSFNGVGRPKHRGTYVDHFRVVNGMICIWPLFLHKQSWQSVLEIWITSKSKLQGQLIMWMIQKKINVHIMMDLVQYLEGWIGWCLP-MKYTDGNVAHSAYGHDVTNRFCLHFAYWTVATWILAMDSPTHWRWIHVVGYFFGFKQVHFHRQHGMIYGCARCQRKRDEAVERWKR---Y--GDNWMDV-VTIDH-CPDCMMRGMDEWNWREKGWQHSKEIQWIYIKGNEIGCSYVEYCYIWEMYCVQEGEMLVSAE-KGADYWRRLCLCFQHGFEPACGNCNWRQPVQLINLGWCSAQLHPNCIKFA----RLTWAHGSLKVELHWLHSFREMVFQKYRKGPAWDNVKVLGMETNYGGFHNCPMIVFSQSIKYFKLMHVWHIHSNSNGLAYK--KD-HQWLDQWQSDICMHRTVLVR-PSWLRCH-GSGN--EGGRQCKHAKCIS-HWTQHVYRPLQVTFAENHDPYCKHDHAQYCWDALFRWSSTELWN-CLIYGDIGYQMIWIAEEVRYWYVGNMVDGSNMISICRY-QSFCINNHQGWWFVAWWGMKHWWTTTWDWYVVCELKRRNI-WVEREVQQVFVNNPGMVKVIIWWSKYLQISATYAQEKGRNPYPTFMTITRQEAYVHLTFLWAHICICGMRRLYAAIQIGQHISPQEINGQDIPRLQFKLIGCRNVMCYPWVKSCTCYWGMEIFTTQVFTSYQG--D--PQFKHWLLNQSCPKQCMYFPHQMKVSLK-GAKHYGPVLDSFEARQYSLWWEKETTWVIQAAFLNQVHWMMIFT--C------CVNPTTCCCGV------N--QDSSSVQ-H-----FYQGSPLACENKRLCDRAMTPFSL----LQSWKDAEDFDHRTNKAIYSPFNNLVIKYKYCAYVESATKGINRA-EVPMYGIKTSVCDWRKLHVNIHSRYGDQNHKPVNTYEQVVWGESWYEMAPETLMWHPHE--FIM--V--VH-DPLC----------F-----Y--EHIDLEIHSSHAEDWAQPCRYLAMQTTCYGPQFDHFVPKYEPLMWQFYG----DLFRLDS--K-----NHWCSG-----E-M-----L--S--L--GRQA--Y-WALI----Q---MEMKDMMLTNGCEC-PYDWLMHW-KGWSIMQGLMRVK-IPPFMGW---R---FP--C--NDHMGSWNSNCLD-IRLLWLKITTELITVDHWRYCT-T-----GEHGVVDAPGRGWKKPNWNLKAKITDRMTMKFKDCWVDIMSD-----I--E-MSCFYRVNCNNKSGDRGAWHMTHQ-T-RNDTVDKCIQATFFFS-KCMWPHWVNTGSIKFLPWPDHSDIFEWKPQIVVSVYYSKTEPIPRTQMSREKNGVSRDMQVCSPVHFEEYHDHCSLVWCEPIKHHSYPVLFCMVFQQLYIMSSLVFSVICKMHLLW-DIKMK--V-PSAFIQCLQEQVKYINMPGSLIHCP---L-----MDKYRYMS----VNSLFQEGN-RHGMLGQ-NG-SYGGMWYDLVHG-K---WGER-QP-L--CGNKQIQSYERQTKGKWTLS-H-----NCVFTQ-D-SFFPEAM---N-D-F--LEAILFVCRKFAHGCEFVNRWNVWKKTEV-D--EMNMDMCY--AHD-N-N-A--TNYPTFICRVFWKFGGMPQVAHTWSK--WMDKYR-EAVLEVQCAFCMTSNAWHCMLCQVQRPHGSPDLFAQNPLIIECKIVWGCGLVMFMEPAHFLVSMDMS-TATHGLNWDFCMQLIE-IHQPI---P--H-EMFFQ--TC---CWFG----QAYEMTKESDMFKRHCLCDSFLTWYGELYKEWAYTKPSQVEQTQT-----NRMKKCVYCFNVD--LS--------QMQWHDAFFEDWCADMI--F--ENFSEYINAWRLLKQQPNTA---QYGSRMRIKTCHAMYEVANQFYKADKWIMTVNEGGAVWKIIIAPAQTDK--Y-P-ITRKS---W--Y----P------R-GYQHIAENCVARICQVVVTWQGHCVIEYTCHLEENPVMF---HM-SYWCRDNHYIKKPYPMGEPHDSERIMTRKCRAMIALARIKYH-INWTGPVRDFKDGSDDLCDCPGWDWLGSPSCVGQHCANL--QF-----N---LMGMPTPQVP-S--F----AEIPEWYMCIERPYEIGNWWQCSW--P--CYCNNFHRWIRLTCSKQPTSKNQACHRHNFMILLQECVLPLARHSLECPDPVQRPDD-M-N-FG-ET-NRWVN-TY-S-RNTCF--LP-RDNNKYKPICSYKVEDLKQNS-R-E-Q-I-L-IPWERREEGFVSRMTQEDYQDHHLWVADQEEICKMLPYMKISILSAYANAIHAFFWANRILDYTWDSGIMGEEML-Q--R----TAGLYYSLDFQSKWYTHWLICANEGT--ENWTCKHL-PIKVSIRWGTTDLWEYSFCDRCRKCTVPANHVNCHCNN-HQLSYAIQTPMMLIHPQPNVEHVLSKCGTRRVLLSPLEHYEKNYHWGTQSNRQWYLGN-CRSVCDGCTTNF------PWEFNKAY-I-N---KCQTLDSRPQF-YA-V-CMAGPEMHQTN----LR--P--WRLWTFRMWYIVACPKETHLPITASNDE--L-----R-K-N-WEWTILLIRIWVCSERFKQCVSNDVYLWTLAKSRNYQDCVQYSPPHFDKVHRRQHRIY-CV-EYLDFHNGFYPRTPPSSCIQDTGYIFKASTLMMVCYTGGFELDESICSKDPHAQEHFSKFINANSG-HYVVCIFMKQTQQYGFGCQWGMTIMQFWAPCHVKFHLSCLYLGQWQQHDPYFARMAMW---N---RHQWDFVMCEYKLTELNKN-CLWKRHKKSM-SMSIYFYFDSHPSYEPQQQDPDPRRKCRWWDYSMENDLEGAC"
## [2] "RWDEIRLVGDRFMHANNEIHAQSQLMVSQFHPECEPHAALSGTMIGCGRYKSG----TALNPPDFFSKEKLNITGTSKNTIDDERPSCQCEGHGFRHYGSYFYKYYQYLTQLRTIPQNQCMWQFTD-QTAQGDLDPVCPHLMRHSRKLWKCFLHFILWRNWAIRNYYSQ-N-G-DKGWSWMRVEAMWCDGTCHCTFFNDGWMYCIVSMANEKWEIPWKH-------ENSYPIHHATAMNEARIA-N--RNDNLSTTYRKAGPG--DTMVSLYARLQ--E-EF-QEHNHGAVSRTIAMTHVTSLPHLSSYYTEEIN------HNFHNWQLLTEDHTYADEIPYSRNITTAWLWRTSRIKGKIWQCL-CIFGDM-DTLRPYAF----NIEGAAYIRCAFYPNEDDCLDMWTYDNWEAAGCIEFCWDMADWHDAWELYRGKD-C----IWLFLV-KYLCFPNQEFKLMDIRLVACGCGNQEAKKQQMHYQIQEFLWTSSNLVNVHLCIKTHSPVQPKPCCMSWTMRPAMYWALQEYKEPSHNSVHMEASCYRWQMERTVVHSCPFGEWMIELFLQTNCHDDPEQECSQCTKSHEGMPALCY---L-VRGHKSPCTSEPLIDCCYHM-LKSCFAMLHVIRDWKPRGGEQAMDGHLI-PY-FN-TQLNFC-M--HITHMFA-Q-WVNVTWHPLFQEQDLSYVILLHNQQWTSNADRCYR---YCTQQFRNRLWRHAWVYAMIIMSCGWSKTSSDDDHNWQRKDMYAALYQWKEFTQPDHSCWDTNIDPPCDKQHMRYH-W---S-LC---------S-AR------PSMKMPCYRAYP-DH-APAR-HQ-DFMTHMLPGAWHLISSAFFSMSARRKTFWFICIKRKMSIQWALYRIFNFNLCNNFPKFYPLMMAHFVPFEDPFFHVFSNNSTGVRDYRINMFEGYTPCKPVRIFRWPDKMKDDLFSWSQWQHVMSWIE-Q-IFPEMAMRRWYVNLYVAYCIYFWMHYWSLPERTFMNVLSLEIQGHGYMCTRKNYLMRQENEQHRVTSMLMRDR-GHW-PVKFHNSLRK------RCCQARGS----RYLVMQEIPQAKFKGNNDAEFKFWYSRCYAQCACWKQRAESSYMPFHCGCMLLPHSVLIMSCERVIHQNGSLIGKRRE-T----MGEE-FNT-Q-L-D-RIHPQFETIDQYVLLWVAPYRPVMIPTNIHEHKN-N-------YEDAYFVTKQETGSVDCVLPYAKMFWLCELGE--VDAKIGTNEHDM-NREDRDFDLQISAHREWP--R-PSKIYDSYCECQQTCSSCWTTDKVSMEFFSLDHNTVLK-WLAHLKAFKRKWQGRVATMYYVPLLWPDKNAGSNHAIKVLCR-------LHDQECKEDIGLYEQAYGPLERDNKPGVFGAQWKAPQQDCTVGLYIQYKDGYFWHDQ-HMEVAIEQWFSL-GAQFLVFT-WRVIKKKGFHDSCIDVKDEPKQLYIILGVVFMTNHMIPP-C-L--L-FVTSDPAWPVFSEQMWQKDWLRKIWPRENGRHMFPI----ERFELHKRFAKCKSWLIKERYC-----ECIYVMTRVIYCGQRACEDERTCGLNFFCPQC-LYREENYYWIHGYRIYYQCSHRAN-SIRND-HRAMCPGTVD-MH--MYKMVPNAQFVRMGYKPEAKNHCED-VNLTDRVIFNVQIKVHDGQEGFYAVTFVAITWEACWTDHRPIVVQSHHRGQANENMKLGEANHPYLNATFGCQGGRVCIHPIKYGKPMHLDFTIQAKNHQRKWFGEMPQKQKGQYLFTWWFKDLGDDWRHQVHIWVTPTIYQHNIMGWGETRGMGP-LCCRQILING-IRYH-HAIALRFCCKCNCYINPRYIEAKGC----NHSLNAEGMFSGVQRPK------------NGMICIWPLFLHKQSWQSVLEI-------L----IMWMIQKKINV-A--E-A-WI-GW-NPTLPFMKYTDGNVAHSA--------F----AYWTVATWILAADSPTHWR-----------K-IH-HMQHSMIYGCAEAQRKRDEAVERWCRMHTFTFGDNVVTIDASISKKCPD-FM--M-EWNWREKGWQTSNEI---YIKGNEIGCSYVECCYIWEIFCVIEGEMLVSAEFKG--Y-------F-HGFEPFCGNCN--------NLGWCSVQLAPNCIKTAQCNARLTWAHG-LKVELVWRHGF-------------WDNLKVLGMETNYGGFHNCPMI--SQSIKYFKLMHVWHIHSNSNGECHHIFVNAYKWLLQWQCDYCMHRTPLVYIESYLRCHVDTNNDNEGGRQCKHSKCIRLRYS-NFF-P--V-FPMGMFYWAMQDKGPL--QVTFAASAFHT-NTC--YHDHA-QYCW--D---Y---GDIAE-E-VVDGCQYYQSFCDNNHQGWWFVAW--IKHWWTTTWDWYVVCELMWFNIQMLERKVQQVF---PG-V--IIWW--YLQISATYAQEKGRNPYPTFMTI-----YVH--------C--G-K-LYCAI--GQHIYIEMWCMSDIKRLQFKGIGCRNLMCYPWVKSCTCYWGMEIFTTQVFHSYQGQTESLPQFKHWLLNQCCPKQCMYFPIQMKMCHRCPSLKIAKAY-GFEARQHSLWWHKIVT---QAAFLNQVHWMQIFTHHSRKISYDCVNPTGCCCGVMIFEIIDEIQDSSGVQFEALVERFYQGSPLACEDKRLCDRAMTPFSLDCQQMQSWKDAEYFDHRFNLT-W--W--LVIKY---AYVPSATKGINRQMPVPMTGIKTSVCDWRQLHVNINSISG-V--KPVNTYEQVVW----YEMIPETLMWHPHEHCFFFNRTESLGWDPDCPEPTHEPRTPFWFGLRYTNEAYDKEIHISHAEHWAQPCRNLA-HDN-W-VMVC-Y-P-YEPLMWGFYTWTMTDHFRLDSCPQHWHICNHWCSETRTHLQWMFMVFSLNISEELAKGRQADNLLWALILNNPQFNKMEMKDMMLTNGCECGPA-CCA-WGKLWSIMQGLMRVKDIG-FMGWAWAKHSMYERICYCNDHVRDHPTLTRGWFEYTWLKIQTEIITVDHWRDCTYVQDVKHGEHGVVDAPGRCWKKPN--LKAKITDRMTMKFKDCWVDIMSDQKECWVPTELMSCFYRVCCNNHSGDRGAWHMTHQHAVRNDTWDKCIQATFFFSLMCGWPHWVNTGSIKPLPWKPQ----E-A--IVVSVYY--T--IPRTQMSRE---V----------HFEEYHDWCSLVWCEAIKHHSIPVLFCMVFQCL---SSLVFSVIDDIGYLGKECHQPWLVMTNALLWLFVEDVKYINMWGSLIHCPYCSMCHCETMDKYRYMSQMFEVNSEFQEGNMRMNMKSHGNCFTWGGMWYDLVHGSRQPNWGERCYPDVWYC-EYQIQSYERQ-KGKWQRSAHTYSEKNCVFDQFPVAWYKEMLQKWNWNYFVFLEAILFVCRKFA-------RPNVWKETEMRKTGEMN-SCCSEDADDIEFSKAGATNYPTFI--VIWKFGWMPQVAH-WREGTWVLEVRFTSFCAWSCAFC-AEDWWGCRFNVIQRCHESPDLF--------CKIVWGCGLVQFTN--YCLVSNDMMCTATHGLNWD--MQSHENSHQPITKFPYKHHEMFFMMHSCHTPCLEGCMDFQAYEMTKESDMQKNHCYGHLYKEW-ARFTHDCGGTKPSQVEQTENWAPVWGRMKKRVYCFNVHRWLSTSLVHVNPQMQWHDAFFEDW-EHRVHPFQMENFSENINAWRLLKKQPNTTNRGQYGSRIRIKRCHAD-KL-NQFYKADKWIMTVNEGGTVWKII-AVAQTLKQGYRANITRKTNKVFCHYLGNCPTADLPVNSGYQHIAENCVARICQVGVTWQGHCVIPYTCHVEENPVMFMLEHLWNVWCRDNHYLKKPYPMGEPHDSCRAMIALPRIKYTGFGPKFQLINWAGPVRDFKDGSDDLCDCPGWDWLGFPSCVYQHCANLVYKYYHWIRNPINLMGMPTTFVPRSSNFFYMFAEIPEWYMC---PYEIGEILR-SWVGQDHCYCNDIPFFHR-PI-R---NMHMACHRHNFMILLQEW--P-P--SLECPDPVHRPDDQMYDHFEVNTCSRPLDHAFHAFLELVFQCMRIRDNAKYKPICSYKVESREQGKVPWDWSMINLEIPVSRYPN-WIWKHTQEDYQDRHLSQEDQKEICKM-P----S---AYAEAIHAFYWANRILD---DSGIMEEEMLDRGMRCAHRTAGLYYSLDFQSKGYTHWLICFNHWTPIESVTDMFVVQWKWMIRWGTTDLWEYSFC---------ANHVNCHCNSFRQLSYALQ-P----Y----VEFVLTKCGYRRVLLSPL--Y-RTYMLGTLSNRTQYLDMLCRSVCDGCTTNFIMWFYVPWEFLKAYFIFAFYINIQTLDARPQEDNGKVFCMSLPEFMQTNEFAFLAMAPRMWRLWTFRMWYIVATPKETHLPITNSNDENTLIMSMARRKIECWEWTVLL-RISVCSERFRR-I-----M---C-----QDCSQYS--R--Q-HREAHWVCECPGEYLDIHNGFYPRTPPSSC-QDTGYIIY---LMMVCYTGGFELDESICSEDPHAQE-----INANSFMRERIPTFMKQTQQYGFGCQWGMTIMQIWAPCHVKFH--CLYLWQWQQHDPYFARMMMWPSQNADERHQED-VMCAYKLTELTSSYCSMSIYFDSHPSYEHLFNNDTQQQ-DPLIQNP-PSRKNRWWDYSMEVDLEGAC"