Global proteins aligments in R

By: Avril Coghlan.

Adapted, edited and expanded: Nathan Brouwer under the Creative Commons 3.0 Attribution License (CC BY 3.0).

Preliminaries

library(compbio4all)
library(Biostrings)

Download sequences

As we did in the previous lesson on dotplots, we’ll look at two sequences.

# Download
## sequence 1: Q9CD83
leprae_fasta <- rentrez::entrez_fetch(db = "protein",
                        id = "NP_001362987.1",
                         rettype = "fasta")
## sequence 2: OIN17619.1
ulcerans_fasta <- rentrez::entrez_fetch(db = "protein",
                         id = "NP_034550.2",
                         rettype = "fasta")

# clean
leprae_vector   <- fasta_cleaner(leprae_fasta)
ulcerans_vector <- fasta_cleaner(ulcerans_fasta)
ulcerans_vector[1:10]

##  [1] "M" "S" "P" "S" "L" "R" "E" "G" "A" "Q"

Pairwise global alignment of DNA sequences using the Needleman-Wunsch algorithm

If you are studying a particular pair of genes or proteins, an important question is to what extent the two sequences are similar.

To quantify similarity, it is necessary to align the two sequences, and then you can calculate a similarity score based on the alignment.

There are two types of alignment in general. A global alignment is an alignment of the full length of two sequences from beginning to end, for example, of two protein sequences or of two DNA sequences. A local alignment is an alignment of part of one sequence to part of another sequence; the parts the end up getting aligned are the most similar, and determined by the alignment algorithm.

The first step in computing a alignment (global or local) is to decide on a scoring system. For example, we may decide to give a score of +2 to a match and a penalty of -1 to a mismatch, and a penalty of -2 to a gap due to an indel. Thus, for the alignment:

## [1] "G A A T T C"

## [1] "G A T T - A"

we would compute a score of

G vs G = match = 2
A vs A = match = 2
A vs T = mismatch = -1
T vs T = match = 2
T vs - = gap = -2
C vs A = mismatch = -1

So, the scores is 2 + 2 -1 + 2 -2 - 1 = 2.

Similarly, the score for the following alignment is 2 + 2 -2 + 2 + 2 -1 = 5:

## [1] "G A A T T C"

## [1] "G A - T T A"

The scoring system above can be represented by a scoring matrix (also known as a substitution matrix). The scoring matrix has one row and one column for each possible letter in our alphabet of letters (e.g. 4 rows and 4 columns for DNA and RNA sequences, 20 x 20 for amino acids). The (i,j) element of the matrix has a value of +2 in case of a match and -1 in case of a mismatch.

We can make a scoring matrix in R by using the nucleotideSubstitutionMatrix() function in the Biostrings package. Biostrings is part of a set of R packages for bioinformatics analysis known as Bioconductor (www.bioconductor.org/).

The arguments (inputs) for the nucleotideSubstitutionMatrix() function are the score that we want to assign to a match and the score that we want to assign to a mismatch. We can also specify that we want to use only the four letters representing the four nucleotides (i.e.. A, C, G, T) by setting baseOnly=TRUE, or whether we also want to use the letters that represent ambiguous cases where we are not sure what the nucleotide is (e.g. ‘N’ = A/C/G/T; ambiguous cases occur in some sequences due to sequencing errors or ambiguities).

To make a scoring matrix which assigns a score of +2 to a match and -1 to a mismatch, and store it in the variable sigma, we type:

# make the matrix
sigma <- nucleotideSubstitutionMatrix(match = 2, 
                                      mismatch = -1, 
                                      baseOnly = TRUE)
# Print out the matrix
sigma

##    A  C  G  T
## A  2 -1 -1 -1
## C -1  2 -1 -1
## G -1 -1  2 -1
## T -1 -1 -1  2

Instead of assigning the same penalty (e.g. -8) to every gap position, it is common instead to assign a gap opening penalty to the first position in a gap (e.g. -8), and a smaller gap extension penalty to every subsequent position in the same gap.

The reason for doing this is that it is likely that adjacent gap positions were created by the same insertion or deletion event, rather than by several independent insertion or deletion events. Therefore, we don’t want to penalize a 3-letter gap (AAA—AAA) as much as we would penalize three separate 1-letter gaps (AA-A-A-AA), as the 3-letter gap may have arisen due to just one insertion or deletion event, while the 3 separate 1-letter gaps probably arose due to three independent insertion or deletion events.

For example, if we want to compute the score for a global alignment of two short DNA sequences ‘GAATTC’ and ‘GATTA’, we can use the Needleman-Wunsch algorithm to calculate the highest-scoring alignment using a particular scoring function.

The pairwiseAlignment() function in the Biostrings package finds the score for the optimal global alignment between two sequences using the Needleman-Wunsch algorithm, given a particular scoring system.

As arguments (inputs), pairwiseAlignment() takes

the two sequences that you want to align,
the scoring matrix,
the gap opening penalty, and
the gap extension penalty.

You can also tell the function that you want to just have the optimal global alignment’s score by setting scoreOnly = TRUE, or that you want to have both the optimal global alignment and its score by setting scoreOnly = FALSE.

For example, let’s find the score for the optimal global alignment between the sequences ‘GAATTC’ and ‘GATTA’.

First, we’ll store the sequences as character vectors:

s1 <- "GAATTC"
s2 <- "GATTA"

Possible ways to align these, from really bad to better include

# No overlap
"GAATTC-----"

## [1] "GAATTC-----"

"------GATTA"

## [1] "------GATTA"

# Overlab of 1 base - mismatch
"GAATTC-----"

## [1] "GAATTC-----"

"-----GATTA-"

## [1] "-----GATTA-"

# Overlap of entire sequence vs 1
## Matches = 
## Mismatches = 
## Indelx = 
"GAATTC"

## [1] "GAATTC"

"GATTA-"

## [1] "GATTA-"

# Overlap of entire sequence vs 2
## Matches = 
## Mismatches = 
## Indelx = 
"GAATTC"

## [1] "GAATTC"

"GAATTC"

## [1] "GAATTC"

"GA-TTA "

## [1] "GA-TTA "

Now we’ll align for real using pairwiseAlignment()

globalAligns1s2 <- Biostrings::pairwiseAlignment(s1, s2, 
                                                 substitutionMatrix = sigma, 
                                                 gapOpening = -8,
                                                 gapExtension = -2, 
                                                 scoreOnly = FALSE)

The output:

globalAligns1s2

## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: GAATTC
## subject: GA-TTA
## score: -3

The above commands print out the optimal global alignment for the two sequences and its score.

Note we set gapOpening to be -2 and gapExtension to be -8, which means that the first position of a gap is assigned a score of -8 - 2= -10, and every subsequent position in a gap is given a score of -8. Here the alignment contains four matches, one mismatch, and one gap of length 1, so its score is (42)+(1-1)+(1*-10) = -3.

Pairwise global alignment of protein sequences using the Needleman-Wunsch algorithm

As well as DNA alignments, it is also possible to make alignments of protein sequences. In this case it is necessary to use a scoring matrix for amino acids rather than for nucleotides.

Protein scoring matrices

There are several well known scoring matrices that come with R, such as the BLOSUM series of matrices. Different BLOSUM matrices exist, named with different numbers. BLOSUM with high numbers are designed for comparing closely related sequences, while BLOSUM with low numbers are designed for comparing evolutionarily distantly related sequences. For example, BLOSUM62 is used for less divergent alignments (alignments of sequences that differ little / have less evolutionary distance between them), and BLOSUM30 is used for more divergent alignments (alignments of sequences that differ a lot / their common ancestor is further in the past). If I wanted to study the evolution of the sequence of hShroom3 in primates, I’d use BLOSUM62. If I wanted to study it in animals such as primates AND jelly fish, whose common ancestor diverged from us far in the past, I’d use BLOSUM30.

Many R packages come with example data sets or data files and you use the data() function is used to load these data files. You can use the data() function to load a data set of BLOSUM matrices that comes with Biostrings

To load the BLOSUM50 matrix, we type:

# NOTE: use data() to load the BLOSUM matrices
data(BLOSUM50)

BLOSUM50 # Print all of the  the data

##    A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
## A  5 -2 -1 -2 -1 -1 -1  0 -2 -1 -2 -1 -1 -3 -1  1  0 -3 -2  0 -2 -1 -1 -5
## R -2  7 -1 -2 -4  1  0 -3  0 -4 -3  3 -2 -3 -3 -1 -1 -3 -1 -3 -1  0 -1 -5
## N -1 -1  7  2 -2  0  0  0  1 -3 -4  0 -2 -4 -2  1  0 -4 -2 -3  4  0 -1 -5
## D -2 -2  2  8 -4  0  2 -1 -1 -4 -4 -1 -4 -5 -1  0 -1 -5 -3 -4  5  1 -1 -5
## C -1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1 -3 -3 -2 -5
## Q -1  1  0  0 -3  7  2 -2  1 -3 -2  2  0 -4 -1  0 -1 -1 -1 -3  0  4 -1 -5
## E -1  0  0  2 -3  2  6 -3  0 -4 -3  1 -2 -3 -1 -1 -1 -3 -2 -3  1  5 -1 -5
## G  0 -3  0 -1 -3 -2 -3  8 -2 -4 -4 -2 -3 -4 -2  0 -2 -3 -3 -4 -1 -2 -2 -5
## H -2  0  1 -1 -3  1  0 -2 10 -4 -3  0 -1 -1 -2 -1 -2 -3  2 -4  0  0 -1 -5
## I -1 -4 -3 -4 -2 -3 -4 -4 -4  5  2 -3  2  0 -3 -3 -1 -3 -1  4 -4 -3 -1 -5
## L -2 -3 -4 -4 -2 -2 -3 -4 -3  2  5 -3  3  1 -4 -3 -1 -2 -1  1 -4 -3 -1 -5
## K -1  3  0 -1 -3  2  1 -2  0 -3 -3  6 -2 -4 -1  0 -1 -3 -2 -3  0  1 -1 -5
## M -1 -2 -2 -4 -2  0 -2 -3 -1  2  3 -2  7  0 -3 -2 -1 -1  0  1 -3 -1 -1 -5
## F -3 -3 -4 -5 -2 -4 -3 -4 -1  0  1 -4  0  8 -4 -3 -2  1  4 -1 -4 -4 -2 -5
## P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3 -2 -1 -2 -5
## S  1 -1  1  0 -1  0 -1  0 -1 -3 -3  0 -2 -3 -1  5  2 -4 -2 -2  0  0 -1 -5
## T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  2  5 -3 -2  0  0 -1  0 -5
## W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1  1 -4 -4 -3 15  2 -3 -5 -2 -3 -5
## Y -2 -1 -2 -3 -3 -1 -2 -3  2 -1 -1 -2  0  4 -3 -2 -2  2  8 -1 -3 -2 -1 -5
## V  0 -3 -3 -4 -1 -3 -3 -4 -4  4  1 -3  1 -1 -3 -2  0 -3 -1  5 -4 -3 -1 -5
## B -2 -1  4  5 -3  0  1 -1  0 -4 -4  0 -3 -4 -2  0  0 -5 -3 -4  5  2 -1 -5
## Z -1  0  0  1 -3  4  5 -2  0 -3 -3  1 -1 -4 -1  0 -1 -2 -2 -3  2  5 -1 -5
## X -1 -1 -1 -1 -2 -1 -1 -2 -1 -1 -1 -1 -1 -2 -2 -1  0 -3 -1 -1 -1 -1 -1 -5
## * -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5  1

We can examine part of the matrix like this

# row 1, columns 1 to 10
BLOSUM50[1, 1:10]

##  A  R  N  D  C  Q  E  G  H  I 
##  5 -2 -1 -2 -1 -1 -1  0 -2 -1

You can get a list of the available scoring matrices that come with the Biostrings package by using the data() function, which takes as an argument the name of the package for which you want to know the data sets that come with it:

data(package="Biostrings")

Another well-known series of scoring matrices are the PAM matrices developed by Margaret Dayhoff and her team. These have largely been replaced by BLOSUM but are important for historical reasons because PAM matrices represent one of the first major bioinformatics, computational biology, and phyolgenetics projects ever.

Example protein alignment

Let’s find the optimal global alignment between the protein sequences “PAWHEAE” and “HEAGAWGHEE” using the Needleman-Wunsch algorithm using the BLOSUM50 matrix.

First, load the scoring matrix BLOSUM50 and make vectors for the sequence

# matrix
data(BLOSUM50)

# sequences
s3 <- "PAWHEAE"
s4 <- "HEAGAWGHEE"

Now do the alignments.

globalAligns3s4 <- pairwiseAlignment(s3, s4,
                                     substitutionMatrix = "BLOSUM50", 
                                     gapOpening = -2,
                                     gapExtension = -8, 
                                     scoreOnly = FALSE)

Look at the results:

globalAligns3s4 # Print out the optimal global alignment and its score

## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: P---AWHEAE
## subject: HEAGAWGHEE
## score: -5

The output from pairwiseAlignment() is pretty simple and only indicates indels (“-”) and the sequences themselves. In other programs like Needle and BLAST you’ll get a richer output that indicates direct matches, relatively similar positions that are mismatches, and disimilar mismatches. For example, Needle uses “|” to indicate direct matches, colons “:” to indicate chemically amino acids / evolutionarily favorable mismatches, and periods “.” to indicate chemically dis-disimilar / evolutionarily less common mismatches. The output above in Needle-like format would be:

# pattern: P---AWHEAE
#          .---||:::|  
# subject: HEAGAWGHEE

We set gapOpening to be -2 and gapExtension to be -8, which means that the first position of a gap is assigned a score of -8-2=-10, and every subsequent position in a gap is given a score of -8. This means that the gap will be given a score of -10-8-8 = -26.

Aligning UniProt sequences

We discussed previously how you can search for UniProt accessions and retrieve the corresponding protein sequences, either via the UniProt website or using the rentrez package.

In the examples given above, we learned how to retrieve the sequences for the chorismate lyase proteins from Mycobacterium leprae (UniProt Q9CD83) and Mycobacterium ulcerans (UniProt A0PQ23), and read them into R, and store them as vectors lepraeseq and ulceransseq.

You can align these sequences using pairwiseAlignment() from the Biostrings package.

As its input, the pairwiseAlignment()function requires that the sequences be in the form of a single string (e.g. “ACGTA”), rather than as a vector of characters (e.g. a vector with the first element as “A”, the second element as “C”, etc.). Therefore, to align the M. leprae and M. ulcerans chorismate lyase proteins, we first need to convert the vectors lepraeeq and ulceransseq into strings. We can do this using the paste() function:

# convert leprae_vector to an object lepraeseq_string
lepraeseq_string <-paste(leprae_vector,collapse = "")    

# convert ulcerans_vector to an object ulceransseq_string
ulceransseq_string <-paste(ulcerans_vector,collapse = "")

Furthermore, pairwiseAlignment() requires that the sequences be stored as uppercase characters. Therefore, if they are not already in uppercase, we need to use the toupper() function to convert lepraeseq_string and ulceransseq_string to uppercase:

lepraeseq_string   <- toupper(lepraeseq_string)
ulceransseq_string <- toupper(ulceransseq_string)

Check the output

lepraeseq_string # Print out the content of "lepraeseq_string"

## [1] "MSPSLQEGAQLGENKPSTCSFSIERILGLDQKKDCVPLMKPHRPWADTCSSSGKDGNLCLHVPNPPSGISFPSVVDHPMPEERASKYENYFSASERLSLKRELSWYRGRRPRTAFTQNQIEVLENVFRVNCYPGIDIREDLAQKLNLEEDRIQIWFQNRRAKLKRSHRESQFLMAKKNFNTNLLE"

We can now align the the M. leprae and M. ulcerans chorismate lyase protein sequences using the pairwiseAlignment() function:

globalAlignLepraeUlcerans <- Biostrings::pairwiseAlignment(lepraeseq_string, 
                                               ulceransseq_string,
                                               substitutionMatrix = BLOSUM50, 
                                               gapOpening = -2, 
                                               gapExtension = -8, 
                                               scoreOnly = FALSE)

The output:

globalAlignLepraeUlcerans # Print out the optimal global alignment and its score

## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: MSPSLQEGAQLGENKPSTCSFSIERILGLDQKKD...IQIWFQNRRAKLKRSHRESQFLMAKKNFNTNLLE
## subject: MSPSLREGAQLRESKPAPCSFSIESILGLDQKKD...IQIWFQNRRAKMKRSRRESQFLMAKKPFNPDLLK
## score: 1014

As the alignment is very long, when you type globalAlignLepraeUlcerans, you only see the start and the end of the alignment. Therefore, we need to have a function to print out the whole alignment (see below).

Viewing a long pairwise alignment

If you want to view a long pairwise alignment such as that between the M. leprae and M. ulerans chorismate lyase proteins, it is convenient to print out the alignment in blocks.

The compbio4all (originally by Avril Coghlan) function print_pairwise_alignment() below will do this for you:

print_pairwise_alignment(globalAlignLepraeUlcerans, 60)

## [1] "MSPSLQEGAQLGENKPSTCSFSIERILGLDQKKDCVPLMKPHRPWADTCSSSGKDGNLCL 60"
## [1] "MSPSLREGAQLRESKPAPCSFSIESILGLDQKKDCTTSVRPHRPWTDTCGDSEKGGNPPL 60"
## [1] " "
## [1] "HVPNPPSGISFPSVVDHPMPEERASKYENYFSASERLSLKRELSWYRGRRPRTAFTQNQI 120"
## [1] "HAPDLPSETSFPCPVDHPRPEERAPKYENYFSASETRSLKRELSWYRGRRPRTAFTQNQV 120"
## [1] " "
## [1] "EVLENVFRVNCYPGIDIREDLAQKLNLEEDRIQIWFQNRRAKLKRSHRESQFLMAKKNFN 180"
## [1] "EVLENVFRVNCYPGIDIREDLAQKLNLEEDRIQIWFQNRRAKMKRSRRESQFLMAKKPFN 180"
## [1] " "
## [1] "TNLLE 240"
## [1] "PDLLK 240"
## [1] " "

The position in the protein of the amino acid that is at the end of each line of the printed alignment is shown after the end of the line. For example, the first line of the alignment above finishes at amino acid position 50 in the M. leprae protein and also at amino acid position 60 in the M. ulcerans protein. Because there as a difference of 60-50 = 10 bases, there must be 10 insertions in the M. leprae to get it to line up. Count the number of dashes in the sequence to see how many there are.

Alignment statistics

Alignment statistics can be accessed using score() and pid()

score(globalAlignLepraeUlcerans)

## [1] 1014

pid(globalAlignLepraeUlcerans)

## [1] 80.54054

There is more than one way to define PID. You can get different formulations by consulting the help file for pid() and selecting the one you want with the type = arguement, e.g.

pid(globalAlignLepraeUlcerans, type = "PID2")

## [1] 80.54054

pid(globalAlignLepraeUlcerans, type = "PID3")

## [1] 80.54054

pid(globalAlignLepraeUlcerans, type = "PID4")

## [1] 80.54054