Overview

In todays workshop we will be looking to:

We will also see how to

Simple pairwise alignment - Global and local

You’ve by now done some alignments by hand and that takes a while, even for short sequences. Fortunately, researchers have developed computer programs to do this for you. Our goal here will be to use the EBI web interface to perform a global alignment using an implementation of the Needleman-Wunsch algorithm and a local alignment using an implementation of the Smith-Waterman algorithm.

Direct your browser to the EBI Tools page for pairwise sequence alignment (PSA): https://www.ebi.ac.uk/Tools/psa/

The following exercises will use the following sequences:

>NP_005359.1 myoglobin isoform 1 [Homo sapiens]
MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVL
TALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFR
KDMASNYKELGFQG
>NP_000509.1 hemoglobin subunit beta [Homo sapiens]
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG
AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN
ALAHKYH

Global alignment

We’ll start by aligning using the program Needle which is an implementation of the Needleman-Wunsch algorithm called EMBOSS Needle.

First, select the type of sequence being aligned (PROTEIN/DNA). After entering your two sequences, go ahead and submit your job. As you will see, this takes a minute or two since you are using public computing resources. Take a look at your output and make note. Would you say these sequences look similar?

Now, let’s go back to the input form (link at the top of the page). Input the sequences again. This time, let’s adjust some parameters and see how that might influence our alignment. Let’s change our scoring matrix to BLOSUM30 and reduce the gap open/end penalties to 1. Go ahead and submit. What is something you notice that is very different in this alignment?

Local alignment

Local alignment is a much more powerful tool to search databases, detect protein domains, and assess similarity between distant homologous genes/proteins. Here, we’ll use the EBI interface for an implementation of the Smith-Waterman algorithm called EMBOSS Water.

First, let’s input our sequences and submit our alignment request using the default parameters. You might notice that there doesn’t seem to be much difference between this alignment and our first alignment above. Why?

Next, let’s adjust some parameters. Let’s reduce our gap opening penalty to 1 but increase the gap extension penalty to 10. Select the BLOSUM90 scoring matrix. What can you conclude about the utility of this alignment?

Now, let’s try using a different local alignment method. This time, select EMBOSS Matcher. Input the sequences as before, but in the advanced options select the BLOSUM30 matrix and increase alternative matches to 5. Submit and let’s examine your output. Why are the alternative matches different sizes?

Using NCBI BLAST to search databases

The following exercise is adapted from Elizabeth A. Kellogg - Donald Danforth Plant Science Center

Both NCBI and all the model organism databases can be searched using a sequence instead of a gene name. This is handy because the vast majority of genes in the vast majority of organisms do not have names. The program that is used to search databases is known as BLAST, an acronym for Basic Local Alignment Search Tool. To generalize, we will be using three things: a) a query sequence, b) BLAST, and c) a database. In this case, the query is something we’ll pretend you’ve just sequenced in your lab, and the database is NCBI.

The NCBI web site is huge and complex. One colleague described exploring this site as similar to looking around the Smithsonian at night with a candle. Imagine that someone has just sent you some PCR primers that supposedly amplify “Your Favorite Gene”. You’ve amplified your PCR product and to check whether it’s the right thing, you’ve sequenced it. The resulting sequence is shown below in FASTA format. In the following computer exercise, you will take that sequence and run a BLAST search against GenBank. After you get the result, I’ll explain a bit more about what you just did.

>PCRseq1
AAAGAGCTTCCATCACGGACCAGATCCCGAATTCTTCTTCTGACCACCGAACGATTGAGT
TTCTCCGATCAGATCTCAGTTTCTGGGAATAATTCGAAGTGAAAAAATGGCGCAGCATCT
GTTGCACGGGACTTTACATGCTACCATCTATGAAGTTGATGCCCTCCATGGTGGTGGTGT
TAGGCAAGGCTTCCTTGGCAAGATTCTGGCAAATGTAGAAGAGACGATTGGTGTTGGTAA
AGGAGAAACACAGTTGTATGCGACGATTGATCTGCAAAAAGCTAGAGTTGGGAGAACCAG
GAAGATCAAAAATGAACCTAAGAACCCAAAGTGGTATGAGTCGTTTCATATTTACTGTGC
TCACTTGGCTTCTGATATCATCTTCACTGTTAAAGATGATAATCCCATTGGAGCTACCCT
TATCGGAAGAGCTTACATTCCTGTTGATCAAGTCATTAACGGCGAGGAAGTGGATCAGTG
GGTTGAGATCTTGGATAATGACAGAAACCCTATTCAGGGAGGATCAAAGATTCATGTCAA
GCTTCAATATTTCCATGTTGAGGAGGATCGTAACTGGAACATGGGTATCAAAAGTGCCAA
GTTCCCTGGAGTGCCATACACATTCTTCTCGCAGAGACAAGGCTGCAAAGTTTCTCTGTA
CCAAGATGCTCATATTCCAGACAACTTTGTCCCTAGAATTCCTCTCGCTGGAGGGAAGAA
CTATGAGCCTCAAAGATGTTGGGAGGATATTTTTGATGCTATTAGCAATGCAAAACACTT
GATCTACATTACTGGTTGGTCTGTTTACGCTGAGATTGCTTTAGTGAGGGACTCGAGGAG
GCCTAAGCCTGGAGGTGATGTGACCATTGGTGAGCTACTCAAGAAGAAGGCTAGTGAAGG
TGTCAGGGTTCTTTTGCTTGTTTGGGATGACAGAACTTCTGTTGATGTGCTGAAGAAAGA
TGGGCTCATGGCTACTCATGATGAAGAGACCGAGAATTTCTTCAGGGGAAGTGATGTCCA
TTGTATTCTGTGCCCTCGTAACCCGGATGACGGTGGTAGCATAGTCCAAAGTTTGCAGAT
CTCTACTATGTTCACGCATCATCAGAAAATCGTTGTTGTGGACAGCGAGATGCCAAGCAG
AGGAGGATCAGAAATGAGGAGAATTGTGAGTTTTGTTGGCGGTATTGATCTTTGTGATGG
AAGATACGACACTCCGTTCCACTCCTTGTTCAGGACATTGGACACAGTCCACCATGATGA
CTTCCATCAACCTAACTTCACTGGTGCTGCTATCACTAAAGGTGGTCCAAGGGAGCCTTG
GCATGACATTCACTCCCGTCTTGAAGGTCCAATTGCTTGGGATGTCATGTACAACTTCGA
GCAGAGATGGAGCAAGCAGGGTGGTAAAGACATTCTGGTTAAGTTGAGAGATCTTAGTGA
TATTATTATCACCCCTTCTCCTGTTATGTTCCAAGAGGACCACGATGTGTGGAATGTCCA
ATTGTTTAGGTCCATTGATGGAGGAGCTGCTGCTGGGTTTCCCGAGTCGCCTGAAGCTGC
TGCGGAAGCCGGGCTTGTAAGTGGGAAAGATAACATCATTGATAGGAGTATCCAAGATGC
TTACATTCATGCAATCAGACGTGCTAAGGATTTCATCTACGTTGAAAACCAGTACTTCCT
TGGGAGTTCTTTTGCTTGGGCAGCCGATGGTATTACTCCTGAGGACATCAATGCCCTGCA
CTTAATCCCAAAAGAGTTGTCGCTGAAGATAGTTAGCAAGATTGAGAAAGGAGAGAAGTT
CAGGGTCTATGTTGTGGTTCCAATGTGGCCAGAAGGTCTCCCAGAGAGTGGATCAGTGCA
AGCTATATTAGACTGGCAGAGGAGGACCATGGAGATGATGTACAAGGATGTGATTCAGGC
TCTCAGGGCCCAGGGTCTTGAGGAAGATCCAAGAAACTATCTGACATTCTTCTGTCTTGG
AAACCGTGAGGTCAAGAAAGATGGAGAGTATGAGCCTGCTGAGAAACCAGACCCCGACAC
TGATTACATGAGGGCGCAAGAAGCACGCCGTTTCATGATTTACGTCCACACCAAAATGAT
GATCGTTGACGATGAATACATTATCATTGGGTCTGCTAACATCAACCAGAGGTCAATGGA
CGGTGCAAGAGACTCTGAGATAGCAATGGGAGGTTATCAACCACATCACTTGTCCCATAG
ACAACCAGCTCGTGGCCAGATCCATGGGTTTCGTATGTCACTCTGGTACGAACACCTGGG
AATGCTCGATGAAACCTTCCTCGATCCATCAAGCTTGGAATGCATTGAGAAAGTTAACCG
CATTTCTGACAAGTATTGGGACTTTTACTCAAGTGAGTCACTCGAACATGACCTTCCTGG
TCACTTGCTCCGCTACCCGATCGGTGTAGCCAGCGAAGGCGACATCACTGAGCTTCCAGG
ATTTGAATTCTTCCCGGACACAAAGGCCCGTATCCTCGGCACCAAATCAGACTACCTGCC
TCCAATCCTTACAACCTAATCTCACTAAGCATGTCAAGTAATGATCTCTCTCTCCCTCTC
TGCTTTGCTGCTGTTGTAGCTTTGAATAAAACTTGAGTGTCTACCTTTAGAATTAAGAAG
TCAAATGGTTGTTATGATGATGCACTTCTTTACCCCTTTGGTTTTTATATTCGTACAATG
ACGTGGTGAGAGAATGTAGCTTTGTGATCTTGTTTTGTTGTTGTTATGTACCTTTGGACT
TATGAATCTTATATCTGATCCTTTTCTTTTTATTTAGTGTGTTTCACATC

Computer exercise 1. Really Basic BLAST Search.

Start by going to https://www.ncbi.nlm.nih.gov/ . From the NCBI home page, click on BLAST in the right column labeled Popular Resources. This will get you to a page labeled BLAST. Because the thing you’ve just amplified is a DNA sequence, you’ll want to compare it to other nucleotide sequences. Accordingly, click on the box labeled nucleotide blast. This gets you to a page called BLAST/blastn suite. Copy the PCR sequence from above and paste it in the box at the top of the BLAST page, where it says “Enter accession number, gi, or FASTA sequence”.

Next look at the box labeled “Choose Search Set”, a little farther down the page. Using the drop down menu, choose “Others” for the database and “Nucleotide collection (nr/nt)”. Continuing down the page, look at the radio buttons under Program Selection, and choose “highly similar sequences (megablast)”. (This may already be chosen, since it is the default.)

Now click on the large blue button labeled BLAST, in the lower left corner. You will be taken (automatically) to a new screen that shows your results. Under “Graphic Summary” will be the title “Distribution of the top 110 Blast Hits,” followed by a box with colored lines in it. Under “Descriptions” is a table of “Sequences producing significant alignments.” The first line gives you information on the best match to your input sequence, and is often referred to as the top blast hit (for obvious reasons). What is it? Under “Taxonomy” you will see a list of the species that had sequences similar to your query sequence; this information is displayed in three different ways. How many different organisms (species) had sequences similar to your query? If some of the names are unfamiliar to you, you can look under the Lineage Report; each Latin name is a hyperlink that will take you to a page with more information about the species.

Leave the BLAST screen open during the lecture, since we’ll come back to it.

Computer exercise 2. A little more sophisticated use of BLAST. The effect of changing the search algorithms, the database, and the algorithm parameters. Producing a distance tree.

Changing the blast algorithm

Let’s play with the BLAST search you just did. First, look again at the list of Other Reports just above the Graphic Summary. This time click on Search Summary. You should now know what is meant by the various Search Parameters listed in the upper box. What is meant by the Gapcosts numbers?

Return to the page where you input the sequence by clicking “Edit Search” at the top of the page. Change the radio button from “highly similar sequences (megablast)” to “somewhat similar sequences (blastn)”. Click on Algorithm parameters at the bottom to open up the list of General Parameters, then set the Max target sequences to 250. Click on the BLAST button. What happens to the number of Blast hits? To the Search Parameters (under Search Summary)? To the number of species (under Taxonomy Reports)?

Changing the database

We won’t do a lot of this, since it could take all day. Return to BLAST Home by clicking on “Edit Search”. Leaving the radio button on “somewhat similar sequences”, go up to the drop down menu labeled “Database”. Choose “expressed sequence tags (est)”. Run BLAST. What do you notice about the red lines in the Graphic Summary, relative to what you saw in the earlier searches? Why do you think they are different? Just so you can see what happens, redo the search, but under Database, choose “Human genomic + transcript”. How many blast hits do you get? What is their alignment score and the percent query coverage? What are the E-values, and what do they tell you about the results?

Changing the algorithm parameters

Return to BLAST Home by clicking on “Edit Search” and set the Program selection to “somewhat similar sequences (blastn)”. Click on “Algorithm parameters” so the screen expands and you can see boxes labeled “General parameters”, “Scoring parameters”, and “Filters and masking.” We will play with just a few of these.

  • General parameters. Recall that the Expect value is the number of matching sequences you would expect to find at random in a database of this size, given the length of your sequence and the scoring parameters. For sequences for which good matches are available, changing the Expect threshold value does not make a great deal of difference in the results. (You can verify this if you want by typing in a higher or lower Expect threshold into the appropriate window.) However, for distantly related sequences the Expect threshold value does matter. As an experiment, go back to BLAST Home and look at the bottom of the page where it says “BLAST Genomes”. Enter Physcomitrella patens in the search box and submit. This gets you to a BLAST page. Paste in PCRseq1, then choose Somewhat similar sequences, then open the Algorithm parameters and verify that the Expect threshold is set to 10 and the word size to 11. Click on BLAST. How many hits do you get and how much of the query sequence do the best ones cover? Now do the same search but set the Expect threshold to 100. How does the E-value affect what you retrieve? What are the circumstances under which you might want to use a higher or lower E-value?
  • Scoring parameters. Return to the standard nucleotide BLAST with the database as Nucleotide collection and paste in PCRseq1. Now look at the drop-down menu called “Match/Mismatch scores”. This controls how many points are awarded for a match in an alignment, and how many points are subtracted for a mismatch. The default depends on the Program Selection. Click the radio button for Highly similar sequences, and the default for Match/Mismatch scores are 1 point for a match, and minus 2 points for a mismatch. What happens to the Match/Mismatch scores when you click the radio button for More dissimilar sequences? What other parameters change as well? Why do you think this happened? Chose Somewhat similar sequences (blastn), for which Match/Mismatch is 2, -3. Change these values to 4 and -5. What does this do the output of the search? Note that the colored lines in the Graphic Summary are coded by alignment score, so you can see many of the differences just by looking at the colors. Return the match/mismatch values to 1, -2. We’ll play with Gap Costs in the context of multiple alignments, where it is easier to see their effects.

The distance tree

Redo the Megablast search that you did in exercise I. This should be using the nucleotide database, search on highly similar sequences, and default algorithm parameters. Under “other reports”, click on the hyperlink called “Distance tree of results”. A tree will appear in the window with your original query sequence highlighted in yellow. (Note: The tree window does not display properly and the menus do not work well in Apple Safari, should you ever be inclined to try it.)

You will notice that there are not 100 branches in the tree; this is because some of them have been “collapsed” to make the tree easier to look at. For example, you may see a branch ending in a green triangle labeled “eudicots|25 leaves”. In this case, the word “leaves” is the computer science equivalent of the word “branch.” If you mouse over the green triangle, you will get a menu that will allow you to choose “show subtree”. If you choose this, the names of the 25 eudicot sequences will appear. What species are they?

Use the browser back button to return to the original tree. Near the top of the window are several drop-down menus. The first is Tree method, and is set at Fast Minimum Evolution. This indicates the algorithm used to produce the tree; we’ll talk more about tree construction algorithms (there are many of them!) later. In the column to the right of the tree is a drop-down menu called Sequence Label, which is set at Sequence Title. This is purely cosmetic, and lets you choose how the sequences are named. Sequence Title is the most informative, but it is also quite long so can make the tree hard to read. Try the different options to see what happens. Sequence ID gives you the unique GenBank ID number for the sequence, but can be fairly hard to interpret unless you are very good at remembering intricate numbers. Taxonomic name gives you the genus, species, and variety (if there is one designated). Blast name gives you the higher taxonomic group – in this case all the sequences are from eudicots. Sequence ID (Blast name) simply gives you the ID number plus the Blast name.

Just above the tree window is a “Tools” dropdown. The layout tool allows you to choose your favorite way of drawing the tree. The rectangular and slanted formats are drawn in such a way that the tree appears to be rooted (term to be defined in the next lecture), whereas the radial and force trees are clearly unrooted. Other than that, the differences among the formats are also purely cosmetic and do not change any of the information in the tree. Click through them to see what happens. You may want to change the Sequence Label to Sequence ID just so the names are not so long.

To try a protein sequence

Choose protein BLAST with non-redundant protein sequences and blastp using the following sequence.

>proteinseq1
MFAFYFLTACISLKGVFGVSPSYNGLGLTPQMGWDNWNTFACDVSEQLLLDTADRISDLG
LKDMGYKYIILDDCWSSGRDSDGFLVADEQKFPNGMGHVADHLHNNSFLFGMYSSAGEYT
CAGYPGSLGREEEDAQFFANNRVDYLKYDNCYNKGQFGTPEISYHRYKAMSDALNKTGRP
IFYSLCNWGQDLTFYWGSGIANSWRMSGDVTAEFTRPDSRCPCDGDEYDCKYAGFHCSIM
NILNKAAPMGQNAGVGGWNDLDNLEVGVGNLTDDEEKAHFSMWAMVKSPLIIGANVNNLK
ASSYSIYSQASVIAINQDSNGIPATRVWRYYVSDTDEYGQGEIQMWSGPLDNGDQVVALL
NGGSVSRPMNTTLEEIFFDSNLGSKKLTSTWDIYDLWANRVDNSTASAILGRNKTATGIL
YNATEQSYKDGLSKNDTRLFGQKIGSLSPNAILNTTVPAHGIAFYRLRPSS

Using R to perform sequence alignments

The following exercise is adapted from Avril Coghlan - Sanger Institute

There are a variety of packages that enable R to perform sequence alignment. We’ll be relying on a pair of packages Biostrings and seqinr. Both of these packages have extensive functionality for downloading, manipulating, and analyzing protein and DNA sequences. We’ll just be scratching the surface.

R Packages used in today’s exercises

Let’s go ahead and load the packages we need.

library(Biostrings)
library(seqinr)

Downloading our sequences from SwissProt/UniProtKB

You may recall we used rentrez to interface with the NCBI database. Like many things in the R ecosystem, there are many packages to perform similar tasks. Much of this comes down to what serves your purposes most effectively. For example, the seqinr packages allows us to connect to a variety of different databases and make queries. This is particularly useful if you know the accession numbers of sequences you are looking to download.

Let’s see just what seqinr let’s us communicate with. The choosebank command is how we initiate communication with different databases.

choosebank()
##  [1] "genbank"         "embl"            "emblwgs"         "genbankseqinr"  
##  [5] "swissprot"       "ensembl"         "hogenom7dna"     "hogenom7"       
##  [9] "hogenom"         "hogenomdna"      "hovergendna"     "hovergen"       
## [13] "hogenom5"        "hogenom5dna"     "hogenom4"        "hogenom4dna"    
## [17] "homolens"        "homolensdna"     "hobacnucl"       "hobacprot"      
## [21] "phever2"         "phever2dna"      "refseq"          "refseq16s"      
## [25] "greviews"        "bacterial"       "archaeal"        "protozoan"      
## [29] "ensprotists"     "ensfungi"        "ensmetazoa"      "ensplants"      
## [33] "ensemblbacteria" "mito"            "polymorphix"     "emglib"         
## [37] "refseqViruses"   "ribodb"          "taxodb"

We’ll be using the SwissProt/UniProtKB database download two protein sequences for our analysis. Each of these query requests opens a “socket” to the UniProt database. This needs to be active to extract our information.

choosebank("swissprot")
leprae <- query("leprae", "AC=Q9CD83")
ulcerans <- query("ulcerans", "AC=A0PQ23")

Now take a look at what we have… you’ll see this isn’t exactly helpful. This is our socket.

leprae
## 1 SQ for AC=Q9CD83

We need to ask the database for the sequence.

lepraeseq <- getSequence(leprae)
ulceransseq <- getSequence(ulcerans)

Now, look… our sequence!

lepraeseq
## [[1]]
##   [1] "M" "T" "N" "R" "T" "L" "S" "R" "E" "E" "I" "R" "K" "L" "D" "R" "D" "L"
##  [19] "R" "I" "L" "V" "A" "T" "N" "G" "T" "L" "T" "R" "V" "L" "N" "V" "V" "A"
##  [37] "N" "E" "E" "I" "V" "V" "D" "I" "I" "N" "Q" "Q" "L" "L" "D" "V" "A" "P"
##  [55] "K" "I" "P" "E" "L" "E" "N" "L" "K" "I" "G" "R" "I" "L" "Q" "R" "D" "I"
##  [73] "L" "L" "K" "G" "Q" "K" "S" "G" "I" "L" "F" "V" "A" "A" "E" "S" "L" "I"
##  [91] "V" "I" "D" "L" "L" "P" "T" "A" "I" "T" "T" "Y" "L" "T" "K" "T" "H" "H"
## [109] "P" "I" "G" "E" "I" "M" "A" "A" "S" "R" "I" "E" "T" "Y" "K" "E" "D" "A"
## [127] "Q" "V" "W" "I" "G" "D" "L" "P" "C" "W" "L" "A" "D" "Y" "G" "Y" "W" "D"
## [145] "L" "P" "K" "R" "A" "V" "G" "R" "R" "Y" "R" "I" "I" "A" "G" "G" "Q" "P"
## [163] "V" "I" "I" "T" "T" "E" "Y" "F" "L" "R" "S" "V" "F" "Q" "D" "T" "P" "R"
## [181] "E" "E" "L" "D" "R" "C" "Q" "Y" "S" "N" "D" "I" "D" "T" "R" "S" "G" "D"
## [199] "R" "F" "V" "L" "H" "G" "R" "V" "F" "K" "N" "L"

Input sequences directly from FASTA files

We can also read in FASTA files instead of connecting to databases. Note, that the location of the file in the read.fasta() command is right here on the server. This is the likely the most common method you will use.

lepraefa <- read.fasta("/opt/BIOL5436/Q9CD83.fa") #reads in the file
## Warning in readLines(file): incomplete final line found on '/opt/BIOL5436/
## Q9CD83.fa'
lepraefaseq <- getSequence(lepraefa) #extract our amino acid sequence from the data structure
ulceransfa <- read.fasta("/opt/BIOL5436/A0PQ23.fa")
## Warning in readLines(file): incomplete final line found on '/opt/BIOL5436/
## A0PQ23.fa'
ulceransfaseq <- getSequence(ulceransfa)

Creating a dotplot from primary sequence data

Often one thing you may want to do with a pairwise alignment is to make a dotplot. A dotplot is a graphical representation of a one-to-one comparison between all residues in your amino acid/DNA sequence. This is a lot like the alignment matrix we made before, except without numbers. This can be easily produced using the Biostrings package. Make the the plot and take a look in the Plots window on the lower right. This should help give us an intuitive feel for how similar our sequences may be or any local alignments that may be significant.

dotPlot(lepraefaseq[[1]], ulceransfaseq[[1]])

Simple nucleotide alignments

We’ve gotten some sequence data into R and we’ve done something with it. Let’s move forward and try and make an alignment. Our first example will look to make some DNA sequence alignments using the pairwiseAlignment() command from the Biostrings package. First, we’ll need to read in our files. Then, establish a scoring matrix. Finally, we’ll make our alignment and display it.

salseq <- getSequence(read.fasta("/opt/BIOL5436/soxS_Sal.fa")) #layered commands to input sequences faster
## Warning in readLines(file): incomplete final line found on '/opt/BIOL5436/
## soxS_Sal.fa'
colseq <- getSequence(read.fasta("/opt/BIOL5436/soxS_Ecoli.fa"))
## Warning in readLines(file): incomplete final line found on '/opt/BIOL5436/
## soxS_Ecoli.fa'
scormat <- nucleotideSubstitutionMatrix(match = 2, mismatch = -1, baseOnly = TRUE) #create our scoring matrix for use in alignment
scormat
##    A  C  G  T
## A  2 -1 -1 -1
## C -1  2 -1 -1
## G -1 -1  2 -1
## T -1 -1 -1  2

Now, let’s proceed to alignment after we’ve gotten our sequences input and scoring matrix defined. Here, we will do a global alignment using our scoring scheme with a affine gap penalty scheme. Note, first we have to convert our sequence array (all characters separate) into a string.

salseqstring <- c2s(salseq[[1]]) #converts a character array into a string
colseqstring <- c2s(colseq[[1]])

# alignment algorithm only takes string character input, use toupper() to convert all strings to capital letters to match our scoring matrix
globalSoxS <- pairwiseAlignment(toupper(salseqstring), toupper(colseqstring), substitutionMatrix=scormat, gapOpening=-2, gapExtension=-8, scoreOnly=FALSE)

# write the pairwise alignment to a file so you can visualize it
writePairwiseAlignments(globalSoxS, file="soxSAlignment.txt")

Protein sequence alignments using scoring matrices

There are a number of scoring matrices in the Biostrings package. Let’s see what is there.

data(package="Biostrings")

We’ll go ahead with using the PAM30 matrix. If using this matrix are we looking for more similar or more distant alignments?

Let’s proceed with performing a global alignment using our protein sequences that we uploaded from files. Note, we could easily make this a local alignment by changing the type of alignment to local. Here, we’ll perform a global alignment using the Needleman-Wunsch algorithm with the PAM30 matrix. We’ll write our output as a file to visualize it.

ulceransstring <- toupper( c2s(ulceransfaseq[[1]]) ) #convert our sequence array to a string and make it uppercase
lepraestring <- toupper( c2s(lepraefaseq[[1]]) )

mycoGlobalAln <- pairwiseAlignment(lepraestring, ulceransstring, substitutionMatrix="PAM30", gapOpening = -2, gapExtension = -8, scoreOnly = FALSE, type="global") # perform the alignment

writePairwiseAlignments(mycoGlobalAln, file="mycoProtAln.txt")