Overview
In todays workshop we will be looking to:
- Use some MSA programs to analyze sequence similarity
- Determine the accuracy of certain alignment programs using BAliBASE data
We will also see how to:
- Use R to perform MSA on a dataset
Multiple Sequence Alignment
Multiple sequence alignment (MSA) is tool used in a variety of downstream bioinformatics applications. We have already seen MSA used in aligning RNA molecules to constrain RNA folding and we will see MSA be used again in the coming weeks.
MSA is a heuristic approach to compare sequences. The dynamic programming algorithm that works well for pairwise comparisons between sequences fails for a variety of theoretical and practical reasons when applied to more than a few sequences. As a result, most MSA algorithms rely on a variety approaches to assemble pairwise alignments into larger MSAs.
There are a wide variety of MSA algorithms and many more programs available to view MSA output. Each software is well-suited for a variety of tasks. Today, we will focus on some widely-used web-based applications to perform MSA and view the output in meaningful ways.
MSA algorithms
There are dozens of different programs available to perform MSA. Baxevanis et al. provide a nice overview of many popular programs used by a wide variety of researchers. Today, we will focus on using a few popular programs - CLUSTAL Omega, MUSCLE, and COBALT. CLUSTAL and MUSCLE are accessible via EMBL-EBI and have easy to use interfaces. COBALT is available through NCBI and uses the BLAST alighment system as its backbone.
Let’s get started by navigating to: https://www.ebi.ac.uk/Tools/msa/
CLUSTAL Omega
One of the most commonly used programs is CLUSTAL Omega. This program is the current iteration of CLUSTAL, a program continuously under development for over 30 years. CLUSTAL Omega is a tree-based progressive aligner that uses iterative tree building and hidden Markov models to refine alignments.
Let’s get a feel for how this aligner works and how accurate it is.
Enter the following sequences into the CLUSTAL Omega interface and execute using default settings except that the output will be MSF. Note that there are a variety of settings we can change regarding the degree of iteration that CLUSTAL will use.
>O24646
MQEQATSSLAASSLPSSSERSSSSAPHLEIKEGIESDEEIRRVPEFGGEAVGKETSGRES
GSATGQERTQATVGESQRKRGRTPAEKENKRLKRLLRNRVSAQQARERKKAYLSELENRV
KDLENKNSELEERLSTLQNENQMLRHILKNTTGNKRGGGGGSNADASL
>Q9SM50
MQEQATSSIAASSLPSSSERSSSSALHHELKEGMESDDEIRRVPEMGGEATGTTSASGRD
GVSAAGQAQPSAGTQRKRGRSPADKENKRLKRLLRNRVSAQQARERKKAYLIDLEARVKE
LETKNAELEERLSTLQNENQMLRHILKNTTAGAQEGRK
>Q0JQA4
MAAQEQEQEKQQVKTSTTSSLPSSSERSSSSAPNNLKEGGGVESDEEIRRVPEMGGGGGS
ASSGAGADERQGKEDGKQQGGGGGGAAAAGGGQEQAPPARKRGRSAGDKEQNRLKRLLRN
RVSAQQARERKKAYMTELEAKAKDLELRNAELEQRVSTLQNENNTLRQILKNTTAHAGKR
GGGGGGKGGDGGGGGKKHHFTKS
>Q69XK6
MTIKRKDDGQVVKQSVKAVGGGLLERVDSDDEEIVGRVPEFGLALPGTSTSGRGSVRVAG
DAAATAAGTSSSSPAAQAGVAGSSSSGRRRGRSPADKEHRRLKRLLRNRVSAQQARERKK
AYMSELEARVKDLERSNSELEERLSTLQNENQMLRQVLKNTTANRRGPDSSAGGDS
>Q6ZHT8
MQEQATSSRPSSSERSSSSGGHHMEIKEGKEAPLRSLLLPFLDFHFTVPLSGMESDEEIG
RVPELGLEPGGASTSGRAAGGGGGGAERAQSSTAQASARRRGRSPADKEHKRLKRLLRNR
VSAQQARERKKAYLNDLEVKVKDLEKKNSELEERFSTLQNENQMLRQILKNTTVSRRGPG
STASGEGQ
Let’s take a look at the output. There are several tabs at the top of the alignment window, click Results Summary. There you will find downloadable files for all information related to your alignment, given your input parameters.
Now, let’s view our alignment. Alignment viewing is a critical part of interpreting an alignment.
MSA viewers
Here are a few popular viewers.
MSA Viewer: https://www.ncbi.nlm.nih.gov/projects/msaviewer/ Mview: https://www.ebi.ac.uk/Tools/msa/mview/ Jalview: https://www.jalview.org/
If using the EBI interface, we can pipe our alignment data straight into Mview in the web browser. If we wanted to use other stand-alone software, we would download our alignment file and open per the software’s instructions.
Go ahead and click Results Viewer. Scroll down and click View in Mview. Your submission link will now be entered into the Mview input window. Click Submit and we will walk through some of the output.
Determining the quality of an alignment program
We may need to learn how good a MSA program is at dealing with our particular scenario. This is a great place to employ the BAliBASE (http://www.lbgi.fr/balibase/). BAliBASE has a number of benchmark scenarios to test your chosen MSA programs. BAliBASE contains different sequence data types in FASTA format along with corresponding validated alignments to compare your MSA program output against.
Let’s test our output - which is actually a BAliBASE test dataset! A number of tests exist for MSA comparison. VerAlign (https://www.ibi.vu.nl/programs/veralignwww/) is a fairly conventional program to make such comparisons. Using our MSF alignment output above, and the MSF output below, let’s compare and see how CLUSTAL did with our reference dataset.
PileUp
MSF: 227 Type: P Check: 4174 ..
Name: O24646 oo Len: 227 Check: 5110 Weight: 10.0
Name: Q9SM50 oo Len: 227 Check: 9784 Weight: 10.0
Name: Q0JQA4 oo Len: 227 Check: 3471 Weight: 10.0
Name: Q69XK6 oo Len: 227 Check: 9317 Weight: 10.0
Name: Q6ZHT8 oo Len: 227 Check: 6492 Weight: 10.0
//
O24646 .......MQE QATSSLAASS LPSSSERSSS SAP.HLEIKE G.........
Q9SM50 .......MQE QATSSIAASS LPSSSERSSS SAL.HHELKE G.........
Q0JQA4 MAAQEQEQEK QQVKTSTTSS LPSSSERSSS SAPNNLKEGG G.........
Q69XK6 .......... ..MTIKRKDD GQVVKQSVKA VGGGLLERVD S.........
Q6ZHT8 .......... ..MQEQATSS RPSSSERSSS SGGHHMEIKE GKEAPLRSLL
O24646 .......... ....IESDEE IRRVPEFGGE AVG....... .KETSGRESG
Q9SM50 .......... ....MESDDE IRRVPEMGGE ATG....... TTSASGRDGV
Q0JQA4 .......... ....VESDEE IRRVPEMGGG GGSAS.SGAG ADERQGKEDG
Q69XK6 .......... .....DDEEI VGRVPEFGLA LPGTSTSGRG SVRVAGDAAA
Q6ZHT8 LPFLDFHFTV PLSGMESDEE IGRVPELGLE PGG....... ASTSGRAAGG
O24646 SATGQERTQA TVGE...... .SQRKRGRTP AEKENKRLKR LLRNRVSAQQ
Q9SM50 SAAGQ..AQP SAG....... .TQRKRGRSP ADKENKRLKR LLRNRVSAQQ
Q0JQA4 KQQGGGGGGA AAAGGGQEQA PPARKRGRSA GDKEQNRLKR LLRNRVSAQQ
Q69XK6 TAAGTSSSSP AAQAGVAGSS SSGRRRGRSP ADKEHRRLKR LLRNRVSAQQ
Q6ZHT8 GGGGAERAQS STAQ...... ASARRRGRSP ADKEHKRLKR LLRNRVSAQQ
O24646 ARERKKAYLS ELENRVKDLE NKNSELEERL STLQNENQML RHILKNTTGN
Q9SM50 ARERKKAYLI DLEARVKELE TKNAELEERL STLQNENQML RHILKNTTAG
Q0JQA4 ARERKKAYMT ELEAKAKDLE LRNAELEQRV STLQNENNTL RQILKNTTAH
Q69XK6 ARERKKAYMS ELEARVKDLE RSNSELEERL STLQNENQML RQVLKNTTAN
Q6ZHT8 ARERKKAYLN DLEVKVKDLE KKNSELEERF STLQNENQML RQILKNTTVS
O24646 K..RGGGGGS NADASL.... .......
Q9SM50 A..QEGRK.. .......... .......
Q0JQA4 AGKRGGGGGG KGGDGGGGGK KHHFTKS
Q69XK6 RRGPDSSAGG DS........ .......
Q6ZHT8 RRGPGSTASG EGQ....... .......
You will see two numbers that are important. SP and CS. SP is a metric that accounts of variable positioning of similarly aligned regions between the test and reference, while CS is a column score for each alignment. SP is a softer metric, while CS is a more stringent metric. Both scale from 0 to 1 (or can be described as percent) where 0 is complete incongruence and 1 is a perfect match. Here, we see our alignment is about 65% successful based on column scoring…
Question
How much do you think exactness of an MSA program matters?
MUSCLE
MUSCLE is another program available through EBI and is widely used in protein sequence alignment. There are some structural differences in how MUSCLE performs MSA compared to CLUSTAL. These differences are described in the textbook, if you are interested.
Let’s input the same dataset as we did for CLUSTAL Omega. You will note we cannot have MSF as an output. In the event you want to use VerAlign, choose FASTA.
Using default settings, let’s see what we get for output.
COBALT
The last aligner we will explore is COBALT. While less popular compared to the global aligners CLUSTAL and MUSCLE, COBALT is a good option as well. COBALT uses the local alignment method behind BLAST to perform alignment. In fact, it is the MSA output provided when you use BLAST to search the NCBI database!
Let’s take a look: https://www.ncbi.nlm.nih.gov/tools/cobalt/re_cobalt.cgi
Paste in the dataset above, as before. Note the advanced parameters available for COBALT. Go ahead and click Align.
Examining the output, we can see there is a graphical depiction of our alignment and several options to allow for realignment, etc. These features make COBALT a powerful web environment for sequence analysis.
Take a second to examine the output compared to the reference, above. Note that COBALT was able to capture some of the early alignment missed by CLUSTAL, much like MUSCLE. However, it does fail with some of the internal gaps in these sequences. This underlies a key point… no MSA aligner is perfect. You often need to make choices on the aligner used based on the circumstances you are working with. That said, most aligners should be giving you relatively similar output if they are well-established platforms.
Real life example
MSA is a powerful tool for identifying highly conserved regions in DNA, RNA, and AA sequences.So far, we’ve been covering this in a fairly abstract way. The power of MSA is really best observed through example. That said, let’s align something interesting and see what we can learn.
One place MSA has been life-changing in the last few years is the development of vaccines targeting the spike protein of SARS-CoV-2. You may have found yourself asking the following question: “How did we know what part of the spike protein to target?” Great question!
The answer to this question is an excellent use of MSA and underlying knowledge of protein structural biology. It turns out, researchers have been looking for vaccine epitopes to coronaviruses for some time. In 2017, Jesper Pallesen and coworkers developed a successful antigenic epitope of the MERS spike protein. We’re lucky they were successful then because epitope used in all vaccines used in the USA and Europe use the principles from Pallesen’s work to great effect against the SARS-CoV-2 spike protein!
Let’s take a look at how we can think about starting an inquiry like Pallesen and coworkers. We first need a series of coronavirus spike protein sequences to compare. Specifically, we need sequences from Betacoronaviruses (i.e. the family causing respiratory disorders in various mammalian hosts). You can get those here. DOWNLOAD
Enter these sequences into COBALT and align.
Now, let’s examine the alignment. Question for us to ponder: If you had to choose a region to design an epitope, which would you choose? Why?
Using the first sequence as a guide (it is SARS-CoV-2 spike), navigate to the 986/987 residue pair. These two residues, in the vaccine S-2P epitope, have been mutated to prolines. These mutations lock the spike protein into its active conformation found in the virion surface. This simple change was directed by MSA, structural comparison, and hypothesis testing in lab experiments. Without our ability to compare… we’d be far behind in our development of vaccines.
MSA in R
library(msa)
We have seen the use of the seqinr subroutines for reading in FASTA data (namely read.fasta()). Here, we can use a function in the Biostrings package to load in FASTA sequences. This package will also read in other formats, too. Like Module 5, we are creating a StringSet variable (box17). This is the standard input for most Bioconductor sequence analysis packages.
box17 <- readAAStringSet("https://drive.google.com/uc?export=download&id=1hMqmCS9ae2-Rvk2oOyiOMUZ_MWKCOeh6")
trying URL 'https://drive.google.com/uc?export=download&id=1hMqmCS9ae2-Rvk2oOyiOMUZ_MWKCOeh6'
Content type 'application/octet-stream' length 13532 bytes (13 KB)
==================================================
downloaded 13 KB
The msa package contains three subroutines for performing CLUSTALW (progressive only), CLUSTAL Omega (with iteration), and MUSCLE alignments. These will meet most needs and allow sending output data to a variety of analysis packages - some we will see next week!
# using the old CLUSTALW2 algorithm
ClustalW = msaClustalW(box17)
use default substitution matrix
# using the iterative Omega algorithm
ClustalO = msaClustalOmega(box17,
maxiters = 5)
using Gonnet
# using MUSCLE
Muscle = msaMuscle(box17)
The msa package also has some alignment analysis tools. One useful one is the msaConsensusSequence() subroutine. This will allow you to extract highly conserved motifs (i.e. 100% conservation).
msaConsensusSequence(ClustalW)
[1] "------------------------------------------------------------???????----?????????-?????---------------------??????????????????????????????????????????????????????????????????????????????????????????????????-----------------------???????????????????--??-------------------------------------------------------------------???????????????????????????????????????????????????------------------------------------------------------------------------------------------??????????--------??????????????????????????????????-???KR?RR??RNR?SAQ??R?RKK?Y???LE?RV??L???N?EL??????L???N??L?Q?L?????????????????????-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------"
# let's use a loop and create some output we can compare
alignments <- c(ClustalW, ClustalO, Muscle)
alignmentConsensus = c()
for (i in 1:length(alignments)) {
alignmentConsensus[i] = msaConsensusSequence(alignments[[i]])
}
print(alignmentConsensus)
[1] "------------------------------------------------------------???????----?????????-?????---------------------??????????????????????????????????????????????????????????????????????????????????????????????????-----------------------???????????????????--??-------------------------------------------------------------------???????????????????????????????????????????????????------------------------------------------------------------------------------------------??????????--------??????????????????????????????????-???KR?RR??RNR?SAQ??R?RKK?Y???LE?RV??L???N?EL??????L???N??L?Q?L?????????????????????-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------"
[2] "---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------???????????-?--------------------------------------------------------------????????????--??-----------???????---?????-------??????????????----------------------?????????????????-------------------------------------------------------------?-????????????????????------------------------------------------------???------???--------??????-----??-?--???---------???????--??----------????????--????????????????-----------------????????-----??????????????????????P?????????KR?RR??RNR?SAQ??R?RKK?Y???LE?RV??L???N?EL??????L???N??L?Q?L????????????-----?--??????-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------"
[3] "-------------------------------------------?????????????????????????---------------------??????????????????---------------------------------??????????????????????????P------------???------------------------??????????-------????????-?????---------------------??????????-------------------?????????????????????????????????????????????????---------------------------------------??????????????????????????????????-------????-----------------------????????????????-???????P?DK??????-KR?RR??RNR?SAQ??R?RKK?Y???LE?RV??L???N?EL?????-?L???N??L?Q?L????????-----------??????????--??????---?-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------"
Question
Comparing these consensus alignments, do they seem to capture similar motifs?
We can also use msa to output more visually accessible outputs using the msaPrettyPrint subroutine. There are several more options in this subroutine than shown here, but the basic output is reasonably informative, as is.
msaPrettyPrint(ClustalO, output="pdf")
---
title: "Module 6 Workshop Exercises"
output:
  html_notebook: default
  html_document:
    df_print: paged
  word_document: default
---

# Overview

In todays workshop we will be looking to:

- Use some MSA programs to analyze sequence similarity
- Determine the accuracy of certain alignment programs using BAliBASE data

We will also see how to:

- Use R to perform MSA on a dataset


# Multiple Sequence Alignment

Multiple sequence alignment (MSA) is tool used in a variety of downstream bioinformatics applications. We have already seen MSA used in aligning RNA molecules to constrain RNA folding and we will see MSA be used again in the coming weeks.

MSA is a heuristic approach to compare sequences. The dynamic programming algorithm that works well for pairwise comparisons between sequences fails for a variety of theoretical and practical reasons when applied to more than a few sequences. As a result, most MSA algorithms rely on a variety approaches to assemble pairwise alignments into larger MSAs.

There are a wide variety of MSA algorithms and many more programs available to view MSA output. Each software is well-suited for a variety of tasks. Today, we will focus on some widely-used web-based applications to perform MSA and view the output in meaningful ways.

# MSA algorithms

There are dozens of different programs available to perform MSA. Baxevanis et al. provide a nice overview of many popular programs used by a wide variety of researchers. Today, we will focus on using a few popular programs - CLUSTAL Omega, MUSCLE, and COBALT. CLUSTAL and MUSCLE are accessible via EMBL-EBI and have easy to use interfaces. COBALT is available through NCBI and uses the BLAST alighment system as its backbone.

Let's get started by navigating to: https://www.ebi.ac.uk/Tools/msa/

## CLUSTAL Omega

One of the most commonly used programs is CLUSTAL Omega. This program is the current iteration of CLUSTAL, a program continuously under development for over 30 years. CLUSTAL Omega is a tree-based progressive aligner that uses iterative tree building and hidden Markov models to refine alignments.

Let's get a feel for how this aligner works and how accurate it is.

Enter the following sequences into the CLUSTAL Omega interface and execute using default settings except that the output will be MSF. Note that there are a variety of settings we can change regarding the degree of iteration that CLUSTAL will use.

    >O24646
    MQEQATSSLAASSLPSSSERSSSSAPHLEIKEGIESDEEIRRVPEFGGEAVGKETSGRES
    GSATGQERTQATVGESQRKRGRTPAEKENKRLKRLLRNRVSAQQARERKKAYLSELENRV
    KDLENKNSELEERLSTLQNENQMLRHILKNTTGNKRGGGGGSNADASL
    >Q9SM50
    MQEQATSSIAASSLPSSSERSSSSALHHELKEGMESDDEIRRVPEMGGEATGTTSASGRD
    GVSAAGQAQPSAGTQRKRGRSPADKENKRLKRLLRNRVSAQQARERKKAYLIDLEARVKE
    LETKNAELEERLSTLQNENQMLRHILKNTTAGAQEGRK
    >Q0JQA4
    MAAQEQEQEKQQVKTSTTSSLPSSSERSSSSAPNNLKEGGGVESDEEIRRVPEMGGGGGS
    ASSGAGADERQGKEDGKQQGGGGGGAAAAGGGQEQAPPARKRGRSAGDKEQNRLKRLLRN
    RVSAQQARERKKAYMTELEAKAKDLELRNAELEQRVSTLQNENNTLRQILKNTTAHAGKR
    GGGGGGKGGDGGGGGKKHHFTKS
    >Q69XK6
    MTIKRKDDGQVVKQSVKAVGGGLLERVDSDDEEIVGRVPEFGLALPGTSTSGRGSVRVAG
    DAAATAAGTSSSSPAAQAGVAGSSSSGRRRGRSPADKEHRRLKRLLRNRVSAQQARERKK
    AYMSELEARVKDLERSNSELEERLSTLQNENQMLRQVLKNTTANRRGPDSSAGGDS
    >Q6ZHT8
    MQEQATSSRPSSSERSSSSGGHHMEIKEGKEAPLRSLLLPFLDFHFTVPLSGMESDEEIG
    RVPELGLEPGGASTSGRAAGGGGGGAERAQSSTAQASARRRGRSPADKEHKRLKRLLRNR
    VSAQQARERKKAYLNDLEVKVKDLEKKNSELEERFSTLQNENQMLRQILKNTTVSRRGPG
    STASGEGQ

Let's take a look at the output. There are several tabs at the top of the alignment window, click `Results Summary`. There you will find downloadable files for all information related to your alignment, given your input parameters.

Now, let's view our alignment. Alignment viewing is a critical part of interpreting an alignment.

### MSA viewers

Here are a few popular viewers.

MSA Viewer: https://www.ncbi.nlm.nih.gov/projects/msaviewer/
Mview: https://www.ebi.ac.uk/Tools/msa/mview/
Jalview: https://www.jalview.org/

If using the EBI interface, we can pipe our alignment data straight into Mview in the web browser. If we wanted to use other stand-alone software, we would download our alignment file and open per the software's instructions.

Go ahead and click `Results Viewer`. Scroll down and click `View in Mview`. Your submission link will now be entered into the Mview input window. Click `Submit` and we will walk through some of the output.

### Determining the quality of an alignment program

We may need to learn how good a MSA program is at dealing with our particular scenario. This is a great place to employ the BAliBASE (http://www.lbgi.fr/balibase/). BAliBASE has a number of benchmark scenarios to test your chosen MSA programs. BAliBASE contains different sequence data types in FASTA format along with corresponding validated alignments to compare your MSA program output against.

Let's test our output - which is actually a BAliBASE test dataset! A number of tests exist for MSA comparison. VerAlign (https://www.ibi.vu.nl/programs/veralignwww/) is a fairly conventional program to make such comparisons. Using our MSF alignment output above, and the MSF output below, let's compare and see how CLUSTAL did with our reference dataset.


    PileUp
    
    
    
       MSF:  227  Type: P    Check:  4174   .. 
    
     Name: O24646 oo  Len:  227  Check:  5110  Weight:  10.0
     Name: Q9SM50 oo  Len:  227  Check:  9784  Weight:  10.0
     Name: Q0JQA4 oo  Len:  227  Check:  3471  Weight:  10.0
     Name: Q69XK6 oo  Len:  227  Check:  9317  Weight:  10.0
     Name: Q6ZHT8 oo  Len:  227  Check:  6492  Weight:  10.0
    
    //
    
    
    
    O24646      .......MQE QATSSLAASS LPSSSERSSS SAP.HLEIKE G......... 
    Q9SM50      .......MQE QATSSIAASS LPSSSERSSS SAL.HHELKE G......... 
    Q0JQA4      MAAQEQEQEK QQVKTSTTSS LPSSSERSSS SAPNNLKEGG G......... 
    Q69XK6      .......... ..MTIKRKDD GQVVKQSVKA VGGGLLERVD S......... 
    Q6ZHT8      .......... ..MQEQATSS RPSSSERSSS SGGHHMEIKE GKEAPLRSLL 
    
    
    O24646      .......... ....IESDEE IRRVPEFGGE AVG....... .KETSGRESG 
    Q9SM50      .......... ....MESDDE IRRVPEMGGE ATG....... TTSASGRDGV 
    Q0JQA4      .......... ....VESDEE IRRVPEMGGG GGSAS.SGAG ADERQGKEDG 
    Q69XK6      .......... .....DDEEI VGRVPEFGLA LPGTSTSGRG SVRVAGDAAA 
    Q6ZHT8      LPFLDFHFTV PLSGMESDEE IGRVPELGLE PGG....... ASTSGRAAGG 
    
    
    O24646      SATGQERTQA TVGE...... .SQRKRGRTP AEKENKRLKR LLRNRVSAQQ 
    Q9SM50      SAAGQ..AQP SAG....... .TQRKRGRSP ADKENKRLKR LLRNRVSAQQ 
    Q0JQA4      KQQGGGGGGA AAAGGGQEQA PPARKRGRSA GDKEQNRLKR LLRNRVSAQQ 
    Q69XK6      TAAGTSSSSP AAQAGVAGSS SSGRRRGRSP ADKEHRRLKR LLRNRVSAQQ 
    Q6ZHT8      GGGGAERAQS STAQ...... ASARRRGRSP ADKEHKRLKR LLRNRVSAQQ 
    
    
    O24646      ARERKKAYLS ELENRVKDLE NKNSELEERL STLQNENQML RHILKNTTGN 
    Q9SM50      ARERKKAYLI DLEARVKELE TKNAELEERL STLQNENQML RHILKNTTAG 
    Q0JQA4      ARERKKAYMT ELEAKAKDLE LRNAELEQRV STLQNENNTL RQILKNTTAH 
    Q69XK6      ARERKKAYMS ELEARVKDLE RSNSELEERL STLQNENQML RQVLKNTTAN 
    Q6ZHT8      ARERKKAYLN DLEVKVKDLE KKNSELEERF STLQNENQML RQILKNTTVS 
    
    
    O24646      K..RGGGGGS NADASL.... .......
    Q9SM50      A..QEGRK.. .......... .......
    Q0JQA4      AGKRGGGGGG KGGDGGGGGK KHHFTKS
    Q69XK6      RRGPDSSAGG DS........ .......
    Q6ZHT8      RRGPGSTASG EGQ....... .......

You will see two numbers that are important. SP and CS. SP is a metric that accounts of variable positioning of similarly aligned regions between the test and reference, while CS is a column score for each alignment. SP is a softer metric, while CS is a more stringent metric. Both scale from 0 to 1 (or can be described as percent) where 0 is complete incongruence and 1 is a perfect match. Here, we see our alignment is about 65% successful based on column scoring...

---

**Question**

How much do you think exactness of an MSA program matters?

---

## MUSCLE

MUSCLE is another program available through EBI and is widely used in protein sequence alignment. There are some structural differences in how MUSCLE performs MSA compared to CLUSTAL. These differences are described in the textbook, if you are interested.

Let's input the same dataset as we did for CLUSTAL Omega. You will note we cannot have MSF as an output. In the event you want to use VerAlign, choose FASTA.

Using default settings, let's see what we get for output. 

## COBALT

The last aligner we will explore is COBALT. While less popular compared to the global aligners CLUSTAL and MUSCLE, COBALT is a good option as well. COBALT uses the local alignment method behind BLAST to perform alignment. In fact, it is the MSA output provided when you use BLAST to search the NCBI database!

Let's take a look: https://www.ncbi.nlm.nih.gov/tools/cobalt/re_cobalt.cgi

Paste in the dataset above, as before. Note the advanced parameters available for COBALT. Go ahead and click `Align`.

Examining the output, we can see there is a graphical depiction of our alignment and several options to allow for realignment, etc. These features make COBALT a powerful web environment for sequence analysis.

Take a second to examine the output compared to the reference, above. Note that COBALT was able to capture some of the early alignment missed by CLUSTAL, much like MUSCLE. However, it does fail with some of the internal gaps in these sequences. This underlies a key point... no MSA aligner is perfect. You often need to make choices on the aligner used based on the circumstances you are working with. That said, most aligners should be giving you relatively similar output if they are well-established platforms.

# Real life example

MSA is a powerful tool for identifying highly conserved regions in DNA, RNA, and AA sequences.So far, we've been covering this in a fairly abstract way. The power of MSA is really best observed through example. That said, let's align something interesting and see what we can learn.

One place MSA has been life-changing in the last few years is the development of vaccines targeting the spike protein of SARS-CoV-2. You may have found yourself asking the following question: "How did we know what part of the spike protein to target?" Great question!

The answer to this question is an excellent use of MSA and underlying knowledge of protein structural biology. It turns out, researchers have been looking for vaccine epitopes to coronaviruses for some time. In 2017, Jesper Pallesen and coworkers^[https://www.pnas.org/content/114/35/E7348] developed a successful antigenic epitope of the MERS spike protein. We're lucky they were successful then because epitope used in all vaccines used in the USA and Europe use the principles from Pallesen's work to great effect against the SARS-CoV-2 spike protein!

Let's take a look at how we can think about starting an inquiry like Pallesen and coworkers. We first need a series of coronavirus spike protein sequences to compare. Specifically, we need sequences from Betacoronaviruses (i.e. the family causing respiratory disorders in various mammalian hosts). You can get those here. [DOWNLOAD](https://drive.google.com/uc?export=download&id=10GTgF2j1eydjf9E9Ipx732z00WBBHRiC)

Enter these sequences into COBALT and align.

Now, let's examine the alignment. Question for us to ponder: **If you had to choose a region to design an epitope, which would you choose? Why?**

Using the first sequence as a guide (it is SARS-CoV-2 spike), navigate to the 986/987 residue pair. These two residues, in the vaccine S-2P epitope, have been mutated to prolines. These mutations lock the spike protein into its active conformation found in the virion surface. This simple change was directed by MSA, structural comparison, and hypothesis testing in lab experiments. Without our ability to compare... we'd be far behind in our development of vaccines.

# MSA in R

```{r}
library(msa)
```

We have seen the use of the `seqinr` subroutines for reading in FASTA data (namely `read.fasta()`). Here, we can use a function in the `Biostrings` package to load in FASTA sequences. This package will also read in other formats, too. Like Module 5, we are creating a StringSet variable (box17). This is the standard input for most Bioconductor sequence analysis packages.

```{r}

box17 <- readAAStringSet("https://drive.google.com/uc?export=download&id=1hMqmCS9ae2-Rvk2oOyiOMUZ_MWKCOeh6")

```

The `msa` package contains three subroutines for performing CLUSTALW (progressive only), CLUSTAL Omega (with iteration), and MUSCLE alignments. These will meet most needs and allow sending output data to a variety of analysis packages - some we will see next week!

```{r}

# using the old CLUSTALW2 algorithm
ClustalW = msaClustalW(box17)

# using the iterative Omega algorithm
ClustalO = msaClustalOmega(box17,
                           maxiters = 5 # we can change the number of iterations manually, default is 1-2.
                           )

# using MUSCLE
Muscle = msaMuscle(box17)
```

The `msa` package also has some alignment analysis tools. One useful one is the `msaConsensusSequence()` subroutine. This will allow you to extract highly conserved motifs (i.e. 100% conservation).

```{r}

msaConsensusSequence(ClustalW)

# let's use a loop and create some output we can compare

alignments <- c(ClustalW, ClustalO, Muscle)

alignmentConsensus = c()

for (i in 1:length(alignments)) {
  
  alignmentConsensus[i] = msaConsensusSequence(alignments[[i]])
  
}

print(alignmentConsensus)

```

---

**Question**

Comparing these consensus alignments, do they seem to capture similar motifs?

---


We can also use `msa` to output more visually accessible outputs using the `msaPrettyPrint` subroutine. There are several more options in this subroutine than shown here, but the basic output is reasonably informative, as is.

```{r}

msaPrettyPrint(ClustalO, output="pdf")

```

