Overview

In todays workshop we will be looking to:

We will also see how to:

Multiple Sequence Alignment

Multiple sequence alignment (MSA) is tool used in a variety of downstream bioinformatics applications. We have already seen MSA used in aligning RNA molecules to constrain RNA folding and we will see MSA be used again in the coming weeks.

MSA is a heuristic approach to compare sequences. The dynamic programming algorithm that works well for pairwise comparisons between sequences fails for a variety of theoretical and practical reasons when applied to more than a few sequences. As a result, most MSA algorithms rely on a variety approaches to assemble pairwise alignments into larger MSAs.

There are a wide variety of MSA algorithms and many more programs available to view MSA output. Each software is well-suited for a variety of tasks. Today, we will focus on some widely-used web-based applications to perform MSA and view the output in meaningful ways.

MSA algorithms

There are dozens of different programs available to perform MSA. Baxevanis et al. provide a nice overview of many popular programs used by a wide variety of researchers. Today, we will focus on using a few popular programs - CLUSTAL Omega, MUSCLE, and COBALT. CLUSTAL and MUSCLE are accessible via EMBL-EBI and have easy to use interfaces. COBALT is available through NCBI and uses the BLAST alighment system as its backbone.

Let’s get started by navigating to: https://www.ebi.ac.uk/Tools/msa/

CLUSTAL Omega

One of the most commonly used programs is CLUSTAL Omega. This program is the current iteration of CLUSTAL, a program continuously under development for over 30 years. CLUSTAL Omega is a tree-based progressive aligner that uses iterative tree building and hidden Markov models to refine alignments.

Let’s get a feel for how this aligner works and how accurate it is.

Enter the following sequences into the CLUSTAL Omega interface and execute using default settings except that the output will be MSF. Note that there are a variety of settings we can change regarding the degree of iteration that CLUSTAL will use.

>O24646
MQEQATSSLAASSLPSSSERSSSSAPHLEIKEGIESDEEIRRVPEFGGEAVGKETSGRES
GSATGQERTQATVGESQRKRGRTPAEKENKRLKRLLRNRVSAQQARERKKAYLSELENRV
KDLENKNSELEERLSTLQNENQMLRHILKNTTGNKRGGGGGSNADASL
>Q9SM50
MQEQATSSIAASSLPSSSERSSSSALHHELKEGMESDDEIRRVPEMGGEATGTTSASGRD
GVSAAGQAQPSAGTQRKRGRSPADKENKRLKRLLRNRVSAQQARERKKAYLIDLEARVKE
LETKNAELEERLSTLQNENQMLRHILKNTTAGAQEGRK
>Q0JQA4
MAAQEQEQEKQQVKTSTTSSLPSSSERSSSSAPNNLKEGGGVESDEEIRRVPEMGGGGGS
ASSGAGADERQGKEDGKQQGGGGGGAAAAGGGQEQAPPARKRGRSAGDKEQNRLKRLLRN
RVSAQQARERKKAYMTELEAKAKDLELRNAELEQRVSTLQNENNTLRQILKNTTAHAGKR
GGGGGGKGGDGGGGGKKHHFTKS
>Q69XK6
MTIKRKDDGQVVKQSVKAVGGGLLERVDSDDEEIVGRVPEFGLALPGTSTSGRGSVRVAG
DAAATAAGTSSSSPAAQAGVAGSSSSGRRRGRSPADKEHRRLKRLLRNRVSAQQARERKK
AYMSELEARVKDLERSNSELEERLSTLQNENQMLRQVLKNTTANRRGPDSSAGGDS
>Q6ZHT8
MQEQATSSRPSSSERSSSSGGHHMEIKEGKEAPLRSLLLPFLDFHFTVPLSGMESDEEIG
RVPELGLEPGGASTSGRAAGGGGGGAERAQSSTAQASARRRGRSPADKEHKRLKRLLRNR
VSAQQARERKKAYLNDLEVKVKDLEKKNSELEERFSTLQNENQMLRQILKNTTVSRRGPG
STASGEGQ

Let’s take a look at the output. There are several tabs at the top of the alignment window, click Results Summary. There you will find downloadable files for all information related to your alignment, given your input parameters.

Now, let’s view our alignment. Alignment viewing is a critical part of interpreting an alignment.

MSA viewers

Here are a few popular viewers.

MSA Viewer: https://www.ncbi.nlm.nih.gov/projects/msaviewer/ Mview: https://www.ebi.ac.uk/Tools/msa/mview/ Jalview: https://www.jalview.org/

If using the EBI interface, we can pipe our alignment data straight into Mview in the web browser. If we wanted to use other stand-alone software, we would download our alignment file and open per the software’s instructions.

Go ahead and click Results Viewer. Scroll down and click View in Mview. Your submission link will now be entered into the Mview input window. Click Submit and we will walk through some of the output.

Determining the quality of an alignment program

We may need to learn how good a MSA program is at dealing with our particular scenario. This is a great place to employ the BAliBASE (http://www.lbgi.fr/balibase/). BAliBASE has a number of benchmark scenarios to test your chosen MSA programs. BAliBASE contains different sequence data types in FASTA format along with corresponding validated alignments to compare your MSA program output against.

Let’s test our output - which is actually a BAliBASE test dataset! A number of tests exist for MSA comparison. VerAlign (https://www.ibi.vu.nl/programs/veralignwww/) is a fairly conventional program to make such comparisons. Using our MSF alignment output above, and the MSF output below, let’s compare and see how CLUSTAL did with our reference dataset.

PileUp



   MSF:  227  Type: P    Check:  4174   .. 

 Name: O24646 oo  Len:  227  Check:  5110  Weight:  10.0
 Name: Q9SM50 oo  Len:  227  Check:  9784  Weight:  10.0
 Name: Q0JQA4 oo  Len:  227  Check:  3471  Weight:  10.0
 Name: Q69XK6 oo  Len:  227  Check:  9317  Weight:  10.0
 Name: Q6ZHT8 oo  Len:  227  Check:  6492  Weight:  10.0

//



O24646      .......MQE QATSSLAASS LPSSSERSSS SAP.HLEIKE G......... 
Q9SM50      .......MQE QATSSIAASS LPSSSERSSS SAL.HHELKE G......... 
Q0JQA4      MAAQEQEQEK QQVKTSTTSS LPSSSERSSS SAPNNLKEGG G......... 
Q69XK6      .......... ..MTIKRKDD GQVVKQSVKA VGGGLLERVD S......... 
Q6ZHT8      .......... ..MQEQATSS RPSSSERSSS SGGHHMEIKE GKEAPLRSLL 


O24646      .......... ....IESDEE IRRVPEFGGE AVG....... .KETSGRESG 
Q9SM50      .......... ....MESDDE IRRVPEMGGE ATG....... TTSASGRDGV 
Q0JQA4      .......... ....VESDEE IRRVPEMGGG GGSAS.SGAG ADERQGKEDG 
Q69XK6      .......... .....DDEEI VGRVPEFGLA LPGTSTSGRG SVRVAGDAAA 
Q6ZHT8      LPFLDFHFTV PLSGMESDEE IGRVPELGLE PGG....... ASTSGRAAGG 


O24646      SATGQERTQA TVGE...... .SQRKRGRTP AEKENKRLKR LLRNRVSAQQ 
Q9SM50      SAAGQ..AQP SAG....... .TQRKRGRSP ADKENKRLKR LLRNRVSAQQ 
Q0JQA4      KQQGGGGGGA AAAGGGQEQA PPARKRGRSA GDKEQNRLKR LLRNRVSAQQ 
Q69XK6      TAAGTSSSSP AAQAGVAGSS SSGRRRGRSP ADKEHRRLKR LLRNRVSAQQ 
Q6ZHT8      GGGGAERAQS STAQ...... ASARRRGRSP ADKEHKRLKR LLRNRVSAQQ 


O24646      ARERKKAYLS ELENRVKDLE NKNSELEERL STLQNENQML RHILKNTTGN 
Q9SM50      ARERKKAYLI DLEARVKELE TKNAELEERL STLQNENQML RHILKNTTAG 
Q0JQA4      ARERKKAYMT ELEAKAKDLE LRNAELEQRV STLQNENNTL RQILKNTTAH 
Q69XK6      ARERKKAYMS ELEARVKDLE RSNSELEERL STLQNENQML RQVLKNTTAN 
Q6ZHT8      ARERKKAYLN DLEVKVKDLE KKNSELEERF STLQNENQML RQILKNTTVS 


O24646      K..RGGGGGS NADASL.... .......
Q9SM50      A..QEGRK.. .......... .......
Q0JQA4      AGKRGGGGGG KGGDGGGGGK KHHFTKS
Q69XK6      RRGPDSSAGG DS........ .......
Q6ZHT8      RRGPGSTASG EGQ....... .......

You will see two numbers that are important. SP and CS. SP is a metric that accounts of variable positioning of similarly aligned regions between the test and reference, while CS is a column score for each alignment. SP is a softer metric, while CS is a more stringent metric. Both scale from 0 to 1 (or can be described as percent) where 0 is complete incongruence and 1 is a perfect match. Here, we see our alignment is about 65% successful based on column scoring…


Question 1: In CLUSTAL Omega, increase the number of combine iterations and max HMM interations to 3. Compare this to the BAliBASE standard above. What are the SP and CS scores? Is the alignment improved compared to the default settings?

Answer here.


MUSCLE

MUSCLE is another program available through EBI and is widely used in protein sequence alignment. There are some structural differences in how MUSCLE performs MSA compared to CLUSTAL. These differences are described in the textbook, if you are interested.

Let’s input the same dataset as we did for CLUSTAL Omega. You will note we cannot have MSF as an output. In the event you want to use VerAlign, choose FASTA.

Using default settings, let’s see what we get for output.

COBALT

The last aligner we will explore is COBALT. While less popular compared to the global aligners CLUSTAL and MUSCLE, COBALT is a good option as well. COBALT uses the local alignment method behind BLAST to perform alignment. In fact, it is the MSA output provided when you use BLAST to search the NCBI database!

Let’s take a look: https://www.ncbi.nlm.nih.gov/tools/cobalt/re_cobalt.cgi

Paste in the dataset above, as before. Note the advanced parameters available for COBALT. Go ahead and click Align.

Examining the output, we can see there is a graphical depiction of our alignment and several options to allow for realignment, etc. These features make COBALT a powerful web environment for sequence analysis.

Take a second to examine the output compared to the reference, above. Note that COBALT was able to capture some of the early alignment missed by CLUSTAL, much like MUSCLE. However, it does fail with some of the internal gaps in these sequences. This underlies a key point… no MSA aligner is perfect. You often need to make choices on the aligner used based on the circumstances you are working with. That said, most aligners should be giving you relatively similar output if they are well-established platforms.

Real life example

MSA is a powerful tool for identifying highly conserved regions in DNA, RNA, and AA sequences.So far, we’ve been covering this in a fairly abstract way. The power of MSA is really best observed through example. That said, let’s align something interesting and see what we can learn.

One place MSA has been life-changing in the last few years is the development of vaccines targeting the spike protein of SARS-CoV-2. You may have found yourself asking the following question: “How did we know what part of the spike protein to target?” Great question!

The answer to this question is an excellent use of MSA and underlying knowledge of protein structural biology. It turns out, researchers have been looking for vaccine epitopes to coronaviruses for some time. In 2017, Jesper Pallesen and coworkers1 developed a successful antigenic epitope of the MERS spike protein. We’re lucky they were successful then because epitope used in all vaccines used in the USA and Europe use the principles from Pallesen’s work to great effect against the SARS-CoV-2 spike protein!

Let’s take a look at how we can think about starting an inquiry like Pallesen and coworkers. We first need a series of coronavirus spike protein sequences to compare. Specifically, we need sequences from Betacoronaviruses (i.e. the family causing respiratory disorders in various mammalian hosts). You can get those here. DOWNLOAD

Enter these sequences into COBALT and align.


Question 2: If you had to choose a region to design an epitope for a vaccine, which would you choose? Why? (You may need to look up what an epitope is.)

Answer here.


Using the first sequence as a guide (it is SARS-CoV-2 spike), navigate to the 986/987 residue pair. These two residues, in the vaccine S-2P epitope, have been mutated to prolines. These mutations lock the spike protein into its active conformation found in the virion surface. This simple change was directed by MSA, structural comparison, and hypothesis testing in lab experiments. Without our ability to compare… we’d be far behind in our development of vaccines.

MSA in R

library(msa)
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Warning: package 'S4Vectors' was built under R version 4.1.1
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit

We have seen the use of the seqinr subroutines for reading in FASTA data (namely read.fasta()). Here, we can use a function in the Biostrings package to load in FASTA sequences. This package will also read in other formats, too. Like Module 5, we are creating a StringSet variable (box17). This is the standard input for most Bioconductor sequence analysis packages.

box17 <- readAAStringSet("https://drive.google.com/uc?export=download&id=1hMqmCS9ae2-Rvk2oOyiOMUZ_MWKCOeh6")

The msa package contains three subroutines for performing CLUSTALW (progressive only), CLUSTAL Omega (with iteration), and MUSCLE alignments. These will meet most needs and allow sending output data to a variety of analysis packages - some we will see next week!

# using the old CLUSTALW2 algorithm
ClustalW = msaClustalW(box17)
## use default substitution matrix
# using the iterative Omega algorithm
ClustalO = msaClustalOmega(box17,
                           maxiters = 5 # we can change the number of iterations manually, default is 1-2.
                           )
## using Gonnet
# using MUSCLE
Muscle = msaMuscle(box17)

The msa package also has some alignment analysis tools. One useful one is the msaConsensusSequence() subroutine. This will allow you to extract highly conserved motifs (i.e. 100% conservation).

msaConsensusSequence(ClustalW)
## [1] "------------------------------------------------------------???????----?????????-?????---------------------??????????????????????????????????????????????????????????????????????????????????????????????????-----------------------???????????????????--??-------------------------------------------------------------------???????????????????????????????????????????????????------------------------------------------------------------------------------------------??????????--------??????????????????????????????????-???KR?RR??RNR?SAQ??R?RKK?Y???LE?RV??L???N?EL??????L???N??L?Q?L?????????????????????-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------"
# let's use a loop and create some output we can compare

alignments <- c(ClustalW, ClustalO, Muscle)

alignmentConsensus = c()

for (i in 1:length(alignments)) {
  
  alignmentConsensus[i] = msaConsensusSequence(alignments[[i]])
  
}

print(alignmentConsensus)
## [1] "------------------------------------------------------------???????----?????????-?????---------------------??????????????????????????????????????????????????????????????????????????????????????????????????-----------------------???????????????????--??-------------------------------------------------------------------???????????????????????????????????????????????????------------------------------------------------------------------------------------------??????????--------??????????????????????????????????-???KR?RR??RNR?SAQ??R?RKK?Y???LE?RV??L???N?EL??????L???N??L?Q?L?????????????????????-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------"                                                                                                                                                                                                                                         
## [2] "---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------???????????-?--------------------------------------------------------------????????????--??-----------???????---?????-------??????????????----------------------?????????????????-------------------------------------------------------------?-????????????????????------------------------------------------------???------???--------??????-----??-?--???---------???????--??----------????????--????????????????-----------------????????-----??????????????????????P?????????KR?RR??RNR?SAQ??R?RKK?Y???LE?RV??L???N?EL??????L???N??L?Q?L????????????-----?--??????-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------"
## [3] "-------------------------------------------?????????????????????????---------------------??????????????????---------------------------------??????????????????????????P------------???------------------------??????????-------????????-?????---------------------??????????-------------------?????????????????????????????????????????????????---------------------------------------??????????????????????????????????-------????-----------------------????????????????-???????P?DK??????-KR?RR??RNR?SAQ??R?RKK?Y???LE?RV??L???N?EL?????-?L???N??L?Q?L????????-----------??????????--??????---?-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------"

Question 3: Comparing these consensus alignments, do they seem to capture similar motifs? If so, provide the regione from one of the consensus alignments, below.

Answer here.


The end!


  1. https://www.pnas.org/content/114/35/E7348↩︎