Introduction. The biological problem

  • In this class we will shift our attention to problems of variability in a more direct biological context. We will be discussing how sequence variability may affect gene regulation through the study of regulatory motifs, before we go on to see how genomic variability in general shapes phenotypic changes in individuals.

We will focus on two particular problems:

  • A. The study, analysis and interpretation of biological sequence motifs.
  • B. The concept of genomic variability and how it is directly related to phenotypical variation in a population.

Part A. Sequence Motifs

Motifs in general

What is a motif in general? How do we understand a motif?

Motifs in Biology

Motifs. A definition

We consider a motif a “repetitive substructure of a bigger object that is meant to convey a certain message”

  • Most (in not all) forms of messages employ “motifs” for:
    • repat/emphasis : e.g. “I have a dream”, MLK, “March on Jobs and Freedom Speech” “Crawth the raven, Nevermore”, Edgar Allan Poe
    • coherence: e.g. “Who controls the past, controls the future”, George Orwell “1984”
    • subtextualization: e.g. “Fair is foul and foul is fair”, William Shakespeare, “Macbeth” (and almost all of “Pulp Fiction”)
    • internal reference: e.g. all “leitmotivs” in Operas

Motifs in Biology

In biology we find a number of meaningful motifs that serve one (or more) of the above functions.

  • Genome: Codons, Transcription factor binding sites, CpG islands,
  • All areas of the genome that interact with proteins in sequence-dependent manner
  • Protein: Patterns of aminoacids that are related to particular function, modules, domains etc

Motifs in Biology

Probabilistic definition of a motif

We can think of such a probabilistic approach that takes into account the variability in each position. Starting with a number of sequence instances of equal size we may calculate the probabilities of occurrence of each nucleotide for each position. The resulting table of the probabilities is a called a Position Weigth Matrix, PWM.

This approach allows us to evaluate the variability of some positions against others and in addition to see which residues tend to be more important in which position.

Accounting for background nucleotide composition

Sequence variation does not only occur within the motif’s instances. Fluctuations in the compositions of DNA sequences occur throughout the genomes of all eukaryotes and specific regions may be enriched in some residues at the expence of others.

This is particularly evident in the fluctuation of GC% along eukaryotic chromosomes, which are divided in regions of high/low GC% that are called “isochores”.

The variability in the genomic sequences will undoubtedly interfere with the motif instances. Binding sites embedded in high GC% regions of the genome will be more GC-rich themselves. In order to account for this variability we need to normalize against what we call the background nucleotide composition.

We achieve this with a different kind of table called Position Specific Scoring Matrices, PSSM.

These are created by dividing the observed probabilities per position in the original PWM (P) with the expected values that are obtained for a given sequence (Q).

A log2-transformation step results in a table where high positive values are corresponding to preferred combinations of residues-positions. A PSSM table is always specific for a given sequence in which one can search for a motif, which brings us to the next problem…

Finding a motif in a sequence with a PWM/PSSM

PWM and PSSM tables are important not only for representations but also for the search and identification of motif instances in an unknown sequence.

Given a table (be it PWM or PSSM) and a longer sequence in which we are searching for the motif we can think of a strategy with which:

  1. We start at position 1 of the sequence
  2. Extract a subsequence \(S_1\) equal in length the the length of the PWM/PSSM
  3. Compare \(S_1\) with the PWM/PSSM by identifying the score of each nucleotide in each position
  4. Calculate a mean score of \(S_1\)
  5. Go to the next position (=2) of the sequence
  6. Repeat steps 1-5.

Strong vs Weak. Evaluating motifs.

Up to now we have seen how every motif can be described as a PWM/PSSM. A different question is related to how different PWM may be describing different patterns. The PWM for NF-kB may contain some very clear positional tendencies in the first and last positions, but this is not the case for the mid-positions. A motif for a different transcriptional regulator may have completely different preferences. We are interested in understanding how the underlying variability in each position and in the motif as a whole may be assessed and what it may tell us about the motif in general.

The question may be rephrased using the concept of “motif strength”. A “strong motif” is one that allows little variability. It means that it is very often found occurring in the same way. On the other hand a “weak” motif has a highly variable structure and thus may be found in a genome in many different variations.

Mathematics Interlude: Variation, Information and Entropy

  • In 1948 Claude Shannon’s pioneering work on message transmission introduce a fundamental concept and gave rise to a whole field of Science called “Information Theory”.

Claude Shannon

  • The basis of information theory is the concept of Entropy which is defined in the following:

    • Given the set \(S\) of \(n\) probable outcomes of a “source”, each of which has probability \(P[i]\)
    • The “Shannon” Entropy of this source is equal to the negative sum of the products of those probabilities and their logarithms, such as: \[H(S)=-\sum_{i=1}^{n} P[i]log(P[i])\]

Mathematics Interlude: Information as Entropy

  • It derives from Shannon’s formula that Entropy maximizes when all possible outcomes have equal probability.
  • This is directly related to the notion of Entropy as you know it from Physics. Can you see how?

When all outcomes have equal probability we are in a position of maximum uncertainty. In principle, out of all the possible things that can happen, we cannot dare make a guess in favour of one or another. We are thus in the worst possible state in terms of being informed about the system under study. Thus when all outcomes are equiprobable, uncertainty is maximum and information is minimum. This is the link between variablity, uncertainty and information that allows us to assess the strength of motifs.

Evaluating a motif with Information (I)

How can we then employ this concept in the case of DNA sequence motifs? Let’s say that you want to “guess” what a motif is. You will need to guess the residue occuppying each position. Considering there are four different outcomes (nucleotides) for each position, we can calculate the “before” Entropy, as being the maximum, with all outcomes equiprobable (P=0.25) to a total, maximum \(H=2\):

\[H(S)_{before}=-\sum_{i=1}^{4} P[0.25]*log(P[0.25])=2\]
This will be the same for every position in the motif.

What is then the entropy once the message has been transmitted? We will denote as “after” the entropy, which we can calculate directly from the PWM. Remember that the PWM contains the probabilities of each position:

\[H(S)_{after}=-\sum_{i=1}^{4} P[i]*log(P[i])=H\]

and thus:

\[I(S)=2-H(S)_{after}\]

The key is that the smaller the \(H(S)_{after}\) the more we have gained as information, since we are reducing the uncertaintly of the message.

Below you see how this is calculated for the NF-kB case:

Notice (red boxes) how some positions have high I, with low uncertainty, while others low I, when there is variability in the observed residues.

Plotting Information in Sequence Logos

Bioinfromaticians have come up with a nice way to represent the information content of motifs. In these, called Sequence Logos, each residue is shown as a coloured letter, whose height corresponds to its contribution to the overall I of the position (which is \(-P_ilog(P_i)\), with \(P_i\) being its probability in the given position). High bars correspond to high information content.

Motif representations (PWM/PSSM)

Going back to the motif search with PWM/PSSM, below you may see the logos obtained for the collection of the top 5% high scoring instances obtained with PSSM (top) and PWM (bottom) searches. Notice how the more “noisy” PWM-search gives rise to less “clear” motifs that eventually are more vague and score lower Information contents, in contrast to the much stronger PSSM-search motifs. The difference in I is considerable (13.7 against 8.6 out of a possible maximum of 20).