We will focus on two particular problems:
What is a motif in general? How do we understand a motif?
Motifs in Biology
We consider a motif a “repetitive substructure of a bigger object that is meant to convey a certain message”
In biology we find a number of meaningful motifs that serve one (or more) of the above functions.
Motifs in Biology
We can think of such a probabilistic approach that takes into account the variability in each position. Starting with a number of sequence instances of equal size we may calculate the probabilities of occurrence of each nucleotide for each position. The resulting table of the probabilities is a called a Position Weigth Matrix, PWM.
This approach allows us to evaluate the variability of some positions against others and in addition to see which residues tend to be more important in which position.
Sequence variation does not only occur within the motif’s instances. Fluctuations in the compositions of DNA sequences occur throughout the genomes of all eukaryotes and specific regions may be enriched in some residues at the expence of others.
This is particularly evident in the fluctuation of GC% along eukaryotic chromosomes, which are divided in regions of high/low GC% that are called “isochores”.
The variability in the genomic sequences will undoubtedly interfere with the motif instances. Binding sites embedded in high GC% regions of the genome will be more GC-rich themselves. In order to account for this variability we need to normalize against what we call the background nucleotide composition.
We achieve this with a different kind of table called Position Specific Scoring Matrices, PSSM.
These are created by dividing the observed probabilities per position in the original PWM (P) with the expected values that are obtained for a given sequence (Q).
A log2-transformation step results in a table where high positive values are corresponding to preferred combinations of residues-positions. A PSSM table is always specific for a given sequence in which one can search for a motif, which brings us to the next problem…
PWM and PSSM tables are important not only for representations but also for the search and identification of motif instances in an unknown sequence.
Given a table (be it PWM or PSSM) and a longer sequence in which we are searching for the motif we can think of a strategy with which:
In the graph below you may see how PWM/PSSM scores for a given transcription factor fluctuate against a complete bacterial genome.
Notice how the PWM search (grey) is much more “noisy” and to a large extent correlates to the overall GC% (green). The results of the PSSM search are, on the other hand, a lot clearer with a greater dynamic range, that allows us to identify real maxima, corresponding to likely positions of the motif in the sequence.
Up to now we have seen how every motif can be described as a PWM/PSSM. A different question is related to how different PWM may be describing different patterns. The PWM for NF-kB may contain some very clear positional tendencies in the first and last positions, but this is not the case for the mid-positions. A motif for a different transcriptional regulator may have completely different preferences. We are interested in understanding how the underlying variability in each position and in the motif as a whole may be assessed and what it may tell us about the motif in general.
The question may be rephrased using the concept of “motif strength”. A “strong motif” is one that allows little variability. It means that it is very often found occurring in the same way. On the other hand a “weak” motif has a highly variable structure and thus may be found in a genome in many different variations.
Claude Shannon
The basis of information theory is the concept of Entropy which is defined in the following:
When all outcomes have equal probability we are in a position of maximum uncertainty. In principle, out of all the possible things that can happen, we cannot dare make a guess in favour of one or another. We are thus in the worst possible state in terms of being informed about the system under study. Thus when all outcomes are equiprobable, uncertainty is maximum and information is minimum. This is the link between variablity, uncertainty and information that allows us to assess the strength of motifs.
How can we then employ this concept in the case of DNA sequence motifs? Let’s say that you want to “guess” what a motif is. You will need to guess the residue occuppying each position. Considering there are four different outcomes (nucleotides) for each position, we can calculate the “before” Entropy, as being the maximum, with all outcomes equiprobable (P=0.25) to a total, maximum \(H=2\):
\[H(S)_{before}=-\sum_{i=1}^{4} P[0.25]*log(P[0.25])=2\]
This will be the same for every position in the motif.
What is then the entropy once the message has been transmitted? We will denote as “after” the entropy, which we can calculate directly from the PWM. Remember that the PWM contains the probabilities of each position:
\[H(S)_{after}=-\sum_{i=1}^{4} P[i]*log(P[i])=H\]
and thus:
\[I(S)=2-H(S)_{after}\]
The key is that the smaller the \(H(S)_{after}\) the more we have gained as information, since we are reducing the uncertaintly of the message.
Below you see how this is calculated for the NF-kB case:
Notice (red boxes) how some positions have high I, with low uncertainty, while others low I, when there is variability in the observed residues.
Bioinfromaticians have come up with a nice way to represent the information content of motifs. In these, called Sequence Logos, each residue is shown as a coloured letter, whose height corresponds to its contribution to the overall I of the position (which is \(-P_ilog(P_i)\), with \(P_i\) being its probability in the given position). High bars correspond to high information content.
Going back to the motif search with PWM/PSSM, below you may see the logos obtained for the collection of the top 5% high scoring instances obtained with PSSM (top) and PWM (bottom) searches. Notice how the more “noisy” PWM-search gives rise to less “clear” motifs that eventually are more vague and score lower Information contents, in contrast to the much stronger PSSM-search motifs. The difference in I is considerable (13.7 against 8.6 out of a possible maximum of 20).