In this article, Gibbs Sampling is used to find the Best K-mer Motifs from a collection of Dna strings. This problem appears in the Rosalind and the Stepic Bioinformatics problem-solving sites and also as an assignment in a Coursera BioInformatics Algorithms course provided by UCSD.
A Profile-randomly generated k-mer in a string is defined as follows: For each k-mer Pattern in Text, compute the probability Pr(Pattern | Profile), resulting in n = |Text| - k + 1 probabilities (p1, ., pn). These probabilities do not necessarily sum to 1, but can still form the random number generator Random(p1, ., pn) based on them.
GIBBSSAMPLER uses this random number generator to select a Profile-randomly generated k-mer at each step: if the n-sided (biased) die with probability of face-showups (p1, ., pn) rolls the number i, then the Profile-randomly generated k-mer is defined as the i-th k-mer in Text.
The algorithm is then repeated a few times with a few random starts and the best motifs are chosen in the end.
The following shows the algorithm used by to implement the Gibbs Sampler and also couple of sample input datasets used and the output best motifs computed.
The animations show the Profile matrix (with Information Content Scaled / unscaled) computed at each iteration of the Repeated GibsSampling algorithm and how it changes.