Algorithmic Complexity of Short Strings

Antonio Rueda-Toicen
March, 2016

Problem: Detecting Regularities within Strings of Data

Can you imagine a pattern analysis application where it would be useful to have a numerical measure for how “complex” (non-regular) or “simple”“(regular) a pattern of data is?

Bioinformaticians routinely search for regions of low complexity when doing structural analysis of complete genomes, as low complexity regions in DNA may be functionally important.
Cryptographers usually want to know how complex a password is. Complex passwords are harder to break than simple ones.

Complexity Intuition

Intuitively, which of these strings is more regular? Which one is more complex?

“AAAAAAAAAA”
“ATATATATAT”
“ATGCCGGCCT”

Kolmogorov Complexity

The complexity of an object can be measured by the length of its shortest description.

For instance, the string “010101010101” has the short description “6 repetitions of 01”, while the string “011001010011” has no description shorter than itself.

Formally, the Algorithmic (“Kolmogorov”) Complexity of a string s is defined as the length of the shortest program that outputs s , when the program is run on a Universal Turing Machine.

Kolmogorov Complexity Estimation in the ACSS Package

The ACSS package uses the Coding Theorem Method, which relates the Kolmogorov Complexity (K) of a string to its algorithmic probability (D).

The algorithmic probability is the frequency of production of the string by small Turing Machines using the same (or greater) alphabet size.

\[ K_{alphabetSize} = -log2(D_{alphabetSize}) \]

require(acss)

acss("AAAAAAAAAA", alphabet=4)

                K.4         D.4
AAAAAAAAAA 23.21958 1.02379e-07

acss("ATGCCGGCCT", alphabet=4)

                K.4          D.4
ATGCCGGCCT 33.11936 1.071714e-10

Try different alphabet sizes in the ACSS Shiny demo

alt text Notice how the more regular strings have lower Kolmogorov complexity.

alt text Notice that when using an alphabet of 2 symbols, ACSS can't estimate the complexity of a string written in 4 symbols, because the algorithmic probability of this string is unknown in this distribution.

More information can be found at http://complexitycalculator.com/methodology.html