Antonio Rueda-Toicen
March, 2016
Can you imagine a pattern analysis application where it would be useful to have a numerical measure for how “complex” (non-regular) or “simple”“(regular) a pattern of data is?
Intuitively, which of these strings is more regular? Which one is more complex?
The complexity of an object can be measured by the length of its shortest description.
For instance, the string “010101010101” has the short description “6 repetitions of 01”, while the string “011001010011” has no description shorter than itself.
Formally, the Algorithmic (“Kolmogorov”) Complexity of a string s is defined as the length of the shortest program that outputs s , when the program is run on a Universal Turing Machine.
The ACSS package uses the Coding Theorem Method, which relates the Kolmogorov Complexity (K) of a string to its algorithmic probability (D).
The algorithmic probability is the frequency of production of the string by small Turing Machines using the same (or greater) alphabet size.
\[ K_{alphabetSize} = -log2(D_{alphabetSize}) \]
require(acss)
acss("AAAAAAAAAA", alphabet=4)
K.4 D.4
AAAAAAAAAA 23.21958 1.02379e-07
acss("ATGCCGGCCT", alphabet=4)
K.4 D.4
ATGCCGGCCT 33.11936 1.071714e-10
Notice how the more regular strings have lower Kolmogorov complexity.
Notice that when using an alphabet of 2 symbols, ACSS can't estimate the complexity of a string written in 4 symbols, because the algorithmic probability of this string is unknown in this distribution.
More information can be found at http://complexitycalculator.com/methodology.html