# Approximations to Algorithmic Probability

**DOI:**https://doi.org/10.1007/978-3-642-27737-5_700-1

- 105 Downloads

## Keywords

Probabilistic Algorithm Small Turing Machines Universal Distribution Algorithmic Content Block Decomposition Method (BDM)## A Computational Theory of Everything

Physicists have long been searching for a so-called Theory of Everything (ToE). Just as quantum mechanics explains the smallest phenomena, from microwaves to light, and general relativity explains the classical world, from gravity to space and time, a ToE would explain everything in the universe, from the smallest to the largest phenomena, in a single formulation.

Science has operated under the assumption and in light of strong evidence that the world is highly, if not completely, algorithmic in nature. If the world were not structured, our attempts to construct a body of theories from observations, to build what we consider ordered scientific models of the world, would have failed. That they have not is spectacular vindication of the validity of the world’s orderly character. We started out believing that the world was ruled by magic, and by irrational and emotional gods. However, thinkers from ancient civilizations such as China and India and, notably, Greece made the first recorded attempts to introduce reason and logic into scholarly discourse, advancing rules that may be seen as the first steps towards a computational theory of causality, where every observation was assumed to have a cause that was something other than a metaphysical explanation.

Over the last few centuries, science has made a lot of progress explaining more with less effectively suggesting the algorithmic nature of the universe. For a long time, the Ptolemaic geocentric model was thought to explain the strange motion of planets in the sky. When it came to planetary movement, the geocentric theory had the same predictive power as the later heliocentric model. The geocentric model, however, was unable to explain how the outer planets seemed to have a period related to the Sun’s period if they had no special relation amongst themselves. Not only was the heliocentric model a better model because it required, overall, fewer constraints or features, such as the many epicycles needed in the geocentric model, but it did not need to be supplemented by additional explanations. Some anomalies remained to be accounted for (having to do with the elliptical orbit, as discovered by Kepler) but it did not need to explain odd regularities that were apparent yet unexplained by the geocentric model. We have replaced the heliocentric by much more sophisticated models, starting with Kepler’s and Newton’s, that not only endorsed and embraced the heliocentric model but placed at the center of the solar system a center of gravity common to the Sun and the rest of the planets, a center of gravity that turns out to be inside the Sun but not to be exactly the Sun or its center. In other words, under updated models the truth value of the claim that either the planets revolve around the Sun or that the Sun revolves around the planets is relative. With this update on the heliocentric model, Newton was able to link the movement of the planets to the movement of everyday objects on Earth, and to supply a reason as to why objects fall down and why we appear to be stuck to the Earth’s surface, an amazing feat where a minor increase in the sophistication of a model yielded explanations of a tremendous number of hitherto unexplained phenomena. But it was not until Einstein’s general relativity that we could explain phenomena that none of the previous classical models could, phenomena that were considered strange anomalies, as if they were the result of the caprices of ancient gods, such as the so-called Venus’ perihelion movement, the phenomenon that sees Venus’ elongated orbit change significantly over the course of a year. The only indication that Venus’ movement was not subject to the mood of an angry god was perhaps its regularity. So both regularity in phenomena such as the movement of stars and planets, and the achievement of superior models able to explain a much greater number of unexplained phenomena with little to no increase in sophistication compared to precursors, are at the center of science.

That we are able to explain more without need of many more assumptions, devising ever more encompassing scientific models, suggests that our complex world tends toward algorithmic simplicity. Indeed, the more simple physics finds the world to be, the more algorithmic it suggests it is, as the universe seems to be generated or explained by simple principles. The chances of uncovering a short computer program producing the universe are thus greater if physical phenomena can be explained by increasingly simple and shorter descriptions and models. This does not come as a surprise. If current cosmological theories are correct, the universe has a common and single origin and is thus deterministically connected from beginning to end.

It is true that laws are only models of natural phenomena and that some later models may become slightly more complex before becoming more simple, but ultimately all models in science have become simpler and simpler with time and further research, so that today we can explain a greater number of phenomena encompassing almost every aspect of the natural world using only two major models, general relativity and the standard model. So both the direction of science and the assumption that the world is algorithmic in nature are solid and definite. We can scientifically explain much more today than we could 50 years ago, and 50 years ago we could explain much more than we could 200 years ago, and so on, using simpler and shorter scientific models of incredible accuracy.

*Panel on The Limits of Understanding*Minsky pointed out (Source: https://youtu.be/DfY-DRsE86s?t=1h30m2s):

It seems to me that the most important discovery since Gödel was the discovery by Chaitin, Solomonoff and Kolmogorov of the concept called, Algorithmic Probability, which is a fundamental new theory of how to make predictions given a collection of experiences, and this is a beautiful theory. Everybody should learn it, but it’s got one problem, that is, that you cannot actually calculate what this theory predicts because it is too hard, it requires an infinite amount of work. However, it should be possible to make practical approximations to the Chaitin, Kolmogorov, Solomonoff theory that would make better predictions than anything we have today. Everybody should learn all about that and spend the rest of their lives working on it.

World Science Festival in New York City, Dec 14, 2014.

### The Power of Mechanistic Models

So what is a causal and mechanistic model in more general terms? Newton’s model is causal and mechanistic because it provides the means to, in principle, reproduce the conditions of the solar system and generate exactly the same outcome. This basically means that Newton’s model can be emulated mechanistically to reproduce the movement of the planets and Sun. Indeed, the only difference between Kepler’s and Newton’s models, given that they are equally accurate in their prediction of planetary motion, is that Newton’s is causal and can explain many more phenomena outside the realm of planetary motion. Not all models are of the same type. For example, in a mechanistic model, the sun’s cycle may require two coupled gears of size 20 × 20 the gear cycle of the moon; the mechanical model would then suggest such a ratio between size and distance of moon to sun. However, this would have been impossible to figure out by following the epicycles model. An example of a mechanistic model is a Rube Goldberg Machine (see Fig. 5), because such a machine implements itself, that is, every action of the machine corresponds to the actual performance of a task. In other words, the best model of something is itself. But we rarely have the chance to design or control a system to include an accurate representation of itself; instead we produce hypotheses for models that we test.

Let’s take as an example a very abstract one, a rational number such as π. The constant π is a computable number and thus can be generated mechanistically. A mechanical model of π is the one suggested by the relationship between the diameter and the circumference of a circle. This, in practice, is quite difficult and is not undertaken physically, but from its abstract calculation, mathematicians have learned many properties beyond the original realm of π itself. For example, π is of low algorithmic (Kolmogorov-Chaitin) complexity because it has many short descriptions, as has been found from many relationships in mathematics.

## Algorithmic Probability

What connects all mechanistic model-driven approaches in science is none other than the computational theory of Algorithmic Probability (Levin 1974; Solomonoff 1964), a theory that elegantly connects computation with classical probability in a proper way through a theorem called the *algorithmic Coding theorem* (c.f. below), which for its part establishes that the most frequent outcome of a causal/deterministic system will also be the most simple in the sense of algorithmic complexity, that is, the estimation of the length of the shortest computer program that produces such an output.

*x*running on a universal (prefix-free (The group of valid programs forms a prefix-free set (no element is a prefix of any other, a property necessary to keep 0 <

*m*(

*x*) < 1). For example, program 2 starts with program 1, so if program 1 is a valid program then program 2 cannot be valid. The set of valid programs is said to constitute a prefix-free set, that is, no element is a prefix of any other, a property necessary to keep 0 <

*m*(

*x*) < 1.)) Turing machine

*U*. The probability measure

*m*(

*x*) is traditionally called Levin’s semi-measure, Solomonoff-Levin’s semi-measure, or

*algorithmic probability*, and it defines what is known as the

*Universal Distribution*(

*semi*from

*semi-decidable*and the fact that m(x), unlike probability measures, does not add up to 1).

*m*(

*x*) formalizes the concept of Occam’s razor (favoring simple – or shorter – hypotheses over complicated ones) (Fig. 1).

### An Alternative Method to Lossless Compression

There is a trend in statistical machine learning fueled by big data and computational power that involves the packing of a set of regression analysis methods into what is called a deep learning and deep neural network. These approaches offer black boxes with incredible capabilities but cannot be farther away from mechanistic models. In some ways, they are sophisticated epicyclic models of some data. They can be extremely useful, can even be as precise as the corresponding mechanistic models of the phenomena they try to capture, but they will fall short at scaling and abstracting general properties into models one can build upon without having to add yet another layer of black boxes. The challenge is how to put together the enormous capabilities of deep learning to synthesize data into a combinatorial set of feature variables with the abstraction capabilities of truly model-based approaches.

Some approaches are unconventional but are not meant to remain so forever. The alternatives for measuring and applying algorithmic complexity other than lossless compression and translating highly abstract theory to highly applied areas have started to produce surprising results that would not be possible had the new numerical methods been unable to go beyond previous traditional schemes. The detailed theory introduced in (Delahaye and Zenil 2012; Soler-Toscano et al. 2014) as a novel alternative to compression (Delahaye and Zenil 2012) has found applications in the study of graphs and networks (Zenil et al. 2014), in molecular biology (Zenil et al. 2016), and animal and human cognition (Gauvrit et al. 2014a, b, 2016, 2017a, b).

*K*(

*x*) and its algorithmic probability of being produced by a random computer program

*m*(

*x*) (Gregory and Chaitin 1966; Levin 1974) (whose executable code digits are the result of tossing a coin). The theorem established that:

*c*is a constant independent of

*x*.

The intuition behind the Coding theorem (2) is a beautiful relation between program-size complexity and frequency of production. If you wished to produce the digits of a mathematical constant like π by throwing digits at random, you would have to produce every digit of its infinite irrational decimal expansion. If you seated a monkey at a typewriter (with, say, 50 keys), the probability of the monkey typing an initial segment of 2400 digits of π by chance is (1/50^{2400}).

If, instead, the hypothetical monkey (simply a source of randomness) is seated at a computer, the chances of its producing a program generating the digits of π are on the order of 1/50^{158}, because it would take the monkey only 158 characters to produce the first 2400 digits of π using, for example, C language.

The less random an object, the more likely it is to be produced by a short program. There is a greater probability of producing the program that produces the object, especially if the program producing *x* is short. Therefore, if an object is compressible, then the chances of producing the compressed program that produces said object are greater than they would be for a random object, i.e., ∣*p*∣ ≪ ∣*x*∣ such that *U* (*x*) = *x*.

### An Unconventional Approach to Computing the Uncomputable

It is one thing to propose that if the world is computational then it has to be generated by some computational rule (Schmidhuber 2002), and a different thing to explore the space of such possible computational rules (Wolfram 2002). However, very little to nothing had been done to actually approximate something like Levin’s universal distribution, to plug it back into a possible algorithmic theory of everything and find applications in the sense of Minsky (c.f. quotation above). Moreover, through the algorithmic Coding theorem one would be able to approximate algorithmic (Kolmogorov-Chaitin) complexity rather than, for example, approximating it only by statistical means, by using common lossless compression algorithms that implement measures of entropy rate based on statistical regularities – repetitions – captured by a sliding window traversing a string.

*n*, 2) be the set of all Turing machines with

*n*states and 2 symbols, and let

*D*(

*n*) be the function that assigns to every finite binary string

*x*the quotient:

*D*(

*n,m*) is an algorithmic measure based on Levin’s Universal Distribution

*m*(

*s*) approximated by finite means (Soler-Toscano et al. 2014), and it can be used to approximate

*K*(

*s*), the algorithmic complexity of string

*s*by using the algorithmic Coding Theorem as follows (Fig. 3):

where CTM stands for *Coding Theorem Method.* The greater the value of *n* we use to calculate *D*(*n,m*)(*s*), the better the approximation to *K*(*s*).

*D*(*n*) can also be seen as averaging *K* over a large set of possible computer languages, each represented by a small Turing machine, in order to reduce the possible impact of the constant from the *invariance theorem*.

The CTM allows us not only to build an empirical distribution of computer program outputs from smaller to larger sizes, but once a string is generated for the first time among all the computer programs of the smallest size, we know which Turing machine is the smallest one producing a string and also the runtime of such a Turing machine. We take the runtime of such a Turing machine as an estimation of another seminal measure of complexity, Bennett’s Logical Depth (LD) approximated by CTM. LD is a measure of *sophistication* that assigns low complexity to both simple and random objects and is defined as the computation time that it takes to decompress an object from its (near) greatest compressed version.

Once the tools and methods are to hand, one can see to what extent the world really runs on a universal mechanical computer *algorithmicity* (Zenil 2011; Zenil and Delahaye 2010): how much the outcome of a process resembles the way in which outcomes would be sorted according to the universal distribution, and of *programmability* (Zenil 2013, 2014a, b, 2015); how much such a process can be reprogrammed at will. The more reprogrammable, the more causal, given that a random object cannot be reprogrammed in any practical way other than by changing every possible bit.

### A Hybrid Divide-and-Conquer Method

The Block Decomposition Method (BDM) allows us to extend the power of the Coding theorem by using a combination of Shannon entropy and algorithmic probability. The idea is to divide a piece of data into segments for which we have algorithmic probability approximations, and then count the number of times that a local regularity occurs (Fig. 5). The power of LD can also be extended by a BDM-like formula, a multiplicative variation of the BDM formula.

What would these CTM and BDM methods do with, for example, the binary digital expansion of a mathematical constant such as π that we know if of low complexity but is also random-looking? We know statistical approaches would fail to characterize it. But we need to make a distinction between *principle* and *practice.* The Coding Theorem method (CTM) can, in principle, and to some extent, in practice, find the algorithmic nature of π. CTM would eventually find that subsequences of π have lower complexity than that estimated by, e.g., Shannon Entropy by an uninformed observer (with no access to the knowledge of the deterministic nature of π). Such an observer would assign maximal entropy to the digits of π because it looks random (there are no apparent repetitions at any level of granularity) where, in contrast, CTM (and thus BDM) would asymptotically find π to be of low algorithmic complexity because it can be generated by short mathematical formulas implemented by short computer programs.

### A Sequence of Causal Models Is Itself a Causal Model

Trillions of small computer programs were run to build an empirical probability distribution (Delahaye and Zenil 2012; Soler-Toscano et al. 2014) that approximates what in the theory of algorithmic probability is called the *universal distribution* (Levin 1974), used as the prior in all sorts of Bayesian calculations that require the framing of hypotheses about generating models or mechanisms that produce an object. Take a binary string of a million 1 s. The theory tells us that the most likely mechanistic generator of 1 s is not print (1 million 1 s) but a recursive program that iterates over print (1) 1 million times, that is, in this case, the shortest possible program. For this sequence the most likely and highest causal generating program can easily be guessed, but for most sequences this is very difficult. However, the more difficult the more random, thus giving us some clues as to the causal content of the object.

BDM shows that the best solution to approximating *K* is to make the best use of both methods using each where relevant and/or practical and thus combining Entropy and algorithmic complexity (Fig. 5). One can decide to capture more algorithmic content with CTM alone but it becomes infeasible (ultimately uncomputable), and then one switches to Entropy, especially if one has already found that an object is of low Entropy, and thus low *K*. There is no need to waste resources only to rediscover that it is of low *K*.

For example, while compression algorithms may fail to find the algorithmic content of π, our method may give you a small clue that π is not truly random (maximal *K*) as Entropy would otherwise suggest. Applying these more powerful measures across several strings, and not only testing against π, provides a handle that one would never reach with statistical tools. So if your segment of π is too long, then BDM will behave asymptotically, like Entropy alone if one does not update the precomputed data from CTM. BDM is optimal for short strings, but for medium strings, from 100 to 1 K bits, it carries the CTM signal to find small signs of algorithmic content, thereby complementing lossless compression algorithms covering the full range of lengths and different information content approximations, showing that the best approaches are hybrid algorithms.

Given that the model machine we use is universal, we know that some machines will implement algorithms generating subsequences of π because π is of low *K* and thus there are very short programs that can generate it (in full, let alone in pieces). To properly compare Entropy and *K* approximated by CTM/BDM, one has to normalize max entropy with max *K* and look at the distributions (see Fig. 4 bottom). One will see then that Entropy approximates the classical Bernoulli distribution, while *K* with BDM approximates an exponential, just as the theory predicts (Levin’s universal distribution or semi measure (Levin 1974)). The distribution would not yet yield the same order of strings, but one can see the strings that have max Entropy but are not max *K* – these are the ones gained by applying our methods versus Entropy or compression. Performing a test for very long sequences of π without updating CTM eventually leads to BDM losing the algorithmic edge given by CTM (Zenil et al. 2016), unless ones reruns CTM for (exponentially) longer periods.

The small machine constitutes a mechanistic model because the Turing machine is mechanistic in nature and explains the small piece of data that it outputs upon halting. By putting all small Turing machines together, each explaining a small piece of the larger system, one can take the whole sequence and use it as a model approximation for the whole. If the small patches are causal and with small programs (Turing machines), the sequence will be shorter than the maximum *K* sequence (similar in size to the object itself), thus providing not only a local approximation of algorithmic complexity but also insight into the causal nature (or otherwise) of the larger system.

## Conclusion

We have explained how computability and algorithmic complexity are now at the center of the challenge of causality in science, and while it is more difficult to numerically evaluate, new methods have been advanced to estimate universal measures of complexity. Only universal measures of complexity can characterize any (computable) feature and can be invariant to object description and can encode, in principle, every possible computable function and therefore, potentially, every deterministic system following classical mechanics. It is therefore an extremely powerful measure tantamount to a computational theory of everything, and numerical methods to approximate it give an edge in finding key indications of causal content in a system, which is germane to understanding aspects of evolving causal systems and to developing algorithmic calculi to manipulate them.

## Bibliography

- Delahaye J-P, Zenil H (2012) Numerical evaluation of algorithmic complexity for short strings: a glance into the innermost structure of randomness. Appl Math Comput 219(1):63–77zbMATHGoogle Scholar
- Gauvrit N, Singmann H, Soler-Toscano F, Zenil H (2016) Algorithmic complexity for psychology: a user-friendly implementation of the coding theorem method. Behav Res Methods 48(1):314–329CrossRefGoogle Scholar
- Gauvrit N, Soler-Toscano F, Zenil H (2014) Natural scene statistics mediate the perception of image complexity. Vis Cogn 22(8):1084–1091CrossRefGoogle Scholar
- Gauvrit N, Zenil H, Delahaye J-P, Soler-Toscano F (2014) Algorithmic complexity for short binary strings applied to psychology: a primer. Behav Res Methods 46(3):732–744CrossRefGoogle Scholar
- Gauvrit N, Zenil H, Soler-Toscano F, Delahaye J-P, Brugger P (2017) Human behavioral complexity peaks at age 25. PLoS Comput Biol 13(4):e1005408CrossRefGoogle Scholar
- Gauvrit N, Zenil H, Tegnér J (2017) The information-theoretic and algorithmic approach to human, animal and artificial cognition. In: Dodig-Crnkovic G, Giovagnoli R (eds) Representation and reality: humans, animals and machines. Springer, New YorkGoogle Scholar
- Gregory J, Chaitin GJ (1966) On the length of programs for computing finite binary sequences. J ACM 13(4):547–569MathSciNetCrossRefzbMATHGoogle Scholar
- Kirchherr W, Li M, Vitányi P (1997) The miraculous universal distribution. Math Intell 19:7–15MathSciNetCrossRefzbMATHGoogle Scholar
- Levin LA (1974) Laws of information conservation (nongrowth) and aspects of the foundation of probability theory. Probl Peredachi Inf 10(3):30–35Google Scholar
- Schmidhuber J (2002) Hierarchies of generalized Kolmogorov complexities and nonenumerable universal measures computable in the limit. Int J Found Comput Sci 13(4):587–612MathSciNetCrossRefzbMATHGoogle Scholar
- Shannon CE (1948) A mathematical theory of communication parts i and ii. Bell Syst Tech J 27:379–423. and 623–656CrossRefzbMATHGoogle Scholar
- Soler-Toscano F, Zenil H, Delahaye J-P, Gauvrit N (2014) Calculating Kolmogorov complexity from the output frequency distributions of small Turing machines. PLoS One 9(5):e96223ADSCrossRefGoogle Scholar
- Solomonoff RJ (1964) A formal theory of inductive inference. Part i. Inf Control 7(1):1–22MathSciNetCrossRefzbMATHGoogle Scholar
- Tegnér J, Zenil H, Kiani NA, Ball G, Gomez-Cabrero D (2016) A perspective on bridging scales and design of models using low-dimensional manifolds and data-driven model inference. Phil Trans R Soc A 374(2080):20160144ADSCrossRefGoogle Scholar
- Wolfram S (2002) A new kind of science. Wolfram Media, ChampaignzbMATHGoogle Scholar
- Zenil H (2011) The world is either algorithmic or mostly random, 2011. Winning 3rd place in the international essay context of the FQXiGoogle Scholar
- Zenil H (2013) A behavioural foundation for natural computing and a programmability test. In: Dodig-Crnkovic G, Giovagnoli R (eds) Computing nature. Springer, New York, pp 87–113CrossRefGoogle Scholar
- Zenil H (2014a) Programmability: a Turing test approach to computation. In: L. De Mol, G. Primiero, (eds) Turing in context, Koninklijke Vlaamse Academie van België voor Wetenschappen en Kunsten (Belgian Academy of Sciences and Arts), Contactforum. Belgian Academy of Sciences and ArtsGoogle Scholar
- Zenil H (2014b) What is nature-like computation? A behavioural approach and a notion of programmability. Philos Technol 27(3):399–421CrossRefGoogle Scholar
- Zenil H (2015) Algorithmicity and programmability in natural computing with the game of life as in silico case study. J Exp Theor Artif Intell 27(1):109–121CrossRefGoogle Scholar
- Zenil H, Delahaye J-P (2010) On the algorithmic nature of the world. In: Dodig-Crnkovic G, Burgin M (eds) Information and computation. World Scientific, LondonGoogle Scholar
- Zenil H, Kiani NA, Tegnér J (2016) Methods of information theory and algorithmic complexity for network biology. Semin Cell Dev Biol 51:32–43CrossRefGoogle Scholar
- Zenil H, Kiani N, Jesper T (2017) Low algorithmic complexity entropy-deceiving graphs. Phys Rev E 96(012308)Google Scholar
- Zenil H, Soler-Toscano F, Dingle K, Louis AA (2014) Correlation of automorphism group size and topological properties with program-size complexity evaluations of graphs and complex networks. Physica A Stat Mech Appl 404:341–358MathSciNetCrossRefGoogle Scholar
- Zenil H, Soler-Toscano F, Kiani NA, Hernández-Orozco S, Rueda-Toicen A (2016) A decomposition method for global evaluation of Shannon entropy and local estimations of algorithmic complexity. arXiv preprint arXiv:1609.00110Google Scholar
- Zenil H. (2017) Algorithmic Data Analytics, Small Data Matters and Correlation versus Causation. In M. Ott, W. Pietsch, J. Wernecke (eds.), Berechenbarkeit der Welt? Philosophie und Wissenschaft im Zeitalter von Big Data (Computability of the World? Philosophy and Science in the Age of Big Data), Springer Verlag, pp. 453-475Google Scholar