Minimum description length


Minimum description length refers to various formalizations of Occam's razor based on formal languages used to parsimoniously describe data. MDL is a model selection principle: The shortest description of the data as the best model. These descriptions are intended to provide data-driven scientific models. It is an important concept in information theory and computational learning theory. Indeed, one definition of information, Algorithmic Information, is that the information content of a data set is the smallest program that outputs that data set. However, the definite noun phrase "the minimum description length principle" has led to much confusion, in that it usually refers to Jorma Rissanen's 1978 pragmatic attempt to automatically derive short descriptions -- not to the model selection principle inspiring that attempt. The MDL principle is best represented by Solomonoff's theory of inductive inference, which is that the best model of a data set is embodied by its shortest self-extracting archive.

Overview

Selecting the minimum length description of the available data as the best model observes the principle identified as Occam's razor. Prior to the advent of computer programming, generating such descriptions was the intellectual labor of scientific theorists. It was far less formal than it has become in the computer age. If two scientists had a theoretic disagreement, they rarely could formally apply Occam's razor to choose between their theories. They would have different data sets and possibly different descriptive languages. Nevertheless, science advanced as Occam's razor was an informal guide in deciding which model was best.
With the advent of formal languages and computer programming Occam's razor was mathematically defined. Models of a given set of observations, encoded as bits of data, could be created in the form of computer programs that output that data. Occam's razor could then formally select the shortest program, measured in bits of this algorithmic information, as the best model.
To avoid confusion, note that there is nothing in the MDL principle that implies a machine produced the program embodying the model. It can be entirely the product of humans. The MDL principle applies regardless of whether the description to be run on a computer is the product of humans, machines or any combination thereof. The MDL principle requires only that the shortest description, when executed, produce the original data set without error.

Two-Part Codes

The distinction in computer programs between programs and literal data applies to all formal descriptions and is sometimes referred to as "two parts" of a description. In statistical MDL learning, such a description is frequently called a two-part code.

MDL in Machine Learning

MDL applies in machine learning when algorithms generate descriptions. Learning occurs when an algorithm generates a shorter description of the same data set.
The theoretic minimum description length of a data set, called its Kolmogorov complexity, cannot, however, be computed. That is to say, even if by random chance an algorithm generates the shortest program of all that outputs the data set, an automated theorem prover cannot prove there is no shorter such program. Nevertheless, given two programs that output the dataset, the MDL principle selects the shorter of the two as embodying the best model.
In practice, when finding a machine learning model that best fits the data according to the MDL principle, one selects the model that minimizes the length of a
two-part code. As an example, in the supervised case of learning the best fitting rule list the code that uniquely defines a rule list given all possible rule lists that could be generated from that dataset, i.e., taking into account the
number of variables and their possible combinations; and 2) the code that returns the dataset given the specific rule list as defined in the first code. The first code is specifically made to avoid overfitting through the penalization of more complex rule lists that would naturally be more prone to overfitting. Note that using a two-part code is essential when one needs to distinguish the different models, such as in the case of finding the model and its parameters that best fit the data. This contrasts with the task of merely wanting to know the size of the model necessary for the best performance.

Recent Work on Algorithmic MDL Learning

Recent machine MDL learning of algorithmic, as opposed to statistical, data models have received increasing attention with increasing availability of data, computation resources and theoretic advances. Approaches are informed by the burgeoning field of artificial general intelligence. Shortly before his death, Marvin Minsky came out strongly in favor of this line of research, saying:

Statistical MDL Learning

Any set of data can be represented by a string of symbols from a finite alphabet.

is based on the following insight: any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally.

Based on this, in 1978, Jorma Rissanen published an MDL learning algorithm using the statistical notion of information rather than algorithmic information. It starts out with this idea: all statistical learning is about finding regularities in data, and the best hypothesis to describe the regularities in data is also the one that is able to statistically compress the data most. Like other statistical methods, it can be used for learning the parameters of a model using some data. Usually though, standard statistical methods assume that the general form of a model is fixed. MDL's main strength is that it can also be used for selecting the general form of a model and its parameters. The quantity of interest is called a hypothesis. The basic idea is then to consider the two-stage code that encodes data with length by first encoding a hypothesis in the set of considered hypotheses and then coding "with the help of" ; in the simplest context this just means "encoding the deviations of the data from the predictions made by :
The achieving this minimum is then viewed as the best explanation of data. As a simple example, take a regression problem: the data could consist of a sequence of points, the set could be the set of all polynomials from to. To describe a polynomial of degree , one would first have to discretize the parameters to some precision; one would then have to describe this precision ; next, one would have to describe the degree , and in the final step, one would have to describe parameters; the total length would be. One would then describe the points in using some fixed code for the x-values and then using a code for the deviations.
In practice, one often uses a probabilistic model. For example, one associates each polynomial with the corresponding conditional distribution expressing that given, is normally distributed with mean and some variance which could either be fixed or added as a free parameter. Then the set of hypotheses reduces to the assumption of a linear model, , with a polynomial.
Furthermore, one is often not directly interested in specific parameters values, but just, for example, the degree of the polynomial. In that case, one sets to be where each represents the hypothesis that the data is best described as a j-th degree polynomial. One then codes data given hypothesis using a one-part code designed such that, whenever some hypothesis fits the data well, the codelength is short. The design of such codes is called universal coding. There are various types of universal codes one could use, often giving similar lengths for long data sequences but differing for short ones. The 'best' are the normalized maximum likelihood or Shtarkov codes. A quite useful class of codes are the Bayesian marginal likelihood codes. For exponential families of distributions, when Jeffreys prior is used and the parameter space is suitably restricted, these asymptotically coincide with the NML codes; this brings MDL theory in close contact with objective Bayes model selection, in which one also sometimes adopts Jeffreys' prior, albeit for different reasons. The MDL approach to model selection "gives a selection criterion formally identical to the BIC approach" for large number of samples.

Example of Statistical MDL Learning

A coin is flipped 1000 times, and the numbers of heads and tails are recorded. Consider two model classes:
For this reason a naive statistical method might choose the second model as a better explanation for the data. However, an MDL approach would construct a single code based on the hypothesis, instead of just using the best one. This code could be the normalized maximum likelihood code or a Bayesian code. If such a code is used, then the total codelength based on the second model class would be larger than 1000 bits. Therefore, the conclusion when following an MDL approach is inevitably that there is not enough evidence to support the hypothesis of the biased coin, even though the best element of the second model class provides better fit to the data.

Statistical MDL Notation

Central to MDL theory is the one-to-one correspondence between code length functions and probability distributions. For any probability distribution, it is possible to construct a code such that the length of is equal to ; this code minimizes the expected code length. Conversely, given a code, one can construct a probability distribution such that the same holds. In other words, searching for an efficient code is equivalent to searching for a good probability distribution.

Limitations of Statistical MDL Learning

The description language of statistical MDL is not computationally universal. Therefore it cannot, even in principle, learn models of recursive natural processes.

Related concepts

Statistical MDL learning is very strongly connected to probability theory and statistics through the correspondence between codes and probability distributions mentioned above. This has led some researchers to view MDL as equivalent to Bayesian inference: code length of model and data together in MDL correspond respectively to prior probability and marginal likelihood in the Bayesian framework.
While Bayesian machinery is often useful in constructing efficient MDL codes, the MDL framework also accommodates other codes that are not Bayesian. An example is the Shtarkov normalized maximum likelihood code, which plays a central role in current MDL theory, but has no equivalent in Bayesian inference. Furthermore, Rissanen stresses that we should make no assumptions about the true data-generating process: in practice, a model class is typically a simplification of reality and thus does not contain any code or probability distribution that is true in any objective sense. In the last mentioned reference Rissanen bases the mathematical underpinning of MDL on the Kolmogorov structure function.
According to the MDL philosophy, Bayesian methods should be dismissed if they are based on unsafe priors that would lead to poor results. The priors that are acceptable from an MDL point of view also tend to be favored in so-called objective Bayesian analysis; there, however, the motivation is usually different.

Other systems

Rissanen's was not the first information-theoretic approach to learning; as early as 1968 Wallace and Boulton pioneered a related concept called minimum message length. The difference between MDL and MML is a source of ongoing confusion. Superficially, the methods appear mostly equivalent, but there are some significant differences, especially in interpretation: