Minimum Description Length

Definition

The Minimum Description Length (MDL) Principle is method for inductive inference that provides a generic solution to the model selection problem. It is based on the following idea: If there is regularity in the data, then it can be used to compress the data i.e. the regularity of the data can be exploited to express it in fewer symbols than the number of symbols needed to describe the data literally. The more there are regularities in the data, the more it can be compressed. In this sense, we can equate learning with finding regularity and say that the more we can compress the data by knowing about its regularity, the more we have learned about the data.

Illustrative example

The following example illustrates how finding the regularities in data can be used to develop a compact representation of the data and lead to data compression. Consider the following three sequences. We assume that each sequence is 10000 bits long, and we just list the beginning and the end of each sequence.

0010100101001010010100101 : : : 00101001010010100101001010010100101 (1.1)

0111010011010010011000111 : : : 10101110101110110001011000100101000 (1.2)

0001100000101010000000010 : : : 00100010000100000010000100001100000 (1.3)

The first of these three sequences is a 2000-fold repetition of 00101. Intuitively, the sequence looks regular. The second sequence has been generated by tosses of a fair coin. It is intuitively speaking as `random as possible', and in this sense there is no regularity underlying it. The third sequence contains approximately four times as many 0s as 1s. It looks less regular, more random than the first; but less random than the second. There is still some discernible regularity in these data, but of a statistical rather than of a deterministic kind.

Any regularity detected in the data can be used to compress the data, i.e. to describe it in a short manner. Using C language, we can write a program:

for (int i = 0; i<2000; i++) {
  printf(`00101`);
}

which prints sequence (1) but is clearly a lot shorter. Thus, sequence (1) is indeed highly compressible. On the other hand, if one generates a sequence like (2) by tosses of a fair coin, then with extremely high probability, the shortest program that prints (2) will look something like this:

printf(`011101001101000010101010::::::::1010111010111011000101100010`);

This program's size is about equal to the length of the sequence. Clearly, it does nothing more than repeat the sequence. The third sequence lies in between the first two: generalizing n = 10000 to arbitrary length n, the first sequence can be compressed to O(log n) bits; with overwhelming probability, the second sequence cannot be compressed at all; and the third sequence can be compressed to some length $ \alpha n $, with $ 0<\alpha<1 $.


Source: http://homepages.cwi.nl/~pdg/ftp/mdlintro.pdf

Alumni Liaison

Basic linear algebra uncovers and clarifies very important geometry and algebra.

Dr. Paul Garrett