Minimum Message Length, MML
Inductive Inference and Machine Learning by Minimum Message Length (MML) encoding.
For a hypothesis H and data D we have from Bayes:P(H&D) = P(H).P(D|H) = P(D).P(H|D)
From Shannon's Mathematical Theory of Communication (1949) we know that in an optimal code, the message length of an event E, MsgLen(E), where E has probability P(E), is given by MsgLen(E) = -log2(P(E)):MsgLen(H&D)
Now in inductive inference one often wants the hypothesis H with the largest posterior probability. MsgLen(H) can usually be estimated well, for some reasonable prior on hypotheses. MsgLen(D|H) can also usually be calculated. Unfortunately it is often impractical to estimate P(D) which is a pity because it would yield P(H|D). However, for two rival hypotheses, H and H'MsgLen(H|D)-MsgLen(H'|D)
a transmitter T and a receiver R
connected by one of Shannon's communication channels.
T must transmit some data D to R.
T and R may have previously agreed on a code book for hypotheses,
using common knowledge and prior expectations.
If T can find a good hypothesis, H, (theory, structure, pattern, ...) to fit
the data then she may be able to transmit the data economically.
An explanation is a two part message:
The message paradigm keeps us "honest":
Any information that is not common knowledge
must be included in the message for it to be decipherable by the receiver;
there can be no hidden parameters.
A more complex hypothesis fits the data better than a simpler model, in general. We see that MML encoding gives a trade-off between hypothesis complexity, MsgLen(H), and the goodness of fit to the data, MsgLen(D|H). The MML principle is one way to justify and realise Occam's razor.
Continuous Real-Valued Parameters
When a model has one or more continuous, real-valued parameters they must be stated to an "appropriate" level of precision. The parameter must be stated in the explanation, and only a finite number of bits can be used for the purpose, as part of MsgLen(H). The stated value will often be close to the maximum-likelihood value which minimises MsgLen(D|H). If the -log likelihood, MsgLen(D|H), varies rapidly for small changes in the parameter, the parameter should be stated to high precision. If the -log likelihood varies only slowly with changes in the parameter, the parameter should be stated to low precision.
The simplest case is the multi-state or multinomial distribution where the data is a sequence of independent values from such a distribution. The hypothesis, H, is an estimate of the probabilities of the various states (eg. the bias of a coin or a dice). The estimate must be stated to an "appropriate" precision, ie. in an appropriate number of bits.
Applications and Related Areas