See also:
- C. S. Wallace and M. P. Georgeff. A General
Objective for
Inductive Inference. [TR
32], Dept.
Computer Science, Monash
University, March
1983.
- J. J. Oliver and D. J. Hand, Introduction
to Minimum
Encoding Inference, [TR
4-94] Dept. Stats. Open Univ. and also TR 94/205
Dept. Comp. Sci. Monash Univ.
- J. J. Oliver and R. A. Baxter, MML and
Bayesianism:
Similarities and Differences, [TR
94/206]
- R. A. Baxter and J. J. Oliver, MDL and MML:
Similarities
and Differences, [TR
94/207]
- The Computer Journal, special
issue 42(4),
1999, includes:
- C. S. Wallace & D. L. Dowe, Minimum
Message Length
and Kolmogorov Complexity. pp270-283, [bib']
also ...
- Refinements of MDL and MML Coding,
pp330-337
- and discussion on MML and MDL by various
authors.
- G. E. Farr & C. S. Wallace, The
complexity of strict
minimum message length inference. Computer Journal 45(3)
pp285-292 2002 [link]
- C. S. Wallace, A
Brief History of
MML, 20/11/2003.
For a hypothesis H
and data D we
have from Bayes:
P(H&D) =
P(H).P(D|H) =
P(D).P(H|D)
P(H) prior probability of hypothesis H
P(H|D) posterior probability of hypothesis H
P(D|H) likelihood of the hypothesis, actually a
function of the data given H.
From Shannon's Mathematical
Theory of Communication
(1949) we know
that in an optimal code, the message length of an event E,
MsgLen(E),
where E has probability P(E), is given by
MsgLen(E) = -log2(P(E)):
MsgLen(H&D) =
MsgLen(H)+MsgLen(D|H)
= MsgLen(D)+MsgLen(H|D)
Now in inductive
inference
one often wants the
hypothesis H
with the largest posterior probability.
MsgLen(H) can usually be estimated well,
for some reasonable prior on hypotheses.
MsgLen(D|H) can also usually be calculated.
Unfortunately it is often impractical to estimate P(D)
which is a pity because it would yield P(H|D).
However, for two rival hypotheses, H and H'
MsgLen(H|D)-MsgLen(H'|D) =
MsgLen(H)+MsgLen(D|H) - MsgLen(H')-MsgLen(D|H')
=
posterior -log odds ratio
Consider
a transmitter T and a receiver R
connected by one of Shannon's communication channels.
T must transmit some data D to R.
T and R may have previously agreed on a code book for hypotheses,
using common knowledge and prior expectations.
If T can find a good hypothesis, H, (theory, structure, pattern, ...)
to fit
the data then she may be able to transmit the data economically.
An explanation is a two part message:
(i) transmit H taking MsgLen(H) bits, and (ii)
transmit D given H taking MsgLen(D|H)
bits.
The message paradigm
keeps us
"honest":
Any information that is not common knowledge
must be included in the message for it to be decipherable by the
receiver;
there can be no hidden parameters. This issue extends to inferring (and stating) real-valued parameters
to the "appropriate" level of precision. The method is "safe":
If we use an inefficient code it can only make the hypothesis look less
attractive than otherwise. There is a natural hypothesis test:
The null-theory corresponds to transmitting the data "as is".
(That does not necessarily mean in 8-bit ascii;
the language must be efficient.)
If a hypothesis cannot better the null-theory then it is not
acceptable.
A more complex
hypothesis fits the
data better than a simpler
model,
in general.
We see that MML encoding gives a trade-off between hypothesis
complexity,
MsgLen(H), and the goodness of fit to the data, MsgLen(D|H).
The MML principle is one way to justify and realise Occam's razor.
Continuous
Real-Valued
Parameters
When a model has one
or more
continuous, real-valued
parameters
they must be stated to an "appropriate" level of precision.
The parameter must be stated in the explanation, and only a finite
number
of bits can be used for the purpose, as part of MsgLen(H).
The stated value will often be close to the maximum-likelihood value
which minimises MsgLen(D|H).
If the -log likelihood, MsgLen(D|H), varies rapidly for small changes
in the parameter, the parameter should be stated to high precision.
If the -log likelihood varies only slowly with changes in the
parameter,
the parameter should be stated to low precision.
The simplest case is
the multi-state
or multinomial distribution where the data is a sequence of independent
values from such a distribution.
The hypothesis, H, is an estimate of the probabilities of the various
states (eg. the bias of a coin or a dice).
The estimate must be stated to an "appropriate" precision,
ie. in an appropriate number of bits.
|