The basis of this thesis is Wallace and Freeman's 1987 paper [#!Wallace.Freeman:1987!#] where a second-order MML approximation technique was presented. This second-order estimator is mathematically simple, computationally efficient and accurate enough for most purposes. Nonetheless, it relies on certain assumptions, which causes the breakdown of the estimator when they are not met [#!Grunwald.Kontkanen.Myllymaki.Silander.Tirri:1998!#,#!Grunwald:1998!#].
In this section, the application of the MML principle will be explained and the derivation of Wallace and Freeman's second-order estimator [#!Wallace.Freeman:1987!#] repeated. In the process, the approximations taken by the second-order estimator will be revealed. Where possible, new or improved approximation techniques will be suggested, to eliminate or minimise the effects of any assumptions.
As described previously, MML attempts to minimise the combined message length
of the model and data. The model encoded within the first part of the message
is labelled
which comes from the model-space of
.
This
model-space can either be discrete or continuous, although this thesis will be
mostly concerned with continuous model-spaces. A model-space is commonly called
a distribution (which is also called a density if it is a continuous
distribution), as each model forms part of a probability distribution. In turn,
the function
is the probability of each model, or simply called the
``prior'' (which is different from the prior knowledge). MML seeks to derive an
estimate (
)
of the original model (
).
The second part of the message contains the data x from the data-space Xencoded with the aid of the model (
). This is the likelihood function
and its message length is the
``negative-log-likelihood-function''
.
The practicalities of optimally encoding the model will not be discussed in this thesis. However, Shannon [#!Shannon:1948!#] has shown that it takes
to optimally encode any information. Arithmetic coding can encode the data in the above length rounded up to the nearest bit in the worse case.
Naively, the message length of a problem with a discrete model-space
can be describe as
Therefore, minimising the message length is equivalent to maximising
.
This is the same as maximising the posterior
probability [#!Farr:1999!#]. Indeed, this is true in the worse case. Examples
include a discrete model-space with no structure other than being a set
[#!Farr:1999!#]. Another example is a discrete multi-dimensional problem with
n-independent boolean attributes resulting in a total of 2n models. Using
diagonal counting, any discrete multi-dimensional problem can be converted into
a discrete single-dimensional problem.
However, whenever there are correlations between discrete models, there exists
an encoding that produces a shorter message length than equation
(
). To understand why, it
is useful to discuss the situation when the parameter (i.e.
)
describing the model is a real-valued variable taking a continuum (uncountable
set) of values. Then the probability of any single model (
)
is
infinitesimally small. This will cause the message length of the first part
(
)
to explode, such that it takes an infinite amount of
information to encode any model. Clearly, it would be better in such a case to
encode no model. Alternatively, a model could be specified only partially, such
that the decoder has some idea about the global properties of the data so the
message length is still minimised. That is, MML would like to break up the
infinite continuum of models into a discrete and countable (although possibly
still infinite) set of models, MML groups similar models together.
The important difference that distinguishes between the worse-case discrete
problem and the continuous problem is not whether the models are a countable
set. The important factor is whether the models in the model-space have a
concept of a local neighbourhood. That is, if the model
is encoded
slightly wrong (ie
instead of
), there is still a
fairly good encoding for the data (i.e.
). For continuous
distributions, this is always true. For discrete distributions, this is not
true in the worse case. In such situations, getting a model slightly wrong is
as bad as getting it wrong by a lot. A mathematical definition for
continuous distributions uses the standard
definition.
However, it is not so simple for discrete distributions and a more exact
mathematical definition will not be forthcoming in this thesis.
As mentioned previiously, when re-parameterising a MML problem, an isomorphic transform may be used to guarantee invariance. The reason why an isomorphic transform is required is because nearby models in the original model-space must correspond to nearby models in the transformed model-space. This does not mean that invariance cannot be obtained using a non-continuous transform. In fact, unless the optimal coding region encompasses the whole domain (i.e. there is no model of the data or the data is random), there always exists a non-continuous transform which preserves invariance. It is merely that there exists non-continuous isomorphisms which do not preserve invariance. On the other hand, every isomorphic transform preserves invariance.