next up previous contents
Next: Fundamental MML approximation Up: No Title Previous: Invariance of Minimum Message

   
Derivation of the second-order MML estimator

The basis of this thesis is Wallace and Freeman's 1987 paper [#!Wallace.Freeman:1987!#] where a second-order MML approximation technique was presented. This second-order estimator is mathematically simple, computationally efficient and accurate enough for most purposes. Nonetheless, it relies on certain assumptions, which causes the breakdown of the estimator when they are not met [#!Grunwald.Kontkanen.Myllymaki.Silander.Tirri:1998!#,#!Grunwald:1998!#].

In this section, the application of the MML principle will be explained and the derivation of Wallace and Freeman's second-order estimator [#!Wallace.Freeman:1987!#] repeated. In the process, the approximations taken by the second-order estimator will be revealed. Where possible, new or improved approximation techniques will be suggested, to eliminate or minimise the effects of any assumptions.

As described previously, MML attempts to minimise the combined message length of the model and data. The model encoded within the first part of the message is labelled $\theta$ which comes from the model-space of $\Theta$. This model-space can either be discrete or continuous, although this thesis will be mostly concerned with continuous model-spaces. A model-space is commonly called a distribution (which is also called a density if it is a continuous distribution), as each model forms part of a probability distribution. In turn, the function $h(\theta)$ is the probability of each model, or simply called the ``prior'' (which is different from the prior knowledge). MML seeks to derive an estimate ( $\hat{\theta}$) of the original model ($\theta$).

The second part of the message contains the data x from the data-space Xencoded with the aid of the model ($\theta$). This is the likelihood function $f(x\vert\theta)$ and its message length is the ``negative-log-likelihood-function'' $L = -\log f(x\vert\theta)$.

The practicalities of optimally encoding the model will not be discussed in this thesis. However, Shannon [#!Shannon:1948!#] has shown that it takes


 \begin{displaymath}
-\log_2 \textrm{Probability(Information)} \textrm{~~bits}
\end{displaymath} (1)

to optimally encode any information. Arithmetic coding can encode the data in the above length rounded up to the nearest bit in the worse case.

Naively, the message length of a problem with a discrete model-space $\Theta$can be describe as


 \begin{displaymath}
\begin{split}
\textrm{MesgLen~} &= \textrm{Total Message Length} \\
&= -\log_2 h(\theta) -\log_2 f(x\vert\theta)
\end{split}\end{displaymath} (2)

Therefore, minimising the message length is equivalent to maximising $h(\theta)f(x\vert\theta)$. This is the same as maximising the posterior probability [#!Farr:1999!#]. Indeed, this is true in the worse case. Examples include a discrete model-space with no structure other than being a set [#!Farr:1999!#]. Another example is a discrete multi-dimensional problem with n-independent boolean attributes resulting in a total of 2n models. Using diagonal counting, any discrete multi-dimensional problem can be converted into a discrete single-dimensional problem.

However, whenever there are correlations between discrete models, there exists an encoding that produces a shorter message length than equation ([*]). To understand why, it is useful to discuss the situation when the parameter (i.e. $\theta$) describing the model is a real-valued variable taking a continuum (uncountable set) of values. Then the probability of any single model ($h(\theta)$) is infinitesimally small. This will cause the message length of the first part ( $-\log_2 h(\theta)$) to explode, such that it takes an infinite amount of information to encode any model. Clearly, it would be better in such a case to encode no model. Alternatively, a model could be specified only partially, such that the decoder has some idea about the global properties of the data so the message length is still minimised. That is, MML would like to break up the infinite continuum of models into a discrete and countable (although possibly still infinite) set of models, MML groups similar models together.

The important difference that distinguishes between the worse-case discrete problem and the continuous problem is not whether the models are a countable set. The important factor is whether the models in the model-space have a concept of a local neighbourhood. That is, if the model $\theta$ is encoded slightly wrong (ie $\theta + \epsilon$ instead of $\theta$), there is still a fairly good encoding for the data (i.e. $f(x\vert\theta+\epsilon)$). For continuous distributions, this is always true. For discrete distributions, this is not true in the worse case. In such situations, getting a model slightly wrong is as bad as getting it wrong by a lot. A mathematical definition for continuous distributions uses the standard $\epsilon-\delta$ definition. However, it is not so simple for discrete distributions and a more exact mathematical definition will not be forthcoming in this thesis.

As mentioned previiously, when re-parameterising a MML problem, an isomorphic transform may be used to guarantee invariance. The reason why an isomorphic transform is required is because nearby models in the original model-space must correspond to nearby models in the transformed model-space. This does not mean that invariance cannot be obtained using a non-continuous transform. In fact, unless the optimal coding region encompasses the whole domain (i.e. there is no model of the data or the data is random), there always exists a non-continuous transform which preserves invariance. It is merely that there exists non-continuous isomorphisms which do not preserve invariance. On the other hand, every isomorphic transform preserves invariance.



 
next up previous contents
Next: Fundamental MML approximation Up: No Title Previous: Invariance of Minimum Message
Edmund Lam
2000-12-04