|
Well I'm not exactly sure why I've been asked
to give a talk
on the history of Minimum Message Length, perhaps to get a little bit
of it on some sort of record before I drop dead which won't be all that
long. And it is a fairly muddled history which involves a lot of people
other than myself and indeed in some respects you would have to say
that the earliest steps in the field were taken before I had ever done
anything about it, and indeed I didn't even know about them until some
years later -- but I will come to that.
For me the whole business I suppose started to
present itself
to me as a bit of a problem when I was an honours student studying
physics. An experimental group in the school had got very excited by a
recent result that they had got. They had been measuring the intensity
of a certain type of cosmic ray and discovered that the intensity
seemed to go up and down according to the sidereal time. That is to say
according to the attitude of the earth with respect to the distant
stars. Now if that was so, it would have indicated a preferred
direction for the cosmic rays to be coming from, and they were of a
type which, it was thought, would not have a preferred direction. So
this was quite an exciting result. And the statistical analysis of
their intensity records showed that the results they had were,
according to the standard statistical tests, significant to about the
1% level. That is to say, you would have expected to get such results
by chance only 1% of the time. Now that is a reasonably good
significance level. Most studies in medicine and psychology and the
more difficult sciences jump up and down if they get a result which is
even 5% significant. But it occurred to me listening to their
presentation, at a seminar rather similar to this one, that the fit
which their presumed sidereal-time variation gave to their data
required them to say where the peak of this wave was coming to within
about one hour, plus or minus one hour, say two hours. But if you
fitted a curve with its peak outside this range the significance level
dropped like a stone. So what?
Well then I said to myself, well they would
have got just as
excited had they found the peak at a different time of the day. And if
the peak was sort of about two hours wide, well that would mean
anywhere in twelve different positions round the clock (24-hour clock)
would have been just as exciting.
[...]
So the chances of their getting a result by
pure chance that
would have made them excited wasn't one in a hundred, it was one in a
hundred divided by twelve which is only about one in eight. And then it
further occurred to me, well they would have got just as excited had
they discovered something which showed two peaks every day, or a peak
every year, or whatever. So there were really rather a lot of
hypotheses that they would have embraced if they found that they gave a
fit to the data with a significance of 1%. And when you throw that in,
the real significance of the result wasn't one in a hundred, or even
one in eight, but something probably more like one in five. In other
words there would be at least a one in five chance that purely random
data would have thrown up something that would have got them excited.
Well, while accidents of the order of one in a
hundred don't
happen often, accidents of the order of chances one in five happen at
least once a week. So I didn't think they had a decent result. I raised
my little voice in the seminar and tried to explain my reservations
but, as usual, was howled down; remember I was only an honours student
and that's the way you've got to treat honours students. And they went
on collecting data. But over the next year the effect disappeared. I
didn't say I told you so because by that time I had become a research
student and they have got to be even more careful about what they say.
But it got me thinking, [...] if you are
talking
about the fit of a hypothesis to some data, and the fit looks good, you
really have to discount this by how many hypotheses you would have
contemplated before choosing the one that fitted the data best. Put it
another way, the more precisely you have to specify your hypothesis out
of a range of possibilities in order to get a good fit to the data, the
less plausible that hypothesis really becomes given the data.
So anyway, I put that thought to the back of
my mind and
forgot all about it. Later on as a research student I got involved in
some rather complex statistical analyses of data from a different sort
of cosmic ray and got introduced to Bayesian statistics because that
seemed to be necessary to get more or less unbiased results. So I did
that and put that on the side. Well OK, by the time I had assumed the
full and unquestionable authority of a junior lecturer I had had the
odd thought about how one should go about weighing evidence and had had
a little bit of experience with statistics. [...]
It happened that I got a research student
coming to do a
masters. He was an electrical engineer named David Boulton who came
from New Zealand where all good things come from -- like my parents --
and for some reason he decided he would do his masters under my
supervision. (No he wasn't my first, he was my second student.) And at
that time, which is quite a while back, it was about 1966, one of the
very promising techniques for building computer logic was a thing
called a tunnel-diode which could switch far faster than the then
available transistors. So his masters was to be on making use of tunnel
diodes in logic -- logic circuits. [...] This went on
and we got some results but after about six months it became clear to
both of us that this was going to be a dead-end, as it indeed proved to
be. The things were just too difficult to handle; the results were
difficult to make reliable and by the time you had made them reliable
they weren't all that fast anyway. So here he was, poor student, come
over from New Zealand to do a masters and half way through it was
obvious that it just wasn't going anywhere. So what to do?
[...] A year or two before I
had come across a
problem which was presented to me by a geologist friend of how to
classify samples of data from a large population. They were in fact
samples of sediments from oil wells and as one put down oil wells in
different places one encounters different sorts of sediments. The
origins and nature of these wasn't entirely understood and he thought
it would be a nice idea to get some sort of classification as to what
sort of sediments were what. [...] At the time I had
done nothing much more than to keep him happy more or less with
something which today would be called a k-means description of the
data. That worked, more or less. But then I forgot about it. Anyway, it
occurred to me when David Boulton ran out of things to do that maybe it
would be a good idea to go back to that classification problem and see
if we could put it in a proper statistical framework: Given data from
some large population is it reasonable to regard the samples that
you've got (there may be a thousand different samples) as representing
different kinds of thing? Is the population, in other words,
made up of several different classes of thing?
Now David was an engineer with a good decent
science
background and he also knew something about information theory. And he
had the bright idea that he had always thought of theories as a way of
concisely representing data and his classic example was the periodic
table of the elements, something which some of you may know about.
[...] It is a way of arranging the 92 elements (or as it was originally
done [when] only about 70 elements [were] known), in a
table in a way which reveals a lot of relationships and similarities
among the chemical elements. Ones in the same column of the table have
very similar chemistries. Ones in the same row of the table have rather
similar masses. And so if you know the table, you already know a lot
about these elements, just represented in a simple geometric
arrangement of where you have listed the element names.
So his cut on doing classification was
that if you do
a decent classification and describe what each class looks like and
then what class each thing in your sample belongs to, you have already
said a lot about the thing because if you've said it belongs to a
particular class then you know it's got to look pretty much like the
description of the things in the class. And so his take on it was that
it was all a matter of getting the data and using some classification
in some way to encode the data more concisely -- data compression. Now
I thought well yeah, well maybe, but look mate this is statistics, what
one really wants to do is to do a proper Bayesian statistical analysis
here, but taking account of how precisely you have to specify the
properties of each class and thing. And the more classes you bung in,
the more properties you'll have to describe, so the more precise, or
the more complex, the whole description of the population becomes.
Let's go and do this properly with Bayesian statistics, putting on
proper prior distributions on numbers of classes and prior
probabilities on parameters and so forth.
[...] We argued about this for
some weeks,
waving arms and shouting at each other, and then we decided well look
this was no good, we will have to prove our points by going away and
doing the maths and writing the algorithms and seeing what happens. So
peace fell on the subterranean rooms where we were working for a week
or two, then we came together again and I looked at his maths and he
looked at my maths and we discovered that we had ended up at the same
place. There really wasn't any difference between doing Bayesian
analyses the way I thought one ought to do Bayesian analyses and doing
data compression the way he thought you ought to do data compression.
Great light dawned -- wrote a program, got some data from somebody who
had been looking at seal skulls from around the world, whacked it in,
cranked the handle, and said oh there's six species of seal here. So we
went to an expert and said how many species of seal are there and which
ones were supposed to be which, and it was a pretty good match except
that we had confused two seal species, which are now incidentally
regarded as being the same species, they are just two different
populations, and unfortunately split up some of the species into two
classes which on inspection turned out to be male and female of the
species. So we were really chuffed, wrote a paper, got it published and
David got his masters.
Then in '68 I came down here to Monash (I've
been here a long,
long time) and David, for his sins, decided to follow me and do a PhD
-- on elaborating the classification algorithm we had so that it could
build a hierarchic tree of classes, not just one class, two class,
three class, four class,... but super-classes and these
dividing into smaller classes and whatever -- the kind of things that
taxonomists do in the biological arena. And so he slaved away at it and
we both knew what was going on by then, so I was able to give him some
help, and [he] finally got a decent PhD out of it. And there the matter
rested for a while. Except that towards the end of this exercise it had
begun to dawn on me that this approach to working out models, or
forming hypotheses from data, while it did seem to work in the problem
of forming a classification of a population, was not necessarily
restricted to that arena and could have wider
applications.
So David went off and did analogue computers
and such but I,
in spare time, and even heads of department occasionally had spare time
in those days, pursued this idea of seeing if I could generalise this
approach of data compression, or Bayesian analysis done right, to
attack in principle more or less any problem of inductive inference. By
inductive inference here I mean specifically the kind of thing that
scientists try to do, which is by taking a whole lot of data try to
come up with general propositions about the source of this data and
which are true of lots of the data, even data that you haven't seen yet
-- the discovery of at least approximations to general laws, general
propositions, by looking at specific instances. And I did come up with
a formalism which has now entered the literature as what is called the
Strict Minimum Message Length formalism which at least gives a formal
answer to the question of how to do it, at least in arenas where you
have a fairly well defined body of hypotheses from which you wish to
select. It might be quite broad, like the set of all possible ways of
classifying a population or, as Mike Georgeff later suggested, the set
of all possible finite-state automata which could represent the grammar
of a language, but the formalism was there. I could prove a few things
about it, but unfortunately the mathematical complexity of it made it
almost unusable. So after a while I developed, using the work that
Boulton and I had done in Snob, at getting tractable approximations to
this general idea and ended up with things which are pretty usable.
Had very little success in getting any of this
work published
because the statisticians did not want to know on the whole and indeed
one paper, which I still consider to be a pretty good paper, was
rejected by a notable statistical journal on the basis of a referee's
comment that he found it objectionable that you could somehow improve
maximum likelihood by adding obscure terms to it. He didn't say it was
wrong, he didn't say it didn't work, it was just objectionable. So that
was that. Right. I went on to other things.
But then out of the blue a mad Irishman
descended on us armed
with survey data from piles of rubble which he had found in Ireland,
and he was persuaded that these were in fact megalithic stone circles,
like Stonehenge only being Irish not quite as big. And an equally mad
engineer in Britain called Thom had achieved considerable fame by
proposing that the shapes of these megalithic circles were in fact
carefully constructed geometric shapes. Although roughly circular, they
weren't truly circular and they were actually constructed from
Pythagorean triangles (like 3, 4, 5, and 5, 12, 13, whatever, the right
angled triangles with integer numbered sides) together with a whole lot
of other stuff. And our mad Irishman didn't really think this was very
likely and he wanted to do a proper statistical analysis. Now
statistical analyses of Thom's results had been attempted but were
inconclusive -- basically conventional statistics just is not equipped
to resolve questions of that complexity. So John Patrick came with all
of his survey data. He'd read something of the work on classification
and thought this was maybe a promising approach. Anyway he did a PhD on
it, in the course of which we developed and refined some of the
approximations to minimum message length and out of it came some pretty
reasonable results, which actually did lend evidence to at least one of
the geometric shapes that Thom had proposed but could not really find
any evidence for any of the others. So this was duly published and
excited the attention of a couple of the statisticians who had
themselves attempted to analyse Thom's data. One of them, Peter
Freeman, then Professor of Statistics at Leicester, got sufficiently
interested in it to come out here for a sabbatical year and work with
me to see if there really was anything in it. With his assistance, we
together ended up writing a paper which actually made sense to the
statisticians, at least enough for it to get published. Phew! After
quite a number of years this was. So I got interested in it again, did
some more work with Peter Freeman on some rather more complex models,
again got published, and this was great.
Now if we can stop there and just go back and
think about what
was really happening in the rest of the world. David Boulton and I had
got started in '67, 1967. But we had been well gazumped by a chap in
the United States, Ray Solomonoff, in I think [.../1964].
And he had come up with the notion that he wasn't really interested so
much in getting theories out of data. His emphasis was, if you get some
data from some source, can we predict what future data might
come from the same source? So his emphasis was on prediction. And his
take on the subject was, well, let us suppose that we can represent the
data that we've got in binary with a very long binary string. And now
let us take a universal Turing machine, and I'm simplifying here a
little bit but that is to say an honest to God computer which will
accept a binary input. And feed just random binary bits into that
computer and look at what comes out. Well what comes out will also be a
binary string and you throw away all of the inputs which made it
produce garbage which wasn't very interesting, or failed, or halt and
catch fire, or loop forever, or all of the other things that computers
can do. And you look only at those random strings which, when input,
made the computer output a binary string which was precisely
the data that you had. And then you said OK, these random input binary
strings are somehow encoding the data because when we bung them in this
computer it makes the computer output the data that we've got. So let
us put those strings in again but tack some more random binary digits
on the end and see what we can get. Well we'll get our original data in
each case plus some further binary strings which look like maybe future
data from the same source, and by noticing how often [in] this process,
this gives us future data of a particular kind, we get a
probability over what might be the future data that emerges. Now this
sounds completely bizarre but Solomonoff managed to prove that if the
data comes from a source which produces data from any
computable probability distribution then his prediction technique,
given enough data to work on, will converge to making predictions based
on precisely that computable probability distribution. In other words,
he proves it works. And it doesn't matter how complicated the source,
what we're talking about, whether it's the incidence of rabies in
French foxes or the, I don't know, the motion of the planets, whatever,
if this data displays any computable regularity this process will
detect it, pick it up, and use it to make as accurate a prediction as
it is possible to make of what is going to come next.
What has this to do with minimum message
length? Well it turns
out that when you look at it, the strings which went into the computer
which have most effect on making the predictions of future data are the
strings which encoded the available data most concisely. In other words
data compression. And the better the compression the more weight was
given to that string in making the predictions as to what's
happening next. So although he was not using this technique to pick on
a model of the source of the data, he was using it to average over all
possible models of the data and in effect assigning to each possible
model a posterior probability which turned out to be precisely the
posterior probability that you end up giving to a model found by
minimum message length.
So there is a very, very close correspondence
between the
work. The mathematical pursuit was quite different: Turing machines are
quite awful things to work with on a theoretical basis, whereas I'd
stuck to, [...] if not finite, at least countable sets
of hypotheses which you can enumerate. The hypotheses which can be
somehow used by a Turing machine are not countable unfortunately; well
they are countable but they cannot be enumerated because you can't tell
by looking at it whether a computer program makes any sort of sense at
all -- whether it will ever give any output, in general.
OK, so in terms of practical application our
development had
immediate applications, and has been applied, and quite a lot of people
around the world are using some of the techniques that we've developed.
Solomonoff's work gives a very strong theoretical respectability to the
whole approach because he can prove very powerful theorems about what
is going to happen of a quite different nature to the kind of thing
that we can prove about MML from statistical theory. But it is much
harder to get anything practically useful out of Solomonoff's approach.
Now we didn't become aware of Solomonoff's work until we were fairly
well down the track, it would have been about '72 or '73, something
like that although he had published, I think, originally in
['64]. I mean he'd published in an IEEE journal on information theory [Information
and Control], I think, which was not the kind of thing that your
average statistician would read. I certainly hadn't read it. And I only
became aware of it through reading a Scientific American article
written by [yet] another bloke.
[...]
Solomonoff's work had been read by a [...]
very famous Russian statistician, Kolmogorov, who'd seen in it a basis
for coming up with a definition of what is randomness. I mean
statisticians and probability people all talk about randomness
but nobody knew really what it meant. Well Kolmogorov latched onto the
idea of Solomonoff's work that if you had a binary string for given
data that Solomonoff worked with then you could find some input to a
Turing machine which made it output this string, but that the input
that you had to put in was shorter than the string which came out, the
only way this could happen would be if the string wasn't really random.
So he latched onto the idea that the real definition of randomness is,
at least for binary strings and by extension to any sort of data, if
you can compress it, it ain't random.
Unfortunately, there was a bit of a defect in
the way he
developed this theory. It turns out to be quite trivial, but he never
spotted it and he never really succeeded in getting his project to
work. He thought it would lead to a theory of statistical inference but
he couldn't make it go because of this defect in his theory. The defect
had the horrible result that if you took any random string of digits
and applied his criterion of randomness to it then, whatever the string
was, you would find infinitely many prefixes of the string (initial
sequences of the string) which were non-random according to his
measure. Which is bizarre.
[...] Another chap,
American this time,
Chaitin, who gave a talk here a couple of years ago roughly. He spotted
what was wrong, set it right. What was wrong was Kolomogorov hadn't
thought of the point that if you have got a finite binary string and
you want to show that it's compressible, you have to get an input for a
Turing machine which will make it output that string and then stop! In
other words the input has got to know how long the output is meant to
be. Add that little bit in and the theory now works, well Chaitin made
it work.
[...] So he wrote the article
in Scientific
American [1975] mentioning Solomonoff's work. I looked at Chaitin's
work and it was all to do with randomness, and frankly I wasn't all
that interested in random data. I wanted to know what you could do with
something which isn't random, like data from the real world, but it did
put me onto Solomonoff's work.
Again round about much the same time I'd been
getting IBM
technical bulletins and one of them had an article by a chap called
Rissanen who had come up with a bright idea for looking at data coming
out of, the output of say a chemical plant, the inputs and the outputs
of this chemical plant, or any sort of plant, and just from that data
working out a model of what was going on in this plant. Basically if
you like, looking at a very complicated sort of Markov process and
making an estimate of what the Markov process is. And lo and behold
when I looked at the maths it looked remarkably like the maths for
minimum message length, except that one crucial term looked to me to be
the wrong way round -- he had a minus sign where he should have had a
plus sign. So I wrote to him and suggested he have another look at this
and furnishing some of my work. [...]
He went on to invent and popularise what is called minimum description
length which is basically very much the same idea. And one has to
concede that anybody can mistake a plus for a minus, I'm always doing
it myself; he had come up with the same idea independently. [...]
Once he got the sign
right it all clicked and he began talking sensibly about it.
So what have we got here? We've got
Solomonoff, to some extent
Chaitin and Kolmogorov, certainly Chaitin, and David Boulton and myself
all coming up with more or less the same idea at more or less the same
time, which I guess might mean that the idea's time had come, except
that obviously it hadn't because thirty years has gone by and the world
still doesn't know about it! But we're working on that.
Anyway back home, we kept beavering away. Well
Mike Georgeff
came along and, as I've mentioned, had this idea that maybe you could
use this scheme actually to estimate the structure of something, namely
the structure of a grammar just given sentences from the grammar, at
least in simple formal languages. This idea had actually been suggested
by a chap called Gaines who had suggested [LA:1976]
that one could use a model of a grammar
which was essentially a finite-state automaton of a particular sort.
And Gaines had given some examples of strings, simple strings in a
simple grammar, and shown that his approach could recover the
finite-state machine which represented the grammar from these strings
but he didn't know how to "stop". He kept on making models which were
more and more and more complex until you ended up with a model which
could produce precisely, but only, the given examples. In other words a
grammar which says these are the sentences and there are no others. And
obviously that wasn't what he was looking for. But he said, well look
if you stop it right at the right spot, you get the grammar which
represents the given sentences but could also generate other sentences
which look sort of similar. So Mike suggested well why don't we have a
go at this. So we did the not very complicated maths and wrote some
fairly horrible programs and lo and behold it all worked [see TR32], very slowly,
but it worked, and that resulted in a free trip to Pisa where we gave a
talk at an artificial intelligence conference there. And this was sort
of the first introduction of these ideas to the artificial intelligence
community.
Since then of course we have gone a lot
further, or a fair bit
further, and with Kevin
[Korb] here and a
bunch of students we have pursued learning other sorts of structures,
in particular causal networks and binary networks, from just raw data.
And that seems to work pretty well too. With Trevor
Dix and Lloyd Allison they
developed some calculations making some use of this theory to answer
questions about the relatedness, or otherwise, of DNA strings, and in
Trevor's case how fragments of sequences could be put together to make
a coherent whole. I think I've got that right. And Ingrid [Zukerman]
has been making a bit of use
of it in looking at the structure of discourse where one of the parties
is a computer trying desperately to understand what the human on the
end of a telephone is actually talking about. So there has been a fair
bit of progress and there is still an awful lot more progress that
could be made, I think. [David
Dowe has also pushed the idea along dealing with less usual
statistical models.]
The artificial intelligence community, in
particular machine
learning because really that's what we are trying to do, is beginning
to take some notice of these techniques; I think it should be taking a
bit more notice. I have had some vehement opposition from a certain
quarter but he hasn't actually yet managed to prove us wrong. No doubt
he's working on it. But certainly in so far as the basic theory goes,
whether you put it in the framework of a la Solomonoff of using a
simple universal Turing machine to decode encoded descriptions of the
data, or whether we put it in the Bayesian statistical framework of
choosing from a possibly large but still not unlimited set of
hypotheses which might fit the data, choosing the one that really does,
there are lots of good theorems which say this is the right way to go,
one of the most powerful actually produced by a theoretical
statistician at Stanford who has managed to prove that if you use this
technique and the data is indeed coming from a probability distribution
which is either computable, or computable given a few uncomputable
constants, then (and you go the route that we have of looking for the
shortest possible encoding) [...] within a finite time you
will get the right answer. You only need a finite amount of data to
find the real honest to God probability distribution from which this
data is coming. That is a very powerful theorem, it is very general,
it's got rates of convergence and all sorts of things coming out of it.
I don't understand it myself -- it uses mathematics that I'm not
familiar with -- but it's nice to know that apparently we are doing
something respectable. And nobody's really been able to show that it
doesn't work -- come up with any convincing example where you get the
wrong answer by using this technique.
So anyway, that's the end of my history. It's
still
open-ended. There's still a great deal of work to be done. You may say
what the hell has it got to do with computing but it is indeed very
significant, I think, that the people that got things out of this, or
came up with these ideas, were an engineer working with RCA and using
computers all the time, that was Solomonoff, and who knew a lot about
Turing machines (still does), another engineer, David Boulton, who knew
a lot about coding and information theory, a lapsed physicist, me, who
knew a bit of statistics and a fair bit about the workings of machines,
and Rissanen who was also an engineer. Statisticians just didn't come
into it. Learning theorists don't come into it. I think you had to have
the background in dealing with real numbers crunching through real
processes in a machine and coming up with real answers and thinking
along those terms, rather than elegant mathematical theories, to be led
down this path. It is a very simple-minded path. What it's saying is
that the most convincing story about some data is the shortest one.
Full stop. OK, it's not easy to turn that precept quantitatively into
something that you can use, but that's basically what we've done. The
fundamental idea is trivial, but very strong and I think it still has a
lot in it which remains to be found out. I have been thoroughly
subverted by Kevin Korb and Charles Twardy -- don't go around talking
to philosophers, they muck your brain around -- but I am now persuaded
that, for instance, we can show that although given present knowledge,
the present state of the universe, we can make predictions, deduce
things about the future, we cannot actually deduce things about the
past. Well you can but you get stupid answers. When we are reasoning
about the past we actually have to use inductive logic, which is the
basis of statistical and inductive inference. So OK, this is amusing if
true. If it is true it solves a couple of conundrums which have been
puzzling philosophers of science for a while, but it remains to be seen
whether it is. So anyway, that's the kind of thing I'm thinking about
at the moment, utterly pointless, but there it is. I'm retired. I'm
allowed to do that. Any questions?
Q: inaudible.
A: The point that it is the same as the
Bayesian approach,
even that hasn't sunk in. Because until you come to realize the
importance of considering the precision of statement of
parameters the ordinary Bayesian approach doesn't get you there. You
can't equate posterior probability density with posterior probability.
Until you do the business about Fisher informations and precisions of
estimation you can't use a Bayesian technique to give you, to lead you
to a hypothesis which you can plausibly say, is the hypothesis of
highest posterior probability, because you can't assign any posterior
probability to any hypothesis if they've got real parameters -- in the
conventional Bayesian sense. So that message still has got to get
across.
What is happening is that people who
wouldn't touch
this sort of analysis with a barge-pole are in fact ending up doing
things which amount to the same thing. There's a hot topic in work on
using neural nets to learn supervised classification which is going for
"wide margin" classifiers. If you look at the mathematics of what
they're actually doing, what the wide margin is doing is precisely
taking account of how precisely you have to specify the parameters of
the network. So they are putting that into their things.
If you look at structural risk minimisation,
another
apparently unrelated thing of Vapnik's which is based on VC dimension,
you find that in certain cases which are pretty general, there are at
least three ways of expressing the VC dimension. If you look at it
closely they correspond to three different ways of encoding the data.
And the different ways depend on,... well OK one can be better
than the others and which comes best depends on how precisely you have
to specify estimates.
So the actual practice of this, the
mathematical consequences,
are getting into peoples' heads, but they're getting there by all sorts
of funny routes which is encouraging -- it means there are all sorts of
justifications for doing it.
Q: inaudible.
A: Well it [Snob]
was produced by David Boulton and myself to do classification, that's
it. If you use what we... It was a very specific application
and it wasn't until after we'd done it that we realized the principles
behind it were capable of being generalized to lots of other problems.
We just saw it as, if you want to do
classification this is
the way you damn well ought to do it. Well the principle is: Find that
classification of the sample which allows you to encode the data in the
sample, most concisely. Full stop. That's what drove it. That's what
the algorithm,... I mean you can look at the algorithm and
that's obviously what it's doing. Or you can look at the algorithm and
also say what it's obviously doing is picking the model of highest
Bayesian posterior probability. So it's very obvious.
Q: inaudible.
A: Oh my message is I made a big
mistake by getting
into academic work.
[general laughter]
No, I mean that. And to be a little
pessimistic, I don't see a
huge future for people wanting to do the kind of academic work which
I've been doing all my life and which most of you have been, well the
older ones among us, have been doing all their lives. I don't think
it's respected, I don't think it's going to be resourced, and one has
to begin to wonder what the hell the point was. Well on that unhappy
note, I'll stop.
[applause]
References mentioned in the text, in
publication order
(Added later by LA)
- R. Solomonoff. A formal theory of
inductive inference,
I and II. Information and Control 7, pp.1-22 and
pp.224-254, 1964.
- A. N. Kolmogorov. Three approaches to
the quantitative
definition of information. Problems of Information and Transmission
1(1), pp.1-7, 1965.
- G. J. Chaitin. On the length of
programs for computing
finite binary sequences. J. A.C.M. 13(4), pp.547-569,
October 1966.
- A. Thom. Megalithic sites in Britain.
Clarendon
Press, Oxford, 1967.
- C. S. Wallace & D. M. Boulton. An
information measure for classification. Computer J. 11(2),
pp.185-194, August 1968 (also [html]).
- D. M. Boulton & C. S. Wallace. The
information
content of a multistate distribution. J. Theor. Biol. 23,
pp.269-278, 1969.
- D. M. Boulton & C. S. Wallace. A
program for numerical classification. Computer J. 13(1),
pp.63-69, February 1970.
- D. M. Boulton & C. S. Wallace. An
information measure for hierarchic classification. Computer J. 16(3),
pp.254-261, August 1973.
- D. M. Boulton. The Information Measure
Criterion for
Intrinsic Classification. PhD
thesis, Monash University, 1975.
- G. J. Chaitin. Randomness and
mathematical proof.
Scientific American 232, pp.47-52, May 1975. (And
also: Randomness in arithmetic. Scientific American, pp.80-85,
July 1985.)
- B. R. Gaines. Behaviour structure
transformations under
uncertainty. Int. J. Man-Machine Studies 8, pp.337-365,
1976.
- J. Rissanen. Modeling by shortest data
description.
Automatica 14, pp.465-471, 1978.
- J. D. Patrick. An information measure
comparative
analysis of megalithic geometries. PhD thesis, Dept. Computer
Science, Monash University, 1979.
- J. Patrick & C. S. Wallace. Stone
circle
geometries: An information theory approach. In "Archaeoastronomy in
the New World", D. Heggie (ed), CUP, 1982.
- M. P. Georgeff & C. S. Wallace. A
general selection
criterion for inductive inference. European Conf. on Artificial
Intelligence, Pisa, Elsevier Science Publ., pp.473-482, September 1984.
(Also [TR32]
Dept. Comp. Sci., Monash University, March 1983.) - C. S. Wallace & P. R. Freeman. Estimation
and inference by compact coding. J. Royal Stat. Soc. B, 49(3),
pp.240-265, 1987.
- C. S. Wallace & P. R. Freeman. Single-factor
analysis by minimum message length estimation. J. Royal Stat.
Soc. B, 54(1), pp.195-209, 1992.
- L. Allison, C. S. Wallace & C. N. Yee. Finite-state
models in the alignment of macro-molecules. J. Mol. Evol., 35(1),
pp.77-89, July 1992.
- T. MacKenzie, D. Platt & T. Dix. Modelling
errors
in restriction mapping. Hawaii Int. Conf. Sys. Sci. 26,
January 1993.
- C. S. Wallace, K. Korb & H. Dai. Causal
discovery
via MML. Proc. 13th Int. Conf. on Machine Learning, L. Saitta (ed),
Morgan Kaufmann, pp.516-524, 1996.
|