|
Q: What is the difference between a hypothesis and a theory?
A: Think of a hypothesis as a card. A theory is a house made of hypotheses.
(From rec.humor.funny, attributed to Marilyn vos Savant.)
Introduction
People often distinguish between
- selecting a model class
- selecting a model from a given class and
- estimating the parameter values of a given model
to fit some data.
It is argued that although these distinctions
are sometimes of practical convenience, that's all:
They are all really one and the same process of inference.
A (parameter estimate of a) model (class) is not a prediction,
at least not a prediction of future data, although
it might be used to predict future data.
It is an explanation of, a hypothesis about,
the process that generated the data.
It can be a good explanation or a bad explanation.
Naturally we prefer good explanations.
Model Class
e.g. polynomial models form a model class for
sequences of points
{<xi,yi>} s.t. xi<xi+1}.
The class includes constants, straight lines, quadratics, cubics, etc.
and note that these have increasing numbers of parameters;
they grow more complex.
Note also that the class does not include functions based on
trigonometric functions; they form another class.
Model
e.g. The general cubic equation
y = ax3+bx2+cx+d
is a model for
{<xi,yi>} s.t. xi<xi+1}.
It has four parameters a, b, c & d
which can be estimated
to fit the model to some given data.
It is usually the case that a model has a fixed number
of parameters (e.g. four above), but this can become blurred
if hierarchical parameters or dependent parameters crop up.
Some writers reserve model for a model (as above) where
all the parameters have fixed, e.g. inferred, values;
if there is any ambiguity I will (try to) use model instance
for the latter.
e.g. The normal distribution N(mu,sigma) is a model and
N(0,1) is a model instance.
Parameter Estimation
e.g. If we estimate the parameters of a cubic to be
a=1, b=2, c=3 & d=4,
we get the particular cubic polynomial
y = x3+2x2+3x+4.
Over-fitting
Over-fitting often appears as selecting
a too complex model for the data.
e.g. Given ten data points from a physics experiment,
a 9th-degree polynomial could be fitted through them, exactly.
This would almost certainly be a ridiculous thing to do.
That small amount of data is probably better described by a straight line
or by a quadratic with any minor variations explained as
"noise" and experimental error.
Parameter estimation provides another manifestation of over-fitting:
stating a parameter value too accurately is also over-fitting.
- e.g. I toss a coin three times.
It lands heads once and tails twice.
What do we infer? p=0.5, p=0.3, p=0.33, p=0.333 or p=0.3333, etc.?
In fact we learn almost nothing, except that the coin does
have a head on one side and a tail on the other; p>0 and p<1.
- e.g. I toss a coin 30 times.
It lands heads 10 times and tails 20 times.
We are probably justified in starting to suspect a bias,
perhaps 0.2<p<0.4.
- The accuracy with which the bias of a coin can be estimated
can be made precise; see the
[2-State / Binomial Distribution].
Classical statistics has developed a variety of
significance tests to judge whether a model
is justified by the data. An alternative is described below.
MML
Attempts to minimise the discrepancy between given data, D, and
values implied by a hypothesis, H, almost always results in over-fitting,
i.e. a too complex hypothesis (model, parameter estimate,...).
e.g. If a quadratic gives a certain root mean squared (RMS) error,
then a cubic will in general give a smaller RMS value.
Some penalty for the complexity of H is needed
to give teeth to the so-called "law of diminishing returns".
The minimum message length (MML) criterion is to consider a two-part message
(remember [Bayes]):
2-part message: H; (D|H) probabilities: P(H&D) = P(H).P(D|H) message length: msgLen(H&D) = msgLen(H) + msgLen(D|H)
A complex hypothesis has a small prior probability, P(H),
a large msgLen(H);
it had better make a big saving on msgLen(D|H) to pay for its msgLen(H).
The name minimum message length is after Shannon's mathematical
theory of communication.
The first part of the two-part message can be considered to be a
"header", as in data compression or data communication.
Many file compression algorithms produce a header in the compressed
file which states a number of parameter values etc., which are necessary
for the data to be decoded.
The use of a prior, P(H), is considered
to be controversial in classical statistics.
Notes
The idea of using compression to guide inference seems to
have started in the 1960s.
- R. J. Solomonoff. A Formal Theory of Inductive Inference, I and II.
Information and Control 7 pp1-22 and pp224-254, 1964
- A. N. Kolmogorov. Three approaches to the
Quantitaive Definition of Information.1(1) pp1-7, 1965
Problems of Information and Transmission
- G. J. Chaitin. On the Length of Programs for Computing
Finite Binary Sequences.
JACM 13(4) pp547-569, Oct' 1966
- C. S. Wallace and D. M. Boulton.
An Information Measure for Classification.
CACM 11(2) pp185-194, Aug' 1968
Seems to be the first(?) application of the principle
to a real inference problem resulting in a practical and useful computer
program.
- Theta: variously the parameter-space, model class etc.
- theta or H: variously a particular parameter value, hypothesis, model, etc
- P(theta) or P(H): prior probability of parameter value, hypothesis etc.
- X: data space, set of all possible observations
- D: data, an observation, D:X, often x:X
- P(D|H): the likelihood of the data D, is also written as f(D|H)
- P(H|D) or P(theta|D): the posterior probability of an estimate
or hypothesis etc. given observed data
- P(D, H) = P(D & H): joint probability of D and H
- m(D) :X->Theta, function mapping observed data D
onto an inferred parameter estimate, hypothesis, etc. as appropriate
Maximum Likelihood
The maximum likelihood principle is to choose H
so as to maximise P(D|H)
e.g. Binomial Distribution (Bernouilli Trials)
A coin is tossed N times, landing heads, #head-times, and
landing tails, #tail=N-#head times.
We want to infer p=P(head).
The likelihood of <#head,#tail> is:
P(#head|p) = p#head.(1-p)#tail
Take the -ve log because (it's easier and) maximising the likelihood
is equivalent to minimising the negative log likelihood:
-log2(P(#head|p)) = -#head.log2(p) -#tail.log2(1-p)
differentiate with respect to p and set to zero:
-#head/p + #tail/(1-p) = 0 #head/p = #tail/(1-p) #head.(1-p) = (N-#head).p #head = N.p p = #head / N, q = 1-p = #tail / N
So the maximum-likelihood inference for the bias of the coin is #head/N.
To sow some seeds of doubt, note that if the coin is thrown just once,
the estimate for p must be either 0.0 or 1.0,
which seems rather silly, although one could argue
that such a small number of trials is itself rather silly.
Still, if the coin were thrown 10 times and happened to
land heads 10 times, which is conceivable, an estimate of 1.0 is not sensible.
e.g. Normal Distribution
Given N data points,
the maximum likelihood estimators for the parameters of
a normal distribution, mu', sigma' are given by:
-
N
mu' = {SUM xi} / N --the sample mean
i=1
N
sigma'2 = {SUM (xi-mu')2}/N --the sample variance
i=1
Note that sigma'2 is biased,
e.g. if there are just two data values
it is implicitly assumed that they lie on opposite sides
of the mean which plainly is not necessarily the case,
i.e. the variance is under-estimated.
Notes
This section follows Farr (p17..., 1999)
- G. Farr. Information Theory and MML Inference.
School of Computer Science and Software Engineering,
Monash University, 1999
|