^CSE454^
[01]
>>
1: Introduction
- Learning: Given data, D,
learn (or fit, or estimate) a
hypothesis, H (or model, or parameter(s)), for D.
- Prediction:
Given input attribute(s) (or variable(s)), possibly trivial,
predict output attribute(s).
- Data:
Observations (values) drawn from some (sample|data) space.
- Typically
- a statistical model
is a formal mathematical model,
- a machine learning method
may be a complex model & use approximations,
- data mining emphasises (very) large data sets,
efficient & maybe ad hoc methods.
Much overlap, and different terminology!
Notation and terminology . . .
CSE454
2002
:
This document is online at
http://www.csse.monash.edu.au/~lloyd/tilde/CSC4/CSE454/
and contains hyper-links to other resources
- Lloyd Allison ©.
<<
[02]
>>
- Sample space
is set of possible outcomes of some experiment
- e.g. examine 1st element of a gene;
sample space = Base = {A, C, G, T}
- An event is a subset, possibly singular,
of the sample space
- e.g. purine = {A, G}
NB. The term
data space is often used,
in machine learning.
<<
[03]
>>
- A random variable, X,
takes values, with probabilities, from the sample space
- Write P(X=A) or just P(A) etc.
- e.g. P(X=A) = 0.4
<<
[04]
>>
People often distinguish between
- selecting a model class,
- selecting a model from a class,
- estimating the parameters of a model.
e.g.
- model class = polynomials
- model = quadratic
- fully-parametrized model = 3 x2 - 4.5 x + 7.2
<<
[05]
>>
Bayes
If B1, B2, ..., Bk
is a partition of a set B (of causes) then
P(A|Bi) P(Bi)
P(Bi|A) = ------------------- i=1, 2, ..., k
k
SUM P(A|Bj) P(Bj)
j=1
<<
[06]
>>
. . . applied to data D and hypotheses Hi:
k
SUM P(D|Hj) P(Hj) = P(D)
j=1
P(Hi|D) = P(D|Hi) P(Hi) / P(D) posterior
P(Hi|D) P(D|Hi) P(Hi)
------- = -------------- posterior odds-ratio
P(Hj|D) P(D|Hj) P(Hj)
<<
[07]
>>
- P(Hi)
prior probability of Hi
- P(Hi|D)
posterior probability of Hi
- P(D|Hi) likelihood
NB. Can ignore P(Hi) in posterior odds-ratio
if, and only if, P(Hi)=P(Hj).
Maximum likelihood may can cause problems when we have inequality.
<<
[08]
>>
Example
C1, a fair coin, P(H) = P(T) = 0.5.
C2, a biased coin, P(H) = 2/3, P(T) = 1/3.
One of the coins is thrown 4 times, giving H, T, T, H.
Which coin was thrown?
H1 : was C1.
H2 : was C2.
<<
[09]
>>
Prior, P(C1) = P(C2) = 0.5.
Likelihood, P(HTTH | C1) = 1/16
and P(HTTH | C2) = 4/9 . 1/9 = 4/81.
Posterior odds-ratio,
P(C1|HTTH)/P(C2|HTTH) =
(1/16 . 1/2) / (4/81 . 1/2) =
81/64.
<<
[10]
>>
Now, P(C1|HTTH) + P(C2|HTTH) = 1
and if x/(1-x) = 81/64, then
64.x = 81 - 81.x,
x = 81/145
P(C1|HTTH) = 81/145.
This case is simple because the model space is discrete,
in fact finite (2).
<<
[11]
>>
e.g. prediction
Know P(C1) = 81/145, P(C2) = 64/145.
The more likely coin is C1.
If we assumed the coin really was C1, would
predict P(H) = 0.5 in future.
But the coin might be C2.
Should predict P(H) =
81/145 . 1/2 + 64/145 . 2/3 =
(243 + 256) / (145 . 6) =
499 / 870 =
0.57
i.e. use a weighted average of the hypotheses.
<<
[12]
Conclusion
We have looked at
- data
- models, parameters
- priors, likelihood, posterior
- inference
- prediction
simple examples!