[01]
>>
Clustering + a Markov Model
On work by Tim(e) Edgoose
Other debts (alphabetic order):
Rohan Baxter, David Dowe, Chris Wallace.
© 2001 Lloyd Allison,
Computer Science and Software Engineering, Monash University,
Victoria, Australia 3800
This
document is online at
www.csse.monash.edu.au/~lloyd/Seminars/Mixture-MM/index.shtml
and contains hyper-links to more detailed notes on certain topics
and to some of the references.
<<
[02]
>>
The Problem: Clustering, (unsupervised-) Classification,
Mixture Modelling, Numerical Taxonomy,
in a Time-Series.
Data: A sequence of multi-variate Observations.
- Algorithm based on
[Snob]
which works on independent data.
See
C. S. Wallace & D. M. Boulton.
An Information Measure for Classification.
Computer Journal, 11, pp.185-195, 1968,
and more recently the special issue of the
Computer Journal 42(4) 1999
- Extended:
Observation_{k} can depend on Observation_{k-1}
- i.e. Hidden Markov Model of order 1
<<
[03]
>>
The Model
- (a) The number of classes (N)
- (b) The relative abundance of each class
- (c) For each class
- Distribution parameters for each significant attribute
(e.g. mu, sigma)
- Relative abundance of next class conditional on current class
- (d) For each Observation
- An assigned class (but see twist later)
- Attribute values given this class
<<
[04]
>>
Model Complexity:
- Bayes (1763)
- P(H|D) = P(H) . P(D|H) / P(D)
- Fisher
- Shannon (c1948)
- msgLen(E) = -log_{2}(P(E)) in optimal code,
so_{ } . . .
- msgLen(H&D)_{ }=
msgLen(H) + msgLen(D|H)
= msgLen(D) +_{ }msgLen(H|D)
- msgLen(H)_{ }= model complexity
- msgLen(D|H) = data complexity | H
<<
[05]
>>
Model Transitions for 2 Classes
<<
[06]
>>
Estimation (1)
- Given an optimal path of transitions through the model, can use
frequencies to estimate transition probabilities etc.
(State to optimum degree of accuracy.)
- New estimates may change optimal path through model.
- Re-estimate, loop, . . . converges.
- (Still have to consider splitting/ merging classes.)
But this gives biased estimates of model parameters!
<<
[07]
>>
Estimation (2) - the twist
- Class memberships of Observations are nuisance parameters
if interested in best class description only.
- Solution: Sum probabilities of all paths through model.
- Legit': Different paths are exclusive sub-hypotheses.
more...
<<
[08]
>>
Forward:
Consider all paths leading to
o_{k} | c_{j}:
- F(o_{1} in c_{i}) =
P(c_{i}).P(o_{1}|c_{i})
- F(o_{k} in c_{i})
= SUM_{j=1..N} F(o_{k-1} in c_{j}).
P(c_{i}|c_{j}).
P(o_{k}|c_{i})
- 1 < k <= K
- P(D|H) = SUM_{i=1..N} F(o_{K} in c_{i})
<<
[09]
>>
Backwards
Consider all paths leading from
o_{k} | c_{j}:
- B(o_{K} in c_{i}) = 1
- B(o_{k}
= SUM_{j=1..N} P(c_{j}|c_{i}).
P(o_{k+1}|c_{j}).
B(o_{k+1} in c_{j})
- 1 <= k < K
- Now
- P(D|H, o_{k} in c_{i})
= F(o_{k} in c_{i}).
B(o_{k} in c_{i}).
- and
- P(D|H) = SUM_{i=1..N} P(D|H, o_{k} in c_{i})
- for 1 <= k <= K
<<
[10]
>>
. . . this
gives the probability that o_{k} is in c_{j}.
i.e. We can deal with fractional
membership of classes.
Results in unbiased estimates of
class parameters and transition probabilities.
<<
[11]
>>
How Many Classes?
Can split class c_{i} into c_{i}' and c_{i}'',
in ratio w:1-w, e.g. w=0.5.
New transition prob's initialised:
next> v. curr'
| . . . | c_{i}' | . . . | c_{i}'' | . . . | c_{j} | . . . |
. . .
| | | | | | | |
c_{i}' | | P(c_{i}|c_{i}).w | | P(c_{i}|c_{i}).(1-w) | | P(c_{j}|c_{i}) | |
. . .
| | | | | | | |
c_{i}'' | | P(c_{i}|c_{i}).w | | P(c_{i}|c_{i}).(1-w) | | P(c_{j}|c_{i}) | |
. . .
| | | | | | | |
c_{j} | | P(c_{i}|c_{j}).w | | P(c_{i}|c_{j}).(1-w) | | | |
. . .
| | | | | | | |
<<
[12]
>>
Can also merge two classes.
Splitting or merging takes place if there is a nett reduction
in message length.
Good heuristic, fast.
Must converge.
Could get stuck in local optimum, but unlikely.
<<
[13]
>>
Tests, e.g. 2-classes:
varying class auto transition from 0.0 to 1.0
<<
[14]
>>
varying data set size, t=0.8
1C: Snob, 1-class; 2C: Snob, 2-classes; M2C: Markov Model 2-classes.
M2C wins if log_{10}(K)>2.3
<<
[15]
>>
e.g. protein dihedral angles:
<<
[16]
Conclusion
- If there is sequential correlation,
the algorithm can better infer class structure from less data
because it models the correlation.
- Time complexity ~ length of sequence × #classes^{2}
- T. Edgoose, L. Allison & D. L. Dowe.
An MML Classification of Protein Sequences that knows about
Angles and Sequences.
Pacific Symposium on Biocomputing, PSB98, pp.585-596, 1998
[paper]
- Tim Edgoose & Lloyd Allison.
Minimum Message Length Hidden Markov Modelling
Data Compression Conf (DCC), pp.169-178, 1998
[paper]
- T. Edgoose & L. Allison.
MML Markov classification of sequential data.
Statistics and Computing 9, pp.269-278, 1999
[paper]