^MML^

Clustering + a Markov Model

On work by Tim(e) Edgoose

Other debts (alphabetic order): Rohan Baxter, David Dowe, Chris Wallace.

© 2001 Lloyd Allison, Computer Science and Software Engineering, Monash University, Victoria, Australia 3800

This document is online at users.monash.edu/~lloyd/Seminars/Mixture-MM/index.shtml and contains hyper-links to more detailed notes on certain topics and to some of the references.

<< [02] >>

The Problem: Clustering, (unsupervised-) Classification, Mixture Modelling, Numerical Taxonomy, in a Time-Series.

Data: A sequence of multi-variate Observations.

Algorithm based on [Snob] which works on independent data.
See C. S. Wallace & D. M. Boulton. An Information Measure for Classification. Computer Journal, 11, pp.185-195, 1968, and more recently the special issue of the Computer Journal 42(4) 1999
Extended: Observation_k can depend on Observation_k-1
i.e. Hidden Markov Model of order 1

<< [03] >>

The Model

(a) The number of classes (N)

(b) The relative abundance of each class

Distribution parameters for each significant attribute (e.g. mu, sigma)
Relative abundance of next class conditional on current class

(d) For each Observation

An assigned class (but see twist later)
Attribute values given this class

<< [04] >>

Model Complexity:

Bayes (1763)
- P(H|D) = P(H) . P(D|H) / P(D)
Fisher
- [Fisher Information]: sensitivity of P(D|H) to (continuous) parameters
Shannon (c1948)
- msgLen(E) = -log₂(P(E)) in optimal code, so . . .
- msgLen(H&D)= msgLen(H) + msgLen(D|H)
  = msgLen(D) +msgLen(H|D)
- msgLen(H)= model complexity
- msgLen(D|H) = data complexity | H

<< [05] >>

Model Transitions for 2 Classes

<< [06] >>

Estimation (1)

Given an optimal path of transitions through the model, can use frequencies to estimate transition probabilities etc.
(State to optimum degree of accuracy.)
New estimates may change optimal path through model.
Re-estimate, loop, . . . converges.
(Still have to consider splitting/ merging classes.)

But this gives biased estimates of model parameters!

<< [07] >>

Estimation (2) - the twist

Class memberships of Observations are nuisance parameters if interested in best class description only.
Solution: Sum probabilities of all paths through model.
Legit': Different paths are exclusive sub-hypotheses.

more...

<< [08] >>

Forward:

Consider all paths leading to o_k | c_j:

F(o₁ in c_i) = P(c_i).P(o₁|c_i)

F(o_k in c_i) = SUM_j=1..N F(o_k-1 in c_j). P(c_i|c_j). P(o_k|c_i)

1 < k <= K

P(D|H) = SUM_i=1..N F(o_K in c_i)

<< [09] >>

Backwards

Consider all paths leading from o_k | c_j:

B(o_K in c_i) = 1; B(o_k = SUM_j=1..N P(c_j|c_i). P(o_k+1|c_j). B(o_k+1 in c_j); 1 <= k < K
Now: P(D|H, o_k in c_i) = F(o_k in c_i). B(o_k in c_i).
and: P(D|H) = SUM_i=1..N P(D|H, o_k in c_i); for 1 <= k <= K

<< [10] >>

. . . this gives the probability that o_k is in c_j. i.e. We can deal with fractional membership of classes.
Results in unbiased estimates of class parameters and transition probabilities.

Hidden Markov Models path conditional in ok in c1

<< [11] >>

How Many Classes?

Can split class c_i into c_i' and c_i'', in ratio w:1-w, e.g. w=0.5. New transition prob's initialised:

next> v. curr'	c_i'	c_i''	c_j
. . .
c_i'	P(c_i\|c_i).w	P(c_i\|c_i).(1-w)	P(c_j\|c_i)
. . .
c_i''	P(c_i\|c_i).w	P(c_i\|c_i).(1-w)	P(c_j\|c_i)
. . .
c_j	P(c_i\|c_j).w	P(c_i\|c_j).(1-w)
. . .

<< [12] >>

Can also merge two classes.

Splitting or merging takes place if there is a nett reduction in message length.

Good heuristic, fast.

Must converge.

Could get stuck in local optimum, but unlikely.

<< [13] >>
Tests, e.g. 2-classes:

varying class auto transition from 0.0 to 1.0
Hidden Markov Model Msg Len v. Auto-Correlation parameter

<< [14] >>

varying data set size, t=0.8
Hidden Markov Model Msg Len v. data set size, auto-correlation=0.8, s=2

1C: Snob, 1-class; 2C: Snob, 2-classes; M2C: Markov Model 2-classes.
M2C wins if log₁₀(K)>2.3

<< [15] >>
e.g. protein dihedral angles:

class structure by Hidden Markov Model on Protein secondary structure dihedral angles

Left: 17 classes

[big pic']
[paper]
[Bioinformatics]

<< [16]

Conclusion

If there is sequential correlation, the algorithm can better infer class structure from less data because it models the correlation.
Time complexity ~ length of sequence × #classes²
T. Edgoose, L. Allison & D. L. Dowe. An MML Classification of Protein Sequences that knows about Angles and Sequences. Pacific Symposium on Biocomputing, PSB98, pp.585-596, 1998 [paper]
Tim Edgoose & Lloyd Allison. Minimum Message Length Hidden Markov Modelling Data Compression Conf (DCC), pp.169-178, 1998 [paper]
T. Edgoose & L. Allison. MML Markov classification of sequential data. Statistics and Computing 9, pp.269-278, 1999 [paper]

Created with "vi (Linux & IRIX)", charset=iso-8859-1