[01] >>
 ^MML^

# Segmentation of Ordered Data

### © 2001, Lloyd Allison, School of Computer Science and Software Engineering, Monash University, Australia 3168

See: L. J. Fitzgibbon, L. Allison & D. L. Dowe. Minimum message length grouping of ordered data. Proc. 11th Int. Workshop on Algorithmic Learning Theory (ALT'2000) pp56-70, Sydney, LNCS V1968, Springer-Verlag, December 2000

This document is online at  www.csse.monash.edu.au/~lloyd/Seminars/Series-Seg01/index.shtml   and contains hyper-links to more detailed notes on certain topics and to some of the references.

<< [02] >>

# Problem

• Given a sequence, s1, ..., sn, of multivariate data,
• divide it into k segments (i.e. k-1 cut-points)
• that best describe the data

- a difficult problem if data ranges of neigbouring segments overlap.

Simplest model: One segment.

Most complex model: One segment per element (over-fitting).

<< [03] >>
e.g. Generated data, 4 segments, 3 cut-points, means & S.D.s shown

<< [04] >>

# Fisher's Algorithm (1958)

Dynamic Programming:
• Given k,
• if c1, c2, ..., ck-1 are k-1 optimal cut-points for s1, ..., sik-1
• then c1, c2, ..., ck-2 are k-2 optimal cut-points for s1, ..., sik-2
so
• compute optimal 1_segmentations for s1, .., si, for all i=1..n
• iteratively
• compute optimal k_segmentations from the optimal k-1_segmentations
• to some maximum value K
O(K*n2)-time.

<< [05] >>
2 segments, Least Squares

finds one cut point well

<< [06] >>
3 segments LSq

2nd cut point is close

<< [07] >>
4 segments LSq

two small segments inappropriately inferred

<< [08] >>
5 segments LSq

two cut points poor, one fair, one good

<< [09] >>

# Stopping

But Fisher's algorithm (1958) has no stopping criterion: Least squares differences (say) decrease as k-->n.

Solution: Model has a message length (cost) . . .
• [Integer] # segments
• Cut-point positions to appropriate accuracy
• For each segment: its parameters to appropriate accuracy
• NB. Can get ``sufficient statistics'' for i..j from those for 1..i-1 and for 1..j.
This prevents over-fitting.

<< [10] >>
MML fit:

essentially correct

<< [11] >>
Real-World Data, MML:

W. Fisher's data: 96 years of Lake Michigan Huron

<< [12] >>
Real-World Data, MML:

<< [13] >>
MML:

140 years of lake levels . . .

<< [14]

# Summary

• Minimum Message Length gives a stopping criterion (no overfitting).
• Time Complexity O(K.N2) where K is max # of segments considered

### References

• W. D. Fisher. On grouping for maximum homogeneity. Jrnl. Am. Stat. Soc. 53 pp789-798 1958.
Dynamic programming algorithm, no stopping criterion.
• R. A. Baxter & J. J. Oliver. The kindest cut: minimum message length segmentation. Proc. 7th Int. Workshop on Algorithmic Learning Theory pp83-90, LCNS V1160, Springer-Verlag, 1996.
Binary data, investigates precision of stating cut-points
• L. J. Fitzgibbon, L. Allison & D. L. Dowe. Minimum message length grouping of ordered data. Proc. 11th Int. Workshop on Algorithmic Learning Theory (ALT'2000), pp56-70, Sydney, LNCS V1968, Springer-Verlag, December 2000.
Multivariate data, dynamic programming algorithm, avoids stating cut-points!
• Data: [1860-1955] ++ [1918-1999] ==> [1860-1999]

© L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3168.
Created with "vi (Linux & IRIX)",   charset=iso-8859-1