Minimum Message Length Segmentation

Segmentation

home₁ home₂
Bib
Algorithms
Bioinfo
FP
Logic
MML
Prog.Lang
and the
Book

FP
II
Ver.1.1
Seg

DPA

Given a series of data, [d₀, d₁, ..., d_n-1], does it consist of one, two [[d₀, ..., d_i], [d_i+1, ..., d_n-1]], or more segments, where each segment is homogeneous in some way? This problem is a special case of unsupervised classification because here the members of a segment ("class") must be contiguous. The number of segments and their boundaries must be learned.

Given an estimator, est, we can estimate a model for each (hypothetical) segment, calculate the message-length of each segment and its model, include the encoding of the segment "cut-points", and thus cost a segmentation. The best segmentation is the one having the shortest total message length.

W.D.Fisher (1958) gave an O(K.n²)-time algorithm to find the best segmentation having exactly K segments, but gave no "stopping criterion" to determine the best K.

If we can break the total segmentation cost into the sum of the cost of each segment, [d_i+1, ..., d_j] where j>i, we can use the 1-dimensional [dynamic programming algorithm] (DPA). Such a break-down requires stating: the length of each segment, the parameters of each segment (to appropriate precision), and the segment's members (the data) given those parameter estimates. This break-down is an approximation to the true cost of the segmentation because it omits the number of segments (small), and states the lengths of all segments which is redundant if the length of the series is common knowledge^[*]. However these discrepancies are small. (And they can be removed by using Fisher's algorithm with our costs but then we have to try K=1, 2, ..., K_max.)

edgeCost est dataSet =
 let n = length dataSet
     lrCosts = ...
      ... cost = nlPr wallaceIntModel (r-l) + msg segModel seg
      ...                    -- NB seg is elts l..r of dataSet
 in \i -> \j -> lrCosts !! (i+1) !! (j-i-1)

segment est dataSet =
 let ...
     (path,c) = dpa1D (-1) (length dataSet - 1) (edgeCost est dataSet)
     bounds   = zip (map ((+)1) path) (drop 1 path)
     segments = map (\(l,r) -> take (r-l+1) (drop l dataSet)) bounds
 ...

7/12/2005

The natural null-hypothesis is to apply the estimator to fit a single model to all of the data in the whole series.

The segmentation algorithm can be applied to any type of data -- univariate, multivariate, discrete, continuous, mixed, correlated or not -- provided only that the estimator, est, and the dataSet agree about that type.

The DPA takes O(n²)-time if the cost of each segment can be calculated in constant time.

Also see

M. B. Dale, L. Allison, P. E. R. Dale. Segmentation and clustering as complementary sources of information. Acta Oecologica, Elsevier, 31(2), pp.193-202, March-April 2007, doi:10.1016/j.actao.2006.09.002,
and more [refs].

[*] A way to "balance the books" is for the null-hypothesis to include the cost of stating the length of the series.

Coding Ockham's Razor, L. Allison, Springer

A Practical Introduction to Denotational Semantics, L. Allison, CUP

Linux
Ubuntu
free op. sys.
OpenOffice
free office suite
The GIMP
~ free photoshop
Firefox
web browser

© L. Allison http://www.allisons.org/ll/ (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University, was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)

Created with "vi (Linux + Solaris)", charset=iso-8859-1, fetched Tuesday, 23-Apr-2024 17:16:30 AEST.