As per lecture #2, Thursday 29 July:
L . A l l i s o n |
Step 1:
Write a program to generate test data for the problem.
The program should accept as parameters:
the number of data points N,
the number of segments k,
the range for the segment-means [lo..hi],
the range for the segment-standard-deviations (0..sd_hi].
Make it so that
the cut-points are drawn uniformly from [1..N-1]+0.5,
the means are distributed uniformly over [lo..hi], and
the standard deviations are distributed uniformly over (0..sd_hi].
Run the program (a lot) and look at its output
to see how you would try to identify segment boundaries by hand.
Think about nasty cases such as two adjacent segments with similar means,
short segments, etc.
It is always a good idea to write a generator for your model(s) because then you have test data where you know the answer. The answer is rarely known with absolute certainty for real data. Writing a generator may also quickly show up problems with your model.
Step 2: You should have a copy of W. Fisher's (1958) paper on minimum-RMS* segmentation of a sequence (thanks to Dean McKenzie for finding this paper and note that it is by Fisher W., not R.A.). Amongst other related problems, Fisher describes a method of finding the minimum-RMS s-segmentation of the sequence into s=1, 2, ..., and L segments for some given upper limit L. An optimal s-segmentation of [1..j] (NB. j) must have a last segment [i..j] for some i<j. Deleting this last segment must leave an optimum (s-1)-segmentation of [1..i] (otherwise the original could be improved upon). Therefore optimum s-segmentations of [1..j], for j=1..N, can be found be extending optimum (s-1)-segmentations of [1..i] for i=1..N. This implies on O(L.N2) dynamic programming algorithm for finding minimum-RMS segmentations of up to L segments. The problem is that as L increases, the RMS value decreases to zero, certainly by the time that s=N, yet is seems obvious that s<<N for any plausible hypothesis. When to stop? Anyway, step 2 is to implement Fisher's algorithm.
Note that the mean and standard deviation of a sequence of numbers can be found in just one pass through the sequence, accumulating the sum and also the sum of the squares. If these values are kept, the mean and standard deviation of any segment [i..j] can be calculated quickly on demand.
Step 3: Use the MML theory for the normal distribution to include a cost for the hypothesis, i.e. stating k, the cut points, and the mean and standard deviation of each segment. This will provide a way of choosing a best value for k. As k increases the second part (data) of the message shrinks, but the first part (hypothesis) grows. Somewhere there is an optimum trade-off. We can also tell if the best is conclusive or marginal.
Just a thought: Let D[i,j], where i<j, be the cost of stating the mean & sd and the data values in the segment [i,j], plus one cut point. Consider D[i,j] to be the adjacency matrix of a graph. A shortest path from vertex 1 to vertex N implies a very good segmentation (there is a little inaccuracy if the costs of stating `k' and of stating a cut-point vary with k). Note that Dijkstra's single-source shortest paths algorithm runs in O(N2)-time.
You can do the prac' in C, C++, Java or JavaScript (the last would be slow but can draw on the existing demonstration in the normal-distribution page and it would be very nice to have as a new web demo').
Test your final program in a reasonably thorough way.
e.g. If one generates some data (step 1) and analyses it,
there are three sets of figures:
(i) the true #segments, cut-points, means and sd's,
(ii) the sample means and sd's, and
(iii) the inferred #segments, cut-points and means and sd's.
How often is your program correct in #segments?
Is it biased one way or the other?
How is it influenced by the number of data points and the
number of segments?
What about the overlap between values in neighbouring segments,
e.g. if the SDs' range is not small compared to the means' range?
How close are the inferred cut-points to true ones?
How do the various differences true:inferred compare with true:sample?
Note that stating the cut points with full accuracy is almost certainly over-fitting! e.g. consider, ..., 1.0, 1.1, 0.9, 1.0, 1.5, 2.1, 1.9, 2.0, 2.0, ... If there is a segment boundary here, on which side of the 1.5 does it fall? One really cannot say. Using less accuracy for cut points would save some cost in the hypothesis. However, you do not have to consider this extra sophistication!
The above segmentation problem amounts to approximating a function by a piece-wise constant function. It is closely related to approximation by piece-wise linear functions, and to polygon fitting etc.. One could also contemplate other distributions on cut-points and/or segment lengths, other distributions within segments, multi-variate data, etc.. Murli & co. have examined segmenting binary-sequences e.g. 011000001010001111011011111000100000110000100. Multi-state sequences might be of interest in biology.
* RMS - root mean squared differences from mean within a segment.
D. Dowe & L. Allison © 1999, Comp Sci and SWE, Monash University