<progress< ^CSE2304^

CSE423 Prac' #1 1999

As per lecture #2, Thursday 29 July:

  1. Data consist of a sequence of floating point numbers, x1, x2, ..., xN.
    e.g. average annual water levels in a lake, average annual rainfall, daily change in a stock-exchange index, sun-spot activity, mean summer temperatures somewhere, results of a long-running physics experiment etc.
  2. The sequence is thought to be of `k' segments, where k = 1, 2, ..., or N, is not known in advance. Segment boundaries can be specified by giving k-1 ``cut points''.
  3. Values in a (potential) segment, [i,j], are thought to be described by a normal distribution.
  4. Write a program to determine the best hypothesis of the following form: k, the k-1 cut points, mean1, sd1, ..., meank, sdk.
  5. Due: end of week 6, semester 2, 1999
  6. Assessment: 20% of course mark

L
.
A
l
l
i
s
o
n

Step 1: Write a program to generate test data for the problem. The program should accept as parameters: the number of data points N, the number of segments k, the range for the segment-means [lo..hi], the range for the segment-standard-deviations (0..sd_hi].
Make it so that the cut-points are drawn uniformly from [1..N-1]+0.5, the means are distributed uniformly over [lo..hi], and the standard deviations are distributed uniformly over (0..sd_hi].
Run the program (a lot) and look at its output to see how you would try to identify segment boundaries by hand. Think about nasty cases such as two adjacent segments with similar means, short segments, etc.

It is always a good idea to write a generator for your model(s) because then you have test data where you know the answer. The answer is rarely known with absolute certainty for real data. Writing a generator may also quickly show up problems with your model.

Step 2: You should have a copy of W. Fisher's (1958) paper on minimum-RMS* segmentation of a sequence (thanks to Dean McKenzie for finding this paper and note that it is by Fisher W., not R.A.). Amongst other related problems, Fisher describes a method of finding the minimum-RMS s-segmentation of the sequence into s=1, 2, ..., and L segments for some given upper limit L. An optimal s-segmentation of [1..j] (NB. j) must have a last segment [i..j] for some i<j. Deleting this last segment must leave an optimum (s-1)-segmentation of [1..i] (otherwise the original could be improved upon). Therefore optimum s-segmentations of [1..j], for j=1..N, can be found be extending optimum (s-1)-segmentations of [1..i] for i=1..N. This implies on O(L.N2) dynamic programming algorithm for finding minimum-RMS segmentations of up to L segments. The problem is that as L increases, the RMS value decreases to zero, certainly by the time that s=N, yet is seems obvious that s<<N for any plausible hypothesis. When to stop? Anyway, step 2 is to implement Fisher's algorithm.

Note that the mean and standard deviation of a sequence of numbers can be found in just one pass through the sequence, accumulating the sum and also the sum of the squares. If these values are kept, the mean and standard deviation of any segment [i..j] can be calculated quickly on demand.

Step 3: Use the MML theory for the normal distribution to include a cost for the hypothesis, i.e. stating k, the cut points, and the mean and standard deviation of each segment. This will provide a way of choosing a best value for k. As k increases the second part (data) of the message shrinks, but the first part (hypothesis) grows. Somewhere there is an optimum trade-off. We can also tell if the best is conclusive or marginal.

Just a thought: Let D[i,j], where i<j, be the cost of stating the mean & sd and the data values in the segment [i,j], plus one cut point. Consider D[i,j] to be the adjacency matrix of a graph. A shortest path from vertex 1 to vertex N implies a very good segmentation (there is a little inaccuracy if the costs of stating `k' and of stating a cut-point vary with k). Note that Dijkstra's single-source shortest paths algorithm runs in O(N2)-time.

You can do the prac' in C, C++, Java or JavaScript (the last would be slow but can draw on the existing demonstration in the normal-distribution page and it would be very nice to have as a new web demo').

Test your final program in a reasonably thorough way. e.g. If one generates some data (step 1) and analyses it, there are three sets of figures: (i) the true #segments, cut-points, means and sd's, (ii) the sample means and sd's, and (iii) the inferred #segments, cut-points and means and sd's.
How often is your program correct in #segments? Is it biased one way or the other? How is it influenced by the number of data points and the number of segments? What about the overlap between values in neighbouring segments, e.g. if the SDs' range is not small compared to the means' range? How close are the inferred cut-points to true ones? How do the various differences true:inferred compare with true:sample?

Note that stating the cut points with full accuracy is almost certainly over-fitting! e.g. consider, ..., 1.0, 1.1, 0.9, 1.0, 1.5, 2.1, 1.9, 2.0, 2.0, ... If there is a segment boundary here, on which side of the 1.5 does it fall? One really cannot say. Using less accuracy for cut points would save some cost in the hypothesis. However, you do not have to consider this extra sophistication!

The above segmentation problem amounts to approximating a function by a piece-wise constant function. It is closely related to approximation by piece-wise linear functions, and to polygon fitting etc.. One could also contemplate other distributions on cut-points and/or segment lengths, other distributions within segments, multi-variate data, etc.. Murli & co. have examined segmenting binary-sequences e.g. 011000001010001111011011111000100000110000100. Multi-state sequences might be of interest in biology.



* RMS - root mean squared differences from mean within a segment.

D. Dowe & L. Allison © 1999, Comp Sci and SWE, Monash University