CSE454 Prac Notes 2004

^up^

It is Gauss, Gaussian and Poisson.
For each experiment, graph, table, etc., it is important to define what data set was used, and what attributes were collected and used -- so that someone else can replicate exactly what you did.
On IP addresses, the full address usually identifies a single session by a single individual. Truncating the address (k=3± say) will often identify an organisation, e.g. au edu monash, although Canadian universities seem to be just ca institution. I would be surprised if k<3 was very interesting.
I believe that a .de CS department uses λ a lot at one time of year, and that a us CS department may use prolog a lot at another time of year.
Are λ or prolog users the more thoughtful?-)
Unfortunately, message lengths are not directly comparable across different runs using different data transformations and/or different distributions. They should be "in theory" but e.g. data measurement accuracy also transforms and the program does not account for this. (It is an awkward s/w engineering problem.)
If you did manage to run a model on the "other" data set, KL distance ~ (increase in msg length) / #data ; you need to take the 2nd-part message length, for data given the model.

Be aware that Snob and C5 solve very different problems, despite the small edit distance of "unsupervised classification" and "supervised classification". In particular a classification tree's "input" attributes are "common knowledge", and there is only a slight penalty for including extra input attributes (slightly more expensive to identify which is being tested). A classification tree only tries to "compress" the output attribute(s).

Created with "vi (Linux & Solaris)", charset=iso-8859-1