^up^

CSE454 Prac Notes 2004

#1

It is Gauss, Gaussian and Poisson.
For each experiment, graph, table, etc., it is important to define what data set was used, and what attributes were collected and used -- so that someone else can replicate exactly what you did.
On IP addresses, the full address usually identifies a single session by a single individual. Truncating the address (k=3± say) will often identify an organisation, e.g. au edu monash, although Canadian universities seem to be just  ca institution. I would be surprised if k<3 was very interesting.
I believe that a .de CS department uses λ a lot at one time of year, and that a us CS department may use prolog a lot at another time of year.
Are λ or prolog users the more thoughtful?-)
Unfortunately, message lengths are not directly comparable across different runs using different data transformations and/or different distributions. They should be "in theory" but e.g. data measurement accuracy also transforms and the program does not account for this. (It is an awkward s/w engineering problem.)
If you did manage to run a model on the "other" data set, KL distance ~ (increase in msg length) / #data ; you need to take the 2nd-part message length, for data given the model.

#2

Be aware that Snob and C5 solve very different problems, despite the small edit distance of "unsupervised classification" and "supervised classification". In particular a classification tree's "input" attributes are "common knowledge", and there is only a slight penalty for including extra input attributes (slightly more expensive to identify which is being tested). A classification tree only tries to "compress" the output attribute(s).


© L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3800.
Created with "vi (Linux & Solaris)",   charset=iso-8859-1