^up^
CSE454 Prac Notes 2004
#1
- It is Gauss, Gaussian and Poisson.
- For each experiment, graph, table, etc., it is important to define
what data set was used, and what attributes were collected and used --
so that someone else can replicate exactly what you did.
- On IP addresses,
the full address usually identifies a single session by
a single individual.
Truncating the address (k=3± say) will often identify
an organisation, e.g. au edu monash,
although Canadian universities
seem to be just ca institution.
I would be surprised if k<3 was very interesting.
- I believe that
a .de CS department uses λ
a lot at one time of year, and that
a us CS department may use prolog
a lot at another time of year.
- Are λ or prolog users the more thoughtful?-)
- Unfortunately, message lengths are not directly comparable across
different runs using different data transformations and/or
different distributions.
They should be "in theory" but e.g. data measurement accuracy
also transforms and the program does not account for this.
(It is an awkward s/w engineering problem.)
- If you did manage to run a model on the "other" data set,
KL distance ~ (increase in msg length) / #data ;
you need to take the 2nd-part message length,
for data given the model.
#2
- Be aware that Snob and C5 solve very different problems,
despite the small edit distance of
"unsupervised classification" and
"supervised classification".
In particular a classification tree's "input" attributes are
"common knowledge",
and there is only a slight penalty for including extra input attributes
(slightly more expensive to identify which is being tested).
A classification tree only tries to "compress"
the output attribute(s).
© L. Allison,
School of Computer Science and Software Engineering,
Monash University, Australia 3800.
Created with "vi (Linux & Solaris)", charset=iso-8859-1