^CSE454^ ^2004^   <prac 1<

Prac' 2 CSE454 CSSE Monash Semester 1, 2004

Due (CSSE office) noon, Thursday, week 12, 27 May 2004.

Supervised Classification: This prac' involves using the [C5 (on nexus)] classification- (decision-) -tree program, [tutorial (click)].

  1. This first part is for you to become familiar with using C5 on a ``tame'' data set. Use C5 to analyse one of the data sets in the .../c5/Data/ directory as indicated here: In 1-page, draw the tree (nicely!), or the top levels if the whole is too big and, refering to your diagram, describe what the tree means for the data set.

  2. (a) Use your own judgment* to choose your ``best'' classification of the cgi-bin data from [prac_1] (e.g. considering `k', attribute selection, distributions used, etc.).
    Take the most probable class, `C', output by Snob for each observation as the attribute to be predicted by C5.
    (If your answer to prac 1 ``didn't go too well'', you should probably ``improve'' it now.)

    (b) Use C5 to form a classification-tree to predict `C' from the other attributes.
    If `C' has ``too many'' values* (that is too high arity), you might need to reduce the number of values, i.e. force Snob to produce fewer classes, e.g. by stopping it after fewer adjust cycles than normal, or by merging some classes.

    (c) Draw the tree (or the top levels if large) and write a short(!) report describing what the tree means for the data set.

  3. Use the ``domain-type'', i.e. at least {commercial, educational, other} but possibly divided more finely, as the attribute to be predicted, from other attributes, by C5. Use C5 to form a classification-tree to predict the domain-type. Compare its performance with that of Snob, prac_1, Q4 (it is not really a fair race because Snob does not claim to do supervised classification).

Write a a short report (2-6 pages of text not counting any graphs or diagrams), possibly with graphs and/or diagrams (it must be clear but need not be an artwork, pen and pencil are usually clearer than Excel), to summarize and to explain your findings.

[1: 5-marks;   2: 12-marks;   3: 3-marks;   total: 20;   this may be varied, depending on what you find.]



© L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3800.
Created with "vi (Linux & Solaris)",   charset=iso-8859-1