^CSE454^ ^2003^ >prac 2>

Re Prac' 1 CSE454 CSSE Monash Semester 1, 12/5/2003

[Snob][3/2003], [*] & [Snob][30/4/2003].

Not all of the following necessarily applies directly to prac-1.

{5/2003: I have just checked the honspc's and there is [i.e. was] a version of snob installed under /usr/local/bin/snob. [-J.M.]}

Be specific so that the reader can repeat exactly what you did. (This applies to honours reports too.) If something is ``frequent'', say how frequent (e.g. 60%?). If something is ``large'' say how large (e.g. h=2.08m). Give values of variables (e.g. K=3) and parameters (e.g. μ=3.6), state what attributes were active, when, what you did with missing or invalid values if any, etc..
Using a plotting package is often a hazard to clarity; perhaps it should be banned in favour of paper and pencil. Particularly:
1. No one can identify 8 different colours in grey-scale.-)
2. If almost all the data points are near the y-axis, try a log-plot for x. If almost all the points are near the x-axis, try a log plot for y. If almost all the points are near the origin, try a log-log plot.
3. Classes can be drawn (in 2D) by plotting the centre, (μ_x,μ_y) with error bars +/-σ_x and +/-σ_y (or an ellipse, or a rectangle, or...). This shows position, variability, overlap.
Like many clustering programs (basic-) Snob assumes that attributes are independent, that is uncorrelated. If attributes are correlated (remember the rgb pixel-values example) then the program will do the best it can to describe the data, i.e. it will place several classes along the hidden ``factor'' (brightness for rgb) that explains the correlation. A simple data transformation will sometimes (not always) remove such correlation: E.g. We might reasonably expect prac-1's @1 (# 1-use sessions) and @2 (# multi-use sessions) to be correlated, as many did discover. If @1~@2 then transforming (@1,@2)--->(@1+@2,@1-@2), or similar, possibly weighted, may remove the correlation.
If a continuous attribute has a ``long tail'', e.g. 1.4, 3.2, 7.3, 53.1, 17.2, 142.6, 11.0, 1.1, 0.6, 1.0, 90.8, 25.7, 2.7, it may be better to transform it with the log function and use the normal distribution to model the transformed attribute data.
Long tails and correlation may even combine -- try log(@_i)+log(@_j) and log(@_i)-log(@_j) in this case. (Note that (log(x)+log(y))/2 is the log of the geometric mean of x and y.)
There seems to be a temptation to use the normal distribution to model everything, but the Poisson can be appropriate for ``counts'', frequencies, etc..
If N(μ,σ) is used on an integer attribute it might work well, but beware of setting the `measurement accuracy' too high, e.g. 0.1 or even 0.01. The program may say: Oh look at this, lots of values (peaks) at 1.0, ditto at 2.0 and 3.0, and nothing in between! There must be classes N(1.0, 0.01), N(2.0, 0.01), N(3.0, 0.01) etc.. Better to try the measurement accuracy at 0.5+/- or even 1.0+/- if you must use N(μ,σ) on such data.
People often state too high measurement accuracy for continuous attributes. This can cause artifacts, i.e. false classes, in the results. More ``accurate'' sounds ``better'' somehow, but is it realistic?
What to do about 0/0's for @3, @4, @5 (mean uses, multi-use sessions), i.e. zero multi-use sessions. One could represent such cases by some default value, e.g. 0?, 1?, -1?, but that might skew ``real'' results. Or one could represent such cases as ``missing values''.

5/2003 © L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3800.
Created with "vi (Linux & IRIX)", charset=iso-8859-1