^CSE454 / 2004^

CSE454 Prac 1, 2004

Due: noon, Thursday week 8 (29 April) CSSE general office.
This is an exercise in unsupervised classification (mixture modelling, clustering), using the MML-based program (either vanilla- or Fortran-) [Snob][*].
In [../cgi/2003-lambda] and [../cgi/2003-prolog] are records of the use of two cgi-bin programs (lambda and prolog.toy) by www clients.
A session is a sequence of two or more uses of lambda (or a sequence of two or more uses of prolog.toy) from the same client address, where the time interval between two successive uses of the program is no more than `limit' seconds e.g. limit=3600.
A single use, isolated by more than the `limit', is not a session.
Note that a session might have been controlled by one person (or a 'bot).
Some client addresses are numeric (IP numbers), e.g. 202.185.108.225 but most are symbolic, e.g. chripso.static.otenet.gr (note that, e.g. 217-126-189-201.uc.nombres.ttd.es is symbolic).
Symbolic addresses which end in .biz, .com, .edu, .gov, .inf, .mil, .net, .org (¿any others?), are considered to have an implicit country .xx added at the end, e.g. pd953adbd.dip.t-dialin.net --> pd953adbd.dip.t-dialin.net.xx.
Symbolic addresses should be reversed, read from the right, e.g. xx.net.t-dialin.dip.pd953adbd. Numeric addresses (IP numbers) are not to be reversed.
Sometimes we will consider addresses to be truncated after at most `k' parts, e.g. k=3, xx.net.t-dialin, which affects the notion of "same client address" in the definition of a session.
Write some simple program(s) (e.g. use shell-scripts, C, C++, Java, or Perl) to gather the following kinds of attribute values for each session: the number of uses of lambda (or of prolog.toy), the duration (seconds), the mean number of seconds between uses in the session, some property(s) of the address, possibly other attributes. Makes the program(s) flexible -- you may need to change what you collect.
You might find it convenient to transform the client addresses, as above, and then sort the data on the transformed addresses (and within that on time, as is).
Now, you must perform at least the following analyses using Snob. We want to see if there are different kinds of client behaviour. Classify data derived from 2003-lambda, where "thing" = "session" (varying k=2,3,4,...? -- note that k affects address, affects session), and input attributes are (a) the number of uses (>=2) in the session and (b) the mean number of seconds between uses in the session. Try the Poisson distribution, and the normal distribution (with data accuracy >= ±0.5) (, and possibly the Normal on the log of the attribute), for "count" attributes. Process 2003-prolog in a similar way. Try using different (combinations of) attributes of your own choosing to find the most interesting result(s). Is there any interesting way to compare the mixtures inferred for the two data sets? Can one mixture be used on the other data set? Can you get KL-distances for the two mixtures? The data give us some very rough and ready "supervised" classes based on the address controlling a session, e.g. educational = {*.edu. ..., *.ac. ...}, commercial = {*.com. ..., *.co. ...}, other. Perhaps the divisions can be finer? Is there any genuine correlation between the classes that Snob found and these "supervised" classes? In MML (compression) terms do the latter behave "much the same" across Snob's classes (apart from random variation), or not? Can the Snob-classification of a session be used to predict the "supervised" class, or v.v., better than chance? Obviously Snob should not be given the "supervised" class in this case!
Write a short report (2-6 pages of text not counting any graphs or diagrams), possibly with graphs and/or diagrams (it must be clear but need not be an artwork, pen and pencil are usually clearer than Excel), to summarize and to explain your findings.

[1&2: 10 marks; 3: 5 marks; 4: 5 marks; total: 20; this may be varied, depending on what you find.]

Created with "vi (Linux & Solaris)", charset=iso-8859-1