^CSE454^
^2003^
>prac 2>
Prac' 1 CSE454 CSSE Monash Semester 1, 2003
Due noon Thursday, week 1, 1 May 2003, at the CSSE general office.
This is an exercise in unsupervised classification
(mixture modelling, clustering), using the MML-based program
[Snob][3/2003],
[*] &
[Snob][30/4/2003].
- In [../cgi/2002]
is a log file (about 600K, 10K lines)
of uses of three cgi-bin programs during 2002:
prolog.toy, lambda and cmprsDNA98
(incidentally, the last is MML-based,
e.g. see L.Stern, L.Allison, R.L.Coppel & T.I.Dix,
Discovering patterns in Plasmodium falciparum genomic DNA,
Molec. & Biochem. Parasitology 118(2) pp175-186, 2001).
The exercise is to perform statistical analysis of this data.
A session is one or more uses of one or more
of these programs, from the same address,
at times T={t1, t2, t3, ...}
where ti - ti-1 <= maxGap,
e.g. maxGap = 1-hour, say.
A gap longer than maxGap separates two sessions of the address.
The size, |T|, is the number of uses in the set of times T.
NB. Addresses such as *.com, *.edu, *.net, etc.
(e.g. p149.as1.adl.dublin.eircom.net)
are understood to actually be *.com.xx, *.edu.xx, *.net.xx, etc.
where `xx' stands in for the "missing" country,
e.g. p149.as1.adl.dublin.eircom.net.xx
- Write some simple program(s)
(use shell-scripts, C, C++, Java, or Perl)
to gather the following attribute values for each address:
- 1. the number of sessions of size one,
- 2. the number of sessions of size >one,
- 3. the mean number of uses of lambda in sessions of size >one,
- 4. the mean number of uses of prolog.toy in ..."...
- 5. the mean number of uses of cmprsDNA98 in ..."...
- The program should be flexible, e.g. we may want to
truncate addresses to drop all but the `k' right-hand names
of each address.
e.g. k=4, dublin.eircom.net.xx
- Make sure that maxGap can be changed easily.
[5 marks]
- Use Snob to cluster the addresses
using the attributes collected above.
[5 marks]
Try varying `k'.
As you consider appropriate,
do some further analysis,
e.g. (a) on different attributes, or
e.g. (b) are there any systematic differences between
.com.* and .edu.* addresses,
etc.?
- Write a short report (2-5 pages),
possibly including graphs and/or diagrams
to summarize and to explain your findings.
[10 marks]
- Can we make money with this?-)
© L. Allison,
School of Computer Science and Software Engineering,
Monash University, Australia 3800.
Created with "vi (Linux & IRIX)", charset=iso-8859-1