^CSE454^ ^2003^   >prac 2>

Prac' 1 CSE454 CSSE Monash Semester 1, 2003

Due noon Thursday, week 1, 1 May 2003, at the CSSE general office.

This is an exercise in unsupervised classification (mixture modelling, clustering), using the MML-based program [Snob][3/2003], [*] & [Snob][30/4/2003].

  1. In [../cgi/2002] is a log file (about 600K, 10K lines) of uses of three cgi-bin programs during 2002: prolog.toy, lambda and cmprsDNA98 (incidentally, the last is MML-based, e.g. see L.Stern, L.Allison, R.L.Coppel & T.I.Dix, Discovering patterns in Plasmodium falciparum genomic DNA, Molec. & Biochem. Parasitology 118(2) pp175-186, 2001). The exercise is to perform statistical analysis of this data.
    A session is one or more uses of one or more of these programs, from the same address, at times T={t1, t2, t3, ...} where ti - ti-1 <= maxGap, e.g. maxGap = 1-hour, say. A gap longer than maxGap separates two sessions of the address.
    The size, |T|, is the number of uses in the set of times T.
    NB. Addresses such as *.com, *.edu, *.net, etc. (e.g. p149.as1.adl.dublin.eircom.net) are understood to actually be *.com.xx, *.edu.xx, *.net.xx, etc. where `xx' stands in for the "missing" country, e.g. p149.as1.adl.dublin.eircom.net.xx
    Write some simple program(s) (use shell-scripts, C, C++, Java, or Perl) to gather the following attribute values for each address:
    1. the number of sessions of size one,
    2. the number of sessions of size >one,
    3. the mean number of uses of lambda in sessions of size >one,
    4. the mean number of uses of prolog.toy in ..."...
    5. the mean number of uses of cmprsDNA98 in ..."...
    The program should be flexible, e.g. we may want to truncate addresses to drop all but the `k' right-hand names of each address. e.g. k=4, dublin.eircom.net.xx
    Make sure that maxGap can be changed easily.
    [5 marks]
  2. Use Snob to cluster the addresses using the attributes collected above.
    [5 marks]
    Try varying `k'. As you consider appropriate, do some further analysis, e.g. (a) on different attributes, or e.g. (b) are there any systematic differences between .com.* and .edu.* addresses, etc.?

  3. Write a short report (2-5 pages), possibly including graphs and/or diagrams to summarize and to explain your findings.
    [10 marks]
  4. Can we make money with this?-)


© L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3800.
Created with "vi (Linux & IRIX)",   charset=iso-8859-1