[Monash Home] [Monash Info] [News and Events] [Campuses and Faculties]
[Monash University]
School of Computer Science and Software   Engineering
about courses People research student community internal
In this page:     Overview   Individual Literature Survey (15%): due week 6   Group Paper (50%): due week 12   Group Presentation (15%): weeks 12 and 13   Individual Implementation of a Data Mining Algorithm (20%): due week 10   University Policy on Cheating and Plagiarism  

CSE5230: Assessment 2004

Home page 
Lectures 
Tutorials 
Assessment 
Set Reading 
Resources 
Feedback 
Timetable 
Staff 
Marks 
Site Usage 
Archive 

Overview

Data Mining is a multidisciplinary field which brings together a wide variety of techniques from areas of research and development with longer histories: machine learning, pattern recognition, statistics, databases and visualisation. The aim is to extract knowledge from the raw information stored in large databases, with the aim of better describing or understanding the existing data, or predicting how new data will be generated in the future. Over the course of this semester, you will work in groups on papers focusing on the theory and application of these techniques. You will also each implement one of the basic data mining techniques yourself.

The marks for this unit will be allocated as follows:

Component Weighting Due
Individual literature survey document and tutorial sheets 15% Week 6
A group paper on an agreed topic of approximately 5000 words 50% Week 12
Group presentation of the paper to the class 15% Weeks 12 and 13
Individual implementation of a data mining algorithm 20% Week 10

Individual Literature Survey (15%): due week 6

The literature survey is due in week 6 and should consist of a discussion of the papers read, including the problems addressed, the techniques used, and their advantages and disadvantages. This is intended to serve as individual preparation for the work on the group paper. Students must discuss at least five (preferably more) articles covering the topic of their group paper. These papers must include the set reading from the lecturer, as well as papers located by the students themselves. The majority of papers surveyed must be academic papers, published in peer-reviewed conferences or journals, not magazine articles.

Assessment
The individual literature survey will be marked as follows:
Understanding of techniques/algorithms (or issues) and their advantages and disadvantages 10
Organization and clarity 3
Accuracy of referencing 2

There will also be tutorial tasks related to the literature survey.

Submission
The Individual Literature Survey is due by 5:00pm on Friday August the 27th. You must submit a hardcopy, with an assignment coversheet to my office, or to my mailbox on level 5 of building B. You are also required to submit a softcopy of the assignment, either by email to me, or on disk.

Group Paper (50%): due week 12

The majority of the assessment for this unit is based on a group paper, and tasks associated with its production and presentation. Each group will prepare a research paper on a particular data mining technique and its applications. At the end of the semester, each group will present their findings to the whole class.

Students will form groups of four or five (depending on final enrolment numbers), and submit a Group Registration Form to the lecturer by the end of week 3. Forms may be submitted to the lecturer's letter box on level 5 of building B. Students having difficulty finding group members are encouraged to use the Feeback Forum on the unit website to seek others in the same situation.

Each group will be assigned a data mining technique or issue as a topic from a list provided by the lecturer. The number of groups assigned to each topic will be minimised (for most topics, there will be two groups). Assignments of topics to groups will be decided on the basis of preferences expressed via a form on the web.

One or two group members must take responsibility for researching and writing each of these parts of the paper:

  • A literature survey giving the research background for the technique and brief accounts of how and where it is applied.
  • An explanation of how the technique and the algorithms implementing it actually work, preferably with a worked example.
  • Two or more detailed case studies showing how the technique has been applied in business, industrial or scientific applications.

Students should make use of the Faculty Guide to Writing Assignments, paying particular attention to section 4, ``Citations'', and section 5, ``Quotations and Paraphrases''.

Papers are to be approximately 5,000 words. A list of allowed paper topics will be available from the unit web site. Here is a prelimiary list of possible topics:

  • Association Rule Discovery
  • Back-propagation Neural Networks
  • Self-Organising Maps
  • Decision Trees
  • Clustering
  • Bayesian Networks
  • Hidden Markov Models
  • Information Filtering (e.g. for "spam" email)
  • Visualisation for Data Mining
  • Ethics and Data Mining
Your group can specify their topic preferences using a web form.

Assessment
The group research paper will be marked as follows:
Understanding of technique/algorithm (or issue) 20
Case studies 20
Organization and clarity 5
Accuracy of referencing 5

Group Presentation (15%): weeks 12 and 13

The presentation will last at least 20 minutes with 5 minutes for questions. Groups should provide copies of their overheads. All group members must participate in the presentation. Depending on the final number of groups, time for presentations may be extended.

Assessment
The group presentation will be marked as follows:
Content 10
Structure 3
Presentation 2

Individual Implementation of a Data Mining Algorithm (20%): due week 10

Students will choose one of the fundamental data mining algorithms (e.g. k-means clustering, Naive Bayes classification, ID3 decision tree, etc.) from a list provided by the lecturer in week 2. Students are to implement this algorithm using the language and platform of their choice. Sample datasets will be provided to students for use when developing their algorithms. Assessment will be via the demonstration of their code to a tutor using a newly provided test dataset, and the explanation of the code to the tutor. Students must demonstrate their understanding of the algorithms and data structures they have used.

You can download full details of the Algorithm Implementation Assignment.

Datasets

The following datasets can be used in testing your algorithms:

Classifiers (Naive Bayes and ID3)
The user should be able to specify one of nominal attributes as the output attribute - i.e. the class that is to be predicted using the other attributes.

The Weather Data
This dataset is the small "toy" dataset used in the handouts on Naive Bayes and ID3. It should be useful for initial testing and debugging.
weather.dat ("play" is usually selected as the output attribute)

The Mushroom Data
This dataset represents information about American mushrooms. The task is to classify mushrooms as either poisonous or edible. These datasets are derived from the one available at the mushroom directory at the UCI Machine Learning Repository. To see what the attribute values actually mean, see this information file.

To simplify the task, I have removed data instances with missing values, and split the dataset into two randomly selected subsets. One can be used for learning the classifier, the other for testing it. Note that the first attribute, "edibility", is the standard output attribute.
agaricus-lepiota-1.dat (for learning: 4644 data points)
agaricus-lepiota-2.dat (for testing: 1000 data points)

Clustering (k-means)

Three Clusters Data
This dataset has only two numerical attributes. The data was generated to have three clusters, centred around (0.75, 0.75), (-0.6, 0.6) and (0, -0.5). There is also some uniformly distributed "noise" data. It should be useful for initial testing and debugging.
three-clusters.dat (1000 data points)

It turns out that the 40% noise in this data meant that the "true" cluster centres could not be found by k-means. Consequently I have created a version with only 10% noise, which works better:
three-clusters-lownoise.dat (1000 data points)

Bank Data
This dataset is adapted from one at DePaul University. It has a mixture of nominal and numerical attributes, so it will need some preprocessing. The data represent customers of a bank. Your task is to do cluster analysis to see if you can identify any market segments.
bank.dat (600 data points)

University Policy on Cheating and Plagiarism

Students should consult University materials on cheating, in particular: It is the students' responsibility to make themselves familiar with the contents of these documents. All work submitted in this unit will be exhaustively checked for plagiarism and cheating, both manually and using automated plagiarism detection tools, such as Damocles.

Generated from XML source and an XSL stylesheet, using xsltproc
By David Squire

Help Contacts Site Map Staff Directory Search