CSE5230: Assessment 2004
|
Data Mining is a multidisciplinary field which brings together a wide variety of techniques from areas of research and development with longer histories: machine learning, pattern recognition, statistics, databases and visualisation. The aim is to extract knowledge from the raw information stored in large databases, with the aim of better describing or understanding the existing data, or predicting how new data will be generated in the future. Over the course of this semester, you will work in groups on papers focusing on the theory and application of these techniques. You will also each implement one of the basic data mining techniques yourself. The marks for this unit will be allocated as follows:
Individual Literature Survey (15%): due week 6 The literature survey is due in week 6 and should consist of a discussion of the papers read, including the problems addressed, the techniques used, and their advantages and disadvantages. This is intended to serve as individual preparation for the work on the group paper. Students must discuss at least five (preferably more) articles covering the topic of their group paper. These papers must include the set reading from the lecturer, as well as papers located by the students themselves. The majority of papers surveyed must be academic papers, published in peer-reviewed conferences or journals, not magazine articles.
Assessment
There will also be tutorial tasks related to the literature survey. SubmissionThe Individual Literature Survey is due by 5:00pm on Friday August the 27th. You must submit a hardcopy, with an assignment coversheet to my office, or to my mailbox on level 5 of building B. You are also required to submit a softcopy of the assignment, either by email to me, or on disk. Group Paper (50%): due week 12 The majority of the assessment for this unit is based on a group paper, and tasks associated with its production and presentation. Each group will prepare a research paper on a particular data mining technique and its applications. At the end of the semester, each group will present their findings to the whole class. Students will form groups of four or five (depending on final enrolment numbers), and submit a Group Registration Form to the lecturer by the end of week 3. Forms may be submitted to the lecturer's letter box on level 5 of building B. Students having difficulty finding group members are encouraged to use the Feeback Forum on the unit website to seek others in the same situation. Each group will be assigned a data mining technique or issue as a topic from a list provided by the lecturer. The number of groups assigned to each topic will be minimised (for most topics, there will be two groups). Assignments of topics to groups will be decided on the basis of preferences expressed via a form on the web. One or two group members must take responsibility for researching and writing each of these parts of the paper:
Students should make use of the Faculty Guide to Writing Assignments, paying particular attention to section 4, ``Citations'', and section 5, ``Quotations and Paraphrases''. Papers are to be approximately 5,000 words. A list of allowed paper topics will be available from the unit web site. Here is a prelimiary list of possible topics:
Assessment
Group Presentation (15%): weeks 12 and 13 The presentation will last at least 20 minutes with 5 minutes for questions. Groups should provide copies of their overheads. All group members must participate in the presentation. Depending on the final number of groups, time for presentations may be extended.
Assessment
Individual Implementation of a Data Mining Algorithm (20%): due week 10 Students will choose one of the fundamental data mining algorithms (e.g. k-means clustering, Naive Bayes classification, ID3 decision tree, etc.) from a list provided by the lecturer in week 2. Students are to implement this algorithm using the language and platform of their choice. Sample datasets will be provided to students for use when developing their algorithms. Assessment will be via the demonstration of their code to a tutor using a newly provided test dataset, and the explanation of the code to the tutor. Students must demonstrate their understanding of the algorithms and data structures they have used. You can download full details of the Algorithm Implementation Assignment.
Datasets The following datasets can be used in testing your algorithms: Classifiers (Naive Bayes and ID3)The user should be able to specify one of nominal attributes as the output attribute - i.e. the class that is to be predicted using the other attributes.
The Weather Data
The Mushroom Data
To simplify the task, I have removed data instances
with missing values, and split the dataset into two randomly selected
subsets. One can be used for learning the classifier, the other for
testing it. Note that the first attribute, "edibility", is the standard
output attribute.
Clustering (k-means)
Three Clusters Data
It turns out that the 40% noise in this data meant that the "true"
cluster centres could not be found by k-means. Consequently I have
created a version with only 10% noise, which works better:
Bank Data University Policy on Cheating and Plagiarism Students should consult University materials on cheating, in particular:
|
Generated from XML source and an
XSL
stylesheet,
using xsltproc
By
David Squire