Honours projects for the Clayton School of IT, 2009
Note that some BSE projects are 12-pts, and that BCS/BBIS projects are 24-pts. Some projects come in 24- and 12-pt versions; the latter involves less work.
Consider the problem of determining the amount of soil contamination given several samples in a contaminated site. If you also have a geostatistic model of how the contamination varies in the site, then you could use simulation to estimate the distribution of the soil contamination. In this project we will investigate various methods of doing simulation in geostatistics.
Geostatistics is an area of research concerned with the study of data distributed in space or in time and space. Examples include:
One standard approach to model the variation of the data is the use of variograms. In this project we will investigate different approaches to fit variograms to data.
Strict Minimum Message Length (SMML) inference is an information-theoretic criterion for machine learning. It was introduced by Wallace and Boulton and has several desirable statistical properties. The SMML code-book problem is to create an optimal code-book that gives the greatest compression of future data. There is a quadratic-time algorithm for data drawn from a binomial distribution, and a good heuristic for the trinomial distribution, but the code-book problem is NP-hard (very probably requires exponential time) in general [FW02]. The project is to devise fast approximation-algorithms, that is algorithms that run quickly and create good, but not necessarily optimal, code books. The first cases to consider will be essentially unbounded one-dimensional in character such as the geometric and Poisson distributions. Optimisations may be possible if the prior on the parameter(s) is "nice". The Gaussian (normal) and related distributions are important for continuous data. The project involves and relates to algorithm analysis, complexity, data compression, and machine learning. You should have good results in mathematics, algorithms and data structures, and formal methods, and must have discussed the project with LA or GF before submitting "the form".
[FW02] G. E. Farr and C. S. Wallace.
The complexity of strict minimum message length inference.
Computer Journal, 45(3), pp.285-292, 2002 [doi:10.1093/comjnl/45.3.285]
Given K DNA sequences, or protein sequences, the multiple-alignment problem and the evolutionary-tree problem are "chicken and egg" problems: Given an optimal evolutionary-tree one can search for an optimal multiple-alignment of the K sequences (and we know how to do that [LW94]), and given an optimal multiple-alignment of the K sequences one can search for an optimal evolutionary-tree.
An evolutionary-tree is a special case of a Bayesian net [KN04]: a tree rather than a general DAG, with discrete variables (over {A,C,G,T} if DNA) at each node, known values at all leaves, and missing values at all internal (ancestral) nodes, e.g., see fig.4 [LW94].
The project is to modify the CaMML algorithm for Bayesian nets to infer evolutionary trees given a multiple alignment. If that is successful, the feasibility of simultaneously solving the multiple-alignment and evolutionary-tree problems will be investigated.
The project involves algorithms, probability and machine learning. You should have good results in mathematics, algorithms and data structures, and formal methods, and must have discussed the project with LA or KK before submitting "the form".
[LW94] L. Allison and C. S. Wallace.
The Posterior Probability Distribution of Alignments and
its Application to Parameter Estimation of Evolutionary Trees and
to Optimisation of Multiple Alignments.
Jrnl. Molec. Evol., 39(4), pp.418-430, 1994
http://www.csse.monash.edu.au/~lloyd/tildeStrings/Multiple/94.JME/
or
http://dx.doi.org/10.1007/BF00160274
[KN04] K. B. Korb and A. E. Nicholson.
Bayesian Artificial Intelligence.
Chapman and Hall / CRC, 2004.
Vision is indispensable. Building automated machines with human vision capabilities is a formidable challenge in both industry and academia. Much of human learning can be viewed as an unsupervised incremental learning process. Investigating this gives a significant contribution to machine learning. In this project our main focus is to investigate how humans accumulate knowledge incrementally by a series of objects, induce a hierarchy of concepts that summarize and organize such observations and use them in classifying future experiences.
Pre-labour fetal health is vital for a successful birth. Despite the maternal involvement in assuring fetal health, there are the risky deliveries where intervention is mandatory. The only external measurement that relates directly to fetal health is fetal heart rate patterns. Analyses of this property can be used to determine ominous instances. However, these pattens are largely misinterpreted by human observers leading to unnecessary surgical procedures. This project presents a cognitive approach to fetal heart rate interpretation. Artificial knowledge acquisition means are revamped with emphasis on understanding rather than enforcing knowledge.
Bioinformatics research involves retrieval and analysis of a large amount of biological data such as gene expressions, DNA, RNA, and proteins. Biological data is characterised by its high volume, high dimensionality and also complexity -in terms of structure, evolving nature and biological processes it involves in. Machine learning algorithms have been widely used for solving problems in bioinformatics such as sequence alignment, gene finding and protein domain analysis. This research is an attempt to develop a connectionist system that identifies dimensional changes in biological data.
Biometric information can serve not only for access or authetication, but also for data protection. It is possible to generate a key from the biometric information that can be used with some cryptographic algorithm to encipher and decipher data. In this project we shall study methods of generating key from finger prints or iris codes, addressing the fundamental problems of biometric cryptography of key change and distortion tolerance. Given the biometrics, a set of reliable features are extracted. In this project we shall design a system in which we do not need to store a biometric template, but only a string of error-correction data from which the biometric cannot be derived, and from which the key cannot be derived either unless the biometric is present.
The name Slime Mould doesn't give an organism a very appealing ring to it... but these "creatures" are really fascinating! There are two types of slime moulds: plasmodial slime moulds and cellular slime moulds. This project is interested in the latter. The trait that makes these slimes so amazing is their diverse range of behaviours during a complex life cycle. They have several "phases" within a motile (moving) period that give the mould the appearance of an animal, and an immotile (static) period that gives them the appearance of a plant or fungus.
Individual cells of the mould start their lives scattered around an area. At a chemical signal triggered by starvation, they all coagulate into a single blob. This then becomes a slug called a plasmodium that may be a few inches long. This slug then wanders around seeking nutrients. When it finds a good spot, the slug turns into a reproductive structure called a fruiting body. This is a fungus-like structure with a stalk and a ball on top that produces spores. These spores are then scattered... and the life cycle starts over again.
The purpose of this project is to build an agent-based simulation of the slime mould's life cycle (to some level of abstraction) and to visualise it using computer graphics techniques. The look of the model is important! The aim is to produce an attractive but true-to-life representation of the organism's appearance and behaviour. A movie of the slime mould plasmodium is online: http://www.youtube.com/watch?v=96U-6iU8W_A Some decent illustrations of the various phases of the life cycle are online too: http://universe-review.ca/R10-18-slimemoulds.htm
See also Korb and Dorin.
Ratings are used in chess, tennis, golf, other sports, etc. in order
to rank both teams and individual players. A variety of systems are
used to do the ratings and rankings - including Elo, Glicko, Sonas and
systems more concerned with prize money (over the past 12 months).
We can re-visit and improve these systems using the Minimum Message
Length (MML) principle (Wallace & Boulton, 1968)(Wallace & Freeman,
1987)(Wallace & Dowe, 1999a, which has been the Computer Journal's
most
downloaded article)(Wallace, posthumous, 2005)(Comley and Dowe, 2005)
to arrive at a comparatively simple model with an improved and
near-optimal predictive accuracy on future games and contests.
This project will require strong mathematics - calculus (partial
derivatives, second-order partial derivatives, integration by parts,
determinants of matrices, etc.), etc.
ComleyDowe2005 Comley, Joshua W. and D.L. Dowe (2005). Minimum
Message Length, MDL and Generalised Bayesian Networks with Asymmetric
Languages, Chapter 11 (pp265-294) in "Advances in Minimum Description
Length: Theory and Applications", M.I.T. Press, April 2005.
[Final camera-ready copy submitted Oct. 2003.]
DoweGardnerOppy2007 Dowe, D. L., S. Gardner and G. Oppy (2007).
Bayes not bust! Why simplicity is no problem for Bayesians. British
J. Philos. Sci., pp709-754.
Wallace2005
WallaceBoulton1968
WallaceDowe1999a
WallaceFreeman1987
The existence of the Higgs boson is central to the structure of the standard
model of elementary particles in Physics. The current search for the Higgs
boson involves experiments at the Fermilab Tevatron and at the CERN Large
Hadron Collider (LHC).
Simulation software is available for the outputs of the Tevatron and the
LHC
for the standard particle model and for various other competing models.
This
enables us to statistically analyze the signals of the presence of a Higgs
boson in the data, in advance.
The main statistical approach considered would be the Minimum Message
Length (MML) principle, an approach from Bayesian information theory which
quantitatively trades off the simplicity off a hypothesis against its
goodness of fit to the observed data. Our search problem is partly a problem
in mixture modelling (or clustering), where we wish to identify whether
or not
there is sufficient evidence that there is a component of the observed
data
which is best explained by the presence of the Higgs boson.
The student should have a reasonably strong background in mathematics
(partial
derivatives, integration, matrix determinants, etc.) and advanced quantum
physics.
References:
D. L. Dowe, S. Gardner and G.R. Oppy (2007) "Bayes not Bust! Why Simplicity
is no Problem for Bayesians", British Journal for the Philosophy of
Science
(BJPS), Dec. 2007, pp709-754.
Wallace2005
The optimal reconstruction of images from traditional computer
tomography (CT) projections in the presence of real noise is a mature
problem with several practical solutions. This project proposes use
of a combination of symmetric (Finite Radon Transform) and asymmetric
(Mojette Transform) discrete projection techniques to reconstruct
images of test data and real images, and uses Minimum Message Length
techniques to optimise the image reconstruction for a variety of assumed
noise models. It also aims to use MML-related techniques to guide the
selection of the best views to capture the minimal amount of
sufficient real data required to reconstruct variable resolution images.
References:
Comley, J. W. and D. L. Dowe (2005). Minimum Message Length and
Generalized Bayesian Networks with Asymmetric Languages, Chapter 11
(pp265-294) in "Advances in Minimum Description Length: Theory and
Applications", M.I.T. Press, April 2005.
[Final camera-ready copy submitted Oct. 2003.]
D. L. Dowe, S. Gardner and G. R. Oppy (2007) "Bayes not Bust! Why
Simplicity is no Problem for Bayesians", Brit. Journal Philos. Sci.
(BJPS), Dec. 2007, pp709-754.
Guedon, JP and Normand, N., The mojette transform: the first ten years,
LNCS 3429 (2005) 79-91
Matus, F. and Flusser, J., Image representation via a finite Radon
transform, IEEE PAMI 15(10) (1993) 996-1006
Wallace2005
Support Vector Machines (SVMs) are a popular approach to classification
in machine learning and "data mining". They are usually only
used to
divide between two classes ("yes"/"no" or "positive"/"negative")
and
nor are they typically able to give probabilities with their
predictions. They also have some arbitrariness in the choice of "kernel"
functions for specifying non-linear boundaries. Using Minimum Message
Length (MML) approaches such as those in Tan & Dowe (2004), other notions
intimated in Dowe (2007) and some previously overlooked coding
inefficiencies, we will be able to overcome all these shortcomings.
This will enable us to come up with comparatively simple SVMs which give
excellent (probabilistic) predictions on multi-class problems, possibly
using non-linear cuts.
The mathematics in this project will not be trivial.
References:
ComleyDowe2005
D. L. Dowe (2007), Discussion following "Hedging Predictions in
Machine
Learning, A. Gammerman and V. Vovk", Computer Journal, Vol. 50, No.
2,
March 2007, pp167-168
D. L. Dowe, S. Gardner and G. R. Oppy (2007) "Bayes not Bust! Why
Simplicity
is no Problem for Bayesians", Brit. Journal Philos. Sci. (BJPS), Dec.
2007,
pp709-754.
P. J. Tan and D. L. Dowe (2004). MML Inference of Oblique Decision Trees,
Proc. 17th Australian Joint Conf. on Artificial Intelligence (AI'04),
Dec. 2004, Lecture Notes in Artificial Intelligence (LNAI) 3339, Springer,
pp1082-1088.
Wallace2005
In games such as "rock, paper, scissors", we typically first
seek
to maximise our returns against an opponent's best strategy, and then
we seek to exploit any systematic sub-optimalities in our opponents'
play. There is a balance between trying simultaneously to play an
optimal strategy and learn a model for the opponents' strategies.
The project will start with just one other player and then build
up to games with several other players and games where the number of
other players is not known. We shall eventually seek to infer the
total number of players and their strategies. This has relevance to
financial markets. The ability of MML to infer when the amount of
data per parameter is very limited (e.g., Dowe & Wallace (1997)) will
be central here.
The mathematics in this project will not be trivial. An interest
in game theory is essential, a knowledge of game theory is desirable.
References:
ComleyDowe2005
D. L. Dowe, S. Gardner and G. R. Oppy (2007) "Bayes not Bust! Why
Simplicity
is no Problem for Bayesians", Brit. Journal Philos. Sci. (BJPS), Dec.
2007,
pp709-754.
Dowe, D. L. and C. S. Wallace (1997). Resolving the Neyman-Scott Problem
by
Minimum Message Length. Computing Science and Statistics (Vol. 28), Proc.
28th Symposium on the interface, Sydney, Australia, pp614-618, 1997.
Wallace2005
WallaceDowe1999a
Many real-world systems (aerodynamic, biological, economic, financial,
meteorological, etc.) are modelled by systems of one or more
differential equations - such as Bernoulli, Navier-Stokes, etc.
Such systems are often well-known, but even then they are typically
inexact. One such example is turbulence and the difficulty in
properly accounting for it.
This project will start off simply with data from which we will infer
a differential equation with only one unknown. We will then introduce
a noise term which will enable us to model data which we can not fit
exactly - due to any number of effects, ranging from quantum mechanical
to measurement inaccuracies to the fact that the system is simply too
complicated for our model.
Our modelling will include Minimum Message Length (MML) - variously
because of a variety of optimality properties and its ability to use
simple models that fit more accurately than the more complex models
of other approaches.
The mathematics in this project will not be trivial.
References:
Comley, J. W. and D. L. Dowe (2005). Minimum Message Length and
Generalized Bayesian Networks with Asymmetric Languages, Chapter 11
(pp265-294) in "Advances in Minimum Description Length: Theory and
Applications", M.I.T. Press, April 2005.
[Final camera-ready copy submitted Oct. 2003.]
D. L. Dowe, S. Gardner and G. R. Oppy (2007) "Bayes not Bust! Why
Simplicity
is no Problem for Bayesians", Brit. Journal Philos. Sci. (BJPS), Dec.
2007,
pp709-754.
Wallace2005
WallaceDowe1999a
Languages spoken by humans change over time. For example, French, Italian
and
many European languages originate from Latin, but each diverges from Latin
in
rather different ways. English is a more complex case, with influences from
Latin and French overlaying its basic Germanic nature. Representing language
by the sounds of spoken words, we use Minimum Message Length (MML) to model
how languages might have developed. Our aim is to reach a model which best
approximates the known history of a language group by running simulations
which assign different probabilities to different types of change, e.g.
internal variation, development of regularity by analogy with existing forms,
and external influences (borrowing). This project will be quite mathematical.
References:
Cangelosi, Angelo and Domenico Parisi (eds.) (2002). Simulating the Evolution
of Language. Berlin: Springer-Verlag.
Dowe, Gardner & Oppy (2007)
Kosmidis, Kosmas, John M. Halley, and Panos Argyrakis (2005). Language
evolution and population dynamics in a system of two interacting species.
Physica A 353: 595-612.
Nowak, Martin A., Natalia Komarova, and Partha Niyogi (2002). Computational
and evolutionary aspects of language. Nature 417: 611-617
Ooi, J.N. and D. L. Dowe (2005). Inferring Phylogenetic Graphs of
Natural Languages using Minimum Message Length, Proc. CAEPIA
2005 (11th Conf. Spanish Assoc. for Artificial Intelligence), Vol. 1,
pp I:143 - I:152, Nov. 2005; ISBN 84-96474-13-5
Researchers in graph theory often want to determine various important properties of specific graphs of modest size: chromatic number, size of maximum clique or independent set, shortest/longest cycle, diameter, girth, genus, automorphism group, etc (most of which you will not have met yet). Such information can be used to formulate or disprove hypotheses, and is very valuable to workers in the field. However, there are huge numbers of different graphs, even of modest size, and there are also very many different parameters which are of interest. Some of these parameters may be very hard to compute. Technology can help a lot here, in storage of large numbers of graphs and their parameters, giving access to this information from anywhere in the world, and computing the frequently NP-hard (or worse) parameters, if they are not already available.
The aim of this project is to build a system which:
As well as system-building, the project will involve doing a survey of existing systems that do parts of the job required, as well as a literature review.
Many decisions will need to be made in the course of the project. Some of these will relate to the relative emphasis given to (a) data management, and using the data on existing graphs to answer user queries about those graphs, and (b) computation, so that queries about new graphs can be answered.
We anticipate giving more emphasis to (a). We also need to decide what graph parameters are to be stored with a graph in the database, and how sophisticated the front end will be (probably primitive, to begin with).
The platform development principles of the Message Lab will be a feature of this project: http://www.messagelab.monash.edu.au/
Prerequisites: good results in: mathematics subjects; Algorithms and Data Structures; Analysis and Design of Algorithms. Some knowledge of databases.
You must talk to Graham Farr before nominating this project.
One of the most crucial phases when chairing a reviewed conference is to assign reviewers to the submitted papers. This is usually done by asking reviewers to indicate, for each paper, whether they want it, could cope with it, do not want it, and have a conflict of interests with its authors. With this information (and perhaps some extra info such as a maximum/minimum number of papers to be assigned to a given reviewer), most modern conference-support tools (such as Easy-Chair) perform an initial assignment. Initial because it is usually so bad that the Chair has to perform extensive modifications to it, often based on information that might not even have been taken into account by the tool, such as the specific research area of the reviewer, its quality, etc. Such modifications tend to be extremely painful and time consuming due, mainly, to the lack of support from the tool.
The aim of this project is twofold. The first is to apply optimisation techniques that yield a better result than those in current state of the art conference-support tools. In particular, the tool should allow the chair to choose among different kind of objective functions and visualise the resulting assignments. The second aim is to provide a graphical interface that somehow helps Chairs modify the initial assignment. Since we do not know yet what is the best optimisation function not how to support Chairs when modifying the tool, the project should be a lot of fun!
It would be extremely useful (though not absolutely necessary) for the student to have taken FIT3082 Programming Languages and Paradigms (and obtained a good mark).
GN is a highly-scalable associative memory algorithm1 capable of handling
multiple streams of input, which are processed and matched with the historical
data (available within the network) in real time. It is capable of utilising
a large number of low performance processors in connected configurations.
Hence GN is highly suited for resource constrained devices such as Berkeley
motes. The project will involve generalising the existing GN implementation,
which handles 1-dimensional sensory inputs, to a multi-dimensional version
capable of analysing 3 or higher dimensional sensory inputs to a WSN. The
24 point project will also investigate whether the algorithm suffers from
curse of dimensionality. This project is available to students with programming
background in C/C++ or Java.
1. Nasution and Khan, A Hierarchical Graph Neuron Scheme
The Walnut password capability operating system has demonstrated the utility and unique access control and security properties inherent in this class of operating system. The aim of this project is to exploit earlier research effort in this area to implement and test an alternative filesystem for Unix (Linux) platforms. This project is suitable for a student with prior Linux kernel experience. Needs a good Linux-internals background. BCS/BDS student, or a strong BSE student.
The Suburban Ad Hoc Networking group focusses its research activities
on techniques for implementing Suburban Ad Hoc Networks. These are self
organising, quasi-static ad hoc (typically wireless) networks which provide
an alternative technology for providing high speed digital connectivity
to households, small businesses and distributed campuses. Specific areas
of research interest include security, low level routing protocols, access
controls and propagation behaviour. Given the broad scope of the research
performed in this area, there is considerable choice in project topics.
Students should consult Dr Ronald Pose or Dr Carlo Kopp for suitable project
topics.
See [/research/san/].
The topic has the potential to also produce a good journal paper if done properly. The project requires some theoretical work to survey the plethora of FEC codes and codecs, and modification of existing open source PPP code to include LCP changes and the FEC codec. C language skills and some comms understanding required. The main requirement is a smart student. Potential for code release into the GPL domain - PPP is widely used in various open source OS'.
This project will explore the possibilities for implementing creative evolution in silico, responding to the challenge by Bedau et al (2000) to reproduce "open-ended evolution". We will begin by examining previous attempts and Bedau's evolutionary activity statistics (Bedau et al 1998) for assessing them. We will develop our own complexity-oriented measures, especially measures sensitive to the growth of organic and ecological function, and apply them to the paleontological record, to the record of ALIFE simulations, and to new simulations of our own designed to put the measure through its paces.
Students are requested to do some preliminary reading (e.g. at least (Levy 1997) and (Bedau et al 2000)) prior to commencing the project. Enrollment in the Honours unit FIT4012 is strongly encouraged as it will provide background material of direct relevance to the topic.
References:
See also: Alison and Korb and Nicholson and Korb
Jon McCormack's projects are in the area of computer graphics, human-computer interaction, adaptive intelligence and evolutionary computing. Current project offerings:
Density estimation is the process of building approximations to an unknown
probability density function on the basis of some observed data. This
includes such familiar approaches as histogram estimation, as well as more
complicated 'smoothing' methods based on kernels. The task of density
estimation is an important topic and finds common use in fields ranging
from
physics to the social sciences.
This project entails the application of Wallace's Minimum Message Length
(MML)
principle, a state of the art model selection criterion developed at Monash
University, to the process of `learning' the underlying probability density
in the form of histograms or smooth curves. The MML principle is grounded
in
information theory and presents the process of learning from data as one
of
communicating the learnable information in the form of a message.
The potential student should have good results in mathematics and must
have
discussed the project with Daniel Schmidt or Enes Makalic before
submitting "the form".
References:
[1] Rissanen, J.; Speed, T. P. & Yu, B.
Density estimation by stochastic complexity
IEEE Transactions on Information Theory, 1992, 38, 315-323
[2] Kontkanen, P. & Myllymaki, P. Meila, M. & Shen, X. (ed.)
MDL Histogram Density Estimation
Proceedings of AISTATS 2007, 2007
[3] Wallace, C. S.
Statistical and Inductive Inference by Minimum Message Length
Springer, 2005
Organisation charts are one of the most common kind of diagrams. However, comparatively little is known about how people understand or use them. In this project eye-tracking equipment will be used to determine how the focus of attention moves around the elements in an organisation chart when it is first seen, and then used in subsequent tasks. Based on this the project will be to develop a model for focus of attention changes when viewing organisation charts and to test the model with eye tracking data. [The project can also be offered for other sorts of visual notations such as UML notations, advanced mathematics, etc]
Automatic document layout is increasingly important topic because of the need to adapt a document's layout to different viewing environments and to dynamically generated content. An important part of technical document layout is how to layout program code, i.e. where to break lines of code if the line is too long. Based on this the project will be to develop a function for measuring the quality of layout of the code and incorporate this into an automatic document layout tool.
Automatic document layout is increasingly important topic because of the need to adapt a document's layout to different viewing environments and to dynamically generated content. An important part of technical document layout is how to layout equations and in particular where to break equations if the line is too long. The project will be to develop a function for measuring the quality of layout of the equation and incorporate this into an automatic document layout tool.
When understanding organizational structure it is often useful to visualize the entities responsible for some particular function in the organization. Unlike a traditional organizational chart the structure is typically not a tree but rather a more complex directed graph in which entities are often grouped into "clusters" and wide variety of relationships exist between the entities. An example of such a organization structure are the research groups and centers of Monash University which can be clustered by Faculty. The aim of this project is to develop a tool for visualizing the research groups and centers of Monash. If successful the tool will be used by Monash Research Services.
The World Wide Web Consortium (W3C) is the non-profit international body responsible for guiding the development of the web. One of its primary tasks is the specification of internet standards such as HTML. As part of their operation W3C organizes a large number of meetings and workshops. They would like a program to help them schedule the weekly meetings using a limited number of rooms. This is a difficult problem and must take into account capacity of rooms and preferences for meetings to be held on particular days and in particular rooms. Experience in a programming in a functional or logic language would be useful.
Bayesian Networks (BNs) are graphical models for probabilistic reasoning, which are now widely accepted in the AI community as intuitively appealing and practical representations for reasoning under uncertainty. One use of BNs is for prediction, and within that general task, for the problem of risk assessment. We have done two recent projects in ecological risk assessment: (a) assessing the risk for native fish populations of water management interventions and (b) risk management of tropical seagrass. However, to date, the BN modeling in these projects has been done with so-called "static" BNs, where there is no explicit representation of time. Clearly, temporal modelling in such prediction tasks is very important, and it could (and should) be done using an existing extension to BNs, called Dynamic Bayesian networks. The aim of this project is to develop knowledge engineering techniques and methodologies for Dynamic Bayesian networks, using the ecological risk assessment domains as case studies.
(If taken as a 24 point project, this may also combine elements of the following Causal Discovery project, by extending CaMML to learn DBNs).
Causal discovery algorithms learn causal Bayesian networks from data.
The oldest of them dates from 1991. At Monash we have developed CaMML (Causal
discovery via Minimum Message Length), which "data mines" observational
data to find the causal model most probable in light of the data. In this
honours project you may choose any one of a number of possible specific
problems to investigate, including:
- Extending CaMML to learn Dynamic Bayesian Network (DBN) structures
- Incorporating expert knowledge as priors for the causal discovery algorithm.c
- The proper evaluation of causal discovery algorithms. The common uses
of predictive accuracy, unweighted edit distance and Kullback-Leibler divergence
are all demonstrably inadequate. We have an alternative "Causal Kullback-Leibler"
which is superior, but needs further work. This is also related to metrics
of causal power that we are developing, which assess how much causal influence
one variable has over another.
- Integrating latent (unobserved) variables into our causal discovery algorithms.
- The philosophy of token causality explicated in terms of causal models
(Twardy & Korb "A Criterion of Probabilistic Causation" Phil
of Sci, 2004; Korb, Twardy, Handfield and Oppy "Causal Reasoning with
Causal Models" Synthese, submitted)
- The mathematics of causal intervention (Korb & Nyberg "The Power
of Intervention"
Minds and Machines, 2006; Korb, Hope, Nicholson and Axnick "Varieties
of Causal Intervention" PRICAI, 2004)
Working primarily with mobile robots and systems, the aim of the research is to establish methodologies for autonomous multi-agent systems. Multi-agent systems would work in a complex interactive network to achieve a common task in an optimized manner. Applications include: cooperative cleaning, exploration of unknown areas, security patrol, search for dangerous targets, search-and-rescue, etc.
These on-going projects focus on research and development of autonomous flying drones - i.e. fully (autonomous) unmanned aerial vehicles. These projects aim to establish methodologies for integrated sensing capabilities with inertial measurement unit (IMU), global positioning system (GPS), and vision, together with optimized intelligent control, navigation and mission planning methodologies for such aerial vehicles.
The Monash Bayesian poker project was started by Kevin Korb in 1993. Since then 5 honours students have worked on various aspects of the project. Our Bayesian poker player (BPP) was originally developed to play 5-card stud poker, using Bayesian network technology. Most recently, BPP has been converted to play Texas Hold'em Poker, the main online form of poker, and re-written in python. BPP has been entered in the inaugural world automated poker playing competition at the main American Artificial Intelligence conference AAAI-2006 and we plan to compete again in 2007. We have developed a simple GUI interface that will allow people to play against BPP via the internet. This project offers lots of options for making BPP a better poker player including improving its bluffing strategies and its opponent modelling. We have access to the AAAI-2006 tournament dgame logs, which provide a rich source of data for automated learning of opponent modelling. Or you might be interested in developing a much more interesting playing interface.
See http://www.csse.monash.edu.au/bai/poker/poker.html for links to Honours projects, online versions of research papers, and to play against the latest version of BPP online.
Artificial Cortical Networks (ACN) also known as Multimodal Self-Organizing Networks are used to model mapping of the multidimensional sensory information e.g. speach onto two-dimensional areas as in human cortex. Recent applications of ACNs are presented in:
The modelling software is written in MATLAB and the first objective of the project is to add GUI in Simulink. Subsequently a student is expected to complete a specific modelling task using ACNs.
3D modelling is the process of constructing a virtual form for use in diverse fields such as film, computer games, computer aided engineering, architecture, physical simulation, or other interactive software. Many different packages and methods exist for modelling, one such package is Blender [1], a free and open-source modelling, animation and compositing tool.
It is often desirable to create organic forms, for example a giant CG kraken for a pirate movie. However, this is usually a very complex process, as 3d modelling tools are very general and have limited support for the kinds of patterns and shapes that can characterise organic quality. The aim of this project is to take a commonly occurring feature of organic shape and design and implement a tool to support the easy modelling of that feature.
One such recurring motif of many organic forms is covering symmetry. This occurs when a surface is literally covered in a repeated object, such as a tentacle being covered by suckers. A tool that supports covering symmetry would ideally take as input: an arbitrary surface S, an arbitrary shape C, and some parameters (such as covering density, arrangement style, etc.) and then would output the shape of C covering S.
The covering operation can be implemented in a multitude of ways and this project would investigate some of these in a very hands on and practical way. The desired outcome would be an easy to use tool integrated into a 3d modelling package (such as Blender), along with a set of case studies (3d models) resulting from the use of tool. As such, experience in (and enthusiasm for) computer graphics and programming is a requirement for this project.
For more information see this web page.
As Service-oriented architecture (SOA) is being considered increasingly in critical applications by distributed enterprises, the provision of dependable services is becoming an important area of research. In this project, we consider dependability attributes of a software service in a sustainable manufacturing domain. Sustainable manufacturing is the employment of eco-efficient technologies and industry standards for engineering manufacturing systems to minimise environmental burdens of green house emissions and energy use. Dependability attributes can be expressed in terms of timeliness, availability and reliability of the correct service, and trustworthiness attributes such as confidentiality, integrity and maintainability of deployed services.
Providing a predictable level of dependability for services which have been composed from services from various providers is an important requirement. Standards such as SOAP, UDDI and WSDL [1] are being adopted by major web service providers. These services need to establish and adhere to standards and hence, the dependability of service will become a differentiating point for services. Existing standards do not provide sufficient information to make decisions about how dependable and available the various services & components are, and how they fail. Some of the challenges that we will be dealing with respect to component interactions and composition of services in the sustainable manufacturing domain are: how to ensure trust in correctness when dealing with global partners and services composed using their services how to ensure reliability when composing services from these partners who are outside the control of the system ...[see SR for more information]
The purpose of a Content-Based Image Retrieval (CBIR) system is to retrieve images from a database such that the visual contents of retrieved images are similar to that of a query image. Often the image similarities are computed from the representation of visual contents such as colour, texture and shape, in terms of a multi-dimensional feature vector. Understandably, the features cannot be assumed to have equal weights. More important features deserve more weights and less important features deserve less weights. These weights can be assigned fully automatically from feature variation statistics in the database images. Better still is to use the so called relevance feedback (RF) mechanism in which the weights are modified iteratively based on users' response regarding the relevance of the retrieved images. The aim of this project is to investigate the feature weighting problem for both cases, namely, without RF and with RF.
In recent years Content-Based Image Retrieval (CBIR) has become one of the most active areas of research in image data mining. In studies involving the design of a CBIR system often the question arises whether the data set sizes used are big enough to rely on the accuracy achieved. The aim of this second project in CBIR is to study the impact, on accuracy, of the number of samples in the database, the number of samples per semantic category, the number of retrieved images returned to user for RF, the number of features used, and their relative sizes.
In statistical pattern recognition, often patterns are represented by a large number of numerical features. Although there is no conceptual justification in reducing the number of features to a small number, in practical problem solving, this becomes a necessary step due to the wellknown phenomenon of the 'curse of dimensionality' of the feature vector on the complexity of the pattern classifier. The aim of the proposed project is to develop a hierarhical feature selection paradigm that is well-suited to multi-class pattern recognition problems.
The project will comprise (i) study of some existing feature selection criteria followed by an experimental investigation of them, (ii) analysis of the above results leading to, hopefully, a hierarchical feature selection paradigm, and (iii) development of an interactive software tool for the above paradigm.
The methodology developed will be such that depending on the classification accuracy arrived at a certain stage, the user will have the option of increasing or decreasing the dimensionality value. The software tool will include procedures for displaying the distribution of pattern samples in different feature spaces obtained by different feature selection methods.
Texture plays an important role in both human interpretation of visual scenes and computer analysis of images. Textural cues are of particular relevance in two different, but related, image analysis problems, namely, the problems of segmentation and classification of images. The proposed project will deal with both of these problems. It will involve (i) investigating existing texture analysis methods from the point of view of their theoretical soundness as textural measures and (ii) investigating their practical applicability.
Port Phillip Bay, Victoria, is home to a small and genetically unique population of bottlenose dolphins, Tursiops sp. Individuals within this population have been identified via photo-identification of the dolphins' dorsal fin. Dorsal fins show a unique shape and through time develop permanent marks and notches. Currently, the digital images are being assessed for these permanent marks and assigned an identity by human 'eye'. This can be incredibly time consuming, requires permanent notches on the fin and a trained eye to pick-up small notches. The aim of the project is to create a methodology/software that not only uses the notches but shape of the fin to assign identity with high levels of accuracy.
Some references:Independent Component Analysis (ICA) is a statistical technique for discovering hidden factors underlying a set of random variables. ICA can be used to discover filters that can be used to characterize visual textures in a set of images. The filters produced by ICA are adapted to the statistics of the set of images used to derive them. It should thus be possible to use ICA to derive a set of filters specifically tuned to the textures in a collection of images from a restricted domain. We will investigate the discovery and selection of a set of ICA filters for image collections from domains such as dermatology.
The system will be developed within the framework of the GNU Image Finding Tool (GIFT) (http://www.gnu.org/software/gift/). The GIFT is an open framework for content-based image retrieval. Use of the GIFT framework will means that researchers working on a project can focus on the problems of immediate interest and importance to the project, rather than having to develop and entire image retrieval system, including user interface, feature extractors, indexing tools, database accessor, user interface, and feedback/learning system from scratch.
In Scientific Visualization we are often interested in representing higher dimensional data in some lower number of dimensions such as 2,3 or 4. Minimal Cost Spanning Trees (MCSTs) are simple to compute and are a way of preserving closeness information in representing higher dimensional data. Current methods for carrying out projections for visualizing higher dimensional data tend to be computationally expensive with results that are sensitive to parameters within the algorithm that are difficult to tune. These methods are also sensitive to the dimension to which the projection is being made. MCST-based projection does have these drawbacks and can be engineered to preserve properties of the original higher dimensional data. This project will involve implementing a package that uses MCSTs for visualizing higher dimensional data in lower dimensions such as 2, 3 or 4 and comparing the performance of this package against packages incorporating techniques like SOMs (Self Organising Maps).
Mixture modeling, also known as clustering, involves taking a set of observations about things and deciding whether the things all belong to one population or whether the things come from a number of different sub-populations. DNA microarray data is a 2D array where the rows or columns may represent either a particular gene or represent the subject of an experiment. The entries in the microarray represent the activity of that gene for that subject. For instance, the entries in the first row might represent gene activities for the first subject while the entries in the fourth column might represent gene activities for the fourth gene. With bi-clustering we are interested in finding groups of genes and groups of people for what the group of genes behaves similarly. If we could re-order the rows and columns of a DNA micro-array, a bi-cluster would appear as a region of cells that show the same gene activity.
The MML (Minimal Message Length) criterion was proposed by Professor Chris Wallace foundation professor of Computer Science at Monash University. MML has been used to create a clustering program called 'snob'. 'snob' has proven to be very effective in identifying the correct number of clusters but it is not used widely in bioinformatics where people continue to use a variety of techniques that are generally not able to identify correctly the number of clusters.
The aim of this project is to apply the MML approach and snob to the mixture modelling problems that arise in the analysis of DNA microarray data. It would be of particular interest to extend the snob approach to bi-clustering.
Minimal Cost Spanning Trees (MCSTs) are a simple tree structure that is easy to compute and which can be very usefully in a wide variety of applications such as networking, finding nearest-neighbours in higher-dimensional spaces, scientific visualisation, hierarchical clustering and image segmentation. Minimal Cost Spanning Graphs (MCSGs) are a generalisation of MCSTs. MCSGs are also simple to compute but are connected graphs, rather than trees. The aim of the project is to survey a number of areas for which MCSTs are already proven to be useful and to explore additional benefits that might result when MCSGs are applied to those areas.
Though transport schedules are precisely timed in advance, the day of operation brings a variety of delays and disruptions. The challenge is to minimize the overall impact of these disruptions on the passengers. A set of transportation disruption problem instances have been designed by the international company Amadeus, and published by the French OR society as the 2009 Roadef Challenge (http://challenge.roadef.org/2009/index.en.htm). In this project we will explore the smaller benchmark set (for which optimal solution are known) using a techniques called "Large Neighbourhood Search" (LNS). If it works well, we will try applying the technique to the larger instances. The objective of the project is to explore techniques for choosing large neighbourhoods. After choosing neighbourhoods by hand for the first instance, it is hoped we can develop a method for automatically choosing "good" neighbourhoods to achieve high quality rescheduling quickly.
This project will involve working with a team of computer scientists and biologists to analyse biological data about proteins and how they have evolved in order to infer how they function. This is an advanced data mining project that will provide a stepping stone toward a career in commercial data mining (such as developing next generation web search technologies), or in data mining or bioinformatics research.
The aim of this project is to analyse a large dataset of proteins that are used in refolding experiments, in order to improve our understanding of the refolding process. We hope to discover relationships between a protein's characteristics and its behaviour in the laboratory. We are also very interested using this information to provide scientists with real experimental protocols in a completely automated fashion. See http://refold.med.monash.edu.au. This is an advanced data mining project and will provide a stepping stone toward a career in commercial data mining (such as developing next generation web search technologies), data mining or bioinformatics research.
Last update: Tuesday, February 24, 2009 10:23 PM