Honours projects for the Clayton School of IT, 2008
Note that some BSE projects are 12-pts, and that BCS/BBIS projects are 24-pts. Some projects come in 24- and 12-pt versions; the latter involves less work.
Consider the problem of determining the amount of soil contamination
given several samples in a contaminated site. If you also have a
geostatistic model of how the contamination varies in the site, then
you could use simulation to estimate the distribution of the soil
contamination. In this project we will investigate various methods of
doing simulation in geostatistics.
Geostatistics is an area of research concerned with the study of
data distributed in space or in time and space. Examples
include:
* ore grades in a mineral deposit,
* depth and thickness of geological layer,
* density of trees in a forest,
* rainfall over a catchment area, and
* concentrations of pollutants in a contaminated site.
One standard approach to model the variation of the data is the use of
variograms. In this project we will investigate different approaches to
fit variograms to data.
[FW02] G. E. Farr and C. S. Wallace.
The complexity of strict minimum message length inference.
Computer Journal, 45(3), pp.285-292, 2002 [doi:10.1093/comjnl/45.3.285]
Given K DNA sequences, or protein sequences, the multiple-alignment problem and the evolutionary-tree problem are "chicken and egg" problems: Given an optimal evolutionary-tree one can search for an optimal multiple-alignment of the K sequences (and we know how to do that [LW94]), and given an optimal multiple-alignment of the K sequences one can search for an optimal evolutionary-tree.
An evolutionary-tree is a special case of a Bayesian net [KN04]: a tree rather than a general DAG, with discrete variables (over {A,C,G,T} if DNA) at each node, known values at all leaves, and missing values at all internal (ancestral) nodes, e.g., see fig.4 [LW94].
The project is to modify the CaMML algorithm for Bayesian nets to infer evolutionary trees given a multiple alignment. If that is successful, the feasibility of simultaneously solving the multiple-alignment and evolutionary-tree problems will be investigated.
The project involves algorithms, probability and machine learning. You should have good results in mathematics, algorithms and data structures, and formal methods, and must have discussed the project with LA or KK before submitting "the form".
[LW94] L. Allison and C. S. Wallace.
The Posterior Probability Distribution of Alignments and
its Application to Parameter Estimation of Evolutionary Trees and
to Optimisation of Multiple Alignments.
Jrnl. Molec. Evol., 39(4), pp.418-430, 1994
http://www.csse.monash.edu.au/~lloyd/tildeStrings/Multiple/94.JME/
or
http://dx.doi.org/10.1007/BF00160274
[KN04] K. B. Korb and A. E. Nicholson.
Bayesian Artificial Intelligence.
Chapman and Hall / CRC, 2004.
The name Slime Mould doesn't give an organism a very appealing ring to it... but these "creatures" are really fascinating! There are two types of slime moulds: plasmodial slime moulds and cellular slime moulds. This project is interested in the latter. The trait that makes these slimes so amazing is their diverse range of behaviours during a complex life cycle. They have several "phases" within a motile (moving) period that give the mould the appearance of an animal, and an immotile (static) period that gives them the appearance of a plant or fungus.
Individual cells of the mould start their lives scattered around an area. At a chemical signal triggered by starvation, they all coagulate into a single blob. This then becomes a slug called a plasmodium that may be a few inches long. This slug then wanders around seeking nutrients. When it finds a good spot, the slug turns into a reproductive structure called a fruiting body. This is a fungus-like structure with a stalk and a ball on top that produces spores. These spores are then scattered... and the life cycle starts over again.
The purpose of this project is to build an agent-based simulation of the slime mould's life cycle (to some level of abstraction) and to visualise it using computer graphics techniques. The look of the model is important! The aim is to produce an attractive but true-to-life representation of the organism's appearance and behaviour. A movie of the slime mould plasmodium is online: http://www.youtube.com/watch?v=96U-6iU8W_A Some decent illustrations of the various phases of the life cycle are online too: http://universe-review.ca/R10-18-slimemoulds.htm
Ratings are used in chess, tennis, golf, other sports, etc. in order
to rank both teams and individual players. A variety of systems are
used to do the ratings and rankings - including Elo, Glicko, Sonas and
systems more concerned with prize money (over the past 12 months).
We can re-visit and improve these systems using the Minimum Message
Length (MML) principle (Wallace & Boulton, 1968)(Wallace & Freeman,
1987)(Wallace & Dowe, 1999a, which has been the Computer Journal's most
downloaded article)(Wallace, posthumous, 2005)(Comley and Dowe, 2005)
to arrive at a comparatively simple model with an improved and
near-optimal predictive accuracy on future games and contests.
This project will require strong mathematics - calculus (partial
derivatives, second-order partial derivatives, integration by parts,
determinants of matrices, etc.), etc.
ComleyDowe2005 Comley, Joshua W. and D.L. Dowe (2005). Minimum
Message Length, MDL and Generalised Bayesian Networks with Asymmetric
Languages, Chapter 11 (pp265-294) in "Advances in Minimum Description
Length: Theory and Applications", M.I.T. Press, April 2005.
[Final camera-ready copy submitted Oct. 2003.]
DoweGardnerOppy2007 Dowe, D. L., S. Gardner and G. Oppy (2007).
Bayes not bust! Why simplicity is no problem for Bayesians. British
J. Philos. Sci., pp709-754.
Wallace2005
WallaceBoulton1968
WallaceDowe1999a
WallaceFreeman1987
The existence of the Higgs boson is central to the structure of the standard
model of elementary particles in Physics. The current search for the Higgs
boson involves experiments at the Fermilab Tevatron and at the CERN Large
Hadron Collider (LHC).
Simulation software is available for the outputs of the Tevatron and the LHC
for the standard particle model and for various other competing models. This
enables us to statistically analyze the signals of the presence of a Higgs
boson in the data, in advance.
The main statistical approach considered would be the Minimum Message
Length (MML) principle, an approach from Bayesian information theory which
quantitatively trades off the simplicity off a hypothesis against its
goodness of fit to the observed data. Our search problem is partly a problem
in mixture modelling (or clustering), where we wish to identify whether or
not
there is sufficient evidence that there is a component of the observed data
which is best explained by the presence of the Higgs boson.
The student should have a reasonably strong background in mathematics (partial
derivatives, integration, matrix determinants, etc.) and advanced quantum
physics.
References:
D. L. Dowe, S. Gardner and G.R. Oppy (2007) "Bayes not Bust! Why Simplicity
is no Problem for Bayesians", British Journal for the Philosophy of Science
(BJPS), Dec. 2007, pp709-754.
Wallace2005
The optimal reconstruction of images from traditional computer
tomography (CT) projections in the presence of real noise is a mature
problem with several practical solutions. This project proposes use
of a combination of symmetric (Finite Radon Transform) and asymmetric
(Mojette Transform) discrete projection techniques to reconstruct
images of test data and real images, and uses Minimum Message Length
techniques to optimise the image reconstruction for a variety of assumed
noise models. It also aims to use MML-related techniques to guide the
selection of the best views to capture the minimal amount of
sufficient real data required to reconstruct variable resolution images.
References:
Comley, J. W. and D. L. Dowe (2005). Minimum Message Length and
Generalized Bayesian Networks with Asymmetric Languages, Chapter 11
(pp265-294) in "Advances in Minimum Description Length: Theory and
Applications", M.I.T. Press, April 2005.
[Final camera-ready copy submitted Oct. 2003.]
D. L. Dowe, S. Gardner and G. R. Oppy (2007) "Bayes not Bust! Why
Simplicity is no Problem for Bayesians", Brit. Journal Philos. Sci.
(BJPS), Dec. 2007, pp709-754.
Guedon, JP and Normand, N., The mojette transform: the first ten years,
LNCS 3429 (2005) 79-91
Matus, F. and Flusser, J., Image representation via a finite Radon
transform, IEEE PAMI 15(10) (1993) 996-1006
Wallace2005
Support Vector Machines (SVMs) are a popular approach to classification
in machine learning and "data mining". They are usually only used
to
divide between two classes ("yes"/"no" or "positive"/"negative")
and
nor are they typically able to give probabilities with their
predictions. They also have some arbitrariness in the choice of "kernel"
functions for specifying non-linear boundaries. Using Minimum Message
Length (MML) approaches such as those in Tan & Dowe (2004), other notions
intimated in Dowe (2007) and some previously overlooked coding
inefficiencies, we will be able to overcome all these shortcomings.
This will enable us to come up with comparatively simple SVMs which give
excellent (probabilistic) predictions on multi-class problems, possibly
using non-linear cuts.
The mathematics in this project will not be trivial.
References:
ComleyDowe2005
D. L. Dowe (2007), Discussion following "Hedging Predictions in Machine
Learning, A. Gammerman and V. Vovk", Computer Journal, Vol. 50, No. 2,
March 2007, pp167-168
D. L. Dowe, S. Gardner and G. R. Oppy (2007) "Bayes not Bust! Why Simplicity
is no Problem for Bayesians", Brit. Journal Philos. Sci. (BJPS), Dec.
2007,
pp709-754.
P. J. Tan and D. L. Dowe (2004). MML Inference of Oblique Decision Trees,
Proc. 17th Australian Joint Conf. on Artificial Intelligence (AI'04),
Dec. 2004, Lecture Notes in Artificial Intelligence (LNAI) 3339, Springer,
pp1082-1088.
Wallace2005
In games such as "rock, paper, scissors", we typically first seek
to maximise our returns against an opponent's best strategy, and then
we seek to exploit any systematic sub-optimalities in our opponents'
play. There is a balance between trying simultaneously to play an
optimal strategy and learn a model for the opponents' strategies.
The project will start with just one other player and then build
up to games with several other players and games where the number of
other players is not known. We shall eventually seek to infer the
total number of players and their strategies. This has relevance to
financial markets. The ability of MML to infer when the amount of
data per parameter is very limited (e.g., Dowe & Wallace (1997)) will
be central here.
The mathematics in this project will not be trivial. An interest
in game theory is essential, a knowledge of game theory is desirable.
References:
ComleyDowe2005
D. L. Dowe, S. Gardner and G. R. Oppy (2007) "Bayes not Bust! Why Simplicity
is no Problem for Bayesians", Brit. Journal Philos. Sci. (BJPS), Dec.
2007,
pp709-754.
Dowe, D. L. and C. S. Wallace (1997). Resolving the Neyman-Scott Problem by
Minimum Message Length. Computing Science and Statistics (Vol. 28), Proc.
28th Symposium on the interface, Sydney, Australia, pp614-618, 1997.
Wallace2005
WallaceDowe1999a
Many real-world systems (aerodynamic, biological, economic, financial,
meteorological, etc.) are modelled by systems of one or more
differential equations - such as Bernoulli, Navier-Stokes, etc.
Such systems are often well-known, but even then they are typically
inexact. One such example is turbulence and the difficulty in
properly accounting for it.
This project will start off simply with data from which we will infer
a differential equation with only one unknown. We will then introduce
a noise term which will enable us to model data which we can not fit
exactly - due to any number of effects, ranging from quantum mechanical
to measurement inaccuracies to the fact that the system is simply too
complicated for our model.
Our modelling will include Minimum Message Length (MML) - variously
because of a variety of optimality properties and its ability to use
simple models that fit more accurately than the more complex models
of other approaches.
The mathematics in this project will not be trivial.
References:
Comley, J. W. and D. L. Dowe (2005). Minimum Message Length and
Generalized Bayesian Networks with Asymmetric Languages, Chapter 11
(pp265-294) in "Advances in Minimum Description Length: Theory and
Applications", M.I.T. Press, April 2005.
[Final camera-ready copy submitted Oct. 2003.]
D. L. Dowe, S. Gardner and G. R. Oppy (2007) "Bayes not Bust! Why Simplicity
is no Problem for Bayesians", Brit. Journal Philos. Sci. (BJPS), Dec.
2007,
pp709-754.
Wallace2005
WallaceDowe1999a
Modelling language change (24 points)
Project-id: dld-ling
Supervisors:
A./Prof. David Dowe (IT) and Dr Simon Musgrave (Linguistics)
(24 points)
Languages spoken by humans change over time. For example, French, Italian
and
many European languages originate from Latin, but each diverges from Latin
in
rather different ways. English is a more complex case, with influences from
Latin and French overlaying its basic Germanic nature. Representing language
by the sounds of spoken words, we use Minimum Message Length (MML) to model
how languages might have developed. Our aim is to reach a model which best
approximates the known history of a language group by running simulations
which assign different probabilities to different types of change, e.g.
internal variation, development of regularity by analogy with existing forms,
and external influences (borrowing). This project will be quite mathematical.
References:
Cangelosi, Angelo and Domenico Parisi (eds.) (2002). Simulating the Evolution
of Language. Berlin: Springer-Verlag.
Dowe, Gardner & Oppy (2007)
Kosmidis, Kosmas, John M. Halley, and Panos Argyrakis (2005). Language
evolution and population dynamics in a system of two interacting species.
Physica A 353: 595-612.
Nowak, Martin A., Natalia Komarova, and Partha Niyogi (2002). Computational
and evolutionary aspects of language. Nature 417: 611-617
Ooi, J.N. and D. L. Dowe (2005). Inferring Phylogenetic Graphs of
Natural Languages using Minimum Message Length, Proc. CAEPIA
2005 (11th Conf. Spanish Assoc. for Artificial Intelligence), Vol. 1,
pp I:143 - I:152, Nov. 2005; ISBN 84-96474-13-5
(24 point or 12 point)
A./Prof. David Dowe and Dr Karen Smith (MAS)
Title: Determination of predictors of emergency patient outcome via MML
[dld-ambo]
Ambulance patients and their vital signs are often regularly monitored while
en route to hospital. Depending on the length of the trip, there might be
one
to several to many such measurements of indicators such as blood pressure
(diastolic and systolic), etc. Other measured variables include respiratory
rate, pulse rate, neurological status and temperature. We will use this data
to predictively model the impact of patient physiological, demographic and
treatment factors on outcome for key clinical sub-groups such as cardiac
arrest, stroke, trauma and respiratory distress. Like many real-world
problems, there are many potentially relevant variables, but there is not
as
much data as we would like for individual patients, and measurements are
often
noisy. Such data is ideally suited to Minimum Message Length (MML) inference,
which we will use to arrive at relatively simple models which fit the available
data about as well as is possible.
References:
Comley, J.W. and D.L. Dowe (2005). Minimum Message Length and Generalized
Bayesian Networks with Asymmetric Languages, Chapter 11 (pp265-294) in
"Advances in Minimum Description Length: Theory and Applications",
M.I.T.
Press, April 2005. [Final camera-ready copy submitted Oct. 2003.]
Dowe, D. L., S. Gardner and G. Oppy (2007). Bayes not bust! Why simplicity
is no problem for Bayesians. Brit. J. Philos. Sci., pp709-754.
Wallace2005
WallaceBoulton1968
WallaceDowe1999a
WallaceFreeman1987
References
Le Provost, T., Wallace, M., Domain independent propagation.
Proc. Int.Conf. on 5th Gen. Comp. Systems, pages 1004--1011, 1992
http://citeseer.ist.psu.edu/leprovost92domain.html M. Maher. Propagation completeness
of reactive constraints.
Proc. ICLP'02, pages 148-162, 2002
Scheduling problems appear in all walks of life: from airlines to hospitals, and from offices to sports leagues. They involve performing a set of jobs, each requiring a sequence of tasks to be completed using a set of resources under some time constraints. The specific resource and duration for each task is fixed, and two tasks can't use the same resource at the same time. As a simple example, consider a computer game in which the components of a gravity gun, a radiation suit and a bullet-time disruptor need to be built according to the following schedule:
| Task_1 | Task_2 | Task_3
-------------------------------------------------------------
Grav. Gun | 6 hours at M1 | 50 hours at M4 | 17 hours at M3
Rad. Suit | 25 hours at M2 | 15 hours at M1 | 10 hours at M4
Disruptor | 16 hours at M1 | 5 hours at M2 | 20 hours at M3
where M1,M2,M3, and M4 are four machines. The aim is to schedule the tasks to complete all jobs in the minimum amount of time. These problems can be modelled by associating a variable with the start time of each task (9 variables in our example above), and searching for the best compatible start times. Alternatively, they can be modelled by associating a variable with each pair of tasks on each resource (6 variables), and searching for the best task-orders. This project addresses the assumption that underlies many scheduling algorithms: that deciding task-orders is better. When, if ever, is this assumption false?
As part of an on-going process to better utilize the advantages of information technology in a university learning environment, a recent student project looked at ways of extracting course and unit information from publically available handbooks, and turning this into an "expert system" that could provide course advice to students.
One outcome from this project was the automatic rendition of a Prolog program that was tailored to a given student at some arbitrary point in his or her course. Knowing what units had been completed, what units the course structure required, and what units might be undertaken in the next semester, the program would not only supply the student with a list of possible units that could be taken over the next semester, but also would show what units had to be taken in order to complete the course with (say) a specialization in a particular field.
The project showed the feasibility of this approach, but did not deliver a workable prototype. This project would be to take the existing work, and (re)engineer it to the point of completing a workable prototype that could be employed via say a web interface, or faculty kiosk.
Knowledge of Prolog would be useful, although not essential. A copy of the thesis may be found at http://www.csse.monash.edu.au/~ajh/research/PfreyThesis.pdf
GN is a highly-scalable associative memory algorithm1 capable of handling
multiple streams of input, which are processed and matched with the historical
data (available within the network) in real time. It is capable of utilising
a large number of low performance processors in connected configurations.
Hence GN is highly suited for resource constrained devices such as Berkeley
motes. The project will involve generalising the existing GN implementation,
which handles 1-dimensional sensory inputs, to a multi-dimensional version
capable of analysing 3 or higher dimensional sensory inputs to a WSN. The
24 point project will also investigate whether the algorithm suffers from
curse of dimensionality. This project is available to students with programming
background in C/C++ or Java.
1. Nasution and Khan, A Hierarchical Graph Neuron Scheme
Density estimation is the process of building approximations to an unknown
probability density function on the basis of some observed data. This
includes such familiar approaches as histogram estimation, as well as more
complicated 'smoothing' methods based on kernels. The task of density
estimation is an important topic and finds common use in fields ranging from
physics to the social sciences.
This project entails the application of Wallace's Minimum Message Length
(MML)
principle, a state of the art model selection criterion developed at Monash
University, to the process of `learning' the underlying probability density
in the form of histograms or smooth curves. The MML principle is grounded
in
information theory and presents the process of learning from data as one
of
communicating the learnable information in the form of a message.
The potential student should have good results in mathematics and must have
discussed the project with Daniel Schmidt or Enes Makalic before
submitting "the form".
References:
[1] Rissanen, J.; Speed, T. P. & Yu, B.
Density estimation by stochastic complexity
IEEE Transactions on Information Theory, 1992, 38, 315-323
[2] Kontkanen, P. & Myllymaki, P. Meila, M. & Shen, X. (ed.)
MDL Histogram Density Estimation
Proceedings of AISTATS 2007, 2007
[3] Wallace, C. S.
Statistical and Inductive Inference by Minimum Message Length
Springer, 2005
See http://www.csse.monash.edu.au/bai/poker/poker.html for links to Honours projects, online versions of research papers, and to play against the latest version of BPP online.
2. J. D. Adams, T. Speakman, E. Zolman, and L. H. Schwacke, "Automatic
Image Matching, Cataloging, and Analysis for Photo-Identification
Research," Aquatic Mammals, Vol. 32, No. 3, pp. 374-384, 2006.
(pdf file available from Sid on request)
3. C. Gope, N. Kehtarnavaz, G. Hillman, and B. Wursig, "An Affine
Invariant curve matching method for photo-identification of marine
mammals, Pattern Recognition, Vol. 38, pp. 125-132, 2005.
(pdf file available from Sid on request)
The system will be developed within the framework of the GNU Image Finding Tool (GIFT) (http://www.gnu.org/software/gift/). The GIFT is an open framework for content-based image retrieval. Use of the GIFT framework will means that researchers working on a project can focus on the problems of immediate interest and importance to the project, rather than having to develop and entire image retrieval system, including user interface, feature extractors, indexing tools, database accessor, user interface, and feedback/learning system from scratch.
The fundamental problem in getting computer programs to process images
is to recognise which groups of pixels belong together because they
represent some important element in the image. We term such a group of
pixels, a cluster. If the pixels in a cluster are spatially connected,
we term such a cluster, a segment. In a hierarchical segmentation,
we recognise that some segments may be merged with neighbouring segments
to form a larger segment which may be merged with other segments in turn.
Often when an image is to be processed, the user is interested in specifying
different regions in the image and actions whose effects will be confined to
those regions. The aim of this project is to allow a user to change part of
an image using a hierarchical segmentation of the image to identify
different regions within the image. This kind of approach can be applied to
identifying possible sites for skin cancer lesions in images of skin
and to problems like identifying how much of an image of the sea
is covered by sea-ice or open water.
In Scientific Visualisation we are interested in mapping large amounts of
high-dimensional data to a lower dimensional space so that a human observer
can look for interesting patterns in the original data. The aim of this
project is to use an approach which takes a small number of points in
original space and maps them to a lower dimensional space in such a way that
relationships between the points in the original space are preserved.
In particular, with triangle projection a group of 3 separate points
in a high dimensional space that do not form a straight line, form a
triangle. We can project those three points to a 2D space so that the
projections of the points form a triangle that is congruent to the triangle
formed using the original three points.
With triangle projection, each point is projected by forming a new triangle
with two points which have already been projected. The aim of this project
is to investigate the use of a number of different schemes that use triangle
projection in exploring high dimensional data sets. Although the rendering
of the projected data will involve some use of 2D, and possibly, 3D computer
graphics, it will not be necessary to have taken a graphics unit in order
to carry out this project.
Minimal Cost Spanning trees (MCSTs) have many uses among them include their
use for hierarchical clustering of data and for hierarchical segmentation of
images. MCSTs can be computed efficiently enough for them to be used in
interactive image segmentation. Minimal Cost Spanning Graphs (MCSGs) are a
generalisation of MCSTs and they share many of the useful properties that
MCSTs have. In particular, they can be computed quickly.
This project will investigate the use of MCSGs for hierarchical clustering
and for interactive image segmentation.
In DNA microarrays the data is a two-dimensional array of values.
The values along a row might all belong to one subject while those along a
column might be for a particular gene. Thus, the brightness of a spot in a
certain row and column represents the activity of a particular gene
for a particular person. The aim of a bi-clustering is to recognise clusters
of people who have clusters of genes that show the same kind of behaviour.
One way to approach this problem is to re-order the rows and columns
so that the value in a particular row and column is most likely to be similar
to values in adjacent rows and columns.
Bi-clustering is a new kind of clustering problem and research in this area
is still at a preliminary stage. The project will involve investigating
the literature to survey existing methods and implementing an approach
for re-ordering rows and columns.
A segment in an image is a group of spatially connected pixels that share
some homogeneity property. Often segments represent objects, or parts of
objects, in a scene. A segment map is something that allows us to determine
for each pixel, which segment it is in. Often a segment map can be regarded
as a 2D array with as many rows or columns as the image and where the entry
in a particular row or column gives the identity of the segment to which the
corresponding pixel in the image belongs.
In a number of areas of image processing such as fractal image coding or
content-based image retrieval, it is necessary to store segment maps.
The aim of this project is to investigate a way of storing segment maps
which retains all the segment information but takes as few bits as possible.
Last update: Wednesday, March 5, 2008 2:36 PM