Project Information

Aims and Significance

Integer optimisation is found in many industries. It involves the process of minimisation of some cost function subject to a variety of real world constraints. The variables in these problems can only take integer values. Examples include scheduling, path planning and resource management. Providing optimal decisions in a timely manner can amount to millions of dollars to many industries, and can have a dramatic effect on the competitiveness of the business.

Over the years there have been many algorithms developed to solve such problems. These algorithms can be classified as either generic or specific to particular problem domains. Generic ones have the advantage that they can be applied to a wide range of problems, however, they are often much slower than those which are tailored. Specific ones are capable of solving certain problems very rapidly, however, can take many years to develop. Their performance can also be very sensitive to changes in the input data. Thus, high speed generic algorithms are of considerable interest because of their wide applicability. Two examples of commonly used generic algorithms are simulated annealing and branch-and-bound, both of which have been shown to suffer from performance problems. These two algorithms have been widely used for solving real world problems for many years. With improved performance, they should become competitive with tailored algorithms whilst maintaining their generic properties.

Effort has been focused on methods to accelerate search techniques using general purpose vector and parallel super computers. Unfortunately, in many cases, the speedup has been disappointing. Various researchers report vector speedups of the order of only 1.5 times [7, 9, 13, 16]. Further, even when reasonable speedup is achieved, the hardware cost may be so high that it is not possible to dedicate the computer to the one task for sufficient amounts of time. In [17] Thomborson argues that the excessive cost of modern supercomputing does not warrant the relatively small improvements in execution speed (20 times is regarded as good for many optimisation codes).

Unlike general purpose computers, Application Specific Architectures are computers that only solve a narrow range of problems. Their specific nature means that they are often much simpler than general purpose computers, and more importantly, they can provide very fast solutions to problems which require large amounts of computation. Thus, they can offer a large improvement in the price/performance characteristics of general purpose computers. Application specific architectures are an ideal platform for acceleration of optimisation algorithms because knowledge of the algorithm and data structure can be incorporated into the machine. However, using traditional implementation techniques, Application Specific Architectures have been difficult to build because they require real hardware to be constructed for each new problem. Thus, using conventional hardware, it is difficult to build a machine which can provide high performance for both simulated annealing and branch-and-bound algorithms.

Technology is now available in the form of Field Programmable Gate Arrays (FPGAs) which allows hardware to be reconfigured without any physical modification. This offers enormous potential because it should now be possible to build a reconfigurable special purpose architecture which can be used to solve a range of problems. Thus, a reconfigurable processor can be attached to a conventional workstation and be loaded with the correct logic for accelerating the current task. This technology makes it possible to build one machine which can be tailored for both simulated annealing and branch-and-bound by reconfiguring the internal architecture accordingly. It is expected that, using FPGAs, it will be possible to attach a range of inexpensive special purpose machines to general purpose ones. This development is consistent with networks of heterogeneous computing machines, which are becoming more common.

The aim of this project is to develop a class of computer architectures which deliver very high performance on solving integer optimisation problems. These architectures will be designed to support both simulated annealing and branch-and-bound algorithms through system reconfiguration. They will be based on FPGA technology.

The work is significant because it will allow a wide range of integer optimisation problems to be solved very rapidly using one inexpensive hardware and software platform. Thus, it will be possible to solve a number of important practical problems using one system. The system will provide the advantages of generic algorithms with the speed of specific ones. The work will also lead to a conceptual advance in the role of special purpose computers.

Research Plan, Methods and Techniques

Background - why not use a supercomputer?

The two algorithms which have been chosen for this project are branch-and-bound and generalised simulated annealing. There are two traditional ways of improving the performance of these algorithms, namely vectorisation and parallelisation. Neither algorithm gains much through vectorisation unless the cost function evaluation itself can be vectorised. Even if this were the case it is unlikely to improve the total execution speed of the program by more than 2 or three times because it is unlikely to occupy more than 50% of the run time. It is very difficult to vectorise tree searching algorithms or monte-carlo algorithms, which are the core of branch-and-bound and annealing respectively.

At first inspection the branch-and-bound algorithm appears to be an ideal parallel application, because it involves searching a tree of potential solutions. Performance of this algorithm on parallel machines is variable, depending on the nature of the pruning and workload allocation heuristics. Some anomalies arise in parallel solutions in which more nodes of the tree are searched than in their sequential counterparts. In [8] a number of studies have shown excellent speedup, however there are still many difficulties reported [6]. In [11] a new machine design is proposed to account for some of these difficulties. In general, the better the pruning scheme, and thus the faster the sequential execution, the poorer the parallel efficiency.

Similarly, simulated annealing is an inherently parallel algorithm because it simulates the motion of particles found in nature. However, in most practical problems multiple moves cannot be freely made in parallel because they are dependent on each other. There has been quite a deal of research to try and decouple the moves to facilitate parallel execution [14], however, the effective speedup has been quite low in practice. In both branch-and-bound and annealing, even when good parallel speedup is achieved, it requires the use of very expensive hardware platforms.

In 1991 Abramson described a special purpose machine for solving an integer programming problem using simulated annealing [5]. The hardware which was built executed the code about 100 times faster than the same program running on a workstation and cost less than a personal computer to build. Interestingly, this machine gained its performance from two main sources. First, it utilised very low level concurrency which cannot be extracted by vector and parallel computers. Second, it avoided all address arithmetic normally required for matrix manipulation. Complete details of the implementation are available in [5] and [3]. The board was implemented using conventional logic devices and was hosted by a PC or workstation. It was controlled by a simple finite state machine which implemented the annealing algorithm, and contained no fast logic or pipelined stages. The latter techniques would have enhanced the performance by another order of magnitude.

The specific architecture described in [5] is not general enough to handle more than one problem class, and certainly cannot execute other search algorithms like branch-and-bound. It was also constructed from conventional logic devices, and thus cannot be modified easily. Since that time, Abramson and Dang [12, 2] have developed a number of enhancements to the original algorithm. This new algorithm, called Generalised Simulated Annealing has the ability to solve arbitrary 0-1 problems with both linear and non-linear cost functions. It also differs from normal annealing in that it prunes the search space as it proceeds in the same way as branch-and-bound. This means that for certain problems it behaves more like a exact solution method [1]. It also incorporates a number of more efficient cooling schemes than normally used [4]. Because it has many similarities to the one already implemented in [5], we believe it would be highly amenable to execution in hardware.

Postula has been working in the area of automatic synthesis for a number of years, and this work is of direct relevance to the project described in this proposal. His work has been targeted at taking very high level descriptions of architectures, using such languages as VHDL, and producing compilers which automatically translate such specifications into hardware. It is planned that a number of different optimisation architectures will be produced, as opposed to one architecture than can be used to execute a number of different algorithms. High level specification languages will allow these architectures to be written in a concise form, simulated and then compiled into logic for the FPGAs. The architectures will also be specified using conventional schematic circuit techniques, and the quality of the two forms can be compared.

Research Plan

The plan for this project is to design and implement application specific architectures which can implement both the simulated annealing and branch-and-bound algorithms. We plan to make extensive use of Field Programmable Gate Arrays as the core technology, and will implement the system in two stages. In the first stage we will purchase a board manufactured by Aptix containing a number of Xilinx FPGAs and switching chips. This board provides a totally reconfigurable machine because it is not only possible to reconfigure the connections within the FPGAs but also between them. The boards do not contain any memory, so we will need to add memory via extension circuit boards. The Aptix system will allow us to experiment with the types of constructs required by the optimisation algorithms under consideration. It is likely that we will only be able to add sufficient memory to solve small problems with this system.

During stage 2 we will develop a board more specifically tailored for optimisation. Whilst it will be built from reconfigurable logic, it will have an architecture more closely matched to the problem. It will also contain sufficient memory to solve realistic problems. We plan to reuse some of the parts purchased during stage 1. The board will be interfaced to a general purpose processor like a SUN Sparc Station, probably through its S-bus interface. This will allow the main application program to execute on the Sparc Station, whilst using the special purpose processor to perform the appropriate acceleration.

An important component of this research is that the performance achieved on the algorithms under consideration will be compared to equivalent code running on workstations and super computers (including the Brisbane MIMD supercomputer funded by ARC Mechanism C). From this, it will be possible to determine the cost effectiveness of the solution and consider its generalisation for other problems.

Relationship to other work

In the past few years the capacity of FPGAs has grown substantially, and there are now many projects around the world in which they are being proposed as the building blocks for special purpose architectures [15]. Below is a small table (from [10]) showing a representative sample of these, and their performance.
Application	                         Performance

RSA Cryptography                         200 kbits per second
String Matching                          200 kwords per second
Heat and Laplace Equation Solution       5 GFLOPS
N-body evolution                         2.5 GFLOPS
Binary 2-d Convolution                   200 MIPS
Binary Neural Net                        500 Mega Synapses per second
3D Geometry Graphics Accelerator         300 MIPS
Discrete Cosine Transform                15000 MIPS

This small summary shows that enormous performance is possible using FPGAs. The systems which were developed for these projects are similar to that being proposed in this project in that they use a number of FPGAs, memory and possibly floating point processors.

Bibliography

1. Abramson D., Dang H., and Krishnamoorthy, M. "Enhanced Simulated Annealing Through Linear Programming Preprocessing", The 12th National Conference of the Australian Society for Operations Research. Adelaide, July 7-9, 1993.

2. Abramson D.,Dang H., and Krishnamoorthy, M. "A Comparison of Two Methods for Solving 0-1 Integer Programs Using a General Purpose Simulated Annealing", submitted to Annals of Operations Research.

3. Abramson, D. A. and Dang, H. "School Timetables: A Case Study in Simulated Annealing", Applied Simulated Annealing, Lecture Notes in Economics and Mathematics Systems, Springer-Verlag, Chapter 5, ISBN 3-540-56229-X.

4. Abramson, D.A. , Dang, H. and Krisnamoorthy, M., (1994), Cooling Schedules for Simulated Annealing Based Scheduling Algorithms, Proceeding of the 17th Australian Computer Science Conference, Univ of Caterbury, Christschurch, NZ, Jan 1994.

5. Abramson, D.A., "A Very High Speed Architecture to Support Simulated Annealing", IEEE Computer, May 1992.

6. Altman E, Marsland T and Breitkreutz, "Accounting for Parallel Tree Search Overheads", International Conference on Parallel Processing, 1988, Vol 3, pp 198- 201

7. Arthur, J. L., Frendewey J. O. and Schumichrast, R. T. "Experience with Vectorizing a Linear Programming Code" (manuscript(, Oregon State University, Corvallis, OR, 1987.

8. Arvindam S, Kumar V and Rao, N, "Efficient Parallel Algorithms for Search Problems: Applications in VLSI CAD", The Third Symposium on the Frontiers of Massively Parallel Computation, October 1990, University of Maryland, pp 166-169.

9. Beasley, J. E. "Linear Programming on CRAY Supercomputers" (manuscript), Derpartment of Management Science, Imperial College London, England, Septemper 1987.

10. Bertin, P, Roncin and Vuillemin, "Programmable Active Memories: A Performance Assessment", Technical Report, DEC Paris Laboratories, 1990.

11. Cheng K and Wang Q, "An Asynchronous Multiprocessor Design for Branch-and- Bound Algorithms", The Third Symposium on the Frontiers of Massively Parallel Computation, October 1990, University of Maryland, pp 65-68.10

12. Connolly, D. "General Purpose Simulated Annealing", J. Operations Research Soc. Vol 43, Number 5, pp 495-505.

13. Edwards J. Tomlin, J. "Parallel Cholesky Factorization", Proceedings of 5th Australian Supercompting Conference, pp 105 - 114, December 1992.

14. Greening, D. "A Taxonomy of Parallel Simulated Annealing Techniques", UCLA Computer Science Department technical report number CSD-890050, August 21, 1989, Los Angeles, California.

15. IEEE Computer Society, "Proceedings of IEEE Workshop in FPGAs for Custom Computing Machines", April 5-7, 1993, Napa, California.

16. James M. Ortega, Introduction to Parallel and Vector Solutions of Linear Systems, Frontiers of Computer Science Series, Plenum Press, 1988.

17. Thomborson, C. "Does your workstation computation belong on a supercomputer?", Comms. of A.C.M., November 1993, Vol 36, No. 11, pp 41-49.

Equipment and Resources

Guess (Griffith University Engine for Search Strategies) concerns the construction of architectures for solving large integer optimisation problems. It is currently supported by an ARC large grant for 1995-97.

Guess makes use of an Aptix AP4 reconfigurable logic board. This contains up to 16 Xilinx 4010 Field Programmable Gate Arrays, plus a number of Aptyx swtich chips. The switches make it possible to connect the pins of the Xilinx parts together, and thus the board is totally reconfigurable. Below is a photo and artwork of the Aptyx AP4 board:


The work will be carried out in the Schools of Computing and Information Technology and Micro-Electronics at Griffith University. Below is a photo of our laboratory:


A Sintering Machine

In 1996 we designed a specialised processor for the Monte-Carlo simulation of metallurgical sintering. Several possible architectures were considered. We demonstrated that such a specialised processor using commercially available gate array technology can solve the same problem more than 100 times faster than a modern high-end workstation. Even using slower FPGAs it is possible to achieve a speedup of over 50 times faster than a workstation, which supports the concept of programmable special purpose computers attached to general purpose machines. A prototype of the processor is now being built using Xilinx FPGA and Aptix FPIC switch technology. The work is further desribed in a paper listed below.

We made an MPEG movie of the output of the sintering machine. This version of the machine was implemented on the Aptix hardware. In this movie, the red regions are a cross sectional view of three particles. The green region is the pore in between the particles. As the particles are heated, the pore shrinks as a result of movement of atoms. The current implementation runs this model about 20 times faster than a Sun Sparcstation 5 workstation.

Further Reading

This page only has limited information about Guess. For more details, please consult the following publications: