Nimrod: Tools for Distributed Parametric Modelling
Home Overview Downloads Applications References Press Contact
Nimrod/G
EnFuzion

Nimrod/O

Active Sheets
Nimrod Portal
Australian Nimrod Testbed
Nimrod News Group

 

Nimrod Background

Nimrod is a tool that manages the execution of parametric studies across distributed computers. It takes responsibility for the overall management of an experiment, as well as the low-level issues of distributing files to remote systems, performing the remote computation and gathering the results. EnFuzion is a commercial version of the research system Nimrod. When a user describes an experiment to Nimrod, they develop a declarative plan file which describes the parameters, their default values, and the commands necessary for performing the work. The system then uses this information to transport the necessary files and schedule the work on the first available machine.

A plan file is composed of two main sections, the parameter section and the tasks section. Figure 1 shows a sample plan used for some of the simulation work discussed in this paper. The experiment consists of varying the thickness parameter and each execution receives a different seed value. Nimrod generates one job for each unique combination of the parameter values, by taking the cross product of all values. In the plan in Figure 1, 400 jobs would be generated to control the computational physics case study.

When the user invokes Nimrod on a workstation, the machine becomes known as the root machine because it controls the experiment. When the dispatcher executes code on remote platforms, each of these is known as a computational node. Thus a given experiment can be conducted with one root and multiple nodes, each of a different architecture if required. Nimrod supports five phases of a computational experiment. Phases 1 and 5 are performed once per experiment, while Phases 2, 3 and 4 are run for each distinct parameter set.

  1. Experiment pre-processing, when data is set up for the experiment;
  2. Execution pre-processing, when data is prepared for a particular execution;
  3. Execution, when the program is executed for a given set of parameter values;
  4. Execution post-processing, when data from a particular execution is reduced;
  5. Experiment post-processing, when results are processed, for example by running data interpretation or visualization software.

parameter iseed integer range from 100 to 4000 step 100;
parameter thick label "BUC thickness" float range from 1.1 to 2.0 step 0.1;
parameter jseed integer compute thick*1000;

task nodestart

copy ccal.$OS node:./ccal
copy dummy node:.
copy ccal.dat node:.
copy skel.inp node:.

endtask


task main

node:substitute skel.inp ccal.inp
node:execute ./ccal
copy node:ccal.op ccalout.$jobname

endtask

Figure 1 Sample Plan file


During phases 2 and 4 files may be moved between the root machine and the cluster processes unlike general parallel computing this is the only communication that occurs between tasks. In this example there are two tasks main and nodestart. The main task is executed for each set of parameters. It runs a simulation, called ccal, on a node, passing the parameter values to the program via a parameter file. It then copies a number of result files back to the root machine, appending each name with a unique identifier. This task corresponds to the 3rd phase discussed in the previous section. The nodestart task executes once per experiment, and corresponds to the 1st phase described in the previous section. It copies a file to the remote node which is common across all simulations, and a skeleton of the input parameter file. The latter is processed using the substitute command, which replaces placeholders with actual values. It also copies the correct binary (ccal) for the target operating system.

The plan file is processed by a tool called the generator. The generator takes the parameter values, and gives the user the choice of actual values. It then builds a run file, which contains a description for each job. The run file is processed by another tool called the dispatcher, which is responsible for managing the computation across the nodes. The dispatcher implements the file transfer commands, as well as the execution of the model on the remote node. In Nimrod and EnFuzion, the dispatcher allocates work to machines without any attempt to schedule their execution.

There are two main versions of Nimrod - Nimrod/G and Nimrod/O.

 

 


Disclaimer