Home Page
Getting Started
enFuzion Manual
Case Studies
Administration
Performance Monitor
 
Administration

April 2004 Hathor is no longer available for general use - it is being used for globus now.


Figure 1 is a diagram detailing how this network is set up.

Ischus is the disk and accounts server. It has 300Gb of disk space for user accounts. This disk space is available on all the machines running linux via NFS. A 40GB shared scratch partition is exported from ischus to all linux machines which is not backed up. It can be used for putting large temporary files that need to be available on all machines. Ischus is also an NIS server, providing account information to all the machines. Mahar is the main login server which is used for starting and controlling jobs on the nodes. It is also the only machine available from the Monash network.

The cluster is made up of the new machines (ischus, mahar and nodes01-50, purchased in Dec 2003) all on Gigabit network and the old machines (hathor and node51-64, purchased from 1999-2002) on 100Megabit network. The new machines have 3GHz Pentium 4 processors with 1GB RAM and 36GB hard disk drives. The older ones are Dual 700/800 MHz Pentium IIIs.

Mahar is connected to the University's Grangenet network (managed by ITS) with all nodes and the file server servers having private IP addresses in the 172.16.65.0/255.255.255.0 subnet. Only mahar is fully accessible from within the university network.

Network Diagram
Figure 1: Network Diagram

Number of Nodes Speed & Type of Processor
4 2x700MHz Pentium III
3 2x800MHz Pentium III
4 1.2GHz AMD Athlon
50 3GHz Pentium IV

When you are using enFuzion, try and use local disk storage on the node (in the /scratch directory) to store commonly used files (such as executable programs and data files). This can significantly reduce the amount of unnecessary network traffic. For example, consider an enFuzion run file which contains the following instructions. Note that "executableprogram" will use the two datafiles ("datafile1" & "datafile2") for input, and will write output to the file called "output".

  copy executableprogram node:.
  copy datafile1 node:.
  copy datafile2 node:.
  node:execute executableprogram $i $j
  copy node:output output.$jobname

If the run file is in your home directory and you run it on hathor, each job will generate the following network traffic:

  copy executableprogram node:.
    hathor will copy executableprogram to the node.  To do this it
    has to first of all copy it from the file server, ischus.  Then
    it sends it to the node.  When the node receives the file, it
    will write it to somewhere under your home directory.  This will
    involve sending it to ischus which will copy it onto the hard disk.
  copy datafile1 node:.
    hathor will copy datafile1 to the node.  To do this it has to
    first of all copy it from the file server, ischus.  Then it sends
    it to the node.  When the node receives the file, it will write it
    to somewhere under your home directory.  This will involve sending
    it to ischus which will copy it onto the hard disk.
  copy datafile2 node:.
    you're probably beginning to get the idea!
  node:execute executableprogram $i $j
    If we're lucky, NFS on the node will have cached executableprogram,
    otherwise it has to be brought across the network from ischus.  
    Likewise for the data files.  The output file which gets written
    will be written to your home directory which has then to be sent
    across the network to ischus
  copy node:output output.$jobname
    Hathor will send a request to the node for this file.  The node
    will then get the file from ischus and send it to hathor.  Hathor
    will then write it to a directory which involves sending it back
    to ischus.

As you can see there is an incredible amount of unnecessary traffic. Each of the nodes actually have a huge amount of scratch disk which you can use to eliminate almost all of the network traffic except for the final copying back of results. And even that step can easily be reduced to half the amount of network traffic.

So how can this be accomplished? On each of the nodes, create a directory in /scratch and precopy all the files into it. Then when you run a job, don't copy across all the files to the node, rather use the ones in your /scratch directory. You have to remember that any file access in your home directory really goes to ischus. If you use /scratch on the nodes, all file accesses there won't go over the network and hence will be much faster.


Copyright © 1998-2005 Monash University