|
April 2004 Hathor is no longer available for general use - it is being used for globus now. Figure 1 is a diagram detailing how this network is set up. Ischus is the disk and accounts server. It has 300Gb of disk space for user accounts. This disk space is available on all the machines running linux via NFS. A 40GB shared scratch partition is exported from ischus to all linux machines which is not backed up. It can be used for putting large temporary files that need to be available on all machines. Ischus is also an NIS server, providing account information to all the machines. Mahar is the main login server which is used for starting and controlling jobs on the nodes. It is also the only machine available from the Monash network. The cluster is made up of the new machines (ischus, mahar and nodes01-50, purchased in Dec 2003) all on Gigabit network and the old machines (hathor and node51-64, purchased from 1999-2002) on 100Megabit network. The new machines have 3GHz Pentium 4 processors with 1GB RAM and 36GB hard disk drives. The older ones are Dual 700/800 MHz Pentium IIIs. Mahar is connected to the University's Grangenet network (managed by ITS) with all nodes and the file server servers having private IP addresses in the 172.16.65.0/255.255.255.0 subnet. Only mahar is fully accessible from within the university network.
When you are using enFuzion, try and use local disk storage on the
node (in the copy executableprogram node:. copy datafile1 node:. copy datafile2 node:. node:execute executableprogram $i $j copy node:output output.$jobname If the run file is in your home directory and you run it on hathor, each job will generate the following network traffic:
copy executableprogram node:.
hathor will copy executableprogram to the node. To do this it
has to first of all copy it from the file server, ischus. Then
it sends it to the node. When the node receives the file, it
will write it to somewhere under your home directory. This will
involve sending it to ischus which will copy it onto the hard disk.
copy datafile1 node:.
hathor will copy datafile1 to the node. To do this it has to
first of all copy it from the file server, ischus. Then it sends
it to the node. When the node receives the file, it will write it
to somewhere under your home directory. This will involve sending
it to ischus which will copy it onto the hard disk.
copy datafile2 node:.
you're probably beginning to get the idea!
node:execute executableprogram $i $j
If we're lucky, NFS on the node will have cached executableprogram,
otherwise it has to be brought across the network from ischus.
Likewise for the data files. The output file which gets written
will be written to your home directory which has then to be sent
across the network to ischus
copy node:output output.$jobname
Hathor will send a request to the node for this file. The node
will then get the file from ischus and send it to hathor. Hathor
will then write it to a directory which involves sending it back
to ischus.
As you can see there is an incredible amount of unnecessary traffic. Each of the nodes actually have a huge amount of scratch disk which you can use to eliminate almost all of the network traffic except for the final copying back of results. And even that step can easily be reduced to half the amount of network traffic. So how can this be accomplished? On each of the nodes, create a directory in /scratch and precopy all the files into it. Then when you run a job, don't copy across all the files to the node, rather use the ones in your /scratch directory. You have to remember that any file access in your home directory really goes to ischus. If you use /scratch on the nodes, all file accesses there won't go over the network and hence will be much faster. Copyright © 1998-2005 Monash University
|