EnFuzion's root and node processes communicate by means of standard TCP/IP network protocol. TCP/IP allows EnFuzion to seamlessly combine platforms of different types, such as Linux and Windows, to work on the same problem within a single cluster.
The Dispatcher process on the root host is the central EnFuzion process. It starts and terminates all other EnFuzion processes, including processes on node hosts. If the Dispatcher terminates, all root and node processes are terminated, and all user files on node hosts are deleted. Node processes can be configured, so that they do not terminate with the Dispatcher but are suspended from operation until another instance of the Dispatcher is started. User processes are terminated, and user files are deleted in all cases.
Each node host executes a node process, called a node server, or simply a node. The node maintains a permanent connection with the Dispatcher on the root, exchanges heartbeat information with the root, monitors the load on the local host and handles the execution of all jobs on that node. A single host can run more than one node. Several nodes on a single host are not commonly deployed in production environments, but they can be useful for testing purposes.
The Dispatcher provides facilities to simplify the management of node processes. It is able to handle processes on node hosts in a manner that is transparent to the user. It can start and stop node processes as specified by configuration files or through EnFuzion API commands. Nodes can also be started independently, in which case they initiate the connection with the Dispatcher.
On each node host, all EnFuzion processes execute under the same local user name. The name is specified during EnFuzion configuration. This user name can be different for each node and is fully configurable. Linux/Unix EnFuzion nodes can also be configured to execute jobs under user specified accounts instead of under the common EnFuzion account.
The following sections describe the starting of node processes on node hosts and the handling of network errors.
EnFuzion provides many mechanisms to start and manage EnFuzion node processes, which makes EnFuzion suitable for a wide range of environments. EnFuzion nodes can either be started and managed by the Dispatcher or they can be started independently of the Dispatcher.
EnFuzion provides several options to handle the starting of EnFuzion node processes by the Dispatcher. In the simplest case, standard methods for remote host access are used to start the nodes. These are described with more detail in the following sections on Windows NT/2000/XP and Linux/Unix. Alternatively, users can completely customize the node starting process by providing a personalized script, instead of using the standard method.
Another option is to start EnFuzion nodes independently of the Dispatcher. These nodes either connect to an already executing Dispatcher, or wait for a connection request from a Dispatcher.
On Windows NT/2000/XP, standard Internet protocols for remote execution are not generally provided. EnFuzion supplies its own service, called the EnFuzion Starter Service, to start processes on remote nodes. The Starter Service handles initiation of EnFuzion processes on the local host and provides additional system management functionality to the EnFuzion root. See the Section called Starter Service in Chapter 3. As with telnet on Linux/Unix platforms, user login name and password must be provided in a configuration file or through the API. EnFuzion provides mechanisms to avoid clear text passwords in configuration files. See the Section called Encrypted Passwords in enfuzion.nodes in Chapter 6.
Local user access is sufficient to use EnFuzion on Windows. However, administrative rights are required to install EnFuzion on Windows NT/2000/XP. Additional security features, such as avoiding the need to have a user account with a local access on each host, are provided by EnFuzion for users with increased security requirements. See Chapter 6.
On Linux/Unix, ssh, rsh and telnet are the standard methods to start an EnFuzion node. The use of ssh is recommended, since it provides the simplest and the most secure way to start a node.
If the telnet protocol is chosen to start a node process, telnet access must be enabled on the node. User login name and password must be provided in a configuration file or through the API. Although clear text passwords can be used in configuration files, EnFuzion provides more secure methods for encrypting these configuration files that include passwords. See the Section called Encrypted Passwords in enfuzion.nodes in Chapter 6.
Although EnFuzion might use the standard ftp protocol to speed up the node start process for telnet, ftp is not required for successful EnFuzion operation. The only exception is when the initial EnFuzion installation step uses telnet, which requires ftp to copy files to node hosts. Alternatively, nodes can be installed without the use of ftp by copying the files manually to nodes. After EnFuzion is operational, ftp is not necessary.
Besides login access, no other special privileges are required to install and use EnFuzion on Linux/Unix. An exception is when nodes are configured to allow EnFuzion users to specify node accounts under which their programs are executed. EnFuzion processes do not require root access and can be run under any user. Running EnFuzion under a regular user strengthens security on nodes, since privileges of EnFuzion processes are limited to privileges of the user under which they execute.
EnFuzion detects network failures and provides a wide range of features to deal with them. It handles failed nodes and automatically resubmits any jobs that were executing on a failed or disconnected node to an operational node.
At the basic level, EnFuzion detects when a network connection is disconnected. At this level, EnFuzion relies on the error handling capabilities of the underlying TCP/IP networking protocol. Unfortunately, the protocol capabilities are not sufficient. For example, if the network cable is simply pulled out, it is not detected by the TCP/IP protocol itself, but it must be handled by a higher level.
EnFuzion implements a higher level of error detection through heartbeat between the root and node computers. If a heartbeat is not received within a specified time period, the node is declared down. The heartbeat interval is usually set to several minutes in order to reduce network traffic. Heartbeats work well for jobs that execute for several minutes or more. Short jobs that need a few seconds or less to execute require error detection that is much faster than the one provided by heartbeat.
To handle network failures for very short jobs and to assure maximum throughput for this type of job, EnFuzion provides an additional mechanism, which allows multiple executions of a single job. If a node becomes available for job execution and no other jobs in the run are waiting to be executed, an additional copy of an already executing job is started. As soon as at least one of the copies completes, other copies are terminated. Users can specify the maximum number of executions of a single job through a predefined run variable, ENFMAX_JOB_COPIES. By default, ENFMAX_JOB_COPIES is set to 1 and only one job copy will execute at any time, i.e., this feature is turned off.
EnFuzion works with security mechanisms provided by the underlying computing platforms. It also includes several enhancements, which strengthen standard system security. These enhancements provide additional security in accessing remote hosts and dealing with sensitive security related information. See the Section called Root Based Security Features in Chapter 6 for EnFuzion root based security features and the Section called Node Based Security Features in Chapter 7 for EnFuzion node based security features.