When a new job is started on a node, the node creates a job server process, which is responsible for the execution of a single job. The job server interprets and executes the task commands that have been specified for the run. Each job has a separate job server. When a job completes, its job server is terminated.
Executables for user applications that are executed on remote hosts by EnFuzion nodes must be available on these computers and included in the execution search path . If EnFuzion is unable to locate an executable file, the job returns an error. Executables can be either preinstalled or copied as part of the job execution.
Each job executes in its own unique job directory on the node. This prevents interference between files from different jobs. For details on directory handling, see the Section called Directory Layout above. All relative file names start from this directory. This unique directory makes it possible to run multiple concurrent jobs on the same computer or on the same shared file system without a conflict between the file names of different jobs. For example, each job can write to a file called output. Although all jobs use the same file name, the files are unique for each job, because they reside in different directories.
The job directory contains files specific to the job. Its parent directory belongs to the run and contains files common to all jobs from that run. During the job initialization, files from its parent run directory are made available as local files in the job directory. If file links are supported, links are established from the job directory to the run directory. On file systems that do not support links, such as some Windows based file systems, files are copied from the run directory to the job directory. The job directory is deleted after the job completes in order to make disk space available for other jobs.
The handling of job directories is different if the job is executing under a user specified account. In that case, the job directory is set to the account home directory, common job files are not copied to the job directory and the directory is not deleted after the job completes. Users are responsible for deleting obsolete files.
A job server may need to execute certain commands on the root host. For that purpose, it maintains a connection with the root host. This connection is separate from the connection between the node and the Dispatcher on the root host. The connection can be either permanent or temporary. A permanent connection is established when the job starts and disconnected when it ends. A temporary connection is established only when commands are issued that require access to the root host. The permanent or temporary mode of connection can be specified by the user by changing the predefined run variable ENFPERMANENT. The default value is false, which means a temporary connection.
All file references on the root host are relative to the run directory. When a file is copied from a node to the root host, the users usually extend its name with a unique identifier to distinguish files from different jobs. This extension can be constructed from a combination of parameter values or can simply be a system defined parameter, called ENFJOBNAME, which is unique for each job.
EnFuzion distinguishes two types of errors, system errors and user errors.
System errors are caused by computer or network failures. If a node fails, jobs executing on that node are automatically restarted elsewhere. No user intervention is required.
User errors are caused by missing files or user commands that return a non zero exit status. The handling of these errors can be specified with the onerror command. Jobs with user errors can either fail, be repeated on a different computer or continue with the execution, depending on the user specified option.
If a job fails on a node computer, either because the user application fails or one of the EnFuzion commands detects an error, then EnFuzion automatically copies the entire job directory from the node to the root. The directory is named error.<job_name>, where <job_name> is replaced with the name of the job. The ability to inspect the contents of the remote directory significantly simplifies error diagnosis.
Job execution errors are described with more detail in the Section called Timeouts/Error Handling in Chapter 8.