Timeouts/Error Handling

EnFuzion provides several mechanisms to effect how job execution errors are handled. These are summarized in this section.

User Errors

With the onerror command, users can specify the handling of execution user errors. These can be caused by such things as missing files or user programs that return non-zero exit status. By default, a job with a user error fails. User errors can be either ignored with a successful job completion, or the job can be rescheduled for execution. See the Section called Command onerror, for details on the onerror task command.)

Timeout for Run Execution

An execution limit can be specified for a run. If a run execution exceeds this limit, it is terminated.

By default, the run execution limit is infinite. The run execution limit is stored in the run variable ENFEXECUTION_LIMIT and contains the limit in seconds.

Timeout for Job Execution

An execution limit can be specified for jobs. The limit is independent of the run execution limit. If a job execution exceeds this limit, it is terminated with failure.

By default, the job execution limit is infinite. The job execution limit is stored in the run variable ENFJOB_EXECUTION_LIMIT and contains the limit in seconds. It is valid for all jobs in the run. This limit does not apply to individual datajobs.

Timeout for User Programs

An execution limit can be specified for user programs. While the ENFJOB_EXECUTION_LIMIT is valid for the entire job, the time limit for user programs specifies how long an individual user command within a job can execute. If a user program execution exceeds this limit, it is terminated with failure.

By default, the timeout for user programs is infinite. The timeout is specified with the task command limit, parameter complete. See the Section called Command limit.

Multiple Job Executions

The same job can be executed concurrently on several nodes. This capability is useful when hosts differ widely in their computing speed. In this case, the slowest host can significantly delay run completion. With multiple execution , a job is concurrently started on several machines, provided that there are idle nodes and that the run uses less than its allocated nodes. The job completes when the first execution is completed. The remaining executions are terminated or ignored.

By default, only one copy of a job is executed concurrently. The maximum number of concurrent job executions is stored in the run variable ENFMAX_JOB_COPIES with a default value of 1.

Timeout for Datajob Execution

An execution limit can be specified for datajobs. The limit is independent of the job and run execution limit. If a datajob execution exceeds this limit, it is restarted on the next available node.

By default, the datajob execution limit is infinite. The datajob execution limit is stored in the run variable ENFDATASTREAM_EXECUTION_LIMIT and contains the limit in seconds. It is valid for all datajobs in the run.

Timeouts for Persistent User Programs

Various time limits can be specified for persistent user programs that execute datajobs. These values can limit the initialization time for user programs, the time for the initial connection with the persistent user program, the time to process one datajob, and the total time.

By default, all timeouts are infinite. Timeouts are specified with task command limit. See the Section called Command limit.

Completed Run Directories

After a run is completed, its directory is kept until deleted by the user. To prevent accumulation of run directories, the Dispatcher automatically deletes obsolete run directories. The time limit for obsolete directories is specified by the cluster variable ENFCLEANUP_LIMIT. Its default value is 7 days or 604800 seconds.