Monitoring Execution

EnFuzion provides several methods to monitor job execution. These include extensive Dispatcher logs, web based monitoring, command line monitoring and monitoring from custom programs. Details are provided in the sections below.

Dispatcher Logs

EnFuzion produces extensive logs which provide detailed information on EnFuzion operation. Logs record important events about all major objects in EnFuzion: the cluster, nodes, runs, jobs and datastreams.

The main log is called enfuzion.log. It is created in the main cluster directory. By default, enfuzion.log contains all events. Run specific events can be turned off to reduce overhead and increase performance.

Each run has its own log, called enfuzion-run.log. The log is created in the run directory. Run logs contain run, job, and datastream events. Datastream events can be turned off to reduce overhead and increase performance.

The enfuzion.log File

During execution, the Dispatcher produces a log, describing major execution events. The log is saved to the file enfuzion.log. Whenever a log grows too large, it is renamed to enfuzion-%d.log, where %d is the smallest integer with a nonexistent file.

Size of Dispatchers log is controlled through root option logsizelimit. The default size of the logsizelimit root option is 10 MB. Root options are described in the Section called Specifying Root Configuration Options in Chapter 6.

When a new Dispatcher is started, existing files enfuzion-%d.log and enfuzion.log are renamed to enfuzion-%08x-%d.log and enfuzion-%08X.log. Where %08x stands for a unique suffix. This preserves all old Dispatcher logs.

The Dispatcher log records all cluster events and execution statistics. Run events, job events and datajob events are recorded in the run log as well. The run log is created in the home directory of the run.

Execution statistics is provided when the root goes to a non-active state, when a node goes down, and when the run is done. Reports begin with the event report "======= execution report: =======".

Description of Log Events

Cluster Log Events


    <time> <event_id> cluster <cluster_name> create port <port_number>
    <time> <event_id> cluster <cluster_name> cleanup <statistics>
    <time> <event_id> cluster <cluster_name> start
    <time> <event_id> cluster <cluster_name> down <statistics>
    <time> <event_id> cluster <cluster_name> message <text>
    <time> <event_id> cluster <cluster_name> report <text>
    <time> <event_id> cluster <cluster_name> add run <run_name>
    <time> <event_id> cluster <cluster_name> remove run <run_name>
    <time> <event_id> cluster <cluster_name> add node <node_name> on host <host_name>
    <time> <event_id> cluster <cluster_name> remove node <node_name>
    <time> <event_id> cluster <cluster_name> set <variable_name> <value>
    <time> <event_id> cluster <cluster_name> unset <variable_name> 

Node Log Events


    <time> <event_id> node <node_name> start
    <time> <event_id> node <node_name> active
    <time> <event_id> node <node_name> terminate
    <time> <event_id> node <node_name> down <statistics>
    <time> <event_id> node <node_name> idle
    <time> <event_id> node <node_name> executing
    <time> <event_id> node <node_name> busy <message>
    <time> <event_id> node <node_name> message <text>
    <time> <event_id> node <node_name> report <text>
    <time> <event_id> node <node_name> set <variable_name> <value>
    <time> <event_id> node <node_name> unset <variable_name>

Run Log Events


    <time> <event_id> run <run_name> create <data>
    <time> <event_id> run <run_name> cleanup <statistics>
    <time> <event_id> run <run_name> start
    <time> <event_id> run <run_name> stop
    <time> <event_id> run <run_name> continue
    <time> <event_id> run <run_name> abort
    <time> <event_id> run <run_name> stage <run_stage> 
    <time> <event_id> run <run_name> done
    <time> <event_id> run <run_name> fail
    <time> <event_id> run <run_name> add job <job_name>
    <time> <event_id> run <run_name> remove job <job_name>
    <time> <event_id> run <run_name> add task <task_name>
    <time> <event_id> run <run_name> change task <task_name>
    <time> <event_id> run <run_name> datain <filename>
    <time> <event_id> run <run_name> message <text>
    <time> <event_id> run <run_name> report <text>
    <time> <event_id> run <run_name> set <variable_name> <value>
    <time> <event_id> run <run_name> unset <variable_name>

Job logs


    <time> <event_id> job <run_name> <job_name> start <node> <host>
    <time> <event_id> job <run_name> <job_name> reschedule <node> <host>
    <time> <event_id> job <run_name> <job_name> done <node> <host>
    <time> <event_id> job <run_name> <job_name> ignore <node> <host>
    <time> <event_id> job <run_name> <job_name> fail <node> <host>
    <time> <event_id> job <run_name> <job_name> abort
    <time> <event_id> job <run_name> <job_name> message <text>
    <time> <event_id> job <run_name> <job_name> set <variable_name> <value>
    <time> <event_id> job <run_name> <job_name> unset <variable_name> 

Datastream logs


    <time> <event_id> datastream <run_name> <job_name> start <node> <host>
    <time> <event_id> datastream <run_name> <job_name> done <node> <host_name>
    <time> <event_id> datastream <run_name> <job_name> reschedule <node> <host>

Monitoring from a Web Browser

EnFuzion monitoring is available by connecting a standard web browser to the Eye program on the EnFuzion root host. The default port for the Eye is 10101.

The Eye provides several different monitoring pages. There is the main page for the overall EnFuzion cluster, a page with summary information for all the nodes, a page with details about each node, a page with summary information for all runs and a page with details about each run.

Cluster Page

The main EnFuzion monitoring page can be reached through the Cluster Monitoring link on the Eye home page. An alternative Cluster link is available in the header of all pages. This page gives basic information about the EnFuzion cluster status, uptime, nodes and runs. It also contains the log and any messages from the Dispatcher.

Node List Page

The node summary page can be reached through the Nodes link on the main monitoring, Cluster page. An alternative Nodes link is available in the header of all pages. This page gives each node's name, status, uptime, distribution of time, and summary of job execution.

Single Node Page

A detailed node page can be reached through the link in the Name field of the node summary, Node List page. This page gives node details, including the node name, host, user, operating system, node start parameters, status, uptime, time distribution, and job execution.

Run List Page

This run summary page can be reached through the Runs link on the main monitoring, Cluster page. An alternative Runs link is available in the header of all pages. This page gives each run's name, status, uptime, scheduling parameters, and summary of job execution.

Single Run Page

A detailed run page can be reached through the link in the ID field of the run summary, Run List page. This page gives run details, including the run name, scheduling parameters, status, uptime, job execution, initialized nodes, results, log, and control.

Monitoring from a Command Line

EnFuzion provides a command line tool enfcmd, which can be used to monitor the Dispatcher. The enfcmd command connects to the Dispatcher API port, which is reported by the Dispatcher in its main log file, called enfuzion.log.

The main enfcmd option for monitoring is the show option. It provides detailed information about the entire EnFuzion cluster, or its individual components.

The syntax for the show option is:


    show [ detailed ] [ ( cluster | node <node_id> | run <run_id> ) ]

Monitoring from a Custom Program

The Dispatcher provides a set of socket based commands, called API commands, which can be used by any program to monitor and control the Dispatcher.

A custom program connects to the Dispatcher as follows:

The Dispatcher is now ready to accept commands from the custom program. The monitoring commands are: