In the situation when the cluster is disconnected from multi-site EnFuzion client, unallocated jobs cannot be allocated to that cluster. Currently, the cluster would only finish the current executing jobs and the waiting jobs (approximately the number of active nodes). There is no reason why the cluster cannot process more jobs, even with no connection to the multi-site EnFuzion client. This is hard to do as the cluster cannot communicate with the multi-site EnFuzion client and it does not know which jobs it can process to avoid job duplication. The cluster cannot choose a job from its unallocated jobs in case another cluster does the same job.
It is possible to give a cluster enough jobs so that its completion time is the same as all the other clusters’ completion time. That way if the connection is lost, the cluster will continue to run. The user can collect all the data after the run is finished when a connection can be re-established. Currently if the cluster has fewer jobs than the remaining clusters, it finishes earlier than the rest of the clusters causing the rest of the clusters to take longer to finish the greater remaining jobs. This leads to a longer completion time.
To predict how quickly jobs are being processed on a cluster requires cluster processing capability information. That is, the multi-site EnFuzion client needs a cluster to complete at least one job to begin to estimate how long it takes to process jobs. Even then, cluster processing capability information is not useful until two clusters have this information. Until then, job allocation is done as described is Job Allocator #1.
Once two clusters report cluster processing capability information, this can be used to balance the jobs to optimise time. One method is to determine which cluster will finish the earliest and which one will finish last and move a job from the slowest cluster to the fastest. This was implemented and correctly balanced the clusters, however, moving one job from the slowest cluster to the fastest would swap their positions making the fastest cluster the slowest and vice versa. The multi-site EnFuzion client would continuously move a job between two clusters and would never determine which cluster would finish earliest with the last job.
To solve this problem, we decided to have a simulation loop. Using average time a cluster takes to process a job, each cluster could be predicted to see how each cluster will preform if jobs were allocated to it. The average time a cluster takes to process a job is calculated from the average time the nodes take to finish jobs divided by the number of active nodes. Using the number of active nodes in the calculation gives the advantage of adjusting the number of allocated jobs if a node on the cluster vanishes. The multi-site EnFuzion client will automatically adjust its estimation on how powerful the cluster is. The average time a cluster takes to process a job is then used in a simulation. Each cluster’s estimated finishing time is calculated by multiplying the average time the cluster takes to finish a job and with the number of jobs allocated to it.

Step 1, the completion time for each cluster is calculated and is displayed as the allocated jobs in a time line. This is not accurate as jobs are processed in parallel, but this is effectively how the job allocator views clusters, that is the rate is processes jobs. In step 2, the new job (Job 112) is temporarily added to all the clusters. In step 3, the cluster that has the earliest estimated completion time is left with the job. This is looped until there are no unallocated jobs left. Job identifiers are ignored and each cluster is passed the jobs that it needs to satisfy the simulation. At this stage, if there are more waiting jobs allocated to the cluster, they are removed. Prior to this simulation, clusters may already have jobs allocated to them in the simulation pool from the previous ‘node discovery’ rule of having enough waiting jobs to satisfy all active jobs finishing. This will not effect the simulation as those jobs are used in calculating the cluster’s estimated finishing time. The simulation will still favour those clusters that are expected to finish earlier. Clusters that do not have any processing capability information are not used in the simulation. They still will be included in the previous ‘node discovery’ rule if that applies (i.e., enough jobs to satisfy every node across all clusters).
This helps in reducing the problem of the cluster becoming idle too soon after a network disconnection. After all the jobs are finished the user can then collect the remaining jobs from the disconnected clusters.
Back to Job Allocator #2
Back to Multi-site
Back to About MSE