GPGrid Efficiency Policy

Efficiency Threshold Reference Table

Role memory CPU Success rate
Analysis 15% 35% 50%
Production 15% 35% 50%
POMS 15% 35% 50%

Job clusters with efficiency below these thresholds will be tagged as inefficient and the submitter will be contacted through email to diagnose and potentially modify their workflow. Total wall time for all jobs in a cluster must be greater than 500 hours to generate a warning email.

Fermilab provides the General Purpose Grid cluster as a computing resource to experiments in order to perform computing for production and analysis workflows. With fixed capacity, the efficient utilization of GPGrid is critical to ensure that all experiments can accomplish their physics goals. Within the next few months, the FIFE group will be implementing an efficiency policy to ensure that resources are used effectively and that inefficient workflows are quickly identified and corrected.

The efficiency of a user’s jobs will be calculated for each cluster of jobs submitted to the FIFE batch server. The three metrics that will be monitored for usage are CPU efficiency, memory allocation efficiency, and scratch disk allocation efficiency. The CPU efficiency will be calculated as the CPU time divided by the wall time for each individual job and weighted by the wall time of each job for the cluster average. The memory allocation efficiency is determined as the ratio of the maximum RSS reported by HTCondor for the job divided by the memory requested and allocated to the job. The scratch disk allocation efficiency is the maximum amount of data resident on the scratch disk divided by the scratch disk storage allocated to the job. Email notifications specifying all of the reported efficiencies are now being sent to all users at their Fermilab email address (i.e., username@fnal.gov). Initially, only CPU, memory, and job success efficiency will be used to identify workflows that need to be corrected.

If a user’s workflow is found to violate one of the efficiency thresholds listed in the table above, the user will receive a warning email that will include links to standard tutorials about how to improve workflow efficiency. Small job clusters submitted for testing will not trigger warning emails, so please scale your testing: 1 job, 10 jobs, 100 jobs, etc. If the users submit 5 job clusters without correcting the inefficient workflow, they will be contacted by the FIFE support group. If the submitter does not respond to FIFE support emails and they continue to submit inefficient jobs, their priority will be adjusted such that their jobs are at the end of the queue.

The full details of the efficiency policy will be discussed at the FIFE Workshop on June 21-22.

–Mike Kirby