Interpreting efficiency plots on FIFEMON

There was an interesting trend in the US nuclear power industry through the 1990s and 2000s: despite no new plants being built, net electricity generation increased almost continuously, averaging around 2% gained per year even as some plants were being shut down. Instead of building new plants, utilities were finding ways to generate more from the existing infrastructure. In much the same way, scientific computing is working to make better use of the existing grid computing resources we have, so even as budget constraints limit how much new physical hardware we can purchase, we can continue to expand our effective capacity and continue to support the increasing demand for computing.

How well grid computing resources are being used is best monitored through metrics we term “efficiency,” which is defined as the amount of a resource that is actually used divided by the amount of that resource that is claimed (read: requested). When you submit a job cluster, you request, either explicitly or implicitly through JobSub defaults¹, a certain amount of CPU cores, memory and disk that you expect each job in that cluster to require. When each job starts running, those requested resources are allocated for that job to use exclusively. The batch system and Fifemon then track the amount of those resources that your job is actually using. If your job tries to exceed the requested amount, it will be held, which terminates the job execution immediately and returns it to the batch queue for diagnostics.

Grossly incorrect resource requests can waste significant computing resources, through limiting the number of jobs that can be run and lost results due to held jobs.

Historically, the number of CPU cores available has been the limiting factor in how many jobs can be running simultaneously², so typically “efficiency” means “CPU efficiency,” notably on many Fifemon dashboards such as “User Efficiency Details” and “Experiment Efficiency Details.” The efficiency graphs on Fifemon are a representation of the “instantaneous cumulative CPU efficiency,” that is “total CPU time divided by total wall time for all running jobs.” Due to time lag in data collection, this data is best considered an approximation, particularly when there is a high rate of job churn.

The main cause of low CPU efficiency is I/O bottlenecks: database connections, low-bandwidth connections (offsite), waiting for files to be restored from tape, clogged/slow dCache doors, etc.

Finally, actual resource utilization for each job is collected after the job is terminated. As this issue of FIFE Notes goes to print, we are preparing to (re-)deploy another tool, Fifemail, to help you better understand the resources utilized by your batch jobs. Fifemail sends an email at the completion of each job cluster and optionally a daily digest that summarizes the resources requested, the actual resources used and the efficiency. We will also be sending warnings when the efficiency is below the thresholds set in the FIFE Efficiency Policy. Please read these warnings and try to understand why the efficiency was low and what can be changed to improve efficiency for future job submissions.

Every 1% increase in CPU efficiency across FermiGrid is like adding five new servers, and when overall efficiency is often below 75% or even 50%, there is considerable room to improve.

CPU and memory efficiency are not the whole story, of course. If a job does not actually complete any useful work or return useful results, then all of the resources it used were wasted. So we also track job success rate, where “success” is defined as the job exit code being zero. Often, however, we have found that the job exit code does not represent the actual failure or success of the job, so please ensure that the job exit code (that is, the exit code of the last command in the job script or set by an explicit exit statement) represents whether or not the job worked as expected and successfully returned the results.

Without the correct exit code from jobs, critical information is lost to help quickly identify and mitigate many potential grid issues, such as faulty nodes, site problems, dCache issues, etc.

While FermiGrid and the OSG provide substantial resources for scientific computing, they are not unlimited and are shared by thousands of users from dozens of collaborations. By minimizing resource waste (maximizing efficiency) you can help improve job throughput for all users and decrease the time waiting for your jobs to start.

–Kevin Retzke

¹JobSub defaults are 1 CPU core, 2000 MiB memory, 35 GiB disk, and 8 hours wall time

²As we see more higher-memory workflows, we are starting to see memory become more of a limiting factor, so we have been and will continue adding more metrics and graphs tracking that as well.