Upgrading GPGrid to FermiGrid

Experiments need ever-increasing computing capabilities and this trend is expected to continue.  The HEPCloud project is dedicated to meeting these needs as efficiently and cost-effectively as possible. Recently, GPGrid and Fifebatch went through a transition to better align the computing cluster with HEPCloud’s efforts.

Previously Fifebatch existed as a glideinWMS pool separate from GPGrid. The glideinWMS requests resources in discrete combinations of CPU cores and memory.  Unfortunately, this means that jobs could not access any excess memory on GPGrid workers.  Many GPGrid workers have larger memory per core ratios than the standard glidein requests.

Memory utilization before and after the GPGrid/FermiGrid refactor. Claimed memory and utilized memory track each other much more closely now because Docker controls the amount of memory assigned to each job, and that assigned memory value is no longer discrete.

Fifebatch and GPGrid were transitioned to combine aspects of both infrastructures into what is now called FermiGrid.  Jobs submitted through Jobsub now run directly on worker nodes and all of the resources offered by FermiGrid worker nodes are available to the jobs.  Offsite jobs will still request glideins from OSG, and those glideins now join FermiGrid as part of the cluster.

In addition, a new capability was added as part of the transition.  Jobs on FermiGrid workers now run in Docker containers.  Docker provides better resource usage enforcement which in turn protects both the worker nodes and other jobs running on the worker nodes.  From the job’s perspective, nothing has really changed.  Docker also provides an OS abstraction from the worker node OS.  This means that the environment that the job runs on is not tied to the worker anymore.  Currently, the worker nodes are running SLF7 while the jobs are running in SLF6 Docker containers.  If the Docker container is missing required libraries, please open a ServiceNow ticket with Distributed Computing Services.

In the near future, we plan to allow experimenters to run their jobs in experiment-approved Singularity images.  Singularity is a container technology that runs in user space.  Its goal is portability, making it a favored technology for bringing your own environment on OSG and other sites.

There are two issues that are being worked on with the HTCondor team.  HTCondor is reporting zero CPU usage and memory usage is incorrectly reported as RSS + cache.  Both issues have been reported, and we expect patches from the HTCondor team soon.

–Anthony Tiradani