Singularity on the OSG

Have you heard about Singularity? You should probably wait until 2040 to see it, but meanwhile, OSG and FIFE teams are working hard to introduce Singularity to improve users’ experience on the grid.  When running jobs on the grid, one issue that users encounter is that their test environment, for example on an interactive node, and the grid environment may differ enough that their jobs that worked in testing might fail on the grid.  Once these jobs do get running though, Fermilab and other sites currently use gLExec to make sure the jobs are running as the user who submitted them.  However, this can also cause numerous issues that can make jobs go held.

Container technology has become popular over the past few years as a method of deploying standardized software stacks and the environments in which they run.  They allow users and experiments to control their runtime environment, with any needed libraries and dependencies, and then ship that environment to any other machine running that container software.  This includes controlling the OS that is run within containers, a feature that benefits both resource owners looking to upgrade their systems while still providing the ability for grid jobs to run on older OSes, and users and experiments whose workflows still depend on these OSes.  Resource owners can also control which container images can be used to launch containers on their resources.

Recently, a number of sites on the Open Science Grid (OSG), including Fermilab, have discussed or implemented support for one such container solution, Singularity. Singularity, developed and maintained at Berkeley Lab, is a container solution that, like others, allows for control of the environment in which the code is run, and provides the process isolation needed for the grid jobs to be run properly.  One of the main features of Singularity that distinguishes it from other solutions like Docker is that privilege escalation is blocked within Singularity containers.  Essentially, once a Singularity container is started by a particular user, all processes within that container run as that user, and the container only has access to the files owned by the user.  Because of this, Singularity, combined with extra logging, should also render unnecessary the authorization functionality that is currently provided by gLExec.

One site that has been supporting running Singularity containers is Nebraska.  At the most recent HTCondor week, Brian Bockelman discussed Nebraska’s experience supporting Singularity containers running inside Glideins for the past few months.  In a pilot-based grid setup, such as on the OSG, the pilots are responsible for checking if a site has Singularity, and if so, for starting jobs within a Singularity container using an image specified by the job (or job wrapper).  Pilots will only be allowed to start images that are approved by the OSG (and on GPGrid, FIFE).  So far, Nebraska has seen very few issues stemming from this setup.  Syracuse is also planning to run grid jobs within Singularity containers to support SL6 grid jobs on worker nodes that will be upgraded to SL7. FIFE is currently testing upgrades to our Glidein setup to allow FIFE jobs to run inside these containers.  Eventually, all sites that support CMS plan to run Singularity containers.

The adoption of Singularity across the OSG sites is still in the first stages, but there is considerable interest and momentum in that direction.  Stay tuned for more updates about the FIFE support for running in containers offsite, and any Singularity-related changes on GPGrid.  We will also discuss this further at the upcoming FIFE workshop on June 21-22.

–Shreyas Bhat and Ken Herner