CVMFS as a software distribution source

Because of the planned removal of Network Attached Storage (Bluearc) mounts from worker nodes, all experiments and projects will be expected to distribute their software to worker nodes with CVMFS. Many already do, but now the remaining ones will need to transition to CVMFS. This article is for them.

FIFE currently supports two primary methods of doing that:

  1. Projects that are too small to be registered with the OSG as their own Virtual Organization, typically because their software is shared between more than one experiment, publish their files in the fermilab.opensciencegrid.org repository.
  2. Virtual Organizations get their own opensciencegrid.org repository.

The two different methods are managed in different ways; see the FIFE documentation for details. In both cases, it is important to understand that although CVMFS appears to the user as a POSIX filesystem similar to Bluearc, it has some limitations. The primary thing to keep in mind is that CVMFS is optimized to distribute executable software to worker nodes, with large numbers of jobs reading the same files. CVMFS can efficiently handle distribution of any type of file utilized by all grid jobs in a batch, if they are of a similar size to shared object libraries (<~100MB). There is a separate repository type for data files that are used by some but not all jobs. For VOs that have their own repository, it is important also to tell CVMFS how to split up the files efficiently by setting wildcard patterns in a “.cvmfsdirtab” file. This is explained in detail in the CVMFS documentation on maintaining repositories.

There are no quotas on space used in CVMFS, but space is not unlimited. People are requested to only publish files that are used by worker nodes. Also, normally files are not deleted from CVMFS storage even when the user deletes them, because file de-duplication makes it difficult to tell whether the files are referenced in other directories. For applications that have a lot of churn, which means they frequently publish and delete files such as with nightly builds, VOs can request to have a separate repository created with CVMFS garbage collection enabled so they can clean up removed files.

Metadata operations (such as listing directories) scale much better with CVMFS than Network Attached Storage, so migrating everyone to CVMFS is expected to improve performance significantly.

–Dave Dykstra