StashCache speeds up data access

StashCache is an OSG service that aims to provide more efficient access to certain types of data across the Grid. Most jobs end up copying their input files all the way from Fermilab every time they run, which can be slow and inefficient. In some cases, the files get reused multiple times – an example  of this is the flux files used as input to GENIE simulations, where each individual job uses a random sub-selection from the entire dataset. When these jobs run opportunistically on grid sites, they would be more efficient if the data could be fetched from somewhere close by. The StashCache project aims to help with this.

StashCache is built on top of the existing xrootd and cvmfs products. Data stored in dCache is indexed by cvmfs. Then, when it is accessed from a Grid site, the files are automatically copied over xrootd to a regional cache server near the site. Subsequent accesses to the same file are now much faster because the cached copy is reused. 


So far, NOvA and DES have put data into StashCache, and we’re now looking to expand to other experiments. Suitable datasets would be in the 100 GB-10TB range and likely to be accessed multiple times by Grid jobs. If you have such datasets and would like to make them available via StashCache, please make a request via the ServiceDesk.

— Robert Illingworth