Experience in production services

Huge amounts of computing resources are needed to process the data coming out of Intensity Frontier detectors. Although addressing different questions, most experiments have similarities in their workflows and computing needs. The OPOS team and the FIFE project capitalize on similarities with a set of tools and practices that incorporate lessons learned from previous experiments.

I will briefly describe some of what I have witnessed during my time at Fermilab.

Jobsub and sam started as initiatives inside last-decade experiments. MINOS produced the first incarnation of jobsub while sam was first crafted during Tevatron Run II period (CDF and D0). These experiments needed to better manage their files (sam) and make the submission of jobs easier/friendlier (jobsub). They came up with solutions that worked effectively and could be generalized for other experiments. These solutions evolved over time, adjusting to the new challenges, and eventually became the jobsub and samweb tools we use on a daily basis. So far, jobsub has had three incarnations as shown in Figure 1. It started as a bash script at MINOS+, was refactored as a python script, and started being used by many experiments. At this point, the maximum number of concurrent jobs it could handle was 5K. It evolved to a client-server architecture in order to manage a higher load of jobs by scaling horizontally and better managing the job log files. Sam followed a similar path.

jobsub-evol

Figure 1 Jobsub evolution over time and experiments.

Back in the summer of 2014 when I joined Computing, most of the production work was performed in local grids. The need to serve peak demands and a forecasted increase in the processing needs drove some of the experiments to make their workflows Open Science Grid (OSG) ready. Configuring CVMFS (CernVM File System) areas, checking data access in the scripts, using the right data transfer protocols, network configurations, among other preparations, allowed the experiments to run their jobs offsite and benefit from the vast number of resources available in an opportunistic way. In 2015, Mu2e was the first experiment using OSG resources. On this same timeline, the HEP cloud project, featured in the last edition of this newsletter, was delivered. Thousands of nodes from a public cloud were made available to the experiments in an elastic and scalable way. Now, the experimenters can potentially reach resources all around the world from grids or from public cloud providers such as AWS (Amazon Web Services).

Joining OPOS meant working side-by-side with experiments to understand their workflows and to help tweak their workflows in order to make them work better.

Working at the edge of knowledge is always an awe-inspiring duty. Natural human curiosity, combined with the need to answer our deepest questions and to understand our environment, have driven our research efforts throughout history. Not all questions have been answered and history continues to be written every single day. On the physics front, Fermilab experiments work on some of the most challenging questions that remain unsolved. Macroscale and microscale universes remain a mystery. Delving into their nature is no easy job. It requires approaches never tried and teamwork in its fullest expression to tackle these highly complex challenges. At Fermilab, we can see how engineers and scientists make this real and make the sum of the parts greater than the whole.

— Paola Buitrago