FIFE Notes – one list

  • Interpreting efficiency plots on FIFEMON
    There was an interesting trend in the US nuclear power industry through the 1990s and 2000s: despite no new plants being built, net electricity generation increased almost continuously, averaging around 2% gained per year even as some plants were being shut down. Instead of building new plants, utilities were finding ways to generate more from the existing infrastructure. In much the same way, scientific computing is working to make better use of the existing grid computing resources we have, so even as budget constraints limit how much new physical hardware we can purchase, we can continue to expand our effective ...
  • g-2 optimizes DAQ with GPUs on the OSG
    With the Muon g-2 experiment now taking data, it’s important to optimize its collection based on the physics at hand. g-2 has written a GPU-based simulation of the muon precession component of the muon anomalous magnetic moment. The simulation is used to test the Q-method, or charge integration, analysis of ωa. The Q-method is performed by summing multiple fills into a single flush, and then summing the flushes to produce a positron time spectrum. The CUDA-based code simulates this quickly by generating a large number of flushes in parallel on a GPU. The code has been used to test methodologies for ...
  • CVMFS as a software distribution source
    Because of the planned removal of Network Attached Storage (Bluearc) mounts from worker nodes, all experiments and projects will be expected to distribute their software to worker nodes with CVMFS. Many already do, but now the remaining ones will need to transition to CVMFS. This article is for them. FIFE currently supports two primary methods of doing that: Projects that are too small to be registered with the OSG as their own Virtual Organization, typically because their software is shared between more than one experiment, publish their files in the repository. Virtual Organizations get their own repository. The two different methods are managed ...
  • Experiment with the most opportunistic hours October-December 2017
    The experiment with the most opportunistic hours on OSG between Oct. 1, 2017 and Dec. 1, 2017 was NOvA with 1.765 million hours.  
  • Continuous Integration updates
    The Continuous Integration system continues to be improved and new features added to fulfill user needs in terms of code testing. There are different way for users to keep testing their code to make sure that changes integrate into the existing code without breaking anything. The most common practice is to test the whole software suite of the experiment for each commit to verify that the code builds properly and can run a suite of quick CI tests to provide quick feedback on the status of the code. This configuration is really useful when there are many developers contributing to their experiment ...
  • HEPCloud program progress
    The HEPCloud program had a very productive 2017, successfully delivering several milestones targeted for this year! Since early this year, the team has been working on designing a new Decision Engine (DE) based on a framework architecture that can be extended to support future needs. The DE is an intelligent decision support system and is the heart of provisioning in the HEPCloud. The DE uses information from several sources such as job queues, monitoring systems like graphite, finances, allocations at NERSC, etc. to make intelligent decisions on facility expansion. In November, the HEPCloud team successfully demonstrated the newly developed DE. As part of ...
  • Most efficient big non-production users October-December 2017
    The most efficient big non-production user on GPGrid who used more than 100,000 hours for successful jobs since Oct 1, 2017 is Igor Tropin with 99.7% efficiency. Rank Experiment Name Wall Hours Efficiency 1 mars Igor Tropin 170,853 99.7% 2 mars Sergei Striganov 145,068 94.4% 3 nova Anna Holin 155,103 93.1% 4 mu2e Iuri Oksuzian 408,159 91.3% 5 mars Igor Rakhno 425,270 90.7%
  • Most efficient experiments October-December 2017
    The most efficient experiments on GPGrid that used more than 100,000 hours since Oct. 1, 2017 were MARS (98%) and MINOS (95%). For detailed information about Wall Hour usage and Experiment efficiency, see Fifemon. –Shreyas Bhat and Tanya Levshina
  • Semiannual FIFE Roadmap Workshop announced: Dec. 5, 2017
    Following a very successful discussion at the previous FIFE Workshop, the FIFE Group has decided to start holding semiannual Roadmap discussions with experiment Offline and Production Coordinators. The goal of the Roadmap discussion is to both inform experiments and gather feedback about strategic infrastructure changes and computing service modifications. These workshops will replace the annual FIFE Workshop. The half-day FIFE Roadmap Workshop will be held at Fermilab in the morning on Dec. 5, 2017 with Zoom connections for remote attendees available. Details about the workshop will be posted at the Indico link below when available, but mark this important discussion ...
  • Upgrading GPGrid to FermiGrid
    Experiments need ever-increasing computing capabilities and this trend is expected to continue.  The HEPCloud project is dedicated to meeting these needs as efficiently and cost-effectively as possible. Recently, GPGrid and Fifebatch went through a transition to better align the computing cluster with HEPCloud’s efforts. Previously Fifebatch existed as a glideinWMS pool separate from GPGrid. The glideinWMS requests resources in discrete combinations of CPU cores and memory.  Unfortunately, this means that jobs could not access any excess memory on GPGrid workers.  Many GPGrid workers have larger memory per core ratios than the standard glidein requests. Fifebatch and GPGrid were transitioned to combine aspects of ...
  • PNFS Dos and Don’ts
    A while back a user, let’s call him “Ken”, was trying to get some work finished on a very compressed timescale. It involved running a script that would generate some job scripts and stage files to dCache, and then submit jobs that take about one hour each. It was a well-tested workflow that followed FIFE best practices, but on this particular day Ken was seeing lots of errors like hanging dCache staging commands and stuck IFDH transfers causing jobs to go held. With the jaws of doom closing around him, Ken opened a Service Desk ticket with the dCache experts, ...
  • GRACC replacing Gratia as grid accounting system
    Earlier this year the OSG deployed its new grid accounting system, GRACC. Developed by Fermilab with OSG collaborators, GRACC aims to provide a more flexible and scalable (and faster!) accounting system than its predecessor, Gratia. More details about the motivation and design of GRACC are in the CHEP 2016 paper. Why does accounting matter to FIFE users when there are great monitoring tools like Fifemon available? Monitoring is more focused on the current state of the batch system and jobs, while accounting is meant to be a reference of what has happened. Who ran jobs where and how well did they use ...
  • Best practices for experiment database access
    A well-designed database can be a strong workhorse for an experiment.  However, if not built for the future that workhorse will age and become a detriment to analysis. As experiments scale up production during their lifecycles, adding more and faster CPUs, they require the same level of performance from the database. But that database can only carry so large a load after which performance drops and drops fast.  Planning must be done at the start on how to maintain the performance of the database throughout its life. Here are a few basic ways to help achieve this. Database indexes are a ...
  • FIFE Workshop 2017 lookback
    For the fifth year, experimenters and members of Scientific Computing Division (SCD) gathered for the annual FIFE Workshop. The workshop focus was divided between discussions of the FIFE roadmap on the first day and extensive tutorials on the second day. The workshop had more than 65 attendees from across all Frontiers (Intensity, Cosmic, and Energy) and from all departments within SCD. The agenda and contributions to the workshop can be found on the FIFE Workshop Indico page. While the series had very productive discussions, the FIFE group has decided that this will be the last summer in the series for ...
  • Experiment with the most opportunistic hours August-October 2017
    The experiment with the most opportunistic hours on OSG between Aug. 1, 2017 and Oct. 1, 2017 was NOvA with 3,254,356 hours.
  • Most efficient big non-production users August-October 2017
    Currently, the batch system is not properly reporting CPU time used by jobs. Due to this, efficiency metrics for jobs are unavailable. We will update this post as soon as the issue is resolved.
  • Most efficient experiments August-October 2017
    Currently, the batch system is not properly reporting CPU time used by jobs. Due to this, efficiency metrics for jobs are unavailable. We will update this post as soon as the issue is resolved.
  • FIFE Notes on Vacation
    FIFE Notes is on vacation. We’ll see you again in October!
  • Experiment with the most opportunistic hours April-June 2017
    The experiment with the most opportunistic hours on OSG between April 1, 2017 and June 1, 2017 was NOvA with 3,158,273 hours. — Shreyas Bhat and Tanya Levshina  
  • GPGrid Efficiency Policy
    Efficiency Threshold Reference Table Role memory CPU Success rate Analysis 15% 35% 50% Production 15% 35% 50% POMS 15% 35% 50% Job clusters with efficiency below these thresholds will be tagged as inefficient and the submitter will be contacted through email to diagnose and potentially modify their workflow. Total wall time for all jobs in a cluster must be greater than 500 hours to generate a warning email. Fermilab provides the General Purpose Grid cluster as a computing resource to experiments in order to perform computing for production and analysis workflows. With fixed capacity, the efficient utilization of GPGrid is critical to ensure that all experiments can accomplish their physics goals. Within the next few months, the FIFE group ...
  • Most efficient big non-production users April-June 2017
    The most efficient big non-production user on GPGrid who used more than 100,000 hours for successful jobs since April 1, 2017 is Jacob Todd with 98.3% efficiency.   Rank Experiment Name Wall Hours Efficiency 1 minos Jacob Todd 1,602,390 98.3% 2 mu2e Kaitlin Ragosta 196,620 97.3% 3 minerva Jeffrey Kleykamp 184,118 96.5% 4 cdf Ashutosh Kotwal 196,491 96.3% 5 minos Adam Schreckenberger 282,450 96.2% 6 mu2e Yaqian Wang 580,724 95.6% 7 minerva Manuel Ramirez Delgado 138,277 95.4% 8 minos Dung Phan 636,221 92.8% 9 mars Vitaly Pronskikh 371,930 92.3% — Shreyas Bhat and Tanya Levshina
  • Reminder: Bluearc unmounting
    Nearly all forms of scientific computing at Fermilab require some form of non-volatile storage. While the primary storage format for scientific data at Fermilab is tape-backed mass storage systems (MSS, consisting of Enstore and dCache), there are a variety of other storage solutions available, depending on the type of scientific computing that needs to be accomplished. Network attached storage (NAS), which at Fermilab is primarily BlueArc systems, provides a good platform of POSIX-compliant storage for interactive computing. It does not, however, provide a robust platform for large-scale parallel access from grid jobs. Furthermore, NAS space is not easily accessible from ...
  • DUNE Workshop Review
    As membership of the DUNE collaboration approaches a thousand scientists from around the world, one of the challenges that the experiment faces is how to simulate the DUNE and ProtoDUNE detectors, and to analyze the data that these simulations will produce.  But if you are a new student or postdoc that has just joined DUNE, where do you get started?  The DUNE simulation code is  daunting even for veteran scientists, let alone for students who only have a few hot summer months in Chicago to make a difference on the world’s leading neutrino experiment before returning to their quiet university ...
  • Singularity on the OSG
    Have you heard about Singularity? You should probably wait until 2040 to see it, but meanwhile, OSG and FIFE teams are working hard to introduce Singularity to improve users’ experience on the grid.  When running jobs on the grid, one issue that users encounter is that their test environment, for example on an interactive node, and the grid environment may differ enough that their jobs that worked in testing might fail on the grid.  Once these jobs do get running though, Fermilab and other sites currently use gLExec to make sure the jobs are running as the user who submitted ...
  • FERRY – Frontier Experiments RegistRY
    Have you ever wondered about what happens when a new postdoc joined an experiment, or if someone you’re collaborating with wanted to run a production workflow? By now, you’re probably used to accessing ServiceNow, navigating through pretty complicated choices, selecting an appropriate form, and submitting the request. Do you want to know what happens next? Probably not… Behind the scenes there is a pretty complicated and somewhat convoluted process that creates your user name, assigns you to appropriate groups, sets up the right ACLs on BlueArc and dCache, creates your user directory on the interactive nodes, and populates VOMS and GUMS. Over ...
  • Most efficient experiments April-June 2017
    The most efficient experiments on GPGrid that used more than 100,000 hours since April 1, 2017 were MARS (94%) and MINOS (93%). For detailed information about Wall Hour usage and Experiment efficiency, see Fifemon. –Tanya Levshina
  • Experiment with the most opportunistic hours February-April 2017
    The experiment with the most opportunistic hours on OSG between February 1, 2017 and April 1, 2017 was NOvA with 1,361,998 hours. –Tanya Levshina
  • Most efficient big non-production users February-April 2017
    The most efficient big non-production user on GPGrid who used more than 100,000 hours for successful jobs since February 1, 2017 is James Sinclair with 97.4% efficiency.  Experiment User Wall Time Efficiency DUNE James Sinclair 109,382.227 97.4% MU2E Iuri Oksuzian 232,699.789 96.3% MU2E Peixin Liu 104,227.185 96.1% DARKSIDE Chengliang Zhu 212,049.632 95.4% MU2E Andrew Edmonds 112,237.518 95.0% MU2E Ralf Ehrlich 231,795.86 93.8% MINERVA Aaron Bercellie 101,910.009 91.6% MINOS Jacob Todd 2,261,329.553 91.1% — Tanya Levshina
  • Most efficient experiments February-April 2017
    The most efficient experiments on GPGrid that used more than 100,000 hours since February 1, 2017 were DarkSide (95.13%) and SBND (91.13%). — Tanya Levshina
  • Great expectations: SC-PMT review 2017
    The Scientific Computing – Portfolio Management Team (SC-PMT) 2017 review was held at the end of February. SC-PMT is the annual review for the computing divisions and experiments to ensure computing resources (both hardware and people) are aligned with both P5 and FNAL objectives. In preparation for the review, each experiment provides spreadsheets to indicate their computing needs projected over three years: tape, disk, cpu, and services. Major stakeholders also prepare short presentations that highlight those needs. Additionally, the division prepares summary slides of those requests and how to best map those onto available resources. This year’s review resulted in ...
  • Git-‘R-Done: New OSG resources from the 2017 AHM
    In early March, over 120 people gathered at the San Diego Supercomputer Center for the annual Open Science Grid All-Hands Meeting. The meeting brings researchers from all fields of science and distributed computing technicians together to learn about the numerous areas of science that the OSG impacts, both within and outside of the particle physics community, and new tools and techniques to improve the performance of the available computing resources. All talks are available from the FNAL Indico server: The meeting featured a number of technical improvements to OSG that will be of interest to FIFE users. These include container ...
  • Ways to improve your life: POMS updates
    With an increasing demand from the production groups, the Production Operations Management System (POMS) is being extended to meet the Intensity Frontier (IF) experiments’ requirements for high scale production and distributed analysis processing. Several experiments are using or have expressed interest in using POMS. NOvA is extensively tracking their entire production. LArIAT and MicroBooNE have adopted POMS for some of their data processing. g-2 just started with Monte Carlo tests; Adam Lyon, quadrant head of the Scientific Computing Division and senior scientist of the g-2 collaboration, says: “Muon g-2 is excited to be about to launch a major simulation generation effort with ...
  • How to make datasets and influence storage: SAM4Users
    SAM4Users is a toolset designed to help analyzers create and manage datasets that are of interest to their analysis. It helps a common user to leverage all the great features SAM provides on their personal data files. Creating, relocating and retiring datasets of data files are no longer tasks that can only be done by a few experts in the experiment’s production group. With the SAM4Users toolset, these tasks all become as simple as the use of one command. Since the release of the toolset, it has attracted interest from many of the experiments that currently use SAM. Users from the NOvA experiment have been ...
  • What to expect when you’re registered for the fifth annual FIFE workshop
    For the fifth year running, the FIFE group ( is holding an early summer workshop for experiment analyzers, offline coordinators and Scientific Computing Division service providers. The FIFE workshop will take place June 21-22, 2017 in the Building 327 video conference room (a.k.a. the CDF Big Room).   The goal of the workshop is to help experiments implement and integrate common SCD services into their computing workflows at all levels along with introducing new service features and newly deployed services. Based on feedback from from last year’s attendees, we have made some changes and placed more emphasis on tutorials. The first day of ...
  • The art of efficiency
    In the near future, the FIFE group will implement a Grid Computing efficiency policy to help ensure maximal utilization of computing resources. In the coming weeks, the FIFE Group will be configuring jobsub servers to send email notifications to all FIFEBatch users informing them of the efficiency of clusters in terms of CPU time, memory utilization and scratch disk requested. These notifications are designed to help users understand where their jobs may experience inefficiencies and allow them to optimize the resources requested when submitting grid jobs. After a trial period of 1-2 months and gathering feedback from stakeholders, the FIFE ...
  • Fifemon Tips – April 2017
    Thanks to recent advances in deep learning, we are able to distill the thousands of monitoring inputs received every second into a single, targeted heuristic that tells you what the state of the scientific computing systems, batch systems, and your jobs are right now. This saves you from having to drill down into all the graphs and tables to figure out if everything is okay. Instead, just sit back and watch the blinking lights! Link: – Kevin Retzke
  • Test
    Come learn about the FIFE services available from Scientific Computing Division: How to efficiently access data on distributed computing How to get access to offsite computing resources New services and features in the coming months Improved documentation and tutorials Learn about access to GPU clusters and Supercomputers/HPC Tuesday will focus on infrastructure and production processing. Wednesday will focus on analyzer tutorials and workflows. Everyone is welcome to attend both days, but suggest that Tuesday may not be as useful as the tutorials on Wednesday. Registration will be available shortly, and as always, registration is FREE!!! Meeting page link in Indico:  
  • FIFE Workshop 2017
    Come learn about the FIFE services available from Scientific Computing Division: How to efficiently access data on distributed computing How to get access to offsite computing resources New services and features in the coming months Improved documentation and tutorials Learn about access to GPU clusters and Supercomputers/HPC Wednesday will focus on infrastructure and production processing. Thursday will focus on analyzer tutorials and workflows. Registration is now open from the link below and as always, registration is FREE!!! FIFE Workshop 2017 website in Indico:
  • Security basics for scientists and anyone who uses the scientific tools
    On Dec. 7, during Computer Security Awareness Day 2016,  Jeny Teheran presented the talk titled “Security Basics for scientists and anyone who uses the scientific tools”. The focus of this presentation was to explain basic security concepts that scientists encounter every day while working at Fermilab, such as Kerberos tickets, certificates and proxies. Along with other specialized presentations about security topics of current interest, Fermilab users learned about authentication, authorization and what happens behind the scenes when they request certificates and how the proxies are useful for job submission and data movement. A webinar and the slides are available at — ...
  • SCPMT17 is just around the corner
    The annual Scientific Computing Portfolio Management Team (SCPMT) review is scheduled for Feb. 23 and 24. This review helps develop the M&S purchases and SCD service directives for the following fiscal year(s), e.g., SCPMT17 helps to develop the FY18 and FY19 plans. In preparation, all experiments that use computing resources fill out a resource request spreadsheet. Experiments also register for the services they intend to use. The figure included shows the number of SCD services each experiment is using. This registration process has gone through a major improvement this year and is being tracked in the ServiceNow software infrastructure (also known ...
  • Jobsub Status and Future Plans
    The Jobsub high-availability servers have recently completed their third year as FIFE’s  preferred batch submission systems to the OSG.  We have worked to achieve a good balance between user convenience, security, service availability and resource utilization.  Experience and high user load have taught us where improvements are needed. The Jobsub servers are fairly ‘monolithic’, meaning each server houses a dedicated condor schedd daemon and ‘silos’ of logically distinct functionality.  A resource intensive request in one ‘silo’, such as ‘retrieve 300 GB  of log files for  job X from experiment Y’, can cause delays or failures to unrelated tasks of other experiments on the ...
  • GrafanaCon 2016
      Last December, Kevin Retzke traveled to New York to speak about Fermilab and Fifemon at GrafanaCon, the annual gathering of Grafana users and developers. Everyone was excited to learn more about Fermilab’s scientific mission, and how Grafana is being used to monitor scientific computing. The Grafana developers in particular love that Grafana is being used at places like Fermilab, CERN and SpaceX, and they made some great posters and stickers to celebrate this. Poster and sticker images are available at A video of Kevin’s talk is available on YouTube (length: 35 minutes) — Kevin Retzke
  • Come learn about the FIFE services available from Scientific Computing Division: How to efficiently access data on distributed computing How to get access to offsite computing resources New services and features in the coming months Improved documentation and tutorials Learn about access to GPU clusters and Supercomputers/HPC Tuesday will focus on infrastructure and production processing. Wednesday will focus on analyzer tutorials and workflows. Registration is now open from the link below and as always, registration is FREE!!! Meeting page link in Indico:  
  • Experiment with the most opportunistic hours December 2016 – February 2017
    The experiment with the most opportunistic hours on OSG between Dec. 1, 2016 and Feb. 1, 2017 was Mu2e with 2,066,794 hours.   —-Tanya Levshina
  • Most efficient big non-production users December 2016 – February 2017
    The most efficient big non-production users on GPGrid who used more than 100,000 hours since Dec. 1, 2016 are listed in the included table. Experiment User Wall Time Efficiency MINERVA Leonidas Aliaga Soplin 189,447 98.4% MU2E Hoai Nam Tran 1,026,781 97.8% LARIAT Brandon Soubasis 359,649 97.4% MU2E Yilong Zhang 101,499 96.7% MU2E David Hedin 119,600 96.2% MU2E Andrei Gaponenko 208,845 95.9% MU2E Peixin Liu 734,964 95.8% NOVA Ranjan Dharmapalan 193,017 95.4% MU2E Robert Bernstein 525,470 94.4% NOVA Siva Kasetti 128,727 94.3% DUNE Paul Lebrun 147,925 94.2% MINERVA Jiyeon Han 322,694 93.6% MU2E Ralf Ehrlich 2,632,159 93.5% MARS Sergei Striganov 258,022 93% MARS Anthony Leveling  203,567 92.3% —-Tanya Levshina  
  • Most efficient experiments December 2016 – February 2017
    The most efficient experiments on GPGrid that used more than 100,000 hours since Dec. 1, 2016 were LArIAT (92.73%) and Mu2e (90.03%).   –Tanya Levshina
  • How to make the most of your holiday break
    How to get a whole bunch of jobs going while everyone else is sipping eggnog While everyone enjoys a break from work this time of year, one thing that won’t be taking a break is grid computing. GPGrid will run at full capacity at all times, as will many of the usual offsite computing clusters. We encourage users to continue to submit jobs so that they can run over the holidays. Support will also be at its normal levels on all days that are not Fermilab Holidays. Enjoy the holidays and, as always, take advantage of all available computing resources. With universities ...
  • HEPCloud doubles the size of CMS computing
    High-energy physics experiments have an ever-growing need for computing, but all the experiments don’t need all the cycles all the time. The need is driven by machine performance, experiment and conference schedules, and even new physics ideas. Computing facilities are purchased with the intention to meet peak workload rather than the average, which impacts the overall computing cost for the facility and the experiment. The HEPCloud program enables facility owners to seamlessly expand their resources from Fermilab to other grid and cloud resources as transparently to the user as possible. At the heart is a decision engine that chooses when to ...
  • MINOS running on Stampede
    The HEP computing model is constantly evolving, and one change that is currently taking place is increased use of High Performance Computing (HPC) resources. Some of these HPC resources include supercomputing sites such as NERSC, as well as the EXtreme Science and Engineering Discovery Environment (XSEDE). XSEDE is actually a collection of several HPC resources, including the Stampede cluster at the Texas Advanced Computing Center. The challenge of utilizing HPC resources is to interface with their job submission systems in a way that does not require any additional work by FIFE users, and to make sure that all usage is accounted ...
  • New developments in Continuous Integration (CI)
    Since the first article appeared in the August 2016 edition of FIFE Notes, the Continuous Integration (CI) project has been implementing new features and on-boarding new experiments and collaborations. DUNE, GlideinWMS, MINERvA and GENIE are ready to try it out. NOvA is using CI extensively for their production releases. Alex Himmel, the NOvA production coordinator, discussed their experience at the October CS liaison meeting. “Continuous integration has been a major benefit to NOvA — it allows us to catch issues one-by-one as they happen instead of all at once during an official production campaign. In just the last few months it ...
  • Fifemon monitoring of data transfers
    Back in about 2012, when we were designing the IFDH layer to insulate experimenter’s code from the gory details of data handling and operating on the grid, I drew a diagram that included the ifdh copy utility logging all the copies to a central logging facility and an agent of the monitoring system scraping those logs to provide counts, transfer rates, etc. While this never really got off the ground in the early versions of Fifemon, the current implementation, which uses Elasticsearch tools to collect statistics from logging data, has brought it to a complete implementation. There are now two dashboards in Fifemon that provide ...
  • Most efficient experiments October – December 2016
    The most efficient experiments on GPGrid that used more than 100,000 hours since October 1, 2016 were LArIAT (96.11%) and Minos (95.5%). –Tanya Levshina
  • Most efficient big non-production users October – December 2016
    The most efficient big non-production user on GPGrid who used more than 100,000 hours since October 1, 2016 was Konstantinos Vellidis with 98.7% efficiency. Experiment User Wall Time Efficiency MU2E Konstantinos Vellidis 223,202 98.7% MU2E Andrei Gaponenko 215,731 98.5% MINERVA Leonidas Aliaga Soplin 195,286 98.4% MU2E Ralf Ehrlich 1,838,030 98% MINERVA Benjamin Messerly 176,768 97.8% MARS Sergei Striganov 289,998 96.9% NOVA Robert Hatcher 110,231 96.7% MARS Diane Reitzner 221,340 96.3% NOVA Ranjan Dharmapalan 105,014 96.1% LARIAT Brandon Soubasis 284,794 96.1% MINOS Adam Schreckenberger 142,396 96% MU2E Zhengyun You 547,144 95.5% MINOS Jacob Todd 169,156 95.1% MARS Igor Tropin 334,364 95% NOVA Stefano Germani 500,109 94.3% NOVA Siva Kasetti 137,952 94% NOVA Biao Wang 126,429 93.4% MARS Vitaly Pronskikh 193,215 91.7% MARS Nikolai Mokhov 192,886 90.9% –Tanya Levshina
  • Experiment with the most opportunistic hours October – December 2016
    The experiment with the most opportunistic hours on OSG between October 1, 2016 and November 30, 2016 was mu2e with 3,607,577 hours. –Tanya Levshina
  • What’s new in Fifemon
    There are some new dCache dashboards in Fifemon. Dcache Transfer Overview The Dcache Transfer Overview dashboard gives a quick view of the current status of the Public dCache system. It also has drop-downs to limit data to particular dCache logical pool groups. If transfers seem to be hanging, this page may point out a higher than normal queue. As with all Fifemon pages, you can select a time range in the upper right of the page. Comparing the current queue sizes to the recent past will tell you if the system is unusually busy at the moment. The “Transfer rate test statistics” plots ...
  • DCAFI phase I close out and phase II prospective
    The first phase of the Distributed Computing Access with Federated Identities (DCAFI) Project was successfully completed in August 2016. All Fermilab users and experiments have been transitioned to the new certificate service provided by CILogon Basic Certificate Authority (CA). Thanks to the hard work of FIFE support personnel and the DCAFI project team, all of the activities in Phase 1 were completed on schedule and with minimal impact to the VOs’ scientific tasks.  During the transition phase, Fermilab VOs stopped using certificates from Fermilab Kerberized Certificate Authority (KCA) and started using certificates from CILogon Basic CA to access computing services. Some ...
  • Most efficient experiments August – September 2016
    The most efficient experiments on GPGrid that used more than 100,000 hours since August 1, 2016 were CDMS (99.15%) and LArIAT (98.53%)
  • Most efficient big non-production users August – September 2016
    The most efficient big non-production user on GPGrid who used more than 100,000 hours since August 1, 2016 was Tommaso Pajero with 99.1% efficiency. Experiment User Wall Time Efficiency CDMS Tommaso Pajero 122,251 99.1% MU2E lisabetta Spadaro Norella 193,063 98.6% MU2E Iuri Oksuzian  135,712 98.6% LAriAT Brandon Soubasis  657,713 98.5% SBND Davio A. Cianci  313,858 97.9% MARS Igor Rakhno 349,868 97.4% MARS Sergei I. Striganov 115,616 96.2% MARS Igor Tropin 231,437 95.7% DARKSIDE Chengliang Zhu 314,492 95.7% MARS Diane Reitzner 155,674 95.2% MU2E Zhengyun You  613,000 95.0% MARS Daniel Ruterbories 127,721 94.3% SeaQuest Kun Liu 496,840 92.4% NOvA Stefano Germani  148,504 91.9% MINERvA Minerba Betancourt 170,350 91.4% DUNE Tyler Johnson 100,518 91.3% — Tanya Levshina
  • Experiment with the most opportunistic hours August–September 2016
    The experiment with the most opportunistic hours on OSG between August 1, 2016 and September 30, 2016 was mu2e with 2,252,507 hours. — Tanya Levshina
  • Batch computing direct access to BlueArc ending
    We discussed plans for unmounting the BlueArc data areas from Fermigrid worker nodes in the December 2015 issue of the FIFE Notes. As noted in that article, the overall data rates needed on Fermigrid exceed the capacity of the current BlueArc NFS servers. We are removing all access to the BlueArc /*/data and /*/ana areas from Fermigrid worker nodes. Both direct NFS mounts and access via Gridftp with ifdh cp will be removed. On request, we will retain the ifdh cp path for a limited, specified time during the transition. New experiments like Dune are being deployed to Fermigrid without worker node ...
  • POMS: handing control over to experiments
    The Production Operations Management System (POMS) was initially developed for the OPOS group to help them effectively manage job submissions for multiple experiments. We are now working to make POMS into a tool that the experiments can use directly to help track their production computing. POMS: Lets experiment production users define “Campaign layers” of specific types of work and group them together into larger Campaigns as needed. Tracks job submissions of multiple jobs for those Campaign layers. Automatically performs job submissions if so configured. Can launch “recovery” jobs for files that didn’t process properly the last time. Can trigger launches in dependent Campaign layers to process output of ...
  • StashCache speeds up data access
    StashCache is an OSG service that aims to provide more efficient access to certain types of data across the Grid. Most jobs end up copying their input files all the way from Fermilab every time they run, which can be slow and inefficient. In some cases, the files get reused multiple times – an example  of this is the flux files used as input to GENIE simulations, where each individual job uses a random sub-selection from the entire dataset. When these jobs run opportunistically on grid sites, they would be more efficient if the data could be fetched from somewhere close by. The StashCache project aims to help with this. StashCache is ...
  • News from ICHEP and CHEP
    This past August saw a record number of physicists in Chicago for the International Conference on High Energy Physics. The 38th installment of this biannual conference featured several presentations by SCD members in not only the Computing and Data Handling track, but also in the Astroparticle, Detector R&D, Higgs, and Neutrino tracks. FIFE was especially visible at the conference, with presentations and posters from nearly every experiment that uses some or all of the FIFE tools. More information is available at The conference cycle continues October 10-14, with SLAC and LBNL hosting the Computing in High Energy and Nuclear Physics ...
  • Fifemon tips
    Interested in seeing what the batch jobs for your SAM project are doing? Go to the SAM Project Summary dashboard and select your project name from the dropdown.  We recently introduced a User Overview dashboard, which shows you at a glance the recent status of your batch jobs, SAM projects, and IFDH file transfers. Is the batch system down? There are several resources that provide you the latest news on any known outages or service degradations: The FIFE Summary dashboard has notes with known outages. Service Now maintains updated knowledge base articles for maintenance outages, and other news.
  • FIFE workshop report
    For the fourth year in a row, the FIFE Project hosted a two day workshop dedicated to improving the scientific computing for Intensity and Cosmic Frontier experiments at Fermilab. The first day focused on the new tools, resources and a roadmap (including a new logo) for the FIFE project in the future, and the second day consisted mostly of tutorials and best practice talks and concluded with one-on-one expert consultations. All presentations are publicly available at With more than 60 attendees present, the discussion was lively and included ideas about access to high performance computing, GPU, and other new architectures. Mike Kirby ...
  • What’s new in Fifemon
    Since the last FIFE Newsletter there have been two Fifemon updates, v3.2 and v3.3. Notable new features include: SAM project monitoring, Grafana update, batch history and much more. SAM Project Monitoring Many FIFE users submit jobs to operate on files in a SAM project. To aid these users in better understanding the state of their SAM projects and associated batch jobs, we’ve started integrating samweb information into Fifemon. For starters, there’s the SAM Project Summary dashboard, from which you can select a SAM project and see the status of all associated batch jobs (and then drill down to further details on the ...
  • Most efficient experiments June – July 2016
    The most efficient experiments on GPGrid that used more than 100,000 hours since June 1, 2016 were LArIAT (98.70%) and MINOS (95.96%). — Tanya Levshina  
  • Most efficient big non-production users June – July 2016
    The most efficient big non-production user on GPGrid who used more than 100,000 hours since June 1, 2016 was Jacob R. Todd with 99.6% efficiency. Experiment User Wall Hours Efficiency Jacob R. Todd MINOS 444,579 99.6% Gregory Pawloski MINOS 615,563 98.7% Stefano Germani MINOS 136,248 98.6% Adam P. Schreckenberger MINOS 163,016 96.6% Rui Chen MINOS 680,125 95.4% Kuldeep Maan NOVA 139,136 95.1% Kevin R. Lynch MU2E 482,051 90.9% Peter S. Madigan DUNE 172,394 90.9% Brandon J. Soubasis LARIAT 185,452 90.6% — Tanya Levshina
  • Experiment with the most opportunistic hours June – July 2016
    The experiment with the most opportunistic hours on OSG between June 1, 2016 and July 31, 2016 was NOvA with 659,139 hours. — Tanya Levshina
  • Continuous Integration avoids unpleasant surprises
    The Continuous Integration (CI) project’s goal is to reduce the amount of human effort needed to verify each code release, and thus to reduce the frequency of wasted computing and human resources. The CI system is a set of tools, applications and machines that allows users to execute their validation tests with minimal effort. It is based on the open source Jenkins toolkit, which offers a powerful tool for complex software, and associated database, testing interfaces, and web facilities. More information (including links to detailed instructions on how to run CI tests) can be found at Experiments/collaborations already on-boarded are: uBooNE, DUNE (35T), LArIAT, ...
  • ProtoDUNE WMS workshop
    Every physics experiment plans several years in advance. Accurately understanding the needs and defining the computational requirements is fundamental to the discovery and success of the experiment. This planning process needs to account for several unknowns, project the technological advancements with a reasonable level of approximation and incorporate them in the planning process. This process is even more challenging for experiments like DUNE and protoDUNE with their wide international collaborations that are still far from the data-taking phase. Amir Farbin along with his DUNE/protoDUNE colleagues are working on defining the computational requirements for the distributed data and workflow management for ...
  • DCAFI transitioning status
    The Distributed Computing Access with Federated Identities (DCAFI) Project has been moving full steam ahead this summer. The short-term goal of the project is to move all Fermilab users from Fermi KCA, which is being planned to shut down at the end of September, to the new Certificate service provided by the CILogon Basic CA. However, the long-term goal is more ambitious than that: making access to Fermilab easy and convenient for all Fermilab users, even for those without Fermilab accounts. In the first phase of DCAFI, we allowed our users to get access to Fermilab by just using their ...
  • Fifemon tips
    Did you know Grafana has two themes – light (white background) and dark (black background)? You can pick your default theme in your User Profile. Overwhelmed by the number of jobs showing up in a table?  Check out the list of filters in the drop-down above the table to help narrow it down. Not sure why your jobs are getting held? With new limits enforcement, you probably need to increase the resources requested (including runtime). The Why Are My Jobs Held? dashboard will show you the reason. Fifemon Tips is a regular column in FIFE Notes that aims to bring you useful tips, ...
  • tmp – 7/7/16
    here’s a URL fun. Here’s a picture. some words to go with it. Useful ones, hopefully. can add more.  
  • MINOS computing on the OSG
    Computing in the MINOS/MINOS+ experiment has evolved greatly in the eleven years since it started taking data in the NuMI beam (April 2005). The scale has increased from the 50 core FNALU batch system to the 15000 cores of Fermigrid/GPGrid. As MINOS prepares to stop taking data at the 2016 Fermilab summer shutdown, another change will be the use of Fermibatch jobsub tools for opportunistic use of the Open Science Grid offsite. Use of remote resources is not new to MINOS. Monte Carlo data has always been generated by collaborating institutions outside Fermilab with eight sites participating over the years. Tarfiles of ...
  • FIFE workshop focuses on services and tutorials
    The annual FIFE Workshop will take place on June 20 and 21 this year with a focus on introducing new services and tutorials for current services. The talks on Monday are directed toward experiment Offline Coordinators and Production groups, and the talks on Tuesday are directed toward analyzers. The structure was chosen to allow attendees to more efficiently identify the talks they have the most interest in, but everyone is welcome to join and contribute to all parts of the workshop. The morning session on Monday will include talks about the current status of computing facilities along with plans for batch ...
  • Most efficient experiments April – May 2016
    The most efficient experiments on GPGrid that used more than 100,000 hours since April 1, 2016 were CDMS (98.27%) and CDF (97.67%).  This can be seen in the above diagram where CDMS and CDF are the brightest green.  — Tanya Levshiva
  • Fifemon
    “There’s a dashboard for that” is the unofficial motto of Fifemon, and to that end we are constantly collecting more data and producing new dashboards. Since the last update, we have added nearly 20 new dashboards, including high-level computing summaries, dCache and SAM monitoring, and troubleshooting guides. In addition to these new dashboards, we have made many improvements to the existing dashboards. Read more for a look at upcoming changes, features and upgrades and to learn more about how Fifemon is impacting and how we are working with the scientific computing community outside of Fermilab. Fifemon is continuing to evolve to provide better monitoring for ...
  • MicroBooNE data processing for Neutrino 2016
    MicroBooNE began collecting Booster Neutrino Beam data on Oct. 15, 2015. The optical trigger system was commissioned on Feb. 10, 2016, and MicroBooNE has been collecting optically triggered data since then.   Fig. 1 shows the volume of data in sam, showing an increased rate of data storage in early April corresponding to the reprocessing campaign. MicroBooNE has recently been engaged in various data processing campaigns for data reconstruction and Monte Carlo generation aimed at producing results for the Neutrino 2016 conference (July 4-9, 2016). Monte Carlo generation for Neutrino 2016 (MCC7) began in early February 2016.  A new version of the reconstruction program was released in early April. Over the subsequent weeks, all raw ...
  • Most efficient big non-production users April – May 2016
    The most efficient big non-production user on GPGrid who used more than 100,000 hours since April 1, 2016 was Willis K.Sakumoto with 100% efficiency.  Number of users with efficiency more than 90% has doubled since March! Experiment User Wall Hours Efficiency CDF  Willis K. Sakumoto  510,351  100% MINOS Jacob R. Todd       248,042  98.8% MINERVA Benjamin Messerly  120,476 98.5% CDMS Ben M. Loer      265,840 98.3% MU2E Anthony Palladino Jr.  374,647 98.2% DUNE Laura Fields  296,843 95.5%  MU2E Federica Bradascio    134,137 95.4% MARS Vitaly Pronskikh 692,789 95.4% NOvA Linda Cremonesi     102,330  94.4% MINERVA Minerba Betancourt     239,606  92.8% MINERVA Jeffrey Kleykamp     171,949  91.1% MU2E Iuri A. Oksuzian   406,241 90.7% NOvA Biao Wang   104,476 90.7% MINOS Adam P. Schreckenberger     722,630   90% — Tanya Levshina
  • Docker on HPC
    The use of containers, like Docker, could substantially reduce the effort required to create and validate new software product releases, since one build could be suitable for use on both grid machines (both FermiGrid and OSG) as well as any machine capable of running the Docker container. Docker provides a means of delivering common, standard software releases using containers. With Docker technology, these containers can then be run on a variety of operating system flavors without change.  Early in 2016, we produced a series of Docker images, layered to reflect the dependencies of the software products that we use, ending at an experiment-specific ...
  • Experiment with the most opportunistic hours April – May 2016
    The experiment with the most opportunistic hours on OSG between April 1, 2016 and May 31, 2016 was NOvA with 1,362,980 hours. — Tanya Levshiva
  • Components in experiment’s workflow management systems infrastructure
    Recently, a group in SCD identified and mapped different components typically found in the Workflow Management Infrastructure (WMS) of HEP experiments. The fact finding exercise resulted in a document that can be found in the CD DOCDB: Beyond its initial goal of setting a common vocabulary, this document is also useful for identifying gaps in the functionality provided by the infrastructure and/or identifying potential services that can be enhanced to provide new or missing functionality. The initial goal for this exercise was to leverage in-house expertise to come up with a common vocabulary based on the WMS components used by ATLAS, CMS and FIFE. ...
  • Experience in production services
    Huge amounts of computing resources are needed to process the data coming out of Intensity Frontier detectors. Although addressing different questions, most experiments have similarities in their workflows and computing needs. The OPOS team and the FIFE project capitalize on similarities with a set of tools and practices that incorporate lessons learned from previous experiments. I will briefly describe some of what I have witnessed during my time at Fermilab. Jobsub and sam started as initiatives inside last-decade experiments. MINOS produced the first incarnation of jobsub while sam was first crafted during Tevatron Run II period (CDF and D0). These experiments needed ...
  • 2016 Open Science Grid all-hands meeting2016 Open Science Grid all-hands meeting
    Every spring, the entire Open Science Grid (OSG) community–consisting of resource owners and operators, users, and staff–gathers at the annual OSG all-hands meeting. The 2016 OSG all-hands meeting was held between Monday, March 14 and Thursday, March 17 at Clemson University in Clemson, SC, thanks in large part to Jim Bottum, the CIO and vice provost for technology at Clemson. The OSG is, as befitting a vehicle for distributed high-throughput computing, a highly distributed organization, with a community spread out across the US. As such, the all-hands meeting offers one of the few opportunities for face-to-face interaction for this community. Some of ...
  • Experiment with the most opportunistic hours Feb. – March 2016Experiment with the most opportunistic hours Feb. - March 2016
    The experiment with the most opportunistic hours on OSG between Feb. 1, 2016 and March 31, 2016 was Mu2e with 4,804,996 hours. — Tanya Levshiva
  • Most efficient big non-production users Feb. – March 2016
    The most efficient big non-production user on FermiGrid who used more than 100,000 hours since Feb. 1, 2015 was Willis K.Sakumoto with 100% efficiency. Experiment User Wall Hours Efficiency CDF  Willis K. Sakumoto 593,455   100% MARS Vitaly Pronskikh 274,953 96.93% MU2E Anthony Palladino Jr. 489,684 92.74% MINERvA Benjamin Messerly 358,353 91.81% MARS Nikolai Mokhov 105,269 91.41% — Tanya Levshina
  • Most efficient experiments Feb. – March 2016Most efficient experiments Feb. - March 2016
    The most efficient experiments on FermiGrid that used more than 100,000 hours since Feb. 1, 2016 were CDF (100%) and MU2E (85.75%). — Tanya Levshiva
  • HEP Cloud: How to add thousands of computers to your data center in a dayHEP Cloud: How to add thousands of computers to your data center in a day
    Throughout any given year, the need of the HEP community to consume computing resources is not constant. It follows cycles of peaks and valleys driven by holiday schedules, conference dates and other factors. Because of this, the classical method of provisioning these resources at providing facilities has drawbacks, such as potential over-provisioning. Grid federations like Open Science Grid offer opportunistic access to the excess capacity so that no cycle goes unused. However, as the appetite for computing increases, so does the need to maximize cost efficiency by developing a model for dynamically provisioning resources only when they’re needed. To address this ...
  • DCAFI moving forward
    The Distributed Computing Access with Federated Identities (DCAFI) project is moving forward on schedule and should be ready to start migrating the first experiment in June. For those of you who are unfamiliar with it, these are the motivations for the project: Dependency on Kerberos makes it difficult for non-Fermilab scientists to access our grid resources remotely, obstructing our lab´s goal of being an international laboratory. Fermilab’s Kerberos Certificate Authority (KCA) server is losing its support starting September 2016, forcing us to find a replacement Certificate Authority for grid access. Asking users to manage their own certificates is a burden on them we avoided with KCA-based grid access, and we want ...
  • AFS transition
    We will be turning off the Fermilab AFS servers in early May this year because the sort of worldwide file sharing once unique to AFS is now provided by the Web. Click here for details of the migration shutdown. The /afs/ file system has served Fermilab well since about 1992. The primary services were: web server content Unix account login areas shared code and data areas for interactive use The advantages of AFS included: Kerberos authentication for flexible network sharing Access Control Lists (ACL’s) for good control of web content worldwide file sharing ( /afs/ and such ) thanks to good security. compatibility with many systems (AIX, IRIX, Linux, OSF1, SunOS, ...
  • Some useful pages for WordPress administration
    Here are links to pages that were developed by Katherine Lato for that give information on things like how to add a user, minor tweaks to the HTML, adding pictures, etc. Another useful source of information is the Using WordPress site at:    This is a useful place to record things that you find out that might be useful to other people. Here are some useful videos on WordPress: When watching videos or reading books about WordPress, bear in mind that at Fermilab there is one theme, plug-ins must be approved, and other restrictions that make much of the how-to general information not ...
  • New sites for MicroBooNE
    The MicroBooNE collaboration operates a 170 ton Liquid Argon Time Projection Chamber (LAr TPC) located on the Booster neutrino beam line at Fermilab. The experiment first started collecting neutrino data in October 2015. MicroBooNE measures low-energy neutrino cross sections and investigates the low-energy excess observed by the MiniBooNE experiment. The detector also serves as a next step in a phased program towards the construction of massive kiloton scale LAr TPC detectors for future long-baseline neutrino physics (DUNE) and is the first detector in the short-baseline neutrino program at Fermilab. In past LAr TPC experiments, event selection was not fully automated. Candidate event ...
  • ​Recent Open Science Grid milestones
    The Open Science Grid (OSG) has recently achieved a number of milestones and continues to provide distributed grid computing resources to scientists around the world at a record scale. 2015 marked the first year since the OSG’s inception a decade earlier that over 1 billion computational hours were consumed by OSG users. Additionally, November 2015 marked the first calendar month where OSG usage exceeded 100 million computational hours putting the OSG on track to breaking last year’s records in 2016. While the LHC experiments of ATLAS and CMS continue to be the cornerstones of both usage and provided resources on the OSG, ...
  • Most efficient experiments on FermiGrid that used more than 500,000 hours
    Most efficient experiments on FermiGrid that used more than 500,000 hours since Dec. 1, 2015 –  MINOS (98.72%) and MINERvA (85.80%) — Tanya Levshina
  • ​​Most efficient big non-production users on FermiGrid
    Most efficient big non-production users on FermiGrid who used more than 100,000 hours since Dec. 1, 2015 was Luri A. Oksuzian from MINOS with 98.9% efficiency. Experiment User Wall Hours Efficiency Mu2e Luri A. Oksuzian 102,314  98.9% Minos Adam J. Aurisano 4,643,042 98.7% Mu2e Anthony Palladino 999,170 95.9% – Tanya Levshina
  • ​Optimizing job submissions
    Carefully tailoring your resource requests will increase your job throughput With partitionable slots now the norm on GPGrid, it’s important to have a good understanding of resource requirements. The memory, disk, and maximum runtime available in free job slots changes as users submit jobs. As a result, there may be free slots that have less resources available to them than the defaults, as they are leftovers from the way the cluster was partitioned at a given moment. There’s nothing inherently wrong with these free slots, and any job that can fit into those requirements can run without problem. When you submit jobs ...
  • ​Coming soon to Fifemon: job resource monitoring
    Wondering why your jobs have been put on hold? Want to better set your resource requests to make more efficient use of the grid (and to get your jobs starting faster)? This information, and more, will be available in the FIFE monitoring application, Fifemon, soon, and is already available for testing in pre-production ( Starting with your User Batch Details dashboard, you can see what jobs have been put on hold and why, as well as a complete listing of job clusters currently in the system. Included in this table are the maximum resources used with how much was requested. If ...
  • Know before you go (on the OSG)​
    FIFE has a Wiki page to help match resource requirements and availability at remote sites Previous editions of FIFE notes have shown glimpses of the tremendous computing resources available on the Open Science Grid. These resources come from a large number of remote sites, each of which has its own limitations and policies regarding opportunistic access. When users try to match their job requirements to sites, it can be a daunting task. In most situations, the FIFE Group recommends that users should not specify sites explicitly when submitting jobs to OSG locations in order to get the most resources. Instead, sit ...
  • ​OPOS helping MINERvA with offline production
    MINERvA has been using offline production from OPOS for over a year. The following picture shows almost 200,000 jobs from MINERvA between June and December of 2015. (Note: DUNE’s part of the graph only covers December since they started until OPOS on November 30, 2015.) MINERvA started by using just monitoring services from OPOS, but now the team is running Monte Carlo (MC) simulation as well. It is anticipated that more services from OPOS will be used in the future. The following quote is from the MINERvA spokespeople to the team. “OPOS takes on the task of submitting and monitoring grid jobs for MINERvA.  ...