HEPCloud program progress

The HEPCloud program had a very productive 2017, successfully delivering several milestones targeted for this year!

Since early this year, the team has been working on designing a new Decision Engine (DE) based on a framework architecture that can be extended to support future needs. The DE is an intelligent decision support system and is the heart of provisioning in the HEPCloud. The DE uses information from several sources such as job queues, monitoring systems like graphite, finances, allocations at NERSC, etc. to make intelligent decisions on facility expansion. In November, the HEPCloud team successfully demonstrated the newly developed DE. As part of the demo, the DE expanded the HEPCloud facility by provisioning over 1400 resources in the AWS and at NERSC. Currently, we are focusing on adding more functionality to the DE and hardening it to run at production scale.

The team has been working hard on documenting and identifying security risks, providing guidelines and mitigation steps, documenting various processes and developing material required to train new members and users alike. The project presented a critical subset of these controls, the Interim Security Controls, to the Computer Security Board in early December. These interim security controls have been approved by the CS Board. The HEPCloud project also benefits from the monitoring dashboard provided by the Landscape project. It is used to track resource usage, allocations at HPC and budget utilization in AWS. The monitoring is equipped to alarm on different conditions and automatically open ServiceNow tickets. This improved monitoring and alarming will be critical for efficient operation of the HEPCloud facility when it goes into production.

This year there were some changes to the HEPCloud project team. Eileen Berman joined Parag Mhashilkar as the HEPCloud co-project managers. Eileen has extensive experience and expertise in managing complex projects.

Year 2018 will be crucial for the HEPCloud program. We will be working on getting Authority to Operate (ATO) from Fermilab DOE Site Office and to move the HEPCloud Facility to production. The work done earlier this year on modernizing GPGrid and Fifebatch into a single infrastructure, FermiGrid, is one of the many necessary steps in making progress. There is lot of work that needs to be done before moving to production, and I am hopeful that we will have another success story to share next year.

–Parag Mhashilkar