HEP Cloud: How to add thousands of computers to your data center in a day


Throughout any given year, the need of the HEP community to consume computing resources is not constant. It follows cycles of peaks and valleys driven by holiday schedules, conference dates and other factors. Because of this, the classical method of provisioning these resources at providing facilities has drawbacks, such as potential over-provisioning. Grid federations like Open Science Grid offer opportunistic access to the excess capacity so that no cycle goes unused. However, as the appetite for computing increases, so does the need to maximize cost efficiency by developing a model for dynamically provisioning resources only when they’re needed.

To address this issue, the HEP Cloud project was launched by the Scientific Computing Division in June 2015. Its goal is to develop a virtual facility that provides a common interface to access a variety of physical computing resources, including local clusters, grids, high-performance computers, and community and commercial clouds. Now in its first phase, the project is evaluating the use of the “elastic” provisioning model offered by commercial clouds such as Amazon Web Services. In this model, resources are rented and provisioned dynamically over the Internet as needed.

The HEP Cloud project team successfully demonstrated this elastic model for CMS in January and February using Amazon Web Services as discussed here and shown below:

This plot shows the number of single core slots instantiated at Amazon over a period of just over two weeks, starting on Jan. 16. The slots have been obtained from Virtual Machines of different “types” (combinations of memory configuration, disk capacity and number of cores), instantiated at different Amazon Regions and Availability Zones. The figure shows the ability to sustain a plateau of 58,000 slots.

This plot shows the number of single core slots instantiated at Amazon over a period of just over two weeks, starting on Jan. 16. The slots have been obtained from Virtual Machines of different “types” (combinations of memory configuration, disk capacity and number of cores), instantiated at different Amazon Regions and Availability Zones. The figure shows the ability to sustain a plateau of 58,000 slots.

In March, the project team demonstrated that HEP Cloud is also a viable solution for the Intensity Frontier community. During the month, OPOS ran 3 productions activities for NOvA consisting of Monte Carlo and data event processing on HEP Cloud. The campaign contributed to the large computing “crunch” necessary to produce results for the Neutrino 2016 conference. It processed more than 90 TB of input data with 550,000 hours of computation, producing more than 150 TB of output. Data IO was handled efficiently by using S3, the highly scalable storage at Amazon. The team demonstrated that NOvA could sustain slot levels of 7,300 concurrent cores, a burst of almost 4 times over the slots allocated at Fermilab for NOvA. And thanks to the project’s integration activities, NOvA is using the same familiar services they use for local computations, such as data handling and job submission.

This plot shows the number of single core slots instantiated at Amazon for the NOvA experiment. Before the full number of jobs in the system (10,000) started to finish, the HEP Cloud sustained 7,300 cores for a day, almost a 4 times increase in the local resource allocation for the experiment.

This plot shows the number of single core slots instantiated at Amazon for the NOvA experiment. Before the full number of jobs in the system (10,000) started to finish, the HEP Cloud sustained 7,300 cores for a day, almost a 4 times increase in the local resource allocation for the experiment.

A version of this article originally appeared in Feb 15, 2016 news.fnal.gov.

— Gabriele Garzoglio and Burt Holzman