The Jobsub high-availability servers have recently completed their third year as FIFE’s preferred batch submission systems to the OSG. We have worked to achieve a good balance between user convenience, security, service availability and resource utilization. Experience and high user load have taught us where improvements are needed.
The Jobsub servers are fairly ‘monolithic’, meaning each server houses a dedicated condor schedd daemon and ‘silos’ of logically distinct functionality. A resource intensive request in one ‘silo’, such as ‘retrieve 300 GB of log files for job X from experiment Y’, can cause delays or failures to unrelated tasks of other experiments on the same server. The monolithic nature of the servers also makes it harder to spin up new ones when demand is high.
Monitoring helps to pinpoint the problem. In the plot above you can see a spike in jobs on Jan. 10. This can be attributed to a deluge of job submissions from one user over a short period of time that eventually crashed the HTCondor scheduler. Future releases of Jobsub will be restructured to prevent this from happening. In the meantime, please don’t submit more than 1K jobs per minute (e.g., if you have just submitted 10K jobs, wait for 10 minutes before submitting more).
The path for separating the ‘silos’ and condor daemons onto a more resilient, scalable architecture of independent machines is fairly straightforward. All Jobsub service requests will come through a REST API, which by design categorizes requests by experiment and type. Open source load balancing technology that dispatches via URL category is a mature technology.
The changes described above have been prototyped using the open source ha_proxy load balancer and ‘stock’ Jobsub servers as well as condor schedds on FermiCloud machines. Expect to see them rolled out in upcoming releases some time this year.