A while back a user, let’s call him “Ken”, was trying to get some work finished on a very compressed timescale. It involved running a script that would generate some job scripts and stage files to dCache, and then submit jobs that take about one hour each. It was a well-tested workflow that followed FIFE best practices, but on this particular day Ken was seeing lots of errors like hanging dCache staging commands and stuck IFDH transfers causing jobs to go held. With the jaws of doom closing around him, Ken opened a Service Desk ticket with the dCache experts, and found out that others were being affected too. In fact, the experts already understood what the problem was.
These errors were happening because some users were running ls commands on directories within /pnfs space that contained thousands, and in some cases tens of thousands, of files. These ls commands were not only taking an extremely long time to run, but they were impacting others’ ability to access their own files because of the way dCache access works (there is a database behind the scenes and that database was overloaded from these ls commands).
The PNFS filesystem, like all filesystems, suffers from performance issues if one puts too many files or subdirectories within a single directory. While it is difficult to give a hard and fast limit on file counts, a good rule of thumb is to limit per-directory file counts to something around 1,000. In some cases we do see acceptable performance with a few thousand files, but if you have directories with many thousands, especially over 10,000 files, it’s a good idea to re-think your directory structure and move files into subdirectories, or re-organize things if you have thousands of subdirectories.
FIFE is always happy to provide consultation on these matters, so don’t hesitate to contact us if you’d like to talk to us about making changes like these. By adhering to these limits, you’ll get better performance out of dCache for both you and your collaborators. Since our experiments share many of the lab’s resources, it’s important that each of us follows best practices.