AthenaTriton: A Tool for running Machine Learning Inference as a Service in Athena
Date:
Description
Machine Learning (ML)-based algorithms play increasingly important roles in almost all aspects of the data analyses in ATLAS. Diverse ML models are used in detector simulations, event reconstructions, and data analyses. They are being deployed in the ATLAS software framework, Athena. The primary approach to perform ML inference in Athena is to use the ONNXRuntime. However, some ML models could not be converted to ONNXRuntime because certain ML operations, such as the MultiAggregation in pyG as of writing, are not supported. Furthermore, a scalable inference strategy that maximises the event processing throughput is needed to cope with the ever-increasing simulation and collision data. A key element in that strategy should be enabling these ML algorithms to run on coprocessors like GPUs because not all computing sites have coprocessors. To that end, we introduce AthenaTriton, a tool that runs ML inference as a service based on the NVIDIA Triton Inference Server. With AthenaTriton, we give Athena the capability to act as a Triton client that sends requests to a remote or local server that performs the model inference. We will present the AthenaTriton design and its scalability in running ML-based algorithms. We emphasise that AthenaTriton can be used in both online and offline computing.