The paper "TurboTransformers: An Efficient GPU Serving System For Transformer Models" by Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou presents a new system aimed at optimizing the speed and memory utilization of Transformer models, particularly when deployed on GPUs.
The authors start by acknowledging the significant role Transformers play in current NLP tasks. Transformers have an advantage over Recurrent Neural Network (RNN) models in that they can process all sequence lengths in parallel, leading to higher accuracy for long sequences.
However, they note that efficient deployment of Transformers for online services in data center environments equipped with GPUs can be challenging. This is primarily due to the increased computation introduced by Transformer structures, which makes it difficult to meet latency and throughput requirements. Furthermore, Transformer models require more memory, which increases the cost of deployment.
To address these issues, the authors propose TurboTransformers, a system designed to accelerate Transformer inference and reduce memory usage. The key techniques used in TurboTransformers include:
A layer-wise adaptive scheduler: The scheduler controls the execution order of different layers in the Transformer. The authors note that this scheduler can optimize the overlap of computation and data transfer, thereby improving the efficiency of execution.
FP16-INT8 mixed precision strategy: The authors propose a strategy that uses a mix of FP16 and INT8 precision during inference. This strategy can reduce memory usage and improve the speed of the inference without sacrificing the accuracy of the model.
Kernel fusion and specialization: The authors propose a technique for fusing several small kernels into one large kernel, as well as for specializing kernels for specific sizes. Both of these techniques can improve the speed of the inference.
Cache-friendly algorithm design: The authors propose an algorithm that can better utilize GPU cache, thereby further speeding up the inference.
The authors present an extensive experimental evaluation of TurboTransformers. They compare its performance against several other Transformer optimization systems, including TensorRT and TVM. The results show that TurboTransformers can achieve higher throughput and lower latency compared to these systems. The authors also show that TurboTransformers can reduce memory usage significantly compared to other systems.
In terms of implications, TurboTransformers can help in deploying Transformer models in real-world, resource-constrained environments, such as data centers. By reducing memory usage and improving speed, it can lower the cost and improve the efficiency of deploying these models. The techniques used in TurboTransformers can also be applied to other models and tasks, not just Transformers, which makes this work broadly applicable.