Author: Sachin Mathew Varghese
Generative AI model inference in language processing tasks is based on token decoding. A token is the smallest unit into which text data can be broken down. With large autoregressive model architectures like transformers[1], each token decoding step results from a computationally heavy forward pass iteration. As model parameters increase, each iteration demands more computation time and resources, eventually driving up operational costs. Recent times have seen a rise in many optimization techniques that have the potential to improve decoding latency for generative AI model deployments, leading to lower overall costs. This article explains some of these advanced lossless optimization techniques for generative language model deployments.
Speculative Decoding with a Draft Model
Speculative sampling is a technique designed to reduce the average per-token decoding latency in scenarios where hardware is underutilized due to small inference batch sizes. The paper by Leviathan, et. al.,[2] introduces speculative decoding of tokens in a way that generates potentially more than one correct token per forward pass. This is achieved by using a much smaller or computationally faster draft model sharing the same vocabulary as the primary large model, operating in tandem.
Speculative execution is a very common optimization technique used for task planning in computer processors by increasing concurrency. In the context of generative models, the core concept is that complex language-modeling tasks often include simpler subtasks that can be approximated well by a smaller parameter model. This smaller draft model generates a predefined number of draft tokens. The larger target model then computes the probabilities of these generated draft tokens in parallel for the verification process. This involves sampling an additional token to replace any draft token that was rejected or to add a new token if all draft tokens were rejected. In this way, each iteration of the target model will produce at least one new token, ensuring that the number of serial runs does not exceed the worst-case scenario of using the target model directly.
A limitation of the speculative decoding method is that it improves latency at the cost of increased computation. This is not feasible for deployment configurations lacking the additional resources to load the draft model and for the parallel inference on the target model to verify the draft tokens. To sum up, the effectiveness of this method depends on the ability of the draft model to approximate the target model.
Medusa Approach with Multiple Decoding Heads
The speculative decoding method described above requires model training and maintaining an operational draft model. The Medusa paper from Cai, et. al.,[3] aims to overcome this overhead. The concept proposes enhancing generative model inference by adding extra decoding heads to predict multiple tokens in parallel as part of the forward pass. These heads are fine-tuned in a parameter-efficient manner and can be added to an existing target model. The paper proposes training these medusa heads independently or together with the target model to increase the prediction accuracy of the overall system.
During inference, each medusa head produces top-k predictions for its designated position, essentially forming a tree, rather than a single linear sequence as seen in the draft model approach. Finally, candidate sequences can be verified through the standard rejection scheme as with the draft model approach or through a proposed typical acceptance method. This typical acceptance scheme selects other plausible candidates from the draft model’s output rather than using rejection sampling. This is deemed more efficient in cases where a higher temperature parameter is utilized to increase the “creativity” of a language model response. Lastly, the medusa approach promises easy integration to existing setups.
Recurrent Drafter for Fast Speculative Decoding
The Redrafter paper from Aonan Zhang, et. al.,[4] further extends the Medusa approach of speculative decoding aimed at improving the latency of serving generative models. Unlike the Medusa approach that requires multiple draft heads with distinct sets of parameters, the recurrent drafter model introduces dependencies among predictive heads based on recurrent neural network models to indicate predictive positions. Thus, a set of draft head parameters are shared across the different predictive positions.
Another optimization in the redrafter approach is the use of beam search to filter out low quality candidates from an otherwise large set of feasible candidate token sequences before verification by the target model. Finally, the authors of the redrafter paper present an efficient tree attention algorithm based on the beam search results, dynamically constructed during runtime.
Conclusion
These techniques are three examples of large language model inference performance optimizations using speculative decoding[5]. An important assumption in all of the techniques is that processing and validating multiple draft tokens in parallel is as fast as processing a single token, which means that the latency of speculative decoding will be no worse than that of the standard approach. In addition, multiple draft tokens can be validated successfully throughout the generation, allowing output token generation to progress by more than one token per forward pass on average. As long as these two conditions stand true, speculative decoding speeds up token generation in a lossless manner, which reduces operational deployment costs and improves performance.
References
[1] A. Vaswani, et. al., “Attention Is All You Need” August 2023. [Online] Available: https://arxiv.org/pdf/1706.03762 [Accessed: 13 October 2024]
[2] Y. Leviathan, et. al., “Fast Inference from Transformers via Speculative Decoding” May 2023. [Online] Available: https://arxiv.org/pdf/2211.17192 [Accessed: 13 October 2024]
[3] T. Cai, et. al., “MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads” June 2024. [Online] Available: https://arxiv.org/pdf/2401.10774 [Accessed: 13 October 2024]
[4] A. Zhang, et. al., “Recurrent Drafter for Fast Speculative Decoding in Large Language Models” Mar 2024. [Online] Available: https://arxiv.org/pdf/2403.09919v1 [Accessed: 13 October 2024]
[5] NVIDIA. TensorRT-LLM, Speculative Sampling, nvidia.github.io. [Online]. Available: https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html [Accessed: 13 October 2024]
Author Bio:
Sachin Mathew Varghese is the workstream lead for LF AI & Data Gen AI Commons group focusing on models, applications and data. He is an AI engineer with vast experience building enterprise platforms and software products for organizations across India, United Kingdom and United States. His main areas of expertise and interest are ML model serving at scale, generative AI deployments, inference optimizations, operational monitoring, compliance and management operations. Sachin is also an open-source advocate and co-author for LF AI & Data paper Model Openness Framework.