We are thrilled to introduce vLLM, LF AI & Data’s latest incubation project. vLLM is a high-throughput, memory-efficient inference and serving engine for Large Language Models (LLMs). LLMs promise to fundamentally change how we use AI across all industries, but serving these models is challenging and surprisingly slow, even on expensive hardware. Today, we are excited to present vLLM, an open-source library for fast LLM inference and serving.
vLLM utilizes PagedAttention, its new attention algorithm that effectively manages attention keys and values. Equipped with PagedAttention, vLLM redefines the latest state of the art in LLM serving: it delivers up to 24x higher throughput than HuggingFace Transformers without requiring any model architecture changes.
“vLLM has already brought incredible value to the open source community and the advancement of AI,” said Bill Higgins, VP of watsonx Platform Engineering and Open Innovation. “IBM is excited to see how vLLM’s vibrant, inclusive community will continue to grow and build even more confidence for enterprise adoption as a project in LF AI & Data.”
vLLM sets a new standard in state-of-the-art serving, achieving up to 24x higher throughput. It optimizes attention key and value memory through PagedAttention, ensuring efficient memory usage. The library handles incoming requests efficiently with continuous batching and uses CUDA/HIP graph for swift model execution. Additionally, vLLM supports various quantization options, including GPTQ, AWQ, SqueezeLLM, and FP8 KV Cache, and features optimized CUDA kernels to ensure peak performance.
vLLM integrates with popular HuggingFace models and supports high-throughput serving with various decoding algorithms, such as parallel sampling and beam search. It enables distributed inference with tensor parallelism and facilitates real-time data handling with streaming outputs. The library provides an OpenAI-compatible API server for a familiar interface and supports NVIDIA and AMD GPUs. Moreover, vLLM includes experimental features like prefix caching and multi-Lora support, enhancing flexibility and usability.
Real-World Applications
vLLMs capabilities have been tested and proven in real-world applications. At Chatbot Arena and Vicuna Demo, vLLM has demonstrated its ability to deliver fast, efficient LLM inference and serving making it an ideal solution for research teams and organizations looking to leverage LLMs without incurring prohibitive costs.
“Under a neutral host and an open, transparent technical governance, LF AI & Data stands as a beacon of neutrality, empowering projects like vLLM to lead and innovate autonomously. We welcome vLLM to our foundation, where collaborative efforts leverage extensive resources, expertise, and community support. Together, we are pioneering new standards in AI, pushing the boundaries of what’s possible, and fostering a vibrant ecosystem of sustainable, open-source AI projects”. Ibrahim Haddad, Executive Director of LF AI & Data
Get Involved
For more information and to contribute to the vLLM project, visit the GitHub repository. Join us in revolutionizing the future of AI with vLLM.
LF AI & Data Resources
- Learn about membership opportunities
- Explore the interactive landscape
- Check out our technical projects
- Join us at upcoming events
- Read the latest announcements on the blog
- Subscribe to the mailing lists
- Follow us on Twitter or LinkedIn