vLLM is an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, an attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention redefines the new state of the art in LLM serving: it delivers up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes.

The University of California – Berkeley donated vLLM to LF AI & Data Foundation as an incubation-stage project in July 2024.