vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

vllm.ai

cross-posted to:
[email protected]

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

vllm.ai

noneabove1182M to

LocalLLaMAEnglish · 1 year ago

cross-posted to:
[email protected]

https://github.com/vllm-project/vllm

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more Tensor parallelism support for distributed inference Streaming outputs OpenAI-compatible API server

YouTube video describing it: https://youtu.be/1RxOYLa69Vw

You must log in or register to comment.

Chat