- cross-posted to:
- [email protected]
- cross-posted to:
- [email protected]
https://github.com/vllm-project/vllm
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels vLLM is flexible and easy to use with:
Seamless integration with popular HuggingFace models High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more Tensor parallelism support for distributed inference Streaming outputs OpenAI-compatible API server
YouTube video describing it: https://youtu.be/1RxOYLa69Vw
You must log in or register to comment.