Huggingface Text Generation Inference adds exllama support

github.com

Huggingface Text Generation Inference adds exllama support

github.com

noneabove1182M to

LocalLLaMAEnglish · 2 years ago

Release v0.9.4 · huggingface/text-generation-inference

github.com

Features server: auto max_batch_total_tokens for flash att models #630 router: ngrok edge #642 server: Add trust_remote_code to quantize script by @ChristophRaab #647 server: Add exllama GPTQ CUDA...

This is actually a pretty big deal, exllama is by far the most performant inference engine out there for CUDA, but the strangest thing is that the PR claims it works for starcoder which is a non-llama model:

https://github.com/huggingface/text-generation-inference/pull/553

So I’m extremely curious to see what this brings…

You must log in or register to comment.

Chat

LocalLLaMA

localllama

Create a post

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

4 users / day
45 users / week
357 users / month
766 users / 6 months
580 local subscribers
2.86K subscribers
297 Posts
1.36K Comments
Modlog