This is actually a pretty big deal, exllama is by far the most performant inference engine out there for CUDA, but the strangest thing is that the PR claims it works for starcoder which is a non-llama model:

https://github.com/huggingface/text-generation-inference/pull/553

So I’m extremely curious to see what this brings…