What is wrong with LLM benchmarks, and why are we still using them?

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

What is wrong with LLM benchmarks, and why are we still using them?

AsAnAILanguageModel · 2 years ago

I just started saving a list of prompts to test models with. It’s not exhaustive of course, but there are a few which help me cull new models quickly. Of course I can’t share them because I don’t want them to leak into training data. :)

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

I have a similar list of prompts/test cases that I use.

However, my experience has been that all fine-tuned LLaMa models give pretty much the same results. I haven’t actually found a model that passes any of my “test cases” that others have failed (additionally, none until OpenOrca preview 2 had failed a test case that others had passed). All the models feel pretty much the same in terms of actual abilities, and the only noticeable difference is that they give their answers in a slightly different way.