What is wrong with LLM benchmarks, and why are we still using them?

@[email protected] · 11 months ago

What is wrong with LLM benchmarks, and why are we still using them?

@[email protected] · 11 months ago

I’ve read all kinds of claims. From very enthusiastic statements to The False Promise of Imitating Proprietary LLMs.

I think the main problem is: It is next to impossible to benchmark something like intelligence. We can’t even assess that properly in humans. It depends on many different skills, from knowledge to reasoning. And knowledge also depends on which topic you’re talking about. And the whole intelligence score depends on the exact task you’re probing for.

My main problem with LLMs and benchmarks is: It is difficult to evaluate the output automatically, because the output is natural language. If you constrain it too much to make that possible, it gets too far away from real world scenarios. And second thing is: People often measure reasoning skills. I like storytelling and (role playing) chatbots. It’s a very different task and models which are good at answering questions sometimes just don’t excel at writing dialogue with a good flow. Or describing things vividly when asked to write a novel.

FYI: Someone compiled a list of papers discussing LLM evaluation.