What is wrong with LLM benchmarks, and why are we still using them?

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

What is wrong with LLM benchmarks, and why are we still using them?

Kerfuffle · 2 years ago

But I can’t accept such an immediately-noticeable decline in real-world performance (model literally craps itself) compared to previous models while simultaneously bragging about how outstanding the benchmark performance is.

Your criticisms are at least partially true and benchmarks like “x% of ChatGPT” should be looked at with extreme skepticism. In my experience as well, parameter size is extremely important. Actually, even with the benchmarks it’s very important: if you look at the ones that collect results you’ll see, for example, there are no 33B models that have a MMLU score in the 70s.

However, I wonder if all of the criticism is entirely fair. Just for example, I believe MMLU is 5-shot, ARC is 10-shot. That means there are a bunch of examples of that type of question and the correct answer before the one the LLM has to answer. If you’re just asking it a question, that’s 1-shot: it has to get it right the first time, without any examples of correct question/answer pairs. Seeing a high MMLU score doesn’t necessarily directly translate to 1-shot performance, so your expectations might not be in line with reality.

Also, different models have different prompt formats. For these fine-tuned models, it won’t necessarily just say “ERROR” if you use the wrong prompt form but the results can be a lot worse. Are you making sure you’re using exactly the prompt that was used when benchmarking?

Finally, sampling settings can also make a really big difference too. A relatively high temperature setting when generating creative output can be good but not when generating source code. Stuff like repetition, frequency/presence penalties can be good in some situations but maybe not when generating source code. Having the wrong sampler settings can force a random token to be picked, even if it’s not valid for the language, or ban/reduce the probability of tokens that would be necessary to produce valid output.

You may or may not already know, but LLMs don’t produce any specific answer after evaluation. You get back an array of probabilities, one for every token ID the model understands (~32,000 for LLaMA models). So sampling can be extremely important.

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

Yeah, I’m aware of how sampling and prompt format affect models. I always try to use the correct prompt format (although sometimes there are contradictions between what the documentation says and what the preset for the model in text-generation-webui says, in which case I often try both with no noticeable difference in results). For sampling I normally use the llama-cpp-python defaults and give the model a few attempts to answer the question (regenerate), sometimes I try it on a deterministic setting.

I wasn’t aware that the benchmarks are multi-shot. I haven’t looked so much into how the benchmarks are actually performed, tbh. But this is useful to know for comparison.

Kerfuffle · 2 years ago

For sampling I normally use the llama-cpp-python defaults

Most default settings have the temperature around 0.8-0.9 which is likely way too high for code generation. Default settings also frequently include stuff like a repetition penalty. Imagine the LLM is trying to generate Python, it has to produce a bunch of spaces before every line but something like a repetition penalty can severely reduce the probability of the tokens it basically must select for the result to be valid. With code, there’s often very little leeway for choosing what to write.

So you said:

I’m aware of how sampling and prompt format affect models.

But judging the model by what it outputs with the default settings (I checked and it looks like for llama-cpp-python it has both a pretty high temperature setting and a repetition penalty enabled) kind of contradicts that.

By the way, you might also want to look into the grammar sampling stuff that recently got added to llama.cpp. This can force the model to generate tokens that conform to some grammar which is pretty useful for code and some other stuff where the output has to conform to something. You should still carefully look at the other settings to ensure they conform to the type of result you want to generate though, the defaults are not suitable for every use case.

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

I have also tried to generate code using deterministic sampling (always pick the token with the highest probability). I didn’t notice any appreciable improvement.

Kerfuffle · 2 years ago

I have also tried to generate code using deterministic sampling (always pick the token with the highest probability). I didn’t notice any appreciable improvement.

Well, you said you sometimes did that so it’s not entirely clear what conclusions you came to are based on deterministic sampling and which aren’t. Anyway, like I said, it’s not just temperature that may be causing issues.

I want to be clear I’m not criticizing you personally or anything like that. I’m not trying to catch you out and you don’t have to justify anything about your decisions or approach to me. The only thing I’m trying to do here is provide information that might help you and potentially other people get better results or understand why the results with a certain approach may be better or worse.