Yes, we know (there are papers about it) that for LLMs every increase of capabilities we need exponentially more data to train it. But don’t worry, we only consumed half the worlds data to train LLMs, still a lot of places to go ;).
That doesn’t appear to actually be the case, though. LLMs have been improving greatly through the use of a smaller amount of higher-quality data, some of it synthetic data that’s been generated in part by other LLMs. Turns out simply dumping giant piles of random nonsense from the Internet on a neural net doesn’t produce the best results. Do you have references to any of those papers you mention?
Yes, we know (there are papers about it) that for LLMs every increase of capabilities we need exponentially more data to train it. But don’t worry, we only consumed half the worlds data to train LLMs, still a lot of places to go ;).
That doesn’t appear to actually be the case, though. LLMs have been improving greatly through the use of a smaller amount of higher-quality data, some of it synthetic data that’s been generated in part by other LLMs. Turns out simply dumping giant piles of random nonsense from the Internet on a neural net doesn’t produce the best results. Do you have references to any of those papers you mention?
Necro-edit: NVIDIA just released an LLM that’s specifically designed to generate training data for other LLMs, as a concrete example of what I’m talking about.