• @noneabove1182OPM
    link
    English
    38 months ago

    I think the implication is more stating that this dataset is even more useful if you don’t jam the whole thing into your training but instead further filter it to a reasonable number of tokens, around 5T, and train on that subset instead

    I could be incorrect, cause they do explicitly say deduplicating, but it’s phrased oddly either way