HUGE dataset released for open source use

noneabove1182 · 1 year ago

HUGE dataset released for open source use

noneabove1182 · 1 year ago

I think the implication is more stating that this dataset is even more useful if you don’t jam the whole thing into your training but instead further filter it to a reasonable number of tokens, around 5T, and train on that subset instead

I could be incorrect, cause they do explicitly say deduplicating, but it’s phrased oddly either way

HUGE dataset released for open source use

HUGE dataset released for open source use

RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models — Together AI