a list with references to the training data plus what they added would be the bare minimum to call it open source, in my opinion, but a lot of people see this more strict than i do.
None of the flagship models publish their training data because they’re all trained on less-than-legal datasets.
It’s a little like complaining that jellyfin doesn’t publish any media with their code - not only is that not legal but it’s implied that you’re responsible for attaining your own.
If you’re someone who can and does compile and re-train your own 64B parameter LLM models, you almost certainly have your own dataset for that purpose (in fact huggingface has many).
a list with references to the training data plus what they added would be the bare minimum to call it open source, in my opinion, but a lot of people see this more strict than i do.
None of the flagship models publish their training data because they’re all trained on less-than-legal datasets.
It’s a little like complaining that jellyfin doesn’t publish any media with their code - not only is that not legal but it’s implied that you’re responsible for attaining your own.
If you’re someone who can and does compile and re-train your own 64B parameter LLM models, you almost certainly have your own dataset for that purpose (in fact huggingface has many).
still doesn’t make it magically open source.
debian would probably split the package in a non-free and open source part, for this reason.