What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they’ll drag that out for years until people go broke fighting, or stop giving a shit.
But the point is that it doesn’t matter if the data is licensed or not. Lack of licensing doesn’t stop you from analyzing data once that data is visible to you. Do you think TV Tropes licensed any of the works of fiction that they have pages about?
They pulled a very public and out in the open data heist and got away with it.
They did not. No data was “heisted.” Data was analyzed. The product of that analysis does not contain the data itself, and so is not a violation of copyright.
You’re thinking of licensing as a person putting something online WITH a license.
The terminology in this case is whether or not it was LICENSED by the commercial entity using and selling it’s derivative. That is the default. The burden is on the commercial entity to prove they were the original creator of said content. It is by default plagiarism otherwise, and this is also the default.
Here’s an example: I write a story and post it online, and it is specific to a toothbrush and toilet scrubber falling in love, and then having dish scrubber pads as children. I say the two main characters are called Dennis and Fran, and their children are called Denise and Francesca. Then somebody goes to prompt OpenAI for a similar and it kicks out the exact same story with the same names, I would win that case based on it clearly being beyond a doubt plagiarism.
Unless you as OpenAI can prove these are all completely random-which they aren’t because it’s trained on my data-then I would be deemed the original creator of that story, and any sales of that data I would be entitled to.
Proving that is a different thing, but that’s what the laws say should happen. If they didn’t contact me to license that story, it’s still plagiarism. Same with music, movies…etc.
The product of that analysis does not contain the data itself, and so is not a violation of copyright.
That’s your opinion, not the opinion of a court or legislature. LLM products are directly derived from and dependent upon the training data, so it is positively considered a derivative work. However, whether it’s considered sufficiently transformative, or whether it passes the fair use test, has not to my knowledge been determined in court. (Note that I am assuming US law here.)
The courts have yet to come to a conclusion, the lawsuits are still ongoing. I think it’s unlikely they’ll conclude that the models contain the data, however, because it’s objectively not true.
The clearest demonstration I can think of to illustrate this is the old Stable Diffusion 1.5 model. It was trained on the LAION 5B dataset, which (as the “5B” indicates) contained 5 billion images. The resulting model was 1.83 gigabytes. So if it’s compressing images and storing them inside the model it’d somehow need to fit ~2.7 images per byte. This is, simply, impossible.
You’ve got your definition of “derivative work” wrong. It does indeed need to contain copyrightable elements of another work for it to be a derivative work.
If I took a copy of Harry Potter, reduced it to a fine slurry, and then made a paper mache sculpture out of it, that’s not a derivative work. None of the copyrightable elements of the book survived.
Did you not read my original comment before responding?
You said:
But the point is that it doesn’t matter if the data is licensed or not. Lack of licensing doesn’t stop you from analyzing data once that data is visible to you. Do you think TV Tropes licensed any of the works of fiction that they have pages about?
They did not. No data was “heisted.” Data was analyzed. The product of that analysis does not contain the data itself, and so is not a violation of copyright.
You’re thinking of licensing as a person putting something online WITH a license.
The terminology in this case is whether or not it was LICENSED by the commercial entity using and selling it’s derivative. That is the default. The burden is on the commercial entity to prove they were the original creator of said content. It is by default plagiarism otherwise, and this is also the default.
Here’s an example: I write a story and post it online, and it is specific to a toothbrush and toilet scrubber falling in love, and then having dish scrubber pads as children. I say the two main characters are called Dennis and Fran, and their children are called Denise and Francesca. Then somebody goes to prompt OpenAI for a similar and it kicks out the exact same story with the same names, I would win that case based on it clearly being beyond a doubt plagiarism.
Unless you as OpenAI can prove these are all completely random-which they aren’t because it’s trained on my data-then I would be deemed the original creator of that story, and any sales of that data I would be entitled to.
Proving that is a different thing, but that’s what the laws say should happen. If they didn’t contact me to license that story, it’s still plagiarism. Same with music, movies…etc.
That’s your opinion, not the opinion of a court or legislature. LLM products are directly derived from and dependent upon the training data, so it is positively considered a derivative work. However, whether it’s considered sufficiently transformative, or whether it passes the fair use test, has not to my knowledge been determined in court. (Note that I am assuming US law here.)
The courts have yet to come to a conclusion, the lawsuits are still ongoing. I think it’s unlikely they’ll conclude that the models contain the data, however, because it’s objectively not true.
The clearest demonstration I can think of to illustrate this is the old Stable Diffusion 1.5 model. It was trained on the LAION 5B dataset, which (as the “5B” indicates) contained 5 billion images. The resulting model was 1.83 gigabytes. So if it’s compressing images and storing them inside the model it’d somehow need to fit ~2.7 images per byte. This is, simply, impossible.
That’s not in question. It doesn’t need to contain the training data to be a derivative work, and therefore a potential infringement.
You’ve got your definition of “derivative work” wrong. It does indeed need to contain copyrightable elements of another work for it to be a derivative work.
If I took a copy of Harry Potter, reduced it to a fine slurry, and then made a paper mache sculpture out of it, that’s not a derivative work. None of the copyrightable elements of the book survived.