Meta admits using pirated books to train AI, but won't pay for it

Alfonso Maruccia

Posts: 822   +266
Staff
A hot potato: Training advanced AI models with proprietary material has become a controversial issue. Many companies now face legal challenges from authors and media organizations in court. Meta admitted to using the well-known "pirate" dataset, Books3, yet the company is reluctant to compensate writers adequately.

A group of authors filed a lawsuit against Meta, alleging the unlawful use of copyrighted material in developing its Llama 1 and Llama 2 large language models. In response, Facebook addressed writer and comedian Sarah Silverman, author Richard Kadrey, and other rights holders spearheading the legal action, acknowledging that its LLMs were trained using copyrighted books.

Meta has admitted to using the Books3 dataset, among many other materials, to train Llama 1 and Llama 2 LLMs. Books3 is a well-known set comprising a plaintext collection of over 195,000 books totaling nearly 37GB. The archive was created by AI researcher Shawn Presser in 2020 as a way to provide a better data source to improve machine learning algorithms.

The widespread availability of the Books3 dataset has led to its extensive use in AI training by many researchers. Big Tech companies, including Meta, have utilized Books3 and other contentious datasets for their commercial AI products. On that account, the New York Times has sued OpenAI and Microsoft for allegedly using millions of copyrighted articles to develop the ChatGPT chatbot.

OpenAI has openly declared that training AI models without using copyrighted material is "impossible," arguing that judges and courts should dismiss compensation lawsuits brought by rights holders. Echoing this stance, Meta admitted to using Books3 but denied any intentional misconduct.

Meta has acknowledged using parts of the Books3 dataset but argued that its use of copyrighted works to train LLMs did not require "consent, credit, or compensation." The company refutes claims of infringing the plaintiffs' "alleged" copyrights, contending that any unauthorized copies of copyrighted works in Books3 should be considered fair use.

Furthermore, Meta is disputing the validity of maintaining the legal action as a Class Action lawsuit, refusing to provide any monetary "relief" to the suing authors or others involved in the Books3 controversy. The dataset, which includes copyrighted material sourced from the pirate site Bibliotik, was targeted in 2023 by the Danish anti-piracy group Rights Alliance, demanding that digital archiving of the Books3 dataset should be banned and is using DMCA notices to enforce those takedowns.

Permalink to story.

 
"OpenAI has openly declared that training AI models without using copyrighted material is impossible,"
That's not a defense, but the admission of guilt. It should be treated as such.

Also it's time to hold leaders of large companies criminally liable for such willful disregard of others' rights, especially when done at such a massive scale.
 
"OpenAI has openly declared that training AI models without using copyrighted material is impossible,"
That's not a defense, but the admission of guilt. It should be treated as such.

Also it's time to hold leaders of large companies criminally liable for such willful disregard of others' rights, especially when done at such a massive scale.
Well I do agree that it is impossible to train AI without the use of copyrighted works but whether or not that constitutes fair use is up for debate. Also, I believe that the DMCA gives copyright holders too much power and modern copyright law is excessive and outdated.
 
"It's impossible to do this without breaking the law" - therefore we should be allowed to do this.

It's like saying "its impossible to drink and drive without breaking the law, therefore we should be allowed to drink and drive".

Also it isn't impossible to do, you just have to pay all the rightsholders for the material you use to build these models rather than stealing it. That's expensive and difficult I expect, but too bad. I think these LLM's should, by law, have to fully disclose all the data they were built using.
 
"It's impossible to do this without breaking the law" - therefore we should be allowed to do this.

It's like saying "its impossible to drink and drive without breaking the law, therefore we should be allowed to drink and drive".

Also it isn't impossible to do, you just have to pay all the rightsholders for the material you use to build these models rather than stealing it. That's expensive and difficult I expect, but too bad. I think these LLM's should, by law, have to fully disclose all the data they were built using.
Not to mention that they are possible in the process of greatly reducing or removing many jobs from the very people they steal. This is hilarious on so many levels.
And above all, I started to doubt a lot of people have any idea what they are actually doing.
This stuff could do something amazing for humanity. It still needs and must have limits.
 
This stuff could do something amazing for humanity. It still needs and must have limits.
Unfortunately the out of control mega-corporations that now rule America and which the broken US legal and political system seems completely unable to police have snapped it all up and are now going to use it for all the wrong purposes. MS are bad, but Google and Facebook - that some next-level bad news for us all.
 
Back