TechSpot means tech analysis and advice you can trust. Read our ethics statement.
A hot potato: Training advanced AI models with proprietary material has become a controversial issue. Many companies now face legal challenges from authors and media organizations in court. Meta admitted to using the well-known "pirate" dataset, Books3, yet the company is reluctant to compensate writers adequately.
A group of authors filed a lawsuit against Meta, alleging the unlawful use of copyrighted material in developing its Llama 1 and Llama 2 large language models. In response, Facebook addressed writer and comedian Sarah Silverman, author Richard Kadrey, and other rights holders spearheading the legal action, acknowledging that its LLMs were trained using copyrighted books.
Meta has admitted to using the Books3 dataset, among many other materials, to train Llama 1 and Llama 2 LLMs. Books3 is a well-known set comprising a plaintext collection of over 195,000 books totaling nearly 37GB. The archive was created by AI researcher Shawn Presser in 2020 as a way to provide a better data source to improve machine learning algorithms.
The widespread availability of the Books3 dataset has led to its extensive use in AI training by many researchers. Big Tech companies, including Meta, have utilized Books3 and other contentious datasets for their commercial AI products. On that account, the New York Times has sued OpenAI and Microsoft for allegedly using millions of copyrighted articles to develop the ChatGPT chatbot.
OpenAI has openly declared that training AI models without using copyrighted material is "impossible," arguing that judges and courts should dismiss compensation lawsuits brought by rights holders. Echoing this stance, Meta admitted to using Books3 but denied any intentional misconduct.
Meta has acknowledged using parts of the Books3 dataset but argued that its use of copyrighted works to train LLMs did not require "consent, credit, or compensation." The company refutes claims of infringing the plaintiffs' "alleged" copyrights, contending that any unauthorized copies of copyrighted works in Books3 should be considered fair use.
Furthermore, Meta is disputing the validity of maintaining the legal action as a Class Action lawsuit, refusing to provide any monetary "relief" to the suing authors or others involved in the Books3 controversy. The dataset, which includes copyrighted material sourced from the pirate site Bibliotik, was targeted in 2023 by the Danish anti-piracy group Rights Alliance, demanding that digital archiving of the Books3 dataset should be banned and is using DMCA notices to enforce those takedowns.