June 11

Meta Allegedly Pirated Millions Of Books To Train Its AI

AI, Authors

Court filings in a lawsuit against Meta revealed that the company likely trained its flagship AI model Llama 3 on millions of pirated books and articles. What's more — conversations among Meta employees appear to suggest they knowingly did this with approval from Meta CEO and co-founder Mark Zuckerberg. 

The filings reveal the company accessed the Library Genesis (aka LibGen) database in order to feed books and articles into its model. According to at least one senior member on the Llama 3 team, they wanted access to books because "books are actually more important than web data."

An article by Alex Reisner in The Atlantic dives further into what Meta may have accessed and when.

What Is LibGen?

LibGen is a massive database of pirated books, scholarly articles, audiobooks, comics, and more. Technically speaking, its a shadow library that enables the peer-to-peer sharing of illegally pirated materials via torrent websites, similar to the original file sharing sites using for pirated music in the early 2000s.

LibGen has roots in Soviet Russia and was created by scientists in Russia in 2008, allegedly as a means first to disseminate otherwise paywalled scientific research to populations living in poor countries. Since then, the network has ballooned into a source of pirated material of all kinds for, well, anybody who wants to access it.

Details on LibGen's full library size vary. In 2019 it was allegedly 33 terabytes worth of content, and now contains at least 81 million scientific articles and 7.5 million books, according to Reisner. While LibGen has been sued and fined millions of dollars — most recently $30 million — those fines have gone unpaid, since nobody can track down who would be the one to pay them. And while authorities continually make efforts to thwart the sharing of pirated materials through LibGen, so far the library only continues to grow. 

How Did Meta Access And Use These Pirated Materials

One of the biggest companies in the world accessing the entire LibGen library to train its AI products is obviously a far cry from providing scientific research to impoverished communities. But that is, apparently, exactly what happened when Meta decided that properly licensing material to train its products would take too much time and money. 

An internal chat among Meta employees showed they were worried that properly licensing materials could "take 4+ weeks to deliver data." Another said that an alleged price was "unreasonably expensive," while somebody else noted that "books do not cost $300" — an apparent failure to recognize the difference between purchasing a copy of a book and purchasing the licensing rights to one. 

Others on the team noted that if they properly license a book, they won't be able to "lean in" to the "fair use" strategy, a reference to claiming that stealing material to train models is legal because the models are creating something new with them. This is the go-to defense for AI companies, but a February 2025 ruling (which came after these Meta conversations) found that training AI models does not fit under the doctrine of fair use.

Eventually, Meta obtained permission from "MZ" (presumably Mark Zuckerberg) to download illegally pirated materials from LibGen in order to train Llama 3.

What This Means For Authors

Piracy has always been an issue for creators and rights holders. The lawsuit that revealed these internal conversations came from authors like Sarah Silverman and Junot Díaz. And Meta certainly isn't the only infringer — OpenAI allegedly used LibGen at points, too. 

But these court filings, particularly the ones involving internal conversations, reveal the lengths to which teams at Meta are aware of their potential violation of copyright laws and their desire to "ask for forgiveness, not permission." They even go so far as to discuss potential penalties they may pay to the European Union. 

There are no shortage of potentially landscape-changing lawsuits against these companies. But the pace at which they move is much slower than the pace at which AI models seemingly steal content. Could this mean we eventually see major settlements to rights holders (similar to the recent NIL settlement for student athletes)? 

It's possible.

But more than anything, creators need to stay vocal and shine a light on how some of the biggest companies in the world are openly stealing their content to create more products. 


MORE STORIES FOR YOU

Insights

Insights

News

Insights

Never miss a good story!

 Subscribe to our newsletter to keep up with what's going on in content creation!