Meta Allegedly Pirated Millions Of Books To Train Its AI

AI, Authors

Court filings in a lawsuit against Meta revealed that the company likely trained its flagship AI model Llama 3 on millions of pirated books and articles. What's more — conversations among Meta employees appear to suggest they knowingly did this with approval from Meta CEO and co-founder Mark Zuckerberg.

The filings reveal the company accessed the Library Genesis (aka LibGen) database in order to feed books and articles into its model. According to at least one senior member on the Llama 3 team, they wanted access to books because "books are actually more important than web data."

An article by Alex Reisner in The Atlantic dives further into what Meta may have accessed and when.

What Is LibGen?

LibGen is a massive database of pirated books, scholarly articles, audiobooks, comics, and more. Technically speaking, its a shadow library that enables the peer-to-peer sharing of illegally pirated materials via torrent websites, similar to the original file sharing sites using for pirated music in the early 2000s.

LibGen has roots in Soviet Russia and was created by scientists in Russia in 2008, allegedly as a means first to disseminate otherwise paywalled scientific research to populations living in poor countries. Since then, the network has ballooned into a source of pirated material of all kinds for, well, anybody who wants to access it.

Details on LibGen's full library size vary. In 2019 it was allegedly 33 terabytes worth of content, and now contains at least 81 million scientific articles and 7.5 million books, according to Reisner. While LibGen has been sued and fined millions of dollars — most recently $30 million — those fines have gone unpaid, since nobody can track down who would be the one to pay them. And while authorities continually make efforts to thwart the sharing of pirated materials through LibGen, so far the library only continues to grow.

How Did Meta Access And Use These Pirated Materials

One of the biggest companies in the world accessing the entire LibGen library to train its AI products is obviously a far cry from providing scientific research to impoverished communities. But that is, apparently, exactly what happened when Meta decided that properly licensing material to train its products would take too much time and money.

An internal chat among Meta employees showed they were worried that properly licensing materials could "take 4+ weeks to deliver data." Another said that an alleged price was "unreasonably expensive," while somebody else noted that "books do not cost $300" — an apparent failure to recognize the difference between purchasing a copy of a book and purchasing the licensing rights to one.

Others on the team noted that if they properly license a book, they won't be able to "lean in" to the "fair use" strategy, a reference to claiming that stealing material to train models is legal because the models are creating something new with them. This is the go-to defense for AI companies, but a February 2025 ruling (which came after these Meta conversations) found that training AI models does not fit under the doctrine of fair use.

Eventually, Meta obtained permission from "MZ" (presumably Mark Zuckerberg) to download illegally pirated materials from LibGen in order to train Llama 3.

What This Means For Authors

Piracy has always been an issue for creators and rights holders. The lawsuit that revealed these internal conversations came from authors like Sarah Silverman and Junot Díaz. And Meta certainly isn't the only infringer — OpenAI allegedly used LibGen at points, too.

But these court filings, particularly the ones involving internal conversations, reveal the lengths to which teams at Meta are aware of their potential violation of copyright laws and their desire to "ask for forgiveness, not permission." They even go so far as to discuss potential penalties they may pay to the European Union.

There are no shortage of potentially landscape-changing lawsuits against these companies. But the pace at which they move is much slower than the pace at which AI models seemingly steal content. Could this mean we eventually see major settlements to rights holders (similar to the recent NIL settlement for student athletes)?

It's possible.

But more than anything, creators need to stay vocal and shine a light on how some of the biggest companies in the world are openly stealing their content to create more products.

The Sora 2 ‘SlopTok’ Backlash

OpenAI forged ahead with the release of its Sora app powered by Sora 2, despite ethical and legal concerns. Now, creators who have accessed the app are sharing their feelings on the platform dubbed SlopTok. The invite-only video generation app eclipsed one million downloads in less than a week after its release. But the company has

Instagram Rings Aims To Be The Content Creator Version Of A Major Awards Show

Instagram is launching its own awards for content creators, dubbed Instagram Rings. According to the platform, “This award is for the creators who don’t just participate in culture – but shift it, breaking through whatever barrier holds them back to realize their ambitions.”How The Instagram Rings Award WorksMost major art mediums have their own awards.

Spotify CEO Stepping Down

Spotify CEO Daniel Ek is stepping down as CEO and away from day-to-day operations at the company. The company had faced its most explicit criticism from artists and listeners yet, thanks mostly due to Ek’s involvement with AI weapon developers. But Spotify says Ek will stay play a big role in the overall direction of the

How To Become An Influencer In 2026

Anybody can become an influencer. Don’t get me wrong — it takes dedication and a willingness to fail until you succeed (and then probably fail again), but if you like posting content on the Internet, you can turn it into anything from a little extra cash every month to a nest egg to generational wealth. Anybody

Courses

Back

Courses

Course Login

Go Pro With Content

Learn Stuff

About

Pricing

Sign In To RootNote