Guide 9 min read

AI Training & Fair Use — What the Law Actually Says

How do AI companies ingest billions of copyrighted images and texts without permission? The answer lies in the highly debated doctrine of "Fair Use." We break down the legal arguments on both sides.

The most consequential legal battle in the history of artificial intelligence isn't about the output—it's about the input. Companies like OpenAI, Anthropic, Midjourney, and Stability AI have trained their massive models by scraping billions of words and images from the internet, often without the permission of or compensation to the original creators.

When sued for copyright infringement by authors, artists, and publishers, these AI companies rely almost entirely on one legal defense: Fair Use.

What is Fair Use?

Fair use is a doctrine in United States copyright law (Section 107 of the Copyright Act) that permits limited use of copyrighted material without having to first acquire permission from the copyright holder. It is a critical defense that protects freedom of expression, allowing for commentary, criticism, news reporting, research, and parody.

Fair use is not a hard-and-fast rule; it is a flexible standard determined on a case-by-case basis. Judges evaluate claims based on four statutory factors.

The Four Fair Use Factors Applied to AI

When courts analyze whether scraping data to train an AI model constitutes fair use, they weigh the following four factors. Here is how both sides are arguing them.

Factor 1: The Purpose and Character of the Use

This is arguably the most critical factor. Courts ask whether the new use is "transformative." Does it add something new, with a further purpose or different character, rather than merely substituting for the original?

  • The AI Companies' Argument: The use is highly transformative. They argue they are not reproducing the original works for human consumption. Instead, they are analyzing the statistical relationships between words or pixels to build a new tool (the AI model). This is functionally similar to data mining, which courts have previously ruled as transformative (e.g., Google Books).
  • The Creators' Argument: The use is highly commercial, not for public research. Furthermore, the resulting models directly compete with the creators. If an AI can generate "a story in the style of George R.R. Martin," it serves the exact same purpose as the original author's work.

Factor 2: The Nature of the Copyrighted Work

This factor considers whether the original work is factual/informational or highly creative/expressive. Creative works receive stronger protection.

  • The AI Companies' Argument: While acknowledging that many scraped works are creative, they argue that the AI is only extracting unprotected facts—the mathematical patterns and statistical structures—not the protected expressive elements.
  • The Creators' Argument: The datasets (like LAION-5B or Books3) are filled with highly expressive, creative works—novels, paintings, and poems. Training on these works strikes at the core of what copyright is meant to protect.

Factor 3: The Amount and Substantiality of the Portion Used

Courts look at how much of the original work was copied, and whether it was the "heart" of the work.

  • The AI Companies' Argument: While they copy entire works to train the model, they argue this is necessary. They point to the Authors Guild v. Google case, where copying entire books to create a searchable database was deemed fair use because the output only showed snippets.
  • The Creators' Argument: AI companies make exact copies of millions of works in their entirety. Furthermore, models can sometimes "memorize" and regurgitate exact chunks of the training data, proving that the copies are substantial and retained.

Factor 4: The Effect of the Use on the Potential Market

Does the new use harm the existing or potential market for the original work?

  • The AI Companies' Argument: AI models don't serve as direct substitutes for specific training works. No one uses ChatGPT specifically to avoid buying a copy of a Stephen King novel.
  • The Creators' Argument: This is where creators are most vocal. Generative AI tools directly undercut their markets. Illustrators are losing jobs to Midjourney; copywriters are being replaced by ChatGPT. The AI companies have essentially created a massive, competing product using the creators' own labor for free.

The U.S. Copyright Office "Part 3" Inquiry

The USCO has been conducting a massive, multi-part study on copyright and AI. A major upcoming component (often referred to as Part 3) focuses entirely on the legality of training datasets. The Office has collected tens of thousands of public comments and is expected to release a comprehensive report advising Congress on whether new legislation or licensing regimes are required.

Major Lawsuits Shaping the Future

The theoretical arguments over fair use are currently being tested in high-stakes federal litigation. The outcomes of these cases will likely dictate the future of the AI industry.

Case Name Plaintiffs Defendant Core Issue
The New York Times v. OpenAI & Microsoft The New York Times OpenAI, Microsoft NYT alleges that ChatGPT was trained on millions of its articles and now acts as a direct substitute, even regurgitating paywalled articles verbatim when prompted.
Andersen v. Stability AI Sarah Andersen, Kelly McKernan, Karla Ortiz Stability AI, Midjourney, DeviantArt Class action by visual artists alleging that image generators unlawfully scraped their portfolios and can now generate competing artworks "in the style of" the plaintiffs.
Getty Images v. Stability AI Getty Images Stability AI Getty alleges Stability scraped millions of images, including its proprietary watermarks, which occasionally appear distorted in Stable Diffusion outputs.
Authors Guild v. OpenAI George R.R. Martin, John Grisham, Jodi Picoult, et al. OpenAI Prominent authors allege their pirated books (via the "Books3" dataset) were used to train language models that can now generate unauthorized sequels and derivatives.

Potential Outcomes and Alternatives

If courts rule that training AI on copyrighted works without permission is not fair use, the industry will face a massive reckoning. Potential alternatives include:

  • Opt-in / Opt-out Systems: A standardized technical protocol (more robust than robots.txt) allowing creators to flag their content as off-limits for AI scraping. Check our guide on How to Protect Your Content From AI Scraping.
  • Collective Licensing: Similar to how radio stations pay ASCAP or BMI to play music, AI companies might pay into a central fund that distributes royalties to creators whose works are used in training datasets.
  • Clean Data Models: A shift toward AI models trained exclusively on public domain works, properly licensed content (like Adobe Firefly), or synthetic data.

Conclusion

The fair use defense is the legal shield protecting the current generative AI boom. Whether that shield holds up against the unified pushback from the creative industries and news media is the defining legal question of this technological era.