mai 15 2025

United States Copyright Office Weighs in on Fair Use Defense for Generative AI Training

Share

On Friday, May 9, the United States Copyright Office (“USCO”) released a “pre-publication version” of its long-anticipated third and final report in a series of guidance on copyright and artificial intelligence. This report, which followed a first (published in July 2024) focusing on “digital replicas” or deepfakes and a second (published in January 2025) on the copyrightability of works created with the aid of artificial intelligence, focuses on the training of generative AI models. This report is the only one of the three to be released in “pre-publication” form; this uncommon step came one day after the dismissal of Dr. Carla Hayden, Librarian of Congress, and one day prior to the dismissal of Register of Copyrights Shira Perlmutter. The USCO stated that it released the pre-publication version “in response to congressional inquiries and expressions of interest from stakeholders,” noting that the “final version will be published in the near future, without any substantive changes expected in the analysis or conclusions.”1

As with other reports in the series, the USCO does not recommend any government intervention at this time; however, it does offer a detailed analysis of the potential applicability of a fair use defense to the training of generative AI models, as well as a strong endorsement for the further development of the voluntary licensing market for training data.

While the report emphasizes that a fair use analysis requires a case-by-case, fact-specific inquiry, it also offers specific guidance on certain factual circumstances that are likely to cut for or against a finding of fair use. The USCO notes that, while there is no set formula, “the first and fourth factors can be expected to assume considerable weight in the analysis.”2 Although the report does not foreclose the possibility of a successful fair use defense in some circumstances, its analysis of the factors—particularly the more heavily weighted first and fourth factors—tends to disfavor a finding of fair use.

In its analysis of the first factor—the purpose and character of the use—the USCO focuses largely on the transformative nature of the use. Although the report concludes that “training a generative AI foundation model on a large and diverse dataset will often be transformative,”3 it also cautions that “transformativeness is a matter of degree, and how transformative or justified a use is will depend on the functionality of the model and how it is deployed.”4 The report expressed a more favorable view towards uses such as research and content moderation and was more critical of models that “generate outputs that are substantially similar to copyrighted works in the dataset,” noting that “[m]any uses fall somewhere in between.”5 Regarding the commerciality prong, the report confirms that the inquiry is not whether the user is a for-profit or not-for-profit entity but rather the specific purpose of the use itself.6

Additionally, the report notes that unlawful access—such as pirating works or circumventing paywalls—will weigh strongly against a finding of fair use.7 Although the report states that this factor alone is not determinative, it emphasizes that such conduct “goes a step further” than intentionally using a work despite denial of permission and “bears on the character of the use.”8 The report also brings this up in its analysis of other fair use factors, making clear the USCO’s position that training a model on pirated or paywalled content—particularly without appropriate guardrails to ensure the outputs do not include portions of the copyrighted work—is highly detrimental to a fair use defense.9

The USCO’s analysis of the second factor—the nature of the copyrighted work—notes that this prong requires a fact-specific analysis that “will vary depending on the model and the works at issue,” commenting that “[w]here the works involved are more expressive, or previously unpublished, the second factor will disfavor fair use.”10

Similarly, in its analysis of the third factor—the amount and substantiality of the use—the USCO recommends a case-by-case assessment, leaving open the possibility that, while the use of an entire work would weigh against a finding of fair use, “the use of entire works appears to be practically necessary for some forms of training for many generative AI models.”11 The report emphasizes the importance of effective safeguards to avoid “memorized” works and prevent infringing outputs.12

The report devotes significant analysis to the fourth factor—the effect on the potential market for or value of the copyrighted work—noting that “[t]he Supreme Court has twice described this factor as ‘undoubtedly the single most important element of fair use,’ although its importance ‘will vary, not only with the amount of harm, but also with the relative strength of the showing of the other factors.’”13 This section of the report balances potential public benefits of unlicensed training against a wide swath of potentials impact on market value of the copyrighted works, “including through lost sales, market dilution, and lost licensing opportunities.”14

The USCO expresses concern over outputs that could act as direct substitutes for copyrighted works: “If thousands of AI-generated romance novels are put on the market, fewer of the human-authored romance novels that the AI was trained on are likely to be sold.”15 The report warns of “significant potential harm to the market for or value of copyrighted works.”16 While a typical fair use analysis necessarily examines the overall impact the allegedly infringing use has on the market, this theory of market dilution expands into “uncharted territory”17 and is generally aligned with an overall prioritization of the interests of the owners of copyrighted works. In addition to concerns about market dilution based on the outputs of generative AI, the report also highlights concerns about the market for data sets that could be licensed to train AI models and encourages the licensing of training data wherever possible: “Where licensing options exist or are likely to be feasible, this consideration will disfavor fair use under the fourth factor.”18 

Although the report does not directly address any of the pending litigations regarding the use of copyrighted works in the training of AI models, it outlines a general spectrum of potential outcomes:

On one end of the spectrum, uses for purposes of noncommercial research or analysis that do not enable portions of the works to be reproduced in the outputs are likely to be fair. On the other end, the copying of expressive works from pirate sources in order to generate unrestricted content that competes in the marketplace, when licensing is reasonably available, is unlikely to qualify as fair use. Many uses, however, will fall somewhere in between.19

The report also discusses in detail the varying comments it received regarding potential concerns and considerations regarding licensing of copyrighted works for AI training, including the potential for a compulsory licensing regime or an opt-out mechanism—both of which the report ultimately advises against.20 The report does express a somewhat favorable view towards an extended collective licensing (“ECL”) system, with licensing rights administered by a collective management organization (“CMO”), similar to ASCAP and BMI in the music industry. Although the report ultimately “recommends allowing the licensing market to continue to develop with government intervention,” it also suggests the consideration of “targeted intervention such as ECL” in the event of market failures.21

In addition to its analysis of the applicability of fair use factors and its encouragement of licensing training data where possible, the report also discusses potential means through which infringement may occur. Two of these discussions are worth highlighting. The report discusses the model’s “weights”—or the numerical parameters that encode what it has learned—to examine whether these tokens can constitute a copy and thus, subsequent reproduction or use of the model weight may amounts to copyright infringement.22 The report concludes that “[w]hether a model’s weights implicate the reproduction or derivative work rights turns on whether the model has retained or memorized substantial protectable expression from the work(s) at issue,” focusing primarily on the outputs and whether the ultimate content generated is substantially similar to the copyrighted work.23 This argument has the potential to expand the risk that users of AI models may face from the content owners objecting to the use of their works as training data.

The report also discusses retrieval-augmented generation (“RAG”), which typically involves the generation of a prompt or search through which works or material responsive to the prompt may be retrieved and notes that this activity involves the reproduction of copyrighted works.24 The report notes the importance of these features in certain models, particularly those involved in news media; it also cautioned that such uses are unlikely to be transformative.25

By providing examples and analysis of the types of factual patterns most likely to support or cut against a finding of fair use, the USCO offered long-awaited guidance on the manner in which a fair use defense may be applied to cases involving generative AI. Given the USCO’s well-established position that works which are generated wholly by AI are not eligible for copyright protection, it is not surprising that significant portions of the report are protective of the interests of copyright holders—particularly with regards to the heavily weighted fourth factor of the fair use analysis. However, the report ultimately provides ammunition for both sides in many of the pending lawsuits regarding copyright and generative AI.

 


 

1 Copyright and Artificial Intelligence, Part 3: Generative AI Training (Pre-Publication Version), at i.

2 Id. at 74.

3 Id. at 45.

4 Id. at 46.

5 Id.

6 Id. at 51.

7 Id. at 51-52.

8 Id. at 52.

9 Id. at 62,74.

10 Id. at 54.

11 Id. at 57.

12 Id. at 59.

13 Id. at 61 (internal citations omitted).

14 Id. at 61.

15 Id. at 65.

16 Id. at 73.

17 Id. at 65.

18 Id. at 73.

19 Id. at 74.

20 Id. at 103-104.

21 Id. at 106.

22 Id. at 28-29.

23 Id. at 30.

24 Id. at 30.

25 Id. at 31, 47.

Compétences et Secteurs liés

Stay Up To Date With Our Insights

See how we use a multidisciplinary, integrated approach to meet our clients' needs.
Subscribe