AI copyright infringement, licensing deals for AI training, and the NYT v. OpenAI lawsuit

March 5, 2024 Marc Hoag

This is an opinion piece and reflects my opinions alone. It is decidedly not legal advice in this unsettled area of law at the intersection of generative AI training data and copyright rather it is merely educational in nature, and no attorney/client relationship is created with this article, hence you should contact and engage an attorney if you have any legal questions. Also, I have no relations, professionally, personally, or otherwise, with any of the persons or companies discussed herein.

In general, and indeed at a technical level, LLMs do not infringe on copyright, unless courts decide that the mere scraping of data without regurgitating scraped content verbatim or without adequate “fair use” transformation is also an infringement (or whether the end user has intentionally created infringing output, as discussed below).

First, that's not a tenable solution: it's the creation of output and whether that output copies and reproduces existing works without being “fair use” that determines whether there is infringement; not merely the reading, scanning, or scraping of such material.

The test for fair use considers the following:

Transformative Use: Transformative uses such as criticism, commentary, or parody support fair use, as do nonprofit, educational, or personal uses. Commercial uses typically lean toward copyright infringement but can be fair use if they substantially transform the original work, using it in new ways or for new markets.
Nature of the Work: Factual works lean more toward fair use, while highly creative works are less likely to be fair use. This distinction is important because the degree of protection afforded under copyright law varies with the work's nature.
Amount and Substantiality: The more of the original work used, the less likely it's fair use, but this also depends on whether the use is transformative and the nature of the work used. Extensive use of a work in a transformative or non-commercial or educational setting could still favor fair use.
Effect on Market Value: If the new use negatively impacts the market value of the original work, it's less likely to be considered fair use. This factor assesses whether the new use usurps demand for the original.

The question of whether an LLM infringes on copyrighted content, therefore, cannot and must not hinge merely on whether copyrighted content was scraped, but rather it must go through each prong of the fair use test above.

Alternatively — or rather, additionally — copyright law needs to be updated to reflect this new era of AI. There's a strong policy reason for this, too: if the courts rule that the mere training of LLMs, without more, is itself an infringement on copyright, then generative AI will be effectively cut at the knees, and humanity risks losing the greatest innovation since, arguably, the Gutenberg Press. Also, it would effectively preclude a future no longer merely in the realm of science fiction, that of generative AI-powered robots that would be little more than walking (or rolling, with all due respect to a certain R2 unit) talking, copyright-infringing automatons.

An even more practicable solution is more of the typical "UGC" -- "user-generated content" -- clauses that pretty much every platform incorporates into their Terms of Service anyway: if you use the platform to do anything illegal, like infringing copyright, then the burden of liability falls on you, the end user, and not the platform.

Example: If you use ChatGPT to create scripts for a physics course (like an alternative to Udemy) but you generate those scripts based solely on transcripts from other writers' YouTube videos, then that would be a blatantly illegal use of ChatGPT and you, the user, should be liable, and not ChatGPT.

Midjourney says as much in their Terms of Service (Paragraph 10), so this isn't an outlandish idea. To use a somewhat more bludgeoned example: if you intentionally crash a car or use a cleaning spray in an lawful manner or not as intended, you're likewise liable, and not the car or cleaning spray manufacturer. That a thing can be used illegally does not mean that the thing itself is illegal.

Net-net, the issue shouldn't be whether generative AI language models can produce output that infringes on copyright. Of course it can, and it has, and it will. The question should be two-pronged: (1) Is the output indeed an infringement per the four-pronged fair use test above; and/or (2) was the infringement caused with intent by the end user?

Assuming (1) is true, then if (2) is also true, the end user is liable; but if (2) is false, then in that particular case, arguably the company may indeed liable. So if, say, the New York Times’ allegations that ChatGPT was producing regurgitated content en masse, on its own, without any intentional “hacking” or clever prompt engineering with the intent to reproduce otherwise copyrighted content in a non-transformative form, then yes, OpenAI could indeed be liable.

Furthermore with respect to The New York Times v. OpenAI, that OpenAI was willing to partner with (pay) other platforms for access to their content (Axel Springer, et al.) implicitly suggests that, absent such partnership (payment), they would indeed be at risk of copyright infringement, so that sort of shoots themselves -- and indeed all generative AI companies -- in the foot. And now Reddit has just struck a deal with both Google and another undisclosed AI company to license its content for $60 million per year each, further sabotaging the argument that copyrighted content is otherwise fair game; in other words, why pay for content that is supposedly free to use?

What I'm suggesting can't be that outrageous: Japan has decided that AI-generated output doesn't, in general, infringe on copyright; and almost as a corollary to that decision, China just declared that AI-generated output can itself be protected by copyright (another thing I've argued for quite some time as well). The AI researcher Stephen Thaler, too, has been trying for years to get his “autonomous” AI recognized as a creator whose output can be copyrighted.

It is possible to reconcile this seeming contradiction: the decision to enter into licensing agreements doesn't necessarily contradict the fair use or non-infringement arguments generally. These licensing deals can be seen as strategic, mitigating legal risks and fostering goodwill with content creators. It’s a proactive move for business and public relations, not an admission that scraping is legally untenable. Moreover, these partnerships might signal a commitment to ethical AI development, ensuring higher quality data obtained through consensual means. This dual strategy balances legal, business, and ethical considerations, navigating the complexities of copyright law in the age of AI.

I realize I am opening the door to plenty of criticism and rebuttal, but such is the nature of this fascinating new realm of law. If you’re a company leveraging or thinking of incorporating generative AI into your workflow, please don’t hesitate to reach out if you have any questions. Contact >