The Gray (Lady) Area of AI Copyright Law

Credit: DALL-E

In the days since The New York Times sued OpenAI and Microsoft over copyright infringement, it seems everyone in tech and media has given their take. (Disclosure: Members of The Media Copilot team do consulting work for several companies, including Microsoft.)

Judging from the chatter, two camps have emerged, lining up probably just how you’d predict. Many on the media side see the case as potentially precedent-setting if not industry-saving. Large language models are nothing without good information for their training data, and OpenAI harvested the Times archive without asking permission.

On the pro-tech side, there’s been an equal and opposite push that insists the outputs that OpenAI’s LLMs create are “transformative” — meaning that the text that ChatGPT produces, while informed by the training data, isn’t a simple copy. That would tick one of the key boxes to qualify it for fair use, which would seem to put OpenAI in the clear.

Before I go any further, I should qualify that I’m not a media lawyer, though I’ve worked with plenty in various editorial roles I’ve held. I’ve also read just about every media lawyer take on the lawsuit (here’s a pretty good one) and now have a better understanding of why the Times points to certain answers from ChatGPT and not others.

The Focus on Copying

Looking at the official complaint, the Times seems to do a pretty good job of attacking the idea that ChatGPT’s output are transformative, dedicating many pages to listing off direct violations of copyright. In page after page, the Times lays out situations where, if you prompt ChatGPT in the right way, it will spit back Times articles verbatim. It’s pretty damning stuff, and it would seem to be a pretty clear copyright violation.

The Times suit, in this respect, is in stark contrast to the lawsuit brought by the comedian Sarah Silverman, who sued OpenAI over harvesting her work for training. In that case the judge quickly dismissed the case because the mere fact of ingestion wasn’t enough to constitute a copyright violation of her books.

The big difference: Silverman didn’t have an output to point to. The Times does.

Hold on a second, though: The pro-tech side argues “simply” has nothing to do with it. The prompts the Times used weren’t generic queries for information; they were deliberate attempts to coax specific articles out of the tool. And while ChatGPT’s propensity to regurgitate copyrighted article is still not good, it’s:

Only relevant to the (presumably very small) set of ChatGPT users who are determined to bypass the Times paywall.

Easily fixed, and it looks like OpenAI has already done so. People (including me) who’ve tried some of the allegedly violative prompts have found they no longer work.

All that said, a copyright violation is a copyright violation, even if it’s relatively minor and temporary, so if the court agrees, there will still be damages. However, on these copying instances alone, those mitigating factors would seem to point to them being less weighty than those in the pro-media camp might like to see.

Subscribe now

The Real Issue

What the Times really wants (or should really want) is to go back to first principles of copyright law. The whole point of it is to ensure creators stay incentivized to create new works by giving them a monopoly on those works. What ChatGPT does — ingest the entire Times archive, among other sources, to produce original answers — may meet the bar of being “transformative” in the context of copyright law, but that’s just one of four tests for fair use.

Another is how the allegedly infringing work affects the market. The dawn of generative search (AKA search generative experience, or SGE) is upon us, where AI algorithms are increasingly giving us “the answer” without the need to click through to a source. That directly threatens the traffic media companies depend on. Zero-click searches have already been having an effect.

Generative search can only accelerate that decline. For a long time, there was a de facto bargain between news publishers and search engines like Google: “We’ll let you index our information, and you give us traffic from search results.” Generative search reneges on the second part, but still takes the information.

One way to rebalance the equation is to give the publishers money instead. Indeed, this is exactly what OpenAI is doing with Axel Springer, and I suspect that deal is why the Times rushed out their lawsuit before the end of the year: Better to strike early, before such deals become the norm, lest you look like you’re just trying to get a better bag at the bargaining table.

Hopefully that’s not the only thing the Times cares about, but it’s also not looking to needle OpenAI over a few parroted articles from very specific prompts, even though that’s what the lawsuit concentrates on.

The sad truth is copyright law probably isn’t adequate to deal with what LLMs do, which is to produce ostensibly original work based on ingesting massive archives. If GPT-4 has hear everything the Times has written about, say, Norman Mailer, and the answer it gives to general questions about Mailer are of much better quality than they would otherwise be (difficult to prove, but probably true), then should the Times be at least credited, if not compensated, for that information?

To neglect doing so may not technically constitute a copyright violation, but if this transformative work is done at scale, across all topics in a general-purpose tool made available to the public, it will clearly alter the market for journalism that the Times exists in. That may be a novel, even revolutionary use of the information. But it doesn’t seem very fair.

We’ve just started a new meetup group for The Media Copilot. We’d love to meet you in person. If you’d like to sponsor an hour of drinks for the meetup, please reach out to team@mediacopilot.ai. Otherwise join the group and let’s all get together!

Join the Group

Ready to start using AI like a pro?


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.