The Fight Begins: The New York Times Sues OpenAI and Microsoft

Credit: DALL-E

Here we go.

The New York Times has officially sued OpenAI and Microsoft for copyright infringement, the first major media company to seek a legal fight over generative AI. The complaint, filed in the Southern District of New York according to Axios, alleges that the two tech companies used Times journalism — without permission — as a key source of information for training AI systems like ChatGPT and Bing, essentially robbing the Times of potential revenue and web traffic. (Disclosure: Members of The Media Copilot team do consulting work for several companies, including Microsoft.)

We have been heading toward something like this for a while. Although OpenAI has begun making deals with media companies to use their content for ChatGPT — see with Axel Springer a couple of weeks ago — there’s been a looming standoff between media and Big Tech over the development of generative AI platforms: how large language models (LLMs) are trained, and how they provide answers to users.

Put simply, general-purpose GenAI tools like ChatGPT and Bing Chat seek to give users a “no click” experience, returning clear answers directly in response to queries. Users don’t need to click through to a source to get the information, so there’s no traffic for a media company to monetize. Absent a deal with the company that designed the tool, the site that produced the information — in this case the Times — gets nothing.

Although OpenAI and Microsoft have yet to respond to the lawsuit, generally the argument on the other side comes down to fair use. Media sites like the Times publish their information on the open web, which is accessible by anyone, including AI companies. Even if the content is behind a paywall, it still needs to be discoverable by web crawlers so it can appear on search engines, which have traditionally been a big source of traffic for media sites. Using that content to train AI systems, it is claimed, is simply an extension of that idea.

Subscribe now

Where the Lawsuit Misses the Mark

The Times’ official complaint is 69 pages long. From a cursory look, it relies heavily on incidents where text prompts to ChatGPT produced almost-verbatim passages from Times’ articles. For example, one prompt complained directly about being “paywalled out” of a specific article and asked the chatbot to produce the first paragraph. After ChatGPT happily did so, the user asked for the next paragraph and the next, which the chatbot obediently produced.

This complaint cites this type of example again and again, pointing to prompts that ask directly about articles and then coax more and more of the original content out of ChatGPT or Bing Chat with follow-up queries. It all feels pretty incriminating when you read it, but it also strikes me as something relatively easy to defend against.

What OpenAI and Microsoft will certainly claim is that, in the time since the Times created these examples, they’ve put in protections that respect copyrighted content. That’s certainly been my experience whenever I’ve tried similar prompts about specific content. I also copied one of the prompts from the lawsuit word for word (the one about “Snow Fall”) and pasted it into GPT-4, which told me it couldn’t access it. While I’m sure there are still ways to “jailbreak” the tools to get them to read Times articles back to you, they’re certainly less obvious, and both companies have internal teams dedicated to addressing them.

However, while that’s at least a partial defense (there would presumably still be damages from the time the prompts DID work), it’s also really a side issue. Or rather, the Times isn’t citing this evidence simply to prove that prompts can get ChatGPT to spit back Time articles that are part of its training data, but to prove that Times articles were used as training data in the first place.

This is the heart of the matter, and why the lawsuit has the potential to up-end how AI models are trained broadly. If the court sides with the Times, then it means these models can no longer train on what traditional search engines use: the open internet data that’s hoovered up by Common Crawl, a nonprofit that creates data archives for anyone to use. And old models that were trained on those archives (or at least unfiltered versions of them) would need to be disabled or destroyed — exactly what the Times is asking for.

The stakes are high, but they could rise even higher through another part of the Times’ complaint: that OpenAI and Microsoft should be held responsible when its GenAI tech gets things wrong. Citing numerous incidents of hallucinations where incorrect facts were attributed to Times content, the Times alleges reputational damage that it should be compensated for. A ruling on this in the Times’ favor could be even more far-reaching, since hallucinations are part and parcel of how LLMs work. If it turns out those “use at your own risk” warnings aren’t enough to protect from getting sued, it could be a huge blow to the deployment of AI platforms generally.

How this plays out depends on which players move next and how fast. Will other media companies join with the Times and take action to protect their content? Or will OpenAI and Microsoft cut a deal that puts this to bed and serves as a template for the rest of the industry? One thing’s for sure: This is only the beginning. 

Looking to power-up your AI game? Contact us for a training session, newsroom consulting, or just for advice. We’re always available.

Contact Us

Ready to start using AI like a pro?


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.