Copyright? What Copyright?

Credit: DALL-E

How will artists and content creators make a living in an AI-driven information ecosystem? It’s a question AI companies didn’t even want to think about until they were forced to.

That appears to be the key takeaway of a 3,000-word story The New York Times published over the weekend on how Big Tech giants deliberately threw copyright concerns to the wind as they created the generative systems that can conjure up content on a whim prompt. According to the report, executives at OpenAI, Google, and Meta all knew they were likely violating copyright, and even their own policies, as they harvested vast datasets without permission, but they did it anyway.

While that revelation is surprising to precisely no one, it helps to see the evidence laid out so clearly. The tech giants had motive (the existential need for ever-more data to train their models), the means (software to interpret and ingest media), and the opportunity (the data being publicly available on the internet).

A big chunk of the piece zeroes in on OpenAI’s harvesting YouTube videos, the content of which is more difficult for large language models (LLMs) to consume than a text-based webpage. OpenAI needed to develop custom software and processes to hoover up over 1 million hours of videos to train GPT-4, according to the Times. The article is essentially a cartoon smoking gun you could put in a thought balloon over the head of OpenAI CTO Mira Murati during her much-maligned interview with The Wall Street Journal’s Joanna Stern.

Similarly, the investigation reveals Google sneakily made adjustments to its own terms of service to ensure it could train its AI models on user data (at least some of it). And Meta, too, looked for any path to AI that allowed the company to ignore copyright concerns around the data it needed to make its models competitive with OpenAI. Negotiations with artists and writers would take too long? “Don’t even try” was the attitude, the report claims.

There used to be a saying in the digital economy: data is the new oil. The analogy is obvious: just as oil empowered the robber barons of the industrial revolution, data — especially user-generated data — was fuel for the algorithm-powered internet of Web 2.0. That, in turn, inspired the idea of Web3, where every netizen has the power to erect a data derrick in their own backyard.

But what happens when Big Tech siphons all the data away before you even start digging? 

Public = Free?

Over the past year, how AI companies have harvested copyright data has come under increased scrutiny, particularly at OpenAI since it’s the leader in the field. Between the lawsuits, the interviews, and investigations like the Times’, it’s come out just how contradictory the company’s position is: It believes anything on the publicly available internet is fair use, but it also says it will honor “do not train.” It doesn’t think it needs to pay for publisher’s content, yet it’s paying for publisher’s content. It says the value of any individual data source is negligible, but it obviously can’t get enough of them.

As ever, the heart of the issue is how content creators are fairly compensated. When working properly, AI creates wholly transformative content from its training data, but it couldn’t do that without the training data. What a healthy AI-mediated information ecosystem looks like, one that respects and incentivizes content creators, is still being figured out.

(story continues below)

Looking to understand how AI can help make you a more productive journalist, PR professional, marketer, or content creator? We highly recommend our thorough three-hour AI Bootstrapping for Marketers and Media class, happening April 16. Starting with foundational AI concepts, the class teaches the essentials of prompting, explores a suite of tools curated for creative work, and shows how to get started on custom solutions for your own workflows.

For those with busy schedules, we also offer a one-hour Beginning AI for Marketers, PR, and Journalists, happening April 15. It’s a crash course meant to rapidly bring novices up to speed on using generative tools. Even if you think you know the basics of ChatGPT and other chatbots already, this class will improve your use of it with a focus on advanced prompting techniques, underutilized features, and a set of go-to tools for speeding up work.

As a newsletter subscriber, you can take advantage of an early-bird discount on both classes with the code AISPRING, which gives you 50% off either class. The discount code will only work through April 10, so be sure to register before then. Here’s that link again to start your registration.

If the Times article is any guide, Big Tech’s vision is a data economy where publicly available content is essentially free. While that’s probably not the best choice for a thriving media industry, that may be what the future looks like if it’s shaped entirely on their terms.

Outside pressure could change that, and it’s building. The New York Times famously sued OpenAI back in December, and it was joined by a few other publishers in February. But for the most part, publishing hasn’t been able to muster collective action to bargain with the AI companies.

Why? Medium CEO Tony Stubblebine gave a first-hand account on a recent People vs. Algorithms podcast. Medium apparently tested the waters on a publishers coalition with other content platforms, including Wikipedia, to better negotiate on a far-reaching licensing deal with OpenAI et al.

It didn’t get off the ground.

“We’re blocking [our content from] these companies because they screwed up,” Stubblebine said. “Just exchange of value — no consent, no credit, no compensation. And then I used that to go to all the other platforms and say, ‘Look, we’re going to fail unless we form a coalition.’ And every single one of them had their own plan.”

Stubblebine pointed to mismatched incentives as the culprit that killed the idea of cooperation: Wikipedia wants its information available everywhere, even via AI summarization. Reddit negotiated a deal with OpenAI on its own. An executive at another, unspecified company told Stubblebine they wanted to get in the AI business themselves.

I would add a key factor to that list: The Times lawsuit actually disincentivizes other publishers from suing OpenAI. With the Times taking on the fight, they can just grab some popcorn and wait for a winner. If the Times wins, they benefit from the ruling no matter what. And if OpenAI wins, they won’t be made to sit in the corner when they come to the bargaining table.

In the end, AI companies clearly believe they have the right to harvest data from the entire internet, but are afraid to say so openly. Media companies want to be compensated for their data, but they’re terrified of being shut out by Big Tech.

Cowardice is easy. It’s much harder to work together to chart a way forward in good faith that enables content creators to thrive while preserving innovation. We at The Media Copilot are doing our part by creating new rules for journalists and creatives in an AI world via our upcoming manifesto. But given the feelings of existential threat on both sides, the clarity that’s sorely needed likely won’t come from the industry, but from a judge and jury.

The Media Copilot is a reader-supported publication. To receive new posts and support our work, consider becoming a free or paid subscriber.

Ready to start using AI like a pro?


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.