Inside the AI Scraping Economy Hitting Media Publishers

The copyright fight between publishers and AI companies has many fronts, but the trickiest one comes down to a single word: outputs. Even if scraping feels indefensible, courts generally aren’t interested in punishing the scrapers unless the resulting product is doing measurable damage to the people whose work was taken. Civil claims especially need a clear line from the act to the injury.

The 2023 Sarah Silverman case is the textbook example. A group of authors including the comedian sued OpenAI for using their books without permission, and a judge later tossed several of the claims because the plaintiffs couldn’t point to specific outputs that were direct copies of their work. Knowing a large language model (LLM) ingested your writing isn’t enough on its own. You have to show the model is producing something that eats into your business.

Why outputs matter more than scraping in court

That evidentiary burden is part of why these cases struggle. Scraping happens silently, at machine speed, behind layers of infrastructure most publishers never see. The outputs of public-facing tools like ChatGPT, Gemini, and Perplexity are easy enough to inspect, but a much larger scraping economy operates outside that view.

It’s been an open secret for a while that AI companies pull data from third-party brokers, and media analyst Matthew Scott Goldstein recently put numbers to it. His report, covered in Digiday, identifies at least 21 companies, several backed by hundreds of millions of dollars, that routinely scrape publisher content without paying for it and sell their “data services” to customers that include OpenAI, Amazon, and even publishers like The Telegraph.

The report is essentially a map of what scraping looks like when no one stops it. Multimillion-dollar businesses, most of them obscure to readers, exist for the sole purpose of indexing publisher content and reselling it to bots and agents. The names won’t ring bells: Parallel AI, Exa, and Bright Data. And they aren’t hiding what they do. A recent Wall Street Journal profile describes Parallel AI as a platform “dedicated to servicing AI agents.” Goldstein calls it a “scraper company with better branding.”

Charlie Munger’s old line—show me the incentives, and I’ll show you the outcome—applies cleanly here. Between the losing streak in court and an administration that has openly waved off copyright concerns, the signal to AI companies and the brokers feeding them is unmistakable. Unauthorized scraping carries little risk, and the default settings of the system push toward more access, not less.

The bot-blocking decision every publisher faces

That setup leaves publishers between a rock and a hard place. Either you block bots as aggressively as your stack will allow, or you let them in. Letting them in feels like surrender, but it also ends the constant whack-a-mole and clears space to build a business that assumes AI will ingest and repurpose your work no matter what.

I’d argue those two stances aren’t as opposed as they look. Publishers should defend their copyright, but they also have to plan for a world in which AI engines are baked into how content reaches anyone. AI is now a distribution channel, a middle layer, and an audience all at once.

So what does a serious response to all this look like? Five components, in my view. Not every publisher will have the resources for all of them.

Get better at blocking bots. IP protection takes both legal and technical effort. Most large publishers are nominally blocking bots, but doing it for real means going past the robots exclusion protocol, the polite instructions sites give bots and which bots regularly ignore. People Inc. CEO Neil Vogel has said his company has needed to become highly sophisticated at blocking unauthorized bots.

Smaller publishers won’t have that level of resourcing, but technical partners exist, and infrastructure providers like Cloudflare have started shipping copyright-protecting defaults. Even when sophisticated blocking is out of reach, intel is not. Look at your bot traffic, but also audit the AI services themselves to see where your content has surfaced without permission.
Practice good GEO. This one feels backwards at first. Whether or not bots have your permission, your content should still be readable to them. Access is binary, on or off. Ignoring generative engine optimization (GEO) just means your work is harder for every bot to parse, including the ones you’d want to let in.

The case for GEO is practical. Scraping is happening, so you may as well compete inside the summaries and pick up whatever qualified traffic results. It also generates a paper trail for the audits in the previous bullet, which can support any future legal claim. And it becomes foundational if you ever build an in-house agent or MCP server on top of your content.
Shift your business model. I’ve covered this at length before, so the short version. The Google-era model is shrinking, and any business built on monetizing anonymous traffic is shrinking with it. New revenue streams (events, subscriptions, data products, licensing) have to be cultivated. Easier said than done. Diversification has to become a religion for ad-dependent publishers, not a side project.

Sue. Not realistic for every publisher. Going after OpenAI or Perplexity requires resources most newsrooms don’t have. But the Goldstein report effectively introduces a new set of potential defendants who have been mostly invisible until now. Given what they’re openly doing and the size of the market involved, it would be strange if more legal action didn’t follow.

Lobby for regulation. Federal action looks unlikely in the current climate, but states are moving on AI policy, including transparency and disclosure rules around training data. Real progress may not require rewriting copyright law from scratch. Even something as simple as requiring bots to properly identify themselves would stop the impersonation that makes the current scraping economy possible.

Subscribe to our newsletter

How AI is changing media, journalism, and content creation.

Learn More

Why agency matters more than victory

As bots keep “eating the internet,” it’s tempting to treat scraping as one more thing publishers just have to live with. Some of that resignation is earned. But inevitability is not the same as paralysis. In a world increasingly run by agents, publishers have to claim back some agency of their own. Protect what’s protectable, adapt where adaptation is the only path, and refuse to let the same companies that scraped your work also write the rules for what happens to it next.

A version of this column appears in Fast Company.

Inside the AI scraping economy nobody wants to talk about

Why outputs matter more than scraping in court

The bot-blocking decision every publisher faces

Why agency matters more than victory

Media OutReach bets on US newswires to boost AI visibility

When bots become the audience

Publishers Turn to AI ‘Honeypots’ to Fight Content Scraping

The bots publishers should be letting through the door

AI accuracy is Google’s problem—until it becomes a publisher’s

Cloudflare will block AI training crawlers by default on ad-supported sites