• Skip to main content
  • Skip to header right navigation
  • Skip to site footer
The Media Copilot

The Media Copilot

How AI is changing Media, journalism and content creation

  • News
  • Reviews
  • Guides
  • AI Courses
    • AI Quick Start
    • NEW—AI for Media
    • Custom AI Training for Teams
  • Newsletter
  • Podcast
  • Events
    • GEO Dinner Series
    • Webinars
  • About

Inside the AI scraping economy nobody wants to talk about

A shadow market of data middlemen is converting publisher work into fuel for AI agents, and the legal system is doing little to stop them.

AI content scraping
Publishers are stuck between blocking AI bots and building businesses that assume the bots will win. (Credit: Google Gemini)
May 19, 2026

By Pete Pachal

The copyright fight between publishers and AI companies has many fronts, but the trickiest one comes down to a single word: outputs. Even if scraping feels indefensible, courts generally aren’t interested in punishing the scrapers unless the resulting product is doing measurable damage to the people whose work was taken. Civil claims especially need a clear line from the act to the injury.

What do 1,000 journalists and PR pros know about AI that you don't? They took AI Quick Start, a 1-hour live class from The Media Copilot. 94% satisfaction. Find out how to work smarter with AI in just 60 minutes. Get 20% off with the code AIPRO: https://mediacopilot.ai/

The 2023 Sarah Silverman case is the textbook example. A group of authors including the comedian sued OpenAI for using their books without permission, and a judge later tossed several of the claims because the plaintiffs couldn’t point to specific outputs that were direct copies of their work. Knowing a large language model (LLM) ingested your writing isn’t enough on its own. You have to show the model is producing something that eats into your business.

Why outputs matter more than scraping in court

That evidentiary burden is part of why these cases struggle. Scraping happens silently, at machine speed, behind layers of infrastructure most publishers never see. The outputs of public-facing tools like ChatGPT, Gemini, and Perplexity are easy enough to inspect, but a much larger scraping economy operates outside that view.

It’s been an open secret for a while that AI companies pull data from third-party brokers, and media analyst Matthew Scott Goldstein recently put numbers to it. His report, covered in Digiday, identifies at least 21 companies, several backed by hundreds of millions of dollars, that routinely scrape publisher content without paying for it and sell their “data services” to customers that include OpenAI, Amazon, and even publishers like The Telegraph.

The report is essentially a map of what scraping looks like when no one stops it. Multimillion-dollar businesses, most of them obscure to readers, exist for the sole purpose of indexing publisher content and reselling it to bots and agents. The names won’t ring bells: Parallel AI, Exa, and Bright Data. And they aren’t hiding what they do. A recent Wall Street Journal profile describes Parallel AI as a platform “dedicated to servicing AI agents.” Goldstein calls it a “scraper company with better branding.”

Charlie Munger’s old line—show me the incentives, and I’ll show you the outcome—applies cleanly here. Between the losing streak in court and an administration that has openly waved off copyright concerns, the signal to AI companies and the brokers feeding them is unmistakable. Unauthorized scraping carries little risk, and the default settings of the system push toward more access, not less.

The bot-blocking decision every publisher faces

That setup leaves publishers between a rock and a hard place. Either you block bots as aggressively as your stack will allow, or you let them in. Letting them in feels like surrender, but it also ends the constant whack-a-mole and clears space to build a business that assumes AI will ingest and repurpose your work no matter what.

I’d argue those two stances aren’t as opposed as they look. Publishers should defend their copyright, but they also have to plan for a world in which AI engines are baked into how content reaches anyone. AI is now a distribution channel, a middle layer, and an audience all at once.

So what does a serious response to all this look like? Five components, in my view. Not every publisher will have the resources for all of them.

  • Get better at blocking bots. IP protection takes both legal and technical effort. Most large publishers are nominally blocking bots, but doing it for real means going past the robots exclusion protocol, the polite instructions sites give bots and which bots regularly ignore. People Inc. CEO Neil Vogel has said his company has needed to become highly sophisticated at blocking unauthorized bots.

    Smaller publishers won’t have that level of resourcing, but technical partners exist, and infrastructure providers like Cloudflare have started shipping copyright-protecting defaults. Even when sophisticated blocking is out of reach, intel is not. Look at your bot traffic, but also audit the AI services themselves to see where your content has surfaced without permission.
  • Practice good GEO. This one feels backwards at first. Whether or not bots have your permission, your content should still be readable to them. Access is binary, on or off. Ignoring generative engine optimization (GEO) just means your work is harder for every bot to parse, including the ones you’d want to let in.

    The case for GEO is practical. Scraping is happening, so you may as well compete inside the summaries and pick up whatever qualified traffic results. It also generates a paper trail for the audits in the previous bullet, which can support any future legal claim. And it becomes foundational if you ever build an in-house agent or MCP server on top of your content.
  • Shift your business model. I’ve covered this at length before, so the short version. The Google-era model is shrinking, and any business built on monetizing anonymous traffic is shrinking with it. New revenue streams (events, subscriptions, data products, licensing) have to be cultivated. Easier said than done. Diversification has to become a religion for ad-dependent publishers, not a side project.
  • Sue. Not realistic for every publisher. Going after OpenAI or Perplexity requires resources most newsrooms don’t have. But the Goldstein report effectively introduces a new set of potential defendants who have been mostly invisible until now. Given what they’re openly doing and the size of the market involved, it would be strange if more legal action didn’t follow.
  • Lobby for regulation. Federal action looks unlikely in the current climate, but states are moving on AI policy, including transparency and disclosure rules around training data. Real progress may not require rewriting copyright law from scratch. Even something as simple as requiring bots to properly identify themselves would stop the impersonation that makes the current scraping economy possible.
  • Subscribe to our newsletter

    How AI is changing media, journalism, and content creation.

    Learn More

Why agency matters more than victory

As bots keep “eating the internet,” it’s tempting to treat scraping as one more thing publishers just have to live with. Some of that resignation is earned. But inevitability is not the same as paralysis. In a world increasingly run by agents, publishers have to claim back some agency of their own. Protect what’s protectable, adapt where adaptation is the only path, and refuse to let the same companies that scraped your work also write the rules for what happens to it next.

A version of this column appears in Fast Company.

Contributors

  • Pete Pachal: Author

    Pete Pachal is the founder of The Media Copilot. In addition to producing the site’s newsletter and podcast, he also teaches courses on how journalists and communications professionals can apply AI tools to their work. Pete has a long career in journalism, previously holding senior roles in global newsrooms such as CoinDesk and Mashable. He’s appeared on Fox Business, CNN, and The Today Show as a thought leader in tech and AI. Pete also puts his encyclopedic knowledge of Doctor Who to good use on the popular podcast, Pull To Open.

Category: AI media analysisTags:GEO| bot blocking| webscraping
Share this post:
FacebookTweetLinkedInEmail
  • Related articles

GEO analytics

Inside AI traffic’s 796% growth, and why it converts more ready-to-buy visitors

Read moreInside AI traffic’s 796% growth, and why it converts more ready-to-buy visitors
GEO dinner NYC

SEO took 20 years to master. GEO changes everything.

Read moreSEO took 20 years to master. GEO changes everything.

AI didn’t kill SEO. It killed average content.

Read moreAI didn’t kill SEO. It killed average content.

Can’t ignore the data: Google’s AI Overviews have gutted news site traffic

Read moreCan’t ignore the data: Google’s AI Overviews have gutted news site traffic

UK media giants launch coalition to demand AI licensing standards

Read moreUK media giants launch coalition to demand AI licensing standards
AI answers, GEO

Stop chasing SEO. Start shaping what AI believes

Read moreStop chasing SEO. Start shaping what AI believes

The Media Copilot

The Media Copilot is an independent media organization covering the intersection of AI and media. Founded by journalist Pete Pachal, we produce journalism, analysis, and courses meant to help newsrooms and PR professionals navigate the growing presence of AI in our media ecosystem.

  • LinkedIn
  • X
  • YouTube
  • Instagram
  • TikTok
  • Bluesky
  • About The Media Copilot
  • Advertising & Sponsorships
  • Our Methodology
  • Privacy Policy
  • Membership
  • Newsletter
  • Podcast
  • Contact

© 2026 · All Rights Reserved · Powered by Springwire.ai · RSS