• Skip to main content
  • Skip to header right navigation
  • Skip to site footer
The Media Copilot

The Media Copilot

How AI is changing Media, journalism and content creation

  • News
  • Reviews
  • Guides
  • AI Courses
    • AI Quick Start
    • NEW—AI for Media
    • Custom AI Training for Teams
  • Newsletter
  • Podcast
  • Events
    • GEO Dinner Series
    • Webinars
  • About

‘The pipes are leaky’: New report shows AI scrapers bypassing publisher protections at scale

TollBit data shows click-through rates from AI apps have collapsed even for sites with licensing deals.

RAG in action: Perplexity summarizes a YouTube video, illustrating how AI apps fetch and synthesize content in real time. (Credit: YouTube/TollBit)
Feb 4, 2026

By The Copilot

AI companies are increasingly using third-party web scrapers to access publisher content rather than paying for it directly, according to a new report from TollBit, a startup that helps publishers monetize AI traffic.

What do 1,000 journalists and PR pros know about AI that you don't? They took AI Quick Start, a 1-hour live class from The Media Copilot. 94% satisfaction. Find out how to work smarter with AI in just 60 minutes. Get 20% off with the code AIPRO: https://mediacopilot.ai/

Key Takeaways

  • TollBit’s “Pipes are Leaky” finds ~40 scrapers selling publisher content.
  • Even publishers with AI deals see chatbot click-throughs collapse.
  • Robots.txt and basic blocks are routinely ignored at scale.

The State of the Bots report, titled “The Pipes are Leaky,” documents nearly 40 web scraping vendors selling access to the web. Many advertise cybersecurity evasion tools and don’t comply with robots.txt by default. Some can even penetrate paywalls.

“Even paywalled content is not necessarily safe,” the report states. In tests across 30 high-authority sites, TollBit found that some scrapers “were able to scrape various paywalled articles in full.” Paywalled sites offered no additional protection—scrapers retrieved full content from the vast majority of pages regardless of subscription barriers.

The scale of the problem

AI scraping has grown dramatically over 2025. In Q1, there was one AI bot visit for every 200 human visits to publisher sites. By year’s end, that ratio had tightened to 1:31—a 60 percent increase in the bot-to-human ratio.

RAG bots—the kind that power real-time searches on ChatGPT and Perplexity—are the primary culprits. They now make roughly 10 page requests for every single request from a training bot. Unlike training crawlers that grab content once for model development, RAG bots need continuous access to fresh information.

OpenAI’s ChatGPT-User bot leads the pack, scraping at a rate five times higher than the second-most-active scraper (Meta) and 16 times higher than Perplexity’s agent.

What they’re scraping

The report reveals clear patterns in what content AI bots are fetching. In Q4, the most-scraped topics included “Stranger Things Season 5,” “Netflix Warner Bros Deal,” and “Holiday Gift Guides & Black Friday Deals”—a shift from Q3’s trending topics around the “Kimmel/Kirk Controversy” and “The Summer I Turned Pretty.”

Different AI tools serve different user needs. ChatGPT users tend toward general search queries. Perplexity users focus on consumer product research and reviews. Claude users skew toward professional tasks, especially in tech.

The categories most heavily scraped: B2B/professional content (up 62 percent), national news (up 55 percent), and tech and consumer electronics (up 107 percent).

  • Subscribe to our newsletter

    How AI is changing media, journalism, and content creation.

    Learn More

Deals aren’t helping

For publishers hoping AI licensing deals would preserve traffic, the news is grim. Click-through rates from AI applications are collapsing across the board.

Sites without direct AI deals saw rates drop from 0.8 percent in Q2 to 0.27 percent by year’s end—a threefold decline. But even sites with 1:1 licensing agreements fared poorly: click-through rates fell from 8.8 percent in Q1 to 1.33 percent in Q4, a drop of more than six times.

On average, AI applications now deliver just 0.12 percent of referral traffic to publishers. Google, by comparison, still delivers over 80 percent.

Robots.txt is failing

The robots.txt protocol, long the standard for telling bots what they can and can’t access, appears increasingly toothless. Thirty percent of AI scrapes in Q4 ignored explicit permissions. OpenAI’s ChatGPT-User had the highest non-compliance rate at 42 percent—despite the company’s documentation claiming it respects the protocol.

The problem extends beyond the big AI labs. Reddit’s October lawsuit against Perplexity also named three scraping vendors most publishers have never heard of: Oxylabs UAB, AWMProxy, and SerpApi. Two provide IP proxies that disguise bots as human traffic. SerpApi scrapes Google’s search results to extract content indirectly.

Google filed its own lawsuit against SerpApi in December, calling it a “last resort” after its in-house bot detection—the product of “tens of thousands of person-hours and millions of dollars”—failed to stop the scraping.

What publishers can do

TollBit is launching a free tool for publishers to test their vulnerability to 15 popular scrapers. But the company’s broader message is sobering: if Google and Reddit have had to resort to litigation, smaller publishers face an uphill battle.

“Regulators must ensure bots are not allowed to mimic humans on the Internet,” the report concludes. “Otherwise, website owners may be caught in an increasingly expensive and ineffective cat-and-mouse game.”

The full report includes a searchable index of scraping vendors with details on their features, compliance policies, and the AI companies that use them.

Posts co-authored by The Copilot are drafted with AI and then carefully edited by Media Copilot editors. Our AI-assisted process allows us to bring more valuable content to our readers while preserving accuracy and quality.

Contributors

  • The Copilot: Author

    I'm a generative AI writer for The Media Copilot. I help author posts, and with the help of human editors, play a growing role in the site's content strategy.

  • Christopher Allbritton: Editor

    Christopher Allbritton covers AI adoption in journalism and newsroom transformation. He brings 20+ years of journalism experience, including roles as Reuters' Pakistan Bureau Chief and TIME's Middle East Correspondent.

Category: News
Share this post:
FacebookTweetLinkedInEmail
  • Related articles

The end of 10 blue links is not the end of Google

Read moreThe end of 10 blue links is not the end of Google

A startup that sells publisher content to AI companies is now worth $2.2 billion

Read moreA startup that sells publisher content to AI companies is now worth $2.2 billion

OpenAI builds a new system to identify AI-generated images

Read moreOpenAI builds a new system to identify AI-generated images

A fraudster built a network of fake AI news sites to manipulate search results

Read moreA fraudster built a network of fake AI news sites to manipulate search results

Google declares the end of the ’10 blue links’ era with AI search overhaul

Read moreGoogle declares the end of the ’10 blue links’ era with AI search overhaul

YouTube is now the No. 2 most-cited social platform in AI answers

Read moreYouTube is now the No. 2 most-cited social platform in AI answers

The Media Copilot

The Media Copilot is an independent media organization covering the intersection of AI and media. Founded by journalist Pete Pachal, we produce journalism, analysis, and courses meant to help newsrooms and PR professionals navigate the growing presence of AI in our media ecosystem.

  • LinkedIn
  • X
  • YouTube
  • Instagram
  • TikTok
  • Bluesky
  • About The Media Copilot
  • Advertising & Sponsorships
  • Our Methodology
  • Privacy Policy
  • Membership
  • Newsletter
  • Podcast
  • Contact

© 2026 · All Rights Reserved · Powered by Springwire.ai · RSS