AI Scrapers Bypass Publisher Protections at Massive Scale

AI companies are increasingly using third-party web scrapers to access publisher content rather than paying for it directly, according to a new report from TollBit, a startup that helps publishers monetize AI traffic.

Key Takeaways

TollBit’s “Pipes are Leaky” names ~40 scrapers reselling publisher content.
Even publishers with AI deals see chatbot click-throughs collapse.
Robots.txt and basic blocks are routinely ignored at scale.

The State of the Bots report, titled “The Pipes are Leaky,” documents nearly 40 web scraping vendors selling access to the web. Many advertise cybersecurity evasion tools and don’t comply with robots.txt by default. Some can even penetrate paywalls.

“Even paywalled content is not necessarily safe,” the report states. In tests across 30 high-authority sites, TollBit found that some scrapers “were able to scrape various paywalled articles in full.” Paywalled sites offered no additional protection—scrapers retrieved full content from the vast majority of pages regardless of subscription barriers.

The scale of the problem

AI scraping has grown dramatically over 2025. In Q1, there was one AI bot visit for every 200 human visits to publisher sites. By year’s end, that ratio had tightened to 1:31—a 60 percent increase in the bot-to-human ratio.

RAG bots—the kind that power real-time searches on ChatGPT and Perplexity—are the primary culprits. They now make roughly 10 page requests for every single request from a training bot. Unlike training crawlers that grab content once for model development, RAG bots need continuous access to fresh information.

OpenAI’s ChatGPT-User bot leads the pack, scraping at a rate five times higher than the second-most-active scraper (Meta) and 16 times higher than Perplexity’s agent.

What they’re scraping

The report reveals clear patterns in what content AI bots are fetching. In Q4, the most-scraped topics included “Stranger Things Season 5,” “Netflix Warner Bros Deal,” and “Holiday Gift Guides & Black Friday Deals”—a shift from Q3’s trending topics around the “Kimmel/Kirk Controversy” and “The Summer I Turned Pretty.”

Different AI tools serve different user needs. ChatGPT users tend toward general search queries. Perplexity users focus on consumer product research and reviews. Claude users skew toward professional tasks, especially in tech.

The categories most heavily scraped: B2B/professional content (up 62 percent), national news (up 55 percent), and tech and consumer electronics (up 107 percent).

Subscribe to our newsletter

How AI is changing media, journalism, and content creation.

Learn More

Deals aren’t helping

For publishers hoping AI licensing deals would preserve traffic, the news is grim. Click-through rates from AI applications are collapsing across the board.

Sites without direct AI deals saw rates drop from 0.8 percent in Q2 to 0.27 percent by year’s end—a threefold decline. But even sites with 1:1 licensing agreements fared poorly: click-through rates fell from 8.8 percent in Q1 to 1.33 percent in Q4, a drop of more than six times.

On average, AI applications now deliver just 0.12 percent of referral traffic to publishers. Google, by comparison, still delivers over 80 percent.

Robots.txt is failing

The robots.txt protocol, long the standard for telling bots what they can and can’t access, appears increasingly toothless. Thirty percent of AI scrapes in Q4 ignored explicit permissions. OpenAI’s ChatGPT-User had the highest non-compliance rate at 42 percent—despite the company’s documentation claiming it respects the protocol.

The problem extends beyond the big AI labs. Reddit’s October lawsuit against Perplexity also named three scraping vendors most publishers have never heard of: Oxylabs UAB, AWMProxy, and SerpApi. Two provide IP proxies that disguise bots as human traffic. SerpApi scrapes Google’s search results to extract content indirectly.

Google filed its own lawsuit against SerpApi in December, calling it a “last resort” after its in-house bot detection—the product of “tens of thousands of person-hours and millions of dollars”—failed to stop the scraping.

What publishers can do

TollBit is launching a free tool for publishers to test their vulnerability to 15 popular scrapers. But the company’s broader message is sobering: if Google and Reddit have had to resort to litigation, smaller publishers face an uphill battle.

“Regulators must ensure bots are not allowed to mimic humans on the Internet,” the report concludes. “Otherwise, website owners may be caught in an increasingly expensive and ineffective cat-and-mouse game.”

The full report includes a searchable index of scraping vendors with details on their features, compliance policies, and the AI companies that use them.

‘The pipes are leaky’: New report shows AI scrapers bypassing publisher protections at scale

Key Takeaways

The scale of the problem

What they’re scraping

Deals aren’t helping

Robots.txt is failing

What publishers can do

Publishers Turn to AI ‘Honeypots’ to Fight Content Scraping

The bots publishers should be letting through the door

Google search traffic to drop by half by Q3 2027 for UK publishers

beehiiv expands beyond newsletters With AI and ad tools

Journalism’s workforce shrinks as AI and new consumer habits reshape the industry

Meta now drives most AI agent traffic while sending publishers few visitors