AI companies are increasingly using third-party web scrapers to access publisher content rather than paying for it directly, according to a new report from TollBit, a startup that helps publishers monetize AI traffic.
What do 1,000 journalists and PR pros know about AI that you don't? They took AI Quick Start, a 1-hour live class from The Media Copilot. 94% satisfaction. Find out how to work smarter with AI in just 60 minutes. Get 20% off with the code AIPRO: https://mediacopilot.ai/
Key Takeaways
- TollBit’s “Pipes are Leaky” finds ~40 scrapers selling publisher content.
- Even publishers with AI deals see chatbot click-throughs collapse.
- Robots.txt and basic blocks are routinely ignored at scale.
The State of the Bots report, titled “The Pipes are Leaky,” documents nearly 40 web scraping vendors selling access to the web. Many advertise cybersecurity evasion tools and don’t comply with robots.txt by default. Some can even penetrate paywalls.
“Even paywalled content is not necessarily safe,” the report states. In tests across 30 high-authority sites, TollBit found that some scrapers “were able to scrape various paywalled articles in full.” Paywalled sites offered no additional protection—scrapers retrieved full content from the vast majority of pages regardless of subscription barriers.
The scale of the problem
AI scraping has grown dramatically over 2025. In Q1, there was one AI bot visit for every 200 human visits to publisher sites. By year’s end, that ratio had tightened to 1:31—a 60 percent increase in the bot-to-human ratio.
RAG bots—the kind that power real-time searches on ChatGPT and Perplexity—are the primary culprits. They now make roughly 10 page requests for every single request from a training bot. Unlike training crawlers that grab content once for model development, RAG bots need continuous access to fresh information.
OpenAI’s ChatGPT-User bot leads the pack, scraping at a rate five times higher than the second-most-active scraper (Meta) and 16 times higher than Perplexity’s agent.
What they’re scraping
The report reveals clear patterns in what content AI bots are fetching. In Q4, the most-scraped topics included “Stranger Things Season 5,” “Netflix Warner Bros Deal,” and “Holiday Gift Guides & Black Friday Deals”—a shift from Q3’s trending topics around the “Kimmel/Kirk Controversy” and “The Summer I Turned Pretty.”
Different AI tools serve different user needs. ChatGPT users tend toward general search queries. Perplexity users focus on consumer product research and reviews. Claude users skew toward professional tasks, especially in tech.
The categories most heavily scraped: B2B/professional content (up 62 percent), national news (up 55 percent), and tech and consumer electronics (up 107 percent).

Deals aren’t helping
For publishers hoping AI licensing deals would preserve traffic, the news is grim. Click-through rates from AI applications are collapsing across the board.
Sites without direct AI deals saw rates drop from 0.8 percent in Q2 to 0.27 percent by year’s end—a threefold decline. But even sites with 1:1 licensing agreements fared poorly: click-through rates fell from 8.8 percent in Q1 to 1.33 percent in Q4, a drop of more than six times.
On average, AI applications now deliver just 0.12 percent of referral traffic to publishers. Google, by comparison, still delivers over 80 percent.
Robots.txt is failing
The robots.txt protocol, long the standard for telling bots what they can and can’t access, appears increasingly toothless. Thirty percent of AI scrapes in Q4 ignored explicit permissions. OpenAI’s ChatGPT-User had the highest non-compliance rate at 42 percent—despite the company’s documentation claiming it respects the protocol.
The problem extends beyond the big AI labs. Reddit’s October lawsuit against Perplexity also named three scraping vendors most publishers have never heard of: Oxylabs UAB, AWMProxy, and SerpApi. Two provide IP proxies that disguise bots as human traffic. SerpApi scrapes Google’s search results to extract content indirectly.
Google filed its own lawsuit against SerpApi in December, calling it a “last resort” after its in-house bot detection—the product of “tens of thousands of person-hours and millions of dollars”—failed to stop the scraping.
What publishers can do
TollBit is launching a free tool for publishers to test their vulnerability to 15 popular scrapers. But the company’s broader message is sobering: if Google and Reddit have had to resort to litigation, smaller publishers face an uphill battle.
“Regulators must ensure bots are not allowed to mimic humans on the Internet,” the report concludes. “Otherwise, website owners may be caught in an increasingly expensive and ineffective cat-and-mouse game.”
The full report includes a searchable index of scraping vendors with details on their features, compliance policies, and the AI companies that use them.







