webscraping Archives - The Media Copilot

Publishers Turn to AI ‘Honeypots’ to Fight Content Scraping

Romy Abu-Fadel — Tue, 21 Jul 2026 15:34:07 +0000

Jul 21, 2026

As AI companies continue scraping the web for training data, some publishers are experimenting with a new tactic that doesn’t only block unwanted bots, but tries to waste their time and money.

Known as LLM honeypotting, Digiday describes the approach as a form of deception to lure AI crawlers into consuming plausible-looking but ultimately worthless content, aimed at making large-scale data collecting so computationally expensive that it becomes less economically viable.

Who’s doing this? A small number of publishers and e-commerce companies are looking for alternatives to traditional bot-blocking. And while the technique remains early and experimental, it’s one way for media companies to protect their content amid a fight to create standards for how AI companies track, value and compensate journalism.

A cybersecurity tactic adapted for AI

Cyberhoneypotting is a long-standing cybersecurity strategy. The technique is to build a decoy target for attackers—or LLM bots, in this case—that can lure bad actors away and even gather intelligence on their capabilities and methods. In this case, the goal is to make scraping content without compensation more costly than it’s worth.

Simon Wistow, co-founder of content delivery network provider Fastly, describes the philosophy as one of “Chang[ing] the economics of attacking.” If abusing a system becomes significantly more expensive than the value gained, the entire model will become unsustainable, he argues.

Applied to AI crawlers, the strategy is equipped against all automated visitors, regardless of whether they are operated by major AI companies or smaller third-party scraping firms.

Publishers can implement the tactic in several ways. They can introduce subtle delays, difficult (for computers) problems to solve before admittance, or endless mazes of contents and files filled with meaningless AI-generated text. Some honeypotting techniques go even further and try to inject bad data into the bots’ training datasets, “poisoning” the AI results.

The tactic’s adoption

Large e-commerce brands are already testing the technology successfully, said Wistow. News publishers are also showing increased interest, although he declined to identify specific customers.

Even so, skepticism towards the strategy remains.

Frederick Jahn, co-founder of AI company Centennal, argues sophisticated scrapers can often detect or avoid honeypots altogether.

“I think it’s a good concept, but more on a marketing level, and like a gimmick,” Jahn said. He argues that publishers would be better served by creating real barriers to stealth crawlers, who are often not shown maze pages.

Supporters of honeypotting maintain that, even if most scrapers adapt, increasing operational costs across thousands or millions of requests could make smaller scraping businesses financially unsustainable.

“If they could burn through that 10 million funding in one crawl then suddenly those businesses aren’t viable and suddenly the whole market collapses, and that’s kind of what you’re going for,” said Wistow.

Costs and limitations for publishers

The strategy isn’t free. Generating and serving millions of fake pages is more expensive than simply blocking unwanted traffic. And larger publishers with more resources are better able to plan and implement the strategy.

Wistow said the approach is unlikely to become widespread, in part because of the consequences of filling the internet with yet more intentionally deceptive content.

“Hallucinations happen even with good data, just because of the way LLMs work,” said Wistow. “This is about changing the economics for the people abusing your site, not running some giant disinformation campaign.”

The post Publishers Turn to AI ‘Honeypots’ to Fight Content Scraping appeared first on The Media Copilot.

Cloudflare will block AI training crawlers by default on ad-supported sites

Romy Abu-Fadel — Wed, 01 Jul 2026 17:36:10 +0000

Jul 1, 2026

By Romy Abu-Fadel

Cloudflare said Wednesday it will begin blocking AI training and agent crawlers by default on ad-supported websites, a change that could force companies such as Google, Apple and Microsoft to more clearly separate search indexing from AI training if they want continued access to large parts of the web.

The policy, scheduled to take effect Sept. 15, applies to new Cloudflare customers, new sites added by existing customers and existing Free-tier customers who have not changed their settings. Search crawlers will remain allowed by default, but training and agent crawlers will be blocked on pages that display ads.

The company said the changes are designed to help publishers remain visible in AI-powered search results while preventing their content from being used for AI training or autonomous agents without permission or compensation.

“Now that the majority of traffic is non-human, we must go further and act faster so that a sustainable ecosystem can emerge,” said Matthew Prince, Cloudflare’s co-founder and CEO.

Splitting up mixed use crawlers

The Web giant said bots that combine search, AI training and agent activity—known as mixed use crawlers—without letting site owners choose among those uses will be blocked on ad-supported pages when training or agent access is blocked. In a company blog post, Cloudflare named Googlebot, Applebot and BingBot as multi-purpose crawlers that could be affected by the most restrictive applicable rules.

“We hope that our proposed default changes encourage mixed use crawlers to separate out search from agent use and training,” Prince said.

Cloudflare said customers will be able to manage three categories of AI traffic: Search, which indexes content for later retrieval; Agent, which accesses a site on behalf of a user in real time; and Training, which collects content to train or fine-tune models. The controls are available to all Cloudflare customers, including those on the Free tier.

That distinction matters for smaller sites. A spokesperson for Cloudflare said the new controls are intended to give all website owners more options for managing AI traffic, not only publishers with ads or subscriptions. But the default blocking policy is tied to pages with advertising, and Cloudflare’s compensation plans remain focused on commercial use cases where AI systems access or surface publisher content.

Alongside the new crawler controls, Cloudflare is expanding analytics to show publishers how bots interact with their content and how much traffic AI platforms send back. The company is also pushing into what it calls Answer Engine Optimization, or AEO, offering tools it says will help customers understand how often their content is cited or surfaced in AI-generated answers.

Cloudflare also announced efforts to reduce unnecessary AI crawling. According to the company, more than half of AI crawler traffic is spent repeatedly checking web pages that have not changed. Because Cloudflare sits between websites and online traffic, it says it can signal to AI companies when pages have been updated and worth revisiting. The company said it is testing those signals with AI firms and plans a broader rollout later this year.

New compensation model

The company is also expanding its publisher compensation strategy by evolving its Pay Per Crawl program into a new system called Pay Per Use. Rather than paying publishers when content is crawled, the new model is designed to compensate them when their content is actually used in AI products. Cloudflare said it is working with AI companies including Ceramic.ai and You.com on the initiative. Under the arrangements, publishers could be paid when their content appears in AI search results or when AI agents access premium content on demand.

But the model does not yet answer the hardest compensation question: what happens when a publisher’s work is used for model training but never appears in a cited answer? Asked whether Pay Per Use compensates publishers in that scenario, The spokesperson said the program is aimed at “programmatic, real-time access and discovery,” and described Pay Per Crawl and Pay Per Use as only two possible economic frameworks.

“The digital landscape is evolving rapidly,” said Marrissa Holloway for Cloudflare. “We welcome ideas from publishers, creators, and AI companies alike on how to build a thriving agentic Internet.”

Holloway did not directly say what Cloudflare’s cut of any revenue generated would be. “It has always been our philosophy that our customers derive many multiples of value more than they pay us,” she said.

The Media Copilot’s take

Cloudflare is not solving AI compensation for the whole Web. It’s building a bargaining layer for larger publishers with enough traffic and revenue to measure, block and negotiate. That helps the larger content outlets, but smaller sites and independent publishers will get switches to turn on and off. That’s useful, but switches don’t mean they have leverage. The long tail of the Web—the indy blog sites, community web pages and hobby sites—can say “no” more clearly, but there still no obvious way for them to get paid when their work is used for an AI’s training data and never comes back with a citation or link.

The post Cloudflare will block AI training crawlers by default on ad-supported sites appeared first on The Media Copilot.

Can AI deliver trustworthy news? NewsGuard thinks its new Chatbot has the answer

Romy Abu-Fadel — Thu, 25 Jun 2026 18:58:36 +0000

Jun 25, 2026

By Romy Abu-Fadel

NewsGuard, a company best known for rating the reliability of online news sources, on Tuesday launched NewsGuard AI, a chatbot that draws exclusively from a database of journalist-vetted stories instead of the open web.

The launch comes as concerns persist over the accuracy of AI-generated responses. NewsGuard said a yearlong audit of leading AI models found they repeated false or misleading claims on controversial news topics 35% of the time. The company argues that limited responses to vetted sources can help reduce the spread of misinformation through AI systems.

NewsGuard AI attributes information directly to the publishers whose reporting is used in its responses, unlike other chatbots like ChatGPT, Claude, Gemini, or Perplexity.

Participanting publishers include The Atlantic and other regional newspapers, opinion journals, and public media organizations. Readers, subscribers and members of some participating outlets will receive a free trial of NewsGuard AI followed by an offer for 33% off the chatbot’s standard $6 monthly subscription.

The company also says it will share revenue with participating publishers through a 50-50 revenue-sharing model and affiliate-style subscription referrals, though it has not publicly disclosed the formula used to calculate payouts.

NewsGuard says its journalists have reviewed more than 36,000 sources since 2018, including newspapers, magazines, opinion publications, local news outlets, independent newsletters, government websites, think tanks, hospitals and research universities. Of these, roughly 12,000 have been rated reliable and are eligible to be cited by NewsGuard AI.

The new service enters a rapidly evolving market in which publishers are negotiating licensing agreements with AI companies while also challenging the unauthorized use of their reporting. Media organizations have struck content deals with companies including OpenAI, Amazon and Meta, even as lawsuits and public disputes over AI scraping continue across the industry.

Chris Richmond, CEO of the fact-checking website Snopes, said the arrangement addressed concerns his organization has had with other AI products.

“Snopes has restricted most AI chatbots from scraping our content,” Richmond said. “But we’re happy to partner with NewsGuard on a model that does this the right way.”

In addition to drawing from vetted sources, NewsGuard AI says it incorporates 41 editorial safeguards. These include access to NewsGuard’s database of 64,000 debunked false claims circulating online, which the company says help prevent the chatbot from repeating known misinformation. Users can also access detailed explanations debunking false claims and share them with others.

“Few things will matter more in the near future than the ability of humans to figure out what’s real, what’s false, and what’s confabulated nonsense,” said Nicholas Thompson, CEO of The Atlantic. “This is particularly true when it comes to news.”

NewsGuard is also targeting educational institutions. Students at participating schools and universities will receive free access while enrolled. The company says the chatbot has been designed to refuse requests to write essays or reports for users.

“NewsGuard AI can provide reliable research while not substituting for students doing their own writing and thinking,” said NewsGuard’s Chief Operating Officer Matt Skibinski.

Local language versions of NewsGuard AI will be available in French, German and Italian in September.

The post Can AI deliver trustworthy news? NewsGuard thinks its new Chatbot has the answer appeared first on The Media Copilot.

Inside the AI scraping economy nobody wants to talk about

Pete Pachal — Tue, 19 May 2026 12:00:00 +0000

The copyright fight between publishers and AI companies has many fronts, but the trickiest one comes down to a single word: outputs. Even if scraping feels indefensible, courts generally aren’t interested in punishing the scrapers unless the resulting product is doing measurable damage to the people whose work was taken. Civil claims especially need a clear line from the act to the injury.

The 2023 Sarah Silverman case is the textbook example. A group of authors including the comedian sued OpenAI for using their books without permission, and a judge later tossed several of the claims because the plaintiffs couldn’t point to specific outputs that were direct copies of their work. Knowing a large language model (LLM) ingested your writing isn’t enough on its own. You have to show the model is producing something that eats into your business.

Why outputs matter more than scraping in court

That evidentiary burden is part of why these cases struggle. Scraping happens silently, at machine speed, behind layers of infrastructure most publishers never see. The outputs of public-facing tools like ChatGPT, Gemini, and Perplexity are easy enough to inspect, but a much larger scraping economy operates outside that view.

It’s been an open secret for a while that AI companies pull data from third-party brokers, and media analyst Matthew Scott Goldstein recently put numbers to it. His report, covered in Digiday, identifies at least 21 companies, several backed by hundreds of millions of dollars, that routinely scrape publisher content without paying for it and sell their “data services” to customers that include OpenAI, Amazon, and even publishers like The Telegraph.

The report is essentially a map of what scraping looks like when no one stops it. Multimillion-dollar businesses, most of them obscure to readers, exist for the sole purpose of indexing publisher content and reselling it to bots and agents. The names won’t ring bells: Parallel AI, Exa, and Bright Data. And they aren’t hiding what they do. A recent Wall Street Journal profile describes Parallel AI as a platform “dedicated to servicing AI agents.” Goldstein calls it a “scraper company with better branding.”

Charlie Munger’s old line—show me the incentives, and I’ll show you the outcome—applies cleanly here. Between the losing streak in court and an administration that has openly waved off copyright concerns, the signal to AI companies and the brokers feeding them is unmistakable. Unauthorized scraping carries little risk, and the default settings of the system push toward more access, not less.

The bot-blocking decision every publisher faces

That setup leaves publishers between a rock and a hard place. Either you block bots as aggressively as your stack will allow, or you let them in. Letting them in feels like surrender, but it also ends the constant whack-a-mole and clears space to build a business that assumes AI will ingest and repurpose your work no matter what.

I’d argue those two stances aren’t as opposed as they look. Publishers should defend their copyright, but they also have to plan for a world in which AI engines are baked into how content reaches anyone. AI is now a distribution channel, a middle layer, and an audience all at once.

So what does a serious response to all this look like? Five components, in my view. Not every publisher will have the resources for all of them.

Get better at blocking bots. IP protection takes both legal and technical effort. Most large publishers are nominally blocking bots, but doing it for real means going past the robots exclusion protocol, the polite instructions sites give bots and which bots regularly ignore. People Inc. CEO Neil Vogel has said his company has needed to become highly sophisticated at blocking unauthorized bots.

Smaller publishers won’t have that level of resourcing, but technical partners exist, and infrastructure providers like Cloudflare have started shipping copyright-protecting defaults. Even when sophisticated blocking is out of reach, intel is not. Look at your bot traffic, but also audit the AI services themselves to see where your content has surfaced without permission.
Practice good GEO. This one feels backwards at first. Whether or not bots have your permission, your content should still be readable to them. Access is binary, on or off. Ignoring generative engine optimization (GEO) just means your work is harder for every bot to parse, including the ones you’d want to let in.

The case for GEO is practical. Scraping is happening, so you may as well compete inside the summaries and pick up whatever qualified traffic results. It also generates a paper trail for the audits in the previous bullet, which can support any future legal claim. And it becomes foundational if you ever build an in-house agent or MCP server on top of your content.
Shift your business model. I’ve covered this at length before, so the short version. The Google-era model is shrinking, and any business built on monetizing anonymous traffic is shrinking with it. New revenue streams (events, subscriptions, data products, licensing) have to be cultivated. Easier said than done. Diversification has to become a religion for ad-dependent publishers, not a side project.

Sue. Not realistic for every publisher. Going after OpenAI or Perplexity requires resources most newsrooms don’t have. But the Goldstein report effectively introduces a new set of potential defendants who have been mostly invisible until now. Given what they’re openly doing and the size of the market involved, it would be strange if more legal action didn’t follow.

Lobby for regulation. Federal action looks unlikely in the current climate, but states are moving on AI policy, including transparency and disclosure rules around training data. Real progress may not require rewriting copyright law from scratch. Even something as simple as requiring bots to properly identify themselves would stop the impersonation that makes the current scraping economy possible.

Why agency matters more than victory

As bots keep “eating the internet,” it’s tempting to treat scraping as one more thing publishers just have to live with. Some of that resignation is earned. But inevitability is not the same as paralysis. In a world increasingly run by agents, publishers have to claim back some agency of their own. Protect what’s protectable, adapt where adaptation is the only path, and refuse to let the same companies that scraped your work also write the rules for what happens to it next.

A version of this column appears in Fast Company.

The post Inside the AI scraping economy nobody wants to talk about appeared first on The Media Copilot.

UK media giants launch coalition to demand AI licensing standards

The Copilot — Thu, 26 Feb 2026 16:17:03 +0000

On Thursday, the BBC, Sky News, The Guardian, The Telegraph, and the Financial Times announced SPUR—the Standards for Publisher Usage Rights coalition—with an open letter calling on media companies worldwide to join the fight for AI content licensing frameworks.

Key Takeaways

Five major UK publishers formed SPUR to push for AI licensing rights.
The coalition uses collective bargaining to strengthen publisher power.
Standards must be set before AI access norms become too entrenched.

“Our reporting, our archives, our original content, have become foundational training material for AI systems,” the letter states. “This material has been scraped, copied and reused with no common standards to enable permission or payment, weakening the economic model that supports journalism.”

The coalition’s five signatories—BBC director-general Tim Davie, Sky News executive chairman David Rhodes, Guardian CEO Anna Bateson, Telegraph CEO Anna Jones, and Financial Times CEO Jon Slade—argue that AI systems built on journalistic content lack transparency about how they generate answers. That opacity, they say, risks eroding public trust in both news and the AI tools people use to access it.

SPUR’s mission is explicit: establish shared technical standards and licensing frameworks that let AI developers access journalism legitimately while guaranteeing publishers retain control of their content and receive compensation.

This isn’t just a negotiating tactic. The coalition positions itself as a bridge between media companies and AI labs, promising to create “rights-cleared, accountable channels” for content access—essentially, a middle ground between total lockdown and unrestricted scraping. Interested publishers can contact [email protected] to join.

For newsrooms already investing in AI tools, SPUR’s emergence matters. The coalition is explicitly positioning this as a global challenge, not a UK-only issue. That means the frameworks they develop could influence how AI training operates everywhere.

The open letter doesn’t name specific AI companies, but the timing is pointed: OpenAI has been sued by The New York Times over alleged copyright infringement related to training data. Anthropic and Google face similar legal pressure. SPUR appears designed to create a negotiated alternative to courtroom battles.

The post UK media giants launch coalition to demand AI licensing standards appeared first on The Media Copilot.