webscraping Archives - The Media Copilot https://mediacopilot.ai/tag/webscraping/ How AI is changing Media, journalism and content creation Thu, 25 Jun 2026 18:58:36 +0000 en-US hourly 1 https://wordpress.org/?v=7.0 https://mediacopilot.ai/wp-content/uploads/2024/08/cropped-cropped-Media-Copilot-favicon-60x60.jpeg webscraping Archives - The Media Copilot https://mediacopilot.ai/tag/webscraping/ 32 32 Can AI deliver trustworthy news? NewsGuard thinks its new Chatbot has the answer https://mediacopilot.ai/newsguard-ai-chatbot-vetted-journalism-publisher-revenue-sharing/ Thu, 25 Jun 2026 18:58:36 +0000 https://mediacopilot.ai/?p=8670 Company says answers come from 12,000 vetted outlets, not web scraped.

The post Can AI deliver trustworthy news? NewsGuard thinks its new Chatbot has the answer appeared first on The Media Copilot.

]]>

By

NewsGuard, a company best known for rating the reliability of online news sources, on Tuesday launched NewsGuard AI, a chatbot that draws exclusively from a database of journalist-vetted stories instead of the open web.

The launch comes as concerns persist over the accuracy of AI-generated responses. NewsGuard said a yearlong audit of leading AI models found they repeated false or misleading claims on controversial news topics 35% of the time. The company argues that limited responses to vetted sources can help reduce the spread of misinformation through AI systems. 

NewsGuard AI attributes information directly to the publishers whose reporting is used in its responses, unlike other chatbots like ChatGPT, Claude, Gemini, or Perplexity.

Participanting publishers include The Atlantic and other regional newspapers, opinion journals, and public media organizations. Readers, subscribers and members of some participating outlets will receive a free trial of NewsGuard AI followed by an offer for 33% off the chatbot’s standard $6 monthly subscription. 

The company also says it will share revenue with participating publishers through a 50-50 revenue-sharing model and affiliate-style subscription referrals, though it has not publicly disclosed the formula used to calculate payouts.

NewsGuard says its journalists have reviewed more than 36,000 sources since 2018, including newspapers, magazines, opinion publications, local news outlets, independent newsletters, government websites, think tanks, hospitals and research universities. Of these, roughly 12,000 have been rated reliable and are eligible to be cited by NewsGuard AI. 

The new service enters a rapidly evolving market in which publishers are negotiating licensing agreements with AI companies while also challenging the unauthorized use of their reporting. Media organizations have struck content deals with companies including OpenAI, Amazon and Meta, even as lawsuits and public disputes over AI scraping continue across the industry.

Chris Richmond, CEO of the fact-checking website Snopes, said the arrangement addressed concerns his organization has had with other AI products.

“Snopes has restricted most AI chatbots from scraping our content,” Richmond said. “But we’re happy to partner with NewsGuard on a model that does this the right way.” 

In addition to drawing from vetted sources, NewsGuard AI says it incorporates 41 editorial safeguards. These include access to NewsGuard’s database of 64,000 debunked false claims circulating online, which the company says help prevent the chatbot from repeating known misinformation. Users can also access detailed explanations debunking false claims and share them with others. 

“Few things will matter more in the near future than the ability of humans to figure out what’s real, what’s false, and what’s confabulated nonsense,” said Nicholas Thompson, CEO of The Atlantic. “This is particularly true when it comes to news.”

NewsGuard is also targeting educational institutions. Students at participating schools and universities will receive free access while enrolled. The company says the chatbot has been designed to refuse requests to write essays or reports for users. 

“NewsGuard AI can provide reliable research while not substituting for students doing their own writing and thinking,” said NewsGuard’s Chief Operating Officer Matt Skibinski.

Local language versions of NewsGuard AI will be available in French, German and Italian in September. 

The post Can AI deliver trustworthy news? NewsGuard thinks its new Chatbot has the answer appeared first on The Media Copilot.

]]>
Inside the AI scraping economy nobody wants to talk about https://mediacopilot.ai/inside-the-ai-scraping-economy-nobody-wants-to-talk-about/ Tue, 19 May 2026 12:00:00 +0000 https://mediacopilot.ai/?p=6852 AI content scrapingA shadow market of data middlemen is converting publisher work into fuel for AI agents, and the legal system is doing little to stop them.

The post Inside the AI scraping economy nobody wants to talk about appeared first on The Media Copilot.

]]>

The copyright fight between publishers and AI companies has many fronts, but the trickiest one comes down to a single word: outputs. Even if scraping feels indefensible, courts generally aren’t interested in punishing the scrapers unless the resulting product is doing measurable damage to the people whose work was taken. Civil claims especially need a clear line from the act to the injury.

The 2023 Sarah Silverman case is the textbook example. A group of authors including the comedian sued OpenAI for using their books without permission, and a judge later tossed several of the claims because the plaintiffs couldn’t point to specific outputs that were direct copies of their work. Knowing a large language model (LLM) ingested your writing isn’t enough on its own. You have to show the model is producing something that eats into your business.

Why outputs matter more than scraping in court

That evidentiary burden is part of why these cases struggle. Scraping happens silently, at machine speed, behind layers of infrastructure most publishers never see. The outputs of public-facing tools like ChatGPT, Gemini, and Perplexity are easy enough to inspect, but a much larger scraping economy operates outside that view.

It’s been an open secret for a while that AI companies pull data from third-party brokers, and media analyst Matthew Scott Goldstein recently put numbers to it. His report, covered in Digiday, identifies at least 21 companies, several backed by hundreds of millions of dollars, that routinely scrape publisher content without paying for it and sell their “data services” to customers that include OpenAI, Amazon, and even publishers like The Telegraph.

The report is essentially a map of what scraping looks like when no one stops it. Multimillion-dollar businesses, most of them obscure to readers, exist for the sole purpose of indexing publisher content and reselling it to bots and agents. The names won’t ring bells: Parallel AI, Exa, and Bright Data. And they aren’t hiding what they do. A recent Wall Street Journal profile describes Parallel AI as a platform “dedicated to servicing AI agents.” Goldstein calls it a “scraper company with better branding.”

Charlie Munger’s old line—show me the incentives, and I’ll show you the outcome—applies cleanly here. Between the losing streak in court and an administration that has openly waved off copyright concerns, the signal to AI companies and the brokers feeding them is unmistakable. Unauthorized scraping carries little risk, and the default settings of the system push toward more access, not less.

The bot-blocking decision every publisher faces

That setup leaves publishers between a rock and a hard place. Either you block bots as aggressively as your stack will allow, or you let them in. Letting them in feels like surrender, but it also ends the constant whack-a-mole and clears space to build a business that assumes AI will ingest and repurpose your work no matter what.

I’d argue those two stances aren’t as opposed as they look. Publishers should defend their copyright, but they also have to plan for a world in which AI engines are baked into how content reaches anyone. AI is now a distribution channel, a middle layer, and an audience all at once.

So what does a serious response to all this look like? Five components, in my view. Not every publisher will have the resources for all of them.

  • Get better at blocking bots. IP protection takes both legal and technical effort. Most large publishers are nominally blocking bots, but doing it for real means going past the robots exclusion protocol, the polite instructions sites give bots and which bots regularly ignore. People Inc. CEO Neil Vogel has said his company has needed to become highly sophisticated at blocking unauthorized bots.

    Smaller publishers won’t have that level of resourcing, but technical partners exist, and infrastructure providers like Cloudflare have started shipping copyright-protecting defaults. Even when sophisticated blocking is out of reach, intel is not. Look at your bot traffic, but also audit the AI services themselves to see where your content has surfaced without permission.
  • Practice good GEO. This one feels backwards at first. Whether or not bots have your permission, your content should still be readable to them. Access is binary, on or off. Ignoring generative engine optimization (GEO) just means your work is harder for every bot to parse, including the ones you’d want to let in.

    The case for GEO is practical. Scraping is happening, so you may as well compete inside the summaries and pick up whatever qualified traffic results. It also generates a paper trail for the audits in the previous bullet, which can support any future legal claim. And it becomes foundational if you ever build an in-house agent or MCP server on top of your content.
  • Shift your business model. I’ve covered this at length before, so the short version. The Google-era model is shrinking, and any business built on monetizing anonymous traffic is shrinking with it. New revenue streams (events, subscriptions, data products, licensing) have to be cultivated. Easier said than done. Diversification has to become a religion for ad-dependent publishers, not a side project.
  • Sue. Not realistic for every publisher. Going after OpenAI or Perplexity requires resources most newsrooms don’t have. But the Goldstein report effectively introduces a new set of potential defendants who have been mostly invisible until now. Given what they’re openly doing and the size of the market involved, it would be strange if more legal action didn’t follow.
  • Lobby for regulation. Federal action looks unlikely in the current climate, but states are moving on AI policy, including transparency and disclosure rules around training data. Real progress may not require rewriting copyright law from scratch. Even something as simple as requiring bots to properly identify themselves would stop the impersonation that makes the current scraping economy possible.

Why agency matters more than victory

As bots keep “eating the internet,” it’s tempting to treat scraping as one more thing publishers just have to live with. Some of that resignation is earned. But inevitability is not the same as paralysis. In a world increasingly run by agents, publishers have to claim back some agency of their own. Protect what’s protectable, adapt where adaptation is the only path, and refuse to let the same companies that scraped your work also write the rules for what happens to it next.

A version of this column appears in Fast Company.

The post Inside the AI scraping economy nobody wants to talk about appeared first on The Media Copilot.

]]>
UK media giants launch coalition to demand AI licensing standards https://mediacopilot.ai/spur-coalition-uk-media-ai-licensing-rights/ Thu, 26 Feb 2026 16:17:03 +0000 https://mediacopilot.ai/?p=4275 Five of Britain's largest news organizations just issued a warning: Your journalism is being used to train AI systems without your permission.

The post UK media giants launch coalition to demand AI licensing standards appeared first on The Media Copilot.

]]>

On Thursday, the BBC, Sky News, The Guardian, The Telegraph, and the Financial Times announced SPUR—the Standards for Publisher Usage Rights coalition—with an open letter calling on media companies worldwide to join the fight for AI content licensing frameworks.

Key Takeaways

  • Five major UK publishers formed SPUR to push for AI licensing rights.
  • The coalition uses collective bargaining to strengthen publisher power.
  • Standards must be set before AI access norms become too entrenched.

“Our reporting, our archives, our original content, have become foundational training material for AI systems,” the letter states. “This material has been scraped, copied and reused with no common standards to enable permission or payment, weakening the economic model that supports journalism.”

The coalition’s five signatories—BBC director-general Tim Davie, Sky News executive chairman David Rhodes, Guardian CEO Anna Bateson, Telegraph CEO Anna Jones, and Financial Times CEO Jon Slade—argue that AI systems built on journalistic content lack transparency about how they generate answers. That opacity, they say, risks eroding public trust in both news and the AI tools people use to access it.

SPUR’s mission is explicit: establish shared technical standards and licensing frameworks that let AI developers access journalism legitimately while guaranteeing publishers retain control of their content and receive compensation.

This isn’t just a negotiating tactic. The coalition positions itself as a bridge between media companies and AI labs, promising to create “rights-cleared, accountable channels” for content access—essentially, a middle ground between total lockdown and unrestricted scraping. Interested publishers can contact [email protected] to join.

For newsrooms already investing in AI tools, SPUR’s emergence matters. The coalition is explicitly positioning this as a global challenge, not a UK-only issue. That means the frameworks they develop could influence how AI training operates everywhere.

The open letter doesn’t name specific AI companies, but the timing is pointed: OpenAI has been sued by The New York Times over alleged copyright infringement related to training data. Anthropic and Google face similar legal pressure. SPUR appears designed to create a negotiated alternative to courtroom battles.

The post UK media giants launch coalition to demand AI licensing standards appeared first on The Media Copilot.

]]>