This post originally appeared in The Media Copilot newsletter. Subscribe here.
In the struggle between content creators and the AI builders who scrape that content for their tools, a new front has opened up. Instead of targeting the tech companies that create and manage the AI models that use their data, they’re going after the data itself.
AI companies rely on public data sets — huge compilations of content scraped from the internet that had previously been used mostly for research — the most popular one being Common Crawl. Although training AI wasn’t the original purpose of these information troves, they quickly became go-to sources for training data for most of the major models.
The thing is, if you’re a content creator, you probably don’t mind your work being used for research, and you probably see the benefit of having your data crawled since it’s the same process Google uses to index the web and then link people to your content in Google Search.
AI changes that calculation. Now, if the AI gets access to your content but just ends up summarizing it for users of that AI, there’s essentially no benefit to the content creator for having your data crawled. You may as well opt out.
Hiding Content From AI
That seems to be the logic of publishers such as The New York Times, a whole bunch of media outlets in Denmark, and other publications, all of whom have requested their data to be removed from Common Crawl, according to Wired. Apparently, the data harvester had never received a request for removal prior to 2023, but now is fielding a bunch of them. Although there is a case to be made that the public nature of the data makes Common Crawl’s actions legally defensible, because the organization is a nonprofit and can’t withstand any lawsuits, it’s simply complying with every request for deletion.
It’s an understandable move on the part of the publishers. However, outside of big organizations like the Times, which has massive brand recognition and a very successful business, it strikes me as a bit suicidal. Common Crawl feeds all kinds of purposes besides providing AI with training data, and it helps power several search engines besides Google. Deciding to opt out to end-run the threat of AI is a classic “cutting off your nose to spite your face” situation.
It’s also a dead end, in my view. The fact is, AI is a part of our media ecosystem now, and there’s a simple reason why: Summarization is simply an attractive product for many news consumers.
The whole act of searching for a topic, getting some blue links, and then clicking on a bunch to form your impression of the right answer was never the most efficient process. AI is actually an excellent intermediary here because it does that work for you, removing friction in the process. Even though, yes, AI sometimes hallucinates, it’s a lot less likely to do so when it’s adapting existing text instead of going into its knowledge base to write something “original.”
Summarization Realization
But the point is summarization, whether done by a chatbot or automatically, isn’t going away and can only grow. Artifact might have left the building, but its generative takeaways live on in the new Yahoo News app, not to mention news aggregators like Otherweb and a new player on the scene, Particle. Perplexity, Google’s AI Overviews, and the coming ChatGPT Search are all moving in this direction. It’s inevitable that there will be a large and growing portion of news consumption that exists at the AI summary level.
Publishers need to understand that in order to adapt their content strategy to this new world. Does that mean providing your content essentially “free” to AI summarizers in order to be a part of that world? Maybe not in every case, but isolating your content from the AI summary-industrial complex by opting out of data sets like Common Crawl is a defensive move, one that won’t work without also playing offense.
What does that mean? For larger publishers, we already know: signing deals with AI builders like OpenAI, which is in turn altering incentives in the marketplace. But if you’re a small to midsize publisher, that’s probably not an option, but there are other ways to take control of your own destiny.
Rather than simply surrendering to the mercy of AI models and aggregators, you can start building AI experiences into your own platform. While that won’t affect any external forces, it will start pointing your content strategy in the right direction, and get your operations to start indexing toward the unique value your content provides in the marketplace.
What would really move the needle, though, is a marketplace for publishers to get good value for their content from those interested in summarizing it for audiences. That’s exactly what TollBit is trying to do. It’s slow going, but with more publisher participation it could become the offensive complement to the defensive move of opting out of data sets like Common Crawl.
Right now there’s a lot of fear, uncertainty, and doubt around what AI summarization will do to the media world. One thing is not in doubt, though: It’s happening, and content providers need to adapt. Standing still isn’t an option.