The Scraper Economy Is Here. Publishers Aren't Paid.

This episode is sponsored by: Adobe Acrobat

On this episode of The Media Copilot podcast, host Pete Pachal sits down with Jonathan Woahn to zero in on a part of the AI content ecosystem that’s just out of sight.

The conversation explores the fast-growing “scraper economy,” where data brokers, indexing companies, and AI infrastructure providers are quietly monetizing access to the web at massive scale while traditional publishers struggle to establish sustainable licensing models. With this gray market of internet data growing, how can publishers both protect their content and take advantage of the now billion-dollar demand for it?

Pete and Jonathan also explore:
• Why the social contract between Google and publishers has fundamentally changed
• The rise of ethically sourced data and whether AI companies will eventually care where content comes from
• Why inference markets may become far more valuable than model training
• How publishers should think about MCPs, AI infrastructure, and product strategy
• Whether a legitimate marketplace for AI content licensing can actually emerge before scraper economics dominate the ecosystem

Along the way, Jonathan shares how his company, Cashmere, is helping publishers structure, license, and deploy content for AI systems while quietly brokering relationships between content owners and companies looking for legal, high quality access to trusted information.

Listen or watch:

YouTube

Spotify

Apple Podcasts

Why this matters:

As generative AI continues reshaping how audiences consume information, the future of publishing may depend on whether media companies can establish sustainable economic models around their content before gray-market scraping ecosystems become the default infrastructure layer of the internet.

This conversation goes beyond AI hype and digs into the economics, legal gray areas, and technical realities quietly redefining the relationship between publishers, platforms, and information itself.

About the 👤 Guest

Linkedin: jonathanwoahn

Website: cashmere.io

Linkedin: company/cashmereio

Manifesto: cashmere.io/manifesto

Subscribe to our newsletter

How AI is changing media, journalism, and content creation.

Learn More

The new Adobe productivity agent orchestrates tools and models to generate images, text and rich content like presentations, podcasts and social posts, while also powering conversational PDF editing in Acrobat.

With new PDF Spaces capabilities, users can combine files, links and notes into interactive, shareable spaces for research, collaboration and content creation. VICE News, Kid Cudi and celebrity event planner Mindy Weiss are already using these tools to build trust and deeper engagement with their audiences.

Link: Do that with Acrobat: AI-Powered PDF workspaces | Adobe Acrobat

Enjoyed this episode?

Subscribe to The Media Copilot on Substack, Apple Podcasts, Spotify, or your favorite app. On YouTube? Tap the Like button and Subscribe to the YouTube channel. For more AI tools and resources built for media professionals, visit mediacopilot.ai.

Produced by Pete Pachal and Executive Producer Michele Musso
Edited by the Musso Media Team

Music: “Favorite” by Alexander Nakarada, licensed under CC BY 4.0

TRANSCRIPT

Pete Pachal (00:20.341)

Hi, welcome to the Media Copilot. It’s a podcast about how AI is changing media, news, and communication. I’m your host, Pete Pachal. I cover tech for a long time as a journalist, and now I talk with the media leaders, the builders, and the creators, all trying to answer the question, how will we get information in the future, and how will that transform journalism and the business of media? Quick note.

If you’re listening on Apple or Spotify, please leave a five-star review, maybe a nice comment. And if you’re watching on YouTube, please like the video and subscribe to the channel if you don’t mind. Those things really do help more people find the show. This week, we’re talking about one of the biggest unresolved questions in media and AI, which is if AI systems are going to use publisher content to answer people’s questions, what does a real marketplace for that content look like?

Because the uncomfortable answer may be that the marketplace already exists. It just doesn’t include publishers. A growing network of scraping companies and data brokers and AI infrastructure players is helping companies access the web at scale. And some of that is powering AI search, rag systems, and even enterprise research tools. Now publishers are trying to stop some of it, maybe use some of it, and certainly get paid for more of it.

My guest this week is Jonathan Wolin, co-founder and CXO of Cashmere. Cashmere is a company building infrastructure for premium publishers to manage and license and protect and monetize their content in AI systems. Cashmere has worked around deals involving publishers and content providers like Wiley and AI platforms like Perplexity. And Jonathan’s been thinking deeply about this and the difference between a scraper-driven market and a legitimate content marketplace.

So we’re going to talk about this. We’re going to talk about the scraper economy, why publishers may be losing, what little leverage they have, what it could take to build a cleaner licensed market, and whether the future of media depends less on blocking bots or more on building just good pipes. Jonathan Woen, welcome to the Media Copilot.

Jonathan Woahn (02:31.714)

Pete, it’s great to be here, thank you.

Pete Pachal (02:34.389)

So before we get into all that big, big Haiti topics I was just describing there, I’d love to learn just like a little bit more about you and your background and how you came to be connected to this, the business of connections, which is connecting like publishers to good marketplaces. So tell us a little bit about your background.

Jonathan Woahn (02:38.434)

Yes.

Jonathan Woahn (02:55.436)

Yeah, thanks. I’m excited to be here, Pete. So my background, I’m a serial entrepreneur. This is the fifth company that I’ve actually started or has been early member of. And in the previous company that I was at, it was called Book Club. And what we were doing at Book Club is we were working with publishers and authors to create professional development programs. And we started out doing these bespoke manually. It was very time intensive. It was very expensive.

And we, when AI and chat GPT came around and we saw, we could actually use AI to create very custom and very bespoke professional development programs. And so we started trying to do that and we ran into two big problems. found, first of all, the process of actually licensing the content was extremely time consuming and difficult. And then the second was even once we were able to license it, the structure of the, the, the data that we got itself.

was very difficult for AI to work with. the anecdote that I use with people is, you know, we would create, um, you know, a program around seven habits of highly effective people. And early days, the AI would create the 26 habits of highly effective people. Like it just didn’t know how to like, kind of pull the information out of the content. Right. And so we just looked at this and said, you know, AI is not going anywhere. Premium content publishers need better rails to be able to interface and interact with AI and

Pete Pachal (03:57.653)

Hmm.

Jonathan Woahn (04:24.79)

We’ve had to build a lot of technology to understand how to make this work and how to facilitate it and how to get AI to work well with it. And so we ended up launching Cashmere to be that infrastructure to help support publishers and building, making it so their content can be used with AI.

Pete Pachal (04:41.141)

Yeah, like what you miss mentioned there really resonates. It’s like the frustrating thing when you get into these AI systems and you expect it to do something semi-deterministic, like surely you can adhere to a character count. And in the same way, like surely you know what the number seven is. And it kind of doesn’t, know, like it kind of does and kind of doesn’t. Now, to be fair, I think we were both, but we’re both kind of talking about AI circa 2024, 2023 probably.

Jonathan Woahn (05:08.238)

Yeah. 2022. Much better. 100%.

Pete Pachal (05:08.789)

The systems that we use today are better at this. But still, it really hammers home that the AI interpretation alone is not the thing you should be 100 % reliant on, in terms of cleaning data and putting guardrails, whether they are within prompts or in the systems itself.

just has to be a part of this process or it’s just not gonna be reliable.

Jonathan Woahn (05:38.382)

Correct. Yeah. Yeah. Especially as the publisher, like if you’re wanting the AI to represent the content in the way that you want it to be represented. Definitely. Yeah. It’s still.

Pete Pachal (05:49.065)

You got to have a voice here and that voice requires some knowledge about your data and treating yourself like a data company in many ways,

Jonathan Woahn (05:56.578)

Yep, 100%. Yep, that’s exactly right.

Pete Pachal (05:59.605)

Cool. So honestly, I’d love to just get right into it. Like the serious stuff I was talking about at the beginning. like, I know, you I, you, you’re on LinkedIn and you, you’re writing about, things going on and there’s some good research out there from Matthew Goldstein and, you were writing about how the AI content marketplace is like already there, but just like the publishers are not, are barely a part of it. Like, it’s not like they’re not there, but it’s like something like a 1.6 billion. I forget what the exact numbers are.

versus something that is like less than a 10th of that size in terms of the actual licensing money sort of being exchanged. So it’s like 14 to one, I think was the ratio you mentioned in terms of like the actual, there was the size of the market versus like the cuts publishers are actually getting in this market. tell us like what’s going on. How did this happen? How did we get here where there’s like just mass scraping of content that’s being sold at scale and publishers are just like, I wouldn’t even say they’re like,

Jonathan Woahn (06:41.474)

Yeah.

Pete Pachal (06:59.017)

barely keeping up, they’re like drowning in this.

Jonathan Woahn (07:01.528)

Totally. And it’s interesting, since then, the number is actually even worse. It’s actually 20 to 1, based off some more recent information that we’ve looked at. I mean, look, we’ve all grown up in this world that Google has kind of curated for us for the last 25, 30 years, where it’s like we’ve been like, we’ve reached some degree of equilibrium where

Pete Pachal (07:08.02)

Wow.

Jonathan Woahn (07:27.672)

You know, we’ve been comfortable with playing Google’s game on SEO and trying to figure out how to get content, you know, how to get it ranked and how to get it discovered. And one of the things that has happened with the advent of AI has been the democratization of search. And, and so like, what I mean by that is, you know, I mean, Google is the default place that we would all go to search and find, look for information. mean, it was just kind of like the first stop. And now what has happened is.

And now we’ve got chat GPT, we’ve got Gemini, we’ve got copilot, we’ve got Claude, we’ve got perplexity, we’ve got and, and, and, and, and, and some of these platforms are building their own indexes and they’re scraping the internet on their own and building their own index so that they can serve their own platform. But what is happening more and more is that, you know, as you and I are standing up, you know, Claude copilot or standing up our own agents to help with our own research or things that we’re doing on our own machines, like

They need access to content. They need access to the internet. And so what has happened as a result of this is, you you used to be able to use Google’s API for search, Bing’s API for search. have both since to my knowledge, shut both of them down. remember when Bing did it two years ago, it was like this big deal. was like, they saw that this competitive asset that they had built this internet scale index. like, we’re not going to make this available for everyone. We need to use this internally.

Pete Pachal (08:54.421)

Hmm.

Jonathan Woahn (08:54.53)

But then what that created was this huge vacuum for a lot of other players to come in and to start scraping the internet and to start building their own indices and to start selling, you know, the work that they’re doing. Cause I mean, candidly building a great crawler and building a great index, it’s not an easy problem to solve. And it’s not a cheap problem to solve. It’s, it’s very technically complicated and it’s very expensive. and so there are definitely opportunities for like economies of scale to come in and for some of these.

Pete Pachal (09:11.912)

Hmm.

Pete Pachal (09:17.012)

Hmm.

Jonathan Woahn (09:23.534)

crawling and scraping platforms to be able to provide solutions. And agents need access to that internet, that content. And so what’s happening now is you’re starting to see a lot of these guys that are popping up and selling content. And there’s like this race to the bottom on pricing because they’re trying to figure out how do we get, you know, how do we get as many agents using our search as possible? But the thing that they are not doing is they are not licensing that content from the publishers that are scraping it from, and they are not sharing revenue with anybody.

Pete Pachal (09:53.459)

Yeah, you’re really zeroing in on like, kind of like the sore spot here. It’s really more of a vacuum of, I don’t know if it’s law or best practices or a few other things, but the way the internet evolved, this indexing, this active indexing was treated as, you know, very benign in the sense that you have an index and that’s just going to help people find your content online. And that was all anyone ever thought that was ever going to be used for. And now

Jonathan Woahn (09:53.518)

Pete Pachal (10:20.741)

you know, obviously that still is part of it, but now with this content layer, this interpretive layer of AI put on top of it, that is owned and operated by these players that aren’t publishers, you know, it’s suddenly that bargain is gone, right? But the infrastructure still exists and it’s sort of like treated as almost like a given that, yeah, indexing, let’s just do indexing, but it’s less about what…

that the infrastructure and more about quote unquote the outputs, whereas in the old days it was like links and now it’s actual content. And this is where like obviously where the contention is and nothing’s really been decided it feels like as much as what you can and can’t do. So everyone just defaults to whatever they can do. There’s no should.

Jonathan Woahn (10:59.618)

Yeah.

Jonathan Woahn (11:11.406)

Yeah. Yeah. I mean, there was a, you know, there was this, you know, social contract between Google and all every website where it was like, I’m going to scrape your content and I’m going to show blue links and I’m going to redirect traffic. We’re going to get you, we’re going to get people to your site. Right. And these days, I mean, there is, there is no contract. A lot of these scrapers are just scraping the content and they are serving it up directly to the agents. And.

They may include the source link, they might not, but there’s nothing that effectively demands that the agentic interface or where it’s using or referencing any of that content for inference, there’s nothing that says that they have to point back to the source or that they have to redirect people. So a lot of that social contract has just been totally upended through a lot of the way that the agents are accessing these indexes right now.

Pete Pachal (12:07.679)

So I know the report that we’re talking about from Matthew Goldstein, is great. It’s been sort of going around. But it makes a distinction between training and grounding. Does that distinction matter that much to publishers?

Jonathan Woahn (12:22.408)

well, yeah, let’s, let’s yeah, go ahead. ahead. Yes. the shorter answer, the shorter answer is yes. Publishers should absolutely care about the difference between inference and grounding. and let me talk about, you know, let me define that. So, you know, or, sorry, I inference and grounding. mean, inference and training or grounding and training are kind of like the two people familiar with those.

Pete Pachal (12:25.459)

I guess we should define those terms.

No, you go ahead. You’re the expert,

Pete Pachal (12:50.057)

Right, sure.

Jonathan Woahn (12:52.056)

So training, the analogy that I’ve used in the past is like training is like as a student, you’re going to school, you’re learning your field of trade, an engineer, right? You go, you’re learning how the equations have and how they work, you’re paying to be there. But at some point, you know, and that’s like training of the model. You’re given all the information and learns how to do what it’s gonna do, right?

Inference is like, and grounding is like when you’re actually applying the skills. So now as an engineer, you graduate and then you go out and it’s like, okay, they’re going to hire me to build this machine. Now you’re actually applying what you’ve learned. And that’s like what inference does with these models, right? It’s like, now we’ve taken all the things that they’ve learned and like using it to generate output, output tokens back to, you know, respond to whatever the user request is. And in, in the world of publishing,

there are a lot of reasons why we need to understand the difference between these two and why it’s really important monetarily to grok the difference. On the training side, are the majority of, Rob Kelly has tracked a lot of this stuff around the training deals that have been taking place and the different deals between publishers. And a lot of the announced deals have largely been for training deals. And so,

From a publisher perspective, it’s pretty easy because you just package your stuff, send it over and you’re getting paid, right? Like you just align on like what those terms are and now you can start making money from training. and you know, the big question around training is like, you know, where does this land from a copyright perspective? Like, is this a transformative, is it something that’s going to be permitted or not? Right. And like the courts are starting to get some kind of clarity around that. The one that is very not clear right now is inference.

And my perspective on this, this is one of the things I was talking with some publishers about yesterday, was it’s like, whether training comes down on the side of copyright infringement or not, I don’t know that it totally matters a lot just because the financial opportunity around inference is so much greater than I believe that the opportunity around training is.

Pete Pachal (15:06.729)

Yeah, exactly. again, just to redefine and so that everyone knows, like inference is like accessing stuff in the real time stuff. Cause training takes a long time. takes, it’s very compute intensive. I don’t know. It’s like, I don’t know, every few months or so the models are retrained or something, maybe even longer than that. Obviously it’s still an important issue, but it’s like accessing like the news that happened today or even an hour ago.

Jonathan Woahn (15:28.61)

Yes.

Pete Pachal (15:30.537)

That’s the opportunity because people want that information. mean, you know, there’s a whole, it’s called the media industry. It’s based around this. You need current information often up to the second, if you’re talking about market movements. Yeah.

Jonathan Woahn (15:42.2)

That’s exactly right. And you can’t train it on that stuff. It has to happen all at runtime. It has to be at grounding. And Pete, it’s beyond just the currency of this, right? But there’s also just the historic record. And so for media publishers, it is going to be about what are the breaking stories, what’s happening right now? And for, say, an academic publisher, the academic record is

Pete Pachal (15:55.797)

Hmm.

Jonathan Woahn (16:07.854)

perpetually changing and we’re learning things that change over time. so publishers need to have the ability to say, here’s what I want available. Here’s what I want discoverable. Here’s what I want in the public record versus like, this is no longer relevant. We need to be able to retract this. We need to be able to pull this out. And you cannot do that from a training perspective. You sell your content, it’s baked into the soup, baked into the cake, right?

Pete Pachal (16:25.333)

Mm-hmm.

Pete Pachal (16:30.085)

All right. Snippets don’t matter so much on the training. this is really good point. I think early on, there was too much focus on the training because there just wasn’t a lot of inference, at least in terms of outputs. Now there’s a lot in terms of all the big AIs have some kind of search connected to them. And there’s a lot of third party systems, legit and shady, shall we say, involved.

Jonathan Woahn (16:32.897)

Right.

Pete Pachal (16:58.549)

So how much has that, damage is probably overstating it, but is there’s kind of a bit of a distraction on training in that, not that that’s been resolved, but is its relevance to what’s actually going to be a sustainable future seems minimal, is what we’re talking about. is that causing a sort of an education problem among publishers almost?

Jonathan Woahn (17:23.63)

I do think so. The dollars, I was just in with a publisher earlier this week and they were just talking about the impact of some of the training deals they’ve done on their budget for the year. And the training deals are some of the things that have kind of helped to kind of close the gap on the work that they’re doing. And so it’s like, they look at these and see like there is real dollars here and it is making a real impact to their top and to their bottom line.

Pete Pachal (17:44.66)

Interesting.

Jonathan Woahn (17:54.146)

but it is episodic, it’s not predictable, the training market doesn’t have the signals of a scalable, sustainable market, right? It’s harder to see, but I think that has, and it does distract from the inference-based opportunities. And candidly, I think a lot of these scraper platforms are distra…

are part of the reason why it’s getting distracted is because they’re not seeing the dollars because those dollars are being diverted to other companies that are not the publishers themselves.

Pete Pachal (18:28.883)

Is this a part of like why a lot of, feels like the deals have kind of dried up. Like there was a bunch for a while there and a good deal of them were open AI, although some of the other ones weren’t necessarily as public. So again, I guess I’m asking you in terms of what you’re hearing, because you probably talking to tons of people all the time. But I do feel like the deals have gotten fewer and farther between. And I think maybe that’s partly a,

demand issue, if you know what I mean. So in the sense of like, um, we’ve been talking about these sort of scraper companies that have risen up in the last year and a half, two years. Um, and they’re, they’re in this legal gray area, certainly, but it’s rather than making a deal with a publisher, like maybe you could just go over there and it’s, you know, to this sort of gray market and get your stuff. And at the same time,

Jonathan Woahn (19:22.786)

get access to it.

Pete Pachal (19:26.621)

as we were just talking about, maybe the training data itself has also been devalued a bit, regardless of where you’re going for it. So it’s kind of like a mix of those things has that caused like less demand on that side. But basically, core question is why have the deals kind of dried up, assuming that’s what you’re hearing too.

Jonathan Woahn (19:43.614)

well, I think the training deals are still happening. I know they’re still happening. I mean, we talked to publishers all the time to talk about the deals that they’re doing right now. They’re smaller. They’re not as big dollar. it seems like they tend to be very, how do I put it? kind of like topically focused. So it’s like, we’re, we’re building a, an agent that is focused on, you know,

Pete Pachal (19:46.954)

Okay.

Pete Pachal (20:05.192)

Okay.

Jonathan Woahn (20:11.724)

medical research and so we’re looking for journals that kind of address this particular domain, right? So it’s not, I feel like we’re definitely seeing a lot less of like the big headline ones that we’ve seen over the last couple of years, but they’re still happening. They’re definitely still happening. It’s just, I think they’re smaller and more targeted and people just aren’t being noisy about it like they were in the early days.

Pete Pachal (20:32.073)

Yeah, maybe it’s just not as much of a news event too. guess there’s sort of a, on the news side of it. Once you land on the moon once, you know, it’s like everything becomes a little road. So I want to tie this back to sort of what we were talking about at the beginning, right? Cause we started out talking about good licensing rails and clean data and having, you know, publisher defined systems for, you know, playing nice with legit operators. And because a lot of these

Jonathan Woahn (20:41.004)

Yeah, exactly.

Pete Pachal (21:02.109)

these gray market companies are presumably doing this, you know, I think it was even in Matthew’s research. He’s sort of like, it can be messy, right? You know, there’s, you gotta get past paywalls and there’s weird stuff when you’re scraping that gets hoovered up as well. And, know, just generally it’s not that reliable or as reliable as, something that is more legit. Does that provide some hope? You know what I mean? Does that, is that like an opening that like, better data cleaner systems?

that’s going to be like, you know, with publisher MCPs and stuff like that. That’s just the demand for that’s going to naturally rise or I don’t know, maybe that’s a little too hopeful.

Jonathan Woahn (21:40.174)

Well, I…

Let me see, I’m trying to think of like how to best answer this question because there’s like some assumptions that we’re kind of building on that I kind of like, I think we need to take a step back and kind of question here, which is like, like one of the questions is like, why are people using, why are these platforms growing? And like, what is it that people are using them for? And like part of it is just because the incumbent alternatives have closed their doors. And so now there’s opportunity, right?

Pete Pachal (21:51.359)

Sure, take them apart.

Jonathan Woahn (22:12.088)

But we’ve, I mean, we’ve started talking with a number of people on the buy side and people who are looking to get access to content legitimately, like they recognize that like candidly, there is some risk that is associated with getting content through some of the gray, you know, the gray platforms. And, and so there are people who are looking to get legitimate access. And part of the challenge that they, that they faced right now is just like, how do you do that? Like, how do you get access to it?

And, um, and I think, I think there’s this, I think there’s this opportunity right now to raise awareness around like the source of where your content’s coming from and is it legitimate and is it creating a sustainable market? And, and so like, just as an example, uh, or an analogy, you know, 20 years ago, going to the grocery store was you just like, I just had to make sure I have food to put on the table.

Like I don’t really care where it comes from. I just need food, right? And what’s been, I think, very positive is in the last 10 years is people have become much more conscious of like, where’s my food coming from? Is it organic? How was it grown? Is this a Georgia peach or is this like a California peach? It’s like, people care a lot more about this. And I think that’s a good thing. And the same thing needs to, sorry.

Pete Pachal (23:33.174)

Yeah, I think it. No, I was just going to say like, there’s a, I don’t know if it’s a sophistication that’s created by supply side, you know, like just having like a lot of supply. I don’t know. I’m not a market genius, the, tell me like, is this an indicator of like an evolving market? And I guess that’s what my question had to do with like, how do we

How is this going to evolve as informational systems get more sophisticated?

Jonathan Woahn (24:01.901)

Yes.

Jonathan Woahn (24:06.828)

And I think that that’s my point here is like, think people need to start thinking about like the term I’ve used is ethically sourced data. Like is this content that I’m working with ethically sourced and is it sustainable? And like, if I’m buying content from a scraper. Like, and it’s not going back to the publisher, how in the world is that publisher even going to be able to continue to create content? Cause if the publisher can’t create content, the scraper has nobody to scrape from. And then you can’t, they have nothing to buy from. And so I think, I think.

Pete Pachal (24:14.069)

All right. Good term.

Jonathan Woahn (24:36.002)

the first step is like, there’s gotta be this awareness on the buy side, particularly around like, where’s my content coming from? And like, how is this creating a sustainable ecosystem? Because I think as that awareness rises, then the question starts to become, okay, well, let’s assume that I actually wanna get my content ethically and I wanna get ethically sourced data, where do I go to do that? And there’s…

not a lot of options at this point in how to go about doing that.

Pete Pachal (25:07.509)

Well, that brings me to like what Cashmere does, right? So can you give me a little bit more on like exactly what your role is in creating? As like, are you someone who cleans up data, create systems, creates a marketplace? Like what is Cashmere’s role in all?

Jonathan Woahn (25:24.268)

Yeah. So our role where we started was ingesting publisher content, getting it ready for AI, structuring in a way that like the AI can now know what the seven habits are instead of 27 habits. And helping to get publisher content cleaned up and ready for use with AI, right? It’s messy. We help clean it up. So that’s the first part. Now, once it’s in place like that, the second piece was how do you manage deployment of that content? How do you get people access to it? How do you help them?

get visibility to it. How do you manage security, authentication, entitlements? Like there’s a lot of kind of moving pieces to like, how do you make that content actually, you know, be consumable? And so what we’ve done is we’ve built all the infrastructure pieces to help connect publisher content with agentic systems. And so our focus has historically just been on like just building that infrastructure and then helping our publishing partners pursue.

the AI opportunities they wanna pursue. What has been interesting though, is as our publisher basis continued to grow, as we’ve started to get more contact, more use cases, more applications, as we started to get a lot more inbound interest from people who are looking to ethically source their content and their data on the buy side. And so we hadn’t done a lot of work on this side because we’ve been very focused on supporting the publishers.

Pete Pachal (26:41.269)

So on the buy side.

Jonathan Woahn (26:50.562)

But now as these opportunities are starting to come in, we are starting to think about, what can we do to help facilitate and expedite getting access for these applications to getting these applications access to ethically sourced content? So we’ve started to do a lot more work on this front. And we’ve got, I don’t know, 15 of these opportunities that we’re running right now. mean, it’s been really fun, to be honest.

Pete Pachal (27:17.205)

And it’s kind of like you’re learning a lot and doing these kind of, I would imagine, content to content buyer sort of handshakes, I guess, from a technical and commercial aspect. And at what point does that become a marketplace or does this become something that is scalable? Like, is that the vision? Maybe it’s a little too early to tell on your side since this isn’t quite the direction you thought you were going. But if that is the case, like,

Like who do you end up competing with in that space? as I know other companies have sort of tried this, they’re all sort of, no one’s done it, done it, if you know what I mean.

Jonathan Woahn (27:56.802)

Yeah, it’s, mean, Pete starting, starting a marketplace is really, really hard. and you’ve got to like, I mean, I think there’s, there’s a marketing component to it. There’s a technical component to it. There’s a lot of luck. I think that sits like being in the right place at the right time with the right people. And so like, I think there’s a version or a future where like what we’re doing with cashmere, I think it could become a marketplace. but at the moment, what it feels more like we’re doing at

Pete Pachal (28:02.74)

Mm-hmm.

Jonathan Woahn (28:25.878)

like this very instant as it feels more like we’re like brokering relationships, right? Like we’re, we’re initially just taking like, we’ve got this super powerful platform, this super powerful technology that sits under the hood of what we’ve built. And now we’re just trying to help, like we’re trying to help publishers monetize their content. Like that is like our goal, right? and if we can help do that by saying, Hey, we’ve got people who want your content. And what’s great is now we’ve got, you know, we have someone that comes to us and says,

As an example, we were talking to a South Korean hardware manufacturer who, not South Korean, South Asia, a hardware manufacturer who has like a Siri competitor. And they’re looking for access at inference time to news content. And they’re looking for kind of like, you know, some kind of lifestyle type content. And so we’ve got a handful of publishers who we’re working with who do this. And so now what we can say is like, well, publisher, we were working on this other thing with you to start with.

But now we’ve had someone coming and asking us for this, is this something you would be interested in getting involved in? And so now like we’re able to start taking this network of publishers that we have and start bringing them together to, to, bring that content into these applications side. So it does feel a lot more like brokering arrangements right now than like a marketplace, but you know, maybe at some point it could turn into that, but that’s not like what we set out to build at the moment.

Pete Pachal (29:51.446)

All right. Yeah, the future is long. Let me ask you like, I guess, probably a pretty important question, which is like, what does fair pricing look like in an AI content marketplace? What are the factors? Obviously it’s gonna be different depending on the content and the people involved in the brokering, but like, again, what are the dimensions that would govern pricing?

Jonathan Woahn (29:54.734)

The future is long. Yeah.

Jonathan Woahn (30:16.012)

Yeah, this is a fantastic question. I’m actually working on some writing up some content around this right now, because I’ve been doing a lot of research on this.

I think there’s fundamentally two factors to think about. The first is the intrinsic value of the content itself. And then the second is the particular use case of how that content is being deployed. And so like, if you kind of break each of those apart, like at a high level, on the content side, the intrinsic value, you know, there are different categories of content that we can look at. So,

Content that is from user generated content, like something you might find on Reddit doesn’t have the same intrinsic value as like a market research report that one of the sell side brokerage firms has put two years of research into, right? Like the value on a per token basis is not like dollars to dollars. Did you have a question? Sorry, I was just looking at.

Pete Pachal (31:17.427)

No, no, no. I’m just following along. I’m right with you. It’s not dollars to dollars.

Jonathan Woahn (31:21.194)

Okay. So there’s these different kinds of categories of content, right? So you’ve got like, there’s like open web content. You might have news, you might have a lifestyle. might have books. You might have market research, market intelligence, scholarly research, right? Like there’s different vertical kind of categories of content that each have. There’s value depend on in each of those. And then even within that, there’s difference in how you think about like front shelf content.

versus back shelf content, right? The stuff that is driving all the attention to your website. And then once you get them in, then you’ve got this other stuff that you might be able to perform to keep them there. So there are some metrics and like, I can get pretty deep on this if it’s helpful, but like.

Pete Pachal (32:06.815)

So far, I’ll let you know when we want to get back to the surface. But go, this is great.

Jonathan Woahn (32:10.486)

Okay. So there’s this intrinsic value of the content itself. That is one of the factors in how to think about pricing. So then if we kind of look at the second factor, which is like the use case, there’s also use cases that dictate the value of that particular content. And so when I say use case, it’s like, how is the AI wanting, what are they wanting to do with that content? And the analogy,

Pete Pachal (32:35.764)

Right.

Jonathan Woahn (32:39.054)

kind of that I’ve drawn for from this has come from the music industry. And so like if you go to ASCAP, you can go to their website and literally like right on the homepage, it’s like, what is your use case? Like, what are you using the music for? And within that use case, you can say, I wanna, I have a restaurant and I want to, you know, have music in my restaurant. So then you click on this and it shows you, it’s like, well, are you a karaoke bar?

Are you wanting to play it in the elevator? Are you wanting to play it as background music? Are you a music like, like what is your actual use case? And then depending on that use case, the way that they structure and price that license looks very different, right? Like it’s, it’s going to cost more to if the music is much more of a focus of what you’re doing versus if it’s just like background music. And so on an AI, we can think of it very similarly to say, well, what is the use case that you’re wanting to use this content for? And.

If you’re just doing like a simple, like chat, like rag chat application, like the value of that content, like it’s probably pretty substitutable. Like, you know, if it’s just like a generic search engine, but if you’re doing like a verticalized search engine where now you’re focusing on, like as an example, like, open evidence, like, I’m not sure if you’re familiar with this, but like open evidence is like pulling in all of this academic research.

Now what you’re doing is you’re providing a use case that has a very specific audience with a very specific need and is very specific kind of outcome they’re looking to drive. And so, you know, what you charge for a chat GPT at $20 a month and like what you get access to there is very different than what you charge, you know, $200 a month for like a deep research agent. That’s like very vertically specialized, right? And so you got to kind of look at these two things to say, what is the intrinsic value of the content and what is the particular use case?

Pete Pachal (34:23.165)

Hmm.

Jonathan Woahn (34:30.7)

And that gives you an idea of what the value is for that content, for that particular use case, and how you think about pricing.

Pete Pachal (34:38.239)

Got it. So a lot of it matters just in terms of like the outputs. It feels like the outputs is always kind of like a big determining factor in all this.

Jonathan Woahn (34:46.264)

Correct.

Pete Pachal (34:48.767)

Cool, okay, so we talked a little bit once just over email about the agent layer for publishers, which I think is an interesting way to sort of think about all this. And what basically, I’m curious what you think about what good product strategy, I guess, looks like for a publisher as they think about that agent layer.

And I know we’ve sort of like touched on a lot of this in terms of just clean data and et cetera, but how do you make sure it’s something well-designed that’s gonna both serve you, the people you broker with, but also the ultimate users, right? Whether they’re readers or analysts or what have you.

Jonathan Woahn (35:35.724)

Yeah. So is the question here just like, are, as you think about designing, like if I’m a publisher and how I’m designing, like what am I doing from a product perspective?

Pete Pachal (35:46.666)

Yeah, because it’s like, feel like when we talk about AI, the most people default, it’s a chat bot, Like you don’t like have a chat bot in front of a public site. And honestly, it’s sort of like, that’s kind of what a lot of experiences are now from a user perspective. like, obviously like having an MCP and that data available just from a prompt that you can, whether you deliberately ask for the data from this publisher or it’s just implied.

Jonathan Woahn (35:51.374)

Mm-hmm.

Pete Pachal (36:16.629)

sort of gets it. You know, I just feel like I’m not sure maybe if this is not, I don’t know if this is more incumbent on the publisher or the person using it, but I guess if you are the publisher, like how are you thinking about that? And I guess it can vary widely depending on what the person is doing with it, but how do you make sure that it’s all those things are available?

Jonathan Woahn (36:36.46)

Yeah, it’s,

Pete Pachal (36:40.373)

Like what are the top three things you would recommend someone to get started with before you even get in there?

Jonathan Woahn (36:49.166)

Yeah. Well, think, I think the first, the first question is like, is, is kind of, I mean, it’s, kind of like, what is your AI strategy here? And do you want to host your own infrastructure and do you want to host your own content? And then the question that I have that falls behind that is like, why, like, like, what is it you’re wanting to do with it? Um, and, and so then, you know, like, but if we go back to that first question of like, you know, what is your AI kind of strategy? Um,

Pete Pachal (36:58.099)

Right.

Pete Pachal (37:09.235)

Yeah.

Jonathan Woahn (37:19.224)

Your strategy might just be, we’re going to make our con, we’re going to license our content on a consumption basis and we’re going to, you know, deploy it through our website and use some of the kind of gateway platform infrastructure that currently exists. Right. And so in that case, if that’s your strategy, then a lot of that comes down to making sure that you’re, you are, structuring your content in such a way that makes it easier for the agents to consume it and understand, you know, how to, how, how your content is actually structured.

So you hear lot about, you know, people talk about having a, you know, there’s like the HTML version that people see and then you can have like a markdown version that people don’t see this is on the scenes that like, you know, that an agent might get access to through website.

Pete Pachal (38:00.822)

Right.

Pete Pachal (38:06.803)

Yeah, totally. No, no, no, I think it is. But I think it’s sort of like thinking of, as a publisher thinks about playing in this world of AI, you know, a lot of it has to do with defense. And that’s kind of sort of trying to get not quite offense, but like, I want to have a presence in this legit, you know, quasi marketplace or forming marketplace.

Jonathan Woahn (38:10.264)

I don’t know if I’m answering your question, so I’m like.

Pete Pachal (38:33.523)

What does that look like in terms of both approach and execution? That’s kind of what I was thinking about. But I also, know, as, know, like I haven’t seen too many pages for maybe these exist and I just haven’t seen them, but like, like a page on a publisher, it’s like, Hey, if you’re interested in like, our A our A our agent and our MCP, you know, call us or like, you know, put them in this application or et cetera.

Jonathan Woahn (38:55.63)

Yeah, here’s how you get legit access to this.

Pete Pachal (38:57.769)

You know, yeah, like kind of just putting it out there. Maybe there should be more of that. I don’t know. I haven’t seen that. Maybe that exists. You tell me.

Jonathan Woahn (39:09.356)

Well, I think it does exist. I mean, this is like what Tolbit is doing. Like anybody who’s using Tolbit, like you get an MCP server by default by hooking up with them, right? I think what happens on the other side of this, that’s a little bit challenging though, is just around the discovery layer. And if you are a legit buyer who’s wanting to access this content. So just as an example, there’s a consulting group that we’re talking to who wants again, legitimate access to high quality content.

Pete Pachal (39:11.881)

Yeah. True.

Jonathan Woahn (39:39.306)

They don’t want to buy it through the scrapers, but they have hundreds of sources of content that they’re wanting to get. And so you can imagine that, you know, if they want to get it from all of these different publishers and their content, like what are they going to do? Go around to every single one of them individually and sign up for it and integrate it. Cause now you’ve got hundreds of data processing agreements you’ve got to go through. You’ve got to make sure they’re all like have information security covered. You’ve got to make sure that

They’ve, you figure out like, what is the pricing going to look like? And like, to your point, I don’t see a lot of websites, a lot of these publishers that are saying, Hey, you know, we might have our MCP, but like, here’s how our content is. Here’s how you get access to it. Right. Like there starts to become this, kind of fan out, you know, power scale problem here where.

If you want legitimate access, how do you actually go about doing this? And you’re right, today as it is, there’s not an easy way to go about solving this at this time.

Pete Pachal (40:41.939)

Yeah, it just seems like it’s definitely a manual kind of roadblock, or at least a manual process right now that tends to be a roadblock in many cases anyway. This has been great. Just to wrap up, I always try to end on the same question, which is that you’re looking, projecting forward. There’s got to be things you’re worried about and hopeful about. Pick one from each of those columns and give it to me in any order.

Jonathan Woahn (41:09.198)

I’ll start with what I’m worried about. I think the thing that I’m worried about is just, I worry that the kind of incumbent experience is going to be severely negatively impactful for publishers. And what I mean by this is like, you know, I like, there’s a lot of analogies and parallels we can pull from the music industry, but

You know, when Steve jobs came out and said, we’re going to sell a single track for 99 cents and here’s how the economics are going to work. And here’s, know, how we’re going to deploy it. Like it reshaped the industry to kind of think in a different way. And it really, I think it expedited the ability to kind of adopt a new business model. And what I worry about with AI and with a lot of publishers is just, you know, there’s a lot of incumbent ways that we’ve done business historically. And there’s not like a really clean unified way that says, Hey, this is how it works with AI.

And I think what ends up happening in that instance is I think the buyers end up dictating a lot of the terms and then publishers end up becoming price takers. And that worries me because I think we have an opportunity here as publishers to effectively reset the table and say, this is how we want it to work and this is how it’s going to be sustainable for us and for you going forward into the future. So that’s what I worry about. What I am hopeful about is…

We continue to see that AI needs access to fantastic content. And there’s nobody better at creating that fantastic content than publishers that they’ve proven over decades and centuries. And there’s a lot of publishers who are really keen and really looking forward on trying to figure out how to do this. And so it’s very encouraging to be working with some of those to see the strides that they’re taking and how forward looking they are on this. And it gives me hope. It gives me hope that,

If we can help to show that these use cases work and help to provide some case studies that like this is sustainable and this is a big opportunity, my hope is that it will address my worry and it will bring the rest of the publishers along.

Pete Pachal (43:15.219)

Nice. It’s quite a vision. We’ll leave it there. Jonathan, thanks so much for stopping by and sharing your thoughts.

Jonathan Woahn (43:21.752)

Thanks Pete for having me, appreciate it.

The Scraper Economy is already here. Publishers just aren’t getting paid.

Listen or watch:

Why this matters:

About the 👤 Guest

Enjoyed this episode?

TRANSCRIPT

The new agentic AI battleground: The case for unified architecture

AI is shrinking entry-level hiring while boosting pay for experienced workers, Dallas Fed finds

Canva launches AI 2.0 with agentic orchestration

Adobe launches Firefly AI Assistant to orchestrate creative work across apps

What an agentic newsroom will look like

Anthropic to OpenClaw users: Pay up

The Scraper Economy is already here. Publishers just aren’t getting paid.

Listen or watch:

Why this matters:

About the 👤 Guest

Sponsor:

Enjoyed this episode?

TRANSCRIPT

The new agentic AI battleground: The case for unified architecture

AI is shrinking entry-level hiring while boosting pay for experienced workers, Dallas Fed finds

Canva launches AI 2.0 with agentic orchestration

Adobe launches Firefly AI Assistant to orchestrate creative work across apps

What an agentic newsroom will look like

Anthropic to OpenClaw users: Pay up