How AI Made Data Privacy Everybody’s Business

Credit: DALL-E

If there’s a single topic within AI that everybody has concerns about, it’s data privacy. It comes up in every class I teach, every casual conversation about how AI is affecting the media business, and almost all the interviews I have with decision-makers at media companies. The fear of losing control of your data, and the outrage toward tech companies who act as if they are entitled to take it, is palpable.

A recent article in The New York Times has renewed fears about data privacy in the age of AI, pointing to a set of recent changes in the Terms of Service for various software products, including those from Google, Snap and Meta. In each case, the company altered language to ensure they included provisions for leveraging user data to help power or train AI systems.

(story continues below)

There are still seats left in our July AI classes! Read on to secure your spot.

If you’re in marketing, PR, or media, and you haven’t yet built AI into your work, this class is for you. This month, our July 18 AI Fundamentals class is focused on marketing and design work. Study after study shows that marketing is one of the industries most affected by generative AI. The class takes you from using basic prompts to tools specific to your work in the scope of an afternoon. 🛠️

No coding or prior experience with AI needed. Our classes are fast and affordable. While others charge $1,000+ for classes, ours are well below that, and if you use the discount code AIMARKET when you sign up, you can save 50%. Plus you get a shiny badge for your LinkedIn profile. ✨

Learn more: AI Fundamentals

Only have an hour to spare? You should check out our July 16 AI Quick Start class, a breakneck 60 minutes of AI bootcamp for content creation. We always work to maker sure we have a special takeaway from our 1-hour classes, so it’s not just a basic overview. Use the discount code AISUMMER at sign-up 😎

Learn more: AI Quick Start

While most of the companies who’ve done this would no doubt prefer the changes were simply treated as incidental, users have not responded with nonchalance. Customers of Adobe, for instance, openly revolted when the company quietly altered its terms of service a few months ago, and executives had to do multiple rounds of damage control. AI, it seems, has everyone on edge.

Tech companies have given them good reason to fear. Another New York Times report from earlier this year laid out how both OpenAI and Google sought to deliberately ignore YouTube’s terms of service by training their AI models on YouTube videos (yes, Google owns YouTube, greatly complicating the matter). Even before that, a detailed report from IEEE Spectrum proved that popular AI image generator Midjourney was trained on copyrighted content, including images from Marvel movies.

Generally, AI companies have hoovered up the majority of public data on the internet to power their models, without clarity on whether that was in any sense OK. Several lawsuits are now before the courts that may help chart a path to a definitive answer to that question.

As the recent Times piece points out, many of the big tech companies — Meta and Google in particular — don’t just host public data; they’re also sitting on mountains of private data: information that users don’t share. With virtually no more public data left to train on, the builders of these AIs would find this private data immensely valuable.

Are these alterations to terms of service a precursor to some kind of retroactive harvesting of that private data to train AI? There isn’t evidence of that, but given Silicon Valley’s record, you can see why people might be concerned.

If you are, what should you do? How can you adjust your approach to AI in a way that maximizes your data privacy?

Breaking Down the Privacy Problem

To answer that question, it’s helpful to unpack why people find the practice of data harvesting so objectionable in the first place. By doing that, we can better find ways to address specific concerns. For many aspects of this, there aren’t easy solutions. But there are ways to adjust thinking and approach to put yourself in the best possible position.

As I see it, concerns about data privacy tend to tall into two buckets: 

Exploitation: “You’re taking my data and leveraging or monetizing it without giving me anything in return.”

Control: “By granting access to my data, I no longer have control of it.”

Let’s address each of these in turn.

Subscribe now


When ChatGPT exploded into existence in late 2022, we were all so blown away by what it could do that few at the time stopped to think about the training data needed to create that experience. The gigantic data sets from Common Crawl et al. had been used, essentially for free, by search engines for years, and this seemed like a logical extension of that norm.

But as time has gone on and we have clarity on how “answer engines” like Perplexity and Google AI Overviews work, public attitudes have shifted. There’s now a general consensus that information sources should be compensated for the information they provide — a recent poll from a think tank called the Artificial Intelligence Policy Institute showed that 74% of respondents said, “AI companies should compensate creators for using their data.”

We’re seeing this shift play out in the business world as OpenAI and others have begun to sign deals with publishers like News Corp and Axel Springer as well as platforms like Reddit to give LLMs access to their content. In the meantime, various challenges are slowly making their way through the courts in the hope of getting a final ruling from s legal perspective.

Today, any site that wants to guard against tech companies harvesting training data can set their site’s preferences (the robots.txt file) to forbid the practice. There are also ways to tag your content at the article level, giving you more control over how those articles are crawled and used. Intaglio, co-created by Media Copilot co-founder John Biggs, is such a solution.

Adding insult to injury, AI systems typically don’t just train on content for profit — their output also acts as a replacement for the content for many users. While people often have a visceral reaction to this reality, it mostly adds a dimension of urgency to resolving the situation in the legal and regulatory realms.


When you interact with a chatbot like ChatGPT or, the requests, documents, and other data you feed into it will generally be used as training data. What that means is there’s a chance that, at some point in the future, another user might be able to coax some or all of that information from the chatbot just by asking.

This obviously means you should not feed sensitive or non-public information into a chatbot. If you want to use an LLM privately, you should use the APIs that AI companies provide, which don’t keep the data for training, according to those same companies.

However, given the track record of Big Tech and the strong tendencies of AI builders to hoover up data however and whenever they can, there are some companies that forbid the use of commercial AI completely — even through an API. That’s pretty extreme, but it doesn’t mean they have to cut themselves off from LLMs: You can still run AI locally, on a server or private cloud.

The Silver Lining

The lesson in all this for everyone, even individuals, is awareness. It’d be unrealistic for most digital citizens to simply “opt out” of using AI or digital platforms that want to harvest data. But you can better understand what you’re agreeing to when you do via an app called Tosless. Feed it any Terms of Service agreement, and it’ll tell you which parts are most concerning. It may not diffuse any privacy land mines, but at least it’ll let you know when you’re about to step on them.

There’s a silver lining in all this consternation about privacy: Given AI’s insatiable need for more data, the value of everyone’s information has effectively gone up. But for a fair economy to arise around that reality, the owners of that information need to be aware of that value. Over the last decade, Big Tech has done its best to convince us that exchanging free services for limitless data harvesting was a fair bargain. Ironically, it could be their most innovative creation — AI — that gets us to finally push back on that idea.

The Media Copilot is a reader-supported publication. To receive new posts and support our work, consider becoming a free or paid subscriber.

Ready to start using AI like a pro?


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.