Models All the Way Down

Image via Knowing Machines

This newsletter will be a little shorter today simply because I want you to head over to KnowingMachines.org to check out Models All the Way Down, a very clever and thorough examination of popular image generation training sets.

The results aren’t pretty.

The TL;DR is simple: image data training sets are inherently biased, poorly described, and contain porn and Child Sexual Abuse Material (CSAM).

“If you want to make a really big AI model — the kind that can generate images or do your homework, or build this website, or fake a moon landing — you start by finding a really big training set,” write the authors. “Images and words, harvested by the billions from the internet, material to build the world that your AI model will reflect back to you.”

“What this training set contains is extremely important. More than any other thing, it will influence what your model can do and how well it does it,” the write.

The article goes on to describe how many training sets are born and what they usually contain. They are mostly taken from Pinterest and Shopify and also contain a huge crawl of data taken from the open web. Classifiers “train” on the images using the ALT tag data associated with each image. This means an image of a woman in sunglasses ends up being identified as “Ray Ban Wayfarers Cheap Free Shipping” instead of what it actually is. This also means most of the training data is in English.

Further, because these training sets troll the Internet like baleen whales, the sets can contain almost anything. That means they contain stuff that nobody wants to see. What that means from a legal standpoint isn’t clear, but it’s also very dangerous.

I encourage you to check out the whole thing. Basically the creators of the corpus have said that it isn’t for commercial use and – this is important – they aren’t responsible for the contents. In short, the definitive collection of data used in image generation globally is biased, contains lots of poorly classified porn or worse, and is considered the gold standard.

And people wonder why we say that GenAI is catastrophic if not used thoughtfully.

Anyway, check it out.

Ready to start using AI like a pro?

Comments

Leave a Reply Cancel reply