AI: Inside Out Latent Space

Accessing AI image synthesis tools’ training data the wrong way

Dec 23, 2022

Artificial Intelligence tools like Midjourney and Stable Diffusion use machine learning techniques to synthesize new images from users’ text inputs. They are trained to do this by analyzing billions of text clips and associated images from the internet.

AI neural networks can assign values to that training data for a wide variety of criteria; they may assign an image or text string a “cuteness“ value, for example, or “animalness,” or “landscapeness” values. They also assign values for aesthetic appeal. Using these values, they infer additional latent values for new criteria on the fly. Much like X, Y, and Z values define a position in three-dimensional physical space, these values define the reference images’ and users’ text prompts’ positions within a hyperdimensional virtual space, which in machine learning parlance is referred to as “latent space.”

Midjourney users typically use natural language text prompts to describe the subjects, action, materials, mood, lighting, style, and atmosphere they want in their images. The Midjourney AI would find a region in latent space defined by the text prompt. Then it generates an image of pure random noise and alter it, pixel by pixel, to conform to the visual properties of the reference images dominating that region. As a creative imaging tool, that’s the direction the system is meant to follow; using text criteria to ping the latent space for appropriate image properties and generating a new image that relates to users’ text.

What happens if we turn the latent space inside out and explore its image references in reverse? What if instead of using language prompts, we experiment with terms that may directly point to interesting regions of the latent space, or point directly to individual reference images that Midjourney has been trained on? What if instead of language, we use random photograph filenames as text prompts?

ONE FILENAME, A VARIETY OF SYNTHESIZED IMAGES

The random photograph filename DSC007654.JPG was the only text instruction used for these images:

The resulting images are a visual and thematic synthesis of the subjects, styles, compositions, moods, and color palettes from images in Midjourney’s training data that are somehow associated with that random photo filename, or adjacent to it in the latent space. There may be one reference image by that name in Midjourney’s training set or hundreds. There may be no filename matches at all, only images that have that filename—or something close to it—embedded in their metadata.

Whatever associations are driving the AI, in the latent space’s DSC007654.JPG region, random noise resolves to figures, streetlights, and trees in foggy twilight; a shrouded figure; close-up nature imagery; landscapes at sunset; and more people and mist.

SYNTHESIZING IN RAW
Does the RAW file extension take us to a region of latent space that draws more heavily from images from professional photographers, who tend to use the RAW file format? DSC00563.RAW gives us a fiery forest floor and twisted tree limbs and bodies:

DSC00564.RAW resolves to trees, fabric, and tangled masses and fields and plains, all red and orange, as well as people and amorphous body parts:

DSC057436.RAW drifts from fused and interlocking human forms to smoke-filled skies, vague flameness, and trees. DSC057437.RAW transitions from colorful leaf-like shapes to bodies and bare branches against red skies:

EXPLORING AND EXPERIMENTING

DSC00764.JPG produces people, misty autumn trees, organic folds, and electric, stormy skies. Note the pseudo signatures and watermarks. Some compositions synthesized in this region of latent space seem to need to signal their originality and quality—and to convey a sense of authorship:

DSC34077.JPG gives us sleek moody interiors and fleeting, mysterious figures; photoreal and painterly landscapes; and humanoid masses:

Any intuition that the synthesized images have been merely copied from pre-existing reference images cannot account for the following syntheses; each of these images’ prompts combined multiple filenames, such as DSC0368.jpg_DSC04873.jpg_DSC04537.jpg, for example:

VERSION 4 ALGORITHM
All the examples above were generated with Midjourney’s Version 3 algorithm. The images below were generated with the prompt DSC006758.JPG in Version 4:

These Version 4 results are not at all like the preceding Version 3 examples. They look more like computer renders, or highly stylized illustrations, or heavily photoshopped images than photographs. Version 4 appears to have a strong inherent style influence that dominates the results, even with a random photo filename as the only text prompt.

Because these Version 4 results are not photographs, we know that they are almost certainly not close copies of any individual photographs that may inhabit Midjourney’s latent space, whatever photo filename may have been used to generate them.

INSIDE OUT AND BACKWARDS: REFERRING TO KNOWN REFERENCE IMAGES DIRECTLY

Midjourney and other AI image synthesis systems like Stable Diffusion were trained on the LAION-5B dataset, which is an index of nearly 6 billion image URLs and associated text descriptions, each of which has been independently published online to freely accessible websites. (That is, LAION-5B does not itself contain any images; it contains links to images on the web.)

It is possible to search through the LAION-5B index. In my own searches of the photo filenames I used in my experiments above, I was unable to identify any individual images in the database that appeared to directly account for any of my synthesized images. [Note: that search tool is limited to images with an “aesthetic“ value of 6 and higher. A separate tool that searches the entire LAION-5B index yielded similar results in my tests, but that tool appears to have been taken offline since this story’s publication.]

What happens if we twist the normal Midjourney workflow even further than before? What happens if we start with specific images that we know were part of Midjourney’s training data, and use their associated text strings as prompts?

Taking an image’s complete, unique text string from LAION-5B and entering it as a prompt in Midjourney can yield images that are apparently thematically related, even to uncanny effect, but which would still not likely be mistaken for copies. Moreover, to the casual viewer, it can even be difficult to discern images made this way from actual reference photos.

For example, the image above, by photographer Steve McCurry, is indexed in LAION-5B, which links directly to the image on McCurry’s own website. LAION-5B includes the following text string for this image, which was taken directly from the image’s metadata as the photographer himself appears to have published it in 2014, and again in 2015: DSC_7400, Omo Valley, Ethiopia, 08/2013, ETHIOPIA-10319NF. Child with wreath of leaves around head. retouched_Kate Daigneault 08/20/2013

When given that complete text string as a prompt, Midjourney generated these new images:

These synthesized images appear thematically similar to the reference image from which the text string prompt was taken, but they also are so different and varied that they could easily be mistaken for additional, original reference images. And they are more similar to each other than to their text prompt’s parent reference photo.

AI AMBITION AND ACHIEVEMENT

As excited as the public’s response to these tools has been, in my view it still greatly underestimates the ambition of AI image synthesis and its achievement, even in these early days.

All the preceding examples demonstrate that when a user enters text prompts in an attempt to refer Midjourney to specific images in its training data—even by expressly referring to singular images’ unique metadata text strings—Midjourney can still produce novel images. To my mind, that indicates something simple but quite extraordinary; “AI image synthesis” is no misnomer.

These tools are not copying reference images, at least not in toto. That would be redundant, of course, as we already have the ability to make perfect copies of digital images. The process and its product is much more than any particular image in the training data. The ambition appears to be to truly synthesize new, visually compelling images from text and reference images.

This is a mind-boggling, monumental achievement that will forever affect the arts. It is perhaps no surprise, though, that one of the comforting ways to make sense of it is to minimize it and frame it as some kind of cheat.

CONTROVERSY

There are several popular objections to AI image synthesizing systems’ use of training data, and to these new tools in general:

Some artists complain—ahistorically, in my opinion—that their styles are being stolen. While artistic styles cannot be copyrighted, and a culture of copyism and emulation are obviously integral and critical to the arts, large, powerful IP-holding corporations may nonetheless soon adopt this highly subjective complaint themselves and throw their weight (and lawyers) behind business practices and new laws that suppress supposed style theft; laws that may encroach on the public domain and burden and limit independent and traditional artists and the arts for decades.
Some artists react with surprise and anger to the fact that the LAION-5B index links to images the artists themselves put online. This response is likely heartfelt but is simply not compatible with even a rudimentary understanding of what the internet is or how it works, or what it means in purely mechanical terms to publish images and text online for people to view and access with machines. Accommodating this naiveté and confusion by allowing people and corporations to block the names of specific artists, descriptions of styles, and intellectual properties from training datasets may further advantage well-lawyered, entrenched, and influential IP holders and disadvantage independent artists and the public.
Some artists seem to simply object to the public appreciating, categorizing, and deconstructing their work in ways they did not anticipate. We are asked, in effect, to act as though the internet and popular culture itself are a sort of art gallery; artists welcome the public to view their work, but not too closely, and not to analyze it or try to figure out what makes it tick. We’re asked to not lean too far over the velvet rope. This is another sincere and emotionally charged objection that may foreshadow more legalistic debates to come over fair use, what it means to publish a work—what the purpose of publication is and why we encourage it by granting limited copyright terms—and whether some artists and IP owners should be empowered to, in effect, claw back elements of publication.
No artists have followed through on vague claims that AI synthesis tools are producing literal copies of their works; to date no one has exhibited an AI synthesized image in a courtroom to challenge its originality in relation to copyright. And it seems obvious that AI synthesis does not aspire to create copies in the first place; we already have perfect copies thanks to the internet and digital image files.
Some critics specifically oppose the use of mechanical proxies in image analysis and creation because of AI’s expected impact on visual artists’ future income. This concern seems to privilege their hypothetical losses over the prospect of AI empowering laypeople to communicate with compelling visual rhetoric in valuable ways and at a scale that has never before been possible. As large and diffuse as the benefits of AI image synthesis may be, at the moment there’s no straightforward way to evaluate some of those economic tradeoffs. Hard evidence likely will be recorded on both sides of the ledger very soon.

My intuition is that AI image synthesis is an important development and will be a boon to the arts. In any case, experiments like my inside out image synthesis explorations may help clarify some artists' concerns about their work being analyzed in the latent space.

Cosmo Wenman is CEO of Concept Realizations. He can be contacted at twitter.com/CosmoWenman and cosmo.wenman@gmail.com

If you enjoyed this story, please share it and consider subscribing. I’ll soon be publishing results from an experimental art collaboration creating AI imagery with several of my blind colleagues. I’ll also post occasional stories here about universal access technology and design, 3D scanning and replicating artwork, and, soon, important updates about my freedom of information lawsuit in Paris against musée Rodin, in which I’m seeking to establish public access to all French national museum’s 3D scans of public domain works.

COSMO WENMAN

Discussion about this post