The Ghost Stays in the Picture, Part 1: Archives, Datasets, and Infrastructures

Eryk Salvaggio is a 2024 Flickr Foundation Research Fellow, diving into the relationships between images, their archives, and datasets through a creative research lens. This three-part series focuses on the ways archives such as Flickr can shape the outputs of generative AI in ways akin to a haunting.

“The Absence Becomes the Thing.”
– Rindon Johnson,
from The Law of Large Numbers

Every image generated by AI calls up a line of ghosts. They haunt the training data, where the contexts of photographs are reduced to the simplest of descriptions. They linger in the decisions of engineers and designers in what labels to use. The ghosts that haunt the generated image are hidden by design, but we can find them through their traces. We just need to know how to look.

As an artist, the images created by AI systems are rarely interesting to me solely as photographs. I find the absences that structure these images, and the stories told in the gaps, to be far more compelling. The images themselves recycle the tropes of their training data. By design, they lean into the most common patterns, changing the details like a lazy student changing the words of a plagiarized essay.

I don’t turn to generative AI for beautiful images. I look for evidence of ghosts.

What exactly is a ghost in an AI system? It’s a structure or decision that haunts information in barely discernible, even invisible, ways. Datasets are shaped by absences, and those absences shape the image. As a diffusion model seeks the path to an image, the absence of pathways constrains what is possible. We can read these paths by looking at AI images critically, addressing the negative space of what appears on our screens. Who are the people we don’t see? What are the stories these images cannot tell?

This can mean absences in representation. When we have thousands of photographs of white children tagged as “girls,” but few black children, black girls are absent from the images. Absence haunts the generated image, shaping it: we will see mostly white girls because black girls have been pushed to the edges. This is not just a glib example. The exact scenario is precisely what I found when I analyzed a dataset used for training image generation tools and automated surveillance systems in 2019. The pattern holds today. Victorian-era portraits of white girls are prevalent in the training data for generative AI systems such as Stable Diffusion. Black girls are absent, with highly sexualized images of adult women taking their place.

Infrastructure makes ghosts, too. We build complex systems one step at a time, like a set of intersecting hallways. Artificial Intelligence is, at its heart, a means of automating decisions. They carry decisions from the past into the future. Once we inscribe these decisions into code, the code becomes infrastructure, subsumed into a labyrinth made by assembling the code of others and code yet to be written. As we renovate these structures through new code or system upgrades, the logic of a particular path is lost. We may need to build new walls around it. But when we bury code, we bury decisions beneath a million lines of if/then statements, weights, and biases of machine learning. Unchallenged, the world that has slipped past us shapes the predictions of these systems in ways we cannot understand.

This is true of most data driven, automated systems, whether we are talking about resume filters or parole decisions. For the generated photograph, these decisions include how we test and calibrate image recognition systems, and how we iterate upon these systems with every new model launch and interface.

Diffusion models — at the core of image generation systems — are an entanglement of systems. It relies on one system to label images, examining how pixels are clustered and matching them with human descriptions. We relied on underpaid labor by humans to test these systems by comparing the results of that tool to what they saw themselves. These comparisons are recorded and integrated into the memory of the model. The actions of those people were fused into the infrastructure of the model, shaping decisions long after they stopped working on the dataset.

We tend to make up stories about synthetic images based on what’s depicted. That is the work of the human imagination: a way of making sense of media based on references we’ve seen before. That is a ghost story, too. But if we want to meet the ghosts that shape AI-generated images, we have to dig deeper into the systems that produce them. The AI ghost story is a story of the past reaching into the present, and to understand it, it helps to know the lineage of those decisions.

Image synthesis has a history, and that history lingers in black boxes of the neural nets as they shape noisy pixels into something recognizable. Part of that story is the datasets, but data is applied to a vast number of uses. One of those uses is training larger systems to sort and handle larger sums of data.

Data shapes data infrastructure. From small sets of data, patterns are found and applied to larger sums of data. These patterns are repeatedly invoked whenever we call upon these systems into the future. The source data is always an incomplete model of things. But nonetheless, it is applied to larger datasets, which inherit and amplify the gaps, absences, and decisions from the past.

This is part of my creative research work on the seance of the digital archive. It focuses not only on data, but the lineage of data and the decisions made using that data to shape larger systems. A key piece of this lineage, and one that merits deeper exploration, is Flickr.

The Archive and the Dataset

With the rise of generative AI, vast troves of visual, textual, and sonic cultural heritage data have been folded into models that create new images, text, even music. But images are host to a special kind of spectral illusion. Most images shared online were never intended to become “data,” and in many ways, this transformation into data is at odds with the real value at the heart of what these archives preserve.

What is the difference between an archive and a dataset? We are dealing with many levels of abstraction here: an archive consists of individual objects designed to serve some human purpose. These objects may then be curated into a collection. It may be a collection of pamphlets, political cartoons, or documentary photographs. It may be the amateur photographer aiming to preserve a snapshot of a birthday party whose daughter and granddaughter celebrated alongside one another. Flickr, as a photo-sharing website, is host to all of these. The miracle of data, compression, and the world wide web is that the same infrastructures can be shared for moments important to “history” but also to the individual. It preserves images from cultural heritage institutions and family beach outings alike.

Flickr is three things at once: an archive and a dataset, most famously. But it is also a kind of data infrastructure. Let’s explore these one by one.

Flickr is an archive. It is a website that preserves history. It holds digital copies of historical artifacts for individual reflection and context. Flickr is a website for memories, stored in its copies of images, snapshots, aids to the remembrance of personal stories. These are assembled into an archive, a collective photo album. Flickr as an archive is a place where the context of an individual item is preserved. But we make sense of this archive socially. Meanings change as users sort these images, tag them, and reuse them (with permission) across the web. The archive is a collection of images with their own history beyond the website itself.

Flickr is a dataset. Flickr images can be described, at scale, in pure numbers. In 2011, the web site claimed to have 6 billion images and more recently boasted of having “tens of billions” of photos, with estimates of 25 million uploads per day. By contrast, the largest widely used image dataset used in machine learning, LAION 5B, contains roughly 5.85 billion images. Flickr as a massive, expanding dataset poses a particular set of challenges in thinking about its future. One of these is the daunting task of sorting and understanding all of those images. The dataset, then, is really just the archive viewed through the abstraction of scale. Billions of images now seen as one data set, with each image merely a piece of the collective whole. As a dataset, we focus on the ways the entirety of that set can be preserved and understood.

But in shifting our lens of focus from an archive to a dataset, individual details become less important. In changing scales in this way, it’s important to move fluidly between them — much as we close one eye, then the other, as we look at the letters of the eye exam. If we want to tackle the myopia of design decisions, we must get used to shifting between these two views, rather than treating one as the sole way we see the world.

What does it mean for Flickr to be “infrastructure” for AI? It helps to define this slippery term, so I turn to a definition used by the Initiative for Public Digital Infrastructure at UMass Amherst:

“Infrastructures are fundamental systems that allow us to build other systems—new houses and businesses rely on the infrastructures of electric power lines, water mains, and roads—and infrastructures are often invisible so long as they work well.”

In the relationship to images in particular, Katrina Sluis describes the shift in meaning attributed to images as their context shifts from archive to data infrastructures:

“Photographic culture is now being sustained by a variety of agents who sit outside the traditional scope of cultural institutions. I’m thinking here of the computer scientist, web designer, Silicon Valley entrepeneur or Amazon Mechanical Turker. And of course, these are actors who are not engaged with photographic culture and the politics of representation, the history of photography or the inherent polysemy of the image. In the computer science lab, the photograph remains relatively uncomplicated – it is ultimately a blob of information – whether materialized as a “picture” or left latent as data.”

Flickr’s images played an important role in shaping image recognition systems at the outset, and in turn, image generation systems. As a result of this entrenchment of images into AI, many Flickr images have become a form of “accidental infrastructure” for AI. I should be clear that Flickr has not trained a generative AI model of its own for the production of new images, nor has it arranged, as of this writing, for the sale of images for use in image training.

When we examine Flickr as infrastructure, we will see that these two worlds — archive and dataset — have come to occupy the same space, complicating our understanding of them both. Flickr’s movement from archive to dataset in the eyes of AI training isn’t permanent. It reflects a historical shift in the ways people understand and relate to images. So it is worth exploring how that shift changes what we see, and how ghosts from the archive come to “haunt” the dataset. In establishing these two lenses of focus, we might find strategies of shifting between the two. This can help us better articulate the context of images that have built, and likely will continue to build, the foundations of generative AI systems and the images these systems produce.

Flickers in the Infrastructure

How did Flickr’s transition from archive to dataset allow it to become a piece of AI infrastructure?

It started with one of the first breakthroughs in AI image generation — StyleGAN 2. StyleGAN 2 could produce images of human faces that were nearly photorealistic. It was entirely a result of the FFHQ dataset, which NVIDIA made from 70,000 Flickr portraits of faces. NVIDIA’s dataset drew on photographs from Flickr and, notably, warned that the dataset would inherit Flickr’s biases. The FFHQ dataset also went on to be used for image labeling and face recognition technologies, too.

We can easily trace the influence of that dataset on the faces StyleGAN 2 produced. In 2019, I did my own analysis of that dataset, looking image by image at the collection. In so doing, I examined the dataset through the lens of an archive. I looked at it as a collection of individual photographs, and individual people. I discovered that less than 3% of the faces sampled from the dataset contained black women. As a result, the faces produced by the image model were less likely to generate faces of black women. When it did, they were less photorealistic than other faces. The absences were shaping what we saw.

If datasets are haunted, then the synthetic image is a seance — a way of generating a specter from the datasets. This word, specter, refers to both the appearance of a spirit, but also the appearance of an image, deriving from the Latin for spectrum. The synthetic image is a specter. It’s an image which appears from an unknown place. It is one slice from a spectrum of possible images associated with a prompt. Absences in the dataset constrained the output of possible images. Without black women in the dataset, black women were not in the images. This is one way absences can haunt the synthetic image.

But there is another case study worth exploring, which is the ways that Flickr haunts infrastructures of AI. How did the dataset shape the automated decision making processes that were then included in longer, more complex systems of image generation?

In part two of this blog post, we’ll look at YFCC100M, a dataset of 99.2 million photos released in June 2014. And we’ll look at the path it has taken as it moved the world’s relationship to this collection of Flickr images from images, into an archive, into a dataset. Along the way, we’ll see how that dataset, by becoming a go-to reference for calibration and testing of image recognition and synthesis, became infused into the infrastructures of generated images.