Data Lifeboat 5: Prototypes and policy

We are now past the midpoint of our first project stage, and have our three basic prototype Data Lifeboats. At the moment, they run locally via the command line and generate rough versions of what Data Lifeboats will eventually contain—data and pictures.

The last step for those prototypes is to move them into a clicky web prototype showing the full workflow—something we will share with our working group (but may not put online publicly). We are working towards completing this first prototyping stage around the end of June and writing up the project in July.

We’ve made a few key decisions since we last posted an update, namely about who we’re designing for and what other expertise we need to bring in. We still have more questions than answers, but really, that’s what prototyping is for.

Who might do which bit

It took us a while to get to this decision, but once we had gone through the initial discovery phase, it became clear that we need to concentrate our efforts on three key user groups:

  1. Flickr members – People who’ve uploaded pictures to Flickr, have set licenses and permissions, and may either be happy or not happy for their pictures to be put into Data Lifeboats.
  2. Data Lifeboat creators – Could be archivists or other curatorial types looking to gather sets of pictures to copy into archives elsewhere, whether that be an institution like The Library of Congress, or a family archivist with a DropBox account.
  3. Dock operators – This group is a bit more speculative, but, we envision that Data Lifeboats could actually land (or dock) in specific destinations and be treated with special care there. Our ideal scenario would be to develop a network of docks–something we’ve been calling a “Safe Harbor Network”—made up of members that are our great and good cultural organizations: they are already really good at keeping things safe over the long term.

It’ll be good to flesh the needs and wants of these three groups out in more detail in our next stage. If you are a Flickr member reading this, and want to share your story about what your Flickr account means to you, we’d love to hear it.

Web archive vs object archive

Some digital/web preservation experts take the opinion that it’s archivally important to also archive the user interface of a digital property in order to fully understand a digital object’s context. This has arguably resulted in web archives containing a whole lot more information and structural stuff than is useful or necessary. It’s sort of like archiving the entire house within which the shoebox of photos was found.

We have decided that archiving the interface itself is not necessary for a Data Lifeboat, and we will be designing a special viewer that will live inside each Data Lifeboat to help people explore its contents.

Analysing the need for new policy

The Data Lifeboat idea is about so much more than technology. Even though that’s certainly challenging, the more we think about it, the more challenging the social and ethical aspects are. It’s gritty, complex stuff, made moreso by the delicate socio-technical settings available to Flickr members, like privacy, search settings, and licensing. The crosshatch of these three vectors makes managing stable permissions over time harder than weaving a complicated textile!

Once we narrowed down our focus to these specific user groups it also became clear that we need to address the (very) complex legal landscape surrounding the potential for archiving of Flickr images external to the service. It’s particularly gnarly when you start considering how permissions might change over time, or how access might shift for different scales of audience. For example, a Flickr member might be happy for Data Lifeboats containing their images to be shared with friends of friends, but a little apprehensive about them being shared with a recognized cultural institution that would use them for research. They may be much less happy for their Flickr pictures to be fully archived and available to anyone in perpetuity.

To help us explore these questions, and begin prototyping policies for each type of user group we foreses, we have enlisted the help of Dr. Andrea Wallace of the Law School at the University of Exeter. She is working with us to develop legal and policy frameworks tailored to the needs of each of these three groups, and to study how the current Flickr Terms of Service may be suitable for, or need adaption around, this idea of a Data Lifeboat. This may include drafting terms and conditions needed to create a Data Lifeboat, how we might be able to enhance rights management, and exploring how to manage expiration or decay of privacy or licensing into the future.

Data Lifeboat prototypes

We have generated three different prototype Data Lifeboats to think with, and show to our working group:

  1. Photos tagged with “Flickrhq”: This prototype includes thousands of tagged images of ‘life working at Flickr’, which is useful to explore the tricky aspects of collating other people’s pictures into a Data Lifeboat. Creating it revealed a search foible, whereby the result set that is delivered by searching via a tag is not consistent. Many of the pictures are also marked as All Rights Reserved, with 33% having downloads disabled. This raises juicy questions about licensing and permissions that need further discussion.
  2. Two photos from each Flickr Commons Member: We picked this subset because Flickr Commons photos are earmarked with the ‘no known copyright restrictions’ assertion, so questions about copying or reusing are theoretically simpler. 
  3. All photos from the Library of Congress (LoC) account: Comprising roughly 42,000 photos also marked as “no known copyright restrictions,” this prototype contains a set that is simpler to manage as all images have a uniform license setting. It was also useful to generate a Data Lifeboat of this size as it allowed us to do some very early benchmarking on questions like how long it takes to create one and where changes to our APIs might be helpful.

Preparing these prototypes has underscored the challenges of balancing the legal, social, and technical aspects of this kind of social media archiving, making clear the need for a special set of terms & conditions for Data Lifeboat creation. They also reveal the limitations of tags in capturing all relevant content (which, to some extent, we were expecting) and the user-imposed restrictions set on images in the Flickr context, like ‘can be downloaded.’

Remaining questions?

OMG, so many. Although the prototypes are still in progress, they have already stimulated great discussion and raised some key questions, such as:

  • How might user intentions or permissions change over time and how could software represent them?
  • How could the scope or scale of sharing influence how shared images are perceived, updated, and utilized?
  • How can we understand how different use cases and how archivists/librarians could engage with the Data Lifeboats?
  • How important is it to make sure Data Lifeboats are launched with embedded rights information, and how might those decay over time?
  • How should we be considering the descriptive or social contexts that accompany images, and how should they inform subsequent decisions about expiration dates?

Long term sustainability and funding models

It’s really so early to be talking about this – and we’re definitely not ready to present any actual, reasonable, viable models here because we don’t know enough yet about how Data Lifeboats could be used or under what circumstances. We did do a first pass review of some obvious potential business models, for example:

  • A premium subscription service that allows users to create personalized Data Lifeboats for their own collections.
  • A consulting service for institutions and individuals who want to create Data Lifeboats for specific archival purposes.
  • Developing training and certification programs for digital archivization that uses Data Lifeboats as the foundation.
  • Membership fees for members of the Safe Harbor network, or charging fees for access to the Data Lifeboat archives.

While there were aspects to each that appealed to our partners, there were also significant flaws so overall, we’re still a long way from having an answer. This is something else we’re planning to explore more broadly in partnership with the wider Flickr Commons membership in subsequent phases of this project.

Next steps

This month we’ll be wrapping up this first prototyping phase supported by the National Endowment for the Humanities. After we’ve completed the required reporting, we’ll move into the next phase in earnest, reaching out to those three user groups more deliberately to learn more about how Data Lifeboats could operate for them and what they would need them to do. 

Two upcoming in-person events!

We’re also very happy to be able to tell you the Mellon Foundation has awarded us a grant to support this next stage, and we’re especially looking forward to running two small events later in the year to gather people from our Flickr Commons partner institutions, as well as other birds of a feather, to discuss these key challenges together.

If you’d like to register your interest in attending one of these meetings, please let us know via this short Registration of Interest form. Please note, these will be small, maybe 20ish people at each, and registering interest does not guarantee a spot, and we’ve only just begun planning in earnest.


A millions-of-things pile: Why we need a Collection Development Policy for Flickr Commons

Flickr is a photo-sharing website and has always been about connecting people through photography. It is different from a generic image-hosting service. Flickr Commons, the program launched in 2008 for museums, libraries, and archives to share their photography collections, is different again: it’s about sharing photography collections with a very big audience, and providing tools to help people to contribute information and knowledge about the pictures, ideally to supplement whatever catalogue information already exists.

A collection development policy is a framework for information institutions like libraries, archives and museums to define what they collect, and importantly, what they don’t collect. It’s an important part of maintaining a coherent and valuable collection while trends and technologies change and advance around the organisation. We think it’s time for the Flickr Commons to have a policy like this.

As the Flickr Commons collection grows, we’re seeing all kinds of images in there: photographs, maps, documents, drawings, museum objects, book scans, and more. Therefore, one aspect of the policy is to ask our members to use of Flickr’s “Content-Type” field to improve the way their images can be categorised and found in search. 

Why are we asking Flickr Commons members to categorise their images?

Since the program launched in 2008, the Flickr Commons has grown to also include illustrations, maps, letters, book scans, and other imagery. The default setting for uploads across all accounts is content_type=Photo, so if you don’t alter that default for new uploads, every image is classified as a photo. This starts to break down if you upload, say, the Engrossed Declaration of Independence, or, a wood engraving of Bloodletting Instruments.

One of the largest Flickr Commons accounts is the great and good British Library, which famously published 1 million illustrations into the program in 2013, announcing:

The images themselves cover a startling mix of subjects: There are maps, geological diagrams, beautiful illustrations, comical satire, illuminated and decorative letters, colourful illustrations, landscapes, wall-paintings and so much more that even we are not aware of… We are looking for new, inventive ways to navigate, find and display these ‘unseen illustrations’. ”

A million first steps by Ben O’Steen, 12 December 2013

Because the default setting for uploads is content_type=Photos, it meant that every search on Flickr Commons was inundated with “the beige 19th Century.” Those images had, by default, been categorised as Photos, but instead were millions of pictures from 17th, 18th, and 19th-century books. 

Earlier this year, the British Library team adjusted the images in their account to set them as “Illustration/Art” and not Photos. But, that had the effect of “hiding” their content from general, default-set searches. This unintentional hiding raised a little alarm with their followers (who were used to seeing the book scans in their searching), some of whom wrote in to ask what had happened. And rightly so, because it had yet to be explained to them by us or by the search interface.

The Backstory

In any aggregated system of cultural materials, you get colossal variegation. Humans describe things differently, no matter how many professional standards we try to implement. Last year, in 2022, the Flickr Commons was mostly a vast swathe of images from scanned book pages. Not photographs, per se, or things created first as photographs. 

There have been two uploads into Flickr Commons of over one million things. The first one was in 2013, by the British Library, whose intention was to ask the community to help describe the million or so book illustrations they had carefully organised with book structure metadata and described using clever machine tags. The BL team was also careful to avoid annoying the Flickr API spirits by carefully pacing their uploading not to cause any alerts. Since then, they have built a community around the collection for over a decade now, cultivating the creative reuse, inspiration and research in the imagery, primarily through the British Library Labs initiative.

The second gigantic upload, in 2014, was (also) mostly images cropped by a computer program. Created by a solo developer working in a Yahoo Research fellowship, the code was run over an extensive collection of content in Internet Archive (IA) book digitization program to crop out images on scanned book pages. Those were shoved into using the API. The developer immediately reached the free account limits, so they negotiated through Yahoo senior management that these millions of images should become part of the Flickr Commons program in an Internet Archive Book Images (IABI) account. Since the developer was also loosely associated with the Internet Archive (IA), IA agreed to be the institutional partner in the Flickr Commons. That’s a requirement of joining the program—that the account is held by an organisation, not an individual. 

These two uploads utterly overwhelmed the smaller Flickr Commons photography collections, even as the two approaches were so different. 

Here’s a graph from April 2022 data that shows all Commons members on the x-axis, and their upload counts on the y-axis.

The IABI account is 5x larger than all the other accounts combined. If you remove the two giants from the data, the average upload per account is just under 3,000 pictures.

These whopper accounts both have billions of views overall. These view counts are unsurprising, given that they completely dominated all search results in Flickr Commons. While the Flickr Commons’ first goal has always been to “increase public access to photography collections”, its secondary—and in my opinion, much more interesting—goal is to “provide a way for the public to contribute information.”

You can see from the two following graphs that a big photo count doesn’t imply deeper engagement. In fact, we’ve seen the opposite is true, and the Flickr Commons members who enjoy the strongest engagement are those who spend time and effort to engage. Drip-feeding content—and not dumping it all at once—will also help viewers to keep up and get a good view of what is being published.

The fifth account in the most-faved data is the fabulous National Library of Ireland, with about 3,000 photos then, which excels at community engagement, demonstrated by its 181,000 faves.

In the comments data, IABI ranks 21st (~3,000), and British Library 27th (~2,000). The top-commented accounts are all in a groove of stellar community engagement.

Employees working in small archives (or large ones, for that matter) simply cannot compete with a content production software program that auto-generates a crop of an image in a book scan and its associated automated many-word metadata. At the Flickr Foundation, we have a place in our hearts for the smaller cultural organisations and want to actively support their online engagements through the Flickr Commons program.

I remember when the IABI account went live. Even though I wasn’t working at Flickr or at the Flickr Foundation at the time, I thought it was a mistake to allow such a vast blast of not-photographs into the Flickr Commons, particularly the second massive collection, mainly because it had been so broadly described, meaning it would turn up content in every search.

Fast forward to last year, in April, when—as my strange first step as Executive Director—I decided in consultation and agreement with the staff at IA to act. We agreed to delete the gargantuan Internet Archive Book Images (IABI) account.

A couple of weeks later, people realised it had happened, and a riot of “Flickr is destroying the public domain” posts popped up. I had not prepared for this reaction, which is the opposite tone I want the Flickr Foundation to set! I’d consulted with the Internet Archive, and a consensus had been reached. But, I was also ignorant of the community enjoying the IABI account—I had presumed there was no community engagement since nobody had logged into the IABI account since just after the giant upload had happened in 2014. That was a mistake, I readily admit, but in my defence, the IA team echoed that same impression when we discussed it. The lone developer (who didn’t work at IA) had uploaded the millions of book images and did not engage with the community. The images were generated from lots of different institutions’ collections digitised through the Internet Archive’s wonderful book scanning initiative. Unfortunately, correct attribution for each institution had not been included in the initial metadata produced for each image. (This was later rectified by a code rewrite by Smithsonian Libraries and Archives, with support from Flickr engineering.) In some cases the content was known to have no copyright—so didn’t fit in the Flickr Commons’ “no known copyright restrictions” assertion and could/should have been declared public domain materials—along with the content_type=Photo declaration, and broad, auto-generated metadata (along with some tagging to group images into their books, for example). In other words, a millions-of-things mess. 

Despite my hesitation, we decided to restore the entire account. This scale of restoration is an incredible engineering feat and an indication of the world-class team working behind the scenes at Flickr. We also set the correct content type designation and adjusted the licences on the restored images to CC0 as Internet Archive does not claim any rights for them. This has the benefit of making them more clearly classified for reuse. 

What we are doing about it

We need to be more restrained when it comes to digital commonses. These huge piles of stuff sound great, but they are not often made with care by people. They’re generated en masse by computers and thrown online. (As a related aside, look to the millions of licensed pieces of content that are mined and inhaled to improve AI programs as their licences are ignored.) 

The British Library acknowledged this, asking for interaction and effort from interested people, and stated explicitly that their 1 million images were “wholly uncurated.” People ultimately enjoyed hunting around in a millions-of-things pile for illustrations of things and made some beautiful responses to them. Indeed, one person managed to add 45,000 tags to the British Library’s Flickr Commons content. 45,000!

Perhaps I’m about to contradict myself again and say this scale of access at a base level was good, at least for computers and computation. But, it wasn’t good inside the Flickr Commons program, and that’s why we need the Collection Development Policy so we can encourage and nurture the seeing, enjoyment and contributions to our shared photographic history we always wanted.

And that’s why we’re drafting the new policy in collaboration with the membership, so we can help Flickr Commons members know how to hold the shape of the container we’ve created instead of bursting it. 

With thanks to Josh Hadro, Martin Kalfatovic, Nora McGregor, Mia Ridge, Alexis Rossi, and Jessamyn West for your time and feedback on this post.