Making some marvelous maps

This week we added maps to our Commons Explorer, and it’s proving to be a fun new way to find photos.

There are over 50,000 photos in the Flickr Commons collection which have location information telling us where the photo was taken. We can plot those locations on a map of the world, so you can get a sense of the geographical spread:

This map is interactive, so you can zoom in and move around to focus on a specific place. As you do, we’ll show you a selection of photos from the area you’ve selected.

You can also filter the map, so you see photos from just a single Commons member. For smaller members the map points can tell a story in themselves, and give you a sense of where a collection is and what it’s about:

These maps are available now, and know about the location of every geotagged photo in Flickr Commons.

Give them a try!

How can you add a location to a Flickr Commons photo?

For the first version of this map, we use the geotag added by the photo’s owner.

If you’re a Flickr Commons member, you can add locations to your photos and they’ll automatically show up on this map. The Flickr Help Center has instructions for how to do that.

It’s possible for other Flickr members to add machine tags to photos, and there are already thousands of crowdsourced tags that have location-related information. We don’t show those on the map right now, but we’re thinking about how we might do that in future!

How does the map work?

There are three technologies that make these maps possible.

The first is SQLite, the database engine we use to power the Commons Explorer. We have a table which contains every photo in the Flickr Commons, and it includes any latitude and longitude information. SQLite is wicked fast and our collection is small potatoes, so it can get the data to draw these maps very quickly.

I’d love to tell you about some deeply nerdy piece of work to hyper-optimize our queries, but it wasn’t necessary. I wrote the naïve query, added a couple of column indexes, and that first attempt was plenty fast. Tallying the locations for the entire Flickr Commons collection takes ~45ms; tallying the locations for an individual member is often under a millisecond.)

The second is Leaflet.js, a JavaScript library for interactive maps. This is a popular and feature-rich library that made it easy for us to add a map to the site. Combined with a marker clustering plugin, we had a lot of options for configuring the map to behave exactly as we wanted, and to connect it to Flickr Commons data.

The third is OpenStreetMap. This is a world map maintained by a community of volunteers, and we use their map tiles as the backdrop for our map.

Plus ça Change

To help us track changes to the Commons Explorer, we’ve added another page: the changelog.

This is part of our broader goal of archiving the organization. Even in the six months since we launched the Explorer, it’s easy to forget what happened when, and new features quickly feel normal. The changelog is a place for us to remember what’s changed and what the site used to look like, as we continue to make changes and improvements.

Working with snapshots of structured data on Wikimedia Commons

In my previous post about Flickypedia Backfillr Bot, I mentioned that Flickypedia uses snapshots of structured data on Wikimedia Commons to spot possible duplicates:

We downloaded snapshots of the structured data for every file on Wikimedia Commons, and we built a database of all the links between files on Wikimedia Commons and Flickr photos. For every file in the snapshot, we looked at the structured data properties where we might find a Flickr URL. Then we tried to parse those URLs using our Flickr URL parsing library, and find out what Flickr photo they point at (if any).

As we’ve been working on Flickypedia, we’ve developed a few tactics for working with these snapshots, which we thought might be useful for other people working with Wikimedia Commons data.

What are these snapshots?

Files on Wikimedia Commons can contain structured data—machine-readable metadata saying where the file came from, the license of the file, when it was created, and so on. For a longer explanation of structured data, read my previous post.

The structured data snapshots are JSON files that contain the structured data statements for all the files on Wikimedia Commons. (One of many public dumps of Wikimedia content.) These snapshots are extremely useful if you have a task that involves searching the database en masse – for example, finding all the Flickr photos on Commons.

All the snapshots we worked with are available for download from https://dumps.wikimedia.org/commonswiki/entities/, and new snapshots are typically created a few times a week.

Do you need snapshots?

Snapshots can be cumbersome, so if you need a quick answer, there may be better ways to get data out of Wikimedia Commons, like Special:MediaSearch and the Commons Query Service, which both support querying on structured data. But if you need to look at Wikimedia Commons as a whole, or run some sort of complex query or analysis that doesn’t fit into an existing tool, the structured snapshots can be very useful.

We’ve already found several use cases for them at the Flickr Foundation:

  • Finding every Flickr photo on Wikimedia Commons. As discussed in previous posts, the many variants of Flickr URL make it difficult to run a query for Flickr photos on Commons – but we can do this analysis easily with a snapshot. We can parse the data in the snapshot with our Flickr URL parser and store the normalised information in a new database.
  • Seeing how structured data is already being used. When we were designing the Flickypedia data model, part of our research involved looking at how structured data was already being used for Flickr photos. Using the snapshots, we could look for examples we could mimic, and compare our ideas to the existing data. Was our proposal following a popular, well-established approach, or was it novel and perhaps more controversial?
  • Verifying our assumptions about structured data. By doing an exhaustive search of the structured data, we could check if our assumptions were correct – and sometimes we’d find counterexamples that forced us to rethink our approach. For example, “every Wikimedia Commons file comes from zero or one Flickr photos”. Looking at the snapshots told us this was false – there are some files which link to multiple Flickr photos, because the same photo was uploaded to Flickr multiple times.

How do you download a snapshot?

The snapshots are fairly large: the latest snapshots are over 30GB, and that’s only getting bigger as more structured data is created. It takes me multiple hours to download a snapshot, and that can be annoying if the connection drops partway through.

Fortunately, Wikimedia Commons has a well-behaved HTTP server that supports resumable downloads. There are lots of download managers that can resume the download when it gets interrupted, so you can download a snapshot over multiple sessions. I like curl because it’s so ubiquitous – there’s a good chance it’s already installed on whatever computer I’m using.

This is an example of the curl command I run:

curl \
  --location \
  --remote-name \
  --continue-at - \
  "https://dumps.wikimedia.org/commonswiki/entities/20240617/commons-20240617-mediainfo.json.gz"

I usually have to run it multiple times to get a complete download, but it does eventually succeed. The important flag here is -​-continue-at –, which tells curl to resume a previous download.

Which format should you download?

The snapshots are available in two formats: bzip2-compressed JSON, and gzip-compressed JSON. They have identical contents, just compressed differently. Which should you pick?

I wasn’t sure which format was right, so when I was getting started, I downloaded both and ran some experiments to see which was a better fit for our use case. We iterate through every file in a snapshot as part of Flickypedia, so we wanted a format we could read quickly.

The file sizes are similar: 33.6GB for bzip2, 43.4GB for gzip. Both of these are manageable downloads for us, so file size wasn’t a deciding factor.

Then I ran a benchmark on my laptop to see how long it took to read each format. This command is just uncompressing each file, and measuring the time it takes:

$ time bzcat commons-20240617-mediainfo.json.bz2 >/dev/null
Executed in 113.48 mins

$ time gzcat commons-20240617-mediainfo.json.gz >/dev/null
Executed in 324.17 secs

That’s not a small difference: gzip is 21 times faster to uncompress than bzip2. Even accounting for the fairly unscientific test conditions, it was the clear winner. For Flickypedia, we use the gzip-compressed snapshots.

What’s inside a snapshot?

An uncompressed snapshot is big – the latest snapshot contains nearly 400GB of JSON.

The file contains a single, massive JSON object:

[
   { … data for the first file … },
   { … data for the second file … },
   …
   { … data for the last file … }
]

Aside from the opening and closing square brackets, each line has a JSON object that contains the data for a single file on Wikimedia Commons. This makes it fairly easy to stream data from this file, without trying to parse the entire snapshot at once.

If you’re curious about the structure of the data, we have some type definitions in Flickypedia: one for the top-level snapshot entries, one for the Wikidata data model which is used for structured data statements. Unfortunately I haven’t been able to find a lot of documentation for these types on Wikimedia Commons itself.

How to read snapshots

The one-file-per-line structure of the snapshot JSON allows us to write a streaming parser in Python. This function will read one file at a time, which is more efficient than reading the entire file at once:

import gzip
import json


def get_entries_from_snapshot(path):
    with gzip.open(path) as uncompressed_file:
        for line in uncompressed_file:

            # Skip the square brackets at the beginning/end of the file
            # which open/close the JSON object
            if line.strip() in {b"[", b"]"}:
                continue

            # Strip the trailing comma at the end of each line
            line = line.rstrip(b",\n")

            # Parse the line as JSON, and yield it to the caller
            entry = json.loads(line)
            yield entry


path = "commons-20240617-mediainfo.json.gz"

for entry in get_entries_from_snapshot(path):
    print(entry)

# {'type': 'mediainfo', 'id': 'M76', … }
# …

This does take a while – on my machine, it takes around 45 minutes just to read the snapshot, with no other processing.

To avoid having to do this too often, my next step is to extend this script to extract the key information I want from the snapshot.

For example, for Flickypedia, we’re only really interested in P12120 (Flickr Photo ID) and P7482 (Source of File) when we’re looking for Flickr photos which are already on Commons. A script which extracts just those two fields can reduce the size of the data substantially, and give me a file that’s easier to work with.

Working with snapshots of structured data on Wikimedia Commons

Last year, we built Flickypedia, a new tool for copying photos from Flickr to Wikimedia Commons. As part of our planning, we asked for feedback on Flickr2Commons and analysed other tools. We spotted two consistent themes in the community’s responses:

  • Write more structured data for Flickr photos
  • Do a better job of detecting duplicate files

We tried to tackle both of these in Flickypedia, and initially, we were just trying to make our uploader better. Only later did we realize that we could take our work a lot further, and retroactively apply it to improve the metadata of the millions of Flickr photos already on Wikimedia Commons. At that moment, Flickypedia Backfillr Bot was born. Last week, the bot completed its millionth update, and we guesstimate we will be able to operate on another 13 million files.

The main goals of the Backfillr Bot are to improve the structured data for Flickr photos on Wikimedia Commons and to make it easier to find out which photos have been copied across. In this post, I’ll talk about what the bot does, and how it came to be.

Write more structured data for Flickr photos

There are two ways to add metadata to a file on Wikimedia Commons: by writing Wikitext or by creating structured data statements.

When you write Wikitext, you write your metadata in a MediaWiki-specific markup language that gets rendered as HTML. This markup can be written and edited by people, and the rendered HTML is designed to be read by people as well. Here’s a small example, which has some metadata to a file linking it back to the original Flickr photo:

== {{int:filedesc}} ==
{{Information
|Description={{en|1=Red-whiskered Bulbul photographed in Karnataka, India.}}
|Source=https://www.flickr.com/photos/shivanayak/12448637/
|Author=[[:en:User:Shivanayak|Shiva shankar]]
|Date=2005-05-04
|Permission=
|other_versions=
}}

and here’s what that Wikitext looks like when rendered as HTML:

A table with four rows: Description (Red-whiskered Bulbul photographed in Karnataka, India), Date (4 May 2005), Source (a Flickr URL) and Author (Shiva shankar)

This syntax is convenient for humans, but it’s fiddly for computers – it can be tricky to extract key information from Wikitext, especially when things get more complicated.

In 2017, Wikimedia Commons added support for structured data. This allows editors to add metadata in a machine-readable format. This makes it much easier to edit metadata programmatically, and there’s a strong desire from the community for new tools to write high-quality structured metadata that other tools can use.

When you add structured data to a file, you create “statements” which are attached to properties. The list of properties is chosen by the volunteers in the Wikimedia community.

For example, there’s a property called “source of file” which is used to indicate where a file came from. The file in our example has a single statement for this property, which says the file is available on the Internet, and points to the original Flickr URL:

Structured data is exposed via an API, and you can retrieve this information in nice machine-readable XML or JSON:

$ curl 'https://commons.wikimedia.org/w/api.php?action=wbgetentities&sites=commonswiki&titles=File%3ARed-whiskered%20Bulbul-web.jpg&format=xml'
<?xml version="1.0"?>
<api success="1">
  …
  <P7482>
    …
    <P973>
      <_v snaktype="value" property="P973">
        <datavalue
          value="https://www.flickr.com/photos/shivanayak/12448637/"
          type="string"/>
      </_v>
    </P973>
    …
  </P7482>
</api>

(Here “P7482” means “source of file” and “P973” is “described at URL”.)

Part of being a good structured data citizen is following the community’s established patterns for writing structured data. Ideally every tool would create statements in the same way, so the data is consistent across files – this makes it easier to work with later.

We spent a long time discussing how Flickypedia should use structured data, and we got a lot of helpful community feedback. We’ve documented our current data model as part of our Wikimedia project page.

Do a better job of detecting duplicate files

If a photo has already been copied from Flickr onto Wikimedia Commons, nobody wants to copy it a second time.

This sounds simple – just check whether the photo is already on Commons, and don’t offer to copy it if it’s already there. In practice, it’s quite tricky to tell if a given Flickr photo is on Commons. There are two big challenges:

  1. Files on Wikimedia Commons aren’t consistent in where they record the URL of the original Flickr photo. Newer files put the URL in structured data; older files only put the URL in Wikitext or the revision descriptions. You have to look in multiple places.
  2. Files on Wikimedia Commons aren’t consistent about which form of the Flickr URL they use – with and without a trailing slash, with the user NSID or their path alias, or the myriad other URL patterns that have been used in Flickr’s twenty-year history.

Here’s a sample of just some of the different URLs we saw in Wikimedia Commons:

https://www.flickr.com/photos/joyoflife//44627174
https://farm5.staticflickr.com/4586/37767087695_bb4ecff5f4_o.jpg
www.flickr.com/photo_edit.gne?id=3435827496
https://www.flickr.com/photo.gne?short=2ouuqFT

There’s no easy way to query Wikimedia Commons and see if a Flickr photo is already there. You can’t, for example, do a search for the current Flickr URL and be sure you’ll find a match – it wouldn’t find any of the examples above. You can combine various approaches that will improve your chances of finding an existing duplicate, if there is one, but it’s a lot of work and you get varying results.

For the first version of Flickypedia, we took a different approach. We downloaded snapshots of the structured data for every file on Wikimedia Commons, and we built a database of all the links between files on Wikimedia Commons and Flickr photos. For every file in the snapshot, we looked at the structured data properties where we might find a Flickr URL. Then we tried to parse those URLs using our Flickr URL parsing library, and find out what Flickr photo they point at (if any).

This gave us a SQLite database that mapped Flickr photo IDs to Wikimedia Commons filenames. We could use this database to do fast queries to find copies of a Flickr photo that already exist on Commons. This proved the concept, but it had a couple of issues:

  • It was an incomplete list – we only looked in the structured data, and not the Wikitext. We estimate we were missing at least a million photos.
  • Nobody else can use this database; it only lives on the Flickypedia server. Theoretically somebody else could create it themselves – the snapshots are public, and the code is open source – but it seems unlikely.
  • This database is only as up-to-date as the latest snapshot we’ve downloaded – it could easily fall behind what’s on Wikimedia Commons.

We wanted to make this process easier – both for ourselves, and anybody else building Flickr–Wikimedia Commons integrations.

Adding the Flickr Photo ID property

Every photo on Flickr has a unique numeric ID, so we proposed a new Flickr photo ID property to add to structured data on Wikimedia Commons. This proposal was discussed and accepted by the Wikimedia Commons community, and gives us a better way to match files on Wikimedia Commons to photos on Flickr:

This is a single field that you can query, and there’s an unambiguous, canonical way that values should be stored in this field – you don’t need to worry about the different variants of Flickr URL.

We added this field to Flickypedia, so any files uploaded with our tool will get this new field, and we hope that other Flickr upload tools will consider adding this field as well. But what about the millions of Flickr photos already on Wikimedia Commons? This is where Flickypedia Backfillr Bot was born.

Updating millions of files

Flickypedia Backfillr Bot applies our structured data mapping to every Flickr photo it can find on Wikimedia Commons – whether or not it was uploaded with Flickypedia. For every photo which was copied from Flickr, it compares the structured data to the live Flickr metadata, and updates the structured data if the two don’t match. This includes the Flickr Photo ID.

It reuses code from our duplicate detector: it goes through a snapshot looking for any files that come from Flickr photos. Then it gets metadata from Flickr, checks if the structured data matches that metadata, and if not, it updates the file on Wikimedia Commons.

Here’s a brief sketch of the process:

Most of the time this logic is fairly straightforward, but occasionally the bot will get confused – this is when the bot wants to write a structured data statement, but there’s already a statement with a different value. In this case, the bot will do nothing and flag it for manual review. There are edge cases and unusual files in Wikimedia Commons, and it’s better for the bot to do nothing than write incorrect or misleading data that will need to be reverted later.

Here are two examples:

  • Sometimes Wikimedia Commons has more specific metadata than Flickr. For example, this Flickr photo was posted by the Donostia Kultura account, and the description identifies Leire Cano as the photographer.

    Flickypedia Backfillr Bot wants to add a creator statement for “Donostia Kultura”, because it can’t understand the description – but when this file was copied to Wikimedia Commons, somebody added a more specific creator statement for “Leire Cano”.

    The bot isn’t sure which statement is correct, so it does nothing and flags this for manual review – and in this case, we’ve left the existing statement as-is.

  • Sometimes existing data on Wikimedia Commons has been mapped incorrectly. For example, this Flickr photo was taken “circa 1943”, but when it was copied to Wikimedia Commons somebody added an overly precise “date taken” statement claiming it was taken on “1 Jan 1943”.

    This bug probably occurred because of a misunderstanding of the Flickr API. The Flickr API will always return a complete timestamp in the “date” field, and then return a separate granularity value telling you how accurate it is. If you ignored that granularity value, you’d create an incorrect statement of what the date is.

    The bot isn’t sure which statement is correct, so it does nothing and flags this for manual review – and in this case, we made a manual edit to replace the statement with the correct date.

What next?

We’re going to keep going! There were a few teething problems when we started running the bot, but the Wikimedia community helped us fix our mistakes. It’s now been running for a month or so, and processed over a million files.

All the Flickypedia code is open source on GitHub, and a lot of it isn’t specific to Flickr – it’s general-purpose code for working with structured data on Wikimedia Commons, and could be adapted to build similar bots. We’ve already had conversations with a few people about other use cases, and we’ve got some sketches for how that code could be extracted into a standalone library.

We estimate that at least 14 million files on Wikimedia Commons are photos that were originally uploaded to Flickr – more than 10% of all the files on Commons. There’s plenty more to do. Onwards and upwards!

The surprising utility of a Flickr URL parser

In my first week at the Flickr Foundation, we made a toy called Flinumeratr. This is a small web app that takes a Flickr URL as input, and shows you all the photos which are present at that URL.

As part of this toy, I made a Python library which parses Flickr URLs, and tells you what the URL points to – a single photo, an album, a gallery, and so on. Initially it just handled fairly common patterns, the sort of URLs that you’d encounter if you use Flickr today, but it’s grown to handle more complicated URLs.

$ flickr_url_parser "https://www.flickr.com/photos/sdasmarchives/50567413447"
{"type": "single_photo", "photo_id": "50567413447"}

$ flickr_url_parser "https://www.flickr.com/photos/aljazeeraenglish/albums/72157626164453131"
{"type": "album", "user_url": "https://www.flickr.com/photos/aljazeeraenglish", "album_id": "72157626164453131", "page": 1}

$ flickr_url_parser "https://www.flickr.com/photos/blueminds/page3"
{"type": "user", "user_url": "https://www.flickr.com/photos/blueminds"}

The implementation is fairly straightforward: I use the hyperlink library to parse the URL text into a structured object, then I compare that object to a list of known patterns. Does it look like this type of URL? Or this type of URL? Or this type of URL? And so on.

You can run this library as a command-line tool, or call it from Python – there are instructions in the GitHub README.

There are lots of URL variants

In my second week and beyond, I started to discover more variants, which should probably be expected in 20-year old software! I’ve been looking into collections of Flickr URLs that have been built up over multiple years, and although most of these URLs follow common patterns, there are lots of unusual variants in the long tail.

Some of these are pretty simple. For example, the URL to a user’s photostream can be formed using your Flickr user NSID or your path alias, so flickr.com/photos/197130754@N07/ and flickr.com/photos/flickrfoundation/ point to the same page.

Others are more complicated, and you can trace the history of Flickr through some of the older URLs. Some of my favorites include:

  • Raw JPEG files, on live.staticflickr.com, farm1.static.flickr.com, and several other subdomains.

  • Links with a .gne suffix, like www.flickr.com/photo_edit.gne?id=3435827496 (from Wikimedia Commons). This acronym stands for Game Neverending, the online game out of which Flickr was born.

  • A Flash video player called stewart.swf, which might be a reference to Stewart Butterfield, one of the cofounders of Flickr.

I’ve added support for every variant of Flickr URL to the parsing library – if you want to see a complete list, check out the tests. I need over a hundred tests to check all the variants are parsed correctly.

Where we’re using it

I’ve been able to reuse this parsing code in a bunch of different projects, including:

  • Building a similar “get photos at this URL” interface in Flickypedia.

  • Looking for Flickr photo URLs in Wikimedia Commons. This is for detecting Flickr photos which have already been uploaded to Commons, which I’ll describe more in another post.

  • Finding Flickr pages which have been captured in the Wayback Machine – I can get a list of saved Flickr URLs, and then see what sort of pages have actually been saved.

When I created the library, I wasn’t sure if this code was actually worth extracting as a standalone package – would I use it again, or was this a premature abstraction?

Now that I’ve seen more of the diversity of Flickr URLs and found more uses for this code, I’m much happier with the decision to abstract it into a standalone library. Now we  only need to add support for each new URL variant once, and then all our projects can benefit.

If you want to try the Flickr URL parser yourself, all the code is open source on GitHub.

Data Lifeboat Update 4: What a service architecture could be like

We’re starting to write code for our Data Lifeboat, and that’s pushed us to decide what the technical architecture looks like. What are the different systems and pieces involved in creating a Data Lifeboat? In this article I’m going to outline what we imagine that might look like.

We’re still very early in the prototyping stage of this work. Our next step is going to be building an end-to-end prototype of this design, and seeing how well it works.

Here’s the diagram we drew on the whiteboard last week:

Let’s step through it in detail.

First somebody has to initiate the creation of a Data Lifeboat, and choose the photos they want to include. There could be a number of ways to start this process: a command-line tool, a graphical web app, a REST API.

We’re starting to think about what those interfaces will look like, and how they’ll work. When somebody creates a Data Lifeboat, we need more information than just a list of photos. We know we’re going to need things like legal agreements, permission statements, and a description of why the Lifeboat was created. All this information needs to be collected at this stage.

However these interfaces work, it all ends in the same way: with a request to create a Data Lifeboat for a list of photos and their metadata from Flickr.

To take a list of photos and create a Data Lifeboat, we’ll have a new Data Lifeboat Creator service. This will call the Flickr API to fetch all the data from Flickr.com, and package it up into a new file. This could take a long time, because we need to make a lot of API calls! (Minutes, if not hours.)

We already have the skeleton of this service in the Commons Explorer, and we expect to reuse that code for the Data Lifeboat.

We are also considering creating an index of all the Data Lifeboats we’ve created – for example, “Photo X was added to Data Lifeboat Y on date Z”. This would be a useful tool for people wanting to look up Flickr URLs if the site ever goes away. “I have a reference to photo X, where did that end up after Flickr?”

When all the API calls are done, this service will eventually produce a complete, standalone Data Lifeboat which is ready to be stored!

When we create the Data Lifeboat, we’re imagining we’ll keep it on some temporary storage owned by the Flickr Foundation. Once the packaging is complete, the person or organization who requested it can download it to their permanent storage. Then it becomes their responsibility to make sure it’s kept safely – for example, creating backups or storing it in multiple geographic locations.

The Flickr Foundation isn’t going to run a single, permanent store of all Data Lifeboats ever created. That would turn us into another Single Point of Failure, which is something we’re keen to avoid!

There are still lots of details to hammer out at every step of this process, but thinking about the broad shape of the Data Lifeboat service has already been useful. It’s helped us get a consistent understanding of what the steps are, and exposed more questions for us to ponder as we keep building.

How does the Commons Explorer work?

Last week we wrote an introductory post about our new Commons Explorer; today we’re diving into some of the technical details. How does it work under the hood?

When we were designing the Commons Explorer, we knew we wanted to look across the Commons collection – we love seeing a mix of photos from different members, not just one account at a time. We wanted to build more views that emphasize the breadth of the collection, and help people find more photos from more members.

We knew we’d need the Flickr API, but it wasn’t immediately obvious how to use it for this task. The API exposes a lot of data, but it can only query the data in certain ways.

For example, we wanted the homepage to show a list of recent uploads from every Flickr Commons member. You can make an API call to get the recent uploads for a single user, but there’s no way to get all the uploads for multiple users in a single API call. We could make an API call for every member, but with over 100 members we’d be making a lot of API calls just to render one component of one page!

It would be impractical to fetch data from the API every time we render a page – but we don’t need to. We know that there isn’t that much activity in Flickr Commons – it isn’t a social media network with thousands of updates a second – so rather than get data from the API every time somebody loads a page, we decided it’s good enough to get it once a day. We trade off a bit of “freshness” for a much faster and more reliable website.

We’ve built a Commons crawler that runs every night, and makes thousands of Flickr API calls (within the API’s limits) to populate a SQLite database with all the data we need to power the Commons Explorer. SQLite is a great fit for this sort of data – it’s easy to run, it gives us lots of flexibility in how we query the data, and it’s wicked fast with the size of our collection.

There are three main tables in the database:

  • The members
  • The photos uploaded by all the members
  • The comments on all those photos

We’re using a couple of different APIs to get this information:

  • The flickr.commons.getInstitutions API gives us a list of all the current Commons members. We combine this with the flickr.people.getInfo API to get more detailed information about each member (like their profile page description).
  • The flickr.people.getPhotos API gives us a list of all the photos in each member’s photostream. This takes quite a while to run – it returns up to 500 photos per call, but there are over 1.8 million photos in Flickr Commons.
  • The flickr.photos.comments.getList API gives us a list of all the comments on a single photo. To save us calling this 1.8 million times, we have some logic to check if there are any (new) comments since the last crawl – we don’t need to call this API if nothing has changed.

We can then write SQL queries to query this data in interesting ways, including searching photos and comments from every member at once.

We have a lightweight Flask web app that queries the SQLite database and renders them as nice HTML pages. This is what you see when you browse the website at https://commons.flickr.org/.

We have a couple of pages where we call the Flickr API to get the most up-to-date data (on individual member pages and the cross-Commons search), but most of the site is coming from the SQLite database. After fine-tuning the database with a couple of indexes, it’s now plenty fast, and gives us a bunch of exciting new ways to explore the Commons.

Having all the data in our own database also allows us to learn new stuff about the Flickr Commons collection that we can’t see on Flickr itself – like the fact that it has 1.8 million photos, or that together Flickr Commons as a whole has had 4.4 billion views.

This crawling code has been an interesting test bed for another project – we’ll be doing something very similar to populate a Data Lifeboat, but we’ll talk more about that in a separate post.

Data Lifeboat Update 2: More questions than answers

By Ewa Spohn

Thanks to the Digital Humanities Advancement Grant we were awarded by the National Endowment for the Humanities, our Data Lifeboat project (which is part of the Content Mobility Program) is now well and truly underway. The Data Lifeboat is our response to the challenge of archiving the 50 billion or so images currently on Flickr, should the service go down. It’s simply too big to archive as a whole, and we think that these shared histories should be available for the long term, so we’re exploring a decentralized approach. Find out more about the context for this work in our first blog post.

So, after our kick-off last month, we were left with a long list of open questions. That list became longer thanks to our first all-hands meeting that took place shortly afterwards! It grew again once we had met with the project user group – staff from the British Library, San Diego Air & Space Museum, and Congregation of Sisters of St Joseph – a small group representing the diversity of Flickr Commons members. Rather than being overwhelmed, we were buoyed by the obvious enthusiasm and encouragement across the group, all of whom agreed that this is very much an idea worth pursuing. 

As Mia Ridge from the British Library put it; “we need ephemeral collections to tell the story of now and give people who don’t currently think they have a role in preservation a different way of thinking about it”. And from Mary Grace of the Congregation of Sisters of St. Joseph in Canada, “we [the smaller institutions] don’t want to be the 3rd class passengers who drown first”. 

Software sketching

We’ve begun working on the software approach to create a Data Lifeboat, focussing on the data model and assessing existing protocols we may use to help package it. Alex and George started creating some small prototypes to test how we should include metadata, and have begun exploring what “social metadata” could be like – that’s the kind of metadata that can only be created on Flickr, and is therefore a required element in any Data Lifeboat (as you’ll see from the diagram below, it’s complex). 


Feb 2024: An early sketch of a Data Lifeboat’s metadata graph structure.

Thanks to our first set of tools, Flinumeratr and Flickypedia, we have robust, reusable code for getting photos and metadata from Flickr. We’ve done some experiments with JSON, XML, and METS as possible ways to store the metadata, and started to imagine what a small viewer that would be included in each Data Lifeboat might be like. 

Complexity of long-term licensing

Alongside the technical development we have started developing our understanding of the legal issues that a Data Lifeboat is going to have to navigate to avoid unintended consequences of long-term preservation colliding with licenses set in the present. We discussed how we could build care and informed participation into the infrastructure, and what the pitfalls might be. There are fiddly questions around creating a Data Lifeboat containing photos from other Flickr members. 

  • As the image creator, would you need to be notified if one of your images has been added to a Data Lifeboat? 
  • Conversely, how would you go about removing an image from a Data Lifeboat? 
  • What happens if there’s a copyright dispute regarding images in a Data Lifeboat that is docked somewhere else? 

We discussed which aspects of other legal and licensing models might apply to Data Lifeboats, given the need to maintain stewardship and access over the long term (100 years at least!), as well as the need for the software to remain usable over this kind of time horizon. This isn’t something that the world of software has ready answers for. 

  • Could Flickr.org offer this kind of service? 
  • How would we notify future users of the conditions of the license, let alone monitor the decay of licenses in existing Data Lifeboats over this kind of timescale? 

So many standards to choose from

We had planned to do a deep dive into the various digital asset management systems used by cultural institutions, but this turned out to be a trickier subject than we thought as there are simply too many approaches, tools, and cobbled-together hacks being used in cultural institutions. Everyone seems to be struggling with this, so it’s not clear (yet) how best to approach this. If you have any ideas, let us know!

This work is supported by the National Endowment for the Humanities.

NEH logo

Introducing Flickypedia, our first tool

Building a new bridge between Flickr and Wikimedia Commons

For the past four months, we’ve been working with the Culture & Heritage team at the Wikimedia Foundation — the non-profit that operates Wikipedia, Wikimedia Commons, and other Wikimedia free knowledge projects — to build Flickypedia, a new tool for bridging the gap between photos on Flickr and files on Wikimedia Commons. Wikimedia Commons is a free-to-use library of illustrations, photos, drawings, videos, and music. By contributing their photos to Wikimedia Commons, Flickr photographers help to illustrate Wikipedia, a free, collaborative encyclopedia written in over 300 languages. More than 1.7 billion unique devices visit Wikimedia projects every month.

We demoed the initial version at GLAM Wiki 2023 in Uruguay, and now that we’ve incorporated some useful feedback from the Wikimedia community, we’re ready to launch it. Flickypedia is now available at https://www.flickr.org/tools/flickypedia/, and we’re really pleased with the result. Our goal was to create higher quality records on Wikimedia Commons, with better connected data and descriptive information, and to make it easier for Flickr photographers to see how their photos are being used.

This project has achieved our original goals – and a couple of new ones we discovered along the way.

So what is Flickypedia?

An easy way to copy photos from Flickr to Wikimedia Commons

The original vision of Flickypedia was a new tool for copying photos from Flickr to Wikimedia Commons, a re-envisioning of the popular Flickr2Commons tool, which copied around 5.4M photos.

This new upload tool is what we built first, leveraging ideas from Flinumeratr, a toy we built for exploring Flickr photos. You start by entering a Flickr URL:

And then Flickypedia will find all photos at that URL, and show you the ones which are suitable for copying to Wikimedia Commons. You can choose which photos you want to upload:

Then you enter a title, a short description, and any categories you want to add to the photo(s):

Then you click “Upload”, and the photo(s) are copied to Wikimedia Commons. Once it’s done, you can leave a comment on the original Flickr photo, so the photographer can see the photo in its new home:

As well as the title and caption written by the uploader, we automatically populate a series of machine-readable metadata fields (“Structured Data on Commons” or “SDC”) based on the Flickr information – the original photographer, date taken, a link to the original, and so on. You can see the exact list of fields in our data modeling document. This should make it easier for Commons users to find the photos they need, and maintain the link to the original photo on Flickr.

This flow has a little more friction than some other Flickr uploading tools, which is by design. We want to enable high-quality descriptions and metadata for carefully selected photos; not just bulk copying for the sake of copying. Our goal is to get high quality photos on Wikimedia Commons, with rich metadata which enables them to be discovered and used – and that’s what Flickypedia enables.

Reducing risk and responsible licensing

Flickr photographers can choose from a variety of licenses, and only some of them can be used on Wikimedia Commons: CC0, Public Domain, CC BY and CC BY-SA. If it’s any other license, the photo shouldn’t be on Wikimedia Commons, according to its licensing policy.

As we were building the Flickypedia uploader, we took the opportunity to emphasize the need for responsible licensing – when you select your photographs, it checks the licenses, and doesn’t allow you to copy anything that doesn’t have a Commons-compatible license:

This helps to reduce risk for everyone involved with Flickr and Wikimedia Commons.

Better duplicate detection

When we looked at the feedback on existing Flickr upload tools, there was one bit of overwhelming feedback: people want better duplicate detection. There are already over 11 million Flickr photos on Wikimedia Commons, and if a photo has already been copied, it doesn’t need to be copied again.

Wikimedia Commons already has some duplicate detection. It’ll spot if you upload a byte-for-byte identical file, but it can’t detect duplicates if the photo has been subtly altered – say, converted to a different file format, or a small border cropped out.

It turns out that there’s no easy way to find out if a given Flickr photo is in Wikimedia Commons. Although most Flickr upload tools will embed that metadata somewhere, they’re not consistent about it. We found at least four ways to spot possible duplicates:

  • You could look for a Flickr URL in the structured data (the machine-readable metadata)
  • You could look for a Flickr URL in the Wikitext (the human-readable description)
  • You could look for a Flickr ID in the filename
  • Or Flickypedia could know that it had already uploaded the photo

And even looking for matching Flickr URLs can be difficult, because there are so many forms of Flickr URLs – here are just some of the varieties of Flickr URLs we found in the existing Wikimedia Commons data:

(And this is without some of the smaller variations, like trailing slashes and http/https.)

We’d already built a Flickr URL parser as part of Flinumeratr, so we were able to write code to recognise these URLs – but it’s a fairly complex component, and that only benefits Flickypedia. We wanted to make it easier for everyone.

So we did!

We proposed (and got accepted) a new Flickr Photo ID property. This is a new field in the machine-readable structured data, which can contain the numeric ID. This is a clean, unambiguous pointer to the original photo, and dramatically simplifies the process of looking for existing Flickr photos.

When Flickypedia uploads a new photo to Flickr, it adds this new property. This should make it easier for other tools to find Flickr photos uploaded with Flickypedia, and skip re-uploading them.

Backfillr Bot: Making Flickr metadata better for all Flickr photos on Commons

That’s great for new photos uploaded with Flickypedia – but what about photos uploaded with other tools, tools that don’t use this field? What about the 10M+ Flickr photos already on Wikimedia Commons? How do we find them?

To fix this problem, we created a new Wikimedia Commons bot: Flickypedia Backfillr Bot. It goes back and fills in structured data on Flickr photos on Commons, including the Flickr Photo ID property. It uses our URL parser to identify all the different forms of Flickr URLs.

This bot is still in a preliminary stage—waiting for approval from the Wikimedia Commons community—but once granted, we’ll be able to improve the metadata for every Flickr photo on Wikimedia Commons. And in addition, create a hook that other tools can use – either to fill in more metadata, or search for Flickr photos.

Sydney Harbour Bridge, from the Museums of History New South Wales. No known copyright restrictions.

Flickypedia started as a tool for copying photos from Flickr to Wikimedia Commons. From the very start, we had ideas about creating stronger links between the two – the “say thanks” feature, where uploaders could leave a comment for the original Flickr photographer – but that was only for new photos.

Along the way, we realized we could build a proper two-way bridge, and strengthen the connection between all Flickr photos on Wikimedia Commons, not just those uploaded with Flickypedia.

We think this ability to follow a photo around the web is really important – to see where it’s come from, and to see where it’s going. A Flickr photo isn’t just an image, it comes with a social context and history, and being uploaded to Wikimedia Commons is the next step in its journey. You can’t separate an image from its context.

As we start to focus on Data Lifeboat, we’ll spend even more time looking at how to preserve the history of a photo – and Flickypedia has given us plenty to think about.

If you want to use Flickypedia to upload some photos to Wikimedia Commons, visit www.flickr.org/tools/flickypedia.

If you want to look at the source code, go to github.com/Flickr-Foundation/flickypedia.

Introducing flinumeratr, our first toy

by Alex

Today we’re pleased to release Flinumeratr, our first toy. You enter a Flickr URL, and it shows you a list of photos that you’d see at that URL:

This is the first engineering step towards what we’ll be building for the rest of this quarter: Flickypedia, a new tool for copying Creative Commons-licensed photos from Flickr to Wikimedia Commons.

As part of Flickypedia, we want to make it easy to select photos from Flickr that are suitable for Wikimedia Commons. You enter a Flickr URL, and Flickypedia will work out what photos are available. This “Flickr URL enumerator”, or “Flinumeratr”, is a proof-of-concept of that idea. It knows how to recognise a variety of URL types, including individual photos, albums, galleries, and a member’s photostream.

We call it a “toy” quite deliberately – it’s a quick thing, not a full-featured app. Keeping it small means we can experiment, try things quickly, and learn a lot in a short amount of time. We’ll build more toys as we have more ideas. Some of those ideas will be reused in bigger projects, and others will be dropped.

Flinumeratr is a playground for an idea for Flickypedia, but it’s also been a context for starting to develop our approach to software development. We’ve been able to move quickly – this is only my fourth day! – but starting a brand new project is always the easy bit. Maintaining that pace is the hard part.

We’re all learning how to work together, I’m dusting off my knowledge of the Flickr API, and we’re establishing some basic coding practices. Things like a test suite, documentation, checks on pull requests, and other guard rails that will help us keep moving. Setting those up now will be much easier than trying to retrofit them later. There’s plenty more we have to decide, but we’re off to a good start.

Under the hood, Flinumeratr is a Python web app written in Flask. We’re calling the Flickr API with the httpx library, and testing everything with pytest and vcrpy. The latter in particular has been so helpful – it “records” interactions with the Flickr API so I can replay them later in our test suite. If you’d like to see more, all our source code is on GitHub.

You can try Flinumeratr at https://flinumeratr.glitch.me. Please let us know what you think!