Making some marvelous maps

This week we added maps to our Commons Explorer, and it’s proving to be a fun new way to find photos.

There are over 50,000 photos in the Flickr Commons collection which have location information telling us where the photo was taken. We can plot those locations on a map of the world, so you can get a sense of the geographical spread:

This map is interactive, so you can zoom in and move around to focus on a specific place. As you do, we’ll show you a selection of photos from the area you’ve selected.

You can also filter the map, so you see photos from just a single Commons member. For smaller members the map points can tell a story in themselves, and give you a sense of where a collection is and what it’s about:

These maps are available now, and know about the location of every geotagged photo in Flickr Commons.

Give them a try!

How can you add a location to a Flickr Commons photo?

For the first version of this map, we use the geotag added by the photo’s owner.

If you’re a Flickr Commons member, you can add locations to your photos and they’ll automatically show up on this map. The Flickr Help Center has instructions for how to do that.

It’s possible for other Flickr members to add machine tags to photos, and there are already thousands of crowdsourced tags that have location-related information. We don’t show those on the map right now, but we’re thinking about how we might do that in future!

How does the map work?

There are three technologies that make these maps possible.

The first is SQLite, the database engine we use to power the Commons Explorer. We have a table which contains every photo in the Flickr Commons, and it includes any latitude and longitude information. SQLite is wicked fast and our collection is small potatoes, so it can get the data to draw these maps very quickly.

I’d love to tell you about some deeply nerdy piece of work to hyper-optimize our queries, but it wasn’t necessary. I wrote the naïve query, added a couple of column indexes, and that first attempt was plenty fast. Tallying the locations for the entire Flickr Commons collection takes ~45ms; tallying the locations for an individual member is often under a millisecond.)

The second is Leaflet.js, a JavaScript library for interactive maps. This is a popular and feature-rich library that made it easy for us to add a map to the site. Combined with a marker clustering plugin, we had a lot of options for configuring the map to behave exactly as we wanted, and to connect it to Flickr Commons data.

The third is OpenStreetMap. This is a world map maintained by a community of volunteers, and we use their map tiles as the backdrop for our map.

Plus ça Change

To help us track changes to the Commons Explorer, we’ve added another page: the changelog.

This is part of our broader goal of archiving the organization. Even in the six months since we launched the Explorer, it’s easy to forget what happened when, and new features quickly feel normal. The changelog is a place for us to remember what’s changed and what the site used to look like, as we continue to make changes and improvements.

Working with snapshots of structured data on Wikimedia Commons

In my previous post about Flickypedia Backfillr Bot, I mentioned that Flickypedia uses snapshots of structured data on Wikimedia Commons to spot possible duplicates:

We downloaded snapshots of the structured data for every file on Wikimedia Commons, and we built a database of all the links between files on Wikimedia Commons and Flickr photos. For every file in the snapshot, we looked at the structured data properties where we might find a Flickr URL. Then we tried to parse those URLs using our Flickr URL parsing library, and find out what Flickr photo they point at (if any).

As we’ve been working on Flickypedia, we’ve developed a few tactics for working with these snapshots, which we thought might be useful for other people working with Wikimedia Commons data.

What are these snapshots?

Files on Wikimedia Commons can contain structured data—machine-readable metadata saying where the file came from, the license of the file, when it was created, and so on. For a longer explanation of structured data, read my previous post.

The structured data snapshots are JSON files that contain the structured data statements for all the files on Wikimedia Commons. (One of many public dumps of Wikimedia content.) These snapshots are extremely useful if you have a task that involves searching the database en masse – for example, finding all the Flickr photos on Commons.

All the snapshots we worked with are available for download from https://dumps.wikimedia.org/commonswiki/entities/, and new snapshots are typically created a few times a week.

Do you need snapshots?

Snapshots can be cumbersome, so if you need a quick answer, there may be better ways to get data out of Wikimedia Commons, like Special:MediaSearch and the Commons Query Service, which both support querying on structured data. But if you need to look at Wikimedia Commons as a whole, or run some sort of complex query or analysis that doesn’t fit into an existing tool, the structured snapshots can be very useful.

We’ve already found several use cases for them at the Flickr Foundation:

  • Finding every Flickr photo on Wikimedia Commons. As discussed in previous posts, the many variants of Flickr URL make it difficult to run a query for Flickr photos on Commons – but we can do this analysis easily with a snapshot. We can parse the data in the snapshot with our Flickr URL parser and store the normalised information in a new database.
  • Seeing how structured data is already being used. When we were designing the Flickypedia data model, part of our research involved looking at how structured data was already being used for Flickr photos. Using the snapshots, we could look for examples we could mimic, and compare our ideas to the existing data. Was our proposal following a popular, well-established approach, or was it novel and perhaps more controversial?
  • Verifying our assumptions about structured data. By doing an exhaustive search of the structured data, we could check if our assumptions were correct – and sometimes we’d find counterexamples that forced us to rethink our approach. For example, “every Wikimedia Commons file comes from zero or one Flickr photos”. Looking at the snapshots told us this was false – there are some files which link to multiple Flickr photos, because the same photo was uploaded to Flickr multiple times.

How do you download a snapshot?

The snapshots are fairly large: the latest snapshots are over 30GB, and that’s only getting bigger as more structured data is created. It takes me multiple hours to download a snapshot, and that can be annoying if the connection drops partway through.

Fortunately, Wikimedia Commons has a well-behaved HTTP server that supports resumable downloads. There are lots of download managers that can resume the download when it gets interrupted, so you can download a snapshot over multiple sessions. I like curl because it’s so ubiquitous – there’s a good chance it’s already installed on whatever computer I’m using.

This is an example of the curl command I run:

curl \
  --location \
  --remote-name \
  --continue-at - \
  "https://dumps.wikimedia.org/commonswiki/entities/20240617/commons-20240617-mediainfo.json.gz"

I usually have to run it multiple times to get a complete download, but it does eventually succeed. The important flag here is -​-continue-at –, which tells curl to resume a previous download.

Which format should you download?

The snapshots are available in two formats: bzip2-compressed JSON, and gzip-compressed JSON. They have identical contents, just compressed differently. Which should you pick?

I wasn’t sure which format was right, so when I was getting started, I downloaded both and ran some experiments to see which was a better fit for our use case. We iterate through every file in a snapshot as part of Flickypedia, so we wanted a format we could read quickly.

The file sizes are similar: 33.6GB for bzip2, 43.4GB for gzip. Both of these are manageable downloads for us, so file size wasn’t a deciding factor.

Then I ran a benchmark on my laptop to see how long it took to read each format. This command is just uncompressing each file, and measuring the time it takes:

$ time bzcat commons-20240617-mediainfo.json.bz2 >/dev/null
Executed in 113.48 mins

$ time gzcat commons-20240617-mediainfo.json.gz >/dev/null
Executed in 324.17 secs

That’s not a small difference: gzip is 21 times faster to uncompress than bzip2. Even accounting for the fairly unscientific test conditions, it was the clear winner. For Flickypedia, we use the gzip-compressed snapshots.

What’s inside a snapshot?

An uncompressed snapshot is big – the latest snapshot contains nearly 400GB of JSON.

The file contains a single, massive JSON object:

[
   { … data for the first file … },
   { … data for the second file … },
   …
   { … data for the last file … }
]

Aside from the opening and closing square brackets, each line has a JSON object that contains the data for a single file on Wikimedia Commons. This makes it fairly easy to stream data from this file, without trying to parse the entire snapshot at once.

If you’re curious about the structure of the data, we have some type definitions in Flickypedia: one for the top-level snapshot entries, one for the Wikidata data model which is used for structured data statements. Unfortunately I haven’t been able to find a lot of documentation for these types on Wikimedia Commons itself.

How to read snapshots

The one-file-per-line structure of the snapshot JSON allows us to write a streaming parser in Python. This function will read one file at a time, which is more efficient than reading the entire file at once:

import gzip
import json


def get_entries_from_snapshot(path):
    with gzip.open(path) as uncompressed_file:
        for line in uncompressed_file:

            # Skip the square brackets at the beginning/end of the file
            # which open/close the JSON object
            if line.strip() in {b"[", b"]"}:
                continue

            # Strip the trailing comma at the end of each line
            line = line.rstrip(b",\n")

            # Parse the line as JSON, and yield it to the caller
            entry = json.loads(line)
            yield entry


path = "commons-20240617-mediainfo.json.gz"

for entry in get_entries_from_snapshot(path):
    print(entry)

# {'type': 'mediainfo', 'id': 'M76', … }
# …

This does take a while – on my machine, it takes around 45 minutes just to read the snapshot, with no other processing.

To avoid having to do this too often, my next step is to extend this script to extract the key information I want from the snapshot.

For example, for Flickypedia, we’re only really interested in P12120 (Flickr Photo ID) and P7482 (Source of File) when we’re looking for Flickr photos which are already on Commons. A script which extracts just those two fields can reduce the size of the data substantially, and give me a file that’s easier to work with.

The surprising utility of a Flickr URL parser

In my first week at the Flickr Foundation, we made a toy called Flinumeratr. This is a small web app that takes a Flickr URL as input, and shows you all the photos which are present at that URL.

As part of this toy, I made a Python library which parses Flickr URLs, and tells you what the URL points to – a single photo, an album, a gallery, and so on. Initially it just handled fairly common patterns, the sort of URLs that you’d encounter if you use Flickr today, but it’s grown to handle more complicated URLs.

$ flickr_url_parser "https://www.flickr.com/photos/sdasmarchives/50567413447"
{"type": "single_photo", "photo_id": "50567413447"}

$ flickr_url_parser "https://www.flickr.com/photos/aljazeeraenglish/albums/72157626164453131"
{"type": "album", "user_url": "https://www.flickr.com/photos/aljazeeraenglish", "album_id": "72157626164453131", "page": 1}

$ flickr_url_parser "https://www.flickr.com/photos/blueminds/page3"
{"type": "user", "user_url": "https://www.flickr.com/photos/blueminds"}

The implementation is fairly straightforward: I use the hyperlink library to parse the URL text into a structured object, then I compare that object to a list of known patterns. Does it look like this type of URL? Or this type of URL? Or this type of URL? And so on.

You can run this library as a command-line tool, or call it from Python – there are instructions in the GitHub README.

There are lots of URL variants

In my second week and beyond, I started to discover more variants, which should probably be expected in 20-year old software! I’ve been looking into collections of Flickr URLs that have been built up over multiple years, and although most of these URLs follow common patterns, there are lots of unusual variants in the long tail.

Some of these are pretty simple. For example, the URL to a user’s photostream can be formed using your Flickr user NSID or your path alias, so flickr.com/photos/197130754@N07/ and flickr.com/photos/flickrfoundation/ point to the same page.

Others are more complicated, and you can trace the history of Flickr through some of the older URLs. Some of my favorites include:

  • Raw JPEG files, on live.staticflickr.com, farm1.static.flickr.com, and several other subdomains.

  • Links with a .gne suffix, like www.flickr.com/photo_edit.gne?id=3435827496 (from Wikimedia Commons). This acronym stands for Game Neverending, the online game out of which Flickr was born.

  • A Flash video player called stewart.swf, which might be a reference to Stewart Butterfield, one of the cofounders of Flickr.

I’ve added support for every variant of Flickr URL to the parsing library – if you want to see a complete list, check out the tests. I need over a hundred tests to check all the variants are parsed correctly.

Where we’re using it

I’ve been able to reuse this parsing code in a bunch of different projects, including:

  • Building a similar “get photos at this URL” interface in Flickypedia.

  • Looking for Flickr photo URLs in Wikimedia Commons. This is for detecting Flickr photos which have already been uploaded to Commons, which I’ll describe more in another post.

  • Finding Flickr pages which have been captured in the Wayback Machine – I can get a list of saved Flickr URLs, and then see what sort of pages have actually been saved.

When I created the library, I wasn’t sure if this code was actually worth extracting as a standalone package – would I use it again, or was this a premature abstraction?

Now that I’ve seen more of the diversity of Flickr URLs and found more uses for this code, I’m much happier with the decision to abstract it into a standalone library. Now we  only need to add support for each new URL variant once, and then all our projects can benefit.

If you want to try the Flickr URL parser yourself, all the code is open source on GitHub.

Data Lifeboat Update 4: What a service architecture could be like

We’re starting to write code for our Data Lifeboat, and that’s pushed us to decide what the technical architecture looks like. What are the different systems and pieces involved in creating a Data Lifeboat? In this article I’m going to outline what we imagine that might look like.

We’re still very early in the prototyping stage of this work. Our next step is going to be building an end-to-end prototype of this design, and seeing how well it works.

Here’s the diagram we drew on the whiteboard last week:

Let’s step through it in detail.

First somebody has to initiate the creation of a Data Lifeboat, and choose the photos they want to include. There could be a number of ways to start this process: a command-line tool, a graphical web app, a REST API.

We’re starting to think about what those interfaces will look like, and how they’ll work. When somebody creates a Data Lifeboat, we need more information than just a list of photos. We know we’re going to need things like legal agreements, permission statements, and a description of why the Lifeboat was created. All this information needs to be collected at this stage.

However these interfaces work, it all ends in the same way: with a request to create a Data Lifeboat for a list of photos and their metadata from Flickr.

To take a list of photos and create a Data Lifeboat, we’ll have a new Data Lifeboat Creator service. This will call the Flickr API to fetch all the data from Flickr.com, and package it up into a new file. This could take a long time, because we need to make a lot of API calls! (Minutes, if not hours.)

We already have the skeleton of this service in the Commons Explorer, and we expect to reuse that code for the Data Lifeboat.

We are also considering creating an index of all the Data Lifeboats we’ve created – for example, “Photo X was added to Data Lifeboat Y on date Z”. This would be a useful tool for people wanting to look up Flickr URLs if the site ever goes away. “I have a reference to photo X, where did that end up after Flickr?”

When all the API calls are done, this service will eventually produce a complete, standalone Data Lifeboat which is ready to be stored!

When we create the Data Lifeboat, we’re imagining we’ll keep it on some temporary storage owned by the Flickr Foundation. Once the packaging is complete, the person or organization who requested it can download it to their permanent storage. Then it becomes their responsibility to make sure it’s kept safely – for example, creating backups or storing it in multiple geographic locations.

The Flickr Foundation isn’t going to run a single, permanent store of all Data Lifeboats ever created. That would turn us into another Single Point of Failure, which is something we’re keen to avoid!

There are still lots of details to hammer out at every step of this process, but thinking about the broad shape of the Data Lifeboat service has already been useful. It’s helped us get a consistent understanding of what the steps are, and exposed more questions for us to ponder as we keep building.

How does the Commons Explorer work?

Last week we wrote an introductory post about our new Commons Explorer; today we’re diving into some of the technical details. How does it work under the hood?

When we were designing the Commons Explorer, we knew we wanted to look across the Commons collection – we love seeing a mix of photos from different members, not just one account at a time. We wanted to build more views that emphasize the breadth of the collection, and help people find more photos from more members.

We knew we’d need the Flickr API, but it wasn’t immediately obvious how to use it for this task. The API exposes a lot of data, but it can only query the data in certain ways.

For example, we wanted the homepage to show a list of recent uploads from every Flickr Commons member. You can make an API call to get the recent uploads for a single user, but there’s no way to get all the uploads for multiple users in a single API call. We could make an API call for every member, but with over 100 members we’d be making a lot of API calls just to render one component of one page!

It would be impractical to fetch data from the API every time we render a page – but we don’t need to. We know that there isn’t that much activity in Flickr Commons – it isn’t a social media network with thousands of updates a second – so rather than get data from the API every time somebody loads a page, we decided it’s good enough to get it once a day. We trade off a bit of “freshness” for a much faster and more reliable website.

We’ve built a Commons crawler that runs every night, and makes thousands of Flickr API calls (within the API’s limits) to populate a SQLite database with all the data we need to power the Commons Explorer. SQLite is a great fit for this sort of data – it’s easy to run, it gives us lots of flexibility in how we query the data, and it’s wicked fast with the size of our collection.

There are three main tables in the database:

  • The members
  • The photos uploaded by all the members
  • The comments on all those photos

We’re using a couple of different APIs to get this information:

  • The flickr.commons.getInstitutions API gives us a list of all the current Commons members. We combine this with the flickr.people.getInfo API to get more detailed information about each member (like their profile page description).
  • The flickr.people.getPhotos API gives us a list of all the photos in each member’s photostream. This takes quite a while to run – it returns up to 500 photos per call, but there are over 1.8 million photos in Flickr Commons.
  • The flickr.photos.comments.getList API gives us a list of all the comments on a single photo. To save us calling this 1.8 million times, we have some logic to check if there are any (new) comments since the last crawl – we don’t need to call this API if nothing has changed.

We can then write SQL queries to query this data in interesting ways, including searching photos and comments from every member at once.

We have a lightweight Flask web app that queries the SQLite database and renders them as nice HTML pages. This is what you see when you browse the website at https://commons.flickr.org/.

We have a couple of pages where we call the Flickr API to get the most up-to-date data (on individual member pages and the cross-Commons search), but most of the site is coming from the SQLite database. After fine-tuning the database with a couple of indexes, it’s now plenty fast, and gives us a bunch of exciting new ways to explore the Commons.

Having all the data in our own database also allows us to learn new stuff about the Flickr Commons collection that we can’t see on Flickr itself – like the fact that it has 1.8 million photos, or that together Flickr Commons as a whole has had 4.4 billion views.

This crawling code has been an interesting test bed for another project – we’ll be doing something very similar to populate a Data Lifeboat, but we’ll talk more about that in a separate post.

Introducing Flickypedia, our first tool

Building a new bridge between Flickr and Wikimedia Commons

For the past four months, we’ve been working with the Culture & Heritage team at the Wikimedia Foundation — the non-profit that operates Wikipedia, Wikimedia Commons, and other Wikimedia free knowledge projects — to build Flickypedia, a new tool for bridging the gap between photos on Flickr and files on Wikimedia Commons. Wikimedia Commons is a free-to-use library of illustrations, photos, drawings, videos, and music. By contributing their photos to Wikimedia Commons, Flickr photographers help to illustrate Wikipedia, a free, collaborative encyclopedia written in over 300 languages. More than 1.7 billion unique devices visit Wikimedia projects every month.

We demoed the initial version at GLAM Wiki 2023 in Uruguay, and now that we’ve incorporated some useful feedback from the Wikimedia community, we’re ready to launch it. Flickypedia is now available at https://www.flickr.org/tools/flickypedia/, and we’re really pleased with the result. Our goal was to create higher quality records on Wikimedia Commons, with better connected data and descriptive information, and to make it easier for Flickr photographers to see how their photos are being used.

This project has achieved our original goals – and a couple of new ones we discovered along the way.

So what is Flickypedia?

An easy way to copy photos from Flickr to Wikimedia Commons

The original vision of Flickypedia was a new tool for copying photos from Flickr to Wikimedia Commons, a re-envisioning of the popular Flickr2Commons tool, which copied around 5.4M photos.

This new upload tool is what we built first, leveraging ideas from Flinumeratr, a toy we built for exploring Flickr photos. You start by entering a Flickr URL:

And then Flickypedia will find all photos at that URL, and show you the ones which are suitable for copying to Wikimedia Commons. You can choose which photos you want to upload:

Then you enter a title, a short description, and any categories you want to add to the photo(s):

Then you click “Upload”, and the photo(s) are copied to Wikimedia Commons. Once it’s done, you can leave a comment on the original Flickr photo, so the photographer can see the photo in its new home:

As well as the title and caption written by the uploader, we automatically populate a series of machine-readable metadata fields (“Structured Data on Commons” or “SDC”) based on the Flickr information – the original photographer, date taken, a link to the original, and so on. You can see the exact list of fields in our data modeling document. This should make it easier for Commons users to find the photos they need, and maintain the link to the original photo on Flickr.

This flow has a little more friction than some other Flickr uploading tools, which is by design. We want to enable high-quality descriptions and metadata for carefully selected photos; not just bulk copying for the sake of copying. Our goal is to get high quality photos on Wikimedia Commons, with rich metadata which enables them to be discovered and used – and that’s what Flickypedia enables.

Reducing risk and responsible licensing

Flickr photographers can choose from a variety of licenses, and only some of them can be used on Wikimedia Commons: CC0, Public Domain, CC BY and CC BY-SA. If it’s any other license, the photo shouldn’t be on Wikimedia Commons, according to its licensing policy.

As we were building the Flickypedia uploader, we took the opportunity to emphasize the need for responsible licensing – when you select your photographs, it checks the licenses, and doesn’t allow you to copy anything that doesn’t have a Commons-compatible license:

This helps to reduce risk for everyone involved with Flickr and Wikimedia Commons.

Better duplicate detection

When we looked at the feedback on existing Flickr upload tools, there was one bit of overwhelming feedback: people want better duplicate detection. There are already over 11 million Flickr photos on Wikimedia Commons, and if a photo has already been copied, it doesn’t need to be copied again.

Wikimedia Commons already has some duplicate detection. It’ll spot if you upload a byte-for-byte identical file, but it can’t detect duplicates if the photo has been subtly altered – say, converted to a different file format, or a small border cropped out.

It turns out that there’s no easy way to find out if a given Flickr photo is in Wikimedia Commons. Although most Flickr upload tools will embed that metadata somewhere, they’re not consistent about it. We found at least four ways to spot possible duplicates:

  • You could look for a Flickr URL in the structured data (the machine-readable metadata)
  • You could look for a Flickr URL in the Wikitext (the human-readable description)
  • You could look for a Flickr ID in the filename
  • Or Flickypedia could know that it had already uploaded the photo

And even looking for matching Flickr URLs can be difficult, because there are so many forms of Flickr URLs – here are just some of the varieties of Flickr URLs we found in the existing Wikimedia Commons data:

(And this is without some of the smaller variations, like trailing slashes and http/https.)

We’d already built a Flickr URL parser as part of Flinumeratr, so we were able to write code to recognise these URLs – but it’s a fairly complex component, and that only benefits Flickypedia. We wanted to make it easier for everyone.

So we did!

We proposed (and got accepted) a new Flickr Photo ID property. This is a new field in the machine-readable structured data, which can contain the numeric ID. This is a clean, unambiguous pointer to the original photo, and dramatically simplifies the process of looking for existing Flickr photos.

When Flickypedia uploads a new photo to Flickr, it adds this new property. This should make it easier for other tools to find Flickr photos uploaded with Flickypedia, and skip re-uploading them.

Backfillr Bot: Making Flickr metadata better for all Flickr photos on Commons

That’s great for new photos uploaded with Flickypedia – but what about photos uploaded with other tools, tools that don’t use this field? What about the 10M+ Flickr photos already on Wikimedia Commons? How do we find them?

To fix this problem, we created a new Wikimedia Commons bot: Flickypedia Backfillr Bot. It goes back and fills in structured data on Flickr photos on Commons, including the Flickr Photo ID property. It uses our URL parser to identify all the different forms of Flickr URLs.

This bot is still in a preliminary stage—waiting for approval from the Wikimedia Commons community—but once granted, we’ll be able to improve the metadata for every Flickr photo on Wikimedia Commons. And in addition, create a hook that other tools can use – either to fill in more metadata, or search for Flickr photos.

Sydney Harbour Bridge, from the Museums of History New South Wales. No known copyright restrictions.

Flickypedia started as a tool for copying photos from Flickr to Wikimedia Commons. From the very start, we had ideas about creating stronger links between the two – the “say thanks” feature, where uploaders could leave a comment for the original Flickr photographer – but that was only for new photos.

Along the way, we realized we could build a proper two-way bridge, and strengthen the connection between all Flickr photos on Wikimedia Commons, not just those uploaded with Flickypedia.

We think this ability to follow a photo around the web is really important – to see where it’s come from, and to see where it’s going. A Flickr photo isn’t just an image, it comes with a social context and history, and being uploaded to Wikimedia Commons is the next step in its journey. You can’t separate an image from its context.

As we start to focus on Data Lifeboat, we’ll spend even more time looking at how to preserve the history of a photo – and Flickypedia has given us plenty to think about.

If you want to use Flickypedia to upload some photos to Wikimedia Commons, visit www.flickr.org/tools/flickypedia.

If you want to look at the source code, go to github.com/Flickr-Foundation/flickypedia.