How does the Commons Explorer work?

Last week we wrote an introductory post about our new Commons Explorer; today we’re diving into some of the technical details. How does it work under the hood?

When we were designing the Commons Explorer, we knew we wanted to look across the Commons collection – we love seeing a mix of photos from different members, not just one account at a time. We wanted to build more views that emphasize the breadth of the collection, and help people find more photos from more members.

We knew we’d need the Flickr API, but it wasn’t immediately obvious how to use it for this task. The API exposes a lot of data, but it can only query the data in certain ways.

For example, we wanted the homepage to show a list of recent uploads from every Flickr Commons member. You can make an API call to get the recent uploads for a single user, but there’s no way to get all the uploads for multiple users in a single API call. We could make an API call for every member, but with over 100 members we’d be making a lot of API calls just to render one component of one page!

It would be impractical to fetch data from the API every time we render a page – but we don’t need to. We know that there isn’t that much activity in Flickr Commons – it isn’t a social media network with thousands of updates a second – so rather than get data from the API every time somebody loads a page, we decided it’s good enough to get it once a day. We trade off a bit of “freshness” for a much faster and more reliable website.

We’ve built a Commons crawler that runs every night, and makes thousands of Flickr API calls (within the API’s limits) to populate a SQLite database with all the data we need to power the Commons Explorer. SQLite is a great fit for this sort of data – it’s easy to run, it gives us lots of flexibility in how we query the data, and it’s wicked fast with the size of our collection.

There are three main tables in the database:

  • The members
  • The photos uploaded by all the members
  • The comments on all those photos

We’re using a couple of different APIs to get this information:

  • The flickr.commons.getInstitutions API gives us a list of all the current Commons members. We combine this with the flickr.people.getInfo API to get more detailed information about each member (like their profile page description).
  • The flickr.people.getPhotos API gives us a list of all the photos in each member’s photostream. This takes quite a while to run – it returns up to 500 photos per call, but there are over 1.8 million photos in Flickr Commons.
  • The flickr.photos.comments.getList API gives us a list of all the comments on a single photo. To save us calling this 1.8 million times, we have some logic to check if there are any (new) comments since the last crawl – we don’t need to call this API if nothing has changed.

We can then write SQL queries to query this data in interesting ways, including searching photos and comments from every member at once.

We have a lightweight Flask web app that queries the SQLite database and renders them as nice HTML pages. This is what you see when you browse the website at https://commons.flickr.org/.

We have a couple of pages where we call the Flickr API to get the most up-to-date data (on individual member pages and the cross-Commons search), but most of the site is coming from the SQLite database. After fine-tuning the database with a couple of indexes, it’s now plenty fast, and gives us a bunch of exciting new ways to explore the Commons.

Having all the data in our own database also allows us to learn new stuff about the Flickr Commons collection that we can’t see on Flickr itself – like the fact that it has 1.8 million photos, or that together Flickr Commons as a whole has had 4.4 billion views.

This crawling code has been an interesting test bed for another project – we’ll be doing something very similar to populate a Data Lifeboat, but we’ll talk more about that in a separate post.