Thursday, 10 April 2014

I Know Where You Were Last Summer: London's public bike data is telling everyone where you've been

This article is about a publicly available dataset of bicycle journey data that contains enough information to track the movements of individual cyclists across London, for a six month period just over a year ago.

I'll also explore how this dataset could be linked with other datasets to identify the actual people who made each of these journeys, and the privacy concerns this kind of linking raises.

--

It probably won't surprise you to learn that there is a publicly available Transport For London dataset that contains records of bike journeys for London's bicycle hire scheme. What may surprise you is that this record includes unique customer identifiers, as well as the location and date/time for the start and end of each journey. The public dataset currently covers a period of six months between 2012 and 2013.

What are the consequences of this? It means that someone who has access to the data can extract and analyse the journeys made by individual cyclists within London during that time, and with a little effort, it's possible to find the actual people who have made the journeys. 

To show what's possible with this data, I built an interactive map to vizualize a handful of selected profiles.

Please note: the purpose of this article is to expose the risks that can come with open datasets. However I've held off from actually trying to find the people behind this data, mostly because of the privacy concerns but also because (thankfully) it requires a fair bit of effort to actually identify individuals from the data...

Below, you'll find a map of all journeys made by one specific cyclist (commuter X), selected because they're one of the top users of a familiar bicycle hire station near where I work:

Bike journeys map - commuter X [interactive version]


Each line represents a particular journey, the size of the line showing the number of times that journey was made. The size of the circle represents the number of different destinations that the cyclist has travelled to and from that bike station. Purple lines indicate there were journeys in both directions, while orange lines (with arrows) indicate journeys that were one-way only.

Bigger, therefore, implies the route or station has greater significance for the person.

NOTE: if you think you might be this person, and you're unhappy having your personal journey data here, please contact me and I will remove the offending map. Then contact TFL (as I have) and tell them to remove customer record numbers from the data.

So what can we tell about this person?

First impressions suggests that they probably live near Limehouse, work in Kings Cross, and have friends or family in the Bethnal Green / Mile End areas of London. This story is strengthened if we filter down to journeys made between 4.00am and 10.00am:

Commuter X - morning journeys [interactive version]


We can see that this person only travels to Kings Cross in the morning, when departing from the Limehouse area or from Bethnal Green. So a morning commute from home, and/or a partner's abode? Applying a similar filter for the afternoon and evening shows return journeys, so the commuting hypothesis becomes stronger still.

Like me, you're probably starting to feel a bit uncomfortable at this point - after all I'm putting a story to this person's data, and it's starting to sound quite personal.

What's more interesting (and worrying) is that:

  1. I'm not really trying very hard, and a deeper inspection of dates, times, locations etc. can reveal far more detail
  2. There's enough here to start thinking about putting a name to the data.

All that's needed to work out who this profile belongs to is one bit of connecting information.

A Foursquare check-in could be connected to a bike journey, though it would be difficult to connect it to the cycle scheme. More likely would be a time-stamped Facebook comment or tweet, saying that the Kings Cross boris bike station is full. Or a geo-coded Flickr photograph, showing someone riding one of the bikes...

Any seemingly innocuous personal signal would be enough to get a detailed record for someone's life in London ... travelling to work, meeting up with friends, secret trysts, drug deals - details of any of these supposedly private aspects of our lives can be exposed.

Here's another profile, chosen because of the volume of journeys made:


Complex bike journey map [interactive version]


Hopefully you can see the richness of the information that is available in the TFL dataset. Every connection on the map represents something of significance to the cyclist, each bike station has some meaning. As well as being a digital fingerprint that can be linked to personally identifiable information, the journey data is a window on this person's life.

--

On a final note, I'd like to point out that there are positives to releasing such data, which can be seen (for example) in the following map:

Commuter destinations around Victoria [interactive version]


The above map shows commuter journeys from a bike station near embankment to various stations around Victoria. These are journeys made between approximately 4.00pm and 5.30pm - so return commutes from work, presumably followed by a train journey from Victoria southwards. Here, there is one point of departure but three destinations, probably because Victoria Rail Station is a major transport hub, so the bike stations nearby will be popular and may often fill up.

The point is that there are benign insights that can be made by looking at individual profiles - but the question remains whether these kind of insights justify the risks to privacy that come with releasing journey data that can be associated with individual profiles.

Credits

Leaflet.js - web mapping library
Cloudmade - map tiles
Transport For London - datasets of Boris Bike data



11 comments:

  1. Um, what? The dataset contains a Bike ID, not a customer ID. You are tracking bikes not individual customers.

    ReplyDelete
    Replies
    1. I think Siddle's point is that given enough overlapping data, it might be possible to identify certain bike users. He mentions, for instance, time-stamped and geo-coded photos.

      Delete
    2. The actual bike data that you download from the TFL website contains customer record numbers - the maps really are showing profiles for people.

      It may of course be a mistake, and I've tried telling this to TFL. In the meantime, it's possible for someone to download and analyse your movements - if they can identify your profile.

      Delete
  2. Really good work James. Now if you can overlay ''twitter'' locations facebook updates you can open a dot.com and be bought out in 5 years for a Billion dollars. GCHQ and the NSA are hiring as well.

    ReplyDelete
  3. Indeed, though a billion dollars seems a bit low IMO.

    ReplyDelete
  4. This is a really interesting analysis JS. The Open Data and Privacy project (of OKF+ORG) is raising similar questions of whether privacy risks are being sufficiently managed if individuals can make the sorts of inferences from open datasets such as you have done here. We propose to outline some basic guideline which data publishers must follow in order to ensure such privacy lapses are minimized. Follow us on Twitter to stay updated @OpenDataPrivacy

    ReplyDelete
  5. I just downloaded the dataset. Bunch of xlsx spreadsheets, and I see the "Unique ID/Customer Record Number" field in there. It's not shown on the list of fields on this doc site. I'm reliably informed (by Ollie Obrien, creator of this awesome bikeshare map), that this field wasn't in earlier versions of the dataset. Maybe they've added it by mistake.

    I wouldn't say it's a disastrous breach of privacy. As I tried to imagine the more nefarious use case, I thought about a stalker. He spots his target docking her boris bike and makes a note of the time. Bingo! he can correlate that to find her customer ID and see where she's been ...except the data's only published for a 6 month period in the past right? And ultimately it only reveals where somebody's been boris-biking. It's not like you have their home address, as you would if you just followed them home for example.

    Even so it's pretty interesting how big databases involving people's location can be more personal than one might at first imagine, and only by adding one seemingly anonymous numeric column.

    ReplyDelete
  6. Assuming you are speaking about the data here: http://www.tfl.gov.uk/info-for/open-data-users/our-feeds
    Under the Network statistics -> Barclays Cycle Hire statistics
    The documentation states:

    Details of all Barclays Cycle Hire journeys.
    The journey information includes:
    Journey ID, Bike ID, Start date, Start time, End date, End time, Start docking station, Start docking station ID, End docking station, End docking station ID

    Yet in the excel file we see the headings:
    Rental Id Billable Duration Duration Unique ID/Customer Record Number Subscription Id Bike Id End Date EndStation Id EndStation Logical Terminal EndStation Name endStationPriority_id Start Date StartStation Id StartStation Logical Terminal StartStation Name startStationPriority_id EndHourCategory Id StartHourCategory Id BikeUserType Id

    ReplyDelete
  7. Interesting
    Can you lend me use it?

    ReplyDelete
  8. few things that can change the Open Data. Have a look at it. Open data has the potential to improve the economy, environment and our society

    ReplyDelete
  9. Please continue to write more because it’s unusual that someone has something interesting to say about this.
    Will be waiting for more! I have some relevant information you can review below.
    Bike shop Hove
    Bike shop Brighton
    Wind Tunnel Fitting

    ReplyDelete