Thursday, 10 April 2014

I Know Where You Were Last Summer: London's public bike data is telling everyone where you've been

This article is about a publicly available dataset of bicycle journey data that contains enough information to track the movements of individual cyclists across London, for a six month period just over a year ago.

I'll also explore how this dataset could be linked with other datasets to identify the actual people who made each of these journeys, and the privacy concerns this kind of linking raises.

--

It probably won't surprise you to learn that there is a publicly available Transport For London dataset that contains records of bike journeys for London's bicycle hire scheme. What may surprise you is that this record includes unique customer identifiers, as well as the location and date/time for the start and end of each journey. The public dataset currently covers a period of six months between 2012 and 2013.

What are the consequences of this? It means that someone who has access to the data can extract and analyse the journeys made by individual cyclists within London during that time, and with a little effort, it's possible to find the actual people who have made the journeys. 

To show what's possible with this data, I built an interactive map to vizualize a handful of selected profiles.

Please note: the purpose of this article is to expose the risks that can come with open datasets. However I've held off from actually trying to find the people behind this data, mostly because of the privacy concerns but also because (thankfully) it requires a fair bit of effort to actually identify individuals from the data...

Below, you'll find a map of all journeys made by one specific cyclist (commuter X), selected because they're one of the top users of a familiar bicycle hire station near where I work:

Bike journeys map - commuter X [interactive version]


Each line represents a particular journey, the size of the line showing the number of times that journey was made. The size of the circle represents the number of different destinations that the cyclist has travelled to and from that bike station. Purple lines indicate there were journeys in both directions, while orange lines (with arrows) indicate journeys that were one-way only.

Bigger, therefore, implies the route or station has greater significance for the person.

NOTE: if you think you might be this person, and you're unhappy having your personal journey data here, please contact me and I will remove the offending map. Then contact TFL (as I have) and tell them to remove customer record numbers from the data.

So what can we tell about this person?

First impressions suggests that they probably live near Limehouse, work in Kings Cross, and have friends or family in the Bethnal Green / Mile End areas of London. This story is strengthened if we filter down to journeys made between 4.00am and 10.00am:

Commuter X - morning journeys [interactive version]


We can see that this person only travels to Kings Cross in the morning, when departing from the Limehouse area or from Bethnal Green. So a morning commute from home, and/or a partner's abode? Applying a similar filter for the afternoon and evening shows return journeys, so the commuting hypothesis becomes stronger still.

Like me, you're probably starting to feel a bit uncomfortable at this point - after all I'm putting a story to this person's data, and it's starting to sound quite personal.

What's more interesting (and worrying) is that:

  1. I'm not really trying very hard, and a deeper inspection of dates, times, locations etc. can reveal far more detail
  2. There's enough here to start thinking about putting a name to the data.

All that's needed to work out who this profile belongs to is one bit of connecting information.

A Foursquare check-in could be connected to a bike journey, though it would be difficult to connect it to the cycle scheme. More likely would be a time-stamped Facebook comment or tweet, saying that the Kings Cross boris bike station is full. Or a geo-coded Flickr photograph, showing someone riding one of the bikes...

Any seemingly innocuous personal signal would be enough to get a detailed record for someone's life in London ... travelling to work, meeting up with friends, secret trysts, drug deals - details of any of these supposedly private aspects of our lives can be exposed.

Here's another profile, chosen because of the volume of journeys made:


Complex bike journey map [interactive version]


Hopefully you can see the richness of the information that is available in the TFL dataset. Every connection on the map represents something of significance to the cyclist, each bike station has some meaning. As well as being a digital fingerprint that can be linked to personally identifiable information, the journey data is a window on this person's life.

--

On a final note, I'd like to point out that there are positives to releasing such data, which can be seen (for example) in the following map:

Commuter destinations around Victoria [interactive version]


The above map shows commuter journeys from a bike station near embankment to various stations around Victoria. These are journeys made between approximately 4.00pm and 5.30pm - so return commutes from work, presumably followed by a train journey from Victoria southwards. Here, there is one point of departure but three destinations, probably because Victoria Rail Station is a major transport hub, so the bike stations nearby will be popular and may often fill up.

The point is that there are benign insights that can be made by looking at individual profiles - but the question remains whether these kind of insights justify the risks to privacy that come with releasing journey data that can be associated with individual profiles.

Credits

Leaflet.js - web mapping library
Cloudmade - map tiles
Transport For London - datasets of Boris Bike data



Sunday, 2 March 2014

London maps and bike rental communities, according to Boris Bike journey data

Every time someone in London makes a journey on a Boris Bike (officially, the Barclays Cycle Hire Scheme), the local government body Transport For London (TFL) record that journey. TFL make some of this data available for download, to allow further analysis and experimentation.

Below, you'll find maps of the most popular bike stations and routes in London, created from the TFL data using Gephi, plus a few simple data processing scripts that I threw together. The idea for these maps originated within a project group at a course on Data Visualisation, held at the Guardian last year. We're working on a more publisher friendly form, so thank you to my course mates for giving me the go ahead to include them here.

First, here's a map showing all bike stations and all popular journeys.


Popular Boris Bike journeys and stations. Full version.

The first map shows the most popular routes and bike stations, those with more than ~150 journeys made during the six months of data that TFL make available. The size of each bike station in this map is based on the number of popular journeys that start or end at that station, a measure of the connectedness of the location. Note: the labels just show the rental area, not the specific station name.

Next, a map where the stations have been grouped together into rental areas, as allocated by TFL:


Rental areas and traffic volumes in the Boris Bike network. Full version | Alternative.

The second map is a version of the first map where related bike stations have been grouped together, and the volume of journeys between areas determines the weight of each connection. Colours in the second map are related to distinct communities in the network - more on this later. The position of the rental areas is approximate and calculated by Gephi. So please don't blame me for any geographical inaccuracies in this map ;)

Some interpretation, along with inspection of underlying data shows that:
  • Major entry points for Boris Bike use are via Kings Cross and Waterloo, more than likely due to commuters arriving from the North and South then heading deeper into London for work.
  • The most popular journeys are those around Hyde Park, corresponding to a popular tourist activity. 
  • The most popular journey (by a long way) is from Hyde Park Corner ... to Hyde Park Corner, presumably a nice trip round the park.
  • The most popular commuter route is between Waterloo (station 3) and Holburn, probably via the Waterloo Bridge.
Of course that's just scratching the surface, and just one example of how to vizualize the data. There's much more that can be done, and similar maps have been created before. Here are a couple of my favourites, plus another of my own, afterwards:
  • This delightful video by specialist Jo Wood at City University in London, published by New Scientist also shows popular routes in the network.
  • A recent BMJ article included a street-level map showing predicted routes and volumes, the focus here is on the health impact of bike sharing schemes.
A bit of experimentation with Gephi's community detection tool results in this map:


Rental communities in the Boris Bike network. Full version.

Here, major connected clusters of bike stations are shown in the same colour (red for Waterloo and environs, Green for around Hyde Park, etc). The communities are detected using Gephi's implementation of the Louvain Method, which finds local communities within large networks. This algorithm has a random element, and generates slightly different communities on each run. However it's clear from repeated runs that distinct local communities exist in the network, in particular around Hyde Park, Kings Cross, Waterloo, and Canary Wharf.

The map that shows bike rental areas (rather than stations) was coloured according to these communities, with journeys between different communities having a colour that's a mixture of the two that are involved.

If you fancy playing with the data or trying out some other visualizations, you can find everything in this GitHub repository.

Saturday, 7 September 2013

Visualizing The Wheel of Time: Reader Sentiment for an Epic Fantasy Series

In the following blog post, I explore reader sentiment for the Epic Fantasy book series The Wheel of Time, as expressed in user-submitted ratings on Amazon and GoodReads.

If you're a data scientist (or similar), you'll probably be interested in the data analysis which includes some interesting observations about the usefulness - or otherwise - of Amazon reviews. If you're a Epic Fantasy Series reader, you'll be interested in the outcome of my analysis: I've decided to go ahead and read all fourteen books.

Note - This is a spoiler free zone
Updated 17th Sept: charts now more accurate, not quite as pretty

*

Recently, I was looking for a good book to read, and a friend recommended The Wheel of Time series by Robert Jordan. But I'd heard from a few sources that the later volumes were harder going than the earlier volumes. Having struggled with later volumes of the Game of Thrones series (sorry George) I was wary of starting a mammoth fourteen volume series unless I was confident I could make it to the end.

So I did some research. First, I checked the Amazon reader-submitted ratings for the books, which are on a scale of 1 to 5 stars. Here's what they look like for the whole series:


Uh oh, that doesn't look good. Books eight to eleven get some pretty poor scores, though the later books seem to pick up again. It looks like the middle of the series could be hard work.

What's going on?

Next, I checked the GoodReads ratings for the series. GoodReads is a site designed "to help people find and share books they love... [and] to improve the process of reading and learning throughout the world." Here's how the GoodReads ratings (which are also submitted by readers and go from 1 to 5) stack up against the Amazon ratings:



So that's a little different. There's still a hump in the middle, but it's nowhere near as pronounced ... in fact the lowest aggregate rating is 3.84, far higher than the 1.8 for the same book on Amazon!

Let's look at the number of reviewers for the two systems, corresponding to the number of people who've read the book and recorded their review and/or star rating. First, GoodReads:



Well that seems reasonable - the number of ratings tails off in the middle, then picks up towards the end as you'd probably expect given the ratings we've seen. And generally speaking, each book has lots of ratings - the lowest count is for the final book, probably a reflection that A Memory of Light has only been out for a year compared to the other books.

Oddly though there are more ratings for book twelve (The Gathering Storm) than for quite a few of the earlier books ... more GoodReads users rated that book than earlier volumes.

This likely reflects the sad fact that volume eleven (Knife of Dreams) was the final volume completed by the original author Robert Jordan, who passed away in 2007. Volume twelve is the first volume written by Brandon Sanderson, the author Robert Jordan chose to finish the series. RIP Robert.

How about Amazon reviewer counts then?



On Amazon, there are far more reviewers for the books that received the really low scores. This suggests that the really low scores are actually a result of frustrated readers motivated to express their concerns, rather than a reflection of relative enjoyability or quality per-se.

GoodReads makes it extremely easy to submit a rating for a book - one click is all it takes. Amazon seems almost to discourage reviews - the "Write a Review" button is halfway down the page, and you must provide a title and description for your review. The net result being that input of everyday browsing users won't be captured on Amazon - only motivated reviewers (such as the frustrated reader) will be bothered to jump through all the hoops.

*

Overall, therefore, it seems sensible to expect a dip in the enjoyability of The Wheel of Time series, from book eight to about book eleven.

But perhaps that dip isn't as severe as suggested by Amazon, whose ratings are likely skewed by frustrated readers. My guess is that many readers reach the later volumes and are frustrated by a change of pace; this certainly matches my experience with the Game of Thrones series where events seemed to slow to a crawl in the most recent books. The problem is compounded when there are long gaps between books being published, making it harder to pick up the story.

Thankfully, the final few books get much higher ratings across the board, so I'm expecting that it's worth getting through the slower books to reach the finale. At least, that's the story I'm telling myself ...

Only Time will tell.

Friday, 4 January 2013

Personal Data Hacks: Visualizing Data from OpenFlights.org

A friend recently told me about OpenFlights.org, a website that allows you to record, analyze, and share personal flight data. He showed me his dataset, which contained a record of every flight he'd taken over the past 10+ years. I was keen to investigate the dataset further, and my friend was happy to provide me with a copy so I could have a play (thank you Luigi!).

The end result is the following collection of visualizations created in Gephi, with a little help from R. They show key transport hubs and routes for airports, countries, and continents that my friend has visited, and demonstrate some of the fun, insightful ways you can use such personal data.

If you're interested in how the visualizations were created, check out the section at the end of this blog posting where I briefly describe the technologies required and steps involved.

Note: OpenFlights.org is free to use, and supported by advertising and donations. You can join me in supporting OpenFlights.org via this link.

Hub Airports and Key Routes

The first visualization below shows the primary airports and routes used by Luigi. Each airport has been ranked in size and colour according to the number of other connected airports, while each connection has been weighted according to the number of times that route was flown. The layout here was generated in Gephi, ensuring (simply put) that related nodes are co-located:


As you can see, PSA (Pisa) and STN (London Stansted) are far and away the most used airports. Not only that, but the return journey between the two airports has been taken many times. These two facts make perfect sense given that Luigi is from Pisa, but moved to the UK a few years ago. Other significant hubs are London Heathrow, London Gatwick, and Rome - not too surprising.

Key Countries and Connections 

Given that many airports are within the same country, is it possible to reflect that in the visualization? One way to achieve this is to partition airports by colours corresponding to different countries, as follows:


So that's kind of OK - the predominance of Italy (yellowish-green) and the UK (blue) - is starting to show, but it's quite confusing.

A better approach is to group the airports and connections by country, and to layout the nodes according to (approximate) geographical positions. The following graph also has a few graphical tweaks for readability:


We're now getting to something approximating a worldwide travel heatmap for my friend. The key travel hubs of the UK and Italy are obvious, also key routes are also jumping out more: between Italy and the UK, Italy and Germany / France, and the UK and Spain. The significance of the other routes also becomes a bit more apparent - further afield countries corresponding to occasional holiday travel (for instance).

Continental Travel

What about different continents? If we return to the original graph and partition the airports by continent, a European bias becomes very clear:


It's also nice to see the groupings of continental airports jumping out - in particular the Green nodes in the bottom right corresponding to African airports. Note that I avoided grouping by continent here because the resulting node for Europe dwarfed all the other nodes, which didn't make for a good visualization.

Creating the Visualizations

The flight data is downloadable from OpenFlights.org as Comma Separated Values. I used a little command-line manipulation (awk, sort, and uniq) to compress the data into a list of unique flights, with a count corresponding to the number of times that flight was taken.

Next, I loaded the data into R, then converted it into a graph which could be easily exported to GML (Graph Modelling Language), then loaded into Gephi and visualized.

The downloaded dataset didn't contain city, country, or continent data. Adding this required an export of nodes from Gephi, followed by a merge with the OpenFlights.org Airport dataset (spreadsheet magic), and a re-import into Gephi.

Saturday, 8 December 2012

Visualizing Gamer Achievement Profiles using R

In this post, I'll describe how to go about visualising and interpreting gamer achievement data using R, the open source tool for statistical computing. Specifically, I'll show how you can create gamer achievement profiles based on publicly available achievement records from the Steam community API.

The visualisations and data interpretation will hopefully be of interest to a general audience, but for the more technically inclined reader I've included the steps required to create the visualisations. If you're mainly interested in the analysis and interpretation, you might want to skip ahead to the Achievement Rate Distributions section.

If you're not a coder, don't be put off - R really is straight forward. The following histogram, for example, can be created from a data set using just two lines of code:


This histogram shows global achievement rates (in percentage points) for all Steam achievements - more on this below.

Achievement Data

So what gamer data are we talking about?

The Steam community API provides both individual and global achievement records. For individual gamers, you can retrieve the lists of achievements they hold on a game-by-game basis. For the community as a whole, the API provides access to the global achievement rate - that is, the percentage of players who hold that particular achievement.

Using the approach described in a previous blog post, it's relatively easy to obtain these data sets, though a little time consuming when it comes to reading the global achievement rates for all games.

The global achievement data set that I created looks like this:


The data is simply one line per game achievement, with three whitespace-delimited columns corresponding to the Game ID, the Achievement ID, and the global achievement rate. You'll notice that the achievement IDs are quoted using the pipe character, which is necessary because some achievement IDs include spaces or quote characters.

The achievement data for specific gamers is quite similar:


Here, each line corresponds to one achievement held by a gamer, whose identity is indicated in the first column. I also chose a different quote character here because some gamer IDs happened to include the pipe character.

R Basics

R is popular tool among data miners because, among other things, it provides an easy way to generate "publication ready" charts such as histograms and scatter plots.

Getting up and running with R is simple. You can download an installation image via the R project homepage. Once installed and started, R provides console for issuing commands, as shown below:



To load a data set, you can use read.table:

achrates <- read.table("ach-rates-full.txt", header=T, quote="|")

The above reads the contents of a data file (ach-rates-full.txt) in table format into memory, accessible via the variable name achrates in this case. The parameters indicate that the file includes a header line, and that column values are quoted using the pipe character.

To view the data, simply type the name of the variable followed by carriage return and R will print out the contents. Use dim to obtain the dimensions of the data, e.g.:

> dim(achrates)
[1] 30081     3

I also found the subset function to be handy. You can use it to create a new dataset, based on some criteria such as user name or game ID. For example to obtain all global achievement rates for Half Life 2, you can type:

ar.hl2 <- subset(achrates, Game == 220)

That's all you need in order to read a data set, view the contents, and to select a subset. But let's move onto something more interesting, and generate a few histograms...

Achievement Rate Distributions

To generate a histogram of values from your data, use the hist function. The histogram shown at the start of this post (and repeated just below) was generated from the global achievement data as follows:

hist(achrates$Rate)

This generates a simple, no-frills histogram of the global achievement rates for every achievement in Steam.


How to interpret the data? Fundamentally, the data appears to show that the vast majority of achievements in Steam are held by only a small percentage of players for each game. This isn't so surprising, given that many games on Steam are for casual gamers. Also many games can be bought in bundles, which can lead to many games either being left unplayed, or played just once or twice - certainly that's the case for the games in my Steam account. It's also worth noting that a few achievements seem to have been created for test purposes, so will naturally only be held by a tiny proportion of gamers (i.e. the game developers, more than likely).

Digging into the data a little deeper provides further insight into the playing habits of Steam gamers. The following lines generate a histogram for a particular user, based on individual achievement data:

gamerdata <- read.table("user-ach-rates.txt", header=T, quote="~")
gd.user <- subset(gamerdata, User == "SomeUser")
hist(gd.user$Rate)

Two of the gamers in my social circle (let's call them Mario and Luigi) have quite distinct profiles of the type of achievements they tend to get.

Mario has over a thousand achievements, coming from a total of 23 games. The histogram of global rates for his achievements looks similar to the overall distribution:


So Mario holds many achievements that are not typically held by other gamers for the games he plays. Luigi on the other hand has about 450 achievements, coming from 35 games. His histogram looks like this:



The difference is quite apparent: the achievements that Luigi gets tend to be those held by a good proportion of other gamers, and he has fewer of the hard to get achievements.


Interpretation

Broadly speaking, the above profiles describe two quite distinct types of gamer.

The first - Mario - has a few games that he plays all the time. Mario gets most or all of the achievements, clocks up lots of game hours, and perhaps tends to the e-sports end of the gaming spectrum - playing multi-player games with friends or adversaries over the net.

The second - Luigi - has more games, and tends to dip in and out them. This type of gamer is perhaps more interested in the game experience or story, rather than obtaining every achievement or exploring every area of a game. A Luigi gamer fits more in the category of casual gamer.

Of course these are my interpretations of the data from a few simple data plots, and would need to be backed up with further data capture and analysis to hold any serious weight.

But hopefully they hint at what might be possible with such data. One can imagine, for example, building classification systems that are able to categorise gamers based on their achievement profile. Such categorisations could be used to generate recommendations or targetted adverts, friend suggestions etc. There may also be other rich sources of related data available to further enhance the gaming ecosystem.

Note on data quality

In a previous blog post, I drew attention to a few issues present in data retrieved from the Steam community API, and some of these cropped up again while I was creating the visualisations here. As such, the set of global achievement rates may not be complete or may have some spurious entries from test achievements which may slightly increase the skew towards low achievement rates. 

Friday, 16 November 2012

Anomalies in Steam Community data

In a recent post I introduced the Steam Community API, and showed how to retrieve gamer data and perform a few simple but fun analyses.

While writing the posting, I came across several problems associated with the data that's returned. If you're thinking about using Steam Community data, it's worth bearing these anomalies in mind because of the impact they'll have on downstream processing and further analysis.

Frustratingly, the quality of the data available through the Steam Community API is quite variable - in particular there are many discrepancies between global achievement data compared to achievement data for individual players. I also came across several global achievement rates that were clearly invalid, and in some cases found that global achievement records for games were totally missing.

The net result: it's hard to trust that the data that's returned. It is still possible to analyze returned data, but you're going to need strong validation, normalization (e.g. of player achievements against a 'gold standard'), and potentially multiple attempts to retrieve equivalent data to ensure what you have is accurate.

Below, you'll find a (non-exhaustive) list of data quality issues I came across, along with some examples, and a little discussion about the problems they introduced and workarounds I used.

Disclaimer: the issues described here reflect my experience while using the Steam API to retrieve data in bulk, to allow me to analyse data for a large number of gamers. Your experience may differ - if so, please let me know.

"Test" achievements

One of the most obvious problems you might find are spurious achievements associated with games. These include a few that can easily be filtered ("null" or empty names) as well as others that are more problematic - such as many 'test' achievements. For example:

TEST_ACHIEVEMENT
TestAchievement
Achievement_Test

Those readers familiar with the Valve games Portal and Portal 2 will realize why these can't be easily filtered - many valid achievement names include some variation of the word "test", e.g. portal_escape_testchambers.

I only spotted these when querying global achievement statistics, so it's possible that problem may just be a filtering issue on that particular API endpoint.

Another example can be seen when compare my (woeful) achievements for AaAaAA!!! - A Reckless Disregard for Gravity with the global achievement data. You should see one extra achievement testo2 which a vanishingly small number of people have achieved - more than likely because it's a leftover artifact from when the game was integrated into Steam.

Out of date achievement lists

Another closely related issue is that personal achievement lists can get out of step with global achievement lists, casting doubt on the reliability of any comparisons made between player achievements.

For example while processing the global record for The Legend of Grimrock, I noticed two additional achievements in comparison to my record:

FIND_ALL_TREASURES, complete_game_normal

It's worth noting that the FIND_ALL_TREASURES achievement appears in lower case in my record, but the complete_game_normal entry was missing completely. As a result, it's necessary to normalize all achievement records for players before making any comparisons, which unfortunately means making assumptions about why entries are missing (e.g. that the game hasn't been played recently) and how to fix the data.

Interestingly, since last viewing the global stats for this game, the data has become one of the ...

Missing global records

A more severe, though in some ways easier to handle issue is that global stats for some games are just missing - though this seems to be an intermittent issue.

The aforementioned Legend of Grimrock is currently one of them, as was Civilization V a couple of weeks ago. This seems to be an API specific issue, because the equivalent website for The Legend of Grimrock shows many achievements as I write this.

It seems that obtaining global stats for games is a bit of a hit-and-miss affair, so be careful with any apparently empty lists of achievements you may see and don't assume that such responses are correct when considering further processing.

Achievements with huge percentages

The final issue I've come across is that some of the global stats for achievements are simply incorrect. For example RAGE by iD Software has several achievements held by over 730,000% of players.

Thankfully this particular issue is easily detected, and offending achievements can be filtered easily.

Photo credit: wilhei55 / Foter / CC BY

Sunday, 4 November 2012

Harvesting Data from the Steam Community API

Introduction

The Steam community API is a web service that provides public access to information about Steam users, their games, achievements, and other related information. In this blog posting I'll describe some of the interesting data you can access, as well as how to model, retrieve, and process that data. I'll also show you how to generate a few fun, simple rankings and statistics for a group of steam gamers.
 
This is primarily a technical article, but it concludes with the results of a simple analysis performed over a small number of friends and aquaintances on Steam, which may be of interest to the non-technically inclined.

The examples shown here can be reproduced using the sample code found in this GitHub repository. It's a work in progress, but hopefully provides enough insight so you can either repeat the results or build your own equivalent.

Accessing the API

The first thing to know is that Steam community data is accessed using a RESTful web service, through a number of related endpoints. Many of the endpoints don't require authentication, but some require you to register for a key which you then provide as a parameter when interacting with the API.

You'll find links to the API documentation below - see the first link for details on how to get a key:

Steam Web API Documentation (high level)
Steam Web API Reference
Steam Web API Self documenting API endpoint
How to access Community Data 

The "Web API" supports both XML and JSON formats, while the closely related "Community Data" endpoints only support XML - it appears the latter are just public pages with an additional parameter of xml=1. In the rest of this posting, I provide XML examples for consistency, but the JSON resources seem to be equivalent. All of the URIs described below can be accessed using the HTTP GET verb, and in all cases appear to be browser-friendly (try clicking the examples).

There are also one or two client libraries available for different languages, notably steam condenser which is available for Java, PHP, and Ruby. Unfortunately I hit a bug caused (I believe) by changes to the behaviour of the steam API, and ultimately decided to using directly HTTP given that the API is relatively straightforward.

Available data

What kind of data can be accessed via the API? Some of the most interesting types of data are user profiles and user game lists, along with user achievements, which many users choose to make public. It's also possible to retrieve global achievement lists for games, which include percentages showing the proportion of players with the game who have a given achievement.
There's actually lots more information available, such as friend lists, statistics on play time, etc. But for now, let's focus on the above.

To make it easier to work with the data, it helps to establish a core domain model. That is, a set of concepts and relationships describing the problem domain. This helps with understanding the data, reasoning about how to process it, and further down the line how describe the data in code.

Given that we're interested in users, games, and achievements, the domain model is fairly simple:


Simple UML domain model for player data. Diagram courtest ObjectAid.

The above diagram was generated from core domain classes in the sample project, using a view-only UML modeling tool called ObjectAid. Aside from the three main concepts, you'll see relationships representing the fact that users have games, that games have achievements associated with them, and that users have acheivements either held or yet to be achieved. You'll also see a few attributes for key data such as steam ID, game name, etc.

Data retrieval

Before retrieving data from the various API endpoints, you'll need to find one or more Steam IDs. There are a few different ways of referring to Steam users, these include personas (nicknames), login account names, identifiers reported by game servers that start STEAM_, and 64 bit community IDs.

We're interested in the 64 bit steam community variants, which unfortunately require a little effort to obtain. A good starting point is your profile page which can be accessed via an "id" or via the unique profile ID - the behaviour of the "id" endpoint is ambiguous, but it appears to attempt to resolve users by their registered nicknames. I ended up viewing the page source on my friend list page or on specific profiles in order to obtain community IDs. It's also worth noting that the API endpoint for retrieving friend lists may be the most reliable method.

If you have a Steam ID in one of the other formats, you might look into one of the sites dedicated to converting between the various ID formats (example). However none of the sites I found generated 64 bit community IDs so friend lists may be the best option.

Once you have an ID or two, you can start to pull down some data. Below, you'll find an outline of key API interactions needed to get player data, game lists, and achievement information.

User profiles
First up, user profile data. You can get individual player summaries using a simple variant on the user profile link - just add xml=1 and you'll receive a computer readable version (example). Alternatively, use the Steam Web API to retrieve player summaries in batch, as follows:

URI:
http://api.steampowered.com/
ISteamUser/GetPlayerSummaries/v0002/?
key=[YOUR KEY HERE]&steamids=[STEAM 64 IDS]&format=[xml OR json]



For the sample stats shown at the end of this posting, I just needed the user's persona name which I retrieved using the second method shown above.

User game lists
A user's game list can also be retrieved using a simple variation on a web page URI. In this case, add /games?xml=1 to the end of a profile page URI (example) and you should have a complete list of games owned by the player.

The key data need you'll need to pull out of the response is the appID - that is, the unique identifier for the game. The response also includes other player data you might find useful, including game names, play time, etc.

User achievement lists
Once you've retrieved a list of unique game IDs for a particular player, it's possible to start retrieving something a little more interesting - individual achievements for those games.

Again, you can obtain a player's achievements for a game in two ways: using a profile page URI, and using the Web API. Helpfully, you'll find links to a player's game achievements in the game list response. Adapt these by adding xml=1 and you're away (example - warning: possible game spoilers for XCOM: Enemy Unknown). I actually used the alternative provided by the Web API, as follows:

URI:
http://api.steampowered.com/
ISteamUserStats/GetPlayerAchievements/v0001/?
appid=[GAME ID]&steamid=[STEAM 64 ID]&key=[YOUR KEY HERE]&format=[xml OR json]

The key information you'll need here are the unique identifiers for the achievements (apiname) and the flag indicating whether or not the player holds that achievement (achieved, values 1 or 0).

Global achievement data
The final endpoint that's worth reviewing retrieves global achievement lists and percentages for games. This data only appears to be available via an unauthenticated endpoint through the Web API (example), as follows:

URI:
http://api.steampowered.com/
ISteamUserStats/GetGlobalAchievementPercentagesForApp/v0002/?
gameid=[GAME ID]&format=[xml OR json]
  
Example response

This achievement data is useful both for cross-referencing with user achievement data, and for comparing individual achievements with global levels. The code in the sample project pulls down this data and uses it to validate and normalize user achievement lists.

Putting it all together

So by now, it's hopefully clear what kind of data you can access via the Steam Community API, and how to retrieve it using the various HTTP endpoints. But that's not quite enough to be able to start working with the data.

Below, you'll find an outline of the steps required to put together a cohesive data model that can then be analysed, persisted, and processed further:

For one or more Steam 64 identifiers:
- Retrieve user profile data, create a user record.
- Read the list of user games (capturing game IDs), associate them with the user.
- Read user achievements per game (capturing game ID, plus achievement ID and status) associate with user.

The end result should be a collection of users, each with an associated set of games, and for each user/game pair, a set of achievements both held and yet to be achieved. In the sample project, I use Java classes to hold user, game, and achievement entities, along with a few Collection objects to record game lists and achievement outcomes. The sample code also retrieves global achievement data for validation and normalization.

Popular games, Biggest Achievers

Using the above approach, I harvested data for eight Steam friends and aquaintances, then generated a list of the top games for the social group along with a ranking of the most accomplished players.

Image courtesy of smarnad / FreeDigitalPhotos.net
First, the top 10 games, according to popularity:

Pop.GameAch'ments
8Portal15
7Portal 250
6Amnesia: The Dark Descent0
6Counter-Strike: Global Offensive193
6Counter-Strike source148
6Counter-Strike source: Beta154
6Half-Life 2: Deathmatch0
6Half-Life 2: Lost Coast0
6Left 4 Dead 269
6Super Meat Boy49

No surprises there then. Portal and Portal 2 being highly popular, followed by a few of the top indie games and several stock Valve games. The only slightly puzzling thing being that Half-Life 3 sorry I mean 2 appears at position 11 (not shown), and is only shared by five players, while six have a copy of HL2: Deathmatch. The likely reason being that one person owns the Source Multiplayer pack which includes HL2: Deathmatch.

Next, who are the biggest achievers. Names have been anonymized to protect the innocent:

PlayerAchievmentsRatio
Chas1000/394925%
Olly at home DOTT789/655012%
Shreddies574/324518%
gryffindorpotential433/369712%
cryptoGoat193/198910%
HappyKittens40/6576%
unpronouncable29/11183%
Cuppa11/21741%

Congratulations Chas, the runaway winner with one thousand achievements, and the highest proportion of possible achievements held. Coming in close behind are Olly and Shreddies, Olly holding the higher number of achievements but Shreddies having achieved a higher relative proportion. Bringing up the rear, the wooden spoon award goes to Cuppa has both the lowest number of achievements overall, and the lowest overall proportion.

These particular stats are just for fun and are shouldn't be taken too seriously, but they do hint at some more compelling uses of the data. For example, it would be interesting to go beyond pure rankings and further analyse the achievements held by gamers from a particular social group. But that's the topic of a future post.

A final note on data reliability

One final thing to mention is that my experience with the quality of data exposed by the Steam Web API has been mixed. The data exposed by the by the community pages (i.e. the public pages, with xml=1 added) seems more reliable than the data provided Steam Web API. One reason might be that the Web API provides less filtering, while the profile page is designed to be human readable and thus may be subject to greater filtering or curation. 

Next time, I'll be discussing those issues in more detail as well as discussing the importance of, and problems associated with, obtaining reliable data.