Friday, 4 January 2013

Personal Data Hacks: Visualizing Data from

A friend recently told me about, a website that allows you to record, analyze, and share personal flight data. He showed me his dataset, which contained a record of every flight he'd taken over the past 10+ years. I was keen to investigate the dataset further, and my friend was happy to provide me with a copy so I could have a play (thank you Luigi!).

The end result is the following collection of visualizations created in Gephi, with a little help from R. They show key transport hubs and routes for airports, countries, and continents that my friend has visited, and demonstrate some of the fun, insightful ways you can use such personal data.

If you're interested in how the visualizations were created, check out the section at the end of this blog posting where I briefly describe the technologies required and steps involved.

Note: is free to use, and supported by advertising and donations. You can join me in supporting via this link.

Hub Airports and Key Routes

The first visualization below shows the primary airports and routes used by Luigi. Each airport has been ranked in size and colour according to the number of other connected airports, while each connection has been weighted according to the number of times that route was flown. The layout here was generated in Gephi, ensuring (simply put) that related nodes are co-located:

As you can see, PSA (Pisa) and STN (London Stansted) are far and away the most used airports. Not only that, but the return journey between the two airports has been taken many times. These two facts make perfect sense given that Luigi is from Pisa, but moved to the UK a few years ago. Other significant hubs are London Heathrow, London Gatwick, and Rome - not too surprising.

Key Countries and Connections 

Given that many airports are within the same country, is it possible to reflect that in the visualization? One way to achieve this is to partition airports by colours corresponding to different countries, as follows:

So that's kind of OK - the predominance of Italy (yellowish-green) and the UK (blue) - is starting to show, but it's quite confusing.

A better approach is to group the airports and connections by country, and to layout the nodes according to (approximate) geographical positions. The following graph also has a few graphical tweaks for readability:

We're now getting to something approximating a worldwide travel heatmap for my friend. The key travel hubs of the UK and Italy are obvious, also key routes are also jumping out more: between Italy and the UK, Italy and Germany / France, and the UK and Spain. The significance of the other routes also becomes a bit more apparent - further afield countries corresponding to occasional holiday travel (for instance).

Continental Travel

What about different continents? If we return to the original graph and partition the airports by continent, a European bias becomes very clear:

It's also nice to see the groupings of continental airports jumping out - in particular the Green nodes in the bottom right corresponding to African airports. Note that I avoided grouping by continent here because the resulting node for Europe dwarfed all the other nodes, which didn't make for a good visualization.

Creating the Visualizations

The flight data is downloadable from as Comma Separated Values. I used a little command-line manipulation (awk, sort, and uniq) to compress the data into a list of unique flights, with a count corresponding to the number of times that flight was taken.

Next, I loaded the data into R, then converted it into a graph which could be easily exported to GML (Graph Modelling Language), then loaded into Gephi and visualized.

The downloaded dataset didn't contain city, country, or continent data. Adding this required an export of nodes from Gephi, followed by a merge with the Airport dataset (spreadsheet magic), and a re-import into Gephi.