Tuesday, 5 May 2015

Visualizing Rebel Alliances in the UK Government

The UK will shortly go to the polls for the 2015 General Election. However there's currently no clear front-runner, and in fact no clear coalition on the cards for a new government. The "new normal" of hung parliaments and coalition forming as part of UK politics appears to be here to stay.

Click here for full size version
As such, I decided to take a look at the open dataset provided by The Public Whip project, with a view to visualizing the relationships between MPs (members of parliament) in the 2010 to 2015 UK government, using a tool called Gephi. The idea was to analyse how MPs are related through their voting patterns in the house of commons, and in particular how they are related through agreement or rebelliousness.

Also I'll admit it: I wanted to write an article with "Rebel Alliance" in the title because I like Star Wars.

In the rest of this article, I'll describe several visualizations that were created from that public whip dataset. These show various aspects of MP relationships, and provide some interesting insights into the UK government.

MP Agreement

First, let's look at a social graph of MPs with a high level of agreement.

To create this vizualization I established relationships between MPs if their votes agree more than 85% of the time - this threshold is based on a histogram of agreement rates from the public whip data, which shows a clear threshold of agreement at around 85%.

Click here for full size version

The nodes (i.e. circles) in the above diagram represent individual MPs, coloured by their political affiliation. The edges (lines) connect MPs if they tend to agree with each other. The graph is laid out using a force-directed graph algorithm, which places related nodes next to each other in an visually pleasing way.

How can we interpret this diagram?

Unsurprisingly, the majority coalition (formed of Conservatives and Liberal Democrats, in blue and orange-yellow) tend to agree with each other, probably when voting "yes" to new bills, legislation, etc. The minority parties (predominantly Labour) also tend to agree with each other, again unsurprisingly.

Essentially, this is a picture of a typical government in the UK, with a governing majority. That's all very nice - but what else can we learn?

Rebellious Relationships

Things are more interesting if we look at rebelliousness.

The social graph below shows relationships between MPs if they voted the same way (yes or no), and they were rebelling against their respective parties in the process.

For example you might see two members of the majority coalition going against the the party line, voting "no" to a bill that that they strongly disagree with. These members are connected in rebelliousness!

Another example would be a bill with strong cross-party support - perhaps a security or policing bill - where the majority of all parties vote "yes". In this example, if member A (Labour) and member B (Conservative) both vote "no" then again they are connected in rebelliousness. The party they belong to is not important in this example - they are rebels either way.

Click here for full size version

Above, the edges of the graph are sized according to the number of times the two MPs have agreed in rebelling - a bigger edge implies stronger agreement between the MPs in disagreeing with their parties. The MP nodes and names are bigger if they have more connections - so the more rebellious MPs will appear larger.

So what can we say about this graph?

First, it seems that rebelliousness broadly follows party lines, because Labour, Conservative, and Liberal Democrat rebels tend to be placed near to each other by the layout algorithm. That is, there are broad groups of rebels associated with each party.

However there are relationships between members of different parties too, which could imply pending defections or the seeds of new political parties. Also, well known, notorious characters in UK politics such as Dennis Skinner, Philip Hollobone, and Mark Reckless make strong appearances and are well connected within the graph.

So this is quite an interesting view of MP relationships and there is more that could be said here, however the graph is showing all cases when two MPs have rebelled together. As such, the resulting vizualization is quite noisy. Fortunately, it can be cleaned up by setting a threshold to detect more interesting relationships.

Significant Rebellions

The following diagram only connects MPs when they have rebelled together on more than 1% of the votes they have both participated in  - this threshold filters out a lot of the noise:

Click here for full size version

Now, more structure becomes apparent, again thanks to the force-directed layout algorithm which attempts to place related nodes close to each other.

Three distinct clusters seem to have appeared.

One cluster is a fairly large group from the Conservative party, and other two groups are mostly composed of Labour MPs with a sprinkling of Liberal Democrats. These groups could be indicators of alliances within parliament, focused around particular agendas - though a lot depends on the specifics of the voting that has taken place. Certainly, it's interesting to see that some fairly cohesive groups of MPs have a tendency to rebel together.

Rebel Alliances

The previous graph provides some nice insights, but there's a further technique we can apply to more clearly partition the clusters that seem to have emerged. Gephi (the tool used to build these vizualizations) has a built-in algorithm for detecting communities, and applying that algorithm consistently partitions the MPs in the following way:

Click here for full size version

The three groups that were suggested the previous visualization are clearly de-marked the algorithm, confirming that there are cohesive groups that can be detected.

But interestingly Zac Goldmith (another well known, and somewhat controversial figure in UK politics) appears in a separate, smaller community from the larger group of rebels making up the Conservative cluster. This may be because Mr. Goldsmith has looser ties to the other Conservative rebels, and a stronger connection to Marsha Singh from Labour.

At least - that's what the data suggests...


The public whip data is a fascinating resource for understanding UK politics.

The voting records of MPs are a strong statement of their role in government, and of their political views. By analyzing those voting records, it's possible to derive relationships between MPs, and to identify interesting clusters - for example, of MPs that have a tendency to rebel together.

Of course an analysis is only as good as the underlying data, and it's important to note that official voting positions of UK political parties are not publicly available - so it's not possible to know for sure if an MP has rebelled, or has simply voted according to their conscience in a free vote. So rebelliousness in this context should be interpreted as "significantly misaligned with the rest of the party".

Also please note that the above analysis is meant to demonstrate the kind of insights that can be gained from open data. I am not a political pundit, so if you feel that the information is incorrect or misleading, let me know and I will address your comments in updates to this blog post.

Technical Notes

You can find the public whip data here. I used a combination of MySQL, Python, and Gephi to create the visualizations in this blog. The code, graphs, and images can be found in this GitHub repository.

Sunday, 25 January 2015

How to create your own replica of the SureChEMBL patent-chemistry dataset

Introduction: Why replicate SureChEMBL?

SureChEMBL is a patent chemistry dataset and set of web services that provides a rich source of information to the drug discovery research community. It was previously owned, developed, and sold by Macmillan, but was recently handed over to the European Bioinformatics Institute (EMBL/EBI) and is now free for everyone to use.

SureChEMBL can already be accessed online, so why would a locally hosted replica be needed?

To answer that question, I'll give the reasons provided by a pharmaceutical company who recently commissioned me to develop a SureChEMBL data replication facility:
  • 1) Firewall restrictions can be avoided - companies involved in drug discovery are often working with substructures or other related search queries which may lead to highly lucrative discoveries. As such, researchers are often prohibited from using external web services, even secure services such as SureChEMBL, as a risk-mitigation strategy. Downloading data files - e.g. exported patent chemistry data - via standard web access points such as FTP is more likely to conform to corporate policy.
  • 2) It's far easier to integrate a replicated database into proprietary data processing pipelines, with dedicated search strategies and analytical processes. The SureChEMBL web site provides some chemical search capabilities, but cheminformatics is a rich and diverse field, and only a fragment of possible search capabilities are exposed via the general-purpose web site.
  • 3) Dedicated resources can be provided for chemical search and analysis, avoiding the need to share public systems with other users. Chemistry searching is a resource-intensive activity, so taking complete control of the hosting is desirable.

Essentially a local replica provides far more flexibility, control, and safety to power users than can be offered by the SureChEMBL web site, hence my client's need.

As SureChEMBL is an open resource, EMBL/EBI stipulated that the result of this project be made available to everyone. As such, the resulting data loading mechanism is available here, under the MIT open source license. This means that anyone can now build their own local replica of the data.

Please note however that the data loading mechanism is provided as-is, with no warranty from myself, or from EBI who have provided access to the underlying data for this project.

The remainder of this article discusses the components and associated data flows, some pros and cons to hosting a replica, as well as potential enhancements that could be made to the data loading mechanism, and to SureChEMBL as a whole.

Components and Data Flows

The following diagram shows the key components, systems, and data flows required to create a local replica of SureChEMBL:

SureChEMBL data client architecture

Components in orange are hosted and maintained by EMBL/EBI. The components in blue must be provisioned and/or developed by the local system administrators or cheminformatics practitioners. The light blue component provides the bridge between the EMBL/EBI systems and the local RDBMS, and can be installed and run according to these instructions.

Patent documents are processed by the SureChEMBL pipeline as they are published, and cause updates to the master database. Then, every night, newly detected chemical annotations are extracted and made available in flat file format via an FTP server. These files are downloaded by the data client, and loaded into the RDBMS using a pre-defined schema.

Once the data has been loaded into the database, it can be queried, searched, extracted, or processed as needed. For example a particular user may decide to add new compounds to a fully-fledged chemistry search engine, or to filter the compounds according to pre-defined criteria and generate alerts if any interesting new compounds are detected.

The database schema has been designed for portability, and tested on both Oracle and MySQL. The following diagram shows the tables and fields in the schema - click the link in the caption to zoom:

SureChEMBL data client schema.

Pros and Cons

If you've read this far, you're probably considering replicating SureChEMBL data in your local environment. If so, there are a few other things to bear in mind.

First, make sure you dedicate enough resources for the RDBMS. The resulting database will be in the order of hundreds of megabytes, and will take days to build rather than hours. The SureChEMBL data client uses an RDBMS because patent-chemistry data is strongly relational. The tradeoff for representing data in this way is typically lower performance and more complex querying and management - this is because the RDBMS manages consistency for you through foreign key relationships, uniqueness relationships, etc.

Second, the resulting dataset is not a one-to-one replica of the data hosted on the SureChEMBL servers, rather it's a "client-facing" version of the data that provides (a) A mapping between chemical compounds and patents, with frequency of occurrence, (b) Chemical representations (SMILES, etc), (c) Chemical attributes such as molecular weight and medicinal chemistry alerts, and (d) Patent metadata such as titles and classifications. You may find that you need to augment the SureChEMBL data from other sources, or request addition of other fields from EMBL/EBI.

Third, bear in mind that the local replica provides a snapshot of the SureChEMBL data, so any data quality improvements made to historic patents won't be reflected in your database; that said, an update or patching mechanism may be provided in the future.

Finally, remember that the SureChEMBL data client can be modified to suit your needs. The client includes, for example, a filtering mechanism that flags certain patents as "life science relevant", based on patent classifications. You may wish to change this based on your own definition of relevance, or to prevent irrelevant patents from being loaded at all.

Where next?

The data loading mechanism is complete as it stands, but there are a few useful extensions that may be added in the future.

In particular, a data patching mechanism would be useful, as it's currently easier to rebuild the replica than to apply changes to the loaded data. It would also be beneficial if there were richer patent metadata, as well as further chemical properties. Finally, it would be helpful if there was a way to seamlessly integrate chemical search such as the JChem cartridge - as it stands there is no default facility in place for searching.

There are also several extensions that could be made to SureChEMBL. These include new types of annotation (such as other biological entities), addition of new patent authorities, as well as data filtering and other quality improvements.

If you have questions regarding the data loading mechanism described in this article, please see my website for contact details. If you have questions related to the wider SureChEMBL system, please contact the SureChEMBL team.

Friday, 17 October 2014

Anatomy of an Emerging Knowledge Network: The Zapnito Graph Vizualized

In this article, I take a high-level look Zapnito, a multi-tenant "Networked Knowledge" platform designed around small, expert communities.

Zapnito is a knowledge sharing platform that allows organizations to create branded networks of experts. It's aimed at publishers, consultancies, media companies, and other corporations. Zapnito includes some social features (such as follow relationships, collaboration), but its focus is knowledge sharing rather than social networking.

As the founder puts it: "Zapnito is a white label platform that offers knowledge network capabilities for publishers. We provide both highly private and open networks, and we own neither publisher content or associated data - both of these are retained by publishers." 

The aim of this article is to show some of the interesting insights that can be gained from basic Social Network Analysis (SNA) of Zapnito. I'll be showing visualizations (such as that on the right) built from an anonymized subset of the Zapnito database, and discussing what can be learned from these.

Note: If you're a more SNA-savvy reader, I won't be diving into metrics such as Average Path Length or Clustering Coefficients - please look out for a future article on these topics.

Users and Followers

What does the core social network look like?

The graphs in this article were built using Gephi, a desktop application designed specifically for Social Network Analysis. All that's needed is a list of nodes (Zapnito users) and a list of edges (follow relationships) in flat file format, and you have a network.

The Zapnito team kindly provided me with an anonymized extract of their database covering several representative customers, so with a little data processing the network could be imported as a directed graph, then vizualized:

The core Zapnito network

Here, each node represents a Zapnito user, and each edge represents a follow relationship. Nodes are scaled according to the number of followers, which is a typical measure of influence in a social network.

The graph is organized using a built in algorithm, one that simulates a physical model in order to find an aesthetically pleasing layout. Another effect of the layout algorithm is that related users tend to be close to each other, while unrelated users are further apart.

There are a couple of observations we can make about the above graph.

  • First, there's one user who appears to be central to the network, someone with numerous connections to the rest of the graph, and relationships to many less influential users. This user is in fact the founder of Zapnito, who has had a long time to build up connections and is motivated to connect with as many Zapnito users as possible to encourage use of the network.
  • Second, you may notice several clumps or clusters - there's a large one to the left, one underneath, and one or two to the right. Apart from a small amount of adjustment by me, the graph as you see it represents the output of the layout algorithm, so what's going on? 
To understand further, we need to look at Zapnito's grouping mechanism.

Group Membership

Zapnito is designed to serve the needs of expert communities, so an essential feature is the set of communities that users can be part of.

These range from invitation-only, private communities with exclusive membership, to open communities that encourage public participation around selected contributors. Examples of public communities include the LifeLabs network, and Zapnito itself.

Note that Zapnito typically uses the term network to refer to the expert groups they host; here I'll use the term community to differentiate from the social network that's being analysed.

So what happens if we partition the nodes by community?

The Zapnito network, partitioned by community

We can now start to see the reason for the clumping generated by the layout algorithm: there is fairly high cohesion between members of a community. This is a nice result, and it's interesting to see the network of networks manifested in this way.

However, Zapnito users can actually be members of many different communities, which you can see above as dark grey nodes. It's important to know who these users are, because they can act as bridges in the network and may be instrumental in disseminating information between communities. Again it's understandable that the founder is a bridge, though there are several others worth noting.

As well as the bridge users, there are some interesting anomalies in the graph deserving of further analysis - but that's out of scope here.

Automatically Detected Communities

So we've seen the communities as defined by the Zapnito adminstrators, but there's another perspective we can take. Gephi has a built in feature to detect communities, using the Louvain algorithm. This detects the most strongly connected nodes within the network, and assigns them to groups. 

Here's what it finds in the Zapnito graph:

Automatically detected communities

Here, the algorithmically detected communities are quite similar to the real communities, but with some notable differences:

  • First, there's a distinct community around the founder. Again this simply reinforces the fact that the founder plays a central role in establishing and promoting the network.
  • Second, some smaller communities which are visible in the previous graph have been folded into the larger communities. 
This second point is worth bearing in mind if you're considering using community detection to provide social network features: you may offend your users if you assign them to the wrong group.

Of course the opposite may be true - a user's follow relationships may reveal the truth of their interests (or allegiance), and may be better indicator of community membership than the set of pre-configured communities on offer.

Contributions and Impact

So far, we've looked at the overall network structure, as well as communities within the network. But Zapnito is a content distribution system for experts, so what insights can we gain here?

The Zapnito database provides counts of video submissions, articles, and comments made by each user. By extracting this data we can highlight the users in the network who make the biggest contribution. Contributions can be counted individually by type, but it's more interesting to look at an aggregate view.

Below, users are shaded relatively according to an overall contribution score - where video submissions scored ten points, articles five points, and comments one point:

Biggest contributors

Here we can see that most users have modest numbers of contributions compared to a handful of very active users. Given the nature of expert communities, this is expected: apart from a small number of prolific content producers, most users will generate high quality submissions, but infrequently.

It's also worth noting that the largest contributor is not the most influential, at least in terms of followers. This is a useful thing to know - it may be beneficial, for example, to promote this user to increase their reach.

We may also want to find users who make little or no contribution, but have influence within the network. We can find these users by modifying the shading in Gephi to give more weight to users who have made at least a small contribution; this brings out the lurkers!

Lurkers (shown in orange)

Above, the red nodes represent users who have made no contributions - comments, posts or otherwise. These individuals, especially those with reasonable numbers of followers, are prime targets to encourage greater participation.

We can use a similar principle to bias the shading to the highest scoring contributors only:

Heroes (in purple)

Here again we're showing the heroes of the system - this is just an alternative view to the graph showing the overall contribution score, but here the biggest hitters in terms of contribution are emphasized.


There are a few key conclusions to take from the above analysis.

First, it's clear that Zapnito's founder has an important role to play in the emerging network, as a well-connected influencer and as a bridge between different communities. However the centrality of the founder's node in the graph is mostly related to his activities in promoting Zapnito and encouraging participation by following and engaging with other Zapnito users, and it will be interesting to see how this changes over time as the network grows.

Next, the difference between official and detected communities suggests that group membership is not clear cut, and is likely to shift over time. This may provide opportunities in the form of emergent groups that were not originally foreseen, as well as potential issues such as split loyalties or schisms in existing communities.

Finally, the process of scoring contributions to build an aggregate score is a useful technique for identifying key contributors, and contrasting such a score with a measure of reputation or impact helps identify influential lurkers, as well as major contributors with limited reach. The former can be encouraged to contribute, while the latter can be supported in building their network of followers, both of which will support dissemination of quality content across the network.

Thursday, 10 April 2014

I Know Where You Were Last Summer: London's public bike data is telling everyone where you've been

This article is about a publicly available dataset of bicycle journey data that contains enough information to track the movements of individual cyclists across London, for a six month period just over a year ago.

I'll also explore how this dataset could be linked with other datasets to identify the actual people who made each of these journeys, and the privacy concerns this kind of linking raises.


It probably won't surprise you to learn that there is a publicly available Transport For London dataset that contains records of bike journeys for London's bicycle hire scheme. What may surprise you is that this record includes unique customer identifiers, as well as the location and date/time for the start and end of each journey. The public dataset currently covers a period of six months between 2012 and 2013.

What are the consequences of this? It means that someone who has access to the data can extract and analyse the journeys made by individual cyclists within London during that time, and with a little effort, it's possible to find the actual people who have made the journeys. 

To show what's possible with this data, I built an interactive map to vizualize a handful of selected profiles.

Please note: the purpose of this article is to expose the risks that can come with open datasets. However I've held off from actually trying to find the people behind this data, mostly because of the privacy concerns but also because (thankfully) it requires a fair bit of effort to actually identify individuals from the data...

Below, you'll find a map of all journeys made by one specific cyclist (commuter X), selected because they're one of the top users of a familiar bicycle hire station near where I work:

Bike journeys map - commuter X [interactive version]

Each line represents a particular journey, the size of the line showing the number of times that journey was made. The size of the circle represents the number of different destinations that the cyclist has travelled to and from that bike station. Purple lines indicate there were journeys in both directions, while orange lines (with arrows) indicate journeys that were one-way only.

Bigger, therefore, implies the route or station has greater significance for the person.

NOTE: if you think you might be this person, and you're unhappy having your personal journey data here, please contact me and I will remove the offending map. Then contact TFL (as I have) and tell them to remove customer record numbers from the data.

So what can we tell about this person?

First impressions suggests that they probably live near Limehouse, work in Kings Cross, and have friends or family in the Bethnal Green / Mile End areas of London. This story is strengthened if we filter down to journeys made between 4.00am and 10.00am:

Commuter X - morning journeys [interactive version]

We can see that this person only travels to Kings Cross in the morning, when departing from the Limehouse area or from Bethnal Green. So a morning commute from home, and/or a partner's abode? Applying a similar filter for the afternoon and evening shows return journeys, so the commuting hypothesis becomes stronger still.

Like me, you're probably starting to feel a bit uncomfortable at this point - after all I'm putting a story to this person's data, and it's starting to sound quite personal.

What's more interesting (and worrying) is that:

  1. I'm not really trying very hard, and a deeper inspection of dates, times, locations etc. can reveal far more detail
  2. There's enough here to start thinking about putting a name to the data.

All that's needed to work out who this profile belongs to is one bit of connecting information.

A Foursquare check-in could be connected to a bike journey, though it would be difficult to connect it to the cycle scheme. More likely would be a time-stamped Facebook comment or tweet, saying that the Kings Cross boris bike station is full. Or a geo-coded Flickr photograph, showing someone riding one of the bikes...

Any seemingly innocuous personal signal would be enough to get a detailed record for someone's life in London ... travelling to work, meeting up with friends, secret trysts, drug deals - details of any of these supposedly private aspects of our lives can be exposed.

Here's another profile, chosen because of the volume of journeys made:

Complex bike journey map [interactive version]

Hopefully you can see the richness of the information that is available in the TFL dataset. Every connection on the map represents something of significance to the cyclist, each bike station has some meaning. As well as being a digital fingerprint that can be linked to personally identifiable information, the journey data is a window on this person's life.


On a final note, I'd like to point out that there are positives to releasing such data, which can be seen (for example) in the following map:

Commuter destinations around Victoria [interactive version]

The above map shows commuter journeys from a bike station near embankment to various stations around Victoria. These are journeys made between approximately 4.00pm and 5.30pm - so return commutes from work, presumably followed by a train journey from Victoria southwards. Here, there is one point of departure but three destinations, probably because Victoria Rail Station is a major transport hub, so the bike stations nearby will be popular and may often fill up.

The point is that there are benign insights that can be made by looking at individual profiles - but the question remains whether these kind of insights justify the risks to privacy that come with releasing journey data that can be associated with individual profiles.


Leaflet.js - web mapping library
Cloudmade - map tiles
Transport For London - datasets of Boris Bike data

Sunday, 2 March 2014

London maps and bike rental communities, according to Boris Bike journey data

Every time someone in London makes a journey on a Boris Bike (officially, the Barclays Cycle Hire Scheme), the local government body Transport For London (TFL) record that journey. TFL make some of this data available for download, to allow further analysis and experimentation.

Below, you'll find maps of the most popular bike stations and routes in London, created from the TFL data using Gephi, plus a few simple data processing scripts that I threw together. The idea for these maps originated within a project group at a course on Data Visualisation, held at the Guardian last year. We're working on a more publisher friendly form, so thank you to my course mates for giving me the go ahead to include them here.

First, here's a map showing all bike stations and all popular journeys.

Popular Boris Bike journeys and stations. Full version.

The first map shows the most popular routes and bike stations, those with more than ~150 journeys made during the six months of data that TFL make available. The size of each bike station in this map is based on the number of popular journeys that start or end at that station, a measure of the connectedness of the location. Note: the labels just show the rental area, not the specific station name.

Next, a map where the stations have been grouped together into rental areas, as allocated by TFL:

Rental areas and traffic volumes in the Boris Bike network. Full version | Alternative.

The second map is a version of the first map where related bike stations have been grouped together, and the volume of journeys between areas determines the weight of each connection. Colours in the second map are related to distinct communities in the network - more on this later. The position of the rental areas is approximate and calculated by Gephi. So please don't blame me for any geographical inaccuracies in this map ;)

Some interpretation, along with inspection of underlying data shows that:
  • Major entry points for Boris Bike use are via Kings Cross and Waterloo, more than likely due to commuters arriving from the North and South then heading deeper into London for work.
  • The most popular journeys are those around Hyde Park, corresponding to a popular tourist activity. 
  • The most popular journey (by a long way) is from Hyde Park Corner ... to Hyde Park Corner, presumably a nice trip round the park.
  • The most popular commuter route is between Waterloo (station 3) and Holburn, probably via the Waterloo Bridge.
Of course that's just scratching the surface, and just one example of how to vizualize the data. There's much more that can be done, and similar maps have been created before. Here are a couple of my favourites, plus another of my own, afterwards:
  • This delightful video by specialist Jo Wood at City University in London, published by New Scientist also shows popular routes in the network.
  • A recent BMJ article included a street-level map showing predicted routes and volumes, the focus here is on the health impact of bike sharing schemes.
A bit of experimentation with Gephi's community detection tool results in this map:

Rental communities in the Boris Bike network. Full version.

Here, major connected clusters of bike stations are shown in the same colour (red for Waterloo and environs, Green for around Hyde Park, etc). The communities are detected using Gephi's implementation of the Louvain Method, which finds local communities within large networks. This algorithm has a random element, and generates slightly different communities on each run. However it's clear from repeated runs that distinct local communities exist in the network, in particular around Hyde Park, Kings Cross, Waterloo, and Canary Wharf.

The map that shows bike rental areas (rather than stations) was coloured according to these communities, with journeys between different communities having a colour that's a mixture of the two that are involved.

If you fancy playing with the data or trying out some other visualizations, you can find everything in this GitHub repository.

Saturday, 7 September 2013

Visualizing The Wheel of Time: Reader Sentiment for an Epic Fantasy Series

In the following blog post, I explore reader sentiment for the Epic Fantasy book series The Wheel of Time, as expressed in user-submitted ratings on Amazon and GoodReads.

If you're a data scientist (or similar), you'll probably be interested in the data analysis which includes some interesting observations about the usefulness - or otherwise - of Amazon reviews. If you're a Epic Fantasy Series reader, you'll be interested in the outcome of my analysis: I've decided to go ahead and read all fourteen books.

Note - This is a spoiler free zone
Updated 17th Sept: charts now more accurate, not quite as pretty


Recently, I was looking for a good book to read, and a friend recommended The Wheel of Time series by Robert Jordan. But I'd heard from a few sources that the later volumes were harder going than the earlier volumes. Having struggled with later volumes of the Game of Thrones series (sorry George) I was wary of starting a mammoth fourteen volume series unless I was confident I could make it to the end.

So I did some research. First, I checked the Amazon reader-submitted ratings for the books, which are on a scale of 1 to 5 stars. Here's what they look like for the whole series:

Uh oh, that doesn't look good. Books eight to eleven get some pretty poor scores, though the later books seem to pick up again. It looks like the middle of the series could be hard work.

What's going on?

Next, I checked the GoodReads ratings for the series. GoodReads is a site designed "to help people find and share books they love... [and] to improve the process of reading and learning throughout the world." Here's how the GoodReads ratings (which are also submitted by readers and go from 1 to 5) stack up against the Amazon ratings:

So that's a little different. There's still a hump in the middle, but it's nowhere near as pronounced ... in fact the lowest aggregate rating is 3.84, far higher than the 1.8 for the same book on Amazon!

Let's look at the number of reviewers for the two systems, corresponding to the number of people who've read the book and recorded their review and/or star rating. First, GoodReads:

Well that seems reasonable - the number of ratings tails off in the middle, then picks up towards the end as you'd probably expect given the ratings we've seen. And generally speaking, each book has lots of ratings - the lowest count is for the final book, probably a reflection that A Memory of Light has only been out for a year compared to the other books.

Oddly though there are more ratings for book twelve (The Gathering Storm) than for quite a few of the earlier books ... more GoodReads users rated that book than earlier volumes.

This likely reflects the sad fact that volume eleven (Knife of Dreams) was the final volume completed by the original author Robert Jordan, who passed away in 2007. Volume twelve is the first volume written by Brandon Sanderson, the author Robert Jordan chose to finish the series. RIP Robert.

How about Amazon reviewer counts then?

On Amazon, there are far more reviewers for the books that received the really low scores. This suggests that the really low scores are actually a result of frustrated readers motivated to express their concerns, rather than a reflection of relative enjoyability or quality per-se.

GoodReads makes it extremely easy to submit a rating for a book - one click is all it takes. Amazon seems almost to discourage reviews - the "Write a Review" button is halfway down the page, and you must provide a title and description for your review. The net result being that input of everyday browsing users won't be captured on Amazon - only motivated reviewers (such as the frustrated reader) will be bothered to jump through all the hoops.


Overall, therefore, it seems sensible to expect a dip in the enjoyability of The Wheel of Time series, from book eight to about book eleven.

But perhaps that dip isn't as severe as suggested by Amazon, whose ratings are likely skewed by frustrated readers. My guess is that many readers reach the later volumes and are frustrated by a change of pace; this certainly matches my experience with the Game of Thrones series where events seemed to slow to a crawl in the most recent books. The problem is compounded when there are long gaps between books being published, making it harder to pick up the story.

Thankfully, the final few books get much higher ratings across the board, so I'm expecting that it's worth getting through the slower books to reach the finale. At least, that's the story I'm telling myself ...

Only Time will tell.

Friday, 4 January 2013

Personal Data Hacks: Visualizing Data from OpenFlights.org

A friend recently told me about OpenFlights.org, a website that allows you to record, analyze, and share personal flight data. He showed me his dataset, which contained a record of every flight he'd taken over the past 10+ years. I was keen to investigate the dataset further, and my friend was happy to provide me with a copy so I could have a play (thank you Luigi!).

The end result is the following collection of visualizations created in Gephi, with a little help from R. They show key transport hubs and routes for airports, countries, and continents that my friend has visited, and demonstrate some of the fun, insightful ways you can use such personal data.

If you're interested in how the visualizations were created, check out the section at the end of this blog posting where I briefly describe the technologies required and steps involved.

Note: OpenFlights.org is free to use, and supported by advertising and donations. You can join me in supporting OpenFlights.org via this link.

Hub Airports and Key Routes

The first visualization below shows the primary airports and routes used by Luigi. Each airport has been ranked in size and colour according to the number of other connected airports, while each connection has been weighted according to the number of times that route was flown. The layout here was generated in Gephi, ensuring (simply put) that related nodes are co-located:

As you can see, PSA (Pisa) and STN (London Stansted) are far and away the most used airports. Not only that, but the return journey between the two airports has been taken many times. These two facts make perfect sense given that Luigi is from Pisa, but moved to the UK a few years ago. Other significant hubs are London Heathrow, London Gatwick, and Rome - not too surprising.

Key Countries and Connections 

Given that many airports are within the same country, is it possible to reflect that in the visualization? One way to achieve this is to partition airports by colours corresponding to different countries, as follows:

So that's kind of OK - the predominance of Italy (yellowish-green) and the UK (blue) - is starting to show, but it's quite confusing.

A better approach is to group the airports and connections by country, and to layout the nodes according to (approximate) geographical positions. The following graph also has a few graphical tweaks for readability:

We're now getting to something approximating a worldwide travel heatmap for my friend. The key travel hubs of the UK and Italy are obvious, also key routes are also jumping out more: between Italy and the UK, Italy and Germany / France, and the UK and Spain. The significance of the other routes also becomes a bit more apparent - further afield countries corresponding to occasional holiday travel (for instance).

Continental Travel

What about different continents? If we return to the original graph and partition the airports by continent, a European bias becomes very clear:

It's also nice to see the groupings of continental airports jumping out - in particular the Green nodes in the bottom right corresponding to African airports. Note that I avoided grouping by continent here because the resulting node for Europe dwarfed all the other nodes, which didn't make for a good visualization.

Creating the Visualizations

The flight data is downloadable from OpenFlights.org as Comma Separated Values. I used a little command-line manipulation (awk, sort, and uniq) to compress the data into a list of unique flights, with a count corresponding to the number of times that flight was taken.

Next, I loaded the data into R, then converted it into a graph which could be easily exported to GML (Graph Modelling Language), then loaded into Gephi and visualized.

The downloaded dataset didn't contain city, country, or continent data. Adding this required an export of nodes from Gephi, followed by a merge with the OpenFlights.org Airport dataset (spreadsheet magic), and a re-import into Gephi.