Tuesday, 5 May 2015

Visualizing Rebel Alliances in the UK Government

The UK will shortly go to the polls for the 2015 General Election. However there's currently no clear front-runner, and in fact no clear coalition on the cards for a new government. The "new normal" of hung parliaments and coalition forming as part of UK politics appears to be here to stay.

Click here for full size version
As such, I decided to take a look at the open dataset provided by The Public Whip project, with a view to visualizing the relationships between MPs (members of parliament) in the 2010 to 2015 UK government, using a tool called Gephi. The idea was to analyse how MPs are related through their voting patterns in the house of commons, and in particular how they are related through agreement or rebelliousness.

Also I'll admit it: I wanted to write an article with "Rebel Alliance" in the title because I like Star Wars.

In the rest of this article, I'll describe several visualizations that were created from that public whip dataset. These show various aspects of MP relationships, and provide some interesting insights into the UK government.

MP Agreement

First, let's look at a social graph of MPs with a high level of agreement.

To create this vizualization I established relationships between MPs if their votes agree more than 85% of the time - this threshold is based on a histogram of agreement rates from the public whip data, which shows a clear threshold of agreement at around 85%.

Click here for full size version

The nodes (i.e. circles) in the above diagram represent individual MPs, coloured by their political affiliation. The edges (lines) connect MPs if they tend to agree with each other. The graph is laid out using a force-directed graph algorithm, which places related nodes next to each other in an visually pleasing way.

How can we interpret this diagram?

Unsurprisingly, the majority coalition (formed of Conservatives and Liberal Democrats, in blue and orange-yellow) tend to agree with each other, probably when voting "yes" to new bills, legislation, etc. The minority parties (predominantly Labour) also tend to agree with each other, again unsurprisingly.

Essentially, this is a picture of a typical government in the UK, with a governing majority. That's all very nice - but what else can we learn?

Rebellious Relationships

Things are more interesting if we look at rebelliousness.

The social graph below shows relationships between MPs if they voted the same way (yes or no), and they were rebelling against their respective parties in the process.

For example you might see two members of the majority coalition going against the the party line, voting "no" to a bill that that they strongly disagree with. These members are connected in rebelliousness!

Another example would be a bill with strong cross-party support - perhaps a security or policing bill - where the majority of all parties vote "yes". In this example, if member A (Labour) and member B (Conservative) both vote "no" then again they are connected in rebelliousness. The party they belong to is not important in this example - they are rebels either way.

Click here for full size version

Above, the edges of the graph are sized according to the number of times the two MPs have agreed in rebelling - a bigger edge implies stronger agreement between the MPs in disagreeing with their parties. The MP nodes and names are bigger if they have more connections - so the more rebellious MPs will appear larger.

So what can we say about this graph?

First, it seems that rebelliousness broadly follows party lines, because Labour, Conservative, and Liberal Democrat rebels tend to be placed near to each other by the layout algorithm. That is, there are broad groups of rebels associated with each party.

However there are relationships between members of different parties too, which could imply pending defections or the seeds of new political parties. Also, well known, notorious characters in UK politics such as Dennis Skinner, Philip Hollobone, and Mark Reckless make strong appearances and are well connected within the graph.

So this is quite an interesting view of MP relationships and there is more that could be said here, however the graph is showing all cases when two MPs have rebelled together. As such, the resulting vizualization is quite noisy. Fortunately, it can be cleaned up by setting a threshold to detect more interesting relationships.

Significant Rebellions

The following diagram only connects MPs when they have rebelled together on more than 1% of the votes they have both participated in  - this threshold filters out a lot of the noise:

Click here for full size version


Now, more structure becomes apparent, again thanks to the force-directed layout algorithm which attempts to place related nodes close to each other.

Three distinct clusters seem to have appeared.

One cluster is a fairly large group from the Conservative party, and other two groups are mostly composed of Labour MPs with a sprinkling of Liberal Democrats. These groups could be indicators of alliances within parliament, focused around particular agendas - though a lot depends on the specifics of the voting that has taken place. Certainly, it's interesting to see that some fairly cohesive groups of MPs have a tendency to rebel together.

Rebel Alliances

The previous graph provides some nice insights, but there's a further technique we can apply to more clearly partition the clusters that seem to have emerged. Gephi (the tool used to build these vizualizations) has a built-in algorithm for detecting communities, and applying that algorithm consistently partitions the MPs in the following way:

Click here for full size version

The three groups that were suggested the previous visualization are clearly de-marked the algorithm, confirming that there are cohesive groups that can be detected.

But interestingly Zac Goldmith (another well known, and somewhat controversial figure in UK politics) appears in a separate, smaller community from the larger group of rebels making up the Conservative cluster. This may be because Mr. Goldsmith has looser ties to the other Conservative rebels, and a stronger connection to Marsha Singh from Labour.

At least - that's what the data suggests...

Conclusion

The public whip data is a fascinating resource for understanding UK politics.

The voting records of MPs are a strong statement of their role in government, and of their political views. By analyzing those voting records, it's possible to derive relationships between MPs, and to identify interesting clusters - for example, of MPs that have a tendency to rebel together.

Of course an analysis is only as good as the underlying data, and it's important to note that official voting positions of UK political parties are not publicly available - so it's not possible to know for sure if an MP has rebelled, or has simply voted according to their conscience in a free vote. So rebelliousness in this context should be interpreted as "significantly misaligned with the rest of the party".

Also please note that the above analysis is meant to demonstrate the kind of insights that can be gained from open data. I am not a political pundit, so if you feel that the information is incorrect or misleading, let me know and I will address your comments in updates to this blog post.

Technical Notes

You can find the public whip data here. I used a combination of MySQL, Python, and Gephi to create the visualizations in this blog. The code, graphs, and images can be found in this GitHub repository.

Sunday, 25 January 2015

How to create your own replica of the SureChEMBL patent-chemistry dataset

Introduction: Why replicate SureChEMBL?

SureChEMBL is a patent chemistry dataset and set of web services that provides a rich source of information to the drug discovery research community. It was previously owned, developed, and sold by Macmillan, but was recently handed over to the European Bioinformatics Institute (EMBL/EBI) and is now free for everyone to use.


SureChEMBL can already be accessed online, so why would a locally hosted replica be needed?

To answer that question, I'll give the reasons provided by a pharmaceutical company who recently commissioned me to develop a SureChEMBL data replication facility:
  • 1) Firewall restrictions can be avoided - companies involved in drug discovery are often working with substructures or other related search queries which may lead to highly lucrative discoveries. As such, researchers are often prohibited from using external web services, even secure services such as SureChEMBL, as a risk-mitigation strategy. Downloading data files - e.g. exported patent chemistry data - via standard web access points such as FTP is more likely to conform to corporate policy.
  • 2) It's far easier to integrate a replicated database into proprietary data processing pipelines, with dedicated search strategies and analytical processes. The SureChEMBL web site provides some chemical search capabilities, but cheminformatics is a rich and diverse field, and only a fragment of possible search capabilities are exposed via the general-purpose web site.
  • 3) Dedicated resources can be provided for chemical search and analysis, avoiding the need to share public systems with other users. Chemistry searching is a resource-intensive activity, so taking complete control of the hosting is desirable.

Essentially a local replica provides far more flexibility, control, and safety to power users than can be offered by the SureChEMBL web site, hence my client's need.

As SureChEMBL is an open resource, EMBL/EBI stipulated that the result of this project be made available to everyone. As such, the resulting data loading mechanism is available here, under the MIT open source license. This means that anyone can now build their own local replica of the data.

Please note however that the data loading mechanism is provided as-is, with no warranty from myself, or from EBI who have provided access to the underlying data for this project.

The remainder of this article discusses the components and associated data flows, some pros and cons to hosting a replica, as well as potential enhancements that could be made to the data loading mechanism, and to SureChEMBL as a whole.

Components and Data Flows

The following diagram shows the key components, systems, and data flows required to create a local replica of SureChEMBL:

SureChEMBL data client architecture


Components in orange are hosted and maintained by EMBL/EBI. The components in blue must be provisioned and/or developed by the local system administrators or cheminformatics practitioners. The light blue component provides the bridge between the EMBL/EBI systems and the local RDBMS, and can be installed and run according to these instructions.

Patent documents are processed by the SureChEMBL pipeline as they are published, and cause updates to the master database. Then, every night, newly detected chemical annotations are extracted and made available in flat file format via an FTP server. These files are downloaded by the data client, and loaded into the RDBMS using a pre-defined schema.

Once the data has been loaded into the database, it can be queried, searched, extracted, or processed as needed. For example a particular user may decide to add new compounds to a fully-fledged chemistry search engine, or to filter the compounds according to pre-defined criteria and generate alerts if any interesting new compounds are detected.

The database schema has been designed for portability, and tested on both Oracle and MySQL. The following diagram shows the tables and fields in the schema - click the link in the caption to zoom:

SureChEMBL data client schema.

Pros and Cons

If you've read this far, you're probably considering replicating SureChEMBL data in your local environment. If so, there are a few other things to bear in mind.

First, make sure you dedicate enough resources for the RDBMS. The resulting database will be in the order of hundreds of megabytes, and will take days to build rather than hours. The SureChEMBL data client uses an RDBMS because patent-chemistry data is strongly relational. The tradeoff for representing data in this way is typically lower performance and more complex querying and management - this is because the RDBMS manages consistency for you through foreign key relationships, uniqueness relationships, etc.

Second, the resulting dataset is not a one-to-one replica of the data hosted on the SureChEMBL servers, rather it's a "client-facing" version of the data that provides (a) A mapping between chemical compounds and patents, with frequency of occurrence, (b) Chemical representations (SMILES, etc), (c) Chemical attributes such as molecular weight and medicinal chemistry alerts, and (d) Patent metadata such as titles and classifications. You may find that you need to augment the SureChEMBL data from other sources, or request addition of other fields from EMBL/EBI.

Third, bear in mind that the local replica provides a snapshot of the SureChEMBL data, so any data quality improvements made to historic patents won't be reflected in your database; that said, an update or patching mechanism may be provided in the future.

Finally, remember that the SureChEMBL data client can be modified to suit your needs. The client includes, for example, a filtering mechanism that flags certain patents as "life science relevant", based on patent classifications. You may wish to change this based on your own definition of relevance, or to prevent irrelevant patents from being loaded at all.

Where next?

The data loading mechanism is complete as it stands, but there are a few useful extensions that may be added in the future.

In particular, a data patching mechanism would be useful, as it's currently easier to rebuild the replica than to apply changes to the loaded data. It would also be beneficial if there were richer patent metadata, as well as further chemical properties. Finally, it would be helpful if there was a way to seamlessly integrate chemical search such as the JChem cartridge - as it stands there is no default facility in place for searching.

There are also several extensions that could be made to SureChEMBL. These include new types of annotation (such as other biological entities), addition of new patent authorities, as well as data filtering and other quality improvements.

If you have questions regarding the data loading mechanism described in this article, please see my website for contact details. If you have questions related to the wider SureChEMBL system, please contact the SureChEMBL team.