Saturday, 7 September 2013

Visualizing The Wheel of Time: Reader Sentiment for an Epic Fantasy Series

In the following blog post, I explore reader sentiment for the Epic Fantasy book series The Wheel of Time, as expressed in user-submitted ratings on Amazon and GoodReads.

If you're a data scientist (or similar), you'll probably be interested in the data analysis which includes some interesting observations about the usefulness - or otherwise - of Amazon reviews. If you're a Epic Fantasy Series reader, you'll be interested in the outcome of my analysis: I've decided to go ahead and read all fourteen books.

Note - This is a spoiler free zone
Updated 17th Sept: charts now more accurate, not quite as pretty

*

Recently, I was looking for a good book to read, and a friend recommended The Wheel of Time series by Robert Jordan. But I'd heard from a few sources that the later volumes were harder going than the earlier volumes. Having struggled with later volumes of the Game of Thrones series (sorry George) I was wary of starting a mammoth fourteen volume series unless I was confident I could make it to the end.

So I did some research. First, I checked the Amazon reader-submitted ratings for the books, which are on a scale of 1 to 5 stars. Here's what they look like for the whole series:


Uh oh, that doesn't look good. Books eight to eleven get some pretty poor scores, though the later books seem to pick up again. It looks like the middle of the series could be hard work.

What's going on?

Next, I checked the GoodReads ratings for the series. GoodReads is a site designed "to help people find and share books they love... [and] to improve the process of reading and learning throughout the world." Here's how the GoodReads ratings (which are also submitted by readers and go from 1 to 5) stack up against the Amazon ratings:



So that's a little different. There's still a hump in the middle, but it's nowhere near as pronounced ... in fact the lowest aggregate rating is 3.84, far higher than the 1.8 for the same book on Amazon!

Let's look at the number of reviewers for the two systems, corresponding to the number of people who've read the book and recorded their review and/or star rating. First, GoodReads:



Well that seems reasonable - the number of ratings tails off in the middle, then picks up towards the end as you'd probably expect given the ratings we've seen. And generally speaking, each book has lots of ratings - the lowest count is for the final book, probably a reflection that A Memory of Light has only been out for a year compared to the other books.

Oddly though there are more ratings for book twelve (The Gathering Storm) than for quite a few of the earlier books ... more GoodReads users rated that book than earlier volumes.

This likely reflects the sad fact that volume eleven (Knife of Dreams) was the final volume completed by the original author Robert Jordan, who passed away in 2007. Volume twelve is the first volume written by Brandon Sanderson, the author Robert Jordan chose to finish the series. RIP Robert.

How about Amazon reviewer counts then?



On Amazon, there are far more reviewers for the books that received the really low scores. This suggests that the really low scores are actually a result of frustrated readers motivated to express their concerns, rather than a reflection of relative enjoyability or quality per-se.

GoodReads makes it extremely easy to submit a rating for a book - one click is all it takes. Amazon seems almost to discourage reviews - the "Write a Review" button is halfway down the page, and you must provide a title and description for your review. The net result being that input of everyday browsing users won't be captured on Amazon - only motivated reviewers (such as the frustrated reader) will be bothered to jump through all the hoops.

*

Overall, therefore, it seems sensible to expect a dip in the enjoyability of The Wheel of Time series, from book eight to about book eleven.

But perhaps that dip isn't as severe as suggested by Amazon, whose ratings are likely skewed by frustrated readers. My guess is that many readers reach the later volumes and are frustrated by a change of pace; this certainly matches my experience with the Game of Thrones series where events seemed to slow to a crawl in the most recent books. The problem is compounded when there are long gaps between books being published, making it harder to pick up the story.

Thankfully, the final few books get much higher ratings across the board, so I'm expecting that it's worth getting through the slower books to reach the finale. At least, that's the story I'm telling myself ...

Only Time will tell.

Friday, 4 January 2013

Personal Data Hacks: Visualizing Data from OpenFlights.org

A friend recently told me about OpenFlights.org, a website that allows you to record, analyze, and share personal flight data. He showed me his dataset, which contained a record of every flight he'd taken over the past 10+ years. I was keen to investigate the dataset further, and my friend was happy to provide me with a copy so I could have a play (thank you Luigi!).

The end result is the following collection of visualizations created in Gephi, with a little help from R. They show key transport hubs and routes for airports, countries, and continents that my friend has visited, and demonstrate some of the fun, insightful ways you can use such personal data.

If you're interested in how the visualizations were created, check out the section at the end of this blog posting where I briefly describe the technologies required and steps involved.

Note: OpenFlights.org is free to use, and supported by advertising and donations. You can join me in supporting OpenFlights.org via this link.

Hub Airports and Key Routes

The first visualization below shows the primary airports and routes used by Luigi. Each airport has been ranked in size and colour according to the number of other connected airports, while each connection has been weighted according to the number of times that route was flown. The layout here was generated in Gephi, ensuring (simply put) that related nodes are co-located:


As you can see, PSA (Pisa) and STN (London Stansted) are far and away the most used airports. Not only that, but the return journey between the two airports has been taken many times. These two facts make perfect sense given that Luigi is from Pisa, but moved to the UK a few years ago. Other significant hubs are London Heathrow, London Gatwick, and Rome - not too surprising.

Key Countries and Connections 

Given that many airports are within the same country, is it possible to reflect that in the visualization? One way to achieve this is to partition airports by colours corresponding to different countries, as follows:


So that's kind of OK - the predominance of Italy (yellowish-green) and the UK (blue) - is starting to show, but it's quite confusing.

A better approach is to group the airports and connections by country, and to layout the nodes according to (approximate) geographical positions. The following graph also has a few graphical tweaks for readability:


We're now getting to something approximating a worldwide travel heatmap for my friend. The key travel hubs of the UK and Italy are obvious, also key routes are also jumping out more: between Italy and the UK, Italy and Germany / France, and the UK and Spain. The significance of the other routes also becomes a bit more apparent - further afield countries corresponding to occasional holiday travel (for instance).

Continental Travel

What about different continents? If we return to the original graph and partition the airports by continent, a European bias becomes very clear:


It's also nice to see the groupings of continental airports jumping out - in particular the Green nodes in the bottom right corresponding to African airports. Note that I avoided grouping by continent here because the resulting node for Europe dwarfed all the other nodes, which didn't make for a good visualization.

Creating the Visualizations

The flight data is downloadable from OpenFlights.org as Comma Separated Values. I used a little command-line manipulation (awk, sort, and uniq) to compress the data into a list of unique flights, with a count corresponding to the number of times that flight was taken.

Next, I loaded the data into R, then converted it into a graph which could be easily exported to GML (Graph Modelling Language), then loaded into Gephi and visualized.

The downloaded dataset didn't contain city, country, or continent data. Adding this required an export of nodes from Gephi, followed by a merge with the OpenFlights.org Airport dataset (spreadsheet magic), and a re-import into Gephi.